Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 334106

Summary: Gzip encoded site deliver non content
Product: z_Archived Reporter: nils.thieme
Component: SmilaAssignee: Project Inbox <smila.irms-inbox>
Status: CLOSED WONTFIX QA Contact:
Severity: enhancement    
Priority: P3 CC: andreas.schank, daniel.stucky
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Whiteboard:

Description nils.thieme CLA 2011-01-12 09:18:17 EST
Hello,

if a website that is gzip compressed is crawled with the SMILA web crawler no content is received. This is due to a bug in the GZIPUtils class (method "unzipBestEffort(byte[], int)".

The important line is 98:
      if ((written + size) > sizeLimit) {
              outStream.write(buf, 0, sizeLimit - written);
              ...
      }

"sizeLimit" is set to 0 and "written" is also 0. So zero bytes will be written. We have crawld the single site: www,wanderkompass.de .

The "sizeLimit" comes from a static property file (not the web.xml) what is strange because all properties of the crawler should be configurable by the web.xml file.
Comment 1 nils.thieme CLA 2011-01-12 09:34:48 EST
We have noticed that we set a property "MaxLengthBytes" in the web.xml file to 0. We assumed that this means that there is no restriction to the size. It would be nice if this actual behaviour can be changed.
Comment 2 Daniel Stucky CLA 2011-01-18 11:00:53 EST
Hi Nils,

I added a check if sizeLimit is > 0 to GZIPUtils.unzipBestEffort(byte[] in, int sizeLimit). Could you please check if this solves your problem ?

Thanks,
Daniel
Comment 3 Andreas Schank CLA 2012-12-19 08:11:17 EST
The Connectivity framework was replaced by new Importing framework.

The new importing framework uses java.util.zip.GZIPInputStream directly. So IMHO, this bug entry is no longer relevant.

Bye,
Andreas