Community
Participate
Working Groups
Hello, if a website that is gzip compressed is crawled with the SMILA web crawler no content is received. This is due to a bug in the GZIPUtils class (method "unzipBestEffort(byte[], int)". The important line is 98: if ((written + size) > sizeLimit) { outStream.write(buf, 0, sizeLimit - written); ... } "sizeLimit" is set to 0 and "written" is also 0. So zero bytes will be written. We have crawld the single site: www,wanderkompass.de . The "sizeLimit" comes from a static property file (not the web.xml) what is strange because all properties of the crawler should be configurable by the web.xml file.
We have noticed that we set a property "MaxLengthBytes" in the web.xml file to 0. We assumed that this means that there is no restriction to the size. It would be nice if this actual behaviour can be changed.
Hi Nils, I added a check if sizeLimit is > 0 to GZIPUtils.unzipBestEffort(byte[] in, int sizeLimit). Could you please check if this solves your problem ? Thanks, Daniel
The Connectivity framework was replaced by new Importing framework. The new importing framework uses java.util.zip.GZIPInputStream directly. So IMHO, this bug entry is no longer relevant. Bye, Andreas