Community
Participate
Working Groups
If the web crawler encounters a web site with an encoding meta tag that is not valid, for example the value is "none", then the web crawler throws an "UnsupportedCharsetException". The exception occurs in line 656 of the WebCrawler class.
Hi Nils, thanks for your bug report. Obviously the problem resides in the data to be crawled. In your example the HTML file contains an invalid encoding. From my point of view throwing a UnsupportedCharsetException is a feasible response to the invalid input data. Even if the encoding would be valid but the underlying platform would not support this encoding the UnsupportedCharsetException would be thrown. What behavior do you expect in the case that an invalid encoding is specified in the HTML ? Ignore the meta information and proceed without it, hoping to be able to detect the encoding from the bytes and if not using the default encoding UTF-8 as fallback ? In some cases this may work out fine, in others it will lead to not correctly decoded characters. Bye, Daniel
Hi Daniel, we want that our crawler behaves as the normal web browser like Firefox. Firefox can determine the encoding with a good probability from the bytes as such. The detection library is also available as a Java library: http://code.google.com/p/juniversalchardet/ . We have integrated this library to automatically detect the encoding. If this fails we suppose we have UTF-8.
Hi Nils, Your requests sounds reasonable. I changed the importance to "enhancement". I don't know when we will have time to integrate this library but it sounds like a great addition to SMILA, not only for the WebCrawler but also for our MimeTypeIdentifier. Bye, Daniel
This bug will not be fixed as it occured in the old connectivity implementation which was replaced by the new importing implementation. The mentioned problem should not be an issue in this the web crawler implementation.