Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 330991 - UnsupportedCharsetException while crawling illegal encoding meta data
Summary: UnsupportedCharsetException while crawling illegal encoding meta data
Status: CLOSED WONTFIX
Alias: None
Product: z_Archived
Classification: Eclipse Foundation
Component: Smila (show other bugs)
Version: unspecified   Edit
Hardware: PC Linux
: P3 enhancement (vote)
Target Milestone: ---   Edit
Assignee: Project Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-11-24 02:18 EST by nils.thieme CLA
Modified: 2022-07-07 11:31 EDT (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description nils.thieme CLA 2010-11-24 02:18:59 EST
If the web crawler encounters a web site with an encoding meta tag that is not valid, for example the value is "none", then the web crawler throws an "UnsupportedCharsetException". The exception occurs in line 656 of the WebCrawler class.
Comment 1 Daniel Stucky CLA 2010-11-29 08:46:02 EST
Hi Nils,

thanks for your bug report. Obviously the problem resides in the data to be crawled. In your example the HTML file contains an invalid encoding. From my point of view throwing a UnsupportedCharsetException is a feasible response to the invalid input data. Even if the encoding would be valid but the underlying platform would not support this encoding the UnsupportedCharsetException would be thrown.

What behavior do you expect in the case that an invalid encoding is specified in the HTML ? Ignore the meta information and proceed without it, hoping to be able to detect the encoding from the bytes and if not using the default encoding UTF-8 as fallback ? In some cases this may work out fine, in others it will lead to not correctly decoded characters. 

Bye,
Daniel
Comment 2 nils.thieme CLA 2010-11-30 09:39:02 EST
Hi Daniel,

we want that our crawler behaves as the normal web browser like Firefox. Firefox can determine the encoding with a good probability from the bytes as such. The detection library is also available as a Java library: http://code.google.com/p/juniversalchardet/ . We have integrated this library to automatically detect the encoding. If this fails we suppose we have UTF-8.
Comment 3 Daniel Stucky CLA 2010-11-30 10:17:45 EST
Hi Nils,

Your requests sounds reasonable. I changed the importance to "enhancement". I don't know when we will have time to integrate this library but it sounds like a great addition to SMILA, not only for the WebCrawler but also for our MimeTypeIdentifier.

Bye,
Daniel
Comment 4 Daniel Stucky CLA 2012-12-19 07:39:41 EST
This bug will not be fixed as it occured in the old connectivity implementation which was replaced by the new importing implementation. The mentioned problem should not be an issue in this the web crawler implementation.