Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 330731

Summary: Encoding extraction fail
Product: z_Archived Reporter: nils.thieme
Component: SmilaAssignee: Project Inbox <smila.irms-inbox>
Status: CLOSED FIXED QA Contact:
Severity: normal    
Priority: P3 CC: daniel.stucky
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Whiteboard:

Description nils.thieme CLA 2010-11-20 05:41:41 EST
If you crawl the following site: http://hitech.newsru.com/article/19nov2010/rugeoportal, you get an exception like this:

java.io.UnsupportedEncodingException: windows-1251;

This is because the semicolon at the end is commit to an extraction function too. To solve this change the regular expression in the file  org.eclipse.smila.connectivity.framework.crawler.web.WebCrawler in line 197 to the following:

 private final Pattern _contentTypePattern = Pattern.compile("^CONTENT-TYPE\\s*:\\s*(?:.|\\s)*CHARSET\\s*=\\s*([\\w-]*)", Pattern.CASE_INSENSITIVE);
Comment 1 Daniel Stucky CLA 2010-11-29 08:19:33 EST
Hi Nils,

thanks for your bug report and the suggested solution. I checked in your fix and added a JUnit test for this issue. It's all checked in with revision 711.

Bye,
Daniel
Comment 2 Andreas Weber CLA 2013-04-15 11:48:17 EDT
Closing this