Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 330731 - Encoding extraction fail
Summary: Encoding extraction fail
Status: CLOSED FIXED
Alias: None
Product: z_Archived
Classification: Eclipse Foundation
Component: Smila (show other bugs)
Version: unspecified   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Project Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-11-20 05:41 EST by nils.thieme CLA
Modified: 2022-07-07 11:31 EDT (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description nils.thieme CLA 2010-11-20 05:41:41 EST
If you crawl the following site: http://hitech.newsru.com/article/19nov2010/rugeoportal, you get an exception like this:

java.io.UnsupportedEncodingException: windows-1251;

This is because the semicolon at the end is commit to an extraction function too. To solve this change the regular expression in the file  org.eclipse.smila.connectivity.framework.crawler.web.WebCrawler in line 197 to the following:

 private final Pattern _contentTypePattern = Pattern.compile("^CONTENT-TYPE\\s*:\\s*(?:.|\\s)*CHARSET\\s*=\\s*([\\w-]*)", Pattern.CASE_INSENSITIVE);
Comment 1 Daniel Stucky CLA 2010-11-29 08:19:33 EST
Hi Nils,

thanks for your bug report and the suggested solution. I checked in your fix and added a JUnit test for this issue. It's all checked in with revision 711.

Bye,
Daniel
Comment 2 Andreas Weber CLA 2013-04-15 11:48:17 EDT
Closing this