Community
Participate
Working Groups
If you crawl the following site: http://hitech.newsru.com/article/19nov2010/rugeoportal, you get an exception like this: java.io.UnsupportedEncodingException: windows-1251; This is because the semicolon at the end is commit to an extraction function too. To solve this change the regular expression in the file org.eclipse.smila.connectivity.framework.crawler.web.WebCrawler in line 197 to the following: private final Pattern _contentTypePattern = Pattern.compile("^CONTENT-TYPE\\s*:\\s*(?:.|\\s)*CHARSET\\s*=\\s*([\\w-]*)", Pattern.CASE_INSENSITIVE);
Hi Nils, thanks for your bug report and the suggested solution. I checked in your fix and added a JUnit test for this issue. It's all checked in with revision 711. Bye, Daniel
Closing this