Community
Participate
Working Groups
I have encountered an IllegalStateException in line 223 in the class HttpResponse (org.eclipse.smila.connectivity.framework.crawler.web) with the following error message: "unsupported protocol: 'file'".
Hi Nils, thanks for your bug report. The web crawler implementation is based on apache commons http client which does not support the "file" protocol. Therefore it is not possible to use URLs with the file protocol for crawling. If you want to crawl data in your filesystem use the FilesystemCrawler instead. Did you start crawling with a file URL or was the file URL found during the crawl ? In the latter case we may have to add a check and filter out any non http protocol URLs. Bye, Daniel
The "file" prefix was part of a link of a crawled web site. I only crawl web sites with the WebCrawler of SMILA.
Ok, I also set the importance of this issue to enhancement. The WebCrawler should either ignore links with protocol "file" or add support for them (e.g. by using the JDK URLConnection instead of apache commons http client). Bye, Daniel
I have a link to a web site that contains a "mailto" link. The web crawler crashes with an IllegalStateException (message: "unsupported protocol: 'mailto'"). The link is: http://www.bratbar.de/
I agree that this is an enhancement. But still the web crawler should not just simply crash if the URL prefix is not supported. It should just ignore that link - for now. @Daniel: Can you please fix this?
Connectivity framework was replaced by new Importing framework. With current Web Crawler, protocol types that can't be handled (file, mailto, ftp) will be ignored, resp. a warning will be written to the log. So there''s no crash anymore, Web Crawler just continues his work after ignoring the link.
Closing this