Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 331180

Summary: IllegalStateException while crawling
Product: z_Archived Reporter: nils.thieme
Component: SmilaAssignee: Andreas Weber <Andreas.Weber>
Status: CLOSED FIXED QA Contact:
Severity: enhancement    
Priority: P3 CC: daniel.stucky, igor.novakovic
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Whiteboard:

Description nils.thieme CLA 2010-11-26 03:28:55 EST
I have encountered an IllegalStateException in line 223 in the class HttpResponse (org.eclipse.smila.connectivity.framework.crawler.web) with the following error message: "unsupported protocol: 'file'".
Comment 1 Daniel Stucky CLA 2010-11-29 09:10:44 EST
Hi Nils,

thanks for your bug report. The web crawler implementation is based on apache commons http client which does not support the "file" protocol. Therefore it is not possible to use URLs with the file protocol for crawling.
If you want to crawl data in your filesystem use the FilesystemCrawler instead.

Did you start crawling with a file URL or was the file URL found during the crawl ? In the latter case we may have to add a check and filter out any non http protocol URLs.

Bye,
Daniel
Comment 2 nils.thieme CLA 2010-11-30 09:14:55 EST
The "file" prefix was part of a link of a crawled web site. I only crawl web sites with the WebCrawler of SMILA.
Comment 3 Daniel Stucky CLA 2010-12-01 05:35:52 EST
Ok, I also set the importance of this issue to enhancement.

The WebCrawler should either ignore links with protocol "file" or add support for them (e.g. by using the JDK URLConnection instead of apache commons http client).

Bye,
Daniel
Comment 4 nils.thieme CLA 2010-12-08 05:58:28 EST
I have a link to a web site that contains a "mailto" link. The web crawler crashes with an IllegalStateException (message: "unsupported protocol: 'mailto'").

The link is: http://www.bratbar.de/
Comment 5 Igor Novakovic CLA 2010-12-17 08:24:17 EST
I agree that this is an enhancement.
But still the web crawler should not just simply crash if the URL prefix is not supported. It should just ignore that link - for now.

@Daniel:
Can you please fix this?
Comment 6 Andreas Weber CLA 2012-12-18 12:19:27 EST
Connectivity framework was replaced by new Importing framework. With current Web Crawler, protocol types that can't be handled (file, mailto, ftp) will be ignored, resp. a warning will be written to the log. So there''s no crash anymore, Web Crawler just continues his work after ignoring the link.
Comment 7 Andreas Weber CLA 2013-04-15 11:48:40 EDT
Closing this