Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 334396

Summary: Web crawler does not send all fetched resources to connectivity
Product: z_Archived Reporter: Igor Novakovic <igor.novakovic>
Component: SmilaAssignee: Juergen Schumacher <juergen.schumacher>
Status: CLOSED FIXED QA Contact:
Severity: normal    
Priority: P3 CC: daniel.stucky, tmenzel
Version: unspecified   
Target Milestone: ---   
Hardware: All   
OS: All   
Whiteboard:

Description Igor Novakovic CLA 2011-01-14 11:50:57 EST
Web crawler seems to have some kind of built in filtering for resources (typical example are images) which are not being filtered out (at least according to explicit configuration in DataSourceConnectionConfig) which prevents him from sending those resources to connectivity where they should be persisted (via blackboard) as records in record and binary store.
Comment 1 Daniel Stucky CLA 2011-01-18 11:59:13 EST
I debugged this issue and found the reason why images are downloaded but not returned by the WebCrawler hasNext() and next() methods:

WebCrawler uses a ParserManager manager where Parsers for certain mimetypes can be registered. Currently there are two Parsers available, a HtmlParser (for "text/html", "text/plain") and a JavaScriptParser (for "application/x-javascript", "text/javascript").
Inside of the Fetcher every resource is tried to be parsed by a registered Parser. This means that any resource that does not match the content-type of one of the two registered parsers cannot be parsed and therefore is discarded. 

There should be log entries like this:
2011-01-18 17:54:21,069 WARN  [Thread-38                                    ]  fetcher.Fetcher                               - Parser for content-type: application/pdf not found
 2011-01-18 17:54:52,581 WARN  [Thread-38                                    ]  fetcher.Fetcher                               - Parser for content-type: application/pdf not found
 2011-01-18 17:54:52,848 WARN  [Thread-38                                    ]  fetcher.Fetcher                               - Parser for content-type: application/pdf not found
 2011-01-18 17:54:53,025 WARN  [Thread-38                                    ]  fetcher.Fetcher                               - Parser for content-type: application/pdf not found
 2011-01-18 17:54:53,138 WARN  [Thread-38                                    ]  fetcher.Fetcher                               - Parser for content-type: text/css not found
 2011-01-18 17:54:53,301 WARN  [Thread-38                                    ]  fetcher.Fetcher                               - Parser for content-type: image/x-icon not found
 2011-01-18 17:54:53,516 WARN  [Thread-38                                    ]  fetcher.Fetcher                               - Parser for content-type: application/pdf not found


I don't know the reason behing the ParserManager and the Parsers, we have to discuss this behavior with the guys from brox that did the initial implementation.

Bye,
Daniel
Comment 2 Igor Novakovic CLA 2011-09-21 06:55:32 EDT
Thomas, can you please provide any input why this decision has been made back then?
I suggest that we change the implementation so that all crawled resources (that have not been filtered out due to crawler filter configuration) are sent to Connectivity.
Comment 3 thomas menzel CLA 2011-09-21 07:42:29 EDT
unfortunately i dont know for sure but would think that this had been from days way past where the focus was mainly on indexing text and saving the whole page to later view in full wasnt an issue. 

+1 for igor's proposal as it makes sense.
Comment 4 Igor Novakovic CLA 2011-09-21 08:06:29 EDT
Jürgen, could you please take a look at this and - if easily doable - remove the discarding of resources?
Comment 5 Juergen Schumacher CLA 2011-09-26 10:21:40 EDT
I've committed a fix, see rev. 1709. We'll see if that's good enough (-;
Keeping this issue open for some more time to wait for possible problems caused by the fix.
Comment 6 Juergen Schumacher CLA 2011-09-26 10:43:24 EDT
Ok, it seems that web-crawled PDFs are damaged somehow, so the fix is not yet final ... I will check what's going on there.
Comment 7 Juergen Schumacher CLA 2011-09-26 11:10:18 EDT
The class org.eclipse.smila.connectivity.framework.crawler.web.http.HttpResponse contains some quite complex code to read the HTTP-response stream into a byte[]. When I replace this by a simple IOUtils.toByteArray(InputStream) call, it works and the PDFs are fine. Does anyone have an idea why the old code was there for? I'll replace it for now with the IOUtils call. If somebody objects, we have to rewrite it.
Comment 8 Igor Novakovic CLA 2011-10-05 08:45:27 EDT
I've just tested the bugfix and can confirm that it works, so I am closing this bug.
Comment 9 Andreas Weber CLA 2013-04-15 11:49:01 EDT
Closing this