| Summary: | Web crawler does not send all fetched resources to connectivity | ||
|---|---|---|---|
| Product: | z_Archived | Reporter: | Igor Novakovic <igor.novakovic> |
| Component: | Smila | Assignee: | Juergen Schumacher <juergen.schumacher> |
| Status: | CLOSED FIXED | QA Contact: | |
| Severity: | normal | ||
| Priority: | P3 | CC: | daniel.stucky, tmenzel |
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
|
Description
Igor Novakovic
I debugged this issue and found the reason why images are downloaded but not returned by the WebCrawler hasNext() and next() methods: WebCrawler uses a ParserManager manager where Parsers for certain mimetypes can be registered. Currently there are two Parsers available, a HtmlParser (for "text/html", "text/plain") and a JavaScriptParser (for "application/x-javascript", "text/javascript"). Inside of the Fetcher every resource is tried to be parsed by a registered Parser. This means that any resource that does not match the content-type of one of the two registered parsers cannot be parsed and therefore is discarded. There should be log entries like this: 2011-01-18 17:54:21,069 WARN [Thread-38 ] fetcher.Fetcher - Parser for content-type: application/pdf not found 2011-01-18 17:54:52,581 WARN [Thread-38 ] fetcher.Fetcher - Parser for content-type: application/pdf not found 2011-01-18 17:54:52,848 WARN [Thread-38 ] fetcher.Fetcher - Parser for content-type: application/pdf not found 2011-01-18 17:54:53,025 WARN [Thread-38 ] fetcher.Fetcher - Parser for content-type: application/pdf not found 2011-01-18 17:54:53,138 WARN [Thread-38 ] fetcher.Fetcher - Parser for content-type: text/css not found 2011-01-18 17:54:53,301 WARN [Thread-38 ] fetcher.Fetcher - Parser for content-type: image/x-icon not found 2011-01-18 17:54:53,516 WARN [Thread-38 ] fetcher.Fetcher - Parser for content-type: application/pdf not found I don't know the reason behing the ParserManager and the Parsers, we have to discuss this behavior with the guys from brox that did the initial implementation. Bye, Daniel Thomas, can you please provide any input why this decision has been made back then? I suggest that we change the implementation so that all crawled resources (that have not been filtered out due to crawler filter configuration) are sent to Connectivity. unfortunately i dont know for sure but would think that this had been from days way past where the focus was mainly on indexing text and saving the whole page to later view in full wasnt an issue. +1 for igor's proposal as it makes sense. Jürgen, could you please take a look at this and - if easily doable - remove the discarding of resources? I've committed a fix, see rev. 1709. We'll see if that's good enough (-; Keeping this issue open for some more time to wait for possible problems caused by the fix. Ok, it seems that web-crawled PDFs are damaged somehow, so the fix is not yet final ... I will check what's going on there. The class org.eclipse.smila.connectivity.framework.crawler.web.http.HttpResponse contains some quite complex code to read the HTTP-response stream into a byte[]. When I replace this by a simple IOUtils.toByteArray(InputStream) call, it works and the PDFs are fine. Does anyone have an idea why the old code was there for? I'll replace it for now with the IOUtils call. If somebody objects, we have to rewrite it. I've just tested the bugfix and can confirm that it works, so I am closing this bug. Closing this |