Community
Participate
Working Groups
Reproduce: - edit data source web.xml: disable delta indexing, set seed to http://www.heise.de/, remove filters. - start job and crawler as usual - after a while, in the statistics for this crawler (/smila/crawlers/web) the counter for "records" will be higher than the one for "pages".
Analysis: This happens when some links in the crawler site are redirected to another site. In this case, there are links on www.heise.de that are redirected to www.heise-marktplatz.de. These links are not crawled, but the Fetcher misses to clear the _output produced for the previous crawled page, so this output will be used again. With delta indexing, this page will be excluded from the crawl later. But without, this record will be pushed to processing twice. I think I have fixed it already in my workspace. Will commit the fix after doing some more tests. And maybe a bit of refactoring (:
Fixed in rev 1774
Closing this