Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 361885

Summary: WebCrawler produces duplicates if DeltaIndexing is disabled
Product: z_Archived Reporter: Juergen Schumacher <juergen.schumacher>
Component: SmilaAssignee: Juergen Schumacher <juergen.schumacher>
Status: CLOSED FIXED QA Contact:
Severity: normal    
Priority: P3 CC: igor.novakovic
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Whiteboard:

Description Juergen Schumacher CLA 2011-10-25 03:01:10 EDT
Reproduce:
- edit data source web.xml: disable delta indexing, set seed to http://www.heise.de/, remove filters.
- start job and crawler as usual
- after a while, in the statistics for this crawler (/smila/crawlers/web) the counter for "records" will be higher than the one for "pages".
Comment 1 Juergen Schumacher CLA 2011-10-25 03:08:02 EDT
Analysis: This happens when some links in the crawler site are redirected to another site. In this case, there are links on www.heise.de that are redirected to www.heise-marktplatz.de. These links are not crawled, but the Fetcher misses to clear the _output produced for the previous crawled page, so this output will be used again. With delta indexing, this page will be excluded from the crawl later. But without, this record will be pushed to processing twice.

I think I have fixed it already in my workspace. Will commit the fix after doing some more tests. And maybe a bit of refactoring (:
Comment 2 Juergen Schumacher CLA 2011-10-25 04:04:28 EDT
Fixed in rev 1774
Comment 3 Andreas Weber CLA 2013-04-15 11:50:15 EDT
Closing this