Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 361885 - WebCrawler produces duplicates if DeltaIndexing is disabled
Summary: WebCrawler produces duplicates if DeltaIndexing is disabled
Status: CLOSED FIXED
Alias: None
Product: z_Archived
Classification: Eclipse Foundation
Component: Smila (show other bugs)
Version: unspecified   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Juergen Schumacher CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-10-25 03:01 EDT by Juergen Schumacher CLA
Modified: 2022-07-07 11:31 EDT (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Juergen Schumacher CLA 2011-10-25 03:01:10 EDT
Reproduce:
- edit data source web.xml: disable delta indexing, set seed to http://www.heise.de/, remove filters.
- start job and crawler as usual
- after a while, in the statistics for this crawler (/smila/crawlers/web) the counter for "records" will be higher than the one for "pages".
Comment 1 Juergen Schumacher CLA 2011-10-25 03:08:02 EDT
Analysis: This happens when some links in the crawler site are redirected to another site. In this case, there are links on www.heise.de that are redirected to www.heise-marktplatz.de. These links are not crawled, but the Fetcher misses to clear the _output produced for the previous crawled page, so this output will be used again. With delta indexing, this page will be excluded from the crawl later. But without, this record will be pushed to processing twice.

I think I have fixed it already in my workspace. Will commit the fix after doing some more tests. And maybe a bit of refactoring (:
Comment 2 Juergen Schumacher CLA 2011-10-25 04:04:28 EDT
Fixed in rev 1774
Comment 3 Andreas Weber CLA 2013-04-15 11:50:15 EDT
Closing this