| Summary: | WebCrawler produces duplicates if DeltaIndexing is disabled | ||
|---|---|---|---|
| Product: | z_Archived | Reporter: | Juergen Schumacher <juergen.schumacher> |
| Component: | Smila | Assignee: | Juergen Schumacher <juergen.schumacher> |
| Status: | CLOSED FIXED | QA Contact: | |
| Severity: | normal | ||
| Priority: | P3 | CC: | igor.novakovic |
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Linux | ||
| Whiteboard: | |||
|
Description
Juergen Schumacher
Analysis: This happens when some links in the crawler site are redirected to another site. In this case, there are links on www.heise.de that are redirected to www.heise-marktplatz.de. These links are not crawled, but the Fetcher misses to clear the _output produced for the previous crawled page, so this output will be used again. With delta indexing, this page will be excluded from the crawl later. But without, this record will be pushed to processing twice. I think I have fixed it already in my workspace. Will commit the fix after doing some more tests. And maybe a bit of refactoring (: Fixed in rev 1774 Closing this |