Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 322322

Summary: Cannot crawl a single page
Product: z_Archived Reporter: Andrej Rosenheinrich <andrej.rosenheinrich>
Component: SmilaAssignee: Andreas Weber <Andreas.Weber>
Status: CLOSED FIXED QA Contact:
Severity: minor    
Priority: P4 CC: igor.novakovic, svoigt.brox, tmenzel
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Whiteboard:
Attachments:
Description Flags
webcrawler configuration none

Description Andrej Rosenheinrich CLA 2010-08-11 04:19:16 EDT
Build Identifier: I20100608-0911

I am trying to crawl a single website, just the given seed, not following any links. when setting maxdepth = 1 the crawling is aborted, claiming the maximum depth is exceeded. i am getting no record as result.

i am using the standard webcrawler, the SMILA.log says the following:
2010-08-11 10:07:54,014 INFO  [Thread-9                                     ]  web.WebCrawler                                - Initializing WebCrawler...
 2010-08-11 10:07:54,021 INFO  [Thread-9                                     ]  management.ManagementRegistration             - [Management Registration] Registering new agent [Crawlers/Web/Total]
 2010-08-11 10:07:54,023 INFO  [Thread-9                                     ]  management.ManagementRegistration             - [Management Registration] Registering new agent [Crawlers/Web/Total] in old controller [org.eclipse.smila.management.jmx.JmxManagementController]
 2010-08-11 10:07:54,025 INFO  [Thread-9                                     ]  jmx.JmxManagementController                   - SMILA:C0=Crawlers,C1=Web,Agent=Total
 2010-08-11 10:07:54,047 INFO  [Thread-9                                     ]  management.ManagementRegistration             - [Management Registration] Registering new agent [Crawlers/Web/web - 264504460]
 2010-08-11 10:07:54,049 INFO  [Thread-9                                     ]  management.ManagementRegistration             - [Management Registration] Registering new agent [Crawlers/Web/web - 264504460] in old controller [org.eclipse.smila.management.jmx.JmxManagementController]
 2010-08-11 10:07:54,051 INFO  [Thread-9                                     ]  jmx.JmxManagementController                   - SMILA:C0=Crawlers,C1=Web,Agent=web - 264504460
 2010-08-11 10:07:54,080 INFO  [Thread-9                                     ]  net.UrlNormalizerFactory                      - Using URL normalizer: org.eclipse.smila.connectivity.framework.crawler.web.net.BasicUrlNormalizer
 2010-08-11 10:07:54,101 INFO  [Thread-9                                     ]  filter.FilterFactory                          - Using URL filter: org.eclipse.smila.connectivity.framework.crawler.web.filter.impl.RegExpFilter
 2010-08-11 10:07:54,143 INFO  [Thread-9                                     ]  filter.FilterFactory                          - Using URL filter: org.eclipse.smila.connectivity.framework.crawler.web.filter.impl.MetaTagFilter
 2010-08-11 10:07:54,146 INFO  [Thread-9                                     ]  filter.FilterFactory                          - Using URL filter: org.eclipse.smila.connectivity.framework.crawler.web.filter.impl.CrawlScopeFilter
 2010-08-11 10:07:58,402 INFO  [Thread-9                                     ]  web.WebSiteIterator                           - Maximum depth exceeded!
 2010-08-11 10:07:58,407 ERROR [Thread-9                                     ]  impl.CrawlThread                              - org.eclipse.smila.connectivity.framework.CrawlerCriticalException: Unable to connect to web site specified in project Example Crawler Configuration
 2010-08-11 10:07:58,408 INFO  [Thread-9                                     ]  web.WebCrawler                                - Closing WebCrawler...
 2010-08-11 10:07:58,407 INFO  [Thread-11                                    ]  web.WebCrawler                                - Producer finished by forcing close procedure
 2010-08-11 10:07:58,429 INFO  [Thread-9                                     ]  impl.CrawlThread                              - Removing deltaindexing lock on datasource web
 2010-08-11 10:07:58,456 INFO  [Thread-9                                     ]  impl.CrawlThread                              - Finished session 9911a7d1-37a4-429a-ab5e-2dc83bc134ef and removed Deltaindexing lock on datasource web
 2010-08-11 10:07:58,457 INFO  [Thread-9                                     ]  impl.CrawlThread                              - Unregistering crawling thread web
 2010-08-11 10:07:58,458 INFO  [Thread-9                                     ]  impl.CrawlThread                              - Crawling thread web unregistered
 2010-08-11 10:07:58,459 INFO  [Thread-9                                     ]  impl.CrawlThread                              - Crawling thread web stopped.

Reproducible: Always

Steps to Reproduce:
1. replace web.xml
2. run SMILA
3. start crawling by invoking startCrawlerTask(web) in jconsole
4. see crawler get aborted (getCrawlerTasksState())
5. read SMILA.log
Comment 1 Andrej Rosenheinrich CLA 2010-08-11 04:20:49 EDT
Created attachment 176302 [details]
webcrawler configuration

the configuration i use for the webcrawler.
Comment 2 Andrej Rosenheinrich CLA 2010-08-11 05:57:01 EDT
the same effect happens btw. when i am using MaxIterations with value 1 instead of MaxDepth.
Comment 3 Sebastian Voigt CLA 2010-08-16 09:31:28 EDT
I can reproduce this problem.

I will fix that in the next upcoming days
Comment 4 Igor Novakovic CLA 2010-11-30 06:24:53 EST
Sebastian, have you managed to fix this already?
Comment 5 Sebastian Voigt CLA 2010-12-17 06:04:41 EST
I've tested my fix and send it to svn.
MaxDepth=1 will crawl the seed page
MaxDepth=2 will crawl the seed page and the pages with links on the seed pages
etc.
Comment 6 Andrej Rosenheinrich CLA 2011-01-31 05:04:43 EST
talking of rev. 719 it seems to me like the problem is solved only for maxdepth. for maxiterations it is still not possible to crawl a single page providing one seed and setting the value for maxiterations to 1. is this the correct and wanted behavior?

greets
andrej
Comment 7 Igor Novakovic CLA 2011-02-07 09:36:49 EST
I am reopening this bug.

@Thomas:
Since Sebastian is unfortunately not any more involved in our project and this piece of code was initially contributed by brox, could you please take a look at it?

Cheers
Igor
Comment 8 Igor Novakovic CLA 2011-03-09 05:12:27 EST
Hi Tom,

any chance looking at this soon?

Cheers
Igor
Comment 9 thomas menzel CLA 2011-03-09 05:28:30 EST
not really, unfortunately.
since i have no clue about this code either there is probably substantial learning cost involved too and i have more pressing issues ATM.

since neither of the remaining team has any pre knowledge about the code i'm un-assing this issue. so anybody willing with some time/need on his hand may pitch in.
Comment 10 Andreas Weber CLA 2012-12-18 09:27:14 EST
Connectivity framework was replaced by new Importing framework. With current Web Crawler, single (root) url can be crawled (parameter: "maxCrawlDepth" : "0") without following any links.
Comment 11 Andreas Weber CLA 2013-04-15 11:51:33 EDT
Closing this