Community
Participate
Working Groups
Build Identifier: I20100608-0911 I am trying to crawl a single website, just the given seed, not following any links. when setting maxdepth = 1 the crawling is aborted, claiming the maximum depth is exceeded. i am getting no record as result. i am using the standard webcrawler, the SMILA.log says the following: 2010-08-11 10:07:54,014 INFO [Thread-9 ] web.WebCrawler - Initializing WebCrawler... 2010-08-11 10:07:54,021 INFO [Thread-9 ] management.ManagementRegistration - [Management Registration] Registering new agent [Crawlers/Web/Total] 2010-08-11 10:07:54,023 INFO [Thread-9 ] management.ManagementRegistration - [Management Registration] Registering new agent [Crawlers/Web/Total] in old controller [org.eclipse.smila.management.jmx.JmxManagementController] 2010-08-11 10:07:54,025 INFO [Thread-9 ] jmx.JmxManagementController - SMILA:C0=Crawlers,C1=Web,Agent=Total 2010-08-11 10:07:54,047 INFO [Thread-9 ] management.ManagementRegistration - [Management Registration] Registering new agent [Crawlers/Web/web - 264504460] 2010-08-11 10:07:54,049 INFO [Thread-9 ] management.ManagementRegistration - [Management Registration] Registering new agent [Crawlers/Web/web - 264504460] in old controller [org.eclipse.smila.management.jmx.JmxManagementController] 2010-08-11 10:07:54,051 INFO [Thread-9 ] jmx.JmxManagementController - SMILA:C0=Crawlers,C1=Web,Agent=web - 264504460 2010-08-11 10:07:54,080 INFO [Thread-9 ] net.UrlNormalizerFactory - Using URL normalizer: org.eclipse.smila.connectivity.framework.crawler.web.net.BasicUrlNormalizer 2010-08-11 10:07:54,101 INFO [Thread-9 ] filter.FilterFactory - Using URL filter: org.eclipse.smila.connectivity.framework.crawler.web.filter.impl.RegExpFilter 2010-08-11 10:07:54,143 INFO [Thread-9 ] filter.FilterFactory - Using URL filter: org.eclipse.smila.connectivity.framework.crawler.web.filter.impl.MetaTagFilter 2010-08-11 10:07:54,146 INFO [Thread-9 ] filter.FilterFactory - Using URL filter: org.eclipse.smila.connectivity.framework.crawler.web.filter.impl.CrawlScopeFilter 2010-08-11 10:07:58,402 INFO [Thread-9 ] web.WebSiteIterator - Maximum depth exceeded! 2010-08-11 10:07:58,407 ERROR [Thread-9 ] impl.CrawlThread - org.eclipse.smila.connectivity.framework.CrawlerCriticalException: Unable to connect to web site specified in project Example Crawler Configuration 2010-08-11 10:07:58,408 INFO [Thread-9 ] web.WebCrawler - Closing WebCrawler... 2010-08-11 10:07:58,407 INFO [Thread-11 ] web.WebCrawler - Producer finished by forcing close procedure 2010-08-11 10:07:58,429 INFO [Thread-9 ] impl.CrawlThread - Removing deltaindexing lock on datasource web 2010-08-11 10:07:58,456 INFO [Thread-9 ] impl.CrawlThread - Finished session 9911a7d1-37a4-429a-ab5e-2dc83bc134ef and removed Deltaindexing lock on datasource web 2010-08-11 10:07:58,457 INFO [Thread-9 ] impl.CrawlThread - Unregistering crawling thread web 2010-08-11 10:07:58,458 INFO [Thread-9 ] impl.CrawlThread - Crawling thread web unregistered 2010-08-11 10:07:58,459 INFO [Thread-9 ] impl.CrawlThread - Crawling thread web stopped. Reproducible: Always Steps to Reproduce: 1. replace web.xml 2. run SMILA 3. start crawling by invoking startCrawlerTask(web) in jconsole 4. see crawler get aborted (getCrawlerTasksState()) 5. read SMILA.log
Created attachment 176302 [details] webcrawler configuration the configuration i use for the webcrawler.
the same effect happens btw. when i am using MaxIterations with value 1 instead of MaxDepth.
I can reproduce this problem. I will fix that in the next upcoming days
Sebastian, have you managed to fix this already?
I've tested my fix and send it to svn. MaxDepth=1 will crawl the seed page MaxDepth=2 will crawl the seed page and the pages with links on the seed pages etc.
talking of rev. 719 it seems to me like the problem is solved only for maxdepth. for maxiterations it is still not possible to crawl a single page providing one seed and setting the value for maxiterations to 1. is this the correct and wanted behavior? greets andrej
I am reopening this bug. @Thomas: Since Sebastian is unfortunately not any more involved in our project and this piece of code was initially contributed by brox, could you please take a look at it? Cheers Igor
Hi Tom, any chance looking at this soon? Cheers Igor
not really, unfortunately. since i have no clue about this code either there is probably substantial learning cost involved too and i have more pressing issues ATM. since neither of the remaining team has any pre knowledge about the code i'm un-assing this issue. so anybody willing with some time/need on his hand may pitch in.
Connectivity framework was replaced by new Importing framework. With current Web Crawler, single (root) url can be crawled (parameter: "maxCrawlDepth" : "0") without following any links.
Closing this