Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 322322 - Cannot crawl a single page
Summary: Cannot crawl a single page
Status: CLOSED FIXED
Alias: None
Product: z_Archived
Classification: Eclipse Foundation
Component: Smila (show other bugs)
Version: unspecified   Edit
Hardware: PC Linux
: P4 minor (vote)
Target Milestone: ---   Edit
Assignee: Andreas Weber CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-08-11 04:19 EDT by Andrej Rosenheinrich CLA
Modified: 2022-07-07 11:31 EDT (History)
3 users (show)

See Also:


Attachments
webcrawler configuration (3.38 KB, text/xml)
2010-08-11 04:20 EDT, Andrej Rosenheinrich CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrej Rosenheinrich CLA 2010-08-11 04:19:16 EDT
Build Identifier: I20100608-0911

I am trying to crawl a single website, just the given seed, not following any links. when setting maxdepth = 1 the crawling is aborted, claiming the maximum depth is exceeded. i am getting no record as result.

i am using the standard webcrawler, the SMILA.log says the following:
2010-08-11 10:07:54,014 INFO  [Thread-9                                     ]  web.WebCrawler                                - Initializing WebCrawler...
 2010-08-11 10:07:54,021 INFO  [Thread-9                                     ]  management.ManagementRegistration             - [Management Registration] Registering new agent [Crawlers/Web/Total]
 2010-08-11 10:07:54,023 INFO  [Thread-9                                     ]  management.ManagementRegistration             - [Management Registration] Registering new agent [Crawlers/Web/Total] in old controller [org.eclipse.smila.management.jmx.JmxManagementController]
 2010-08-11 10:07:54,025 INFO  [Thread-9                                     ]  jmx.JmxManagementController                   - SMILA:C0=Crawlers,C1=Web,Agent=Total
 2010-08-11 10:07:54,047 INFO  [Thread-9                                     ]  management.ManagementRegistration             - [Management Registration] Registering new agent [Crawlers/Web/web - 264504460]
 2010-08-11 10:07:54,049 INFO  [Thread-9                                     ]  management.ManagementRegistration             - [Management Registration] Registering new agent [Crawlers/Web/web - 264504460] in old controller [org.eclipse.smila.management.jmx.JmxManagementController]
 2010-08-11 10:07:54,051 INFO  [Thread-9                                     ]  jmx.JmxManagementController                   - SMILA:C0=Crawlers,C1=Web,Agent=web - 264504460
 2010-08-11 10:07:54,080 INFO  [Thread-9                                     ]  net.UrlNormalizerFactory                      - Using URL normalizer: org.eclipse.smila.connectivity.framework.crawler.web.net.BasicUrlNormalizer
 2010-08-11 10:07:54,101 INFO  [Thread-9                                     ]  filter.FilterFactory                          - Using URL filter: org.eclipse.smila.connectivity.framework.crawler.web.filter.impl.RegExpFilter
 2010-08-11 10:07:54,143 INFO  [Thread-9                                     ]  filter.FilterFactory                          - Using URL filter: org.eclipse.smila.connectivity.framework.crawler.web.filter.impl.MetaTagFilter
 2010-08-11 10:07:54,146 INFO  [Thread-9                                     ]  filter.FilterFactory                          - Using URL filter: org.eclipse.smila.connectivity.framework.crawler.web.filter.impl.CrawlScopeFilter
 2010-08-11 10:07:58,402 INFO  [Thread-9                                     ]  web.WebSiteIterator                           - Maximum depth exceeded!
 2010-08-11 10:07:58,407 ERROR [Thread-9                                     ]  impl.CrawlThread                              - org.eclipse.smila.connectivity.framework.CrawlerCriticalException: Unable to connect to web site specified in project Example Crawler Configuration
 2010-08-11 10:07:58,408 INFO  [Thread-9                                     ]  web.WebCrawler                                - Closing WebCrawler...
 2010-08-11 10:07:58,407 INFO  [Thread-11                                    ]  web.WebCrawler                                - Producer finished by forcing close procedure
 2010-08-11 10:07:58,429 INFO  [Thread-9                                     ]  impl.CrawlThread                              - Removing deltaindexing lock on datasource web
 2010-08-11 10:07:58,456 INFO  [Thread-9                                     ]  impl.CrawlThread                              - Finished session 9911a7d1-37a4-429a-ab5e-2dc83bc134ef and removed Deltaindexing lock on datasource web
 2010-08-11 10:07:58,457 INFO  [Thread-9                                     ]  impl.CrawlThread                              - Unregistering crawling thread web
 2010-08-11 10:07:58,458 INFO  [Thread-9                                     ]  impl.CrawlThread                              - Crawling thread web unregistered
 2010-08-11 10:07:58,459 INFO  [Thread-9                                     ]  impl.CrawlThread                              - Crawling thread web stopped.

Reproducible: Always

Steps to Reproduce:
1. replace web.xml
2. run SMILA
3. start crawling by invoking startCrawlerTask(web) in jconsole
4. see crawler get aborted (getCrawlerTasksState())
5. read SMILA.log
Comment 1 Andrej Rosenheinrich CLA 2010-08-11 04:20:49 EDT
Created attachment 176302 [details]
webcrawler configuration

the configuration i use for the webcrawler.
Comment 2 Andrej Rosenheinrich CLA 2010-08-11 05:57:01 EDT
the same effect happens btw. when i am using MaxIterations with value 1 instead of MaxDepth.
Comment 3 Sebastian Voigt CLA 2010-08-16 09:31:28 EDT
I can reproduce this problem.

I will fix that in the next upcoming days
Comment 4 Igor Novakovic CLA 2010-11-30 06:24:53 EST
Sebastian, have you managed to fix this already?
Comment 5 Sebastian Voigt CLA 2010-12-17 06:04:41 EST
I've tested my fix and send it to svn.
MaxDepth=1 will crawl the seed page
MaxDepth=2 will crawl the seed page and the pages with links on the seed pages
etc.
Comment 6 Andrej Rosenheinrich CLA 2011-01-31 05:04:43 EST
talking of rev. 719 it seems to me like the problem is solved only for maxdepth. for maxiterations it is still not possible to crawl a single page providing one seed and setting the value for maxiterations to 1. is this the correct and wanted behavior?

greets
andrej
Comment 7 Igor Novakovic CLA 2011-02-07 09:36:49 EST
I am reopening this bug.

@Thomas:
Since Sebastian is unfortunately not any more involved in our project and this piece of code was initially contributed by brox, could you please take a look at it?

Cheers
Igor
Comment 8 Igor Novakovic CLA 2011-03-09 05:12:27 EST
Hi Tom,

any chance looking at this soon?

Cheers
Igor
Comment 9 thomas menzel CLA 2011-03-09 05:28:30 EST
not really, unfortunately.
since i have no clue about this code either there is probably substantial learning cost involved too and i have more pressing issues ATM.

since neither of the remaining team has any pre knowledge about the code i'm un-assing this issue. so anybody willing with some time/need on his hand may pitch in.
Comment 10 Andreas Weber CLA 2012-12-18 09:27:14 EST
Connectivity framework was replaced by new Importing framework. With current Web Crawler, single (root) url can be crawled (parameter: "maxCrawlDepth" : "0") without following any links.
Comment 11 Andreas Weber CLA 2013-04-15 11:51:33 EDT
Closing this