Community
Participate
Working Groups
Created attachment 185440 [details] Memory overview done with Eclipse MAT As mentioned in the SMILA forum (http://www.eclipse.org/forums/index.php?t=msg&th=201819&start=0) if I start a deep crawl the memory is filling up until reaching MaxHeapSize. SMILA often crashes, eg if starting another crawl job. A screenshot of the memory profiling with MAT can be found as attachment.
Daniel, could you please take a short look at this?
We have to take a look at this.
I think I have found the issue: The problem was that the member variable _webSiteIterator of class WebCrawler was not set to null when the internal CrawlingProducerThread finished. Therefor the last assigned instance of a WebSiteIterator could not be garbage collected as it was still referenced and therefore its internal set _linksDone (where ALL the crawled websites were stored) was also remaining in memory. I added line _webSiteIterator = null; to the finally block of CrawlingProducerThread. With this fix the Heap Memory is freed by Garbage Collection as soon as a crawl finishes.
Hi, Thanks for having a look onto this issue. But I have to admit, that the issue is not fixed with your code. The problem is, that - as long as the crawl is running as you stated correctly - _linksToDo, _linksNextLevel and _linksDone are increased steadily. So assuming you have a long-running crawl, (ie "crawling in the wild" with some seeds and large deep) these variables causes the OOM-Exception, because you will find many links on every site you are crawling. Your solution is for a limited crawl only, i guess As Igor stated, this is more a design problem. And as i promised Igor: We made a hack for this ourselfs, because this was a blocking issue for our project: _linksToDo, _linksNextLevel and _linksDone are now stored in a database and not hold in memory. This enables us, to do very large and long crawls without rising memory requirements. But with this solution we are no longer able to count correctly (as seen in JConsole). Hope this helps. Martin
Well, the problem you describe is certainly not a memory leak but a design problem because all links are store in memory which eventually leads to an OOM exception. What I fixed was a real memory leak, because the resources were not freed when a crawl finished. At least this issue is fixed and helps when multiple crawls (whose total list of visited links fits in memory) are executed repeatedly. I suppose that this bug is closed and that a ChangeRequest for a NonMemory-WebsiteIterator is created. Daniel
I agree with Daniel and will therefore close this issue. Martin, it would be really great if you could open a new enhancement issue and describe your workaround and the pitfalls you've experienced so that we can use it as an input while redesigning connectivity for the next release. Cheers Igor
Closing this