Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 332864 - Possible memory leak in Crawler/WebsiteIterator
Summary: Possible memory leak in Crawler/WebsiteIterator
Status: CLOSED FIXED
Alias: None
Product: z_Archived
Classification: Eclipse Foundation
Component: Smila (show other bugs)
Version: unspecified   Edit
Hardware: PC Linux
: P2 critical (vote)
Target Milestone: ---   Edit
Assignee: Daniel Stucky CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-12-17 11:35 EST by Martin Röbert CLA
Modified: 2022-07-07 11:31 EDT (History)
2 users (show)

See Also:


Attachments
Memory overview done with Eclipse MAT (96.66 KB, image/png)
2010-12-17 11:35 EST, Martin Röbert CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Röbert CLA 2010-12-17 11:35:15 EST
Created attachment 185440 [details]
Memory overview done with Eclipse MAT

As mentioned in the SMILA forum (http://www.eclipse.org/forums/index.php?t=msg&th=201819&start=0) if I start a deep crawl the memory is filling up until reaching MaxHeapSize. SMILA often crashes, eg if starting another crawl job.

A screenshot of the memory profiling with MAT can be found as attachment.
Comment 1 Igor Novakovic CLA 2010-12-20 10:02:38 EST
Daniel, could you please take a short look at this?
Comment 2 Igor Novakovic CLA 2011-03-09 05:08:21 EST
We have to take a look at this.
Comment 3 Daniel Stucky CLA 2011-03-10 09:06:01 EST
I think I have found the issue:

The problem was that the member variable _webSiteIterator of class WebCrawler was not set to null when the internal CrawlingProducerThread finished. Therefor the last assigned instance of a WebSiteIterator could not be garbage collected as it was still referenced and therefore its internal set _linksDone (where ALL the crawled websites were stored) was also remaining in memory.

I added line
_webSiteIterator = null;
to the finally block of CrawlingProducerThread. With this fix the Heap Memory is freed by Garbage Collection as soon as a crawl finishes.
Comment 4 Martin Röbert CLA 2011-03-11 02:51:01 EST
Hi,

Thanks for having a look onto this issue.

But I have to admit, that the issue is not fixed with your code.
The problem is, that - as long as the crawl is running as you stated correctly - _linksToDo, _linksNextLevel and _linksDone are increased steadily.
So assuming you have a long-running crawl, (ie "crawling in the wild" with some seeds and large deep) these variables causes the OOM-Exception, because you will find many links on every site you are crawling.
Your solution is for a limited crawl only, i guess

As Igor stated, this is more a design problem. And as i promised Igor: We made a hack for this ourselfs, because this was a blocking issue for our project: _linksToDo, _linksNextLevel and _linksDone are now stored in a database and not hold in memory. This enables us, to do very large and long crawls without rising memory requirements. But with this solution we are no longer able to count correctly (as seen in JConsole).

Hope this helps.

Martin
Comment 5 Daniel Stucky CLA 2011-03-11 03:36:37 EST
Well,
the problem you describe is certainly not a memory leak but a design problem because all links are store in memory which eventually leads to an OOM exception.

What I fixed was a real memory leak, because the resources were not freed when a crawl finished. At least this issue is fixed and helps when multiple crawls (whose total list of visited links fits in memory) are executed repeatedly.

I suppose that this bug is closed and that a ChangeRequest for a NonMemory-WebsiteIterator is created.

Daniel
Comment 6 Igor Novakovic CLA 2011-03-11 08:34:26 EST
I agree with Daniel and will therefore close this issue.

Martin, it would be really great if you could open a new enhancement issue and describe your workaround and the pitfalls you've experienced so that we can use it as an input while redesigning connectivity for the next release.

Cheers
Igor
Comment 7 Andreas Weber CLA 2013-04-15 11:27:51 EDT
Closing this