Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 339315

Summary: While long crawling massive temporary files
Product: z_Archived Reporter: nils.thieme
Component: SmilaAssignee: Andreas Weber <Andreas.Weber>
Status: CLOSED FIXED QA Contact:
Severity: enhancement    
Priority: P3 CC: daniel.stucky, marco.strack
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Whiteboard:

Description nils.thieme CLA 2011-03-09 00:42:07 EST
Hello,

the crawler stores the objects that were downloaded into the temporary folder org.eclipse.smila.connectivity.framework.web in the workspace directory. These files are deleted after the crawl was finished. If we make a long crawl the temporary files remains for the long turn so the space of the hard drive becomes smaller.

It would be better if the temporary files are deleted after they were deserialized. But we have also noticed that delete the files immediately after deserializing leads to massive performance decline.
Comment 1 Igor Novakovic CLA 2011-03-11 04:21:05 EST
Yes, this is an important issue. We will address it in the next release when we redesign the connectivity concept.
Comment 2 Andreas Weber CLA 2013-01-08 03:25:37 EST
Connectivity has been replaced by new Importing framework. But we still have problems with massive temporary files during import. So I leave this issue open and just changed its component setting.
Comment 3 Andreas Weber CLA 2013-07-18 07:25:48 EDT
We have the following problem:  "run once" jobs (e.g. crawl jobs) only have one workflow run. Temp objects in the job management are removed not before a whole workflow run is completed, so they are not removed after a succesful task. The reason for this is, that the input object could be shared between workers in the workflow.

The idea: We try to identify if a workflow has workers (resp. actions) that share the same input bucket. If that's not the case, we call these workflows "non-forking". For non-forking workflows, we change the clean up of the temp objects in the job mgmt: After each successful task the input object of the worker can be removed (cause there will be no other worker working on the same object). 

Typically, crawl workflows are non-forking.
Comment 4 Andreas Weber CLA 2013-07-19 05:54:39 EDT
The above is implemented now: (Non-persistent) input objects of workers from non-forking workflows will be removed after the worker has successfully completed its task. For most cases (especially typical crawl workflows) this should be sufficient to avoid the massive accumulation of temp objects in the objectstore.

However, it could be further improved (a) by checking the non-forking condition not on the workflow but more granular for each workflow bucket and (b) by implementing a logic for forking buckets too (-> do clean up if all workers using that input bucket have successfully finished their task).
Comment 5 Andreas Weber CLA 2013-07-19 07:25:29 EDT
added SMILA Documentation: http://wiki.eclipse.org/SMILA/Documentation/WorkerAndWorkflows#Non-forking_workflows
Comment 6 Andreas Weber CLA 2015-03-19 06:58:57 EDT
closed