Community
Participate
Working Groups
Hello, the crawler stores the objects that were downloaded into the temporary folder org.eclipse.smila.connectivity.framework.web in the workspace directory. These files are deleted after the crawl was finished. If we make a long crawl the temporary files remains for the long turn so the space of the hard drive becomes smaller. It would be better if the temporary files are deleted after they were deserialized. But we have also noticed that delete the files immediately after deserializing leads to massive performance decline.
Yes, this is an important issue. We will address it in the next release when we redesign the connectivity concept.
Connectivity has been replaced by new Importing framework. But we still have problems with massive temporary files during import. So I leave this issue open and just changed its component setting.
We have the following problem: "run once" jobs (e.g. crawl jobs) only have one workflow run. Temp objects in the job management are removed not before a whole workflow run is completed, so they are not removed after a succesful task. The reason for this is, that the input object could be shared between workers in the workflow. The idea: We try to identify if a workflow has workers (resp. actions) that share the same input bucket. If that's not the case, we call these workflows "non-forking". For non-forking workflows, we change the clean up of the temp objects in the job mgmt: After each successful task the input object of the worker can be removed (cause there will be no other worker working on the same object). Typically, crawl workflows are non-forking.
The above is implemented now: (Non-persistent) input objects of workers from non-forking workflows will be removed after the worker has successfully completed its task. For most cases (especially typical crawl workflows) this should be sufficient to avoid the massive accumulation of temp objects in the objectstore. However, it could be further improved (a) by checking the non-forking condition not on the workflow but more granular for each workflow bucket and (b) by implementing a logic for forking buckets too (-> do clean up if all workers using that input bucket have successfully finished their task).
added SMILA Documentation: http://wiki.eclipse.org/SMILA/Documentation/WorkerAndWorkflows#Non-forking_workflows
closed