Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 356799

Summary: Job state shows still RUNNING when worker node crashed
Product: z_Archived Reporter: Gunnar Wagenknecht <gunnar>
Component: gyrexAssignee: Gunnar Wagenknecht <gunnar>
Status: RESOLVED FIXED QA Contact:
Severity: enhancement    
Priority: P3 CC: andreas.mihm, mike.tschierschke
Version: unspecified   
Target Milestone: ---   
Hardware: All   
OS: All   
Whiteboard:

Description Gunnar Wagenknecht CLA 2011-09-06 08:01:33 EDT
When a worker engines shuts down unclean (eg. crash) a job's state is not reset. It might still show up as RUNNING.

The reason is that the job state is currently persisted within the cloud preferences. However, this can get out of sync. 

WAITING == is in queue
RUNNING == is executing

ZooKeeper has the concept of ephemeral entries. Those are removed when the node disconnects. However, they are only useful for the RUNNING state but not for the WAITING or ABORTING state.
Comment 1 Gunnar Wagenknecht CLA 2011-09-19 08:01:09 EDT
I think it may be possible to separate the "RUNNING" status from the rest. We could refactor status handling into a status manager. The status information might still be stored in the preferences for lookup purposes.
Comment 2 Gunnar Wagenknecht CLA 2011-09-20 09:27:42 EDT
Solved by implementing "active/inactive" management using ephemeral nodes in ZooKeeper. When a job is scheduled on a worker engine it is marked active. When it's finished it will be marked inactive. If the worker engines dies or is shutdown while the job is running, the ephemeral node will be removed by ZooKeeper. The JobManager then discovers such bogus states the next time the job is queued or then the cleanup triggers. There is also a new console command for detecting and cleaning up bogus jobs.