Community
Participate
Working Groups
When a worker engines shuts down unclean (eg. crash) a job's state is not reset. It might still show up as RUNNING. The reason is that the job state is currently persisted within the cloud preferences. However, this can get out of sync. WAITING == is in queue RUNNING == is executing ZooKeeper has the concept of ephemeral entries. Those are removed when the node disconnects. However, they are only useful for the RUNNING state but not for the WAITING or ABORTING state.
I think it may be possible to separate the "RUNNING" status from the rest. We could refactor status handling into a status manager. The status information might still be stored in the preferences for lookup purposes.
Solved by implementing "active/inactive" management using ephemeral nodes in ZooKeeper. When a job is scheduled on a worker engine it is marked active. When it's finished it will be marked inactive. If the worker engines dies or is shutdown while the job is running, the ephemeral node will be removed by ZooKeeper. The JobManager then discovers such bogus states the next time the job is queued or then the cleanup triggers. There is also a new console command for detecting and cleaning up bogus jobs.