| Summary: | Job state shows still RUNNING when worker node crashed | ||
|---|---|---|---|
| Product: | z_Archived | Reporter: | Gunnar Wagenknecht <gunnar> |
| Component: | gyrex | Assignee: | Gunnar Wagenknecht <gunnar> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | enhancement | ||
| Priority: | P3 | CC: | andreas.mihm, mike.tschierschke |
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
|
Description
Gunnar Wagenknecht
I think it may be possible to separate the "RUNNING" status from the rest. We could refactor status handling into a status manager. The status information might still be stored in the preferences for lookup purposes. Solved by implementing "active/inactive" management using ephemeral nodes in ZooKeeper. When a job is scheduled on a worker engine it is marked active. When it's finished it will be marked inactive. If the worker engines dies or is shutdown while the job is running, the ephemeral node will be removed by ZooKeeper. The JobManager then discovers such bogus states the next time the job is queued or then the cleanup triggers. There is also a new console command for detecting and cleaning up bogus jobs. |