Community
Participate
Working Groups
All the slaves for Releng JIPP are down. 4.13 build on 24th did not run due to infrastructure issues.
I'm investigating.
I've restarted the JIPP, agents are coming back online. Did you try to restart your instance via committer toolbox during the weekend?
(In reply to Mikaël Barbero from comment #2) > I've restarted the JIPP, agents are coming back online. > > Did you try to restart your instance via committer toolbox during the > weekend? No we haven't done any thing from our side. The windows test machine is still down can you take a look?
(In reply to Mikaël Barbero from comment #2) > Did you try to restart your instance via committer toolbox during the > weekend? How? AFAIK this is no longer possible after Releng project was moved to Platform project.
You're right Dani. I forgot about the move from releng to platform. windows test agent is back online.
All slaves are down again. This is blocker for us
Also sent an e-mail to webmaster.
We noticed OutOfMemoryErrors. Memory has been increased. Slaves are up again.
For some reasons, the windows test machine was not able to re-connect either (while still reporting connected from the agent side). I've restarted the instance and re-connected the windows agent. I'm monitoring the instance.
All agents are back online.
What's the reason for these resent issues? We're in RC* phase and this is not acceptable.
Plus this: We are in 2019, Cloud era. Every system I know has watchdogs that monitor the system and either restart automatically or inform the admins. That we have to open bugs for this and wait for a resolution sounds ridiculous to me.
(In reply to Dani Megert from comment #12) > Plus this: We are in 2019, Cloud era. Every system I know has watchdogs that > monitor the system and either restart automatically or inform the admins. > That we have to open bugs for this and wait for a resolution sounds > ridiculous to me. Mikaël, IIRC you said that with JIRO we will get some self healing, right? That would be great!
(In reply to Dani Megert from comment #11) > What's the reason for these resent issues? Regarding comment 8: Is maybe a job using more memory than before?
(In reply to Dani Megert from comment #11) > What's the reason for these resent issues? We're in RC* phase and this is > not acceptable. We understand that outages like that is frustrating. Rest assured that we also well understand that any outage happening late in a release schedule is highly stressful and that we prioritize those issues accordingly. Our current reasoning is that your JIPP faced consecutive OOME because it has reached a threshold where the default heap that we assigned to it (the same value as for 99% of JIPP on the old infra) is not high enough anymore. The xmx setting did not change for at least a couple of years and still works well for 99% of the JIPPs (releng's one included until recently). Our mitigation has been to increase this value. So far, the JIPP looks stable again. (In reply to Dani Megert from comment #12) > Plus this: We are in 2019, Cloud era. Every system I know has watchdogs that > monitor the system and either restart automatically or inform the admins. > That we have to open bugs for this and wait for a resolution sounds > ridiculous to me. You're preaching to the choir. This is the second main reason why we're migrating to the new clustered infra (the main one being scalability). We already have a small set of better tools / watchdogs on this new infra that we ever had on the old one (e.g., self restart of instances when they become unavailable/unresponsive). We also plan to improve the tooling/monitoring once the migration (of all our 250 JIPPs) is complete.
(In reply to Mikaël Barbero from comment #15) > You're preaching to the choir. ;-) Thanks for the detailed reply!
(In reply to Mikaël Barbero from comment #15) You're preaching to the choir. ;-) Thanks for the detailed reply!
(In reply to Dani Megert from comment #13) > (In reply to Dani Megert from comment #12) > > Plus this: We are in 2019, Cloud era. Every system I know has watchdogs that > > monitor the system and either restart automatically or inform the admins. > > That we have to open bugs for this and wait for a resolution sounds > > ridiculous to me. > Mikaël, IIRC you said that with JIRO we will get some self healing, right? > That would be great! Yes, it does! (In reply to Dani Megert from comment #14) > (In reply to Dani Megert from comment #11) > > What's the reason for these resent issues? > Regarding comment 8: Is maybe a job using more memory than before? The OOM happened only in the agent connections and did not crash the full JIPP, so I don't think a build job (which is a jvm subprocess) could affect that. Also, the machine itself was having a ton of free (RSS) memory left, so this is excluded. I tend to think that this could be a GC issue where all agents try to connect simultaneously and multiple jobs are being started at the same time. With the restricted Xmx that was used, this could very likely lead to OOM the agents connections without crashing the whole system. FYI, I've been experimenting for quite some time with https://wiki.jenkins.io/display/JENKINS/Monitoring and we will probably make it a default on the infra.