Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 550421 - All the slaves for Releng JIPP are down
Summary: All the slaves for Releng JIPP are down
Status: CLOSED FIXED
Alias: None
Product: Community
Classification: Eclipse Foundation
Component: Servers (show other bugs)
Version: unspecified   Edit
Hardware: All All
: P1 blocker (vote)
Target Milestone: ---   Edit
Assignee: Eclipse Webmaster CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 548564
  Show dependency tree
 
Reported: 2019-08-25 09:44 EDT by Sarika Sinha CLA
Modified: 2019-08-27 03:55 EDT (History)
9 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sarika Sinha CLA 2019-08-25 09:44:22 EDT
All the slaves for Releng JIPP are down.

4.13 build on 24th did not run due to infrastructure issues.
Comment 1 Mikaël Barbero CLA 2019-08-26 03:25:29 EDT
I'm investigating.
Comment 2 Mikaël Barbero CLA 2019-08-26 03:31:53 EDT
I've restarted the JIPP, agents are coming back online. 

Did you try to restart your instance via committer toolbox during the weekend?
Comment 3 Sravan Kumar Lakkimsetti CLA 2019-08-26 04:03:35 EDT
(In reply to Mikaël Barbero from comment #2)
> I've restarted the JIPP, agents are coming back online. 
> 
> Did you try to restart your instance via committer toolbox during the
> weekend?

No we haven't done any thing from our side. 

The windows test machine is still down can you take a look?
Comment 4 Dani Megert CLA 2019-08-26 05:14:37 EDT
(In reply to Mikaël Barbero from comment #2)
> Did you try to restart your instance via committer toolbox during the
> weekend?
How? AFAIK this is no longer possible after Releng project was moved to Platform project.
Comment 5 Mikaël Barbero CLA 2019-08-26 05:29:51 EDT
You're right Dani. I forgot about the move from releng to platform. 


windows test agent is back online.
Comment 6 Sravan Kumar Lakkimsetti CLA 2019-08-26 11:56:28 EDT
All slaves are down again. This is blocker for us
Comment 7 Dani Megert CLA 2019-08-26 12:21:38 EDT
Also sent an e-mail to webmaster.
Comment 8 Frederic Gurr CLA 2019-08-26 13:06:48 EDT
We noticed OutOfMemoryErrors. Memory has been increased. Slaves are up again.
Comment 9 Mikaël Barbero CLA 2019-08-26 14:40:09 EDT
For some reasons, the windows test machine was not able to re-connect either (while still reporting connected from the agent side). 

I've restarted the instance and re-connected the windows agent. I'm monitoring the instance.
Comment 10 Mikaël Barbero CLA 2019-08-26 14:46:02 EDT
All agents are back online.
Comment 11 Dani Megert CLA 2019-08-26 16:37:14 EDT
What's the reason for these resent issues? We're in RC* phase and this is not acceptable.
Comment 12 Dani Megert CLA 2019-08-26 16:47:44 EDT
Plus this: We are in 2019, Cloud era. Every system I know has watchdogs that monitor the system and either restart automatically or inform the admins. That we have to open bugs for this and wait for a resolution sounds ridiculous to me.
Comment 13 Dani Megert CLA 2019-08-27 03:23:38 EDT
(In reply to Dani Megert from comment #12)
> Plus this: We are in 2019, Cloud era. Every system I know has watchdogs that
> monitor the system and either restart automatically or inform the admins.
> That we have to open bugs for this and wait for a resolution sounds
> ridiculous to me.
Mikaël, IIRC you said that with JIRO we will get some self healing, right? That would be great!
Comment 14 Dani Megert CLA 2019-08-27 03:34:49 EDT
(In reply to Dani Megert from comment #11)
> What's the reason for these resent issues?
Regarding comment 8: Is maybe a job using more memory than before?
Comment 15 Mikaël Barbero CLA 2019-08-27 03:36:11 EDT
(In reply to Dani Megert from comment #11)
> What's the reason for these resent issues? We're in RC* phase and this is
> not acceptable.

We understand that outages like that is frustrating. Rest assured that we also well understand that any outage happening late in a release schedule is highly stressful and that we prioritize those issues accordingly.

Our current reasoning is that your JIPP faced consecutive OOME because it has reached a threshold where the default heap that we assigned to it (the same value as for 99% of JIPP on the old infra) is not high enough anymore. The xmx setting did not change for at least a couple of years and still works well for 99% of the JIPPs (releng's one included until recently). Our mitigation has been to increase this value. So far, the JIPP looks stable again.


(In reply to Dani Megert from comment #12)
> Plus this: We are in 2019, Cloud era. Every system I know has watchdogs that
> monitor the system and either restart automatically or inform the admins.
> That we have to open bugs for this and wait for a resolution sounds
> ridiculous to me.

You're preaching to the choir. This is the second main reason why we're migrating to the new clustered infra (the main one being scalability). We already have a small set of better tools / watchdogs on this new infra that we ever had on the old one (e.g., self restart of instances when they become unavailable/unresponsive). We also plan to improve the tooling/monitoring once the migration (of all our 250 JIPPs) is complete.
Comment 16 Dani Megert CLA 2019-08-27 03:45:30 EDT Comment hidden (obsolete)
Comment 17 Dani Megert CLA 2019-08-27 03:48:10 EDT
(In reply to Mikaël Barbero from comment #15)

You're preaching to the choir. ;-)

Thanks for the detailed reply!
Comment 18 Mikaël Barbero CLA 2019-08-27 03:55:22 EDT
(In reply to Dani Megert from comment #13)
> (In reply to Dani Megert from comment #12)
> > Plus this: We are in 2019, Cloud era. Every system I know has watchdogs that
> > monitor the system and either restart automatically or inform the admins.
> > That we have to open bugs for this and wait for a resolution sounds
> > ridiculous to me.
> Mikaël, IIRC you said that with JIRO we will get some self healing, right?
> That would be great!

Yes, it does!


(In reply to Dani Megert from comment #14)
> (In reply to Dani Megert from comment #11)
> > What's the reason for these resent issues?
> Regarding comment 8: Is maybe a job using more memory than before?

The OOM happened only in the agent connections and did not crash the full JIPP, so I don't think a build job (which is a jvm subprocess) could affect that.

Also, the machine itself was having a ton of free (RSS) memory left, so this is excluded.

I tend to think that this could be a GC issue where all agents try to connect simultaneously and multiple jobs are being started at the same time. With the restricted Xmx that was used, this could very likely lead to OOM the agents connections without crashing the whole system.

FYI, I've been experimenting for quite some time with https://wiki.jenkins.io/display/JENKINS/Monitoring and we will probably make it a default on the infra.