Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 549714 - HIPP6 is down
Summary: HIPP6 is down
Status: CLOSED FIXED
Alias: None
Product: Community
Classification: Eclipse Foundation
Component: CI-Jenkins (show other bugs)
Version: unspecified   Edit
Hardware: PC Mac OS X
: P3 major (vote)
Target Milestone: ---   Edit
Assignee: CI Admin Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
: 549740 (view as bug list)
Depends on:
Blocks: 548564
  Show dependency tree
 
Reported: 2019-08-01 07:49 EDT by Daryl Maier CLA
Modified: 2019-08-08 02:45 EDT (History)
10 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Daryl Maier CLA 2019-08-01 07:49:49 EDT
We have not been able to access ci.eclipse.org/openj9 since last evening (no connection at all).  Perhaps HIPP6 is down?  Could someone investigate please?  Thanks.
Comment 1 Mikaël Barbero CLA 2019-08-01 08:03:35 EDT
Indeed, hipp6 is down. I'm restarting it.
Comment 2 Mikaël Barbero CLA 2019-08-01 08:23:56 EDT
hipp6 has been restarted and all JIPP are running/starting.
Comment 3 Mikaël Barbero CLA 2019-08-01 08:24:07 EDT
Thanks for the heads up btw.
Comment 4 Daryl Maier CLA 2019-08-01 09:08:13 EDT
This was working for a short period after you restarted it.  But seems to be down again?  Can you have a look again please?  Do you need a separate bug report or shall I just re-open?
Comment 5 Daryl Maier CLA 2019-08-01 09:16:15 EDT
ci.eclipse.org/openj9 seems to be down/unresponsive again.  It was working for a short period of time after HIPP6 was restarted about an hour ago.
Comment 6 Mikaël Barbero CLA 2019-08-01 09:21:27 EDT
hipp6 is restarting.
Comment 7 Keith W. Campbell CLA 2019-08-01 12:36:00 EDT
It seems to be down yet again. Perhaps it needs more attention than just a reboot?
Comment 8 Mikaël Barbero CLA 2019-08-01 13:00:21 EDT
yes, I'm investigating.
Comment 9 Mikaël Barbero CLA 2019-08-01 13:44:00 EDT
I've rebooted the machine without starting the JIPPs. I'm investigating but first thing that strikes me is the kernel's RAID sync process which is lagging. I'll wait for it to complete before starting any JIPP. It should complete in about 2 hours (say /proc/mdstat).

In the meantime I'm continuing my investigation.
Comment 10 Adam Brousseau CLA 2019-08-01 13:46:57 EDT
Thanks for the update Mikaël
Comment 11 Mikaël Barbero CLA 2019-08-01 16:02:38 EDT
RAID sync is complete, I've restarted the JIPPs. Let's see how the machine behave now.
Comment 12 Mikaël Barbero CLA 2019-08-02 02:33:55 EDT
HIPP6 was again unreachable this morning. I've rebooted it and I'm restarting some JIPP selectively while monitoring the machine.

I'll move openj9 to another machine as it is, AFAICT, the biggest one on this machine. I will take a couple of hours to move the workspaces over the network. I'll keep you posted here. Thanks for you patience.
Comment 13 Mikaël Barbero CLA 2019-08-02 03:40:16 EDT
btw, the most probable cause of this outage is a highly fragmented BTRFS which causes a *lot* of IOs, which in turn make the whole system little to no responsive. I could de-fragment the FS, but it may backfire by causing considerable increase of space usage depending on the broken up ref-links (see https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-filesystem for more details).
Comment 14 Mikaël Barbero CLA 2019-08-02 03:47:59 EDT
*** Bug 549740 has been marked as a duplicate of this bug. ***
Comment 15 Mikaël Barbero CLA 2019-08-02 05:35:25 EDT
OpenJ9 move is still undergoing. In the meantime the following JIPPs have been restarted and they look healthy

datatools
ditto
ecoretools
emf
emfstore
jdt
oomph
xpect
Comment 16 Alexander Kurtakov CLA 2019-08-02 07:27:11 EDT
If HIPP6 is giving so much trouble wouldnt time be better spent migrating JIPPs on it to JIRO ?
Comment 17 Mikaël Barbero CLA 2019-08-02 07:30:52 EDT
Migration to JIRO requires involvement from the projects' teams. It's not something that we can ask for without prior notice. 

Note however that we already made switching JIPPs from HIPP6 to JIRO on highest priority. The first one will be openj9 as it is our prime suspect for bringing flakiness to hipp machines (it used to run on hipp5 where we've seen a lot of flakiness at that time, on hipp6 for a year and the history is repeating).
Comment 18 Mikaël Barbero CLA 2019-08-02 08:55:05 EDT
All JIPPs excepted openj9 have been restarted.
Comment 19 Mikaël Barbero CLA 2019-08-02 11:19:18 EDT
I've identified at least one more potential culprit in the lack of responsiveness (see bug 549754)
Comment 20 Mikaël Barbero CLA 2019-08-02 12:09:25 EDT
Another potential disk churn is the number of builds being kept (some of the jobs have more than 1000 builds; the highest being mpc with epp-mpc-ci job with a total of 4537 builds kept). Please check that all jobs are properly configured to keep only a reasonable number of builds. See links below for more details.

https://support.cloudbees.com/hc/en-us/articles/115000237071-How-do-I-set-discard-old-builds-for-a-Multi-Branch-Pipeline-Job-

Please see below all the jobs having more than 100 builds store on disk (format is "project job buildscount"). I've already removed all but the 50 more recent builds on disk for most of the projects/jobs below. You will still have to configure your jobs to make sure this number does not grow again. Thanks.

ditto ditto-ci 390
geogig boundless-trigger-master 221
geogig geogig-master 275
geogig geogig-master-deploy 240
geomesa GeoMesaRelease 120
leshan leshan 523
mpc epp-mpc-ci 4537
mpc epp-mpc-maintenance 1064
mpc epp-mpc-release 134
mpc epp-mpc-rest-tests 1666
oomph integration 111
oomph integration-nightly 107
oomph uss-integration 107
oomph uss-integration-nightly 107
openj9 adam_pipeline 300
openj9 Check_Artifactory_Disk_Usage 158
openj9 Cleanup_Artifactory 207
openj9 Pipeline_Build_Test_JDK11_ppc64_aix 111
openj9 Pipeline_Build_Test_JDK11_ppc64le_linux 110
openj9 Pipeline_Build_Test_JDK11_s390x_linux 112
openj9 Pipeline_Build_Test_JDK11_x86-64_linux 113
openj9 Pipeline_Build_Test_JDK11_x86-64_linux_cm 110
openj9 Pipeline_Build_Test_JDK11_x86-64_linux_xl 111
openj9 Pipeline_Build_Test_JDK11_x86-64_mac 110
openj9 Pipeline_Build_Test_JDK11_x86-64_windows 111
openj9 Pipeline_Build_Test_JDK12_ppc64_aix 107
openj9 Pipeline_Build_Test_JDK12_ppc64le_linux 107
openj9 Pipeline_Build_Test_JDK12_s390x_linux 107
openj9 Pipeline_Build_Test_JDK12_x86-64_linux 107
openj9 Pipeline_Build_Test_JDK12_x86-64_linux_xl 107
openj9 Pipeline_Build_Test_JDK12_x86-64_mac 107
openj9 Pipeline_Build_Test_JDK12_x86-64_windows 106
openj9 Pipeline_Build_Test_JDK8_ppc64_aix 109
openj9 Pipeline_Build_Test_JDK8_ppc64le_linux 111
openj9 Pipeline_Build_Test_JDK8_s390x_linux 111
openj9 Pipeline_Build_Test_JDK8_x86-32_windows 111
openj9 Pipeline_Build_Test_JDK8_x86-64_linux 112
openj9 Pipeline_Build_Test_JDK8_x86-64_linux_cm 110
openj9 Pipeline_Build_Test_JDK8_x86-64_linux_xl 112
openj9 Pipeline_Build_Test_JDK8_x86-64_mac 111
openj9 Pipeline_Build_Test_JDK8_x86-64_windows 113
openj9 Pipeline_Build_Test_JDKnext_ppc64le_linux 106
openj9 Pipeline_Build_Test_JDKnext_s390x_linux 106
openj9 Pipeline_Build_Test_JDKnext_x86-64_linux 106
openj9 Pipeline_Build_Test_JDKnext_x86-64_linux_xl 106
openj9 Pipeline_Build_Test_JDKnext_x86-64_mac 106
openj9 Pipeline_Build_Test_JDKnext_x86-64_windows 106
openj9 PullRequest-OpenJ9 108
scout org.eclipse.scout_maven-master_snapshotBuild 634
scout org.eclipse.scout.rt_deploy_from_tag 120
scout publish_staged_builds 1071
virgo recipe-accessing-data-mongodb 380
virgo recipe-custom-virgo 411
virgo recipe-messaging-with-rabbitmq 368
virgo recipe-rest-service.snapshot 435
virgo recipe-serving-web-content 377
virgo recipe-template 409
virgo recipe-uploading-files.snapshot 771
Comment 21 Mikaël Barbero CLA 2019-08-02 12:31:33 EDT
I've restarted openj9 JIPP (still on hipp6). Let's see how it behave with the cleanup I've made today.
Comment 22 Mikaël Barbero CLA 2019-08-02 12:53:44 EDT
(note that I've increased openj9 JVM Xmx to 8GB as I suspect that it was also using a lot of CPU because of GC)
Comment 23 Mikaël Barbero CLA 2019-08-02 14:23:36 EDT
I just had to restart the machine again because of it being un-responsive. I did not start openj9 but all other JIPPs are starting.
Comment 24 Mikaël Barbero CLA 2019-08-02 17:44:32 EDT
Migration of openj9 to hipp5 is over. It is up and running over there from now on. 

hipp6 looks very healthy without openj9, so I declare it fixed.

Nevertheless, I'll be monitoring both hipp5 and hipp6 over the next couple of days to see if some patterns emerge. Thanks again for your patience.
Comment 25 Mikaël Barbero CLA 2019-08-03 01:59:46 EDT
Everything has been running smoothly for 8 hours. Closing.
Comment 26 Arthur van Dorp CLA 2019-08-04 16:06:52 EDT
"Please see below all the jobs having more than 100 builds store on disk (format is "project job buildscount"). I've already removed all but the 50 more recent builds on disk for most of the projects/jobs below. You will still have to configure your jobs to make sure this number does not grow again. Thanks.
[...]
scout org.eclipse.scout_maven-master_snapshotBuild 634
scout org.eclipse.scout.rt_deploy_from_tag 120
scout publish_staged_builds 1071
[...]"

Could you check those job runs on the disk again? The jobs have 10 or less builds configured to be kept. Probably Jenkins kept some old jobs because of some permission or other problem. The Jenkins UI only shows the configured amount of builds. Thanks.
Comment 27 Mikaël Barbero CLA 2019-08-05 03:29:24 EDT
(In reply to Arthur van Dorp from comment #26)
> "Please see below all the jobs having more than 100 builds store on disk
> (format is "project job buildscount"). I've already removed all but the 50
> more recent builds on disk for most of the projects/jobs below. You will
> still have to configure your jobs to make sure this number does not grow
> again. Thanks.
> [...]
> scout org.eclipse.scout_maven-master_snapshotBuild 634
> scout org.eclipse.scout.rt_deploy_from_tag 120
> scout publish_staged_builds 1071
> [...]"
> 
> Could you check those job runs on the disk again? The jobs have 10 or less
> builds configured to be kept. Probably Jenkins kept some old jobs because of
> some permission or other problem. The Jenkins UI only shows the configured
> amount of builds. Thanks.

I actually did the change myself for a couple a jobs, including yours. Sorry for the misunderstanding.
Comment 28 Jay Arthanareeswaran CLA 2019-08-07 23:31:19 EDT
Looks like another outage. I am unable to access jdt core gerrit jobs.
Comment 29 Mikaël Barbero CLA 2019-08-08 02:25:32 EDT
hipp6 is restarting.
Comment 30 Mikaël Barbero CLA 2019-08-08 02:45:05 EDT
all JIPP instances on hippp6 are up and running