Community
Participate
Working Groups
We have not been able to access ci.eclipse.org/openj9 since last evening (no connection at all). Perhaps HIPP6 is down? Could someone investigate please? Thanks.
Indeed, hipp6 is down. I'm restarting it.
hipp6 has been restarted and all JIPP are running/starting.
Thanks for the heads up btw.
This was working for a short period after you restarted it. But seems to be down again? Can you have a look again please? Do you need a separate bug report or shall I just re-open?
ci.eclipse.org/openj9 seems to be down/unresponsive again. It was working for a short period of time after HIPP6 was restarted about an hour ago.
hipp6 is restarting.
It seems to be down yet again. Perhaps it needs more attention than just a reboot?
yes, I'm investigating.
I've rebooted the machine without starting the JIPPs. I'm investigating but first thing that strikes me is the kernel's RAID sync process which is lagging. I'll wait for it to complete before starting any JIPP. It should complete in about 2 hours (say /proc/mdstat). In the meantime I'm continuing my investigation.
Thanks for the update Mikaël
RAID sync is complete, I've restarted the JIPPs. Let's see how the machine behave now.
HIPP6 was again unreachable this morning. I've rebooted it and I'm restarting some JIPP selectively while monitoring the machine. I'll move openj9 to another machine as it is, AFAICT, the biggest one on this machine. I will take a couple of hours to move the workspaces over the network. I'll keep you posted here. Thanks for you patience.
btw, the most probable cause of this outage is a highly fragmented BTRFS which causes a *lot* of IOs, which in turn make the whole system little to no responsive. I could de-fragment the FS, but it may backfire by causing considerable increase of space usage depending on the broken up ref-links (see https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-filesystem for more details).
*** Bug 549740 has been marked as a duplicate of this bug. ***
OpenJ9 move is still undergoing. In the meantime the following JIPPs have been restarted and they look healthy datatools ditto ecoretools emf emfstore jdt oomph xpect
If HIPP6 is giving so much trouble wouldnt time be better spent migrating JIPPs on it to JIRO ?
Migration to JIRO requires involvement from the projects' teams. It's not something that we can ask for without prior notice. Note however that we already made switching JIPPs from HIPP6 to JIRO on highest priority. The first one will be openj9 as it is our prime suspect for bringing flakiness to hipp machines (it used to run on hipp5 where we've seen a lot of flakiness at that time, on hipp6 for a year and the history is repeating).
All JIPPs excepted openj9 have been restarted.
I've identified at least one more potential culprit in the lack of responsiveness (see bug 549754)
Another potential disk churn is the number of builds being kept (some of the jobs have more than 1000 builds; the highest being mpc with epp-mpc-ci job with a total of 4537 builds kept). Please check that all jobs are properly configured to keep only a reasonable number of builds. See links below for more details. https://support.cloudbees.com/hc/en-us/articles/115000237071-How-do-I-set-discard-old-builds-for-a-Multi-Branch-Pipeline-Job- Please see below all the jobs having more than 100 builds store on disk (format is "project job buildscount"). I've already removed all but the 50 more recent builds on disk for most of the projects/jobs below. You will still have to configure your jobs to make sure this number does not grow again. Thanks. ditto ditto-ci 390 geogig boundless-trigger-master 221 geogig geogig-master 275 geogig geogig-master-deploy 240 geomesa GeoMesaRelease 120 leshan leshan 523 mpc epp-mpc-ci 4537 mpc epp-mpc-maintenance 1064 mpc epp-mpc-release 134 mpc epp-mpc-rest-tests 1666 oomph integration 111 oomph integration-nightly 107 oomph uss-integration 107 oomph uss-integration-nightly 107 openj9 adam_pipeline 300 openj9 Check_Artifactory_Disk_Usage 158 openj9 Cleanup_Artifactory 207 openj9 Pipeline_Build_Test_JDK11_ppc64_aix 111 openj9 Pipeline_Build_Test_JDK11_ppc64le_linux 110 openj9 Pipeline_Build_Test_JDK11_s390x_linux 112 openj9 Pipeline_Build_Test_JDK11_x86-64_linux 113 openj9 Pipeline_Build_Test_JDK11_x86-64_linux_cm 110 openj9 Pipeline_Build_Test_JDK11_x86-64_linux_xl 111 openj9 Pipeline_Build_Test_JDK11_x86-64_mac 110 openj9 Pipeline_Build_Test_JDK11_x86-64_windows 111 openj9 Pipeline_Build_Test_JDK12_ppc64_aix 107 openj9 Pipeline_Build_Test_JDK12_ppc64le_linux 107 openj9 Pipeline_Build_Test_JDK12_s390x_linux 107 openj9 Pipeline_Build_Test_JDK12_x86-64_linux 107 openj9 Pipeline_Build_Test_JDK12_x86-64_linux_xl 107 openj9 Pipeline_Build_Test_JDK12_x86-64_mac 107 openj9 Pipeline_Build_Test_JDK12_x86-64_windows 106 openj9 Pipeline_Build_Test_JDK8_ppc64_aix 109 openj9 Pipeline_Build_Test_JDK8_ppc64le_linux 111 openj9 Pipeline_Build_Test_JDK8_s390x_linux 111 openj9 Pipeline_Build_Test_JDK8_x86-32_windows 111 openj9 Pipeline_Build_Test_JDK8_x86-64_linux 112 openj9 Pipeline_Build_Test_JDK8_x86-64_linux_cm 110 openj9 Pipeline_Build_Test_JDK8_x86-64_linux_xl 112 openj9 Pipeline_Build_Test_JDK8_x86-64_mac 111 openj9 Pipeline_Build_Test_JDK8_x86-64_windows 113 openj9 Pipeline_Build_Test_JDKnext_ppc64le_linux 106 openj9 Pipeline_Build_Test_JDKnext_s390x_linux 106 openj9 Pipeline_Build_Test_JDKnext_x86-64_linux 106 openj9 Pipeline_Build_Test_JDKnext_x86-64_linux_xl 106 openj9 Pipeline_Build_Test_JDKnext_x86-64_mac 106 openj9 Pipeline_Build_Test_JDKnext_x86-64_windows 106 openj9 PullRequest-OpenJ9 108 scout org.eclipse.scout_maven-master_snapshotBuild 634 scout org.eclipse.scout.rt_deploy_from_tag 120 scout publish_staged_builds 1071 virgo recipe-accessing-data-mongodb 380 virgo recipe-custom-virgo 411 virgo recipe-messaging-with-rabbitmq 368 virgo recipe-rest-service.snapshot 435 virgo recipe-serving-web-content 377 virgo recipe-template 409 virgo recipe-uploading-files.snapshot 771
I've restarted openj9 JIPP (still on hipp6). Let's see how it behave with the cleanup I've made today.
(note that I've increased openj9 JVM Xmx to 8GB as I suspect that it was also using a lot of CPU because of GC)
I just had to restart the machine again because of it being un-responsive. I did not start openj9 but all other JIPPs are starting.
Migration of openj9 to hipp5 is over. It is up and running over there from now on. hipp6 looks very healthy without openj9, so I declare it fixed. Nevertheless, I'll be monitoring both hipp5 and hipp6 over the next couple of days to see if some patterns emerge. Thanks again for your patience.
Everything has been running smoothly for 8 hours. Closing.
"Please see below all the jobs having more than 100 builds store on disk (format is "project job buildscount"). I've already removed all but the 50 more recent builds on disk for most of the projects/jobs below. You will still have to configure your jobs to make sure this number does not grow again. Thanks. [...] scout org.eclipse.scout_maven-master_snapshotBuild 634 scout org.eclipse.scout.rt_deploy_from_tag 120 scout publish_staged_builds 1071 [...]" Could you check those job runs on the disk again? The jobs have 10 or less builds configured to be kept. Probably Jenkins kept some old jobs because of some permission or other problem. The Jenkins UI only shows the configured amount of builds. Thanks.
(In reply to Arthur van Dorp from comment #26) > "Please see below all the jobs having more than 100 builds store on disk > (format is "project job buildscount"). I've already removed all but the 50 > more recent builds on disk for most of the projects/jobs below. You will > still have to configure your jobs to make sure this number does not grow > again. Thanks. > [...] > scout org.eclipse.scout_maven-master_snapshotBuild 634 > scout org.eclipse.scout.rt_deploy_from_tag 120 > scout publish_staged_builds 1071 > [...]" > > Could you check those job runs on the disk again? The jobs have 10 or less > builds configured to be kept. Probably Jenkins kept some old jobs because of > some permission or other problem. The Jenkins UI only shows the configured > amount of builds. Thanks. I actually did the change myself for a couple a jobs, including yours. Sorry for the misunderstanding.
Looks like another outage. I am unable to access jdt core gerrit jobs.
all JIPP instances on hippp6 are up and running