Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 563645

Summary: Some jobs on the new infra have inconsistent build times (sometimes *really* slow)
Product: Community Reporter: Sravan Kumar Lakkimsetti <sravankumarl>
Component: CI-JenkinsAssignee: CI Admin Inbox <ci.admin-inbox>
Status: CLOSED FIXED QA Contact:
Severity: major    
Priority: P3 CC: agarcdomi, christian.dietrich.opensource, daniel_megert, denis.roy, frederic.gurr, jonah, kalyan_prasad, mikael.barbero, sarika.sinha, twolf, webmaster
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Windows 10   
See Also: https://bugs.eclipse.org/bugs/show_bug.cgi?id=563221
https://bugs.eclipse.org/bugs/show_bug.cgi?id=563816
https://bugs.eclipse.org/bugs/show_bug.cgi?id=563785
Whiteboard:
Bug Depends on: 563223    
Bug Blocks: 564012    

Description Sravan Kumar Lakkimsetti CLA 2020-05-27 12:15:47 EDT
Our I-build job normally takes about 2 hours and 15 minutes on old infrastructure. In the new infrastructure the build time is varying between 2 hours to 5 hours. 

Today it almost took 5 hours. Can you please investigate the reason for this kind of variation?
Comment 1 Frederic Gurr CLA 2020-05-27 13:24:57 EDT
Pinpointing a single reason for such variations is not an easy task, given that the I-Build is a complex and long build running on a complex build environment.

To get a better sense of average build times and outliers, I have increased the "number of builds to keep" to 25 (instead of 5).

I would recommend to try increasing the CPU limits and requests to 4000m and see how that effects the build time.

A part of the problem might also be a lower I/O performance, but I'd start with adjusting the CPU limits.
Comment 2 Sravan Kumar Lakkimsetti CLA 2020-06-02 01:01:41 EDT
We have a time out today with https://ci-staging.eclipse.org/releng/job/I-build/148/

Our build should normally take about 2 hours to 2.5 hours. So our time out is for 5 hours. 

the above build hit the time out.
Comment 3 Sravan Kumar Lakkimsetti CLA 2020-06-02 05:10:14 EDT
Hit time out again in https://ci-staging.eclipse.org/releng/job/I-build/149/
Comment 4 Sravan Kumar Lakkimsetti CLA 2020-06-02 05:19:03 EDT
Raising to blocker
Comment 5 Christian Dietrich CLA 2020-06-02 05:22:04 EDT
we at Xtext see notable build time increases too.
Comment 6 Frederic Gurr CLA 2020-06-02 06:15:42 EDT
We are investigating.
Comment 7 Dani Megert CLA 2020-06-02 06:58:09 EDT
Latest build attempt failed again. For details see https://ci-staging.eclipse.org/releng/job/I-build/151/console

This looks like it made the build fail:

12:09:02  !ENTRY org.eclipse.equinox.p2.transport.ecf 2 0 2020-06-02 10:09:02.445
12:09:02  !MESSAGE Connection to https://download.eclipse.org/eclipse/updates/4.16-I-builds/I20200531-0600/plugins/org.eclipse.equinox.security.linux.x86_64_1.1.300.v20190830-1238.jar.pack.gz failed on download.eclipse.org:443 failed to respond. Retry attempt 0 started
12:09:02  !STACK 0
12:09:02  org.apache.http.NoHttpResponseException: download.eclipse.org:443 failed to respond
12:09:02  	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:141)
12:09:02  	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
12:09:02  	at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
12:09:02  	at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
12:09:02  	at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157)
12:09:02  	at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
12:09:02  	at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
12:09:02  	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
12:09:02  	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
12:09:02  	at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
12:09:02  	at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
12:09:02  	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
12:09:02  	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
12:09:02  	at org.eclipse.ecf.provider.filetransfer.httpclient45.HttpClientRetrieveFileTransfer.performConnect(HttpClientRetrieveFileTransfer.java:1000)
12:09:02  	at org.eclipse.ecf.provider.filetransfer.httpclient45.HttpClientRetrieveFileTransfer.access$0(HttpClientRetrieveFileTransfer.java:992)
12:09:02  	at org.eclipse.ecf.provider.filetransfer.httpclient45.HttpClientRetrieveFileTransfer$1.performFileTransfer(HttpClientRetrieveFileTransfer.java:988)
12:09:02  	at org.eclipse.ecf.filetransfer.FileTransferJob.run(FileTransferJob.java:74)
12:09:02  	at org.eclipse.core.internal.jobs.Worker.run(Worker.java:63)
Comment 8 Sravan Kumar Lakkimsetti CLA 2020-06-02 07:28:15 EDT
build 150 successfully completed in 2 hours and 4 minutes.

Here is the status of last 5 builds

151 Build failed with stack trace mentioned in comment 7
150 Build succeeded took 2 hrs and 4 minutes
149 Build failed hitting timeout of 5 hours
148 Build failed hitting timeout of 5 hours
147 Build succeeded took 4 hours and 11 minutes

In our build we have a dependent job. This is not scheduled in timely manner in build 148 and 147 See Bug 563785

There is a delay in scheduling the job for 1hr 15min for build 147 and 2hr 4min for build 148

Hope this data is good enough. 

If you see https://ci-staging.eclipse.org/releng/job/I-build/ you can see the stage view. I can explain each of the tasks on what kind of disk access, network access etc. Please let me know if you need any details.
Comment 9 Sravan Kumar Lakkimsetti CLA 2020-06-02 07:41:54 EDT
We do have couple of build (build 143 and build 151) failures with timeout during change over from one stage to another. Here is the stack trace

java.io.IOException: Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at 30 seconds
	at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:441)
	at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:336)
	at hudson.Launcher$ProcStarter.start(Launcher.java:454)
	at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:234)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:103)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:317)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:286)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:179)
	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
	at sun.reflect.GeneratedMethodAccessor580.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.lang.reflect.Method.invoke(Unknown Source)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:163)
	at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23)
	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:157)
	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:161)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:165)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)
	at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17)
	at WorkflowScript.run(WorkflowScript:266)
	at ___cps.transform___(Native Method)
	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:86)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83)
	at sun.reflect.GeneratedMethodAccessor296.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.lang.reflect.Method.invoke(Unknown Source)
	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
	at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
	at com.cloudbees.groovy.cps.Next.step(Next.java:83)
	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129)
	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268)
	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51)
	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:185)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:400)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$400(CpsThreadGroup.java:96)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:312)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:276)
	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67)
	at java.util.concurrent.FutureTask.run(Unknown Source)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.util.concurrent.FutureTask.run(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)
Comment 10 Frederic Gurr CLA 2020-06-02 07:55:48 EDT
I just set the following system variable:
-Dorg.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout=60
to increase the timeout. Let's see if that helps.
Comment 11 Thomas Wolf CLA 2020-06-02 09:24:26 EDT
(In reply to Frederic Gurr from comment #10)
> I just set the following system variable:
> -Dorg.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.
> websocketConnectionTimeout=60
> to increase the timeout. Let's see if that helps.

Doubling that connection timeout sounds more like a work-around for a particular symptom, not like something that might solve the underlying issue reported here and in the linked bug reports: builds appear to be generally very slow -- much slower than usual.
Comment 12 Denis Roy CLA 2020-06-02 09:32:36 EDT
Mikael, Fred: We really, really need something like bug 563223. Something that can be a repeatable performance benchmark for the various infra components.

Otherwise, where do we look?  Storage backend? Network uplinks? k8s scheduling? CPU? What else was running at the time?

In this case, I've looked at storage backends, git/gerrit, and network uplinks. Nothing overloaded or unusual.
Comment 13 Denis Roy CLA 2020-06-02 09:47:03 EDT
*** Bug 563221 has been marked as a duplicate of this bug. ***
Comment 14 Christian Dietrich CLA 2020-06-02 10:09:10 EDT
@denis. please take our master branches to test, not the release branches
Comment 15 Frederic Gurr CLA 2020-06-02 10:21:11 EDT
(In reply to Thomas Wolf from comment #11)
> (In reply to Frederic Gurr from comment #10)
> > I just set the following system variable:
> > -Dorg.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.
> > websocketConnectionTimeout=60
> > to increase the timeout. Let's see if that helps.
> 
> Doubling that connection timeout sounds more like a work-around for a
> particular symptom, not like something that might solve the underlying issue
> reported here and in the linked bug reports: builds appear to be generally
> very slow -- much slower than usual.
I should have made that clearer. My change addresses a particular issue of the Releng JIPP, since they use websocket connections for some of their agents. This is indeed not a fix for the intermittent slowness of the build cluster that affected multiple projects.
Comment 16 Frederic Gurr CLA 2020-06-02 10:21:47 EDT
(In reply to Denis Roy from comment #12)
> Mikael, Fred: We really, really need something like bug 563223. Something
> that can be a repeatable performance benchmark for the various infra
> components.
> 
> Otherwise, where do we look?  Storage backend? Network uplinks? k8s
> scheduling? CPU? What else was running at the time?
> 
> In this case, I've looked at storage backends, git/gerrit, and network
> uplinks. Nothing overloaded or unusual.
Let's discuss this in bug 563223.
Comment 17 Sravan Kumar Lakkimsetti CLA 2020-06-02 10:40:31 EDT
(In reply to Frederic Gurr from comment #15)
> (In reply to Thomas Wolf from comment #11)
> > (In reply to Frederic Gurr from comment #10)
> > > I just set the following system variable:
> > > -Dorg.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.
> > > websocketConnectionTimeout=60
> > > to increase the timeout. Let's see if that helps.
> > 
> > Doubling that connection timeout sounds more like a work-around for a
> > particular symptom, not like something that might solve the underlying issue
> > reported here and in the linked bug reports: builds appear to be generally
> > very slow -- much slower than usual.
> I should have made that clearer. My change addresses a particular issue of
> the Releng JIPP, since they use websocket connections for some of their
> agents. This is indeed not a fix for the intermittent slowness of the build
> cluster that affected multiple projects.

Actually we are not using any websocket based agents during I-build job https://ci-staging.eclipse.org/releng/job/I-build/. 

The failures occurred when we are moving from one stage to another using same agent.
Comment 18 Christian Dietrich CLA 2020-06-02 16:34:23 EDT
looks like jenkins decides to start the builds on the slowest and most broken machine it can find: solution for Xtext builds: start an alibi build what will be scheduled on the slow machine and then the build you actually want to have.

result: build is blazing fast

https://ci.eclipse.org/xtext/job/xtext-lib/job/master/26/console
vs
https://ci.eclipse.org/xtext/job/xtext-lib/job/master/25/console
Comment 19 Antonio Garcia-Dominguez CLA 2020-06-04 12:40:04 EDT
We're experiencing a similar problem in Hawk. We have some long integration tests that can take between 30min on a good day, and 6 hours (!) on a bad day. We will have one build take 35minutes, the next one 25min, and the next one 2 hours. Admittedly, these are quite I/O bound, but in our previous infra before moving to Eclipse (Travis CI) we had much more consistent times.

For instance, see builds #258 - #264 in our main job, and compare the variation in the "Tycho" stage:

https://ci.eclipse.org/hawk/job/hawk/job/master/

There haven't been any major changes in those builds that would explain the change in build times: most of those commits are about minor tweaks in the releng scripts / copyright headers.
Comment 20 Denis Roy CLA 2020-06-04 13:36:47 EDT
From 25 minutes to 6 hours is wild; I don't see anything that is busy for so long to cause such a performance issue. All the storage backends are fine -- sure, they have peaks that last several minutes where some performance decrease may be observed, but it would be light.

Looking at some of the logs, I see wasted minutes on this:
INFO: I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://download.eclipse.org:80: The target server failed to respond

These 10 wasted minutes are a documented bug with ECF (I believe): https://bugs.eclipse.org/bugs/show_bug.cgi?id=560808

The rest is just a really long log file that doesn't provide much. Hard to know where the performance issue is.
Comment 21 Mikaël Barbero CLA 2020-06-09 04:52:52 EDT
Just a heads up that we're working on this. We will pin the jenkins masters to the slower nodes so that faster nodes will be used more exclusively for actual builds. It does not mean that discrepencencies won't happen again, but it will limit them the best we can with the resources we have. The change will be deployed during the next upgrade week (see bug 563057 and https://status.eclipse.org/incidents/1jvztmyy9jnl for details)

The other solution would be to block builds from running on slower hardware and queue them until faster nodes are available. I don't think it's something desirable. This should only happen when the cluster runs at capacity (it happens already from time to time). See https://wiki.eclipse.org/Jenkins#Do_you_do_overbooking.3F for more details.
Comment 22 Mikaël Barbero CLA 2020-06-09 05:21:15 EDT
*** Bug 563816 has been marked as a duplicate of this bug. ***
Comment 23 Mikaël Barbero CLA 2020-06-09 05:23:05 EDT
Changed title as it is not only about releng's JIPP.
Comment 24 Mikaël Barbero CLA 2020-06-12 09:17:25 EDT
Lowering importance, as it's not a blocker.
Comment 25 Sravan Kumar Lakkimsetti CLA 2020-06-18 02:18:20 EDT
From yesterday we had about 2 build failures as they hit timeout.

For both builds the pod was assigned to osnode-5.eclipse.org. 

Here is the log https://ci-staging.eclipse.org/releng/job/Y-build-4.17/3/ timed out after 5 hours.
I triggered the job again today, this time the pod is assigned to osnode-4.eclipse.org. This job is running fast expected to complete in less than 3 hours.
Comment 26 Mikaël Barbero CLA 2020-06-29 11:02:06 EDT
The workaround described in comment #21 has been implemented. I'm leaving this one open for a week and then will close it. If you see any large build times variations in the meantime, feel free to add comment.
Comment 27 Sravan Kumar Lakkimsetti CLA 2020-06-29 11:13:37 EDT
Currently builds are getting completed in 2 hours(really fast) from past 3 days. 

I will update this bug if I see any variations in the times
Comment 28 Mikaël Barbero CLA 2020-07-09 05:12:00 EDT
Clsoing as builds seem to continuously complete in about 2hours.