| Summary: | Some jobs on the new infra have inconsistent build times (sometimes *really* slow) | ||
|---|---|---|---|
| Product: | Community | Reporter: | Sravan Kumar Lakkimsetti <sravankumarl> |
| Component: | CI-Jenkins | Assignee: | CI Admin Inbox <ci.admin-inbox> |
| Status: | CLOSED FIXED | QA Contact: | |
| Severity: | major | ||
| Priority: | P3 | CC: | agarcdomi, christian.dietrich.opensource, daniel_megert, denis.roy, frederic.gurr, jonah, kalyan_prasad, mikael.barbero, sarika.sinha, twolf, webmaster |
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Windows 10 | ||
| See Also: |
https://bugs.eclipse.org/bugs/show_bug.cgi?id=563221 https://bugs.eclipse.org/bugs/show_bug.cgi?id=563816 https://bugs.eclipse.org/bugs/show_bug.cgi?id=563785 |
||
| Whiteboard: | |||
| Bug Depends on: | 563223 | ||
| Bug Blocks: | 564012 | ||
|
Description
Sravan Kumar Lakkimsetti
Pinpointing a single reason for such variations is not an easy task, given that the I-Build is a complex and long build running on a complex build environment. To get a better sense of average build times and outliers, I have increased the "number of builds to keep" to 25 (instead of 5). I would recommend to try increasing the CPU limits and requests to 4000m and see how that effects the build time. A part of the problem might also be a lower I/O performance, but I'd start with adjusting the CPU limits. We have a time out today with https://ci-staging.eclipse.org/releng/job/I-build/148/ Our build should normally take about 2 hours to 2.5 hours. So our time out is for 5 hours. the above build hit the time out. Hit time out again in https://ci-staging.eclipse.org/releng/job/I-build/149/ Raising to blocker we at Xtext see notable build time increases too. We are investigating. Latest build attempt failed again. For details see https://ci-staging.eclipse.org/releng/job/I-build/151/console This looks like it made the build fail: 12:09:02 !ENTRY org.eclipse.equinox.p2.transport.ecf 2 0 2020-06-02 10:09:02.445 12:09:02 !MESSAGE Connection to https://download.eclipse.org/eclipse/updates/4.16-I-builds/I20200531-0600/plugins/org.eclipse.equinox.security.linux.x86_64_1.1.300.v20190830-1238.jar.pack.gz failed on download.eclipse.org:443 failed to respond. Retry attempt 0 started 12:09:02 !STACK 0 12:09:02 org.apache.http.NoHttpResponseException: download.eclipse.org:443 failed to respond 12:09:02 at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:141) 12:09:02 at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) 12:09:02 at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) 12:09:02 at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) 12:09:02 at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157) 12:09:02 at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) 12:09:02 at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) 12:09:02 at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) 12:09:02 at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) 12:09:02 at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) 12:09:02 at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) 12:09:02 at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) 12:09:02 at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) 12:09:02 at org.eclipse.ecf.provider.filetransfer.httpclient45.HttpClientRetrieveFileTransfer.performConnect(HttpClientRetrieveFileTransfer.java:1000) 12:09:02 at org.eclipse.ecf.provider.filetransfer.httpclient45.HttpClientRetrieveFileTransfer.access$0(HttpClientRetrieveFileTransfer.java:992) 12:09:02 at org.eclipse.ecf.provider.filetransfer.httpclient45.HttpClientRetrieveFileTransfer$1.performFileTransfer(HttpClientRetrieveFileTransfer.java:988) 12:09:02 at org.eclipse.ecf.filetransfer.FileTransferJob.run(FileTransferJob.java:74) 12:09:02 at org.eclipse.core.internal.jobs.Worker.run(Worker.java:63) build 150 successfully completed in 2 hours and 4 minutes. Here is the status of last 5 builds 151 Build failed with stack trace mentioned in comment 7 150 Build succeeded took 2 hrs and 4 minutes 149 Build failed hitting timeout of 5 hours 148 Build failed hitting timeout of 5 hours 147 Build succeeded took 4 hours and 11 minutes In our build we have a dependent job. This is not scheduled in timely manner in build 148 and 147 See Bug 563785 There is a delay in scheduling the job for 1hr 15min for build 147 and 2hr 4min for build 148 Hope this data is good enough. If you see https://ci-staging.eclipse.org/releng/job/I-build/ you can see the stage view. I can explain each of the tasks on what kind of disk access, network access etc. Please let me know if you need any details. We do have couple of build (build 143 and build 151) failures with timeout during change over from one stage to another. Here is the stack trace java.io.IOException: Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at 30 seconds at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:441) at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:336) at hudson.Launcher$ProcStarter.start(Launcher.java:454) at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:234) at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:103) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:317) at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:286) at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:179) at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122) at sun.reflect.GeneratedMethodAccessor580.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022) at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113) at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:163) at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23) at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:157) at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:161) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:165) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17) at WorkflowScript.run(WorkflowScript:266) at ___cps.transform___(Native Method) at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:86) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83) at sun.reflect.GeneratedMethodAccessor296.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21) at com.cloudbees.groovy.cps.Next.step(Next.java:83) at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174) at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163) at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129) at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268) at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51) at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:185) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:400) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$400(CpsThreadGroup.java:96) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:312) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:276) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67) at java.util.concurrent.FutureTask.run(Unknown Source) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) I just set the following system variable: -Dorg.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout=60 to increase the timeout. Let's see if that helps. (In reply to Frederic Gurr from comment #10) > I just set the following system variable: > -Dorg.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator. > websocketConnectionTimeout=60 > to increase the timeout. Let's see if that helps. Doubling that connection timeout sounds more like a work-around for a particular symptom, not like something that might solve the underlying issue reported here and in the linked bug reports: builds appear to be generally very slow -- much slower than usual. Mikael, Fred: We really, really need something like bug 563223. Something that can be a repeatable performance benchmark for the various infra components. Otherwise, where do we look? Storage backend? Network uplinks? k8s scheduling? CPU? What else was running at the time? In this case, I've looked at storage backends, git/gerrit, and network uplinks. Nothing overloaded or unusual. *** Bug 563221 has been marked as a duplicate of this bug. *** @denis. please take our master branches to test, not the release branches (In reply to Thomas Wolf from comment #11) > (In reply to Frederic Gurr from comment #10) > > I just set the following system variable: > > -Dorg.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator. > > websocketConnectionTimeout=60 > > to increase the timeout. Let's see if that helps. > > Doubling that connection timeout sounds more like a work-around for a > particular symptom, not like something that might solve the underlying issue > reported here and in the linked bug reports: builds appear to be generally > very slow -- much slower than usual. I should have made that clearer. My change addresses a particular issue of the Releng JIPP, since they use websocket connections for some of their agents. This is indeed not a fix for the intermittent slowness of the build cluster that affected multiple projects. (In reply to Denis Roy from comment #12) > Mikael, Fred: We really, really need something like bug 563223. Something > that can be a repeatable performance benchmark for the various infra > components. > > Otherwise, where do we look? Storage backend? Network uplinks? k8s > scheduling? CPU? What else was running at the time? > > In this case, I've looked at storage backends, git/gerrit, and network > uplinks. Nothing overloaded or unusual. Let's discuss this in bug 563223. (In reply to Frederic Gurr from comment #15) > (In reply to Thomas Wolf from comment #11) > > (In reply to Frederic Gurr from comment #10) > > > I just set the following system variable: > > > -Dorg.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator. > > > websocketConnectionTimeout=60 > > > to increase the timeout. Let's see if that helps. > > > > Doubling that connection timeout sounds more like a work-around for a > > particular symptom, not like something that might solve the underlying issue > > reported here and in the linked bug reports: builds appear to be generally > > very slow -- much slower than usual. > I should have made that clearer. My change addresses a particular issue of > the Releng JIPP, since they use websocket connections for some of their > agents. This is indeed not a fix for the intermittent slowness of the build > cluster that affected multiple projects. Actually we are not using any websocket based agents during I-build job https://ci-staging.eclipse.org/releng/job/I-build/. The failures occurred when we are moving from one stage to another using same agent. looks like jenkins decides to start the builds on the slowest and most broken machine it can find: solution for Xtext builds: start an alibi build what will be scheduled on the slow machine and then the build you actually want to have. result: build is blazing fast https://ci.eclipse.org/xtext/job/xtext-lib/job/master/26/console vs https://ci.eclipse.org/xtext/job/xtext-lib/job/master/25/console We're experiencing a similar problem in Hawk. We have some long integration tests that can take between 30min on a good day, and 6 hours (!) on a bad day. We will have one build take 35minutes, the next one 25min, and the next one 2 hours. Admittedly, these are quite I/O bound, but in our previous infra before moving to Eclipse (Travis CI) we had much more consistent times. For instance, see builds #258 - #264 in our main job, and compare the variation in the "Tycho" stage: https://ci.eclipse.org/hawk/job/hawk/job/master/ There haven't been any major changes in those builds that would explain the change in build times: most of those commits are about minor tweaks in the releng scripts / copyright headers. From 25 minutes to 6 hours is wild; I don't see anything that is busy for so long to cause such a performance issue. All the storage backends are fine -- sure, they have peaks that last several minutes where some performance decrease may be observed, but it would be light.
Looking at some of the logs, I see wasted minutes on this:
INFO: I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://download.eclipse.org:80: The target server failed to respond
These 10 wasted minutes are a documented bug with ECF (I believe): https://bugs.eclipse.org/bugs/show_bug.cgi?id=560808
The rest is just a really long log file that doesn't provide much. Hard to know where the performance issue is.
Just a heads up that we're working on this. We will pin the jenkins masters to the slower nodes so that faster nodes will be used more exclusively for actual builds. It does not mean that discrepencencies won't happen again, but it will limit them the best we can with the resources we have. The change will be deployed during the next upgrade week (see bug 563057 and https://status.eclipse.org/incidents/1jvztmyy9jnl for details) The other solution would be to block builds from running on slower hardware and queue them until faster nodes are available. I don't think it's something desirable. This should only happen when the cluster runs at capacity (it happens already from time to time). See https://wiki.eclipse.org/Jenkins#Do_you_do_overbooking.3F for more details. *** Bug 563816 has been marked as a duplicate of this bug. *** Changed title as it is not only about releng's JIPP. Lowering importance, as it's not a blocker. From yesterday we had about 2 build failures as they hit timeout. For both builds the pod was assigned to osnode-5.eclipse.org. Here is the log https://ci-staging.eclipse.org/releng/job/Y-build-4.17/3/ timed out after 5 hours. I triggered the job again today, this time the pod is assigned to osnode-4.eclipse.org. This job is running fast expected to complete in less than 3 hours. The workaround described in comment #21 has been implemented. I'm leaving this one open for a week and then will close it. If you see any large build times variations in the meantime, feel free to add comment. Currently builds are getting completed in 2 hours(really fast) from past 3 days. I will update this bug if I see any variations in the times Clsoing as builds seem to continuously complete in about 2hours. |