| Summary: | Builds (on build2) cannot access git repositories | ||
|---|---|---|---|
| Product: | Community | Reporter: | Steve Powell <zteve.powell> |
| Component: | CI-Jenkins | Assignee: | CI Admin Inbox <ci.admin-inbox> |
| Status: | CLOSED FIXED | QA Contact: | |
| Severity: | major | ||
| Priority: | P3 | CC: | d_a_carver, glyn.normington, webmaster |
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Mac OS X - Carbon (unsup.) | ||
| Whiteboard: | |||
|
Description
Steve Powell
Cancelling the builds appears to produce: SCM check out aborted Archiving artifacts Is the build2 environment not correctly set after restart?? ....and then, eventually: ERROR: Publisher hudson.tasks.ArtifactArchiver aborted due to exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:196) at hudson.remoting.Request.call(Request.java:122) at hudson.remoting.Channel.call(Channel.java:551) at hudson.EnvVars.getRemote(EnvVars.java:196) at hudson.model.Computer.getEnvironment(Computer.java:736) at hudson.model.Run.getEnvironment(Run.java:1643) at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:663) at hudson.tasks.ArtifactArchiver.perform(ArtifactArchiver.java:116) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19) at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:582) at hudson.model.AbstractBuild$AbstractRunner.performAllBuildStep(AbstractBuild.java:563) at hudson.model.AbstractBuild$AbstractRunner.performAllBuildStep(AbstractBuild.java:550) at hudson.model.Build$RunnerImpl.post2(Build.java:152) at hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:528) at hudson.model.Run.run(Run.java:1267) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:122) Recording test results ..and another hang ...OK Each abort request (click on red cross icon) produces another step.. here is the next one: ERROR: Publisher hudson.tasks.junit.JUnitResultArchiver aborted due to exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:196) at hudson.remoting.Request.call(Request.java:122) at hudson.remoting.Channel.call(Channel.java:551) at hudson.EnvVars.getRemote(EnvVars.java:196) at hudson.model.Computer.getEnvironment(Computer.java:736) at hudson.model.Run.getEnvironment(Run.java:1643) at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:663) at hudson.tasks.junit.JUnitResultArchiver.perform(JUnitResultArchiver.java:117) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19) at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:582) at hudson.model.AbstractBuild$AbstractRunner.performAllBuildStep(AbstractBuild.java:563) at hudson.model.AbstractBuild$AbstractRunner.performAllBuildStep(AbstractBuild.java:550) at hudson.model.Build$RunnerImpl.post2(Build.java:152) at hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:528) at hudson.model.Run.run(Run.java:1267) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:122) Failed to send e-mail to Christopher Frost because no e-mail address is known, and no default e-mail domain is configured Failed to send e-mail to Glyn Normington because no e-mail address is known, and no default e-mail domain is configured (more hanging....) Last one is: ERROR: Publisher hudson.tasks.Mailer aborted due to exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:196) at hudson.remoting.Request.call(Request.java:122) at hudson.remoting.Channel.call(Channel.java:551) at hudson.FilePath.act(FilePath.java:736) at hudson.FilePath.act(FilePath.java:729) at hudson.FilePath.toURI(FilePath.java:784) at hudson.tasks.MailSender.createFailureMail(MailSender.java:259) at hudson.tasks.MailSender.getMail(MailSender.java:134) at hudson.tasks.MailSender.execute(MailSender.java:82) at hudson.tasks.Mailer.perform(Mailer.java:101) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19) at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:582) at hudson.model.AbstractBuild$AbstractRunner.performAllBuildStep(AbstractBuild.java:563) at hudson.model.AbstractBuild$AbstractRunner.performAllBuildStep(AbstractBuild.java:550) at hudson.model.Build$RunnerImpl.post2(Build.java:152) at hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:528) at hudson.model.Run.run(Run.java:1267) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:122) Finished: FAILURE and we finish at last... What is going on here? Upped to major since it affects all my build jobs... I restarted the slave to clear the apparent dangling build, and now the virgo.test.snapshot build seem to run without issue. Perhaps there is an issue if the slave gets stuck and the master is subsequently restarted, but the slave is not. -M. Are you checking out code via anonymous git? (git://) or are you using ssh+git? If it's ssh, perhaps it's waiting for a password? Well the problem seems to have cleared on virgo.test.snapshot. We use git:// but the strategy hasn't changed since the last time we ran the build successfully so I think it is not relevant; unless the server or the slave configuration has changed in this area? The problem is still apparent on virgo.web.snapshot (I've just kicked off another build and it is hanging in the same place). Killing it manually requires FOUR attempts (it clicks through hang points as before). I'm going to resubmit, but this is still an open bug. (In reply to comment #7) > Are you checking out code via anonymous git? (git://) or are you using > ssh+git? If it's ssh, perhaps it's waiting for a password? There is a little (circumstantial) evidence that the emf-graphiti-nighly job (which is currently hung in emma coverage step on the slave) might be causing all other jobs to hang getting git source from the server.
The emf- jobs were cancelled (or failed time-out) yesterday before the test job was retried; the emf- job was restarted and is hung in the same place again today and the jobs (including the test job) are hanging as before.
My guess:
I think server/slave communication is hanging in the emma/coverage plugin.
Witness the fast pipeline exception I see in the take-down log of the first emf- job:
BUILD SUCCESSFUL
Total time: 10 minutes 53 seconds
Archiving artifacts
Recording test results
Emma: looking for coverage reports in the provided path: build/result/test/output/*-coverageReport.xml
Emma: found 2 report files:
/opt/users/hudsonbuild/workspace/emf-graphiti-nighly/build/result/test/output/JUnit-graphiti-coverageReport.xml
/opt/users/hudsonbuild/workspace/emf-graphiti-nighly/build/result/test/output/JUnit-graphitiUI-coverageReport.xml
Emma: stored 2 report files in the build folder: /opt/users/hudsonbuild/.hudson/jobs/emf-graphiti-nighly/builds/2010-07-20_11-06-04/emma
ERROR: Publisher hudson.plugins.emma.EmmaPublisher aborted due to exception
java.io.IOException
at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:173)
at sun.nio.cs.StreamDecoder$CharsetSD.readBytes(StreamDecoder.java:452)
at sun.nio.cs.StreamDecoder$CharsetSD.implRead(StreamDecoder.java:494)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:222)
at java.io.InputStreamReader.read(InputStreamReader.java:177)
at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2992)
at org.xmlpull.mxp1.MXParser.more(MXParser.java:3046)
at org.xmlpull.mxp1.MXParser.parseProlog(MXParser.java:1410)
at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1395)
at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
at org.xmlpull.mxp1.MXParser.nextTag(MXParser.java:1078)
at hudson.plugins.emma.EmmaBuildAction.loadRatios(EmmaBuildAction.java:260)
at hudson.plugins.emma.EmmaBuildAction.load(EmmaBuildAction.java:233)
at hudson.plugins.emma.EmmaPublisher.perform(EmmaPublisher.java:126)
at hudson.tasks.BuildStepMonitor$3.perform(BuildStepMonitor.java:36)
at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:582)
at hudson.model.AbstractBuild$AbstractRunner.performAllBuildStep(AbstractBuild.java:563)
at hudson.model.AbstractBuild$AbstractRunner.performAllBuildStep(AbstractBuild.java:550)
at hudson.model.Build$RunnerImpl.post2(Build.java:152)
at hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:528)
at hudson.model.Run.run(Run.java:1267)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:122)
ln -s builds/2010-07-20_11-06-04 /opt/users/hudsonbuild/.hudson/jobs/emf-graphiti-nighly/builds/../lastSuccessful failed: -1
ln -s builds/2010-07-20_11-06-04 /opt/users/hudsonbuild/.hudson/jobs/emf-graphiti-nighly/builds/../lastStable failed: -1
Finished: SUCCESS
Note that this is reported as SUCCESS when it has clearly failed.
Do I understand correctly that the emma/coverage plug-in is a recent thing? I read some talk of "re-instating" it on another bug.
Well, my theory got a knock. emf- job has just failed (same way as before) but test job still hangs at the start. Unless the slave/server communication was disrupted permanently by the emf- job problem it might have nothing to do with it. Evidence FOR my theory: virgo.medic.snapshot (which runs on the master and NOT on the build2 slave) runs fine. But virgo.test.snapshot (which runs on build2) is still hanging.... > Note that this is reported as SUCCESS when it has clearly failed.
While the emma plugin does seen to have exploded that doesn't mean the build itself was a failure or is there something I'm missing?
And I just kicked off a virgo-test build on the slave and it ran without issue.
-M.
The build #78 seems to have worked, thank you. And my other builds on build2 have kicked off. Was the slave build2 restarted before your 'test'? After the last failure of the emf- build? The past behaviour of emf-graphiti-nigh(t)ly is consistent with my suspicion of an emma-plug-in failure causing piped communication between the slave and the server to be broken. The emf- job can be run on master or build2. After it failed (succeeded but hang in emma) on build2, my jobs hang doing any piped transfer from the server (e.g. source transfer). After emf- runs successfully (I don't know if the slave was restarted or not -- there is no evidence of this that I can see in the build history) then my jobs run fine. [By success here I mean runs the emma plugin without hanging.] This correlation remains highly suspicious. virgo.gemini-web-container.snapshot is sticking at the start again like the ones before. So is virgo.kernel.snapshot.. This started to happen very recently -- the kernel job was respun after hanging in tests before. Lo and behold -- almost immediately before the failure the emf-graphiti-nightly (sic) job ran (apparently successfully) with the following errors in its log:
BUILD SUCCESSFUL
Total time: 11 minutes 46 seconds
Archiving artifacts
Recording test results
Emma: looking for coverage reports in the provided path: build/result/test/output/*-coverageReport.xml
Emma: found 2 report files:
/opt/users/hudsonbuild/workspace/emf-graphiti-nightly/build/result/test/output/JUnit-graphiti-coverageReport.xml
/opt/users/hudsonbuild/workspace/emf-graphiti-nightly/build/result/test/output/JUnit-graphitiUI-coverageReport.xml
Emma: stored 2 report files in the build folder: /opt/users/hudsonbuild/.hudson/jobs/emf-graphiti-nightly/builds/2010-07-28_05-02-56/emma
ERROR: Publisher hudson.plugins.emma.EmmaPublisher aborted due to exception
java.io.IOException
at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:173)
at sun.nio.cs.StreamDecoder$CharsetSD.readBytes(StreamDecoder.java:452)
at sun.nio.cs.StreamDecoder$CharsetSD.implRead(StreamDecoder.java:494)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:222)
at java.io.InputStreamReader.read(InputStreamReader.java:177)
at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2992)
at org.xmlpull.mxp1.MXParser.more(MXParser.java:3046)
at org.xmlpull.mxp1.MXParser.parseProlog(MXParser.java:1410)
at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1395)
at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
at org.xmlpull.mxp1.MXParser.nextTag(MXParser.java:1078)
at hudson.plugins.emma.EmmaBuildAction.loadRatios(EmmaBuildAction.java:260)
at hudson.plugins.emma.EmmaBuildAction.load(EmmaBuildAction.java:233)
at hudson.plugins.emma.EmmaPublisher.perform(EmmaPublisher.java:126)
at hudson.tasks.BuildStepMonitor$3.perform(BuildStepMonitor.java:36)
at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:582)
at hudson.model.AbstractBuild$AbstractRunner.performAllBuildStep(AbstractBuild.java:563)
at hudson.model.AbstractBuild$AbstractRunner.performAllBuildStep(AbstractBuild.java:550)
at hudson.model.Build$RunnerImpl.post2(Build.java:152)
at hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:528)
at hudson.model.Run.run(Run.java:1267)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:122)
ln -s builds/2010-07-28_05-02-56 /opt/users/hudsonbuild/.hudson/jobs/emf-graphiti-nightly/builds/../lastSuccessful failed: -1
ln -s builds/2010-07-28_05-02-56 /opt/users/hudsonbuild/.hudson/jobs/emf-graphiti-nightly/builds/../lastStable failed: -1
Finished: SUCCESS
Incidentally, I left the two jobs running so you can see the prblem. Kill them if you want to recycle build2. Please note any actions here. I restarted the slave, but the master seems to be stuck(not clearing) on the virgo slave jobs. Once the helios build is done I'll restart the master and see if that clears the job lists. -M. I killed my jobs; hudson was shutting down and no-one seemed to have tried killing them (they take a lot of killing -- see this bug comments). (In reply to comment #19) > I restarted the slave, but the master seems to be stuck(not clearing) on the > virgo slave jobs. Once the helios build is done I'll restart the master and > see if that clears the job lists. > > -M. I killed the EPP packaging job, and restarted Hudson. That particular job was 2 hrs into an 8 hr run. They have said that is safe to kill and just restart the job after Hudson is restarted. So I'll restart the EPP packaging job when hudson is back up. Gemini-web-container worked fine -- thanks. Can't tell if virgo.web.snapshot is working yet (the console output gives: Status Code: 500 Exception: Stacktrace: java.lang.StringIndexOutOfBoundsException at java.lang.String.substring(String.java:1092) at hudson.MarkupText$SubText.getText(MarkupText.java:106) at hudson.console.UrlAnnotator$UrlConsoleAnnotator.annotate(UrlAnnotator.java:30) at hudson.console.ConsoleAnnotationOutputStream.eol(ConsoleAnnotationOutputStream.java:145) at hudson.console.LineTransformationOutputStream.eol(LineTransformationOutputStream.java:60) ... exception when I try to open it in hudson ui. This is another bug.) Thanks. This is a bug that has been fixed in the latest versions of Hudson. Also, don't watch the console to long, as eventually you'll get the below error as well. (In reply to comment #22) > Gemini-web-container worked fine -- thanks. > > Can't tell if virgo.web.snapshot is working yet (the console output gives: > > Status Code: 500 > > Exception: > Stacktrace: > java.lang.StringIndexOutOfBoundsException > at java.lang.String.substring(String.java:1092) > at hudson.MarkupText$SubText.getText(MarkupText.java:106) > at > hudson.console.UrlAnnotator$UrlConsoleAnnotator.annotate(UrlAnnotator.java:30) > at > hudson.console.ConsoleAnnotationOutputStream.eol(ConsoleAnnotationOutputStream.java:145) > at > hudson.console.LineTransformationOutputStream.eol(LineTransformationOutputStream.java:60) > ... > > exception when I try to open it in hudson ui. This is another bug.) > > Thanks. 15 hours ago this job was kicked off on build2: modisco-nightly and this is the Console Log: ---------8<------------ Started by user gbarbier ln -s 2010-07-29_12-27-21 /opt/users/hudsonbuild/.hudson/jobs/modisco-nightly/builds/73 failed: -1 Building remotely on build2 ----------------------- build2 is offline.... (apparently it has timed out) oh look what job ran on build2 (apparently successfully) before this stopped working!!! emf-graphiti-nightly #72 wot a cooincidence! ----------------------- In any case my virgo build jobs were all failing git accesses before that anyhow..... I've restarted the build2 slave process. -M. build2 is offline...... emf-graphiti-nightly #102 1 min 36 sec broken since build #72 emf-graphiti-nightly #101 2 min 28 sec broken since build #72 emf-graphiti-nightly #100 2 min 55 sec broken since build #72 emf-graphiti-nightly #99 9 min 47 sec broken since build #72 emf-graphiti-nightly #98 15 min broken since build #72 emf-graphiti-nightly #97 23 min broken since build #72 emf-graphiti-nightly #96 31 min broken since build #72 emf-graphiti-nightly #95 33 min broken since build #72 emf-graphiti-nightly #94 36 min broken since build #72 emf-graphiti-nightly #93 38 min broken since this build Seems broken. This is master. All build2 builds are queued. Since fixing the git connections issue in the other bug (don't have the number handy) how have your builds been running? Looks like Build2 lost connection again. Seeing remote issues. I think it is pretty obvious that the Emma plugin isn't working right accross slaves. So if they want to use that plugin, they should tie the job to Master instead of build2. (In reply to comment #28) My builds have not recovered. Fixed elsewhere. |