Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 320415 - Builds (on build2) cannot access git repositories
Summary: Builds (on build2) cannot access git repositories
Status: CLOSED FIXED
Alias: None
Product: Community
Classification: Eclipse Foundation
Component: CI-Jenkins (show other bugs)
Version: unspecified   Edit
Hardware: PC Mac OS X - Carbon (unsup.)
: P3 major (vote)
Target Milestone: ---   Edit
Assignee: CI Admin Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-07-20 12:15 EDT by Steve Powell CLA
Modified: 2010-08-06 04:17 EDT (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Steve Powell CLA 2010-07-20 12:15:21 EDT
virgo.test.snapshot build hangs here:

Started by user spowell
ln -s 2010-07-20_11-52-50 /opt/users/hudsonbuild/.hudson/jobs/virgo.test.snapshot/builds/72 failed: -1
Building remotely on build2

virgo.web-server.snapshot build hangs likewise:

Started by user spowell
ln -s 2010-07-20_11-48-34 /opt/users/hudsonbuild/.hudson/jobs/virgo.web-server.snapshot/builds/48 failed: -1
Building remotely on build2

and virgo-web.snapshot ditto:

Started by user spowell
ln -s 2010-07-20_11-46-05 /opt/users/hudsonbuild/.hudson/jobs/virgo.web.snapshot/builds/53 failed: -1
Building remotely on build2

Looking in the Git polling log doesn't reveal a problem (apart from submodule errors which are normal), so I don't know why it doesn't progress to downloading the source. An hour ago or so some builds were working.
Comment 1 Steve Powell CLA 2010-07-20 12:18:27 EDT
Cancelling the builds appears to produce:

SCM check out aborted
Archiving artifacts

Is the build2 environment not correctly set after restart??
Comment 2 Steve Powell CLA 2010-07-20 12:19:54 EDT
....and then, eventually:

ERROR: Publisher hudson.tasks.ArtifactArchiver aborted due to exception
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Object.wait(Object.java:196)
	at hudson.remoting.Request.call(Request.java:122)
	at hudson.remoting.Channel.call(Channel.java:551)
	at hudson.EnvVars.getRemote(EnvVars.java:196)
	at hudson.model.Computer.getEnvironment(Computer.java:736)
	at hudson.model.Run.getEnvironment(Run.java:1643)
	at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:663)
	at hudson.tasks.ArtifactArchiver.perform(ArtifactArchiver.java:116)
	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
	at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:582)
	at hudson.model.AbstractBuild$AbstractRunner.performAllBuildStep(AbstractBuild.java:563)
	at hudson.model.AbstractBuild$AbstractRunner.performAllBuildStep(AbstractBuild.java:550)
	at hudson.model.Build$RunnerImpl.post2(Build.java:152)
	at hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:528)
	at hudson.model.Run.run(Run.java:1267)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
	at hudson.model.ResourceController.execute(ResourceController.java:88)
	at hudson.model.Executor.run(Executor.java:122)
Recording test results

..and another hang
Comment 3 Steve Powell CLA 2010-07-20 12:21:32 EDT
...OK Each abort request (click on red cross icon) produces another step.. here is the next one:


ERROR: Publisher hudson.tasks.junit.JUnitResultArchiver aborted due to exception
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Object.wait(Object.java:196)
	at hudson.remoting.Request.call(Request.java:122)
	at hudson.remoting.Channel.call(Channel.java:551)
	at hudson.EnvVars.getRemote(EnvVars.java:196)
	at hudson.model.Computer.getEnvironment(Computer.java:736)
	at hudson.model.Run.getEnvironment(Run.java:1643)
	at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:663)
	at hudson.tasks.junit.JUnitResultArchiver.perform(JUnitResultArchiver.java:117)
	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
	at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:582)
	at hudson.model.AbstractBuild$AbstractRunner.performAllBuildStep(AbstractBuild.java:563)
	at hudson.model.AbstractBuild$AbstractRunner.performAllBuildStep(AbstractBuild.java:550)
	at hudson.model.Build$RunnerImpl.post2(Build.java:152)
	at hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:528)
	at hudson.model.Run.run(Run.java:1267)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
	at hudson.model.ResourceController.execute(ResourceController.java:88)
	at hudson.model.Executor.run(Executor.java:122)
Failed to send e-mail to Christopher Frost because no e-mail address is known, and no default e-mail domain is configured
Failed to send e-mail to Glyn Normington because no e-mail address is known, and no default e-mail domain is configured

(more hanging....)
Comment 4 Steve Powell CLA 2010-07-20 12:22:18 EDT
Last one is:

ERROR: Publisher hudson.tasks.Mailer aborted due to exception
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Object.wait(Object.java:196)
	at hudson.remoting.Request.call(Request.java:122)
	at hudson.remoting.Channel.call(Channel.java:551)
	at hudson.FilePath.act(FilePath.java:736)
	at hudson.FilePath.act(FilePath.java:729)
	at hudson.FilePath.toURI(FilePath.java:784)
	at hudson.tasks.MailSender.createFailureMail(MailSender.java:259)
	at hudson.tasks.MailSender.getMail(MailSender.java:134)
	at hudson.tasks.MailSender.execute(MailSender.java:82)
	at hudson.tasks.Mailer.perform(Mailer.java:101)
	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
	at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:582)
	at hudson.model.AbstractBuild$AbstractRunner.performAllBuildStep(AbstractBuild.java:563)
	at hudson.model.AbstractBuild$AbstractRunner.performAllBuildStep(AbstractBuild.java:550)
	at hudson.model.Build$RunnerImpl.post2(Build.java:152)
	at hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:528)
	at hudson.model.Run.run(Run.java:1267)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
	at hudson.model.ResourceController.execute(ResourceController.java:88)
	at hudson.model.Executor.run(Executor.java:122)
Finished: FAILURE

and we finish at last...  What is going on here?
Comment 5 Steve Powell CLA 2010-07-20 12:24:01 EDT
Upped to major since it affects all my build jobs...
Comment 6 Eclipse Webmaster CLA 2010-07-20 13:01:38 EDT
I restarted the slave to clear the apparent dangling build, and now the virgo.test.snapshot build seem to run without issue.

Perhaps there is an issue if the slave gets stuck and the master is subsequently restarted, but the slave is not.

-M.
Comment 7 Denis Roy CLA 2010-07-20 13:46:34 EDT
Are you checking out code via anonymous git?  (git://) or are you using ssh+git?  If it's ssh, perhaps it's waiting for a password?
Comment 8 Steve Powell CLA 2010-07-21 03:26:25 EDT
Well the problem seems to have cleared on virgo.test.snapshot.

We use git:// but the strategy hasn't changed since the last time we ran the build successfully so I think it is not relevant; unless the server or the slave configuration has changed in this area?

The problem is still apparent on virgo.web.snapshot (I've just kicked off another build and it is hanging in the same place).  Killing it manually requires FOUR attempts (it clicks through hang points as before).

I'm going to resubmit, but this is still an open bug.

(In reply to comment #7)
> Are you checking out code via anonymous git?  (git://) or are you using
> ssh+git?  If it's ssh, perhaps it's waiting for a password?
Comment 9 Steve Powell CLA 2010-07-21 03:43:28 EDT
There is a little (circumstantial) evidence that the emf-graphiti-nighly job (which is currently hung in emma coverage step on the slave) might be causing all other jobs to hang getting git source from the server.

The emf- jobs were cancelled (or failed time-out) yesterday before the test job was retried; the emf- job was restarted and is hung in the same place again today and the jobs (including the test job) are hanging as before.

My guess:

I think server/slave communication is hanging in the emma/coverage plugin.
Witness the fast pipeline exception I see in the take-down log of the first emf- job:

BUILD SUCCESSFUL
Total time: 10 minutes 53 seconds
Archiving artifacts
Recording test results
Emma: looking for coverage reports in the provided path: build/result/test/output/*-coverageReport.xml
Emma: found 2 report files: 
          /opt/users/hudsonbuild/workspace/emf-graphiti-nighly/build/result/test/output/JUnit-graphiti-coverageReport.xml
          /opt/users/hudsonbuild/workspace/emf-graphiti-nighly/build/result/test/output/JUnit-graphitiUI-coverageReport.xml
Emma: stored 2 report files in the build folder: /opt/users/hudsonbuild/.hudson/jobs/emf-graphiti-nighly/builds/2010-07-20_11-06-04/emma
ERROR: Publisher hudson.plugins.emma.EmmaPublisher aborted due to exception
java.io.IOException
	at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:173)
	at sun.nio.cs.StreamDecoder$CharsetSD.readBytes(StreamDecoder.java:452)
	at sun.nio.cs.StreamDecoder$CharsetSD.implRead(StreamDecoder.java:494)
	at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:222)
	at java.io.InputStreamReader.read(InputStreamReader.java:177)
	at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2992)
	at org.xmlpull.mxp1.MXParser.more(MXParser.java:3046)
	at org.xmlpull.mxp1.MXParser.parseProlog(MXParser.java:1410)
	at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1395)
	at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
	at org.xmlpull.mxp1.MXParser.nextTag(MXParser.java:1078)
	at hudson.plugins.emma.EmmaBuildAction.loadRatios(EmmaBuildAction.java:260)
	at hudson.plugins.emma.EmmaBuildAction.load(EmmaBuildAction.java:233)
	at hudson.plugins.emma.EmmaPublisher.perform(EmmaPublisher.java:126)
	at hudson.tasks.BuildStepMonitor$3.perform(BuildStepMonitor.java:36)
	at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:582)
	at hudson.model.AbstractBuild$AbstractRunner.performAllBuildStep(AbstractBuild.java:563)
	at hudson.model.AbstractBuild$AbstractRunner.performAllBuildStep(AbstractBuild.java:550)
	at hudson.model.Build$RunnerImpl.post2(Build.java:152)
	at hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:528)
	at hudson.model.Run.run(Run.java:1267)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
	at hudson.model.ResourceController.execute(ResourceController.java:88)
	at hudson.model.Executor.run(Executor.java:122)
ln -s builds/2010-07-20_11-06-04 /opt/users/hudsonbuild/.hudson/jobs/emf-graphiti-nighly/builds/../lastSuccessful failed: -1
ln -s builds/2010-07-20_11-06-04 /opt/users/hudsonbuild/.hudson/jobs/emf-graphiti-nighly/builds/../lastStable failed: -1
Finished: SUCCESS

Note that this is reported as SUCCESS when it has clearly failed.

Do I understand correctly that the emma/coverage plug-in is a recent thing? I read some talk of "re-instating" it on another bug.
Comment 10 Steve Powell CLA 2010-07-21 03:46:23 EDT
Well, my theory got a knock.   emf- job has just failed (same way as before) but test job still hangs at the start.

Unless the slave/server communication was disrupted permanently by the emf- job problem it might have nothing to do with it.
Comment 11 Steve Powell CLA 2010-07-21 03:51:45 EDT
Evidence FOR my theory:  virgo.medic.snapshot (which runs on the master and NOT on the build2 slave) runs fine.

But virgo.test.snapshot (which runs on build2) is still hanging....
Comment 12 Eclipse Webmaster CLA 2010-07-21 10:40:10 EDT
> Note that this is reported as SUCCESS when it has clearly failed.

While the emma plugin does seen to have exploded that doesn't mean the build itself was a failure or is there something I'm missing?

And I just kicked off a virgo-test build on the slave and it ran without issue.

-M.
Comment 13 Steve Powell CLA 2010-07-22 07:50:24 EDT
The build #78 seems to have worked, thank you.  And my other builds on build2 have kicked off.

Was the slave build2 restarted before your 'test'?  After the last failure of the emf- build?
Comment 14 Steve Powell CLA 2010-07-22 09:21:00 EDT
The past behaviour of emf-graphiti-nigh(t)ly is consistent with my suspicion of an emma-plug-in failure causing piped communication between the slave and the server to be broken.

The emf- job can be run on master or build2. After it failed (succeeded but hang in emma) on build2, my jobs hang doing any piped transfer from the server (e.g. source transfer).  After emf- runs successfully (I don't know if the slave was restarted or not -- there is no evidence of this that I can see in the build history) then my jobs run fine.  [By success here I mean runs the emma plugin without hanging.]  This correlation remains highly suspicious.
Comment 15 Steve Powell CLA 2010-07-28 09:24:09 EDT
virgo.gemini-web-container.snapshot  is sticking at the start again like the ones before.
Comment 16 Steve Powell CLA 2010-07-28 09:26:39 EDT
So is virgo.kernel.snapshot.. This started to happen very recently -- the kernel job was respun after hanging in tests before.
Comment 17 Steve Powell CLA 2010-07-28 09:30:43 EDT
Lo and behold -- almost immediately before the failure  the emf-graphiti-nightly (sic) job ran (apparently successfully) with the following errors in its log:

BUILD SUCCESSFUL
Total time: 11 minutes 46 seconds
Archiving artifacts
Recording test results
Emma: looking for coverage reports in the provided path: build/result/test/output/*-coverageReport.xml
Emma: found 2 report files: 
          /opt/users/hudsonbuild/workspace/emf-graphiti-nightly/build/result/test/output/JUnit-graphiti-coverageReport.xml
          /opt/users/hudsonbuild/workspace/emf-graphiti-nightly/build/result/test/output/JUnit-graphitiUI-coverageReport.xml
Emma: stored 2 report files in the build folder: /opt/users/hudsonbuild/.hudson/jobs/emf-graphiti-nightly/builds/2010-07-28_05-02-56/emma
ERROR: Publisher hudson.plugins.emma.EmmaPublisher aborted due to exception
java.io.IOException
	at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:173)
	at sun.nio.cs.StreamDecoder$CharsetSD.readBytes(StreamDecoder.java:452)
	at sun.nio.cs.StreamDecoder$CharsetSD.implRead(StreamDecoder.java:494)
	at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:222)
	at java.io.InputStreamReader.read(InputStreamReader.java:177)
	at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2992)
	at org.xmlpull.mxp1.MXParser.more(MXParser.java:3046)
	at org.xmlpull.mxp1.MXParser.parseProlog(MXParser.java:1410)
	at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1395)
	at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
	at org.xmlpull.mxp1.MXParser.nextTag(MXParser.java:1078)
	at hudson.plugins.emma.EmmaBuildAction.loadRatios(EmmaBuildAction.java:260)
	at hudson.plugins.emma.EmmaBuildAction.load(EmmaBuildAction.java:233)
	at hudson.plugins.emma.EmmaPublisher.perform(EmmaPublisher.java:126)
	at hudson.tasks.BuildStepMonitor$3.perform(BuildStepMonitor.java:36)
	at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:582)
	at hudson.model.AbstractBuild$AbstractRunner.performAllBuildStep(AbstractBuild.java:563)
	at hudson.model.AbstractBuild$AbstractRunner.performAllBuildStep(AbstractBuild.java:550)
	at hudson.model.Build$RunnerImpl.post2(Build.java:152)
	at hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:528)
	at hudson.model.Run.run(Run.java:1267)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
	at hudson.model.ResourceController.execute(ResourceController.java:88)
	at hudson.model.Executor.run(Executor.java:122)
ln -s builds/2010-07-28_05-02-56 /opt/users/hudsonbuild/.hudson/jobs/emf-graphiti-nightly/builds/../lastSuccessful failed: -1
ln -s builds/2010-07-28_05-02-56 /opt/users/hudsonbuild/.hudson/jobs/emf-graphiti-nightly/builds/../lastStable failed: -1
Finished: SUCCESS
Comment 18 Steve Powell CLA 2010-07-28 09:58:12 EDT
Incidentally, I left the two jobs running so you can see the prblem.  Kill them if you want to recycle build2.  Please note any actions here.
Comment 19 Eclipse Webmaster CLA 2010-07-28 10:11:20 EDT
I restarted the slave, but the master seems to be stuck(not clearing) on the
virgo slave jobs.  Once the helios build is done I'll restart the master and
see if that clears the job lists.

-M.
Comment 20 Steve Powell CLA 2010-07-28 10:27:50 EDT
I killed my jobs; hudson was shutting down and no-one seemed to have tried killing them (they take a lot of killing -- see this bug comments).
Comment 21 David Carver CLA 2010-07-28 10:33:11 EDT
(In reply to comment #19)
> I restarted the slave, but the master seems to be stuck(not clearing) on the
> virgo slave jobs.  Once the helios build is done I'll restart the master and
> see if that clears the job lists.
> 
> -M.

I killed the EPP packaging job, and restarted Hudson.  That particular job was 2 hrs into an 8 hr run.   They have said that is safe to kill and just restart the job after Hudson is restarted.  So I'll restart the EPP packaging job when hudson is back up.
Comment 22 Steve Powell CLA 2010-07-28 11:21:42 EDT
Gemini-web-container worked fine -- thanks.

Can't tell if virgo.web.snapshot is working yet (the console output gives:

Status Code: 500

Exception: 
Stacktrace: 
java.lang.StringIndexOutOfBoundsException
	at java.lang.String.substring(String.java:1092)
	at hudson.MarkupText$SubText.getText(MarkupText.java:106)
	at hudson.console.UrlAnnotator$UrlConsoleAnnotator.annotate(UrlAnnotator.java:30)
	at hudson.console.ConsoleAnnotationOutputStream.eol(ConsoleAnnotationOutputStream.java:145)
	at hudson.console.LineTransformationOutputStream.eol(LineTransformationOutputStream.java:60)
...

exception when I try to open it in hudson ui.  This is another bug.)

Thanks.
Comment 23 David Carver CLA 2010-07-28 12:02:46 EDT
This is a bug that has been fixed in the latest versions of Hudson.  Also, don't watch the console to long, as eventually you'll get the below error as well.



(In reply to comment #22)
> Gemini-web-container worked fine -- thanks.
> 
> Can't tell if virgo.web.snapshot is working yet (the console output gives:
> 
> Status Code: 500
> 
> Exception: 
> Stacktrace: 
> java.lang.StringIndexOutOfBoundsException
>     at java.lang.String.substring(String.java:1092)
>     at hudson.MarkupText$SubText.getText(MarkupText.java:106)
>     at
> hudson.console.UrlAnnotator$UrlConsoleAnnotator.annotate(UrlAnnotator.java:30)
>     at
> hudson.console.ConsoleAnnotationOutputStream.eol(ConsoleAnnotationOutputStream.java:145)
>     at
> hudson.console.LineTransformationOutputStream.eol(LineTransformationOutputStream.java:60)
> ...
> 
> exception when I try to open it in hudson ui.  This is another bug.)
> 
> Thanks.
Comment 24 Steve Powell CLA 2010-07-30 04:05:58 EDT
15 hours ago this job was kicked off on build2:

modisco-nightly

and this is the Console Log:
---------8<------------
Started by user gbarbier
ln -s 2010-07-29_12-27-21 /opt/users/hudsonbuild/.hudson/jobs/modisco-nightly/builds/73 failed: -1
Building remotely on build2
-----------------------

build2 is offline.... (apparently it has timed out)

oh look what job ran on build2 (apparently successfully) before this stopped working!!!

	emf-graphiti-nightly #72

wot a cooincidence!
-----------------------

In any case my virgo build jobs were all failing git accesses before that anyhow.....
Comment 25 Eclipse Webmaster CLA 2010-07-30 14:57:12 EDT
I've restarted the build2 slave process.

-M.
Comment 26 Steve Powell CLA 2010-08-02 09:05:29 EDT
build2 is offline......
Comment 27 Steve Powell CLA 2010-08-02 09:10:45 EDT
	emf-graphiti-nightly #102	1 min 36 sec	broken since build #72	
	emf-graphiti-nightly #101	2 min 28 sec	broken since build #72	
	emf-graphiti-nightly #100	2 min 55 sec	broken since build #72	
	emf-graphiti-nightly #99	9 min 47 sec	broken since build #72	
	emf-graphiti-nightly #98	15 min	broken since build #72	
	emf-graphiti-nightly #97	23 min	broken since build #72	
	emf-graphiti-nightly #96	31 min	broken since build #72	
	emf-graphiti-nightly #95	33 min	broken since build #72	
	emf-graphiti-nightly #94	36 min	broken since build #72	
	emf-graphiti-nightly #93	38 min	broken since this build	

Seems broken. This is master.  All build2 builds are queued.
Comment 28 Denis Roy CLA 2010-08-04 11:39:30 EDT
Since fixing the git connections issue in the other bug (don't have the number handy) how have your builds been running?
Comment 29 David Carver CLA 2010-08-04 11:41:38 EDT
Looks like Build2 lost connection again.  Seeing remote issues.

I think it is pretty obvious that the Emma plugin isn't working right accross slaves.  So if they want to use that plugin, they should tie the job to Master instead of build2.
Comment 30 Steve Powell CLA 2010-08-04 12:59:52 EDT
(In reply to comment #28)
My builds have not recovered.
Comment 31 Steve Powell CLA 2010-08-06 04:17:21 EDT
Fixed elsewhere.