Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 326366

Summary: Hudson slaves appear "hung"
Product: Community Reporter: David Williams <david_williams>
Component: CI-JenkinsAssignee: Eclipse Webmaster <webmaster>
Status: RESOLVED FIXED QA Contact:
Severity: blocker    
Priority: P3 CC: d_a_carver, gunnar, jacek.pospychala, nicolas.bros
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Windows 7   
Whiteboard:

Description David Williams CLA 2010-09-27 20:32:42 EDT
Slaves 1 and 2, at least. 

Not sure what I'm seeing, but my "run aggregator" job has not made any apparent progress for 20-30 minutes

https://hudson.eclipse.org/hudson/view/Repository%20Aggregation/job/indigo.runAggregator/43/consoleFull

And judging from the "image" of jobs in progress, no jobs are finishing/continuing. 

And, not sure what to do to help diagnose or what else to investigate. 

I'll soon just try to cancel my job, and restart it ... but, thought I'd report this apparent hang first.
Comment 1 Denis Roy CLA 2010-09-27 21:30:12 EDT
I've restarted both slaves.  I was updating the kernel on the dev.eclipse.org servers but despite changing the IP address in the Hudson /etc/hosts file, it would appear Hudson keeps it cached or something.
Comment 2 David Williams CLA 2010-09-27 21:39:11 EDT
It may be that already-running-jobs cache the value? I'm not sure of the exact timing (or you restarting the slaves), but I restarted my 'run aggregator' job and it appears to have started fine .... so I'll assume fixed ... and will re-open if it actually hangs again later (which I am not expecting).
Comment 3 David Williams CLA 2010-09-28 00:21:22 EDT
well ... now hudson itself isn't responding ... probably something else ... but I can't confirm the slaves are working :/ 

https://hudson.eclipse.org/hudson/view/Repository%20Aggregation/
Comment 4 Jacek Pospychala CLA 2010-09-28 04:30:41 EDT
I believe I've been seeing hung jobs yesterday before kernel update.
Eg. mine was hanging at "archiving artifacts". Aborting the job resulted in same error as in bug 326283 ("ssh channel is closed")
Comment 5 David Williams CLA 2010-09-28 13:45:13 EDT
I'm seeing this again (I think) on Slave 2 so am re-opening. 
I'm specifically watching 
	indigo.runAggregator
but it seems several jobs are running without noticeable progress for at least 20 minutes (that's only the time I've "timed" it ... my runAggregator job should have finished an hour ago). 

I'll leave running another 20 minutes, in case an admin gets a chance to look at processes or whatever while its in this state. Then I'll cancel and try to restart it on slave 1.
Comment 6 Denis Roy CLA 2010-09-28 13:47:58 EDT
The downloads & build disk array (it's shared) is absolutely slammed.  Perhaps wait a bit more for the congestion to clear?
Comment 7 David Williams CLA 2010-09-28 13:55:36 EDT
(In reply to comment #6)
> The downloads & build disk array (it's shared) is absolutely slammed.  Perhaps
> wait a bit more for the congestion to clear?

Sure. glad to. Is there some way I (and others) can "query" things like this? I know on old setup I used to do "top" and I could tell if average load was abnormally high ... just curious if there's something like that I can do to check myself in future.
Comment 8 David Williams CLA 2010-09-28 14:06:48 EDT
(In reply to comment #6)
> The downloads & build disk array (it's shared) is absolutely slammed.  Perhaps
> wait a bit more for the congestion to clear?

Oh ... and ... why would they be so busy? Is there some limiting bottleneck now we have so much CPU? Or does it appear to be some temporary condition or operation?
Comment 9 David Williams CLA 2010-09-28 14:13:42 EDT
BTW, I'm seeing some odd JUnit test failures in webtools (which is independent of hudson, but in /shared space) that fail with an unexplainable "file not found" sort of message ... even though file is obviously there. Admittedly it _might_ still be some bug on our part ... but, I'm sort of hoping its this same "congestion" problem. Tracking in in bug 326442 just to cross reference.
Comment 10 Denis Roy CLA 2010-09-28 14:38:15 EDT
> Oh ... and ... why would they be so busy?

Too many files being shuffled left and right?

> Is there some limiting bottleneck now we have so much CPU?

It appears everyone needs to build all at the same time, then do nothing for the rest of the day.  Just like rush-hour traffic  :)  Right now everything is pretty much idle.  Let's just say there is always a bottleneck, and right now it is definitely not CPU.
Comment 11 David Williams CLA 2010-09-28 14:51:36 EDT
(In reply to comment #10)
>  Right now everything is
> pretty much idle. 

You mean on the disks? If so, Hudson Slave 2 still appears stalled. 

I'll try another ... and probably make the disk problem worse :) 
But that job hasn't progressed in hours.
Comment 12 David Williams CLA 2010-09-29 13:39:14 EDT
I do believe the slaves (I'm beginning to dislike that term) are once again in a bad state? My "aggregator job" seems hung .. and many jobs appear running, but not making progress? 

Anything I can do to help diagnose or fix this? It makes me sad to think we've already hit "maximum capacity" or something due to disk bottle necks (if that's what it is). 
 
And not to add (too much) pressure ... we are supposed to declare Indigo M2 on Friday and am sure we'll need quite a few more builds in next two days to do it.
Comment 13 Eclipse Webmaster CLA 2010-09-29 14:16:17 EDT
We've restarted the slaves and turned the number of executors down to see if this is simply an issue of to many jobs running at once.  As I understand it the Apache folks have a pile of slaves all with just a couple of executors.

If this doesn't help I think the only option left to try is JNLP slaves.

-M.
Comment 14 David Williams CLA 2010-09-29 15:24:26 EDT
Maybe I'm watching too much ... but seems hung again already.
Comment 15 Denis Roy CLA 2010-09-29 16:06:25 EDT
The slaves themselves are fine.  But your job (and the Egit job that was also running on the same slave) simply stopped doing stuff.  The java process is there, but it doesn't seem to be doing anything.

In the *nix world I can user strace -p (pid) to see what a process is doing.  I can't do this with a Java process since it's a virtual machine. Does anyone know how I can interrogate a java process to see what it's currently doing?
Comment 16 David Williams CLA 2010-09-29 16:10:43 EDT
(In reply to comment #15)
> The slaves themselves are fine.  But your job (and the Egit job that was also
> running on the same slave) simply stopped doing stuff.  The java process is
> there, but it doesn't seem to be doing anything.
> 
> In the *nix world I can user strace -p (pid) to see what a process is doing.  I
> can't do this with a Java process since it's a virtual machine. Does anyone
> know how I can interrogate a java process to see what it's currently doing?

Well, I usually get a stack trace from a dump by using kill -3 (pid) ... its hard for me to predict where it puts it ... sometimes under the Java executable directory, sometimes under the Java executable (depending on how started, I guess). Is that what you mean?
Comment 17 Denis Roy CLA 2010-09-29 16:18:26 EDT
That's what I did the last time I saw one of your hung processes, but the kill -3 simply killed it without saying a word.
Comment 18 Eclipse Webmaster CLA 2010-09-29 16:24:00 EDT
I've converted slave2 into a JNLP slave for comparison purposes.

-M.
Comment 19 Denis Roy CLA 2010-09-29 16:53:47 EDT
Could this have anything to do with xvnc?  The hung egit job I was referring to (when your job was also hung) invoked xvnc.  There is currently a hung gmp job, and it too use(d) xvnc.
Comment 20 David Williams CLA 2010-09-29 17:44:52 EDT
I switched to Slave 2, but appears to be "hung" already :( 

> but the kill -3 simply killed it without saying a word.
note sure what would cause that ... be sure the pid is directly for the java process if there appears to be more than one. 

> Could this have anything to do with xvnc? 
Don't know. I was going to say "sure", but my aggregation job just hung and it doesn't use it. 

Mind if I move back to "master" temporarily? get one good build?
Comment 21 Denis Roy CLA 2010-09-29 22:19:15 EDT
Go right ahead...
Comment 22 David Williams CLA 2010-09-29 22:39:16 EDT
(In reply to comment #21)
> Go right ahead...

I did switch to master, and have gotten 4 or so aggregation builds on it ... in the mean time, Slave 1 appears it has some stalled jobs from others :(
Comment 23 Denis Roy CLA 2010-09-29 23:12:03 EDT
> I did switch to master, and have gotten 4 or so aggregation builds on it ... in
> the mean time, Slave 1 appears it has some stalled jobs from others :(

I'll try to talk to the folks who run Hudson at Apache to see what their thoughts and experiences are.

Sometimes I wonder if we wouldn't be better served by just having two independent masters (one on each physical hudson server) that don't even talk to each other, but that store jobs and artifacts on /shared/hudson-master1 and /shared/hudson-master2 or something like that.
Comment 24 David Williams CLA 2010-09-30 03:12:30 EDT
Maybe its clogged pipes. :) 

I googled "slave jobs appear to hang on hudson" and was quickly led to Hudson bug 
http://issues.hudson-ci.org/browse/HUDSON-5934
which led to 
http://issues.hudson-ci.org/browse/HUDSON-5977 titled "Dead lock condition due to pipe clogging"
 
I'll admit I skim read the technical parts ... but symptoms sounded exactly the same ... and sounded like it might result from a variety of IO operations ... and I couldn't help notice, 5977 concludes with "Reverting back from 1.378 to 1.377 resolves it." And I see we are at 1.378 ... how long has that been the case? Did we to that level for some "must have" reason? 
 
But, please read the technical details and see if relevant ... I"m just trying to provide a pointer. (I started off reading about strace, and quickly gave that up! :) 

Thanks,
Comment 25 Nicolas Bros CLA 2010-09-30 03:30:43 EDT
Yesterday, as I was trying to build for Indigo M2, Hudson was shutdown and restarted with 1.378.
After that, each build I did on the slaves hung after a few minutes.
I tried moving the job to master, and I got a good build.

Today, I get errors:
hudson.util.IOException2: Failed to join the process
Caused by: java.util.concurrent.ExecutionException: hudson.remoting.RequestAbortedException: java.io.StreamCorruptedException: invalid type code: 32

hudson.util.IOException2: remote file operation failed: /tmp/hudson576336936282741195.sh at hudson.remoting.Channel@1f4ace71:hudson-slave1
Caused by: hudson.remoting.ChannelClosedException: channel is already closed
Caused by: java.io.StreamCorruptedException: invalid type code: 32

May I suggest downgrading back to 1.377?
Comment 26 Jacek Pospychala CLA 2010-09-30 10:32:33 EDT
btw. this problem seems to be tracked both by bug 326283 and bug 326366
Leaving it up to webmasters to dupe which one they prefer, because now they seem to enjoy commenting on both :-)
Comment 27 David Carver CLA 2010-09-30 10:50:02 EDT
(In reply to comment #24)
> Maybe its clogged pipes. :) 
> 
> I googled "slave jobs appear to hang on hudson" and was quickly led to Hudson
> bug 
> http://issues.hudson-ci.org/browse/HUDSON-5934
> which led to 
> http://issues.hudson-ci.org/browse/HUDSON-5977 titled "Dead lock condition due
> to pipe clogging"
> 
> I'll admit I skim read the technical parts ... but symptoms sounded exactly the
> same ... and sounded like it might result from a variety of IO operations ...
> and I couldn't help notice, 5977 concludes with "Reverting back from 1.378 to
> 1.377 resolves it." And I see we are at 1.378 ... how long has that been the
> case? Did we to that level for some "must have" reason? 
> 
> But, please read the technical details and see if relevant ... I"m just trying
> to provide a pointer. (I started off reading about strace, and quickly gave
> that up! :) 

Hudson 1.378 was installed yesterday which is supposed to take care of Hudson-5977.
Comment 28 David Carver CLA 2010-09-30 10:51:50 EDT
In addition to help hunt down why Builds are Hanging and the possible cause this wiki page may help:

http://wiki.hudson-ci.org/display/HUDSON/Build+is+hanging
Comment 29 David Williams CLA 2010-09-30 11:06:59 EDT
(In reply to comment #27)
> (In reply to comment #24)
> > Maybe its clogged pipes. :) 

> 
> Hudson 1.378 was installed yesterday which is supposed to take care of
> Hudson-5977.

I think that bug was re-opened with comments similar to "This issue made 1.378 unusable."

I notice that the apache hudson server is at 1.376
Maybe we should update only when they do :)
Comment 30 David Williams CLA 2010-09-30 11:27:37 EDT
I'm only vaguely aware of the time line here ... but it appears that bug was reopened on 9/26 with the comment "Issue seems to have worsened after the new patch."

http://issues.hudson-ci.org/browse/HUDSON-5977?focusedCommentId=142130&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_142130

And I get the impression we updated to 1.378 after 9/26. What's up with that? I thought this new, supported hudson server was no longer going to be a bleeding edge tester? 

[And, you all know I'm the last to be judgmental ... but the first to be encourage stability :) ]
Comment 31 Denis Roy CLA 2010-09-30 11:39:30 EDT
> And I get the impression we updated to 1.378 after 9/26. What's up with that? I
> thought this new, supported hudson server was no longer going to be a bleeding
> edge tester? 

We upgraded to 1.378 because we were having SSH issues between the master and the slaves.

I think I am confused with the Hudson 'release' schedule.  It seems that new Hudson 'releases' (1.376, 377, 378) are nothing more than Nightly builds?  I thought these were actual stable releases...

At this point, I agree that going to 1.376 seems to be a good option.
Comment 32 David Carver CLA 2010-09-30 11:42:37 EDT
(In reply to comment #31)
> > And I get the impression we updated to 1.378 after 9/26. What's up with that? I
> > thought this new, supported hudson server was no longer going to be a bleeding
> > edge tester? 
> 
> We upgraded to 1.378 because we were having SSH issues between the master and
> the slaves.
> 
> I think I am confused with the Hudson 'release' schedule.  It seems that new
> Hudson 'releases' (1.376, 377, 378) are nothing more than Nightly builds?  I
> thought these were actual stable releases...
> 
> At this point, I agree that going to 1.376 seems to be a good option.

Before upgrading to 1.378 to fix Remoting issues, my understanding is that Matt tested on the Sandbox instance first.

Hudson now has an option to easily Downgrade to a prior version, so they can do that as well.
Comment 33 David Carver CLA 2010-09-30 11:53:29 EDT
(In reply to comment #31)
> > And I get the impression we updated to 1.378 after 9/26. What's up with that? I
> > thought this new, supported hudson server was no longer going to be a bleeding
> > edge tester? 
> 
> We upgraded to 1.378 because we were having SSH issues between the master and
> the slaves.
> 
> I think I am confused with the Hudson 'release' schedule.  It seems that new
> Hudson 'releases' (1.376, 377, 378) are nothing more than Nightly builds?  I
> thought these were actual stable releases...
> 
> At this point, I agree that going to 1.376 seems to be a good option.

Hudson does Weekly releases.   Every Friday they do a release.

With Hudson there will never be a version that is Blessed in the official sense of an Eclipse Release.   It is up to users to view the change log from time to time to see what has changed, if any new bugs have been fixed, and make the determination if they want to upgrade or not.

More and more open source software is following the above pattern, and it may make some people uncomfortable.

Personally I think the Hudson release schedule is a good thing, it gets critical functionality and bug fixes out sooner rather than later.  1.379 should be out this Friday.
Comment 34 David Williams CLA 2010-09-30 16:27:30 EDT
I notice you did revert to 1.376 so I re-configured my job to run on slaves and at at least one finished fine ... unlike the other day on 1.378 where none would finish. 

So .. as far as I'm concerned, this blocker can be resolved ... I'm not sure about the communication/ssh problem, but guess that can be tracked in 326283. 

Much thanks.