| Summary: | Hudson slaves appear "hung" | ||
|---|---|---|---|
| Product: | Community | Reporter: | David Williams <david_williams> |
| Component: | CI-Jenkins | Assignee: | Eclipse Webmaster <webmaster> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | blocker | ||
| Priority: | P3 | CC: | d_a_carver, gunnar, jacek.pospychala, nicolas.bros |
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Windows 7 | ||
| Whiteboard: | |||
|
Description
David Williams
I've restarted both slaves. I was updating the kernel on the dev.eclipse.org servers but despite changing the IP address in the Hudson /etc/hosts file, it would appear Hudson keeps it cached or something. It may be that already-running-jobs cache the value? I'm not sure of the exact timing (or you restarting the slaves), but I restarted my 'run aggregator' job and it appears to have started fine .... so I'll assume fixed ... and will re-open if it actually hangs again later (which I am not expecting). well ... now hudson itself isn't responding ... probably something else ... but I can't confirm the slaves are working :/ https://hudson.eclipse.org/hudson/view/Repository%20Aggregation/ I believe I've been seeing hung jobs yesterday before kernel update. Eg. mine was hanging at "archiving artifacts". Aborting the job resulted in same error as in bug 326283 ("ssh channel is closed") I'm seeing this again (I think) on Slave 2 so am re-opening. I'm specifically watching indigo.runAggregator but it seems several jobs are running without noticeable progress for at least 20 minutes (that's only the time I've "timed" it ... my runAggregator job should have finished an hour ago). I'll leave running another 20 minutes, in case an admin gets a chance to look at processes or whatever while its in this state. Then I'll cancel and try to restart it on slave 1. The downloads & build disk array (it's shared) is absolutely slammed. Perhaps wait a bit more for the congestion to clear? (In reply to comment #6) > The downloads & build disk array (it's shared) is absolutely slammed. Perhaps > wait a bit more for the congestion to clear? Sure. glad to. Is there some way I (and others) can "query" things like this? I know on old setup I used to do "top" and I could tell if average load was abnormally high ... just curious if there's something like that I can do to check myself in future. (In reply to comment #6) > The downloads & build disk array (it's shared) is absolutely slammed. Perhaps > wait a bit more for the congestion to clear? Oh ... and ... why would they be so busy? Is there some limiting bottleneck now we have so much CPU? Or does it appear to be some temporary condition or operation? BTW, I'm seeing some odd JUnit test failures in webtools (which is independent of hudson, but in /shared space) that fail with an unexplainable "file not found" sort of message ... even though file is obviously there. Admittedly it _might_ still be some bug on our part ... but, I'm sort of hoping its this same "congestion" problem. Tracking in in bug 326442 just to cross reference. > Oh ... and ... why would they be so busy? Too many files being shuffled left and right? > Is there some limiting bottleneck now we have so much CPU? It appears everyone needs to build all at the same time, then do nothing for the rest of the day. Just like rush-hour traffic :) Right now everything is pretty much idle. Let's just say there is always a bottleneck, and right now it is definitely not CPU. (In reply to comment #10) > Right now everything is > pretty much idle. You mean on the disks? If so, Hudson Slave 2 still appears stalled. I'll try another ... and probably make the disk problem worse :) But that job hasn't progressed in hours. I do believe the slaves (I'm beginning to dislike that term) are once again in a bad state? My "aggregator job" seems hung .. and many jobs appear running, but not making progress? Anything I can do to help diagnose or fix this? It makes me sad to think we've already hit "maximum capacity" or something due to disk bottle necks (if that's what it is). And not to add (too much) pressure ... we are supposed to declare Indigo M2 on Friday and am sure we'll need quite a few more builds in next two days to do it. We've restarted the slaves and turned the number of executors down to see if this is simply an issue of to many jobs running at once. As I understand it the Apache folks have a pile of slaves all with just a couple of executors. If this doesn't help I think the only option left to try is JNLP slaves. -M. Maybe I'm watching too much ... but seems hung again already. The slaves themselves are fine. But your job (and the Egit job that was also running on the same slave) simply stopped doing stuff. The java process is there, but it doesn't seem to be doing anything. In the *nix world I can user strace -p (pid) to see what a process is doing. I can't do this with a Java process since it's a virtual machine. Does anyone know how I can interrogate a java process to see what it's currently doing? (In reply to comment #15) > The slaves themselves are fine. But your job (and the Egit job that was also > running on the same slave) simply stopped doing stuff. The java process is > there, but it doesn't seem to be doing anything. > > In the *nix world I can user strace -p (pid) to see what a process is doing. I > can't do this with a Java process since it's a virtual machine. Does anyone > know how I can interrogate a java process to see what it's currently doing? Well, I usually get a stack trace from a dump by using kill -3 (pid) ... its hard for me to predict where it puts it ... sometimes under the Java executable directory, sometimes under the Java executable (depending on how started, I guess). Is that what you mean? That's what I did the last time I saw one of your hung processes, but the kill -3 simply killed it without saying a word. I've converted slave2 into a JNLP slave for comparison purposes. -M. Could this have anything to do with xvnc? The hung egit job I was referring to (when your job was also hung) invoked xvnc. There is currently a hung gmp job, and it too use(d) xvnc. I switched to Slave 2, but appears to be "hung" already :( > but the kill -3 simply killed it without saying a word. note sure what would cause that ... be sure the pid is directly for the java process if there appears to be more than one. > Could this have anything to do with xvnc? Don't know. I was going to say "sure", but my aggregation job just hung and it doesn't use it. Mind if I move back to "master" temporarily? get one good build? Go right ahead... (In reply to comment #21) > Go right ahead... I did switch to master, and have gotten 4 or so aggregation builds on it ... in the mean time, Slave 1 appears it has some stalled jobs from others :( > I did switch to master, and have gotten 4 or so aggregation builds on it ... in
> the mean time, Slave 1 appears it has some stalled jobs from others :(
I'll try to talk to the folks who run Hudson at Apache to see what their thoughts and experiences are.
Sometimes I wonder if we wouldn't be better served by just having two independent masters (one on each physical hudson server) that don't even talk to each other, but that store jobs and artifacts on /shared/hudson-master1 and /shared/hudson-master2 or something like that.
Maybe its clogged pipes. :) I googled "slave jobs appear to hang on hudson" and was quickly led to Hudson bug http://issues.hudson-ci.org/browse/HUDSON-5934 which led to http://issues.hudson-ci.org/browse/HUDSON-5977 titled "Dead lock condition due to pipe clogging" I'll admit I skim read the technical parts ... but symptoms sounded exactly the same ... and sounded like it might result from a variety of IO operations ... and I couldn't help notice, 5977 concludes with "Reverting back from 1.378 to 1.377 resolves it." And I see we are at 1.378 ... how long has that been the case? Did we to that level for some "must have" reason? But, please read the technical details and see if relevant ... I"m just trying to provide a pointer. (I started off reading about strace, and quickly gave that up! :) Thanks, Yesterday, as I was trying to build for Indigo M2, Hudson was shutdown and restarted with 1.378. After that, each build I did on the slaves hung after a few minutes. I tried moving the job to master, and I got a good build. Today, I get errors: hudson.util.IOException2: Failed to join the process Caused by: java.util.concurrent.ExecutionException: hudson.remoting.RequestAbortedException: java.io.StreamCorruptedException: invalid type code: 32 hudson.util.IOException2: remote file operation failed: /tmp/hudson576336936282741195.sh at hudson.remoting.Channel@1f4ace71:hudson-slave1 Caused by: hudson.remoting.ChannelClosedException: channel is already closed Caused by: java.io.StreamCorruptedException: invalid type code: 32 May I suggest downgrading back to 1.377? btw. this problem seems to be tracked both by bug 326283 and bug 326366 Leaving it up to webmasters to dupe which one they prefer, because now they seem to enjoy commenting on both :-) (In reply to comment #24) > Maybe its clogged pipes. :) > > I googled "slave jobs appear to hang on hudson" and was quickly led to Hudson > bug > http://issues.hudson-ci.org/browse/HUDSON-5934 > which led to > http://issues.hudson-ci.org/browse/HUDSON-5977 titled "Dead lock condition due > to pipe clogging" > > I'll admit I skim read the technical parts ... but symptoms sounded exactly the > same ... and sounded like it might result from a variety of IO operations ... > and I couldn't help notice, 5977 concludes with "Reverting back from 1.378 to > 1.377 resolves it." And I see we are at 1.378 ... how long has that been the > case? Did we to that level for some "must have" reason? > > But, please read the technical details and see if relevant ... I"m just trying > to provide a pointer. (I started off reading about strace, and quickly gave > that up! :) Hudson 1.378 was installed yesterday which is supposed to take care of Hudson-5977. In addition to help hunt down why Builds are Hanging and the possible cause this wiki page may help: http://wiki.hudson-ci.org/display/HUDSON/Build+is+hanging (In reply to comment #27) > (In reply to comment #24) > > Maybe its clogged pipes. :) > > Hudson 1.378 was installed yesterday which is supposed to take care of > Hudson-5977. I think that bug was re-opened with comments similar to "This issue made 1.378 unusable." I notice that the apache hudson server is at 1.376 Maybe we should update only when they do :) I'm only vaguely aware of the time line here ... but it appears that bug was reopened on 9/26 with the comment "Issue seems to have worsened after the new patch." http://issues.hudson-ci.org/browse/HUDSON-5977?focusedCommentId=142130&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_142130 And I get the impression we updated to 1.378 after 9/26. What's up with that? I thought this new, supported hudson server was no longer going to be a bleeding edge tester? [And, you all know I'm the last to be judgmental ... but the first to be encourage stability :) ] > And I get the impression we updated to 1.378 after 9/26. What's up with that? I
> thought this new, supported hudson server was no longer going to be a bleeding
> edge tester?
We upgraded to 1.378 because we were having SSH issues between the master and the slaves.
I think I am confused with the Hudson 'release' schedule. It seems that new Hudson 'releases' (1.376, 377, 378) are nothing more than Nightly builds? I thought these were actual stable releases...
At this point, I agree that going to 1.376 seems to be a good option.
(In reply to comment #31) > > And I get the impression we updated to 1.378 after 9/26. What's up with that? I > > thought this new, supported hudson server was no longer going to be a bleeding > > edge tester? > > We upgraded to 1.378 because we were having SSH issues between the master and > the slaves. > > I think I am confused with the Hudson 'release' schedule. It seems that new > Hudson 'releases' (1.376, 377, 378) are nothing more than Nightly builds? I > thought these were actual stable releases... > > At this point, I agree that going to 1.376 seems to be a good option. Before upgrading to 1.378 to fix Remoting issues, my understanding is that Matt tested on the Sandbox instance first. Hudson now has an option to easily Downgrade to a prior version, so they can do that as well. (In reply to comment #31) > > And I get the impression we updated to 1.378 after 9/26. What's up with that? I > > thought this new, supported hudson server was no longer going to be a bleeding > > edge tester? > > We upgraded to 1.378 because we were having SSH issues between the master and > the slaves. > > I think I am confused with the Hudson 'release' schedule. It seems that new > Hudson 'releases' (1.376, 377, 378) are nothing more than Nightly builds? I > thought these were actual stable releases... > > At this point, I agree that going to 1.376 seems to be a good option. Hudson does Weekly releases. Every Friday they do a release. With Hudson there will never be a version that is Blessed in the official sense of an Eclipse Release. It is up to users to view the change log from time to time to see what has changed, if any new bugs have been fixed, and make the determination if they want to upgrade or not. More and more open source software is following the above pattern, and it may make some people uncomfortable. Personally I think the Hudson release schedule is a good thing, it gets critical functionality and bug fixes out sooner rather than later. 1.379 should be out this Friday. I notice you did revert to 1.376 so I re-configured my job to run on slaves and at at least one finished fine ... unlike the other day on 1.378 where none would finish. So .. as far as I'm concerned, this blocker can be resolved ... I'm not sure about the communication/ssh problem, but guess that can be tracked in 326283. Much thanks. |