| Summary: | job failing to start due to "SSH channel is closed" when archiving previous result on hudson-slave2 | ||
|---|---|---|---|
| Product: | Community | Reporter: | Jacek Pospychala <jacek.pospychala> |
| Component: | CI-Jenkins | Assignee: | Eclipse Webmaster <webmaster> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | major | ||
| Priority: | P3 | CC: | david_williams, dennis.huebner, d_a_carver, ed, michal.ruza, mtaal, nicolas.bros, rschnekenburger, sbernard, sbouchet, zteve.powell |
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Mac OS X - Carbon (unsup.) | ||
| Whiteboard: | |||
|
Description
Jacek Pospychala
I've restarted the slave thread on slave-2. Can you try your job now? (In reply to comment #1) > I've restarted the slave thread on slave-2. Can you try your job now? Hi Denis, I just had the same problem on slave1. So I moved the job to the slave2, it seems to work. Thanks for your job! Rémi yes, I ran cbi-pdt-2.2-helios again on hudson-slave2 and there's no such error any more. Thanks Denis I've kicked the agent on slave1 as well. (In reply to comment #4) > I've kicked the agent on slave1 as well. Just an FYI, some remoting issues that could be related have been fixed in the latest Hudson. http://hudson-labs.org/changelog Folks are reporting this is fixed on both slaves. I don't like to reopen things, but this seems to still occur. Now I hit it when trying to see what's in job workspace: https://hudson.eclipse.org/hudson/view/Athena%20CBI/job/cbi-pdt-2.2-helios/ws/ Status Code: 500 Exception: Stacktrace: hudson.util.IOException2: remote file operation failed: /opt/users/hudsonbuild/workspace/cbi-pdt-2.2-helios at hudson.remoting.Channel@27bf6624:hudson-slave2 at hudson.FilePath.act(FilePath.java:749) at hudson.FilePath.act(FilePath.java:735) at hudson.FilePath.exists(FilePath.java:1003) at hudson.model.AbstractProject.doWs(AbstractProject.java:1543) at sun.reflect.GeneratedMethodAccessor462.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.kohsuke.stapler.Function$InstanceFunction.invoke(Function.java:259) at org.kohsuke.stapler.Function.bindAndInvoke(Function.java:126) at org.kohsuke.stapler.Function.bindAndInvokeAndServeResponse(Function.java:65) at org.kohsuke.stapler.MetaClass$1.doDispatch(MetaClass.java:79) at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:30) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:534) at org.kohsuke.stapler.MetaClass$7.doDispatch(MetaClass.java:219) at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:30) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:534) at org.kohsuke.stapler.MetaClass$7.doDispatch(MetaClass.java:219) at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:30) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:534) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:450) at org.kohsuke.stapler.Stapler.service(Stapler.java:132) at javax.servlet.http.HttpServlet.service(HttpServlet.java:45) at winstone.ServletConfiguration.execute(ServletConfiguration.java:249) at winstone.RequestDispatcher.forward(RequestDispatcher.java:335) at winstone.RequestDispatcher.doFilter(RequestDispatcher.java:378) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:94) at hudson.plugins.audit_trail.AuditTrailFilter.doFilter(AuditTrailFilter.java:64) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:97) at hudson.util.PluginServletFilter.doFilter(PluginServletFilter.java:86) at winstone.FilterConfiguration.execute(FilterConfiguration.java:195) at winstone.RequestDispatcher.doFilter(RequestDispatcher.java:368) at hudson.security.csrf.CrumbFilter.doFilter(CrumbFilter.java:47) at winstone.FilterConfiguration.execute(FilterConfiguration.java:195) at winstone.RequestDispatcher.doFilter(RequestDispatcher.java:368) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:84) at hudson.security.UnwrapSecurityExceptionFilter.doFilter(UnwrapSecurityExceptionFilter.java:51) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.ui.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:166) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.providers.anonymous.AnonymousProcessingFilter.doFilter(AnonymousProcessingFilter.java:125) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.ui.rememberme.RememberMeProcessingFilter.doFilter(RememberMeProcessingFilter.java:142) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.ui.AbstractProcessingFilter.doFilter(AbstractProcessingFilter.java:271) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.ui.basicauth.BasicProcessingFilter.doFilter(BasicProcessingFilter.java:173) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.context.HttpSessionContextIntegrationFilter.doFilter(HttpSessionContextIntegrationFilter.java:249) at hudson.security.HttpSessionContextIntegrationFilter2.doFilter(HttpSessionContextIntegrationFilter2.java:66) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at hudson.security.ChainedServletFilter.doFilter(ChainedServletFilter.java:76) at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:164) at winstone.FilterConfiguration.execute(FilterConfiguration.java:195) at winstone.RequestDispatcher.doFilter(RequestDispatcher.java:368) at winstone.RequestDispatcher.forward(RequestDispatcher.java:333) at winstone.RequestHandlerThread.processRequest(RequestHandlerThread.java:244) at winstone.RequestHandlerThread.run(RequestHandlerThread.java:150) at java.lang.Thread.run(Thread.java:619) Caused by: java.io.IOException: SSH channel is closed. (Close requested by remote) at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:383) at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1838) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1747) at java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1249) at java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1203) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1387) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1538) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:329) at hudson.remoting.Channel.send(Channel.java:419) at hudson.remoting.Request.call(Request.java:105) at hudson.remoting.Channel.call(Channel.java:557) at hudson.FilePath.act(FilePath.java:742) ... 57 more (only on slave-2. after switching to slave1 link works) I'm upping priority, because earlier at least one slave was working correctly, but now on both of them the job is failing. For now I resorted to building on master, but it's tough, given the length of it's queue.. sometimes :-) I'm wondering why others jobs are running and my is failing. I tried another job (cbi-atf), that was always green (blue), but it failed as well now It is happening to the emf-cdo-integration job on both slaves too. Building on master with the currently configured number of builders is not very flexible. I've added ServerAliveInterval 5 to the SSH client config on both slaves, and re-launched the slave. *sigh* (In reply to comment #11) It works now, thanks! Closing as fixed. If you have other slave SSH issues, please reopen. It's happened on https://hudson.eclipse.org/hudson/job/cbi-mdt-ocl-3.0/315/ https://hudson.eclipse.org/hudson/job/cbi-mdt-ocl-3.0/314/ after https://hudson.eclipse.org/hudson/job/cbi-mdt-ocl-3.0/313/ had to be killed because it was doing nothing. Looks like there are stuck jobs on hudson-slave2, and I saw the remoting issue on slave2. Again, latest hudson build addresses some remoting issues that can cause deadlocks. In particular this issue: http://issues.hudson-ci.org/browse/HUDSON-5977 > Again, latest hudson build addresses some remoting issues that can cause
> deadlocks.
My dreams of a stable Hudson are always just an upgrade away :)
Matt? Do you have any cycles for this?
Matt == Hudson Slave
I'm testing the upgrade on the sandbox now. -M. Ok, the update has been applied and everything has restarted. -M. btw. this problem seems to be tracked both by bug 326283 and bug 326366 Leaving it up to webmasters to dupe which one they prefer, because now they seem to enjoy commenting on both :-) These are two different issues. In this issue, the slave totally loses communication with the master. In the 'hung' bug, the jobs running on slaves simply stop doing what they are supposed to, despite the fact that communication between master and slave is OK. ok, after downgrade to 1.376, this one is still happening today: https://hudson.eclipse.org/hudson/job/cbi-atf-0.3-nightly/666/console Note that job is not stopped in the middle, but it doesn't even start. I wonder why it started happening on 27th/Sept. Was there a hudson, ssh, network or configuration upgrade? Is it because of that kernel update (which afaik was later)? If error says "Close requested by remote" maybe the "remote" (this means slave, yes?) has something more in it's logs? It'd be also interesting to know if this is also happening on windows slave, assuming that windows slave also uses ssh, and not e.g. jnlp. Just random guesses Seems Hudson-Slave1 is having the remoting issues again. More info on SSH and different ways to launch manage slaves here: http://wiki.hudson-ci.org/display/HUDSON/Distributed+builds Hi, I am encountering this issue also on slave 1 and slave 2 when doing a Texo build. gr. Martin I've kicked both slaves. I'm getting "SSH channel is closed." again on hudson-slave1. I've restarted the slave agent. -M. Seeing these symptoms on all virgo* jobs which are run on hudson-slave-2. In general, when an executor fails in this way, it might be nice if hudson disabled it -- otherwise all subsequent jobs that are dispatched to it are likely to fail (rather quickly) even if this isn't the only executor they might use. If there is ever any load-balancing employed in job scheduling, failures of this form will form a black-hole into which every eligible job will be thrown; this executor is clearly superior to all others: its service time is almost instantaneous. :-) Pro-tem I'm converting all virgo-* jobs to run on hudson-slave-1 explicitly -- but I will want to put this back to build2 as soon as the problem is fixed. I don't want to have to do this. (NB: Last job on slave-2 that actually worked was apparently two days and 20 hours ago. This is a long time to have ci jobs failing.) I've restarted the slave agent. This really seems to be a hudson bug. -M. Hi, this happens again today on slave1 : Building remotely on hudson-slave1 hudson.util.IOException2: remote file operation failed: /opt/users/hudsonbuild/workspace/m2m-atl-3.2 at hudson.remoting.Channel@4f57d0b7:hudson-slave1 at hudson.FilePath.act(FilePath.java:749) at hudson.FilePath.act(FilePath.java:735) at hudson.FilePath.mkdirs(FilePath.java:801) at hudson.model.AbstractProject.checkout(AbstractProject.java:1059) at hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:479) at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:411) at hudson.model.Run.run(Run.java:1273) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:129) Caused by: java.io.IOException: SSH channel is closed. (Close requested by remote) at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:383) at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1838) at java.io.ObjectOutputStream$BlockDataOutputStream.writeByte(ObjectOutputStream.java:1876) at java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1537) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:329) at hudson.remoting.Channel.send(Channel.java:419) at hudson.remoting.Request.call(Request.java:105) at hudson.remoting.Channel.call(Channel.java:557) at hudson.FilePath.act(FilePath.java:742) ... 9 more The slave agent has been restarted. -M. hudson-slave1 appears hung; The Helios aggregation job has been running for 4 hr 30 with no apparent progress. And I saw "SSH channel is closed" when killing a frozen MoDisco job. So, I think slave1 needs to be restarted. I've restarted the slave. -M. Hi, We have a "SSH channel is closed" for koneki.ldt job which build on hudson-slave1. I see the 3 current jobs on hudson-slave1 has be running for a long time (5h, 20h, 20h) Perhaps it need to be restarted. > Perhaps it need to be restarted.
It has been restarted.
|