Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 326283 - job failing to start due to "SSH channel is closed" when archiving previous result on hudson-slave2
Summary: job failing to start due to "SSH channel is closed" when archiving previous r...
Status: RESOLVED FIXED
Alias: None
Product: Community
Classification: Eclipse Foundation
Component: CI-Jenkins (show other bugs)
Version: unspecified   Edit
Hardware: PC Mac OS X - Carbon (unsup.)
: P3 major (vote)
Target Milestone: ---   Edit
Assignee: Eclipse Webmaster CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-09-27 05:44 EDT by Jacek Pospychala CLA
Modified: 2013-05-03 09:14 EDT (History)
11 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jacek Pospychala CLA 2010-09-27 05:44:55 EDT
Hi,
it happend with job:
https://hudson.eclipse.org/hudson/view/Athena%20CBI/job/cbi-pdt-2.2-helios/

builds #484-487

When my job was running pn hudson-slave2 it was failing immediately with following log (below). I noticed that this happened only on hudson-slave2, after explicitly switching to hudson-slave1, this was not happening.

Started by user jpospycha
Building remotely on hudson-slave2
hudson.util.IOException2: remote file operation failed: /opt/users/hudsonbuild/workspace/cbi-pdt-2.2-helios at hudson.remoting.Channel@665292e6:hudson-slave2
	at hudson.FilePath.act(FilePath.java:749)
	at hudson.FilePath.act(FilePath.java:735)
	at hudson.FilePath.mkdirs(FilePath.java:801)
	at hudson.model.AbstractProject.checkout(AbstractProject.java:1036)
	at hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:479)
	at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:411)
	at hudson.model.Run.run(Run.java:1248)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
	at hudson.model.ResourceController.execute(ResourceController.java:88)
	at hudson.model.Executor.run(Executor.java:129)
Caused by: java.io.IOException: SSH channel is closed. (Close requested by remote)
	at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:383)
	at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)
	at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1838)
	at java.io.ObjectOutputStream$BlockDataOutputStream.writeByte(ObjectOutputStream.java:1876)
	at java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1537)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:329)
	at hudson.remoting.Channel.send(Channel.java:419)
	at hudson.remoting.Request.call(Request.java:105)
	at hudson.remoting.Channel.call(Channel.java:557)
	at hudson.FilePath.act(FilePath.java:742)
	... 9 more
Archiving artifacts
ERROR: Failed to archive artifacts: build/?20*/**, build/?-SNAPSHOT/**,
java.io.IOException: SSH channel is closed. (Close requested by remote)
	at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:383)
	at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)
	at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1838)
	at java.io.ObjectOutputStream$BlockDataOutputStream.writeByte(ObjectOutputStream.java:1876)
	at java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1537)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:329)
	at hudson.remoting.Channel.send(Channel.java:419)
	at hudson.remoting.Request.call(Request.java:105)
	at hudson.remoting.Channel.call(Channel.java:557)
	at hudson.EnvVars.getRemote(EnvVars.java:202)
	at hudson.model.Computer.getEnvironment(Computer.java:736)
	at hudson.model.Run.getEnvironment(Run.java:1646)
	at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:682)
	at hudson.tasks.ArtifactArchiver.perform(ArtifactArchiver.java:116)
	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
	at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:601)
	at hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:580)
	at hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:558)
	at hudson.model.Build$RunnerImpl.post2(Build.java:158)
	at hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:528)
	at hudson.model.Run.run(Run.java:1271)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
	at hudson.model.ResourceController.execute(ResourceController.java:88)
	at hudson.model.Executor.run(Executor.java:129)
Recording test results
ERROR: Publisher hudson.tasks.junit.JUnitResultArchiver aborted due to exception
java.io.IOException: SSH channel is closed. (Close requested by remote)
	at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:383)
	at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)
	at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1838)
	at java.io.ObjectOutputStream$BlockDataOutputStream.writeByte(ObjectOutputStream.java:1876)
	at java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1537)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:329)
	at hudson.remoting.Channel.send(Channel.java:419)
	at hudson.remoting.Request.call(Request.java:105)
	at hudson.remoting.Channel.call(Channel.java:557)
	at hudson.EnvVars.getRemote(EnvVars.java:202)
	at hudson.model.Computer.getEnvironment(Computer.java:736)
	at hudson.model.Run.getEnvironment(Run.java:1646)
	at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:682)
	at hudson.tasks.junit.JUnitResultArchiver.perform(JUnitResultArchiver.java:130)
	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
	at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:601)
	at hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:580)
	at hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:558)
	at hudson.model.Build$RunnerImpl.post2(Build.java:158)
	at hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:528)
	at hudson.model.Run.run(Run.java:1271)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
	at hudson.model.ResourceController.execute(ResourceController.java:88)
	at hudson.model.Executor.run(Executor.java:129)
Finished: FAILURE
Comment 1 Denis Roy CLA 2010-09-27 09:00:39 EDT
I've restarted the slave thread on slave-2.  Can you try your job now?
Comment 2 Remi Schnekenburger CLA 2010-09-27 09:05:48 EDT
(In reply to comment #1)
> I've restarted the slave thread on slave-2.  Can you try your job now?

Hi Denis,

I just had the same problem on slave1. So I moved the job to the slave2, it seems to work.

Thanks for your job!
Rémi
Comment 3 Jacek Pospychala CLA 2010-09-27 09:08:29 EDT
yes, I ran cbi-pdt-2.2-helios again on hudson-slave2 and there's no such error any more.
Thanks Denis
Comment 4 Denis Roy CLA 2010-09-27 09:10:53 EDT
I've kicked the agent on slave1 as well.
Comment 5 David Carver CLA 2010-09-27 09:19:52 EDT
(In reply to comment #4)
> I've kicked the agent on slave1 as well.

Just an FYI, some remoting issues that could be related have been fixed in the latest Hudson.

http://hudson-labs.org/changelog
Comment 6 Denis Roy CLA 2010-09-27 09:30:04 EDT
Folks are reporting this is fixed on both slaves.
Comment 7 Jacek Pospychala CLA 2010-09-27 09:46:50 EDT
I don't like to reopen things, but this seems to still occur.
Now I hit it when trying to see what's in job workspace:
https://hudson.eclipse.org/hudson/view/Athena%20CBI/job/cbi-pdt-2.2-helios/ws/

Status Code: 500
Exception:
Stacktrace:

hudson.util.IOException2: remote file operation failed: /opt/users/hudsonbuild/workspace/cbi-pdt-2.2-helios at hudson.remoting.Channel@27bf6624:hudson-slave2
	at hudson.FilePath.act(FilePath.java:749)
	at hudson.FilePath.act(FilePath.java:735)
	at hudson.FilePath.exists(FilePath.java:1003)
	at hudson.model.AbstractProject.doWs(AbstractProject.java:1543)
	at sun.reflect.GeneratedMethodAccessor462.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.kohsuke.stapler.Function$InstanceFunction.invoke(Function.java:259)
	at org.kohsuke.stapler.Function.bindAndInvoke(Function.java:126)
	at org.kohsuke.stapler.Function.bindAndInvokeAndServeResponse(Function.java:65)
	at org.kohsuke.stapler.MetaClass$1.doDispatch(MetaClass.java:79)
	at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:30)
	at org.kohsuke.stapler.Stapler.invoke(Stapler.java:534)
	at org.kohsuke.stapler.MetaClass$7.doDispatch(MetaClass.java:219)
	at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:30)
	at org.kohsuke.stapler.Stapler.invoke(Stapler.java:534)
	at org.kohsuke.stapler.MetaClass$7.doDispatch(MetaClass.java:219)
	at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:30)
	at org.kohsuke.stapler.Stapler.invoke(Stapler.java:534)
	at org.kohsuke.stapler.Stapler.invoke(Stapler.java:450)
	at org.kohsuke.stapler.Stapler.service(Stapler.java:132)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:45)
	at winstone.ServletConfiguration.execute(ServletConfiguration.java:249)
	at winstone.RequestDispatcher.forward(RequestDispatcher.java:335)
	at winstone.RequestDispatcher.doFilter(RequestDispatcher.java:378)
	at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:94)
	at hudson.plugins.audit_trail.AuditTrailFilter.doFilter(AuditTrailFilter.java:64)
	at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:97)
	at hudson.util.PluginServletFilter.doFilter(PluginServletFilter.java:86)
	at winstone.FilterConfiguration.execute(FilterConfiguration.java:195)
	at winstone.RequestDispatcher.doFilter(RequestDispatcher.java:368)
	at hudson.security.csrf.CrumbFilter.doFilter(CrumbFilter.java:47)
	at winstone.FilterConfiguration.execute(FilterConfiguration.java:195)
	at winstone.RequestDispatcher.doFilter(RequestDispatcher.java:368)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:84)
	at hudson.security.UnwrapSecurityExceptionFilter.doFilter(UnwrapSecurityExceptionFilter.java:51)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at org.acegisecurity.ui.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:166)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at org.acegisecurity.providers.anonymous.AnonymousProcessingFilter.doFilter(AnonymousProcessingFilter.java:125)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at org.acegisecurity.ui.rememberme.RememberMeProcessingFilter.doFilter(RememberMeProcessingFilter.java:142)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at org.acegisecurity.ui.AbstractProcessingFilter.doFilter(AbstractProcessingFilter.java:271)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at org.acegisecurity.ui.basicauth.BasicProcessingFilter.doFilter(BasicProcessingFilter.java:173)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at org.acegisecurity.context.HttpSessionContextIntegrationFilter.doFilter(HttpSessionContextIntegrationFilter.java:249)
	at hudson.security.HttpSessionContextIntegrationFilter2.doFilter(HttpSessionContextIntegrationFilter2.java:66)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at hudson.security.ChainedServletFilter.doFilter(ChainedServletFilter.java:76)
	at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:164)
	at winstone.FilterConfiguration.execute(FilterConfiguration.java:195)
	at winstone.RequestDispatcher.doFilter(RequestDispatcher.java:368)
	at winstone.RequestDispatcher.forward(RequestDispatcher.java:333)
	at winstone.RequestHandlerThread.processRequest(RequestHandlerThread.java:244)
	at winstone.RequestHandlerThread.run(RequestHandlerThread.java:150)
	at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: SSH channel is closed. (Close requested by remote)
	at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:383)
	at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)
	at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1838)
	at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1747)
	at java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1249)
	at java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1203)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1387)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
	at java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1538)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:329)
	at hudson.remoting.Channel.send(Channel.java:419)
	at hudson.remoting.Request.call(Request.java:105)
	at hudson.remoting.Channel.call(Channel.java:557)
	at hudson.FilePath.act(FilePath.java:742)
	... 57 more
Comment 8 Jacek Pospychala CLA 2010-09-27 09:50:08 EDT
(only on slave-2. after switching to slave1 link works)
Comment 9 Jacek Pospychala CLA 2010-09-27 13:08:14 EDT
I'm upping priority, because earlier at least one slave was working correctly, but now on both of them the job is failing.
For now I resorted to building on master, but it's tough, given the length of it's queue.. sometimes :-)
I'm wondering why others jobs are running and my is failing. I tried another job (cbi-atf), that was always green (blue), but it failed as well now
Comment 10 Michal Ruzicka CLA 2010-09-27 13:40:18 EDT
It is happening to the emf-cdo-integration job on both slaves too.
Building on master with the currently configured number of builders is not very flexible.
Comment 11 Denis Roy CLA 2010-09-27 13:47:51 EDT
I've added ServerAliveInterval 5 to the SSH client config on both slaves, and re-launched the slave.

*sigh*
Comment 12 Michal Ruzicka CLA 2010-09-27 14:45:40 EDT
(In reply to comment #11)

It works now, thanks!
Comment 13 Denis Roy CLA 2010-09-28 13:20:01 EDT
Closing as fixed.  If you have other slave SSH issues, please reopen.
Comment 15 David Carver CLA 2010-09-28 18:27:51 EDT
Looks like there are stuck jobs on hudson-slave2, and I saw the remoting issue on slave2.

Again, latest hudson build addresses some remoting issues that can cause deadlocks.
Comment 16 David Carver CLA 2010-09-28 18:28:58 EDT
In particular this issue:

http://issues.hudson-ci.org/browse/HUDSON-5977
Comment 17 Denis Roy CLA 2010-09-28 18:43:24 EDT
> Again, latest hudson build addresses some remoting issues that can cause
> deadlocks.

My dreams of a stable Hudson are always just an upgrade away  :)

Matt?  Do you have any cycles for this?

Matt == Hudson Slave
Comment 18 Eclipse Webmaster CLA 2010-09-29 10:29:16 EDT
I'm testing the upgrade on the sandbox now.

-M.
Comment 19 Eclipse Webmaster CLA 2010-09-29 11:03:52 EDT
Ok, the update has been applied and everything has restarted.

-M.
Comment 20 Jacek Pospychala CLA 2010-09-30 10:32:16 EDT
btw. this problem seems to be tracked both by bug 326283 and bug 326366
Leaving it up to webmasters to dupe which one they prefer, because now they seem to enjoy commenting on both :-)
Comment 21 Denis Roy CLA 2010-09-30 10:36:31 EDT
These are two different issues.  In this issue, the slave totally loses communication with the master.  In the 'hung' bug, the jobs running on slaves simply stop doing what they are supposed to, despite the fact that communication between master and slave is OK.
Comment 22 Jacek Pospychala CLA 2010-10-01 06:24:50 EDT
ok, after downgrade to 1.376, this one is still happening today:
https://hudson.eclipse.org/hudson/job/cbi-atf-0.3-nightly/666/console

Note that job is not stopped in the middle, but it doesn't even start.
I wonder why it started happening on 27th/Sept. Was there a hudson, ssh, network or configuration upgrade? Is it because of that kernel update (which afaik was later)? 
If error says "Close requested by remote" maybe the "remote" (this means slave, yes?) has something more in it's logs?
It'd be also interesting to know if this is also happening on windows slave, assuming that windows slave also uses ssh, and not e.g. jnlp.
Just random guesses
Comment 23 David Carver CLA 2010-10-01 10:38:58 EDT
Seems Hudson-Slave1 is having the remoting issues again.

More info on SSH and different ways to launch manage slaves here:

http://wiki.hudson-ci.org/display/HUDSON/Distributed+builds
Comment 24 Martin Taal CLA 2010-10-01 16:32:05 EDT
Hi,
I am encountering this issue also on slave 1 and slave 2 when doing a Texo build.

gr. Martin
Comment 25 Denis Roy CLA 2010-10-01 16:48:51 EDT
I've kicked both slaves.
Comment 26 Nicolas Bros CLA 2010-11-09 08:25:35 EST
I'm getting "SSH channel is closed." again on hudson-slave1.
Comment 27 Eclipse Webmaster CLA 2010-11-09 09:09:25 EST
I've restarted the slave agent.

-M.
Comment 28 Steve Powell CLA 2010-11-15 06:41:14 EST
Seeing these symptoms on all virgo* jobs which are run on hudson-slave-2.

In general, when an executor fails in this way, it might be nice if hudson disabled it -- otherwise all subsequent jobs that are dispatched to it are likely to fail (rather quickly) even if this isn't the only executor they might use.

If there is ever any load-balancing employed in job scheduling, failures of this form will form a black-hole into which every eligible job will be thrown; this executor is clearly superior to all others: its service time is almost instantaneous. :-)

Pro-tem I'm converting all virgo-* jobs to run on hudson-slave-1 explicitly -- but I will want to put this back to build2 as soon as the problem is fixed.  I don't want to have to do this.
Comment 29 Steve Powell CLA 2010-11-15 06:45:43 EST
(NB: Last job on slave-2 that actually worked was apparently two days and 20 hours ago.  This is a long time to have ci jobs failing.)
Comment 30 Eclipse Webmaster CLA 2010-11-15 10:03:10 EST
I've restarted the slave agent.  This really seems to be a hudson bug.

-M.
Comment 31 Bouchet Stéphane CLA 2010-12-21 08:40:56 EST
Hi,

this happens again today on slave1 : 
Building remotely on hudson-slave1
hudson.util.IOException2: remote file operation failed: /opt/users/hudsonbuild/workspace/m2m-atl-3.2 at hudson.remoting.Channel@4f57d0b7:hudson-slave1
	at hudson.FilePath.act(FilePath.java:749)
	at hudson.FilePath.act(FilePath.java:735)
	at hudson.FilePath.mkdirs(FilePath.java:801)
	at hudson.model.AbstractProject.checkout(AbstractProject.java:1059)
	at hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:479)
	at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:411)
	at hudson.model.Run.run(Run.java:1273)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
	at hudson.model.ResourceController.execute(ResourceController.java:88)
	at hudson.model.Executor.run(Executor.java:129)
Caused by: java.io.IOException: SSH channel is closed. (Close requested by remote)
	at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:383)
	at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)
	at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1838)
	at java.io.ObjectOutputStream$BlockDataOutputStream.writeByte(ObjectOutputStream.java:1876)
	at java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1537)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:329)
	at hudson.remoting.Channel.send(Channel.java:419)
	at hudson.remoting.Request.call(Request.java:105)
	at hudson.remoting.Channel.call(Channel.java:557)
	at hudson.FilePath.act(FilePath.java:742)
	... 9 more
Comment 32 Eclipse Webmaster CLA 2010-12-21 10:02:42 EST
The slave agent has been restarted.

-M.
Comment 33 Nicolas Bros CLA 2011-01-20 09:17:56 EST
hudson-slave1 appears hung; The Helios aggregation job has been running for 4 hr 30 with no apparent progress.
And I saw "SSH channel is closed" when killing a frozen MoDisco job.
So, I think slave1 needs to be restarted.
Comment 34 Eclipse Webmaster CLA 2011-01-20 09:43:00 EST
I've restarted the slave.

-M.
Comment 35 Simon Bernard CLA 2013-05-03 04:41:51 EDT
Hi,

We have a "SSH channel is closed" for koneki.ldt job which build on hudson-slave1.

I see the 3 current jobs on hudson-slave1 has be running for a long time (5h, 20h, 20h)

Perhaps it need to be restarted.
Comment 36 Denis Roy CLA 2013-05-03 09:14:47 EDT
> Perhaps it need to be restarted.

It has been restarted.