Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 474076

Summary: Shared instance (actually its slaves) are all offline
Product: Community Reporter: David Williams <david_williams>
Component: CI-JenkinsAssignee: CI Admin Inbox <ci.admin-inbox>
Status: RESOLVED FIXED QA Contact:
Severity: critical    
Priority: P3 CC: daniel_megert, webmaster
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Whiteboard:

Description David Williams CLA 2015-07-31 23:47:35 EDT
Something appears wrong with Hudson shared instance, such as 

https://hudson.eclipse.org/hudson/view/Eclipse%20and%20Equinox/

It's showing me 'master' is idle, but all the slaves are off-line. 

Actually my web browser seems to never finish loading that view ... still says "Reading hudson.eclipse.org" for 10 or 15 minutes. 

But, specific jobs, such as 

https://hudson.eclipse.org/hudson/view/Eclipse%20and%20Equinox/job/ep46N-unit-lin64/

say "pending, hudson-slave4 is offline". 

Does this shared instance still reboot every Sunday morning? Even if so, I have a feeling that won't fix what ever is ailing it.
Comment 1 David Williams CLA 2015-07-31 23:52:10 EDT
With my semi-admin privledges, tried to start huson-slave4, but it simply said following in log: 

java.io.IOException: Unexpected termination of the channel
	at hudson.remoting.Channel$ReaderThread.run(Channel.java:1030)
Caused by: java.io.EOFException
	at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2554)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
	at hudson.remoting.Channel$ReaderThread.run(Channel.java:1024)

(Well, it said more than that, but I've "lost" it already, and that's what it ended with.)
Comment 2 David Williams CLA 2015-07-31 23:56:30 EDT
Note: I've marked as blocker since we've already lost tests for two "nightly builds" ... and our Neon M1 stabilization builds are starting with Sunday night's build. Even the nightlies would be blocker worthy, but the stabilization I-builds are "emergency blockers" ... if there is such as thing.
Comment 3 Eclipse Webmaster CLA 2015-08-02 14:08:32 EDT
I'm not seeing any delays in the response of the main page, and as of right now (2pm EST) almost all the slaves are reporting 'idle'.

Yy guess is that the weekly slave restarts that ran this morning probably 'unblocked' anything that was stuck.

-M.
Comment 4 Dani Megert CLA 2015-08-02 14:28:46 EDT
(In reply to Eclipse Webmaster from comment #3)
> I'm not seeing any delays in the response of the main page, and as of right
> now (2pm EST) almost all the slaves are reporting 'idle'.

If they are idle then something is still wrong since we are waiting on further test results. See e.g.
http://download.eclipse.org/eclipse/downloads/drops4/N20150801-1500/
Comment 5 Eclipse Webmaster CLA 2015-08-02 14:34:19 EDT
Well the Mac slave is running a build and I did have to restart the windows slave, due to a crash of the slave process.

-M.
Comment 6 Dani Megert CLA 2015-08-02 14:52:43 EDT
(In reply to Eclipse Webmaster from comment #5)
> Well the Mac slave is running a build and I did have to restart the windows
> slave, due to a crash of the slave process.
> 
> -M.

k, thanks Matt! Let's see how it goes.
Comment 7 David Williams CLA 2015-08-02 17:55:38 EDT
Mac and Linux tests ran for N20150731-2000, and N20150801-1500 (but not Windows, since it failed quickly with the typical 

hudson.remoting.Channel@35322e51:windows7tests
hudson.util.IOException2: remote file operation failed: <https://hudson.eclipse.org/hudson/job/ep46N-unit-win32/ws/> at hudson.remoting.Channel@35322e51:windows7tests
		 at hudson.FilePath.act(FilePath.java:754)

(And, the Windows machine is not part of the "auto restart", AFAIK)

Since Windows slave was restarted, I've restarted the tests for Windows machine for N20150731-2000 and N20150801-1500, and appears to be running normally, so, will declare this "fixed" (even though tests won't be complete for some time ... on Monday).
Comment 8 Dani Megert CLA 2015-08-03 03:54:49 EDT
(In reply to David Williams from comment #7)
> Mac and Linux tests ran for N20150731-2000, and N20150801-1500 (but not
> Windows, since it failed quickly with the typical 
> 
> hudson.remoting.Channel@35322e51:windows7tests
> hudson.util.IOException2: remote file operation failed:
> <https://hudson.eclipse.org/hudson/job/ep46N-unit-win32/ws/> at
> hudson.remoting.Channel@35322e51:windows7tests
> 		 at hudson.FilePath.act(FilePath.java:754)
> 
> (And, the Windows machine is not part of the "auto restart", AFAIK)
> 
> Since Windows slave was restarted, I've restarted the tests for Windows
> machine for N20150731-2000 and N20150801-1500, and appears to be running
> normally, so, will declare this "fixed" (even though tests won't be complete
> for some time ... on Monday).

Update: still no Windows test results for the mentioned builds and so far only Linux test results for I20150802-2000. So, either it is still or again broken, or it takes very long, which is also bad for us.
Comment 9 Dani Megert CLA 2015-08-03 04:35:05 EDT
Looks like the results slowly arrive:
Windows test results for N20150731-2000
Mac test results for I20150802-2000

Still missing are Windows test results for N20150801-1500 and I20150802-2000.
Comment 10 David Williams CLA 2015-08-04 22:06:15 EDT
(In reply to Dani Megert from comment #9)
> Looks like the results slowly arrive:
> Windows test results for N20150731-2000
> Mac test results for I20150802-2000
> 
> Still missing are Windows test results for N20150801-1500 and I20150802-2000.

I commented on these on platform-releng-dev list ... it just takes Windows a long time to catch up, since a) slow machine, and b) can only run one test-build at a time, since we need a dedicated display, on Windows. Those might be another problem :) but, not this bug. 

Thanks all,