Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 335274

Summary: build is dead
Product: Community Reporter: David Williams <david_williams>
Component: ServersAssignee: Eclipse Webmaster <webmaster>
Status: RESOLVED FIXED QA Contact:
Severity: blocker    
Priority: P3 CC: aniefer, christian.campo, nicolas.bros, pwebster, remy.suen
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Windows 7   
Whiteboard:
Attachments:
Description Flags
Build server backend interface none

Description David Williams CLA 2011-01-24 23:13:00 EST
Well, that's what "infrastructure status" report says. 

= = = = = 
build 	  	  	  	

DEAD, offline or not reporting

= = = = = 

It is still up and running ... sort of ... a shell I happened to have open reports last three "top" load usage in the hundreds! 

[ 214.10 187.25 126.20 ]

But ... there seemed to be no particular job using a log of cpu or memory? 


You'll probably see other symptoms or reports before this bugzilla ... but, thought I'd report here in case it helps. 

I first noticed this when trying to login with a build id to kill a job ... and it wouldn't let me login ... gave me the welcome message, but no prompt.
Comment 1 Denis Roy CLA 2011-01-25 09:44:52 EST
Yep, it's dead, and I cannot even reach its management interface.  I've opened a ticket with the Data Center to have it restarted.  That should happen soon-ish.
Comment 2 Denis Roy CLA 2011-01-25 13:26:32 EST
Created attachment 187556 [details]
Build server backend interface

Well, this is interesting.  It appears all network communication stopped with the build server just before 11pm ET yesterday, as witnessed by the server's switch activity graph.  Both the private and backend interfaces both stopped transmitting at the same time.

Strangely, neither the build server nor the switches it is connected to reported anything wrong with the connection.  Normally, if someone disconnects a cable, or the interface goes down for some reason, both the switch and the server would log the event like this, when the server was rebooted yesterday at noon.

Jan 24 12:02:50 switch1 10747: 3y49w: %LINK-3-UPDOWN: Interface GigabitEthernet0/7, changed state to down

The server happily continued to process its cron jobs all night and into this morning, when it was restarted.  By looking at the server's log, you'd never imaging anything went wrong, except for this one entry:

Jan 24 22:41:02 build kernel: [38170.651876] nfs: server nfsslave not responding, still trying

I'm not sure what prevented the build server from talking to one of its NFS servers, but that seems to have had a severe enough impact for it to lose some kind of "state" and simply space out.

We did restart the server yesterday for a new kernel update.  I'll keep my eyes opened in case there was a regression with network drivers/IP/NFS/whatever.

For now, the server is back up, although it never actually died...
Comment 3 David Williams CLA 2011-01-27 12:19:21 EST
Well, it's been running for a few days now ... guess the reboot "fixed" it :) 

Thanks,
Comment 4 Andrew Niefer CLA 2011-01-28 13:27:31 EST
Denis,
Since Monday, Orion builds have been failing intermittently when attempting to delete an empty directory.

I can see no reason for the failure, after the failure I can go in and manually delete the folder for no problem.

Can you see if there is any indication of this being an nfs problem or something?
Comment 5 Denis Roy CLA 2011-01-28 13:40:11 EST
There is nothing that is indicating a problem.

However, if I examine your log at bug 335723 comment, it is possible that after deleting all files inside the directory, one after one, the servers haven't yet flushed their buffers, making it impossible to delete the directory because the file deletion changes have not yet synced to disk.

Do you not have a way to delete the directory and all its contents with a single call, like an rm -rf dir, rather than rm *; rmdir (dirname) ?
Comment 6 Andrew Niefer CLA 2011-01-28 13:59:30 EST
I will see if I can do that.  The tricky part is since the delete is done inside the ant scripts generated by pde.build, there is no direct way to change how the delete is done.  

I think I see a way I can get override this, but if that doesn't work out then I will have to modify pde.build directly, in which case I could do whatever I wanted.
Comment 7 Andrew Niefer CLA 2011-01-28 15:50:45 EST
Here is the error message when I do a exec of the native "rm -rf ..."
cleanup.assembly:
     [exec] rm: cannot remove `/shared/eclipse/e4/orion/I201101281514/tmp/eclipse/plugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.1.100.v20101220/META-INF/.nfs0000000058da4958000005b9': Device or resource busy
     [exec] rm: cannot remove `/shared/eclipse/e4/orion/I201101281514/tmp/eclipse/plugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.1.100.v20101220/META-INF/.nfs0000000058da495b000005ba': Device or resource busy
     [exec] rm: cannot remove `/shared/eclipse/e4/orion/I201101281514/tmp/eclipse/plugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.1.100.v20101220/META-INF/.nfs0000000058da4955000005b8': Device or resource busy
     [exec] Result: 1


I do now have a workaround via this over-ride mechanism I came up with to do the native "rm -r".  I should be able to move the folder to some other location and then if it fails it won't interferes with the stuff coming later.
Comment 8 Denis Roy CLA 2011-01-28 16:07:32 EST
The .nfs files are consistent with a file handle that is in the process of being deleted.  The NFS server (or its disk arrays) may be lagged when you see this.

There's really not much I can do about them...  

But an rm -rf (dir) will not exhibit this.  rm'img individual files, then trying to remove the directory, will.

Can't you just make a native call to rm -rf (dir) ?
Comment 9 Andrew Niefer CLA 2011-01-28 18:12:13 EST
(In reply to comment #8)
> Can't you just make a native call to rm -rf (dir) ?
The output in comment #7 was from a native rm -rf (dir).

I'll close this bug again as the change I made to attempt the native call also allows a "mv" first so the build is working again.