| Summary: | build is dead | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Community | Reporter: | David Williams <david_williams> | ||||
| Component: | Servers | Assignee: | Eclipse Webmaster <webmaster> | ||||
| Status: | RESOLVED FIXED | QA Contact: | |||||
| Severity: | blocker | ||||||
| Priority: | P3 | CC: | aniefer, christian.campo, nicolas.bros, pwebster, remy.suen | ||||
| Version: | unspecified | ||||||
| Target Milestone: | --- | ||||||
| Hardware: | PC | ||||||
| OS: | Windows 7 | ||||||
| Whiteboard: | |||||||
| Attachments: |
|
||||||
|
Description
David Williams
Yep, it's dead, and I cannot even reach its management interface. I've opened a ticket with the Data Center to have it restarted. That should happen soon-ish. Created attachment 187556 [details]
Build server backend interface
Well, this is interesting. It appears all network communication stopped with the build server just before 11pm ET yesterday, as witnessed by the server's switch activity graph. Both the private and backend interfaces both stopped transmitting at the same time.
Strangely, neither the build server nor the switches it is connected to reported anything wrong with the connection. Normally, if someone disconnects a cable, or the interface goes down for some reason, both the switch and the server would log the event like this, when the server was rebooted yesterday at noon.
Jan 24 12:02:50 switch1 10747: 3y49w: %LINK-3-UPDOWN: Interface GigabitEthernet0/7, changed state to down
The server happily continued to process its cron jobs all night and into this morning, when it was restarted. By looking at the server's log, you'd never imaging anything went wrong, except for this one entry:
Jan 24 22:41:02 build kernel: [38170.651876] nfs: server nfsslave not responding, still trying
I'm not sure what prevented the build server from talking to one of its NFS servers, but that seems to have had a severe enough impact for it to lose some kind of "state" and simply space out.
We did restart the server yesterday for a new kernel update. I'll keep my eyes opened in case there was a regression with network drivers/IP/NFS/whatever.
For now, the server is back up, although it never actually died...
Well, it's been running for a few days now ... guess the reboot "fixed" it :) Thanks, Denis, Since Monday, Orion builds have been failing intermittently when attempting to delete an empty directory. I can see no reason for the failure, after the failure I can go in and manually delete the folder for no problem. Can you see if there is any indication of this being an nfs problem or something? There is nothing that is indicating a problem. However, if I examine your log at bug 335723 comment, it is possible that after deleting all files inside the directory, one after one, the servers haven't yet flushed their buffers, making it impossible to delete the directory because the file deletion changes have not yet synced to disk. Do you not have a way to delete the directory and all its contents with a single call, like an rm -rf dir, rather than rm *; rmdir (dirname) ? I will see if I can do that. The tricky part is since the delete is done inside the ant scripts generated by pde.build, there is no direct way to change how the delete is done. I think I see a way I can get override this, but if that doesn't work out then I will have to modify pde.build directly, in which case I could do whatever I wanted. Here is the error message when I do a exec of the native "rm -rf ..."
cleanup.assembly:
[exec] rm: cannot remove `/shared/eclipse/e4/orion/I201101281514/tmp/eclipse/plugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.1.100.v20101220/META-INF/.nfs0000000058da4958000005b9': Device or resource busy
[exec] rm: cannot remove `/shared/eclipse/e4/orion/I201101281514/tmp/eclipse/plugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.1.100.v20101220/META-INF/.nfs0000000058da495b000005ba': Device or resource busy
[exec] rm: cannot remove `/shared/eclipse/e4/orion/I201101281514/tmp/eclipse/plugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.1.100.v20101220/META-INF/.nfs0000000058da4955000005b8': Device or resource busy
[exec] Result: 1
I do now have a workaround via this over-ride mechanism I came up with to do the native "rm -r". I should be able to move the folder to some other location and then if it fails it won't interferes with the stuff coming later.
The .nfs files are consistent with a file handle that is in the process of being deleted. The NFS server (or its disk arrays) may be lagged when you see this. There's really not much I can do about them... But an rm -rf (dir) will not exhibit this. rm'img individual files, then trying to remove the directory, will. Can't you just make a native call to rm -rf (dir) ? (In reply to comment #8) > Can't you just make a native call to rm -rf (dir) ? The output in comment #7 was from a native rm -rf (dir). I'll close this bug again as the change I made to attempt the native call also allows a "mv" first so the build is working again. |