Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 401975

Summary: Investigate alternatives to NFS for /shared
Product: Community Reporter: Dennis Huebner <dennis.huebner>
Component: CI-JenkinsAssignee: CI Admin Inbox <ci.admin-inbox>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: P3 CC: denis.roy, gunnar, mknauer, thanh.ha, webmaster
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Mac OS X   
Whiteboard:
Bug Depends on: 403843    
Bug Blocks:    

Description Dennis Huebner CLA 2013-02-28 04:07:30 EST
Sometimes our jobs fails with "java.io.IOException: Stale NFS file handle".
I saw the same error on sandbox hudson, in both cases /opt/public/common/buckminster-4.2/configuration/org.eclipse.core.runtime/ directory was involved.

Plz see: https://hudson.eclipse.org/hudson/job/MWE-Language-nightly-HEAD/2388/console
... and: https://hudson.eclipse.org/sandbox/job/xtext.gerrit/291/console



INFO:  System property https.nonProxyHosts has been set to *.eclipse.org by an external source. This value will be overwritten using the values from the preferences
ERROR: IOException encountered while reading "/opt/public/common/buckminster-4.2/configuration/org.eclipse.core.runtime/.mainData.145".
java.io.IOException: Stale NFS file handle
	at java.io.RandomAccessFile.readBytes(Native Method)

ERROR: IOException encountered while reading "/opt/public/common/buckminster-4.2/configuration/org.eclipse.core.runtime/.extraData.133".
java.io.IOException: Stale NFS file handle
	at java.io.RandomAccessFile.readBytes(Native Method)
Comment 1 Markus Knauer CLA 2013-02-28 04:25:39 EST
Same problem here with the RAP builds, e.g. these builds:

  https://hudson.eclipse.org/hudson/job/rap-head-runtime/12/console
  https://hudson.eclipse.org/hudson/job/rap-head-runtime/14/console

In our case there is no connection to a Buckminster installation because we are building with Tycho/Maven. We see this error always in combination with a Hudson workspace delete and some .git/... files.

This error happens every few days, but only since the crash and rebuild of /shared. My guess is that the restored file system has a slightly different configuration (NFS or some extended ACLs).
Comment 2 Denis Roy CLA 2013-02-28 08:37:53 EST
> Plz see:
> https://hudson.eclipse.org/hudson/job/MWE-Language-nightly-HEAD/2388/console
> ... and: https://hudson.eclipse.org/sandbox/job/xtext.gerrit/291/console

When you get a Stale NFS handle on a read operation, chances are there's some other job or process that deleted the file while another job was using it.  If you're using the same working directory on slave2 and sandbox, you're definitely looking for trouble.  Hudson does not lock files.

ls: cannot access /opt/public/common/buckminster-4.2/configuration/org.eclipse.core.runtime/.mainData.145: No such file or directory 


(In reply to comment #1)
> We see this error always in combination with
> a Hudson workspace delete and some .git/... files.

I believe this is a known issue with either Hudson or the Java API that recursively deletes a directory.  One project's workaround was to issue an OS system call ("rm -rf") since it behaves correctly over NFS mounted filesystems.  

I'm not sure what alternatives we have for shared filesystems that are neither inordinately complex to manage nor painfully slow.

 
> This error happens every few days, but only since the crash and rebuild of
> /shared. My guess is that the restored file system has a slightly different
> configuration (NFS or some extended ACLs).

It was happening before; perhaps you were just lucky.
Comment 3 Denis Roy CLA 2013-02-28 08:38:46 EST
Let's rename this bug.  I know Matt has played with ocfs2 in the past, but I think it would be worthwhile to examine what's out there.
Comment 4 Denis Roy CLA 2013-03-22 11:34:32 EDT
In informal discussions, as part of bug 403843 we could imagine the workspace being on local (non-NFS) disks.
Comment 5 Denis Roy CLA 2013-06-05 10:11:36 EDT
I'd like to look at GlusterFS http://www.gluster.org/

I don't think it's part of the SLES repo though.
Comment 6 Gunnar Wagenknecht CLA 2013-07-05 12:11:26 EDT
Denis, we also run a small Hudson cluster but we do not use NFS for our builds. Every Hudson has it's own file-system on local discs. After e build, Hudson has some support for copying artifacts from a slave back to the master (configurable).

I think that builds should happen on local discs only. If there are some interesting build results to share, projects should be encourage to copy/move them to "/shared" *after* a build. Thus, "/shared" can still remain on NFS.
Comment 7 Denis Roy CLA 2013-07-05 13:31:19 EDT
Yep, with HIPP we won't need NFS for the workspace.  The Hudson app files will reside on NFS, in each of the HIPP user's $HOME, to facilitate launch from any HIPP server.
Comment 8 Denis Roy CLA 2013-07-05 13:34:36 EDT
(In reply to comment #6)
> I think that builds should happen on local discs only.

FWIW, one problem we have with the monolithic Hudson is that local disks are fairly small (sometimes 200G or less) so there's not much room even for temporary store when you consider the number of jobs.  But under HIPP this shouldn't be a problem.

On vhosts like LocationTech and Polarsys, where local disk space is a premium, we created a large image file on an NFS server and mounted it locally via the loopback device.  This "tricks" Linux into thinking it's a local block device, enabling buffers and kernel-level cache.  Performance is much better than NFS-only, and only slightly slower than a virtualized block device.

HIPP will have the added advantage of running bare-iron with bare-iron disks, not virtualized.