| Summary: | Investigate alternatives to NFS for /shared | ||
|---|---|---|---|
| Product: | Community | Reporter: | Dennis Huebner <dennis.huebner> |
| Component: | CI-Jenkins | Assignee: | CI Admin Inbox <ci.admin-inbox> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | normal | ||
| Priority: | P3 | CC: | denis.roy, gunnar, mknauer, thanh.ha, webmaster |
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Mac OS X | ||
| Whiteboard: | |||
| Bug Depends on: | 403843 | ||
| Bug Blocks: | |||
|
Description
Dennis Huebner
Same problem here with the RAP builds, e.g. these builds: https://hudson.eclipse.org/hudson/job/rap-head-runtime/12/console https://hudson.eclipse.org/hudson/job/rap-head-runtime/14/console In our case there is no connection to a Buckminster installation because we are building with Tycho/Maven. We see this error always in combination with a Hudson workspace delete and some .git/... files. This error happens every few days, but only since the crash and rebuild of /shared. My guess is that the restored file system has a slightly different configuration (NFS or some extended ACLs). > Plz see: > https://hudson.eclipse.org/hudson/job/MWE-Language-nightly-HEAD/2388/console > ... and: https://hudson.eclipse.org/sandbox/job/xtext.gerrit/291/console When you get a Stale NFS handle on a read operation, chances are there's some other job or process that deleted the file while another job was using it. If you're using the same working directory on slave2 and sandbox, you're definitely looking for trouble. Hudson does not lock files. ls: cannot access /opt/public/common/buckminster-4.2/configuration/org.eclipse.core.runtime/.mainData.145: No such file or directory (In reply to comment #1) > We see this error always in combination with > a Hudson workspace delete and some .git/... files. I believe this is a known issue with either Hudson or the Java API that recursively deletes a directory. One project's workaround was to issue an OS system call ("rm -rf") since it behaves correctly over NFS mounted filesystems. I'm not sure what alternatives we have for shared filesystems that are neither inordinately complex to manage nor painfully slow. > This error happens every few days, but only since the crash and rebuild of > /shared. My guess is that the restored file system has a slightly different > configuration (NFS or some extended ACLs). It was happening before; perhaps you were just lucky. Let's rename this bug. I know Matt has played with ocfs2 in the past, but I think it would be worthwhile to examine what's out there. In informal discussions, as part of bug 403843 we could imagine the workspace being on local (non-NFS) disks. I'd like to look at GlusterFS http://www.gluster.org/ I don't think it's part of the SLES repo though. Denis, we also run a small Hudson cluster but we do not use NFS for our builds. Every Hudson has it's own file-system on local discs. After e build, Hudson has some support for copying artifacts from a slave back to the master (configurable). I think that builds should happen on local discs only. If there are some interesting build results to share, projects should be encourage to copy/move them to "/shared" *after* a build. Thus, "/shared" can still remain on NFS. Yep, with HIPP we won't need NFS for the workspace. The Hudson app files will reside on NFS, in each of the HIPP user's $HOME, to facilitate launch from any HIPP server. (In reply to comment #6) > I think that builds should happen on local discs only. FWIW, one problem we have with the monolithic Hudson is that local disks are fairly small (sometimes 200G or less) so there's not much room even for temporary store when you consider the number of jobs. But under HIPP this shouldn't be a problem. On vhosts like LocationTech and Polarsys, where local disk space is a premium, we created a large image file on an NFS server and mounted it locally via the loopback device. This "tricks" Linux into thinking it's a local block device, enabling buffers and kernel-level cache. Performance is much better than NFS-only, and only slightly slower than a virtualized block device. HIPP will have the added advantage of running bare-iron with bare-iron disks, not virtualized. |