Community
Participate
Working Groups
Since Aug 10 jgit tests are failing on hudson.eclipse.org since they seem to be unable to delete files in test cleanup phase [1]. Error messages all look like: Error Message ERROR: Failed to delete target/trash/test1313053715639_563/.nfs0000000006a5c59f00000356 in org.eclipse.jgit.api.MergeCommandTest.582 Stacktrace java.lang.AssertionError: ERROR: Failed to delete target/trash/test1313053715639_563/.nfs0000000006a5c59f00000356 in org.eclipse.jgit.api.MergeCommandTest.582 at org.junit.Assert.fail(Assert.java:91) at org.eclipse.jgit.junit.LocalDiskRepositoryTestCase.reportDeleteFailure(LocalDiskRepositoryTestCase.java:253) at org.eclipse.jgit.junit.LocalDiskRepositoryTestCase.recursiveDelete(LocalDiskRepositoryTestCase.java:230) at org.eclipse.jgit.junit.LocalDiskRepositoryTestCase.recursiveDelete(LocalDiskRepositoryTestCase.java:227) at org.eclipse.jgit.junit.LocalDiskRepositoryTestCase.tearDown(LocalDiskRepositoryTestCase.java:193) ... Could you check if this is caused by some NFS problem ? [1] https://hudson.eclipse.org/hudson/job/jgit/809/#showFailuresLink
The .nfs files are typically created when an file, opened by another process, is deleted. It is automatically removed when the process that is maintaining the lock releases it. I'm not sure what your cleanup phase looks like, but it appears something is still using files while it's cleaning up.
I failed to reproduce the problem on Mac OS X 10.6.8, Windows 7 and Ubuntu 11.04. I'll inspect the recent source changes since the last successful build, maybe this rings some bells. I will also try building the recent commits on hudson to find out which change may have introduced the problem.
(In reply to comment #2) > I failed to reproduce the problem on Mac OS X 10.6.8, Windows 7 and Ubuntu > 11.04. You likely won't. Any networked file system will introduce more lag than a local disk. I remember reading similar reports in the past, where the problem was solved by using a linux "rm -rf" instead of a Java call. I'll try to dig up the bug.
I tried to rebuild a couple of older JGit versions on hudson and it turns out that all of them now fail with the mentioned file deletion errors [1]. I also tried to rebuild JGit 1.0 which was released with Indigo in June [2]. Also this build fails. So it looks like something changed in the server infrastructure. In general many JGit JUnit tests create some files, in the JUnit tear down phase which happens after each test was executed these files are then deleted. If a test fails to delete the files it created it is also failing in order to let us know that it doesn't completely cleanup the resources it created during the tests. [1] https://hudson.eclipse.org/hudson/job/jgit/810/console https://hudson.eclipse.org/hudson/job/jgit/811/console [2] https://hudson.eclipse.org/hudson/job/jgit/812/console
> build fails. So it looks like something changed in the server infrastructure. Something definitely has -- we've put in our new NFS servers. I get the feeling that the new servers are so fast that delete calls are being returned successfully without the actual contents being written to disk. The Java 'rm' call likely deleted directory contents first, then the actual directory. For performance reasons, we export the /shared filesystem with in "async" mode, which means writes (and deletes) are return "OK" to the client while the server buffers the actual write. This is not a problem for a linux 'rm -rf' call but Java's implementation differs as above. What calls are you using to erase directories?
Many tests (JGit almost has 2000 unit tests) use LocalDiskRepositoryTestCase.recursiveDelete(final String testName, final File dir, boolean silent, boolean failOnError) [1] to delete the files and folders created during the test. In the end this boils down to a depth-first walk over the directory tree to be deleted which is deleting the files and folders bottom up using java.io.File.delete(). [1] line 217 in http://egit.eclipse.org/w/?p=jgit.git;a=blob;f=org.eclipse.jgit.junit/src/org/eclipse/jgit/junit/LocalDiskRepositoryTestCase.java;h=0c7ae7de72085fef8c48b5897f0a810957178af3;hb=HEAD
http://jenkins.361315.n4.nabble.com/xunit-plugin-error-hudson-try-to-remove-temp-used-dir-td956983.html Looks familiar. Is there any way you can catch the exception, wait a bit, then try again? The easy way out here is to not export the NFS location async, but the performance hit could be huge (for everyone) if Java performs a java.io.File.delete() on each and every file, since Java will need to wait for the server's confirmation that the file was actually deleted. Another option would be to replace your recursiveDelete() call with a call to the OS's rm -rf.
For pure entertainment, I can switch from async to sync on-the-fly so that we can determine if this indeed fixes the problem, and also to see the impact on performance. Interested in giving it a try?
We could use delete-check-retry logic in tests but jgit itself also needs to delete files in some places, I am not sure if we want to have this logic in these places as well. I would be interested to try the tests with the file system switched to "sync". So could you switch that on and then start the jgit build job [1]. [1] https://hudson.eclipse.org/hudson/job/jgit/
> build fails. So it looks like something changed in the server infrastructure. FWIW, your build #806 succeeded last Tuesday. The new NFS servers were put in place the Saturday prior. https://hudson.eclipse.org/hudson/job/jgit/806/ Are you sure nothing has changed on _your_ end?
Since Hudson was exhibiting many other problems (SCM polling, jobs never ending) we've completely restarted Hudson. Since then your builds have been succeeding. I tried with sync and async, and they both succeed. We get a lower I/O wait in async mode, so we'll leave it as-is.
(In reply to comment #10) > > build fails. So it looks like something changed in the server infrastructure. > > FWIW, your build #806 succeeded last Tuesday. The new NFS servers were put in > place the Saturday prior. > > https://hudson.eclipse.org/hudson/job/jgit/806/ > > Are you sure nothing has changed on _your_ end? we didn't touch the build job configuration, and the rebuild of jgit 1.0 also failed which succeeded earlier, so maybe this was caused by side effects of the hudson hickup