Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 354481 - file deletion failures caused by NFS problem ?
Summary: file deletion failures caused by NFS problem ?
Status: RESOLVED FIXED
Alias: None
Product: Community
Classification: Eclipse Foundation
Component: CI-Jenkins (show other bugs)
Version: unspecified   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Eclipse Webmaster CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-08-11 07:41 EDT by Matthias Sohn CLA
Modified: 2011-08-12 10:34 EDT (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Sohn CLA 2011-08-11 07:41:12 EDT
Since Aug 10 jgit tests are failing on hudson.eclipse.org since they seem to be unable to delete files in test cleanup phase [1]. Error messages all look like:

Error Message

ERROR: Failed to delete target/trash/test1313053715639_563/.nfs0000000006a5c59f00000356 in org.eclipse.jgit.api.MergeCommandTest.582
Stacktrace

java.lang.AssertionError: ERROR: Failed to delete target/trash/test1313053715639_563/.nfs0000000006a5c59f00000356 in org.eclipse.jgit.api.MergeCommandTest.582
	at org.junit.Assert.fail(Assert.java:91)
	at org.eclipse.jgit.junit.LocalDiskRepositoryTestCase.reportDeleteFailure(LocalDiskRepositoryTestCase.java:253)
	at org.eclipse.jgit.junit.LocalDiskRepositoryTestCase.recursiveDelete(LocalDiskRepositoryTestCase.java:230)
	at org.eclipse.jgit.junit.LocalDiskRepositoryTestCase.recursiveDelete(LocalDiskRepositoryTestCase.java:227)
	at org.eclipse.jgit.junit.LocalDiskRepositoryTestCase.tearDown(LocalDiskRepositoryTestCase.java:193)
	...

Could you check if this is caused by some NFS problem ?

[1] https://hudson.eclipse.org/hudson/job/jgit/809/#showFailuresLink
Comment 1 Denis Roy CLA 2011-08-11 08:10:02 EDT
The .nfs files are typically created when an file, opened by another process, is deleted. It is automatically removed when the process that is maintaining the lock releases it.

I'm not sure what your cleanup phase looks like, but it appears something is still using files while it's cleaning up.
Comment 2 Matthias Sohn CLA 2011-08-11 08:59:26 EDT
I failed to reproduce the problem on Mac OS X 10.6.8, Windows 7 and Ubuntu 11.04.

I'll inspect the recent source changes since the last successful build, maybe this rings some bells. I will also try building the recent commits on hudson to find out which change may have introduced the problem.
Comment 3 Denis Roy CLA 2011-08-11 09:23:35 EDT
(In reply to comment #2)
> I failed to reproduce the problem on Mac OS X 10.6.8, Windows 7 and Ubuntu
> 11.04.

You likely won't.  Any networked file system will introduce more lag than a local disk.

I remember reading similar reports in the past, where the problem was solved by using a linux "rm -rf" instead of a Java call.  I'll try to dig up the bug.
Comment 4 Matthias Sohn CLA 2011-08-11 10:14:18 EDT
I tried to rebuild a couple of older JGit versions on hudson and it turns out
that all of them now fail with the mentioned file deletion errors [1]. I also
tried to rebuild JGit 1.0 which was released with Indigo in June [2]. Also this
build fails. So it looks like something changed in the server infrastructure.

In general many JGit JUnit tests create some files, in the JUnit tear down phase
which happens after each test was executed these files are then deleted. If a
test fails to delete the files it created it is also failing in order to let us
know that it doesn't completely cleanup the resources it created during the
tests.

[1] https://hudson.eclipse.org/hudson/job/jgit/810/console
     https://hudson.eclipse.org/hudson/job/jgit/811/console
[2] https://hudson.eclipse.org/hudson/job/jgit/812/console
Comment 5 Denis Roy CLA 2011-08-11 10:51:47 EDT
> build fails. So it looks like something changed in the server infrastructure.

Something definitely has -- we've put in our new NFS servers.  I get the feeling that the new servers are so fast that delete calls are being returned successfully without the actual contents being written to disk.  The Java 'rm' call likely deleted directory contents first, then the actual directory.

For performance reasons, we export the /shared filesystem with in "async" mode, which means writes (and deletes) are return "OK" to the client while the server buffers the actual write.  This is not a problem for a linux 'rm -rf' call but Java's implementation differs as above.

What calls are you using to erase directories?
Comment 6 Matthias Sohn CLA 2011-08-11 11:36:29 EDT
Many tests (JGit almost has 2000 unit tests) use LocalDiskRepositoryTestCase.recursiveDelete(final String testName, final File dir, boolean silent, boolean failOnError) [1] to delete the files and folders created during the test. In the end this boils down to a depth-first walk over the directory tree to be deleted which is deleting the files and folders bottom up using java.io.File.delete().

[1] line 217 in http://egit.eclipse.org/w/?p=jgit.git;a=blob;f=org.eclipse.jgit.junit/src/org/eclipse/jgit/junit/LocalDiskRepositoryTestCase.java;h=0c7ae7de72085fef8c48b5897f0a810957178af3;hb=HEAD
Comment 7 Denis Roy CLA 2011-08-11 11:54:47 EDT
http://jenkins.361315.n4.nabble.com/xunit-plugin-error-hudson-try-to-remove-temp-used-dir-td956983.html

Looks familiar.

Is there any way you can catch the exception, wait a bit, then try again?

The easy way out here is to not export the NFS location async, but the performance hit could be huge (for everyone) if Java performs a java.io.File.delete() on each and every file, since Java will need to wait for the server's confirmation that the file was actually deleted.

Another option would be to replace your recursiveDelete() call with a call to the OS's rm -rf.
Comment 8 Denis Roy CLA 2011-08-11 12:09:17 EDT
For pure entertainment, I can switch from async to sync on-the-fly so that we can determine if this indeed fixes the problem, and also to see the impact on performance.  Interested in giving it a try?
Comment 9 Matthias Sohn CLA 2011-08-12 02:44:40 EDT
We could use delete-check-retry logic in tests but jgit itself also needs to delete files in some places, I am not sure if we want to have this logic in these places as well.

I would be interested to try the tests with the file system switched to "sync". So could you switch that on and then start the jgit build job [1].

[1] https://hudson.eclipse.org/hudson/job/jgit/
Comment 10 Denis Roy CLA 2011-08-12 08:31:05 EDT
> build fails. So it looks like something changed in the server infrastructure.

FWIW, your build #806 succeeded last Tuesday.  The new NFS servers were put in place the Saturday prior.

https://hudson.eclipse.org/hudson/job/jgit/806/

Are you sure nothing has changed on _your_ end?
Comment 11 Denis Roy CLA 2011-08-12 09:38:07 EDT
Since Hudson was exhibiting many other problems (SCM polling, jobs never ending) we've completely restarted Hudson.  Since then your builds have been succeeding.  I tried with sync and async, and they both succeed.  We get a lower I/O wait in async mode, so we'll leave it as-is.
Comment 12 Matthias Sohn CLA 2011-08-12 10:34:50 EDT
(In reply to comment #10)
> > build fails. So it looks like something changed in the server infrastructure.
> 
> FWIW, your build #806 succeeded last Tuesday.  The new NFS servers were put in
> place the Saturday prior.
> 
> https://hudson.eclipse.org/hudson/job/jgit/806/
> 
> Are you sure nothing has changed on _your_ end?

we didn't touch the build job configuration, and the rebuild of jgit 1.0 also failed which succeeded earlier, so maybe this was caused by side effects of the hudson hickup