Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 177400 - 4 failures in org.eclipse.core.tests.resources
Summary: 4 failures in org.eclipse.core.tests.resources
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Resources (show other bugs)
Version: 3.3   Edit
Hardware: Power PC Linux-GTK
: P2 major (vote)
Target Milestone: 3.3 M7   Edit
Assignee: John Arthorne CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-03-14 14:55 EDT by Corey Ashford CLA
Modified: 2008-10-06 21:55 EDT (History)
2 users (show)

See Also:


Attachments
liblocalfile compiled on SLES 10 (41.57 KB, application/octet-stream)
2007-03-14 20:44 EDT, Corey Ashford CLA
no flags Details
Fixes problem with clsConvert object being garbage collected between calls to getString (938 bytes, patch)
2007-04-05 19:59 EDT, Corey Ashford CLA
john.arthorne: iplog+
Details | Diff
Liblocalfile shared object compiled on Linux ppc RHEL3, tested on SLES 10 (41.58 KB, application/octet-stream)
2007-04-05 21:02 EDT, Corey Ashford CLA
no flags Details
Updated patch (1.27 KB, patch)
2007-04-09 10:54 EDT, John Arthorne CLA
no flags Details | Diff
Updated patch (1.22 KB, patch)
2007-04-13 12:55 EDT, John Arthorne CLA
no flags Details | Diff
Updated patch (1.39 KB, patch)
2007-04-13 16:53 EDT, John Arthorne CLA
no flags Details | Diff
liblocalfile for linux-gtk-ppc (41.52 KB, application/octet-stream)
2007-04-13 17:43 EDT, Corey Ashford CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Corey Ashford CLA 2007-03-14 14:55:46 EDT
Build ID: I20070222-0951

Steps To Reproduce:
1. Run the Eclipse automated test suite on the target "coreresources".
2. Note failures in the following four tests
testBrokenSymlinkAttributes
testRecursiveSymlink
testSymlinkAttributes
testSymlinkPutInfo

More information:
I believe these failures are occurring on linux-ppc-gtk only, and started occurring in 3.3M5 and have continued in 3.3M5eh.  I haven't checked, but these tests could be new, and therefore may not be regressions.

From a small amount of debugging I've done so far, it appears that when the Java code accesses the file info, it is getting back values that are all zeroes, especially for the attribute value, which is what the test is checking for its assertions.
Comment 1 Corey Ashford CLA 2007-03-14 20:44:18 EDT
Created attachment 60882 [details]
liblocalfile compiled on SLES 10

I have found that if I re-compile the source for liblocalfile in the org.eclipse.core.filesystem project on SLES 10, and then use that liblocalfile.so in the org.eclipse.core.filesystem.linux.ppc plug-in that all of the tests in org.eclipse.core.tests.resources now pass.

The problem here is that the liblocalfile that's in the Eclipse release was created on SLES 9, but doesn't work completely right on SLES 10.  A liblocalfile created under SLES 10 works on SLES 10, but causes core dumps when used on earlier Linux releases.

It seems like there are two avenues to fix this:

1) Create a new release specifically for SLES 10 / RHEL 5 and newer Linux ppc releases.
2) Create a version of liblocalfile that is compatible with both OS's.

2 would be preferable if it was easy.  I suspect that there are differences in the layout of the data structures in the system .h files that are used by the liblocalfile code, and that is creating a binary incompatibility.

If anyone has suggestions about how to create a liblocalfile that is compatible with both OS releases, without intensely ugly code, I'd be interested.  Are there other OS's where this same problem has arisen?

Otherwise, how hard is it to release two different versions of Eclipse, one with  a version of liblocal file for older Linux ppc releases and one for newer releases?

For completeness, I have attached a copy of the liblocalfile compiled on SLES 10.
Comment 2 Kim Moir CLA 2007-04-03 09:09:14 EDT
John, Corey asked me to look at this bug but I'm not sure what could be causing this problem. Corey is running linux-gtk-ppc tests and contributing them back to us for inclusion on the web page. Do you have any ideas on how to solve the issue described above?
Comment 3 John Arthorne CLA 2007-04-03 10:50:02 EDT
Obviously 2) would be the ideal solution, but I know very little about the different flavours of PPC linux and what might be causing the problem here.  The native C code used by resources is pretty basic stuff.  Do you get the same compatibility problems when compiling on REL 4 vs. REL 5? I'm just wondering if there is one platform that will produce a compiled library that will run on all the others.  Our plan target platforms are SLES 10 and REL 4, so our most important task is to ensure we have a library that runs on those two flavours.  Does the attached patch also work ok on REL 4?

The only way I can think of doing 1) would be to define a new architecture constant for SLES10 PPC to distinguish from other PPC distros.  This seems pretty drastic.
Comment 4 Corey Ashford CLA 2007-04-03 14:15:43 EDT
I haven't tested on RHEL 4, actually.  I initially built the liblocalfile on SLES 10, but then had complaints that the resulting .so was causing crashes on FC9.  So I rebuilt it there and tried it on SLES 10, and it still worked.  But later, tests were added that failed on SLES 10.

This information about RHEL 4 and SLES 10 being the targets of Eclipse is interesting.  I was unaware of this, and I will try building and testing on RHEL 4 and see if it works on SLES 10 as well.

Thanks for your help.


Comment 5 John Arthorne CLA 2007-04-03 16:57:03 EDT
Sorry, re-reading this bug report fully, I think I should give more background info.

New support was added in the Linux native code to support symbolic links (see bug 170317).  This change requires recompilation of the native code.  The tests that are failing are new tests added to exercise this new functionality, so it's no surprise that they are failing on PPC, since we never recompiled the native code on that platform.  

So, if you can provide a library based on the latest native source, I can put it in the builds.  Ideally this should run on as many distros as possible, so I'll leave you to investigate that (I don't even have a PPC machine).  
Comment 6 John Arthorne CLA 2007-04-03 17:03:26 EDT
I realized we will have the same problem for 64-bit Linux - I have entered bug 180806 to track the fact that the natives need to be recompiled there too.
Comment 7 Corey Ashford CLA 2007-04-05 13:38:52 EDT
Thank you for the information about the new symlink functionality.  I had made an assumption that these were just new added tests.

I think something I have done has regressed my set-up as I am having some problems getting even the SLES10-compiled version of liblocalfile not to crash in these symlink tests.  It is crashing inside getString (stack trace shown below).  If anyone has some tips on debugging these shared objects, I'd appreciate seeing them.

3HPNATIVESTACK         Native Stack of "main" PID 18464
NULL                   -------------------------
3HPSTACKLINE            clIsSubclassOf at FA722D8 in libjvm.so
3HPSTACKLINE            101EE688
3HPSTACKLINE            xeRunJniMethod at FB61DA8 in libjvm.so
3HPSTACKLINE            ?? at FA2AC64 in libjvm.so
3HPSTACKLINE            ?? at FA30584 in libjvm.so
3HPSTACKLINE            getString at BE3AFD0 in liblocalfile_1_0_0.so
3HPSTACKLINE            Java_org_eclipse_core_internal_filesystem_local_LocalFileNatives_internalGetFileInfo at BE3B6E0 in liblocalfile_1_0_0.so
3HPSTACKLINE            sysInvokeNative at FB4848C in libjvm.so
3HPSTACKLINE            mmipInvokeJniMethod at FB56F6C in libjvm.so
3HPSTACKLINE            mmipExecuteJava at FB526BC in libjvm.so
3HPSTACKLINE            xeRunJvmMethod at FB62084 in libjvm.so
3HPSTACKLINE            JVM_InvokeMethod at FA562C4 in libjvm.so
3HPSTACKLINE            JVM_InvokeMethod at FDCE848 in libjvm.so
3HPSTACKLINE            Java_sun_reflect_NativeMethodAccessorImpl_invoke0 at F988A00 in libjava.so
3HPSTACKLINE            sysInvokeNative at FB4848C in libjvm.so
3HPSTACKLINE            mmipInvokeJniMethod at FB56F6C in libjvm.so
3HPSTACKLINE            mmipExecuteJava at FB526BC in libjvm.so
3HPSTACKLINE            xeRunJvmMethod at FB62084 in libjvm.so
3HPSTACKLINE            JVM_InvokeMethod at FA562C4 in libjvm.so
3HPSTACKLINE            JVM_InvokeMethod at FDCE848 in libjvm.so
3HPSTACKLINE            Java_sun_reflect_NativeMethodAccessorImpl_invoke0 at F988A00 in libjava.so
3HPSTACKLINE            sysInvokeNative at FB4848C in libjvm.so
3HPSTACKLINE            mmipInvokeJniMethod at FB56F6C in libjvm.so
3HPSTACKLINE            mmipInvokeLazyJniMethod at FB57A90 in libjvm.so
3HPSTACKLINE            mmipExecuteJava at FB526BC in libjvm.so
3HPSTACKLINE            xeRunJniMethod at FB61B34 in libjvm.so
Comment 8 Corey Ashford CLA 2007-04-05 18:21:02 EDT
Upon further investigation, by turning on "-Xcheck:jni:noadvice,pedantic" to the IBM JVM, which turns on some checking in the JNI calls, I found the following:

In the C file in org.eclipse.core.filesystem/natives/unix/localfile.c, the second parameter, clsConvert, to CallStaticObjectMethod in getString is coming up as an invalid jobject.  Here is the relevant code excerpt:

jstring getString(JNIEnv *env, jbyteArray source) {
        static jclass clsConvert = 0;
        static jmethodID midFromPlatformBytes = 0;
    if (midFromPlatformBytes == 0) {
        clsConvert = (*env)->FindClass(env, "org/eclipse/core/internal/filesystem/local/Convert");
        if (clsConvert == 0) return NULL;
        midFromPlatformBytes = (*env)->GetStaticMethodID(env, clsConvert, "fromPlatformBytes", "([B)Ljava/lang/String;");
        if (midFromPlatformBytes == 0) return NULL;
    }
    return (*env)->CallStaticObjectMethod(env, clsConvert, midFromPlatformBytes, source);
}


This bad jobject happens only on the second call to getString where the function assumes that both clsConvert and midFromPlatformBytes are already initialized.

I created a local copy of this C file and modified it to dump out the values of clsConvert before calling CallStaticObjectMethod.  In both cases, the first call, and subsequent calls, the value of clsConvert was identical, meaning that this object pointer itself had not become corrupted.  However, the object it points to appears to have changed between the two calls, and my suspicion is that the object has been disposed of because the JVM doesn't know that there's an existing pointer to the object anymore.

I modified the line:

if (midPlatformBytes == 0) {

to 

if (1) {

which forces getString to look up the Convert class everytime.  This fixed the failure I am seeing.  I should note that I had seen intermittant failures before too, but figured I just had an inconsistant test environment set up.  However, recently, for some reason, the failures in these tests have been easily reproducible, failing on every run.

I don't think this is the best solution.  Is there a way to tell the JVM not to dispose of an object that the shared object wants to hang onto?
Comment 9 Corey Ashford CLA 2007-04-05 19:59:21 EDT
Created attachment 63130 [details]
Fixes problem with clsConvert object being garbage collected between calls to getString

Adds a call to NewGlobalRef to fix a problem with clsConvert being garbage collected between calls to getString, which was causing a crash on linux-gtk-ppc running the IBM JVM (1.4.2, 5.0, 6.0)
Comment 10 Corey Ashford CLA 2007-04-05 21:02:18 EDT
Created attachment 63135 [details]
Liblocalfile shared object compiled on Linux ppc RHEL3, tested on SLES 10

This shared object was built on RHEL3 (for backward compatibility) using the source file fix to localfile.c in this same bugzilla.

The Symlink tests pass with this shared object.  Please consider updating this shared object into the release for 3.3M7.
Comment 11 John Arthorne CLA 2007-04-08 13:12:22 EDT
Good find, Corey.

I'm not an expert on JNI, but I think we should just be looking up the class and method ids each time we call it.  This is what the rest of the native code in localfile.c does.  Presumably this creates something like a java.lang.reflect.Method object, which can certainly be garbage collected when no longer used.  I'm increasing severity/priority because it seems like there is a risk of this causing errors on all Linux platforms.

CCing Martin who wrote this method for any comments on why it might have been done this way.
Comment 12 John Arthorne CLA 2007-04-09 10:54:54 EDT
Created attachment 63263 [details]
Updated patch

Won't we need to create a global reference to the method id as well? Here is a suggested updated patch.

Useful doc: http://java.sun.com/docs/books/jni/html/refs.html
Comment 13 Corey Ashford CLA 2007-04-09 15:09:13 EDT
I was under the impression that jmethodID was just an integer, but I just looked and you are right, it's another opaque structure pointer, implying that it's pointing to some Java object that could be garbage collected.

So this updated patch looks good, thanks.
Comment 14 John Arthorne CLA 2007-04-09 16:16:15 EDT
Corey, I don't have a PPC machine on hand, so if you could attach a compiled library based on the latest source it would be appreciated.
Comment 15 Corey Ashford CLA 2007-04-09 17:27:05 EDT
The Updated Patch does not compile.  The jmethodID type is not derived from, nor another name for, a jobject, even though it is an opaque pointer.  Therefore, it cannot be passed to NewGlobalRef which requires a jobject, at least according to the jni.h from IBM Java 1.4.2, 5.0, and 6.0.

From this I surmise that the ID may actually be an integer which can be held in the shared object's space without fear of garbage collection.  I don't have direct evidence for this, but the fact that it cannot be passed to NewGlobalRef is a hint.

If this is the case, and I have no evidence to the contrary, the original patch I submitted, along with the liblocalfile binary are correct.

If you are ok with reverting to the previous patch, I think I can "unobsolete" it :)
Comment 16 Martin Oberhuber CLA 2007-04-10 07:10:59 EDT
As John said in comment #5, the original problem appeared because new functionality and new tests were added in M5, and the native lib had not been recompiled for all these platforms.

For Platforms other than Linux this would not matter, since the new features would only be advertised on Linux so they were ignored otherwise. All Linux variants, however, should be equipped with the new native lib. I do not think that different versions for RHEL4 and RHEL5 are necessary.

The patch looks good. Trying to persistently cache the object IDs and method IDs just came out of my lack of experience with JNI. I had thought that since the "Convert" class was all-static it would survive, but apparently I was wrong. I later discovered that such references could get lost along the way, and adding a global Ref seems to be the best possible fix for this.

Whether method IDs can also get stale I don't know, but if tests run OK I'd agree with Corey to keep the code as-is.
Comment 17 John Arthorne CLA 2007-04-13 12:55:48 EDT
Created attachment 63776 [details]
Updated patch

Here is another update in light of the new information. I'd rather not hang onto the method id in a static since we don't know much about how this is represented.  The rest of our native code does method lookup as needed and the performance seems to be good. The doc I have read on caching seems to use static globals to hold onto classes, but not methods.
Comment 18 Corey Ashford CLA 2007-04-13 16:01:29 EDT
That patch does not look right because the method id is only going to be looked up on the first call, and then discarded.  On the second call, it will be uninitialized when CallStaticObjectMethod is called.

If the method id is not going to be static, the initialization for it needs to be moved outside of the "if" statement.
Comment 19 John Arthorne CLA 2007-04-13 16:53:49 EDT
Created attachment 63796 [details]
Updated patch

Sigh... this would be a lot easier if I had a Linux machine available to test it.
Comment 20 Corey Ashford CLA 2007-04-13 17:43:20 EDT
Created attachment 63803 [details]
liblocalfile for linux-gtk-ppc

Liblocalfile shared object compiled on Linux ppc RHEL3, tested on SLES 10, based on latest updated patch.
Comment 21 John Arthorne CLA 2007-04-16 11:27:26 EDT
Thanks Corey, I have released your fix for Linux PPC.  Leaving this open until I get a chance to release a corresponding fix for Linux x86.
Comment 22 John Arthorne CLA 2007-04-16 13:09:00 EDT
Fix released on Linux x86.
Comment 23 Martin Oberhuber CLA 2007-04-18 12:24:15 EDT
John - May I ask on what version of Linux x86 and with what compiler you are building? If you are interested, I have built a version on this:

platform:         i686-pc-linux-gnu
OS Name:          Red Hat Linux release 6.2 (Zoot)
OS Version:       Linux redhat 2.2.14-5.0 #1 (glibc-2.1.3-15)
Compiler version: gcc-2.95.2 19991024 (release)

I built it with -O -g rather than just the -g switch which currently is in the Makefile.
Comment 24 John Arthorne CLA 2007-04-18 13:16:41 EDT
Here is what I used:

gcc v2.96 20000731 
red hat 7.3 2.96-12

I'm not familiar with how compatible the various RedHat versions are. Are there LInux versions that won't work with the library I compiled? Is there an advantage to compiling on a different version? Our main goal is to be compatible with the OS versions listed on the Eclipse project 3.3 plan.
Comment 25 Martin Oberhuber CLA 2007-04-19 06:18:44 EDT
RH 7.3 should be ancient enough to be compatible with any platform on which the Eclipse Resources / Local File System is practically run.

However, as part of my work on bug 183137 (contributing Solaris Support) I also built it on Redhat 6.2 - just to verify that the code is still good after my changes. Feel free to get it from the attachment on bug 183137, though keeping the RH 7.3 build may be the better idea ("never change a running system" :-)