Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 338697 - NullPointerException in BundleLoader.getImportedSources
Summary: NullPointerException in BundleLoader.getImportedSources
Status: RESOLVED FIXED
Alias: None
Product: Equinox
Classification: Eclipse Project
Component: Framework (show other bugs)
Version: 3.5.2   Edit
Hardware: All All
: P3 major (vote)
Target Milestone: 3.7 M7   Edit
Assignee: Thomas Watson CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-03-02 11:09 EST by Matt Lavin CLA
Modified: 2011-03-29 17:04 EDT (History)
1 user (show)

See Also:


Attachments
A project with a testcase and testdata (105.06 KB, application/force-download)
2011-03-02 11:09 EST, Matt Lavin CLA
no flags Details
work in progress 3.5.x patch (6.49 KB, patch)
2011-03-25 12:27 EDT, Thomas Watson CLA
no flags Details | Diff
updated patch for 3.7 only (13.09 KB, patch)
2011-03-29 16:01 EDT, Thomas Watson CLA
no flags Details | Diff
test case (4.72 KB, patch)
2011-03-29 17:02 EDT, Thomas Watson CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Matt Lavin CLA 2011-03-02 11:09:18 EST
Created attachment 190175 [details]
A project with a testcase and testdata

The Jazz Foundation Server from jazz.net uses OSGi on the server, and we've seen the following exception a number of times in various installs:

Caused by: java.lang.NullPointerException
	at org.eclipse.osgi.internal.loader.BundleLoader.getImportedSources(BundleLoader.java:239)
	at org.eclipse.osgi.internal.loader.BundleLoader.findImportedSource(BundleLoader.java:1040)
	at org.eclipse.osgi.internal.loader.BundleLoader.findClassInternal(BundleLoader.java:436)
	at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:404)
	at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:392)
	at org.eclipse.osgi.internal.baseadaptor.DefaultClassLoader.loadClass(DefaultClassLoader.java:105)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:619)
	at org.eclipse.osgi.internal.loader.BundleLoader.loadClass(BundleLoader.java:320)
	at org.eclipse.osgi.framework.internal.core.BundleHost.loadClass(BundleHost.java:231)
	at org.eclipse.osgi.framework.internal.core.AbstractBundle.loadBundleActivator(AbstractBundle.java:150)
	... 38 more


I recently spent some time debugging the problem to find the root cause and here is what I've found so far.  The reason for the NPE is that the .state or .lazy files seem to be invalid.  The files themselves are readable, but the data inside isn't consistent.  Specifically, when using the StateReader.fullyLoad((BundleDescriptionImpl)bundleDescription) with the state files in my environment, the persisted object references, and tries to load, an indexed object that hasn't been loaded yet.  

The method StateReader.getFromObjectTable(int) does not check that the given index is in the map, and so a null is returned which ultimately leads to our NPE.  Looking at the code, it would appear that the bundle dependencies used to load dependent indexed objects are not enough cause the index object to be loaded.

I've attached a test project that includes my .state files and a testcase that demonstrates the problem
Comment 1 Thomas Watson CLA 2011-03-03 13:44:31 EST
Can you please verify the version of org.eclipse.osgi you are using.  You opened the bug against 3.5.2.  I just want to make sure that is what you are using.

I am having a difficult time figuring out how the persistent resolver cache got into this state.  It appears we have duplicate ExportPackageDescription objects flying around in the state.  I suspect this has something to do with how we unload state objects from memory after a period of no use.  Perhaps there is some bug that is letting the objects be flushed from the lazy data from each bundle but there are still dangling references to the objects from other bundles.  Something like that could lead to duplicates getting loaded back into memory, this would lead to duplicate objects getting persisted to .stata/.lazy but I am not seeing a way that can happen with the code in 3.5.2.
Comment 2 Matt Lavin CLA 2011-03-03 16:49:35 EST
The version of org.eclipse.osgi that I'm using is 3.5.2.R35x_v20100126.
Comment 3 Thomas Watson CLA 2011-03-10 12:37:30 EST
I still have no idea how this is happening.  My theory is that we are flushing the objects from memory and then not sharing them properly when we reload them, causing duplicates in memory.  This causes issues when persisting the resolution state.  If that is what is going on, then setting the following configuration option should be a workaround:

osgi.lazyStateUnloadingTime=0

I have spent quite a bit of time analyzing the code and I still cannot see a window where something like this could happen.  But it would be very useful if you could try the workaround and let me know if it helps fix the issue.  Keep in mind that you need to set this value from the very start before the corruption occurs.  It will not help to set it after the first failure since the resolver cache will be corrupted still.  Performing a -clean startup should fix a corrupted cache.
Comment 4 Thomas Watson CLA 2011-03-25 12:27:28 EDT
Created attachment 191921 [details]
work in progress 3.5.x patch

There is a window where the State can be modified while persisting the data.  This could lead to some strange sorting of the bundles in the persistent state which could lead to null values when reading from persistent state.

This patch locks the State monitor before persistent the state to disk to prevent another thread from modifying the state while we are saving.  I still need to analyze the locking here to make sure it is correct.
Comment 5 Thomas Watson CLA 2011-03-28 11:16:05 EDT
The attached patch definitely plugs a hole in the writing of the resolver data, but I am not convinced it is the root cause behind the corrupted data from comment 0.  The reason I say this is because the type of corruption that can be caused by this issue would result in a cache where and exporting bundle would have null packages in their list of selected package exports.  This is not what is happening in the corrupted state cache attached to this bug report.  The attached cache contains duplicate export package objects which indicates that the resolver state had duplicate export package objects for the same package in memory at the time the cache was persisted.  This would result in the incorrect indexing we are seeing in the corrupted cache from this bug report and the ultimate resulting nulls for the imported packages.

It would really help if we could set osgi.lazyStateUnloadingTime=0 and see if the problem reproduces in the environment where this issue is occurring.  If this helps it would indicate an issue with the unloading/loading state code.  I have reviewed this code but still have not found a hole that would cause the duplicate objects in memory, but it is quite clear there were duplicates when the state was persisted.
Comment 6 Matt Lavin CLA 2011-03-28 13:45:28 EDT
I looked at the code again, and here is something I noticed.

StateManager.run() is called to unload the lazy bundle state.  Statemanager.run() calls StateImpl.unloadLazyData() while holding the StateImpl instance as a lock.  

StateManager.writeState(StateImpl, File, File) is called to persist the current state, and it has a comment "make sure we are fully loaded before saving", but there doesn't any locking to ensure that the StateImpl stays fully loaded for the duration of writeState().  

It appears that it's possible for the state to be flushed of lazy data during a write.  If that happened, then the lazy data would be re-read to acquire the data that's needed.  Would that cause any problems?
Comment 7 Thomas Watson CLA 2011-03-28 14:30:01 EDT
(In reply to comment #6)
> I looked at the code again, and here is something I noticed.
> 
> StateManager.run() is called to unload the lazy bundle state. 
> Statemanager.run() calls StateImpl.unloadLazyData() while holding the StateImpl
> instance as a lock.  
> 
> StateManager.writeState(StateImpl, File, File) is called to persist the current
> state, and it has a comment "make sure we are fully loaded before saving", but
> there doesn't any locking to ensure that the StateImpl stays fully loaded for
> the duration of writeState().  
> 
> It appears that it's possible for the state to be flushed of lazy data during a
> write.  If that happened, then the lazy data would be re-read to acquire the
> data that's needed.  Would that cause any problems?

We should only be writing the resolution state if the state has changed in some way.  For example, if a bundle was added/removed or (re-)resolved.  We also should only be unloading state data if it has NOT changed since we read it from the cached state we are reading from.

If a thread is attempting to write the state while another thread is attempting to unload the state then there would have to be some problem with how we are check to see if the state really can be unloaded.

Basically something like this would have to occur:

1) The StateManager.run thread checks to see if it can unload the state and starts the call to StateImpl.unloadLazyData()
2) Another thread installs and starts a resolve process, causing StateManager.run thread to block
3) The system is shutdown causing persistence of the state immediately.  At this point perhaps the StateManager.run thread is unloading while we are persisting.
Comment 8 Thomas Watson CLA 2011-03-29 11:55:23 EDT
I have found a much more likely way this is happening.  Here is the scenario:

1) Launch the framework from a cached state where at least one bundle is unresolvable (in the INSTALLED state).
2) Attempt a resolve operation on the unresolvable bundle.  The bundle cannot resolve.
3) Install additional bundles.
4) Try to resolve some bundles.
5) Shutdown to save the new state
6) Relaunch and find the strange cached state and possible NPEs.

The root cause of the issue can happen between steps 2 and 3.  

During step 2 the State is fully loaded and the ExportPackageDescription objects are populated with the Resolver during the attempt to resolve the bundle.  No new bundle can be resolved so the State content (timestamp) is not changed.  This means the State lazy data is still eligible for unloading.  The issue happens if unloading of State lazy data happens before step 3.

The issue is that when we unload the lazy data from the State we neglect to flush the resolver.  So the resolver still has references to the old ExportPackageDescription objects.  Now when new bundles are installed and we attempt to resolve them we fully load the State again which creates new ExportPackageDescription Objects.  If newly installed bundles end up getting resolved they may get resolved to the old ExportPackageDescription objects which are still contained in the Resolver.  This ultimately leads to the duplicate ExportPackageDescription objects we see in the cached state attached to this bug report.


The workaround is to set:

osgi.lazyStateUnloadingTime=0

I think we should fix this in 3.7, but I would suggest we stick with the work around on older releases instead of trying to fix this.  I think the fix should simply involve flushing the resolver when we unload the lazydata, but I need to look into that more.
Comment 9 Thomas Watson CLA 2011-03-29 16:01:58 EDT
Created attachment 192126 [details]
updated patch for 3.7 only

This patch removes the separate locking done on the reader during load/unloading of lazy data.  Instead it uses the StateImpl.monitor lock.  The StateWriter also uses the StateImpl.monitor lock while writing.

The issue in comment 4 is handled by locking the StateImpl.monitor lock while writing.  This prevents any other thread from unloading data while we are writing.

The StateImpl.unloadLoazyData method now obtains the StateImpl.monitor lock while unloading.  This ensures no other thread is performing a resolve operation and allows us to safely flush the resolver.  That solves the main issue in this bug report as described in comment 8.  This also helps solve the issue in comment 6 because now we cannot write from one thread while unloading lazydata in another.

The lock on the reader was always problematic because that lock along with the StateImpl.monitor lock and the SystemState lock used in the StateManager.run method made for lots of challenges for detecting out of order locks and deadlocks.  Doing away with the reader lock should simplify that, but we still need to be careful that the StateImpl.monitor lock and the BundleDescriptionImpl.monitor locks do not be obtained out of order.  If both locks are needed by a thread then the StateImpl.monitor lock must be obtained first.
Comment 10 Thomas Watson CLA 2011-03-29 17:02:11 EDT
Created attachment 192141 [details]
test case

Test case that tests the scenario in comment 8.
Comment 11 Thomas Watson CLA 2011-03-29 17:04:57 EDT
I have released this patch to head.