Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 337538 - Equinox StateImpl thread safety bug
Summary: Equinox StateImpl thread safety bug
Status: RESOLVED FIXED
Alias: None
Product: Equinox
Classification: Eclipse Project
Component: Framework (show other bugs)
Version: 3.7   Edit
Hardware: PC Mac OS X - Carbon (unsup.)
: P3 blocker (vote)
Target Milestone: 3.7 M6   Edit
Assignee: Thomas Watson CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 438169
  Show dependency tree
 
Reported: 2011-02-18 05:05 EST by Chris Frost CLA
Modified: 2014-06-25 11:43 EDT (History)
3 users (show)

See Also:


Attachments
Log file for a startup and shutdown of a failed app deployment (415.15 KB, application/octet-stream)
2011-02-18 05:05 EST, Chris Frost CLA
no flags Details
Dump file - thread (27.39 KB, text/plain)
2011-02-18 05:06 EST, Chris Frost CLA
no flags Details
Dump file - summary (4.92 KB, text/plain)
2011-02-18 05:06 EST, Chris Frost CLA
no flags Details
Dump file - configuration (5.55 KB, application/octet-stream)
2011-02-18 05:07 EST, Chris Frost CLA
no flags Details
Possible fix (2.62 KB, patch)
2011-02-18 08:37 EST, Thomas Watson CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Frost CLA 2011-02-18 05:05:59 EST
Created attachment 189261 [details]
Log file for a startup and shutdown of a failed app deployment

To reproduce:
1. Get the 3.0.0.M01 shipping of Virgo Web Server and start it many times until one of the web apps fails to deploys. I have managed it in anything from 4 runs to 12 runs to get the failure.

I have attached the serviceability files for the problem to this issue.
Comment 1 Chris Frost CLA 2011-02-18 05:06:28 EST
Created attachment 189262 [details]
Dump file - thread
Comment 2 Chris Frost CLA 2011-02-18 05:06:47 EST
Created attachment 189263 [details]
Dump file - summary
Comment 3 Chris Frost CLA 2011-02-18 05:07:20 EST
Created attachment 189264 [details]
Dump file - configuration
Comment 4 Chris Frost CLA 2011-02-18 05:10:14 EST
Here is a summary of the log file for quick reference.

[2011-02-18 09:58:08.981] start-signalling-1           org.eclipse.virgo.medic.eventlog.default                         ME0003I Dump 'serviceability/dump/2011-02-18-09-58-518' generated 
[2011-02-18 09:58:08.982] start-signalling-1           System.err                                                        Feb 18, 2011 9:58:08 AM org.apache.tomcat.util.digester.Digester getParser 
[2011-02-18 09:58:08.982] start-signalling-1           System.err                                                        SEVERE: Digester.getParser:  
[2011-02-18 09:58:08.982] start-signalling-1           System.err                                                        java.util.NoSuchElementException 
[2011-02-18 09:58:08.982] start-signalling-1           System.err                                                        	at org.eclipse.osgi.framework.util.KeyedHashSet$EquinoxSetIterator.next(KeyedHashSet.java:474) 
[2011-02-18 09:58:08.983] start-signalling-1           System.err                                                        	at org.eclipse.osgi.framework.util.KeyedHashSet$EquinoxSetIterator.next(KeyedHashSet.java:1) 
[2011-02-18 09:58:08.983] start-signalling-1           System.err                                                        	at org.eclipse.osgi.internal.resolver.StateImpl.getExportedPackages(StateImpl.java:657) 


.....


[2011-02-18 09:58:08.987] start-signalling-1           System.err                                                        	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) 
[2011-02-18 09:58:08.987] start-signalling-1           System.err                                                        	at java.lang.Thread.run(Thread.java:680) 
[2011-02-18 09:58:08.988] start-signalling-1           System.err                                                        Feb 18, 2011 9:58:08 AM org.apache.catalina.startup.ContextConfig processDefaultWebConfig 
[2011-02-18 09:58:08.989] start-signalling-1           System.err                                                        SEVERE: Parse error in default web.xml 
[2011-02-18 09:58:08.989] start-signalling-1           System.err                                                        java.lang.NullPointerException 
[2011-02-18 09:58:08.989] start-signalling-1           System.err                                                        	at org.apache.tomcat.util.digester.Digester.getXMLReader(Digester.java:977) 
[2011-02-18 09:58:08.989] start-signalling-1           System.err                                                        	at org.apache.tomcat.util.digester.Digester.parse(Digester.java:1642) 
[2011-02-18 09:58:08.989] start-signalling-1           System.err                                                        	at org.apache.catalina.startup.ContextConfig.processDefaultWebConfig(ContextConfig.java:693)
Comment 5 Glyn Normington CLA 2011-02-18 07:29:20 EST
The root cause appears to be a thread safety bug in Equinox (3.7M5). I saw the problem more clearly by triggering a VM-wide breakpoint on NoSuchElementException being thrown (on the 19th attempt!).

The problem is that the file system watcher thread spots a WAB (Web Application Bundle) in pickup and starts it. This resolves the WAB under the fs-watcher thread. A collection of resolved bundles in Equinox (StateImpl.resolvedBundles) is updated inside a synchronisation block. Then a start-signalling-thread-2 gets going and uses package admin to iterate over the collection of resolved bundles without any synchronisation and it hits a broken invariant and throws NoSuchElementException unexpectedly, rather than returning false from hasNext.

There is a comment at the top of KeyedHashSet which makes things clear:

"This class in not thread safe, clients must ensure synchronization when modifying an object of this type."

This is, of course, false.  Clients also need to ensure synchronisation when reading the object so that the correct "happens before" relationships are in place.

Interestingly the state of KeyedHashSet and the relevant EquinoxSetIterator does not show the broken invariant, presumably because of the interaction of the debugger after the breakpoint has triggered. 

The state shown in the debugger is:
    currentIndex=224
    found=102
    elementCount=103
and the elements array consists of 224 elements, the first 103 (indices 0..102) of which are non-null and the rest of which are null.

So I guess that the code sequence on the first thread:
    elements[i] = element;
    elementCount++;
is only partially visible on the second thread.

Rather surprisingly, it seems that the updated elementCount is visible but elements[i] is still showing as null. This is permissible according to the Java Memory Model.
Comment 6 Thomas Watson CLA 2011-02-18 08:33:01 EST
Thanks for finding this bug Christopher and thanks for the analysis Glyn!

AFAICT this has been an issue ever since we all did all the work in bug 247522 to make the State thread safe.  The method clearly needs to be protected by a lock on the state monitor.
Comment 7 Glyn Normington CLA 2011-02-18 08:36:57 EST
Yes, I think we have probably seen it before, but we were sensitised by the fact we were testing in preparation to publicise the newly build Virgo 3.0.0.M01.
Comment 8 Thomas Watson CLA 2011-02-18 08:37:11 EST
Created attachment 189271 [details]
Possible fix

Christopher or Glyn, could you give this patch a try?
Comment 9 Thomas Watson CLA 2011-02-18 10:26:05 EST
Glyn has tested the patch and confirmed that the patch has fixed the issue for virgo.  Thanks Glyn!

I have released the patch to HEAD.