| Summary: | Equinox StateImpl thread safety bug | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Eclipse Project] Equinox | Reporter: | Chris Frost <eclipse> | ||||||||||||
| Component: | Framework | Assignee: | Thomas Watson <tjwatson> | ||||||||||||
| Status: | RESOLVED FIXED | QA Contact: | |||||||||||||
| Severity: | blocker | ||||||||||||||
| Priority: | P3 | CC: | glyn.normington, milesg78, tjwatson | ||||||||||||
| Version: | 3.7 | ||||||||||||||
| Target Milestone: | 3.7 M6 | ||||||||||||||
| Hardware: | PC | ||||||||||||||
| OS: | Mac OS X - Carbon (unsup.) | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Bug Depends on: | |||||||||||||||
| Bug Blocks: | 438169 | ||||||||||||||
| Attachments: |
|
||||||||||||||
Created attachment 189262 [details]
Dump file - thread
Created attachment 189263 [details]
Dump file - summary
Created attachment 189264 [details]
Dump file - configuration
Here is a summary of the log file for quick reference. [2011-02-18 09:58:08.981] start-signalling-1 org.eclipse.virgo.medic.eventlog.default ME0003I Dump 'serviceability/dump/2011-02-18-09-58-518' generated [2011-02-18 09:58:08.982] start-signalling-1 System.err Feb 18, 2011 9:58:08 AM org.apache.tomcat.util.digester.Digester getParser [2011-02-18 09:58:08.982] start-signalling-1 System.err SEVERE: Digester.getParser: [2011-02-18 09:58:08.982] start-signalling-1 System.err java.util.NoSuchElementException [2011-02-18 09:58:08.982] start-signalling-1 System.err at org.eclipse.osgi.framework.util.KeyedHashSet$EquinoxSetIterator.next(KeyedHashSet.java:474) [2011-02-18 09:58:08.983] start-signalling-1 System.err at org.eclipse.osgi.framework.util.KeyedHashSet$EquinoxSetIterator.next(KeyedHashSet.java:1) [2011-02-18 09:58:08.983] start-signalling-1 System.err at org.eclipse.osgi.internal.resolver.StateImpl.getExportedPackages(StateImpl.java:657) ..... [2011-02-18 09:58:08.987] start-signalling-1 System.err at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [2011-02-18 09:58:08.987] start-signalling-1 System.err at java.lang.Thread.run(Thread.java:680) [2011-02-18 09:58:08.988] start-signalling-1 System.err Feb 18, 2011 9:58:08 AM org.apache.catalina.startup.ContextConfig processDefaultWebConfig [2011-02-18 09:58:08.989] start-signalling-1 System.err SEVERE: Parse error in default web.xml [2011-02-18 09:58:08.989] start-signalling-1 System.err java.lang.NullPointerException [2011-02-18 09:58:08.989] start-signalling-1 System.err at org.apache.tomcat.util.digester.Digester.getXMLReader(Digester.java:977) [2011-02-18 09:58:08.989] start-signalling-1 System.err at org.apache.tomcat.util.digester.Digester.parse(Digester.java:1642) [2011-02-18 09:58:08.989] start-signalling-1 System.err at org.apache.catalina.startup.ContextConfig.processDefaultWebConfig(ContextConfig.java:693) The root cause appears to be a thread safety bug in Equinox (3.7M5). I saw the problem more clearly by triggering a VM-wide breakpoint on NoSuchElementException being thrown (on the 19th attempt!).
The problem is that the file system watcher thread spots a WAB (Web Application Bundle) in pickup and starts it. This resolves the WAB under the fs-watcher thread. A collection of resolved bundles in Equinox (StateImpl.resolvedBundles) is updated inside a synchronisation block. Then a start-signalling-thread-2 gets going and uses package admin to iterate over the collection of resolved bundles without any synchronisation and it hits a broken invariant and throws NoSuchElementException unexpectedly, rather than returning false from hasNext.
There is a comment at the top of KeyedHashSet which makes things clear:
"This class in not thread safe, clients must ensure synchronization when modifying an object of this type."
This is, of course, false. Clients also need to ensure synchronisation when reading the object so that the correct "happens before" relationships are in place.
Interestingly the state of KeyedHashSet and the relevant EquinoxSetIterator does not show the broken invariant, presumably because of the interaction of the debugger after the breakpoint has triggered.
The state shown in the debugger is:
currentIndex=224
found=102
elementCount=103
and the elements array consists of 224 elements, the first 103 (indices 0..102) of which are non-null and the rest of which are null.
So I guess that the code sequence on the first thread:
elements[i] = element;
elementCount++;
is only partially visible on the second thread.
Rather surprisingly, it seems that the updated elementCount is visible but elements[i] is still showing as null. This is permissible according to the Java Memory Model.
Thanks for finding this bug Christopher and thanks for the analysis Glyn! AFAICT this has been an issue ever since we all did all the work in bug 247522 to make the State thread safe. The method clearly needs to be protected by a lock on the state monitor. Yes, I think we have probably seen it before, but we were sensitised by the fact we were testing in preparation to publicise the newly build Virgo 3.0.0.M01. Created attachment 189271 [details]
Possible fix
Christopher or Glyn, could you give this patch a try?
Glyn has tested the patch and confirmed that the patch has fixed the issue for virgo. Thanks Glyn! I have released the patch to HEAD. |
Created attachment 189261 [details] Log file for a startup and shutdown of a failed app deployment To reproduce: 1. Get the 3.0.0.M01 shipping of Virgo Web Server and start it many times until one of the web apps fails to deploys. I have managed it in anything from 4 runs to 12 runs to get the failure. I have attached the serviceability files for the problem to this issue.