Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 306668 - IProvisioningAgent.getService is NOT thread safe
Summary: IProvisioningAgent.getService is NOT thread safe
Status: RESOLVED FIXED
Alias: None
Product: Equinox
Classification: Eclipse Project
Component: p2 (show other bugs)
Version: 3.6   Edit
Hardware: PC All
: P3 normal (vote)
Target Milestone: 3.6 RC1   Edit
Assignee: John Arthorne CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-03-22 02:21 EDT by Meng Xin Zhu CLA
Modified: 2010-05-06 17:54 EDT (History)
3 users (show)

See Also:
irbull: review+


Attachments
Fix v01 (2.96 KB, patch)
2010-05-03 10:58 EDT, John Arthorne CLA
no flags Details | Diff
Fix v02 (4.58 KB, patch)
2010-05-06 15:08 EDT, John Arthorne CLA
no flags Details | Diff
Fix v03 (4.57 KB, patch)
2010-05-06 16:26 EDT, John Arthorne CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Meng Xin Zhu CLA 2010-03-22 02:21:53 EDT
p2 supports multiple agents now, and the agent is in charge of providing other services.
If a service is provided by IAgentServiceFactory, agent instance will use factory to create it when the service is got first time. However two different threads might get different service instances from one agent if they're concurrent.
Comment 1 John Arthorne CLA 2010-05-03 10:58:47 EDT
Created attachment 166788 [details]
Fix v01
Comment 2 John Arthorne CLA 2010-05-03 11:24:26 EDT
Pascal or Ian, can you review this fix? The set of instantiated services is protected by the lock on the "agentServices" field. The fix is to lock in the getService method until the point where the service is stored in the map (so both the "test" and "set" are within the same synchronized block). I have run all the p2 tests with this fix.
Comment 3 John Arthorne CLA 2010-05-04 09:33:32 EDT
Ian, did you intend to mark this as "review+"? It is still showing as "review?".
Comment 4 Ian Bull CLA 2010-05-04 11:31:06 EDT
No, I just meant to mark it for review so I can to do today.
Comment 5 Ian Bull CLA 2010-05-06 12:34:13 EDT
This patch looks good. My only concern is that there are other methods in the ProvisioningAgent class which are not synchronized and access the agentServices.  But since the agent service is actually a syncronized map, and we use this same object as our lock, we should be good. 

Actually, on closer inspection, we might get into a race condition with shutdown. The iteration through the agent services is not synchronized in the stop method.  This means that we could have callers who have called getService, and moved passed the checkRunning() and is waiting to enter the syncronized block. At the same time, someone could call stop() and start to unregister services.  When the first caller awakes, it will be using an invalid provisioning agent.

I've given a +1 to this patch as it fixes the problem outlined in the bug.  John, do you think the shutdown race condition is a problem?  Should we just file a second bug for that?
Comment 6 Ian Bull CLA 2010-05-06 12:42:44 EDT
Looking again, the caller to stop() will block on getValues(), but once they get the values (services) and being to stop them, the other caller is free to register other services.  This case is likely very rare, but still possible.
Comment 7 John Arthorne CLA 2010-05-06 15:08:32 EDT
Created attachment 167370 [details]
Fix v02

You're right, there was also a race condition in the stop method. I have moved the checkRunning() call in getService into the same synchronized block, and in stop() I have put stopped=true inside the same synchronized block as stopping the services. This means you can never access a service that is currently being stopped or that has completed stopping.
Comment 8 John Arthorne CLA 2010-05-06 15:08:58 EDT
Setting back to review? because there is a new fix.
Comment 9 John Arthorne CLA 2010-05-06 15:10:56 EDT
Note I also didn't like mixing two locks - one for "stopped" and another for "agent services". I have marked stopped as volatile instead of explicitly synchronizing on it. Where relevant, it is accessed/modified from within a synchronized block on agentServices to avoid accessing or changing services whipe stopped.
Comment 10 John Arthorne CLA 2010-05-06 16:08:45 EDT
After more testing, I found that second fix caused deadlocks. I don't think we can safely lock that aggressively during shutdown. The failure case is:

- A service tries to join a job during its shutdown process to tidy up after itself
- That job attempts to get a service

-> Deadlock because we are now locking during stopping of services.
Comment 11 John Arthorne CLA 2010-05-06 16:26:09 EDT
Created attachment 167389 [details]
Fix v03

This fixes the deadlock in the previous patch by removing synchronization while stopping services. This means that a client can now access a service while it is being stopped, but this is not a new problem (current code in HEAD is the same). 

However I did fix the problem in stop() that we are iterating over a collection without synchronizing on it, which isn't thread safe. I now grab a copy of the collection to iterate over for thread safety.

I ran all tests on this and it found no problems. Ian, can you give this new patch a review?
Comment 12 Ian Bull CLA 2010-05-06 16:40:13 EDT
John, what if we put the stop=true near the top of the stop method (inside the synchronized block.  Since we protect the get service with the checkRunning, this should gaurd against the case of getting a service while it's being stopped.

Also, by putting stop=true at the top, if sometime fails during the stop method, we don't leave the agent provider in a broken state.
Comment 13 John Arthorne CLA 2010-05-06 17:17:51 EDT
(In reply to comment #12)
> Also, by putting stop=true at the top, if sometime fails during the stop
> method, we don't leave the agent provider in a broken state.

This doesn't work because some services need to be able to access other services while they are stopping (for example while stopping I join a job that I scheduled, and that job may be making use of other services). Ideally you would want to shut down services in prerequisite order, like the framework does for bundles, but we have no way to determine that here. Arguably we should say referencing other services from your stop() method is not allowed but that's a big change at this point in the release and I'm not sure it is feasible to implement.
Comment 14 Ian Bull CLA 2010-05-06 17:22:13 EDT
(In reply to comment #13)
> (In reply to comment #12)
> > Also, by putting stop=true at the top, if sometime fails during the stop
> > method, we don't leave the agent provider in a broken state.
> 
> This doesn't work because some services need to be able to access other
> services while they are stopping (for example while stopping I join a job that
> I scheduled, and that job may be making use of other services). Ideally you
> would want to shut down services in prerequisite order, like the framework does
> for bundles, but we have no way to determine that here. Arguably we should say
> referencing other services from your stop() method is not allowed but that's a
> big change at this point in the release and I'm not sure it is feasible to
> implement.

That explains why when I tried to make this change a number of test cases failed. Ok, thanks for the explanation John.  I'm happy with this patch.
Comment 15 John Arthorne CLA 2010-05-06 17:54:02 EDT
Thanks Ian. Fix released.