Community
Participate
Working Groups
p2 supports multiple agents now, and the agent is in charge of providing other services. If a service is provided by IAgentServiceFactory, agent instance will use factory to create it when the service is got first time. However two different threads might get different service instances from one agent if they're concurrent.
Created attachment 166788 [details] Fix v01
Pascal or Ian, can you review this fix? The set of instantiated services is protected by the lock on the "agentServices" field. The fix is to lock in the getService method until the point where the service is stored in the map (so both the "test" and "set" are within the same synchronized block). I have run all the p2 tests with this fix.
Ian, did you intend to mark this as "review+"? It is still showing as "review?".
No, I just meant to mark it for review so I can to do today.
This patch looks good. My only concern is that there are other methods in the ProvisioningAgent class which are not synchronized and access the agentServices. But since the agent service is actually a syncronized map, and we use this same object as our lock, we should be good. Actually, on closer inspection, we might get into a race condition with shutdown. The iteration through the agent services is not synchronized in the stop method. This means that we could have callers who have called getService, and moved passed the checkRunning() and is waiting to enter the syncronized block. At the same time, someone could call stop() and start to unregister services. When the first caller awakes, it will be using an invalid provisioning agent. I've given a +1 to this patch as it fixes the problem outlined in the bug. John, do you think the shutdown race condition is a problem? Should we just file a second bug for that?
Looking again, the caller to stop() will block on getValues(), but once they get the values (services) and being to stop them, the other caller is free to register other services. This case is likely very rare, but still possible.
Created attachment 167370 [details] Fix v02 You're right, there was also a race condition in the stop method. I have moved the checkRunning() call in getService into the same synchronized block, and in stop() I have put stopped=true inside the same synchronized block as stopping the services. This means you can never access a service that is currently being stopped or that has completed stopping.
Setting back to review? because there is a new fix.
Note I also didn't like mixing two locks - one for "stopped" and another for "agent services". I have marked stopped as volatile instead of explicitly synchronizing on it. Where relevant, it is accessed/modified from within a synchronized block on agentServices to avoid accessing or changing services whipe stopped.
After more testing, I found that second fix caused deadlocks. I don't think we can safely lock that aggressively during shutdown. The failure case is: - A service tries to join a job during its shutdown process to tidy up after itself - That job attempts to get a service -> Deadlock because we are now locking during stopping of services.
Created attachment 167389 [details] Fix v03 This fixes the deadlock in the previous patch by removing synchronization while stopping services. This means that a client can now access a service while it is being stopped, but this is not a new problem (current code in HEAD is the same). However I did fix the problem in stop() that we are iterating over a collection without synchronizing on it, which isn't thread safe. I now grab a copy of the collection to iterate over for thread safety. I ran all tests on this and it found no problems. Ian, can you give this new patch a review?
John, what if we put the stop=true near the top of the stop method (inside the synchronized block. Since we protect the get service with the checkRunning, this should gaurd against the case of getting a service while it's being stopped. Also, by putting stop=true at the top, if sometime fails during the stop method, we don't leave the agent provider in a broken state.
(In reply to comment #12) > Also, by putting stop=true at the top, if sometime fails during the stop > method, we don't leave the agent provider in a broken state. This doesn't work because some services need to be able to access other services while they are stopping (for example while stopping I join a job that I scheduled, and that job may be making use of other services). Ideally you would want to shut down services in prerequisite order, like the framework does for bundles, but we have no way to determine that here. Arguably we should say referencing other services from your stop() method is not allowed but that's a big change at this point in the release and I'm not sure it is feasible to implement.
(In reply to comment #13) > (In reply to comment #12) > > Also, by putting stop=true at the top, if sometime fails during the stop > > method, we don't leave the agent provider in a broken state. > > This doesn't work because some services need to be able to access other > services while they are stopping (for example while stopping I join a job that > I scheduled, and that job may be making use of other services). Ideally you > would want to shut down services in prerequisite order, like the framework does > for bundles, but we have no way to determine that here. Arguably we should say > referencing other services from your stop() method is not allowed but that's a > big change at this point in the release and I'm not sure it is feasible to > implement. That explains why when I tried to make this change a number of test cases failed. Ok, thanks for the explanation John. I'm happy with this patch.
Thanks Ian. Fix released.