Community
Participate
Working Groups
Created attachment 137961 [details] Proposed patch to fix issue Build ID: 3.5 RC3 Suppose ComponentA has two static references with cardinality "1..1", one to ServiceA and one to ServiceB, and exposes a ServiceC The problematic sequence of events are as follows: * [Thread 1] a ServiceA is registered - the Resolver proceeds to find static bindings for this service (the selectStaticBind() method) under a synchronized lock, result = { ComponentA } * [Thread 1] the lock is released ... * [Thread 2] a ServiceB is registered - the Resolver proceeds to find static references to this service (the selectStaticBind() method) under a synchronized lock, result = { ComponentA } * [Thread 2] the lock is released * [Thread 2] ComponentA is disposed of (needs to update static bindings) - effectively does nothing as ComponentA is not built at this point * [Thread 2] components to be built are determined, result = { ComponentA } * [Thread 2] ComponentA is built ... >>>* [Thread 1] ComponentA is disposed of (needs to update static bindings) - destroys the component created just above * [Thread 1] components to build are determined, result = { ComponentA } - due to destroy above * [Thread 1] ComponentA is built ... The ">>>" line is superfluous and results in a component being deactivated and reactivated unnecessarily. Depending on the component, this may result in expensive operations being performed multiple times, and there may be other dependencies that then also have to be disposed and recreated. A suggested patch is attached that mostly avoids the problem by calculating, while in possession of the sync lock, which components need to be disposed - this is only components that are already building or built at that time. If the component is built after this point, the binding is satisfied and disposing is unnecessary. If the component was in the building process, it will be disposed and recreated, but this is necessary to satisfy the static reference.
Created attachment 137987 [details] optimized patch This patch achieves the same result but is simpler
Created attachment 138268 [details] sample bundle that exhibits concurrency issues In the past I had encountered some strange unexplainable services being deactivated for what I could not easily explain. When I saw this bug I decided I would delve into things a little more and see whether the proposed change(s) would infact solve the issues that I had. To investigate this further I developed a sample bundle that would try unmask any concurrency issue around multiple register events happening concurrently. The attached sample bundle is composed of 3 declarative service components, OSGI-INF/Consumer.xml is dependent on 3 services (IServiceA, IServiceB and IServiceC) it is immediate and enabled, OSGI-INF/Controller.xml is an immediate service that spawns several threads, and attempts to let them go all at the same time. 3 threads register an instance of the IServiceA, IServiceB, IServiceC respectively, and the 4th thread enables the OSGI-INF/Null.xml component which is effectively a component that provides/consumes no services. As a result of all threads registering the required services for OSGI-INF/Consumer.xml I would expect the component to be activated, the expected output of starting the bundle would be as follows. osgi> >>> Released the lock +++ binding (class=class net.sample.bundle.services.ServiceA) +++ binding (class=class net.sample.bundle.services.ServiceB) +++ binding (class=class net.sample.bundle.services.ServiceC) +++ activate I noticed after serveral runs on my linux quad core amd64 that i would get the following output instead, that looks suspiciously like my unexplainable service being deactivated. Not all runs were the same, sometimes i got more deactivations. osgi> >>> Released the lock +++ binding (class=class net.sample.bundle.services.ServiceA) +++ binding (class=class net.sample.bundle.services.ServiceB) +++ binding (class=class net.sample.bundle.services.ServiceC) +++ activate --- deactivate --- unbinding (class=class net.sample.bundle.services.ServiceC) --- unbinding (class=class net.sample.bundle.services.ServiceB) --- unbinding (class=class net.sample.bundle.services.ServiceA) +++ binding (class=class net.sample.bundle.services.ServiceA) +++ binding (class=class net.sample.bundle.services.ServiceB) +++ binding (class=class net.sample.bundle.services.ServiceC) +++ activate After delving into things further i discovered that the patch provided although decreasing the likelihood of this unexplained deactivation there was still another case in that a thread could be building/have built the component based on the osgi service references all being satisfied (in the osgi service registry). However all the register events had not been fully processed and as such a thread (Thread A) determined it has to destroy a component that was constructed on another thread (Thread B) as a result of a register event that is being processed (on Thread A) for a service reference that was used to satisfy the construct of the component (in Thread B).
Created attachment 138269 [details] experimental patch From the investigation it came apparent that multiple threads handling service events can cause problems, although the proposed patch reduces the likelihood of hitting a problem it is still possible for an "early" activation (relative to service registration events) causing the handling of a service event to cause the deactivation of a component that itself does not need to be deactivated. In an effort to get around this I have attached an experimental patch that attempts to track what ServiceReferences exist in the osgi registry and when determining whether a Reference is satisfied it checks with the osgi service registry and what DS thinks is registered (tracked through events). In doing this it prevents an early activation of a component that could result in an unnecessary deactivation.
Created attachment 138270 [details] experimental patch v2 update to (hopefully) correctly synchronize ServiceReference table after the service listener has been registered.
Marking for 3.6. Stoyan can you investigate this more closely? Once we are confident in a fix we can do additional review and assessment on the risk of including a fix in 3.5.1.
The best way to be able to show this multiple deactivate/activate is to start the sample bundle with a breakpoint on org.eclipse.equinox.internal.ds.Resolver line number 277. This results in the OSGI-INF/Comsumer.xml component being created via the component enable invocation, however 3 break points will be hit for each osgi service registration call. Each breakpoint continue will result in the component being be deactivated/activated.
I was able to reproduce and analyze the situation. The proposed patch looks good and solves the problem.
Patch released in HEAD. Thanks for the patch Andrew. Do you think it should be released in 3.5.2 as well? In my opinion the problem is quite rare and actually does not seem like a bug. I rather think of it as an enhancement. So, I think it is not necessary to release it in 3.5.x.
Comment on attachment 138270 [details] experimental patch v2 Adding contribution to the iplog. Thanks!
The frequency at which this problem occurs in some of my equinox application is quite high during startup, generally its not a problem given the dynamic nature of things. Although it does depend on what services are affected, some components that i consider "BIG" use a worker thread to fully initialize the component, and when it is initialized fully it programaticaly registers the component against several more interfaces, its generally these registration that trigger the problem, if this integral service gets marked to be destroyed then it brings down a whole load of other services to effectively have to recreate them all. During runtime it doesn't appear to be an issue as such, however that is generally because service components coming and going is very infrequent and there is unlikely to be any services registered programaticly. If declarative services were to be used for everything (such as someway to enable specific service references one its ready) then I am sure it wouldn't be a problem, its more a problem when you have interopability with other ways of using the osgi service registry and as a result can end up with this case being hit. How big of a problem it could be depends on the component being destroyed. However that being said I just run a patched version of the osgi declarative services so as I don't encounter the problem anymore.
I recommend we get more testing with the fix in 3.6. If we get more requests to backport a fix to 3.5.x then we can evaluate doing so for 3.5.2.