| Summary: | [publisher] NPE running headless app on non-main thread | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Eclipse Project] Equinox | Reporter: | DJ Houghton <dj.houghton> | ||||||||
| Component: | Compendium | Assignee: | equinox.compendium-inbox <equinox.compendium-inbox> | ||||||||
| Status: | RESOLVED FIXED | QA Contact: | |||||||||
| Severity: | normal | ||||||||||
| Priority: | P3 | CC: | eric.gwin, irbull, john.arthorne, konstantin, michael.sacarny, pascal, s.boshev, stephan.herrmann, tjwatson | ||||||||
| Version: | 3.6 | ||||||||||
| Target Milestone: | 3.6 M6 | ||||||||||
| Hardware: | PC | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Attachments: |
|
||||||||||
This is also reproducible in Windows XP. Using the same command line arguments and the same update site, I get the NPE in about 20% of the times I run the application. This bug seems not 100% deterministic: - even with -nosplash the NPE may occur - running on a headless machine causes almost 100% failure, independent of -nosplash - running on a desktop with DISPLAY unset gives similar probabilities to success vs. failure I assume something else must be accessing the display besides the splash screen. Unfortunately this bug makes it impossible for me to run my build on top of I20100128-1731. (In reply to comment #2) > This bug seems not 100% deterministic: > - even with -nosplash the NPE may occur > - running on a headless machine causes almost 100% failure, independent > of -nosplash > - running on a desktop with DISPLAY unset gives similar probabilities > to success vs. failure > I assume something else must be accessing the display besides the > splash screen. > Can you launch with -console, and use ss p2.ui See if the p2.ui bundle is started. This is a shot in the dark, but in 3.5.1 I was tracking a case where the p2.ui bundle was registering a service (in that case to warn users about unsigned content). However, the platform wasn't actually running, so if the service got invoked, an exception was thrown (because the window could not be open). Could be something similar here. Also, if you poke around with the SS command, see if any other ui bundles are started. -- As I said, just a shot in the dark though. Seeing the same NPE with build I20100128-1731 on Windows. Let me know if you'd like to get some debug output or logs or something. This is pretty major issue as the command line build is toast. Update: I managed to get a build/test running on the server by performing some tasks on the desktop and sending files back to the build server. Seems I don't definitely need a patch for releasing something based on M5. Thus I personally needn't raise severity to blocker any more :) But I wouldn't want to do this procedure for future builds ... Interestingly, while the build/test is running, I cannot reproduce the problem on the server, assumeably because timing is different when the machine is under load... I'll try the steps in comment 3 when the test run has finished. BTW: whats the option to keep the console after the application has finished? I confirm that the problem occurs again on I20100129-1300. It seems to only be happening on the first startup on the installation. For example I have been able to get successful metadata generation with teh command line DJ mentioned by just running the command twice in a row. I believe that this is a friend of the service registration problem that we have been having lately, because if I delete the org.eclipse.osgi folder from the configuration, then I can consistently reproduce the problem. Created attachment 157787 [details]
bundles startup debug output
Trying to help with some debug here.
I have attached the debug output of the order of the started bundles.
It seems the application thread starts doing its job before the necessary bundles are started.
In details:
1. The app bundle starts and launches the application thread
2. bundle org.eclipse.equinox.p2.artifact.repository, which provides DS components needed by the p2.publisher bundle, starts after the application thread has started.
Here we have race condition. If app thread is slowed down, SCR will process the components in p2.artifact.repository bundle just in time when the application needs them.
(In reply to comment #7) > Here we have race condition. If app thread is slowed down, SCR will process the > components in p2.artifact.repository bundle just in time when the application > needs them. Thanks Stoyan, in this case we have p2.artifact.repository which has DS components but no API. So it will never have a class loaded out of it causing a lazy activation trigger. Once the LAZY_ACTIVATION event is fired then DS will process it and get its service registered. But currently this will only happen on the start-level thread. Meanwhile the application thread could attempt to lookup the service and not find it. I see two issues: 1) to be consistent with main-threaded applications the application container should try to delay the execution of the any-threaded default application until after the start-level has changed. This would allow any-threaded applications to wait until after all the LAZY_ACTIVATION events have been fired by the start-level change while launching the platform. 2) The application needs to be more resilient when the services are not yet available. Perhaps org.eclipse.equinox.internal.p2.core.ProvisioningAgent.getService(String) should wait for the service for a set amount of time if the service is not yet available. I'm not sure if clients are expecting only non-null values here or not. If they are expecting null in some cases then perhaps a getService(String name, long timeout) method should be introduced when non-null is a must. Created attachment 157822 [details]
Fix at publisher level
Here is a potential localized fix within the publisher. It waits for certain essential services, and fails with an informative exception rather than NPE if the service is never obtained. I suspect we will run across other examples of this, so some kind of getService with timeout as suggested by Tom might be needed if there isn't a framework/app service level fix for this.
Created attachment 157844 [details]
app container patch
Here is a potential app container patch. It moves the call to launch the any threaded application (AnyThreadAppLauncher.launchEclipseApplication(anyThreadedDefaultApp)) until after the main thread has gotten control back. The main thread gets control back after the platform is fully initialized, the start-level has been set to its final value etc.
This has one unfortunate side-effect. It will prevent any threaded applications to run as the default application (set with the eclipse.application property) when running on a framework that does not have control over the main thread. When Equinox is launched we have control of the main thread and are able to register an ApplicationLauncher service for running main threaded applications. This patch uses this fact to delay the start of the "default" any threaded application until the ApplicationLauncher is available.
The other option is to delay the launch of the default any threaded application until after the start-level has been set to its final value. Unfortunately there is no standard and reliable way to figure out when the final start level has been reached unless you are the actual agent setting the framework start-level. In fact there is no standard way to even know that the framework start-level is currently being changed to a new value. Anything we put in place to react to start-level changes will be a hack or Equinox specific.
(In reply to comment #10) > Created an attachment (id=157844) [details] > app container patch I tested the patch and it works fine > The other option is to delay the launch of the default any threaded application > until after the start-level has been set to its final value. Unfortunately > there is no standard and reliable way to figure out when the final start level > has been reached unless you are the actual agent setting the framework > start-level. In fact there is no standard way to even know that the framework > start-level is currently being changed to a new value. Anything we put in > place to react to start-level changes will be a hack or Equinox specific. Have you considered the option to start the default application after the framework has started? I think this will be the appropriate time for launching the default application. This would be possible by: - listening for a framework started event - checking the state of the system bundle and listening for a change to ACTIVE state perhaps via a BundleTracker (In reply to comment #11) > Have you considered the option to start the default application after the > framework has started? I think this will be the appropriate time for launching > the default application. This would be possible by: > - listening for a framework started event > - checking the state of the system bundle and listening for a change to ACTIVE > state perhaps via a BundleTracker I did consider that but it still has issues depending on how the framework is launched. The framework could be launched to the default start-level (1) and then the start-level could be increased to the final start-level (6). In this case the FrameworkEvent.STARTED would have already been fired. The issue is we do not know the context of why the app container is starting. I released the app container patch. John, do you want to keep this bug open in p2 for additional publisher changes? Or should we move and close this bug in Equinox->Compendium and open a separate bug for the publisher changes? (In reply to comment #13) > I released the app container patch. John, do you want to keep this bug open in > p2 for additional publisher changes? Or should we move and close this bug in > Equinox->Compendium and open a separate bug for the publisher changes? I released the fix in the publisher already. I'll just move this to Equinox Compendium for the app container fix. Fixed. Which platform build number include these fixes? (In reply to comment #16) > Which platform build number include these fixes? This was just release yesterday so it is not included in an I-Build yet. The first nightly build to include the fix is N20100203-2000. I can confirm that this issue is resolved in N20100203-2000. Thanks! *** Bug 303622 has been marked as a duplicate of this bug. *** *** Bug 303802 has been marked as a duplicate of this bug. *** *** Bug 303802 has been marked as a duplicate of this bug. *** |
build i0128-1731 - linux - unset DISPLAY - run the FeaturesAndBundlesPublisher headless app - get an NPE If you run the app with -nosplash then it works OK. This is the command-line I used to get the NPE: /home/dj/java/ibm-60/jre/bin/java -classpath /home/dj/downloads/eclipse.test/plugins/org.eclipse.equinox.launcher_1.1.0.v20100118.jar org.eclipse.equinox.launcher.Main -consoleLog -application org.eclipse.equinox.p2.publisher.FeaturesAndBundlesPublisher -source /home/dj/downloads/updateSite -metadataRepository file:/home/dj/downloads/updateSite -metadataRepositoryName "Object Teams Updates" -artifactRepository file:/home/dj/downloads/updateSite Log file: !SESSION 2010-01-29 10:50:58.270 ----------------------------------------------- eclipse.buildId=I20100128-1731 java.fullversion=J2RE 1.6.0 IBM J9 2.4 Linux x86-32 jvmxi3260-20080816_22093 (JIT enabled, AOT enabled) J9VM - 20080816_022093_lHdSMr JIT - r9_20080721_1330ifx2 GC - 20080724_AA BootLoader constants: OS=linux, ARCH=x86, WS=gtk, NL=en_US Framework arguments: -application org.eclipse.equinox.p2.publisher.FeaturesAndBundlesPublisher -source /home/dj/downloads/updateSite -metadataRepository file:/home/dj/downloads/updateSite -metadataRepositoryName Object Teams Updates -artifactRepository file:/home/dj/downloads/updateSite Command-line arguments: -consoleLog -application org.eclipse.equinox.p2.publisher.FeaturesAndBundlesPublisher -source /home/dj/downloads/updateSite -metadataRepository file:/home/dj/downloads/updateSite -metadataRepositoryName Object Teams Updates -artifactRepository file:/home/dj/downloads/updateSite !ENTRY org.eclipse.equinox.app 4 0 2010-01-29 10:50:59.245 !MESSAGE !STACK 0 java.lang.NullPointerException at org.eclipse.equinox.p2.publisher.Publisher.loadArtifactRepository(Publisher.java:142) at org.eclipse.equinox.p2.publisher.Publisher.createArtifactRepository(Publisher.java:104) at org.eclipse.equinox.p2.publisher.AbstractPublisherApplication.initializeRepositories(AbstractPublisherApplication.java:89) at org.eclipse.equinox.p2.publisher.AbstractPublisherApplication.initialize(AbstractPublisherApplication.java:80) at org.eclipse.equinox.p2.publisher.AbstractPublisherApplication.run(AbstractPublisherApplication.java:271) at org.eclipse.equinox.p2.publisher.AbstractPublisherApplication.run(AbstractPublisherApplication.java:249) at org.eclipse.equinox.p2.publisher.AbstractPublisherApplication.start(AbstractPublisherApplication.java:301) at org.eclipse.equinox.internal.app.EclipseAppHandle.run(EclipseAppHandle.java:194) at org.eclipse.equinox.internal.app.AnyThreadAppLauncher.run(AnyThreadAppLauncher.java:26) at java.lang.Thread.run(Thread.java:735)