Community
Participate
Working Groups
Created attachment 197831 [details] workspace log found after trying to "install everything" in common repo While stess testing Indigo RC4 packages (Java EE IDE) and while trying to reproduce bug 349073 (which I could not) I did once see a seemingly bad error, that left my installation in some sort of state where it would not restart. Worse, there was no error log where it told be to go check for errors. On restart after installing a bunch of stuff (and subsequent attempts) I get an error dialog that said "An error has occurred. See the log file /home/davidw/.eclipse/org.eclipse.platform_3.7.0_692754952/configuration/1307756571714.log" but no log there. And ... the main reason I'm opening this bug ... I was surprised by some of the error messages left in workspace log during the "install everything" phase of the test ... and they might be related to the error state? Many messages are similar to !ENTRY org.eclipse.equinox.p2.touchpoint.natives 4 0 2011-06-10 21:36:30.274 !MESSAGE org.eclipse.equinox.internal.p2.touchpoint.natives.actions.UnzipAction error unzipping zipfile: /home/davidw/.eclipse/o rg.eclipse.platform_3.7.0_692754952/p2/org.eclipse.equinox.p2.core/cache/binary/org.eclipse.xtext.ui_root_2.0.0.v201106070531des tination: /usr/local/eclipseIndigoRC4 I think that is saying it is trying to use as destination /usr/local/eclipseIndigoRC4, which, as you might guess, is my "shared installed" and indeed me as "user" does not have write access there ... so, I'm surprised anything is trying to unzip something there. (and, unzip failures might even be related to the "too many files open" problem I was initially investigating, if so wanted to be sure to capture the log, and info, in case it helps.
Here's some detail on the VM I'm using ... I mention it since, in the past, some VMs did not correctly handle closing zip files (I think only after an error) so wanted to document the detail. But, for cases I know about, the zip file bugs were fixed long ago. But ... just in case they regressed :) ... java version "1.6.0" Java(TM) SE Runtime Environment (build pxa6460sr9fp1-20110208_03(SR9 FP1)) IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux amd64-64 jvmxa6460sr9-20110203_74623 (JIT enabled, AOT enabled) J9VM - 20110203_074623 JIT - r9_20101028_17488ifx3 GC - 20101027_AA) JCL - 20110203_01
DJ could you please investigate this?
I was able to reproduce this (details in next entry) but I think the "installing into non-writable area" is unrelated, and some separate issue, as the log messages are reproducible in much simpler scenarios. I've opened bug 349126 for that "touch point" issue.
Created attachment 197841 [details] config logs (and artifacts.xml) after "install everything" I was able to get in this "unstartable state" again, and this time, I did find logs where it it told me they were. I've not looked too closely, but did see some "too many files open" message in there ... so, ... maybe I just really do need to increase file limit? (Seems odd ... though, since I've done similar install tests in the past, and never had too ... but, admit, I do not do it systematically each and every milestone, or anything, so not sure if poor previous testing on my part, or some sort of regression.
Could you also attach the bundles.info and the content of the profile after the isntallation? What were you installing and from which package and which repo. Thx.
Created attachment 197843 [details] workspace log from latest failed "install everyting" In my latest attempt to recreate, I started with the EPP package for JavaScript ID. Untaring it in to /usr/local as root, then running as normal user. I added /releases/staging to software sites, then tried to select and install as much as I could. I had to unselect runtime targets (naturally) but also I deselected ObjectTeam stuff (conflicts with JDK) and 'eRCP' under mobile tools and all WindowBuilder features (may not have had to deselect all that, but got some messages about platform incompatible, so just wanted to be sure I had a relatively valid selection set ... even though, I hope, no one would really select as much as I had selected! The "install" seemed to go ok (took a long time, 30 minutes maybe?) but no error messages. Gave the normal "you need to restart" prompt, and when I did, is when I got the "see config.log" msg".
Created attachment 197844 [details] profile directory and bundles.info Here's the requested bundles.info and profile directory. I should, BTW, emphasize that my main concern is not the ability to "install everything" ... that'd be nice, but ... my main concern is that I managed to select something that was hosing up the install all by itself.
I'm worried that we are not failing the install and leaving the install unchanged in such circumstances. Assuming that this is a file limit problem, does a piece meal installation resolve the issue? Also does the same issue occurs when eclipse is installed as normal user?
Created attachment 197846 [details] another config log I tried to start again, after adding ulimit -n 2048 to my script. Still failed, said to see log, but it was very brief, and seemed to have no error msgs in it. just "warning" about equinox hook. But, I've attached. AND THEN, I tried ulimit -n 4048 and that time it did startup! Yeah! Not sure if coincidence, or it really needs that many handles (which, if it does, would probably indicate a bug, or something unusual ... web servers don't usually need that many! :) .... but, at least indicates there's not some run-away viral process or anything.
At this point I would not be surprised if the issue you ran into was bug #349121.
(In reply to comment #10) > At this point I would not be surprised if the issue you ran into was bug > #349121. Maybe. Partially. I did just try again. This time using eclipse SDK installed into non-shared area, normal user area. Same "install everything" attempt. Again, install seemed to succeed. Failed on restart. I had reset ulimit to 1028. Again not much in config log (just warning about equinox hook), BUT had interesting console log (from shell) which I'll attach next. No mention anywhere about "out of handles". But, I then set ulimit -n back to 4048 and it started up ok. Ahh, the joys of stress testing :)
Created attachment 197847 [details] console log captured from shell invocation, from eclipseSDK non-shared scenario Doubt this console log (from shell) is interesting, unless it helps explains why the "real" errors are not logged. I agree the "root" problem seems to be excessive handles required. I will add, I think ulimit -n applies to both file handles and "ports" used in http connections. I mention this, because it seems upon successful restart, when ulimit set to 4048, I am immediately presented with some p2-like "discovery" dialogs to install more stuff ... one from SVN, and one from "Modeling Projects" ... so, maybe, wild guess ... someone is doing something funny with http connections on restart. When in this "first restarted" state, lsof reports 2632 handles are in use (by the launcher process).
I think this is bug 349121 or some other one that's simply holding on to the "jars" in plugins directory. Once I could get it to come up, recently, there were 2577 handles being held, and 2342 of those were specifically to "jars" files in plugins. Not sure what the normal "percentage is, but that seems extreme. There 3459 total jar files in my plugins directory in this stress test, so not sure why some are dangling and some not if just has to do with "checking signature"? There are also, in this weird scenario 3 or 4 "framework hooks" and probably 40 bundles that show up in "startup" preferences ... so plenty of opportunity for someone to be doing something to all those jars. In my primary working environment ... mostly/roughly Java EE IDE from around Indigo M6 ... there are, once up, 358 handles in use, 139 of those point to jars in plugins directory, and there are 1109 total plugins jars in installation. If that helps anyone. My heads spinning :)
Well one more ... mostly because I also meant to warn, when running this "stressed" environment, even when I get it to come up with ulimit -n 4048 it will not shut down cleanly. I need to use kill -9 on it. :( The numbers for my Helios SR2 environment are similar to my Indigo working environment: total handles in use 315, number of those that are to jars in plugins folder: 99 total number of plugins: 779 $ lsof -p 7595 | wc -l 315 $ lsof -p 7595 | grep "plugin.*jar" | wc -l 99 $ ll plugins | wc -l 779
I did try some runs where I'd start with Java EE IDE package, and instead of "installing everything", just install a hand full of feature ... select 5 or 15 features, and install those. Whenever the workbench would restart, after such additional installs, the number of handles in use would be quite high ... such as 1300. If I shut down the workbench (cleanly) and restarted it again (with no new installations) then the number of handles at that starting point would be relatively normal such as 385. All of the "leaked handles" do not seem directly related to how much was installed during the most recent install, but the delta may be. For example, when I started paying attention to above "caching" effect, I installed one new feature, that added ~50 jar files to /plugins, and again, upon restart, the number of handles in use was 1343 (similar to the previous high value of 1306, but in the neighborhood of 50 higher). Then, upon restart (with no new installation) the number of handles was again much lower, 425 ... but previous "resting" value was 385 ... again, in neighborhood of 50 higher. So ... my guess is there are several sources of file handle leaks. And, it won't take "installing everything" before people hit it ... maybe a couple, maybe a dozen new features? at the end of my most recent "few things at a time" testing, I had only 1500 jar files in /plugins. Of course, whether people "crash" on restart will depend on their ulimit -n setting ... but 1024 is quiet common default. And ... I forget ... do other operating systems limit number of file handles? I don't think so, but could be wrong.
David I would like to confirm something, when you say "I tried to start again, after adding ulimit -n 2048 to my script.". What does the script do? Is it installing or simply trying to start eclipse? It just occurred to me that the install may well be completing fine but it is the restart that causes problems.
*** Bug 336435 has been marked as a duplicate of this bug. ***
(In reply to comment #16) > David I would like to confirm something, when you say "I tried to start again, > after adding ulimit -n 2048 to my script.". What does the script do? Is it > installing or simply trying to start eclipse? > > It just occurred to me that the install may well be completing fine but it is > the restart that causes problems. The script is to just start eclipse. and, yes, it appears to me, on the surface, that the install completes (at least relatively ok). And, on that first restart, there are 3 or 5 times the number of file handles required, on "first start after install" (so, that is where users will see a failure to start ... on the semi-automatic restart after install) but even the "resting" file handles in use (on second start after successful shutdown with no install) seem a little higher than normal).
I'd like to focus this bug on the excessive number of file handles used, especially on "first startup". I'll rename bug, and I'll obsolete many of the attachments, which are probably just distracting. This problem may only apply to linux, and will be an issue only if users default "file handle limits" are set "low", such as 1024, the default on most fresh installations, but, I think major, for that user group. At least, the portion of them that installs lots of stuff. :) [And, _might_ be VM specific? I've only tried one VM, and only on 64 bit arch]. I've tried several "install new stuff" use cases, and I think the issue it simply with the first time starting, or restarting. But, the "major" part, is some of our packages (and some adopter products) start off with lot of bundles, so they may have problems, in this environment. Luckily there is at least a work-around ... user can set their ulimit -n higher and likely continue. FWIW, I tried running scenarios with the patch in bug 349121 and it did not help reduce number of "leaked handles", in the use cases I tried. I'll attach some "statistics" from some test runs I did that I think show the root problem well. Essentially, the "handles in use" is always around one or two hundred more than the number of bundles installed ... on first startup (but less and relatively constant on subsequent startups). So, while I suspect there is still several sources of "leaked handles", I think these high numbers on first startup, correlated to number of bundles installed, is a bad one.
Created attachment 197911 [details] small tables, showing "handles in use" at varioius points Hope these numbers, and bug report, helps. In some cases, it seemed "leaked handles" was perfectly correlated to number of jars (not bundles) installed but that didn't hold up with I installed "Birt examples", ... so ... might still be related to some kind of specific jar ... but, figure the next steps would be some debug sessions, not more installing :)
Good timing ... :) ... Max just posted on equinox-dev list pointing to an old bug 138182 that is probably related?
I'm going to try and debug, but I have a feeling something else is crawling through the installed jar files and leaking the files (PDE etc). First step is to confirm that the framework is not leaking these files.
(In reply to comment #21) > Good timing ... :) ... Max just posted on equinox-dev list > pointing to an old bug 138182 that is probably related? FWIW, I could not see osgi.bundlefile.limit set to any default, as implied by posts ... but, I also tried setting it explicitly to 100 from eclipse.ini and command line, and made no difference on large number of "leaked" file handles. (So, yes, ... maybe be leaking from somewhere else).
(In reply to comment #22) > I'm going to try and debug, but I have a feeling something else is crawling > through the installed jar files and leaking the files (PDE etc). First step is > to confirm that the framework is not leaking these files. I asked Snjezana about what she found regarding zipentry and here is her answer: "I have started Eclipse in PDE with the -clean option. SignatureBlockProcessor.readIntoArray is called only once. ZipBundleFile.basicOpen is called for every jarred plugin. ZipBundleFile.close is called on Eclipse shutdown. lsof -p <PID> shows the plugin's jar files are opened." So from what we see it seems like a very old bug but something that only recently started affecting Eclipse after many more plugins are now jarred + there are more bundles in general.
(In reply to comment #23) > (In reply to comment #21) > > Good timing ... :) ... Max just posted on equinox-dev list > > pointing to an old bug 138182 that is probably related? > > FWIW, I could not see osgi.bundlefile.limit set to any default, as implied by > posts ... but, I also tried setting it explicitly to 100 from eclipse.ini and > command line, and made no difference on large number of "leaked" file handles. > (So, yes, ... maybe be leaking from somewhere else). We don't set it because the default (when not set) is 100. (In reply to comment #24) > > "I have started Eclipse in PDE with the -clean option. > SignatureBlockProcessor.readIntoArray is called only once. > ZipBundleFile.basicOpen is called for every jarred plugin. ZipBundleFile.close > is called on Eclipse shutdown. lsof -p <PID> shows the plugin's jar files are > opened." > > So from what we see it seems like a very old bug but something that only > recently started affecting Eclipse after many more plugins are now jarred + > there are more bundles in general. This would be surprising, because such behavior would prevent any eclipse base product with over 1000 jarred bundles from installing on a default linux installation. I know of several products that have upwards of 3000 bundles, most of which are jars. I will try to reproduce.
The problem happens when Eclipse is started with the -clean option or is restarted after installing/updating. For more details see https://issues.jboss.org/browse/JBIDE-8854?focusedCommentId=12608111&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12608111 A reproducible test case is included in this comment.
> > So from what we see it seems like a very old bug but something that only > > recently started affecting Eclipse after many more plugins are now jarred + > > there are more bundles in general. > > This would be surprising, because such behavior would prevent any eclipse base > product with over 1000 jarred bundles from installing on a default linux > installation. I know of several products that have upwards of 3000 bundles, > most of which are jars. That is what we are seeing right now...many jarred plugins => massive jar refs in lsof when calling with -clean or after reinstall. > I will try to reproduce.
OK, I have found the issue. This is a regression in Indigo. The fix for bug332771 has introduced the issue: org.eclipse.osgi.internal.baseadaptor.BaseStorage.getMRUList() is recreating a new MRUBundlefileList for each bundle zip file. This is about the worst thing we could have done for limiting zip files opened.
CC Kim, since this is a rather critical bug in 3.7. Need to discuss if we can defer fixing until SR1 or consider respinning RC4.
Created attachment 197924 [details] patch Patch to limit the number of MRUBundleFileList to one for the lifetime of the framework.
I have reviewed the fix and I'm in favour of a 3.7 stream rebuild if downstream teams are ok with it (release train participants who would need to respin). Effectively this bug disables the MRU cache that was introduced in 3.4. It means Equinox holds one open file handle for each bundle that is accessed in a given session. Note the "install 1000 bundles at once" case isn't the only failure case for this. Bundles that have files loaded for any reason will keep a file handle open. Before we introduced that MRU cache, this was a constant source of problems for the consumer community, so *not* fixing it would require all of that consumer community to manually patch this until 3.7.1 comes out. We should have a regression stress test for this case to avoid future regressions.
Created attachment 197930 [details] unofficial, unsigned, temporary binary patch for confirmation testing To be sure this was the only/main thing I was seeing, I created a patch feature, applied to some of my "test installations" and can conclude that yes, it was the main problem I was encountering. The number of handles were not perfectly stable from one run to the next, installing one thing then another, but nearly so. I never say "used handles" above 380 with the patch applied ... much better than 1010 :) I've attached my installable test patch, if others want to test ... usual caveats apply ... unofficial, not signed, temporary, to be used only to confirm if observed issues are due to this bug or something else.
As for a respin, the EPP packages would, of course, have to be re-built. But, hopefully not the "common repository". I say "hopefully not", since I would not be very confident that we could. Of the many projects, a few might have changed URLs, have changed content, or something ... not saying they did, just saying there's risk there (there's a lot of projects, not all of whom have as mature a releng process as the platform :) . But, if _no one_ in the common repository (such as RAP runtime?) had a direct dependency on the exact version of org.eclipse.osgi, then we'd not have to rebuild the common repository ... just update one of it's "trusted repo" URLs to point to the new platform site, if/when it was rebuilt. Any one feel confident enough with p2 metadata to "search" the content.jar, at releases/staging to see if hard and fast dependencies on org.eclipse.osgi?
DJ, please review.
Pascal please review.
Jeff please review and approve. Note that we need input from the community on if they can react in time, but need to get our approvals and builds done just in case we decide to recontribute for Indigo.
Regarding comment #33, David since the bundle in question is osgi which resides at the very bottom of the eclipse stack, many of the Eclipse feature suffixes will be changed which would change my build submission to Indigo. Yes, the build points to a child repo for the Eclipse project. However, every time I release a new milestone to the Indigo build, it seems that this this requires a new RAP submission because it's looking for an old version of osgi when the platform has just submitted a new one.
After applying the patch, I can't reproduce the issue described in https://issues.jboss.org/browse/JBIDE-8854?focusedCommentId=12608111&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12608111
Awesome; we weren't crazy after all ;) Here is my vote (not that it counts ;) for making this into Eclipse 3.7.0 as we have already had more than enough bugreports from users of JBoss Tools on this issue and since the bug occurs mostly after you have installed something new then users blame the new thing coming in instead of the actual core problem in the base eclipse equinox resulting in tainting the whole ecosystem unjust.
The issue seems severe enough, and the patch safe enough, to try and get this into Indigo. If rebuilt feature qualifiers are a problem then technically, what about having the patch feature released with Indigo and built into the packages? Commercially, there's many products which release patch features from day 1.
I reviewed the fix and it looks good to me. I'm in favor of fixing this for 3.7.
+1
Marking as fixed for RC4 (even though the build will likely be called RC5).
Sorry for the late response, was out all day yesterday. Yes, this is something we should fix for Indigo. Thanks all for finding/reporting/fixing.
FWIW, in case there's any doubt, with the fix in EPP Packages, I was able to "install everything". Not that I could use everything ... but is a good stress test. The operation worked and the IDE restarted. I never saw "file handles in use" above 400. Thanks!
*** Bug 349554 has been marked as a duplicate of this bug. ***
FYI, the build that includes this fix (to be renamed RC5 later) is: http://download.eclipse.org/eclipse/downloads/drops/I20110613-1736/
(In reply to comment #47) > FYI, the build that includes this fix (to be renamed RC5 later) is: > > http://download.eclipse.org/eclipse/downloads/drops/I20110613-1736/ Any to let everyone know when we use that site we can install JBoss Tools completely without a glitch - so considered fix by us. Great work Thanks!
Thanks to all that have helped verify this bug fix.