Community
Participate
Working Groups
Today I accessed the Hudson "performance" machine: https://hudson.eclipse.org/perftests/view/Eclipse%20and%20Equinox/ And, got this message: Error Hudson detected that you appear to be running more than one instance of Hudson that share the same home directory '/home/hudson/hudsonbuild/.hudson'. This greatly confuses Hudson and you will likely experience strange behaviors, so please correct the situation. This Hudson: 1087747993 contextPath="/perftests" at 32160@hudson-perf1master Other Hudson: 134800844 contextPath="/perftests" at 11925@hudson-perf1master Odd this would start now? I did, in past week, "update plugins" and in doing do, that requires a "Hudson restart" ... if you think that's related? There was an option to "ignore this problem and use Hudson anyway" which, I will probably pick, for now, but would suspect it needs to be fixed and restarted. Marked as "critical, since I suspect the situation would result in "lost data".
In fact, I am wondering if this "same home directory" problem accounts for bug 454736 -- where the "locks" in Hudson did not seem to be working (allowing two jobs to run, each of which should have had to "get a lock" first). And, that did result in lost and "bad data". [I recall now, it was adding locks, that initially required this machine to be restarted.] Even before this previous week, I've sometimes had trouble using "cascading jobs", where I change to "parent jobs" value ... and it appears to "take" in the child when looked at immediately in the child, but then at runtime, the "child" behaves as though the value was not set. That's wasted in a lot of wasted time, and possibly "bad data", but, but positive. (The variable had to do with "cleaning the workspace" at the start of test run, so if not "clean", not sure what the impact was, exactly, since most thing were obviously being over written, but, I am not sure all were.) And, not sure if this hurts anything, but uname -a gives one name to the machine, which matches what 'hostname' gives, (when executed) but echo $HOSTNAME gives "build" Seems that environment variable is configured? Incorrectly? Not sure that trouble that would cause, if any, but seems like it could? While I no longer get the message I did in comment 0, I hope the performance machine can be set up to be a bit more independent, and have it's own "HUDSON_HOME", etc. And, of course, for you to double check your start up scripts to make sure there are not two of them being started up.
And, while I'm thinking of it ... after seeing the confusing "hostname" variable ... I want to confirm ... this machine has its own dedicated disks, right? Not "shared" with "build" or any part of NFS, right? I thought to ask thinking of those times recently when the "build" machine seemed locked up (such as, bug 454272) and it seemed much of that was due to a back log of disk operations ... and, I'm just now thinking to say ... some of our performance tests are extremely disk intensive (they bog down my personal, non-shared Linux server! -- and its not CPU or memory load ... its disk operations). So, just wanted to confirm.
As further evidence the "performance test machine" is configured wrong, if you were go look right now, the main screen shows 1 build running, and one in the que. But, if you go to the specific job (that is in the que) you'd see it is already running. I'll attach screen shots ... but ... I sure would like some response to this bug!
Created attachment 249534 [details] main web page
Created attachment 249535 [details] page for a specific build.
Changing this to "blocker". As you can see, if you look at the machine, one job is "frozen" in the queue, and the "hidden one" does not seem to be doing anything. Sounds like a clear case of "two instances" to me. If, I don't hear anything soon, via this bug, I'll try canceling the hidden job (though, I did try that once before) and I can not cancel it, will force a restart.
It looks like your restart did manage to start a second job: >ps -aef | grep hudsonbuild 11925 11920 0 Dec14 ? 00:09:19 /shared/common/jdk1.7.0-latest//bin/java -Xms1000m -Xmx2500m -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/tmp/hudsonbuild -jar /home/hudson/hudsonbuild/hudson.war --httpPort=8443 --prefix=/perftests 32160 32155 0 Dec12 ? 00:52:13 /shared/common/jdk1.7.0-latest//bin/java -Xms1000m -Xmx2500m -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/tmp/hudsonbuild -jar /home/hudson/hudsonbuild/hudson.war --httpPort=8443 --prefix=/perftests I've killed them both and cleanly restarted hudson. -M.
Thank you for fixing the immediate serious problem reported 4 days ago :/ But, it would be nice to understand how to configure Hudson so that it could restart itself as it is designed to do. But will change to "major" (lost function) to cover that part.
(In reply to David Williams from comment #8) > Thank you for fixing the immediate serious problem reported 4 days ago :/ > > But, it would be nice to understand how to configure Hudson so that it could > restart itself as it is designed to do. But will change to "major" (lost > function) to cover that part. I'll also note, it seems like the "weekly restart" could use some improvement to be sure to kill "all instances" before starting "the right" one. Or, at least as a "second step" perhaps pgrep -f /home/hudson/hudsonbuild/hudson.war pkill -f /home/hudson/hudsonbuild/hudson.war
I don't think your "clean start" was too clean. Builds are now not able to execute XVNC: https://hudson.eclipse.org/perftests/view/Eclipse%20and%20Equinox/job/ep44MLR-perf-lin64/3/console Started by user david_williams FATAL: null java.lang.NullPointerException at hudson.plugins.xvnc.Xvnc.doSetUp(Xvnc.java:83) at hudson.plugins.xvnc.Xvnc.setUp(Xvnc.java:73) at hudson.model.Build$RunnerImpl.doRun(Build.java:129) at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:524) at hudson.model.Run.run(Run.java:1450) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44) at hudson.model.ResourceController.execute(ResourceController.java:82) at hudson.model.Executor.run(Executor.java:137) My guess is that some plugins need to be installed, into "this" instance. (IMHO, this wasn't due to me "restarting" .. I think it's been mis-configured for a while.) Assuming this is part of the "weekly restart" ... I bet on Sunday, there will be two instances running again.
> Builds are now not able to execute XVNC: I just restarted hudson and ran your build, and saw this in the console: Started by user droy $ pkill Xvnc $ pkill Xrealvnc $ sh -c "rm -f /tmp/.X*-lock /tmp/.X11-unix/X*" FATAL: null java.lang.NullPointerException at hudson.plugins.xvnc.Xvnc.doSetUp(Xvnc.java:83) at hudson.plugins.xvnc.Xvnc.setUp(Xvnc.java:73) at hudson.model.Build$RunnerImpl.doRun(Build.java:129) at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:524) at hudson.model.Run.run(Run.java:1450) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44) at hudson.model.ResourceController.execute(ResourceController.java:82) at hudson.model.Executor.run(Executor.java:137) pkill Xvnc?
(In reply to Denis Roy from comment #11) > > Builds are now not able to execute XVNC: > > I just restarted hudson and ran your build, and saw this in the console: > > Started by user droy > $ pkill Xvnc > $ pkill Xrealvnc > $ sh -c "rm -f /tmp/.X*-lock /tmp/.X11-unix/X*" > FATAL: null > java.lang.NullPointerException > at hudson.plugins.xvnc.Xvnc.doSetUp(Xvnc.java:83) > at hudson.plugins.xvnc.Xvnc.setUp(Xvnc.java:73) > at hudson.model.Build$RunnerImpl.doRun(Build.java:129) > at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:524) > at hudson.model.Run.run(Run.java:1450) > at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44) > at hudson.model.ResourceController.execute(ResourceController.java:82) > at hudson.model.Executor.run(Executor.java:137) > > > > pkill Xvnc? I think those kills and removes are part of a standard "Clean up before start" option in the "Manage Hudson" configuration. And, suspect they are a red-herring in this case. But, just FYI, the "help" for that clean up option says: Try to clean up any stale locks or processes before running Xvnc for the first time in a given session on a given node. Any processes named Xvnc or Xrealvnc will be killed, and files /tmp/.X*-lock or /tmp/.X11-unix/X* deleted. So, I think that's relatively normal. As for the null pointer, I'd check to be sure there is a vncserver running. I think normally that is "independent" of Hudson (i.e. must be set up, and running, before Hudson can make use of the Xvnc plugin). And, (running or not) another thing to check if if "you" (hudsonbuild user) has a password set up, so it can automatically invoke the Xvnc command. On this machine (and, on shared instance too), that command is Xvnc :$DISPLAY_NUMBER -geometry 1024x768 -depth 24 -ac (where Hudson fills in the $DISPLAY_NUMBER, from its "pool"). So, as a normal "hudsonbuild user" I'd see if you can execute, say, Xvnc :50 -geometry 1024x768 -depth 24 -ac from the command line. My guess is that'd simply tell you there is no vncserver running, so that would be the "next step" (perhaps "reinstall" it or see if anything in logs that prevented it from starting up?) Or ... it might say there as no password? It is possible, I would guess, that with two instances of Hudson running, that some configuration file may have gotten corrupted. So, for example, if vncserver is runnning, and you can run Xvnc from command line then next thing I'd try is to remove Xvnc plugin. Restart Hudson. Then add Xvnc plugin back in, and again restart Hudson. That's how I would approach debugging this issue.
If I launch Xvnc from the commandline I get this: hudson-perf1master:~> Xvnc 19/12/2014 11:33:45 Xvnc version X.org/xf4vnc custom version [snip] 19/12/2014 11:33:45 Protocol versions supported: 3.7, 3.3 Fatal server error: Couldn't add screen If I launch it with a display number I get a running Xvnc @hudson-perf1master:~> Xvnc :1000 19/12/2014 11:33:41 Xvnc version X.org/xf4vnc custom version (EE) config/hal: NewInputDeviceRequest failed (2) ^C outage4@hudson-perf1master:~> vncserver You will require a password to access your desktops. Password: Perhaps the Hudson environment is missing a display number. I'll look at the HIPPs to see what we do there.
(In reply to David Williams from comment #12) > So, as a normal "hudsonbuild user" I'd see if you can execute, say, > > Xvnc :50 -geometry 1024x768 -depth 24 -ac > > from the command line. As the hudsonbuild user, that works just fine. I compared the perf hudson config (https://hudson.eclipse.org/perftests/configure) with the CBI hipp, and the config looks OK -- both have a minimum and a maximum range. Since Perf is the only hudson running on the server, there's no need to worry about display overlap. > My guess is that'd simply tell you there is no vncserver running, so that > would be the "next step" I don't see a vncserver running on the HIPP machines.
(In reply to Denis Roy from comment #14) > > > My guess is that'd simply tell you there is no vncserver running, so that > > would be the "next step" > > I don't see a vncserver running on the HIPP machines. Yes, my mistake. I've read "man Xvnc" now, and it says explains "vncserver" is just a script to start Xvnc. (pasted below). Could it be as simple that "Xvnc" needs to be "on the path"? (or, in /usr/bin? is it installed anywhere funny?) ... I guess your command line tests cover that. I've also confirmed on my local test machine, there is no "sign" of "vnc" running unless the Hudson job is running. When a Hudson job *is* running, there is a clear process associated with "the job". And I in no way know if I have mine set up in "the best way", but will send you what that process line looks like in email (it's not *real* sensitive .. but .. I've no idea what a hacker might be able to take advantage of :) as well as my Hudson "configuration line". I do notice just now, that I do *not* have "Clean up before start" checked on my local test machine. Do not recall if there was a reason? But ... guess we could try that? (Nothing like "trial and error" debugging, eh? :) = = = = = = = Xvnc is the X VNC (Virtual Network Computing) server. It is based on a standard X server, but it has a "virtual" screen rather than a physical one. X applications display themselves on it as if it were a normal X display, but they can only be accessed via a VNC viewer - see vncviewer(1). So Xvnc is really two servers in one. To the applications it is an X server, and to the remote VNC users it is a VNC server. By convention we have arranged that the VNC server display number will be the same as the X server display num‐ ber, which means you can use eg. snoopy:2 to refer to display 2 on machine "snoopy" in both the X world and the VNC world. The best way of starting Xvnc is via the vncserver script. This sets up the environment appropriately and runs some X applications to get you going. See the manual page for vncserver(1) for more information.
For what it's worth, I was able to get our performance tests running again, by handling the "display" stuff myself (using xvfb from a shell script in the Hudson job). Since "we" are the only ones running there, and we only run one job at a time, I believe this will work for as long as we need it too. My guess is that the configuration files for "plugins" are corrupt, and this problem with Xvnc can be addressed when we upgrade to a newer version of Hudson -- and at that time, I'd pretty much recommend to "start fresh". With a whole new install. Naturally, others advice welcome. (But, it's not "blocking" any longer, so will change back to "major").
FWIW, I think there is an issue with Xvnc plugin version 1.13-h-2, perhaps on "master only" installations? See bug 450388. At some point, we should "back level" the installed plugin to 1.13-h-1. But .. since my jobs are working ... and since it would require Hudson to "restart itself" ... something this version/installation appears to have trouble doing? ... I won't try this myself, and will at least wait for webmasters to "be around" in January ... if not have them do themselves, to see if the "dual instance" problem comes back.
(In reply to David Williams from comment #17) > FWIW, I think there is an issue with Xvnc plugin version 1.13-h-2, perhaps > on "master only" installations? > > See bug 450388. > > At some point, we should "back level" the installed plugin to 1.13-h-1. > > But .. since my jobs are working ... and since it would require Hudson to > "restart itself" ... something this version/installation appears to have > trouble doing? ... I won't try this myself, and will at least wait for > webmasters to "be around" in January ... if not have them do themselves, to > see if the "dual instance" problem comes back. So Hudson plugins have been releasing with 2 versions for awhile now, h-1 and h-2 versions. I believe h-2 versions only work in Hudson 3.1.0 or newer and h-1 versions only work in versions before that such has 3.0.x. Unfortunately the version upgrading code used by Hudson 3.0.x often picks the incompatible h-2 version by default. So for plugins that have these 2 version releases we have to be careful and hand pick their upgrade paths, at least until Eclipse starts deploying a newer Hudson release.
I agree with Thanh, our aging Hudson and HIPP deployments are becoming a liability, and we'll work on upgrades come the new year. Also, we have a release engineer starting Feb. 2. That will help accelerate the resolution of all the Hudson/CBI/build bugs.
The immediate problem was fixed long ago. I suspect even some of the "long term fixes" have been fixed or improved? Thanks for everyone's help.