Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 455161 - Error in perftest Hudson config?
Summary: Error in perftest Hudson config?
Status: RESOLVED FIXED
Alias: None
Product: Community
Classification: Eclipse Foundation
Component: CI-Jenkins (show other bugs)
Version: unspecified   Edit
Hardware: PC Linux
: P3 major (vote)
Target Milestone: ---   Edit
Assignee: CI Admin Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 454736 454921
  Show dependency tree
 
Reported: 2014-12-14 12:59 EST by David Williams CLA
Modified: 2015-11-02 14:00 EST (History)
3 users (show)

See Also:


Attachments
main web page (13.61 KB, image/png)
2014-12-18 12:51 EST, David Williams CLA
no flags Details
page for a specific build. (22.45 KB, image/png)
2014-12-18 12:52 EST, David Williams CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Williams CLA 2014-12-14 12:59:13 EST
Today I accessed the Hudson "performance" machine: 

https://hudson.eclipse.org/perftests/view/Eclipse%20and%20Equinox/

And, got this message: 

	
Error

Hudson detected that you appear to be running more than one instance of Hudson that share the same home directory '/home/hudson/hudsonbuild/.hudson'. This greatly confuses Hudson and you will likely experience strange behaviors, so please correct the situation.
This Hudson:	1087747993 contextPath="/perftests" at 32160@hudson-perf1master
Other Hudson:	134800844 contextPath="/perftests" at 11925@hudson-perf1master

Odd this would start now? 
I did, in past week, "update plugins" and in doing do, that requires a "Hudson restart" ... if you think that's related? 

There was an option to "ignore this problem and use Hudson anyway"
which, I will probably pick, for now, but would suspect it needs to be fixed and restarted. 

Marked as "critical, since I suspect the situation would result in "lost data".
Comment 1 David Williams CLA 2014-12-14 19:01:18 EST
In fact, I am wondering if this "same home directory" problem accounts for bug 454736 -- where the "locks" in Hudson did not seem to be working (allowing two jobs to run, each of which should have had to "get a lock" first). And, that did result in lost and "bad data". [I recall now, it was adding locks, that initially required this machine to be restarted.] 

Even before this previous week, I've sometimes had trouble using "cascading jobs", where I change to "parent jobs" value ... and it appears to "take" in the child when looked at immediately in the child, but then at runtime, the "child" behaves as though the value was not set. That's wasted in a lot of wasted time, and possibly "bad data", but, but positive. (The variable had to do with "cleaning the workspace" at the start of test run, so if not "clean", not sure what the impact was, exactly, since most thing were obviously being over written, but, I am not sure all were.) 

And, not sure if this hurts anything, but 
uname -a 
gives one name to the machine, which matches what 
'hostname' gives, (when executed)
but 
echo $HOSTNAME 
gives "build"

Seems that environment variable is configured? Incorrectly? 
Not sure that trouble that would cause, if any, but seems like it could? 

While I no longer get the message I did in comment 0, I hope the performance machine can be set up to be a bit more independent, and have it's own "HUDSON_HOME", etc.


And, of course, for you to double check your start up scripts to make sure there are not two of them being  started up.
Comment 2 David Williams CLA 2014-12-15 21:16:27 EST
And, while I'm thinking of it ... after seeing the confusing "hostname" variable ... I want to confirm ... this machine has its own dedicated disks, right? Not "shared" with "build" or any part of NFS, right? 

I thought to ask thinking of those times recently when the "build" machine seemed locked up (such as, bug 454272) and it seemed much of that was due to a back log of disk operations ... and, I'm just now thinking to say ... some of our performance tests are extremely disk intensive (they bog down my personal, non-shared Linux server! -- and its not CPU or memory load ... its disk operations). 

So, just wanted to confirm.
Comment 3 David Williams CLA 2014-12-18 12:50:59 EST
As further evidence the "performance test machine" is configured wrong, if you were go look right now, the main screen shows 1 build running, and one in the que. 

But, if you go to the specific job (that is in the que) you'd see it is already running. 

I'll attach screen shots ... but ... I sure would like some response to this bug!
Comment 4 David Williams CLA 2014-12-18 12:51:36 EST
Created attachment 249534 [details]
main web page
Comment 5 David Williams CLA 2014-12-18 12:52:02 EST
Created attachment 249535 [details]
page for a specific build.
Comment 6 David Williams CLA 2014-12-18 14:56:35 EST
Changing this to "blocker". 

As you can see, if you look at the machine, one job is "frozen" in the queue, and the "hidden one" does not seem to be doing anything. 

Sounds like a clear case of "two instances" to me. 

If, I don't hear anything soon, via this bug, I'll try canceling the hidden job (though, I did try that once before) and I can not cancel it, will force a restart.
Comment 7 Eclipse Webmaster CLA 2014-12-18 16:27:47 EST
It looks like your restart did manage to start a second job:

>ps -aef | grep hudsonbuild
11925 11920  0 Dec14 ?        00:09:19 /shared/common/jdk1.7.0-latest//bin/java -Xms1000m -Xmx2500m -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/tmp/hudsonbuild -jar /home/hudson/hudsonbuild/hudson.war --httpPort=8443 --prefix=/perftests
32160 32155  0 Dec12 ?        00:52:13 /shared/common/jdk1.7.0-latest//bin/java -Xms1000m -Xmx2500m -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/tmp/hudsonbuild -jar /home/hudson/hudsonbuild/hudson.war --httpPort=8443 --prefix=/perftests

I've killed them both and cleanly restarted hudson.

-M.
Comment 8 David Williams CLA 2014-12-18 18:23:49 EST
Thank you for fixing the immediate serious problem reported 4 days ago :/ 

But, it would be nice to understand how to configure Hudson so that it could restart itself as it is designed to do. But will change to "major" (lost function) to cover that part.
Comment 9 David Williams CLA 2014-12-18 18:27:24 EST
(In reply to David Williams from comment #8)
> Thank you for fixing the immediate serious problem reported 4 days ago :/ 
> 
> But, it would be nice to understand how to configure Hudson so that it could
> restart itself as it is designed to do. But will change to "major" (lost
> function) to cover that part.

I'll also note, it seems like the "weekly restart" could use some improvement to be sure to kill "all instances" before starting "the right" one. Or, at least as a "second step" perhaps 

pgrep -f /home/hudson/hudsonbuild/hudson.war
pkill -f /home/hudson/hudsonbuild/hudson.war
Comment 10 David Williams CLA 2014-12-19 00:27:43 EST
I don't think your "clean start" was too clean. 

Builds are now not able to execute XVNC: 

https://hudson.eclipse.org/perftests/view/Eclipse%20and%20Equinox/job/ep44MLR-perf-lin64/3/console

Started by user david_williams
FATAL: null
java.lang.NullPointerException
	at hudson.plugins.xvnc.Xvnc.doSetUp(Xvnc.java:83)
	at hudson.plugins.xvnc.Xvnc.setUp(Xvnc.java:73)
	at hudson.model.Build$RunnerImpl.doRun(Build.java:129)
	at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:524)
	at hudson.model.Run.run(Run.java:1450)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
	at hudson.model.ResourceController.execute(ResourceController.java:82)
	at hudson.model.Executor.run(Executor.java:137)

My guess is that some plugins need to be installed, into "this" instance. 
(IMHO, this wasn't due to me "restarting" .. I think it's been mis-configured for a while.) 

Assuming this is part of the "weekly restart" ... I bet on Sunday, there will be two instances running again.
Comment 11 Denis Roy CLA 2014-12-19 09:15:54 EST
> Builds are now not able to execute XVNC: 

I just restarted hudson and ran your build, and saw this in the console:

Started by user droy
$ pkill Xvnc
$ pkill Xrealvnc
$ sh -c "rm -f /tmp/.X*-lock /tmp/.X11-unix/X*"
FATAL: null
java.lang.NullPointerException
	at hudson.plugins.xvnc.Xvnc.doSetUp(Xvnc.java:83)
	at hudson.plugins.xvnc.Xvnc.setUp(Xvnc.java:73)
	at hudson.model.Build$RunnerImpl.doRun(Build.java:129)
	at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:524)
	at hudson.model.Run.run(Run.java:1450)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
	at hudson.model.ResourceController.execute(ResourceController.java:82)
	at hudson.model.Executor.run(Executor.java:137)



pkill Xvnc?
Comment 12 David Williams CLA 2014-12-19 11:32:32 EST
(In reply to Denis Roy from comment #11)
> > Builds are now not able to execute XVNC: 
> 
> I just restarted hudson and ran your build, and saw this in the console:
> 
> Started by user droy
> $ pkill Xvnc
> $ pkill Xrealvnc
> $ sh -c "rm -f /tmp/.X*-lock /tmp/.X11-unix/X*"
> FATAL: null
> java.lang.NullPointerException
> 	at hudson.plugins.xvnc.Xvnc.doSetUp(Xvnc.java:83)
> 	at hudson.plugins.xvnc.Xvnc.setUp(Xvnc.java:73)
> 	at hudson.model.Build$RunnerImpl.doRun(Build.java:129)
> 	at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:524)
> 	at hudson.model.Run.run(Run.java:1450)
> 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
> 	at hudson.model.ResourceController.execute(ResourceController.java:82)
> 	at hudson.model.Executor.run(Executor.java:137)
> 
> 
> 
> pkill Xvnc?

I think those kills and removes are part of a standard "Clean up before start" option in the "Manage Hudson" configuration. And, suspect they are a red-herring in this case. But, just FYI, the "help" for that clean up option says: 

Try to clean up any stale locks or processes before running Xvnc for the first time in a given session on a given node. Any processes named Xvnc or Xrealvnc will be killed, and files /tmp/.X*-lock or /tmp/.X11-unix/X* deleted. 

So, I think that's relatively normal. 

As for the null pointer, 

I'd check to be sure there is a vncserver running. I think normally that is "independent" of Hudson (i.e. must be set up, and running, before Hudson can make use of the Xvnc plugin). 

And, (running or not) another thing to check if if "you" (hudsonbuild user) has a password set up, so it can automatically invoke the Xvnc command. On this machine (and, on shared instance too), that command is

Xvnc :$DISPLAY_NUMBER -geometry 1024x768 -depth 24 -ac

(where Hudson fills in the $DISPLAY_NUMBER, from its "pool"). 

So, as a normal "hudsonbuild user" I'd see if you can execute, say,  

Xvnc :50 -geometry 1024x768 -depth 24 -ac

from the command line. 

My guess is that'd simply tell you there is no vncserver running, so that would be the "next step" (perhaps "reinstall" it or see if anything in logs that prevented it from starting up?) 

Or ... it might say there as no password? 

It is possible, I would guess, that with two instances of Hudson running, that some configuration file may have gotten corrupted. So, for example, if vncserver is runnning, and you can run Xvnc from command line then next thing I'd try is to remove Xvnc plugin. Restart Hudson. Then add Xvnc plugin back in, and again restart Hudson. 

That's how I would approach debugging this issue.
Comment 13 Denis Roy CLA 2014-12-19 11:36:46 EST
If I launch Xvnc from the commandline I get this:

hudson-perf1master:~> Xvnc
19/12/2014 11:33:45 Xvnc version X.org/xf4vnc custom version
[snip]
19/12/2014 11:33:45 Protocol versions supported: 3.7, 3.3

Fatal server error:
Couldn't add screen



If I launch it with a display number I get a running Xvnc

@hudson-perf1master:~> Xvnc :1000
19/12/2014 11:33:41 Xvnc version X.org/xf4vnc custom version
(EE) config/hal: NewInputDeviceRequest failed (2)
^C




outage4@hudson-perf1master:~> vncserver 

You will require a password to access your desktops.

Password: 




Perhaps the Hudson environment is missing a display number.  I'll look at the HIPPs to see what we do there.
Comment 14 Denis Roy CLA 2014-12-19 12:01:30 EST
(In reply to David Williams from comment #12)
> So, as a normal "hudsonbuild user" I'd see if you can execute, say,  
> 
> Xvnc :50 -geometry 1024x768 -depth 24 -ac
> 
> from the command line. 

As the hudsonbuild user, that works just fine.

I compared the perf hudson config (https://hudson.eclipse.org/perftests/configure) with the CBI hipp, and the config looks OK -- both have a minimum and a maximum range.  Since Perf is the only hudson running on the server, there's no need to worry about display overlap.

 
> My guess is that'd simply tell you there is no vncserver running, so that
> would be the "next step"

I don't see a vncserver running on the HIPP machines.
Comment 15 David Williams CLA 2014-12-19 13:11:02 EST
(In reply to Denis Roy from comment #14)

>  
> > My guess is that'd simply tell you there is no vncserver running, so that
> > would be the "next step"
> 
> I don't see a vncserver running on the HIPP machines.

Yes, my mistake. I've read "man Xvnc" now, and it says explains "vncserver" is just a script to start Xvnc. (pasted below). 

Could it be as simple that "Xvnc" needs to be "on the path"? (or, in /usr/bin? is it installed anywhere funny?) ... I guess your command line tests cover that. 

I've also confirmed on my local test machine, there is no "sign" of "vnc" running unless the Hudson job is running. When a Hudson job *is* running, there is a clear process associated with "the job". And I in no way know if I have mine set up in "the best way", but will send you what that process line looks like in email (it's not *real* sensitive .. but .. I've no idea what a hacker might be able to take advantage of :) as well as my Hudson "configuration line". 


I do notice just now, that I do *not* have "Clean up before start" checked on my local test machine. Do not recall if there was a reason? But ... guess we could try that? (Nothing like "trial and error" debugging, eh? :) 

 


= = = = = = = 

       Xvnc  is the X VNC (Virtual Network Computing) server.  It is based on a standard X server, but it has a "virtual" screen
       rather than a physical one.  X applications display themselves on it as if it were a normal X display, but they can  only
       be accessed via a VNC viewer - see vncviewer(1).

       So  Xvnc  is  really  two  servers in one. To the applications it is an X server, and to the remote VNC users it is a VNC
       server. By convention we have arranged that the VNC server display number will be the same as the X server  display  num‐
       ber,  which  means  you  can  use  eg. snoopy:2 to refer to display 2 on machine "snoopy" in both the X world and the VNC
       world.

       The best way of starting Xvnc is via the vncserver script.  This sets up the environment appropriately and  runs  some  X
       applications to get you going.  See the manual page for vncserver(1) for more information.
Comment 16 David Williams CLA 2014-12-21 17:21:28 EST
For what it's worth, I was able to get our performance tests running again, by handling the "display" stuff myself (using xvfb from a shell script in the Hudson job). Since "we" are the only ones running there, and we only run one job at a time, I believe this will work for as long as we need it too. 

My guess is that the configuration files for "plugins" are corrupt, and this problem with Xvnc can be addressed when we upgrade to a newer version of Hudson -- and at that time, I'd pretty much recommend to "start fresh". With a whole new install. 

Naturally, others advice welcome. (But, it's not "blocking" any longer, so will change back to "major").
Comment 17 David Williams CLA 2014-12-23 17:11:12 EST
FWIW, I think there is an issue with Xvnc plugin version 1.13-h-2, perhaps on "master only" installations? 

See bug 450388. 

At some point, we should "back level" the installed plugin to 1.13-h-1.

But .. since my jobs are working ... and since it would require Hudson to "restart itself" ... something this version/installation appears to have trouble doing? ... I won't try this myself, and will at least wait for webmasters to "be around" in January ... if not have them do themselves, to see if the "dual instance" problem comes back.
Comment 18 Thanh Ha CLA 2014-12-24 01:23:07 EST
(In reply to David Williams from comment #17)
> FWIW, I think there is an issue with Xvnc plugin version 1.13-h-2, perhaps
> on "master only" installations? 
> 
> See bug 450388. 
> 
> At some point, we should "back level" the installed plugin to 1.13-h-1.
> 
> But .. since my jobs are working ... and since it would require Hudson to
> "restart itself" ... something this version/installation appears to have
> trouble doing? ... I won't try this myself, and will at least wait for
> webmasters to "be around" in January ... if not have them do themselves, to
> see if the "dual instance" problem comes back.

So Hudson plugins have been releasing with 2 versions for awhile now, h-1 and h-2 versions. I believe h-2 versions only work in Hudson 3.1.0 or newer and h-1 versions only work in versions before that such has 3.0.x.

Unfortunately the version upgrading code used by Hudson 3.0.x often picks the incompatible h-2 version by default. So for plugins that have these 2 version releases we have to be careful and hand pick their upgrade paths, at least until Eclipse starts deploying a newer Hudson release.
Comment 19 Denis Roy CLA 2014-12-24 09:19:16 EST
I agree with Thanh, our aging Hudson and HIPP deployments are becoming a liability, and we'll work on upgrades come the new year.  Also, we have a release engineer starting Feb. 2.  That will help accelerate the resolution of all the Hudson/CBI/build bugs.
Comment 20 David Williams CLA 2015-11-02 14:00:07 EST
The immediate problem was fixed long ago. I suspect even some of the "long term fixes" have been fixed or improved? 

Thanks for everyone's help.