Community
Participate
Working Groups
Ever since I closed bug 377344 as "all is ok now" I've seen Hudson sort of come and go, oscillating between extremely slow and down right unresponsive (e.g. 3 minutes to go from one click to the next ... pretty unusable if you have 3 or 4 steps to do).
I'm investigating this.
One problem I'm seeing is that the tycho-its job is somehow spawning tons of java processes, each consuming tons of resources. 10349 hudsonbu 20 0 958m 105m 9168 S 64 1.0 0:05.17 /opt/public/common/sun-jdk1.6.0_21_x64/jre/bin/java -Xmx512m -XX:MaxPermSize=256m -classpath /opt/users/hudsonbuild/workspace/tycho-its-linux-nigh 9991 hudsonbu 20 0 1031m 181m 9m S 39 1.8 0:11.33 /opt/public/common/sun-jdk1.6.0_21_x64/jre/bin/java -Xmx512m -XX:MaxPermSize=256m -classpath /opt/users/hudsonbuild/workspace/tycho-its-linux-nigh 9712 hudsonbu 20 0 968m 203m 9832 S 26 2.0 0:14.48 /opt/public/common/sun-jdk1.6.0_21_x64/jre/bin/java -Xmx512m -XX:MaxPermSize=256m -classpath /opt/users/hudsonbuild/workspace/tycho-its-linux-nigh 9874 hudsonbu 20 0 969m 195m 9832 S 15 1.9 0:13.62 /opt/public/common/sun-jdk1.6.0_21_x64/jre/bin/java -Xmx512m -XX:MaxPermSize=256m -classpath /opt/users/hudsonbuild/workspace/tycho-its-linux-nigh 9825 hudsonbu 20 0 973m 197m 9m S 13 1.9 0:13.32 /opt/public/common/sun-jdk1.6.0_21_x64/jre/bin/java -Xmx512m -XX:MaxPermSize=256m -classpath /opt/users/hudsonbuild/workspace/tycho-its-linux-nigh 9091 hudsonbu 20 0 973m 251m 10m S 12 2.5 0:16.10 /opt/public/common/sun-jdk1.6.0_21_x64/jre/bin/java -Xmx512m -XX:MaxPermSize=256m -classpath /opt/users/hudsonbuild/workspace/tycho-its-linux-nigh
This has seemed fixed since your comment 2 ... so, assuming you killed all those? And haven't came back? Feel free to re-open if there is more work to do here to track down root problem, but I'm closing as "fixed" to signify I no longer see any "unusable" sluggishness ... not exactly snappy :) but never has been that. Thanks for attending to it.
Actually, I will reopen this. FWIW, I ran OS updates on all the Hudsons and completely rebooted the master and slave6. I think that did it a lot of good. I think we should -- and I hate to say this, because the concept is so foreign to us -- schedule regular reboots of the Hudson master & slaves (perhaps on the weekends). At the very least, stop the Hudson service, kilall the java processes that could be dangling, and restart it.
(In reply to comment #4) > Actually, I will reopen this. > > FWIW, I ran OS updates on all the Hudsons and completely rebooted the master > and slave6. I think that did it a lot of good. > > I think we should -- and I hate to say this, because the concept is so foreign > to us -- schedule regular reboots of the Hudson master & slaves (perhaps on the > weekends). At the very least, stop the Hudson service, kilall the java > processes that could be dangling, and restart it. As soon as I closed it, started to have a few sluggish spots again. :) > ... hate to say this, because the concept is so foreign to us And I hate to admit, I wouldn't complain, under the circumstances :) [I recall we used to to that in the early days of using Crusecontrol ... sure glad that stabilized over the years.] Perhaps could list/track dangling java processes? See if there's a fixable pattern? Review with your "hudson experts contact list"?
(In reply to comment #4) > schedule regular reboots of the Hudson master & slaves I'll add this to Matt's bucket :-D
I've crafted a script to restart Hudson, and I've set it to run at 3:30am on Sundays. -M.
(In reply to comment #7) > I've crafted a script to restart Hudson, and I've set it to run at 3:30am on > Sundays. > > -M. Will it "hard restart" or ... wait for current jobs to finish? I don't particular care, as long as everyone knows what to expect. I guess the deluxe solution would be to start at 3:30, if jobs are running, set the "shutdown" flag so no new ones start, wait for those running to finish, but if not finished by 4:30, then go head and restart. (I know we in Platform currently have some "unit test jobs" that run for 10 or 15 hours (overnight), and while unfortunate to "lose" them", I do not think you (or anyone) should be held up for THAT long waiting for Hudson to restart). [And, FYI, we have a work item to "break the tests up into smaller chunks" ... not sure when we'll be there ... but, this might give us more motivation :)
It's a 'hard' start. Since part of the issue(for the slaves) seems to be 'java processes that won't die', the script gives hudson a chance to shutdown 'nicely', waits for a couple of minutes and then invokes killproc on all outstanding java processes. The only way I could think of to use the web interface to do the 'really nice shutdown' was to create a specific user with full admin access(or give the script webmasters creds) and that seemed like a bad idea. -M.
I think for the most part this was resolved with the work Matt did over the summer.