Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 377453 - Hudson is unusably sluggish
Summary: Hudson is unusably sluggish
Status: RESOLVED FIXED
Alias: None
Product: Community
Classification: Eclipse Foundation
Component: CI-Jenkins (show other bugs)
Version: unspecified   Edit
Hardware: PC Linux
: P3 enhancement (vote)
Target Milestone: ---   Edit
Assignee: Nobody - feel free to take it CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 377365
  Show dependency tree
 
Reported: 2012-04-23 16:50 EDT by David Williams CLA
Modified: 2012-09-26 14:22 EDT (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Williams CLA 2012-04-23 16:50:16 EDT
Ever since I closed bug 377344 as "all is ok now" I've seen Hudson sort of come and go, oscillating between extremely slow and down right unresponsive (e.g. 3 minutes to go from one click to the next ... pretty unusable if you have 3 or 4 steps to do).
Comment 1 Denis Roy CLA 2012-04-25 08:58:09 EDT
I'm investigating this.
Comment 2 Denis Roy CLA 2012-04-25 10:54:17 EDT
One problem I'm seeing is that the tycho-its job is somehow spawning tons of java processes, each consuming tons of resources.

10349 hudsonbu  20   0  958m 105m 9168 S   64  1.0   0:05.17 /opt/public/common/sun-jdk1.6.0_21_x64/jre/bin/java -Xmx512m -XX:MaxPermSize=256m -classpath /opt/users/hudsonbuild/workspace/tycho-its-linux-nigh
 9991 hudsonbu  20   0 1031m 181m   9m S   39  1.8   0:11.33 /opt/public/common/sun-jdk1.6.0_21_x64/jre/bin/java -Xmx512m -XX:MaxPermSize=256m -classpath /opt/users/hudsonbuild/workspace/tycho-its-linux-nigh
 9712 hudsonbu  20   0  968m 203m 9832 S   26  2.0   0:14.48 /opt/public/common/sun-jdk1.6.0_21_x64/jre/bin/java -Xmx512m -XX:MaxPermSize=256m -classpath /opt/users/hudsonbuild/workspace/tycho-its-linux-nigh
 9874 hudsonbu  20   0  969m 195m 9832 S   15  1.9   0:13.62 /opt/public/common/sun-jdk1.6.0_21_x64/jre/bin/java -Xmx512m -XX:MaxPermSize=256m -classpath /opt/users/hudsonbuild/workspace/tycho-its-linux-nigh
 9825 hudsonbu  20   0  973m 197m   9m S   13  1.9   0:13.32 /opt/public/common/sun-jdk1.6.0_21_x64/jre/bin/java -Xmx512m -XX:MaxPermSize=256m -classpath /opt/users/hudsonbuild/workspace/tycho-its-linux-nigh
 9091 hudsonbu  20   0  973m 251m  10m S   12  2.5   0:16.10 /opt/public/common/sun-jdk1.6.0_21_x64/jre/bin/java -Xmx512m -XX:MaxPermSize=256m -classpath /opt/users/hudsonbuild/workspace/tycho-its-linux-nigh
Comment 3 David Williams CLA 2012-04-27 11:11:20 EDT
This has seemed fixed since your comment 2 ... so, assuming you killed all those? And haven't came back? 

Feel free to re-open if there is more work to do here to track down root problem, but I'm closing as "fixed" to signify I no longer see any "unusable" sluggishness ... not exactly snappy :) but never has been that. 

Thanks for attending to it.
Comment 4 Denis Roy CLA 2012-04-27 11:53:10 EDT
Actually, I will reopen this.

FWIW, I ran OS updates on all the Hudsons and completely rebooted the master and slave6.  I think that did it a lot of good.

I think we should -- and I hate to say this, because the concept is so foreign to us -- schedule regular reboots of the Hudson master & slaves (perhaps on the weekends).  At the very least, stop the Hudson service, kilall the java processes that could be dangling, and restart it.
Comment 5 David Williams CLA 2012-04-27 12:03:24 EDT
(In reply to comment #4)
> Actually, I will reopen this.
> 
> FWIW, I ran OS updates on all the Hudsons and completely rebooted the master
> and slave6.  I think that did it a lot of good.
> 
> I think we should -- and I hate to say this, because the concept is so foreign
> to us -- schedule regular reboots of the Hudson master & slaves (perhaps on the
> weekends).  At the very least, stop the Hudson service, kilall the java
> processes that could be dangling, and restart it.

As soon as I closed it, started to have a few sluggish spots again. :) 

> ... hate to say this, because the concept is so foreign to us

And I hate to admit, I wouldn't complain, under the circumstances :) [I recall we used to to that in the early days of using Crusecontrol ... sure glad that stabilized over the years.] Perhaps could list/track dangling java processes? See if there's a fixable pattern? Review with your "hudson experts contact list"?
Comment 6 Denis Roy CLA 2012-04-27 14:15:26 EDT
(In reply to comment #4)
> schedule regular reboots of the Hudson master & slaves

I'll add this to Matt's bucket  :-D
Comment 7 Eclipse Webmaster CLA 2012-05-11 14:22:04 EDT
I've crafted a script to restart Hudson, and I've set it to run at 3:30am on Sundays.

-M.
Comment 8 David Williams CLA 2012-05-11 14:40:41 EDT
(In reply to comment #7)
> I've crafted a script to restart Hudson, and I've set it to run at 3:30am on
> Sundays.
> 
> -M.

Will it "hard restart" or ... wait for current jobs to finish? I don't particular care, as long as everyone knows what to expect. I guess the deluxe solution would be to start at 3:30, if jobs are running, set the "shutdown" flag so no new ones start, wait for those running to finish, but if not finished by 4:30, then go head and restart. (I know we in Platform currently have some "unit test jobs" that run for 10 or 15 hours (overnight), and while unfortunate to "lose" them", I do not think you (or anyone) should be held up for THAT long waiting for Hudson to restart). [And, FYI, we have a work item to "break the tests up into smaller chunks" ... not sure when we'll be there ... but, this might give us more motivation :)
Comment 9 Eclipse Webmaster CLA 2012-05-11 14:52:25 EDT
It's a 'hard' start.  Since part of the issue(for the slaves) seems to be 'java processes that won't die', the script gives hudson a chance to shutdown 'nicely', waits for a couple of minutes and then invokes killproc on all outstanding java processes.

The only way I could think of to use the web interface to do the 'really nice shutdown' was to create a specific user with full admin access(or give the script webmasters creds) and that seemed like a bad idea.

-M.
Comment 10 Denis Roy CLA 2012-09-26 14:22:50 EDT
I think for the most part this was resolved with the work Matt did over the summer.