Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 367772

Summary: Restructure slave1 resources.
Product: Community Reporter: Eclipse Webmaster <webmaster>
Component: CI-JenkinsAssignee: Eclipse Webmaster <webmaster>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: P3 CC: glyn.normington, kim.moir, matthias.sohn, pwebster, sbouchet
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Whiteboard:
Bug Depends on:    
Bug Blocks: 367238    

Description Eclipse Webmaster CLA 2012-01-03 11:03:45 EST

    
Comment 1 Eclipse Webmaster CLA 2012-01-03 11:07:13 EST
I'd like to take the slave offline and create several smaller slaves from the recovered resources.

This seems to be consistent with both how the ASF appears to handle it's slaves and the JBoss folks(based on a conversation with them).

-M.
Comment 2 Bouchet Stéphane CLA 2012-01-04 09:40:49 EST
+1 to have slave2 and slave3 back :D

but be carrefull to have a common shared filesystem between all of them, to avoid promotion problems and disk usage.
Comment 3 Matthias Sohn CLA 2012-01-04 17:17:15 EST
+1
Comment 4 Glyn Normington CLA 2012-01-05 03:21:54 EST
If this will increase throughput and reduce the probability of port clashes between jobs in unrelated projects, then it sounds like a good thing.

However, we in Virgo are sensitive to any change in Hudson because we try to keep our CI builds clean at all times (and review their status daily) and we have configured the jobs carefully to achieve that, so we would not be in favour if this restructuring was likely to lead to any corruption or loss of job configuration.
Comment 5 Eclipse Webmaster CLA 2012-01-05 10:44:06 EST
(In reply to comment #2)
> but be carrefull to have a common shared filesystem between all of them, to
> avoid promotion problems and disk usage.

Well the idea is to 'shrink' slave1 to about the same size as Fastlane(about 1/3 its size), then clone it twice and bring those new instances up and add them in.  So basically we'd end up with 3 slave1s (so any job that can currently run on slave1 should be able to run on any of the new nodes).

(In reply to comment #4)
> so we would not be in
> favour if this restructuring was likely to lead to any corruption or loss of
> job configuration.

Well the job data is all stored on the master hudson instance(and on an nfs mount) so I don't expect this will have any impact on your configuration.

-M.
Comment 6 Denis Roy CLA 2012-01-05 10:51:28 EST
(In reply to comment #5)
> (In reply to comment #2)
> > but be carrefull to have a common shared filesystem between all of them, to
> > avoid promotion problems and disk usage.
> 
> Well the idea is to 'shrink' slave1 to about the same size as Fastlane(about
> 1/3 its size),


When we used to have slave1, slave2 and slave3 (each with 50G local space) we'd constantly run out, so we decided to aggregate all the slaves into one big slave with lots of local space.

Unfortunately, we just don't have the disk space to have multiple slaves with 300+ GB each, so if we return to the smaller slaves scenario, projects will either have to keep their workspaces clean, or we'll have to live with out-of-disk-space errors.  Having /shared accessible on each slave is not enough -- people need to move their build artifacts from the workspace and onto /shared, and we cannot control that.
Comment 7 Bouchet Stéphane CLA 2012-01-06 04:01:00 EST
(In reply to comment #6)
> (In reply to comment #5)
> > (In reply to comment #2)
> > > but be carrefull to have a common shared filesystem between all of them, to
> > > avoid promotion problems and disk usage.
> > 
> > Well the idea is to 'shrink' slave1 to about the same size as Fastlane(about
> > 1/3 its size),
> 
> 
> When we used to have slave1, slave2 and slave3 (each with 50G local space) we'd
> constantly run out, so we decided to aggregate all the slaves into one big
> slave with lots of local space.
> 
> Unfortunately, we just don't have the disk space to have multiple slaves with
> 300+ GB each, so if we return to the smaller slaves scenario, projects will
> either have to keep their workspaces clean, or we'll have to live with
> out-of-disk-space errors.  Having /shared accessible on each slave is not
> enough -- people need to move their build artifacts from the workspace and onto
> /shared, and we cannot control that.

This is exactly what I was warning about. And this is why the disk report mail was created for, too. 

AFAIK, /shared is accessible from every slaves, because the last built artifact is always at /shared/... 

So I may miss something but why not configuring each slaves to have its HUDSON_HOME pointing to /shared ?
Comment 8 Eclipse Webmaster CLA 2012-01-18 16:01:52 EST
(In reply to comment #7)
> 
> So I may miss something but why not configuring each slaves to have its
> HUDSON_HOME pointing to /shared ?

Well when Hudson was setup the docs indicated that 'sharing' the home directory in this way was problematic.  I'm unable to find the specific doc right now, but perhaps this is better discussed as a 'separate' Hudson upgrade?

Right now I'm working on adding a 'new' slave(one of our old dev nodes) and once I have that in and running (should be friday-ish) I'll look to take slave1 out of service later next week so it can be split in 2(which will result in 3 slaves with 100G+ of disk space each).

-M.
Comment 9 Eclipse Webmaster CLA 2012-07-04 11:00:15 EDT
Ok now that Juno is out the door I'd like to pick this up again.  Right now my plan is to start this work on July 16th.

-M.
Comment 10 Eclipse Webmaster CLA 2012-07-26 10:44:24 EDT
Ok, it took longer than I anticipated but it's done.  We now have 2 slaves with 14G Ram 4CPUs and 170G HD.  I've put hudson-slave1 back into service and added hudson-slave4.

-M.