Community
Participate
Working Groups
https://hudson.eclipse.org/hudson/job/emf-cdo-integration/1100/console : [java] [ant] You can check signing status by tailing /home/data/httpd/download-staging.priv/arch/signer.log [java] [ant] Waiting for signing to complete. This may take more then 60 minutes [java] [ant] Obtaining signed file from staging area [java] An exception occurred while writing to the platform log: [java] java.io.IOException: No space left on device
Current disk hog is eclipse-equinox: ==== hudson-slave1.eclipse.org ==== /dev/xvda1 55G 55G 92M 100% / -> Usage exceeding 1GB for: Hudson workspace on hudson-slave1 (50G capacity) (2011-01-31T08:21) 15.3G eclipse-equinox-test-N 1.8G cbi-papyrus-integration 1.8G cbi-papyrus-0.7-nightly 1.4G virgo.kernel.snapshot 1.3G eclipse-JUnit-Linux 1.3G cbi-mat-nightly 1.1G cdt-nightly 1.1G cdt-release 1.0G cbi-wtp-wst.xml ==== END: hudson-slave1.eclipse.org ====
I only keep three builds from each of my streams. I just deleted some so they are down to 2 builds kept. I think we need more disk space. During milestone weeks, we will have several builds running simultaneously. Each of our builds consumes at least 6GB. I can't store the build artifacts in /shared/eclipse because the mac and windows test machines don't have access to these volumes.
(In reply to comment #2) > I can't store the build artifacts in /shared/eclipse > because the mac and windows test machines don't have access to these volumes. Sure they can.. http://build.eclipse.org/eclipse/
> Each of our builds consumes at least 6GB I thought that was for Platform... This is equinox we're talking about.
Eclipse and Equinox are the same build. I didn't know about http://build.eclipse.org/eclipse/. Last time, I asked about I was told there wasn't a way to access /shared/eclipse from the non-linux slaves :-) https://bugs.eclipse.org/bugs/show_bug.cgi?id=329830#c21
> I didn't know about http://build.eclipse.org/eclipse/. Last time, I asked > about I was told there wasn't a way to access /shared/eclipse from the > non-linux slaves :-) The docs always take precedence over what we say ;) http://wiki.eclipse.org/IT_Infrastructure_Doc#Builds Marking as fixed ... for now.
Blocker: It's not fixed: https://hudson.eclipse.org/hudson/job/emf-cdo-integration/1108/console
I confirm the issue. I also get: java.io.IOException: No space left on device This is a blocker for both Indigo M5 and Helios SR2 RC2. Today (Tuesday) is +2 and tomorrow +3.
Same here for ATL..
Which disk exactly is full? When I type "df" when connected to build.eclipse.org in ssh, I don't see anything near 100%. It's not mounted on build.eclipse.org? Anyway, I wiped my workspace and removed old builds, and I could start a build afterwards. So, the problem seems solved, at least temporarily.
> Which disk exactly is full? When I type "df" when connected to > build.eclipse.org in ssh, I don't see anything near 100%. It's not mounted on > build.eclipse.org? Each Hudson slave has its own storage for the workspace. This storage is only meant to be temporary, since build artifacts should either be moved to /shared or promoted to download. Please see this diagram for a better view: http://wiki.eclipse.org/Image:Build_infra_layout.png
I'll rename this bug to capture the bigger problem: Now that more and more projects are jumping on Hudson, what can we do to ensure the place stays clean? Here are the two problems we'll encounter: 1. Large workspaces spanning multiple slaves. 2. Look at how many jobs are on the Hudson home page. I have no way of knowing if any jobs are orphaned. 3. Multiple jobs for the same project, all with "small" workspaces, adds up Potential solutions for 1) - I wipe _all_ workspaces clean every Sunday. No exceptions. - We get a 10TB disk array and have no problems for a couple of... months? years? Potential solutions for 2) - I automatically delete jobs that have not run in 60 days Potential solutions for 3) - We limit the number of jobs one project can have - We get a 10TB disk array and have no problems for a couple of... months? years?
So am all for removing artifacts (not the jobs) after 30 days +1 And lets start collecting money for that 10 GB disk +1 Wiping all workspaces on sunday I believe is not so good - christian (In reply to comment #12) > I'll rename this bug to capture the bigger problem: Now that more and more > projects are jumping on Hudson, what can we do to ensure the place stays clean? > > Here are the two problems we'll encounter: > > > 1. Large workspaces spanning multiple slaves. > > > 2. Look at how many jobs are on the Hudson home page. I have no way of knowing > if any jobs are orphaned. > > > 3. Multiple jobs for the same project, all with "small" workspaces, adds up > > > Potential solutions for 1) > > - I wipe _all_ workspaces clean every Sunday. No exceptions. > - We get a 10TB disk array and have no problems for a couple of... months? > years? > > > Potential solutions for 2) > > - I automatically delete jobs that have not run in 60 days > > > Potential solutions for 3) > > - We limit the number of jobs one project can have > - We get a 10TB disk array and have no problems for a couple of... months? > years?
> - We limit the number of jobs one project can have I think this is not a solution, one can occupate enough space with only one job running. > - I automatically delete jobs that have not run in 60 days Delete the job? ...or just older builds and maybe ws? Just imagine one have a maintanance job for Helios, so it could happen that this job will be deleted cause there were any commits between the SR1 and first SR2 milestone. > - We get a 10TB disk array and have no problems +1
> Potential solutions for 1) > > - I wipe _all_ workspaces clean every Sunday. No exceptions. The projects can configure their jobs to only keep artifact for X days and/or the last X builds. Only the manually locked builds are kept forever ( but i am not sure this is a good idea to keep artifact on hudson and not in other place like download or shared space ) > - We get a 10TB disk array and have no problems for a couple of... months? > years? +0 giving more space will push people to use it and not clean their workspaces. > > > Potential solutions for 2) > > - I automatically delete jobs that have not run in 60 days Same remark has above, hudson will delete any old builds automatically according to the configuration. > > > Potential solutions for 3) > > - We limit the number of jobs one project can have > - We get a 10TB disk array and have no problems for a couple of... months? > years?
(In reply to comment #12) > - I wipe _all_ workspaces clean every Sunday. No exceptions. +1 this shouldn't cause any problems since each job should be able to work from an empty workspace. But we should avoid wiping workspaces while a job is running. > - We get a 10TB disk array +1 for bigger disk (though 10TB is probably overkill) 50GB doesn't seem much, given the number of jobs. > - I automatically delete jobs that have not run in 60 days "Automatically" seems dangerous. But maybe send a mail to the person or team that requested the job to ask them whether it can be deleted. Also, be aware that Hudson seems to lose the "last run" information when deleting all builds from a project. > - We limit the number of jobs one project can have The problem is not so much the number of jobs as the space they occupy. Maybe add a disk quota per project.
> - I wipe _all_ workspaces clean every Sunday. No exceptions. No thank you. I run test builds on Sunday because traffic is light :-). When we have resolved all the issues with the Windows and Mac slaves and can run our builds on eclipse.org hardware, we run builds on Sunday too. > - We get a 10TB disk array and have no problems for a couple of... months? > years? I can't estimate how much space you need but it's not realistic to expect that everyone builds at Eclipse on Hudson while the amount of disk space doesn't increase. > - I automatically delete jobs that have not run in 60 days No thank you. As someone has already mentioned, there are old jobs that are kept around for maintenance purposes. If they are configured to delete old jobs, they won't consume much space. If I'm not using a build for a while, I flush the workpace too so it doesn't consume disk space. - We limit the number of jobs one project can have I don't expect that most people have extra jobs lying around just for fun :-) I only have jobs that exist for a specific purpose. I think the key is to encourage projects to only keep a few builds on Hudson to reduce the disk space fingerprint.
(In reply to comment #16) > (In reply to comment #12) > > - I wipe _all_ workspaces clean every Sunday. No exceptions. > 1 this shouldn't cause any problems since each job should be able to work from > an empty workspace. But we should avoid wiping workspaces while a job is > running. After next job run ws will be full again, so you will free space for a couple of hours. > > > - We get a 10TB disk array > 1 for bigger disk (though 10TB is probably overkill) > 50GB doesn't seem much, given the number of jobs. > > > - I automatically delete jobs that have not run in 60 days > "Automatically" seems dangerous. But maybe send a mail to the person or team > that requested the job to ask them whether it can be deleted. > Also, be aware that Hudson seems to lose the "last run" information when > deleting all builds from a project. There are jobs that are scheduled by hudson itself using cron, to run once a day or week whithout to check SCM. Such jobs will not be recognized as unused. I think we should have a kind of script that scan all the job configs for important setting like: - keeped builds (max count | date) - scheduling by using SCM check or changed URL (cron is not alowed) - and maybe "Abort the build if it's stuck" (does not save space but should be set) > > > - We limit the number of jobs one project can have > The problem is not so much the number of jobs as the space they occupy. > Maybe add a disk quota per project.
> Wiping all workspaces on sunday I believe is not so good Why not? Jobs are supposed to run cleanly with an empty workspace. This happens when we add slaves. (In reply to comment #15) > The projects can configure their jobs to only keep artifact for X days Yes, but they don't. Then we run out of disk space. > +0 giving more space will push people to use it and not clean their workspaces. Exactly. (In reply to comment #16) > > - I wipe _all_ workspaces clean every Sunday. No exceptions. > +1 this shouldn't cause any problems since each job should be able to work from > an empty workspace. But we should avoid wiping workspaces while a job is > running. Exactly. And agreed. > > - We get a 10TB disk array > +1 for bigger disk (though 10TB is probably overkill) > 50GB doesn't seem much, given the number of jobs. At this point, there's not much price difference between 5T and 10T. > > - We limit the number of jobs one project can have > The problem is not so much the number of jobs as the space they occupy. > Maybe add a disk quota per project. Unless Hudson has a quota mechanism, at the file system, all files belong to 'hudsonBuild', so it's difficult to map a job to a project. (In reply to comment #17) > I can't estimate how much space you need but it's not realistic to expect that > everyone builds at Eclipse on Hudson while the amount of disk space doesn't > increase. In the last year to year-and-a-half, we've added 3TB for downloads/archives, 1TB for /shared and 500G for Hudson. The 4TB downloads is almost 30% full, /shared is over 70% full and the 500G for Hudson ... well, not a ton of that left either. In short, I'm not adding another Megabyte of disk space until we can come up with a concrete way to ensure we're cleaning up. Wasting disk space is just too easy, and too costly for the Foundation. > I don't expect that most people have extra jobs lying around just for fun :-) I wouldn't expect project to maintain Gigabytes of files just for fun, but /shared is 700GB strong. > I think the key is to encourage projects For years I've been encouraging projects to clean up and to avoid wasting disk space. And yet here we are. (In reply to comment #18) > After next job run ws will be full again, so you will free > space for a couple of hours. Not as full -- some (many?) projects have a workspace several GB in size, containing build artifacts from months ago. Those 'maintenance' jobs -- the ones that only run a few times a year -- are those workspaces cleaned up, or do they occupy several GB for several months?
(In reply to comment #12) > wipe _all_ workspaces clean I only need my workspaces during and "some time" after actual build runs. The time after a build is needed to investigate possible build problems. If a build is triggered by SCM change our world-wide team distribution may cause delays of some hours until I have the chance to investigate problems. But I could also just kick a fresh build if the gap is larger. I could live with my workspaces being wiped some hours after the last build. My jobs wipe them initially anyway. > delete jobs that have not run in 60 days I don't see how this adds value if the workspaces are being wiped periodically. > limit the number of jobs one project can have I don't see how this adds value if the workspaces are being wiped periodically. > limit the number of kept builds per job Depends on the number. My jobs keep less than 30MB per build and I prefer to keep a maximum of let's say 10 builds (some stable builds that have been promoted plus "volatile" CI builds). That makes a total of 600MB for my two jobs (given the workspaces are wiped out frequently). Would that be too much?
This is bad. And too many comments for me to follow it all ... but I can confirm this is a blocking problem. I just waited a few hours to (finally) get a green build and got a build failed, instead, with java.io.IOException: No space left on device near the end of the log. At that point, I looked explicitly from a shell, and df /shared -h reports about 30% free ... so, assume someone (or a cron job?) cleaned up something, so will try the build again. (My jobs run on a 'slave', but mostly use disk space on 'shared'. The solution seems so simple. Get more diskspace. If it were me, I'd see how much is in use now, and plan on 5 times that much for the next 6 months or so. Then, if you are worried about too much "waste" over time, I think the measurement of waste needs to be based on time. If a job/project grows larger by 1000 Megs on 'shared' week after week after week after week, then that is a fairly obvious waste of resources (as far as I know). If, on the other a project uses a whole lot more, say 30G, but pretty much always uses that much (or less), then there is probably a real need there. (I picked 30G, since that's about the limit that WTP uses on 'shared' ... and we have for years and years). So, my suggestion is to get more disk space immediately, since this is blocking ... and then measure each project over time and spot the problem-projects that way. I'm sure I'm oversimplifying things and we all have a lot to learn (over time) ... but, something needs to be done and some plan put in place. I don't think "let things fail until someone cleans something up" is working. How can we help? Need a credit card number? :) Long term, I'd also suggest checking with some other OS projects (apache?) and see if they'd tell you how much disk they have per hudson job, etc. ... get some idea if we are in the same neighborhood? I tried to look myself, and didn't see any mention of disk space, at http://wiki.apache.org/general/Hudson but sounds like they do have some scripts that _enforce_ setting time outs, etc. (And, yes, yes, they also have 5 hudson admins :) ... maybe a good measure is "disk space per admin" :)
I don't think the outage warning has anything to do with /shared, but is a error because slave1 has run out of space on the 'system' drive. I say that because I've started getting errors out of postfix to that effect. If you bind the job to 'fastlane'(since nobody should be using it, except for train type builds) does it still fail? -M.
> df /shared -h > > reports about 30% free ... so, assume someone (or a cron job?) cleaned up > something, so will try the build again. Just to be clear: /shared has never been full. Ever. But it is getting there. That's 1 Terabyte for 'temporary storage'.
Since I lasted posted, but job ran fine, but now has failed again .. Guess the error msg is kind of deceiving then (no fault of yours). When you say "system" do you mean "/tmp" by any chance? I could imagine the "unpack" operation is trying to use /tmp? I'll try fastlane (but, the problem isn't exactly reproducible, so not sure a success there proves anything ... unless there's more that you see ... greatest thanks for your help. = = = = The following errors occured when building Helios: org.eclipse.core.runtime.CoreException: Unable to unpack artifact osgi.bundle,org.eclipse.gmf.bridge.ui,1.3.0.v20101217-1532 in repository file:/shared/helios/aggregation/final/aggregate: Unable to read repository at file:/shared/helios/aggregation/final/aggregate/plugins/org.eclipse.gmf.bridge.ui_1.3.0.v20101217-1532.jar.pack.gz. Caused by: java.io.IOException: No space left on device
> Guess the error msg is kind of deceiving then (no fault of yours). When you say > "system" do you mean "/tmp" by any chance? I could imagine the "unpack" > operation is trying to use /tmp? Yes, it's having a hard time unpacking because the Hudson slave's disk is full... be it /tmp or its local workspace. I've cleaned up old files in /tmp, but that hasn't really liberated lots of space. Filesystem Size Used Avail Use% Mounted on /dev/xvda1 55G 54G 1.9G 97% /
One solution that we could easily implement is to host each slave workspace area on /shared. The Master's workspace is already there. /shared/hudson/ws-slave1/ /shared/hudson/ws-slave2/ /shared/hudson/ws-fastlane/ The second benefit is that everyone could browse the Hudson workspaces directly from build.eclipse.org. Currently, your only way of accessing your workspace is from the Hudson UI. The problem is that /shared only has ~300G left, so it will be full in no time.
As a temporary measure, I'm in the process of creating two 750G virtual hard drives inside the 3T of space we have for download.eclipse.org. I'll mount these drives inside hudson-slave1 and hudson-slave2 and move the workspaces to the new "drives". This should give us some breathing room until we can figure this out.
I think both the previous ideas are pretty good, as temporary emergency "breathing room". Recent aggregation builds took twice as long as normal (4 hours instead of 2) but 1) even that's better than "out space" errors once or twice a day, and 2) Denis assures me that's just one data point in time, and it won't always be that slow :) I'm quite willing to help somehow if I can, but still not sure exactly what the problem is. I suspect there will be several prongs to the solution and will take some time to figure out a manageable process. Thanks so much for your help and temporary solution, especially during these last few weeks of Helios SR2, when the impact of failed builds is so disruptive.
I'm assigning this to "webmaster" primarily to avoid so much bugzilla mail going out, as is our convention, but don't mean to imply the problem is entirely the responsibility of our webmaster. But, we will obviously need his guidance as we improve our practices.
> As a temporary measure, I'm in the process of creating two 750G virtual hard > drives inside the 3T of space we have for download.eclipse.org. Both drives are created, and I'm rsync'ing the local Hudson workspaces from slave1 and slave2 to them. When it's done, I'll take the slaves offline to resync, then mount the new workspace areas in their place. That will take care of the disk issue in the immediate term and, although it will be slower than local disk access, it will provide us with the opportunity to observe the growth patterns of local workspaces.
slave1 and slave2 now have 750G (each, 1.5T total) of usable workspace. As mentioned, this disk space is 'borrowed' from the downloads/archives area and is on a remote server, which will be slower than the local disk array.
What can we do to get you more space Denis?
I need money to purchase hardware, so I'm not sure what you can do in that area. But before doing that, I'm interested in knowing how we can ensure the storage we have is used efficiently.
(In reply to comment #31) > which will be slower than the local disk array Indeed! The MoDisco nightly build started on Feb 8 on hudson-slave1 finished in 3h30. It used to complete in about an hour on hudson-slave1 using the local hard disk. And the build started on Feb 9 on hudson-slave1 is still not finished today, *23 hours later* (it looks almost finished, although that could still be a few hours at this rate...). Each operation is extremely slow, taking 1 or 2 orders of magnitude longer: BUCKMINSTER SETPREF took : 175s (×10) BUCKMINSTER IMPORT took : 16315s (×10) BUCKMINSTER RESOLVE took : 23295s (×156) BUCKMINSTER BUILD took : 5241s (×81) BUCKMINSTER JUNIT took : 2006s (×3.5) BUCKMINSTER PERFORM 'site.p2' took : 26135s (×47) Compare this to the previous "normal" times: BUCKMINSTER SETPREF took : 17s BUCKMINSTER IMPORT took : 1563s BUCKMINSTER RESOLVE took : 149s BUCKMINSTER BUILD took : 64s BUCKMINSTER JUNIT took : 567s BUCKMINSTER PERFORM 'site.p2' took : 551s And the other builds currently running on hudson-slave1 don't seem to be faring better: cbi-wtp-inc-xquery : Started 18 hr ago cdt-nightly : Started 17 hr ago jetty : Started 22 hr ago
Please see bug 336864
I fully agree we can and should do a better job of "cleaning up", but I've learned it is not as easy as it sounds. For many reasons. But, thought I would document how I have learned the current "shortage" of space, besides the outright failures it causes, has caused other problems or delays. As people have mentioned, one of the reasons for using lots of space is to "cache" other's build/repos, etc. While it seems to some that should not be required, I've run across a few issues lately that show that it is, and that illustrate how it costs us "down time" when cached pre-reqs are cleaned up too aggressively. One was documented in bug 336897#c5. In that case, sounds like the "old pre-req" really did "disappear". tsk tsk. Another I hit when deleting an old GEF 3.4.2 pre-req from our WTP cache. Shortly after, someone needed a patch build with that old code, and that build failed. Sure, GEF had been moved to "archive" ... easy enough to know that, but then I discovered GEF apparently doesn't use the same URL conventions in 'archive' as it does in 'downloads' ... it is flatter in 'archive'. So, we had to change our build scripts to use the archive URL for that old version. No big deal ... but ... cost us about a day with failing builds, debugging, and rebuilding ... all to save 5 Megbytes in "extra" storage. The other complicated case is we sometime purposely use an "intermittent" build from one of our pre-reqs ... say an M build from January ... so, by February, those M builds have been deleted (not even moved to archives) but yet we may not be ready to move up to a new M build yet. Sure, maybe we could, but some times are better than others, so best if we can pick the time, instead of just discovering suddenly that month-old M build was deleted by another project. I'm just saying cleaning up is complicated. Not that we should not make efforts to stay cleaner. Long term, one thing that might help in the Hudson workspace case, is have an "opt-out" (or opt-in?) plugin that "cleans workspace" ... with some variable field, such as "after n days". So, if someone knows they can not/should not clean their workspace, they can "opt out", whereas others will have their workspaces cleaned after n number of days of no modifications ... so, would only effect those not very active, Just an idea ... and I know, I know ... someone would have to write it :) (if it doesn't already exist). Given comments elsewhere, where not everyone has their own project space on /shared, maybe it'd be easier to have a special space on "shared" with similar permissions as "/tmp" so any Hudson job could have their own writable "cachedprereq" space on /shared without webmasters having to explicitly define a /shared directory for each and every project or job that needs it. Say /shared/cachedprereqs and encourage people to use distinct subdirectory, such as their job name? Then, unlike /tmp it would not be cleaned up every 14 days. And ... the cache subdirectories could be monitored or reviewed from time to time to see if anyone was getting carried away or using it for things other than caching build requirements? Just some thoughts. Just trying to help.
What's the current status here? - Do people feel that Denis' "nag E-Mails" helped or are we still at risk? - Do we need a "how to keep Hudson clean" HOWTO? - Anything else? Having Hudson run out of disk space causes lots of disruption and waste so we should really get this fixed. From a high level, it looks like there's 3 possible measures, (a) educate folk, (b) create software/tools to help stay clean, (c) just add disks. Have we done all the easy steps? Adding AC for consideration.
Hi, I've done my best to reduce the disk space used by the Hudson jobs I'm responsible for (MoDisco & EMF Facet). I'd like to remove the built zip once it's archived by Hudson ("Archive the artifacts"). The problem is that it is a "Post build action", so that I can't execute anything after that. I've found that there is a plug-in to allow this: http://wiki.hudson-ci.org/display/HUDSON/Post+build+task Would it be possible to install it? Or is there another way?
Almost three years later, I think we're in a happy place. Of course, additional terabytes of storage and HIPP make this much easier.