Community
Participate
Working Groups
Worldwide download mirror sites enable the Eclipse Foundation to efficiently distribute Eclipse bits with moderate costs. However, despite my many efforts at constructing an exclusion list to filter Nightly and temporary builds, many mirrors are beginning to consider the disk space requirements for an Eclipse mirror to be excessive, as witnessed on the eclipse-mirrors mailing list: http://dev.eclipse.org/mhonarc/lists/eclipse-mirrors/msg00569.html There may be many factors influencing the amount of disk space we occupy on mirrors: -> All-in-one ZIPs -- seemingly useless bit duplication -> .jar and .jar.pack.gz -- bit duplication that saves bandwidth at the expense of disk space -> Stable/Release Candidate builds that linger in the mirrored area for weeks after they are pertinent -> Lack of naming convention for directories means it's difficult to construct an effective exclusion list for Nightly builds that covers everyone's nightly builds -> Committer indifference towards cleaning up disk space after releases So I call on the AC to help me prevent a collapse of our network of download mirrors.
I'm going to move some of our older 3.6 stable builds to archive.eclipse.org. Each of our builds is now about 6GB in size which is ridiculous but that's what you get when you build 15 platforms.
EPP is just another project that eats up lots of space on the mirrors. A single package build means about 10 GBytes... Usually my own policy is to distribute * the last stable release (currently Galileo SR2) and * the last two milestone / release candidate builds to the mirrors. All other builds should be available from archive.eclipse.org. I am not sure where to save space... any suggestions?
Personally I'm not too concerned about EPP's (or the Platform's) disk space usage. It is what it is unless you can revolutionize your builds. However, here is one example of perhaps useless bit duplication: knowing that a package exists for PDT, does the PDT team really need to produce All-In-Ones? http://eclipse.org/pdt/downloads/
(In reply to comment #1) > I'm going to move some of our older 3.6 stable builds to archive.eclipse.org. > Each of our builds is now about 6GB in size which is ridiculous but that's what > you get when you build 15 platforms. Kim, don't we delete all 3.x stable builds once the release is done anyway? I.e., once 3.6 is shipped we remove all 3.6 milestones?
I usually archive them for a few months and then delete them. Some people may still refer to them.
As a point of reference, being an Apache mirror only requires 28G. I realize Eclipse is very different conceptually, but it seems to be too easy for a project to put five 200M files in the download area and call it a product.
Kim, Markus is the full build taking that space or just the distributable pieces? Denis, AFAIK you get those "all-in-ones" out-of-the-box with Athena. PDT might think they are special because their users typically don't know Eclipse well enough to pick the individual pieces. However, their users should be converted EPP consumers! One way might be to remove those all-in-ones. BTW, I also think that only release builds should be mirrored to not stress our mirror network.
> I'm going to move some of our older 3.6 stable builds to archive.eclipse.org. > Each of our builds is now about 6GB in size which is ridiculous but that's what > you get when you build 15 platforms. Perhaps not all of these need to be mirrored? eclipse-SDK-3.5.2-macosx-carbon.tar.gz gets about 45 downloads/day Anything %solaris% gets about 37 downloads/day /eclipse/downloads/drops/R-%/eclipse-platform-SDK- gets about 15/day /eclipse/downloads/drops/%hpux% got 4500 downloads ... in 2 years. /eclipse/downloads/drops/%aix% got 1300 downloads in 2 years /eclipse/downloads/drops/%/org.eclipse.platform-3 has a whopping 37 downloads eclipse-Automated-Tests and org.eclipse.rcp.source 2 per day each org.eclipse.platform.source-%.zip has 78 downloads ever, and it's a 280M file. And then there's all this stuff, which is not meant to be 'downloaded' by mirrors: 47M ./compilelogs 143M ./testresults 650K ./checksum 652K ./apitools 179K ./clickThroughs 1.4M ./buildnotes 28K ./buildlogs 144M ./performance So my proposal here would be to create a subdirectory in each build to contain all the low-volume content, which I can exclude from the mirror sync: R-3.7-20110623 (6GB) |--- X-other (3GB ?) |--- |--- compilelogs |--- |--- testresults |--- |--- performance |--- `--- (zips, gz and org.* files with few downloads, listed above) `--- (zips, gz and files that get many, many downloads)
(In reply to comment #7) > Kim, Markus is the full build taking that space or just the distributable > pieces? The 11 packages, each of them available for 7 platforms take most of the space (~10 GB) and then there are a few kB of XML data... so there is not really anything worth to remove. > BTW, I also think that only release builds should be mirrored to not stress our > mirror network. I did only upload the Helios milestones and release candidates this year (i.e. I didn't upload M2...~M4). I think it is good to upload them because they usually get more than 5000 downloads. Would it be possible to define a per-directory exclude file for rsync if this is possible at all? I had something similar to a .htaccess in mind. This would enable the committers/releng people to manage what gets distributed to the mirrors.
Created attachment 171734 [details] Bandwidth graph - 24h period The other unfortunate side-effect of the bit duplication is the network congestion that occurs during the release train. As many projects simultaneously upload new bits for RC and Releases, our bandwidth becomes heavily taxed when the mirrors pull the new bits for several hours. In the graph, you can see us at our limit from 22:00 till about 08:00 the next day, despite the fact that I allowed two bursts of traffic in between. In this case, not only are mirrors pulling RC4 bits for Helios EPP packages and Helios aggregate p2 repos, but also RC bits of Webtools (zips and p2), PDT (zips and p2), CDT (zips and p2), TPTP (zips and p2) and so on.
Denis, I just downloaded stats for our 3.5.2 release so I would know what to move to a subdirectory for low volume downloads. The file is here https://bugs.eclipse.org/bugs/show_bug.cgi?id=316620#c1 What should be the threshold for removing stuff from the mirrors? A thousand total downloads? I don't know, so if you have any insight this would be appreciated :-)
(In reply to comment #8) > R-3.7-20110623 (6GB) > |--- X-other (3GB ?) > |--- |--- compilelogs > |--- |--- testresults > |--- |--- performance > |--- `--- (zips, gz and org.* files with few downloads, listed above) > `--- (zips, gz and files that get many, many downloads) This sounds like a good idea. Could this be made transparent to people using download.php? I.e., "X-other" would be stripped from the path before looking for the file? That would allow dynamically moving things around based on download stats without breaking links. Projects can certainly change their download paths for future releases, but not for Helios or any past release, without breaking people's links.
> What should be the threshold for removing stuff from the mirrors? Consider that we have about 40 mirrors. If in the last 30 days a large file (20+ MB) hasn't had 100-or-so downloads, then it may be worth putting them in X-other them from mirrors. This query looks for files downloaded in the last 30 days: https://dev.eclipse.org/committers/committertools/stats.php?filename=/eclipse/downloads/drops/R-3.5.2-201002111343/&view_date=L30 (In reply to comment #12) > I.e., "X-other" would be stripped from the path before looking for the file? Absolutely. We currently look for /path/to/file.zip in download, then archive.eclipse.org before declaring failure. Looking for /path/to/X-other/file.zip is trivial to implement as a third check. In the stats, you would see /path/to/file.zip and /path/to/X-other/file.zip as two different files, though. Is that OK?
(In reply to comment #13) > Absolutely. We currently look for /path/to/file.zip in download, then > archive.eclipse.org before declaring failure. Looking for > /path/to/X-other/file.zip is trivial to implement as a third check. > > In the stats, you would see /path/to/file.zip and /path/to/X-other/file.zip as > two different files, though. Is that OK? Sounds fine to me. I think we generally do stats queries using patterns like /path/to/* anyway.
The frequency of the dissatisfaction expressed on the eclipse-mirror list is beginning to bother me. A full Eclipse.org mirror is expected to have 350G of disk space for us, which I believe is way too much. One easy solution that I can think of is to change our inclusion policy. Right now, everything is mirrored except a small exclusion list to filter nightly builds, etc. If we change that to exclude everything except a specific set of directories, representing the highest volume of downloads, then we can easily cut our size by an order of magnitude without any need for releng teams to restructure their directories. I do worry about what this will do to our bandwidth, but we do have Amazon AWS to back us up, and we would have the flexibility to simply include more mirrored content to match heavy traffic directories.
FWIW: Removing Orbit from syncing across mirrors has broken all ECF builds (it runs at OSU and thus uses the local mirror directly via Buckminster). While we can work around this, it would make things easier if we get a head warning if such essential bundles like Orbit are going to be removed from syncing across mirrors.
I was a little surprised to see Orbit directory come off the list. And would recommend it be added back, since it does have a P2 repository in it. I'm not sure what "stale" means, but Orbit should only be using a couple of gigabytes on mirrors, maybe 4 or 5 G at most. Many of the meaty directories (in 'committers') start with "I" and I was under the impression those are not mirrored? Are these types of numbers prohibitive? [12:23:37] david_williams@build:~/downloads/tools/orbit $ du ./* -sh 5.2G ./committers 40K ./commonFiles 4.0K ./displayBuildMachine.php 1.7G ./downloads 4.0K ./index.php 4.0K ./parseProperties.php 4.0K ./runIndexer.sh
I've moved all archive related stuff for Buckminster. Please reinstate it as a mirrored project.
Orbit and Buckminster have been added back to the mirror pool. Buckminster uses 147M of mirror space, and Orbit uses 1.7G. Thanks for the cleanup!
Denis, I noticed that for technology projects I builds get mirrored. Can we prevent mirroring for nightly and integration builds in general? *integration* */I-* */N-*
Created attachment 192123 [details] Directory tree view of a full mirror Gunnar, I've added the patterns you've mentioned. I've also added many patterns of javadoc, apidocs and test/compile logs that needlessly get sent to mirrors. With this exclusion list, the size of a Full Eclipse mirror is down to 277G (the actual size of download.eclipse.org is about double that). Sadly, another US university has discontinued being an Eclipse mirror today -- and the cited reason was our large footprint. At this point, I can keep trimming away at the exclusion list, but those entries that I find are relatively small potatoes. So I'm attaching two things: - A directory tree view of a Full Eclipse mirror - The size of each top-level directory >1G If anyone has any ideas as to what we can do next, I'm all ears. scanserv:/home/data/download.eclipse.org # du * -sh 11G birt 1.1G datatools 1.8G dsdp 12G e4 66G eclipse 2.2G equinox 1.1G gyrex 1.9G jetty 2.2G mat 20G modeling 7.5G releases 8.6G rt 1.9G stem 1.2G stp 89G technology 1.5G tm 24G tools 5.3G tptp 1.2G virgo 16G webtools
One thing I've noticed is the increasing number of maven repositories in project download areas. I'm not sure if Maven can user our network of mirrors or not; if it cannot, perhaps excluding maven repos from the mirror list would also help trim content.
I've just archived a 3.7 milestones 1-5 and 3.6, 3.6.1 which should reduce the amount sent to mirrors by 42GB. I'll also take another look at bug 315073 to reduce the number of platforms sent to the mirrors.
This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. If you have further information on the current state of the bug, please add it. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. -- The automated Eclipse Genie.
There's probably not much we can do here, other than occasionally reminding everyone that disk space is finite. Closing as WONTFIX