Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 315073

Summary: Overall mirror size stressing mirror sites
Product: Community Reporter: Denis Roy <denis.roy>
Component: Architecture CouncilAssignee: eclipse.org-architecture-council
Status: RESOLVED WONTFIX QA Contact:
Severity: normal    
Priority: P3 CC: bugs.eclipse.org, david_williams, gunnar, kim.moir, mknauer, pwebster, thomas
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Whiteboard: stalebug
Bug Depends on: 316620    
Bug Blocks:    
Attachments:
Description Flags
Bandwidth graph - 24h period
none
Directory tree view of a full mirror none

Description Denis Roy CLA 2010-05-31 09:25:33 EDT
Worldwide download mirror sites enable the Eclipse Foundation to efficiently distribute Eclipse bits with moderate costs.  However, despite my many efforts at constructing an exclusion list to filter Nightly and temporary builds, many mirrors are beginning to consider the disk space requirements for an Eclipse mirror to be excessive, as witnessed on the eclipse-mirrors mailing list:

http://dev.eclipse.org/mhonarc/lists/eclipse-mirrors/msg00569.html


There may be many factors influencing the amount of disk space we occupy on mirrors:


-> All-in-one ZIPs -- seemingly useless bit duplication

-> .jar and .jar.pack.gz -- bit duplication that saves bandwidth at the expense of disk space

-> Stable/Release Candidate builds that linger in the mirrored area for weeks after they are pertinent

-> Lack of naming convention for directories means it's difficult to construct an effective exclusion list for Nightly builds that covers everyone's nightly builds

-> Committer indifference towards cleaning up disk space after releases


So I call on the AC to help me prevent a collapse of our network of download mirrors.
Comment 1 Kim Moir CLA 2010-05-31 09:36:07 EDT
I'm going to move some of our older 3.6 stable builds to archive.eclipse.org.  Each of our builds is now about 6GB in size which is ridiculous but that's what you get when you build 15 platforms.
Comment 2 Markus Knauer CLA 2010-05-31 10:02:11 EDT
EPP is just another project that eats up lots of space on the mirrors. A single package build means about 10 GBytes...

Usually my own policy is to distribute 

  * the last stable release (currently Galileo SR2) and
  * the last two milestone / release candidate builds

to the mirrors. All other builds should be available from archive.eclipse.org. I am not sure where to save space... any suggestions?
Comment 3 Denis Roy CLA 2010-05-31 10:24:45 EDT
Personally I'm not too concerned about EPP's (or the Platform's) disk space usage.  It is what it is unless you can revolutionize your builds.

However, here is one example of perhaps useless bit duplication: knowing that a package exists for PDT, does the PDT team really need to produce All-In-Ones?

http://eclipse.org/pdt/downloads/
Comment 4 John Arthorne CLA 2010-05-31 10:50:45 EDT
(In reply to comment #1)
> I'm going to move some of our older 3.6 stable builds to archive.eclipse.org. 
> Each of our builds is now about 6GB in size which is ridiculous but that's what
> you get when you build 15 platforms.

Kim, don't we delete all 3.x stable builds once the release is done anyway? I.e., once 3.6 is shipped we remove all 3.6 milestones?
Comment 5 Kim Moir CLA 2010-05-31 10:56:21 EDT
I usually archive them for a few months and then delete them.  Some people may still refer to them.
Comment 6 Denis Roy CLA 2010-05-31 11:14:05 EDT
As a point of reference, being an Apache mirror only requires 28G.  I realize Eclipse is very different conceptually, but it seems to be too easy for a project to put five 200M files in the download area and call it a product.
Comment 7 Gunnar Wagenknecht CLA 2010-05-31 14:25:25 EDT
Kim, Markus is the full build taking that space or just the distributable pieces?

Denis, AFAIK you get those "all-in-ones" out-of-the-box with Athena. PDT might think they are special because their users typically don't know Eclipse well enough to pick the individual pieces. However, their users should be converted EPP consumers! One way might be to remove those all-in-ones.

BTW, I also think that only release builds should be mirrored to not stress our mirror network.
Comment 8 Denis Roy CLA 2010-06-01 10:18:26 EDT
> I'm going to move some of our older 3.6 stable builds to archive.eclipse.org. 
> Each of our builds is now about 6GB in size which is ridiculous but that's what
> you get when you build 15 platforms.

Perhaps not all of these need to be mirrored?

eclipse-SDK-3.5.2-macosx-carbon.tar.gz gets about 45 downloads/day

Anything %solaris% gets about 37 downloads/day

/eclipse/downloads/drops/R-%/eclipse-platform-SDK- gets about 15/day

/eclipse/downloads/drops/%hpux% got 4500 downloads ... in 2 years.

/eclipse/downloads/drops/%aix% got 1300 downloads in 2 years

/eclipse/downloads/drops/%/org.eclipse.platform-3 has a whopping 37 downloads

eclipse-Automated-Tests and org.eclipse.rcp.source 2 per day each

org.eclipse.platform.source-%.zip has 78 downloads ever, and it's a 280M file.

And then there's all this stuff, which is not meant to be 'downloaded' by mirrors:
47M     ./compilelogs
143M    ./testresults
650K    ./checksum
652K    ./apitools
179K    ./clickThroughs
1.4M    ./buildnotes
28K     ./buildlogs
144M    ./performance



So my proposal here would be to create a subdirectory in each build to contain all the low-volume content, which I can exclude from the mirror sync:

R-3.7-20110623 (6GB)
|--- X-other (3GB ?)
|--- |--- compilelogs
|--- |--- testresults
|--- |--- performance
|--- `--- (zips, gz and org.* files with few downloads, listed above)
`--- (zips, gz and files that get many, many downloads)
Comment 9 Markus Knauer CLA 2010-06-02 02:35:27 EDT
(In reply to comment #7)
> Kim, Markus is the full build taking that space or just the distributable
> pieces?

The 11 packages, each of them available for 7 platforms take most of the space (~10 GB) and then there are a few kB of XML data... so there is not really anything worth to remove.

> BTW, I also think that only release builds should be mirrored to not stress our
> mirror network.

I did only upload the Helios milestones and release candidates this year (i.e. I didn't upload M2...~M4). I think it is good to upload them because they usually get more than 5000 downloads.

Would it be possible to define a per-directory exclude file for rsync if this is possible at all? I had something similar to a .htaccess in mind. This would enable the committers/releng people to manage what gets distributed to the mirrors.
Comment 10 Denis Roy CLA 2010-06-11 11:16:30 EDT
Created attachment 171734 [details]
Bandwidth graph - 24h period

The other unfortunate side-effect of the bit duplication is the network congestion that occurs during the release train.  As many projects simultaneously upload new bits for RC and Releases, our bandwidth becomes heavily taxed when the mirrors pull the new bits for several hours.

In the graph, you can see us at our limit from 22:00 till about 08:00 the next day, despite the fact that I allowed two bursts of traffic in between. 

In this case, not only are mirrors pulling RC4 bits for Helios EPP packages and Helios aggregate p2 repos, but also RC bits of Webtools (zips and p2), PDT (zips and p2), CDT (zips and p2), TPTP (zips and p2) and so on.
Comment 11 Kim Moir CLA 2010-06-11 11:29:44 EDT
Denis, I just downloaded stats for our 3.5.2 release so I would know what to move to a subdirectory for low volume downloads.  

The file is here 
https://bugs.eclipse.org/bugs/show_bug.cgi?id=316620#c1

What should be the threshold for removing stuff from the mirrors?  A thousand total downloads?  I don't know, so if you have any insight this would be appreciated :-)
Comment 12 John Arthorne CLA 2010-06-11 11:46:19 EDT
(In reply to comment #8)
> R-3.7-20110623 (6GB)
> |--- X-other (3GB ?)
> |--- |--- compilelogs
> |--- |--- testresults
> |--- |--- performance
> |--- `--- (zips, gz and org.* files with few downloads, listed above)
> `--- (zips, gz and files that get many, many downloads)

This sounds like a good idea. Could this be made transparent to people using download.php? I.e., "X-other" would be stripped from the path before looking for the file? That would allow dynamically moving things around based on download stats without breaking links. Projects can certainly change their download paths for future releases, but not for Helios or any past release, without breaking people's links.
Comment 13 Denis Roy CLA 2010-06-11 13:24:21 EDT
> What should be the threshold for removing stuff from the mirrors?

Consider that we have about 40 mirrors.  If in the last 30 days a large file (20+ MB) hasn't had 100-or-so downloads, then it may be worth putting them in X-other them from mirrors.

This query looks for files downloaded in the last 30 days:
https://dev.eclipse.org/committers/committertools/stats.php?filename=/eclipse/downloads/drops/R-3.5.2-201002111343/&view_date=L30




(In reply to comment #12)
> I.e., "X-other" would be stripped from the path before looking for the file?

Absolutely.  We currently look for /path/to/file.zip in download, then archive.eclipse.org before declaring failure.  Looking for /path/to/X-other/file.zip is trivial to implement as a third check.

In the stats, you would see /path/to/file.zip and /path/to/X-other/file.zip as two different files, though.  Is that OK?
Comment 14 John Arthorne CLA 2010-06-11 13:47:09 EDT
(In reply to comment #13)
> Absolutely.  We currently look for /path/to/file.zip in download, then
> archive.eclipse.org before declaring failure.  Looking for
> /path/to/X-other/file.zip is trivial to implement as a third check.
> 
> In the stats, you would see /path/to/file.zip and /path/to/X-other/file.zip as
> two different files, though.  Is that OK?

Sounds fine to me. I think we generally do stats queries using patterns like /path/to/* anyway.
Comment 15 Denis Roy CLA 2010-07-09 13:52:51 EDT
The frequency of the dissatisfaction expressed on the eclipse-mirror list is beginning to bother me.  A full Eclipse.org mirror is expected to have 350G of disk space for us, which I believe is way too much.

One easy solution that I can think of is to change our inclusion policy.  Right now, everything is mirrored except a small exclusion list to filter nightly builds, etc.

If we change that to exclude everything except a specific set of directories, representing the highest volume of downloads, then we can easily cut our size by an order of magnitude without any need for releng teams to restructure their directories.  I do worry about what this will do to our bandwidth, but we do have Amazon AWS to back us up, and we would have the flexibility to simply include more mirrored content to match heavy traffic directories.
Comment 16 Markus Kuppe CLA 2010-07-14 09:01:47 EDT
FWIW: Removing Orbit from syncing across mirrors has broken all ECF builds (it runs at OSU and thus uses the local mirror directly via Buckminster). While we can work around this, it would make things easier if we get a head warning if such essential bundles like Orbit are going to be removed from syncing across mirrors.
Comment 17 David Williams CLA 2010-07-14 12:32:31 EDT
I was a little surprised to see Orbit directory come off the list. And would recommend it be added back, since it does have a P2 repository in it. 

I'm not sure what "stale" means, but Orbit should only be using a couple of gigabytes on mirrors, maybe 4 or 5 G at most. Many of the meaty directories (in 'committers') start with "I" and I was under the impression those are not mirrored? Are these types of numbers prohibitive? 

       [12:23:37] david_williams@build:~/downloads/tools/orbit

$ du ./* -sh
5.2G    ./committers
40K     ./commonFiles
4.0K    ./displayBuildMachine.php
1.7G    ./downloads
4.0K    ./index.php
4.0K    ./parseProperties.php
4.0K    ./runIndexer.sh
Comment 18 Thomas Hallgren CLA 2010-07-16 08:56:54 EDT
I've moved all archive related stuff for Buckminster. Please reinstate it as a mirrored project.
Comment 19 Denis Roy CLA 2010-07-19 10:15:52 EDT
Orbit and Buckminster have been added back to the mirror pool.  Buckminster uses 147M of mirror space, and Orbit uses 1.7G.  Thanks for the cleanup!
Comment 20 Gunnar Wagenknecht CLA 2011-02-26 06:57:18 EST
Denis, I noticed that for technology projects I builds get mirrored. Can we prevent mirroring for nightly and integration builds in general?

*integration*
*/I-*
*/N-*
Comment 21 Denis Roy CLA 2011-03-29 15:25:25 EDT
Created attachment 192123 [details]
Directory tree view of a full mirror

Gunnar, I've added the patterns you've mentioned.

I've also added many patterns of javadoc, apidocs and test/compile logs that needlessly get sent to mirrors.  With this exclusion list, the size of a Full Eclipse mirror is down to 277G (the actual size of download.eclipse.org is about double that).

Sadly, another US university has discontinued being an Eclipse mirror today -- and the cited reason was our large footprint.

At this point, I can keep trimming away at the exclusion list, but those entries that I find are relatively small potatoes.

So I'm attaching two things:
- A directory tree view of a Full Eclipse mirror
- The size of each top-level directory >1G

If anyone has any ideas as to what we can do next, I'm all ears.


scanserv:/home/data/download.eclipse.org # du * -sh
11G     birt
1.1G    datatools
1.8G    dsdp
12G     e4
66G     eclipse
2.2G    equinox
1.1G    gyrex
1.9G    jetty
2.2G    mat
20G     modeling
7.5G    releases
8.6G    rt
1.9G    stem
1.2G    stp
89G     technology
1.5G    tm
24G     tools
5.3G    tptp
1.2G    virgo
16G     webtools
Comment 22 Denis Roy CLA 2011-03-29 15:27:36 EDT
One thing I've noticed is the increasing number of maven repositories in project download areas.  I'm not sure if Maven can user our network of mirrors or not; if it cannot, perhaps excluding maven repos from the mirror list would also help trim content.
Comment 23 Kim Moir CLA 2011-03-29 21:20:23 EDT
I've just archived a 3.7 milestones 1-5 and 3.6, 3.6.1 which should reduce the amount sent to mirrors by 42GB.  I'll also take another look at bug 315073 to reduce the number of platforms sent to the mirrors.
Comment 24 Eclipse Genie CLA 2014-10-18 07:41:27 EDT
This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

If you have further information on the current state of the bug, please add it. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

--
The automated Eclipse Genie.
Comment 25 Denis Roy CLA 2014-10-21 15:37:34 EDT
There's probably not much we can do here, other than occasionally reminding everyone that disk space is finite.

Closing as WONTFIX