Community
Participate
Working Groups
This is a follow-up on p2 bug 337022. While investigating performance issues in tycho builds using the helios p2 repo http://download.eclipse.org/releases/helios/ we found that in general for each (child) p2 repository, there is a performance penalty of 4 HTTP roundtrips on the client side (even just to find out that nothing changed wrt the local p2 cache). Helios has about 11 child repositories resulting in > 40 HTTP roundtrips introducing typical lag times of > 10 sec before any p2 operation can proceed. This is a general issue not only for tycho but for any eclipse user trying to install/update. We suggest to reduce the number of child repos for the (future) eclipse release train to a minimum or even collapse the release train repo into one non-composite repository.
+1 for reducing the number of child repos if that can speed up the initial loading of repositories. I've seen recently the Helios and Indigo repositories take up to 15 minutes to load, with download.eclipse.org serving the metadata jars at ~10KiB/s
+1 Much in favor of any performance improvement on the indigo repo. 4 roundtrips per child seem a real problem. I'm wondering why the children can't be retrieved from any mirrors? Couldn't the master already have a mirror list including the children?
The mirrors are not consulted when reading the meta-data. They are only used when downloading the found artifacts. The reason for that is that the mirrorsURL must be obtained in order to compute the mirrors list and that URL resides in the meta-data.
> 4 roundtrips per child seem a real problem. I'm wondering why the children > can't be retrieved from any mirrors? Couldn't the master already have a mirror > list including the children? Using mirrors to obtain metadata burnt us in the past where the mirrors were providing bad files or outdated files, so I'm really not inclined to try something like that. Failing the download of a file or two is something that ppl can deal with / understand, but failure to load a repo is something that ppl will blame on the tool.
Created attachment 196757 [details] Request summary on helios repository For reference, here is a summary of requests when doing a "check for updates" on only the helios release repository. There were 60 requests in this case, although 25 are HEAD requests. About 40% were requests for p2.index files that didn't exist.
(In reply to comment #4) > something like that. Failing the download of a file or two is something that > ppl can deal with / understand, but failure to load a repo is something that > ppl will blame on the tool. I see your point - but having to wait minutes (!) just for the UI to show up so I can select what I want to install or update is also a problem that people blame on the tool. And that's not a theoretical scenario, it's happening today and people hate Eclipse for this problem. Take any other downloaded app (Skype, iTunes, Java JRE, ...), and "check for updates" is almost instantaneous. So every effort to improve this performance is appreciated. People can understand when downloading the artifacts is slow. But downloading the metadata must be fast. Conceptually, given that children need to be registered with their composite parent already, that parent should be able to know MD5 sums of the children so it can check correctness of any child downloaded from a mirror. Also, the composite parent could know the mirror list of its children. Should we relocate that discussion into a better bug / component ?
Created attachment 196775 [details] console log on second try "check for updates" I measured once again by starting eclipse 3.6.2 classic with arguments -consoleLog \ -Dorg.apache.commons.logging.Log=org.apache.commons.logging.impl.SimpleLog \ -Dorg.apache.commons.logging.logging.simplelog.showdatetime=true \ -Dorg.apache.commons.logging.simplelog.log.httpclient.wire=debug \ -Dorg.apache.commons.logging.simplelog.log.org.apache.commons.httpclient=debug and removing all p2 repos except http://download.eclipse.org/releases/helios/, then clicking Help > Check for updates. There is a caching effect. First time it does the 60 requests mentioned by John. Subsequent "check for updates" (on a restarted IDE) only trigger a total of 16 requests. There is one GET for p2.index and one HEAD for (composite)content.jar per child repo (no more requests for artifacts.jar). Still I think it would make sense to reduce the number of child repos.
(In reply to comment #7) This is an excellent point: It is important to only have very few metadata repositories (ideally a non-composite) because these repositories always need to be checked for modifications for every update operation. The number of artifact repository composites doesn't really matter at all in the typical case, i.e. when there are no updates. (Except in Tycho, but we can fix this -> see tycho bug 347477). This should also make this issue much easier to resolve: merging the metadata shouldn't be a big thing, whereas copying all artifacts may.
While anything we can do to improve performance is worth looking at, somehow the numbers don't add up here. 16 HEAD requests should not take 10 seconds. For example I reload (Ctrl+F5) the main http://eclipse.org main page in my browser it does 40+ GET operations but the page loads within a second. The discussion of whether metadata comes from mirrors seems unrelated - going to mirrors doesn't reduce round trips, only causes the round trips to go to different servers. Whether the mirror is going to be faster than the main eclipse.org is very hard to predict.. generally the main eclipse.org server always has this data cached in memory so it should be able to turn that data around quite quickly.
Response time can vary greatly depending on internet connection speed, firewalls, etc. For example, I asked somebody from Germany to do "ping download.eclipse.org" and average response time was ~145 ms, or ~2.3 seconds for 16 requests. For comparison, my ping from Toronto is ~20 ms or ~0.3 seconds for 16 requests.
(In reply to comment #10) > Response time can vary greatly depending on internet connection speed, > firewalls, etc. For example, I asked somebody from Germany to do "ping > download.eclipse.org" and average response time was ~145 ms, or ~2.3 seconds > for 16 requests. For comparison, my ping from Toronto is ~20 ms or ~0.3 seconds > for 16 requests. ICMP ping is one thing, what I did now is a small java HTTP ping that does HEAD requests (attached). From within our corporate network in Germany, I am getting on average ~450 ms for HEAD request to http://download.eclipse.org/releases/helios/p2.index and more interestingly, ~120 ms for HEAD request to http://eclipse.org/home/images/bullet.png (which is part of the eclipse.org main page) This means you cannot compare the http://eclipse.org main page with the release train p2 repo. There is almost a factor of 4 in terms of ping time between them. While the bullet.png gives a quite flat line of 120 ms measurements, the p2.index measurement has several peaks, in one extreme case a single roundtrip took 17 secs. So I end up with ~7 secs for 16 requests to the p2 repo from Germany. As I already stated in the original bug, in my opinion this is a latency problem, not a bandwidth problem. If you are close enough to the server, you don't see the problem. Whether 7 secs is too much for a last-modified check on a p2 install/update operation may be subject to debate. For sure it gets annoying for tycho users if there is a 7 sec wait on the start of every single build.
Created attachment 196818 [details] HTTP ping measurement you need apache commons-httpclient 3.1 for this
I'm sure the webmasters read ALL of our bugs ... but in case this one slipped past them, thought I'd add them to CC. Webmasters, I'm assume the network latency differences mentioned in comment #11 sound "normal"? That is, requests for *png files to "www.eclipse.org" would always be expected to be responded to faster (and more consistently) than requests to a non-existant file on download.eclipse.org? I'd assume the www servers would have more "edge servers" than "download"? (just to guess at correct terminology). Just wanted to draw to your attention in case indicates any issues on your end (which I doubt). ... and so you'd know why the servers were getting all those pings from Germany :) ... I kept getting "reset connection" exceptions when I tried. :( I better quit before I get locked out.
I tried timing HEAD requests from Sweden. It's even worse with > 500 ms on average. Here's a sample curl output. [thhal@tada ~]$ curl --head --write-out 'Namelookup: %{time_namelookup}, Connect: %{time_connect}, Starttransfer %{time_starttransfer}\n' http://download.eclipse.org/releases/helios/p2.index HTTP/1.1 404 Not Found Date: Sat, 28 May 2011 05:34:04 GMT Server: Apache Connection: close Content-Type: text/html Namelookup: 0.028, Connect: 0.184, Starttransfer 0.514 I don't see any difference between download.eclipse.org and eclipse.org and I don't see any difference for files that are missing or existing.
Even worse from France: Namelookup: 0.011, Connect: 0.215, Starttransfer 0.719
> While anything we can do to improve performance is worth looking at, somehow > the numbers don't add up here. 16 HEAD requests should not take 10 seconds. For > example I reload (Ctrl+F5) the main http://eclipse.org main page in my browser > it does 40+ GET operations but the page loads within a second. That is an interesting comparison to make, and I will attempt to illustrate why p2's access to download.eclipse.org is infinitely slower than your browser's access to www.eclipse.org. TCP clients use a three-way handshake to establish one connection: SYN, SYN-ACK, ACK. Client -> Server: SYN "I want to talk to you" Server -> Client: SYN-ACK "Sure." (round trip #1) Client -> Server: ACK "Well, Ok then" Client -> Server: "GET /releases/..... HTTP/1/1" Server -> Client: ACK "I got your request" (round trip #2) Server -> Client: "404 not found" Client -> Server: ACK "Roger that" (round trip #3) It's all rather inefficient since the client and the server waste lots of time waiting on each other's response, but it guarantees delivery. That's what TCP is all about. I will attach 2 screenshots -- one of a browser contacting www.eclipse.org and the other of p2 talking to download.eclipse.org.
Created attachment 196894 [details] Wireshark of p2 "Check For Updates" This is a packet capture of my laptop performing a "Check For Updates". I've omitted the DNS frames to concentrate on the conversation between my laptop and download.eclipse.org. Notice that for each and every GET and HEAD request, the seven-packet, three-round-trip three-way handshake is established. Connect, request, disconnect.
Created attachment 196896 [details] Wireshark of Firefox "Shift Reload" on www.eclipse.org Contrast to a forced refresh of my browser and www.eclipse.org. One connection is established, and multiple documents are requested within the same connection. This is much more efficient, and greatly reduces the amount of chatter on the network.
Created attachment 196897 [details] Screenshot of Firefox' simultaneous network connections To make matters worse (for p2/downloads), Firefox can establish multiple connections simultaneously and request multiple files/images from the server at the same time. It reassembles the out-of-sequence data on-the-fly to render the page. p2 likely fetches one document at a time, waiting for the previous transaction to complete before continuing onto the next request.
> From Germany to do "ping > download.eclipse.org" and average response time was ~145 ms, or ~2.3 seconds > for 16 requests. For comparison, my ping from Toronto is ~20 ms or ~0.3 seconds > for 16 requests. Indeed, there's nothing like physical proximity to the servers. Anything outside of North America requires crossing the ocean to reach Eclipse.org's Canadian servers.(In reply to comment #11) > (In reply to comment #10) > ICMP ping is one thing, what I did now is a small java HTTP ping that does HEAD > requests (attached). > > From within our corporate network in Germany, I am getting on average > > ~450 ms for HEAD request to > http://download.eclipse.org/releases/helios/p2.index > > and more interestingly, ~120 ms for HEAD request to > http://eclipse.org/home/images/bullet.png (which is part of the eclipse.org > main page) > > This means you cannot compare the http://eclipse.org main page with the release > train p2 repo. This is the next variable. Since eclipse.org does not have unlimited, unmetered bandwidth, our outgoing packets must be queued for delivery. Packets SENT from www.eclipse.org (and bugzilla, and Wiki, and just about everything except download.eclipse.org) are the first ones out. Even RSYNC data sent to our mirrors gets out *before* http downloads from download.eclipse.org, so when everyone uploads 15 Gigabytes of Release Candidate goodness, http downloads from download.eclipse.org get queued even further back. Fun stuff, eh?
(In reply to comment #13) > I'd assume the www servers would have more "edge servers" than "download"? > (just to guess at correct terminology). We're not Google. Google has servers on every street corner in every city around the world. Ok, I am exaggerating, but placing edge servers close (physically) to the clients is the only way you'll get super-low latency. Our servers are all in Canada. To conclude this lengthy set of posts, the physical location (Canada) makes for higher latency for Europeans, Asians, Africans and South Americans. The latency is widely amplified by the inefficiency of p2's conversation with download.eclipse.org. If p2 behaved more like a browser (ie, connection keepalives, multiple connections, simultaneous file transfers) the perceived "slowness" would widely disappear.
(In reply to comment #21) thanks Denis for this enlightening analysis. So in summary p2 should be changed to behave more like a browser. The fix in p2 will not happen for the Indigo timeframe, so it will come earliest with Juno. Can the original proposal to reduce number of child repos (which would not require fixes in p2) still be considered for indigo or is it too late already? E.g. if the p2 repo would be collapsed into one non-composite we would go down from 16 HTTP requests to 2. At least it would make the problem less severe for another year to wait until the root cause is fixed.
(In reply to comment #17) > Notice that for each and every GET and HEAD request, the seven-packet, > three-round-trip three-way handshake is established. > > Connect, request, disconnect. I have entered bug 347669 against p2 for this. p2's transport layer (Apache HTTP client, ECF), should be keeping those connections alive across HTTP requests.
(In reply to comment #22) > Can the original proposal to reduce number of child repos (which would not > require fixes in p2) still be considered for indigo or is it too late already? > E.g. if the p2 repo would be collapsed into one non-composite we would go down > from 16 HTTP requests to 2. > At least it would make the problem less severe for another year to wait until > the root cause is fixed. At Indigo release time the structure currently looks like this: /releases/indigo - composite master repository /releases/indigo/I2011xxxx - the repository for the indigo SR0 release /epp/packages/indigo - composite master repository for EPP /epp/packages/indigo/R/ - the repository for the indigo SR0 release Markus is looking at merging the last two into a single repository in bug 347455. So the extra step we could take is merging the indigo SR0 repository with the EPP SR0 repository to create a single non-composite repository. This is theoretically possible but sounds risky - it's the kind of change we would generally want to make in a much earlier milestones to iron out any issues. Note that the extra children only come up when we release SR1 and SR2, each of which adds two extra repositories (EPP and master metadata repositories).
> So in summary p2 should be changed Well, I didn't mean to imply that p2 should be changed. The problem we're facing is compounded and amplified by three things: latency, p2 connects/disconnects, and the number of repos. Reducing any of those three things will help the situation. Pulling metadata from mirrors, then failing to the main site, would also be a good option, and one that would relieve traffic from the home site.
> Well, I didn't mean to imply that p2 should be changed. I've re-read my posts in comment 16-19 and since I have a hard time understanding them myself, I felt like I should summarize some of the reasons p2 is less efficient than a browser: - browser can establish multiple simultaneous connections to a server, download several files at the same time. p2 seems to download 1 file at a time (comment 19) - browser can issue multiple requests in one connection. p2 seems to connect, request, disconnect, which adds a _ton_ of latency. (comment 17 & 18). - eclipse.org's firewalls treat network traffic differently. www/bugs/wiki/cvs/svn/git traffic is sent before downloads since we have limited bandwidth, so comparing download.eclipse.org to X is not an apples-to-apples comparison I _could_ do some packet mangling to force higher priority for metadata files out of download.eclipse.org but before doing this, two things would be really nice to have: - p2 should reduce the overall number of trips to eclipse.org, as is suggested by this bug - p2 should support reusing an http connection to avoid the connect/disconnect overhead. FWIW, I cannot guarantee our firewall can even support this -- the number of simultaneous open connections may be too high -- but that is my problem :)
(In reply to comment #26) > - p2 should support reusing an http connection to avoid the connect/disconnect > overhead. see bug 297742
> > I _could_ do some packet mangling to force higher priority for metadata files > out of download.eclipse.org but before doing this, two things would be really > nice to have: I have opened bug 358340 to track that issue/task separately.
I'd actually like to close this bug as "won't fix", mainly since there are reasons we have the main 2 or three composites for a release (SR0, SR1, and SR2) and that seems to be the solution original proposed. While come improvements could be made with "no loss of function", such as bug 347455, I don't think we can get rid of of the SR0, SR1, and SR2 repositories ... at least, not easily, without a lot of work and moderate amount of risk. The main reasons for having the separate repos for each release (and service release) mostly comes down to "we are not perfect". That is, we re-create staging repositories "fresh" each time, gradually getting closer and closer to what we eventually release. Part of the reason we do those "fresh" is a fear that versioning rules are not perfectly applied. That is, sometimes, a bundle may actually have changed, but its version (and qualifier) did not. There are many ways of that happening, and to guard against it is not easy. So ... to start "combining releases", I think, there would need to be a lot of releng improvements (from all projects) to have more perfect repos and more perfectly reproducible aggregations. I'm not sure this bug was opened as that sort of massive work request ... was it? Another reason we have separate repos is so that we can easily archive them at different times (such as, we might archive SR0 leaving only SR1 and SR2 mirrored, or similar) ... I think if we had just "one big repo" we'd essentially not be able to archive anything (we'd have to leave everything mirrored). Well ... let me be clear ... I know we _could_ put various artifacts in different places ... I mean we could not _easily_ do so). Lastly, I'll comment, I think this bug was opened "with a solution" instead of stating the problem. The problem, as I read it, is that in many cases "parts of the p2 update process is too slow". And, for that problem, we have have actually identified a number of improvements that could be made and opened separate bugs for those ... at least those in the list I'll paste below ... hence, I think this specific bugzilla has served its purpose and can be closed as "won't fix". I'll leave it open, for right now, because I do not want to give the impression I am not open to clarifying discussion ... but, if it comes down to "someone should do more release engineering work", then someone will need to step up to the tasks. Bugs to improve p2 performance: bug 297742 [transport] Investigate how to maintain HTTP session bug 347448 Add p2.index in Juno repos bug 347455 Reduce number of child repositories of the EPP p2 repository bug 347669 [transport] Reuse TCP connections bug 358340 Serve up p2 metadata at high (or, normal) priority Please clarify if I have misunderstood anything.
I have added a dependency on bug 297742 and bug 347455 and changed the title to state the actual problem. I'm fine with resolving this bug once dependent bugs are resolved.
with bug 297742 and bug 347455 fixed, we can resolve this one