| Summary: | Callisto updates fail with odd network related errors | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Eclipse Project] Platform | Reporter: | David Williams <david_williams> | ||||
| Component: | Update (deprecated - use Eclipse>Equinox>p2) | Assignee: | Platform-Update-Inbox <platform-update-inbox> | ||||
| Status: | RESOLVED DUPLICATE | QA Contact: | |||||
| Severity: | major | ||||||
| Priority: | P3 | CC: | bjorn.freeman-benson, cdtdoug, ed.burnette, francois, gunnar, jeff.myers, kim.moir, mike.milinkovich, pascal, pombredanne, webmaster | ||||
| Version: | 3.1.2 | ||||||
| Target Milestone: | --- | ||||||
| Hardware: | PC | ||||||
| OS: | Windows XP | ||||||
| Whiteboard: | |||||||
| Attachments: |
|
||||||
|
Description
David Williams
Denis, I'm CCing you on this bug for your awareness of these tests. If my guess is correct,and connections being left open, that might look odd (or, even be harmful?) to Eclipes Infrastructure if keep-alive connections are being left open on that end ... so, please let us know if you seeing anything detrimental (and we could at least scale back the test if required). Kim, CCing you as I think you are doing similar, smaller scale tests, just for platform. Thanks for adding me to this. KeepAlives are enabled on both www.eclipse.org and download/update, but their idle timeout is fairly low (6 seconds). After that, Apache kills the connection. I haven't seen anything unusual at my end. One thing to check is whether UM created a new "internal" connection when it gets a redirect. If that had been the case, I thought we would have seen it sooner than today because all of the ibm.com Update Manager accesses have been using redirects for years now. The only difference is that your site.xml is amazingly huge. D. David, if you go through our VPN you are not redirected to the IBM mirrors. At least for us in Canada, this is the case. Thanks Kim. I've tried it both ways (internal network and completely external ... unless Eclipse has some special mirroring with Time-Warner Cable :) Same symptoms both cases. Oh, and a reboot didn't help, but I also didn't see the handles in svchost.exe process increase above a couple of thousand ... so, I guess that 35,000 number was a red-herring ... from some other buggy software :( and not related to this problem. (In reply to comment #3) > The only difference is that your site.xml is > amazingly huge. > Yes, and just to clarify, this site.xml is a quickly-cobbled-together listing of all features and plugins 3.1 related update, (where the "listing" is archive tags associating a "request path" and "redirect URL" (where it is redirected to nearest mirror that contains the file) ... I don't know if that size is related to this problem, but, I hope note. I figured it would be a good "stress test" if nothing else ... since I suspect the final Callisto site.xml might be smaller, say, 1/4th of this one ... but ... not an order of magnitude smaller. Note: I've now moved this test to http://download.eclipse.org/callisto/testUpdates/ (since it now exists, and I need the other site for other work). Also, I've removed some features, so the current set installs for me (at least, it usually does, but I've only tried a few times). I can easily add more features back, if needed for testing/diagnosis, but I intend to prepare a more "selectable" set before long, since this problem with peformance is blocking forward progress on any sort of large update plan). This is pretty big site, i don't think we ever tested on such a big site. how many feature were selected for installation? Can you tell us more about the server? David I have an idea that might be worth investigating if you realize UM doesn't deal with redirects gracefully. Assuming site.xml, currently a static file, contains 300 url="". For each of those url's, we're hitting eclipse.org to get the best mirror, then redirect UM to the mirror site for the download (for a total of 601 actual http requests, including site.xml). Why don't we simply send the client a dynamically-generated "site.xml" when site.xml is requested? Say I transparently intercept an Update Manager request for site.xml, and pass that on to a PHP script. The static site.xml can be processed, sending it as-is back to the browser, but replacing each url="...?r=1&file=..." with url="http://bestmirror.com/thefile.jar" as it's parsed. The result is one request for site.xml containing a bunch of url="" statements that UM can access directly - no redirects. What do you think? The major drawback is that I wouldn't track the actual 300 downloads for stats... But the big plus is that it's more efficient for large UM sites. D. That's a great idea. However it might also worth checking that the site.xml is only obtained once. (In reply to comment #8) > This is pretty big site, i don't think we ever tested on such a big site. > how many feature were selected for installation? > Can you tell us more about the server? > I've nearly forgotten what's at that test site, but do recall getting JDT, PDE, WTP, GMF, EMF, GEF, JEM to install fairly well, but when ever I tried to "grow" it to include anything from DTP, BIRT, TPTP, VE I would start getting the "funny errors". So, I'd stay that's about "half" of what we need for Callisto. The server? I can't say much ... Eclipse.org ... I tried from within IBM (which I think always goes to a fullmoon mirror and from my "personal" ISP (Time-Warner Cable, 3Mbs) with basically same results. (In reply to comment #10) > That's a great idea. However it might also worth checking that the site.xml is > only obtained once. Not only is that a great idea, it has always been supported by Update :-). Update has support for custom site types, of which the XML-backed one is just the default. It is possible to register a new site type (and a ISite object factory) so that features and other entries of the ISite object are obtained dynamically instead from a static XML file. (In reply to comment #9) > > Why don't we simply send the client a dynamically-generated "site.xml" when > site.xml is requested? > Well, the stat's are pretty important! But, I've always assumed at some point this would all be done more dynamically, but assumed better to start off staticically and go from there. Also, there was two advantages to the "semi-static" approach, in addition to better stats. 1. That it could fit into mirrors system better (since the site.xml itself could be replicated) and 2. There's a certain comfort to a static file that you can have it, test it, feel someone warm and fuzzy with it, and then deploy when ready. None of these rule out your ideas, but that's just to explain my thinkng. Perhaps you could "prototype" a dynamic site.xml system .. if you were volunteering :) And .. recall .. my intent was not to test frequent re-directs, my (orignal) intent was to just start testing some big honking updates! :) If you are interested, there was another, more "selectable" prototype, still implemented via the same xml format, individual redirects, but at http://download.eclipse.org/callisto/testUpdates/proto2.xml If I recall, it has various pieces of JDT, PDE, WTP, EMF, GEF, and JEM The few time I tried it, it always succeeded (but its intent was more for high level UI review .. nt so much "load testing".). So, how can we make progress on this bug. Can others reproduce it? If it helps, I've done another small test, one with "direct" http references, another using the "nearest mirrored" form. Its shows one has the problem, the other does not. This test uses only EMF "pre-reqs" for a 3.1.1 platform. I was hoping to use this for our WTP 1.0.1 update site, so, since can not, it truly is blocking us on the 3.1.x streams and I'll have to find another solution ... I guess point everyone directly to eclipse.org? Since not all WTP pre-req'd products are on all mirrors? So, the two tests sites can be seen using http://download.eclipse.org/callisto/testUpdates/site-emf-direct.xml or http://download.eclipse.org/callisto/testUpdates/site-emf-mirrored.xml It seems to me the first question is: Does this fail truely due to an issue in update mangager, or, could it be the "nearest mirror" script sometimes fails, or takes too long to return a response? Anyone know how "we" can tell? Any debug flags to turn on, or anything? The "nearest mirror" form fails at different points, with errors, such as, !ENTRY org.eclipse.update.core 4 0 2006-02-21 14:05:51.310 !MESSAGE Unable to retrieve remote reference "http://www.eclipse.org/downloads/download.php?r=1&file=/tools/emf/updates/plugins/org.eclipse.emf.ecore.sdo.doc_2.1.1.jar". [Operation timed out: connect: could be due to invalid address] !STACK 0 java.net.SocketException: Operation timed out: connect:could be due to invalid address at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:372) To help troubleshoot, may I recommend the following: - add a reference to a jar file that you know for a fact doesn't exist, on a server that is known to respond (for instance, download.eclipse.org/somedarnupdate.jar) to see how UM handles 404 - file not found - add a reference to a jar file to some bogus site, to see how UM handles a host not being there alakazam.eclipse.org/someupdate.jar) - put a jar file in a location that typically returns "forbidden" to see how UM handles a tupical 4xx forbidden code - you may have to poke around for this D. And, do you mean these for "nearest mirror" form? Or direct http form? Granted I was not intentionally testing "not found", but due to typos, I've often seen the "not found" (first in your list) ... and UM doesn't handle too well, but does does print a 404 not found error in console. (Guess maybe you wanted to compare exact errors?). I do know too, if there's absolutely nothing valid, the UM does not even get to the "select" dialog ... gives an error that "no valid features found on site". The "nearest mirror" errors I am seeing with the provided tests, occur after the selection list, and UM does indeed start validating and downloading some updates, and then fails later due to what's in an archive tag. (In reply to comment #17) > And, do you mean these for "nearest mirror" form? Or direct http form? I meant directly, so you can see how UM deals with errors. The fun part of dealing with mirrors is that they are wildly unpredictable. There was a time when download.php used to attempt to fetch the remote file as a sanity check before sending the redirect to the client (making the process 100% accurate) but the cost of doing so (in time and cycles) was astronomical. If UM doesn't gracefully deal with http errors, then I would recommend putting all the callisto files in the same directory and using the traditional <site mirrorsURL=""> as this has been proven to work so far - but even that is not foolproof. This does mean duplicate files, but whatcha gonna do. D. Just FYI ... I see the "copy all to Callisto" to be a good last-resort solution that solves Callisto specific problem, but does not really solve (one of) the problems that Eclipse.org has in general, that's how "other" projects (even non Callisto projects) can refer to other pre-req'd projects in their own update sites ... to get pre-reqs more automatically, without them telling their users to simply go through several, multiple sites and several, multiple installs. Currently, those "third party" adopters have no choice but to do that, or to refer directly to eclipse.org ... not making use of mirrors. ASAIK. I was hoping Callisto would become a "best practice" way of doing update sites ... and everyone making copies of everything is not a best practice. So, I'm still holding out that UM and the "nearest mirror" script can be made more robust. (In reply to comment #19) > Just FYI ... I see the "copy all to Callisto" to be a good last-resort solution > that solves Callisto specific problem, but does not really solve (one of) the > problems that Eclipse.org has in general, that's how "other" projects (even non > Callisto projects) can refer to other pre-req'd projects in their own update > sites ... to get pre-reqs more automatically, without them telling their users > to simply go through several, multiple sites and several, multiple installs. I completle agree with this. I've already opened bug 111730 and bug 106281 a while back which request such capabilities. I think bug 115042 would also be a nice enhancement. (In reply to comment #18) > There was a time when download.php used to attempt to fetch the remote file as > a sanity check before sending the redirect to the client (making the process > 100% accurate) but the cost of doing so (in time and cycles) was astronomical. BTW, did you mean "get the remote file" completely? I can see why that would be expensive! But what about smaller quicker requests ... such as to make a HEAD request, instead of a GET request, and just verify the size reported is > 0, or similar? Think this would improve accuracy? (If, indeed, this is even the problem in this case). > BTW, did you mean "get the remote file" completely?
No, I worded that really badly. Sorry.
We did what you said. It was absolutely accurate, but the extra overhead (small, but multiplied by the number of downloads we get per minute) made our site slow and sluggish so I stopped.
D.
BTW, a bit more "data". After seeing this problem in update manager, I tried simply pasting a specific request in my browser ... and it gave a little better error message :) "The server at mirrors.cat.pdx.edu is taking too long to respond." The URL I was attempting to use was http://www.eclipse.org/downloads/download.php?r=1&file=/eclipse/updates/3.2milestones/plugins/org.apache.ant_1.6.5.200602171115.jar The URL I was being redirected to was http://mirrors.cat.pdx.edu/eclipse/eclipse/updates/3.2milestones/plugins/org.apache.ant_1.6.5.200602171115.jar I tried a number of times, over and over, sometimes it worked and I got a jar (from anohter site) but sometimes it failed ... always at this same site! I hope this whole problem isn't just due to one or few "bad mirrors" out there. Is there anything "we" do to insure the mirror servers matches are data returned from the nearest mirror script? You know, a daily check or something? Seems like a "HEAD" request for file size would catch this sort of unresponsive problem. But, if done in "real time", the script would have to be "smart" enough to still give a fast response, if first HEAD request didn't respond quickly ... I guess just move down the list and try the next? Or, I'm wondering, if the nature of this problem is such that sometimes mirror sites go down for a while, so we would not have to literally check during each request .... but once every hour or something? Thanks for any info. > I hope this whole problem isn't just due to one or few "bad mirrors" out there. Likely. Like I said in Comment 18, The fun part of dealing with mirrors is that they are wildly unpredictable. The server can, and does, weed out the bulk of the problems by polling each mirror every hour to make sure they're up to date. > Is there anything "we" do to insure the mirror servers matches are data > returned from the nearest mirror script? Absolutely not. In fact, any one mirror could introduce backdoors and worms into the code and repackage the JAR files. That's why you run md5 checksums. Don't expect the server to run md5 sums against mirror files - we might as well just serve the files ourselves, and even there it wouldn't even account for any transmission errors from us to you. > You know, a daily check or something? > Seems like a "HEAD" request for file size would catch this sort of unresponsive > problem. This solves nothing, because any mirror could drop off the planet at any time. They can delete any file at any time. Determining correct transfer and transfer integrity is a task best suited for the client. > But, if done in "real time", the script would have to be "smart" > enough to still give a fast response, if first HEAD request didn't respond > quickly How can eclipse.org servers, in Canada, possibly determine that transfer between a mirror in Germany and a client the UK is fast enough? The server is currently doing everything it realistically can to ensure you get a good mirror, but like any distributed system, it's not guaranteed to work 100% of the time. I believe the client needs to handle connection timeouts and File Not Found errors gracefully. Verifying data integrity is another bag of beans. D. (In reply to comment #24) > > > Is there anything "we" do to insure the mirror servers matches are data > > returned from the nearest mirror script? > > Absolutely not. In fact, any one mirror could introduce backdoors and worms > into the code and repackage the JAR files. That's why you run md5 checksums. Sorry, I wasn't clear. I didnt' mean that type of data. I meant the redirection host name (as 'data') ... the "nearest neighbor" script is telling us to go to "mirrors.cat.pdx.edu" ... but, it "obvious" that it can not even be connected to, to I'd call that "bad data". > > > Seems like a "HEAD" request for file size would catch this sort of >> unresponsive problem. > > This solves nothing, because any mirror could drop off the planet at any time. > They can delete any file at any time. Determining correct transfer and transfer > integrity is a task best suited for the client. > I do not think its all one or the other. I suspect both the mirror scripts AND UM need to me made more robust. You can imagine that if UM keeps asking for a URL, and the "nearest mirror" script keeps returning an unresponsive host, e.g. "mirrors.cat.pdx.edu" then no progress can ever be made by UM. Its one thing to handle some occasional odd file deletion, etc., but, seems our problem is one of probability over many attempts ... and, for the number of requests that Callisto will need (under current plan), the probability of failure durng any one of the hundreds of requests (for one whole install) is near 1. But ... you do say, we "polling each mirror every hour to make sure they're up to date" ... so, maybe there is no more robustness that can be achieved in the mirroing system? If, hypothetically, some were to write a simple program independent of UM, to repeatedly ask for the nearest mirror, and then ask that mirror for a file, and we checked if the file was retrievable, what would you expect the error (failure) rate to be? That is, at what level of failure would you agree there was a problem with mirroring system? (I'm just asking since I have no experience in this area). Similarly .. UM should re-try a certain number of times and if the connection is bad .. then what? Or ... if the file has been delted from mirror, then what? Default back to eclipse.org? That might work. Then a high failure rate in mirrors, just gets translated to higher bandwidth on eclipse.org ... but, that is kind of the correct fall back, I think. Right? > what would you > expect the error (failure) rate to be? 2.2% failure rate, measured in January 2005. > Or ... if the file has been delted from mirror, then what? > Default back to eclipse.org? +1 !!! I think this is bang on - totally acceptable, and most likely the simplest solution of them all. UM could even have a retry threshold before falling back to the main server, but that's getting fancy. D. So, as we near critical point for Callisto M5, the alternatives are 1. Do not use "nearest mirror" URL for each jar file, but instead use only one mirror list at beginning, and let user pick one. 2. Do not use mirrors at all. (I'm sure this sounds outrageous to some, but ... for a milestone? think there's a chance? If extra bandwidth purchased?). I do not much like option 1. since on the surface appears to require a big mass copy of all the project's individual update sites to the callisto directory. Not much of an achievement, if you ask me. But, maybe there's an alternative to the mass copy. If we had something like bug 123009 implemented by download.php script, then we could handle "internally" with Callisto's site.xml file -- that is, still one mirror choice at the beginning, but we'd know that mirror contained every project needed by Callisto, then archive tags could use simple relative URLs, as we do in WebTools project. Or, similarly, we could send note to our beloved mirror partners and ask them (or specify to them, that "mirroring /download.eclipse.org/callisto/.... MEANS THAT they have mirrored the 10 projects of Callisto, in matching (relative) directory structured. Sound doable? This might be a good time (Tuesday morning) for core teams to observe updating from Callisto update site. http://download.eclipse.org/callisto/releases/ Perhaps servers are busy ... but status doesn't seem to show it. https://dev.eclipse.org/committers/help/status.php Everytime I try to install something it fails (well, 80% of the time). The error is always either "access forbidden" or "not found" ... and always a different jar. Changed title to reflect this is no longer a simulation. (Just pre-release). Created attachment 35857 [details]
Bjorn's Mar 7 Callisto install problems log
I'm having the same seemingly random problems with installing. I've attached some notes of the problems I found along with screenshots of what I saw.
Given the randomness of the failures, I would think that if the Update Manager did a retry on an IO failure that it would finally download the whole thing. Or does it retry already? From what it looks like to me, the server is throttling back the client (for some reason). Doug, DJ and I are actively looking into this area. There are already multiple bug reports about that. This looks to me as two problems: 1. UM is not reporting errors correctly or not at all, which we should fix 2. Server is not working correctly. In one of the attachment's (Bjorg) I saw that we were getting HTTP 403 which should not happen and somewhere else I saw other similar status codes. If you try to download each one of these technologies from their update site you have no problem, which from what Bjorg showed is not true with Callisto site. What UM should do in second case is read HTTP status code and most of 4XX and 5XX codes log it present it to the user. There is nothing else we can do since this is some kind of internal error on the server. If UM gets connection time-out then we should give user an option to restart downloads in the future not right there. I say in future not right there since the likely cause of time out is either a) server being too busy which means if we would to restart download instantly we would just keep throttling server b) or there is a network problem. In this case chances are that network problems will go away in future but not right than In general UM should be more descriptive about problems it encounters. This is important since now users are left in the dark which breeds frustration on their side. I have opened bug #131025 to ask the assistance of the eclipse webmaster on the 403 errors. I'm moving this to "major" instead of "blocker" since we have worked around this issue in Callisto, partially by having copies of all jars in a common directory, so users can be able to pick one mirror, and as long as that's a good mirror, things should work. I think the focus of this bug should now be that update manager needs some relatively user friendly way to recover from "not found" or "access forbidden" errors, ... such as, perhaps "pick another mirror and retry", or similar. Without such a fix, a "bad mirror" does become a blocking problem for that user, and currently there is few diagnostics or hints to user as to what the problem might be or what an appropriate remedy might be. *** This bug has been marked as a duplicate of 144876 *** |