| Summary: | [transport] intermittent timeout during downloading artifacts should be forgivable | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Eclipse Project] Equinox | Reporter: | Meng Xin Zhu <kane.zhu> | ||||||||
| Component: | p2 | Assignee: | Meng Xin Zhu <kane.zhu> | ||||||||
| Status: | RESOLVED FIXED | QA Contact: | |||||||||
| Severity: | normal | ||||||||||
| Priority: | P3 | CC: | davegarratt, helmut.haigermoser, henrik.lindberg, irbull, kane.mx, mober.at+eclipse, pascal, wolfgang.haug | ||||||||
| Version: | 3.7 | ||||||||||
| Target Milestone: | Juno M2 | ||||||||||
| Hardware: | PC | ||||||||||
| OS: | All | ||||||||||
| Whiteboard: | |||||||||||
| Bug Depends on: | |||||||||||
| Bug Blocks: | 360493 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
Meng Xin Zhu
Created attachment 198377 [details]
test case
add an additional informal case to demonstrate the wget's behavior
The idea is to fail (reasonably) early at the lowest level and let the higher order logic do the work - perhaps selecting a better mirror. Way back the transport layer did perform several retries - the result was slower overall throughput as it continued to try bad mirrors when there were better options. It could be that some higher order logic needs to retry rather than just rely on the lower (transport) level. Maybe there is only one source (no mirrors), and it should try harder (i.e. call transport several times). (In reply to comment #2) > The idea is to fail (reasonably) early at the lowest level and let the higher > order logic do the work - perhaps selecting a better mirror. Way back the > transport layer did perform several retries - the result was slower overall > throughput as it continued to try bad mirrors when there were better options. > > It could be that some higher order logic needs to retry rather than just rely on > the lower (transport) level. > > Maybe there is only one source (no mirrors), and it should try harder (i.e. call > transport several times). It is a good suggestion that MirrorRequest tries other mirror sites. I think transport also should try the downloading several times(could be configurable) if socket timeout happened. Because that site almost has been the best mirror picked by higher level, the timeout probably is accidental case during a long time downloading. FWIW, I've hit this in production and it's particularly bad when you only have one mirror. I've also hit a similar problem where the download is corrupt. Since we don't retry the mirror again -- and we have no more mirrors -- we simply give up (and fail the install). I did a small experiment -- always retry at least 3 times (even if we only have 1 mirror). This improved things in practice. I did this at the mirror request level. Assigning to Juno since there seems to have multiple ppl interested in seeing this being addressed. Patch welcome. (In reply to comment #4) > I did a small experiment -- always retry at least 3 times (even if we only have > 1 mirror). This improved things in practice. I did this at the mirror request > level. That sound like the right place. The mirror logic should try very hard when there is only one mirror. When there are several mirrors however, I was thinking that the mirror layer would like to have a shorter timeout to be able to determine which one is best quicker than waiting for the quite long timeout from the first (or it should try several in parallel) . (Also: I posted this in some other issue regarding mirrors earlier; it would be very beneficial if the transport layer could be given a low speed threshold - when it hits that it would give up and let the mirror layer try another mirror). (For a single mirror it would naturally give transport 'no treshold'). The speed is already calculated, so the changes are mostly passing the information around in addition to the improvements in the mirror layer. Unfortunately I have to time available to work on this - I can help someone else if there are questions though. I have been searching for something which matches my experience with timeouts - and this is reasonably close - but I would welcome confirmation. I use Eclipse 3.6 and 3.7 on OSX Snow Leopard. When at home and I try to download new plugins or check for updates I invariably get timeouts and it's very very slow. This is despite the fact that I have a 50mb optical fibre internet link which is lightening fast. The same laptop doing the same update at work over a 3mb ADSL line works perfectly every time. We have the same router in both locations. Now - here is the crunch. If when at home I establish a VPN connection to the office and configure the link to send all my TCP/IP traffic over the link - it works ok again. To me this implies that the routing/mirrors used are different for between ISPs. I've no idea how Eclipse determines what mirrors to use. This has been the case for some months now and there was another bug 325299 which seemed to represent the problem - but it's been closed now. It's infuriating as I don't know if this is an Eclipse problem (local or remote ?) or an ISP problem or whatever ? Until I can understand what is timing out - exactly what Eclipse is trying to do at the time I wont be able to make any progress. Is this relevant to this ticket - if not can you please let me know which category of Eclipse forums/groups/projects I should log this in. Thanks Dave (In reply to comment #7) > Is this relevant to this ticket - if not can you please let me know which > category of Eclipse forums/groups/projects I should log this in. > It is not relevant to this particular issue. The problem has to do with mirror selection. There are several issues logged against p2 mirror handling. There may also be issues with the "download" script at eclipse which is involved in computing the set of mirrors that p2 gets to pick from. IIRC It is possible to get the mirrors as XML with a http reuest to see which mirrors eclipse download thinks your p2 should be contacting. I would start there, and if that looks off, contact webmaster @ eclipse. You can also try with option "nomirrors" when running from at home. Created attachment 202124 [details]
a patch
Pls help review this patch. I created it via git command 'git format-patch -M -1 <revision>', it looks like eclipse can't fully recognize it.
If downloading is failed, p2 will try the mirror sites to download that artifact if it has mirrors. So I add a retry flag(default value is off) in transport layer in case no mirror exists.
And I add two test cases, first one to verify p2 would try to use mirror site to download artifact if the first attempt meets socket timeout.
Another case is turning the flag on, it would retry given times if the timeout happened.
(In reply to comment #9) > Created attachment 202124 [details] > a patch CQ:WIND00295420 +1 :) In our environment this patch will help reduce problems in slow environments tremendously, please let us know if and when this patch can be used! :) Helmut (In reply to comment #7) > > To me this implies that the routing/mirrors used are different for between ISPs. > I've no idea how Eclipse determines what mirrors to use. This has been the case > for some months now and there was another bug 325299 which seemed to represent > the problem - but it's been closed now. > When installing plug-ins from Eclipse official site, p2 would load the the metadata of repository firstly. Then downing the binary/jars from repository that has a lot of mirror sites. It means that the metadata of repository is always downloaded from eclipse.org. If timeout happened on downloading metadata of repository, p2 would lost a lot of information and don't know the content in the repository. I think it's another point we can improve stability. I only briefly looked at the patch (and I may have missed it, but) I wonder where the RETRY on the IStatus that is expected in order to inform the higher layer about the condition comes from? Created attachment 202144 [details]
can be applied to workspace
update one line
(In reply to comment #12) > I only briefly looked at the patch (and I may have missed it, but) I wonder > where the RETRY on the IStatus that is expected in order to inform the higher > layer about the condition comes from? upload a new patch can be applied to workspace, and copy some snippet of RepositoryTransport.java to here, + private static boolean isForgiveableException(Throwable t) { + if (t instanceof SocketTimeoutException) + return true; + else if (t instanceof SocketException) + return true; + return false; + } + public static DownloadStatus forStatus(IStatus original, URI toDownload) { Throwable t = original.getException(); + if (isForgiveableException(t) && original.getCode() == IArtifactRepository.CODE_RETRY) + return new DownloadStatus(original.getSeverity(), Activator.ID, original.getCode(), original.getMessage(), t); return forException(t, toDownload); } public static DownloadStatus forException(Throwable t, URI toDownload) { + if (isForgiveableException(t)) { + String value = System.getProperty(TIMEOUT_RETRY); + if (value != null) { + try { + int retry = Integer.valueOf(value).intValue(); + if (retry > 0) { + Integer retryCount = null; + if (socketExceptionRetry == null) { + socketExceptionRetry = new HashMap<URI, Integer>(); + retryCount = new Integer(1); + } else { + Integer alreadyRetryCount = socketExceptionRetry.get(toDownload); + if (alreadyRetryCount == null) + retryCount = new Integer(1); + else if (alreadyRetryCount.intValue() < retry) { + retryCount = new Integer(alreadyRetryCount.intValue() + 1); + } + } + if (retryCount != null) { + socketExceptionRetry.put(toDownload, retryCount); + return new DownloadStatus(IStatus.ERROR, Activator.ID, IArtifactRepository.CODE_RETRY, + NLS.bind(Messages.connection_to_0_failed_on_1_retry_attempt_2, new String[] {toDownload.toString(), t.getMessage(), retryCount.toString()}), t); + } + } + } catch (NumberFormatException e) { + // ignore + } + } + } Thanks for the follow up, now I see where it comes from. That looks like a decent approach (it is in good company with logic of similar kind). So +1 for me on the approach. (In reply to comment #15) > Thanks for the follow up, now I see where it comes from. That looks like a > decent approach (it is in good company with logic of similar kind). > > So +1 for me on the approach. Henrik, thanks for your review. I'll commit it if there is no more objections. fixed in origin/master Just checking ... could this be backported to 3.7.2 ? In my own experience, installing things via P2 is a real problem with Indigo when I'm on a very slow or unreliable network. I've repeately seen the whole install break due to timeouts ... and there is no workaround available. It has been backported to 3.7.2 via bug 360493. (In reply to comment #18) > Just checking ... could this be backported to 3.7.2 ? > > It has been backported to 3.7.2 via bug 360493. Cool ... thanks ! I suggest adding bug 360493 in the "blocks" field of this bug then, in order to link the original fix and the backport. Many projects do it like this and I've always found that link very helpful. Thanks Martin |