| Summary: | [repository] Download does not try hard enough | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Eclipse Project] Equinox | Reporter: | Pascal Rapicault <pascal> | ||||||||||
| Component: | p2 | Assignee: | Pascal Rapicault <pascal> | ||||||||||
| Status: | RESOLVED FIXED | QA Contact: | |||||||||||
| Severity: | major | ||||||||||||
| Priority: | P3 | CC: | david_williams, irbull, jeffmcaffer, tjwatson | ||||||||||
| Version: | 3.7 | Flags: | irbull:
review+
tjwatson: review+ |
||||||||||
| Target Milestone: | 3.7 RC2 | ||||||||||||
| Hardware: | PC | ||||||||||||
| OS: | Mac OS X - Carbon (unsup.) | ||||||||||||
| Whiteboard: | |||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Pascal Rapicault
I investigated a bit and what is killing us here is the counter we added to stop the infinite download. The situation goes as follow: there is list of mirrors (more than MAX_DOWNLOAD) and none of them (except the last one which is the canonical site) contain the artifact we are looking for. Because we stop the download after MAX_DOWNLOAD attempts, we never get a chance to try the original site. Ah, here is a test (that probably needs to be made to work with local data) and a patch. The patch consists in using the sourceArtifactDescriptor to communicate to the artifact repository that we want the artifact to be returned from the main repository and ignore mirror. I feel that we are are abusing the role of the ArtifactDescriptor, but I could argue either way :) Created attachment 195016 [details]
Test and patch
*** Bug 345086 has been marked as a duplicate of this bug. *** FWIW, IMHO, I'd count this as a "major" severity ... loss of function ... of course, it is "my" XML Editor ... :) Just FYI, the "infrastructure bug" which made this bug noticeable in the first place, is now fixed, see bug 345078. This means the bug may not be as easily reproducible in the originally reported scenario ... but, still a bug, and, I think, still a major regression that could show up in other scenarios and/or other sites. The patch looks reasonable from a functional point of view. "root_only" as a boolean/existence flag is awkward. This property will end up being SPI or some such that repo implementors need to know about. So what are the semantics that we are after here. AFAIK there is no concept of "root" currently. Are we saying "no mirrors" (don't like negative statements). Or "this is the last try, make it good"? Or is this a "fetch mode" or "QoS" attribute? Personally I like the last option as it is the most flexible. Other repos may not have mirrors/roots/baselocations, ... per se but may still have variable reliability in terms of their ability to supply the desired artifact. We want to tell them, get it now if you can. For now we would have one fetch modes or QoS setting. Something like "reliability". The logic would be basically the same. If the QoS property is not set then do whatever you want in the repo. If it is set to reliability then try your hardest to actually get the byte reliably. Other than that, having some actual constants and a code comment saying why we are setting the descriptor property now would be good. Created attachment 195898 [details]
New patch
Attached is a new patch with a test. I've already committed the test data to CVS since it contained some binaries.
I've looked at the patch, and fixed up the test case so it now works. The patch solves the problem, however, I'm worried about using the descriptors properties in this way. In particular, I'm worried about two things: 1. We might end up copying this property to a target descriptor 2. We might end up saving this property in our src descriptor (in the local cached version of the artifact repo). However, I can't actually reproduce a test case that does either of these things, but if this happened, it's not clear what sort of side effects we would see. I'm wondering if we can somehow introduce the notion of 'transient' descriptors, ones which will never be saved. Let me see if I can do that in a NON-API way. Created attachment 196054 [details]
Patch v3
After some more testing, I realized that if you mirror an artifact, and end up on the last mirror, the SKIP_MIRROR flag will now be set on the source descriptor. If you query the repo again, this property is still set.
Now, I'm not sure that this will have any negative consequences, but I'm still not comfortable with the side effect.
I've taken a slightly different approach, I've added the concept of 'transientProperties' to the ArtifactDescriptors. These can be used for exactly this purpose (setting properties that you never want to persist). ArtifactDescriptors are SPI, so I'm not sure what implications this will have in terms of 'adding new API post API-Freeze'.
Also, I can be convinced that we should clear this property in a finally block.
I agree with the concern that the properties is kept in the descriptor. Btw, does it end up being written on disk? My only concern with the new approach is the introduction of new SPI at this stage. What I would suggest is that instead of making the new methods public we make them protected or private and we use reflection to access to it. This is really ugly but does at least we don't have to cope with API addition, API which I'm not sure I like either but we don't have much time to review that in depth. After this change, we are good to go for RC2. Created attachment 196073 [details]
Patch v4
I've talked to Pascal and we want to do something here. However, this patch adds new public methods to SPI (api?) classes. Since the classes are technicaly in an 'exported' package, they are (by definition) API.
Here are our options
1. Add the methods as private and access them using reflection -- I really don't like this approach because our software design is being based on end game 'rules' and not sound engineering principles. I will do this if there is no other option.
2. Add public methods and mark them @noreference / @nooverride. Does this make them "non API"? I think so, Pascal wasn't so sure.
3. Use a different approach to this problem -- increase our retry limit to something stupid (like 1000). That way we protect against the infinite loop, but we try all mirrors. (except when we have 1001 mirrors :P).
4. Go back to the original idea and use the descriptors. I'm not comfortable with that, but if others are happy with that approach then I'll back down.
5. Do nothing! This was actually a problem with our mirrors and that's been fixed.
This patch uses #2 -- the public methods with the API annotations. I've also reset the flag in a finally block so we don't leave the transient property lying around.
BTW, the end-game rules state that we also need another reviewer on this. Maybe Jeff can help.
Here's some data, and yet-another-opinion, if you find it helpful ... The current MAX_RETRY_REQUEST appears to be 10. Which is ridiculously low, IMHO. The number of http mirrors for "eclipse.org" I've seen for our major sites number around 70 or 80 max (I think p2 uses only http, right?) if you add ftp mirrors, there might be about 120 max. So, speaking roughly, if 10% of mirrors were down or inaccessible, it'd fail, as currently is? While 10% down isn't exactly common, I suspect it is also not uncommon. Observation: you say its to prevent "infinite loop" ... which I don't get. Maybe you mean "apparent" infinite loop? But really just a long long time for something assumed to fail eventually, and you want it to fail faster? The reason I ask: are these failures distinguished about why they fail? A "timout" sort of reason usually takes a LONG time to fail (like 90 seconds), and I agreed, that mirror is likely bad and shouldn't be tried again for a a long while ... but simple 404's (as the main problem in this case) fail pretty quick, and you could afford to go through a couple of hundred, I'd guess, without anyone noticing. Well, some might notice ... maybe 1 second per try. Where did a limit of 10 come from anyway? So ... my advice is to keep it simple for late change to indigo, increase limit to 200 and worry about proper fix in Juno. I know that kind of "undoes" the gains by putting in the limit to begin with, but I think it should be re-done in Juno as part of a larger effort ... to quote a comment in the code ... // TODO this needs to be redone with a much better mirror management scheme. But ... I'm obviously no expert in the context of this code, and the ideas being discussed are certainly close to the right solution, (other than changing, spi this late in a release, etc.) ... so don't give up if you really feel that's the right approach. My main goal was to give you these "typical numbers" of 50 to 80 http mirrors for the major sites, at least at eclipse.org. And I hope that helps, at least a little. The infinite loop that Ian is talking about was actually a real one (see bug #327256). Since none of us is really happy with any patch and since the pb with files not being mirrorred has been fixed, I'm of the opinion of simply bumping up the MAX_TRY_REQUEST to 200 as suggested by David. (In reply to comment #14) > The infinite loop that Ian is talking about was actually a real one (see bug > #327256). > Since none of us is really happy with any patch and since the pb with files not > being mirrorred has been fixed, I'm of the opinion of simply bumping up the > MAX_TRY_REQUEST to 200 as suggested by David. Yes, the infinite loop happened if some mirrors had bad content (we noticed it with MD5s). They weren't down - but they weren't right -- and we would keep retrying (forever). The real fix for all this is better mirror management. Actually, I think we can do "the right thing" with transient properties, but certainly not in time for Indigo. I'm fine with a count of 200, but going to back to the original problem (bad MD5s), we will actually download the artifact 200 times (and test the MD5) before finally failing... That won't be quick -- but it's still better than an infinite loop. Actually, Pascal has released this fix. The fix was to increase from 10 to 200. We should get the right +1's in place though. Since pascal fixed this, I'll mark him as the "assignee". I'll +1 it, and we need one more reviewer. Adding tom as another reviewer since I talked to him about this on IRC. Thanks everyone! This is all in place for RC2. |