Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 327256 - Infinite loop in p2
Summary: Infinite loop in p2
Status: CLOSED FIXED
Alias: None
Product: Equinox
Classification: Eclipse Project
Component: p2 (show other bugs)
Version: unspecified   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: 3.7 M7   Edit
Assignee: Ian Bull CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-10-07 12:51 EDT by Ian Bull CLA
Modified: 2011-03-30 14:10 EDT (History)
8 users (show)

See Also:


Attachments
Patch v1 (1.65 KB, patch)
2010-10-07 16:53 EDT, Ian Bull CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Ian Bull CLA 2010-10-07 12:51:20 EDT
If the MD5 hash is bad, then we get into an Infinite loop. This is because the repo doesn't get marked as 'bad' so we just keep retrying.
Comment 1 Ian Bull CLA 2010-10-07 16:49:31 EDT
I spent some time on this, and I'm not sure there is an easy fix here.  We use a mirror list (and mirror selector). Each mirror has a 'failure' count, which goes up and down depending on how well the mirror responds to certain requests. For example, if you download a file with a decent connection speed the mirror failure count goes to zero.  If something fails (MD5 hash for example) the failure count goes up.  

However, as long as there exists at least 1 mirror whose failure count != 2, we will keep retrying.  In the case of the MD5 problem, we will try the mirror, download the artifact (set failure count to 0), check the MD5 (which fails, set the failure count to 1), and repeat.  

While we can modify this algorithm a bit,  it turns out there is a lot of things that contribute to the failure count, and in a system with lots of mirrors, it's hard to know for sure if we will ever terminate.

For Eclipse 3.6.2, I propose we put a hard limit on the number of times we will try to mirror a particluar artifact. For example, if we have tried 50 consecutive times without success (to download a single artifact) then just give up.  This will at least ensure we don't end up in an infinite loop.  For 3.7 maybe we can look at a better mirrors selection / failure count solution.
Comment 2 Ian Bull CLA 2010-10-07 16:53:05 EDT
Created attachment 180456 [details]
Patch v1

This puts a limit of 50 on the number of times a particular mirror request will be tried.
Comment 3 Ian Bull CLA 2010-11-12 15:24:45 EST
John,

I wanted to bring this bug to your attention. Pascal mentioned that you're probably the best person to architect a solution. If you want to chat about the problem, ping me on IRC.
Comment 4 Thomas Hallgren CLA 2010-11-13 03:43:00 EST
A perhaps bold opinion, but why don't we remove the whole mirroring concept from p2 altogether? There are plenty of solutions out there that deals with this in much better ways.

As a side note, I'm not even able to find documentation on how to set up a server farm that supports p2's mirroring scheme. Is anyone using it besides Eclipse.org?
Comment 5 Ian Bull CLA 2010-11-13 11:40:05 EST
(In reply to comment #4)
> A perhaps bold opinion, but why don't we remove the whole mirroring concept
> from p2 altogether? There are plenty of solutions out there that deals with
> this in much better ways.

While this might be a possibility, this is not a 3.6.2 solution.  We need both a 3.6.2 and 3.7 solution here.
Comment 6 Jeff McAffer CLA 2010-11-13 15:13:33 EST
(In reply to comment #4)
> A perhaps bold opinion, but why don't we remove the whole mirroring concept
> from p2 altogether? There are plenty of solutions out there that deals with
> this in much better ways.

Suggestions?

> As a side note, I'm not even able to find documentation on how to set up a
> server farm that supports p2's mirroring scheme. Is anyone using it besides
> Eclipse.org?

As far as I recall the mirroring scheme in p2 is there to support the Eclipse.org setup not the other way around.  They supply a xml doc that lists the mirrors thought to contain interesting stuff and we have to cycle through them trying to find what we need.

Ian's approach seems like a reasonable way to address the immediate problem though personally I'd go for a smaller number of tries.
Comment 7 Thomas Hallgren CLA 2010-11-13 16:52:38 EST
(In reply to comment #6)
> (In reply to comment #4)
> > A perhaps bold opinion, but why don't we remove the whole mirroring concept
> > from p2 altogether? There are plenty of solutions out there that deals with
> > this in much better ways.
> 
> Suggestions?
> 
Use "Warp speed from the amazon cloud" ;-)

> > As a side note, I'm not even able to find documentation on how to set up a
> > server farm that supports p2's mirroring scheme. Is anyone using it besides
> > Eclipse.org?
> 
> As far as I recall the mirroring scheme in p2 is there to support the
> Eclipse.org setup not the other way around.

Yes. But since p2 is a generic provisioning system, I would assume a built in mirror mechanism to be generic as well. I would also assume documentation on par with other API's. At present, the mirror system is tied to Eclipse.org and useless to other provisioning organizations unless they first engage in advanced reverse engineering.
Comment 8 John Arthorne CLA 2010-11-15 09:32:21 EST
(In reply to comment #7)
> Yes. But since p2 is a generic provisioning system, I would assume a built in
> mirror mechanism to be generic as well. I would also assume documentation on
> par with other API's. At present, the mirror system is tied to Eclipse.org and
> useless to other provisioning organizations unless they first engage in
> advanced reverse engineering.

In case this is just an indirect way of requesting improvement to the doc, I have entered bug 330247 to improve the documentation. There is nothing in the p2 mirroring mechanism that ties it specifically to eclipse.org.
Comment 9 Ian Bull CLA 2010-11-15 12:01:09 EST
(In reply to comment #6)
> Ian's approach seems like a reasonable way to address the immediate problem
> though personally I'd go for a smaller number of tries.

Sure, 50 is likely a little high. But I wouldn't make it too small (say 25-30).  Remember, this check is really just a fail safe.  99.9% of the time, this limit is never reached.
Comment 10 Pascal Rapicault CLA 2010-12-23 23:18:18 EST
The problem is real, the patch almost complete, if not complete. We should strive to release that and not let the patch rot. Assigning to Ian to drive to completion.
Comment 11 DJ Houghton CLA 2011-03-16 12:11:59 EDT
Ian, this seems similar to bug 340165. We are successfully downloading the artifact but then one of the processing steps is failing so we continue to re-download, etc.  Thoughts?
Comment 12 DJ Houghton CLA 2011-03-16 15:04:44 EDT
*** Bug 340165 has been marked as a duplicate of this bug. ***
Comment 13 DJ Houghton CLA 2011-03-16 16:33:58 EDT
Ian, is there a related issue here in that we are setting the status code to RETRY=true?

I think this is roughly the order in which we do things:
- download artifact successfully
- set retry=true
- run md5, unpack, jar verifier, etc
- return error (note retry is still true)
- code says:
   do {
      ...
   } while (code == RETRY);
Comment 14 Ian Bull CLA 2011-03-17 01:51:30 EDT
(In reply to comment #13)
> Ian, is there a related issue here in that we are setting the status code to
> RETRY=true?
> 
> I think this is roughly the order in which we do things:
> - download artifact successfully
> - set retry=true
> - run md5, unpack, jar verifier, etc
> - return error (note retry is still true)
> - code says:
>    do {
>       ...
>    } while (code == RETRY);

Yes, that's roughly the steps.  IIRC, if you have one 'good' mirror left, then we don't set retry to false.  Now, you might think that the mirror we just tried (and failed to validate the MD5 hash from) would be 'bad', but here's the catch, each mirror gets a second chance. So this 'almost bad' mirror is tested again and artifact is fetched and p2 marks the mirror 'good' again.  Of course, the Md5 fails, so the mirror is 'almost bad' and the loop continue (retry the artifact, everyone is happy, fail the md5, wash rinse repeat).

Let's at least get the retry limit solution in for 3.7
Comment 15 Ian Bull CLA 2011-03-29 20:06:05 EDT
I've set the max retry count to 10 and committed the fix to head.

FWIW, simply by reading the comments you can see the problem.  When we fail to download an artifact from a mirror we write: don't punish the mirror too much just because we failed to download one artifact.  Higher up in the stack we write: keep trying to download the artifact as long as we have one good mirror left.

This is not great long term solution, but at least we will stop trying after 10 attempts.  The reason this problem is hard is because there is no exposed API for mirror selection.  Mirror selection is done by the repository itself, however, we while downloading the artifacts we make some assumptions about how those repositories will behave (wrt to mirror selection).  To fix this properly, we will likely need to make mirror selection part of the artifact repository API.

To make this even more confusing, we have mirror requests (mirror as a verb) and mirrors (mirror as a noun). ;-)
Comment 16 Ian Bull CLA 2011-03-30 14:10:00 EDT
I've opened bug 341405 to discuss enhancing the mirror concept to support 'bad artifacts in good mirrors'.