| Summary: | [repository] Composite repositories ignores errors when loading children (errors are treated like missing children) | ||
|---|---|---|---|
| Product: | [Eclipse Project] Equinox | Reporter: | Tobias Oberlies <t-oberlies> |
| Component: | p2 | Assignee: | P2 Inbox <equinox.p2-inbox> |
| Status: | RESOLVED WONTFIX | QA Contact: | |
| Severity: | normal | ||
| Priority: | P3 | CC: | david_williams, Ed.Merks, henrik.lindberg, irbull, john.arthorne, katya.stoycheva, mn, nboldt, pascal.rapicault, pascal, thomas, tjwatson |
| Version: | 3.7 | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Windows 7 | ||
| Whiteboard: | |||
| Bug Depends on: | |||
| Bug Blocks: | 391956 | ||
|
Description
Tobias Oberlies
For reference for others, the support for the atomic loading of composite repositories was done in bug 312984. I don't agree with this change of the default behaviour. If there is a case where we cannot load *all* the children of a composite repository, then I agree we should throw an exception. Otherwise I think it is valid to have a composite repository with invalid children. Consider the use case where a particular child repository is off-line but the rest of the children are ok. I think that changing the default behaviour will have the not-so subtle side effect of forcing repository owners to ensure all the repositories listed are available all the time. I think we'd get an earful from release engineers especially. (In reply to comment #1) > If there is a case where we cannot load *all* the children of a composite > repository, then I agree we should throw an exception. Otherwise I think it is > valid to have a composite repository with invalid children. Consider the use > case where a particular child repository is off-line but the rest of the > children are ok. I don't doubt that this is a valid use case: aggregate repositories could be used list multiple children with each the same content for redundancy. People who want to use composite repositories in this way should be required to set the property p2.atomic.composite.loading to false. (But as you already noted, this use case even isn't fully supported - a failure of all children is not detected.) The by far more common use case is that composite repositories are used to join distinct content, but no-one seems to be aware that they need to set p2.atomic.composite.loading to get a reasonable result (e.g. see repositories linked from [1]). Therefore the only realistic solution is to change the default. > I think that changing the default behaviour will have the not-so subtle side > effect of forcing repository owners to ensure all the repositories listed are > available all the time. I think we'd get an earful from release engineers > especially. This is good. I like problems to become obvious right when they occur. With the current state of affairs, p2 just thinks it is okay to omit half of the content of a repository and hopes that no-one notices. IMHO this is not acceptable. [1] http://wiki.eclipse.org/Eclipse_Project_Update_Sites The composite repository was never intended as a fail-over mechanism so of course all children must be available. It must be either all of nothing. Changing this default behaviour will not be subtle and will result in new surprising behaviours. I concur with DJ. However we should recommend the use of the new flag for things like the release repository (e.g. Juno) since this is where the request initially came from (and in this case the repo is intended to be all or nothing), but it should be a user decision that we don't want to see forced onto users. For example I'm pretty sure that switching the SDK repo would break things. (In reply to comment #4) > However we should recommend the use of the new flag for things like the release > repository (e.g. Juno) since this is where the request initially came from (and > in this case the repo is intended to be all or nothing), but it should be a user > decision that we don't want to see forced onto users. Nothing is forced on the users: If anyone wants the fail-over behaviour, they can set p2.atomic.composite.loading=false. For everyone else, changing the default is a bugfix. If you seriously insist on keeping the broken behaviour for the few people who rely on the bug, I may implement some kind of switch in the p2 code so that at least Tycho can change the default. Tycho will also need to load repositories older than Juno reliably - the non-deterministic behaviour is fatal for a build system that aims for reproducible results. Nevertheless I would rather prefer that p2 is also fixed. I don't believe in the success of "recommending the use of the new flag", so unless we fix it in p2, I suppose that the bug will be around for a long time... > For example I'm pretty sure that switching the SDK repo would break things. Which URL do you mean with the SDK repo? Claiming that the default behaviour is broken is simply wrong. It all depends on the semantics you want to give to the loading of a composite repo. From your point of view it needs to be a unit (e.g release repo), from our initial point of view it was just a convenience for grouping and not all pieces were required (e.g. the I build repos for the Eclipse SDK - http://wiki.eclipse.org/Eclipse_Project_Update_Sites). IMO, the two semantics are useful. Over the years some ppl have been burnt by one or helped. It is a trade off, and in this case p2 provides a mechanism and a default policy that made sense at the time. Changing the default, which is what you are proposing, is forcing a change to the users which they may or not have any understanding about. You see this has a bug fix because this is the behaviour you desire, it may not be the case for everybody. But let's focus on your use case (btw you should have started with this :) ), even if we were to change the code that create the composite repos to require the children to be loaded successfully, this would not address all your problems wrt builds since some people would still be able to publish composite repos with failing children. In addition to that, you would still have to cope with legacy repos (e.g. Indigo) which would never change. IMO, there are two discussions to have here: 1) Can p2 add a flag such that any composite repo must have its children loaded. 2) Should the default behaviour of the tasks creating the repos be changed. To #1, I would say yes, to #2 let's weight the pros and cons. I see a lot of problems emerge if people are using a composite as a fail-over. Let's say someone has 5 servers where he makes Indigo available and this is reflected this with a composite using 5 children, one for each server. If all servers are green, this will result in 5 copies of indigo and a vast memory consumption. Query performance will be degraded since the evaluator has no idea it's looking at exact replicas. Next problem is that the children, also being composites, might be read partially. This is fine as long as at least one is read completely but how will you know? There's no indication of "success". What set of repositories must be read in order to satisfy the meta-data request? Being able to answer that question is crucial if we're planning to support fail-over. It gets really messy when the children are pointing to different repositories, i.e. the user starts mixing fail-over with the what I would consider the true purpose of the composite, bringing set of things together. I don't understand why this is an issue. To me it goes without saying that a repository must be read as a whole or not read at all. A non-deterministic behavior in this area is not desirable and if people really use it that way today, then the sooner we can rectify that, the better. (In reply to comment #6) > even if we were to change the code that create the composite repos to require > the children to be loaded successfully, this would not address all your problems > wrt builds since some people would still be able to publish composite repos with > failing children. But this would reveal the problem: in such a case the build would fail. With composite repositories returning success, even if parts of their content is missing, the build may pass, but the build result may be different (e.g. if there are optional dependencies involved). This is a nightmare. Hiding problems is bad, and whoever wants this should be forced to state this explicitly with p2.atomic.composite.loading=false. The title of this bug could just as easily be, "p2 repositories are fault-tolerant by default" ;) For the sake of argument, imagine there are two distinct repositories rather than a composite with two children. One repository is not currently available - should we still allow an install to occur using only the other repository? If so, then that install is also "not deterministic" because perhaps the other repository had some optional extra piece that would have been installed if available. If we are talking about a build-time use case, then I agree with the idea Pascal suggests of a flag in p2 to treat any missing repository as a build failure (either children of a composite, or any single repository). p2 is also commonly used outside of a build environment, for end-user installs. In that use case I think the fault-tolerant semantics are better. If the unavailable child contains a dependency that is required, then the install will fail anyway. If the child doesn't have anything that I needed, why should my install fail? If the child contained something optional, then as a user I would be annoyed if the system said, "Sorry I cannot install A, because there is a repository I can't reach that *might* contain something *optional* that goes with A". The web is inherently unpredictable and fault-tolerance is an extremely important property in such an environment. Let's say I'm updating my IDE from Indigo SR1. I run into some weird problem so I start a discussion on a newsgroup to try and figure out what's going on. Many hours later, after I've sworn countless times that I really did install from Indigo after the release was out, it's discovered that the probable cause of my problems was that not all children of Indigo were available and I was never made aware of that. Wouldn't it be far better if the installer had told me that up front? AFAIK, the installer does tell me if it for some reason cannot contact a repository that has been explicitly added which is good. Why should this be any different? The plan of what gets installed is based on the set of things that is available to the planner. Silent, non-deterministic removal of things from that set may change the plan completely and give the user results that are indeed possible to install but far from optimal. IMO, that's quite horrible. I don't see how that could benefit the user regardless of if it's an end-user install or a build environment. (In reply to comment #10) > Let's say I'm updating my IDE from Indigo SR1. I run into some weird problem so > I start a discussion on a newsgroup to try and figure out what's going on. Many > hours later, after I've sworn countless times that I really did install from > Indigo after the release was out, it's discovered that the probable cause of my > problems was that not all children of Indigo were available and I was never > made aware of that. I don't understand what weird problem could happen here. If the child contains a required part of the thing you are trying to install, then the install will fail today. If the child contains something you don't need, then the install succeeds. Great for the end user - why fail if the installer can't reach something it doesn't need? > AFAIK, the installer does tell me if it for some reason cannot contact a > repository that has been explicitly added which is good. Why should this > be any different? I agree this case shouldn't be any different. If I have a repository in my list that is not available, I can still install things from another repository (I'm talking about the end user UI as seen in the Eclipse SDK). I thought you were arguing that these cases should be different. It's not so bad if the installer "tells me" when there is something unreachable, but it shouldn't prevent me from installing (perhaps we could have a message like this: "The installer can't find something that you don't need. Proceed with the install anyway?"). > Silent, non-deterministic removal of things from that set may > change the plan completely and give the user results that are indeed possible > to install but far from optimal. IMO, that's quite horrible. Let's be clear on what this horrible case is. Something that is optional might not get installed if it is not available. p2 will only install valid configurations as specified by the metadata regardless of what repositories are present. If you have an optional dependency, then that entire feature is "non-deterministic" by design... i.e., you can't be certain the the optional thing will be installed unless you have complete control over the entire set of repositories available to the user (which you don't in the typical end-user scenario). If you want a 100% predictable install, the only way to accomplish that is to avoid optional dependencies or feature requires in your metadata (use feature includes only). > I don't see how that could benefit the user regardless of if it's an > end-user install or a build environment. If the installer fails because it can't reach something the end user doesn't need, that doesn't benefit anyone. (In reply to comment #11) > I don't understand what weird problem could happen here. Indigo is a bit special since it contains children where each child is a coherent release verified with the b3 aggregator so I admit that chances for really weird problems to occur are fairly small. Other repositories are constituted differently though. Here's another, not so weird, example for Indigo: I know that many besides me prefer to wait until SR1 before they upgrade their current IDE. They are sitting with a stable 3.6 SR2 biding their time until 3.7 SR1 is officially released. They will update when that happens. Now, let's assume that the child representing SR1 is temporarily inaccessible during the update. The user won't notice that the June release that they really wanted to avoid is the one that gets installed. I think it's safe to say that the user would have appreciated a failure from Indigo as a whole so that he could try again later. > why fail if the installer can't reach > something it doesn't need? > Because it reaches other versions than the desired one. > Let's be clear on what this horrible case is. Something that is optional might > not get installed if it is not available. I think the fact the plan will select older versions is worse than not getting the optional stuff. If I install from Indigo once SR1 is out, then I expect to install the SR1 features. I would prefer to get an error regardless of if things are optional or not. That way I would know what happened and why the expected optional functionality isn't available in my IDE. That knowledge would save me the time it would otherwise take to figure out why the optional thing didn't get installed although I've been told that it should be. Let's assume that I have a support function in a larger company. I write recommendations for how the developers should configure their IDE's. People start calling me, telling me that optional stuff that I'm 100% certain is present in the repository that I've told them to use doesn't get installed. If the repository as a whole was reported as temporarily unavailable I wouldn't have that problem. How long will it take to sort out that kind of problem? (In reply to comment #12) I couldn't agree more. If I as an end user tell p2 to install from a certain repository, I want p2 to do exactly that or fail, and not to install only from a random subset of the referenced repositories. p2 shouldn't be guessing what the user wants, but just do as it is told. Otherwise people learn to distrust p2, and IMHO for good reasons as long as the behaviour of p2 is defined in terms of variables (like network availability) that cannot be seen/understood by the users. This is fixed for Kepler M3 [1]. p2 repository owners who rely on the old, lenient fail-over behaviour (see comment #1) may re-enable it by adding the property 'p2.atomic.composite.loading' with value 'false' in their compositeContent.xml/compositeArtifact.xml. The change [1] will be included in the next Tycho release, so the Tycho users will help to find p2 repositories that need to have this property set. p2 clients who need to communicate with broken composite repositories which are out of maintenance can (temporarily) re-enable the old behaviour by setting the system property -Declipse.p2.atomic.composite.loading.default=false. [1] http://git.eclipse.org/c/equinox/rt.equinox.p2.git/commit/?h=integration&id=1c6723a02e376f32c75daada80b9acc46bd1cd01 I'm a bit concerned by this fix because it changes the behavior in the runtime code in a way that causes repositories to stop loading whereas in the past some content may have been presented to the user and he could have been able to do something with the repo. I can think of two cases: 1 - the user does not know that he has put bad references in his repository 2 - the repo being referred to is offline but the content from this repo is already installed. For example the user refers to juno but it is offline. I think a different could be to change the defaults used by the publisher. Ian, what do you think? I also disagree with this change. I know for certain this will break usage of some of my own repositories. The original use case for composite repositories was to allow publishing new builds as individual repositories, but have a single stable URL for end users to point at. Combine that with a cleanup script that removes old builds, and these repositories will now be unusable. I know you can argue that someone *shouldn't* do that but this was a well known semantic property of composite repositories. I can see the use cases for "fault tolerant" versus "atomic", but I think this should be a property of the repository rather than a property of the p2 runtime that affects behaviour of existing repositories. The fact that we already had a way to enable the atomic behaviour, and nobody was using it, suggests to me this is a solution to a very particular problem (maybe in Tycho?) that isn't necessarily a problem for anyone else. I'm also curious to hear the root problem this is trying to solve. I still can't figure out the "bad things will happen if some children are missing" case. The way p2 works, the metadata is all completely validated before any change is made. If there is a case where something bad happens when one child is missing, this dependency resolution should fail. It suggests to me the real solution is that there should be a dependency between the IU's of those sibling child repositories, so that the planner will disallow the install of one piece is missing. It seems like you're trying to use the mechanism of sibling child repositories as a kind of implicit dependency link (don't allow X to be installed if Y is missing). I also think changing repository topology shouldn't affect install behaviour - if those child repositories were removed from the composite and made independent top level repositories it should have the same behaviour. (In reply to comment #16) > The fact that we already had a way to enable the atomic behaviour, and > nobody was using it, suggests to me this is a solution to a very particular > problem (maybe in Tycho?) that isn't necessarily a problem for anyone else. I think it suggests that very few people knew that this option existed. I sure didn't. (In reply to comment #17) > I still can't figure out the "bad things will happen if some children are > missing" case. Please read my "not so weird" example in comment #12. (In reply to comment #19) > > Please read my "not so weird" example in comment #12. Ok, so the case is that after SR1 comes out, you don't want to be offered to install SR0. Right now Markus has disabled SR1 updates because of bug 390756, so therefore we shouldn't allow SR0 upgrades either? If we wanted these semantics we would just not use a composite repository for the release - we would just swap out the old repository and insert the new one. (In reply to comment #20) > Ok, so the case is that after SR1 comes out, you don't want to be offered to > install SR0. Right now Markus has disabled SR1 updates because of bug > 390756, so therefore we shouldn't allow SR0 upgrades either? There's a not so subtle difference between when a publisher makes a decision about what the repository should or shouldn't contain versus temporary partial outages. Are you seriously suggesting that the current problem with the EPP packages should be used as a precedent for how p2 handle atomicity? The reason why the property has been introduced (back in April 2011) is because there is a need for ppl to treat composite repos as an atomic unit whereby the failure to load one child should cause the complete repo to fail. For example this could be desirable for someone shipping its own plugins that relies some other plugins from another repos. However I think this is a producer decision (producer of the repo) not a consumer one (the p2 runtime, the end user) though I'm open to introduce a flag to control that through a system property or preference for Tycho use. (In reply to comment #22) > The reason why the property has been introduced (back in April 2011) is > because there is a need for ppl to treat composite repos as an atomic unit > whereby the failure to load one child should cause the complete repo to > fail. Good to know that this property was introduced. Personally, I think the notion of using the p2 composite mechanism as a way to achieve an invisible and unpredictable fail-over or a fault tolerant service is scary. I hope the default setting prevents this. I'm not sure what you mean Thomas, could you please elaborate? (In reply to comment #20) > Right now Markus has disabled SR1 updates because of bug 390756, so > therefore we shouldn't allow SR0 upgrades either? If you want to remove a child from a composite, just remove the reference from the compositeContent.xml. (In reply to comment #16) > I know for certain this will break usage of some of my own repositories. Just add <property name='p2.atomic.composite.loading' value='false'/> in your composite*.xml, and you'll get the old fail-tolerant behaviour. (In reply to comment #24) > I'm not sure what you mean Thomas, could you please elaborate? Semantics defined in terms of network availability is perceived as random (see my comment #13). IMHO we shouldn't want that p2 is considered as behaving randomly. This is why the default behaviour had to be changed to be deterministic. (In reply to comment #27) > > I'm not sure what you mean Thomas, could you please elaborate? > Semantics defined in terms of network availability is perceived as random > (see my comment #13). IMHO we shouldn't want that p2 is considered as > behaving randomly. This is why the default behaviour had to be changed to be > deterministic. I don't think your change has much impact on whether p2 is deterministic. As I keep saying, the way you establish determinism in p2 is to ensure the metadata expresses all the dependencies correctly. If the network is unreliable, then whether each single composite repository is "all or nothing" is not going to make p2 have the transaction semantics you are looking for (the network could fail in the middle of the operation for example). It really doesn't buy you much benefit. The compatibility problem is that there are many composite repositories out there today that have some children missing (for example old builds were deleted). These repositories behave in a deterministic way today - you can install stuff from the children that are available. With your change it is still deterministic - it will always fail. This has nothing to do with determinism, randomness etc. It is a significant behaviour change that will cause real problems for consumers. (In reply to comment #28) > As I > keep saying, the way you establish determinism in p2 is to ensure the metadata > expresses all the dependencies correctly. If the network is unreliable, then > whether each single composite repository is "all or nothing" is not going to > make p2 have the transaction semantics you are looking for How do you prevent the problem Thomas described in comment #12 with dependencies? The semantics of "latest version" changes with the availablility of child repositories, and "latest version" is used in the p2 UI. > The compatibility problem is that there are many composite repositories out > there today that have some children missing (for example old builds were > deleted). These repositories behave in a deterministic way today - you can > install stuff from the children that are available. Loading such a repository consist of multiple HTTP requests over a network. Any subset of these requests may fail in a quasi random way, so the content of the composite is random. > It is a significant behaviour change that will cause real problems for consumers. As a consumer, I want p2 to fail fast in case of network problems. The price for this is that p2 repository providers have to maintain their composite*.xml file to reflect the list of children that are meant to be available (or explicitly state that they don't do this by setting the repository option). There are a lot more users than repository providers, so the trade off should be in favor of the users. There are many ways to ensure safe delivery of content. One very proven method is to validate that the content is consistent and will not break things. p2 is good at this, there's no doubt about that. But let's move that point aside for a little while and ask ourselves, what if the content is indeed consistent but still isn't what the publisher wanted you to have? Simple scenario: I, the publisher, decide to publish a new version of my tool where some bugs have been fixed. I tell the users who reported the bugs to update and verify. After updating, they still report the same problems. After several hours with at least two people involved searching for the cause, it's discovered that a renewed update attempt indeed solves the problem although nothing has changed. The described situation is utterly horrible. Now we don't even know what caused the problem in the first place. Did the user really update the first time? Has something else changed on his side? Why is it suddenly working? Surely a behavior that leads to that kind of problems should be avoided if possible. As a publisher, I want a reliable tool. When I publish, my users must get the latest stuff. If they can't, due to unforeseen circumstances, then I want the user to be notified about that by p2 so that he can inform me about the problem. As a user, I want the same thing (as explained earlier). I think the existing behavior (ignore failure) is ok as the default behavior and it is good that there is a flag that can be set in the repository to allow a publisher to declare that this is indeed one atomic repository, it just happens to be a composition. In practice, when all content is on the same machine/disk and is served from one server the likelihood of one failing and others still working is very small. (Given that the content exists). Hence, the "ignore failure" is basically good for those that are lazy and do not clean up their composite index, and in practice does not make it bad for others. Those that rely on a composition of several remote sites and at the same time have the ignore failure are simply not offering their users the best experience. The choice here is either possible mysterious failures to get what was expected and no failure, or that a shaky repository with things almost never used always destroys the overall install. Neither of these are good, and flipping the flag does not really help - it is just bad in two different ways. So IMO, changing the default to atomic will not make many users happy - on the contrary. Just my 2c. (In reply to comment #31) > In practice, when all content is on the same machine/disk and is served from > one server the likelihood of one failing and others still working is very > small. Given that this discussion is about when it actually does fail, likelihood is a somewhat moot point. The composite mechanism is supposed to be opaque and hidden beneath the p2 API's. The ability to hide the actual implementation has been a very prominent factor in the whole design. It shouldn't matter to the publisher nor the consumer how the repository is assembled. Nor should it matter if it's provided by an intelligent service on-the-fly from a database or in some other fashion. What does matter a great deal though, is that the content intended by the publisher is what is consumed by the consumer. At all times. And if that contract is broken, some flag is raised to that fact. (In reply to comment #32) > (In reply to comment #31) > > In practice, when all content is on the same machine/disk and is served from > > one server the likelihood of one failing and others still working is very > > small. > > Given that this discussion is about when it actually does fail, likelihood > is a somewhat moot point. > > The composite mechanism is supposed to be opaque and hidden beneath the p2 > API's. The ability to hide the actual implementation has been a very > prominent factor in the whole design. It shouldn't matter to the publisher > nor the consumer how the repository is assembled. Nor should it matter if > it's provided by an intelligent service on-the-fly from a database or in > some other fashion. What does matter a great deal though, is that the > content intended by the publisher is what is consumed by the consumer. At > all times. And if that contract is broken, some flag is raised to that fact. Yes. The publisher should set the atomic flag or not. It also makes sense to be able to set "atomic=true" as a user when you know better than the publisher. Then complain to publishers that do not maintain their index. I am pretty sure that the repositories where it actually counts are hunted down and fixed pretty quickly. How about doing that, and then decide on changing the default? (In reply to comment #33) > How about doing that, and then decide on changing the default? How about changing the default and then let the publishers who have no problem hiding failures from their consumers and potentially run into the problems described here flip the bit? (In reply to comment #34) > (In reply to comment #33) > > How about doing that, and then decide on changing the default? > > How about changing the default and then let the publishers who have no > problem hiding failures from their consumers and potentially run into the > problems described here flip the bit? Others were of the opinion that's a worse option, and I think I agree. IMO those who care in their products, or builds turns it on, and I think the problematic repositories are quickly identified and cleaned up (i.e. from missing child repositories - which is really what we are talking about here). I am all for atomicity in other cases. It will need to be people that really understands the issues that drive the change anyway. Ordinary users really have no clue and would at most just complain in general about Eclipse and/or p2. Can you see a warning being output/logged? Respository x seems to have missing child repositories a, b, c - please report to repository owner. Not sure ordinary users really pay attention to such things and probably ignores to report it anyway. A quick way to get the experts is to turn on atomicity in Buckminster and Tycho. All the projects at Eclipse that have bad repositories would see things failing. (In reply to comment #15) > Ian, what do you think? I think that composite repositories should fail (by default) if *all* their children fail to load. However, I also think that many people have come to rely on the current behaviour. Unfortunately the current behaviour is practically API, and expecting others to change their repositories (or their habits), is a message we can't deliver. There is no practical way we can get everyone else to change their repositories, and breaking users (who are currently working fine) is not an option. I'd be happy to have Tycho (or the publisher) change the default behaviour to include the flag. I think a p2 runtime flag makes sense. We would need to decide on the precedent order of these flag (Repo set to *NOT* fault tolerant and p2 set to *MUST_BE_FAULT_TOLERANT*, who wins? I think Henriks suggestion, adding a warning that will make the user aware of problems, is something to consider. What would the wording be? "Unable to reach all repositories appointed by the composite xxx. This means that the resolution might contain unexpected results although it was successful. Accept / Retry / Cancel" or something like that. What would you do if that dialog popped up? My strongest objection to the defaults used today is these failures are allowed to go unnoticed. Unfortunately, all IDE's out there lack this warning so adding it doesn't solve the anything until everyone has updated their IDE's. I disagree with Ian that the current behavior is practically API. In my view the current behavior is a serious bug. It's nature (silently ignoring errors) means that most people are not aware of it. My recommendation is that we start by changing the default. That fixes the bug instead of telling all publishers to create their own workaround for it. We must of course inform everyone about the behavioral change and provide an instruction how to retain the old behavior. Once the default is changed, we should consider adding a client setting that would allow the resolution to continue even though some repositories cannot be reached (with the above warning of course). Some time later, we should deprecate the publishers ability to make composites "fault tolerant". That was not a problem that the composite was meant to solve and if we really want to retain that behavior, then there's a lot of additional things to consider such as; which children are optional and which are mandatory? Which children are actually mirrors of the same repo? Etc. Fault tolerance is a whole different ball game that IMO, p2 shouldn't try to engage in. We have enough problems with the home-brewed mirroring. (In reply to comment #36) > (In reply to comment #15) > > Ian, what do you think? > > I think that composite repositories should fail (by default) if *all* their > children fail to load. However, I also think that many people have come to > rely on the current behaviour. Unfortunately the current behaviour is > practically API, and expecting others to change their repositories (or their > habits), is a message we can't deliver. There is no practical way we can get > everyone else to change their repositories, and breaking users (who are > currently working fine) is not an option. > > I'd be happy to have Tycho (or the publisher) change the default behaviour > to include the flag. Agree. >I think a p2 runtime flag makes sense. We would need > to decide on the precedent order of these flag (Repo set to *NOT* fault > tolerant and p2 set to *MUST_BE_FAULT_TOLERANT*, who wins? It is actually not FAULT_TOLERANT as much as FAULT_IGNORANT :) Being "fault tolerant" implies doing the right thing when there are errors - the current behavior just ignores errors. Although it can be used as a "poor mans fault tolerance implementation" it seems far better to implement such solutions using other technologies. Hence, I like that publishers of repos create atomic repos by default. Someone that wants the "poor mans fault tolerance" will then need to make a decision about what they want, and if the p2 option of ignore-errors is good enough for their needs. The client flag should be "make all composite loads atomically". If that is set to true, it wins, if set to false, the behavior is defined by the repo. Making a repository that states it is "atomic" ignore errors via a client flag when read does not seem to be very useful (not the right thing to do). Currently, there are no client flags to override explicit configuration from the repositories. The only client flag available allows to change the default configuration. This should only be needed while there are still repository providers which use the "poor mans fault tolerance implementation" and haven't heard that they need to set p2.atomic.composite.loading=false for this. I thought of a compromise over the weekend that I support supports the main cases we all care about. We should be able to distinguish between "missing child" and "communication error contacting child". The missing child case is the one that I believe is common, and will cause lots of problems for end users if we change the default. However there is nothing random or intermittent about a child that does not exist - every install will have the same effect. On the other hand it looks like the main concern of Tobias and Thomas is the "network glitch" mentioned in comment #0. I.e., some other kind of communication problem. What if we propagated that failure up to the parent, but kept the current behaviour of ignoring a missing child (perhaps still configurable by a repository setting if needed). (In reply to comment #40) > I thought of a compromise over the weekend that I support supports the main > cases we all care about. We should be able to distinguish between "missing > child" and "communication error contacting child". The missing child case is > the one that I believe is common, and will cause lots of problems for end > users if we change the default. However there is nothing random or > intermittent about a child that does not exist - every install will have the > same effect. On the other hand it looks like the main concern of Tobias and > Thomas is the "network glitch" mentioned in comment #0. I.e., some other > kind of communication problem. What if we propagated that failure up to the > parent, but kept the current behaviour of ignoring a missing child (perhaps > still configurable by a repository setting if needed). +1 (In reply to comment #40) > ... What if we propagated that failure up to the > parent, but kept the current behaviour of ignoring a missing child (perhaps > still configurable by a repository setting if needed). A missing child is very likely a sign of a mistake made by the publisher. Why have a default behavior allowing such mistakes to go unnoticed? Likely causes might be misspelled URL, that the child ended up at a location that doesn't match, or that it was unintentionally removed. If a publisher really wanted to remove it, then why not simply remove the child entry in the composite? Perhaps the suggestion is an improvement but I can't really understand the benefits. The more stringent and predictable behavior we have, the better. We shout strive to catch problems early and a behavior that is easily understood. That builds trust for the technology. (In reply to comment #40) This is good: If we get a 404 for a child, treat it as "missing child" and silently omit it (if atomic.loading=false). In all other cases, propagate the error. As a first step towards this, I'll change the default of atomic.loading back to false. (In reply to comment #43) > As a first step towards this, I'll change the default of atomic.loading back > to false. Why? Consider DJ Houghton's example in comment #1. The server for a particular repository is off-line. How is that different from a particular volume not being mounted (which might well be the case if you get a 404). Both cases can be caused by spelling mistakes. Both cases can be caused by network failures. Both cases can be caused by unintentional actions. What's the significant difference that we're trying to nail down here and why is that important to the user? I'm with you that most composite repositories should be loaded in an atomic way. This means that all tools should by default produce composite*.xmls with the <property name='p2.atomic.composite.loading' value='true'/>. (Can you help with this? Tycho doesn't have support for building composite repositories.) Although I also find the use cases for non-atomic loading obscure, John convinced me that non-atomic loading is not broken beyond repair. Therefore IMHO we no longer have the justification for an incompatible change in p2. This bug is about making the loading of composite repositories fail in case of network errors. This doesn't absolutely require a change of the atomic loading default. If you still want to pursue this (e.g. to cover other error cases), I'd propose to open a new bug. I've been thinking more about this. I think there are two things to consider. One is the reason why something is missing, another is when this is discovered. *Why something is missing:* Consider a URL like: http://download.elcipse.org/tools/downloads/drops/R20120526062928 Since the hostname is misspelled, this URL will yield a network failure. This URL however: http://download.eclipse.org/tools/downloads/drops/R20120526062929 will yield a 404 since the timestamp is wrong. As I wrote earlier, there might be many other problems, including network failures in the server infrastructure, that will present themselves as 404's. The only thing that a p2 client can be sure of is that the repository, for one reason or another, cannot be reached. I think it would be a mistake to put semantic significance into the reason for the failure. If we do, we should at least have a clearly formulated statement that explains why we're doing this. After all, I'm not your average p2 user and I have a hard time understanding why we would do this. *When a problem is discovered:* So far, we have just been talking about meta-data. What about composite artifact repositories? Should they be treated the same way? If we decide not to, then we need to explain the inconsistencies. If we on the other hand decide to allow failures here, then we have a serious problem since the plan for what's to be downloaded has been computed already. It's very likely that a missing artifact repository will cause an incomplete install. Should that be silently ignored? I don't think so. And that's regardless of if the problem is caused by a 404 or a network glitch. It's actually impossible to retain a consistent "tolerate some errors" behavior. My conclusion: The only sane thing to do in my opinion is to treat the repositories (meta-data as well as artifact) as consistent descriptions of a state. If some parts of that state cannot be reached, for whatever reason, then the state is corrupted and the install must fail. That behavior is easy to document and explain. It doesn't raise any questions about what types of errors that should be silently tolerated and why. And it allows for a consistent treatment of all types of repositories. (In reply to comment #45) > ... Can you help with this? Yes, I think I have some cycles to spare. What do you have in mind? An improvement to the p2.composite.repository ant task? (In reply to comment #47) > (In reply to comment #45) > > ... Can you help with this? > > Yes, I think I have some cycles to spare. What do you have in mind? An > improvement to the p2.composite.repository ant task? If this is how people are creating composite repositories, yes. Is this also the tool used for creating the Eclipse release train composites? (In reply to comment #48) > (In reply to comment #47) > > (In reply to comment #45) > > > ... Can you help with this? > > > > Yes, I think I have some cycles to spare. What do you have in mind? An > > improvement to the p2.composite.repository ant task? > If this is how people are creating composite repositories, yes. Is this also > the tool used for creating the Eclipse release train composites? The release is created using the b3 aggregator. I can fix that. But then the releases are aggregated into a composite. I don't think the aggregator is in charge of that. Not even sure if it's automated. David? (In reply to comment #49) > (In reply to comment #48) > > (In reply to comment #47) > > > (In reply to comment #45) > > > > ... Can you help with this? > > > > > > Yes, I think I have some cycles to spare. What do you have in mind? An > > > improvement to the p2.composite.repository ant task? > > If this is how people are creating composite repositories, yes. Is this also > > the tool used for creating the Eclipse release train composites? > > The release is created using the b3 aggregator. I can fix that. But then the > releases are aggregated into a composite. I don't think the aggregator is in > charge of that. Not even sure if it's automated. David? Sometimes I use an ant task, some times I hand edit. Usually the later, since I can create the composite jar ahead of time, with a different name, and then on "make visible" day, just rename the jar file. (In reply to comment #49) > The release is created using the b3 aggregator. I can fix that. Thank you! This is an important step forward. (In reply to comment #50) > Sometimes I use an ant task, some times I hand edit. Usually the later, since I > can create the composite jar ahead of time, with a different name, and then on > "make visible" day, just rename the jar file. @David: Could you make sure that these composites have the atomic option set to true? See here [1] for an example from the p2 tests. [1] http://git.eclipse.org/c/equinox/rt.equinox.p2.git/tree/bundles/org.eclipse.equinox.p2.tests/testData/metadataRepo/composite/missingChild/atomicLoading/compositeContent.xml
>
> @David: Could you make sure that these composites have the atomic option set
> to true? See here [1] for an example from the p2 tests.
>
> [1]
> http://git.eclipse.org/c/equinox/rt.equinox.p2.git/tree/bundles/org.eclipse.
> equinox.p2.tests/testData/metadataRepo/composite/missingChild/atomicLoading/
> compositeContent.xml
I'll clone this bug to cross-project list for proper visibility of the change.
In my usual cautious way, I'll do for Kepler now and if no problems surprise us, add to Juno for SR2.
That example file seems slightly wrong, though, since it says "properties=2" I think it should be "3"?
<properties size='2'>
<property name='p2.compressed' value='false'/>
<property name='p2.timestamp' value='1234'/>
<property name='p2.atomic.composite.loading' value='true'/>
</properties>
To improve cross-references, for the common repo, I've opened bug 391956 - Use p2.atomic.composite.loading in common repos I've opened bug 391959 to track the b3 aggregator change. I've opened bug 391962 to track the enhancement to the p2.composite.repository ant task. Where are we at with this bug? Its marked for M3, but it looks like we are far from agreement on the fix. Conceptually, the fix is clear: When a non-atomic composite repository is loaded, p2 should distinguish between 404s (-> omit child) and other network responses (-> fail loading of composite). However I don't think that the implementation is easy. In any case, we also agreed (and implementations are under way) that all composite repositories should explicitly be marked as atomic. Therefore I hope that non-atomic will eventually become rare, and hence fixing this bug will become less important. FWIW, I added a note about this gotcha / behaviour change to the Tycho 0.16 release notes page: http://wiki.eclipse.org/Tycho/Release_Notes/0.16#Behaviour_Changes I know this is an old bug, but just to be explicit, this "atomic" property only needs to be in "compositeContent" files, right? It has no effect in composite artifact files? All examples I see mention compositeContent files but I am not sure why it would not also apply to compositeArtifacts files? This issue seems to have died. (And I know I'm often thankful that someone's partially broken composite still works.) |