| Summary: | Reconsider removing jars from Helios common repository | ||
|---|---|---|---|
| Product: | Community | Reporter: | David Williams <david_williams> |
| Component: | Cross-Project | Assignee: | David Williams <david_williams> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | normal | ||
| Priority: | P3 | CC: | ahunter.eclipse, alex.blewitt, filip.hrbek, jeffmcaffer, john.arthorne, mknauer, nboldt, oisin.hurley, pwebster, sbouchet, thomas |
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | All | ||
| Whiteboard: | |||
|
Description
David Williams
+1 for keeping both jars and pack.gzs - the jar is the canonical artifact and is always usable by java, whereas there can be issues with the packed items as David has pointed out. I do think that there should be some concept of 'sunsetting' for repository content, but I think that is not related to the issues at hand in this bug. While I think that the pack200 backward compatibility concern is a bit far fetched, I see another reason for keeping the jars, namely hybrid repositories (both Maven and p2). So +1 for that. Keeping both highlights another issue. Not all jars benefit from the packing or benefit very little. For those jars, we should consider dropping the .pack.gz file and only retain the jar. Below is some information for an IBM internal team responsible for consuming the repositories produced by the Eclipse teams. Their group has done some extensive investigation, and decided that they cannot use pack2000 for products consuming the Helios release. For this reason they are asking me to make sure any p2 sites I want them to use provide jars.
------------------------
Adding Pack200 support has been evaluated with the goal to support this [during Helios].
A basic assumption was that the packed jar would replace the unpacked jar. The result was that the performance data does not justify the introduced complexity. Here are highlights of the thinking behind this:
First, pack200 can have tremendous savings for artifacts containing java code. Some jars have been compressed to only 7% of the originals jars size. However the totals for [product X] update and [product Y] were only reduced to 73.5% and 79.8% size of then original.
This could easily lead to an longer overall install time. For example for [product Y] the unpack rate seen was 0.4 MB/s based on the compressed size (pack.gz). The install time will be longer once artifact retrieval data rate is higher then 0.22 MB/s. This would likely affect installs from physical media or intranets negatively.
Under the assumption that only packed artifacts would be released, all customers would be effected. Possibly lowering customer satisfaction.
Build times would increase 10-20%. That's not a show stopper. But it would be noticeable. With long build times being one of the major complaints of [IBM build] customers, this is not something we want to do lightly.
Highlights of the risk and complexity introduced by pack200:
* pack200 alters the binary content of jars. This invalidates jar signatures unless jars are normalized before signing. This introduces substantial complexity and need for coordination between the build and install handling of the produced artifacts.
- The normalization step must use the same parameters as the packing step. The JREs should be aligned.
* The pack200 format is complex (JSR 200 spec would be 82 pages printed out and it is a dense read). It introduces risk by the potential that normalized jars behave differently. While the belief is that this would not happen all too often it would cause high costs in each incident. For example one of the normalized jars broke a build modified to gather performance data.
* In addition to the previous case some jars are not packable.
* In addition to the previous cases some jars which did not produce errors when packing may not be unpackable. A validations step would have to be introduced.
* Handling pack200 recursively on jars of jars introduce complexity even more. The tool that handles this would have to be carefully versioned and or multiple versions kept.
* There are two pack200 formats. One which supports packing 1.6 Java language constructs and one for 1.5 and lower. The former cannot be unpacked by JRE 1.5 tools. However [product X] ships with 1.5 for size reasons.
* There is the potential that a newer JRE cannot unpack all previously released repositories. This introduces particular requirements on Java requirements.
Considering using packed jars in addition to its unpacked would allow to improve the experience for customers with low transfer rates over the network. However all complexity and risk would be introduced for a narrow segment of customers.
Great summary of pros and cons using pack200. The very minor savings on some jars relates to what I said in comment 2. It would not be hard to add functionality to the aggregator that would discard the packed artifact when the savings are below a certain threshold. (In reply to comment #3) > This could easily lead to an longer overall install time. For example for > [product Y] the unpack rate seen was 0.4 MB/s based on the compressed size > (pack.gz). The install time will be longer once artifact retrieval data rate > is higher then 0.22 MB/s. This would likely affect installs from physical > media or intranets negatively. I just wanted to point out that p2 makes the decision on whether to use the packed or canonical (unpacked) artifact based on the repository location. If the repository is local (file: URL), then it will always use the canonical artifact and never used the packaged one. So, physical media installs should not be affected by presence of both packed and unpacked artifacts in the repository. > Highlights of the risk and complexity introduced by pack200: > > * pack200 alters the binary content of jars. This invalidates jar > signatures unless jars are normalized before signing. This introduces > substantial complexity and need for coordination between the build and install > handling of the produced artifacts. > - The normalization step must use the same parameters as the packing > step. The JREs should be aligned. > * The pack200 format is complex (JSR 200 spec would be 82 pages printed out > and it is a dense read). It introduces risk by the potential that normalized > jars behave differently. While the belief is that this would not happen too > often it would cause high costs in each incident. For example one of the > normalized jars broke a build modified to gather performance data. These are all valid points for why someone would chose not to use pack200. However, I don't think they apply to any projects on the release train. All projects on the release train normalize their jars prior to signing, so whether you install the pack200 or jar artifact out of the helios repository, you will obtain the identical jar at install time in both cases. This removes any risk of having different jars for the end user at runtime. This last point also alleviates David's concern that we are somehow losing data by discarding unpacked jars. We could always recreate the identical unpacked artifacts at a later date from the packaged artifacts, using the same version of pack200 that was used to normalize the jars in the first place. In any case, I don't have a strong feeling either way on this one. If the webmasters and mirrors think disk space is not a problem, then there is certainly no harm in keeping the unpacked artifacts around. Assigning to my self, mostly to avoid wide spread notifications (so add yourself to cc if interested). (In reply to comment #2) > ..., I see another reason for keeping the jars, namely hybrid repositories > (both Maven and p2). So +1 for that. > What does this mean, Thomas? A Maven repository doesn't use pack.gz files, I assume? And Maven can have "metadata" (or what ever they call it) someplace else, and point to these jars? > Keeping both highlights another issue. Not all jars benefit from the packing or > benefit very little. For those jars, we should consider dropping the .pack.gz > file and only retain the jar. I agree we should be "smart" about packing only Java code, but I do think it should be based on presence or absence of Java code in the jar, not just some "amount of savings". And not sure that's really in our common-repo control ... is something projects (or "jar processor") would have to do better. But, if you feel strongly about it, and others agree, I wouldn't block the effort. Perhaps a first step would be to find out how much savings there would be? (In reply to comment #5) > (In reply to comment #3) > > This last point also alleviates David's concern that we are somehow losing data > by discarding unpacked jars. We could always recreate the identical unpacked > artifacts at a later date from the packaged artifacts, using the same version > of pack200 that was used to normalize the jars in the first place. > Thanks John. I didn't believe you at first, so went back and re-read some of the docs. I thought there were different options that could be specified and result in subtle difference, but upon re-reading I see that the same parameters must be used for the repack step, as the pack step, so if any information is thrown away, it is already thrown away in the repack step. This makes me think we may eventually (if not now) better specify what Simultaneous Release projects can specify. For example, I'm not sure its "ok" for an individual project to decide to strip out line numbers or debug information. As far as I know, everyone uses the defaults (or possibly E4 effort level) but ... best to be explicit, I'm finding. (In reply to comment #7) > (In reply to comment #2) > > ..., I see another reason for keeping the jars, namely hybrid repositories > > (both Maven and p2). So +1 for that. > > > > What does this mean, Thomas? A Maven repository doesn't use pack.gz files, I > assume? And Maven can have "metadata" (or what ever they call it) someplace > else, and point to these jars? > Yes, that sums it up. When we provide Helios as a repository that is consumable by both p2 and Maven, we create a hybrid that contains meta data for both. The actual artifacts can be shared as long as they are not packed. > > Keeping both highlights another issue. Not all jars benefit from the packing or > > benefit very little. For those jars, we should consider dropping the .pack.gz > > file and only retain the jar. > > I agree we should be "smart" about packing only Java code, but I do think it > should be based on presence or absence of Java code in the jar, not just some > "amount of savings". Why not? To me it sounds like that's the right thing to focus on. The presence of java code is inferior to the actual savings. > Perhaps a first step would be to find out how much savings there would be? I think this is trivial but I'll confer some more with Filip who's done the major portion of the work with the aggregator. (In reply to comment #2) > While I think that the pack200 backward compatibility concern is a bit far > fetched, This made me have a funny thought, if you don't mind a small attempt at humor ... I bet programmers during the 1980's said the same thing, when questioned about the use of 2 digit fields for 'year'. :) (In reply to comment #9) > (In reply to comment #7) > > (In reply to comment #2) > > > > I agree we should be "smart" about packing only Java code, but I do think it > > should be based on presence or absence of Java code in the jar, not just some > > "amount of savings". > > Why not? To me it sounds like that's the right thing to focus on. The presence > of java code is inferior to the actual savings. > Just that for the non-java case, there is a (well known?) technical reason not to use pack200. For just looking as "amount of savings", it seems like it'd be an arbitrary cutoff, and the question of "what is enough savings to justify the extra storage" would depend on so many variables as to be not worth the effort to discuss, decide, and implement. Just my view. Another factor, which I admit to not understanding very well, is that long long ago, we used to run OSGi Indexer on update sites so the site could act as one big bundle repository. Not sure if that was ever used much, but I have an intuition that it requires jars, not packed.gz files. (but, not sure). (In reply to comment #10) > (In reply to comment #2) > > While I think that the pack200 backward compatibility concern is a bit far > > fetched, > > This made me have a funny thought, if you don't mind a small attempt at humor > ... I bet programmers during the 1980's said the same thing, when questioned > about the use of 2 digit fields for 'year'. :) :-) I see your point. The reason I see this particular issue as a bit far fetched is that if it ever becomes a problem in the future, then we can always unpack everything at that time with an older software. Another reason is that I can't see why the Java community would break backward compatibility and screw it's own users. A third is that this Java is now an open community. We can participate in the community ourselves and influence it's direction. I'm convinced we'll have some say in the matter. Especially if we have major repositories that break. All moot points though since we agree that the jars should be present in Helios. (In reply to comment #11) > Just that for the non-java case, there is a (well known?) technical reason not > to use pack200. For just looking as "amount of savings", it seems like it'd be > an arbitrary cutoff, and the question of "what is enough savings to justify the > extra storage" would depend on so many variables as to be not worth the effort > to discuss, decide, and implement. Just my view. Aha. OK, I can understand the "simple rule" argument. The reason I'm advocating the "amount of savings" is that when I look at source bundles I sometimes see a significant saving. I was surprised cases where it's almost 50%. This folder is a good example: ~/downloads/modeling/emf/emf/updates/2.6/plugins One reason seems to be that the gzip alone compresses the jar and perhaps that's something we should consider for p2. A third format that gzips the jar but doesn't involve pack200 (extension .jar.gz), suitable for features, source bundles, documentation ,etc. (In reply to comment #12) > Another factor, which I admit to not understanding very well, is that long long > ago, we used to run OSGi Indexer on update sites so the site could act as one > big bundle repository. Not sure if that was ever used much, but I have an > intuition that it requires jars, not packed.gz files. (but, not sure). Sounds like a tool that should be rewritten to use the meta-data, not the jars themselves. Given the amount of flak I got for forcing Athena builds to produce ONLY packed jars (unless specific instructions were included to not pack individual plugins using pack.properties files) [bug 306300], I don't see why packing is really helpful. I've also seen a trend toward using zipped p2 repos [1] rather than unzipped sites, as the zip provides all the plugins, features, and metadata in a consumable state which doesn't get piecemeal mirrored and suffer from the "I got everything except a single plugin jar so I had to restart the update process 2-3 times" syndrome that I've seen many times over with some Eclipse.org mirrors. The zip also provides a snapshot that doesn't move unexpectedly (unless you're consuming a *SNAPSHOT.zip from within a Hudson job's workspace or artifacts, in which case buyer beware). [1] http://download.eclipse.org/athena/repos/ Taking the EMF 2.6M6 update site as an example: 10M emf-xsd-Update-2.6.0M6.zip 15M emf-xsd-Update-2.6.0M6-unpack.zip (repacked to include unpacked jars ONLY) 11M emf-xsd-Update-2.6.0M6.zip on disk, including packed jars ONLY 16M emf-xsd-Update-2.6.0M6.zip on disk, including unpacked jars ONLY 25M emf-xsd-Update-2.6.0M6.zip on disk, including both packed/unpacked jars From this we can see: * zipping a repo saves less than 10% (11/10 or 16/15) but download is easier * keeping both packed/unpacked jars increases disk use by >50% (25/16 or 25/11) * using packed jars reduces disk use by 33% (1 - 11/16 or 1 - 10/15) So, assuming you want an unzipped repo w/ both packed/unpacked jars (to accommodate everyone), and a repo zip with only packed jars (to make the large download faster/smaller), your footprint would be 25 + 10 = 35M. Or, if you wanted to skip packing entirely, you'd have 15 + 16 = 31M. One might therefore argue that given the problems inherent in the packing/unpacking process, and the fact that because of those problems one needs to keep both packed and unpacked artifacts, packed jars simply aren't worth using at all. But I'm a little verklempt, so I'll let you discuss. :) Nick, Consider that the primary objective for a repository is to serve the clients that installs software. These clients stands for a vast majority of the downloads. Because of that, disk size becomes less important since a) the client cares about speed and stability, and b) disk is extremely cheap compared to network bandwidth. The two main concerns are therefore to reduce the network bandwidth and increase stability. We achieve this by: 1. Only download exactly the bits that are needed. I.e. finer granularity. 2. Pack those bits as much as possible. 3. Provide unpacked bits as a fallback for some scenarios (unpack fail or access is local). 4. Extensive mirroring to geographically close locations. Unlike you, I think the trend is moving towards using p2 for provisioning and that includes the build use-cases. PDE now provision a target platform quite nicely using p2. Buckminster does too (including the Helios Aggregator). The much wanted Maven access is likely to also affect this trend positively. Another complicating factor when zipping is that Helios now preserves old artifacts forever. Like the Maven repository at Maven Central, It grows over time (Last time I checked, the maven repo was way beyond 5GB in size). I'm a little late to the party but I scanned the posts and feel there is one issue that, from a Helios point of view, dictates the answer. If only packed JARs are available in the repo then people using JREs that do not support unpack cannot use Helios at all. Sure the tooling basically requires Java 5 but we do more than tooling. There is a seperate topic of what the contributing projects produce. AFAICT they can (at least in theory) produce whatever form they want as long as the Helios build that consumes their repos can create from their input both the packed and unpacked JARs. I think I've discovered part of the reason for some of the confusion in some of the discussions around this topic, at least on my part. I went to turn this "on" for our M7 repository, by changing -packedStrategy" and discovered there is no option to "copy both". Is that right? The only option is to "unpackAsSibling"? I'm resolving this now, since I threw the switch, and increase in size was as expected ... approximately doubled, to 1G for "a release". I do, BTW, still think it important (required, even) for projects to provide both jars and pack.gz files, at least for now. Given time, and discussions, we (and adopters) might feel comfortable with other processes ... but, we need to take that time and discussion to be sure we meet expectations and are as consistent as we should be. I hope many of you test the repo, currently at /releases/staging to make sure it is being created correctly (metadata and all). Lastly, I do still think there is merit to a "try and copy both files" option, so if one of them "fails" or fails to verify as they now are from 'subversive' project, we'd still end up with a usable repository, instead of just failing to create one, and waiting for them to make their fix. And, sorry Denis ... guess you'll have to sell back all those Rolexes you got last year? :) (In reply to comment #18) > Lastly, I do still think there is merit to a "try and copy both files" option, > so if one of them "fails" or fails to verify as they now are from 'subversive' > project, we'd still end up with a usable repository, instead of just failing to > create one, and waiting for them to make their fix. > Historically, the reason we don't approve of corrupt files is that we want them fixed a.s.a.p. We have no mechanism to say: "Your Helios contribution is corrupt but we are able to make sense of it anyway. We'll approve it for now but please consider fixing it in time for the next milestone". Do you think we should add that to the automated aggregation process? I'm a bit concerned if we make it the aggregators business to try and make sense of broken artifacts and "fix problems". At least in this context where it might lessen the contributors motivation to fix the source of the problem.
> Do you think we should add that to the automated aggregation process? I'm a bit
> concerned if we make it the aggregators business to try and make sense of
> broken artifacts and "fix problems". At least in this context where it might
> lessen the contributors motivation to fix the source of the problem.
Its hard to always know the right balancing point, but in my experience, best not to hold up everyone else, if it is just a matter of treating errors as warnngs and continuing as best as possible .. conceptually like setting 'failOnError=false'. But I agree, the aggregator should not "jump through hoops" to fix broken repos ... and should, btw, still have the option to, conceptually, set "failOnError=true' when desired.
|