Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 326383 - Automate IP checks of downloads
Summary: Automate IP checks of downloads
Status: RESOLVED FIXED
Alias: None
Product: Community
Classification: Eclipse Foundation
Component: IP Log Tool (show other bugs)
Version: unspecified   Edit
Hardware: PC Mac OS X - Carbon (unsup.)
: P3 enhancement (vote)
Target Milestone: ---   Edit
Assignee: Wayne Beaton CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-09-28 05:06 EDT by Glyn Normington CLA
Modified: 2014-04-04 15:57 EDT (History)
2 users (show)

See Also:


Attachments
output of zipinfo -l for Virgo web server milestone 4 zip (140.77 KB, text/plain)
2010-09-30 06:43 EDT, Glyn Normington CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Glyn Normington CLA 2010-09-28 05:06:47 EDT
There has been discussion in the RT PMC of what constitutes "released content" as this is crucial to applying the IP policy correctly. The term is not currently crisply defined. There is also a general issue of manual, and therefore time-consuming and error-prone, checks being necessary to compare the contents of downloads to CQs.

One way to formalise released content would be to focus on the download area and produce a tool which automatically guarantees all archive content and other files in the download area match CQs for the relevant project. Content destined for the download area would be placed there using the tool and would need to pass an automated check first. A failed check would return a report listing the problem files and would not update the download area.

Apart from formalising released content, such an approach would have the benefit of removing a source of human error and would reduce the manual checking that is needed when a project is released. Admittedly this would take some investment as such a tool would be be a significant addition to the IP log tool, but I think it would pay for itself over time.

The tool could be implemented by introducing secure hashes into IPZilla. Each JAR covered by a CQ would be attached to the CQ and a secure hash would be calculated and recorded against that CQ. To check content being placed in the download area, any archives would be exploded and then each JAR would have its secure hash calculated and mapped to a CQ. If there was no matching CQ for the relevant project, then the check would fail.

Such a check would need to be introduced "gently" to allow the set of secure hashes to be built up gradually. So the process should for a given introductory period - perhaps 6 months - allow content through to the download server without insisting on matching CQs but report back to the user which JARs were not matched to CQs and warn that later this will prevent the content from going on the download area. After the introductory period, when all projects should have ensured they are clean, the check could be made hard.
Comment 1 Wayne Beaton CLA 2010-09-28 07:48:24 EDT
I've started implementing some of this. I didn't get as far as secure hashes, but I like that idea.

As I've been doing the manual checks, I've been building up a mapping between CQs and corresponding bundle names. You can see the results here:

http://eclipse.org/projects/tools/ip_cq_overview.php

FWIW, this page is generated from a combination of a query against IPZilla, and a text file that just maps a CQ number to a JAR file (ultimately, the contents of this text file should probably go into a db). I've been holding off on the complete implementation while I wait for the mapping to be more complete. We've probably already reached a tipping point and it's time to put more work into this.

I am always concerned about adding more work for committers. My intent has been to incorporate this into the IP Log tool (Woolsey) to make it easy to identify the bundles that we haven't mapped yet, and report those mappings so that we can make our data more complete.

I think that Bug 298358 is related.
Comment 2 Glyn Normington CLA 2010-09-28 07:58:03 EDT
Sounds good Wayne! If you update this bug as you make progress, we'll be happy to provide feedback.

One suggestion. You are thinking of mapping a CQ number to a JAR, but I thought some CQs covered more than one JAR in which case it would be better to map a JAR to a CQ to allow for many-one relationships.
Comment 3 Wayne Beaton CLA 2010-09-28 08:06:08 EDT
Take a harder look at the link. It supports multiple mappings. n-to-m in fact.
Comment 4 Glyn Normington CLA 2010-09-28 08:18:58 EDT
(In reply to comment #3)
> Take a harder look at the link. It supports multiple mappings. n-to-m in fact.

My mistake. Sorry.
Comment 5 Wayne Beaton CLA 2010-09-29 14:12:34 EDT
Check this out:

http://eclipse.org/projects/tools/bundle_scanner.php

It compares a provided list of bundles against our database. It uses the same information used to generate the IP Log, but doesn't involve the actual IP Log itself. I should be able to provide that integration as part of Woolsey.

It's a start. I'll be using this myself so it should evolve rather quickly.
Comment 6 Glyn Normington CLA 2010-09-30 06:42:36 EDT
Thanks Wayne. I tried Virgo milestone 4 and the output wasn't a great success. I used product id of virgo, but maybe that was wrong but the tool didn't complain. It seemed to have trouble parsing the output of zipinfo -l (run on Mac OS X). I'll attach the output for you to try. Here's a snip of some of the output:

Other Bundles
We're not sure what's going on with these bundles.

Archive:
virgo-web-server-2.1.0.M04-incubation.zip
40592869
944
drwxr-xr-x
2.0
unx
...
Comment 7 Glyn Normington CLA 2010-09-30 06:43:17 EDT
Created attachment 179940 [details]
output of zipinfo -l for Virgo web server milestone 4 zip
Comment 8 Wayne Beaton CLA 2010-09-30 08:38:12 EDT
I see the first problem: more documentation/help is required.

Use the full project id: rt.virgo.

Virgo was one of my test cases, it should work pretty well...

use zipinfo -1 (i.e. "one")

It's not smart enough to disregard all the file info stuff.
Comment 9 Wayne Beaton CLA 2010-09-30 08:42:56 EDT
I've been thinking more about using secure hashes. I wonder if we can just add them as attachments with some kind of naming pattern. It should be easy enough to write a query to extract them. We can mark old or incorrect ones as obsolete. We should be able to make that work without modifying our bugzilla instance.
Comment 10 Glyn Normington CLA 2010-10-01 06:49:32 EDT
(In reply to comment #8)
> I see the first problem: more documentation/help is required.
> 
> Use the full project id: rt.virgo.
> 
> Virgo was one of my test cases, it should work pretty well...
> 
> use zipinfo -1 (i.e. "one")
> 
> It's not smart enough to disregard all the file info stuff.

That's much better. Worked a treat. :-)
Comment 11 Glyn Normington CLA 2010-10-01 06:51:58 EDT
(In reply to comment #9)
> I've been thinking more about using secure hashes. I wonder if we can just add
> them as attachments with some kind of naming pattern. It should be easy enough
> to write a query to extract them. We can mark old or incorrect ones as
> obsolete. We should be able to make that work without modifying our bugzilla
> instance.

Sounds reasonable. I wonder if the committer should attach the binary JAR and the genie should calculate the secure hash and store it in the details of the attachment? Otherwise we need some way for the committer to relate the binary JAR to the CQ. I'm sure you have this figured out already...
Comment 12 Wayne Beaton CLA 2010-10-15 16:51:32 EDT
I spoke with a university professor today about "JAR fingerprinting" work that he's done. I'll be speaking with some graduate students next week about the problem; I'm hopeful that we'll be able to attract some academic involvement.

Glyn, do you have any links or information about what constitutes a reasonable "secure hash" for a bundle JAR?
Comment 13 Glyn Normington CLA 2010-10-16 06:59:52 EDT
Virgo uses a sha hash when cacheing bundle JARs which seemed a reasonable choice with an extremely low probability of two (slightly) different bundles yielding the same hash. This is the approach taken by git.

I think going beyond sha is overkill. Perhaps the academics are trying to solve a different problem?

You can extract the code from here if you like:

http://git.eclipse.org/c/virgo/org.eclipse.virgo.artifact-repository.git/tree/org.eclipse.virgo.repository/src/main/java/org/eclipse/virgo/repository/internal/ShaHashGenerator.java
Comment 14 Wayne Beaton CLA 2010-10-16 07:31:35 EDT
It occurred to me as I was driving home that md5 or sha would probably do it. I'm still hopeful that I can get some students to do the work :-)
Comment 15 Glyn Normington CLA 2010-10-16 08:18:34 EDT
(In reply to comment #14)
> It occurred to me as I was driving home that md5 or sha would probably do it.
> I'm still hopeful that I can get some students to do the work :-)

Fair enough, but have you ever maintained code written by students? ;-)
Comment 16 Wayne Beaton CLA 2014-04-04 15:57:04 EDT
I've been using the download scanner tool for a few years now to help with the technical assessment of downloads. It does a very good (though not 100%) job on OSGi-based downloads.

https://www.eclipse.org/projects/tools/downloads.php

Note that only committers can access this page.

I believe that the original intent of this bug has been addressed and am marking it FIXED. We may need to open new bugs to address deficiencies, or support for other languages/technologies.