Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 302160

Summary: Simple download statistics
Product: [Eclipse Project] Equinox Reporter: John Arthorne <john.arthorne>
Component: p2Assignee: John Arthorne <john.arthorne>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: P3 CC: beth, bluesoldier, david_williams, g.watson, irbull, Kenn.Hussey, kim.moir, Mike_Wilson, mober.at+eclipse, nboldt, overholt, pascal, sbouchet, webmaster
Version: 3.6   
Target Milestone: 3.6 M7   
Hardware: PC   
OS: Windows XP   
Whiteboard:
Attachments:
Description Flags
Download statistics implementation none

Description John Arthorne CLA 2010-02-08 11:27:03 EST
I had a thought this morning about how we could implement simple collection of download statistics. We have (at least) the following requirements:

1) Collecting stats must be "best effort", and not cause transfers to fail if stats could not be collected

2) The repository must be able to control how/if this collection is performed, so that someone redistributing the content can "turn it off" or redirect statistics elsewhere.

3) No personal information should be collected

4) Stats collection must be coarse-grained. I.e., once per artifact would be too many round trips.

Here is a simple solution that I think satisfies these requirements. Any given artifact descriptor could specify a property for stats collection. If this property is absent then no statistics would be gathered. Example:

    <artifact classifier='osgi.bundle' id='org.eclipse.osgi' version='3.6.0.v20100128-1430'>
      <properties size='2'>
        <property name='download.stats' value='org.eclipse.osgi'/>
        ...

The artifact repository itself would also have an optional property with a "stats URL" as a destination for reporting download statistics to:

<repository name='My Repo' type='org.eclipse.equinox.p2.artifact.repository.simpleRepository' version='1.0.0'>
  <properties size='3'>
    <property name='p2.statsURI' value='http://download.eclipse.org/stats'/>
    ...

After a successful artifact download, we would check for the property on the artifact descriptor and the property on the repository. If both properties are present it constructs a URI by combining the two:

statsURI = http://download.eclipse.org/stats/org.eclipse.osgi

We would then perform a simple HTTP HEAD request on this URI. Any failure would simply be logged. On the server side someone could use the server logs to count all the HEAD requests to obtain the aggregate download stats.
Comment 1 John Arthorne CLA 2010-02-08 11:28:27 EST
Created attachment 158483 [details]
Download statistics implementation
Comment 2 John Arthorne CLA 2010-02-08 16:58:00 EST
Denis, what do you think of this general approach? In particular, is a HEAD request on some unique URL sufficient for you to be able to gather useful statistics on your end?
Comment 3 Denis Roy CLA 2010-02-08 20:17:40 EST
Sounds acceptable to me.  In layman terms, what is an artifact?
Comment 4 Ian Bull CLA 2010-02-09 01:17:11 EST
(In reply to comment #3)
> Sounds acceptable to me.  In layman terms, what is an artifact?

An artifact is the actual bytes that you need to download, put on disk, and run. For example, a plug-in (bundle) is an artifact.  So are the launchers.  

For a bit of background, p2 separates things into Metadata and Artifacts.  The metadata describe "what" can be installed. Metadata might describe that plug-in A "requires" plug-in B.  p2 uses this to "plan" what needs to be installed (and what the user already has).  Once this is all figured out, the "artifacts" are downloaded (if they are needed).  -- This is a bit of a rough description of p2, but it should suffice :-).

I missed the p2 call today, but I think what John is suggesting is to tie the stats to the description of the actual artifact. That way, we only track things that are actually downloaded.
Comment 5 David Williams CLA 2010-02-09 01:34:00 EST
Seems to me one drawback of the proposal is that it would be pretty easy (if not tempting) to spoof. That is, someone could write a kiddie script to do nothing but head requests to that URL.  Granted, it would be kind of malicious ... from which I guess there can never be complete protection. Just thought I'd mention it so rudimentary safeguards could be put in place, if possible.
Comment 6 Ian Bull CLA 2010-02-09 12:05:16 EST
(In reply to comment #5)
> Seems to me one drawback of the proposal is that it would be pretty easy (if
> not tempting) to spoof. That is, someone could write a kiddie script to do
> nothing but head requests to that URL.  Granted, it would be kind of malicious
> ... from which I guess there can never be complete protection. Just thought I'd
> mention it so rudimentary safeguards could be put in place, if possible.

On the server we could (and probably should) setup something that looks for malicious behaviour (same IP address hitting the same URL over and over again, for example).

However, I wonder if we can do better.  Is it possible to generate a unique key that gets appended to the URL (as a query parameter).   Our stats processing tool would use this key to determine if it really came from "Eclipse".  Of course, begin open source, it would be easy for someone to copy any algorithm we came up with.  Does anybody know of techniques for dealing with this?
Comment 7 Denis Roy CLA 2010-02-09 15:24:34 EST
> Seems to me one drawback of the proposal is that it would be pretty easy (if
> not tempting) to spoof. 

We get _tons_ of that today -- people repeatedly hitting www.eclipse.org/downloads/download.php?file= and download.eclipse.org/somefile.zip zillions of times for a purpose not known to me.  The person analyzing the trend data should be able to spot these data anomalies.

As long as the Internet exists we will have idiots.  Just look at SMTP  :)
Comment 8 David Williams CLA 2010-02-09 15:41:21 EST
(In reply to comment #7)
> ...  The person analyzing the trend data should be able to spot these data
> anomalies.
>

Fine by me. I guess it is no different that the download.php?file= type solution. 

And the bug title does say 'simple'. :) [That is, I'm not advocating a more complex solution ... just wanted to discuss the issue.]

thanks,
Comment 9 David Williams CLA 2010-02-09 15:49:43 EST
I will make another comment about an implementation detail ... about the statement: 

<quote>
The artifact repository itself would also have an optional property with a
"stats URL" as a destination for reporting download statistics
</quote>

should this literally be part of the artifact repository file? Or a separate file, that's at the same location? It might be easier to "mirror" or duplicate repositories (without change), and change only the reporting file ... for cases where it was desired to change the reporting URL. I made a similar suggestion about the "p2mirrorsURL" and that went no where ... just thought I'd mention it again. Seems odd to to tweak a repository file for "policy" information ... but, perhaps I'm missing the bigger picture of what a repository is.
Comment 10 John Arthorne CLA 2010-02-10 14:59:04 EST
(In reply to comment #9)
> should this literally be part of the artifact repository file? Or a separate
> file, that's at the same location? It might be easier to "mirror" or duplicate
> repositories (without change), and change only the reporting file ... for cases
> where it was desired to change the reporting URL. I made a similar suggestion
> about the "p2mirrorsURL" and that went no where ... just thought I'd mention it
> again. Seems odd to to tweak a repository file for "policy" information ...
> but, perhaps I'm missing the bigger picture of what a repository is.

It was certainly intentional to put this "policy" info at the repository level. Someone hosting a repository elsewhere with the same contents may quite reasonably want to alter the mirror URL and stats URL, so putting that information as a repository property allows them to do that. We don't really have any other separate place to put this data at the moment. I think in general this "stats" URL will be less sensitive to the mirror URL, which needs updating today whenever the repository is moved (which is a problem but we just don't have a solution for it). I imagine eclipse.org for example would have a single stats URL, and a given artifact downloaded from any number of different repos at eclipse.org would report stats to the same place (this could be viewed as a feature since it simplifies collection and aggregation of the download data).
Comment 11 John Arthorne CLA 2010-02-11 09:22:32 EST
I have released this to HEAD to allow further testing. If the required property is not set on both the artifact descriptor and the repository this code has no effect. 

During the p2 call it was mentioned that the artifact descriptor is perhaps not the best location for the stats property. However after thinking about it more, I can't think of another place that would have the same flexibility. Properties on artifact descriptors are not part of the "identity" or equality of descriptors, so there is no problem if other repositories have the same descriptor but with a missing or different stats collection property. 

In any case I think the general approach is quite flexible, and we can change where the statistics properties are stored later on if need be.
Comment 12 Denis Roy CLA 2010-02-23 15:50:36 EST
> I have released this to HEAD to allow further testing.

How can I help test this further, John?
Comment 13 Denis Roy CLA 2010-04-19 13:41:23 EDT
> How can I help test this further, John?

How can I help test this further, John?  Where can interested parties find some basic way of enabling this for testing?  I need to pipe these URLs into our database to allow for querying, so the sooner I can start seeing what they look like, the better.
Comment 14 John Arthorne CLA 2010-04-19 16:52:21 EDT
(In reply to comment #13)
> > How can I help test this further, John?
> How can I help test this further, John?

I'll let you know. I did some testing this afternoon and found some problems that will be fixed for tomorrow's build. By M7 we should be able to test on the M7 version of the release repository itself. I need to create some more documentation for it as well. The main question on your end is what URL should we use as the statistics gathering root URL. I.e., if it was something like "http://eclipse.org/stats" then you would be getting HEAD requests logged like this:

http://eclipse.org/stats/org.eclipse.platform
http://eclipse.org/stats/org.eclipse.cdt.core
...
Comment 15 Denis Roy CLA 2010-04-21 10:09:13 EDT
If these requests are sent in the background (ie, user is not waiting for them to return anything) I would much prefer they be sent to download.eclipse.org:

http://download.eclipse.org/stats/org.eclipse.platform
http://download.eclipse.org/stats/org.eclipse.cdt.core
...
Comment 16 Nick Boldt CLA 2010-04-21 14:54:40 EDT
So what do end users do to test this? Wait for M7, install and use it, then do an update/install and check some stats page somewhere to see what we logged?

Is there something developers need to do to "stats-enable" their repos' files to be logged/tracked, akin to adding the Google Analytics tracking code into webpages?
Comment 17 John Arthorne CLA 2010-04-21 15:08:25 EDT
(In reply to comment #16)
> So what do end users do to test this? Wait for M7, install and use it, then do
> an update/install and check some stats page somewhere to see what we logged?
> 
> Is there something developers need to do to "stats-enable" their repos' files
> to be logged/tracked, akin to adding the Google Analytics tracking code into
> webpages?

What information is logged, and where the information goes, is entirely controlled by the repository. There are two steps to enable it:

1) In the artifact repository that you want to track downloads from, add a "p2.statsURI" property specifying the statistics URL (in artifacts.jar):

<repository name='Update Site' type='org.eclipse.equinox.p2.artifact.repository.simpleRepository' version='1'>
  <properties size='3'>
    <property name='p2.timestamp' value='1269575706171'/>
    <property name='p2.compressed' value='true'/>
    <property name='p2.statsURI' value='http://arthorne.com/bogusstats'/>

(please don't use arthorne.com, this is just what I was using to test because I have access to the server logs for that site ;))

2) In the same repository, add a "download.stats" property for each IU that you want to gather stats for. You can pick one plugin in your feature for example:

    <artifact classifier='osgi.bundle' id='test.plugin1' version='1.0.0.201003261255'>
      <properties size='3'>
        <property name='artifact.size' value='0'/>
        <property name='download.size' value='1757'/>
	<property name='download.stats' value='test.plugin1.bundle'/>
      </properties>
    </artifact>

In this example, after a successful download a HEAD request will be issued to:

http://arthorne.com/bogusstats/test.plugin.1.bundle

(value of the "downloads.stats" property appended to the value of the "p2.statsURI").

You can test this yourself using a platform integration build from *this* week, using any repository you like.
Comment 18 Nick Boldt CLA 2010-04-21 20:15:16 EDT
So if this was something I was going to add to Athena's generation of metadata, I would have to create artifacts.xml via the publisher, then manually shoehorn this information into it?

Will there at some point be publisher [1] support for this?

eg., using 

  -p2.statsURI http://arthorne.com/bogusstats

or via ant task [2], p2.statsURI="http://arthorne.com/bogusstats"

I expect p2 could generate everything else by suffixing .bundle for a bundle, .feature for a feature, etc.

[1] http://wiki.eclipse.org/Equinox/p2/Publisher#Features_And_Bundles_Publisher_Application

[2] http://wiki.eclipse.org/Equinox/p2/Publisher#Features_and_Bundles_Publisher_Task
Comment 19 John Arthorne CLA 2010-04-21 22:11:51 EDT
Publisher integration is possible, but we wouldn't want to add that property on every bundle. The point here is to just put it on one or two key artifacts to avoid an extra round trip for each artifact. This was really just intended as a replacement for the old "single file hack" rather than a more elaborate solution. In any case, I suggest opening a bug about publisher integration. I don't have any plans to work on that myself at this point.
Comment 20 Nick Boldt CLA 2010-04-22 11:18:44 EDT
(In reply to comment #19)
> Publisher integration is possible, but we wouldn't want to add that property on
> every bundle. The point here is to just put it on one or two key artifacts to
> avoid an extra round trip for each artifact. This was really just intended as a
> replacement for the old "single file hack" rather than a more elaborate
> solution. In any case, I suggest opening a bug about publisher integration. I
> don't have any plans to work on that myself at this point.

My concern is that unless there's a scriptable way of doing this, no one's going to use it for their weekly/monthly builds. And manually hacking metadata is for most people Very Scary Indeed, even with the awesome support from #equinox-dev IRC channel and p2-dev@ mailing list.

Anyway, as requested, see bug 310132.
Comment 21 John Arthorne CLA 2010-04-28 15:18:09 EDT
I have added some documentation here:

http://wiki.eclipse.org/Equinox_p2_download_stats
Comment 22 Denis Roy CLA 2010-04-28 16:10:55 EDT
Thanks much for this simple implementation!