| Summary: | Simple download statistics | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Eclipse Project] Equinox | Reporter: | John Arthorne <john.arthorne> | ||||
| Component: | p2 | Assignee: | John Arthorne <john.arthorne> | ||||
| Status: | RESOLVED FIXED | QA Contact: | |||||
| Severity: | normal | ||||||
| Priority: | P3 | CC: | beth, bluesoldier, david_williams, g.watson, irbull, Kenn.Hussey, kim.moir, Mike_Wilson, mober.at+eclipse, nboldt, overholt, pascal, sbouchet, webmaster | ||||
| Version: | 3.6 | ||||||
| Target Milestone: | 3.6 M7 | ||||||
| Hardware: | PC | ||||||
| OS: | Windows XP | ||||||
| Whiteboard: | |||||||
| Attachments: |
|
||||||
Created attachment 158483 [details]
Download statistics implementation
Denis, what do you think of this general approach? In particular, is a HEAD request on some unique URL sufficient for you to be able to gather useful statistics on your end? Sounds acceptable to me. In layman terms, what is an artifact? (In reply to comment #3) > Sounds acceptable to me. In layman terms, what is an artifact? An artifact is the actual bytes that you need to download, put on disk, and run. For example, a plug-in (bundle) is an artifact. So are the launchers. For a bit of background, p2 separates things into Metadata and Artifacts. The metadata describe "what" can be installed. Metadata might describe that plug-in A "requires" plug-in B. p2 uses this to "plan" what needs to be installed (and what the user already has). Once this is all figured out, the "artifacts" are downloaded (if they are needed). -- This is a bit of a rough description of p2, but it should suffice :-). I missed the p2 call today, but I think what John is suggesting is to tie the stats to the description of the actual artifact. That way, we only track things that are actually downloaded. Seems to me one drawback of the proposal is that it would be pretty easy (if not tempting) to spoof. That is, someone could write a kiddie script to do nothing but head requests to that URL. Granted, it would be kind of malicious ... from which I guess there can never be complete protection. Just thought I'd mention it so rudimentary safeguards could be put in place, if possible. (In reply to comment #5) > Seems to me one drawback of the proposal is that it would be pretty easy (if > not tempting) to spoof. That is, someone could write a kiddie script to do > nothing but head requests to that URL. Granted, it would be kind of malicious > ... from which I guess there can never be complete protection. Just thought I'd > mention it so rudimentary safeguards could be put in place, if possible. On the server we could (and probably should) setup something that looks for malicious behaviour (same IP address hitting the same URL over and over again, for example). However, I wonder if we can do better. Is it possible to generate a unique key that gets appended to the URL (as a query parameter). Our stats processing tool would use this key to determine if it really came from "Eclipse". Of course, begin open source, it would be easy for someone to copy any algorithm we came up with. Does anybody know of techniques for dealing with this? > Seems to me one drawback of the proposal is that it would be pretty easy (if
> not tempting) to spoof.
We get _tons_ of that today -- people repeatedly hitting www.eclipse.org/downloads/download.php?file= and download.eclipse.org/somefile.zip zillions of times for a purpose not known to me. The person analyzing the trend data should be able to spot these data anomalies.
As long as the Internet exists we will have idiots. Just look at SMTP :)
(In reply to comment #7) > ... The person analyzing the trend data should be able to spot these data > anomalies. > Fine by me. I guess it is no different that the download.php?file= type solution. And the bug title does say 'simple'. :) [That is, I'm not advocating a more complex solution ... just wanted to discuss the issue.] thanks, I will make another comment about an implementation detail ... about the statement: <quote> The artifact repository itself would also have an optional property with a "stats URL" as a destination for reporting download statistics </quote> should this literally be part of the artifact repository file? Or a separate file, that's at the same location? It might be easier to "mirror" or duplicate repositories (without change), and change only the reporting file ... for cases where it was desired to change the reporting URL. I made a similar suggestion about the "p2mirrorsURL" and that went no where ... just thought I'd mention it again. Seems odd to to tweak a repository file for "policy" information ... but, perhaps I'm missing the bigger picture of what a repository is. (In reply to comment #9) > should this literally be part of the artifact repository file? Or a separate > file, that's at the same location? It might be easier to "mirror" or duplicate > repositories (without change), and change only the reporting file ... for cases > where it was desired to change the reporting URL. I made a similar suggestion > about the "p2mirrorsURL" and that went no where ... just thought I'd mention it > again. Seems odd to to tweak a repository file for "policy" information ... > but, perhaps I'm missing the bigger picture of what a repository is. It was certainly intentional to put this "policy" info at the repository level. Someone hosting a repository elsewhere with the same contents may quite reasonably want to alter the mirror URL and stats URL, so putting that information as a repository property allows them to do that. We don't really have any other separate place to put this data at the moment. I think in general this "stats" URL will be less sensitive to the mirror URL, which needs updating today whenever the repository is moved (which is a problem but we just don't have a solution for it). I imagine eclipse.org for example would have a single stats URL, and a given artifact downloaded from any number of different repos at eclipse.org would report stats to the same place (this could be viewed as a feature since it simplifies collection and aggregation of the download data). I have released this to HEAD to allow further testing. If the required property is not set on both the artifact descriptor and the repository this code has no effect. During the p2 call it was mentioned that the artifact descriptor is perhaps not the best location for the stats property. However after thinking about it more, I can't think of another place that would have the same flexibility. Properties on artifact descriptors are not part of the "identity" or equality of descriptors, so there is no problem if other repositories have the same descriptor but with a missing or different stats collection property. In any case I think the general approach is quite flexible, and we can change where the statistics properties are stored later on if need be. > I have released this to HEAD to allow further testing.
How can I help test this further, John?
> How can I help test this further, John?
How can I help test this further, John? Where can interested parties find some basic way of enabling this for testing? I need to pipe these URLs into our database to allow for querying, so the sooner I can start seeing what they look like, the better.
(In reply to comment #13) > > How can I help test this further, John? > How can I help test this further, John? I'll let you know. I did some testing this afternoon and found some problems that will be fixed for tomorrow's build. By M7 we should be able to test on the M7 version of the release repository itself. I need to create some more documentation for it as well. The main question on your end is what URL should we use as the statistics gathering root URL. I.e., if it was something like "http://eclipse.org/stats" then you would be getting HEAD requests logged like this: http://eclipse.org/stats/org.eclipse.platform http://eclipse.org/stats/org.eclipse.cdt.core ... If these requests are sent in the background (ie, user is not waiting for them to return anything) I would much prefer they be sent to download.eclipse.org: http://download.eclipse.org/stats/org.eclipse.platform http://download.eclipse.org/stats/org.eclipse.cdt.core ... So what do end users do to test this? Wait for M7, install and use it, then do an update/install and check some stats page somewhere to see what we logged? Is there something developers need to do to "stats-enable" their repos' files to be logged/tracked, akin to adding the Google Analytics tracking code into webpages? (In reply to comment #16) > So what do end users do to test this? Wait for M7, install and use it, then do > an update/install and check some stats page somewhere to see what we logged? > > Is there something developers need to do to "stats-enable" their repos' files > to be logged/tracked, akin to adding the Google Analytics tracking code into > webpages? What information is logged, and where the information goes, is entirely controlled by the repository. There are two steps to enable it: 1) In the artifact repository that you want to track downloads from, add a "p2.statsURI" property specifying the statistics URL (in artifacts.jar): <repository name='Update Site' type='org.eclipse.equinox.p2.artifact.repository.simpleRepository' version='1'> <properties size='3'> <property name='p2.timestamp' value='1269575706171'/> <property name='p2.compressed' value='true'/> <property name='p2.statsURI' value='http://arthorne.com/bogusstats'/> (please don't use arthorne.com, this is just what I was using to test because I have access to the server logs for that site ;)) 2) In the same repository, add a "download.stats" property for each IU that you want to gather stats for. You can pick one plugin in your feature for example: <artifact classifier='osgi.bundle' id='test.plugin1' version='1.0.0.201003261255'> <properties size='3'> <property name='artifact.size' value='0'/> <property name='download.size' value='1757'/> <property name='download.stats' value='test.plugin1.bundle'/> </properties> </artifact> In this example, after a successful download a HEAD request will be issued to: http://arthorne.com/bogusstats/test.plugin.1.bundle (value of the "downloads.stats" property appended to the value of the "p2.statsURI"). You can test this yourself using a platform integration build from *this* week, using any repository you like. So if this was something I was going to add to Athena's generation of metadata, I would have to create artifacts.xml via the publisher, then manually shoehorn this information into it? Will there at some point be publisher [1] support for this? eg., using -p2.statsURI http://arthorne.com/bogusstats or via ant task [2], p2.statsURI="http://arthorne.com/bogusstats" I expect p2 could generate everything else by suffixing .bundle for a bundle, .feature for a feature, etc. [1] http://wiki.eclipse.org/Equinox/p2/Publisher#Features_And_Bundles_Publisher_Application [2] http://wiki.eclipse.org/Equinox/p2/Publisher#Features_and_Bundles_Publisher_Task Publisher integration is possible, but we wouldn't want to add that property on every bundle. The point here is to just put it on one or two key artifacts to avoid an extra round trip for each artifact. This was really just intended as a replacement for the old "single file hack" rather than a more elaborate solution. In any case, I suggest opening a bug about publisher integration. I don't have any plans to work on that myself at this point. (In reply to comment #19) > Publisher integration is possible, but we wouldn't want to add that property on > every bundle. The point here is to just put it on one or two key artifacts to > avoid an extra round trip for each artifact. This was really just intended as a > replacement for the old "single file hack" rather than a more elaborate > solution. In any case, I suggest opening a bug about publisher integration. I > don't have any plans to work on that myself at this point. My concern is that unless there's a scriptable way of doing this, no one's going to use it for their weekly/monthly builds. And manually hacking metadata is for most people Very Scary Indeed, even with the awesome support from #equinox-dev IRC channel and p2-dev@ mailing list. Anyway, as requested, see bug 310132. I have added some documentation here: http://wiki.eclipse.org/Equinox_p2_download_stats Thanks much for this simple implementation! |
I had a thought this morning about how we could implement simple collection of download statistics. We have (at least) the following requirements: 1) Collecting stats must be "best effort", and not cause transfers to fail if stats could not be collected 2) The repository must be able to control how/if this collection is performed, so that someone redistributing the content can "turn it off" or redirect statistics elsewhere. 3) No personal information should be collected 4) Stats collection must be coarse-grained. I.e., once per artifact would be too many round trips. Here is a simple solution that I think satisfies these requirements. Any given artifact descriptor could specify a property for stats collection. If this property is absent then no statistics would be gathered. Example: <artifact classifier='osgi.bundle' id='org.eclipse.osgi' version='3.6.0.v20100128-1430'> <properties size='2'> <property name='download.stats' value='org.eclipse.osgi'/> ... The artifact repository itself would also have an optional property with a "stats URL" as a destination for reporting download statistics to: <repository name='My Repo' type='org.eclipse.equinox.p2.artifact.repository.simpleRepository' version='1.0.0'> <properties size='3'> <property name='p2.statsURI' value='http://download.eclipse.org/stats'/> ... After a successful artifact download, we would check for the property on the artifact descriptor and the property on the repository. If both properties are present it constructs a URI by combining the two: statsURI = http://download.eclipse.org/stats/org.eclipse.osgi We would then perform a simple HTTP HEAD request on this URI. Any failure would simply be logged. On the server side someone could use the server logs to count all the HEAD requests to obtain the aggregate download stats.