| Summary: | Extract contribution information from Git | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Community | Reporter: | Wayne Beaton <wayne.beaton> | ||||
| Component: | IP Log Tool | Assignee: | Wayne Beaton <wayne.beaton> | ||||
| Status: | RESOLVED FIXED | QA Contact: | |||||
| Severity: | normal | ||||||
| Priority: | P3 | CC: | alex.blewitt, bentmann, bugs.eclipse.org, denis.roy, gunnar, john.arthorne, kaloyan, matthias.sohn, mober.at+eclipse, overholt, pwebster, t-oberlies, uwe.st | ||||
| Version: | unspecified | ||||||
| Target Milestone: | --- | ||||||
| Hardware: | PC | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Attachments: |
|
||||||
It's all in the jgit.iplog bundle. Find the details on how to use org.eclipse.jgit.iplog to generate the iplog for a given repository here http://egit.eclipse.org/w/?p=jgit.git;a=blob;f=org.eclipse.jgit.iplog/README;h=4015c7da9adf0ea3aee37e75067e921d9923878e;hb=HEAD This has gerrit in the title, but is independent of gerrit itself. It only really depends on the committer/author being present in the commits, which I think we have? *** Bug 341289 has been marked as a duplicate of this bug. *** Any idea on when this will be added? I have a release coming up next month for Orion and just wondering if I'll need to manually craft the contributor section of the IP log. (In reply to comment #5) > Any idea on when this will be added? I have a release coming up next month for > Orion and just wondering if I'll need to manually craft the contributor section > of the IP log. Targeting end of September. I've generalized the functions that I use to extract information for Dash. Unfortunately, the extraction process runs on a different server than the IP Log code, so the next step is to assemble the data and stuff it where the IP Log generator code can find it. The extraction process is going to have to run in batch mode since it's pretty long running stuff anyway. How often should the extraction process run? Hourly? Daily? Weekly? It's a hard call since there will be cases where last minute additions will have to be included in the log. Maybe a git trigger. Hmmm... I wonder if webmaster will let me set up one of those? I'll run a few tests to see how time consuming it really is to generate the information on the fly (i.e. I'll test my assumption that this is too time consuming to run directly) Created attachment 204239 [details]
mylyn/context/zip
I decided to take the most direct route to get this operating. I've added a script to Dash that will extract the contribution information from a single git repo and return it in CSV format. The script can only be invoked from an eclipse.org server. That script runs directly on the repo (specified as /gitroot/project/repo.git) and so can be a little time consuming for large repos. The new version can be tested here: http://www.eclipse.org/projects/ip_log2.php?projectid=eclipse.orion The usability is quite terrible due to the latency, but it does work. Projects that do not have Git repositories will work as normal with one change: I've used this opportunity to collapse multiple comments in a bug marked iplog+ attributed to an individual into a single entry in the log. As Git repos get bigger, the performance will get worse. I did implement a solution that uses a batch process to gather the information and shove it into a database. This solution, however, makes it difficult to keep the data current (e.g. a commit pushed minutes before the log is generated would not appear). This solution could be made workable with a Git hook that pushes changes into the database as pushes occur; however, the server that accepts the pushes is different from the server that's running Dash, so there is some connectivity weirdness that needs to be overcome to make this sort of solution work. A stopgap solution might be use a little JavaScript to populate the contributors section of the page after the page is loaded. This would give us an opportunity to let people know that something is happening. Also, due to the long running nature of the script, there is a real possibility that it could be used to overwhelm the server. I may need to restrict access to committers only to mitigate this. Another possibility is to cache the data and use the -since option on git log to limit the amount of time spent rummaging around in commit records. I will explore that. FWIW, I ran the modified IP log tool on the eclipse.platform project and--while the performance is terrible--it did come back. More to do... With some minor modifications, I've been able to cache the author information in a database. The Dash script that provides the author information caches any information that it finds so that subsequent calls can avoid the expensive scan of the repository. Every time the script is asked for the author data, it scans the corresponding repository for commits that occurred after the timestamp of the last scan. The danger with this is that it is possible that some commits with dates that occur before the date of the last scan may be pushed and missed (since the scan is only looking for commits that occur after the last time a scan was made). To mitigate this, a batch script periodically initates full scans the repositories and updates the cache. I intend to schedule this as a low priority scan that occurs daily, or perhaps bi-weekly. The changes can be tested here: http://www.eclipse.org/projects/ip_log2.php?projectid=rt.virgo,rt.ecf Note that you can change the value of the projectid parameter to the id of any project. Note also that submit won't do what you expect. Unless you expect it to submit the results of running the original script... Can somebody point me to some good instructions on setting the "Author" field in a commit? You can set GIT_AUTHOR_NAME , GIT_AUTHOR_EMAIL , GIT_COMMITTER _NAME, etc as environment variables prior to a commit. See e.g. http://cworth.org/hgbook-git/tour/ You can also use the git commit --author on the command line to override these variables. If you need to change an existing commit you would use git commit --amend --author Note that like any other amend this will change the commit hash. (In reply to comment #10) > The changes can be tested here: > > http://www.eclipse.org/projects/ip_log2.php?projectid=rt.virgo,rt.ecf From a brief look over the new IP log for Tycho (http://www.eclipse.org/projects/ip_log2.php?projectid=technology.tycho) that you mentioned in your mail, it seems something is wrong with the committer table. The IP log lists Jan Sievers as the only active committer, although http://git.eclipse.org/c/tycho/org.eclipse.tycho.git/commit/?id=4dda65a568c25bd47a3ccd2f08bde2df6a2b3791 and http://git.eclipse.org/c/tycho/org.eclipse.tycho.git/commit/?id=9a1660bea8ff2592a6ed473d1e95a4d15f4c37dd suggest that at least Tobias Oberlies and Igor Fedorenko are also active on the project. (In reply to comment #13) > From a brief look over the new IP log for Tycho > (http://www.eclipse.org/projects/ip_log2.php?projectid=technology.tycho) that > you mentioned in your mail, it seems something is wrong with the committer > table. The IP log lists Jan Sievers as the only active committer, although > http://git.eclipse.org/c/tycho/org.eclipse.tycho.git/commit/?id=4dda65a568c25bd47a3ccd2f08bde2df6a2b3791 > and > http://git.eclipse.org/c/tycho/org.eclipse.tycho.git/commit/?id=9a1660bea8ff2592a6ed473d1e95a4d15f4c37dd > suggest that at least Tobias Oberlies and Igor Fedorenko are also active on the > project. The information is correct: The org.eclipse.tycho repository has only received content very recently (because of CQs), and hence everything up to commit e31b836 was rebased from initial contribution state in the old GitHub repository to the new eclipse.org repository. Jan has done that rebase, hence he is marked committer for all rebased commits. (In reply to comment #14) > [...] everything up to commit > e31b836 was rebased from initial contribution state in the old GitHub > repository to the new eclipse.org repository. Jan has done that rebase, hence > he is marked committer for all rebased commits. I see. Still, the first commit I mentioned (4dda65a568c25bd47a3ccd2f08bde2df6a2b3791) has author and committer being "Tobias Oberlies" yet the IP log didn't list you as active committer until I rechecked today so the log appears to be lagging 24+ hours behind. To conclude this, if the technical challenges mentioned by Wayne prevent the IP log to reflect the latest commits at the time of its generation, it might be helpful to include a date/commit that indicates the origin/age of the data in the log. At first glance, Linux Tools' log looks good. (In reply to comment #13) > (In reply to comment #10) > > The changes can be tested here: > > > > http://www.eclipse.org/projects/ip_log2.php?projectid=rt.virgo,rt.ecf > > From a brief look over the new IP log for Tycho > (http://www.eclipse.org/projects/ip_log2.php?projectid=technology.tycho) that > you mentioned in your mail, it seems something is wrong with the committer > table. The IP log lists Jan Sievers as the only active committer, although > http://git.eclipse.org/c/tycho/org.eclipse.tycho.git/commit/?id=4dda65a568c25bd47a3ccd2f08bde2df6a2b3791 > and > http://git.eclipse.org/c/tycho/org.eclipse.tycho.git/commit/?id=9a1660bea8ff2592a6ed473d1e95a4d15f4c37dd > suggest that at least Tobias Oberlies and Igor Fedorenko are also active on the > project. There are two different things at play here. The first link points to a commit that Dash hadn't actually scanned yet (Dash runs once a week on Sundays to capture committer the activity information that's used in this log). The other thing is that the Dash harvesting scripts harvest the committer information, not the author information. Perhaps we should be attributing the commit to the author rather than the the committer (assuming that the author is a committer). This issue, however, is unrelated to this bug. Can you please open a new bug to address this? (In reply to comment #17) > The other > thing is that the Dash harvesting scripts harvest the committer information, > not the author information. Perhaps we should be attributing the commit to the > author rather than the the committer (assuming that the author is a committer). > > This issue, however, is unrelated to this bug. Can you please open a new bug to > address this? Hm, I guess this is a dup of #346898? (In reply to comment #18) > Hm, I guess this is a dup of #346898? I think they're different but related. Dash currently only reports on committers, not authors. I believe that is the focus on Bug 346898. (In reply to comment #9) > The new version can be tested here: > http://www.eclipse.org/projects/ip_log2.php?projectid=eclipse.orion This currently just generates a 404 ... I'm interested in mining git contributions for TCF: http://www.eclipse.org/projects/ip_log2.php?projectid=tools.cdt.tcf How should I proceed ? (In reply to comment #20) > (In reply to comment #9) > > The new version can be tested here: > > http://www.eclipse.org/projects/ip_log2.php?projectid=eclipse.orion > > This currently just generates a 404 ... > > I'm interested in mining git contributions for TCF: > http://www.eclipse.org/projects/ip_log2.php?projectid=tools.cdt.tcf > > How should I proceed ? That was the URL I used for testing. I received enough feedback to move it live. Try this URL instead: http://www.eclipse.org/projects/ip_log.php?projectid=tools.cdt.tcf (In reply to comment #20) > I'm interested in mining git contributions for TCF: > http://www.eclipse.org/projects/ip_log2.php?projectid=tools.cdt.tcf TCF doesn't specify any Git repositories in its metadata, so there will be no git contributions recorded in the log. Wayne, AFAIK this can be closed, can't it? (In reply to comment #23) > Wayne, AFAIK this can be closed, can't it? Yup |
We should be able to pull contribution information directly from a Git repository (Git allows us to record the author on contributions separate from the committer). The EGit project currently does this. Here's what they do: --- 1. Pull a list of all CQs for a project out of IPzilla.. 2. Dump the list to a local file. --- --- 3. Pull a list of committers out of Gerrit database. 4. Dump the list to a local file (as CSV). --- --- 5. Read additional IP data from a file in the Git repo. This reads things like projects, consumed projects, etc. 6. Read file with committer info from 4. 7. Scan a Git repository's history from a starting tag (version): Each commit is analyzed based on "author" name. If the "author" email matches to an *active* committer the committer record is updated with a "hasCommits" flag. If the "author" email does not matches an *active* committer collect the commit as contribution by non- committer. 8. Generate IP log XML --- I assume that we can get the code from them when we're ready to actually implement this.