Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 327594 - Extract contribution information from Git
Summary: Extract contribution information from Git
Status: RESOLVED FIXED
Alias: None
Product: Community
Classification: Eclipse Foundation
Component: IP Log Tool (show other bugs)
Version: unspecified   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Wayne Beaton CLA
QA Contact:
URL:
Whiteboard:
Keywords:
: 341289 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-10-12 13:36 EDT by Wayne Beaton CLA
Modified: 2013-06-24 11:24 EDT (History)
13 users (show)

See Also:


Attachments
mylyn/context/zip (1.09 KB, application/octet-stream)
2011-09-28 23:02 EDT, Wayne Beaton CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Wayne Beaton CLA 2010-10-12 13:36:48 EDT
We should be able to pull contribution information directly from a Git repository (Git allows us to record the author on contributions separate from the committer). The EGit project currently does this. 

Here's what they do:

---
1. Pull a list of all CQs for a project out of IPzilla..
2. Dump the list to a local file.
---

---
3. Pull a list of committers out of Gerrit database.
4. Dump the list to a local file (as CSV).
---

---
5. Read additional IP data from a file in the Git repo.
    This reads things like projects, consumed projects, etc.
6. Read file with committer info from 4.
7. Scan a Git repository's history from a starting tag (version):
    Each commit is analyzed based on "author" name. If the
    "author" email matches to an *active* committer the
    committer record is updated with a "hasCommits" flag.
    If the "author" email does not matches an *active*
    committer collect the commit as contribution by non-
    committer.
8. Generate IP log XML
---

I assume that we can get the code from them when we're ready to actually implement this.
Comment 1 Gunnar Wagenknecht CLA 2010-10-12 13:57:38 EDT
It's all in the jgit.iplog bundle.
Comment 2 Matthias Sohn CLA 2011-03-20 19:55:28 EDT
Find the details on how to use org.eclipse.jgit.iplog to generate the iplog for a given repository here http://egit.eclipse.org/w/?p=jgit.git;a=blob;f=org.eclipse.jgit.iplog/README;h=4015c7da9adf0ea3aee37e75067e921d9923878e;hb=HEAD
Comment 3 Alex Blewitt CLA 2011-05-27 09:56:51 EDT
This has gerrit in the title, but is independent of gerrit itself. It only really depends on the committer/author being present in the commits, which I think we have?
Comment 4 Wayne Beaton CLA 2011-06-22 23:38:01 EDT
*** Bug 341289 has been marked as a duplicate of this bug. ***
Comment 5 John Arthorne CLA 2011-09-20 16:11:52 EDT
Any idea on when this will be added? I have a release coming up next month for Orion and just wondering if I'll need to manually craft the contributor section of the IP log.
Comment 6 Wayne Beaton CLA 2011-09-20 19:36:26 EDT
(In reply to comment #5)
> Any idea on when this will be added? I have a release coming up next month for
> Orion and just wondering if I'll need to manually craft the contributor section
> of the IP log.

Targeting end of September.
Comment 7 Wayne Beaton CLA 2011-09-28 23:02:22 EDT
I've generalized the functions that I use to extract information for Dash. Unfortunately, the extraction process runs on a different server than the IP Log code, so the next step is to assemble the data and stuff it where the IP Log generator code can find it. The extraction process is going to have to run in batch mode since it's pretty long running stuff anyway. 

How often should the extraction process run? Hourly? Daily? Weekly? It's a hard call since there will be cases where last minute additions will have to be included in the log. Maybe a git trigger. Hmmm... I wonder if webmaster will let me set up one of those?

I'll run a few tests to see how time consuming it really is to generate the information on the fly (i.e. I'll test my assumption that this is too time consuming to run directly)
Comment 8 Wayne Beaton CLA 2011-09-28 23:02:24 EDT
Created attachment 204239 [details]
mylyn/context/zip
Comment 9 Wayne Beaton CLA 2011-10-05 15:30:05 EDT
I decided to take the most direct route to get this operating. I've added a script to Dash that will extract the contribution information from a single git repo and return it in CSV format. The script can only be invoked from an eclipse.org server. That script runs directly on the repo (specified as /gitroot/project/repo.git) and so can be a little time consuming for large repos.

The new version can be tested here:

http://www.eclipse.org/projects/ip_log2.php?projectid=eclipse.orion

The usability is quite terrible due to the latency, but it does work. 

Projects that do not have Git repositories will work as normal with one change: I've used this opportunity to collapse multiple comments in a bug marked iplog+ attributed to an individual into a single entry in the log.

As Git repos get bigger, the performance will get worse.

I did implement a solution that uses a batch process to gather the information and shove it into a database. This solution, however, makes it difficult to keep the data current (e.g. a commit pushed minutes before the log is generated would not appear). This solution could be made workable with a Git hook that pushes changes into the database as pushes occur; however, the server that accepts the pushes is different from the server that's running Dash, so there is some connectivity weirdness that needs to be overcome to make this sort of solution work.

A stopgap solution might be use a little JavaScript to populate the contributors section of the page after the page is loaded. This would give us an opportunity to let people know that something is happening.

Also, due to the long running nature of the script, there is a real possibility that it could be used to overwhelm the server. I may need to restrict access to committers only to mitigate this.

Another possibility is to cache the data and use the -since option on git log to limit the amount of time spent rummaging around in commit records. I will explore that.

FWIW, I ran the modified IP log tool on the eclipse.platform project and--while the performance is terrible--it did come back.

More to do...
Comment 10 Wayne Beaton CLA 2011-10-07 14:13:15 EDT
With some minor modifications, I've been able to cache the author information in a database. 

The Dash script that provides the author information caches any information that it finds so that subsequent calls can avoid the expensive scan of the repository. Every time the script is asked for the author data, it scans the corresponding repository for commits that occurred after the timestamp of the last scan.

The danger with this is that it is possible that some commits with dates that occur before the date of the last scan may be pushed and missed (since the scan is only looking for commits that occur after the last time a scan was made).

To mitigate this, a batch script periodically initates full scans the repositories and updates the cache. I intend to schedule this as a low priority scan that occurs daily, or perhaps bi-weekly.

The changes can be tested here:

http://www.eclipse.org/projects/ip_log2.php?projectid=rt.virgo,rt.ecf

Note that you can change the value of the projectid parameter to the id of any project.

Note also that submit won't do what you expect. Unless you expect it to submit the results of running the original script...
Comment 11 Wayne Beaton CLA 2011-10-07 23:24:01 EDT
Can somebody point me to some good instructions on setting the "Author" field in a commit?
Comment 12 Alex Blewitt CLA 2011-10-08 03:46:19 EDT
You can set GIT_AUTHOR_NAME , GIT_AUTHOR_EMAIL , GIT_COMMITTER _NAME, etc as environment variables prior to a commit. See e.g.

http://cworth.org/hgbook-git/tour/

You can also use the git commit --author on the command line to override these variables. 

If you need to change an existing commit you would use git commit --amend --author 

Note that like any other amend this will change the commit hash.
Comment 13 Benjamin Bentmann CLA 2011-10-09 10:51:50 EDT
(In reply to comment #10)
> The changes can be tested here:
> 
> http://www.eclipse.org/projects/ip_log2.php?projectid=rt.virgo,rt.ecf

From a brief look over the new IP log for Tycho (http://www.eclipse.org/projects/ip_log2.php?projectid=technology.tycho) that you mentioned in your mail, it seems something is wrong with the committer table. The IP log lists Jan Sievers as the only active committer, although http://git.eclipse.org/c/tycho/org.eclipse.tycho.git/commit/?id=4dda65a568c25bd47a3ccd2f08bde2df6a2b3791 and http://git.eclipse.org/c/tycho/org.eclipse.tycho.git/commit/?id=9a1660bea8ff2592a6ed473d1e95a4d15f4c37dd suggest that at least Tobias Oberlies and Igor Fedorenko are also active on the project.
Comment 14 Tobias Oberlies CLA 2011-10-10 04:36:04 EDT
(In reply to comment #13)
> From a brief look over the new IP log for Tycho
> (http://www.eclipse.org/projects/ip_log2.php?projectid=technology.tycho) that
> you mentioned in your mail, it seems something is wrong with the committer
> table. The IP log lists Jan Sievers as the only active committer, although
> http://git.eclipse.org/c/tycho/org.eclipse.tycho.git/commit/?id=4dda65a568c25bd47a3ccd2f08bde2df6a2b3791
> and
> http://git.eclipse.org/c/tycho/org.eclipse.tycho.git/commit/?id=9a1660bea8ff2592a6ed473d1e95a4d15f4c37dd
> suggest that at least Tobias Oberlies and Igor Fedorenko are also active on the
> project.

The information is correct: The org.eclipse.tycho repository has only received content very recently (because of CQs), and hence everything up to commit e31b836 was rebased from initial contribution state in the old GitHub repository to the new eclipse.org repository. Jan has done that rebase, hence he is marked committer for all rebased commits.
Comment 15 Benjamin Bentmann CLA 2011-10-10 06:17:43 EDT
(In reply to comment #14)
> [...] everything up to commit
> e31b836 was rebased from initial contribution state in the old GitHub
> repository to the new eclipse.org repository. Jan has done that rebase, hence
> he is marked committer for all rebased commits.

I see. Still, the first commit I mentioned (4dda65a568c25bd47a3ccd2f08bde2df6a2b3791) has author and committer being "Tobias Oberlies" yet the IP log didn't list you as active committer until I rechecked today so the log appears to be lagging 24+ hours behind. To conclude this, if the technical challenges mentioned by Wayne prevent the IP log to reflect the latest commits at the time of its generation, it might be helpful to include a date/commit that indicates the origin/age of the data in the log.
Comment 16 Andrew Overholt CLA 2011-10-11 09:31:47 EDT
At first glance, Linux Tools' log looks good.
Comment 17 Wayne Beaton CLA 2011-10-11 11:14:33 EDT
(In reply to comment #13)
> (In reply to comment #10)
> > The changes can be tested here:
> > 
> > http://www.eclipse.org/projects/ip_log2.php?projectid=rt.virgo,rt.ecf
> 
> From a brief look over the new IP log for Tycho
> (http://www.eclipse.org/projects/ip_log2.php?projectid=technology.tycho) that
> you mentioned in your mail, it seems something is wrong with the committer
> table. The IP log lists Jan Sievers as the only active committer, although
> http://git.eclipse.org/c/tycho/org.eclipse.tycho.git/commit/?id=4dda65a568c25bd47a3ccd2f08bde2df6a2b3791
> and
> http://git.eclipse.org/c/tycho/org.eclipse.tycho.git/commit/?id=9a1660bea8ff2592a6ed473d1e95a4d15f4c37dd
> suggest that at least Tobias Oberlies and Igor Fedorenko are also active on the
> project.

There are two different things at play here. The first link points to a commit that Dash hadn't actually scanned yet (Dash runs once a week on Sundays to capture committer the activity information that's used in this log). The other thing is that the Dash harvesting scripts harvest the committer information, not the author information. Perhaps we should be attributing the commit to the author rather than the the committer (assuming that the author is a committer).

This issue, however, is unrelated to this bug. Can you please open a new bug to address this?
Comment 18 Benjamin Bentmann CLA 2011-10-11 13:15:55 EDT
(In reply to comment #17)
> The other
> thing is that the Dash harvesting scripts harvest the committer information,
> not the author information. Perhaps we should be attributing the commit to the
> author rather than the the committer (assuming that the author is a committer).
> 
> This issue, however, is unrelated to this bug. Can you please open a new bug to
> address this?

Hm, I guess this is a dup of #346898?
Comment 19 Wayne Beaton CLA 2011-10-19 15:03:17 EDT
(In reply to comment #18)

> Hm, I guess this is a dup of #346898?

I think they're different but related.

Dash currently only reports on committers, not authors. I believe that is the focus on Bug 346898.
Comment 20 Martin Oberhuber CLA 2011-10-20 12:06:55 EDT
(In reply to comment #9)
> The new version can be tested here:
> http://www.eclipse.org/projects/ip_log2.php?projectid=eclipse.orion

This currently just generates a 404 ...

I'm interested in mining git contributions for TCF:
http://www.eclipse.org/projects/ip_log2.php?projectid=tools.cdt.tcf

How should I proceed ?
Comment 21 Wayne Beaton CLA 2011-10-20 12:21:05 EDT
(In reply to comment #20)
> (In reply to comment #9)
> > The new version can be tested here:
> > http://www.eclipse.org/projects/ip_log2.php?projectid=eclipse.orion
> 
> This currently just generates a 404 ...
> 
> I'm interested in mining git contributions for TCF:
> http://www.eclipse.org/projects/ip_log2.php?projectid=tools.cdt.tcf
> 
> How should I proceed ?

That was the URL I used for testing. I received enough feedback to move it live. Try this URL instead:

http://www.eclipse.org/projects/ip_log.php?projectid=tools.cdt.tcf
Comment 22 Wayne Beaton CLA 2011-10-20 12:26:03 EDT
(In reply to comment #20)
> I'm interested in mining git contributions for TCF:
> http://www.eclipse.org/projects/ip_log2.php?projectid=tools.cdt.tcf

TCF doesn't specify any Git repositories in its metadata, so there will be no git contributions recorded in the log.
Comment 23 Gunnar Wagenknecht CLA 2013-06-24 05:31:22 EDT
Wayne, AFAIK this can be closed, can't it?
Comment 24 Wayne Beaton CLA 2013-06-24 11:24:11 EDT
(In reply to comment #23)
> Wayne, AFAIK this can be closed, can't it?

Yup