Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 360692 - Is git gc run on the git.eclipse.org git repos
Summary: Is git gc run on the git.eclipse.org git repos
Status: RESOLVED FIXED
Alias: None
Product: Community
Classification: Eclipse Foundation
Component: Git (show other bugs)
Version: unspecified   Edit
Hardware: PC Mac OS X - Carbon (unsup.)
: P2 major (vote)
Target Milestone: ---   Edit
Assignee: Eclipse Webmaster CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 389101
  Show dependency tree
 
Reported: 2011-10-12 12:10 EDT by Thomas Watson CLA
Modified: 2013-11-21 15:44 EST (History)
7 users (show)

See Also:


Attachments
some "stats" for platform repositories (1.14 KB, text/plain)
2012-08-15 17:54 EDT, David Williams CLA
no flags Details
count-objects stats before gc (2.07 KB, text/plain)
2012-08-15 19:09 EDT, David Williams CLA
no flags Details
stats after gc (2.20 KB, text/plain)
2012-08-15 20:21 EDT, David Williams CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Watson CLA 2011-10-12 12:10:44 EDT
Does Eclipse gc the bare repos on git.eclipse.org on a regular basis? For example, a cron job?

If not, how do we gc our own repos to clean up unreachable objects and pack the loose objects?

Thanks.
Comment 1 Thomas Watson CLA 2011-10-12 14:51:47 EDT
I think this can be done with the following ssh command:

ssh git.eclipse.org git --git-dir=/gitroot/equinox/rt.equinox.framework.git --bare gc

Is this a good idea?  Should that be the recommended way a committer can GC their git repos?
Comment 2 Denis Roy CLA 2011-10-12 15:01:05 EDT
ssh user@git.eclipse.org "git --git-dir=/gitroot/equinox/rt.equinox.framework.git gc --aggressive"

... should work.
Comment 3 Kim Moir CLA 2011-10-12 15:19:55 EDT
Tom, do you want me to set up a cron job on eclipse.org to gc the repos on say, a monthly basis.  This would be easier than having individual committers having to remember to do it.
Comment 4 Thomas Watson CLA 2011-10-12 15:55:31 EDT
(In reply to comment #3)
> Tom, do you want me to set up a cron job on eclipse.org to gc the repos on say,
> a monthly basis.  This would be easier than having individual committers having
> to remember to do it.

That is probably a good idea.  I was not sure if the foundation wanted to make a general cron job that does this for all repos or to leave it up to each project to decide how and when their upstream repos are GC'ed.
Comment 5 BJ Hargrave CLA 2011-10-12 16:06:32 EDT
(In reply to comment #2)
> ssh user@git.eclipse.org "git
> --git-dir=/gitroot/equinox/rt.equinox.framework.git gc --aggressive"
> 
> ... should work.

I am not sure I would use --aggressive.
Comment 6 BJ Hargrave CLA 2011-10-12 16:09:16 EDT
(In reply to comment #4)
> (In reply to comment #3)
> > Tom, do you want me to set up a cron job on eclipse.org to gc the repos on say,
> > a monthly basis.  This would be easier than having individual committers having
> > to remember to do it.
> 
> That is probably a good idea.  I was not sure if the foundation wanted to make
> a general cron job that does this for all repos or to leave it up to each
> project to decide how and when their upstream repos are GC'ed.

I can't imagine most project leads have any idea about git repo maintenance needs. The foundation should do this for all hosted repos.

Overtime, without regular gc, the hosted repos will end up with many unreachable objects, many loose objects and many unpacked refs. This will slowly degrade performance of accessing the hosted repos for developers and the build team.
Comment 7 Kim Moir CLA 2011-10-12 16:26:56 EDT
I agree with BJ that this should be a cron job at the foundation.  Something that could ls /gitroot and interactively create a script to garbage collect all the repos.  This would avoid maintaining a static shell script that would get stale as repos are added and deleted.
Comment 8 Thomas Watson CLA 2011-10-12 16:57:06 EDT
(In reply to comment #7)
> I agree with BJ that this should be a cron job at the foundation.  Something
> that could ls /gitroot and interactively create a script to garbage collect all
> the repos.  This would avoid maintaining a static shell script that would get
> stale as repos are added and deleted.

+1

Also don't think we should use --aggressive
Comment 9 Denis Roy CLA 2011-10-26 09:54:50 EDT
> I can't imagine most project leads have any idea about git repo maintenance
> needs. The foundation should do this for all hosted repos.

By the same token, your post on eclipse-dev[1] leads me to believe that the projects should know when best to perform maintenance, and when gc is best run.  

"In any case, a branch is just a pointer to a commit. If deleted, they can easily be recreated assuming the commit has not been gc'd."

I would hate to have a cron job run a gc while a team is in full swing and perhaps some garbage may need to be salvaged.

Historically, the foundation has never meddled with project source repositories.  I would very much prefer to stay out of it.


[1] http://dev.eclipse.org/mhonarc/lists/eclipse-dev/msg09277.html
Comment 10 BJ Hargrave CLA 2011-10-26 10:20:34 EDT
(In reply to comment #9)
> "In any case, a branch is just a pointer to a commit. If deleted, they can
> easily be recreated assuming the commit has not been gc'd."

It is not nice to attempt to use my own words against me. Don't you know anything about politics in the US? :-)

A pre-receive hook which prevents committers (at least non-PMC lead committers) from deleting the main branches (e.g. branch name does not contain /) coupled with the foundation managing gc hygiene would work well.  

And given the distributed nature of git repos, there will always be some repo which has the interesting commit. For example, mirroring to github means the github mirror of the repo would likely have the commit.

This sort of mistake (e.g. deleting master branch) is generally caught very quickly. git gc generally does not discard objects until they have aged for a while: "(default is 2 weeks ago, overridable by the config variable gc.pruneExpire)". So there is at least 2 weeks to discover and fix the problem before the commit would be discarded. We can even extend the expire time if we wish.

My main interests in running gc is not really to discard unreachable objects as much as it is to pack loose objects and refs for efficiency.
Comment 11 Denis Roy CLA 2011-10-26 11:21:34 EDT
> It is not nice to attempt to use my own words against me. Don't you know
> anything about politics in the US? :-)

No, I live in Canada and I am in no way exposed to US politics  </sarcasm>  but brace yourself for I shall do it again.

> My main interests in running gc is not really to discard unreachable objects as
> much as it is to pack loose objects and refs for efficiency.

According to your bug 362076 comment 4:

Sometimes, gc will be run automatically:

"Some git commands run git gc --auto after performing operations that could
create many loose objects."

At this point, if the purpose of running gc is to clean up loose objects, and git-gc is run automatically after performing operations that could create loose objects, why would we disable that and run it manually?
Comment 12 BJ Hargrave CLA 2011-10-26 11:38:37 EDT
(In reply to comment #11)
> At this point, if the purpose of running gc is to clean up loose objects, and
> git-gc is run automatically after performing operations that could create loose
> objects, why would we disable that and run it manually?

In the case from that bug, old objects were dereferenced by the (mistaken) deletion of branches and tags. The auto gc then pruned the old objects immediately. Disabling auto gc would allow the mistake maker to restore the branches and tags to the still available objects.

A gc cron job would likely run at a time distant in the future, unless the mistake maker was unlucky enough to make his mistake just before the cron job ran. :-)
Comment 13 BJ Hargrave CLA 2011-10-26 11:41:33 EDT
Note: The gc cron job could also run with --no-prune if we are very concerned about the kind of mistake in bug 362076. However, a better pre-receive hook is a better was to prevent that kind of error.
Comment 14 David Williams CLA 2012-08-15 12:51:57 EDT
I was excited when I saw the title of this bug, because I was curious too ... then disappointed when I saw no conclusion and that last comment was pretty old. 

I guess the recommendation from knowledgeable committers is that the Foundation run a cron job ... but, Denis, if your conclusion is "no" ... then if you were explicit and closed as "won't fix", then at least we'd know and we'd document our procedure and have our own cron jobs. [And, I'm not trying to encourage you to close as "won't fix", if you need more time to research :) ] 

I wrote a small script to check "unreachable objects" on some of our platform repositories. No idea if these counts (listed below) represent large numbers or small compared to the "git world of repos" (but, seem relatively small to me). Assuming no one has ever ran gc (explicitly) on platform's repo, then I'd think this clean up cron job would only have to be ran rarely, say ... once a month?

Count of unreachable objects: 
(using  git fsck --unreachable | wc -l ) 

/gitroot/platform/eclipse.platform.common.git
56
/gitroot/platform/eclipse.platform.debug.git
6
/gitroot/platform/eclipse.platform.git
8
/gitroot/platform/eclipse.platform.news.git
25
/gitroot/platform/eclipse.platform.releng.eclipsebuilder.git
0
/gitroot/platform/eclipse.platform.releng.git
1
/gitroot/platform/eclipse.platform.releng.maps.git
10
/gitroot/platform/eclipse.platform.resources.git
5
/gitroot/platform/eclipse.platform.runtime.git
11
/gitroot/platform/eclipse.platform.swt.binaries.git
27
/gitroot/platform/eclipse.platform.swt.git
90
/gitroot/platform/eclipse.platform.team.git
5
/gitroot/platform/eclipse.platform.text.git
3
/gitroot/platform/eclipse.platform.ua.git
53
/gitroot/platform/eclipse.platform.ui.git
496
Comment 15 BJ Hargrave CLA 2012-08-15 14:58:40 EDT
(In reply to comment #14)
> Count of unreachable objects: 

It is not the unreachable objects that are interesting here (although they should be cleaned up). It is the loose (or unpacked) objects which reduce performance over time. git gc will call git repack[1] to pack the loose objects.

[1] https://git-htmldocs.googlecode.com/git/git-repack.html
Comment 16 David Williams CLA 2012-08-15 15:33:18 EDT
(In reply to comment #15)
> (In reply to comment #14)
> > Count of unreachable objects: 
> 
> It is not the unreachable objects that are interesting here (although they
> should be cleaned up). It is the loose (or unpacked) objects which reduce
> performance over time. git gc will call git repack[1] to pack the loose
> objects.
> 
> [1] https://git-htmldocs.googlecode.com/git/git-repack.html

Ok, thanks for the clarification. I don't see any easy way to "measure" looseness :) and if no one knows how to do that, is there a way to do some sort of "before/after" measurement. That is, before and then after I would "manually" run git gc -auto" on each of these platform repos to see if there is very much improvement? Would the difference show up with "disk use" size? Or is it more complicated than that? Perhaps we are worrying over nothing, ... measurements might help know the best course of action? 

[FWIW, I got to looking into this, because webtools maps file repo take 60 Megs when cloned and takes a long time, the platform's maps files take only 30 Meg ans seems faster than "half as fast" (though, didn't literally measure each). I was hoping some cleanup/gc stuff would "cure" that webtool repo, but it didn't].
Comment 17 BJ Hargrave CLA 2012-08-15 15:49:07 EDT
(In reply to comment #16)
> Ok, thanks for the clarification. I don't see any easy way to "measure"
> looseness :) 

Run

find objects/?? -type f | wc -l

in the git repo.

find objects/pack -ls

will show the number and size of the packs.

After a git gc, the number of loose objects should go way down (it probably wont go to zero as some loose objects are unreachable will be garbage after some time elapses) and the packs will be larger/more.

and if no one knows how to do that, is there a way to do some
> sort of "before/after" measurement. That is, before and then after I would
> "manually" run git gc -auto" on each of these platform repos to see if there
> is very much improvement? Would the difference show up with "disk use" size?
> Or is it more complicated than that? Perhaps we are worrying over nothing,
> ... measurements might help know the best course of action? 
> 
> [FWIW, I got to looking into this, because webtools maps file repo take 60
> Megs when cloned and takes a long time, the platform's maps files take only
> 30 Meg ans seems faster than "half as fast" (though, didn't literally
> measure each). I was hoping some cleanup/gc stuff would "cure" that webtool
> repo, but it didn't].

When you have many loose objects, it takes git upload-pack much longer to organize all the needed objects in to a pack for the fetcher. If they are already packed up into a pack, then you just need to transmit the existing pack.
Comment 18 David Williams CLA 2012-08-15 17:54:38 EDT
Created attachment 219929 [details]
some "stats" for platform repositories

Attached are some stats for the 15 platform repositories -- "before" stats. 

I assume there's no harm that could come from running git gc --auto on each? (I think the default "prune" time is 2 weeks, so would not prune anything if less that 2 weeks old). 

I also cloned all the repos (in order to time it) and (over my network) took about 30 minutes. Is that something we'd expect to be faster once gc ran? (assuming it reduces loose objects, improves packs, as expected?). Just want to confirm I'm not chasing the wind here.
Comment 19 BJ Hargrave CLA 2012-08-15 18:15:44 EDT
(In reply to comment #18)
> Attached are some stats for the 15 platform repositories -- "before" stats. 

Wow! That is a lot of loose objects on most of the repos. Many also have a lot of packs. All these repos would benefit from a gc.

Also, I learned that "git count-objects -v" is the proper way to get loose object and packs stats for a repo :-)

> 
> I assume there's no harm that could come from running git gc --auto on each?

You don't want --auto since gc may not do anything depending upon how the repo is configured. Just "git gc"

> (I think the default "prune" time is 2 weeks, so would not prune anything if
> less that 2 weeks old). 

Yup. Also depends upon repo config.

> 
> I also cloned all the repos (in order to time it) and (over my network) took
> about 30 minutes. Is that something we'd expect to be faster once gc ran?
> (assuming it reduces loose objects, improves packs, as expected?). Just want
> to confirm I'm not chasing the wind here.

It will reduce the time it takes the server to assemble the pack (git upload-pack) for transmission to the fetcher (git fetch-pack). Download time wont really change.
Comment 20 David Williams CLA 2012-08-15 18:19:27 EDT
> 
> Also, I learned that "git count-objects -v" is the proper way to get loose
> object and packs stats for a repo :-)
> 

There's a git command for everything, I'm learning :)
Comment 21 David Williams CLA 2012-08-15 19:09:24 EDT
Created attachment 219931 [details]
count-objects stats before gc

hard for me to know what all this means ... but, this attachment is the stats for 

 git count-objects -v

for each repo (before calling running git gc, which I'll now try).
Comment 22 David Williams CLA 2012-08-15 20:21:13 EDT
Created attachment 219932 [details]
stats after gc

Everything looks as expected in the stats (based on what BJ said and what I've read). 

But, not sure how to demonstrate what difference it makes (that is, what benefit it gives to infrastructure or users).

As BJ predicted, when I re-timed cloning all repos, it was about the same (actually, two minutes longer, but that could easily be variations in my wireless service). 

To respond to the last sentence in comment 11 (Denis' question about why disable it and run manually via cron job) I think we should not disable what normally runs automatically ... but, from "stats" appears to me some weekly or monthly gc cron job wouldn't be a bad idea. But hard for me to know what the "real" benefit would be. Perhaps some "production services" (e.g. github) document what they do? Perhaps this is analogous to "defragmenting your disk" (on Windows) ... some people swear it really improves things ... but others claim it is rarely needed, especially these days? 

But, to be explicit, I fail to see why a project would be motivated to run their own cron jobs ... but, obviously, I could be missing something ... so feel free to continue educating me :)
Comment 23 David Williams CLA 2012-08-16 00:35:00 EDT
Ah, forgot, one measure of "benefit" is disk space used. After going through all that, the "du" on the server (for /gitroot/platform) when down to 870 Meg. (down from 1300 Meg prior to gc). So, nothing to sneeze at. 

Once I clone locally, the working directory is (still) 1.2 G, not too surprisingly.
Comment 24 Denis Roy CLA 2012-09-10 11:57:58 EDT
Bug 389101 is a good example of git gc --auto being helpful.

I'm not sure if Gerrit runs gc.
Comment 25 Denis Roy CLA 2013-11-15 15:49:16 EST
(In reply to Denis Roy from comment #24)
> Bug 389101 is a good example of git gc --auto being helpful.
> 
> I'm not sure if Gerrit runs gc.

It doesn't.  See bug 421648.

We need to do this.
Comment 26 Denis Roy CLA 2013-11-21 15:44:04 EST
We're now running gerrit gc --all on all the Gerrit repos every Saturday, since jGit doesn't do automatic gc.  The nice thing is that since Gerrit already "owns" all its repos, we don't have to muck with permissions or user switching.

http://stackoverflow.com/questions/9938215/should-git-gc-be-run-periodically-on-gerrit-managed-git-repositories

We have no plans for running git gc on non-Gerrit project repos.