Community
Participate
Working Groups
Does Eclipse gc the bare repos on git.eclipse.org on a regular basis? For example, a cron job? If not, how do we gc our own repos to clean up unreachable objects and pack the loose objects? Thanks.
I think this can be done with the following ssh command: ssh git.eclipse.org git --git-dir=/gitroot/equinox/rt.equinox.framework.git --bare gc Is this a good idea? Should that be the recommended way a committer can GC their git repos?
ssh user@git.eclipse.org "git --git-dir=/gitroot/equinox/rt.equinox.framework.git gc --aggressive" ... should work.
Tom, do you want me to set up a cron job on eclipse.org to gc the repos on say, a monthly basis. This would be easier than having individual committers having to remember to do it.
(In reply to comment #3) > Tom, do you want me to set up a cron job on eclipse.org to gc the repos on say, > a monthly basis. This would be easier than having individual committers having > to remember to do it. That is probably a good idea. I was not sure if the foundation wanted to make a general cron job that does this for all repos or to leave it up to each project to decide how and when their upstream repos are GC'ed.
(In reply to comment #2) > ssh user@git.eclipse.org "git > --git-dir=/gitroot/equinox/rt.equinox.framework.git gc --aggressive" > > ... should work. I am not sure I would use --aggressive.
(In reply to comment #4) > (In reply to comment #3) > > Tom, do you want me to set up a cron job on eclipse.org to gc the repos on say, > > a monthly basis. This would be easier than having individual committers having > > to remember to do it. > > That is probably a good idea. I was not sure if the foundation wanted to make > a general cron job that does this for all repos or to leave it up to each > project to decide how and when their upstream repos are GC'ed. I can't imagine most project leads have any idea about git repo maintenance needs. The foundation should do this for all hosted repos. Overtime, without regular gc, the hosted repos will end up with many unreachable objects, many loose objects and many unpacked refs. This will slowly degrade performance of accessing the hosted repos for developers and the build team.
I agree with BJ that this should be a cron job at the foundation. Something that could ls /gitroot and interactively create a script to garbage collect all the repos. This would avoid maintaining a static shell script that would get stale as repos are added and deleted.
(In reply to comment #7) > I agree with BJ that this should be a cron job at the foundation. Something > that could ls /gitroot and interactively create a script to garbage collect all > the repos. This would avoid maintaining a static shell script that would get > stale as repos are added and deleted. +1 Also don't think we should use --aggressive
> I can't imagine most project leads have any idea about git repo maintenance > needs. The foundation should do this for all hosted repos. By the same token, your post on eclipse-dev[1] leads me to believe that the projects should know when best to perform maintenance, and when gc is best run. "In any case, a branch is just a pointer to a commit. If deleted, they can easily be recreated assuming the commit has not been gc'd." I would hate to have a cron job run a gc while a team is in full swing and perhaps some garbage may need to be salvaged. Historically, the foundation has never meddled with project source repositories. I would very much prefer to stay out of it. [1] http://dev.eclipse.org/mhonarc/lists/eclipse-dev/msg09277.html
(In reply to comment #9) > "In any case, a branch is just a pointer to a commit. If deleted, they can > easily be recreated assuming the commit has not been gc'd." It is not nice to attempt to use my own words against me. Don't you know anything about politics in the US? :-) A pre-receive hook which prevents committers (at least non-PMC lead committers) from deleting the main branches (e.g. branch name does not contain /) coupled with the foundation managing gc hygiene would work well. And given the distributed nature of git repos, there will always be some repo which has the interesting commit. For example, mirroring to github means the github mirror of the repo would likely have the commit. This sort of mistake (e.g. deleting master branch) is generally caught very quickly. git gc generally does not discard objects until they have aged for a while: "(default is 2 weeks ago, overridable by the config variable gc.pruneExpire)". So there is at least 2 weeks to discover and fix the problem before the commit would be discarded. We can even extend the expire time if we wish. My main interests in running gc is not really to discard unreachable objects as much as it is to pack loose objects and refs for efficiency.
> It is not nice to attempt to use my own words against me. Don't you know > anything about politics in the US? :-) No, I live in Canada and I am in no way exposed to US politics </sarcasm> but brace yourself for I shall do it again. > My main interests in running gc is not really to discard unreachable objects as > much as it is to pack loose objects and refs for efficiency. According to your bug 362076 comment 4: Sometimes, gc will be run automatically: "Some git commands run git gc --auto after performing operations that could create many loose objects." At this point, if the purpose of running gc is to clean up loose objects, and git-gc is run automatically after performing operations that could create loose objects, why would we disable that and run it manually?
(In reply to comment #11) > At this point, if the purpose of running gc is to clean up loose objects, and > git-gc is run automatically after performing operations that could create loose > objects, why would we disable that and run it manually? In the case from that bug, old objects were dereferenced by the (mistaken) deletion of branches and tags. The auto gc then pruned the old objects immediately. Disabling auto gc would allow the mistake maker to restore the branches and tags to the still available objects. A gc cron job would likely run at a time distant in the future, unless the mistake maker was unlucky enough to make his mistake just before the cron job ran. :-)
Note: The gc cron job could also run with --no-prune if we are very concerned about the kind of mistake in bug 362076. However, a better pre-receive hook is a better was to prevent that kind of error.
I was excited when I saw the title of this bug, because I was curious too ... then disappointed when I saw no conclusion and that last comment was pretty old. I guess the recommendation from knowledgeable committers is that the Foundation run a cron job ... but, Denis, if your conclusion is "no" ... then if you were explicit and closed as "won't fix", then at least we'd know and we'd document our procedure and have our own cron jobs. [And, I'm not trying to encourage you to close as "won't fix", if you need more time to research :) ] I wrote a small script to check "unreachable objects" on some of our platform repositories. No idea if these counts (listed below) represent large numbers or small compared to the "git world of repos" (but, seem relatively small to me). Assuming no one has ever ran gc (explicitly) on platform's repo, then I'd think this clean up cron job would only have to be ran rarely, say ... once a month? Count of unreachable objects: (using git fsck --unreachable | wc -l ) /gitroot/platform/eclipse.platform.common.git 56 /gitroot/platform/eclipse.platform.debug.git 6 /gitroot/platform/eclipse.platform.git 8 /gitroot/platform/eclipse.platform.news.git 25 /gitroot/platform/eclipse.platform.releng.eclipsebuilder.git 0 /gitroot/platform/eclipse.platform.releng.git 1 /gitroot/platform/eclipse.platform.releng.maps.git 10 /gitroot/platform/eclipse.platform.resources.git 5 /gitroot/platform/eclipse.platform.runtime.git 11 /gitroot/platform/eclipse.platform.swt.binaries.git 27 /gitroot/platform/eclipse.platform.swt.git 90 /gitroot/platform/eclipse.platform.team.git 5 /gitroot/platform/eclipse.platform.text.git 3 /gitroot/platform/eclipse.platform.ua.git 53 /gitroot/platform/eclipse.platform.ui.git 496
(In reply to comment #14) > Count of unreachable objects: It is not the unreachable objects that are interesting here (although they should be cleaned up). It is the loose (or unpacked) objects which reduce performance over time. git gc will call git repack[1] to pack the loose objects. [1] https://git-htmldocs.googlecode.com/git/git-repack.html
(In reply to comment #15) > (In reply to comment #14) > > Count of unreachable objects: > > It is not the unreachable objects that are interesting here (although they > should be cleaned up). It is the loose (or unpacked) objects which reduce > performance over time. git gc will call git repack[1] to pack the loose > objects. > > [1] https://git-htmldocs.googlecode.com/git/git-repack.html Ok, thanks for the clarification. I don't see any easy way to "measure" looseness :) and if no one knows how to do that, is there a way to do some sort of "before/after" measurement. That is, before and then after I would "manually" run git gc -auto" on each of these platform repos to see if there is very much improvement? Would the difference show up with "disk use" size? Or is it more complicated than that? Perhaps we are worrying over nothing, ... measurements might help know the best course of action? [FWIW, I got to looking into this, because webtools maps file repo take 60 Megs when cloned and takes a long time, the platform's maps files take only 30 Meg ans seems faster than "half as fast" (though, didn't literally measure each). I was hoping some cleanup/gc stuff would "cure" that webtool repo, but it didn't].
(In reply to comment #16) > Ok, thanks for the clarification. I don't see any easy way to "measure" > looseness :) Run find objects/?? -type f | wc -l in the git repo. find objects/pack -ls will show the number and size of the packs. After a git gc, the number of loose objects should go way down (it probably wont go to zero as some loose objects are unreachable will be garbage after some time elapses) and the packs will be larger/more. and if no one knows how to do that, is there a way to do some > sort of "before/after" measurement. That is, before and then after I would > "manually" run git gc -auto" on each of these platform repos to see if there > is very much improvement? Would the difference show up with "disk use" size? > Or is it more complicated than that? Perhaps we are worrying over nothing, > ... measurements might help know the best course of action? > > [FWIW, I got to looking into this, because webtools maps file repo take 60 > Megs when cloned and takes a long time, the platform's maps files take only > 30 Meg ans seems faster than "half as fast" (though, didn't literally > measure each). I was hoping some cleanup/gc stuff would "cure" that webtool > repo, but it didn't]. When you have many loose objects, it takes git upload-pack much longer to organize all the needed objects in to a pack for the fetcher. If they are already packed up into a pack, then you just need to transmit the existing pack.
Created attachment 219929 [details] some "stats" for platform repositories Attached are some stats for the 15 platform repositories -- "before" stats. I assume there's no harm that could come from running git gc --auto on each? (I think the default "prune" time is 2 weeks, so would not prune anything if less that 2 weeks old). I also cloned all the repos (in order to time it) and (over my network) took about 30 minutes. Is that something we'd expect to be faster once gc ran? (assuming it reduces loose objects, improves packs, as expected?). Just want to confirm I'm not chasing the wind here.
(In reply to comment #18) > Attached are some stats for the 15 platform repositories -- "before" stats. Wow! That is a lot of loose objects on most of the repos. Many also have a lot of packs. All these repos would benefit from a gc. Also, I learned that "git count-objects -v" is the proper way to get loose object and packs stats for a repo :-) > > I assume there's no harm that could come from running git gc --auto on each? You don't want --auto since gc may not do anything depending upon how the repo is configured. Just "git gc" > (I think the default "prune" time is 2 weeks, so would not prune anything if > less that 2 weeks old). Yup. Also depends upon repo config. > > I also cloned all the repos (in order to time it) and (over my network) took > about 30 minutes. Is that something we'd expect to be faster once gc ran? > (assuming it reduces loose objects, improves packs, as expected?). Just want > to confirm I'm not chasing the wind here. It will reduce the time it takes the server to assemble the pack (git upload-pack) for transmission to the fetcher (git fetch-pack). Download time wont really change.
> > Also, I learned that "git count-objects -v" is the proper way to get loose > object and packs stats for a repo :-) > There's a git command for everything, I'm learning :)
Created attachment 219931 [details] count-objects stats before gc hard for me to know what all this means ... but, this attachment is the stats for git count-objects -v for each repo (before calling running git gc, which I'll now try).
Created attachment 219932 [details] stats after gc Everything looks as expected in the stats (based on what BJ said and what I've read). But, not sure how to demonstrate what difference it makes (that is, what benefit it gives to infrastructure or users). As BJ predicted, when I re-timed cloning all repos, it was about the same (actually, two minutes longer, but that could easily be variations in my wireless service). To respond to the last sentence in comment 11 (Denis' question about why disable it and run manually via cron job) I think we should not disable what normally runs automatically ... but, from "stats" appears to me some weekly or monthly gc cron job wouldn't be a bad idea. But hard for me to know what the "real" benefit would be. Perhaps some "production services" (e.g. github) document what they do? Perhaps this is analogous to "defragmenting your disk" (on Windows) ... some people swear it really improves things ... but others claim it is rarely needed, especially these days? But, to be explicit, I fail to see why a project would be motivated to run their own cron jobs ... but, obviously, I could be missing something ... so feel free to continue educating me :)
Ah, forgot, one measure of "benefit" is disk space used. After going through all that, the "du" on the server (for /gitroot/platform) when down to 870 Meg. (down from 1300 Meg prior to gc). So, nothing to sneeze at. Once I clone locally, the working directory is (still) 1.2 G, not too surprisingly.
Bug 389101 is a good example of git gc --auto being helpful. I'm not sure if Gerrit runs gc.
(In reply to Denis Roy from comment #24) > Bug 389101 is a good example of git gc --auto being helpful. > > I'm not sure if Gerrit runs gc. It doesn't. See bug 421648. We need to do this.
We're now running gerrit gc --all on all the Gerrit repos every Saturday, since jGit doesn't do automatic gc. The nice thing is that since Gerrit already "owns" all its repos, we don't have to muck with permissions or user switching. http://stackoverflow.com/questions/9938215/should-git-gc-be-run-periodically-on-gerrit-managed-git-repositories We have no plans for running git gc on non-Gerrit project repos.