Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 331273

Summary: Commit operations opens and reads almost all files
Product: [Technology] EGit Reporter: Robin Rosenberg <robin.rosenberg>
Component: CoreAssignee: Project Inbox <egit.core-inbox>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: P3 CC: angvoz.dev, christian.halstrick, daniel_megert, ilya.ivanov, leachbj, robert.munteanu
Version: 0.10.0   
Target Milestone: ---   
Hardware: PC   
OS: Mac OS X - Carbon (unsup.)   
Whiteboard:
Bug Depends on:    
Bug Blocks: 325393, 327913, 333204    

Description Robin Rosenberg CLA 2010-11-27 11:55:45 EST
Here is what I posted on the mailing list

Re: [jgit-dev] Avoiding stat calls (again)
13 nov 2010 kl. 16:36 skrev Robin Rosenberg:

This issue pops up again and again, I guess when one stops being tired of it. I got an idea that does not involve JNI.

If the time stamp hasn't changed, couldn't we assume size didn't change either? I believe
all OS:es update the timestamp when we write to a file, not just on open/close.

Given some definition of old, we could avoid retrieving the file size if the (old) timestamp
in the index equals the one on disk when checking for a possible modification. 

What would old be? 2 seconds (FAT) or much more?

See the Gerrit http://egit.eclipse.org/r/#change,1905. The main thing I wanted, was to make the Commit
dialog pop up faster. Now, what I found is that it actually opens and read every file, No wonder it
takes forever. Git is supposed to be fast.

Daemon Thread [ModalContext] (Suspended)	
	ContainerTreeIterator(WorkingTreeIterator).idBuffer() line: 213	
	ContainerTreeIterator(AbstractTreeIterator).idEqual(AbstractTreeIterator) line: 368	
	TreeWalk.idEqual(int, int) line: 696	
	TreeFilter$2.include(TreeWalk) line: 133	
	AndTreeFilter$List.include(TreeWalk) line: 162	
	TreeWalk.next() line: 551	
	IndexDiff.diff() line: 184	
	CommitActionHandler.buildIndexHeadDiffList(IProject[], IProgressMonitor) line: 324	
	CommitActionHandler.access$0(CommitActionHandler, IProject[], IProgressMonitor) line: 292	
	CommitActionHandler$1.run(IProgressMonitor) line: 100	
	ModalContext$ModalContextThread.run() line: 121	

The TreeWalk needs a better interface to the TreeIterators to support this cleanly and TreeWalk
must be able to receive parameters regarding how to compare with regards to filemode and crlf etc.

I think we need to break the TreeWalk API. Let's suprise Shawn when he gets back from vacation }:>

-- robin
Comment 1 Robin Rosenberg CLA 2010-11-27 11:57:13 EST
Shawns comments:

On Thu, Nov 18, 2010 at 11:56 AM, Shawn Pearce <spearce@spearce.org> wrote:
On Tue, Nov 16, 2010 at 2:57 PM, Robin Rosenberg
<robin.rosenberg@dewire.com> wrote:
See the Gerrit http://egit.eclipse.org/r/#change,1905. The main thing I wanted, was to make the Commit
dialog pop up faster. Now, what I found is that it actually opens and read every file, No wonder it
takes forever. Git is supposed to be fast.

Daemon Thread [ModalContext] (Suspended)
       ContainerTreeIterator(WorkingTreeIterator).idBuffer() line: 213
       ContainerTreeIterator(AbstractTreeIterator).idEqual(AbstractTreeIterator) line: 368
       TreeWalk.idEqual(int, int) line: 696
       TreeFilter$2.include(TreeWalk) line: 133

Ouch.  Someone is using the default ANY_DIFF with an iterator that
cannot provide fast access to the ObjectId of the current entry (aka a
WorkingTreeIterator).  No wonder its taking ages, we have to SHA-1
checksum the entire working directory.  That's a bad use of the
ANY_DIFF filter.
...
Idea 2)  Add knowledge of the TreeWalk and the position of the
DirCacheIterator to the WorkingTreeIterator.  That is, maybe we add a
method like:

 public void setDirCacheIterator(TreeWalk tw, int pos);

This is proposed in http://egit.eclipse.org/r/1927

But, I'm yet not convinced we should commit this.

If we rely on only this change, ANY_DIFF will still call idBuffer()
for working files that are modified, because they don't match the
DirCacheIterator.  If the files are sufficiently large, JGit will
still crawl because idBuffer() is spending a lot of time to compute a
SHA-1 checksum that doesn't matter.

I never meant for TreeFilter.ANY_DIFF to be used with a
WorkingTreeIterator.  It can be used... but its performance results
are disastrous because it relies solely on idBuffer(), and idBuffer()
is an inherently slow operation for working tree files.  I intended to
use something like the WorkingTreeIterator's isModified() method,
through a different TreeFilter implementation.  The earlier rewrite of
the resource decorator in EGit was the first such example (and
predates the isModified method by a lot).

It may simply be the case that the offending application code should
stop using ANY_DIFF when there is a WorkingTreeIterator involved, and
instead use a different TreeFilter implementation that can more
effectively take advantage of the isModified() method.  This filter
probably should be defined as a stock filter in JGit, and take the
various relevant indexes as constructor parameters.

An alternative approach is to add some sort of "compare" method to
AbstractTreeIterator that TreeWalk can use, so that instead of
idEqual(int,int) we have entryEqual(int, int) that delegates down into
that compare routine.  Then WorkingTreeIterator and DirCacheIterator
can override the compare routine to try and special case a compare of
those two iterator types, relying on isModified() instead of
idBuffer().  This does get ugly implementation wise, as both iterator
types need to override the method with identical delegation to a
common compare routine.  If we ever grew additional iterator types,
the permutations spread out through the various implementations quite
quickly, making for a very bug-prone system.

-- 
Shawn.
Comment 2 Robin Rosenberg CLA 2010-11-27 11:57:47 EST
Christians Halstrick's comments

Hi,

An alternative approach is to add some sort of "compare" method to

I wanted to suggest this approach of having an explicit comparison
even before I read your mail. The fixes you suggested intend to make
idBuffer() faster. But in the use case of ANY_DIFF we are not
interested in the content IDs. We want to compare to TreeIterators
whether they currently point to the same content or not (I guess we
are only interested in content equality not in greater, lesser,...).

I see also problems in implementating this because
DirCacheIterator.compare(WorkTreeIterator) and
WorkTreeIterator.compare(DirCacheIterator) should both use an
implementation probably implemented in WorkTreeIterator  (that
implementations should be clever enough to call isModified() instead
of blindly calling idBuffer()). We could solve this by additionally
introducing a TreeIteratorComparator which implements a
compare(TreeIterator a, TreeIterator b). That compare method would be
public and the compare methods on the iterators itself would be
protected, so we enforce that the clever TreeIteratorComparator.
TeeIteratorComparator would delegate to the compare method of either
"a" or "b" depending on the types of "a" and "b". The knowledge which
comparison is better is completely hidding in the
TeeIteratorComparator and the Iteratores themselfes don't have to
care. E.g. He would take care to call the WorkingTreeIterators compare
method (which would use isModified()) if he see's thath one of the
iterators is a WorkingTreeIterator.

AbstractTreeIterator that TreeWalk can use, so that instead of
idEqual(int,int) we have entryEqual(int, int) that delegates down into
that compare routine.  Then WorkingTreeIterator and DirCacheIterator
can override the compare routine to try and special case a compare of
those two iterator types, relying on isModified() instead of
idBuffer().  This does get ugly implementation wise, as both iterator
types need to override the method with identical delegation to a
common compare routine.  If we ever grew additional iterator types,
the permutations spread out through the various implementations quite
quickly, making for a very bug-prone system.

-- Chris
Comment 3 Ilya Ivanov CLA 2011-07-12 07:00:00 EDT
Hi guys! We have a strange situation with this issue. 

One of developers has very low performance in operations with WorkingTreeIterator.
I have added some logging and found that isModified() method always computes SHA.
Methods entry.getLength() and entry.getLastModified() return 0 in most cases. It results in checking content of every file.
I have no idea how such wrong index entries can be created. When I copy his .git/index file into my repo I can reproduce the problem. But after reset index backs to normal and performance is ok. 

The question is how does the index become broken?

In my environment I never had such problem. We both use Ubuntu with ext4 file system.
Comment 4 Christian Halstrick CLA 2011-08-10 08:54:47 EDT
Hi Ilya,

which version of JGit are you using? Are you working on a shared repo so I can have a look at the repo where it happens? If I can't have access to
the real repo and index file then please attach the output "debug-show-dir-cache" command from jgit command line. (How to use the jgit command line is described here: http://wiki.eclipse.org/JGit/User_Guide).
Comment 5 Robin Rosenberg CLA 2012-12-30 12:31:36 EST
I think it is safe to say that 45edb41a9ed1248964cb4bf0d635f2643406d123, 342de38e57e3052d3d2f12e93d629bdc63f123be and possibly other commits have fixed this issue.