Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 323398 - new RevWalk(db).parseBody(rc) is bad
Summary: new RevWalk(db).parseBody(rc) is bad
Status: CLOSED FIXED
Alias: None
Product: EGit
Classification: Technology
Component: UI (show other bugs)
Version: 0.9.0   Edit
Hardware: PC Mac OS X - Carbon (unsup.)
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Project Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-08-23 11:06 EDT by Chris Aniszczyk CLA
Modified: 2010-09-18 15:55 EDT (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Aniszczyk CLA 2010-08-23 11:06:46 EDT
File org.eclipse.egit.core/src/org/eclipse/egit/core/internal/storage/CommitFileRevision.java
Line 61:                                new RevWalk(db).parseBody(rc);
You shouldn't do this.  I can't promise you that parsing the body of another RevWalk's RevCommit with a new RevWalk will continue to work in the future.

If you need to ensure the buffer is parsed, you should be passing around the RevWalk that created the RevCommit you passed around, and call parseBody() on that.

parseBody() knows whether or not the buffer is loaded.  Its essentially a no-op if the buffer is already loaded.  Therefore you don't need the rc.getRawBuffer() conditional check around it.

I know we don't promise that getRawBuffer() implies the other things are accessible.  That is, just because getRawBuffer() returns non-null does not imply that getFullMessage() or getAuthorIdent() will work.  Only parseBody() promises that.
Comment 1 Chris Aniszczyk CLA 2010-08-23 11:07:54 EDT
Dariusz, can you look at Shawn's concerns?
Comment 2 Dariusz Luksza CLA 2010-08-23 19:24:56 EDT
Shawn, is there any way to 'disconnect' RevCommit object from RevWalk?

As far as I understood how RevWalk works, it disposes 'buffer' array when it isn't usable any more for walking. I'm wondering does we, on application level, cache complete RevObject instance. If not ... the only solution for this problem would be passing RevWalk and ObjectId over all 'abstraction levels' in EGit.
Comment 3 Shawn Pearce CLA 2010-08-23 19:30:29 EDT
(In reply to comment #2)
> Shawn, is there any way to 'disconnect' RevCommit object from RevWalk?

Nope.  Well, you can get one by parsing the commit and then never using that
in a traversal.  I.e.:

  RevWalk rw = new RevWalk(db);
  RevCommit c = rw.parseCommit(objectId);
  ... now never use rw ...


Another option is to use rw.setRetainBody(true) so the walker won't discard the body its parsed.  But if you do big history operations involving hundreds of commits, and most of them you don't really need the body of, this can be a big waste of memory.

> As far as I understood how RevWalk works, it disposes 'buffer' array when it
> isn't usable any more for walking. I'm wondering does we, on application level,
> cache complete RevObject instance. If not ... the only solution for this
> problem would be passing RevWalk and ObjectId over all 'abstraction levels' in
> EGit.

Yes.  It is a bit common to pass around both.  Or, do a rw.parseBody(c) before you pass the RevCommit and then just ensure you never perform another traversal on that RevWalk.
Comment 4 Dariusz Luksza CLA 2010-08-23 19:44:38 EDT
(In reply to comment #3)
> Yes.  It is a bit common to pass around both.  Or, do a rw.parseBody(c) before
> you pass the RevCommit and then just ensure you never perform another traversal
> on that RevWalk.

This will be quite tricky since in GitModelRepository we use LogCommand to obtain list of commit's. It is even more complicated because we use two log's:
* common ancestor - source commit
* common ancestor - destination commit
this solution was suggested by Stefan Lay (AFAIR) to obtain complete list of commits between source and destination. LogCommad internally use RevWalk, so actually we have two separate instances of RevWalk ... maybe we could merge them into one? And then pass it to RevModelCommit constructor.
Comment 5 Shawn Pearce CLA 2010-08-23 19:47:34 EDT
(In reply to comment #4)
> 
> This will be quite tricky since in GitModelRepository we use LogCommand to
> obtain list of commit's. It is even more complicated because we use two log's:
> * common ancestor - source commit
> * common ancestor - destination commit
> this solution was suggested by Stefan Lay (AFAIR) to obtain complete list of
> commits between source and destination. LogCommad internally use RevWalk, so
> actually we have two separate instances of RevWalk ... maybe we could merge
> them into one? And then pass it to RevModelCommit constructor.

Yes.  If you use the left/right approach I described you can use just one RevWalk instance to get both sets of commits and the common ancestor in a single traversal.  This will be faster, as its only one walk and not 3.

Since you are interested in all of the commits returned, their bodies would be saved by the RevWalk, and you'd be OK to use them without needing parseBody(), so long as you don't start a new traversal on that RevWalk.  Which you probably don't have to do, since you got everything in 1 traversal.

:-)
Comment 6 Dariusz Luksza CLA 2010-08-23 19:55:40 EDT
(In reply to comment #5)
> Since you are interested in all of the commits returned, their bodies would be
> saved by the RevWalk, and you'd be OK to use them without needing parseBody(),
> so long as you don't start a new traversal on that RevWalk.  Which you probably
> don't have to do, since you got everything in 1 traversal.

OK, but what about memory usage if we would have more then 1k commits in traversal?

We already had problem with NPE during parsing RevCommit on oldest commit in list, I'm afraid that we can once again hit this issue.
Comment 7 Shawn Pearce CLA 2010-08-23 20:08:02 EDT
(In reply to comment #6)
> (In reply to comment #5)
> > Since you are interested in all of the commits returned, their bodies would be
> > saved by the RevWalk, and you'd be OK to use them without needing parseBody(),
> > so long as you don't start a new traversal on that RevWalk.  Which you probably
> > don't have to do, since you got everything in 1 traversal.
> 
> OK, but what about memory usage if we would have more then 1k commits in
> traversal?

But that implies there are 1k commits different between the two sides.  So no matter what you do, you'd want 1k commits loaded so you can present them to the user, no?

If you abort the RevWalk when you find the common ancestor (the commit with both local and remote flags on it), the walker won't go too far back in history.  So even a project with 888k commits will only need to parse, load, and cache the commits that are different between the two sides, which is probably under 1k.  The remainder of the old history of the project won't even be looked at.

> We already had problem with NPE during parsing RevCommit on oldest commit in
> list, I'm afraid that we can once again hit this issue.

I'm not sure I understand.  Do you know what the root cause of that was?  JGit throws NPE for a lot of reasons, but the most common one during parseCommit() or parseBody() was the parameter was null.  :-)
Comment 8 Dariusz Luksza CLA 2010-08-23 20:13:35 EDT
(In reply to comment #7)
> I'm not sure I understand.  Do you know what the root cause of that was?  JGit
> throws NPE for a lot of reasons, but the most common one during parseCommit()
> or parseBody() was the parameter was null.  :-)

Yes I know the cause of NPE, the 'buffer' was null, because it was disposed by RevWalk. But I don't know why it was disposed.
Comment 9 Shawn Pearce CLA 2010-08-23 20:18:05 EDT
(In reply to comment #8)
> 
> Yes I know the cause of NPE, the 'buffer' was null, because it was disposed by
> RevWalk. But I don't know why it was disposed.

We dispose uninteresting commits, unless you ask for boundary sorting.

Was that "last" commit not actually returned by the RevWalk, but was instead someone else's parent?  In such a case we don't promise that the buffer is parsed.

Did you markUninteresting anything on that traversal?  Without topo sorting its possible for a commit to be output, but then later be determined to be uninteresting, and we might have discarded the buffer then, after you had already saved it.  Its the price you pay for not enforcing topo sorting on your result, you may get slightly inaccurate results, but its far faster, and usually correct.
Comment 10 Dariusz Luksza CLA 2010-08-23 20:33:36 EDT
(In reply to comment #9)
> Was that "last" commit not actually returned by the RevWalk, but was instead
> someone else's parent?

All commits that are displayed in git change set comes from LogCommand, with use RevWalk internally. Since there were used two separate LogCommand's it is possible that this "last commit" was created by different RevWalk ... and this seams to be the root cause of this NPE. If we use only one LogCommand with single RevWalk this problem will occur, am I right?
Comment 11 Dariusz Luksza CLA 2010-08-27 19:21:19 EDT
After merging commit 8a77b0e we can remove "new RevWalk(db).parseBody(rc)"

Change set with does that is here:
http://egit.eclipse.org/r/#change,1437
Comment 12 Dariusz Luksza CLA 2010-09-18 15:55:48 EDT
Fixed with commit: 75a371e156977f26f72a620880584e6332db0f42