Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 312868

Summary: java.lang.OutOfMemoryError when fetching objects bigger than heap size
Product: [Technology] JGit Reporter: Dmitry Neverov <dmitry.neverov>
Component: JGitAssignee: Shawn Pearce <sop>
Status: REOPENED --- QA Contact:
Severity: normal    
Priority: P3 CC: caniszczyk, dmitry.neverov, igor, jeblen, matthias.sohn, remy.suen, roberto.tyley, robin.rosenberg, roland, sop, stefanb, thomas.mey, tratar
Version: 0.7.1   
Target Milestone: ---   
Hardware: All   
OS: All   
Whiteboard:
Bug Depends on:    
Bug Blocks: 390689    

Description Dmitry Neverov CLA 2010-05-14 04:03:25 EDT
JGit loads fetched objects to memory, so we get java.lang.OutOfMemoryError when try to fetch object bigger than heap size.
Comment 1 Shawn Pearce CLA 2010-05-14 11:45:55 EDT
There are several ways objects get handled in JGit.  You'll need to look at all of them to really fix this bug.

The first spot to start poking at is org.eclipse.jgit.transport.IndexPack.  In this class the indexOneObject() method calls whole(int,long,long) to allocate a byte[] to hold the entire object and inflates it into memory.  You'll need to rework this code path to stream through the object instead.

The next spot is much more difficult.  During delta resolution resolveDeltas(int,int,int,byte[],PackedObjectInfo) is called in a recursive call stack to inflate each object and then any deltas that depend on it.  Each layer of that stack is a byte[] holding the complete contents of the object on the stack.  The longer the delta chain, the more byte arrays we hold onto.  For a 50 deep chain, we hang onto 50 byte[]s on the stack.

If you fix that, you should be able to fetch/clone a repository with really big file contents.  But you might not be able to actually perform a checkout of those files, because the random access code paths also allocate a single byte[] for the entire blob.


The other work we need to do is modify the ObjectLoader API to permit an additional method of:

  public InputStream getInputStream();

and actually implement the various ObjectLoaders such that an InputStream can stream through the contents rather than pulling it all into memory at once.  For larger blobs we then need to modify any code that uses an ObjectLoader to use the InputStream rather than the byte[].  (However for smaller commits, trees and tags we should keep using the byte[] as its significantly faster.)
Comment 2 Shawn Pearce CLA 2010-06-30 13:33:15 EDT
I'm starting to work on this.
Comment 3 Shawn Pearce CLA 2010-12-03 16:34:33 EST
*** Bug 321097 has been marked as a duplicate of this bug. ***
Comment 4 Roberto Tyley CLA 2010-12-04 21:01:55 EST
Regarding the first spot Shawn highlighted - IndexPack.whole(int,long,long) allocating a large byte array - I've submitted a patch which addresses this issue to Gerrit:

http://egit.eclipse.org/r/#change,2040

I hadn't read Shawn's comment before encountering and then patching this issue, but what I did is exactly what he described :) The key change is that DigestOutputStream is now used to calculate the object's digest- the entire object is loaded into memory *only* if it can be structurally checked. Structure checks apply only to the (relatively tiny) Tag, Commit & Tree objects so they pose no problem - however, Blobs are solely *streamed* to calculate their digest, removing the memory problem in that part of the code.

The way I've written the code does take a liberty, in that it also removes a different kind of check. Blob objects are no longer checked using byte-for-byte comparision with already-known blobs. As the blob will have had it's SHA checked, you could argue that it's overly paranoid to do a byte-by-byte comparison. It might make sense to make this checking optional, and implement it using a streaming comparison, depending on how seriously it's required.

best regards,
Roberto

P.S. Apologies for the lack of manual line-breaking in my commit messages :-(
Comment 5 Chris Aniszczyk CLA 2010-12-08 12:32:17 EST
http://egit.eclipse.org/r/#change,2040 is now MERGED
Comment 6 Robin Rosenberg CLA 2012-04-14 14:54:00 EDT
Fixed long time ago
Comment 7 John Eblen CLA 2012-10-17 14:58:31 EDT
It seems like this bug was never completely fixed. The patch only addresses the first spot. The other spots, specifically the problems with recursive calls to resolveDeltas(), have not been addressed. Bug 390689 is one example where it seems this bug is still occurring. Could someone on the JGit team clarify the status of this bug?
Comment 8 John Eblen CLA 2014-10-06 12:55:05 EDT
It has been two years since my question... :)

Could someone clarify whether this bug has been entirely fixed, either by the patch provided or elsewhere, or whether only part of it has been fixed? It seems that comment 1 lists three separate issues, and only the first one was addressed. What problems remain?
Comment 9 Shawn Pearce CLA 2014-12-10 11:21:00 EST
(In reply to John Eblen from comment #8)
> It has been two years since my question... :)
> 
> Could someone clarify whether this bug has been entirely fixed, either by
> the patch provided or elsewhere, or whether only part of it has been fixed?
> It seems that comment 1 lists three separate issues, and only the first one
> was addressed. What problems remain?

Large delta compressed objects are a problem. They can only be efficiently inflated by loading the entire base object into a random access byte[], which must fit into the heap. There is no other option available.

git-core (the C version) and JGit both now avoid creating a delta compressed object if the object is >50 MiB (configurable). For these really big files they are stored without delta compression, using only zlib deflate. Those go through the "whole object"  paths, which were patched years ago to stream, avoiding the need to fully allocate in heap.
Comment 10 Matthias Sohn CLA 2014-12-17 18:53:23 EST
*** Bug 455401 has been marked as a duplicate of this bug. ***