| Summary: | java.lang.OutOfMemoryError when fetching objects bigger than heap size | ||
|---|---|---|---|
| Product: | [Technology] JGit | Reporter: | Dmitry Neverov <dmitry.neverov> |
| Component: | JGit | Assignee: | Shawn Pearce <sop> |
| Status: | REOPENED --- | QA Contact: | |
| Severity: | normal | ||
| Priority: | P3 | CC: | caniszczyk, dmitry.neverov, igor, jeblen, matthias.sohn, remy.suen, roberto.tyley, robin.rosenberg, roland, sop, stefanb, thomas.mey, tratar |
| Version: | 0.7.1 | ||
| Target Milestone: | --- | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Bug Depends on: | |||
| Bug Blocks: | 390689 | ||
|
Description
Dmitry Neverov
There are several ways objects get handled in JGit. You'll need to look at all of them to really fix this bug. The first spot to start poking at is org.eclipse.jgit.transport.IndexPack. In this class the indexOneObject() method calls whole(int,long,long) to allocate a byte[] to hold the entire object and inflates it into memory. You'll need to rework this code path to stream through the object instead. The next spot is much more difficult. During delta resolution resolveDeltas(int,int,int,byte[],PackedObjectInfo) is called in a recursive call stack to inflate each object and then any deltas that depend on it. Each layer of that stack is a byte[] holding the complete contents of the object on the stack. The longer the delta chain, the more byte arrays we hold onto. For a 50 deep chain, we hang onto 50 byte[]s on the stack. If you fix that, you should be able to fetch/clone a repository with really big file contents. But you might not be able to actually perform a checkout of those files, because the random access code paths also allocate a single byte[] for the entire blob. The other work we need to do is modify the ObjectLoader API to permit an additional method of: public InputStream getInputStream(); and actually implement the various ObjectLoaders such that an InputStream can stream through the contents rather than pulling it all into memory at once. For larger blobs we then need to modify any code that uses an ObjectLoader to use the InputStream rather than the byte[]. (However for smaller commits, trees and tags we should keep using the byte[] as its significantly faster.) I'm starting to work on this. *** Bug 321097 has been marked as a duplicate of this bug. *** Regarding the first spot Shawn highlighted - IndexPack.whole(int,long,long) allocating a large byte array - I've submitted a patch which addresses this issue to Gerrit: http://egit.eclipse.org/r/#change,2040 I hadn't read Shawn's comment before encountering and then patching this issue, but what I did is exactly what he described :) The key change is that DigestOutputStream is now used to calculate the object's digest- the entire object is loaded into memory *only* if it can be structurally checked. Structure checks apply only to the (relatively tiny) Tag, Commit & Tree objects so they pose no problem - however, Blobs are solely *streamed* to calculate their digest, removing the memory problem in that part of the code. The way I've written the code does take a liberty, in that it also removes a different kind of check. Blob objects are no longer checked using byte-for-byte comparision with already-known blobs. As the blob will have had it's SHA checked, you could argue that it's overly paranoid to do a byte-by-byte comparison. It might make sense to make this checking optional, and implement it using a streaming comparison, depending on how seriously it's required. best regards, Roberto P.S. Apologies for the lack of manual line-breaking in my commit messages :-( http://egit.eclipse.org/r/#change,2040 is now MERGED Fixed long time ago It seems like this bug was never completely fixed. The patch only addresses the first spot. The other spots, specifically the problems with recursive calls to resolveDeltas(), have not been addressed. Bug 390689 is one example where it seems this bug is still occurring. Could someone on the JGit team clarify the status of this bug? It has been two years since my question... :) Could someone clarify whether this bug has been entirely fixed, either by the patch provided or elsewhere, or whether only part of it has been fixed? It seems that comment 1 lists three separate issues, and only the first one was addressed. What problems remain? (In reply to John Eblen from comment #8) > It has been two years since my question... :) > > Could someone clarify whether this bug has been entirely fixed, either by > the patch provided or elsewhere, or whether only part of it has been fixed? > It seems that comment 1 lists three separate issues, and only the first one > was addressed. What problems remain? Large delta compressed objects are a problem. They can only be efficiently inflated by loading the entire base object into a random access byte[], which must fit into the heap. There is no other option available. git-core (the C version) and JGit both now avoid creating a delta compressed object if the object is >50 MiB (configurable). For these really big files they are stored without delta compression, using only zlib deflate. Those go through the "whole object" paths, which were patched years ago to stream, avoiding the need to fully allocate in heap. *** Bug 455401 has been marked as a duplicate of this bug. *** |