Community
Participate
Working Groups
We would benefit from a small JNI layer providing some helper methods for file system operations not available in Java before Java 7 brings us NIO2. This was discussed in the mail thread [1]. List of operations in JNI layer JGit would benefit from (cited from [2] to ease tracking progress): 1) symlinks: Provide readlink() and symlink() so JGit can actually process symlinks like native C Git does, assuming the JNI library can be loaded and that you are on a POSIX system where symlinks work like they should. 2) lstat: Provide the majority of the lstat() structure up to the Java level. The important part about this is first that its lstat() and not stat(), because then we read the status of a symlink itself and not the target of the link. The second part is being able to get the st_mode, st_mtime and st_size fields in a single operating system call. If we want to be more compatibility with the C implementation we would honor the tv_nsec for nanosecond component of the time field, so we can get more accurate times than just milliseconds. We might also want to honor the st_ctime, st_dev, st_ino, st_uid and st_gid fields that C Git stores into the index record (see DirCacheEntry's commented out static constants). 3) readdir: Provide a Linux-like readdir to replace File.listFiles(). Some C libraries provide a d_type field in struct dirent that is returned by readdir. This field can have values like DT_DIR, DT_REG, and DT_LNK, to hint about what the item is. If we have this data we don't need to perform a stat on a path in order to know how to work with it. 4) mmap and lookup of entries in pack index files. PackIndexV[12] classes are brutal in Java. If we had native forms of these that use mmap() to open the file and a C implementation of the binary search algorithm, we might be able to do more efficient object lookup. We don't use NIO's mapped ByteBuffer code because its slower than what we have, and the GC isn't able to release the mmap region fast enough when we decide we don't want that file anymore. If we do this in native code ourselves we can also provide explicit unmapping. 5) mmap and inflation of objects from pack files. PackFile is pretty brutal, needing to load in slices of a pack into byte[] and then inflate those byte[] chunks through the Inflater class. If we do all of this in C, and use explicit mmap calls, we can avoid the allocation of JVM heap memory and just use the operating system's buffer cache directly to read from the packs. We can also shovel the chunks of data into libz inflate() more efficiently, which means we can probably read small objects more quickly, resulting in faster processing of commits and trees. The last one might be harder now that we are trying to support large objects. But it could still make a good performance improvement. A lot of our resource cost right now is tied up in WindowCache and the byte[] we had to allocate in order to copy the data in from the file. If we can just convert those over to mmap() slices that are accessible only from JNI, and expose in JNI just two small methods like: readObjectHeader() -- first half of the load() method in PackFile inflateRegion() -- the incremental decompression in WindowCursor We might be able to do almost everything else at the Java level with much lower overheads. I think the above list is already sorted in priority order. 1 (symlink) and 2 (lstat) are needed just for good compatibility with C Git and are fairly simple to implement. 3 (readdir with d_type) would give us a small performance boost, but isn't that important. 4 and 5 are likely to provide some real performance improvements, but are a lot more work. [1] http://dev.eclipse.org/mhonarc/lists/jgit-dev/msg00722.html [2] http://dev.eclipse.org/mhonarc/lists/jgit-dev/msg00734.html