Community
Participate
Working Groups
See bug 262791 for the motivation and research behind this report. HashtableOfObject.removeKey() calls rehash() essentially every time it is called. But rehash() creates a new buffer of twice the size of the existing buffer, and copies all the existing entries over with a put(). The result is that calling removeKey() is very expensive in both performance and memory space. This is an issue because when large projects are compiled with annotation processing turned on, APT calls CompilationUnitResolver.resolve() for all the files being compiled. All the files get added to a HashtableOfObject, and then each is removed individually by calling removeKey(). With the existing implementation, APT performance (CPU time) scales at O(N^1.5). Replacing the HashtableOfObject with a java.util.HashMap as an experiment caused APT to scale O(N), improved the absolute performance, and consumed much less heap. I'm giving this a severity of "Major" because the APT perf issue is currently a high priority for several major users; see bug 262791.
Satyam, can you please investigate this ?
If the bug is confirmed you need to review and possibly fix all other HashtableOf* implementations.
I was mistaken about the memory impact: it sizes the new buffer to be twice the number of actual elements, not twice the size of the existing buffer. So while this is unnecessary it is not causing an explosion. The performance impact is quite real, though.
Created attachment 143843 [details] Test for HashtableOfObject I've attached a test case that exercises some of the HashtableOfObject code, to facilitate work on it. This test case is not complete, but it does include some basic verifications as well as a performance comparison of key removal time against HashMap. With 20k keys, on my machine it takes HashtableOfObject 60 seconds to remove them all, and it takes HashMap 16 milliseconds. Based on code review, there are some other problems with HashtableOfObject as well: 1. Numerous methods in HashtableOfObject contain a redundant length check. In the following line: if (currentKey.length == keyLength && CharOperation.equals(currentKey, key)) the currentKey.length == keyLength comparison is needless, because CharOperation.equals() already checks the length. I imagine this was someone's idea of an optimization, but it actually is more expensive, and should be removed. 2. In HashtableOfObject.toString(), an O(N^2) string concatenation is performed. Instead, a StringBuffer should be used. (I would say StringBuilder, but this code must compile under Java 1.4.) 3. HashtableOfObject.put returns the new value, not the old value. This is not an error as such, because it is undocumented, but it is useless and inconsistent with the java.util collections behavior. Thankfully the return value does not appear to ever be used within JDT Core code. Fixing the removeKey() perf problem will not be easy without changing the design (or just getting rid of it altogether and using HashMap instead). The data structure of HashtableOfObject doesn't have an array of bucket lists; rather, it uses a circular array of entries and inserts new items at the first empty slot after the hash code. This is arguably more memory-efficient than HashMap, but it means that an entry is often not located where its hash code would suggest, which necessitates rehashing and copying the entire table after each removal. Bottom line seems to be that this data structure may be inappropriate for cases where items are to be removed frequently.
See also bug 258366 there seems to be a copy of the same code in another bundle with memory issues. Adding Pascal who created the class in core.runtime.
Satyam, We should investigate a solution that doesn't necessarily force us to use HashMap or Hashtable as these two structures are memory intensive. Could you please try to change the implementation of HashtableOfObject to make it more efficient for key removal? The current implementation is not designed to be used with a lot of removal calls. We should definitely not replace all usages of HashtableofObject with a HashMap or a Hashtable.
I agree that we absolutely should not replace all uses of HashtableOfXxx with HashMap, for memory reasons. However, I think it is appropriate to replace the one map in CompilationUnitResolver. Memory is not a significant issue here - there's only ever one of these maps at a time and it is released at the end of compilation. The HashtableOfXxx data structure puts the skip list inline with the hash buckets; that is, the insertion algorithm is approximately "go to the index specified by the hash function, then search forward until you find an empty slot, then insert the new entry there." Get() then looks like "go to the index specified by the hash function, then search forward until you find an entry whose key is equal to the specified key or until you find a null." Because entries are therefore intermingled, removing a key requires rehashing the entire table. One way around this might be to insert a sentinel value into the key table, rather than removing the entries, when removing values; then the table could be rehashed if the number of entries goes down to, say, less than half of the size of the key table. I didn't spend much time on this because it seemed that it would potentially destabilize a lot of code. It might be the right approach for 3.6 but for 3.5 at least, I think it makes more sense to just acknowledge that HashtableOfXxx is not a good solution for maps where removeKey() is going to be called a lot, and replace the one known offender with a more appropriate data structure, namely HashMap.
I suggest we first speed up the current implementation. We only need to rehash if there is an element at the index + 1. In a sparsely populated table, there is no need to rehash if there was no collision at the index of the removed key.
Worth trying the experiment, but I wonder if this will really be effective? The current implementation keeps the table between 50% and 100% full, so there will be an element at n+1 fairly often. I am concerned about changing the implementation of a widely used data structure in a x.y.1 release. Of course, it will be up to you guys to decide whether to approve the patch :-)
"The current implementation keeps the table between 50% and 100% full, so there will be an element at n+1 fairly often." Actually the threshold which triggers growing the table is size * 1.75f So if you expect 100 elements, the array is created with 175 and will grow when the 100 element is added. It all comes down how random the hash key is. When all the keys end up close to each other, then the change won't improve the performance of remove at all. BUT removing elements was never meant to be a common task with any of our helper hashtables... but that doesn't mean we cannot speed them up, as well as rethink what data structure would best fit CompilationUnitResolver.resolve()
They are only three places in the jdt code where remove of hash elements is getting called. One in CompilationUnitResolver.resolve() and the other two are for HashtableOfObjectToInt in SourceElementParser.convertToMethodDeclaration and CompletionParser.convertToMethodDeclaration I tried removing the hash elements only sometimes (by using a sentinel on the remove), it definitely gave very good performance. Though HashMap is still better, I believe this performance should be acceptable. However, we have to put these additional sentinel checks for all the gets and puts. Added to this there are many places where we do access the keytable directly (13 to be precise) -- We need to put this sentinel checks here too. Considering the above, I feel we can just replace to HashMap only these three places.
(In reply to comment #11) > They are only three places in the jdt code where remove of hash elements is > getting called. One in CompilationUnitResolver.resolve() and the other two are > for HashtableOfObjectToInt in SourceElementParser.convertToMethodDeclaration > and CompletionParser.convertToMethodDeclaration [...] > > Considering the above, I feel we can just replace to HashMap only these three > places. I would recommend against changing the hash map implementation in SourceElementParser and CompletionParser unless we specifically know there is a performance problem there that outweighs any memory considerations. I haven't seen those methods turn up on any of my profiles.
Satyam, please look at why removeKey() is even called in CompilationUnitResolver.resolve() Is it not possible to change the value associated with the key to indicate that its removed, instead of removing the key ?
I concur with Kent. We should investigate replacing the value with null instead of calling removeKey and then check the value. If null, this means it has already been processed and it can be ignore. Then we could keep the HashtableOfObject for the compilation unit resolver. I haven't checked yet the other case where removeKey is used.
Satyam, since we need this under control before the end of the week, I'll reassign to Kent.
Targetting 3.5.1 as this is a major problem for apt.
Created attachment 145033 [details] Proposed patch for 3.5.1 This patch does not remove the keys but instead replaces the values with null.
Created attachment 145133 [details] Proposed patch for 3.5.1 Walter - please give this a try.
Released to R_3_5_maintenance for 3.5.1 Released to HEAD for 3.6M2 This is a performance change so to verify please examine the code changes.
+1. With that patch, performance on a workspace with 4000 files containing nonprocessed annotations (@SuppressWarnings) is comparable to or even slightly faster than my previous experiment replacing the map with a HashMap; and all APT tests still pass.
Verified the code for 3.5.1 RC2 using M20090826-1100
Verified.
Verified for 3.6M2