Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 285799 - HashtableOfObject rehashes and grows buffer on removeKey()
Summary: HashtableOfObject rehashes and grows buffer on removeKey()
Status: VERIFIED FIXED
Alias: None
Product: JDT
Classification: Eclipse Project
Component: Core (show other bugs)
Version: 3.5   Edit
Hardware: PC Windows XP
: P3 major (vote)
Target Milestone: 3.5.1   Edit
Assignee: Kent Johnson CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 262791
  Show dependency tree
 
Reported: 2009-08-05 19:50 EDT by Walter Harley CLA
Modified: 2012-10-23 07:46 EDT (History)
7 users (show)

See Also:


Attachments
Test for HashtableOfObject (6.66 KB, patch)
2009-08-09 01:15 EDT, Walter Harley CLA
no flags Details | Diff
Proposed patch for 3.5.1 (2.50 KB, patch)
2009-08-19 16:47 EDT, Kent Johnson CLA
no flags Details | Diff
Proposed patch for 3.5.1 (4.36 KB, patch)
2009-08-20 10:45 EDT, Kent Johnson CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Walter Harley CLA 2009-08-05 19:50:36 EDT
See bug 262791 for the motivation and research behind this report.

HashtableOfObject.removeKey() calls rehash() essentially every time it is called.  But rehash() creates a new buffer of twice the size of the existing buffer, and copies all the existing entries over with a put().  The result is that calling removeKey() is very expensive in both performance and memory space.

This is an issue because when large projects are compiled with annotation processing turned on, APT calls CompilationUnitResolver.resolve() for all the files being compiled.  All the files get added to a HashtableOfObject, and then each is removed individually by calling removeKey().

With the existing implementation, APT performance (CPU time) scales at O(N^1.5).  Replacing the HashtableOfObject with a java.util.HashMap as an experiment caused APT to scale O(N), improved the absolute performance, and consumed much less heap.

I'm giving this a severity of "Major" because the APT perf issue is currently a high priority for several major users; see bug 262791.
Comment 1 Srikanth Sankaran CLA 2009-08-06 01:04:30 EDT
Satyam, can you please investigate this ? 
Comment 2 Dani Megert CLA 2009-08-06 05:17:31 EDT
If the bug is confirmed you need to review and possibly fix all other HashtableOf* implementations.
Comment 3 Walter Harley CLA 2009-08-06 17:23:55 EDT
I was mistaken about the memory impact: it sizes the new buffer to be twice the number of actual elements, not twice the size of the existing buffer.  So while this is unnecessary it is not causing an explosion.

The performance impact is quite real, though.
Comment 4 Walter Harley CLA 2009-08-09 01:15:09 EDT
Created attachment 143843 [details]
Test for HashtableOfObject

I've attached a test case that exercises some of the HashtableOfObject code, to facilitate work on it.  This test case is not complete, but it does include some basic verifications as well as a performance comparison of key removal time against HashMap.  With 20k keys, on my machine it takes HashtableOfObject 60 seconds to remove them all, and it takes HashMap 16 milliseconds.

Based on code review, there are some other problems with HashtableOfObject as well:

1. Numerous methods in HashtableOfObject contain a redundant length check.  In the following line:

  if (currentKey.length == keyLength && CharOperation.equals(currentKey, key))

the currentKey.length == keyLength comparison is needless, because CharOperation.equals() already checks the length.  I imagine this was someone's idea of an optimization, but it actually is more expensive, and should be removed.

2. In HashtableOfObject.toString(), an O(N^2) string concatenation is performed. Instead, a StringBuffer should be used.  (I would say StringBuilder, but this code must compile under Java 1.4.)

3. HashtableOfObject.put returns the new value, not the old value.  This is not an error as such, because it is undocumented, but it is useless and inconsistent with the java.util collections behavior.  Thankfully the return value does not appear to ever be used within JDT Core code.


Fixing the removeKey() perf problem will not be easy without changing the design (or just getting rid of it altogether and using HashMap instead).  The data structure of HashtableOfObject doesn't have an array of bucket lists; rather, it uses a circular array of entries and inserts new items at the first empty slot after the hash code.  This is arguably more memory-efficient than HashMap, but it means that an entry is often not located where its hash code would suggest, which necessitates rehashing and copying the entire table after each removal.

Bottom line seems to be that this data structure may be inappropriate for cases where items are to be removed frequently.
Comment 5 Dani Megert CLA 2009-08-11 03:49:29 EDT
See also bug 258366 there seems to be a copy of the same code in another bundle with memory issues.

Adding Pascal who created the class in core.runtime.
Comment 6 Olivier Thomann CLA 2009-08-17 09:47:05 EDT
Satyam,

We should investigate a solution that doesn't necessarily force us to use HashMap or Hashtable as these two structures are memory intensive. Could you please try to change the implementation of HashtableOfObject to make it more efficient for key removal?

The current implementation is not designed to be used with a lot of removal calls.

We should definitely not replace all usages of HashtableofObject with a HashMap or a Hashtable.
Comment 7 Walter Harley CLA 2009-08-17 12:55:21 EDT
I agree that we absolutely should not replace all uses of HashtableOfXxx with HashMap, for memory reasons.  However, I think it is appropriate to replace the one map in CompilationUnitResolver.  Memory is not a significant issue here - there's only ever one of these maps at a time and it is released at the end of compilation.

The HashtableOfXxx data structure puts the skip list inline with the hash buckets; that is, the insertion algorithm is approximately "go to the index specified by the hash function, then search forward until you find an empty slot, then insert the new entry there."  Get() then looks like "go to the index specified by the hash function, then search forward until you find an entry whose key is equal to the specified key or until you find a null."  Because entries are therefore intermingled, removing a key requires rehashing the entire table.  One way around this might be to insert a sentinel value into the key table, rather than removing the entries, when removing values; then the table could be rehashed if the number of entries goes down to, say, less than half of the size of the key table.  

I didn't spend much time on this because it seemed that it would potentially destabilize a lot of code.  It might be the right approach for 3.6 but for 3.5 at least, I think it makes more sense to just acknowledge that HashtableOfXxx is not a good solution for maps where removeKey() is going to be called a lot, and replace the one known offender with a more appropriate data structure, namely HashMap.
Comment 8 Kent Johnson CLA 2009-08-17 13:27:04 EDT
I suggest we first speed up the current implementation.

We only need to rehash if there is an element at the index + 1.

In a sparsely populated table, there is no need to rehash if there was no collision at the index of the removed key.
Comment 9 Walter Harley CLA 2009-08-17 13:59:36 EDT
Worth trying the experiment, but I wonder if this will really be effective?  The current implementation keeps the table between 50% and 100% full, so there will be an element at n+1 fairly often.

I am concerned about changing the implementation of a widely used data structure in a x.y.1 release.  Of course, it will be up to you guys to decide whether to approve the patch :-)
Comment 10 Kent Johnson CLA 2009-08-17 14:13:24 EDT
"The current implementation keeps the table between 50% and 100% full, so there
will be an element at n+1 fairly often."

Actually the threshold which triggers growing the table is size * 1.75f

So if you expect 100 elements, the array is created with 175 and will grow when the 100 element is added.

It all comes down how random the hash key is. When all the keys end up close to each other, then the change won't improve the performance of remove at all.


BUT removing elements was never meant to be a common task with any of our helper hashtables... but that doesn't mean we cannot speed them up, as well as rethink what data structure would best fit CompilationUnitResolver.resolve()
Comment 11 Satyam Kandula CLA 2009-08-18 05:50:16 EDT
They are only three places in the jdt code where remove of hash elements is getting called. One in CompilationUnitResolver.resolve() and the other two are for HashtableOfObjectToInt in SourceElementParser.convertToMethodDeclaration and CompletionParser.convertToMethodDeclaration 

I tried removing the hash elements only sometimes (by using a sentinel on the remove),  it definitely gave very good performance. Though HashMap is still better, I believe this performance should be acceptable. 

However, we have to put these additional sentinel checks for all the gets and puts.  Added to this there are many places where we do access the keytable directly (13 to be precise) -- We need to put this sentinel checks here too. 

Considering the above, I feel we can just replace to HashMap only these three places.  
Comment 12 Walter Harley CLA 2009-08-18 10:57:18 EDT
(In reply to comment #11)
> They are only three places in the jdt code where remove of hash elements is
> getting called. One in CompilationUnitResolver.resolve() and the other two are
> for HashtableOfObjectToInt in SourceElementParser.convertToMethodDeclaration
> and CompletionParser.convertToMethodDeclaration [...]
> 
> Considering the above, I feel we can just replace to HashMap only these three
> places.  


I would recommend against changing the hash map implementation in SourceElementParser and CompletionParser unless we specifically know there is a performance problem there that outweighs any memory considerations.  I haven't seen those methods turn up on any of my profiles.
Comment 13 Kent Johnson CLA 2009-08-18 11:42:56 EDT
Satyam, please look at why removeKey() is even called in CompilationUnitResolver.resolve()

Is it not possible to change the value associated with the key to indicate that its removed, instead of removing the key ?
Comment 14 Olivier Thomann CLA 2009-08-18 12:57:49 EDT
I concur with Kent. We should investigate replacing the value with null instead of calling removeKey and then check the value. If null, this means it has already been processed and it can be ignore.
Then we could keep the HashtableOfObject for the compilation unit resolver.

I haven't checked yet the other case where removeKey is used.
Comment 15 Olivier Thomann CLA 2009-08-19 15:28:36 EDT
Satyam, since we need this under control before the end of the week, I'll reassign to Kent.
Comment 16 Olivier Thomann CLA 2009-08-19 15:29:01 EDT
Targetting 3.5.1 as this is a major problem for apt.
Comment 17 Kent Johnson CLA 2009-08-19 16:47:17 EDT
Created attachment 145033 [details]
Proposed patch for 3.5.1

This patch does not remove the keys but instead replaces the values with null.
Comment 18 Kent Johnson CLA 2009-08-20 10:45:49 EDT
Created attachment 145133 [details]
Proposed patch for 3.5.1

Walter - please give this a try.
Comment 19 Kent Johnson CLA 2009-08-21 13:35:46 EDT
Released to R_3_5_maintenance for 3.5.1

Released to HEAD for 3.6M2


This is a performance change so to verify please examine the code changes.
Comment 20 Walter Harley CLA 2009-08-22 01:34:19 EDT
+1.  With that patch, performance on a workspace with 4000 files containing nonprocessed annotations (@SuppressWarnings) is comparable to or even slightly faster than my previous experiment replacing the map with a HashMap; and all APT tests still pass.
Comment 21 Satyam Kandula CLA 2009-08-28 03:10:31 EDT
Verified the code for 3.5.1 RC2 using M20090826-1100
Comment 22 Olivier Thomann CLA 2009-08-28 11:35:12 EDT
Verified.
Comment 23 Ayushman Jain CLA 2009-09-15 07:57:14 EDT
Verified for 3.6M2