Community
Participate
Working Groups
Build ID: I20080502-0100 Steps To Reproduce: 1. Start eclipses 2. do a heap dump 3. Use the Eclipse Memory Analyzer to analyse the heap dump For details you can check my blog at : http://kohlerm.blogspot.com/2008/05/analyzing-memory-consumption-of-eclipse.html The Spell checking engine component accounts for more than 20% of the overall memory consumption. More information:
>1. Start eclipses >2. do a heap dump >3. Use the Eclipse Memory Analyzer to analyse the heap dump Are you 100% sure with these steps? The dictionary is only loaded when the first editor is opened hence I doubt that those steps are correct. Can you attach some numbers here (the link might be unavailable in the future).
Created attachment 101695 [details] Dictionary memory consumption Dominator tree screenshot of the the Spellchecker component. 24.73 of the overall memory consumption in SpellCheckEngine
(In reply to comment #1) > >1. Start eclipses > >2. do a heap dump > >3. Use the Eclipse Memory Analyzer to analyse the heap dump > Are you 100% sure with these steps? The dictionary is only loaded when the > first editor is opened hence I doubt that those steps are correct. Sorry for being imprecise. The One file was open in an Editor (http://bp1.blogger.com/_MFv_s-Hc-mQ/SDHq9Etyx6I/AAAAAAAAATQ/tCc-sGHqZSY/s1600-h/eclipseSetup.png) > > Can you attach some numbers here (the link might be unavailable in the future). > Done. One comment from http://javahispano.org/contenidos/es/analisis_del_consumo_de_memoria_de_eclipse/ made sense to me. Someone there points out the the dictionary is English only and that not everybody speaks English. Maybe the dictionary should not be turned on by default, at least it it can be detected that Eclipse is running on an non English OS. Regards, Markus
What do the numbers in the columns mean, e.g. is the middle one showing bites or bytes or...? Note that when using language packs you might get dictionaries for your locale.
(In reply to comment #4) > What do the numbers in the columns mean, e.g. is the middle one showing bites > or bytes or...? > The numbers are all bytes. "Shallow Heap" means the size of the object itself including the object header (8 Bytes on 32 bit SUN VM) and the fields. Retained ... means retained size in bytes. "Retained size" is the amount of memory that would be freed by the garbage collecotr if the object would be removed from memory. > Note that when using language packs you might get dictionaries for your locale. > Ok. I don't use any language packs
My findings - The hashmap used for dictionary consumes about 7MB of memory, which is about 23% of heap memory when you just start eclipse and open an editor. But there is no memory leak or wastage as such. (Words present in the dictionary and the phonetic hash strings consume about 5.3 MB) - However we can save about 1.7 MB of memory by using char[] (in place of String) to store words in the hashmap. - Currently the spell check engine uses a phonetic hash algorithm (adapted version of double metaphone algorithm), we can investigate other algorithms/data structures for spell checking. But this would involve considerable effort.
(In reply to comment #6) > My findings > - The hashmap used for dictionary consumes about 7MB of memory, which is about > 23% of heap memory when you just start eclipse and open an editor. But there is > no memory leak or wastage as such. (Words present in the dictionary and the > phonetic hash strings consume about 5.3 MB) > - However we can save about 1.7 MB of memory by using char[] (in place of > String) to store words in the hashmap. > - Currently the spell check engine uses a phonetic hash algorithm (adapted > version of double metaphone algorithm), we can investigate other > algorithms/data structures for spell checking. But this would involve > considerable effort. Yes indeed there's no leak. But wouldn't using a Trie [http://en.wikipedia.org/wiki/Trie] or a similar data structure help to reduce the memory usage? A very simple way to reduce memory usage would be to turn off the spell checker by default. I know many people who don't really like it.
Created attachment 152839 [details] spell checking performance test I have created a performance test for spell checking to benchmark the performance before we change the implementation for the spell check engine.
Created attachment 153068 [details] performance test updated the test to spell check the entire java file as a text file.
Created attachment 153343 [details] performance test Removed the big setup involving editors. Also the problem collector counts the number of spelling problems in the document. (The patch just prints this value, I couldn't think of anything better to do with it.)
Feedback on the perf test: - org.eclipse.jdt.ui.PreferenceConstants.SPELLING_PROBLEMS_THRESHOLD must be set to max int in setUp and setToDefault on tearDown - 'fDoc': don't prefix local variables with 'f': this is our field prefix - I don't understand why the document is created twice in measure instead of being prepared in setUp - int problemCount should be private (with very few exceptions a field should always be private) - member class 'SpellingProblemCollector' should be static - copyright date is wrong (should be 2009 and not "2000, 2009"
Ah and I forgot: - don't print out stuff to console during the real measurement - increase measured runs to 50
Created attachment 153553 [details] performance test + fix - Updated the performance test as per suggestions - Included the fix as well The fix is to use utf-8 encoded bytes to store the dictionary words as well as the phonetic hash keys. I had also implemented a Trie for the phonetic hash keys but the overall memory consumption for a Trie came out slightly greater than the simply using utf-8 encoded bytes. Original memory consumption : 6.9 MB Memory consumption with this fix : 4.3 MB The CPU time as per the performance test in both cases was coming around 330ms. Also fHashBuckets is given a default size of 32769, do we really need these many buckets (as a default value) for JavaDocTagDictionary, TaskTagDictionary and HTMLTagDictionary?
Created attachment 153554 [details] Memory consumption values with the fix
Your patch does not work for me as it throws tons of: !ENTRY org.eclipse.jdt.ui 4 10001 2009-12-14 16:30:12.070 !MESSAGE Internal Error !STACK 0 java.io.UnsupportedEncodingException: UTF_8 at java.lang.StringCoding.encode(StringCoding.java:340) at java.lang.String.getBytes(String.java:954) at org.eclipse.jdt.internal.ui.text.spelling.engine.AbstractSpellDictionary$UTF8Bytes.convertToBytes(AbstractSpellDictionary.java:70) This is expected since "UTF_8" is not the correct name for that encoding and also not a supported alias on most VMs. Can you explain why this works for you? What VM are you using? Is "UTF_8" available as alias on your machine? Comments on the patch itself: - I don't like the AbstractSpellDictionary.UTF8Bytes.convertToBytes(String) helper ==> it's better to do the error handling at the places where the conversion happens, e.g. currently you go through the loop in isCorrect(...) instead of returning false immediately after failure - UTF8Bytes class needs Javadoc (and @since tag) - UTF8Bytes.hashCode should be inlined - UTF8Bytes is a bad name as the instances aren't tied to UTF-8. Also, the field name 'key' is badly chosen: if you look at the class itself it's not understandable what the field has to to with a key - isCorrect(...): computes (byte[])candidateList.get(index) twice - on what findings is the new value for the load factor based? Data? - hashWord(...) has several occurrences of new UTF8Bytes(hashBytes) - given our hash is now byte[] it would probably simplify the code if the IPhoneticHashProvider would be changed accordingly Trivial: - copyright date in not updated in the PerformanceTestSuite. - AbstractSpellDictionary.LOAD_FACTOR is missing @since tag
(In reply to comment #15) > Your patch does not work for me as it throws tons of: > > !ENTRY org.eclipse.jdt.ui 4 10001 2009-12-14 16:30:12.070 > !MESSAGE Internal Error > !STACK 0 > java.io.UnsupportedEncodingException: UTF_8 > at java.lang.StringCoding.encode(StringCoding.java:340) > at java.lang.String.getBytes(String.java:954) > at > org.eclipse.jdt.internal.ui.text.spelling.engine.AbstractSpellDictionary$UTF8Bytes.convertToBytes(AbstractSpellDictionary.java:70) > > This is expected since "UTF_8" is not the correct name for that encoding and > also not a supported alias on most VMs. Can you explain why this works for you? > What VM are you using? Is "UTF_8" available as alias on your machine? > This is bit strange. I have tried this against IBM JRE 1.6, 1.5, 1.4.2. http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html lists the correct name to be UTF8 or UTF-8, so UTF_8 should not work. IBM JRE 1.4.2 contains sun.io.CharacterEncoding in core.jar, the alias table in this also contains UTF8 and UTF-8. However this class also contains a private method 'replaceDash(String)' which seems to be replacing '_' with '-' (since I am looking at the class file I cant be too sure). All three 'UTF8', 'UTF-8' and 'UTF_8' work on my machine, but since 'UTF8' is the standard name I will change UTF_8 to UTF8.
Looks like the IBM JRE adds additional aliases which of course is very dangerous because as soon as someone starts to use those one is locked into that VM since others don't support them. >I will change UTF_8 to UTF8. Correct. >but since 'UTF8' is the standard name I will change UTF_8 to UTF8. It's a bit more complicated ;-) The name which one normally uses (not just in the Java world) is 'UTF-8' which corresponds to the immutable name under which the encoding is registered with the IANA registry. That same IANA name is the (canonical) name used when using the (newer) nio APIs (see Charset for details) and for the io and java.lang APIs the canonical name is 'UTF8'.
Created attachment 155204 [details] fix > - I don't like the AbstractSpellDictionary.UTF8Bytes.convertToBytes(String) > helper > ==> it's better to do the error handling at the places where the conversion > happens, e.g. currently you go through the loop in isCorrect(...) instead of returning false immediately after failure > - UTF8Bytes is a bad name as the instances aren't tied to UTF-8. Also, the > field name 'key' is badly chosen: if you look at the class itself it's not > understandable what the field has to to with a key Agree, renamed the class to ByteArrayWrapper and removed the convertToBytes helper. > - UTF8Bytes.hashCode should be inlined There are 2 hascode methods, which one should be inlined? By the way I had used 'Source -> Generate Hashcode() and equals()', so if there is a problem with this code shouldnt we create a bug for this? > - on what findings is the new value for the load factor based? Data? There are around 49000 words in the default English dictionary, with the default load factor the number of buckets in the hashmap becomes 64k with the new value the number of buckets remains 32k.
Created attachment 155205 [details] performance test
Final review for the test: - didn't see that before: problemCount is a wrong name for a field since fields must be prefixed with an 'f' ==> fProblemCount - I don't like the static field 'printed' - please check the code I committed to HEAD - I renamed 'fDoc' to 'fDocument' - switched to use normal performance meter so that it doesn't show up in the summary graph Committed to HEAD and perf_35x branch and released into perf_35x map file.
Feedback to the patch as discussed with Deepak over Sametime: - canonical name for java.io and java.lang APIs is "UTF8" - the places where we create strings off the byte[] is wrong because we use the default encoding instead of UTF-8 - we should add a static 'ByteArrayWrapper getHash(String)' helper and use that value as hash. This avoids code duplication of "new ByteArrayWrapper(hash.getBytes(UTF8))" >> - UTF8Bytes.hashCode should be inlined >There are 2 hascode methods, which one should be inlined? By the way I had >used 'Source -> Generate Hashcode() and equals()', so if there is a problem >with this code shouldnt we create a bug for this? You're right. Personally I wouldn't use the same method name for a static method but let's leave that for now.
>Original memory consumption : 6.9 MB >Memory consumption with this fix : 4.3 MB I get slightly different numbers but still a win around 2MB (37% less memory): Retain size in HEAD is 3266Kb, reach size is 16Mb Retain size with patch is 5225Kb, reach size is 14Mb Cool!
Created attachment 156085 [details] fix (In reply to comment #21) > - canonical name for java.io and java.lang APIs is "UTF8" Done > - the places where we create strings off the byte[] is wrong because we use the default encoding instead of UTF-8 Done > - we should add a static 'ByteArrayWrapper getHash(String)' helper and use that value as hash. This avoids code duplication of "new ByteArrayWrapper(hash.getBytes(UTF8))" Not done, reasons being - "new ByteArrayWrapper(hash.getBytes(UTF8))" occurs only in 2 places - A new getHash method which in turn calls DefaultPhoneticHashProvider.getHash() would be like this private ByteArrayWrapper getHash(String word) throws UnsupportedEncodingException { return new ByteArrayWrapper( fHashProvider.getHash(word).getBytes(UTF8)); } Now getCandidates(final String hash)dosent include the call to both getHash() and new ByteArrayWrapper(). hence the new getHash() will have to be called from all the places where fHashProvider.getHash(word) is called in getCandidates() call hierarchy and do the error handling there. This leads to more changes as compared to HEAD and defeats the purpose of creating this new method. I have also added a getInitialSize() method to provide different default sizes for different dictionary types (LocaleSensitiveSpellDictionary, JavaDocTagDictionary, TaskTagDictionary, HTMLTagDictionary etc). This leads to further reduction in memory consumption as each of JavaDocTagDictionary, TaskTagDictionary, HTMLTagDictionary were consuming 131 KB of memory but were mostly empty.
Agree. Thanks Deepak! Next time please also add the @since tag to the overridden method's traditional comment. Patch committed to HEAD. Available in builds > N20100114-2000.
.
Re "UTF8" vs. "UTF-8": We should just use the version that is mentioned in the Javadocs. From Java 1.4 on, String#getBytes(String) links to java.nio.charset.Charset, which uses "UTF-8" as canonical name. http://java.sun.com/j2se/1.4.2/docs/guide/intl/encoding.doc.html is a bit misleading, since the column title "Canonical Name for java.io and java.lang API" looks like this applies to all APIs in java.lang, but that's not true, since many APIs have been retrofitted and use nio's Charset now. I've changed it in HEAD to "UTF-8" (like everywhere else in Text and JDT/UI).
Verified for 3.6 M5 with I20100126-0800.