| Summary: | API Visibility in StatisticBasedSimilarityChecker | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Modeling] EMFCompare | Reporter: | Victor Roldan Betancort <vroldanbet> | ||||||
| Component: | Core | Assignee: | EMF Compare <emf.compare-inbox> | ||||||
| Status: | CLOSED FIXED | QA Contact: | |||||||
| Severity: | enhancement | ||||||||
| Priority: | P3 | CC: | cedric.brun, laurent.goubet, vroldanbet | ||||||
| Version: | 1.3 | ||||||||
| Target Milestone: | --- | ||||||||
| Hardware: | PC | ||||||||
| OS: | Windows 7 | ||||||||
| Whiteboard: | |||||||||
| Attachments: |
|
||||||||
Hi Victor, The whole Statistic based code suffers from a lot of issues, has clearly not been designed as an API and scales poorly. I did start to implement a new approach for emf compare, it's based on the idea of "fingerprints". A fingerprint is a set of weighted values & references computed/extracted from a model element. The goals behind this approach are the following : - customizing the match process is easy, just create an EMF switch class to efficiently provide fingerprints from eobjects - we are able to have a *distance* between two fingerprint allowing us to sort them and retrieve similarity efficiently. - fingerprints might be EObjects at some point (not right now) allowing us, *if* we see memorry comsumption is a problem, to store them on CDO. So far it's very much a draft implementation, I can push it on a specific branch if you're interested in having a look on it. Well, that was a digression, but it may be worth it for you to work on this approach which will scale way better in the end and which are more likely to be supported long term. That said, we're open to patches allowing customization but clearly stating that's kind of internal and this api might break fairly soon. Hi Cedric, thanks for your comments. A generalization of the StatisticBasedSimilarityChecker is definitely a great idea. Extenders could improve match accuracy by providing their own fingerprints, which is something in the end I'm actually achieving by extending StatisticBasedSimilarityChecker, but of course, not as clean as the former. Despite it is so much convenient, I'm in a point of the current project development that a) I cannot afford time investment in deep redesign / refactoring and b) cannot risk the current degree of reliability of my comparison component, which is so far doing not so bad (including some minor EMF Compare code changes). I guess this could be feasible once the project is finished and we get into maintenance stage. I'm of course interested in this initiative and I'd be happy to take a look at the new fingerpring based similiarity analysis. I'm sure that, as an extender of the StatisticBasedSimilarityChecker, I could be helpful. All that been said, my priority now is to "get rid of my EMF Compare modifications" in my workspace. I can live with them (it has some legal implications of course, but no big deal). But if those changes could be contributed and are mostly innocuous to users, I think it would be a good contribution. So far I only have some changes related with the metrics cache and the Similarity Kind concept. I'll like to provide the patch. Give me some time to fight with Git :) I just realized there is a new import package in latest org.eclipse.emf.compare.match, from Google Guava. Just to be sure... does adding a new import break API? And where can I find that bundle? The downloads page does not say anything about this library, and the integration build does not include it. Created attachment 205641 [details]
proposed patch v1
The patch looks good to me, however instead of doing two iterations, I'd like to take the idea all the way and separate the four distinct caches once and for all. I'll start with your patch as we need to maintain API, and propose a second with the changes I think will address your needs. Sounds good to me, Laurent! Let me know if you need any feedback. Created attachment 205917 [details]
Proposed patch v2
Victor,
Does the attached patch look good to you in terms of API or would you still be missing bits with this?
Laurent, it looks good, except for the missing setMetricsCache() method. I use it to customize the implementation of the cache, in this case, I use a LinkedHashMap to avoid an indefinite grow of those caches and therefore avoid OOME. I thought it wasn't a problem adding that method, since the class is declared internal (and yet, I'm extending it :P). Could you figure out some mechanism to allow customization of the caches implementation? Oh, also, maybe those IF (type) ELSE IF (type) could be substituted by a SWITCH clause? It may look clearer. Not sure if one approach is more optimal than the other. I just realize nameSimilarityCache is protected. Was that intended to substitute my setMetricsCache() method? any reason why the other caches arent protected? Victor, The four caches are protected, all initialized to an empty map in initMetricsCaches... or so it should :p. That was indeed meant to replace the "setter" method (as we would need four of them, or yet another switch). As for the "if-else if" ... yes, that should be a switch. I overlooked that one. With the four caches protected, do you still need the "set*" method? As a side note ... do you really see a major improvement of the performance with a LinkedHashMap? If yes, then I see no reason not to switch the implementation in the "main" code instead of forcing clients to override it. Laurent, I see. the 4 protected caches should be fine, in that case I won't need the setters. I didn't realize it because it still looks to me as the code has 1 protected cache and 3 private :P Regarding LinkedHashMap, it does not provide better performance, but reduces memory consumption considerably, while retaining the same performance. Of course, if the heap is running out of memory, the GC while start spending more and more time doing its duty. That eventually turns into, either an OOME, or a dramatic performance decrease. So I think clients could benefit from LinkedHashMaps. The problem is that I'm not sure *which* should be the size limit for that cache. In my case, I use an arbitrary number, identified through experimentation. I would suggest you to perform tests with your workspace and see if things keep running the same. In my case, it wasn't an option: huge models would just make the comparison impossible to finish in machines with < 768MB heap. (In reply to comment #13) > Laurent, > > I see. the 4 protected caches should be fine, in that case I won't need the > setters. I didn't realize it because it still looks to me as the code has 1 > protected cache and 3 private :P Yup, the patch is wrong ^^. I don't really know how I managed that with copy/pastes ;). Will be four protected in the final code though. > > Regarding LinkedHashMap, it does not provide better performance, but reduces > memory consumption considerably, while retaining the same performance. That was "memory" performance I was talking about yes. > > So I think clients could benefit from LinkedHashMaps. The problem is that I'm > not sure *which* should be the size limit for that cache. In my case, I use an > arbitrary number, identified through experimentation. I would suggest you to > perform tests with your workspace and see if things keep running the same. In > my case, it wasn't an option: huge models would just make the comparison > impossible to finish in machines with < 768MB heap. What I don't really understand here is why there is such a difference between the standard HashMap and the LinkedHashMap in this case. LinkedHashMaps still need resizing when the number of keys grows too large, and hash collisions are handled the same, with a doubly linked list for the "value" of that hash. The only thing a LinkedHashMap does that HashMap does not ... is to maintain an additional linked list for the iteration order. That should amount to a greater memory consumption than HashMap, and slightly lower all-around time performance (except for iteration). What you _do_ have that the generic code has not is an initial capacity for the cache. Chances are you would get the same result with a plain "HashMap", but giving it the accurate initial capacity (thus avoiding resizing to a capacity way greater than the total amount you need, and which cannot be GC'ed). (I'm interested to know if I am wrong or totally off track here :p.) I'll leave HashMaps for the four caches in the generated code, and initialize them to 256 bucket (which is IMO way less than the necessary for "big" comparison, but should avoid resizing in most "small" cases). Could you tell me the number you used for your initial capacity? The changes of this second patch have been pushed to master. > > I see. the 4 protected caches should be fine, in that case I won't need the > > setters. I didn't realize it because it still looks to me as the code has 1 > > protected cache and 3 private :P > > Yup, the patch is wrong ^^. I don't really know how I managed that with > copy/pastes ;). Will be four protected in the final code though. Great :D > > So I think clients could benefit from LinkedHashMaps. The problem is that I'm > > not sure *which* should be the size limit for that cache. In my case, I use an > > arbitrary number, identified through experimentation. I would suggest you to > > perform tests with your workspace and see if things keep running the same. In > > my case, it wasn't an option: huge models would just make the comparison > > impossible to finish in machines with < 768MB heap. > > What I don't really understand here is why there is such a difference between > the standard HashMap and the LinkedHashMap in this case. LinkedHashMaps still > need resizing when the number of keys grows too large, and hash collisions are > handled the same, with a doubly linked list for the "value" of that hash. > > The only thing a LinkedHashMap does that HashMap does not ... is to maintain an > additional linked list for the iteration order. That should amount to a greater > memory consumption than HashMap, and slightly lower all-around time performance > (except for iteration). > > What you _do_ have that the generic code has not is an initial capacity for the > cache. Chances are you would get the same result with a plain "HashMap", but > giving it the accurate initial capacity (thus avoiding resizing to a capacity > way greater than the total amount you need, and which cannot be GC'ed). (I'm > interested to know if I am wrong or totally off track here :p.) > > I'll leave HashMaps for the four caches in the generated code, and initialize > them to 256 bucket (which is IMO way less than the necessary for "big" > comparison, but should avoid resizing in most "small" cases). Could you tell me > the number you used for your initial capacity? > > The changes of this second patch have been pushed to master. Sorry, I wasn't enough clear here. LinkedHashMaps are good for caches because they can act as an LRU cache if you subclass it: new LinkedHashMap<EObject, Integer>() { private static final long serialVersionUID = 1L; @Override protected boolean removeEldestEntry(Map.Entry<EObject, Integer> eldest) { return size() > 1000; } The snippet above will remove the eldest entry if the size becomes bigger than 1000 elements in the hash. Of course I agree that the LinkedHashMap by itself shouldn't be more optimal than the HashMap itself, but I'm afraid I don't have evidences for such claim, you seem to have already more knowledge about it, though :D Anyway, the patch is now enough for me. If you still think users could benefit from an LRU cache by using LinkedHashMap, I can give feedback at anytime :) Thanks for the work! VĂctor > Sorry, I wasn't enough clear here. LinkedHashMaps are good for caches because > they can act as an LRU cache if you subclass it: Ha, I understand better then :) indeed in such cases memory consumption of the cache should be way lower than that of "off-the-shelf" Maps. > > Anyway, the patch is now enough for me. If you still think users could benefit > from an LRU cache by using LinkedHashMap, I can give feedback at anytime :) Unfortunately, there are a number of other caches in the code that would benefit from LRU caches... and I wouldn't know where to start on that :s. If I were to create caches though, I'd take a look at what guava can give us (the "new dependency" you've seen on EMF Compare. I think you too could benefit from the utilities they have as far as collections are concerned (on that account, please note that recent developments on EMF Compare greatly reduced the time and memory footprints of "big" comparisons ... you may benefit from using one of the 1.3 versions). For example, you could build a cache such as this with guava (the code mainly speaks for itself) : ConcurrentMap<Key, Graph> graphs = new MapMaker() .concurrencyLevel(32) .softKeys() .weakValues() .expiration(30, TimeUnit.MINUTES) .makeComputingMap( new Function<Key, Graph>() { public Graph apply(Key key) { return createExpensiveGraph(key); } }); > > Thanks for the work! Thanks for the feedback ;) > If I were to create caches though, I'd take a look at what guava can give us
> (the "new dependency" you've seen on EMF Compare. I think you too could benefit
> from the utilities they have as far as collections are concerned (on that
> account, please note that recent developments on EMF Compare greatly reduced
> the time and memory footprints of "big" comparisons ... you may benefit from
> using one of the 1.3 versions).
>
> For example, you could build a cache such as this with guava (the code mainly
> speaks for itself) :
>
> ConcurrentMap<Key, Graph> graphs = new MapMaker()
> .concurrencyLevel(32)
> .softKeys()
> .weakValues()
> .expiration(30, TimeUnit.MINUTES)
> .makeComputingMap(
> new Function<Key, Graph>() {
> public Graph apply(Key key) {
> return createExpensiveGraph(key);
> }
> });
I see. Indeed I realized there was an awesome performance boost after the Guava dependency came into play. I saw the time to compare 2 huge models to reduce by half!!! I was very impressed about that. So I guava provided such general improvement, I'm sure our code can benefit too.
Thanks again!
> I see. Indeed I realized there was an awesome performance boost after the Guava
> dependency came into play. I saw the time to compare 2 huge models to reduce by
> half!!! I was very impressed about that. So I guava provided such general
> improvement, I'm sure our code can benefit too.
Introducing guava does not magically improve the performance (alas), but it does allow you to re-think your algorithms to make better use of the collections.
For example,
Sets.union(set1, set2)
replaces
set1.addAll(set2)
and an iterable that "joins" together the end of set1 with the beginning of set2 ... avoiding the whole process of iterating over set2 and adding its values to set1 (with all the work that this operation involves).
After that, it is a matter of reviewing the algorithms and trying to rethink them with "lazy" collection operations :).
(In reply to comment #18) > > I see. Indeed I realized there was an awesome performance boost after the Guava > > dependency came into play. I saw the time to compare 2 huge models to reduce by > > half!!! I was very impressed about that. So I guava provided such general > > improvement, I'm sure our code can benefit too. > > Introducing guava does not magically improve the performance (alas), but it > does allow you to re-think your algorithms to make better use of the > collections. > > For example, > > Sets.union(set1, set2) > > replaces > > set1.addAll(set2) > > and an iterable that "joins" together the end of set1 with the beginning of > set2 ... avoiding the whole process of iterating over set2 and adding its > values to set1 (with all the work that this operation involves). > > After that, it is a matter of reviewing the algorithms and trying to rethink > them with "lazy" collection operations :). Thats interesting. There is also a nice presentation in youtube, explaining the details of the library and the possible benefits it provides: Using the Google Collections Library for Java (1 of 2) http://www.youtube.com/watch?v=ZeO_J2OcHYM Using the Google Collections Library for Java (2 of 2) http://www.youtube.com/watch?v=9ni_KEkHfto Looks like it requires a bit of rethinking, but could improve peformance (cpu and memory) and concurrency! Thanks for the info Laurent, I'll definitely look into it :) batch-closing a bunch of "RESOLVED" bugs. |
Hi guys, I could copy/paste from the newsgroup, but I'll try to sumarize better. Late tests showed me it's common to specialize StatisticBasedSimilarityChecker, as similarity indices may fail depending on the shape and information in the model. In our case, EMF Compare returned many false positives, mainly due to wrong matches. Specialization of this class, always led me to the same issue: inability to call the similarity cache. That's one thing. The second issue about that is it seems critical to me being able to manage that cache, as it is the main performance point of the matching process. IMHO, the current cache suffers from several design problems: a) Different similarity informations are put in the same cache (TYPE_SIMILARITY, NAME_SIMILARITY, RELATION_SIMILARITY, VALUE_SIMILARITY). I suspect (don't have evidences) lookup/insert process becomes more expensive as this cache grows. Having separate caches for each kind would probably alleviate this. b) This cache grows indefinitely (doesn't scale). That led some of our tests to GC bottleneck situations, where the JVM spent more time GC'ing than doing actual EMF Compare logic. In such scenarios, the comparison did just not end. Increasing heap isn't an option for us (our app is 32 bit), but even when I did increase it, it didn't help (tests with 2GB heap). Some experiments showed me that using a LinkedHashMap: 1) Dramatically reduces memory needs 2) Doesn't affect performance (I suspect that, once matched, similarity information is not used anymore). I didn't try separating the different kinds in different caches, but I'll give it a chance! So my proposal is to change visibility of the cache, or even better, do some nice refactors so the implementation of this cache could be changed. Of course, I'll be happy to contribute patches. Thanks for your time :)