Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 353414

Summary: [Performance] NameSimilarity.pairs should use Set implementation instead of List
Product: [Modeling] EMFCompare Reporter: Cedric Brun <cedric.brun>
Component: CoreAssignee: EMF Compare <emf.compare-inbox>
Status: CLOSED FIXED QA Contact:
Severity: normal    
Priority: P3    
Version: 1.2   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Whiteboard:
Attachments:
Description Flags
mylyn/context/zip none

Description Cedric Brun CLA 2011-07-29 10:04:01 EDT
the code

final double union = pairs1.size() + pairs2.size();
				pairs1.retainAll(pairs2);

is using Lists whereas it should use sets.				
				
This represents roughly 90% of the time spent during the match now that other optimisations have been done. Using sets divides the time spent in this method per a factor 10.

The original research paper (UMLDiff) was usings Sets, but doing this changeis not without effect .

1. NameSimilarity(a,b) should be equals NameSimilarity(b,a) now (which looks like a better option ;) )
2. The overall score of similarity will change for each match when some pairs can be found several times in a given String. Not sure if it's going to be significant enough to change the global result though.
Comment 1 Cedric Brun CLA 2011-07-29 12:38:01 EDT
Hum, just launching the tests it looks like this tiny change, as I was frightened of, has huge effect and lead to 4 failling tests in the matching set.
Comment 2 Cedric Brun CLA 2011-07-29 13:11:10 EDT
Checking through it, it was only "thresolds" effects. Indeed the change affect the absolute values returned by the matchs (and so the way elements are matched at some point), but *probably* for the better.
Comment 3 Cedric Brun CLA 2011-08-04 04:47:00 EDT
Created attachment 200890 [details]
mylyn/context/zip

adding the mylyn context
Comment 4 Cedric Brun CLA 2011-08-04 05:23:26 EDT
commited and pushed on branches 1.2 and master.
Comment 5 Cedric Brun CLA 2011-08-04 06:10:34 EDT
Reopening, other tests we did add in the meantime are failling. Other option : keeping the lists (and the former behavior) but making sure we will do the retainAll faster !
Comment 6 Cedric Brun CLA 2011-08-04 08:46:09 EDT
I have a fix keeping the exact same behavior as before but 5 times faster. (Coming soon in your git repository ;) )

While profiling I also saw : org.eclipse.emf.compare.util.ModelUtils.contains(Resource, EObject) which looked, at first, a quite dumb implementation (trying to find if an EObject is in a Resource by iterating over all the resource content) . Thinking a bit more about it, it's in fact done this way to handle fragmented resources. Clearly not the most efficient way to do but providing a more efficient way is not that trivial. Anyway this method can be speeded up  with an extra check at the beginning : 

public static boolean contains(Resource resource, EObject eObject) {
		if (eObject.eResource() == null)
			return false;
		final TreeIterator<EObject> contentsIterator = resource
				.getAllContents();
				
In case the eObject has no resource, we have very little chance to find it back in a Resource ;)

Doing so I did improve the performance of using EMF compare as a framework (It's not going to happen when using the Workspace actions)
Comment 7 Cedric Brun CLA 2011-08-09 04:38:05 EDT
pushed and fixed in both 1.2 and master.