Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 353414 - [Performance] NameSimilarity.pairs should use Set implementation instead of List
Summary: [Performance] NameSimilarity.pairs should use Set implementation instead of List
Status: CLOSED FIXED
Alias: None
Product: EMFCompare
Classification: Modeling
Component: Core (show other bugs)
Version: 1.2   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: EMF Compare CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-07-29 10:04 EDT by Cedric Brun CLA
Modified: 2011-08-09 04:38 EDT (History)
0 users

See Also:


Attachments
mylyn/context/zip (6.49 KB, application/octet-stream)
2011-08-04 04:47 EDT, Cedric Brun CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Cedric Brun CLA 2011-07-29 10:04:01 EDT
the code

final double union = pairs1.size() + pairs2.size();
				pairs1.retainAll(pairs2);

is using Lists whereas it should use sets.				
				
This represents roughly 90% of the time spent during the match now that other optimisations have been done. Using sets divides the time spent in this method per a factor 10.

The original research paper (UMLDiff) was usings Sets, but doing this changeis not without effect .

1. NameSimilarity(a,b) should be equals NameSimilarity(b,a) now (which looks like a better option ;) )
2. The overall score of similarity will change for each match when some pairs can be found several times in a given String. Not sure if it's going to be significant enough to change the global result though.
Comment 1 Cedric Brun CLA 2011-07-29 12:38:01 EDT
Hum, just launching the tests it looks like this tiny change, as I was frightened of, has huge effect and lead to 4 failling tests in the matching set.
Comment 2 Cedric Brun CLA 2011-07-29 13:11:10 EDT
Checking through it, it was only "thresolds" effects. Indeed the change affect the absolute values returned by the matchs (and so the way elements are matched at some point), but *probably* for the better.
Comment 3 Cedric Brun CLA 2011-08-04 04:47:00 EDT
Created attachment 200890 [details]
mylyn/context/zip

adding the mylyn context
Comment 4 Cedric Brun CLA 2011-08-04 05:23:26 EDT
commited and pushed on branches 1.2 and master.
Comment 5 Cedric Brun CLA 2011-08-04 06:10:34 EDT
Reopening, other tests we did add in the meantime are failling. Other option : keeping the lists (and the former behavior) but making sure we will do the retainAll faster !
Comment 6 Cedric Brun CLA 2011-08-04 08:46:09 EDT
I have a fix keeping the exact same behavior as before but 5 times faster. (Coming soon in your git repository ;) )

While profiling I also saw : org.eclipse.emf.compare.util.ModelUtils.contains(Resource, EObject) which looked, at first, a quite dumb implementation (trying to find if an EObject is in a Resource by iterating over all the resource content) . Thinking a bit more about it, it's in fact done this way to handle fragmented resources. Clearly not the most efficient way to do but providing a more efficient way is not that trivial. Anyway this method can be speeded up  with an extra check at the beginning : 

public static boolean contains(Resource resource, EObject eObject) {
		if (eObject.eResource() == null)
			return false;
		final TreeIterator<EObject> contentsIterator = resource
				.getAllContents();
				
In case the eObject has no resource, we have very little chance to find it back in a Resource ;)

Doing so I did improve the performance of using EMF compare as a framework (It's not going to happen when using the Workspace actions)
Comment 7 Cedric Brun CLA 2011-08-09 04:38:05 EDT
pushed and fixed in both 1.2 and master.