Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 336967 - Provide ability to adjust default projector timeout value
Summary: Provide ability to adjust default projector timeout value
Status: RESOLVED FIXED
Alias: None
Product: Equinox
Classification: Eclipse Project
Component: p2 (show other bugs)
Version: 3.7   Edit
Hardware: PC Mac OS X - Carbon (unsup.)
: P3 normal (vote)
Target Milestone: 3.7 M6   Edit
Assignee: DJ Houghton CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 336968 363963
  Show dependency tree
 
Reported: 2011-02-11 11:41 EST by DJ Houghton CLA
Modified: 2011-11-16 14:24 EST (History)
3 users (show)

See Also:


Attachments
patch (2.95 KB, patch)
2011-02-11 14:59 EST, DJ Houghton CLA
no flags Details | Diff
New version of sat4j core with a fix for the timeout during optimization (189.00 KB, application/x-java-archive)
2011-03-05 07:25 EST, Daniel Le Berre CLA
no flags Details
New version of sat4j pb with a fix for the timeout during optimization (126.65 KB, application/x-java-archive)
2011-03-05 07:29 EST, Daniel Le Berre CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description DJ Houghton CLA 2011-02-11 11:41:07 EST
There have been a couple reports where we have huge systems with several thousand IUs and people are adding several IUs (same ids but multiple versions) via the reconciler and the solution that we get back from SAT4J isn't always the optimal one. As described in Bug 301446 comment 5 this is expected and the solutions are *not* incorrect. 

We should provide a mechanism to adjust the default timeout value so clients are able to use this in their problem investigations. I recommend a System property (key: eclipse.p2.projector.timeout) which would default to 1000 (current value) if the property doesn't exist or the property value can't be parsed.
Comment 1 Daniel Le Berre CLA 2011-02-11 12:00:37 EST
Yes, we can do this. It is not a big deal.

It would be nice to collect those use cases, it order to see if we can speed up that resolution process.
Comment 2 Pascal Rapicault CLA 2011-02-11 13:07:18 EST
Same thing than Daniel. I'm all for putting the new support, but we need to also a system that gathers the information so we can more easily recreate the problem.
Note that I suspect that this is on dropins style install where a complete product is installed through dropins which means that the search paste is much less restricted than in typical p2 use cases.
Comment 3 DJ Houghton CLA 2011-02-11 13:39:42 EST
Yes, this is another case where everything (~3000 plug-ins) are installed via the dropins mechanism so everything is considered optional, etc.

I've collected the data (profile registry before and after the execution, content.xml of the IUs we are trying to install) and will try and put together a test case.
Comment 4 DJ Houghton CLA 2011-02-11 14:59:27 EST
Created attachment 188812 [details]
patch

Patch. Only sets the timeout to be the user-specified value if it is a positive integer larger than the default. (currently 1000)
Comment 5 DJ Houghton CLA 2011-02-11 15:00:27 EST
Patch released to HEAD.
Comment 6 DJ Houghton CLA 2011-03-04 10:02:34 EST
I had a good chat with Pascal (thanks!) and he explained how this value is really used. I'll paste some information here just so we have it recorded for others to read and reference.

-------------

Rather than referring to a 1 second timeout, a "1000 timeout" refers to the number of times to retry when there are conflicts.  So when the value is 1000, it tries to find an optimal solution and if it doesn't find one by the end of 1000 tries, it returns the best one it has found so far. By increasing the number to 10,000 it means it will try to find a solution 10,000 times at the most. 

This was done this way so it will produce consistent results across multiple machines. If the timeout value was a real timeout, then the result for the same call on multiple machines would be highly dependent on processor, etc and most likely be different in cases where there are a lot of conflicts.

Another subtle aspect of the "timeout on conflict" value is that it is reset each time a better solution is found. So it really means:
- found a solution
- try up to 1000 times to find a better one
- found a better one
- count is reset to 0
- try up to 1000 times to find a better one

This should explain (because I know you are all asking) why it takes longer than 1 second (or 10 seconds) to present the solution to the installation problem when trying to install new software. Each attempt to find a solution could take an arbitrary amount of time so it is hard to predict how much longer installs will take if you increase this value by too much. 

In the general non-dropins-install case it shouldn't matter much because everything is installed via the UI or API calls and the dependencies and requirements are considered strict.

The way that things are installed through the drop-ins, everything is installed optionally so when we try and compute what needs to be installed, everything (including all previously install 3500 bundles) is considered optional and we try and recalculate the best solution. That is why we hit so many conflicts and why it takes so many tries in order to get the optimal solution.
Comment 7 Daniel Le Berre CLA 2011-03-04 10:20:26 EST
This should not be the case.

If it is indeed the case, then it is a bug in SAT4J. There is a notion of grouped calls to the solver in which the timeout should not be reset.

It is true that I usually do it on time, not on conflicts. I will check that ASAP.
Comment 8 Pascal Rapicault CLA 2011-03-04 21:52:38 EST
Daniel, do not worry. I did not check the SAT4J code. I was telling DJ about the restart behaviour from memory and it seems that I have mislead him. Apologies to you both.
Comment 9 Daniel Le Berre CLA 2011-03-05 03:40:49 EST
I opened the following bug for SAT4J:
http://jira.ow2.org/browse/SAT-5

I noticed that there is a possible issue when the timeout in seconds is reached between two calls to the isSatisfiable() method.

I need to investigate further to see if it can happen also with conflict based timeout.
Comment 10 Daniel Le Berre CLA 2011-03-05 07:25:41 EST
Created attachment 190468 [details]
New version of sat4j core with a fix for the timeout during optimization
Comment 11 Daniel Le Berre CLA 2011-03-05 07:29:13 EST
Created attachment 190469 [details]
New version of sat4j pb with a fix for the timeout during optimization

DJ, could you give a try at your test cases with those new jars for sat4j?
Their version number is 2.3.0.v20110305.

It should fix the issues you met when changing the value of the timeout.
Comment 12 DJ Houghton CLA 2011-03-07 15:55:28 EST
Unfortunately I don't still have access to the machine which exhibited the problem, but I do have a copy of the profile, repo, etc that I was trying to put together to get a reproducible stand-alone test case.
Comment 13 DJ Houghton CLA 2011-03-08 16:45:44 EST
I got access to the test machine and tested the new JARs and they worked 7 out of 8 times. During the 6th invocation, lower versions of the bundles were installed. I was just using all default values and not passing in any special System properties, etc.

Also, I've released a new (currently disabled) test to the p2.tests called Bug301446. It has a copy of the profile along with the content.xml from the metadata repository of the dropins. I cannot get the test to fail consistently yet but wanted to capture the data so we have it on-hand.
Comment 14 Daniel Le Berre CLA 2011-03-08 17:36:55 EST
Thanks DJ!

It is strange that the behavior is not exactly the same each time, with a conflict based timeout.
We must feed slightly differently the solver each time (i.e. the  order of the IUs must change).
Comment 15 DJ Houghton CLA 2011-03-09 07:37:51 EST
Yes, I'm not sure the input is the same every time. We are relying on the reconciler to discover what is needed to be installed. And we are running "eclipse -clean" each time so if something was installed the first time, then it wouldn't be included in the "potential IUs to install" the second time. That is, it wasn't a clean run each time, it was based on the previous results.

Also note that the test is being run on a VMWare image so there are some more constraints there.