Community
Participate
Working Groups
As discussed on status call, it may be helpful, for a number of reasons, to have a "short list" of unit tests that can run with each build. Ideally when those tests all pass, it would signify the build deserved further testing, but if there were failures in this "short list" it would signify something was wrong and should be investigated before continued testing. So, I am opening this bug to begin to document that discussion. I'll past in the current "long list" below, but will say up front, I think the goal should be to find a set of tests that run in about an hour, maybe two Another criteria is that it be "representative" of the health of the whole build .... something for each major component, ideally, version checking perhaps, comparator logs perhaps, but ideally not include things that are essentially "stress tests" that run a lot of long running operations, or specific corner case cases. I doubt to do this in the best possible way would be an easy task, and take a lot of time to "do right", but perhaps we could get started with a subset of our existing tests to get a feel for it? Just pick the ones that run the quickest, and leave out any that take more than a few minutes to finish. Here is, though a dozen layers of redirection, ends up running the specific tests, as far as I can tell /org.eclipse.releng.eclipsebuilder/eclipse/buildConfigs/sdk.tests/testScripts/test.xml And the default is to run "all" (or, if I could figure out how to do it, you can run just one suite if specified on command line). But, what I'd like to do is redefine "all" to be a short list of about 10 of these? And then define a new target "runAll" that would contain all the tests (I can't decide, if second test run should really re-reun all, or only those not ran the first time. At any rate, opening this for discussion. Any discussions/opinion about which should be ran each build, and which could be ran, say, one a week? <target name="all"> <antcall target="equinoxp2" /> <antcall target="equinoxp2ui" /> <antcall target="pdeui" /> <antcall target="jdtcompilertool" /> <antcall target="jdtcompilerapt" /> <antcall target="jdttext" /> <antcall target="ant" /> <antcall target="compare" /> <antcall target="coreruntime" /> <antcall target="coreresources" /> <antcall target="osgi" /> <antcall target="coreexpressions" /> <antcall target="teamcore" /> <antcall target="jdtcoreperf" /> <antcall target="jdtcorebuilder" /> <antcall target="jdtcorecompiler" /> <antcall target="uiperformance" /> <antcall target="uieditors" /> <antcall target="uinavigator" /> <antcall target="uiworkbenchtexteditor" /> <antcall target="uircp" /> <antcall target="uiviews" /> <antcall target="jdtdebug" /> <antcall target="jdtui" /> <antcall target="jdtuirefactoring" /> <antcall target="ltkuirefactoringtests" /> <antcall target="ltkcorerefactoringtests" /> <antcall target="text" /> <antcall target="jface" /> <antcall target="jfacedatabinding" /> <antcall target="filebuffers" /> <antcall target="antui" /> <antcall target="coretestsnet" /> <antcall target="jdtapt" /> <antcall target="pdebuild" /> <antcall target="jdtaptpluggable" /> <antcall target="ua" /> <antcall target="uiforms" /> <antcall target="pdeapitooling" /> <antcall target="equinoxsecurity" /> <antcall target="search" /> <antcall target="pdeds" /> <antcall target="jdtcoremodel" /> <antcall target="uadoc" /> <antcall target="debug" /> <antcall target="ui" /> <antcall target="relEng" /> <antcall target="swt" /> <antcall target="teamcvs" /> <antcall target="equinoxds" /> <antcall target="equinoxp2discovery" /> <antcall target="bidi" />
One quick way to get a subset would be removing JDT and Team/CVS tests. These are known to be longer running tests and if there are "lower level" failures it might not matter what happened in higher level tests.
(In reply to comment #1) > One quick way to get a subset would be removing JDT and Team/CVS tests. These > are known to be longer running tests and if there are "lower level" failures it > might not matter what happened in higher level tests. Ok, the more specific, the better. :) I commented out a few based on Kim's suggestion (in another bug) but based on yours, will comment out more AND .. the important question ... is there any reason or logic behind the order of the tests listed in in the "all" target"? Seems it would make more sense to group into "platform", "jdt" and "pde" subsets? Or, something. But ... hate to do that blindly in case there was some logic behind the current order. <target name="all"> <antcall target="equinoxp2" /> <antcall target="equinoxp2ui" /> <antcall target="pdeui" /> <!-- temp remove long running tests <antcall target="jdtcompilertool" /> <antcall target="jdtcompilerapt" /> <antcall target="jdttext" /> --> <antcall target="ant" /> <antcall target="compare" /> <antcall target="coreruntime" /> <antcall target="coreresources" /> <antcall target="osgi" /> <antcall target="coreexpressions" /> <!-- temp remove long running tests <antcall target="teamcore" /> --> <!-- temp remove long running tests <antcall target="jdtcoreperf" /> <antcall target="jdtcorebuilder" /> <antcall target="jdtcorecompiler" /> --> <antcall target="uiperformance" /> <antcall target="uieditors" /> <antcall target="uinavigator" /> <antcall target="uiworkbenchtexteditor" /> <antcall target="uircp" /> <antcall target="uiviews" /> <!-- <antcall target="jdtdebug" /> <antcall target="jdtui" /> <antcall target="jdtuirefactoring" /> <antcall target="ltkuirefactoringtests" /> <antcall target="ltkcorerefactoringtests" /> --> <antcall target="text" /> <antcall target="jface" /> <antcall target="jfacedatabinding" /> <antcall target="filebuffers" /> <antcall target="antui" /> <antcall target="coretestsnet" /> <!-- temp remove long running tests <antcall target="jdtapt" /> --> <antcall target="pdebuild" /> <antcall target="jdtaptpluggable" /> <antcall target="ua" /> <antcall target="uiforms" /> <antcall target="pdeapitooling" /> <antcall target="equinoxsecurity" /> <antcall target="search" /> <antcall target="pdeds" /> <!-- temp remove long running tests <antcall target="jdtcoremodel" /> --> <antcall target="uadoc" /> <antcall target="debug" /> <antcall target="ui" /> <antcall target="relEng" /> <antcall target="swt" /> <!-- temp remove long running tests <antcall target="teamcvs" /> --> <antcall target="equinoxds" /> <antcall target="equinoxp2discovery" /> <antcall target="bidi" />
Given bug 377453, where Hudson will be rebooted once a week, we may want to give this issue more attention, so we don't always lose all unit tests one night a week. I need to study the actual times more to know what to suggest, but, I'll write down what I'm thinking of, to see if it spurs any improved ideas from others: I'm thinking of asking for a new set of Hudson jobs and breaking our unit tests up to match. Something like the list below, where the "normal" ones would be expected to finish in under an hour, and the "long running" ones would take longer, 5 or 8 hours? Maybe we should break those into three parts as well ... but, this is the starting idea eclipse-sdk-3x-platform-tests eclipse-sdk-3x-jdt-tests eclipse-sdk-3x-pde-tests eclipse-sdk-3x-longrunning eclipse-sdk-4x-platform-tests eclipse-sdk-4x-jdt-tests eclipse-sdk-4x-pde-tests eclipse-sdk-4x-longrunning The idea would be to run them "in order" (so, platform runs before jdt, etc.) If easy to do, I'd like to set things up so the expected short running ones have a short time limit ... say of 20/30 minutes ... so if they do happen to hang, they time out quickly, instead of waiting the current 2-hours-for-all-test-suites approach. Not sure how much work this will be, but, _might_ be worth considering for Juno.
I've noticed that from Junit output reports, there are no tests that take very long. The test itself, is relatively quick. So, I'm assuming it is the "setup" for some tests that take a long time, and I do not know how to find the overall length of a test. So, I'll invent a "logging" macro that will write out the "current time" as each test starts, and ends. This is relatively easy to do in test.xml since (as far as I can see) each and every test (bundle) is "started" by the "runtests" macro. So, just that one place to put <markTime msg="start @{testPlugin}"/> ... <markTime msg="end @{testPlugin}"/> This will result in a file to be analyzed later that will look similar to start <testPluginName1> 1349545005 end <testPluginName1> 1349546666 start <testPluginName2> 1349547777 end <testPluginName2> 1349549999 Where the times are the "seconds since 1970-01-01 00:00:00 UTC". From there, should be relatively easy to write a small utility to calculate elapsed time. This relies on running an OS executable date +%s which from my tests there at home work on linux, windows, and macs. I'm not sure if/when the "elapsed time" portion will be integrated with main output of running the tests, but initially just want to see a) if it works, and b) if it helps to identify the really long tests. FYI, from everything I could find on online, an ant method to track and compute "elapsed time" seemed pretty complicated (due to Ants non-procedural nature), so thought a simple quick solution was the place to start.
> > This relies on running an OS executable > date +%s > which from my tests there at home work on linux, windows, and macs. > Just to correct myself, I guess on MY windows systems I've installed "unix like" utilities. This executable failed on build.eclipse.org, which I assume is (mostly) out-of-the-box windows, so I've restricted this logging to "unix" family for now. (We could ask for webmasters to install something special, but doesn't seem required, since I'm must looking for 2 or three categories of "short", "medium" and "long" ... doubt there is that much difference between platforms for nearly all tests.
Created attachment 222097 [details] Summary of Linux overall unit test times
Created attachment 222098 [details] Summary of Mac overall unit test times Just dumping some data obtained so far. I think this sort of summary data will be useful. From it, I think we can define 6 groups, of "platform", "jdt", "pde", "platformLR", "jdtLR", "pdeLR" (where 'LR' is for "Long Running"). Now ... what to do with that? One thought is perhaps only run the "short tests" for nightly builds ... I suspect they (both nightlies and the shorter running tests) are good "sanity checks" and the longer running ones more functional tests that may not be that important for nightly builds? Second possibility, we may be able to submit them as 6 hudson jobs (short ones first) and receive back the data and update DL page in 6 steps so there is incremental updates during a build. To do this, though, we'll need to improve the way we watch for and collect back unit test results.
I have now used the data to define a "quickTest" group of unit tests that runs in about 1/4th the time of all the tests. From a quick glance, this is really about 1/4 of the specific unit tests (not test suites, its roughly 3/4s the test suites). You can see how they are grouped by reading the main test.xml file, in master, http://git.eclipse.org/c/platform/eclipse.platform.releng.maps.git/tree/org.eclipse.releng/configuration/eclipseBuilderOverlays/eclipse/buildConfigs/sdk.tests/testScripts/test.xml#n1403 Or, in next comment I'll attach a "data report" from a run on my local machine, which shows the grouping also. Here is my specific proposal: For nightly builds, let's run just the "quickTests". These should complete within a couple of hours (about the time it takes to actually build) instead of the 6 or 8 hours we are seeing now. For I and M builds, we would still run the full set. [Side point: Another important point to keep in mind: Some of the divisions I've made have been made based on preliminary performance tests results. Some of the tests don't run, or take a long time run when the "performance tests" flag is set. The ultimate goal is to refine or evolve the list and grouping so it applies to both unit tests and performance tests. Put another way, for our initial performance tests, we likely will be running a subset of all possible performance tests.] I think this is a more responsible use of the eclipse.org resources ... and in theory, could allow us to do more builds, if/when that's required (as one example, we could do 2 N-build/test cycles each day in the time we are currently doing one). I think this still satisfies the main purpose of the N-builds ... as a quick sanity check ... while for the more important I and M builds, we'd still get the full set of tests ran. One complication I've not looked into yet, is that doing this proposal as is, is that in the summary of tests, for nightly builds, it will simply show a lot of tests as "didn't run" (but they will not show up as DNFs) but, will just look funny ... at a glance it'd be hard to tell why a test didn't run. Of course, we already have limitations here: we have 5 suites that never run, since they have been "disabled" due to problems ... so, maybe some overall improvements in these summaries can be made? I have committed the test.xml and will let it run in full tonight (Thursday) since that will be a good check that I have not accidentally dropped (left out) a suite with all the copy/pasting I did. Then, I propose, starting Friday night, we'd run only the "quickTests" for nighties. [If nothing else, this has a side effect that our Saturday night N-builds/tests will have time to complete, before the weekly reboot of Hudson!] So, let me know if I'm off-base and looking at this all wrong. We can certainly refine the exact grouping over time, but as far as I can tell, having the ability to run smaller sets of tests will be a great improvement (especially as we try to restore some performance tests). I'll look at improving the "did not run" list but that may not be implemented until later.
Created attachment 222185 [details] Summary of "quickTests" time-to-run This summary is from a unit-test run on my local machine, which overall takes half the time of the build.eclipse.org linux tests ... which itself is problematic :( But, it shows that the "quickTests" take about 45 minutes, while the total time is 4 hours. (So, on build.eclipse.org, I'd expect that to be about 1.5 hours vs. 8 hours total). We'll find out tonight, unless someone objects to this whole idea. You can also see by reading through the summary which are included in "quickTests" and which are included in the "LR" (long running) group. [Keep in mind, some of the LR tests, such as 'uiperformance', are quick as unit test but take forever or do not run in performance tests ... so far, I am trying to capture both (unit tests, and performance tests) in one set of groups.]
(In reply to comment #8) > Here is my specific proposal: For nightly builds, let's run just the > "quickTests". These should complete within a couple of hours (about the time > it takes to actually build) instead of the 6 or 8 hours we are seeing now. > > For I and M builds, we would still run the full set. -1 from me for that: for sure I want to detect and be able to fix the failures before the I-build and not get them on the I-build and request a rebuild. I'm also against a short-list. Then I'd rather not run "my" tests every second N-build (or so). But honestly, 5 years ago the IT infrastructure was able to handle them and now 5 years later it's not working anymore. Sounds broken to me.
From JDT/Core perspective, short-list tests without JDT on it doesn't sound very helpful for me. Rather than course grained disabling of test suites I'd propose to filter individual tests so that each component gets some testing, always. E.g., for running org.eclipse.jdt.core.tests.compiler locally and when I'm short in time I simply pass -Dcompliance=1.7 to the VM and this cuts out all tests for compliance levels 1.3, 1.4, 1.5 and 1.6. Do other components have similar means to slice their test suite into something significantly faster? (In reply to comment #10) > second N-build (or so). But honestly, 5 years ago the IT infrastructure was > able to handle them and now 5 years later it's not working anymore. Sounds > broken to me. Well, if we had faster tests, we could do continuous builds and builds triggered by gerrit, no?
(In reply to comment #11) > > (In reply to comment #10) > > second N-build (or so). But honestly, 5 years ago the IT infrastructure was > > able to handle them and now 5 years later it's not working anymore. Sounds > > broken to me. > > Well, if we had faster tests, we could do continuous builds and builds > triggered by gerrit, no? Sure if the current tests would run faster ;-). But this bug here is about leaving tests out because they run slower on the Hudson. If we really want to run some tests via Gerrit then I would not want to run a fixed small amount of tests but rather tests that correspond to the changed code (package or type granularity). Also, I would only want to run tests that never give false positives. Currently, I tend to simply ignore the result of the EGit test outcome when I make a fix for EGit via Gerrit, because the tests almost always fail either because some unstable unrelated test(s) or because Hudson is in a weird state.
Thanks for everyone's comments thus far. I won't go into a long editorial on the issues or my mis-perceptions and mis-conceptualizations, but will change the focus my efforts to making the tests more "modular" and have more flexibility in the way they are run. For example, if a re-build was required, then we might decide to cancel the longer running tests that hadn't finished yet (since technically would not be very valid and might need the limited server resource for the new re-build's tests). But, more important, in discussing some performance test anomalies with Denis, it appears on today's many-processor machines, each processor is slower (in crystal-frequency) than the single or dual core processors of years past. This means some types of things will be slower (on one processor, with slower frequency) and the more we can do in parallel, the better ... this is especially true of a system such as Hudson, where some of the slaves are literally different machines, and some slaves are merely different virtual machines defined on some sub-set of the processors on the same physical machine. While we are weeks away from putting this into practice, some initial "local" tests I've done look promising. This I checked by using my local version of a Hudson master, with 4 executors (on a machine with 8 processors, an Intel i7, running Linux). Running the unit tests as one job on my local setup takes them almost 5 hours to complete. Running them as 6 jobs takes a little over 2.5 hours! So for a while, 4 are running "in parallel", the other two waiting in the queue until some finish. The moral of the story is that us being explicit about what to run in parallel makes better use of all those processors rather than letting the machine do its best to take advantage of all those processors. This would be more true on Hudson with many slaves. (Of course, if all the slaves and executors were busy with other jobs, then we'd be back to the same time as one big long job ... but, that's seldom the case). One complication is having some improved scripts to collect the results of multiple Hudson jobs back to one build page. But, we need to do that sort of improvement anyway, for bug 389048. A more difficult problem is that on Windows I don't think we can run some of the UI tests "in parallel" on same machine, since there is literally only one display (with Hudson). That is, on Windows, the jobs may essentially have to run serially, hence, no faster than they are running as one big job. But, I think on Linux and the Mac they will typically run faster, overall. Perhaps one approach (for the Windows issue) is having the "UI sensitive" jobs all in one job that we'd make sure to run by itself. I think the right starting point for this modularity is still the 6 sets I've defined so far, but we can evolve that in the future, of course. I'll work on bug 389048 first, and few other changes to scripts to make this "parallel processing" easier to implement. FYI, this "parallel processing" will apply to performance tests, same as unit tests, if/when we restore running them on a regular basis.
From my experience, the problem is that testing some components is very hard in Eclipse because they have too many (heavy) dependencies. I don't have a simple example handy but usually, it means you need to start the platform with X plugins just to test a method call. My hope is that these hard dependencies can be fixed with DI/IoC and e4. The goal should be that I can test code completion for an editor plugin without having to load the platform preferences and the spell checker, so to speak. Or to put it another way: It's possible to run 1000 unit tests per second unless the code is designed to prevent this. So I'd prefer if you'd isolate the long running tests and then open bugs to cut their dependencies.
(In reply to comment #14) > From my experience, the problem is that testing some components is very hard > in Eclipse because they have too many (heavy) dependencies. > ... > So I'd prefer if you'd isolate the long running tests and then open bugs to > cut their dependencies. Do you mean dependencies in "what's required to be installed"? I have seen some inefficiencies such as (I think) we download and unzip "previous build" up front, even though its required by just a few tests. But not sure there's any huge improvements to make for most cases (i.e. we'd always use whole SDK, download whole unit tests package, etc. even if not all needed when testing "one method"). If you have any concrete proposals (preferably as patches :) I'd like to hear more. But, if you mean the normal "dependencies" specified by features or bundle manifests, that not likely to change (unless, you are saying some of our test bundles really have dependencies they don't need?) Thanks for your comments.
(In reply to comment #15) > > From my experience, the problem is that testing some components is very hard > > in Eclipse because they have too many (heavy) dependencies. > > ... > > So I'd prefer if you'd isolate the long running tests and then open bugs to > > cut their dependencies. > > Do you mean dependencies in "what's required to be installed"? I mean any kind of dependency. When a test runs longer than 10 seconds, then it probably tests things that it shouldn't. To get my idea, think how you would test making changes to a user object. There are two approaches. The first goes like this: * Install and set up a database server * Create tables, set constraints, fill tables with data * Start application, login with correct credentials, search user object * Modify user object * Save modified user in database * Validate changes in database This kind of test tests too much. It test login, which we don't really care. We just login because there is no other way to execute the test. Because of bad design, the login has become a dependency of the test. Same with the database server. I think (hope?) it's safe to assume that database vendors test their products. There is no point in "testing" whether they will execute SQL correctly. So here is another approach: * Wrap all SQL related code in a helper class * Mock this helper class for the test so it only records SQL instead of sending it to a database. Some methods can return predefined results, etc. * Modify user object * Save using the mock * Validate the generated SQL The first test easily breaks (network problems, password changes), it's slow (you need to stop the database, clean it, load data, ...). The second form of test is fast, compact and independent. Changes to the login (code changes or password changes in the database) can't break it. It's not always easy to see in which category a test belongs but "test runs longer than 10 seconds" is usually a good clue. So my argument is that the JDT test suite is slow because it often has to start the platform. Is it really important for JDT to test OSGi? Platform startup?
(In reply to comment #16) > I mean any kind of dependency. When a test runs longer than 10 seconds, then > it probably tests things that it shouldn't. > > To get my idea, think how you would test making changes to a user object. WRT the theory of unit tests, I fully agree, but ... JDT tests are not unit tests in that sense, nobody cares about the effect of MethodDeclaration.analyseCode(..), all useful tests in this area are integration tests at some level, because we need to find out whether a given project of java files produces correct class files (looking at the compiler as an example). > So my argument is that the JDT test suite is slow because it often has to > start the platform. Is it really important for JDT to test OSGi? Platform > startup? As for JDT/Core tests, no they don't restart the platform, the "heavy" setup is only creating and initializing new Java projects as test objects. We're beating a dead horse anyway, JDT/Core test suites execute 50000+ tests in 45min on my machine, corresponding to an average around 50 milliseconds per test. Wow, we have integration tests way beyond your threshold for unit tests. So in JDT/Core the pain point is number of tests (the pain of luxury). Other suites may differ, maybe you're referring to JDT/UI tests (which have to fire up a UI of obvious reasons ...). My point is: the theory of "mocking everything else" is not a silver bullet, in many situations it is even useless.
Doing a mass "reset to default assignee" of 52 bugs to help make clear it will (very likely) not be me working on things I had previously planned to work on. I hope this will help prevent the bugs from "getting lost" in other people's queries. Feel free to "take" a bug if appropriate.
This bug is too broad to be actionable. Closing now. We need better targeted and actionable items.