| Summary: | run performance tests on eclipse.org hardware - one build, over and over | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Eclipse Project] Platform | Reporter: | David Williams <david_williams> | ||||||||
| Component: | Releng | Assignee: | David Williams <david_williams> | ||||||||
| Status: | VERIFIED FIXED | QA Contact: | |||||||||
| Severity: | normal | ||||||||||
| Priority: | P3 | CC: | akurtakov, daniel_megert, david_williams, deepakazad, denis.roy, john.arthorne, kim.moir, mike.milinkovich, pwebster, satyam.kandula, sbouchet, wayne.beaton | ||||||||
| Version: | 4.2 | ||||||||||
| Target Milestone: | 4.3 M2 | ||||||||||
| Hardware: | PC | ||||||||||
| OS: | Windows 7 | ||||||||||
| Whiteboard: | |||||||||||
| Bug Depends on: | 389377, 389422 | ||||||||||
| Bug Blocks: | 362718, 374441, 389857 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
David Williams
Just to capture it somewhere, here is the meaty part of the "configuration" of the Hudson Job that Kim was using before she left. Some of it I don't understand, and we do not accommodate currently. Some of it is stuff I know has changed as we got the builds working on build.eclipse.org. But some of the "principles" (copying special property files to special locations) will serve us at least temporarily. echo "Starting..." cd test/eclipse-testing echo "extraVMargs=-Declipse.perf.config=config=@testMachine@;build=@buildType@@timestamp@;jvm=sun -Declipse.perf.assertAgainst=config=@testMachine@;build=@reference@;jvm=sun" >vm.properties ./runtests -os linux -ws gtk -arch x86_64 -properties vm.properties -Dtest.target=performance -Dorg.eclipse.equinox.p2.reconciler.tests.lastrelease.platform.archive.linux-x86_64=eclipse-SDK-I20120314-1800-linux-gtk.tar.gz Just didn't want it to get "lost", since I am about to change that job configuration. Status: Here's that I've done and observed so far. A. Got "antui" performance test running by itself and ran it 5 times. Captured in jobs 44 to 48 on hudson, such as https://hudson.eclipse.org/hudson/job/eclipse-sdk-perf-test/44/ B. Tried to expand the test to a longer list, 12 "platform" test which as far as I can tell, should all have "platform targets" defined (according to test.properties) but ... only 5 of the tests produced any output. Not sure why others did not. C. The test.properties file is (apparently) generated during the build, and according to it, there should be 41 suites with "performance targets". Anyone recall if that generated list was inaccurate and there was always a smaller subset that ran? I'll attach test.properties. The 12 I defined to run were <antcall target="ant" /> <antcall target="antui" /> <antcall target="compare" /> <antcall target="coreruntime" /> <antcall target="coreresources" /> <antcall target="osgi" /> <antcall target="coreexpressions" /> <antcall target="coretestsnet" /> <antcall target="text" /> <antcall target="jface" /> <antcall target="jfacedatabinding" /> <antcall target="filebuffers" /> but only these 5 produced results osgi coreresources compare coreruntime antui I may have changed the order of some, while defining a subset of "platform tests" ... but ... I would think that would not matter? Any one recall if there were order sensitive effects? I may try re-arranging order, but am not optimistic that'd change much (if it did, I'd prefer the tests to be fixed not to be order sensitive). I'll also try adding more tests since those 5 tests that ran did not take very long. I did not see any obvious error in logs, or anything, but if anyone else wants to to look at details, the "larger runs" start with job 49: https://hudson.eclipse.org/hudson/job/eclipse-sdk-perf-test/49/ Created attachment 221008 [details]
generated test properties, describing which tests have performance targets
Ok, just those five like in runs 49-50 is great. Those have a good mix of CPU intensive, disk intensive, etc. Can I just launch that again or have you been making more changes. About three more like 49-50 would be great. I've started run #51... we'll see what happens. (In reply to comment #5) > I've started run #51... we'll see what happens. Should be the same thing as 50. 49 was the case where 12 tests produced 5 results. For 50 (and 51, ...) I added more to the "try and test" set, 32 total. I did not check how many of these actually had performance test targets defined, but should be nearly the complete "platform set" minus the org.eclipse.equinox.p2.tests, [which I thought was the really long running one, but looking again in the light of morning, I am not sure]. But, they produced 11 output files (with the detailed timing results) instead of 5. <target name="platform"> <antcall target="ant" /> <antcall target="antui" /> <antcall target="compare" /> <antcall target="coreruntime" /> <antcall target="coreresources" /> <antcall target="osgi" /> <antcall target="coreexpressions" /> <antcall target="coretestsnet" /> <antcall target="text" /> <antcall target="jface" /> <antcall target="jfacedatabinding" /> <antcall target="filebuffers" /> <antcall target="teamcore" /> <antcall target="uadoc" /> <antcall target="uiperformance" /> <antcall target="uieditors" /> <antcall target="uinavigator" /> <antcall target="uiworkbenchtexteditor" /> <!-- don't run now, for 4.2. See bug 380553. <antcall target="uircp" /> --> <antcall target="uiviews" /> <antcall target="ua" /> <antcall target="uiforms" /> <antcall target="equinoxp2ui" /> <antcall target="equinoxsecurity" /> <antcall target="search" /> <antcall target="debug" /> <antcall target="ui" /> <antcall target="teamcvs" /> <antcall target="equinoxds" /> <antcall target="equinoxp2discovery" /> <antcall target="bidi" /> <antcall target="ltkuirefactoringtests" /> <antcall target="ltkcorerefactoringtests" /> </target> The 11 output files were: org.eclipse.osgi.tests.perf.AllTests.txt org.eclipse.core.tests.resources.perf.AllTests.txt org.eclipse.team.tests.ccvs.ui.benchmark.WorkflowTests.txt org.eclipse.jface.tests.performance.JFacePerformanceSuite.txt org.eclipse.team.tests.ccvs.ui.benchmark.SyncTests.txt org.eclipse.compare.tests.performance.PerformanceTestSuite.txt org.eclipse.core.tests.runtime.perf.AllTests.txt org.eclipse.ant.tests.ui.testplugin.AntUIPerformanceTests.txt org.eclipse.ua.tests.AllPerformanceTests.txt org.eclipse.ui.tests.forms.AllFormsPerformanceTests.txt org.eclipse.ui.tests.performance.UIPerformanceTestSuite.txt So, as far as I know, this is a good set to run over-and-over. After this second run, I'll set a timer so one is kicked off about every 8 hours (The first run took 5 hours 20 minutes). But wouldn't we want to make sure what ever it was virtualizied with (same hardware) was also either busy or not? Such as if with "hudson 1" perhaps we should re-run some tests there, occasionally? Also, I noticed two failing tests, with "out of memory" errors, but didn't seem to be heap size, seemed to be insufficient machine memory? java.lang.OutOfMemoryError: unable to create new native thread > Also, I noticed two failing tests, with "out of memory" errors, but didn't
> seem to be heap size, seemed to be insufficient machine memory?
I think that VM only has 1G ... I can likely steal a GB or two from the Windows slave and bump it up if a restart can be tolerated.
(In reply to comment #8) > > Also, I noticed two failing tests, with "out of memory" errors, but didn't > > seem to be heap size, seemed to be insufficient machine memory? > > I think that VM only has 1G ... I can likely steal a GB or two from the > Windows slave and bump it up if a restart can be tolerated. Another GB would likely be good, but not urgent. Fine if you can set it and let it take effect with the Saturday (Sunday AM) restart? Our old perf test machines had 3GB on a 32-bit machine, so I expect it would need at least that much on a 64-bit machine if not more. I realize on a virtualized machine it's not quite the same comparison, but safe to assume it needs more. (In reply to comment #5) > I've started run #51... we'll see what happens. And in case you are wondering what happened ... :) run 51 is "missing completely". Hudson itself "hung" (well ... became unresponsive) so Denis restarted then, no run 51. 52 failed because slave had to be restarted, so now we are waiting for run 53. I fear our performance tests "hung" something in Hudson, in particular, I've seen one case where our "uiperformance" test hung (on my local machine) and kill -9 would not kill it, even though luckily ctrl-c did. (bug 389365). IMing with Denis we said we'd "try again" and see if it happens again ... you know Hudson ... it hangs itself often enough we can't be too quick to blame any one thing. Just wanted to keep you informed. I will keep updating this, but here is a spreadsheet with some of the results extracted: https://docs.google.com/spreadsheet/ccc?key=0ArqzCH6xAv4wdG4tdy1odm5GbGJ1dFdEaWY5R0trc1E#gid=0 I picked three tests that I considered representative, that had fairly consistent times when running on our dedicated hardware: 1) A disk intensive test (refresh a project) 2) A CPU intensive test (adding problem markers) 3) A UI intensive test (opening Ant editor) So far the results are looking pretty stable. Surprisingly the purely CPU-based test hasn't been consistent within the same run. Each test runs several times in a loop and calculates standard deviation within that test run. I was pleased by the disk intensive test stability, since I thought that would be an area that would be hard to make consistent. Maybe we just got lucky so I'll keep tracking it. The UI test was reasonably stable so far. If our tests are killing hudson that's not a good sign though! > I picked three tests that I considered representative, that had fairly
> consistent times when running on our dedicated hardware:
Were these tests run under Hudson on your dedicated hardware? I'd be curious to know if Hudson (either the master, the slave, or other) adds overhead or inconsistency.
(In reply to comment #13) > > I picked three tests that I considered representative, that had fairly > > consistent times when running on our dedicated hardware: > > Were these tests run under Hudson on your dedicated hardware? I'd be > curious to know if Hudson (either the master, the slave, or other) adds > overhead or inconsistency. No we never used Hudson internally for the performance tests. I believe we just invoked them directly via ssh from the build machine. I'm surely naive but I thought Hudson was similarly just running a script to kick it off and not adding overhead on the slave where the test is running? (In reply to comment #14) > ... I'm surely naive > but I thought Hudson was similarly just running a script to kick it off and > not adding overhead on the slave where the test is running? I am not familiar with how Hudson does it (what technology is uses) but there would always be some amount of "are you done yet", "here's the log so far", etc. but we all hope those are very efficient low overhead interactions. (But, honestly, it could well try to hold the whole log in memory for the convenience of "master", which could impact tests, it could que-up serialized objects which has the potential to "go wild" if not well managed/tested, etc.) I'm not saying "I know" ... just echoing "none of us know" :) Isn't it fun to experiment. :) Not that I am a build watcher (well, I confess, I am) but the UI Performance suite got an java.lang.OutOfMemoryError: unable to create new native thread So guess that won't hang. https://hudson.eclipse.org/hudson/job/eclipse-sdk-perf-test/ws/workarea/I20120911-1000/eclipse-testing/results/linux.gtk.x86_6.0/org.eclipse.ui.tests.performance.UIPerformanceTestSuite.txt [So, we might need that memory fix sooner than I thought?] In the main log, there's also some interesting messages that sound familiar but don't recall where I've seen them before. https://hudson.eclipse.org/hudson/job/eclipse-sdk-perf-test/53/console [exec] Window manager warning: Buggy client sent a _NET_ACTIVE_WINDOW message with a timestamp of 0 for 0x8012a8 (Internal E) [exec] Window manager warning: meta_window_activate called by a pager with a 0 timestamp; the pager needs to be fixed. [exec] [java] Java Result: -1 [exec] Window manager warning: Buggy client sent a _NET_ACTIVE_WINDOW message with a timestamp of 0 for 0x80134a () [exec] Window manager warning: meta_window_activate called by a pager with a 0 timestamp; the pager needs to be fixed. ... I believe the "Java Result: -1" is just the UI Test failing, likely unrelated to "Buggy client" messages. Status: job 53 finished "normally". I am surprised the "unable to create new native thread" didn't show up as a "test failure" ... maybe that message is coming from Hudson itself ... so tests started, implies no tests failed? At any rate, I've set it to run 3 times a day, starting at 10 PM: so 10 PM, 6 AM, and 2 PM. Denis, if you ever need to restart hudson or this "virtual machine", feel free to kill what ever "performance job" is running. I imagine if the perf slave is completely running out of memory it could break any hudson process running on that machine. Status: build 54 started and ran "normal" (by normal, I just mean produced 11 results txt files). Build 55 ... not so good ... it started at 6 PM, ran for about 3.5 hours, then "failed hard", with a message about <quote> Looks like the node went offline during the build. Check the slave log for the details. FATAL: channel is already closed </quote> Webmasters, was this something you did? Know anything about? Some automatic process kick in when it detects "hang"? The node appears on-line now ... so, guess we'll just wait to see what happens at 2 PM build? just to note it, there is no "run 59" from Saturday. I didn't check until late, but Saturday night Hudson was completely unresponsive to my browser requests ... so fear once again something about our performance tests are effecting Hudson. But Hudson did its weekly restart on Sunday AM just find and our next perf. test ran "normally". I sent email to webmasters, in case they see any thing that mentions "performance" in logs for last half of Saturday. Created attachment 221134 [details]
some sample summary data
This data shows "system time" only, for 6 runs, 111 scenarios (all from "platform").
Some of them show nearly identical time from run to run, as we expected (hoped).
Some of them, though, seem wildly different.
From a quick peek, these three show wide variety:
org.eclipse.core.tests.resources.perf.BenchWorkspace#testCountResources
org.eclipse.core.tests.resources.perf.BuilderPerformanceTest#testManualBuildWithAutobuildOn()
org.eclipse.core.tests.resources.perf.LocalHistoryPerformanceTest#testClearHistory20x20()
I presume those are disk io intensitve, from their name.
Note that for some scenarios, such as
org.eclipse.core.tests.runtime.perf.ContentTypePerformanceTest#testLoadCatalog()
the data is very weird, indicating some error in my summarizing code ... perhaps not counting for error conditions, or something?
In other words, this was some 'quick and dirty' code to others could more easily look at general pattern and see if any "next steps" seemed obvious.
I know little about these tests, what is most important to look at, etc., so feel free to say if I'm seeing it all wrong.
[You are welcome to examine all the data ... its all there on hudson in "build artifacts".
I guess we already know we want the VM to be increased from 1 G to 4 G .. that might help? [Has that happened yet? Is that possible?]
Also, webmasters, [keeping in mind, I know just enough to be dangerous], I've heard you say you've "pegged this VM to 1 CPU" ... I wonder ... if there is a certain amount of "Hudson/slave work" going on, in addition to our performance tests ... if allocating two CPUs would make the performance tests more "stable" or predicable from run to run? ... perhaps Hudson and performance tests are nearly time slicing on one CPU? I know, this gets complicated with hyper threading and all ... so maybe no big difference in 1 vs 2 CPU ... but .... thought I'd mention it might be worth trying (if in fact, it is even "just one", I may have misheard).
I have seen enough here to call this experiment a success. Not all of our tests had perfectly stable results in the past either - some are micro-benchmarks or not very realistic scenarios. Sometimes a test is just written to confirm a bug fix but then time isn't taken to make sure it is stable on all platforms. The three I summarized in the google doc show pretty stable results over half a dozen runs. Excellent! Thanks for taking a look and applying your experienced expertise, John. I'll count this as "fixed", though I will keep the job running a while. I changed the configuration so it should run all performance tests now, so we can get an idea how long it takes ... 12 hours? 24 hours? I'll set the schedule to it runs only once per day, at 2 PM. This is great news! |