| Summary: | run performance tests on eclipse.org hardware - with windows job | ||
|---|---|---|---|
| Product: | [Eclipse Project] Platform | Reporter: | David Williams <david_williams> |
| Component: | Releng | Assignee: | Platform-Releng-Inbox <platform-releng-inbox> |
| Status: | CLOSED DUPLICATE | QA Contact: | |
| Severity: | normal | ||
| Priority: | P3 | CC: | akurtakov, daniel_megert, david_williams, deepakazad, john.arthorne, kim.moir, Lars.Vogel, mike.milinkovich, overholt, pwebster, satyam.kandula, sbouchet, wayne.beaton |
| Version: | 4.2 | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Windows 7 | ||
| Whiteboard: | |||
| Bug Depends on: | 389369, 389371, 389834 | ||
| Bug Blocks: | 362718, 374441, 454921 | ||
| Attachments: | |||
|
Description
David Williams
I've finally got a "windows" version of performance tests working, https://hudson.eclipse.org/hudson/view/Eclipse%20and%20Equinox/job/ep4-perf-win32/ had to shorten the name from the planned eclipse-sdk-perf-test-windows to the shorter ep4-perf-win32 since the longer name was enough for some "file path too long errors" to occur :( But, I am ready for a "dedicated" windows virtual machine to be setup now. Either current windows one tweaked for performance tests or the new hudson instance (bug 389834) ... which ever is faster (since we are still just interested in getting a series of test runs to sanity check basic procedure's stability. Hudson was restarted, more memory was allocated to the perf1 slave, and dedicated CPUs were added to windows slave. With one single executor you should be all set. Thanks Denis, I've scheduled our performance test to run at 5 PM 1 AM and 9 AM (8 hour windows). I think they should take about 6 hours ... allow a little time for a few other jobs to run in between. But I have disabled our unit tests for windows, for now, so they won't interfere with the timing (they too take 6 or 8 hours to run). I think in 3 or 5 days we should get enough data to show "stability". I should make a note: the job number of the first valid test build is #26, just so I don't forget. The first job took 13 hours, not "6 or so". But from what I can tell 3 tests "timed out" (which sort of adds 4 or 6 hours to what it might otherwise be (since test is given 2 hours before being killed). I'm judging these three from the screen captures taken (file names below). There'd be stack dumps in logs. But given this is meant to be a "sanity check", of test running, I may disable these three if it happens again. org.eclipse.core.tests.resources.perf.AllTests_screen0.png org.eclipse.core.tests.resources.perf.AllTests_screen5.png org.eclipse.jface.tests.performance.JFacePerformanceSuite_screen0.png org.eclipse.jface.tests.performance.JFacePerformanceSuite_screen5.png org.eclipse.ui.tests.performance.UIPerformanceTestSuite_screen0.png org.eclipse.ui.tests.performance.UIPerformanceTestSuite_screen5.png Do we have any idea why they timed out? Where can I find those screen shots? (In reply to comment #6) > Do we have any idea why they timed out? No, not without study. It does not seem to be "waiting for prompt" type of issue. More likely true hang or .... just takes over two hours! (In reply to comment #7) > Where can I find those screen shots? Easiest, from "build 26", you can select the "build artifacts", download that "archive.zip", unzip, and ... they are in there somewhere. :) https://hudson.eclipse.org/hudson/view/Eclipse%20and%20Equinox/job/ep4-perf-win32/26/ Ok, I did have that window still open, they are specifically in .../archive/workarea/I20120920-1300/eclipse-testing/results/win32.win32.x86_7.0/timeoutScreens/ (In reply to comment #7) > Where can I find those screen shots? https://hudson.eclipse.org/hudson/view/Eclipse%20and%20Equinox/job/ep4-perf-win32/26/artifact/workarea/I20120920-1300/eclipse-testing/results/win32.win32.x86_7.0/timeoutScreens/ Well the core.resources one is still a mystery. This suite takes about 6 minutes on the linux perf slave, but is still running after 2 hours on the windows slave. There is no UI whatsoever in this test so the screen shot is understandably not very interesting. The stack dump shows it is still actively running, deleting files. This test is very I/O intensive, so the only thing that comes to mind would be extremely slow disk access. > only thing that comes to mind would be extremely slow disk access.
Write cache was disabled on the RAID controllers on both hosts.. I've enabled it. That will make a difference.
We need to cut down the frequency, it is completely dominating the windows slave. Orion has not been able to run our windows tests in several days for example. Hopefully it will speed up with Denis' changes but we need to give other projects a chance to run tests. Maybe we could even run a smaller subset until we get a dedicated slave? (In reply to comment #12) > We need to cut down the frequency, it is completely dominating the windows > slave. Orion has not been able to run our windows tests in several days for > example. Hopefully it will speed up with Denis' changes but we need to give > other projects a chance to run tests. Maybe we could even run a smaller > subset until we get a dedicated slave? Yeah, the next job that completed (#28) took 14 hours, and 4 tests timed out after 2 hours. I'll try removing those 5 and see if the rest get us back to the 6 hour-ish time frame. We'll have to deal with them at some point (that is, investigate in detail) but ... guess not now. org.eclipse.core.tests.resources.perf.AllTests_screen0.png org.eclipse.core.tests.resources.perf.AllTests_screen5.png org.eclipse.jface.tests.performance.JFacePerformanceSuite_screen0.png org.eclipse.jface.tests.performance.JFacePerformanceSuite_screen5.png org.eclipse.swt.tests.junit.performance.PerformanceTests_screen0.png org.eclipse.swt.tests.junit.performance.PerformanceTests_screen5.png org.eclipse.ui.tests.performance.UIPerformanceTestSuite_screen0.png org.eclipse.ui.tests.performance.UIPerformanceTestSuite_screen5.png Please see bug 390441 If readers like the blow-by-blow, I removed the 4 problematic tests that apparently hang. The next build completed in 1 hour! BUT only contained one test ... antui ... which is the one I use to "make sure things work", but double checked and the setup still says to use "platform". So, I "touched" the configuration, in case something was contaminated or out of synch and started another job. I should note, I ran same test on my home machine, it ran in 50 minutes (brag, brag) AND it contained 11 tests results, which is about what I'd expect. So ... we'll see if next build is any better? If not, I'll study logs more. > I should note, I ran same test on my home machine, it ran in 50 minutes
> (brag, brag) AND it contained 11 tests results, which is about what I'd
> expect.
I've noted that you'd like to become a Hardware Sponsor for the Eclipse Foundation. =)
>
> So ... we'll see if next build is any better? If not, I'll study logs more.
Note to self, even though the -Xmx500m specified for the Java VM that starts ant runner seems excessively large, since all it does is launch antrunner which in turn launches other instances of Java with the VM parameters needed for the tests, it is not excessive because there is an XSLT transform to turn the XML files into pretty HTML which is a notorious memory hog. (In fact, on my home machine, I'd set it to -Xmx1000m which I'm sure explains why its faster :) I'll try restoring to -Xmx500m first .... but heck ... since it is "mx" shouldn't hurt to make it 1000.
In short, even though the build was ending with "success", there was actually a "BUILD FAILED" when it tried to transform the first xml file from the first test.
> In short, even though the build was ending with "success", there was > actually a "BUILD FAILED" when it tried to transform the first xml file from > the first test. The "transform" failure was more than a memory issue, and due to a change in VM version. I've opened bug 390494 to cover that issue. Ahh, windows. Job 50 is the first "correct" run after moving to Java 7u7 (fixing the XSLT issue to get there) and I hopefully added back the "resource" test. But, It (still) hits the 2 hour time limit. https://hudson.eclipse.org/hudson/view/Eclipse%20and%20Equinox/job/ep4-perf-win32/50/artifact/workarea/I20120920-1300/eclipse-testing/results/win32.win32.x86_7.0/timeoutScreens/ Something must be amiss? Here's a Java 101 question ... should I be using "java" or "javaw" to launch these tests? When I inherited this code, the use seemed to be "mixed" (java on two platforms, javaw on another, I think, I forget which was what ... I assumed it was an odd "historical inconsistency" and changed all to "java". Could that account for any of this? From a quick look at the log, it almost seems like this test might be running normally ... and then just sitting there "waiting" to be shutdown? (even though there's no UI) I will note, though, on my home machine this test completes fine. (standalone, though running hudson as master, no slaves, so it runs the tests in a very similar way). My home tests finished in a little over 2 hours (total) but its not a one to one comparison. I can't run the CVS tests "outside" the build.eclipse.org infrastructure, apparently. [I assume it was set up this way on purpose ... so as to not introduce too much network time variability ... and anyone who wanted to replicate the cvs performance testing on their own system, could still get the cvs "test repo" and make their own "local" version of?] (In reply to comment #19) > Ahh, windows. [snip] > Something must be amiss? FWIW, When I happened to be on the console yesterday, there was one of those "Java (or javaw, can't remember) is requesting permission to do X" (In reply to comment #21) > I will note, though, on my home machine this test completes fine. > (standalone, though running hudson as master, no slaves, so it runs the > tests in a very similar way). > > My home tests finished in a little over 2 hours (total) but its not a one to > one comparison. So why don't we set up a one to one comparison? Let's get bug 389834 out of the way. (In reply to comment #22) > FWIW, When I happened to be on the console yesterday, there was one of those > "Java (or javaw, can't remember) is requesting permission to do X" A firewall prompt, or a Windows "user doesn't have the right to do this" prompt? I didn't see anything in the screen dumps captured by our tests, but this could be a source of problems. javaw is used to suppress the Java console. For tests I don't think it matters because we consume the console output via a file. However it seems like one of those "If it ain't broke, don't change it" kind of things. (In reply to comment #22) > (In reply to comment #19) > > Ahh, windows. > [snip] > > Something must be amiss? > > FWIW, When I happened to be on the console yesterday, there was one of those > "Java (or javaw, can't remember) is requesting permission to do X" I'm assuming you said "yes, allow java to do what ever it wants"? :) This makes sense, since we had just moved to the new version "7u7", new "permissions" would have to be "approved" for that executable. But ... job 51 had the same "hang" for resource tests. https://hudson.eclipse.org/hudson/job/ep4-perf-win32/51/ As has job 52 (though 52 is still running, finishing other tests). https://hudson.eclipse.org/hudson/job/ep4-perf-win32/ws/workarea/I20120920-1300/eclipse-testing/results/win32.win32.x86_7.0/timeoutScreens/ I left the test in because "resource tests" sounds pretty important to measure "disk access" type stability, and I still think the tests themselves might be "running right", but just not "ending right". Advise welcome if someone can tell otherwise and would advise me to remove it. [Meanwhile, I've have 6 runs on my home personal machine (minus CVS tests), and it has never hung]. (In reply to comment #23) > (In reply to comment #21) > > I will note, though, on my home machine this test completes fine. > > (standalone, though running hudson as master, no slaves, so it runs the > > tests in a very similar way). > > > > My home tests finished in a little over 2 hours (total) but its not a one to > > one comparison. > > So why don't we set up a one to one comparison? Let's get bug 389834 out of > the way. Not sure what you mean? a. I should put cvs test repo on my home machine to make tests one-one? b. Remove cvs tests from current tests to make one-one? c. Or you implement bug 389834 and then you'd know? I think CVS tests are pretty low in the priority list, if at all. They weren't very reliable tests at the best of times, and there is very little change going into CVS component any more. > I'm assuming you said "yes, allow java to do what ever it wants"? :) > A firewall prompt, or a Windows "user doesn't have the right to do this" > prompt? I didn't really notice... I am about 80% sure it was a firewall prompt. > Not sure what you mean? You are running tests in your home environment on a single master without slaves and the tests are working, so let's implement bug 389834 to mimic that. (In reply to comment #27) > I think CVS tests are pretty low in the priority list, if at all. They > weren't very reliable tests at the best of times, and there is very little > change going into CVS component any more. Good point. I've removed them from the reduced "platform set" we are experimenting with. A rough estimate is it will save about 2 hours? This is based on the observation that on build.eclipse.org, the tests take about 6 hours, on my home machine, about 2 hours. So, 2 hours of that 4 hour difference is that the resource tests "stall" on build.eclipse.org and take a full 2 hours instead of 8 minutes as on my home machine. So the remaining 2 hours is assumed to be the cvs tests ... well, that, plus my machine is a little bit "over powered" :/ So far we have 4 "good" runs on build.eclipse.org, https://hudson.eclipse.org/hudson/view/Eclipse%20and%20Equinox/job/ep4-perf-win32/ (beginning with job 50) so I think a few more should allow use to summarize the data approximately Thursday. Then we'll "give back" the machine to have more than one executor. > ... A rough estimate is it will save about 2 hours? Yes, the "overnight" tests ran 4 hours and 22 minutes, so even ignoring the "hanging" resource tests, the cvs tests take a long time 1.5 to 2 hours (high cost) and since not an active area, would have low benefit. I have opened bug 391014 to discuss what we should do with such tests. For now we can remove them from "releng side" .... but seems like long term we may want to remove the "performance target" from those tests and/or mark them in some way as "long running". I am done running "tests of the tests" on the Hudson windows slave, having 11 good runs, and tried to return it to have multiple executors but then it "went down" ... probably related to changing its config ... and that's one I can not restart, so sent note to Webmasters. Data to follow. recall the reason we were doing this test was to run the same build, over and over, and make sure we got consistent, if not identical results from run to run. Basically did 11 runs (not all tests finished normally, for each run, apparently ... I've not looked at why not). I used Kepler M2 as the build to run. We ended up with 7 test suites (from "platform" component) which has about 60 tests all together. [I removed about 4 or 6 suites that either had hanging problems, or simply took a long time]. I left in the "resource tests" even though it seemed to "hang", because the tests that did run seemed to run normally, and (for some reason) seemed to hang while ending, or something (and "resource tests", I think, one of the most mostly likely places of of undue variation since that's the only truly shared resource ... in theory. Others will need to help me interpret "the variation" I thought the tests would all be much more consistent from run to run, but ... apparently not. I am not sure what's a "good" standard deviation in this case. Plus, I found, to me, a more disturbing result. I thought, well, wouldn't it be good to try this on another machine too? Have a machine for comparison? So I used my own home windows machine, but it was set up very similar ... used same VM, same M2 build, same vm settings, etc., but ... a) it was a Hudson "master", instead of slave, and b) it is fairly powerful and doesn't do much else ... an Intel i7 (2nd gen) and some souped up video display cared (but, no SDD drives ... a mere 5.9 on the "windows experience scale" :) BUT, the disturbing part, for most tests, the Elapsed Times on my machine were 10 times faster! An order of magnitude! I am pretty sure my little i7 wouldn't cause that. So ... (as Denis has hypothesized too) ... there's some question if a "slave" can be used as a "performance test machine" ... this is mere hypothesis ... we've not done a good test of that ... but ... a 10 fold difference. I have no idea what would cause that. So, maybe others (especially those familiar with old runs and data) can study the data and know what it means. I don't. I'll attach two forms of data. One, a "summary" of the 11 runs, summarizing Elapsed times only, for both machines. Then I will also attach zips of the complete "results" directories, if anyone wants to dive deeper. Created attachment 221934 [details]
Elapsed Time summaries for build.eclipse.org hudson windows slave
Created attachment 221935 [details]
Elapsed Time summaries for davidw hudson windows master
Created attachment 221936 [details]
results dirs from 11 runs on build.eclipse.org hudson windows7tests slave
Created attachment 221937 [details]
results dirs from 11 runs on davidw hudson master
So there you are ... lots of data.
I should emphasize, I am not "coming to a conclusion" here. First since I could easily be looking at something wrong. Second I am not that familiar with what to expect. Third ... I am hoping some experts reading this can understand and explain what's going on :)
> BUT, the disturbing part, for most tests, the Elapsed Times on my machine
> were 10 times faster! An order of magnitude!
Running Windows on Xen requires full virtualization, which is known to be slower than our paravirtualized Linux counterparts (which operate a near bare-iron speed).
If we agree that the actual time taken to complete a task is not what's important, but rather the consistent trend, then a fully virtualized Windows may not be so bad.
The other alternative is that we set up a dedicated Windows server for performance tests. It would pain me to dedicate hardware to a single Hudson executor, though.
We should check our hosts to make sure hardware assist for virtualization is enabled. It should be, but it wouldn't hurt to check.
> We should check our hosts I've opened bug 391246 (In reply to comment #37) > > BUT, the disturbing part, for most tests, the Elapsed Times on my machine > > were 10 times faster! An order of magnitude! > > Running Windows on Xen requires full virtualization, which is known to be > slower than our paravirtualized Linux counterparts (which operate a near > bare-iron speed). > > If we agree that the actual time taken to complete a task is not what's > important, but rather the consistent trend, then a fully virtualized Windows > may not be so bad. > > The other alternative is that we set up a dedicated Windows server for > performance tests. It would pain me to dedicate hardware to a single Hudson > executor, though. > > We should check our hosts to make sure hardware assist for virtualization is > enabled. It should be, but it wouldn't hurt to check. I agree, it is "differences" we are looking for, from one run to another. But a 10 fold slow down makes me concerned there might be "more variation" do to the virtualization layer. Just my intuition, no idea if theoretically valid or not. But the standard deviations across the 11 runs was higher on build.eclipse.org than on my machine. For one example, such as for "ant open editor" Average (Mean): 4934.5 (ms) Standard Deviation: 152.3 vs. Average (Mean): 448.5 (ms) Standard Deviation: 21.0 If the virtualization layer was absolutely constant, statistically you'd expect the exact same standard deviation, even if overall times were longer ... so I'm guessing some of that increased variation is from the virtualization layer? Even if true, the higher standard deviation (more variability) isn't necessarily a deal breaker ... but would make it harder to spot (small) regressions, since the normal variability would be larger. But, would probably suffice to find the largest regressions? To belabor the point, with an overly simple example, if our code got 100 ms slower in opening the editor, in the "davidw" case, we'd know that was likely a regression and probably worth investigating, since greater than 3 standard deviations. But on the build machine runs, 100 ms is well within the "noise level" so we would not really notice it. I did check one of our old performance tests runs; and the "open ant editor" times were in line with my machine's times ... which reassures me I wasn't making some stupid mistake or something. http://download.eclipse.org/eclipse/downloads/drops/R-3.7.2-201202080800/performance/eplnx1/Scenario483.html So, I'm still promoting a conclusion here ... I think it needs others comments and gain some consensus. [And, maybe you'll find something with hardware/bios settings.] (In reply to comment #39) > But a 10 fold slow down makes me concerned there might be "more variation" [snip] > Average (Mean): 4934.5 (ms) Standard Deviation: 152.3 > vs. > Average (Mean): 448.5 (ms) Standard Deviation: 21.0 > > If the virtualization layer was absolutely constant, statistically you'd > expect the exact same standard deviation, even if overall times were longer > ... so I'm guessing some of that increased variation is from the > virtualization layer? Your numbers are actually showing _less_ variation on the virtualized environment. Tests take 10x longer to run, but std. dev. is less than 10x higher. 21ms / 448.5ms = 4.68% variation from avg 152.3ms / 4934.5 ms = 3.08% variation from avg In fact, with this slower environment, not only is the dispersion closer, it will also amplify performance changes. Catching a 20ms performance regression (or progression) may be difficult on a i7 processor, but that becomes a 200ms change on our virtualized environment. In my totally unqualified opinion, I think we're onto something. If we can speed up the execution so that it's not 10x slower than your home box, that would be a bonus so that it doesn't take 6 hours to run the tests. Created attachment 221970 [details]
Elapsed Time summaries for previously posted linux runs
I had attached the summary data for our Linux version of this "tests of the tests" to one of the associated bugs. Since then I've improved my small extraction utility a little, so thought I'd re-run the Linux data through that, and post here for comparison.
Not sure how helpful it is, since it seems nearly "in the middle" of the two windows data sets.
For example, for the "open ant editor" case:
Average (Mean): 1436.7 (ms) Standard Deviation: 78.9 Number of Runs: 6
But, here it is, for comparison.
> But, here it is, for comparison.
Comparison to what?
(In reply to comment #40) > (In reply to comment #39) > > But a 10 fold slow down makes me concerned there might be "more variation" > > [snip] > > > Average (Mean): 4934.5 (ms) Standard Deviation: 152.3 > > vs. > > Average (Mean): 448.5 (ms) Standard Deviation: 21.0 > > > > If the virtualization layer was absolutely constant, statistically you'd > > expect the exact same standard deviation, even if overall times were longer > > ... so I'm guessing some of that increased variation is from the > > virtualization layer? > > Your numbers are actually showing _less_ variation on the virtualized > environment. Tests take 10x longer to run, but std. dev. is less than 10x > higher. > > 21ms / 448.5ms = 4.68% variation from avg > > 152.3ms / 4934.5 ms = 3.08% variation from avg > I guess you can prove anything with statistics :) but that's not what I remember from my probability courses. Standard deviations are already "in units of the data" so in this case, you'd expect about 70% of a random sample to be within +/- 1 standard deviation of the mean ... that is, the computation that matters is plus/minus ... not division and "percent from the mean". http://en.wikipedia.org/wiki/Standard_deviation And you can't assume a "slow virtual layer" would make make small regressions changes in our code larger and easier to spot. Imagine if our code just called "sleep(1000)" ... you would not expect then that to be translated into 10 seconds on the slow virtual system ... still just one second longer. (In reply to comment #42) > > But, here it is, for comparison. > > Comparison to what? The other "Elapsed Time summaries". I changed title to make that clearer. (I don't know what machine is virtualized how and where, but if nothing else may support your theory that the windows virtualization is significantly slower than the linux virtualization ... if comparable underlying setup.
>
> And you can't assume a "slow virtual layer" would make make small
> regressions changes in our code larger and easier to spot. Imagine if our
> code just called "sleep(1000)" ... you would not expect then that to be
> translated into 10 seconds on the slow virtual system ... still just one
> second longer.
And, I meant to say ... right?
If that's not the case, then I really don't understand virtualization.
One more point to explain, then I'm letting this topic rest. I want to make sure its clear that the numbers my utility is averaging (and computing standard deviation for) are _already_ based on multiple runs. That's why we'd expect small variation between them. For the "open ant editor" case. The test opens the ant editor 15 times, and then reports the average time over those 15 times. So, in the "Elapsed Times" report I've attached, in the section such as org.eclipse.ant.tests.ui.editor.performance.OpenAntEditorTest#testOpenAntEditor1() Job Number: 189 Elapsed Process: 425 ms Job Number: 188 Elapsed Process: 431 ms Job Number: 186 Elapsed Process: 436 ms Job Number: 185 Elapsed Process: 444 ms Job Number: 184 Elapsed Process: 466 ms Job Number: 183 Elapsed Process: 465 ms Job Number: 182 Elapsed Process: 418 ms Job Number: 180 Elapsed Process: 462 ms Job Number: 179 Elapsed Process: 440 ms Job Number: 178 Elapsed Process: 487 ms Job Number: 177 Elapsed Process: 460 ms Average (Mean): 448.5 (ms) Standard Deviation: 21.0 Number of Runs: 11 For job 189, there is actually 15 numbers going into that average of 425. For job 188, 15 openings which average to 431. Etc. Just wanted to be clear that's why we'd expect small variation between runs of the exact same driver. This doesn't help with any of the other issues/questions raised ... just thought that was important to explain that was part of the reason we'd expect small differences between runs. Of course, if we opened the editor 100 times, or 1000 times, maybe then there would be less noise in the data ... but ... I think the "number of runs" numbers were arrived at in the past as being right trade off between total time to run tests and steady statistics. In theory, that might be another option (to increase number of times each test is ran) but there's not global setting so would take a lot of work, changing each test, I'm guessing. I won't comment on anyone's math, but if for example I/O is 10x slower on hudson than what a typical user will experience, then it does make the test less helpful. For example it could lead to too much focus on optimizing I/O at the expense of greater CPU or heap, resulting in slower performance for end user even when hudson result is reporting an improvement. Also from a purely pragmatic point of view, tests taking longer results in longer turnaround time on verifying fixes, finding regressions, promoting builds, etc. So if there is any way we can close that gap and make the hudson tests run faster it is worth doing. Here's a couple more data points. 1) I switched back to using the windows7tests box to doing our normal unit tests. They now take 13 hours! Roughly twice as long as the tests on the 64 bit back and and 64 bit Linux. I know they used to take a long time, before that virtual machine was reconfigured for performance tests, but my memory is that is was more like 8 or 10 or something. So, in other words, something is drastically wrong. IMHO Created attachment 222022 [details]
summary stats over 11 runs on "slow" linux box
The second data point (I sure had fun this weekend :) is that I tried running the same tests on an "old and slow" linux box I have laying around [simple dual core, 32 bit, at least 3 or 4 years old]. I was hoping to show a) even though the overall times were slower that b) the standard deviations would be about the same as on my personal, fast and modern windows machine.
Surprisingly, I found that many (but not all) of the average times were FASTER than on my personal beefed up windows machine AND many (but not all) of the standard deviations were also less!
Here's my conclusion from this little side-test: a) Linux is better than Windows (ha ha, we all know my biases already) b) the exact machine and its exact configuration makes a huge difference in how these tests run. (For example, on the Linux box there was no anti virus running. Could that sort of thing account for the differences? Don't know for sure. Is there anti-virus running on the "virtual machine"? Has "auto updates" been turned off?
I don't know what the answer is, but am beginning to think it would be less work to try and find some stand-alone machine that at least started off being representative of what an average user would use. There's enough variables even in that, to make valid performance tests difficult.
To get an "older and slower" windows machine, I've have to use my wifes 6 year old laptop. Not even sure she has windows 7 :)
But, bottom line, 13 hours for our normal unit tests is unworkable ... maybe that should be the first problem we solve?
(In reply to comment #49) > Surprisingly, I found that many (but not all) of the average times were > FASTER than on my personal beefed up windows machine Actually, today's CPUs generally have slower clock speeds than CPUs of several years ago. But today's Marketing Machine has placed emphasis on number of (slower) cores. > it would be less work Give us some time to investigate bug 391246 first. Having a fast, reliable virtualized Windows environment is the best course of action in the long run.
>
> > it would be less work
>
> Give us some time to investigate bug 391246 first. Having a fast, reliable
> virtualized Windows environment is the best course of action in the long run.
Ok, I'm assuming this will be complete in next week or two? If so, I can reduce our windows tests until then, since as things are, our windows unit tests time out after 15 or 16 hours (my self imposed Hudson time limit).
> Ok, I'm assuming this will be complete in next week or two?
Well, I can't usually predict how fast I can resolve an issue... If this is so pressing, I can probably truck an old Windows XP box and toss it in the cabinet in the meanwhile. But that can't be a permanent solution.
(In reply to comment #52) > > Ok, I'm assuming this will be complete in next week or two? > > Well, I can't usually predict how fast I can resolve an issue... If this is > so pressing, I can probably truck an old Windows XP box and toss it in the > cabinet in the meanwhile. But that can't be a permanent solution. Its "pressing" now, for our normal unit tests (not the eventual performance tests). I believe when we started experimenting with our performance tests, you made some changes (increased memory allocated, increased number of processors) and am thinking maybe that's when first hosted on the Xen processor? In any case, for now we are done with performance "test of the tests" ... we stopped doing our unit tests during that time, for a few weeks, knowing we could not have time/ability to do both performance and unit tests .. and now that we have gone back to running our normal JUnit tests ... we gave discovered they are taking twice as long as they used to (and 2 or 3 times longer than on the mac or linux slave). Roughly. In a perfect week, we run 10 such tests. Often more (15 or so) if "rebuilds" are required. So, we can not run our full suite as is, on the Windows slave -- not enough hours in the day. I'll ask around, and see if others think that's something we can live with, or if worth asking you to dust off an old XP box. [I'm sure we could live with it for a week or two ... but, longer than that, and pretty sure we'd need somewhere else to run Windows unit tests]. > discovered they are taking twice as long as they used to (and 2 or 3 times
> longer than on the mac or linux slave). Roughly.
Ah, that is interesting to know. So the machine was faster until I gave it dedicated processors.
I'll revert that change. One thing we may need to test (and Matt, this is FYI) is disable the hyperthreading on vserver hosts. I believe it does more harm than good.
I've given the Windows slave access to more physical CPU cores. (In reply to comment #55) > I've given the Windows slave access to more physical CPU cores. The Unit tests on windows have returned to 7 or 8 hours instead of 13 to 16. So, thanks for that. Given that huge difference, I believe the performance "tests of the tests" I spent two weeks on was not a valid test of anything. But, I'm giving that a rest for a while. Seems to be a moving (black box) target to me. David, do your Windows performance tests need a VNC client to be watching the screen? I ask because I've got some new hardware here in my office, and I made a startling discovery: while installing Windows on a virtual server, with an active VNC window/connection, the install simply took forever. The DVD drive light would just casually "blink" to show activity. Just closing the VNC connection made the install process so much faster, and the DVD drive light seems to indicate much more activity. My educated guess is that, with an active VNC session, the emulator probably spends a considerable amount of time on graphics emulation. Since the current windows slave has an active VNC session for your UI tests to succeed, I'm wondering what the performance impact is. Created attachment 222202 [details]
vmstat 2 installing windows
Here's a screenshot of the vmstat command running on the host while installing Windows 7 64-bit from DVD media.
From the top, before the first red line, you can see a steady 1112 KB/sec read from a block device (likely the DVD) and a couple of write bursts as the buffers are flushed. During this time, CPU is 99-100% idle.
At the first red line, I've opened a VNC connection. DVD I/O falls to 278-534 KB/sec and output to disk is considerably slower. CPU idle time falls as well.
At the second red line, I have simply closed my VNC connection and performance is restored.
(In reply to comment #57) > David, do your Windows performance tests need a VNC client to be watching > the screen? > > ... > > Since the current windows slave has an active VNC session for your UI tests > to succeed, I'm wondering what the performance impact is. Virtual on top of virtual effects, eh? Its probably not _required_ to have that pretend user watching the screen, if that's what you mean. We need "UI" of course, but most of the UI performance tests would probably not be effected. There also LOTS of VNC settings (usually) that effect how VNC does the emulation (polling vs events only) and "how much" it tries to be high fidelity (low fidelity is fine even for our unit tests, I'm 99% sure). It would still be nice to have for performance tests, since in some failure conditions that allows us to capture more information that we'd otherwise get. But ... we could try it both ways (eventually). I hate to mix issues here, but since you added the processors to the Windows process, Linux is now our slowest unit test machine (8 or so hours) and the (new) 64 bit Mac and the (more processors) Windows pretty snappy (5 or 6 hours). This is just based on a few runs, but, makes me wonder if all "virtual machines" need more processors, even the Linux ones? > I hate to mix issues here, but since you added the processors to the Windows
> process, Linux is now our slowest unit test machine
Which slave do you run the linux tests on?
(In reply to comment #60) > > I hate to mix issues here, but since you added the processors to the Windows > > process, Linux is now our slowest unit test machine > > Which slave do you run the linux tests on? 6 (or, 1 occasionally) Slave6 is an old server that we repurposed into a slave. Not the fastest. Our fastest slave is slave2. It is about 75% faster than slave6. I'll compile some server performance data and toss it on a wiki page so that projects can better select which slaves to use. (In reply to comment #62) > Slave6 is an old server that we repurposed into a slave. Not the fastest. > > Our fastest slave is slave2. It is about 75% faster than slave6. > > I'll compile some server performance data and toss it on a wiki page so that > projects can better select which slaves to use. I'm sure we've been over this before, but I tried putting in "build2" as "build only on" criteria, and the job started on "Hudson-slave1" ... looking closer, It seems that both "Hudson-slave1" and "Hudson-slave2" have a "label" of "build2". As does "hudson-slave4" now that I look closer. So, above, when you say "slave2", I assume you mean "Hudson-slave2" and not "build2". [previous to looking close, I thought 'Hudson-slave' was just the Hudson label for "build" but that the numerals still corresponded.] So, just trying to decide if you mean "Any of the slaves on build2 node are 75% faster" or if you mean "Hudson slave2" specifically. 75% of 8 hours is 6 hours ... I'm starting to get excited! Thanks, > So, above, when you say "slave2", I assume you mean "Hudson-slave2"
Yes
> Surprisingly, I found that many (but not all) of the average times were > FASTER than on my personal beefed up windows machine AND many (but not all) > of the standard deviations were also less! I've compiled some really basic metrics to compare the CPU performance of our servers: http://wiki.eclipse.org/Hudson_server_performance_metrics FWIW, my 1 year old laptop is faster than our fastest servers. However, server hardware is optimized for parallel operations, I/O throughput and stability. With server-class hardware, increasing the CPU clock speed gets costly very quickly. On a related note, Linux vservers perform identically to their bare-iron equivalent, so virtualization is not adding any overhead in that area. > server hardware is optimized for parallel operations,
David, I'm guessing there's no easy way to break up a 4-hour job into multiple jobs that can all run at once? Our newer servers have 12 & 16 cpu cores... It would be nice to use them all instead of having a single thread run forever.
(In reply to comment #66) > > server hardware is optimized for parallel operations, > > David, I'm guessing there's no easy way to break up a 4-hour job into > multiple jobs that can all run at once? Our newer servers have 12 & 16 cpu > cores... It would be nice to use them all instead of having a single thread > run forever. I've been investigating that for our regular unit tests, and while not "easy", I think it might be feasible. Same would apply to performance tests. I'll keep you posted. David, I've copied your windows performance job (as denis_wt) and I'm running it on a local slave (denistest). I'm trying to understand the purpose of this:
platformIndependentZips:
[sleep] sleeping for 180000 milliseconds
(In reply to comment #68) > David, I've copied your windows performance job (as denis_wt) and I'm > running it on a local slave (denistest). I'm trying to understand the > purpose of this: > > platformIndependentZips: > [sleep] sleeping for 180000 milliseconds I'll tell you purpose, though could obviously be improved (I hate it too, when doing my own local tests). And, keep in mind, this is "shared code" used for both performance tests and unit tests. Purpose: Normally, when we build, we build from cronjob under e4Build ID. Once done, and everything is packaged up on the build machine, we 1) start the Hudson tests jobs (under hudsonbuild id of course) and 2) set some flags/files to signify a committer must move something from builds to downloads. So (as do others) I have a cronjob than runs every 15 minutes to see if there is anything to move from builds to downloads. Hence, there is a period during normal build/test cycle (from 5 to 20 minutes) where the tests have been told to start, but there is nothing yet on downloads for them to fetch. So, we wait, and loop, until there is. Clear as mud? (In reply to comment #69) > (In reply to comment #68) > > David, I've copied your windows performance job (as denis_wt) and I'm > > running it on a local slave (denistest). I'm trying to understand the > > purpose of this: > > > > platformIndependentZips: > > [sleep] sleeping for 180000 milliseconds > > .... > > Clear as mud? Probably what you would really like to know, is once you have run the tests once, and have everything already in Hudson's job workspace, assuming you just want to tweak virtual machine, vnc, etc., (not change tests) you can 1) turn off the "clean workspace" check mark under "advanced options", and 2) add the java parameter -DskipInstall=true to the last command in the last Hudson build step and it skips all that install stuff and just jumps right into the tests (pretty much) assuming everything already in place. HTH (In reply to comment #70) > > Probably what you would really like to know, is once you have run the tests > once, and have everything already in Hudson's job workspace, assuming you > just want to tweak virtual machine, vnc, etc., (not change tests) you can 1) > turn off the "clean workspace" check mark under "advanced options", and 2) > add the java parameter -DskipInstall=true to the last command in the last > Hudson build step and it skips all that install stuff and just jumps right > into the tests (pretty much) assuming everything already in place. > > HTH Meant to give specifics to be less ambiguous c:\java\jdk1.7.0_07\jre\bin\java.exe -Xmx500m -jar %WORKSPACE%/org.eclipse.releng.basebuilder/plugins/org.eclipse.equinox.launcher.jar -DbuildId=%buildId% -DeclipseStream=%eclipseStream% -Dargs=platform -Dosgi.os=win32 -Dosgi.ws=win32 -Dosgi.arch=x86 -application org.eclipse.ant.core.antRunner -v -f %WORKSPACE%/org.eclipse.releng.eclipsebuilder/runTests2.xml would become c:\java\jdk1.7.0_07\jre\bin\java.exe -Xmx500m -jar %WORKSPACE%/org.eclipse.releng.basebuilder/plugins/org.eclipse.equinox.launcher.jar -DbuildId=%buildId% -DeclipseStream=%eclipseStream% -Dargs=platform -Dosgi.os=win32 -Dosgi.ws=win32 -Dosgi.arch=x86 -DskipInstall=true -application org.eclipse.ant.core.antRunner -v -f %WORKSPACE%/org.eclipse.releng.eclipsebuilder/runTests2.xml But don't forget to uncheck "clean workspace", or ... well, kind of self explanatory. > Purpose: Normally, when we build, we build from cronjob under e4Build ID.
> Once done, and everything is packaged up on the build machine, we 1) start
> the Hudson tests jobs (under hudsonbuild id of course) and 2) set some
> flags/files to signify a committer must move something from builds to
> downloads. So (as do others) I have a cronjob than runs every 15 minutes to
> see if there is anything to move from builds to downloads.
>
> Hence, there is a period during normal build/test cycle (from 5 to 20
> minutes) where the tests have been told to start, but there is nothing yet
> on downloads for them to fetch. So, we wait, and loop, until there is.
>
> Clear as mud?
Yep, thanks. I was just curious. All this sleeping (there are multiple sleep events) obviously ties up an executor, but you already know that.
My slave remote slave here in my office keeps losing its connection to the master, so the tests never complete, but so far, simply opening a VNC connection, even with the lowest settings possible, affects the qemu performance by a large margin. I see noticeably faster log output when I disconnect.
Another thing I'd like to establish is the performance difference when the tests are run on a single standalone master (like you're running at home) and through a slave instance hanging off a 1Gb network link. In other words, we're no better off if we set up a bare-iron ultra-fast Windows box and it's still slow as molasses because of slave delegation. (In reply to comment #73) > ... and it's still slow as molasses because of slave delegation. Is it obvious, or should I ask ... do we _have_ to run the performance jobs (and all Windows virtual machines) as a Hudson slaves? I know there would be some advantages to using Hudson ... but not sure they out weight the disadvantage of being "slow as molasses". We mostly just need to start a job and collect the data. We don't use the best parts of Hudson (lots of interacting pieces and sources). I don't think the perf tests need to run as slaves at all. We could theoretically set up a Windows box as a standalone Hudson master if that's what it takes. I am simply trying to pinpoint where all the slowness is coming from. There are so many variables and I'm narrowing them down. (In reply to comment #64) > > So, above, when you say "slave2", I assume you mean "Hudson-slave2" > > Yes I tried using slave2 for a test job ... and it was still running after 11 hours! https://hudson.eclipse.org/hudson/job/ep4-unit-lin64/281/ I killed it so the next job could start and we'll see how it goes. [FYI, I could see some VNC "connection" in the log and wanted to be sure it was clear, we only need that "user looking at the console trick" for Windows.] [And, even there, if its causing that much performance problems, might be better to live without it ... I'd review with team first.] (In reply to comment #76) > (In reply to comment #64) > > > So, above, when you say "slave2", I assume you mean "Hudson-slave2" > > > > Yes > > I tried using slave2 for a test job ... and it was still running after 11 > hours! > The next one was killed after running for 14 hours. It might be faster in some respect ... but not for running our tests! Let me know if/how I could diagnose that's wrong with slave 2 ... think I'll try another for now. The next unit test time for linux was a little over 5 hours, on Hudson 1 (on par with the 64 bit mac, and much better than slave 2 and a little better then 6). This was after the Saturday night (Sunday AM) weekly restart. (In reply to comment #78) > The next unit test time for linux was a little over 5 hours, on Hudson 1 (on > par with the 64 bit mac, and much better than slave 2 and a little better > then 6). This was after the Saturday night (Sunday AM) weekly restart. To be more specific (or, is it more general) ... in our last run, Sunday evening, the Linux tests and Mac tests took about 5 hours (5 hours 15 minutes) but the Windows tests (still) took 2 hours longer, about 7 hours. You can see the "last duration" column for the ep4-unit-* tests in https://hudson.eclipse.org/hudson/view/Eclipse%20and%20Equinox/ (In reply to comment #79) > To be more specific (or, is it more general) ... in our last run, Sunday > evening, the Linux tests and Mac tests took about 5 hours (5 hours 15 > minutes) but the Windows tests (still) took 2 hours longer, about 7 hours. > You can see the "last duration" column for the ep4-unit-* tests in > https://hudson.eclipse.org/hudson/view/Eclipse%20and%20Equinox/ For context, on our old dedicated hardware, the Windows tests took over 9 hours. See for example 3.7.2: http://download.eclipse.org/eclipse/downloads/drops/R-3.7.2-201202080800/testresults/consolelogs/win32-6.0_consolelog.txt Also worth keeping in mind is that several components have platform-specific tests that don't run on all platforms. Comparing the number of tests that ran in the most recent build on each platform: Windows: 80,751 tests Linux: 70,320 tests Mac: 68,729 tests So I wouldn't read too much into comparing the test execution times across the three platforms on Hudson. Doing a mass "reset to default assignee" of 52 bugs to help make clear it will (very likely) not be me working on things I had previously planned to work on. I hope this will help prevent the bugs from "getting lost" in other people's queries. Feel free to "take" a bug if appropriate. *** This bug has been marked as a duplicate of bug 548523 *** |