| Summary: | cvs timeouts stalling eclipse builds | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Community | Reporter: | Kim Moir <kim.moir> | ||||
| Component: | CVS | Assignee: | Eclipse Webmaster <webmaster> | ||||
| Status: | RESOLVED FIXED | QA Contact: | |||||
| Severity: | normal | ||||||
| Priority: | P3 | CC: | andrew.eisenberg, bokowski, daniel_megert, david_williams, gheorghe, jcayne, john.arthorne, slamers, stephan.herrmann | ||||
| Version: | unspecified | ||||||
| Target Milestone: | --- | ||||||
| Hardware: | PC | ||||||
| OS: | Windows XP | ||||||
| Whiteboard: | |||||||
| Attachments: |
|
||||||
|
Description
Kim Moir
That sounds like a plan. -M. There is a session stalled now cvs -dkmoir@dev.eclipse.org:/cvsroot/eclipse -q export -r v20100409 org.eclipse.pde-feature/feature.xml I logged into dev.eclipse.org but I can't see the active session. I was on node5 - the connection may be to another node Hmmm, I'm not seeing any errors in the logs. As far as I can tell Torolab is connected to nodes 1 and 5 (at this time), but I don't see a session with your name on it on either node. If you use the -t flag are there any hints in the output? -M. kmoir@node5:~> netstat -t | grep ibm | grep ssh tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:28760 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:65369 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:65369 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:39745 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:54655 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:54655 ESTABLISHED tcp 0 32 node5.eclipse.org:ssh fw54.torolab.ibm.:21369 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:57363 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:17180 ESTABLISHED tcp 0 64 node5.eclipse.org:ssh fw54.torolab.ibm.:22530 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:49422 ESTABLISHED tcp 0 1204 node5.eclipse.org:ssh fw54.torolab.ibm.:64008 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:13066 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:12342 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:44081 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:44337 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:29757 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:18479 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:61652 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.c:8436 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:33782 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:49124 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh pat.zurich.ibm.co:65210 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh pat.zurich.ibm.co:29653 ESTABLISHED I can't tell which connection is mine. Btw, if you want to contact me I'm on gtalk (firstname.lastname@gmail.com) or IRC #eclipse-dev #equinox-dev. Might be easier that way :-) Well 'ps -aef | grep oir | grep ssh' was able to find you, even though I couldn't. I've connected to the process with strace but it seems to be 'stuck' node5# strace -p162893 -F -f Process 162893 attached - interrupt to quit select(12, [3 6], [], NULL, NULL And it's been sitting there for about ~10minutes. I've sent it a HUP signal to see if anything falls out. -M. I have another cvs checkout that is hanging. kmoir 16611 14010 0 12:39 ? 00:00:00 cvs -dkmoir@dev.eclipse.org:/cvsroot/rt export -d org.eclipse.equinox.p2.metadata.repository -rv20100405-2203 org.eclipse.equinox/p2/bundles/org.eclipse.equinox.p2.metadata.repository kmoir 16614 16611 0 12:39 ? 00:00:00 ssh -l kmoir dev.eclipse.org cvs server tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:48764 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:60798 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:37229 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:13678 ESTABLISHED tcp 0 0 node5.eclipse.:www-http fw54.torolab.ibm.:33559 TIME_WAIT tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:61979 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:15619 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:39437 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:13066 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:13066 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:12342 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:48432 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:12342 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:48432 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:53554 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:18479 ESTABLISHED tcp 0 0 node5.eclipse.org:ssh fw54.torolab.ibm.:26920 ESTABLISHED Can you see any problems? I'm not seeing anything other that process waiting for I/O. Here's a couple of questions: If you direct your cvs command at node1 specifically(edit your host file) does the same thing happen? If you add the -t flag to your CVS command what is it doing before it hangs? -M. I can try editing my host file and see what happens. All the cvs commands are actually run by generating ant scripts using pde build which call the ant cvs task. There currently isn't a way to add a parameter to the generated script to issue with cvs command with -t. If I run the cvs command manually with -t it fine but the older process is hung. I'll attach the log. Of course, I can't seem to replicate the hung process this way. I looked at the /var/lock/cvs and there are lock files there for the project but they are for later date and owned by anonymous. Not sure if lock file contention is an issue. Created attachment 164868 [details]
log of manual cvs checkout
Today I've had some oddness with my connections to eclipse.org servers. I've had a few cvs failures as well as some timeouts when trying to ssh to download1.eclipse.org. All the failures are intermittent, so I'm not able to help you out any more than saying that it's happening to me. This all may be related to the original bug report. Is there any more information that you would liek from me? We are also encountering this issue in TPTP. Let us know if you would like us to provide any information. Thanks Strangely, from my home connection, I can maintain several SSH connections for hours on end provided they do not become inactive for extended periods of time. Inactive in this case means no bits flowing in either direction. If I maintain a 'top d 60' command on a server, which refreshes every 60 seconds, my SSH terminal stays alive indefinitely. My thinking here is that perhaps CVS is being spaced out (for reasons unknown) and the inactivity is long enough for our firewall to terminate the connection. CVS is obviously confused about having its connection terminated and cannot re-establish a new one to continue. I'll investigate increasing the inactivity timeout for SSH and pserver to see if that makes any difference. > I'll investigate increasing the inactivity timeout for SSH and pserver to see > if that makes any difference. Out inactivity timeout is set to one hour. I left an SSH session unattended for 20 minutes, and it was still alive after that. In reading an email from Joel, another common point seems to be that this happens during a build (which, I assume, means pde/ant/etc). I have yet to see someone able to reproduce this from commandline CVS. Could ant/pde be playing a part in this? Seems that this is not limited to checking out tons of code, but sometimes to small tasks like rdiff. I am running the script below to see if I can get a stall from commandline CVS. This comes from Joel: #!/bin/bash while [ 1==1 ]; do date >> cvs_test.log cvs -d :ext:droy@dev.eclipse.org:/cvsroot/tptp -Q rdiff -s -r v201001151900 -r HEAD monitoring/features/org.eclipse.tptp.monitoring.mx4j.jmx >> cvs_test.log sleep 60 done (In reply to comment #13) > > I'll investigate increasing the inactivity timeout for SSH and pserver to see > > if that makes any difference. > > Our inactivity timeout is set to one hour. To be clear: I haven't changed anything on our firewall. 1 hour inactivity timeout should be more than enough. Ok, Denis and I think we've fixed the issue. Joel and Kim: can you run your builds a couple of times and let us know what happens. -M. (In reply to comment #15) > Ok, Denis and I think we've fixed the issue. > What? And you're going to leave us in suspense? Or, just tell us the fix if it worked? I haven't spoken up before, because I hate to complain (ha ha) but I have had one "local" machine/connection which has never been able to get through a whole big wtp build in the two or three weeks I've been trying ... but it did tonight after I read this post. If your curious, I have another machine, that has always worked ok, but it uses a non-released version of cvs (built from source) which allows a "time out" value so that "hung" connections are closed after 10 minutes. This makes more sense that it may sound, if in fact the connections "hang" after all the data has actually been received ... and the server knows it, but the client does not. Anyway ... seems better from my one little test (and, I really am curious :) > What? And you're going to leave us in suspense? Or, just tell us the fix if it
> worked?
We finally had a series of events that led to believe SSH was misbehaving on one server. Matt and I were intrigued as to why, since SSH is one of those daemons that "just works". Matt thought we were hitting some kind of limit, and he was right. The SSH daemon has a MaxStartups parameter.
MaxStartups
Specifies the maximum number of concurrent unauthenticated con-
nections to the sshd daemon. Additional connections will be
dropped until authentication succeeds or the LoginGraceTime
expires for a connection. The default is 10.
One one node, we reduced the value to 2, then opened two invaliduser@node3 sessions and left them sitting at the Password: prompt (in Startup state). Sure enough, subsequent SSH connections were not waiting, or hanging -- they were being closed!
So I can see how there could be 10 SSH connections in startup mode, waiting for authentication to complete:
- since pserver was moved to another data set some months ago, more builds are using SSH, thus overall increased usage of SSH
- builds tend to connect/disconnect from the code repo very frequently, and sometimes multiple times per second
- if our LDAP server (or SSH) was slow in any way, the authentication process could take a bit longer and ramp up the number of SSH startups at any given time
- our load balancer was configured to keep SSH connections from a user always on the same node, instead of round-robin'ing them across all the nodes.
- we recently removed one 'dev' node from service, reducing the overall pool of SSH servers
To solve the issue, we have increased the MaxStartups to 20, and have reset the load balancer to round-robin SSH requests across all the nodes. Reducing load on the LDAP server (a work in progress) will also help authentication proceed faster. Since doing all this, my script running at home has not seen a single failure.
Thanks Denis, that was very interesting. I'll try some test builds today with the entry removed from my /etc/hosts. I started a test build and I see that I have a cvs connection that has been stalled for about 30 minutes. kmoir 6907 4950 0 09:43 ? 00:00:00 cvs -dkmoir@dev.eclipse.org:/cvsroot/eclipse export -d org.eclipse.jdt.compiler.apt -rHEAD org.eclipse.jdt.compiler.apt Kim, when you start a build, it would be helpful if you could strace the CVS processes. Otherwise, we're shooting in the dark since by the time you tell us, your process on our end is gone. If you run this from a linux box, it should be as easy as adding 'strace ' in front of the cvs command, and piping that to a file. The strace will slow your cvs process down considerably, but at least we'll see what happens when it stalls. Just to add, TPTP also tried a build this morning as well. CVS was hanging for about 30 minutes before we killed the process. We were running the following command: cvs -d tptp:/cvsroot/tptp -Q rdiff -s -r v201003111452 -r HEAD platform/org.eclipse.hyades.probekit.doc.user > cvs -d tptp:/cvsroot/tptp -Q rdiff -s -r v201003111452 -r HEAD
What is that? pserver? ssh? To what server?
Sorry I wasn't clear, it is using ssh (tptp is an alias). The equivalent would be cvs -d :ext:user@dev.eclipse.org:/cvsroot/tptp -Q rdiff -s -r v201003111452 -r HEAD platform/org.eclipse.hyades.probekit.doc.user Around the same time as Sean and Kim, I saw the same thing from my home test: Wed Apr 21 10:30:50 EDT 2010 [success] Wed Apr 21 10:31:51 EDT 2010 [success] Wed Apr 21 10:32:53 EDT 2010 ssh_exchange_identification: read: Connection timed out cvs [rdiff aborted]: end of file from server (consult above messages if any) Wed Apr 21 12:45:08 EDT 2010 [success] As you can see, the 10:32:53 job timed out, and my client didn't 'let go' until 12:45. A timeout like that is entirely different from what Matt and I have "fixed". I'll keep investigating. *** Bug 309994 has been marked as a duplicate of this bug. *** I've changed my script at home to grab a full strace -Fft of the process. When it dies, I should have an idea where it stalls. I had a similar issue which could not be resolved/fixed (bug 293416). The only way I could work again was to replace dev.eclipse.org with node1.eclipse.org. (In reply to comment #27) > I had a similar issue which could not be resolved/fixed (bug 293416). The only > way I could work again was to replace dev.eclipse.org with node1.eclipse.org. Having problems with committing to svn+ssh I tried replacing dev with node1, but to no avail. Right now the attempt to commit a single java file stalled which means I have to kill Eclipse due to bug 260743. This happens at approx. one out of four commits. There was an issue with our Gigabit colo port this morning -- the network cable was replaced. We had general layer 1-2 instability today from 7:52am to 7:56am and earlier, Apr 12 10:14am and again Apr 16 15:34. I doubt any of this would affect TCP exchanges, but they certainly cannot help. I've been running my tests from home as well as on an Amazon EC2 instance in Ireland. So far I have not seen any timeouts to ssh. (In reply to comment #28) > I tried replacing dev with node1, but to no avail. If you have access to a Linux box, can you run a tcptraceroute to dev.eclipse.org? Please don't use an ICMP traceroute -- those are virtually useless. You should see something like this: # tcptraceroute dev.eclipse.org 22 Selected device eth0, address 10.227.109.236, port 40939 for outgoing packets Tracing the path to dev.eclipse.org (206.191.52.50) on TCP port 22 (ssh), 30 hops max 1 10.227.108.3 0.393 ms 16.270 ms 0.252 ms 2 ec2-79-125-0-135.eu-west-1.compute.amazonaws.com (79.125.0.135) 0.444 ms 0.475 ms 0.456 ms 3 87.238.81.12 0.906 ms 0.978 ms 0.982 ms 4 ge-1-1-0-121.dub10.ip4.tinet.net (77.67.75.61) 1.233 ms 1.260 ms 1.321 ms 5 xe-10-3-0.nyc30.ip4.tinet.net (89.149.184.185) 101.557 ms 83.363 ms 82.806 ms 6 shaw-gw.ip4.tinet.net (213.200.66.234) 82.794 ms 82.849 ms 82.797 ms 7 rc2hu-ge3-0-0.ny.shawcable.net (66.163.74.1) 83.018 ms 82.934 ms 82.968 ms 8 rc1hu-ge0-0-0.ny.shawcable.net (66.163.74.13) 83.025 ms 83.093 ms 82.957 ms 9 rc1sh-pos14-0-0.mt.shawcable.net (66.163.77.153) 95.448 ms 95.248 ms 95.443 ms 10 66.163.66.86 95.403 ms 95.407 ms 95.558 ms 11 rx0sh-mountain-cable.mt.bigpipeinc.com (66.244.223.238) 95.525 ms 95.566 ms 95.541 ms 12 * * * 13 * * * 14 gateway2-vlan94.magma.ca (209.217.64.49) 104.707 ms 348.231 ms 275.575 ms 15 206.191.0.120 98.426 ms 98.150 ms 98.419 ms 16 dev.eclipse.org (206.191.52.50) [open] 100.325 ms 100.308 ms 100.172 ms (In reply to comment #29) > (In reply to comment #28) > If you have access to a Linux box, can you run a tcptraceroute to > dev.eclipse.org? Please don't use an ICMP traceroute -- those are virtually > useless. I'm on Ubuntu, where (tcp)traceroute seems to have slightly different syntax but here it is: # sudo traceroute dev.eclipse.org -T -p 22 traceroute to dev.eclipse.org (206.191.52.50), 30 hops max, 60 byte packets 1 fritz.box (192.168.178.1) 0.682 ms 1.061 ms 1.050 ms 2 * * * 3 217.237.155.210 (217.237.155.210) 68.875 ms 72.208 ms 79.349 ms 4 217.239.37.106 (217.239.37.106) 182.843 ms 188.680 ms 194.310 ms 5 194.25.211.18 (194.25.211.18) 199.723 ms 206.094 ms * 6 * 0.ge-5-0-0.XL4.IAD8.ALTER.NET (152.63.41.158) 202.114 ms 207.235 ms 7 0.so-0-2-0.XT2.MTL1.ALTER.NET (152.63.0.118) 237.718 ms 0.so-7-0-0.XL2.TOR3.ALTER.NET (152.63.2.102) 177.964 ms 0.so-0-2-0.XT2.MTL1.ALTER.NET (152.63.0.118) 179.573 ms 8 POS1-0.XR2.TOR3.ALTER.NET (152.63.129.101) 178.338 ms 0.so-3-0-0.XR2.MTL1.ALTER.NET (152.63.133.42) 168.995 ms 168.997 ms 9 194.ATM6-0.GW2.OTT1.ALTER.NET (152.63.131.189) 173.530 ms 192.ATM7-0.GW2.OTT1.ALTER.NET (152.63.131.181) 169.944 ms 194.ATM6-0.GW2.OTT1.ALTER.NET (152.63.131.189) 175.640 ms 10 magma7-u-gw.customer.alter.net (209.167.168.238) 177.035 ms 175.824 ms 173.343 ms 11 gateway1-vlan92.magma.ca (209.217.64.41) 175.896 ms 176.516 ms 170.560 ms 12 206.191.0.87 (206.191.0.87) 174.343 ms 171.883 ms 177.213 ms 13 dev.eclipse.org (206.191.52.50) 176.795 ms 170.442 ms 173.911 ms (tcptraceroute seems to be an alias to "traceroute -T", but tcptraceroute wouldn't let me specify a port ("bind: Address already in use")). Stephan, you have some major lag occurring at hop #4. Hop #5 actually dropped one of your SYN requests :-( alter.net's montreal-toronto-montreal-toronto-montreal dance looks ridiculous, but doesn't hurt much. Unfortunately, the massive latency you're seeing is well beyond my control. So my improved script came up empty handed. 1619/1619 successful tries from my Amazon EC2 server in Dublin, Ireland and 1649/1649 successful tries from my home PC near Ottawa. I'll let it run all weekend, but I think we're good. This doesn't mean you will always get 100% connection rates, but this is the Internet, and I'm only in control of the last couple of hops in the whole route. Perhaps another consideration is that connections to SSH don't fail as gracefully as they do with pserver... Or timeouts take longer to be considered a time out. In my earlier example, my script was stuck for 90 minutes and after a bit of searching, I can't see any way of reducing this. I am concluding my testing. This is as exhaustive as I'm willing to be. My tests have been running 24/7 all week from two locations: - my home computer, using a cable modem, near (geographically, and 10 network hops, 19ms round-trip) near the servers - an EC2 instance in Ireland (17 network hops, 120ms round-trip) From home, 1 out of 11201 connections failed due to a DNS error. From EC2, 4 out of 10492 connections failed, all due to timing out. Interestingly, those 4 timeouts took _2 hours and 10-15 minutes_ to finally give up. There is no activity in the strace for 2 hours and 15 minutes. A packet capture would have been interesting. Of the four timeouts, I have no indication that the packets even reached the Eclipse.org network's inner bounds (our firewall). I clearly see the SSH activity before and after the failure. Our border firewall and our CSS are logging all dropped sessions for whatever reason (hitting blocked ports, SYN attacks, etc) and has logged nothing wrt. this. So my conclusion is this: - my feeling is that the timeouts occur upon connection. Once connected, you're most likely good. --> with this in mind, it seems that since a build connects+disconnects dozens/hundreds of times (I don't know why) you are increasing your chances of timing out. - SSH handles connection timeouts quite poorly -- 2 hours?? --> Perhaps using pserver to build is a better idea --> or finding a way to reduce the timeout (and induce a retry, or fail) would help - I think Eclipse.org's network is in great shape. The above DNS error doesn't necessarily mean it was a DNS failure on our part since, from home, I use my ISP's DNS servers. .. and the amount of traffic at any given time in the trans-Atlantic pipes is a huge variable. We have done things to make this better: - SSH and load balancer server config in comment 17 should reduce the amount of times a connection is refused because of too many startup SSH connections --> but a refused connection != a timed out connection. A refused connection fails immediately, a timed out connection hangs for hours So there you have it. Geographically, the farther away you are from Eclipse.org, the more network hops, the more chance of a time out. Some options: -> deal with timeouts gracefully (either get SSH to fail faster, or use pserver) -> move builds closer to Eclipse.org (Ottawa, Canada) ** Large corps (yes, IBM) often love to route stuff in weird ways, so physical proximity is no indication of network proximity -> move builds to build*.eclipse.org -> spend zillions of dollars in setting up Eclipse.org SSH points-of-presence around the world :) Thanks Denis for all your time testing this. I really appreciate it. The reason that our build connects so many times via ssh is as follows. PDE build resolves dependencies for the bundles we are building, and generates Ant fetch scripts to fetch the source of the bundles being built. Each bundle being fetched generates a separate call to CVS. This is useful because we can run a lot of parallel ant tasks to speed up our build. (Even though our build takes a ridiculous amount of time to complete, it could be worse if we didn't have so many tasks running in parallel) Changing our build to checkout everything in one giant ssh connection is non-trivial. This would require a major change to the way pde build generates build scripts. We can't use pserver for builds because there is sometimes a log updating that copy of the repository, and as well, we would go back to lock file contention issues. I agree that checkout out wads of code to a remote location is not optimal as have to deal with multiple network hops. I think the best solution for us is to keep the /etc/hosts entry for the node one for the interim to avoid these problems. I haven't seen any issues with cvs timeouts since I made that change. Yes, we intend to move our build to eclipse.org. However, we will have to wait until there are test servers on line at the foundation until we can do this. (In reply to comment #34) > I think the best solution for us is to keep the /etc/hosts entry for the node > one for the interim to avoid these problems. I haven't seen any issues with cvs > timeouts since I made that change. The configuration changes we made in comment 17 should make your builds as reliable as using node1. More reliable infact, because if we take node1 down for maintenance or to reboot, you won't be able to connect to CVS at all. *** Bug 303217 has been marked as a duplicate of this bug. *** |