Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 309013

Summary: cvs timeouts stalling eclipse builds
Product: Community Reporter: Kim Moir <kim.moir>
Component: CVSAssignee: Eclipse Webmaster <webmaster>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: P3 CC: andrew.eisenberg, bokowski, daniel_megert, david_williams, gheorghe, jcayne, john.arthorne, slamers, stephan.herrmann
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   
Whiteboard:
Attachments:
Description Flags
log of manual cvs checkout none

Description Kim Moir CLA 2010-04-13 11:29:36 EDT
For the last couple of days, I have had CVS timeouts to eclipse.org which have stalled the build.  They seem to happen between 10am and 5pm.  If I run a build at night or in the early morning they don't happen.  Looking at our lab, we have lots of bandwidth to eclipse.org so this isn't an issue.  We use extssh with my userid to check out the code for the builds.

If you would like, I can start a build and show you a stalled connection on eclipse.org.
Comment 1 Eclipse Webmaster CLA 2010-04-13 13:45:42 EDT
That sounds like a plan.

-M.
Comment 2 Kim Moir CLA 2010-04-13 14:29:40 EDT
There is a session stalled now 

cvs -dkmoir@dev.eclipse.org:/cvsroot/eclipse -q export -r v20100409 org.eclipse.pde-feature/feature.xml

I logged into dev.eclipse.org but I can't see the active session. I was on node5 - the connection may be to another node
Comment 3 Eclipse Webmaster CLA 2010-04-13 14:57:22 EDT
Hmmm, I'm not seeing any errors in the logs. 

As far as I can tell Torolab is connected to nodes 1 and 5 (at this time), but I don't see a session with your name on it on either node.

If you use the -t flag are there any hints in the output?

-M.
Comment 4 Kim Moir CLA 2010-04-13 15:12:46 EDT
kmoir@node5:~> netstat -t | grep ibm | grep ssh

tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:28760 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:65369 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:65369 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:39745 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:54655 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:54655 ESTABLISHED
tcp        0     32 node5.eclipse.org:ssh   fw54.torolab.ibm.:21369 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:57363 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:17180 ESTABLISHED
tcp        0     64 node5.eclipse.org:ssh   fw54.torolab.ibm.:22530 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:49422 ESTABLISHED
tcp        0   1204 node5.eclipse.org:ssh   fw54.torolab.ibm.:64008 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:13066 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:12342 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:44081 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:44337 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:29757 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:18479 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:61652 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.c:8436 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:33782 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:49124 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   pat.zurich.ibm.co:65210 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   pat.zurich.ibm.co:29653 ESTABLISHED

I can't tell which connection is mine.  

Btw, if you want to contact me I'm on gtalk (firstname.lastname@gmail.com) or IRC #eclipse-dev #equinox-dev.  Might be easier that way :-)
Comment 5 Eclipse Webmaster CLA 2010-04-13 15:45:42 EDT
Well 'ps -aef | grep oir | grep ssh' was able to find you, even though I couldn't.

I've connected to the process with strace but it seems to be 'stuck'

node5# strace -p162893 -F -f
Process 162893 attached - interrupt to quit
select(12, [3 6], [], NULL, NULL  

And it's been sitting there for about ~10minutes.  I've sent it a HUP signal to see if anything falls out. 

-M.
Comment 6 Kim Moir CLA 2010-04-14 13:33:22 EDT
I have another cvs checkout that is hanging.

kmoir    16611 14010  0 12:39 ?        00:00:00 cvs -dkmoir@dev.eclipse.org:/cvsroot/rt export -d org.eclipse.equinox.p2.metadata.repository -rv20100405-2203 org.eclipse.equinox/p2/bundles/org.eclipse.equinox.p2.metadata.repository
kmoir    16614 16611  0 12:39 ?        00:00:00 ssh -l kmoir dev.eclipse.org cvs server

tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:48764 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:60798 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:37229 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:13678 ESTABLISHED
tcp        0      0 node5.eclipse.:www-http fw54.torolab.ibm.:33559 TIME_WAIT
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:61979 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:15619 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:39437 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:13066 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:13066 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:12342 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:48432 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:12342 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:48432 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:53554 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:18479 ESTABLISHED
tcp        0      0 node5.eclipse.org:ssh   fw54.torolab.ibm.:26920 ESTABLISHED

Can you see any problems?
Comment 7 Eclipse Webmaster CLA 2010-04-14 13:50:22 EDT
I'm not seeing anything other that process waiting for I/O.

Here's a couple of questions:

If you direct your cvs command at node1 specifically(edit your host file) does the same thing happen?

If you add the -t flag to your CVS command what is it doing before it hangs?

-M.
Comment 8 Kim Moir CLA 2010-04-14 14:28:27 EDT
I can try editing my host file and see what happens. All the cvs commands are actually run by generating ant scripts using pde build which call the ant cvs task. There currently isn't a way to add a parameter to the generated script to issue with cvs command with -t.  

If I run the cvs command manually with -t it fine but the older process is hung. I'll attach the log.  Of course, I can't seem to replicate the hung process this way.

I looked at the /var/lock/cvs and there are lock files there for the project but they are for later date and owned by anonymous. Not sure if lock file contention is an issue.
Comment 9 Kim Moir CLA 2010-04-14 14:31:53 EDT
Created attachment 164868 [details]
log of manual cvs checkout
Comment 10 Andrew Eisenberg CLA 2010-04-14 16:58:42 EDT
Today I've had some oddness with my connections to eclipse.org servers.  I've had a few cvs failures as well as some timeouts when trying to ssh to download1.eclipse.org.  All the failures are intermittent, so I'm not able to help you out any more than saying that it's happening to me.  This all may be related to the original bug report.  

Is there any more information that you would liek from me?
Comment 11 Joel Cayne CLA 2010-04-19 09:51:16 EDT
We are also encountering this issue in TPTP. Let us know if you would like us to provide any information.

Thanks
Comment 12 Denis Roy CLA 2010-04-19 14:44:21 EDT
Strangely, from my home connection, I can maintain several SSH connections for hours on end provided they do not become inactive for extended periods of time.  Inactive in this case means no bits flowing in either direction.  If I maintain a 'top d 60' command on a server, which refreshes every 60 seconds, my SSH terminal stays alive indefinitely.

My thinking here is that perhaps CVS is being spaced out (for reasons unknown) and the inactivity is long enough for our firewall to terminate the connection.  CVS is obviously confused about having its connection terminated and cannot re-establish a new one to continue.

I'll investigate increasing the inactivity timeout for SSH and pserver to see if that makes any difference.
Comment 13 Denis Roy CLA 2010-04-19 16:15:57 EDT
> I'll investigate increasing the inactivity timeout for SSH and pserver to see
> if that makes any difference.

Out inactivity timeout is set to one hour.  I left an SSH session unattended for 20 minutes, and it was still alive after that.

In reading an email from Joel, another common point seems to be that this happens during a build (which, I assume, means pde/ant/etc).  I have yet to see someone able to reproduce this from commandline CVS.

Could ant/pde be playing a part in this?  Seems that this is not limited to checking out tons of code, but sometimes to small tasks like rdiff.


I am running the script below to see if I can get a stall from commandline CVS.  This comes from Joel:

#!/bin/bash

while [ 1==1 ]; do
        date >> cvs_test.log
        cvs -d :ext:droy@dev.eclipse.org:/cvsroot/tptp -Q rdiff -s -r v201001151900 -r HEAD monitoring/features/org.eclipse.tptp.monitoring.mx4j.jmx >> cvs_test.log
        sleep 60
done
Comment 14 Denis Roy CLA 2010-04-19 16:16:58 EDT
(In reply to comment #13)
> > I'll investigate increasing the inactivity timeout for SSH and pserver to see
> > if that makes any difference.
> 
> Our inactivity timeout is set to one hour.

To be clear: I haven't changed anything on our firewall.  1 hour inactivity timeout should be more than enough.
Comment 15 Eclipse Webmaster CLA 2010-04-20 16:10:39 EDT
Ok, Denis and I think we've fixed the issue.  

Joel and Kim: can you run your builds a couple of times and let us know what happens.

-M.
Comment 16 David Williams CLA 2010-04-20 20:34:48 EDT
(In reply to comment #15)
> Ok, Denis and I think we've fixed the issue.  
> 

What? And you're going to leave us in suspense? Or, just tell us the fix if it worked? 

I haven't spoken up before, because I hate to complain (ha ha) but I have had one "local" machine/connection which has never been able to get through a whole big wtp build in the two or three weeks I've been trying ... but it did tonight after I read this post. If your curious, I have another machine, that has always worked ok, but it uses a non-released version of cvs (built from source) which allows a "time out" value so that "hung" connections are closed after 10 minutes. This makes more sense that it may sound, if in fact the connections "hang" after all the data has actually been received ... and the server knows it, but the client does not. 

Anyway ... seems better from my one little test (and, I really am curious :)
Comment 17 Denis Roy CLA 2010-04-21 08:03:54 EDT
> What? And you're going to leave us in suspense? Or, just tell us the fix if it
> worked? 

We finally had a series of events that led to believe SSH was misbehaving on one server. Matt and I were intrigued as to why, since SSH is one of those daemons that "just works".  Matt thought we were hitting some kind of limit, and he was right. The SSH daemon has a MaxStartups parameter.  

MaxStartups
Specifies the maximum number of concurrent unauthenticated con-
nections to the sshd daemon.  Additional connections will be
dropped until authentication succeeds or the LoginGraceTime
expires for a connection.	The default is 10.

One one node, we reduced the value to 2, then opened two invaliduser@node3 sessions and left them sitting at the Password: prompt (in Startup state).  Sure enough, subsequent SSH connections were not waiting, or hanging -- they were being closed!

So I can see how there could be 10 SSH connections in startup mode, waiting for authentication to complete:

- since pserver was moved to another data set some months ago, more builds are using SSH, thus overall increased usage of SSH

- builds tend to connect/disconnect from the code repo very frequently, and sometimes multiple times per second

- if our LDAP server (or SSH) was slow in any way, the authentication process could take a bit longer and ramp up the number of SSH startups at any given time

- our load balancer was configured to keep SSH connections from a user always on the same node, instead of round-robin'ing them across all the nodes.

- we recently removed one 'dev' node from service, reducing the overall pool of SSH servers

To solve the issue, we have increased the MaxStartups to 20, and have reset the load balancer to round-robin SSH requests across all the nodes.  Reducing load on the LDAP server (a work in progress) will also help authentication proceed faster.  Since doing all this, my script running at home has not seen a single failure.
Comment 18 Kim Moir CLA 2010-04-21 08:56:44 EDT
Thanks Denis, that was very interesting.  I'll try some test builds today with the entry removed from my /etc/hosts.
Comment 19 Kim Moir CLA 2010-04-21 10:21:20 EDT
I started a test build and I see that I have a cvs connection that has been stalled for about 30 minutes. 

kmoir     6907  4950  0 09:43 ?        00:00:00 cvs -dkmoir@dev.eclipse.org:/cvsroot/eclipse export -d org.eclipse.jdt.compiler.apt -rHEAD org.eclipse.jdt.compiler.apt
Comment 20 Denis Roy CLA 2010-04-21 10:33:52 EDT
Kim, when you start a build, it would be helpful if you could strace the CVS processes.  Otherwise, we're shooting in the dark since by the time you tell us, your process on our end is gone.  If you run this from a linux box, it should be as easy as adding 'strace ' in front of the cvs command, and piping that to a file.  The strace will slow your cvs process down considerably, but at least we'll see what happens when it stalls.
Comment 21 Sean Lamers CLA 2010-04-21 10:44:03 EDT
Just to add, TPTP also tried a build this morning as well. CVS was hanging for about 30 minutes before we killed the process. We were running the following command:

cvs -d tptp:/cvsroot/tptp -Q rdiff -s -r v201003111452 -r HEAD platform/org.eclipse.hyades.probekit.doc.user
Comment 22 Denis Roy CLA 2010-04-21 11:21:36 EDT
> cvs -d tptp:/cvsroot/tptp -Q rdiff -s -r v201003111452 -r HEAD

What is that?  pserver?  ssh?  To what server?
Comment 23 Sean Lamers CLA 2010-04-21 12:21:20 EDT
Sorry I wasn't clear, it is using ssh (tptp is an alias). The equivalent would be cvs -d :ext:user@dev.eclipse.org:/cvsroot/tptp -Q rdiff -s -r v201003111452 -r HEAD platform/org.eclipse.hyades.probekit.doc.user
Comment 24 Denis Roy CLA 2010-04-21 13:15:56 EDT
Around the same time as Sean and Kim, I saw the same thing from my home test:

Wed Apr 21 10:30:50 EDT 2010
[success]
Wed Apr 21 10:31:51 EDT 2010
[success]
Wed Apr 21 10:32:53 EDT 2010
ssh_exchange_identification: read: Connection timed out
cvs [rdiff aborted]: end of file from server (consult above messages if any)
Wed Apr 21 12:45:08 EDT 2010
[success]

As you can see, the 10:32:53 job timed out, and my client didn't 'let go' until 12:45.


A timeout like that is entirely different from what Matt and I have "fixed".  I'll keep investigating.
Comment 25 Stephan Herrmann CLA 2010-04-21 14:36:05 EDT
*** Bug 309994 has been marked as a duplicate of this bug. ***
Comment 26 Denis Roy CLA 2010-04-21 15:00:15 EDT
I've changed my script at home to grab a full strace -Fft of the process.  When it dies, I should have an idea where it stalls.
Comment 27 Dani Megert CLA 2010-04-22 03:32:57 EDT
I had a similar issue which could not be resolved/fixed (bug 293416). The only way I could work again was to replace dev.eclipse.org with node1.eclipse.org.
Comment 28 Stephan Herrmann CLA 2010-04-22 10:29:23 EDT
(In reply to comment #27)
> I had a similar issue which could not be resolved/fixed (bug 293416). The only
> way I could work again was to replace dev.eclipse.org with node1.eclipse.org.

Having problems with committing to svn+ssh I tried replacing dev with node1,
but to no avail. Right now the attempt to commit a single java file stalled
which means I have to kill Eclipse due to bug 260743.
This happens at approx. one out of four commits.
Comment 29 Denis Roy CLA 2010-04-22 10:48:55 EDT
There was an issue with our Gigabit colo port this morning -- the network cable was replaced.  We had general layer 1-2 instability today from 7:52am to 7:56am and earlier, Apr 12 10:14am and again Apr 16 15:34.  I doubt any of this would affect TCP exchanges, but they certainly cannot help.

I've been running my tests from home as well as on an Amazon EC2 instance in Ireland.  So far I have not seen any timeouts to ssh.




(In reply to comment #28)
> I tried replacing dev with node1, but to no avail.

If you have access to a Linux box, can you run a tcptraceroute to dev.eclipse.org?  Please don't use an ICMP traceroute -- those are virtually useless.

You should see something like this:

# tcptraceroute dev.eclipse.org 22
Selected device eth0, address 10.227.109.236, port 40939 for outgoing packets
Tracing the path to dev.eclipse.org (206.191.52.50) on TCP port 22 (ssh), 30 hops max
 1  10.227.108.3  0.393 ms  16.270 ms  0.252 ms
 2  ec2-79-125-0-135.eu-west-1.compute.amazonaws.com (79.125.0.135)  0.444 ms  0.475 ms  0.456 ms
 3  87.238.81.12  0.906 ms  0.978 ms  0.982 ms
 4  ge-1-1-0-121.dub10.ip4.tinet.net (77.67.75.61)  1.233 ms  1.260 ms  1.321 ms
 5  xe-10-3-0.nyc30.ip4.tinet.net (89.149.184.185)  101.557 ms  83.363 ms  82.806 ms
 6  shaw-gw.ip4.tinet.net (213.200.66.234)  82.794 ms  82.849 ms  82.797 ms
 7  rc2hu-ge3-0-0.ny.shawcable.net (66.163.74.1)  83.018 ms  82.934 ms  82.968 ms
 8  rc1hu-ge0-0-0.ny.shawcable.net (66.163.74.13)  83.025 ms  83.093 ms  82.957 ms
 9  rc1sh-pos14-0-0.mt.shawcable.net (66.163.77.153)  95.448 ms  95.248 ms  95.443 ms
10  66.163.66.86  95.403 ms  95.407 ms  95.558 ms
11  rx0sh-mountain-cable.mt.bigpipeinc.com (66.244.223.238)  95.525 ms  95.566 ms  95.541 ms
12  * * *
13  * * *
14  gateway2-vlan94.magma.ca (209.217.64.49)  104.707 ms  348.231 ms  275.575 ms
15  206.191.0.120  98.426 ms  98.150 ms  98.419 ms
16  dev.eclipse.org (206.191.52.50) [open]  100.325 ms  100.308 ms  100.172 ms
Comment 30 Stephan Herrmann CLA 2010-04-22 12:16:05 EDT
(In reply to comment #29)
> (In reply to comment #28)
> If you have access to a Linux box, can you run a tcptraceroute to
> dev.eclipse.org?  Please don't use an ICMP traceroute -- those are virtually
> useless.

I'm on Ubuntu, where (tcp)traceroute seems to have slightly different 
syntax but here it is:

# sudo traceroute dev.eclipse.org -T -p 22
traceroute to dev.eclipse.org (206.191.52.50), 30 hops max, 60 byte packets
 1  fritz.box (192.168.178.1)  0.682 ms  1.061 ms  1.050 ms
 2  * * *
 3  217.237.155.210 (217.237.155.210)  68.875 ms  72.208 ms  79.349 ms
 4  217.239.37.106 (217.239.37.106)  182.843 ms  188.680 ms  194.310 ms
 5  194.25.211.18 (194.25.211.18)  199.723 ms  206.094 ms *
 6  * 0.ge-5-0-0.XL4.IAD8.ALTER.NET (152.63.41.158)  202.114 ms  207.235 ms
 7  0.so-0-2-0.XT2.MTL1.ALTER.NET (152.63.0.118)  237.718 ms 0.so-7-0-0.XL2.TOR3.ALTER.NET (152.63.2.102)  177.964 ms 0.so-0-2-0.XT2.MTL1.ALTER.NET (152.63.0.118)  179.573 ms
 8  POS1-0.XR2.TOR3.ALTER.NET (152.63.129.101)  178.338 ms 0.so-3-0-0.XR2.MTL1.ALTER.NET (152.63.133.42)  168.995 ms  168.997 ms
 9  194.ATM6-0.GW2.OTT1.ALTER.NET (152.63.131.189)  173.530 ms 192.ATM7-0.GW2.OTT1.ALTER.NET (152.63.131.181)  169.944 ms 194.ATM6-0.GW2.OTT1.ALTER.NET (152.63.131.189)  175.640 ms
10  magma7-u-gw.customer.alter.net (209.167.168.238)  177.035 ms  175.824 ms  173.343 ms
11  gateway1-vlan92.magma.ca (209.217.64.41)  175.896 ms  176.516 ms  170.560 ms
12  206.191.0.87 (206.191.0.87)  174.343 ms  171.883 ms  177.213 ms
13  dev.eclipse.org (206.191.52.50)  176.795 ms  170.442 ms  173.911 ms

(tcptraceroute seems to be an alias to "traceroute -T", but tcptraceroute
wouldn't let me specify a port ("bind: Address already in use")).
Comment 31 Denis Roy CLA 2010-04-22 12:43:30 EDT
Stephan, you have some major lag occurring at hop #4.  Hop #5 actually dropped one of your SYN requests  :-(

alter.net's montreal-toronto-montreal-toronto-montreal dance looks ridiculous, but doesn't hurt much.

Unfortunately, the massive latency you're seeing is well beyond my control.
Comment 32 Denis Roy CLA 2010-04-23 14:32:57 EDT
So my improved script came up empty handed.  1619/1619 successful tries from my Amazon EC2 server in Dublin, Ireland and 1649/1649 successful tries from my home PC near Ottawa.

I'll let it run all weekend, but I think we're good.  This doesn't mean you will always get 100% connection rates, but this is the Internet, and I'm only in control of the last couple of hops in the whole route.

Perhaps another consideration is that connections to SSH don't fail as gracefully as they do with pserver... Or timeouts take longer to be considered a time out.  In my earlier example, my script was stuck for 90 minutes and after a bit of searching, I can't see any way of reducing this.
Comment 33 Denis Roy CLA 2010-04-30 10:54:01 EDT
I am concluding my testing.  This is as exhaustive as I'm willing to be.


My tests have been running 24/7 all week from two locations:
- my home computer, using a cable modem, near (geographically, and 10 network hops, 19ms round-trip) near the servers
- an EC2 instance in Ireland (17 network hops, 120ms round-trip)

From home, 1 out of 11201 connections failed due to a DNS error.

From EC2, 4 out of 10492 connections failed, all due to timing out.

Interestingly, those 4 timeouts took _2 hours and 10-15 minutes_ to finally give up.  There is no activity in the strace for 2 hours and 15 minutes.  A packet capture would have been interesting.

Of the four timeouts, I have no indication that the packets even reached the Eclipse.org network's inner bounds (our firewall).  I clearly see the SSH activity before and after the failure.  Our border firewall and our CSS are logging all dropped sessions for whatever reason (hitting blocked ports, SYN attacks, etc) and has logged nothing wrt. this.

So my conclusion is this:
- my feeling is that the timeouts occur upon connection. Once connected, you're most likely good.
--> with this in mind, it seems that since a build connects+disconnects dozens/hundreds of times (I don't know why) you are increasing your chances of timing out.

- SSH handles connection timeouts quite poorly -- 2 hours??
--> Perhaps using pserver to build is a better idea
--> or finding a way to reduce the timeout (and induce a retry, or fail) would help

- I think Eclipse.org's network is in great shape.  The above DNS error doesn't necessarily mean it was a DNS failure on our part since, from home, I use my ISP's DNS servers.  .. and the amount of traffic at any given time in the trans-Atlantic pipes is a huge variable.


We have done things to make this better:
- SSH and load balancer server config in comment 17 should reduce the amount of times a connection is refused because of too many startup SSH connections
--> but a refused connection != a timed out connection.  A refused connection fails immediately, a timed out connection hangs for hours


So there you have it.  Geographically, the farther away you are from Eclipse.org, the more network hops, the more chance of a time out.  Some options:

-> deal with timeouts gracefully (either get SSH to fail faster, or use pserver)

-> move builds closer to Eclipse.org (Ottawa, Canada)
** Large corps (yes, IBM) often love to route stuff in weird ways, so physical proximity is no indication of network proximity

-> move builds to build*.eclipse.org

-> spend zillions of dollars in setting up Eclipse.org SSH points-of-presence around the world  :)
Comment 34 Kim Moir CLA 2010-04-30 11:42:57 EDT
Thanks Denis for all your time testing this. I really appreciate it.

The reason that our build connects so many times via ssh is as follows.  PDE build resolves dependencies for the bundles we are building, and generates Ant fetch scripts to fetch the source of the bundles being built. Each bundle being fetched generates a separate call to CVS.   This is useful because we can run a lot of parallel ant tasks to speed up our build. (Even though our build takes a ridiculous amount of time to complete, it could be worse if we didn't have so many tasks running in parallel)

Changing our build to checkout everything in one giant ssh connection is non-trivial.  This would require a major change to the way pde build generates build scripts. We can't use pserver for builds because there is sometimes a log updating that copy of the repository, and as well, we would go back to lock file contention issues.  

I agree that checkout out wads of code to a remote location is not optimal as have to deal with multiple network hops.

I think the best solution for us is to keep the /etc/hosts entry for the node one for the interim to avoid these problems. I haven't seen any issues with cvs timeouts since I made that change.

Yes, we intend to move our build to eclipse.org.  However, we will have to wait until there are test servers on line at the foundation until we can do this.
Comment 35 Denis Roy CLA 2010-05-02 11:27:37 EDT
(In reply to comment #34)
> I think the best solution for us is to keep the /etc/hosts entry for the node
> one for the interim to avoid these problems. I haven't seen any issues with cvs
> timeouts since I made that change.

The configuration changes we made in comment 17 should make your builds as reliable as using node1.  More reliable infact, because if we take node1 down for maintenance or to reboot, you won't be able to connect to CVS at all.
Comment 36 Denis Roy CLA 2010-05-10 10:18:11 EDT
*** Bug 303217 has been marked as a duplicate of this bug. ***