Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 220330

Summary: [build.eclipse.org] intermittent ssh connection failures during jarsigning
Product: Community Reporter: Nick Boldt <nboldt>
Component: ServersAssignee: Eclipse Webmaster <webmaster>
Status: RESOLVED WONTFIX QA Contact:
Severity: normal    
Priority: P3 CC: kim.moir
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Whiteboard:

Description Nick Boldt CLA 2008-02-26 01:39:10 EST
The process below is to push a file to build.eclipse.org and wait until jarsigner is done and the file reappears; when it does, the ls command returns something other than "No such file or directory" and the build can continue. Unfortunately, sometimes I get these ssh auth failures instead.

Any idea why this happens?

--

waitForChangedAttribs:

-timestamp:
     [echo] 12:24:00

compareAttribs:
     [exec] Result: 2
     [echo] original:  ${originalAttribs}
     [echo] polled:  ls: emf-sdo-xsd-Master-runWithSun.zip: No such file or directory

writeDiffResult:

waitForChangedAttribs:

-timestamp:
     [echo] 12:26:01

compareAttribs:
     [exec] Result: 2
     [echo] original:  ${originalAttribs}
     [echo] polled:  ls: emf-sdo-xsd-Master-runWithSun.zip: No such file or directory

writeDiffResult:

waitForChangedAttribs:

-timestamp:
     [echo] 12:28:03

compareAttribs:
     [exec] Result: 2
     [echo] original:  ${originalAttribs}
     [echo] polled:  ls: emf-sdo-xsd-Master-runWithSun.zip: No such file or directory

writeDiffResult:

waitForChangedAttribs:

-timestamp:
     [echo] 12:30:10

compareAttribs:
     [exec] Result: 255
     [echo] original:  ${originalAttribs}
     [echo] polled:  Permission denied, please try again.
     [echo] Permission denied, please try again.
     [echo] Received disconnect from 206.191.52.57: 2: Too many authentication failures for nickb

writeDiffResult:

waitForChangedAttribs:
     [echo] copy zip back to build machine
     [exec] Result: 1
     [echo] delete temp files on build.eclipse.org

packMasterZip:
Comment 1 Denis Roy CLA 2008-02-26 09:06:57 EST
-timestamp:
     [echo] 12:30:10

Is that 12:30:10 PM local time?
Comment 2 Nick Boldt CLA 2008-02-26 12:10:50 EST
(In reply to comment #1)
> -timestamp:
>      [echo] 12:30:10
> Is that 12:30:10 PM local time?

Yes, that's 00:30h EST, about an hour before I opened this bug.
Comment 3 Denis Roy CLA 2008-02-26 15:36:43 EST
That specific instance is the 12:30am crunch.  There's super heavy load as the downloads stats tables are rotated. I also get some NFS timeout messages around that time too:

Feb 26 00:29:07 build kernel: nfs: server nfsmaster not responding, timed out

I also see a noticeable dip in the bandwidth for a few minutes while the servers crunch.

Your best bet is to catch the error and try again, or avoid doing stuff around 12:30.
Comment 4 Nick Boldt CLA 2008-02-26 16:26:50 EST
> Your best bet is to catch the error and try again, or avoid doing stuff around
> 12:30.

I was thinking my 'check for signed zip' test ought to be smarter anyway, so I'll fix that at my end. 

Since this is a scheduled event and not a server snafu, I'll close this as WONTFIX, and avoid the 00:30 crunch in future.

Thanks for the info. 

Comment 5 Kim Moir CLA 2008-03-05 12:20:47 EST
Denis, by dip in bandwidth in comment #3, I assume you mean that the there is little available bandwidth at this time for cvs checkouts from eclipse.org.  How long does this last after 12:30am? The reason I'm asking is that our builds have been failing with cvs timeouts the last few nights and if this server maintenance is the culprit, I'll run them earlier.  
Comment 6 Denis Roy CLA 2008-03-05 13:20:46 EST
(In reply to comment #5)
> Denis, by dip in bandwidth in comment #3, I assume you mean that the there is
> little available bandwidth at this time for cvs checkouts from eclipse.org. 

No, I mean the NFS server is so ridiculously busy that servers wait for it, causing a dip in bandwidth as everything slows down waiting for disk time.

Midnight to about 12:45am seems the be the absolute busiest time of the day.  If you can have your builds completed before that, that would be great.
Comment 7 Kim Moir CLA 2008-03-05 13:51:03 EST
Thanks, I'll change our build time.  You might want to announce this to other committers so that other teams don't run into the same issue.
Comment 8 Nick Boldt CLA 2008-03-05 14:28:04 EST
(In reply to comment #7)
> Thanks, I'll change our build time.  You might want to announce this to other
> committers so that other teams don't run into the same issue.

Denis:

I can blog this along with the multiple-queue signing (bug 220037) & some other signing tips, once I've tested it out. But I don't want to steal your thunder, if you'd prefer an email broadcast or your own blog. Let me know either way.