| Summary: | CVS is getting progressively slower - change pserver to use shadow data | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Community | Reporter: | Kim Moir <kim.moir> | ||||||||||||
| Component: | CVS | Assignee: | Eclipse Webmaster <webmaster> | ||||||||||||
| Status: | RESOLVED FIXED | QA Contact: | |||||||||||||
| Severity: | normal | ||||||||||||||
| Priority: | P3 | CC: | andrew.eisenberg, angvoz.dev, aniefer, bokowski, contact, daniel_megert, david_williams, d_a_carver, ed, elias, gunnar, irbull, jkindler, john.arthorne, karl.matthias, marc.khouzam, Olivier_Thomann, pwebster, rrausch, sven.efftinge, tmenzel, tomasz.zarna | ||||||||||||
| Version: | unspecified | ||||||||||||||
| Target Milestone: | --- | ||||||||||||||
| Hardware: | PC | ||||||||||||||
| OS: | Windows XP | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Kim Moir
The big culprit here is pserver ... There is so much pserver traffic that it's slowing down (in terms of lock files and disk activity) all of CVS. Unfortunately, because many projects also use pserver for building, I can't make any special class of service for it. The best way to solve both issues (disk performance and locked files from anonymous) would be to serve pserver from a mirror copy of CVS. However, builds would need to run off extssh (or from build.eclipse.org, where I could point pserver connections to the 'live' repo)... I tried tagging jdt.core projects for next I-build unsuccessfully after 5 attempts. It keeps failing! Any ETA when this is fixed ? Thanks. Even though, many projects use pserver for building (myself included), I think it's pretty fundamental that people should be able to tag their projects and release code. Otherwise, we can't work. It took Dani well over an hour to tag his projects today. It took me over 30 minutes and that failed silently. Last week, I waited until later in the evening (7:30pm) to tag for the nightly build because there CVS just failed silently during the day. Olivier is also seeing problems. If the the issue is lock files perhaps some of the projects could have their lock files moved to a new partition and disk to reduce the load on the existing /var/lock/cvs kmoir@node4:/home/data/cvs> grep LockDir= */CVSROOT/config birt/CVSROOT/config:LockDir=/var/lock/cvs callisto/CVSROOT/config:LockDir=/var/lock/cvs datatools/CVSROOT/config:LockDir=/var/lock/cvs dsdp/CVSROOT/config:LockDir=/var/lock/cvs eclipse/CVSROOT/config:LockDir=/var/lock/cvs modeling/CVSROOT/config:LockDir=/var/lock/cvs org.eclipse/CVSROOT/config:LockDir=/var/lock/cvs rt/CVSROOT/config:LockDir=/var/lock/cvs technology/CVSROOT/config:LockDir=/var/lock/cvs tools/CVSROOT/config:LockDir=/var/lock/cvs tptp/CVSROOT/config:LockDir=/var/lock/cvs webtools/CVSROOT/config:LockDir=/var/lock/cvs Can you distinguish so that pserver anonymous goes via a mirror pserver with extssh credentials goes direct? Might this explain Bug 291635? (In reply to comment #3) > Even though, many projects use pserver for building (myself included), I think > it's pretty fundamental that people should be able to tag their projects and > release code. Otherwise, we can't work. I agree -- but while pserver and extssh are served from the same dataset, I cannot guarantee that an anonymous user will not block your tag operation. > If the the issue is lock files perhaps some of the projects could have their > lock files moved to a new partition and disk to reduce the load on the existing > /var/lock/cvs The issue is not the location of the lock files. The issue is that there are so many pserver operations locking files for a long period of time. Why are pserver connections taking so long? a) because the disk array is so busy and b) people do weird stuff with pserver (for instance, creating git mirrors which pounds on pserver for hours/days, etc). At this point, I can 'fix' your tagging issue in less than twelve seconds by pointing pserver to our shadow copy. Your builds may not be thrilled. (In reply to comment #2) > I tried tagging jdt.core projects for next I-build unsuccessfully after 5 > attempts. It keeps failing! > Any ETA when this is fixed ? > Thanks. You may need to up your CVS connection time out setting. I've actually have mine set currently to about 5 minutes. I have one big project that can take a while to tag, and that at least lets me get things tagged. Still slow, but no more silent timeouts. I can easily change my build to use extssh by s/pserver:anonymous/extssh:kmoir/g using an Ant search and replace in the maps after they are checked out. I don't want to build from a pserver mirror because we tag the builder and maps with each integration build. Losing that would mean that our build would no longer be reproducible. Ironically, in the early days of eclipse, we did have a cvs mirror because our European labs had limited bandwidth and could only commit to their local repository which was then mirrored to eclipse. It wasn't pretty :-( Also, I recall that a couple of years ago, we had a delay instituted in our builds to check out the code from cvs 10 minutes after the build started to allow the eclipse.org nodes to synchronize. Another suggestion would be to have a pserver mirror for git mirrors etc on a separate hostname with separate lock files. Actually it might be useful to have this for rsync etc too. Content could be synched there only once a day. The git and rsync mirror traffic wouldn't impact the developers. Just a thought. +1 for the solution proposed in comment 1. SourceForge does it as well. All anonymous CVS pserver access goes to a mirror which is just a replica of data. SourceForge actually does sync only every 24 hours. Builds and committers can continue to run from the latest source. I'm pretty sure that this is possible, even if a build happens *outside* of build.eclipse.org. +1 for comment #1 as well. I've been getting frustrated lately because of the CVS slowness, so I welcome this. Builds would need to use ssh if they want to build against the 'live' CVS. As an exception, I could point pserver from build.eclipse.org to the 'live' CVS since it is used by committers only. Otherwise, *all* pserver access to dev.eclipse.org (except from build) will hit a shadow CVS repo which would be updated regularly. (In reply to comment #10) > Builds would need to use ssh if they want to build against the 'live' CVS. As > an exception, I could point pserver from build.eclipse.org to the 'live' CVS > since it is used by committers only. > > Otherwise, *all* pserver access to dev.eclipse.org (except from build) will hit > a shadow CVS repo which would be updated regularly. +1 for mirroring the public external pserver access. The general public doesn't need real time updates, like committers and builds do. Seems to me that "fails silently" is a whole different class of problem than being slow! Wouldn't that indicate a problem in some client software? The solution in comment #1 would work for us in WTP. The thing to avoid is what we used to have, where committers might commit something, and we'd normally have to be sure to wait 5 or 10 minutes to make sure it got mirrored or replicated and then be "pulled" for the next build ... and occasionally, it would still "miss" (not having mirrored or replicated yet) and we'd have to rebuild. Sometimes the mirrors or replication would take 20 to 30 minutes. Doesn't sound like that would be an issue, if we have the two options, pserver on build.eclipse.org, or ssh externally. It would be nice if the mirror-time would be well understood (I started to say "predictable" but know that might be asking too much :) Are we talking this sort of 10 to 30 minute time frame ... or more like the 24 hours I heard someone mention? But, before we solve the problem ... may I ask ... are we sure we know what the problem is? Seems odd that this seems to have "just started happening" the past few days or week ... or have I gotten the wrong impression on duration? One issue to note and be sure people are aware of ... occasionally a committer on one project may need to "see the latest" from another project in which they are not a committer, such as to respond to an API addition, or similar. I doubt in practice this would be much of an issue ... but people should be made aware of it, be aware of the "time lag", and occasionally adjust their work plan to wait for the change to be visible via pserver. quick question to denis: you mentioned both, cvs and svn in your mail pointing to this bug but i only read of CVS woes here. is/will this be affecting SVN as well or only the CVS repos? (In reply to comment #13) > One issue to note and be sure people are aware of ... occasionally a committer > on one project may need to "see the latest" from another project in which they > are not a committer, That's not an issue. You can connect to *any* CVS repo using SSH and your committer credentials. Everything is readable, you just can't commit (i.e. write). As we are currently also experiencing lots of problems with building based on SVN (we rarely see a build finish, but get http 504 codes on svn export), I would also be interested in a statement about this. (In reply to comment #15) > (In reply to comment #13) > > That's not an issue. You can connect to *any* CVS repo using SSH and your > committer credentials. Everything is readable, you just can't commit (i.e. > write). Oh, thanks Gunnar, I didn't know that. It does beg the question though ... why wouldn't everyone just use extssh to get the "higher bandwidth"? I guess we are assuming (hoping?) that the bulk of "pserver abuse" comes from people who are not a committer on any eclipse project. Any reason to think that's true? Does that mean we should ask all committers, even for builds, to use extssh and be good to go, without any mirrors and redirection? (I know, not all builds could easily move to extssh .... I'm just asking questions to further my understanding). (In reply to comment #17) > ... why wouldn't everyone just use extssh to get the "higher bandwidth"? It's not possible in the current architecture. Not "bandwidth" is the issue but "disc activity". Both operate on the same set of files on the same set of discs. That's why Denis proposes to move PSERVER to a different set of discs, i.e. create a mirror. Then EXTSSH will not be affected by PSERVER any more. (In reply to comment #12) >Seems to me that "fails silently" is a whole different class of problem than >being slow! Wouldn't that indicate a problem in some client software? If not seen it fail silently but of course I cannot tell if it really fails silently. What I see (depending on the connection timeout I've set) is that I either have to wait incredibly long (in my case 1h30) or get a timeout. There are actually *two* issues here which we should discuss separately: 1. the CVS setup issue discussed in this bug here 2. find out why this happens since a some days now, i.e. there must be some change in the infrastructure or some special state/service/user that is extremely blocking and needs to get terminated. I've filed bug 293416 to track that part. Regarding this bug here: we should simply follow comment 1 i.e. delegate pserver to a read-only copy. Those builders who can live with that - fine. The others can simply switch to extssh. However, I'm not yet 100% convinced that this is the only root cause for the current problem we're seeing. Gunnar is right in comment 18. The current bottleneck right now is "disk activity". During our busiest periods, the activity lights on our disks no longer blink -- they just stay on. FWIW, pserver is not the sole cause of the problem here, but moving it to a separate data source will: a) allow extssh to breathe easier from fewer disk seeks b) prevent pserver abusers from impacting extssh (In reply to comment #7) > Another suggestion would be to have a pserver mirror for git mirrors etc on a > separate hostname We're setting up a git mirror here at Eclipse.org -- see bug 280583. Actually, bug 280583 comment 87 explains the problem I have with pserver -- this person hammered on pserver CVS for "tens of hours" to get their local git mirror. That is a lot of disk access. (In reply to comment #14) > is/will this be affecting SVN as well or only the CVS repos? pserver is the low hanging fruit right now. Anonymous SVN _should_ also be sourced from a read-only mirror copy, but right now we're not getting nearly as much SVN traffic as we do pserver. It will happen eventually, though. (In reply to comment #19) > 2. find out why this happens since a some days now, i.e. there must be some > change in the infrastructure or some special state/service/user that is > extremely blocking and needs to get terminated. We haven't changed our infrastructure. > others can simply switch to extssh. However, I'm not yet 100% convinced that > this is the only root cause for the current problem we're seeing. I examined all the current pserver connections, and there were two different IP addresses that had about a dozen (or more) connections each, some of them lingering for a few hours. I blocked both those IPs on our firewall -- perhaps they were legitimately blocked, or perhaps they were up to no good. >I blocked both those IPs on our firewall
This didn't help.
In addition, to tagging issues, I've had problems with hung cvs connections on every build since yesterday. This causes our build to hang or fail. This seems very similar to bug 289408. Could you please investigate? (In reply to comment #7) > I can easily change my build to use extssh by > s/pserver:anonymous/extssh:kmoir/g using an Ant search and replace in the maps > after they are checked out. Small correction ... extssh is Eclpise specific. So that'd be s/pserver:anonymous/ext:kmoir/g with export CVS_RSH=SSH being set in environment or initial script. Just in case future readers of this bugzilla find it helpful. > Could you please investigate?
We are... But I cannot reproduce the problem. I was able to see a hang when Daniel Megert was doing something (ie, cvs activity, then several minutes of 'hang' time, followed by some activity) but so far I cannot make this happen.
Try cvs -d :pserver:anonymous@dev.eclipse.org:/cvsroot/eclipse co -r v20091028a org.eclipse.releng.basebuilder It's a big project - should generate some hangs. This is also the project I have problems tagging. I just ran that from commandline CVS from our office in Portland, which has no special connection whatsoever to dev.eclipse.org... It performed flawlessly. I'll keep trying it. (In reply to comment #26) > I just ran that from commandline CVS from our office in Portland, which has no > special connection whatsoever to dev.eclipse.org... It performed flawlessly. > > I'll keep trying it. Works for me just fine, too. I've run it a few times now. I can assure you I have as bad a broadband connection to Eclipse.org as anyone in North America. That particular module is taking between 3m9s and 4m05s to check out, according to the unix time command. Kim, are you doing this from inside IBM? It may be that IBM's firewalls/proxies/etc are limiting your session. Things are slow for reasons that Denis has already specified. If possible I suggest trying this from somewhere other than IBM to determine if it's really accessing our site that is 100% the problem, or if it's an interaction between pserver being slow and timeouts on your network gear. (In reply to comment #27) > (In reply to comment #26) > > I just ran that from commandline CVS from our office in Portland, which has no > > special connection whatsoever to dev.eclipse.org... It performed flawlessly. > > > > I'll keep trying it. > > Works for me just fine, too. I've run it a few times now. I can assure you I > have as bad a broadband connection to Eclipse.org as anyone in North America. > That particular module is taking between 3m9s and 4m05s to check out, according > to the unix time command. > > Kim, are you doing this from inside IBM? It may be that IBM's > firewalls/proxies/etc are limiting your session. Things are slow for reasons > that Denis has already specified. If possible I suggest trying this from > somewhere other than IBM to determine if it's really accessing our site that is > 100% the problem, or if it's an interaction between pserver being slow and > timeouts on your network gear. Kim, I also have a very large project (can take up to 8 minutes to check out), org.eclipse.wst.xml.xpath.processor.tests, it can also take up to 5 to 10 minutes to tag. However, this has always been the case, and I do these all remotely. I've had to increase my cvs timeout option up to 6 minutes for this one project. With this said, I have this project setup on Hudson for building, (wst.psychopath), and have never experienced a problem with checking out code during the build. Of course I'm using the local hudson instance on build.eclipse.org, and not trying to run builds from a remote location. Note sure if any of what I said helps. So we have two bugs, this one and bug 293416. Both talk about ssh, and this one talks about pserver. As for pserver, I cannot reproduce the problem Kim is seeing, either from our Portland office, or my home computer. I've been trying for days now. Either a full checkout with nothing on the local side or an update. The update typically takes from 6 to 10 minutes. time cvs -Qd :pserver:anonymous@dev.eclipse.org:/cvsroot/eclipse co -r v20091028a org.eclipse.releng.basebuilder real 8m23.051s user 0m4.063s sys 0m19.275s I will try no more. pserver is working fine. I will examine extssh in the other bug. We will leave this bug open to discuss the problem that CVS (both ssh and pserver) is getting progressively slower -- this is a fact that I acknowledge. Short of getting new server (which won't happen) the proposed solution is to point pserver to our read-only data source (except for build.eclipse.org). Before doing this, I will expose this proposal to the committers, since those building remotely from pserver will need to change. Yes, I'm going to try running the checkout from my linux box at home. However, I think it's strange that it worked for years without an issue and now it happens. We don't have a proxy on our firewall at IBM. I'll talk to our network team about this but again I doubt it since we can have http connections open for hours to download stuff. Why would pserver connections be any different? I've configured the Eclipse cvs client timeout on my laptop to a much longer timeout than the default. However, when the real build runs we use pde build to checkout the code and you can't specify a different timeout. Even if there was a way to specify it, it wouldn't help. The way that the build works is that pde build generates fetch scripts for all our bundles and features. Each fetch call issues Ant cvs calls which call the cvs executable on the build machine. The behaviour I'm seeing is that the CVS connection simply hangs and doesn't proceed. When this happens, I have to reissue the cvs command manually for that bundle. Timing out the cvs checkout process will only result in missing content for the bundle. Yes, running the build on hudson could possibly reduce the frequency of this issue. However, our build runs tests on Linux, Windows and Mac and the Foundation hudson server doesn't have any slaves to run tests on these platforms. (We have 11 test machines in addition to our build machine) Conversely, we could run tests on VMmage images on the cloud, however, this requires extensive testing and someone to pay for it:-) In #4 I referenced Bug 291635 whereby the cvs diff behind a Create Patch is malfunctioning. My observations are that mdt/ocl CVS has been temperamental for 2 months. I have only been able to create an MDT/OCL patch by piecing together 4 partial patches. The problem I get is connection reset by peer. While I can follow the recommendation to make my timeout 300 rather than 60 I cannot make the CVS server do the same for itself. Kim, if you can craft a file containing the series of cvs commands that I would need to run to simulate your build, I will try that. Perhaps I'm not banging hard enough on CVS. (In reply to comment #30) > Yes, running the build on hudson could possibly reduce the frequency of this > issue. However, our build runs tests on Linux, Windows and Mac and the > Foundation hudson server doesn't have any slaves to run tests on these > platforms. (We have 11 test machines in addition to our build machine) > Conversely, we could run tests on VMmage images on the cloud, however, this > requires extensive testing and someone to pay for it:-) Or another option. Allow your current build machines that run those additional tests, to be build.eclipse.org Hudson Slaves, and then you only have to have separate jobs that run on the appropriate slave machine once the main job is done. You are then using rsync or something else to move your test files around to the machines. Plus as an added advantage everybody can gain from having access to various platform slaves for builds, helping to reduce the overall load on build.eclipse.org. Just a thought. In an attempt to improve this situation (the disconnects, not the slowness) I have overridden the load-balancing mechanism on the Cisco CSS for ssh and for pserver. The default should be working, but it seems that something is wrong and this is the only tweak that makes sense to me to try. Please try pserver on dev.eclipse.org and see if this makes any difference. I've put zip called forDenis.tar.gz on my home directory on dev.eclipse.org. This zip has our generated fetch scripts for our last integration build. I've also included the latest integration build to actually run the build fetch. So extract the tarball cd eclipse There's a script called runbuild.sh It looks like this right now /buildtest/I20091029-0840/jdk1.5.0_14/jre/bin/java -Xmx500m -jar plugins/org.eclipse.equinox.launcher_1.1.0.v20091023.jar -Dosgi.os=linux -Dosgi.ws=gtk -Dosgi.arch=x86 -application org.eclipse.ant.core.antRunner -f ../src/fetch_master.xml -DfeaturesRecursively=true -DbuildDirectory=/home/youdir/build/src Update runbuild.sh to point to your java instead of /buildtest/I20091029-0840/jdk1.5.0_14/jre/bin/java Update buildDirectory to point to the directory where you've extracted this zip and then /src ./runbuild.sh This will fetch all of our code that we need to checkout for our build. No compiling. Note: The tar to run the build is linux.gtk.x86, if you're testing on another platform, you'll have to put another eclipse-SDK there, and it doesn't matter which version you use. Thanks Karl for the change, I'm running several builds now and will see if I see any problems. I talked to our IBM network team and we don't have any timeout limitations on TCP/IP connections, including pserver. However, I'm going to ping our network engineer the next time I see one of the timeouts to allow them to see if there is something unusual happening. Kim, any improvement here since I made that change? I haven't seen any cvs timeouts since you made that change. This morning, at 9:10 we had a timeout which killed our build. I happened to be logged into build.eclipse.org at the time and my console session became unresponsive. I then tried to sync with cvs and saw the same issue - couldn't connect to dev.eclipse.org. I was able to connect to non-eclipse external websites. Then it came back. Then about 10 minutes later, eclipse.org was inaccessible again. Dani saw the same thing at this time because he was trying to make a build submission. Our labs go through completely different firewalls so it's not an IBM issue. Did you see a drop in traffic around this time on your network graphs? (In reply to comment #38) > This morning, at 9:10 we had a timeout which killed our build. I happened to > be logged into build.eclipse.org at the time and my console session became > unresponsive. I then tried to sync with cvs and saw the same issue - couldn't > connect to dev.eclipse.org. I was able to connect to non-eclipse external > websites. Then it came back. Then about 10 minutes later, eclipse.org was > inaccessible again. Dani saw the same thing at this time because he was trying > to make a build submission. Our labs go through completely different firewalls > so it's not an IBM issue. Did you see a drop in traffic around this time on > your network graphs? Depending on how the CVS to GIT imports are going and running, we might be running into some contention there. Personally I'm still not experiencing time outs, but I do notice on Hudson that at certain times, builds may take about 10 minutes longer within the check out steps. Not seeing any time outs there, and haven't experienced any time outs locally. Which particular projects are timing out during check out. Is it always the extremely large projects (i.e. the basebuilder?) If so, we could just be hitting the general performance issues that CVS has when repositories reach a certain size and certain amount of meta data. The older the project the more meta data that CVS has to search through to get the relevant information. Particularly when tagging or checking out via tags. (In reply to comment #38) > Did you see a drop in traffic around this time on > your network graphs? I did, I even started to ping dev.eclipse.org at that time and the result was that 1 out 4 or 5 pings timed out. No, it's not always the large projects. Today it was checking out our test feature /cvsroot/eclipse, org.eclipse.test-feature which is very small. Just now, it timed out while fetching org.eclipse.swt.win32.wce_ppc.arm. Again, not a large project. Thanks for the feedback, Kim and others. I do see a drop off on our network graphs at 9:10 this morning (noticeable, but not huge). I also don't see anything wrong on our servers or network gear at that time. I have asked our upstream if they can provide any information about what was happening on their network at that time. (In reply to comment #42) > Thanks for the feedback, Kim and others. I do see a drop off on our network > graphs at 9:10 this morning (noticeable, but not huge). I also don't see > anything wrong on our servers or network gear at that time. I have asked our > upstream if they can provide any information about what was happening on their > network at that time. Our ISP has told us that they experienced network gear failure this morning beginning at at least 8:40am Eastern. They replaced a dead card in a router and traffic was marginally interrupted until about 11:15am. Routes are still converging for some locations and the load is higher on the other peer routers, so there may be some residual slowness today. I am working with them now to find out more specifics about what happened and when they first noticed issues. I'll let you know when we have more info. (In reply to comment #40) > I did, I even started to ping dev.eclipse.org at that time and the result was > that 1 out 4 or 5 pings timed out. ping will not work reliably on dev.eclipse.org, since it's the same cluster of servers that handle download.eclipse.org traffic. ICMP echo packets there are about the lowest class of traffic possible, so when our bandwidth is saturated (which it almost always is) those reply packets will simply never leave our network. Note the difference between these two: ping node1.eclipse.org <-- the entire host is high-priority ping dev.eclipse.org <-- ICMP echo is 'garbage' bandwidth, unlike SSH and pserver Actually, Karl, perhaps we should just block all ICMP echo traffic from dev/download, to avoid these types of incorrect interpretations? (In reply to comment #35) > I've put zip called forDenis.tar.gz on my home directory on dev.eclipse.org. Thanks, Kim, I'll give this a try. I'm using SCP at the Portland office to receive this file from dev, at a rate-limited 100 Kbits/sec. ETA is over seven hours. If that computer can maintain an SSH connection open for 7 hours.... (In reply to comment #45) > Actually, Karl, perhaps we should just block all ICMP echo traffic from > dev/download, to avoid these types of incorrect interpretations? Done. Please keep me updated on timeouts status (yesterday's outage at our ISP aside). *** Bug 294886 has been marked as a duplicate of this bug. *** I'm planning on making this happen in 30 days: as of Friday, Dec. 11, anonymous pserver from the world wide internet will be served from a shadow copy of CVS. pserver from build.eclipse.org will continue to function as is now, from the live data. > pserver from build.eclipse.org will continue to function as is now, from the
> live data.
Precision: anonymous pserver CVS from within the Eclipse.org firewall will remain unchanged (ie, it will be served from the live data). This includes dev.eclipse.org itself, build.eclipse.org, and all the project vservers.
(In reply to comment #48) > I'm planning on making this happen in 30 days: as of Friday, Dec. 11, anonymous > pserver from the world wide internet will be served from a shadow copy of CVS. > pserver from build.eclipse.org will continue to function as is now, from the > live data. I think this is a good idea ... but, am (still) wondering, what, approximately, would be the "delay" in the shadow copy matching matching the live copy? A few minutes? Hours? Overnight? Just curious at this point. Right now our shadow CVS is updated three times a day: 10:45, 16:45 and 22:45 ET. (In reply to comment #51) > Right now our shadow CVS is updated three times a day: 10:45, 16:45 and 22:45 > ET. What effect will this have on contributors? I'm just thinking of myself last year when I wasn't on the p2 team but I was working on bugs, attending meetings, and submitting patches. If I had to work off a shadow CVS that was only updated 3 times a day, I think this would have severely handicapped my ability to work. This might be a small population of people, but we may need to find a solution to this. (i.e. You can get request access to the main server for example). This might also be a problem for cross component work. For example, I have write access to p2, but read access to the workbench. If I'm waiting on a change in the workbench (so I can update something in the p2 UI), then I may have to wait for the servers to sync. I'm just playing devil's advocate here. (In reply to comment #52) > This might also be a problem for cross component work. For example, I have > write access to p2, but read access to the workbench. I believe if you have ssh access to eclipse.org you can create and use SSH connections to any CVS repository hosted at Eclipse. You don't need CVS write access to have an SSH connection. > This might also be a problem for cross component work. For example, I have
> write access to p2, but read access to the workbench.
If you have a committer account, you can use SSH for all your repos, even the ones you don't have commit access on. You'll always be up to date.
As for contributions, the best I can do is to hook into the commitinfo and (attempt) to sync CVS portions that change as they change. In theory, this means pserver's shadow copy will only be mere minutes behind. In practice, well, we know these types of things may break.
Unfortunately, this is where it's at: I have received enough committer complaints that CVS is not performing well that I need to do something about it. Any anonymous service is bound to be (ab)used for all kinds of reasons -- git cloning entire repos, which can take days, or yanking everything out of CVS one commit at a time for university research -- and currently that is impacting performance for committers.
I'm open to other suggestions, but this seems like low-hanging fruit to me.
(In reply to comment #54) > I'm open to other suggestions, but this seems like low-hanging fruit to me. I vote we try it the easy way. If it turns out that it's making life hard for some people then we try a more complicated solution. My $.02. (In reply to comment #54) > If you have a committer account, you can use SSH for all your repos, even the > ones you don't have commit access on. You'll always be up to date. > John, Denis, thanks... we should just make sure committers know this and they are using their ssh access especially if they consume other components. > I'm open to other suggestions, but this seems like low-hanging fruit to me. I agree, I just wanted to raise this issue. Even in my particular case last year I was already an Eclipse committer, so I could be up-to-date if I used my ssh access on p2. (In reply to comment #54) > As for contributions, the best I can do is to hook into the commitinfo and > (attempt) to sync CVS portions that change as they change. In theory, this > means pserver's shadow copy will only be mere minutes behind. In practice, > well, we know these types of things may break. There are thousands of projects on SourceForge and they have a similar setup. The anonymous CVS is actually up to 24 hours behind, i.e. they only sync once a day. But AFAIK it never really affects contributions. Of course, it doesn't make it easier. > Unfortunately, this is where it's at: I have received enough committer > complaints that CVS is not performing well that I need to do something about > it. Any anonymous service is bound to be (ab)used for all kinds of reasons -- > git cloning entire repos, which can take days, or yanking everything out of CVS > one commit at a time for university research -- and currently that is impacting > performance for committers. I think it's a fair trade-off. We must ensure that any SCM works reliable for committers on a daily base. Full-time contributors is a different story, though. But compared to the majority of activity the other random anonymous traffic is causing I think we have no other option. (In reply to comment #54) > I'm open to other suggestions, but this seems like low-hanging fruit to me. Will you be able to measure the improvement in some way? If so ... who's going to run the pool? I'll put my wager on "less that 10%". Ha ha. Just kidding. I think it will make a noticeable improvement. But, it would be nice to have some sort of test case to know if it helps or not. > Will you be able to measure the improvement in some way? I tend to do that before suggesting a solution, to make sure it's the right solution. For instance, right now When I SIGSTOP all anonymous pserver connections on dev.eclipse.org, IOWait on the nfs server drops. When I SIGCONT, IOWait increases. Likewise, for pserver, yesterday I checked out org.eclipse.babel: 5 minutes. From the shadow copy: 8 seconds. pserver users will be thrilled. As for ssh, something simple like this, run both before and after may do: export CVS_RSH=ssh time cvs -d :ext:droy@dev.eclipse.org:/cvsroot/technology co org.eclipse.babel But our server load fluctuates heavily during the day, making such tests difficult. [Why is this only normal priority?] After using Wireshark to observe why it is very difficult to do multiple file commits, it is clear that the recommendation to use extssh and set timeouts to 300 is just not enough. The current server is unable to handle the TCP protocol stack with the result that there is a replacement load of TCP retransmit traffic. If you hit a bad patch you back off from 2 to 5 to 10 to 20 ... seconds between retries, and the server end imposes a 60 overall second timeout. I am finding that copmmitting 100 files, sometimes one at a time is a mega-time waster. CRITICAL not normal. Could you please attach your wireshark output or packet dump? Created attachment 152678 [details]
CVS Compare Timeout
Attached is a slightly simpler scenario. A timeout on a single file Synchronize resulting from a Compare File with Head at 6:59 GMT today.
The timeout is the last packet, where the TCP stack has got fed up with a lack of response, long before the 300 configured connection timeout.
(Eclipse 3.5)
Created attachment 152709 [details]
Screenshot of Ed's pcap in Wireshark
Ed, thanks for the pcap.
Upon opening it with Wireshark, it became apparent why you are experiencing a timeout. All of the protocol header checksums of packets *you* are sending are 0x0000. This is not right. Our firewall will tolerate this for a while, but afterwards it may stop accepting your broken packets for any number of valid reasons.
Most http/mail/any clients they can live in this error state forever and never notice since their connections to us are short-lived, but when you try to sustain a connection by sending broken packets, you can get strange results.
If I were to guess, I would likely blame your Netgear router.
Created attachment 152710 [details]
Capture of Denis' computer
As a point of reference, here is a pcap I did from my computer. I'm also going across a Linksys router, but its firmware is up to date, and my packet checksums are present.
Please check your network stack, update the firmware on your network gear, and make sure your network drivers are up to date.
(In reply to comment #63) > Upon opening it with Wireshark, it became apparent why you are experiencing a > timeout. All of the protocol header checksums of packets *you* are sending are > 0x0000. This is not right. Our firewall will tolerate this for a while, but > afterwards it may stop accepting your broken packets for any number of valid > reasons. I'm told that this is a Wireshark feature. It snoops the outgoing packets before the checksum is calculated. > I'm told that this is a Wireshark feature. It snoops the outgoing packets > before the checksum is calculated. It's not as much a wireshark feature as it is of your network driver's ability to defer protocol checksums to the nic itself. If you have this enabled, it essentially makes your pcap useless, since your capture cannot really capture the fact that you may be transmitting bad packets. Furthermore, if your network stack defers checksumming to your NIC and your NIC is not performing the checksums, you have no way of knowing. From here, I don't believe our servers or network gear is the problem, otherwise *everyone* would have SSH disconnects and timeouts like you are experiencing. If you'd like, please contact me at webmaster@eclipse.org with your internet-facing IP address (http://whatismyip.org), and I can arrange a pcap of your SSH session at our server. This will help determine if there is a problem with your header checksums or not. Created attachment 152752 [details]
CVS Create Patch Timeout
Ok. I've learnt to switch off Vista checksum offloading and to enable TCP checksum validation in Wireshark.
At first I thought wow it's cured; but it isn't. The 'bad' checksums at my end were a red herring.
Attached shows at least 3 retransmission timeouts during a Create Patch. The resulting patch appeared to have been correctly created, but applying it in reverse shows that half of the files are missing.
No pop-up dialog.
No messages in the Error Log.
CVS Console has a number of:
The server reported an error while performing the "cvs diff" command which may only indicate that a difference exists. (took 0:15.023)
Error: org.eclipse.ocl.uml.edit: The server did not provide any additional information.
and
failed due to an internal error (took 0:39.936)
Error: java.net.SocketException: Software caused connection abort: socket write error
Ed, thanks for the follow-up. The new pcap is very helpful. I can clearly see at 77.15 our last transmission to you, then the retransmits from you at 78.3, 80.8 and 85.79, 95.6 then 115.3. Finally, at 154.6, your computer gives up and resets the connection. You then issue a SYN, which we immediately ACK and life carries on momentarily. It definitely seems like your connection was simply dropped along the way. When it was reset, creating a new connection, it was welcomed with open arms. Strangely, you were able to submit a 1.6M attachment to bugzilla likely without any of this. Karl, can you find anything in the logs that would indicate what happened? Did we hit a limit somewhere? I'm also a bit concerned with the "TCP segment of a reassembled PDU" messages. FWIW, Ed you can re-enable offloading chesksums to your NIC. It does free up CPU time on your computer. Created attachment 152758 [details]
Screenshot of 77-157
Screenshot of the above analysis, in case anyone else wants to see.
Thanks, Ed for posting that. (In reply to comment #68) > I can clearly see at 77.15 our last transmission to you, then the retransmits > from you at 78.3, 80.8 and 85.79, 95.6 then 115.3. Finally, at 154.6, your > computer gives up and resets the connection. You then issue a SYN, which we > immediately ACK and life carries on momentarily. Yep, there is packet loss. Problem is that with only one half of the capture we can't see where that loss happened. > It definitely seems like your connection was simply dropped along the way. > When it was reset, creating a new connection, it was welcomed with open arms. Well there was packet loss serious enough to cause a reset. It's hard to say where that happened at this point. > Strangely, you were able to submit a 1.6M attachment to bugzilla likely without > any of this. Yes, you're right, that's interesting. Makes it unlikely that it's a general network problem. > Karl, can you find anything in the logs that would indicate what happened? Did > we hit a limit somewhere? No limits were hit today, and no error messages that look at all suspicious. I also don't have Ed's outside IP address, all I have is the NAT address on ths inside of his firewall (Netgear) from the capture, so it makes it hard to search on that address alone. Ed, if you can supply that it might help, but I've been through the logs and I don't see anything, though it's a big haystack. > I'm also a bit concerned with the "TCP segment of a reassembled PDU" messages. Those are normal and are generated by Wireshark to show the exact arrival of pieces of the application level packet. The packet is re-assembled including all the fragments, so without that message you might see them twice. ** Ed, some important questions, if you can help. 1. What is the MTU size on your router? 2. Are you on a DSL line? 3. What is the MTU size on your system? 4. What filter did you use when capturing? (Would it have filtered out ICMP messages that you might have received in relation to this connection?) What I'm trying to figure out is if the fact that the SSH packets have the "Do Not Fragment" flag set is causing a problem because of restricted MTU along the path. On our side MTU is universally 1500 so it's not on our end, but if you're on DSL you might have a smaller MTU and large packets are causing breakdown. Also cable systems which allow greater than 1500 MTU can cause some issues as well. Thanks again for your help. Matt says he can reproduce this, so we'll work on it next week. (I have not been able to do so) > > Strangely, you were able to submit a 1.6M attachment to bugzilla likely without > > any of this. The problems occur persistently only with CVS; my recent 'patch' on Bug 295316 was hand assembled from five sub-patches since the full Create Patch was impossible, and even then some of the sub-patches failed. For any Create Patch I repeat until two attempts create the same largest file size before even bothering to check a reverse Apply Patch. > ** Ed, some important questions, if you can help. > > 1. What is the MTU size on your router? 1458 > 2. Are you on a DSL line? Yes; 80.229.165.239 right now, which is not yesterday. Usually 3-4 MBit/s. > 3. What is the MTU size on your system? 1500 > 4. What filter did you use when capturing? (Would it have filtered out ICMP > messages that you might have received in relation to this connection?) ip.src==206.191.52.50 or ip.dst==206.191.52.50 > > > What I'm trying to figure out is if the fact that the SSH packets have the "Do > Not Fragment" flag set is causing a problem because of restricted MTU along the > path. On our side MTU is universally 1500 so it's not on our end, but if > you're on DSL you might have a smaller MTU and large packets are causing > breakdown. Also cable systems which allow greater than 1500 MTU can cause some > issues as well. Thanks again for your help. Matt says he can reproduce this, > so we'll work on it next week. (I have not been able to do so) I've no idea why my router is 1458; it's own help says don't change from 1500 unless you're sure. It has just been that way for maybe five years. Wireshark shows: Outgoing (from me) TCP has settled on 1380. Incoming TCP has settled on 1360. It seems that link negotiation has worked. ping google.com -l 1430 is the largest that doesn't timeout [1472 bytes on wire]. 1472 = 14 (Ethernet) + 20 (IP) + 8 (ping) + 1430 (data) so '1458' is going out through the ISP (the reply is 64 bytes data). (In reply to comment #63) > Created an attachment (id=152709) [details] > Upon opening it with Wireshark, it became apparent why you are experiencing a > timeout. All of the protocol header checksums of packets *you* are sending are > 0x0000. This is not right. Our firewall will tolerate this for a while, but > afterwards it may stop accepting your broken packets for any number of valid > reasons. > > Most http/mail/any clients they can live in this error state forever and never > notice since their connections to us are short-lived, but when you try to > sustain a connection by sending broken packets, you can get strange results. > > If I were to guess, I would likely blame your Netgear router. Probably off topic. My SSH Shell sessions frequently lose their connection to build.eclipse.org. I just assumed, for many years, this was normal or a function of using Putty (from windows) or my ISP dropping service for a few milliseconds. It has been easy enough to reconnect to my 'screen' session so never thought it might actually be something that's fixable. Now you are making me think maybe there is something wrong with the packets I am sending! I have recently noticed it happens much less frequently in my Shell sessions to IBM's network. This wireshark talk sounds over my head ... is there any simple way to comprehensively check the health of my network communication? What is a typical "up time" I should expect from an SSH Shell session? Mine averages several hours. Just ignore this comment if too broad and too off topic for this specific bugzilla. Thanks, > My SSH Shell sessions frequently lose their connection to build.eclipse.org. I > just assumed, for many years, this was normal or a function of using Putty > (from windows) or my ISP dropping service for a few milliseconds. It has been > easy enough to reconnect to my 'screen' session so never thought it might > actually be something that's fixable. Our firewall will terminate your session (of anything, not just SSH) if it is inactive for several minutes. If there is a constant stream of activity, such as tailing a log file, it shouldn't be disconnected. > What is a typical "up time" I should expect from an SSH Shell session? Mine > averages several hours. I think you're doing quite well. From home I usually get disconnected a couple of times/day due to inactivity. I usually just run a top d 30 to keep it alive. We could (should) probably increase the idle timeout of SSH specifically, since many of us need to maintain an open shell for the entire day. (In reply to comment #71) > The problems occur persistently only with CVS; my recent 'patch' on Bug 295316 > was hand assembled from five sub-patches since the full Create Patch was > impossible, and even then some of the sub-patches failed. For any Create Patch > I repeat until two attempts create the same largest file size before even > bothering to check a reverse Apply Patch. I'd suggest trying a long SCP copy to see if you hit the same issue. That will help determine if it's CVS or SSH that is the issue. For security reasons SSH sets the "Do Not Fragment" flag which many other protocols do not. > > 4. What filter did you use when capturing? (Would it have filtered out ICMP > > messages that you might have received in relation to this connection?) > ip.src==206.191.52.50 or ip.dst==206.191.52.50 Ok, that's not filtering ICMP. Thanks. > I've no idea why my router is 1458; it's own help says don't change from 1500 > unless you're sure. It has just been that way for maybe five years. Well if you're on ADSL and you're running PPPoE (usually, but not always), then the MTU is restricted by the PPP overhead. I believe the router will automatically drop it down if you're running PPPoE. The problem is that the desktop doesn't know about the MTU difference. It can negotiate this down, but ideally you'd want the desktop to match the router. Try first the SCP I mention above to see if you have the same symptoms. Then try setting your system MTU down to 1458 and see if life gets happier. > Wireshark shows: Outgoing (from me) TCP has settled on 1380. Incoming TCP has > settled on 1360. It seems that link negotiation has worked. Yes, that should work. But it could cause some retransmissions at times and given the general slowness of disk access to CVS it might exacerbate the issue. > ping google.com -l 1430 > > is the largest that doesn't timeout [1472 bytes on wire]. > > 1472 = 14 (Ethernet) + 20 (IP) + 8 (ping) + 1430 (data) > > so '1458' is going out through the ISP (the reply is 64 bytes data). Thanks for doing that. Hopefully one of the two tests mentioned above will shed some light. In the meantime I'm going to bug Matt and see if he can get this to happen again. Using SCP. The first long attempt hit a retransmit repeat timeout immediately. A short attempt then worked without problem. Another long attempt worked but exhibited many retransmit interludes; it seems to start doubling from a lower base and so never actually exceeded the connection timeout. Set system MTU to 1458. SCP worked but again with many retransmit interludes. [If you want some 17 MB pcaps, I can send them.] To confirm; there were no ICMP messages during the SCPs. The SCP transfer direction is different, so whereas with the CVS Synchronize the client was retransmitting packets to the client (after failing to receive an ACK), with SCP the server is retransmitting to the client (after failing to process the ACK which Wireshark shows was sent). Looks like periodic packet loss on the server. (In reply to comment #54) > As for contributions, the best I can do is to hook into the commitinfo and > (attempt) to sync CVS portions that change as they change. I've changed all the CVSROOTs to report commit changes to a commit log. I've also crafted the script which will parse the commit log and sync those changes to the shadow copy. For now, I'm running the sync process entirely manually so that I can catch any hiccups as they occur. Once this is in place, the pserver data will be in almost perfect sync with the 'live' copy. For what it's worth, I can't reproduce the symptoms Ed is mentioning. I even signed up for an Amazon EC2 instance in Europe, and I can bounce 36MB transfers to Eclipse.org without error. We do our builds of AJDT on ajdt.eclipse.org. Currently, we access CVS anonymously. Will the changes here mean that we need to start accessing CVS from ajdt.eclipse.org via extssh? Nothing will change for anonymous access from within our firewall. That includes project vservers such as ajdt.eclipse.org. You have nothing more to do. > I'm planning on making this happen in 30 days: as of Friday, Dec. 11, anonymous
> pserver from the world wide internet will be served from a shadow copy of CVS.
Just a reminder that this change will occur Friday. Right now, the shadow copy of CVS is being synced immediately after a commit, so its data is very near 'live'.
The change has been made. Overall, CVS should definitely be a happier place. pserver will inherently get a major performance boost from being served from a shadow data source. Denis, is the pserver mirror supposed to be in-sync with extssh as of now? Because the following file http://dev.eclipse.org/viewcvs/index.cgi/pde-incubator/modeling/plugins/org.eclipse.pde.emfforms/src/org/eclipse/pde/emfforms/editor/EmfFormEditor.java?view=log should be in 1.34 version (with a commit made on Dec. 8th), but the pserver mirror (and ViewVC, by the way) is still exposing version 1.33. I don't feel confortable with reopening the bug, but obviously something must be wrong somewhere, no? Thanks! Ben I'm expecting some glitches like this at first, since this sync routine is not very mature. I'll fix this. Denis, Just wondering where the lock file directories are for pserver vs extssh now. I see in the /cvsroot/*/config files, it's listed as /var/lock/cvs. The reason I'm asking is that Andrew Niefer is seeing permission denied errors on lock files when trying to check out code via pserver. The lock files are in /var/lock/cvs for both pserver and extssh, but those are each hosted on different servers, so the file systems are not the same. Can you send me (or paste) the error? I was a little overzealous, I'm actually getting a warning: cvs checkout: warning: cannot write to history file /cvsroot/rt/CVSROOT/history: Read-only file system The check out actually succeeds, I just wasn't getting the content I expected because pserver is not in sync with the latest I definitely see an improvement on build.eclipse.org. Builds that were taking a long time during the day are new averaging about a 10 to 15 minute performance boost. I have two projects that check out one HUGE bundle (27800+ files), and that would take up to 8 minutes to completely check out or scan. That seems to be getting done now in about half that time maybe more. If CVS was hit hard, it could easily double that check out time for just that one bundle. (In reply to comment #86) > The check out actually succeeds, I just wasn't getting the content I expected > because pserver is not in sync with the latest As it turns out, the loginfo facility I'm using does not 'see' tags and branches, so that data does not propagate immediately. I've opened bug 297750 for this. In Comment 70. "Thanks again for your help. Matt says he can reproduce this, so we'll work on it next week. (I have not been able to do so)" Any progress? If anything the problems are now worse. I have been trying to create a 200 filepatch and failing abysmally. One 500kB file just repeatedly timesout while comparing. (In reply to comment #89) > Any progress? Unable to replicate that particular issue. I know this bug is resolved, but thought I'd ask here if others are seeing cvs kind of slow the past few weeks? I think it "comes and goes" but at times, does seem to take "hours" to commit code that should be few minutes. (maybe like a 10 fold difference in times). I would not reopen (or open a new bug) if only me and another wtp committer were seeing the issue ... might be a fluke ... but if others are also, maybe deserves some investigation? We are seeing it using extssh, from "public" internet (not build machine, or anything). Mostly notice it on commits, but sometimes "compares" also. I don't check out that much code :) but seems to be fairly normal and consistent for checkouts. (In reply to comment #91) David, I have seen problems with SVN in the past weeks, too. But for me it was never a commit-problem. Our build checks out the full product sources into a clean work area while building and it seems like the SVN was sometimes that much loaded that we got timeout errors during propfind and other SVN operations. I somehow expected that, because we are approaching the Galileo SR2. Of course it has been much worse in November 09 and I'm happy to see that performance improved since then, but it does not feel, that we are back to normal yet ... (In reply to comment #91) > I know this bug is resolved, but thought I'd ask here if others are seeing cvs > kind of slow the past few weeks? I think it "comes and goes" but at times, does > seem to take "hours" to commit code that should be few minutes. (maybe like a > 10 fold difference in times). I would not reopen (or open a new bug) if only me > and another wtp committer were seeing the issue ... might be a fluke ... but if > others are also, maybe deserves some investigation? We are seeing it using > extssh, from "public" internet (not build machine, or anything). Mostly notice > it on commits, but sometimes "compares" also. I don't check out that much code > :) but seems to be fairly normal and consistent for checkouts. I can second that. Although it's synchronization which is slow for us. Synchronizing my workspace takes about 20-40 minutes. I'm not sure if this can be solved at all or whether our project set is just too large for CVS. Anyway, we're considering migrating to git. I've seen this issue too. Both synchronization and checkouts. Same here. We just changed our build from anonymous pserver to ssh with committer id and has become slower. Someone reported that they got inconsistent check-outs off the mirror CVS. Is it possible that the CVS mirror would take a snapshot right in the middle of someone's commit, and therefore miss part of it and yield code that won't compile? A sync is triggered by any 'write' operation -- commit, tag, branch, etc. It is entirely possible that someone tags an entire tree, which will trigger a sync which can take minutes to complete. If you happen to do a checkout or update during that time, you may have an inconsistent state. (In reply to comment #97) > A sync is triggered by any 'write' operation -- commit, tag, branch, etc. It > is entirely possible that someone tags an entire tree, which will trigger a > sync which can take minutes to complete. If you happen to do a checkout or > update during that time, you may have an inconsistent state. So the syncs are done pretty often? Not only a select times, like in comment #51 which said: > Right now our shadow CVS is updated three times a day: > 10:45, 16:45 and 22:45 ET. So, if someone gets a checkout that does not compile, they could simply synchronize with the repo right away and it should fix it, right? > So, if someone gets a checkout that does not compile, they could simply
> synchronize with the repo right away and it should fix it, right?
It _should_. Unfortunately, the CVS trigger mechanisms are sometimes flaky, and may not pick up the entire transaction. That's why we do a complete (not partial) sync three times a day.
Just to cross-reference... When build.eclipse.org connects to pserver at over 30 connections per second, it's not surprising that things can slow down a bit... See bug 302572 comment 12 |