| Summary: | Hudson no longer triggering builds via cvs changes | ||
|---|---|---|---|
| Product: | Community | Reporter: | David Williams <david_williams> |
| Component: | CI-Jenkins | Assignee: | Eclipse Webmaster <webmaster> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | major | ||
| Priority: | P3 | CC: | d_a_carver, kim.moir, konstantin, mknauer |
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Linux | ||
| Whiteboard: | |||
|
Description
David Williams
I can't seem to run builds on slave2, but slave1 seems to be working. On slave2, I get "failure to determine" because my cvs checkouts don't work. I've gone digging through the Hudson logs and I don't see any errors reported for this job. Does it 'work' if you force a build? Slave2 appears to have gotten stuck, so I restarted it. -M. > ... Does it 'work' if you force a build?
Yes, it runs just fine when "manually" started.
As some more data points, Dave C. said on cross project some of his cvs triggered jobs are triggered correctly. This led me to look at them. (I realized I didn't mention "recently" ... even my builds used to work fine, even after all the big changes to build server and hudson). But, gave me a hint of where to look. Some of the jobs he mentions also shows a simple, lonely message of Started on Oct 29, 2010 5:16:44 PM Such as see https://hudson.eclipse.org/hudson/view/WTP/job/cbi-wtp-wst.xsl.psychopath/scmPollLog/? and https://hudson.eclipse.org/hudson/view/WTP/job/cbi-wtp-wst.xsl/scmPollLog/? But one of them as a more looking log. Such the xml cvs polling log looks "normal". See https://hudson.eclipse.org/hudson/view/WTP/job/cbi-wtp-wst.xml/scmPollLog/? It starts with Started on Nov 1, 2010 12:06:44 PM [org.eclipse.wst.dtd.core] $ cvs -q -z3 -n update -PdC -D "Monday, November 1, 2010 4:06:45 PM UTC" cvs update: New directory `src/org/eclipse/wst/dtd/core/builder' -- ignored cvs update: New directory `src/org/eclipse/wst/dtd/core/content' -- ignored cvs update: New directory `src/org/eclipse/wst/dtd/core/contenttype' -- ignored cvs update: New directory `src/org/eclipse/wst/dtd/core/document' -- ignored cvs update: New directory `src/org/eclipse/wst/dtd/core/encoding' -- ignored cvs update: New directory `src/org/eclipse/wst/dtd/core/event' -- ignored cvs update: New directory `src/org/eclipse/wst/dtd/core/internal/builder' -- ignored cvs update: New directory `src/org/eclipse/wst/dtd/core/internal/rules' -- ignored cvs update: New directory `src/org/eclipse/wst/dtd/core/modelhandler' -- ignored cvs update: New directory `src/org/eclipse/wst/dtd/core/parser' -- ignored cvs update: New directory `src/org/eclipse/wst/dtd/core/rules' -- ignored cvs update: New directory `src/org/eclipse/wst/dtd/core/text/rules' -- ignored cvs update: New directory `src/org/eclipse/wst/dtd/core/tokenizer' -- ignored cvs update: New directory `src/org/eclipse/wst/dtd/core/util' -- ignored [org.eclipse.wst.dtd.ui] $ cvs -q -z3 -n update -PdC -D "Monday, November 1, 2010 4:06:45 PM UTC" [...] So ... I do think something is hung somewhere? I've tried several things, disabling then reenabling job. changing interval, even removing trigger then re-adding. All as an attempt the job to "reset" itself. But, still doesn't trigger, still says "Started on Oct 29, 2010 5:16:44 PM". So, I'd like to ask that the server be restarted, so see if that fixes it. A couple of other things admins can check/try first, if you'd like. A. You mentioned nothing in logs, is there any processes running by hudsonBuild with "cvs" command? I tried "ps -ef } grep huson" but apparently no longer on 'build' machine ... so, I can't query the processes. A2. I know that cvs has some settings that allow only a certain number of cvs processes to be started to check if there are changes, and I know at some point in the past that was set "low" to see if it helped other odd behaviour ... so I'm just wonder if a certain number of those processes are hung, while some continue to run? B. Well, I was going to suggest deleting /shared/jobs/indigo.runAggregator/scm-polling.log but the more I think about it, I'm pretty sure that's really just a log, not involved with triggering anything. So, I think a restart is the next thing to try, unless anyone has a better idea? One other thing I've seen happen with the CVS on hudson, is that they seem to go to the public peserver instance, instead of pulling from the inside eclipse. It can take up to a minute for changes to be picked up at times. David you might want to try polling from local: instead of pserver to see if that makes any difference. (In reply to comment #5) > So, I'd like to ask that the server be restarted, so see if that fixes it. Ok. I've flagged the system for shutdown and when the emf-cdo job finishes I'll restart the service. > A. You mentioned nothing in logs, is there any processes running by hudsonBuild > with "cvs" command? No. -M. (In reply to comment #7) > (In reply to comment #5) > > > So, I'd like to ask that the server be restarted, so see if that fixes it. > > Ok. I've flagged the system for shutdown and when the emf-cdo job finishes > I'll restart the service. > > > A. You mentioned nothing in logs, is there any processes running by hudsonBuild > > with "cvs" command? > > No. > > -M. That CDO job seems stuck. It's been running 15 hrs, a normal build takes 37 mins. We really need that automatic kill plugin for Hudson if a job runs longer than a specified period of time. Could you please restart Hudson so I can run some test builds? Like Dave said, the CDO job seems stuck because it has been running for 17 hours so it could be killed. The restart seemed to "fix" this immediate issue, so closing as fixed. Though not sure what the longer term issue is. Not only did my builds kick off, but some of the others mentioned in comments now have normal looking cvs logs, instead of saying just "started". Such as https://hudson.eclipse.org/hudson/view/WTP/job/cbi-wtp-wst.xsl/scmPollLog/? Starts with Started on Nov 1, 2010 6:41:27 PM [org.eclipse.wst.xsl.releng] $ cvs -q -z3 -n update -PdC -D "Monday, November 1, 2010 10:41:27 PM UTC" ? target [org.eclipse.wst.xsl.repository] $ cvs -q -z3 -n update -PdC -D "Monday, November 1, 2010 10:41:27 PM UTC" ? target [org.eclipse.wst.xsl.feature] $ cvs -q -z3 -n update -PdC -D "Monday, November 1, 2010 10:41:27 PM UTC" ? target [org.eclipse.wst.xsl] $ cvs -q -z3 -n update -PdC -D "Monday, November 1, 2010 10:41:27 PM UTC" ? target [org.eclipse.wst.xsl.core] $ cvs -q -z3 -n update -PdC -D "Monday, November 1, 2010 10:41:27 PM UTC" ? target cvs update: New directory `bin' -- ignored cvs update: New directory `models' -- ignored cvs update: New directory `src/org/eclipse/wst/xsl/core/internal/validation/xalan' -- ignored cvs update: New directory `src/org/eclipse/wst/xsl/core/internal/xpathfunctions' -- ignored cvs update: New directory `src_emf_xpath' -- ignored cvs update: New directory `src_validation' -- ignored [org.eclipse.wst.xsl.debug.ui] $ cvs -q -z3 -n update -PdC -D "Monday, November 1, 2010 10:41:27 PM UTC" ? target [...] So seems to me some jobs were "hung" in some fashion while checking for cvs changes, but not sure how/where to "see" that. Plus, I must not understand about the ":local:" suggestion. While I'm not worried about a minute or two accuracy, I tried it in case related, but what I used doesn't seem to work: cvs [checkout aborted]: Bad CVSROOT: `:local:anonymous@dev.eclipse.org:/cvsroot/callisto'. So, did I get the syntax wrong? Or is this a sign of something else wrong? But, thanks for investigating, thanks for restarting. Just before I restarted things last night I noticed this 'message' on Hudsons admin tab: There are more SCM polling activities scheduled than handled, so the threads are not keeping up with the demands. Check if your polling is hanging, and/or increase the number of threads if necessary. And I checked what our SCM concurrent polling limit was and it should be 'unlimited', but perhaps there's an internal limit somewhere. The local syntax should be something like: cvs -d :local:/cvsroot/webtools co org.eclipse.wtp.incubator -M. (In reply to comment #11) > Just before I restarted things last night I noticed this 'message' on Hudsons > admin tab: > > There are more SCM polling activities scheduled than handled, so the threads > are not keeping up with the demands. Check if your polling is hanging, and/or > increase the number of threads if necessary. > > And I checked what our SCM concurrent polling limit was and it should be > 'unlimited', but perhaps there's an internal limit somewhere. > > The local syntax should be something like: > > cvs -d :local:/cvsroot/webtools co org.eclipse.wtp.incubator > > -M. Very useful, thanks. I know on my "local" machines, in the past, it has sometimes been a problem that if a connection between cvs client and server gets broken for some reason (such as server side times out) the client continues to wait forever and doesn't know the connection has been broken. As a result, I usually get a more recent version of cvs (1.12.13.1) but have to compile it myself, but it offers the ability to set a "timeout" in the ~/.cvsrc file (e.g. -timeout 20m) so if a client side tcp connection is still open after 20 minutes, that attempt is ended. (I'm winging the technical details here, if you can't tell :) But, I think these showed up under 'ps -ef | grep cvs" as invocations that were still running, which you say you didn't see. But ... in case it helps. Sounds like it could be related. (And, source is easy enough to configure and compile that even I could do it :) ... in case this continues to be a problem ... maybe it'll only be an occasional thing). Thanks for the full "local" syntax. Seems obvious now that you've written it down for me :) I'm re-opening this bug as seems to have occurred again ... or, something very similar. The main problem is this the helios.runAggregator job. The first clue was some helios contributors told be they committed some changes to org.eclipse.helios.build but no build triggered for over an hour (that was approx. 5:30 PM today). I looked and saw that no build had been triggered, for approx. 24 hours, though several changes made. At this point, I manually triggered a build, and it worked fine. Then I committed some changes to see if it'd trigger a build, but it did not. This job had been working fine prior to today. The "CVS Polling Log" looks pretty normal, with messages such as Started on Feb 1, 2011 10:30:55 PM Done. Took 2.6 sec No changes (Even though, it should have detected changes). Another oddity is that if I try to look at/download a workspace file, nothing happens (eventually get a "bad gateway message" ... and if I click on and download a zip of the workspace, the zip is empty. I then used the job's menu options to "clear the workspace". That did, then, cause a build to automatically be triggered. When that build is finished, I'll try again to see if a build is triggered automatically by a cvs change. But, wanted to re-open this bug, to see if other jobs effected, if anything in logs, or if the admin's tab again has a warning about "There are more SCM polling activities scheduled than handled"? I have seen this before on Sapphire jobs. Haven't seen this recently. (In reply to comment #13) > I then used the job's menu options to "clear the workspace". That did, then, > cause a build to automatically be triggered. When that build is finished, I'll > try again to see if a build is triggered automatically by a cvs change. No joy. I made a small change to one of the files, committed it, but no build triggered. The hudson's polling log seemed normal CVS Polling Log Started on Feb 2, 2011 1:00:55 AM Done. Took 3.2 sec No changes (but should have been changes). I next tried changing to use ":local:" protocol in the hudson's job config to see if there's some oddity with the pserver shadows such that they are not changing (i.e. accurately reflecting content) ... but that change in configuration itself caused a rebuild ... so, will have to test "touching" a file in AM (Eastern). CVS Polling Log Started on Feb 2, 2011 1:30:55 AM Workspace is inconsistent with configuration. Scheduling a new build: /opt/users/hudsonbuild/workspace/helios.runAggregator/CVS/Root content mismatch: expected :local:/cvsroot/callisto but found :pserver:anonymous@dev.eclipse.org:/cvsroot/callisto (In reply to comment #15) > (In reply to comment #13) > > > I next tried changing to use ":local:" protocol in the hudson's job config to > see if there's some oddity with the pserver shadows such that they are not > changing (i.e. accurately reflecting content) ... but that change in > configuration itself caused a rebuild ... so, will have to test "touching" a > file in AM (Eastern). > Is it AM yet? :) I realized I didn't have to wait for a build to finish to test ... I could cancel it ... and sure enough. Using ":local:" works as expected ... changes in cvs are detected within minutes and trigger a build. I suspect this means the contents for this helios aggregation job have been incorrect for at least 24 hours ... that using pserver has been retrieving some stale copy of cvs files. So, I suspect this issue should be some new bug: "pserver on hudson at times retrieves stale content for 24 hours or more"? ... But, I'll let webmasters (if not others) investigate and comment from here ... I could be wrong on several counts. (In reply to comment #16) > > So, I suspect this issue should be some new bug: "pserver on hudson at times > retrieves stale content for 24 hours or more"? ... But, I'll let webmasters (if > not others) investigate and comment from here ... I could be wrong on several > counts. I'm not necessarily sure this is a hudson bug. From what I've seen it seems that the pserver protocol on the hudson slaves, seems to go the public pserver address instead of the committer pserver. So items are delayed. Using local is the way around this, and probably should be used regardless, but for those using MAP files I can set it being a pain. As far as I know all of the hudson machines use the 'live' cvs/pserver data. I've taken a look at the logs and I don't see any SCM issues reported, and I've been unable to find a logger class in hudson that provides any insight into the SCM activities. -M. Fixed with occasional Hudson restarts. |