| Summary: | builds frequently triggered due to "content change" when content has not changed | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Technology] Hudson | Reporter: | David Williams <david_williams> | ||||
| Component: | Core | Assignee: | Winston Prakash <winston.prakash> | ||||
| Status: | REOPENED --- | QA Contact: | |||||
| Severity: | major | ||||||
| Priority: | P3 | CC: | denis.roy, mistria, mknauer, thanh.ha | ||||
| Version: | unspecified | ||||||
| Target Milestone: | --- | ||||||
| Hardware: | PC | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Attachments: |
|
||||||
|
Description
David Williams
Killing one job, wiping workspace did not seem to help anything ... it has start re-building (nearly) continuously. I'll let it run like that for a bit, in case it helps you spot anomalies in the logs, or something, but otherwise will disable it on Wednesday ... give it a rest for a while. I get the feeling that Hudson may not be optimized to operate on NFS-mounted filesystems. Bug 363652 is another filesystem issue. Moving to the Hudson team for consideration. Ok, thanks. I have disabled the job for now ... since it's ran 10 times already today. Since this effectively prevents me from using the "automatic" build, I'll change severity to "major". It seems to effect similar jobs in a similar way, only not as bad ... such as build only a couple of times per day. Such as see https://hudson.eclipse.org/hudson/view/Repository%20Aggregation/job/juno.runReports/ David could you add your job configuration file. BTW, I guess this happens only when the files are in a NFS filesystem right? (In reply to comment #4) > David could you add your job configuration file. BTW, I guess this happens only > when the files are in a NFS filesystem right? I meant attach your job configuration file for me to take a look at it. Created attachment 207111 [details]
contents of jobs/indigo.runReports
I'll attach whole contents of /shared/jobs/indigo.runReports, in case it helps, though you did just ask for the config.xml.
I did find looking in that directory interesting, as there is a file named
url-change-trigger-oldmd5
so that implies it is using md5 to decide if/when the file has changed.
And ... yes, as far as I know only NFS ... I have a parallel "test system" that does not have this problem ... and on it, all the files on are same file, literal, file system (though, of course, there are several differences ... such as, my test system is not "up and running" nearly as often).
Just to give some observations ... the "juno.runReports" had been running "continuously" quite a bit the last few days (even though no change in content) and I changed the content, it triggered a build, but then others were triggered after that, when content didn't change. I "manually" checked md5sum on /home/data/httpd/download.eclipse.org/releases/staging/content.jar and it matched the value in /shared/jobs/juno.runReports/url-change-trigger-oldmd5 So, neither changed, but kept building as though it had. I think noticed the "URL" I was using wasn't quite right, in configuration page it was file:/home/data/httpd/download.eclipse.org/releases/staging/content.jar but I think the proper form is file:///home/data/httpd/download.eclipse.org/releases/staging/content.jar So, changed it, saved it, and that seemed to stop the "continuous builds" (at least in the hour or so I watched it ... not long. So, thought maybe I'd "fixed" it by improving the URL syntax ... but then, to my surprise, I see that Hudson, somewhere along the line, changed it back to file:/home/data/httpd/download.eclipse.org/releases/staging/content.jar So ... no closer to understanding causes/reasons, but, thought I'd report on what I'd observed and tried. FYI, after noticing today these jobs were still "run away jobs", building over and over again, though no change in content, I tried changing the "URL" to the HTTP form, such as http://download.eclipse.org/releases/staging/content.jar but still no joy. Over and over. I have had to disable the jobs. In case its not obvious, this did use to work just fine ... something seems to have changed in recent releases? The issue seems to be due to Gerrit Plugin which is triggering the builds incorrectly. Matt disabled the Gerrit trigger plugin, so the spurious builds should not happen. If it happens again, let me know. Mean while I'm looking at the Gerrit Pugin to find out why it triggers like that. I don't think this was fixed by the Gerrit Plugin removal (or, its been added back in?). I tried to verify by re-enabling, "trigger on content change" in the "juno.runReports" job: https://hudson.eclipse.org/hudson/view/Repository%20Aggregation/job/juno.runReports/ You can see it runs numerous times per day, even though the critical file, content.jar, has not changed since the 16th: [12:00:57] david_williams@build:/home/data/httpd/download.eclipse.org/releases/staging $ stat content.jar File: `content.jar' Size: 2111379 Blocks: 4136 IO Block: 32768 regular file Device: 19h/25d Inode: 95412227 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 7336/david_williams) Gid: ( 8403/callistoadmin) Access: 2012-02-16 23:55:47.000000000 -0500 Modify: 2012-02-16 23:55:47.000000000 -0500 Change: 2012-02-16 23:55:47.000000000 -0500 Want me to leave enabled so you can 'investigate' easier? Soon? Else, I'll switch back to the more "manual" mode so as to not eat up resources. David, you're using this as the URL for the "Build when a URL's content changes": file:/home/data/httpd/download.eclipse.org/releases/staging/content.jar That filesystem is mounted on the Hudson master, but I'm not sure if it's mounted on all the slaves. In this case, would it make more sense to use http:// ? (In reply to comment #11) > David, you're using this as the URL for the "Build when a URL's content > changes": > > file:/home/data/httpd/download.eclipse.org/releases/staging/content.jar > > That filesystem is mounted on the Hudson master, but I'm not sure if it's > mounted on all the slaves. In this case, would it make more sense to use > http:// ? Ok ... used to work as file:// (months ago) and I have tried http:// in the past (when first having the problem) ... but, given many other changes and complications it makes sense to try http:// now, so, I've changed it to http://download.eclipse.org/releases/staging/content.jar We should no more by end of day (since I don't plan on changing it today ... it might build once, with "changed URL" .... but, should stop after that. Thanks. Interesting results ... sort of. When I changed to "http://" I canceled the currently running job (and one was pending that I also canceled) ... no new builds were started (for an hour or so that I checked. Good so far. But being one to not leave well enough alone, I "manually" started a job to see what would happen. That seems to have re-started the build-for-no-reason pattern over again. I do notice, that a "pending" job starts, before the current job finishes. I wonder if there is something that "while its building" there is some check or state that goes astray (that is, if it was a "real quick" job, maybe this wouldn't happen. I do not know how often Hudson checks for a change, but this job takes about an hour to complete. (FYI, it does "read" from the content.jar, but would not change it, in case that helps). I'll switch back to "manual" soon to avoid the "runaway" jobs, but feel free if/when you want me to re-enable "content change" so it can be "watched" while its running (for someone who knows what to watch for). Another (possible) "hint" ... this particular file that's check is fairly large, about 2 Megs. It is used as "indicator" since it is the file "that matters". I suppose there could be bugs with handling files that large? Which in my spare time, I could test by creating/using a small "indicator-only" file. Just wanted to mention that size variable. This functionality is provided by URL Change plugin, which downloads the file every minute and check if there is any change. However, this plugin is deprecated in favor of another plugin called URLTrigger plugin. This plugin is more accurate and gives option to check last modification date or URL content. However URLTrigger plugin is a Jenkins plugin. If needed we could fork the plugin and make it work for Hudson (In reply to comment #15) > This functionality is provided by URL Change plugin, which downloads the file > every minute and check if there is any change. However, this plugin is > deprecated in favor of another plugin called URLTrigger plugin. This plugin is > more accurate and gives option to check last modification date or URL content. > However URLTrigger plugin is a Jenkins plugin. If needed we could fork the > plugin and make it work for Hudson Well, I've moved back to my "semi-manual" method, so I can't say I NEED it. But, seems URL change sounds like a URLTrigger function would be useful, and I am fine carrying that as a "feature request". But, will ask ... should the URLChange plugin be uninstalled/not used? Perhaps it works for some people ... and, recall, it did used to work for me ... for years .... but obviously somethings changed. Is there a way, to query on cross-project list, or, better, by grepping the config.xml files? If no one uses it, I'd say remove it and avoid the chance of "runaway jobs" ... but, if some use it currently, and it truly works for them, then guess there's not too much harm leaving it there for them. (In short, I'd lean toward removing .. but don't think it's critical to remove). I am a little glad that one Gerrit Plugin wasn't causing all the unrelated job triggers by itself ... that sounds like a bigger bug! For the record, saw this bug again today ... a couple of jobs "running continuously". Thanh will look into other ways of triggering the jobs ... but "URL Change" should be disabled ... or come with a warning ... or be replaced. How's progress on that Jenkins plugin? :) I don't know if it was a coincidence of something else webmasters did, but after disabling that job ... which was "constantly" checking a CGit URL ... suddenly Hudson was responsive again! (In reply to comment #18) Hi David, since I did not hear from you over a year I thought there is no problem and didn't look in to it. I looked at the code now This is how the URL Change Trigger works - Every minute, the file content is downloaded by the plugin - MD5 is computed from the file content and the content itself is discarded - The MD5 is then written to a file <hudson-home>/jobs/<job-folder>/url-change-trigger-oldmd5 - If the newly computed MD5 is different than the old one written to the file, then the job is triggered. Question we need to ask and answer is Why every minute the MD5 of the file is computed differently if the content of the file did not change? Could a network issue caused the contents to be downloaded incorrectly and causing the MD5 to be computed differently? |