Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 400633

Summary: Need to exit build on (bad) errors and send mail
Product: [Eclipse Project] Platform Reporter: David Williams <david_williams>
Component: RelengAssignee: David Williams <david_williams>
Status: VERIFIED FIXED QA Contact:
Severity: normal    
Priority: P2 CC: Lars.Vogel
Version: 4.2.1   
Target Milestone: 4.3 M7   
Hardware: PC   
OS: Linux   
Whiteboard:

Description David Williams CLA 2013-02-12 16:57:49 EST
There's been a few occasions when something failed badly in build, but in proceeded along, eventually trying to "test" , even though no zips produced. 

For example, after the "Foundation move" on a recent weekend, signing webservice was completely broken, which caused CBI to output nothing.
Comment 1 David Williams CLA 2013-02-13 10:49:31 EST
This looks harder to do that I thought. 

First, due to the nature of our bash scripts and function calls, there will be lots of "little tweaks" to make to get them to return a meaninguful return code (rathere then, say, the return code of "popd", which would be almost always "0" even if previous command failed. 

But, worse, it does not seem maven (at least the way we invoke it) always returns a meaningful error code upon error. From searching the internet, this was a known problem, but supposed improved, but from the little I've tested seems to return zero, even if there is a "maven build" problem. 

I put in some minimal changes, just to echo a message on failure (not to do the eventual "smart processing") and it does not seem to echo anything. 

Will need to investigate deeper (which means it'll have less priority). 

http://git.eclipse.org/c/platform/eclipse.platform.releng.aggregator.git/commit/?id=1b85b4aeb9c4b619d92b5bdd7236ac7141b368aa
Comment 2 David Williams CLA 2013-02-13 18:54:22 EST
I see in some local test runs, the build ends with the messages below. 

The first one may give some insight into what our function is returning from our function call. I've added some "guards" and echos to debug Not sure that the other error messages are caused by. 


/shared/eclipse/builds/4I/production/run-maven-build.sh: line 45: exit: DEBUG:: numeric argument required
/shared/eclipse/builds/4I /shared/eclipse/builds/4I
2013-02-13 18:41:35 URL:http://git.eclipse.org/c/platform/eclipse.platform.releng.basebuilder.git/snapshot/eclipse.platform.releng.basebuilder-R38M6PlusRC3G.zip [53635738] -> "basebuilder-R38M6PlusRC3G.zip" [1]
/shared/eclipse/builds/4I
   ERROR: /shared/eclipse/builds/4I/master/gitCache/eclipse.platform.releng.aggregator/eclipse.platform.repository/target/repository did not exist in fn-gather-repo
   ERROR: /shared/eclipse/builds/4I/master/gitCache/eclipse.platform.releng.aggregator/eclipse.platform.releng.tychoeclipsebuilder/sdk/target/products did not exist in fn-gather-sdks
   ERROR: /shared/eclipse/builds/4I/master/gitCache/eclipse.platform.releng.aggregator/eclipse.platform.releng.tychoeclipsebuilder/rcp.sdk/target/products did not exist in fn-gather-platform
/shared/eclipse/builds/4I/master/gitCache/eclipse.platform.releng.aggregator/eclipse.platform.swt.binaries/bundles /shared/eclipse/builds/4I
/shared/eclipse/builds/4I
   ERROR: /shared/eclipse/builds/4I/master/gitCache/eclipse.platform.releng.aggregator/eclipse.platform.releng.tychoeclipsebuilder/eclipse-junit-tests/target did not exist in fn-gather-test-zips.
   ERROR: /shared/eclipse/builds/4I/master/gitCache/eclipse.platform.releng.aggregator/eclipse.platform.repository/target/repository did not exist in fn-slice-repo
/shared/eclipse/builds/4I/siteDir/eclipse/downloads/drops4cbibased/I20130213-1813 /shared/eclipse/builds/4I
Unknown target: verifyCompile
No known target specified.
Comment 3 David Williams CLA 2013-02-13 21:29:59 EST
FWIW, after adding my "guarding" code and echos, it exits with the message that 

exitcode was zero 

(Which, if not obvious, comes from a check for '0', not 'zero'). 
So, not sure what previous "non numeric" error was from. Perhaps a typo? 

But, still getting these subsequent messages ... on my home test machine, even after deleting entire working area and getting fresh clones. :\

Production build is still running. 


/shared/eclipse/builds/4I /shared/eclipse/builds/4I
2013-02-13 21:19:15 URL:http://git.eclipse.org/c/platform/eclipse.platform.releng.basebuilder.git/snapshot/eclipse.platform.releng.basebuilder-R38M6PlusRC3G.zip [53635738] -> "basebuilder-R38M6PlusRC3G.zip" [1]
/shared/eclipse/builds/4I
   ERROR: /shared/eclipse/builds/4I/master/gitCache/eclipse.platform.releng.aggregator/eclipse.platform.repository/target/repository did not exist in fn-gather-repo
   ERROR: /shared/eclipse/builds/4I/master/gitCache/eclipse.platform.releng.aggregator/eclipse.platform.releng.tychoeclipsebuilder/sdk/target/products did not exist in fn-gather-sdks
   ERROR: /shared/eclipse/builds/4I/master/gitCache/eclipse.platform.releng.aggregator/eclipse.platform.releng.tychoeclipsebuilder/rcp.sdk/target/products did not exist in fn-gather-platform
/shared/eclipse/builds/4I/master/gitCache/eclipse.platform.releng.aggregator/eclipse.platform.swt.binaries/bundles /shared/eclipse/builds/4I
/shared/eclipse/builds/4I
   ERROR: /shared/eclipse/builds/4I/master/gitCache/eclipse.platform.releng.aggregator/eclipse.platform.releng.tychoeclipsebuilder/eclipse-junit-tests/target did not exist in fn-gather-test-zips.
   ERROR: /shared/eclipse/builds/4I/master/gitCache/eclipse.platform.releng.aggregator/eclipse.platform.repository/target/repository did not exist in fn-slice-repo
/shared/eclipse/builds/4I/siteDir/eclipse/downloads/drops4cbibased/I20130213-2026 /shared/eclipse/builds/4I
Unknown target: verifyCompile
No known target specified.
Comment 4 David Williams CLA 2013-02-18 14:45:51 EST
Just to note it, I think maven is returning non-zero error code on error, but losing it in "master build" ... perhaps due to piping output to "tee" ? (so, 'tee' is the last thing to execute). 

= = = 
$SCRIPT_PATH/run-maven-build.sh $BUILD_ENV_FILE 2>&1 | tee $logsDirectory/mb060_run-maven-build_output.txt
buildrc=$?
# does not seem be be "catching" error code. Perhaps due to tee? 
echo "return code from run-maven-build.sh was: $buildrc"


run-maven-build.sh (in even of an error) echos:
exitcode was a legal, non-zero numeric return code

but master-build.sh echos:  
return code from run-maven-build.sh was: 0
Comment 5 David Williams CLA 2013-02-26 21:58:31 EST
One function to do first, fn-git-clean-aggregator. There's some indication, as documented in bug 394831 that it doesn't always get clean.
Comment 6 David Williams CLA 2013-02-26 23:56:54 EST
(In reply to comment #5)
> One function to do first, fn-git-clean-aggregator. There's some indication,
> as documented in bug 394831 that it doesn't always get clean.

Pretty clear error looking at the detailed log at 

http://download.eclipse.org/eclipse/downloads/drops4/I20130226-2000/buildlogs/mb010_get-aggregator_output.txt

/shared/eclipse/builds/4I
/shared/eclipse/builds/4I/master/gitCache/eclipse.platform.releng.aggregator /shared/eclipse/builds/4I
git pull
error: insufficient permission for adding an object to repository database .git/objects

fatal: failed to write object
fatal: unpack-objects failed
git submodule init
git submodule update

I've seen at least one presentation about moving to Tycho/Maven that had as one of their "recommendations based on their experience" ... put in lots of error checking! So, we are learning that same lesson.
Comment 7 David Williams CLA 2013-03-06 01:10:22 EST
bug 402492 demonstrates another common error check we should add through out. 

If a script function expects N arguments, we should check first thing that it has N arguments and exit if not. Will save a lot of silly mistakes from taking time to track down.
Comment 8 David Williams CLA 2013-03-28 11:12:02 EDT
I will update here the status of 'error checking'. I've covered the cases of 
getting/cleaning aggregator
getting submodules to compute "pom updates" 
and running the maven build itself. 

I focused on these three since that is where we've seen errors in the past ... errors that effectively "stop the build" or produce bogus output. 

By "handle the error", these are the following actions taken, so far, in no particular order. 

a). Don't promote equinox at all (or, more exact, don't create the script that promotes it) since largely invalid and just results in a "promotion error".

b) Don't start the tests, since they fail right away with invalid (or no) input to test. 

c) add "Build Failed" to end of subject line in note sent to dev list. 

d) avoid most subsequent steps, if error happens earlier (e.g. makes no sense to run maven build, if there is an error getting aggregator, or if something is so wrong can not run "update POMs" routine). 

e) I still "publish" drop to downloads but this is mostly just to get the logs up there for visibility. 

f) Add appropriate flags to buildproperties.php etc., so that the "download page" knows not to display empty download and tests section, and instead  displays short message that "build failed ... see logs". 

Some obvious (minor) things still to improve: 

display a red X in list of downloads instead of the current "tests are running" icon.  

include data or links directly in email to dev list just to save a mouse click or two and in some cases make it more obvious which teams need to respond, instead of everyone having to look on DL page to find out. 

I think I'll let this sit, as is for a while and see how it works in practice. 

One thing I don't like is that the bash scripts and logic to handle the error checking and processing seems overly complicated and fragile and deserves some thought on how to simplify/refactor/restructure that bash code so it is easier to extend to other sections that might have errors.
Comment 9 David Williams CLA 2013-03-28 11:50:48 EDT
I forgot to list another (important) "handle error" action: 

g) Don't publish "update site", since it would be invalid (or, empty) and to some extent make the "composite site" invalid.
Comment 10 David Williams CLA 2013-03-29 01:22:08 EDT
I addressed comment 7 with this commit: 

http://git.eclipse.org/c/platform/eclipse.platform.releng.aggregator.git/commit/?id=5599fdcf2c85b71e5ccd2a44d0ff50f5861e8471

bash functions check that the number of arguments obtained were what was expected and if not exits immediately with "PROGRAM ERROR".
Comment 11 David Williams CLA 2013-04-14 09:32:41 EDT
I think I've fixed enough to call this fixed. There's still a few spots where errors MIGHT occur that we would not properly capture, but I think most efficient strategy is to wait and fix when we do see errors occur there ... otherwise, we are spending time on very rare events.
Comment 12 David Williams CLA 2013-05-30 16:45:34 EDT
mass change to 'verified', as these bugs are either routine or obviously fixed build breaks.