Community
Participate
Working Groups
I've ran into a race condition developing some Grails server running support that builds on WTP code. Now... it is entirely possible that this isn't really a problem in WTP, but a problem in my understanding of how things ought to work. Either way, I'm hoping that some positive result will come from filing this bug (either fixing the bug in WTP or the bug in my understanding :-) Anyway... here goes. Running testing code I had some unpredictable behavior reaking of a race condition. I was able to narrow it down to the stacktrace at the end of this bugreport). Summary: - near the top (were my code finally gets called) I checked the scheduling rule that is in effect. It is the rule for a 'Server.StartJob'. This rule only has a lock on the server. - this seems wrong because - the startjob actually triggers publishing of resources from the workspace (TomcatLaunchConfigurationDelegate is calling Server.publish presumably because it detected that some resources where dirty and needed republishing). - publish jobs (Server.PublishJob) needs a scheduling rule that includes a lock on the workspace root. If it is possible for a startjob to trigger publishing work to happen, it seems like StartJob scheduling rule should at least contain the publish scheduling rule. I guess the assumption made by Server.StartJob is that it doesn't publish stuff, only starts the Server. But Tomcat actually publishes when the server starts. Not sure whether we should blame TomcatLaunchConfigurationDelegate or Server.StartJob, but it seems to me that they are making assumptions that are inconsistent with one another. I think the easiest way to fix this would be to make the Server.StartJob rule the same as publish job (i.e. both grab a lock on workspace root) which will make it safe to publish stuff from a startjob. Kris PS: I deleted some stuff in the stacktrace that is code in STS, but really the problem is aparant in the bottom portion of the stack trace, which only contains org.eclipse... code. PS2: If someone has a suggested workaround PS3: I checked out a version of wst.server.core plugin source code to debug this. The version is the one from 2010/09/08 (Eclipse 3.6.2 release date). -----Stacktrace--------- ... I get called here and dump out the active scheduling rule... ... intervening irrelevant stuff deleted... org.eclipse.wst.server.core.internal.ModulePublishInfo.getModuleResources(org.eclipse.wst.server.core.IModule[]) line: 321 org.eclipse.wst.server.core.internal.ServerPublishInfo.getResources(org.eclipse.wst.server.core.IModule[]) line: 361 org.eclipse.wst.server.core.internal.Server.getResources(org.eclipse.wst.server.core.IModule[]) line: 1432 com.springsource.sts.internal.server.tc.core.TcServerBehaviour(org.eclipse.wst.server.core.model.ServerBehaviourDelegate).getResources(org.eclipse.wst.server.core.IModule[]) line: 671 com.springsource.sts.internal.server.tc.core.TcServerBehaviour(org.eclipse.jst.server.tomcat.core.internal.TomcatServerBehaviour).getResources(org.eclipse.wst.server.core.IModule[]) line: 953 com.springsource.sts.internal.server.tc.core.TcServerBehaviour.getResources(org.eclipse.wst.server.core.IModule[]) line: 169 com.springsource.sts.internal.server.tc.core.TcServerBehaviour(org.eclipse.jst.server.tomcat.core.internal.TomcatServerBehaviour).publishDir(int, java.util.Properties, org.eclipse.wst.server.core.IModule[], org.eclipse.wst.server.core.util.PublishHelper, org.eclipse.core.runtime.IProgressMonitor) line: 327 com.springsource.sts.internal.server.tc.core.TcServerBehaviour(org.eclipse.jst.server.tomcat.core.internal.TomcatServerBehaviour).publishModule(int, int, org.eclipse.wst.server.core.IModule[], org.eclipse.core.runtime.IProgressMonitor) line: 268 com.springsource.sts.internal.server.tc.core.TcServerBehaviour.publishModule(int, int, org.eclipse.wst.server.core.IModule[], org.eclipse.core.runtime.IProgressMonitor) line: 440 com.springsource.sts.internal.server.tc.core.TcServerBehaviour(org.eclipse.wst.server.core.model.ServerBehaviourDelegate).publishModule(int, org.eclipse.wst.server.core.IModule[], int, org.eclipse.core.runtime.IProgressMonitor) line: 1024 com.springsource.sts.internal.server.tc.core.TcServerBehaviour(org.eclipse.wst.server.core.model.ServerBehaviourDelegate).publishModules(int, java.util.List, java.util.List, org.eclipse.core.runtime.MultiStatus, org.eclipse.core.runtime.IProgressMonitor) line: 1114 com.springsource.sts.internal.server.tc.core.TcServerBehaviour(org.eclipse.wst.server.core.model.ServerBehaviourDelegate).publish(int, org.eclipse.core.runtime.IProgressMonitor) line: 947 com.springsource.sts.internal.server.tc.core.TcServerBehaviour(org.eclipse.wst.server.core.model.ServerBehaviourDelegate).publish(int, java.util.List<org.eclipse.wst.server.core.IModule[]>, org.eclipse.core.runtime.IProgressMonitor, org.eclipse.core.runtime.IAdaptable) line: 774 org.eclipse.wst.server.core.internal.Server.publishImpl(int, java.util.List<org.eclipse.wst.server.core.IModule[]>, org.eclipse.core.runtime.IAdaptable, org.eclipse.core.runtime.IProgressMonitor) line: 2889 org.eclipse.wst.server.core.internal.Server.publish(int, org.eclipse.core.runtime.IProgressMonitor) line: 1207 org.eclipse.jst.server.tomcat.core.internal.TomcatLaunchConfigurationDelegate.launch(org.eclipse.debug.core.ILaunchConfiguration, java.lang.String, org.eclipse.debug.core.ILaunch, org.eclipse.core.runtime.IProgressMonitor) line: 41 org.eclipse.debug.internal.core.LaunchConfiguration.launch(java.lang.String, org.eclipse.core.runtime.IProgressMonitor, boolean, boolean) line: 853 org.eclipse.debug.internal.core.LaunchConfiguration.launch(java.lang.String, org.eclipse.core.runtime.IProgressMonitor, boolean) line: 702 org.eclipse.debug.internal.core.LaunchConfiguration.launch(java.lang.String, org.eclipse.core.runtime.IProgressMonitor) line: 695 org.eclipse.wst.server.core.internal.Server.startImpl2(java.lang.String, org.eclipse.core.runtime.IProgressMonitor) line: 3209 org.eclipse.wst.server.core.internal.Server.startImpl(java.lang.String, org.eclipse.core.runtime.IProgressMonitor) line: 3159 org.eclipse.wst.server.core.internal.Server$StartJob.run(org.eclipse.core.runtime.IProgressMonitor) line: 359 org.eclipse.core.internal.jobs.Worker.run() line: 54
I haven't had much time to look into the code in detailed, but I wanted to drop a quick comment to get ensure that we keep moving. In the past we were locking the workspace during the startup, but we removed the lock as it shouldn't be necessary to lock the workspace during a startup. The only true reason why we should lock the workspace is during the publish. Even further, we shouldn't really lock the entire workspace during the publish, but only those resources that are being published
(In reply to comment #1) > I haven't had much time to look into the code in detailed, but I wanted to drop > a quick comment to get ensure that we keep moving. > > In the past we were locking the workspace during the startup, but we removed > the lock as it shouldn't be necessary to lock the workspace during a startup. > The only true reason why we should lock the workspace is during the publish. > Even further, we shouldn't really lock the entire workspace during the publish, > but only those resources that are being published Not having the lock does put rather strong constraints on what downstream callees can do (safely). In the interest of flexibility for downstream callees I think it would be better to have a larger lock. Downstream callees can not widen the lock (Yes, I've tried :-). I gather that the Server core API is intended to support a wide variety of servers... so can you really know what they are doing exactly? Also, one of the downstream callees is actually the Tomcat launch configuration delegate in the eclipse code itself, and this one is aparantly already doing something that violates the assumption being made by the 'too small' lock in the Server start job.
Hmm Interesting.. looking into the code I do see that tomcat is doing a publish during a start, now that seems odd to me. But there was a valid reason at some point bug# 125002. Got to think what we can do to improve the situation. Fundamentally it seems usual to me to do a publish during a start operation, specially when the framework should be taking care of it, but an true solution for the problem might be complex as the launch configuration delegate is an extension from org.eclipse.debug.core. One solution might be to provide our own implementation and move all the adopters into this extension point, but this is might be a bit drastic.
(In reply to comment #2) > Not having the lock does put rather strong constraints on what downstream > callees can do (safely). In the interest of flexibility for downstream callees > I think it would be better to have a larger lock. Downstream callees can not > widen the lock (Yes, I've tried :-). Yes, not possible. > > I gather that the Server core API is intended to support a wide variety of > servers... so can you really know what they are doing exactly? Correct, in the past we were being safe by doing a lock on the workspace, but in the benefit of usability we reduce the lock. We had reports of server taking a long time to start, or detect the server status (in remote server cases), in those cases a lock to the workspace would prevent making any modifactions to the projects unnecessarily (for most cases). > Also, one of the downstream callees is actually the Tomcat launch configuration > delegate in the eclipse code itself, and this one is aparantly already doing > something that violates the assumption being made by the 'too small' lock in > the Server start job. nsk,nsk... we have to review the tomcat code. And possible change it to remove the publish during the start.
Kris, Do you mind explaining in more detailed your race condition? what was the the race condition.
Essentially, when we publish grails apps, it looks like we end up doing publish related stuff (packaging up stuff from the workspace into a war etc to put onto the server) at the same time the workspace is being built. In some case this ends up packaging old/corrupt files. Also, I have to add that I wasn't yet able to get this to happen just playing in the UI. I've only seen it in test runs that directly call the server api from a non-ui thread. Perhaps there is some reason why it can't or doesn't happen when starting and stopping the server from the UI (i.e. by clicking on that start button). It may be more of problem for us (Grails tooling in STS) than usual because the internal Groovy Eclipse compiler and the external process that builds the war file both like to do a compile and both use the same output folder. I really need some way to ensure that the Eclipse build is finished before we start the external war packaging script. Unfortunately my 'plublishing' code is being called through some layers of APIs from above that already have taken a lock which I'm not able to widen.
It was Bug 125002 that added publishing to TomcatLaunchConfigurationDelegate. I don't know if the reason still applies or if the implementation needs more work.
Oops. I need to examine the comments more thoroughly as Angel had already noted this bug. I'll keep looking and see if I can come up with something useful. :)
Just a litle extra tidbit of info. I've just had some of my tests fail even when the 'correct' scheduling rule was applied to my publish code 'from above' (I'm printing out the rule in effect, and I noted that it tends to vary unpredictably, (depending on sequences of events or job scheduling order?). In this case I got the test fail because although the correct scheduling rule was applied to the publish job, this only guaranteed that the build job was not scheduled concurrently with the publish job, it did not guarantee that the build job is scheduled *before* the publish job. So what happened here is that the build job was scheduled after the publish and I still ended up publishing stale info to the server. It may be a good idea if publish jobs would join any scheduled build jobs before starting. There doesn't seem to be much point in publishing stuff from the workspace if chances are that this stuff is about to be changed by a build job.
Are the "build" and "publish" jobs being scheduled automatically? If a resource change triggers a "build" and a "publish", and the "build" job ran after the "publish" job and changed more resources, I would expect another "publish" to be triggered to address those changes.
The test in question is doing something like this - create a test server - create some project, add stuff into it - add the project to the server - start the server - check whether server contains correct stuff This causes the following: - starting the server causes a publish (sometimes in called through the start job and sometimes called through the publish job) - creating the project etc. causes a build job Depending on the order of these two, and which lock ends up being held at the time of publish I get three different outcomes (stale publish, corrupt publish or correct publish). Yes, you are right, the build job would trigger another publish in case of the stale publish). It actually doesn't, but that is a problem in my own code (I'm forced to ignore changes to the output folder because I cannot distinguish changes in there made by the war build from those made by the Eclipse compilers). I know how to solve that ... Nevertheless, I think it would be much better if the build job would be scheduled before the publish job. For one, it would be much more efficient. Also, I think, arguably a project that has edits which would trigger builds, really *should not* be published before the build is run. Why? Because some resources that the user edited are inconsistent with stuff that the builder is supposed to change. In essence, the project really isn't in a consistent state until it has been built. You could argue, practically speaking, that it isn't really all that important. Users using the UI would probably not see the 'build after publish' case since it would only happen if they click the 'start button' really fast after making a change :-) Anyway, I thought it was interesting enough to make note of this.
I'm having trouble thinking of a good way for a server to know when a resource change and the chain of build job(s) that may get spawned, has reached completion. Locking the workspace would avoid partially build artifacts being published, but has you have discovered, won't prevent stale content if build job(s) run after the publish. The server can prevent the latter but does address that case with auto-publishing. With generic servers and to a lesser extent with Tomcat, the auto-published content will get served eventually. With the generic servers, the auto-publish should trigger a redeployment of the webapp. With Tomcat some resources change in the running webapp without problems, such as static content. Other types of resources like web.xml, will cause Tomcat to reload the webapp, which is the default behavior. It can take up to 5 seconds for Tomcat to detect that the reload is needed. In either case, it will be up to the test program to determine when the server is finally serving the desired content if changes can occur after the server has started. If the test has to figure this out, it might as well wait for the build job(s) to run their course before starting the server. With respect Tomcat doing a publish during startup, it's probably still needed to ensure that the server configuration is published to the server prior to startup. If auto-publishing was disabled and a project added to the server and the server started, should the server serve or not serve the added project because no specific publish operation has occurred. I'm assuming the added project should be served. I'm assuming turning auto-publishing off is simply to avoid publishing while the server is running. Of course, I may not completely understand the server framework, since my exposure to it is almost entirely at the adapter level.
(In reply to comment #12) > I'm having trouble thinking of a good way for a server to know when a resource > change and the chain of build job(s) that may get spawned, has reached > completion. I think this true when you are looking at it from the point of view of (the author of) code running inside the server's publishing code. That is in fact precisely the problem I'm struggling with since that is where my code is running at. I don't think it is necessarily true when you look at it from the framework's point of view. I don't see why the framework couldn't add some code to 'join' auto build jobs prior to starting the server. > Locking the workspace would avoid partially build artifacts being > published, but has you have discovered, won't prevent stale content if build > job(s) run after the publish. Right. That was my point in my last message. Just enforcing mutual exclusion doesn't seem sufficient (I initially thought it would be). > The server can prevent the latter but does > address that case with auto-publishing. Agreed, auto publishing can recover from the problem, and this what lies within the realm of what can be done in the server code. But regardless, I think it is undesirable for the problem to arise in the first place. (For example, what if auto publishing is turned off? Then although I made the changes before starting the server, is it acceptable that I end up with a 'stale' copy being published, just because by accident or chance, the build didn't get scheduled soon enough?). I would be inclined to say "no that is not ok" and I would be inclined to think it is the framework's responsibility to ensure that this doesn't happen. (It seems to would have to be the framework's responsibility since the server publish code has no control over locks and scheduling etc.) > With respect Tomcat doing a publish during startup, it's probably still needed > to ensure that the server configuration is published to the server prior to > startup. If auto-publishing was disabled and a project added to the server > and the server started, should the server serve or not serve the added project > because no specific publish operation has occurred. I'm assuming the added > project should be served. I'm assuming turning auto-publishing off is simply > to avoid publishing while the server is running. Of course, I may not > completely understand the server framework, since my exposure to it is almost > entirely at the adapter level. Everything you say makes sense (but I also admit I don't know the server framework well, almost certainly even less than you). If what you say is true, then it seems we are back to the initial reason for me raising this bug report: the framework seems to assume 'start' only needs to acquire a lock on the server but not on the published workspace resources. This seems to be in conflict with doing publish during start. I argued earlier that perhaps start should take a bigger lock to accommodate what Tomcat is doing. But it seems that that could be a performance problem... Anyway... this stuff is complex. Perhaps it was a bad idea for me to bring up my observation about builds executing after publish. It may be best to treat this as a separate issue (I could raise one if that seems like a good idea). Now, if we focus only on the initial problem (start lock <-> tomcat publish during start). Then I see only three possible choices here: - tomcat should change. - the lock should change. - leave it as is and live with it because the cost of a 'fix' is worse than the problem it fixes. Not sure what is really the best action. I hoped you guys could figure that out :-)