Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 353015 - Getting a thread dump of a hanging job?
Summary: Getting a thread dump of a hanging job?
Status: CLOSED FIXED
Alias: None
Product: Community
Classification: Eclipse Foundation
Component: CI-Jenkins (show other bugs)
Version: unspecified   Edit
Hardware: PC Windows Vista
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Eclipse Webmaster CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-07-25 11:38 EDT by Kaloyan Raev CLA
Modified: 2011-07-26 09:22 EDT (History)
0 users

See Also:


Attachments
hudson thread dump (11.15 KB, text/plain)
2011-07-26 06:30 EDT, Kaloyan Raev CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kaloyan Raev CLA 2011-07-25 11:38:46 EDT
The libra-indigo job is persistently hanging on the tests. This behavior is not observed when running the build locally. 

I want to have a thread dump to see the reason, but when I call:
https://hudson.eclipse.org/hudson/threadDump

It says that I have no permissions.

How can I get the thread dump?
Comment 1 Denis Roy CLA 2011-07-25 13:11:53 EDT
Let your job hang, let us know and we'll post the thread dump for you.
Comment 2 Eclipse Webmaster CLA 2011-07-25 13:27:10 EDT
Another option is if the job is running on the master you can try starting it with the -Xdebug flag, then you should be able to connect with jdb from build.

-M.
Comment 3 Kaloyan Raev CLA 2011-07-25 15:10:38 EDT
The build is currently hanging hudson-slave1. 
Please, take the thread dump. You may then kill the job to free the slot.
Comment 4 Denis Roy CLA 2011-07-25 16:18:18 EDT
Here's what I'm seeing:

Executor #4 for hudson-slave1 : executing libra-indigo #36

"Executor #4 for hudson-slave1 : executing libra-indigo #36" Id=59 Group=main WAITING on hudson.remoting.UserRequest@2947aa41
	at java.lang.Object.wait(Native Method)
	-  waiting on hudson.remoting.UserRequest@2947aa41
	at java.lang.Object.wait(Object.java:485)
	at hudson.remoting.Request$1.get(Request.java:206)
	at hudson.remoting.Request$1.get(Request.java:172)
	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:55)
	at hudson.Proc$RemoteProc.join(Proc.java:347)
	at hudson.Launcher$ProcStarter.join(Launcher.java:280)
	at hudson.tasks.Maven.perform(Maven.java:259)
	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
	at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:603)
	at hudson.model.Build$RunnerImpl.build(Build.java:172)
	at hudson.model.Build$RunnerImpl.doRun(Build.java:137)
	at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:417)
	at hudson.model.Run.run(Run.java:1337)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
	at hudson.model.ResourceController.execute(ResourceController.java:88)
	at hudson.model.Executor.run(Executor.java:140)



SCM polling for hudson.model.FreeStyleProject@67d77351[libra-indigo]

"SCM polling for hudson.model.FreeStyleProject@67d77351[libra-indigo]" Id=182 Group=main WAITING on hudson.slaves.WorkspaceList@38b1a665
	at java.lang.Object.wait(Native Method)
	-  waiting on hudson.slaves.WorkspaceList@38b1a665
	at java.lang.Object.wait(Object.java:485)
	at hudson.slaves.WorkspaceList.acquire(WorkspaceList.java:173)
	at hudson.model.AbstractProject.poll(AbstractProject.java:1230)
	at hudson.triggers.SCMTrigger$Runner.runPolling(SCMTrigger.java:425)
	at hudson.triggers.SCMTrigger$Runner.run(SCMTrigger.java:454)
	at hudson.util.SequentialExecutionQueue$QueueEntry.run(SequentialExecutionQueue.java:118)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:619)

	Number of locked synchronizers = 1
	- java.util.concurrent.locks.ReentrantLock$NonfairSync@175c7b86
Comment 5 Denis Roy CLA 2011-07-25 16:19:22 EDT
Actually, just about every running job has the same thread:

"Executor #0 for hudson-slave1 : executing cbi-papyrus-0.7-nightly #1623" Id=55 Group=main WAITING on hudson.remoting.UserRequest@689b9911
	at java.lang.Object.wait(Native Method)
	-  waiting on hudson.remoting.UserRequest@689b9911
	at java.lang.Object.wait(Object.java:485)
	at hudson.remoting.Request$1.get(Request.java:206)
	at hudson.remoting.Request$1.get(Request.java:172)
	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:55)
	at hudson.Proc$RemoteProc.join(Proc.java:347)
	at hudson.Launcher$ProcStarter.join(Launcher.java:280)
	at hudson.tasks.Ant.perform(Ant.java:216)
	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
	at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:603)
	at hudson.model.Build$RunnerImpl.build(Build.java:172)
	at hudson.model.Build$RunnerImpl.doRun(Build.java:137)
	at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:417)
	at hudson.model.Run.run(Run.java:1337)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
	at hudson.model.ResourceController.execute(ResourceController.java:88)
	at hudson.model.Executor.run(Executor.java:140)
Comment 6 Denis Roy CLA 2011-07-25 16:21:54 EDT
On slave1 I see this:

java -Dbuckminster.output.root=/opt/users/hudsonbuild/workspace/buckminster-mdt-ocl-core-3.2-master/buckminster.output -Dbuckminster.temp.root=/opt/users/hudsonbuild/workspace/buckminster-mdt-ocl-core-3.2-master/buckminster.temp -Dhttp.proxyHost=proxy.eclipse.org -Dhttp.proxyPort=9898 -Dhttps.proxyHost=proxy.eclipse.org -Dhttps.proxyPort=9898 -Dhttp.nonProxyHosts="*.eclipse.org" -Dhttps.nonProxyHosts="*.eclipse.org" -Dcheckout.location=/opt/users/hudsonbuild/workspace/buckminster-mdt-ocl-core-3.2-master -Dreference.repository=https://hudson.eclipse.org/hudson/job/buckminster-mdt-ocl-core-3.2-master/lastSuccessfulBuild/artifact/MDT-OCL.p2.repository/ -XX:PermSize=64M -XX:MaxPermSize=128M -jar /shared/common/buckminster-3.7/plugins/org.eclipse.equinox.launcher_1.2.0.v20110502.jar -application org.eclipse.buckminster.cmdline.headless -data /opt/users/hudsonbuild/workspace/buckminster-mdt-ocl-core-3.2-master/buildroot/buckminster.workspace --loglevel info -S /opt/users/hudsonbuild/workspace/buckminster-mdt-ocl-core-3.2-master/org.eclipse.ocl.git/releng/org.eclipse.ocl.releng.buckminster/scripts/ocl-buckminster.script: stdout copier

"java -Dbuckminster.output.root=/opt/users/hudsonbuild/workspace/buckminster-mdt-ocl-core-3.2-master/buckminster.output -Dbuckminster.temp.root=/opt/users/hudsonbuild/workspace/buckminster-mdt-ocl-core-3.2-master/buckminster.temp -Dhttp.proxyHost=proxy.eclipse.org -Dhttp.proxyPort=9898 -Dhttps.proxyHost=proxy.eclipse.org -Dhttps.proxyPort=9898 -Dhttp.nonProxyHosts="*.eclipse.org" -Dhttps.nonProxyHosts="*.eclipse.org" -Dcheckout.location=/opt/users/hudsonbuild/workspace/buckminster-mdt-ocl-core-3.2-master -Dreference.repository=https://hudson.eclipse.org/hudson/job/buckminster-mdt-ocl-core-3.2-master/lastSuccessfulBuild/artifact/MDT-OCL.p2.repository/ -XX:PermSize=64M -XX:MaxPermSize=128M -jar /shared/common/buckminster-3.7/plugins/org.eclipse.equinox.launcher_1.2.0.v20110502.jar -application org.eclipse.buckminster.cmdline.headless -data /opt/users/hudsonbuild/workspace/buckminster-mdt-ocl-core-3.2-master/buildroot/buckminster.workspace --loglevel info -S /opt/users/hudsonbuild/workspace/buckminster-mdt-ocl-core-3.2-master/org.eclipse.ocl.git/releng/org.eclipse.ocl.releng.buckminster/scripts/ocl-buckminster.script: stdout copier" Id=530663 Group=main RUNNABLE (in native)
	at java.io.FileInputStream.readBytes(Native Method)
	at java.io.FileInputStream.read(FileInputStream.java:199)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
	-  locked java.io.BufferedInputStream@7a7c82c7
	at java.io.FilterInputStream.read(FilterInputStream.java:90)
	at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)
Comment 7 Kaloyan Raev CLA 2011-07-26 06:28:37 EDT
Thanks, Denis. However, the thread dump of the job process does not show anything really useful. The tests are executed in a separate java process and that thread dump of that process is the interesting one. 

I managed to run the tests with the -Xdebug option and connect remotely from my IDE. I see that there is a thread hanging on the XML Validator that is blocked on reading an XML Schema from the Internet. Perhaps, this is not possible anymore, after some restrictions for accessing Internet were recently applied to the Hudson server. 

I will try to work this around in the test.
Comment 8 Kaloyan Raev CLA 2011-07-26 06:30:13 EDT
Created attachment 200346 [details]
hudson thread dump

Thread-8 is the problematic one hanging on socket read.
Comment 9 Kaloyan Raev CLA 2011-07-26 07:16:51 EDT
I was able to resolve the problem in the tests. 

However, the remote debug session interrupted unexpectedly and it looks like there are still some file locks in the workspace. This prevents successful execution of the libra-indigo job on 'master'. I also cannot clean the workspace on 'master' from the web front-end - it says that some ..../.nfsxxxxxx file cannot be deleted. The job works successfully on hudson-slave1 now. 

Could you clean the workspace of the libra-indigo job on 'master'?
Comment 10 Denis Roy CLA 2011-07-26 08:27:37 EDT
> I see that there is a thread hanging on the XML Validator that is blocked
> on reading an XML Schema from the Internet. Perhaps, this is not possible
> anymore, after some restrictions for accessing Internet were recently applied
> to the Hudson server. 

We haven't applied any further restrictions on Internet access to the Hudson servers, so I'd like to investigate this.  What URL are you trying to access?

I've wiped the job's workspace.
Comment 11 Kaloyan Raev CLA 2011-07-26 08:35:15 EDT
I am not sure about the exact URL, but the following one is a good guess:

http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd

In the thread dump I recognize the XML validator that is triggered by the project builder. The validator needs to read the corresponding XML schema in order to validate the XML file. In WTP we have the so called XML catalog that contains the popular XML schema locally and avoid the validator accessing them from Internet. I resolved the issue by adding the XML catalog plugins as dependency to the test.
Comment 12 Denis Roy CLA 2011-07-26 08:46:03 EDT
If your tests are run from a separate Java process, you "may" need to redefine the proxy options.  We've defined them in the Shell and in Hudson, but we've seen instances where this is not enough.

We've set JAVA_ARGS and JVM-OPTS to:
-Dhttp.proxyHost=proxy.eclipse.org -Dhttp.proxyPort=9898 -Dhttps.proxyHost=proxy.eclipse.org -Dhttps.proxyPort=9898 -Dhttp.nonProxyHosts="*.eclipse.org|172.30.206.*" -Dhttps.nonProxyHosts="*.eclipse.org" -Dftp.proxyHost=proxy.eclipse.org -Dftp.proxyPort=9898 -Dftp.nonProxyHosts="*.eclipse.org"
Comment 13 Kaloyan Raev CLA 2011-07-26 09:02:19 EDT
Do you have the same proxy settings in the Maven's settings.xml?

Having them there should make all processes executed by Maven to have these proxy configuration.
Comment 14 Denis Roy CLA 2011-07-26 09:09:21 EDT
Yes, this is our maven settings.xml file:

<settings>
  <proxies>
    <proxy>
      <id>httpproxy</id>
      <active>true</active>
      <protocol>http</protocol>
      <host>proxy.eclipse.org</host>
      <port>9898</port>
      <nonProxyHosts>*.eclipse.org|172.30.206.*</nonProxyHosts>
    </proxy>
    <proxy>
      <id>httpsproxy</id>
      <active>true</active>
      <protocol>https</protocol>
      <host>proxy.eclipse.org</host>
      <port>9898</port>
      <nonProxyHosts>*.eclipse.org|172.30.206.*</nonProxyHosts>
    </proxy>
  </proxies>
</settings>
Comment 15 Kaloyan Raev CLA 2011-07-26 09:22:45 EDT
OK, great. 
Next time if I have the same problem and I cannot resolve it in the tests, I will try adding these proxy settings explicitly to the process. 

I am closing this bug. 

Thanks for the help.