Community
Participate
Working Groups
Build Identifier: 8.0.1.v20110908 I'm using jetty-servlet to run JUnit tests within a project. I started noticing that sometimes the tests would hang indefinitely while waiting for a response from the Jetty server - but not within any of the code under test. In fact, the issue is reproducible without any custom server code - but in just repeatedly requesting a non-existent URL on the server, where Jetty returns a HTTP 404. I originally noticed this under 7.4.5.v20110725, but am still noticing identical symptoms under 8.0.1.v20110908. A few notes on reproducibility: 1) This happens when using the SelectChannelConnector, but cannot be be reproduced using the BlockingChannelConnector. 2) I have reproduced this without using any custom servlet (receiving a HTTP 404 "Not Found" response), as well as using a custom servlet that returns a HTTP 403 "Forbidden" response. I have not been able to reproduce this when receiving a HTTP 200 "OK" response. 3) Calling HttpURLConnection.disconnect() on previous requests greatly helps in reproducing the issue. I have reproduced without this, but only very rarely. There is some type of deadlock happening here - but it typically takes a few hundred iterations to reproduce. The details of my VM are: java version "1.6.0_27" Java(TM) SE Runtime Environment (build 1.6.0_27-b07) Java HotSpot(TM) 64-Bit Server VM (build 20.2-b06, mixed mode) I am running Windows 7 SP1 x64 on a dual-core CPU. The closest related bug report I could find is https://bugs.eclipse.org/bugs/show_bug.cgi?id=355048, but it does not appear to be related. In particular, there is no excessive CPU observed when Jetty hangs. Instead, the client requests are simply never fulfilled, and all threads are left in a RUNNABLE, WAITING, or TIMED_WAITING state. (See the attached test.log for a complete thread dump.) Reproducible: Sometimes Steps to Reproduce: Run the attached test project.
Created attachment 203116 [details] Log file from test case, including 2 complete stack traces / thread dumps. Thread dumps start at lines # 43,837 and #54,785.
Created attachment 203117 [details] Test project demonstrating issue. This is the source bundle for a test project to help demonstrate and reproduce this issue. It is Maven enabled - and can either be run as "Maven test" - or the included "JettyHangTest" can be run directly with JUnit. This includes 5 JUnit tests: testBlockingChannelConnector403Disconnect, testSelectChannelConnnector200Disconnect, testSelectChannelConnnector403NoDisconnect, testSelectChannelConnector403Disconnect, and testSelectChannelConnector404Disconnect. The first 2 never fail, and the last 2 fail quite consistently. I've only seen testSelectChannelConnnector403NoDisconnect fail very rarely - but it does happen. Please note that for each test run, all logging output is captured to target/JUnit/test.log. (A sample containing failures of the last 2 tests is also attached separately.) If you are not able to reproduce, please set JettyHangTest.ITERATIONS to 10,000 or higher, then retry. (This will cause the tests to take significantly longer to complete - including the successful ones.) On a failure, a thread dump (stack traces) will automatically be captured and generated to the log, and then the test will fail, with any remaining tests continuing to execute. If desired, JettyHangTest.FAIL_ON_TIMEOUT can be set to false, which will call Thread.wait() on the test thread - allowing for easier use of debuggers or profilers, etc.
No luck reproducing so far... even with iterations set to 100000 (which exhausts FD's for disconnect tests). Still trying.
Still no joy reproducing on my linux box. Inspecting the thread dump does not reveal anything of great concern. All jetty threads are idle except the selector0 , which is runnable, but in a strange discardUrgentData method. Googling this, I found http://forum.subsonic.org/forum/viewtopic.php?t=7493%29 so other software is having similar issues... somebody has commented there that they fixed this by upgrading JVM to 1.6.0_27, but you are on that already. Other googling found similar issues reported, some with jetty, but other apps as well. It would be useful if you could capture a TCP/IP dump of the traffic during a failed test. I'd like to know if the problem is caused by jetty not sending a response or the client not reading it.
Joakim, I cc'd you on this - can you try to reproduce this on your windows env?
Greg - thank you for looking into this! I had meant to run this on Linux as well before submitting the report, but I guess I missed that. I'd be happy to help capture a TCP/IP dump. Do you have any suggestions for capturing this, e.g. something within Jetty or otherwise within the Java layer? Or is Wireshark the best bet? I will get this back to you, but it may be a few weeks. I'm really busy right now, but should have some free weeks coming up soon...
I get the build failure as well on jetty-websocket under windows. $ mvn --version Apache Maven 3.0.2 (r1056850; 2011-01-08 17:58:10-0700) Java version: 1.6.0_18, vendor: Sun Microsystems Inc. Java home: C:\Program Files (x86)\Java\jdk1.6.0_18\jre Default locale: en_US, platform encoding: Cp1252 OS name: "windows 7", version: "6.1", arch: "x86", family: "windows" Running org.eclipse.jetty.websocket.WebSocketClientTest 2011-09-20 08:39:03.343:WARN:oeji.nio: java.lang.ClassCastException: org.eclipse.jetty.websocket.WebSocketClient$WebSocketFuture cannot be cast to org.eclipse.jetty.io.nio.SelectChannelEndPoint at org.eclipse.jetty.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:599) at org.eclipse.jetty.io.nio.SelectorManager$1.run(SelectorManager.java:282) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:598) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:533) at java.lang.Thread.run(Thread.java:619)
Upgrading the JDK on Windows 7 resulted in the ClassCastException going away. joakim@hyperion /c/code/org.eclipse.jetty.project/jetty-websocket $ mvn --version Apache Maven 3.0.2 (r1056850; 2011-01-08 17:58:10-0700) Java version: 1.6.0_27, vendor: Sun Microsystems Inc. Java home: C:\Program Files\Java\jdk1.6.0_27\jre Default locale: en_US, platform encoding: Cp1252 OS name: "windows 7", version: "6.1", arch: "amd64", family: "windows" But there's still a timeout on WebsocketClientTest $ cat target/surefire-reports/org.eclipse.jetty.websocket.WebSocketClientTest.txt ------------------------------------------------------------------------------- Test set: org.eclipse.jetty.websocket.WebSocketClientTest ------------------------------------------------------------------------------- Tests run: 12, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 14.902 sec <<< FAILURE! testAsyncConnectionRefused(org.eclipse.jetty.websocket.WebSocketClientTest) Time elapsed: 1.085 sec <<< ERROR! java.util.concurrent.TimeoutException at org.eclipse.jetty.websocket.WebSocketClient$WebSocketFuture.get(WebSocketClient.java:557) at org.eclipse.jetty.websocket.WebSocketClient$WebSocketFuture.get(WebSocketClient.java:345) at org.eclipse.jetty.websocket.WebSocketClientTest.testAsyncConnectionRefused(WebSocketClientTest.java:160) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:53) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:119) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:101) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.maven.surefire.booter.ProviderFactory$ClassLoaderProxy.invoke(ProviderFactory.java:103) at $Proxy0.invoke(Unknown Source) at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:150) at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcess(SurefireStarter.java:91) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:69)
Bug #355048 seems related to this one.
I ran into what seems to be this issue today. I can reproduce this issue on 1.6u29 but not on 1.6u21. I'm using Win7 32bit. I have used the test case previously attached to this bug to verify.
Created attachment 205663 [details] tcpdump where server gets unresponsive An ip dump showing server (192.168.161.49) and clients (192.168.161.3 and 58) in a REST conversation where the server gets unresponsive.
I'm having no luck whatsoever reproducing this on linux with jetty-8.0.1, jetty-8.0.4, jetty-8.0.5-SNAPSHOT using jdk 1.6_26. Doing runs of 10,000 repetitions. For the people experiencing this on a windows platform, have you ensured that you have tweaked the tcp stack on the windows box for high load? Here's the jetty page, which refers to linux, but will give you some idea of what to look for for windows: http://wiki.eclipse.org/Jetty/Howto/High_Load Jan
Created attachment 208891 [details] Thread-dump of hang triggered by one request at a time Here is my thread-dump, reproduced using a single connection (so the Windows TCP limit cannot be at fault). Please verify whether this thread-dump corresponds to the same bug, otherwise I will file a separate bug report. Here is what I see: 1. I launch the server and initiate client connections against the server. 2. Only one client connection is allowed at a time (I make a request, get a response and based on that response I might send a subsequent request). 3. Shortly thereafter (say 500ms) the server stops responding. It looks like it finishes responding to one request and hangs right before accepting the next request. 4. When it is "hung" I can connect using Putty or Firefox. The server will accept the socket connection and allow me to send whatever I want but it will never respond with an error message, respond to the request or close the connection. In my case this problem occurs extremely rarely so I cannot provide you with reproducible steps but hopefully this will give you extra clues on top of the testcases posted above. I am using version 8.0.4.v20111024.
Greg, Can you please take a look at this ASAP? We're about to ship a product using an embedded Jetty instance and this hang is a show-stopper. It occurs frequently enough to be a problem and is seemingly random so we can't provide customers with a workaround. I need to locate a version of Jetty that does not have this problem (older or newer) in order to ship the product.
Gill, have you tried 8.1.0.RC3? Looking at thread dump now. PS. urgency in the open source project is not determined by product shipping dates. Such commercial concerns are best communicated with commercial support contracts :)
Guys, I've looked at the thread dumps provided and cannot see any deadlocks. We've had 3 engineers try to repro and no luck. So I'm closing this issue for now because we cannot reproduce and there has been no evidence given of a real deadlock. Deadlocks are easy to spot in a stack trace, so if anybody has a deadlock it should be simple to capture the details and raise another bugzilla with that thread dump. I don't doubt that the multiple commentors on this issue do have real problems, but there is nothing yet to indicate that they are jetty issues, and could just as well be garbage collection or just a slow application. As it is, I think we've had multiple issues piled onto this bugzilla, which is just adding a layer of confusion. If you have a deadlock - capture a thread dump and it will be clearly visible. Open a new issue with that thread dump. If you have have a slow server, then please profile it and check your garbage collection. If you suspect it is jetty rather than your application, please open a new issue with your evidence and thinking to that effect. If Jetty becomes unresponsive, then a thread dump is useful, but it is also good to probe the server with some other requests - is it accepting new connections? Can you get a 400 or 404 response on a new connection for a nonsense request? Can you get an image request services (or other static content)? Is it just your application requests that are slow? What is the CPU level doing? etc. Gather that evidence and open a new issue with it.
Greg, I believe we're all talking about the same issue. The title of the bug report is misleading because we're not necessarily talking about a *thread* hang here, rather about a *request* hang (though obviously one could be caused by the other). The symptoms are: * The server accepts new socket connections and allows clients to send whatever they want, but never sends a response. You can verify this by opening a putty connection to the server and typing away. * Because no one seems to be parsing the request, it doesn't really matter what URL you request, you'll never get a response. * The server is not slow. CPU usage is nil. Remember, I've reproduced this issue using a single client using at most one connection at a time. My guess is that this problem isn't caused by a deadlock, per se, but rather by server variables ending up in a bad state. If you help us enable extra logs we should be able to help you pinpoint the location of the problem. Please reopen this issue.
(In reply to comment #16) > If you have a deadlock - capture a thread dump and it will be clearly visible. > Open a new issue with that thread dump. Greg, I think this argument makes it a bit too easy. We are not talking about a classic deadlock here. From my observations it's a race condition where a Jetty acceptor thread waits on something that is never ever triggered again. The server is then unresponsive with no CPU activity. Browser requests take a very long time to time out. If I killed the server all waiting browsers time out immediately.
+1 for comments #17 and #18. As of my initial report in early September, I was able to reproduce this quite regularly. I will attempt to do the same with 8.1.0.RC3 later today or tomorrow, then re-test with 8.0.1.v20110908 for comparison. If I'm no longer able to reproduce with this latest version compared to the reported version, then I agree that this bug should be closed as resolved, and any remaining issues should be tracked on new bugs.
I am still able to easily reproduce with 8.0.1.v20110908. This is now with JDK 1.6.0_30, and happens to be on a clean installation of Windows 7. The stack traces are not any different than attached previously. 8.1.0.RC3 is not yet available at http://download.eclipse.org/jetty/ , nor is it available in Maven. I repeated my test project used above (and also attached), only changing the jetty-servlet <version>8.0.1.v20110908</version> to 8.1.0.RC2. I was unable to complete the test because everything is unbelievably slow, taking multiple seconds to run each iteration vs. running multiple iterations within fractions of a second on all previously tested versions.
For my own part, I am running under Windows 7 32-bit, Java7 update 2. Greg and Jan, you've both been unable to reproduce the problem under Linux yet there are plenty of people here who reproduced it under Windows. Please try reproducing it under the same environment as us. The problem could be OS or JVM-specific.
8.1.0.RC3 never got published, sorry about that. I'll probably be running RC4 tomorrow which will go out, otherwise I would recommend you simply try building the jetty-8 branch and see if your issue still exists. thats about what I can add, we have run across odd issues in windows jvm's over the years, some are able to be worked around, others are not...but we can't fix what we can't reproduce I would recommend profiling under Yourkit or something and reproduce there, that ought to give you guys some more information about what might be going on here, might help us sort it out as well
Gill et al, thanks for some extra clarification about this issue. As I said, I don't doubt that you have an issue, but doubt that all comments here are about the same issue. Looking at your latest description, the information we need to debug this is unlikely to be in a thread dump. Instead there is output from a call to server.dump that will be very helpful. There are two ways you can generate this output. In jetty.xml there are some options: <Set name="dumpAfterStart">false</Set> <Set name="dumpBeforeStop">false</Set> if you set these to true, then you will get a full dump after starting and then again when ctrl-C is hit to stop the server. If you can reproduce the issue, then the output from the second dump would be very valuable to see. You can also get a dump at will by using the JMX interface and calling dump methods with jvisualvm or jconsole. To run with jmx just uncomment jetty-jmx.xml from start.ini and include -Dcom.sun.management.jmxremote on the command line. Also a point of clarification - when a connection becomes unresponsive, is it just that connection - or all connections made after that point?
I've just pushed a snapshot of 8.1.0 that will be forming the basis of the RC4 release. The snapshot can be downloaded here: https://oss.sonatype.org/content/groups/jetty/org/eclipse/jetty/jetty-distribution/8.1.0-SNAPSHOT/jetty-distribution-8.1.0-20120113.002101-7.tar.gz If you guys can try this one out, taking into account Greg's instructions for how to gather extra info, that would be helpful. thanks Jan
Created attachment 209422 [details] server dump after Mark's testcase detects a hang > Also a point of clarification - when a connection becomes unresponsive, is it just that connection - or all connections made after that point? All connections, which is why this issue is so serious. I couldn't get "dumpBeforeStop" working properly so I generated a server dump by invoking server.dump() after a TimeoutException occurred. Hopefully this contains the same information that "dumpBeforeStop" was meant to produce. Please take a look.
I forgot to mention, this dump was generated the 8.1.0 snapshot you posted above.
Bugzilla really needs an edit feature ;) The dump was generated *using* the 8.1.0 snapshot you posted above.
Gili, thanks for that - again the selector is in the discard urgent data method. Can you get a dump from a normal server and see if that method is being called?
Gili, Mark, the other thing that would be interesting to try, is if using JMX you are able to call stop/start on the connector, and if this fixes the issue. If it does, then we could potentially write a detector for this issue that at least resets the connector.
Created attachment 209426 [details] Build / Test results I copied the testcase over to https://github.com/joakime/jetty-test-hang and tested it on my Windows 7 machine. Version 6.1 (Build 7601: Service Pack 1) Apache Maven 3.0.2 (r1056850; 2011-01-08 17:58:10-0700) Java version: 1.6.0_27, vendor: Sun Microsystems Inc. Java home: C:\Program Files\Java\jdk1.6.0_27\jre Default locale: en_US, platform encoding: Cp1252 OS name: "windows 7", version: "6.1", arch: "amd64", family: "windows" java version "1.6.0_27" Java(TM) SE Runtime Environment (build 1.6.0_27-b07) Java HotSpot(TM) 64-Bit Server VM (build 20.2-b06, mixed mode) Result: no hang, no test failures.
Did you mean I should invoke dump() after a request returns *without* a timeout and compare it to dump() with a timeout? To answer your question: no, in the normal case (non-hang) the selector is in: sun.nio.ch.WindowsSelectorImpl$SubSelector.poll0(Native Method) as opposed to discardUrgentData(Native Method). It's also worth noting that in the normal case the Connector is missing the following two lines which are present in the hang case: | +- sun.nio.ch.WindowsSelectorImpl@62f842c9 keys=1 | +- SCEP@1799f67b{l(/127.0.0.1:49020)<->r(/127.0.0.1:49017),d=false,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=1r}-{AsyncHttpConnection@7d051157,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-14,l=0,c=-3},r=1554} iOps=1 rOps=1 I'm not familiar with using JMX so I'm not sure I can really help there. Besides, your proposal sounds like a workaround more than a fix. I'd rather kill this bug dead :) For what it's worth, here is the source-code for discardUrgentData() from OpenJDK: { char data[8]; jboolean discarded = JNI_FALSE; int n; do { n = recv(s, (char*)&data, sizeof(data), MSG_OOB); if (n > 0) { discarded = JNI_TRUE; } } while (n > 0); return discarded; } and it is invoked by the following Java code: // The descriptor may be in the exceptfds set because there is // OOB data queued to the socket. If there is OOB data then it // is discarded and the key is not added to the selected set. if (isExceptFds && (sk.channel() instanceof SocketChannelImpl) && discardUrgentData(desc)) { continue; } I assume we're blocked in recv(). The question is why. Here is the commit that added the discardUrgentData() function: http://hg.openjdk.java.net/jdk7/jdk7/jdk/rev/5c5fe62d990d The commit comment says "non-blocking sockets with TCP urgent disabled get still selected for read ops". If the socket is non-blocking, then why is recv() blocking? Are we misconfiguring the socket somehow? Joakim, I'm running under Windows 64-bit but using a 32-bit JDK. Can you please retry using this configuration? Also, please note I am using Java7 update 2 and Mark using Java6 update 30. Please try either of these versions.
According to http://msdn.microsoft.com/en-us/library/windows/desktop/ms740121%28v=vs.85%29.aspx "If no incoming data is available at the socket, the recv call blocks and waits for data to arrive according to the blocking rules defined for WSARecv with the MSG_PARTIAL flag not set unless the socket is nonblocking. In this case, a value of SOCKET_ERROR is returned with the error code set to WSAEWOULDBLOCK. The select, WSAAsyncSelect, or WSAEventSelect functions can be used to determine when more data arrives." Either the socket is blocking, or we're in some sort of loop that hits this function frequently. Seeing as the CPU usage is 0% I'm guessing the socket is blocking.
I recommend downloading the OpenJDK source code from http://download.java.net/openjdk/jdk7/ and taking a look at WindowsSelectorImpl.c and WindowsSelectorImpl.java. The only way we could end up in discardUrgentData() is if one of the sockets encountered an error. Unfortunately I don't see a way for us to find out what that error was. Another confusing thing is that the Java code seems to assume that discardUrgentData() will never block. So why does it ...?
Here are some tips for reproducing the problem quicker: 1. Increase ITERATIONS at the top of the file to 10,000 or even 100,000. 2. Comment out testBlockingChannelConnector403Disconnect, testSelectChannelConnnector200Disconnect, testSelectChannelConnnector403NoDisconnect since these rarely trigger the problem. 3. Surprisingly, a high logging level *increases* the probability of failure (so leave logback's logging level set to "ALL"). 4. I cannot reproduce the problem under a debugger (which prevents me from inspecting variables in WindowsSelectorImpl when a hang occurs). Very annoying! There is definitely a race condition at play... I'm running on a system with a quad-core i7 processor (hyperthreading disabled). Are your systems also quad core? PS: http://lucene.472066.n3.nabble.com/Solr-3-3-crashes-after-18-hours-tp3218496p3268232.html seems to indicate this could be triggered by the system running out of handles.
(In reply to comment #34) > There is definitely a race condition at play... I'm running on a system with a > quad-core i7 processor (hyperthreading disabled). Are your systems also quad > core? The system that I've been reproducing this on is (only) a dual core. (Just indicating that a quad core isn't necessary for this...)
Been testing using the code at https://github.com/joakime/jetty-test-hang under a variety of environment setups. All on Win7 (Quad Core Intel i7 / 12GB Ram / Laptop) Using the following JDKs * 6u27 64-bit * 6u30 32-bit * 6u30 64-bit * 7u2 32-bit * 7u2 64-bit Tested the following jetty releases * 8.0.1.v20110908 (old nio) * 8.1.0.RC2 (new nio) * 7.6.0.RC3 (new nio, stable) Results: * 8.0.1.v20110908 PASS on all JDKs * 8.1.0.RC2 - 6u27 64-bit : HANG (in BlockingChannelConnector / 403 tests) - 6u30 32-bit : HANG (in BlockingChannelConnector / 403 tests) - 6u30 64-bit : HANG (in BlockingChannelConnector / 403 tests) - 7u2 32-bit : HANG (in BlockingChannelConnector / 403 tests) - 7u2 64-bit : Assertion Failure (in BlockingChannelConnector / 403 tests) * 7.6.0.RC3 PASS on all JDKs
Just tested the entire JDK set on Win7 again, but against the Jetty 8.1.0.RC4 release (being staged/pushed to central today) Results: * 8.1.0.RC4 PASS on all JDKs
Gili, good investigation thanks. However it is increasingly making me think this is a JVM+OS bug. This is happening in our call to select, whose contract is such that it will poll a set of connectors and block for a specific time. There is nothing in that contract that says it will block on a specific socket and read data from it (urgent or otherwise). I don't think it is a race condition per se... because in your setup, there is only a single selector thread (and even if there are more, they work on different sets of sockets). It could be something about a socket changing state while selecting.... but this is the asynchonous API and it should be protected against that!!! Also we do queue up all our selector operations and perform them in our selector thread - so it is not jetty messing with it. The two lines that you see in the dump: sun.nio.ch.WindowsSelectorImpl@62f842c9 keys=1 | +- SCEP@1799f67b{l(/127.0.0.1:49020)<->r(/127.0.0.1:49017),d=false,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=1r}-{AsyncHttpConnection@7d051157,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-14,l=0,c=-3},r=1554} iOps=1 rOps=1 The first is the selector, saying it has 1 key. The other is the select channel end point, saying it is not dispatched, it is open and not half closed, it is not read or write blocked and is writable and is interested in read operations. It has a generator and parser in idle state. So it is all normal. I'm thinking it would be good if you could open a bug on openJDK and see what they have to say about this. I used to have some contacts with the sun NIO developers, but it has all changed now... so open JDK is probably the best way.
Greg, I tend to agree. I need your help in filing a bug report. I see we've got a testcase which leads to a hang but it's not clear what code Jetty is invoking under the hood. Any chance you or Joakim could replace calls to Jetty with calls to the core Java API in Mark's testcase? We need the testcase as stripped-down as possible.
I received a private email pointing out: https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2011-0872 which shows when/why the "fix" went in. Maybe start with putting a comment on there. Also, we can't make a non-jetty test case - because we can't reproduce at all.... although I'm starting to think we need to look at the traffic being offered to the server to reproduce. It may need some strange data to trigger. It would be interesting to capture the TCP data that triggers and then try to rerun to see if it triggers it again.
Greg, Didn't Dan already provide a TCP/IP dump?
The TCP dump we have is in text, so it is hard to analyse. I will have another closer look. Ideally we need a full binary tcp dump.
There are not packets in that text tcpdump with the URG flag set (although it is not clear if that is the trigger, but a good place to start - it maybe that the "fix" to the JRE is falsely triggering). Also it is not clear if this dump was from when the problem occurred or from after the problem. I'm guessing the later, as there are no responses at all. I'd like to see a tcpdump of a normally working server that suddenly becomes hung. Note that if somebody does capture that dump - please do not post it here, as it may be the basis of a DoS attack on all async java servers on that JRE! Instead email it privately or post it to a new issue that is committer only.
Hi The dump is supposed to be from when the problem occured. Isn't frame 47444 a response to frame 47431? If I'm not missinterpreting the tcpdump the last response from the server is frame 47450. The first unresponded request(no http ok received for this) is frame 47451.
Just another data point: Still experience the hangs with Jetty 8.0.1.v20110908 under the Sun JDK 1.7.0_02 under Windows. Will see what I can do to get some TCP dumps over the next few days...
Dan, you are right. I misread the tcpdump. I will analyse some more.
I filed a bug report and alerted the kind folks at Oracle: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7130796
(In reply to comment #47) > I filed a bug report and alerted the kind folks at Oracle: > http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7130796 Wow - how did you get an actual bug # so fast (vs. an internal review ID)? Regardless, all I can see is "This bug is not available." Was it considered security sensitive? If not, could someone do me a favor and update this bug when the Oracle bug becomes publicly available so that I can subscribe to it there as well? Thanks!
> Wow - how did you get an actual bug # so fast (vs. an internal review ID)? That process went away for me around the time that Oracle bought Sun. I was under the impression that it went away for everyone... Anyway, I don't believe the bug has been flagged as security sensitive (yet?). It should be visible within 2-3 days.
(In reply to comment #49) > > Wow - how did you get an actual bug # so fast (vs. an internal review ID)? > > That process went away for me around the time that Oracle bought Sun. I was > under the impression that it went away for everyone... Hmmm - that might explain why another bug I filed on 2010-01-30 still hasn't been converted to a real bug ticket yet. :-) (In reply to comment #29) > the other thing that would be interesting to try, is if using JMX you are able > to call stop/start on the connector, and if this fixes the issue. > > If it does, then we could potentially write a detector for this issue that at > least resets the connector. Greg - I'd like to help out with this, but I need some guidance as to how to expose this functionality through JMX. I'm running this embedded, so I don't have any XML files to configure. (Is there just a one-line "register" call or such that I can include in my Java test code?) I have no problems accessing the running JVM through JMX, but I simply don't see any Jetty MBeans to access.
Mark, here is the magic source you need to activate dumps and JMX from embedded jetty. Server server = new Server(); server.setDumpAfterStart(true); server.setDumpBeforeStop(true); // Setup JMX MBeanContainer mbContainer=new MBeanContainer(ManagementFactory.getPlatformMBeanServer()); mbContainer.start(); server.getContainer().addEventListener(mbContainer); server.addBean(mbContainer,true);
(In reply to comment #29) > the other thing that would be interesting to try, is if using JMX you are able > to call stop/start on the connector, and if this fixes the issue. > > If it does, then we could potentially write a detector for this issue that at > least resets the connector. Done. Shown below are iterations #320 (completed successfully) and #321 (hung). You can also see during #321 where I used JMX to stop then start the connector, which had absolutely no effect - other than causing the additional log entries. I also tried a dump once I realized nothing else happened, expecting to see additional details logged - but nothing else ever became visible. 2012-01-22 16:10:02,772 [main] INFO com.ziesemer.test.jetty.JettyHangTest - === BEGIN testSelectChannelConnector403Disconnect: 320 === 2012-01-22 16:10:02,772 [qtp438957460-293 Selector0] DEBUG org.eclipse.jetty.io.nio - destroyEndPoint SCEP@746599731java.nio.channels.SocketChannel[closed][o=false d=true,io=1,w=true,rb=false,wb=false] 2012-01-22 16:10:02,772 [qtp438957460-293 Selector0] DEBUG org.eclipse.jetty.server.HttpConnection - closed org.eclipse.jetty.server.nio.SelectChannelConnector$SelectChannelHttpConnection@1524be18@127.0.0.1:51667<->127.0.0.1:51989 2012-01-22 16:10:02,772 [qtp438957460-294 - /test/403] DEBUG org.eclipse.jetty.server.Server - REQUEST /test/403 on org.eclipse.jetty.server.nio.SelectChannelConnector$SelectChannelHttpConnection@eff27ff@127.0.0.1:51667<->127.0.0.1:51990 2012-01-22 16:10:02,772 [qtp438957460-294 - /test/403] DEBUG org.eclipse.jetty.servlet.ServletHandler - servlet null|/test/403|null -> com.ziesemer.test.jetty.Test403Servlet-1225717157 2012-01-22 16:10:02,772 [qtp438957460-294 - /test/403] DEBUG org.eclipse.jetty.servlet.ServletHandler - chain=null 2012-01-22 16:10:02,773 [JettyHangTest] INFO com.ziesemer.test.jetty.JettyHangTest - responseCode: 403 2012-01-22 16:10:02,773 [qtp438957460-294 - /test/403] DEBUG org.eclipse.jetty.server.Server - RESPONSE /test/403 403 2012-01-22 16:10:02,773 [JettyHangTest] INFO com.ziesemer.test.jetty.JettyHangTest - Disconnecting sun.net.www.protocol.http.HttpURLConnection:http://localhost:51667/test/403 ... 2012-01-22 16:10:02,773 [JettyHangTest] INFO com.ziesemer.test.jetty.JettyHangTest - Disconnected. 2012-01-22 16:10:02,773 [qtp438957460-294] DEBUG org.eclipse.jetty.server.AsyncHttpConnection - complete half close org.eclipse.jetty.server.nio.SelectChannelConnector$SelectChannelHttpConnection@eff27ff@127.0.0.1:51667<->127.0.0.1:51990 2012-01-22 16:10:02,773 [main] INFO com.ziesemer.test.jetty.JettyHangTest - === END testSelectChannelConnector403Disconnect: 320 === 2012-01-22 16:10:02,773 [main] INFO com.ziesemer.test.jetty.JettyHangTest - === BEGIN testSelectChannelConnector403Disconnect: 321 === 2012-01-22 16:12:32,826 [RMI TCP Connection(4)-192.168.82.101] DEBUG org.eclipse.jetty.jmx.ObjectMBean - invoke stop 2012-01-22 16:12:32,826 [RMI TCP Connection(4)-192.168.82.101] DEBUG org.eclipse.jetty.util.component.AbstractLifeCycle - stopping SelectChannelConnector@127.0.0.1:51667 STARTED 2012-01-22 16:12:32,826 [RMI TCP Connection(4)-192.168.82.101] DEBUG org.eclipse.jetty.util.component.AbstractLifeCycle - stopping org.eclipse.jetty.server.nio.SelectChannelConnector$ConnectorSelectorManager@36c5d117#STARTED 2012-01-22 16:12:35,428 [RMI TCP Connection(3)-192.168.82.101] DEBUG org.eclipse.jetty.jmx.ObjectMBean - invoke start 2012-01-22 16:12:42,541 [RMI TCP Connection(2)-192.168.82.101] DEBUG org.eclipse.jetty.jmx.ObjectMBean - invoke dump
Created attachment 209888 [details] Complete log file from test case, including JMX restart attempt on the connector.
Created attachment 209889 [details] Thread dump for the same JMX restart attempt as the above 2 comments.
Created attachment 209890 [details] Updated primary test class form test project. Attached is an updated JettyHangTest.java (from the original test project) that I used for the above 3 comments. (This may be useful for correlating the source line #'s from the thread dump, etc.) Per comment #51, it adds the server dumps and the Jetty JMX MBeans. It is also reconfigured to not timeout and fail the test, but to instead hang indefinitely - allowing for time to connect by JMX, etc.
(In reply to comment #43) > I'd like to see a tcpdump of a normally working server that suddenly becomes > hung. Note that if somebody does capture that dump - please do not post it > here, as it may be the basis of a DoS attack on all async java servers on that > JRE! Instead email it privately or post it to a new issue that is committer > only. Greg - is this still something that would be helpful if I could provide? If so, I'll need some additional tips for how to set this up. I ran Wireshark as I repeated my above tests, but it's not seeing any of this traffic - possibly because the traffic isn't crossing interfaces? (I'm sure I could rig something up to have multiple interfaces available, and it would be trivial to get Jetty to bind to one - but I'm not immediately sure how I would get the HttpURLConnection for the other side of the test to bind to the other interface...)
Sorry - hopefully my last update for today: All of today's previous tests were run with 8.0.4.v20111024. I just now tried 8.1.0.RC4. I seem to be able to confirm Joakim's findings in comment #37 - all my tests now pass - even with 10,000 iterations. So can this be considered resolved? If so, does this mean that it was a Jetty bug, and not a JVM bug?
Mark, I am skeptical. The stack-trace I saw looked very much like a JDK bug. My guess is that code changes in 8.1.0.RC4 are masking the problem but it isn't actually fixed. I believe this bug is a race condition. When I slowed things down (using a debugger) it magically went away, but obviously the underlying problem remained.
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7130796 has been updated. They've managed to reproduce the problem and explain what's causing it.
Gill, Good work getting the bug registered and acknowledged by oracle! I'll vote for it and comment on it, as it is a rather serious remote DoS. We'll leave this issue open here until they have released a fix.
(In reply to comment #60) > I'll vote for it and comment on it, as it is a rather serious remote DoS. > We'll leave this issue open here until they have released a fix. Greg, if you can prove that this bug is exploitable then I'm fairly sure Oracle would jump on it. I can't think of a way a remote attacker could intentionally trigger this bug. If you do, please email me privately and I'll forward the information to my contact at Oracle.
Just to let you guys know, we have the same issue on Windows 7 and Windows Server 2008 R2 boxes with Jetty 7.5.4.v20111024 on Java 6 update 30 and Java 6 update 26. We can easily reproduce this issue in an Eclipse embedded Jetty (7.5.4.v20111024) setup when using the SelectChannelConnector and let ApacheBench do its work. When ApacheBench is running and SelectChannelConnector is used, the other client (lets say our own Firefox web browser) is not getting any response. When switching to the BlockingChannelConnector Firefox is getting responses, a bit slowed down but thats normal with ApacheBench running. The ApacheBench command used: ab -n 10000 -c 10 http://localhost:8080/app For now we are switching to BlockingChannelConnector.
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7130796 disappeared briefly (probably while they investigated whether this had DoS potential) and has now returned. According to the latest evaluation: "Note that this is not a remote DoS, but rather something that this server is running into because it closes the SocketChannels while they are in use in a Selector. Fixing it is straight-forward but it means bringing back 4960962. It's not possible to solve all issues without support from the operating system (Windows lacks it unfortunately)." There is also http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7132889 but by the looks of it the evaluator doesn't believe that it will solve the problem.
The JVM bug/issue filed at http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7130796 has been closed. "This is fixed by the changes via 6346658 for jdk8. Once the changes have baked in jdk8 for a while then we should considering back-porting to a jdk7 update. Posted Date : 2012-02-20 19:24:15.0" I'll need to check the release notes of 6u31 and 7u3 to see if has been backported...
Nope, this has not yet been backported to JDK 6 or JDK 7. JDK 6u31 notes: http://www.oracle.com/technetwork/java/javase/6u31-relnotes-1482342.html JDK 7u3 notes: http://www.oracle.com/technetwork/java/javase/7u3-relnotes-1481928.html
I added a speculative workaround to master (which will be in 7.6.2), that simply cancels the selector key before closing. This has been tested by 1 client who has reported initial success. So it would be worthwhile to get some wider testing... either of a recent snapshot or 7.6.2 that comes out in a few days
Unfortunately the work around appears to be causing other problems. We are getting selectors blocked by closing channels in a way that is identical to the problem reported here http://java.net/jira/browse/GRIZZLY-547 Their fix was to stop doing a cancel before a close. So that kind of snookers us... if we cancel before a close, we block the selector on unix. If we just close without a cancel, we block the selector on windows! I will make the cancel conditional on operating system - YUCK!!!!
What is the status of this bug? We are experiencing severe problems for months with our embedded jetty server which we never had before. We think these problems might be caused by this bug and are looking forward for a fix...
(In reply to comment #68) > What is the status of this bug? > We are experiencing severe problems for months with our embedded jetty server > which we never had before. > We think these problems might be caused by this bug and are looking forward for > a fix... Henzler, This bug is supposed to be fixed in jdk8-b29 (it doesn't look like Oracle plans on back-porting it). Can you try the latest jdk8 build and see if it fixes your problem? http://jdk8.java.net/download.html
We are currently using jre 1.6.0_30_b-12 We went back to JRE 1.6 because we thought the problems came from Java 7. Do you know if the problem did already exist in JRE 1.6?
This issue was introduced as of Java6 and prior to that there was another problem. See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4960962 for more details.
(In reply to comment #67) Greg, I'm interested in testing your workaround against 7.6.3 (on Linux). Do you have a patch?
Note that we've had a report that running with beta java 8 does fix this issue. The patch appears to have no affect.
Hi Greg. I am also interested in this patch. Is it available? If not can you point me to the change? (In reply to comment #67) > Unfortunately the work around appears to be causing other problems. We are > getting selectors blocked by closing channels in a way that is identical to the > problem reported here http://java.net/jira/browse/GRIZZLY-547 > > Their fix was to stop doing a cancel before a close. So that kind of snookers > us... if we cancel before a close, we block the selector on unix. If we just > close without a cancel, we block the selector on windows! > > I will make the cancel conditional on operating system - YUCK!!!!
Greg, seems to be right. We changed our application server to run on Java 8 Beta because of this issue. Until now it seems that it helped. Before using Java 8 Jetty could not run more than 3 days without hanging.
Jetty 7.6.4 has the patch included, where it makes cancelling the selector key before close conditional on operating systems. It is not a 100% solution on either platform - currently if you are suffering from this issue it is only java8 that works. Snoracle have closed the issue on java-7 so you can't comment and ask for a back port. I've submitted a new bug report asking for a backport, but not response yet.
Still no joy getting oracle to backport this to jdk7. You can see the fix applied to jdk8 here: http://hg.openjdk.java.net/jdk8/tl/jdk/rev/7326971f09af So perhaps openjdk 7 could be build with that patch applied. Does anybody have any experience with building openjdk from source for windows? Does it result in a usable/stable binary?
trying again to get a backport to 7 http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7187260
*** Bug 387446 has been marked as a duplicate of this bug. ***
All - (In reply to comment #78) > trying again to get a backport to 7 > > http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7187260 Per the linked report, this is apparently fixed in JDK 1.7.0_06 - which was released 2 days ago on 2012-08-14.
It would be great to get some feedback on this one to see if java7u6 fixes this or not?
(In reply to comment #81) > It would be great to get some feedback on this one to see if java7u6 fixes > this or not? In testing with my test case which is attached here, this issue does appear to be resolved under JDK 1.7.0_06 - even after running 10,000 iterations. Looking forward to results from anyone else that has also been experiencing this to confirm - otherwise I'll mark this as resolved after the weekend.
i know someone else that things this was resolved as well and is testing it as well looks promising
This is great news. We are planning to run a test over the weekend. Will post back results Monday.
We ran our test over this past weekend and it appears to have fixed our issue too. Thank you Greg for staying on this.
Resolving as per comments 82, 83, 84, and 85.
JDK 1.7.0_06 probably helped, but not completely resolved the issue for our environment. This problem still occur using: karaf@root> info Karaf Karaf version 2.3.0 Karaf home /home/rully/git/bippo-commerce5/karaf Karaf base /home/rully/git/bippo-commerce5/karaf OSGi Framework org.apache.felix.framework - 4.0.3 JVM Java Virtual Machine OpenJDK 64-Bit Server VM version 23.2-b09 Version 1.7.0_09 Vendor Oracle Corporation Uptime 45.943 seconds Total compile time 20.963 seconds Threads Live threads 137 Daemon threads 123 Peak 139 Total started 186 Memory Current heap size 172,238 kbytes Maximum heap size 466,048 kbytes Committed heap size 250,368 kbytes Pending objects 0 Garbage collector Name = 'PS Scavenge', Collections = 32, Time = 0.220 seconds Garbage collector Name = 'PS MarkSweep', Collections = 0, Time = 0.000 seconds Classes Current classes loaded 11,207 Total classes loaded 11,207 Total classes unloaded 0 Operating system Name Linux version 3.2.0-37-generic Architecture amd64 Processors 4 Versions: karaf@root> list -s -t 0 | grep '(pax.web|jetty)' [ 58] [Active ] [ ] [ ] [ 30] org.eclipse.jetty.util (7.6.7.v20120910) [ 59] [Active ] [ ] [ ] [ 30] org.eclipse.jetty.io (7.6.7.v20120910) [ 60] [Active ] [ ] [ ] [ 30] org.eclipse.jetty.http (7.6.7.v20120910) [ 61] [Active ] [ ] [ ] [ 30] org.eclipse.jetty.client (7.6.7.v20120910) [ 62] [Active ] [ ] [ ] [ 30] org.eclipse.jetty.continuation (7.6.7.v20120910) [ 63] [Active ] [ ] [ ] [ 30] org.eclipse.jetty.jmx (7.6.7.v20120910) [ 64] [Active ] [ ] [ ] [ 30] org.eclipse.jetty.server (7.6.7.v20120910) [ 65] [Active ] [ ] [ ] [ 30] org.eclipse.jetty.security (7.6.7.v20120910) [ 66] [Active ] [ ] [ ] [ 30] org.eclipse.jetty.servlet (7.6.7.v20120910) [ 67] [Active ] [ ] [ ] [ 30] org.eclipse.jetty.servlets (7.6.7.v20120910) [ 68] [Active ] [ ] [ ] [ 30] org.eclipse.jetty.xml (7.6.7.v20120910) [ 69] [Active ] [ ] [ ] [ 30] org.eclipse.jetty.webapp (7.6.7.v20120910) [ 70] [Active ] [ ] [ ] [ 30] org.eclipse.jetty.jndi (7.6.7.v20120910) [ 71] [Active ] [ ] [ ] [ 30] org.eclipse.jetty.plus (7.6.7.v20120910) [ 72] [Active ] [ ] [ ] [ 30] org.eclipse.jetty.websocket (7.6.7.v20120910) [ 73] [Active ] [ ] [ ] [ 30] org.ops4j.pax.web.pax-web-api (1.1.3) [ 74] [Active ] [ ] [ ] [ 30] org.ops4j.pax.web.pax-web-spi (1.1.3) [ 75] [Active ] [ ] [ ] [ 30] org.ops4j.pax.web.pax-web-runtime (1.1.3) [ 76] [Active ] [ ] [ ] [ 30] org.ops4j.pax.web.pax-web-jetty (1.1.3) [ 77] [Active ] [ ] [ ] [ 30] org.ops4j.pax.web.pax-web-extender-whiteboard (1.1.3) [ 80] [Active ] [ ] [ ] [ 30] org.ops4j.pax.web.pax-web-jsp (1.1.3) [ 81] [Active ] [ ] [ ] [ 30] org.ops4j.pax.web.pax-web-extender-war (1.1.3) [ 82] [Active ] [ ] [ ] [ 30] org.ops4j.pax.web.pax-web-deployer (1.1.3) Granted our environment is rather complex (Karaf 2.3.0 + Pax Web 1.1.3 which includes Jetty + Pax Wicket + our own bundles). Using BlockingChannelConnector (the default for Karaf 2.3.0) it's fine. Switching to SelectorChannelConnector the app won't even boot. Example non-working karaf etc/jetty.xml config: (changing Selector to Blocking makes it work again) <Call name="addConnector"> <Arg> <New class="org.eclipse.jetty.server.nio.SelectorChannelConnector"> <Set name="host"> <Property name="jetty.host" /> </Set> <Set name="port"> <Property name="jetty.port" default="8181"/> </Set> <Set name="maxIdleTime">10000</Set> <Set name="Acceptors">4</Set> <Set name="statsOn">false</Set> <Set name="confidentialPort">8443</Set> </New> </Arg> </Call> It hangs right after this point: 2013-02-14 10:52:25,943 | INFO | rint Extender: 1 | HttpServiceFactoryImpl | .internal.HttpServiceFactoryImpl 33 | 75 - org.ops4j.pax.web.pax-web-runtime - 1.1.3 | Binding bundle: [org.apache.karaf.deployer.features [35]] to http service This is probably not a Jetty bug, but I'll post it here anyway in case someone experiences similar problem. I'll report this to Pax Web project also.
Spoke to soon. False alarm. It was due to my typo: <New class="org.eclipse.jetty.server.nio.SelectorChannelConnector"> should have been: <New class="org.eclipse.jetty.server.nio.SelectChannelConnector"> and now it works :-) My bad.