| Summary: | Jetty doesn't close sockets that are in CLOSE_WAIT | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [RT] Jetty | Reporter: | Matt Painter <matthew.painter> | ||||
| Component: | server | Assignee: | Greg Wilkins <gregw> | ||||
| Status: | RESOLVED WONTFIX | QA Contact: | |||||
| Severity: | critical | ||||||
| Priority: | P3 | CC: | eclipse, gregw, janb, jetty-inbox | ||||
| Version: | 9.3.5 | ||||||
| Target Milestone: | 9.3.x | ||||||
| Hardware: | PC | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Attachments: |
|
||||||
Verified to happen on 9.3.5 also We have seen a lot of this in the logs:
2015-11-13 18:32:41.247 [Scheduler-81153283] DEBUG o.eclipse.jetty.io.AbstractEndPoint - onIdleExpired - || - Ignored idle endpoint SelectChannelEndPoint@6ba4dafd{/10.0.0.194:61581<->8888,Open,in,out,-,-,30001/30000,HttpConnection}{io=0/0,kio=0,kro=1}
2015-11-13 18:32:41.253 [Scheduler-81153283] DEBUG o.eclipse.jetty.io.AbstractEndPoint - onIdleExpired - || - Ignored idle endpoint SelectChannelEndPoint@24e35471{/10.0.0.194:62914<->8888,Open,in,out,-,-,30001/30000,HttpConnection}{io=0/0,kio=0,kro=1}
2015-11-13 18:32:41.262 [Scheduler-81153283] DEBUG o.eclipse.jetty.io.AbstractEndPoint - onIdleExpired - || - Ignored idle endpoint SelectChannelEndPoint@61cfce31{/10.0.0.194:62973<->8888,Open,in,out,-,-,30000/30000,HttpConnection}{io=0/0,kio=0,kro=1}
2015-11-13 18:32:41.263 [Scheduler-81153283] DEBUG o.eclipse.jetty.io.AbstractEndPoint - onIdleExpired - || - Ignored idle endpoint SelectChannelEndPoint@453da72{/10.0.0.194:62818<->8888,Open,in,out,-,-,30001/30000,HttpConnection}{io=0/0,kio=0,kro=1}
2015-11-13 18:32:41.263 [Scheduler-81153283] DEBUG o.eclipse.jetty.io.AbstractEndPoint - onIdleExpired - || - Ignored idle endpoint SelectChannelEndPoint@27a4425{/10.0.0.194:61466<->8888,Open,in,out,-,-,30000/30000,HttpConnection}{io=0/0,kio=0,kro=1}
2015-11-13 18:32:41.276 [Scheduler-81153283] DEBUG o.eclipse.jetty.io.AbstractEndPoint - onIdleExpired - || - Ignored idle endpoint SelectChannelEndPoint@79576c07{/10.0.0.194:62804<->8888,Open,in,out,-,-,30000/30000,HttpConnection}{io=0/0,kio=0,kro=1}
2015-11-13 18:32:41.287 [Scheduler-81153283] DEBUG o.eclipse.jetty.io.AbstractEndPoint - onIdleExpired - || - Ignored idle endpoint SelectChannelEndPoint@353db1e3{/10.0.0.194:62150<->8888,Open,in,out,-,-,30001/30000,HttpConnection}{io=0/0,kio=0,kro=1}
2015-11-13 18:32:41.304 [Scheduler-81153283] DEBUG o.eclipse.jetty.io.AbstractEndPoint - onIdleExpired - || - Ignored idle endpoint SelectChannelEndPoint@5b5922ab{/10.0.0.194:63083<->8888,Open,in,out,-,-,30001/30000,HttpConnection}{io=0/0,kio=0,kro=1}
2015-11-13 18:32:41.307 [Scheduler-81153283] DEBUG o.eclipse.jetty.io.AbstractEndPoint - onIdleExpired - || - Ignored idle endpoint SelectChannelEndPoint@49b03a47{/10.0.0.194:62268<->8888,Open,in,out,-,-,30000/30000,HttpConnection}{io=0/0,kio=0,kro=1}
2015-11-13 18:32:41.309 [Scheduler-81153283] DEBUG o.eclipse.jetty.io.AbstractEndPoint - onIdleExpired - || - Ignored idle endpoint SelectChannelEndPoint@306e4257{/10.0.0.194:62764<->8888,Open,in,out,-,-,30001/30000,HttpConnection}{io=0/0,kio=0,kro=1}
2015-11-13 18:32:41.326 [Scheduler-81153283] DEBUG o.eclipse.jetty.io.AbstractEndPoint - onIdleExpired - || - Ignored idle endpoint SelectChannelEndPoint@6a1b1403{/10.0.0.194:63093<->8888,Open,in,out,-,-,30001/30000,HttpConnection}{io=0/0,kio=0,kro=1}
2015-11-13 18:32:41.329 [Scheduler-81153283] DEBUG o.eclipse.jetty.io.AbstractEndPoint - onIdleExpired - || - Ignored idle endpoint SelectChannelEndPoint@d50b02e{/10.0.0.194:62543<->8888,Open,in,out,-,-,30000/30000,HttpConnection}{io=0/0,kio=0,kro=1}
2015-11-13 18:32:41.340 [Scheduler-81153283] DEBUG o.eclipse.jetty.io.AbstractEndPoint - onIdleExpired - || - Ignored idle endpoint SelectChannelEndPoint@34f16be{/10.0.0.194:62663<->8888,Open,in,out,-,-,30000/30000,HttpConnection}{io=0/0,kio=0,kro=1}
2015-11-13 18:32:41.358 [Scheduler-81153283] DEBUG o.eclipse.jetty.io.AbstractEndPoint - onIdleExpired - || - Ignored idle endpoint SelectChannelEndPoint@59d76062{/10.0.0.194:61479<->8888,Open,in,out,-,-,30001/30000,HttpConnection}{io=0/0,kio=0,kro=1}^C
These connections seem to all map to connections in CLOSE_WAIT.
This is printed from the code:
@Override
protected void onIdleExpired(TimeoutException timeout)
{
boolean output_shutdown=isOutputShutdown();
boolean input_shutdown=isInputShutdown();
boolean fillFailed = _fillInterest.onFail(timeout);
boolean writeFailed = _writeFlusher.onFail(timeout);
// If the endpoint is half closed and there was no fill/write handling, then close here.
// This handles the situation where the connection has completed its close handling
// and the endpoint is half closed, but the other party does not complete the close.
// This perhaps should not check for half closed, however the servlet spec case allows
// for a dispatched servlet or suspended request to extend beyond the connections idle
// time. So if this test would always close an idle endpoint that is not handled, then
// we would need a mode to ignore timeouts for some HTTP states
if (isOpen() && (output_shutdown || input_shutdown) && !(fillFailed || writeFailed))
close();
else
LOG.debug("Ignored idle endpoint {}",this);
}
Perhaps the logic here is incorrect? Or perhaps there's a bug? Trying to investigate further, would be grateful for any pointers or assistance :)
Created attachment 257934 [details]
Log showing different behaviour for re-used channel
Here we can see the different behaviour between two requests.
The second request reads 423 bytes, but because it never re-reads it in ChannelEndpoint.fill (returning -1), it never calls ChannelEndpoint.shutdownInput(). Therefore the input state is never transitioned to ISHUT which then seems to cause a knock-on effect, with the channel never able to be closed.
This may seemingly be due to the AWS ELB issuing a 504 gateway timeout, that is the number of CLOSE_WAITs seems to link directly with the number of 504s the load balancer issues. Matt, The thing of concern here is that the endpoints have io=0, so it is not interested in any operations - hence it is not seeing a close. This is typically the state when a request has been dispatch, but no io operation has been attempted. It should not be a state for an idle connection. The shutdownInput issue is unlikely to be the issue. With HTTP the server will never instigate an input side shutdown. It may shutdown output and wait to read a -1, or it may read a -1 and then do a close. But perhaps we have an error in that logic somewhere? It is also probably an issue that such a endpoint is surviving a timeout and not being closed. So we need to find a way to reproduce this issue. Can you capture a tcp/ip trace of the interaction from proxy to server when a 504 occurs. Matt, can you give us any more information on this one??? It would be good to see the full log, or at least everything done by thread qtp1181693689-125. Very strange to see the 9s between the last fill on the connection and the interest ops being set to 0. So we really need to full debug trace and a tcp trace as well. Hi Sorry for the delay, have been investigating and patching our production. I need to set up an aws environment in aws just to try to reproduce it by triggering an elb timeout. It only happens under load in production, so trying to get a TCP track for this wouldn't make sense due to excessive noise. I have worked around it for now by ensuring that the continuation is timed out and closed from our side before the elb times it out. It seems to me like jetty needs to periodically check if waiting continuations have been closed from the other side, even if it hasn't reached a time out yet. If I'm correct there's no way in Java to detect to aoxket close without trying to read or write though? Matt, ah that makes sense - you are using continuations and the requests are suspended (Note that you should switch to the AsyncContext API in the servlet spec as continuations are deprecated). So there is no read interest set, hence the close is not seen and the timeouts are ignored. This is an unsolvable problem with java as you can't register for just error/close events. If we read periodically, then we have a DOS problem as if content is sent, we have to put it somewhere. You can periodically send some noop content while waiting - which depends on your content-type. Some clients can accept 102-processing responses as interim responses. Or you can just use something like cometd and we'll deal with all of the details for you. I'm closing this for now, as I think this is just a known problem. We've been facing a very similar issue. For reasons too long to explain here we're still running 9.2.2 in our production environment. An Amazon Application Load Balancer (ALB), more specifically one of its target groups, is doing health checks on our Jetty instances. When under heavy load, we recently "lost" two instances to the CLOSE_WAIT issue, with ~30k sockets in that state, running into the limits for file descriptors, stalling the instance. Maybe worth noting: the lsof output lists those sockets as all IPv6: java 22664 xxxx 570u IPv6 57747443 0t0 TCP localhost:ddi-tcp-1->localhost:60836 (CLOSE_WAIT) When we put an Apache reverse proxy between the ALB and the instance and have the ALB run its health check through the reverse proxy then the problem doesn't seem to occur. The Apache server will use IPv4 to connect to the Jetty instance. It seems weird that Apache would be able to handle this properly while Jetty chokes an entire server based on this. I would really hope there are ways to get this fixes and would be happy if this bug could be re-opened. In our case it poses a severe risk for our production environment in case anyone tries to set up health checks straight to the Jetty port. Axel, We can't reopen this issue, as the jetty project moved to using github issues some time ago. The link is here: https://github.com/eclipse/jetty.project/issues. I've opened a fresh issue there, and added in your comment: https://github.com/eclipse/jetty.project/issues/2169 Great, thanks! Added another comment there. |
In production we are seeing tens of thousands of connections end up in CLOSE_WAIT, but Jetty thinks they are open. For example: $ netstat -a | grep 10.0.0.199:34177 tcp 1 0 10.0.0.117:8888 10.0.0.199:34177 CLOSE_WAIT However, jetty thinks it is open, and what's more the idle time appears to be resetting as per these sequential measurements: +- SelectChannelEndPoint@287eec1{/10.0.0.199:34177<->8888,Open,in,out,-,-,1041/30000,HttpConnection}{io=0,kio=0,kro=1} iOps=0 rOps=1 +- SelectChannelEndPoint@287eec1{/10.0.0.199:34177<->8888,Open,in,out,-,-,10263/30000,HttpConnection}{io=0,kio=0,kro=1} iOps=0 rOps=1 +- SelectChannelEndPoint@287eec1{/10.0.0.199:34177<->8888,Open,in,out,-,-,25662/30000,HttpConnection}{io=0,kio=0,kro=1} iOps=0 rOps=1 (resets the idle time) +- SelectChannelEndPoint@287eec1{/10.0.0.199:34177<->8888,Open,in,out,-,-,9167/30000,HttpConnection}{io=0,kio=0,kro=1} iOps=0 rOps=1 How can Jetty be resetting the idle time but not closing the connection? It's an ELB on AWS that's connecting in, we've escalated it with them but it looks like a Jetty issue. We've tried setting -Dorg.eclipse.jetty.io.SelectorManager.submitKeyUpdates=true to no avail :/ Any pointers? This is causing us real issues right now, and we're going to upgrade to 9.3 and see if it's fixed, but in the meantime we're just restarting the java process when it consumes >40k file handles, which isn't ideal. Sample dump: ServerConnector@63de4056{HTTP/1.1}{0.0.0.0:8888} - STARTED +~ com.importio.jetty.server.JettyServer@6adfd886 - STARTED +~ org.eclipse.jetty.util.thread.ExecutorThreadPool@69e9e025 - STARTED += org.eclipse.jetty.util.thread.ScheduledExecutorScheduler@5d201879 - STARTED | +- sun.misc.Unsafe.park(Native Method) | +- java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) | +- java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) | +- java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093) | +- java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) | +- java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067) | +- java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) | +- java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) | +- java.lang.Thread.run(Thread.java:745) +- org.eclipse.jetty.io.ArrayByteBufferPool@7cae4f33 += HttpConnectionFactory@3761ac99{HTTP/1.1} - STARTED | +- HttpConfiguration@34397fae{32768/8192,8192/8192,https://:0,[ForwardedRequestCustomizer@3ff251de]} | +~ org.eclipse.jetty.jmx.MBeanContainer@285b6758 += org.eclipse.jetty.server.ServerConnector$ServerConnectorManager@36774117 - STARTED | +- org.eclipse.jetty.io.SelectorManager$ManagedSelector@31cc3018 keys=37552 selected=0 id=0 | +- org.eclipse.jetty.io.SelectorManager$ManagedSelector.select(SelectorManager.java:596) | +- sun.nio.ch.EPollSelectorImpl@2b9900b1 keys=37552 | +- SelectChannelEndPoint@287eec1{/10.0.0.199:34177<->8888,Open,in,out,-,-,22453/30000,HttpConnection}{io=0,kio=0,kro=1} iOps=0 rOps=1 | +- SelectChannelEndPoint@4e4b73cb{/10.0.0.46:30936<->8888,Open,in,out,-,-,20393/30000,HttpConnection}{io=0,kio=0,kro=1} iOps=0 rOps=1 | +- SelectChannelEndPoint@23d06a17{/10.0.0.46:40572<->8888,Open,in,out,-,-,10927/30000,HttpConnection}{io=0,kio=0,kro=1} iOps=0 rOps=1 | +- SelectChannelEndPoint@75ec2687{/10.0.0.199:28910<->8888,Open,in,out,-,-,1294/30000,HttpConnection}{io=0,kio=0,kro=1} iOps=0 rOps=1 | +- SelectChannelEndPoint@58d28a46{/10.0.0.46:2935<->8888,Open,in,out,-,-,18179/30000,HttpConnection}{io=0,kio=0,kro=1} iOps=0 rOps=1 | +- SelectChannelEndPoint@2141b2b{/10.0.0.46:18804<->8888,Open,in,out,-,-,28533/30000,HttpConnection}{io=0,kio=0,kro=1} iOps=0 rOps=1 | +- SelectChannelEndPoint@5c144a62{/10.0.0.46:5639<->8888,Open,in,out,-,-,11059/30000,HttpConnection}{io=0,kio=0,kro=1} iOps=0 rOps=1 | +- SelectChannelEndPoint@2a849991{/10.0.0.199:35490<->8888,Open,in,out,-,-,3435/30000,HttpConnection}{io=0,kio=0,kro=1} iOps=0 rOps=1 | +- SelectChannelEndPoint@1b52f3f9{/10.0.0.199:45169<->8888,Open,in,out,-,-,3602/30000,HttpConnection}{io=0,kio=0,kro=1} iOps=0 rOps=1 ... 37k more ... | +- SelectChannelEndPoint@2646f57{/10.0.0.46:34617<->8888,Open,in,out,-,-,26394/30000,HttpConnection}{io=0,kio=0,kro=1} iOps=0 rOps=1 += org.eclipse.jetty.server.ConnectorStatistics@e80a320 - STARTED | +- connections=CounterStatistic@23a7a226{c=37553,m=37554,t=110980} | +- duration=SampleStatistic@73bb1aba{c=73427,m=2007485,t=3869852097,v100=13950161355077854} | +- in=SampleStatistic@1657eed8{c=73427,m=390,t=210400,v100=365611520} | +- out=SampleStatistic@2be8b8d3{c=73427,m=390,t=210400,v100=365611520} +~ org.eclipse.jetty.jmx.MBeanContainer@285b6758 +- sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:8888] +- serverThreadPool-2-acceptor-0@29724111-ServerConnector@63de4056{HTTP/1.1}{0.0.0.0:8888}