| Summary: | FIN WAIT sockets | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [RT] Jetty | Reporter: | Greg Wilkins <gregw> | ||||||||||||||||||
| Component: | server | Assignee: | Greg Wilkins <gregw> | ||||||||||||||||||
| Status: | RESOLVED FIXED | QA Contact: | |||||||||||||||||||
| Severity: | normal | ||||||||||||||||||||
| Priority: | P3 | CC: | jetty-inbox, jyri, mark.mielke, mike.mclean | ||||||||||||||||||
| Version: | 9.0.0 | ||||||||||||||||||||
| Target Milestone: | 9.0.x | ||||||||||||||||||||
| Hardware: | PC | ||||||||||||||||||||
| OS: | Linux | ||||||||||||||||||||
| See Also: | https://bugs.eclipse.org/bugs/show_bug.cgi?id=364638 | ||||||||||||||||||||
| Whiteboard: | |||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||
|
Description
Greg Wilkins
This is on Ubuntu server 11.04 (Linux 2.6.35.4). I have explicitly checked (and try to set lower) parameters that should influence this, such as /proc/sys/net/ipv4/tcp_fin_timeout, but this didn't help. As soon as I stop the jetty server, the connections stay around for about 15 seconds, and then all disappear, this led me to believe there's still something in jetty that's holding on to the connections. My current workaround is to send back "Connection: close" with all requests, which seems to work, but I'd really like to keep connections open for a short while in case more requests are sent by the client, and close them when they're idle for a while (for testing I've set maxIdleTime to 15000). It is also not 100% of requests for which this happen, the relevant debug log for the following connection shows that it's properly closed after 15 seconds: 265436 [qtp161514210-14 - /api/pb?action=UpdateCheckRequest] DEBUG org.eclipse.jetty.server.Server - REQUEST /api/pb on org.eclipse.jetty.server.nio.SelectChannelConnector$SelectChannelHttpConnection@3bc6d11a@204.232.---.---:80<->62.140.---.---:----- 265436 [qtp161514210-14 - /api/pb?action=UpdateCheckRequest] DEBUG org.eclipse.jetty.server.handler.ContextHandler - scope null||/api/pb @ o.e.j.s.ServletContextHandler{/api,null} 265436 [qtp161514210-14 - /api/pb?action=UpdateCheckRequest] DEBUG org.eclipse.jetty.server.handler.ContextHandler - context=/api||/pb @ o.e.j.s.ServletContextHandler{/api,null} 265436 [qtp161514210-14 - /api/pb?action=UpdateCheckRequest] DEBUG org.eclipse.jetty.servlet.ServletHandler - servlet /api|/pb|null -> scm.adserver.ProtobufServiceServlet-928946006 265436 [qtp161514210-14 - /api/pb?action=UpdateCheckRequest] DEBUG org.eclipse.jetty.servlet.ServletHandler - chain=null 265438 [qtp161514210-14 - /api/pb?action=UpdateCheckRequest] DEBUG org.eclipse.jetty.server.Server - RESPONSE /api/pb 200 ... 280993 [qtp161514210-11 Selector0] DEBUG org.eclipse.jetty.io.nio - destroyEndPoint SCEP@1217256339java.nio.channels.SocketChannel[closed][o=false d=false,io=1,w=true,rb=false,wb=false] 280994 [qtp161514210-11 Selector0] DEBUG org.eclipse.jetty.server.HttpConnection - closed org.eclipse.jetty.server.nio.SelectChannelConnector$SelectChannelHttpConnection@3bc6d11a@204.232.---.---:80<->62.140.---.---:------ But in other cases just nothing happens, even though the initial request looks exactly the same: 167947 [qtp161514210-16 - /api/pb?action=UpdateCheckRequest] DEBUG org.eclipse.jetty.server.Server - REQUEST /api/pb on org.eclipse.jetty.server.nio.SelectChannelConnector$SelectChannelHttpConnection@5918cb3a@204.232.---.---:80<->68.55.--.--:45290 167947 [qtp161514210-16 - /api/pb?action=UpdateCheckRequest] DEBUG org.eclipse.jetty.server.handler.ContextHandler - scope null||/api/pb @ o.e.j.s.ServletContextHandler{/api,null} 167948 [qtp161514210-16 - /api/pb?action=UpdateCheckRequest] DEBUG org.eclipse.jetty.server.handler.ContextHandler - context=/api||/pb @ o.e.j.s.ServletContextHandler{/api,null} 167948 [qtp161514210-16 - /api/pb?action=UpdateCheckRequest] DEBUG org.eclipse.jetty.servlet.ServletHandler - servlet /api|/pb|null -> scm.adserver.ProtobufServiceServlet-928946006 167948 [qtp161514210-16 - /api/pb?action=UpdateCheckRequest] DEBUG org.eclipse.jetty.servlet.ServletHandler - chain=null 167949 [qtp161514210-16 - /api/pb?action=UpdateCheckRequest] DEBUG org.eclipse.jetty.server.Server - RESPONSE /api/pb 200 (and this connection is in FIN_WAIT2 hours later still). Any other diagnostic information that would be useful or things I can try apart from Connection: close ? We had dead FIN_WAIT_2 connections on jetty 7.5.4 and 8.1 snapshot. The idle timeout needs to be reset and reapplied after attempting a half close. Currently when the timeout expires, a FIN is sent, which if acknowledged puts the connection in FIN_WAIT2... but if the connection then goes black, then a RST or FIN is never received and the connection is not closed. The change is to ensure that the idle timeout fires for a second time and if the connection is half closed, then do a real close. fixed pushed to jetty-7.6 branch @ ddce35a 7.6.0.RC0 due in the next few days. there is a report that SSL still has a problem here. So need to investigate. Created attachment 208329 [details]
socket state diagram
Simone, can you have a look at this issue, as I've an email report of 100% CPU associated with this. The situation is that we send a FIN because we have shutdown the connection. The client sends an ack, but then never sends a FIN or a RST. I changed our timeout handling so that we would timeout again, and on the second idleExpire we should close rather than shutdown. However, I only tested this for normal sockets. Could you uses your SSL proxy technique to make a test harness for this with SSL to see if a) we spin; b) if the timeout does eventually close the socket thanks Greg, I have created another bug for the SSL spin (https://bugs.eclipse.org/bugs/show_bug.cgi?id=367175), that turned out to not be related to this bug. This bug is more related to a problem in idle timeout checking, so I'm assigning it back to you. SslConnection was suppressing the second call to onIdleExpired and preventing a real close. It now does a real close if the real endpoint is already oshut. Test harness improved to check server side as well as client seeing broken pipe or similar. Still some reports of failure, so reopening for now.
For async HTTP we want to check for idle while the request is not dispatch and not suspended. So the change currently implement is to remove the setCheckIdle calls from customize and persist to be in the AsyncHttpConnection, with the pattern:
handle()
{
try
{
setCheckIdle(false);
...
}
finally
{
if (!suspended)
setCheckIdle(true);
}
}
Note that suspended lifecycles all end with a dispatch to handle, so idle will be turned on after the async lifecycle has completed.
No recent problem reports, plus new impl in 9.0.x is working well Hi Greg: I just started monitoring this issue as I believe I am encountering this issue. However, I haven't had time to really dig into exactly what the heck is going on. I am using HTTPS as a primary protocol for a custom server. I recently (this weekend just past) was convinced of the merit of Jetty, and switched from Grizzly HTTP Server to Embedded Jetty 9.0. These results apply to RHEL 5.5 / 32-bit with Java 7.0u15. I have a performance measurement script that immediately failed after this switch. Where my previous solutions (first a custom blocking I/O implementation and later Grizzly HTTP Server for the last 12 months) easily handled this load which was effectively quick bursts of very short requests... 1000 requests in a row, each with a new connection, running from 10 parallel threads, Embedded Jetty immediately started to fail running out of file descriptors. Odd, but perhaps it was near the limit. I increased the maximum number of file descriptor to 8192 and then began to monitor using "lsof" and /proc/$PID/fd/ and I discovered that with Embedded Jetty the number of file descriptors allocated during this loop would grow to about 3150 before hitting a steady state. Every new burst (burst every 5 seconds) would bump the file descriptors up, and then around 3150 it stopped growing. Also odd. I started to monitor more closely and I determined that: - Many of the file descriptors were UNIX sockets. 1006 in the case I just reproduced. - Many of the file descriptors were TCP/IP sockets in some sort of inappropriate state. 2000 in the same sample as above. "lsof" would report these as sockets for which it "can't identify protocol". I starting diff'ing the output of "lsof" between requests and I found some more odd behaviour. It seems that in many cases the file descriptors would stay flat between the bursts and between the invocations, but if I issued single requests, then some sort of background cleanup would happen. That is, if 3136 file descriptors were allocated they would stay allocated until I made a simple request, at which point it would drop to 3000, then 2850, ... until finally it would reach around 250 and hit the "low" steady state. I also found from the "lsof" diffs, what seemed to be that as a result of each new request at the low state, it would seem to clean up one "can't identify protocol" and substitute it with a new UNIX socket which yielded a steady state where the file descriptors wouldn't necessarily increase or decrease. At first I thought Jetty must suck. What kind of crappy software must this be? I found lots of links online to people encountering similar problems. However, I did note that the high steady state seemed stable, and I don't actually expect this service to receive bursts of 1000 connections in a row during normal service, so I decided to persist. I took a look at the Jetty source... and I found it is pretty sophisticated and careful code. I was unable to discover the bug, but got lost in the NIO event triggers and mean to get back to this. I also looked at the Java 7 implementation code and was unable to find a problem there. I ran out of time to look during that interval and my soft conclusion at this time is that either: 1) file descriptors are being leaked, and the garbage collector is finding them and reclaiming them 2) it has something to do with how Jetty calls Java 7 NIO, which is very different from Grizzly NIO Framework, that results in many more file descriptors allocated than might normally be expected with some sort of lazy clean up. Due to the steady states reached, however, I'm doubtful of 1) as I think the garbage collector would result in much more random behaviour which much worse "worst cases". I can't imagine file descriptors take that much heap space, and I would expect thousands more file descriptors to be allocated before garbage collection kicks in. Finally, I also found that connections seem to persist until the keep alive drops. That is, with the default configuration, connections seem to stay around for about 30 seconds after the client has disconnected at which point I was able to trace that Jetty believes it is closing the file handle, and this is where I found the "can't identify protocol" converts to a UNIX socket. I believe I found that it is even the same file descriptor in this case! Very confusing behaviour. I think I'm able to reproduce the problem but I'm not clear on how to proceed. If you have any tests you'd like me to do, let me know. This sounds like jetty is keeping the SSL connections half open for a timeout. Try changing max idle time to say 10s and see if the sockets only wait around for 10s instead of 30s. Also can you say what version of Jetty and JVM you are using? Running SSL connections asynchronously is henously complex, specially shutting them down nicely. Note that we have a new SSL implementation in Jetty-9 that we will put a lot of effort into debugging and making perfect. Greg Wilkens writes: > This sounds like jetty is keeping the SSL connections half open for a timeout. > > Try changing max idle time to say 10s and see if the sockets only wait around for 10s instead of 30s. This is one part of my problem report... However, please note that it keeps these 3000+ file descriptors open for much longer than 30 seconds (indefinitely from what I can tell until prompted - say, by a next request incoming...). There is both factors at play: 1) It holds the SSL connections open in half open for a timeout of 30 seconds. 2) It holds both SSL connections and UNIX sockets open for a much longer period. > Also can you say what version of Jetty and JVM you are using? Jetty 9.0.0.v20130308. Java 7u15. RHEL 5.5 / 32-bit. > Running SSL connections asynchronously is henously complex, specially shutting them down nicely. Note that we have a new SSL implementation in Jetty-9 that we will put a lot of effort into debugging and making perfect. Understood. However, also please note that I was successfully using Grizzly HTTP Server before this and it did not exhibit these behaviour characteristics. I will see if I can prepare a smaller test case so that you can reproduce for yourself... reopening because reported on Jetty-9 Ok... I've stripped it down to something a lot more basic and reproduced more specific behaviour on Fedora 18 / x86 / 64-bit with Java 7u17. With HTTP (no SSL), issuing a simple HTTP GET using "curl", the behaviour I am seeing is: - Server immediately closes the socket. - Unix socket left over. With HTTPS (SSL), issuing a simple HTTP GET using "curl", the behaviour I am seeing is: - Server immediately down shutdown(). - Socket sits in half open state. - After 30 seconds, server does close(). - Unix socket replaces half open state socket. I will attach source and a distribution image that may help you reproduce. Extract the binary image like this: $ tar xfz jetty-problem-0.1-SNAPSHOT-dist.tar.gz $ JAVA_HOME=/usr/java/jdk1.7 bin/jetty-problem-server [2013-04-11@02:16:10.572] DEBUG o.e.jetty.util.log: Logging to Logger[org.eclipse.jetty.util.log] via org.eclipse.jetty.util.log.Slf4jLog ... lots of debug output ... From another window... first find the process: $ ps -ef | grep java | grep jetty | grep problem mark 6935 6896 0 02:16 pts/2 00:00:01 /usr/java/jdk1.7/bin/java -server -Xms256m -Xmx512m -XX:+UseG1GC -Djava.util.logging.config.file=/home/mark/jetty.problem/etc/logging.properties -Dlogback.configurationFile=/home/mark/jetty.problem/etc/logback.xml com.acme.jetty.problem.TestServer Here is the output from testing HTTP: $ /usr/sbin/lsof -p 6935 >lsof.1 $ wget http://localhost:10080/ --2013-04-11 02:19:47-- http://localhost:10080/ Resolving localhost (localhost)... ::1, 127.0.0.1 Connecting to localhost (localhost)|::1|:10080... connected. HTTP request sent, awaiting response... 200 OK Length: 0 Saving to: ‘index.html.1’ [ <=> ] 0 --.-K/s in 0s 2013-04-11 02:19:47 (0.00 B/s) - ‘index.html.1’ saved [0/0] $ /usr/sbin/lsof -p 6935 >lsof.2 $ diff lsof.1 lsof.2 98a99 > java 6935 mark 54u unix 0xffff8801f2119040 0t0 472307 socket Here is the output from testing HTTPS: $ /usr/sbin/lsof -p 6935 >lsof.2 $ diff lsof.1 lsof.2 98a99 > java 6935 mark 54u unix 0xffff8801f2119040 0t0 472307 socket $ $ $ /usr/sbin/lsof -p 6935 >lsof.1 $ curl --insecure https://localhost:10443/ $ /usr/sbin/lsof -p 6935 >lsof.2 $ diff lsof.1 lsof.2 98a99 > java 6935 mark 53u sock 0,6 0t0 472158 protocol: TCPv6 $ sleep 30 $ /usr/sbin/lsof -p 6935 >lsof.3 $ diff lsof.2 lsof.3 99c99 < java 6935 mark 53u sock 0,6 0t0 472158 protocol: TCPv6 --- > java 6935 mark 53u unix 0xffff8801f2119040 0t0 472307 socket My outsider looking in guess is: 1) I expect HTTPS to operate the same as HTTP, and I'm surprised that HTTP would close immediately, but HTTPS would delay close for 30 seconds. Something seems inconsistent here. 2) Something is converting half open TCP/IP sockets to Unix sockets with the same file descriptor. I'm initially doubtful that Jetty is doing this, and I'm wondering if Java 7 is doing this as a consequence of NIO? If so, how bad is this? It seems to quickly consume file descriptors - many more than should be required to do the job - but does this mean we need to make sure that max files per process is higher when using Jetty / Java NIO? Or does this represent a scaling limit of some sort and ultimately a bug in either Jetty or Java? I will also attach the source if you want to rebuild. Build using Maven 3 connected to Maven Central. Created attachment 229588 [details]
jetty-problem-0.1-SNAPSHOT-dist.tar.gz
Created attachment 229589 [details]
jetty-problem.tar.gz
Thanks for this extra information. I've not yet tried your server, but I am getting some similar results with just using the embedded SpdyServer class that is in the distro. This gives me the Dump servlet which is useful for doing things like: curl -k https://localhost:8443/test/dump/info?sleep=10000 which has demonstrated that the unix socket is being created with the TCP/IP connection and not after the close of the connection. You can see this with the following from before during and after the above curl command: gregw@Brick: /tmp [613] lsof -Ts -n -p 13360 | egrep -e http-alt -e ' socket' -e 8443 java 13360 gregw 25u IPv6 1089455 0t0 TCP *:8443 (LISTEN) java 13360 gregw 45u IPv6 1095069 0t0 TCP *:http-alt (LISTEN) java 13360 gregw 46u unix 0x0000000000000000 0t0 1095070 socket gregw@Brick: /tmp [614] lsof -Ts -n -p 13360 | egrep -e http-alt -e ' socket' -e 8443 java 13360 gregw 25u IPv6 1089455 0t0 TCP *:8443 (LISTEN) java 13360 gregw 45u IPv6 1095069 0t0 TCP *:http-alt (LISTEN) java 13360 gregw 46u unix 0x0000000000000000 0t0 1095070 socket java 13360 gregw 99u IPv6 1090456 0t0 TCP 127.0.0.1:8443->127.0.0.1:58811 (ESTABLISHED) java 13360 gregw 101u unix 0x0000000000000000 0t0 1099842 socket gregw@Brick: /tmp [615] lsof -Ts -n -p 13360 | egrep -e http-alt -e ' socket' -e 8443 java 13360 gregw 25u IPv6 1089455 0t0 TCP *:8443 (LISTEN) java 13360 gregw 45u IPv6 1095069 0t0 TCP *:http-alt (LISTEN) java 13360 gregw 46u unix 0x0000000000000000 0t0 1095070 socket java 13360 gregw 101u unix 0x0000000000000000 0t0 1099842 socket I get same thing for http Note that I can't reproduce your 30s of half closed connection with ssl, but will investigate that with your server after we've worked out what these sockets are for. Hmmm I'm seeing the number of sockets level out at 11. I'm thinking they are definitely something to do with the NIO implementation. If from another machine I do while :; do curl http://192.168.0.4:8080/ > /dev/null; done then I never see more than [707] lsof -Ts -n -p 13360 | egrep -e http-alt -e ' socket' -e 8443 java 13360 gregw 25u IPv6 1089455 0t0 TCP *:8443 (LISTEN) java 13360 gregw 33u unix 0x0000000000000000 0t0 1099842 socket java 13360 gregw 45u IPv6 1095069 0t0 TCP *:http-alt (LISTEN) java 13360 gregw 46u unix 0x0000000000000000 0t0 1095070 socket java 13360 gregw 99u unix 0x0000000000000000 0t0 1099842 socket java 13360 gregw 101u unix 0x0000000000000000 0t0 1099842 socket java 13360 gregw 102u unix 0x0000000000000000 0t0 1099842 socket java 13360 gregw 104u unix 0x0000000000000000 0t0 1099842 socket java 13360 gregw 107u unix 0x0000000000000000 0t0 1099842 socket java 13360 gregw 108u unix 0x0000000000000000 0t0 1099842 socket java 13360 gregw 109u unix 0x0000000000000000 0t0 1099842 socket java 13360 gregw 110u unix 0x0000000000000000 0t0 1099842 socket java 13360 gregw 111u unix 0x0000000000000000 0t0 1099842 socket So I'm thinking they are not an issue.... although if you see 3000 of them, then they are. Will look at the 30s delay tomorrow. I'm running your server now and I'm not seeing any FINWAIT sockets with curl. But I did see 1 for about 10s if I let a browser timeout. On analysis, something strange is happening with out SSL close conversations in that after we exchange close handshakes curl is just resetting the connection instead of sending a FIN? With a browser, I can see both good FIN-FIN closes and bad FIN-RST closes, and with either case we have to wait an idle timeout before we finish our clean up. I think we may have stopped looking for read interest after close handshake, so we are not noticing the final FIN or RST? investigating more. I think there are lots of different things going on here and they are not all to do with Jetty. For example I'm looking at an SSL conversation between Jetty and Firefox where Jetty has a 4s timeout. What I see is: x last data sent x+4 jetty sends close handshake x+4 jetty sends FIN (shutdownOutput) x+4.04 FF acks the FIN frame x+11 FF sends FIN (but no close handshake?!?!?) x+11.01 Jetty ACKs FIN So this is a nice TCP/IP close, but a non perfect SSL close. It is as if FF is ignoring the connection while it is not using it! But then I also see x last data sent x+4 jetty sends close handshake x+4 jetty sends FIN (shutdownOutput) x+4.04 FF acks the FIN frame x+4.05 FF sends FIN (but no close handshake?!?!?) x+4.06 Jetty ACKs FIN So the extra delay from FF does not always happen!!! If I increase jetty's timeout so that FF idles out the connection, then I see x last data sent x+110 FF sends close handshake x+110 FF sends FIN x+110.01 Jetty sends close handshake x+110.02 FF sends RST!!! So jetty didn't send a FIN, either because FF sent the RST too quickly or perhaps because it was not going to send it anyway? But none of this is looking hugely wrong, just a little strange. It is similar to what we are seeing from the curl close case. I'm wondering if Jetty should do the same as FF and not send the reply SSL close handshake and instead just FIN the connection? So I think the issue is that Jetty is trying to be too nice. When it receives an SSL close handshake, it sends one in reply and then expects the other end to close down the connection. But because jetty is not longer using the connection (it's closed), its not actively reading from it, so it does not see the FIN or RST from the other end and waits for a timeout. So the simple solution is that we should just act like everybody else and not bother with sending the reply SSL close handshake. If we just close the connection we are getting nice FIN-FIN closes and no timeouts are needed. I've made this change and am testing it now. However, I'm still a little concerned that we had a state that was not reading and thus ignored the FIN/RST until a timeout - so will continue to investigate that a little more, just in case the close early solution can't be used for some other reason. I decided the curl test (single threaded essentially) wasn't demonstrating the behaviour I was seeing of 3000+ file descriptors. I've included a test client that executes 100 pings from 10 threads. Note that in my case I normally test 1000 pings over 10 threads, but this very quickly leads to "No more files" exceptions, so I've backed it down to 100 to allow better analysis. (In my case, 1000 pings was leaving 3000 file descriptors... so what happens when we do just 100 pings?) First, testing HTTP (not SSL)... $ tar xfz jetty-problem-0.2-SNAPSHOT-dist.tar.gz $ bin/jetty-problem-server [2013-04-12@00:24:46.568] DEBUG o.e.jetty.util.log: Logging to Logger[org.eclipse.jetty.util.log] via org.eclipse.jetty.util.log.Slf4jLog ... lots of debug output ... From another window: $ ls bin/ jetty-problem-0.2-SNAPSHOT-dist.tar.gz lib/ etc/ jetty-problem-0.2-SNAPSHOT.tar.gz $ bin/jetty-problem-client http://localhost:10080/ ... ParallelPing x 100 @ 10 threads = 2821 ms elapsed ParallelPing sleeping... ParallelPing starting... ParallelPing x 100 @ 10 threads = 1655 ms elapsed ParallelPing sleeping... ParallelPing starting... ParallelPing x 100 @ 10 threads = 1533 ms elapsed ParallelPing sleeping... ... From a third window: $ ps -ef | grep TestServer mark 31229 6896 2 00:24 pts/2 00:00:02 /usr/java/jdk/bin/java -server -Xms256m -Xmx512m -XX:+UseG1GC -Djava.util.logging.config.file=/home/mark/jetty.problem/etc/logging.properties -Dlogback.configurationFile=/home/mark/jetty.problem/etc/logback.xml com.acme.jetty.problem.TestServer mark 31364 31346 0 00:26 pts/1 00:00:00 grep --color=auto TestServer $ /usr/sbin/lsof -p 31229 | wc -l 115 $ /usr/sbin/lsof -p 31229 | wc -l 116 $ /usr/sbin/lsof -p 31229 | wc -l 112 $ /usr/sbin/lsof -p 31229 | wc -l 112 This value is staying relatively stable. No real problem here that I see. Now, from the second window we will try HTTPS (SSL): ^C (Control-C) $ bin/jetty-problem-client https://localhost:10443/ ... ParallelPing x 100 @ 10 threads = 7971 ms elapsed ParallelPing sleeping... ParallelPing starting... ParallelPing x 100 @ 10 threads = 7247 ms elapsed ParallelPing sleeping... ParallelPing starting... ParallelPing x 100 @ 10 threads = 8130 ms elapsed ParallelPing sleeping... ... And from the third window after giving it a little while to "warm up": $ ps -ef | grep TestServer mark 31229 6896 2 00:24 pts/2 00:00:10 /usr/java/jdk/bin/java -server -Xms256m -Xmx512m -XX:+UseG1GC -Djava.util.logging.config.file=/home/mark/jetty.problem/etc/logging.properties -Dlogback.configurationFile=/home/mark/jetty.problem/etc/logback.xml com.acme.jetty.problem.TestServer mark 31500 31346 0 00:31 pts/1 00:00:00 grep --color=auto TestServer $ /usr/sbin/lsof -p 31229 | wc -l 312 $ /usr/sbin/lsof -p 31229 | wc -l 312 $ /usr/sbin/lsof -p 31229 | wc -l 312 With 100 pings, it's hovering at +200 file descriptors compared to HTTP. From the second window, I kill the client process, which should clean up all the file descriptors by signalling close to the server: ParallelPing starting... ParallelPing x 100 @ 10 threads = 2968 ms elapsed ParallelPing sleeping... ParallelPing starting... ^C From the third window, I check the file descriptors: $ /usr/sbin/lsof -p 31229 | wc -l 320 $ /usr/sbin/lsof -p 31229 | wc -l 320 $ /usr/sbin/lsof -p 31229 | wc -l 320 From the debug in the first window, I see that it does close the connections eventually - at least it says it does: [2013-04-12@00:32:23.645] DEBUG o.e.j.i.ChannelEndPoint: close SelectChannelEndPoint@6002a673{localhost/127.0.0.1:56456<r-l>/127.0.0.1:10443,o=false,is=false,os=true,fi=FillInterest@4dfdf943{false,null},wf=WriteFlusher@5cf7c5b5{IDLE},it=30000}{SslConnection@71538614{NOT_HANDSHAKING,eio=-1/-1,di=-1} -> HttpConnection@5387b07b{IDLE},g=HttpGenerator{s=START},p=HttpParser{s=START,0 of -1}}{io=0,kio=0,kro=1} However: $ /usr/sbin/lsof -p 31229 | wc -l 320 $ /usr/sbin/lsof -p 31229 | wc -l 320 $ /usr/sbin/lsof -p 31229 | wc -l 320 The file descriptors are not going down! Now, I issue a *single* request... $ curl --insecure https://localhost:10443/ $ /usr/sbin/lsof -p 31229 | wc -l 269 Weird? One connection and it wakes it up such that it drops from 320 to 269? This continues like so: $ curl --insecure https://localhost:10443/ $ /usr/sbin/lsof -p 31229 | wc -l 217 $ curl --insecure https://localhost:10443/ $ /usr/sbin/lsof -p 31229 | wc -l 168 $ curl --insecure https://localhost:10443/ $ /usr/sbin/lsof -p 31229 | wc -l 116 $ curl --insecure https://localhost:10443/ $ /usr/sbin/lsof -p 31229 | wc -l 116 Very strange... If I set the scenario up again to see what kind of file descriptors these might be, I see: $ /usr/sbin/lsof -p 31229 | wc -l 312 $ /usr/sbin/lsof -p 31229 | grep ' unix ' | wc -l 108 $ /usr/sbin/lsof -p 31229 | grep ' sock ' | wc -l 77 A number of "sock", and a number of "unix". After all the "close" happen... give it 30 seconds or so... it ends up looking like this: $ /usr/sbin/lsof -p 31229 | wc -l 312 $ /usr/sbin/lsof -p 31229 | grep ' sock ' | wc -l 0 $ /usr/sbin/lsof -p 31229 | grep ' unix ' | wc -l 208 All the "sock" have become "unix"? Wierd... Created attachment 229654 [details]
jetty-problem-0.2-SNAPSHOT-dist.tar.gz
Created attachment 229655 [details]
jetty-problem-0.2-SNAPSHOT-src.tar.gz
Created attachment 229656 [details]
agressive close patch
I've not yet tried your latest client, but I'm hopeful (if less than confident) that the fix I'm working on should avoid the issue as it is a more agressive close of the connection.
If you could test this patch against your client/environment, that would be handy.
I can't build yet as I have some tests that I need to fix up for the new close style.
Created attachment 229657 [details]
Gentle close patch
One thing I didn't like about just doing the aggressive close was that without it we should still not have been needing a timeout to finish the close.
So this patch fixes that problem, so that without the aggressive close the SslConnection is still driven towards reading, looking for a FIN and/or RST.
I'm hopeful this will resolve the FIN_WAIT issue on it's own, but I still think we should go for the aggressive close as well because that is what other SSL impls appear to be doing and if we try to be gentle the clients just RST us anyway.
With "aggressive close patch" I can't immediately see any bad behaviour with the test client: $ /usr/sbin/lsof -p 5159 | wc -l 115 $ /usr/sbin/lsof -p 5159 | wc -l 115 $ /usr/sbin/lsof -p 5159 | wc -l 115 $ /usr/sbin/lsof -p 5159 | wc -l 115 $ /usr/sbin/lsof -p 5159 | wc -l 115 $ /usr/sbin/lsof -p 5159 | wc -l 115 $ /usr/sbin/lsof -p 5159 | wc -l 117 $ /usr/sbin/lsof -p 5159 | wc -l 120 $ /usr/sbin/lsof -p 5159 | wc -l 117 $ /usr/sbin/lsof -p 5159 | wc -l 118 $ /usr/sbin/lsof -p 5159 | wc -l 117 $ /usr/sbin/lsof -p 5159 | wc -l 117 $ /usr/sbin/lsof -p 5159 | wc -l 117 $ /usr/sbin/lsof -p 5159 | wc -l 117 $ /usr/sbin/lsof -p 5159 | wc -l 117 $ /usr/sbin/lsof -p 5159 | wc -l 117 $ /usr/sbin/lsof -p 5159 | wc -l 117 Fairly stable # of file descriptors given that it's allocated and unallocated 100 every 5 seconds, running 10 at any given time while it's active. I'll try your next patch as well, and try it on my internal product that was seeing the 3000 file descriptors... Oh good news! However, the story is still not complete. I've just noticed that with Chrome, it only checks an SSL connection every 10s to see if it has been closed by the server. If Jetty has an idle time of 3s, then it sends an SSL close and FIN when idle. Chrome acks this, but does not send anything back until the connection has been idle for 10s, when it finally just sends a FIN (no close reply). Meanwhile jetty has timed out again (after another 3s) and appears to completely close the socket. However I'm not seeing a RST like I would expect, so still more investigation needed. With "gentle close patch", I see the debugging showing up with this at the end: [2013-04-12@02:28:01.993] DEBUG o.e.j.i.SelectorManager: Ignoring cancelled key for channel java.nio.channels.SocketChannel[closed] [2013-04-12@02:28:01.993] DEBUG o.e.j.i.SelectorManager: Selector loop waiting on select Both "gentle close patch" and "aggressive close patch" are maintaining stable file descriptor counts... but I do find it odd that simple substituting the jetty-io.jar and I get... aggressive close patch: $ /usr/sbin/lsof -p 6123 | wc -l 119 $ /usr/sbin/lsof -p 6123 | wc -l 117 $ /usr/sbin/lsof -p 6123 | wc -l 117 $ /usr/sbin/lsof -p 6123 | wc -l 116 gentle close patch: $ /usr/sbin/lsof -p 6324 | wc -l 106 $ /usr/sbin/lsof -p 6324 | wc -l 106 $ /usr/sbin/lsof -p 6324 | wc -l 121 $ /usr/sbin/lsof -p 6324 | wc -l 120 $ /usr/sbin/lsof -p 6324 | wc -l 119 $ /usr/sbin/lsof -p 6324 | wc -l 122 $ /usr/sbin/lsof -p 6324 | wc -l 106 It seems like gentle close patch *might* use fewer file descriptors. I don't know if that is good or bad. It may be an artifact of Java NIO? Or coincidence? But it seems reproducible as well... Trying both on my internal product now... Created attachment 229658 [details]
Combined patch
This is a combined patch that has both the gentle close code and the aggressive close in it, which I think is a good combination to handle many different close scenarios.
I've checked the chrome closing and Jetty is definitely calling close on the channel - no idea why the operating system does not send a RST?
On the internal product... Again, with the model of sending 1000 full HTTP POST from 10 Java threads simultaneously when active, with five second breaks between runs, with JSON input and output with varying length input and output... With no patch: - Quickly climbs (with each iteration of 1000 connections) to 3223, and occasionally fluctuates back down to 3097 or some other value only to return to 3223. With "aggressive close": - Hovers between 223 - 227 with no immediately apparent weighting With "gentle close": - Hovers between 223 - 227 as well, but clearly weighted towards exactly 223 I'm sure one patch is better than the other for some technical reason, and the "Ignoring cancelled key for channel" for "gentle close" deserves analysis, but from an application perspective: Both "aggressive close" and "gentle close" do not seem to be exhibiting any poor behaviour. Mark, thanks for testing this! I'm going to seek some internal code review on both patches and then hope to have a 9.0.2 out tomorrow or next Monday. For "both" patch... Test client / server: - Behaves much like "aggressive close". 115 to 119 file descriptors. No "Ignoring cancelled key for channel" immediately visible in debug output. Internal product: - 223 to 226 file descriptors in the samples. So "both" seems ok as well at least from my perspective... Greg: Thanks for the attention and quick fix. This was my only concern about switching from Grizzly to Jetty and it looks like it has been resolved. Great! Greg: I think I might have found an explanation in the Java 7 source code, here:
jdk/src/share/classes/java/net/AbstractPlainSocketImpl.java
In the implementation of close(), it has this mysterious passage:
protected void close() throws IOException {
synchronized(fdLock) {
if (fd != null) {
...
if (fdUseCount == 0) {
if (closePending) {
return;
}
closePending = true;
/*
* We close the FileDescriptor in two-steps - first the
* "pre-close" which closes the socket but doesn't
* release the underlying file descriptor. This operation
* may be lengthy due to untransmitted data and a long
* linger interval. Once the pre-close is done we do the
* actual socket to release the fd.
*/
try {
socketPreClose();
} finally {
socketClose();
}
fd = null;
return;
} else {
/*
* If a thread has acquired the fd and a close
* isn't pending then use a deferred close.
* Also decrement fdUseCount to signal the last
* thread that releases the fd to close it.
*/
if (!closePending) {
closePending = true;
fdUseCount--;
socketPreClose();
}
}
}
}
}
In the underlying code it appears they use a AF_UNIX socket as a "marker", and they'll sometimes choose to use dup2() instead of close() to close the socket but keep the file descriptor held, at which point it is no longer a AF_INET socket, but is now an AF_UNIX socket.
Based on the above... I think this may mean that the fdUseCount > 0 with however Jetty is managing the socket (pre-patch), and it's doing socketPreClose() which does dup2(marker_fd => fd) instead of close(fd), which then leaves AF_UNIX sockets around...
Once fdUseCount does hit 0, it is cleaned up - so to really understand what is happening we'd have to know why fdUseCount > 0?
Hope this helps...
Interesting! We've had issues with JRE's preclose before - it used to go 100% CPU on us or throw exceptions that are not in it' signature. The suggestion was at the time that this bug was triggered because we still had the socket registered with the selector - this might explain the fdUseCount>0. Good news is that if we do our close properly, then we unregister with the selector and that count should go to 0. Note that we have some variations of the fix that we are trying, because the one you tested gave our HTTP client grief. 9.0.2 doesn't exhibit any bad symptoms that I can see. Thanks, Greg! Fixed. Thanks for help debugging and testing! Hi Greg:
After deployment... I've been encountering some troubling behaviour for the last few days. Can you comment if this seems related to the patch and we should discuss here, or if this looks like an unrelated problem and I should open a new defect? (Or if you want a new defect anyways...)
The reason I suspect it is related... is because I was finally able to reproduce the exact same symptoms on demand, and my steps to reproduce relate to sessions being automatically closed after the read timeout while the Servlet is reading from ServletRequest.getReader(). The Servlet encounters org.eclipse.jetty.io.EofException and 1 or I believe possibly even 2 in some cases Request.recycle() get stuck.
Longer details:
For production symptoms... over several hours, threads would slowly accumulate (no real pattern... but say 1 to 10 every 24 hours?) which consume 100% of one CPU core. These threads have stack traces like this:
"qtp658178110-91" prio=10 tid=0x00000000056dd000 nid=0x796b runnable [0x0000000042f3a000]
java.lang.Thread.State: RUNNABLE
at org.eclipse.jetty.server.HttpConnection$Input.blockForContent(HttpConnection.java:544)
at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:124)
- locked <0x00000000e0a44238> (a org.eclipse.jetty.util.ArrayQueue)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
- locked <0x00000000e0a442c0> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:154)
at java.io.BufferedReader.read(BufferedReader.java:175)
- locked <0x00000000e0a442c0> (a java.io.InputStreamReader)
at org.eclipse.jetty.server.Request.recycle(Request.java:1496)
at org.eclipse.jetty.server.HttpChannel.reset(HttpChannel.java:211)
at org.eclipse.jetty.server.HttpConnection.reset(HttpConnection.java:147)
at org.eclipse.jetty.server.HttpConnection.completed(HttpConnection.java:480)
at org.eclipse.jetty.server.HttpChannel.run(HttpChannel.java:322)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:266)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:240)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
at java.lang.Thread.run(Thread.java:722)
jstack will catch these threads in various places... but always within Request.recycle(), and ways within BufferedReader.read(). HttpInput.read() is usually involved, and it will move around from a few methods inside (at high speed no doubt) but I've caught it as low as HttpParser.shutdownInput().
I racked my brains for a way to trigger this and finally came up with a simple way...
My embedded server is using the Jetty servlet engine. The servlet normally reads JSON using ServletRequest.getReader(). If I stimulate the server in this way: (I've stripped unnecessary information which could be construed as IP... <host>, <port>, <servlet path>, ...)
bash-4.1$ telnet <host> <port>
Trying 10.128.9.41...
Connected to <host>.
Escape character is '^]'.
POST <servlet path> HTTP/1.0
Content-type: application/json
Content-length: 1000
{"requestType"<ENTER>
<WAIT FOR READ TIMEOUT>
Immediately after the read timeout elapsed, I received these results via HTTP:
HTTP/1.1 200 OK
Content-Type: application/json;charset=UTF-8
Content-Length: 5382
Server: Jetty(9.0.2.v20130417)
{"responseType":"exception","exceptionType":"backend","exceptionCode":"internal","exceptionText":"Unexpected exception: org.eclipse.jetty.io.EofException","exceptionTrace":"BackendException: Unexpected exception: org.eclipse.jetty.io.EofException\n...
This means that the servlet noted that it encountered org.eclipse.jetty.io.EofException and it chose to handle this and return the exception to the client for analysis.
I was monitoring the number of threads stuck in Request.recycle() before and after the timeout elapsed, and sure enough:
$ /usr/java/jdk1.7/bin/jstack 28768 | grep Request.recycle | wc -l
2
$ /usr/java/jdk1.7/bin/jstack 28768 | grep Request.recycle | wc -l
3
... and now my server CPU is a lot busier than it should be.
Let me know how you would prefer I proceed. I may tinker and try to find the problem myself... but frankly it is a little intimidating with all the layers and this is the first time I have seen much of this Jetty code.
I *think* that one of the read() calls should be returning -1 to indicate EOF and it isn't. But... not sure. Possibly the EofException is raising through the servlet thread and not properly recording that the session is at EOF?
Mark, this looks like a new issue, so can you open a new bugzilla and we'll have a look cheers Thanks. Done. |