Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 364921 - FIN WAIT sockets
Summary: FIN WAIT sockets
Status: RESOLVED FIXED
Alias: None
Product: Jetty
Classification: RT
Component: server (show other bugs)
Version: 9.0.0   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: 9.0.x   Edit
Assignee: Greg Wilkins CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-11-27 17:31 EST by Greg Wilkins CLA
Modified: 2013-04-26 01:10 EDT (History)
4 users (show)

See Also:


Attachments
socket state diagram (19.72 KB, image/gif)
2011-12-13 13:43 EST, Greg Wilkins CLA
no flags Details
jetty-problem-0.1-SNAPSHOT-dist.tar.gz (1.69 MB, application/octet-stream)
2013-04-11 02:39 EDT, Mark Mielke CLA
no flags Details
jetty-problem.tar.gz (5.13 KB, application/octet-stream)
2013-04-11 02:42 EDT, Mark Mielke CLA
no flags Details
jetty-problem-0.2-SNAPSHOT-dist.tar.gz (2.48 MB, application/gzip)
2013-04-12 00:47 EDT, Mark Mielke CLA
no flags Details
jetty-problem-0.2-SNAPSHOT-src.tar.gz (40.00 KB, application/octet-stream)
2013-04-12 00:48 EDT, Mark Mielke CLA
no flags Details
agressive close patch (1.12 KB, patch)
2013-04-12 01:38 EDT, Greg Wilkins CLA
no flags Details | Diff
Gentle close patch (2.68 KB, patch)
2013-04-12 02:07 EDT, Greg Wilkins CLA
no flags Details | Diff
Combined patch (2.90 KB, patch)
2013-04-12 02:35 EDT, Greg Wilkins CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Greg Wilkins CLA 2011-11-27 17:31:29 EST
I'm running an embedded jetty server which has fairly high numbers
ofconnections from mobile phones, which oftensuddenly lose connections
without properly shutting them down.I noticed that on my server, over
time it's building up lots ofconnections that are in FIN_WAIT2 state.
This was the casewith 8.0.0, and also with the latest version 8.0.4. I
tried both withopenJDK 6 and Sun JDK 6 (all on a linux 64 bit
machine).Using a thread dump, I see that jetty is
maintainingSelectChannelEndPoint s for these hundreds of idle
connections,even though I'm setting maxIdleTime to 15000 (i.e. 15
seconds). Someof the _idleTimestamp values are hours old already,but
the connection doesn't seem to be cleaned up.Is there some setting I'm
overlooking or is this a bug in jetty?
Best regards,Mathijs Vogelzang
Comment 1 Greg Wilkins CLA 2011-11-27 17:32:04 EST
This is on Ubuntu server 11.04 (Linux 2.6.35.4). I have explicitly
checked (and try to set lower) parameters that should influence this,
such as
/proc/sys/net/ipv4/tcp_fin_timeout, but this didn't help. As soon as I
stop the jetty server, the connections stay around for about 15
seconds,
and then all disappear, this led me to believe there's still something
in jetty that's holding on to the connections.
My current workaround is to send back "Connection: close" with all
requests, which seems to work, but I'd really like to keep connections
open for
a short while in case more requests are sent by the client, and close
them when they're idle for a while (for testing I've set maxIdleTime
to 15000).

It is also not 100% of requests for which this happen, the relevant
debug log for the following connection shows that it's properly closed
after 15 seconds:

265436 [qtp161514210-14 - /api/pb?action=UpdateCheckRequest] DEBUG
org.eclipse.jetty.server.Server  - REQUEST /api/pb on
org.eclipse.jetty.server.nio.SelectChannelConnector$SelectChannelHttpConnection@3bc6d11a@204.232.---.---:80<->62.140.---.---:-----
265436 [qtp161514210-14 - /api/pb?action=UpdateCheckRequest] DEBUG
org.eclipse.jetty.server.handler.ContextHandler  - scope null||/api/pb
@ o.e.j.s.ServletContextHandler{/api,null}
265436 [qtp161514210-14 - /api/pb?action=UpdateCheckRequest] DEBUG
org.eclipse.jetty.server.handler.ContextHandler  - context=/api||/pb @
o.e.j.s.ServletContextHandler{/api,null}
265436 [qtp161514210-14 - /api/pb?action=UpdateCheckRequest] DEBUG
org.eclipse.jetty.servlet.ServletHandler  - servlet /api|/pb|null ->
scm.adserver.ProtobufServiceServlet-928946006
265436 [qtp161514210-14 - /api/pb?action=UpdateCheckRequest] DEBUG
org.eclipse.jetty.servlet.ServletHandler  - chain=null
265438 [qtp161514210-14 - /api/pb?action=UpdateCheckRequest] DEBUG
org.eclipse.jetty.server.Server  - RESPONSE /api/pb  200
...
280993 [qtp161514210-11 Selector0] DEBUG org.eclipse.jetty.io.nio  -
destroyEndPoint
SCEP@1217256339java.nio.channels.SocketChannel[closed][o=false
d=false,io=1,w=true,rb=false,wb=false]
280994 [qtp161514210-11 Selector0] DEBUG
org.eclipse.jetty.server.HttpConnection  - closed
org.eclipse.jetty.server.nio.SelectChannelConnector$SelectChannelHttpConnection@3bc6d11a@204.232.---.---:80<->62.140.---.---:------

But in other cases just nothing happens, even though the initial
request looks exactly the same:

167947 [qtp161514210-16 - /api/pb?action=UpdateCheckRequest] DEBUG
org.eclipse.jetty.server.Server  - REQUEST /api/pb on
org.eclipse.jetty.server.nio.SelectChannelConnector$SelectChannelHttpConnection@5918cb3a@204.232.---.---:80<->68.55.--.--:45290
167947 [qtp161514210-16 - /api/pb?action=UpdateCheckRequest] DEBUG
org.eclipse.jetty.server.handler.ContextHandler  - scope null||/api/pb
@ o.e.j.s.ServletContextHandler{/api,null}
167948 [qtp161514210-16 - /api/pb?action=UpdateCheckRequest] DEBUG
org.eclipse.jetty.server.handler.ContextHandler  - context=/api||/pb @
o.e.j.s.ServletContextHandler{/api,null}
167948 [qtp161514210-16 - /api/pb?action=UpdateCheckRequest] DEBUG
org.eclipse.jetty.servlet.ServletHandler  - servlet /api|/pb|null ->
scm.adserver.ProtobufServiceServlet-928946006
167948 [qtp161514210-16 - /api/pb?action=UpdateCheckRequest] DEBUG
org.eclipse.jetty.servlet.ServletHandler  - chain=null
167949 [qtp161514210-16 - /api/pb?action=UpdateCheckRequest] DEBUG
org.eclipse.jetty.server.Server  - RESPONSE /api/pb  200

(and this connection is in FIN_WAIT2 hours later still). Any other
diagnostic information that would be useful or things I can try apart
from Connection: close ?
Comment 2 Greg Wilkins CLA 2011-11-27 17:44:16 EST
We had dead FIN_WAIT_2 connections on jetty 7.5.4 and 8.1 snapshot.
Comment 3 Greg Wilkins CLA 2011-11-27 19:36:45 EST
The idle timeout needs to be reset and reapplied after attempting a half close.
Currently when the timeout expires, a FIN is sent, which if acknowledged puts the connection in FIN_WAIT2... but if the connection then goes black, then a RST or FIN is never received and the connection is not closed.

The change is to ensure that the idle timeout fires for a second time and if the connection is half closed, then do a real close.
Comment 4 Greg Wilkins CLA 2011-11-27 19:55:22 EST
fixed pushed to jetty-7.6 branch @ ddce35a
7.6.0.RC0 due in the next few days.
Comment 5 Greg Wilkins CLA 2011-12-12 12:26:57 EST
there is a report that SSL still has a problem here.  So need to investigate.
Comment 6 Greg Wilkins CLA 2011-12-13 13:43:13 EST
Created attachment 208329 [details]
socket state diagram
Comment 7 Greg Wilkins CLA 2011-12-13 13:46:43 EST
Simone, can you have a look at this issue, as I've an email report of 100% CPU associated with this.

The situation is that we send a FIN because we have shutdown the connection. The client sends an ack, but then never sends a FIN or a RST.  I changed our timeout handling so that we would timeout again, and on the second idleExpire we should close rather than shutdown.

However, I only tested this for normal sockets.  Could you uses your SSL proxy technique to make a test harness for this with SSL to see if a) we spin; b) if the timeout does eventually close the socket 

thanks
Comment 8 Simone Bordet CLA 2011-12-20 07:28:12 EST
Greg, I have created another bug for the SSL spin (https://bugs.eclipse.org/bugs/show_bug.cgi?id=367175), that turned out to not be related to this bug.

This bug is more related to a problem in idle timeout checking, so I'm assigning it back to you.
Comment 9 Greg Wilkins CLA 2011-12-20 18:06:07 EST
SslConnection was suppressing the second call to onIdleExpired and preventing a real close.    It now does a real close if the real endpoint is already oshut.

Test harness improved to check server side as well as client seeing broken pipe or similar.
Comment 10 Greg Wilkins CLA 2011-12-22 08:15:08 EST
Still some reports of failure, so reopening for now.

For async HTTP we want to check for idle while the request is not dispatch and not suspended.   So the change currently implement is to remove the setCheckIdle calls from customize and persist to be in the AsyncHttpConnection, with the pattern:

handle()
{
   try 
   {
       setCheckIdle(false);

       ...
   }
   finally
   {
       if (!suspended)
          setCheckIdle(true);
   }
}

Note that suspended lifecycles all end with a dispatch to handle, so idle will be turned on after the async lifecycle has completed.
Comment 11 Greg Wilkins CLA 2013-04-07 18:33:03 EDT
No recent problem reports, plus new impl in 9.0.x is working well
Comment 12 Mark Mielke CLA 2013-04-08 22:47:53 EDT
Hi Greg:

I just started monitoring this issue as I believe I am encountering this issue. However, I haven't had time to really dig into exactly what the heck is going on.

I am using HTTPS as a primary protocol for a custom server. I recently (this weekend just past) was convinced of the merit of Jetty, and switched from Grizzly HTTP Server to Embedded Jetty 9.0. These results apply to RHEL 5.5 / 32-bit with Java 7.0u15.

I have a performance measurement script that immediately failed after this switch. Where my previous solutions (first a custom blocking I/O implementation and later Grizzly HTTP Server for the last 12 months) easily handled this load which was effectively quick bursts of very short requests... 1000 requests in a row, each with a new connection, running from 10 parallel threads, Embedded Jetty immediately started to fail running out of file descriptors. Odd, but perhaps it was near the limit. I increased the maximum number of file descriptor to 8192 and then began to monitor using "lsof" and /proc/$PID/fd/ and I discovered that with Embedded Jetty the number of file descriptors allocated during this loop would grow to about 3150 before hitting a steady state. Every new burst (burst every 5 seconds) would bump the file descriptors up, and then around 3150 it stopped growing. Also odd.

I started to monitor more closely and I determined that:

- Many of the file descriptors were UNIX sockets. 1006 in the case I just reproduced.
- Many of the file descriptors were TCP/IP sockets in some sort of inappropriate state. 2000 in the same sample as above. "lsof" would report these as sockets for which it "can't identify protocol".

I starting diff'ing the output of "lsof" between requests and I found some more odd behaviour.  It seems that in many cases the file descriptors would stay flat between the bursts and between the invocations, but if I issued single requests, then some sort of background cleanup would happen. That is, if 3136 file descriptors were allocated they would stay allocated until I made a simple request, at which point it would drop to 3000, then 2850, ... until finally it would reach around 250 and hit the "low" steady state.

I also found from the "lsof" diffs, what seemed to be that as a result of each new request at the low state, it would seem to clean up one "can't identify protocol" and substitute it with a new UNIX socket which yielded a steady state where the file descriptors wouldn't necessarily increase or decrease.

At first I thought Jetty must suck. What kind of crappy software must this be? I found lots of links online to people encountering similar problems. However, I did note that the high steady state seemed stable, and I don't actually expect this service to receive bursts of 1000 connections in a row during normal service, so I decided to persist.

I took a look at the Jetty source... and I found it is pretty sophisticated and careful code. I was unable to discover the bug, but got lost in the NIO event triggers and mean to get back to this. I also looked at the Java 7 implementation code and was unable to find a problem there.

I ran out of time to look during that interval and my soft conclusion at this time is that either:
1) file descriptors are being leaked, and the garbage collector is finding them and reclaiming them
2) it has something to do with how Jetty calls Java 7 NIO, which is very different from Grizzly NIO Framework, that results in many more file descriptors allocated than might normally be expected with some sort of lazy clean up.

Due to the steady states reached, however, I'm doubtful of 1) as I think the garbage collector would result in much more random behaviour which much worse "worst cases". I can't imagine file descriptors take that much heap space, and I would expect thousands more file descriptors to be allocated before garbage collection kicks in.

Finally, I also found that connections seem to persist until the keep alive drops. That is, with the default configuration, connections seem to stay around for about 30 seconds after the client has disconnected at which point I was able to trace that Jetty believes it is closing the file handle, and this is where I found the "can't identify protocol" converts to a UNIX socket. I believe I found that it is even the same file descriptor in this case!

Very confusing behaviour. I think I'm able to reproduce the problem but I'm not clear on how to proceed. If you have any tests you'd like me to do, let me know.
Comment 13 Greg Wilkins CLA 2013-04-10 16:35:59 EDT
This sounds like jetty is keeping the SSL connections half open for a timeout.

Try changing max idle time to say 10s and see if the sockets only wait around for 10s instead of 30s.

Also can you say what version of Jetty and JVM you are using?

Running SSL connections asynchronously is henously complex, specially shutting them down nicely. Note that we have a new SSL implementation in Jetty-9 that we will put a lot of effort into debugging and making perfect.
Comment 14 Mark Mielke CLA 2013-04-10 18:54:35 EDT
Greg Wilkens writes:
> This sounds like jetty is keeping the SSL connections half open for a timeout.
>
> Try changing max idle time to say 10s and see if the sockets only wait around for 10s instead of 30s.

This is one part of my problem report... However, please note that it keeps these 3000+ file descriptors open for much longer than 30 seconds (indefinitely from what I can tell until prompted - say, by a next request incoming...).

There is both factors at play:

1) It holds the SSL connections open in half open for a timeout of 30 seconds.

2) It holds both SSL connections and UNIX sockets open for a much longer period.

> Also can you say what version of Jetty and JVM you are using?

Jetty 9.0.0.v20130308. Java 7u15. RHEL 5.5 / 32-bit.

> Running SSL connections asynchronously is henously complex, specially shutting them down nicely. Note that we have a new SSL implementation in Jetty-9 that we will put a lot of effort into debugging and making perfect.

Understood. However, also please note that I was successfully using Grizzly HTTP Server before this and it did not exhibit these behaviour characteristics.

I will see if I can prepare a smaller test case so that you can reproduce for yourself...
Comment 15 Greg Wilkins CLA 2013-04-10 19:19:39 EDT
reopening because reported on Jetty-9
Comment 16 Mark Mielke CLA 2013-04-11 02:32:50 EDT
Ok... I've stripped it down to something a lot more basic and reproduced more specific behaviour on Fedora 18 / x86 / 64-bit with Java 7u17.

With HTTP (no SSL), issuing a simple HTTP GET using "curl", the behaviour I am seeing is:

- Server immediately closes the socket.
- Unix socket left over.

With HTTPS (SSL), issuing a simple HTTP GET using "curl", the behaviour I am seeing is:

- Server immediately down shutdown().
- Socket sits in half open state.
- After 30 seconds, server does close().
- Unix socket replaces half open state socket.

I will attach source and a distribution image that may help you reproduce.

Extract the binary image like this:

$ tar xfz jetty-problem-0.1-SNAPSHOT-dist.tar.gz
$ JAVA_HOME=/usr/java/jdk1.7 bin/jetty-problem-server
[2013-04-11@02:16:10.572] DEBUG o.e.jetty.util.log: Logging to Logger[org.eclipse.jetty.util.log] via org.eclipse.jetty.util.log.Slf4jLog
... lots of debug output ...



From another window... first find the process:

$ ps -ef | grep java | grep jetty | grep problem
mark      6935  6896  0 02:16 pts/2    00:00:01 /usr/java/jdk1.7/bin/java -server -Xms256m -Xmx512m -XX:+UseG1GC -Djava.util.logging.config.file=/home/mark/jetty.problem/etc/logging.properties -Dlogback.configurationFile=/home/mark/jetty.problem/etc/logback.xml com.acme.jetty.problem.TestServer



Here is the output from testing HTTP:

$ /usr/sbin/lsof -p 6935 >lsof.1
$ wget http://localhost:10080/                  
--2013-04-11 02:19:47--  http://localhost:10080/
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:10080... connected.
HTTP request sent, awaiting response... 200 OK
Length: 0
Saving to: ‘index.html.1’

    [ <=>                                   ] 0           --.-K/s   in 0s      

2013-04-11 02:19:47 (0.00 B/s) - ‘index.html.1’ saved [0/0]

$ /usr/sbin/lsof -p 6935 >lsof.2
$ diff lsof.1 lsof.2
98a99
> java    6935 mark   54u     unix 0xffff8801f2119040      0t0 472307 socket



Here is the output from testing HTTPS:

$ /usr/sbin/lsof -p 6935 >lsof.2
$ diff lsof.1 lsof.2
98a99
> java    6935 mark   54u     unix 0xffff8801f2119040      0t0 472307 socket
$ 
$ 
$ /usr/sbin/lsof -p 6935 >lsof.1
$ curl --insecure https://localhost:10443/
$ /usr/sbin/lsof -p 6935 >lsof.2          
$ diff lsof.1 lsof.2
98a99
> java    6935 mark   53u     sock                0,6      0t0 472158 protocol: TCPv6
$ sleep 30
$ /usr/sbin/lsof -p 6935 >lsof.3
$ diff lsof.2 lsof.3
99c99
< java    6935 mark   53u     sock                0,6      0t0 472158 protocol: TCPv6
---
> java    6935 mark   53u     unix 0xffff8801f2119040      0t0 472307 socket



My outsider looking in guess is:

1) I expect HTTPS to operate the same as HTTP, and I'm surprised that HTTP would close immediately, but HTTPS would delay close for 30 seconds. Something seems inconsistent here.

2) Something is converting half open TCP/IP sockets to Unix sockets with the same file descriptor. I'm initially doubtful that Jetty is doing this, and I'm wondering if Java 7 is doing this as a consequence of NIO? If so, how bad is this? It seems to quickly consume file descriptors - many more than should be required to do the job - but does this mean we need to make sure that max files per process is higher when using Jetty / Java NIO? Or does this represent a scaling limit of some sort and ultimately a bug in either Jetty or Java?



I will also attach the source if you want to rebuild. Build using Maven 3 connected to Maven Central.
Comment 17 Mark Mielke CLA 2013-04-11 02:39:36 EDT
Created attachment 229588 [details]
jetty-problem-0.1-SNAPSHOT-dist.tar.gz
Comment 18 Mark Mielke CLA 2013-04-11 02:42:24 EDT
Created attachment 229589 [details]
jetty-problem.tar.gz
Comment 19 Greg Wilkins CLA 2013-04-11 04:22:45 EDT
Thanks for this extra information.   I've not yet tried your server, but I am getting some similar results with just using the embedded SpdyServer class that is in the distro.

This gives me the Dump servlet which is useful for doing things like:

curl -k https://localhost:8443/test/dump/info?sleep=10000

which has demonstrated that the unix socket is being created with the TCP/IP connection and not after the close of the connection.  You can see this with the following from before during and after the above curl command:

gregw@Brick: /tmp
[613] lsof -Ts -n -p 13360 | egrep -e http-alt -e ' socket' -e 8443
java    13360 gregw   25u  IPv6            1089455      0t0      TCP *:8443 (LISTEN)
java    13360 gregw   45u  IPv6            1095069      0t0      TCP *:http-alt (LISTEN)
java    13360 gregw   46u  unix 0x0000000000000000      0t0  1095070 socket

gregw@Brick: /tmp
[614] lsof -Ts -n -p 13360 | egrep -e http-alt -e ' socket' -e 8443
java    13360 gregw   25u  IPv6            1089455      0t0      TCP *:8443 (LISTEN)
java    13360 gregw   45u  IPv6            1095069      0t0      TCP *:http-alt (LISTEN)
java    13360 gregw   46u  unix 0x0000000000000000      0t0  1095070 socket
java    13360 gregw   99u  IPv6            1090456      0t0      TCP 127.0.0.1:8443->127.0.0.1:58811 (ESTABLISHED)
java    13360 gregw  101u  unix 0x0000000000000000      0t0  1099842 socket

gregw@Brick: /tmp
[615] lsof -Ts -n -p 13360 | egrep -e http-alt -e ' socket' -e 8443
java    13360 gregw   25u  IPv6            1089455      0t0      TCP *:8443 (LISTEN)
java    13360 gregw   45u  IPv6            1095069      0t0      TCP *:http-alt (LISTEN)
java    13360 gregw   46u  unix 0x0000000000000000      0t0  1095070 socket
java    13360 gregw  101u  unix 0x0000000000000000      0t0  1099842 socket


I get same thing for http


Note that I can't reproduce your 30s of half closed connection with ssl, but will investigate that with your server after we've worked out what these sockets are for.
Comment 20 Greg Wilkins CLA 2013-04-11 04:32:43 EDT
Hmmm I'm seeing the number of sockets level out at 11.   I'm thinking they are definitely something to do with the NIO implementation.

If from another machine I do

   while :; do curl http://192.168.0.4:8080/ > /dev/null; done

then I never see more than

[707] lsof -Ts -n -p 13360 | egrep -e http-alt -e ' socket' -e 8443
java    13360 gregw   25u  IPv6            1089455      0t0      TCP *:8443 (LISTEN)
java    13360 gregw   33u  unix 0x0000000000000000      0t0  1099842 socket
java    13360 gregw   45u  IPv6            1095069      0t0      TCP *:http-alt (LISTEN)
java    13360 gregw   46u  unix 0x0000000000000000      0t0  1095070 socket
java    13360 gregw   99u  unix 0x0000000000000000      0t0  1099842 socket
java    13360 gregw  101u  unix 0x0000000000000000      0t0  1099842 socket
java    13360 gregw  102u  unix 0x0000000000000000      0t0  1099842 socket
java    13360 gregw  104u  unix 0x0000000000000000      0t0  1099842 socket
java    13360 gregw  107u  unix 0x0000000000000000      0t0  1099842 socket
java    13360 gregw  108u  unix 0x0000000000000000      0t0  1099842 socket
java    13360 gregw  109u  unix 0x0000000000000000      0t0  1099842 socket
java    13360 gregw  110u  unix 0x0000000000000000      0t0  1099842 socket
java    13360 gregw  111u  unix 0x0000000000000000      0t0  1099842 socket



So I'm thinking they are not an issue.... although if you see 3000 of them, then they are.

Will look at the 30s delay tomorrow.
Comment 21 Greg Wilkins CLA 2013-04-11 20:48:58 EDT
I'm running your server now and I'm not seeing any FINWAIT sockets with curl.

But I did see 1 for about 10s if I let a browser timeout. 

On analysis, something strange is happening with out SSL close conversations in that after we exchange close handshakes curl is just resetting the connection instead of sending a FIN?

With a browser, I can see both good FIN-FIN closes and bad FIN-RST closes, and with either case we have to wait an idle timeout before we finish our clean up.  I think we may have stopped looking for read interest after close handshake, so we are not noticing the final FIN or RST?

investigating more.
Comment 22 Greg Wilkins CLA 2013-04-11 23:35:07 EDT
I think there are lots of different things going on here and they are not all to do with Jetty.

For example I'm looking at an SSL conversation between Jetty and Firefox where Jetty has a 4s timeout.  What I see is:

x       last data sent
x+4     jetty sends close handshake
x+4     jetty sends FIN (shutdownOutput)
x+4.04  FF acks the FIN frame
x+11    FF sends FIN (but no close handshake?!?!?)
x+11.01 Jetty ACKs FIN

So this is a nice TCP/IP close, but a non perfect SSL close.  It is as if FF is ignoring the connection while it is not using it!

But then I also see

x       last data sent
x+4     jetty sends close handshake
x+4     jetty sends FIN (shutdownOutput)
x+4.04  FF acks the FIN frame
x+4.05  FF sends FIN (but no close handshake?!?!?)
x+4.06  Jetty ACKs FIN

So the extra delay from FF does not always happen!!!


If I increase jetty's timeout so that FF idles out the connection, then I see


x        last data sent
x+110    FF sends close handshake
x+110    FF sends FIN
x+110.01 Jetty sends close handshake
x+110.02 FF sends RST!!!

So jetty didn't send a FIN, either because FF sent the RST too quickly or perhaps because it was not going to send it anyway?

But none of this is looking hugely wrong, just a little strange.  It is similar to what we are seeing from the curl close case.   I'm wondering if Jetty should do the same as FF and not send the reply SSL close handshake and instead just FIN the connection?
Comment 23 Greg Wilkins CLA 2013-04-12 00:08:40 EDT
So I think the issue is that Jetty is trying to be too nice.  When it receives an SSL close handshake, it sends one in reply and then expects the other end to close down the connection.  But because jetty is not longer using the connection (it's closed), its not actively reading from it, so it does not see the FIN or RST from the other end and waits for a timeout.

So the simple solution is that we should just act like everybody else and not bother with sending the reply SSL close handshake.  If we just close the connection we are getting nice FIN-FIN closes and no timeouts are needed.  I've made this change and am testing it now.


However, I'm still a little concerned that we had a state that was not reading and thus ignored the FIN/RST until a timeout - so will continue to investigate that a little more, just in case the close early solution can't be used for some other reason.
Comment 24 Mark Mielke CLA 2013-04-12 00:44:53 EDT
I decided the curl test (single threaded essentially) wasn't demonstrating
the behaviour I was seeing of 3000+ file descriptors.

I've included a test client that executes 100 pings from 10 threads.
Note that in my case I normally test 1000 pings over 10 threads, but this
very quickly leads to "No more files" exceptions, so I've backed it down
to 100 to allow better analysis. (In my case, 1000 pings was leaving 3000
file descriptors... so what happens when we do just 100 pings?)

First, testing HTTP (not SSL)...

$ tar xfz jetty-problem-0.2-SNAPSHOT-dist.tar.gz
$ bin/jetty-problem-server
[2013-04-12@00:24:46.568] DEBUG o.e.jetty.util.log: Logging to Logger[org.eclipse.jetty.util.log] via org.eclipse.jetty.util.log.Slf4jLog
... lots of debug output ...


From another window:

$ ls
bin/  jetty-problem-0.2-SNAPSHOT-dist.tar.gz  lib/
etc/  jetty-problem-0.2-SNAPSHOT.tar.gz
$ bin/jetty-problem-client http://localhost:10080/
...
ParallelPing x 100 @ 10 threads = 2821 ms elapsed
ParallelPing sleeping...
ParallelPing starting...
ParallelPing x 100 @ 10 threads = 1655 ms elapsed
ParallelPing sleeping...
ParallelPing starting...
ParallelPing x 100 @ 10 threads = 1533 ms elapsed
ParallelPing sleeping...
...

From a third window:

$ ps -ef | grep TestServer
mark     31229  6896  2 00:24 pts/2    00:00:02 /usr/java/jdk/bin/java -server -Xms256m -Xmx512m -XX:+UseG1GC -Djava.util.logging.config.file=/home/mark/jetty.problem/etc/logging.properties -Dlogback.configurationFile=/home/mark/jetty.problem/etc/logback.xml com.acme.jetty.problem.TestServer
mark     31364 31346  0 00:26 pts/1    00:00:00 grep --color=auto TestServer
$ /usr/sbin/lsof -p 31229 | wc -l
115
$ /usr/sbin/lsof -p 31229 | wc -l
116
$ /usr/sbin/lsof -p 31229 | wc -l
112
$ /usr/sbin/lsof -p 31229 | wc -l
112


This value is staying relatively stable. No real problem here that I see.

Now, from the second window we will try HTTPS (SSL):

^C (Control-C)
$ bin/jetty-problem-client https://localhost:10443/
...
ParallelPing x 100 @ 10 threads = 7971 ms elapsed
ParallelPing sleeping...
ParallelPing starting...
ParallelPing x 100 @ 10 threads = 7247 ms elapsed
ParallelPing sleeping...
ParallelPing starting...
ParallelPing x 100 @ 10 threads = 8130 ms elapsed
ParallelPing sleeping...
...


And from the third window after giving it a little while to "warm up":

$ ps -ef | grep TestServer       
mark     31229  6896  2 00:24 pts/2    00:00:10 /usr/java/jdk/bin/java -server -Xms256m -Xmx512m -XX:+UseG1GC -Djava.util.logging.config.file=/home/mark/jetty.problem/etc/logging.properties -Dlogback.configurationFile=/home/mark/jetty.problem/etc/logback.xml com.acme.jetty.problem.TestServer
mark     31500 31346  0 00:31 pts/1    00:00:00 grep --color=auto TestServer
$ /usr/sbin/lsof -p 31229 | wc -l
312
$ /usr/sbin/lsof -p 31229 | wc -l
312
$ /usr/sbin/lsof -p 31229 | wc -l
312


With 100 pings, it's hovering at +200 file descriptors compared to HTTP.

From the second window, I kill the client process, which should clean
up all the file descriptors by signalling close to the server:

ParallelPing starting...
ParallelPing x 100 @ 10 threads = 2968 ms elapsed
ParallelPing sleeping...
ParallelPing starting...
^C


From the third window, I check the file descriptors:

$ /usr/sbin/lsof -p 31229 | wc -l
320
$ /usr/sbin/lsof -p 31229 | wc -l
320
$ /usr/sbin/lsof -p 31229 | wc -l
320


From the debug in the first window, I see that it does close the connections
eventually - at least it says it does:

[2013-04-12@00:32:23.645] DEBUG o.e.j.i.ChannelEndPoint: close SelectChannelEndPoint@6002a673{localhost/127.0.0.1:56456<r-l>/127.0.0.1:10443,o=false,is=false,os=true,fi=FillInterest@4dfdf943{false,null},wf=WriteFlusher@5cf7c5b5{IDLE},it=30000}{SslConnection@71538614{NOT_HANDSHAKING,eio=-1/-1,di=-1} -> HttpConnection@5387b07b{IDLE},g=HttpGenerator{s=START},p=HttpParser{s=START,0 of -1}}{io=0,kio=0,kro=1}


However:

$ /usr/sbin/lsof -p 31229 | wc -l
320
$ /usr/sbin/lsof -p 31229 | wc -l
320
$ /usr/sbin/lsof -p 31229 | wc -l
320


The file descriptors are not going down!

Now, I issue a *single* request...

$ curl --insecure https://localhost:10443/


$ /usr/sbin/lsof -p 31229 | wc -l         
269

Weird? One connection and it wakes it up such that it drops from 320 to 269?

This continues like so:

$ curl --insecure https://localhost:10443/
$ /usr/sbin/lsof -p 31229 | wc -l         
217
$ curl --insecure https://localhost:10443/
$ /usr/sbin/lsof -p 31229 | wc -l         
168
$ curl --insecure https://localhost:10443/
$ /usr/sbin/lsof -p 31229 | wc -l         
116
$ curl --insecure https://localhost:10443/
$ /usr/sbin/lsof -p 31229 | wc -l         
116

Very strange...

If I set the scenario up again to see what kind of file descriptors
these might be, I see:

$ /usr/sbin/lsof -p 31229 | wc -l        
312
$ /usr/sbin/lsof -p 31229 | grep ' unix ' | wc -l
108
$ /usr/sbin/lsof -p 31229 | grep ' sock ' | wc -l
77

A number of "sock", and a number of "unix".

After all the "close" happen... give it 30 seconds or so... it ends up
looking like this:

$ /usr/sbin/lsof -p 31229 | wc -l                
312
$ /usr/sbin/lsof -p 31229 | grep ' sock ' | wc -l
0
$ /usr/sbin/lsof -p 31229 | grep ' unix ' | wc -l
208

All the "sock" have become "unix"?

Wierd...
Comment 25 Mark Mielke CLA 2013-04-12 00:47:57 EDT
Created attachment 229654 [details]
jetty-problem-0.2-SNAPSHOT-dist.tar.gz
Comment 26 Mark Mielke CLA 2013-04-12 00:48:26 EDT
Created attachment 229655 [details]
jetty-problem-0.2-SNAPSHOT-src.tar.gz
Comment 27 Greg Wilkins CLA 2013-04-12 01:38:06 EDT
Created attachment 229656 [details]
agressive close patch

I've not yet tried your latest client, but I'm hopeful (if less than confident) that the fix I'm working on should avoid the issue as it is a more agressive close of the connection.

If you could test this patch against your client/environment, that would be handy.

I can't build yet as I have some tests that I need to fix up for the new close style.
Comment 28 Greg Wilkins CLA 2013-04-12 02:07:15 EDT
Created attachment 229657 [details]
Gentle close patch

One thing I didn't like about just doing the aggressive close was that without it we should still not have been needing a timeout to finish the close.

So this patch fixes that problem, so that without the aggressive close the SslConnection is still driven towards reading, looking for a FIN and/or RST.  

I'm hopeful this will resolve the FIN_WAIT issue on it's own, but I still think we should go for the aggressive close as well because that is what other SSL impls appear to be doing and if we try to be gentle the clients just RST us anyway.
Comment 29 Mark Mielke CLA 2013-04-12 02:19:57 EDT
With "aggressive close patch" I can't immediately see any bad behaviour with the test client:

$ /usr/sbin/lsof -p 5159 | wc -l
115
$ /usr/sbin/lsof -p 5159 | wc -l
115
$ /usr/sbin/lsof -p 5159 | wc -l
115
$ /usr/sbin/lsof -p 5159 | wc -l
115
$ /usr/sbin/lsof -p 5159 | wc -l
115
$ /usr/sbin/lsof -p 5159 | wc -l
115
$ /usr/sbin/lsof -p 5159 | wc -l
117
$ /usr/sbin/lsof -p 5159 | wc -l
120
$ /usr/sbin/lsof -p 5159 | wc -l
117
$ /usr/sbin/lsof -p 5159 | wc -l
118
$ /usr/sbin/lsof -p 5159 | wc -l
117
$ /usr/sbin/lsof -p 5159 | wc -l
117
$ /usr/sbin/lsof -p 5159 | wc -l
117
$ /usr/sbin/lsof -p 5159 | wc -l
117
$ /usr/sbin/lsof -p 5159 | wc -l
117
$ /usr/sbin/lsof -p 5159 | wc -l
117
$ /usr/sbin/lsof -p 5159 | wc -l
117

Fairly stable # of file descriptors given that it's allocated and unallocated 100 every 5 seconds, running 10 at any given time while it's active.

I'll try your next patch as well, and try it on my internal product that was seeing the 3000 file descriptors...
Comment 30 Greg Wilkins CLA 2013-04-12 02:27:24 EDT
Oh good news!


However, the story is still not complete.  I've just noticed that with Chrome, it only checks an SSL connection every 10s to see if it has been closed by the server.

If Jetty has an idle time of 3s, then it sends an SSL close and FIN when idle.  Chrome acks this, but does not send anything back until the connection has been idle for 10s, when it finally just sends a FIN (no close reply).

Meanwhile jetty has timed out again (after another 3s) and appears to completely close the socket.   However I'm not seeing a RST like I would expect, so still more investigation needed.
Comment 31 Mark Mielke CLA 2013-04-12 02:32:29 EDT
With "gentle close patch", I see the debugging showing up with this at the end:

[2013-04-12@02:28:01.993] DEBUG o.e.j.i.SelectorManager: Ignoring cancelled key for channel java.nio.channels.SocketChannel[closed]
[2013-04-12@02:28:01.993] DEBUG o.e.j.i.SelectorManager: Selector loop waiting on select


Both "gentle close patch" and "aggressive close patch" are maintaining stable file descriptor counts... but I do find it odd that simple substituting the jetty-io.jar and I get...

aggressive close patch:

$ /usr/sbin/lsof -p 6123 | wc -l
119
$ /usr/sbin/lsof -p 6123 | wc -l
117
$ /usr/sbin/lsof -p 6123 | wc -l
117
$ /usr/sbin/lsof -p 6123 | wc -l
116

gentle close patch:

$ /usr/sbin/lsof -p 6324 | wc -l
106
$ /usr/sbin/lsof -p 6324 | wc -l
106
$ /usr/sbin/lsof -p 6324 | wc -l
121
$ /usr/sbin/lsof -p 6324 | wc -l
120
$ /usr/sbin/lsof -p 6324 | wc -l
119
$ /usr/sbin/lsof -p 6324 | wc -l
122
$ /usr/sbin/lsof -p 6324 | wc -l
106


It seems like gentle close patch *might* use fewer file descriptors. I don't know if that is good or bad. It may be an artifact of Java NIO? Or coincidence? But it seems reproducible as well...

Trying both on my internal product now...
Comment 32 Greg Wilkins CLA 2013-04-12 02:35:51 EDT
Created attachment 229658 [details]
Combined patch

This is a combined patch that has both the gentle close code and the aggressive close in it, which I think is a good combination to handle many different close scenarios.

I've checked the chrome closing and Jetty is definitely calling close on the channel - no idea why the operating system does not send a RST?
Comment 33 Mark Mielke CLA 2013-04-12 02:48:25 EDT
On the internal product...

Again, with the model of sending 1000 full HTTP POST from 10 Java threads simultaneously when active, with five second breaks between runs, with JSON input and output with varying length input and output...

With no patch:

- Quickly climbs (with each iteration of 1000 connections) to 3223, and occasionally fluctuates back down to 3097 or some other value only to return to 3223.

With "aggressive close":

- Hovers between 223 - 227 with no immediately apparent weighting

With "gentle close":

- Hovers between 223 - 227 as well, but clearly weighted towards exactly 223



I'm sure one patch is better than the other for some technical reason, and the "Ignoring cancelled key for channel" for "gentle close" deserves analysis, but from an application perspective: Both "aggressive close" and "gentle close" do not seem to be exhibiting any poor behaviour.
Comment 34 Greg Wilkins CLA 2013-04-12 02:57:50 EDT
Mark,

thanks for testing this!

I'm going to seek some internal code review on both patches and then hope to have a 9.0.2 out tomorrow or next Monday.
Comment 35 Mark Mielke CLA 2013-04-12 03:06:07 EDT
For "both" patch...

Test client / server:

- Behaves much like "aggressive close". 115 to 119 file descriptors. No "Ignoring cancelled key for channel" immediately visible in debug output.

Internal product:

- 223 to 226 file descriptors in the samples.

So "both" seems ok as well at least from my perspective...

Greg: Thanks for the attention and quick fix. This was my only concern about switching from Grizzly to Jetty and it looks like it has been resolved. Great!
Comment 36 Mark Mielke CLA 2013-04-12 17:46:19 EDT
Greg: I think I might have found an explanation in the Java 7 source code, here:

jdk/src/share/classes/java/net/AbstractPlainSocketImpl.java

In the implementation of close(), it has this mysterious passage:

    protected void close() throws IOException {
        synchronized(fdLock) {
            if (fd != null) {
                ...
                if (fdUseCount == 0) {
                    if (closePending) {
                        return;
                    }
                    closePending = true;
                    /*
                     * We close the FileDescriptor in two-steps - first the
                     * "pre-close" which closes the socket but doesn't
                     * release the underlying file descriptor. This operation
                     * may be lengthy due to untransmitted data and a long
                     * linger interval. Once the pre-close is done we do the
                     * actual socket to release the fd.
                     */
                    try {
                        socketPreClose();
                    } finally {
                        socketClose();
                    }
                    fd = null;
                    return;
                } else {
                    /*
                     * If a thread has acquired the fd and a close
                     * isn't pending then use a deferred close.
                     * Also decrement fdUseCount to signal the last
                     * thread that releases the fd to close it.
                     */
                    if (!closePending) {
                        closePending = true;
                        fdUseCount--;
                        socketPreClose();
                    }
                }
            }
        }
    }


In the underlying code it appears they use a AF_UNIX socket as a "marker", and they'll sometimes choose to use dup2() instead of close() to close the socket but keep the file descriptor held, at which point it is no longer a AF_INET socket, but is now an AF_UNIX socket.

Based on the above... I think this may mean that the fdUseCount > 0 with however Jetty is managing the socket (pre-patch), and it's doing socketPreClose() which does dup2(marker_fd => fd) instead of close(fd), which then leaves AF_UNIX sockets around...

Once fdUseCount does hit 0, it is cleaned up - so to really understand what is happening we'd have to know why fdUseCount > 0?

Hope this helps...
Comment 37 Greg Wilkins CLA 2013-04-12 21:22:04 EDT
Interesting!  

We've had issues with JRE's preclose before - it used to go 100% CPU on us or throw exceptions that are not in it' signature.      The suggestion was at the time that this bug was triggered because we still had the socket registered with the selector - this might explain the fdUseCount>0.

Good news is that if we do our close properly, then we unregister with the selector and that count should go to 0.

Note that we have some variations of the fix that we are trying, because the one you tested gave our HTTP client grief.
Comment 38 Mark Mielke CLA 2013-04-18 02:44:58 EDT
9.0.2 doesn't exhibit any bad symptoms that I can see. Thanks, Greg!
Comment 39 Greg Wilkins CLA 2013-04-22 00:05:10 EDT
Fixed.  Thanks for help debugging and testing!
Comment 40 Mark Mielke CLA 2013-04-26 00:29:24 EDT
Hi Greg:

After deployment... I've been encountering some troubling behaviour for the last few days. Can you comment if this seems related to the patch and we should discuss here, or if this looks like an unrelated problem and I should open a new defect? (Or if you want a new defect anyways...)

The reason I suspect it is related... is because I was finally able to reproduce the exact same symptoms on demand, and my steps to reproduce relate to sessions being automatically closed after the read timeout while the Servlet is reading from ServletRequest.getReader(). The Servlet encounters org.eclipse.jetty.io.EofException and 1 or I believe possibly even 2 in some cases Request.recycle() get stuck.

Longer details:

For production symptoms... over several hours, threads would slowly accumulate (no real pattern... but say 1 to 10 every 24 hours?) which consume 100% of one CPU core. These threads have stack traces like this:

"qtp658178110-91" prio=10 tid=0x00000000056dd000 nid=0x796b runnable [0x0000000042f3a000]
   java.lang.Thread.State: RUNNABLE
        at org.eclipse.jetty.server.HttpConnection$Input.blockForContent(HttpConnection.java:544)
        at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:124)
        - locked <0x00000000e0a44238> (a org.eclipse.jetty.util.ArrayQueue)
        at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
        - locked <0x00000000e0a442c0> (a java.io.InputStreamReader)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:154)
        at java.io.BufferedReader.read(BufferedReader.java:175)
        - locked <0x00000000e0a442c0> (a java.io.InputStreamReader)
        at org.eclipse.jetty.server.Request.recycle(Request.java:1496)
        at org.eclipse.jetty.server.HttpChannel.reset(HttpChannel.java:211)
        at org.eclipse.jetty.server.HttpConnection.reset(HttpConnection.java:147)
        at org.eclipse.jetty.server.HttpConnection.completed(HttpConnection.java:480)
        at org.eclipse.jetty.server.HttpChannel.run(HttpChannel.java:322)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:266)
        at org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:240)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
        at java.lang.Thread.run(Thread.java:722)

jstack will catch these threads in various places... but always within Request.recycle(), and ways within BufferedReader.read(). HttpInput.read() is usually involved, and it will move around from a few methods inside (at high speed no doubt) but I've caught it as low as HttpParser.shutdownInput().

I racked my brains for a way to trigger this and finally came up with a simple way...

My embedded server is using the Jetty servlet engine. The servlet normally reads JSON using ServletRequest.getReader(). If I stimulate the server in this way: (I've stripped unnecessary information which could be construed as IP... <host>, <port>, <servlet path>, ...)

bash-4.1$ telnet <host> <port>
Trying 10.128.9.41...
Connected to <host>.
Escape character is '^]'.
POST <servlet path> HTTP/1.0
Content-type: application/json
Content-length: 1000

{"requestType"<ENTER>
<WAIT FOR READ TIMEOUT>


Immediately after the read timeout elapsed, I received these results via HTTP:

HTTP/1.1 200 OK
Content-Type: application/json;charset=UTF-8
Content-Length: 5382
Server: Jetty(9.0.2.v20130417)

{"responseType":"exception","exceptionType":"backend","exceptionCode":"internal","exceptionText":"Unexpected exception: org.eclipse.jetty.io.EofException","exceptionTrace":"BackendException: Unexpected exception: org.eclipse.jetty.io.EofException\n...


This means that the servlet noted that it encountered org.eclipse.jetty.io.EofException and it chose to handle this and return the exception to the client for analysis.


I was monitoring the number of threads stuck in Request.recycle() before and after the timeout elapsed, and sure enough:

$ /usr/java/jdk1.7/bin/jstack 28768 | grep Request.recycle | wc -l
2
$ /usr/java/jdk1.7/bin/jstack 28768 | grep Request.recycle | wc -l
3


... and now my server CPU is a lot busier than it should be.

Let me know how you would prefer I proceed. I may tinker and try to find the problem myself... but frankly it is a little intimidating with all the layers and this is the first time I have seen much of this Jetty code.

I *think* that one of the read() calls should be returning -1 to indicate EOF and it isn't. But... not sure. Possibly the EofException is raising through the servlet thread and not properly recording that the session is at EOF?
Comment 41 Greg Wilkins CLA 2013-04-26 00:58:59 EDT
Mark,

this looks like a new issue, so can you open a new bugzilla and we'll have a look

cheers
Comment 42 Mark Mielke CLA 2013-04-26 01:10:21 EDT
Thanks. Done.