Debugging "clogged" TCP connections
- by Nikratio
I'm having trouble with an internet connection that seems to randomly "freeze" arbitrary tcp connections. The connections stay established, but no data is coming through.
When this happens, netstat still shows the connection status as ESTABLISHED on both the local computer:
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name Timer
tcp        0     53 192.168.0.10:41129      173.255.235.238:143     ESTABLISHED 8219/gnutls-cli  on (79.31/13/0)
..and the remote server:
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name Timer
tcp        0      0 173.255.235.238:143     68.5.174.98:41129       ESTABLISHED 5303/imapd       off (0.00/0/0)
However, it seems that no data at all is transferred. If I run strace on the local and remote process, both just show a repeating sequence of select calls (with different fds of course), e.g.
select(6, [0 5], NULL, NULL, {0, 50000}) = 0 (Timeout)
select(6, [0 5], NULL, NULL, {0, 50000}) = 0 (Timeout)
select(6, [0 5], NULL, NULL, {0, 50000}) = 0 (Timeout)
The internet connection overall does not seem affected, I can still establish new connections to the same service on the same server without any problems. However, the affected local applications seem to be unaware of the problem and just hang.
When I look at a packet capture of this connection on the client side, the last thing that happens is that the client transmits some data, then nothing happens for about 1100 seconds, and then several TCP Retransmission requests go out, with intervals increasing from 4 seconds to 130 seconds. No activity is captured after that.
After about 10 minutes, the connection on the remote end disappears from the netstat (I wasn't able to catch any intermediate state), but still stays ESTABLISHED on the local end.
Finally, after some more minutes, the local application aborts with a timeout and disappears from the local netstat output as well.
Does anyone have a suggestion of how I could debug this further to find out where the problem lies and how to fix it? 
Additionaly and/or as a temporary workaround: is is there some way to globally reduce the timeout on client and/or server to reduce the time before the local application aborts?