user12820842 - Developer IT

DTracing TCP congestion control

- by user12820842

In a previous post, I showed how we can use DTrace to probe TCP receive and send window events. TCP receive and send windows are in effect both about flow-controlling how much data can be received - the receive window reflects how much data the local TCP is prepared to receive, while the send window simply reflects the size of the receive window of the peer TCP. Both then represent flow control as imposed by the receiver. However, consider that without the sender imposing flow control, and a slow link to a peer, TCP will simply fill up it's window with sent segments. Dealing with multiple TCP implementations filling their peer TCP's receive windows in this manner, busy intermediate routers may drop some of these segments, leading to timeout and retransmission, which may again lead to drops. This is termed congestion, and TCP has multiple congestion control strategies. We can see that in this example, we need to have some way of adjusting how much data we send depending on how quickly we receive acknowledgement - if we get ACKs quickly, we can safely send more segments, but if acknowledgements come slowly, we should proceed with more caution. More generally, we need to implement flow control on the send side also. Slow Start and Congestion Avoidance From RFC2581, let's examine the relevant variables: "The congestion window (cwnd) is a sender-side limit on the amount of data the sender can transmit into the network before receiving an acknowledgment (ACK). Another state variable, the slow start threshold (ssthresh), is used to determine whether the slow start or congestion avoidance algorithm is used to control data transmission" Slow start is used to probe the network's ability to handle transmission bursts both when a connection is first created and when retransmission timers fire. The latter case is important, as the fact that we have effectively lost TCP data acts as a motivator for re-probing how much data the network can handle from the sending TCP. The congestion window (cwnd) is initialized to a relatively small value, generally a low multiple of the sending maximum segment size. When slow start kicks in, we will only send that number of bytes before waiting for acknowledgement. When acknowledgements are received, the congestion window is increased in size until cwnd reaches the slow start threshold ssthresh value. For most congestion control algorithms the window increases exponentially under slow start, assuming we receive acknowledgements. We send 1 segment, receive an ACK, increase the cwnd by 1 MSS to 2*MSS, send 2 segments, receive 2 ACKs, increase the cwnd by 2*MSS to 4*MSS, send 4 segments etc. When the congestion window exceeds the slow start threshold, congestion avoidance is used instead of slow start. During congestion avoidance, the congestion window is generally updated by one MSS for each round-trip-time as opposed to each ACK, and so cwnd growth is linear instead of exponential (we may receive multiple ACKs within a single RTT). This continues until congestion is detected. If a retransmit timer fires, congestion is assumed and the ssthresh value is reset. It is reset to a fraction of the number of bytes outstanding (unacknowledged) in the network. At the same time the congestion window is reset to a single max segment size. Thus, we initiate slow start until we start receiving acknowledgements again, at which point we can eventually flip over to congestion avoidance when cwnd ssthresh. Congestion control algorithms differ most in how they handle the other indication of congestion - duplicate ACKs. A duplicate ACK is a strong indication that data has been lost, since they often come from a receiver explicitly asking for a retransmission. In some cases, a duplicate ACK may be generated at the receiver as a result of packets arriving out-of-order, so it is sensible to wait for multiple duplicate ACKs before assuming packet loss rather than out-of-order delivery. This is termed fast retransmit (i.e. retransmit without waiting for the retransmission timer to expire). Note that on Oracle Solaris 11, the congestion control method used can be customized. See here for more details. In general, 3 or more duplicate ACKs indicate packet loss and should trigger fast retransmit . It's best not to revert to slow start in this case, as the fact that the receiver knew it was missing data suggests it has received data with a higher sequence number, so we know traffic is still flowing. Falling back to slow start would be excessive therefore, so fast recovery is used instead. Observing slow start and congestion avoidance The following script counts TCP segments sent when under slow start (cwnd ssthresh). #!/usr/sbin/dtrace -s #pragma D option quiet tcp:::connect-request / start[args[1]-cs_cid] == 0/ { start[args[1]-cs_cid] = 1; } tcp:::send / start[args[1]-cs_cid] == 1 && args[3]-tcps_cwnd tcps_cwnd_ssthresh / { @c["Slow start", args[2]-ip_daddr, args[4]-tcp_dport] = count(); } tcp:::send / start[args[1]-cs_cid] == 1 && args[3]-tcps_cwnd args[3]-tcps_cwnd_ssthresh / { @c["Congestion avoidance", args[2]-ip_daddr, args[4]-tcp_dport] = count(); } As we can see the script only works on connections initiated since it is started (using the start[] associative array with the connection ID as index to set whether it's a new connection (start[cid] = 1). From there we simply differentiate send events where cwnd ssthresh (congestion avoidance). Here's the output taken when I accessed a YouTube video (where rport is 80) and from an FTP session where I put a large file onto a remote system. # dtrace -s tcp_slow_start.d ^C ALGORITHM RADDR RPORT #SEG Slow start 10.153.125.222 20 6 Slow start 138.3.237.7 80 14 Slow start 10.153.125.222 21 18 Congestion avoidance 10.153.125.222 20 1164 We see that in the case of the YouTube video, slow start was exclusively used. Most of the segments we sent in that case were likely ACKs. Compare this case - where 14 segments were sent using slow start - to the FTP case, where only 6 segments were sent before we switched to congestion avoidance for 1164 segments. In the case of the FTP session, the FTP data on port 20 was predominantly sent with congestion avoidance in operation, while the FTP session relied exclusively on slow start. For the default congestion control algorithm - "newreno" - on Solaris 11, slow start will increase the cwnd by 1 MSS for every acknowledgement received, and by 1 MSS for each RTT in congestion avoidance mode. Different pluggable congestion control algorithms operate slightly differently. For example "highspeed" will update the slow start cwnd by the number of bytes ACKed rather than the MSS. And to finish, here's a neat oneliner to visually display the distribution of congestion window values for all TCP connections to a given remote port using a quantization. In this example, only port 80 is in use and we see the majority of cwnd values for that port are in the 4096-8191 range. # dtrace -n 'tcp:::send { @q[args[4]-tcp_dport] = quantize(args[3]-tcps_cwnd); }' dtrace: description 'tcp:::send ' matched 10 probes ^C 80 value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@ 5 1 | 0 2 | 0 4 | 0 8 | 0 16 | 0 32 | 0 64 | 0 128 | 0 256 | 0 512 | 0 1024 | 0 2048 |@@@@@@@@@ 8 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 23 8192 | 0

Read the article

Exploring TCP throughput with DTrace

- by user12820842

One key measure to use when assessing TCP throughput is assessing the amount of unacknowledged data in the pipe. This is sometimes termed the Bandwidth Delay Product (BDP) (note that BDP is often used more generally as the product of the link capacity and the end-to-end delay). In DTrace terms, the amount of unacknowledged data in bytes for the connection is the different between the next sequence number to send and the lowest unacknoweldged sequence number (tcps_snxt - tcps_suna). According to the theory, when the number of unacknowledged bytes for the connection is less than the receive window of the peer, the path bandwidth is the limiting factor for throughput. In other words, if we can fill the pipe without the peer TCP complaining (by virtue of its window size reaching 0), we are purely bandwidth-limited. If the peer's receive window is too small however, the sending TCP has to wait for acknowledgements before it can send more data. In this case the round-trip time (RTT) limits throughput. In such cases the effective throughput limit is the window size divided by the RTT, e.g. if the window size is 64K and the RTT is 0.5sec, the throughput is 128K/s. So a neat way to visually determine if the receive window of clients may be too small should be to compare the distribution of BDP values for the server versus the client's advertised receive window. If the BDP distribution overlaps the send window distribution such that it is to the right (or lower down in DTrace since quantizations are displayed vertically), it indicates that the amount of unacknowledged data regularly exceeds the client's receive window, so that it is possible that the sender may have more data to send but is blocked by a zero-window on the client side. In the following example, we compare the distribution of BDP values to the receive window advertised by the receiver (10.175.96.92) for a large file download via http. # dtrace -s tcp_tput.d ^C BDP(bytes) 10.175.96.92 80 value ------------- Distribution ------------- count -1 | 0 0 | 6 1 | 0 2 | 0 4 | 0 8 | 0 16 | 0 32 | 0 64 | 0 128 | 0 256 | 3 512 | 0 1024 | 0 2048 | 9 4096 | 14 8192 | 27 16384 | 67 32768 |@@ 1464 65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 32396 131072 | 0 SWND(bytes) 10.175.96.92 80 value ------------- Distribution ------------- count 16384 | 0 32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 17067 65536 | 0 Here we have a puzzle. We can see that the receiver's advertised window is in the 32768-65535 range, while the amount of unacknowledged data in the pipe is largely in the 65536-131071 range. What's going on here? Surely in a case like this we should see zero-window events, since the amount of data in the pipe regularly exceeds the window size of the receiver. We can see that we don't see any zero-window events since the SWND distribution displays no 0 values - it stays within the 32768-65535 range. The explanation is straightforward enough. TCP Window scaling is in operation for this connection - the Window Scale TCP option is used on connection setup to allow a connection to advertise (and have advertised to it) a window greater than 65536 bytes. In this case the scaling shift is 1, so this explains why the SWND values are clustered in the 32768-65535 range rather than the 65536-131071 range - the SWND value needs to be multiplied by two since the reciever is also scaling its window by a shift factor of 1. Here's the simple script that compares BDP and SWND distributions, fixed to take account of window scaling. #!/usr/sbin/dtrace -s #pragma D option quiet tcp:::send / (args[4]-tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 / { @bdp["BDP(bytes)", args[2]-ip_daddr, args[4]-tcp_sport] = quantize(args[3]-tcps_snxt - args[3]-tcps_suna); } tcp:::receive / (args[4]-tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 / { @swnd["SWND(bytes)", args[2]-ip_saddr, args[4]-tcp_dport] = quantize((args[4]-tcp_window)*(1 tcps_snd_ws)); } And here's the fixed output. # dtrace -s tcp_tput_scaled.d ^C BDP(bytes) 10.175.96.92 80 value ------------- Distribution ------------- count -1 | 0 0 | 39 1 | 0 2 | 0 4 | 0 8 | 0 16 | 0 32 | 0 64 | 0 128 | 0 256 | 3 512 | 0 1024 | 0 2048 | 4 4096 | 9 8192 | 22 16384 | 37 32768 |@ 99 65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3858 131072 | 0 SWND(bytes) 10.175.96.92 80 value ------------- Distribution ------------- count 512 | 0 1024 | 1 2048 | 0 4096 | 2 8192 | 4 16384 | 7 32768 | 14 65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1956 131072 | 0

Read the article

Observing flow control idle time in TCP

- by user12820842

Previously I described how to observe congestion control strategies during transmission, and here I talked about TCP's sliding window approach for handling flow control on the receive side. A neat trick would now be to put the pieces together and ask the following question - how often is TCP transmission blocked by congestion control (send-side flow control) versus a zero-sized send window (which is the receiver saying it cannot process any more data)? So in effect we are asking whether the size of the receive window of the peer or the congestion control strategy may be sub-optimal. The result of such a problem would be that we have TCP data that we could be transmitting but we are not, potentially effecting throughput. So flow control is in effect: when the congestion window is less than or equal to the amount of bytes outstanding on the connection. We can derive this from args[3]-tcps_snxt - args[3]-tcps_suna, i.e. the difference between the next sequence number to send and the lowest unacknowledged sequence number; and when the window in the TCP segment received is advertised as 0 We time from these events until we send new data (i.e. args[4]-tcp_seq = snxt value when window closes. Here's the script: #!/usr/sbin/dtrace -s #pragma D option quiet tcp:::send / (args[3]-tcps_snxt - args[3]-tcps_suna) = args[3]-tcps_cwnd / { cwndclosed[args[1]-cs_cid] = timestamp; cwndsnxt[args[1]-cs_cid] = args[3]-tcps_snxt; @numclosed["cwnd", args[2]-ip_daddr, args[4]-tcp_dport] = count(); } tcp:::send / cwndclosed[args[1]-cs_cid] && args[4]-tcp_seq = cwndsnxt[args[1]-cs_cid] / { @meantimeclosed["cwnd", args[2]-ip_daddr, args[4]-tcp_dport] = avg(timestamp - cwndclosed[args[1]-cs_cid]); @stddevtimeclosed["cwnd", args[2]-ip_daddr, args[4]-tcp_dport] = stddev(timestamp - cwndclosed[args[1]-cs_cid]); @numclosed["cwnd", args[2]-ip_daddr, args[4]-tcp_dport] = count(); cwndclosed[args[1]-cs_cid] = 0; cwndsnxt[args[1]-cs_cid] = 0; } tcp:::receive / args[4]-tcp_window == 0 && (args[4]-tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 / { swndclosed[args[1]-cs_cid] = timestamp; swndsnxt[args[1]-cs_cid] = args[3]-tcps_snxt; @numclosed["swnd", args[2]-ip_saddr, args[4]-tcp_dport] = count(); } tcp:::send / swndclosed[args[1]-cs_cid] && args[4]-tcp_seq = swndsnxt[args[1]-cs_cid] / { @meantimeclosed["swnd", args[2]-ip_daddr, args[4]-tcp_sport] = avg(timestamp - swndclosed[args[1]-cs_cid]); @stddevtimeclosed["swnd", args[2]-ip_daddr, args[4]-tcp_sport] = stddev(timestamp - swndclosed[args[1]-cs_cid]); swndclosed[args[1]-cs_cid] = 0; swndsnxt[args[1]-cs_cid] = 0; } END { printf("%-6s %-20s %-8s %-25s %-8s %-8s\n", "Window", "Remote host", "Port", "TCP Avg WndClosed(ns)", "StdDev", "Num"); printa("%-6s %-20s %-8d %@-25d %@-8d %@-8d\n", @meantimeclosed, @stddevtimeclosed, @numclosed); } So this script will show us whether the peer's receive window size is preventing flow ("swnd" events) or whether congestion control is limiting flow ("cwnd" events). As an example I traced on a server with a large file transfer in progress via a webserver and with an active ssh connection running "find / -depth -print". Here is the output: ^C Window Remote host Port TCP Avg WndClosed(ns) StdDev Num cwnd 10.175.96.92 80 86064329 77311705 125 cwnd 10.175.96.92 22 122068522 151039669 81 So we see in this case, the congestion window closes 125 times for port 80 connections and 81 times for ssh. The average time the window is closed is 0.086sec for port 80 and 0.12sec for port 22. So if you wish to change congestion control algorithm in Oracle Solaris 11, a useful step may be to see if congestion really is an issue on your network. Scripts like the one posted above can help assess this, but it's worth reiterating that if congestion control is occuring, that's not necessarily a problem that needs fixing. Recall that congestion control is about controlling flow to prevent large-scale drops, so looking at congestion events in isolation doesn't tell us the whole story. For example, are we seeing more congestion events with one control algorithm, but more drops/retransmission with another? As always, it's best to start with measures of throughput and latency before arriving at a specific hypothesis such as "my congestion control algorithm is sub-optimal".

Read the article

Exploring TCP throughput with DTrace (2)

- by user12820842

Last time, I described how we can use the overlap in distributions of unacknowledged byte counts and send window to determine whether the peer's receive window may be too small, limiting throughput. Let's combine that comparison with a comparison of congestion window and slow start threshold, all on a per-port/per-client basis. This will help us Identify whether the congestion window or the receive window are limiting factors on throughput by comparing the distributions of congestion window and send window values to the distribution of outstanding (unacked) bytes. This will allow us to get a visual sense for how often we are thwarted in our attempts to fill the pipe due to congestion control versus the peer not being able to receive any more data. Identify whether slow start or congestion avoidance predominate by comparing the overlap in the congestion window and slow start distributions. If the slow start threshold distribution overlaps with the congestion window, we know that we have switched between slow start and congestion avoidance, possibly multiple times. Identify whether the peer's receive window is too small by comparing the distribution of outstanding unacked bytes with the send window distribution (i.e. the peer's receive window). I discussed this here. # dtrace -s tcp_window.d dtrace: script 'tcp_window.d' matched 10 probes ^C cwnd 80 10.175.96.92 value ------------- Distribution ------------- count 1024 | 0 2048 | 4 4096 | 6 8192 | 18 16384 | 36 32768 |@ 79 65536 |@ 155 131072 |@ 199 262144 |@@@ 400 524288 |@@@@@@ 798 1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3848 2097152 | 0 ssthresh 80 10.175.96.92 value ------------- Distribution ------------- count 268435456 | 0 536870912 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5543 1073741824 | 0 unacked 80 10.175.96.92 value ------------- Distribution ------------- count -1 | 0 0 | 1 1 | 0 2 | 0 4 | 0 8 | 0 16 | 0 32 | 0 64 | 0 128 | 0 256 | 3 512 | 0 1024 | 0 2048 | 4 4096 | 9 8192 | 21 16384 | 36 32768 |@ 78 65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5391 131072 | 0 swnd 80 10.175.96.92 value ------------- Distribution ------------- count 32768 | 0 65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5543 131072 | 0 Here we are observing a large file transfer via http on the webserver. Comparing these distributions, we can observe: That slow start congestion control is in operation. The distribution of congestion window values lies below the range of slow start threshold values (which are in the 536870912+ range), so the connection is in slow start mode. Both the unacked byte count and the send window values peak in the 65536-131071 range, but the send window value distribution is narrower. This tells us that the peer TCP's receive window is not closing. The congestion window distribution peaks in the 1048576 - 2097152 range while the receive window distribution is confined to the 65536-131071 range. Since the cwnd distribution ranges as low as 2048-4095, we can see that for some of the time we have been observing the connection, congestion control has been a limiting factor on transfer, but for the majority of the time the receive window of the peer would more likely have been the limiting factor. However, we know the window has never closed as the distribution of swnd values stays within the 65536-131071 range. So all in all we have a connection that has been mildly constrained by congestion control, but for the bulk of the time we have been observing it neither congestion or peer receive window have limited throughput. Here's the script: #!/usr/sbin/dtrace -s tcp:::send / (args[4]-tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 / { @cwnd["cwnd", args[4]-tcp_sport, args[2]-ip_daddr] = quantize(args[3]-tcps_cwnd); @ssthresh["ssthresh", args[4]-tcp_sport, args[2]-ip_daddr] = quantize(args[3]-tcps_cwnd_ssthresh); @unacked["unacked", args[4]-tcp_sport, args[2]-ip_daddr] = quantize(args[3]-tcps_snxt - args[3]-tcps_suna); @swnd["swnd", args[4]-tcp_sport, args[2]-ip_daddr] = quantize((args[4]-tcp_window)*(1 tcps_snd_ws)); } One surprise here is that slow start is still in operation - one would assume that for a large file transfer, acknowledgements would push the congestion window up past the slow start threshold over time. The slow start threshold is in fact still close to it's initial (very high) value, so that would suggest we have not experienced any congestion (the slow start threshold is adjusted when congestion occurs). Also, the above measurements were taken early in the connection lifetime, so the congestion window did not get a changes to get bumped up to the level of the slow start threshold. A good strategy when examining these sorts of measurements for a given service (such as a webserver) would be start by examining the distributions above aggregated by port number only to get an overall feel for service performance, i.e. is congestion control or peer receive window size an issue, or are we unconstrained to fill the pipe? From there, the overlap of distributions will tell us whether to drill down into specific clients. For example if the send window distribution has multiple peaks, we may want to examine if particular clients show issues with their receive window.

Developer IT