linux/net
Eric Dumazet d4589926d7 tcp: refine TSO splits
While investigating performance problems on small RPC workloads,
I noticed linux TCP stack was always splitting the last TSO skb
into two parts (skbs). One being a multiple of MSS, and a small one
with the Push flag. This split is done even if TCP_NODELAY is set,
or if no small packet is in flight.

Example with request/response of 4K/4K

IP A > B: . ack 68432 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: . 65537:68433(2896) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: P 68433:69633(1200) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
IP B > A: . ack 68433 win 2768 <nop,nop,timestamp 6525001 6524593>
IP B > A: . 69632:72528(2896) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
IP B > A: P 72528:73728(1200) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
IP A > B: . ack 72528 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: . 69633:72529(2896) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: P 72529:73729(1200) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>

We can avoid this split by including the Nagle tests at the right place.

Note : If some NIC had trouble sending TSO packets with a partial
last segment, we would have hit the problem in GRO/forwarding workload already.

tcp_minshall_update() is moved to tcp_output.c and is updated as we might
feed a TSO packet with a partial last segment.

This patch tremendously improves performance, as the traffic now looks
like :

IP A > B: . ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
IP A > B: P 94209:98305(4096) ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
IP B > A: . ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
IP B > A: P 98304:102400(4096) ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
IP A > B: . ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
IP A > B: P 98305:102401(4096) ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
IP B > A: . ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
IP B > A: P 102400:106496(4096) ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
IP A > B: . ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
IP A > B: P 102401:106497(4096) ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
IP B > A: . ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
IP B > A: P 106496:110592(4096) ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>

Before :

lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
280774

 Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':

     205719.049006 task-clock                #    9.278 CPUs utilized
         8,449,968 context-switches          #    0.041 M/sec
         1,935,997 CPU-migrations            #    0.009 M/sec
           160,541 page-faults               #    0.780 K/sec
   548,478,722,290 cycles                    #    2.666 GHz                     [83.20%]
   455,240,670,857 stalled-cycles-frontend   #   83.00% frontend cycles idle    [83.48%]
   272,881,454,275 stalled-cycles-backend    #   49.75% backend  cycles idle    [66.73%]
   166,091,460,030 instructions              #    0.30  insns per cycle
                                             #    2.74  stalled cycles per insn [83.39%]
    29,150,229,399 branches                  #  141.699 M/sec                   [83.30%]
     1,943,814,026 branch-misses             #    6.67% of all branches         [83.32%]

      22.173517844 seconds time elapsed

lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
IpOutRequests                   16851063           0.0
IpExtOutOctets                  23878580777        0.0

After patch :

lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
280877

 Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':

     107496.071918 task-clock                #    4.847 CPUs utilized
         5,635,458 context-switches          #    0.052 M/sec
         1,374,707 CPU-migrations            #    0.013 M/sec
           160,920 page-faults               #    0.001 M/sec
   281,500,010,924 cycles                    #    2.619 GHz                     [83.28%]
   228,865,069,307 stalled-cycles-frontend   #   81.30% frontend cycles idle    [83.38%]
   142,462,742,658 stalled-cycles-backend    #   50.61% backend  cycles idle    [66.81%]
    95,227,712,566 instructions              #    0.34  insns per cycle
                                             #    2.40  stalled cycles per insn [83.43%]
    16,209,868,171 branches                  #  150.795 M/sec                   [83.20%]
       874,252,952 branch-misses             #    5.39% of all branches         [83.37%]

      22.175821286 seconds time elapsed

lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
IpOutRequests                   11239428           0.0
IpExtOutOctets                  23595191035        0.0

Indeed, the occupancy of tx skbs (IpExtOutOctets/IpOutRequests) is higher :
2099 instead of 1417, thus helping GRO to be more efficient when using FQ packet
scheduler.

Many thanks to Neal for review and ideas.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Van Jacobson <vanj@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 15:15:25 -05:00
..
9p Nothing really exciting: some groundwork for changing virtio endian, and 2013-11-15 13:28:47 +09:00
802 neigh: convert parms to an array 2013-12-09 20:56:12 -05:00
8021q Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-11-14 16:30:30 +09:00
appletalk net: rework recvmsg handler msg_name and msg_namelen logic 2013-11-20 21:52:30 -05:00
atm net: rework recvmsg handler msg_name and msg_namelen logic 2013-11-20 21:52:30 -05:00
ax25 net: rework recvmsg handler msg_name and msg_namelen logic 2013-11-20 21:52:30 -05:00
batman-adv batadv: Slight optimization of batadv_compare_eth 2013-12-09 20:58:11 -05:00
bluetooth Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem 2013-12-13 13:14:28 -05:00
bridge net: more spelling fixes 2013-12-10 21:57:11 -05:00
caif net: rework recvmsg handler msg_name and msg_namelen logic 2013-11-20 21:52:30 -05:00
can Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-11-15 16:47:22 -08:00
ceph net: 8021q/bluetooth/bridge/can/ceph: Remove extern from function prototypes 2013-10-19 19:12:11 -04:00
core net: remove dead code for add/del multiple 2013-12-17 15:14:04 -05:00
dcb net/*: Fix FSF address in file headers 2013-12-06 12:37:57 -05:00
dccp ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE 2013-11-05 21:52:27 -05:00
decnet dn_dev: add support for IFA_FLAGS nl attribute 2013-12-10 21:50:00 -05:00
dns_resolver net/*: Fix FSF address in file headers 2013-12-06 12:37:57 -05:00
dsa net: dsa: inherit addr_assign_type along with dev_addr 2013-09-03 20:57:49 -04:00
ethernet ethernet: use likely() for common Ethernet encap 2013-09-30 21:52:53 -07:00
hsr net/hsr: Support iproute print_opt ('ip -details ...') 2013-11-30 12:48:14 -05:00
ieee802154 genetlink: make multicast groups const, prevent abuse 2013-11-19 16:39:06 -05:00
ipv4 tcp: refine TSO splits 2013-12-17 15:15:25 -05:00
ipv6 net-ipv6: Fix alleged compiler warning in ipv6_exthdrs_len() 2013-12-15 23:55:04 -05:00
ipx net: rework recvmsg handler msg_name and msg_namelen logic 2013-11-20 21:52:30 -05:00
irda net/irda: Fix FSF address in file headers 2013-12-06 12:37:57 -05:00
iucv net: rework recvmsg handler msg_name and msg_namelen logic 2013-11-20 21:52:30 -05:00
key net: rework recvmsg handler msg_name and msg_namelen logic 2013-11-20 21:52:30 -05:00
l2tp inet: fix addr_len/msg->msg_namelen assignment in recv_error and rxpmtu functions 2013-11-23 14:46:23 -08:00
lapb net/lapb: re-send packets on timeout 2013-09-23 16:52:45 -04:00
llc net: rework recvmsg handler msg_name and msg_namelen logic 2013-11-20 21:52:30 -05:00
mac80211 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless 2013-12-06 09:50:45 -05:00
mac802154 6lowpan: set and use mac_len for mac header length 2013-10-30 17:18:46 -04:00
mpls ipip: add GSO/TSO support 2013-10-19 19:36:19 -04:00
netfilter netfilter: Fix FSF address in file headers 2013-12-06 12:37:57 -05:00
netlabel netlabel: Fix FSF address in file headers 2013-12-06 12:37:56 -05:00
netlink genetlink/pmcraid: use proper genetlink multicast API 2013-11-28 18:26:30 -05:00
netrom net: rework recvmsg handler msg_name and msg_namelen logic 2013-11-20 21:52:30 -05:00
nfc nfc: Fix FSF address in file headers 2013-12-11 10:56:21 -05:00
openvswitch net: ovs: use CRC32 accelerated flow hash if available 2013-12-17 14:27:17 -05:00
packet packet: fix using smp_processor_id() in preemptible code 2013-12-14 01:04:13 -05:00
phonet Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-11-19 15:50:47 -08:00
rds rds: prevent BUG_ON triggered on congestion update to loopback 2013-12-03 11:54:18 -05:00
rfkill rfkill: Fix FSF address in file headers 2013-12-11 10:56:21 -05:00
rose net: rework recvmsg handler msg_name and msg_namelen logic 2013-11-20 21:52:30 -05:00
rxrpc net: rework recvmsg handler msg_name and msg_namelen logic 2013-11-20 21:52:30 -05:00
sched pkt_sched: set root qdisc before change() in attach_default_qdiscs() 2013-12-14 01:20:06 -05:00
sctp sctp: remove redundant null check on asoc 2013-12-11 15:39:34 -05:00
sunrpc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-11-20 14:25:39 -08:00
tipc tipc: change lock_sock order in connect() 2013-12-16 12:48:35 -05:00
unix unix: convert printks to pr_<level> 2013-12-06 16:35:58 -05:00
vmw_vsock net: rework recvmsg handler msg_name and msg_namelen logic 2013-11-20 21:52:30 -05:00
wimax wimax: remove dead code 2013-11-21 13:09:42 -05:00
wireless Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless 2013-12-06 09:50:45 -05:00
x25 x25: convert printks to pr_<level> 2013-12-09 20:24:18 -05:00
xfrm net: move pskb_put() to core code 2013-11-07 19:28:58 -05:00
Kconfig kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS cleanly 2013-11-21 16:42:27 -08:00
Makefile net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0) 2013-11-03 23:20:14 -05:00
compat.c net: clamp ->msg_namelen instead of returning an error 2013-11-29 16:12:52 -05:00
nonet.c
socket.c net: handle error more gracefully in socketpair() 2013-12-10 22:24:13 -05:00
sysctl_net.c net: Update the sysctl permissions handler to test effective uid/gid 2013-10-07 15:57:56 -04:00