linux

Commit Graph

Author	SHA1	Message	Date
Wei Wang	85cb73ff9b	net: ipv6: reset daddr and dport in sk if connect() fails In __ip6_datagram_connect(), reset sk->sk_v6_daddr and inet->dport if error occurs. In udp_v6_early_demux(), check for sk_state to make sure it is in TCP_ESTABLISHED state. Together, it makes sure unconnected UDP socket won't be considered as a valid candidate for early demux. v3: add TCP_ESTABLISHED state check in udp_v6_early_demux() v2: fix compilation error Fixes: `5425077d73` ("net: ipv6: Add early demux handler for UDP unicast") Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-25 11:46:56 -04:00
Paolo Abeni	4b943faedf	udp/v6: prefetch rmem_alloc in udp6_queue_rcv_skb() very similar to commit `dd99e425be` ("udp: prefetch rmem_alloc in udp_queue_rcv_skb()"), this allows saving a cache miss when the BH is bottle-neck for UDP over ipv6 packet processing, e.g. for small packets when a single RX NIC ingress queue is in use. Performances under flood when multiple NIC RX queues used are unaffected, but when a single NIC rx queue is in use, this gives ~8% performance improvement. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-22 13:44:04 -04:00
Wei Wang	d24406c85d	udp: call dst_hold_safe() in udp_sk_rx_set_dst() In udp_v4/6_early_demux() code, we try to hold dst->__refcnt for dst with DST_NOCACHE flag. This is because later in udp_sk_rx_dst_set() function, we will try to cache this dst in sk for connected case. However, a better way to achieve this is to not try to hold dst in early_demux(), but in udp_sk_rx_dst_set(), call dst_hold_safe(). This approach is also more consistant with how tcp is handling it. And it will make later changes simpler. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-17 22:53:59 -04:00
David S. Miller	c6cd850d65	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2017-05-18 16:11:32 -04:00
Paolo Abeni	a3f96c47c8	udp: make udp_queue_rcv_skb() functions static Since the udp memory accounting refactor, we don't need any more to export the udp_queue_rcv_skb(). Make them static and fix a couple of sparse warnings: net/ipv4/udp.c:1615:5: warning: symbol 'udp_queue_rcv_skb' was not declared. Should it be static? net/ipv6/udp.c:572:5: warning: symbol 'udpv6_queue_rcv_skb' was not declared. Should it be static? Fixes: `850cbaddb5` ("udp: use it's own memory accounting schema") Fixes: `c915fe13cb` ("udplite: fix NULL pointer dereference") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-05-18 10:23:33 -04:00
Paolo Abeni	2276f58ac5	udp: use a separate rx queue for packet reception under udp flood the sk_receive_queue spinlock is heavily contended. This patch try to reduce the contention on such lock adding a second receive queue to the udp sockets; recvmsg() looks first in such queue and, only if empty, tries to fetch the data from sk_receive_queue. The latter is spliced into the newly added queue every time the receive path has to acquire the sk_receive_queue lock. The accounting of forward allocated memory is still protected with the sk_receive_queue lock, so udp_rmem_release() needs to acquire both locks when the forward deficit is flushed. On specific scenarios we can end up acquiring and releasing the sk_receive_queue lock multiple times; that will be covered by the next patch Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-05-16 15:41:29 -04:00
subashab@codeaurora.org	0bd84065b1	net: ipv6: Fix UDP early demux lookup with udp_l3mdev_accept=0 David Ahern reported that `5425077d73` ("net: ipv6: Add early demux handler for UDP unicast") breaks udp_l3mdev_accept=0 since early demux for IPv6 UDP was doing a generic socket lookup which does not require an exact match. Fix this by making UDPv6 early demux match connected sockets only. v1->v2: Take reference to socket after match as suggested by Eric v2->v3: Add comment before break Fixes: `5425077d73` ("net: ipv6: Add early demux handler for UDP unicast") Reported-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Cc: Eric Dumazet <edumazet@google.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-20 15:50:27 -04:00
subashab@codeaurora.org	dddb64bcb3	net: Add sysctl to toggle early demux for tcp and udp Certain system process significant unconnected UDP workload. It would be preferrable to disable UDP early demux for those systems and enable it for TCP only. By disabling UDP demux, we see these slight gains on an ARM64 system- 782 -> 788Mbps unconnected single stream UDPv4 633 -> 654Mbps unconnected UDPv4 different sources The performance impact can change based on CPU architecure and cache sizes. There will not much difference seen if entire UDP hash table is in cache. Both sysctls are enabled by default to preserve existing behavior. v1->v2: Change function pointer instead of adding conditional as suggested by Stephen. v2->v3: Read once in callers to avoid issues due to compiler optimizations. Also update commit message with the tests. v3->v4: Store and use read once result instead of querying pointer again incorrectly. v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n} Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Suggested-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Tom Herbert <tom@herbertland.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-24 13:17:07 -07:00
David S. Miller	16ae1f2236	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/broadcom/genet/bcmmii.c drivers/net/hyperv/netvsc.c kernel/bpf/hashtab.c Almost entirely overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-23 16:41:27 -07:00
Alexander Potapenko	d515684d78	ipv6: make sure to initialize sockc.tsflags before first use In the case udp_sk(sk)->pending is AF_INET6, udpv6_sendmsg() would jump to do_append_data, skipping the initialization of sockc.tsflags. Fix the problem by moving sockc.tsflags initialization earlier. The bug was detected with KMSAN. Fixes: `c14ac9451c` ("sock: enable timestamping using control messages") Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-22 12:40:22 -07:00
subashab@codeaurora.org	5425077d73	net: ipv6: Add early demux handler for UDP unicast While running a single stream UDPv6 test, we observed that amount of CPU spent in NET_RX softirq was much greater than UDPv4 for an equivalent receive rate. The test here was run on an ARM64 based Android system. On further analysis with perf, we found that UDPv6 was spending significant time in the statistics netfilter targets which did socket lookup per packet. These statistics rules perform a lookup when there is no socket associated with the skb. Since there are multiple instances of these rules based on UID, there will be equal number of lookups per skb. By introducing early demux for UDPv6, we avoid the redundant lookups. This also helped to improve the performance (800Mbps -> 870Mbps) on a CPU limited system in a single stream UDPv6 receive test with 1450 byte sized datagrams using iperf. v1->v2: Use IPv6 cookie to validate dst instead of 0 as suggested by Eric Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-12 22:54:17 -07:00
David S. Miller	3f64116a83	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2017-02-16 19:34:01 -05:00
Jonathan T. Leighton	052d2369d1	ipv6: Handle IPv4-mapped src to in6addr_any dst. This patch adds a check on the type of the source address for the case where the destination address is in6addr_any. If the source is an IPv4-mapped IPv6 source address, the destination is changed to ::ffff:127.0.0.1, and otherwise the destination is changed to ::1. This is done in three locations to handle UDP calls to either connect() or sendmsg() and TCP calls to connect(). Note that udpv6_sendmsg() delays handling an in6addr_any destination until very late, so the patch only needs to handle the case where the source is an IPv4-mapped IPv6 address. Signed-off-by: Jonathan T. Leighton <jtleight@udel.edu> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-14 12:13:51 -05:00
David S. Miller	3efa70d78f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net The conflict was an interaction between a bug fix in the netvsc driver in 'net' and an optimization of the RX path in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-07 16:29:30 -05:00
Julian Anastasov	0dec879f63	net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP When same struct dst_entry can be used for many different neighbours we can not use it for pending confirmations. The datagram protocols can use MSG_CONFIRM to confirm the neighbour. When used with MSG_PROBE we do not reach the code where neighbour is confirmed, so we have to do the same slow lookup by using the dst_confirm_neigh() helper. When MSG_PROBE is not used, ip_append_data/ip6_append_data will set the skb flag dst_pending_confirm. Reported-by: YueHaibing <yuehaibing@huawei.com> Fixes: `5110effee8` ("net: Do delayed neigh confirmation.") Fixes: `f2bb4bedf3` ("ipv4: Cache output routes in fib_info nexthops.") Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-07 13:07:47 -05:00
Eric Dumazet	69629464e0	udp: properly cope with csum errors Dmitry reported that UDP sockets being destroyed would trigger the WARN_ON(atomic_read(&sk->sk_rmem_alloc)); in inet_sock_destruct() It turns out we do not properly destroy skb(s) that have wrong UDP checksum. Thanks again to syzkaller team. Fixes : `7c13f97ffd` ("udp: do fwd memory scheduling on dequeue") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-07 11:19:00 -05:00
Robert Shearman	63a6fff353	net: Avoid receiving packets with an l3mdev on unbound UDP sockets Packets arriving in a VRF currently are delivered to UDP sockets that aren't bound to any interface. TCP defaults to not delivering packets arriving in a VRF to unbound sockets. IP route lookup and socket transmit both assume that unbound means using the default table and UDP applications that haven't been changed to be aware of VRFs may not function correctly in this case since they may not be able to handle overlapping IP address ranges, or be able to send packets back to the original sender if required. So add a sysctl, udp_l3mdev_accept, to control this behaviour with it being analgous to the existing tcp_l3mdev_accept, namely to allow a process to have a VRF-global listen socket. Have this default to off as this is the behaviour that users will expect, given that there is no explicit mechanism to set unmodified VRF-unaware application into a default VRF. Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-30 15:00:58 -05:00
Josef Bacik	fe38d2a1c8	inet: collapse ipv4/v6 rcv_saddr_equal functions into one We pass these per-protocol equal functions around in various places, but we can just have one function that checks the sk->sk_family and then do the right comparison function. I've also changed the ipv4 version to not cast to inet_sock since it is unneeded. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-18 13:04:28 -05:00
Linus Torvalds	7c0f6ba682	Replace <asm/uaccess.h> with <linux/uaccess.h> globally This was entirely automated, using the script by Al: PATT='^[[:blank:]]#[[:blank:]]include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"\|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-12-24 11:46:01 -08:00
David S. Miller	0b42f25d2f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net udplite conflict is resolved by taking what 'net-next' did which removed the backlog receive method assignment, since it is no longer necessary. Two entries were added to the non-priv ethtool operations switch statement, one in 'net' and one in 'net-next, so simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-26 23:42:21 -05:00
Eric Dumazet	30c7be26fd	udplite: call proper backlog handlers In commits `93821778de` ("udp: Fix rcv socket locking") and `f7ad74fef3` ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb") UDP backlog handlers were renamed, but UDPlite was forgotten. This leads to crashes if UDPlite header is pulled twice, which happens starting from commit `e6afc8ace6` ("udp: remove headers from UDP packets before queueing") Bug found by syzkaller team, thanks a lot guys ! Note that backlog use in UDP/UDPlite is scheduled to be removed starting from linux-4.10, so this patch is only needed up to linux-4.9 Fixes: `93821778de` ("udp: Fix rcv socket locking") Fixes: `f7ad74fef3` ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb") Fixes: `e6afc8ace6` ("udp: remove headers from UDP packets before queueing") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-24 15:32:14 -05:00
David S. Miller	f9aa9dc7d2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net All conflicts were simple overlapping changes except perhaps for the Thunder driver. That driver has a change_mtu method explicitly for sending a message to the hardware. If that fails it returns an error. Normally a driver doesn't need an ndo_change_mtu method becuase those are usually just range changes, which are now handled generically. But since this extra operation is needed in the Thunder driver, it has to stay. However, if the message send fails we have to restore the original MTU before the change because the entire call chain expects that if an error is thrown by ndo_change_mtu then the MTU did not change. Therefore code is added to nicvf_change_mtu to remember the original MTU, and to restore it upon nicvf_update_hw_max_frs() failue. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-22 13:27:16 -05:00
Eric Dumazet	d21dbdfe0a	udp: avoid one cache line miss in recvmsg() UDP_SKB_CB(skb)->partial_cov is located at offset 66 in skb, requesting a cold cache line being read in cpu cache. We can avoid this cache line miss for UDP sockets, as partial_cov has a meaning only for UDPLite. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-21 11:26:59 -05:00
Eric Dumazet	e68b6e50fa	udp: enable busy polling for all sockets UDP busy polling is restricted to connected UDP sockets. This is because sk_busy_loop() only takes care of one NAPI context. There are cases where it could be extended. 1) Some hosts receive traffic on a single NIC, with one RX queue. 2) Some applications use SO_REUSEPORT and associated BPF filter to split the incoming traffic on one UDP socket per RX queue/thread/cpu 3) Some UDP sockets are used to send/receive traffic for one flow, but they do not bother with connect() This patch records the napi_id of first received skb, giving more reach to busy polling. Tested: lpaa23:~# echo 70 >/proc/sys/net/core/busy_read lpaa24:~# echo 70 >/proc/sys/net/core/busy_read lpaa23:~# for f in `seq 1 10`; do ./super_netperf 1 -H lpaa24 -t UDP_RR -l 5; done Before patch : 27867 28870 37324 41060 41215 36764 36838 44455 41282 43843 After patch : 73920 73213 70147 74845 71697 68315 68028 75219 70082 73707 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-18 10:44:31 -05:00
Pablo Neira	73e2d5e34b	udp: restore UDPlite many-cast delivery Honor udptable parameter that is passed to __udp_lib_mcast_deliver(), otherwise udplite broadcast/multicast use the wrong table and it breaks. Fixes: `2dc41cff75` ("udp: Use hash2 for long hash1 chains in __udp_lib_mcast_deliver.") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-15 22:14:27 -05:00
David S. Miller	7d384846b9	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains a second batch of Netfilter updates for your net-next tree. This includes a rework of the core hook infrastructure that improves Netfilter performance by ~15% according to synthetic benchmarks. Then, a large batch with ipset updates, including a new hash:ipmac set type, via Jozsef Kadlecsik. This also includes a couple of assorted updates. Regarding the core hook infrastructure rework to improve performance, using this simple drop-all packets ruleset from ingress: nft add table netdev x nft add chain netdev x y { type filter hook ingress device eth0 priority 0\; } nft add rule netdev x y drop And generating traffic through Jesper Brouer's samples/pktgen/pktgen_bench_xmit_mode_netif_receive.sh script using -i option. perf report shows nf_tables calls in its top 10: 17.30% kpktgend_0 [nf_tables] [k] nft_do_chain 15.75% kpktgend_0 [kernel.vmlinux] [k] __netif_receive_skb_core 10.39% kpktgend_0 [nf_tables_netdev] [k] nft_do_chain_netdev I'm measuring here an improvement of ~15% in performance with this patchset, so we got +2.5Mpps more. I have used my old laptop Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz 4-cores. This rework contains more specifically, in strict order, these patches: 1) Remove compile-time debugging from core. 2) Remove obsolete comments that predate the rcu era. These days it is well known that a Netfilter hook always runs under rcu_read_lock(). 3) Remove threshold handling, this is only used by br_netfilter too. We already have specific code to handle this from br_netfilter, so remove this code from the core path. 4) Deprecate NF_STOP, as this is only used by br_netfilter. 5) Place nf_state_hook pointer into xt_action_param structure, so this structure fits into one single cacheline according to pahole. This also implicit affects nftables since it also relies on the xt_action_param structure. 6) Move state->hook_entries into nf_queue entry. The hook_entries pointer is only required by nf_queue(), so we can store this in the queue entry instead. 7) use switch() statement to handle verdict cases. 8) Remove hook_entries field from nf_hook_state structure, this is only required by nf_queue, so store it in nf_queue_entry structure. 9) Merge nf_iterate() into nf_hook_slow() that results in a much more simple and readable function. 10) Handle NF_REPEAT away from the core, so far the only client is nf_conntrack_in() and we can restart the packet processing using a simple goto to jump back there when the TCP requires it. This update required a second pass to fix fallout, fix from Arnd Bergmann. 11) Set random seed from nft_hash when no seed is specified from userspace. 12) Simplify nf_tables expression registration, in a much smarter way to save lots of boiler plate code, by Liping Zhang. 13) Simplify layer 4 protocol conntrack tracker registration, from Davide Caratti. 14) Missing CONFIG_NF_SOCKET_IPV4 dependency for udp4_lib_lookup, due to recent generalization of the socket infrastructure, from Arnd Bergmann. 15) Then, the ipset batch from Jozsef, he describes it as it follows: * Cleanup: Remove extra whitespaces in ip_set.h * Cleanup: Mark some of the helpers arguments as const in ip_set.h * Cleanup: Group counter helper functions together in ip_set.h * struct ip_set_skbinfo is introduced instead of open coded fields in skbinfo get/init helper funcions. * Use kmalloc() in comment extension helper instead of kzalloc() because it is unnecessary to zero out the area just before explicit initialization. * Cleanup: Split extensions into separate files. * Cleanup: Separate memsize calculation code into dedicated function. * Cleanup: group ip_set_put_extensions() and ip_set_get_extensions() together. * Add element count to hash headers by Eric B Munson. * Add element count to all set types header for uniform output across all set types. * Count non-static extension memory into memsize calculation for userspace. * Cleanup: Remove redundant mtype_expire() arguments, because they can be get from other parameters. * Cleanup: Simplify mtype_expire() for hash types by removing one level of intendation. * Make NLEN compile time constant for hash types. * Make sure element data size is a multiple of u32 for the hash set types. * Optimize hash creation routine, exit as early as possible. * Make struct htype per ipset family so nets array becomes fixed size and thus simplifies the struct htype allocation. * Collapse same condition body into a single one. * Fix reported memory size for hash:* types, base hash bucket structure was not taken into account. * hash:ipmac type support added to ipset by Tomasz Chilinski. * Use setup_timer() and mod_timer() instead of init_timer() by Muhammad Falak R Wani, individually for the set type families. 16) Remove useless connlabel field in struct netns_ct, patch from Florian Westphal. 17) xt_find_table_lock() doesn't return ERR_PTR() anymore, so simplify {ip,ip6,arp}tables code that uses this. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-13 22:41:25 -05:00
Arnd Bergmann	30f5815848	udp: provide udp{4,6}_lib_lookup for nf_socket_ipv{4,6} Since commit `ca065d0cf8` ("udp: no longer use SLAB_DESTROY_BY_RCU") the udp6_lib_lookup and udp4_lib_lookup functions are only provided when it is actually possible to call them. However, moving the callers now caused a link error: net/built-in.o: In function `nf_sk_lookup_slow_v6': (.text+0x131a39): undefined reference to `udp6_lib_lookup' net/ipv4/netfilter/nf_socket_ipv4.o: In function `nf_sk_lookup_slow_v4': nf_socket_ipv4.c:(.text.nf_sk_lookup_slow_v4+0x114): undefined reference to `udp4_lib_lookup' This extends the #ifdef so we also provide the functions when CONFIG_NF_SOCKET_IPV4 or CONFIG_NF_SOCKET_IPV6, respectively are set. Fixes: `8db4c5be88` ("netfilter: move socket lookup infrastructure to nf_socket_ipv{4,6}.c") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-11-10 00:05:14 +01:00
Paolo Abeni	7c13f97ffd	udp: do fwd memory scheduling on dequeue A new argument is added to __skb_recv_datagram to provide an explicit skb destructor, invoked under the receive queue lock. The UDP protocol uses such argument to perform memory reclaiming on dequeue, so that the UDP protocol does not set anymore skb->desctructor. Instead explicit memory reclaiming is performed at close() time and when skbs are removed from the receive queue. The in kernel UDP protocol users now need to call a skb_recv_udp() variant instead of skb_recv_datagram() to properly perform memory accounting on dequeue. Overall, this allows acquiring only once the receive queue lock on dequeue. Tested using pktgen with random src port, 64 bytes packet, wire-speed on a 10G link as sender and udp_sink as the receiver, using an l4 tuple rxhash to stress the contention, and one or more udp_sink instances with reuseport. nr sinks vanilla patched 1 440 560 3 2150 2300 6 3650 3800 9 4450 4600 12 6250 6450 v1 -> v2: - do rmem and allocated memory scheduling under the receive lock - do bulk scheduling in first_packet_length() and in udp_destruct_sock() - avoid the typdef for the dequeue callback Suggested-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-07 13:24:41 -05:00
Paolo Abeni	ad959036a7	net/sock: add an explicit sk argument for ip_cmsg_recv_offset() So that we can use it even after orphaining the skbuff. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-07 13:24:41 -05:00
Lorenzo Colitti	e2d118a1cb	net: inet: Support UID-based routing in IP protocols. - Use the UID in routing lookups made by protocol connect() and sendmsg() functions. - Make sure that routing lookups triggered by incoming packets (e.g., Path MTU discovery) take the UID of the socket into account. - For packets not associated with a userspace socket, (e.g., ping replies) use UID 0 inside the user namespace corresponding to the network namespace the socket belongs to. This allows all namespaces to apply routing and iptables rules to kernel-originated traffic in that namespaces by matching UID 0. This is better than using the UID of the kernel socket that is sending the traffic, because the UID of kernel sockets created at namespace creation time (e.g., the per-processor ICMP and TCP sockets) is the UID of the user that created the socket, which might not be mapped in the namespace. Tested: compiles allnoconfig, allyesconfig, allmodconfig Tested: https://android-review.googlesource.com/253302 Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-04 14:45:23 -04:00
David S. Miller	27058af401	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Mostly simple overlapping changes. For example, David Ahern's adjacency list revamp in 'net-next' conflicted with an adjacency list traversal bug fix in 'net'. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-30 12:42:58 -04:00
Eric Dumazet	10df8e6152	udp: fix IP_CHECKSUM handling First bug was added in commit `ad6f939ab1` ("ip: Add offset parameter to ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr)); Then commit `e6afc8ace6` ("udp: remove headers from UDP packets before queueing") forgot to adjust the offsets now UDP headers are pulled before skb are put in receive queue. Fixes: `ad6f939ab1` ("ip: Add offset parameter to ip_cmsg_recv") Fixes: `e6afc8ace6` ("udp: remove headers from UDP packets before queueing") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Sam Kumar <samanthakumar@google.com> Cc: Willem de Bruijn <willemb@google.com> Tested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-26 17:33:22 -04:00
Paolo Abeni	850cbaddb5	udp: use it's own memory accounting schema Completely avoid default sock memory accounting and replace it with udp-specific accounting. Since the new memory accounting model encapsulates completely the required locking, remove the socket lock on both enqueue and dequeue, and avoid using the backlog on enqueue. Be sure to clean-up rx queue memory on socket destruction, using udp its own sk_destruct. Tested using pktgen with random src port, 64 bytes packet, wire-speed on a 10G link as sender and udp_sink as the receiver, using an l4 tuple rxhash to stress the contention, and one or more udp_sink instances with reuseport. nr readers Kpps (vanilla) Kpps (patched) 1 170 440 3 1250 2150 6 3000 3650 9 4200 4450 12 5700 6250 v4 -> v5: - avoid unneeded test in first_packet_length v3 -> v4: - remove useless sk_rcvqueues_full() call v2 -> v3: - do not set the now unsed backlog_rcv callback v1 -> v2: - add memory pressure support - fixed dropwatch accounting for ipv6 Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-22 17:05:05 -04:00
David S. Miller	6abdd5f593	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net All three conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-30 00:54:02 -04:00
Eric Dumazet	6a6ad2a4e5	ipv6: udp: remove udp_v6_clear_sk() Now RCU lookups of ipv6 udp sockets no longer dereference pinet6 field, we can get rid of udp_v6_clear_sk() helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-23 23:23:50 -07:00
David Ahern	5d77dca828	net: diag: support SOCK_DESTROY for UDP sockets This implements SOCK_DESTROY for UDP sockets similar to what was done for TCP with commit `c1e64e298b` ("net: diag: Support destroying TCP sockets.") A process with a UDP socket targeted for destroy is awakened and recvmsg fails with ECONNABORTED. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-23 23:12:27 -07:00
Eric Dumazet	75d855a5e9	udp: get rid of SLAB_DESTROY_BY_RCU allocations After commit `ca065d0cf8` ("udp: no longer use SLAB_DESTROY_BY_RCU") we do not need this special allocation mode anymore, even if it is harmless. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-23 17:46:17 -07:00
Daniel Borkmann	ba66bbe548	udp: use sk_filter_trim_cap for udp{,6}_queue_rcv_skb After `a612769774` ("udp: prevent bugcheck if filter truncates packet too much"), there followed various other fixes for similar cases such as `f4979fcea7` ("rose: limit sk_filter trim to payload"). Latter introduced a new helper sk_filter_trim_cap(), where we can pass the trim limit directly to the socket filter handling. Make use of it here as well with sizeof(struct udphdr) as lower cap limit and drop the extra skb->len test in UDP's input path. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Willem de Bruijn <willemb@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-25 21:40:33 -07:00
David S. Miller	de0ba9a0d8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Just several instances of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-24 00:53:32 -04:00
Michal Kubeček	a612769774	udp: prevent bugcheck if filter truncates packet too much If socket filter truncates an udp packet below the length of UDP header in udpv6_queue_rcv_skb() or udp_queue_rcv_skb(), it will trigger a BUG_ON in skb_pull_rcsum(). This BUG_ON (and therefore a system crash if kernel is configured that way) can be easily enforced by an unprivileged user which was reported as CVE-2016-6162. For a reproducer, see http://seclists.org/oss-sec/2016/q3/8 Fixes: `e6afc8ace6` ("udp: remove headers from UDP packets before queueing") Reported-by: Marco Grassi <marco.gra@gmail.com> Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-11 12:43:15 -07:00
David S. Miller	ee58b57100	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Several cases of overlapping changes, except the packet scheduler conflicts which deal with the addition of the free list parameter to qdisc_enqueue(). Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 05:03:36 -04:00
Su, Xuemin	d1e37288c9	udp reuseport: fix packet of same flow hashed to different socket There is a corner case in which udp packets belonging to a same flow are hashed to different socket when hslot->count changes from 10 to 11: 1) When hslot->count <= 10, __udp_lib_lookup() searches udp_table->hash, and always passes 'daddr' to udp_ehashfn(). 2) When hslot->count > 10, __udp_lib_lookup() searches udp_table->hash2, but may pass 'INADDR_ANY' to udp_ehashfn() if the sockets are bound to INADDR_ANY instead of some specific addr. That means when hslot->count changes from 10 to 11, the hash calculated by udp_ehashfn() is also changed, and the udp packets belonging to a same flow will be hashed to different socket. This is easily reproduced: 1) Create 10 udp sockets and bind all of them to 0.0.0.0:40000. 2) From the same host send udp packets to 127.0.0.1:40000, record the socket index which receives the packets. 3) Create 1 more udp socket and bind it to 0.0.0.0:44096. The number 44096 is 40000 + UDP_HASH_SIZE(4096), this makes the new socket put into the same hslot as the aformentioned 10 sockets, and makes the hslot->count change from 10 to 11. 4) From the same host send udp packets to 127.0.0.1:40000, and the socket index which receives the packets will be different from the one received in step 2. This should not happen as the socket bound to 0.0.0.0:44096 should not change the behavior of the sockets bound to 0.0.0.0:40000. It's the same case for IPv6, and this patch also fixes that. Signed-off-by: Su, Xuemin <suxm@chinanetcenter.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-14 17:23:09 -04:00
Hannes Frederic Sowa	38b7097b55	ipv6: use TOS marks from sockets for routing decision In IPv6 the ToS values are part of the flowlabel in flowi6 and get extracted during fib rule lookup, but we forgot to correctly initialize the flowlabel before the routing lookup. Reported-by: <liam.mcbirnie@boeing.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-11 15:33:26 -07:00
Eric Dumazet	ce25d66ad5	Possible problem with `e6afc8ac` ("udp: remove headers from UDP packets before queueing") Paul Moore tracked a regression caused by a recent commit, which mistakenly assumed that sk_filter() could be avoided if socket had no current BPF filter. The intent was to avoid udp_lib_checksum_complete() overhead. But sk_filter() also checks skb_pfmemalloc() and security_sock_rcv_skb(), so better call it. Fixes: `e6afc8ace6` ("udp: remove headers from UDP packets before queueing") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Paul Moore <paul@paul-moore.com> Tested-by: Paul Moore <paul@paul-moore.com> Tested-by: Stephen Smalley <sds@tycho.nsa.gov> Cc: samanthakumar <samanthakumar@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-02 18:29:49 -04:00
Hannes Frederic Sowa	e5aed006be	udp: prevent skbs lingering in tunnel socket queues In case we find a socket with encapsulation enabled we should call the encap_recv function even if just a udp header without payload is available. The callbacks are responsible for correctly verifying and dropping the packets. Also, in case the header validation fails for geneve and vxlan we shouldn't put the skb back into the socket queue, no one will pick them up there. Instead we can simply discard them in the respective encap_recv functions. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-05-20 19:56:02 -04:00
Alexander Duyck	ed7cbbce54	udp: Resolve NULL pointer dereference over flow-based vxlan device While testing an OpenStack configuration using VXLANs I saw the following call trace: RIP: 0010:[<ffffffff815fad49>] udp4_lib_lookup_skb+0x49/0x80 RSP: 0018:ffff88103867bc50 EFLAGS: 00010286 RAX: ffff88103269bf00 RBX: ffff88103269bf00 RCX: 00000000ffffffff RDX: 0000000000004300 RSI: 0000000000000000 RDI: ffff880f2932e780 RBP: ffff88103867bc60 R08: 0000000000000000 R09: 000000009001a8c0 R10: 0000000000004400 R11: ffffffff81333a58 R12: ffff880f2932e794 R13: 0000000000000014 R14: 0000000000000014 R15: ffffe8efbfd89ca0 FS: 0000000000000000(0000) GS:ffff88103fd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000488 CR3: 0000000001c06000 CR4: 00000000001426e0 Stack: ffffffff81576515 ffffffff815733c0 ffff88103867bc98 ffffffff815fcc17 ffff88103269bf00 ffffe8efbfd89ca0 0000000000000014 0000000000000080 ffffe8efbfd89ca0 ffff88103867bcc8 ffffffff815fcf8b ffff880f2932e794 Call Trace: [<ffffffff81576515>] ? skb_checksum+0x35/0x50 [<ffffffff815733c0>] ? skb_push+0x40/0x40 [<ffffffff815fcc17>] udp_gro_receive+0x57/0x130 [<ffffffff815fcf8b>] udp4_gro_receive+0x10b/0x2c0 [<ffffffff81605863>] inet_gro_receive+0x1d3/0x270 [<ffffffff81589e59>] dev_gro_receive+0x269/0x3b0 [<ffffffff8158a1b8>] napi_gro_receive+0x38/0x120 [<ffffffffa0871297>] gro_cell_poll+0x57/0x80 [vxlan] [<ffffffff815899d0>] net_rx_action+0x160/0x380 [<ffffffff816965c7>] __do_softirq+0xd7/0x2c5 [<ffffffff8107d969>] run_ksoftirqd+0x29/0x50 [<ffffffff8109a50f>] smpboot_thread_fn+0x10f/0x160 [<ffffffff8109a400>] ? sort_range+0x30/0x30 [<ffffffff81096da8>] kthread+0xd8/0xf0 [<ffffffff81693c82>] ret_from_fork+0x22/0x40 [<ffffffff81096cd0>] ? kthread_park+0x60/0x60 The following trace is seen when receiving a DHCP request over a flow-based VXLAN tunnel. I believe this is caused by the metadata dst having a NULL dev value and as a result dev_net(dev) is causing a NULL pointer dereference. To resolve this I am replacing the check for skb_dst(skb)->dev with just skb->dev. This makes sense as the callers of this function are usually in the receive path and as such skb->dev should always be populated. In addition other functions in the area where these are called are already using dev_net(skb->dev) to determine the namespace the UDP packet belongs in. Fixes: `63058308cd` ("udp: Add udp6_lib_lookup_skb and udp4_lib_lookup_skb") Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-05-13 01:56:14 -04:00
Wei Wang	26879da587	ipv6: add new struct ipcm6_cookie In the sendmsg function of UDP, raw, ICMP and l2tp sockets, we use local variables like hlimits, tclass, opt and dontfrag and pass them to corresponding functions like ip6_make_skb, ip6_append_data and xxx_push_pending_frames. This is not a good practice and makes it hard to add new parameters. This fix introduces a new struct ipcm6_cookie similar to ipcm_cookie in ipv4 and include the above mentioned variables. And we only pass the pointer to this structure to corresponding functions. This makes it easier to add new parameters in the future and makes the function cleaner. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-05-03 16:08:14 -04:00
Eric Dumazet	e61da9e259	udp: prepare for non BH masking at backlog processing UDP uses the generic socket backlog code, and this will soon be changed to not disable BH when protocol is called back. We need to use appropriate SNMP accessors. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-05-02 17:02:25 -04:00
Eric Dumazet	a16292a0f0	net: rename ICMP6_INC_STATS_BH() Rename ICMP6_INC_STATS_BH() to __ICMP6_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-27 22:48:24 -04:00
Eric Dumazet	02c223470c	net: udp: rename UDP_INC_STATS_BH() Rename UDP_INC_STATS_BH() to __UDP_INC_STATS(), and UDP6_INC_STATS_BH() to __UDP6_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-27 22:48:23 -04:00
Eric Dumazet	6aef70a851	net: snmp: kill various STATS_USER() helpers In the old days (before linux-3.0), SNMP counters were duplicated, one for user context, and one for BH context. After commit `8f0ea0fe3a` ("snmp: reduce percpu needs by 50%") we have a single copy, and what really matters is preemption being enabled or disabled, since we use this_cpu_inc() or __this_cpu_inc() respectively. We therefore kill SNMP_INC_STATS_USER(), SNMP_ADD_STATS_USER(), NET_INC_STATS_USER(), NET_ADD_STATS_USER(), SCTP_INC_STATS_USER(), SNMP_INC_STATS64_USER(), SNMP_ADD_STATS64_USER(), TCP_ADD_STATS_USER(), UDP_INC_STATS_USER(), UDP6_INC_STATS_USER(), and XFRM_INC_STATS_USER() Following patches will rename __BH helpers to make clear their usage is not tied to BH being disabled. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-27 22:48:22 -04:00
David S. Miller	1602f49b58	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts were two cases of simple overlapping changes, nothing serious. In the UDP case, we need to add a hlist_add_tail_rcu() to linux/rculist.h, because we've moved UDP socket handling away from using nulls lists. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-23 18:51:33 -04:00
Martin KaFai Lau	e646b657f6	ipv6: udp: Do a route lookup and update during release_cb This patch adds a release_cb for UDPv6. It does a route lookup and updates sk->sk_dst_cache if it is needed. It picks up the left-over job from ip6_sk_update_pmtu() if the sk was owned by user during the pmtu update. It takes a rcu_read_lock to protect the __sk_dst_get() operations because another thread may do ip6_dst_store() without taking the sk lock (e.g. sendmsg). Fixes: `45e4fd2668` ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reported-by: Wei Wang <weiwan@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-14 16:29:53 -04:00
Tom Herbert	63058308cd	udp: Add udp6_lib_lookup_skb and udp4_lib_lookup_skb Add externally visible functions to lookup a UDP socket by skb. This will be used for GRO in UDP sockets. These functions also check if skb->dst is set, and if it is not skb->dev is used to get dev_net. This allows calling lookup functions before dst has been set on the skbuff. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-07 16:53:14 -04:00
samanthakumar	627d2d6b55	udp: enable MSG_PEEK at non-zero offset Enable peeking at UDP datagrams at the offset specified with socket option SOL_SOCKET/SO_PEEK_OFF. Peek at any datagram in the queue, up to the end of the given datagram. Implement the SO_PEEK_OFF semantics introduced in commit `ef64a54f6e` ("sock: Introduce the SO_PEEK_OFF sock option"). Increase the offset on peek, decrease it on regular reads. When peeking, always checksum the packet immediately, to avoid recomputation on subsequent peeks and final read. The socket lock is not held for the duration of udp_recvmsg, so peek and read operations can run concurrently. Only the last store to sk_peek_off is preserved. Signed-off-by: Sam Kumar <samanthakumar@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-05 16:29:37 -04:00
samanthakumar	e6afc8ace6	udp: remove headers from UDP packets before queueing Remove UDP transport headers before queueing packets for reception. This change simplifies a follow-up patch to add MSG_PEEK support. Signed-off-by: Sam Kumar <samanthakumar@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-05 16:29:37 -04:00
Eric Dumazet	ca065d0cf8	udp: no longer use SLAB_DESTROY_BY_RCU Tom Herbert would like not touching UDP socket refcnt for encapsulated traffic. For this to happen, we need to use normal RCU rules, with a grace period before freeing a socket. UDP sockets are not short lived in the high usage case, so the added cost of call_rcu() should not be a concern. This actually removes a lot of complexity in UDP stack. Multicast receives no longer need to hold a bucket spinlock. Note that ip early demux still needs to take a reference on the socket. Same remark for functions used by xt_socket and xt_PROXY netfilter modules, but this might be changed later. Performance for a single UDP socket receiving flood traffic from many RX queues/cpus. Simple udp_rx using simple recvfrom() loop : 438 kpps instead of 374 kpps : 17 % increase of the peak rate. v2: Addressed Willem de Bruijn feedback in multicast handling - keep early demux break in __udp4_lib_demux_lookup() Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Cc: Willem de Bruijn <willemb@google.com> Tested-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 22:11:19 -04:00
Soheil Hassas Yeganeh	c14ac9451c	sock: enable timestamping using control messages Currently, SOL_TIMESTAMPING can only be enabled using setsockopt. This is very costly when users want to sample writes to gather tx timestamps. Add support for enabling SO_TIMESTAMPING via control messages by using tsflags added in `struct sockcm_cookie` (added in the previous patches in this series) to set the tx_flags of the last skb created in a sendmsg. With this patch, the timestamp recording bits in tx_flags of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg. Please note that this is only effective for overriding the recording timestamps flags. Users should enable timestamp reporting (e.g., SOF_TIMESTAMPING_SOFTWARE \| SOF_TIMESTAMPING_OPT_ID) using socket options and then should ask for SOF_TIMESTAMPING_TX_* using control messages per sendmsg to sample timestamps for each write. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 15:50:30 -04:00
Soheil Hassas Yeganeh	ad1e46a837	ipv6: process socket-level control messages in IPv6 Process socket-level control messages by invoking __sock_cmsg_send in ip6_datagram_send_ctl for control messages on the SOL_SOCKET layer. This makes sure whenever ip6_datagram_send_ctl is called for udp and raw, we also process socket-level control messages. This is a bit uglier than IPv4, since IPv6 does not have something like ipcm_cookie. Perhaps we can later create a control message cookie for IPv6? Note that this commit interprets new control messages that were ignored before. As such, this commit does not change the behavior of IPv6 control messages. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 15:50:30 -04:00
Eric Dumazet	2d4212261f	ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates IPv6 counters updates use a different macro than IPv4. Fixes: `36cbb2452c` ("udp: Increment UDP_MIB_IGNOREDMULTI for arriving unmatched multicasts") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Rick Jones <rick.jones2@hp.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-30 19:01:33 -04:00
David S. Miller	810813c47a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Several cases of overlapping changes, as well as one instance (vxlan) of a bug fix in 'net' overlapping with code movement in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 12:34:12 -05:00
Bill Sommerfeld	59dca1d8a6	udp6: fix UDP/IPv6 encap resubmit path IPv4 interprets a negative return value from a protocol handler as a request to redispatch to a new protocol. In contrast, IPv6 interprets a negative value as an error, and interprets a positive value as a request for redispatch. UDP for IPv6 was unaware of this difference. Change __udp6_lib_rcv() to return a positive value for redispatch. Note that the socket's encap_rcv hook still needs to return a negative value to request dispatch, and in the case of IPv6 packets, adjust IP6CB(skb)->nhoff to identify the byte containing the next protocol. Signed-off-by: Bill Sommerfeld <wsommerfeld@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-07 15:23:12 -05:00
Wei Wang	e0d8c1b738	ipv6: pass up EMSGSIZE msg for UDP socket in Ipv6 In ipv4, when the machine receives a ICMP_FRAG_NEEDED message, the connected UDP socket will get EMSGSIZE message on its next read from the socket. However, this is not the case for ipv6. This fix modifies the udp err handler in Ipv6 for ICMP6_PKT_TOOBIG to make it similar to ipv4 behavior. That is when the machine gets an ICMP6_PKT_TOOBIG message, the connected UDP socket will get EMSGSIZE message on its next read from the socket. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-19 15:46:24 -05:00
Craig Gallek	496611d7b5	inet: create IPv6-equivalent inet_hash function In order to support fast lookups for TCP sockets with SO_REUSEPORT, the function that adds sockets to the listening hash set needs to be able to check receive address equality. Since this equality check is different for IPv4 and IPv6, we will need two different socket hashing functions. This patch adds inet6_hash identical to the existing inet_hash function and updates the appropriate references. A following patch will differentiate the two by passing different comparison functions to __inet_hash. Additionally, in order to use the IPv6 address equality function from inet6_hashtables (which is compiled as a built-in object when IPv6 is enabled) it also needs to be in a built-in object file as well. This moves ipv6_rcv_saddr_equal into inet_hashtables to accomplish this. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-11 03:54:14 -05:00
Eric Dumazet	ed0dfffd7d	udp: fix potential infinite loop in SO_REUSEPORT logic Using a combination of connected and un-connected sockets, Dmitry was able to trigger soft lockups with his fuzzer. The problem is that sockets in the SO_REUSEPORT array might have different scores. Right after sk2=socket(), setsockopt(sk2,...,SO_REUSEPORT, on) and bind(sk2, ...), but _before_ the connect(sk2) is done, sk2 is added into the soreuseport array, with a score which is smaller than the score of first socket sk1 found in hash table (I am speaking of the regular UDP hash table), if sk1 had the connect() done, giving a +8 to its score. hash bucket [X] -> sk1 -> sk2 -> NULL sk1 score = 14 (because it did a connect()) sk2 score = 6 SO_REUSEPORT fast selection is an optimization. If it turns out the score of the selected socket does not match score of first socket, just fallback to old SO_REUSEPORT logic instead of trying to be too smart. Normal SO_REUSEPORT users do not mix different kind of sockets, as this mechanism is used for load balance traffic. Fixes: `e32ea7e747` ("soreuseport: fast reuseport UDP socket selection") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Craig Gallek <kraigatgoog@gmail.com> Acked-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-01-19 13:52:25 -05:00
Craig Gallek	1134158ba3	soreuseport: pass skb to secondary UDP socket lookup This socket-lookup path did not pass along the skb in question in my original BPF-based socket selection patch. The skb in the udpN_lib_lookup2 path can be used for BPF-based socket selection just like it is in the 'traditional' udpN_lib_lookup path. udpN_lib_lookup2 kicks in when there are greater than 10 sockets in the same hlist slot. Coincidentally, I chose 10 sockets per reuseport group in my functional test, so the lookup2 path was not excersised. This adds an additional set of tests with 20 sockets. Fixes: `538950a1b7` ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF") Fixes: `3ca8e40299` ("soreuseport: BPF selection functional test") Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-01-06 01:28:04 -05:00
Craig Gallek	538950a1b7	soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF Expose socket options for setting a classic or extended BPF program for use when selecting sockets in an SO_REUSEPORT group. These options can be used on the first socket to belong to a group before bind or on any socket in the group after bind. This change includes refactoring of the existing sk_filter code to allow reuse of the existing BPF filter validation checks. Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-01-04 22:49:59 -05:00
Craig Gallek	e32ea7e747	soreuseport: fast reuseport UDP socket selection Include a struct sock_reuseport instance when a UDP socket binds to a specific address for the first time with the reuseport flag set. When selecting a socket for an incoming UDP packet, use the information available in sock_reuseport if present. This required adding an additional field to the UDP source address equality function to differentiate between exact and wildcard matches. The original use case allowed wildcard matches when checking for existing port uses during bind. The new use case of adding a socket to a reuseport group requires exact address matching. Performance test (using a machine with 2 CPU sockets and a total of 48 cores): Create reuseport groups of varying size. Use one socket from this group per user thread (pinning each thread to a different core) calling recvmmsg in a tight loop. Record number of messages received per second while saturating a 10G link. 10 sockets: 18% increase (~2.8M -> 3.3M pkts/s) 20 sockets: 14% increase (~2.9M -> 3.3M pkts/s) 40 sockets: 13% increase (~3.0M -> 3.4M pkts/s) This work is based off a similar implementation written by Ying Cai <ycai@google.com> for implementing policy-based reuseport selection. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-01-04 22:49:58 -05:00
Eric Dumazet	197c949e77	udp: properly support MSG_PEEK with truncated buffers Backport of this upstream commit into stable kernels : `89c22d8c3b` ("net: Fix skb csum races when peeking") exposed a bug in udp stack vs MSG_PEEK support, when user provides a buffer smaller than skb payload. In this case, skb_copy_and_csum_datagram_iovec(skb, sizeof(struct udphdr), msg->msg_iov); returns -EFAULT. This bug does not happen in upstream kernels since Al Viro did a great job to replace this into : skb_copy_and_csum_datagram_msg(skb, sizeof(struct udphdr), msg); This variant is safe vs short buffers. For the time being, instead reverting Herbert Xu patch and add back skb->ip_summed invalid changes, simply store the result of udp_lib_checksum_complete() so that we avoid computing the checksum a second time, and avoid the problematic skb_copy_and_csum_datagram_iovec() call. This patch can be applied on recent kernels as it avoids a double checksumming, then backported to stable kernels as a bug fix. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-01-04 17:23:36 -05:00
Eric Dumazet	45f6fad84c	ipv6: add complete rcu protection around np->opt This patch addresses multiple problems : UDP/RAW sendmsg() need to get a stable struct ipv6_txoptions while socket is not locked : Other threads can change np->opt concurrently. Dmitry posted a syzkaller (http://github.com/google/syzkaller) program desmonstrating use-after-free. Starting with TCP/DCCP lockless listeners, tcp_v6_syn_recv_sock() and dccp_v6_request_recv_sock() also need to use RCU protection to dereference np->opt once (before calling ipv6_dup_options()) This patch adds full RCU protection to np->opt Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-12-02 23:37:16 -05:00
Eric Dumazet	70da268b56	net: SO_INCOMING_CPU setsockopt() support SO_INCOMING_CPU as added in commit `2c8c56e15d` was a getsockopt() command to fetch incoming cpu handling a particular TCP flow after accept() This commits adds setsockopt() support and extends SO_REUSEPORT selection logic : If a TCP listener or UDP socket has this option set, a packet is delivered to this socket only if CPU handling the packet matches the specified one. This allows to build very efficient TCP servers, using one listener per RX queue, as the associated TCP listener should only accept flows handled in softirq by the same cpu. This provides optimal NUMA behavior and keep cpu caches hot. Note that __inet_lookup_listener() still has to iterate over the list of all listeners. Following patch puts sk_refcnt in a different cache line to let this iteration hit only shared and read mostly cache lines. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-12 19:28:20 -07:00
Ian Morris	ec120da6f0	ipv6: trivial whitespace fix Change brace placement to be in line with coding standards Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-17 14:34:48 -07:00
Eric Dumazet	beb39db59d	udp: fix behavior of wrong checksums We have two problems in UDP stack related to bogus checksums : 1) We return -EAGAIN to application even if receive queue is not empty. This breaks applications using edge trigger epoll() 2) Under UDP flood, we can loop forever without yielding to other processes, potentially hanging the host, especially on non SMP. This patch is an attempt to make things better. We might in the future add extra support for rt applications wanting to better control time spent doing a recv() in a hostile environment. For example we could validate checksums before queuing packets in socket receive queue. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 21:42:18 -07:00
Henning Rogge	33b4b015e1	net/ipv6/udp: Fix ipv6 multicast socket filter regression Commit <5cf3d46192fc> ("udp: Simplify__udp_lib_mcast_deliver") simplified the filter for incoming IPv6 multicast but removed the check of the local socket address and the UDP destination address. This patch restores the filter to prevent sockets bound to a IPv6 multicast IP to receive other UDP traffic link unicast. Signed-off-by: Henning Rogge <hrogge@gmail.com> Fixes: `5cf3d46192` ("udp: Simplify__udp_lib_mcast_deliver") Cc: "David S. Miller" <davem@davemloft.net> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-19 16:34:43 -04:00
Sheng Yong	8bc0034cf6	net: remove extra newlines Signed-off-by: Sheng Yong <shengyong1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-04-07 22:24:37 -04:00
Ian Morris	53b24b8f94	ipv6: coding style: comparison for inequality with NULL The ipv6 code uses a mixture of coding styles. In some instances check for NULL pointer is done as x != NULL and sometimes as x. x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-31 13:51:54 -04:00
Ian Morris	63159f29be	ipv6: coding style: comparison for equality with NULL The ipv6 code uses a mixture of coding styles. In some instances check for NULL pointer is done as x == NULL and sometimes as !x. !x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-31 13:51:54 -04:00
Eric Dumazet	6eada0110c	netns: constify net_hash_mix() and various callers const qualifiers ease code review by making clear which objects are not written in a function. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-18 22:00:34 -04:00
Ying Xue	1b78414047	net: Remove iocb argument from sendmsg and recvmsg After TIPC doesn't depend on iocb argument in its internal implementations of sendmsg() and recvmsg() hooks defined in proto structure, no any user is using iocb argument in them at all now. Then we can drop the redundant iocb argument completely from kinds of implementations of both sendmsg() and recvmsg() in the entire networking stack. Cc: Christoph Hellwig <hch@lst.de> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-02 13:06:31 -05:00
Vlad Yasevich	03485f2adc	udpv6: Add lockless sendmsg() support This commit adds the same functionaliy to IPv6 that commit `903ab86d19` Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Tue Mar 1 02:36:48 2011 +0000 udp: Add lockless transmit path added to IPv4. UDP transmit path can now run without a socket lock, thus allowing multiple threads to send to a single socket more efficiently. This is only used when corking/MSG_MORE is not used. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-02 19:28:04 -08:00
Vlad Yasevich	d39d938c82	ipv6: Introduce udpv6_send_skb() Now that we can individually construct IPv6 skbs to send, add a udpv6_send_skb() function to populate the udp header and send the skb. This allows udp_v6_push_pending_frames() to re-use this function as well as enables us to add lockless sendmsg() support. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-02 19:28:04 -08:00
Tom Herbert	224d019c4f	ip: Move checksum convert defines to inet Move convert_csum from udp_sock to inet_sock. This allows the possibility that we can use convert checksum for different types of sockets and also allows convert checksum to be enabled from inet layer (what we'll want to do when enabling IP_CHECKSUM cmsg). Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-05 22:44:46 -05:00
Al Viro	f69e6d131f	ip_generic_getfrag, udplite_getfrag: switch to passing msghdr Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-12-09 16:28:22 -05:00
Joe Perches	60c04aecd8	udp: Neaten and reduce size of compute_score functions The compute_score functions are a bit difficult to read. Neaten them a bit to reduce object sizes and make them a bit more intelligible. Return early to avoid indentation and avoid unnecessary initializations. (allyesconfig, but w/ -O2 and no profiling) $ size net/ipv[46]/udp.o.* text data bss dec hex filename 28680 1184 25 29889 74c1 net/ipv4/udp.o.new 28756 1184 25 29965 750d net/ipv4/udp.o.old 17600 1010 2 18612 48b4 net/ipv6/udp.o.new 17632 1010 2 18644 48d4 net/ipv6/udp.o.old Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-08 20:28:47 -05:00
Al Viro	227158db16	new helper: skb_copy_and_csum_datagram_msg() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-24 04:28:44 -05:00
Ian Morris	e5d08d718a	ipv6: coding style improvements (remove assignment in if statements) This change has no functional impact and simply addresses some coding style issues detected by checkpatch. Specifically this change adjusts "if" statements which also include the assignment of a variable. No changes to the resultant object files result as determined by objdiff. Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-23 21:00:56 -05:00
Joe Perches	ba7a46f16d	net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited Use the more common dynamic_debug capable net_dbg_ratelimited and remove the LIMIT_NETDEBUG macro. All messages are still ratelimited. Some KERN_<LEVEL> uses are changed to KERN_DEBUG. This may have some negative impact on messages that were emitted at KERN_INFO that are not not enabled at all unless DEBUG is defined or dynamic_debug is enabled. Even so, these messages are now _not_ emitted by default. This also eliminates the use of the net_msg_warn sysctl "/proc/sys/net/core/warnings". For backward compatibility, the sysctl is not removed, but it has no function. The extern declaration of net_msg_warn is removed from sock.h and made static in net/core/sysctl_net_core.c Miscellanea: o Update the sysctl documentation o Remove the embedded uses of pr_fmt o Coalesce format fragments o Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-11 14:10:31 -05:00
Eric Dumazet	2c8c56e15d	net: introduce SO_INCOMING_CPU Alternative to RPS/RFS is to use hardware support for multiple queues. Then split a set of million of sockets into worker threads, each one using epoll() to manage events on its own socket pool. Ideally, we want one thread per RX/TX queue/cpu, but we have no way to know after accept() or connect() on which queue/cpu a socket is managed. We normally use one cpu per RX queue (IRQ smp_affinity being properly set), so remembering on socket structure which cpu delivered last packet is enough to solve the problem. After accept(), connect(), or even file descriptor passing around processes, applications can use : int cpu; socklen_t len = sizeof(cpu); getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len); And use this information to put the socket into the right silo for optimal performance, as all networking stack should run on the appropriate cpu, without need to send IPI (RPS/RFS). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-11 13:00:06 -05:00
Rick Jones	36cbb2452c	udp: Increment UDP_MIB_IGNOREDMULTI for arriving unmatched multicasts As NIC multicast filtering isn't perfect, and some platforms are quite content to spew broadcasts, we should not trigger an event for skb:kfree_skb when we do not have a match for such an incoming datagram. We do though want to avoid sweeping the matter under the rug entirely, so increment a suitable statistic. This incorporates feedback from David L. Stevens, Karl Neiss and Eric Dumazet. V3 - use bool per David Miller Signed-off-by: Rick Jones <rick.jones2@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-07 15:45:50 -05:00
David S. Miller	51f3d02b98	net: Add and use skb_copy_datagram_msg() helper. This encapsulates all of the skb_copy_datagram_iovec() callers with call argument signature "skb, offset, msghdr->msg_iov, length". When we move to iov_iters in the networking, the iov_iter object will sit in the msghdr. Having a helper like this means there will be less places to touch during that transformation. Based upon descriptions and patch from Al Viro. Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-05 16:46:40 -05:00
Tom Herbert	2abb7cdc0d	udp: Add support for doing checksum unnecessary conversion Add support for doing CHECKSUM_UNNECESSARY to CHECKSUM_COMPLETE conversion in UDP tunneling path. In the normal UDP path, we call skb_checksum_try_convert after locating the UDP socket. The check is that checksum conversion is enabled for the socket (new flag in UDP socket) and that checksum field is non-zero. In the UDP GRO path, we call skb_gro_checksum_try_convert after checksum is validated and checksum field is non-zero. Since this is already in GRO we assume that checksum conversion is always wanted. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-09-01 21:36:28 -07:00
Ian Morris	67ba4152e8	ipv6: White-space cleansing : Line Layouts This patch makes no changes to the logic of the code but simply addresses coding style issues as detected by checkpatch. Both objdump and diff -w show no differences. A number of items are addressed in this patch: * Multiple spaces converted to tabs * Spaces before tabs removed. * Spaces in pointer typing cleansed (char )foo etc. Remove space after sizeof * Ensure spacing around comparators such as if statements. Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-08-24 22:37:52 -07:00
Daniel Borkmann	8fc54f6891	net: use reciprocal_scale() helper Replace open codings of (((u64) <x> * <y>) >> 32) with reciprocal_scale(). Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-08-23 12:21:21 -07:00
Duan Jiong	4330487acf	net: use inet6_iif instead of IP6CB()->iif Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-07-31 22:37:06 -07:00
Duan Jiong	7304fe4681	net: fix the counter ICMP_MIB_INERRORS/ICMP6_MIB_INERRORS When dealing with ICMPv[46] Error Message, function icmp_socket_deliver() and icmpv6_notify() do some valid checks on packet's length, but then some protocols check packet's length redaudantly. So remove those duplicated statements, and increase counter ICMP_MIB_INERRORS/ICMP6_MIB_INERRORS in function icmp_socket_deliver() and icmpv6_notify() respectively. In addition, add missed counter in udp6/udplite6 when socket is NULL. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-07-31 22:04:18 -07:00
Sorin Dumitru	274f482d33	sock: remove skb argument from sk_rcvqueues_full It hasn't been used since commit 0fd7bac(net: relax rcvbuf limits). Signed-off-by: Sorin Dumitru <sorin@returnze.ro> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-07-23 13:23:06 -07:00
David Held	2dc41cff75	udp: Use hash2 for long hash1 chains in __udp*_lib_mcast_deliver. Many multicast sources can have the same port which can result in a very large list when hashing by port only. Hash by address and port instead if this is the case. This makes multicast more similar to unicast. On a 24-core machine receiving from 500 multicast sockets on the same port, before this patch 80% of system CPU was used up by spin locking and only ~25% of packets were successfully delivered. With this patch, all packets are delivered and kernel overhead is ~8% system CPU on spinlocks. Signed-off-by: David Held <drheld@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-07-16 23:29:52 -07:00
David Held	5cf3d46192	udp: Simplify __udp*_lib_mcast_deliver. Switch to using sk_nulls_for_each which shortens the code and makes it easier to update. Signed-off-by: David Held <drheld@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-07-16 23:29:52 -07:00
David S. Miller	1a98c69af1	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Signed-off-by: David S. Miller <davem@davemloft.net>	2014-07-16 14:09:34 -07:00
Eric Dumazet	9fe516ba3f	inet: move ipv6only in sock_common When an UDP application switches from AF_INET to AF_INET6 sockets, we have a small performance degradation for IPv4 communications because of extra cache line misses to access ipv6only information. This can also be noticed for TCP listeners, as ipv6_only_sock() is also used from __inet_lookup_listener()->compute_score() This is magnified when SO_REUSEPORT is used. Move ipv6only into struct sock_common so that it is available at no extra cost in lookups. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-07-01 23:46:21 -07:00

1 2 3 4 5 ...

403 Commits