Commit Graph

1052 Commits

Author SHA1 Message Date
hannes@stressinduktion.org c91f81773c ipv6: protect skb->sk accesses from recursive dereference inside the stack
[ Upstream commit f60e5990d9 ]

We should not consult skb->sk for output decisions in xmit recursion
levels > 0 in the stack. Otherwise local socket settings could influence
the result of e.g. tunnel encapsulation process.

ipv6 does not conform with this in three places:

1) ip6_fragment: we do consult ipv6_npinfo for frag_size

2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
   loop the packet back to the local socket

3) ip6_skb_dst_mtu could query the settings from the user socket and
   force a wrong MTU

Furthermore:
In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
PF_PACKET socket ontop of an IPv6-backed vxlan device.

Reuse xmit_recursion as we are currently only interested in protecting
tunnel devices.

Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-29 10:31:46 +02:00
Matthew Thode f0bde5010b net: reject creation of netdev names with colons
[ Upstream commit a4176a9391 ]

colons are used as a separator in netdev device lookup in dev_ioctl.c

Specific functions are SIOCGIFTXQLEN SIOCETHTOOL SIOCSIFNAME

Signed-off-by: Matthew Thode <mthode@mthode.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-18 13:31:22 +01:00
Eric Dumazet 64d76f569a net: rps: fix cpu unplug
[ Upstream commit ac64da0b83 ]

softnet_data.input_pkt_queue is protected by a spinlock that
we must hold when transferring packets from victim queue to an active
one. This is because other cpus could still be trying to enqueue packets
into victim queue.

A second problem is that when we transfert the NAPI poll_list from
victim to current cpu, we absolutely need to special case the percpu
backlog, because we do not want to add complex locking to protect
process_queue : Only owner cpu is allowed to manipulate it, unless cpu
is offline.

Based on initial patch from Prasad Sodagudi & Subash Abhinov
Kasiviswanathan.

This version is better because we do not slow down packet processing,
only make migration safer.

Reported-by: Prasad Sodagudi <psodagud@codeaurora.org>
Reported-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-26 17:50:10 -08:00
Alexander Y. Fomichev 7cc4382590 net: fix creation adjacent device symlinks
commit 7ce64c79c4 upstream.

__netdev_adjacent_dev_insert may add adjust device of different net
namespace, without proper check it leads to emergence of broken
sysfs links from/to devices in another namespace.
Fix: rewrite netdev_adjacent_is_neigh_list macro as a function,
     move net_eq check into netdev_adjacent_is_neigh_list.
     (thanks David)
     related to: 4c75431ac3

Signed-off-by: Alexander Fomichev <git.user@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Miquel van Smoorenburg <mikevs@xs4all.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-27 08:18:59 -08:00
Alexander Y. Fomichev 565318483e net: prevent of emerging cross-namespace symlinks
commit 4c75431ac3 upstream.

Code manipulating sysfs symlinks on adjacent net_devices(s)
currently doesn't take into account that devices potentially
belong to different namespaces.

This patch trying to fix an issue as follows:
- check for net_ns before creating / deleting symlink.
  for now only netdev_adjacent_rename_links and
  __netdev_adjacent_dev_remove are affected, afaics
  __netdev_adjacent_dev_insert implies both net_devs
  belong to the same namespace.
- Drop all existing symlinks to / from all adj_devs before
  switching namespace and recreate them just after.

Signed-off-by: Alexander Y. Fomichev <git.user@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Miquel van Smoorenburg <mikevs@xs4all.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-27 08:18:59 -08:00
Jay Vosburgh a81b0a4f94 net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding
[ Upstream commit 2c26d34bbc ]

When using VXLAN tunnels and a sky2 device, I have experienced
checksum failures of the following type:

[ 4297.761899] eth0: hw csum failure
[...]
[ 4297.765223] Call Trace:
[ 4297.765224]  <IRQ>  [<ffffffff8172f026>] dump_stack+0x46/0x58
[ 4297.765235]  [<ffffffff8162ba52>] netdev_rx_csum_fault+0x42/0x50
[ 4297.765238]  [<ffffffff8161c1a0>] ? skb_push+0x40/0x40
[ 4297.765240]  [<ffffffff8162325c>] __skb_checksum_complete+0xbc/0xd0
[ 4297.765243]  [<ffffffff8168c602>] tcp_v4_rcv+0x2e2/0x950
[ 4297.765246]  [<ffffffff81666ca0>] ? ip_rcv_finish+0x360/0x360

	These are reliably reproduced in a network topology of:

container:eth0 == host(OVS VXLAN on VLAN) == bond0 == eth0 (sky2) -> switch

	When VXLAN encapsulated traffic is received from a similarly
configured peer, the above warning is generated in the receive
processing of the encapsulated packet.  Note that the warning is
associated with the container eth0.

        The skbs from sky2 have ip_summed set to CHECKSUM_COMPLETE, and
because the packet is an encapsulated Ethernet frame, the checksum
generated by the hardware includes the inner protocol and Ethernet
headers.

	The receive code is careful to update the skb->csum, except in
__dev_forward_skb, as called by dev_forward_skb.  __dev_forward_skb
calls eth_type_trans, which in turn calls skb_pull_inline(skb, ETH_HLEN)
to skip over the Ethernet header, but does not update skb->csum when
doing so.

	This patch resolves the problem by adding a call to
skb_postpull_rcsum to update the skb->csum after the call to
eth_type_trans.

Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-27 08:18:53 -08:00
Toshiaki Makita 72111ffa19 net: Fix stacked vlan offload features computation
[ Upstream commit 796f2da81b ]

When vlan tags are stacked, it is very likely that the outer tag is stored
in skb->vlan_tci and skb->protocol shows the inner tag's vlan_proto.
Currently netif_skb_features() first looks at skb->protocol even if there
is the outer tag in vlan_tci, thus it incorrectly retrieves the protocol
encapsulated by the inner vlan instead of the inner vlan protocol.
This allows GSO packets to be passed to HW and they end up being
corrupted.

Fixes: 58e998c6d2 ("offloading: Force software GSO for multiple vlan tags.")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-27 08:18:53 -08:00
Vlad Yasevich ff81e63f64 net: Always untag vlan-tagged traffic on input.
[ Upstream commit 0d5501c1c8 ]

Currently the functionality to untag traffic on input resides
as part of the vlan module and is build only when VLAN support
is enabled in the kernel.  When VLAN is disabled, the function
vlan_untag() turns into a stub and doesn't really untag the
packets.  This seems to create an interesting interaction
between VMs supporting checksum offloading and some network drivers.

There are some drivers that do not allow the user to change
tx-vlan-offload feature of the driver.  These drivers also seem
to assume that any VLAN-tagged traffic they transmit will
have the vlan information in the vlan_tci and not in the vlan
header already in the skb.  When transmitting skbs that already
have tagged data with partial checksum set, the checksum doesn't
appear to be updated correctly by the card thus resulting in a
failure to establish TCP connections.

The following is a packet trace taken on the receiver where a
sender is a VM with a VLAN configued.  The host VM is running on
doest not have VLAN support and the outging interface on the
host is tg3:
10:12:43.503055 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
(0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27243,
offset 0, flags [DF], proto TCP (6), length 60)
    10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
-> 0x48d9), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
4294837885 ecr 0,nop,wscale 7], length 0
10:12:44.505556 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
(0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27244,
offset 0, flags [DF], proto TCP (6), length 60)
    10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
-> 0x44ee), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
4294838888 ecr 0,nop,wscale 7], length 0

This connection finally times out.

I've only access to the TG3 hardware in this configuration thus have
only tested this with TG3 driver.  There are a lot of other drivers
that do not permit user changes to vlan acceleration features, and
I don't know if they all suffere from a similar issue.

The patch attempt to fix this another way.  It moves the vlan header
stipping code out of the vlan module and always builds it into the
kernel network core.  This way, even if vlan is not supported on
a virtualizatoin host, the virtual machines running on top of such
host will still work with VLANs enabled.

CC: Patrick McHardy <kaber@trash.net>
CC: Nithin Nayak Sujir <nsujir@broadcom.com>
CC: Michael Chan <mchan@broadcom.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-15 08:36:40 +02:00
Jerry Chu 9b6cdedfa3 net-gre-gro: Fix a bug that breaks the forwarding path
[ Upstream commit c3caf1192f ]

Fixed a bug that was introduced by my GRE-GRO patch
(bf5a755f5e net-gre-gro: Add GRE
support to the GRO stack) that breaks the forwarding path
because various GSO related fields were not set. The bug will
cause on the egress path either the GSO code to fail, or a
GRE-TSO capable (NETIF_F_GSO_GRE) NICs to choke. The following
fix has been tested for both cases.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-28 08:06:01 -07:00
Loic Prylli 3dc0c70ff2 net: Fix NETDEV_CHANGE notifier usage causing spurious arp flush
[ Upstream commit 5495119465 ]

A bug was introduced in NETDEV_CHANGE notifier sequence causing the
arp table to be sometimes spuriously cleared (including manual arp
entries marked permanent), upon network link carrier changes.

The changed argument for the notifier was applied only to a single
caller of NETDEV_CHANGE, missing among others netdev_state_change().
So upon net_carrier events induced by the network, which are
triggering a call to netdev_state_change(), arp_netdev_event() would
decide whether to clear or not arp cache based on random/junk stack
values (a kind of read buffer overflow).

Fixes: be9efd3653 ("net: pass changed flags along with NETDEV_CHANGE event")
Fixes: 6c8b4e3ff8 ("arp: flush arp cache on IFF_NOARP change")
Signed-off-by: Loic Prylli <loicp@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-28 08:06:00 -07:00
Eric Dumazet 0d0fc9f9e8 net: force a list_del() in unregister_netdevice_many()
[ Upstream commit 87757a917b ]

unregister_netdevice_many() API is error prone and we had too
many bugs because of dangling LIST_HEAD on stacks.

See commit f87e6f4793 ("net: dont leave active on stack LIST_HEAD")

In fact, instead of making sure no caller leaves an active list_head,
just force a list_del() in the callee. No one seems to need to access
the list after unregister_netdevice_many()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-26 15:15:39 -04:00
Eric Dumazet fc2ab6e8f4 net-gro: reset skb->truesize in napi_reuse_skb()
[ Upstream commit e33d0ba804 ]

Recycling skb always had been very tough...

This time it appears GRO layer can accumulate skb->truesize
adjustments made by drivers when they attach a fragment to skb.

skb_gro_receive() can only subtract from skb->truesize the used part
of a fragment.

I spotted this problem seeing TcpExtPruneCalled and
TcpExtTCPRcvCollapsed that were unexpected with a recent kernel, where
TCP receive window should be sized properly to accept traffic coming
from a driver not overshooting skb->truesize.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:39 -07:00
Eric Dumazet d83924d768 net: gro: make sure skb->cb[] initial content has not to be zero
[ Upstream commit 29e9824278 ]

Starting from linux-3.13, GRO attempts to build full size skbs.

Problem is the commit assumed one particular field in skb->cb[]
was clean, but it is not the case on some stacked devices.

Timo reported a crash in case traffic is decrypted before
reaching a GRE device.

Fix this by initializing NAPI_GRO_CB(skb)->last at the right place,
this also removes one conditional.

Thanks a lot to Timo for providing full reports and bisecting this.

Fixes: 8a29111c7c ("net: gro: allow to build full sized skb")
Bisected-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:38 -07:00
Cong Wang af8c0e0612 rtnetlink: wait for unregistering devices in rtnl_link_unregister()
[ Upstream commit 200b916f35 ]

From: Cong Wang <cwang@twopensource.com>

commit 50624c934d (net: Delay default_device_exit_batch until no
devices are unregistering) introduced rtnl_lock_unregistering() for
default_device_exit_batch(). Same race could happen we when rmmod a driver
which calls rtnl_link_unregister() as we call dev->destructor without rtnl
lock.

For long term, I think we should clean up the mess of netdev_run_todo()
and net namespce exit code.

Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:38 -07:00
Vlad Yasevich 102e103f76 vlan: Fix lockdep warning with stacked vlan devices.
[ Upstream commit d38569ab2b ]

This reverts commit dc8eaaa006.
	vlan: Fix lockdep warning when vlan dev handle notification

Instead we use the new new API to find the lock subclass of
our vlan device.  This way we can support configurations where
vlans are interspersed with other devices:
  bond -> vlan -> macvlan -> vlan

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:34 -07:00
Vlad Yasevich 69ab2a8b80 net: Find the nesting level of a given device by type.
[ Upstream commit 4085ebe8c3 ]

Multiple devices in the kernel can be stacked/nested and they
need to know their nesting level for the purposes of lockdep.
This patch provides a generic function that determines a nesting
level of a particular device by its type (ex: vlan, macvlan, etc).
We only care about nesting of the same type of devices.

For example:
  eth0 <- vlan0.10 <- macvlan0 <- vlan1.20

The nesting level of vlan1.20 would be 1, since there is another vlan
in the stack under it.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:34 -07:00
dingtianhong e2c7f96858 vlan: Fix lockdep warning when vlan dev handle notification
[ Upstream commit dc8eaaa006 ]

When I open the LOCKDEP config and run these steps:

modprobe 8021q
vconfig add eth2 20
vconfig add eth2.20 30
ifconfig eth2 xx.xx.xx.xx

then the Call Trace happened:

[32524.386288] =============================================
[32524.386293] [ INFO: possible recursive locking detected ]
[32524.386298] 3.14.0-rc2-0.7-default+ #35 Tainted: G           O
[32524.386302] ---------------------------------------------
[32524.386306] ifconfig/3103 is trying to acquire lock:
[32524.386310]  (&vlan_netdev_addr_lock_key/1){+.....}, at: [<ffffffff814275f4>] dev_mc_sync+0x64/0xb0
[32524.386326]
[32524.386326] but task is already holding lock:
[32524.386330]  (&vlan_netdev_addr_lock_key/1){+.....}, at: [<ffffffff8141af83>] dev_set_rx_mode+0x23/0x40
[32524.386341]
[32524.386341] other info that might help us debug this:
[32524.386345]  Possible unsafe locking scenario:
[32524.386345]
[32524.386350]        CPU0
[32524.386352]        ----
[32524.386354]   lock(&vlan_netdev_addr_lock_key/1);
[32524.386359]   lock(&vlan_netdev_addr_lock_key/1);
[32524.386364]
[32524.386364]  *** DEADLOCK ***
[32524.386364]
[32524.386368]  May be due to missing lock nesting notation
[32524.386368]
[32524.386373] 2 locks held by ifconfig/3103:
[32524.386376]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff81431d42>] rtnl_lock+0x12/0x20
[32524.386387]  #1:  (&vlan_netdev_addr_lock_key/1){+.....}, at: [<ffffffff8141af83>] dev_set_rx_mode+0x23/0x40
[32524.386398]
[32524.386398] stack backtrace:
[32524.386403] CPU: 1 PID: 3103 Comm: ifconfig Tainted: G           O 3.14.0-rc2-0.7-default+ #35
[32524.386409] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[32524.386414]  ffffffff81ffae40 ffff8800d9625ae8 ffffffff814f68a2 ffff8800d9625bc8
[32524.386421]  ffffffff810a35fb ffff8800d8a8d9d0 00000000d9625b28 ffff8800d8a8e5d0
[32524.386428]  000003cc00000000 0000000000000002 ffff8800d8a8e5f8 0000000000000000
[32524.386435] Call Trace:
[32524.386441]  [<ffffffff814f68a2>] dump_stack+0x6a/0x78
[32524.386448]  [<ffffffff810a35fb>] __lock_acquire+0x7ab/0x1940
[32524.386454]  [<ffffffff810a323a>] ? __lock_acquire+0x3ea/0x1940
[32524.386459]  [<ffffffff810a4874>] lock_acquire+0xe4/0x110
[32524.386464]  [<ffffffff814275f4>] ? dev_mc_sync+0x64/0xb0
[32524.386471]  [<ffffffff814fc07a>] _raw_spin_lock_nested+0x2a/0x40
[32524.386476]  [<ffffffff814275f4>] ? dev_mc_sync+0x64/0xb0
[32524.386481]  [<ffffffff814275f4>] dev_mc_sync+0x64/0xb0
[32524.386489]  [<ffffffffa0500cab>] vlan_dev_set_rx_mode+0x2b/0x50 [8021q]
[32524.386495]  [<ffffffff8141addf>] __dev_set_rx_mode+0x5f/0xb0
[32524.386500]  [<ffffffff8141af8b>] dev_set_rx_mode+0x2b/0x40
[32524.386506]  [<ffffffff8141b3cf>] __dev_open+0xef/0x150
[32524.386511]  [<ffffffff8141b177>] __dev_change_flags+0xa7/0x190
[32524.386516]  [<ffffffff8141b292>] dev_change_flags+0x32/0x80
[32524.386524]  [<ffffffff8149ca56>] devinet_ioctl+0x7d6/0x830
[32524.386532]  [<ffffffff81437b0b>] ? dev_ioctl+0x34b/0x660
[32524.386540]  [<ffffffff814a05b0>] inet_ioctl+0x80/0xa0
[32524.386550]  [<ffffffff8140199d>] sock_do_ioctl+0x2d/0x60
[32524.386558]  [<ffffffff81401a52>] sock_ioctl+0x82/0x2a0
[32524.386568]  [<ffffffff811a7123>] do_vfs_ioctl+0x93/0x590
[32524.386578]  [<ffffffff811b2705>] ? rcu_read_lock_held+0x45/0x50
[32524.386586]  [<ffffffff811b39e5>] ? __fget_light+0x105/0x110
[32524.386594]  [<ffffffff811a76b1>] SyS_ioctl+0x91/0xb0
[32524.386604]  [<ffffffff815057e2>] system_call_fastpath+0x16/0x1b

========================================================================

The reason is that all of the addr_lock_key for vlan dev have the same class,
so if we change the status for vlan dev, the vlan dev and its real dev will
hold the same class of addr_lock_key together, so the warning happened.

we should distinguish the lock depth for vlan dev and its real dev.

v1->v2: Convert the vlan_netdev_addr_lock_key to an array of eight elements, which
	could support to add 8 vlan id on a same vlan dev, I think it is enough for current
	scene, because a netdev's name is limited to IFNAMSIZ which could not hold 8 vlan id,
	and the vlan dev would not meet the same class key with its real dev.

	The new function vlan_dev_get_lockdep_subkey() will return the subkey and make the vlan
	dev could get a suitable class key.

v2->v3: According David's suggestion, I use the subclass to distinguish the lock key for vlan dev
	and its real dev, but it make no sense, because the difference for subclass in the
	lock_class_key doesn't mean that the difference class for lock_key, so I use lock_depth
	to distinguish the different depth for every vlan dev, the same depth of the vlan dev
	could have the same lock_class_key, I import the MAX_LOCK_DEPTH from the include/linux/sched.h,
	I think it is enough here, the lockdep should never exceed that value.

v3->v4: Add a huge array of locking keys will waste static kernel memory and is not a appropriate method,
	we could use _nested() variants to fix the problem, calculate the depth for every vlan dev,
	and use the depth as the subclass for addr_lock_key.

Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:34 -07:00
Vlad Yasevich 9038f47699 net: Start with correct mac_len in skb_network_protocol
[ Upstream commit 1e785f48d2 ]

Sometimes, when the packet arrives at skb_mac_gso_segment()
its skb->mac_len already accounts for some of the mac lenght
headers in the packet.  This seems to happen when forwarding
through and OpenSSL tunnel.

When we start looking for any vlan headers in skb_network_protocol()
we seem to ignore any of the already known mac headers and start
with an ETH_HLEN.  This results in an incorrect offset, dropped
TSO frames and general slowness of the connection.

We can start counting from the known skb->mac_len
and return at least that much if all mac level headers
are known and accounted for.

Fixes: 53d6471cef (net: Account for all vlan headers in skb_mac_gso_segment)
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Daniel Borkman <dborkman@redhat.com>
Tested-by: Martin Filip <nexus+kernel@smoula.net>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:34 -07:00
Vlad Yasevich 53d6471cef net: Account for all vlan headers in skb_mac_gso_segment
skb_network_protocol() already accounts for multiple vlan
headers that may be present in the skb.  However, skb_mac_gso_segment()
doesn't know anything about it and assumes that skb->mac_len
is set correctly to skip all mac headers.  That may not
always be the case.  If we are simply forwarding the packet (via
bridge or macvtap), all vlan headers may not be accounted for.

A simple solution is to allow skb_network_protocol to return
the vlan depth it has calculated.  This way skb_mac_gso_segment
will correctly skip all mac headers.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 17:10:36 -04:00
Florian Westphal d206940319 net: core: introduce netif_skb_dev_features
Will be used by upcoming ipv4 forward path change that needs to
determine feature mask using skb->dst->dev instead of skb->dev.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 17:17:02 -05:00
Rashika Kheria 0a59f3a9fd net: Mark functions as static in core/dev.c
Mark functions as static in core/dev.c because they are not used outside
this file.

This eliminates the following warning in core/dev.c:
net/core/dev.c:2806:5: warning: no previous prototype for ‘__dev_queue_xmit’ [-Wmissing-prototypes]
net/core/dev.c:4640:5: warning: no previous prototype for ‘netdev_adjacent_sysfs_add’ [-Wmissing-prototypes]
net/core/dev.c:4650:6: warning: no previous prototype for ‘netdev_adjacent_sysfs_del’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09 17:32:49 -08:00
Linus Torvalds 4ba9920e5e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) BPF debugger and asm tool by Daniel Borkmann.

 2) Speed up create/bind in AF_PACKET, also from Daniel Borkmann.

 3) Correct reciprocal_divide and update users, from Hannes Frederic
    Sowa and Daniel Borkmann.

 4) Currently we only have a "set" operation for the hw timestamp socket
    ioctl, add a "get" operation to match.  From Ben Hutchings.

 5) Add better trace events for debugging driver datapath problems, also
    from Ben Hutchings.

 6) Implement auto corking in TCP, from Eric Dumazet.  Basically, if we
    have a small send and a previous packet is already in the qdisc or
    device queue, defer until TX completion or we get more data.

 7) Allow userspace to manage ipv6 temporary addresses, from Jiri Pirko.

 8) Add a qdisc bypass option for AF_PACKET sockets, from Daniel
    Borkmann.

 9) Share IP header compression code between Bluetooth and IEEE802154
    layers, from Jukka Rissanen.

10) Fix ipv6 router reachability probing, from Jiri Benc.

11) Allow packets to be captured on macvtap devices, from Vlad Yasevich.

12) Support tunneling in GRO layer, from Jerry Chu.

13) Allow bonding to be configured fully using netlink, from Scott
    Feldman.

14) Allow AF_PACKET users to obtain the VLAN TPID, just like they can
    already get the TCI.  From Atzm Watanabe.

15) New "Heavy Hitter" qdisc, from Terry Lam.

16) Significantly improve the IPSEC support in pktgen, from Fan Du.

17) Allow ipv4 tunnels to cache routes, just like sockets.  From Tom
    Herbert.

18) Add Proportional Integral Enhanced packet scheduler, from Vijay
    Subramanian.

19) Allow openvswitch to mmap'd netlink, from Thomas Graf.

20) Key TCP metrics blobs also by source address, not just destination
    address.  From Christoph Paasch.

21) Support 10G in generic phylib.  From Andy Fleming.

22) Try to short-circuit GRO flow compares using device provided RX
    hash, if provided.  From Tom Herbert.

The wireless and netfilter folks have been busy little bees too.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2064 commits)
  net/cxgb4: Fix referencing freed adapter
  ipv6: reallocate addrconf router for ipv6 address when lo device up
  fib_frontend: fix possible NULL pointer dereference
  rtnetlink: remove IFLA_BOND_SLAVE definition
  rtnetlink: remove check for fill_slave_info in rtnl_have_link_slave_info
  qlcnic: update version to 5.3.55
  qlcnic: Enhance logic to calculate msix vectors.
  qlcnic: Refactor interrupt coalescing code for all adapters.
  qlcnic: Update poll controller code path
  qlcnic: Interrupt code cleanup
  qlcnic: Enhance Tx timeout debugging.
  qlcnic: Use bool for rx_mac_learn.
  bonding: fix u64 division
  rtnetlink: add missing IFLA_BOND_AD_INFO_UNSPEC
  sfc: Use the correct maximum TX DMA ring size for SFC9100
  Add Shradha Shah as the sfc driver maintainer.
  net/vxlan: Share RX skb de-marking and checksum checks with ovs
  tulip: cleanup by using ARRAY_SIZE()
  ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is called
  net/cxgb4: Don't retrieve stats during recovery
  ...
2014-01-25 11:17:34 -08:00
Linus Torvalds bb1281f2aa Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "Usual rocket science stuff from trivial.git"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
  neighbour.h: fix comment
  sched: Fix warning on make htmldocs caused by wait.h
  slab: struct kmem_cache is protected by slab_mutex
  doc: Fix typo in USB Gadget Documentation
  of/Kconfig: Spelling s/one/once/
  mkregtable: Fix sscanf handling
  lp5523, lp8501: comment improvements
  thermal: rcar: comment spelling
  treewide: fix comments and printk msgs
  IXP4xx: remove '1 &&' from a condition check in ixp4xx_restart()
  Documentation: update /proc/uptime field description
  Documentation: Fix size parameter for snprintf
  arm: fix comment header and macro name
  asm-generic: uaccess: Spelling s/a ny/any/
  mtd: onenand: fix comment header
  doc: driver-model/platform.txt: fix a typo
  drivers: fix typo in DEVTMPFS_MOUNT Kconfig help text
  doc: Fix typo (acces_process_vm -> access_process_vm)
  treewide: Fix typos in printk
  drivers/gpu/drm/qxl/Kconfig: reformat the help text
  ...
2014-01-22 21:21:55 -08:00
Or Gerlitz e27a2f8395 net: Export gro_find_by_type helpers
Export the gro_find_receive/complete_by_type helpers to they can be invoked
by the gro callbacks of encapsulation protocols such as vxlan.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 18:05:04 -08:00
Or Gerlitz b582ef0990 net: Add GRO support for UDP encapsulating protocols
Add GRO handlers for protocols that do UDP encapsulation, with the intent of
being able to coalesce packets which encapsulate packets belonging to
the same TCP session.

For GRO purposes, the destination UDP port takes the role of the ether type
field in the ethernet header or the next protocol in the IP header.

The UDP GRO handler will only attempt to coalesce packets whose destination
port is registered to have gro handler.

Use a mark on the skb GRO CB data to disallow (flush) running the udp gro receive
code twice on a packet. This solves the problem of udp encapsulated packets whose
inner VM packet is udp and happen to carry a port which has registered offloads.

Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 18:05:04 -08:00
Sabrina Dubroca f14fe8a848 net: remove unnecessary initializations in net_dev_init
softnet_data is already set to 0, no need to use memset or initialize
specific fields to 0 or NULL afterwards.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 16:44:45 -08:00
Jason Wang 9d08dd3d32 net: document accel_priv parameter for __dev_queue_xmit()
To silent "make htmldocs" warning.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-19 19:55:51 -08:00
Michael Dalton a953be53ce net-sysfs: add support for device-specific rx queue sysfs attributes
Extend existing support for netdevice receive queue sysfs attributes to
permit a device-specific attribute group. Initial use case for this
support will be to allow the virtio-net device to export per-receive
queue mergeable receive buffer size.

Signed-off-by: Michael Dalton <mwdalton@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 23:46:06 -08:00
Veaceslav Falico 1d486bfb66 net: add NETDEV_PRECHANGEMTU to notify before mtu change happens
Currently, if a device changes its mtu, first the change happens (invloving
all the side effects), and after that the NETDEV_CHANGEMTU is sent so that
other devices can catch up with the new mtu. However, if they return
NOTIFY_BAD, then the change is reverted and error returned.

This is a really long and costy operation (sometimes). To fix this, add
NETDEV_PRECHANGEMTU notification which is called prior to any change
actually happening, and if any callee returns NOTIFY_BAD - the change is
aborted. This way we're skipping all the playing with apply/revert the mtu.

CC: "David S. Miller" <davem@davemloft.net>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Eric Dumazet <edumazet@google.com>
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 17:15:41 -08:00
Tom Herbert 0b4cec8c2e net: Check skb->rxhash in gro_receive
When initializing a gro_list for a packet, first check the rxhash of
the incoming skb against that of the skb's in the list. This should be
a very strong inidicator of whether the flow is going to be matched,
and potentially allows a lot of other checks to be short circuited.
Use skb_hash_raw so that we don't force the hash to be calculated.

Tested by running netperf 200 TCP_STREAMs between two machines with
GRO, HW rxhash, and 1G. Saw no performance degration, slight reduction
of time in dev_gro_receive.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 16:22:54 -08:00
Veaceslav Falico 5bb025fae5 net: rename sysfs symlinks on device name change
Currently, we don't rename the upper/lower_ifc symlinks in
/sys/class/net/*/ , which might result stale/duplicate links/names.

Fix this by adding netdev_adjacent_rename_links(dev, oldname) which renames
all the upper/lower interface's links to dev from the upper/lower_oldname
to the new name.

We don't need a rollback because only we control these symlinks and if we
fail to rename them - sysfs will anyway complain.

Reported-by: Ding Tianhong <dingtianhong@huawei.com>
CC: Ding Tianhong <dingtianhong@huawei.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-15 15:16:20 -08:00
Veaceslav Falico 3ee3270756 net: add sysfs helpers for netdev_adjacent logic
They clean up the code a bit and can be used further.

CC: Ding Tianhong <dingtianhong@huawei.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-15 15:16:19 -08:00
Ben Hutchings ae78dbfa40 net: Add trace events for all receive entry points, exposing more skb fields
The existing net/netif_rx and net/netif_receive_skb trace events
provide little information about the skb, nor do they indicate how it
entered the stack.

Add trace events at entry of each of the exported functions, including
most fields that are likely to be interesting for debugging driver
datapath behaviour.  Split netif_rx() and netif_receive_skb() so that
internal calls are not traced.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14 14:46:02 -08:00
Ben Hutchings d87d04a785 net: Add net_dev_start_xmit trace event, exposing more skb fields
The existing net/net_dev_xmit trace event provides little information
about the skb that has been passed to the driver, and it is not
simple to add more since the skb may already have been freed at
the point the event is emitted.

Add a separate trace event before the skb is passed to the driver,
including most fields that are likely to be interesting for debugging
driver datapath behaviour.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14 14:46:02 -08:00
Ben Hutchings 20567661a1 net: Fix indentation in dev_hard_start_xmit()
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14 14:45:42 -08:00
David S. Miller 0a379e21c5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-01-14 14:42:42 -08:00
Veaceslav Falico 2315dc91a5 net: make dev_set_mtu() honor notification return code
Currently, after changing the MTU for a device, dev_set_mtu() calls
NETDEV_CHANGEMTU notification, however doesn't verify it's return code -
which can be NOTIFY_BAD - i.e. some of the net notifier blocks refused this
change, and continues nevertheless.

To fix this, verify the return code, and if it's an error - then revert the
MTU to the original one, notify again and pass the error code.

CC: Jiri Pirko <jiri@resnulli.us>
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13 15:19:26 -08:00
Eric Dumazet 600adc18eb net: gro: change GRO overflow strategy
GRO layer has a limit of 8 flows being held in GRO list,
for performance reason.

When a packet comes for a flow not yet in the list,
and list is full, we immediately give it to upper
stacks, lowering aggregation performance.

With TSO auto sizing and FQ packet scheduler, this situation
happens more often.

This patch changes strategy to simply evict the oldest flow of
the list. This works better because of the nature of packet
trains for which GRO is efficient. This also has the effect
of lowering the GRO latency if many flows are competing.

Tested :

Used a 40Gbps NIC, with 4 RX queues, and 200 concurrent TCP_STREAM
netperf.

Before patch, aggregate rate is 11Gbps (while a single flow can reach
30Gbps)

After patch, line rate is reached.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jerry Chu <hkchu@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13 11:43:46 -08:00
Jason Wang f663dd9aaf net: core: explicitly select a txq before doing l2 forwarding
Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The
will cause several issues:

- NETIF_F_LLTX were removed for macvlan, so txq lock were done for macvlan
  instead of lower device which misses the necessary txq synchronization for
  lower device such as txq stopping or frozen required by dev watchdog or
  control path.
- dev_hard_start_xmit() was called with NULL txq which bypasses the net device
  watchdog.
- dev_hard_start_xmit() does not check txq everywhere which will lead a crash
  when tso is disabled for lower device.

Fix this by explicitly introducing a new param for .ndo_select_queue() for just
selecting queues in the case of l2 forwarding offload. netdev_pick_tx() was also
extended to accept this parameter and dev_queue_xmit_accel() was used to do l2
forwarding transmission.

With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need
to check txq against NULL in dev_hard_start_xmit(). Also there's no need to keep
a dedicated ndo_dfwd_start_xmit() and we can just reuse the code of
dev_queue_xmit() to do the transmission.

In the future, it was also required for macvtap l2 forwarding support since it
provides a necessary synchronization method.

Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: e1000-devel@lists.sourceforge.net
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-10 13:23:08 -05:00
Jerry Chu bf5a755f5e net-gre-gro: Add GRE support to the GRO stack
This patch built on top of Commit 299603e837
("net-gro: Prepare GRO stack for the upcoming tunneling support") to add
the support of the standard GRE (RFC1701/RFC2784/RFC2890) to the GRO
stack. It also serves as an example for supporting other encapsulation
protocols in the GRO stack in the future.

The patch supports version 0 and all the flags (key, csum, seq#) but
will flush any pkt with the S (seq#) flag. This is because the S flag
is not support by GSO, and a GRO pkt may end up in the forwarding path,
thus requiring GSO support to break it up correctly.

Currently the "packet_offload" structure only contains L3 (ETH_P_IP/
ETH_P_IPV6) GRO offload support so the encapped pkts are limited to
IP pkts (i.e., w/o L2 hdr). But support for other protocol type can
be easily added, so is the support for GRE variations like NVGRE.

The patch also support csum offload. Specifically if the csum flag is on
and the h/w is capable of checksumming the payload (CHECKSUM_COMPLETE),
the code will take advantage of the csum computed by the h/w when
validating the GRE csum.

Note that commit 60769a5dcd "ipv4: gre:
add GRO capability" already introduces GRO capability to IPv4 GRE
tunnels, using the gro_cells infrastructure. But GRO is done after
GRE hdr has been removed (i.e., decapped). The following patch applies
GRO when pkts first come in (before hitting the GRE tunnel code). There
is some performance advantage for applying GRO as early as possible.
Also this approach is transparent to other subsystem like Open vSwitch
where GRE decap is handled outside of the IP stack hence making it
harder for the gro_cells stuff to apply. On the other hand, some NICs
are still not capable of hashing on the inner hdr of a GRE pkt (RSS).
In that case the GRO processing of pkts from the same remote host will
all happen on the same CPU and the performance may be suboptimal.

I'm including some rough preliminary performance numbers below. Note
that the performance will be highly dependent on traffic load, mix as
usual. Moreover it also depends on NIC offload features hence the
following is by no means a comprehesive study. Local testing and tuning
will be needed to decide the best setting.

All tests spawned 50 copies of netperf TCP_STREAM and ran for 30 secs.
(super_netperf 50 -H 192.168.1.18 -l 30)

An IP GRE tunnel with only the key flag on (e.g., ip tunnel add gre1
mode gre local 10.246.17.18 remote 10.246.17.17 ttl 255 key 123)
is configured.

The GRO support for pkts AFTER decap are controlled through the device
feature of the GRE device (e.g., ethtool -K gre1 gro on/off).

1.1 ethtool -K gre1 gro off; ethtool -K eth0 gro off
thruput: 9.16Gbps
CPU utilization: 19%

1.2 ethtool -K gre1 gro on; ethtool -K eth0 gro off
thruput: 5.9Gbps
CPU utilization: 15%

1.3 ethtool -K gre1 gro off; ethtool -K eth0 gro on
thruput: 9.26Gbps
CPU utilization: 12-13%

1.4 ethtool -K gre1 gro on; ethtool -K eth0 gro on
thruput: 9.26Gbps
CPU utilization: 10%

The following tests were performed on a different NIC that is capable of
csum offload. I.e., the h/w is capable of computing IP payload csum
(CHECKSUM_COMPLETE).

2.1 ethtool -K gre1 gro on (hence will use gro_cells)

2.1.1 ethtool -K eth0 gro off; csum offload disabled
thruput: 8.53Gbps
CPU utilization: 9%

2.1.2 ethtool -K eth0 gro off; csum offload enabled
thruput: 8.97Gbps
CPU utilization: 7-8%

2.1.3 ethtool -K eth0 gro on; csum offload disabled
thruput: 8.83Gbps
CPU utilization: 5-6%

2.1.4 ethtool -K eth0 gro on; csum offload enabled
thruput: 8.98Gbps
CPU utilization: 5%

2.2 ethtool -K gre1 gro off

2.2.1 ethtool -K eth0 gro off; csum offload disabled
thruput: 5.93Gbps
CPU utilization: 9%

2.2.2 ethtool -K eth0 gro off; csum offload enabled
thruput: 5.62Gbps
CPU utilization: 8%

2.2.3 ethtool -K eth0 gro on; csum offload disabled
thruput: 7.69Gbps
CPU utilization: 8%

2.2.4 ethtool -K eth0 gro on; csum offload enabled
thruput: 8.96Gbps
CPU utilization: 5-6%

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 16:21:31 -05:00
Benjamin Poirier cdb3f4a31b net: Do not enable tx-nocache-copy by default
There are many cases where this feature does not improve performance or even
reduces it.

For example, here are the results from tests that I've run using 3.12.6 on one
Intel Xeon W3565 and one i7 920 connected by ixgbe adapters. The results are
from the Xeon, but they're similar on the i7. All numbers report the
mean±stddev over 10 runs of 10s.

1) latency tests similar to what is described in "c6e1a0d net: Allow no-cache
copy from user on transmit"
There is no statistically significant difference between tx-nocache-copy
on/off.
nic irqs spread out (one queue per cpu)

200x netperf -r 1400,1
tx-nocache-copy off
        692000±1000 tps
        50/90/95/99% latency (us): 275±2/643.8±0.4/799±1/2474.4±0.3
tx-nocache-copy on
        693000±1000 tps
        50/90/95/99% latency (us): 274±1/644.1±0.7/800±2/2474.5±0.7

200x netperf -r 14000,14000
tx-nocache-copy off
        86450±80 tps
        50/90/95/99% latency (us): 334.37±0.02/838±1/2100±20/3990±40
tx-nocache-copy on
        86110±60 tps
        50/90/95/99% latency (us): 334.28±0.01/837±2/2110±20/3990±20

2) single stream throughput tests
tx-nocache-copy leads to higher service demand

                        throughput  cpu0        cpu1        demand
                        (Gb/s)      (Gcycle)    (Gcycle)    (cycle/B)

nic irqs and netperf on cpu0 (1x netperf -T0,0 -t omni -- -d send)

tx-nocache-copy off     9402±5      9.4±0.2                 0.80±0.01
tx-nocache-copy on      9403±3      9.85±0.04               0.838±0.004

nic irqs on cpu0, netperf on cpu1 (1x netperf -T1,1 -t omni -- -d send)

tx-nocache-copy off     9401±5      5.83±0.03   5.0±0.1     0.923±0.007
tx-nocache-copy on      9404±2      5.74±0.03   5.523±0.009 0.958±0.002

As a second example, here are some results from Eric Dumazet with latest
net-next.
tx-nocache-copy also leads to higher service demand

(cpu is Intel(R) Xeon(R) CPU X5660  @ 2.80GHz)

lpq83:~# ./ethtool -K eth0 tx-nocache-copy on
lpq83:~# perf stat ./netperf -H lpq84 -c
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    10.00      9407.44   2.50     -1.00    0.522   -1.000

 Performance counter stats for './netperf -H lpq84 -c':

       4282.648396 task-clock                #    0.423 CPUs utilized
             9,348 context-switches          #    0.002 M/sec
                88 CPU-migrations            #    0.021 K/sec
               355 page-faults               #    0.083 K/sec
    11,812,797,651 cycles                    #    2.758 GHz                     [82.79%]
     9,020,522,817 stalled-cycles-frontend   #   76.36% frontend cycles idle    [82.54%]
     4,579,889,681 stalled-cycles-backend    #   38.77% backend  cycles idle    [67.33%]
     6,053,172,792 instructions              #    0.51  insns per cycle
                                             #    1.49  stalled cycles per insn [83.64%]
       597,275,583 branches                  #  139.464 M/sec                   [83.70%]
         8,960,541 branch-misses             #    1.50% of all branches         [83.65%]

      10.128990264 seconds time elapsed

lpq83:~# ./ethtool -K eth0 tx-nocache-copy off
lpq83:~# perf stat ./netperf -H lpq84 -c
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    10.00      9412.45   2.15     -1.00    0.449   -1.000

 Performance counter stats for './netperf -H lpq84 -c':

       2847.375441 task-clock                #    0.281 CPUs utilized
            11,632 context-switches          #    0.004 M/sec
                49 CPU-migrations            #    0.017 K/sec
               354 page-faults               #    0.124 K/sec
     7,646,889,749 cycles                    #    2.686 GHz                     [83.34%]
     6,115,050,032 stalled-cycles-frontend   #   79.97% frontend cycles idle    [83.31%]
     1,726,460,071 stalled-cycles-backend    #   22.58% backend  cycles idle    [66.55%]
     2,079,702,453 instructions              #    0.27  insns per cycle
                                             #    2.94  stalled cycles per insn [83.22%]
       363,773,213 branches                  #  127.757 M/sec                   [83.29%]
         4,242,732 branch-misses             #    1.17% of all branches         [83.51%]

      10.128449949 seconds time elapsed

CC: Tom Herbert <therbert@google.com>
Signed-off-by: Benjamin Poirier <bpoirier@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 16:20:19 -05:00
David S. Miller 56a4342dfe Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
	net/ipv6/ip6_tunnel.c
	net/ipv6/ip6_vti.c

ipv6 tunnel statistic bug fixes conflicting with consolidation into
generic sw per-cpu net stats.

qlogic conflict between queue counting bug fix and the addition
of multiple MAC address support.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 17:37:45 -05:00
David S. Miller 855404efae Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
netfilter/IPVS updates for net-next

The following patchset contains Netfilter updates for your net-next tree,
they are:

* Add full port randomization support. Some crazy researchers found a way
  to reconstruct the secure ephemeral ports that are allocated in random mode
  by sending off-path bursts of UDP packets to overrun the socket buffer of
  the DNS resolver to trigger retransmissions, then if the timing for the
  DNS resolution done by a client is larger than usual, then they conclude
  that the port that received the burst of UDP packets is the one that was
  opened. It seems a bit aggressive method to me but it seems to work for
  them. As a result, Daniel Borkmann and Hannes Frederic Sowa came up with a
  new NAT mode to fully randomize ports using prandom.

* Add a new classifier to x_tables based on the socket net_cls set via
  cgroups. These includes two patches to prepare the field as requested by
  Zefan Li. Also from Daniel Borkmann.

* Use prandom instead of get_random_bytes in several locations of the
  netfilter code, from Florian Westphal.

* Allow to use the CTA_MARK_MASK in ctnetlink when mangling the conntrack
  mark, also from Florian Westphal.

* Fix compilation warning due to unused variable in IPVS, from Geert
  Uytterhoeven.

* Add support for UID/GID via nfnetlink_queue, from Valentina Giusti.

* Add IPComp extension to x_tables, from Fan Du.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-05 20:18:50 -05:00
Daniel Borkmann 86f8515f97 net: netprio: rename config to be more consistent with cgroup configs
While we're at it and introduced CGROUP_NET_CLASSID, lets also make
NETPRIO_CGROUP more consistent with the rest of cgroups and rename it
into CONFIG_CGROUP_NET_PRIO so that for networking, we now have
CONFIG_CGROUP_NET_{PRIO,CLASSID}. This not only makes the CONFIG
option consistent among networking cgroups, but also among cgroups
CONFIG conventions in general as the vast majority has a prefix of
CONFIG_CGROUP_<SUBSYS>.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Zefan Li <lizefan@huawei.com>
Cc: cgroups@vger.kernel.org
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-03 23:41:42 +01:00
stephen hemminger 1d143d9f0c net: core functions cleanup
The following functions are not used outside of net/core/dev.c
and should be declared static.

  call_netdevice_notifiers_info
  __dev_remove_offload
  netdev_has_any_upper_dev
  __netdev_adjacent_dev_remove
  __netdev_adjacent_dev_link_lists
  __netdev_adjacent_dev_unlink_lists
  __netdev_adjacent_dev_unlink
  __netdev_adjacent_dev_link_neighbour
  __netdev_adjacent_dev_unlink_neighbour

And the following are never used and should be deleted
  netdev_lower_dev_get_private_rcu
  __netdev_find_adj_rcu

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-01 23:46:09 -05:00
Zhi Yong Wu 855abcf066 net, rps: fix the comment of net_rps_action_and_irq_enable()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-31 16:44:10 -05:00
Eric Dumazet 289dccbe14 net: use kfree_skb_list() helper
We can use kfree_skb_list() instead of open coding it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-21 22:28:16 -05:00
Jiri Kosina e23c34bb41 Merge branch 'master' into for-next
Sync with Linus' tree to be able to apply fixes on top of newer things
in tree (efi-stub).

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-12-19 15:08:32 +01:00
John Fastabend 85328240c6 net: allow netdev_all_upper_get_next_dev_rcu with rtnl lock held
It is useful to be able to walk all upper devices when bringing
a device online where the RTNL lock is held. In this case it
is safe to walk the all_adj_list because the RTNL lock is used
to protect the write side as well.

This patch adds a check to see if the rtnl lock is held before
throwing a warning in netdev_all_upper_get_next_dev_rcu().

Also because we now have a call site for lockdep_rtnl_is_held()
outside COFIG_LOCK_PROVING an inline definition returning 1 is
needed. Similar to the rcu_read_lock_is_held().

Fixes: 2a47fa45d4 ("ixgbe: enable l2 forwarding acceleration for macvlans")
CC: Veaceslav Falico <vfalico@redhat.com>
Reported-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-12-17 21:19:08 -08:00
Tom Herbert 3958afa1b2 net: Change skb_get_rxhash to skb_get_hash
Changing name of function as part of making the hash in skbuff to be
generic property, not just for receive path.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 16:36:21 -05:00