Commit Graph

415637 Commits

Author SHA1 Message Date
Patrick McHardy c9484874e7 netfilter: nf_tables: add hook ops to struct nft_pktinfo
Multi-family tables need the AF from the hook ops. Add a pointer to the
hook ops and replace usage of the hooknum member in struct nft_pktinfo.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-07 23:50:43 +01:00
Patrick McHardy 3b088c4bc0 netfilter: nf_tables: make chain types override the default AF functions
Currently the AF-specific hook functions override the chain-type specific
hook functions. That doesn't make too much sense since the chain types
are a special case of the AF-specific hooks.

Make the AF-specific hook functions the default and make the optional
chain type hooks override them.

As a side effect, the necessary code restructuring reduces the code size,
f.i. in case of nf_tables_ipv4.o:

  nf_tables_ipv4_init_net   |  -24
  nft_do_chain_ipv4         | -113
 2 functions changed, 137 bytes removed, diff: -137

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-07 23:50:43 +01:00
Pablo Neira Ayuso 688d18636f netfilter: nft_reject: fix compilation warning if NF_TABLES_IPV6 is disabled
net/netfilter/nft_reject.c: In function 'nft_reject_eval':
net/netfilter/nft_reject.c:37:14: warning: unused variable 'net' [-Wunused-variable]

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-07 23:50:43 +01:00
Benjamin Poirier cdb3f4a31b net: Do not enable tx-nocache-copy by default
There are many cases where this feature does not improve performance or even
reduces it.

For example, here are the results from tests that I've run using 3.12.6 on one
Intel Xeon W3565 and one i7 920 connected by ixgbe adapters. The results are
from the Xeon, but they're similar on the i7. All numbers report the
mean±stddev over 10 runs of 10s.

1) latency tests similar to what is described in "c6e1a0d net: Allow no-cache
copy from user on transmit"
There is no statistically significant difference between tx-nocache-copy
on/off.
nic irqs spread out (one queue per cpu)

200x netperf -r 1400,1
tx-nocache-copy off
        692000±1000 tps
        50/90/95/99% latency (us): 275±2/643.8±0.4/799±1/2474.4±0.3
tx-nocache-copy on
        693000±1000 tps
        50/90/95/99% latency (us): 274±1/644.1±0.7/800±2/2474.5±0.7

200x netperf -r 14000,14000
tx-nocache-copy off
        86450±80 tps
        50/90/95/99% latency (us): 334.37±0.02/838±1/2100±20/3990±40
tx-nocache-copy on
        86110±60 tps
        50/90/95/99% latency (us): 334.28±0.01/837±2/2110±20/3990±20

2) single stream throughput tests
tx-nocache-copy leads to higher service demand

                        throughput  cpu0        cpu1        demand
                        (Gb/s)      (Gcycle)    (Gcycle)    (cycle/B)

nic irqs and netperf on cpu0 (1x netperf -T0,0 -t omni -- -d send)

tx-nocache-copy off     9402±5      9.4±0.2                 0.80±0.01
tx-nocache-copy on      9403±3      9.85±0.04               0.838±0.004

nic irqs on cpu0, netperf on cpu1 (1x netperf -T1,1 -t omni -- -d send)

tx-nocache-copy off     9401±5      5.83±0.03   5.0±0.1     0.923±0.007
tx-nocache-copy on      9404±2      5.74±0.03   5.523±0.009 0.958±0.002

As a second example, here are some results from Eric Dumazet with latest
net-next.
tx-nocache-copy also leads to higher service demand

(cpu is Intel(R) Xeon(R) CPU X5660  @ 2.80GHz)

lpq83:~# ./ethtool -K eth0 tx-nocache-copy on
lpq83:~# perf stat ./netperf -H lpq84 -c
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    10.00      9407.44   2.50     -1.00    0.522   -1.000

 Performance counter stats for './netperf -H lpq84 -c':

       4282.648396 task-clock                #    0.423 CPUs utilized
             9,348 context-switches          #    0.002 M/sec
                88 CPU-migrations            #    0.021 K/sec
               355 page-faults               #    0.083 K/sec
    11,812,797,651 cycles                    #    2.758 GHz                     [82.79%]
     9,020,522,817 stalled-cycles-frontend   #   76.36% frontend cycles idle    [82.54%]
     4,579,889,681 stalled-cycles-backend    #   38.77% backend  cycles idle    [67.33%]
     6,053,172,792 instructions              #    0.51  insns per cycle
                                             #    1.49  stalled cycles per insn [83.64%]
       597,275,583 branches                  #  139.464 M/sec                   [83.70%]
         8,960,541 branch-misses             #    1.50% of all branches         [83.65%]

      10.128990264 seconds time elapsed

lpq83:~# ./ethtool -K eth0 tx-nocache-copy off
lpq83:~# perf stat ./netperf -H lpq84 -c
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    10.00      9412.45   2.15     -1.00    0.449   -1.000

 Performance counter stats for './netperf -H lpq84 -c':

       2847.375441 task-clock                #    0.281 CPUs utilized
            11,632 context-switches          #    0.004 M/sec
                49 CPU-migrations            #    0.017 K/sec
               354 page-faults               #    0.124 K/sec
     7,646,889,749 cycles                    #    2.686 GHz                     [83.34%]
     6,115,050,032 stalled-cycles-frontend   #   79.97% frontend cycles idle    [83.31%]
     1,726,460,071 stalled-cycles-backend    #   22.58% backend  cycles idle    [66.55%]
     2,079,702,453 instructions              #    0.27  insns per cycle
                                             #    2.94  stalled cycles per insn [83.22%]
       363,773,213 branches                  #  127.757 M/sec                   [83.29%]
         4,242,732 branch-misses             #    1.17% of all branches         [83.51%]

      10.128449949 seconds time elapsed

CC: Tom Herbert <therbert@google.com>
Signed-off-by: Benjamin Poirier <bpoirier@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 16:20:19 -05:00
Jiri Pirko dfd1582d1e ipv4: loopback device: ignore value changes after device is upped
When lo is brought up, new ifa is created. Then, devconf and neigh values
bitfield should be set so later changes of default values would not
affect lo values.

Note that the same behaviour is in ipv6. Also note that this is likely
not an issue in many distros (for example Fedora 19) because userspace
sets address to lo manually before bringing it up.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 15:55:17 -05:00
FX Le Bail 509aba3b0d IPv6: add the option to use anycast addresses as source addresses in echo reply
This change allows to follow a recommandation of RFC4942.

- Add "anycast_src_echo_reply" sysctl to control the use of anycast addresses
  as source addresses for ICMPv6 echo reply. This sysctl is false by default
  to preserve existing behavior.
- Add inline check ipv6_anycast_destination().
- Use them in icmpv6_echo_reply().

Reference:
RFC4942 - IPv6 Transition/Coexistence Security Considerations
   (http://tools.ietf.org/html/rfc4942#section-2.1.6)

2.1.6. Anycast Traffic Identification and Security

   [...]
   To avoid exposing knowledge about the internal structure of the
   network, it is recommended that anycast servers now take advantage of
   the ability to return responses with the anycast address as the
   source address if possible.

Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 15:51:39 -05:00
Wei Yongjun 9ba75fb0c4 net/mlx4_en: fix error return code in mlx4_en_get_qp()
Fix to return a negative error code from the error handling
case instead of 0.

Fixes: 837052d0cc ('net/mlx4_en: Add netdev support for TCP/IP offloads of vxlan tunneling')
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 15:42:52 -05:00
Hayes Wang 4a8deae2f4 r8152: correct some messages
- Replace pr_warn_ratelimited() with net_ratelimit() and netdev_warn().
 - Adjust the algnment of some messages.
 - Remove the peroid.
 - Fix some messages don't have terminating newline.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 00:41:20 -05:00
David S. Miller ecca6a968f bna: Fix build due to missing use of dma_unmap_len_set()
> as reported for linux-next of Dec.20, 2013
> when CONFIG_NEED_DMA_MAP_STATE is not enabled:
>
> drivers/net/ethernet/brocade/bna/bnad.c: In function 'bnad_start_xmit':
> drivers/net/ethernet/brocade/bna/bnad.c:3074:26: error: 'struct bnad_tx_vector' has no member named 'dma_len'

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 20:37:41 -05:00
Eric Dumazet 438e38fadc gre_offload: statically build GRE offloading support
GRO/GSO layers can be enabled on a node, even if said
node is only forwarding packets.

This patch permits GSO (and upcoming GRO) support for GRE
encapsulated packets, even if the host has no GRE tunnel setup.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 20:28:34 -05:00
Hauke Mehrtens 0e5959346c bgmac: fix typos
This fixes some typos found by Sergei.

Reported-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 20:26:46 -05:00
David S. Miller 39b6b2992f Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch
Jesse Gross says:

====================
[GIT net-next] Open vSwitch

Open vSwitch changes for net-next/3.14. Highlights are:
 * Performance improvements in the mechanism to get packets to userspace
   using memory mapped netlink and skb zero copy where appropriate.
 * Per-cpu flow stats in situations where flows are likely to be shared
   across CPUs. Standard flow stats are used in other situations to save
   memory and allocation time.
 * A handful of code cleanups and rationalization.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 19:48:38 -05:00
Stephen Hemminger 443cd88c8a ovs: make functions local
Several functions and datastructures could be local
Found with 'make namespacecheck'

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:54:39 -08:00
Thomas Graf 09c5e6054e openvswitch: Compute checksum in skb_gso_segment() if needed
The copy & csum optimization is no longer present with zerocopy
enabled. Compute the checksum in skb_gso_segment() directly by
dropping the HW CSUM capability from the features passed in.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:53:24 -08:00
Thomas Graf bda56f143c openvswitch: Use skb_zerocopy() for upcall
Use of skb_zerocopy() can avoid the expensive call to memcpy()
when copying the packet data into the Netlink skb. Completes
checksum through skb_checksum_help() if not already done in
GSO segmentation.

Zerocopy is only performed if user space supported unaligned
Netlink messages. memory mapped netlink i/o is preferred over
zerocopy if it is set up.

Cost of upcall is significantly reduced from:
+   7.48%       vhost-8471  [k] memcpy
+   5.57%     ovs-vswitchd  [k] memcpy
+   2.81%       vhost-8471  [k] csum_partial_copy_generic

to:
+   5.72%     ovs-vswitchd  [k] memcpy
+   3.32%       vhost-5153  [k] memcpy
+   0.68%       vhost-5153  [k] skb_zerocopy

(megaflows disabled)

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:53:17 -08:00
Thomas Graf 8055a89cfa openvswitch: Pass datapath into userspace queue functions
Allows removing the net and dp_ifindex argument and simplify the
code.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:53:07 -08:00
Thomas Graf 44da5ae5fb openvswitch: Drop user features if old user space attempted to create datapath
Drop user features if an outdated user space instance that does not
understand the concept of user_features attempted to create a new
datapath.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:53:00 -08:00
Thomas Graf 43d4be9cb5 openvswitch: Allow user space to announce ability to accept unaligned Netlink messages
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:52:53 -08:00
Thomas Graf af2806f8f9 net: Export skb_zerocopy() to zerocopy from one skb to another
Make the skb zerocopy logic written for nfnetlink queue available for
use by other modules.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:52:42 -08:00
Wei Yongjun 5f03f47c9c openvswitch: remove duplicated include from flow_table.c
Remove duplicated include.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:52:35 -08:00
Daniel Borkmann 11d6c461b3 net: ovs: use kfree_rcu instead of rcu_free_{sw_flow_mask_cb,acts_callback}
As we're only doing a kfree() anyway in the RCU callback, we can
simply use kfree_rcu, which does the same job, and remove the
function rcu_free_sw_flow_mask_cb() and rcu_free_acts_callback().

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:52:30 -08:00
Pravin B Shelar e298e50570 openvswitch: Per cpu flow stats.
With mega flow implementation ovs flow can be shared between
multiple CPUs which makes stats updates highly contended
operation. This patch uses per-CPU stats in cases where a flow
is likely to be shared (if there is a wildcard in the 5-tuple
and therefore likely to be spread by RSS). In other situations,
it uses the current strategy, saving memory and allocation time.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:52:24 -08:00
Thomas Graf 795449d8b8 openvswitch: Enable memory mapped Netlink i/o
Use memory mapped Netlink i/o for all unicast openvswitch
communication if a ring has been set up.

Benchmark
  * pktgen -> ovs internal port
  * 5M pkts, 5M flows
  * 4 threads, 8 cores

Before:
Result: OK: 67418743(c67108212+d310530) usec, 5000000 (9000byte,0frags)
  74163pps 5339Mb/sec (5339736000bps) errors: 0
	+   2.98%     ovs-vswitchd  [k] copy_user_generic_string
	+   2.49%     ovs-vswitchd  [k] memcpy
	+   1.84%       kpktgend_2  [k] memcpy
	+   1.81%       kpktgend_1  [k] memcpy
	+   1.81%       kpktgend_3  [k] memcpy
	+   1.78%       kpktgend_0  [k] memcpy

After:
Result: OK: 24229690(c24127165+d102524) usec, 5000000 (9000byte,0frags)
  206358pps 14857Mb/sec (14857776000bps) errors: 0
	+   2.80%     ovs-vswitchd  [k] memcpy
	+   1.31%       kpktgend_2  [k] memcpy
	+   1.23%       kpktgend_0  [k] memcpy
	+   1.09%       kpktgend_1  [k] memcpy
	+   1.04%       kpktgend_3  [k] memcpy
	+   0.96%     ovs-vswitchd  [k] copy_user_generic_string

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:52:12 -08:00
Thomas Graf aae9f0e22c netlink: Avoid netlink mmap alloc if msg size exceeds frame size
An insufficent ring frame size configuration can lead to an
unnecessary skb allocation for every Netlink message. Check frame
size before taking the queue lock and allocating the skb and
re-check with lock to be safe.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:52:06 -08:00
Thomas Graf bb9b18fb55 genl: Add genlmsg_new_unicast() for unicast message allocation
Allocates a new sk_buff large enough to cover the specified payload
plus required Netlink headers. Will check receiving socket for
memory mapped i/o capability and use it if enabled. Will fall back
to non-mapped skb if message size exceeds the frame size of the ring.

Signed-of-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:51:53 -08:00
Jesse Gross 663efa3696 openvswitch: Silence RCU lockdep checks from flow lookup.
Flow lookup can happen either in packet processing context or userspace
context but it was annotated as requiring RCU read lock to be held. This
also allows OVS mutex to be held without causing warnings.

Reported-by: Justin Pettit <jpettit@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Reviewed-by: Thomas Graf <tgraf@redhat.com>
2014-01-06 15:51:48 -08:00
Andy Zhou 5bb506324d openvswitch: Change ovs_flow_tbl_lookup_xx() APIs
API changes only for code readability. No functional chnages.

This patch removes the underscored version. Added a new API
ovs_flow_tbl_lookup_stats() that returns the n_mask_hits.

Reported by: Ben Pfaff <blp@nicira.com>
Reviewed-by: Thomas Graf <tgraf@redhat.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:51:41 -08:00
Ben Pfaff 8f49ce1135 openvswitch: Shrink sw_flow_mask by 8 bytes (64-bit) or 4 bytes (32-bit).
We won't normally have a ton of flow masks but using a size_t to store
values no bigger than sizeof(struct sw_flow_key) seems excessive.

This reduces sw_flow_key_range and sw_flow_mask by 4 bytes on 32-bit
systems.  On 64-bit systems it shrinks sw_flow_key_range by 12 bytes but
sw_flow_mask only by 8 bytes due to padding.

Compile tested only.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:51:27 -08:00
Ben Pfaff d1211908b9 openvswitch: Correct comment.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06 15:51:21 -08:00
David S. Miller 56a4342dfe Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
	net/ipv6/ip6_tunnel.c
	net/ipv6/ip6_vti.c

ipv6 tunnel statistic bug fixes conflicting with consolidation into
generic sw per-cpu net stats.

qlogic conflict between queue counting bug fix and the addition
of multiple MAC address support.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 17:37:45 -05:00
Jamal Hadi Salim 805c1f4aed net_sched: act: action flushing missaccounting
action flushing missaccounting
Account only for deleted actions

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:46:56 -05:00
Jamal Hadi Salim 63acd6807c net_sched: Remove unnecessary checks for act->ops
Remove unnecessary checks for act->ops
(suggested by Eric Dumazet).

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:46:32 -05:00
sfeldma@cumulusnetworks.com fbf2671bb8 bridge: use DEVICE_ATTR_xx macros
Use DEVICE_ATTR_RO/RW macros to simplify bridge sysfs attribute definitions.

Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:40:46 -05:00
Curt Brune fe0d692bbc bridge: use spin_lock_bh() in br_multicast_set_hash_max
br_multicast_set_hash_max() is called from process context in
net/bridge/br_sysfs_br.c by the sysfs store_hash_max() function.

br_multicast_set_hash_max() calls spin_lock(&br->multicast_lock),
which can deadlock the CPU if a softirq that also tries to take the
same lock interrupts br_multicast_set_hash_max() while the lock is
held .  This can happen quite easily when any of the bridge multicast
timers expire, which try to take the same lock.

The fix here is to use spin_lock_bh(), preventing other softirqs from
executing on this CPU.

Steps to reproduce:

1. Create a bridge with several interfaces (I used 4).
2. Set the "multicast query interval" to a low number, like 2.
3. Enable the bridge as a multicast querier.
4. Repeatedly set the bridge hash_max parameter via sysfs.

  # brctl addbr br0
  # brctl addif br0 eth1 eth2 eth3 eth4
  # brctl setmcqi br0 2
  # brctl setmcquerier br0 1

  # while true ; do echo 4096 > /sys/class/net/br0/bridge/hash_max; done

Signed-off-by: Curt Brune <curt@cumulusnetworks.com>
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:39:47 -05:00
Eric Dumazet 8f646c922d vxlan: keep original skb ownership
Sathya Perla posted a patch trying to address following problem :

<quote>
 The vxlan driver sets itself as the socket owner for all the TX flows
 it encapsulates (using vxlan_set_owner()) and assigns it's own skb
 destructor. This causes all tunneled traffic to land up on only one TXQ
 as all encapsulated skbs refer to the vxlan socket and not the original
 socket.  Also, the vxlan skb destructor breaks some functionality for
 tunneled traffic like wmem accounting and as TCP small queues and
 FQ/pacing packet scheduler.
</quote>

I reworked Sathya patch and added some explanations.

vxlan_xmit() can avoid one skb_clone()/dev_kfree_skb() pair
and gain better drop monitor accuracy, by calling kfree_skb() when
appropriate.

The UDP socket used by vxlan to perform encapsulation of xmit packets
do not need to be alive while packets leave vxlan code. Its better
to keep original socket ownership to get proper feedback from qdisc and
NIC layers.

We use skb->sk to

A) control amount of bytes/packets queued on behalf of a socket, but
prior vxlan code did the skb->sk transfert without any limit/control
on vxlan socket sk_sndbuf.

B) security purposes (as selinux) or netfilter uses, and I do not think
anything is prepared to handle vxlan stacked case in this area.

By not changing ownership, vxlan tunnels behave like other tunnels.
As Stephen mentioned, we might do the same change in L2TP.

Reported-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:37:09 -05:00
Eric Dumazet 996b175e39 tcp: out_of_order_queue do not use its lock
TCP out_of_order_queue lock is not used, as queue manipulation
happens with socket lock held and we therefore use the lockless
skb queue routines (as __skb_queue_head())

We can use __skb_queue_head_init() instead of skb_queue_head_init()
to make this more consistent.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:34:34 -05:00
Hannes Frederic Sowa 88ad31491e ipv6: don't install anycast address for /128 addresses on routers
It does not make sense to create an anycast address for an /128-prefix.
Suppress it.

As 32019e651c ("ipv6: Do not leave router anycast address for /127
prefixes.") shows we also may not leave them, because we could accidentally
remove an anycast address the user has allocated or got added via another
prefix.

Cc: François-Xavier Le Bail <fx.lebail@yahoo.com>
Cc: Thomas Haller <thaller@redhat.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:32:43 -05:00
Dan Williams e5e97ee956 hso: fix handling of modem port SERIAL_STATE notifications
The existing serial state notification handling expected older Option
devices, having a hardcoded assumption that the Modem port was always
USB interface #2.  That isn't true for devices from the past few years.

hso_serial_state_notification is a local cache of a USB Communications
Interface Class SERIAL_STATE notification from the device, and the
USB CDC specification (section 6.3, table 67 "Class-Specific Notifications")
defines wIndex as the USB interface the event applies to.  For hso
devices this will always be the Modem port, as the Modem port is the
only port which is set up to receive them by the driver.

So instead of always expecting USB interface #2, instead validate the
notification with the actual USB interface number of the Modem port.

Signed-off-by: Dan Williams <dcbw@redhat.com>
Tested-by: H. Nikolaus Schaller <hns@goldelico.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:29:44 -05:00
Veaceslav Falico 0b23810d8c bonding: fix kstrtou8() return value verification in num_peer_notif
It returns 0 in case of success, !0 error otherwise. Fix the improper error
verification.

Fixes: 2c9839c143 ("bonding: add num_grat_arp attribute netlink support")
CC: sfeldma@cumulusnetworks.com
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:26:00 -05:00
hayeswang 31ca1decfb r8152: replace the return value of rtl_ops_init
Replace the boolean value with the error code for the return value
of the rtl_ops_init().

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:24:09 -05:00
hayeswang e3ad412ad8 r8152: move the actions of saving the information of the device
Some information of the device may be used in other functions. Move
the relative code to make sure it would be initialzed correctly
before using it.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:24:09 -05:00
hayeswang 45f4a19f6d r8152: replace some tabs with spaces
Replace the tabs of the variables declaration with the spaces.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:24:09 -05:00
Guenter Roeck 22d3b76ed7 isdn: Drop big endian cpp checks from telespci and hfc_pci drivers
With arm:allmodconfig, building the Teles PCI driver fails with

telespci.c:294:2: error: #error "not running on big endian machines now"

Similar, building the driver for HFC PCI-Bus cards fails with

hfc_pci.c:1647:2: error: #error "not running on big endian machines now"

Remove the big endian cpp check from both drivers to fix the build errors.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 15:50:51 -05:00
Vijay Subramanian d4b36210c2 net: pkt_sched: PIE AQM scheme
Proportional Integral controller Enhanced (PIE) is a scheduler to address the
bufferbloat problem.

>From the IETF draft below:
" Bufferbloat is a phenomenon where excess buffers in the network cause high
latency and jitter. As more and more interactive applications (e.g. voice over
IP, real time video streaming and financial transactions) run in the Internet,
high latency and jitter degrade application performance. There is a pressing
need to design intelligent queue management schemes that can control latency and
jitter; and hence provide desirable quality of service to users.

We present here a lightweight design, PIE(Proportional Integral controller
Enhanced) that can effectively control the average queueing latency to a target
value. Simulation results, theoretical analysis and Linux testbed results have
shown that PIE can ensure low latency and achieve high link utilization under
various congestion situations. The design does not require per-packet
timestamp, so it incurs very small overhead and is simple enough to implement
in both hardware and software.  "

Many thanks to Dave Taht for extensive feedback, reviews, testing and
suggestions. Thanks also to Stephen Hemminger and Eric Dumazet for reviews and
suggestions.  Naeem Khademi and Dave Taht independently contributed to ECN
support.

For more information, please see technical paper about PIE in the IEEE
Conference on High Performance Switching and Routing 2013. A copy of the paper
can be found at ftp://ftpeng.cisco.com/pie/.

Please also refer to the IETF draft submission at
http://tools.ietf.org/html/draft-pan-tsvwg-pie-00

All relevant code, documents and test scripts and results can be found at
ftp://ftpeng.cisco.com/pie/.

For problems with the iproute2/tc or Linux kernel code, please contact Vijay
Subramanian (vijaynsu@cisco.com or subramanian.vijay@gmail.com) Mythili Prabhu
(mysuryan@cisco.com)

Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: Mythili Prabhu <mysuryan@cisco.com>
CC: Dave Taht <dave.taht@bufferbloat.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 15:13:01 -05:00
David S. Miller 83111e7fe8 netfilter: Fix build failure in nfnetlink_queue_core.c.
net/netfilter/nfnetlink_queue_core.c: In function 'nfqnl_put_sk_uidgid':
net/netfilter/nfnetlink_queue_core.c:304:35: error: 'TCP_TIME_WAIT' undeclared (first use in this function)
net/netfilter/nfnetlink_queue_core.c:304:35: note: each undeclared identifier is reported only once for each function it appears in
make[3]: *** [net/netfilter/nfnetlink_queue_core.o] Error 1

Just a missing include of net/tcp_states.h

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 13:36:06 -05:00
David S. Miller 9aa28f2b71 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nftables
Pablo Neira Ayuso says: <pablo@netfilter.org>

====================
nftables updates for net-next

The following patchset contains nftables updates for your net-next tree,
they are:

* Add set operation to the meta expression by means of the select_ops()
  infrastructure, this allows us to set the packet mark among other things.
  From Arturo Borrero Gonzalez.

* Fix wrong format in sscanf in nf_tables_set_alloc_name(), from Daniel
  Borkmann.

* Add new queue expression to nf_tables. These comes with two previous patches
  to prepare this new feature, one to add mask in nf_tables_core to
  evaluate the queue verdict appropriately and another to refactor common
  code with xt_NFQUEUE, from Eric Leblond.

* Do not hide nftables from Kconfig if nfnetlink is not enabled, also from
  Eric Leblond.

* Add the reject expression to nf_tables, this adds the missing TCP RST
  support. It comes with an initial patch to refactor common code with
  xt_NFQUEUE, again from Eric Leblond.

* Remove an unused variable assignment in nf_tables_dump_set(), from Michal
  Nazarewicz.

* Remove the nft_meta_target code, now that Arturo added the set operation
  to the meta expression, from me.

* Add help information for nf_tables to Kconfig, also from me.

* Allow to dump all sets by specifying NFPROTO_UNSPEC, similar feature is
  available to other nf_tables objects, requested by Arturo, from me.

* Expose the table usage counter, so we can know how many chains are using
  this table without dumping the list of chains, from Tomasz Bursztyka.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 13:29:30 -05:00
David S. Miller 6a8c4796df Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next
Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates

This series contains updates to i40e only.

Majority of this series contains patches from Greg and Mitch to fix
up or add functionality to the PF/VF driver interactions.  Notably,
a fix for SR-IOV VF port VLAN which resolved the problem of port VLAN
configurations not being persistent across VF driver loads and unloads
and enable/disable of the feature.  Also do not enable the default port
on the VEB, which is designed only to bridge the PF to an Open vSwitch
or bridge.  Another fix to resolve a possible memory corruption
condition where ARQ messages are written to random memory locations.
Fix a problem where the 'ip link show' command would display stale
link address information after the link address was set via the 'ip
link set' command.

Anjali provides several patches, one which saves information that can
be used while cleaning the Tx ring and useful in detecting Tx hangs.
Then provides a fixes to the admin queue shutdown function to ensure
we are shutting down the queue in the shutdown path and ensure ASQ is
alive before issuing the admin queue command.

Shannon provides a fix for get/update vsi params where the incorrect
struct was being used.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 13:25:58 -05:00
David S. Miller ce088848c2 Merge branch 'be2net'
Sathya Perla says:

====================
be2net: patch set

Pls apply the following bug fixes to the 'net' tree. Thanks.

Suresh Reddy (2):
  be2net: increase the timeout value for loopback-test FW cmd
  be2net: fix max_evt_qs calculation for BE3 in SR-IOV config

Vasundhara Volam (1):
  be2net: disable RSS when number of RXQs is reduced to 1 via
    set-channels
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 13:09:26 -05:00
Suresh Reddy e3dc867c17 be2net: fix max_evt_qs calculation for BE3 in SR-IOV config
The driver wrongly assumes 16 EQs/vectors are available for each BE3 PF.
When SR-IOV is enabled, a BE3 PF can support only a max of 8 EQs.

Signed-off-by: Suresh Reddy <suresh.reddy@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 13:09:21 -05:00
Suresh Reddy 5eeff6354f be2net: increase the timeout value for loopback-test FW cmd
The loopback test FW cmd may need upto 15 seconds to complete on
certain PHYs. This patch also fixes the name of the completion variable
used to synchronize FW cmd completions as it not used by the flashing
cmd alone anymore.

Signed-off-by: Suresh Reddy <suresh.reddy@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 13:09:21 -05:00