Commit Graph

45125 Commits

Author SHA1 Message Date
Vivien Didelot 88e4f0ca4e net: dsa: move netdevice notifier registration
Move the netdevice notifier block register code in slave.c and provide
helpers for dsa.c to register and unregister it.

At the same time, check for errors since (un)register_netdevice_notifier
may fail.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-06 16:53:28 -05:00
Parav Pandit d0d7b10b05 net-next: treewide use is_vlan_dev() helper function.
This patch makes use of is_vlan_dev() function instead of flag
comparison which is exactly done by is_vlan_dev() helper function.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Jon Maxwell <jmaxwell37@gmail.com>
Acked-by: Johannes Thumshirn <jth@kernel.org>
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-06 16:33:29 -05:00
Xin Long d15c9ede61 sctp: process fwd tsn chunk only when prsctp is enabled
This patch is to check if asoc->peer.prsctp_capable is set before
processing fwd tsn chunk, if not, it will return an ERROR to the
peer, just as rfc3758 section 3.3.1 demands.

Reported-by: Julian Cordes <julian.cordes@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-06 11:57:15 -05:00
Ido Schimmel a8eca32615 net: remove ndo_neigh_{construct, destroy} from stacked devices
In commit 18bfb924f0 ("net: introduce default neigh_construct/destroy
ndo calls for L2 upper devices") we added these ndos to stacked devices
such as team and bond, so that calls will be propagated to mlxsw.

However, previous commit removed the reliance on these ndos and no new
users of these ndos have appeared since above mentioned commit. We can
therefore safely remove this dead code.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-06 11:25:57 -05:00
Eric Dumazet 02c1602ee7 net: remove __napi_complete()
All __napi_complete() callers have been converted to
use the more standard napi_complete_done(),
we can now remove this NAPI method for good.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-05 16:11:57 -05:00
David Ahern 7d4d5065ec net: ipv6: Use compressed IPv6 addresses showing route replace error
ip6_print_replace_route_err logs an error if a route replace fails with
IPv6 addresses in the full format. e.g,:

IPv6: IPV6: multipath route replace failed (check consistency of installed routes): 2001:0db8:0200:0000:0000:0000:0000:0000 nexthop 2001:0db8:0001:0000:0000:0000:0000:0016 ifi 0

Change the message to dump the addresses in the compressed format.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04 19:58:14 -05:00
David Ahern 16a16cd35e net: ipv6: Change notifications for multipath delete to RTA_MULTIPATH
If an entire multipath route is deleted using prefix and len (without any
nexthops), send a single RTM_DELROUTE notification with the full route
using RTA_MULTIPATH. This is done by generating the skb before the route
delete when all of the sibling routes are still present but sending it
after the route has been removed from the FIB. The skip_notify flag
is used to tell the lower fib code not to send notifications for the
individual nexthop routes.

If a route is deleted using RTA_MULTIPATH for any nexthops or a single
nexthop entry is deleted, then the nexthops are deleted one at a time with
notifications sent as each hop is deleted. This is necessary given that
IPv6 allows individual hops within a route to be deleted.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04 19:58:14 -05:00
David Ahern 3b1137fe74 net: ipv6: Change notifications for multipath add to RTA_MULTIPATH
Change ip6_route_multipath_add to send one notifciation with the full
route encoded with RTA_MULTIPATH instead of a series of individual routes.
This is done by adding a skip_notify flag to the nl_info struct. The
flag is used to skip sending of the notification in the fib code that
actually inserts the route. Once the full route has been added, a
notification is generated with all nexthops.

ip6_route_multipath_add handles 3 use cases: new routes, route replace,
and route append. The multipath notification generated needs to be
consistent with the order of the nexthops and it should be consistent
with the order in a FIB dump which means the route with the first nexthop
needs to be used as the route reference. For the first 2 cases (new and
replace), a reference to the route used to send the notification is
obtained by saving the first route added. For the append case, the last
route added is used to loop back to its first sibling route which is
the first nexthop in the multipath route.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04 19:58:14 -05:00
David Ahern beb1afac51 net: ipv6: Add support to dump multipath routes via RTA_MULTIPATH attribute
IPv6 returns multipath routes as a series of individual routes making
their display and handling by userspace different and more complicated
than IPv4, putting the burden on the user to see that a route is part of
a multipath route and internally creating a multipath route if desired
(e.g., libnl does this as of commit 29b71371e764). This patch addresses
this difference, allowing multipath routes to be returned using the
RTA_MULTIPATH attribute.

The end result is that IPv6 multipath routes can be treated and displayed
in a format similar to IPv4:

    $ ip -6 ro ls vrf red
    2001:db8:1::/120 dev eth1 proto kernel metric 256  pref medium
    2001:db8:2::/120 dev eth2 proto kernel metric 256  pref medium
    2001:db8:200::/120 metric 1024
	    nexthop via 2001:db8:1::2  dev eth1 weight 1
	    nexthop via 2001:db8:2::2  dev eth2 weight 1

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04 19:58:14 -05:00
David Ahern 0ae8133586 net: ipv6: Allow shorthand delete of all nexthops in multipath route
IPv4 allows multipath routes to be deleted using just the prefix and
length. For example:
    $ ip ro ls vrf red
    unreachable default metric 8192
    1.1.1.0/24
        nexthop via 10.100.1.254  dev eth1 weight 1
        nexthop via 10.11.200.2  dev eth11.200 weight 1
    10.11.200.0/24 dev eth11.200 proto kernel scope link src 10.11.200.3
    10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.3

    $ ip ro del 1.1.1.0/24 vrf red

    $ ip ro ls vrf red
    unreachable default metric 8192
    10.11.200.0/24 dev eth11.200 proto kernel scope link src 10.11.200.3
    10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.3

The same notation does not work with IPv6 because of how multipath routes
are implemented for IPv6. For IPv6 only the first nexthop of a multipath
route is deleted if the request contains only a prefix and length. This
leads to unnecessary complexity in userspace dealing with IPv6 multipath
routes.

This patch allows all nexthops to be deleted without specifying each one
in the delete request. Internally, this is done by walking the sibling
list of the route matching the specifications given (prefix, length,
metric, protocol, etc).

    $  ip -6 ro ls vrf red
    2001:db8:1::/120 dev eth1 proto kernel metric 256  pref medium
    2001:db8:2::/120 dev eth2 proto kernel metric 256  pref medium
    2001:db8:200::/120 via 2001:db8:1::2 dev eth1 metric 1024  pref medium
    2001:db8:200::/120 via 2001:db8:2::2 dev eth2 metric 1024  pref medium
    ...

    $ ip -6 ro del vrf red 2001:db8:200::/120

    $ ip -6 ro ls vrf red
    2001:db8:1::/120 dev eth1 proto kernel metric 256  pref medium
    2001:db8:2::/120 dev eth2 proto kernel metric 256  pref medium
    ...

Because IPv6 allows individual nexthops to be deleted without deleting
the entire route, the ip6_route_multipath_del and non-multipath code
path (ip6_route_del) have to be discriminated so that all nexthops are
only deleted for the latter case. This is done by making the existing
fc_type in fib6_config a u16 and then adding a new u16 field with
fc_delete_all_nh as the first bit.

Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04 19:58:14 -05:00
Eric Dumazet 6e7bc478c9 net: skb_needs_check() accepts CHECKSUM_NONE for tx
My recent change missed fact that UFO would perform a complete
UDP checksum before segmenting in frags.

In this case skb->ip_summed is set to CHECKSUM_NONE.

We need to add this valid case to skb_needs_check()

Fixes: b2504a5dbe ("net: reduce skb_warn_bad_offload() noise")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03 17:33:01 -05:00
Eric Dumazet 79e7fff47b net: remove support for per driver ndo_busy_poll()
We added generic support for busy polling in NAPI layer in linux-4.5

No network driver uses ndo_busy_poll() anymore, we can get rid
of the pointer in struct net_device_ops, and its use in sk_busy_loop()

Saves NETIF_F_BUSY_POLL features bit.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03 17:28:29 -05:00
David S. Miller 52e01b84a2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree, they are:

1) Stash ctinfo 3-bit field into pointer to nf_conntrack object from
   sk_buff so we only access one single cacheline in the conntrack
   hotpath. Patchset from Florian Westphal.

2) Don't leak pointer to internal structures when exporting x_tables
   ruleset back to userspace, from Willem DeBruijn. This includes new
   helper functions to copy data to userspace such as xt_data_to_user()
   as well as conversions of our ip_tables, ip6_tables and arp_tables
   clients to use it. Not surprinsingly, ebtables requires an ad-hoc
   update. There is also a new field in x_tables extensions to indicate
   the amount of bytes that we copy to userspace.

3) Add nf_log_all_netns sysctl: This new knob allows you to enable
   logging via nf_log infrastructure for all existing netnamespaces.
   Given the effort to provide pernet syslog has been discontinued,
   let's provide a way to restore logging using netfilter kernel logging
   facilities in trusted environments. Patch from Michal Kubecek.

4) Validate SCTP checksum from conntrack helper, from Davide Caratti.

5) Merge UDPlite conntrack and NAT helpers into UDP, this was mostly
   a copy&paste from the original helper, from Florian Westphal.

6) Reset netfilter state when duplicating packets, also from Florian.

7) Remove unnecessary check for broadcast in IPv6 in pkttype match and
   nft_meta, from Liping Zhang.

8) Add missing code to deal with loopback packets from nft_meta when
   used by the netdev family, also from Liping.

9) Several cleanups on nf_tables, one to remove unnecessary check from
   the netlink control plane path to add table, set and stateful objects
   and code consolidation when unregister chain hooks, from Gao Feng.

10) Fix harmless reference counter underflow in IPVS that, however,
    results in problems with the introduction of the new refcount_t
    type, from David Windsor.

11) Enable LIBCRC32C from nf_ct_sctp instead of nf_nat_sctp,
    from Davide Caratti.

12) Missing documentation on nf_tables uapi header, from Liping Zhang.

13) Use rb_entry() helper in xt_connlimit, from Geliang Tang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03 16:58:20 -05:00
Jiri Pirko 69ca05ce9d sched: cls_flower: expose priority to offloading netdevice
The driver that offloads flower rules needs to know with which priority
user inserted the rules. So add this information into offload struct.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03 16:35:43 -05:00
Eric Dumazet 38ab52e8e1 tcp: clear pfmemalloc on outgoing skb
Josef Bacik diagnosed following problem :

   I was seeing random disconnects while testing NBD over loopback.
   This turned out to be because NBD sets pfmemalloc on it's socket,
   however the receiving side is a user space application so does not
   have pfmemalloc set on its socket. This means that
   sk_filter_trim_cap will simply drop this packet, under the
   assumption that the other side will simply retransmit. Well we do
   retransmit, and then the packet is just dropped again for the same
   reason.

It seems the better way to address this problem is to clear pfmemalloc
in the TCP transmit path. pfmemalloc strict control really makes sense
on the receive path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03 16:23:57 -05:00
David Ahern 94b5e0f970 net: ipv6: Set protocol to kernel for local routes
IPv6 stack does not set the protocol for local routes, so those routes show
up with proto "none":
    $ ip -6 ro ls table local
    local ::1 dev lo proto none metric 0  pref medium
    local 2100:3:: dev lo proto none metric 0  pref medium
    local 2100:3::4 dev lo proto none metric 0  pref medium
    local fe80:: dev lo proto none metric 0  pref medium
    ...

Set rt6i_protocol to RTPROT_KERNEL for consistency with IPv4. Now routes
show up with proto "kernel":
    $ ip -6 ro ls table local
    local ::1 dev lo proto kernel metric 0  pref medium
    local 2100:3:: dev lo proto kernel metric 0  pref medium
    local 2100:3::4 dev lo proto kernel metric 0  pref medium
    local fe80:: dev lo proto kernel metric 0  pref medium
    ...

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03 16:01:44 -05:00
Roopa Prabhu 11538d039a bridge: vlan dst_metadata hooks in ingress and egress paths
- ingress hook:
    - if port is a tunnel port, use tunnel info in
      attached dst_metadata to map it to a local vlan
- egress hook:
    - if port is a tunnel port, use tunnel info attached to
      vlan to set dst_metadata on the skb

CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03 15:21:22 -05:00
Roopa Prabhu efa5356b0d bridge: per vlan dst_metadata netlink support
This patch adds support to attach per vlan tunnel info dst
metadata. This enables bridge driver to map vlan to tunnel_info
at ingress and egress. It uses the kernel dst_metadata infrastructure.

The initial use case is vlan to vni bridging, but the api is generic
to extend to any tunnel_info in the future:
    - Uapi to configure/unconfigure/dump per vlan tunnel data
    - netlink functions to configure vlan and tunnel_info mapping
    - Introduces bridge port flag BR_LWT_VLAN to enable attach/detach
    dst_metadata to bridged packets on ports. off by default.
    - changes to existing code is mainly refactor some existing vlan
    handling netlink code + hooks for new vlan tunnel code
    - I have kept the vlan tunnel code isolated in separate files.
    - most of the netlink vlan tunnel code is handling of vlan-tunid
    ranges (follows the vlan range handling code). To conserve space
    vlan-tunid by default are always dumped in ranges if applicable.

Use case:
example use for this is a vxlan bridging gateway or vtep
which maps vlans to vn-segments (or vnis).

iproute2 example (patched and pruned iproute2 output to just show
relevant fdb entries):
example shows same host mac learnt on two vni's and
vlan 100 maps to vni 1000, vlan 101 maps to vni 1001

before (netdev per vni):
$bridge fdb show | grep "00:02:00:00:00:03"
00:02:00:00:00:03 dev vxlan1001 vlan 101 master bridge
00:02:00:00:00:03 dev vxlan1001 dst 12.0.0.8 self
00:02:00:00:00:03 dev vxlan1000 vlan 100 master bridge
00:02:00:00:00:03 dev vxlan1000 dst 12.0.0.8 self

after this patch with collect metdata in bridged mode (single netdev):
$bridge fdb show | grep "00:02:00:00:00:03"
00:02:00:00:00:03 dev vxlan0 vlan 101 master bridge
00:02:00:00:00:03 dev vxlan0 src_vni 1001 dst 12.0.0.8 self
00:02:00:00:00:03 dev vxlan0 vlan 100 master bridge
00:02:00:00:00:03 dev vxlan0 src_vni 1000 dst 12.0.0.8 self

CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03 15:21:22 -05:00
Yotam Gigi 295a6e06d2 net/sched: act_ife: Change to use ife module
Use the encode/decode functionality from the ife module instead of using
implementation inside the act_ife.

Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03 15:16:46 -05:00
Yotam Gigi 1ce8460496 net: Introduce ife encapsulation module
This module is responsible for the ife encapsulation protocol
encode/decode logics. That module can:
 - ife_encode: encode skb and reserve space for the ife meta header
 - ife_decode: decode skb and extract the meta header size
 - ife_tlv_meta_encode - encodes one tlv entry into the reserved ife
   header space.
 - ife_tlv_meta_decode - decodes one tlv entry from the packet
 - ife_tlv_meta_next - advance to the next tlv

Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03 15:16:45 -05:00
Yotam Gigi 1d5e7c859e net/sched: act_ife: Unexport ife_tlv_meta_encode
As the function ife_tlv_meta_encode is not used by any other module,
unexport it and make it static for the act_ife module.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03 15:16:45 -05:00
Eric Dumazet 3541f9e8bd tcp: add tcp_mss_clamp() helper
Small cleanup factorizing code doing the TCP_MAXSEG clamping.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03 11:19:34 -05:00
Eric Dumazet 8fe809a992 net: add LINUX_MIB_PFMEMALLOCDROP counter
Debugging issues caused by pfmemalloc is often tedious.

Add a new SNMP counter to more easily diagnose these problems.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Josef Bacik <jbacik@fb.com>
Acked-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02 23:34:19 -05:00
David Ahern 6610910939 net: ipv4: remove fib_lookup.h from devinet.c include list
nothing in devinet.c relies on fib_lookup.h; remove it from the includes

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02 23:09:08 -05:00
Eric Dumazet b9ea2a7be7 net: remove useless pfmemalloc setting
When __alloc_skb() allocates an skb from fast clone cache,
setting pfmemalloc on the clone is not needed.

Clone will be properly initialized later at skb_clone() time,
including pfmemalloc field, as it is included in the
headers_start/headers_end section which is fully copied.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02 23:03:05 -05:00
Andrey Vagin ba94f3088b unix: add ioctl to open a unix socket file with O_PATH
This ioctl opens a file to which a socket is bound and
returns a file descriptor. The caller has to have CAP_NET_ADMIN
in the socket network namespace.

Currently it is impossible to get a path and a mount point
for a socket file. socket_diag reports address, device ID and inode
number for unix sockets. An address can contain a relative path or
a file may be moved somewhere. And these properties say nothing about
a mount namespace and a mount point of a socket file.

With the introduced ioctl, we can get a path by reading
/proc/self/fd/X and get mnt_id from /proc/self/fdinfo/X.

In CRIU we are going to use this ioctl to dump and restore unix socket.

Here is an example how it can be used:

$ strace -e socket,bind,ioctl ./test /tmp/test_sock
socket(AF_UNIX, SOCK_STREAM, 0)         = 3
bind(3, {sa_family=AF_UNIX, sun_path="test_sock"}, 11) = 0
ioctl(3, SIOCUNIXFILE, 0)           = 4
^Z

$ ss -a | grep test_sock
u_str  LISTEN     0      1      test_sock 17798                 * 0

$ ls -l /proc/760/fd/{3,4}
lrwx------ 1 root root 64 Feb  1 09:41 3 -> 'socket:[17798]'
l--------- 1 root root 64 Feb  1 09:41 4 -> /tmp/test_sock

$ cat /proc/760/fdinfo/4
pos:	0
flags:	012000000
mnt_id:	40

$ cat /proc/self/mountinfo | grep "^40\s"
40 19 0:37 / /tmp rw shared:23 - tmpfs tmpfs rw

Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02 21:58:02 -05:00
David S. Miller e2160156bf Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All merge conflicts were simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02 16:54:00 -05:00
Michal Kubeček 2851940ffe netfilter: allow logging from non-init namespaces
Commit 69b34fb996 ("netfilter: xt_LOG: add net namespace support for
xt_LOG") disabled logging packets using the LOG target from non-init
namespaces. The motivation was to prevent containers from flooding
kernel log of the host. The plan was to keep it that way until syslog
namespace implementation allows containers to log in a safe way.

However, the work on syslog namespace seems to have hit a dead end
somewhere in 2013 and there are users who want to use xt_LOG in all
network namespaces. This patch allows to do so by setting

  /proc/sys/net/netfilter/nf_log_all_netns

to a nonzero value. This sysctl is only accessible from init_net so that
one cannot switch the behaviour from inside a container.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-02 14:31:58 +01:00
David Windsor 90c1aff702 ipvs: free ip_vs_dest structs when refcnt=0
Currently, the ip_vs_dest cache frees ip_vs_dest objects when their
reference count becomes < 0.  Aside from not being semantically sound,
this is problematic for the new type refcount_t, which will be introduced
shortly in a separate patch. refcount_t is the new kernel type for
holding reference counts, and provides overflow protection and a
constrained interface relative to atomic_t (the type currently being
used for kernel reference counts).

Per Julian Anastasov: "The problem is that dest_trash currently holds
deleted dests (unlinked from RCU lists) with refcnt=0."  Changing
dest_trash to hold dest with refcnt=1 will allow us to free ip_vs_dest
structs when their refcnt=0, in ip_vs_dest_put_and_free().

Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-02 14:31:57 +01:00
Florian Westphal a9e419dc7b netfilter: merge ctinfo into nfct pointer storage area
After this change conntrack operations (lookup, creation, matching from
ruleset) only access one instead of two sk_buff cache lines.

This works for normal conntracks because those are allocated from a slab
that guarantees hw cacheline or 8byte alignment (whatever is larger)
so the 3 bits needed for ctinfo won't overlap with nf_conn addresses.

Template allocation now does manual address alignment (see previous change)
on arches that don't have sufficent kmalloc min alignment.

Some spots intentionally use skb->_nfct instead of skb_nfct() helpers,
this is to avoid undoing the skb_nfct() use when we remove untracked
conntrack object in the future.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-02 14:31:56 +01:00
Florian Westphal 3032230920 netfilter: guarantee 8 byte minalign for template addresses
The next change will merge skb->nfct pointer and skb->nfctinfo
status bits into single skb->_nfct (unsigned long) area.

For this to work nf_conn addresses must always be aligned at least on
an 8 byte boundary since we will need the lower 3bits to store nfctinfo.

Conntrack templates are allocated via kmalloc.
kbuild test robot reported
BUILD_BUG_ON failed: NFCT_INFOMASK >= ARCH_KMALLOC_MINALIGN
on v1 of this patchset, so not all platforms meet this requirement.

Do manual alignment if needed,  the alignment offset is stored in the
nf_conn entry protocol area. This works because templates are not
handed off to L4 protocol trackers.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-02 14:31:55 +01:00
Florian Westphal c74454fadd netfilter: add and use nf_ct_set helper
Add a helper to assign a nf_conn entry and the ctinfo bits to an sk_buff.
This avoids changing code in followup patch that merges skb->nfct and
skb->nfctinfo into skb->_nfct.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-02 14:31:54 +01:00
Florian Westphal cb9c68363e skbuff: add and use skb_nfct helper
Followup patch renames skb->nfct and changes its type so add a helper to
avoid intrusive rename change later.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-02 14:31:53 +01:00
Florian Westphal 97a6ad13de netfilter: reduce direct skb->nfct usage
Next patch makes direct skb->nfct access illegal, reduce noise
in next patch by using accessors we already have.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-02 14:31:52 +01:00
Florian Westphal 6e10148c5c netfilter: reset netfilter state when duplicating packet
We should also toss nf_bridge_info, if any -- packet is leaving via
ip_local_out, also, this skb isn't bridged -- it is a locally generated
copy.  Also this avoids the need to touch this later when skb->nfct is
replaced with 'unsigned long _nfct' in followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-02 14:31:51 +01:00
Florian Westphal 11df4b760f netfilter: conntrack: no need to pass ctinfo to error handler
It is never accessed for reading and the only places that write to it
are the icmp(6) handlers, which also set skb->nfct (and skb->nfctinfo).

The conntrack core specifically checks for attached skb->nfct after
->error() invocation and returns early in this case.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-02 14:31:51 +01:00
Feng 10435c1192 netfilter: nf_tables: Eliminate duplicated code in nf_tables_table_enable()
If something fails in nf_tables_table_enable(), it unregisters the
chains. But the rollback code is the same as nf_tables_table_disable()
almostly, except there is one counter check.  Now create one wrapper
function to eliminate the duplicated codes.

Signed-off-by: Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-02 14:30:19 +01:00
Linus Torvalds 6d04dfc896 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix handling of interrupt status in stmmac driver. Just because we
    have masked the event from generating interrupts, doesn't mean the
    bit won't still be set in the interrupt status register. From Alexey
    Brodkin.

 2) Fix DMA API debugging splats in gianfar driver, from Arseny Solokha.

 3) Fix off-by-one error in __ip6_append_data(), from Vlad Yasevich.

 4) cls_flow does not match on icmpv6 codes properly, from Simon Horman.

 5) Initial MAC address can be set incorrectly in some scenerios, from
    Ivan Vecera.

 6) Packet header pointer arithmetic fix in ip6_tnl_parse_tlv_end_lim(),
    from Dan Carpenter.

 7) Fix divide by zero in __tcp_select_window(), from Eric Dumazet.

 8) Fix crash in iwlwifi when unregistering thermal zone, from Jens
    Axboe.

 9) Check for DMA mapping errors in starfire driver, from Alexey
    Khoroshilov.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (31 commits)
  tcp: fix 0 divide in __tcp_select_window()
  ipv6: pointer math error in ip6_tnl_parse_tlv_enc_lim()
  net: fix ndo_features_check/ndo_fix_features comment ordering
  net/sched: matchall: Fix configuration race
  be2net: fix initial MAC setting
  ipv6: fix flow labels when the traffic class is non-0
  net: thunderx: avoid dereferencing xcv when NULL
  net/sched: cls_flower: Correct matching on ICMPv6 code
  ipv6: Paritially checksum full MTU frames
  net/mlx4_core: Avoid command timeouts during VF driver device shutdown
  gianfar: synchronize DMA API usage by free_skb_rx_queue w/ gfar_new_page
  net: ethtool: add support for 2500BaseT and 5000BaseT link modes
  can: bcm: fix hrtimer/tasklet termination in bcm op removal
  net: adaptec: starfire: add checks for dma mapping errors
  net: phy: micrel: KSZ8795 do not set SUPPORTED_[Asym_]Pause
  can: Fix kernel panic at security_sock_rcv_skb
  net: macb: Fix 64 bit addressing support for GEM
  stmmac: Discard masked flags in interrupt status register
  net/mlx5e: Check ets capability before ets query FW command
  net/mlx5e: Fix update of hash function/key via ethtool
  ...
2017-02-01 11:52:27 -08:00
Yotam Gigi f3b20313ae net/sched: act_psample: Remove unnecessary ASSERT_RTNL
The ASSERT_RTNL is not necessary in the init function, as it does not
touch any rtnl protected structures, as opposed to the mirred action which
does have to hold a net device.

Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-01 14:10:03 -05:00
Yotam Gigi cadb9c9fdb net/sched: act_sample: Fix error path in init
Fix error path of in sample init, by releasing the tc hash in case of
failure in psample_group creation.

Fixes: 5c5670fae4 ("net/sched: Introduce sample tc action")
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-01 14:10:03 -05:00
Eric Dumazet 06425c308b tcp: fix 0 divide in __tcp_select_window()
syszkaller fuzzer was able to trigger a divide by zero, when
TCP window scaling is not enabled.

SO_RCVBUF can be used not only to increase sk_rcvbuf, also
to decrease it below current receive buffers utilization.

If mss is negative or 0, just return a zero TCP window.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov  <dvyukov@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-01 12:55:42 -05:00
Dan Carpenter 63117f09c7 ipv6: pointer math error in ip6_tnl_parse_tlv_enc_lim()
Casting is a high precedence operation but "off" and "i" are in terms of
bytes so we need to have some parenthesis here.

Fixes: fbfa743a9d ("ipv6: fix ip6_tnl_parse_tlv_enc_lim()")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-01 12:27:33 -05:00
David Ahern 1f5e29ce79 net: ipv6: add NLM_F_APPEND in notifications when applicable
IPv6 does not set the NLM_F_APPEND flag in notifications to signal that
a NEWROUTE is an append versus a new route or a replaced one. Add the
flag if the request has it.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-01 12:13:52 -05:00
Eric Dumazet b2504a5dbe net: reduce skb_warn_bad_offload() noise
Dmitry reported warnings occurring in __skb_gso_segment() [1]

All SKB_GSO_DODGY producers can allow user space to feed
packets that trigger the current check.

We could prevent them from doing so, rejecting packets, but
this might add regressions to existing programs.

It turns out our SKB_GSO_DODGY handlers properly set up checksum
information that is needed anyway when packets needs to be segmented.

By checking again skb_needs_check() after skb_mac_gso_segment(),
we should remove these pesky warnings, at a very minor cost.

With help from Willem de Bruijn

[1]
WARNING: CPU: 1 PID: 6768 at net/core/dev.c:2439 skb_warn_bad_offload+0x2af/0x390 net/core/dev.c:2434
lo: caps=(0x000000a2803b7c69, 0x0000000000000000) len=138 data_len=0 gso_size=15883 gso_type=4 ip_summed=0
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 6768 Comm: syz-executor1 Not tainted 4.9.0 #5
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
 ffff8801c063ecd8 ffffffff82346bdf ffffffff00000001 1ffff100380c7d2e
 ffffed00380c7d26 0000000041b58ab3 ffffffff84b37e38 ffffffff823468f1
 ffffffff84820740 ffffffff84f289c0 dffffc0000000000 ffff8801c063ee20
Call Trace:
 [<ffffffff82346bdf>] __dump_stack lib/dump_stack.c:15 [inline]
 [<ffffffff82346bdf>] dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
 [<ffffffff81827e34>] panic+0x1fb/0x412 kernel/panic.c:179
 [<ffffffff8141f704>] __warn+0x1c4/0x1e0 kernel/panic.c:542
 [<ffffffff8141f7e5>] warn_slowpath_fmt+0xc5/0x100 kernel/panic.c:565
 [<ffffffff8356cbaf>] skb_warn_bad_offload+0x2af/0x390 net/core/dev.c:2434
 [<ffffffff83585cd2>] __skb_gso_segment+0x482/0x780 net/core/dev.c:2706
 [<ffffffff83586f19>] skb_gso_segment include/linux/netdevice.h:3985 [inline]
 [<ffffffff83586f19>] validate_xmit_skb+0x5c9/0xc20 net/core/dev.c:2969
 [<ffffffff835892bb>] __dev_queue_xmit+0xe6b/0x1e70 net/core/dev.c:3383
 [<ffffffff8358a2d7>] dev_queue_xmit+0x17/0x20 net/core/dev.c:3424
 [<ffffffff83ad161d>] packet_snd net/packet/af_packet.c:2930 [inline]
 [<ffffffff83ad161d>] packet_sendmsg+0x32ed/0x4d30 net/packet/af_packet.c:2955
 [<ffffffff834f0aaa>] sock_sendmsg_nosec net/socket.c:621 [inline]
 [<ffffffff834f0aaa>] sock_sendmsg+0xca/0x110 net/socket.c:631
 [<ffffffff834f329a>] ___sys_sendmsg+0x8fa/0x9f0 net/socket.c:1954
 [<ffffffff834f5e58>] __sys_sendmsg+0x138/0x300 net/socket.c:1988
 [<ffffffff834f604d>] SYSC_sendmsg net/socket.c:1999 [inline]
 [<ffffffff834f604d>] SyS_sendmsg+0x2d/0x50 net/socket.c:1995
 [<ffffffff84371941>] entry_SYSCALL_64_fastpath+0x1f/0xc2

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov  <dvyukov@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-01 12:02:48 -05:00
Yotam Gigi fd62d9f5c5 net/sched: matchall: Fix configuration race
In the current version, the matchall internal state is split into two
structs: cls_matchall_head and cls_matchall_filter. This makes little
sense, as matchall instance supports only one filter, and there is no
situation where one exists and the other does not. In addition, that led
to some races when filter was deleted while packet was processed.

Unify that two structs into one, thus simplifying the process of matchall
creation and deletion. As a result, the new, delete and get callbacks have
a dummy implementation where all the work is done in destroy and change
callbacks, as was done in cls_cgroup.

Fixes: bf3994d2ed ("net/sched: introduce Match-all classifier")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-01 11:57:33 -05:00
Theuns Verwoerd 160ca01424 rtnetlink: Handle IFLA_MASTER parameter when processing rtnl_newlink
Allow a master interface to be specified as one of the parameters when
creating a new interface via rtnl_newlink.  Previously this would
require invoking interface creation, waiting for it to complete, and
then separately binding that new interface to a master.

In particular, this is used when creating a macvlan child interface for
VRRP in a VRF configuration, allowing the interface creator to specify
directly what master interface should be inherited by the child,
without having to deal with asynchronous complications and potential
race conditions.

Signed-off-by: Theuns Verwoerd <theuns.verwoerd@alliedtelesis.co.nz>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-01 11:53:23 -05:00
David S. Miller 04cdf13e34 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2017-02-01

1) Some typo fixes, from Alexander Alemayhu.

2) Don't acquire state lock in get_mtu functions.
   The only rece against a dead state does not matter.
   From Florian Westphal.

3) Remove xfrm4_state_fini, it is unused for more than
   10 years. From Florian Westphal.

4) Various rcu usage improvements. From Florian Westphal.

5) Properly handle crypto arrors in ah4/ah6.
   From Gilad Ben-Yossef.

6) Try to avoid skb linearization in esp4 and esp6.

7) The esp trailer is now set up in different places,
   add a helper for this.

8) With the upcomming usage of gro_cells in IPsec,
   a gro merged skb can have a secpath. Drop it
   before freeing or reusing the skb.

9) Add a xfrm dummy network device for napi. With
   this we can use gro_cells from within xfrm,
   it allows IPsec GRO without impact on the generic
   networking code.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-01 11:22:38 -05:00
Alexei Starovoitov 4d1ceea851 net: ethtool: convert large order kmalloc allocations to vzalloc
under memory pressure 'ethtool -S' command may warn:
[ 2374.385195] ethtool: page allocation failure: order:4, mode:0x242c0c0
[ 2374.405573] CPU: 12 PID: 40211 Comm: ethtool Not tainted
[ 2374.423071] Call Trace:
[ 2374.423076]  [<ffffffff8148cb29>] dump_stack+0x4d/0x64
[ 2374.423080]  [<ffffffff811667cb>] warn_alloc_failed+0xeb/0x150
[ 2374.423082]  [<ffffffff81169cd3>] ? __alloc_pages_direct_compact+0x43/0xf0
[ 2374.423084]  [<ffffffff8116a25c>] __alloc_pages_nodemask+0x4dc/0xbf0
[ 2374.423091]  [<ffffffffa0023dc2>] ? cmd_exec+0x722/0xcd0 [mlx5_core]
[ 2374.423095]  [<ffffffff811b3dcc>] alloc_pages_current+0x8c/0x110
[ 2374.423097]  [<ffffffff81168859>] alloc_kmem_pages+0x19/0x90
[ 2374.423099]  [<ffffffff81186e5e>] kmalloc_order_trace+0x2e/0xe0
[ 2374.423101]  [<ffffffff811c0084>] __kmalloc+0x204/0x220
[ 2374.423105]  [<ffffffff816c269e>] dev_ethtool+0xe4e/0x1f80
[ 2374.423106]  [<ffffffff816b967e>] ? dev_get_by_name_rcu+0x5e/0x80
[ 2374.423108]  [<ffffffff816d6926>] dev_ioctl+0x156/0x560
[ 2374.423111]  [<ffffffff811d4c68>] ? mem_cgroup_commit_charge+0x78/0x3c0
[ 2374.423117]  [<ffffffff8169d542>] sock_do_ioctl+0x42/0x50
[ 2374.423119]  [<ffffffff8169d9c3>] sock_ioctl+0x1b3/0x250
[ 2374.423121]  [<ffffffff811f0f42>] do_vfs_ioctl+0x92/0x580
[ 2374.423123]  [<ffffffff8100222b>] ? do_audit_syscall_entry+0x4b/0x70
[ 2374.423124]  [<ffffffff8100287c>] ? syscall_trace_enter_phase1+0xfc/0x120
[ 2374.423126]  [<ffffffff811f14a9>] SyS_ioctl+0x79/0x90
[ 2374.423127]  [<ffffffff81002bb0>] do_syscall_64+0x50/0xa0
[ 2374.423129]  [<ffffffff817e19bc>] entry_SYSCALL64_slow_path+0x25/0x25

~1160 mlx5 counters ~= order 4 allocation which is unlikely to succeed
under memory pressure. Convert them to vzalloc() as ethtool_get_regs() does.
Also take care of drivers without counters similar to
commit 67ae7cf1ee ("ethtool: Allow zero-length register dumps again")
and reduce warn_on to warn_on_once.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-31 13:28:06 -05:00
Simon Horman 040587af31 net/sched: cls_flower: Correct matching on ICMPv6 code
When matching on the ICMPv6 code ICMPV6_CODE rather than
ICMPV4_CODE attributes should be used.

This corrects what appears to be a typo.

Sample usage:

tc qdisc add dev eth0 ingress
tc filter add dev eth0 protocol ipv6 parent ffff: flower \
	indev eth0 ip_proto icmpv6 type 128 code 0 action drop

Without this change the code parameter above is effectively ignored.

Fixes: 7b684884fb ("net/sched: cls_flower: Support matching on ICMP type and code")
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30 16:42:09 -05:00
David S. Miller 0d29ed28da linux-can-fixes-for-4.10-20170130
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEES2FAuYbJvAGobdVQPTuqJaypJWoFAliPEGcTHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRA9O6olrKklakWHB/971KHqyYL5Rhbo8rOl+XKIpTLSFCJi
 tgaN1mOUaBl2UIlevzW/9kNP7DkT+0axaEdFwsfZCm3opLwCaYI9TtxbDLmMx8VW
 p7/djLDDcE9hgH1J6PnHL6w+Mlnr6NcdmVlU7lm8YyiCcALQp6EdYXVh1T1/xD0+
 ytoqO2A8pgvR1VfcD7Q4M3qE0L9JQRrFhDenStn8fql6fpb2RqFT2iogQF6hta1Y
 YNZx0pK24gpCDTWqR06BBuHHDLpS6nGmnMl5Qvft1FBmSsvKpp4qAMN2IVyYA402
 SdEHJHlo9SmqNZmQg2QQg5Hn0KJ4iY7PxLJGc2yPvZowa9ewBRRW2xYG
 =CiTr
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-4.10-20170130' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2017-01-30

this is a pull request of one patch.

The patch is by Oliver Hartkopp and fixes the hrtimer/tasklet termination in
bcm op removal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30 16:38:39 -05:00