Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 3029 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Vrf device is not able to change mac address now because lack of
ndo_set_mac_address. Complete this in case some apps need to do
this.
Reported-by: Hui Wang <wanghui104@huawei.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A common theme in the output path is looking up a neigh entry for a
nexthop, either the gateway in an rtable or a fallback to the daddr
in the skb:
nexthop = (__force u32)rt_nexthop(rt, ip_hdr(skb)->daddr);
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
To allow the nexthop to be an IPv6 address we need to consider the
family of the nexthop and then call __ipv{4,6}_neigh_lookup_noref based
on it.
To make this simpler, add a ip_neigh_gw4 helper similar to ip_neigh_gw6
added in an earlier patch which handles:
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
And then add a second one, ip_neigh_for_gw, that calls either
ip_neigh_gw4 or ip_neigh_gw6 based on the address family of the gateway.
Update the output paths in the VRF driver and core v4 code to use
ip_neigh_for_gw simplifying the family based lookup and making both
ready for a v6 nexthop.
ipv4_neigh_lookup has a different need - the potential to resolve a
passed in address in addition to any gateway in the rtable or skb. Since
this is a one-off, add ip_neigh_gw4 and ip_neigh_gw6 diectly. The
difference between __neigh_create used by the helpers and neigh_create
called by ipv4_neigh_lookup is taking a refcount, so add rcu_read_lock_bh
and bump the refcnt on the neigh entry.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A later patch allows an IPv6 gateway with an IPv4 route. The neighbor
entry will exist in the v6 ndisc table and the cached header will contain
the ipv6 protocol which is wrong for an IPv4 packet. For an IPv4 packet to
use the v6 neighbor entry, neigh_output needs to skip the cached header
and just use the output callback for the neigh entry.
A future patchset can look at expanding the hh_cache to handle 2
protocols. For now, IPv6 gateways with an IPv4 route will take the
extra overhead of generating the header.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the mtu of a vrf device is set to 0, it would cause ping
failed. So I think we should limit vrf mtu in a reasonable range
to solve this problem. I set dev->min_mtu to IPV6_MIN_MTU, so it
will works for both ipv4 and ipv6. And if dev->max_mtu still be 0
can be confusing, so I set dev->max_mtu to ETH_MAX_MTU.
Here is the reproduce step:
1.Config vrf interface and set mtu to 0:
3: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
master vrf1 state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:9e:dd:c1 brd ff:ff:ff:ff:ff:ff
2.Ping peer:
3: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
master vrf1 state UP group default qlen 1000
link/ether 52:54:00:9e:dd:c1 brd ff:ff:ff:ff:ff:ff
inet 10.0.0.1/16 scope global enp4s0
valid_lft forever preferred_lft forever
connect: Network is unreachable
3.Set mtu to default value, ping works:
PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=1.88 ms
Fixes: ad49bc6361 ("net: vrf: remove MTU limits for vrf device")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
VRF devices don't work with upper devices. Currently, it's possible to
add a VRF device to a bridge or team, and to create macvlan, macsec, or
ipvlan devices on top of a VRF (bond and vlan are prevented respectively
by the lack of an ndo_set_mac_address op and the NETIF_F_VLAN_CHALLENGED
feature flag).
Fix this by setting the IFF_NO_RX_HANDLER flag (introduced in commit
f5426250a6 ("net: introduce IFF_NO_RX_HANDLER")).
Cc: David Ahern <dsahern@gmail.com>
Fixes: 193125dbd8 ("net: Introduce VRF device driver")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similiar to commit e94cd8113c ("net: remove MTU limits for dummy and
ifb device"), MTU is irrelevant for VRF device. We init it as 64K while
limit it to [68, 1500] may make users feel confused.
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to pass extack together with NETDEV_PRE_UP notifications, it's
necessary to route the extack to __dev_open() from diverse (possibly
indirect) callers. One prominent API through which the notification is
invoked is dev_change_flags().
Therefore extend dev_change_flags() with and extra extack argument and
update all users. Most of the calls end up just encoding NULL, but
several sites (VLAN, ipvlan, VRF, rtnetlink) do have extack available.
Since the function declaration line is changed anyway, name the other
function arguments to placate checkpatch.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A follow-up patch will extend dev_change_flags() with an extack
argument. Extend cycle_netdev() to have that argument available for the
conversion.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The skb for packets that are multicast or to a link-local address are
not marked as being enslaved to a VRF, if they are received on a socket
bound to the VRF. This is needed for ND and it is preferable for the
kernel not to have to deal with the additional use-cases if ll or mcast
packets are handled as enslaved. However, this does not allow service
instances listening on unbound and bound to VRF sockets to distinguish
the VRF used, if packets are sent as multicast or to a link-local
address. The fix is for the VRF driver to also mark these skb as being
enslaved to the VRF.
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code to obtain the correct table for the incoming interface was
missing for IPv6. This has been added along with the table creation
notification to fib rules for the RTNL_FAMILY_IP6MR address family.
Signed-off-by: Patrick Ruddy <pruddy@vyatta.att-mail.com>
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP sockets originated in a VRF can improve their performance if CRC32c
computation is delegated to underlying devices: update device features,
setting NETIF_F_SCTP_CRC. Iterating the following command in the topology
proposed with [1],
# ip vrf exec vrf-h2 netperf -H 192.0.2.1 -t SCTP_STREAM -- -m 10K
the measured throughput in Mbit/s improved from 2395 ± 1% to 2720 ± 1%.
[1] https://www.spinics.net/lists/netdev/msg486007.html
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A later patch removes rt6i_table from rt6_info. Save the ipv6
table for a VRF in net_vrf. fib tables can not be deleted so
no reference counting or locking is required.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c,
we had some overlapping changes:
1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE -->
MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE
2) In 'net-next' params->log_rq_size is renamed to be
params->log_rq_mtu_frames.
3) In 'net-next' params->hard_mtu is added.
Signed-off-by: David S. Miller <davem@davemloft.net>
Miguel reported an skb use after free / double free in vrf_finish_output
when neigh_output returns an error. The vrf driver should return after
the call to neigh_output as it takes over the skb on error path as well.
Patch is a simplified version of Miguel's patch which was written for 4.9,
and updated to top of tree.
Fixes: 8f58336d3f ("net: Add ethernet header for pass through VRF device")
Signed-off-by: Miguel Fadon Perlines <mfadon@teldat.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 does path selection for multipath routes deep in the lookup
functions. The next patch adds L4 hash option and needs the skb
for the forward path. To get the skb to the relevant FIB lookup
functions it needs to go through the fib rules layer, so add a
lookup_data argument to the fib_lookup_arg struct.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These pernet_operations make pretty simple actions
like variable initialization on init, debug checks
on exit, and so on, and they obviously are able
to be executed in parallel with any others:
vrf_net_ops
lockd_net_ops
grace_net_ops
xfrm6_tunnel_net_ops
kcm_net_ops
tcf_net_ops
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For ages iproute2 has used `struct rtmsg` as the ancillary header for
FIB rules and in the process set the protocol value to RTPROT_BOOT.
Until ca56209a66 ("net: Allow a rule to track originating protocol")
the kernel rules code ignored the protocol value sent from userspace
and always returned 0 in notifications. To avoid incompatibility with
existing iproute2, send the protocol as a new attribute.
Fixes: cac56209a6 ("net: Allow a rule to track originating protocol")
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow a rule that is being added/deleted/modified or
dumped to contain the originating protocol's id.
The protocol is handled just like a routes originating
protocol is. This is especially useful because there
is starting to be a plethora of different user space
programs adding rules.
Allow the vrf device to specify that the kernel is the originator
of the rule created for this device.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove rt_table_id from rtable. It was added for getroute to return the
table id that was hit in the lookup. With the changes for fibmatch the
table id can be extracted from the fib_info returned in the fib_result
so it no longer needs to be in rtable directly.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sukumar reported that sends to the local broadcast address
(255.255.255.255) are broken. Check for the address in vrf driver
and do not redirect to the VRF device - similar to multicast
packets.
With this change sockets can use SO_BINDTODEVICE to specify an
egress interface and receive responses. Note: the egress interface
can not be a VRF device but needs to be the enslaved device.
https://bugzilla.kernel.org/show_bug.cgi?id=198521
Reported-by: Sukumar Gopalakrishnan <sukumarg1973@gmail.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Files removed in 'net-next' had their license header updated
in 'net'. We take the remove from 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
FRA_L3MDEV is defined as U8, but is being added as a U32 attribute. On
big endian architecture, this results in the l3mdev entry not being
added to the FIB rules.
Fixes: 1aa6c4f6b8 ("net: vrf: Add l3mdev rules on first device create")
Signed-off-by: Jeff Barnhill <0xeffeff@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack arg to netdev_upper_dev_link and netdev_master_upper_dev_link
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass extack to do_set_master and down to ndo_add_slave
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_dst_set(skb, dst) installs a normal (refcounted) dst, there is no
point using skb_dst_force(skb)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When building an allmodconfig kernel with gcc-4.6, we get a rather
odd warning:
drivers/net/vrf.c: In function ‘vrf_ip6_input_dst’:
drivers/net/vrf.c:964:3: error: initialized field with side-effects overwritten [-Werror]
drivers/net/vrf.c:964:3: error: (near initialization for ‘fl6’) [-Werror]
I have no idea what this warning is even trying to say, but it does
seem like a false positive. Reordering the initialization in to match
the structure definition gets rid of the warning, and might also avoid
whatever gcc thinks is wrong here.
Fixes: 9ff7438460 ("net: vrf: Handle ipv6 multicast and link-local addresses")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The VRF cached rtable and rt6_info for local traffic are no longer
needed and actually prevent local traffic through enslaved devices.
Remove them.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack error messages for failure paths creating vrf devices. Once
extack support is added to iproute2, we go from the unhelpful:
$ ip li add foobar type vrf
RTNETLINK answers: Invalid argument
to:
$ ip li add foobar type vrf
Error: VRF table id is missing
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for extended error reporting.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for extended error reporting.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
DST_NOCACHE flag check has been removed from dst_release() and
dst_hold_safe() in a previous patch because all the dst are now ref
counted properly and can be released based on refcnt only.
Looking at the rest of the DST_NOCACHE use, all of them can now be
removed or replaced with other checks.
So this patch gets rid of all the DST_NOCACHE usage and remove this flag
completely.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In IPv6 routing code, struct rt6_info is created for each static route
and RTF_CACHE route and inserted into fib6 tree. In both cases, dst
ref count is not taken.
As explained in the previous patch, this leads to the need of the dst
garbage collector.
This patch holds ref count of dst before inserting the route into fib6
tree and properly releases the dst when deleting it from the fib6 tree
as a preparation in order to fully get rid of dst gc later.
Also, correct fib6_age() logic to check dst->__refcnt to be 1 to indicate
no user is referencing the dst.
And remove dst_hold() in vrf_rt6_create() as ip6_dst_alloc() already puts
dst->__refcnt to 1.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.
Make these functions return void * and remove all the casts across
the tree, adding a (u8 *) cast only where the unsigned char pointer
was used directly, all done with the following spatch:
@@
expression SKB, LEN;
typedef u8;
identifier fn = { skb_push, __skb_push, skb_push_rcsum };
@@
- *(fn(SKB, LEN))
+ *(u8 *)fn(SKB, LEN)
@@
expression E, SKB, LEN;
identifier fn = { skb_push, __skb_push, skb_push_rcsum };
type T;
@@
- E = ((T *)(fn(SKB, LEN)))
+ E = fn(SKB, LEN)
@@
expression SKB, LEN;
identifier fn = { skb_push, __skb_push, skb_push_rcsum };
@@
- fn(SKB, LEN)[0]
+ *(u8 *)fn(SKB, LEN)
Note that the last part there converts from push(...)[0] to the
more idiomatic *(u8 *)push(...).
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 1aa6c4f6b8 ("net: vrf: Add l3mdev rules on first device create")
adds the l3mdev FIB rule the first time a VRF device is created. However,
it only creates the rule once and only in the namespace the first device
is created - which may not be init_net. Fix by using the net_generic
capability to make the add_fib_rules flag per network namespace.
Fixes: 1aa6c4f6b8 ("net: vrf: Add l3mdev rules on first device create")
Reported-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Network devices can allocate reasources and private memory using
netdev_ops->ndo_init(). However, the release of these resources
can occur in one of two different places.
Either netdev_ops->ndo_uninit() or netdev->destructor().
The decision of which operation frees the resources depends upon
whether it is necessary for all netdev refs to be released before it
is safe to perform the freeing.
netdev_ops->ndo_uninit() presumably can occur right after the
NETDEV_UNREGISTER notifier completes and the unicast and multicast
address lists are flushed.
netdev->destructor(), on the other hand, does not run until the
netdev references all go away.
Further complicating the situation is that netdev->destructor()
almost universally does also a free_netdev().
This creates a problem for the logic in register_netdevice().
Because all callers of register_netdevice() manage the freeing
of the netdev, and invoke free_netdev(dev) if register_netdevice()
fails.
If netdev_ops->ndo_init() succeeds, but something else fails inside
of register_netdevice(), it does call ndo_ops->ndo_uninit(). But
it is not able to invoke netdev->destructor().
This is because netdev->destructor() will do a free_netdev() and
then the caller of register_netdevice() will do the same.
However, this means that the resources that would normally be released
by netdev->destructor() will not be.
Over the years drivers have added local hacks to deal with this, by
invoking their destructor parts by hand when register_netdevice()
fails.
Many drivers do not try to deal with this, and instead we have leaks.
Let's close this hole by formalizing the distinction between what
private things need to be freed up by netdev->destructor() and whether
the driver needs unregister_netdevice() to perform the free_netdev().
netdev->priv_destructor() performs all actions to free up the private
resources that used to be freed by netdev->destructor(), except for
free_netdev().
netdev->needs_free_netdev is a boolean that indicates whether
free_netdev() should be done at the end of unregister_netdevice().
Now, register_netdevice() can sanely release all resources after
ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
and netdev->priv_destructor().
And at the end of unregister_netdevice(), we invoke
netdev->priv_destructor() and optionally call free_netdev().
Signed-off-by: David S. Miller <davem@davemloft.net>
The current codes only deal with the case that the skb is dropped, it
may meet one use-after-free issue when NF_HOOK returns 0 that means
the skb is stolen by one netfilter rule or hook.
When one netfilter rule or hook stoles the skb and return NF_STOLEN,
it means the skb is taken by the rule, and other modules should not
touch this skb ever. Maybe the skb is queued or freed directly by the
rule.
Now uses the nf_hook instead of NF_HOOK to get the result of netfilter,
and check the return value of nf_hook. Only when its value equals 1, it
means the skb could go ahead. Or reset the skb as NULL.
BTW, because vrf_rcv_finish is empty function, so needn't invoke it
even though nf_hook returns 1. But we need to modify vrf_rcv_finish
to deal with the NF_STOLEN case.
There are two cases when skb is stolen.
1. The skb is stolen and freed directly.
There is nothing we need to do, and vrf_rcv_finish isn't invoked.
2. The skb is queued and reinjected again.
The vrf_rcv_finish would be invoked as okfn, so need to free the
skb in it.
Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Moving the loopback into a VRF breaks networking for the default VRF.
Since the VRF device is the loopback for VRF domains, there is no
reason to move the loopback. Given the repercussions, block attempts
to set lo into a VRF.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A function in kernel/bpf/syscall.c which got a bug fix in 'net'
was moved to kernel/bpf/verifier.c in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse
for doit functions that call it directly.
This is the first step to using extended error reporting in rtnetlink.
>From here individual subsystems can be updated to set netlink_ext_ack as
needed.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only need 1 l3mdev FIB rule. Fix setting NLM_F_EXCL in the nlmsghdr.
Fixes: 1aa6c4f6b8 ("net: vrf: Add l3mdev rules on first device create")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/broadcom/genet/bcmmii.c
drivers/net/hyperv/netvsc.c
kernel/bpf/hashtab.c
Almost entirely overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
The VRF driver allows users to implement device based features for an
entire domain. For example, a qdisc or netfilter rules can be attached
to a VRF device or tcpdump can be used to view packets for all devices
in the L3 domain.
The device-based features come with a performance penalty, most
notably in the Tx path. The VRF driver uses the l3mdev_l3_out hook
to switch the dst on an skb to its private dst. This allows the skb
to traverse the xmit stack with the device set to the VRF device
which in turn enables the netfilter and qdisc features. The VRF
driver then performs the FIB lookup again and reinserts the packet.
This patch avoids the redirect for IPv6 packets if a qdisc has not
been attached to a VRF device which is the default config. In this
case the netfilter hooks and network taps are directly traversed in
the l3mdev_l3_out handler. If a qdisc is attached to a VRF device,
then the redirect using the vrf dst is done.
Additional overhead is removed by only checking packet taps if a
socket is open on the device (vrf_dev->ptype_all list is not empty).
Packet sockets bound to any device will still get a copy of the
packet via the real ingress or egress interface.
The end result of this change is a decrease in the overhead of VRF
for the default, baseline case (ie., no netfilter rules, no packet
sockets, no qdisc) from a +3% improvement for UDP which has a lookup
per packet (VRF being better than no l3mdev) to ~2% loss for TCP_CRR
which connects a socket for each request-response.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The VRF driver allows users to implement device based features for an
entire domain. For example, a qdisc or netfilter rules can be attached
to a VRF device or tcpdump can be used to view packets for all devices
in the L3 domain.
The device-based features come with a performance penalty, most
notably in the Tx path. The VRF driver uses the l3mdev_l3_out hook
to switch the dst on an skb to its private dst. This allows the skb
to traverse the xmit stack with the device set to the VRF device
which in turn enables the netfilter and qdisc features. The VRF
driver then performs the FIB lookup again and reinserts the packet.
This patch avoids the redirect for IPv4 packets if a qdisc has not
been attached to a VRF device which is the default config. In this
case the netfilter hooks and network taps are directly traversed in
the l3mdev_l3_out handler. If a qdisc is attached to a VRF device,
then the redirect using the vrf dst is done.
Additional overhead is removed by only checking packet taps if a
socket is open on the device (vrf_dev->ptype_all list is not empty).
Packet sockets bound to any device will still get a copy of the
packet via the real ingress or egress interface.
The end result of this change is a decrease in the overhead of VRF
for the default, baseline case (ie., no netfilter rules, no packet
sockets, no qdisc) to ~3% for UDP which has a lookup per packet and
< 1% overhead for connected sockets that leverage early demux and
avoid FIB lookups.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The VRF driver takes a reference to the inet6_dev on the VRF device for
its rt6_local dst when handling local traffic through the VRF device as
a loopback. When the device is deleted the driver does a put on the idev
but does not reset rt6i_idev in the rt6_info struct. When the dst is
destroyed, dst_destroy calls ip6_dst_destroy which does a second put for
what is essentially the same reference causing it to be prematurely freed.
Reset rt6i_idev after the put in the vrf driver.
Fixes: b4869aa2f8 ("net: vrf: ipv6 support for local traffic to
local addresses")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>