Commit 2efd4fca70 ("ip: in cmsg IP(V6)_ORIGDSTADDR call
pskb_may_pull") avoided a read beyond the end of the skb linear
segment by calling pskb_may_pull.
That function can trigger a BUG_ON in pskb_expand_head if the skb is
shared, which it is when when peeking. It can also return ENOMEM.
Avoid both by switching to safer skb_header_pointer.
Fixes: 2efd4fca70 ("ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull")
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2019-01-08
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix BSD'ism in sendmsg(2) to rewrite unspecified IPv6 dst for
unconnected UDP sockets with [::1] _after_ cgroup BPF invocation,
from Andrey.
2) Follow-up fix to the speculation fix where we need to reject a
corner case for sanitation when ptr and scalars are mixed in the
same alu op. Also, some unrelated minor doc fixes, from Daniel.
3) Fix BPF kselftest's incorrect uses of create_and_get_cgroup()
by not assuming fd of zero value to be the result of an error
case, from Stanislav.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
I realized the last patch calls dev_get_by_index_rcu in a branch not
holding the rcu lock. Add the calls to rcu_read_lock and rcu_read_unlock.
Fixes: ec90ad3349 ("ipv6: Consider sk_bound_dev_if when binding a socket to a v4 mapped address")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sys_sendmsg has supported unspecified destination IPv6 (wildcard) for
unconnected UDP sockets since 876c7f41. When [::] is passed by user as
destination, sys_sendmsg rewrites it with [::1] to be consistent with
BSD (see "BSD'ism" comment in the code).
This didn't work when cgroup-bpf was enabled though since the rewrite
[::] -> [::1] happened before passing control to cgroup-bpf block where
fl6.daddr was updated with passed by user sockaddr_in6.sin6_addr (that
might or might not be changed by BPF program). That way if user passed
[::] as dst IPv6 it was first rewritten with [::1] by original code from
876c7f41, but then rewritten back with [::] by cgroup-bpf block.
It happened even when BPF_CGROUP_UDP6_SENDMSG program was not present
(CONFIG_CGROUP_BPF=y was enough).
The fix is to apply BSD'ism after cgroup-bpf block so that [::] is
replaced with [::1] no matter where it came from: passed by user to
sys_sendmsg or set by BPF_CGROUP_UDP6_SENDMSG program.
Fixes: 1cedee13d2 ("bpf: Hooks for sys_sendmsg")
Reported-by: Nitin Rawat <nitin.rawat@intel.com>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Similar to c5ee066333 ("ipv6: Consider sk_bound_dev_if when binding a
socket to an address"), binding a socket to v4 mapped addresses needs to
consider if the socket is bound to a device.
This problem also exists from the beginning of git history.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot was able to crash one host with the following stack trace :
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 8625 Comm: syz-executor4 Not tainted 4.20.0+ #8
RIP: 0010:dev_net include/linux/netdevice.h:2169 [inline]
RIP: 0010:icmp6_send+0x116/0x2d30 net/ipv6/icmp.c:426
icmpv6_send
smack_socket_sock_rcv_skb
security_sock_rcv_skb
sk_filter_trim_cap
__sk_receive_skb
dccp_v6_do_rcv
release_sock
This is because a RX packet found socket owned by user and
was stored into socket backlog. Before leaving RCU protected section,
skb->dev was cleared in __sk_receive_skb(). When socket backlog
was finally handled at release_sock() time, skb was fed to
smack_socket_sock_rcv_skb() then icmp6_send()
We could fix the bug in smack_socket_sock_rcv_skb(), or simply
make icmp6_send() more robust against such possibility.
In the future we might provide to icmp6_send() the net pointer
instead of infering it.
Fixes: d66a8acbda ("Smack: Inform peer that IPv6 traffic has been blocked")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Piotr Sawicki <p.sawicki2@partner.samsung.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I forgot to deal with IPv6 in commit 11789039da ("fou: Prevent unbounded
recursion in GUE error handler").
Now syzbot reported what might be the same type of issue, caused by
gue6_err(), that is, handling exceptions for direct UDP encapsulation in
GUE (UDP-in-UDP) leads to unbounded recursion in the GUE exception
handler.
As it probably doesn't make sense to set up GUE this way, and it's
currently not even possible to configure this, skip exception handling for
UDP (or UDP-Lite) packets encapsulated in UDP (or UDP-Lite) packets with
GUE on IPv6.
Reported-by: syzbot+4ad25edc7a33e4ab91e0@syzkaller.appspotmail.com
Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: b8a51b38e4 ("fou, fou6: ICMP error handlers for FoU and GUE")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit fixes a regression in AF_INET/RTM_GETADDR and
AF_INET6/RTM_GETADDR.
Before this commit, the kernel would stop dumping addresses once the first
skb was full and end the stream with NLMSG_DONE(-EMSGSIZE). The error
shouldn't be sent back to netlink_dump so the callback is kept alive. The
userspace is expected to call back with a new empty skb.
Changes from V1:
- The error is not handled in netlink_dump anymore but rather in
inet_dump_ifaddr and inet6_dump_addr directly as suggested by
David Ahern.
Fixes: d7e38611b8 ("net/ipv4: Put target net when address dump fails due to bad attributes")
Fixes: 242afaa696 ("net/ipv6: Put target net when address dump fails due to bad attributes")
Cc: David Ahern <dsahern@gmail.com>
Cc: "David S . Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Arthur Gautier <baloo@gandi.net>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 does not consider if the socket is bound to a device when binding
to an address. The result is that a socket can be bound to eth0 and then
bound to the address of eth1. If the device is a VRF, the result is that
a socket can only be bound to an address in the default VRF.
Resolve by considering the device if sk_bound_dev_if is set.
This problem exists from the beginning of git history.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dump of a specific table with strict checking enabled is looping. The
problem is that the end of the table dump is not marked in the cb. When
dumping a specific table, cb args 0 and 1 are not used (they are the hash
index and entry with an hash table index when dumping all tables). Re-use
args[0] to hold a 'done' flag for the specific table dump.
Fixes: 13e38901d4 ("net/ipv6: Plumb support for filtering route dumps")
Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ip6_neigh_lookup(), we must not return errors coming from
neigh_create(): if creation of a neighbour entry fails, the lookup should
return NULL, in the same way as it's done in __neigh_lookup().
Otherwise, callers legitimately checking for a non-NULL return value of
the lookup function might dereference an invalid pointer.
For instance, on neighbour table overflow, ndisc_router_discovery()
crashes ndisc_update() by passing ERR_PTR(-ENOBUFS) as 'neigh' argument.
Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: f8a1b43b70 ("net/ipv6: Create a neigh_lookup for FIB entries")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
KMSAN detected read beyond end of buffer in vti and sit devices when
passing truncated packets with PF_PACKET. The issue affects additional
ip tunnel devices.
Extend commit 76c0ddd8c3 ("ip6_tunnel: be careful when accessing the
inner header") and commit ccfec9e5cb ("ip_tunnel: be careful when
accessing the inner header").
Move the check to a separate helper and call at the start of each
ndo_start_xmit function in net/ipv4 and net/ipv6.
Minor changes:
- convert dev_kfree_skb to kfree_skb on error path,
as dev_kfree_skb calls consume_skb which is not for error paths.
- use pskb_network_may_pull even though that is pedantic here,
as the same as pskb_may_pull for devices without llheaders.
- do not cache ipv6 hdrs if used only once
(unsafe across pskb_may_pull, was more relevant to earlier patch)
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
'ipv6_find_idev()' returns NULL on error, not an error pointer.
Update the test accordingly and return -ENOBUFS, as already done in
'addrconf_add_dev()', if NULL is returned.
Fixes: ("ipv6: allow userspace to add IFA_F_OPTIMISTIC addresses")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
While flushing the cache via ipv6_sysctl_rtcache_flush(), the call
to proc_dointvec() may fail. The fix adds a check that returns the
error, on failure.
Signed-off-by: Aditya Pakki <pakki001@umn.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Support for destination MAC in ipset, from Stefano Brivio.
2) Disallow all-zeroes MAC address in ipset, also from Stefano.
3) Add IPSET_CMD_GET_BYNAME and IPSET_CMD_GET_BYINDEX commands,
introduce protocol version number 7, from Jozsef Kadlecsik.
A follow up patch to fix ip_set_byindex() is also included
in this batch.
4) Honor CTA_MARK_MASK from ctnetlink, from Andreas Jaggi.
5) Statify nf_flow_table_iterate(), from Taehee Yoo.
6) Use nf_flow_table_iterate() to simplify garbage collection in
nf_flow_table logic, also from Taehee Yoo.
7) Don't use _bh variants of call_rcu(), rcu_barrier() and
synchronize_rcu_bh() in Netfilter, from Paul E. McKenney.
8) Remove NFC_* cache definition from the old caching
infrastructure.
9) Remove layer 4 port rover in NAT helpers, use random port
instead, from Florian Westphal.
10) Use strscpy() in ipset, from Qian Cai.
11) Remove NF_NAT_RANGE_PROTO_RANDOM_FULLY branch now that
random port is allocated by default, from Xiaozhou Liu.
12) Ignore NF_NAT_RANGE_PROTO_RANDOM too, from Florian Westphal.
13) Limit port allocation selection routine in NAT to avoid
softlockup splats when most ports are in use, from Florian.
14) Remove unused parameters in nf_ct_l4proto_unregister_sysctl()
from Yafang Shao.
15) Direct call to nf_nat_l4proto_unique_tuple() instead of
indirection, from Florian Westphal.
16) Several patches to remove all layer 4 NAT indirections,
remove nf_nat_l4proto struct, from Florian Westphal.
17) Fix RTP/RTCP source port translation when SNAT is in place,
from Alin Nastac.
18) Selective rule dump per chain, from Phil Sutter.
19) Revisit CLUSTERIP target, this includes a deadlock fix from
netns path, sleep in atomic, remove bogus WARN_ON_ONCE()
and disallow mismatching IP address and MAC address.
Patchset from Taehee Yoo.
20) Update UDP timeout to stream after 2 seconds, from Florian.
21) Shrink UDP established timeout to 120 seconds like TCP timewait.
22) Sysctl knobs to set GRE timeouts, from Yafang Shao.
23) Move seq_print_acct() to conntrack core file, from Florian.
24) Add enum for conntrack sysctl knobs, also from Florian.
25) Place nf_conntrack_acct, nf_conntrack_helper, nf_conntrack_events
and nf_conntrack_timestamp knobs in the core, from Florian Westphal.
As a side effect, shrink netns_ct structure by removing obsolete
sysctl anchors, also from Florian.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
It was reported that IPsec would crash when it encounters an IPv6
reassembled packet because skb->sk is non-zero and not a valid
pointer.
This is because skb->sk is now a union with ip_defrag_offset.
This patch fixes this by resetting skb->sk when exiting from
the reassembly code.
Reported-by: Xiumei Mu <xmu@redhat.com>
Fixes: 219badfaad ("ipv6: frags: get rid of ip6frag_skb_cb/...")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lots of conflicts, by happily all cases of overlapping
changes, parallel adds, things of that nature.
Thanks to Stephen Rothwell, Saeed Mahameed, and others
for their guidance in these resolutions.
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2018-12-20
Two last patches for this release cycle:
1) Remove an unused variable in xfrm_policy_lookup_bytype().
From YueHaibing.
2) Fix possible infinite loop in __xfrm6_tunnel_alloc_spi().
Also from YueHaibing.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot reported the use of uninitialized udp6_addr::sin6_scope_id.
We can just set ::sin6_scope_id to zero, as tunnels are unlikely
to use an IPv6 address that needs a scope id and there is no
interface to bind in this context.
For net-next, it looks different as we have cfg->bind_ifindex there
so we can probably call ipv6_iface_scope_id().
Same for ::sin6_flowinfo, tunnels don't use it.
Fixes: 8024e02879 ("udp: Add udp_sock_create for UDP tunnels to open listener socket")
Reported-by: syzbot+c56449ed3652e6720f30@syzkaller.appspotmail.com
Cc: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_sec_path gains 'const' qualifier to avoid
xt_policy.c: 'skb_sec_path' discards 'const' qualifier from pointer target type
same reasoning as previous conversions: Won't need to touch these
spots anymore when skb->sp is removed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
It can only return 0 (success) or -ENOMEM.
Change return value to a pointer to secpath struct.
This avoids direct access to skb->sp:
err = secpath_set(skb);
if (!err) ..
skb->sp-> ...
Becomes:
sp = secpath_set(skb)
if (!sp) ..
sp-> ..
This reduces noise in followup patch which is going to remove skb->sp.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds an optional extension infrastructure, with ispec (xfrm) and
bridge netfilter as first users.
objdiff shows no changes if kernel is built without xfrm and br_netfilter
support.
The third (planned future) user is Multipath TCP which is still
out-of-tree.
MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
numbers used by individual subflows.
This DSS mapping is read/written from tcp option space on receive and
written to tcp option space on transmitted tcp packets that are part of
and MPTCP connection.
Extending skb_shared_info or adding a private data field to skb fclones
doesn't work for incoming skb, so a different DSS propagation method would
be required for the receive side.
mptcp has same requirements as secpath/bridge netfilter:
1. extension memory is released when the sk_buff is free'd.
2. data is shared after cloning an skb (clone inherits extension)
3. adding extension to an skb will COW the extension buffer if needed.
The "MPTCP upstreaming" effort adds SKB_EXT_MPTCP extension to store the
mapping for tx and rx processing.
Two new members are added to sk_buff:
1. 'active_extensions' byte (filling a hole), telling which extensions
are available for this skb.
This has two purposes.
a) avoids the need to initialize the pointer.
b) allows to "delete" an extension by clearing its bit
value in ->active_extensions.
While it would be possible to store the active_extensions byte
in the extension struct instead of sk_buff, there is one problem
with this:
When an extension has to be disabled, we can always clear the
bit in skb->active_extensions. But in case it would be stored in the
extension buffer itself, we might have to COW it first, if
we are dealing with a cloned skb. On kmalloc failure we would
be unable to turn an extension off.
2. extension pointer, located at the end of the sk_buff.
If the active_extensions byte is 0, the pointer is undefined,
it is not initialized on skb allocation.
This adds extra code to skb clone and free paths (to deal with
refcount/free of extension area) but this replaces similar code that
manages skb->nf_bridge and skb->sp structs in the followup patches of
the series.
It is possible to add support for extensions that are not preseved on
clones/copies.
To do this, it would be needed to define a bitmask of all extensions that
need copy/cow semantics, and change __skb_ext_copy() to check
->active_extensions & SKB_EXT_PRESERVE_ON_CLONE, then just set
->active_extensions to 0 on the new clone.
This isn't done here because all extensions that get added here
need the copy/cow semantics.
v2:
Allocate entire extension space using kmem_cache.
Upside is that this allows better tracking of used memory,
downside is that we will allocate more space than strictly needed in
most cases (its unlikely that all extensions are active/needed at same
time for same skb).
The allocated memory (except the small extension header) is not cleared,
so no additonal overhead aside from memory usage.
Avoid atomic_dec_and_test operation on skb_ext_put()
by using similar trick as kfree_skbmem() does with fclone_ref:
If recount is 1, there is no concurrent user and we can free right away.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This pointer is going to be removed soon, so use the existing helpers in
more places to avoid noise when the removal happens.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
gcc warn this:
net/ipv6/xfrm6_tunnel.c:143 __xfrm6_tunnel_alloc_spi() warn:
always true condition '(spi <= 4294967295) => (0-u32max <= u32max)'
'spi' is u32, which always not greater than XFRM6_TUNNEL_SPI_MAX
because of wrap around. So the second forloop will never reach.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2018-12-18
1) Add xfrm policy selftest scripts.
From Florian Westphal.
2) Split inexact policies into four different search list
classes and use the rbtree infrastructure to store/lookup
the policies. This is to improve the policy lookup
performance after the flowcache removal.
Patches from Florian Westphal.
3) Various coding style fixes, from Colin Ian King.
4) Fix policy lookup logic after adding the inexact policy
search tree infrastructure. From Florian Westphal.
5) Remove a useless remove BUG_ON from xfrm6_dst_ifdown.
From Li RongQing.
6) Use the correct policy direction for lookups on hash
rebuilding. From Florian Westphal.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
mfc6_cache is not needed by ip6mr_forward2 so drop it from the input
argument list.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SOF_TIMESTAMPING_OPT_ID is supported on TCP, UDP and RAW sockets.
But it was missing on RAW with IPPROTO_IP, PF_PACKET and CAN.
Add skb_setup_tx_timestamp that configures both tx_flags and tskey
for these paths that do not need corking or use bytestream keys.
Fixes: 09c2d251b7 ("net-timestamp: add key to disambiguate concurrent datagrams")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Raw sockets support tx timestamping, but one case is missing.
IPPROTO_RAW takes a separate packet construction path. raw_send_hdrinc
has an explicit call to sock_tx_timestamp, but rawv6_send_hdrinc does
not. Add it.
Fixes: 11878b40ed ("net-timestamp: SOCK_RAW and PING timestamping")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes the (now empty) nf_nat_l4proto struct, all its instances
and all the no longer needed runtime (un)register functionality.
nf_nat_need_gre() can be axed as well: the module that calls it (to
load the no-longer-existing nat_gre module) also calls other nat core
functions. GRE nat is now always available if kernel is built with it.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This removes the last l4proto indirection, the two callers, the l3proto
packet mangling helpers for ipv4 and ipv6, now call the
nf_nat_l4proto_manip_pkt() helper.
nf_nat_proto_{dccp,tcp,sctp,gre,icmp,icmpv6} are left behind, even though
they contain no functionality anymore to not clutter this patch.
Next patch will remove the empty files and the nf_nat_l4proto
struct.
nf_nat_proto_udp.c is renamed to nf_nat_proto.c, as it now contains the
other nat manip functionality as well, not just udp and udplite.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
all protocols did set this to nf_nat_l4proto_nlattr_to_range, so
just call it directly.
The important difference is that we'll now also call it for
protocols that we don't support (i.e., nf_nat_proto_unknown did
not provide .nlattr_to_range).
However, there should be no harm, even icmp provided this callback.
If we don't implement a specific l4nat for this, nothing would make
use of this information, so adding a big switch/case construct listing
all supported l4protocols seems a bit pointless.
This change leaves a single function pointer in the l4proto struct.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With exception of icmp, all of the l4 nat protocols set this to
nf_nat_l4proto_in_range.
Get rid of this and just check the l4proto in the caller.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
No need for indirections here, we only support ipv4 and ipv6
and the called functions are very small.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
fold remaining users (icmp, icmpv6, gre) into nf_nat_l4proto_unique_tuple.
The static-save of old incarnation of resolved key in gre and icmp is
removed as well, just use the prandom based offset like the others.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Historically this was net_random() based, and was then converted to
a hash based algorithm (private boot seed + hash of endpoint addresses)
due to concerns of leaking net_random() bits.
RANDOM_FULLY mode was added later to avoid problems with hash
based mode (see commit 34ce324019,
"netfilter: nf_nat: add full port randomization support" for details).
Just make prandom_u32() the default search starting point and get rid of
->secure_port() altogether.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Sergey reported that forwarding was no longer working
if fq packet scheduler was used.
This is caused by the recent switch to EDT model, since incoming
packets might have been timestamped by __net_timestamp()
__net_timestamp() uses ktime_get_real(), while fq expects packets
using CLOCK_MONOTONIC base.
The fix is to clear skb->tstamp in forwarding paths.
Fixes: 80b14dee2b ("net: Add a new socket option for a future transmit time.")
Fixes: fb420d5d91 ("tcp/fq: move back to CLOCK_MONOTONIC")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Sergey Matyukevich <geomatsi@gmail.com>
Tested-by: Sergey Matyukevich <geomatsi@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This avoids an indirect call in the receive path for TCP and UDP
packets. TCP takes precedence on UDP, so that we have a single
additional conditional in the common case.
When IPV6 is build as module, all gro symbols except UDPv6 are
builtin, while the latter belong to the ipv6 module, so we
need some special care.
v1 -> v2:
- adapted to INDIRECT_CALL_ changes
v2 -> v3:
- fix build issue with CONFIG_IPV6=m
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This avoids an indirect calls for L3 GRO receive path, both
for ipv4 and ipv6, if the latter is not compiled as a module.
Note that when IPv6 is compiled as builtin, it will be checked first,
so we have a single additional compare for the more common path.
v1 -> v2:
- adapted to INDIRECT_CALL_ changes
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A recent commit allows sockets bound to a VRF to receive ipv6 link local
packets. However, it only works for UDP and worse TCP connection attempts
to the LLA with the only listener bound to the VRF just hang where as
before the client gets a reset and connection refused. Fix by adjusting
ir_iif for LL addresses and packets received through a device enslaved
to a VRF.
Fixes: 6f12fa7755 ("vrf: mark skb for multicast or link-local as enslaved to VRF")
Reported-by: Donald Sharp <sharpd@cumulusnetworks.com>
Cc: Mike Manning <mmanning@vyatta.att-mail.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A relatively common use case is to have several IPs configured
on a host, and have different listeners for each of them. We would
like to add a "catch all" listener on addr_any, to match incoming
connections not served by any of the listeners bound to a specific
address.
However, port-only lookups can match addr_any sockets when sockets
listening on specific addresses are present if so_reuseport flag
is set. This patch eliminates lookups into port-only hashtable,
as lookups by (addr,port) tuple are easily available.
In addition, compute_score() is tweaked to _not_ match
addr_any sockets to specific addresses, as hash collisions
could result in the unwanted behavior described above.
Tested: the patch compiles; full test in the last patch in this
patchset. Existing reuseport_* selftests also pass.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A relatively common use case is to have several IPs configured
on a host, and have different listeners for each of them. We would
like to add a "catch all" listener on addr_any, to match incoming
connections not served by any of the listeners bound to a specific
address.
However, port-only lookups can match addr_any sockets when sockets
listening on specific addresses are present if so_reuseport flag
is set. This patch eliminates lookups into port-only hashtable,
as lookups by (addr,port) tuple are easily available.
In addition, compute_score() is tweaked to _not_ match
addr_any sockets to specific addresses, as hash collisions
could result in the unwanted behavior described above.
Tested: the patch compiles; full test in the last patch in this
patchset. Existing reuseport_* selftests also pass.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
vr.mifi is indirectly controlled by user-space, hence leading to
a potential exploitation of the Spectre variant 1 vulnerability.
This issue was detected with the help of Smatch:
net/ipv6/ip6mr.c:1845 ip6mr_ioctl() warn: potential spectre issue 'mrt->vif_table' [r] (local cap)
net/ipv6/ip6mr.c:1919 ip6mr_compat_ioctl() warn: potential spectre issue 'mrt->vif_table' [r] (local cap)
Fix this by sanitizing vr.mifi before using it to index mrt->vif_table'
Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].
[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changed the is_gretap_dev and is_ip6gretap_dev logic from structure
comparison to string comparison of the rtnl_link_ops kind field.
This approach aligns with the current identification methods and function
names of vxlan and geneve network devices.
Convert mlxsw to use these helpers and use them in downstream mlx5 patch.
Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Several conflicts, seemingly all over the place.
I used Stephen Rothwell's sample resolutions for many of these, if not
just to double check my own work, so definitely the credit largely
goes to him.
The NFP conflict consisted of a bug fix (moving operations
past the rhashtable operation) while chaning the initial
argument in the function call in the moved code.
The net/dsa/master.c conflict had to do with a bug fix intermixing of
making dsa_master_set_mtu() static with the fixing of the tagging
attribute location.
cls_flower had a conflict because the dup reject fix from Or
overlapped with the addition of port range classifiction.
__set_phy_supported()'s conflict was relatively easy to resolve
because Andrew fixed it in both trees, so it was just a matter
of taking the net-next copy. Or at least I think it was :-)
Joe Stringer's fix to the handling of netns id 0 in bpf_sk_lookup()
intermixed with changes on how the sdif and caller_net are calculated
in these code paths in net-next.
The remaining BPF conflicts were largely about the addition of the
__bpf_md_ptr stuff in 'net' overlapping with adjustments and additions
to the relevant data structure where the MD pointer macros are used.
Signed-off-by: David S. Miller <davem@davemloft.net>
extra_uref is used in __ip(6)_append_data only if uarg is set.
Smatch sees that the variable is passed to sock_zerocopy_put_abort.
This function accesses it only when uarg is set, but smatch cannot
infer this.
Make this dependency explicit.
Fixes: 52900d2228 ("udp: elide zerocopy operation in hot path")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Even if we send an IPv6 packet without options, MAX_HEADER might not be
enough to account for the additional headroom required by alignment of
hardware headers.
On a configuration without HYPERV_NET, WLAN, AX25, and with IPV6_TUNNEL,
sending short SCTP packets over IPv4 over L2TP over IPv6, we start with
100 bytes of allocated headroom in sctp_packet_transmit(), end up with 54
bytes after l2tp_xmit_skb(), and 14 bytes in ip6_finish_output2().
Those would be enough to append our 14 bytes header, but we're going to
align that to 16 bytes, and write 2 bytes out of the allocated slab in
neigh_hh_output().
KASan says:
[ 264.967848] ==================================================================
[ 264.967861] BUG: KASAN: slab-out-of-bounds in ip6_finish_output2+0x1aec/0x1c70
[ 264.967866] Write of size 16 at addr 000000006af1c7fe by task netperf/6201
[ 264.967870]
[ 264.967876] CPU: 0 PID: 6201 Comm: netperf Not tainted 4.20.0-rc4+ #1
[ 264.967881] Hardware name: IBM 2827 H43 400 (z/VM 6.4.0)
[ 264.967887] Call Trace:
[ 264.967896] ([<00000000001347d6>] show_stack+0x56/0xa0)
[ 264.967903] [<00000000017e379c>] dump_stack+0x23c/0x290
[ 264.967912] [<00000000007bc594>] print_address_description+0xf4/0x290
[ 264.967919] [<00000000007bc8fc>] kasan_report+0x13c/0x240
[ 264.967927] [<000000000162f5e4>] ip6_finish_output2+0x1aec/0x1c70
[ 264.967935] [<000000000163f890>] ip6_finish_output+0x430/0x7f0
[ 264.967943] [<000000000163fe44>] ip6_output+0x1f4/0x580
[ 264.967953] [<000000000163882a>] ip6_xmit+0xfea/0x1ce8
[ 264.967963] [<00000000017396e2>] inet6_csk_xmit+0x282/0x3f8
[ 264.968033] [<000003ff805fb0ba>] l2tp_xmit_skb+0xe02/0x13e0 [l2tp_core]
[ 264.968037] [<000003ff80631192>] l2tp_eth_dev_xmit+0xda/0x150 [l2tp_eth]
[ 264.968041] [<0000000001220020>] dev_hard_start_xmit+0x268/0x928
[ 264.968069] [<0000000001330e8e>] sch_direct_xmit+0x7ae/0x1350
[ 264.968071] [<000000000122359c>] __dev_queue_xmit+0x2b7c/0x3478
[ 264.968075] [<00000000013d2862>] ip_finish_output2+0xce2/0x11a0
[ 264.968078] [<00000000013d9b14>] ip_finish_output+0x56c/0x8c8
[ 264.968081] [<00000000013ddd1e>] ip_output+0x226/0x4c0
[ 264.968083] [<00000000013dbd6c>] __ip_queue_xmit+0x894/0x1938
[ 264.968100] [<000003ff80bc3a5c>] sctp_packet_transmit+0x29d4/0x3648 [sctp]
[ 264.968116] [<000003ff80b7bf68>] sctp_outq_flush_ctrl.constprop.5+0x8d0/0xe50 [sctp]
[ 264.968131] [<000003ff80b7c716>] sctp_outq_flush+0x22e/0x7d8 [sctp]
[ 264.968146] [<000003ff80b35c68>] sctp_cmd_interpreter.isra.16+0x530/0x6800 [sctp]
[ 264.968161] [<000003ff80b3410a>] sctp_do_sm+0x222/0x648 [sctp]
[ 264.968177] [<000003ff80bbddac>] sctp_primitive_ASSOCIATE+0xbc/0xf8 [sctp]
[ 264.968192] [<000003ff80b93328>] __sctp_connect+0x830/0xc20 [sctp]
[ 264.968208] [<000003ff80bb11ce>] sctp_inet_connect+0x2e6/0x378 [sctp]
[ 264.968212] [<0000000001197942>] __sys_connect+0x21a/0x450
[ 264.968215] [<000000000119aff8>] sys_socketcall+0x3d0/0xb08
[ 264.968218] [<000000000184ea7a>] system_call+0x2a2/0x2c0
[...]
Just like ip_finish_output2() does for IPv4, check that we have enough
headroom in ip6_xmit(), and reallocate it if we don't.
This issue is older than git history.
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>