Commit Graph

44086 Commits

Author SHA1 Message Date
Marc Kleine-Budde 332b05ca7a can: raw: raw_setsockopt: limit number of can_filter that can be set
This patch adds a check to limit the number of can_filters that can be
set via setsockopt on CAN_RAW sockets. Otherwise allocations > MAX_ORDER
are not prevented resulting in a warning.

Reference: https://lkml.org/lkml/2016/12/2/230

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-12-07 10:45:57 +01:00
Marcelo Ricardo Leitner dcb17d22e1 tcp: warn on bogus MSS and try to amend it
There have been some reports lately about TCP connection stalls caused
by NIC drivers that aren't setting gso_size on aggregated packets on rx
path. This causes TCP to assume that the MSS is actually the size of the
aggregated packet, which is invalid.

Although the proper fix is to be done at each driver, it's often hard
and cumbersome for one to debug, come to such root cause and report/fix
it.

This patch amends this situation in two ways. First, it adds a warning
on when this situation occurs, so it gives a hint to those trying to
debug this. It also limit the maximum probed MSS to the adverised MSS,
as it should never be any higher than that.

The result is that the connection may not have the best performance ever
but it shouldn't stall, and the admin will have a hint on what to look
for.

Tested with virtio by forcing gso_size to 0.

v2: updated msg per David's suggestion
v3: use skb_iif to find the interface and also log its name, per Eric
    Dumazet's suggestion. As the skb may be backlogged and the interface
    gone by then, we need to check if the number still has a meaning.
v4: use helper tcp_gro_dev_warn() and avoid pr_warn_once inside __once, per
    David's suggestion

Cc: Jonathan Maxwell <jmaxwell37@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-06 11:01:19 -05:00
Herbert Xu ed5d7788a9 netlink: Do not schedule work from sk_destruct
It is wrong to schedule a work from sk_destruct using the socket
as the memory reserve because the socket will be freed immediately
after the return from sk_destruct.

Instead we should do the deferral prior to sk_free.

This patch does just that.

Fixes: 707693c8a4 ("netlink: Call cb->done from a worker thread")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 19:43:42 -05:00
Pan Bian c79e167c3c net: caif: remove ineffective check
The check of the return value of sock_register() is ineffective.
"if(!err)" seems to be a typo. It is better to propagate the error code
to the callers of caif_sktinit_module(). This patch removes the check
statment and directly returns the result of sock_register().

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188751
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 14:48:48 -05:00
Kees Cook 0eab121ef8 net: ping: check minimum size on ICMP header length
Prior to commit c0371da604 ("put iov_iter into msghdr") in v3.19, there
was no check that the iovec contained enough bytes for an ICMP header,
and the read loop would walk across neighboring stack contents. Since the
iov_iter conversion, bad arguments are noticed, but the returned error is
EFAULT. Returning EINVAL is a clearer error and also solves the problem
prior to v3.19.

This was found using trinity with KASAN on v3.18:

BUG: KASAN: stack-out-of-bounds in memcpy_fromiovec+0x60/0x114 at addr ffffffc071077da0
Read of size 8 by task trinity-c2/9623
page:ffffffbe034b9a08 count:0 mapcount:0 mapping:          (null) index:0x0
flags: 0x0()
page dumped because: kasan: bad access detected
CPU: 0 PID: 9623 Comm: trinity-c2 Tainted: G    BU         3.18.0-dirty #15
Hardware name: Google Tegra210 Smaug Rev 1,3+ (DT)
Call trace:
[<ffffffc000209c98>] dump_backtrace+0x0/0x1ac arch/arm64/kernel/traps.c:90
[<ffffffc000209e54>] show_stack+0x10/0x1c arch/arm64/kernel/traps.c:171
[<     inline     >] __dump_stack lib/dump_stack.c:15
[<ffffffc000f18dc4>] dump_stack+0x7c/0xd0 lib/dump_stack.c:50
[<     inline     >] print_address_description mm/kasan/report.c:147
[<     inline     >] kasan_report_error mm/kasan/report.c:236
[<ffffffc000373dcc>] kasan_report+0x380/0x4b8 mm/kasan/report.c:259
[<     inline     >] check_memory_region mm/kasan/kasan.c:264
[<ffffffc00037352c>] __asan_load8+0x20/0x70 mm/kasan/kasan.c:507
[<ffffffc0005b9624>] memcpy_fromiovec+0x5c/0x114 lib/iovec.c:15
[<     inline     >] memcpy_from_msg include/linux/skbuff.h:2667
[<ffffffc000ddeba0>] ping_common_sendmsg+0x50/0x108 net/ipv4/ping.c:674
[<ffffffc000dded30>] ping_v4_sendmsg+0xd8/0x698 net/ipv4/ping.c:714
[<ffffffc000dc91dc>] inet_sendmsg+0xe0/0x12c net/ipv4/af_inet.c:749
[<     inline     >] __sock_sendmsg_nosec net/socket.c:624
[<     inline     >] __sock_sendmsg net/socket.c:632
[<ffffffc000cab61c>] sock_sendmsg+0x124/0x164 net/socket.c:643
[<     inline     >] SYSC_sendto net/socket.c:1797
[<ffffffc000cad270>] SyS_sendto+0x178/0x1d8 net/socket.c:1761

CVE-2016-8399

Reported-by: Qidan He <i@flanker017.me>
Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 13:35:38 -05:00
Pan Bian b59589635f net: bridge: set error code on failure
Function br_sysfs_addbr() does not set error code when the call
kobject_create_and_add() returns a NULL pointer. It may be better to
return "-ENOMEM" when kobject_create_and_add() fails.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188781

Signed-off-by: Pan Bian <bianpan2016@163.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 13:26:22 -05:00
Suraj Deshmukh 14dd3e1b97 net: af_mpls.c add space before open parenthesis
Adding space after switch keyword before open
parenthesis for readability purpose.

This patch fixes the checkpatch.pl warning:
space required before the open parenthesis '('

Signed-off-by: Suraj Deshmukh <surajssd009005@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 13:25:55 -05:00
Alexander Duyck a52ca62c4a ipv4: Drop suffix update from resize code
It has been reported that update_suffix can be expensive when it is called
on a large node in which most of the suffix lengths are the same.  The time
required to add 200K entries had increased from around 3 seconds to almost
49 seconds.

In order to address this we need to move the code for updating the suffix
out of resize and instead just have it handled in the cases where we are
pushing a node that increases the suffix length, or will decrease the
suffix length.

Fixes: 5405afd1a3 ("fib_trie: Add tracking value for suffix length")
Reported-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Reviewed-by: Robert Shearman <rshearma@brocade.com>
Tested-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 13:15:58 -05:00
Alexander Duyck 1a239173cc ipv4: Drop leaf from suffix pull/push functions
It wasn't necessary to pass a leaf in when doing the suffix updates so just
drop it.  Instead just pass the suffix and work with that.

Since we dropped the leaf there is no need to include that in the name so
the names are updated to node_push_suffix and node_pull_suffix.

Finally I noticed that the logic for pulling the suffix length back
actually had some issues.  Specifically it would stop prematurely if there
was a longer suffix, but it was not as long as the original suffix.  I
updated the code to address that in node_pull_suffix.

Fixes: 5405afd1a3 ("fib_trie: Add tracking value for suffix length")
Suggested-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Reviewed-by: Robert Shearman <rshearma@brocade.com>
Tested-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 13:15:58 -05:00
Pan Bian c66ebf2db5 net: dcb: set error code on failures
In function dcbnl_cee_fill(), returns the value of variable err on
errors. However, on some error paths (e.g. nla put fails), its value may
be 0. It may be better to explicitly set a negative errno to variable
err before returning.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188881

Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 23:54:25 -05:00
David S. Miller a38b610094 Here is another batman-adv bugfix:
- fix checking for failed allocation of TVLV blocks in TT local data,
    by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlhBnMAWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoQjfD/4i/eHZjGNwS9aTxh0QCfzM45JP
 xq5DmvCZrJXzGIt/g3vDI9WqpdJrnJsfUhCwzcn2NwCqBt8BKlGFT/NDXv5rxjBu
 5wHV9gZ5r6whMVyXiDgkYNW0B1M71rjgh9rxvH61Lyrdaq5RSwGVZb9dohRffGpo
 zTU+l+GiMJXzszA3Hi0ga+8/3xyEdwjMzSyBSK2xLdyHFBn8gg4EfAf1kJhS4rk8
 dbJUHFEvtg0wB7LkMlXe3DHkX/bRAi+PGU/YNk1x0x877ejfejIn7pp/GoweFgAW
 2cn4SxVIsN7sycZCQYar5iqjbQ/mO0v5fNTkxo8EGiNB7wHTr0RVFI/UPOP6KeS6
 2JR9qubg4kd6nU9yWMC8osbzye+GWmbPkvkFBrcHRkZpZDwsuaLJH2SbmWGt62kT
 mjpF2Uhsz+hn6AF5YhdDPAnBNP0iJWu25jWymCtHqCjieJsK0QjvAbXLMz45VWoW
 5rrvOBy1gI47/3gs2jitEXER9/6onai3vzoJ3cX7YYUh2ntiAuK3Lg5wGnrODm/W
 fyjpDcnwURBhiP4H+uoeiRvxFXpNeUBJiTolw+tl8n//adlLNepclA0OnGunx4GP
 qc5LPksYp1KWV6rIy3Nl+62XGLZkSGHcYzVb+uX/3Bng8HNmva7ceUggsVwIFsgi
 Isw24Dckx4bSB8P86w==
 =Y2ED
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20161202' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here is another batman-adv bugfix:

 - fix checking for failed allocation of TVLV blocks in TT local data,
   by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 16:13:27 -05:00
Eric Dumazet b98b0bc8c4 net: avoid signed overflows for SO_{SND|RCV}BUFFORCE
CAP_NET_ADMIN users should not be allowed to set negative
sk_sndbuf or sk_rcvbuf values, as it can lead to various memory
corruptions, crashes, OOM...

Note that before commit 8298193012 ("net: cleanups in
sock_setsockopt()"), the bug was even more serious, since SO_SNDBUF
and SO_RCVBUF were vulnerable.

This needs to be backported to all known linux kernels.

Again, many thanks to syzkaller team for discovering this gem.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 14:10:14 -05:00
Michal Kubeček 3de81b7588 tipc: check minimum bearer MTU
Qian Zhang (张谦) reported a potential socket buffer overflow in
tipc_msg_build() which is also known as CVE-2016-8632: due to
insufficient checks, a buffer overflow can occur if MTU is too short for
even tipc headers. As anyone can set device MTU in a user/net namespace,
this issue can be abused by a regular user.

As agreed in the discussion on Ben Hutchings' original patch, we should
check the MTU at the moment a bearer is attached rather than for each
processed packet. We also need to repeat the check when bearer MTU is
adjusted to new device MTU. UDP case also needs a check to avoid
overflow when calculating bearer MTU.

Fixes: b97bf3fd8f ("[TIPC] Initial merge")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reported-by: Qian Zhang (张谦) <zhangqian-c@360.cn>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 14:03:20 -05:00
Artem Savkov 6b6ebb6b01 ip6_offload: check segs for NULL in ipv6_gso_segment.
segs needs to be checked for being NULL in ipv6_gso_segment() before calling
skb_shinfo(segs), otherwise kernel can run into a NULL-pointer dereference:

[   97.811262] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
[   97.819112] IP: [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0
[   97.825214] PGD 0 [   97.827047]
[   97.828540] Oops: 0000 [#1] SMP
[   97.831678] Modules linked in: vhost_net vhost macvtap macvlan nfsv3 rpcsec_gss_krb5
nfsv4 dns_resolver nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4
iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
bridge stp llc snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel
snd_hda_codec edac_mce_amd snd_hda_core edac_core snd_hwdep kvm_amd snd_seq kvm snd_seq_device
snd_pcm irqbypass snd_timer ppdev parport_serial snd parport_pc k10temp pcspkr soundcore parport
sp5100_tco shpchp sg wmi i2c_piix4 acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc
ip_tables xfs libcrc32c sr_mod cdrom sd_mod ata_generic pata_acpi amdkfd amd_iommu_v2 radeon
broadcom bcm_phy_lib i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
ttm ahci serio_raw tg3 firewire_ohci libahci pata_atiixp drm ptp libata firewire_core pps_core
i2c_core crc_itu_t fjes dm_mirror dm_region_hash dm_log dm_mod
[   97.927721] CPU: 1 PID: 3504 Comm: vhost-3495 Not tainted 4.9.0-7.el7.test.x86_64 #1
[   97.935457] Hardware name: AMD Snook/Snook, BIOS ESK0726A 07/26/2010
[   97.941806] task: ffff880129a1c080 task.stack: ffffc90001bcc000
[   97.947720] RIP: 0010:[<ffffffff816e52f9>]  [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0
[   97.956251] RSP: 0018:ffff88012fc43a10  EFLAGS: 00010207
[   97.961557] RAX: 0000000000000000 RBX: ffff8801292c8700 RCX: 0000000000000594
[   97.968687] RDX: 0000000000000593 RSI: ffff880129a846c0 RDI: 0000000000240000
[   97.975814] RBP: ffff88012fc43a68 R08: ffff880129a8404e R09: 0000000000000000
[   97.982942] R10: 0000000000000000 R11: ffff880129a84076 R12: 00000020002949b3
[   97.990070] R13: ffff88012a580000 R14: 0000000000000000 R15: ffff88012a580000
[   97.997198] FS:  0000000000000000(0000) GS:ffff88012fc40000(0000) knlGS:0000000000000000
[   98.005280] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   98.011021] CR2: 00000000000000cc CR3: 0000000126c5d000 CR4: 00000000000006e0
[   98.018149] Stack:
[   98.020157]  00000000ffffffff ffff88012fc43ac8 ffffffffa017ad0a 000000000000000e
[   98.027584]  0000001300000000 0000000077d59998 ffff8801292c8700 00000020002949b3
[   98.035010]  ffff88012a580000 0000000000000000 ffff88012a580000 ffff88012fc43a98
[   98.042437] Call Trace:
[   98.044879]  <IRQ> [   98.046803]  [<ffffffffa017ad0a>] ? tg3_start_xmit+0x84a/0xd60 [tg3]
[   98.053156]  [<ffffffff815eeee0>] skb_mac_gso_segment+0xb0/0x130
[   98.059158]  [<ffffffff815eefd3>] __skb_gso_segment+0x73/0x110
[   98.064985]  [<ffffffff815ef40d>] validate_xmit_skb+0x12d/0x2b0
[   98.070899]  [<ffffffff815ef5d2>] validate_xmit_skb_list+0x42/0x70
[   98.077073]  [<ffffffff81618560>] sch_direct_xmit+0xd0/0x1b0
[   98.082726]  [<ffffffff815efd86>] __dev_queue_xmit+0x486/0x690
[   98.088554]  [<ffffffff8135c135>] ? cpumask_next_and+0x35/0x50
[   98.094380]  [<ffffffff815effa0>] dev_queue_xmit+0x10/0x20
[   98.099863]  [<ffffffffa09ce057>] br_dev_queue_push_xmit+0xa7/0x170 [bridge]
[   98.106907]  [<ffffffffa09ce161>] br_forward_finish+0x41/0xc0 [bridge]
[   98.113430]  [<ffffffff81627cf2>] ? nf_iterate+0x52/0x60
[   98.118735]  [<ffffffff81627d6b>] ? nf_hook_slow+0x6b/0xc0
[   98.124216]  [<ffffffffa09ce32c>] __br_forward+0x14c/0x1e0 [bridge]
[   98.130480]  [<ffffffffa09ce120>] ? br_dev_queue_push_xmit+0x170/0x170 [bridge]
[   98.137785]  [<ffffffffa09ce4bd>] br_forward+0x9d/0xb0 [bridge]
[   98.143701]  [<ffffffffa09cfbb7>] br_handle_frame_finish+0x267/0x560 [bridge]
[   98.150834]  [<ffffffffa09d0064>] br_handle_frame+0x174/0x2f0 [bridge]
[   98.157355]  [<ffffffff8102fb89>] ? sched_clock+0x9/0x10
[   98.162662]  [<ffffffff810b63b2>] ? sched_clock_cpu+0x72/0xa0
[   98.168403]  [<ffffffff815eccf5>] __netif_receive_skb_core+0x1e5/0xa20
[   98.174926]  [<ffffffff813659f9>] ? timerqueue_add+0x59/0xb0
[   98.180580]  [<ffffffff815ed548>] __netif_receive_skb+0x18/0x60
[   98.186494]  [<ffffffff815ee625>] process_backlog+0x95/0x140
[   98.192145]  [<ffffffff815edccd>] net_rx_action+0x16d/0x380
[   98.197713]  [<ffffffff8170cff1>] __do_softirq+0xd1/0x283
[   98.203106]  [<ffffffff8170b2bc>] do_softirq_own_stack+0x1c/0x30
[   98.209107]  <EOI> [   98.211029]  [<ffffffff8108a5c0>] do_softirq+0x50/0x60
[   98.216166]  [<ffffffff815ec853>] netif_rx_ni+0x33/0x80
[   98.221386]  [<ffffffffa09eeff7>] tun_get_user+0x487/0x7f0 [tun]
[   98.227388]  [<ffffffffa09ef3ab>] tun_sendmsg+0x4b/0x60 [tun]
[   98.233129]  [<ffffffffa0b68932>] handle_tx+0x282/0x540 [vhost_net]
[   98.239392]  [<ffffffffa0b68c25>] handle_tx_kick+0x15/0x20 [vhost_net]
[   98.245916]  [<ffffffffa0abacfe>] vhost_worker+0x9e/0xf0 [vhost]
[   98.251919]  [<ffffffffa0abac60>] ? vhost_umem_alloc+0x40/0x40 [vhost]
[   98.258440]  [<ffffffff81003a47>] ? do_syscall_64+0x67/0x180
[   98.264094]  [<ffffffff810a44d9>] kthread+0xd9/0xf0
[   98.268965]  [<ffffffff810a4400>] ? kthread_park+0x60/0x60
[   98.274444]  [<ffffffff8170a4d5>] ret_from_fork+0x25/0x30
[   98.279836] Code: 8b 93 d8 00 00 00 48 2b 93 d0 00 00 00 4c 89 e6 48 89 df 66 89 93 c2 00 00 00 ff 10 48 3d 00 f0 ff ff 49 89 c2 0f 87 52 01 00 00 <41> 8b 92 cc 00 00 00 48 8b 80 d0 00 00 00 44 0f b7 74 10 06 66
[   98.299425] RIP  [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0
[   98.305612]  RSP <ffff88012fc43a10>
[   98.309094] CR2: 00000000000000cc
[   98.312406] ---[ end trace 726a2c7a2d2d78d0 ]---

Signed-off-by: Artem Savkov <asavkov@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:34:58 -05:00
Sowmini Varadhan 721c7443dc RDS: TCP: unregister_netdevice_notifier() in error path of rds_tcp_init_net
If some error is encountered in rds_tcp_init_net, make sure to
unregister_netdevice_notifier(), else we could trigger a panic
later on, when the modprobe from a netns fails.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:29:26 -05:00
Eli Cooper 80d1106aea Revert: "ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()"
This reverts commit ae148b0858
("ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()").

skb->protocol is now set in __ip_local_out() and __ip6_local_out() before
dst_output() is called. It is no longer necessary to do it for each tunnel.

Cc: stable@vger.kernel.org
Signed-off-by: Eli Cooper <elicooper@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 12:34:22 -05:00
Eli Cooper b4e479a96f ipv6: Set skb->protocol properly for local output
When xfrm is applied to TSO/GSO packets, it follows this path:

    xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()

where skb_gso_segment() relies on skb->protocol to function properly.

This patch sets skb->protocol to ETH_P_IPV6 before dst_output() is called,
fixing a bug where GSO packets sent through an ipip6 tunnel are dropped
when xfrm is involved.

Cc: stable@vger.kernel.org
Signed-off-by: Eli Cooper <elicooper@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 12:34:22 -05:00
Eli Cooper f418043910 ipv4: Set skb->protocol properly for local output
When xfrm is applied to TSO/GSO packets, it follows this path:

    xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()

where skb_gso_segment() relies on skb->protocol to function properly.

This patch sets skb->protocol to ETH_P_IP before dst_output() is called,
fixing a bug where GSO packets sent through a sit tunnel are dropped
when xfrm is involved.

Cc: stable@vger.kernel.org
Signed-off-by: Eli Cooper <elicooper@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 12:34:22 -05:00
Philip Pettersson 84ac726023 packet: fix race condition in packet_set_ring
When packet_set_ring creates a ring buffer it will initialize a
struct timer_list if the packet version is TPACKET_V3. This value
can then be raced by a different thread calling setsockopt to
set the version to TPACKET_V1 before packet_set_ring has finished.

This leads to a use-after-free on a function pointer in the
struct timer_list when the socket is closed as the previously
initialized timer will not be deleted.

The bug is fixed by taking lock_sock(sk) in packet_setsockopt when
changing the packet version while also taking the lock at the start
of packet_set_ring.

Fixes: f6fb8f100b ("af-packet: TPACKET_V3 flexible buffer implementation.")
Signed-off-by: Philip Pettersson <philip.pettersson@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 12:16:49 -05:00
Tobias Klauser 6919756caa net/rtnetlink: fix attribute name in nlmsg_size() comments
Use the correct attribute constant names IFLA_GSO_MAX_{SEGS,SIZE}
instead of IFLA_MAX_GSO_{SEGS,SIZE} for the comments int nlmsg_size().

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 10:34:59 -05:00
Sven Eckelmann c2d0f48a13 batman-adv: Check for alloc errors when preparing TT local data
batadv_tt_prepare_tvlv_local_data can fail to allocate the memory for the
new TVLV block. The caller is informed about this problem with the returned
length of 0. Not checking this value results in an invalid memory access
when either tt_data or tt_change is accessed.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 7ea7b4a142 ("batman-adv: make the TT CRC logic VLAN specific")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-12-02 10:46:59 +01:00
David S. Miller 7bbf91ce27 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2016-12-01

1) Change the error value when someone tries to run 32bit
   userspace on a 64bit host from -ENOTSUPP to the userspace
   exported -EOPNOTSUPP. Fix from Yi Zhao.

2) On inbound, ESN sequence numbers are already in network
   byte order. So don't try to convert it again, this fixes
   integrity verification for ESN. Fixes from Tobias Brunner.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-01 11:35:49 -05:00
David S. Miller 3d2dd617fb Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

This is a large batch of Netfilter fixes for net, they are:

1) Three patches to fix NAT conversion to rhashtable: Switch to rhlist
   structure that allows to have several objects with the same key.
   Moreover, fix wrong comparison logic in nf_nat_bysource_cmp() as this is
   expecting a return value similar to memcmp(). Change location of
   the nat_bysource field in the nf_conn structure to avoid zeroing
   this as it breaks interaction with SLAB_DESTROY_BY_RCU and lead us
   to crashes. From Florian Westphal.

2) Don't allow malformed fragments go through in IPv6, drop them,
   otherwise we hit GPF, patch from Florian Westphal.

3) Fix crash if attributes are missing in nft_range, from Liping Zhang.

4) Fix arptables 32-bits userspace 64-bits kernel compat, from Hongxu Jia.

5) Two patches from David Ahern to fix netfilter interaction with vrf.
   From David Ahern.

6) Fix element timeout calculation in nf_tables, we take milliseconds
   from userspace, but we use jiffies from kernelspace. Patch from
   Anders K.  Pedersen.

7) Missing validation length netlink attribute for nft_hash, from
   Laura Garcia.

8) Fix nf_conntrack_helper documentation, we don't default to off
   anymore for a bit of time so let's get this in sync with the code.

I know is late but I think these are important, specifically the NAT
bits, as they are mostly addressing fallout from recent changes. I also
read there are chances to have -rc8, if that is the case, that would
also give us a bit more time to test this.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-01 11:04:41 -05:00
Hongxu Jia 17a49cd549 netfilter: arp_tables: fix invoking 32bit "iptable -P INPUT ACCEPT" failed in 64bit kernel
Since 09d9686047 ("netfilter: x_tables: do compat validation via
translate_table"), it used compatr structure to assign newinfo
structure.  In translate_compat_table of ip_tables.c and ip6_tables.c,
it used compatr->hook_entry to replace info->hook_entry and
compatr->underflow to replace info->underflow, but not do the same
replacement in arp_tables.c.

It caused invoking 32-bit "arptbale -P INPUT ACCEPT" failed in 64bit
kernel.
--------------------------------------
root@qemux86-64:~# arptables -P INPUT ACCEPT
root@qemux86-64:~# arptables -P INPUT ACCEPT
ERROR: Policy for `INPUT' offset 448 != underflow 0
arptables: Incompatible with this kernel
--------------------------------------

Fixes: 09d9686047 ("netfilter: x_tables: do compat validation via translate_table")
Signed-off-by: Hongxu Jia <hongxu.jia@windriver.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-30 20:50:23 +01:00
Guillaume Nault 31e2f21fb3 l2tp: fix address test in __l2tp_ip6_bind_lookup()
The '!(addr && ipv6_addr_equal(addr, laddr))' part of the conditional
matches if addr is NULL or if addr != laddr.
But the intend of __l2tp_ip6_bind_lookup() is to find a sockets with
the same address, so the ipv6_addr_equal() condition needs to be
inverted.

For better clarity and consistency with the rest of the expression, the
(!X || X == Y) notation is used instead of !(X && X != Y).

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 14:14:08 -05:00
Guillaume Nault df90e68861 l2tp: fix lookup for sockets not bound to a device in l2tp_ip
When looking up an l2tp socket, we must consider a null netdevice id as
wild card. There are currently two problems caused by
__l2tp_ip_bind_lookup() not considering 'dif' as wild card when set to 0:

  * A socket bound to a device (i.e. with sk->sk_bound_dev_if != 0)
    never receives any packet. Since __l2tp_ip_bind_lookup() is called
    with dif == 0 in l2tp_ip_recv(), sk->sk_bound_dev_if is always
    different from 'dif' so the socket doesn't match.

  * Two sockets, one bound to a device but not the other, can be bound
    to the same address. If the first socket binding to the address is
    the one that is also bound to a device, the second socket can bind
    to the same address without __l2tp_ip_bind_lookup() noticing the
    overlap.

To fix this issue, we need to consider that any null device index, be
it 'sk->sk_bound_dev_if' or 'dif', matches with any other value.
We also need to pass the input device index to __l2tp_ip_bind_lookup()
on reception so that sockets bound to a device never receive packets
from other devices.

This patch fixes l2tp_ip6 in the same way.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 14:14:08 -05:00
Guillaume Nault d5e3a19093 l2tp: fix racy socket lookup in l2tp_ip and l2tp_ip6 bind()
It's not enough to check for sockets bound to same address at the
beginning of l2tp_ip{,6}_bind(): even if no socket is found at that
time, a socket with the same address could be bound before we take
the l2tp lock again.

This patch moves the lookup right before inserting the new socket, so
that no change can ever happen to the list between address lookup and
socket insertion.

Care is taken to avoid side effects on the socket in case of failure.
That is, modifications of the socket are done after the lookup, when
binding is guaranteed to succeed, and before releasing the l2tp lock,
so that concurrent lookups will always see fully initialised sockets.

For l2tp_ip, 'ret' is set to -EINVAL before checking the SOCK_ZAPPED
bit. Error code was mistakenly set to -EADDRINUSE on error by commit
32c231164b ("l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()").
Using -EINVAL restores original behaviour.

For l2tp_ip6, the lookup is now always done with the correct bound
device. Before this patch, when binding to a link-local address, the
lookup was done with the original sk->sk_bound_dev_if, which was later
overwritten with addr->l2tp_scope_id. Lookup is now performed with the
final sk->sk_bound_dev_if value.

Finally, the (addr_len >= sizeof(struct sockaddr_in6)) check has been
dropped: addr is a sockaddr_l2tpip6 not sockaddr_in6 and addr_len has
already been checked at this point (this part of the code seems to have
been copy-pasted from net/ipv6/raw.c).

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 14:14:08 -05:00
Guillaume Nault a3c18422a4 l2tp: hold socket before dropping lock in l2tp_ip{, 6}_recv()
Socket must be held while under the protection of the l2tp lock; there
is no guarantee that sk remains valid after the read_unlock_bh() call.

Same issue for l2tp_ip and l2tp_ip6.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 14:14:07 -05:00
Guillaume Nault 0382a25af3 l2tp: lock socket before checking flags in connect()
Socket flags aren't updated atomically, so the socket must be locked
while reading the SOCK_ZAPPED flag.

This issue exists for both l2tp_ip and l2tp_ip6. For IPv6, this patch
also brings error handling for __ip6_datagram_connect() failures.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 14:14:07 -05:00
Daniele Di Proietto f92a80a997 openvswitch: Fix skb leak in IPv6 reassembly.
If nf_ct_frag6_gather() returns an error other than -EINPROGRESS, it
means that we still have a reference to the skb.  We should free it
before returning from handle_fragments, as stated in the comment above.

Fixes: daaa7d647f ("netfilter: ipv6: avoid nf_iterate recursion")
CC: Florian Westphal <fw@strlen.de>
CC: Pravin B Shelar <pshelar@ovn.org>
CC: Joe Stringer <joe@ovn.org>
Signed-off-by: Daniele Di Proietto <diproiettod@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 11:00:45 -05:00
Tobias Brunner a55e23864d esp6: Fix integrity verification when ESN are used
When handling inbound packets, the two halves of the sequence number
stored on the skb are already in network order.

Fixes: 000ae7b269 ("esp6: Switch to new AEAD interface")
Signed-off-by: Tobias Brunner <tobias@strongswan.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-11-30 11:10:16 +01:00
Tobias Brunner 7c7fedd51c esp4: Fix integrity verification when ESN are used
When handling inbound packets, the two halves of the sequence number
stored on the skb are already in network order.

Fixes: 7021b2e1cd ("esp4: Switch to new AEAD interface")
Signed-off-by: Tobias Brunner <tobias@strongswan.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-11-30 11:09:39 +01:00
Yi Zhao 83e2d0587a xfrm_user: fix return value from xfrm_user_rcv_msg
It doesn't support to run 32bit 'ip' to set xfrm objdect on 64bit host.
But the return value is unknown for user program:

ip xfrm policy list
RTNETLINK answers: Unknown error 524

Replace ENOTSUPP with EOPNOTSUPP:

ip xfrm policy list
RTNETLINK answers: Operation not supported

Signed-off-by: Yi Zhao <yi.zhao@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-11-30 10:58:53 +01:00
Johan Hovold 881eadabe7 net: dsa: slave: fix fixed-link phydev leaks
Make sure to deregister and free any fixed-link PHY registered using
of_phy_register_fixed_link() on slave-setup errors and on slave destroy.

Fixes: 0d8bcdd383 ("net: dsa: allow for more complex PHY setups")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-29 23:17:02 -05:00
Johan Hovold 3f65047c85 of_mdio: add helper to deregister fixed-link PHYs
Add helper to deregister fixed-link PHYs registered using
of_phy_register_fixed_link().

Convert the two drivers that care to deregister their fixed-link PHYs to
use the new helper, but note that most drivers currently fail to do so.

Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-29 23:17:02 -05:00
Johan Hovold 0d8f3c6715 net: dsa: slave: fix of-node leak and phy priority
Make sure to drop the reference taken by of_parse_phandle() before
returning from dsa_slave_phy_setup().

Note that this also modifies the PHY priority so that any fixed-link
node is only parsed when no phy-handle is given, which is in accordance
with the common scheme for this.

Fixes: 0d8bcdd383 ("net: dsa: allow for more complex PHY setups")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-29 23:17:02 -05:00
Arnaldo Carvalho de Melo a510887824 GSO: Reload iph after pskb_may_pull
As it may get stale and lead to use after free.

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Alexander Duyck <aduyck@mirantis.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Fixes: cbc53e08a7 ("GSO: Add GSO type for fixed IPv4 ID")
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-29 20:45:54 -05:00
Jiri Pirko 725cbb62e7 sched: cls_flower: remove from hashtable only in case skip sw flag is not set
Be symmetric to hashtable insert and remove filter from hashtable only
in case skip sw flag is not set.

Fixes: e69985c67c ("net/sched: cls_flower: Introduce support in SKIP SW flag")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-29 20:44:38 -05:00
Eric Dumazet 648f0c28df net/dccp: fix use-after-free in dccp_invalid_packet
pskb_may_pull() can reallocate skb->head, we need to reload dh pointer
in dccp_invalid_packet() or risk use after free.

Bug found by Andrey Konovalov using syzkaller.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-29 20:37:26 -05:00
Herbert Xu 707693c8a4 netlink: Call cb->done from a worker thread
The cb->done interface expects to be called in process context.
This was broken by the netlink RCU conversion.  This patch fixes
it by adding a worker struct to make the cb->done call where
necessary.

Fixes: 21e4902aea ("netlink: Lockless lookup with RCU grace...")
Reported-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-29 19:48:38 -05:00
Amir Vadai 95c2027bfe net/sched: pedit: make sure that offset is valid
Add a validation function to make sure offset is valid:
1. Not below skb head (could happen when offset is negative).
2. Validate both 'offset' and 'at'.

Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-29 19:46:00 -05:00
Florian Westphal 9b57da0630 netfilter: ipv6: nf_defrag: drop mangled skb on ream error
Dmitry Vyukov reported GPF in network stack that Andrey traced down to
negative nh offset in nf_ct_frag6_queue().

Problem is that all network headers before fragment header are pulled.
Normal ipv6 reassembly will drop the skb when errors occur further down
the line.

netfilter doesn't do this, and instead passed the original fragment
along.  That was also fine back when netfilter ipv6 defrag worked with
cloned fragments, as the original, pristine fragment was passed on.

So we either have to undo the pull op, or discard such fragments.
Since they're malformed after all (e.g. overlapping fragment) it seems
preferrable to just drop them.

Same for temporary errors -- it doesn't make sense to accept (and
perhaps forward!) only some fragments of same datagram.

Fixes: 029f7f3b87 ("netfilter: ipv6: nf_defrag: avoid/free clone operations")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Debugged-by: Andrey Konovalov <andreyknvl@google.com>
Diagnosed-by: Eric Dumazet <Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-29 20:23:58 +01:00
Nikita Yushchenko 7a99cd6e21 net: dsa: fix unbalanced dsa_switch_tree reference counting
_dsa_register_switch() gets a dsa_switch_tree object either via
dsa_get_dst() or via dsa_add_dst(). Former path does not increase kref
in returned object (resulting into caller not owning a reference),
while later path does create a new object (resulting into caller owning
a reference).

The rest of _dsa_register_switch() assumes that it owns a reference, and
calls dsa_put_dst().

This causes a memory breakage if first switch in the tree initialized
successfully, but second failed to initialize. In particular, freed
dsa_swith_tree object is left referenced by switch that was initialized,
and later access to sysfs attributes of that switch cause OOPS.

To fix, need to add kref_get() call to dsa_get_dst().

Fixes: 83c0afaec7 ("net: dsa: Add new binding implementation")
Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-28 16:15:45 -05:00
David Ahern 79dc7e3f1c net: handle no dst on skb in icmp6_send
Andrey reported the following while fuzzing the kernel with syzkaller:

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 0 PID: 3859 Comm: a.out Not tainted 4.9.0-rc6+ #429
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff8800666d4200 task.stack: ffff880067348000
RIP: 0010:[<ffffffff833617ec>]  [<ffffffff833617ec>]
icmp6_send+0x5fc/0x1e30 net/ipv6/icmp.c:451
RSP: 0018:ffff88006734f2c0  EFLAGS: 00010206
RAX: ffff8800666d4200 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000018
RBP: ffff88006734f630 R08: ffff880064138418 R09: 0000000000000003
R10: dffffc0000000000 R11: 0000000000000005 R12: 0000000000000000
R13: ffffffff84e7e200 R14: ffff880064138484 R15: ffff8800641383c0
FS:  00007fb3887a07c0(0000) GS:ffff88006cc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000000 CR3: 000000006b040000 CR4: 00000000000006f0
Stack:
 ffff8800666d4200 ffff8800666d49f8 ffff8800666d4200 ffffffff84c02460
 ffff8800666d4a1a 1ffff1000ccdaa2f ffff88006734f498 0000000000000046
 ffff88006734f440 ffffffff832f4269 ffff880064ba7456 0000000000000000
Call Trace:
 [<ffffffff83364ddc>] icmpv6_param_prob+0x2c/0x40 net/ipv6/icmp.c:557
 [<     inline     >] ip6_tlvopt_unknown net/ipv6/exthdrs.c:88
 [<ffffffff83394405>] ip6_parse_tlv+0x555/0x670 net/ipv6/exthdrs.c:157
 [<ffffffff8339a759>] ipv6_parse_hopopts+0x199/0x460 net/ipv6/exthdrs.c:663
 [<ffffffff832ee773>] ipv6_rcv+0xfa3/0x1dc0 net/ipv6/ip6_input.c:191
 ...

icmp6_send / icmpv6_send is invoked for both rx and tx paths. In both
cases the dst->dev should be preferred for determining the L3 domain
if the dst has been set on the skb. Fallback to the skb->dev if it has
not. This covers the case reported here where icmp6_send is invoked on
Rx before the route lookup.

Fixes: 5d41ce29e ("net: icmp6_send should use dst dev to determine L3 domain")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-28 16:13:01 -05:00
Julian Wollrath 4df21dfcf2 tcp: Set DEFAULT_TCP_CONG to bbr if DEFAULT_BBR is set
Signed-off-by: Julian Wollrath <jwollrath@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-28 12:15:00 -05:00
Daniel Borkmann d936377414 net, sched: respect rcu grace period on cls destruction
Roi reported a crash in flower where tp->root was NULL in ->classify()
callbacks. Reason is that in ->destroy() tp->root is set to NULL via
RCU_INIT_POINTER(). It's problematic for some of the classifiers, because
this doesn't respect RCU grace period for them, and as a result, still
outstanding readers from tc_classify() will try to blindly dereference
a NULL tp->root.

The tp->root object is strictly private to the classifier implementation
and holds internal data the core such as tc_ctl_tfilter() doesn't know
about. Within some classifiers, such as cls_bpf, cls_basic, etc, tp->root
is only checked for NULL in ->get() callback, but nowhere else. This is
misleading and seemed to be copied from old classifier code that was not
cleaned up properly. For example, d3fa76ee6b ("[NET_SCHED]: cls_basic:
fix NULL pointer dereference") moved tp->root initialization into ->init()
routine, where before it was part of ->change(), so ->get() had to deal
with tp->root being NULL back then, so that was indeed a valid case, after
d3fa76ee6b, not really anymore. We used to set tp->root to NULL long
ago in ->destroy(), see 47a1a1d4be ("pkt_sched: remove unnecessary xchg()
in packet classifiers"); but the NULLifying was reintroduced with the
RCUification, but it's not correct for every classifier implementation.

In the cases that are fixed here with one exception of cls_cgroup, tp->root
object is allocated and initialized inside ->init() callback, which is always
performed at a point in time after we allocate a new tp, which means tp and
thus tp->root was not globally visible in the tp chain yet (see tc_ctl_tfilter()).
Also, on destruction tp->root is strictly kfree_rcu()'ed in ->destroy()
handler, same for the tp which is kfree_rcu()'ed right when we return
from ->destroy() in tcf_destroy(). This means, the head object's lifetime
for such classifiers is always tied to the tp lifetime. The RCU callback
invocation for the two kfree_rcu() could be out of order, but that's fine
since both are independent.

Dropping the RCU_INIT_POINTER(tp->root, NULL) for these classifiers here
means that 1) we don't need a useless NULL check in fast-path and, 2) that
outstanding readers of that tp in tc_classify() can still execute under
respect with RCU grace period as it is actually expected.

Things that haven't been touched here: cls_fw and cls_route. They each
handle tp->root being NULL in ->classify() path for historic reasons, so
their ->destroy() implementation can stay as is. If someone actually
cares, they could get cleaned up at some point to avoid the test in fast
path. cls_u32 doesn't set tp->root to NULL. For cls_rsvp, I just added a
!head should anyone actually be using/testing it, so it at least aligns with
cls_fw and cls_route. For cls_flower we additionally need to defer rhashtable
destruction (to a sleepable context) after RCU grace period as concurrent
readers might still access it. (Note that in this case we need to hold module
reference to keep work callback address intact, since we only wait on module
unload for all call_rcu()s to finish.)

This fixes one race to bring RCU grace period guarantees back. Next step
as worked on by Cong however is to fix 1e052be69d ("net_sched: destroy
proto tp when all filters are gone") to get the order of unlinking the tp
in tc_ctl_tfilter() for the RTM_DELTFILTER case right by moving
RCU_INIT_POINTER() before tcf_destroy() and let the notification for
removal be done through the prior ->delete() callback. Both are independant
issues. Once we have that right, we can then clean tp->root up for a number
of classifiers by not making them RCU pointers, which requires a new callback
(->uninit) that is triggered from tp's RCU callback, where we just kfree()
tp->root from there.

Fixes: 1f947bf151 ("net: sched: rcu'ify cls_bpf")
Fixes: 9888faefe1 ("net: sched: cls_basic use RCU")
Fixes: 70da9f0bf9 ("net: sched: cls_flow use RCU")
Fixes: 77b9900ef5 ("tc: introduce Flower classifier")
Fixes: bf3994d2ed ("net/sched: introduce Match-all classifier")
Fixes: 952313bd62 ("net: sched: cls_cgroup use RCU")
Reported-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Roi Dayan <roid@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-28 10:47:35 -05:00
Jon Paul Maloy 9590112241 tipc: fix link statistics counter errors
In commit e4bf4f7696 ("tipc: simplify packet sequence number
handling") we changed the internal representation of the packet
sequence number counters from u32 to u16, reflecting what is really
sent over the wire.

Since then some link statistics counters have been displaying incorrect
values, partially because the counters meant to be used as sequence
number snapshots are now used as direct counters, stored as u32, and
partially because some counter updates are just missing in the code.

In this commit we correct this in two ways. First, we base the
displayed packet sent/received values on direct counters instead
of as previously a calculated difference between current sequence
number and a snapshot. Second, we add the missing updates of the
counters.

This change is compatible with the current netlink API, and requires
no changes to the user space tools.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27 20:35:55 -05:00
David S. Miller 8eb4adf60b Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2016-11-25

1) Fix a refcount leak in vti6.
   From Nicolas Dichtel.

2) Fix a wrong if statement in xfrm_sk_policy_lookup.
   From Florian Westphal.

3) The flowcache watermarks are per cpu. Take this into
   account when comparing to the threshold where we
   refusing new allocations. From Miroslav Urbanek.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27 20:21:48 -05:00
Johan Hovold fd05d7b18c net: dsa: fix fixed-link-phy device leaks
Make sure to drop the reference taken by of_phy_find_device() when
registering and deregistering the fixed-link PHY-device.

Fixes: 39b0c70519 ("net: dsa: Allow configuration of CPU & DSA port
speeds/duplex")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27 20:01:15 -05:00
Jon Paul Maloy 6998cc6ec2 tipc: resolve connection flow control compatibility problem
In commit 10724cc7bb ("tipc: redesign connection-level flow control")
we replaced the previous message based flow control with one based on
1k blocks. In order to ensure backwards compatibility the mechanism
falls back to using message as base unit when it senses that the peer
doesn't support the new algorithm. The default flow control window,
i.e., how many units can be sent before the sender blocks and waits
for an acknowledge (aka advertisement) is 512. This was tested against
the previous version, which uses an acknowledge frequency of on ack per
256 received message, and found to work fine.

However, we missed the fact that versions older than Linux 3.15 use an
acknowledge frequency of 512, which is exactly the limit where a 4.6+
sender will stop and wait for acknowledge. This would also work fine if
it weren't for the fact that if the first sent message on a 4.6+ server
side is an empty SYNACK, this one is also is counted as a sent message,
while it is not counted as a received message on a legacy 3.15-receiver.
This leads to the sender always being one step ahead of the receiver, a
scenario causing the sender to block after 512 sent messages, while the
receiver only has registered 511 read messages. Hence, the legacy
receiver is not trigged to send an acknowledge, with a permanently
blocked sender as result.

We solve this deadlock by simply allowing the sender to send one more
message before it blocks, i.e., by a making minimal change to the
condition used for determining connection congestion.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 21:38:16 -05:00