Commit Graph

664758 Commits

Author SHA1 Message Date
Xin Long e4dc99c7c2 sctp: process duplicated strreset out and addstrm out requests correctly
Now sctp stream reconf will process a request again even if it's seqno is
less than asoc->strreset_inseq.

If one request has been done successfully and some data chunks have been
accepted and then a duplicated strreset out request comes, the streamin's
ssn will be cleared. It will cause that stream will never receive chunks
any more because of unsynchronized ssn. It allows a replay attack.

A similar issue also exists when processing addstrm out requests. It will
cause more extra streams being added.

This patch is to fix it by saving the last 2 results into asoc. When a
duplicated strreset out or addstrm out request is received, reply it with
bad seqno if it's seqno < asoc->strreset_inseq - 2, and reply it with the
result saved in asoc if it's seqno >= asoc->strreset_inseq - 2.

Note that it saves last 2 results instead of only last 1 result, because
two requests can be sent together in one chunk.

And note that when receiving a duplicated request, the receiver side will
still reply it even if the peer has received the response. It's safe, As
the response will be dropped by the peer.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-18 13:39:50 -04:00
Chonggang Li b89f04c61e bonding: deliver link-local packets with skb->dev set to link that packets arrived on
Bonding driver changes the skb->dev to the bonding-master before
passing the packet to stack for further processing. This, however
does not make sense for the link-local packets and it loses "the
link info" once its skb->dev is changed to bonding-master.  This
patch changes this behavior for link-local packets by not changing
the skb->dev to the bonding-master and maintaining it as it is,
i.e. the link on which the packet arrived.

Signed-off-by: Chonggang Li <chonggangli@google.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:45:47 -04:00
David Ahern c21ef3e343 net: rtnetlink: plumb extended ack to doit function
Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse
for doit functions that call it directly.

This is the first step to using extended error reporting in rtnetlink.
>From here individual subsystems can be updated to set netlink_ext_ack as
needed.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:35:38 -04:00
David Lebrun af3b5158b8 ipv6: sr: fix BUG due to headroom too small after SRH push
When a locally generated packet receives an SRH with two or more segments,
the remaining headroom is too small to push an ethernet header. This patch
ensures that the headroom is large enough after SRH push.

The BUG generated the following trace.

[  192.950285] skbuff: skb_under_panic: text:ffffffff81809675 len:198 put:14 head:ffff88006f306400 data:ffff88006f3063fa tail:0xc0 end:0x2c0 dev:A-1
[  192.952456] ------------[ cut here ]------------
[  192.953218] kernel BUG at net/core/skbuff.c:105!
[  192.953411] invalid opcode: 0000 [#1] PREEMPT SMP
[  192.953411] Modules linked in:
[  192.953411] CPU: 5 PID: 3433 Comm: ping6 Not tainted 4.11.0-rc3+ #237
[  192.953411] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
[  192.953411] task: ffff88007c2d42c0 task.stack: ffffc90000ef4000
[  192.953411] RIP: 0010:skb_panic+0x61/0x70
[  192.953411] RSP: 0018:ffffc90000ef7900 EFLAGS: 00010286
[  192.953411] RAX: 0000000000000085 RBX: 00000000000086dd RCX: 0000000000000201
[  192.953411] RDX: 0000000080000201 RSI: ffffffff81d104c5 RDI: 00000000ffffffff
[  192.953411] RBP: ffffc90000ef7920 R08: 0000000000000001 R09: 0000000000000000
[  192.953411] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  192.953411] R13: ffff88007c5a4000 R14: ffff88007b363d80 R15: 00000000000000b8
[  192.953411] FS:  00007f94b558b700(0000) GS:ffff88007fd40000(0000) knlGS:0000000000000000
[  192.953411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  192.953411] CR2: 00007fff5ecd5080 CR3: 0000000074141000 CR4: 00000000001406e0
[  192.953411] Call Trace:
[  192.953411]  skb_push+0x3b/0x40
[  192.953411]  eth_header+0x25/0xc0
[  192.953411]  neigh_resolve_output+0x168/0x230
[  192.953411]  ? ip6_finish_output2+0x242/0x8f0
[  192.953411]  ip6_finish_output2+0x242/0x8f0
[  192.953411]  ? ip6_finish_output2+0x76/0x8f0
[  192.953411]  ip6_finish_output+0xa8/0x1d0
[  192.953411]  ip6_output+0x64/0x2d0
[  192.953411]  ? ip6_output+0x73/0x2d0
[  192.953411]  ? ip6_dst_check+0xb5/0xc0
[  192.953411]  ? dst_cache_per_cpu_get.isra.2+0x40/0x80
[  192.953411]  seg6_output+0xb0/0x220
[  192.953411]  lwtunnel_output+0xcf/0x210
[  192.953411]  ? lwtunnel_output+0x59/0x210
[  192.953411]  ip6_local_out+0x38/0x70
[  192.953411]  ip6_send_skb+0x2a/0xb0
[  192.953411]  ip6_push_pending_frames+0x48/0x50
[  192.953411]  rawv6_sendmsg+0xa39/0xf10
[  192.953411]  ? __lock_acquire+0x489/0x890
[  192.953411]  ? __mutex_lock+0x1fc/0x970
[  192.953411]  ? __lock_acquire+0x489/0x890
[  192.953411]  ? __mutex_lock+0x1fc/0x970
[  192.953411]  ? tty_ioctl+0x283/0xec0
[  192.953411]  inet_sendmsg+0x45/0x1d0
[  192.953411]  ? _copy_from_user+0x54/0x80
[  192.953411]  sock_sendmsg+0x33/0x40
[  192.953411]  SYSC_sendto+0xef/0x170
[  192.953411]  ? entry_SYSCALL_64_fastpath+0x5/0xc2
[  192.953411]  ? trace_hardirqs_on_caller+0x12b/0x1b0
[  192.953411]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[  192.953411]  SyS_sendto+0x9/0x10
[  192.953411]  entry_SYSCALL_64_fastpath+0x1f/0xc2
[  192.953411] RIP: 0033:0x7f94b453db33
[  192.953411] RSP: 002b:00007fff5ecd0578 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[  192.953411] RAX: ffffffffffffffda RBX: 00007fff5ecd16e0 RCX: 00007f94b453db33
[  192.953411] RDX: 0000000000000040 RSI: 000055a78352e9c0 RDI: 0000000000000003
[  192.953411] RBP: 00007fff5ecd1690 R08: 000055a78352c940 R09: 000000000000001c
[  192.953411] R10: 0000000000000000 R11: 0000000000000246 R12: 000055a783321e10
[  192.953411] R13: 000055a7839890c0 R14: 0000000000000004 R15: 0000000000000000
[  192.953411] Code: 00 00 48 89 44 24 10 8b 87 c4 00 00 00 48 89 44 24 08 48 8b 87 d8 00 00 00 48 c7 c7 90 58 d2 81 48 89 04 24 31 c0 e8 4f 70 9a ff <0f> 0b 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 48 8b 97 d8 00 00
[  192.953411] RIP: skb_panic+0x61/0x70 RSP: ffffc90000ef7900
[  193.000186] ---[ end trace bd0b89fabdf2f92c ]---
[  193.000951] Kernel panic - not syncing: Fatal exception in interrupt
[  193.001137] Kernel Offset: disabled
[  193.001169] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

Fixes: 19d5a26f5e ("ipv6: sr: expand skb head only if necessary")
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:33:53 -04:00
Ilan Tayari 7a7a9bd7ac gso: Validate assumption of frag_list segementation
Commit 07b26c9454 ("gso: Support partial splitting at the frag_list
pointer") assumes that all SKBs in a frag_list (except maybe the last
one) contain the same amount of GSO payload.

This assumption is not always correct, resulting in the following
warning message in the log:
    skb_segment: too many frags

For example, mlx5 driver in Striding RQ mode creates some RX SKBs with
one frag, and some with 2 frags.
After GRO, the frag_list SKBs end up having different amounts of payload.
If this frag_list SKB is then forwarded, the aforementioned assumption
is violated.

Validate the assumption, and fall back to software GSO if it not true.

Fixes: 07b26c9454 ("gso: Support partial splitting at the frag_list pointer")
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:31:29 -04:00
Xin Long edb12f2d72 sctp: get list_of_streams of strreset outreq earlier
Now when processing strreset out responses, it gets outreq->list_of_streams
only when result is performed. But if result is not performed, str_p will
be NULL. It will cause panic in sctp_ulpevent_make_stream_reset_event if
nums is not 0.

This patch is to fix it by getting outreq->list_of_streams earlier, and
also to improve some codes for the strreset inreq process.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:25:35 -04:00
Chenbo Feng 9fd0f31563 Add uid and cookie bpf helper to cg_skb_func_proto
BPF helper functions get_socket_cookie and get_socket_uid can be
used for network traffic classifications, among others. Expose
them also to programs of type BPF_PROG_TYPE_CGROUP_SKB. As of
commit 8f917bba00 ("bpf: pass sk to helper functions") the
required skb->sk function is available at both cgroup bpf ingress
and egress hooks. With these two new helper, cg_skb_func_proto is
effectively the same as sk_filter_func_proto.

Change since V1:
Instead of add the helper to cg_skb_func_proto, redirect the
cg_skb_func_proto to sk_filter_func_proto since all helper function
in sk_filter_func_proto are applicable to cg_skb_func_proto now.

Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:22:00 -04:00
Simon Xiao f3c9d40ee1 hv_netvsc: change netvsc device default duplex to FULL
The netvsc device supports full duplex by default.
This warnings in log from bonding device which did not like
seeing UNKNOWN duplex.

Signed-off-by: Simon Xiao <sixiao@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:12:33 -04:00
stephen hemminger 776e726bfb netvsc: fix RCU warning in get_stats
The statistics functionis called with RTNL held during probe
but with RCU held during access from /proc and elsewhere.
This is safe so update the lockdep annotation.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:12:33 -04:00
Dan Carpenter 1dbba4cb8d net: phy: test the right variable in phy_write_mmd()
This is a copy and paste buglet.  We meant to test for ->write_mmd but
we test for ->read_mmd.

Fixes: 1ee6b9bc62 ("net: phy: make phy_(read|write)_mmd() generic MMD accessors")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:10:42 -04:00
David S. Miller 450cc8cce2 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2017-04-14

Here's the main batch of Bluetooth & 802.15.4 patches for the 4.12
kernel.

 - Many fixes to 6LoWPAN, in particular for BLE
 - New CA8210 IEEE 802.15.4 device driver (accounting for most of the
   lines of code added in this pull request)
 - Added Nokia Bluetooth (UART) HCI driver
 - Some serdev & TTY changes that are dependencies for the Nokia
   driver (with acks from relevant maintainers and an agreement that
   these come through the bluetooth tree)
 - Support for new Intel Bluetooth device
 - Various other minor cleanups/fixes here and there

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:00:57 -04:00
David S. Miller d584fec634 Merge branch 'bpf-lru-perf'
Martin KaFai Lau says:

====================
bpf: LRU performance and test-program improvements

The first 4 patches make a few improvements to the LRU tests.

Patch 5/6 is to improve the performance of BPF_F_NO_COMMON_LRU map.

Patch 6/6 adds an example in using LRU map with map-in-map.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:55:54 -04:00
Martin KaFai Lau 3a5795b83d bpf: lru: Add map-in-map LRU example
This patch adds a map-in-map LRU example.
If we know only a subset of cores will use the
LRU, we can allocate a common LRU list per targeting core
and store it into an array-of-hashs.

It allows using the common LRU map with map-update performance
comparable to the BPF_F_NO_COMMON_LRU map but without wasting memory
on the unused cores that we know they will never access the LRU map.

BPF_F_NO_COMMON_LRU:
> map_perf_test 32 8 10000000 10000000 | awk '{sum += $3}END{print sum}'
9234314 (9.23M/s)

map-in-map LRU:
> map_perf_test 512 8 1260000 80000000 | awk '{sum += $3}END{print sum}'
9962743 (9.96M/s)

Notes that the max_entries for the map-in-map LRU test is 1260000 which
is the max_entries for each inner LRU map.  8 processes have been
started, so 8 * 1260000 = 10080000 (~10M) which is close to what is
used in the BPF_F_NO_COMMON_LRU test.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:55:52 -04:00
Martin KaFai Lau 695ba2651a bpf: lru: Lower the PERCPU_NR_SCANS from 16 to 4
After doing map_perf_test with a much bigger
BPF_F_NO_COMMON_LRU map, the perf report shows a
lot of time spent in rotating the inactive list (i.e.
__bpf_lru_list_rotate_inactive):
> map_perf_test 32 8 10000 1000000 | awk '{sum += $3}END{print sum}'
19644783 (19M/s)
> map_perf_test 32 8 10000000 10000000 |  awk '{sum += $3}END{print sum}'
6283930 (6.28M/s)

By inactive, it usually means the element is not in cache.  Hence,
there is a need to tune the PERCPU_NR_SCANS value.

This patch finds a better number of elements to
scan during each list rotation.  The PERCPU_NR_SCANS (which
is defined the same as PERCPU_FREE_TARGET) decreases
from 16 elements to 4 elements.  This change only
affects the BPF_F_NO_COMMON_LRU map.

The test_lru_dist does not show meaningful difference
between 16 and 4.  Our production L4 load balancer which uses
the LRU map for conntrack-ing also shows little change in cache
hit rate.  Since both benchmark and production data show no
cache-hit difference, PERCPU_NR_SCANS is lowered from 16 to 4.
We can consider making it configurable if we find a usecase
later that shows another value works better and/or use
a different rotation strategy.

After this change:
> map_perf_test 32 8 10000000 10000000 |  awk '{sum += $3}END{print sum}'
9240324 (9.2M/s)

i.e. 6.28M/s -> 9.2M/s

The test_lru_dist has not shown meaningful difference:
> test_lru_dist zipf.100k.a1_01.out 4000 1:
nr_misses: 31575 (Before) vs 31566 (After)

> test_lru_dist zipf.100k.a0_01.out 40000 1
nr_misses: 67036 (Before) vs 67031 (After)

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:55:52 -04:00
Martin KaFai Lau 9fd63d05f3 bpf: Allow bpf sample programs (*_user.c) to change bpf_map_def
The current bpf_map_def is statically defined during compile
time.  This patch allows the *_user.c program to change it during
runtime.  It is done by adding load_bpf_file_fixup_map() which
takes a callback.  The callback will be called before creating
each map so that it has a chance to modify the bpf_map_def.

The current usecase is to change max_entries in map_perf_test.
It is interesting to test with a much bigger map size in
some cases (e.g. the following patch on bpf_lru_map.c).
However,  it is hard to find one size to fit all testing
environment.  Hence, it is handy to take the max_entries
as a cmdline arg and then configure the bpf_map_def during
runtime.

This patch adds two cmdline args.  One is to configure
the map's max_entries.  Another is to configure the max_cnt
which controls how many times a syscall is called.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:55:52 -04:00
Martin KaFai Lau bf8db5d243 bpf: lru: Refactor LRU map tests in map_perf_test
One more LRU test will be added later in this patch series.
In this patch, we first move all existing LRU map tests into
a single syscall (connect) first so that the future new
LRU test can be added without hunting another syscall.

One of the map name is also changed from percpu_lru_hash_map
to nocommon_lru_hash_map to avoid the confusion with percpu_hash_map.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:55:52 -04:00
Martin KaFai Lau 6467acbc70 bpf: lru: Cleanup test_lru_map.c
This patch does the following cleanup on test_lru_map.c
1) Fix indentation (Replace spaces by tabs)
2) Remove redundant BPF_F_NO_COMMON_LRU test
3) Simplify some comments

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:55:52 -04:00
Martin KaFai Lau 9746f85686 bpf: lru: Add test_lru_sanity6 for BPF_F_NO_COMMON_LRU
test_lru_sanity3 is not applicable to BPF_F_NO_COMMON_LRU.
It just happens to work when PERCPU_FREE_TARGET == 16.

This patch:
1) Disable test_lru_sanity3 for BPF_F_NO_COMMON_LRU
2) Add test_lru_sanity6 to test list rotation for
   the BPF_F_NO_COMMON_LRU map.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:55:52 -04:00
Jisheng Zhang 82960fff09 net: mvneta: fix failed to suspend if WOL is enabled
Recently, suspend/resume and WOL support are added into mvneta driver.
If we enable WOL, then we get some error as below on Marvell BG4CT
platforms during suspend:

[  184.149723] dpm_run_callback(): mdio_bus_suspend+0x0/0x50 returns -16
[  184.149727] PM: Device f7b62004.mdio-mi:00 failed to suspend: error -16

-16 means -EBUSY, phy_suspend() will return -EBUSY if it finds the
device has WOL enabled.

We fix this issue by properly setting the netdev's power.can_wakeup
and power.wakeup, i.e

1. in mvneta_mdio_probe(), call device_set_wakeup_capable() to set
power.can_wakeup if the phy support WOL.

2. in mvneta_ethtool_set_wol(), call device_set_wakeup_enable() to
set power.wakeup if WOL has been successfully enabled in phy.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:46:35 -04:00
Nikolay Aleksandrov cab93af0ed net: bridge: notify on hw fdb takeover
Recently we added support for SW fdbs to take over HW ones, but that
results in changing a user-visible fdb flag thus we need to send a
notification, also it's consistent with how HW takes over SW entries.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:45:34 -04:00
WANG Cong f5001ceab8 kcm: remove a useless copy_from_user()
struct kcm_clone only contains fd, and kcm_clone() only
writes this struct, so there is no need to copy it from user.

Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:28:48 -04:00
Jiri Pirko 6b2af241f0 MAINTAINERS: rename TC entry and add couple of header files
The section is not specific only to "TC classifiers", but applies to the
whole TC subsystem. Also, add couple of forgotten headers.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:26:21 -04:00
Russell King 786df9c2a4 net: phy: simplify phy_supported_speeds()
Simplify the loop in phy_supported_speeds().

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:25:07 -04:00
Russell King d06130377c net: phy: improve phylib correctness for non-autoneg settings
phylib has some undesirable behaviour when forcing a link mode through
ethtool.  phylib uses this code:

	idx = phy_find_valid(phy_find_setting(phydev->speed, phydev->duplex),
			features);

to find an index in the settings table.  phy_find_setting() starts at
index 0, and scans upwards looking for an exact speed and duplex match.
When it doesn't find it, it returns MAX_NUM_SETTINGS - 1, which is
10baseT-Half duplex.

phy_find_valid() then scans from the point (and effectively only checks
one entry) before bailing out, returning MAX_NUM_SETTINGS - 1.

phy_sanitize_settings() then sets ->speed to SPEED_10 and ->duplex to
DUPLEX_HALF whether or not 10baseT-Half is supported or not.  This goes
against all the comments against these functions, and 10baseT-Half may
not even be supported by the hardware.

Rework these functions, introducing a new method of scanning the table.
There are two modes of lookup that phylib wants: exact, and inexact.

- in exact mode, we return either an exact match or failure
- in inexact mode, we return an exact match if it exists, a match at
  the highest speed that is not greater than the requested speed
  (ignoring duplex), or failing that, the lowest supported speed, or
  failure.

The biggest difference is that we always check whether the entry is
supported before further consideration, so all unsupported entries are
not considered as candidates.

This results in arguably saner behaviour, better matches the comments,
and is probably what users would expect.

This becomes important as ethernet speeds increase, PHYs exist which do
not support the 10Mbit speeds, and half-duplex is likely to become
obsolete - it's already not even an option on 10Gbit and faster links.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:25:07 -04:00
stephen hemminger 8ea3e43911 Subject: net: allow configuring default qdisc
Since 3.12 it has been possible to configure the default queuing
discipline via sysctl. This patch adds ability to configure the
default queue discipline in kernel configuration. This is useful for
environments where configuring the value from userspace is difficult
to manage.

The default is still the same as before (pfifo_fast) and it is
possible to change after kernel init with sysctl. This is similar
to how TCP congestion control works.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:23:06 -04:00
David S. Miller 7ca9511813 Merge branch 'qed-arfs'
Manish Chopra says:

====================
qed/qede: aRFS support

This series adds support for Accelerated Flow Steering
in qede driver for TCP/UDP over IPv4/IPv6 protocols.

Please consider applying this series to "net-next"
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:06:19 -04:00
Chopra, Manish e4917d46a6 qede: Add aRFS support
This patch adds support for aRFS for TCP and UDP
protocols with IPv4/IPv6.

Signed-off-by: Manish Chopra <manish.chopra@cavium.com>
Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:06:18 -04:00
Chopra, Manish d51e4af5c2 qed: aRFS infrastructure support
This patch adds necessary APIs to interface with
qede aRFS support in successive patch.

It also reserves separate PTT entry for aRFS,
[as being in fastpath flow] for hardware access instead of
trying to acquire it at run time from the ptt pool.

Signed-off-by: Manish Chopra <manish.chopra@cavium.com>
Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:06:18 -04:00
Martin Wetterwald 53a759c89b smsc95xx: Add comments to the registers definition
This chip is used by a lot of embedded devices and also by the Raspberry
Pi 1, 2 & 3 which were created to promote the study of computer
sciences. Students wanting to learn kernel / network device driver
programming through those devices can only rely on the Linux kernel
driver source to make their own.

This commit adds a lot of comments to the registers definition to expand
the register names.

Cc: Steve Glendinning <steve.glendinning@shawell.net>
Cc: Microchip Linux Driver Support <UNGLinuxDriver@microchip.com>
CC: David Miller <davem@davemloft.net>
Signed-off-by: Martin Wetterwald <martin@wetterwald.eu>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Steve Glendinning <steve.glendinning@shawell.net>
Acked-by: Woojung Huh <Woojung.Huh@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:04:52 -04:00
R. Parameswaran 57240d0078 l2tp: device MTU setup, tunnel socket needs a lock
The MTU overhead calculation in L2TP device set-up
merged via commit b784e7ebfc
needs to be adjusted to lock the tunnel socket while
referencing the sub-data structures to derive the
socket's IP overhead.

Reported-by: Guillaume Nault <g.nault@alphalink.fr>
Tested-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: R. Parameswaran <rparames@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:01:48 -04:00
David Ahern 4a6e3c5def net: ipv6: send unsolicited NA on admin up
ndisc_notify is the ipv6 equivalent to arp_notify. When arp_notify is
set to 1, gratuitous arp requests are sent when the device is brought up.
The same is expected when ndisc_notify is set to 1 (per ndisc_notify in
Documentation/networking/ip-sysctl.txt). The NA is not sent on NETDEV_UP
event; add it.

Fixes: 5cb04436ee ("ipv6: add knob to send unsolicited ND on link-layer address change")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 12:44:55 -04:00
David S. Miller 70d40b366d Merge branch 'mlx5-RDMA-netdevice'
Saeed Mahameed says:

====================
Mellanox, mlx5 RDMA net device support

This series provides the lower level mlx5 support of RDMA netdevice
creation API [1] suggested and introduced by Intel's HFI OPA VNIC
netdevice driver [2], to enable IPoIB mlx5 RDMA netdevice creation.

mlx5 IPoIB RDMA netdev will serve as an acceleration netdevice for the current
IPoIB ULP generic netdevice, providing:
	- mlx5 RSS support.
	- mlx5 HW RX,TX offloads (checksum, TSO, LRO, etc ..).
	- Full mlx5 HW features transparent to the ULP itself.

The idea here is to reuse and benefit from the already implemented mlx5e netdevice
management and channels API for both etherent and RDMA netdevices, since both IPoIB
and Ethernet netdevices share same common mlx5 HW resources (with some small
exceptions) and share most of the control/data path logic, it is more natural to
have them share the same code.

The differences between IPoIB and Ethernet netdevices can be summarized to:

Steering:
In mlx5, IPoIB traffic is sent and received from an underlay special QP, and in Ethernet
the traffic is handled by vports and vport steering is managed by e-switch or FW.

For IPoIB traffic to get steered correctly the only thing we need to do is to create RSS
HW contexts for RX and TX HW contexts for TX (similar to mlx5e) with the underlay QP attached to
them (underlay QP will be 0 in case of Ethernet).

RX,TX:
Since IPoIB traffic is different, slightly modified RX and TX handlers are required,
still we do some code reuse in data path via common helper functions.

All of the other generic netdevice and mlx5 aspects will be shared between mlx5 Ethernet
and IPoIB netdevices, e.g.
	- Channels creation and handling (RQs,SQs,CQs, NAPI, interrupt moderation, etc..)
	- Offloads, checksum, GRO, LRO, TSO, and more.
        - netdevice logic and non Ethernet specific ndos (open/close, etc..)

In order to achieve what we want:

In patchet 1 to 3, Erez added the supported for underlay QP in mlx5_ifc and refactored
the mlx5 steering code to accept the underlay QP as a parameter for creating steering
objects and enabled flow steering for IB link.

Then we are going to use the mlx5e netdevice profile, which is already used to separate between
NIC and VF representors netdevices, to create new type of IPoIB netdevice profile.

For that, one small refactoring is required to make mlx5e netdevice profile management
more genetic and agnostic to link type which is done in patch #4.

In patch #5, we introduce ipoib.c to host all of mlx5 IPoIB (mlx5i) specific logic and a
skeleton for the IPoIB mlx5 netdevice profile, and we will start filling it in next patches,
using mlx5e already existing APIs.

Patch #6 and #7, Implement init/cleanup RX mlx5i netdev profile handlers to create mlx5 RSS
resources, same as mlx5e but without vlan and L2 steering tables.

Patch #8, Implement init/cleanup TX mlx5i netdev profile handlers, to create TX resources
same as mlx5e but with one TC (tc = 0) support.

Patch #9, Implement mlx5i open/close ndos, where we reuese the mlx5e channels API, to start/stop TX/RX channels.

Patch #10, Create the underlay QP and attach it to mlx5i RSS and TX HW contexts.

Patch #11 and #12, Break down the mlx5e xmit flow into smaller helper function and implement the
mlx5i IPoIB xmit routine.

Patch #13 and #14, Have an RX handler per netdevice profile. We already do this before this series
in a non clean way to separate between NIC netdev and VF representor RX handlers, in patch 13 we make
the RX handler generic and bound to a profile and in patch 14 we implement the IPoIB RX handlers.

Patch #15, Small cleanup to avoid e-switch with IPoIB netdev.

In order to enable mlx5 IPoIB, a merge between the IPoIB RDMA netdev offolad support [3]
- which was alread submitted to the rdma mailing list - and this series is required
plus an extra small patch [4] which will connect between both sides and actually enables the offload.

Once both patch-sets are merged into linux we will have to submit the extra small patch [4], to enable
the feature.

Thanks,
Saeed.

[1] https://patchwork.kernel.org/patch/9676637/

[2] https://lwn.net/Articles/715453/
    https://patchwork.kernel.org/patch/9587815/

[3] https://patchwork.kernel.org/patch/9672069/
[4] https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/commit/?id=0141db6a686e32294dee015b7d07706162ba48d8
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 11:08:33 -04:00
Erez Shitrit 93d576af3c hw/mlx5: Add New bit to check over QP creation
Add check for bit IB_QP_CREATE_NETIF_QP while creating QP.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 11:08:32 -04:00
Saeed Mahameed 955bc48081 net/mlx5e: E-switch vport manager is valid for ethernet only
Currently the driver support only ethernet eswitch, and we want to
protect downstream IPoIB netdev from trying to access it in IB link.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 11:08:32 -04:00
Saeed Mahameed 9d6bd752c6 net/mlx5e: IPoIB, RX handler
Implement IPoIB RX SKB handler.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 11:08:32 -04:00
Saeed Mahameed 20fd0c193f net/mlx5e: RX handlers per netdev profile
In order to have different RX handler per profile, fix and refactor the
current code to take the rx handler directly from the netdevice profile
rather than computing it on runtime as it was done with the switchdev
mode representor rx handler.

This will also remove the current wrong assumption in mlx5e_alloc_rq
code that mlx5e_priv->ppriv is of the type vport_rep.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 11:08:31 -04:00
Saeed Mahameed 258545449b net/mlx5e: IPoIB, Xmit flow
Implement mlx5e's IPoIB SKB transmit using the helper functions provided
by mlx5e ethernet tx flow, the only difference in the code between
mlx5e_xmit and mlx5i_xmit is that IPoIB has some extra fields to fill
(UD datagram segment) in the TX descriptor (WQE) and it doesn't need to
have any vlan handling.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 11:08:31 -04:00
Saeed Mahameed 77bdf8950b net/mlx5e: Xmit flow break down
Break current mlx5e xmit flow into smaller blocks (helper functions)
in order to reuse them for IPoIB SKB transmission.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 11:08:31 -04:00
Saeed Mahameed ec8fd927b7 net/mlx5e: IPoIB, Underlay QP
Create IPoIB underlay QP needed by the IPoIB netdevice profile for RSS
and TX HW context to perform on IPoIB traffic.

Reset the underlay QP on dev_uninit ndo to stop IPoIB traffic going
through this QP when the ULP IPoIB decides to cleanup.

Implement attach/detach mcast RDMA netdev callbacks for later RDMA
netdev use.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 11:08:31 -04:00
Saeed Mahameed 603f4a4521 net/mlx5e: IPoIB, Basic netdev ndos open/close
Implement open/close of IPoIB netdevice ndos using mlx5e's
channels API to manage data path resources (RQs/SQs/CQs).

Set IPoIB netdev address on dev_init ndo.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 11:08:30 -04:00
Saeed Mahameed 5426a0b274 net/mlx5e: IPoIB, TX TIS creation
Modify mlx5e tis creation function to accept underlay qp number, which
will be needed by IPoIB.

Implement mlx5i (IPoIB) tx init/cleanup netdevice profile flows to
create one TIS with the IPoIB underlay qp, for IPoIB TX SQs.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 11:08:30 -04:00
Saeed Mahameed bc81b9d326 net/mlx5e: IPoIB, RSS flow steering tables
Like the mlx5e ethernet mode, on IPoIB mode we need to create RX steering
tables, but IPoIB do not require MAC and VLAN steering tables so the
only tables we create in here are:
1. TTC Table (Traffic Type Classifier table for RSS steering)
2. ARFS Table (for accelerated RFS support)

Creation of those tables is identical to mlx5e ethernet mode, hence the
use of mlx5e_create_ttc_table and mlx5e_arfs_create_tables.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 11:08:30 -04:00
Saeed Mahameed 8f493ffd88 net/mlx5e: IPoIB, RX steering RSS RQTs and TIRs
Implement IPoIB RX RSS (RQTs and TIRs) HW objects creation,
All we do here is simply reuse the mlx5e implementation to create
direct and indirect (RSS) steering HW objects.

For that we just expose
mlx5e_{create,destroy}_{direct,indirect}_{rqt,tir} functions into en.h
and call them from ipoib.c in init/cleanup_rx IPoIB netdevice profile
callbacks.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 11:08:30 -04:00
Saeed Mahameed 48935bbb7a net/mlx5e: IPoIB, Add netdevice profile skeleton
Create mlx5e IPoIB netdevice profile skeleton in the new ipoib.c
file with empty implementation.

Downstream patches will provide the full mlx5 rdma netdevice acceleration
support for IPoIB into this new file, by using the mlx5e netdevice
profile and new mlx5_channels APIs and infrastructures.
Same as already done in mlx5e NIC netdevice and switchdev mode VF
representors.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 11:08:30 -04:00
Saeed Mahameed 2c3b5beec4 net/mlx5e: More generic netdev management API
In preparation for mlx5e RDMA net_device support, here we generalize
mlx5e_attach/detach in a way that those functions will be agnostic
to link type.  For that we move ethernet specific NIC net device logic out
of those functions into {nic,rep}_{enable/disable} mlx5e NIC and
representor profiles callbacks.

Also some of the logic was moved only to NIC profile since it is not right
to have this logic for representor net device (e.g. set port MTU).

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 11:08:29 -04:00
Erez Shitrit ffdb8827ec net/mlx5: Enable flow-steering for IB link
Get the relevant capabilities if supports ipoib_enhanced_offloads and
init the flow steering table accordingly.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 11:08:29 -04:00
Erez Shitrit b3ba51498b net/mlx5: Refactor create flow table method to accept underlay QP
IB flow tables need the underlay qp to perform flow steering.
Here we change the API of the flow tables creation to accept the
underlay QP number as a parameter in order to support IB (IPoIB) flow
steering.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 11:08:29 -04:00
Erez Shitrit 500a3d0ded net/mlx5: Add IPoIB enhanced offloads bits to mlx5_ifc
New capability bit: ipoib_enhanced_offloads, indicates new ability for UD
QP to do RSS and enhanced IPoIB offloads and acceleration.

Add underlay_qpn to the TIS and flow_table objects In order to support
SET_ROOT command, to connect between IPoIB QPs and flow steering tables.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 11:08:29 -04:00
Haiyang Zhang f72860afa2 hv_netvsc: Exclude non-TCP port numbers from vRSS hashing
Azure hosts are not supporting non-TCP port numbers in vRSS hashing for
now. For example, UDP packet loss rate will be high if port numbers are
also included in vRSS hash.

So, we created this patch to use only IP numbers for hashing in non-TCP
traffic.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 11:05:19 -04:00
Haiyang Zhang 8db91f6a9b hv_netvsc: Fix the queue index computation in forwarding case
If the outgoing skb has a RX queue mapping available, we use the queue
number directly, other than put it through Send Indirection Table.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 11:05:19 -04:00