From cf28f3bbfca097d956f9021cb710dfad56adcc62 Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Mon, 17 Aug 2020 10:42:14 -0700 Subject: [PATCH 01/46] bpf: Use get_file_rcu() instead of get_file() for task_file iterator With latest `bpftool prog` command, we observed the following kernel panic. BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page PGD dfe894067 P4D dfe894067 PUD deb663067 PMD 0 Oops: 0010 [#1] SMP CPU: 9 PID: 6023 ... RIP: 0010:0x0 Code: Bad RIP value. RSP: 0000:ffffc900002b8f18 EFLAGS: 00010286 RAX: ffff8883a405f400 RBX: ffff888e46a6bf00 RCX: 000000008020000c RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8883a405f400 RBP: ffff888e46a6bf50 R08: 0000000000000000 R09: ffffffff81129600 R10: ffff8883a405f300 R11: 0000160000000000 R12: 0000000000002710 R13: 000000e9494b690c R14: 0000000000000202 R15: 0000000000000009 FS: 00007fd9187fe700(0000) GS:ffff888e46a40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 0000000de5d33002 CR4: 0000000000360ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: rcu_core+0x1a4/0x440 __do_softirq+0xd3/0x2c8 irq_exit+0x9d/0xa0 smp_apic_timer_interrupt+0x68/0x120 apic_timer_interrupt+0xf/0x20 RIP: 0033:0x47ce80 Code: Bad RIP value. RSP: 002b:00007fd9187fba40 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13 RAX: 0000000000000002 RBX: 00007fd931789160 RCX: 000000000000010c RDX: 00007fd9308cdfb4 RSI: 00007fd9308cdfb4 RDI: 00007ffedd1ea0a8 RBP: 00007fd9187fbab0 R08: 000000000000000e R09: 000000000000002a R10: 0000000000480210 R11: 00007fd9187fc570 R12: 00007fd9316cc400 R13: 0000000000000118 R14: 00007fd9308cdfb4 R15: 00007fd9317a9380 After further analysis, the bug is triggered by Commit eaaacd23910f ("bpf: Add task and task/file iterator targets") which introduced task_file bpf iterator, which traverses all open file descriptors for all tasks in the current namespace. The latest `bpftool prog` calls a task_file bpf program to traverse all files in the system in order to associate processes with progs/maps, etc. When traversing files for a given task, rcu read_lock is taken to access all files in a file_struct. But it used get_file() to grab a file, which is not right. It is possible file->f_count is 0 and get_file() will unconditionally increase it. Later put_file() may cause all kind of issues with the above as one of sympotoms. The failure can be reproduced with the following steps in a few seconds: $ cat t.c #include #include #include #include #include #define N 10000 int fd[N]; int main() { int i; for (i = 0; i < N; i++) { fd[i] = open("./note.txt", 'r'); if (fd[i] < 0) { fprintf(stderr, "failed\n"); return -1; } } for (i = 0; i < N; i++) close(fd[i]); return 0; } $ gcc -O2 t.c $ cat run.sh #/bin/bash for i in {1..100} do while true; do ./a.out; done & done $ ./run.sh $ while true; do bpftool prog >& /dev/null; done This patch used get_file_rcu() which only grabs a file if the file->f_count is not zero. This is to ensure the file pointer is always valid. The above reproducer did not fail for more than 30 minutes. Fixes: eaaacd23910f ("bpf: Add task and task/file iterator targets") Suggested-by: Josef Bacik Signed-off-by: Yonghong Song Signed-off-by: Alexei Starovoitov Reviewed-by: Josef Bacik Link: https://lore.kernel.org/bpf/20200817174214.252601-1-yhs@fb.com --- kernel/bpf/task_iter.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index 232df29793e9..f21b5e1e4540 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -178,10 +178,11 @@ again: f = fcheck_files(curr_files, curr_fd); if (!f) continue; + if (!get_file_rcu(f)) + continue; /* set info->fd */ info->fd = curr_fd; - get_file(f); rcu_read_unlock(); return f; } From 3fb1a96a91120877488071a167d26d76be4be977 Mon Sep 17 00:00:00 2001 From: Andrii Nakryiko Date: Tue, 18 Aug 2020 09:44:56 -0700 Subject: [PATCH 02/46] libbpf: Fix build on ppc64le architecture MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On ppc64le we get the following warning: In file included from btf_dump.c:16:0: btf_dump.c: In function ‘btf_dump_emit_struct_def’: ../include/linux/kernel.h:20:17: error: comparison of distinct pointer types lacks a cast [-Werror] (void) (&_max1 == &_max2); \ ^ btf_dump.c:882:11: note: in expansion of macro ‘max’ m_sz = max(0LL, btf__resolve_size(d->btf, m->type)); ^~~ Fix by explicitly casting to __s64, which is a return type from btf__resolve_size(). Fixes: 702eddc77a90 ("libbpf: Handle GCC built-in types for Arm NEON") Signed-off-by: Andrii Nakryiko Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200818164456.1181661-1-andriin@fb.com --- tools/lib/bpf/btf_dump.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/lib/bpf/btf_dump.c b/tools/lib/bpf/btf_dump.c index fe39bd774697..57c00fa63932 100644 --- a/tools/lib/bpf/btf_dump.c +++ b/tools/lib/bpf/btf_dump.c @@ -879,7 +879,7 @@ static void btf_dump_emit_struct_def(struct btf_dump *d, btf_dump_printf(d, ": %d", m_sz); off = m_off + m_sz; } else { - m_sz = max(0LL, btf__resolve_size(d->btf, m->type)); + m_sz = max((__s64)0, btf__resolve_size(d->btf, m->type)); off = m_off + m_sz * 8; } btf_dump_printf(d, ";"); From 8b61fba503904acae24aeb2bd5569b4d6544d48f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alvin=20=C5=A0ipraga?= Date: Tue, 18 Aug 2020 10:51:34 +0200 Subject: [PATCH 03/46] macvlan: validate setting of multiple remote source MAC addresses MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Remote source MAC addresses can be set on a 'source mode' macvlan interface via the IFLA_MACVLAN_MACADDR_DATA attribute. This commit tightens the validation of these MAC addresses to match the validation already performed when setting or adding a single MAC address via the IFLA_MACVLAN_MACADDR attribute. iproute2 uses IFLA_MACVLAN_MACADDR_DATA for its 'macvlan macaddr set' command, and IFLA_MACVLAN_MACADDR for its 'macvlan macaddr add' command, which demonstrates the inconsistent behaviour that this commit addresses: # ip link add link eth0 name macvlan0 type macvlan mode source # ip link set link dev macvlan0 type macvlan macaddr add 01:00:00:00:00:00 RTNETLINK answers: Cannot assign requested address # ip link set link dev macvlan0 type macvlan macaddr set 01:00:00:00:00:00 # ip -d link show macvlan0 5: macvlan0@eth0: mtu 1500 ... link/ether 2e:ac:fd:2d:69:f8 brd ff:ff:ff:ff:ff:ff promiscuity 0 macvlan mode source remotes (1) 01:00:00:00:00:00 numtxqueues 1 ... With this change, the 'set' command will (rightly) fail in the same way as the 'add' command. Signed-off-by: Alvin Šipraga Signed-off-by: David S. Miller --- drivers/net/macvlan.c | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 4942f6112e51..5da04e997989 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -1269,6 +1269,9 @@ static void macvlan_port_destroy(struct net_device *dev) static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) { + struct nlattr *nla, *head; + int rem, len; + if (tb[IFLA_ADDRESS]) { if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN) return -EINVAL; @@ -1316,6 +1319,20 @@ static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[], return -EADDRNOTAVAIL; } + if (data[IFLA_MACVLAN_MACADDR_DATA]) { + head = nla_data(data[IFLA_MACVLAN_MACADDR_DATA]); + len = nla_len(data[IFLA_MACVLAN_MACADDR_DATA]); + + nla_for_each_attr(nla, head, len, rem) { + if (nla_type(nla) != IFLA_MACVLAN_MACADDR || + nla_len(nla) != ETH_ALEN) + return -EINVAL; + + if (!is_valid_ether_addr(nla_data(nla))) + return -EADDRNOTAVAIL; + } + } + if (data[IFLA_MACVLAN_MACADDR_COUNT]) return -EINVAL; @@ -1372,10 +1389,6 @@ static int macvlan_changelink_sources(struct macvlan_dev *vlan, u32 mode, len = nla_len(data[IFLA_MACVLAN_MACADDR_DATA]); nla_for_each_attr(nla, head, len, rem) { - if (nla_type(nla) != IFLA_MACVLAN_MACADDR || - nla_len(nla) != ETH_ALEN) - continue; - addr = nla_data(nla); ret = macvlan_hash_add_source(vlan, addr); if (ret) From db06ea341fcd1752fbdb58454507faa140e3842f Mon Sep 17 00:00:00 2001 From: Edward Cree Date: Tue, 18 Aug 2020 13:43:30 +0100 Subject: [PATCH 04/46] sfc: really check hash is valid before using it Actually hook up the .rx_buf_hash_valid method in EF100's nic_type. Fixes: 068885434ccb ("sfc: check hash is valid before using it") Reported-by: Martin Habets Signed-off-by: Edward Cree Reviewed-by: Jesse Brandeburg Signed-off-by: David S. Miller --- drivers/net/ethernet/sfc/ef100_nic.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/sfc/ef100_nic.c b/drivers/net/ethernet/sfc/ef100_nic.c index 206d70f9d95b..b8a7e9ed7913 100644 --- a/drivers/net/ethernet/sfc/ef100_nic.c +++ b/drivers/net/ethernet/sfc/ef100_nic.c @@ -739,6 +739,7 @@ const struct efx_nic_type ef100_pf_nic_type = { .rx_remove = efx_mcdi_rx_remove, .rx_write = ef100_rx_write, .rx_packet = __ef100_rx_packet, + .rx_buf_hash_valid = ef100_rx_buf_hash_valid, .fini_dmaq = efx_fini_dmaq, .max_rx_ip_filters = EFX_MCDI_FILTER_TBL_ROWS, .filter_table_probe = ef100_filter_table_up, @@ -820,6 +821,7 @@ const struct efx_nic_type ef100_vf_nic_type = { .rx_remove = efx_mcdi_rx_remove, .rx_write = ef100_rx_write, .rx_packet = __ef100_rx_packet, + .rx_buf_hash_valid = ef100_rx_buf_hash_valid, .fini_dmaq = efx_fini_dmaq, .max_rx_ip_filters = EFX_MCDI_FILTER_TBL_ROWS, .filter_table_probe = ef100_filter_table_up, From 9cbbc451098ec1e9942886023203b2247dec94bd Mon Sep 17 00:00:00 2001 From: Edward Cree Date: Tue, 18 Aug 2020 13:43:57 +0100 Subject: [PATCH 05/46] sfc: take correct lock in ef100_reset() When downing and upping the ef100 filter table, we need to take a write lock on efx->filter_sem, not just a read lock, because we may kfree() the table pointers. Without this, resets cause a WARN_ON from efx_rwsem_assert_write_locked(). Fixes: a9dc3d5612ce ("sfc_ef100: RX filter table management and related gubbins") Signed-off-by: Edward Cree Reviewed-by: Jesse Brandeburg Signed-off-by: David S. Miller --- drivers/net/ethernet/sfc/ef100_nic.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/sfc/ef100_nic.c b/drivers/net/ethernet/sfc/ef100_nic.c index b8a7e9ed7913..19fe86b3b316 100644 --- a/drivers/net/ethernet/sfc/ef100_nic.c +++ b/drivers/net/ethernet/sfc/ef100_nic.c @@ -431,18 +431,18 @@ static int ef100_reset(struct efx_nic *efx, enum reset_type reset_type) /* A RESET_TYPE_ALL will cause filters to be removed, so we remove filters * and reprobe after reset to avoid removing filters twice */ - down_read(&efx->filter_sem); + down_write(&efx->filter_sem); ef100_filter_table_down(efx); - up_read(&efx->filter_sem); + up_write(&efx->filter_sem); rc = efx_mcdi_reset(efx, reset_type); if (rc) return rc; netif_device_attach(efx->net_dev); - down_read(&efx->filter_sem); + down_write(&efx->filter_sem); rc = ef100_filter_table_up(efx); - up_read(&efx->filter_sem); + up_write(&efx->filter_sem); if (rc) return rc; From 788f920a0f137baa4dbc1efdd5039c4a0a01b8d7 Mon Sep 17 00:00:00 2001 From: Edward Cree Date: Tue, 18 Aug 2020 13:44:18 +0100 Subject: [PATCH 06/46] sfc: null out channel->rps_flow_id after freeing it If an ef100_net_open() fails, ef100_net_stop() may be called without channel->rps_flow_id having been written; thus it may hold the address freed by a previous ef100_net_stop()'s call to efx_remove_filters(). This then causes a double-free when efx_remove_filters() is called again, leading to a panic. To prevent this, after freeing it, overwrite it with NULL. Fixes: a9dc3d5612ce ("sfc_ef100: RX filter table management and related gubbins") Signed-off-by: Edward Cree Reviewed-by: Jesse Brandeburg Signed-off-by: David S. Miller --- drivers/net/ethernet/sfc/rx_common.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/sfc/rx_common.c b/drivers/net/ethernet/sfc/rx_common.c index ef9bca92b0b7..5e29284c89c9 100644 --- a/drivers/net/ethernet/sfc/rx_common.c +++ b/drivers/net/ethernet/sfc/rx_common.c @@ -849,6 +849,7 @@ void efx_remove_filters(struct efx_nic *efx) efx_for_each_channel(channel, efx) { cancel_delayed_work_sync(&channel->filter_work); kfree(channel->rps_flow_id); + channel->rps_flow_id = NULL; } #endif down_write(&efx->filter_sem); From e6a43910d55d09dae65772ad571d4c61e459b17a Mon Sep 17 00:00:00 2001 From: Edward Cree Date: Tue, 18 Aug 2020 13:44:50 +0100 Subject: [PATCH 07/46] sfc: don't free_irq()s if they were never requested If efx_nic_init_interrupt fails, or was never run (e.g. due to an earlier failure in ef100_net_open), freeing irqs in efx_nic_fini_interrupt is not needed and will cause error messages and stack traces. So instead, only do this if efx_nic_init_interrupt successfully completed, as indicated by the new efx->irqs_hooked flag. Fixes: 965b549f3c20 ("sfc_ef100: implement ndo_open/close and EVQ probing") Signed-off-by: Edward Cree Reviewed-by: Jesse Brandeburg Signed-off-by: David S. Miller --- drivers/net/ethernet/sfc/net_driver.h | 2 ++ drivers/net/ethernet/sfc/nic.c | 4 ++++ 2 files changed, 6 insertions(+) diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h index dcb741d8bd11..062462a13847 100644 --- a/drivers/net/ethernet/sfc/net_driver.h +++ b/drivers/net/ethernet/sfc/net_driver.h @@ -846,6 +846,7 @@ struct efx_async_filter_insertion { * @timer_quantum_ns: Interrupt timer quantum, in nanoseconds * @timer_max_ns: Interrupt timer maximum value, in nanoseconds * @irq_rx_adaptive: Adaptive IRQ moderation enabled for RX event queues + * @irqs_hooked: Channel interrupts are hooked * @irq_rx_mod_step_us: Step size for IRQ moderation for RX event queues * @irq_rx_moderation_us: IRQ moderation time for RX event queues * @msg_enable: Log message enable flags @@ -1004,6 +1005,7 @@ struct efx_nic { unsigned int timer_quantum_ns; unsigned int timer_max_ns; bool irq_rx_adaptive; + bool irqs_hooked; unsigned int irq_mod_step_us; unsigned int irq_rx_moderation_us; u32 msg_enable; diff --git a/drivers/net/ethernet/sfc/nic.c b/drivers/net/ethernet/sfc/nic.c index d994d136bb03..d1e908846f5d 100644 --- a/drivers/net/ethernet/sfc/nic.c +++ b/drivers/net/ethernet/sfc/nic.c @@ -129,6 +129,7 @@ int efx_nic_init_interrupt(struct efx_nic *efx) #endif } + efx->irqs_hooked = true; return 0; fail2: @@ -154,6 +155,8 @@ void efx_nic_fini_interrupt(struct efx_nic *efx) efx->net_dev->rx_cpu_rmap = NULL; #endif + if (!efx->irqs_hooked) + return; if (EFX_INT_MODE_USE_MSI(efx)) { /* Disable MSI/MSI-X interrupts */ efx_for_each_channel(channel, efx) @@ -163,6 +166,7 @@ void efx_nic_fini_interrupt(struct efx_nic *efx) /* Disable legacy interrupt */ free_irq(efx->legacy_irq, efx); } + efx->irqs_hooked = false; } /* Register dump */ From 335956421c86f64fd46186d76d3961f6adcff187 Mon Sep 17 00:00:00 2001 From: Ganji Aravind Date: Tue, 18 Aug 2020 21:10:57 +0530 Subject: [PATCH 08/46] cxgb4: Fix work request size calculation for loopback test Work request used for sending loopback packet needs to add the firmware work request only once. So, fix by using correct structure size. Fixes: 7235ffae3d2c ("cxgb4: add loopback ethtool self-test") Signed-off-by: Ganji Aravind Reviewed-by: Jesse Brandeburg Signed-off-by: David S. Miller --- drivers/net/ethernet/chelsio/cxgb4/sge.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c b/drivers/net/ethernet/chelsio/cxgb4/sge.c index d2b587d1670a..7c9fe4bc235b 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/sge.c +++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c @@ -2553,8 +2553,8 @@ int cxgb4_selftest_lb_pkt(struct net_device *netdev) pkt_len = ETH_HLEN + sizeof(CXGB4_SELFTEST_LB_STR); - flits = DIV_ROUND_UP(pkt_len + sizeof(struct cpl_tx_pkt) + - sizeof(*wr), sizeof(__be64)); + flits = DIV_ROUND_UP(pkt_len + sizeof(*cpl) + sizeof(*wr), + sizeof(__be64)); ndesc = flits_to_desc(flits); lb = &pi->ethtool_lb; From c650e04898072e4b579cbf8d9dd5b86bcdbe9b00 Mon Sep 17 00:00:00 2001 From: Ganji Aravind Date: Tue, 18 Aug 2020 21:10:58 +0530 Subject: [PATCH 09/46] cxgb4: Fix race between loopback and normal Tx path Even after Tx queues are marked stopped, there exists a small window where the current packet in the normal Tx path is still being sent out and loopback selftest ends up corrupting the same Tx ring. So, ensure selftest takes the Tx lock to synchronize access the Tx ring. Fixes: 7235ffae3d2c ("cxgb4: add loopback ethtool self-test") Signed-off-by: Ganji Aravind Reviewed-by: Jesse Brandeburg Signed-off-by: David S. Miller --- drivers/net/ethernet/chelsio/cxgb4/sge.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c b/drivers/net/ethernet/chelsio/cxgb4/sge.c index 7c9fe4bc235b..869431a1eedd 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/sge.c +++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c @@ -2561,11 +2561,14 @@ int cxgb4_selftest_lb_pkt(struct net_device *netdev) lb->loopback = 1; q = &adap->sge.ethtxq[pi->first_qset]; + __netif_tx_lock(q->txq, smp_processor_id()); reclaim_completed_tx(adap, &q->q, -1, true); credits = txq_avail(&q->q) - ndesc; - if (unlikely(credits < 0)) + if (unlikely(credits < 0)) { + __netif_tx_unlock(q->txq); return -ENOMEM; + } wr = (void *)&q->q.desc[q->q.pidx]; memset(wr, 0, sizeof(struct tx_desc)); @@ -2598,6 +2601,7 @@ int cxgb4_selftest_lb_pkt(struct net_device *netdev) init_completion(&lb->completion); txq_advance(&q->q, ndesc); cxgb4_ring_tx_db(adap, &q->q, ndesc); + __netif_tx_unlock(q->txq); /* wait for the pkt to return */ ret = wait_for_completion_timeout(&lb->completion, 10 * HZ); From 989e4da042ca4a56bbaca9223d1a93639ad11e17 Mon Sep 17 00:00:00 2001 From: Sumera Priyadarsini Date: Wed, 19 Aug 2020 00:22:41 +0530 Subject: [PATCH 10/46] net: gianfar: Add of_node_put() before goto statement Every iteration of for_each_available_child_of_node() decrements reference count of the previous node, however when control is transferred from the middle of the loop, as in the case of a return or break or goto, there is no decrement thus ultimately resulting in a memory leak. Fix a potential memory leak in gianfar.c by inserting of_node_put() before the goto statement. Issue found with Coccinelle. Signed-off-by: Sumera Priyadarsini Signed-off-by: David S. Miller --- drivers/net/ethernet/freescale/gianfar.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c index b513b8c5c3b5..41dd3d0f3452 100644 --- a/drivers/net/ethernet/freescale/gianfar.c +++ b/drivers/net/ethernet/freescale/gianfar.c @@ -750,8 +750,10 @@ static int gfar_of_init(struct platform_device *ofdev, struct net_device **pdev) continue; err = gfar_parse_group(child, priv, model); - if (err) + if (err) { + of_node_put(child); goto err_grp_init; + } } } else { /* SQ_SG_MODE */ err = gfar_parse_group(np, priv, model); From eabe861881a733fc84f286f4d5a1ffaddd4f526f Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Sat, 15 Aug 2020 04:46:41 -0400 Subject: [PATCH 11/46] net: handle the return value of pskb_carve_frag_list() correctly pskb_carve_frag_list() may return -ENOMEM in pskb_carve_inside_nonlinear(). we should handle this correctly or we would get wrong sk_buff. Fixes: 6fa01ccd8830 ("skbuff: Add pskb_extract() helper function") Signed-off-by: Miaohe Lin Signed-off-by: David S. Miller --- net/core/skbuff.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 5c3b906aeef3..e18184ffa9c3 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -5987,9 +5987,13 @@ static int pskb_carve_inside_nonlinear(struct sk_buff *skb, const u32 off, if (skb_has_frag_list(skb)) skb_clone_fraglist(skb); - if (k == 0) { - /* split line is in frag list */ - pskb_carve_frag_list(skb, shinfo, off - pos, gfp_mask); + /* split line is in frag list */ + if (k == 0 && pskb_carve_frag_list(skb, shinfo, off - pos, gfp_mask)) { + /* skb_frag_unref() is not needed here as shinfo->nr_frags = 0. */ + if (skb_has_frag_list(skb)) + kfree_skb_list(skb_shinfo(skb)->frag_list); + kfree(data); + return -ENOMEM; } skb_release_data(skb); From 0410d07190961ac526f05085765a8d04d926545b Mon Sep 17 00:00:00 2001 From: Jiri Wiesner Date: Sun, 16 Aug 2020 20:52:44 +0200 Subject: [PATCH 12/46] bonding: fix active-backup failover for current ARP slave When the ARP monitor is used for link detection, ARP replies are validated for all slaves (arp_validate=3) and fail_over_mac is set to active, two slaves of an active-backup bond may get stuck in a state where both of them are active and pass packets that they receive to the bond. This state makes IPv6 duplicate address detection fail. The state is reached thus: 1. The current active slave goes down because the ARP target is not reachable. 2. The current ARP slave is chosen and made active. 3. A new slave is enslaved. This new slave becomes the current active slave and can reach the ARP target. As a result, the current ARP slave stays active after the enslave action has finished and the log is littered with "PROBE BAD" messages: > bond0: PROBE: c_arp ens10 && cas ens11 BAD The workaround is to remove the slave with "going back" status from the bond and re-enslave it. This issue was encountered when DPDK PMD interfaces were being enslaved to an active-backup bond. I would be possible to fix the issue in bond_enslave() or bond_change_active_slave() but the ARP monitor was fixed instead to keep most of the actions changing the current ARP slave in the ARP monitor code. The current ARP slave is set as inactive and backup during the commit phase. A new state, BOND_LINK_FAIL, has been introduced for slaves in the context of the ARP monitor. This allows administrators to see how slaves are rotated for sending ARP requests and attempts are made to find a new active slave. Fixes: b2220cad583c9 ("bonding: refactor ARP active-backup monitor") Signed-off-by: Jiri Wiesner Signed-off-by: David S. Miller --- drivers/net/bonding/bond_main.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 415a37e44cae..c5d3032dd1a2 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2948,6 +2948,9 @@ static int bond_ab_arp_inspect(struct bonding *bond) if (bond_time_in_interval(bond, last_rx, 1)) { bond_propose_link_state(slave, BOND_LINK_UP); commit++; + } else if (slave->link == BOND_LINK_BACK) { + bond_propose_link_state(slave, BOND_LINK_FAIL); + commit++; } continue; } @@ -3056,6 +3059,19 @@ static void bond_ab_arp_commit(struct bonding *bond) continue; + case BOND_LINK_FAIL: + bond_set_slave_link_state(slave, BOND_LINK_FAIL, + BOND_SLAVE_NOTIFY_NOW); + bond_set_slave_inactive_flags(slave, + BOND_SLAVE_NOTIFY_NOW); + + /* A slave has just been enslaved and has become + * the current active slave. + */ + if (rtnl_dereference(bond->curr_active_slave)) + RCU_INIT_POINTER(bond->current_arp_slave, NULL); + continue; + default: slave_err(bond->dev, slave->dev, "impossible: link_new_state %d on slave\n", @@ -3106,8 +3122,6 @@ static bool bond_ab_arp_probe(struct bonding *bond) return should_notify_rtnl; } - bond_set_slave_inactive_flags(curr_arp_slave, BOND_SLAVE_NOTIFY_LATER); - bond_for_each_slave_rcu(bond, slave, iter) { if (!found && !before && bond_slave_is_up(slave)) before = slave; From 4ef1a7cb08e94da1f2f2a34ee6cefe7ae142dc98 Mon Sep 17 00:00:00 2001 From: Xin Long Date: Mon, 17 Aug 2020 14:30:49 +0800 Subject: [PATCH 13/46] ipv6: some fixes for ipv6_dev_find() This patch is to do 3 things for ipv6_dev_find(): As David A. noticed, - rt6_lookup() is not really needed. Different from __ip_dev_find(), ipv6_dev_find() doesn't have a compatibility problem, so remove it. As Hideaki suggested, - "valid" (non-tentative) check for the address is also needed. ipv6_chk_addr() calls ipv6_chk_addr_and_flags(), which will traverse the address hash list, but it's heavy to be called inside ipv6_dev_find(). This patch is to reuse the code of ipv6_chk_addr_and_flags() for ipv6_dev_find(). - dev parameter is passed into ipv6_dev_find(), as link-local addresses from user space has sin6_scope_id set and the dev lookup needs it. Fixes: 81f6cb31222d ("ipv6: add ipv6_dev_find()") Suggested-by: YOSHIFUJI Hideaki Reported-by: David Ahern Signed-off-by: Xin Long Signed-off-by: David S. Miller --- include/net/addrconf.h | 3 ++- net/ipv6/addrconf.c | 60 ++++++++++++++++-------------------------- net/tipc/udp_media.c | 8 +++--- 3 files changed, 28 insertions(+), 43 deletions(-) diff --git a/include/net/addrconf.h b/include/net/addrconf.h index ba3f6c15ad2b..18f783dcd55f 100644 --- a/include/net/addrconf.h +++ b/include/net/addrconf.h @@ -97,7 +97,8 @@ bool ipv6_chk_custom_prefix(const struct in6_addr *addr, int ipv6_chk_prefix(const struct in6_addr *addr, struct net_device *dev); -struct net_device *ipv6_dev_find(struct net *net, const struct in6_addr *addr); +struct net_device *ipv6_dev_find(struct net *net, const struct in6_addr *addr, + struct net_device *dev); struct inet6_ifaddr *ipv6_get_ifaddr(struct net *net, const struct in6_addr *addr, diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 8e761b8c47c6..01146b66d666 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -1893,12 +1893,13 @@ EXPORT_SYMBOL(ipv6_chk_addr); * 2. does the address exist on the specific device * (skip_dev_check = false) */ -int ipv6_chk_addr_and_flags(struct net *net, const struct in6_addr *addr, - const struct net_device *dev, bool skip_dev_check, - int strict, u32 banned_flags) +static struct net_device * +__ipv6_chk_addr_and_flags(struct net *net, const struct in6_addr *addr, + const struct net_device *dev, bool skip_dev_check, + int strict, u32 banned_flags) { unsigned int hash = inet6_addr_hash(net, addr); - const struct net_device *l3mdev; + struct net_device *l3mdev, *ndev; struct inet6_ifaddr *ifp; u32 ifp_flags; @@ -1909,10 +1910,11 @@ int ipv6_chk_addr_and_flags(struct net *net, const struct in6_addr *addr, dev = NULL; hlist_for_each_entry_rcu(ifp, &inet6_addr_lst[hash], addr_lst) { - if (!net_eq(dev_net(ifp->idev->dev), net)) + ndev = ifp->idev->dev; + if (!net_eq(dev_net(ndev), net)) continue; - if (l3mdev_master_dev_rcu(ifp->idev->dev) != l3mdev) + if (l3mdev_master_dev_rcu(ndev) != l3mdev) continue; /* Decouple optimistic from tentative for evaluation here. @@ -1923,15 +1925,23 @@ int ipv6_chk_addr_and_flags(struct net *net, const struct in6_addr *addr, : ifp->flags; if (ipv6_addr_equal(&ifp->addr, addr) && !(ifp_flags&banned_flags) && - (!dev || ifp->idev->dev == dev || + (!dev || ndev == dev || !(ifp->scope&(IFA_LINK|IFA_HOST) || strict))) { rcu_read_unlock(); - return 1; + return ndev; } } rcu_read_unlock(); - return 0; + return NULL; +} + +int ipv6_chk_addr_and_flags(struct net *net, const struct in6_addr *addr, + const struct net_device *dev, bool skip_dev_check, + int strict, u32 banned_flags) +{ + return __ipv6_chk_addr_and_flags(net, addr, dev, skip_dev_check, + strict, banned_flags) ? 1 : 0; } EXPORT_SYMBOL(ipv6_chk_addr_and_flags); @@ -1990,35 +2000,11 @@ EXPORT_SYMBOL(ipv6_chk_prefix); * * The caller should be protected by RCU, or RTNL. */ -struct net_device *ipv6_dev_find(struct net *net, const struct in6_addr *addr) +struct net_device *ipv6_dev_find(struct net *net, const struct in6_addr *addr, + struct net_device *dev) { - unsigned int hash = inet6_addr_hash(net, addr); - struct inet6_ifaddr *ifp, *result = NULL; - struct net_device *dev = NULL; - - rcu_read_lock(); - hlist_for_each_entry_rcu(ifp, &inet6_addr_lst[hash], addr_lst) { - if (net_eq(dev_net(ifp->idev->dev), net) && - ipv6_addr_equal(&ifp->addr, addr)) { - result = ifp; - break; - } - } - - if (!result) { - struct rt6_info *rt; - - rt = rt6_lookup(net, addr, NULL, 0, NULL, 0); - if (rt) { - dev = rt->dst.dev; - ip6_rt_put(rt); - } - } else { - dev = result->idev->dev; - } - rcu_read_unlock(); - - return dev; + return __ipv6_chk_addr_and_flags(net, addr, dev, !dev, 1, + IFA_F_TENTATIVE); } EXPORT_SYMBOL(ipv6_dev_find); diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c index 53f0de0676b7..911d13cd2e67 100644 --- a/net/tipc/udp_media.c +++ b/net/tipc/udp_media.c @@ -660,6 +660,7 @@ static int tipc_udp_enable(struct net *net, struct tipc_bearer *b, struct udp_tunnel_sock_cfg tuncfg = {NULL}; struct nlattr *opts[TIPC_NLA_UDP_MAX + 1]; u8 node_id[NODE_ID_LEN] = {0,}; + struct net_device *dev; int rmcast = 0; ub = kzalloc(sizeof(*ub), GFP_ATOMIC); @@ -714,8 +715,6 @@ static int tipc_udp_enable(struct net *net, struct tipc_bearer *b, rcu_assign_pointer(ub->bearer, b); tipc_udp_media_addr_set(&b->addr, &local); if (local.proto == htons(ETH_P_IP)) { - struct net_device *dev; - dev = __ip_dev_find(net, local.ipv4.s_addr, false); if (!dev) { err = -ENODEV; @@ -738,9 +737,8 @@ static int tipc_udp_enable(struct net *net, struct tipc_bearer *b, b->mtu = b->media->mtu; #if IS_ENABLED(CONFIG_IPV6) } else if (local.proto == htons(ETH_P_IPV6)) { - struct net_device *dev; - - dev = ipv6_dev_find(net, &local.ipv6); + dev = ub->ifindex ? __dev_get_by_index(net, ub->ifindex) : NULL; + dev = ipv6_dev_find(net, &local.ipv6, dev); if (!dev) { err = -ENODEV; goto err; From 840110a4eae190dcbb9907d68216d5d1d9f25839 Mon Sep 17 00:00:00 2001 From: Maxim Mikityanskiy Date: Mon, 17 Aug 2020 16:34:05 +0300 Subject: [PATCH 14/46] ethtool: Fix preserving of wanted feature bits in netlink interface Currently, ethtool-netlink calculates new wanted bits as: (req_wanted & req_mask) | (old_active & ~req_mask) It completely discards the old wanted bits, so they are forgotten with the next ethtool command. Sample steps to reproduce: 1. ethtool -k eth0 tx-tcp-segmentation: on # TSO is on from the beginning 2. ethtool -K eth0 tx off tx-tcp-segmentation: off [not requested] 3. ethtool -k eth0 tx-tcp-segmentation: off [requested on] 4. ethtool -K eth0 rx off # Some change unrelated to TSO 5. ethtool -k eth0 tx-tcp-segmentation: off # "Wanted on" is forgotten This commit fixes it by changing the formula to: (req_wanted & req_mask) | (old_wanted & ~req_mask), where old_active was replaced by old_wanted to account for the wanted bits. The shortcut condition for the case where nothing was changed now compares wanted bitmasks, instead of wanted to active. Fixes: 0980bfcd6954 ("ethtool: set netdev features with FEATURES_SET request") Signed-off-by: Maxim Mikityanskiy Reviewed-by: Michal Kubecek Signed-off-by: David S. Miller --- net/ethtool/features.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/net/ethtool/features.c b/net/ethtool/features.c index 4e632dc987d8..ec196f0fddc9 100644 --- a/net/ethtool/features.c +++ b/net/ethtool/features.c @@ -224,7 +224,9 @@ int ethnl_set_features(struct sk_buff *skb, struct genl_info *info) DECLARE_BITMAP(wanted_diff_mask, NETDEV_FEATURE_COUNT); DECLARE_BITMAP(active_diff_mask, NETDEV_FEATURE_COUNT); DECLARE_BITMAP(old_active, NETDEV_FEATURE_COUNT); + DECLARE_BITMAP(old_wanted, NETDEV_FEATURE_COUNT); DECLARE_BITMAP(new_active, NETDEV_FEATURE_COUNT); + DECLARE_BITMAP(new_wanted, NETDEV_FEATURE_COUNT); DECLARE_BITMAP(req_wanted, NETDEV_FEATURE_COUNT); DECLARE_BITMAP(req_mask, NETDEV_FEATURE_COUNT); struct nlattr *tb[ETHTOOL_A_FEATURES_MAX + 1]; @@ -250,6 +252,7 @@ int ethnl_set_features(struct sk_buff *skb, struct genl_info *info) rtnl_lock(); ethnl_features_to_bitmap(old_active, dev->features); + ethnl_features_to_bitmap(old_wanted, dev->wanted_features); ret = ethnl_parse_bitset(req_wanted, req_mask, NETDEV_FEATURE_COUNT, tb[ETHTOOL_A_FEATURES_WANTED], netdev_features_strings, info->extack); @@ -261,11 +264,11 @@ int ethnl_set_features(struct sk_buff *skb, struct genl_info *info) goto out_rtnl; } - /* set req_wanted bits not in req_mask from old_active */ + /* set req_wanted bits not in req_mask from old_wanted */ bitmap_and(req_wanted, req_wanted, req_mask, NETDEV_FEATURE_COUNT); - bitmap_andnot(new_active, old_active, req_mask, NETDEV_FEATURE_COUNT); - bitmap_or(req_wanted, new_active, req_wanted, NETDEV_FEATURE_COUNT); - if (bitmap_equal(req_wanted, old_active, NETDEV_FEATURE_COUNT)) { + bitmap_andnot(new_wanted, old_wanted, req_mask, NETDEV_FEATURE_COUNT); + bitmap_or(req_wanted, new_wanted, req_wanted, NETDEV_FEATURE_COUNT); + if (bitmap_equal(req_wanted, old_wanted, NETDEV_FEATURE_COUNT)) { ret = 0; goto out_rtnl; } From 2847bfed888fbb8bf4c8e8067fd6127538c2c700 Mon Sep 17 00:00:00 2001 From: Maxim Mikityanskiy Date: Mon, 17 Aug 2020 16:34:06 +0300 Subject: [PATCH 15/46] ethtool: Account for hw_features in netlink interface ethtool-netlink ignores dev->hw_features and may confuse the drivers by asking them to enable features not in the hw_features bitmask. For example: 1. ethtool -k eth0 tls-hw-tx-offload: off [fixed] 2. ethtool -K eth0 tls-hw-tx-offload on tls-hw-tx-offload: on 3. ethtool -k eth0 tls-hw-tx-offload: on [fixed] Fitler out dev->hw_features from req_wanted to fix it and to resemble the legacy ethtool behavior. Fixes: 0980bfcd6954 ("ethtool: set netdev features with FEATURES_SET request") Signed-off-by: Maxim Mikityanskiy Reviewed-by: Michal Kubecek Signed-off-by: David S. Miller --- net/ethtool/features.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/ethtool/features.c b/net/ethtool/features.c index ec196f0fddc9..6b288bfd7678 100644 --- a/net/ethtool/features.c +++ b/net/ethtool/features.c @@ -273,7 +273,8 @@ int ethnl_set_features(struct sk_buff *skb, struct genl_info *info) goto out_rtnl; } - dev->wanted_features = ethnl_bitmap_to_features(req_wanted); + dev->wanted_features &= ~dev->hw_features; + dev->wanted_features |= ethnl_bitmap_to_features(req_wanted) & dev->hw_features; __netdev_update_features(dev); ethnl_features_to_bitmap(new_active, dev->features); mod = !bitmap_equal(old_active, new_active, NETDEV_FEATURE_COUNT); From f01204ec8be7ea5e8f0230a7d4200e338d563bde Mon Sep 17 00:00:00 2001 From: Maxim Mikityanskiy Date: Mon, 17 Aug 2020 16:34:07 +0300 Subject: [PATCH 16/46] ethtool: Don't omit the netlink reply if no features were changed The legacy ethtool userspace tool shows an error when no features could be changed. It's useful to have a netlink reply to be able to show this error when __netdev_update_features wasn't called, for example: 1. ethtool -k eth0 large-receive-offload: off 2. ethtool -K eth0 rx-fcs on 3. ethtool -K eth0 lro on Could not change any device features rx-lro: off [requested on] 4. ethtool -K eth0 lro on # The output should be the same, but without this patch the kernel # doesn't send the reply, and ethtool is unable to detect the error. This commit makes ethtool-netlink always return a reply when requested, and it still avoids unnecessary calls to __netdev_update_features if the wanted features haven't changed. Fixes: 0980bfcd6954 ("ethtool: set netdev features with FEATURES_SET request") Signed-off-by: Maxim Mikityanskiy Reviewed-by: Michal Kubecek Signed-off-by: David S. Miller --- net/ethtool/features.c | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/net/ethtool/features.c b/net/ethtool/features.c index 6b288bfd7678..495635f152ba 100644 --- a/net/ethtool/features.c +++ b/net/ethtool/features.c @@ -268,14 +268,11 @@ int ethnl_set_features(struct sk_buff *skb, struct genl_info *info) bitmap_and(req_wanted, req_wanted, req_mask, NETDEV_FEATURE_COUNT); bitmap_andnot(new_wanted, old_wanted, req_mask, NETDEV_FEATURE_COUNT); bitmap_or(req_wanted, new_wanted, req_wanted, NETDEV_FEATURE_COUNT); - if (bitmap_equal(req_wanted, old_wanted, NETDEV_FEATURE_COUNT)) { - ret = 0; - goto out_rtnl; + if (!bitmap_equal(req_wanted, old_wanted, NETDEV_FEATURE_COUNT)) { + dev->wanted_features &= ~dev->hw_features; + dev->wanted_features |= ethnl_bitmap_to_features(req_wanted) & dev->hw_features; + __netdev_update_features(dev); } - - dev->wanted_features &= ~dev->hw_features; - dev->wanted_features |= ethnl_bitmap_to_features(req_wanted) & dev->hw_features; - __netdev_update_features(dev); ethnl_features_to_bitmap(new_active, dev->features); mod = !bitmap_equal(old_active, new_active, NETDEV_FEATURE_COUNT); From 17340552ce449ab55e3ccbfc2af1bcc600b4fbb5 Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Mon, 17 Aug 2020 23:40:42 +0100 Subject: [PATCH 17/46] net: mscc: ocelot: remove duplicate "the the" phrase in Kconfig text The Kconfig help text contains the phrase "the the" in the help text. Fix this. Signed-off-by: Colin Ian King Signed-off-by: David S. Miller --- drivers/net/dsa/ocelot/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/dsa/ocelot/Kconfig b/drivers/net/dsa/ocelot/Kconfig index f121619d81fe..2d23ccef7d0e 100644 --- a/drivers/net/dsa/ocelot/Kconfig +++ b/drivers/net/dsa/ocelot/Kconfig @@ -9,7 +9,7 @@ config NET_DSA_MSCC_FELIX select NET_DSA_TAG_OCELOT select FSL_ENETC_MDIO help - This driver supports network switches from the the Vitesse / + This driver supports network switches from the Vitesse / Microsemi / Microchip Ocelot family of switching cores that are connected to their host CPU via Ethernet. The following switches are supported: From ad6641189c5935192a15eeb4b369dd04ebedfabb Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Mon, 17 Aug 2020 23:44:25 +0100 Subject: [PATCH 18/46] net: ipv4: remove duplicate "the the" phrase in Kconfig text The Kconfig help text contains the phrase "the the" in the help text. Fix this and reformat the block of help text. Signed-off-by: Colin Ian King Signed-off-by: David S. Miller --- net/ipv4/Kconfig | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig index 60db5a6487cc..87983e70f03f 100644 --- a/net/ipv4/Kconfig +++ b/net/ipv4/Kconfig @@ -661,13 +661,13 @@ config TCP_CONG_BBR BBR (Bottleneck Bandwidth and RTT) TCP congestion control aims to maximize network utilization and minimize queues. It builds an explicit - model of the the bottleneck delivery rate and path round-trip - propagation delay. It tolerates packet loss and delay unrelated to - congestion. It can operate over LAN, WAN, cellular, wifi, or cable - modem links. It can coexist with flows that use loss-based congestion - control, and can operate with shallow buffers, deep buffers, - bufferbloat, policers, or AQM schemes that do not provide a delay - signal. It requires the fq ("Fair Queue") pacing packet scheduler. + model of the bottleneck delivery rate and path round-trip propagation + delay. It tolerates packet loss and delay unrelated to congestion. It + can operate over LAN, WAN, cellular, wifi, or cable modem links. It can + coexist with flows that use loss-based congestion control, and can + operate with shallow buffers, deep buffers, bufferbloat, policers, or + AQM schemes that do not provide a delay signal. It requires the fq + ("Fair Queue") pacing packet scheduler. choice prompt "Default TCP congestion control" From e679654a704e5bd676ea6446fa7b764cbabf168a Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Tue, 18 Aug 2020 15:23:09 -0700 Subject: [PATCH 19/46] bpf: Fix a rcu_sched stall issue with bpf task/task_file iterator In our production system, we observed rcu stalls when 'bpftool prog` is running. rcu: INFO: rcu_sched self-detected stall on CPU rcu: \x097-....: (20999 ticks this GP) idle=302/1/0x4000000000000000 softirq=1508852/1508852 fqs=4913 \x09(t=21031 jiffies g=2534773 q=179750) NMI backtrace for cpu 7 CPU: 7 PID: 184195 Comm: bpftool Kdump: loaded Tainted: G W 5.8.0-00004-g68bfc7f8c1b4 #6 Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019 Call Trace: dump_stack+0x57/0x70 nmi_cpu_backtrace.cold+0x14/0x53 ? lapic_can_unplug_cpu.cold+0x39/0x39 nmi_trigger_cpumask_backtrace+0xb7/0xc7 rcu_dump_cpu_stacks+0xa2/0xd0 rcu_sched_clock_irq.cold+0x1ff/0x3d9 ? tick_nohz_handler+0x100/0x100 update_process_times+0x5b/0x90 tick_sched_timer+0x5e/0xf0 __hrtimer_run_queues+0x12a/0x2a0 hrtimer_interrupt+0x10e/0x280 __sysvec_apic_timer_interrupt+0x51/0xe0 asm_call_on_stack+0xf/0x20 sysvec_apic_timer_interrupt+0x6f/0x80 asm_sysvec_apic_timer_interrupt+0x12/0x20 RIP: 0010:task_file_seq_get_next+0x71/0x220 Code: 00 00 8b 53 1c 49 8b 7d 00 89 d6 48 8b 47 20 44 8b 18 41 39 d3 76 75 48 8b 4f 20 8b 01 39 d0 76 61 41 89 d1 49 39 c1 48 19 c0 <48> 8b 49 08 21 d0 48 8d 04 c1 4c 8b 08 4d 85 c9 74 46 49 8b 41 38 RSP: 0018:ffffc90006223e10 EFLAGS: 00000297 RAX: ffffffffffffffff RBX: ffff888f0d172388 RCX: ffff888c8c07c1c0 RDX: 00000000000f017b RSI: 00000000000f017b RDI: ffff888c254702c0 RBP: ffffc90006223e68 R08: ffff888be2a1c140 R09: 00000000000f017b R10: 0000000000000002 R11: 0000000000100000 R12: ffff888f23c24118 R13: ffffc90006223e60 R14: ffffffff828509a0 R15: 00000000ffffffff task_file_seq_next+0x52/0xa0 bpf_seq_read+0xb9/0x320 vfs_read+0x9d/0x180 ksys_read+0x5f/0xe0 do_syscall_64+0x38/0x60 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f8815f4f76e Code: c0 e9 f6 fe ff ff 55 48 8d 3d 76 70 0a 00 48 89 e5 e8 36 06 02 00 66 0f 1f 44 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 52 c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 RSP: 002b:00007fff8f9df578 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 000000000170b9c0 RCX: 00007f8815f4f76e RDX: 0000000000001000 RSI: 00007fff8f9df5b0 RDI: 0000000000000007 RBP: 00007fff8f9e05f0 R08: 0000000000000049 R09: 0000000000000010 R10: 00007f881601fa40 R11: 0000000000000246 R12: 00007fff8f9e05a8 R13: 00007fff8f9e05a8 R14: 0000000001917f90 R15: 000000000000e22e Note that `bpftool prog` actually calls a task_file bpf iterator program to establish an association between prog/map/link/btf anon files and processes. In the case where the above rcu stall occured, we had a process having 1587 tasks and each task having roughly 81305 files. This implied 129 million bpf prog invocations. Unfortunwtely none of these files are prog/map/link/btf files so bpf iterator/prog needs to traverse all these files and not able to return to user space since there are no seq_file buffer overflow. This patch fixed the issue in bpf_seq_read() to limit the number of visited objects. If the maximum number of visited objects is reached, no more objects will be visited in the current syscall. If there is nothing written in the seq_file buffer, -EAGAIN will return to the user so user can try again. The maximum number of visited objects is set at 1 million. In our Intel Xeon D-2191 2.3GHZ 18-core server, bpf_seq_read() visiting 1 million files takes around 0.18 seconds. We did not use cond_resched() since for some iterators, e.g., netlink iterator, where rcu read_lock critical section spans between consecutive seq_ops->next(), which makes impossible to do cond_resched() in the key while loop of function bpf_seq_read(). Signed-off-by: Yonghong Song Signed-off-by: Alexei Starovoitov Cc: Paul E. McKenney Link: https://lore.kernel.org/bpf/20200818222309.2181348-1-yhs@fb.com --- kernel/bpf/bpf_iter.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c index b6715964b685..8faa2ce89396 100644 --- a/kernel/bpf/bpf_iter.c +++ b/kernel/bpf/bpf_iter.c @@ -67,6 +67,9 @@ static void bpf_iter_done_stop(struct seq_file *seq) iter_priv->done_stop = true; } +/* maximum visited objects before bailing out */ +#define MAX_ITER_OBJECTS 1000000 + /* bpf_seq_read, a customized and simpler version for bpf iterator. * no_llseek is assumed for this file. * The following are differences from seq_read(): @@ -79,7 +82,7 @@ static ssize_t bpf_seq_read(struct file *file, char __user *buf, size_t size, { struct seq_file *seq = file->private_data; size_t n, offs, copied = 0; - int err = 0; + int err = 0, num_objs = 0; void *p; mutex_lock(&seq->lock); @@ -135,6 +138,7 @@ static ssize_t bpf_seq_read(struct file *file, char __user *buf, size_t size, while (1) { loff_t pos = seq->index; + num_objs++; offs = seq->count; p = seq->op->next(seq, p, &seq->index); if (pos == seq->index) { @@ -153,6 +157,15 @@ static ssize_t bpf_seq_read(struct file *file, char __user *buf, size_t size, if (seq->count >= size) break; + if (num_objs >= MAX_ITER_OBJECTS) { + if (offs == 0) { + err = -EAGAIN; + seq->op->stop(seq, p); + goto done; + } + break; + } + err = seq->op->show(seq, p); if (err > 0) { bpf_iter_dec_seq_num(seq); From e60572b8d4c39572be6857d1ec91fdf979f8775f Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Tue, 18 Aug 2020 15:23:10 -0700 Subject: [PATCH 20/46] bpf: Avoid visit same object multiple times Currently when traversing all tasks, the next tid is always increased by one. This may result in visiting the same task multiple times in a pid namespace. This patch fixed the issue by seting the next tid as pid_nr_ns(pid, ns) + 1, similar to funciton next_tgid(). Signed-off-by: Yonghong Song Signed-off-by: Alexei Starovoitov Cc: Rik van Riel Link: https://lore.kernel.org/bpf/20200818222310.2181500-1-yhs@fb.com --- kernel/bpf/task_iter.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index f21b5e1e4540..99af4cea1102 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -29,8 +29,9 @@ static struct task_struct *task_seq_get_next(struct pid_namespace *ns, rcu_read_lock(); retry: - pid = idr_get_next(&ns->idr, tid); + pid = find_ge_pid(*tid, ns); if (pid) { + *tid = pid_nr_ns(pid, ns); task = get_pid_task(pid, PIDTYPE_PID); if (!task) { ++*tid; From 00fa1d83a8b50351c830521d00135e823c46e7d0 Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Tue, 18 Aug 2020 15:23:12 -0700 Subject: [PATCH 21/46] bpftool: Handle EAGAIN error code properly in pids collection When the error code is EAGAIN, the kernel signals the user space should retry the read() operation for bpf iterators. Let us do it. Signed-off-by: Yonghong Song Signed-off-by: Alexei Starovoitov Acked-by: Andrii Nakryiko Link: https://lore.kernel.org/bpf/20200818222312.2181675-1-yhs@fb.com --- tools/bpf/bpftool/pids.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tools/bpf/bpftool/pids.c b/tools/bpf/bpftool/pids.c index e3b116325403..df7d8ec76036 100644 --- a/tools/bpf/bpftool/pids.c +++ b/tools/bpf/bpftool/pids.c @@ -134,6 +134,8 @@ int build_obj_refs_table(struct obj_refs_table *table, enum bpf_obj_type type) while (true) { ret = read(fd, buf, sizeof(buf)); if (ret < 0) { + if (errno == EAGAIN) + continue; err = -errno; p_err("failed to read PID iterator output: %d", err); goto out; From 63d4a4c145cca2e84dc6e62d2ef5cb990c9723c2 Mon Sep 17 00:00:00 2001 From: Shay Agroskin Date: Wed, 19 Aug 2020 20:28:36 +0300 Subject: [PATCH 22/46] net: ena: Prevent reset after device destruction The reset work is scheduled by the timer routine whenever it detects that a device reset is required (e.g. when a keep_alive signal is missing). When releasing device resources in ena_destroy_device() the driver cancels the scheduling of the timer routine without destroying the reset work explicitly. This creates the following bug: The driver is suspended and the ena_suspend() function is called -> This function calls ena_destroy_device() to free the net device resources -> The driver waits for the timer routine to finish its execution and then cancels it, thus preventing from it to be called again. If, in its final execution, the timer routine schedules a reset, the reset routine might be called afterwards,and a redundant call to ena_restore_device() would be made. By changing the reset routine we allow it to read the device's state accurately. This is achieved by checking whether ENA_FLAG_TRIGGER_RESET flag is set before resetting the device and making both the destruction function and the flag check are under rtnl lock. The ENA_FLAG_TRIGGER_RESET is cleared at the end of the destruction routine. Also surround the flag check with 'likely' because we expect that the reset routine would be called only when ENA_FLAG_TRIGGER_RESET flag is set. The destruction of the timer and reset services in __ena_shutoff() have to stay, even though the timer routine is destroyed in ena_destroy_device(). This is to avoid a case in which the reset routine is scheduled after free_netdev() in __ena_shutoff(), which would create an access to freed memory in adapter->flags. Fixes: 8c5c7abdeb2d ("net: ena: add power management ops to the ENA driver") Signed-off-by: Shay Agroskin Signed-off-by: David S. Miller --- drivers/net/ethernet/amazon/ena/ena_netdev.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index 2a6c9725e092..44aeace196f0 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -3601,16 +3601,14 @@ static void ena_fw_reset_device(struct work_struct *work) { struct ena_adapter *adapter = container_of(work, struct ena_adapter, reset_task); - struct pci_dev *pdev = adapter->pdev; - if (unlikely(!test_bit(ENA_FLAG_TRIGGER_RESET, &adapter->flags))) { - dev_err(&pdev->dev, - "device reset schedule while reset bit is off\n"); - return; - } rtnl_lock(); - ena_destroy_device(adapter, false); - ena_restore_device(adapter); + + if (likely(test_bit(ENA_FLAG_TRIGGER_RESET, &adapter->flags))) { + ena_destroy_device(adapter, false); + ena_restore_device(adapter); + } + rtnl_unlock(); } @@ -4389,8 +4387,11 @@ static void __ena_shutoff(struct pci_dev *pdev, bool shutdown) netdev->rx_cpu_rmap = NULL; } #endif /* CONFIG_RFS_ACCEL */ - del_timer_sync(&adapter->timer_service); + /* Make sure timer and reset routine won't be called after + * freeing device resources. + */ + del_timer_sync(&adapter->timer_service); cancel_work_sync(&adapter->reset_task); rtnl_lock(); /* lock released inside the below if-else block */ From 8b147f6f3e7de4e51113e3e9ec44aa2debc02c58 Mon Sep 17 00:00:00 2001 From: Shay Agroskin Date: Wed, 19 Aug 2020 20:28:37 +0300 Subject: [PATCH 23/46] net: ena: Change WARN_ON expression in ena_del_napi_in_range() The ena_del_napi_in_range() function unregisters the napi handler for rings in a given range. This function had the following WARN_ON macro: WARN_ON(ENA_IS_XDP_INDEX(adapter, i) && adapter->ena_napi[i].xdp_ring); This macro prints the call stack if the expression inside of it is true [1], but the expression inside of it is the wanted situation. The expression checks whether the ring has an XDP queue and its index corresponds to a XDP one. This patch changes the expression to !ENA_IS_XDP_INDEX(adapter, i) && adapter->ena_napi[i].xdp_ring which indicates an unwanted situation. Also, change the structure of the function. The napi handler is unregistered for all rings, and so there's no need to check whether the index is an XDP index or not. By removing this check the code becomes much more readable. Fixes: 548c4940b9f1 ("net: ena: Implement XDP_TX action") Signed-off-by: Shay Agroskin Signed-off-by: David S. Miller --- drivers/net/ethernet/amazon/ena/ena_netdev.c | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index 44aeace196f0..233db15c970d 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -2180,13 +2180,10 @@ static void ena_del_napi_in_range(struct ena_adapter *adapter, int i; for (i = first_index; i < first_index + count; i++) { - /* Check if napi was initialized before */ - if (!ENA_IS_XDP_INDEX(adapter, i) || - adapter->ena_napi[i].xdp_ring) - netif_napi_del(&adapter->ena_napi[i].napi); - else - WARN_ON(ENA_IS_XDP_INDEX(adapter, i) && - adapter->ena_napi[i].xdp_ring); + netif_napi_del(&adapter->ena_napi[i].napi); + + WARN_ON(!ENA_IS_XDP_INDEX(adapter, i) && + adapter->ena_napi[i].xdp_ring); } } From ccd143e5150f24b9ba15145c7221b61dd9e41021 Mon Sep 17 00:00:00 2001 From: Shay Agroskin Date: Wed, 19 Aug 2020 20:28:38 +0300 Subject: [PATCH 24/46] net: ena: Make missed_tx stat incremental Most statistics in ena driver are incremented, meaning that a stat's value is a sum of all increases done to it since driver/queue initialization. This patch makes all statistics this way, effectively making missed_tx statistic incremental. Also added a comment regarding rx_drops and tx_drops to make it clearer how these counters are calculated. Fixes: 11095fdb712b ("net: ena: add statistics for missed tx packets") Signed-off-by: Shay Agroskin Signed-off-by: David S. Miller --- drivers/net/ethernet/amazon/ena/ena_netdev.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index 233db15c970d..a3a8edf9a734 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -3687,7 +3687,7 @@ static int check_missing_comp_in_tx_queue(struct ena_adapter *adapter, } u64_stats_update_begin(&tx_ring->syncp); - tx_ring->tx_stats.missed_tx = missed_tx; + tx_ring->tx_stats.missed_tx += missed_tx; u64_stats_update_end(&tx_ring->syncp); return rc; @@ -4556,6 +4556,9 @@ static void ena_keep_alive_wd(void *adapter_data, tx_drops = ((u64)desc->tx_drops_high << 32) | desc->tx_drops_low; u64_stats_update_begin(&adapter->syncp); + /* These stats are accumulated by the device, so the counters indicate + * all drops since last reset. + */ adapter->dev_stats.rx_drops = rx_drops; adapter->dev_stats.tx_drops = tx_drops; u64_stats_update_end(&adapter->syncp); From d1fb55592909ea249af70170c7a52e637009564d Mon Sep 17 00:00:00 2001 From: Johannes Berg Date: Wed, 19 Aug 2020 21:52:38 +0200 Subject: [PATCH 25/46] netlink: fix state reallocation in policy export Evidently, when I did this previously, we didn't have more than 10 policies and didn't run into the reallocation path, because it's missing a memset() for the unused policies. Fix that. Fixes: d07dcf9aadd6 ("netlink: add infrastructure to expose policies to userspace") Signed-off-by: Johannes Berg Signed-off-by: David S. Miller --- net/netlink/policy.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/netlink/policy.c b/net/netlink/policy.c index f6491853c797..2b3e26f7496f 100644 --- a/net/netlink/policy.c +++ b/net/netlink/policy.c @@ -51,6 +51,9 @@ static int add_policy(struct nl_policy_dump **statep, if (!state) return -ENOMEM; + memset(&state->policies[state->n_alloc], 0, + flex_array_size(state, policies, n_alloc - state->n_alloc)); + state->policies[state->n_alloc].policy = policy; state->policies[state->n_alloc].maxtype = maxtype; state->n_alloc = n_alloc; From 957ff4278e0db34f56c2bc121fdd6393e4523ef2 Mon Sep 17 00:00:00 2001 From: Min Li Date: Tue, 18 Aug 2020 10:41:22 -0400 Subject: [PATCH 26/46] ptp: ptp_clockmatrix: use i2c_master_send for i2c write The old code for i2c write would break on some controllers, which fails at handling Repeated Start Condition. So we will just use i2c_master_send to handle write in one transanction. Changes since v1: - Remove indentation change Signed-off-by: Min Li Signed-off-by: David S. Miller --- drivers/ptp/ptp_clockmatrix.c | 56 +++++++++++++++++++++++++++-------- drivers/ptp/ptp_clockmatrix.h | 2 ++ 2 files changed, 45 insertions(+), 13 deletions(-) diff --git a/drivers/ptp/ptp_clockmatrix.c b/drivers/ptp/ptp_clockmatrix.c index 73aaae5574ed..e020faff7da5 100644 --- a/drivers/ptp/ptp_clockmatrix.c +++ b/drivers/ptp/ptp_clockmatrix.c @@ -142,16 +142,15 @@ static int idtcm_strverscmp(const char *ver1, const char *ver2) return result; } -static int idtcm_xfer(struct idtcm *idtcm, - u8 regaddr, - u8 *buf, - u16 count, - bool write) +static int idtcm_xfer_read(struct idtcm *idtcm, + u8 regaddr, + u8 *buf, + u16 count) { struct i2c_client *client = idtcm->client; struct i2c_msg msg[2]; int cnt; - char *fmt = "i2c_transfer failed at %d in %s for %s, at addr: %04X!\n"; + char *fmt = "i2c_transfer failed at %d in %s, at addr: %04X!\n"; msg[0].addr = client->addr; msg[0].flags = 0; @@ -159,7 +158,7 @@ static int idtcm_xfer(struct idtcm *idtcm, msg[0].buf = ®addr; msg[1].addr = client->addr; - msg[1].flags = write ? 0 : I2C_M_RD; + msg[1].flags = I2C_M_RD; msg[1].len = count; msg[1].buf = buf; @@ -170,7 +169,6 @@ static int idtcm_xfer(struct idtcm *idtcm, fmt, __LINE__, __func__, - write ? "write" : "read", regaddr); return cnt; } else if (cnt != 2) { @@ -182,6 +180,37 @@ static int idtcm_xfer(struct idtcm *idtcm, return 0; } +static int idtcm_xfer_write(struct idtcm *idtcm, + u8 regaddr, + u8 *buf, + u16 count) +{ + struct i2c_client *client = idtcm->client; + /* we add 1 byte for device register */ + u8 msg[IDTCM_MAX_WRITE_COUNT + 1]; + int cnt; + char *fmt = "i2c_master_send failed at %d in %s, at addr: %04X!\n"; + + if (count > IDTCM_MAX_WRITE_COUNT) + return -EINVAL; + + msg[0] = regaddr; + memcpy(&msg[1], buf, count); + + cnt = i2c_master_send(client, msg, count + 1); + + if (cnt < 0) { + dev_err(&client->dev, + fmt, + __LINE__, + __func__, + regaddr); + return cnt; + } + + return 0; +} + static int idtcm_page_offset(struct idtcm *idtcm, u8 val) { u8 buf[4]; @@ -195,7 +224,7 @@ static int idtcm_page_offset(struct idtcm *idtcm, u8 val) buf[2] = 0x10; buf[3] = 0x20; - err = idtcm_xfer(idtcm, PAGE_ADDR, buf, sizeof(buf), 1); + err = idtcm_xfer_write(idtcm, PAGE_ADDR, buf, sizeof(buf)); if (err) { idtcm->page_offset = 0xff; @@ -223,11 +252,12 @@ static int _idtcm_rdwr(struct idtcm *idtcm, err = idtcm_page_offset(idtcm, hi); if (err) - goto out; + return err; - err = idtcm_xfer(idtcm, lo, buf, count, write); -out: - return err; + if (write) + return idtcm_xfer_write(idtcm, lo, buf, count); + + return idtcm_xfer_read(idtcm, lo, buf, count); } static int idtcm_read(struct idtcm *idtcm, diff --git a/drivers/ptp/ptp_clockmatrix.h b/drivers/ptp/ptp_clockmatrix.h index ffae56c5d97f..82840d72364a 100644 --- a/drivers/ptp/ptp_clockmatrix.h +++ b/drivers/ptp/ptp_clockmatrix.h @@ -55,6 +55,8 @@ #define PEROUT_ENABLE_OUTPUT_MASK (0xdeadbeef) +#define IDTCM_MAX_WRITE_COUNT (512) + /* Values of DPLL_N.DPLL_MODE.PLL_MODE */ enum pll_mode { PLL_MODE_MIN = 0, From 9553b62c1dd27df67ab2f52ec8a3bc3501887619 Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Tue, 18 Aug 2020 18:14:39 +0200 Subject: [PATCH 27/46] net: atlantic: Use readx_poll_timeout() for large timeout Commit 8dcf2ad39fdb2 ("net: atlantic: add hwmon getter for MAC temperature") implemented a read callback with an udelay(10000U). This fails to compile on ARM because the delay is >1ms. I doubt that it is needed to spin for 10ms even if possible on x86. >From looking at the code, the context appears to be preemptible so using usleep() should work and avoid busy spinning. Use readx_poll_timeout() in the poll loop. Fixes: 8dcf2ad39fdb2 ("net: atlantic: add hwmon getter for MAC temperature") Cc: Mark Starovoytov Cc: Igor Russkikh Signed-off-by: Sebastian Andrzej Siewior Acked-by: Guenter Roeck Signed-off-by: David S. Miller --- drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c index 16a944707ba9..8941ac4df9e3 100644 --- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c +++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c @@ -1631,8 +1631,8 @@ static int hw_atl_b0_get_mac_temp(struct aq_hw_s *self, u32 *temp) hw_atl_ts_reset_set(self, 0); } - err = readx_poll_timeout_atomic(hw_atl_b0_ts_ready_and_latch_high_get, - self, val, val == 1, 10000U, 500000U); + err = readx_poll_timeout(hw_atl_b0_ts_ready_and_latch_high_get, self, + val, val == 1, 10000U, 500000U); if (err) return err; From cf96d977381d4a23957bade2ddf1c420b74a26b6 Mon Sep 17 00:00:00 2001 From: Wang Hai Date: Wed, 19 Aug 2020 10:33:09 +0800 Subject: [PATCH 28/46] net: gemini: Fix missing free_netdev() in error path of gemini_ethernet_port_probe() Replace alloc_etherdev_mq with devm_alloc_etherdev_mqs. In this way, when probe fails, netdev can be freed automatically. Fixes: 4d5ae32f5e1e ("net: ethernet: Add a driver for Gemini gigabit ethernet") Reported-by: Hulk Robot Signed-off-by: Wang Hai Signed-off-by: David S. Miller --- drivers/net/ethernet/cortina/gemini.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/net/ethernet/cortina/gemini.c b/drivers/net/ethernet/cortina/gemini.c index 66e67b24a887..62e271aea4a5 100644 --- a/drivers/net/ethernet/cortina/gemini.c +++ b/drivers/net/ethernet/cortina/gemini.c @@ -2389,7 +2389,7 @@ static int gemini_ethernet_port_probe(struct platform_device *pdev) dev_info(dev, "probe %s ID %d\n", dev_name(dev), id); - netdev = alloc_etherdev_mq(sizeof(*port), TX_QUEUE_NUM); + netdev = devm_alloc_etherdev_mqs(dev, sizeof(*port), TX_QUEUE_NUM, TX_QUEUE_NUM); if (!netdev) { dev_err(dev, "Can't allocate ethernet device #%d\n", id); return -ENOMEM; @@ -2521,7 +2521,6 @@ static int gemini_ethernet_port_probe(struct platform_device *pdev) } port->netdev = NULL; - free_netdev(netdev); return ret; } @@ -2530,7 +2529,6 @@ static int gemini_ethernet_port_remove(struct platform_device *pdev) struct gemini_ethernet_port *port = platform_get_drvdata(pdev); gemini_port_remove(port); - free_netdev(port->netdev); return 0; } From 1e891e513e16c145cc9b45b1fdb8bf4a4f2f9557 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Toke=20H=C3=B8iland-J=C3=B8rgensen?= Date: Wed, 19 Aug 2020 13:05:34 +0200 Subject: [PATCH 29/46] libbpf: Fix map index used in error message MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The error message emitted by bpf_object__init_user_btf_maps() was using the wrong section ID. Signed-off-by: Toke Høiland-Jørgensen Signed-off-by: Daniel Borkmann Acked-by: Yonghong Song Link: https://lore.kernel.org/bpf/20200819110534.9058-1-toke@redhat.com --- tools/lib/bpf/libbpf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 5d20b2da4427..0ad0b0491e1f 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -2264,7 +2264,7 @@ static int bpf_object__init_user_btf_maps(struct bpf_object *obj, bool strict, data = elf_getdata(scn, NULL); if (!scn || !data) { pr_warn("failed to get Elf_Data from map section %d (%s)\n", - obj->efile.maps_shndx, MAPS_ELF_SEC); + obj->efile.btf_maps_shndx, MAPS_ELF_SEC); return -EINVAL; } From fb73ed5ef7e1f3b561dae517b9efdf58c153636b Mon Sep 17 00:00:00 2001 From: Kaige Li Date: Thu, 20 Aug 2020 14:47:55 +0800 Subject: [PATCH 30/46] net: phy: mscc: Fix a couple of spelling mistakes "spcified" -> "specified" There are a couple of spelling mistakes in comment text. Fix these. Signed-off-by: Kaige Li Signed-off-by: David S. Miller --- drivers/net/phy/mscc/mscc_main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/phy/mscc/mscc_main.c b/drivers/net/phy/mscc/mscc_main.c index a4fbf3a4fa97..6bc7406a1ce7 100644 --- a/drivers/net/phy/mscc/mscc_main.c +++ b/drivers/net/phy/mscc/mscc_main.c @@ -1738,13 +1738,13 @@ static int __phy_write_mcb_s6g(struct phy_device *phydev, u32 reg, u8 mcb, return 0; } -/* Trigger a read to the spcified MCB */ +/* Trigger a read to the specified MCB */ static int phy_update_mcb_s6g(struct phy_device *phydev, u32 reg, u8 mcb) { return __phy_write_mcb_s6g(phydev, reg, mcb, PHY_MCB_S6G_READ); } -/* Trigger a write to the spcified MCB */ +/* Trigger a write to the specified MCB */ static int phy_commit_mcb_s6g(struct phy_device *phydev, u32 reg, u8 mcb) { return __phy_write_mcb_s6g(phydev, reg, mcb, PHY_MCB_S6G_WRITE); From 3e659a82c45076e354d20db4b0776e716c6e7fe3 Mon Sep 17 00:00:00 2001 From: Edward Cree Date: Thu, 20 Aug 2020 11:47:19 +0100 Subject: [PATCH 31/46] sfc: fix build warnings on 32-bit Truncation of DMA_BIT_MASK to 32-bit dma_addr_t is semantically safe, but the compiler was warning because it was happening implicitly. Insert explicit casts to suppress the warnings. Reported-by: Randy Dunlap Signed-off-by: Edward Cree Acked-by: Randy Dunlap # build-tested Signed-off-by: David S. Miller --- drivers/net/ethernet/sfc/ef100.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/sfc/ef100.c b/drivers/net/ethernet/sfc/ef100.c index 9729983f4840..c54b7f8243f3 100644 --- a/drivers/net/ethernet/sfc/ef100.c +++ b/drivers/net/ethernet/sfc/ef100.c @@ -142,7 +142,7 @@ static int ef100_pci_parse_continue_entry(struct efx_nic *efx, int entry_locatio /* Temporarily map new BAR. */ rc = efx_init_io(efx, bar, - DMA_BIT_MASK(ESF_GZ_TX_SEND_ADDR_WIDTH), + (dma_addr_t)DMA_BIT_MASK(ESF_GZ_TX_SEND_ADDR_WIDTH), pci_resource_len(efx->pci_dev, bar)); if (rc) { netif_err(efx, probe, efx->net_dev, @@ -160,7 +160,7 @@ static int ef100_pci_parse_continue_entry(struct efx_nic *efx, int entry_locatio /* Put old BAR back. */ rc = efx_init_io(efx, previous_bar, - DMA_BIT_MASK(ESF_GZ_TX_SEND_ADDR_WIDTH), + (dma_addr_t)DMA_BIT_MASK(ESF_GZ_TX_SEND_ADDR_WIDTH), pci_resource_len(efx->pci_dev, previous_bar)); if (rc) { netif_err(efx, probe, efx->net_dev, @@ -334,7 +334,7 @@ static int ef100_pci_parse_xilinx_cap(struct efx_nic *efx, int vndr_cap, /* Temporarily map BAR. */ rc = efx_init_io(efx, bar, - DMA_BIT_MASK(ESF_GZ_TX_SEND_ADDR_WIDTH), + (dma_addr_t)DMA_BIT_MASK(ESF_GZ_TX_SEND_ADDR_WIDTH), pci_resource_len(efx->pci_dev, bar)); if (rc) { netif_err(efx, probe, efx->net_dev, @@ -495,7 +495,7 @@ static int ef100_pci_probe(struct pci_dev *pci_dev, /* Set up basic I/O (BAR mappings etc) */ rc = efx_init_io(efx, fcw.bar, - DMA_BIT_MASK(ESF_GZ_TX_SEND_ADDR_WIDTH), + (dma_addr_t)DMA_BIT_MASK(ESF_GZ_TX_SEND_ADDR_WIDTH), pci_resource_len(efx->pci_dev, fcw.bar)); if (rc) goto fail; From ce51f63e63c52a4e1eee4dd040fb0ba0af3b43ab Mon Sep 17 00:00:00 2001 From: Peilin Ye Date: Thu, 20 Aug 2020 16:30:52 +0200 Subject: [PATCH 32/46] net/smc: Prevent kernel-infoleak in __smc_diag_dump() __smc_diag_dump() is potentially copying uninitialized kernel stack memory into socket buffers, since the compiler may leave a 4-byte hole near the beginning of `struct smcd_diag_dmbinfo`. Fix it by initializing `dinfo` with memset(). Fixes: 4b1b7d3b30a6 ("net/smc: add SMC-D diag support") Suggested-by: Dan Carpenter Signed-off-by: Peilin Ye Signed-off-by: Ursula Braun Signed-off-by: David S. Miller --- net/smc/smc_diag.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/net/smc/smc_diag.c b/net/smc/smc_diag.c index e1f64f4ba236..da9ba6d1679b 100644 --- a/net/smc/smc_diag.c +++ b/net/smc/smc_diag.c @@ -170,13 +170,15 @@ static int __smc_diag_dump(struct sock *sk, struct sk_buff *skb, (req->diag_ext & (1 << (SMC_DIAG_DMBINFO - 1))) && !list_empty(&smc->conn.lgr->list)) { struct smc_connection *conn = &smc->conn; - struct smcd_diag_dmbinfo dinfo = { - .linkid = *((u32 *)conn->lgr->id), - .peer_gid = conn->lgr->peer_gid, - .my_gid = conn->lgr->smcd->local_gid, - .token = conn->rmb_desc->token, - .peer_token = conn->peer_token - }; + struct smcd_diag_dmbinfo dinfo; + + memset(&dinfo, 0, sizeof(dinfo)); + + dinfo.linkid = *((u32 *)conn->lgr->id); + dinfo.peer_gid = conn->lgr->peer_gid; + dinfo.my_gid = conn->lgr->smcd->local_gid; + dinfo.token = conn->rmb_desc->token; + dinfo.peer_token = conn->peer_token; if (nla_put(skb, SMC_DIAG_DMBINFO, sizeof(dinfo), &dinfo) < 0) goto errout; From 51f6463aacfbfd322bcaadc606da56acef644b05 Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Wed, 19 Aug 2020 11:23:42 +0200 Subject: [PATCH 33/46] tools/resolve_btfids: Fix sections with wrong alignment The data of compressed section should be aligned to 4 (for 32bit) or 8 (for 64 bit) bytes. The binutils ld sets sh_addralign to 1, which makes libelf fail with misaligned section error during the update as reported by Jesper: FAILED elf_update(WRITE): invalid section alignment While waiting for ld fix, we can fix compressed sections sh_addralign value manually. Adding warning in -vv mode when the fix is triggered: $ ./tools/bpf/resolve_btfids/resolve_btfids -vv vmlinux ... section(36) .comment, size 44, link 0, flags 30, type=1 section(37) .debug_aranges, size 45684, link 0, flags 800, type=1 - fixing wrong alignment sh_addralign 16, expected 8 section(38) .debug_info, size 129104957, link 0, flags 800, type=1 - fixing wrong alignment sh_addralign 1, expected 8 section(39) .debug_abbrev, size 1152583, link 0, flags 800, type=1 - fixing wrong alignment sh_addralign 1, expected 8 section(40) .debug_line, size 7374522, link 0, flags 800, type=1 - fixing wrong alignment sh_addralign 1, expected 8 section(41) .debug_frame, size 702463, link 0, flags 800, type=1 section(42) .debug_str, size 1017571, link 0, flags 830, type=1 - fixing wrong alignment sh_addralign 1, expected 8 section(43) .debug_loc, size 3019453, link 0, flags 800, type=1 - fixing wrong alignment sh_addralign 1, expected 8 section(44) .debug_ranges, size 1744583, link 0, flags 800, type=1 - fixing wrong alignment sh_addralign 16, expected 8 section(45) .symtab, size 2955888, link 46, flags 0, type=2 section(46) .strtab, size 2613072, link 0, flags 0, type=3 ... update ok for vmlinux Another workaround is to disable compressed debug info data CONFIG_DEBUG_INFO_COMPRESSED kernel option. Fixes: fbbb68de80a4 ("bpf: Add resolve_btfids tool to resolve BTF IDs in ELF object") Reported-by: Jesper Dangaard Brouer Signed-off-by: Jiri Olsa Signed-off-by: Alexei Starovoitov Acked-by: Jesper Dangaard Brouer Acked-by: Yonghong Song Cc: Mark Wielaard Cc: Nick Clifton Link: https://lore.kernel.org/bpf/20200819092342.259004-1-jolsa@kernel.org --- tools/bpf/resolve_btfids/main.c | 36 +++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/tools/bpf/resolve_btfids/main.c b/tools/bpf/resolve_btfids/main.c index 4d9ecb975862..0def0bb1f783 100644 --- a/tools/bpf/resolve_btfids/main.c +++ b/tools/bpf/resolve_btfids/main.c @@ -233,6 +233,39 @@ static struct btf_id *add_symbol(struct rb_root *root, char *name, size_t size) return btf_id__add(root, id, false); } +/* + * The data of compressed section should be aligned to 4 + * (for 32bit) or 8 (for 64 bit) bytes. The binutils ld + * sets sh_addralign to 1, which makes libelf fail with + * misaligned section error during the update: + * FAILED elf_update(WRITE): invalid section alignment + * + * While waiting for ld fix, we fix the compressed sections + * sh_addralign value manualy. + */ +static int compressed_section_fix(Elf *elf, Elf_Scn *scn, GElf_Shdr *sh) +{ + int expected = gelf_getclass(elf) == ELFCLASS32 ? 4 : 8; + + if (!(sh->sh_flags & SHF_COMPRESSED)) + return 0; + + if (sh->sh_addralign == expected) + return 0; + + pr_debug2(" - fixing wrong alignment sh_addralign %u, expected %u\n", + sh->sh_addralign, expected); + + sh->sh_addralign = expected; + + if (gelf_update_shdr(scn, sh) == 0) { + printf("FAILED cannot update section header: %s\n", + elf_errmsg(-1)); + return -1; + } + return 0; +} + static int elf_collect(struct object *obj) { Elf_Scn *scn = NULL; @@ -309,6 +342,9 @@ static int elf_collect(struct object *obj) obj->efile.idlist_shndx = idx; obj->efile.idlist_addr = sh.sh_addr; } + + if (compressed_section_fix(elf, scn, &sh)) + return -1; } return 0; From 5597432dde62befd3ab92e6ef9e073564e277ea8 Mon Sep 17 00:00:00 2001 From: Veronika Kabatova Date: Wed, 19 Aug 2020 18:07:10 +0200 Subject: [PATCH 34/46] selftests/bpf: Remove test_align leftovers Calling generic selftests "make install" fails as rsync expects all files from TEST_GEN_PROGS to be present. The binary is not generated anymore (commit 3b09d27cc93d) so we can safely remove it from there and also from gitignore. Fixes: 3b09d27cc93d ("selftests/bpf: Move test_align under test_progs") Signed-off-by: Veronika Kabatova Signed-off-by: Alexei Starovoitov Acked-by: Jesper Dangaard Brouer Link: https://lore.kernel.org/bpf/20200819160710.1345956-1-vkabatov@redhat.com --- tools/testing/selftests/bpf/.gitignore | 1 - tools/testing/selftests/bpf/Makefile | 2 +- 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore index 1bb204cee853..9a0946ddb705 100644 --- a/tools/testing/selftests/bpf/.gitignore +++ b/tools/testing/selftests/bpf/.gitignore @@ -6,7 +6,6 @@ test_lpm_map test_tag FEATURE-DUMP.libbpf fixdep -test_align test_dev_cgroup /test_progs* test_tcpbpf_user diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index a83b5827532f..fc946b7ac288 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -32,7 +32,7 @@ LDLIBS += -lcap -lelf -lz -lrt -lpthread # Order correspond to 'make run_tests' order TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs \ - test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \ + test_verifier_log test_dev_cgroup test_tcpbpf_user \ test_sock test_btf test_sockmap get_cgroup_id_user test_socket_cookie \ test_cgroup_storage \ test_netcnt test_tcpnotify_user test_sock_fields test_sysctl \ From c8a36f1945b2b1b3f9823b66fc2181dc069cf803 Mon Sep 17 00:00:00 2001 From: Andrii Nakryiko Date: Wed, 19 Aug 2020 22:28:41 -0700 Subject: [PATCH 35/46] bpf: xdp: Fix XDP mode when no mode flags specified 7f0a838254bd ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device") inadvertently changed which XDP mode is assumed when no mode flags are specified explicitly. Previously, driver mode was preferred, if driver supported it. If not, generic SKB mode was chosen. That commit changed default to SKB mode always. This patch fixes the issue and restores the original logic. Fixes: 7f0a838254bd ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device") Reported-by: Lorenzo Bianconi Signed-off-by: Andrii Nakryiko Signed-off-by: Alexei Starovoitov Tested-by: Lorenzo Bianconi Link: https://lore.kernel.org/bpf/20200820052841.1559757-1-andriin@fb.com --- net/core/dev.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index b5d1129d8310..d42c9ea0c3c0 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -8742,13 +8742,15 @@ struct bpf_xdp_link { int flags; }; -static enum bpf_xdp_mode dev_xdp_mode(u32 flags) +static enum bpf_xdp_mode dev_xdp_mode(struct net_device *dev, u32 flags) { if (flags & XDP_FLAGS_HW_MODE) return XDP_MODE_HW; if (flags & XDP_FLAGS_DRV_MODE) return XDP_MODE_DRV; - return XDP_MODE_SKB; + if (flags & XDP_FLAGS_SKB_MODE) + return XDP_MODE_SKB; + return dev->netdev_ops->ndo_bpf ? XDP_MODE_DRV : XDP_MODE_SKB; } static bpf_op_t dev_xdp_bpf_op(struct net_device *dev, enum bpf_xdp_mode mode) @@ -8896,7 +8898,7 @@ static int dev_xdp_attach(struct net_device *dev, struct netlink_ext_ack *extack return -EINVAL; } - mode = dev_xdp_mode(flags); + mode = dev_xdp_mode(dev, flags); /* can't replace attached link */ if (dev_xdp_link(dev, mode)) { NL_SET_ERR_MSG(extack, "Can't replace active BPF XDP link"); @@ -8984,7 +8986,7 @@ static int dev_xdp_detach_link(struct net_device *dev, ASSERT_RTNL(); - mode = dev_xdp_mode(link->flags); + mode = dev_xdp_mode(dev, link->flags); if (dev_xdp_link(dev, mode) != link) return -EINVAL; @@ -9080,7 +9082,7 @@ static int bpf_xdp_link_update(struct bpf_link *link, struct bpf_prog *new_prog, goto out_unlock; } - mode = dev_xdp_mode(xdp_link->flags); + mode = dev_xdp_mode(xdp_link->dev, xdp_link->flags); bpf_op = dev_xdp_bpf_op(xdp_link->dev, mode); err = dev_xdp_install(xdp_link->dev, mode, bpf_op, NULL, xdp_link->flags, new_prog); @@ -9164,7 +9166,7 @@ out_put_dev: int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack, int fd, int expected_fd, u32 flags) { - enum bpf_xdp_mode mode = dev_xdp_mode(flags); + enum bpf_xdp_mode mode = dev_xdp_mode(dev, flags); struct bpf_prog *new_prog = NULL, *old_prog = NULL; int err; From c210773d6c6f595f5922d56b7391fe343bc7310e Mon Sep 17 00:00:00 2001 From: Yauheni Kaliuta Date: Thu, 20 Aug 2020 14:58:43 +0300 Subject: [PATCH 36/46] bpf: selftests: global_funcs: Check err_str before strstr The error path in libbpf.c:load_program() has calls to pr_warn() which ends up for global_funcs tests to test_global_funcs.c:libbpf_debug_print(). For the tests with no struct test_def::err_str initialized with a string, it causes call of strstr() with NULL as the second argument and it segfaults. Fix it by calling strstr() only for non-NULL err_str. Signed-off-by: Yauheni Kaliuta Signed-off-by: Alexei Starovoitov Acked-by: Yonghong Song Link: https://lore.kernel.org/bpf/20200820115843.39454-1-yauheni.kaliuta@redhat.com --- tools/testing/selftests/bpf/prog_tests/test_global_funcs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c b/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c index 25b068591e9a..193002b14d7f 100644 --- a/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c +++ b/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c @@ -19,7 +19,7 @@ static int libbpf_debug_print(enum libbpf_print_level level, log_buf = va_arg(args, char *); if (!log_buf) goto out; - if (strstr(log_buf, err_str) == 0) + if (err_str && strstr(log_buf, err_str) == 0) found = true; out: printf(format, log_buf); From 4d820543c54c47a2bd3c95ddbf52f83c89a219a0 Mon Sep 17 00:00:00 2001 From: Haiyang Zhang Date: Thu, 20 Aug 2020 14:53:14 -0700 Subject: [PATCH 37/46] hv_netvsc: Remove "unlikely" from netvsc_select_queue When using vf_ops->ndo_select_queue, the number of queues of VF is usually bigger than the synthetic NIC. This condition may happen often. Remove "unlikely" from the comparison of ndev->real_num_tx_queues. Fixes: b3bf5666a510 ("hv_netvsc: defer queue selection to VF") Signed-off-by: Haiyang Zhang Signed-off-by: David S. Miller --- drivers/net/hyperv/netvsc_drv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index 787f17e2a971..0029292cdb9f 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -367,7 +367,7 @@ static u16 netvsc_select_queue(struct net_device *ndev, struct sk_buff *skb, } rcu_read_unlock(); - while (unlikely(txq >= ndev->real_num_tx_queues)) + while (txq >= ndev->real_num_tx_queues) txq -= ndev->real_num_tx_queues; return txq; From c3d897e01aef8ddc43149e4d661b86f823e3aae7 Mon Sep 17 00:00:00 2001 From: Haiyang Zhang Date: Thu, 20 Aug 2020 14:53:15 -0700 Subject: [PATCH 38/46] hv_netvsc: Fix the queue_mapping in netvsc_vf_xmit() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit netvsc_vf_xmit() / dev_queue_xmit() will call VF NIC’s ndo_select_queue or netdev_pick_tx() again. They will use skb_get_rx_queue() to get the queue number, so the “skb->queue_mapping - 1” will be used. This may cause the last queue of VF not been used. Use skb_record_rx_queue() here, so that the skb_get_rx_queue() called later will get the correct queue number, and VF will be able to use all queues. Fixes: b3bf5666a510 ("hv_netvsc: defer queue selection to VF") Signed-off-by: Haiyang Zhang Signed-off-by: David S. Miller --- drivers/net/hyperv/netvsc_drv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index 0029292cdb9f..64b0a74c1523 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -502,7 +502,7 @@ static int netvsc_vf_xmit(struct net_device *net, struct net_device *vf_netdev, int rc; skb->dev = vf_netdev; - skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping; + skb_record_rx_queue(skb, qdisc_skb_cb(skb)->slave_dev_queue_mapping); rc = dev_queue_xmit(skb); if (likely(rc == NET_XMIT_SUCCESS || rc == NET_XMIT_CN)) { From 272502fcb7cda01ab07fc2fcff82d1d2f73d43cc Mon Sep 17 00:00:00 2001 From: Mark Tomlinson Date: Wed, 19 Aug 2020 13:53:58 +1200 Subject: [PATCH 39/46] gre6: Fix reception with IP6_TNL_F_RCV_DSCP_COPY When receiving an IPv4 packet inside an IPv6 GRE packet, and the IP6_TNL_F_RCV_DSCP_COPY flag is set on the tunnel, the IPv4 header would get corrupted. This is due to the common ip6_tnl_rcv() function assuming that the inner header is always IPv6. This patch checks the tunnel protocol for IPv4 inner packets, but still defaults to IPv6. Fixes: 308edfdf1563 ("gre6: Cleanup GREv6 receive path, call common GRE functions") Signed-off-by: Mark Tomlinson Signed-off-by: David S. Miller --- net/ipv6/ip6_tunnel.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index f635914f42ec..a0217e5bf3bc 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -915,7 +915,15 @@ int ip6_tnl_rcv(struct ip6_tnl *t, struct sk_buff *skb, struct metadata_dst *tun_dst, bool log_ecn_err) { - return __ip6_tnl_rcv(t, skb, tpi, tun_dst, ip6ip6_dscp_ecn_decapsulate, + int (*dscp_ecn_decapsulate)(const struct ip6_tnl *t, + const struct ipv6hdr *ipv6h, + struct sk_buff *skb); + + dscp_ecn_decapsulate = ip6ip6_dscp_ecn_decapsulate; + if (tpi->proto == htons(ETH_P_IP)) + dscp_ecn_decapsulate = ip4ip6_dscp_ecn_decapsulate; + + return __ip6_tnl_rcv(t, skb, tpi, tun_dst, dscp_ecn_decapsulate, log_ecn_err); } EXPORT_SYMBOL(ip6_tnl_rcv); From 41506bff84f1563e20e505ca9a0366a30ae2a879 Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Wed, 19 Aug 2020 14:45:39 +0200 Subject: [PATCH 40/46] dt-bindings: net: renesas, ether: Improve schema validation - Remove pinctrl consumer properties, as they are handled by core dt-schema, - Document missing properties, - Document missing PHY child node, - Add "additionalProperties: false". Signed-off-by: Geert Uytterhoeven Reviewed-by: Rob Herring Reviewed-by: Sergei Shtylyov Signed-off-by: David S. Miller --- .../bindings/net/renesas,ether.yaml | 22 +++++++++++++------ 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/Documentation/devicetree/bindings/net/renesas,ether.yaml b/Documentation/devicetree/bindings/net/renesas,ether.yaml index 08678af5ed93..8ce5ed8a58dd 100644 --- a/Documentation/devicetree/bindings/net/renesas,ether.yaml +++ b/Documentation/devicetree/bindings/net/renesas,ether.yaml @@ -59,9 +59,15 @@ properties: clocks: maxItems: 1 - pinctrl-0: true + power-domains: + maxItems: 1 - pinctrl-names: true + resets: + maxItems: 1 + + phy-mode: true + + phy-handle: true renesas,no-ether-link: type: boolean @@ -74,6 +80,11 @@ properties: specify when the Ether LINK signal is active-low instead of normal active-high +patternProperties: + "^ethernet-phy@[0-9a-f]$": + type: object + $ref: ethernet-phy.yaml# + required: - compatible - reg @@ -83,7 +94,8 @@ required: - '#address-cells' - '#size-cells' - clocks - - pinctrl-0 + +additionalProperties: false examples: # Lager board @@ -99,8 +111,6 @@ examples: clocks = <&mstp8_clks R8A7790_CLK_ETHER>; phy-mode = "rmii"; phy-handle = <&phy1>; - pinctrl-0 = <ðer_pins>; - pinctrl-names = "default"; renesas,ether-link-active-low; #address-cells = <1>; #size-cells = <0>; @@ -109,7 +119,5 @@ examples: reg = <1>; interrupt-parent = <&irqc0>; interrupts = <0 IRQ_TYPE_LEVEL_LOW>; - pinctrl-0 = <&phy1_pins>; - pinctrl-names = "default"; }; }; From ab921f3cdbec01c68705a7ade8bec628d541fc2b Mon Sep 17 00:00:00 2001 From: David Laight Date: Wed, 19 Aug 2020 14:40:52 +0000 Subject: [PATCH 41/46] net: sctp: Fix negotiation of the number of data streams. The number of output and input streams was never being reduced, eg when processing received INIT or INIT_ACK chunks. The effect is that DATA chunks can be sent with invalid stream ids and then discarded by the remote system. Fixes: 2075e50caf5ea ("sctp: convert to genradix") Signed-off-by: David Laight Acked-by: Marcelo Ricardo Leitner Signed-off-by: David S. Miller --- net/sctp/stream.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/sctp/stream.c b/net/sctp/stream.c index bda2536dd740..6dc95dcc0ff4 100644 --- a/net/sctp/stream.c +++ b/net/sctp/stream.c @@ -88,12 +88,13 @@ static int sctp_stream_alloc_out(struct sctp_stream *stream, __u16 outcnt, int ret; if (outcnt <= stream->outcnt) - return 0; + goto out; ret = genradix_prealloc(&stream->out, outcnt, gfp); if (ret) return ret; +out: stream->outcnt = outcnt; return 0; } @@ -104,12 +105,13 @@ static int sctp_stream_alloc_in(struct sctp_stream *stream, __u16 incnt, int ret; if (incnt <= stream->incnt) - return 0; + goto out; ret = genradix_prealloc(&stream->in, incnt, gfp); if (ret) return ret; +out: stream->incnt = incnt; return 0; } From eda814b97dfb8d9f4808eb2f65af9bd3705c4cae Mon Sep 17 00:00:00 2001 From: Alaa Hleihel Date: Wed, 19 Aug 2020 18:24:10 +0300 Subject: [PATCH 42/46] net/sched: act_ct: Fix skb double-free in tcf_ct_handle_fragments() error flow tcf_ct_handle_fragments() shouldn't free the skb when ip_defrag() call fails. Otherwise, we will cause a double-free bug. In such cases, just return the error to the caller. Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct") Signed-off-by: Alaa Hleihel Reviewed-by: Roi Dayan Signed-off-by: David S. Miller --- net/sched/act_ct.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c index e6ad42b11835..2c3619165680 100644 --- a/net/sched/act_ct.c +++ b/net/sched/act_ct.c @@ -704,7 +704,7 @@ static int tcf_ct_handle_fragments(struct net *net, struct sk_buff *skb, err = ip_defrag(net, skb, user); local_bh_enable(); if (err && err != -EINPROGRESS) - goto out_free; + return err; if (!err) { *defrag = true; From f6db9096416209474090d64d8284e7c16c3d8873 Mon Sep 17 00:00:00 2001 From: Xin Long Date: Thu, 20 Aug 2020 15:34:47 +0800 Subject: [PATCH 43/46] tipc: call rcu_read_lock() in tipc_aead_encrypt_done() b->media->send_msg() requires rcu_read_lock(), as we can see elsewhere in tipc, tipc_bearer_xmit, tipc_bearer_xmit_skb and tipc_bearer_bc_xmit(). Syzbot has reported this issue as: net/tipc/bearer.c:466 suspicious rcu_dereference_check() usage! Workqueue: cryptd cryptd_queue_worker Call Trace: tipc_l2_send_msg+0x354/0x420 net/tipc/bearer.c:466 tipc_aead_encrypt_done+0x204/0x3a0 net/tipc/crypto.c:761 cryptd_aead_crypt+0xe8/0x1d0 crypto/cryptd.c:739 cryptd_queue_worker+0x118/0x1b0 crypto/cryptd.c:181 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415 kthread+0x3b5/0x4a0 kernel/kthread.c:291 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:293 So fix it by calling rcu_read_lock() in tipc_aead_encrypt_done() for b->media->send_msg(). Fixes: fc1b6d6de220 ("tipc: introduce TIPC encryption & authentication") Reported-by: syzbot+47bbc6b678d317cccbe0@syzkaller.appspotmail.com Signed-off-by: Xin Long Signed-off-by: David S. Miller --- net/tipc/crypto.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/tipc/crypto.c b/net/tipc/crypto.c index 001bcb0f2480..c38babaa4e57 100644 --- a/net/tipc/crypto.c +++ b/net/tipc/crypto.c @@ -757,10 +757,12 @@ static void tipc_aead_encrypt_done(struct crypto_async_request *base, int err) switch (err) { case 0: this_cpu_inc(tx->stats->stat[STAT_ASYNC_OK]); + rcu_read_lock(); if (likely(test_bit(0, &b->up))) b->media->send_msg(net, skb, b, &tx_ctx->dst); else kfree_skb(skb); + rcu_read_unlock(); break; case -EINPROGRESS: return; From 774d977abfd024e6f73484544b9abe5a5cd62de7 Mon Sep 17 00:00:00 2001 From: Tom Rix Date: Fri, 21 Aug 2020 06:56:00 -0700 Subject: [PATCH 44/46] net: dsa: b53: check for timeout clang static analysis reports this problem b53_common.c:1583:13: warning: The left expression of the compound assignment is an uninitialized value. The computed value will also be garbage ent.port &= ~BIT(port); ~~~~~~~~ ^ ent is set by a successful call to b53_arl_read(). Unsuccessful calls are caught by an switch statement handling specific returns. b32_arl_read() calls b53_arl_op_wait() which fails with the unhandled -ETIMEDOUT. So add -ETIMEDOUT to the switch statement. Because b53_arl_op_wait() already prints out a message, do not add another one. Fixes: 1da6df85c6fb ("net: dsa: b53: Implement ARL add/del/dump operations") Signed-off-by: Tom Rix Acked-by: Florian Fainelli Signed-off-by: David S. Miller --- drivers/net/dsa/b53/b53_common.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c index 6500179c2ca2..0837ae0e0c5e 100644 --- a/drivers/net/dsa/b53/b53_common.c +++ b/drivers/net/dsa/b53/b53_common.c @@ -1554,6 +1554,8 @@ static int b53_arl_op(struct b53_device *dev, int op, int port, return ret; switch (ret) { + case -ETIMEDOUT: + return ret; case -ENOSPC: dev_dbg(dev->dev, "{%pM,%.4d} no space left in ARL\n", addr, vid); From b16fc097bc283184cde40e5b30d15705e1590410 Mon Sep 17 00:00:00 2001 From: Tobias Klauser Date: Fri, 21 Aug 2020 15:36:42 +0200 Subject: [PATCH 45/46] bpf: Fix two typos in uapi/linux/bpf.h Also remove trailing whitespaces in bpf_skb_get_tunnel_key example code. Signed-off-by: Tobias Klauser Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200821133642.18870-1-tklauser@distanz.ch --- include/uapi/linux/bpf.h | 10 +++++----- tools/include/uapi/linux/bpf.h | 10 +++++----- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 0480f893facd..b6238b2209b7 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -767,7 +767,7 @@ union bpf_attr { * * Also, note that **bpf_trace_printk**\ () is slow, and should * only be used for debugging purposes. For this reason, a notice - * bloc (spanning several lines) is printed to kernel logs and + * block (spanning several lines) is printed to kernel logs and * states that the helper should not be used "for production use" * the first time this helper is used (or more precisely, when * **trace_printk**\ () buffers are allocated). For passing values @@ -1033,14 +1033,14 @@ union bpf_attr { * * int ret; * struct bpf_tunnel_key key = {}; - * + * * ret = bpf_skb_get_tunnel_key(skb, &key, sizeof(key), 0); * if (ret < 0) * return TC_ACT_SHOT; // drop packet - * + * * if (key.remote_ipv4 != 0x0a000001) * return TC_ACT_SHOT; // drop packet - * + * * return TC_ACT_OK; // accept packet * * This interface can also be used with all encapsulation devices @@ -1147,7 +1147,7 @@ union bpf_attr { * Description * Retrieve the realm or the route, that is to say the * **tclassid** field of the destination for the *skb*. The - * indentifier retrieved is a user-provided tag, similar to the + * identifier retrieved is a user-provided tag, similar to the * one used with the net_cls cgroup (see description for * **bpf_get_cgroup_classid**\ () helper), but here this tag is * held by a route (a destination entry), not by a task. diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 0480f893facd..b6238b2209b7 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -767,7 +767,7 @@ union bpf_attr { * * Also, note that **bpf_trace_printk**\ () is slow, and should * only be used for debugging purposes. For this reason, a notice - * bloc (spanning several lines) is printed to kernel logs and + * block (spanning several lines) is printed to kernel logs and * states that the helper should not be used "for production use" * the first time this helper is used (or more precisely, when * **trace_printk**\ () buffers are allocated). For passing values @@ -1033,14 +1033,14 @@ union bpf_attr { * * int ret; * struct bpf_tunnel_key key = {}; - * + * * ret = bpf_skb_get_tunnel_key(skb, &key, sizeof(key), 0); * if (ret < 0) * return TC_ACT_SHOT; // drop packet - * + * * if (key.remote_ipv4 != 0x0a000001) * return TC_ACT_SHOT; // drop packet - * + * * return TC_ACT_OK; // accept packet * * This interface can also be used with all encapsulation devices @@ -1147,7 +1147,7 @@ union bpf_attr { * Description * Retrieve the realm or the route, that is to say the * **tclassid** field of the destination for the *skb*. The - * indentifier retrieved is a user-provided tag, similar to the + * identifier retrieved is a user-provided tag, similar to the * one used with the net_cls cgroup (see description for * **bpf_get_cgroup_classid**\ () helper), but here this tag is * held by a route (a destination entry), not by a task. From eeaac3634ee0e3f35548be35275efeca888e9b23 Mon Sep 17 00:00:00 2001 From: Nikolay Aleksandrov Date: Sat, 22 Aug 2020 15:06:36 +0300 Subject: [PATCH 46/46] net: nexthop: don't allow empty NHA_GROUP Currently the nexthop code will use an empty NHA_GROUP attribute, but it requires at least 1 entry in order to function properly. Otherwise we end up derefencing null or random pointers all over the place due to not having any nh_grp_entry members allocated, nexthop code relies on having at least the first member present. Empty NHA_GROUP doesn't make any sense so just disallow it. Also add a WARN_ON for any future users of nexthop_create_group(). BUG: kernel NULL pointer dereference, address: 0000000000000080 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP CPU: 0 PID: 558 Comm: ip Not tainted 5.9.0-rc1+ #93 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:fib_check_nexthop+0x4a/0xaa Code: 0f 84 83 00 00 00 48 c7 02 80 03 f7 81 c3 40 80 fe fe 75 12 b8 ea ff ff ff 48 85 d2 74 6b 48 c7 02 40 03 f7 81 c3 48 8b 40 10 <48> 8b 80 80 00 00 00 eb 36 80 78 1a 00 74 12 b8 ea ff ff ff 48 85 RSP: 0018:ffff88807983ba00 EFLAGS: 00010213 RAX: 0000000000000000 RBX: ffff88807983bc00 RCX: 0000000000000000 RDX: ffff88807983bc00 RSI: 0000000000000000 RDI: ffff88807bdd0a80 RBP: ffff88807983baf8 R08: 0000000000000dc0 R09: 000000000000040a R10: 0000000000000000 R11: ffff88807bdd0ae8 R12: 0000000000000000 R13: 0000000000000000 R14: ffff88807bea3100 R15: 0000000000000001 FS: 00007f10db393700(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000080 CR3: 000000007bd0f004 CR4: 00000000003706f0 Call Trace: fib_create_info+0x64d/0xaf7 fib_table_insert+0xf6/0x581 ? __vma_adjust+0x3b6/0x4d4 inet_rtm_newroute+0x56/0x70 rtnetlink_rcv_msg+0x1e3/0x20d ? rtnl_calcit.isra.0+0xb8/0xb8 netlink_rcv_skb+0x5b/0xac netlink_unicast+0xfa/0x17b netlink_sendmsg+0x334/0x353 sock_sendmsg_nosec+0xf/0x3f ____sys_sendmsg+0x1a0/0x1fc ? copy_msghdr_from_user+0x4c/0x61 ___sys_sendmsg+0x63/0x84 ? handle_mm_fault+0xa39/0x11b5 ? sockfd_lookup_light+0x72/0x9a __sys_sendmsg+0x50/0x6e do_syscall_64+0x54/0xbe entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f10dacc0bb7 Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb cd 66 0f 1f 44 00 00 8b 05 9a 4b 2b 00 85 c0 75 2e 48 63 ff 48 63 d2 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 b1 f2 2a 00 f7 d8 64 89 02 48 RSP: 002b:00007ffcbe628bf8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007ffcbe628f80 RCX: 00007f10dacc0bb7 RDX: 0000000000000000 RSI: 00007ffcbe628c60 RDI: 0000000000000003 RBP: 000000005f41099c R08: 0000000000000001 R09: 0000000000000008 R10: 00000000000005e9 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 00007ffcbe628d70 R15: 0000563a86c6e440 Modules linked in: CR2: 0000000000000080 CC: David Ahern Fixes: 430a049190de ("nexthop: Add support for nexthop groups") Reported-by: syzbot+a61aa19b0c14c8770bd9@syzkaller.appspotmail.com Signed-off-by: Nikolay Aleksandrov Reviewed-by: David Ahern Signed-off-by: David S. Miller --- net/ipv4/nexthop.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index cc8049b100b2..134e92382275 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -446,7 +446,7 @@ static int nh_check_attr_group(struct net *net, struct nlattr *tb[], unsigned int i, j; u8 nhg_fdb = 0; - if (len & (sizeof(struct nexthop_grp) - 1)) { + if (!len || len & (sizeof(struct nexthop_grp) - 1)) { NL_SET_ERR_MSG(extack, "Invalid length for nexthop group attribute"); return -EINVAL; @@ -1187,6 +1187,9 @@ static struct nexthop *nexthop_create_group(struct net *net, struct nexthop *nh; int i; + if (WARN_ON(!num_nh)) + return ERR_PTR(-EINVAL); + nh = nexthop_alloc(); if (!nh) return ERR_PTR(-ENOMEM);