Commit Graph

8326 Commits

Author SHA1 Message Date
Steve Wise d12ff62482 RDMA/nldev: common resource dumpit function
Create a common dumpit function that can be used by all common resource
types.  This reduces code replication and simplifies the code as we add
more resource types.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-08 15:03:03 -05:00
Steve Wise 88831a2cfe RDMA/restrack: clean up res_to_dev()
Simplify res_to_dev() to make it easier to read/maintain.

Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-08 15:03:03 -05:00
Yishai Hadas d50a8a96ee IB/mlx4: Move mlx4_uverbs_ex_query_device_resp to include/uapi/
This struct is involved in the user API for mlx4 and should not be hidden
inside a driver header file.

Fixes: 09d208b258 ("IB/mlx4: Add report for RSS capabilities by vendor channel")
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-07 16:10:07 -07:00
Doug Ledford 1abb791fcd Merge tag 'mlx5-updates-2018-02-28-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux into k.o/wip/dl-for-next
mlx5-updates-2018-02-28-1 (IPSec-1)

This series consists of some fixes and refactors for the mlx5 drivers,
especially around the FPGA and flow steering. Most of them are trivial
fixes and are the foundation of allowing IPSec acceleration from user-space.

We use flow steering abstraction in order to accelerate IPSec packets.
When a user creates a steering rule, [s]he states that we'll carry an
encrypt/decrypt flow action (using a specific configuration) for every
packet which conforms to a certain match. Since currently offloading these
packets is done via FPGA, we'll add another set of flow steering ops.
These ops will execute the required FPGA commands and then call the
standard steering ops.

In order to achieve this, we need that the commands will get all the
required information. Therefore, we pass the fte object and embed the
flow_action struct inside the fte. In addition, we add the shim layer
that will later be used for alternating between the standard and the
FPGA steering commands.

Some fixes, like " net/mlx5e: Wait for FPGA command responses with a timeout"
are very relevant for user-space applications, as these applications could
be killed, but we still want to wait for the FPGA and update the kernel's
database.

Regards,
Aviad and Matan

Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-07 15:56:39 -07:00
Zhu Yanjun befd8d98f2 IB/rxe: change the function rxe_init_device_param type
The function rxe_init_device_param always return 0. So the function
type is changed to void.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-07 15:56:15 -07:00
Zhu Yanjun 31f1bd14cb IB/rxe: remove unnecessary rxe in rxe_send
In the function rxe_send, the variable rxe is not used in it.
So it should be removed.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-07 15:56:14 -07:00
Zhu Yanjun 86af617641 IB/rxe: remove unnecessary skb_clone
In send_atomic_ack function, it is not necessary to make a
skb_clone. To gain better performance (high throughput and
low latency), this skb_clone is removed.

The following tests are made.

 server                       client
---------                    ---------
|1.1.1.1|<----rxe-channel--->|1.1.1.2|
---------                    ---------

On server: rping -s -a 1.1.1.1 -v -C 1000 -S 512
On client: rping -c -a 1.1.1.1 -v -C 1000 -S 512

The kernel config CONFIG_DEBUG_KMEMLEAK is enabled on both server
and client.

This test runs for several hours. There is no memory leak and the whole
system can work well.

Based on the above network, the following tests are made.

Server: ibv_rc_pingpong -d rxe0 -g 1
Client: ibv_rc_pingpong -d rxe0 -g 1 1.1.1.1

The test results on Server(10 tests are made).
Before:
Throughput is 137.07 Mbit/sec
Latency is 517.76 usec/iter

After:
Throughput is 148.85 Mbit/sec
Latency is 476.64 usec/iter

The throughput is enhanced and the latency is reduced.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>

Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-07 15:56:14 -07:00
Bart Van Assche 63cf1a902c IB/srpt: Add RDMA/CM support
Add a parameter for configuring the port on which the ib_srpt driver
listens for incoming RDMA/CM connections, namely
/sys/kernel/config/target/srpt/discovery_auth/rdma_cm_port. The default
value for this parameter is 0 which means "do not listen for incoming
RDMA/CM connections". Add RDMA/CM support to all code that handles
connection state changes. Modify srpt_init_nodeacl() such that ACLs can
be configured for IPv4 and IPv6 addresses.

Note: incoming connection requests are only accepted for ports that
have been enabled. See also the "if (!sport->enabled)" code in the
connection request handler. See also the following configfs attribute:
/sys/kernel/config/target/srpt/$port/$port/enable.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-07 15:56:14 -07:00
David S. Miller bcde6b725f mlx5-updates-2018-02-28-1 (IPSec-1)
This series consists of some fixes and refactors for the mlx5 drivers,
 especially around the FPGA and flow steering. Most of them are trivial
 fixes and are the foundation of allowing IPSec acceleration from user-space.
 
 We use flow steering abstraction in order to accelerate IPSec packets.
 When a user creates a steering rule, [s]he states that we'll carry an
 encrypt/decrypt flow action (using a specific configuration) for every
 packet which conforms to a certain match. Since currently offloading these
 packets is done via FPGA, we'll add another set of flow steering ops.
 These ops will execute the required FPGA commands and then call the
 standard steering ops.
 
 In order to achieve this, we need that the commands will get all the
 required information. Therefore, we pass the fte object and embed the
 flow_action struct inside the fte. In addition, we add the shim layer
 that will later be used for alternating between the standard and the
 FPGA steering commands.
 
 Some fixes, like " net/mlx5e: Wait for FPGA command responses with a timeout"
 are very relevant for user-space applications, as these applications could
 be killed, but we still want to wait for the FPGA and update the kernel's
 database.
 
 Regards,
 Aviad and Matan
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJan4UmAAoJEEg/ir3gV/o+cZwH/1xBpdLsmeqEimwQ41bAc9Rj
 UmPZXXMyQVUYfGOiE1aLTH7YNi38XWSnTFMN7HklMeX/9YKxUZNG8YuiO9iQhE1B
 rUqRKfYFz9oFrUh95SzeclaunTpKrhYKHjQ9/u9nBMfI3H2Fy+y2NUBjIqJ6nysz
 op3EwRcX5kD4+MjRum24XLUnMSYbg05mHCZKTDj5/2T4x+/j0XQqvvmWWinIt8BO
 R4d7XGGywGjbhtcG1j+XBcFeLsEZnS/w70hoN38TdcmNWvokl1pGk8DVDii9i7GX
 c5jQj2h5WG/bdsS26y8MFfWpoAn3Qzzm4W4OYwp/vmL7n/Llvq0GRCEKCi8AlA0=
 =LeYV
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2018-02-28-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

Saeed Mahameed says:

====================
mlx5-updates-2018-02-28-1 (IPSec-1)

This series consists of some fixes and refactors for the mlx5 drivers,
especially around the FPGA and flow steering. Most of them are trivial
fixes and are the foundation of allowing IPSec acceleration from user-space.

We use flow steering abstraction in order to accelerate IPSec packets.
When a user creates a steering rule, [s]he states that we'll carry an
encrypt/decrypt flow action (using a specific configuration) for every
packet which conforms to a certain match. Since currently offloading these
packets is done via FPGA, we'll add another set of flow steering ops.
These ops will execute the required FPGA commands and then call the
standard steering ops.

In order to achieve this, we need that the commands will get all the
required information. Therefore, we pass the fte object and embed the
flow_action struct inside the fte. In addition, we add the shim layer
that will later be used for alternating between the standard and the
FPGA steering commands.

Some fixes, like " net/mlx5e: Wait for FPGA command responses with a timeout"
are very relevant for user-space applications, as these applications could
be killed, but we still want to wait for the FPGA and update the kernel's
database.

Regards,
Aviad and Matan
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07 15:28:13 -05:00
Leon Romanovsky a5880b8443 RDMA/ucma: Check that user doesn't overflow QP state
The QP state is limited and declared in enum ib_qp_state,
but ucma user was able to supply any possible (u32) value.

Reported-by: syzbot+0df1ab766f8924b1edba@syzkaller.appspotmail.com
Fixes: 7521663857 ("RDMA/cma: Export rdma cm interface to userspace")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-07 15:25:40 -05:00
Leon Romanovsky aa0de36a40 RDMA/mlx5: Fix integer overflow while resizing CQ
The user can provide very large cqe_size which will cause to integer
overflow as it can be seen in the following UBSAN warning:

Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-07 15:23:43 -05:00
Leon Romanovsky 6a21dfc0d0 RDMA/ucma: Limit possible option size
Users of ucma are supposed to provide size of option level,
in most paths it is supposed to be equal to u8 or u16, but
it is not the case for the IB path record, where it can be
multiple of struct ib_path_rec_data.

This patch takes simplest possible approach and prevents providing
values more than possible to allocate.

Reported-by: syzbot+a38b0e9f694c379ca7ce@syzkaller.appspotmail.com
Fixes: 7ce86409ad ("RDMA/ucma: Allow user space to set service type")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-07 15:18:03 -05:00
Parav Pandit bb7f8f199c IB/core: Fix possible crash to access NULL netdev
resolved_dev returned might be NULL as ifindex is transient number.
Ignoring NULL check of resolved_dev might crash the kernel.
Therefore perform NULL check before accessing resolved_dev.

Additionally rdma_resolve_ip_route() invokes addr_resolve() which
performs check and address translation for loopback ifindex.
Therefore, checking it again in rdma_resolve_ip_route() is not helpful.
Therefore, the code is simplified to avoid IFF_LOOPBACK check.

Fixes: 200298326b ("IB/core: Validate route when we init ah")
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-07 15:15:40 -05:00
Boris Pismenny 3346c48737 {net,IB}/mlx5: Add flow steering helpers
Add helper functions that check if a protocol is
part of a flow steering match criteria.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-03-06 22:20:14 -08:00
Matan Barak a9db0ecf15 {net,IB}/mlx5: Add has_tag to mlx5_flow_act
The has_tag member will indicate whether a tag action was specified
in flow specification.

A flow tag 0 = MLX5_FS_DEFAULT_FLOW_TAG is assumed a valid flow tag
that is currently used by mlx5 RDMA driver, whereas in HW flow_tag = 0
means that the user doesn't care about flow_tag.  HW always provide
a flow_tag = 0 if all flow tags requested on a specific flow are 0.

So we need a way (in the driver) to differentiate between a user really
requesting flow_tag = 0 and a user who does not care, in order to be
able to report conflicting flow tags on a specific flow.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-03-06 22:06:33 -08:00
Boris Pismenny 075572d4b7 IB/mlx5: Pass mlx5_flow_act struct instead of multiple arguments
Group and pass all function arguments of parse_flow_attr call in one
common struct mlx5_flow_act.

This patch passes all the action arguments of parse_flow_attr in one common
struct mlx5_flow_act. It allows us to scale the number of actions without adding
new arguments to the function.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 22:06:11 -08:00
Aviad Yehezkel c33251a3c6 IB/mlx5: Removed not used parameters
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 22:05:36 -08:00
Selvin Xavier 942c9b6ca8 RDMA/bnxt_re: Avoid Hard lockup during error CQE processing
Hitting the following hardlockup due to a race condition in
error CQE processing.

[26146.879798] bnxt_en 0000:04:00.0: QPLIB: FP: CQ Processed Req
[26146.886346] bnxt_en 0000:04:00.0: QPLIB: wr_id[1251] = 0x0 with status 0xa
[26156.350935] NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
[26156.357470] Modules linked in: nfsd auth_rpcgss nfs_acl lockd grace
[26156.447957] CPU: 4 PID: 3413 Comm: kworker/4:1H Kdump: loaded
[26156.457994] Hardware name: Dell Inc. PowerEdge R430/0CN7X8,
[26156.466390] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
[26156.472639] Call Trace:
[26156.475379]  <NMI>  [<ffffffff98d0d722>] dump_stack+0x19/0x1b
[26156.481833]  [<ffffffff9873f775>] watchdog_overflow_callback+0x135/0x140
[26156.489341]  [<ffffffff9877f237>] __perf_event_overflow+0x57/0x100
[26156.496256]  [<ffffffff98787c24>] perf_event_overflow+0x14/0x20
[26156.502887]  [<ffffffff9860a580>] intel_pmu_handle_irq+0x220/0x510
[26156.509813]  [<ffffffff98d16031>] perf_event_nmi_handler+0x31/0x50
[26156.516738]  [<ffffffff98d1790c>] nmi_handle.isra.0+0x8c/0x150
[26156.523273]  [<ffffffff98d17be8>] do_nmi+0x218/0x460
[26156.528834]  [<ffffffff98d16d79>] end_repeat_nmi+0x1e/0x7e
[26156.534980]  [<ffffffff987089c0>] ? native_queued_spin_lock_slowpath+0x1d0/0x200
[26156.543268]  [<ffffffff987089c0>] ? native_queued_spin_lock_slowpath+0x1d0/0x200
[26156.551556]  [<ffffffff987089c0>] ? native_queued_spin_lock_slowpath+0x1d0/0x200
[26156.559842]  <EOE>  [<ffffffff98d083e4>] queued_spin_lock_slowpath+0xb/0xf
[26156.567555]  [<ffffffff98d15690>] _raw_spin_lock+0x20/0x30
[26156.573696]  [<ffffffffc08381a1>] bnxt_qplib_lock_buddy_cq+0x31/0x40 [bnxt_re]
[26156.581789]  [<ffffffffc083bbaa>] bnxt_qplib_poll_cq+0x43a/0xf10 [bnxt_re]
[26156.589493]  [<ffffffffc083239b>] bnxt_re_poll_cq+0x9b/0x760 [bnxt_re]

The issue happens if RQ poll_cq or SQ poll_cq or Async error event tries to
put the error QP in flush list. Since SQ and RQ of each error qp are added
to two different flush list, we need to protect it using locks of
corresponding CQs. Difference in order of acquiring the lock in
SQ poll_cq and RQ poll_cq can cause a hard lockup.

Revisits the locking strategy and removes the usage of qplib_cq.hwq.lock.
Instead of this lock, introduces qplib_cq.flush_lock to handle
addition/deletion of QPs in flush list. Also, always invoke the flush_lock
in order (SQ CQ lock first and then RQ CQ lock) to avoid any potential
deadlock.

Other than the poll_cq context, the movement of QP to/from flush list can
be done in modify_qp context or from an async error event from HW.
Synchronize these operations using the bnxt_re verbs layer CQ locks.
To achieve this, adds a call back to the HW abstraction layer(qplib) to
bnxt_re ib_verbs layer in case of async error event. Also, removes the
buddy cq functions as it is no longer required.

Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 20:08:39 -07:00
Max Gurtovoy d3b9e8ad42 RDMA/core: Reduce poll batch for direct cq polling
Fix warning limit for kernel stack consumption:

drivers/infiniband/core/cq.c: In function 'ib_process_cq_direct':
drivers/infiniband/core/cq.c:78:1: error: the frame size of 1032 bytes
is larger than 1024 bytes [-Werror=frame-larger-than=]

Using smaller ib_wc array on the stack brings us comfortably below that
limit again.

Fixes: 246d8b184c ("IB/cq: Don't force IB_POLL_DIRECT poll context for ib_process_cq_direct")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Sergey Gorenko <sergeygo@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 20:08:39 -07:00
Dan Carpenter 5d414b178e IB/mlx5: Fix an error code in __mlx5_ib_modify_qp()
"err" is either zero or possibly uninitialized here.  It should be
-EINVAL.

Fixes: 427c1e7bcd ("{IB, net}/mlx5: Move the modify QP operation table to mlx5_ib")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 20:08:39 -07:00
Mark Bloch 210b1f7807 IB/mlx5: When not in dual port RoCE mode, use provided port as native
The series that introduced dual port RoCE mode assumed that we don't have
a dual port HCA that use the mlx5 driver, this is not the case for
Connect-IB HCAs. This reasoning led to assigning 1 as the native port
index which causes issue when the second port is used.

For example query_pkey() when called on the second port will return values
of the first port. Make sure that we assign the right port index as the
native port index.

Fixes: 32f69e4be2 ("{net, IB}/mlx5: Manage port association for multiport RoCE")
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 20:08:38 -07:00
Jack M a18177925c IB/mlx4: Include GID type when deleting GIDs from HW table under RoCE
The commit cited below added a gid_type field (RoCEv1 or RoCEv2)
to GID properties.

When adding GIDs, this gid_type field was copied over to the
hardware gid table. However, when deleting GIDs, the gid_type field
was not copied over to the hardware gid table.

As a result, when running RoCEv2, all RoCEv2 gids in the
hardware gid table were set to type RoCEv1 when any gid was deleted.

This problem would persist until the next gid was added (which would again
restore the gid_type field for all the gids in the hardware gid table).

Fix this by copying over the gid_type field to the hardware gid table
when deleting gids, so that the gid_type of all remaining gids is
preserved when a gid is deleted.

Fixes: b699a859d1 ("IB/mlx4: Add gid_type to GID properties")
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 20:08:38 -07:00
Jack Morgenstein 0077416a3d IB/mlx4: Fix corruption of RoCEv2 IPv4 GIDs
When using IPv4 addresses in RoCEv2, the GID format for the mapped
IPv4 address should be: ::ffff:<4-byte IPv4 address>.

In the cited commit, IPv4 mapped IPV6 addresses had the 3 upper dwords
zeroed out by memset, which resulted in deleting the ffff field.

However, since procedure ipv6_addr_v4mapped() already verifies that the
gid has format ::ffff:<ipv4 address>, no change is needed for the gid,
and the memset can simply be removed.

Fixes: 7e57b85c44 ("IB/mlx4: Add support for setting RoCEv2 gids in hardware")
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 20:08:38 -07:00
Kalderon, Michal 551e1c67b4 RDMA/qedr: Fix iWARP write and send with immediate
iWARP does not support RDMA WRITE or SEND with immediate data.
Driver should check this before submitting to FW and return an
immediate error

Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 19:57:37 -07:00
Kalderon, Michal e3fd112cbf RDMA/qedr: Fix kernel panic when running fio over NFSoRDMA
Race in qedr_poll_cq, lastest_cqe wasn't protected by lock,
leading to a case where two context's accessing poll_cq at
the same time lead to one of them having a pointer to an old
latest_cqe and reading an invalid cqe element

Signed-off-by: Amit Radzi <Amit.Radzi@cavium.com>
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 19:57:37 -07:00
Kalderon, Michal ea0ed47803 RDMA/qedr: Fix iWARP connect with port mapper
Fix iWARP connect and listen to use the mapped port for
ipv4 and ipv6. Without this fixed, running on a server
that has iwpmd enabled will not use the correct port

Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 19:57:37 -07:00
Kalderon, Michal 11052696fd RDMA/qedr: Fix ipv6 destination address resolution
The wrong parameter was passed to dst_neigh_lookup

Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 19:57:37 -07:00
Gustavo A. R. Silva 63231585a6 RDMA/bnxt_re/qplib_sp: Use true and false for boolean values
Assign true or false to boolean variables instead of an integer value.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 16:00:51 -07:00
Sergey Gorenko fbd36818ee IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported
If a HCA supports the SG_GAPS_REG feature then fewer memory regions
are required per command. This patch reduces the number of memory
regions that is allocated per SRP session.

Signed-off-by: Sergey Gorenko <sergeygo@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Acked-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 16:00:51 -07:00
Bart Van Assche 4190443947 IB/hfi1: Add a missing rcu_read_unlock()
This patch avoids that sparse reports the following:

drivers/infiniband/hw/hfi1/driver.c:251:13: warning: context imbalance in 'rcv_hdrerr' - different lock contexts for basic block

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 16:00:51 -07:00
Arushi 666fe24bbe infiniband: hw: Drop unnecessary continue
Continue at the bottom of a loop are removed.
Issue found using drop_continue.cocci Coccinelle script.

Signed-off-by: Arushi Singhal <arushisinghal19971997@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 16:00:51 -07:00
Shiraz Saleem 7e952b19eb i40iw: Implement get_vector_affinity API
Storage ULPs (like NVMEoF) benefit from exposing affinity mapping
per completion vector to find the optimal multi-queue affinity
assignments. The ULPs call the verbs API ib_get_vector_affinity
introduced in commit c66cd353bb ("RDMA/core: expose affinity mappings per
completion vector") to get the underlying devices affinity mappings.

Add support in driver to expose the affinity masks per MSI-X
completion vector.

Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 16:00:51 -07:00
Shiraz Saleem 7de8b3576a i40iw: Improve CM node lookup time on connection setup
Currently all CM nodes involved in a connection are
maintained in a connected_node list per dev. During
connection setup, we need to search this every time
we receive a packet on the iWARP LAN Queue (ILQ) and
this can be pretty inefficient for large number of
connections.

Fix this by organizing the CM nodes in two lists -
accelerated list and non-accelerated list. The search
on ILQ receive would be limited to only non accelerated
nodes. When a node moves to RTS, it is added to the
accelerated list.

Benchmarking ucmatose 16k connections shows a 20%
improvement in test completion time.

Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 16:00:51 -07:00
Mustafa Ismail 6b0c549fc6 i40iw: Refactor handling of txpend list
Currently the TX pending lists for IEQ and ILQ are
handled separately. The handling of both can be
consolidated in i40iw_poll_completion.

Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 16:00:51 -07:00
Bart Van Assche 2a78cb4db4 IB/srpt: Fix an out-of-bounds stack access in srpt_zerolength_write()
Avoid triggering an out-of-bounds stack access by changing the type
of 'wr' from ib_send_wr into ib_rdma_wr.

This patch fixes the following KASAN bug report:

BUG: KASAN: stack-out-of-bounds in rxe_post_send+0x7a9/0x9a0 [rdma_rxe]
Read of size 8 at addr ffff880068197a48 by task kworker/2:1/44

Workqueue: ib_cm cm_work_handler [ib_cm]
Call Trace:
 dump_stack+0x8e/0xcd
 print_address_description+0x6f/0x280
 kasan_report+0x25a/0x380
 __asan_load8+0x54/0x90
 rxe_post_send+0x7a9/0x9a0 [rdma_rxe]
 srpt_zerolength_write+0xf0/0x180 [ib_srpt]
 srpt_cm_rtu_recv+0x68/0x110 [ib_srpt]
 srpt_rdma_cm_handler+0xbb/0x15b [ib_srpt]
 cma_ib_handler+0x1aa/0x4a0 [rdma_cm]
 cm_process_work+0x30/0x100 [ib_cm]
 cm_work_handler+0xa86/0x351b [ib_cm]
 process_one_work+0x475/0x9f0
 worker_thread+0x69/0x690
 kthread+0x1ad/0x1d0
 ret_from_fork+0x3a/0x50

Fixes: aaf45bd83e ("IB/srpt: Detect session shutdown reliably")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 16:00:51 -07:00
Bart Van Assche a6544a624c RDMA/rxe: Fix an out-of-bounds read
This patch avoids that KASAN reports the following when the SRP initiator
calls srp_post_send():

==================================================================
BUG: KASAN: stack-out-of-bounds in rxe_post_send+0x5c4/0x980 [rdma_rxe]
Read of size 8 at addr ffff880066606e30 by task 02-mq/1074

CPU: 2 PID: 1074 Comm: 02-mq Not tainted 4.16.0-rc3-dbg+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0x85/0xc7
print_address_description+0x65/0x270
kasan_report+0x231/0x350
rxe_post_send+0x5c4/0x980 [rdma_rxe]
srp_post_send.isra.16+0x149/0x190 [ib_srp]
srp_queuecommand+0x94d/0x1670 [ib_srp]
scsi_dispatch_cmd+0x1c2/0x550 [scsi_mod]
scsi_queue_rq+0x843/0xa70 [scsi_mod]
blk_mq_dispatch_rq_list+0x143/0xac0
blk_mq_do_dispatch_ctx+0x1c5/0x260
blk_mq_sched_dispatch_requests+0x2bf/0x2f0
__blk_mq_run_hw_queue+0xdb/0x160
__blk_mq_delay_run_hw_queue+0xba/0x100
blk_mq_run_hw_queue+0xf2/0x190
blk_mq_sched_insert_request+0x163/0x2f0
blk_execute_rq+0xb0/0x130
scsi_execute+0x14e/0x260 [scsi_mod]
scsi_probe_and_add_lun+0x366/0x13d0 [scsi_mod]
__scsi_scan_target+0x18a/0x810 [scsi_mod]
scsi_scan_target+0x11e/0x130 [scsi_mod]
srp_create_target+0x1522/0x19e0 [ib_srp]
kernfs_fop_write+0x180/0x210
__vfs_write+0xb1/0x2e0
vfs_write+0xf6/0x250
SyS_write+0x99/0x110
do_syscall_64+0xee/0x2b0
entry_SYSCALL_64_after_hwframe+0x42/0xb7

The buggy address belongs to the page:
page:ffffea0001998180 count:0 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x4000000000000000()
raw: 4000000000000000 0000000000000000 0000000000000000 00000000ffffffff
raw: dead000000000100 dead000000000200 0000000000000000 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff880066606d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1
ffff880066606d80: f1 00 f2 f2 f2 f2 f2 f2 f2 00 00 f2 f2 f2 f2 f2
>ffff880066606e00: f2 00 00 00 00 00 f2 f2 f2 f3 f3 f3 f3 00 00 00
                                    ^
ffff880066606e80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff880066606f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Moni Shoua <monis@mellanox.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 16:00:51 -07:00
Bart Van Assche a1ae7d0345 RDMA/core: Avoid that ib_drain_qp() triggers an out-of-bounds stack access
This patch fixes the following KASAN complaint:

==================================================================
BUG: KASAN: stack-out-of-bounds in rxe_post_send+0x77d/0x9b0 [rdma_rxe]
Read of size 8 at addr ffff880061aef860 by task 01/1080

CPU: 2 PID: 1080 Comm: 01 Not tainted 4.16.0-rc3-dbg+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0x85/0xc7
print_address_description+0x65/0x270
kasan_report+0x231/0x350
rxe_post_send+0x77d/0x9b0 [rdma_rxe]
__ib_drain_sq+0x1ad/0x250 [ib_core]
ib_drain_qp+0x9/0x30 [ib_core]
srp_destroy_qp+0x51/0x70 [ib_srp]
srp_free_ch_ib+0xfc/0x380 [ib_srp]
srp_create_target+0x1071/0x19e0 [ib_srp]
kernfs_fop_write+0x180/0x210
__vfs_write+0xb1/0x2e0
vfs_write+0xf6/0x250
SyS_write+0x99/0x110
do_syscall_64+0xee/0x2b0
entry_SYSCALL_64_after_hwframe+0x42/0xb7

The buggy address belongs to the page:
page:ffffea000186bbc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x4000000000000000()
raw: 4000000000000000 0000000000000000 0000000000000000 00000000ffffffff
raw: 0000000000000000 ffffea000186bbe0 0000000000000000 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff880061aef700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff880061aef780: 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00
>ffff880061aef800: f2 f2 f2 f2 f2 f2 f2 00 00 00 00 00 f2 f2 f2 f2
                                                      ^
ffff880061aef880: f2 f2 f2 00 00 00 00 00 00 00 00 00 00 00 f2 f2
ffff880061aef900: f2 f2 f2 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================

Fixes: 765d67748b ("IB: new common API for draining queues")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: stable@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 16:00:51 -07:00
Colin Ian King 042932f7a3 infiniband: remove redundant assignment to pointer 'rdi'
The pointer rdi is being initialized with a value that is never read
and re-assigned immediately after, hence the initialization is redundant
and can be removed.

Cleans up clang warning:
drivers/infiniband/sw/rdmavt/vt.c:94:23: warning: Value stored to 'rdi'
during its initialization is never read

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-06 16:00:51 -07:00
David Ahern b75cc8f90f net/ipv6: Pass skb to route lookup
IPv6 does path selection for multipath routes deep in the lookup
functions. The next patch adds L4 hash option and needs the skb
for the forward path. To get the skb to the relevant FIB lookup
functions it needs to go through the fib rules layer, so add a
lookup_data argument to the fib_lookup_arg struct.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04 13:04:22 -05:00
Hernán Gonzalez c33bab622d IB/rxe: Remove unused variable (char *rxe_qp_state_name[])
Note: This is compile only tested as I have no access to the hw.  This
variable was not used anywhere in the code. Removing it saves 24 bytes.

add/remove: 0/1 grow/shrink: 0/0 up/down: 0/-24 (-24)
Function                                     old     new   delta
rxe_qp_state_name                             24       -     -24
Total: Before=3348732, After=3348708, chg -0.00%

Signed-off-by: Hernán Gonzalez <hernan@vanguardiasur.com.ar>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 13:57:40 -07:00
Hernán Gonzalez 7f566a91b1 IB/qib: Move char *qib_sdma_state_names[] and constify while there.
Note: This is compile only tested as I have no access to the hw.
This variable was not used in qib_sdma.c but in qib_iba7322.c. Declaring it
there, as static, saves 56 bytes.

add/remove: 0/2 grow/shrink: 0/0 up/down: 0/-144 (-144)
Function                                     old     new   delta
qib_sdma_state_names                          56       -     -56
qib_sdma_event_names                          88       -     -88
Total: Before=2874565, After=2874421, chg -0.01%

Signed-off-by: Hernán Gonzalez <hernan@vanguardiasur.com.ar>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 13:57:40 -07:00
Hernán Gonzalez 4f1d583432 IB/qib: Remove unused variable (char *qib_sdma_event_names[])
Note: This is compile only tested as I have no access to the hw.

This variable was not used anywhere in the code. Removing it saves 88
bytes.

add/remove: 0/1 grow/shrink: 0/0 up/down: 0/-88 (-88)
Function                                     old     new   delta
qib_sdma_event_names                          88       -     -88
Total: Before=2874565, After=2874477, chg -0.00%

Signed-off-by: Hernán Gonzalez <hernan@vanguardiasur.com.ar>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 13:57:39 -07:00
Bart Van Assche 7da09af91d IB/srp: Use %pIS instead of inet_ntop()
Except for a minor log message change, this patch does not change
any functionality. For the introduction of %pIS, see also commit
1067964305 ("lib: vsprintf: add IPv4/v6 generic %p[Ii]S[pfs]
format specifier").

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 13:57:39 -07:00
Bart Van Assche c74ff7501e Revert "IB/srp: Avoid that a cable pull can trigger a kernel crash"
The caller of srp_ib_lookup_path() is responsible for holding a reference
on the SCSI host. That means that commit 8a0d18c621 was not necessary.
Hence revert it.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 13:57:39 -07:00
Bart Van Assche e68088e78d IB/srp: Fix srp_abort()
Before commit e494f6a728 ("[SCSI] improved eh timeout handler") it
did not really matter whether or not abort handlers like srp_abort()
called .scsi_done() when returning another value than SUCCESS. Since
that commit however this matters. Hence only call .scsi_done() when
returning SUCCESS.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 13:57:39 -07:00
Arnd Bergmann a8ed748708 infiniband: bnxt_re: use BIT_ULL() for 64-bit bit masks
On 32-bit targets, we otherwise get a warning about an impossible constant
integer expression:

In file included from include/linux/kernel.h:11,
                 from include/linux/interrupt.h:6,
                 from drivers/infiniband/hw/bnxt_re/ib_verbs.c:39:
drivers/infiniband/hw/bnxt_re/ib_verbs.c: In function 'bnxt_re_query_device':
include/linux/bitops.h:7:24: error: left shift count >= width of type [-Werror=shift-count-overflow]
 #define BIT(nr)   (1UL << (nr))
                        ^~
drivers/infiniband/hw/bnxt_re/bnxt_re.h:61:34: note: in expansion of macro 'BIT'
 #define BNXT_RE_MAX_MR_SIZE_HIGH BIT(39)
                                  ^~~
drivers/infiniband/hw/bnxt_re/bnxt_re.h:62:30: note: in expansion of macro 'BNXT_RE_MAX_MR_SIZE_HIGH'
 #define BNXT_RE_MAX_MR_SIZE  BNXT_RE_MAX_MR_SIZE_HIGH
                              ^~~~~~~~~~~~~~~~~~~~~~~~
drivers/infiniband/hw/bnxt_re/ib_verbs.c:149:25: note: in expansion of macro 'BNXT_RE_MAX_MR_SIZE'
  ib_attr->max_mr_size = BNXT_RE_MAX_MR_SIZE;
                         ^~~~~~~~~~~~~~~~~~~

Fixes: 872f357824 ("RDMA/bnxt_re: Add support for MRs with Huge pages")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 13:57:39 -07:00
Arnd Bergmann e5d6574ded infiniband: qplib_fp: fix pointer cast
Building for a 32-bit target results in a couple of warnings from casting
between a 32-bit pointer and a 64-bit integer:

drivers/infiniband/hw/bnxt_re/qplib_fp.c: In function 'bnxt_qplib_service_nq':
drivers/infiniband/hw/bnxt_re/qplib_fp.c:333:23: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
    bnxt_qplib_arm_srq((struct bnxt_qplib_srq *)q_handle,
                       ^
drivers/infiniband/hw/bnxt_re/qplib_fp.c:336:12: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
            (struct bnxt_qplib_srq *)q_handle,
            ^
In file included from include/linux/byteorder/little_endian.h:5,
                 from arch/arm/include/uapi/asm/byteorder.h:22,
                 from include/asm-generic/bitops/le.h:6,
                 from arch/arm/include/asm/bitops.h:342,
                 from include/linux/bitops.h:38,
                 from include/linux/kernel.h:11,
                 from include/linux/interrupt.h:6,
                 from drivers/infiniband/hw/bnxt_re/qplib_fp.c:39:
drivers/infiniband/hw/bnxt_re/qplib_fp.c: In function 'bnxt_qplib_create_srq':
include/uapi/linux/byteorder/little_endian.h:31:43: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
 #define __cpu_to_le64(x) ((__force __le64)(__u64)(x))
                                           ^
include/linux/byteorder/generic.h:86:21: note: in expansion of macro '__cpu_to_le64'
 #define cpu_to_le64 __cpu_to_le64
                     ^~~~~~~~~~~~~
drivers/infiniband/hw/bnxt_re/qplib_fp.c:569:19: note: in expansion of macro 'cpu_to_le64'
  req.srq_handle = cpu_to_le64(srq);

Using a uintptr_t as an intermediate works on all architectures.

Fixes: 37cb11acf1 ("RDMA/bnxt_re: Add SRQ support for Broadcom adapters")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 13:57:39 -07:00
Markus Elfring c7ec83772a RDMA/iwpm: Delete an error message for a failed memory allocation in iwpm_create_nlmsg()
Omit an extra message for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 13:57:39 -07:00
Markus Elfring b7c5bc7368 IB/usnic: Delete an error message for a failed memory allocation in usnic_transport_init()
Omit an extra message for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 13:57:39 -07:00
Leon Romanovsky 55de9a77da RDMA/mlx5: Refactor QP type check to be as early as possible
Perform QP type check in one place and fail as early as possible.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 13:16:36 -07:00
Jason Gunthorpe 8efe991e8b IB/uverbs: Tidy uverbs_uobject_add
Maintaining the uobjects list is mandatory, hoist it into the common
rdma_alloc_commit_uobject() function and inline it as there is now
only one caller.

Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 12:55:03 -07:00
Muneendra Kumar M 4cd482c12b IB/core : Add null pointer check in addr_resolve
dev_get_by_index is being called in addr_resolve
function which returns NULL and NULL pointer access
leads to kernel crash.

Following call trace is observed while running
rdma_lat test application

[  146.173149] BUG: unable to handle kernel NULL pointer dereference
at 00000000000004a0
[  146.173198] IP: addr_resolve+0x9e/0x3e0 [ib_core]
[  146.173221] PGD 0 P4D 0
[  146.173869] Oops: 0000 [#1] SMP PTI
[  146.182859] CPU: 8 PID: 127 Comm: kworker/8:1 Tainted: G  O 4.15.0-rc6+ #18
[  146.183758] Hardware name: LENOVO System x3650 M5: -[8871AC1]-/01KN179,
 BIOS-[TCE132H-2.50]- 10/11/2017
[  146.184691] Workqueue: ib_cm cm_work_handler [ib_cm]
[  146.185632] RIP: 0010:addr_resolve+0x9e/0x3e0 [ib_core]
[  146.186584] RSP: 0018:ffffc9000362faa0 EFLAGS: 00010246
[  146.187521] RAX: 000000000000001b RBX: ffffc9000362fc08 RCX:
0000000000000006
[  146.188472] RDX: 0000000000000000 RSI: 0000000000000096 RDI
: ffff88087fc16990
[  146.189427] RBP: ffffc9000362fb18 R08: 00000000ffffff9d R09:
00000000000004ac
[  146.190392] R10: 00000000000001e7 R11: 0000000000000001 R12:
ffff88086af2e090
[  146.191361] R13: 0000000000000000 R14: 0000000000000001 R15:
00000000ffffff9d
[  146.192327] FS:  0000000000000000(0000) GS:ffff88087fc00000(0000)
knlGS:0000000000000000
[  146.193301] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  146.194274] CR2: 00000000000004a0 CR3: 000000000220a002 CR4:
00000000003606e0
[  146.195258] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  146.196256] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[  146.197231] Call Trace:
[  146.198209]  ? rdma_addr_register_client+0x30/0x30 [ib_core]
[  146.199199]  rdma_resolve_ip+0x1af/0x280 [ib_core]
[  146.200196]  rdma_addr_find_l2_eth_by_grh+0x154/0x2b0 [ib_core]

The below patch adds the missing NULL pointer check
returned by dev_get_by_index before accessing the netdev to
avoid kernel crash.

We observed the below crash when we try to do the below test.

 server                       client
 ---------                    ---------
 |1.1.1.1|<----rxe-channel--->|1.1.1.2|
 ---------                    ---------

On server: rdma_lat -c -n 2 -s 1024
On client:rdma_lat 1.1.1.1 -c -n 2 -s 1024

Fixes: 200298326b ("IB/core: Validate route when we init ah")
Signed-off-by: Muneendra <muneendra.kumar@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 12:10:33 -07:00
Selvin Xavier 497158aa5f RDMA/bnxt_re: Fix the ib_reg failure cleanup
Release the netdev references in the cleanup path.  Invokes the cleanup
routines if bnxt_re_ib_reg fails.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 12:10:33 -07:00
Devesh Sharma c354dff00d RDMA/bnxt_re: Fix incorrect DB offset calculation
To support host systems with non 4K page size, l2_db_size shall be
calculated with 4096 instead of PAGE_SIZE. Also, supply the host page size
to FW during initialization.

Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 12:10:32 -07:00
Devesh Sharma a45bc17b36 RDMA/bnxt_re: Unconditionly fence non wire memory operations
HW requires an unconditonal fence for all non-wire memory operations
through SQ. This guarantees the completions of these memory operations.

Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 12:10:32 -07:00
Parav Pandit 2fb4f4eadd IB/core: Fix missing RDMA cgroups release in case of failure to register device
During IB device registration process, if query_device() fails or if
ib_core fails to registers sysfs entries, rdma cgroup cleanup is
skipped.

Cc: <stable@vger.kernel.org> # v4.2+
Fixes: 4be3a4fa51 ("IB/core: Fix kernel crash during fail to initialize device")
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 12:10:32 -07:00
Moni Shoua 65389322b2 IB/mlx: Set slid to zero in Ethernet completion struct
IB spec says that a lid should be ignored when link layer is Ethernet,
for example when building or parsing a CM request message (CA17-34).
However, since ib_lid_be16() and ib_lid_cpu16()  validates the slid,
not only when link layer is IB, we set the slid to zero to prevent
false warnings in the kernel log.

Fixes: 62ede77799 ("Add OPA extended LID support")
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 12:10:32 -07:00
Daniel Jurgens aba4621346 {net, IB}/mlx5: Raise fatal IB event when sys error occurs
All other mlx5_events report the port number as 1 based, which is how FW
reports it in the port event EQE. Reporting 0 for this event causes
mlx5_ib to not raise a fatal event notification to registered clients
due to a seemingly invalid port.

All switch cases in mlx5_ib_event that go through the port check are
supposed to set the port now, so just do it once at variable
declaration.

Fixes: 89d44f0a6c73("net/mlx5_core: Add pci error handlers to mlx5_core driver")
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 12:10:32 -07:00
Noa Osherovich e7b169f344 IB/mlx5: Avoid passing an invalid QP type to firmware
During QP creation, the mlx5 driver translates the QP type to an
internal value which is passed on to FW. There was no check to make
sure that the translated value is valid, and -EINVAL was coerced into
the mailbox command.

Current firmware refuses this as an invalid QP type, but future/past
firmware may do something else.

Fixes: 09a7d9eca1 ('{net,IB}/mlx5: QP/XRCD commands via mlx5 ifc')
Reviewed-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 12:10:32 -07:00
Sergey Gorenko da343b6d90 IB/mlx5: Fix incorrect size of klms in the memory region
The value of mr->ndescs greater than mr->max_descs is set in the
function mlx5_ib_sg_to_klms() if sg_nents is greater than
mr->max_descs. This is an invalid value and it causes the
following error when registering mr:

mlx5_0:dump_cqe:276:(pid 193): dump error cqe
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000030: 00 00 00 00 0f 00 78 06 25 00 00 8b 08 1e 8f d3

Cc: <stable@vger.kernel.org> # 4.5
Fixes: b005d31647 ("mlx5: Add arbitrary sg list support")
Signed-off-by: Sergey Gorenko <sergeygo@mellanox.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-28 12:10:32 -07:00
Doug Ledford 1d1ab1ae69 mlx5-update-2018-02-23 (IB representors)
From: Mark Bloch <markb@mellanox.com>
 =========
 Add IB representor when in switchdev mode
 
 The following series adds support for an IB (RAW Ethernet only) device
 representor which is created when the user switches to switchdev mode.
 
 Today when switching to switchdev mode the only representors which are
 created are net devices. Each netdev is a representor of a virtual
 function and any data sent via the representor is received on the virtual
 function, and any data sent via the virtual function is received by the
 representor.
 
 For the mlx5 driver the main use of this functionality is to be able to
 use Open vSwitch on the hypervisor in order to manage/control traffic
 from/to the virtual functions. Open vSwitch can also work with  DPDK
 devices and not just net devices, this series exposes an IB device, which
 Mellanox PMD driver uses, which then can be used by Open vSwitch DPDK.
 
 An IB device representor exposes only RAW Ethernet QP capabilities and
 the ability to create flow rules to direct traffic to its RX queues. The
 state of the IB device (ACTIVE/DOWN etc..) is based on the state of the
 corresponding net device representor. No other RDMA/RoCE functionality is
 currently supported and no GID table is exposed.
 =========
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJakH7zAAoJEEg/ir3gV/o+c/MIAMGGgNajr49+JP3t9wnrs011
 +cTfAfM88HBzTlfb/COEBz+jurH2oB7ZF4RZC29S+6pR3loKKBuvbiPndE0XKjSg
 Ue4sOkawybmDvfo9ZiMsusOiMfTp5wsLmqJP1HRUvGMAlSBeriMTZfbiKzx5c3Ok
 X8cMnRIvUOtCoQaJTfKarDUn4OF8aFam4tQW8k/RAo77kTPyihb1NlGiblrcCA2E
 PWYAOWW3D8gvE0cr19JVgEqpKIaJ/VRyjwQ7m8XSvfBJtw1ZTO6YMXiXbWMOsRzD
 fx33H+n/qwJT0cnxDmSpZrR7mEk+Wr2HL92O85KDupOSgLOIlywmtIIkEAnCeaw=
 =Fq6m
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2018-02-23' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux into k.o/wip/dl-for-next

mlx5-update-2018-02-23 (IB representors)

From: Mark Bloch <markb@mellanox.com>
=========
Add IB representor when in switchdev mode

The following series adds support for an IB (RAW Ethernet only) device
representor which is created when the user switches to switchdev mode.

Today when switching to switchdev mode the only representors which are
created are net devices. Each netdev is a representor of a virtual
function and any data sent via the representor is received on the virtual
function, and any data sent via the virtual function is received by the
representor.

For the mlx5 driver the main use of this functionality is to be able to
use Open vSwitch on the hypervisor in order to manage/control traffic
from/to the virtual functions. Open vSwitch can also work with  DPDK
devices and not just net devices, this series exposes an IB device, which
Mellanox PMD driver uses, which then can be used by Open vSwitch DPDK.

An IB device representor exposes only RAW Ethernet QP capabilities and
the ability to create flow rules to direct traffic to its RX queues. The
state of the IB device (ACTIVE/DOWN etc..) is based on the state of the
corresponding net device representor. No other RDMA/RoCE functionality is
currently supported and no GID table is exposed.
=========

Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-28 13:37:39 -05:00
David S. Miller fb66cb0775 mlx5-update-2018-02-23 (IB representors)
From: Mark Bloch <markb@mellanox.com>
 =========
 Add IB representor when in switchdev mode
 
 The following series adds support for an IB (RAW Ethernet only) device
 representor which is created when the user switches to switchdev mode.
 
 Today when switching to switchdev mode the only representors which are
 created are net devices. Each netdev is a representor of a virtual
 function and any data sent via the representor is received on the virtual
 function, and any data sent via the virtual function is received by the
 representor.
 
 For the mlx5 driver the main use of this functionality is to be able to
 use Open vSwitch on the hypervisor in order to manage/control traffic
 from/to the virtual functions. Open vSwitch can also work with  DPDK
 devices and not just net devices, this series exposes an IB device, which
 Mellanox PMD driver uses, which then can be used by Open vSwitch DPDK.
 
 An IB device representor exposes only RAW Ethernet QP capabilities and
 the ability to create flow rules to direct traffic to its RX queues. The
 state of the IB device (ACTIVE/DOWN etc..) is based on the state of the
 corresponding net device representor. No other RDMA/RoCE functionality is
 currently supported and no GID table is exposed.
 =========
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJakH7zAAoJEEg/ir3gV/o+c/MIAMGGgNajr49+JP3t9wnrs011
 +cTfAfM88HBzTlfb/COEBz+jurH2oB7ZF4RZC29S+6pR3loKKBuvbiPndE0XKjSg
 Ue4sOkawybmDvfo9ZiMsusOiMfTp5wsLmqJP1HRUvGMAlSBeriMTZfbiKzx5c3Ok
 X8cMnRIvUOtCoQaJTfKarDUn4OF8aFam4tQW8k/RAo77kTPyihb1NlGiblrcCA2E
 PWYAOWW3D8gvE0cr19JVgEqpKIaJ/VRyjwQ7m8XSvfBJtw1ZTO6YMXiXbWMOsRzD
 fx33H+n/qwJT0cnxDmSpZrR7mEk+Wr2HL92O85KDupOSgLOIlywmtIIkEAnCeaw=
 =Fq6m
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2018-02-23' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

Saeed Mahameed says:

mlx5-update-2018-02-23 (IB representors)

From: Mark Bloch <markb@mellanox.com>
=========
Add IB representor when in switchdev mode

The following series adds support for an IB (RAW Ethernet only) device
representor which is created when the user switches to switchdev mode.

Today when switching to switchdev mode the only representors which are
created are net devices. Each netdev is a representor of a virtual
function and any data sent via the representor is received on the virtual
function, and any data sent via the virtual function is received by the
representor.

For the mlx5 driver the main use of this functionality is to be able to
use Open vSwitch on the hypervisor in order to manage/control traffic
from/to the virtual functions. Open vSwitch can also work with  DPDK
devices and not just net devices, this series exposes an IB device, which
Mellanox PMD driver uses, which then can be used by Open vSwitch DPDK.

An IB device representor exposes only RAW Ethernet QP capabilities and
the ability to create flow rules to direct traffic to its RX queues. The
state of the IB device (ACTIVE/DOWN etc..) is based on the state of the
corresponding net device representor. No other RDMA/RoCE functionality is
currently supported and no GID table is exposed.
=========

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-28 09:54:54 -05:00
Kirill Tkhai 25354866e0 net: Convert cma_pernet_operations
These pernet_operations just create and destroy IDR.
So, we mark them as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:36 -05:00
David S. Miller f74290fdb3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-24 00:04:20 -05:00
Mark Bloch ec9c2fb8ce IB/mlx5: Disable self loopback check when in switchdev mode
When in switchdev mode, there is no need to do self loopback checks
as we can't receive those packets, we insert steering rules to the
eswitch that make sure packets can't be looped back.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-23 12:36:39 -08:00
Mark Bloch b5ca15ad7e IB/mlx5: Add proper representors support
This commit adds full support for IB representor:

1) Representors profile, We add two new profiles:
   nic_rep_profile - This profile will be used to create an IB device that
   represents the PF/UPLINK.
   rep_profile - This profile will be used to create an IB device that
   represents VFs. Each VF will be its own representor.
2) Proper load/unload callbacks, Those are called by the E-Switch when
   moving to/from switchdev mode.
3) Different flow DB handling for when we in switchdev mode.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-23 12:36:39 -08:00
Mark Bloch b96c9dde17 IB/mlx5: E-Switch, Add rule to forward traffic to vport
In order to forward traffic from representor's SQ to the right virtual
function, every time an SQ is created also add the corresponding flow rule
to the FDB.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-23 12:36:39 -08:00
Mark Bloch 72afcf8247 IB/mlx5: Don't expose MR cache in switchdev mode
When enabling many VFs and switching to switchdev mode, the total amount
of mkeys we try to allocate when loading representors is very large and
may cause timeouts on allocations, the same issues was observed on VFs
and we employ the same fix that was done for them. We avoid allocating
the full MR cache on load but still allow it to be manipulated once the
IB device is loaded.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-23 12:36:39 -08:00
Mark Bloch 8e6efa3a31 IB/mlx5: When in switchdev mode, expose only raw packet capabilities
Currently in switchdev mode we allow only for raw packet QPs.
Expose the right capabilities and set the gid table length to 0, also
make sure we don't try to enable RoCE, so split the function
to enable RoCE so representors can enable only the notifier needed for
net device events.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-23 12:36:39 -08:00
Mark Bloch bcf87f1dbb IB/mlx5: Listen to netdev register/unresiter events in switchdev mode
Currently we listen to netdev register/unregister event based on PCI
device. When in switchdev mode PF and representors share the same PCI
device, so in order to pair ib device and netdev in switchdev mode
compare the netdev that triggered the event to that of the representor.

Expose a function that lets you receive the netdev associated what
a given representor.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-23 12:36:39 -08:00
Mark Bloch 018a94eec0 IB/mlx5: Add match on vport when in switchdev mode
When we point to a representor, it means we are in switchdev mode.
The flow db is shared between PF and virtual function representors
so each rule created needs to have a match on its specific source port.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-23 12:36:39 -08:00
Mark Bloch 9a4ca38d77 IB/mlx5: Allocate flow DB only on PF IB device
A flow DB is a shared resource between PF and representors,
need to allocate it only when creating the PF IB device.
Once we add IB representors, they will use the flow db which was
created by the PF.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-23 12:36:39 -08:00
Mark Bloch fc385b7ac4 IB/mlx5: Add basic regiser/unregister representors code
Create the basic infrastructure of registering and unregistering
IB representors. The load/unload callbacks are left empty and
proper implementation will be introduced in following patches.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-23 12:36:39 -08:00
Leon Romanovsky 87915bf82e RDMA/verbs: Return proper error code for not supported system call
The proper return error is -EOPNOTSUPP and not -ENOSYS, so update
all places in verbs.c to match this semantics.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-22 22:31:18 -05:00
Leon Romanovsky 372e15c5db RDMA/uverbs: Reduce number of command header flags checks
Simplify the code by directly checking the availability of extended
command flog instead of doing multiple shift operations.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-22 22:31:18 -05:00
Leon Romanovsky cd35cf4b40 RDMA/uverbs: Replace user's types with kernel's types
The internal to kernel variable declarations don't need to be
declared with user types. This patch converts such occurrences
appeared in ib_uverbs_write().

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-22 22:31:18 -05:00
Leon Romanovsky 6284380a97 RDMA/uverbs: Refactor the header validation logic
Move all header validation logic to be performed before SRCU read lock.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-22 22:29:50 -05:00
Leon Romanovsky e21719fbbd RDMa/uverbs: Copy ex_hdr outside of SRCU read lock
The SRCU read lock protects the IB device pointer
and doesn't need to be called before copying user
provided header.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-22 22:29:50 -05:00
Leon Romanovsky 491d5c6a30 RDMA/uverbs: Move uncontext check before SRCU read lock
There is no need to take SRCU lock before checking
file->ucontext, so move it do it before it.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-22 22:29:50 -05:00
Leon Romanovsky eb455e329b RDMA/uverbs: Properly check command supported mask
The check based on index is not sufficient because

  IB_USER_VERBS_EX_CMD_CREATE_CQ = IB_USER_VERBS_CMD_CREATE_CQ

and IB_USER_VERBS_CMD_CREATE_CQ <= IB_USER_VERBS_CMD_OPEN_QP,
so if we execute IB_USER_VERBS_EX_CMD_CREATE_CQ this code checks
ib_dev->uverbs_cmd_mask not ib_dev->uverbs_ex_cmd_mask.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-22 22:29:50 -05:00
Leon Romanovsky 77833b8a48 RDMA/uverbs: Refactor command header processing
Move all command header processing into separate function
and perform those checks before acquiring SRCU read lock.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-22 22:29:37 -05:00
Leon Romanovsky f2630ce2fb RDMA/uverbs: Unify return values of not supported command
The non-existing command is supposed to return -EOPNOTSUPP, but the
current code returns different errors for different flows for the
same failure. This patch unifies those flows.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-22 22:29:36 -05:00
Leon Romanovsky a9ed5b38aa RDMA/uverbs: Return not supported error code for unsupported commands
Command that doesn't exist means that it is not supported,
so update code to return -EOPNOTSUPP in case of failure.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-22 22:29:11 -05:00
Leon Romanovsky 43ae95130d RDMA/uverbs: Fail as early as possible if not enough header data was provided
Fail as early as possible if not enough header data
was provided.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-22 22:29:11 -05:00
Leon Romanovsky a6c4a66ae9 RDMA/uverbs: Refactor flags checks and update return value
Since commit f21519b23c ("IB/core: extended command: an
improved infrastructure for uverbs commands"), the uverbs
supports extra flags as an input to the command interface.

However actually, there is only one flag available and used,
so it is better to refactor the code, so the resolution and
report to the users is done as early as possible.

As part of this change, we changed the return value of failure case
from ENOSYS to be EINVAL to be consistent with the rest flags checks.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-22 22:29:11 -05:00
Leon Romanovsky 08f0e16163 RDMA/uverbs: Update sizeof users
Update sizeof() users to be consistent with coding style.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-22 22:29:11 -05:00
Leon Romanovsky b5bc598186 RDMA/uverbs: Convert command mask validity check function to be bool
The function validate_command_mask() returns only two results: success
or failure, so convert it to return bool instead of 0 and -1.

Reported-by: Noa Osherovich <noaos@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-22 22:29:10 -05:00
Doug Ledford 4bb46608ed Merge branch 'k.o/for-rc' into k.o/wip/dl-for-next
There is a 14 patch series waiting to come into for-next that has a
dependecy on code submitted into this kernel's for-rc series.  So, merge
the for-rc branch into the current for-next in order to make the patch
series apply cleanly.

Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-22 22:27:20 -05:00
Doug Ledford f76a5c75d9 mlx5-updates-2018-02-21
This series includes shared code updates for mlx5 core driver for both
 netdev and rdma subsystems.
 
 By Saeed,
 First six patches of the series are meant to address a performance issue
 and should provide a performance boost for multi core IRQ interrupt hungry
 workloads.  The issue is fixed in the first patch, all other patches are
 meant to refactor the code in light of this fix.
 
 The problem it comes to fix, is a shared spinlock accessed across all HCA
 IRQs which protects the CQ database.  To solve this we simply move the CQ
 database and its spinlock to be per EQ (IRQ), thus per core.
 
 By Yonatan,
 Fragmented completion queue (CQ) for RDMA,
 core driver implementation to create fragmented CQ buffers rather than
 one large contiguous memory buffer, the implementation scheme already
 exist and used by the netdev CQs, the patch shares that code with the
 rdma CQ creation flow and makes use of the new API in mlx5_ib driver.
 
 Thanks,
 Saeed.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJajcwxAAoJEEg/ir3gV/o+uJAIAICP2LBu5ANBgX7Z7/B0IZX4
 Abkfk+IKR44v+3t0BBgEAZI2A9LcTQ8pwIOeLCDpcKd78665Y7rYQL1p31WuPKc5
 1WwwBesjQLAmdiNLIjUnjW1fd3YyOmqLBcaMMYIeF0m4m4pCQue1kwtxesQrhdZV
 I7FBxxnQt/8o6hoh/5nmU4DzGyZaY+uR/4FxUCEreklfhNeI5M2DZKe1xQCmhsJ4
 Tosh/cJ8kSBJfclH6hk1lh5eWGDYDltWCpY86KBWsYktl10VAE3P7LPmekn487ta
 e6cL1NPoEHbp+/UBThkNZMzUnA7wpoNeDbTTtVZrdcL1KivwColt1r3pmS1bqEQ=
 =DlfW
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2018-02-21' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux into k.o/wip/dl-for-next

mlx5-updates-2018-02-21

This series includes shared code updates for mlx5 core driver for both
netdev and rdma subsystems.

By Saeed,
First six patches of the series are meant to address a performance issue
and should provide a performance boost for multi core IRQ interrupt hungry
workloads.  The issue is fixed in the first patch, all other patches are
meant to refactor the code in light of this fix.

The problem it comes to fix, is a shared spinlock accessed across all HCA
IRQs which protects the CQ database.  To solve this we simply move the CQ
database and its spinlock to be per EQ (IRQ), thus per core.

By Yonatan,
Fragmented completion queue (CQ) for RDMA,
core driver implementation to create fragmented CQ buffers rather than
one large contiguous memory buffer, the implementation scheme already
exist and used by the netdev CQs, the patch shares that code with the
rdma CQ creation flow and makes use of the new API in mlx5_ib driver.

Thanks,
Saeed.

Merged into rdma-next tree as well as net-next tree to prevent conflicts
in future patches between the two trees.

Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-22 20:52:28 -05:00
David S. Miller f4af1db48b mlx5-updates-2018-02-21
This series includes shared code updates for mlx5 core driver for both
 netdev and rdma subsystems.
 
 By Saeed,
 First six patches of the series are meant to address a performance issue
 and should provide a performance boost for multi core IRQ interrupt hungry
 workloads.  The issue is fixed in the first patch, all other patches are
 meant to refactor the code in light of this fix.
 
 The problem it comes to fix, is a shared spinlock accessed across all HCA
 IRQs which protects the CQ database.  To solve this we simply move the CQ
 database and its spinlock to be per EQ (IRQ), thus per core.
 
 By Yonatan,
 Fragmented completion queue (CQ) for RDMA,
 core driver implementation to create fragmented CQ buffers rather than
 one large contiguous memory buffer, the implementation scheme already
 exist and used by the netdev CQs, the patch shares that code with the
 rdma CQ creation flow and makes use of the new API in mlx5_ib driver.
 
 Thanks,
 Saeed.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJajcwxAAoJEEg/ir3gV/o+uJAIAICP2LBu5ANBgX7Z7/B0IZX4
 Abkfk+IKR44v+3t0BBgEAZI2A9LcTQ8pwIOeLCDpcKd78665Y7rYQL1p31WuPKc5
 1WwwBesjQLAmdiNLIjUnjW1fd3YyOmqLBcaMMYIeF0m4m4pCQue1kwtxesQrhdZV
 I7FBxxnQt/8o6hoh/5nmU4DzGyZaY+uR/4FxUCEreklfhNeI5M2DZKe1xQCmhsJ4
 Tosh/cJ8kSBJfclH6hk1lh5eWGDYDltWCpY86KBWsYktl10VAE3P7LPmekn487ta
 e6cL1NPoEHbp+/UBThkNZMzUnA7wpoNeDbTTtVZrdcL1KivwColt1r3pmS1bqEQ=
 =DlfW
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2018-02-21' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

Saeed Mahameed says:

====================
mlx5-updates-2018-02-21

This series includes shared code updates for mlx5 core driver for both
netdev and rdma subsystems.

By Saeed,
First six patches of the series are meant to address a performance issue
and should provide a performance boost for multi core IRQ interrupt hungry
workloads.  The issue is fixed in the first patch, all other patches are
meant to refactor the code in light of this fix.

The problem it comes to fix, is a shared spinlock accessed across all HCA
IRQs which protects the CQ database.  To solve this we simply move the CQ
database and its spinlock to be per EQ (IRQ), thus per core.

By Yonatan,
Fragmented completion queue (CQ) for RDMA,
core driver implementation to create fragmented CQ buffers rather than
one large contiguous memory buffer, the implementation scheme already
exist and used by the netdev CQs, the patch shares that code with the
rdma CQ creation flow and makes use of the new API in mlx5_ib driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-22 14:38:38 -05:00
Leon Romanovsky f45765872e RDMA/uverbs: Fix kernel panic while using XRC_TGT QP type
Attempt to modify XRC_TGT QP type from the user space (ibv_xsrq_pingpong
invocation) will trigger the following kernel panic. It is caused by the
fact that such QPs missed uobject initialization.

[   17.408845] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
[   17.412645] IP: rdma_lookup_put_uobject+0x9/0x50
[   17.416567] PGD 0 P4D 0
[   17.419262] Oops: 0000 [#1] SMP PTI
[   17.422915] CPU: 0 PID: 455 Comm: ibv_xsrq_pingpo Not tainted 4.16.0-rc1+ #86
[   17.424765] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[   17.427399] RIP: 0010:rdma_lookup_put_uobject+0x9/0x50
[   17.428445] RSP: 0018:ffffb8c7401e7c90 EFLAGS: 00010246
[   17.429543] RAX: 0000000000000000 RBX: ffffb8c7401e7cf8 RCX: 0000000000000000
[   17.432426] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
[   17.437448] RBP: 0000000000000000 R08: 00000000000218f0 R09: ffffffff8ebc4cac
[   17.440223] R10: fffff6038052cd80 R11: ffff967694b36400 R12: ffff96769391f800
[   17.442184] R13: ffffb8c7401e7cd8 R14: 0000000000000000 R15: ffff967699f60000
[   17.443971] FS:  00007fc29207d700(0000) GS:ffff96769fc00000(0000) knlGS:0000000000000000
[   17.446623] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   17.448059] CR2: 0000000000000048 CR3: 000000001397a000 CR4: 00000000000006b0
[   17.449677] Call Trace:
[   17.450247]  modify_qp.isra.20+0x219/0x2f0
[   17.451151]  ib_uverbs_modify_qp+0x90/0xe0
[   17.452126]  ib_uverbs_write+0x1d2/0x3c0
[   17.453897]  ? __handle_mm_fault+0x93c/0xe40
[   17.454938]  __vfs_write+0x36/0x180
[   17.455875]  vfs_write+0xad/0x1e0
[   17.456766]  SyS_write+0x52/0xc0
[   17.457632]  do_syscall_64+0x75/0x180
[   17.458631]  entry_SYSCALL_64_after_hwframe+0x21/0x86
[   17.460004] RIP: 0033:0x7fc29198f5a0
[   17.460982] RSP: 002b:00007ffccc71f018 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   17.463043] RAX: ffffffffffffffda RBX: 0000000000000078 RCX: 00007fc29198f5a0
[   17.464581] RDX: 0000000000000078 RSI: 00007ffccc71f050 RDI: 0000000000000003
[   17.466148] RBP: 0000000000000000 R08: 0000000000000078 R09: 00007ffccc71f050
[   17.467750] R10: 000055b6cf87c248 R11: 0000000000000246 R12: 00007ffccc71f300
[   17.469541] R13: 000055b6cf8733a0 R14: 0000000000000000 R15: 0000000000000000
[   17.471151] Code: 00 00 0f 1f 44 00 00 48 8b 47 48 48 8b 00 48 8b 40 10 e9 0b 8b 68 00 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 53 89 f5 <48> 8b 47 48 48 89 fb 40 0f b6 f6 48 8b 00 48 8b 40 20 e8 e0 8a
[   17.475185] RIP: rdma_lookup_put_uobject+0x9/0x50 RSP: ffffb8c7401e7c90
[   17.476841] CR2: 0000000000000048
[   17.477764] ---[ end trace 1dbcc5354071a712 ]---
[   17.478880] Kernel panic - not syncing: Fatal exception
[   17.480277] Kernel Offset: 0xd000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Fixes: 2f08ee363f ("RDMA/restrack: don't use uaccess_kernel()")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-21 13:52:19 -05:00
Selvin Xavier 7374fbd9e1 RDMA/bnxt_re: Avoid system hang during device un-reg
BNXT_RE_FLAG_TASK_IN_PROG doesn't handle multiple work
requests posted together. Track schedule of multiple
workqueue items by maintaining a per device counter
and proceed with IB dereg only if this counter is zero.
flush_workqueue is no longer required from
NETDEV_UNREGISTER path.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-20 11:59:47 -05:00
Selvin Xavier dcdaba0806 RDMA/bnxt_re: Fix system crash during load/unload
During driver unload, the driver proceeds with cleanup
without waiting for the scheduled events. So the device
pointers get freed up and driver crashes when the events
are scheduled later.

Flush the bnxt_re_task work queue before starting
device removal.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-20 11:57:21 -05:00
Selvin Xavier 3b921e3bc4 RDMA/bnxt_re: Synchronize destroy_qp with poll_cq
Avoid system crash when destroy_qp is invoked while
the driver is processing the poll_cq. Synchronize these
functions using the cq_lock.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-20 11:57:21 -05:00
Devesh Sharma 6b4521f517 RDMA/bnxt_re: Unpin SQ and RQ memory if QP create fails
Driver leaves the QP memory pinned if QP create command
fails from the FW. Avoids this scenario by adding a proper
exit path if the FW command fails.

Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-20 11:57:21 -05:00
Devesh Sharma 7ff662b761 RDMA/bnxt_re: Disable atomic capability on bnxt_re adapters
More testing needs to be done before enabling this feature.
Disabling the feature temporarily

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-20 11:57:21 -05:00
Steve Wise 2f08ee363f RDMA/restrack: don't use uaccess_kernel()
uaccess_kernel() isn't sufficient to determine if an rdma resource is
user-mode or not.  For example, resources allocated in the add_one()
function of an ib_client get falsely labeled as user mode, when they
are kernel mode allocations.  EG: mad qps.

The result is that these qps are skipped over during a nldev query
because of an erroneous namespace mismatch.

So now we determine if the resource is user-mode by looking at the object
struct's uobject or similar pointer to know if it was allocated for user
mode applications.

Fixes: 02d8883f52 ("RDMA/restrack: Add general infrastructure to track RDMA resources")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-16 10:18:11 -07:00
Leon Romanovsky 2188558621 RDMA/verbs: Check existence of function prior to accessing it
Update all the flows to ensure that function pointer exists prior
to accessing it.

This is much safer than checking the uverbs_ex_mask variable, especially
since we know that test isn't working properly and will be removed
in -next.

This prevents a user triggereable oops.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-16 09:18:55 -07:00
Bart Van Assche 3a148896b2 IB/srp: Fix completion vector assignment algorithm
Ensure that cv_end is equal to ibdev->num_comp_vectors for the
NUMA node with the highest index. This patch improves spreading
of RDMA channels over completion vectors and thereby improves
performance, especially on systems with only a single NUMA node.
This patch drops support for the comp_vector login parameter by
ignoring the value of that parameter since I have not found a
good way to combine support for that parameter and automatic
spreading of RDMA channels over completion vectors.

Fixes: d92c0da71a ("IB/srp: Add multichannel support")
Reported-by: Alexander Schmid <alex@modula-shop-systems.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Alexander Schmid <alex@modula-shop-systems.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 15:38:58 -07:00
Adit Ranadive 1f5a6c47aa RDMA/vmw_pvrdma: Fix usage of user response structures in ABI file
This ensures that we return the right structures back to userspace.
Otherwise, it looks like the reserved fields in the response structures
in userspace might have uninitialized data in them.

Fixes: 8b10ba783c ("RDMA/vmw_pvrdma: Add shared receive queue support")
Fixes: 29c8d9eba5 ("IB: Add vmw_pvrdma driver")
Suggested-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Bryan Tan <bryantan@vmware.com>
Reviewed-by: Aditya Sarwade <asarwade@vmware.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 15:31:28 -07:00
Leon Romanovsky 5d4c05c3ee RDMA/uverbs: Sanitize user entered port numbers prior to access it
==================================================================
BUG: KASAN: use-after-free in copy_ah_attr_from_uverbs+0x6f2/0x8c0
Read of size 4 at addr ffff88006476a198 by task syzkaller697701/265

CPU: 0 PID: 265 Comm: syzkaller697701 Not tainted 4.15.0+ #90
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
Call Trace:
 dump_stack+0xde/0x164
 ? dma_virt_map_sg+0x22c/0x22c
 ? show_regs_print_info+0x17/0x17
 ? lock_contended+0x11a0/0x11a0
 print_address_description+0x83/0x3e0
 kasan_report+0x18c/0x4b0
 ? copy_ah_attr_from_uverbs+0x6f2/0x8c0
 ? copy_ah_attr_from_uverbs+0x6f2/0x8c0
 ? lookup_get_idr_uobject+0x120/0x200
 ? copy_ah_attr_from_uverbs+0x6f2/0x8c0
 copy_ah_attr_from_uverbs+0x6f2/0x8c0
 ? modify_qp+0xd0e/0x1350
 modify_qp+0xd0e/0x1350
 ib_uverbs_modify_qp+0xf9/0x170
 ? ib_uverbs_query_qp+0xa70/0xa70
 ib_uverbs_write+0x7f9/0xef0
 ? attach_entity_load_avg+0x8b0/0x8b0
 ? ib_uverbs_query_qp+0xa70/0xa70
 ? uverbs_devnode+0x110/0x110
 ? cyc2ns_read_end+0x10/0x10
 ? print_irqtrace_events+0x280/0x280
 ? sched_clock_cpu+0x18/0x200
 ? _raw_spin_unlock_irq+0x29/0x40
 ? _raw_spin_unlock_irq+0x29/0x40
 ? _raw_spin_unlock_irq+0x29/0x40
 ? time_hardirqs_on+0x27/0x670
 __vfs_write+0x10d/0x700
 ? uverbs_devnode+0x110/0x110
 ? kernel_read+0x170/0x170
 ? _raw_spin_unlock_irq+0x29/0x40
 ? finish_task_switch+0x1bd/0x7a0
 ? finish_task_switch+0x194/0x7a0
 ? prandom_u32_state+0xe/0x180
 ? rcu_read_unlock+0x80/0x80
 ? security_file_permission+0x93/0x260
 vfs_write+0x1b0/0x550
 SyS_write+0xc7/0x1a0
 ? SyS_read+0x1a0/0x1a0
 ? trace_hardirqs_on_thunk+0x1a/0x1c
 entry_SYSCALL_64_fastpath+0x1e/0x8b
RIP: 0033:0x433c29
RSP: 002b:00007ffcf2be82a8 EFLAGS: 00000217

Allocated by task 62:
 kasan_kmalloc+0xa0/0xd0
 kmem_cache_alloc+0x141/0x480
 dup_fd+0x101/0xcc0
 copy_process.part.62+0x166f/0x4390
 _do_fork+0x1cb/0xe90
 kernel_thread+0x34/0x40
 call_usermodehelper_exec_work+0x112/0x260
 process_one_work+0x929/0x1aa0
 worker_thread+0x5c6/0x12a0
 kthread+0x346/0x510
 ret_from_fork+0x3a/0x50

Freed by task 259:
 kasan_slab_free+0x71/0xc0
 kmem_cache_free+0xf3/0x4c0
 put_files_struct+0x225/0x2c0
 exit_files+0x88/0xc0
 do_exit+0x67c/0x1520
 do_group_exit+0xe8/0x380
 SyS_exit_group+0x1e/0x20
 entry_SYSCALL_64_fastpath+0x1e/0x8b

The buggy address belongs to the object at ffff88006476a000
 which belongs to the cache files_cache of size 832
The buggy address is located 408 bytes inside of
 832-byte region [ffff88006476a000, ffff88006476a340)
The buggy address belongs to the page:
page:ffffea000191da80 count:1 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
flags: 0x4000000000008100(slab|head)
raw: 4000000000008100 0000000000000000 0000000000000000 0000000100080008
raw: 0000000000000000 0000000100000001 ffff88006bcf7a80 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88006476a080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88006476a100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88006476a180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                            ^
 ffff88006476a200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88006476a280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

Cc: syzkaller <syzkaller@googlegroups.com>
Cc: <stable@vger.kernel.org> # 4.11
Fixes: 44c58487d5 ("IB/core: Define 'ib' and 'roce' rdma_ah_attr types")
Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 15:31:27 -07:00
Leon Romanovsky 1ff5325c3c RDMA/uverbs: Fix circular locking dependency
Avoid circular locking dependency by calling
to uobj_alloc_commit() outside of xrcd_tree_mutex lock.

======================================================
WARNING: possible circular locking dependency detected
4.15.0+ #87 Not tainted
------------------------------------------------------
syzkaller401056/269 is trying to acquire lock:
 (&uverbs_dev->xrcd_tree_mutex){+.+.}, at: [<000000006c12d2cd>] uverbs_free_xrcd+0xd2/0x360

but task is already holding lock:
 (&ucontext->uobjects_lock){+.+.}, at: [<00000000da010f09>] uverbs_cleanup_ucontext+0x168/0x730

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&ucontext->uobjects_lock){+.+.}:
       __mutex_lock+0x111/0x1720
       rdma_alloc_commit_uobject+0x22c/0x600
       ib_uverbs_open_xrcd+0x61a/0xdd0
       ib_uverbs_write+0x7f9/0xef0
       __vfs_write+0x10d/0x700
       vfs_write+0x1b0/0x550
       SyS_write+0xc7/0x1a0
       entry_SYSCALL_64_fastpath+0x1e/0x8b

-> #0 (&uverbs_dev->xrcd_tree_mutex){+.+.}:
       lock_acquire+0x19d/0x440
       __mutex_lock+0x111/0x1720
       uverbs_free_xrcd+0xd2/0x360
       remove_commit_idr_uobject+0x6d/0x110
       uverbs_cleanup_ucontext+0x2f0/0x730
       ib_uverbs_cleanup_ucontext.constprop.3+0x52/0x120
       ib_uverbs_close+0xf2/0x570
       __fput+0x2cd/0x8d0
       task_work_run+0xec/0x1d0
       do_exit+0x6a1/0x1520
       do_group_exit+0xe8/0x380
       SyS_exit_group+0x1e/0x20
       entry_SYSCALL_64_fastpath+0x1e/0x8b

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&ucontext->uobjects_lock);
                               lock(&uverbs_dev->xrcd_tree_mutex);
                               lock(&ucontext->uobjects_lock);
  lock(&uverbs_dev->xrcd_tree_mutex);

 *** DEADLOCK ***

3 locks held by syzkaller401056/269:
 #0:  (&file->cleanup_mutex){+.+.}, at: [<00000000c9f0c252>] ib_uverbs_close+0xac/0x570
 #1:  (&ucontext->cleanup_rwsem){++++}, at: [<00000000b6994d49>] uverbs_cleanup_ucontext+0xf6/0x730
 #2:  (&ucontext->uobjects_lock){+.+.}, at: [<00000000da010f09>] uverbs_cleanup_ucontext+0x168/0x730

stack backtrace:
CPU: 0 PID: 269 Comm: syzkaller401056 Not tainted 4.15.0+ #87
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
Call Trace:
 dump_stack+0xde/0x164
 ? dma_virt_map_sg+0x22c/0x22c
 ? uverbs_cleanup_ucontext+0x168/0x730
 ? console_unlock+0x502/0xbd0
 print_circular_bug.isra.24+0x35e/0x396
 ? print_circular_bug_header+0x12e/0x12e
 ? find_usage_backwards+0x30/0x30
 ? entry_SYSCALL_64_fastpath+0x1e/0x8b
 validate_chain.isra.28+0x25d1/0x40c0
 ? check_usage+0xb70/0xb70
 ? graph_lock+0x160/0x160
 ? find_usage_backwards+0x30/0x30
 ? cyc2ns_read_end+0x10/0x10
 ? print_irqtrace_events+0x280/0x280
 ? __lock_acquire+0x93d/0x1630
 __lock_acquire+0x93d/0x1630
 lock_acquire+0x19d/0x440
 ? uverbs_free_xrcd+0xd2/0x360
 __mutex_lock+0x111/0x1720
 ? uverbs_free_xrcd+0xd2/0x360
 ? uverbs_free_xrcd+0xd2/0x360
 ? __mutex_lock+0x828/0x1720
 ? mutex_lock_io_nested+0x1550/0x1550
 ? uverbs_cleanup_ucontext+0x168/0x730
 ? __lock_acquire+0x9a9/0x1630
 ? mutex_lock_io_nested+0x1550/0x1550
 ? uverbs_cleanup_ucontext+0xf6/0x730
 ? lock_contended+0x11a0/0x11a0
 ? uverbs_free_xrcd+0xd2/0x360
 uverbs_free_xrcd+0xd2/0x360
 remove_commit_idr_uobject+0x6d/0x110
 uverbs_cleanup_ucontext+0x2f0/0x730
 ? sched_clock_cpu+0x18/0x200
 ? uverbs_close_fd+0x1c0/0x1c0
 ib_uverbs_cleanup_ucontext.constprop.3+0x52/0x120
 ib_uverbs_close+0xf2/0x570
 ? ib_uverbs_remove_one+0xb50/0xb50
 ? ib_uverbs_remove_one+0xb50/0xb50
 __fput+0x2cd/0x8d0
 task_work_run+0xec/0x1d0
 do_exit+0x6a1/0x1520
 ? fsnotify_first_mark+0x220/0x220
 ? exit_notify+0x9f0/0x9f0
 ? entry_SYSCALL_64_fastpath+0x5/0x8b
 ? entry_SYSCALL_64_fastpath+0x5/0x8b
 ? trace_hardirqs_on_thunk+0x1a/0x1c
 ? time_hardirqs_on+0x27/0x670
 ? time_hardirqs_off+0x27/0x490
 ? syscall_return_slowpath+0x6c/0x460
 ? entry_SYSCALL_64_fastpath+0x5/0x8b
 do_group_exit+0xe8/0x380
 SyS_exit_group+0x1e/0x20
 entry_SYSCALL_64_fastpath+0x1e/0x8b
RIP: 0033:0x431ce9

Cc: syzkaller <syzkaller@googlegroups.com>
Cc: <stable@vger.kernel.org> # 4.11
Fixes: fd3c7904db ("IB/core: Change idr objects to use the new schema")
Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 15:31:27 -07:00
Leon Romanovsky 5c2e1c4f92 RDMA/uverbs: Fix bad unlock balance in ib_uverbs_close_xrcd
There is no matching lock for this mutex. Git history suggests this is
just a missed remnant from an earlier version of the function before
this locking was moved into uverbs_free_xrcd.

Originally this lock was protecting the xrcd_table_delete()

=====================================
WARNING: bad unlock balance detected!
4.15.0+ #87 Not tainted
-------------------------------------
syzkaller223405/269 is trying to release lock (&uverbs_dev->xrcd_tree_mutex) at:
[<00000000b8703372>] ib_uverbs_close_xrcd+0x195/0x1f0
but there are no more locks to release!

other info that might help us debug this:
1 lock held by syzkaller223405/269:
 #0:  (&uverbs_dev->disassociate_srcu){....}, at: [<000000005af3b960>] ib_uverbs_write+0x265/0xef0

stack backtrace:
CPU: 0 PID: 269 Comm: syzkaller223405 Not tainted 4.15.0+ #87
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
Call Trace:
 dump_stack+0xde/0x164
 ? dma_virt_map_sg+0x22c/0x22c
 ? ib_uverbs_write+0x265/0xef0
 ? console_unlock+0x502/0xbd0
 ? ib_uverbs_close_xrcd+0x195/0x1f0
 print_unlock_imbalance_bug+0x131/0x160
 lock_release+0x59d/0x1100
 ? ib_uverbs_close_xrcd+0x195/0x1f0
 ? lock_acquire+0x440/0x440
 ? lock_acquire+0x440/0x440
 __mutex_unlock_slowpath+0x88/0x670
 ? wait_for_completion+0x4c0/0x4c0
 ? rdma_lookup_get_uobject+0x145/0x2f0
 ib_uverbs_close_xrcd+0x195/0x1f0
 ? ib_uverbs_open_xrcd+0xdd0/0xdd0
 ib_uverbs_write+0x7f9/0xef0
 ? cyc2ns_read_end+0x10/0x10
 ? ib_uverbs_open_xrcd+0xdd0/0xdd0
 ? uverbs_devnode+0x110/0x110
 ? cyc2ns_read_end+0x10/0x10
 ? cyc2ns_read_end+0x10/0x10
 ? sched_clock_cpu+0x18/0x200
 __vfs_write+0x10d/0x700
 ? uverbs_devnode+0x110/0x110
 ? kernel_read+0x170/0x170
 ? __fget+0x358/0x5d0
 ? security_file_permission+0x93/0x260
 vfs_write+0x1b0/0x550
 SyS_write+0xc7/0x1a0
 ? SyS_read+0x1a0/0x1a0
 ? trace_hardirqs_on_thunk+0x1a/0x1c
 entry_SYSCALL_64_fastpath+0x1e/0x8b
RIP: 0033:0x4335c9

Cc: syzkaller <syzkaller@googlegroups.com>
Cc: <stable@vger.kernel.org> # 4.11
Fixes: fd3c7904db ("IB/core: Change idr objects to use the new schema")
Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 15:31:27 -07:00
Leon Romanovsky 0cba0efcc7 RDMA/restrack: Increment CQ restrack object before committing
Once the uobj is committed it is immediately possible another thread
could destroy it, which worst case, can result in a use-after-free
of the restrack objects.

Cc: syzkaller <syzkaller@googlegroups.com>
Fixes: 08f294a152 ("RDMA/core: Add resource tracking for create and destroy CQs")
Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 15:31:26 -07:00
Leon Romanovsky 3f802b162d RDMA/uverbs: Protect from command mask overflow
The command number is not bounds checked against the command mask before it
is shifted, resulting in an ubsan hit. This does not cause malfunction since
the command number is eventually bounds checked, but we can make this ubsan
clean by moving the bounds check to before the mask check.

================================================================================
UBSAN: Undefined behaviour in
drivers/infiniband/core/uverbs_main.c:647:21
shift exponent 207 is too large for 64-bit type 'long long unsigned int'
CPU: 0 PID: 446 Comm: syz-executor3 Not tainted 4.15.0-rc2+ #61
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
Call Trace:
dump_stack+0xde/0x164
? dma_virt_map_sg+0x22c/0x22c
ubsan_epilogue+0xe/0x81
__ubsan_handle_shift_out_of_bounds+0x293/0x2f7
? debug_check_no_locks_freed+0x340/0x340
? __ubsan_handle_load_invalid_value+0x19b/0x19b
? lock_acquire+0x440/0x440
? lock_acquire+0x19d/0x440
? __might_fault+0xf4/0x240
? ib_uverbs_write+0x68d/0xe20
ib_uverbs_write+0x68d/0xe20
? __lock_acquire+0xcf7/0x3940
? uverbs_devnode+0x110/0x110
? cyc2ns_read_end+0x10/0x10
? sched_clock_cpu+0x18/0x200
? sched_clock_cpu+0x18/0x200
__vfs_write+0x10d/0x700
? uverbs_devnode+0x110/0x110
? kernel_read+0x170/0x170
? __fget+0x35b/0x5d0
? security_file_permission+0x93/0x260
vfs_write+0x1b0/0x550
SyS_write+0xc7/0x1a0
? SyS_read+0x1a0/0x1a0
? trace_hardirqs_on_thunk+0x1a/0x1c
entry_SYSCALL_64_fastpath+0x18/0x85
RIP: 0033:0x448e29
RSP: 002b:00007f033f567c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007f033f5686bc RCX: 0000000000448e29
RDX: 0000000000000060 RSI: 0000000020001000 RDI: 0000000000000012
RBP: 000000000070bea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00000000000056a0 R14: 00000000006e8740 R15: 0000000000000000
================================================================================

Cc: syzkaller <syzkaller@googlegroups.com>
Cc: <stable@vger.kernel.org> # 4.5
Fixes: 2dbd5186a3 ("IB/core: IB/core: Allow legacy verbs through extended interfaces")
Reported-by: Noa Osherovich <noaos@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 15:31:26 -07:00
Jason Gunthorpe ec6f8401c4 IB/uverbs: Fix unbalanced unlock on error path for rdma_explicit_destroy
If remove_commit fails then the lock is left locked while the uobj still
exists. Eventually the kernel will deadlock.

lockdep detects this and says:

 test/4221 is leaving the kernel with locks still held!
 1 lock held by test/4221:
  #0:  (&ucontext->cleanup_rwsem){.+.+}, at: [<000000001e5c7523>] rdma_explicit_destroy+0x37/0x120 [ib_uverbs]

Fixes: 4da70da23e ("IB/core: Explicitly destroy an object while keeping uobject")
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 15:31:26 -07:00
Jason Gunthorpe 104f268d43 IB/uverbs: Improve lockdep_check
This is really being used as an assert that the expected usecnt
is being held and implicitly that the usecnt is valid. Rename it to
assert_uverbs_usecnt and tighten the checks to only accept valid
values of usecnt (eg 0 and < -1 are invalid).

The tigher checkes make the assertion cover more cases and is more
likely to find bugs via syzkaller/etc.

Fixes: 3832125624 ("IB/core: Add support for idr types")
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 14:59:47 -07:00
Leon Romanovsky 6623e3e3cd RDMA/uverbs: Protect from races between lookup and destroy of uobjects
The race is between lookup_get_idr_uobject and
uverbs_idr_remove_uobj -> uverbs_uobject_put.

We deliberately do not call sychronize_rcu after the idr_remove in
uverbs_idr_remove_uobj for performance reasons, instead we call
kfree_rcu() during uverbs_uobject_put.

However, this means we can obtain pointers to uobj's that have
already been released and must protect against krefing them
using kref_get_unless_zero.

==================================================================
BUG: KASAN: use-after-free in copy_ah_attr_from_uverbs.isra.2+0x860/0xa00
Read of size 4 at addr ffff88005fda1ac8 by task syz-executor2/441

CPU: 1 PID: 441 Comm: syz-executor2 Not tainted 4.15.0-rc2+ #56
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xd4
print_address_description+0x73/0x290
kasan_report+0x25c/0x370
? copy_ah_attr_from_uverbs.isra.2+0x860/0xa00
copy_ah_attr_from_uverbs.isra.2+0x860/0xa00
? uverbs_try_lock_object+0x68/0xc0
? modify_qp.isra.7+0xdc4/0x10e0
modify_qp.isra.7+0xdc4/0x10e0
ib_uverbs_modify_qp+0xfe/0x170
? ib_uverbs_query_qp+0x970/0x970
? __lock_acquire+0xa11/0x1da0
ib_uverbs_write+0x55a/0xad0
? ib_uverbs_query_qp+0x970/0x970
? ib_uverbs_query_qp+0x970/0x970
? ib_uverbs_open+0x760/0x760
? futex_wake+0x147/0x410
? sched_clock_cpu+0x18/0x180
? check_prev_add+0x1680/0x1680
? do_futex+0x3b6/0xa30
? sched_clock_cpu+0x18/0x180
__vfs_write+0xf7/0x5c0
? ib_uverbs_open+0x760/0x760
? kernel_read+0x110/0x110
? lock_acquire+0x370/0x370
? __fget+0x264/0x3b0
vfs_write+0x18a/0x460
SyS_write+0xc7/0x1a0
? SyS_read+0x1a0/0x1a0
? trace_hardirqs_on_thunk+0x1a/0x1c
entry_SYSCALL_64_fastpath+0x18/0x85
RIP: 0033:0x448e29
RSP: 002b:00007f443fee0c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007f443fee16bc RCX: 0000000000448e29
RDX: 0000000000000078 RSI: 00000000209f8000 RDI: 0000000000000012
RBP: 000000000070bea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 0000000000008e98 R14: 00000000006ebf38 R15: 0000000000000000

Allocated by task 1:
kmem_cache_alloc_trace+0x16c/0x2f0
mlx5_alloc_cmd_msg+0x12e/0x670
cmd_exec+0x419/0x1810
mlx5_cmd_exec+0x40/0x70
mlx5_core_mad_ifc+0x187/0x220
mlx5_MAD_IFC+0xd7/0x1b0
mlx5_query_mad_ifc_gids+0x1f3/0x650
mlx5_ib_query_gid+0xa4/0xc0
ib_query_gid+0x152/0x1a0
ib_query_port+0x21e/0x290
mlx5_port_immutable+0x30f/0x490
ib_register_device+0x5dd/0x1130
mlx5_ib_add+0x3e7/0x700
mlx5_add_device+0x124/0x510
mlx5_register_interface+0x11f/0x1c0
mlx5_ib_init+0x56/0x61
do_one_initcall+0xa3/0x250
kernel_init_freeable+0x309/0x3b8
kernel_init+0x14/0x180
ret_from_fork+0x24/0x30

Freed by task 1:
kfree+0xeb/0x2f0
mlx5_free_cmd_msg+0xcd/0x140
cmd_exec+0xeba/0x1810
mlx5_cmd_exec+0x40/0x70
mlx5_core_mad_ifc+0x187/0x220
mlx5_MAD_IFC+0xd7/0x1b0
mlx5_query_mad_ifc_gids+0x1f3/0x650
mlx5_ib_query_gid+0xa4/0xc0
ib_query_gid+0x152/0x1a0
ib_query_port+0x21e/0x290
mlx5_port_immutable+0x30f/0x490
ib_register_device+0x5dd/0x1130
mlx5_ib_add+0x3e7/0x700
mlx5_add_device+0x124/0x510
mlx5_register_interface+0x11f/0x1c0
mlx5_ib_init+0x56/0x61
do_one_initcall+0xa3/0x250
kernel_init_freeable+0x309/0x3b8
kernel_init+0x14/0x180
ret_from_fork+0x24/0x30

The buggy address belongs to the object at ffff88005fda1ab0
which belongs to the cache kmalloc-32 of size 32
The buggy address is located 24 bytes inside of
32-byte region [ffff88005fda1ab0, ffff88005fda1ad0)
The buggy address belongs to the page:
page:00000000d5655c19 count:1 mapcount:0 mapping: (null)
index:0xffff88005fda1fc0
flags: 0x4000000000000100(slab)
raw: 4000000000000100 0000000000000000 ffff88005fda1fc0 0000000180550008
raw: ffffea00017f6780 0000000400000004 ffff88006c803980 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff88005fda1980: fc fc fb fb fb fb fc fc fb fb fb fb fc fc fb fb
ffff88005fda1a00: fb fb fc fc fb fb fb fb fc fc 00 00 00 00 fc fc
ffff88005fda1a80: fb fb fb fb fc fc fb fb fb fb fc fc fb fb fb fb
ffff88005fda1b00: fc fc 00 00 00 00 fc fc fb fb fb fb fc fc fb fb
ffff88005fda1b80: fb fb fc fc fb fb fb fb fc fc fb fb fb fb fc fc
==================================================================@

Cc: syzkaller <syzkaller@googlegroups.com>
Cc: <stable@vger.kernel.org> # 4.11
Fixes: 3832125624 ("IB/core: Add support for idr types")
Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 14:59:47 -07:00
Jason Gunthorpe d9dc7a3500 IB/uverbs: Hold the uobj write lock after allocate
This clarifies the design intention that time between allocate and
commit has the uobj exclusive to the caller. We already guarantee
this by delaying publishing the uobj pointer via idr_insert,
fd_install, list_add, etc.

Additionally holding the usecnt lock during this period provides
extra clarity and more protection against future mistakes.

Fixes: 3832125624 ("IB/core: Add support for idr types")
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 14:59:46 -07:00
Matan Barak 4d39a959bc IB/uverbs: Fix possible oops with duplicate ioctl attributes
If the same attribute is listed twice by the user in the ioctl attribute
list then error unwind can cause the kernel to deref garbage.

This happens when an object with WRITE access is sent twice. The second
parse properly fails but corrupts the state required for the error unwind
it triggers.

Fixing this by making duplicates in the attribute list invalid. This is
not something we need to support.

The ioctl interface is currently recommended to be disabled in kConfig.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 14:59:46 -07:00
Matan Barak 9dfb2ff400 IB/uverbs: Add ioctl support for 32bit processes
32 bit processes running on a 64 bit kernel call compat_ioctl so that
implementations can revise any structure layout issues. Point compat_ioctl
at our normal ioctl because:

- All our structures are designed to be the same on 32 and 64 bit, ie we
  use __aligned_u64 when required and are careful to manage padding.

- Any pointers are stored in u64's and userspace is expected
  to prepare them properly.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 14:59:46 -07:00
Matan Barak 3d89459e2e IB/uverbs: Fix method merging in uverbs_ioctl_merge
Fix a bug in uverbs_ioctl_merge that looked at the object's iterator
number instead of the method's iterator number when merging methods.

While we're at it, make the uverbs_ioctl_merge code a bit more clear
and faster.

Fixes: 118620d368 ('IB/core: Add uverbs merge trees functionality')
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 14:59:45 -07:00
Jason Gunthorpe 2f36028ce9 IB/uverbs: Use u64_to_user_ptr() not a union
The union approach will get the endianness wrong sometimes if the kernel's
pointer size is 32 bits resulting in EFAULTs when trying to copy to/from
user.

Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 14:59:45 -07:00
Jason Gunthorpe 6c976c30ad IB/uverbs: Use inline data transfer for UHW_IN
The rule for the API is pointers less than 8 bytes are inlined into
the .data field of the attribute. Fix the creation of the driver udata
struct to follow this rule and point to the .data itself when the size
is less than 8 bytes.

Otherwise if the UHW struct is less than 8 bytes the driver will get
EFAULT during copy_from_user.

Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 14:59:44 -07:00
Matan Barak 89d9e8d3f1 IB/uverbs: Always use the attribute size provided by the user
This fixes several bugs around the copy_to/from user path:
 - copy_to used the user provided size of the attribute
   and could copy data beyond the end of the kernel buffer into
   userspace.
 - copy_from didn't know the size of the kernel buffer and
   could have left kernel memory unexpectedly un-initialized.
 - copy_from did not use the user length to determine if the
   attribute data is inlined or not.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 14:59:44 -07:00
Leon Romanovsky 415bb699d7 RDMA/restrack: Remove unimplemented XRCD object
Resource tracking of XRCD objects is not implemented in current
version of restrack and hence can be removed.

Fixes: 02d8883f52 ("RDMA/restrack: Add general infrastructure to track RDMA resources")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 14:59:44 -07:00
Alaa Hleihel 14fa91e0fe IB/ipoib: Do not warn if IPoIB debugfs doesn't exist
netdev_wait_allrefs() could rebroadcast NETDEV_UNREGISTER event
multiple times until all refs are gone, which will result in calling
ipoib_delete_debug_files multiple times and printing a warning.

Remove the WARN_ONCE since checks of NULL pointers before calling
debugfs_remove are not needed.

Fixes: 771a525840 ("IB/IPoIB: ibX: failed to create mcg debug file")
Signed-off-by: Alaa Hleihel <alaa@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 14:59:43 -07:00
Yonatan Cohen 388ca8be00 IB/mlx5: Implement fragmented completion queue (CQ)
The current implementation of create CQ requires contiguous
memory, such requirement is problematic once the memory is
fragmented or the system is low in memory, it causes for
failures in dma_zalloc_coherent().

This patch implements new scheme of fragmented CQ to overcome
this issue by introducing new type: 'struct mlx5_frag_buf_ctrl'
to allocate fragmented buffers, rather than contiguous ones.

Base the Completion Queues (CQs) on this new fragmented buffer.

It fixes following crashes:
kworker/29:0: page allocation failure: order:6, mode:0x80d0
CPU: 29 PID: 8374 Comm: kworker/29:0 Tainted: G OE 3.10.0
Workqueue: ib_cm cm_work_handler [ib_cm]
Call Trace:
[<>] dump_stack+0x19/0x1b
[<>] warn_alloc_failed+0x110/0x180
[<>] __alloc_pages_slowpath+0x6b7/0x725
[<>] __alloc_pages_nodemask+0x405/0x420
[<>] dma_generic_alloc_coherent+0x8f/0x140
[<>] x86_swiotlb_alloc_coherent+0x21/0x50
[<>] mlx5_dma_zalloc_coherent_node+0xad/0x110 [mlx5_core]
[<>] ? mlx5_db_alloc_node+0x69/0x1b0 [mlx5_core]
[<>] mlx5_buf_alloc_node+0x3e/0xa0 [mlx5_core]
[<>] mlx5_buf_alloc+0x14/0x20 [mlx5_core]
[<>] create_cq_kernel+0x90/0x1f0 [mlx5_ib]
[<>] mlx5_ib_create_cq+0x3b0/0x4e0 [mlx5_ib]

Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-15 00:30:03 -08:00
Andy Shevchenko 71591d1280 RDMA/hns: Replace __raw_write*(cpu_to_le*()) with LE write*()
There is no need to repeat the semantics of writel() and similar.
Moreover sparse complains about this:

drivers/infiniband/hw/hns/hns_roce_hw_v1.c:1690:22: expected unsigned long long val
drivers/infiniband/hw/hns/hns_roce_hw_v1.c:1690:22: got restricted __le64 <noident>

Fixing this by replacing __raw_write*(cpu_to_le*()) calls by plain
write*() ones.

Note, write*() accessors are little endian by definition.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-14 16:32:54 -07:00
Jason Gunthorpe 7061f28d8a rxe: Do not use 'struct sockaddr' in a uapi header
Linux has two 'linux/socket.h' files - and only the one in the kernel
defines struct sockaddr - the user space one does not.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-14 16:31:35 -07:00
oulijun d480bb50d2 RDMA/hns: Use free_pages function instead of free_page
It need to use free_pages function for free the memory allocated
by __get_free_pages function.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-14 16:30:12 -07:00
Yixian Liu ced07769dc RDMA/hns: Fix QP state judgement before receiving work requests
The QP can accept receive work requests only when the QP is
in the states that allow them to be submitted.

This patch updates the QP state judgement based on the
specification.

Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-14 16:30:12 -07:00
oulijun 173bc6be96 RDMA/hns: Fix a bug with modifying mac address
When modifying mac address, it will trigger hns_roce_del_gid
function and can't delete the default gid matched the index
because the attribute of gid is null.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-14 16:30:12 -07:00
Corentin Labbe fc968aee5e IB/cxgb3: remove cxio_dbg.c
cxio_dbg.c is uncompiled since commit 2b540355cd ("RDMA/cxgb3: cleanups")
10 years after, we could remove it.

Signed-off-by: Corentin Labbe <clabbe@baylibre.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-14 16:21:15 -07:00
Denys Vlasenko 9b2c45d479 net: make getname() functions return length rather than use int* parameter
Changes since v1:
Added changes in these files:
    drivers/infiniband/hw/usnic/usnic_transport.c
    drivers/staging/lustre/lnet/lnet/lib-socket.c
    drivers/target/iscsi/iscsi_target_login.c
    drivers/vhost/net.c
    fs/dlm/lowcomms.c
    fs/ocfs2/cluster/tcp.c
    security/tomoyo/network.c

Before:
All these functions either return a negative error indicator,
or store length of sockaddr into "int *socklen" parameter
and return zero on success.

"int *socklen" parameter is awkward. For example, if caller does not
care, it still needs to provide on-stack storage for the value
it does not need.

None of the many FOO_getname() functions of various protocols
ever used old value of *socklen. They always just overwrite it.

This change drops this parameter, and makes all these functions, on success,
return length of sockaddr. It's always >= 0 and can be differentiated
from an error.

Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.

rpc_sockname() lost "int buflen" parameter, since its only use was
to be passed to kernel_getsockname() as &buflen and subsequently
not used in any way.

Userspace API is not changed.

    text    data     bss      dec     hex filename
30108430 2633624  873672 33615726 200ef6e vmlinux.before.o
30108109 2633612  873672 33615393 200ee21 vmlinux.o

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: David S. Miller <davem@davemloft.net>
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
CC: linux-bluetooth@vger.kernel.org
CC: linux-decnet-user@lists.sourceforge.net
CC: linux-wireless@vger.kernel.org
CC: linux-rdma@vger.kernel.org
CC: linux-sctp@vger.kernel.org
CC: linux-nfs@vger.kernel.org
CC: linux-x25@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-12 14:15:04 -05:00
Linus Torvalds a9a08845e9 vfs: do bulk POLL* -> EPOLL* replacement
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
        L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
        for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
    done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do.  But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.

Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-11 14:34:03 -08:00
Linus Torvalds 2246edfaf8 Second pull request for 4.16 merge window
- Clean up some function signatures in rxe for clarity
 - Tidy the RDMA netlink header to remove unimplemented constants
 - bnxt_re driver fixes, one is a regression this window.
 - Minor hns driver fixes
 - Various fixes from Dan Carpenter and his tool
 - Fix IRQ cleanup race in HFI1
 - HF1 performance optimizations and a fix to report counters in the right units
 - Fix for an IPoIB startup sequence race with the external manager
 - Oops fix for the new kabi path
 - Endian cleanups for hns
 - Fix for mlx5 related to the new automatic affinity support
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaePL/AAoJELgmozMOVy/dqhsQALUhzDuuJ+/F6supjmyqZG53
 Ak/PoFjTmHToGQfDq/1TRzyKwMx12aB2l6WGZc31FzhvCw4daPWkoEVKReNWUUJ+
 fmESxjLgo8ZRGSqpNxn9Q8agE/I/5JZQoA8bCFCYgdZPKTPNKdtAVBphpdhmrOX4
 ygjABikWf/wBsNF1A8lnX9xkfPO21cPHrFQLTnuOzOT/hc6U+PPklHSQCnS91svh
 1+Pqjtssg54rxYkJqiFq3giSnfwvmAXO8WyVGmRRPFGLpB0nIjq0Sl6ZgLLClz7w
 YJdiBGr7rlnNMgGCjlPU2ZO3lO6J0ytXQzFNqRqvKryXQOv+uVeJgep7WqHTcdQU
 UN30FCKQMgLL/F6NF8wKaKcK4X0VgXQa7gpuH2fVSXF0c3LO3/mmWNjixbGSzT2c
 Wj+EW3eOKlTddhRLhgbMOdwc32tIGhaD85z2F4+FZO+XI9ZQtJaDewWVDjYoumP/
 RlDIFw+KCgSq7+UZL8CoXuh0BuS1nu9TGfkx1HW0DLMF1+yigNiswpUfksV4cISP
 JqE2I3yH0A4UobD/a+f9IhIfk2MjxO0tJWNjU8IA9LXgUFlskQ6MpH/AcE9G8JNv
 tlfLGR3s4PJa/7j/Iy2F84og/b/KH8v7vyj4Eknq/hLq63/BiM5wj0AUBRrGulN6
 HhAMOegxGZ7IKP/y0L7I
 =xwZz
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull more rdma updates from Doug Ledford:
 "Items of note:

   - two patches fix a regression in the 4.15 kernel. The 4.14 kernel
     worked fine with NVMe over Fabrics and mlx5 adapters. That broke in
     4.15. The fix is here.

   - one of the patches (the endian notation patch from Lijun) looks
     like a lot of lines of change, but it's mostly mechanical in
     nature. It amounts to the biggest chunk of change in it (it's about
     2/3rds of the overall pull request).

  Summary:

   - Clean up some function signatures in rxe for clarity

   - Tidy the RDMA netlink header to remove unimplemented constants

   - bnxt_re driver fixes, one is a regression this window.

   - Minor hns driver fixes

   - Various fixes from Dan Carpenter and his tool

   - Fix IRQ cleanup race in HFI1

   - HF1 performance optimizations and a fix to report counters in the right units

   - Fix for an IPoIB startup sequence race with the external manager

   - Oops fix for the new kabi path

   - Endian cleanups for hns

   - Fix for mlx5 related to the new automatic affinity support"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (38 commits)
  net/mlx5: increase async EQ to avoid EQ overrun
  mlx5: fix mlx5_get_vector_affinity to start from completion vector 0
  RDMA/hns: Fix the endian problem for hns
  IB/uverbs: Use the standard kConfig format for experimental
  IB: Update references to libibverbs
  IB/hfi1: Add 16B rcvhdr trace support
  IB/hfi1: Convert kzalloc_node and kcalloc to use kcalloc_node
  IB/core: Avoid a potential OOPs for an unused optional parameter
  IB/core: Map iWarp AH type to undefined in rdma_ah_find_type
  IB/ipoib: Fix for potential no-carrier state
  IB/hfi1: Show fault stats in both TX and RX directions
  IB/hfi1: Remove blind constants from 16B update
  IB/hfi1: Convert PortXmitWait/PortVLXmitWait counters to flit times
  IB/hfi1: Do not override given pcie_pset value
  IB/hfi1: Optimize process_receive_ib()
  IB/hfi1: Remove unnecessary fecn and becn fields
  IB/hfi1: Look up ibport using a pointer in receive path
  IB/hfi1: Optimize packet type comparison using 9B and bypass code paths
  IB/hfi1: Compute BTH only for RDMA_WRITE_LAST/SEND_LAST packet
  IB/hfi1: Remove dependence on qp->s_hdrwords
  ...
2018-02-06 11:09:45 -08:00
Linus Torvalds 105cf3c8c6 pci-v4.16-changes
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJad5lgAAoJEFmIoMA60/r8s2kQAI3PztawDpaCP9Z12pkbBHSt
 Ho0xTyk9rCZi9kQJbNjc+a+QrlA3QmTHXIXerB3LSWoh7M+XhsECjem92eHpgLNS
 JvYPhTfOrCr0vdiAmOz6hD0AqN/psrbfzgiJhSwomsGEFS77k7kERSJckRv81sxb
 Aj5F/WjucAgLorwm4auveAJEQ7atE7/6pkXzoqYm4G6NLOb46jUcRGndrnvXZBlz
 fws8fBM4BHyi7i25CYQl24tFq1CGax1rIPgLg+4KnH76bQk/N6Ju0sGVSzfh+hG8
 SIerK9bJbzGRAuNKoxB3aO1dyzsK3x9WztE2mG98w5trOISPIR1FqnvC/225FWAU
 d6eIXiC7wKnEx+DElNTzCjzfHc7SAJoupO32H7CoiTe5zPUlWlxJ1zLYkK1gt50q
 m8PRBiYTglxyznzrO0drtcdjEzvbdZNRrsYnul4wi1vSHzjk6F6XLtzT10XWM1M1
 1pXLB8384FTj0Hu4bq6Y3Aivkmz0Sf+eQM2NaOwe+Zj7/1VV0d3lvi4LUXkqzLCA
 FoXPJSMxG2Qu+iflCeYRQBJjExaZH3eNLZ3dT6QpcJrjaFVedd9u5DeeFqNL27zV
 bhr8TdqrR4p4rc8EBAGoCapw96IxLZROKB3gxbrZVOpfIZpzthwHbElHX6aqUgF4
 w/EV1JWs36WXWaxFk8wd
 =ttq9
 -----END PGP SIGNATURE-----

Merge tag 'pci-v4.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:

 - skip AER driver error recovery callbacks for correctable errors
   reported via ACPI APEI, as we already do for errors reported via the
   native path (Tyler Baicar)

 - fix DPC shared interrupt handling (Alex Williamson)

 - print full DPC interrupt number (Keith Busch)

 - enable DPC only if AER is available (Keith Busch)

 - simplify DPC code (Bjorn Helgaas)

 - calculate ASPM L1 substate parameter instead of hardcoding it (Bjorn
   Helgaas)

 - enable Latency Tolerance Reporting for ASPM L1 substates (Bjorn
   Helgaas)

 - move ASPM internal interfaces out of public header (Bjorn Helgaas)

 - allow hot-removal of VGA devices (Mika Westerberg)

 - speed up unplug and shutdown by assuming Thunderbolt controllers
   don't support Command Completed events (Lukas Wunner)

 - add AtomicOps support for GPU and Infiniband drivers (Felix Kuehling,
   Jay Cornwall)

 - expose "ari_enabled" in sysfs to help NIC naming (Stuart Hayes)

 - clean up PCI DMA interface usage (Christoph Hellwig)

 - remove PCI pool API (replaced with DMA pool) (Romain Perier)

 - deprecate pci_get_bus_and_slot(), which assumed PCI domain 0 (Sinan
   Kaya)

 - move DT PCI code from drivers/of/ to drivers/pci/ (Rob Herring)

 - add PCI-specific wrappers for dev_info(), etc (Frederick Lawler)

 - remove warnings on sysfs mmap failure (Bjorn Helgaas)

 - quiet ROM validation messages (Alex Deucher)

 - remove redundant memory alloc failure messages (Markus Elfring)

 - fill in types for compile-time VGA and other I/O port resources
   (Bjorn Helgaas)

 - make "pci=pcie_scan_all" work for Root Ports as well as Downstream
   Ports to help AmigaOne X1000 (Bjorn Helgaas)

 - add SPDX tags to all PCI files (Bjorn Helgaas)

 - quirk Marvell 9128 DMA aliases (Alex Williamson)

 - quirk broken INTx disable on Ceton InfiniTV4 (Bjorn Helgaas)

 - fix CONFIG_PCI=n build by adding dummy pci_irqd_intx_xlate() (Niklas
   Cassel)

 - use DMA API to get MSI address for DesignWare IP (Niklas Cassel)

 - fix endpoint-mode DMA mask configuration (Kishon Vijay Abraham I)

 - fix ARTPEC-6 incorrect IS_ERR() usage (Wei Yongjun)

 - add support for ARTPEC-7 SoC (Niklas Cassel)

 - add endpoint-mode support for ARTPEC (Niklas Cassel)

 - add Cadence PCIe host and endpoint controller driver (Cyrille
   Pitchen)

 - handle multiple INTx status bits being set in dra7xx (Vignesh R)

 - translate dra7xx hwirq range to fix INTD handling (Vignesh R)

 - remove deprecated Exynos PHY initialization code (Jaehoon Chung)

 - fix MSI erratum workaround for HiSilicon Hip06/Hip07 (Dongdong Liu)

 - fix NULL pointer dereference in iProc BCMA driver (Ray Jui)

 - fix Keystone interrupt-controller-node lookup (Johan Hovold)

 - constify qcom driver structures (Julia Lawall)

 - rework Tegra config space mapping to increase space available for
   endpoints (Vidya Sagar)

 - simplify Tegra driver by using bus->sysdata (Manikanta Maddireddy)

 - remove PCI_REASSIGN_ALL_BUS usage on Tegra (Manikanta Maddireddy)

 - add support for Global Fabric Manager Server (GFMS) event to
   Microsemi Switchtec switch driver (Logan Gunthorpe)

 - add IDs for Switchtec PSX 24xG3 and PSX 48xG3 (Kelvin Cao)

* tag 'pci-v4.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (140 commits)
  PCI: cadence: Add EndPoint Controller driver for Cadence PCIe controller
  dt-bindings: PCI: cadence: Add DT bindings for Cadence PCIe endpoint controller
  PCI: endpoint: Fix EPF device name to support multi-function devices
  PCI: endpoint: Add the function number as argument to EPC ops
  PCI: cadence: Add host driver for Cadence PCIe controller
  dt-bindings: PCI: cadence: Add DT bindings for Cadence PCIe host controller
  PCI: Add vendor ID for Cadence
  PCI: Add generic function to probe PCI host controllers
  PCI: generic: fix missing call of pci_free_resource_list()
  PCI: OF: Add generic function to parse and allocate PCI resources
  PCI: Regroup all PCI related entries into drivers/pci/Makefile
  PCI/DPC: Reformat DPC register definitions
  PCI/DPC: Add and use DPC Status register field definitions
  PCI/DPC: Squash dpc_rp_pio_get_info() into dpc_process_rp_pio_error()
  PCI/DPC: Remove unnecessary RP PIO register structs
  PCI/DPC: Push dpc->rp_pio_status assignment into dpc_rp_pio_get_info()
  PCI/DPC: Squash dpc_rp_pio_print_error() into dpc_rp_pio_get_info()
  PCI/DPC: Make RP PIO log size check more generic
  PCI/DPC: Rename local "status" to "dpc_status"
  PCI/DPC: Squash dpc_rp_pio_print_tlp_header() into dpc_rp_pio_print_error()
  ...
2018-02-06 09:59:40 -08:00
oulijun 8b9b8d143b RDMA/hns: Fix the endian problem for hns
The hip06 and hip08 run on a little endian ARM, it needs to
revise the annotations to indicate that the HW uses little
endian data in the various DMA buffers, and flow the necessary
swaps throughout.

The imm_data use big endian mode. The cpu_to_le32/le32_to_cpu
swaps are no-op for this, which makes the only substantive
change the handling of imm_data which is now mandatory swapped.

This also keep match with the userspace hns driver and resolve
the warning by sparse.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-05 10:48:48 -05:00
Jason Gunthorpe e9d1e38927 IB/uverbs: Use the standard kConfig format for experimental
We really don't want people turning this on just yet, make it very
clear with capital letters.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-04 11:56:53 -05:00
Jason Gunthorpe 46adb17982 IB: Update references to libibverbs
These days the userspace comes from rdma-core, revise references
in the kernel to point to the current repository.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-04 11:56:49 -05:00
Don Hiatt 6197a815fe IB/hfi1: Add 16B rcvhdr trace support
Add trace_hfi1_rcvhdr support for bypass packets.
While here, remove the etype argument as it is available
in struct hfi1_packet.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:43:32 -07:00
Kamenee Arumugam 953a9cebea IB/hfi1: Convert kzalloc_node and kcalloc to use kcalloc_node
Kzalloc_node API doesn't check for overflows in size multiplication.
While kcalloc API check for overflows in size multiplication
but these implementations are not NUMA-aware.

This conversion allowed for correcting an allocation used in the hot
path to be on the local NUMA and ensure us overflow free multiplication
for the size of a memory allocation.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:43:32 -07:00
Michael J. Ruhl 2ff124d597 IB/core: Avoid a potential OOPs for an unused optional parameter
The ev_file is an optional parameter for CQ creation. If the parameter
is not passed, the ev_file pointer will be NULL.  Using that pointer
to set the cq_context will result in an OOPs.

Verify that ev_file is not NULL before using.

Cc: <stable@vger.kernel.org> # 4.14.x
Fixes: 9ee79fce36 ("IB/core: Add completion queue (cq) object actions")
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:43:32 -07:00
Alex Estrin 1029361084 IB/ipoib: Fix for potential no-carrier state
On reboot SM can program port pkey table before ipoib registered its
event handler, which could result in missing pkey event and leave root
interface with initial pkey value from index 0.

Since OPA port starts with invalid pkey in index 0, root interface will
fail to initialize and stay down with no-carrier flag.

For IB ipoib interface may end up with pkey different from value
opensm put in pkey table idx 0, resulting in connectivity issues
(different mcast groups, for example).

Close the window by calling event handler after registration
to make sure ipoib pkey is in sync with port pkey table.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Alex Estrin <alex.estrin@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:43:31 -07:00
Mitko Haralanov b5de809ef6 IB/hfi1: Show fault stats in both TX and RX directions
The routine which shows the fault stats checks the counters
to determine whether to show any stats based on the number of
transmitted pkts/bytes for a particular opcode.

Unfortunately, it only checked the receive counters. As a result,
if any packet faults have happened for packets egressing the HFI,
those stats would not be shown.

In order to fix this, the routine is amended to also check the
TX counters. With this change the pkt/byte counts are the sum of
both TX and RX counts for the opcode.

Fixes: 1b311f8931 ("IB/hfi1: Add tx_opcode_stats like the opcode_stats")
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:43:31 -07:00
Mike Marciniszyn 78d3633ba9 IB/hfi1: Remove blind constants from 16B update
These values were introduced as part of the 16B code to
account for the varying size of the LRH between the differing
packet formats.

Replace the blind constants with defines based on FIELD_SIZEOF()
calls.

Fixes: 5b6cabb0db ("IB/hfi1: Add 16B RC/UC support")
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:43:30 -07:00
Kamenee Arumugam 0719007663 IB/hfi1: Convert PortXmitWait/PortVLXmitWait counters to flit times
HFI's counters SendWaitCnt and SendWaitVlCnt are in units
of TXE cycle time (at 805MHz). OPA counters PortXmitWait and
PortVLXmtWait are in units of flit times.
Convert the counter values to flit units using following
conversion formula:

PortXmitWait =
	SendWaitCnt * 2 * (4 /link_width) * (25 Gbps /link_speed)
PortVLXmitWait =
	SendWaitVLCnt * 2 * (4 /link_width) * (25 Gbps /link_speed)

At link up or downgrade events, the link width can change. To ensure
accurate counter calculations, sample the counters after the events,
during counter requests, and then aggregate the OPA counters.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:43:30 -07:00
Bartlomiej Dudek 6391214f4d IB/hfi1: Do not override given pcie_pset value
During PCIe Gen 3 transistion, pcie_pset is read and might be overridden
to a default value(i.e. 255) in do_pcie_gen3_transition() routine.

If the pcie_pset value is overridden then this new value will be used
during initialization of next adapter on a different card.

Introducing a new local variable to avoid modification of pcie_pset

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Bartlomiej Dudek <bartlomiej.dudek@intel.com>
Signed-off-by: Patel Jay P <jay.p.patel@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:43:30 -07:00
Sebastian Sanchez aca7f4fc32 IB/hfi1: Optimize process_receive_ib()
The arguments for trace_hfi1_rcvhdr() get computed every
time in the hot path regardless of the whether the trace
is on or off. This is seen to be costly with a profile.
The handling of fault inject isolates the verbs device for
all packets regardless of the presence of a RHF_DC_ERR error.

Fix the first by computing trace_hfi1_rcvhdr() arguments within
the trace itself, so that when the trace is off, the argument
data isn't computed. Fix the second by moving the error check to
handle_eflags() when an RHF error occurs and by testing for
RHF_DC_ERR before executing the reset of handle_eflags().

Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:43:30 -07:00
Sebastian Sanchez ca85bb1ca9 IB/hfi1: Remove unnecessary fecn and becn fields
packet->fecn and packet->becn are calculated in the hot path
and are never used. Remove these fields as they show to be
costly in a profile. Also, remove initialization for
becn and fecn in process_ecn() as they're unconditionally
assigned in the function and ensure fecn and becn variables
use a boolean type.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:43:29 -07:00
Sebastian Sanchez bdaf96f650 IB/hfi1: Look up ibport using a pointer in receive path
In the receive path, hfi1_ibport is looked up by indexing into an
array. A profile shows this to be expensive. The receive context
data has a pointer to the ibport data, use that pointer instead.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:43:29 -07:00
Sebastian Sanchez 6d6b8848c8 IB/hfi1: Optimize packet type comparison using 9B and bypass code paths
The packet type comparison used to find out if a packet is a bypass
packet in the hot path is an expensive operation as seen in a profile.

Determine packet's pkey and migration bit through the bypass and 9B
code paths instead.

Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:43:28 -07:00
Sebastian Sanchez f150e2736f IB/hfi1: Compute BTH only for RDMA_WRITE_LAST/SEND_LAST packet
In hfi1_rc_rcv(), BTH is computed for all packets received.
However, it's only used for packets received with opcodes
RDMA_WRITE_LAST and SEND_LAST, and it is a costly operation.

Compute BTH only in the RDMA_WRITE_LAST/SEND_LAST code path
and let the compiler handle endianness conversion for bitwise
operations.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:43:28 -07:00
Mitko Haralanov 9636258f10 IB/hfi1: Remove dependence on qp->s_hdrwords
The s_hdrwords variable was used to indicate whether a
packet was already built on a previous iteration of the
send engine. This variable assumed the protection of the
QP's RVT_S_BUSY flag, which was required since the the
QP's s_lock was dropped just prior to the packet being
queued on the one of the egress mechanisms.

Support for multiple send engine instantiations require
that the field not be used due to concurency issues.
The ps.txreq signals the "already built" without the
potential concurency issues.

Fix by getting rid of all s_hdrword usage.   A wrapper
is added to test for the already built case that used to
use s_hdrwords.

What used to be stored in s_hdrwords is now in the txreq.
The PBC is not counted, but is added in the pio/sdma code
paths prior to posting the packet.

Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:24:32 -07:00
Alex Estrin 2b1e7fe161 IB/hfi1: Fix for potential refcount leak in hfi1_open_file()
The dd refcount is speculatively incremented prior to allocating
the fd memory with kzalloc(). If that kzalloc() failed the dd
refcount leaks.
Increment refcount on kzalloc success.

Fixes: e11ffbd575 ("IB/hfi1: Do not free hfi1 cdev parent structure early")
Reviewed-by: Michael J Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Alex Estrin <alex.estrin@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:24:32 -07:00
Alex Estrin 473291b3ea IB/hfi1: Fix for early release of sdma context
With IRQF_SHARED flag set and CONFIG_DEBUG_SHIRQ enabled
module removal may result in panic in sdma_interrupt() routine
if associated sdma context was released before pci_free_irq();

[ 9198.939885] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 9198.940514] IP: sdma_make_progress+0xa5/0x450 [hfi1]
[ 9198.941114] PGD 170bdc0067 P4D 170bdc0067 PUD 172063e067 PMD 0
[ 9198.941783] Oops: 0000 [#1] SMP
.....
[ 9198.958877] CPU: 132 PID: 64173 Comm: rmmod Tainted: G           OE   4.14.0-rc4+ #1
[ 9198.961032] Hardware name: Intel Corporation S7200AP/S7200AP, BIOS S72C610.86B.01.02.0118.080620171935 08/06/2017
[ 9198.963323] task: ffff9681397f0000 task.stack: ffffae1647c40000
[ 9198.965695] RIP: 0010:sdma_make_progress+0xa5/0x450 [hfi1]
[ 9198.968082] RSP: 0018:ffffae1647c43be8 EFLAGS: 00010046
[ 9198.970503] RAX: 0000000000000000 RBX: ffff9680ce8b5ca8 RCX: 0000000000000000
[ 9198.973006] RDX: 0000000000000000 RSI: 0000000001a00d28 RDI: ffff9680ce8b5ca0
[ 9198.975546] RBP: ffffae1647c43c40 R08: ffff96814325ec00 R09: 00000000ffffffff
[ 9198.978142] R10: 000000004325e501 R11: ffff96814325ec00 R12: ffff9680ce8b5c44
[ 9198.980779] R13: ffff9680ce8b5ca0 R14: 0000000000000000 R15: ffff9680ce8b5b00
[ 9198.983462] FS:  00007f31196ba740(0000) GS:ffff96819df00000(0000) knlGS:0000000000000000
[ 9198.986231] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9198.989036] CR2: 0000000000000000 CR3: 000000170833f000 CR4: 00000000001406e0
[ 9198.991911] Call Trace:
[ 9198.994847]  sdma_engine_interrupt+0x82/0x100 [hfi1]
[ 9198.997852]  sdma_interrupt+0x61/0xc0 [hfi1]
[ 9199.000852]  __free_irq+0x1b3/0x2d0
[ 9199.003873]  free_irq+0x35/0x70
[ 9199.006909]  pci_free_irq+0x1c/0x30
[ 9199.009999]  clean_up_interrupts+0x53/0xf0 [hfi1]
[ 9199.013137]  hfi1_start_cleanup+0x117/0x190 [hfi1]
[ 9199.016315]  postinit_cleanup+0x1d/0x270 [hfi1]
[ 9199.019529]  remove_one+0x1f3/0x210 [hfi1]
[ 9199.022738]  pci_device_remove+0x39/0xc0
[ 9199.025974]  device_release_driver_internal+0x141/0x210
[ 9199.029268]  driver_detach+0x3f/0x80
[ 9199.032580]  bus_remove_driver+0x55/0xd0
[ 9199.035931]  driver_unregister+0x2c/0x50
[ 9199.039321]  pci_unregister_driver+0x2a/0xa0
[ 9199.042755]  hfi1_mod_cleanup+0x10/0xb50 [hfi1]
[ 9199.046196]  SyS_delete_module+0x171/0x250
...

Fix by exporting sdma_clean() and removing from sdma_exit().
sdma_exit() now just manipulates the engine state,
leaving the memory free to sdma_clean() which is now called
just before the dd is freed.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Michael J Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Alex Estrin <alex.estrin@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:24:32 -07:00
Michael J. Ruhl 82a9792656 IB/hfi1: Re-order IRQ cleanup to address driver cleanup race
The pci_request_irq() interfaces always adds the IRQF_SHARED bit to
all IRQ requests.

When the kernel is built with CONFIG_DEBUG_SHIRQ config flag, if the
IRQF_SHARED bit is set, a call to the IRQ handler is made from the
__free_irq() function. This is testing a race condition between the
IRQ cleanup and an IRQ racing the cleanup.  The HFI driver should be
able to handle this race, but does not.

This race can cause traces that start with this footprint:

BUG: unable to handle kernel NULL pointer dereference at   (null)
Call Trace:
 <hfi1 irq handler>
 ...
 __free_irq+0x1b3/0x2d0
 free_irq+0x35/0x70
 pci_free_irq+0x1c/0x30
 clean_up_interrupts+0x53/0xf0 [hfi1]
 hfi1_start_cleanup+0x122/0x190 [hfi1]
 postinit_cleanup+0x1d/0x280 [hfi1]
 remove_one+0x233/0x250 [hfi1]
 pci_device_remove+0x39/0xc0

Export IRQ cleanup function so it can be called from other modules.

Using the exported cleanup function:

  Re-order the driver cleanup code to clean up IRQ resources before
  other resources, eliminating the race.

  Re-order error path for init so that the race does not occur.

Reduce severity on spurious error message for SDMA IRQs to info.

Reviewed-by: Alex Estrin <alex.estrin@intel.com>
Reviewed-by: Patel Jay P <jay.p.patel@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:24:32 -07:00
Dan Carpenter f34727a135 RDMA/nldev: missing error code in nldev_res_get_doit()
We should return -ENOMEM if the allocation fails.  The current code
accidentally returns success.

Fixes: bf3c5a93c5 ("RDMA/nldev: Provide global resource utilization")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:24:32 -07:00
oulijun 0da6550366 RDMA/hns: Fix misplaced call to hns_roce_cleanup_hem_table
The mtt_table is cleaned up during the err_unmap_cqe label, it is a
mistake to duplicate the cleanup during the later unwind labels.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:24:32 -07:00
oulijun fd012f1c4f RDMA/hns: Add names to function arguments in function pointers
This patch mainly fix some style warings matched with the new checkpatch
requirement. The warning as follows:

WARNING: function definition argument 'struct hns_roce_cq *' should also have
an identifier name

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:24:32 -07:00
oulijun c27991198c RDMA/hns: Remove unnecessary operator
The double not-operator is unncessary when used in a boolean context. This
patch removes them.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:24:32 -07:00
Markus Elfring e5b8984386 RDMA/bnxt_re: Use common error handling code in bnxt_qplib_alloc_dpi_tbl()
Add a jump target so that a bit of exception handling can be better reused
at the end of this function.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: Devesh Sharma <devesh.sharma@broadcom.com>
Acked-by: Jonathan Toppins <jtoppins@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:24:31 -07:00
Markus Elfring f390b71b65 RDMA/bnxt_re: Delete two error messages for a failed memory allocation in bnxt_qplib_alloc_dpi_tbl()
Omit extra messages for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: Devesh Sharma <devesh.sharma@broadcom.com>
Acked-by: Jonathan Toppins <jtoppins@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-01 15:24:31 -07:00
Linus Torvalds 73da9e1a9f Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - misc fixes

 - ocfs2 updates

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
  mm: remove PG_highmem description
  tools, vm: new option to specify kpageflags file
  mm/swap.c: make functions and their kernel-doc agree
  mm, memory_hotplug: fix memmap initialization
  mm: correct comments regarding do_fault_around()
  mm: numa: do not trap faults on shared data section pages.
  hugetlb, mbind: fall back to default policy if vma is NULL
  hugetlb, mempolicy: fix the mbind hugetlb migration
  mm, hugetlb: further simplify hugetlb allocation API
  mm, hugetlb: get rid of surplus page accounting tricks
  mm, hugetlb: do not rely on overcommit limit during migration
  mm, hugetlb: integrate giga hugetlb more naturally to the allocation path
  mm, hugetlb: unify core page allocation accounting and initialization
  mm/memcontrol.c: try harder to decrease [memory,memsw].limit_in_bytes
  mm/memcontrol.c: make local symbol static
  mm/hmm: fix uninitialized use of 'entry' in hmm_vma_walk_pmd()
  include/linux/mmzone.h: fix explanation of lower bits in the SPARSEMEM mem_map pointer
  mm/compaction.c: fix comment for try_to_compact_pages()
  mm/page_ext.c: make page_ext_init a noop when CONFIG_PAGE_EXTENSION but nothing uses it
  zsmalloc: use U suffix for negative literals being shifted
  ...
2018-01-31 18:46:22 -08:00
David Rientjes 5ff7091f5a mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks
Commit 4d4bbd8526 ("mm, oom_reaper: skip mm structs with mmu
notifiers") prevented the oom reaper from unmapping private anonymous
memory with the oom reaper when the oom victim mm had mmu notifiers
registered.

The rationale is that doing mmu_notifier_invalidate_range_{start,end}()
around the unmap_page_range(), which is needed, can block and the oom
killer will stall forever waiting for the victim to exit, which may not
be possible without reaping.

That concern is real, but only true for mmu notifiers that have
blockable invalidate_range_{start,end}() callbacks.  This patch adds a
"flags" field to mmu notifier ops that can set a bit to indicate that
these callbacks do not block.

The implementation is steered toward an expensive slowpath, such as
after the oom reaper has grabbed mm->mmap_sem of a still alive oom
victim.

[rientjes@google.com: mmu_notifier_invalidate_range_end() can also call the invalidate_range() must not block, fix comment]
  Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1801091339570.240101@chino.kir.corp.google.com
[akpm@linux-foundation.org: make mm_has_blockable_invalidate_notifiers() return bool, use rwsem_is_locked()]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1712141329500.74052@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Dimitri Sivanich <sivanich@hpe.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:38 -08:00
Linus Torvalds b2fe5fa686 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Significantly shrink the core networking routing structures. Result
    of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf

 2) Add netdevsim driver for testing various offloads, from Jakub
    Kicinski.

 3) Support cross-chip FDB operations in DSA, from Vivien Didelot.

 4) Add a 2nd listener hash table for TCP, similar to what was done for
    UDP. From Martin KaFai Lau.

 5) Add eBPF based queue selection to tun, from Jason Wang.

 6) Lockless qdisc support, from John Fastabend.

 7) SCTP stream interleave support, from Xin Long.

 8) Smoother TCP receive autotuning, from Eric Dumazet.

 9) Lots of erspan tunneling enhancements, from William Tu.

10) Add true function call support to BPF, from Alexei Starovoitov.

11) Add explicit support for GRO HW offloading, from Michael Chan.

12) Support extack generation in more netlink subsystems. From Alexander
    Aring, Quentin Monnet, and Jakub Kicinski.

13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From
    Russell King.

14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso.

15) Many improvements and simplifications to the NFP driver bpf JIT,
    from Jakub Kicinski.

16) Support for ipv6 non-equal cost multipath routing, from Ido
    Schimmel.

17) Add resource abstration to devlink, from Arkadi Sharshevsky.

18) Packet scheduler classifier shared filter block support, from Jiri
    Pirko.

19) Avoid locking in act_csum, from Davide Caratti.

20) devinet_ioctl() simplifications from Al viro.

21) More TCP bpf improvements from Lawrence Brakmo.

22) Add support for onlink ipv6 route flag, similar to ipv4, from David
    Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits)
  tls: Add support for encryption using async offload accelerator
  ip6mr: fix stale iterator
  net/sched: kconfig: Remove blank help texts
  openvswitch: meter: Use 64-bit arithmetic instead of 32-bit
  tcp_nv: fix potential integer overflow in tcpnv_acked
  r8169: fix RTL8168EP take too long to complete driver initialization.
  qmi_wwan: Add support for Quectel EP06
  rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK
  ipmr: Fix ptrdiff_t print formatting
  ibmvnic: Wait for device response when changing MAC
  qlcnic: fix deadlock bug
  tcp: release sk_frag.page in tcp_disconnect
  ipv4: Get the address of interface correctly.
  net_sched: gen_estimator: fix lockdep splat
  net: macb: Handle HRESP error
  net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring
  ipv6: addrconf: break critical section in addrconf_verify_rtnl()
  ipv6: change route cache aging logic
  i40e/i40evf: Update DESC_NEEDED value to reflect larger value
  bnxt_en: cleanup DIM work on device shutdown
  ...
2018-01-31 14:31:10 -08:00
Zhu Yanjun 316663c6bf IB/rxe: remove redudant parameter in rxe_av_fill_ip_info
In the function rxe_av_fill_ip_info, the parameter rxe is not used.
So it is removed.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-31 16:32:10 -05:00
Zhu Yanjun 45a290f8f6 IB/rxe: change the function rxe_av_fill_ip_info to void
The function rxe_av_fill_ip_info always returns 0. So the function
type is changed to void.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-31 16:32:10 -05:00
Zhu Yanjun a402dc445d IB/rxe: change the function to void from int
Since the function rxe_av_to_attr always return 0, the function
type is changed to void.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-31 16:32:10 -05:00
Zhu Yanjun 9c96f3d4dd IB/rxe: remove unnecessary parameter in rxe_av_to_attr
In the function rxe_av_to_attr, the parameter rxe is
not used. So it is removed.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-31 16:32:10 -05:00
Zhu Yanjun ca3d9feee6 IB/rxe: change the function to void from int
The function rxe_av_from_attr always return 0. So change the
function to void.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-31 16:32:10 -05:00
Zhu Yanjun 1a241db19c IB/rxe: remove redudant parameter in function
In the function rxe_av_from_attr, the parameter rxe
is not used. So it is removed.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-31 16:32:10 -05:00
Dan Carpenter 8de9399a24 RDMA/bnxt_re: Fix an error code in bnxt_qplib_create_srq()
We should return -ENOMEM if the allocation fails.  (The current code
returns succees).

Fixes: 37cb11acf1 ("RDMA/bnxt_re: Add SRQ support for Broadcom adapters")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-By: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-31 16:19:50 -05:00
Doug Ledford 589ccd8b04 RDMA/bnxt_re: Fix static checker warning
If there is ever any error while creating srq->umem, we return that
error, we don't store it in srq->umem, so any check of srq->umem for
IS_ERR is pointless.  Further, checking udata is unnecessary as
srq->umem is always either NULL or valid, without respect to udata.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-31 16:19:50 -05:00
Linus Torvalds 7b1cd95d65 First merge window pull request for 4.16
- Misc small driver fixups to
   bnxt_re/hfi1/qib/hns/ocrdma/rdmavt/vmw_pvrdma/nes
 - Several major feature adds to bnxt_re driver: SRIOV VF RoCE support,
   HugePages support, extended hardware stats support, and SRQ support
 - A notable number of fixes to the i40iw driver from debugging scale up
   testing
 - More work to enable the new hip08 chip in the hns driver
 - Misc small ULP fixups to srp/srpt//ipoib
 - Preparation for srp initiator and target to support the RDMA-CM
   protocol for connections
 - Add RDMA-CM support to srp initiator, srp target is still a WIP
 - Fixes for a couple of places where ipoib could spam the dmesg log
 - Fix encode/decode of FDR/EDR data rates in the core
 - Many patches from Parav with ongoing work to clean up inconsistencies
   and bugs in RoCE support around the rdma_cm
 - mlx5 driver support for the userspace features 'thread domain', 'wallclock
   timestamps' and 'DV Direct Connected transport'. Support for the firmware
   dual port rocee capability
 - Core support for more than 32 rdma devices in the char dev allocation
 - kernel doc updates from Randy Dunlap
 - New netlink uAPI for inspecting RDMA objects similar in spirit to 'ss'
 - One minor change to the kobject code acked by GKH
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCgAGBQJacfljAAoJEDht9xV+IJsaUnwP+QFJvfIDEfRlfU2rTmcfymPs
 Rz9bW1KLgETcJx/XOE2ba2DOaqdFr56TLflsDfEfOSIL8AtzBQqH3vTqEj49bBP7
 4JZAkzWllUS/qoYD2XmvOM0IrIfFXzZtLM/lzLi+5dwK26x3GAB9hHXpKzUrJ1vj
 I1Naq14qOFXoNBndEtZJqtIKOhR/Pnd6YtxAiNCmViZGdqm3DIU3D4VJhU5B7pO9
 j6ovJs16wfJl/gV1iiz9xO49ViVFpwzSIzYE/Q2ZCegcrsF3EEVN2J4vZHkKgDuN
 0/Ar/WOvkPzKBFR8hJ7M4kwp0Fy/69/U49s7kpGNxdhML9sU3+Qfse6JYGj0M9L8
 01gTM0SShyAZMNAvjVFbIKLQPg806OAit4cooMwlObbwJ6b7B8K0uN17/uVIkIqp
 gXqertyl1BLhUtTOby/8Fox/f/oEvaZksKiwcTKSb7D1Y5jGZZUPRknJ5SwAFWQB
 RiTPJ6mY7BUsM9zuYQtRE8x2mpgIezYXFcrAz7iT76WuoZQgo1QLIyYRM1+MlhnC
 wNrp5BtqoVfW2Ps0CbSdxJ9vDtDf3cwLg0RzcCB8+NJJccsRD9IVMDev/TDY5k9U
 M9LxxtW3WuulRWgliU0Q9VaswUQoIao16vBMVL7GwUm+ClLvbRVoPe8jxgtfk+W3
 GAANAI7Kv/vUoV/6CFfP
 =sMXV
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull RDMA subsystem updates from Jason Gunthorpe:
 "Overall this cycle did not have any major excitement, and did not
  require any shared branch with netdev.

  Lots of driver updates, particularly of the scale-up and performance
  variety. The largest body of core work was Parav's patches fixing and
  restructing some of the core code to make way for future RDMA
  containerization.

  Summary:

   - misc small driver fixups to
     bnxt_re/hfi1/qib/hns/ocrdma/rdmavt/vmw_pvrdma/nes

   - several major feature adds to bnxt_re driver: SRIOV VF RoCE
     support, HugePages support, extended hardware stats support, and
     SRQ support

   - a notable number of fixes to the i40iw driver from debugging scale
     up testing

   - more work to enable the new hip08 chip in the hns driver

   - misc small ULP fixups to srp/srpt//ipoib

   - preparation for srp initiator and target to support the RDMA-CM
     protocol for connections

   - add RDMA-CM support to srp initiator, srp target is still a WIP

   - fixes for a couple of places where ipoib could spam the dmesg log

   - fix encode/decode of FDR/EDR data rates in the core

   - many patches from Parav with ongoing work to clean up
     inconsistencies and bugs in RoCE support around the rdma_cm

   - mlx5 driver support for the userspace features 'thread domain',
     'wallclock timestamps' and 'DV Direct Connected transport'. Support
     for the firmware dual port rocee capability

   - core support for more than 32 rdma devices in the char dev
     allocation

   - kernel doc updates from Randy Dunlap

   - new netlink uAPI for inspecting RDMA objects similar in spirit to 'ss'

   - one minor change to the kobject code acked by Greg KH"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (259 commits)
  RDMA/nldev: Provide detailed QP information
  RDMA/nldev: Provide global resource utilization
  RDMA/core: Add resource tracking for create and destroy PDs
  RDMA/core: Add resource tracking for create and destroy CQs
  RDMA/core: Add resource tracking for create and destroy QPs
  RDMA/restrack: Add general infrastructure to track RDMA resources
  RDMA/core: Save kernel caller name when creating PD and CQ objects
  RDMA/core: Use the MODNAME instead of the function name for pd callers
  RDMA: Move enum ib_cq_creation_flags to uapi headers
  IB/rxe: Change RDMA_RXE kconfig to use select
  IB/qib: remove qib_keys.c
  IB/mthca: remove mthca_user.h
  RDMA/cm: Fix access to uninitialized variable
  RDMA/cma: Use existing netif_is_bond_master function
  IB/core: Avoid SGID attributes query while converting GID from OPA to IB
  RDMA/mlx5: Avoid memory leak in case of XRCD dealloc failure
  IB/umad: Fix use of unprotected device pointer
  IB/iser: Combine substrings for three messages
  IB/iser: Delete an unnecessary variable initialisation in iser_send_data_out()
  IB/iser: Delete an error message for a failed memory allocation in iser_send_data_out()
  ...
2018-01-31 12:05:10 -08:00
Linus Torvalds 168fe32a07 Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull poll annotations from Al Viro:
 "This introduces a __bitwise type for POLL### bitmap, and propagates
  the annotations through the tree. Most of that stuff is as simple as
  'make ->poll() instances return __poll_t and do the same to local
  variables used to hold the future return value'.

  Some of the obvious brainos found in process are fixed (e.g. POLLIN
  misspelled as POLL_IN). At that point the amount of sparse warnings is
  low and most of them are for genuine bugs - e.g. ->poll() instance
  deciding to return -EINVAL instead of a bitmap. I hadn't touched those
  in this series - it's large enough as it is.

  Another problem it has caught was eventpoll() ABI mess; select.c and
  eventpoll.c assumed that corresponding POLL### and EPOLL### were
  equal. That's true for some, but not all of them - EPOLL### are
  arch-independent, but POLL### are not.

  The last commit in this series separates userland POLL### values from
  the (now arch-independent) kernel-side ones, converting between them
  in the few places where they are copied to/from userland. AFAICS, this
  is the least disruptive fix preserving poll(2) ABI and making epoll()
  work on all architectures.

  As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
  it will trigger only on what would've triggered EPOLLWRBAND on other
  architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
  at all on sparc. With this patch they should work consistently on all
  architectures"

* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
  make kernel-side POLL... arch-independent
  eventpoll: no need to mask the result of epi_item_poll() again
  eventpoll: constify struct epoll_event pointers
  debugging printk in sg_poll() uses %x to print POLL... bitmap
  annotate poll(2) guts
  9p: untangle ->poll() mess
  ->si_band gets POLL... bitmap stored into a user-visible long field
  ring_buffer_poll_wait() return value used as return value of ->poll()
  the rest of drivers/*: annotate ->poll() instances
  media: annotate ->poll() instances
  fs: annotate ->poll() instances
  ipc, kernel, mm: annotate ->poll() instances
  net: annotate ->poll() instances
  apparmor: annotate ->poll() instances
  tomoyo: annotate ->poll() instances
  sound: annotate ->poll() instances
  acpi: annotate ->poll() instances
  crypto: annotate ->poll() instances
  block: annotate ->poll() instances
  x86: annotate ->poll() instances
  ...
2018-01-30 17:58:07 -08:00
Linus Torvalds d772794637 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main RCU changes in this cycle were:

   - Updates to use cond_resched() instead of cond_resched_rcu_qs()
     where feasible (currently everywhere except in kernel/rcu and in
     kernel/torture.c). Also a couple of fixes to avoid sending IPIs to
     offline CPUs.

   - Updates to simplify RCU's dyntick-idle handling.

   - Updates to remove almost all uses of smp_read_barrier_depends() and
     read_barrier_depends().

   - Torture-test updates.

   - Miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
  torture: Save a line in stutter_wait(): while -> for
  torture: Eliminate torture_runnable and perf_runnable
  torture: Make stutter less vulnerable to compilers and races
  locking/locktorture: Fix num reader/writer corner cases
  locking/locktorture: Fix rwsem reader_delay
  torture: Place all torture-test modules in one MAINTAINERS group
  rcutorture/kvm-build.sh: Skip build directory check
  rcutorture: Simplify functions.sh include path
  rcutorture: Simplify logging
  rcutorture/kvm-recheck-*: Improve result directory readability check
  rcutorture/kvm.sh: Support execution from any directory
  rcutorture/kvm.sh: Use consistent help text for --qemu-args
  rcutorture/kvm.sh: Remove unused variable, `alldone`
  rcutorture: Remove unused script, config2frag.sh
  rcutorture/configinit: Fix build directory error message
  rcutorture: Preempt RCU-preempt readers more vigorously
  torture: Reduce #ifdefs for preempt_schedule()
  rcu: Remove have_rcu_nocb_mask from tree_plugin.h
  rcu: Add comment giving debug strategy for double call_rcu()
  tracing, rcu: Hide trace event rcu_nocb_wake when not used
  ...
2018-01-30 10:15:30 -08:00
Jason Gunthorpe e7996a9a77 Linux 4.15
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJabj6pAAoJEHm+PkMAQRiGs8cIAJQFkCWnbz86e3vG4DuWhyA8
 CMGHCQdUOxxFGa/ixhIiuetbC0x+JVHAjV2FwVYbAQfaZB3pfw2iR1ncQxpAP1AI
 oLU9vBEqTmwKMPc9CM5rRfnLFWpGcGwUNzgPdxD5yYqGDtcM8K840mF6NdkYe5AN
 xU8rv1wlcFPF4A5pvHCH0pvVmK4VxlVFk/2H67TFdxBs4PyJOnSBnf+bcGWgsKO6
 hC8XIVtcKCH2GfFxt5d0Vgc5QXJEpX1zn2mtCa1MwYRjN2plgYfD84ha0xE7J0B0
 oqV/wnjKXDsmrgVpncr3txd4+zKJFNkdNRE4eLAIupHo2XHTG4HvDJ5dBY2NhGU=
 =sOml
 -----END PGP SIGNATURE-----

Merge tag v4.15 of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

To resolve conflicts in:
 drivers/infiniband/hw/mlx5/main.c
 drivers/infiniband/hw/mlx5/qp.c

From patches merged into the -rc cycle. The conflict resolution matches
what linux-next has been carrying.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-30 09:30:00 -07:00
Leon Romanovsky b5fa635aab RDMA/nldev: Provide detailed QP information
Implement RDMA nldev netlink interface to get detailed information on each
QP in the system. This includes the owning process or kernel ULP and
detailed information from the qp_attrs.

Currently only the dumpit variant is implemented.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-29 20:21:41 -07:00
Leon Romanovsky bf3c5a93c5 RDMA/nldev: Provide global resource utilization
Expose through the netlink interface the global per-device utilization of
the supported object types.

Provide both dumpit and doit callbacks.

As an example of possible output from rdmatool for system with 5
mlx5 cards:

$ rdma res
1: mlx5_0: qp 4 cq 5 pd 3
2: mlx5_1: qp 4 cq 5 pd 3
3: mlx5_2: qp 4 cq 5 pd 3
4: mlx5_3: qp 2 cq 3 pd 2
5: mlx5_4: qp 4 cq 5 pd 3

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-29 20:21:40 -07:00
Leon Romanovsky 9d5f8c209b RDMA/core: Add resource tracking for create and destroy PDs
Track create and destroy operations of PD objects.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-29 20:21:40 -07:00
Leon Romanovsky 08f294a152 RDMA/core: Add resource tracking for create and destroy CQs
Track create and destroy operations of CQ objects.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-29 20:21:40 -07:00
Leon Romanovsky 78a0cd648a RDMA/core: Add resource tracking for create and destroy QPs
Track create and destroy operations of QP objects.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-29 20:21:39 -07:00
Leon Romanovsky 02d8883f52 RDMA/restrack: Add general infrastructure to track RDMA resources
The RDMA subsystem has very strict set of objects to work with, but it
completely lacks tracking facilities and has no visibility of resource
utilization.

The following patch adds such infrastructure to keep track of RDMA
resources to help with debugging of user space applications. The primary
user of this infrastructure is RDMA nldev netlink (following patches), to
be exposed to userspace via rdmatool, but it is not limited too that.

At this stage, the main three objects (PD, CQ and QP) are added, and more
will be added later.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-29 20:21:39 -07:00
Leon Romanovsky f66c8ba4c9 RDMA/core: Save kernel caller name when creating PD and CQ objects
The KBUILD_MODNAME variable contains the module name and it is known for
kernel users during compilation, so let's reuse it to track the owners.

Followup patches will store this for resource tracking.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-29 14:01:44 -07:00
Jason Gunthorpe beb801ac51 RDMA: Move enum ib_cq_creation_flags to uapi headers
The flags field the enum is used with comes directly from the uapi
so it belongs in the uapi headers for clarity and so userspace can
use it.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-29 12:58:34 -07:00
Jason Gunthorpe 0812ed1321 IB/rxe: Change RDMA_RXE kconfig to use select
NET_UDP_TUNNEL is not user selectable, so it should be used as a select
in kconfig.

CRYPTO_CRC32 is a required library for RDMA_RXE so it should active
automatically, as most other CRYPTO_ users do.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-29 12:58:00 -07:00
David S. Miller 3e3ab9ccca Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 10:15:51 -05:00
Corentin Labbe 3356c8c0b8 IB/qib: remove qib_keys.c
qib_keys.c was left uncompilable in commit 7c2e11fe2d ("IB/qib: Remove qp and mr functionality from qib")
Since nothing need it, remove it from tree.

Signed-off-by: Corentin Labbe <clabbe@baylibre.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-28 14:07:16 -07:00
Corentin Labbe 57ac30c0ef IB/mthca: remove mthca_user.h
mthca_user.h is unused since commit 486f60954c ("IB/mthca: Move user vendor structures")
Remove it from tree.

Signed-off-by: Corentin Labbe <clabbe@baylibre.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-28 14:07:16 -07:00
Leon Romanovsky 925f7ea7a6 RDMA/cm: Fix access to uninitialized variable
The ndev will be initialized and held only for successful
ib_get_cached_gid(), otherwise it is garbage stack memory.
Calling dev_put() in failure path is wrong.

Fixes: 16c72e4028 ("IB/cm: Refactor to avoid setting path record software only fields")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-28 14:07:16 -07:00
Parav Pandit 3cd96fddcc RDMA/cma: Use existing netif_is_bond_master function
When checking whatever the current netdev is the bond master interface,
use kernel API netif_is_bond_master() instead of hardcoding the check.
No functionality is changed.

Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-28 14:07:16 -07:00
Parav Pandit 708ea056b3 IB/core: Avoid SGID attributes query while converting GID from OPA to IB
SGID attributes are not used during OPA to IB GID conversion.
Therefore don't query it.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-28 14:07:16 -07:00
Leon Romanovsky b081808a66 RDMA/mlx5: Avoid memory leak in case of XRCD dealloc failure
Failure in XRCD FW deallocation command leaves memory leaked and
returns error to the user which he can't do anything about it.

This patch changes behavior to always free memory and always return
success to the user.

Fixes: e126ba97db ("mlx5: Add driver for Mellanox Connect-IB adapters")
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-28 14:07:16 -07:00
Jack Morgenstein f23a5350e4 IB/umad: Fix use of unprotected device pointer
The ib_write_umad() is protected by taking the umad file mutex.
However, it accesses file->port->ib_dev -- which is protected only by the
port's mutex (field file_mutex).

The ib_umad_remove_one() calls ib_umad_kill_port() which sets
port->ib_dev to NULL under the port mutex (NOT the file mutex).
It then sets the mad agent to "dead" under the umad file mutex.

This is a race condition -- because there is a window where
port->ib_dev is NULL, while the agent is not "dead".

As a result, we saw stack traces like:

[16490.678059] BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
[16490.678246] IP: ib_umad_write+0x29c/0xa3a [ib_umad]
[16490.678333] PGD 0 P4D 0
[16490.678404] Oops: 0000 [#1] SMP PTI
[16490.678466] Modules linked in: rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx4_en(OE) ptp pps_core mlx4_ib(OE-) ib_core(OE) mlx4_core(OE) mlx_compat
(OE) memtrack(OE) devlink mst_pciconf(OE) mst_pci(OE) netconsole nfsv3 nfs_acl nfs lockd grace fscache cfg80211 rfkill esp6_offload esp6 esp4_offload esp4 sunrpc kvm_intel kvm ppdev parport_pc irqbypass
parport joydev i2c_piix4 virtio_balloon cirrus drm_kms_helper ttm drm e1000 serio_raw virtio_pci virtio_ring virtio ata_generic pata_acpi qemu_fw_cfg [last unloaded: mlxfw]
[16490.679202] CPU: 4 PID: 3115 Comm: sminfo Tainted: G           OE   4.14.13-300.fc27.x86_64 #1
[16490.679339] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014
[16490.679477] task: ffff9cf753890000 task.stack: ffffaf70c26b0000
[16490.679571] RIP: 0010:ib_umad_write+0x29c/0xa3a [ib_umad]
[16490.679664] RSP: 0018:ffffaf70c26b3d90 EFLAGS: 00010202
[16490.679747] RAX: 0000000000000010 RBX: ffff9cf75610fd80 RCX: 0000000000000000
[16490.679856] RDX: 0000000000000001 RSI: 00007ffdf2bfd714 RDI: ffff9cf6bb2a9c00

In the above trace, ib_umad_write is trying to dereference the NULL
file->port->ib_dev pointer.

Fix this by using the agent's device pointer (the device field
in struct ib_mad_agent) -- which IS protected by the umad file mutex.

Cc: <stable@vger.kernel.org> # v4.11
Fixes: 44c58487d5 ("IB/core: Define 'ib' and 'roce' rdma_ah_attr types")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-28 14:07:16 -07:00
Markus Elfring 4cb24c550f IB/iser: Combine substrings for three messages
The script "checkpatch.pl" pointed information out like the following.

WARNING: quoted string split across lines

Thus fix the affected source code places.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-28 14:07:16 -07:00
Markus Elfring d769e6be24 IB/iser: Delete an unnecessary variable initialisation in iser_send_data_out()
The variable "tx_desc" will be set to an appropriate pointer a bit later.
Thus omit the explicit initialisation at the beginning.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-28 14:07:16 -07:00
Markus Elfring cac5141199 IB/iser: Delete an error message for a failed memory allocation in iser_send_data_out()
Omit an extra message for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-28 14:07:16 -07:00
Davidlohr Bueso 487f6683f1 IB/mthca: Fix gup usage in mthca_map_user_db()
get_user_pages() must be called with mmap_sem held, currently
it is not. In fact it is called under the user db_table->mutex.
To fix this we can convert gup to use the fast alternative,
and safely avoid taking mmap_sem, if possible. Furthermore
this is safe wrt to the mutex as other callers that take the
lock (unmap and alloc_db) are not called under mmap_sem
(hence possible deadlock).

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-26 10:43:46 -05:00
Nicolas Dichtel f15ca723c1 net: don't call update_pmtu unconditionally
Some dst_ops (e.g. md_dst_ops)) doesn't set this handler. It may result to:
"BUG: unable to handle kernel NULL pointer dereference at           (null)"

Let's add a helper to check if update_pmtu is available before calling it.

Fixes: 52a589d51f ("geneve: update skb dst pmtu on tx path")
Fixes: a93bf0ff44 ("vxlan: update skb dst pmtu on tx path")
CC: Roman Kapl <code@rkapl.cz>
CC: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-25 16:27:34 -05:00
Kalderon, Michal dc728f779a RDMA/qedr: lower print level of flushed CQEs
There are races where can still get flush on CQEs before the QP enters
error state. This is not an error and should be treated as
debug information.

Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-25 10:58:36 -05:00
Jason Gunthorpe 3624a8f025 RDMA/uverbs: Use an unambiguous errno for method not supported
Returning EOPNOTSUPP is problematic because it can also be
returned by the method function, and we use it in quite a few
places in drivers these days.

Instead, dedicate EPROTONOSUPPORT to indicate that the ioctl framework
is enabled but the requested object and method are not supported by
the kernel. No other case will return this code, and it lets userspace
know to fall back to write().

grep says we do not use it today in drivers/infiniband subsystem.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-25 10:57:29 -05:00
Leon Romanovsky f97f43c9ed RDMA/srpt: Fix RCU debug build error
Combination of CONFIG_DEBUG_OBJECTS_RCU_HEAD=y and
CONFIG_INFINIBAND_SRPT=m produces the following build error.

ERROR: "init_rcu_head" [drivers/infiniband/ulp/srpt/ib_srpt.ko] undefined!
make[1]: *** [scripts/Makefile.modpost:92: __modpost] Error 1
make: *** [Makefile:1216: modules] Error 2

The reason to it that init_rcu_head() is not exported and not supposed
to be used in modules. It is needed for dynamic initialization of
statically allocated rcu_head structures.

Fixes: 795bc112cd ("IB/srpt: Make it safe to use RCU for srpt_device.rch_list")
Fixes: a11253142e ("IB/srpt: Rework multi-channel support")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-25 10:17:01 -05:00
Felix Kuehling 20c3ff6114 RDMA/qedr: Use pci_enable_atomic_ops_to_root()
Use PCI core interface pci_enable_atomic_ops_to_root() to enable atomic
capability.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Michal Kalderon <michal.kalderon@cavium.com>
CC: Ram Amrani <Ram.Amrani@cavium.com>
CC: Doug Ledford <dledford@redhat.com>
2018-01-24 15:28:11 -06:00
Bart Van Assche b0780ee56e IB/srp: Add target_can_queue login parameter
Although I'm not sure this parameter is useful for regular SRP users,
setting this parameter to 1 has shown to be invaluable for testing the
block layer core, SCSI core and device mapper queue running mechanisms.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-23 11:35:05 -05:00
Bart Van Assche 19f313438c IB/srp: Add RDMA/CM support
Since the SRP_LOGIN_REQ defined in the SRP standard is larger than
what fits in the RDMA/CM login request private data, introduce a new
login request format for the RDMA/CM.

Note: since srp_daemon and ibsrpdm rely on the subnet manager and
since there is no equivalent of the IB subnet manager in non-IB
networks, login has to be performed manually for non-IB networks.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-23 11:35:04 -05:00
Parav Pandit 052eac6eeb RDMA/cma: Update RoCE multicast routines to use net namespace
rdma_dev_addr contains the net namespace pointer, while referring
bound_dev_if of the rdma_dev_addr, refer to the net namespace of
rdma_cm_id stored in rdma_dev_addr.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-22 11:39:50 -07:00
Parav Pandit 66c74d746d RDMA/cma: Update cma_validate_port to honor net namespace
cma_validate_port uses rdma_dev_addr to validate the port of the cm_id.
It needs to honor the net namespace which is setup during cm_id creation
when finding netdevice.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-22 11:39:50 -07:00
Parav Pandit 2493a57bc1 RDMA/cma: Refactor to access multiple fields of rdma_dev_addr
Pass the rdma_cm_id so that multiple fields of the rdma_dev_addr
structure can be accessed, instead of passing each individual fields.

This is needed to access some additional fields in followup patches.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-22 11:39:50 -07:00
Parav Pandit 00db63c128 RDMA/cma: Check existence of netdevice during port validation
If valid netdevice is not found for RoCE, GID table should not be
searched with NULL netdevice.

Doing so causes the search routines to ignore the netdev argument and may
match the wrong GID table entry if the netdev is deleted.

Fixes: abae1b71dd ("IB/cma: cma_validate_port should verify the port and netdevice")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-22 11:39:50 -07:00
Leon Romanovsky 10bea9c873 RDMA/mlx5: Remove redundant allocation warning print
The kmalloc() failure to allocate memory generates enough information
and doesn't need to be accompanied by another driver print.

Fixes: d69a24e036 ("IB/mlx5: Move IB event processing onto a workqueue")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-19 13:05:39 -07:00
Parav Pandit 7a2f64ee4a RDMA/ucma: Use rdma cm API to query GID
Make use of rdma_read_gids() API to read SGID and DGID which returns
correct GIDs for RoCE and other transports.

rdma_addr_get_dgid() for RoCE for client side connections returns MAC
address, instead of DGID.
rdma_addr_get_sgid() for RoCE doesn't return correct SGID for IPv6 and
when more than one IP address is assigned to the netdevice.

Therefore use transport agnostic rdma_read_gids() API provided by rdma_cm
module.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-19 13:05:38 -07:00
Parav Pandit 411460ac50 RDMA/cma: Introduce API to read GIDs for multiple transports
This patch introduces an API that allows legacy applications to query
GIDs for a rdma_cm_id which is used during connection establishment.

GIDs are stored and created differently for iWarp, IB and RoCE transports.
Therefore rdma_read_gids() returns GID for all the transports hiding
such internal details to caller.
It is usable for client side and server side connections.

In general continued use of GID based addressing outside of IB is
discouraged, so rdma_read_gids() should not be used by any new ULPs.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-19 13:05:38 -07:00
Bart Van Assche 2ffcf04263 IB/srpt: Move the code for parsing struct ib_cm_req_event_param
This patch does not change any functionality but makes srpt_cm_req_recv()
independent of the IB/CM and hence simplifies the patch that introduces
RDMA/CM support.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:27 -05:00
Bart Van Assche 090fa24bf2 IB/srpt: Preparations for adding RDMA/CM support
Introduce a union in struct srpt_rdma_ch for member variables that
depend on the type of connection manager. Avoid that error messages
report the IB/CM ID.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:27 -05:00
Bart Van Assche fcf589364f IB/srpt: Don't allow reordering of commands on wait list
If a receive I/O context is removed from the wait list and
srpt_handle_new_iu() fails to allocate a send I/O context then
re-adding the receive I/O context to the wait list can cause
reordering. Avoid this by only removing a receive I/O context
from the wait list after allocating a send I/O context succeeded.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:26 -05:00
Bart Van Assche e28a547da6 IB/srpt: Fix a race condition related to wait list processing
Wait list processing only occurs if the channel state >= CH_LIVE. Hence
set the channel state to CH_LIVE before triggering wait list processing
asynchronously.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:26 -05:00
Bart Van Assche db7683d7de IB/srpt: Fix login-related race conditions
Make sure that sport->mutex is not released between the duplicate
channel check, adding a channel to the channel list and performing
the sport enabled check. Avoid that srpt_disconnect_ch() can be
invoked concurrently with the ib_send_cm_rep() call by
srpt_cm_req_recv().

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:26 -05:00
Bart Van Assche 63d370a6a9 IB/srpt: Log all zero-length writes and completions
The new pr_debug() statements are useful when debugging the ib_srpt
driver.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:25 -05:00
Bart Van Assche 940874f8be IB/srpt: Simplify srpt_close_session()
Move a mutex lock and unlock statement from srpt_close_session()
into srpt_disconnect_ch_sync(). Since the previous patch removed
the last user of the return value of that function, change the
return value of srpt_disconnect_ch_sync() into void.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:25 -05:00
Bart Van Assche a11253142e IB/srpt: Rework multi-channel support
Store initiator and target port ID's once per nexus instead of in each
channel data structure. This change simplifies the duplicate connection
check in srpt_cm_req_recv().

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:25 -05:00
Bart Van Assche 2dc98f09f9 IB/srpt: Use the source GID as session name
Use the source GID as session name instead of the initiator port ID
from the SRP login request. The only functional change in this patch
is that it changes the session name shown in debug messages.

Note: the fifth argument that is passed to target_alloc_session() is
what the SCSI target core uses as key for lookups in the ACL (access
control list) information.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:24 -05:00
Bart Van Assche ba60c84f82 IB/srpt: One target per port
In multipathing setups where a target system is equipped with
dual-port HCAs it is useful to have one connection per target port
instead of one connection per target HCA. Hence move the connection
list (rch_list) from struct srpt_device into struct srpt_port.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:24 -05:00
Bart Van Assche c5efb62148 IB/srpt: Add P_Key support
Process connection requests that use another P_Key than the default
correctly.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:24 -05:00
Bart Van Assche 4413834452 IB/srpt: Rework srpt_disconnect_ch_sync()
This patch fixes a use-after-free issue for ch->release_done when
running the SRP protocol on top of the rdma_rxe driver.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:23 -05:00
Bart Van Assche 795bc112cd IB/srpt: Make it safe to use RCU for srpt_device.rch_list
The next patch will iterate over rch_list from a context from which
it is not allowed to block. Hence make rch_list RCU-safe.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:23 -05:00
Bart Van Assche 48900a2834 IB/srp: Refactor srp_send_req()
This patch does not change any functionality but prepares for the patch
that adds RDMA_CM support by making the RDMA_CM patch much easier to
read.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:22 -05:00
Bart Van Assche 85769c6f3b IB/srp: Improve path record query error message
Show all path record query parameters if a path record query fails.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:22 -05:00
Bart Van Assche 2a174df0c6 IB/srp: Use kstrtoull() instead of simple_strtoull()
Use kstrtoull() since simple_strtoull() is deprecated. This patch
improves error checking but otherwise does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:22 -05:00
weiyongjun (A) 0b5fe5c43a RDMA/hns: Remove unnecessary platform_get_resource() error check
devm_ioremap_resource() already checks if the resource is NULL, so
remove the unnecessary platform_get_resource() error check.

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:21 -05:00
Feras Daoud 5c99eaecb1 IB/mlx5: Mmap the HCA's clock info to user-space
This patch maps the new page to user space applications to
allow converting a user space completion timestamp to system wall
time at the lowest possible latency cost.
By using a versioning scheme we allow compatibility between current
and future userspace libraries.
The change moves mlx5_ib_mmap_cmd enum from mlx5_ib.h to the
abi header file mlx5-abi.h.

Reviewed-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Eitan Rabin <rabin@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:21 -05:00
Sagi Grimberg 246d8b184c IB/cq: Don't force IB_POLL_DIRECT poll context for ib_process_cq_direct
polling the completion queue directly does not interfere
with the existing polling logic, hence drop the requirement.
Be aware that running ib_process_cq_direct with non IB_POLL_DIRECT
CQ may trigger concurrent CQ processing.

This can be used for polling mode ULPs.

Cc: Bart Van Assche <bart.vanassche@wdc.com>
Reported-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
[maxg: added wcs array argument to __ib_process_cq]
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:20 -05:00
Max Gurtovoy aaebd377c0 IB/core: postpone WR initialization during queue drain
No need to initialize completion and WR in case we fail
during QP modification.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:20 -05:00
Bart Van Assche bb3ffb7ad4 RDMA/rxe: Fix rxe_qp_cleanup()
rxe_qp_cleanup() can sleep so it must be run in thread context and
not in atomic context. This patch avoids that the following bug is
triggered:

Kernel BUG at 00000000560033f3 [verbose debug info unavailable]
BUG: sleeping function called from invalid context at net/core/sock.c:2761
in_atomic(): 1, irqs_disabled(): 0, pid: 7, name: ksoftirqd/0
INFO: lockdep is turned off.
Preemption disabled at:
[<00000000b6e69628>] __do_softirq+0x4e/0x540
CPU: 0 PID: 7 Comm: ksoftirqd/0 Not tainted 4.15.0-rc7-dbg+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Call Trace:
 dump_stack+0x85/0xbf
 ___might_sleep+0x177/0x260
 lock_sock_nested+0x1d/0x90
 inet_shutdown+0x2e/0xd0
 rxe_qp_cleanup+0x107/0x140 [rdma_rxe]
 rxe_elem_release+0x18/0x80 [rdma_rxe]
 rxe_requester+0x1cf/0x11b0 [rdma_rxe]
 rxe_do_task+0x78/0xf0 [rdma_rxe]
 tasklet_action+0x99/0x270
 __do_softirq+0xc0/0x540
 run_ksoftirqd+0x1c/0x70
 smpboot_thread_fn+0x1be/0x270
 kthread+0x117/0x130
 ret_from_fork+0x24/0x30

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Moni Shoua <monis@mellanox.com>
Cc: stable@vger.kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:20 -05:00
Bart Van Assche 65567e4121 RDMA/rxe: Fix a race condition in rxe_requester()
The rxe driver works as follows:
* The send queue, receive queue and completion queues are implemented as
  circular buffers.
* ib_post_send() and ib_post_recv() calls are serialized through a spinlock.
* Removing elements from various queues happens from tasklet
  context. Tasklets are guaranteed to run on at most one CPU. This serializes
  access to these queues. See also rxe_completer(), rxe_requester() and
  rxe_responder().
* rxe_completer() processes the skbs queued onto qp->resp_pkts.
* rxe_requester() handles the send queue (qp->sq.queue).
* rxe_responder() processes the skbs queued onto qp->req_pkts.

Since rxe_drain_req_pkts() processes qp->req_pkts, calling
rxe_drain_req_pkts() from rxe_requester() is racy. Hence this patch.

Reported-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:19 -05:00
Devesh Sharma 37cb11acf1 RDMA/bnxt_re: Add SRQ support for Broadcom adapters
Shared receive queue (SRQ) is defined as a pool of
receive buffers shared among multiple QPs which belong
to same protection domain in a given process context.
Use of SRQ reduces the memory foot print of IB applications.

Broadcom adapters support SRQ, adding code-changes to enable
shared receive queue.

Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:19 -05:00
Selvin Xavier 89f81008ba RDMA/bnxt_re: expose detailed stats retrieved from HW
Broadcom's adapter supports more granular statistics
to allow better understanding about the state of the
chip when data traffic is flowing.

Exposing the detailed stats to the consumer through
the standard hook available in the kverbs interface.
In order to retrieve all the information, driver
implements a firmware command.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:18 -05:00
Somnath Kotur 872f357824 RDMA/bnxt_re: Add support for MRs with Huge pages
Depending on the OS page-table configurations, applications
may request MRs which has page size alignment other than 4K

Underlying provider driver needs to adjust its PBL boundaries
according to the incoming page boundaries in the PA list.

Adding a capability to register MRs having pages-sizes other
than 4K (Hugepages).

Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:13 -05:00
Selvin Xavier 2fc68543f2 RDMA/bnxt_re: Add support for query firmware version
The device now reports firmware version thus, removing
the hard coded values of the FW version string and
redundant fw_rev hook from sysfs. Adding code to query
firmware version from underlying device and report it
through the kernel verb to get firmware version string.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-17 09:56:17 -05:00
Selvin Xavier ccd9d0d3df RDMA/bnxt_re: Enable RoCE on virtual functions
RoCE can be used by virtual functions (VFs) as well. Adding
code changes to allow resource reservation, initialization
and avail the resources to the RDMA applications running on
those VFs.

Currently, fifty percent of the total available resources
are reserved for PF and remaining are equally divided among
active VFs.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-17 09:56:17 -05:00
David S. Miller c02b3741eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Overlapping changes all over.

The mini-qdisc bits were a little bit tricky, however.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 00:10:42 -05:00
Mustafa Ismail f20d429511 i40iw: Free IEQ resources
The iWARP Exception Queue (IEQ) resources are not freed when a QP is
destroyed. Fix this by freeing IEQ resources when freeing QP resources.

Fixes: d374984179 ("i40iw: add files for iwarp interface")
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-16 20:38:18 -07:00
Mustafa Ismail ebb6c0c015 i40iw: Remove setting of rem_addr.len
Remove setting of rem_addr.len before calling iw_rdma_write,
iw_inline_rdma_write and rdma_read. rem_addr.len is not used in those
functions.

Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-16 20:38:18 -07:00
Sindhu Devale 72b30e986d i40iw: Remove limit on re-posting AEQ entries to HW
Currently, if the number of processed Asynchronous Event Queue (AEQ)
entries exceeds 255, they are not returned to HW for re-use. During
scale-up, the unreturned AEQ entries can grow to the max AEQ size and
cause the HW to report an AEQ overflow.

Remove the check which limits the number of processed AEQ entries returned
to HW.

Fixes: 86dbcd0f12 ("RDMA/i40iw: add file to handle cqp calls")
Signed-off-by: Sindhu Devale <sindhu.devale@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-16 20:38:18 -07:00
Shiraz Saleem 6376e926af i40iw: Zero-out consumer key on allocate stag for FMR
If the application invalidates the MR before the FMR WR, HW parses the
consumer key portion of the stag and returns an invalid stag key
Asynchronous Event (AE) that tears down the QP.

Fix this by zeroing-out the consumer key portion of the allocated stag
returned to application for FMR.

Fixes: ee855d3b93f3 ("RDMA/i40iw: Add base memory management extensions")
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-16 20:38:18 -07:00
Shiraz Saleem 23541b28e5 i40iw: Remove extra call to i40iw_est_sd()
Remove redundant estimate SD function call.  sd_needed should already be
updated at the end of the do while resource reduction loop.

Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-16 20:38:18 -07:00
oulijun d4994d2f1f RDMA/hns: Set the guid for hip08 RoCE device
This patch assign a guid(Global Unique identifer) value to the hip08
device.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-16 20:38:18 -07:00
oulijun 2eade67535 RDMA/hns: Update the verbs of polling for completion
If the port is a RoCEv2 port, the remote port address and QP information
which returned for UD will be modified.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-16 20:38:18 -07:00
oulijun 6c1f08b347 RDMA/hns: Assign zero for pkey_index of wc in hip08
Because pkey is fixed for hip08 RoCE, it needs to assign zero for
pkey_index of wc. otherwise, it will happen an error when establishing
connection by communication management mechanism.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-16 20:38:18 -07:00
oulijun 7bdee4158b RDMA/hns: Fill sq wqe context of ud type in hip08
This patch mainly configure the fields of sq wqe of ud type when posting
wr of gsi qp type.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-16 20:38:18 -07:00
oulijun 0fa95a9a71 RDMA/hns: Add gsi qp support for modifying qp in hip08
It needs to Assign the values for some fields in qp context when qp type
is gsi qp type in hip08.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-16 20:38:18 -07:00
oulijun b66efc9320 RDMA/hns: Create gsi qp in hip08
The gsi qp and rc qp use the same qp context structure and the created
flow, only differentiate them by qpn and qp type.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-16 20:38:18 -07:00
oulijun 6d13b869ea RDMA/hns: Assign the correct value for tx_cqn
When modifying qp from init to init, it need to assign the cqn of send cq
for tx cqn field of qp context. Otherwise, it will cause a mistake when
the send and recv cq sizes are different.

Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-16 20:38:18 -07:00
Linus Torvalds 8cbab92dff Fifth pull request for 4.15-rc
- Oops fix in hfi1 driver
 - Use-after-free issue in iser-target
 - Use of user supplied array index without proper checking
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaXnN2AAoJELgmozMOVy/dyYAP/3YBi6ofVHrIjKovrsA6fh+y
 O1+YuG++2glgKF1QuZkZnrKqrJUIueHhGuKce2rRgWWMioAzOfd1mbXbwha3xKWh
 4mH/BK/fv1k/TZy+eu6w4WMnjtExVinPp4qeSwvvfHh3BpJfnBOtPbYNdPVt94MV
 QYPZuLarvSIJe8MFDbv6yitAIQWnPQLfGMWPW8OZb0ut7bsMNPyZ8LvcAXhfXA4a
 tVuxAfCqUb90BHWOnCBcaWGNpp9Ov2w6yyNCDkiIRXvgJxhjvJwxpHze6TMa4fCo
 mv8e4QDdQ8KK2vQaINQAJTl68Y1/fPxtI8um5tpIhJ5YjQE3rrVzTaQJRstBcXXs
 ViiDmxm/4KTUTeFtWHRj7oVJTZNSOBu+FfBVmjqtcUl4k+t9tDfj2Lw/ZVXT/3KQ
 +dsqUV+1PrQspCeFlKwg+e1mgXa3miyOYvMPacAHzRYBLv5oU6MPpb+WHbZMlTMt
 duT50JfjseIB0s6F2aGLNRRXPWb9jVUwHwzOXaDGKwX1+5u6O2IKeUP7+15TgZsS
 IFeLPodNN8W1N1z0TGcGL8zRXtY9GYhlYwrdMUe/4EWfxrrRtawLLCGGVhWKk/Rv
 iPl/Qpxj/8f2DS0A5+N5iHTAZPoOXjNx6JtZf9AZhvI9blvwZVuGdr63mKZMUnpb
 a0xeypv1bnYbv7k70zS8
 =MHNK
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Doug Ledford:
 "We had a few more items creep up over the last week. Given we are in
  -rc8, these are obviously limited to bugs that have a big downside and
  for which we are certain of the fix.

  The first is a straight up oops bug that all you have to do is read
  the code to see it's a guaranteed 100% oops bug.

  The second is a use-after-free issue. We get away lucky if the queue
  we are shutting down is empty, but if it isn't, we can end up oopsing.
  We really need to drain the queue before destroying it.

  The final one is an issue with bad user input causing us to access our
  port array out of bounds. While fixing the array out of bounds issue,
  it was noticed that the original code did the same thing twice (the
  call to rdma_ah_set_port_num()), so its removal is not balanced by a
  readd elsewhere, it was already where it needed to be in addition to
  where it didn't need to be.

  Summary:

   - Oops fix in hfi1 driver

   - use-after-free issue in iser-target

   - use of user supplied array index without proper checking"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  RDMA/mlx5: Fix out-of-bound access while querying AH
  IB/hfi1: Prevent a NULL dereference
  iser-target: Fix possible use-after-free in connection establishment error
2018-01-16 16:47:40 -08:00
Xiongfeng Wang 979a459c83 IB/cma: use strlcpy() instead of strncpy()
gcc-8 reports

drivers/infiniband/core/cma_configfs.c: In function 'make_cma_dev':
./include/linux/string.h:245:9: warning: '__builtin_strncpy' specified
bound 64 equals destination size [-Wstringop-truncation]

We need to use strlcpy() to make sure the string is nul-terminated.

Signed-off-by: Xiongfeng Wang <xiongfeng.wang@linaro.org>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-15 15:33:21 -07:00
Jack Morgenstein 852f692759 IB/mlx4: Fix incorrectly releasing steerable UD QPs when have only ETH ports
Allocating steerable UD QPs depends on having at least one IB port,
while releasing those QPs does not.

As a result, when there are only ETH ports, the IB (RoCE) driver
requests releasing a qp range whose base qp is zero, with
qp count zero.

When SR-IOV is enabled, and the VF driver is running on a VM over
a hypervisor which treats such qp release calls as errors
(rather than NOPs), we see lines in the VM message log like:

 mlx4_core 0002:00:02.0: Failed to release qp range base:0 cnt:0

Fix this by adding a check for a zero count in mlx4_release_qp_range()
(which thus treats releasing 0 qps as a nop), and eliminating the
check for device managed flow steering when releasing steerable UD QPs.
(Freeing ib_uc_qpns_bitmap unconditionally is also OK, since it
remains NULL when steerable UD QPs are not allocated).

Cc: <stable@vger.kernel.org>
Fixes: 4196670be7 ("IB/mlx4: Don't allocate range of steerable UD QPs for Ethernet-only device")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-15 15:33:21 -07:00
Jason Gunthorpe 7bed7ebcb7 RDMA/qedr: Fix endian problems around imm_data
The double swap matches what user space rdma-core does to imm_data.

wc->imm_data is not used in the kernel so this change has no practical
impact.

Acked-by: Michal Kalderon <michal.kalderon@cavium.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-15 15:33:21 -07:00
Jason Gunthorpe ccb8a29e7d RDMA/hns: Fix endian problems around imm_data and rkey
This matches the changes made recently to the userspace hns
driver when it was made sparse clean.

See rdma-core commit bffd380cfe56 ("libhns: Make the provider sparse
clean")

wc->imm_data is not used in the kernel so this change has no practical
impact.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-15 15:33:21 -07:00
Jason Gunthorpe c966ea12c0 RDMA: Mark imm_data as be32 in the verbs uapi header
This matches what the userspace copy of this header has been doing
for a while. imm_data is an opaque 4 byte array carried over the network,
and invalidate_rkey is in CPU byte order.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-15 15:33:21 -07:00
Parav Pandit a6753c4d62 IB/core: Limit DMAC resolution to RoCE Connected QPs
Resolving DMAC for RoCE is applicable to only Connected mode QPs.
So resolve DMAC for only for Connected mode QPs.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-15 15:33:21 -07:00
Parav Pandit f2290d6d52 IB/core: Attempt DMAC resolution for only RoCE
Instead of returning 0 (success) for RoCE scenarios where DMAC should
not be resolved, avoid such attempt and make code consistent with
ib_create_user_ah().

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-15 15:33:21 -07:00
Parav Pandit b96ac05a87 IB/core: Limit DMAC resolution to userspace QPs
Currently ah_attr is initialized by the ib_cm layer for rdma_cm
based applications. For RoCE transport ah_attr.roce.dmac is already
initialized by ib_cm, rdma_cm either from wc, path record, route
resolve, explicit path record setting depending on active or passive
side QP. Therefore avoid resolving DMAC for QP of kernel consumers.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-15 15:33:21 -07:00
Parav Pandit b2bedfb395 IB/core: Perform modify QP on real one
Currently qp->port stores the port number whenever IB_QP_PORT
QP attribute mask is set (during QP state transition to INIT state).
This port number should be stored for the real QP when XRC target QP
is used.

Follow the ib_modify_qp() implementation and hide the access to ->real_qp.

Fixes: a512c2fbef ("IB/core: Introduce modify QP operation with udata")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-15 15:33:21 -07:00
Leon Romanovsky ae59c3f0b6 RDMA/mlx5: Fix out-of-bound access while querying AH
The rdma_ah_find_type() accesses the port array based on an index
controlled by userspace. The existing bounds check is after the first use
of the index, so userspace can generate an out of bounds access, as shown
by the KASN report below.

==================================================================
BUG: KASAN: slab-out-of-bounds in to_rdma_ah_attr+0xa8/0x3b0
Read of size 4 at addr ffff880019ae2268 by task ibv_rc_pingpong/409

CPU: 0 PID: 409 Comm: ibv_rc_pingpong Not tainted 4.15.0-rc2-00031-gb60a3faf5b83-dirty #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
Call Trace:
 dump_stack+0xe9/0x18f
 print_address_description+0xa2/0x350
 kasan_report+0x3a5/0x400
 to_rdma_ah_attr+0xa8/0x3b0
 mlx5_ib_query_qp+0xd35/0x1330
 ib_query_qp+0x8a/0xb0
 ib_uverbs_query_qp+0x237/0x7f0
 ib_uverbs_write+0x617/0xd80
 __vfs_write+0xf7/0x500
 vfs_write+0x149/0x310
 SyS_write+0xca/0x190
 entry_SYSCALL_64_fastpath+0x18/0x85
RIP: 0033:0x7fe9c7a275a0
RSP: 002b:00007ffee5498738 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007fe9c7ce4b00 RCX: 00007fe9c7a275a0
RDX: 0000000000000018 RSI: 00007ffee5498800 RDI: 0000000000000003
RBP: 000055d0c8d3f010 R08: 00007ffee5498800 R09: 0000000000000018
R10: 00000000000000ba R11: 0000000000000246 R12: 0000000000008000
R13: 0000000000004fb0 R14: 000055d0c8d3f050 R15: 00007ffee5498560

Allocated by task 1:
 __kmalloc+0x3f9/0x430
 alloc_mad_private+0x25/0x50
 ib_mad_post_receive_mads+0x204/0xa60
 ib_mad_init_device+0xa59/0x1020
 ib_register_device+0x83a/0xbc0
 mlx5_ib_add+0x50e/0x5c0
 mlx5_add_device+0x142/0x410
 mlx5_register_interface+0x18f/0x210
 mlx5_ib_init+0x56/0x63
 do_one_initcall+0x15b/0x270
 kernel_init_freeable+0x2d8/0x3d0
 kernel_init+0x14/0x190
 ret_from_fork+0x24/0x30

Freed by task 0:
(stack is not available)

The buggy address belongs to the object at ffff880019ae2000
 which belongs to the cache kmalloc-512 of size 512
The buggy address is located 104 bytes to the right of
 512-byte region [ffff880019ae2000, ffff880019ae2200)
The buggy address belongs to the page:
page:000000005d674e18 count:1 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
flags: 0x4000000000008100(slab|head)
raw: 4000000000008100 0000000000000000 0000000000000000 00000001000c000c
raw: dead000000000100 dead000000000200 ffff88001a402000 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff880019ae2100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff880019ae2180: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
>ffff880019ae2200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                                                          ^
 ffff880019ae2280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff880019ae2300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================
Disabling lock debugging due to kernel taint

Cc: <stable@vger.kernel.org>
Fixes: 44c58487d5 ("IB/core: Define 'ib' and 'roce' rdma_ah_attr types")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-15 14:19:55 -07:00
Eran Ben Elisha 72f36be061 net/mlx5: Fix mlx5_get_uars_page to return error code
Change mlx5_get_uars_page to return ERR_PTR in case of
allocation failure. Change all callers accordingly to
check the IS_ERR(ptr) instead of NULL.

Fixes: 59211bd3b6 ("net/mlx5: Split the load/unload flow into hardware and software flows")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-01-12 02:01:47 +02:00
Eran Ben Elisha 8978cc921f {net,ib}/mlx5: Don't disable local loopback multicast traffic when needed
There are systems platform information management interfaces (such as
HOST2BMC) for which we cannot disable local loopback multicast traffic.

Separate disable_local_lb_mc and disable_local_lb_uc capability bits so
driver will not disable multicast loopback traffic if not supported.
(It is expected that Firmware will not set disable_local_lb_mc if
HOST2BMC is running for example.)

Function mlx5_nic_vport_update_local_lb will do best effort to
disable/enable UC/MC loopback traffic and return success only in case it
succeeded to changed all allowed by Firmware.

Adapt mlx5_ib and mlx5e to support the new cap bits.

Fixes: 2c43c5a036 ("net/mlx5e: Enable local loopback in loopback selftest")
Fixes: c85023e153 ("IB/mlx5: Add raw ethernet local loopback support")
Fixes: bded747bb4 ("net/mlx5: Add raw ethernet local loopback firmware command")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-01-12 00:52:42 +02:00
Randy Dunlap 4f9a3018a2 infiniband: fix sw/rdmavt/* kernel-doc notation
Use correct parameter names and formatting in function kernel-doc notation
to eliminate warnings from scripts/kernel-doc.

../drivers/infiniband/sw/rdmavt/mr.c:784: warning: Excess function parameter 'ibmfr' description in 'rvt_map_phys_fmr'
../drivers/infiniband/sw/rdmavt/vt.c:234: warning: Excess function parameter 'intex' description in 'rvt_query_pkey'
../drivers/infiniband/sw/rdmavt/vt.c:266: warning: Excess function parameter 'index' description in 'rvt_query_gid'
../drivers/infiniband/sw/rdmavt/vt.c:306: warning: Excess function parameter 'data' description in 'rvt_alloc_ucontext'
../drivers/infiniband/sw/rdmavt/cq.c:65: warning: Excess function parameter 'sig' description in 'rvt_cq_enter'
../drivers/infiniband/sw/rdmavt/qp.c:279: warning: Excess function parameter 'qpt' description in 'rvt_free_all_qps'
../drivers/infiniband/sw/rdmavt/mcast.c:282: warning: Excess function parameter 'igd' description in 'rvt_attach_mcast'
../drivers/infiniband/sw/rdmavt/mcast.c:345: warning: Excess function parameter 'igd' description in 'rvt_detach_mcast'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: linux-doc@vger.kernel.org
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-10 22:00:35 -07:00
Randy Dunlap cb8d09426a infiniband: fix ulp/opa_vnic/opa_vnic_vema.c kernel-doc notation
Use correct parameter name and description in kernel-doc notation to
eliminate a kernel-doc warning.

../drivers/infiniband/ulp/opa_vnic/opa_vnic_vema.c:730: warning: Excess function parameter 'cport' description in 'opa_vnic_vema_send_trap'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: linux-doc@vger.kernel.org
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-10 22:00:34 -07:00
Randy Dunlap 92f617038e infiniband: fix core/fmr_pool.c kernel-doc notation
Fix kernel-doc warning for ib_fmr_pool_map_phys() and also format it
with function description and text spacing.

../drivers/infiniband/core/fmr_pool.c:404: warning: Excess function parameter 'pool' description in 'ib_fmr_pool_map_phys'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-10 22:00:34 -07:00
Randy Dunlap 1f58621e40 infiniband: fix core/verbs.c kernel-doc notation
Change function parameter name in kernel-doc notation and other comments
to eliminate a kernel-doc warning.

../drivers/infiniband/core/verbs.c:1790: warning: Excess function parameter 'wq_init_attr' description in 'ib_create_wq'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-10 22:00:34 -07:00
Parav Pandit 89838118a5 RDMA/cma: Fix rdma_cm path querying for RoCE
The 'if' logic in ucma_query_path was broken with OPA was introduced
and started to treat RoCE paths as as OPA paths. Invert the logic
of the 'if' so only OPA paths are treated as OPA paths.

Otherwise the path records returned to rdma_cma users are mangled
when in RoCE mode.

Fixes: 5752075144 ("IB/SA: Add OPA path record type")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-10 22:00:34 -07:00
Parav Pandit 8d20a1f0ec RDMA/cma: Fix rdma_cm raw IB path setting for RoCE
rdma_set_ib_path() missed setting path record fields for RoCE
transport when RoCE support was added.

This results in setting incorrect ndev, destination mac address,
incorrect GID type etc errors when user space attempts to set a raw
IB path using the roce IB path compatibility mapping from userspace.

Fixes: 3c86aa70bf ("RDMA/cm: Add RDMA CM support for IBoE devices")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-10 22:00:33 -07:00
Parav Pandit fe75889f27 RDMA/{cma, ucma}: Simplify and rename rdma_set_ib_paths
Since 2006 there has been no user of rdmacm based application to make use
of setting multiple path records using rdma_set_ib_paths API.

Therefore code is simplified to allow setting one path record entry.
Now that it sets only single path, it is renamed to reflect the same.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-10 22:00:33 -07:00
Parav Pandit 9327c7afdc RDMA/cma: Provide a function to set RoCE path record L2 parameters
Introduce a helper function to set path record L2 fields for RoCE.
This includes setting GID type, destination mac address and netdev
ifindex.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-10 22:00:33 -07:00
Parav Pandit 608bc44634 RDMA/cma: Use the right net namespace for the rdma_cm_id
The net namespace is set in addr during create_rdma_id(),
cma_resolve_iboe_route() should use that instead of the
init namespace.

The original code was added in commit fa20105e09 ("IB/cma: Add support
for network namespaces"), but this path wasn't in use back then.

This patch updates the code to use right namespace, as preparation
for improving namespace support.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-10 22:00:33 -07:00
Huy Nguyen 8cf12d7780 IB/core: Increase number of char device minors
There is a need to increase number of possible char devices to support
large number of SR-IOV instances. The current limit is in the range of
64-128 devices/ports. Increase it to support up to 1024.

The patch performs the following steps to refactor the code:
1. Removes the split bitmap for fixed and overflow dev numbers.
2. Pre-allocates the non-legacy major number range during driver
   initialization, choosen for simplicity.
3. Add new define (RDMA_MAX_PORTS) that is shared between all drivers.
   This is the maximum total number of ports on all struct ib_devices.
4. Set RDMA_MAX_PORTS to 1024.

Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-10 22:00:32 -07:00
Huy Nguyen 008b656f42 IB/core: Remove the locking for character device bitmaps
Remove the locks that protect character device bitmaps of
uverbs, umad and issm.

The character device bitmaps are accessed in "client->add" and
"client->remove" calls from ib_register_device and ib_unregister_device
respectively. These calls are already protected by the "device_mutex"
mutex. Thus, the spinlocks are not needed.

Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-10 22:00:32 -07:00
Bart Van Assche 6f301e06de RDMA/rxe: Fix a race condition related to the QP error state
The following sequence:
* Change queue pair state into IB_QPS_ERR.
* Post a work request on the queue pair.

Triggers the following race condition in the rdma_rxe driver:
* rxe_qp_error() triggers an asynchronous call of rxe_completer(), the function
  that examines the QP send queue.
* rxe_post_send() posts a work request on the QP send queue.

If rxe_completer() runs prior to rxe_post_send(), it will drain the send
queue and the driver will assume no further action is necessary.
However, once we post the send to the send queue, because the queue is
in error, no send completion will ever happen and the send will get
stuck.  In order to process the send, we need to make sure that
rxe_completer() gets run after a send is posted to a queue pair in an
error state.  This patch ensures that happens.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Moni Shoua <monis@mellanox.com>
Cc: <stable@vger.kernel.org> # v4.8
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-10 16:57:27 -05:00
Colin Ian King da005f9fab IB/mlx5: remove redundant assignment of mdev
The initial assignment to mdev is redundant as mdev is re-assigned
later and the first assigned value is never read. Remove this
redundant assignment.

Cleans up clang warning:
drivers/infiniband/hw/mlx5/main.c:359:24: warning: Value stored
to 'mdev' during its initialization is never read

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-10 16:52:52 -05:00
Dan Carpenter 57194fa763 IB/hfi1: Prevent a NULL dereference
In the original code, we set "fd->uctxt" to NULL and then dereference it
which will cause an Oops.

Fixes: f2a3bc00a0 ("IB/hfi1: Protect context array set/clear with spinlock")
Cc: <stable@vger.kernel.org> # 4.14.x
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-10 16:48:09 -05:00
Sagi Grimberg cd52cb26e7 iser-target: Fix possible use-after-free in connection establishment error
In case we fail to establish the connection we must drain our pre-posted
login recieve work request before continuing safely with connection
teardown.

Fixes: a060b5629a ("IB/core: generic RDMA READ/WRITE API")
Cc: <stable@vger.kernel.org> # 4.7+
Reported-by: Amrani, Ram <Ram.Amrani@cavium.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-10 16:46:03 -05:00
David S. Miller a0ce093180 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-01-09 10:37:00 -05:00
Linus Torvalds 44596f8682 Fourth pull request for 4.15-rc
- One line fix to mlx4 error flow (same as mlx5 fix in last pull request,
   just in the mlx4 driver)
 - Fix a race condition in the IPoIB driver.  This patch is larger than
   just a one line fix, but resolves a race condition in a fairly
   straight forward manner
 - Fix a locking issue in the RDMA netlink code.  This patch is also
   larger than I would like for a late -rc.  It has, however, had a week
   to bake in the rdma tree prior to this pull request
 - One line fix to fix granting remote machine access to memory that they
   don't need and shouldn't have
 - One line fix to correct the fact that our sgid/dgid pair is swapped
   from what you would expect when receiving an incoming connection
   request
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaU+ZkAAoJELgmozMOVy/dLw8P/1f27k9c7Bg91VfuyQeIcSxA
 kyRDdzlkRzuI/6QJ4ErK+IkOH8ADG6UGmQa+fOv1dxG8do+YwVflcY7gEgjJA7fP
 k0oPuGjiq8wrEWZrFGinln38ou0KALYd4F2C32unVYrsIohQLHSr1D6Ttw0W5FA6
 NQG4nVn9FzmilgjqtkW2zOGKw4jdAn57J47tUp49KufuPBTUcxjmZCdaV5AmiuzN
 5JpZUieL49Zoc18pcm1OreqDPZcj5LV1XquDNV+AZgU9+uGKoIb932k6hQjBRuml
 FSePxpPjdN8zX/KVaa4HQHX4U4uMBp0HcRHYME1bDsKwTh/d9xKM/yTPzzCtJz+r
 wmGJ9TPr2nq8blJJq17nSXbaJ4LmzlScCwork3LomdZJi880JwWJlvjFG3M/Yir9
 HvS2zIOUJm+xZBNCDVEayYcBMkXew5XjxETtDwOvfYX8FM419LLk1WOp2y/4LKDD
 hIR8QYkZMl37lMYqWZUghNjR7Rov6jdd30KDiCGdOAO/qszlNyTSL+icWyzc1t/X
 VT4ai7vc0RTicPWwb8H8o8/dQNj8Ed8w5NnMq3hjen+KrTKShkZTMuW+or/E9jZN
 ha9jIzSPLRfOvX6mZRrQVe6hiY3fOWMZXdw7gtehUy2hX7LCSwwbn2v6FcsDxyMQ
 UW6ZVG3ccP9YSY+tBWKg
 =kUnv
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Doug Ledford:

 - One line fix to mlx4 error flow (same as mlx5 fix in last pull
   request, just in the mlx4 driver)

 - Fix a race condition in the IPoIB driver. This patch is larger than
   just a one line fix, but resolves a race condition in a fairly
   straight forward manner

 - Fix a locking issue in the RDMA netlink code. This patch is also
   larger than I would like for a late -rc. It has, however, had a week
   to bake in the rdma tree prior to this pull request

 - One line fix to fix granting remote machine access to memory that
   they don't need and shouldn't have

 - One line fix to correct the fact that our sgid/dgid pair is swapped
   from what you would expect when receiving an incoming connection
   request

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  IB/srpt: Fix ACL lookup during login
  IB/srpt: Disable RDMA access by the initiator
  RDMA/netlink: Fix locking around __ib_get_device_by_index
  IB/ipoib: Fix race condition in neigh creation
  IB/mlx4: Fix mlx4_ib_alloc_mr error flow
2018-01-08 16:17:31 -08:00
Zhu Yanjun 5793b46521 IB/rxe: remove unnecessary skb_clone in xmit
In xmit, there is a skb_clone. This function copies the struct sk_buff.
And some parameters are changed to the new skb. Then the new skb is sent
while the old skb is freed.

While the function skb_clone is removed, the parameter changes are made on
the old skb, then the old skb is sent. It can also work well.

The following tests are made.

 server                       client
---------                    ---------
|1.1.1.1|<----rxe-channel--->|1.1.1.2|
---------                    ---------

On server: rping -s -a 1.1.1.1 -v -C 1000 -S 512
On client: rping -c -a 1.1.1.1 -v -C 1000 -S 512

The kernel config CONFIG_DEBUG_KMEMLEAK is enabled on both server
and client.

This test runs for several hours. There is no memory leak and the whole
system can work well.

As the above network, the following tests are made.

Server: ibv_rc_pingpong -d rxe0 -g 1
Client: ibv_rc_pingpong -d rxe0 -g 1 1.1.1.1

The result on Server.
Before:
8192000 bytes in 0.88 seconds = 74.36 Mbit/sec
1000 iters in 0.88 seconds = 881.30 usec/iter

After:
8192000 bytes in 0.81 seconds = 81.15 Mbit/sec
1000 iters in 0.81 seconds = 807.62 usec/iter

The throughput is enhanced and the latency is reduced.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Joe Jin <joe.jin@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>

Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-08 17:43:47 -05:00
Zhu Yanjun 91eab79653 IB/rxe: add the static type to the variable
The variable recv_sockets is only used in the file rxe_net.c. So
it is better to add static type to it.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Joe Jin <joe.jin@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-08 17:43:06 -05:00
Doug Ledford f8457d5832 Merge branch 'bart-srpt-for-next' into k.o/wip/dl-for-next
Merging in 12 patch series from Bart that required changes in the
current for-rc branch in order to apply cleanly.

Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-08 16:06:20 -05:00
Bart Van Assche 2d67017cc7 IB/srpt: Micro-optimize I/O context state manipulation
Since all I/O context state changes are already serialized, it is
not necessary to protect I/O context state changes with the I/O
context spinlock. Hence remove that spinlock.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-08 16:05:12 -05:00
Bart Van Assche dd3bec8655 IB/srpt: Inline srpt_get_cmd_state()
It is not necessary to obtain ioctx->spinlock when reading the ioctx
state. Since after removal of this locking only a single line remains,
inline the srpt_get_cmd_state() function.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-08 16:05:12 -05:00
Bart Van Assche b14cb74479 IB/srpt: Introduce srpt_format_guid()
Introduce a function for converting a GUID into an ASCII string. This
patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-08 16:05:12 -05:00
Bart Van Assche 3fa7f0e994 IB/srpt: Reduce frequency of receive failure messages
Disabling an SRP target port causes the state of all QPs associated
with a port to be changed into IB_QPS_ERR. Avoid that this causes
one error message per I/O context to be reported.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-08 16:05:12 -05:00
Bart Van Assche ea51d2e12a IB/srpt: Convert a warning into a debug message
At least when running the ib_srpt driver on top of the rdma_rxe
driver it is easy to trigger a zero-length write completion in
the CH_DISCONNECTED state. Hence make the message that reports
this less noisy.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-08 16:05:12 -05:00
Bart Van Assche b185d3f895 IB/srpt: Use the IPv6 format for GIDs in log messages
Make the ib_srpt driver use the IPv6 format for GIDs in log messages
to improve consistency of this driver with other RDMA kernel drivers.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-08 16:05:12 -05:00
Bart Van Assche 5f1141b165 IB/srpt: Verify port numbers in srpt_event_handler()
Verify whether port numbers are in the expected range before using
these as an array index. Complain if a port number is out of range.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-08 16:05:12 -05:00
Bart Van Assche d9f45ae69f IB/srpt: Reduce the severity level of a log message
Since the SRQ event message is only useful for debugging purposes,
reduce its severity from "informational" to "debug".

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-08 16:05:12 -05:00
Bart Van Assche ed262287e2 IB/srpt: Rename a local variable, a member variable and a constant
Rename rsp_size into max_rsp_size and SRPT_RQ_SIZE into MAX_SRPT_RQ_SIZE.
The new names better reflect the role of this member variable and constant.
Since the prefix "srp_" is superfluous in the context of the function
that creates an RDMA channel, rename srp_sq_size into sq_size.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-08 16:05:12 -05:00
Bart Van Assche f8781a5300 IB/srpt: Document all structure members in ib_srpt.h
This patch avoids that the following command reports any warnings:

scripts/kernel-doc -none drivers/infiniband/ulp/srpt/ib_srpt.h

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-08 16:05:12 -05:00
Bart Van Assche 10eac19bb2 IB/srpt: Fix kernel-doc warnings in ib_srpt.c
Avoid that warnings about missing parameter descriptions are reported
when building with W=1.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-08 16:05:11 -05:00
Bart Van Assche 1b0bb73f72 IB/srpt: Remove an unused structure member
Fixes: commit a42d985bd5 ("ib_srpt: Initial SRP Target merge for v3.3-rc1")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-08 16:05:11 -05:00
Daniel Jurgens 85c7c01499 IB/mlx5: Don't advertise RAW QP support in dual port mode
When operating in dual port RoCE mode FW doesn't support steering for
raw QPs on the slave port. They still work on the master port, but
the user has no way of knowing which port is the master. The
capability is reported per device, not per port.

Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-08 11:42:24 -07:00
Daniel Jurgens 212f2a87b7 IB/mlx5: Route MADs for dual port RoCE
Route performance query MADs to the correct mlx5_core_dev when using
dual port RoCE mode.

Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-08 11:42:23 -07:00
Daniel Jurgens cfe4e37fdc {net, IB}/mlx5: Change set_roce_gid to take a port number
When in dual port mode setting a RoCE GID for any port flows through the
master ports mlx5_core_dev. Provide an interface to set the port when
sending this command.

Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-08 11:42:23 -07:00
Daniel Jurgens aac4492ef2 IB/mlx5: Update counter implementation for dual port RoCE
Update the counter interface for multiple ports. Some counter sets
always comes from the primary device.

Port specific counters should be accessed per mlx5_core_dev not always
through the IB master mdev.

Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-08 11:42:23 -07:00
Parav Pandit a9e546e73a IB/mlx5: Change debugfs to have per port contents
When there are multiple ports for single IB(RoCE) device, support
debugfs entries to be available for each port.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-08 11:42:22 -07:00
Daniel Jurgens b3cbd6f080 IB/mlx5: Implement dual port functionality in query routines
Port operations must be routed to their native mlx5_core_dev. A
multiport RoCE device registers itself as having 2 ports even before a
2nd port is affiliated. If an unaffilated port is queried use capability
information from the master port, these values are the same.

Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-08 11:42:22 -07:00
Daniel Jurgens d69a24e036 IB/mlx5: Move IB event processing onto a workqueue
Because mlx5_ib_event can be called from atomic context move event
handling onto a workqueue. A mutex lock is required to get the IB device
for slave ports, so move event processing onto a work queue. When an IB
event is received, check if the mlx5_core_dev  is a slave port, if so
attempt to get the IB device it's affiliated with. If found process the
event for that device, otherwise return.

Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-08 11:42:22 -07:00
Daniel Jurgens 32f69e4be2 {net, IB}/mlx5: Manage port association for multiport RoCE
When mlx5_ib_add is called determine if the mlx5 core device being
added is capable of dual port RoCE operation. If it is, determine
whether it is a master device or a slave device using the
num_vhca_ports and affiliate_nic_vport_criteria capabilities.

If the device is a slave, attempt to find a master device to affiliate it
with. Devices that can be affiliated will share a system image guid. If
none are found place it on a list of unaffiliated ports. If a master is
found bind the port to it by configuring the port affiliation in the NIC
vport context.

Similarly when mlx5_ib_remove is called determine the port type. If it's
a slave port, unaffiliate it from the master device, otherwise just
remove it from the unaffiliated port list.

The IB device is registered as a multiport device, even if a 2nd port is
not available for affiliation. When the 2nd port is affiliated later the
GID cache must be refreshed in order to get the default GIDs for the 2nd
port in the cache. Export roce_rescan_device to provide a mechanism to
refresh the cache after a new port is bound.

In a multiport configuration all IB object (QP, MR, PD, etc) related
commands should flow through the master mlx5_core_dev, other commands
must be sent to the slave port mlx5_core_mdev, an interface is provide
to get the correct mdev for non IB object commands.

Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-08 11:42:22 -07:00
Daniel Jurgens 7fd8aefb7c IB/mlx5: Make netdev notifications multiport capable
When multiple RoCE ports are supported registration for events on
multiple netdevs is required. Refactor the event registration and
handling to support multiple ports.

Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-08 11:42:21 -07:00
Daniel Jurgens 508562d6f7 IB/mlx5: Reduce the use of num_port capability
Remove use of the num_ports general capability throughout. The number of
ports will be variable in the future, and reported in a different way.

Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-08 11:42:21 -07:00
Daniel Jurgens 908d6460b3 IB/core: Change roce_rescan_device to return void
It always returns 0. Change return type to void.

Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-08 11:42:21 -07:00