256 lines
9.3 KiB
Plaintext
256 lines
9.3 KiB
Plaintext
|
Paravirtualized RDMA Device (PVRDMA)
|
||
|
====================================
|
||
|
|
||
|
|
||
|
1. Description
|
||
|
===============
|
||
|
PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
|
||
|
It works with its Linux Kernel driver AS IS, no need for any special guest
|
||
|
modifications.
|
||
|
|
||
|
While it complies with the VMware device, it can also communicate with bare
|
||
|
metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
|
||
|
can work with Soft-RoCE (rxe).
|
||
|
|
||
|
It does not require the whole guest RAM to be pinned allowing memory
|
||
|
over-commit and, even if not implemented yet, migration support will be
|
||
|
possible with some HW assistance.
|
||
|
|
||
|
A project presentation accompany this document:
|
||
|
- http://events.linuxfoundation.org/sites/events/files/slides/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf
|
||
|
|
||
|
|
||
|
|
||
|
2. Setup
|
||
|
========
|
||
|
|
||
|
|
||
|
2.1 Guest setup
|
||
|
===============
|
||
|
Fedora 27+ kernels work out of the box, older distributions
|
||
|
require updating the kernel to 4.14 to include the pvrdma driver.
|
||
|
|
||
|
However the libpvrdma library needed by User Level Software is still
|
||
|
not available as part of the distributions, so the rdma-core library
|
||
|
needs to be compiled and optionally installed.
|
||
|
|
||
|
Please follow the instructions at:
|
||
|
https://github.com/linux-rdma/rdma-core.git
|
||
|
|
||
|
|
||
|
2.2 Host Setup
|
||
|
==============
|
||
|
The pvrdma backend is an ibdevice interface that can be exposed
|
||
|
either by a Soft-RoCE(rxe) device on machines with no RDMA device,
|
||
|
or an HCA SRIOV function(VF/PF).
|
||
|
Note that ibdevice interfaces can't be shared between pvrdma devices,
|
||
|
each one requiring a separate instance (rxe or SRIOV VF).
|
||
|
|
||
|
|
||
|
2.2.1 Soft-RoCE backend(rxe)
|
||
|
===========================
|
||
|
A stable version of rxe is required, Fedora 27+ or a Linux
|
||
|
Kernel 4.14+ is preferred.
|
||
|
|
||
|
The rdma_rxe module is part of the Linux Kernel but not loaded by default.
|
||
|
Install the User Level library (librxe) following the instructions from:
|
||
|
https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home
|
||
|
|
||
|
Associate an ETH interface with rxe by running:
|
||
|
rxe_cfg add eth0
|
||
|
An rxe0 ibdevice interface will be created and can be used as pvrdma backend.
|
||
|
|
||
|
|
||
|
2.2.2 RDMA device Virtual Function backend
|
||
|
==========================================
|
||
|
Nothing special is required, the pvrdma device can work not only with
|
||
|
Ethernet Links, but also Infinibands Links.
|
||
|
All is needed is an ibdevice with an active port, for Mellanox cards
|
||
|
will be something like mlx5_6 which can be the backend.
|
||
|
|
||
|
|
||
|
2.2.3 QEMU setup
|
||
|
================
|
||
|
Configure QEMU with --enable-rdma flag, installing
|
||
|
the required RDMA libraries.
|
||
|
|
||
|
|
||
|
|
||
|
3. Usage
|
||
|
========
|
||
|
Currently the device is working only with memory backed RAM
|
||
|
and it must be mark as "shared":
|
||
|
-m 1G \
|
||
|
-object memory-backend-ram,id=mb1,size=1G,share \
|
||
|
-numa node,memdev=mb1 \
|
||
|
|
||
|
The pvrdma device is composed of two functions:
|
||
|
- Function 0 is a vmxnet Ethernet Device which is redundant in Guest
|
||
|
but is required to pass the ibdevice GID using its MAC.
|
||
|
Examples:
|
||
|
For an rxe backend using eth0 interface it will use its mac:
|
||
|
-device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC>
|
||
|
For an SRIOV VF, we take the Ethernet Interface exposed by it:
|
||
|
-device vmxnet3,multifunction=on,mac=<RoCE eth MAC>
|
||
|
- Function 1 is the actual device:
|
||
|
-device pvrdma,addr=<slot>.1,backend-dev=<ibdevice>,backend-gid-idx=<gid>,backend-port=<port>
|
||
|
where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4)
|
||
|
Note: Pay special attention that the GID at backend-gid-idx matches vmxnet's MAC.
|
||
|
The rules of conversion are part of the RoCE spec, but since manual conversion
|
||
|
is not required, spotting problems is not hard:
|
||
|
Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a
|
||
|
MAC: 7c:fe:90:cb:74:3a
|
||
|
Note the difference between the first byte of the MAC and the GID.
|
||
|
|
||
|
|
||
|
|
||
|
4. Implementation details
|
||
|
=========================
|
||
|
|
||
|
|
||
|
4.1 Overview
|
||
|
============
|
||
|
The device acts like a proxy between the Guest Driver and the host
|
||
|
ibdevice interface.
|
||
|
On configuration path:
|
||
|
- For every hardware resource request (PD/QP/CQ/...) the pvrdma will request
|
||
|
a resource from the backend interface, maintaining a 1-1 mapping
|
||
|
between the guest and host.
|
||
|
On data path:
|
||
|
- Every post_send/receive received from the guest will be converted into
|
||
|
a post_send/receive for the backend. The buffers data will not be touched
|
||
|
or copied resulting in near bare-metal performance for large enough buffers.
|
||
|
- Completions from the backend interface will result in completions for
|
||
|
the pvrdma device.
|
||
|
|
||
|
|
||
|
4.2 PCI BARs
|
||
|
============
|
||
|
PCI Bars:
|
||
|
BAR 0 - MSI-X
|
||
|
MSI-X vectors:
|
||
|
(0) Command - used when execution of a command is completed.
|
||
|
(1) Async - not in use.
|
||
|
(2) Completion - used when a completion event is placed in
|
||
|
device's CQ ring.
|
||
|
BAR 1 - Registers
|
||
|
--------------------------------------------------------
|
||
|
| VERSION | DSR | CTL | REQ | ERR | ICR | IMR | MAC |
|
||
|
--------------------------------------------------------
|
||
|
DSR - Address of driver/device shared memory used
|
||
|
for the command channel, used for passing:
|
||
|
- General info such as driver version
|
||
|
- Address of 'command' and 'response'
|
||
|
- Address of async ring
|
||
|
- Address of device's CQ ring
|
||
|
- Device capabilities
|
||
|
CTL - Device control operations (activate, reset etc)
|
||
|
IMG - Set interrupt mask
|
||
|
REQ - Command execution register
|
||
|
ERR - Operation status
|
||
|
|
||
|
BAR 2 - UAR
|
||
|
---------------------------------------------------------
|
||
|
| QP_NUM | SEND/RECV Flag || CQ_NUM | ARM/POLL Flag |
|
||
|
---------------------------------------------------------
|
||
|
- Offset 0 used for QP operations (send and recv)
|
||
|
- Offset 4 used for CQ operations (arm and poll)
|
||
|
|
||
|
|
||
|
4.3 Major flows
|
||
|
===============
|
||
|
|
||
|
4.3.1 Create CQ
|
||
|
===============
|
||
|
- Guest driver
|
||
|
- Allocates pages for CQ ring
|
||
|
- Creates page directory (pdir) to hold CQ ring's pages
|
||
|
- Initializes CQ ring
|
||
|
- Initializes 'Create CQ' command object (cqe, pdir etc)
|
||
|
- Copies the command to 'command' address
|
||
|
- Writes 0 into REQ register
|
||
|
- Device
|
||
|
- Reads the request object from the 'command' address
|
||
|
- Allocates CQ object and initialize CQ ring based on pdir
|
||
|
- Creates the backend CQ
|
||
|
- Writes operation status to ERR register
|
||
|
- Posts command-interrupt to guest
|
||
|
- Guest driver
|
||
|
- Reads the HW response code from ERR register
|
||
|
|
||
|
4.3.2 Create QP
|
||
|
===============
|
||
|
- Guest driver
|
||
|
- Allocates pages for send and receive rings
|
||
|
- Creates page directory(pdir) to hold the ring's pages
|
||
|
- Initializes 'Create QP' command object (max_send_wr,
|
||
|
send_cq_handle, recv_cq_handle, pdir etc)
|
||
|
- Copies the object to 'command' address
|
||
|
- Write 0 into REQ register
|
||
|
- Device
|
||
|
- Reads the request object from 'command' address
|
||
|
- Allocates the QP object and initialize
|
||
|
- Send and recv rings based on pdir
|
||
|
- Send and recv ring state
|
||
|
- Creates the backend QP
|
||
|
- Writes the operation status to ERR register
|
||
|
- Posts command-interrupt to guest
|
||
|
- Guest driver
|
||
|
- Reads the HW response code from ERR register
|
||
|
|
||
|
4.3.3 Post receive
|
||
|
==================
|
||
|
- Guest driver
|
||
|
- Initializes a wqe and place it on recv ring
|
||
|
- Write to qpn|qp_recv_bit (31) to QP offset in UAR
|
||
|
- Device
|
||
|
- Extracts qpn from UAR
|
||
|
- Walks through the ring and does the following for each wqe
|
||
|
- Prepares the backend CQE context to be used when
|
||
|
receiving completion from backend (wr_id, op_code, emu_cq_num)
|
||
|
- For each sge prepares backend sge
|
||
|
- Calls backend's post_recv
|
||
|
|
||
|
4.3.4 Process backend events
|
||
|
============================
|
||
|
- Done by a dedicated thread used to process backend events;
|
||
|
at initialization is attached to the device and creates
|
||
|
the communication channel.
|
||
|
- Thread main loop:
|
||
|
- Polls for completions
|
||
|
- Extracts QEMU _cq_num, wr_id and op_code from context
|
||
|
- Writes CQE to CQ ring
|
||
|
- Writes CQ number to device CQ
|
||
|
- Sends completion-interrupt to guest
|
||
|
- Deallocates context
|
||
|
- Acks the event to backend
|
||
|
|
||
|
|
||
|
|
||
|
5. Limitations
|
||
|
==============
|
||
|
- The device obviously is limited by the Guest Linux Driver features implementation
|
||
|
of the VMware device API.
|
||
|
- Memory registration mechanism requires mremap for every page in the buffer in order
|
||
|
to map it to a contiguous virtual address range. Since this is not the data path
|
||
|
it should not matter much. If the default max mr size is increased, be aware that
|
||
|
memory registration can take up to 0.5 seconds for 1GB of memory.
|
||
|
- The device requires target page size to be the same as the host page size,
|
||
|
otherwise it will fail to init.
|
||
|
- QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached,
|
||
|
so it can't work with huge pages. The limitation will be addressed in the future,
|
||
|
however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge
|
||
|
pages available, QEMU will use them. QEMU will fail to init if the requirements
|
||
|
are not met.
|
||
|
|
||
|
|
||
|
|
||
|
6. Performance
|
||
|
==============
|
||
|
By design the pvrdma device exits on each post-send/receive, so for small buffers
|
||
|
the performance is affected; however for medium buffers it will became close to
|
||
|
bare metal and from 1MB buffers and up it reaches bare metal performance.
|
||
|
(tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device)
|
||
|
|
||
|
All the above assumes no memory registration is done on data path.
|