98dad2b019
Suggested-by: Cédric Le Goater <clg@redhat.com> Signed-off-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Cédric Le Goater <clg@redhat.com>
167 lines
6.2 KiB
ReStructuredText
167 lines
6.2 KiB
ReStructuredText
===============================
|
|
IOMMUFD BACKEND usage with VFIO
|
|
===============================
|
|
|
|
(Same meaning for backend/container/BE)
|
|
|
|
With the introduction of iommufd, the Linux kernel provides a generic
|
|
interface for user space drivers to propagate their DMA mappings to kernel
|
|
for assigned devices. While the legacy kernel interface is group-centric,
|
|
the new iommufd interface is device-centric, relying on device fd and iommufd.
|
|
|
|
To support both interfaces in the QEMU VFIO device, introduce a base container
|
|
to abstract the common part of VFIO legacy and iommufd container. So that the
|
|
generic VFIO code can use either container.
|
|
|
|
The base container implements generic functions such as memory_listener and
|
|
address space management whereas the derived container implements callbacks
|
|
specific to either legacy or iommufd. Each container has its own way to setup
|
|
secure context and dma management interface. The below diagram shows how it
|
|
looks like with both containers.
|
|
|
|
::
|
|
|
|
VFIO AddressSpace/Memory
|
|
+-------+ +----------+ +-----+ +-----+
|
|
| pci | | platform | | ap | | ccw |
|
|
+---+---+ +----+-----+ +--+--+ +--+--+ +----------------------+
|
|
| | | | | AddressSpace |
|
|
| | | | +------------+---------+
|
|
+---V-----------V-----------V--------V----+ /
|
|
| VFIOAddressSpace | <------------+
|
|
| | | MemoryListener
|
|
| VFIOContainerBase list |
|
|
+-------+----------------------------+----+
|
|
| |
|
|
| |
|
|
+-------V------+ +--------V----------+
|
|
| iommufd | | vfio legacy |
|
|
| container | | container |
|
|
+-------+------+ +--------+----------+
|
|
| |
|
|
| /dev/iommu | /dev/vfio/vfio
|
|
| /dev/vfio/devices/vfioX | /dev/vfio/$group_id
|
|
Userspace | |
|
|
============+============================+===========================
|
|
Kernel | device fd |
|
|
+---------------+ | group/container fd
|
|
| (BIND_IOMMUFD | | (SET_CONTAINER/SET_IOMMU)
|
|
| ATTACH_IOAS) | | device fd
|
|
| | |
|
|
| +-------V------------V-----------------+
|
|
iommufd | | vfio |
|
|
(map/unmap | +---------+--------------------+-------+
|
|
ioas_copy) | | | map/unmap
|
|
| | |
|
|
+------V------+ +-----V------+ +------V--------+
|
|
| iommfd core | | device | | vfio iommu |
|
|
+-------------+ +------------+ +---------------+
|
|
|
|
* Secure Context setup
|
|
|
|
- iommufd BE: uses device fd and iommufd to setup secure context
|
|
(bind_iommufd, attach_ioas)
|
|
- vfio legacy BE: uses group fd and container fd to setup secure context
|
|
(set_container, set_iommu)
|
|
|
|
* Device access
|
|
|
|
- iommufd BE: device fd is opened through ``/dev/vfio/devices/vfioX``
|
|
- vfio legacy BE: device fd is retrieved from group fd ioctl
|
|
|
|
* DMA Mapping flow
|
|
|
|
1. VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
|
|
2. VFIO populates DMA map/unmap via the container BEs
|
|
* iommufd BE: uses iommufd
|
|
* vfio legacy BE: uses container fd
|
|
|
|
Example configuration
|
|
=====================
|
|
|
|
Step 1: configure the host device
|
|
---------------------------------
|
|
|
|
It's exactly same as the VFIO device with legacy VFIO container.
|
|
|
|
Step 2: configure QEMU
|
|
----------------------
|
|
|
|
Interactions with the ``/dev/iommu`` are abstracted by a new iommufd
|
|
object (compiled in with the ``CONFIG_IOMMUFD`` option).
|
|
|
|
Any QEMU device (e.g. VFIO device) wishing to use ``/dev/iommu`` must
|
|
be linked with an iommufd object. It gets a new optional property
|
|
named iommufd which allows to pass an iommufd object. Take ``vfio-pci``
|
|
device for example:
|
|
|
|
.. code-block:: bash
|
|
|
|
-object iommufd,id=iommufd0
|
|
-device vfio-pci,host=0000:02:00.0,iommufd=iommufd0
|
|
|
|
Note the ``/dev/iommu`` and VFIO cdev can be externally opened by a
|
|
management layer. In such a case the fd is passed, the fd supports a
|
|
string naming the fd or a number, for example:
|
|
|
|
.. code-block:: bash
|
|
|
|
-object iommufd,id=iommufd0,fd=22
|
|
-device vfio-pci,iommufd=iommufd0,fd=23
|
|
|
|
If the ``fd`` property is not passed, the fd is opened by QEMU.
|
|
|
|
If no ``iommufd`` object is passed to the ``vfio-pci`` device, iommufd
|
|
is not used and the user gets the behavior based on the legacy VFIO
|
|
container:
|
|
|
|
.. code-block:: bash
|
|
|
|
-device vfio-pci,host=0000:02:00.0
|
|
|
|
Supported platform
|
|
==================
|
|
|
|
Supports x86, ARM and s390x currently.
|
|
|
|
Caveats
|
|
=======
|
|
|
|
Dirty page sync
|
|
---------------
|
|
|
|
Dirty page sync with iommufd backend is unsupported yet, live migration is
|
|
disabled by default. But it can be force enabled like below, low efficient
|
|
though.
|
|
|
|
.. code-block:: bash
|
|
|
|
-object iommufd,id=iommufd0
|
|
-device vfio-pci,host=0000:02:00.0,iommufd=iommufd0,enable-migration=on
|
|
|
|
P2P DMA
|
|
-------
|
|
|
|
PCI p2p DMA is unsupported as IOMMUFD doesn't support mapping hardware PCI
|
|
BAR region yet. Below warning shows for assigned PCI device, it's not a bug.
|
|
|
|
.. code-block:: none
|
|
|
|
qemu-system-x86_64: warning: IOMMU_IOAS_MAP failed: Bad address, PCI BAR?
|
|
qemu-system-x86_64: vfio_container_dma_map(0x560cb6cb1620, 0xe000000021000, 0x3000, 0x7f32ed55c000) = -14 (Bad address)
|
|
|
|
FD passing with mdev
|
|
--------------------
|
|
|
|
``vfio-pci`` device checks sysfsdev property to decide if backend is a mdev.
|
|
If FD passing is used, there is no way to know that and the mdev is treated
|
|
like a real PCI device. There is an error as below if user wants to enable
|
|
RAM discarding for mdev.
|
|
|
|
.. code-block:: none
|
|
|
|
qemu-system-x86_64: -device vfio-pci,iommufd=iommufd0,x-balloon-allowed=on,fd=9: vfio VFIO_FD9: x-balloon-allowed only potentially compatible with mdev devices
|
|
|
|
``vfio-ap`` and ``vfio-ccw`` devices don't have same issue as their backend
|
|
devices are always mdev and RAM discarding is force enabled.
|