2015-09-23 21:04:44 +02:00
|
|
|
/*
|
|
|
|
* vfio based device assignment support - PCI devices
|
|
|
|
*
|
|
|
|
* Copyright Red Hat, Inc. 2012-2015
|
|
|
|
*
|
|
|
|
* Authors:
|
|
|
|
* Alex Williamson <alex.williamson@redhat.com>
|
|
|
|
*
|
|
|
|
* This work is licensed under the terms of the GNU GPL, version 2. See
|
|
|
|
* the COPYING file in the top-level directory.
|
|
|
|
*/
|
|
|
|
#ifndef HW_VFIO_VFIO_PCI_H
|
|
|
|
#define HW_VFIO_VFIO_PCI_H
|
|
|
|
|
|
|
|
#include "exec/memory.h"
|
|
|
|
#include "hw/pci/pci.h"
|
|
|
|
#include "hw/vfio/vfio-common.h"
|
|
|
|
#include "qemu/event_notifier.h"
|
|
|
|
#include "qemu/queue.h"
|
|
|
|
#include "qemu/timer.h"
|
2020-09-03 22:43:22 +02:00
|
|
|
#include "qom/object.h"
|
2015-09-23 21:04:44 +02:00
|
|
|
|
2015-09-23 21:04:49 +02:00
|
|
|
#define PCI_ANY_ID (~0)
|
|
|
|
|
2015-09-23 21:04:44 +02:00
|
|
|
struct VFIOPCIDevice;
|
|
|
|
|
vfio/quirks: ioeventfd quirk acceleration
The NVIDIA BAR0 quirks virtualize the PCI config space mirrors found
in device MMIO space. Normally PCI config space is considered a slow
path and further optimization is unnecessary, however NVIDIA uses a
register here to enable the MSI interrupt to re-trigger. Exiting to
QEMU for this MSI-ACK handling can therefore rate limit our interrupt
handling. Fortunately the MSI-ACK write is easily detected since the
quirk MemoryRegion otherwise has very few accesses, so simply looking
for consecutive writes with the same data is sufficient, in this case
10 consecutive writes with the same data and size is arbitrarily
chosen. We configure the KVM ioeventfd with data match, so there's
no risk of triggering for the wrong data or size, but we do risk that
pathological driver behavior might consume all of QEMU's file
descriptors, so we cap ourselves to 10 ioeventfds for this purpose.
In support of the above, generic ioeventfd infrastructure is added
for vfio quirks. This automatically initializes an ioeventfd list
per quirk, disables and frees ioeventfds on exit, and allows
ioeventfds marked as dynamic to be dropped on device reset. The
rationale for this latter feature is that useful ioeventfds may
depend on specific driver behavior and since we necessarily place a
cap on our use of ioeventfds, a machine reset is a reasonable point
at which to assume a new driver and re-profile.
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-06-05 16:23:17 +02:00
|
|
|
typedef struct VFIOIOEventFD {
|
|
|
|
QLIST_ENTRY(VFIOIOEventFD) next;
|
|
|
|
MemoryRegion *mr;
|
|
|
|
hwaddr addr;
|
|
|
|
unsigned size;
|
|
|
|
uint64_t data;
|
|
|
|
EventNotifier e;
|
|
|
|
VFIORegion *region;
|
|
|
|
hwaddr region_addr;
|
|
|
|
bool dynamic; /* Added runtime, removed on device reset */
|
2018-06-05 16:23:17 +02:00
|
|
|
bool vfio;
|
vfio/quirks: ioeventfd quirk acceleration
The NVIDIA BAR0 quirks virtualize the PCI config space mirrors found
in device MMIO space. Normally PCI config space is considered a slow
path and further optimization is unnecessary, however NVIDIA uses a
register here to enable the MSI interrupt to re-trigger. Exiting to
QEMU for this MSI-ACK handling can therefore rate limit our interrupt
handling. Fortunately the MSI-ACK write is easily detected since the
quirk MemoryRegion otherwise has very few accesses, so simply looking
for consecutive writes with the same data is sufficient, in this case
10 consecutive writes with the same data and size is arbitrarily
chosen. We configure the KVM ioeventfd with data match, so there's
no risk of triggering for the wrong data or size, but we do risk that
pathological driver behavior might consume all of QEMU's file
descriptors, so we cap ourselves to 10 ioeventfds for this purpose.
In support of the above, generic ioeventfd infrastructure is added
for vfio quirks. This automatically initializes an ioeventfd list
per quirk, disables and frees ioeventfds on exit, and allows
ioeventfds marked as dynamic to be dropped on device reset. The
rationale for this latter feature is that useful ioeventfds may
depend on specific driver behavior and since we necessarily place a
cap on our use of ioeventfds, a machine reset is a reasonable point
at which to assume a new driver and re-profile.
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-06-05 16:23:17 +02:00
|
|
|
} VFIOIOEventFD;
|
|
|
|
|
2015-09-23 21:04:46 +02:00
|
|
|
typedef struct VFIOQuirk {
|
|
|
|
QLIST_ENTRY(VFIOQuirk) next;
|
|
|
|
void *data;
|
vfio/quirks: ioeventfd quirk acceleration
The NVIDIA BAR0 quirks virtualize the PCI config space mirrors found
in device MMIO space. Normally PCI config space is considered a slow
path and further optimization is unnecessary, however NVIDIA uses a
register here to enable the MSI interrupt to re-trigger. Exiting to
QEMU for this MSI-ACK handling can therefore rate limit our interrupt
handling. Fortunately the MSI-ACK write is easily detected since the
quirk MemoryRegion otherwise has very few accesses, so simply looking
for consecutive writes with the same data is sufficient, in this case
10 consecutive writes with the same data and size is arbitrarily
chosen. We configure the KVM ioeventfd with data match, so there's
no risk of triggering for the wrong data or size, but we do risk that
pathological driver behavior might consume all of QEMU's file
descriptors, so we cap ourselves to 10 ioeventfds for this purpose.
In support of the above, generic ioeventfd infrastructure is added
for vfio quirks. This automatically initializes an ioeventfd list
per quirk, disables and frees ioeventfds on exit, and allows
ioeventfds marked as dynamic to be dropped on device reset. The
rationale for this latter feature is that useful ioeventfds may
depend on specific driver behavior and since we necessarily place a
cap on our use of ioeventfds, a machine reset is a reasonable point
at which to assume a new driver and re-profile.
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-06-05 16:23:17 +02:00
|
|
|
QLIST_HEAD(, VFIOIOEventFD) ioeventfds;
|
2015-09-23 21:04:46 +02:00
|
|
|
int nr_mem;
|
|
|
|
MemoryRegion *mem;
|
2018-06-05 16:23:17 +02:00
|
|
|
void (*reset)(struct VFIOPCIDevice *vdev, struct VFIOQuirk *quirk);
|
2015-09-23 21:04:44 +02:00
|
|
|
} VFIOQuirk;
|
|
|
|
|
|
|
|
typedef struct VFIOBAR {
|
|
|
|
VFIORegion region;
|
2018-02-06 19:08:25 +01:00
|
|
|
MemoryRegion *mr;
|
|
|
|
size_t size;
|
|
|
|
uint8_t type;
|
2015-09-23 21:04:44 +02:00
|
|
|
bool ioport;
|
|
|
|
bool mem64;
|
|
|
|
QLIST_HEAD(, VFIOQuirk) quirks;
|
|
|
|
} VFIOBAR;
|
|
|
|
|
|
|
|
typedef struct VFIOVGARegion {
|
|
|
|
MemoryRegion mem;
|
|
|
|
off_t offset;
|
|
|
|
int nr;
|
|
|
|
QLIST_HEAD(, VFIOQuirk) quirks;
|
|
|
|
} VFIOVGARegion;
|
|
|
|
|
|
|
|
typedef struct VFIOVGA {
|
|
|
|
off_t fd_offset;
|
|
|
|
int fd;
|
|
|
|
VFIOVGARegion region[QEMU_PCI_VGA_NUM_REGIONS];
|
|
|
|
} VFIOVGA;
|
|
|
|
|
|
|
|
typedef struct VFIOINTx {
|
|
|
|
bool pending; /* interrupt pending */
|
|
|
|
bool kvm_accel; /* set when QEMU bypass through KVM enabled */
|
|
|
|
uint8_t pin; /* which pin to pull for qemu_set_irq */
|
|
|
|
EventNotifier interrupt; /* eventfd triggered on interrupt */
|
|
|
|
EventNotifier unmask; /* eventfd for unmask on QEMU bypass */
|
|
|
|
PCIINTxRoute route; /* routing info for QEMU bypass */
|
|
|
|
uint32_t mmap_timeout; /* delay to re-enable mmaps after interrupt */
|
|
|
|
QEMUTimer *mmap_timer; /* enable mmaps after periods w/o interrupts */
|
|
|
|
} VFIOINTx;
|
|
|
|
|
|
|
|
typedef struct VFIOMSIVector {
|
|
|
|
/*
|
|
|
|
* Two interrupt paths are configured per vector. The first, is only used
|
|
|
|
* for interrupts injected via QEMU. This is typically the non-accel path,
|
|
|
|
* but may also be used when we want QEMU to handle masking and pending
|
|
|
|
* bits. The KVM path bypasses QEMU and is therefore higher performance,
|
|
|
|
* but requires masking at the device. virq is used to track the MSI route
|
|
|
|
* through KVM, thus kvm_interrupt is only available when virq is set to a
|
|
|
|
* valid (>= 0) value.
|
|
|
|
*/
|
|
|
|
EventNotifier interrupt;
|
|
|
|
EventNotifier kvm_interrupt;
|
|
|
|
struct VFIOPCIDevice *vdev; /* back pointer to device */
|
|
|
|
int virq;
|
|
|
|
bool use;
|
|
|
|
} VFIOMSIVector;
|
|
|
|
|
|
|
|
enum {
|
|
|
|
VFIO_INT_NONE = 0,
|
|
|
|
VFIO_INT_INTx = 1,
|
|
|
|
VFIO_INT_MSI = 2,
|
|
|
|
VFIO_INT_MSIX = 3,
|
|
|
|
};
|
|
|
|
|
2018-02-06 19:08:25 +01:00
|
|
|
/* Cache of MSI-X setup */
|
2015-09-23 21:04:44 +02:00
|
|
|
typedef struct VFIOMSIXInfo {
|
|
|
|
uint8_t table_bar;
|
|
|
|
uint8_t pba_bar;
|
|
|
|
uint16_t entries;
|
|
|
|
uint32_t table_offset;
|
|
|
|
uint32_t pba_offset;
|
2016-01-19 19:33:42 +01:00
|
|
|
unsigned long *pending;
|
2015-09-23 21:04:44 +02:00
|
|
|
} VFIOMSIXInfo;
|
|
|
|
|
2020-08-25 21:20:38 +02:00
|
|
|
#define TYPE_VFIO_PCI "vfio-pci"
|
2020-09-16 20:25:19 +02:00
|
|
|
OBJECT_DECLARE_SIMPLE_TYPE(VFIOPCIDevice, VFIO_PCI)
|
2020-08-25 21:20:38 +02:00
|
|
|
|
2020-09-03 22:43:22 +02:00
|
|
|
struct VFIOPCIDevice {
|
2015-09-23 21:04:44 +02:00
|
|
|
PCIDevice pdev;
|
|
|
|
VFIODevice vbasedev;
|
|
|
|
VFIOINTx intx;
|
|
|
|
unsigned int config_size;
|
|
|
|
uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */
|
|
|
|
off_t config_offset; /* Offset of config space region within device fd */
|
|
|
|
unsigned int rom_size;
|
|
|
|
off_t rom_offset; /* Offset of ROM region within device fd */
|
|
|
|
void *rom;
|
|
|
|
int msi_cap_size;
|
|
|
|
VFIOMSIVector *msi_vectors;
|
|
|
|
VFIOMSIXInfo *msix;
|
|
|
|
int nr_vectors; /* Number of MSI/MSIX vectors currently in use */
|
|
|
|
int interrupt; /* Current interrupt type */
|
|
|
|
VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
|
2016-03-10 17:39:08 +01:00
|
|
|
VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
|
vfio/pci: Intel graphics legacy mode assignment
Enable quirks to support SandyBridge and newer IGD devices as primary
VM graphics. This requires new vfio-pci device specific regions added
in kernel v4.6 to expose the IGD OpRegion, the shadow ROM, and config
space access to the PCI host bridge and LPC/ISA bridge. VM firmware
support, SeaBIOS only so far, is also required for reserving memory
regions for IGD specific use. In order to enable this mode, IGD must
be assigned to the VM at PCI bus address 00:02.0, it must have a ROM,
it must be able to enable VGA, it must have or be able to create on
its own an LPC/ISA bridge of the proper type at PCI bus address
00:1f.0 (sorry, not compatible with Q35 yet), and it must have the
above noted vfio-pci kernel features and BIOS. The intention is that
to enable this mode, a user simply needs to assign 00:02.0 from the
host to 00:02.0 in the VM:
-device vfio-pci,host=0000:00:02.0,bus=pci.0,addr=02.0
and everything either happens automatically or it doesn't. In the
case that it doesn't, we leave error reports, but assume the device
will operate in universal passthrough mode (UPT), which doesn't
require any of this, but has a much more narrow window of supported
devices, supported use cases, and supported guest drivers.
When using IGD in this mode, the VM firmware is required to reserve
some VM RAM for the OpRegion (on the order or several 4k pages) and
stolen memory for the GTT (up to 8MB for the latest GPUs). An
additional option, x-igd-gms allows the user to specify some amount
of additional memory (value is number of 32MB chunks up to 512MB) that
is pre-allocated for graphics use. TBH, I don't know of anything that
requires this or makes use of this memory, which is why we don't
allocate any by default, but the specification suggests this is not
actually a valid combination, so the option exists as a workaround.
Please report if it's actually necessary in some environment.
See code comments for further discussion about the actual operation
of the quirks necessary to assign these devices.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gerd Hoffmann <kraxel@redhat.com>
Tested-by: Gerd Hoffmann <kraxel@redhat.com>
2016-05-26 17:43:21 +02:00
|
|
|
void *igd_opregion;
|
2015-09-23 21:04:44 +02:00
|
|
|
PCIHostDeviceAddress host;
|
|
|
|
EventNotifier err_notifier;
|
|
|
|
EventNotifier req_notifier;
|
|
|
|
int (*resetfn)(struct VFIOPCIDevice *);
|
2015-09-23 21:04:49 +02:00
|
|
|
uint32_t vendor_id;
|
|
|
|
uint32_t device_id;
|
|
|
|
uint32_t sub_vendor_id;
|
|
|
|
uint32_t sub_device_id;
|
2015-09-23 21:04:44 +02:00
|
|
|
uint32_t features;
|
|
|
|
#define VFIO_FEATURE_ENABLE_VGA_BIT 0
|
|
|
|
#define VFIO_FEATURE_ENABLE_VGA (1 << VFIO_FEATURE_ENABLE_VGA_BIT)
|
|
|
|
#define VFIO_FEATURE_ENABLE_REQ_BIT 1
|
|
|
|
#define VFIO_FEATURE_ENABLE_REQ (1 << VFIO_FEATURE_ENABLE_REQ_BIT)
|
2016-05-26 17:43:22 +02:00
|
|
|
#define VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT 2
|
|
|
|
#define VFIO_FEATURE_ENABLE_IGD_OPREGION \
|
|
|
|
(1 << VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT)
|
2018-03-13 18:17:29 +01:00
|
|
|
OnOffAuto display;
|
2019-03-11 18:14:40 +01:00
|
|
|
uint32_t display_xres;
|
|
|
|
uint32_t display_yres;
|
2015-09-23 21:04:44 +02:00
|
|
|
int32_t bootindex;
|
vfio/pci: Intel graphics legacy mode assignment
Enable quirks to support SandyBridge and newer IGD devices as primary
VM graphics. This requires new vfio-pci device specific regions added
in kernel v4.6 to expose the IGD OpRegion, the shadow ROM, and config
space access to the PCI host bridge and LPC/ISA bridge. VM firmware
support, SeaBIOS only so far, is also required for reserving memory
regions for IGD specific use. In order to enable this mode, IGD must
be assigned to the VM at PCI bus address 00:02.0, it must have a ROM,
it must be able to enable VGA, it must have or be able to create on
its own an LPC/ISA bridge of the proper type at PCI bus address
00:1f.0 (sorry, not compatible with Q35 yet), and it must have the
above noted vfio-pci kernel features and BIOS. The intention is that
to enable this mode, a user simply needs to assign 00:02.0 from the
host to 00:02.0 in the VM:
-device vfio-pci,host=0000:00:02.0,bus=pci.0,addr=02.0
and everything either happens automatically or it doesn't. In the
case that it doesn't, we leave error reports, but assume the device
will operate in universal passthrough mode (UPT), which doesn't
require any of this, but has a much more narrow window of supported
devices, supported use cases, and supported guest drivers.
When using IGD in this mode, the VM firmware is required to reserve
some VM RAM for the OpRegion (on the order or several 4k pages) and
stolen memory for the GTT (up to 8MB for the latest GPUs). An
additional option, x-igd-gms allows the user to specify some amount
of additional memory (value is number of 32MB chunks up to 512MB) that
is pre-allocated for graphics use. TBH, I don't know of anything that
requires this or makes use of this memory, which is why we don't
allocate any by default, but the specification suggests this is not
actually a valid combination, so the option exists as a workaround.
Please report if it's actually necessary in some environment.
See code comments for further discussion about the actual operation
of the quirks necessary to assign these devices.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gerd Hoffmann <kraxel@redhat.com>
Tested-by: Gerd Hoffmann <kraxel@redhat.com>
2016-05-26 17:43:21 +02:00
|
|
|
uint32_t igd_gms;
|
vfio/pci: Allow relocating MSI-X MMIO
Recently proposed vfio-pci kernel changes (v4.16) remove the
restriction preventing userspace from mmap'ing PCI BARs in areas
overlapping the MSI-X vector table. This change is primarily intended
to benefit host platforms which make use of system page sizes larger
than the PCI spec recommendation for alignment of MSI-X data
structures (ie. not x86_64). In the case of POWER systems, the SPAPR
spec requires the VM to program MSI-X using hypercalls, rendering the
MSI-X vector table unused in the VM view of the device. However,
ARM64 platforms also support 64KB pages and rely on QEMU emulation of
MSI-X. Regardless of the kernel driver allowing mmaps overlapping
the MSI-X vector table, emulation of the MSI-X vector table also
prevents direct mapping of device MMIO spaces overlapping this page.
Thanks to the fact that PCI devices have a standard self discovery
mechanism, we can try to resolve this by relocating the MSI-X data
structures, either by creating a new PCI BAR or extending an existing
BAR and updating the MSI-X capability for the new location. There's
even a very slim chance that this could benefit devices which do not
adhere to the PCI spec alignment guidelines on x86_64 systems.
This new x-msix-relocation option accepts the following choices:
off: Disable MSI-X relocation, use native device config (default)
auto: Use a known good combination for the platform/device (none yet)
bar0..bar5: Specify the target BAR for MSI-X data structures
If compatible, the target BAR will either be created or extended and
the new portion will be used for MSI-X emulation.
The first obvious user question with this option is how to determine
whether a given platform and device might benefit from this option.
In most cases, the answer is that it won't, especially on x86_64.
Devices often dedicate an entire BAR to MSI-X and therefore no
performance sensitive registers overlap the MSI-X area. Take for
example:
# lspci -vvvs 0a:00.0
0a:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection
...
Region 0: Memory at db680000 (32-bit, non-prefetchable) [size=512K]
Region 3: Memory at db7f8000 (32-bit, non-prefetchable) [size=16K]
...
Capabilities: [70] MSI-X: Enable+ Count=10 Masked-
Vector table: BAR=3 offset=00000000
PBA: BAR=3 offset=00002000
This device uses the 16K bar3 for MSI-X with the vector table at
offset zero and the pending bits arrary at offset 8K, fully honoring
the PCI spec alignment guidance. The data sheet specifically refers
to this as an MSI-X BAR. This device would not see a benefit from
MSI-X relocation regardless of the platform, regardless of the page
size.
However, here's another example:
# lspci -vvvs 02:00.0
02:00.0 Serial Attached SCSI controller: xxxxxxxx
...
Region 0: I/O ports at c000 [size=256]
Region 1: Memory at ef640000 (64-bit, non-prefetchable) [size=64K]
Region 3: Memory at ef600000 (64-bit, non-prefetchable) [size=256K]
...
Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
Vector table: BAR=1 offset=0000e000
PBA: BAR=1 offset=0000f000
Here the MSI-X data structures are placed on separate 4K pages at the
end of a 64KB BAR. If our host page size is 4K, we're likely fine,
but at 64KB page size, MSI-X emulation at that location prevents the
entire BAR from being directly mapped into the VM address space.
Overlapping performance sensitive registers then starts to be a very
likely scenario on such a platform. At this point, the user could
enable tracing on vfio_region_read and vfio_region_write to determine
more conclusively if device accesses are being trapped through QEMU.
Upon finding a device and platform in need of MSI-X relocation, the
next problem is how to choose target PCI BAR to host the MSI-X data
structures. A few key rules to keep in mind for this selection
include:
* There are only 6 BAR slots, bar0..bar5
* 64-bit BARs occupy two BAR slots, 'lspci -vvv' lists the first slot
* PCI BARs are always a power of 2 in size, extending == doubling
* The maximum size of a 32-bit BAR is 2GB
* MSI-X data structures must reside in an MMIO BAR
Using these rules, we can evaluate each BAR of the second example
device above as follows:
bar0: I/O port BAR, incompatible with MSI-X tables
bar1: BAR could be extended, incurring another 64KB of MMIO
bar2: Unavailable, bar1 is 64-bit, this register is used by bar1
bar3: BAR could be extended, incurring another 256KB of MMIO
bar4: Unavailable, bar3 is 64bit, this register is used by bar3
bar5: Available, empty BAR, minimum additional MMIO
A secondary optimization we might wish to make in relocating MSI-X
is to minimize the additional MMIO required for the device, therefore
we might test the available choices in order of preference as bar5,
bar1, and finally bar3. The original proposal for this feature
included an 'auto' option which would choose bar5 in this case, but
various drivers have been found that make assumptions about the
properties of the "first" BAR or the size of BARs such that there
appears to be no foolproof automatic selection available, requiring
known good combinations to be sourced from users. This patch is
pre-enabled for an 'auto' selection making use of a validated lookup
table, but no entries are yet identified.
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-02-06 19:08:26 +01:00
|
|
|
OffAutoPCIBAR msix_relo;
|
2015-09-23 21:04:44 +02:00
|
|
|
uint8_t pm_cap;
|
2017-08-30 00:05:47 +02:00
|
|
|
uint8_t nv_gpudirect_clique;
|
2015-09-23 21:04:44 +02:00
|
|
|
bool pci_aer;
|
|
|
|
bool req_enabled;
|
|
|
|
bool has_flr;
|
|
|
|
bool has_pm_reset;
|
|
|
|
bool rom_read_failed;
|
|
|
|
bool no_kvm_intx;
|
|
|
|
bool no_kvm_msi;
|
|
|
|
bool no_kvm_msix;
|
2018-02-06 19:08:27 +01:00
|
|
|
bool no_geforce_quirks;
|
vfio/quirks: ioeventfd quirk acceleration
The NVIDIA BAR0 quirks virtualize the PCI config space mirrors found
in device MMIO space. Normally PCI config space is considered a slow
path and further optimization is unnecessary, however NVIDIA uses a
register here to enable the MSI interrupt to re-trigger. Exiting to
QEMU for this MSI-ACK handling can therefore rate limit our interrupt
handling. Fortunately the MSI-ACK write is easily detected since the
quirk MemoryRegion otherwise has very few accesses, so simply looking
for consecutive writes with the same data is sufficient, in this case
10 consecutive writes with the same data and size is arbitrarily
chosen. We configure the KVM ioeventfd with data match, so there's
no risk of triggering for the wrong data or size, but we do risk that
pathological driver behavior might consume all of QEMU's file
descriptors, so we cap ourselves to 10 ioeventfds for this purpose.
In support of the above, generic ioeventfd infrastructure is added
for vfio quirks. This automatically initializes an ioeventfd list
per quirk, disables and frees ioeventfds on exit, and allows
ioeventfds marked as dynamic to be dropped on device reset. The
rationale for this latter feature is that useful ioeventfds may
depend on specific driver behavior and since we necessarily place a
cap on our use of ioeventfds, a machine reset is a reasonable point
at which to assume a new driver and re-profile.
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-06-05 16:23:17 +02:00
|
|
|
bool no_kvm_ioeventfd;
|
2018-06-05 16:23:17 +02:00
|
|
|
bool no_vfio_ioeventfd;
|
2018-10-15 18:52:09 +02:00
|
|
|
bool enable_ramfb;
|
2018-03-13 18:17:30 +01:00
|
|
|
VFIODisplay *dpy;
|
2019-10-17 03:38:30 +02:00
|
|
|
Notifier irqchip_change_notifier;
|
2020-09-03 22:43:22 +02:00
|
|
|
};
|
2015-09-23 21:04:44 +02:00
|
|
|
|
2020-02-06 19:55:42 +01:00
|
|
|
/* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
|
|
|
|
static inline bool vfio_pci_is(VFIOPCIDevice *vdev, uint32_t vendor, uint32_t device)
|
|
|
|
{
|
|
|
|
return (vendor == PCI_ANY_ID || vendor == vdev->vendor_id) &&
|
|
|
|
(device == PCI_ANY_ID || device == vdev->device_id);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool vfio_is_vga(VFIOPCIDevice *vdev)
|
|
|
|
{
|
|
|
|
PCIDevice *pdev = &vdev->pdev;
|
|
|
|
uint16_t class = pci_get_word(pdev->config + PCI_CLASS_DEVICE);
|
|
|
|
|
|
|
|
return class == PCI_CLASS_DISPLAY_VGA;
|
|
|
|
}
|
|
|
|
|
2015-09-23 21:04:45 +02:00
|
|
|
uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
|
|
|
|
void vfio_pci_write_config(PCIDevice *pdev,
|
|
|
|
uint32_t addr, uint32_t val, int len);
|
|
|
|
|
|
|
|
uint64_t vfio_vga_read(void *opaque, hwaddr addr, unsigned size);
|
|
|
|
void vfio_vga_write(void *opaque, hwaddr addr, uint64_t data, unsigned size);
|
|
|
|
|
2021-02-05 18:18:17 +01:00
|
|
|
bool vfio_opt_rom_in_denylist(VFIOPCIDevice *vdev);
|
2015-09-23 21:04:45 +02:00
|
|
|
void vfio_vga_quirk_setup(VFIOPCIDevice *vdev);
|
2016-03-10 17:39:08 +01:00
|
|
|
void vfio_vga_quirk_exit(VFIOPCIDevice *vdev);
|
|
|
|
void vfio_vga_quirk_finalize(VFIOPCIDevice *vdev);
|
2015-09-23 21:04:45 +02:00
|
|
|
void vfio_bar_quirk_setup(VFIOPCIDevice *vdev, int nr);
|
2016-03-10 17:39:08 +01:00
|
|
|
void vfio_bar_quirk_exit(VFIOPCIDevice *vdev, int nr);
|
|
|
|
void vfio_bar_quirk_finalize(VFIOPCIDevice *vdev, int nr);
|
2015-09-23 21:04:49 +02:00
|
|
|
void vfio_setup_resetfn_quirk(VFIOPCIDevice *vdev);
|
2017-08-30 00:05:39 +02:00
|
|
|
int vfio_add_virt_caps(VFIOPCIDevice *vdev, Error **errp);
|
2018-06-05 16:23:17 +02:00
|
|
|
void vfio_quirk_reset(VFIOPCIDevice *vdev);
|
2020-02-06 19:55:42 +01:00
|
|
|
VFIOQuirk *vfio_quirk_alloc(int nr_mem);
|
|
|
|
void vfio_probe_igd_bar4_quirk(VFIOPCIDevice *vdev, int nr);
|
2015-09-23 21:04:45 +02:00
|
|
|
|
2017-08-30 00:05:47 +02:00
|
|
|
extern const PropertyInfo qdev_prop_nv_gpudirect_clique;
|
|
|
|
|
2016-10-17 18:57:57 +02:00
|
|
|
int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp);
|
2016-03-10 17:39:08 +01:00
|
|
|
|
2016-05-26 17:43:22 +02:00
|
|
|
int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
|
2016-10-17 18:57:59 +02:00
|
|
|
struct vfio_region_info *info,
|
|
|
|
Error **errp);
|
spapr: Support NVIDIA V100 GPU with NVLink2
NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory
space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver
implements special regions for such GPUs and emulates an NVLink bridge.
NVLink2-enabled POWER9 CPUs also provide address translation services
which includes an ATS shootdown (ATSD) register exported via the NVLink
bridge device.
This adds a quirk to VFIO to map the GPU memory and create an MR;
the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses
this to get the MR and map it to the system address space.
Another quirk does the same for ATSD.
This adds additional steps to sPAPR PHB setup:
1. Search for specific GPUs and NPUs, collect findings in
sPAPRPHBState::nvgpus, manage system address space mappings;
2. Add device-specific properties such as "ibm,npu", "ibm,gpu",
"memory-block", "link-speed" to advertise the NVLink2 function to
the guest;
3. Add "mmio-atsd" to vPHB to advertise the ATSD capability;
4. Add new memory blocks (with extra "linux,memory-usable" to prevent
the guest OS from accessing the new memory until it is onlined) and
npuphb# nodes representing an NPU unit for every vPHB as the GPU driver
uses it for link discovery.
This allocates space for GPU RAM and ATSD like we do for MMIOs by
adding 2 new parameters to the phb_placement() hook. Older machine types
set these to zero.
This puts new memory nodes in a separate NUMA node to as the GPU RAM
needs to be configured equally distant from any other node in the system.
Unlike the host setup which assigns numa ids from 255 downwards, this
adds new NUMA nodes after the user configures nodes or from 1 if none
were configured.
This adds requirement similar to EEH - one IOMMU group per vPHB.
The reason for this is that ATSD registers belong to a physical NPU
so they cannot invalidate translations on GPUs attached to another NPU.
It is guaranteed by the host platform as it does not mix NVLink bridges
or GPUs from different NPU in the same IOMMU group. If more than one
IOMMU group is detected on a vPHB, this disables ATSD support for that
vPHB and prints a warning.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for vfio portions]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Message-Id: <20190312082103.130561-1-aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-03-12 09:21:03 +01:00
|
|
|
int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
|
|
|
|
int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
|
2016-05-26 17:43:22 +02:00
|
|
|
|
2018-04-27 11:11:06 +02:00
|
|
|
void vfio_display_reset(VFIOPCIDevice *vdev);
|
2018-03-13 18:17:29 +01:00
|
|
|
int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
|
|
|
|
void vfio_display_finalize(VFIOPCIDevice *vdev);
|
|
|
|
|
2015-09-23 21:04:44 +02:00
|
|
|
#endif /* HW_VFIO_VFIO_PCI_H */
|