qemu-e2k

Commit Graph

Author	SHA1	Message	Date
Alexey Kardashevskiy	ec132efaa8	spapr: Support NVIDIA V100 GPU with NVLink2 NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver implements special regions for such GPUs and emulates an NVLink bridge. NVLink2-enabled POWER9 CPUs also provide address translation services which includes an ATS shootdown (ATSD) register exported via the NVLink bridge device. This adds a quirk to VFIO to map the GPU memory and create an MR; the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses this to get the MR and map it to the system address space. Another quirk does the same for ATSD. This adds additional steps to sPAPR PHB setup: 1. Search for specific GPUs and NPUs, collect findings in sPAPRPHBState::nvgpus, manage system address space mappings; 2. Add device-specific properties such as "ibm,npu", "ibm,gpu", "memory-block", "link-speed" to advertise the NVLink2 function to the guest; 3. Add "mmio-atsd" to vPHB to advertise the ATSD capability; 4. Add new memory blocks (with extra "linux,memory-usable" to prevent the guest OS from accessing the new memory until it is onlined) and npuphb# nodes representing an NPU unit for every vPHB as the GPU driver uses it for link discovery. This allocates space for GPU RAM and ATSD like we do for MMIOs by adding 2 new parameters to the phb_placement() hook. Older machine types set these to zero. This puts new memory nodes in a separate NUMA node to as the GPU RAM needs to be configured equally distant from any other node in the system. Unlike the host setup which assigns numa ids from 255 downwards, this adds new NUMA nodes after the user configures nodes or from 1 if none were configured. This adds requirement similar to EEH - one IOMMU group per vPHB. The reason for this is that ATSD registers belong to a physical NPU so they cannot invalidate translations on GPUs attached to another NPU. It is guaranteed by the host platform as it does not mix NVLink bridges or GPUs from different NPU in the same IOMMU group. If more than one IOMMU group is detected on a vPHB, this disables ATSD support for that vPHB and prints a warning. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: for vfio portions] Acked-by: Alex Williamson <alex.williamson@redhat.com> Message-Id: <20190312082103.130561-1-aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2019-04-26 10:41:23 +10:00
Markus Armbruster	8f8f588565	vfio: Report warnings with warn_report(), not error_printf() Cc: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20190417190641.26814-8-armbru@redhat.com> Acked-by: Alex Williamson <alex.williamson@redhat.com>	2019-04-18 22:18:59 +02:00
Gerd Hoffmann	c62a0c7ce3	vfio/display: add xres + yres properties This allows configure the display resolution which the vgpu should use. The information will be passed to the guest using EDID, so the mdev driver must support the vfio edid region for this to work. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2019-03-11 12:59:59 -06:00
Daniel P. Berrangé	772f1b3721	trace: forbid use of %m in trace event format strings The '%m' format instructs glibc's printf()/syslog() implementation to insert the contents of strerror(errno). Since this is a glibc extension it should generally be avoided in QEMU due to need for portability to a variety of platforms. Even though vfio is Linux-only code that could otherwise use "%m", it must still be avoided in trace-events files because several of the backends do not use the format string and so this error information is invisible to them. The errno string value should be given as an explicit trace argument instead, making it accessible to all backends. This also allows it to work correctly with future patches that use the format string with systemtap's simple printf code. Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> Message-id: 20190123120016.4538-4-berrange@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2019-01-24 14:16:56 +00:00
Alex Williamson	d26e543891	vfio/pci: Remove PCIe Link Status emulation Now that the downstream port will virtually negotiate itself to the link status of the downstream device, we can remove this emulation. It's not clear that it was every terribly useful anyway. Tested-by: Geoffrey McRae <geoff@hostfission.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2018-12-19 16:48:16 -05:00
Alex Williamson	d96a0ac71c	pcie: Create enums for link speed and width In preparation for reporting higher virtual link speeds and widths, create enums and macros to help us manage them. Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com> Tested-by: Geoffrey McRae <geoff@hostfission.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2018-12-19 16:48:16 -05:00
Markus Armbruster	c3b8e3e0ed	vfio: Clean up error reporting after previous commit The previous commit changed vfio's warning messages from vfio warning: DEV-NAME: Could not frobnicate to warning: vfio DEV-NAME: Could not frobnicate To match this change, change error messages from vfio error: DEV-NAME: On fire to vfio DEV-NAME: On fire Note the loss of "error". If we think marking error messages that way is a good idea, we should mark all error messages, i.e. make error_report() print it. Cc: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Message-Id: <20181017082702.5581-7-armbru@redhat.com>	2018-10-19 14:51:34 +02:00
Markus Armbruster	e1eb292ace	vfio: Use warn_report() & friends to report warnings The vfio code reports warnings like error_report(WARN_PREFIX "Could not frobnicate", DEV-NAME); where WARN_PREFIX is defined so the message comes out as vfio warning: DEV-NAME: Could not frobnicate This usage predates the introduction of warn_report() & friends in commit `97f40301f1`. It's time to convert to that interface. Since these functions already prefix the message with "warning: ", replace WARN_PREFIX by VFIO_MSG_PREFIX, so the messages come out like warning: vfio DEV-NAME: Could not frobnicate The next commit will replace ERR_PREFIX. Cc: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20181017082702.5581-6-armbru@redhat.com>	2018-10-19 14:51:34 +02:00
Markus Armbruster	4b5766488f	error: Fix use of error_prepend() with &error_fatal, &error_abort From include/qapi/error.h: * Pass an existing error to the caller with the message modified: * error_propagate(errp, err); * error_prepend(errp, "Could not frobnicate '%s': ", name); Fei Li pointed out that doing error_propagate() first doesn't work well when @errp is &error_fatal or &error_abort: the error_prepend() is never reached. Since I doubt fixing the documentation will stop people from getting it wrong, introduce error_propagate_prepend(), in the hope that it lures people away from using its constituents in the wrong order. Update the instructions in error.h accordingly. Convert existing error_prepend() next to error_propagate to error_propagate_prepend(). If any of these get reached with &error_fatal or &error_abort, the error messages improve. I didn't check whether that's the case anywhere. Cc: Fei Li <fli@suse.com> Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20181017082702.5581-2-armbru@redhat.com>	2018-10-19 14:51:34 +02:00
Li Qiang	2683ccd5be	vfio-pci: make vfio-pci device more QOM conventional Define a TYPE_VFIO_PCI and drop DO_UPCAST. Signed-off-by: Li Qiang <liq3ea@gmail.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-10-15 11:22:29 -06:00
Gerd Hoffmann	b290659fc3	hw/vfio/display: add ramfb support So we have a boot display when using a vgpu as primary display. ramfb depends on a fw_cfg file. fw_cfg files can not be added and removed at runtime, therefore a ramfb-enabled vfio device can't be hotplugged. Add a nohotplug variant of the vfio-pci device (as child class). Add the ramfb property to the nohotplug variant only. So to enable the vgpu display with boot support use this: -device vfio-pci-nohotplug,display=on,ramfb=on,sysfsdev=... Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-10-15 10:52:09 -06:00
Alex Williamson	a1c0f88649	vfio/pci: Handle subsystem realpath() returning NULL Fix error reported by Coverity where realpath can return NULL, resulting in a segfault in strcmp(). This should never happen given that we're working through regularly structured sysfs paths, but trivial enough to easily avoid. Fixes: `238e917285` ("vfio/ccw/pci: Allow devices to opt-in for ballooning") Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-08-23 10:45:57 -06:00
Alex Williamson	238e917285	vfio/ccw/pci: Allow devices to opt-in for ballooning If a vfio assigned device makes use of a physical IOMMU, then memory ballooning is necessarily inhibited due to the page pinning, lack of page level granularity at the IOMMU, and sufficient notifiers to both remove the page on balloon inflation and add it back on deflation. However, not all devices are backed by a physical IOMMU. In the case of mediated devices, if a vendor driver is well synchronized with the guest driver, such that only pages actively used by the guest driver are pinned by the host mdev vendor driver, then there should be no overlap between pages available for the balloon driver and pages actively in use by the device. Under these conditions, ballooning should be safe. vfio-ccw devices are always mediated devices and always operate under the constraints above. Therefore we can consider all vfio-ccw devices as balloon compatible. The situation is far from straightforward with vfio-pci. These devices can be physical devices with physical IOMMU backing or mediated devices where it is unknown whether a physical IOMMU is in use or whether the vendor driver is well synchronized to the working set of the guest driver. The safest approach is therefore to assume all vfio-pci devices are incompatible with ballooning, but allow user opt-in should they have further insight into mediated devices. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-08-17 09:27:16 -06:00
Cédric Le Goater	26c0ae5638	vfio/pci: do not set the PCIDevice 'has_rom' attribute PCI devices needing a ROM allocate an optional MemoryRegion with pci_add_option_rom(). pci_del_option_rom() does the cleanup when the device is destroyed. The only action taken by this routine is to call vmstate_unregister_ram() which clears the id string of the optional ROM RAMBlock and now, also flags the RAMBlock as non-migratable. This was recently added by commit `b895de5027` ("migration: discard non-migratable RAMBlocks"), . VFIO devices do their own loading of the PCI option ROM in vfio_pci_size_rom(). The memory region is switched to an I/O region and the PCI attribute 'has_rom' is set but the RAMBlock of the ROM region is not allocated. When the associated PCI device is deleted, pci_del_option_rom() calls vmstate_unregister_ram() which tries to flag a NULL RAMBlock, leading to a SEGV. It seems that 'has_rom' was set to have memory_region_destroy() called, but since commit `469b046ead` ("memory: remove memory_region_destroy") this is not necessary anymore as the MemoryRegion is freed automagically. Remove the PCIDevice 'has_rom' attribute setting in vfio. Fixes: `b895de5027` ("migration: discard non-migratable RAMBlocks") Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-07-11 13:43:57 -06:00
Philippe Mathieu-Daudé	e0255bb1ac	hw/vfio: Use the IEC binary prefix definitions It eases code review, unit is explicit. Patch generated using: $ git grep -E '(1024\|2048\|4096\|8192\|(<<\|>>).?(10\|20\|30))' hw/ include/hw/ and modified manually. Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Message-Id: <20180625124238.25339-38-f4bug@amsat.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-07-02 15:41:16 +02:00
Alex Williamson	8151a9c56d	vfio/pci: Default display option to "off" Commit `a9994687cb` ("vfio/display: core & wireup") added display support to vfio-pci with the default being "auto", which breaks existing VMs when the vGPU requires GL support but had no previous requirement for a GL compatible configuration. "Off" is the safer default as we impose no new requirements to VM configurations. Fixes: `a9994687cb` ("vfio/display: core & wireup") Cc: qemu-stable@nongnu.org Cc: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-06-05 08:28:09 -06:00
Alex Williamson	2b1dbd0d72	vfio/quirks: Enable ioeventfd quirks to be handled by vfio directly With vfio ioeventfd support, we can program vfio-pci to perform a specified BAR write when an eventfd is triggered. This allows the KVM ioeventfd to be wired directly to vfio-pci, entirely avoiding userspace handling for these events. On the same micro-benchmark where the ioeventfd got us to almost 90% of performance versus disabling the GeForce quirks, this gets us to within 95%. Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-06-05 08:28:09 -06:00
Alex Williamson	c958c51d2e	vfio/quirks: ioeventfd quirk acceleration The NVIDIA BAR0 quirks virtualize the PCI config space mirrors found in device MMIO space. Normally PCI config space is considered a slow path and further optimization is unnecessary, however NVIDIA uses a register here to enable the MSI interrupt to re-trigger. Exiting to QEMU for this MSI-ACK handling can therefore rate limit our interrupt handling. Fortunately the MSI-ACK write is easily detected since the quirk MemoryRegion otherwise has very few accesses, so simply looking for consecutive writes with the same data is sufficient, in this case 10 consecutive writes with the same data and size is arbitrarily chosen. We configure the KVM ioeventfd with data match, so there's no risk of triggering for the wrong data or size, but we do risk that pathological driver behavior might consume all of QEMU's file descriptors, so we cap ourselves to 10 ioeventfds for this purpose. In support of the above, generic ioeventfd infrastructure is added for vfio quirks. This automatically initializes an ioeventfd list per quirk, disables and frees ioeventfds on exit, and allows ioeventfds marked as dynamic to be dropped on device reset. The rationale for this latter feature is that useful ioeventfds may depend on specific driver behavior and since we necessarily place a cap on our use of ioeventfds, a machine reset is a reasonable point at which to assume a new driver and re-profile. Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-06-05 08:23:17 -06:00
Alex Williamson	469d02de99	vfio/quirks: Add quirk reset callback Quirks can be self modifying, provide a hook to allow them to cleanup on device reset if desired. Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-06-05 08:23:17 -06:00
Tina Zhang	8983e3e350	ui: introduce vfio_display_reset During guest OS reboot, guest framebuffer is invalid. It will cause bugs, if the invalid guest framebuffer is still used by host. This patch is to introduce vfio_display_reset which is invoked during vfio display reset. This vfio_display_reset function is used to release the invalid display resource, disable scanout mode and replace the invalid surface with QemuConsole's DisplaySurafce. This patch can fix the GPU hang issue caused by gd_egl_draw during guest OS reboot. Changes v3->v4: - Move dma-buf based display check into the vfio_display_reset(). (Gerd) Changes v2->v3: - Limit vfio_display_reset to dma-buf based vfio display. (Gerd) Changes v1->v2: - Use dpy_gfx_update_full() update screen after reset. (Gerd) - Remove dpy_gfx_switch_surface(). (Gerd) Signed-off-by: Tina Zhang <tina.zhang@intel.com> Message-id: 1524820266-27079-3-git-send-email-tina.zhang@intel.com Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>	2018-04-27 11:36:34 +02:00
Alexey Kardashevskiy	fcad0d2121	ppc/spapr, vfio: Turn off MSIX emulation for VFIO devices This adds a possibility for the platform to tell VFIO not to emulate MSIX so MMIO memory regions do not get split into chunks in flatview and the entire page can be registered as a KVM memory slot and make direct MMIO access possible for the guest. This enables the entire MSIX BAR mapping to the guest for the pseries platform in order to achieve the maximum MMIO preformance for certain devices. Tested on: LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-03-13 11:17:31 -06:00
Alexey Kardashevskiy	ae0215b2bb	vfio-pci: Allow mmap of MSIX BAR At the moment we unconditionally avoid mapping MSIX data of a BAR and emulate MSIX table in QEMU. However it is 1) not always necessary as a platform may provide a paravirt interface for MSIX configuration; 2) can affect the speed of MMIO access by emulating them in QEMU when frequently accessed registers share same system page with MSIX data, this is particularly a problem for systems with the page size bigger than 4KB. A new capability - VFIO_REGION_INFO_CAP_MSIX_MAPPABLE - has been added to the kernel [1] which tells the userspace that mapping of the MSIX data is possible now. This makes use of it so from now on QEMU tries mapping the entire BAR as a whole and emulate MSIX on top of that. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a32295c612c57990d17fb0f41e7134394b2f35f6 Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-03-13 11:17:31 -06:00
Gerd Hoffmann	a9994687cb	vfio/display: core & wireup Infrastructure for display support. Must be enabled using 'display' property. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Reviewed By: Kirti Wankhede <kwankhede@nvidia.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-03-13 11:17:29 -06:00
Julia Suvorova	3e015d815b	use g_path_get_basename instead of basename basename(3) and dirname(3) modify their argument and may return pointers to statically allocated memory which may be overwritten by subsequent calls. g_path_get_basename and g_path_get_dirname have no such issues, and therefore more preferable. Signed-off-by: Julia Suvorova <jusual@mail.ru> Message-Id: <1519888086-4207-1-git-send-email-jusual@mail.ru> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-03-06 14:01:29 +01:00
Peter Maydell	b734ed9de1	virtio,vhost,pci,pc: features, fixes and cleanups - new stats in virtio balloon - virtio eventfd rework for boot speedup - vhost memory rework for boot speedup - fixes and cleanups all over the place Signed-off-by: Michael S. Tsirkin <mst@redhat.com> -----BEGIN PGP SIGNATURE----- iQEcBAABAgAGBQJagxKDAAoJECgfDbjSjVRp5qAH/3gmgBaIzL3KRHd5i0RZifJv PvyAVYgZd7h0+/1r9GM7guHKyEPZ08JtbHSm/HuDV4BD/Vf3/8joy8roExIfde2A 6k8fd6ANVQmE3t5zUxNXi9qiG4pO4xDIu4cMAbixzgN9x5ttlcfTw7fTT0e0VJxJ 8SN02/uCPPR/DY4/cpjah+slSyv6rBKT1v1ONy7djyRTYHi6h3Meoh05YfEALkwA goxTKBZHi0L1IZ3HP/ZpXJDohQ5n2P09DX0fQgb8PgmW6WIWB/Qpi5pD53LZpMCV n9waTF0U0ahneFd2FHo22QMMrwWvQyrjv+w5uXVr+qmHb/OyH2tUt7PgGF9+QKA= =78s5 -----END PGP SIGNATURE----- Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging virtio,vhost,pci,pc: features, fixes and cleanups - new stats in virtio balloon - virtio eventfd rework for boot speedup - vhost memory rework for boot speedup - fixes and cleanups all over the place Signed-off-by: Michael S. Tsirkin <mst@redhat.com> # gpg: Signature made Tue 13 Feb 2018 16:29:55 GMT # gpg: using RSA key 281F0DB8D28D5469 # gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>" # gpg: aka "Michael S. Tsirkin <mst@redhat.com>" # Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67 # Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469 * remotes/mst/tags/for_upstream: (22 commits) virtio-balloon: include statistics of disk/file caches acpi-test: update FADT lpc: drop pcie host dependency tests: acpi: fix FADT not being compared to reference table hw/pci-bridge: fix pcie root port's IO hints capability libvhost-user: Support across-memory-boundary access libvhost-user: Fix resource leak virtio-balloon: unref the memory region before continuing pci: removed the is_express field since a uniform interface was inserted virtio-blk: enable multiple vectors when using multiple I/O queues pci/bus: let it has higher migration priority pci-bridge/i82801b11: clear bridge registers on platform reset vhost: Move log_dirty check vhost: Merge and delete unused callbacks vhost: Clean out old vhost_set_memory and friends vhost: Regenerate region list from changed sections list vhost: Merge sections added to temporary list vhost: Simplify ring verification checks vhost: Build temporary section list and deref after commit virtio: improve virtio devices initialization time ... Signed-off-by: Peter Maydell <peter.maydell@linaro.org>	2018-02-13 16:33:31 +00:00
Markus Armbruster	922a01a013	Move include qemu/option.h from qemu-common.h to actual users qemu-common.h includes qemu/option.h, but most places that include the former don't actually need the latter. Drop the include, and add it to the places that actually need it. While there, drop superfluous includes of both headers, and separate #include from file comment with a blank line. This cleanup makes the number of objects depending on qemu/option.h drop from 4545 (out of 4743) to 284 in my "build everything" tree. Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20180201111846.21846-20-armbru@redhat.com> [Semantic conflict with commit `bdd6a90a9e` in block/nvme.c resolved]	2018-02-09 13:52:16 +01:00
Yoni Bettan	d61a363d3e	pci: removed the is_express field since a uniform interface was inserted according to Eduardo Habkost's commit `fd3b02c889` all PCIEs now implement INTERFACE_PCIE_DEVICE so we don't need is_express field anymore. Devices that implements only INTERFACE_PCIE_DEVICE (is_express == 1) or devices that implements only INTERFACE_CONVENTIONAL_PCI_DEVICE (is_express == 0) where not affected by the change. The only devices that were affected are those that are hybrid and also had (is_express == 1) - therefor only: - hw/vfio/pci.c - hw/usb/hcd-xhci.c - hw/xen/xen_pt.c For those 3 I made sure that QEMU_PCI_CAP_EXPRESS is on in instance_init() Reviewed-by: Marcel Apfelbaum <marcel@redhat.com> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Yoni Bettan <ybettan@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2018-02-08 21:06:41 +02:00
Alex Williamson	db32d0f438	vfio/pci: Add option to disable GeForce quirks These quirks are necessary for GeForce, but not for Quadro/GRID/Tesla assignment. Leaving them enabled is fully functional and provides the most compatibility, but due to the unique NVIDIA MSI ACK behavior[1], it also introduces latency in re-triggering the MSI interrupt. This overhead is typically negligible, but has been shown to adversely affect some (very) high interrupt rate applications. This adds the vfio-pci device option "x-no-geforce-quirks=" which can be set to "on" to disable this additional overhead. A follow-on optimization for GeForce might be to make use of an ioeventfd to allow KVM to trigger an irqfd in the kernel vfio-pci driver, avoiding the bounce through userspace to handle this device write. [1] Background: the NVIDIA driver has been observed to issue a write to the MMIO mirror of PCI config space in BAR0 in order to allow the MSI interrupt for the device to retrigger. Older reports indicated a write of 0xff to the (read-only) MSI capability ID register, while more recently a write of 0x0 is observed at config space offset 0x704, non-architected, extended config space of the device (BAR0 offset 0x88704). Virtualization of this range is only required for GeForce. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-02-06 11:08:27 -07:00
Alex Williamson	89d5202edc	vfio/pci: Allow relocating MSI-X MMIO Recently proposed vfio-pci kernel changes (v4.16) remove the restriction preventing userspace from mmap'ing PCI BARs in areas overlapping the MSI-X vector table. This change is primarily intended to benefit host platforms which make use of system page sizes larger than the PCI spec recommendation for alignment of MSI-X data structures (ie. not x86_64). In the case of POWER systems, the SPAPR spec requires the VM to program MSI-X using hypercalls, rendering the MSI-X vector table unused in the VM view of the device. However, ARM64 platforms also support 64KB pages and rely on QEMU emulation of MSI-X. Regardless of the kernel driver allowing mmaps overlapping the MSI-X vector table, emulation of the MSI-X vector table also prevents direct mapping of device MMIO spaces overlapping this page. Thanks to the fact that PCI devices have a standard self discovery mechanism, we can try to resolve this by relocating the MSI-X data structures, either by creating a new PCI BAR or extending an existing BAR and updating the MSI-X capability for the new location. There's even a very slim chance that this could benefit devices which do not adhere to the PCI spec alignment guidelines on x86_64 systems. This new x-msix-relocation option accepts the following choices: off: Disable MSI-X relocation, use native device config (default) auto: Use a known good combination for the platform/device (none yet) bar0..bar5: Specify the target BAR for MSI-X data structures If compatible, the target BAR will either be created or extended and the new portion will be used for MSI-X emulation. The first obvious user question with this option is how to determine whether a given platform and device might benefit from this option. In most cases, the answer is that it won't, especially on x86_64. Devices often dedicate an entire BAR to MSI-X and therefore no performance sensitive registers overlap the MSI-X area. Take for example: # lspci -vvvs 0a:00.0 0a:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection ... Region 0: Memory at db680000 (32-bit, non-prefetchable) [size=512K] Region 3: Memory at db7f8000 (32-bit, non-prefetchable) [size=16K] ... Capabilities: [70] MSI-X: Enable+ Count=10 Masked- Vector table: BAR=3 offset=00000000 PBA: BAR=3 offset=00002000 This device uses the 16K bar3 for MSI-X with the vector table at offset zero and the pending bits arrary at offset 8K, fully honoring the PCI spec alignment guidance. The data sheet specifically refers to this as an MSI-X BAR. This device would not see a benefit from MSI-X relocation regardless of the platform, regardless of the page size. However, here's another example: # lspci -vvvs 02:00.0 02:00.0 Serial Attached SCSI controller: xxxxxxxx ... Region 0: I/O ports at c000 [size=256] Region 1: Memory at ef640000 (64-bit, non-prefetchable) [size=64K] Region 3: Memory at ef600000 (64-bit, non-prefetchable) [size=256K] ... Capabilities: [c0] MSI-X: Enable+ Count=16 Masked- Vector table: BAR=1 offset=0000e000 PBA: BAR=1 offset=0000f000 Here the MSI-X data structures are placed on separate 4K pages at the end of a 64KB BAR. If our host page size is 4K, we're likely fine, but at 64KB page size, MSI-X emulation at that location prevents the entire BAR from being directly mapped into the VM address space. Overlapping performance sensitive registers then starts to be a very likely scenario on such a platform. At this point, the user could enable tracing on vfio_region_read and vfio_region_write to determine more conclusively if device accesses are being trapped through QEMU. Upon finding a device and platform in need of MSI-X relocation, the next problem is how to choose target PCI BAR to host the MSI-X data structures. A few key rules to keep in mind for this selection include: * There are only 6 BAR slots, bar0..bar5 * 64-bit BARs occupy two BAR slots, 'lspci -vvv' lists the first slot * PCI BARs are always a power of 2 in size, extending == doubling * The maximum size of a 32-bit BAR is 2GB * MSI-X data structures must reside in an MMIO BAR Using these rules, we can evaluate each BAR of the second example device above as follows: bar0: I/O port BAR, incompatible with MSI-X tables bar1: BAR could be extended, incurring another 64KB of MMIO bar2: Unavailable, bar1 is 64-bit, this register is used by bar1 bar3: BAR could be extended, incurring another 256KB of MMIO bar4: Unavailable, bar3 is 64bit, this register is used by bar3 bar5: Available, empty BAR, minimum additional MMIO A secondary optimization we might wish to make in relocating MSI-X is to minimize the additional MMIO required for the device, therefore we might test the available choices in order of preference as bar5, bar1, and finally bar3. The original proposal for this feature included an 'auto' option which would choose bar5 in this case, but various drivers have been found that make assumptions about the properties of the "first" BAR or the size of BARs such that there appears to be no foolproof automatic selection available, requiring known good combinations to be sourced from users. This patch is pre-enabled for an 'auto' selection making use of a validated lookup table, but no entries are yet identified. Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-02-06 11:08:26 -07:00
Alex Williamson	04f336b05f	vfio/pci: Emulate BARs The kernel provides similar emulation of PCI BAR register access to QEMU, so up until now we've used that for things like BAR sizing and storing the BAR address. However, if we intend to resize BARs or add BARs that don't exist on the physical device, we need to switch to the pure QEMU emulation of the BAR. Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-02-06 11:08:25 -07:00
Alex Williamson	3a286732d1	vfio/pci: Add base BAR MemoryRegion Add one more layer to our stack of MemoryRegions, this base region allows us to register BARs independently of the vfio region or to extend the size of BARs which do map to a region. This will be useful when we want hypervisor defined BARs or sections of BARs, for purposes such as relocating MSI-X emulation. We therefore call msix_init() based on this new base MemoryRegion, while the quirks, which only modify regions still operate on those sub-MemoryRegions. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-02-06 11:08:25 -07:00
David Gibson	fd56e0612b	pci: Eliminate redundant PCIDevice::bus pointer The bus pointer in PCIDevice is basically redundant with QOM information. It's always initialized to the qdev_get_parent_bus(), the only difference is the type. Therefore this patch eliminates the field, instead creating a pci_get_bus() helper to do the type mangling to derive it conveniently from the QOM Device object underneath. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com> Reviewed-by: Marcel Apfelbaum <marcel@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com>	2017-12-05 19:13:45 +02:00
Eduardo Habkost	a5fa336f11	pci: Add interface names to hybrid PCI devices The following devices support both PCI Express and Conventional PCI, by including special code to handle the QEMU_PCI_CAP_EXPRESS flag and/or conditional pcie_endpoint_cap_init() calls: * vfio-pci (is_express=1, but legacy PCI handled by vfio_populate_device()) * vmxnet3 (is_express=0, but PCIe handled by vmxnet3_realize()) * pvscsi (is_express=0, but PCIe handled by pvscsi_realize()) * virtio-pci (is_express=0, but PCIe handled by virtio_pci_dc_realize(), and additional legacy PCI code at virtio_pci_realize()) * base-xhci (is_express=1, but pcie_endpoint_cap_init() call is conditional on pci_bus_is_express(dev->bus) * Note that xhci does not clear QEMU_PCI_CAP_EXPRESS like the other hybrid devices Cc: Dmitry Fleytman <dmitry@daynix.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Marcel Apfelbaum <marcel@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2017-10-15 05:54:42 +03:00
Alex Williamson	dfbee78db8	vfio/pci: Add NVIDIA GPUDirect Cliques support NVIDIA has defined a specification for creating GPUDirect "cliques", where devices with the same clique ID support direct peer-to-peer DMA. When running on bare-metal, tools like NVIDIA's p2pBandwidthLatencyTest (part of cuda-samples) determine which GPUs can support peer-to-peer based on chipset and topology. When running in a VM, these tools have no visibility to the physical hardware support or topology. This option allows the user to specify hints via a vendor defined capability. For instance: <qemu:commandline> <qemu:arg value='-set'/> <qemu:arg value='device.hostdev0.x-nv-gpudirect-clique=0'/> <qemu:arg value='-set'/> <qemu:arg value='device.hostdev1.x-nv-gpudirect-clique=1'/> <qemu:arg value='-set'/> <qemu:arg value='device.hostdev2.x-nv-gpudirect-clique=1'/> </qemu:commandline> This enables two cliques. The first is a singleton clique with ID 0, for the first hostdev defined in the XML (note that since cliques define peer-to-peer sets, singleton clique offer no benefit). The subsequent two hostdevs are both added to clique ID 1, indicating peer-to-peer is possible between these devices. QEMU only provides validation that the clique ID is valid and applied to an NVIDIA graphics device, any validation that the resulting cliques are functional and valid is the user's responsibility. The NVIDIA specification allows a 4-bit clique ID, thus valid values are 0-15. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2017-10-03 12:57:36 -06:00
Alex Williamson	e3f79f3bd4	vfio/pci: Add virtual capabilities quirk infrastructure If the hypervisor needs to add purely virtual capabilties, give us a hook through quirks to do that. Note that we determine the maximum size for a capability based on the physical device, if we insert a virtual capability, that can change. Therefore if maximum size is smaller after added virt capabilities, use that. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2017-10-03 12:57:36 -06:00
Alex Williamson	5b31c8229d	vfio/pci: Do not unwind on error If vfio_add_std_cap() errors then going to out prepends irrelevant errors for capabilities we haven't attempted to add as we unwind our recursive stack. Just return error. Fixes: `7ef165b9a8` ("vfio/pci: Pass an error object to vfio_add_capabilities") Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2017-10-03 12:57:35 -06:00
Philippe Mathieu-Daudé	96d2c2c574	vfio/pci: fix use of freed memory hw/vfio/pci.c:308:29: warning: Use of memory after it is freed qemu_set_fd_handler(*pfd, NULL, NULL, vdev); ^~~~ Reported-by: Clang Static Analyzer Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2017-07-26 11:38:18 -06:00
Alex Williamson	47985727e3	vfio/pci: Fixup v0 PCIe capabilities Intel 82599 VFs report a PCIe capability version of 0, which is invalid. The earliest version of the PCIe spec used version 1. This causes Windows to fail startup on the device and it will be disabled with error code 10. Our choices are either to drop the PCIe cap on such devices, which has the side effect of likely preventing the guest from discovering any extended capabilities, or performing a fixup to update the capability to the earliest valid version. This implements the latter. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2017-07-10 10:39:43 -06:00
Alex Williamson	7da624e26a	vfio: Test realized when using VFIOGroup.device_list iterator VFIOGroup.device_list is effectively our reference tracking mechanism such that we can teardown a group when all of the device references are removed. However, we also use this list from our machine reset handler for processing resets that affect multiple devices. Generally device removals are fully processed (exitfn + finalize) when this reset handler is invoked, however if the removal is triggered via another reset handler (piix4_reset->acpi_pcihp_reset) then the device exitfn may run, but not finalize. In this case we hit asserts when we start trying to access PCI helpers since much of the PCI state of the device is released. To resolve this, add a pointer to the Object DeviceState in our common base-device and skip non-realized devices as we iterate. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2017-07-10 10:39:43 -06:00
Mao Zhongyi	2784127857	pci: Replace pci_add_capability2() with pci_add_capability() After the patch 'Make errp the last parameter of pci_add_capability()', pci_add_capability() and pci_add_capability2() now do exactly the same. So drop the wrapper pci_add_capability() of pci_add_capability2(), then replace the pci_add_capability2() with pci_add_capability() everywhere. Cc: pbonzini@redhat.com Cc: rth@twiddle.net Cc: ehabkost@redhat.com Cc: mst@redhat.com Cc: dmitry@daynix.com Cc: jasowang@redhat.com Cc: marcel@redhat.com Cc: alex.williamson@redhat.com Cc: armbru@redhat.com Suggested-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Mao Zhongyi <maozy.fnst@cn.fujitsu.com> Reviewed-by: Marcel Apfelbaum <marcel@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2017-07-03 22:29:49 +03:00
Mao Zhongyi	9a7c2a5970	pci: Make errp the last parameter of pci_add_capability() Add Error argument for pci_add_capability() to leverage the errp to pass info on errors. This way is helpful for its callers to make a better error handling when moving to 'realize'. Cc: pbonzini@redhat.com Cc: rth@twiddle.net Cc: ehabkost@redhat.com Cc: mst@redhat.com Cc: jasowang@redhat.com Cc: marcel@redhat.com Cc: alex.williamson@redhat.com Cc: armbru@redhat.com Signed-off-by: Mao Zhongyi <maozy.fnst@cn.fujitsu.com> Reviewed-by: Marcel Apfelbaum <marcel@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2017-07-03 22:29:49 +03:00
Mao Zhongyi	9a815774bb	pci: Fix the wrong assertion. pci_add_capability returns a strictly positive value on success, correct asserts. Cc: dmitry@daynix.com Cc: jasowang@redhat.com Cc: kraxel@redhat.com Cc: alex.williamson@redhat.com Cc: armbru@redhat.com Cc: marcel@redhat.com Signed-off-by: Mao Zhongyi <maozy.fnst@cn.fujitsu.com> Reviewed-by: Marcel Apfelbaum <marcel@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2017-07-03 22:29:49 +03:00
Dong Jia Shi	6e4e6f0d40	vfio/pci: Fix incorrect error message When the "No host device provided" error occurs, the hint message that starts with "Use -vfio-pci," makes no sense, since "-vfio-pci" is not a valid command line parameter. Correct this by replacing "-vfio-pci" with "-device vfio-pci". Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2017-05-03 14:52:35 -06:00
Alex Williamson	d0d1cd70d1	vfio/pci: Improve extended capability comments, skip masked caps Since commit `4bb571d857` ("pci/pcie: don't assume cap id 0 is reserved") removes the internal use of extended capability ID 0, the comment here becomes invalid. However, peeling back the onion, the code is still correct and we still can't seed the capability chain with ID 0, unless we want to muck with using the version number to force the header to be non-zero, which is much uglier to deal with. The comment also now covers some of the subtleties of using cap ID 0, such as transparently indicating absence of capabilities if none are added. This doesn't detract from the correctness of the referenced commit as vfio in the kernel also uses capability ID zero to mask capabilties. In fact, we should skip zero capabilities precisely because the kernel might also expose such a capability at the head position and re-introduce the problem. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Tested-by: Peter Xu <peterx@redhat.com> Reported-by: Jintack Lim <jintack@cs.columbia.edu> Tested-by: Jintack Lim <jintack@cs.columbia.edu>	2017-02-22 13:19:58 -07:00
Alex Williamson	35c7cb4caf	vfio/pci: Report errors from qdev_unplug() via device request Currently we ignore this error, report it with error_reportf_err() Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>	2017-02-22 13:19:58 -07:00
Cao jin	ee640c625e	pci: Convert msix_init() to Error and fix callers msix_init() reports errors with error_report(), which is wrong when it's used in realize(). The same issue was fixed for msi_init() in commit `1108b2f`. In order to make the API change as small as possible, leave the return value check to later patch. For some devices(like e1000e, vmxnet3, nvme) who won't fail because of msix_init's failure, suppress the error report by passing NULL error object. Bonus: add comment for msix_init. CC: Jiri Pirko <jiri@resnulli.us> CC: Gerd Hoffmann <kraxel@redhat.com> CC: Dmitry Fleytman <dmitry@daynix.com> CC: Jason Wang <jasowang@redhat.com> CC: Michael S. Tsirkin <mst@redhat.com> CC: Hannes Reinecke <hare@suse.de> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Alex Williamson <alex.williamson@redhat.com> CC: Markus Armbruster <armbru@redhat.com> CC: Marcel Apfelbaum <marcel@redhat.com> Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2017-02-01 03:37:18 +02:00
Cao jin	8907379204	vfio: remove a duplicated word in comments Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>	2017-01-24 23:26:53 +03:00
Yongji Xie	95251725e3	vfio: Add support for mmapping sub-page MMIO BARs Now the kernel commit 05f0c03fbac1 ("vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive") allows VFIO to mmap sub-page BARs. This is the corresponding QEMU patch. With those patches applied, we could passthrough sub-page BARs to guest, which can help to improve IO performance for some devices. In this patch, we expand MemoryRegions of these sub-page MMIO BARs to PAGE_SIZE in vfio_pci_write_config(), so that the BARs could be passed to KVM ioctl KVM_SET_USER_MEMORY_REGION with a valid size. The expanding size will be recovered when the base address of sub-page BAR is changed and not page aligned any more in guest. And we also set the priority of these BARs' memory regions to zero in case of overlap with BARs which share the same page with sub-page BARs in guest. Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2016-10-31 09:53:04 -06:00
Ido Yariv	a52a4c4717	vfio/pci: fix out-of-sync BAR information on reset When a PCI device is reset, pci_do_device_reset resets all BAR addresses in the relevant PCIDevice's config buffer. The VFIO configuration space stays untouched, so the guest OS may choose to skip restoring the BAR addresses as they would seem intact. The PCI device may be left non-operational. One example of such a scenario is when the guest exits S3. Fix this by resetting the BAR addresses in the VFIO configuration space as well. Signed-off-by: Ido Yariv <ido@wizery.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2016-10-31 09:53:04 -06:00
Cao jin	893bfc3cc8	vfio: fix duplicate function call When vfio device is reset(encounter FLR, or bus reset), if need to do bus reset(vfio_pci_hot_reset_one is called), vfio_pci_pre_reset & vfio_pci_post_reset will be called twice. Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2016-10-17 10:58:03 -06:00

1 2 3

135 Commits