qemu-e2k

Author	SHA1	Message	Date
Alex Williamson	567d7d3e6b	vfio/common: Work around kernel overflow bug in DMA unmap A kernel bug was introduced in v4.15 via commit 71a7d3d78e3c which adds a test for address space wrap-around in the vfio DMA unmap path. Unfortunately due to overflow, the kernel detects an unmap of the last page in the 64-bit address space as a wrap-around. In QEMU, a Q35 guest with VT-d emulation and guest IOMMU enabled will attempt to make such an unmap request during VM system reset, triggering an error: qemu-kvm: VFIO_UNMAP_DMA: -22 qemu-kvm: vfio_dma_unmap(0x561f059948f0, 0xfef00000, 0xffffffff01100000) = -22 (Invalid argument) Here the IOVA start address (0xfef00000) and the size parameter (0xffffffff01100000) add to exactly 2^64, triggering the bug. A kernel fix is queued for the Linux v5.0 release to address this. This patch implements a workaround to retry the unmap, excluding the final page of the range when we detect an unmap failing which matches the requirements for this issue. This is expected to be a safe and complete workaround as the VT-d address space does not extend to the full 64-bit space and therefore the last page should never be mapped. This workaround can be removed once all kernels with this bug are sufficiently deprecated. Link: https://bugzilla.redhat.com/show_bug.cgi?id=1662291 Reported-by: Pei Zhang <pezhang@redhat.com> Debugged-by: Peter Xu <peterx@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2019-02-21 21:07:03 -07:00
Daniel P. Berrangé	772f1b3721	trace: forbid use of %m in trace event format strings The '%m' format instructs glibc's printf()/syslog() implementation to insert the contents of strerror(errno). Since this is a glibc extension it should generally be avoided in QEMU due to need for portability to a variety of platforms. Even though vfio is Linux-only code that could otherwise use "%m", it must still be avoided in trace-events files because several of the backends do not use the format string and so this error information is invisible to them. The errno string value should be given as an explicit trace argument instead, making it accessible to all backends. This also allows it to work correctly with future patches that use the format string with systemtap's simple printf code. Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> Message-id: 20190123120016.4538-4-berrange@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2019-01-24 14:16:56 +00:00
Alex Williamson	238e917285	vfio/ccw/pci: Allow devices to opt-in for ballooning If a vfio assigned device makes use of a physical IOMMU, then memory ballooning is necessarily inhibited due to the page pinning, lack of page level granularity at the IOMMU, and sufficient notifiers to both remove the page on balloon inflation and add it back on deflation. However, not all devices are backed by a physical IOMMU. In the case of mediated devices, if a vendor driver is well synchronized with the guest driver, such that only pages actively used by the guest driver are pinned by the host mdev vendor driver, then there should be no overlap between pages available for the balloon driver and pages actively in use by the device. Under these conditions, ballooning should be safe. vfio-ccw devices are always mediated devices and always operate under the constraints above. Therefore we can consider all vfio-ccw devices as balloon compatible. The situation is far from straightforward with vfio-pci. These devices can be physical devices with physical IOMMU backing or mediated devices where it is unknown whether a physical IOMMU is in use or whether the vendor driver is well synchronized to the working set of the guest driver. The safest approach is therefore to assume all vfio-pci devices are incompatible with ballooning, but allow user opt-in should they have further insight into mediated devices. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-08-17 09:27:16 -06:00
Alex Williamson	2b1dbd0d72	vfio/quirks: Enable ioeventfd quirks to be handled by vfio directly With vfio ioeventfd support, we can program vfio-pci to perform a specified BAR write when an eventfd is triggered. This allows the KVM ioeventfd to be wired directly to vfio-pci, entirely avoiding userspace handling for these events. On the same micro-benchmark where the ioeventfd got us to almost 90% of performance versus disabling the GeForce quirks, this gets us to within 95%. Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-06-05 08:28:09 -06:00
Alex Williamson	c958c51d2e	vfio/quirks: ioeventfd quirk acceleration The NVIDIA BAR0 quirks virtualize the PCI config space mirrors found in device MMIO space. Normally PCI config space is considered a slow path and further optimization is unnecessary, however NVIDIA uses a register here to enable the MSI interrupt to re-trigger. Exiting to QEMU for this MSI-ACK handling can therefore rate limit our interrupt handling. Fortunately the MSI-ACK write is easily detected since the quirk MemoryRegion otherwise has very few accesses, so simply looking for consecutive writes with the same data is sufficient, in this case 10 consecutive writes with the same data and size is arbitrarily chosen. We configure the KVM ioeventfd with data match, so there's no risk of triggering for the wrong data or size, but we do risk that pathological driver behavior might consume all of QEMU's file descriptors, so we cap ourselves to 10 ioeventfds for this purpose. In support of the above, generic ioeventfd infrastructure is added for vfio quirks. This automatically initializes an ioeventfd list per quirk, disables and frees ioeventfds on exit, and allows ioeventfds marked as dynamic to be dropped on device reset. The rationale for this latter feature is that useful ioeventfds may depend on specific driver behavior and since we necessarily place a cap on our use of ioeventfds, a machine reset is a reasonable point at which to assume a new driver and re-profile. Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-06-05 08:23:17 -06:00
Eric Auger	5c08600547	vfio: Use a trace point when a RAM section cannot be DMA mapped Commit `567b5b309a` ("vfio/pci: Relax DMA map errors for MMIO regions") added an error message if a passed memory section address or size is not aligned to the page size and thus cannot be DMA mapped. This patch fixes the trace by printing the region name and the memory region section offset within the address space (instead of offset_within_region). We also turn the error_report into a trace event. Indeed, In some cases, the traces can be confusing to non expert end-users and let think the use case does not work (whereas it works as before). This is the case where a BAR is successively mapped at different GPAs and its sections are not compatible with dma map. The listener is called several times and traces are issued for each intermediate mapping. The end-user cannot easily match those GPAs against the final GPA output by lscpi. So let's keep those information to informed users. In mid term, the plan is to advise the user about BAR relocation relevance. Fixes: `567b5b309a` ("vfio/pci: Relax DMA map errors for MMIO regions") Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-04-05 10:48:52 -06:00
Alex Williamson	89d5202edc	vfio/pci: Allow relocating MSI-X MMIO Recently proposed vfio-pci kernel changes (v4.16) remove the restriction preventing userspace from mmap'ing PCI BARs in areas overlapping the MSI-X vector table. This change is primarily intended to benefit host platforms which make use of system page sizes larger than the PCI spec recommendation for alignment of MSI-X data structures (ie. not x86_64). In the case of POWER systems, the SPAPR spec requires the VM to program MSI-X using hypercalls, rendering the MSI-X vector table unused in the VM view of the device. However, ARM64 platforms also support 64KB pages and rely on QEMU emulation of MSI-X. Regardless of the kernel driver allowing mmaps overlapping the MSI-X vector table, emulation of the MSI-X vector table also prevents direct mapping of device MMIO spaces overlapping this page. Thanks to the fact that PCI devices have a standard self discovery mechanism, we can try to resolve this by relocating the MSI-X data structures, either by creating a new PCI BAR or extending an existing BAR and updating the MSI-X capability for the new location. There's even a very slim chance that this could benefit devices which do not adhere to the PCI spec alignment guidelines on x86_64 systems. This new x-msix-relocation option accepts the following choices: off: Disable MSI-X relocation, use native device config (default) auto: Use a known good combination for the platform/device (none yet) bar0..bar5: Specify the target BAR for MSI-X data structures If compatible, the target BAR will either be created or extended and the new portion will be used for MSI-X emulation. The first obvious user question with this option is how to determine whether a given platform and device might benefit from this option. In most cases, the answer is that it won't, especially on x86_64. Devices often dedicate an entire BAR to MSI-X and therefore no performance sensitive registers overlap the MSI-X area. Take for example: # lspci -vvvs 0a:00.0 0a:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection ... Region 0: Memory at db680000 (32-bit, non-prefetchable) [size=512K] Region 3: Memory at db7f8000 (32-bit, non-prefetchable) [size=16K] ... Capabilities: [70] MSI-X: Enable+ Count=10 Masked- Vector table: BAR=3 offset=00000000 PBA: BAR=3 offset=00002000 This device uses the 16K bar3 for MSI-X with the vector table at offset zero and the pending bits arrary at offset 8K, fully honoring the PCI spec alignment guidance. The data sheet specifically refers to this as an MSI-X BAR. This device would not see a benefit from MSI-X relocation regardless of the platform, regardless of the page size. However, here's another example: # lspci -vvvs 02:00.0 02:00.0 Serial Attached SCSI controller: xxxxxxxx ... Region 0: I/O ports at c000 [size=256] Region 1: Memory at ef640000 (64-bit, non-prefetchable) [size=64K] Region 3: Memory at ef600000 (64-bit, non-prefetchable) [size=256K] ... Capabilities: [c0] MSI-X: Enable+ Count=16 Masked- Vector table: BAR=1 offset=0000e000 PBA: BAR=1 offset=0000f000 Here the MSI-X data structures are placed on separate 4K pages at the end of a 64KB BAR. If our host page size is 4K, we're likely fine, but at 64KB page size, MSI-X emulation at that location prevents the entire BAR from being directly mapped into the VM address space. Overlapping performance sensitive registers then starts to be a very likely scenario on such a platform. At this point, the user could enable tracing on vfio_region_read and vfio_region_write to determine more conclusively if device accesses are being trapped through QEMU. Upon finding a device and platform in need of MSI-X relocation, the next problem is how to choose target PCI BAR to host the MSI-X data structures. A few key rules to keep in mind for this selection include: * There are only 6 BAR slots, bar0..bar5 * 64-bit BARs occupy two BAR slots, 'lspci -vvv' lists the first slot * PCI BARs are always a power of 2 in size, extending == doubling * The maximum size of a 32-bit BAR is 2GB * MSI-X data structures must reside in an MMIO BAR Using these rules, we can evaluate each BAR of the second example device above as follows: bar0: I/O port BAR, incompatible with MSI-X tables bar1: BAR could be extended, incurring another 64KB of MMIO bar2: Unavailable, bar1 is 64-bit, this register is used by bar1 bar3: BAR could be extended, incurring another 256KB of MMIO bar4: Unavailable, bar3 is 64bit, this register is used by bar3 bar5: Available, empty BAR, minimum additional MMIO A secondary optimization we might wish to make in relocating MSI-X is to minimize the additional MMIO required for the device, therefore we might test the available choices in order of preference as bar5, bar1, and finally bar3. The original proposal for this feature included an 'auto' option which would choose bar5 in this case, but various drivers have been found that make assumptions about the properties of the "first" BAR or the size of BARs such that there appears to be no foolproof automatic selection available, requiring known good combinations to be sourced from users. This patch is pre-enabled for an 'auto' selection making use of a validated lookup table, but no entries are yet identified. Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-02-06 11:08:26 -07:00
Alexey Kardashevskiy	07bc681a33	vfio/spapr: Use iommu memory region's get_attr() In order to enable TCE operations support in KVM, we have to inform the KVM about VFIO groups being attached to specific LIOBNs. The KVM already knows about VFIO groups, the only bit missing is which in-kernel TCE table (the one with user visible TCEs) should update the attached broups. There is an KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE attribute of the VFIO KVM device which receives a groupfd/tablefd couple. This uses a new memory_region_iommu_get_attr() helper to get the IOMMU fd and calls KVM to establish the link. As get_attr() is not implemented yet, this should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2018-02-06 11:08:24 -07:00
Vladimir Sementsov-Ogievskiy	8908eb1a4a	trace-events: fix code style: print 0x before hex numbers The only exception are groups of numers separated by symbols '.', ' ', ':', '/', like 'ab.09.7d'. This patch is made by the following: > find . -name trace-events \| xargs python script.py where script.py is the following python script: ========================= #!/usr/bin/env python import sys import re import fileinput rhex = '%[-+ .0-9](?:[hljztL]\|ll\|hh)?(?:x\|X\|"\sPRI[xX][^"]"?)' rgroup = re.compile('((?:' + rhex + '[.:/ ])+' + rhex + ')') rbad = re.compile('(?<!0x)' + rhex) files = sys.argv[1:] for fname in files: for line in fileinput.input(fname, inplace=True): arr = re.split(rgroup, line) for i in range(0, len(arr), 2): arr[i] = re.sub(rbad, '0x\g<0>', arr[i]) sys.stdout.write(''.join(arr)) ========================= Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Cornelia Huck <cohuck@redhat.com> Message-id: 20170731160135.12101-5-vsementsov@virtuozzo.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2017-08-01 12:13:07 +01:00
Philippe Mathieu-Daudé	87e0331c5a	docs: fix broken paths to docs/devel/tracing.txt With the move of some docs/ to docs/devel/ on `ac06724a71`, no references were updated. Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>	2017-07-31 13:12:53 +03:00
Peter Xu	3213835720	vfio: trace map/unmap for notify as well We traces its range, but we don't know whether it's a MAP/UNMAP. Let's dump it as well. Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2017-02-17 21:52:31 +02:00
Stefan Hajnoczi	7f4076c1bb	trace: clean up trace-events files There are a number of unused trace events that scripts/cleanup-trace-events.pl finds. The "hw/vfio/pci-quirks.c" filename was typoed and "qapi/qapi-visit-core.c" was missing the qapi/ directory prefix. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-id: 20170126171613.1399-3-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2017-01-31 17:12:15 +00:00
Eric Auger	1a22aca1d0	vfio/pci: Conversion to realize This patch converts VFIO PCI to realize function. Also original initfn errors now are propagated using QEMU error objects. All errors are formatted with the same pattern: "vfio: %s: the error description" Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2016-10-17 10:58:01 -06:00
Laurent Vivier	e723b87103	trace-events: fix first line comment in trace-events Documentation is docs/tracing.txt instead of docs/trace-events.txt. find . -name trace-events -exec \ sed -i "s?See docs/trace-events.txt for syntax documentation.?See docs/tracing.txt for syntax documentation.?" \ {} \; Signed-off-by: Laurent Vivier <lvivier@redhat.com> Message-id: 1470669081-17860-1-git-send-email-lvivier@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2016-08-12 10:36:01 +01:00
Alexey Kardashevskiy	2e4109de8e	vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2) New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management. This adds ability to VFIO common code to dynamically allocate/remove DMA windows in the host kernel when new VFIO container is added/removed. This adds a helper to vfio_listener_region_add which makes VFIO_IOMMU_SPAPR_TCE_CREATE ioctl and adds just created IOMMU into the host IOMMU list; the opposite action is taken in vfio_listener_region_del. When creating a new window, this uses heuristic to decide on the TCE table levels number. This should cause no guest visible change in behavior. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> [dwg: Added some casts to prevent printf() warnings on certain targets where the kernel headers' __u64 doesn't match uint64_t or PRIx64] Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2016-07-05 14:31:08 +10:00
Alexey Kardashevskiy	318f67ce13	vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2) This makes use of the new "memory registering" feature. The idea is to provide the userspace ability to notify the host kernel about pages which are going to be used for DMA. Having this information, the host kernel can pin them all once per user process, do locked pages accounting (once) and not spent time on doing that in real time with possible failures which cannot be handled nicely in some cases. This adds a prereg memory listener which listens on address_space_memory and notifies a VFIO container about memory which needs to be pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped. The feature is only enabled for SPAPR IOMMU v2. The host kernel changes are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does not call it when v2 is detected and enabled. This enforces guest RAM blocks to be host page size aligned; however this is not new as KVM already requires memory slots to be host page size aligned. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [dwg: Fix compile error on 32-bit host] Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2016-07-05 14:30:54 +10:00
Alex Williamson	e37dac06dc	vfio/pci: Hide SR-IOV capability The kernel currently exposes the SR-IOV capability as read-only through vfio-pci. This is sufficient to protect the host kernel, but has the potential to confuse guests without further virtualization. In particular, OVMF tries to size the VF BARs and comes up with absurd results, ending with an assert. There's not much point in adding virtualization to a read-only capability, so we simply hide it for now. If the kernel ever enables SR-IOV virtualization, we should easily be able to test it through VF BAR sizing or explicit flags. Testing whether we should parse extended capabilities is also pulled into the function to keep these assumptions in one place. Tested-by: Laszlo Ersek <lersek@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2016-06-30 13:00:23 -06:00
Daniel P. Berrange	1cf6ebc73f	trace: split out trace events for hw/vfio/ directory Move all trace-events for files in the hw/vfio/ directory to their own file. Signed-off-by: Daniel P. Berrange <berrange@redhat.com> Message-id: 1466066426-16657-30-git-send-email-berrange@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2016-06-20 17:22:16 +01:00

18 Commits