Thie PIIX3 south bridge implements both the PIC and the ISA bus, so wiring the
interrupts there makes the device model more self-contained. Furthermore, this
allows the ISA interrupts to be wired to internal child devices in
pci_piix3_realize() which will be performed in subsequent patches.
Signed-off-by: Bernhard Beschow <shentey@gmail.com>
Message-Id: <20231007123843.127151-10-shentey@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When the board assigns the ISA IRQs after the device's realize(), internal
devices such as the RTC can't be wired in ich9_lpc_realize() since the qemu_irqs
are still NULL. Fix that by assigning the ISA interrupts before realize().
This change is necessary for PIIX consolidation because PIIX4 wires the RTC
interrupts in its realize() method, so PIIX3 needs to do so as well. Since the
PC and Q35 boards share RTC code, and since PIIX3 needs the change, ICH9 needs
to be adapted as well.
Signed-off-by: Bernhard Beschow <shentey@gmail.com>
Message-Id: <20231007123843.127151-9-shentey@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
TYPE_PIIX3_DEVICE doesn't instantiate a PIC since it relies on the board to do
so. The "pic" attribute, however, suggests that there is one. Rename the
attribute to reflect that it represents ISA interrupt lines. Use the same naming
convention as in the VIA south bridges as well as in TYPE_I82378.
Signed-off-by: Bernhard Beschow <shentey@gmail.com>
Message-Id: <20231007123843.127151-8-shentey@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The variable is never used by its declared type. Eliminate it.
Signed-off-by: Bernhard Beschow <shentey@gmail.com>
Message-Id: <20231007123843.127151-7-shentey@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Avoid assigning the private member of struct PIIX3State from outside which goes
against best QOM practices. Instead, implement best QOM practice by adding an
"isa-irqs" array property to TYPE_PIIX3_DEVICE and assign it in board code, i.e.
from outside.
Signed-off-by: Bernhard Beschow <shentey@gmail.com>
Message-Id: <20231007123843.127151-6-shentey@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
PIIX_NUM_PIC_IRQS is assumed to be the same as ISA_NUM_IRQS, otherwise
inconsistencies can occur.
Signed-off-by: Bernhard Beschow <shentey@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20231007123843.127151-5-shentey@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Unlike its PIIX4 counterpart, TYPE_PIIX3_DEVICE doesn't instantiate a PIC
itself. Instead, it relies on the board to do so. This means that the board
needs to wire the ISA IRQs to the PIIX3 device model. As long as the board
assigns the ISA IRQs after PIIX3's realize(), internal devices can't be wired in
pci_piix3_realize() since the qemu_irqs are still NULL. Fix that by assigning
the ISA interrupts before realize(). This will allow for embedding child devices
into the host device as already done for PIIX4.
Signed-off-by: Bernhard Beschow <shentey@gmail.com>
Message-Id: <20231007123843.127151-4-shentey@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The next patches will need to take advantage of it.
Signed-off-by: Bernhard Beschow <shentey@gmail.com>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20231007123843.127151-3-shentey@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
By being the only entity assigning a non-NULL value to "rtc_irq", the first if
statement determines whether the second if statement is executed. So merge the
two statements into one.
Signed-off-by: Bernhard Beschow <shentey@gmail.com>
Message-Id: <20231007123843.127151-2-shentey@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When the given uuid is already present in the hash table,
virtio_add_resource() does not add the passed VirtioSharedObject. In
this case, free it in the callers to avoid leaking memory. This fixed
the following `make check` error, when built with --enable-sanitizers:
4/166 qemu:unit / test-virtio-dmabuf ERROR 1.51s exit status 1
==7716==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 320 byte(s) in 20 object(s) allocated from:
#0 0x7f6fc16e3808 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:144
#1 0x7f6fc1503e98 in g_malloc (/lib/x86_64-linux-gnu/libglib-2.0.so.0+0x57e98)
#2 0x564d63cafb6b in test_add_invalid_resource ../tests/unit/test-virtio-dmabuf.c:100
#3 0x7f6fc152659d (/lib/x86_64-linux-gnu/libglib-2.0.so.0+0x7a59d)
SUMMARY: AddressSanitizer: 320 byte(s) leaked in 20 allocation(s).
The changes at virtio_add_resource() itself are not strictly necessary
for the memleak fix, but they make it more obvious that, on an error
return, the passed object is not added to the hash.
Signed-off-by: Matheus Tavares Bernardino <quic_mathbern@quicinc.com>
Message-Id: <c61c13f9a0c67dec473bdbfc8789c29ef26c900b.1696624734.git.quic_mathbern@quicinc.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Albert Esteve <aesteve@redhat.com>
Signed-off-by: Matheus Tavares Bernardino <<a href="mailto:quic_mathbern@quicinc.com" target="_blank">quic_mathbern@quicinc.com</a>><br>
Currently, the one-shot (mode 1) PIT expires far too quickly,
due to the output being set under the wrong logic.
This change fixes the one-shot PIT mode to behave similarly to mode 0.
TESTED: using the one-shot PIT mode to calibrate a local apic timer.
Signed-off-by: Damien Zammit <damien@zamaudio.com>
Message-Id: <20230226015755.52624-1-damien@zamaudio.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Commit 6103451aeb ("hw/i386: Build-time assertion on pc/q35 reset register
being identical.") introduced a build-time check where the addresses of the
reset registers are expected to be equal. Back then rev3 of the FADT was used
which required the reset register to be populated and there was common code.
In commit 3a3fcc75f9 ("pc: acpi: force FADT rev1 for 440fx based machine
types") the FADT was downgraded to rev1 for PIIX where the reset register isn't
available. Thus, there is no need for the assertion any longer, so remove it.
Signed-off-by: Bernhard Beschow <shentey@gmail.com>
Reviewed-by: Ani Sinha <anisinha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20231004092355.12929-1-shentey@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
vhost-user-scsi has a VirtioDeviceClass->reset() function that calls
->vhost_reset_device(). The other vhost devices don't notify the vhost
device upon reset.
Stateful vhost devices may need to handle device reset in order to free
resources or prevent stale device state from interfering after reset.
Call ->vhost_device_reset() from virtio_reset() so that that vhost
devices are notified of device reset.
This patch affects behavior as follows:
- vhost-kernel: No change in behavior since ->vhost_reset_device() is
not implemented.
- vhost-user: back-ends that negotiate
VHOST_USER_PROTOCOL_F_RESET_DEVICE now receive a
VHOST_USER_DEVICE_RESET message upon device reset. Otherwise there is
no change in behavior. DPDK, SPDK, libvhost-user, and the
vhost-user-backend crate do not negotiate
VHOST_USER_PROTOCOL_F_RESET_DEVICE automatically.
- vhost-vdpa: an extra SET_STATUS 0 call is made during device reset.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20231004014532.1228637-4-stefanha@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
vhost_kernel_reset_device() invokes RESET_OWNER, which disassociates the
owner process from the device. The device is left non-operational since
SET_OWNER is only called once during startup in vhost_dev_init().
vhost_kernel_reset_device() is never called so this latent bug never
appears. Get rid of vhost_kernel_reset_device() for now. If someone
needs it in the future they'll need to implement it correctly.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20231004014532.1228637-3-stefanha@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
The VHOST_USER_RESET_OWNER message is deprecated in the spec:
This is no longer used. Used to be sent to request disabling all
rings, but some back-ends interpreted it to also discard connection
state (this interpretation would lead to bugs). It is recommended
that back-ends either ignore this message, or use it to disable all
rings.
The only caller of vhost_user_reset_device() is vhost_user_scsi_reset().
It checks that F_RESET_DEVICE was negotiated before calling it:
static void vhost_user_scsi_reset(VirtIODevice *vdev)
{
VHostSCSICommon *vsc = VHOST_SCSI_COMMON(vdev);
struct vhost_dev *dev = &vsc->dev;
/*
* Historically, reset was not implemented so only reset devices
* that are expecting it.
*/
if (!virtio_has_feature(dev->protocol_features,
VHOST_USER_PROTOCOL_F_RESET_DEVICE)) {
return;
}
if (dev->vhost_ops->vhost_reset_device) {
dev->vhost_ops->vhost_reset_device(dev);
}
}
Therefore VHOST_USER_RESET_OWNER is actually never sent by
vhost_user_reset_device(). Remove the dead code. This effectively moves
the vhost-user protocol specific code from vhost-user-scsi.c into
vhost-user.c where it belongs.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20231004014532.1228637-2-stefanha@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
(1) The virtio-1.2 specification
<http://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html> writes:
> 3 General Initialization And Device Operation
> 3.1 Device Initialization
> 3.1.1 Driver Requirements: Device Initialization
>
> [...]
>
> 7. Perform device-specific setup, including discovery of virtqueues for
> the device, optional per-bus setup, reading and possibly writing the
> device’s virtio configuration space, and population of virtqueues.
>
> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
and
> 4 Virtio Transport Options
> 4.1 Virtio Over PCI Bus
> 4.1.4 Virtio Structure PCI Capabilities
> 4.1.4.3 Common configuration structure layout
> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>
> [...]
>
> The driver MUST configure the other virtqueue fields before enabling the
> virtqueue with queue_enable.
>
> [...]
(The same statements are present in virtio-1.0 identically, at
<http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html>.)
These together mean that the following sub-sequence of steps is valid for
a virtio-1.0 guest driver:
(1.1) set "queue_enable" for the needed queues as the final part of device
initialization step (7),
(1.2) set DRIVER_OK in step (8),
(1.3) immediately start sending virtio requests to the device.
(2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
special virtio feature is negotiated, then virtio rings start in disabled
state, according to
<https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
enabling vrings.
Therefore setting "queue_enable" from the guest (1.1) -- which is
technically "buffered" on the QEMU side until the guest sets DRIVER_OK
(1.2) -- is a *control plane* operation, which -- after (1.2) -- travels
from the guest through QEMU to the vhost-user backend, using a unix domain
socket.
Whereas sending a virtio request (1.3) is a *data plane* operation, which
evades QEMU -- it travels from guest to the vhost-user backend via
eventfd.
This means that operations ((1.1) + (1.2)) and (1.3) travel through
different channels, and their relative order can be reversed, as perceived
by the vhost-user backend.
That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
against the Rust-language virtiofsd version 1.7.2. (Which uses version
0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
crate.)
Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
device initialization steps (i.e., control plane operations), and
immediately sends a FUSE_INIT request too (i.e., performs a data plane
operation). In the Rust-language virtiofsd, this creates a race between
two components that run *concurrently*, i.e., in different threads or
processes:
- Control plane, handling vhost-user protocol messages:
The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
[crates/vhost-user-backend/src/handler.rs] handles
VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
flag according to the message processed.
- Data plane, handling virtio / FUSE requests:
The "VringEpollHandler::handle_event" method
[crates/vhost-user-backend/src/event_loop.rs] handles the incoming
virtio / FUSE request, consuming the virtio kick at the same time. If
the vring's "enabled" flag is set, the virtio / FUSE request is
processed genuinely. If the vring's "enabled" flag is clear, then the
virtio / FUSE request is discarded.
Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
However, if the data plane processor in virtiofsd wins the race, then it
sees the FUSE_INIT *before* the control plane processor took notice of
VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
processor. Therefore the latter drops FUSE_INIT on the floor, and goes
back to waiting for further virtio / FUSE requests with epoll_wait.
Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
The deadlock is not deterministic. OVMF hangs infrequently during first
boot. However, OVMF hangs almost certainly during reboots from the UEFI
shell.
The race can be "reliably masked" by inserting a very small delay -- a
single debug message -- at the top of "VringEpollHandler::handle_event",
i.e., just before the data plane processor checks the "enabled" field of
the vring. That delay suffices for the control plane processor to act upon
VHOST_USER_SET_VRING_ENABLE.
We can deterministically prevent the race in QEMU, by blocking OVMF inside
step (1.2) -- i.e., in the write to the device status register that
"unleashes" queue enablement -- until VHOST_USER_SET_VRING_ENABLE actually
*completes*. That way OVMF's VCPU cannot advance to the FUSE_INIT
submission before virtiofsd's control plane processor takes notice of the
queue being enabled.
Wait for VHOST_USER_SET_VRING_ENABLE completion by:
- setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
has been negotiated, or
- performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Tested-by: Albert Esteve <aesteve@redhat.com>
[lersek@redhat.com: work Eugenio's explanation into the commit message,
about QEMU containing step (1.1) until step (1.2)]
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Message-Id: <20231002203221.17241-8-lersek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The "vhost_set_vring" function already centralizes the common parts of
"vhost_user_set_vring_num", "vhost_user_set_vring_base" and
"vhost_user_set_vring_enable". We'll want to allow some of those callers
to wait for a reply.
Therefore, rebase "vhost_set_vring" from just "vhost_user_write" to
"vhost_user_write_sync", exposing the "wait_for_reply" parameter.
This is purely refactoring -- there is no observable change. That's
because:
- all three callers pass in "false" for "wait_for_reply", which disables
all logic in "vhost_user_write_sync" except the call to
"vhost_user_write";
- the fds=NULL and fd_num=0 arguments of the original "vhost_user_write"
call inside "vhost_set_vring" are hard-coded within
"vhost_user_write_sync".
Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Tested-by: Albert Esteve <aesteve@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Message-Id: <20231002203221.17241-7-lersek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
In order to avoid a forward-declaration for "vhost_user_write_sync" in a
subsequent patch, hoist "vhost_user_write_sync" ->
"vhost_user_get_features" -> "vhost_user_get_u64" just above
"vhost_set_vring".
This is purely code movement -- no observable change.
Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Tested-by: Albert Esteve <aesteve@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Message-Id: <20231002203221.17241-6-lersek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
At this point, only "vhost_user_write_sync" calls "enforce_reply"; embed
the latter into the former.
This is purely refactoring -- no observable change.
Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Tested-by: Albert Esteve <aesteve@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Message-Id: <20231002203221.17241-5-lersek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The tails of the "vhost_user_set_vring_addr" and "vhost_user_set_u64"
functions are now byte-for-byte identical. Factor the common tail out to a
new function called "vhost_user_write_sync".
This is purely refactoring -- no observable change.
Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Tested-by: Albert Esteve <aesteve@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Message-Id: <20231002203221.17241-4-lersek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
In the vhost_user_set_vring_addr() function, we calculate
"reply_supported" unconditionally, even though we'll only need it if
"wait_for_reply" is also true.
Restrict the scope of "reply_supported" to the minimum.
This is purely refactoring -- no observable change.
Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Tested-by: Albert Esteve <aesteve@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Message-Id: <20231002203221.17241-3-lersek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Next patches in this series will delay the polling
and checking of buffers until either the SVQ is
full or control commands shadow buffers are full,
no longer perform an immediate poll and check of
the device's used buffers for each CVQ state load command.
To achieve this, this patch exposes
vhost_svq_available_slots(), allowing QEMU to know
whether the SVQ is full.
Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Message-Id: <25938079f0bd8185fd664c64e205e629f7a966be.1697165821.git.yin31149@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iLMEAAEKAB0WIQS4/x2g0v3LLaCcbCxAov/yOSY+3wUCZSimNQAKCRBAov/yOSY+
33XwBADF9ZKlESDBDa/huNFAKD7BsUIdglHfz9lHnLY+kQbCun4HyTLtp2IBsySu
mZTjdfU/LnaBidFLjEnmZZMPyiI3oV1ruSzT53egSDaxrFUXGpc9oxtMNLsyfk9P
swdngG13Fc9sWVKC7IJeYDYXgkvHY7NxsiV8U9vdqXOyw2uoHA==
=ufPc
-----END PGP SIGNATURE-----
Merge tag 'pull-loongarch-20231013' of https://gitlab.com/gaosong/qemu into staging
pull-loongarch-20231013
# -----BEGIN PGP SIGNATURE-----
#
# iLMEAAEKAB0WIQS4/x2g0v3LLaCcbCxAov/yOSY+3wUCZSimNQAKCRBAov/yOSY+
# 33XwBADF9ZKlESDBDa/huNFAKD7BsUIdglHfz9lHnLY+kQbCun4HyTLtp2IBsySu
# mZTjdfU/LnaBidFLjEnmZZMPyiI3oV1ruSzT53egSDaxrFUXGpc9oxtMNLsyfk9P
# swdngG13Fc9sWVKC7IJeYDYXgkvHY7NxsiV8U9vdqXOyw2uoHA==
# =ufPc
# -----END PGP SIGNATURE-----
# gpg: Signature made Thu 12 Oct 2023 22:06:45 EDT
# gpg: using RSA key B8FF1DA0D2FDCB2DA09C6C2C40A2FFF239263EDF
# gpg: Good signature from "Song Gao <m17746591750@163.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: B8FF 1DA0 D2FD CB2D A09C 6C2C 40A2 FFF2 3926 3EDF
* tag 'pull-loongarch-20231013' of https://gitlab.com/gaosong/qemu:
LoongArch: step down as general arch maintainer
hw/loongarch/virt: Remove unused 'loongarch_virt_pm' region
hw/loongarch/virt: Remove unused ISA Bus
hw/loongarch/virt: Remove unused ISA UART
hw/loongarch: remove global loaderparams variable
target/loongarch: Add preldx instruction
target/loongarch: fix ASXE flag conflict
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* Deprecate the rdma code
* Fix flaky npcm7xx_timer test
* i2c-echo license statement and Kconfig switch
* Disable the failing riscv64-debian-cross CI job by default
-----BEGIN PGP SIGNATURE-----
iQJFBAABCAAvFiEEJ7iIR+7gJQEY8+q5LtnXdP5wLbUFAmUoI3IRHHRodXRoQHJl
ZGhhdC5jb20ACgkQLtnXdP5wLbXl3A//dnuUf7cXrllZc9PyTRMg99c45DRrj058
e9iOIdQdMAxN8QQtG/6HHVGRBKP5q5LqcAHZE4DTbIxz9keyoVNQ7CiOL8Zm1GJx
up5o/HGul8pAnlG0HrM21VJ6ZVSFW8j+50LQbxLTPOMdGAmzxBPvVIGonDOMBIJh
GPMTFFbJcppSCY2i5ijZ1qmgWMjwErEdQ2kXxxWo5eczNUFT68A4UmQ4NzlA0DZ7
jG19+2DzigJxnLhYXIOPGCkKjVPcue7QjB1ka35ixU6AgQcW3WNrEwqVbf+FHNyN
U1ONWqSTkpGlxB/3a+u/HecC4qN4bCRzVdfTNOArQpgNcc+NWr+jZlAoCZ9N72aH
jZNfW4D1TcgLh+Etj2NWRJKe/2F8N03V/9sJta0eYekzwUqu15Lc1FhhG2zJUJzO
QZxPik6J7PkxX28AoSLjyRx80g3V06eBh2YXqIZioZnbdRQrqX1GG7YpRk7YQRCF
o77T85R64s/z/6GSvEee2f6e6c8O/WKMhLBBoI8fCN0TZVv5VlwHLipKxRQqUZIn
+Iu/LpKsGtmQT259D+rqkCJMzdPFcc3S8e4MwWb8KMe3/rqZuGoCO3ecXQR+qto4
Gif0Z7T+us1hf6ExHDi+8FaRBMlbDpk1/Ebw1NvwcWqMaWipyeAY1yVmYzAMcQ9/
dm2e2YWb4mk=
=aPMR
-----END PGP SIGNATURE-----
Merge tag 'pull-request-2023-10-12' of https://gitlab.com/thuth/qemu into staging
* Fix CVE-2023-1544
* Deprecate the rdma code
* Fix flaky npcm7xx_timer test
* i2c-echo license statement and Kconfig switch
* Disable the failing riscv64-debian-cross CI job by default
* tag 'pull-request-2023-10-12' of https://gitlab.com/thuth/qemu:
gitlab-ci: Disable the riscv64-debian-cross-container by default
MAINTAINERS: Add include/sysemu/qtest.h to the qtest section
hw/misc/Kconfig: add switch for i2c-echo
hw/misc/i2c-echo: add copyright/license note
tests/qtest: Fix npcm7xx_timer-test.c flaky test
hw/rdma: Deprecate the pvrdma device and the rdma subsystem
hw/pvrdma: Protect against buggy or malicious guest driver
Conflicts:
docs/about/deprecated.rst
Context conflict between RISC-V and RDMA deprecation.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
"Host Memory Backends" and "Memory devices" queue ("mem"):
- Support memory devices with multiple memslots
- Support memory devices that dynamically consume memslots
- Support memory devices that can automatically decide on the number of
memslots to use
- virtio-mem support for exposing memory dynamically via multiple
memslots
- Some required cleanups/refactorings
-----BEGIN PGP SIGNATURE-----
iQJFBAABCAAvFiEEG9nKrXNcTDpGDfzKTd4Q9wD/g1oFAmUn+XMRHGRhdmlkQHJl
ZGhhdC5jb20ACgkQTd4Q9wD/g1qDHA//T01suTa+uzrcoJHoMWN11S47WnAmbuTo
vVakucLBPMJAa9xZeCy3OavXaVGpHkw+t6g3OFknof0LfQ5/j9iE3Q1PxURN7g5j
SJ2WJXCoceM6T4TMhPvVvgEaYjFmESqZB5FZgedMT0QRyhAxMuF9pCkWhk1O3OAV
JqQKqLFiGcv60AEuBYGZGzgiOUv8EJ5gKwRF4VOdyHIxqZDw1aZXzlcd4TzFZBQ7
rwW/3ef+sFmUJdmfrSrqcIlQSRrqZ2w95xATDzLTIEEUT3SWqh/E95EZWIz1M0oQ
NgWgFiLCR1KOj7bWFhLXT7IfyLh0mEysD+P/hY6QwQ4RewWG7EW5UK+JFswssdcZ
rEj5XpHZzev/wx7hM4bWsoQ+VIvrH7j3uYGyWkcgYRbdDEkWDv2rsT23lwGYNhht
oBsrdEBELRw6v4C8doq/+sCmHmuxUMqTGwbArCQVnB1XnLxOEkuqlnfq5MORkzNF
fxbIRx+LRluOllC0HVaDQd8qxRq1+UC5WIpAcDcrouy4HGgi1onWKrXpgjIAbVyH
M6cENkK7rnRk96gpeXdmrf0h9HqRciAOY8oUsFsvLyKBOCPBWDrLyOQEY5UoSdtD
m4QpEVgywCy2z1uU/UObeT/UxJy/9EL/Zb+DHoEK06iEhwONoUJjEBYMJD38RMkk
mwPTB4UAk9g=
=s69t
-----END PGP SIGNATURE-----
Merge tag 'mem-2023-10-12' of https://github.com/davidhildenbrand/qemu into staging
Hi,
"Host Memory Backends" and "Memory devices" queue ("mem"):
- Support memory devices with multiple memslots
- Support memory devices that dynamically consume memslots
- Support memory devices that can automatically decide on the number of
memslots to use
- virtio-mem support for exposing memory dynamically via multiple
memslots
- Some required cleanups/refactorings
# -----BEGIN PGP SIGNATURE-----
#
# iQJFBAABCAAvFiEEG9nKrXNcTDpGDfzKTd4Q9wD/g1oFAmUn+XMRHGRhdmlkQHJl
# ZGhhdC5jb20ACgkQTd4Q9wD/g1qDHA//T01suTa+uzrcoJHoMWN11S47WnAmbuTo
# vVakucLBPMJAa9xZeCy3OavXaVGpHkw+t6g3OFknof0LfQ5/j9iE3Q1PxURN7g5j
# SJ2WJXCoceM6T4TMhPvVvgEaYjFmESqZB5FZgedMT0QRyhAxMuF9pCkWhk1O3OAV
# JqQKqLFiGcv60AEuBYGZGzgiOUv8EJ5gKwRF4VOdyHIxqZDw1aZXzlcd4TzFZBQ7
# rwW/3ef+sFmUJdmfrSrqcIlQSRrqZ2w95xATDzLTIEEUT3SWqh/E95EZWIz1M0oQ
# NgWgFiLCR1KOj7bWFhLXT7IfyLh0mEysD+P/hY6QwQ4RewWG7EW5UK+JFswssdcZ
# rEj5XpHZzev/wx7hM4bWsoQ+VIvrH7j3uYGyWkcgYRbdDEkWDv2rsT23lwGYNhht
# oBsrdEBELRw6v4C8doq/+sCmHmuxUMqTGwbArCQVnB1XnLxOEkuqlnfq5MORkzNF
# fxbIRx+LRluOllC0HVaDQd8qxRq1+UC5WIpAcDcrouy4HGgi1onWKrXpgjIAbVyH
# M6cENkK7rnRk96gpeXdmrf0h9HqRciAOY8oUsFsvLyKBOCPBWDrLyOQEY5UoSdtD
# m4QpEVgywCy2z1uU/UObeT/UxJy/9EL/Zb+DHoEK06iEhwONoUJjEBYMJD38RMkk
# mwPTB4UAk9g=
# =s69t
# -----END PGP SIGNATURE-----
# gpg: Signature made Thu 12 Oct 2023 09:49:39 EDT
# gpg: using RSA key 1BD9CAAD735C4C3A460DFCCA4DDE10F700FF835A
# gpg: issuer "david@redhat.com"
# gpg: Good signature from "David Hildenbrand <david@redhat.com>" [unknown]
# gpg: aka "David Hildenbrand <davidhildenbrand@gmail.com>" [full]
# gpg: aka "David Hildenbrand <hildenbr@in.tum.de>" [unknown]
# gpg: WARNING: The key's User ID is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 1BD9 CAAD 735C 4C3A 460D FCCA 4DDE 10F7 00FF 835A
* tag 'mem-2023-10-12' of https://github.com/davidhildenbrand/qemu:
virtio-mem: Mark memslot alias memory regions unmergeable
memory,vhost: Allow for marking memory device memory regions unmergeable
virtio-mem: Expose device memory dynamically via multiple memslots if enabled
virtio-mem: Update state to match bitmap as soon as it's been migrated
virtio-mem: Pass non-const VirtIOMEM via virtio_mem_range_cb
memory: Clarify mapping requirements for RamDiscardManager
memory-device,vhost: Support automatic decision on the number of memslots
vhost: Add vhost_get_max_memslots()
kvm: Add stub for kvm_get_max_memslots()
memory-device,vhost: Support memory devices that dynamically consume memslots
memory-device: Track required and actually used memslots in DeviceMemoryState
stubs: Rename qmp_memory_device.c to memory_device.c
memory-device: Support memory devices with multiple memslots
vhost: Return number of free memslots
kvm: Return number of free memslots
softmmu/physmem: Fixup qemu_ram_block_from_host() documentation
vhost: Remove vhost_backend_can_merge() callback
vhost: Rework memslot filtering and fix "used_memslot" tracking
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This adds initial support for gfxstream and cross-domain. Both
features rely on virtio-gpu blob resources and context types, which
are also implemented in this patch.
gfxstream has a long and illustrious history in Android graphics
paravirtualization. It has been powering graphics in the Android
Studio Emulator for more than a decade, which is the main developer
platform.
Originally conceived by Jesse Hall, it was first known as "EmuGL" [a].
The key design characteristic was a 1:1 threading model and
auto-generation, which fit nicely with the OpenGLES spec. It also
allowed easy layering with ANGLE on the host, which provides the GLES
implementations on Windows or MacOS enviroments.
gfxstream has traditionally been maintained by a single engineer, and
between 2015 to 2021, the goldfish throne passed to Frank Yang.
Historians often remark this glorious reign ("pax gfxstreama" is the
academic term) was comparable to that of Augustus and both Queen
Elizabeths. Just to name a few accomplishments in a resplendent
panoply: higher versions of GLES, address space graphics, snapshot
support and CTS compliant Vulkan [b].
One major drawback was the use of out-of-tree goldfish drivers.
Android engineers didn't know much about DRM/KMS and especially TTM so
a simple guest to host pipe was conceived.
Luckily, virtio-gpu 3D started to emerge in 2016 due to the work of
the Mesa/virglrenderer communities. In 2018, the initial virtio-gpu
port of gfxstream was done by Cuttlefish enthusiast Alistair Delva.
It was a symbol compatible replacement of virglrenderer [c] and named
"AVDVirglrenderer". This implementation forms the basis of the
current gfxstream host implementation still in use today.
cross-domain support follows a similar arc. Originally conceived by
Wayland aficionado David Reveman and crosvm enjoyer Zach Reizner in
2018, it initially relied on the downstream "virtio-wl" device.
In 2020 and 2021, virtio-gpu was extended to include blob resources
and multiple timelines by yours truly, features gfxstream/cross-domain
both require to function correctly.
Right now, we stand at the precipice of a truly fantastic possibility:
the Android Emulator powered by upstream QEMU and upstream Linux
kernel. gfxstream will then be packaged properfully, and app
developers can even fix gfxstream bugs on their own if they encounter
them.
It's been quite the ride, my friends. Where will gfxstream head next,
nobody really knows. I wouldn't be surprised if it's around for
another decade, maintained by a new generation of Android graphics
enthusiasts.
Technical details:
- Very simple initial display integration: just used Pixman
- Largely, 1:1 mapping of virtio-gpu hypercalls to rutabaga function
calls
Next steps for Android VMs:
- The next step would be improving display integration and UI interfaces
with the goal of the QEMU upstream graphics being in an emulator
release [d].
Next steps for Linux VMs for display virtualization:
- For widespread distribution, someone needs to package Sommelier or the
wayland-proxy-virtwl [e] ideally into Debian main. In addition, newer
versions of the Linux kernel come with DRM_VIRTIO_GPU_KMS option,
which allows disabling KMS hypercalls. If anyone cares enough, it'll
probably be possible to build a custom VM variant that uses this display
virtualization strategy.
[a] https://android-review.googlesource.com/c/platform/development/+/34470
[b] https://android-review.googlesource.com/q/topic:%22vulkan-hostconnection-start%22
[c] https://android-review.googlesource.com/c/device/generic/goldfish-opengl/+/761927
[d] https://developer.android.com/studio/releases/emulator
[e] https://github.com/talex5/wayland-proxy-virtwl
Signed-off-by: Gurchetan Singh <gurchetansingh@chromium.org>
Tested-by: Alyssa Ross <hi@alyssa.is>
Tested-by: Emmanouil Pitsidianakis <manos.pitsidianakis@linaro.org>
Tested-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Emmanouil Pitsidianakis <manos.pitsidianakis@linaro.org>
Reviewed-by: Antonio Caggiano <quic_acaggian@quicinc.com>
Reviewed-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Use VIRTIO_GPU_SHM_ID_HOST_VISIBLE as id for virtio-gpu.
Signed-off-by: Antonio Caggiano <antonio.caggiano@collabora.com>
Tested-by: Alyssa Ross <hi@alyssa.is>
Tested-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Tested-by: Huang Rui <ray.huang@amd.com>
Acked-by: Huang Rui <ray.huang@amd.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Define a new capability type 'VIRTIO_PCI_CAP_SHARED_MEMORY_CFG' to allow
defining shared memory regions with sizes and offsets of 2^32 and more.
Multiple instances of the capability are allowed and distinguished
by a device-specific 'id'.
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Antonio Caggiano <antonio.caggiano@collabora.com>
Signed-off-by: Gurchetan Singh <gurchetansingh@chromium.org>
Tested-by: Alyssa Ross <hi@alyssa.is>
Tested-by: Huang Rui <ray.huang@amd.com>
Tested-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Acked-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Gurchetan Singh <gurchetansingh@chromium.org>
Reviewed-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Fixed four ufs-related coverity issues.
The coverity issues and fixes are as follows
1. CID 1519042: Security issue with the rand() function
Changed to use a fixed value (0xab) instead of rand() as
the value for testing
2. CID 1519043: Dereference after null check
Removed useless (redundant) null checks
3. CID 1519050: Out-of-bounds access issue
Fix to pass an array type variable to find_first_bit and
find_next_bit using DECLARE_BITMAP()
4. CID 1519051: Out-of-bounds read issue
Fix incorrect range check for lun
Fix coverity CID: 1519042 1519043 1519050 1519051
Signed-off-by: Jeuk Kim <jeuk20.kim@samsung.com>
The system test shutdown uses the 'loongarch_virt_pm' region.
We can use the write AcpiFadtData.sleep_clt register to realize the shutdown.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Song Gao <gaosong@loongson.cn>
Message-ID: <20231012072351.1409344-1-gaosong@loongson.cn>
Signed-off-by: Song Gao <gaosong@loongson.cn>
The LoongArch 'virt' machine doesn't use its ISA I/O region.
If a ISA device were to be mapped there, there is no support
for ISA IRQ. Unlikely useful. Simply remove.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Song Gao <gaosong@loongson.cn>
Message-Id: <20231010135342.40219-3-philmd@linaro.org>
Signed-off-by: Song Gao <gaosong@loongson.cn>
The LoongArch 'virt' machine doesn't use any ISA UART.
No need to build the device model, remove its Kconfig entry.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Song Gao <gaosong@loongson.cn>
Message-Id: <20231010135342.40219-2-philmd@linaro.org>
Signed-off-by: Song Gao <gaosong@loongson.cn>
Passing the struct around explicitly makes the control-flow more
obvious.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Song Gao <gaosong@loongson.cn>
Signed-off-by: Thomas Weißschuh <thomas@t-8ch.de>
Message-Id: <20231010-loongarch-loader-params-v2-1-512cc7959683@t-8ch.de>
Signed-off-by: Song Gao <gaosong@loongson.cn>
Let's mark the memslot alias memory regions as unmergable, such that
flatview and vhost won't merge adjacent memory region aliases and we can
atomically map/unmap individual aliases without affecting adjacent
alias memory regions.
This handles vhost and vfio in multiple-memslot mode correctly (which do
not support atomic memslot updates) and avoids the temporary removal of
large memslots, which can be an expensive operation. For example, vfio
might have to unpin + repin a lot of memory, which is undesired.
Message-ID: <20230926185738.277351-19-david@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Let's allow for marking memory regions unmergeable, to teach
flatview code and vhost to not merge adjacent aliases to the same memory
region into a larger memory section; instead, we want separate aliases to
stay separate such that we can atomically map/unmap aliases without
affecting other aliases.
This is desired for virtio-mem mapping device memory located on a RAM
memory region via multiple aliases into a memory region container,
resulting in separate memslots that can get (un)mapped atomically.
As an example with virtio-mem, the layout would look something like this:
[...]
0000000240000000-00000020bfffffff (prio 0, i/o): device-memory
0000000240000000-000000043fffffff (prio 0, i/o): virtio-mem
0000000240000000-000000027fffffff (prio 0, ram): alias memslot-0 @mem2 0000000000000000-000000003fffffff
0000000280000000-00000002bfffffff (prio 0, ram): alias memslot-1 @mem2 0000000040000000-000000007fffffff
00000002c0000000-00000002ffffffff (prio 0, ram): alias memslot-2 @mem2 0000000080000000-00000000bfffffff
[...]
Without unmergable memory regions, all three memslots would get merged into
a single memory section. For example, when mapping another alias (e.g.,
virtio-mem-memslot-3) or when unmapping any of the mapped aliases,
memory listeners will first get notified about the removal of the big
memory section to then get notified about re-adding of the new
(differently merged) memory section(s).
In an ideal world, memory listeners would be able to deal with that
atomically, like KVM nowadays does. However, (a) supporting this for other
memory listeners (vhost-user, vfio) is fairly hard: temporary removal
can result in all kinds of issues on concurrent access to guest memory;
and (b) this handling is undesired, because temporarily removing+readding
can consume quite some time on bigger memslots and is not efficient
(e.g., vfio unpinning and repinning pages ...).
Let's allow for marking a memory region unmergeable, such that we
can atomically (un)map aliases to the same memory region, similar to
(un)mapping individual DIMMs.
Similarly, teach vhost code to not redo what flatview core stopped doing:
don't merge such sections. Merging in vhost code is really only relevant
for handling random holes in boot memory where; without this merging,
the vhost-user backend wouldn't be able to mmap() some boot memory
backed on hugetlb.
We'll use this for virtio-mem next.
Message-ID: <20230926185738.277351-18-david@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Having large virtio-mem devices that only expose little memory to a VM
is currently a problem: we map the whole sparse memory region into the
guest using a single memslot, resulting in one gigantic memslot in KVM.
KVM allocates metadata for the whole memslot, which can result in quite
some memory waste.
Assuming we have a 1 TiB virtio-mem device and only expose little (e.g.,
1 GiB) memory, we would create a single 1 TiB memslot and KVM has to
allocate metadata for that 1 TiB memslot: on x86, this implies allocating
a significant amount of memory for metadata:
(1) RMAP: 8 bytes per 4 KiB, 8 bytes per 2 MiB, 8 bytes per 1 GiB
-> For 1 TiB: 2147483648 + 4194304 + 8192 = ~ 2 GiB (0.2 %)
With the TDP MMU (cat /sys/module/kvm/parameters/tdp_mmu) this gets
allocated lazily when required for nested VMs
(2) gfn_track: 2 bytes per 4 KiB
-> For 1 TiB: 536870912 = ~512 MiB (0.05 %)
(3) lpage_info: 4 bytes per 2 MiB, 4 bytes per 1 GiB
-> For 1 TiB: 2097152 + 4096 = ~2 MiB (0.0002 %)
(4) 2x dirty bitmaps for tracking: 2x 1 bit per 4 KiB page
-> For 1 TiB: 536870912 = 64 MiB (0.006 %)
So we primarily care about (1) and (2). The bad thing is, that the
memory consumption *doubles* once SMM is enabled, because we create the
memslot once for !SMM and once for SMM.
Having a 1 TiB memslot without the TDP MMU consumes around:
* With SMM: 5 GiB
* Without SMM: 2.5 GiB
Having a 1 TiB memslot with the TDP MMU consumes around:
* With SMM: 1 GiB
* Without SMM: 512 MiB
... and that's really something we want to optimize, to be able to just
start a VM with small boot memory (e.g., 4 GiB) and a virtio-mem device
that can grow very large (e.g., 1 TiB).
Consequently, using multiple memslots and only mapping the memslots we
really need can significantly reduce memory waste and speed up
memslot-related operations. Let's expose the sparse RAM memory region using
multiple memslots, mapping only the memslots we currently need into our
device memory region container.
The feature can be enabled using "dynamic-memslots=on" and requires
"unplugged-inaccessible=on", which is nowadays the default.
Once enabled, we'll auto-detect the number of memslots to use based on the
memslot limit provided by the core. We'll use at most 1 memslot per
gigabyte. Note that our global limit of memslots accross all memory devices
is currently set to 256: even with multiple large virtio-mem devices,
we'd still have a sane limit on the number of memslots used.
The default is to not dynamically map memslot for now
("dynamic-memslots=off"). The optimization must be enabled manually,
because some vhost setups (e.g., hotplug of vhost-user devices) might be
problematic until we support more memslots especially in vhost-user backends.
Note that "dynamic-memslots=on" is just a hint that multiple memslots
*may* be used for internal optimizations, not that multiple memslots
*must* be used. The actual number of memslots that are used is an
internal detail: for example, once memslot metadata is no longer an
issue, we could simply stop optimizing for that. Migration source and
destination can differ on the setting of "dynamic-memslots".
Message-ID: <20230926185738.277351-17-david@redhat.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
It's cleaner and future-proof to just have other state that depends on the
bitmap state to be updated as soon as possible when restoring the bitmap.
So factor out informing RamDiscardListener into a functon and call it in
case of early migration right after we restored the bitmap.
Message-ID: <20230926185738.277351-16-david@redhat.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Let's prepare for a user that has to modify the VirtIOMEM device state.
Message-ID: <20230926185738.277351-15-david@redhat.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
We want to support memory devices that can automatically decide how many
memslots they will use. In the worst case, they have to use a single
memslot.
The target use cases are virtio-mem and the hyper-v balloon.
Let's calculate a reasonable limit such a memory device may use, and
instruct the device to make a decision based on that limit. Use a simple
heuristic that considers:
* A memslot soft-limit for all memory devices of 256; also, to not
consume too many memslots -- which could harm performance.
* Actually still free and unreserved memslots
* The percentage of the remaining device memory region that memory device
will occupy.
Further, while we properly check before plugging a memory device whether
there still is are free memslots, we have other memslot consumers (such as
boot memory, PCI BARs) that don't perform any checks and might dynamically
consume memslots without any prior reservation. So we might succeed in
plugging a memory device, but once we dynamically map a PCI BAR we would
be in trouble. Doing accounting / reservation / checks for all such
users is problematic (e.g., sometimes we might temporarily split boot
memory into two memslots, triggered by the BIOS).
We use the historic magic memslot number of 509 as orientation to when
supporting 256 memory devices -> memslots (leaving 253 for boot memory and
other devices) has been proven to work reliable. We'll fallback to
suggesting a single memslot if we don't have at least 509 total memslots.
Plugging vhost devices with less than 509 memslots available while we
have memory devices plugged that consume multiple memslots due to
automatic decisions can be problematic. Most configurations might just fail
due to "limit < used + reserved", however, it can also happen that these
memory devices would suddenly consume memslots that would actually be
required by other memslot consumers (boot, PCI BARs) later. Note that this
has always been sketchy with vhost devices that support only a small number
of memslots; but we don't want to make it any worse.So let's keep it simple
and simply reject plugging such vhost devices in such a configuration.
Eventually, all vhost devices that want to be fully compatible with such
memory devices should support a decent number of memslots (>= 509).
Message-ID: <20230926185738.277351-13-david@redhat.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Let's add vhost_get_max_memslots().
Message-ID: <20230926185738.277351-12-david@redhat.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
We want to support memory devices that have a dynamically managed memory
region container as device memory region. This device memory region maps
multiple RAM memory subregions (e.g., aliases to the same RAM memory
region), whereby these subregions can be (un)mapped on demand.
Each RAM subregion will consume a memslot in KVM and vhost, resulting in
such a new device consuming memslots dynamically, and initially usually
0. We already track the number of used vs. required memslots for all
memslots. From that, we can derive the number of reserved memslots that
must not be used otherwise.
The target use case is virtio-mem and the hyper-v balloon, which will
dynamically map aliases to RAM memory region into their device memory
region container.
Properly document what's supported and what's not and extend the vhost
memslot check accordingly.
Message-ID: <20230926185738.277351-10-david@redhat.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Let's track how many memslots are required by plugged memory devices and
how many are currently actually getting used by plugged memory
devices.
"required - used" is the number of reserved memslots. For now, the number
of used and required memslots is always equal, and there are no
reservations. This is a preparation for memory devices that want to
dynamically consume memslots after initially specifying how many they
require -- where we'll end up with reserved memslots.
To track the number of used memslots, create a new address space for
our device memory and register a memory listener (add/remove) for that
address space.
Message-ID: <20230926185738.277351-9-david@redhat.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
We want to support memory devices that have a memory region container as
device memory region that maps multiple RAM memory regions. Let's start
by supporting memory devices that statically map multiple RAM memory
regions and, thereby, consume multiple memslots.
We already have one device that uses a container as device memory region:
NVDIMMs. However, a NVDIMM always ends up consuming exactly one memslot.
Let's add support for that by asking the memory device via a new
callback how many memslots it requires.
Message-ID: <20230926185738.277351-7-david@redhat.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>