Commit Graph

1574 Commits

Author SHA1 Message Date
Maciej S. Szmigiero
2d7f108186 Revert "hw/virtio/virtio-pmem: Replace impossible check by assertion"
This reverts commit 5960f254db since the
previous commit made this situation possible again.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
2023-11-06 13:53:59 +01:00
Steve Sistare
89415796f6 cpr: relax vhost migration blockers
vhost blocks migration if logging is not supported to track dirty
memory, and vhost-user blocks it if the log cannot be saved to a shm fd.

vhost-vdpa blocks migration if both hosts do not support all the device's
features using a shadow VQ, for tracking requests and dirty memory.

vhost-scsi blocks migration if storage cannot be shared across hosts,
or if state cannot be migrated.

None of these conditions apply if the old and new qemu processes do
not run concurrently, and if new qemu starts on the same host as old,
which is the case for cpr.

Narrow the scope of these blockers so they only apply to normal mode.
They will not block cpr modes when they are added in subsequent patches.

No functional change until a new mode is added.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Message-ID: <1698263069-406971-5-git-send-email-steven.sistare@oracle.com>
2023-11-01 16:13:59 +01:00
Juan Quintela
99b16e8ee4 migration: Use vmstate_register_any()
This are the easiest cases, where we were already using
VMSTATE_INSTANCE_ID_ANY.

Reviewed-by: Stefan Berger <stefanb@linux.ibm.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Message-ID: <20231020090731.28701-3-quintela@redhat.com>
2023-11-01 16:13:58 +01:00
Stefan Hajnoczi
84d61e5f36 virtio: use defer_call() in virtio_irqfd_notify()
virtio-blk and virtio-scsi invoke virtio_irqfd_notify() to send Used
Buffer Notifications from an IOThread. This involves an eventfd
write(2) syscall. Calling this repeatedly when completing multiple I/O
requests in a row is wasteful.

Use the defer_call() API to batch together virtio_irqfd_notify() calls
made during thread pool (aio=threads), Linux AIO (aio=native), and
io_uring (aio=io_uring) completion processing.

Behavior is unchanged for emulated devices that do not use
defer_call_begin()/defer_call_end() since defer_call() immediately
invokes the callback when called outside a
defer_call_begin()/defer_call_end() region.

fio rw=randread bs=4k iodepth=64 numjobs=8 IOPS increases by ~9% with a
single IOThread and 8 vCPUs. iodepth=1 decreases by ~1% but this could
be noise. Detailed performance data and configuration specifics are
available here:
https://gitlab.com/stefanha/virt-playbooks/-/tree/blk_io_plug-irqfd

This duplicates the BH that virtio-blk uses for batching. The next
commit will remove it.

Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20230913200045.1024233-4-stefanha@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-10-31 15:42:14 +01:00
Paolo Bonzini
126e7f7803 kvm: require KVM_CAP_IOEVENTFD and KVM_CAP_IOEVENTFD_ANY_LENGTH
KVM_CAP_IOEVENTFD_ANY_LENGTH was added in Linux 4.4, released in 2016.
Assume that it is present.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-25 17:35:15 +02:00
Paolo Bonzini
5d9ec1f4c7 kvm: assume that many ioeventfds can be created
NR_IOBUS_DEVS was increased to 200 in Linux 2.6.34.  By Linux 3.5 it had
increased to 1000 and later ioeventfds were changed to not count against
the limit.  But the earlier limit of 200 would already be enough for
kvm_check_many_ioeventfds() to be true, so remove the check.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-25 17:35:15 +02:00
Stefan Hajnoczi
1b4a5a20da virtio,pc,pci: features, cleanups
infrastructure for vhost-vdpa shadow work
 piix south bridge rework
 reconnect for vhost-user-scsi
 dummy ACPI QTG DSM for cxl
 
 tests, cleanups, fixes all over the place
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmU06PMPHG1zdEByZWRo
 YXQuY29tAAoJECgfDbjSjVRpNIsH/0DlKti86VZLJ6PbNqsnKxoK2gg05TbEhPZU
 pQ+RPDaCHpFBsLC5qsoMJwvaEQFe0e49ZFemw7bXRzBxgmbbNnZ9ArCIPqT+rvQd
 7UBmyC+kacVyybZatq69aK2BHKFtiIRlT78d9Izgtjmp8V7oyKoz14Esh8wkE+FT
 ypHUa70Addi6alNm6BVkm7bxZxi0Wrmf3THqF8ViYvufzHKl7JR5e17fKWEG0BqV
 9W7AeHMnzJ7jkTvBGUw7g5EbzFn7hPLTbO4G/VW97k0puS4WRX5aIMkVhUazsRIa
 zDOuXCCskUWuRapiCwY0E4g7cCaT8/JR6JjjBaTgkjJgvo5Y8Eg=
 =ILek
 -----END PGP SIGNATURE-----

Merge tag 'for_upstream' of https://git.kernel.org/pub/scm/virt/kvm/mst/qemu into staging

virtio,pc,pci: features, cleanups

infrastructure for vhost-vdpa shadow work
piix south bridge rework
reconnect for vhost-user-scsi
dummy ACPI QTG DSM for cxl

tests, cleanups, fixes all over the place

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

# -----BEGIN PGP SIGNATURE-----
#
# iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmU06PMPHG1zdEByZWRo
# YXQuY29tAAoJECgfDbjSjVRpNIsH/0DlKti86VZLJ6PbNqsnKxoK2gg05TbEhPZU
# pQ+RPDaCHpFBsLC5qsoMJwvaEQFe0e49ZFemw7bXRzBxgmbbNnZ9ArCIPqT+rvQd
# 7UBmyC+kacVyybZatq69aK2BHKFtiIRlT78d9Izgtjmp8V7oyKoz14Esh8wkE+FT
# ypHUa70Addi6alNm6BVkm7bxZxi0Wrmf3THqF8ViYvufzHKl7JR5e17fKWEG0BqV
# 9W7AeHMnzJ7jkTvBGUw7g5EbzFn7hPLTbO4G/VW97k0puS4WRX5aIMkVhUazsRIa
# zDOuXCCskUWuRapiCwY0E4g7cCaT8/JR6JjjBaTgkjJgvo5Y8Eg=
# =ILek
# -----END PGP SIGNATURE-----
# gpg: Signature made Sun 22 Oct 2023 02:18:43 PDT
# gpg:                using RSA key 5D09FD0871C8F85B94CA8A0D281F0DB8D28D5469
# gpg:                issuer "mst@redhat.com"
# gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>" [full]
# gpg:                 aka "Michael S. Tsirkin <mst@redhat.com>" [full]
# Primary key fingerprint: 0270 606B 6F3C DF3D 0B17  0970 C350 3912 AFBE 8E67
#      Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA  8A0D 281F 0DB8 D28D 5469

* tag 'for_upstream' of https://git.kernel.org/pub/scm/virt/kvm/mst/qemu: (62 commits)
  intel-iommu: Report interrupt remapping faults, fix return value
  MAINTAINERS: Add include/hw/intc/i8259.h to the PC chip section
  vhost-user: Fix protocol feature bit conflict
  tests/acpi: Update DSDT.cxl with QTG DSM
  hw/cxl: Add QTG _DSM support for ACPI0017 device
  tests/acpi: Allow update of DSDT.cxl
  hw/i386/cxl: ensure maxram is greater than ram size for calculating cxl range
  vhost-user: fix lost reconnect
  vhost-user-scsi: start vhost when guest kicks
  vhost-user-scsi: support reconnect to backend
  vhost: move and rename the conn retry times
  vhost-user-common: send get_inflight_fd once
  hw/i386/pc_piix: Make PIIX4 south bridge usable in PC machine
  hw/isa/piix: Implement multi-process QEMU support also for PIIX4
  hw/isa/piix: Resolve duplicate code regarding PCI interrupt wiring
  hw/isa/piix: Reuse PIIX3's PCI interrupt triggering in PIIX4
  hw/isa/piix: Rename functions to be shared for PCI interrupt triggering
  hw/isa/piix: Reuse PIIX3 base class' realize method in PIIX4
  hw/isa/piix: Share PIIX3's base class with PIIX4
  hw/isa/piix: Harmonize names of reset control memory regions
  ...

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-10-23 14:45:29 -07:00
Li Feng
f02a4b8e64 vhost-user: fix lost reconnect
When the vhost-user is reconnecting to the backend, and if the vhost-user fails
at the get_features in vhost_dev_init(), then the reconnect will fail
and it will not be retriggered forever.

The reason is:
When the vhost-user fails at get_features, the vhost_dev_cleanup will be called
immediately.

vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'.

The reconnect path is:
vhost_user_blk_event
   vhost_user_async_close(.. vhost_user_blk_disconnect ..)
     qemu_chr_fe_set_handlers <----- clear the notifier callback
       schedule vhost_user_async_close_bh

The vhost->vdev is null, so the vhost_user_blk_disconnect will not be
called, then the event fd callback will not be reinstalled.

All vhost-user devices have this issue, including vhost-user-blk/scsi.

With this patch, if the vdev->vdev is null, the fd callback will still
be reinstalled.

Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling")

Signed-off-by: Li Feng <fengli@smartx.com>
Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
Message-Id: <20231009044735.941655-6-fengli@smartx.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-22 05:18:17 -04:00
Li Feng
4dfcc09f48 vhost: move and rename the conn retry times
Multiple devices need this macro, move it to a common header.

Signed-off-by: Li Feng <fengli@smartx.com>
Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
Message-Id: <20231009044735.941655-3-fengli@smartx.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-22 05:18:17 -04:00
Stefan Hajnoczi
c0c4f14729 virtio: call ->vhost_reset_device() during reset
vhost-user-scsi has a VirtioDeviceClass->reset() function that calls
->vhost_reset_device(). The other vhost devices don't notify the vhost
device upon reset.

Stateful vhost devices may need to handle device reset in order to free
resources or prevent stale device state from interfering after reset.

Call ->vhost_device_reset() from virtio_reset() so that that vhost
devices are notified of device reset.

This patch affects behavior as follows:
- vhost-kernel: No change in behavior since ->vhost_reset_device() is
  not implemented.
- vhost-user: back-ends that negotiate
  VHOST_USER_PROTOCOL_F_RESET_DEVICE now receive a
  VHOST_USER_DEVICE_RESET message upon device reset. Otherwise there is
  no change in behavior. DPDK, SPDK, libvhost-user, and the
  vhost-user-backend crate do not negotiate
  VHOST_USER_PROTOCOL_F_RESET_DEVICE automatically.
- vhost-vdpa: an extra SET_STATUS 0 call is made during device reset.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20231004014532.1228637-4-stefanha@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
2023-10-22 05:18:16 -04:00
Stefan Hajnoczi
e6383293eb vhost-backend: remove vhost_kernel_reset_device()
vhost_kernel_reset_device() invokes RESET_OWNER, which disassociates the
owner process from the device. The device is left non-operational since
SET_OWNER is only called once during startup in vhost_dev_init().

vhost_kernel_reset_device() is never called so this latent bug never
appears. Get rid of vhost_kernel_reset_device() for now. If someone
needs it in the future they'll need to implement it correctly.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20231004014532.1228637-3-stefanha@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
2023-10-22 05:18:16 -04:00
Stefan Hajnoczi
22d2464f7e vhost-user: do not send RESET_OWNER on device reset
The VHOST_USER_RESET_OWNER message is deprecated in the spec:

   This is no longer used. Used to be sent to request disabling all
   rings, but some back-ends interpreted it to also discard connection
   state (this interpretation would lead to bugs).  It is recommended
   that back-ends either ignore this message, or use it to disable all
   rings.

The only caller of vhost_user_reset_device() is vhost_user_scsi_reset().
It checks that F_RESET_DEVICE was negotiated before calling it:

  static void vhost_user_scsi_reset(VirtIODevice *vdev)
  {
      VHostSCSICommon *vsc = VHOST_SCSI_COMMON(vdev);
      struct vhost_dev *dev = &vsc->dev;

      /*
       * Historically, reset was not implemented so only reset devices
       * that are expecting it.
       */
      if (!virtio_has_feature(dev->protocol_features,
                              VHOST_USER_PROTOCOL_F_RESET_DEVICE)) {
          return;
      }

      if (dev->vhost_ops->vhost_reset_device) {
          dev->vhost_ops->vhost_reset_device(dev);
      }
  }

Therefore VHOST_USER_RESET_OWNER is actually never sent by
vhost_user_reset_device(). Remove the dead code. This effectively moves
the vhost-user protocol specific code from vhost-user-scsi.c into
vhost-user.c where it belongs.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20231004014532.1228637-2-stefanha@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
2023-10-22 05:18:16 -04:00
Laszlo Ersek
d7dc0682f5 vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
(1) The virtio-1.2 specification
<http://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html> writes:

> 3     General Initialization And Device Operation
> 3.1   Device Initialization
> 3.1.1 Driver Requirements: Device Initialization
>
> [...]
>
> 7. Perform device-specific setup, including discovery of virtqueues for
>    the device, optional per-bus setup, reading and possibly writing the
>    device’s virtio configuration space, and population of virtqueues.
>
> 8. Set the DRIVER_OK status bit. At this point the device is “live”.

and

> 4         Virtio Transport Options
> 4.1       Virtio Over PCI Bus
> 4.1.4     Virtio Structure PCI Capabilities
> 4.1.4.3   Common configuration structure layout
> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>
> [...]
>
> The driver MUST configure the other virtqueue fields before enabling the
> virtqueue with queue_enable.
>
> [...]

(The same statements are present in virtio-1.0 identically, at
<http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html>.)

These together mean that the following sub-sequence of steps is valid for
a virtio-1.0 guest driver:

(1.1) set "queue_enable" for the needed queues as the final part of device
initialization step (7),

(1.2) set DRIVER_OK in step (8),

(1.3) immediately start sending virtio requests to the device.

(2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
special virtio feature is negotiated, then virtio rings start in disabled
state, according to
<https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
enabling vrings.

Therefore setting "queue_enable" from the guest (1.1) -- which is
technically "buffered" on the QEMU side until the guest sets DRIVER_OK
(1.2) -- is a *control plane* operation, which -- after (1.2) -- travels
from the guest through QEMU to the vhost-user backend, using a unix domain
socket.

Whereas sending a virtio request (1.3) is a *data plane* operation, which
evades QEMU -- it travels from guest to the vhost-user backend via
eventfd.

This means that operations ((1.1) + (1.2)) and (1.3) travel through
different channels, and their relative order can be reversed, as perceived
by the vhost-user backend.

That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
against the Rust-language virtiofsd version 1.7.2. (Which uses version
0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
crate.)

Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
device initialization steps (i.e., control plane operations), and
immediately sends a FUSE_INIT request too (i.e., performs a data plane
operation). In the Rust-language virtiofsd, this creates a race between
two components that run *concurrently*, i.e., in different threads or
processes:

- Control plane, handling vhost-user protocol messages:

  The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
  [crates/vhost-user-backend/src/handler.rs] handles
  VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
  flag according to the message processed.

- Data plane, handling virtio / FUSE requests:

  The "VringEpollHandler::handle_event" method
  [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
  virtio / FUSE request, consuming the virtio kick at the same time. If
  the vring's "enabled" flag is set, the virtio / FUSE request is
  processed genuinely. If the vring's "enabled" flag is clear, then the
  virtio / FUSE request is discarded.

Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
However, if the data plane processor in virtiofsd wins the race, then it
sees the FUSE_INIT *before* the control plane processor took notice of
VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
processor. Therefore the latter drops FUSE_INIT on the floor, and goes
back to waiting for further virtio / FUSE requests with epoll_wait.
Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.

The deadlock is not deterministic. OVMF hangs infrequently during first
boot. However, OVMF hangs almost certainly during reboots from the UEFI
shell.

The race can be "reliably masked" by inserting a very small delay -- a
single debug message -- at the top of "VringEpollHandler::handle_event",
i.e., just before the data plane processor checks the "enabled" field of
the vring. That delay suffices for the control plane processor to act upon
VHOST_USER_SET_VRING_ENABLE.

We can deterministically prevent the race in QEMU, by blocking OVMF inside
step (1.2) -- i.e., in the write to the device status register that
"unleashes" queue enablement -- until VHOST_USER_SET_VRING_ENABLE actually
*completes*. That way OVMF's VCPU cannot advance to the FUSE_INIT
submission before virtiofsd's control plane processor takes notice of the
queue being enabled.

Wait for VHOST_USER_SET_VRING_ENABLE completion by:

- setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
  for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
  has been negotiated, or

- performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
  a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.

Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Tested-by: Albert Esteve <aesteve@redhat.com>
[lersek@redhat.com: work Eugenio's explanation into the commit message,
 about QEMU containing step (1.1) until step (1.2)]
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Message-Id: <20231002203221.17241-8-lersek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-22 05:18:16 -04:00
Laszlo Ersek
75b6b6da21 vhost-user: allow "vhost_set_vring" to wait for a reply
The "vhost_set_vring" function already centralizes the common parts of
"vhost_user_set_vring_num", "vhost_user_set_vring_base" and
"vhost_user_set_vring_enable". We'll want to allow some of those callers
to wait for a reply.

Therefore, rebase "vhost_set_vring" from just "vhost_user_write" to
"vhost_user_write_sync", exposing the "wait_for_reply" parameter.

This is purely refactoring -- there is no observable change. That's
because:

- all three callers pass in "false" for "wait_for_reply", which disables
  all logic in "vhost_user_write_sync" except the call to
  "vhost_user_write";

- the fds=NULL and fd_num=0 arguments of the original "vhost_user_write"
  call inside "vhost_set_vring" are hard-coded within
  "vhost_user_write_sync".

Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Tested-by: Albert Esteve <aesteve@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Message-Id: <20231002203221.17241-7-lersek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-22 05:18:16 -04:00
Laszlo Ersek
df3b2abc32 vhost-user: hoist "write_sync", "get_features", "get_u64"
In order to avoid a forward-declaration for "vhost_user_write_sync" in a
subsequent patch, hoist "vhost_user_write_sync" ->
"vhost_user_get_features" -> "vhost_user_get_u64" just above
"vhost_set_vring".

This is purely code movement -- no observable change.

Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Tested-by: Albert Esteve <aesteve@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Message-Id: <20231002203221.17241-6-lersek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-22 05:18:16 -04:00
Laszlo Ersek
99ad9ec89d vhost-user: flatten "enforce_reply" into "vhost_user_write_sync"
At this point, only "vhost_user_write_sync" calls "enforce_reply"; embed
the latter into the former.

This is purely refactoring -- no observable change.

Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Tested-by: Albert Esteve <aesteve@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Message-Id: <20231002203221.17241-5-lersek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-22 05:18:16 -04:00
Laszlo Ersek
54ae36822f vhost-user: factor out "vhost_user_write_sync"
The tails of the "vhost_user_set_vring_addr" and "vhost_user_set_u64"
functions are now byte-for-byte identical. Factor the common tail out to a
new function called "vhost_user_write_sync".

This is purely refactoring -- no observable change.

Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Tested-by: Albert Esteve <aesteve@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Message-Id: <20231002203221.17241-4-lersek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-22 05:18:16 -04:00
Laszlo Ersek
ed0b3ebbae vhost-user: tighten "reply_supported" scope in "set_vring_addr"
In the vhost_user_set_vring_addr() function, we calculate
"reply_supported" unconditionally, even though we'll only need it if
"wait_for_reply" is also true.

Restrict the scope of "reply_supported" to the minimum.

This is purely refactoring -- no observable change.

Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Tested-by: Albert Esteve <aesteve@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Message-Id: <20231002203221.17241-3-lersek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-22 05:18:16 -04:00
Laszlo Ersek
1428831981 vhost-user: strip superfluous whitespace
Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Albert Esteve <aesteve@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Message-Id: <20231002203221.17241-2-lersek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-22 05:18:16 -04:00
Stefan Hajnoczi
384dbdda94 Migration Pull request (20231020)
In this pull request:
 - disable analyze-migration on s390x (thomas)
 - Fix parse_ramblock() (peter)
 - start merging live update (steve)
 - migration-test support for using several binaries (fabiano)
 - multifd cleanups (fabiano)
 
 CI: https://gitlab.com/juan.quintela/qemu/-/pipelines/1042492801
 
 Please apply.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEGJn/jt6/WMzuA0uC9IfvGFhy1yMFAmUyJMsACgkQ9IfvGFhy
 1yP0AQ/9ELr6VJ0crqzfGm2dy2emnZMaQhDtzR4Kk4ciZF6U+GiATdGN9hK499mP
 6WzRIjtSzwD8YZvhLfegxIVTGcEttaM93uXFPznWrk7gwny6QTvuA4qtcRYejTSl
 wE4GQQOsSrukVCUlqcZtY/t2aphVWQzlx8RRJE3XGaodT1gNLMjd+xp34NbbOoR3
 32ixpSPUCOGvCd7hb+HG7pEzk+905Pn2URvbdiP71uqhgJZdjMAv8ehSGD3kufdg
 FMrZyIEq7Eguk2bO1+7ZiVuIafXXRloIVqi1ENmjIyNDa/Rlv2CA85u0CfgeP6qY
 Ttj+MZaz8PIhf97IJEILFn+NDXYgsGqEFl//uNbLuTeCpmr9NPhBzLw8CvCefPrR
 rwBs3J+QbDHWX9EYjk6QZ9QfYJy/DXkl0KfdNtQy9Wf+0o1mHDn5/y3s782T24aJ
 lGo0ph4VJLBNOx58rpgmoO5prRIjqzF5w4j8pCSeGUC4Bcub5af4TufYrwaf+cps
 iIbNFx79dLXBlfkKIn7i9RLpz7641Fs/iTQ/MZh1eyvX++UDXAPWnbd4GDYOEewA
 U3WKsTs/ipIbY8nqaO4j1VMzADPUfetBXznBw60xsZcfjynFJsPV6/F/0OpUupdv
 qPEY4LZ2uwP4K7AlzrUzUn2f3BKrspL0ObX0qTn0WJ8WX5Jp/YA=
 =m+uB
 -----END PGP SIGNATURE-----

Merge tag 'migration-20231020-pull-request' of https://gitlab.com/juan.quintela/qemu into staging

Migration Pull request (20231020)

In this pull request:
- disable analyze-migration on s390x (thomas)
- Fix parse_ramblock() (peter)
- start merging live update (steve)
- migration-test support for using several binaries (fabiano)
- multifd cleanups (fabiano)

CI: https://gitlab.com/juan.quintela/qemu/-/pipelines/1042492801

Please apply.

# -----BEGIN PGP SIGNATURE-----
#
# iQIzBAABCAAdFiEEGJn/jt6/WMzuA0uC9IfvGFhy1yMFAmUyJMsACgkQ9IfvGFhy
# 1yP0AQ/9ELr6VJ0crqzfGm2dy2emnZMaQhDtzR4Kk4ciZF6U+GiATdGN9hK499mP
# 6WzRIjtSzwD8YZvhLfegxIVTGcEttaM93uXFPznWrk7gwny6QTvuA4qtcRYejTSl
# wE4GQQOsSrukVCUlqcZtY/t2aphVWQzlx8RRJE3XGaodT1gNLMjd+xp34NbbOoR3
# 32ixpSPUCOGvCd7hb+HG7pEzk+905Pn2URvbdiP71uqhgJZdjMAv8ehSGD3kufdg
# FMrZyIEq7Eguk2bO1+7ZiVuIafXXRloIVqi1ENmjIyNDa/Rlv2CA85u0CfgeP6qY
# Ttj+MZaz8PIhf97IJEILFn+NDXYgsGqEFl//uNbLuTeCpmr9NPhBzLw8CvCefPrR
# rwBs3J+QbDHWX9EYjk6QZ9QfYJy/DXkl0KfdNtQy9Wf+0o1mHDn5/y3s782T24aJ
# lGo0ph4VJLBNOx58rpgmoO5prRIjqzF5w4j8pCSeGUC4Bcub5af4TufYrwaf+cps
# iIbNFx79dLXBlfkKIn7i9RLpz7641Fs/iTQ/MZh1eyvX++UDXAPWnbd4GDYOEewA
# U3WKsTs/ipIbY8nqaO4j1VMzADPUfetBXznBw60xsZcfjynFJsPV6/F/0OpUupdv
# qPEY4LZ2uwP4K7AlzrUzUn2f3BKrspL0ObX0qTn0WJ8WX5Jp/YA=
# =m+uB
# -----END PGP SIGNATURE-----
# gpg: Signature made Thu 19 Oct 2023 23:57:15 PDT
# gpg:                using RSA key 1899FF8EDEBF58CCEE034B82F487EF185872D723
# gpg: Good signature from "Juan Quintela <quintela@redhat.com>" [full]
# gpg:                 aka "Juan Quintela <quintela@trasno.org>" [full]
# Primary key fingerprint: 1899 FF8E DEBF 58CC EE03  4B82 F487 EF18 5872 D723

* tag 'migration-20231020-pull-request' of https://gitlab.com/juan.quintela/qemu:
  tests/qtest: Don't print messages from query instances
  tests/qtest/migration: Allow user to specify a machine type
  tests/qtest/migration: Support more than one QEMU binary
  tests/qtest/migration: Set q35 as the default machine for x86_86
  tests/qtest/migration: Specify the geometry of the bootsector
  tests/qtest/migration: Define a machine for all architectures
  tests/qtest/migration: Introduce find_common_machine_version
  tests/qtest: Introduce qtest_resolve_machine_alias
  tests/qtest: Introduce qtest_has_machine_with_env
  tests/qtest: Allow qtest_get_machines to use an alternate QEMU binary
  tests/qtest: Introduce qtest_init_with_env
  tests/qtest: Allow qtest_qemu_binary to use a custom environment variable
  migration/multifd: Stop checking p->quit in multifd_send_thread
  migration: simplify notifiers
  migration: Fix parse_ramblock() on overwritten retvals
  migration: simplify blockers
  tests/qtest/migration-test: Disable the analyze-migration.py test on s390x

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-10-20 06:46:53 -07:00
Steve Sistare
c8a7fc5179 migration: simplify blockers
Modify migrate_add_blocker and migrate_del_blocker to take an Error **
reason.  This allows migration to own the Error object, so that if
an error occurs in migrate_add_blocker, migration code can free the Error
and clear the client handle, simplifying client code.  It also simplifies
the migrate_del_blocker call site.

In addition, this is a pre-requisite for a proposed future patch that would
add a mode argument to migration requests to support live update, and
maintain a list of blockers for each mode.  A blocker may apply to a single
mode or to multiple modes, and passing Error** will allow one Error object
to be registered for multiple modes.

No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Tested-by: Michael Galaxy <mgalaxy@akamai.com>
Reviewed-by: Michael Galaxy <mgalaxy@akamai.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Message-ID: <1697634216-84215-1-git-send-email-steven.sistare@oracle.com>
2023-10-20 08:51:41 +02:00
Philippe Mathieu-Daudé
5960f254db hw/virtio/virtio-pmem: Replace impossible check by assertion
The get_memory_region() handler is used when (un)plugging the
device, which can only occur *after* it is realized.

virtio_pmem_realize() ensure the instance can not be realized
without 'memdev'. Remove the superfluous check, replacing it
by an assertion.

Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Manos Pitsidianakis <manos.pitsidianakis@linaro.org>
Message-Id: <20231017140150.44995-2-philmd@linaro.org>
2023-10-19 23:13:28 +02:00
Hawkins Jiawei
99d6a32469 vhost: Expose vhost_svq_available_slots()
Next patches in this series will delay the polling
and checking of buffers until either the SVQ is
full or control commands shadow buffers are full,
no longer perform an immediate poll and check of
the device's used buffers for each CVQ state load command.

To achieve this, this patch exposes
vhost_svq_available_slots(), allowing QEMU to know
whether the SVQ is full.

Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Message-Id: <25938079f0bd8185fd664c64e205e629f7a966be.1697165821.git.yin31149@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-18 10:41:50 -04:00
Stefan Hajnoczi
0193b3bc05 virtio-gpu rutabaga support
-----BEGIN PGP SIGNATURE-----
 
 iQJQBAABCAA6FiEEh6m9kz+HxgbSdvYt2ujhCXWWnOUFAmUtP5YcHG1hcmNhbmRy
 ZS5sdXJlYXVAcmVkaGF0LmNvbQAKCRDa6OEJdZac5X9CD/4s1n/GZyDr9bh04V03
 otAqtq2CSyuUOviqBrqxYgraCosUD1AuX8WkDy5cCPtnKC4FxRjgVlm9s7K/yxOW
 xZ78e4oVgB1F3voOq6LgtKK6BRG/BPqNzq9kuGcayCHQbSxg7zZVwa702Y18r2ZD
 pjOhbZCrJTSfASL7C3e/rm7798Wk/hzSrClGR56fbRAVgQ6Lww2L97/g0nHyDsWK
 DrCBrdqFtKjpLeUHmcqqS4AwdpG2SyCgqE7RehH/wOhvGTxh/JQvHbLGWK2mDC3j
 Qvs8mClC5bUlyNQuUz7lZtXYpzCW6VGMWlz8bIu+ncgSt6RK1TRbdEfDJPGoS4w9
 ZCGgcTxTG/6BEO76J/VpydfTWDo1FwQCQ0Vv7EussGoRTLrFC3ZRFgDWpqCw85yi
 AjPtc0C49FHBZhK0l1CoJGV4gGTDtD9jTYN0ffsd+aQesOjcsgivAWBaCOOQWUc8
 KOv9sr4kLLxcnuCnP7p/PuVRQD4eg0TmpdS8bXfnCzLSH8fCm+n76LuJEpGxEBey
 3KPJPj/1BNBgVgew+znSLD/EYM6YhdK2gF5SNrYsdR6UcFdrPED/xmdhzFBeVym/
 xbBWqicDw4HLn5YrJ4tzqXje5XUz5pmJoT5zrRMXTHiu4pjBkEXO/lOdAoFwSy8M
 WNOtmSyB69uCrbyLw6xE2/YX8Q==
 =5a/Z
 -----END PGP SIGNATURE-----

Merge tag 'gpu-pull-request' of https://gitlab.com/marcandre.lureau/qemu into staging

virtio-gpu rutabaga support

# -----BEGIN PGP SIGNATURE-----
#
# iQJQBAABCAA6FiEEh6m9kz+HxgbSdvYt2ujhCXWWnOUFAmUtP5YcHG1hcmNhbmRy
# ZS5sdXJlYXVAcmVkaGF0LmNvbQAKCRDa6OEJdZac5X9CD/4s1n/GZyDr9bh04V03
# otAqtq2CSyuUOviqBrqxYgraCosUD1AuX8WkDy5cCPtnKC4FxRjgVlm9s7K/yxOW
# xZ78e4oVgB1F3voOq6LgtKK6BRG/BPqNzq9kuGcayCHQbSxg7zZVwa702Y18r2ZD
# pjOhbZCrJTSfASL7C3e/rm7798Wk/hzSrClGR56fbRAVgQ6Lww2L97/g0nHyDsWK
# DrCBrdqFtKjpLeUHmcqqS4AwdpG2SyCgqE7RehH/wOhvGTxh/JQvHbLGWK2mDC3j
# Qvs8mClC5bUlyNQuUz7lZtXYpzCW6VGMWlz8bIu+ncgSt6RK1TRbdEfDJPGoS4w9
# ZCGgcTxTG/6BEO76J/VpydfTWDo1FwQCQ0Vv7EussGoRTLrFC3ZRFgDWpqCw85yi
# AjPtc0C49FHBZhK0l1CoJGV4gGTDtD9jTYN0ffsd+aQesOjcsgivAWBaCOOQWUc8
# KOv9sr4kLLxcnuCnP7p/PuVRQD4eg0TmpdS8bXfnCzLSH8fCm+n76LuJEpGxEBey
# 3KPJPj/1BNBgVgew+znSLD/EYM6YhdK2gF5SNrYsdR6UcFdrPED/xmdhzFBeVym/
# xbBWqicDw4HLn5YrJ4tzqXje5XUz5pmJoT5zrRMXTHiu4pjBkEXO/lOdAoFwSy8M
# WNOtmSyB69uCrbyLw6xE2/YX8Q==
# =5a/Z
# -----END PGP SIGNATURE-----
# gpg: Signature made Mon 16 Oct 2023 09:50:14 EDT
# gpg:                using RSA key 87A9BD933F87C606D276F62DDAE8E10975969CE5
# gpg:                issuer "marcandre.lureau@redhat.com"
# gpg: Good signature from "Marc-André Lureau <marcandre.lureau@redhat.com>" [full]
# gpg:                 aka "Marc-André Lureau <marcandre.lureau@gmail.com>" [full]
# Primary key fingerprint: 87A9 BD93 3F87 C606 D276  F62D DAE8 E109 7596 9CE5

* tag 'gpu-pull-request' of https://gitlab.com/marcandre.lureau/qemu:
  docs/system: add basic virtio-gpu documentation
  gfxstream + rutabaga: enable rutabaga
  gfxstream + rutabaga: meson support
  gfxstream + rutabaga: add initial support for gfxstream
  gfxstream + rutabaga prep: added need defintions, fields, and options
  virtio-gpu: blob prep
  virtio-gpu: hostmem
  virtio-gpu: CONTEXT_INIT feature
  virtio: Add shared memory capability

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-10-17 10:05:51 -04:00
Dr. David Alan Gilbert
605a16a762 virtio: Add shared memory capability
Define a new capability type 'VIRTIO_PCI_CAP_SHARED_MEMORY_CFG' to allow
defining shared memory regions with sizes and offsets of 2^32 and more.
Multiple instances of the capability are allowed and distinguished
by a device-specific 'id'.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Antonio Caggiano <antonio.caggiano@collabora.com>
Signed-off-by: Gurchetan Singh <gurchetansingh@chromium.org>
Tested-by: Alyssa Ross <hi@alyssa.is>
Tested-by: Huang Rui <ray.huang@amd.com>
Tested-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Acked-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Gurchetan Singh <gurchetansingh@chromium.org>
Reviewed-by: Akihiko Odaki <akihiko.odaki@daynix.com>
2023-10-16 11:29:56 +04:00
David Hildenbrand
ee6398d862 virtio-mem: Mark memslot alias memory regions unmergeable
Let's mark the memslot alias memory regions as unmergable, such that
flatview and vhost won't merge adjacent memory region aliases and we can
atomically map/unmap individual aliases without affecting adjacent
alias memory regions.

This handles vhost and vfio in multiple-memslot mode correctly (which do
not support atomic memslot updates) and avoids the temporary removal of
large memslots, which can be an expensive operation. For example, vfio
might have to unpin + repin a lot of memory, which is undesired.

Message-ID: <20230926185738.277351-19-david@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12 14:15:22 +02:00
David Hildenbrand
533f5d6679 memory,vhost: Allow for marking memory device memory regions unmergeable
Let's allow for marking memory regions unmergeable, to teach
flatview code and vhost to not merge adjacent aliases to the same memory
region into a larger memory section; instead, we want separate aliases to
stay separate such that we can atomically map/unmap aliases without
affecting other aliases.

This is desired for virtio-mem mapping device memory located on a RAM
memory region via multiple aliases into a memory region container,
resulting in separate memslots that can get (un)mapped atomically.

As an example with virtio-mem, the layout would look something like this:
  [...]
  0000000240000000-00000020bfffffff (prio 0, i/o): device-memory
    0000000240000000-000000043fffffff (prio 0, i/o): virtio-mem
      0000000240000000-000000027fffffff (prio 0, ram): alias memslot-0 @mem2 0000000000000000-000000003fffffff
      0000000280000000-00000002bfffffff (prio 0, ram): alias memslot-1 @mem2 0000000040000000-000000007fffffff
      00000002c0000000-00000002ffffffff (prio 0, ram): alias memslot-2 @mem2 0000000080000000-00000000bfffffff
  [...]

Without unmergable memory regions, all three memslots would get merged into
a single memory section. For example, when mapping another alias (e.g.,
virtio-mem-memslot-3) or when unmapping any of the mapped aliases,
memory listeners will first get notified about the removal of the big
memory section to then get notified about re-adding of the new
(differently merged) memory section(s).

In an ideal world, memory listeners would be able to deal with that
atomically, like KVM nowadays does. However, (a) supporting this for other
memory listeners (vhost-user, vfio) is fairly hard: temporary removal
can result in all kinds of issues on concurrent access to guest memory;
and (b) this handling is undesired, because temporarily removing+readding
can consume quite some time on bigger memslots and is not efficient
(e.g., vfio unpinning and repinning pages ...).

Let's allow for marking a memory region unmergeable, such that we
can atomically (un)map aliases to the same memory region, similar to
(un)mapping individual DIMMs.

Similarly, teach vhost code to not redo what flatview core stopped doing:
don't merge such sections. Merging in vhost code is really only relevant
for handling random holes in boot memory where; without this merging,
the vhost-user backend wouldn't be able to mmap() some boot memory
backed on hugetlb.

We'll use this for virtio-mem next.

Message-ID: <20230926185738.277351-18-david@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12 14:15:22 +02:00
David Hildenbrand
177f9b1ee4 virtio-mem: Expose device memory dynamically via multiple memslots if enabled
Having large virtio-mem devices that only expose little memory to a VM
is currently a problem: we map the whole sparse memory region into the
guest using a single memslot, resulting in one gigantic memslot in KVM.
KVM allocates metadata for the whole memslot, which can result in quite
some memory waste.

Assuming we have a 1 TiB virtio-mem device and only expose little (e.g.,
1 GiB) memory, we would create a single 1 TiB memslot and KVM has to
allocate metadata for that 1 TiB memslot: on x86, this implies allocating
a significant amount of memory for metadata:

(1) RMAP: 8 bytes per 4 KiB, 8 bytes per 2 MiB, 8 bytes per 1 GiB
    -> For 1 TiB: 2147483648 + 4194304 + 8192 = ~ 2 GiB (0.2 %)

    With the TDP MMU (cat /sys/module/kvm/parameters/tdp_mmu) this gets
    allocated lazily when required for nested VMs
(2) gfn_track: 2 bytes per 4 KiB
    -> For 1 TiB: 536870912 = ~512 MiB (0.05 %)
(3) lpage_info: 4 bytes per 2 MiB, 4 bytes per 1 GiB
    -> For 1 TiB: 2097152 + 4096 = ~2 MiB (0.0002 %)
(4) 2x dirty bitmaps for tracking: 2x 1 bit per 4 KiB page
    -> For 1 TiB: 536870912 = 64 MiB (0.006 %)

So we primarily care about (1) and (2). The bad thing is, that the
memory consumption *doubles* once SMM is enabled, because we create the
memslot once for !SMM and once for SMM.

Having a 1 TiB memslot without the TDP MMU consumes around:
* With SMM: 5 GiB
* Without SMM: 2.5 GiB
Having a 1 TiB memslot with the TDP MMU consumes around:
* With SMM: 1 GiB
* Without SMM: 512 MiB

... and that's really something we want to optimize, to be able to just
start a VM with small boot memory (e.g., 4 GiB) and a virtio-mem device
that can grow very large (e.g., 1 TiB).

Consequently, using multiple memslots and only mapping the memslots we
really need can significantly reduce memory waste and speed up
memslot-related operations. Let's expose the sparse RAM memory region using
multiple memslots, mapping only the memslots we currently need into our
device memory region container.

The feature can be enabled using "dynamic-memslots=on" and requires
"unplugged-inaccessible=on", which is nowadays the default.

Once enabled, we'll auto-detect the number of memslots to use based on the
memslot limit provided by the core. We'll use at most 1 memslot per
gigabyte. Note that our global limit of memslots accross all memory devices
is currently set to 256: even with multiple large virtio-mem devices,
we'd still have a sane limit on the number of memslots used.

The default is to not dynamically map memslot for now
("dynamic-memslots=off"). The optimization must be enabled manually,
because some vhost setups (e.g., hotplug of vhost-user devices) might be
problematic until we support more memslots especially in vhost-user backends.

Note that "dynamic-memslots=on" is just a hint that multiple memslots
*may* be used for internal optimizations, not that multiple memslots
*must* be used. The actual number of memslots that are used is an
internal detail: for example, once memslot metadata is no longer an
issue, we could simply stop optimizing for that. Migration source and
destination can differ on the setting of "dynamic-memslots".

Message-ID: <20230926185738.277351-17-david@redhat.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12 14:15:22 +02:00
David Hildenbrand
884a0c20e6 virtio-mem: Update state to match bitmap as soon as it's been migrated
It's cleaner and future-proof to just have other state that depends on the
bitmap state to be updated as soon as possible when restoring the bitmap.

So factor out informing RamDiscardListener into a functon and call it in
case of early migration right after we restored the bitmap.

Message-ID: <20230926185738.277351-16-david@redhat.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12 14:15:22 +02:00
David Hildenbrand
a45171dba7 virtio-mem: Pass non-const VirtIOMEM via virtio_mem_range_cb
Let's prepare for a user that has to modify the VirtIOMEM device state.

Message-ID: <20230926185738.277351-15-david@redhat.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12 14:15:22 +02:00
David Hildenbrand
a2335113ae memory-device,vhost: Support automatic decision on the number of memslots
We want to support memory devices that can automatically decide how many
memslots they will use. In the worst case, they have to use a single
memslot.

The target use cases are virtio-mem and the hyper-v balloon.

Let's calculate a reasonable limit such a memory device may use, and
instruct the device to make a decision based on that limit. Use a simple
heuristic that considers:
* A memslot soft-limit for all memory devices of 256; also, to not
  consume too many memslots -- which could harm performance.
* Actually still free and unreserved memslots
* The percentage of the remaining device memory region that memory device
  will occupy.

Further, while we properly check before plugging a memory device whether
there still is are free memslots, we have other memslot consumers (such as
boot memory, PCI BARs) that don't perform any checks and might dynamically
consume memslots without any prior reservation. So we might succeed in
plugging a memory device, but once we dynamically map a PCI BAR we would
be in trouble. Doing accounting / reservation / checks for all such
users is problematic (e.g., sometimes we might temporarily split boot
memory into two memslots, triggered by the BIOS).

We use the historic magic memslot number of 509 as orientation to when
supporting 256 memory devices -> memslots (leaving 253 for boot memory and
other devices) has been proven to work reliable. We'll fallback to
suggesting a single memslot if we don't have at least 509 total memslots.

Plugging vhost devices with less than 509 memslots available while we
have memory devices plugged that consume multiple memslots due to
automatic decisions can be problematic. Most configurations might just fail
due to "limit < used + reserved", however, it can also happen that these
memory devices would suddenly consume memslots that would actually be
required by other memslot consumers (boot, PCI BARs) later. Note that this
has always been sketchy with vhost devices that support only a small number
of memslots; but we don't want to make it any worse.So let's keep it simple
and simply reject plugging such vhost devices in such a configuration.

Eventually, all vhost devices that want to be fully compatible with such
memory devices should support a decent number of memslots (>= 509).

Message-ID: <20230926185738.277351-13-david@redhat.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12 14:15:22 +02:00
David Hildenbrand
cd89c065b0 vhost: Add vhost_get_max_memslots()
Let's add vhost_get_max_memslots().

Message-ID: <20230926185738.277351-12-david@redhat.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12 14:15:22 +02:00
David Hildenbrand
766aa0a654 memory-device,vhost: Support memory devices that dynamically consume memslots
We want to support memory devices that have a dynamically managed memory
region container as device memory region. This device memory region maps
multiple RAM memory subregions (e.g., aliases to the same RAM memory
region), whereby these subregions can be (un)mapped on demand.

Each RAM subregion will consume a memslot in KVM and vhost, resulting in
such a new device consuming memslots dynamically, and initially usually
0. We already track the number of used vs. required memslots for all
memslots. From that, we can derive the number of reserved memslots that
must not be used otherwise.

The target use case is virtio-mem and the hyper-v balloon, which will
dynamically map aliases to RAM memory region into their device memory
region container.

Properly document what's supported and what's not and extend the vhost
memslot check accordingly.

Message-ID: <20230926185738.277351-10-david@redhat.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12 14:15:22 +02:00
David Hildenbrand
8c49951c4a vhost: Return number of free memslots
Let's return the number of free slots instead of only checking if there
is a free slot. Required to support memory devices that consume multiple
memslots.

This is a preparation for memory devices that consume multiple memslots.

Message-ID: <20230926185738.277351-6-david@redhat.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12 14:15:22 +02:00
David Hildenbrand
309ebfa691 vhost: Remove vhost_backend_can_merge() callback
Checking whether the memory regions are equal is sufficient: if they are
equal, then most certainly the contained fd is equal.

The whole vhost-user memslot handling is suboptimal and overly
complicated. We shouldn't have to lookup a RAM memory regions we got
notified about in vhost_user_get_mr_data() using a host pointer. But that
requires a bigger rework -- especially an alternative vhost_set_mem_table()
backend call that simply consumes MemoryRegionSections.

For now, let's just drop vhost_backend_can_merge().

Message-ID: <20230926185738.277351-3-david@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Acked-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12 14:15:21 +02:00
David Hildenbrand
552b25229c vhost: Rework memslot filtering and fix "used_memslot" tracking
Having multiple vhost devices, some filtering out fd-less memslots and
some not, can mess up the "used_memslot" accounting. Consequently our
"free memslot" checks become unreliable and we might run out of free
memslots at runtime later.

An example sequence which can trigger a potential issue that involves
different vhost backends (vhost-kernel and vhost-user) and hotplugged
memory devices can be found at [1].

Let's make the filtering mechanism less generic and distinguish between
backends that support private memslots (without a fd) and ones that only
support shared memslots (with a fd). Track the used_memslots for both
cases separately and use the corresponding value when required.

Note: Most probably we should filter out MAP_PRIVATE fd-based RAM regions
(for example, via memory-backend-memfd,...,shared=off or as default with
 memory-backend-file) as well. When not using MAP_SHARED, it might not work
as expected. Add a TODO for now.

[1] https://lkml.kernel.org/r/fad9136f-08d3-3fd9-71a1-502069c000cf@redhat.com

Message-ID: <20230926185738.277351-2-david@redhat.com>
Fixes: 988a27754b ("vhost: allow backends to filter memory sections")
Cc: Tiwei Bie <tiwei.bie@intel.com>
Acked-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12 14:15:21 +02:00
Thomas Huth
da3182887c hw/virtio/vhost: Silence compiler warnings in vhost code when using -Wshadow
Rename a variable in vhost_dev_sync_region() and remove a superfluous
declaration in vhost_commit() to make this code compilable with "-Wshadow".

Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-ID: <20231004114809.105672-1-thuth@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-By: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
2023-10-06 12:47:57 +02:00
Thomas Huth
5ae80e6297 hw/virtio/virtio-pci: Avoid compiler warning with -Wshadow
"len" is used as parameter of the functions virtio_write_config()
and virtio_read_config(), and additionally as a local variable,
so this causes a compiler warning when compiling with "-Wshadow"
and can be confusing for the reader. Rename the local variables
to "caplen" to avoid this problem.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-ID: <20231004095302.99037-1-thuth@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
2023-10-06 10:56:54 +02:00
Albert Esteve
1609476662 vhost-user: add shared_object msg
Add three new vhost-user protocol
`VHOST_USER_BACKEND_SHARED_OBJECT_* messages`.
These new messages are sent from vhost-user
back-ends to interact with the virtio-dmabuf
table in order to add or remove themselves as
virtio exporters, or lookup for virtio dma-buf
shared objects.

The action taken in the front-end depends
on the type stored in the virtio shared
object hash table.

When the table holds a pointer to a vhost
backend for a given UUID, the front-end sends
a VHOST_USER_GET_SHARED_OBJECT to the
backend holding the shared object.

The messages can only be sent after successfully
negotiating a new VHOST_USER_PROTOCOL_F_SHARED_OBJECT
vhost-user protocol feature bit.

Finally, refactor code to send response message so
that all common parts both for the common REPLY_ACK
case, and other data responses, can call it and
avoid code repetition.

Signed-off-by: Albert Esteve <aesteve@redhat.com>
Message-Id: <20231002065706.94707-4-aesteve@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-04 18:15:06 -04:00
Ilya Maximets
70f88436aa virtio: remove unused next argument from virtqueue_split_read_next_desc()
The 'next' was converted from a local variable to an output parameter
in commit:
  412e0e81b1 ("virtio: handle virtqueue_read_next_desc() errors")

But all the actual uses of the 'i/next' as an output were removed a few
months prior in commit:
  aa570d6fb6 ("virtio: combine the read of a descriptor")

Remove the unused argument to simplify the code.

Also, adding a comment to the function to describe what it is actually
doing, as it is not obvious that the 'desc' is both an input and an
output argument.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Message-Id: <20230927140016.2317404-3-i.maximets@ovn.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-04 18:15:06 -04:00
Ilya Maximets
d501f97d96 virtio: remove unnecessary thread fence while reading next descriptor
It was supposed to be a compiler barrier and it was a compiler barrier
initially called 'wmb' when virtio core support was introduced.
Later all the instances of 'wmb' were switched to smp_wmb to fix memory
ordering issues on non-x86 platforms.  However, this one doesn't need
to be an actual barrier, as its only purpose was to ensure that the
value is not read twice.

And since commit aa570d6fb6 ("virtio: combine the read of a descriptor")
there is no need for a barrier at all, since we're no longer reading
guest memory here, but accessing a local structure.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Message-Id: <20230927140016.2317404-2-i.maximets@ovn.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-04 18:15:06 -04:00
Ilya Maximets
850cd20b07 virtio: use shadow_avail_idx while checking number of heads
We do not need the most up to date number of heads, we only want to
know if there is at least one.

Use shadow variable as long as it is not equal to the last available
index checked.  This avoids expensive qatomic dereference of the
RCU-protected memory region cache as well as the memory access itself.

The change improves performance of the af-xdp network backend by 2-3%.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Message-Id: <20230927135157.2316982-1-i.maximets@ovn.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-04 18:15:06 -04:00
Jonah Palmer
3d123a8b41 vhost-user: move VhostUserProtocolFeature definition to header file
Move the definition of VhostUserProtocolFeature to
include/hw/virtio/vhost-user.h.

Remove previous definitions in hw/scsi/vhost-user-scsi.c,
hw/virtio/vhost-user.c, and hw/virtio/virtio-qmp.c.

Previously there were 3 separate definitions of this over 3 different
files. Now only 1 definition of this will be present for these 3 files.

Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
Reviewed-by: Emmanouil Pitsidianakis <manos.pitsidianakis@linaro.org>
Message-Id: <20230926224107.2951144-4-jonah.palmer@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-04 04:54:28 -04:00
Jonah Palmer
58f8168978 qmp: update virtio feature maps, vhost-user-gpio introspection
Add new vhost-user protocol feature to vhost-user protocol feature map
and enumeration:
 - VHOST_USER_PROTOCOL_F_STATUS

Add new virtio device features for several virtio devices to their
respective feature mappings:

virtio-blk:
 - VIRTIO_BLK_F_SECURE_ERASE

virtio-net:
 - VIRTIO_NET_F_NOTF_COAL
 - VIRTIO_NET_F_GUEST_USO4
 - VIRTIO_NET_F_GUEST_USO6
 - VIRTIO_NET_F_HOST_USO

virtio/vhost-user-gpio:
 - VIRTIO_GPIO_F_IRQ
 - VHOST_USER_F_PROTOCOL_FEATURES

Add support for introspection on vhost-user-gpio devices.

Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
Reviewed-by: Emmanouil Pitsidianakis <manos.pitsidianakis@linaro.org>
Message-Id: <20230926224107.2951144-3-jonah.palmer@oracle.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-04 04:54:26 -04:00
Jonah Palmer
b532c684e0 qmp: remove virtio_list, search QOM tree instead
The virtio_list duplicates information about virtio devices that already
exist in the QOM composition tree. Instead of creating this list of
realized virtio devices, search the QOM composition tree instead.

This patch modifies the QMP command qmp_x_query_virtio to instead
recursively search the QOM composition tree for devices of type
'TYPE_VIRTIO_DEVICE'. The device is also checked to ensure it's
realized.

Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20230926224107.2951144-2-jonah.palmer@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-04 04:54:24 -04:00
Hawkins Jiawei
b0de17a2e2 vhost: Add count argument to vhost_svq_poll()
Next patches in this series will no longer perform an
immediate poll and check of the device's used buffers
for each CVQ state load command. Instead, they will
send CVQ state load commands in parallel by polling
multiple pending buffers at once.

To achieve this, this patch refactoring vhost_svq_poll()
to accept a new argument `num`, which allows vhost_svq_poll()
to wait for the device to use multiple elements,
rather than polling for a single element.

Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Message-Id: <950b3bfcfc5d446168b9d6a249d554a013a691d4.1693287885.git.yin31149@gmail.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-04 04:54:23 -04:00
Eugenio Pérez
6c4825476a vdpa: move vhost_vdpa_set_vring_ready to the caller
Doing that way allows CVQ to be enabled before the dataplane vqs,
restoring the state as MQ or MAC addresses properly in the case of a
migration.

The patch does it by defining a ->load NetClientInfo callback also for
dataplane.  Ideally, this should be done by an independent patch, but
the function is already static so it would only add an empty
vhost_vdpa_net_data_load stub.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Message-Id: <20230822085330.3978829-5-eperezma@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-04 04:54:21 -04:00
Eugenio Pérez
d7ce084176 vdpa: export vhost_vdpa_set_vring_ready
The vhost-vdpa net backend needs to enable vrings in a different order
than default, so export it.

No functional change intended except for tracing, that now includes the
(virtio) index being enabled and the return value of the ioctl.

Still ignoring return value of this function if called from
vhost_vdpa_dev_start, as reorganize calling code around it is out of
the scope of this series.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230822085330.3978829-3-eperezma@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-04 04:54:19 -04:00
Ilya Maximets
43d6376980 virtio: don't zero out memory region cache for indirect descriptors
Lots of virtio functions that are on a hot path in data transmission
are initializing indirect descriptor cache at the point of stack
allocation.  It's a 112 byte structure that is getting zeroed out on
each call adding unnecessary overhead.  It's going to be correctly
initialized later via special init function.  The only reason to
actually initialize right away is the ability to safely destruct it.
Replacing a designated initializer with a function to only initialize
what is necessary.

Removal of the unnecessary stack initializations improves throughput
of virtio-net devices in terms of 64B packets per second by 6-14 %
depending on the case.  Tested with a proposed af-xdp network backend
and a dpdk testpmd application in the guest, but should be beneficial
for other virtio devices as well.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Message-Id: <20230811143423.3258788-1-i.maximets@ovn.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-04 04:54:15 -04:00
Alex Bennée
f92a2d61cd hw/virtio: add config support to vhost-user-device
To use the generic device the user will need to provide the config
region size via the command line. We also add a notifier so the guest
can be pinged if the remote daemon updates the config.

With these changes:

  -device vhost-user-device-pci,virtio-id=41,num_vqs=2,config_size=8

is equivalent to:

  -device vhost-user-gpio-pci

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20230710153522.3469097-11-alex.bennee@linaro.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-04 04:54:05 -04:00