2008-11-05 17:29:27 +01:00
|
|
|
/*
|
|
|
|
* QEMU KVM support
|
|
|
|
*
|
|
|
|
* Copyright IBM, Corp. 2008
|
2008-11-24 20:36:26 +01:00
|
|
|
* Red Hat, Inc. 2008
|
2008-11-05 17:29:27 +01:00
|
|
|
*
|
|
|
|
* Authors:
|
|
|
|
* Anthony Liguori <aliguori@us.ibm.com>
|
2008-11-24 20:36:26 +01:00
|
|
|
* Glauber Costa <gcosta@redhat.com>
|
2008-11-05 17:29:27 +01:00
|
|
|
*
|
|
|
|
* This work is licensed under the terms of the GNU GPL, version 2 or later.
|
|
|
|
* See the COPYING file in the top-level directory.
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
2016-01-29 18:50:05 +01:00
|
|
|
#include "qemu/osdep.h"
|
2008-11-05 17:29:27 +01:00
|
|
|
#include <sys/ioctl.h>
|
2021-05-17 10:23:50 +02:00
|
|
|
#include <poll.h>
|
2008-11-05 17:29:27 +01:00
|
|
|
|
|
|
|
#include <linux/kvm.h>
|
|
|
|
|
2012-12-17 18:20:00 +01:00
|
|
|
#include "qemu/atomic.h"
|
|
|
|
#include "qemu/option.h"
|
|
|
|
#include "qemu/config-file.h"
|
2015-09-24 02:29:36 +02:00
|
|
|
#include "qemu/error-report.h"
|
2017-06-13 15:57:00 +02:00
|
|
|
#include "qapi/error.h"
|
2012-12-12 13:24:50 +01:00
|
|
|
#include "hw/pci/msi.h"
|
2016-07-14 07:56:30 +02:00
|
|
|
#include "hw/pci/msix.h"
|
2013-07-15 17:45:03 +02:00
|
|
|
#include "hw/s390x/adapter.h"
|
2012-12-17 18:19:49 +01:00
|
|
|
#include "exec/gdbstub.h"
|
2015-06-18 18:28:45 +02:00
|
|
|
#include "sysemu/kvm_int.h"
|
2019-08-12 07:23:59 +02:00
|
|
|
#include "sysemu/runstate.h"
|
2017-03-03 12:01:16 +01:00
|
|
|
#include "sysemu/cpus.h"
|
kvm: Atomic memslot updates
If we update an existing memslot (e.g., resize, split), we temporarily
remove the memslot to re-add it immediately afterwards. These updates
are not atomic, especially not for KVM VCPU threads, such that we can
get spurious faults.
Let's inhibit most KVM ioctls while performing relevant updates, such
that we can perform the update just as if it would happen atomically
without additional kernel support.
We capture the add/del changes and apply them in the notifier commit
stage instead. There, we can check for overlaps and perform the ioctl
inhibiting only if really required (-> overlap).
To keep things simple we don't perform additional checks that wouldn't
actually result in an overlap -- such as !RAM memory regions in some
cases (see kvm_set_phys_mem()).
To minimize cache-line bouncing, use a separate indicator
(in_ioctl_lock) per CPU. Also, make sure to hold the kvm_slots_lock
while performing both actions (removing+re-adding).
We have to wait until all IOCTLs were exited and block new ones from
getting executed.
This approach cannot result in a deadlock as long as the inhibitor does
not hold any locks that might hinder an IOCTL from getting finished and
exited - something fairly unusual. The inhibitor will always hold the BQL.
AFAIKs, one possible candidate would be userfaultfd. If a page cannot be
placed (e.g., during postcopy), because we're waiting for a lock, or if the
userfaultfd thread cannot process a fault, because it is waiting for a
lock, there could be a deadlock. However, the BQL is not applicable here,
because any other guest memory access while holding the BQL would already
result in a deadlock.
Nothing else in the kernel should block forever and wait for userspace
intervention.
Note: pause_all_vcpus()/resume_all_vcpus() or
start_exclusive()/end_exclusive() cannot be used, as they either drop
the BQL or require to be called without the BQL - something inhibitors
cannot handle. We need a low-level locking mechanism that is
deadlock-free even when not releasing the BQL.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Tested-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20221111154758.1372674-4-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-11 16:47:58 +01:00
|
|
|
#include "sysemu/accel-blocker.h"
|
2012-12-17 18:20:00 +01:00
|
|
|
#include "qemu/bswap.h"
|
2012-12-17 18:19:49 +01:00
|
|
|
#include "exec/memory.h"
|
2013-11-04 12:59:02 +01:00
|
|
|
#include "exec/ram_addr.h"
|
2012-12-17 18:20:00 +01:00
|
|
|
#include "qemu/event_notifier.h"
|
Include qemu/main-loop.h less
In my "build everything" tree, changing qemu/main-loop.h triggers a
recompile of some 5600 out of 6600 objects (not counting tests and
objects that don't depend on qemu/osdep.h). It includes block/aio.h,
which in turn includes qemu/event_notifier.h, qemu/notify.h,
qemu/processor.h, qemu/qsp.h, qemu/queue.h, qemu/thread-posix.h,
qemu/thread.h, qemu/timer.h, and a few more.
Include qemu/main-loop.h only where it's needed. Touching it now
recompiles only some 1700 objects. For block/aio.h and
qemu/event_notifier.h, these numbers drop from 5600 to 2800. For the
others, they shrink only slightly.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190812052359.30071-21-armbru@redhat.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
2019-08-12 07:23:50 +02:00
|
|
|
#include "qemu/main-loop.h"
|
2017-06-02 08:06:46 +02:00
|
|
|
#include "trace.h"
|
2015-07-06 20:15:13 +02:00
|
|
|
#include "hw/irq.h"
|
2019-11-13 10:56:53 +01:00
|
|
|
#include "qapi/visitor.h"
|
2019-11-13 10:56:53 +01:00
|
|
|
#include "qapi/qapi-types-common.h"
|
|
|
|
#include "qapi/qapi-visit-common.h"
|
2020-05-12 05:06:06 +02:00
|
|
|
#include "sysemu/reset.h"
|
2020-07-07 10:54:37 +02:00
|
|
|
#include "qemu/guest-random.h"
|
|
|
|
#include "sysemu/hw_accel.h"
|
|
|
|
#include "kvm-cpus.h"
|
2022-06-25 19:38:35 +02:00
|
|
|
#include "sysemu/dirtylimit.h"
|
kvm: Atomic memslot updates
If we update an existing memslot (e.g., resize, split), we temporarily
remove the memslot to re-add it immediately afterwards. These updates
are not atomic, especially not for KVM VCPU threads, such that we can
get spurious faults.
Let's inhibit most KVM ioctls while performing relevant updates, such
that we can perform the update just as if it would happen atomically
without additional kernel support.
We capture the add/del changes and apply them in the notifier commit
stage instead. There, we can check for overlaps and perform the ioctl
inhibiting only if really required (-> overlap).
To keep things simple we don't perform additional checks that wouldn't
actually result in an overlap -- such as !RAM memory regions in some
cases (see kvm_set_phys_mem()).
To minimize cache-line bouncing, use a separate indicator
(in_ioctl_lock) per CPU. Also, make sure to hold the kvm_slots_lock
while performing both actions (removing+re-adding).
We have to wait until all IOCTLs were exited and block new ones from
getting executed.
This approach cannot result in a deadlock as long as the inhibitor does
not hold any locks that might hinder an IOCTL from getting finished and
exited - something fairly unusual. The inhibitor will always hold the BQL.
AFAIKs, one possible candidate would be userfaultfd. If a page cannot be
placed (e.g., during postcopy), because we're waiting for a lock, or if the
userfaultfd thread cannot process a fault, because it is waiting for a
lock, there could be a deadlock. However, the BQL is not applicable here,
because any other guest memory access while holding the BQL would already
result in a deadlock.
Nothing else in the kernel should block forever and wait for userspace
intervention.
Note: pause_all_vcpus()/resume_all_vcpus() or
start_exclusive()/end_exclusive() cannot be used, as they either drop
the BQL or require to be called without the BQL - something inhibitors
cannot handle. We need a low-level locking mechanism that is
deadlock-free even when not releasing the BQL.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Tested-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20221111154758.1372674-4-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-11 16:47:58 +01:00
|
|
|
#include "qemu/range.h"
|
2008-11-05 17:29:27 +01:00
|
|
|
|
2013-12-23 16:40:40 +01:00
|
|
|
#include "hw/boards.h"
|
2023-01-24 13:19:36 +01:00
|
|
|
#include "sysemu/stats.h"
|
2013-12-23 16:40:40 +01:00
|
|
|
|
2011-01-10 12:50:05 +01:00
|
|
|
/* This check must be after config-host.h is included */
|
|
|
|
#ifdef CONFIG_EVENTFD
|
|
|
|
#include <sys/eventfd.h>
|
|
|
|
#endif
|
|
|
|
|
2015-11-10 01:23:42 +01:00
|
|
|
/* KVM uses PAGE_SIZE in its definition of KVM_COALESCED_MMIO_MAX. We
|
|
|
|
* need to use the real host PAGE_SIZE, as that's what KVM will use.
|
|
|
|
*/
|
2021-01-18 07:38:06 +01:00
|
|
|
#ifdef PAGE_SIZE
|
|
|
|
#undef PAGE_SIZE
|
|
|
|
#endif
|
2022-03-23 16:57:22 +01:00
|
|
|
#define PAGE_SIZE qemu_real_host_page_size()
|
2008-12-09 21:09:57 +01:00
|
|
|
|
2021-11-11 12:06:04 +01:00
|
|
|
#ifndef KVM_GUESTDBG_BLOCKIRQ
|
|
|
|
#define KVM_GUESTDBG_BLOCKIRQ 0
|
|
|
|
#endif
|
|
|
|
|
2008-11-05 17:29:27 +01:00
|
|
|
//#define DEBUG_KVM
|
|
|
|
|
|
|
|
#ifdef DEBUG_KVM
|
2010-04-18 16:22:14 +02:00
|
|
|
#define DPRINTF(fmt, ...) \
|
2008-11-05 17:29:27 +01:00
|
|
|
do { fprintf(stderr, fmt, ## __VA_ARGS__); } while (0)
|
|
|
|
#else
|
2010-04-18 16:22:14 +02:00
|
|
|
#define DPRINTF(fmt, ...) \
|
2008-11-05 17:29:27 +01:00
|
|
|
do { } while (0)
|
|
|
|
#endif
|
|
|
|
|
2016-05-12 05:48:13 +02:00
|
|
|
struct KVMParkedVcpu {
|
|
|
|
unsigned long vcpu_id;
|
|
|
|
int kvm_fd;
|
|
|
|
QLIST_ENTRY(KVMParkedVcpu) node;
|
|
|
|
};
|
|
|
|
|
2011-02-07 12:19:25 +01:00
|
|
|
KVMState *kvm_state;
|
2012-01-31 19:17:52 +01:00
|
|
|
bool kvm_kernel_irqchip;
|
2015-12-17 17:16:08 +01:00
|
|
|
bool kvm_split_irqchip;
|
2012-07-26 16:35:11 +02:00
|
|
|
bool kvm_async_interrupts_allowed;
|
2013-04-24 22:24:12 +02:00
|
|
|
bool kvm_halt_in_kernel_allowed;
|
2014-05-27 14:03:35 +02:00
|
|
|
bool kvm_eventfds_allowed;
|
2012-07-26 16:35:14 +02:00
|
|
|
bool kvm_irqfds_allowed;
|
2014-10-31 14:38:18 +01:00
|
|
|
bool kvm_resamplefds_allowed;
|
2012-07-26 16:35:15 +02:00
|
|
|
bool kvm_msi_via_irqfd_allowed;
|
2012-07-26 16:35:16 +02:00
|
|
|
bool kvm_gsi_routing_allowed;
|
2013-09-03 10:08:25 +02:00
|
|
|
bool kvm_gsi_direct_mapping;
|
2013-04-23 10:29:36 +02:00
|
|
|
bool kvm_allowed;
|
2013-05-29 10:27:25 +02:00
|
|
|
bool kvm_readonly_mem_allowed;
|
2015-03-12 13:53:49 +01:00
|
|
|
bool kvm_vm_attributes_allowed;
|
2015-10-15 15:44:50 +02:00
|
|
|
bool kvm_direct_msi_allowed;
|
2015-11-06 09:02:46 +01:00
|
|
|
bool kvm_ioeventfd_any_length_allowed;
|
2016-10-04 14:28:09 +02:00
|
|
|
bool kvm_msi_use_devid;
|
2021-11-11 12:06:03 +01:00
|
|
|
bool kvm_has_guest_debug;
|
2022-09-29 13:42:23 +02:00
|
|
|
static int kvm_sstep_flags;
|
2017-02-08 13:52:50 +01:00
|
|
|
static bool kvm_immediate_exit;
|
2019-09-24 16:47:50 +02:00
|
|
|
static hwaddr kvm_max_slot_size = ~0;
|
2008-11-05 17:29:27 +01:00
|
|
|
|
2011-01-21 21:48:17 +01:00
|
|
|
static const KVMCapabilityInfo kvm_required_capabilites[] = {
|
|
|
|
KVM_CAP_INFO(USER_MEMORY),
|
|
|
|
KVM_CAP_INFO(DESTROY_MEMORY_REGION_WORKS),
|
2017-09-11 19:49:28 +02:00
|
|
|
KVM_CAP_INFO(JOIN_MEMORY_REGIONS_WORKS),
|
2011-01-21 21:48:17 +01:00
|
|
|
KVM_CAP_LAST_INFO
|
|
|
|
};
|
|
|
|
|
2019-10-17 03:12:35 +02:00
|
|
|
static NotifierList kvm_irqchip_change_notifiers =
|
|
|
|
NOTIFIER_LIST_INITIALIZER(kvm_irqchip_change_notifiers);
|
|
|
|
|
2020-03-18 15:52:03 +01:00
|
|
|
struct KVMResampleFd {
|
|
|
|
int gsi;
|
|
|
|
EventNotifier *resample_event;
|
|
|
|
QLIST_ENTRY(KVMResampleFd) node;
|
|
|
|
};
|
|
|
|
typedef struct KVMResampleFd KVMResampleFd;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Only used with split irqchip where we need to do the resample fd
|
|
|
|
* kick for the kernel from userspace.
|
|
|
|
*/
|
|
|
|
static QLIST_HEAD(, KVMResampleFd) kvm_resample_fd_list =
|
|
|
|
QLIST_HEAD_INITIALIZER(kvm_resample_fd_list);
|
|
|
|
|
2021-05-06 18:05:41 +02:00
|
|
|
static QemuMutex kml_slots_lock;
|
|
|
|
|
|
|
|
#define kvm_slots_lock() qemu_mutex_lock(&kml_slots_lock)
|
|
|
|
#define kvm_slots_unlock() qemu_mutex_unlock(&kml_slots_lock)
|
2019-06-03 08:50:54 +02:00
|
|
|
|
2021-05-06 18:05:42 +02:00
|
|
|
static void kvm_slot_init_dirty_bitmap(KVMSlot *mem);
|
|
|
|
|
2020-03-18 15:52:03 +01:00
|
|
|
static inline void kvm_resample_fd_remove(int gsi)
|
|
|
|
{
|
|
|
|
KVMResampleFd *rfd;
|
|
|
|
|
|
|
|
QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
|
|
|
|
if (rfd->gsi == gsi) {
|
|
|
|
QLIST_REMOVE(rfd, node);
|
|
|
|
g_free(rfd);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void kvm_resample_fd_insert(int gsi, EventNotifier *event)
|
|
|
|
{
|
|
|
|
KVMResampleFd *rfd = g_new0(KVMResampleFd, 1);
|
|
|
|
|
|
|
|
rfd->gsi = gsi;
|
|
|
|
rfd->resample_event = event;
|
|
|
|
|
|
|
|
QLIST_INSERT_HEAD(&kvm_resample_fd_list, rfd, node);
|
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_resample_fd_notify(int gsi)
|
|
|
|
{
|
|
|
|
KVMResampleFd *rfd;
|
|
|
|
|
|
|
|
QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
|
|
|
|
if (rfd->gsi == gsi) {
|
|
|
|
event_notifier_set(rfd->resample_event);
|
|
|
|
trace_kvm_resample_fd_notify(gsi);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-06-01 11:51:24 +02:00
|
|
|
int kvm_get_max_memslots(void)
|
|
|
|
{
|
2020-01-21 12:03:48 +01:00
|
|
|
KVMState *s = KVM_STATE(current_accel());
|
2016-06-01 11:51:24 +02:00
|
|
|
|
|
|
|
return s->nr_slots;
|
|
|
|
}
|
|
|
|
|
2019-06-03 08:50:54 +02:00
|
|
|
/* Called with KVMMemoryListener.slots_lock held */
|
2015-06-18 18:30:13 +02:00
|
|
|
static KVMSlot *kvm_get_free_slot(KVMMemoryListener *kml)
|
2008-11-05 17:29:27 +01:00
|
|
|
{
|
2015-06-18 18:30:13 +02:00
|
|
|
KVMState *s = kvm_state;
|
2008-11-05 17:29:27 +01:00
|
|
|
int i;
|
|
|
|
|
2013-11-22 20:12:44 +01:00
|
|
|
for (i = 0; i < s->nr_slots; i++) {
|
2015-06-18 18:30:13 +02:00
|
|
|
if (kml->slots[i].memory_size == 0) {
|
|
|
|
return &kml->slots[i];
|
2011-01-04 09:32:13 +01:00
|
|
|
}
|
2008-11-05 17:29:27 +01:00
|
|
|
}
|
|
|
|
|
2014-10-31 17:38:32 +01:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool kvm_has_free_slot(MachineState *ms)
|
|
|
|
{
|
2015-06-18 18:30:13 +02:00
|
|
|
KVMState *s = KVM_STATE(ms->accelerator);
|
2019-06-03 08:50:54 +02:00
|
|
|
bool result;
|
|
|
|
KVMMemoryListener *kml = &s->memory_listener;
|
|
|
|
|
2021-05-06 18:05:41 +02:00
|
|
|
kvm_slots_lock();
|
2019-06-03 08:50:54 +02:00
|
|
|
result = !!kvm_get_free_slot(kml);
|
2021-05-06 18:05:41 +02:00
|
|
|
kvm_slots_unlock();
|
2015-06-18 18:30:13 +02:00
|
|
|
|
2019-06-03 08:50:54 +02:00
|
|
|
return result;
|
2014-10-31 17:38:32 +01:00
|
|
|
}
|
|
|
|
|
2019-06-03 08:50:54 +02:00
|
|
|
/* Called with KVMMemoryListener.slots_lock held */
|
2015-06-18 18:30:13 +02:00
|
|
|
static KVMSlot *kvm_alloc_slot(KVMMemoryListener *kml)
|
2014-10-31 17:38:32 +01:00
|
|
|
{
|
2015-06-18 18:30:13 +02:00
|
|
|
KVMSlot *slot = kvm_get_free_slot(kml);
|
2014-10-31 17:38:32 +01:00
|
|
|
|
|
|
|
if (slot) {
|
|
|
|
return slot;
|
|
|
|
}
|
|
|
|
|
2009-04-17 16:26:29 +02:00
|
|
|
fprintf(stderr, "%s: no free slot available\n", __func__);
|
|
|
|
abort();
|
|
|
|
}
|
|
|
|
|
2015-06-18 18:30:13 +02:00
|
|
|
static KVMSlot *kvm_lookup_matching_slot(KVMMemoryListener *kml,
|
2012-10-23 12:30:10 +02:00
|
|
|
hwaddr start_addr,
|
2017-09-11 19:49:30 +02:00
|
|
|
hwaddr size)
|
2009-04-17 16:26:29 +02:00
|
|
|
{
|
2015-06-18 18:30:13 +02:00
|
|
|
KVMState *s = kvm_state;
|
2009-04-17 16:26:29 +02:00
|
|
|
int i;
|
|
|
|
|
2013-11-22 20:12:44 +01:00
|
|
|
for (i = 0; i < s->nr_slots; i++) {
|
2015-06-18 18:30:13 +02:00
|
|
|
KVMSlot *mem = &kml->slots[i];
|
2009-04-17 16:26:29 +02:00
|
|
|
|
2017-09-11 19:49:30 +02:00
|
|
|
if (start_addr == mem->start_addr && size == mem->memory_size) {
|
2009-04-17 16:26:29 +02:00
|
|
|
return mem;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-11-05 17:29:27 +01:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2017-09-11 19:49:29 +02:00
|
|
|
/*
|
|
|
|
* Calculate and align the start address and the size of the section.
|
|
|
|
* Return the size. If the size is 0, the aligned section is empty.
|
|
|
|
*/
|
|
|
|
static hwaddr kvm_align_section(MemoryRegionSection *section,
|
|
|
|
hwaddr *start)
|
|
|
|
{
|
|
|
|
hwaddr size = int128_get64(section->size);
|
2017-10-16 16:43:01 +02:00
|
|
|
hwaddr delta, aligned;
|
2017-09-11 19:49:29 +02:00
|
|
|
|
|
|
|
/* kvm works in page size chunks, but the function may be called
|
|
|
|
with sub-page size and unaligned start address. Pad the start
|
|
|
|
address to next and truncate size to previous page boundary. */
|
2017-10-16 16:43:01 +02:00
|
|
|
aligned = ROUND_UP(section->offset_within_address_space,
|
2022-03-23 16:57:22 +01:00
|
|
|
qemu_real_host_page_size());
|
2017-10-16 16:43:01 +02:00
|
|
|
delta = aligned - section->offset_within_address_space;
|
|
|
|
*start = aligned;
|
2017-09-11 19:49:29 +02:00
|
|
|
if (delta > size) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2022-03-23 16:57:22 +01:00
|
|
|
return (size - delta) & qemu_real_host_page_mask();
|
2017-09-11 19:49:29 +02:00
|
|
|
}
|
|
|
|
|
2011-12-15 18:55:26 +01:00
|
|
|
int kvm_physical_memory_addr_from_host(KVMState *s, void *ram,
|
2012-10-23 12:30:10 +02:00
|
|
|
hwaddr *phys_addr)
|
2010-10-11 20:31:20 +02:00
|
|
|
{
|
2015-06-18 18:30:13 +02:00
|
|
|
KVMMemoryListener *kml = &s->memory_listener;
|
2019-06-03 08:50:54 +02:00
|
|
|
int i, ret = 0;
|
2010-10-11 20:31:20 +02:00
|
|
|
|
2021-05-06 18:05:41 +02:00
|
|
|
kvm_slots_lock();
|
2013-11-22 20:12:44 +01:00
|
|
|
for (i = 0; i < s->nr_slots; i++) {
|
2015-06-18 18:30:13 +02:00
|
|
|
KVMSlot *mem = &kml->slots[i];
|
2010-10-11 20:31:20 +02:00
|
|
|
|
2011-12-15 18:55:26 +01:00
|
|
|
if (ram >= mem->ram && ram < mem->ram + mem->memory_size) {
|
|
|
|
*phys_addr = mem->start_addr + (ram - mem->ram);
|
2019-06-03 08:50:54 +02:00
|
|
|
ret = 1;
|
|
|
|
break;
|
2010-10-11 20:31:20 +02:00
|
|
|
}
|
|
|
|
}
|
2021-05-06 18:05:41 +02:00
|
|
|
kvm_slots_unlock();
|
2010-10-11 20:31:20 +02:00
|
|
|
|
2019-06-03 08:50:54 +02:00
|
|
|
return ret;
|
2010-10-11 20:31:20 +02:00
|
|
|
}
|
|
|
|
|
2018-05-16 11:18:34 +02:00
|
|
|
static int kvm_set_user_memory_region(KVMMemoryListener *kml, KVMSlot *slot, bool new)
|
2008-11-24 20:36:26 +01:00
|
|
|
{
|
2015-06-18 18:30:13 +02:00
|
|
|
KVMState *s = kvm_state;
|
2008-11-24 20:36:26 +01:00
|
|
|
struct kvm_userspace_memory_region mem;
|
2017-12-15 06:23:26 +01:00
|
|
|
int ret;
|
2008-11-24 20:36:26 +01:00
|
|
|
|
2015-06-18 18:30:14 +02:00
|
|
|
mem.slot = slot->slot | (kml->as_id << 16);
|
2008-11-24 20:36:26 +01:00
|
|
|
mem.guest_phys_addr = slot->start_addr;
|
2011-12-15 18:55:26 +01:00
|
|
|
mem.userspace_addr = (unsigned long)slot->ram;
|
2008-11-24 20:36:26 +01:00
|
|
|
mem.flags = slot->flags;
|
2013-05-31 10:52:18 +02:00
|
|
|
|
2018-05-16 11:18:34 +02:00
|
|
|
if (slot->memory_size && !new && (mem.flags ^ slot->old_flags) & KVM_MEM_READONLY) {
|
2013-05-29 10:27:26 +02:00
|
|
|
/* Set the slot size to 0 before setting the slot to the desired
|
|
|
|
* value. This is needed based on KVM commit 75d61fbc. */
|
|
|
|
mem.memory_size = 0;
|
2020-02-21 17:33:36 +01:00
|
|
|
ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
|
|
|
|
if (ret < 0) {
|
|
|
|
goto err;
|
|
|
|
}
|
2013-05-29 10:27:26 +02:00
|
|
|
}
|
|
|
|
mem.memory_size = slot->memory_size;
|
2017-12-15 06:23:26 +01:00
|
|
|
ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
|
2018-05-16 11:18:34 +02:00
|
|
|
slot->old_flags = mem.flags;
|
2020-02-21 17:33:36 +01:00
|
|
|
err:
|
2017-12-15 06:23:26 +01:00
|
|
|
trace_kvm_set_user_memory(mem.slot, mem.flags, mem.guest_phys_addr,
|
|
|
|
mem.memory_size, mem.userspace_addr, ret);
|
2020-02-21 17:33:36 +01:00
|
|
|
if (ret < 0) {
|
|
|
|
error_report("%s: KVM_SET_USER_MEMORY_REGION failed, slot=%d,"
|
|
|
|
" start=0x%" PRIx64 ", size=0x%" PRIx64 ": %s",
|
|
|
|
__func__, mem.slot, slot->start_addr,
|
|
|
|
(uint64_t)mem.memory_size, strerror(errno));
|
|
|
|
}
|
2017-12-15 06:23:26 +01:00
|
|
|
return ret;
|
2008-11-24 20:36:26 +01:00
|
|
|
}
|
|
|
|
|
2020-07-07 10:54:37 +02:00
|
|
|
static int do_kvm_destroy_vcpu(CPUState *cpu)
|
2016-05-12 05:48:13 +02:00
|
|
|
{
|
|
|
|
KVMState *s = kvm_state;
|
|
|
|
long mmap_size;
|
|
|
|
struct KVMParkedVcpu *vcpu = NULL;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
DPRINTF("kvm_destroy_vcpu\n");
|
|
|
|
|
2019-06-19 18:21:32 +02:00
|
|
|
ret = kvm_arch_destroy_vcpu(cpu);
|
|
|
|
if (ret < 0) {
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
2016-05-12 05:48:13 +02:00
|
|
|
mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0);
|
|
|
|
if (mmap_size < 0) {
|
|
|
|
ret = mmap_size;
|
|
|
|
DPRINTF("KVM_GET_VCPU_MMAP_SIZE failed\n");
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = munmap(cpu->kvm_run, mmap_size);
|
|
|
|
if (ret < 0) {
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
2021-05-17 10:23:50 +02:00
|
|
|
if (cpu->kvm_dirty_gfns) {
|
2021-06-09 03:43:55 +02:00
|
|
|
ret = munmap(cpu->kvm_dirty_gfns, s->kvm_dirty_ring_bytes);
|
2021-05-17 10:23:50 +02:00
|
|
|
if (ret < 0) {
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-05-12 05:48:13 +02:00
|
|
|
vcpu = g_malloc0(sizeof(*vcpu));
|
|
|
|
vcpu->vcpu_id = kvm_arch_vcpu_id(cpu);
|
|
|
|
vcpu->kvm_fd = cpu->kvm_fd;
|
|
|
|
QLIST_INSERT_HEAD(&kvm_state->kvm_parked_vcpus, vcpu, node);
|
|
|
|
err:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-07-07 10:54:37 +02:00
|
|
|
void kvm_destroy_vcpu(CPUState *cpu)
|
|
|
|
{
|
|
|
|
if (do_kvm_destroy_vcpu(cpu) < 0) {
|
|
|
|
error_report("kvm_destroy_vcpu failed");
|
|
|
|
exit(EXIT_FAILURE);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-05-12 05:48:13 +02:00
|
|
|
static int kvm_get_vcpu(KVMState *s, unsigned long vcpu_id)
|
|
|
|
{
|
|
|
|
struct KVMParkedVcpu *cpu;
|
|
|
|
|
|
|
|
QLIST_FOREACH(cpu, &s->kvm_parked_vcpus, node) {
|
|
|
|
if (cpu->vcpu_id == vcpu_id) {
|
|
|
|
int kvm_fd;
|
|
|
|
|
|
|
|
QLIST_REMOVE(cpu, node);
|
|
|
|
kvm_fd = cpu->kvm_fd;
|
|
|
|
g_free(cpu);
|
|
|
|
return kvm_fd;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return kvm_vm_ioctl(s, KVM_CREATE_VCPU, (void *)vcpu_id);
|
|
|
|
}
|
|
|
|
|
2020-07-23 18:09:15 +02:00
|
|
|
int kvm_init_vcpu(CPUState *cpu, Error **errp)
|
2008-11-05 17:29:27 +01:00
|
|
|
{
|
|
|
|
KVMState *s = kvm_state;
|
|
|
|
long mmap_size;
|
|
|
|
int ret;
|
|
|
|
|
2020-07-23 18:09:15 +02:00
|
|
|
trace_kvm_init_vcpu(cpu->cpu_index, kvm_arch_vcpu_id(cpu));
|
2008-11-05 17:29:27 +01:00
|
|
|
|
2016-05-12 05:48:13 +02:00
|
|
|
ret = kvm_get_vcpu(s, kvm_arch_vcpu_id(cpu));
|
2008-11-05 17:29:27 +01:00
|
|
|
if (ret < 0) {
|
2020-07-23 18:09:15 +02:00
|
|
|
error_setg_errno(errp, -ret, "kvm_init_vcpu: kvm_get_vcpu failed (%lu)",
|
|
|
|
kvm_arch_vcpu_id(cpu));
|
2008-11-05 17:29:27 +01:00
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
2012-10-31 05:29:00 +01:00
|
|
|
cpu->kvm_fd = ret;
|
2012-12-01 05:35:08 +01:00
|
|
|
cpu->kvm_state = s;
|
2017-06-18 21:11:01 +02:00
|
|
|
cpu->vcpu_dirty = true;
|
2021-06-29 18:01:18 +02:00
|
|
|
cpu->dirty_pages = 0;
|
2022-06-25 19:38:35 +02:00
|
|
|
cpu->throttle_us_per_full = 0;
|
2008-11-05 17:29:27 +01:00
|
|
|
|
|
|
|
mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0);
|
|
|
|
if (mmap_size < 0) {
|
2011-02-01 22:15:48 +01:00
|
|
|
ret = mmap_size;
|
2020-07-23 18:09:15 +02:00
|
|
|
error_setg_errno(errp, -mmap_size,
|
|
|
|
"kvm_init_vcpu: KVM_GET_VCPU_MMAP_SIZE failed");
|
2008-11-05 17:29:27 +01:00
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
2012-12-01 06:18:14 +01:00
|
|
|
cpu->kvm_run = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED,
|
2012-10-31 05:29:00 +01:00
|
|
|
cpu->kvm_fd, 0);
|
2012-12-01 06:18:14 +01:00
|
|
|
if (cpu->kvm_run == MAP_FAILED) {
|
2008-11-05 17:29:27 +01:00
|
|
|
ret = -errno;
|
2020-07-23 18:09:15 +02:00
|
|
|
error_setg_errno(errp, ret,
|
|
|
|
"kvm_init_vcpu: mmap'ing vcpu state failed (%lu)",
|
|
|
|
kvm_arch_vcpu_id(cpu));
|
2008-11-05 17:29:27 +01:00
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
2011-01-04 09:32:13 +01:00
|
|
|
if (s->coalesced_mmio && !s->coalesced_mmio_ring) {
|
|
|
|
s->coalesced_mmio_ring =
|
2012-12-01 06:18:14 +01:00
|
|
|
(void *)cpu->kvm_run + s->coalesced_mmio * PAGE_SIZE;
|
2011-01-04 09:32:13 +01:00
|
|
|
}
|
2010-01-26 12:21:16 +01:00
|
|
|
|
2021-05-17 10:23:50 +02:00
|
|
|
if (s->kvm_dirty_ring_size) {
|
|
|
|
/* Use MAP_SHARED to share pages with the kernel */
|
2021-06-09 03:43:55 +02:00
|
|
|
cpu->kvm_dirty_gfns = mmap(NULL, s->kvm_dirty_ring_bytes,
|
2021-05-17 10:23:50 +02:00
|
|
|
PROT_READ | PROT_WRITE, MAP_SHARED,
|
|
|
|
cpu->kvm_fd,
|
|
|
|
PAGE_SIZE * KVM_DIRTY_LOG_PAGE_OFFSET);
|
|
|
|
if (cpu->kvm_dirty_gfns == MAP_FAILED) {
|
|
|
|
ret = -errno;
|
|
|
|
DPRINTF("mmap'ing vcpu dirty gfns failed: %d\n", ret);
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-10-31 06:57:49 +01:00
|
|
|
ret = kvm_arch_init_vcpu(cpu);
|
2020-07-23 18:09:15 +02:00
|
|
|
if (ret < 0) {
|
|
|
|
error_setg_errno(errp, -ret,
|
|
|
|
"kvm_init_vcpu: kvm_arch_init_vcpu failed (%lu)",
|
|
|
|
kvm_arch_vcpu_id(cpu));
|
|
|
|
}
|
2023-06-18 23:24:40 +02:00
|
|
|
cpu->kvm_vcpu_stats_fd = kvm_vcpu_ioctl(cpu, KVM_GET_STATS_FD, NULL);
|
|
|
|
|
2008-11-05 17:29:27 +01:00
|
|
|
err:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2008-11-24 20:36:26 +01:00
|
|
|
/*
|
|
|
|
* dirty pages logging control
|
|
|
|
*/
|
2011-04-06 21:09:54 +02:00
|
|
|
|
2015-06-18 18:28:43 +02:00
|
|
|
static int kvm_mem_flags(MemoryRegion *mr)
|
2011-04-06 21:09:54 +02:00
|
|
|
{
|
2015-06-18 18:28:43 +02:00
|
|
|
bool readonly = mr->readonly || memory_region_is_romd(mr);
|
2013-05-29 10:27:26 +02:00
|
|
|
int flags = 0;
|
2015-06-18 18:28:43 +02:00
|
|
|
|
|
|
|
if (memory_region_get_dirty_log_mask(mr) != 0) {
|
|
|
|
flags |= KVM_MEM_LOG_DIRTY_PAGES;
|
|
|
|
}
|
2013-05-29 10:27:26 +02:00
|
|
|
if (readonly && kvm_readonly_mem_allowed) {
|
|
|
|
flags |= KVM_MEM_READONLY;
|
|
|
|
}
|
|
|
|
return flags;
|
2011-04-06 21:09:54 +02:00
|
|
|
}
|
|
|
|
|
2019-06-03 08:50:54 +02:00
|
|
|
/* Called with KVMMemoryListener.slots_lock held */
|
2015-06-18 18:30:13 +02:00
|
|
|
static int kvm_slot_update_flags(KVMMemoryListener *kml, KVMSlot *mem,
|
|
|
|
MemoryRegion *mr)
|
2008-11-24 20:36:26 +01:00
|
|
|
{
|
2015-06-18 18:28:43 +02:00
|
|
|
mem->flags = kvm_mem_flags(mr);
|
2008-11-24 20:36:26 +01:00
|
|
|
|
2009-05-01 20:52:46 +02:00
|
|
|
/* If nothing changed effectively, no need to issue ioctl */
|
2018-05-16 11:18:34 +02:00
|
|
|
if (mem->flags == mem->old_flags) {
|
2011-04-06 21:09:54 +02:00
|
|
|
return 0;
|
2009-05-01 20:52:46 +02:00
|
|
|
}
|
|
|
|
|
2021-05-06 18:05:42 +02:00
|
|
|
kvm_slot_init_dirty_bitmap(mem);
|
2018-05-16 11:18:34 +02:00
|
|
|
return kvm_set_user_memory_region(kml, mem, false);
|
2008-11-24 20:36:26 +01:00
|
|
|
}
|
|
|
|
|
2015-06-18 18:30:13 +02:00
|
|
|
static int kvm_section_update_flags(KVMMemoryListener *kml,
|
|
|
|
MemoryRegionSection *section)
|
2011-04-06 21:09:54 +02:00
|
|
|
{
|
2019-09-24 16:47:50 +02:00
|
|
|
hwaddr start_addr, size, slot_size;
|
2017-09-11 19:49:32 +02:00
|
|
|
KVMSlot *mem;
|
2019-06-03 08:50:54 +02:00
|
|
|
int ret = 0;
|
2011-04-06 21:09:54 +02:00
|
|
|
|
2017-09-11 19:49:32 +02:00
|
|
|
size = kvm_align_section(section, &start_addr);
|
|
|
|
if (!size) {
|
2015-04-27 14:51:31 +02:00
|
|
|
return 0;
|
2011-04-06 21:09:54 +02:00
|
|
|
}
|
2017-09-11 19:49:32 +02:00
|
|
|
|
2021-05-06 18:05:41 +02:00
|
|
|
kvm_slots_lock();
|
2019-06-03 08:50:54 +02:00
|
|
|
|
2019-09-24 16:47:50 +02:00
|
|
|
while (size && !ret) {
|
|
|
|
slot_size = MIN(kvm_max_slot_size, size);
|
|
|
|
mem = kvm_lookup_matching_slot(kml, start_addr, slot_size);
|
|
|
|
if (!mem) {
|
|
|
|
/* We don't have a slot if we want to trap every access. */
|
|
|
|
goto out;
|
|
|
|
}
|
2017-09-11 19:49:32 +02:00
|
|
|
|
2019-09-24 16:47:50 +02:00
|
|
|
ret = kvm_slot_update_flags(kml, mem, section->mr);
|
|
|
|
start_addr += slot_size;
|
|
|
|
size -= slot_size;
|
|
|
|
}
|
2019-06-03 08:50:54 +02:00
|
|
|
|
|
|
|
out:
|
2021-05-06 18:05:41 +02:00
|
|
|
kvm_slots_unlock();
|
2019-06-03 08:50:54 +02:00
|
|
|
return ret;
|
2011-04-06 21:09:54 +02:00
|
|
|
}
|
|
|
|
|
2011-12-18 13:06:05 +01:00
|
|
|
static void kvm_log_start(MemoryListener *listener,
|
2015-04-25 14:38:30 +02:00
|
|
|
MemoryRegionSection *section,
|
|
|
|
int old, int new)
|
2008-11-24 20:36:26 +01:00
|
|
|
{
|
2015-06-18 18:30:13 +02:00
|
|
|
KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
|
2011-12-18 13:06:05 +01:00
|
|
|
int r;
|
|
|
|
|
2015-04-25 14:38:30 +02:00
|
|
|
if (old != 0) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2015-06-18 18:30:13 +02:00
|
|
|
r = kvm_section_update_flags(kml, section);
|
2011-12-18 13:06:05 +01:00
|
|
|
if (r < 0) {
|
|
|
|
abort();
|
|
|
|
}
|
2008-11-24 20:36:26 +01:00
|
|
|
}
|
|
|
|
|
2011-12-18 13:06:05 +01:00
|
|
|
static void kvm_log_stop(MemoryListener *listener,
|
2015-04-25 14:38:30 +02:00
|
|
|
MemoryRegionSection *section,
|
|
|
|
int old, int new)
|
2008-11-24 20:36:26 +01:00
|
|
|
{
|
2015-06-18 18:30:13 +02:00
|
|
|
KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
|
2011-12-18 13:06:05 +01:00
|
|
|
int r;
|
|
|
|
|
2015-04-25 14:38:30 +02:00
|
|
|
if (new != 0) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2015-06-18 18:30:13 +02:00
|
|
|
r = kvm_section_update_flags(kml, section);
|
2011-12-18 13:06:05 +01:00
|
|
|
if (r < 0) {
|
|
|
|
abort();
|
|
|
|
}
|
2008-11-24 20:36:26 +01:00
|
|
|
}
|
|
|
|
|
2010-04-23 19:04:14 +02:00
|
|
|
/* get kvm's dirty pages bitmap and update qemu's */
|
2021-05-06 18:05:44 +02:00
|
|
|
static void kvm_slot_sync_dirty_pages(KVMSlot *slot)
|
2009-07-27 12:49:56 +02:00
|
|
|
{
|
2021-05-06 18:05:44 +02:00
|
|
|
ram_addr_t start = slot->ram_start_offset;
|
2022-03-23 16:57:22 +01:00
|
|
|
ram_addr_t pages = slot->memory_size / qemu_real_host_page_size();
|
2013-11-05 15:52:54 +01:00
|
|
|
|
2021-05-06 18:05:44 +02:00
|
|
|
cpu_physical_memory_set_dirty_lebitmap(slot->dirty_bmap, start, pages);
|
2009-07-27 12:49:56 +02:00
|
|
|
}
|
|
|
|
|
2021-05-17 10:23:50 +02:00
|
|
|
static void kvm_slot_reset_dirty_pages(KVMSlot *slot)
|
|
|
|
{
|
|
|
|
memset(slot->dirty_bmap, 0, slot->dirty_bmap_size);
|
|
|
|
}
|
|
|
|
|
2010-04-23 19:04:14 +02:00
|
|
|
#define ALIGN(x, y) (((x)+(y)-1) & ~((y)-1))
|
|
|
|
|
2019-11-21 17:56:45 +01:00
|
|
|
/* Allocate the dirty bitmap for a slot */
|
2021-05-06 18:05:42 +02:00
|
|
|
static void kvm_slot_init_dirty_bitmap(KVMSlot *mem)
|
2019-11-21 17:56:45 +01:00
|
|
|
{
|
2021-05-06 18:05:42 +02:00
|
|
|
if (!(mem->flags & KVM_MEM_LOG_DIRTY_PAGES) || mem->dirty_bmap) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2019-11-21 17:56:45 +01:00
|
|
|
/*
|
|
|
|
* XXX bad kernel interface alert
|
|
|
|
* For dirty bitmap, kernel allocates array of size aligned to
|
|
|
|
* bits-per-long. But for case when the kernel is 64bits and
|
|
|
|
* the userspace is 32bits, userspace can't align to the same
|
|
|
|
* bits-per-long, since sizeof(long) is different between kernel
|
|
|
|
* and user space. This way, userspace will provide buffer which
|
|
|
|
* may be 4 bytes less than the kernel will use, resulting in
|
|
|
|
* userspace memory corruption (which is not detectable by valgrind
|
|
|
|
* too, in most cases).
|
|
|
|
* So for now, let's align to 64 instead of HOST_LONG_BITS here, in
|
|
|
|
* a hope that sizeof(long) won't become >8 any time soon.
|
2020-12-17 02:49:40 +01:00
|
|
|
*
|
|
|
|
* Note: the granule of kvm dirty log is qemu_real_host_page_size.
|
|
|
|
* And mem->memory_size is aligned to it (otherwise this mem can't
|
|
|
|
* be registered to KVM).
|
2019-11-21 17:56:45 +01:00
|
|
|
*/
|
2022-03-23 16:57:22 +01:00
|
|
|
hwaddr bitmap_size = ALIGN(mem->memory_size / qemu_real_host_page_size(),
|
2019-11-21 17:56:45 +01:00
|
|
|
/*HOST_LONG_BITS*/ 64) / 8;
|
|
|
|
mem->dirty_bmap = g_malloc0(bitmap_size);
|
2021-05-06 18:05:46 +02:00
|
|
|
mem->dirty_bmap_size = bitmap_size;
|
2019-11-21 17:56:45 +01:00
|
|
|
}
|
|
|
|
|
2021-05-06 18:05:43 +02:00
|
|
|
/*
|
|
|
|
* Sync dirty bitmap from kernel to KVMSlot.dirty_bmap, return true if
|
|
|
|
* succeeded, false otherwise
|
|
|
|
*/
|
|
|
|
static bool kvm_slot_get_dirty_log(KVMState *s, KVMSlot *slot)
|
|
|
|
{
|
|
|
|
struct kvm_dirty_log d = {};
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
d.dirty_bitmap = slot->dirty_bmap;
|
|
|
|
d.slot = slot->slot | (slot->as_id << 16);
|
|
|
|
ret = kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d);
|
|
|
|
|
|
|
|
if (ret == -ENOENT) {
|
|
|
|
/* kernel does not have dirty bitmap in this slot */
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
if (ret) {
|
|
|
|
error_report_once("%s: KVM_GET_DIRTY_LOG failed with %d",
|
|
|
|
__func__, ret);
|
|
|
|
}
|
|
|
|
return ret == 0;
|
|
|
|
}
|
|
|
|
|
2021-05-17 10:23:50 +02:00
|
|
|
/* Should be with all slots_lock held for the address spaces. */
|
|
|
|
static void kvm_dirty_ring_mark_page(KVMState *s, uint32_t as_id,
|
|
|
|
uint32_t slot_id, uint64_t offset)
|
|
|
|
{
|
|
|
|
KVMMemoryListener *kml;
|
|
|
|
KVMSlot *mem;
|
|
|
|
|
|
|
|
if (as_id >= s->nr_as) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
kml = s->as[as_id].ml;
|
|
|
|
mem = &kml->slots[slot_id];
|
|
|
|
|
|
|
|
if (!mem->memory_size || offset >=
|
2022-03-23 16:57:22 +01:00
|
|
|
(mem->memory_size / qemu_real_host_page_size())) {
|
2021-05-17 10:23:50 +02:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
set_bit(offset, mem->dirty_bmap);
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool dirty_gfn_is_dirtied(struct kvm_dirty_gfn *gfn)
|
|
|
|
{
|
2022-08-26 13:00:00 +02:00
|
|
|
/*
|
|
|
|
* Read the flags before the value. Pairs with barrier in
|
|
|
|
* KVM's kvm_dirty_ring_push() function.
|
|
|
|
*/
|
|
|
|
return qatomic_load_acquire(&gfn->flags) == KVM_DIRTY_GFN_F_DIRTY;
|
2021-05-17 10:23:50 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
static void dirty_gfn_set_collected(struct kvm_dirty_gfn *gfn)
|
|
|
|
{
|
2022-09-02 02:15:34 +02:00
|
|
|
/*
|
|
|
|
* Use a store-release so that the CPU that executes KVM_RESET_DIRTY_RINGS
|
|
|
|
* sees the full content of the ring:
|
|
|
|
*
|
|
|
|
* CPU0 CPU1 CPU2
|
|
|
|
* ------------------------------------------------------------------------------
|
|
|
|
* fill gfn0
|
|
|
|
* store-rel flags for gfn0
|
|
|
|
* load-acq flags for gfn0
|
|
|
|
* store-rel RESET for gfn0
|
|
|
|
* ioctl(RESET_RINGS)
|
|
|
|
* load-acq flags for gfn0
|
|
|
|
* check if flags have RESET
|
|
|
|
*
|
|
|
|
* The synchronization goes from CPU2 to CPU0 to CPU1.
|
|
|
|
*/
|
|
|
|
qatomic_store_release(&gfn->flags, KVM_DIRTY_GFN_F_RESET);
|
2021-05-17 10:23:50 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Should be with all slots_lock held for the address spaces. It returns the
|
|
|
|
* dirty page we've collected on this dirty ring.
|
|
|
|
*/
|
|
|
|
static uint32_t kvm_dirty_ring_reap_one(KVMState *s, CPUState *cpu)
|
|
|
|
{
|
|
|
|
struct kvm_dirty_gfn *dirty_gfns = cpu->kvm_dirty_gfns, *cur;
|
|
|
|
uint32_t ring_size = s->kvm_dirty_ring_size;
|
|
|
|
uint32_t count = 0, fetch = cpu->kvm_fetch_index;
|
|
|
|
|
2023-02-16 17:18:32 +01:00
|
|
|
/*
|
|
|
|
* It's possible that we race with vcpu creation code where the vcpu is
|
|
|
|
* put onto the vcpus list but not yet initialized the dirty ring
|
|
|
|
* structures. If so, skip it.
|
|
|
|
*/
|
|
|
|
if (!cpu->created) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2021-05-17 10:23:50 +02:00
|
|
|
assert(dirty_gfns && ring_size);
|
|
|
|
trace_kvm_dirty_ring_reap_vcpu(cpu->cpu_index);
|
|
|
|
|
|
|
|
while (true) {
|
|
|
|
cur = &dirty_gfns[fetch % ring_size];
|
|
|
|
if (!dirty_gfn_is_dirtied(cur)) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
kvm_dirty_ring_mark_page(s, cur->slot >> 16, cur->slot & 0xffff,
|
|
|
|
cur->offset);
|
|
|
|
dirty_gfn_set_collected(cur);
|
|
|
|
trace_kvm_dirty_ring_page(cpu->cpu_index, fetch, cur->offset);
|
|
|
|
fetch++;
|
|
|
|
count++;
|
|
|
|
}
|
|
|
|
cpu->kvm_fetch_index = fetch;
|
2021-06-29 18:01:18 +02:00
|
|
|
cpu->dirty_pages += count;
|
2021-05-17 10:23:50 +02:00
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Must be with slots_lock held */
|
2022-06-25 19:38:30 +02:00
|
|
|
static uint64_t kvm_dirty_ring_reap_locked(KVMState *s, CPUState* cpu)
|
2021-05-17 10:23:50 +02:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
uint64_t total = 0;
|
|
|
|
int64_t stamp;
|
|
|
|
|
|
|
|
stamp = get_clock();
|
|
|
|
|
2022-06-25 19:38:30 +02:00
|
|
|
if (cpu) {
|
|
|
|
total = kvm_dirty_ring_reap_one(s, cpu);
|
|
|
|
} else {
|
|
|
|
CPU_FOREACH(cpu) {
|
|
|
|
total += kvm_dirty_ring_reap_one(s, cpu);
|
|
|
|
}
|
2021-05-17 10:23:50 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
if (total) {
|
|
|
|
ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS);
|
|
|
|
assert(ret == total);
|
|
|
|
}
|
|
|
|
|
|
|
|
stamp = get_clock() - stamp;
|
|
|
|
|
|
|
|
if (total) {
|
|
|
|
trace_kvm_dirty_ring_reap(total, stamp / 1000);
|
|
|
|
}
|
|
|
|
|
|
|
|
return total;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Currently for simplicity, we must hold BQL before calling this. We can
|
|
|
|
* consider to drop the BQL if we're clear with all the race conditions.
|
|
|
|
*/
|
2022-06-25 19:38:30 +02:00
|
|
|
static uint64_t kvm_dirty_ring_reap(KVMState *s, CPUState *cpu)
|
2021-05-17 10:23:50 +02:00
|
|
|
{
|
|
|
|
uint64_t total;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We need to lock all kvm slots for all address spaces here,
|
|
|
|
* because:
|
|
|
|
*
|
|
|
|
* (1) We need to mark dirty for dirty bitmaps in multiple slots
|
|
|
|
* and for tons of pages, so it's better to take the lock here
|
|
|
|
* once rather than once per page. And more importantly,
|
|
|
|
*
|
|
|
|
* (2) We must _NOT_ publish dirty bits to the other threads
|
|
|
|
* (e.g., the migration thread) via the kvm memory slot dirty
|
|
|
|
* bitmaps before correctly re-protect those dirtied pages.
|
|
|
|
* Otherwise we can have potential risk of data corruption if
|
|
|
|
* the page data is read in the other thread before we do
|
|
|
|
* reset below.
|
|
|
|
*/
|
|
|
|
kvm_slots_lock();
|
2022-06-25 19:38:30 +02:00
|
|
|
total = kvm_dirty_ring_reap_locked(s, cpu);
|
2021-05-17 10:23:50 +02:00
|
|
|
kvm_slots_unlock();
|
|
|
|
|
|
|
|
return total;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void do_kvm_cpu_synchronize_kick(CPUState *cpu, run_on_cpu_data arg)
|
|
|
|
{
|
|
|
|
/* No need to do anything */
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Kick all vcpus out in a synchronized way. When returned, we
|
|
|
|
* guarantee that every vcpu has been kicked and at least returned to
|
|
|
|
* userspace once.
|
|
|
|
*/
|
|
|
|
static void kvm_cpu_synchronize_kick_all(void)
|
|
|
|
{
|
|
|
|
CPUState *cpu;
|
|
|
|
|
|
|
|
CPU_FOREACH(cpu) {
|
|
|
|
run_on_cpu(cpu, do_kvm_cpu_synchronize_kick, RUN_ON_CPU_NULL);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Flush all the existing dirty pages to the KVM slot buffers. When
|
|
|
|
* this call returns, we guarantee that all the touched dirty pages
|
|
|
|
* before calling this function have been put into the per-kvmslot
|
|
|
|
* dirty bitmap.
|
|
|
|
*
|
|
|
|
* This function must be called with BQL held.
|
|
|
|
*/
|
|
|
|
static void kvm_dirty_ring_flush(void)
|
|
|
|
{
|
|
|
|
trace_kvm_dirty_ring_flush(0);
|
|
|
|
/*
|
|
|
|
* The function needs to be serialized. Since this function
|
|
|
|
* should always be with BQL held, serialization is guaranteed.
|
|
|
|
* However, let's be sure of it.
|
|
|
|
*/
|
|
|
|
assert(qemu_mutex_iothread_locked());
|
|
|
|
/*
|
|
|
|
* First make sure to flush the hardware buffers by kicking all
|
|
|
|
* vcpus out in a synchronous way.
|
|
|
|
*/
|
|
|
|
kvm_cpu_synchronize_kick_all();
|
2022-06-25 19:38:30 +02:00
|
|
|
kvm_dirty_ring_reap(kvm_state, NULL);
|
2021-05-17 10:23:50 +02:00
|
|
|
trace_kvm_dirty_ring_flush(1);
|
|
|
|
}
|
|
|
|
|
2008-11-24 20:36:26 +01:00
|
|
|
/**
|
2019-06-03 08:50:52 +02:00
|
|
|
* kvm_physical_sync_dirty_bitmap - Sync dirty bitmap from kernel space
|
2008-11-24 20:36:26 +01:00
|
|
|
*
|
2019-06-03 08:50:52 +02:00
|
|
|
* This function will first try to fetch dirty bitmap from the kernel,
|
|
|
|
* and then updates qemu's dirty bitmap.
|
|
|
|
*
|
2019-06-03 08:50:54 +02:00
|
|
|
* NOTE: caller must be with kml->slots_lock held.
|
|
|
|
*
|
2019-06-03 08:50:52 +02:00
|
|
|
* @kml: the KVM memory listener object
|
|
|
|
* @section: the memory section to sync the dirty bitmap with
|
2008-11-24 20:36:26 +01:00
|
|
|
*/
|
2021-05-06 18:05:43 +02:00
|
|
|
static void kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
|
|
|
|
MemoryRegionSection *section)
|
2008-11-24 20:36:26 +01:00
|
|
|
{
|
|
|
|
KVMState *s = kvm_state;
|
2009-05-01 20:52:47 +02:00
|
|
|
KVMSlot *mem;
|
2017-09-11 19:49:33 +02:00
|
|
|
hwaddr start_addr, size;
|
2021-05-06 18:05:44 +02:00
|
|
|
hwaddr slot_size;
|
2017-09-11 19:49:33 +02:00
|
|
|
|
|
|
|
size = kvm_align_section(section, &start_addr);
|
2019-09-24 16:47:50 +02:00
|
|
|
while (size) {
|
|
|
|
slot_size = MIN(kvm_max_slot_size, size);
|
|
|
|
mem = kvm_lookup_matching_slot(kml, start_addr, slot_size);
|
2017-09-11 19:49:33 +02:00
|
|
|
if (!mem) {
|
2017-10-16 16:42:58 +02:00
|
|
|
/* We don't have a slot if we want to trap every access. */
|
2021-05-06 18:05:43 +02:00
|
|
|
return;
|
2009-05-01 20:52:47 +02:00
|
|
|
}
|
2021-05-06 18:05:43 +02:00
|
|
|
if (kvm_slot_get_dirty_log(s, mem)) {
|
2021-05-06 18:05:44 +02:00
|
|
|
kvm_slot_sync_dirty_pages(mem);
|
2009-05-01 20:52:47 +02:00
|
|
|
}
|
2019-09-24 16:47:50 +02:00
|
|
|
start_addr += slot_size;
|
|
|
|
size -= slot_size;
|
2008-11-24 20:36:26 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-06-03 08:50:55 +02:00
|
|
|
/* Alignment requirement for KVM_CLEAR_DIRTY_LOG - 64 pages */
|
|
|
|
#define KVM_CLEAR_LOG_SHIFT 6
|
2022-03-23 16:57:22 +01:00
|
|
|
#define KVM_CLEAR_LOG_ALIGN (qemu_real_host_page_size() << KVM_CLEAR_LOG_SHIFT)
|
2019-06-03 08:50:55 +02:00
|
|
|
#define KVM_CLEAR_LOG_MASK (-KVM_CLEAR_LOG_ALIGN)
|
|
|
|
|
2019-09-24 16:47:48 +02:00
|
|
|
static int kvm_log_clear_one_slot(KVMSlot *mem, int as_id, uint64_t start,
|
|
|
|
uint64_t size)
|
2019-06-03 08:50:55 +02:00
|
|
|
{
|
|
|
|
KVMState *s = kvm_state;
|
2019-09-24 16:47:48 +02:00
|
|
|
uint64_t end, bmap_start, start_delta, bmap_npages;
|
2019-06-03 08:50:55 +02:00
|
|
|
struct kvm_clear_dirty_log d;
|
2022-03-23 16:57:22 +01:00
|
|
|
unsigned long *bmap_clear = NULL, psize = qemu_real_host_page_size();
|
2019-09-24 16:47:48 +02:00
|
|
|
int ret;
|
2019-06-03 08:50:55 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We need to extend either the start or the size or both to
|
|
|
|
* satisfy the KVM interface requirement. Firstly, do the start
|
|
|
|
* page alignment on 64 host pages
|
|
|
|
*/
|
2019-09-24 16:47:49 +02:00
|
|
|
bmap_start = start & KVM_CLEAR_LOG_MASK;
|
|
|
|
start_delta = start - bmap_start;
|
2019-06-03 08:50:55 +02:00
|
|
|
bmap_start /= psize;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The kernel interface has restriction on the size too, that either:
|
|
|
|
*
|
|
|
|
* (1) the size is 64 host pages aligned (just like the start), or
|
|
|
|
* (2) the size fills up until the end of the KVM memslot.
|
|
|
|
*/
|
|
|
|
bmap_npages = DIV_ROUND_UP(size + start_delta, KVM_CLEAR_LOG_ALIGN)
|
|
|
|
<< KVM_CLEAR_LOG_SHIFT;
|
|
|
|
end = mem->memory_size / psize;
|
|
|
|
if (bmap_npages > end - bmap_start) {
|
|
|
|
bmap_npages = end - bmap_start;
|
|
|
|
}
|
|
|
|
start_delta /= psize;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Prepare the bitmap to clear dirty bits. Here we must guarantee
|
|
|
|
* that we won't clear any unknown dirty bits otherwise we might
|
|
|
|
* accidentally clear some set bits which are not yet synced from
|
|
|
|
* the kernel into QEMU's bitmap, then we'll lose track of the
|
|
|
|
* guest modifications upon those pages (which can directly lead
|
|
|
|
* to guest data loss or panic after migration).
|
|
|
|
*
|
|
|
|
* Layout of the KVMSlot.dirty_bmap:
|
|
|
|
*
|
|
|
|
* |<-------- bmap_npages -----------..>|
|
|
|
|
* [1]
|
|
|
|
* start_delta size
|
|
|
|
* |----------------|-------------|------------------|------------|
|
|
|
|
* ^ ^ ^ ^
|
|
|
|
* | | | |
|
|
|
|
* start bmap_start (start) end
|
|
|
|
* of memslot of memslot
|
|
|
|
*
|
|
|
|
* [1] bmap_npages can be aligned to either 64 pages or the end of slot
|
|
|
|
*/
|
|
|
|
|
|
|
|
assert(bmap_start % BITS_PER_LONG == 0);
|
|
|
|
/* We should never do log_clear before log_sync */
|
|
|
|
assert(mem->dirty_bmap);
|
2020-12-08 12:40:13 +01:00
|
|
|
if (start_delta || bmap_npages - size / psize) {
|
2019-06-03 08:50:55 +02:00
|
|
|
/* Slow path - we need to manipulate a temp bitmap */
|
|
|
|
bmap_clear = bitmap_new(bmap_npages);
|
|
|
|
bitmap_copy_with_src_offset(bmap_clear, mem->dirty_bmap,
|
|
|
|
bmap_start, start_delta + size / psize);
|
|
|
|
/*
|
|
|
|
* We need to fill the holes at start because that was not
|
|
|
|
* specified by the caller and we extended the bitmap only for
|
|
|
|
* 64 pages alignment
|
|
|
|
*/
|
|
|
|
bitmap_clear(bmap_clear, 0, start_delta);
|
|
|
|
d.dirty_bitmap = bmap_clear;
|
|
|
|
} else {
|
2020-12-08 12:40:13 +01:00
|
|
|
/*
|
|
|
|
* Fast path - both start and size align well with BITS_PER_LONG
|
|
|
|
* (or the end of memory slot)
|
|
|
|
*/
|
2019-06-03 08:50:55 +02:00
|
|
|
d.dirty_bitmap = mem->dirty_bmap + BIT_WORD(bmap_start);
|
|
|
|
}
|
|
|
|
|
|
|
|
d.first_page = bmap_start;
|
|
|
|
/* It should never overflow. If it happens, say something */
|
|
|
|
assert(bmap_npages <= UINT32_MAX);
|
|
|
|
d.num_pages = bmap_npages;
|
2019-09-24 16:47:48 +02:00
|
|
|
d.slot = mem->slot | (as_id << 16);
|
2019-06-03 08:50:55 +02:00
|
|
|
|
2021-01-29 09:43:54 +01:00
|
|
|
ret = kvm_vm_ioctl(s, KVM_CLEAR_DIRTY_LOG, &d);
|
|
|
|
if (ret < 0 && ret != -ENOENT) {
|
2019-06-03 08:50:55 +02:00
|
|
|
error_report("%s: KVM_CLEAR_DIRTY_LOG failed, slot=%d, "
|
|
|
|
"start=0x%"PRIx64", size=0x%"PRIx32", errno=%d",
|
|
|
|
__func__, d.slot, (uint64_t)d.first_page,
|
|
|
|
(uint32_t)d.num_pages, ret);
|
|
|
|
} else {
|
|
|
|
ret = 0;
|
|
|
|
trace_kvm_clear_dirty_log(d.slot, d.first_page, d.num_pages);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* After we have updated the remote dirty bitmap, we update the
|
|
|
|
* cached bitmap as well for the memslot, then if another user
|
|
|
|
* clears the same region we know we shouldn't clear it again on
|
|
|
|
* the remote otherwise it's data loss as well.
|
|
|
|
*/
|
|
|
|
bitmap_clear(mem->dirty_bmap, bmap_start + start_delta,
|
|
|
|
size / psize);
|
|
|
|
/* This handles the NULL case well */
|
|
|
|
g_free(bmap_clear);
|
2019-09-24 16:47:48 +02:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
* kvm_physical_log_clear - Clear the kernel's dirty bitmap for range
|
|
|
|
*
|
|
|
|
* NOTE: this will be a no-op if we haven't enabled manual dirty log
|
|
|
|
* protection in the host kernel because in that case this operation
|
|
|
|
* will be done within log_sync().
|
|
|
|
*
|
|
|
|
* @kml: the kvm memory listener
|
|
|
|
* @section: the memory range to clear dirty bitmap
|
|
|
|
*/
|
|
|
|
static int kvm_physical_log_clear(KVMMemoryListener *kml,
|
|
|
|
MemoryRegionSection *section)
|
|
|
|
{
|
|
|
|
KVMState *s = kvm_state;
|
2019-09-24 16:47:49 +02:00
|
|
|
uint64_t start, size, offset, count;
|
|
|
|
KVMSlot *mem;
|
2019-10-02 12:22:12 +02:00
|
|
|
int ret = 0, i;
|
2019-09-24 16:47:48 +02:00
|
|
|
|
|
|
|
if (!s->manual_dirty_log_protect) {
|
|
|
|
/* No need to do explicit clear */
|
2019-10-02 12:22:12 +02:00
|
|
|
return ret;
|
2019-09-24 16:47:48 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
start = section->offset_within_address_space;
|
|
|
|
size = int128_get64(section->size);
|
|
|
|
|
|
|
|
if (!size) {
|
|
|
|
/* Nothing more we can do... */
|
2019-10-02 12:22:12 +02:00
|
|
|
return ret;
|
2019-09-24 16:47:48 +02:00
|
|
|
}
|
|
|
|
|
2021-05-06 18:05:41 +02:00
|
|
|
kvm_slots_lock();
|
2019-09-24 16:47:48 +02:00
|
|
|
|
|
|
|
for (i = 0; i < s->nr_slots; i++) {
|
|
|
|
mem = &kml->slots[i];
|
2019-09-24 16:47:49 +02:00
|
|
|
/* Discard slots that are empty or do not overlap the section */
|
|
|
|
if (!mem->memory_size ||
|
|
|
|
mem->start_addr > start + size - 1 ||
|
|
|
|
start > mem->start_addr + mem->memory_size - 1) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (start >= mem->start_addr) {
|
|
|
|
/* The slot starts before section or is aligned to it. */
|
|
|
|
offset = start - mem->start_addr;
|
|
|
|
count = MIN(mem->memory_size - offset, size);
|
|
|
|
} else {
|
|
|
|
/* The slot starts after section. */
|
|
|
|
offset = 0;
|
|
|
|
count = MIN(mem->memory_size, size - (mem->start_addr - start));
|
|
|
|
}
|
|
|
|
ret = kvm_log_clear_one_slot(mem, kml->as_id, offset, count);
|
|
|
|
if (ret < 0) {
|
2019-09-24 16:47:48 +02:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-05-06 18:05:41 +02:00
|
|
|
kvm_slots_unlock();
|
2019-06-03 08:50:55 +02:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-10-02 18:21:54 +02:00
|
|
|
static void kvm_coalesce_mmio_region(MemoryListener *listener,
|
|
|
|
MemoryRegionSection *secion,
|
2012-10-23 12:30:10 +02:00
|
|
|
hwaddr start, hwaddr size)
|
2008-12-09 21:09:57 +01:00
|
|
|
{
|
|
|
|
KVMState *s = kvm_state;
|
|
|
|
|
|
|
|
if (s->coalesced_mmio) {
|
|
|
|
struct kvm_coalesced_mmio_zone zone;
|
|
|
|
|
|
|
|
zone.addr = start;
|
|
|
|
zone.size = size;
|
2012-02-29 16:54:29 +01:00
|
|
|
zone.pad = 0;
|
2008-12-09 21:09:57 +01:00
|
|
|
|
2012-10-02 18:21:54 +02:00
|
|
|
(void)kvm_vm_ioctl(s, KVM_REGISTER_COALESCED_MMIO, &zone);
|
2008-12-09 21:09:57 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-10-02 18:21:54 +02:00
|
|
|
static void kvm_uncoalesce_mmio_region(MemoryListener *listener,
|
|
|
|
MemoryRegionSection *secion,
|
2012-10-23 12:30:10 +02:00
|
|
|
hwaddr start, hwaddr size)
|
2008-12-09 21:09:57 +01:00
|
|
|
{
|
|
|
|
KVMState *s = kvm_state;
|
|
|
|
|
|
|
|
if (s->coalesced_mmio) {
|
|
|
|
struct kvm_coalesced_mmio_zone zone;
|
|
|
|
|
|
|
|
zone.addr = start;
|
|
|
|
zone.size = size;
|
2012-02-29 16:54:29 +01:00
|
|
|
zone.pad = 0;
|
2008-12-09 21:09:57 +01:00
|
|
|
|
2012-10-02 18:21:54 +02:00
|
|
|
(void)kvm_vm_ioctl(s, KVM_UNREGISTER_COALESCED_MMIO, &zone);
|
2008-12-09 21:09:57 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-10-17 18:52:54 +02:00
|
|
|
static void kvm_coalesce_pio_add(MemoryListener *listener,
|
|
|
|
MemoryRegionSection *section,
|
|
|
|
hwaddr start, hwaddr size)
|
|
|
|
{
|
|
|
|
KVMState *s = kvm_state;
|
|
|
|
|
|
|
|
if (s->coalesced_pio) {
|
|
|
|
struct kvm_coalesced_mmio_zone zone;
|
|
|
|
|
|
|
|
zone.addr = start;
|
|
|
|
zone.size = size;
|
|
|
|
zone.pio = 1;
|
|
|
|
|
|
|
|
(void)kvm_vm_ioctl(s, KVM_REGISTER_COALESCED_MMIO, &zone);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_coalesce_pio_del(MemoryListener *listener,
|
|
|
|
MemoryRegionSection *section,
|
|
|
|
hwaddr start, hwaddr size)
|
|
|
|
{
|
|
|
|
KVMState *s = kvm_state;
|
|
|
|
|
|
|
|
if (s->coalesced_pio) {
|
|
|
|
struct kvm_coalesced_mmio_zone zone;
|
|
|
|
|
|
|
|
zone.addr = start;
|
|
|
|
zone.size = size;
|
|
|
|
zone.pio = 1;
|
|
|
|
|
|
|
|
(void)kvm_vm_ioctl(s, KVM_UNREGISTER_COALESCED_MMIO, &zone);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static MemoryListener kvm_coalesced_pio_listener = {
|
2021-08-17 03:35:52 +02:00
|
|
|
.name = "kvm-coalesced-pio",
|
2018-10-17 18:52:54 +02:00
|
|
|
.coalesced_io_add = kvm_coalesce_pio_add,
|
|
|
|
.coalesced_io_del = kvm_coalesce_pio_del,
|
2023-06-20 18:50:49 +02:00
|
|
|
.priority = MEMORY_LISTENER_PRIORITY_MIN,
|
2018-10-17 18:52:54 +02:00
|
|
|
};
|
|
|
|
|
2009-05-08 22:33:24 +02:00
|
|
|
int kvm_check_extension(KVMState *s, unsigned int extension)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = kvm_ioctl(s, KVM_CHECK_EXTENSION, extension);
|
|
|
|
if (ret < 0) {
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2014-07-14 19:15:15 +02:00
|
|
|
int kvm_vm_check_extension(KVMState *s, unsigned int extension)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = kvm_vm_ioctl(s, KVM_CHECK_EXTENSION, extension);
|
|
|
|
if (ret < 0) {
|
|
|
|
/* VM wide version not implemented, use global one instead */
|
|
|
|
ret = kvm_check_extension(s, extension);
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-05-12 05:06:06 +02:00
|
|
|
typedef struct HWPoisonPage {
|
|
|
|
ram_addr_t ram_addr;
|
|
|
|
QLIST_ENTRY(HWPoisonPage) list;
|
|
|
|
} HWPoisonPage;
|
|
|
|
|
|
|
|
static QLIST_HEAD(, HWPoisonPage) hwpoison_page_list =
|
|
|
|
QLIST_HEAD_INITIALIZER(hwpoison_page_list);
|
|
|
|
|
|
|
|
static void kvm_unpoison_all(void *param)
|
|
|
|
{
|
|
|
|
HWPoisonPage *page, *next_page;
|
|
|
|
|
|
|
|
QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
|
|
|
|
QLIST_REMOVE(page, list);
|
|
|
|
qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
|
|
|
|
g_free(page);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_hwpoison_page_add(ram_addr_t ram_addr)
|
|
|
|
{
|
|
|
|
HWPoisonPage *page;
|
|
|
|
|
|
|
|
QLIST_FOREACH(page, &hwpoison_page_list, list) {
|
|
|
|
if (page->ram_addr == ram_addr) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
page = g_new(HWPoisonPage, 1);
|
|
|
|
page->ram_addr = ram_addr;
|
|
|
|
QLIST_INSERT_HEAD(&hwpoison_page_list, page, list);
|
|
|
|
}
|
|
|
|
|
2015-03-13 22:23:37 +01:00
|
|
|
static uint32_t adjust_ioeventfd_endianness(uint32_t val, uint32_t size)
|
|
|
|
{
|
2022-03-23 16:57:18 +01:00
|
|
|
#if HOST_BIG_ENDIAN != TARGET_BIG_ENDIAN
|
2022-03-23 16:57:17 +01:00
|
|
|
/* The kernel expects ioeventfd values in HOST_BIG_ENDIAN
|
2015-03-13 22:23:37 +01:00
|
|
|
* endianness, but the memory core hands them in target endianness.
|
|
|
|
* For example, PPC is always treated as big-endian even if running
|
|
|
|
* on KVM and on PPC64LE. Correct here.
|
|
|
|
*/
|
|
|
|
switch (size) {
|
|
|
|
case 2:
|
|
|
|
val = bswap16(val);
|
|
|
|
break;
|
|
|
|
case 4:
|
|
|
|
val = bswap32(val);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
return val;
|
|
|
|
}
|
|
|
|
|
2014-01-10 08:20:18 +01:00
|
|
|
static int kvm_set_ioeventfd_mmio(int fd, hwaddr addr, uint32_t val,
|
2013-04-02 15:52:25 +02:00
|
|
|
bool assign, uint32_t size, bool datamatch)
|
2013-04-01 23:05:21 +02:00
|
|
|
{
|
|
|
|
int ret;
|
2015-04-27 18:59:04 +02:00
|
|
|
struct kvm_ioeventfd iofd = {
|
|
|
|
.datamatch = datamatch ? adjust_ioeventfd_endianness(val, size) : 0,
|
|
|
|
.addr = addr,
|
|
|
|
.len = size,
|
|
|
|
.flags = 0,
|
|
|
|
.fd = fd,
|
|
|
|
};
|
2013-04-01 23:05:21 +02:00
|
|
|
|
2019-02-12 14:47:57 +01:00
|
|
|
trace_kvm_set_ioeventfd_mmio(fd, (uint64_t)addr, val, assign, size,
|
|
|
|
datamatch);
|
2013-04-01 23:05:21 +02:00
|
|
|
if (!kvm_enabled()) {
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
2013-04-02 15:52:25 +02:00
|
|
|
if (datamatch) {
|
|
|
|
iofd.flags |= KVM_IOEVENTFD_FLAG_DATAMATCH;
|
|
|
|
}
|
2013-04-01 23:05:21 +02:00
|
|
|
if (!assign) {
|
|
|
|
iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, &iofd);
|
|
|
|
|
|
|
|
if (ret < 0) {
|
|
|
|
return -errno;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-04-01 23:54:45 +02:00
|
|
|
static int kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint16_t val,
|
2013-04-02 15:52:25 +02:00
|
|
|
bool assign, uint32_t size, bool datamatch)
|
2013-04-01 23:05:21 +02:00
|
|
|
{
|
|
|
|
struct kvm_ioeventfd kick = {
|
2015-03-13 22:23:37 +01:00
|
|
|
.datamatch = datamatch ? adjust_ioeventfd_endianness(val, size) : 0,
|
2013-04-01 23:05:21 +02:00
|
|
|
.addr = addr,
|
2013-04-02 15:52:25 +02:00
|
|
|
.flags = KVM_IOEVENTFD_FLAG_PIO,
|
2013-04-01 23:54:45 +02:00
|
|
|
.len = size,
|
2013-04-01 23:05:21 +02:00
|
|
|
.fd = fd,
|
|
|
|
};
|
|
|
|
int r;
|
2019-02-12 14:47:57 +01:00
|
|
|
trace_kvm_set_ioeventfd_pio(fd, addr, val, assign, size, datamatch);
|
2013-04-01 23:05:21 +02:00
|
|
|
if (!kvm_enabled()) {
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
2013-04-02 15:52:25 +02:00
|
|
|
if (datamatch) {
|
|
|
|
kick.flags |= KVM_IOEVENTFD_FLAG_DATAMATCH;
|
|
|
|
}
|
2013-04-01 23:05:21 +02:00
|
|
|
if (!assign) {
|
|
|
|
kick.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
|
|
|
|
}
|
|
|
|
r = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, &kick);
|
|
|
|
if (r < 0) {
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2011-01-10 12:50:05 +01:00
|
|
|
static int kvm_check_many_ioeventfds(void)
|
|
|
|
{
|
2011-01-25 17:17:14 +01:00
|
|
|
/* Userspace can use ioeventfd for io notification. This requires a host
|
|
|
|
* that supports eventfd(2) and an I/O thread; since eventfd does not
|
|
|
|
* support SIGIO it cannot interrupt the vcpu.
|
|
|
|
*
|
|
|
|
* Older kernels have a 6 device limit on the KVM io bus. Find out so we
|
2011-01-10 12:50:05 +01:00
|
|
|
* can avoid creating too many ioeventfds.
|
|
|
|
*/
|
2011-08-22 15:24:58 +02:00
|
|
|
#if defined(CONFIG_EVENTFD)
|
2011-01-10 12:50:05 +01:00
|
|
|
int ioeventfds[7];
|
|
|
|
int i, ret = 0;
|
|
|
|
for (i = 0; i < ARRAY_SIZE(ioeventfds); i++) {
|
|
|
|
ioeventfds[i] = eventfd(0, EFD_CLOEXEC);
|
|
|
|
if (ioeventfds[i] < 0) {
|
|
|
|
break;
|
|
|
|
}
|
2013-04-02 15:52:25 +02:00
|
|
|
ret = kvm_set_ioeventfd_pio(ioeventfds[i], 0, i, true, 2, true);
|
2011-01-10 12:50:05 +01:00
|
|
|
if (ret < 0) {
|
|
|
|
close(ioeventfds[i]);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Decide whether many devices are supported or not */
|
|
|
|
ret = i == ARRAY_SIZE(ioeventfds);
|
|
|
|
|
|
|
|
while (i-- > 0) {
|
2013-04-02 15:52:25 +02:00
|
|
|
kvm_set_ioeventfd_pio(ioeventfds[i], 0, i, false, 2, true);
|
2011-01-10 12:50:05 +01:00
|
|
|
close(ioeventfds[i]);
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
#else
|
|
|
|
return 0;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2011-01-21 21:48:17 +01:00
|
|
|
static const KVMCapabilityInfo *
|
|
|
|
kvm_check_extension_list(KVMState *s, const KVMCapabilityInfo *list)
|
|
|
|
{
|
|
|
|
while (list->name) {
|
|
|
|
if (!kvm_check_extension(s, list->value)) {
|
|
|
|
return list;
|
|
|
|
}
|
|
|
|
list++;
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2019-09-24 16:47:50 +02:00
|
|
|
void kvm_set_max_memslot_size(hwaddr max_slot_size)
|
|
|
|
{
|
|
|
|
g_assert(
|
2022-03-23 16:57:22 +01:00
|
|
|
ROUND_UP(max_slot_size, qemu_real_host_page_size()) == max_slot_size
|
2019-09-24 16:47:50 +02:00
|
|
|
);
|
|
|
|
kvm_max_slot_size = max_slot_size;
|
|
|
|
}
|
|
|
|
|
kvm: Atomic memslot updates
If we update an existing memslot (e.g., resize, split), we temporarily
remove the memslot to re-add it immediately afterwards. These updates
are not atomic, especially not for KVM VCPU threads, such that we can
get spurious faults.
Let's inhibit most KVM ioctls while performing relevant updates, such
that we can perform the update just as if it would happen atomically
without additional kernel support.
We capture the add/del changes and apply them in the notifier commit
stage instead. There, we can check for overlaps and perform the ioctl
inhibiting only if really required (-> overlap).
To keep things simple we don't perform additional checks that wouldn't
actually result in an overlap -- such as !RAM memory regions in some
cases (see kvm_set_phys_mem()).
To minimize cache-line bouncing, use a separate indicator
(in_ioctl_lock) per CPU. Also, make sure to hold the kvm_slots_lock
while performing both actions (removing+re-adding).
We have to wait until all IOCTLs were exited and block new ones from
getting executed.
This approach cannot result in a deadlock as long as the inhibitor does
not hold any locks that might hinder an IOCTL from getting finished and
exited - something fairly unusual. The inhibitor will always hold the BQL.
AFAIKs, one possible candidate would be userfaultfd. If a page cannot be
placed (e.g., during postcopy), because we're waiting for a lock, or if the
userfaultfd thread cannot process a fault, because it is waiting for a
lock, there could be a deadlock. However, the BQL is not applicable here,
because any other guest memory access while holding the BQL would already
result in a deadlock.
Nothing else in the kernel should block forever and wait for userspace
intervention.
Note: pause_all_vcpus()/resume_all_vcpus() or
start_exclusive()/end_exclusive() cannot be used, as they either drop
the BQL or require to be called without the BQL - something inhibitors
cannot handle. We need a low-level locking mechanism that is
deadlock-free even when not releasing the BQL.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Tested-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20221111154758.1372674-4-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-11 16:47:58 +01:00
|
|
|
/* Called with KVMMemoryListener.slots_lock held */
|
2015-06-18 18:30:13 +02:00
|
|
|
static void kvm_set_phys_mem(KVMMemoryListener *kml,
|
|
|
|
MemoryRegionSection *section, bool add)
|
2010-01-27 21:07:08 +01:00
|
|
|
{
|
2017-09-11 19:49:31 +02:00
|
|
|
KVMSlot *mem;
|
2010-01-27 21:07:08 +01:00
|
|
|
int err;
|
2011-12-18 13:06:05 +01:00
|
|
|
MemoryRegion *mr = section->mr;
|
2022-06-08 20:38:47 +02:00
|
|
|
bool writable = !mr->readonly && !mr->rom_device;
|
2021-05-06 18:05:44 +02:00
|
|
|
hwaddr start_addr, size, slot_size, mr_offset;
|
|
|
|
ram_addr_t ram_start_offset;
|
2017-09-11 19:49:29 +02:00
|
|
|
void *ram;
|
2010-01-27 21:07:08 +01:00
|
|
|
|
2011-12-18 13:06:05 +01:00
|
|
|
if (!memory_region_is_ram(mr)) {
|
2022-06-08 20:38:47 +02:00
|
|
|
if (writable || !kvm_readonly_mem_allowed) {
|
2013-05-29 10:27:26 +02:00
|
|
|
return;
|
|
|
|
} else if (!mr->romd_mode) {
|
|
|
|
/* If the memory device is not in romd_mode, then we actually want
|
|
|
|
* to remove the kvm memory slot so all accesses will trap. */
|
|
|
|
add = false;
|
|
|
|
}
|
2011-12-15 18:55:26 +01:00
|
|
|
}
|
|
|
|
|
2017-09-11 19:49:29 +02:00
|
|
|
size = kvm_align_section(section, &start_addr);
|
|
|
|
if (!size) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2021-05-06 18:05:44 +02:00
|
|
|
/* The offset of the kvmslot within the memory region */
|
|
|
|
mr_offset = section->offset_within_region + start_addr -
|
|
|
|
section->offset_within_address_space;
|
|
|
|
|
|
|
|
/* use aligned delta to align the ram address and offset */
|
|
|
|
ram = memory_region_get_ram_ptr(mr) + mr_offset;
|
|
|
|
ram_start_offset = memory_region_get_ram_addr(mr) + mr_offset;
|
2011-12-18 13:06:05 +01:00
|
|
|
|
2017-09-11 19:49:31 +02:00
|
|
|
if (!add) {
|
2019-09-24 16:47:50 +02:00
|
|
|
do {
|
|
|
|
slot_size = MIN(kvm_max_slot_size, size);
|
|
|
|
mem = kvm_lookup_matching_slot(kml, start_addr, slot_size);
|
|
|
|
if (!mem) {
|
kvm: Atomic memslot updates
If we update an existing memslot (e.g., resize, split), we temporarily
remove the memslot to re-add it immediately afterwards. These updates
are not atomic, especially not for KVM VCPU threads, such that we can
get spurious faults.
Let's inhibit most KVM ioctls while performing relevant updates, such
that we can perform the update just as if it would happen atomically
without additional kernel support.
We capture the add/del changes and apply them in the notifier commit
stage instead. There, we can check for overlaps and perform the ioctl
inhibiting only if really required (-> overlap).
To keep things simple we don't perform additional checks that wouldn't
actually result in an overlap -- such as !RAM memory regions in some
cases (see kvm_set_phys_mem()).
To minimize cache-line bouncing, use a separate indicator
(in_ioctl_lock) per CPU. Also, make sure to hold the kvm_slots_lock
while performing both actions (removing+re-adding).
We have to wait until all IOCTLs were exited and block new ones from
getting executed.
This approach cannot result in a deadlock as long as the inhibitor does
not hold any locks that might hinder an IOCTL from getting finished and
exited - something fairly unusual. The inhibitor will always hold the BQL.
AFAIKs, one possible candidate would be userfaultfd. If a page cannot be
placed (e.g., during postcopy), because we're waiting for a lock, or if the
userfaultfd thread cannot process a fault, because it is waiting for a
lock, there could be a deadlock. However, the BQL is not applicable here,
because any other guest memory access while holding the BQL would already
result in a deadlock.
Nothing else in the kernel should block forever and wait for userspace
intervention.
Note: pause_all_vcpus()/resume_all_vcpus() or
start_exclusive()/end_exclusive() cannot be used, as they either drop
the BQL or require to be called without the BQL - something inhibitors
cannot handle. We need a low-level locking mechanism that is
deadlock-free even when not releasing the BQL.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Tested-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20221111154758.1372674-4-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-11 16:47:58 +01:00
|
|
|
return;
|
2019-09-24 16:47:50 +02:00
|
|
|
}
|
|
|
|
if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
|
2021-05-17 10:23:50 +02:00
|
|
|
/*
|
|
|
|
* NOTE: We should be aware of the fact that here we're only
|
|
|
|
* doing a best effort to sync dirty bits. No matter whether
|
|
|
|
* we're using dirty log or dirty ring, we ignored two facts:
|
|
|
|
*
|
|
|
|
* (1) dirty bits can reside in hardware buffers (PML)
|
|
|
|
*
|
|
|
|
* (2) after we collected dirty bits here, pages can be dirtied
|
|
|
|
* again before we do the final KVM_SET_USER_MEMORY_REGION to
|
|
|
|
* remove the slot.
|
|
|
|
*
|
|
|
|
* Not easy. Let's cross the fingers until it's fixed.
|
|
|
|
*/
|
|
|
|
if (kvm_state->kvm_dirty_ring_size) {
|
2022-06-25 19:38:30 +02:00
|
|
|
kvm_dirty_ring_reap_locked(kvm_state, NULL);
|
2023-05-09 04:21:20 +02:00
|
|
|
if (kvm_state->kvm_dirty_ring_with_bitmap) {
|
|
|
|
kvm_slot_sync_dirty_pages(mem);
|
|
|
|
kvm_slot_get_dirty_log(kvm_state, mem);
|
|
|
|
}
|
2021-05-17 10:23:50 +02:00
|
|
|
} else {
|
|
|
|
kvm_slot_get_dirty_log(kvm_state, mem);
|
|
|
|
}
|
2021-05-06 18:05:45 +02:00
|
|
|
kvm_slot_sync_dirty_pages(mem);
|
2019-09-24 16:47:50 +02:00
|
|
|
}
|
2012-01-15 15:13:59 +01:00
|
|
|
|
2019-09-24 16:47:50 +02:00
|
|
|
/* unregister the slot */
|
|
|
|
g_free(mem->dirty_bmap);
|
|
|
|
mem->dirty_bmap = NULL;
|
|
|
|
mem->memory_size = 0;
|
|
|
|
mem->flags = 0;
|
|
|
|
err = kvm_set_user_memory_region(kml, mem, false);
|
|
|
|
if (err) {
|
|
|
|
fprintf(stderr, "%s: error unregistering slot: %s\n",
|
|
|
|
__func__, strerror(-err));
|
|
|
|
abort();
|
|
|
|
}
|
|
|
|
start_addr += slot_size;
|
|
|
|
size -= slot_size;
|
|
|
|
} while (size);
|
kvm: Atomic memslot updates
If we update an existing memslot (e.g., resize, split), we temporarily
remove the memslot to re-add it immediately afterwards. These updates
are not atomic, especially not for KVM VCPU threads, such that we can
get spurious faults.
Let's inhibit most KVM ioctls while performing relevant updates, such
that we can perform the update just as if it would happen atomically
without additional kernel support.
We capture the add/del changes and apply them in the notifier commit
stage instead. There, we can check for overlaps and perform the ioctl
inhibiting only if really required (-> overlap).
To keep things simple we don't perform additional checks that wouldn't
actually result in an overlap -- such as !RAM memory regions in some
cases (see kvm_set_phys_mem()).
To minimize cache-line bouncing, use a separate indicator
(in_ioctl_lock) per CPU. Also, make sure to hold the kvm_slots_lock
while performing both actions (removing+re-adding).
We have to wait until all IOCTLs were exited and block new ones from
getting executed.
This approach cannot result in a deadlock as long as the inhibitor does
not hold any locks that might hinder an IOCTL from getting finished and
exited - something fairly unusual. The inhibitor will always hold the BQL.
AFAIKs, one possible candidate would be userfaultfd. If a page cannot be
placed (e.g., during postcopy), because we're waiting for a lock, or if the
userfaultfd thread cannot process a fault, because it is waiting for a
lock, there could be a deadlock. However, the BQL is not applicable here,
because any other guest memory access while holding the BQL would already
result in a deadlock.
Nothing else in the kernel should block forever and wait for userspace
intervention.
Note: pause_all_vcpus()/resume_all_vcpus() or
start_exclusive()/end_exclusive() cannot be used, as they either drop
the BQL or require to be called without the BQL - something inhibitors
cannot handle. We need a low-level locking mechanism that is
deadlock-free even when not releasing the BQL.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Tested-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20221111154758.1372674-4-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-11 16:47:58 +01:00
|
|
|
return;
|
2010-01-27 21:07:08 +01:00
|
|
|
}
|
|
|
|
|
2017-09-11 19:49:31 +02:00
|
|
|
/* register the new slot */
|
2019-09-24 16:47:50 +02:00
|
|
|
do {
|
|
|
|
slot_size = MIN(kvm_max_slot_size, size);
|
|
|
|
mem = kvm_alloc_slot(kml);
|
2021-05-06 18:05:43 +02:00
|
|
|
mem->as_id = kml->as_id;
|
2019-09-24 16:47:50 +02:00
|
|
|
mem->memory_size = slot_size;
|
|
|
|
mem->start_addr = start_addr;
|
2021-05-06 18:05:44 +02:00
|
|
|
mem->ram_start_offset = ram_start_offset;
|
2019-09-24 16:47:50 +02:00
|
|
|
mem->ram = ram;
|
|
|
|
mem->flags = kvm_mem_flags(mr);
|
2021-05-06 18:05:42 +02:00
|
|
|
kvm_slot_init_dirty_bitmap(mem);
|
2019-09-24 16:47:50 +02:00
|
|
|
err = kvm_set_user_memory_region(kml, mem, true);
|
|
|
|
if (err) {
|
|
|
|
fprintf(stderr, "%s: error registering slot: %s\n", __func__,
|
|
|
|
strerror(-err));
|
|
|
|
abort();
|
|
|
|
}
|
|
|
|
start_addr += slot_size;
|
2021-05-06 18:05:44 +02:00
|
|
|
ram_start_offset += slot_size;
|
2019-09-24 16:47:50 +02:00
|
|
|
ram += slot_size;
|
|
|
|
size -= slot_size;
|
|
|
|
} while (size);
|
2010-01-27 21:07:08 +01:00
|
|
|
}
|
|
|
|
|
2021-05-17 10:23:50 +02:00
|
|
|
static void *kvm_dirty_ring_reaper_thread(void *data)
|
|
|
|
{
|
|
|
|
KVMState *s = data;
|
|
|
|
struct KVMDirtyRingReaper *r = &s->reaper;
|
|
|
|
|
|
|
|
rcu_register_thread();
|
|
|
|
|
|
|
|
trace_kvm_dirty_ring_reaper("init");
|
|
|
|
|
|
|
|
while (true) {
|
|
|
|
r->reaper_state = KVM_DIRTY_RING_REAPER_WAIT;
|
|
|
|
trace_kvm_dirty_ring_reaper("wait");
|
|
|
|
/*
|
|
|
|
* TODO: provide a smarter timeout rather than a constant?
|
|
|
|
*/
|
|
|
|
sleep(1);
|
|
|
|
|
2022-06-25 19:38:35 +02:00
|
|
|
/* keep sleeping so that dirtylimit not be interfered by reaper */
|
|
|
|
if (dirtylimit_in_service()) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2021-05-17 10:23:50 +02:00
|
|
|
trace_kvm_dirty_ring_reaper("wakeup");
|
|
|
|
r->reaper_state = KVM_DIRTY_RING_REAPER_REAPING;
|
|
|
|
|
|
|
|
qemu_mutex_lock_iothread();
|
2022-06-25 19:38:30 +02:00
|
|
|
kvm_dirty_ring_reap(s, NULL);
|
2021-05-17 10:23:50 +02:00
|
|
|
qemu_mutex_unlock_iothread();
|
|
|
|
|
|
|
|
r->reaper_iteration++;
|
|
|
|
}
|
|
|
|
|
|
|
|
trace_kvm_dirty_ring_reaper("exit");
|
|
|
|
|
|
|
|
rcu_unregister_thread();
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2023-08-22 18:31:04 +02:00
|
|
|
static void kvm_dirty_ring_reaper_init(KVMState *s)
|
2021-05-17 10:23:50 +02:00
|
|
|
{
|
|
|
|
struct KVMDirtyRingReaper *r = &s->reaper;
|
|
|
|
|
|
|
|
qemu_thread_create(&r->reaper_thr, "kvm-reaper",
|
|
|
|
kvm_dirty_ring_reaper_thread,
|
|
|
|
s, QEMU_THREAD_JOINABLE);
|
|
|
|
}
|
|
|
|
|
2023-05-09 04:21:21 +02:00
|
|
|
static int kvm_dirty_ring_init(KVMState *s)
|
|
|
|
{
|
|
|
|
uint32_t ring_size = s->kvm_dirty_ring_size;
|
|
|
|
uint64_t ring_bytes = ring_size * sizeof(struct kvm_dirty_gfn);
|
2023-05-09 04:21:22 +02:00
|
|
|
unsigned int capability = KVM_CAP_DIRTY_LOG_RING;
|
2023-05-09 04:21:21 +02:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
s->kvm_dirty_ring_size = 0;
|
|
|
|
s->kvm_dirty_ring_bytes = 0;
|
|
|
|
|
|
|
|
/* Bail if the dirty ring size isn't specified */
|
|
|
|
if (!ring_size) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read the max supported pages. Fall back to dirty logging mode
|
|
|
|
* if the dirty ring isn't supported.
|
|
|
|
*/
|
2023-05-09 04:21:22 +02:00
|
|
|
ret = kvm_vm_check_extension(s, capability);
|
|
|
|
if (ret <= 0) {
|
|
|
|
capability = KVM_CAP_DIRTY_LOG_RING_ACQ_REL;
|
|
|
|
ret = kvm_vm_check_extension(s, capability);
|
|
|
|
}
|
|
|
|
|
2023-05-09 04:21:21 +02:00
|
|
|
if (ret <= 0) {
|
|
|
|
warn_report("KVM dirty ring not available, using bitmap method");
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ring_bytes > ret) {
|
|
|
|
error_report("KVM dirty ring size %" PRIu32 " too big "
|
|
|
|
"(maximum is %ld). Please use a smaller value.",
|
|
|
|
ring_size, (long)ret / sizeof(struct kvm_dirty_gfn));
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2023-05-09 04:21:22 +02:00
|
|
|
ret = kvm_vm_enable_cap(s, capability, 0, ring_bytes);
|
2023-05-09 04:21:21 +02:00
|
|
|
if (ret) {
|
|
|
|
error_report("Enabling of KVM dirty ring failed: %s. "
|
|
|
|
"Suggested minimum value is 1024.", strerror(-ret));
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
2023-05-09 04:21:22 +02:00
|
|
|
/* Enable the backup bitmap if it is supported */
|
|
|
|
ret = kvm_vm_check_extension(s, KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP);
|
|
|
|
if (ret > 0) {
|
|
|
|
ret = kvm_vm_enable_cap(s, KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP, 0);
|
|
|
|
if (ret) {
|
|
|
|
error_report("Enabling of KVM dirty ring's backup bitmap failed: "
|
|
|
|
"%s. ", strerror(-ret));
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
|
|
|
s->kvm_dirty_ring_with_bitmap = true;
|
|
|
|
}
|
|
|
|
|
2023-05-09 04:21:21 +02:00
|
|
|
s->kvm_dirty_ring_size = ring_size;
|
|
|
|
s->kvm_dirty_ring_bytes = ring_bytes;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-12-18 13:06:05 +01:00
|
|
|
static void kvm_region_add(MemoryListener *listener,
|
|
|
|
MemoryRegionSection *section)
|
|
|
|
{
|
2015-06-18 18:30:13 +02:00
|
|
|
KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
|
kvm: Atomic memslot updates
If we update an existing memslot (e.g., resize, split), we temporarily
remove the memslot to re-add it immediately afterwards. These updates
are not atomic, especially not for KVM VCPU threads, such that we can
get spurious faults.
Let's inhibit most KVM ioctls while performing relevant updates, such
that we can perform the update just as if it would happen atomically
without additional kernel support.
We capture the add/del changes and apply them in the notifier commit
stage instead. There, we can check for overlaps and perform the ioctl
inhibiting only if really required (-> overlap).
To keep things simple we don't perform additional checks that wouldn't
actually result in an overlap -- such as !RAM memory regions in some
cases (see kvm_set_phys_mem()).
To minimize cache-line bouncing, use a separate indicator
(in_ioctl_lock) per CPU. Also, make sure to hold the kvm_slots_lock
while performing both actions (removing+re-adding).
We have to wait until all IOCTLs were exited and block new ones from
getting executed.
This approach cannot result in a deadlock as long as the inhibitor does
not hold any locks that might hinder an IOCTL from getting finished and
exited - something fairly unusual. The inhibitor will always hold the BQL.
AFAIKs, one possible candidate would be userfaultfd. If a page cannot be
placed (e.g., during postcopy), because we're waiting for a lock, or if the
userfaultfd thread cannot process a fault, because it is waiting for a
lock, there could be a deadlock. However, the BQL is not applicable here,
because any other guest memory access while holding the BQL would already
result in a deadlock.
Nothing else in the kernel should block forever and wait for userspace
intervention.
Note: pause_all_vcpus()/resume_all_vcpus() or
start_exclusive()/end_exclusive() cannot be used, as they either drop
the BQL or require to be called without the BQL - something inhibitors
cannot handle. We need a low-level locking mechanism that is
deadlock-free even when not releasing the BQL.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Tested-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20221111154758.1372674-4-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-11 16:47:58 +01:00
|
|
|
KVMMemoryUpdate *update;
|
|
|
|
|
|
|
|
update = g_new0(KVMMemoryUpdate, 1);
|
|
|
|
update->section = *section;
|
2015-06-18 18:30:13 +02:00
|
|
|
|
kvm: Atomic memslot updates
If we update an existing memslot (e.g., resize, split), we temporarily
remove the memslot to re-add it immediately afterwards. These updates
are not atomic, especially not for KVM VCPU threads, such that we can
get spurious faults.
Let's inhibit most KVM ioctls while performing relevant updates, such
that we can perform the update just as if it would happen atomically
without additional kernel support.
We capture the add/del changes and apply them in the notifier commit
stage instead. There, we can check for overlaps and perform the ioctl
inhibiting only if really required (-> overlap).
To keep things simple we don't perform additional checks that wouldn't
actually result in an overlap -- such as !RAM memory regions in some
cases (see kvm_set_phys_mem()).
To minimize cache-line bouncing, use a separate indicator
(in_ioctl_lock) per CPU. Also, make sure to hold the kvm_slots_lock
while performing both actions (removing+re-adding).
We have to wait until all IOCTLs were exited and block new ones from
getting executed.
This approach cannot result in a deadlock as long as the inhibitor does
not hold any locks that might hinder an IOCTL from getting finished and
exited - something fairly unusual. The inhibitor will always hold the BQL.
AFAIKs, one possible candidate would be userfaultfd. If a page cannot be
placed (e.g., during postcopy), because we're waiting for a lock, or if the
userfaultfd thread cannot process a fault, because it is waiting for a
lock, there could be a deadlock. However, the BQL is not applicable here,
because any other guest memory access while holding the BQL would already
result in a deadlock.
Nothing else in the kernel should block forever and wait for userspace
intervention.
Note: pause_all_vcpus()/resume_all_vcpus() or
start_exclusive()/end_exclusive() cannot be used, as they either drop
the BQL or require to be called without the BQL - something inhibitors
cannot handle. We need a low-level locking mechanism that is
deadlock-free even when not releasing the BQL.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Tested-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20221111154758.1372674-4-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-11 16:47:58 +01:00
|
|
|
QSIMPLEQ_INSERT_TAIL(&kml->transaction_add, update, next);
|
2011-12-18 13:06:05 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_region_del(MemoryListener *listener,
|
|
|
|
MemoryRegionSection *section)
|
|
|
|
{
|
2015-06-18 18:30:13 +02:00
|
|
|
KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
|
kvm: Atomic memslot updates
If we update an existing memslot (e.g., resize, split), we temporarily
remove the memslot to re-add it immediately afterwards. These updates
are not atomic, especially not for KVM VCPU threads, such that we can
get spurious faults.
Let's inhibit most KVM ioctls while performing relevant updates, such
that we can perform the update just as if it would happen atomically
without additional kernel support.
We capture the add/del changes and apply them in the notifier commit
stage instead. There, we can check for overlaps and perform the ioctl
inhibiting only if really required (-> overlap).
To keep things simple we don't perform additional checks that wouldn't
actually result in an overlap -- such as !RAM memory regions in some
cases (see kvm_set_phys_mem()).
To minimize cache-line bouncing, use a separate indicator
(in_ioctl_lock) per CPU. Also, make sure to hold the kvm_slots_lock
while performing both actions (removing+re-adding).
We have to wait until all IOCTLs were exited and block new ones from
getting executed.
This approach cannot result in a deadlock as long as the inhibitor does
not hold any locks that might hinder an IOCTL from getting finished and
exited - something fairly unusual. The inhibitor will always hold the BQL.
AFAIKs, one possible candidate would be userfaultfd. If a page cannot be
placed (e.g., during postcopy), because we're waiting for a lock, or if the
userfaultfd thread cannot process a fault, because it is waiting for a
lock, there could be a deadlock. However, the BQL is not applicable here,
because any other guest memory access while holding the BQL would already
result in a deadlock.
Nothing else in the kernel should block forever and wait for userspace
intervention.
Note: pause_all_vcpus()/resume_all_vcpus() or
start_exclusive()/end_exclusive() cannot be used, as they either drop
the BQL or require to be called without the BQL - something inhibitors
cannot handle. We need a low-level locking mechanism that is
deadlock-free even when not releasing the BQL.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Tested-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20221111154758.1372674-4-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-11 16:47:58 +01:00
|
|
|
KVMMemoryUpdate *update;
|
|
|
|
|
|
|
|
update = g_new0(KVMMemoryUpdate, 1);
|
|
|
|
update->section = *section;
|
|
|
|
|
|
|
|
QSIMPLEQ_INSERT_TAIL(&kml->transaction_del, update, next);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_region_commit(MemoryListener *listener)
|
|
|
|
{
|
|
|
|
KVMMemoryListener *kml = container_of(listener, KVMMemoryListener,
|
|
|
|
listener);
|
|
|
|
KVMMemoryUpdate *u1, *u2;
|
|
|
|
bool need_inhibit = false;
|
|
|
|
|
|
|
|
if (QSIMPLEQ_EMPTY(&kml->transaction_add) &&
|
|
|
|
QSIMPLEQ_EMPTY(&kml->transaction_del)) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We have to be careful when regions to add overlap with ranges to remove.
|
|
|
|
* We have to simulate atomic KVM memslot updates by making sure no ioctl()
|
|
|
|
* is currently active.
|
|
|
|
*
|
|
|
|
* The lists are order by addresses, so it's easy to find overlaps.
|
|
|
|
*/
|
|
|
|
u1 = QSIMPLEQ_FIRST(&kml->transaction_del);
|
|
|
|
u2 = QSIMPLEQ_FIRST(&kml->transaction_add);
|
|
|
|
while (u1 && u2) {
|
|
|
|
Range r1, r2;
|
|
|
|
|
|
|
|
range_init_nofail(&r1, u1->section.offset_within_address_space,
|
|
|
|
int128_get64(u1->section.size));
|
|
|
|
range_init_nofail(&r2, u2->section.offset_within_address_space,
|
|
|
|
int128_get64(u2->section.size));
|
|
|
|
|
|
|
|
if (range_overlaps_range(&r1, &r2)) {
|
|
|
|
need_inhibit = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (range_lob(&r1) < range_lob(&r2)) {
|
|
|
|
u1 = QSIMPLEQ_NEXT(u1, next);
|
|
|
|
} else {
|
|
|
|
u2 = QSIMPLEQ_NEXT(u2, next);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
kvm_slots_lock();
|
|
|
|
if (need_inhibit) {
|
|
|
|
accel_ioctl_inhibit_begin();
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Remove all memslots before adding the new ones. */
|
|
|
|
while (!QSIMPLEQ_EMPTY(&kml->transaction_del)) {
|
|
|
|
u1 = QSIMPLEQ_FIRST(&kml->transaction_del);
|
|
|
|
QSIMPLEQ_REMOVE_HEAD(&kml->transaction_del, next);
|
2015-06-18 18:30:13 +02:00
|
|
|
|
kvm: Atomic memslot updates
If we update an existing memslot (e.g., resize, split), we temporarily
remove the memslot to re-add it immediately afterwards. These updates
are not atomic, especially not for KVM VCPU threads, such that we can
get spurious faults.
Let's inhibit most KVM ioctls while performing relevant updates, such
that we can perform the update just as if it would happen atomically
without additional kernel support.
We capture the add/del changes and apply them in the notifier commit
stage instead. There, we can check for overlaps and perform the ioctl
inhibiting only if really required (-> overlap).
To keep things simple we don't perform additional checks that wouldn't
actually result in an overlap -- such as !RAM memory regions in some
cases (see kvm_set_phys_mem()).
To minimize cache-line bouncing, use a separate indicator
(in_ioctl_lock) per CPU. Also, make sure to hold the kvm_slots_lock
while performing both actions (removing+re-adding).
We have to wait until all IOCTLs were exited and block new ones from
getting executed.
This approach cannot result in a deadlock as long as the inhibitor does
not hold any locks that might hinder an IOCTL from getting finished and
exited - something fairly unusual. The inhibitor will always hold the BQL.
AFAIKs, one possible candidate would be userfaultfd. If a page cannot be
placed (e.g., during postcopy), because we're waiting for a lock, or if the
userfaultfd thread cannot process a fault, because it is waiting for a
lock, there could be a deadlock. However, the BQL is not applicable here,
because any other guest memory access while holding the BQL would already
result in a deadlock.
Nothing else in the kernel should block forever and wait for userspace
intervention.
Note: pause_all_vcpus()/resume_all_vcpus() or
start_exclusive()/end_exclusive() cannot be used, as they either drop
the BQL or require to be called without the BQL - something inhibitors
cannot handle. We need a low-level locking mechanism that is
deadlock-free even when not releasing the BQL.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Tested-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20221111154758.1372674-4-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-11 16:47:58 +01:00
|
|
|
kvm_set_phys_mem(kml, &u1->section, false);
|
|
|
|
memory_region_unref(u1->section.mr);
|
|
|
|
|
|
|
|
g_free(u1);
|
|
|
|
}
|
|
|
|
while (!QSIMPLEQ_EMPTY(&kml->transaction_add)) {
|
|
|
|
u1 = QSIMPLEQ_FIRST(&kml->transaction_add);
|
|
|
|
QSIMPLEQ_REMOVE_HEAD(&kml->transaction_add, next);
|
|
|
|
|
|
|
|
memory_region_ref(u1->section.mr);
|
|
|
|
kvm_set_phys_mem(kml, &u1->section, true);
|
|
|
|
|
|
|
|
g_free(u1);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (need_inhibit) {
|
|
|
|
accel_ioctl_inhibit_end();
|
|
|
|
}
|
|
|
|
kvm_slots_unlock();
|
2011-12-18 13:06:05 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_log_sync(MemoryListener *listener,
|
|
|
|
MemoryRegionSection *section)
|
2010-01-27 21:07:21 +01:00
|
|
|
{
|
2015-06-18 18:30:13 +02:00
|
|
|
KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
|
2011-12-18 13:06:05 +01:00
|
|
|
|
2021-05-06 18:05:41 +02:00
|
|
|
kvm_slots_lock();
|
2021-05-06 18:05:43 +02:00
|
|
|
kvm_physical_sync_dirty_bitmap(kml, section);
|
2021-05-06 18:05:41 +02:00
|
|
|
kvm_slots_unlock();
|
2010-01-27 21:07:21 +01:00
|
|
|
}
|
|
|
|
|
2023-05-09 04:21:19 +02:00
|
|
|
static void kvm_log_sync_global(MemoryListener *l, bool last_stage)
|
2021-05-17 10:23:50 +02:00
|
|
|
{
|
|
|
|
KVMMemoryListener *kml = container_of(l, KVMMemoryListener, listener);
|
|
|
|
KVMState *s = kvm_state;
|
|
|
|
KVMSlot *mem;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
/* Flush all kernel dirty addresses into KVMSlot dirty bitmap */
|
|
|
|
kvm_dirty_ring_flush();
|
|
|
|
|
|
|
|
/*
|
|
|
|
* TODO: make this faster when nr_slots is big while there are
|
|
|
|
* only a few used slots (small VMs).
|
|
|
|
*/
|
|
|
|
kvm_slots_lock();
|
|
|
|
for (i = 0; i < s->nr_slots; i++) {
|
|
|
|
mem = &kml->slots[i];
|
|
|
|
if (mem->memory_size && mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
|
|
|
|
kvm_slot_sync_dirty_pages(mem);
|
2023-05-09 04:21:20 +02:00
|
|
|
|
|
|
|
if (s->kvm_dirty_ring_with_bitmap && last_stage &&
|
|
|
|
kvm_slot_get_dirty_log(s, mem)) {
|
|
|
|
kvm_slot_sync_dirty_pages(mem);
|
|
|
|
}
|
|
|
|
|
2021-05-17 10:23:50 +02:00
|
|
|
/*
|
|
|
|
* This is not needed by KVM_GET_DIRTY_LOG because the
|
|
|
|
* ioctl will unconditionally overwrite the whole region.
|
|
|
|
* However kvm dirty ring has no such side effect.
|
|
|
|
*/
|
|
|
|
kvm_slot_reset_dirty_pages(mem);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
kvm_slots_unlock();
|
|
|
|
}
|
|
|
|
|
2019-06-03 08:50:55 +02:00
|
|
|
static void kvm_log_clear(MemoryListener *listener,
|
|
|
|
MemoryRegionSection *section)
|
|
|
|
{
|
|
|
|
KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
|
|
|
|
int r;
|
|
|
|
|
|
|
|
r = kvm_physical_log_clear(kml, section);
|
|
|
|
if (r < 0) {
|
|
|
|
error_report_once("%s: kvm log clear failed: mr=%s "
|
|
|
|
"offset=%"HWADDR_PRIx" size=%"PRIx64, __func__,
|
|
|
|
section->mr->name, section->offset_within_region,
|
|
|
|
int128_get64(section->size));
|
|
|
|
abort();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-09-30 22:21:11 +02:00
|
|
|
static void kvm_mem_ioeventfd_add(MemoryListener *listener,
|
|
|
|
MemoryRegionSection *section,
|
|
|
|
bool match_data, uint64_t data,
|
|
|
|
EventNotifier *e)
|
|
|
|
{
|
|
|
|
int fd = event_notifier_get_fd(e);
|
2012-02-08 15:39:06 +01:00
|
|
|
int r;
|
|
|
|
|
2012-03-20 13:31:38 +01:00
|
|
|
r = kvm_set_ioeventfd_mmio(fd, section->offset_within_address_space,
|
2013-05-27 10:08:27 +02:00
|
|
|
data, true, int128_get64(section->size),
|
|
|
|
match_data);
|
2012-02-08 15:39:06 +01:00
|
|
|
if (r < 0) {
|
2019-06-07 11:08:30 +02:00
|
|
|
fprintf(stderr, "%s: error adding ioeventfd: %s (%d)\n",
|
|
|
|
__func__, strerror(-r), -r);
|
2012-02-08 15:39:06 +01:00
|
|
|
abort();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-09-30 22:21:11 +02:00
|
|
|
static void kvm_mem_ioeventfd_del(MemoryListener *listener,
|
|
|
|
MemoryRegionSection *section,
|
|
|
|
bool match_data, uint64_t data,
|
|
|
|
EventNotifier *e)
|
2012-02-08 15:39:06 +01:00
|
|
|
{
|
2012-09-30 22:21:11 +02:00
|
|
|
int fd = event_notifier_get_fd(e);
|
2012-02-08 15:39:06 +01:00
|
|
|
int r;
|
|
|
|
|
2012-03-20 13:31:38 +01:00
|
|
|
r = kvm_set_ioeventfd_mmio(fd, section->offset_within_address_space,
|
2013-05-27 10:08:27 +02:00
|
|
|
data, false, int128_get64(section->size),
|
|
|
|
match_data);
|
2012-02-08 15:39:06 +01:00
|
|
|
if (r < 0) {
|
2019-06-07 11:08:30 +02:00
|
|
|
fprintf(stderr, "%s: error deleting ioeventfd: %s (%d)\n",
|
|
|
|
__func__, strerror(-r), -r);
|
2012-02-08 15:39:06 +01:00
|
|
|
abort();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-09-30 22:21:11 +02:00
|
|
|
static void kvm_io_ioeventfd_add(MemoryListener *listener,
|
|
|
|
MemoryRegionSection *section,
|
|
|
|
bool match_data, uint64_t data,
|
|
|
|
EventNotifier *e)
|
2012-02-08 15:39:06 +01:00
|
|
|
{
|
2012-09-30 22:21:11 +02:00
|
|
|
int fd = event_notifier_get_fd(e);
|
2012-02-08 15:39:06 +01:00
|
|
|
int r;
|
|
|
|
|
2013-04-01 23:54:45 +02:00
|
|
|
r = kvm_set_ioeventfd_pio(fd, section->offset_within_address_space,
|
2013-05-27 10:08:27 +02:00
|
|
|
data, true, int128_get64(section->size),
|
|
|
|
match_data);
|
2012-02-08 15:39:06 +01:00
|
|
|
if (r < 0) {
|
2019-06-07 11:08:30 +02:00
|
|
|
fprintf(stderr, "%s: error adding ioeventfd: %s (%d)\n",
|
|
|
|
__func__, strerror(-r), -r);
|
2012-02-08 15:39:06 +01:00
|
|
|
abort();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-09-30 22:21:11 +02:00
|
|
|
static void kvm_io_ioeventfd_del(MemoryListener *listener,
|
|
|
|
MemoryRegionSection *section,
|
|
|
|
bool match_data, uint64_t data,
|
|
|
|
EventNotifier *e)
|
2012-02-08 15:39:06 +01:00
|
|
|
|
|
|
|
{
|
2012-09-30 22:21:11 +02:00
|
|
|
int fd = event_notifier_get_fd(e);
|
2012-02-08 15:39:06 +01:00
|
|
|
int r;
|
|
|
|
|
2013-04-01 23:54:45 +02:00
|
|
|
r = kvm_set_ioeventfd_pio(fd, section->offset_within_address_space,
|
2013-05-27 10:08:27 +02:00
|
|
|
data, false, int128_get64(section->size),
|
|
|
|
match_data);
|
2012-02-08 15:39:06 +01:00
|
|
|
if (r < 0) {
|
2019-06-07 11:08:30 +02:00
|
|
|
fprintf(stderr, "%s: error deleting ioeventfd: %s (%d)\n",
|
|
|
|
__func__, strerror(-r), -r);
|
2012-02-08 15:39:06 +01:00
|
|
|
abort();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-06-18 18:30:14 +02:00
|
|
|
void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
|
2021-08-17 03:35:52 +02:00
|
|
|
AddressSpace *as, int as_id, const char *name)
|
2015-06-18 18:30:13 +02:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2022-03-15 15:41:56 +01:00
|
|
|
kml->slots = g_new0(KVMSlot, s->nr_slots);
|
2015-06-18 18:30:14 +02:00
|
|
|
kml->as_id = as_id;
|
2015-06-18 18:30:13 +02:00
|
|
|
|
|
|
|
for (i = 0; i < s->nr_slots; i++) {
|
|
|
|
kml->slots[i].slot = i;
|
|
|
|
}
|
|
|
|
|
kvm: Atomic memslot updates
If we update an existing memslot (e.g., resize, split), we temporarily
remove the memslot to re-add it immediately afterwards. These updates
are not atomic, especially not for KVM VCPU threads, such that we can
get spurious faults.
Let's inhibit most KVM ioctls while performing relevant updates, such
that we can perform the update just as if it would happen atomically
without additional kernel support.
We capture the add/del changes and apply them in the notifier commit
stage instead. There, we can check for overlaps and perform the ioctl
inhibiting only if really required (-> overlap).
To keep things simple we don't perform additional checks that wouldn't
actually result in an overlap -- such as !RAM memory regions in some
cases (see kvm_set_phys_mem()).
To minimize cache-line bouncing, use a separate indicator
(in_ioctl_lock) per CPU. Also, make sure to hold the kvm_slots_lock
while performing both actions (removing+re-adding).
We have to wait until all IOCTLs were exited and block new ones from
getting executed.
This approach cannot result in a deadlock as long as the inhibitor does
not hold any locks that might hinder an IOCTL from getting finished and
exited - something fairly unusual. The inhibitor will always hold the BQL.
AFAIKs, one possible candidate would be userfaultfd. If a page cannot be
placed (e.g., during postcopy), because we're waiting for a lock, or if the
userfaultfd thread cannot process a fault, because it is waiting for a
lock, there could be a deadlock. However, the BQL is not applicable here,
because any other guest memory access while holding the BQL would already
result in a deadlock.
Nothing else in the kernel should block forever and wait for userspace
intervention.
Note: pause_all_vcpus()/resume_all_vcpus() or
start_exclusive()/end_exclusive() cannot be used, as they either drop
the BQL or require to be called without the BQL - something inhibitors
cannot handle. We need a low-level locking mechanism that is
deadlock-free even when not releasing the BQL.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Tested-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20221111154758.1372674-4-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-11 16:47:58 +01:00
|
|
|
QSIMPLEQ_INIT(&kml->transaction_add);
|
|
|
|
QSIMPLEQ_INIT(&kml->transaction_del);
|
|
|
|
|
2015-06-18 18:30:13 +02:00
|
|
|
kml->listener.region_add = kvm_region_add;
|
|
|
|
kml->listener.region_del = kvm_region_del;
|
kvm: Atomic memslot updates
If we update an existing memslot (e.g., resize, split), we temporarily
remove the memslot to re-add it immediately afterwards. These updates
are not atomic, especially not for KVM VCPU threads, such that we can
get spurious faults.
Let's inhibit most KVM ioctls while performing relevant updates, such
that we can perform the update just as if it would happen atomically
without additional kernel support.
We capture the add/del changes and apply them in the notifier commit
stage instead. There, we can check for overlaps and perform the ioctl
inhibiting only if really required (-> overlap).
To keep things simple we don't perform additional checks that wouldn't
actually result in an overlap -- such as !RAM memory regions in some
cases (see kvm_set_phys_mem()).
To minimize cache-line bouncing, use a separate indicator
(in_ioctl_lock) per CPU. Also, make sure to hold the kvm_slots_lock
while performing both actions (removing+re-adding).
We have to wait until all IOCTLs were exited and block new ones from
getting executed.
This approach cannot result in a deadlock as long as the inhibitor does
not hold any locks that might hinder an IOCTL from getting finished and
exited - something fairly unusual. The inhibitor will always hold the BQL.
AFAIKs, one possible candidate would be userfaultfd. If a page cannot be
placed (e.g., during postcopy), because we're waiting for a lock, or if the
userfaultfd thread cannot process a fault, because it is waiting for a
lock, there could be a deadlock. However, the BQL is not applicable here,
because any other guest memory access while holding the BQL would already
result in a deadlock.
Nothing else in the kernel should block forever and wait for userspace
intervention.
Note: pause_all_vcpus()/resume_all_vcpus() or
start_exclusive()/end_exclusive() cannot be used, as they either drop
the BQL or require to be called without the BQL - something inhibitors
cannot handle. We need a low-level locking mechanism that is
deadlock-free even when not releasing the BQL.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Tested-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20221111154758.1372674-4-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-11 16:47:58 +01:00
|
|
|
kml->listener.commit = kvm_region_commit;
|
2015-06-18 18:30:13 +02:00
|
|
|
kml->listener.log_start = kvm_log_start;
|
|
|
|
kml->listener.log_stop = kvm_log_stop;
|
2023-06-20 18:50:47 +02:00
|
|
|
kml->listener.priority = MEMORY_LISTENER_PRIORITY_ACCEL;
|
2021-08-17 03:35:52 +02:00
|
|
|
kml->listener.name = name;
|
2015-06-18 18:30:13 +02:00
|
|
|
|
2021-05-17 10:23:50 +02:00
|
|
|
if (s->kvm_dirty_ring_size) {
|
|
|
|
kml->listener.log_sync_global = kvm_log_sync_global;
|
|
|
|
} else {
|
|
|
|
kml->listener.log_sync = kvm_log_sync;
|
|
|
|
kml->listener.log_clear = kvm_log_clear;
|
|
|
|
}
|
|
|
|
|
2015-06-18 18:30:13 +02:00
|
|
|
memory_listener_register(&kml->listener, as);
|
2019-06-14 03:52:37 +02:00
|
|
|
|
|
|
|
for (i = 0; i < s->nr_as; ++i) {
|
|
|
|
if (!s->as[i].as) {
|
|
|
|
s->as[i].as = as;
|
|
|
|
s->as[i].ml = kml;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2015-06-18 18:30:13 +02:00
|
|
|
}
|
2012-09-30 22:21:11 +02:00
|
|
|
|
|
|
|
static MemoryListener kvm_io_listener = {
|
2021-08-17 03:35:52 +02:00
|
|
|
.name = "kvm-io",
|
2012-09-30 22:21:11 +02:00
|
|
|
.eventfd_add = kvm_io_ioeventfd_add,
|
|
|
|
.eventfd_del = kvm_io_ioeventfd_del,
|
2023-06-20 18:50:48 +02:00
|
|
|
.priority = MEMORY_LISTENER_PRIORITY_DEV_BACKEND,
|
2010-01-27 21:07:21 +01:00
|
|
|
};
|
|
|
|
|
2012-07-26 16:35:12 +02:00
|
|
|
int kvm_set_irq(KVMState *s, int irq, int level)
|
2011-10-15 11:49:47 +02:00
|
|
|
{
|
|
|
|
struct kvm_irq_level event;
|
|
|
|
int ret;
|
|
|
|
|
2012-07-26 16:35:11 +02:00
|
|
|
assert(kvm_async_interrupts_enabled());
|
2011-10-15 11:49:47 +02:00
|
|
|
|
|
|
|
event.level = level;
|
|
|
|
event.irq = irq;
|
2012-08-24 13:34:47 +02:00
|
|
|
ret = kvm_vm_ioctl(s, s->irq_set_ioctl, &event);
|
2011-10-15 11:49:47 +02:00
|
|
|
if (ret < 0) {
|
2012-07-26 16:35:12 +02:00
|
|
|
perror("kvm_set_irq");
|
2011-10-15 11:49:47 +02:00
|
|
|
abort();
|
|
|
|
}
|
|
|
|
|
2012-08-24 13:34:47 +02:00
|
|
|
return (s->irq_set_ioctl == KVM_IRQ_LINE) ? 1 : event.status;
|
2011-10-15 11:49:47 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef KVM_CAP_IRQ_ROUTING
|
2012-06-05 21:03:57 +02:00
|
|
|
typedef struct KVMMSIRoute {
|
|
|
|
struct kvm_irq_routing_entry kroute;
|
|
|
|
QTAILQ_ENTRY(KVMMSIRoute) entry;
|
|
|
|
} KVMMSIRoute;
|
|
|
|
|
2011-10-15 11:49:47 +02:00
|
|
|
static void set_gsi(KVMState *s, unsigned int gsi)
|
|
|
|
{
|
2016-03-06 02:57:25 +01:00
|
|
|
set_bit(gsi, s->used_gsi_bitmap);
|
2011-10-15 11:49:47 +02:00
|
|
|
}
|
|
|
|
|
2012-05-16 20:41:10 +02:00
|
|
|
static void clear_gsi(KVMState *s, unsigned int gsi)
|
|
|
|
{
|
2016-03-06 02:57:25 +01:00
|
|
|
clear_bit(gsi, s->used_gsi_bitmap);
|
2012-05-16 20:41:10 +02:00
|
|
|
}
|
|
|
|
|
2013-04-16 15:58:13 +02:00
|
|
|
void kvm_init_irq_routing(KVMState *s)
|
2011-10-15 11:49:47 +02:00
|
|
|
{
|
2012-05-16 20:41:10 +02:00
|
|
|
int gsi_count, i;
|
2011-10-15 11:49:47 +02:00
|
|
|
|
2014-06-06 14:46:05 +02:00
|
|
|
gsi_count = kvm_check_extension(s, KVM_CAP_IRQ_ROUTING) - 1;
|
2011-10-15 11:49:47 +02:00
|
|
|
if (gsi_count > 0) {
|
|
|
|
/* Round up so we can search ints using ffs */
|
2016-03-06 02:57:25 +01:00
|
|
|
s->used_gsi_bitmap = bitmap_new(gsi_count);
|
2012-05-16 20:41:08 +02:00
|
|
|
s->gsi_count = gsi_count;
|
2011-10-15 11:49:47 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
s->irq_routes = g_malloc0(sizeof(*s->irq_routes));
|
|
|
|
s->nr_allocated_irq_routes = 0;
|
|
|
|
|
2015-10-15 15:44:50 +02:00
|
|
|
if (!kvm_direct_msi_allowed) {
|
2012-05-16 20:41:14 +02:00
|
|
|
for (i = 0; i < KVM_MSI_HASHTAB_SIZE; i++) {
|
|
|
|
QTAILQ_INIT(&s->msi_hashtab[i]);
|
|
|
|
}
|
2012-05-16 20:41:10 +02:00
|
|
|
}
|
|
|
|
|
2011-10-15 11:49:47 +02:00
|
|
|
kvm_arch_init_irq_routing(s);
|
|
|
|
}
|
|
|
|
|
2013-04-17 01:11:55 +02:00
|
|
|
void kvm_irqchip_commit_routes(KVMState *s)
|
2012-05-17 15:32:35 +02:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2016-08-03 05:07:21 +02:00
|
|
|
if (kvm_gsi_direct_mapping()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!kvm_gsi_routing_enabled()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2012-05-17 15:32:35 +02:00
|
|
|
s->irq_routes->flags = 0;
|
2016-07-14 07:56:35 +02:00
|
|
|
trace_kvm_irqchip_commit_routes();
|
2012-05-17 15:32:35 +02:00
|
|
|
ret = kvm_vm_ioctl(s, KVM_SET_GSI_ROUTING, s->irq_routes);
|
|
|
|
assert(ret == 0);
|
|
|
|
}
|
|
|
|
|
2011-10-15 11:49:47 +02:00
|
|
|
static void kvm_add_routing_entry(KVMState *s,
|
|
|
|
struct kvm_irq_routing_entry *entry)
|
|
|
|
{
|
|
|
|
struct kvm_irq_routing_entry *new;
|
|
|
|
int n, size;
|
|
|
|
|
|
|
|
if (s->irq_routes->nr == s->nr_allocated_irq_routes) {
|
|
|
|
n = s->nr_allocated_irq_routes * 2;
|
|
|
|
if (n < 64) {
|
|
|
|
n = 64;
|
|
|
|
}
|
|
|
|
size = sizeof(struct kvm_irq_routing);
|
|
|
|
size += n * sizeof(*new);
|
|
|
|
s->irq_routes = g_realloc(s->irq_routes, size);
|
|
|
|
s->nr_allocated_irq_routes = n;
|
|
|
|
}
|
|
|
|
n = s->irq_routes->nr++;
|
|
|
|
new = &s->irq_routes->entries[n];
|
2013-06-04 13:52:32 +02:00
|
|
|
|
|
|
|
*new = *entry;
|
2011-10-15 11:49:47 +02:00
|
|
|
|
|
|
|
set_gsi(s, entry->gsi);
|
|
|
|
}
|
|
|
|
|
2012-08-27 08:28:38 +02:00
|
|
|
static int kvm_update_routing_entry(KVMState *s,
|
|
|
|
struct kvm_irq_routing_entry *new_entry)
|
|
|
|
{
|
|
|
|
struct kvm_irq_routing_entry *entry;
|
|
|
|
int n;
|
|
|
|
|
|
|
|
for (n = 0; n < s->irq_routes->nr; n++) {
|
|
|
|
entry = &s->irq_routes->entries[n];
|
|
|
|
if (entry->gsi != new_entry->gsi) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2013-06-04 13:52:35 +02:00
|
|
|
if(!memcmp(entry, new_entry, sizeof *entry)) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-06-04 13:52:32 +02:00
|
|
|
*entry = *new_entry;
|
2012-08-27 08:28:38 +02:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
return -ESRCH;
|
|
|
|
}
|
|
|
|
|
2012-05-17 15:32:32 +02:00
|
|
|
void kvm_irqchip_add_irq_route(KVMState *s, int irq, int irqchip, int pin)
|
2011-10-15 11:49:47 +02:00
|
|
|
{
|
2013-06-04 13:52:32 +02:00
|
|
|
struct kvm_irq_routing_entry e = {};
|
2011-10-15 11:49:47 +02:00
|
|
|
|
2012-05-16 20:41:08 +02:00
|
|
|
assert(pin < s->gsi_count);
|
|
|
|
|
2011-10-15 11:49:47 +02:00
|
|
|
e.gsi = irq;
|
|
|
|
e.type = KVM_IRQ_ROUTING_IRQCHIP;
|
|
|
|
e.flags = 0;
|
|
|
|
e.u.irqchip.irqchip = irqchip;
|
|
|
|
e.u.irqchip.pin = pin;
|
|
|
|
kvm_add_routing_entry(s, &e);
|
|
|
|
}
|
|
|
|
|
2012-05-17 15:32:34 +02:00
|
|
|
void kvm_irqchip_release_virq(KVMState *s, int virq)
|
2012-05-16 20:41:10 +02:00
|
|
|
{
|
|
|
|
struct kvm_irq_routing_entry *e;
|
|
|
|
int i;
|
|
|
|
|
2013-09-03 10:08:25 +02:00
|
|
|
if (kvm_gsi_direct_mapping()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2012-05-16 20:41:10 +02:00
|
|
|
for (i = 0; i < s->irq_routes->nr; i++) {
|
|
|
|
e = &s->irq_routes->entries[i];
|
|
|
|
if (e->gsi == virq) {
|
|
|
|
s->irq_routes->nr--;
|
|
|
|
*e = s->irq_routes->entries[s->irq_routes->nr];
|
|
|
|
}
|
|
|
|
}
|
|
|
|
clear_gsi(s, virq);
|
2016-07-14 07:56:31 +02:00
|
|
|
kvm_arch_release_virq_post(virq);
|
2017-05-09 08:00:42 +02:00
|
|
|
trace_kvm_irqchip_release_virq(virq);
|
2012-05-16 20:41:10 +02:00
|
|
|
}
|
|
|
|
|
2019-10-17 03:12:35 +02:00
|
|
|
void kvm_irqchip_add_change_notifier(Notifier *n)
|
|
|
|
{
|
|
|
|
notifier_list_add(&kvm_irqchip_change_notifiers, n);
|
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_irqchip_remove_change_notifier(Notifier *n)
|
|
|
|
{
|
|
|
|
notifier_remove(n);
|
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_irqchip_change_notify(void)
|
|
|
|
{
|
|
|
|
notifier_list_notify(&kvm_irqchip_change_notifiers, NULL);
|
|
|
|
}
|
|
|
|
|
2012-05-16 20:41:10 +02:00
|
|
|
static unsigned int kvm_hash_msi(uint32_t data)
|
|
|
|
{
|
|
|
|
/* This is optimized for IA32 MSI layout. However, no other arch shall
|
|
|
|
* repeat the mistake of not providing a direct MSI injection API. */
|
|
|
|
return data & 0xff;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_flush_dynamic_msi_routes(KVMState *s)
|
|
|
|
{
|
|
|
|
KVMMSIRoute *route, *next;
|
|
|
|
unsigned int hash;
|
|
|
|
|
|
|
|
for (hash = 0; hash < KVM_MSI_HASHTAB_SIZE; hash++) {
|
|
|
|
QTAILQ_FOREACH_SAFE(route, &s->msi_hashtab[hash], entry, next) {
|
|
|
|
kvm_irqchip_release_virq(s, route->kroute.gsi);
|
|
|
|
QTAILQ_REMOVE(&s->msi_hashtab[hash], route, entry);
|
|
|
|
g_free(route);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_irqchip_get_virq(KVMState *s)
|
|
|
|
{
|
2016-03-06 02:57:25 +01:00
|
|
|
int next_virq;
|
2012-05-16 20:41:10 +02:00
|
|
|
|
Fix irq route entries exceeding KVM_MAX_IRQ_ROUTES
Last month, we experienced several guests crash(6cores-8cores), qemu logs
display the following messages:
qemu-system-x86_64: /build/qemu-2.1.2/kvm-all.c:976:
kvm_irqchip_commit_routes: Assertion `ret == 0' failed.
After analysis and verification, we can confirm it's irq-balance
daemon(in guest) leads to the assertion failure. Start a 8 core guest with
two disks, execute the following scripts will reproduce the BUG quickly:
irq_affinity.sh
========================================================================
vda_irq_num=25
vdb_irq_num=27
while [ 1 ]
do
for irq in {1,2,4,8,10,20,40,80}
do
echo $irq > /proc/irq/$vda_irq_num/smp_affinity
echo $irq > /proc/irq/$vdb_irq_num/smp_affinity
dd if=/dev/vda of=/dev/zero bs=4K count=100 iflag=direct
dd if=/dev/vdb of=/dev/zero bs=4K count=100 iflag=direct
done
done
========================================================================
QEMU setup static irq route entries in kvm_pc_setup_irq_routing(), PIC and
IOAPIC share the first 15 GSI numbers, take up 23 GSI numbers, but take up
38 irq route entries. When change irq smp_affinity in guest, a dynamic route
entry may be setup, the current logic is: if allocate GSI number succeeds,
a new route entry can be added. The available dynamic GSI numbers is
1021(KVM_MAX_IRQ_ROUTES-23), but available irq route entries is only
986(KVM_MAX_IRQ_ROUTES-38), GSI numbers greater than route entries.
irq-balance's behavior will eventually leads to total irq route entries
exceed KVM_MAX_IRQ_ROUTES, ioctl(KVM_SET_GSI_ROUTING) fail and
kvm_irqchip_commit_routes() trigger assertion failure.
This patch fix the BUG.
Signed-off-by: Wenshuang Ma <kevinnma@tencent.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-07-01 15:41:41 +02:00
|
|
|
/*
|
|
|
|
* PIC and IOAPIC share the first 16 GSI numbers, thus the available
|
|
|
|
* GSI numbers are more than the number of IRQ route. Allocating a GSI
|
|
|
|
* number can succeed even though a new route entry cannot be added.
|
|
|
|
* When this happens, flush dynamic MSI entries to free IRQ route entries.
|
|
|
|
*/
|
2015-10-15 15:44:50 +02:00
|
|
|
if (!kvm_direct_msi_allowed && s->irq_routes->nr == s->gsi_count) {
|
Fix irq route entries exceeding KVM_MAX_IRQ_ROUTES
Last month, we experienced several guests crash(6cores-8cores), qemu logs
display the following messages:
qemu-system-x86_64: /build/qemu-2.1.2/kvm-all.c:976:
kvm_irqchip_commit_routes: Assertion `ret == 0' failed.
After analysis and verification, we can confirm it's irq-balance
daemon(in guest) leads to the assertion failure. Start a 8 core guest with
two disks, execute the following scripts will reproduce the BUG quickly:
irq_affinity.sh
========================================================================
vda_irq_num=25
vdb_irq_num=27
while [ 1 ]
do
for irq in {1,2,4,8,10,20,40,80}
do
echo $irq > /proc/irq/$vda_irq_num/smp_affinity
echo $irq > /proc/irq/$vdb_irq_num/smp_affinity
dd if=/dev/vda of=/dev/zero bs=4K count=100 iflag=direct
dd if=/dev/vdb of=/dev/zero bs=4K count=100 iflag=direct
done
done
========================================================================
QEMU setup static irq route entries in kvm_pc_setup_irq_routing(), PIC and
IOAPIC share the first 15 GSI numbers, take up 23 GSI numbers, but take up
38 irq route entries. When change irq smp_affinity in guest, a dynamic route
entry may be setup, the current logic is: if allocate GSI number succeeds,
a new route entry can be added. The available dynamic GSI numbers is
1021(KVM_MAX_IRQ_ROUTES-23), but available irq route entries is only
986(KVM_MAX_IRQ_ROUTES-38), GSI numbers greater than route entries.
irq-balance's behavior will eventually leads to total irq route entries
exceed KVM_MAX_IRQ_ROUTES, ioctl(KVM_SET_GSI_ROUTING) fail and
kvm_irqchip_commit_routes() trigger assertion failure.
This patch fix the BUG.
Signed-off-by: Wenshuang Ma <kevinnma@tencent.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-07-01 15:41:41 +02:00
|
|
|
kvm_flush_dynamic_msi_routes(s);
|
|
|
|
}
|
|
|
|
|
2012-05-16 20:41:10 +02:00
|
|
|
/* Return the lowest unused GSI in the bitmap */
|
2016-03-06 02:57:25 +01:00
|
|
|
next_virq = find_first_zero_bit(s->used_gsi_bitmap, s->gsi_count);
|
|
|
|
if (next_virq >= s->gsi_count) {
|
|
|
|
return -ENOSPC;
|
|
|
|
} else {
|
|
|
|
return next_virq;
|
2012-05-16 20:41:10 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static KVMMSIRoute *kvm_lookup_msi_route(KVMState *s, MSIMessage msg)
|
|
|
|
{
|
|
|
|
unsigned int hash = kvm_hash_msi(msg.data);
|
|
|
|
KVMMSIRoute *route;
|
|
|
|
|
|
|
|
QTAILQ_FOREACH(route, &s->msi_hashtab[hash], entry) {
|
|
|
|
if (route->kroute.u.msi.address_lo == (uint32_t)msg.address &&
|
|
|
|
route->kroute.u.msi.address_hi == (msg.address >> 32) &&
|
2013-04-16 15:05:22 +02:00
|
|
|
route->kroute.u.msi.data == le32_to_cpu(msg.data)) {
|
2012-05-16 20:41:10 +02:00
|
|
|
return route;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
int kvm_irqchip_send_msi(KVMState *s, MSIMessage msg)
|
|
|
|
{
|
2012-05-16 20:41:14 +02:00
|
|
|
struct kvm_msi msi;
|
2012-05-16 20:41:10 +02:00
|
|
|
KVMMSIRoute *route;
|
|
|
|
|
2015-10-15 15:44:50 +02:00
|
|
|
if (kvm_direct_msi_allowed) {
|
2012-05-16 20:41:14 +02:00
|
|
|
msi.address_lo = (uint32_t)msg.address;
|
|
|
|
msi.address_hi = msg.address >> 32;
|
2013-04-16 15:05:22 +02:00
|
|
|
msi.data = le32_to_cpu(msg.data);
|
2012-05-16 20:41:14 +02:00
|
|
|
msi.flags = 0;
|
|
|
|
memset(msi.pad, 0, sizeof(msi.pad));
|
|
|
|
|
|
|
|
return kvm_vm_ioctl(s, KVM_SIGNAL_MSI, &msi);
|
|
|
|
}
|
|
|
|
|
2012-05-16 20:41:10 +02:00
|
|
|
route = kvm_lookup_msi_route(s, msg);
|
|
|
|
if (!route) {
|
2012-05-17 15:32:35 +02:00
|
|
|
int virq;
|
2012-05-16 20:41:10 +02:00
|
|
|
|
|
|
|
virq = kvm_irqchip_get_virq(s);
|
|
|
|
if (virq < 0) {
|
|
|
|
return virq;
|
|
|
|
}
|
|
|
|
|
2022-03-15 15:41:56 +01:00
|
|
|
route = g_new0(KVMMSIRoute, 1);
|
2012-05-16 20:41:10 +02:00
|
|
|
route->kroute.gsi = virq;
|
|
|
|
route->kroute.type = KVM_IRQ_ROUTING_MSI;
|
|
|
|
route->kroute.flags = 0;
|
|
|
|
route->kroute.u.msi.address_lo = (uint32_t)msg.address;
|
|
|
|
route->kroute.u.msi.address_hi = msg.address >> 32;
|
2013-04-16 15:05:22 +02:00
|
|
|
route->kroute.u.msi.data = le32_to_cpu(msg.data);
|
2012-05-16 20:41:10 +02:00
|
|
|
|
|
|
|
kvm_add_routing_entry(s, &route->kroute);
|
2013-04-17 01:11:55 +02:00
|
|
|
kvm_irqchip_commit_routes(s);
|
2012-05-16 20:41:10 +02:00
|
|
|
|
|
|
|
QTAILQ_INSERT_TAIL(&s->msi_hashtab[kvm_hash_msi(msg.data)], route,
|
|
|
|
entry);
|
|
|
|
}
|
|
|
|
|
|
|
|
assert(route->kroute.type == KVM_IRQ_ROUTING_MSI);
|
|
|
|
|
2012-07-26 16:35:12 +02:00
|
|
|
return kvm_set_irq(s, route->kroute.gsi, 1);
|
2012-05-16 20:41:10 +02:00
|
|
|
}
|
|
|
|
|
2022-02-22 15:11:16 +01:00
|
|
|
int kvm_irqchip_add_msi_route(KVMRouteChange *c, int vector, PCIDevice *dev)
|
2012-05-17 15:32:33 +02:00
|
|
|
{
|
2013-06-04 13:52:32 +02:00
|
|
|
struct kvm_irq_routing_entry kroute = {};
|
2012-05-17 15:32:33 +02:00
|
|
|
int virq;
|
2022-02-22 15:11:16 +01:00
|
|
|
KVMState *s = c->s;
|
2016-07-14 07:56:30 +02:00
|
|
|
MSIMessage msg = {0, 0};
|
|
|
|
|
2017-07-07 11:45:26 +02:00
|
|
|
if (pci_available && dev) {
|
2016-07-14 07:56:32 +02:00
|
|
|
msg = pci_get_msi_message(dev, vector);
|
2016-07-14 07:56:30 +02:00
|
|
|
}
|
2012-05-17 15:32:33 +02:00
|
|
|
|
2013-09-03 10:08:25 +02:00
|
|
|
if (kvm_gsi_direct_mapping()) {
|
2015-06-02 15:56:23 +02:00
|
|
|
return kvm_arch_msi_data_to_gsi(msg.data);
|
2013-09-03 10:08:25 +02:00
|
|
|
}
|
|
|
|
|
2012-07-26 16:35:16 +02:00
|
|
|
if (!kvm_gsi_routing_enabled()) {
|
2012-05-17 15:32:33 +02:00
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
|
|
|
virq = kvm_irqchip_get_virq(s);
|
|
|
|
if (virq < 0) {
|
|
|
|
return virq;
|
|
|
|
}
|
|
|
|
|
|
|
|
kroute.gsi = virq;
|
|
|
|
kroute.type = KVM_IRQ_ROUTING_MSI;
|
|
|
|
kroute.flags = 0;
|
|
|
|
kroute.u.msi.address_lo = (uint32_t)msg.address;
|
|
|
|
kroute.u.msi.address_hi = msg.address >> 32;
|
2013-04-16 15:05:22 +02:00
|
|
|
kroute.u.msi.data = le32_to_cpu(msg.data);
|
2017-07-07 11:45:26 +02:00
|
|
|
if (pci_available && kvm_msi_devid_required()) {
|
2016-10-04 14:28:09 +02:00
|
|
|
kroute.flags = KVM_MSI_VALID_DEVID;
|
|
|
|
kroute.u.msi.devid = pci_requester_id(dev);
|
|
|
|
}
|
2015-10-15 15:44:52 +02:00
|
|
|
if (kvm_arch_fixup_msi_route(&kroute, msg.address, msg.data, dev)) {
|
2015-01-09 09:04:40 +01:00
|
|
|
kvm_irqchip_release_virq(s, virq);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
2012-05-17 15:32:33 +02:00
|
|
|
|
2017-05-09 08:00:42 +02:00
|
|
|
trace_kvm_irqchip_add_msi_route(dev ? dev->name : (char *)"N/A",
|
|
|
|
vector, virq);
|
2016-07-14 07:56:35 +02:00
|
|
|
|
2012-05-17 15:32:33 +02:00
|
|
|
kvm_add_routing_entry(s, &kroute);
|
2016-07-14 07:56:31 +02:00
|
|
|
kvm_arch_add_msi_route_post(&kroute, vector, dev);
|
2022-02-22 15:11:16 +01:00
|
|
|
c->changes++;
|
2012-05-17 15:32:33 +02:00
|
|
|
|
|
|
|
return virq;
|
|
|
|
}
|
|
|
|
|
2015-10-15 15:44:52 +02:00
|
|
|
int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg,
|
|
|
|
PCIDevice *dev)
|
2012-08-27 08:28:38 +02:00
|
|
|
{
|
2013-06-04 13:52:32 +02:00
|
|
|
struct kvm_irq_routing_entry kroute = {};
|
2012-08-27 08:28:38 +02:00
|
|
|
|
2013-09-03 10:08:25 +02:00
|
|
|
if (kvm_gsi_direct_mapping()) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-08-27 08:28:38 +02:00
|
|
|
if (!kvm_irqchip_in_kernel()) {
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
|
|
|
kroute.gsi = virq;
|
|
|
|
kroute.type = KVM_IRQ_ROUTING_MSI;
|
|
|
|
kroute.flags = 0;
|
|
|
|
kroute.u.msi.address_lo = (uint32_t)msg.address;
|
|
|
|
kroute.u.msi.address_hi = msg.address >> 32;
|
2013-04-16 15:05:22 +02:00
|
|
|
kroute.u.msi.data = le32_to_cpu(msg.data);
|
2017-07-07 11:45:26 +02:00
|
|
|
if (pci_available && kvm_msi_devid_required()) {
|
2016-10-04 14:28:09 +02:00
|
|
|
kroute.flags = KVM_MSI_VALID_DEVID;
|
|
|
|
kroute.u.msi.devid = pci_requester_id(dev);
|
|
|
|
}
|
2015-10-15 15:44:52 +02:00
|
|
|
if (kvm_arch_fixup_msi_route(&kroute, msg.address, msg.data, dev)) {
|
2015-01-09 09:04:40 +01:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
2012-08-27 08:28:38 +02:00
|
|
|
|
2016-07-14 07:56:35 +02:00
|
|
|
trace_kvm_irqchip_update_msi_route(virq);
|
|
|
|
|
2012-08-27 08:28:38 +02:00
|
|
|
return kvm_update_routing_entry(s, &kroute);
|
|
|
|
}
|
|
|
|
|
2020-03-18 15:52:02 +01:00
|
|
|
static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
|
|
|
|
EventNotifier *resample, int virq,
|
2013-07-22 11:51:33 +02:00
|
|
|
bool assign)
|
2012-05-17 15:32:36 +02:00
|
|
|
{
|
2020-03-18 15:52:02 +01:00
|
|
|
int fd = event_notifier_get_fd(event);
|
|
|
|
int rfd = resample ? event_notifier_get_fd(resample) : -1;
|
|
|
|
|
2012-05-17 15:32:36 +02:00
|
|
|
struct kvm_irqfd irqfd = {
|
|
|
|
.fd = fd,
|
|
|
|
.gsi = virq,
|
|
|
|
.flags = assign ? 0 : KVM_IRQFD_FLAG_DEASSIGN,
|
|
|
|
};
|
|
|
|
|
2013-07-22 11:51:33 +02:00
|
|
|
if (rfd != -1) {
|
2020-03-18 15:52:03 +01:00
|
|
|
assert(assign);
|
|
|
|
if (kvm_irqchip_is_split()) {
|
|
|
|
/*
|
|
|
|
* When the slow irqchip (e.g. IOAPIC) is in the
|
|
|
|
* userspace, KVM kernel resamplefd will not work because
|
|
|
|
* the EOI of the interrupt will be delivered to userspace
|
|
|
|
* instead, so the KVM kernel resamplefd kick will be
|
|
|
|
* skipped. The userspace here mimics what the kernel
|
|
|
|
* provides with resamplefd, remember the resamplefd and
|
|
|
|
* kick it when we receive EOI of this IRQ.
|
|
|
|
*
|
|
|
|
* This is hackery because IOAPIC is mostly bypassed
|
|
|
|
* (except EOI broadcasts) when irqfd is used. However
|
|
|
|
* this can bring much performance back for split irqchip
|
|
|
|
* with INTx IRQs (for VFIO, this gives 93% perf of the
|
|
|
|
* full fast path, which is 46% perf boost comparing to
|
|
|
|
* the INTx slow path).
|
|
|
|
*/
|
|
|
|
kvm_resample_fd_insert(virq, resample);
|
|
|
|
} else {
|
|
|
|
irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
|
|
|
|
irqfd.resamplefd = rfd;
|
|
|
|
}
|
|
|
|
} else if (!assign) {
|
|
|
|
if (kvm_irqchip_is_split()) {
|
|
|
|
kvm_resample_fd_remove(virq);
|
|
|
|
}
|
2013-07-22 11:51:33 +02:00
|
|
|
}
|
|
|
|
|
2012-07-26 16:35:14 +02:00
|
|
|
if (!kvm_irqfds_enabled()) {
|
2012-05-17 15:32:36 +02:00
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
|
|
|
return kvm_vm_ioctl(s, KVM_IRQFD, &irqfd);
|
|
|
|
}
|
|
|
|
|
2013-07-15 17:45:03 +02:00
|
|
|
int kvm_irqchip_add_adapter_route(KVMState *s, AdapterInfo *adapter)
|
|
|
|
{
|
2014-11-20 22:10:58 +01:00
|
|
|
struct kvm_irq_routing_entry kroute = {};
|
2013-07-15 17:45:03 +02:00
|
|
|
int virq;
|
|
|
|
|
|
|
|
if (!kvm_gsi_routing_enabled()) {
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
|
|
|
virq = kvm_irqchip_get_virq(s);
|
|
|
|
if (virq < 0) {
|
|
|
|
return virq;
|
|
|
|
}
|
|
|
|
|
|
|
|
kroute.gsi = virq;
|
|
|
|
kroute.type = KVM_IRQ_ROUTING_S390_ADAPTER;
|
|
|
|
kroute.flags = 0;
|
|
|
|
kroute.u.adapter.summary_addr = adapter->summary_addr;
|
|
|
|
kroute.u.adapter.ind_addr = adapter->ind_addr;
|
|
|
|
kroute.u.adapter.summary_offset = adapter->summary_offset;
|
|
|
|
kroute.u.adapter.ind_offset = adapter->ind_offset;
|
|
|
|
kroute.u.adapter.adapter_id = adapter->adapter_id;
|
|
|
|
|
|
|
|
kvm_add_routing_entry(s, &kroute);
|
|
|
|
|
|
|
|
return virq;
|
|
|
|
}
|
|
|
|
|
2015-11-10 13:52:42 +01:00
|
|
|
int kvm_irqchip_add_hv_sint_route(KVMState *s, uint32_t vcpu, uint32_t sint)
|
|
|
|
{
|
|
|
|
struct kvm_irq_routing_entry kroute = {};
|
|
|
|
int virq;
|
|
|
|
|
|
|
|
if (!kvm_gsi_routing_enabled()) {
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
if (!kvm_check_extension(s, KVM_CAP_HYPERV_SYNIC)) {
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
virq = kvm_irqchip_get_virq(s);
|
|
|
|
if (virq < 0) {
|
|
|
|
return virq;
|
|
|
|
}
|
|
|
|
|
|
|
|
kroute.gsi = virq;
|
|
|
|
kroute.type = KVM_IRQ_ROUTING_HV_SINT;
|
|
|
|
kroute.flags = 0;
|
|
|
|
kroute.u.hv_sint.vcpu = vcpu;
|
|
|
|
kroute.u.hv_sint.sint = sint;
|
|
|
|
|
|
|
|
kvm_add_routing_entry(s, &kroute);
|
|
|
|
kvm_irqchip_commit_routes(s);
|
|
|
|
|
|
|
|
return virq;
|
|
|
|
}
|
|
|
|
|
2011-10-15 11:49:47 +02:00
|
|
|
#else /* !KVM_CAP_IRQ_ROUTING */
|
|
|
|
|
2013-04-16 15:58:13 +02:00
|
|
|
void kvm_init_irq_routing(KVMState *s)
|
2011-10-15 11:49:47 +02:00
|
|
|
{
|
|
|
|
}
|
2012-05-16 20:41:10 +02:00
|
|
|
|
2012-06-05 21:03:57 +02:00
|
|
|
void kvm_irqchip_release_virq(KVMState *s, int virq)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2012-05-16 20:41:10 +02:00
|
|
|
int kvm_irqchip_send_msi(KVMState *s, MSIMessage msg)
|
|
|
|
{
|
|
|
|
abort();
|
|
|
|
}
|
2012-05-17 15:32:33 +02:00
|
|
|
|
2022-02-22 15:11:16 +01:00
|
|
|
int kvm_irqchip_add_msi_route(KVMRouteChange *c, int vector, PCIDevice *dev)
|
2012-05-17 15:32:33 +02:00
|
|
|
{
|
2012-06-25 17:40:39 +02:00
|
|
|
return -ENOSYS;
|
2012-05-17 15:32:33 +02:00
|
|
|
}
|
2012-05-17 15:32:36 +02:00
|
|
|
|
2013-07-15 17:45:03 +02:00
|
|
|
int kvm_irqchip_add_adapter_route(KVMState *s, AdapterInfo *adapter)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
2015-11-10 13:52:42 +01:00
|
|
|
int kvm_irqchip_add_hv_sint_route(KVMState *s, uint32_t vcpu, uint32_t sint)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
|
2020-03-18 15:52:02 +01:00
|
|
|
static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
|
|
|
|
EventNotifier *resample, int virq,
|
|
|
|
bool assign)
|
2012-05-17 15:32:36 +02:00
|
|
|
{
|
|
|
|
abort();
|
|
|
|
}
|
2013-01-15 18:50:13 +01:00
|
|
|
|
|
|
|
int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
2011-10-15 11:49:47 +02:00
|
|
|
#endif /* !KVM_CAP_IRQ_ROUTING */
|
|
|
|
|
2015-07-06 20:15:13 +02:00
|
|
|
int kvm_irqchip_add_irqfd_notifier_gsi(KVMState *s, EventNotifier *n,
|
|
|
|
EventNotifier *rn, int virq)
|
2012-05-17 15:32:36 +02:00
|
|
|
{
|
2020-03-18 15:52:02 +01:00
|
|
|
return kvm_irqchip_assign_irqfd(s, n, rn, virq, true);
|
2012-05-17 15:32:36 +02:00
|
|
|
}
|
|
|
|
|
2015-07-06 20:15:13 +02:00
|
|
|
int kvm_irqchip_remove_irqfd_notifier_gsi(KVMState *s, EventNotifier *n,
|
|
|
|
int virq)
|
2012-07-05 17:16:30 +02:00
|
|
|
{
|
2020-03-18 15:52:02 +01:00
|
|
|
return kvm_irqchip_assign_irqfd(s, n, NULL, virq, false);
|
2012-07-05 17:16:30 +02:00
|
|
|
}
|
|
|
|
|
2015-07-06 20:15:13 +02:00
|
|
|
int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n,
|
|
|
|
EventNotifier *rn, qemu_irq irq)
|
|
|
|
{
|
|
|
|
gpointer key, gsi;
|
|
|
|
gboolean found = g_hash_table_lookup_extended(s->gsimap, irq, &key, &gsi);
|
|
|
|
|
|
|
|
if (!found) {
|
|
|
|
return -ENXIO;
|
|
|
|
}
|
|
|
|
return kvm_irqchip_add_irqfd_notifier_gsi(s, n, rn, GPOINTER_TO_INT(gsi));
|
|
|
|
}
|
|
|
|
|
|
|
|
int kvm_irqchip_remove_irqfd_notifier(KVMState *s, EventNotifier *n,
|
|
|
|
qemu_irq irq)
|
|
|
|
{
|
|
|
|
gpointer key, gsi;
|
|
|
|
gboolean found = g_hash_table_lookup_extended(s->gsimap, irq, &key, &gsi);
|
|
|
|
|
|
|
|
if (!found) {
|
|
|
|
return -ENXIO;
|
|
|
|
}
|
|
|
|
return kvm_irqchip_remove_irqfd_notifier_gsi(s, n, GPOINTER_TO_INT(gsi));
|
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_irqchip_set_qemuirq_gsi(KVMState *s, qemu_irq irq, int gsi)
|
|
|
|
{
|
|
|
|
g_hash_table_insert(s->gsimap, irq, GINT_TO_POINTER(gsi));
|
|
|
|
}
|
|
|
|
|
2019-11-13 11:17:12 +01:00
|
|
|
static void kvm_irqchip_create(KVMState *s)
|
2011-10-15 11:49:47 +02:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2019-12-28 11:43:26 +01:00
|
|
|
assert(s->kernel_irqchip_split != ON_OFF_AUTO_AUTO);
|
2015-06-18 18:30:15 +02:00
|
|
|
if (kvm_check_extension(s, KVM_CAP_IRQCHIP)) {
|
|
|
|
;
|
|
|
|
} else if (kvm_check_extension(s, KVM_CAP_S390_IRQCHIP)) {
|
|
|
|
ret = kvm_vm_enable_cap(s, KVM_CAP_S390_IRQCHIP, 0);
|
|
|
|
if (ret < 0) {
|
|
|
|
fprintf(stderr, "Enable kernel irqchip failed: %s\n", strerror(-ret));
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
return;
|
2011-10-15 11:49:47 +02:00
|
|
|
}
|
|
|
|
|
2014-02-26 18:20:00 +01:00
|
|
|
/* First probe and see if there's a arch-specific hook to create the
|
|
|
|
* in-kernel irqchip for us */
|
2019-11-13 11:17:12 +01:00
|
|
|
ret = kvm_arch_irqchip_create(s);
|
2015-06-18 18:30:15 +02:00
|
|
|
if (ret == 0) {
|
2019-12-28 11:43:26 +01:00
|
|
|
if (s->kernel_irqchip_split == ON_OFF_AUTO_ON) {
|
2022-07-28 16:24:46 +02:00
|
|
|
error_report("Split IRQ chip mode not supported.");
|
2015-12-17 17:16:08 +01:00
|
|
|
exit(1);
|
|
|
|
} else {
|
|
|
|
ret = kvm_vm_ioctl(s, KVM_CREATE_IRQCHIP);
|
|
|
|
}
|
2015-06-18 18:30:15 +02:00
|
|
|
}
|
|
|
|
if (ret < 0) {
|
|
|
|
fprintf(stderr, "Create kernel irqchip failed: %s\n", strerror(-ret));
|
|
|
|
exit(1);
|
2011-10-15 11:49:47 +02:00
|
|
|
}
|
|
|
|
|
2012-01-31 19:17:52 +01:00
|
|
|
kvm_kernel_irqchip = true;
|
2012-07-26 16:35:11 +02:00
|
|
|
/* If we have an in-kernel IRQ chip then we must have asynchronous
|
|
|
|
* interrupt delivery (though the reverse is not necessarily true)
|
|
|
|
*/
|
|
|
|
kvm_async_interrupts_allowed = true;
|
2013-04-24 22:24:12 +02:00
|
|
|
kvm_halt_in_kernel_allowed = true;
|
2011-10-15 11:49:47 +02:00
|
|
|
|
|
|
|
kvm_init_irq_routing(s);
|
|
|
|
|
2015-07-06 20:15:13 +02:00
|
|
|
s->gsimap = g_hash_table_new(g_direct_hash, g_direct_equal);
|
2011-10-15 11:49:47 +02:00
|
|
|
}
|
|
|
|
|
2013-08-23 15:24:37 +02:00
|
|
|
/* Find number of supported CPUs using the recommended
|
|
|
|
* procedure from the kernel API documentation to cope with
|
|
|
|
* older kernels that may be missing capabilities.
|
|
|
|
*/
|
|
|
|
static int kvm_recommended_vcpus(KVMState *s)
|
2012-07-31 13:18:17 +02:00
|
|
|
{
|
kvm: check KVM_CAP_NR_VCPUS with kvm_vm_check_extension()
On a modern server-class ppc host with the following CPU topology:
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0,8,16,24
Off-line CPU(s) list: 1-7,9-15,17-23,25-31
Thread(s) per core: 1
If both KVM PR and KVM HV loaded and we pass:
-machine pseries,accel=kvm,kvm-type=PR -smp 8
We expect QEMU to warn that this exceeds the number of online CPUs:
Warning: Number of SMP cpus requested (8) exceeds the recommended
cpus supported by KVM (4)
Warning: Number of hotpluggable cpus requested (8) exceeds the
recommended cpus supported by KVM (4)
but nothing is printed...
This happens because on ppc the KVM_CAP_NR_VCPUS capability is VM
specific ndreally depends on the KVM type, but we currently use it
as a global capability. And KVM returns a fallback value based on
KVM HV being present. Maybe KVM on POWER shouldn't presume anything
as long as it doesn't have a VM, but in all cases, we should call
KVM_CREATE_VM first and use KVM_CAP_NR_VCPUS as a VM capability.
This patch hence changes kvm_recommended_vcpus() accordingly and
moves the sanity checking of smp_cpus after the VM creation.
It is okay for the other archs that also implement KVM_CAP_NR_VCPUS,
ie, mips, s390, x86 and arm, because they don't depend on the VM
being created or not.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Message-Id: <150600966286.30533.10909862523552370889.stgit@bahia.lan>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-09-21 18:01:02 +02:00
|
|
|
int ret = kvm_vm_check_extension(s, KVM_CAP_NR_VCPUS);
|
2013-08-23 15:24:37 +02:00
|
|
|
return (ret) ? ret : 4;
|
|
|
|
}
|
2012-07-31 13:18:17 +02:00
|
|
|
|
2013-08-23 15:24:37 +02:00
|
|
|
static int kvm_max_vcpus(KVMState *s)
|
|
|
|
{
|
|
|
|
int ret = kvm_check_extension(s, KVM_CAP_MAX_VCPUS);
|
|
|
|
return (ret) ? ret : kvm_recommended_vcpus(s);
|
2012-07-31 13:18:17 +02:00
|
|
|
}
|
|
|
|
|
2016-05-26 10:02:23 +02:00
|
|
|
static int kvm_max_vcpu_id(KVMState *s)
|
|
|
|
{
|
|
|
|
int ret = kvm_check_extension(s, KVM_CAP_MAX_VCPU_ID);
|
|
|
|
return (ret) ? ret : kvm_max_vcpus(s);
|
|
|
|
}
|
|
|
|
|
2016-04-26 15:41:04 +02:00
|
|
|
bool kvm_vcpu_id_is_valid(int vcpu_id)
|
|
|
|
{
|
2020-01-21 12:03:48 +01:00
|
|
|
KVMState *s = KVM_STATE(current_accel());
|
2016-05-26 10:02:23 +02:00
|
|
|
return vcpu_id >= 0 && vcpu_id < kvm_max_vcpu_id(s);
|
2016-04-26 15:41:04 +02:00
|
|
|
}
|
|
|
|
|
2021-06-29 18:01:18 +02:00
|
|
|
bool kvm_dirty_ring_enabled(void)
|
|
|
|
{
|
|
|
|
return kvm_state->kvm_dirty_ring_size ? true : false;
|
|
|
|
}
|
|
|
|
|
2022-04-26 14:59:44 +02:00
|
|
|
static void query_stats_cb(StatsResultList **result, StatsTarget target,
|
qmp: add filtering of statistics by name
Allow retrieving only a subset of statistics. This can be useful
for example in order to plot a subset of the statistics many times
a second: KVM publishes ~40 statistics for each vCPU on x86; retrieving
and serializing all of them would be useless.
Another use will be in HMP in the following patch; implementing the
filter in the backend is easy enough that it was deemed okay to make
this a public interface.
Example:
{ "execute": "query-stats",
"arguments": {
"target": "vcpu",
"vcpus": [ "/machine/unattached/device[2]",
"/machine/unattached/device[4]" ],
"providers": [
{ "provider": "kvm",
"names": [ "l1d_flush", "exits" ] } } }
{ "return": {
"vcpus": [
{ "path": "/machine/unattached/device[2]"
"providers": [
{ "provider": "kvm",
"stats": [ { "name": "l1d_flush", "value": 41213 },
{ "name": "exits", "value": 74291 } ] } ] },
{ "path": "/machine/unattached/device[4]"
"providers": [
{ "provider": "kvm",
"stats": [ { "name": "l1d_flush", "value": 16132 },
{ "name": "exits", "value": 57922 } ] } ] } ] } }
Extracted from a patch by Mark Kanda.
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-24 19:13:16 +02:00
|
|
|
strList *names, strList *targets, Error **errp);
|
2022-02-15 16:04:33 +01:00
|
|
|
static void query_stats_schemas_cb(StatsSchemaList **result, Error **errp);
|
|
|
|
|
2022-06-25 19:38:34 +02:00
|
|
|
uint32_t kvm_dirty_ring_size(void)
|
|
|
|
{
|
|
|
|
return kvm_state->kvm_dirty_ring_size;
|
|
|
|
}
|
|
|
|
|
2014-09-26 22:45:30 +02:00
|
|
|
static int kvm_init(MachineState *ms)
|
2008-11-05 17:29:27 +01:00
|
|
|
{
|
2014-09-26 22:45:30 +02:00
|
|
|
MachineClass *mc = MACHINE_GET_CLASS(ms);
|
2009-06-07 11:30:25 +02:00
|
|
|
static const char upgrade_note[] =
|
|
|
|
"Please upgrade to at least kernel 2.6.29 or recent kvm-kmod\n"
|
|
|
|
"(see http://sourceforge.net/projects/kvm).\n";
|
2022-12-20 13:05:58 +01:00
|
|
|
const struct {
|
2013-08-23 15:24:37 +02:00
|
|
|
const char *name;
|
|
|
|
int num;
|
|
|
|
} num_cpus[] = {
|
2019-05-18 22:54:21 +02:00
|
|
|
{ "SMP", ms->smp.cpus },
|
|
|
|
{ "hotpluggable", ms->smp.max_cpus },
|
2022-12-20 13:05:58 +01:00
|
|
|
{ /* end of list */ }
|
2013-08-23 15:24:37 +02:00
|
|
|
}, *nc = num_cpus;
|
|
|
|
int soft_vcpus_limit, hard_vcpus_limit;
|
2008-11-05 17:29:27 +01:00
|
|
|
KVMState *s;
|
2011-01-21 21:48:17 +01:00
|
|
|
const KVMCapabilityInfo *missing_cap;
|
2008-11-05 17:29:27 +01:00
|
|
|
int ret;
|
2023-08-22 18:31:02 +02:00
|
|
|
int type;
|
2020-03-04 03:55:54 +01:00
|
|
|
uint64_t dirty_log_manual_caps;
|
2008-11-05 17:29:27 +01:00
|
|
|
|
2021-05-06 18:05:41 +02:00
|
|
|
qemu_mutex_init(&kml_slots_lock);
|
|
|
|
|
2014-09-26 22:45:32 +02:00
|
|
|
s = KVM_STATE(ms->accelerator);
|
2008-11-05 17:29:27 +01:00
|
|
|
|
2012-04-04 03:15:54 +02:00
|
|
|
/*
|
|
|
|
* On systems where the kernel can support different base page
|
|
|
|
* sizes, host page size may be different from TARGET_PAGE_SIZE,
|
|
|
|
* even with KVM. TARGET_PAGE_SIZE is assumed to be the minimum
|
|
|
|
* page size for the system though.
|
|
|
|
*/
|
2022-03-23 16:57:22 +01:00
|
|
|
assert(TARGET_PAGE_SIZE <= qemu_real_host_page_size());
|
2012-04-04 03:15:54 +02:00
|
|
|
|
2014-06-18 00:10:31 +02:00
|
|
|
s->sigmask_len = 8;
|
2022-11-11 16:47:57 +01:00
|
|
|
accel_blocker_init();
|
2014-06-18 00:10:31 +02:00
|
|
|
|
2009-03-12 21:12:48 +01:00
|
|
|
#ifdef KVM_CAP_SET_GUEST_DEBUG
|
2009-09-12 09:36:22 +02:00
|
|
|
QTAILQ_INIT(&s->kvm_sw_breakpoints);
|
2009-03-12 21:12:48 +01:00
|
|
|
#endif
|
2016-05-12 05:48:13 +02:00
|
|
|
QLIST_INIT(&s->kvm_parked_vcpus);
|
2020-07-21 14:25:21 +02:00
|
|
|
s->fd = qemu_open_old("/dev/kvm", O_RDWR);
|
2008-11-05 17:29:27 +01:00
|
|
|
if (s->fd == -1) {
|
|
|
|
fprintf(stderr, "Could not access KVM kernel module: %m\n");
|
|
|
|
ret = -errno;
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = kvm_ioctl(s, KVM_GET_API_VERSION, 0);
|
|
|
|
if (ret < KVM_API_VERSION) {
|
2014-05-30 22:26:22 +02:00
|
|
|
if (ret >= 0) {
|
2008-11-05 17:29:27 +01:00
|
|
|
ret = -EINVAL;
|
2011-01-04 09:32:13 +01:00
|
|
|
}
|
2008-11-05 17:29:27 +01:00
|
|
|
fprintf(stderr, "kvm version too old\n");
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ret > KVM_API_VERSION) {
|
|
|
|
ret = -EINVAL;
|
|
|
|
fprintf(stderr, "kvm version not supported\n");
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
2017-02-08 13:52:50 +01:00
|
|
|
kvm_immediate_exit = kvm_check_extension(s, KVM_CAP_IMMEDIATE_EXIT);
|
2013-11-22 20:12:44 +01:00
|
|
|
s->nr_slots = kvm_check_extension(s, KVM_CAP_NR_MEMSLOTS);
|
|
|
|
|
|
|
|
/* If unspecified, use the default value */
|
|
|
|
if (!s->nr_slots) {
|
|
|
|
s->nr_slots = 32;
|
|
|
|
}
|
|
|
|
|
2019-06-14 03:52:37 +02:00
|
|
|
s->nr_as = kvm_check_extension(s, KVM_CAP_MULTI_ADDRESS_SPACE);
|
|
|
|
if (s->nr_as <= 1) {
|
|
|
|
s->nr_as = 1;
|
|
|
|
}
|
|
|
|
s->as = g_new0(struct KVMAs, s->nr_as);
|
|
|
|
|
2020-11-02 15:44:36 +01:00
|
|
|
if (object_property_find(OBJECT(current_machine), "kvm-type")) {
|
|
|
|
g_autofree char *kvm_type = object_property_get_str(OBJECT(current_machine),
|
|
|
|
"kvm-type",
|
|
|
|
&error_abort);
|
2019-03-04 11:13:33 +01:00
|
|
|
type = mc->kvm_type(ms, kvm_type);
|
2021-03-10 14:52:17 +01:00
|
|
|
} else if (mc->kvm_type) {
|
|
|
|
type = mc->kvm_type(ms, NULL);
|
2023-08-22 18:31:02 +02:00
|
|
|
} else {
|
|
|
|
type = kvm_arch_get_default_type(ms);
|
2013-12-23 16:40:40 +01:00
|
|
|
}
|
|
|
|
|
2023-08-22 18:31:03 +02:00
|
|
|
if (type < 0) {
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
2014-01-09 22:14:23 +01:00
|
|
|
do {
|
2013-12-23 16:40:40 +01:00
|
|
|
ret = kvm_ioctl(s, KVM_CREATE_VM, type);
|
2014-01-09 22:14:23 +01:00
|
|
|
} while (ret == -EINTR);
|
|
|
|
|
|
|
|
if (ret < 0) {
|
2014-01-27 15:18:09 +01:00
|
|
|
fprintf(stderr, "ioctl(KVM_CREATE_VM) failed: %d %s\n", -ret,
|
2014-01-09 22:14:23 +01:00
|
|
|
strerror(-ret));
|
|
|
|
|
2010-04-01 18:42:37 +02:00
|
|
|
#ifdef TARGET_S390X
|
2015-04-23 17:03:46 +02:00
|
|
|
if (ret == -EINVAL) {
|
|
|
|
fprintf(stderr,
|
|
|
|
"Host kernel setup problem detected. Please verify:\n");
|
|
|
|
fprintf(stderr, "- for kernels supporting the switch_amode or"
|
|
|
|
" user_mode parameters, whether\n");
|
|
|
|
fprintf(stderr,
|
|
|
|
" user space is running in primary address space\n");
|
|
|
|
fprintf(stderr,
|
|
|
|
"- for kernels supporting the vm.allocate_pgste sysctl, "
|
|
|
|
"whether it is enabled\n");
|
|
|
|
}
|
2021-07-22 16:13:40 +02:00
|
|
|
#elif defined(TARGET_PPC)
|
|
|
|
if (ret == -EINVAL) {
|
|
|
|
fprintf(stderr,
|
|
|
|
"PPC KVM module is not loaded. Try modprobe kvm_%s.\n",
|
|
|
|
(type == 2) ? "pr" : "hv");
|
|
|
|
}
|
2010-04-01 18:42:37 +02:00
|
|
|
#endif
|
2008-11-05 17:29:27 +01:00
|
|
|
goto err;
|
2010-04-01 18:42:37 +02:00
|
|
|
}
|
2008-11-05 17:29:27 +01:00
|
|
|
|
2014-01-09 22:14:23 +01:00
|
|
|
s->vmfd = ret;
|
kvm: check KVM_CAP_NR_VCPUS with kvm_vm_check_extension()
On a modern server-class ppc host with the following CPU topology:
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0,8,16,24
Off-line CPU(s) list: 1-7,9-15,17-23,25-31
Thread(s) per core: 1
If both KVM PR and KVM HV loaded and we pass:
-machine pseries,accel=kvm,kvm-type=PR -smp 8
We expect QEMU to warn that this exceeds the number of online CPUs:
Warning: Number of SMP cpus requested (8) exceeds the recommended
cpus supported by KVM (4)
Warning: Number of hotpluggable cpus requested (8) exceeds the
recommended cpus supported by KVM (4)
but nothing is printed...
This happens because on ppc the KVM_CAP_NR_VCPUS capability is VM
specific ndreally depends on the KVM type, but we currently use it
as a global capability. And KVM returns a fallback value based on
KVM HV being present. Maybe KVM on POWER shouldn't presume anything
as long as it doesn't have a VM, but in all cases, we should call
KVM_CREATE_VM first and use KVM_CAP_NR_VCPUS as a VM capability.
This patch hence changes kvm_recommended_vcpus() accordingly and
moves the sanity checking of smp_cpus after the VM creation.
It is okay for the other archs that also implement KVM_CAP_NR_VCPUS,
ie, mips, s390, x86 and arm, because they don't depend on the VM
being created or not.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Message-Id: <150600966286.30533.10909862523552370889.stgit@bahia.lan>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-09-21 18:01:02 +02:00
|
|
|
|
|
|
|
/* check the vcpu limits */
|
|
|
|
soft_vcpus_limit = kvm_recommended_vcpus(s);
|
|
|
|
hard_vcpus_limit = kvm_max_vcpus(s);
|
|
|
|
|
|
|
|
while (nc->name) {
|
|
|
|
if (nc->num > soft_vcpus_limit) {
|
|
|
|
warn_report("Number of %s cpus requested (%d) exceeds "
|
|
|
|
"the recommended cpus supported by KVM (%d)",
|
|
|
|
nc->name, nc->num, soft_vcpus_limit);
|
|
|
|
|
|
|
|
if (nc->num > hard_vcpus_limit) {
|
|
|
|
fprintf(stderr, "Number of %s cpus requested (%d) exceeds "
|
|
|
|
"the maximum cpus supported by KVM (%d)\n",
|
|
|
|
nc->name, nc->num, hard_vcpus_limit);
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
nc++;
|
|
|
|
}
|
|
|
|
|
2011-01-21 21:48:17 +01:00
|
|
|
missing_cap = kvm_check_extension_list(s, kvm_required_capabilites);
|
|
|
|
if (!missing_cap) {
|
|
|
|
missing_cap =
|
|
|
|
kvm_check_extension_list(s, kvm_arch_required_capabilities);
|
2008-11-05 17:29:27 +01:00
|
|
|
}
|
2011-01-21 21:48:17 +01:00
|
|
|
if (missing_cap) {
|
2009-05-08 22:33:24 +02:00
|
|
|
ret = -EINVAL;
|
2011-01-21 21:48:17 +01:00
|
|
|
fprintf(stderr, "kvm does not support %s\n%s",
|
|
|
|
missing_cap->name, upgrade_note);
|
2008-12-09 20:59:09 +01:00
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
2009-05-08 22:33:24 +02:00
|
|
|
s->coalesced_mmio = kvm_check_extension(s, KVM_CAP_COALESCED_MMIO);
|
2018-10-17 18:52:54 +02:00
|
|
|
s->coalesced_pio = s->coalesced_mmio &&
|
|
|
|
kvm_check_extension(s, KVM_CAP_COALESCED_PIO);
|
2008-12-09 21:09:57 +01:00
|
|
|
|
2021-05-17 10:23:50 +02:00
|
|
|
/*
|
|
|
|
* Enable KVM dirty ring if supported, otherwise fall back to
|
|
|
|
* dirty logging mode
|
|
|
|
*/
|
2023-05-09 04:21:21 +02:00
|
|
|
ret = kvm_dirty_ring_init(s);
|
|
|
|
if (ret < 0) {
|
|
|
|
goto err;
|
2021-05-17 10:23:50 +02:00
|
|
|
}
|
|
|
|
|
2021-05-06 18:05:48 +02:00
|
|
|
/*
|
|
|
|
* KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is not needed when dirty ring is
|
|
|
|
* enabled. More importantly, KVM_DIRTY_LOG_INITIALLY_SET will assume no
|
|
|
|
* page is wr-protected initially, which is against how kvm dirty ring is
|
|
|
|
* usage - kvm dirty ring requires all pages are wr-protected at the very
|
|
|
|
* beginning. Enabling this feature for dirty ring causes data corruption.
|
2021-05-17 10:23:50 +02:00
|
|
|
*
|
|
|
|
* TODO: Without KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 and kvm clear dirty log,
|
|
|
|
* we may expect a higher stall time when starting the migration. In the
|
|
|
|
* future we can enable KVM_CLEAR_DIRTY_LOG to work with dirty ring too:
|
|
|
|
* instead of clearing dirty bit, it can be a way to explicitly wr-protect
|
|
|
|
* guest pages.
|
2021-05-06 18:05:48 +02:00
|
|
|
*/
|
|
|
|
if (!s->kvm_dirty_ring_size) {
|
|
|
|
dirty_log_manual_caps =
|
|
|
|
kvm_check_extension(s, KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2);
|
|
|
|
dirty_log_manual_caps &= (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE |
|
|
|
|
KVM_DIRTY_LOG_INITIALLY_SET);
|
|
|
|
s->manual_dirty_log_protect = dirty_log_manual_caps;
|
|
|
|
if (dirty_log_manual_caps) {
|
|
|
|
ret = kvm_vm_enable_cap(s, KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2, 0,
|
|
|
|
dirty_log_manual_caps);
|
|
|
|
if (ret) {
|
|
|
|
warn_report("Trying to enable capability %"PRIu64" of "
|
|
|
|
"KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 but failed. "
|
|
|
|
"Falling back to the legacy mode. ",
|
|
|
|
dirty_log_manual_caps);
|
|
|
|
s->manual_dirty_log_protect = 0;
|
|
|
|
}
|
2019-06-03 08:50:55 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2009-11-25 00:33:03 +01:00
|
|
|
#ifdef KVM_CAP_VCPU_EVENTS
|
|
|
|
s->vcpu_events = kvm_check_extension(s, KVM_CAP_VCPU_EVENTS);
|
|
|
|
#endif
|
|
|
|
|
2010-03-01 19:10:29 +01:00
|
|
|
s->robust_singlestep =
|
|
|
|
kvm_check_extension(s, KVM_CAP_X86_ROBUST_SINGLESTEP);
|
|
|
|
|
2010-03-12 15:20:49 +01:00
|
|
|
#ifdef KVM_CAP_DEBUGREGS
|
|
|
|
s->debugregs = kvm_check_extension(s, KVM_CAP_DEBUGREGS);
|
|
|
|
#endif
|
|
|
|
|
2019-06-19 18:21:38 +02:00
|
|
|
s->max_nested_state_len = kvm_check_extension(s, KVM_CAP_NESTED_STATE);
|
|
|
|
|
2012-06-05 21:03:57 +02:00
|
|
|
#ifdef KVM_CAP_IRQ_ROUTING
|
2015-10-15 15:44:50 +02:00
|
|
|
kvm_direct_msi_allowed = (kvm_check_extension(s, KVM_CAP_SIGNAL_MSI) > 0);
|
2012-06-05 21:03:57 +02:00
|
|
|
#endif
|
2012-05-16 20:41:14 +02:00
|
|
|
|
2012-08-27 08:28:39 +02:00
|
|
|
s->intx_set_mask = kvm_check_extension(s, KVM_CAP_PCI_2_3);
|
|
|
|
|
2012-08-24 13:34:47 +02:00
|
|
|
s->irq_set_ioctl = KVM_IRQ_LINE;
|
2012-08-15 13:08:13 +02:00
|
|
|
if (kvm_check_extension(s, KVM_CAP_IRQ_INJECT_STATUS)) {
|
2012-08-24 13:34:47 +02:00
|
|
|
s->irq_set_ioctl = KVM_IRQ_LINE_STATUS;
|
2012-08-15 13:08:13 +02:00
|
|
|
}
|
|
|
|
|
2013-05-29 10:27:25 +02:00
|
|
|
kvm_readonly_mem_allowed =
|
|
|
|
(kvm_check_extension(s, KVM_CAP_READONLY_MEM) > 0);
|
|
|
|
|
2014-05-27 14:03:35 +02:00
|
|
|
kvm_eventfds_allowed =
|
|
|
|
(kvm_check_extension(s, KVM_CAP_IOEVENTFD) > 0);
|
|
|
|
|
2014-10-31 14:38:18 +01:00
|
|
|
kvm_irqfds_allowed =
|
|
|
|
(kvm_check_extension(s, KVM_CAP_IRQFD) > 0);
|
|
|
|
|
|
|
|
kvm_resamplefds_allowed =
|
|
|
|
(kvm_check_extension(s, KVM_CAP_IRQFD_RESAMPLE) > 0);
|
|
|
|
|
2015-03-12 13:53:49 +01:00
|
|
|
kvm_vm_attributes_allowed =
|
|
|
|
(kvm_check_extension(s, KVM_CAP_VM_ATTRIBUTES) > 0);
|
|
|
|
|
2015-11-06 09:02:46 +01:00
|
|
|
kvm_ioeventfd_any_length_allowed =
|
|
|
|
(kvm_check_extension(s, KVM_CAP_IOEVENTFD_ANY_LENGTH) > 0);
|
|
|
|
|
2021-11-11 12:06:03 +01:00
|
|
|
#ifdef KVM_CAP_SET_GUEST_DEBUG
|
|
|
|
kvm_has_guest_debug =
|
|
|
|
(kvm_check_extension(s, KVM_CAP_SET_GUEST_DEBUG) > 0);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
kvm_sstep_flags = 0;
|
|
|
|
if (kvm_has_guest_debug) {
|
|
|
|
kvm_sstep_flags = SSTEP_ENABLE;
|
2021-11-11 12:06:04 +01:00
|
|
|
|
|
|
|
#if defined KVM_CAP_SET_GUEST_DEBUG2
|
|
|
|
int guest_debug_flags =
|
|
|
|
kvm_check_extension(s, KVM_CAP_SET_GUEST_DEBUG2);
|
|
|
|
|
|
|
|
if (guest_debug_flags & KVM_GUESTDBG_BLOCKIRQ) {
|
|
|
|
kvm_sstep_flags |= SSTEP_NOIRQ;
|
|
|
|
}
|
|
|
|
#endif
|
2021-11-11 12:06:03 +01:00
|
|
|
}
|
|
|
|
|
2017-06-01 13:35:15 +02:00
|
|
|
kvm_state = s;
|
|
|
|
|
2015-02-04 16:43:51 +01:00
|
|
|
ret = kvm_arch_init(ms, s);
|
2011-01-04 09:32:13 +01:00
|
|
|
if (ret < 0) {
|
2008-11-05 17:29:27 +01:00
|
|
|
goto err;
|
2011-01-04 09:32:13 +01:00
|
|
|
}
|
2008-11-05 17:29:27 +01:00
|
|
|
|
2019-12-28 11:43:26 +01:00
|
|
|
if (s->kernel_irqchip_split == ON_OFF_AUTO_AUTO) {
|
|
|
|
s->kernel_irqchip_split = mc->default_kernel_irqchip_split ? ON_OFF_AUTO_ON : ON_OFF_AUTO_OFF;
|
|
|
|
}
|
|
|
|
|
2020-05-12 05:06:06 +02:00
|
|
|
qemu_register_reset(kvm_unpoison_all, NULL);
|
|
|
|
|
2019-11-13 10:56:53 +01:00
|
|
|
if (s->kernel_irqchip_allowed) {
|
2019-11-13 11:17:12 +01:00
|
|
|
kvm_irqchip_create(s);
|
2011-10-15 11:49:47 +02:00
|
|
|
}
|
|
|
|
|
2015-11-20 10:37:16 +01:00
|
|
|
if (kvm_eventfds_allowed) {
|
|
|
|
s->memory_listener.listener.eventfd_add = kvm_mem_ioeventfd_add;
|
|
|
|
s->memory_listener.listener.eventfd_del = kvm_mem_ioeventfd_del;
|
|
|
|
}
|
2018-10-17 18:52:54 +02:00
|
|
|
s->memory_listener.listener.coalesced_io_add = kvm_coalesce_mmio_region;
|
|
|
|
s->memory_listener.listener.coalesced_io_del = kvm_uncoalesce_mmio_region;
|
2015-06-18 18:30:13 +02:00
|
|
|
|
|
|
|
kvm_memory_listener_register(s, &s->memory_listener,
|
2021-08-17 03:35:52 +02:00
|
|
|
&address_space_memory, 0, "kvm-memory");
|
2020-10-17 23:01:01 +02:00
|
|
|
if (kvm_eventfds_allowed) {
|
|
|
|
memory_listener_register(&kvm_io_listener,
|
|
|
|
&address_space_io);
|
|
|
|
}
|
2018-10-17 18:52:54 +02:00
|
|
|
memory_listener_register(&kvm_coalesced_pio_listener,
|
|
|
|
&address_space_io);
|
2008-11-05 17:29:27 +01:00
|
|
|
|
2011-01-10 12:50:05 +01:00
|
|
|
s->many_ioeventfds = kvm_check_many_ioeventfds();
|
|
|
|
|
kvm: check KVM_CAP_SYNC_MMU with kvm_vm_check_extension()
On a server-class ppc host, this capability depends on the KVM type,
ie, HV or PR. If both KVM are present in the kernel, we will always
get the HV specific value, even if we explicitely requested PR on
the command line.
This can have an impact if we're using hugepages or a balloon device.
Since we've already created the VM at the time any user calls
kvm_has_sync_mmu(), switching to kvm_vm_check_extension() is
enough to fix any potential issue.
It is okay for the other archs that also implement KVM_CAP_SYNC_MMU,
ie, mips, s390, x86 and arm, because they don't depend on the VM being
created or not.
While here, let's cache the state of this extension in a bool variable,
since it has several users in the code, as suggested by Thomas Huth.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <150600965332.30533.14702405809647835716.stgit@bahia.lan>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-09-21 18:00:53 +02:00
|
|
|
s->sync_mmu = !!kvm_vm_check_extension(kvm_state, KVM_CAP_SYNC_MMU);
|
2018-08-17 17:27:15 +02:00
|
|
|
if (!s->sync_mmu) {
|
2020-06-26 09:22:31 +02:00
|
|
|
ret = ram_block_discard_disable(true);
|
|
|
|
assert(!ret);
|
2018-08-17 17:27:15 +02:00
|
|
|
}
|
2021-05-17 10:23:50 +02:00
|
|
|
|
|
|
|
if (s->kvm_dirty_ring_size) {
|
2023-08-22 18:31:04 +02:00
|
|
|
kvm_dirty_ring_reaper_init(s);
|
2021-05-17 10:23:50 +02:00
|
|
|
}
|
|
|
|
|
2022-02-15 16:04:33 +01:00
|
|
|
if (kvm_check_extension(kvm_state, KVM_CAP_BINARY_STATS_FD)) {
|
2022-04-26 11:49:33 +02:00
|
|
|
add_stats_callbacks(STATS_PROVIDER_KVM, query_stats_cb,
|
|
|
|
query_stats_schemas_cb);
|
2022-02-15 16:04:33 +01:00
|
|
|
}
|
|
|
|
|
2008-11-05 17:29:27 +01:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
err:
|
2014-05-30 22:26:22 +02:00
|
|
|
assert(ret < 0);
|
2012-09-03 22:40:40 +02:00
|
|
|
if (s->vmfd >= 0) {
|
|
|
|
close(s->vmfd);
|
|
|
|
}
|
|
|
|
if (s->fd != -1) {
|
|
|
|
close(s->fd);
|
2008-11-05 17:29:27 +01:00
|
|
|
}
|
2023-08-22 18:31:04 +02:00
|
|
|
g_free(s->as);
|
2015-06-18 18:30:13 +02:00
|
|
|
g_free(s->memory_listener.slots);
|
2008-11-05 17:29:27 +01:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2014-06-18 00:10:31 +02:00
|
|
|
void kvm_set_sigmask_len(KVMState *s, unsigned int sigmask_len)
|
|
|
|
{
|
|
|
|
s->sigmask_len = sigmask_len;
|
|
|
|
}
|
|
|
|
|
2015-04-08 13:30:58 +02:00
|
|
|
static void kvm_handle_io(uint16_t port, MemTxAttrs attrs, void *data, int direction,
|
|
|
|
int size, uint32_t count)
|
2008-11-05 17:29:27 +01:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
uint8_t *ptr = data;
|
|
|
|
|
|
|
|
for (i = 0; i < count; i++) {
|
2015-04-08 13:30:58 +02:00
|
|
|
address_space_rw(&address_space_io, port, attrs,
|
2015-04-26 17:49:24 +02:00
|
|
|
ptr, size,
|
2013-08-13 14:43:57 +02:00
|
|
|
direction == KVM_EXIT_IO_OUT);
|
2008-11-05 17:29:27 +01:00
|
|
|
ptr += size;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-05-27 01:55:29 +02:00
|
|
|
static int kvm_handle_internal_error(CPUState *cpu, struct kvm_run *run)
|
2010-03-23 17:37:11 +01:00
|
|
|
{
|
2014-01-21 18:11:31 +01:00
|
|
|
fprintf(stderr, "KVM internal error. Suberror: %d\n",
|
|
|
|
run->internal.suberror);
|
|
|
|
|
2010-03-23 17:37:11 +01:00
|
|
|
if (kvm_check_extension(kvm_state, KVM_CAP_INTERNAL_ERROR_DATA)) {
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < run->internal.ndata; ++i) {
|
2021-04-28 16:24:31 +02:00
|
|
|
fprintf(stderr, "extra data[%d]: 0x%016"PRIx64"\n",
|
2010-03-23 17:37:11 +01:00
|
|
|
i, (uint64_t)run->internal.data[i]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (run->internal.suberror == KVM_INTERNAL_ERROR_EMULATION) {
|
|
|
|
fprintf(stderr, "emulation failure\n");
|
2012-10-31 06:57:49 +01:00
|
|
|
if (!kvm_arch_stop_on_emulation_error(cpu)) {
|
2019-04-17 21:18:02 +02:00
|
|
|
cpu_dump_state(cpu, stderr, CPU_DUMP_CODE);
|
2011-03-15 12:26:27 +01:00
|
|
|
return EXCP_INTERRUPT;
|
2011-01-04 09:32:13 +01:00
|
|
|
}
|
2010-03-23 17:37:11 +01:00
|
|
|
}
|
|
|
|
/* FIXME: Should trigger a qmp message to let management know
|
|
|
|
* something went wrong.
|
|
|
|
*/
|
2011-01-21 21:48:06 +01:00
|
|
|
return -1;
|
2010-03-23 17:37:11 +01:00
|
|
|
}
|
|
|
|
|
2010-01-26 12:21:16 +01:00
|
|
|
void kvm_flush_coalesced_mmio_buffer(void)
|
2008-12-09 21:09:57 +01:00
|
|
|
{
|
|
|
|
KVMState *s = kvm_state;
|
2011-10-18 19:43:12 +02:00
|
|
|
|
2023-07-31 14:59:46 +02:00
|
|
|
if (!s || s->coalesced_flush_in_progress) {
|
2011-10-18 19:43:12 +02:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
s->coalesced_flush_in_progress = true;
|
|
|
|
|
2010-01-26 12:21:16 +01:00
|
|
|
if (s->coalesced_mmio_ring) {
|
|
|
|
struct kvm_coalesced_mmio_ring *ring = s->coalesced_mmio_ring;
|
2008-12-09 21:09:57 +01:00
|
|
|
while (ring->first != ring->last) {
|
|
|
|
struct kvm_coalesced_mmio *ent;
|
|
|
|
|
|
|
|
ent = &ring->coalesced_mmio[ring->first];
|
|
|
|
|
2018-10-17 18:52:54 +02:00
|
|
|
if (ent->pio == 1) {
|
2020-02-18 12:24:57 +01:00
|
|
|
address_space_write(&address_space_io, ent->phys_addr,
|
|
|
|
MEMTXATTRS_UNSPECIFIED, ent->data,
|
|
|
|
ent->len);
|
2018-10-17 18:52:54 +02:00
|
|
|
} else {
|
|
|
|
cpu_physical_memory_write(ent->phys_addr, ent->data, ent->len);
|
|
|
|
}
|
2010-02-22 17:57:54 +01:00
|
|
|
smp_wmb();
|
2008-12-09 21:09:57 +01:00
|
|
|
ring->first = (ring->first + 1) % KVM_COALESCED_MMIO_MAX;
|
|
|
|
}
|
|
|
|
}
|
2011-10-18 19:43:12 +02:00
|
|
|
|
|
|
|
s->coalesced_flush_in_progress = false;
|
2008-12-09 21:09:57 +01:00
|
|
|
}
|
|
|
|
|
2021-01-26 18:36:47 +01:00
|
|
|
bool kvm_cpu_check_are_resettable(void)
|
|
|
|
{
|
|
|
|
return kvm_arch_cpu_check_are_resettable();
|
|
|
|
}
|
|
|
|
|
2016-10-31 10:36:08 +01:00
|
|
|
static void do_kvm_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
|
2009-08-17 22:19:53 +02:00
|
|
|
{
|
2017-06-18 21:11:01 +02:00
|
|
|
if (!cpu->vcpu_dirty) {
|
2012-10-31 06:57:49 +01:00
|
|
|
kvm_arch_get_registers(cpu);
|
2017-06-18 21:11:01 +02:00
|
|
|
cpu->vcpu_dirty = true;
|
2009-08-17 22:19:53 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-05-01 13:45:44 +02:00
|
|
|
void kvm_cpu_synchronize_state(CPUState *cpu)
|
2010-05-04 14:45:23 +02:00
|
|
|
{
|
2017-06-18 21:11:01 +02:00
|
|
|
if (!cpu->vcpu_dirty) {
|
2016-10-31 10:36:08 +01:00
|
|
|
run_on_cpu(cpu, do_kvm_cpu_synchronize_state, RUN_ON_CPU_NULL);
|
2011-01-04 09:32:13 +01:00
|
|
|
}
|
2010-05-04 14:45:23 +02:00
|
|
|
}
|
|
|
|
|
2016-10-31 10:36:08 +01:00
|
|
|
static void do_kvm_cpu_synchronize_post_reset(CPUState *cpu, run_on_cpu_data arg)
|
2010-03-01 19:10:30 +01:00
|
|
|
{
|
2012-10-31 06:57:49 +01:00
|
|
|
kvm_arch_put_registers(cpu, KVM_PUT_RESET_STATE);
|
2017-06-18 21:11:01 +02:00
|
|
|
cpu->vcpu_dirty = false;
|
2010-03-01 19:10:30 +01:00
|
|
|
}
|
|
|
|
|
2014-08-20 14:55:25 +02:00
|
|
|
void kvm_cpu_synchronize_post_reset(CPUState *cpu)
|
|
|
|
{
|
2016-10-31 10:36:08 +01:00
|
|
|
run_on_cpu(cpu, do_kvm_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
|
2014-08-20 14:55:25 +02:00
|
|
|
}
|
|
|
|
|
2016-10-31 10:36:08 +01:00
|
|
|
static void do_kvm_cpu_synchronize_post_init(CPUState *cpu, run_on_cpu_data arg)
|
2010-03-01 19:10:30 +01:00
|
|
|
{
|
2012-10-31 06:57:49 +01:00
|
|
|
kvm_arch_put_registers(cpu, KVM_PUT_FULL_STATE);
|
2017-06-18 21:11:01 +02:00
|
|
|
cpu->vcpu_dirty = false;
|
2010-03-01 19:10:30 +01:00
|
|
|
}
|
|
|
|
|
2014-08-20 14:55:25 +02:00
|
|
|
void kvm_cpu_synchronize_post_init(CPUState *cpu)
|
|
|
|
{
|
2016-10-31 10:36:08 +01:00
|
|
|
run_on_cpu(cpu, do_kvm_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
|
2014-08-20 14:55:25 +02:00
|
|
|
}
|
|
|
|
|
2017-05-26 06:46:28 +02:00
|
|
|
static void do_kvm_cpu_synchronize_pre_loadvm(CPUState *cpu, run_on_cpu_data arg)
|
|
|
|
{
|
2017-06-18 21:11:01 +02:00
|
|
|
cpu->vcpu_dirty = true;
|
2017-05-26 06:46:28 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_cpu_synchronize_pre_loadvm(CPUState *cpu)
|
|
|
|
{
|
|
|
|
run_on_cpu(cpu, do_kvm_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
|
|
|
|
}
|
|
|
|
|
2017-02-08 12:48:54 +01:00
|
|
|
#ifdef KVM_HAVE_MCE_INJECTION
|
|
|
|
static __thread void *pending_sigbus_addr;
|
|
|
|
static __thread int pending_sigbus_code;
|
|
|
|
static __thread bool have_sigbus_pending;
|
|
|
|
#endif
|
|
|
|
|
2017-02-08 13:52:50 +01:00
|
|
|
static void kvm_cpu_kick(CPUState *cpu)
|
|
|
|
{
|
2020-09-23 12:56:46 +02:00
|
|
|
qatomic_set(&cpu->kvm_run->immediate_exit, 1);
|
2017-02-08 13:52:50 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_cpu_kick_self(void)
|
|
|
|
{
|
|
|
|
if (kvm_immediate_exit) {
|
|
|
|
kvm_cpu_kick(current_cpu);
|
|
|
|
} else {
|
|
|
|
qemu_cpu_kick_self();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-02-09 09:41:14 +01:00
|
|
|
static void kvm_eat_signals(CPUState *cpu)
|
|
|
|
{
|
|
|
|
struct timespec ts = { 0, 0 };
|
|
|
|
siginfo_t siginfo;
|
|
|
|
sigset_t waitset;
|
|
|
|
sigset_t chkset;
|
|
|
|
int r;
|
|
|
|
|
2017-02-08 13:52:50 +01:00
|
|
|
if (kvm_immediate_exit) {
|
2020-09-23 12:56:46 +02:00
|
|
|
qatomic_set(&cpu->kvm_run->immediate_exit, 0);
|
2017-02-08 13:52:50 +01:00
|
|
|
/* Write kvm_run->immediate_exit before the cpu->exit_request
|
|
|
|
* write in kvm_cpu_exec.
|
|
|
|
*/
|
|
|
|
smp_wmb();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2017-02-09 09:41:14 +01:00
|
|
|
sigemptyset(&waitset);
|
|
|
|
sigaddset(&waitset, SIG_IPI);
|
|
|
|
|
|
|
|
do {
|
|
|
|
r = sigtimedwait(&waitset, &siginfo, &ts);
|
|
|
|
if (r == -1 && !(errno == EAGAIN || errno == EINTR)) {
|
|
|
|
perror("sigtimedwait");
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
|
|
|
|
r = sigpending(&chkset);
|
|
|
|
if (r == -1) {
|
|
|
|
perror("sigpending");
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
} while (sigismember(&chkset, SIG_IPI));
|
|
|
|
}
|
|
|
|
|
2013-05-26 23:46:55 +02:00
|
|
|
int kvm_cpu_exec(CPUState *cpu)
|
2008-11-05 17:29:27 +01:00
|
|
|
{
|
2012-12-01 06:18:14 +01:00
|
|
|
struct kvm_run *run = cpu->kvm_run;
|
2011-03-15 12:26:25 +01:00
|
|
|
int ret, run_ret;
|
2008-11-05 17:29:27 +01:00
|
|
|
|
2010-04-18 16:22:14 +02:00
|
|
|
DPRINTF("kvm_cpu_exec()\n");
|
2008-11-05 17:29:27 +01:00
|
|
|
|
2012-10-31 06:57:49 +01:00
|
|
|
if (kvm_arch_process_async_events(cpu)) {
|
2020-09-23 12:56:46 +02:00
|
|
|
qatomic_set(&cpu->exit_request, 0);
|
2011-02-07 12:19:18 +01:00
|
|
|
return EXCP_HLT;
|
2011-02-01 22:16:00 +01:00
|
|
|
}
|
2010-05-04 14:45:27 +02:00
|
|
|
|
2015-06-18 18:47:23 +02:00
|
|
|
qemu_mutex_unlock_iothread();
|
2017-06-06 20:19:39 +02:00
|
|
|
cpu_exec_start(cpu);
|
2015-06-18 18:47:23 +02:00
|
|
|
|
2011-02-01 22:16:00 +01:00
|
|
|
do {
|
2015-04-08 13:30:58 +02:00
|
|
|
MemTxAttrs attrs;
|
|
|
|
|
2017-06-18 21:11:01 +02:00
|
|
|
if (cpu->vcpu_dirty) {
|
2012-10-31 06:57:49 +01:00
|
|
|
kvm_arch_put_registers(cpu, KVM_PUT_RUNTIME_STATE);
|
2017-06-18 21:11:01 +02:00
|
|
|
cpu->vcpu_dirty = false;
|
2009-08-17 22:19:53 +02:00
|
|
|
}
|
|
|
|
|
2012-10-31 06:57:49 +01:00
|
|
|
kvm_arch_pre_run(cpu, run);
|
2020-09-23 12:56:46 +02:00
|
|
|
if (qatomic_read(&cpu->exit_request)) {
|
2011-02-01 22:16:00 +01:00
|
|
|
DPRINTF("interrupt exit requested\n");
|
|
|
|
/*
|
|
|
|
* KVM requires us to reenter the kernel after IO exits to complete
|
|
|
|
* instruction emulation. This self-signal will ensure that we
|
|
|
|
* leave ASAP again.
|
|
|
|
*/
|
2017-02-08 13:52:50 +01:00
|
|
|
kvm_cpu_kick_self();
|
2011-02-01 22:16:00 +01:00
|
|
|
}
|
|
|
|
|
2017-02-08 13:52:50 +01:00
|
|
|
/* Read cpu->exit_request before KVM_RUN reads run->immediate_exit.
|
|
|
|
* Matching barrier in kvm_eat_signals.
|
|
|
|
*/
|
|
|
|
smp_rmb();
|
|
|
|
|
2012-10-31 06:06:49 +01:00
|
|
|
run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0);
|
2011-02-01 22:16:00 +01:00
|
|
|
|
2015-04-08 13:30:58 +02:00
|
|
|
attrs = kvm_arch_post_run(cpu, run);
|
2008-11-05 17:29:27 +01:00
|
|
|
|
2017-02-08 12:48:54 +01:00
|
|
|
#ifdef KVM_HAVE_MCE_INJECTION
|
|
|
|
if (unlikely(have_sigbus_pending)) {
|
|
|
|
qemu_mutex_lock_iothread();
|
|
|
|
kvm_arch_on_sigbus_vcpu(cpu, pending_sigbus_code,
|
|
|
|
pending_sigbus_addr);
|
|
|
|
have_sigbus_pending = false;
|
|
|
|
qemu_mutex_unlock_iothread();
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2011-03-15 12:26:25 +01:00
|
|
|
if (run_ret < 0) {
|
2011-03-15 12:26:26 +01:00
|
|
|
if (run_ret == -EINTR || run_ret == -EAGAIN) {
|
|
|
|
DPRINTF("io window exit\n");
|
2017-02-09 09:41:14 +01:00
|
|
|
kvm_eat_signals(cpu);
|
2011-03-15 12:26:27 +01:00
|
|
|
ret = EXCP_INTERRUPT;
|
2011-03-15 12:26:26 +01:00
|
|
|
break;
|
|
|
|
}
|
2011-12-16 01:20:20 +01:00
|
|
|
fprintf(stderr, "error: kvm run failed %s\n",
|
|
|
|
strerror(-run_ret));
|
2015-05-18 21:06:47 +02:00
|
|
|
#ifdef TARGET_PPC
|
|
|
|
if (run_ret == -EBUSY) {
|
|
|
|
fprintf(stderr,
|
|
|
|
"This is probably because your SMT is enabled.\n"
|
|
|
|
"VCPU can only run on primary threads with all "
|
|
|
|
"secondary threads offline.\n");
|
|
|
|
}
|
|
|
|
#endif
|
2014-08-29 15:58:20 +02:00
|
|
|
ret = -1;
|
|
|
|
break;
|
2008-11-05 17:29:27 +01:00
|
|
|
}
|
|
|
|
|
2013-03-29 05:27:52 +01:00
|
|
|
trace_kvm_run_exit(cpu->cpu_index, run->exit_reason);
|
2008-11-05 17:29:27 +01:00
|
|
|
switch (run->exit_reason) {
|
|
|
|
case KVM_EXIT_IO:
|
2010-04-18 16:22:14 +02:00
|
|
|
DPRINTF("handle_io\n");
|
2015-06-18 18:47:24 +02:00
|
|
|
/* Called outside BQL */
|
2015-04-08 13:30:58 +02:00
|
|
|
kvm_handle_io(run->io.port, attrs,
|
2011-02-01 22:16:01 +01:00
|
|
|
(uint8_t *)run + run->io.data_offset,
|
|
|
|
run->io.direction,
|
|
|
|
run->io.size,
|
|
|
|
run->io.count);
|
2011-03-15 12:26:27 +01:00
|
|
|
ret = 0;
|
2008-11-05 17:29:27 +01:00
|
|
|
break;
|
|
|
|
case KVM_EXIT_MMIO:
|
2010-04-18 16:22:14 +02:00
|
|
|
DPRINTF("handle_mmio\n");
|
2015-06-18 18:47:26 +02:00
|
|
|
/* Called outside BQL */
|
2015-04-08 13:30:58 +02:00
|
|
|
address_space_rw(&address_space_memory,
|
|
|
|
run->mmio.phys_addr, attrs,
|
|
|
|
run->mmio.data,
|
|
|
|
run->mmio.len,
|
|
|
|
run->mmio.is_write);
|
2011-03-15 12:26:27 +01:00
|
|
|
ret = 0;
|
2008-11-05 17:29:27 +01:00
|
|
|
break;
|
|
|
|
case KVM_EXIT_IRQ_WINDOW_OPEN:
|
2010-04-18 16:22:14 +02:00
|
|
|
DPRINTF("irq_window_open\n");
|
2011-03-15 12:26:27 +01:00
|
|
|
ret = EXCP_INTERRUPT;
|
2008-11-05 17:29:27 +01:00
|
|
|
break;
|
|
|
|
case KVM_EXIT_SHUTDOWN:
|
2010-04-18 16:22:14 +02:00
|
|
|
DPRINTF("shutdown\n");
|
2017-05-15 23:41:13 +02:00
|
|
|
qemu_system_reset_request(SHUTDOWN_CAUSE_GUEST_RESET);
|
2011-03-15 12:26:27 +01:00
|
|
|
ret = EXCP_INTERRUPT;
|
2008-11-05 17:29:27 +01:00
|
|
|
break;
|
|
|
|
case KVM_EXIT_UNKNOWN:
|
2011-01-21 21:48:07 +01:00
|
|
|
fprintf(stderr, "KVM: unknown exit, hardware reason %" PRIx64 "\n",
|
|
|
|
(uint64_t)run->hw.hardware_exit_reason);
|
2011-01-21 21:48:06 +01:00
|
|
|
ret = -1;
|
2008-11-05 17:29:27 +01:00
|
|
|
break;
|
2010-03-23 17:37:11 +01:00
|
|
|
case KVM_EXIT_INTERNAL_ERROR:
|
2013-05-27 01:55:29 +02:00
|
|
|
ret = kvm_handle_internal_error(cpu, run);
|
2010-03-23 17:37:11 +01:00
|
|
|
break;
|
2021-05-17 10:23:50 +02:00
|
|
|
case KVM_EXIT_DIRTY_RING_FULL:
|
|
|
|
/*
|
|
|
|
* We shouldn't continue if the dirty ring of this vcpu is
|
|
|
|
* still full. Got kicked by KVM_RESET_DIRTY_RINGS.
|
|
|
|
*/
|
|
|
|
trace_kvm_dirty_ring_full(cpu->cpu_index);
|
|
|
|
qemu_mutex_lock_iothread();
|
2022-06-25 19:38:35 +02:00
|
|
|
/*
|
|
|
|
* We throttle vCPU by making it sleep once it exit from kernel
|
|
|
|
* due to dirty ring full. In the dirtylimit scenario, reaping
|
|
|
|
* all vCPUs after a single vCPU dirty ring get full result in
|
|
|
|
* the miss of sleep, so just reap the ring-fulled vCPU.
|
|
|
|
*/
|
|
|
|
if (dirtylimit_in_service()) {
|
|
|
|
kvm_dirty_ring_reap(kvm_state, cpu);
|
|
|
|
} else {
|
|
|
|
kvm_dirty_ring_reap(kvm_state, NULL);
|
|
|
|
}
|
2021-05-17 10:23:50 +02:00
|
|
|
qemu_mutex_unlock_iothread();
|
2022-06-25 19:38:35 +02:00
|
|
|
dirtylimit_vcpu_execute(cpu);
|
2021-05-17 10:23:50 +02:00
|
|
|
ret = 0;
|
|
|
|
break;
|
2014-06-19 19:06:25 +02:00
|
|
|
case KVM_EXIT_SYSTEM_EVENT:
|
|
|
|
switch (run->system_event.type) {
|
|
|
|
case KVM_SYSTEM_EVENT_SHUTDOWN:
|
2017-05-15 23:41:13 +02:00
|
|
|
qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
|
2014-06-19 19:06:25 +02:00
|
|
|
ret = EXCP_INTERRUPT;
|
|
|
|
break;
|
|
|
|
case KVM_SYSTEM_EVENT_RESET:
|
2017-05-15 23:41:13 +02:00
|
|
|
qemu_system_reset_request(SHUTDOWN_CAUSE_GUEST_RESET);
|
2014-06-19 19:06:25 +02:00
|
|
|
ret = EXCP_INTERRUPT;
|
|
|
|
break;
|
2015-07-03 14:01:43 +02:00
|
|
|
case KVM_SYSTEM_EVENT_CRASH:
|
2017-02-14 07:25:22 +01:00
|
|
|
kvm_cpu_synchronize_state(cpu);
|
2015-07-03 14:01:43 +02:00
|
|
|
qemu_mutex_lock_iothread();
|
2017-02-14 07:25:23 +01:00
|
|
|
qemu_system_guest_panicked(cpu_get_crash_info(cpu));
|
2015-07-03 14:01:43 +02:00
|
|
|
qemu_mutex_unlock_iothread();
|
|
|
|
ret = 0;
|
|
|
|
break;
|
2014-06-19 19:06:25 +02:00
|
|
|
default:
|
|
|
|
DPRINTF("kvm_arch_handle_exit\n");
|
|
|
|
ret = kvm_arch_handle_exit(cpu, run);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
break;
|
2008-11-05 17:29:27 +01:00
|
|
|
default:
|
2010-04-18 16:22:14 +02:00
|
|
|
DPRINTF("kvm_arch_handle_exit\n");
|
2012-10-31 06:57:49 +01:00
|
|
|
ret = kvm_arch_handle_exit(cpu, run);
|
2008-11-05 17:29:27 +01:00
|
|
|
break;
|
|
|
|
}
|
2011-03-15 12:26:27 +01:00
|
|
|
} while (ret == 0);
|
2008-11-05 17:29:27 +01:00
|
|
|
|
2017-06-06 20:19:39 +02:00
|
|
|
cpu_exec_end(cpu);
|
2015-06-18 18:47:23 +02:00
|
|
|
qemu_mutex_lock_iothread();
|
|
|
|
|
2011-01-21 21:48:06 +01:00
|
|
|
if (ret < 0) {
|
2019-04-17 21:18:02 +02:00
|
|
|
cpu_dump_state(cpu, stderr, CPU_DUMP_CODE);
|
2011-09-30 19:45:27 +02:00
|
|
|
vm_stop(RUN_STATE_INTERNAL_ERROR);
|
2008-11-10 16:55:14 +01:00
|
|
|
}
|
|
|
|
|
2020-09-23 12:56:46 +02:00
|
|
|
qatomic_set(&cpu->exit_request, 0);
|
2008-11-05 17:29:27 +01:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2008-11-13 20:21:00 +01:00
|
|
|
int kvm_ioctl(KVMState *s, int type, ...)
|
2008-11-05 17:29:27 +01:00
|
|
|
{
|
|
|
|
int ret;
|
2008-11-13 20:21:00 +01:00
|
|
|
void *arg;
|
|
|
|
va_list ap;
|
2008-11-05 17:29:27 +01:00
|
|
|
|
2008-11-13 20:21:00 +01:00
|
|
|
va_start(ap, type);
|
|
|
|
arg = va_arg(ap, void *);
|
|
|
|
va_end(ap);
|
|
|
|
|
2013-03-29 05:27:05 +01:00
|
|
|
trace_kvm_ioctl(type, arg);
|
2008-11-13 20:21:00 +01:00
|
|
|
ret = ioctl(s->fd, type, arg);
|
2011-01-04 09:32:13 +01:00
|
|
|
if (ret == -1) {
|
2008-11-05 17:29:27 +01:00
|
|
|
ret = -errno;
|
2011-01-04 09:32:13 +01:00
|
|
|
}
|
2008-11-05 17:29:27 +01:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2008-11-13 20:21:00 +01:00
|
|
|
int kvm_vm_ioctl(KVMState *s, int type, ...)
|
2008-11-05 17:29:27 +01:00
|
|
|
{
|
|
|
|
int ret;
|
2008-11-13 20:21:00 +01:00
|
|
|
void *arg;
|
|
|
|
va_list ap;
|
|
|
|
|
|
|
|
va_start(ap, type);
|
|
|
|
arg = va_arg(ap, void *);
|
|
|
|
va_end(ap);
|
2008-11-05 17:29:27 +01:00
|
|
|
|
2013-03-29 05:27:05 +01:00
|
|
|
trace_kvm_vm_ioctl(type, arg);
|
2022-11-11 16:47:57 +01:00
|
|
|
accel_ioctl_begin();
|
2008-11-13 20:21:00 +01:00
|
|
|
ret = ioctl(s->vmfd, type, arg);
|
2022-11-11 16:47:57 +01:00
|
|
|
accel_ioctl_end();
|
2011-01-04 09:32:13 +01:00
|
|
|
if (ret == -1) {
|
2008-11-05 17:29:27 +01:00
|
|
|
ret = -errno;
|
2011-01-04 09:32:13 +01:00
|
|
|
}
|
2008-11-05 17:29:27 +01:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-10-31 06:06:49 +01:00
|
|
|
int kvm_vcpu_ioctl(CPUState *cpu, int type, ...)
|
2008-11-05 17:29:27 +01:00
|
|
|
{
|
|
|
|
int ret;
|
2008-11-13 20:21:00 +01:00
|
|
|
void *arg;
|
|
|
|
va_list ap;
|
|
|
|
|
|
|
|
va_start(ap, type);
|
|
|
|
arg = va_arg(ap, void *);
|
|
|
|
va_end(ap);
|
2008-11-05 17:29:27 +01:00
|
|
|
|
2013-03-29 05:27:05 +01:00
|
|
|
trace_kvm_vcpu_ioctl(cpu->cpu_index, type, arg);
|
2022-11-11 16:47:57 +01:00
|
|
|
accel_cpu_ioctl_begin(cpu);
|
2012-10-31 05:29:00 +01:00
|
|
|
ret = ioctl(cpu->kvm_fd, type, arg);
|
2022-11-11 16:47:57 +01:00
|
|
|
accel_cpu_ioctl_end(cpu);
|
2011-01-04 09:32:13 +01:00
|
|
|
if (ret == -1) {
|
2008-11-05 17:29:27 +01:00
|
|
|
ret = -errno;
|
2011-01-04 09:32:13 +01:00
|
|
|
}
|
2008-11-05 17:29:27 +01:00
|
|
|
return ret;
|
|
|
|
}
|
2008-12-04 21:33:06 +01:00
|
|
|
|
2014-02-26 18:20:00 +01:00
|
|
|
int kvm_device_ioctl(int fd, int type, ...)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
void *arg;
|
|
|
|
va_list ap;
|
|
|
|
|
|
|
|
va_start(ap, type);
|
|
|
|
arg = va_arg(ap, void *);
|
|
|
|
va_end(ap);
|
|
|
|
|
|
|
|
trace_kvm_device_ioctl(fd, type, arg);
|
2022-11-11 16:47:57 +01:00
|
|
|
accel_ioctl_begin();
|
2014-02-26 18:20:00 +01:00
|
|
|
ret = ioctl(fd, type, arg);
|
2022-11-11 16:47:57 +01:00
|
|
|
accel_ioctl_end();
|
2014-02-26 18:20:00 +01:00
|
|
|
if (ret == -1) {
|
|
|
|
ret = -errno;
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2015-03-12 13:53:49 +01:00
|
|
|
int kvm_vm_check_attr(KVMState *s, uint32_t group, uint64_t attr)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct kvm_device_attr attribute = {
|
|
|
|
.group = group,
|
|
|
|
.attr = attr,
|
|
|
|
};
|
|
|
|
|
|
|
|
if (!kvm_vm_attributes_allowed) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = kvm_vm_ioctl(s, KVM_HAS_DEVICE_ATTR, &attribute);
|
|
|
|
/* kvm returns 0 on success for HAS_DEVICE_ATTR */
|
|
|
|
return ret ? 0 : 1;
|
|
|
|
}
|
|
|
|
|
2015-09-24 02:29:36 +02:00
|
|
|
int kvm_device_check_attr(int dev_fd, uint32_t group, uint64_t attr)
|
|
|
|
{
|
|
|
|
struct kvm_device_attr attribute = {
|
|
|
|
.group = group,
|
|
|
|
.attr = attr,
|
|
|
|
.flags = 0,
|
|
|
|
};
|
|
|
|
|
|
|
|
return kvm_device_ioctl(dev_fd, KVM_HAS_DEVICE_ATTR, &attribute) ? 0 : 1;
|
|
|
|
}
|
|
|
|
|
2017-06-13 15:57:00 +02:00
|
|
|
int kvm_device_access(int fd, int group, uint64_t attr,
|
|
|
|
void *val, bool write, Error **errp)
|
2015-09-24 02:29:36 +02:00
|
|
|
{
|
|
|
|
struct kvm_device_attr kvmattr;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
kvmattr.flags = 0;
|
|
|
|
kvmattr.group = group;
|
|
|
|
kvmattr.attr = attr;
|
|
|
|
kvmattr.addr = (uintptr_t)val;
|
|
|
|
|
|
|
|
err = kvm_device_ioctl(fd,
|
|
|
|
write ? KVM_SET_DEVICE_ATTR : KVM_GET_DEVICE_ATTR,
|
|
|
|
&kvmattr);
|
|
|
|
if (err < 0) {
|
2017-06-13 15:57:00 +02:00
|
|
|
error_setg_errno(errp, -err,
|
|
|
|
"KVM_%s_DEVICE_ATTR failed: Group %d "
|
|
|
|
"attr 0x%016" PRIx64,
|
|
|
|
write ? "SET" : "GET", group, attr);
|
2015-09-24 02:29:36 +02:00
|
|
|
}
|
2017-06-13 15:57:00 +02:00
|
|
|
return err;
|
2015-09-24 02:29:36 +02:00
|
|
|
}
|
|
|
|
|
kvm: check KVM_CAP_SYNC_MMU with kvm_vm_check_extension()
On a server-class ppc host, this capability depends on the KVM type,
ie, HV or PR. If both KVM are present in the kernel, we will always
get the HV specific value, even if we explicitely requested PR on
the command line.
This can have an impact if we're using hugepages or a balloon device.
Since we've already created the VM at the time any user calls
kvm_has_sync_mmu(), switching to kvm_vm_check_extension() is
enough to fix any potential issue.
It is okay for the other archs that also implement KVM_CAP_SYNC_MMU,
ie, mips, s390, x86 and arm, because they don't depend on the VM being
created or not.
While here, let's cache the state of this extension in a bool variable,
since it has several users in the code, as suggested by Thomas Huth.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <150600965332.30533.14702405809647835716.stgit@bahia.lan>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-09-21 18:00:53 +02:00
|
|
|
bool kvm_has_sync_mmu(void)
|
2008-12-04 21:33:06 +01:00
|
|
|
{
|
kvm: check KVM_CAP_SYNC_MMU with kvm_vm_check_extension()
On a server-class ppc host, this capability depends on the KVM type,
ie, HV or PR. If both KVM are present in the kernel, we will always
get the HV specific value, even if we explicitely requested PR on
the command line.
This can have an impact if we're using hugepages or a balloon device.
Since we've already created the VM at the time any user calls
kvm_has_sync_mmu(), switching to kvm_vm_check_extension() is
enough to fix any potential issue.
It is okay for the other archs that also implement KVM_CAP_SYNC_MMU,
ie, mips, s390, x86 and arm, because they don't depend on the VM being
created or not.
While here, let's cache the state of this extension in a bool variable,
since it has several users in the code, as suggested by Thomas Huth.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <150600965332.30533.14702405809647835716.stgit@bahia.lan>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-09-21 18:00:53 +02:00
|
|
|
return kvm_state->sync_mmu;
|
2008-12-04 21:33:06 +01:00
|
|
|
}
|
2009-03-12 21:12:48 +01:00
|
|
|
|
2009-11-25 00:33:03 +01:00
|
|
|
int kvm_has_vcpu_events(void)
|
|
|
|
{
|
|
|
|
return kvm_state->vcpu_events;
|
|
|
|
}
|
|
|
|
|
2010-03-01 19:10:29 +01:00
|
|
|
int kvm_has_robust_singlestep(void)
|
|
|
|
{
|
|
|
|
return kvm_state->robust_singlestep;
|
|
|
|
}
|
|
|
|
|
2010-03-12 15:20:49 +01:00
|
|
|
int kvm_has_debugregs(void)
|
|
|
|
{
|
|
|
|
return kvm_state->debugregs;
|
|
|
|
}
|
|
|
|
|
2019-06-19 18:21:38 +02:00
|
|
|
int kvm_max_nested_state_length(void)
|
|
|
|
{
|
|
|
|
return kvm_state->max_nested_state_len;
|
|
|
|
}
|
|
|
|
|
2011-01-10 12:50:05 +01:00
|
|
|
int kvm_has_many_ioeventfds(void)
|
|
|
|
{
|
|
|
|
if (!kvm_enabled()) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
return kvm_state->many_ioeventfds;
|
|
|
|
}
|
|
|
|
|
2011-10-15 11:49:47 +02:00
|
|
|
int kvm_has_gsi_routing(void)
|
|
|
|
{
|
2012-01-25 18:28:05 +01:00
|
|
|
#ifdef KVM_CAP_IRQ_ROUTING
|
2011-10-15 11:49:47 +02:00
|
|
|
return kvm_check_extension(kvm_state, KVM_CAP_IRQ_ROUTING);
|
2012-01-25 18:28:05 +01:00
|
|
|
#else
|
|
|
|
return false;
|
|
|
|
#endif
|
2011-10-15 11:49:47 +02:00
|
|
|
}
|
|
|
|
|
2012-08-27 08:28:39 +02:00
|
|
|
int kvm_has_intx_set_mask(void)
|
|
|
|
{
|
|
|
|
return kvm_state->intx_set_mask;
|
|
|
|
}
|
|
|
|
|
2017-07-11 12:21:26 +02:00
|
|
|
bool kvm_arm_supports_user_irq(void)
|
|
|
|
{
|
|
|
|
return kvm_check_extension(kvm_state, KVM_CAP_ARM_USER_IRQ);
|
|
|
|
}
|
|
|
|
|
2009-03-12 21:12:48 +01:00
|
|
|
#ifdef KVM_CAP_SET_GUEST_DEBUG
|
2023-08-07 17:56:58 +02:00
|
|
|
struct kvm_sw_breakpoint *kvm_find_sw_breakpoint(CPUState *cpu, vaddr pc)
|
2009-03-12 21:12:48 +01:00
|
|
|
{
|
|
|
|
struct kvm_sw_breakpoint *bp;
|
|
|
|
|
2012-12-01 05:35:08 +01:00
|
|
|
QTAILQ_FOREACH(bp, &cpu->kvm_state->kvm_sw_breakpoints, entry) {
|
2011-01-04 09:32:13 +01:00
|
|
|
if (bp->pc == pc) {
|
2009-03-12 21:12:48 +01:00
|
|
|
return bp;
|
2011-01-04 09:32:13 +01:00
|
|
|
}
|
2009-03-12 21:12:48 +01:00
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2012-12-01 05:35:08 +01:00
|
|
|
int kvm_sw_breakpoints_active(CPUState *cpu)
|
2009-03-12 21:12:48 +01:00
|
|
|
{
|
2012-12-01 05:35:08 +01:00
|
|
|
return !QTAILQ_EMPTY(&cpu->kvm_state->kvm_sw_breakpoints);
|
2009-03-12 21:12:48 +01:00
|
|
|
}
|
|
|
|
|
2009-07-16 23:55:28 +02:00
|
|
|
struct kvm_set_guest_debug_data {
|
|
|
|
struct kvm_guest_debug dbg;
|
|
|
|
int err;
|
|
|
|
};
|
|
|
|
|
2016-10-31 10:36:08 +01:00
|
|
|
static void kvm_invoke_set_guest_debug(CPUState *cpu, run_on_cpu_data data)
|
2009-07-16 23:55:28 +02:00
|
|
|
{
|
2016-10-31 10:36:08 +01:00
|
|
|
struct kvm_set_guest_debug_data *dbg_data =
|
|
|
|
(struct kvm_set_guest_debug_data *) data.host_ptr;
|
2009-09-17 20:05:58 +02:00
|
|
|
|
2016-10-10 17:46:25 +02:00
|
|
|
dbg_data->err = kvm_vcpu_ioctl(cpu, KVM_SET_GUEST_DEBUG,
|
2012-12-01 05:35:08 +01:00
|
|
|
&dbg_data->dbg);
|
2009-07-16 23:55:28 +02:00
|
|
|
}
|
|
|
|
|
2013-07-25 20:50:21 +02:00
|
|
|
int kvm_update_guest_debug(CPUState *cpu, unsigned long reinject_trap)
|
2009-03-12 21:12:48 +01:00
|
|
|
{
|
2009-07-16 23:55:28 +02:00
|
|
|
struct kvm_set_guest_debug_data data;
|
2009-03-12 21:12:48 +01:00
|
|
|
|
2010-03-01 19:10:29 +01:00
|
|
|
data.dbg.control = reinject_trap;
|
2009-03-12 21:12:48 +01:00
|
|
|
|
2013-06-21 20:20:45 +02:00
|
|
|
if (cpu->singlestep_enabled) {
|
2010-03-01 19:10:29 +01:00
|
|
|
data.dbg.control |= KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP;
|
2021-11-11 12:06:04 +01:00
|
|
|
|
|
|
|
if (cpu->singlestep_enabled & SSTEP_NOIRQ) {
|
|
|
|
data.dbg.control |= KVM_GUESTDBG_BLOCKIRQ;
|
|
|
|
}
|
2010-03-01 19:10:29 +01:00
|
|
|
}
|
2012-10-31 06:57:49 +01:00
|
|
|
kvm_arch_update_guest_debug(cpu, &data.dbg);
|
2009-03-12 21:12:48 +01:00
|
|
|
|
2016-10-31 10:36:08 +01:00
|
|
|
run_on_cpu(cpu, kvm_invoke_set_guest_debug,
|
|
|
|
RUN_ON_CPU_HOST_PTR(&data));
|
2009-07-16 23:55:28 +02:00
|
|
|
return data.err;
|
2009-03-12 21:12:48 +01:00
|
|
|
}
|
|
|
|
|
2022-09-29 13:42:25 +02:00
|
|
|
bool kvm_supports_guest_debug(void)
|
|
|
|
{
|
|
|
|
/* probed during kvm_init() */
|
|
|
|
return kvm_has_guest_debug;
|
|
|
|
}
|
|
|
|
|
2022-12-06 16:20:27 +01:00
|
|
|
int kvm_insert_breakpoint(CPUState *cpu, int type, vaddr addr, vaddr len)
|
2009-03-12 21:12:48 +01:00
|
|
|
{
|
|
|
|
struct kvm_sw_breakpoint *bp;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
if (type == GDB_BREAKPOINT_SW) {
|
2013-06-19 17:37:31 +02:00
|
|
|
bp = kvm_find_sw_breakpoint(cpu, addr);
|
2009-03-12 21:12:48 +01:00
|
|
|
if (bp) {
|
|
|
|
bp->use_count++;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2022-03-15 15:41:56 +01:00
|
|
|
bp = g_new(struct kvm_sw_breakpoint, 1);
|
2009-03-12 21:12:48 +01:00
|
|
|
bp->pc = addr;
|
|
|
|
bp->use_count = 1;
|
2013-06-19 17:37:31 +02:00
|
|
|
err = kvm_arch_insert_sw_breakpoint(cpu, bp);
|
2009-03-12 21:12:48 +01:00
|
|
|
if (err) {
|
2011-08-21 05:09:37 +02:00
|
|
|
g_free(bp);
|
2009-03-12 21:12:48 +01:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2013-06-19 17:37:31 +02:00
|
|
|
QTAILQ_INSERT_HEAD(&cpu->kvm_state->kvm_sw_breakpoints, bp, entry);
|
2009-03-12 21:12:48 +01:00
|
|
|
} else {
|
|
|
|
err = kvm_arch_insert_hw_breakpoint(addr, len, type);
|
2011-01-04 09:32:13 +01:00
|
|
|
if (err) {
|
2009-03-12 21:12:48 +01:00
|
|
|
return err;
|
2011-01-04 09:32:13 +01:00
|
|
|
}
|
2009-03-12 21:12:48 +01:00
|
|
|
}
|
|
|
|
|
2013-06-24 23:50:24 +02:00
|
|
|
CPU_FOREACH(cpu) {
|
2013-07-25 20:50:21 +02:00
|
|
|
err = kvm_update_guest_debug(cpu, 0);
|
2011-01-04 09:32:13 +01:00
|
|
|
if (err) {
|
2009-03-12 21:12:48 +01:00
|
|
|
return err;
|
2011-01-04 09:32:13 +01:00
|
|
|
}
|
2009-03-12 21:12:48 +01:00
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2022-12-06 16:20:27 +01:00
|
|
|
int kvm_remove_breakpoint(CPUState *cpu, int type, vaddr addr, vaddr len)
|
2009-03-12 21:12:48 +01:00
|
|
|
{
|
|
|
|
struct kvm_sw_breakpoint *bp;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
if (type == GDB_BREAKPOINT_SW) {
|
2013-06-19 17:37:31 +02:00
|
|
|
bp = kvm_find_sw_breakpoint(cpu, addr);
|
2011-01-04 09:32:13 +01:00
|
|
|
if (!bp) {
|
2009-03-12 21:12:48 +01:00
|
|
|
return -ENOENT;
|
2011-01-04 09:32:13 +01:00
|
|
|
}
|
2009-03-12 21:12:48 +01:00
|
|
|
|
|
|
|
if (bp->use_count > 1) {
|
|
|
|
bp->use_count--;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-06-19 17:37:31 +02:00
|
|
|
err = kvm_arch_remove_sw_breakpoint(cpu, bp);
|
2011-01-04 09:32:13 +01:00
|
|
|
if (err) {
|
2009-03-12 21:12:48 +01:00
|
|
|
return err;
|
2011-01-04 09:32:13 +01:00
|
|
|
}
|
2009-03-12 21:12:48 +01:00
|
|
|
|
2013-06-19 17:37:31 +02:00
|
|
|
QTAILQ_REMOVE(&cpu->kvm_state->kvm_sw_breakpoints, bp, entry);
|
2011-08-21 05:09:37 +02:00
|
|
|
g_free(bp);
|
2009-03-12 21:12:48 +01:00
|
|
|
} else {
|
|
|
|
err = kvm_arch_remove_hw_breakpoint(addr, len, type);
|
2011-01-04 09:32:13 +01:00
|
|
|
if (err) {
|
2009-03-12 21:12:48 +01:00
|
|
|
return err;
|
2011-01-04 09:32:13 +01:00
|
|
|
}
|
2009-03-12 21:12:48 +01:00
|
|
|
}
|
|
|
|
|
2013-06-24 23:50:24 +02:00
|
|
|
CPU_FOREACH(cpu) {
|
2013-07-25 20:50:21 +02:00
|
|
|
err = kvm_update_guest_debug(cpu, 0);
|
2011-01-04 09:32:13 +01:00
|
|
|
if (err) {
|
2009-03-12 21:12:48 +01:00
|
|
|
return err;
|
2011-01-04 09:32:13 +01:00
|
|
|
}
|
2009-03-12 21:12:48 +01:00
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-05-27 14:40:48 +02:00
|
|
|
void kvm_remove_all_breakpoints(CPUState *cpu)
|
2009-03-12 21:12:48 +01:00
|
|
|
{
|
|
|
|
struct kvm_sw_breakpoint *bp, *next;
|
2013-06-19 17:37:31 +02:00
|
|
|
KVMState *s = cpu->kvm_state;
|
2014-07-19 03:21:46 +02:00
|
|
|
CPUState *tmpcpu;
|
2009-03-12 21:12:48 +01:00
|
|
|
|
2009-09-12 09:36:22 +02:00
|
|
|
QTAILQ_FOREACH_SAFE(bp, &s->kvm_sw_breakpoints, entry, next) {
|
2013-06-19 17:37:31 +02:00
|
|
|
if (kvm_arch_remove_sw_breakpoint(cpu, bp) != 0) {
|
2009-03-12 21:12:48 +01:00
|
|
|
/* Try harder to find a CPU that currently sees the breakpoint. */
|
2014-07-19 03:21:46 +02:00
|
|
|
CPU_FOREACH(tmpcpu) {
|
|
|
|
if (kvm_arch_remove_sw_breakpoint(tmpcpu, bp) == 0) {
|
2009-03-12 21:12:48 +01:00
|
|
|
break;
|
2011-01-04 09:32:13 +01:00
|
|
|
}
|
2009-03-12 21:12:48 +01:00
|
|
|
}
|
|
|
|
}
|
2012-11-12 15:04:35 +01:00
|
|
|
QTAILQ_REMOVE(&s->kvm_sw_breakpoints, bp, entry);
|
|
|
|
g_free(bp);
|
2009-03-12 21:12:48 +01:00
|
|
|
}
|
|
|
|
kvm_arch_remove_all_hw_breakpoints();
|
|
|
|
|
2013-06-24 23:50:24 +02:00
|
|
|
CPU_FOREACH(cpu) {
|
2013-07-25 20:50:21 +02:00
|
|
|
kvm_update_guest_debug(cpu, 0);
|
2011-01-04 09:32:13 +01:00
|
|
|
}
|
2009-03-12 21:12:48 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* !KVM_CAP_SET_GUEST_DEBUG */
|
2010-02-17 23:14:42 +01:00
|
|
|
|
2017-02-09 09:41:14 +01:00
|
|
|
static int kvm_set_signal_mask(CPUState *cpu, const sigset_t *sigset)
|
2010-02-17 23:14:42 +01:00
|
|
|
{
|
2014-06-18 00:10:31 +02:00
|
|
|
KVMState *s = kvm_state;
|
2010-02-17 23:14:42 +01:00
|
|
|
struct kvm_signal_mask *sigmask;
|
|
|
|
int r;
|
|
|
|
|
2011-08-21 05:09:37 +02:00
|
|
|
sigmask = g_malloc(sizeof(*sigmask) + sizeof(*sigset));
|
2010-02-17 23:14:42 +01:00
|
|
|
|
2014-06-18 00:10:31 +02:00
|
|
|
sigmask->len = s->sigmask_len;
|
2010-02-17 23:14:42 +01:00
|
|
|
memcpy(sigmask->sigset, sigset, sizeof(*sigset));
|
2012-10-31 06:06:49 +01:00
|
|
|
r = kvm_vcpu_ioctl(cpu, KVM_SET_SIGNAL_MASK, sigmask);
|
2011-08-21 05:09:37 +02:00
|
|
|
g_free(sigmask);
|
2010-02-17 23:14:42 +01:00
|
|
|
|
|
|
|
return r;
|
|
|
|
}
|
2017-02-09 10:04:34 +01:00
|
|
|
|
2017-02-08 13:52:50 +01:00
|
|
|
static void kvm_ipi_signal(int sig)
|
2017-02-09 09:41:14 +01:00
|
|
|
{
|
2017-02-08 13:52:50 +01:00
|
|
|
if (current_cpu) {
|
|
|
|
assert(kvm_immediate_exit);
|
|
|
|
kvm_cpu_kick(current_cpu);
|
|
|
|
}
|
2017-02-09 09:41:14 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_init_cpu_signals(CPUState *cpu)
|
|
|
|
{
|
|
|
|
int r;
|
|
|
|
sigset_t set;
|
|
|
|
struct sigaction sigact;
|
|
|
|
|
|
|
|
memset(&sigact, 0, sizeof(sigact));
|
2017-02-08 13:52:50 +01:00
|
|
|
sigact.sa_handler = kvm_ipi_signal;
|
2017-02-09 09:41:14 +01:00
|
|
|
sigaction(SIG_IPI, &sigact, NULL);
|
|
|
|
|
|
|
|
pthread_sigmask(SIG_BLOCK, NULL, &set);
|
|
|
|
#if defined KVM_HAVE_MCE_INJECTION
|
|
|
|
sigdelset(&set, SIGBUS);
|
|
|
|
pthread_sigmask(SIG_SETMASK, &set, NULL);
|
|
|
|
#endif
|
|
|
|
sigdelset(&set, SIG_IPI);
|
2017-02-08 13:52:50 +01:00
|
|
|
if (kvm_immediate_exit) {
|
|
|
|
r = pthread_sigmask(SIG_SETMASK, &set, NULL);
|
|
|
|
} else {
|
|
|
|
r = kvm_set_signal_mask(cpu, &set);
|
|
|
|
}
|
2017-02-09 09:41:14 +01:00
|
|
|
if (r) {
|
|
|
|
fprintf(stderr, "kvm_set_signal_mask: %s\n", strerror(-r));
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-02-08 12:48:54 +01:00
|
|
|
/* Called asynchronously in VCPU thread. */
|
2013-01-17 09:30:27 +01:00
|
|
|
int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr)
|
2011-02-01 22:15:51 +01:00
|
|
|
{
|
2017-02-08 12:48:54 +01:00
|
|
|
#ifdef KVM_HAVE_MCE_INJECTION
|
|
|
|
if (have_sigbus_pending) {
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
have_sigbus_pending = true;
|
|
|
|
pending_sigbus_addr = addr;
|
|
|
|
pending_sigbus_code = code;
|
2020-09-23 12:56:46 +02:00
|
|
|
qatomic_set(&cpu->exit_request, 1);
|
2017-02-08 12:48:54 +01:00
|
|
|
return 0;
|
|
|
|
#else
|
|
|
|
return 1;
|
|
|
|
#endif
|
2011-02-01 22:15:51 +01:00
|
|
|
}
|
|
|
|
|
2017-02-08 12:48:54 +01:00
|
|
|
/* Called synchronously (via signalfd) in main thread. */
|
2011-02-01 22:15:51 +01:00
|
|
|
int kvm_on_sigbus(int code, void *addr)
|
|
|
|
{
|
2017-02-08 12:48:54 +01:00
|
|
|
#ifdef KVM_HAVE_MCE_INJECTION
|
2017-02-09 10:04:34 +01:00
|
|
|
/* Action required MCE kills the process if SIGBUS is blocked. Because
|
|
|
|
* that's what happens in the I/O thread, where we handle MCE via signalfd,
|
|
|
|
* we can only get action optional here.
|
|
|
|
*/
|
|
|
|
assert(code != BUS_MCEERR_AR);
|
|
|
|
kvm_arch_on_sigbus_vcpu(first_cpu, code, addr);
|
|
|
|
return 0;
|
2017-02-08 12:48:54 +01:00
|
|
|
#else
|
|
|
|
return 1;
|
|
|
|
#endif
|
2011-02-01 22:15:51 +01:00
|
|
|
}
|
2014-02-26 18:20:00 +01:00
|
|
|
|
|
|
|
int kvm_create_device(KVMState *s, uint64_t type, bool test)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct kvm_create_device create_dev;
|
|
|
|
|
|
|
|
create_dev.type = type;
|
|
|
|
create_dev.fd = -1;
|
|
|
|
create_dev.flags = test ? KVM_CREATE_DEVICE_TEST : 0;
|
|
|
|
|
|
|
|
if (!kvm_check_extension(s, KVM_CAP_DEVICE_CTRL)) {
|
|
|
|
return -ENOTSUP;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = kvm_vm_ioctl(s, KVM_CREATE_DEVICE, &create_dev);
|
|
|
|
if (ret) {
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
return test ? 0 : create_dev.fd;
|
|
|
|
}
|
2014-05-09 10:06:46 +02:00
|
|
|
|
2016-03-30 18:27:24 +02:00
|
|
|
bool kvm_device_supported(int vmfd, uint64_t type)
|
|
|
|
{
|
|
|
|
struct kvm_create_device create_dev = {
|
|
|
|
.type = type,
|
|
|
|
.fd = -1,
|
|
|
|
.flags = KVM_CREATE_DEVICE_TEST,
|
|
|
|
};
|
|
|
|
|
|
|
|
if (ioctl(vmfd, KVM_CHECK_EXTENSION, KVM_CAP_DEVICE_CTRL) <= 0) {
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (ioctl(vmfd, KVM_CREATE_DEVICE, &create_dev) >= 0);
|
|
|
|
}
|
|
|
|
|
2014-05-09 10:06:46 +02:00
|
|
|
int kvm_set_one_reg(CPUState *cs, uint64_t id, void *source)
|
|
|
|
{
|
|
|
|
struct kvm_one_reg reg;
|
|
|
|
int r;
|
|
|
|
|
|
|
|
reg.id = id;
|
|
|
|
reg.addr = (uintptr_t) source;
|
|
|
|
r = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ®);
|
|
|
|
if (r) {
|
2016-02-01 20:37:44 +01:00
|
|
|
trace_kvm_failed_reg_set(id, strerror(-r));
|
2014-05-09 10:06:46 +02:00
|
|
|
}
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
int kvm_get_one_reg(CPUState *cs, uint64_t id, void *target)
|
|
|
|
{
|
|
|
|
struct kvm_one_reg reg;
|
|
|
|
int r;
|
|
|
|
|
|
|
|
reg.id = id;
|
|
|
|
reg.addr = (uintptr_t) target;
|
|
|
|
r = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, ®);
|
|
|
|
if (r) {
|
2016-02-01 20:37:44 +01:00
|
|
|
trace_kvm_failed_reg_get(id, strerror(-r));
|
2014-05-09 10:06:46 +02:00
|
|
|
}
|
|
|
|
return r;
|
|
|
|
}
|
2014-09-26 22:45:24 +02:00
|
|
|
|
2019-06-14 03:52:37 +02:00
|
|
|
static bool kvm_accel_has_memory(MachineState *ms, AddressSpace *as,
|
|
|
|
hwaddr start_addr, hwaddr size)
|
|
|
|
{
|
|
|
|
KVMState *kvm = KVM_STATE(ms->accelerator);
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < kvm->nr_as; ++i) {
|
|
|
|
if (kvm->as[i].as == as && kvm->as[i].ml) {
|
2019-09-24 16:47:50 +02:00
|
|
|
size = MIN(kvm_max_slot_size, size);
|
2019-06-14 03:52:37 +02:00
|
|
|
return NULL != kvm_lookup_matching_slot(kvm->as[i].ml,
|
|
|
|
start_addr, size);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2019-11-13 10:56:53 +01:00
|
|
|
static void kvm_get_kvm_shadow_mem(Object *obj, Visitor *v,
|
|
|
|
const char *name, void *opaque,
|
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
KVMState *s = KVM_STATE(obj);
|
|
|
|
int64_t value = s->kvm_shadow_mem;
|
|
|
|
|
|
|
|
visit_type_int(v, name, &value, errp);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_set_kvm_shadow_mem(Object *obj, Visitor *v,
|
|
|
|
const char *name, void *opaque,
|
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
KVMState *s = KVM_STATE(obj);
|
|
|
|
int64_t value;
|
|
|
|
|
2021-05-17 10:17:15 +02:00
|
|
|
if (s->fd != -1) {
|
|
|
|
error_setg(errp, "Cannot set properties after the accelerator has been initialized");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
error: Eliminate error_propagate() with Coccinelle, part 1
When all we do with an Error we receive into a local variable is
propagating to somewhere else, we can just as well receive it there
right away. Convert
if (!foo(..., &err)) {
...
error_propagate(errp, err);
...
return ...
}
to
if (!foo(..., errp)) {
...
...
return ...
}
where nothing else needs @err. Coccinelle script:
@rule1 forall@
identifier fun, err, errp, lbl;
expression list args, args2;
binary operator op;
constant c1, c2;
symbol false;
@@
if (
(
- fun(args, &err, args2)
+ fun(args, errp, args2)
|
- !fun(args, &err, args2)
+ !fun(args, errp, args2)
|
- fun(args, &err, args2) op c1
+ fun(args, errp, args2) op c1
)
)
{
... when != err
when != lbl:
when strict
- error_propagate(errp, err);
... when != err
(
return;
|
return c2;
|
return false;
)
}
@rule2 forall@
identifier fun, err, errp, lbl;
expression list args, args2;
expression var;
binary operator op;
constant c1, c2;
symbol false;
@@
- var = fun(args, &err, args2);
+ var = fun(args, errp, args2);
... when != err
if (
(
var
|
!var
|
var op c1
)
)
{
... when != err
when != lbl:
when strict
- error_propagate(errp, err);
... when != err
(
return;
|
return c2;
|
return false;
|
return var;
)
}
@depends on rule1 || rule2@
identifier err;
@@
- Error *err = NULL;
... when != err
Not exactly elegant, I'm afraid.
The "when != lbl:" is necessary to avoid transforming
if (fun(args, &err)) {
goto out
}
...
out:
error_propagate(errp, err);
even though other paths to label out still need the error_propagate().
For an actual example, see sclp_realize().
Without the "when strict", Coccinelle transforms vfio_msix_setup(),
incorrectly. I don't know what exactly "when strict" does, only that
it helps here.
The match of return is narrower than what I want, but I can't figure
out how to express "return where the operand doesn't use @err". For
an example where it's too narrow, see vfio_intx_enable().
Silently fails to convert hw/arm/armsse.c, because Coccinelle gets
confused by ARMSSE being used both as typedef and function-like macro
there. Converted manually.
Line breaks tidied up manually. One nested declaration of @local_err
deleted manually. Preexisting unwanted blank line dropped in
hw/riscv/sifive_e.c.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20200707160613.848843-35-armbru@redhat.com>
2020-07-07 18:06:02 +02:00
|
|
|
if (!visit_type_int(v, name, &value, errp)) {
|
2019-11-13 10:56:53 +01:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
s->kvm_shadow_mem = value;
|
|
|
|
}
|
|
|
|
|
2019-11-13 10:56:53 +01:00
|
|
|
static void kvm_set_kernel_irqchip(Object *obj, Visitor *v,
|
|
|
|
const char *name, void *opaque,
|
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
KVMState *s = KVM_STATE(obj);
|
|
|
|
OnOffSplit mode;
|
|
|
|
|
2021-05-17 10:17:15 +02:00
|
|
|
if (s->fd != -1) {
|
|
|
|
error_setg(errp, "Cannot set properties after the accelerator has been initialized");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2020-07-07 18:05:47 +02:00
|
|
|
if (!visit_type_OnOffSplit(v, name, &mode, errp)) {
|
2019-11-13 10:56:53 +01:00
|
|
|
return;
|
2020-07-07 18:05:47 +02:00
|
|
|
}
|
|
|
|
switch (mode) {
|
|
|
|
case ON_OFF_SPLIT_ON:
|
|
|
|
s->kernel_irqchip_allowed = true;
|
|
|
|
s->kernel_irqchip_required = true;
|
|
|
|
s->kernel_irqchip_split = ON_OFF_AUTO_OFF;
|
|
|
|
break;
|
|
|
|
case ON_OFF_SPLIT_OFF:
|
|
|
|
s->kernel_irqchip_allowed = false;
|
|
|
|
s->kernel_irqchip_required = false;
|
|
|
|
s->kernel_irqchip_split = ON_OFF_AUTO_OFF;
|
|
|
|
break;
|
|
|
|
case ON_OFF_SPLIT_SPLIT:
|
|
|
|
s->kernel_irqchip_allowed = true;
|
|
|
|
s->kernel_irqchip_required = true;
|
|
|
|
s->kernel_irqchip_split = ON_OFF_AUTO_ON;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
/* The value was checked in visit_type_OnOffSplit() above. If
|
|
|
|
* we get here, then something is wrong in QEMU.
|
|
|
|
*/
|
|
|
|
abort();
|
2019-11-13 10:56:53 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-11-13 11:17:12 +01:00
|
|
|
bool kvm_kernel_irqchip_allowed(void)
|
|
|
|
{
|
2019-11-13 10:56:53 +01:00
|
|
|
return kvm_state->kernel_irqchip_allowed;
|
2019-11-13 11:17:12 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
bool kvm_kernel_irqchip_required(void)
|
|
|
|
{
|
2019-11-13 10:56:53 +01:00
|
|
|
return kvm_state->kernel_irqchip_required;
|
2019-11-13 11:17:12 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
bool kvm_kernel_irqchip_split(void)
|
|
|
|
{
|
2019-12-28 11:43:26 +01:00
|
|
|
return kvm_state->kernel_irqchip_split == ON_OFF_AUTO_ON;
|
2019-11-13 11:17:12 +01:00
|
|
|
}
|
|
|
|
|
2021-05-06 18:05:47 +02:00
|
|
|
static void kvm_get_dirty_ring_size(Object *obj, Visitor *v,
|
|
|
|
const char *name, void *opaque,
|
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
KVMState *s = KVM_STATE(obj);
|
|
|
|
uint32_t value = s->kvm_dirty_ring_size;
|
|
|
|
|
|
|
|
visit_type_uint32(v, name, &value, errp);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_set_dirty_ring_size(Object *obj, Visitor *v,
|
|
|
|
const char *name, void *opaque,
|
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
KVMState *s = KVM_STATE(obj);
|
|
|
|
uint32_t value;
|
|
|
|
|
|
|
|
if (s->fd != -1) {
|
|
|
|
error_setg(errp, "Cannot set properties after the accelerator has been initialized");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2022-11-21 09:50:53 +01:00
|
|
|
if (!visit_type_uint32(v, name, &value, errp)) {
|
2021-05-06 18:05:47 +02:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
if (value & (value - 1)) {
|
|
|
|
error_setg(errp, "dirty-ring-size must be a power of two.");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
s->kvm_dirty_ring_size = value;
|
|
|
|
}
|
|
|
|
|
2019-11-13 10:56:53 +01:00
|
|
|
static void kvm_accel_instance_init(Object *obj)
|
|
|
|
{
|
|
|
|
KVMState *s = KVM_STATE(obj);
|
|
|
|
|
2021-05-17 10:17:15 +02:00
|
|
|
s->fd = -1;
|
|
|
|
s->vmfd = -1;
|
2019-11-13 10:56:53 +01:00
|
|
|
s->kvm_shadow_mem = -1;
|
2019-12-28 11:43:26 +01:00
|
|
|
s->kernel_irqchip_allowed = true;
|
|
|
|
s->kernel_irqchip_split = ON_OFF_AUTO_AUTO;
|
2021-05-06 18:05:47 +02:00
|
|
|
/* KVM dirty ring is by default off */
|
|
|
|
s->kvm_dirty_ring_size = 0;
|
2023-05-09 04:21:20 +02:00
|
|
|
s->kvm_dirty_ring_with_bitmap = false;
|
2023-09-05 11:12:46 +02:00
|
|
|
s->kvm_eager_split_size = 0;
|
2022-09-29 09:20:14 +02:00
|
|
|
s->notify_vmexit = NOTIFY_VMEXIT_OPTION_RUN;
|
|
|
|
s->notify_window = 0;
|
2022-12-03 18:51:13 +01:00
|
|
|
s->xen_version = 0;
|
2022-12-16 17:27:00 +01:00
|
|
|
s->xen_gnttab_max_frames = 64;
|
2023-01-18 15:36:23 +01:00
|
|
|
s->xen_evtchn_max_pirq = 256;
|
2019-11-13 10:56:53 +01:00
|
|
|
}
|
|
|
|
|
2022-09-29 13:42:23 +02:00
|
|
|
/**
|
|
|
|
* kvm_gdbstub_sstep_flags():
|
|
|
|
*
|
|
|
|
* Returns: SSTEP_* flags that KVM supports for guest debug. The
|
|
|
|
* support is probed during kvm_init()
|
|
|
|
*/
|
|
|
|
static int kvm_gdbstub_sstep_flags(void)
|
|
|
|
{
|
|
|
|
return kvm_sstep_flags;
|
|
|
|
}
|
|
|
|
|
2014-09-26 22:45:24 +02:00
|
|
|
static void kvm_accel_class_init(ObjectClass *oc, void *data)
|
|
|
|
{
|
|
|
|
AccelClass *ac = ACCEL_CLASS(oc);
|
|
|
|
ac->name = "KVM";
|
accel: Rename 'init' method to 'init_machine'
Today, all accelerator init functions affect some global state:
* tcg_init() calls tcg_exec_init() and affects globals such as tcg_tcx,
page size globals, and possibly others;
* kvm_init() changes the kvm_state global, cpu_interrupt_handler, and possibly
others;
* xen_init() changes the xen_xc global, and registers a change state handler.
With the new accelerator QOM classes, initialization may now be split in two
steps:
* instance_init() will do basic initialization that doesn't affect any global
state and don't need MachineState or MachineClass data. This will allow
probing code to safely create multiple accelerator objects on the fly just
for reporting host/accelerator capabilities, for example.
* accel_init_machine()/init_machine() will save the accelerator object in
MachineState, and do initialization steps which still affect global state,
machine state, or that need data from MachineClass or MachineState.
To clarify the difference between those two steps, rename init() to
init_machine().
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-26 22:45:29 +02:00
|
|
|
ac->init_machine = kvm_init;
|
2019-06-14 03:52:37 +02:00
|
|
|
ac->has_memory = kvm_accel_has_memory;
|
2014-09-26 22:45:24 +02:00
|
|
|
ac->allowed = &kvm_allowed;
|
2022-09-29 13:42:23 +02:00
|
|
|
ac->gdbstub_supported_sstep_flags = kvm_gdbstub_sstep_flags;
|
2019-11-13 10:56:53 +01:00
|
|
|
|
2019-11-13 10:56:53 +01:00
|
|
|
object_class_property_add(oc, "kernel-irqchip", "on|off|split",
|
|
|
|
NULL, kvm_set_kernel_irqchip,
|
qom: Drop parameter @errp of object_property_add() & friends
The only way object_property_add() can fail is when a property with
the same name already exists. Since our property names are all
hardcoded, failure is a programming error, and the appropriate way to
handle it is passing &error_abort.
Same for its variants, except for object_property_add_child(), which
additionally fails when the child already has a parent. Parentage is
also under program control, so this is a programming error, too.
We have a bit over 500 callers. Almost half of them pass
&error_abort, slightly fewer ignore errors, one test case handles
errors, and the remaining few callers pass them to their own callers.
The previous few commits demonstrated once again that ignoring
programming errors is a bad idea.
Of the few ones that pass on errors, several violate the Error API.
The Error ** argument must be NULL, &error_abort, &error_fatal, or a
pointer to a variable containing NULL. Passing an argument of the
latter kind twice without clearing it in between is wrong: if the
first call sets an error, it no longer points to NULL for the second
call. ich9_pm_add_properties(), sparc32_ledma_realize(),
sparc32_dma_realize(), xilinx_axidma_realize(), xilinx_enet_realize()
are wrong that way.
When the one appropriate choice of argument is &error_abort, letting
users pick the argument is a bad idea.
Drop parameter @errp and assert the preconditions instead.
There's one exception to "duplicate property name is a programming
error": the way object_property_add() implements the magic (and
undocumented) "automatic arrayification". Don't drop @errp there.
Instead, rename object_property_add() to object_property_try_add(),
and add the obvious wrapper object_property_add().
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200505152926.18877-15-armbru@redhat.com>
[Two semantic rebase conflicts resolved]
2020-05-05 17:29:22 +02:00
|
|
|
NULL, NULL);
|
2019-11-13 10:56:53 +01:00
|
|
|
object_class_property_set_description(oc, "kernel-irqchip",
|
2020-05-05 17:29:15 +02:00
|
|
|
"Configure KVM in-kernel irqchip");
|
2019-11-13 10:56:53 +01:00
|
|
|
|
2019-11-13 10:56:53 +01:00
|
|
|
object_class_property_add(oc, "kvm-shadow-mem", "int",
|
|
|
|
kvm_get_kvm_shadow_mem, kvm_set_kvm_shadow_mem,
|
qom: Drop parameter @errp of object_property_add() & friends
The only way object_property_add() can fail is when a property with
the same name already exists. Since our property names are all
hardcoded, failure is a programming error, and the appropriate way to
handle it is passing &error_abort.
Same for its variants, except for object_property_add_child(), which
additionally fails when the child already has a parent. Parentage is
also under program control, so this is a programming error, too.
We have a bit over 500 callers. Almost half of them pass
&error_abort, slightly fewer ignore errors, one test case handles
errors, and the remaining few callers pass them to their own callers.
The previous few commits demonstrated once again that ignoring
programming errors is a bad idea.
Of the few ones that pass on errors, several violate the Error API.
The Error ** argument must be NULL, &error_abort, &error_fatal, or a
pointer to a variable containing NULL. Passing an argument of the
latter kind twice without clearing it in between is wrong: if the
first call sets an error, it no longer points to NULL for the second
call. ich9_pm_add_properties(), sparc32_ledma_realize(),
sparc32_dma_realize(), xilinx_axidma_realize(), xilinx_enet_realize()
are wrong that way.
When the one appropriate choice of argument is &error_abort, letting
users pick the argument is a bad idea.
Drop parameter @errp and assert the preconditions instead.
There's one exception to "duplicate property name is a programming
error": the way object_property_add() implements the magic (and
undocumented) "automatic arrayification". Don't drop @errp there.
Instead, rename object_property_add() to object_property_try_add(),
and add the obvious wrapper object_property_add().
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200505152926.18877-15-armbru@redhat.com>
[Two semantic rebase conflicts resolved]
2020-05-05 17:29:22 +02:00
|
|
|
NULL, NULL);
|
2019-11-13 10:56:53 +01:00
|
|
|
object_class_property_set_description(oc, "kvm-shadow-mem",
|
2020-05-05 17:29:15 +02:00
|
|
|
"KVM shadow MMU size");
|
2021-05-06 18:05:47 +02:00
|
|
|
|
|
|
|
object_class_property_add(oc, "dirty-ring-size", "uint32",
|
|
|
|
kvm_get_dirty_ring_size, kvm_set_dirty_ring_size,
|
|
|
|
NULL, NULL);
|
|
|
|
object_class_property_set_description(oc, "dirty-ring-size",
|
|
|
|
"Size of KVM dirty page ring buffer (default: 0, i.e. use bitmap)");
|
2022-09-29 09:20:12 +02:00
|
|
|
|
|
|
|
kvm_arch_accel_class_init(oc);
|
2014-09-26 22:45:24 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
static const TypeInfo kvm_accel_type = {
|
|
|
|
.name = TYPE_KVM_ACCEL,
|
|
|
|
.parent = TYPE_ACCEL,
|
2019-11-13 10:56:53 +01:00
|
|
|
.instance_init = kvm_accel_instance_init,
|
2014-09-26 22:45:24 +02:00
|
|
|
.class_init = kvm_accel_class_init,
|
2014-09-26 22:45:32 +02:00
|
|
|
.instance_size = sizeof(KVMState),
|
2014-09-26 22:45:24 +02:00
|
|
|
};
|
|
|
|
|
|
|
|
static void kvm_type_init(void)
|
|
|
|
{
|
|
|
|
type_register_static(&kvm_accel_type);
|
|
|
|
}
|
|
|
|
|
|
|
|
type_init(kvm_type_init);
|
2022-02-15 16:04:33 +01:00
|
|
|
|
|
|
|
typedef struct StatsArgs {
|
|
|
|
union StatsResultsType {
|
|
|
|
StatsResultList **stats;
|
|
|
|
StatsSchemaList **schema;
|
|
|
|
} result;
|
qmp: add filtering of statistics by name
Allow retrieving only a subset of statistics. This can be useful
for example in order to plot a subset of the statistics many times
a second: KVM publishes ~40 statistics for each vCPU on x86; retrieving
and serializing all of them would be useless.
Another use will be in HMP in the following patch; implementing the
filter in the backend is easy enough that it was deemed okay to make
this a public interface.
Example:
{ "execute": "query-stats",
"arguments": {
"target": "vcpu",
"vcpus": [ "/machine/unattached/device[2]",
"/machine/unattached/device[4]" ],
"providers": [
{ "provider": "kvm",
"names": [ "l1d_flush", "exits" ] } } }
{ "return": {
"vcpus": [
{ "path": "/machine/unattached/device[2]"
"providers": [
{ "provider": "kvm",
"stats": [ { "name": "l1d_flush", "value": 41213 },
{ "name": "exits", "value": 74291 } ] } ] },
{ "path": "/machine/unattached/device[4]"
"providers": [
{ "provider": "kvm",
"stats": [ { "name": "l1d_flush", "value": 16132 },
{ "name": "exits", "value": 57922 } ] } ] } ] } }
Extracted from a patch by Mark Kanda.
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-24 19:13:16 +02:00
|
|
|
strList *names;
|
2022-02-15 16:04:33 +01:00
|
|
|
Error **errp;
|
|
|
|
} StatsArgs;
|
|
|
|
|
|
|
|
static StatsList *add_kvmstat_entry(struct kvm_stats_desc *pdesc,
|
|
|
|
uint64_t *stats_data,
|
|
|
|
StatsList *stats_list,
|
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
|
|
|
|
Stats *stats;
|
|
|
|
uint64List *val_list = NULL;
|
|
|
|
|
|
|
|
/* Only add stats that we understand. */
|
|
|
|
switch (pdesc->flags & KVM_STATS_TYPE_MASK) {
|
|
|
|
case KVM_STATS_TYPE_CUMULATIVE:
|
|
|
|
case KVM_STATS_TYPE_INSTANT:
|
|
|
|
case KVM_STATS_TYPE_PEAK:
|
|
|
|
case KVM_STATS_TYPE_LINEAR_HIST:
|
|
|
|
case KVM_STATS_TYPE_LOG_HIST:
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return stats_list;
|
|
|
|
}
|
|
|
|
|
|
|
|
switch (pdesc->flags & KVM_STATS_UNIT_MASK) {
|
|
|
|
case KVM_STATS_UNIT_NONE:
|
|
|
|
case KVM_STATS_UNIT_BYTES:
|
|
|
|
case KVM_STATS_UNIT_CYCLES:
|
|
|
|
case KVM_STATS_UNIT_SECONDS:
|
2022-07-14 14:10:22 +02:00
|
|
|
case KVM_STATS_UNIT_BOOLEAN:
|
2022-02-15 16:04:33 +01:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return stats_list;
|
|
|
|
}
|
|
|
|
|
|
|
|
switch (pdesc->flags & KVM_STATS_BASE_MASK) {
|
|
|
|
case KVM_STATS_BASE_POW10:
|
|
|
|
case KVM_STATS_BASE_POW2:
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return stats_list;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Alloc and populate data list */
|
|
|
|
stats = g_new0(Stats, 1);
|
|
|
|
stats->name = g_strdup(pdesc->name);
|
|
|
|
stats->value = g_new0(StatsValue, 1);;
|
|
|
|
|
2022-07-14 14:10:22 +02:00
|
|
|
if ((pdesc->flags & KVM_STATS_UNIT_MASK) == KVM_STATS_UNIT_BOOLEAN) {
|
|
|
|
stats->value->u.boolean = *stats_data;
|
|
|
|
stats->value->type = QTYPE_QBOOL;
|
|
|
|
} else if (pdesc->size == 1) {
|
2022-02-15 16:04:33 +01:00
|
|
|
stats->value->u.scalar = *stats_data;
|
|
|
|
stats->value->type = QTYPE_QNUM;
|
|
|
|
} else {
|
|
|
|
int i;
|
|
|
|
for (i = 0; i < pdesc->size; i++) {
|
|
|
|
QAPI_LIST_PREPEND(val_list, stats_data[i]);
|
|
|
|
}
|
|
|
|
stats->value->u.list = val_list;
|
|
|
|
stats->value->type = QTYPE_QLIST;
|
|
|
|
}
|
|
|
|
|
|
|
|
QAPI_LIST_PREPEND(stats_list, stats);
|
|
|
|
return stats_list;
|
|
|
|
}
|
|
|
|
|
|
|
|
static StatsSchemaValueList *add_kvmschema_entry(struct kvm_stats_desc *pdesc,
|
|
|
|
StatsSchemaValueList *list,
|
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
StatsSchemaValueList *schema_entry = g_new0(StatsSchemaValueList, 1);
|
|
|
|
schema_entry->value = g_new0(StatsSchemaValue, 1);
|
|
|
|
|
|
|
|
switch (pdesc->flags & KVM_STATS_TYPE_MASK) {
|
|
|
|
case KVM_STATS_TYPE_CUMULATIVE:
|
|
|
|
schema_entry->value->type = STATS_TYPE_CUMULATIVE;
|
|
|
|
break;
|
|
|
|
case KVM_STATS_TYPE_INSTANT:
|
|
|
|
schema_entry->value->type = STATS_TYPE_INSTANT;
|
|
|
|
break;
|
|
|
|
case KVM_STATS_TYPE_PEAK:
|
|
|
|
schema_entry->value->type = STATS_TYPE_PEAK;
|
|
|
|
break;
|
|
|
|
case KVM_STATS_TYPE_LINEAR_HIST:
|
|
|
|
schema_entry->value->type = STATS_TYPE_LINEAR_HISTOGRAM;
|
|
|
|
schema_entry->value->bucket_size = pdesc->bucket_size;
|
|
|
|
schema_entry->value->has_bucket_size = true;
|
|
|
|
break;
|
|
|
|
case KVM_STATS_TYPE_LOG_HIST:
|
|
|
|
schema_entry->value->type = STATS_TYPE_LOG2_HISTOGRAM;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
switch (pdesc->flags & KVM_STATS_UNIT_MASK) {
|
|
|
|
case KVM_STATS_UNIT_NONE:
|
|
|
|
break;
|
2022-07-14 14:10:22 +02:00
|
|
|
case KVM_STATS_UNIT_BOOLEAN:
|
|
|
|
schema_entry->value->has_unit = true;
|
|
|
|
schema_entry->value->unit = STATS_UNIT_BOOLEAN;
|
|
|
|
break;
|
2022-02-15 16:04:33 +01:00
|
|
|
case KVM_STATS_UNIT_BYTES:
|
|
|
|
schema_entry->value->has_unit = true;
|
|
|
|
schema_entry->value->unit = STATS_UNIT_BYTES;
|
|
|
|
break;
|
|
|
|
case KVM_STATS_UNIT_CYCLES:
|
|
|
|
schema_entry->value->has_unit = true;
|
|
|
|
schema_entry->value->unit = STATS_UNIT_CYCLES;
|
|
|
|
break;
|
|
|
|
case KVM_STATS_UNIT_SECONDS:
|
|
|
|
schema_entry->value->has_unit = true;
|
|
|
|
schema_entry->value->unit = STATS_UNIT_SECONDS;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
schema_entry->value->exponent = pdesc->exponent;
|
|
|
|
if (pdesc->exponent) {
|
|
|
|
switch (pdesc->flags & KVM_STATS_BASE_MASK) {
|
|
|
|
case KVM_STATS_BASE_POW10:
|
|
|
|
schema_entry->value->has_base = true;
|
|
|
|
schema_entry->value->base = 10;
|
|
|
|
break;
|
|
|
|
case KVM_STATS_BASE_POW2:
|
|
|
|
schema_entry->value->has_base = true;
|
|
|
|
schema_entry->value->base = 2;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
schema_entry->value->name = g_strdup(pdesc->name);
|
|
|
|
schema_entry->next = list;
|
|
|
|
return schema_entry;
|
|
|
|
exit:
|
|
|
|
g_free(schema_entry->value);
|
|
|
|
g_free(schema_entry);
|
|
|
|
return list;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Cached stats descriptors */
|
|
|
|
typedef struct StatsDescriptors {
|
|
|
|
const char *ident; /* cache key, currently the StatsTarget */
|
|
|
|
struct kvm_stats_desc *kvm_stats_desc;
|
2022-09-05 12:06:02 +02:00
|
|
|
struct kvm_stats_header kvm_stats_header;
|
2022-02-15 16:04:33 +01:00
|
|
|
QTAILQ_ENTRY(StatsDescriptors) next;
|
|
|
|
} StatsDescriptors;
|
|
|
|
|
|
|
|
static QTAILQ_HEAD(, StatsDescriptors) stats_descriptors =
|
|
|
|
QTAILQ_HEAD_INITIALIZER(stats_descriptors);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return the descriptors for 'target', that either have already been read
|
|
|
|
* or are retrieved from 'stats_fd'.
|
|
|
|
*/
|
|
|
|
static StatsDescriptors *find_stats_descriptors(StatsTarget target, int stats_fd,
|
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
StatsDescriptors *descriptors;
|
|
|
|
const char *ident;
|
|
|
|
struct kvm_stats_desc *kvm_stats_desc;
|
|
|
|
struct kvm_stats_header *kvm_stats_header;
|
|
|
|
size_t size_desc;
|
|
|
|
ssize_t ret;
|
|
|
|
|
|
|
|
ident = StatsTarget_str(target);
|
|
|
|
QTAILQ_FOREACH(descriptors, &stats_descriptors, next) {
|
|
|
|
if (g_str_equal(descriptors->ident, ident)) {
|
|
|
|
return descriptors;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
descriptors = g_new0(StatsDescriptors, 1);
|
|
|
|
|
|
|
|
/* Read stats header */
|
2022-09-05 12:06:02 +02:00
|
|
|
kvm_stats_header = &descriptors->kvm_stats_header;
|
2023-06-18 23:24:40 +02:00
|
|
|
ret = pread(stats_fd, kvm_stats_header, sizeof(*kvm_stats_header), 0);
|
2022-02-15 16:04:33 +01:00
|
|
|
if (ret != sizeof(*kvm_stats_header)) {
|
|
|
|
error_setg(errp, "KVM stats: failed to read stats header: "
|
|
|
|
"expected %zu actual %zu",
|
|
|
|
sizeof(*kvm_stats_header), ret);
|
2022-06-24 08:31:59 +02:00
|
|
|
g_free(descriptors);
|
2022-02-15 16:04:33 +01:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
size_desc = sizeof(*kvm_stats_desc) + kvm_stats_header->name_size;
|
|
|
|
|
|
|
|
/* Read stats descriptors */
|
|
|
|
kvm_stats_desc = g_malloc0_n(kvm_stats_header->num_desc, size_desc);
|
|
|
|
ret = pread(stats_fd, kvm_stats_desc,
|
|
|
|
size_desc * kvm_stats_header->num_desc,
|
|
|
|
kvm_stats_header->desc_offset);
|
|
|
|
|
|
|
|
if (ret != size_desc * kvm_stats_header->num_desc) {
|
|
|
|
error_setg(errp, "KVM stats: failed to read stats descriptors: "
|
|
|
|
"expected %zu actual %zu",
|
|
|
|
size_desc * kvm_stats_header->num_desc, ret);
|
|
|
|
g_free(descriptors);
|
|
|
|
g_free(kvm_stats_desc);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
descriptors->kvm_stats_desc = kvm_stats_desc;
|
|
|
|
descriptors->ident = ident;
|
|
|
|
QTAILQ_INSERT_TAIL(&stats_descriptors, descriptors, next);
|
|
|
|
return descriptors;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void query_stats(StatsResultList **result, StatsTarget target,
|
2023-06-18 23:24:40 +02:00
|
|
|
strList *names, int stats_fd, CPUState *cpu,
|
|
|
|
Error **errp)
|
2022-02-15 16:04:33 +01:00
|
|
|
{
|
|
|
|
struct kvm_stats_desc *kvm_stats_desc;
|
|
|
|
struct kvm_stats_header *kvm_stats_header;
|
|
|
|
StatsDescriptors *descriptors;
|
|
|
|
g_autofree uint64_t *stats_data = NULL;
|
|
|
|
struct kvm_stats_desc *pdesc;
|
|
|
|
StatsList *stats_list = NULL;
|
|
|
|
size_t size_desc, size_data = 0;
|
|
|
|
ssize_t ret;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
descriptors = find_stats_descriptors(target, stats_fd, errp);
|
|
|
|
if (!descriptors) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2022-09-05 12:06:02 +02:00
|
|
|
kvm_stats_header = &descriptors->kvm_stats_header;
|
2022-02-15 16:04:33 +01:00
|
|
|
kvm_stats_desc = descriptors->kvm_stats_desc;
|
|
|
|
size_desc = sizeof(*kvm_stats_desc) + kvm_stats_header->name_size;
|
|
|
|
|
|
|
|
/* Tally the total data size; read schema data */
|
|
|
|
for (i = 0; i < kvm_stats_header->num_desc; ++i) {
|
|
|
|
pdesc = (void *)kvm_stats_desc + i * size_desc;
|
|
|
|
size_data += pdesc->size * sizeof(*stats_data);
|
|
|
|
}
|
|
|
|
|
|
|
|
stats_data = g_malloc0(size_data);
|
|
|
|
ret = pread(stats_fd, stats_data, size_data, kvm_stats_header->data_offset);
|
|
|
|
|
|
|
|
if (ret != size_data) {
|
|
|
|
error_setg(errp, "KVM stats: failed to read data: "
|
|
|
|
"expected %zu actual %zu", size_data, ret);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < kvm_stats_header->num_desc; ++i) {
|
|
|
|
uint64_t *stats;
|
|
|
|
pdesc = (void *)kvm_stats_desc + i * size_desc;
|
|
|
|
|
|
|
|
/* Add entry to the list */
|
|
|
|
stats = (void *)stats_data + pdesc->offset;
|
qmp: add filtering of statistics by name
Allow retrieving only a subset of statistics. This can be useful
for example in order to plot a subset of the statistics many times
a second: KVM publishes ~40 statistics for each vCPU on x86; retrieving
and serializing all of them would be useless.
Another use will be in HMP in the following patch; implementing the
filter in the backend is easy enough that it was deemed okay to make
this a public interface.
Example:
{ "execute": "query-stats",
"arguments": {
"target": "vcpu",
"vcpus": [ "/machine/unattached/device[2]",
"/machine/unattached/device[4]" ],
"providers": [
{ "provider": "kvm",
"names": [ "l1d_flush", "exits" ] } } }
{ "return": {
"vcpus": [
{ "path": "/machine/unattached/device[2]"
"providers": [
{ "provider": "kvm",
"stats": [ { "name": "l1d_flush", "value": 41213 },
{ "name": "exits", "value": 74291 } ] } ] },
{ "path": "/machine/unattached/device[4]"
"providers": [
{ "provider": "kvm",
"stats": [ { "name": "l1d_flush", "value": 16132 },
{ "name": "exits", "value": 57922 } ] } ] } ] } }
Extracted from a patch by Mark Kanda.
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-24 19:13:16 +02:00
|
|
|
if (!apply_str_list_filter(pdesc->name, names)) {
|
|
|
|
continue;
|
|
|
|
}
|
2022-02-15 16:04:33 +01:00
|
|
|
stats_list = add_kvmstat_entry(pdesc, stats, stats_list, errp);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!stats_list) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
switch (target) {
|
|
|
|
case STATS_TARGET_VM:
|
|
|
|
add_stats_entry(result, STATS_PROVIDER_KVM, NULL, stats_list);
|
|
|
|
break;
|
|
|
|
case STATS_TARGET_VCPU:
|
|
|
|
add_stats_entry(result, STATS_PROVIDER_KVM,
|
2023-06-18 23:24:40 +02:00
|
|
|
cpu->parent_obj.canonical_path,
|
2022-02-15 16:04:33 +01:00
|
|
|
stats_list);
|
|
|
|
break;
|
|
|
|
default:
|
2022-07-19 15:48:53 +02:00
|
|
|
g_assert_not_reached();
|
2022-02-15 16:04:33 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void query_stats_schema(StatsSchemaList **result, StatsTarget target,
|
|
|
|
int stats_fd, Error **errp)
|
|
|
|
{
|
|
|
|
struct kvm_stats_desc *kvm_stats_desc;
|
|
|
|
struct kvm_stats_header *kvm_stats_header;
|
|
|
|
StatsDescriptors *descriptors;
|
|
|
|
struct kvm_stats_desc *pdesc;
|
|
|
|
StatsSchemaValueList *stats_list = NULL;
|
|
|
|
size_t size_desc;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
descriptors = find_stats_descriptors(target, stats_fd, errp);
|
|
|
|
if (!descriptors) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2022-09-05 12:06:02 +02:00
|
|
|
kvm_stats_header = &descriptors->kvm_stats_header;
|
2022-02-15 16:04:33 +01:00
|
|
|
kvm_stats_desc = descriptors->kvm_stats_desc;
|
|
|
|
size_desc = sizeof(*kvm_stats_desc) + kvm_stats_header->name_size;
|
|
|
|
|
|
|
|
/* Tally the total data size; read schema data */
|
|
|
|
for (i = 0; i < kvm_stats_header->num_desc; ++i) {
|
|
|
|
pdesc = (void *)kvm_stats_desc + i * size_desc;
|
|
|
|
stats_list = add_kvmschema_entry(pdesc, stats_list, errp);
|
|
|
|
}
|
|
|
|
|
|
|
|
add_stats_schema(result, STATS_PROVIDER_KVM, target, stats_list);
|
|
|
|
}
|
|
|
|
|
2023-06-18 23:24:40 +02:00
|
|
|
static void query_stats_vcpu(CPUState *cpu, StatsArgs *kvm_stats_args)
|
2022-02-15 16:04:33 +01:00
|
|
|
{
|
2023-06-18 23:24:40 +02:00
|
|
|
int stats_fd = cpu->kvm_vcpu_stats_fd;
|
2022-02-15 16:04:33 +01:00
|
|
|
Error *local_err = NULL;
|
|
|
|
|
|
|
|
if (stats_fd == -1) {
|
|
|
|
error_setg_errno(&local_err, errno, "KVM stats: ioctl failed");
|
|
|
|
error_propagate(kvm_stats_args->errp, local_err);
|
|
|
|
return;
|
|
|
|
}
|
qmp: add filtering of statistics by name
Allow retrieving only a subset of statistics. This can be useful
for example in order to plot a subset of the statistics many times
a second: KVM publishes ~40 statistics for each vCPU on x86; retrieving
and serializing all of them would be useless.
Another use will be in HMP in the following patch; implementing the
filter in the backend is easy enough that it was deemed okay to make
this a public interface.
Example:
{ "execute": "query-stats",
"arguments": {
"target": "vcpu",
"vcpus": [ "/machine/unattached/device[2]",
"/machine/unattached/device[4]" ],
"providers": [
{ "provider": "kvm",
"names": [ "l1d_flush", "exits" ] } } }
{ "return": {
"vcpus": [
{ "path": "/machine/unattached/device[2]"
"providers": [
{ "provider": "kvm",
"stats": [ { "name": "l1d_flush", "value": 41213 },
{ "name": "exits", "value": 74291 } ] } ] },
{ "path": "/machine/unattached/device[4]"
"providers": [
{ "provider": "kvm",
"stats": [ { "name": "l1d_flush", "value": 16132 },
{ "name": "exits", "value": 57922 } ] } ] } ] } }
Extracted from a patch by Mark Kanda.
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-24 19:13:16 +02:00
|
|
|
query_stats(kvm_stats_args->result.stats, STATS_TARGET_VCPU,
|
2023-06-18 23:24:40 +02:00
|
|
|
kvm_stats_args->names, stats_fd, cpu,
|
|
|
|
kvm_stats_args->errp);
|
2022-02-15 16:04:33 +01:00
|
|
|
}
|
|
|
|
|
2023-06-18 23:24:40 +02:00
|
|
|
static void query_stats_schema_vcpu(CPUState *cpu, StatsArgs *kvm_stats_args)
|
2022-02-15 16:04:33 +01:00
|
|
|
{
|
2023-06-18 23:24:40 +02:00
|
|
|
int stats_fd = cpu->kvm_vcpu_stats_fd;
|
2022-02-15 16:04:33 +01:00
|
|
|
Error *local_err = NULL;
|
|
|
|
|
|
|
|
if (stats_fd == -1) {
|
|
|
|
error_setg_errno(&local_err, errno, "KVM stats: ioctl failed");
|
|
|
|
error_propagate(kvm_stats_args->errp, local_err);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
query_stats_schema(kvm_stats_args->result.schema, STATS_TARGET_VCPU, stats_fd,
|
|
|
|
kvm_stats_args->errp);
|
|
|
|
}
|
|
|
|
|
2022-04-26 14:59:44 +02:00
|
|
|
static void query_stats_cb(StatsResultList **result, StatsTarget target,
|
qmp: add filtering of statistics by name
Allow retrieving only a subset of statistics. This can be useful
for example in order to plot a subset of the statistics many times
a second: KVM publishes ~40 statistics for each vCPU on x86; retrieving
and serializing all of them would be useless.
Another use will be in HMP in the following patch; implementing the
filter in the backend is easy enough that it was deemed okay to make
this a public interface.
Example:
{ "execute": "query-stats",
"arguments": {
"target": "vcpu",
"vcpus": [ "/machine/unattached/device[2]",
"/machine/unattached/device[4]" ],
"providers": [
{ "provider": "kvm",
"names": [ "l1d_flush", "exits" ] } } }
{ "return": {
"vcpus": [
{ "path": "/machine/unattached/device[2]"
"providers": [
{ "provider": "kvm",
"stats": [ { "name": "l1d_flush", "value": 41213 },
{ "name": "exits", "value": 74291 } ] } ] },
{ "path": "/machine/unattached/device[4]"
"providers": [
{ "provider": "kvm",
"stats": [ { "name": "l1d_flush", "value": 16132 },
{ "name": "exits", "value": 57922 } ] } ] } ] } }
Extracted from a patch by Mark Kanda.
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-24 19:13:16 +02:00
|
|
|
strList *names, strList *targets, Error **errp)
|
2022-02-15 16:04:33 +01:00
|
|
|
{
|
|
|
|
KVMState *s = kvm_state;
|
|
|
|
CPUState *cpu;
|
|
|
|
int stats_fd;
|
|
|
|
|
|
|
|
switch (target) {
|
|
|
|
case STATS_TARGET_VM:
|
|
|
|
{
|
|
|
|
stats_fd = kvm_vm_ioctl(s, KVM_GET_STATS_FD, NULL);
|
|
|
|
if (stats_fd == -1) {
|
|
|
|
error_setg_errno(errp, errno, "KVM stats: ioctl failed");
|
|
|
|
return;
|
|
|
|
}
|
2023-06-18 23:24:40 +02:00
|
|
|
query_stats(result, target, names, stats_fd, NULL, errp);
|
2022-02-15 16:04:33 +01:00
|
|
|
close(stats_fd);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case STATS_TARGET_VCPU:
|
|
|
|
{
|
|
|
|
StatsArgs stats_args;
|
|
|
|
stats_args.result.stats = result;
|
qmp: add filtering of statistics by name
Allow retrieving only a subset of statistics. This can be useful
for example in order to plot a subset of the statistics many times
a second: KVM publishes ~40 statistics for each vCPU on x86; retrieving
and serializing all of them would be useless.
Another use will be in HMP in the following patch; implementing the
filter in the backend is easy enough that it was deemed okay to make
this a public interface.
Example:
{ "execute": "query-stats",
"arguments": {
"target": "vcpu",
"vcpus": [ "/machine/unattached/device[2]",
"/machine/unattached/device[4]" ],
"providers": [
{ "provider": "kvm",
"names": [ "l1d_flush", "exits" ] } } }
{ "return": {
"vcpus": [
{ "path": "/machine/unattached/device[2]"
"providers": [
{ "provider": "kvm",
"stats": [ { "name": "l1d_flush", "value": 41213 },
{ "name": "exits", "value": 74291 } ] } ] },
{ "path": "/machine/unattached/device[4]"
"providers": [
{ "provider": "kvm",
"stats": [ { "name": "l1d_flush", "value": 16132 },
{ "name": "exits", "value": 57922 } ] } ] } ] } }
Extracted from a patch by Mark Kanda.
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-24 19:13:16 +02:00
|
|
|
stats_args.names = names;
|
2022-02-15 16:04:33 +01:00
|
|
|
stats_args.errp = errp;
|
|
|
|
CPU_FOREACH(cpu) {
|
2022-04-26 14:59:44 +02:00
|
|
|
if (!apply_str_list_filter(cpu->parent_obj.canonical_path, targets)) {
|
|
|
|
continue;
|
|
|
|
}
|
2023-06-18 23:24:40 +02:00
|
|
|
query_stats_vcpu(cpu, &stats_args);
|
2022-02-15 16:04:33 +01:00
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void query_stats_schemas_cb(StatsSchemaList **result, Error **errp)
|
|
|
|
{
|
|
|
|
StatsArgs stats_args;
|
|
|
|
KVMState *s = kvm_state;
|
|
|
|
int stats_fd;
|
|
|
|
|
|
|
|
stats_fd = kvm_vm_ioctl(s, KVM_GET_STATS_FD, NULL);
|
|
|
|
if (stats_fd == -1) {
|
|
|
|
error_setg_errno(errp, errno, "KVM stats: ioctl failed");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
query_stats_schema(result, STATS_TARGET_VM, stats_fd, errp);
|
|
|
|
close(stats_fd);
|
|
|
|
|
2022-08-18 14:08:24 +02:00
|
|
|
if (first_cpu) {
|
|
|
|
stats_args.result.schema = result;
|
|
|
|
stats_args.errp = errp;
|
2023-06-18 23:24:40 +02:00
|
|
|
query_stats_schema_vcpu(first_cpu, &stats_args);
|
2022-08-18 14:08:24 +02:00
|
|
|
}
|
2022-02-15 16:04:33 +01:00
|
|
|
}
|