This moves these commands from MAINTAINERS section "QMP" to new
section "Stats". Status is Orphan. Volunteers welcome!
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20230124121946.1139465-23-armbru@redhat.com>
If we update an existing memslot (e.g., resize, split), we temporarily
remove the memslot to re-add it immediately afterwards. These updates
are not atomic, especially not for KVM VCPU threads, such that we can
get spurious faults.
Let's inhibit most KVM ioctls while performing relevant updates, such
that we can perform the update just as if it would happen atomically
without additional kernel support.
We capture the add/del changes and apply them in the notifier commit
stage instead. There, we can check for overlaps and perform the ioctl
inhibiting only if really required (-> overlap).
To keep things simple we don't perform additional checks that wouldn't
actually result in an overlap -- such as !RAM memory regions in some
cases (see kvm_set_phys_mem()).
To minimize cache-line bouncing, use a separate indicator
(in_ioctl_lock) per CPU. Also, make sure to hold the kvm_slots_lock
while performing both actions (removing+re-adding).
We have to wait until all IOCTLs were exited and block new ones from
getting executed.
This approach cannot result in a deadlock as long as the inhibitor does
not hold any locks that might hinder an IOCTL from getting finished and
exited - something fairly unusual. The inhibitor will always hold the BQL.
AFAIKs, one possible candidate would be userfaultfd. If a page cannot be
placed (e.g., during postcopy), because we're waiting for a lock, or if the
userfaultfd thread cannot process a fault, because it is waiting for a
lock, there could be a deadlock. However, the BQL is not applicable here,
because any other guest memory access while holding the BQL would already
result in a deadlock.
Nothing else in the kernel should block forever and wait for userspace
intervention.
Note: pause_all_vcpus()/resume_all_vcpus() or
start_exclusive()/end_exclusive() cannot be used, as they either drop
the BQL or require to be called without the BQL - something inhibitors
cannot handle. We need a low-level locking mechanism that is
deadlock-free even when not releasing the BQL.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Tested-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20221111154758.1372674-4-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Using the new accel-blocker API, mark where ioctls are being called
in KVM. Next, we will implement the critical section that will take
care of performing memslots modifications atomically, therefore
preventing any new ioctl from running and allowing the running ones
to finish.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20221111154758.1372674-3-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 012d4c96e2 changed the visitor functions taking Error ** to
return bool instead of void, and the commits following it used the new
return value to simplify error checking. Since then a few more uses
in need of the same treatment crept in. Do that. All pretty
mechanical except for
* balloon_stats_get_all()
This is basically the same transformation commit 012d4c96e2 applied
to the virtual walk example in include/qapi/visitor.h.
* set_max_queue_size()
Additionally replace "goto end of function" by return.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20221121085054.683122-10-armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
There are cases that malicious virtual machine can cause CPU stuck (due
to event windows don't open up), e.g., infinite loop in microcode when
nested #AC (CVE-2015-5307). No event window means no event (NMI, SMI and
IRQ) can be delivered. It leads the CPU to be unavailable to host or
other VMs. Notify VM exit is introduced to mitigate such kind of
attacks, which will generate a VM exit if no event window occurs in VM
non-root mode for a specified amount of time (notify window).
A new KVM capability KVM_CAP_X86_NOTIFY_VMEXIT is exposed to user space
so that the user can query the capability and set the expected notify
window when creating VMs. The format of the argument when enabling this
capability is as follows:
Bit 63:32 - notify window specified in qemu command
Bit 31:0 - some flags (e.g. KVM_X86_NOTIFY_VMEXIT_ENABLED is set to
enable the feature.)
Users can configure the feature by a new (x86 only) accel property:
qemu -accel kvm,notify-vmexit=run|internal-error|disable,notify-window=n
The default option of notify-vmexit is run, which will enable the
capability and do nothing if the exit happens. The internal-error option
raises a KVM internal error if it happens. The disable option does not
enable the capability. The default value of notify-window is 0. It is valid
only when notify-vmexit is not disabled. The valid range of notify-window
is non-negative. It is even safe to set it to zero since there's an
internal hardware threshold to be added to ensure no false positive.
Because a notify VM exit may happen with VM_CONTEXT_INVALID set in exit
qualification (no cases are anticipated that would set this bit), which
means VM context is corrupted. It would be reflected in the flags of
KVM_EXIT_NOTIFY exit. If KVM_NOTIFY_CONTEXT_INVALID bit is set, raise a KVM
internal error unconditionally.
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20220929072014.20705-5-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Expose struct KVMState out of kvm-all.c so that the field of struct
KVMState can be accessed when defining target-specific accelerator
properties.
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20220929072014.20705-4-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Several hypervisor capabilities in KVM are target-specific. When exposed
to QEMU users as accelerator properties (i.e. -accel kvm,prop=value), they
should not be available for all targets.
Add a hook for targets to add their own properties to -accel kvm, for
now no such property is defined.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220929072014.20705-3-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This removes the final hard coding of kvm_enabled() in gdbstub and
moves the check to an AccelOps.
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Mads Ynddal <mads@ynddal.dk>
Message-Id: <20220929114231.583801-46-alex.bennee@linaro.org>
As HW virtualization requires specific support to handle breakpoints
lets push out special casing out of the core gdbstub code and into
AccelOpsClass. This will make it easier to add other accelerator
support and reduces some of the stub shenanigans.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Mads Ynddal <mads@ynddal.dk>
Message-Id: <20220929114231.583801-45-alex.bennee@linaro.org>
The support of single-stepping is very much dependent on support from
the accelerator we are using. To avoid special casing in gdbstub move
the probing out to an AccelClass function so future accelerators can
put their code there.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Mads Ynddal <mads@ynddal.dk>
Message-Id: <20220929114231.583801-44-alex.bennee@linaro.org>
Reported by Coverity as CID 1490142. Since the size is constant and the
lifetime is the same as the StatsDescriptors struct, embed the struct
directly instead of using a separate allocation.
Suggested-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The following scenario can happen if QEMU sets more RESET flags while
the KVM_RESET_DIRTY_RINGS ioctl is ongoing on another host CPU:
CPU0 CPU1 CPU2
------------------------ ------------------ ------------------------
fill gfn0
store-rel flags for gfn0
fill gfn1
store-rel flags for gfn1
load-acq flags for gfn0
set RESET for gfn0
load-acq flags for gfn1
set RESET for gfn1
do ioctl! ----------->
ioctl(RESET_RINGS)
fill gfn2
store-rel flags for gfn2
load-acq flags for gfn2
set RESET for gfn2
process gfn0
process gfn1
process gfn2
do ioctl!
etc.
The three load-acquire in CPU0 synchronize with the three store-release
in CPU2, but CPU0 and CPU1 are only synchronized up to gfn1 and CPU1
may miss gfn2's fields other than flags.
The kernel must be able to cope with invalid values of the fields, and
userspace *will* invoke the ioctl once more. However, once the RESET flag
is cleared on gfn2, it is lost forever, therefore in the above scenario
CPU1 must read the correct value of gfn2's fields.
Therefore RESET must be set with a store-release, that will synchronize
with KVM's load-acquire in CPU1.
Cc: Gavin Shan <gshan@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The KVM_DIRTY_GFN_F_DIRTY flag ensures that the entry is valid. If
the read of the fields are not ordered after the read of the flag,
QEMU might see stale values.
Cc: Gavin Shan <gshan@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-M none creates a guest without a vCPU, causing the following error:
$ ./qemu-system-x86_64 -qmp stdio -M none -accel kvm
{execute:qmp_capabilities}
{"return": {}}
{execute: query-stats-schemas}
Segmentation fault (core dumped)
Fix it by not querying the vCPU stats if first_cpu is NULL.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
perror() is designed to append the decoded errno value to a
string. This, however, only makes sense if we called something that
actually sets errno prior to that.
For the callers that check for split irqchip support that is not the
case, and we end up with confusing error messages that end in
"success". Use error_report() instead.
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Message-Id: <20220728142446.438177-1-cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Coverity complains that there is a codepath in the query_stats()
function where it can leak the memory pointed to by stats_list. This
can only happen if the caller passes something other than
STATS_TARGET_VM or STATS_TARGET_VCPU as the 'target', which no
callsite does. Enforce this assumption using g_assert_not_reached(),
so that if we have a future bug we hit the assert rather than
silently leaking memory.
Resolves: Coverity CID 1490140
Fixes: cc01a3f4ca ("kvm: Support for querying fd-based stats")
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220719134853.327059-1-peter.maydell@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Setup a negative feedback system when vCPU thread
handling KVM_EXIT_DIRTY_RING_FULL exit by introducing
throttle_us_per_full field in struct CPUState. Sleep
throttle_us_per_full microseconds to throttle vCPU
if dirtylimit is in service.
Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-Id: <977e808e03a1cef5151cae75984658b6821be618.1656177590.git.huangy81@chinatelecom.cn>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Introduce kvm_dirty_ring_size util function to help calculate
dirty ring ful time.
Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Acked-by: Peter Xu <peterx@redhat.com>
Message-Id: <f9ce1f550bfc0e3a1f711e17b1dbc8f701700e56.1656177590.git.huangy81@chinatelecom.cn>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Add a non-required argument 'CPUState' to kvm_dirty_ring_reap so
that it can cover single vcpu dirty-ring-reaping scenario.
Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-Id: <c32001242875e83b0d9f78f396fe2dcd380ba9e8.1656177590.git.huangy81@chinatelecom.cn>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
The next version of Linux will introduce boolean statistics, which
can only have 0 or 1 values. Convert them to the new QAPI fields
added in the previous commit.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This function doesn't release descriptors in one error path,
result in memory leak. Call g_free() to release it.
Fixes: cc01a3f4ca ("kvm: Support for querying fd-based stats")
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Message-Id: <20220624063159.57411-1-linmq006@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allow retrieving only a subset of statistics. This can be useful
for example in order to plot a subset of the statistics many times
a second: KVM publishes ~40 statistics for each vCPU on x86; retrieving
and serializing all of them would be useless.
Another use will be in HMP in the following patch; implementing the
filter in the backend is easy enough that it was deemed okay to make
this a public interface.
Example:
{ "execute": "query-stats",
"arguments": {
"target": "vcpu",
"vcpus": [ "/machine/unattached/device[2]",
"/machine/unattached/device[4]" ],
"providers": [
{ "provider": "kvm",
"names": [ "l1d_flush", "exits" ] } } }
{ "return": {
"vcpus": [
{ "path": "/machine/unattached/device[2]"
"providers": [
{ "provider": "kvm",
"stats": [ { "name": "l1d_flush", "value": 41213 },
{ "name": "exits", "value": 74291 } ] } ] },
{ "path": "/machine/unattached/device[4]"
"providers": [
{ "provider": "kvm",
"stats": [ { "name": "l1d_flush", "value": 16132 },
{ "name": "exits", "value": 57922 } ] } ] } ] } }
Extracted from a patch by Mark Kanda.
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allow retrieving the statistics from a specific provider only.
This can be used in the future by HMP commands such as "info
sync-profile" or "info profile". The next patch also adds
filter-by-provider capabilities to the HMP equivalent of
query-stats, "info stats".
Example:
{ "execute": "query-stats",
"arguments": {
"target": "vm",
"providers": [
{ "provider": "kvm" } ] } }
The QAPI is a bit more verbose than just a list of StatsProvider,
so that it can be subsequently extended with filtering of statistics
by name.
If a provider is specified more than once in the filter, each request
will be included separately in the output.
Extracted from a patch by Mark Kanda.
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce a simple filtering of statistics, that allows to retrieve
statistics for a subset of the guest vCPUs. This will be used for
example by the HMP monitor, in order to retrieve the statistics
for the currently selected CPU.
Example:
{ "execute": "query-stats",
"arguments": {
"target": "vcpu",
"vcpus": [ "/machine/unattached/device[2]",
"/machine/unattached/device[4]" ] } }
Extracted from a patch by Mark Kanda.
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add support for querying fd-based KVM stats - as introduced by Linux kernel
commit:
cb082bfab59a ("KVM: stats: Add fd-based API to read binary stats data")
This allows the user to analyze the behavior of the VM without access
to debugfs.
Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We have about 30 instances of the typo/variant spelling 'writeable',
and over 500 of the more common 'writable'. Standardize on the
latter.
Change produced with:
sed -i -e 's/\([Ww][Rr][Ii][Tt]\)[Ee]\([Aa][Bb][Ll][Ee]\)/\1\2/g' $(git grep -il writeable)
and then hand-undoing the instance in linux-headers/linux/kvm.h.
Most of these changes are in comments or documentation; the
exceptions are:
* a local variable in accel/hvf/hvf-accel-ops.c
* a local variable in accel/kvm/kvm-all.c
* the PMCR_WRITABLE_MASK macro in target/arm/internals.h
* the EPT_VIOLATION_GPA_WRITABLE macro in target/i386/hvf/vmcs.h
(which is never used anywhere)
* the AR_TYPE_WRITABLE_MASK macro in target/i386/hvf/vmx.h
(which is never used anywhere)
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Stefan Weil <sw@weilnetz.de>
Message-id: 20220505095015.2714666-1-peter.maydell@linaro.org
Replace the global variables with inlined helper functions. getpagesize() is very
likely annotated with a "const" function attribute (at least with glibc), and thus
optimization should apply even better.
This avoids the need for a constructor initialization too.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20220323155743.1585078-12-marcandre.lureau@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Convert the TARGET_WORDS_BIGENDIAN macro, similarly to what was done
with HOST_BIG_ENDIAN. The new TARGET_BIG_ENDIAN macro is either 0 or 1,
and thus should always be defined to prevent misuse.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Suggested-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220323155743.1585078-8-marcandre.lureau@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Replace a config-time define with a compile time condition
define (compatible with clang and gcc) that must be declared prior to
its usage. This avoids having a global configure time define, but also
prevents from bad usage, if the config header wasn't included before.
This can help to make some code independent from qemu too.
gcc supports __BYTE_ORDER__ from about 4.6 and clang from 3.2.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
[ For the s390x parts I'm involved in ]
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220323155743.1585078-7-marcandre.lureau@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
g_new(T, n) is neater than g_malloc(sizeof(T) * n). It's also safer,
for two reasons. One, it catches multiplication overflowing size_t.
Two, it returns T * rather than void *, which lets the compiler catch
more type errors.
This commit only touches allocations with size arguments of the form
sizeof(T).
Patch created mechanically with:
$ spatch --in-place --sp-file scripts/coccinelle/use-g_new-etc.cocci \
--macro-file scripts/cocci-macro-file.h FILES...
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20220315144156.1595462-4-armbru@redhat.com>
Reviewed-by: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
We invoke the kvm_irqchip_commit_routes() for each addition to MSI route
table, which is not efficient if we are adding lots of routes in some cases.
This patch lets callers invoke the kvm_irqchip_commit_routes(), so the
callers can decide how to optimize.
[1] https://lists.gnu.org/archive/html/qemu-devel/2021-11/msg00967.html
Signed-off-by: Longpeng <longpeng2@huawei.com>
Message-Id: <20220222141116.2091-3-longpeng2@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add cpus_are_resettable() to AccelOps, and implement it for the
KVM accelerator.
Suggested-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20220207075426.81934-12-f4bug@amsat.org>
Add cpu_thread_is_idle() to AccelOps, and implement it for the
KVM / WHPX accelerators.
Suggested-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20220207075426.81934-11-f4bug@amsat.org>
Use the KVM_GUESTDBG_BLOCKIRQ debug flag if supported.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
[Extracted from Maxim's patch into a separate commit. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20211111110604.207376-6-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
[Extracted from Maxim's patch into a separate commit. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20211111110604.207376-5-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
dirty_pages is used to calculate dirtyrate via dirty ring, when
enabled, kvm-reaper will increase the dirty pages after gfns
being dirtied.
kvm_dirty_ring_enabled shows if kvm-reaper is working. dirtyrate
thread could use it to check if measurement can base on dirty
ring feature.
Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Message-Id: <fee5fb2ab17ec2159405fc54a3cff8e02322f816.1624040308.git.huangy81@chinatelecom.cn>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
SEV is x86-specific, no need to add its stub to other
architectures. Move the stub file to target/i386/kvm/.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20211007161716.453984-5-philmd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Provide a name field for all the memory listeners. It can be used to identify
which memory listener is which.
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Message-Id: <20210817013553.30584-2-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
PowerPC has two KVM types (HV, PR) that translate into three kernel
modules:
kvm.ko - common kvm code
kvm_hv.ko - kvm running with MSR_HV=1 or MSR_HV|PR=0 in a nested guest.
kvm_pr.ko - kvm running in usermode MSR_PR=1.
Since the two KVM types can both be running at the same time, this
creates a situation in which it is possible for one or both of the
modules to fail to initialize, leaving the generic one behind. This
leads QEMU to think it can create a guest, but KVM will fail when
calling the type-specific code:
ioctl(KVM_CREATE_VM) failed: 22 Invalid argument
qemu-kvm: failed to initialize KVM: Invalid argument
Ideally this would be solved kernel-side, but it might be a while
until we can get rid of one of the modules. So in the meantime this
patch tries to make this less confusing for the end user by adding a
more elucidative message:
ioctl(KVM_CREATE_VM) failed: 22 Invalid argument
PPC KVM module is not loaded. Try 'modprobe kvm_hv'.
[dwg: Fixed error in #elif which failed compile on !ppc hosts]
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
Message-Id: <20210722141340.2367905-1-farosas@linux.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Found this when I wanted to try the per-vcpu dirty rate series out, then I
found that it's not really working and it can quickly hang death a guest. I
found strange errors (e.g. guest crash after migration) happens even without
the per-vcpu dirty rate series.
When merging dirty ring, probably no one notice that the trivial renaming diff
[1] missed two existing references of kvm_dirty_ring_sizes; they do matter
since otherwise we'll mmap() a shorter range of memory after the renaming.
I think it didn't SIGBUS for me easily simply because some other stuff within
qemu mmap()ed right after the dirty rings (e.g. when testing 4096 slots, it
aligned with one small page on x86), so when we access the rings we've been
reading/writting to random memory elsewhere of qemu.
Fix the two sizes when map/unmap the shared dirty gfn memory.
[1] https://lore.kernel.org/qemu-devel/dac5f0c6-1bca-3daf-e5d2-6451dbbaca93@redhat.com/
Cc: Hyman Huang <huangy81@chinatelecom.cn>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210609014355.217110-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit e50caf4a5c ("tracing: convert documentation to rST")
converted docs/devel/tracing.txt to docs/devel/tracing.rst.
We still have several references to the old file, so let's fix them
with the following command:
sed -i s/tracing.txt/tracing.rst/ $(git grep -l docs/devel/tracing.txt)
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20210517151702.109066-2-sgarzare@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
KVM dirty ring is a new interface to pass over dirty bits from kernel to the
userspace. Instead of using a bitmap for each memory region, the dirty ring
contains an array of dirtied GPAs to fetch (in the form of offset in slots).
For each vcpu there will be one dirty ring that binds to it.
kvm_dirty_ring_reap() is the major function to collect dirty rings. It can be
called either by a standalone reaper thread that runs in the background,
collecting dirty pages for the whole VM. It can also be called directly by any
thread that has BQL taken.
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210506160549.130416-11-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is for KVM_CLEAR_DIRTY_LOG, which is only
useful for KVM_GET_DIRTY_LOG. Skip enabling it for kvm dirty ring.
More importantly, KVM_DIRTY_LOG_INITIALLY_SET will not wr-protect all the pages
initially, which is against how kvm dirty ring is used - there's no way for kvm
dirty ring to re-protect a page before it's notified as being written first
with a GFN entry in the ring! So when KVM_DIRTY_LOG_INITIALLY_SET is enabled
with dirty ring, we'll see silent data loss after migration.
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210506160549.130416-10-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a parameter for dirty gfn count for dirty rings. If zero, dirty ring is
disabled. Otherwise dirty ring will be enabled with the per-vcpu gfn count as
specified. If dirty ring cannot be enabled due to unsupported kernel or
illegal parameter, it'll fallback to dirty logging.
By default, dirty ring is not enabled (dirty-gfn-count default to 0).
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210506160549.130416-9-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cache it too because we'll reference it more frequently in the future.
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210506160549.130416-8-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_physical_sync_dirty_bitmap() on the whole section is inaccurate, because
the section can be a superset of the memslot that we're working on. The result
is that if the section covers multiple kvm memslots, we could be doing the
synchronization for multiple times for each kvmslot in the section.
With the two helpers that we just introduced, it's very easy to do it right now
by calling the helpers.
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210506160549.130416-7-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_physical_sync_dirty_bitmap() calculates the ramblock offset in an
awkward way from the MemoryRegionSection that passed in from the
caller. The truth is for each KVMSlot the ramblock offset never
change for the lifecycle. Cache the ramblock offset for each KVMSlot
into the structure when the KVMSlot is created.
With that, we can further simplify kvm_physical_sync_dirty_bitmap()
with a helper to sync KVMSlot dirty bitmap to the ramblock dirty
bitmap of a specific KVMSlot.
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210506160549.130416-6-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>