Commit Graph

129 Commits

Author SHA1 Message Date
Avi Kivity 15ad71460d KVM: Use the scheduler preemption notifiers to make kvm preemptible
Current kvm disables preemption while the new virtualization registers are
in use.  This of course is not very good for latency sensitive workloads (one
use of virtualization is to offload user interface and other latency
insensitive stuff to a container, so that it is easier to analyze the
remaining workload).  This patch re-enables preemption for kvm; preemption
is now only disabled when switching the registers in and out, and during
the switch to guest mode and back.

Contains fixes from Shaohua Li <shaohua.li@intel.com>.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:20 +02:00
Jeff Dike 519ef35341 KVM: add hypercall nr to kvm_run
Add the hypercall number to kvm_run and initialize it.  This changes the ABI,
but as this particular ABI was unusable before this no users are affected.

Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:20 +02:00
Rusty Russell fb3f0f51d9 KVM: Dynamically allocate vcpus
This patch converts the vcpus array in "struct kvm" to a pointer
array, and changes the "vcpu_create" and "vcpu_setup" hooks into one
"vcpu_create" call which does the allocation and initialization of the
vcpu (calling back into the kvm_vcpu_init core helper).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:20 +02:00
Gregory Haskins a2fa3e9f52 KVM: Remove arch specific components from the general code
struct kvm_vcpu has vmx-specific members; remove them to a private structure.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:20 +02:00
Rusty Russell c820c2aa27 KVM: load_pdptrs() cleanups
load_pdptrs can be handed an invalid cr3, and it should not oops.
This can happen because we injected #gp in set_cr3() after we set
vcpu->cr3 to the invalid value, or from kvm_vcpu_ioctl_set_sregs(), or
memory configuration changes after the guest did set_cr3().

We should also copy the pdpte array once, before checking and
assigning, otherwise an SMP guest can potentially alter the values
between the check and the set.

Finally one nitpick: ret = 1 should be done as late as possible: this
allows GCC to check for unset "ret" should the function change in
future.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:20 +02:00
Shaohua Li fe55188194 KVM: Move gfn_to_page out of kmap/unmap pairs
gfn_to_page might sleep with swap support. Move it out of the kmap calls.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:19 +02:00
Rusty Russell 310bc76c2b KVM: Return if the pdptrs are invalid when the guest turns on PAE.
Don't fall through and turn on PAE in this case.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:19 +02:00
Rusty Russell 7075bc816c KVM: Use standard CR8 flags, and fix TPR definition
Intel manual (and KVM definition) say the TPR is 4 bits wide.  Also fix
CR8_RESEVED_BITS typo.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:19 +02:00
Jeff Dike 8fc0d085f5 KVM: Set exit_reason to KVM_EXIT_MMIO where run->mmio is initialized.
Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:19 +02:00
Rusty Russell 9eb829ced8 KVM: Trivial: Use standard BITMAP macros, open-code userspace-exposed header
Creating one's own BITMAP macro seems suboptimal: if we use manual
arithmetic in the one place exposed to userspace, we can use standard
macros elsewhere.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:18 +02:00
Rusty Russell 66aee91aaa KVM: Use standard CR4 flags, tighten checking
On this machine (Intel), writing to the CR4 bits 0x00000800 and
0x00001000 cause a GPF.  The Intel manual is a little unclear, but
AFIACT they're reserved, too.

Also fix spelling of CR4_RESEVED_BITS.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:18 +02:00
Rusty Russell f802a307cb KVM: Use standard CR3 flags, tighten checking
The kernel now has asm/cpu-features.h: use those macros instead of inventing
our own.

Also spell out definition of CR3_RESEVED_BITS, fix spelling and
tighten it for the non-PAE case.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:18 +02:00
Rusty Russell 707d92fa72 KVM: Trivial: Use standard CR0 flags macros from asm/cpu-features.h
The kernel now has asm/cpu-features.h: use those macros instead of
inventing our own.

Also spell out definition of CR0_RESEVED_BITS (no code change) and fix typo.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:18 +02:00
Rusty Russell 9a2b85c620 KVM: Trivial: Avoid hardware_disable predeclaration
Don't pre-declare hardware_disable: shuffle the reboot hook down.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:18 +02:00
Eddie Dong 65619eb5a8 KVM: In-kernel string pio write support
Add string pio write support to support some version of Windows.

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:17 +02:00
Qing He dad3795d2b KVM: SMP: Add vcpu_id field in struct vcpu
This patch adds a `vcpu_id' field in `struct vcpu', so we can
differentiate BSP and APs without pointer comparison or arithmetic.

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:17 +02:00
Nguyen Anh Quynh cd0d913797 KVM: Fix *nopage() in kvm_main.c
*nopage() in kvm_main.c should only store the type of mmap() fault if
the pointers are not NULL. This patch fixes the problem.

Signed-off-by: Nguyen Anh Quynh <aquynh@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:17 +02:00
Avi Kivity 6ec8a856e4 KVM: Avoid calling smp_call_function_single() with interrupts disabled
When taking a cpu down, we need to hardware_disable() it.
Unfortunately, the CPU_DYING notifier is called with interrupts
disabled, which means we can't use smp_call_function_single().

Fortunately, the CPU_DYING notifier is always called on the dying cpu,
so we don't need to use the function at all and can simply call
hardware_disable() directly.

Tested-by: Paolo Ornati <ornati@fastwebnet.it>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-19 10:13:49 -07:00
Avi Kivity 4c981b43d7 KVM: Fix removal of nx capability from guest cpuid
Testing the wrong bit caused kvm not to disable nx on the guest when it is
disabled on the host (an mmu optimization relies on the nx bits being the
same in the guest and host).

This allows Windows to boot when nx is disabled on te host (e.g. when
host pae is disabled).

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-25 14:31:13 +03:00
Avi Kivity 7cfa4b0a43 Revert "KVM: Avoid useless memory write when possible"
This reverts commit a3c870bdce.  While it
does save useless updates, it (probably) defeats the fork detector, causing
a massive performance loss.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-25 14:30:56 +03:00
Rusty Russell 5e58cfe41c KVM: Fix unlikely kvm_create vs decache_vcpus_on_cpu race
We add the kvm to the vm_list before initializing the vcpu mutexes,
which can be mutex_trylock()'ed by decache_vcpus_on_cpu().

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-25 14:29:34 +03:00
Avi Kivity b0fcd903e6 KVM: Correctly handle writes crossing a page boundary
Writes that are contiguous in virtual memory may not be contiguous in
physical memory; so split writes that straddle a page boundary.

Thanks to Aurelien for reporting the bug, patient testing, and a fix
to this very patch.

Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-25 14:29:17 +03:00
Avi Kivity 35f3f28613 KVM: x86 emulator: implement rdmsr and wrmsr
Allow real-mode emulation of rdmsr and wrmsr.  This allows smp Windows to
boot, presumably for its sipi trampoline.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-20 20:16:29 +03:00
Avi Kivity 90cb0529dd KVM: Fix memory slot management functions for guest smp
The memory slot management functions were oriented against vcpu 0, where
they should be kvm-wide.  This causes hangs starting X on guest smp.

Fix by making the functions (and resultant tail in the mmu) non-vcpu-specific.
Unfortunately this reduces the efficiency of the mmu object cache a bit.  We
may have to revisit this later.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-20 20:16:29 +03:00
Avi Kivity cec9ad279b KVM: Use CPU_DYING for disabling virtualization
Only at the CPU_DYING stage can we be sure that no user process will
be scheduled onto the cpu and oops when trying to use virtualization
extensions.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:51 +03:00
Avi Kivity 4267c41a45 KVM: Tune hotplug/suspend IPIs
The hotplug IPIs can be called from the cpu on which we are currently
running on, so use on_cpu().  Similarly, drop on_each_cpu() for the
suspend/resume callbacks, as we're in atomic context here and only one
cpu is up anyway.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:51 +03:00
Avi Kivity 1b6c016818 KVM: Keep track of which cpus have virtualization enabled
By keeping track of which cpus have virtualization enabled, we
prevent double-enable or double-disable during hotplug, which is a
very fatal oops.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:51 +03:00
Avi Kivity e495606dd0 KVM: Clean up #includes
Remove unnecessary ones, and rearange the remaining in the standard order.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:49 +03:00
Avi Kivity d6d2816849 KVM: Remove kvmfs in favor of the anonymous inodes source
kvm uses a pseudo filesystem, kvmfs, to generate inodes, a job that the
new anonymous inodes source does much better.

Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:49 +03:00
Luca Tettamanti a3c870bdce KVM: Avoid useless memory write when possible
When writing to normal memory and the memory area is unchanged the write
can be safely skipped, avoiding the costly kvm_mmu_pte_write.

Signed-Off-By: Luca Tettamanti <kronos.it@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:48 +03:00
Eddie Dong 74906345ff KVM: Add support for in-kernel pio handlers
Useful for the PIC and PIT.

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:48 +03:00
Gregory Haskins 2eeb2e94eb KVM: Adds support for in-kernel mmio handlers
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:47 +03:00
Avi Kivity d9e368d612 KVM: Flush remote tlbs when reducing shadow pte permissions
When a vcpu causes a shadow tlb entry to have reduced permissions, it
must also clear the tlb on remote vcpus.  We do that by:

- setting a bit on the vcpu that requests a tlb flush before the next entry
- if the vcpu is currently executing, we send an ipi to make sure it
  exits before we continue

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:46 +03:00
Avi Kivity 39c3b86e5c KVM: Keep an upper bound of initialized vcpus
That way, we don't need to loop for KVM_MAX_VCPUS for a single vcpu
vm.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:46 +03:00
Avi Kivity d3bef15f84 KVM: Move duplicate halt handling code into kvm_main.c
Will soon have a thid user.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:46 +03:00
Avi Kivity 120e9a453b KVM: Fix adding an smp virtual machine to the vm list
If we add the vm once per vcpu, we corrupt the list if the guest has
multiple vcpus.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:45 +03:00
Avi Kivity 7b53aa5650 KVM: Fix vcpu freeing for guest smp
A vcpu can pin up to four mmu shadow pages, which means the freeing
loop will never terminate.  Fix by first unpinning shadow pages on
all vcpus, then freeing shadow pages.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:45 +03:00
Nguyen Anh Quynh 313899477f KVM: Remove unnecessary initialization and checks in mark_page_dirty()
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:45 +03:00
Avi Kivity d3d25b048b KVM: MMU: Use slab caches for shadow pages and their headers
Use slab caches instead of a simple custom list.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:43 +03:00
Eddie Dong 2cc51560ae KVM: VMX: Avoid saving and restoring msr_efer on lightweight vmexit
MSR_EFER.LME/LMA bits are automatically save/restored by VMX
hardware, KVM only needs to save NX/SCE bits at time of heavy
weight VM Exit. But clearing NX bits in host envirnment may
cause system hang if the host page table is using EXB bits,
thus we leave NX bits as it is. If Host NX=1 and guest NX=0, we
can do guest page table EXB bits check before inserting a shadow
pte (though no guest is expecting to see this kind of gp fault).
If host NX=0, we present guest no Execute-Disable feature to guest,
thus no host NX=0, guest NX=1 combination.

This patch reduces raw vmexit time by ~27%.

Me: fix compile warnings on i386.

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:42 +03:00
Matthew Gregan 2dc7094b56 KVM: Implement IA32_EBL_CR_POWERON msr
Attempting to boot the default 'bsd' kernel of OpenBSD 4.1 i386 in a guest
fails early in the kernel init inside p3_get_bus_clock while trying to read
the IA32_EBL_CR_POWERON MSR.  KVM logs an 'unhandled MSR' message and the
guest kernel faults.

This patch is sufficient to allow OpenBSD to boot, after which it seems to
run fine.  I'm not sure if this is the correct solution for dealing with
this particular MSR, but it works for me.

Signed-off-by: Matthew Gregan <kinetik@flim.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:40 +03:00
Avi Kivity 09072daf37 KVM: Unify kvm_mmu_pre_write() and kvm_mmu_post_write()
Instead of calling two functions and repeating expensive checks, call one
function and provide it with before/after information.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:38 +03:00
Avi Kivity e6adf28365 KVM: Avoid saving and restoring some host CPU state on lightweight vmexit
Many msrs and the like will only be used by the host if we schedule() or
return to userspace.  Therefore, we avoid saving them if we handle the
exit within the kernel, and if a reschedule is not requested.

Based on a patch from Eddie Dong <eddie.dong@intel.com> with a couple of
fixes by me.

Signed-off-by: Yaozu(Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:38 +03:00
Avi Kivity 7702fd1f6f KVM: Prevent guest fpu state from leaking into the host
The lazy fpu changes did not take into account that some vmexit handlers
can sleep.  Move loading the guest state into the inner loop so that it
can be reloaded if necessary, and move loading the host state into
vmx_vcpu_put() so it can be performed whenever we relinquish the vcpu.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-06-15 12:30:59 +03:00
Alexey Dobriyan e8edc6e03a Detach sched.h from mm.h
First thing mm.h does is including sched.h solely for can_do_mlock() inline
function which has "current" dereference inside. By dealing with can_do_mlock()
mm.h can be detached from sched.h which is good. See below, why.

This patch
a) removes unconditional inclusion of sched.h from mm.h
b) makes can_do_mlock() normal function in mm/mlock.c
c) exports can_do_mlock() to not break compilation
d) adds sched.h inclusions back to files that were getting it indirectly.
e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
   getting them indirectly

Net result is:
a) mm.h users would get less code to open, read, preprocess, parse, ... if
   they don't need sched.h
b) sched.h stops being dependency for significant number of files:
   on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
   after patch it's only 3744 (-8.3%).

Cross-compile tested on

	all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
	alpha alpha-up
	arm
	i386 i386-up i386-defconfig i386-allnoconfig
	ia64 ia64-up
	m68k
	mips
	parisc parisc-up
	powerpc powerpc-up
	s390 s390-up
	sparc sparc-up
	sparc64 sparc64-up
	um-x86_64
	x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig

as well as my two usual configs.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-21 09:18:19 -07:00
Rafael J. Wysocki 8bb7844286 Add suspend-related notifications for CPU hotplug
Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress.  This
patch introduces such notifications and causes them to be used during
suspend and resume transitions.  It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).

[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Avi Kivity 02c8320972 KVM: Don't require explicit indication of completion of mmio or pio
It is illegal not to return from a pio or mmio request without completing
it, as mmio or pio is an atomic operation.  Therefore, we can simplify
the userspace interface by avoiding the completion indication.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:32 +03:00
Avi Kivity e7df56e4a0 KVM: Remove extraneous guest entry on mmio read
When emulating an mmio read, we actually emulate twice: once to determine
the physical address of the mmio, and, after we've exited to userspace to
get the mmio value, we emulate again to place the value in the result
register and update any flags.

But we don't really need to enter the guest again for that, only to take
an immediate vmexit.  So, if we detect that we're doing an mmio read,
emulate a single instruction before entering the guest again.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:32 +03:00
Anthony Liguori 25c4c2762e KVM: VMX: Properly shadow the CR0 register in the vcpu struct
Set all of the host mask bits for CR0 so that we can maintain a proper
shadow of CR0.  This exposes CR0.TS, paving the way for lazy fpu handling.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:31 +03:00
Avi Kivity 4c690a1e86 KVM: Allow passing 64-bit values to the emulated read/write API
This simplifies the API somewhat (by eliminating the special-case
cmpxchg8b on i386).

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:31 +03:00