Prepare to support TLB flush hypercalls, some of which are REP hypercalls.
Also, return HV_STATUS_INVALID_HYPERCALL_INPUT as it seems more
appropriate.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Hyper-V TLB flush hypercalls definitions will be required for KVM so move
them hyperv-tlfs.h. Structures also need to be renamed as '_pcpu' suffix is
irrelevant for a general-purpose definition.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The CLDEMOTE instruction hints to hardware that the cache line that
contains the linear address should be moved("demoted") from
the cache(s) closest to the processor core to a level more distant
from the processor core. This may accelerate subsequent accesses
to the line by other cores in the same coherence domain,
especially if the line was written by the core that demotes the line.
This patch exposes the cldemote feature to the guest.
The release document ref below link:
https://software.intel.com/sites/default/files/managed/c5/15/\
architecture-instruction-set-extensions-programming-reference.pdf
This patch has a dependency on https://lkml.org/lkml/2018/4/23/928
Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
Reviewed-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
When vmcs12 uses VPID, all TLB entries populated by L2 are tagged with
vmx->nested.vpid02. Currently, INVVPID executed by L1 is emulated by L0
by using INVVPID single/global-context to flush all TLB entries
tagged with vmx->nested.vpid02 regardless of INVVPID type executed by
L1.
However, we can easily optimize the case of L1 INVVPID on an
individual-address. Just INVVPID given individual-address tagged with
vmx->nested.vpid02.
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
[Squashed with a preparatory patch that added the !operand.vpid line.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Since commit 5c614b3583 ("KVM: nVMX: nested VPID emulation"),
vmcs01 and vmcs02 don't share the same VPID. vmcs01 uses vmx->vpid
while vmcs02 uses vmx->nested.vpid02. This was done such that TLB
flush could be avoided when switching between L1 and L2.
However, the above mentioned commit only changed L2 VMEntry logic to
not flush TLB when switching from L1 to L2. It forgot to also remove
the TLB flush which is done when simulating a VMExit from L2 to L1.
To fix this issue, on VMExit from L2 to L1 we flush TLB only in case
vmcs01 enables VPID and vmcs01->vpid==vmcs02->vpid. This happens when
vmcs01 enables VPID and vmcs12 does not.
Fixes: 5c614b3583 ("KVM: nVMX: nested VPID emulation")
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
This is a fix from reviewing the code, but it looks like it might be
able to lead to an Oops. It affects 32bit systems.
The KVM_MEMORY_ENCRYPT_REG_REGION ioctl uses a u64 for range->addr and
range->size but the high 32 bits would be truncated away on a 32 bit
system. This is harmless but it's also harmless to prevent it.
Then in sev_pin_memory() the "uaddr + ulen" calculation can wrap around.
The wrap around can happen on 32 bit or 64 bit systems, but I was only
able to figure out a problem for 32 bit systems. We would pick a number
which results in "npages" being zero. The sev_pin_memory() would then
return ZERO_SIZE_PTR without allocating anything.
I made it illegal to call sev_pin_memory() with "ulen" set to zero.
Hopefully, that doesn't cause any problems. I also changed the type of
"first" and "last" to long, just for cosmetic reasons. Otherwise on a
64 bit system you're saving "uaddr >> 12" in an int and it truncates the
high 20 bits away. The math works in the current code so far as I can
see but it's just weird.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
[Brijesh noted that the code is only reachable on X86_64.]
Reviewed-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
handle_mmio_page_fault() was recently moved to be an internal-only
MMU function, i.e. it's static and no longer defined in kvm_host.h.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Enforce the invariant that existing VMCS12 field offsets must not
change. Experience has shown that without strict enforcement, this
invariant will not be maintained.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[Changed the code to use BUILD_BUG_ON_MSG instead of better, but GCC 4.6
requiring _Static_assert. - Radim.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Changing the VMCS12 layout will break save/restore compatibility with
older kvm releases once the KVM_{GET,SET}_NESTED_STATE ioctls are
accepted upstream. Google has already been using these ioctls for some
time, and we implore the community not to disturb the existing layout.
Move the four most recently added fields to preserve the offsets of
the previously defined fields and reserve locations for the vmread and
vmwrite bitmaps, which will be used in the virtualization of VMCS
shadowing (to improve the performance of double-nesting).
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[Kept the SDM order in vmcs_field_to_offset_table. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The hypercall was added using a struct timespec based implementation,
but we should not use timespec in new code.
This changes it to timespec64. There is no functional change
here since the implementation is only used in 64-bit kernels
that use the same definition for timespec and timespec64.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
When saving a vCPU's nested state, the vmcs02 is discarded. Only the
shadow vmcs12 is saved. The shadow vmcs12 contains all of the
information needed to reconstruct an equivalent vmcs02 on restore, but
we have to be able to deal with two contexts:
1. The nested state was saved immediately after an emulated VM-entry,
before the vmcs02 was ever launched.
2. The nested state was saved some time after the first successful
launch of the vmcs02.
Though it's an implementation detail rather than an architected bit,
vmx->nested_run_pending serves to distinguish between these two
cases. Hence, we save it as part of the vCPU's nested state. (Yes,
this is ugly.)
Even when restoring from a checkpoint, it may be necessary to build
the vmcs02 as if prepare_vmcs02 was called from nested_vmx_run. So,
the 'from_vmentry' argument should be dropped, and
vmx->nested_run_pending should be consulted instead. The nested state
restoration code then has to set vmx->nested_run_pending prior to
calling prepare_vmcs02. It's important that the restoration code set
vmx->nested_run_pending anyway, since the flag impacts things like
interrupt delivery as well.
Fixes: cf8b84f48a ("kvm: nVMX: Prepare for checkpointing L2 state")
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The Hyper-V APIC code is built when CONFIG_HYPERV is enabled but the actual
code in that file is guarded with CONFIG_X86_64. There is no point in doing
this. Neither is there a point in having the CONFIG_HYPERV guard in there
because the containing directory is not built when CONFIG_HYPERV=n.
Further for the hv_init_apic() function a stub is provided only for
CONFIG_HYPERV=n, which is pointless as the callsite is not compiled at
all. But for X86_32 the stub is missing and the build fails.
Clean that up:
- Compile hv_apic.c only when CONFIG_X86_64=y
- Make the stub for hv_init_apic() available when CONFG_X86_64=n
Fixes: 6b48cb5f83 ("X86/Hyper-V: Enlighten APIC access")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Michael Kelley <mikelley@microsoft.com>
Not all configurations magically include asm/apic.h, but the Hyper-V code
requires it. Include it explicitely.
Fixes: 6b48cb5f83 ("X86/Hyper-V: Enlighten APIC access")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Michael Kelley <mikelley@microsoft.com>
Just three commits.
The two cxl ones are not fixes per se, but they modify code that was added this
cycle so that it will work with a recent firmware change.
And then a fix for a recent commit that added sleeps in the NVRAM code, which
needs to be more careful and not sleep if eg. we're called in the panic() path.
Thanks to:
Nicholas Piggin, Philippe Bergheaud, Christophe Lombard.
-----BEGIN PGP SIGNATURE-----
iQIwBAABCAAaBQJa/tZPExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYAZ
1xAAtFpiF0YzTigwxSJBrDHjj+JkkPrHZ0OtItIFI6V3FaerbL6ZFJyn2/wfM7N7
v5wluFucSq4I96pMMecLJx7KGMFMBg1GoMbToEM9dqY2mDxK6pvrXX8RJ3dafMql
N+NsaJptdGm8hu4HB//P752nvjH9G2xARIc4twMPdXdBTyINtCXhTPFOtZQN0c4V
IStGkchwkyJPdzQHy1T+T1crKrgL1hqcTX0ubC7iedX8L1H+pdxm58GUo4zQTVKF
5ZOWYPYScawdPQJSi/KEaZqYW9nPiJiE0zd5B5tjGwFQlcZtzIz3PfokdD6vY75g
SEbvbOJxdWFiihwCsZPrQyUe7P4RRboLxl5n0YWeK7SYh1eEQ38YeOTkEIy8KKEV
tQUR3c3XeWBD8WOpMXuxmwla5HsNgcYhDmSIx96jVX/VVOzKFbpdeYxfbOZDkVK6
puX0Bwm8WVI9wbLtqTdMDtdencn9qOuwX7Pzot0p/XhJGOvwmgS3+rPhzcfU1Ftw
0Y2y5MH+wthj6glxLYro0qSwkUQYnT6KHzY8S1pheN2DgzWVkSWOnIqrWkvcvm3R
10bqg/wBrAQYylTO4jkYKKMBYMgSE3x2nIC08I7nfJ37xQFSwTI5K8WbhSYG8MKo
TZxwhgZX/ip4Ylxuk3P0nMx1Tt/DmkvcguuYqgfBKMNMO+U=
=yrZc
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.17-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Just three commits.
The two cxl ones are not fixes per se, but they modify code that was
added this cycle so that it will work with a recent firmware change.
And then a fix for a recent commit that added sleeps in the NVRAM
code, which needs to be more careful and not sleep if eg. we're called
in the panic() path.
Thanks to Nicholas Piggin, Philippe Bergheaud, Christophe Lombard"
* tag 'powerpc-4.17-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/powernv: Fix NVRAM sleep in invalid context when crashing
cxl: Report the tunneled operations status
cxl: Set the PBCQ Tunnel BAR register when enabling capi mode
Fix race condition when accessing System Management Network registers
Fix reading critical temperatures on F15h M60h and M70h
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJa+0BbAAoJEMsfJm/On5mBo3EQAJxtFC7pA7JzY0yZsXvaA+50
ObN9EtG5mhVMZQfcOThcN6ZGzV12rpJltsCp6Poy0g8n7rgLiB5y2IJvinM7ETil
6zbw5onfv2So/WyvXWBylEI0J4WjtGc8n17S1+nlT+Ppy4ID6PQPv1pGfr7YVI0o
0T2sLSfDQD7vgtvpHi7A+4q2hbsI0HjS3LKI8CAy4UboZ8yltxJBsgV7gJ3fbv4Z
tX9DOH05bGsCR/9vwoA3rRVbUKbvPnwTY36DCAyT53QuYRIBwREXi/xkxCkKdSsn
X3o78TPkvE/qTyK1ZjuJ5yxDdLmesibiKOtyPBeaPaTQ+jcayfSr+rQrAvsZ2Ogp
8pjZ5he3LR4/8wdmBhZBBcDXDdBMar8SRMSpPrBRyWONpn5fSLuszUkintKTND4c
dH1zlXmYjRFsQBW2O+/b6k1Hq/p654mwD4hBbxHN7FVBnrWDWzUgd2xSpQLxSqkz
sfyd6wsvrVeUCGHAsgVY9sXYlbrTjI1WWkOX4EAJC2YKvWDYTB/kQXg0I5vICN4m
9tLyoC8tvKothIe8J1U5VUeGgpP5QES+yf7YNF9gc02D8l5xlsWuUAVrBI1XBOdS
0MXFFFxM68Y6ufhIiahSXPM7vocSFi6CuuYbuz6Z09a2L9cahG4C5+Qe9E9h6PjM
N4uOoFJGKckctQYJB0rO
=SujR
-----END PGP SIGNATURE-----
Merge tag 'hwmon-for-linus-v4.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
"Two k10temp fixes:
- fix race condition when accessing System Management Network
registers
- fix reading critical temperatures on F15h M60h and M70h
Also add PCI ID's for the AMD Raven Ridge root bridge"
* tag 'hwmon-for-linus-v4.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (k10temp) Use API function to access System Management Network
x86/amd_nb: Add support for Raven Ridge CPUs
hwmon: (k10temp) Fix reading critical temperature register
* x86 fixes: PCID, UMIP, locking
* Improved support for recent Windows version that have a 2048 Hz
APIC timer.
* Rename KVM_HINTS_DEDICATED CPUID bit to KVM_HINTS_REALTIME
* Better behaved selftests.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJa/bkTAAoJEL/70l94x66Dzf8IAJ1GqtXi0CNbq8MvU4QIqw0L
HLIRoe/QgkTeTUa2fwirEuu5I+/wUyPvy5sAIsn/F5eiZM7nciLm+fYzw6F2uPIm
lSCqKpVwmh8dPl1SBaqPnTcB1HPVwcCgc2SF9Ph7yZCUwFUtoeUuPj8v6Qy6y21g
jfobHFZa3MrFgi7kPxOXSrC1qxuNJL9yLB5mwCvCK/K7jj2nrGJkLLDuzgReCqvz
isOdpof3hz8whXDQG5cTtybBgE9veym4YqJY8R5ANXBKqbFlhaNF1T3xXrdPMISZ
7bsGgkhYEOqeQsPrFwzAIiFxe2DogFwkn1BcvJ1B+duXrayt5CBnDPRB6Yxg00M=
=H0d0
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
- ARM/ARM64 locking fixes
- x86 fixes: PCID, UMIP, locking
- improved support for recent Windows version that have a 2048 Hz APIC
timer
- rename KVM_HINTS_DEDICATED CPUID bit to KVM_HINTS_REALTIME
- better behaved selftests
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: rename KVM_HINTS_DEDICATED to KVM_HINTS_REALTIME
KVM: arm/arm64: VGIC/ITS save/restore: protect kvm_read_guest() calls
KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock
KVM: arm/arm64: VGIC/ITS: Promote irq_lock() in update_affinity
KVM: arm/arm64: Properly protect VGIC locks from IRQs
KVM: X86: Lower the default timer frequency limit to 200us
KVM: vmx: update sec exec controls for UMIP iff emulating UMIP
kvm: x86: Suppress CR3_PCID_INVD bit only when PCIDs are enabled
KVM: selftests: exit with 0 status code when tests cannot be run
KVM: hyperv: idr_find needs RCU protection
x86: Delay skip of emulated hypercall instruction
KVM: Extend MAX_IRQ_ROUTES to 4096 for all archs
KVM_HINTS_DEDICATED seems to be somewhat confusing:
Guest doesn't really care whether it's the only task running on a host
CPU as long as it's not preempted.
And there are more reasons for Guest to be preempted than host CPU
sharing, for example, with memory overcommit it can get preempted on a
memory access, post copy migration can cause preemption, etc.
Let's call it KVM_HINTS_REALTIME which seems to better
match what guests expect.
Also, the flag most be set on all vCPUs - current guests assume this.
Note so in the documentation.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull s390 fixes from Martin Schwidefsky:
- a fix for the vfio ccw translation code
- update an incorrect email address in the MAINTAINERS file
- fix a division by zero oops in the cpum_sf code found by trinity
- two fixes for the error handling of the qdio code
- several spectre related patches to convert all left-over indirect
branches in the kernel to expoline branches
- update defconfigs to avoid warnings due to the netfilter Kconfig
changes
- avoid several compiler warnings in the kexec_file code for s390
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/qdio: don't release memory in qdio_setup_irq()
s390/qdio: fix access to uninitialized qdio_q fields
s390/cpum_sf: ensure sample frequency of perf event attributes is non-zero
s390: use expoline thunks in the BPF JIT
s390: extend expoline to BC instructions
s390: remove indirect branch from do_softirq_own_stack
s390: move spectre sysfs attribute code
s390/kernel: use expoline for indirect branches
s390/ftrace: use expoline for indirect branches
s390/lib: use expoline for indirect branches
s390/crc32-vx: use expoline for indirect branches
s390: move expoline assembler macros to a header
vfio: ccw: fix cleanup if cp_prefetch fails
s390/kexec_file: add declaration of purgatory related globals
s390: update defconfigs
MAINTAINERS: update s390 zcrypt maintainers email address
Similarly to opal_event_shutdown, opal_nvram_write can be called in
the crash path with irqs disabled. Special case the delay to avoid
sleeping in invalid context.
Fixes: 3b8070335f ("powerpc/powernv: Fix OPAL NVRAM driver OPAL_BUSY loops")
Cc: stable@vger.kernel.org # v3.2
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
a field event. This is increasingly important for the histogram trigger
work that is being extended.
While auditing trace events, I found that a couple of the xen events
were used as just marking that a function was called, by creating
a static array of size zero. This can play havoc with the tracing
features if these events are used, because a zero size of a static
array is denoted as a special nul terminated dynamic array (this is
what the trace_marker code uses). But since the xen events have no
size, they are not nul terminated, and unexpected results may occur.
As trace events were never intended on being a marker to denote
that a function was hit or not, especially since function tracing
and kprobes can trivially do the same, the best course of action is
to simply remove these events.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCWvtgDhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qtY0AQC2HSSRkP5GVL1/c1Xoxl202O1tQ9Dp
G08oci4bfcRCIAEA8ATc+1LZPGQUvd0ucrD4FiJnfpYUHrCTvvRsz4d9LQQ=
=HUQR
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.17-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Some of the ftrace internal events use a zero for a data size of a
field event. This is increasingly important for the histogram trigger
work that is being extended.
While auditing trace events, I found that a couple of the xen events
were used as just marking that a function was called, by creating a
static array of size zero. This can play havoc with the tracing
features if these events are used, because a zero size of a static
array is denoted as a special nul terminated dynamic array (this is
what the trace_marker code uses). But since the xen events have no
size, they are not nul terminated, and unexpected results may occur.
As trace events were never intended on being a marker to denote that a
function was hit or not, especially since function tracing and kprobes
can trivially do the same, the best course of action is to simply
remove these events"
* tag 'trace-v4.17-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/x86/xen: Remove zero data size trace events trace_xen_mmu_flush_tlb{_all}
kvm_read_guest() will eventually look up in kvm_memslots(), which requires
either to hold the kvm->slots_lock or to be inside a kvm->srcu critical
section.
In contrast to x86 and s390 we don't take the SRCU lock on every guest
exit, so we have to do it individually for each kvm_read_guest() call.
Provide a wrapper which does that and use that everywhere.
Note that ending the SRCU critical section before returning from the
kvm_read_guest() wrapper is safe, because the data has been *copied*, so
we don't need to rely on valid references to the memslot anymore.
Cc: Stable <stable@vger.kernel.org> # 4.8+
Reported-by: Jan Glauber <jan.glauber@caviumnetworks.com>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Anthoine reported:
The period used by Windows change over time but it can be 1
milliseconds or less. I saw the limit_periodic_timer_frequency
print so 500 microseconds is sometimes reached.
As suggested by Paolo, lower the default timer frequency limit to a
smaller interval of 200 us (5000 Hz) to leave some headroom. This
is required due to Windows 10 changing the scheduler tick limit
from 1024 Hz to 2048 Hz.
Reported-by: Anthoine Bourgeois <anthoine.bourgeois@blade-group.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Anthoine Bourgeois <anthoine.bourgeois@blade-group.com>
Cc: Darren Kenny <darren.kenny@oracle.com>
Cc: Jan Kiszka <jan.kiszka@web.de>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Doing an audit of trace events, I discovered two trace events in the xen
subsystem that use a hack to create zero data size trace events. This is not
what trace events are for. Trace events add memory footprint overhead, and
if all you need to do is see if a function is hit or not, simply make that
function noinline and use function tracer filtering.
Worse yet, the hack used was:
__array(char, x, 0)
Which creates a static string of zero in length. There's assumptions about
such constructs in ftrace that this is a dynamic string that is nul
terminated. This is not the case with these tracepoints and can cause
problems in various parts of ftrace.
Nuke the trace events!
Link: http://lkml.kernel.org/r/20180509144605.5a220327@gandalf.local.home
Cc: stable@vger.kernel.org
Fixes: 95a7d76897 ("xen/mmu: Use Xen specific TLB flush instead of the generic one.")
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
These private pages have special purposes in the virtualization of L1,
but not in the virtualization of L2. In particular, L1's APIC access
page should never be entered into L2's page tables, because this
causes a great deal of confusion when the APIC virtualization hardware
is being used to accelerate L2's accesses to its own APIC.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
L1 and L2 need to have disjoint mappings, so that L1's APIC access
page (under VMX) can be omitted from L2's mappings.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It is only possible to share the APIC access page between L1 and L2 if
they also share the virtual-APIC page. If L2 has its own virtual-APIC
page, then MMIO accesses to L1's TPR from L2 will access L2's TPR
instead. Moreover, L1's local APIC has to be in xAPIC mode, which is
another condition that hasn't been checked.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Previously, we toggled between SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE
and SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES, depending on whether or
not the EXTD bit was set in MSR_IA32_APICBASE. However, if the local
APIC is disabled, we should not set either of these APIC
virtualization control bits.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The local APIC can be in one of three modes: disabled, xAPIC or
x2APIC. (A fourth mode, "invalid," is included for completeness.)
Using the new enumeration can make some of the APIC mode logic easier
to read. In kvm_set_apic_base, for instance, it is clear that one
cannot transition directly from x2APIC mode to xAPIC mode or directly
from APIC disabled to x2APIC mode.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
[Check invalid bits even if msr_info->host_initiated. Reported by
Wanpeng Li. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Enlightened MSR-Bitmap is a natural extension of Enlightened VMCS:
Hyper-V Top Level Functional Specification states:
"The L1 hypervisor may collaborate with the L0 hypervisor to make MSR
accesses more efficient. It can enable enlightened MSR bitmaps by setting
the corresponding field in the enlightened VMCS to 1. When enabled, the L0
hypervisor does not monitor the MSR bitmaps for changes. Instead, the L1
hypervisor must invalidate the corresponding clean field after making
changes to one of the MSR bitmaps."
I reached out to Hyper-V team for additional details and I got the
following information:
"Current Hyper-V implementation works as following:
If the enlightened MSR bitmap is not enabled:
- All MSR accesses of L2 guests cause physical VM-Exits
If the enlightened MSR bitmap is enabled:
- Physical VM-Exits for L2 accesses to certain MSRs (currently FS_BASE,
GS_BASE and KERNEL_GS_BASE) are avoided, thus making these MSR accesses
faster."
I tested my series with a tight rdmsrl loop in L2, for KERNEL_GS_BASE the
results are:
Without Enlightened MSR-Bitmap: 1300 cycles/read
With Enlightened MSR-Bitmap: 120 cycles/read
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Extract the logic to free a root page in a separate function to avoid code
duplication in mmu_free_roots(). Also, change it to an exported function
i.e. kvm_mmu_free_roots().
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MSB of CR3 is a reserved bit if the PCIDE bit is not set in CR4.
It should be checked when PCIDE bit is not set, however commit
'd1cd3ce900441 ("KVM: MMU: check guest CR3 reserved bits based on
its physical address width")' removes the bit 63 checking
unconditionally. This patch fixes it by checking bit 63 of CR3
when PCIDE bit is not set in CR4.
Fixes: d1cd3ce900 (KVM: MMU: check guest CR3 reserved bits based on its physical address width)
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Liran Alon <liran.alon@oracle.com>
Cc: stable@vger.kernel.org
Reviewed-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull x86/pti updates from Thomas Gleixner:
"A mixed bag of fixes and updates for the ghosts which are hunting us.
The scheduler fixes have been pulled into that branch to avoid
conflicts.
- A set of fixes to address a khread_parkme() race which caused lost
wakeups and loss of state.
- A deadlock fix for stop_machine() solved by moving the wakeups
outside of the stopper_lock held region.
- A set of Spectre V1 array access restrictions. The possible
problematic spots were discuvered by Dan Carpenters new checks in
smatch.
- Removal of an unused file which was forgotten when the rest of that
functionality was removed"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vdso: Remove unused file
perf/x86/cstate: Fix possible Spectre-v1 indexing for pkg_msr
perf/x86/msr: Fix possible Spectre-v1 indexing in the MSR driver
perf/x86: Fix possible Spectre-v1 indexing for x86_pmu::event_map()
perf/x86: Fix possible Spectre-v1 indexing for hw_perf_event cache_*
perf/core: Fix possible Spectre-v1 indexing for ->aux_pages[]
sched/autogroup: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]
sched/core: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]
sched/core: Introduce set_special_state()
kthread, sched/wait: Fix kthread_parkme() completion issue
kthread, sched/wait: Fix kthread_parkme() wait-loop
sched/fair: Fix the update of blocked load when newly idle
stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock
Add Raven Ridge root bridge and data fabric PCI IDs.
This is required for amd_pci_dev_to_node_id() and amd_smn_read().
Cc: stable@vger.kernel.org # v4.16+
Tested-by: Gabriel Craciunescu <nix.or.die@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Merge misc fixes from Andrew Morton:
"13 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
rbtree: include rcu.h
scripts/faddr2line: fix error when addr2line output contains discriminator
ocfs2: take inode cluster lock before moving reflinked inode from orphan dir
mm, oom: fix concurrent munlock and oom reaper unmap, v3
mm: migrate: fix double call of radix_tree_replace_slot()
proc/kcore: don't bounds check against address 0
mm: don't show nr_indirectly_reclaimable in /proc/vmstat
mm: sections are not offlined during memory hotremove
z3fold: fix reclaim lock-ups
init: fix false positives in W+X checking
lib/find_bit_benchmark.c: avoid soft lockup in test_find_first_bit()
KASAN: prohibit KASAN+STRUCTLEAK combination
MAINTAINERS: update Shuah's email address
Currently STRUCTLEAK inserts initialization out of live scope of variables
from KASAN point of view. This leads to KASAN false positive reports.
Prohibit this combination for now.
Link: http://lkml.kernel.org/r/20180419172451.104700-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
1) Verify lengths of keys provided by the user is AF_KEY, from Kevin
Easton.
2) Add device ID for BCM89610 PHY. Thanks to Bhadram Varka.
3) Add Spectre guards to some ATM code, courtesy of Gustavo A. R.
Silva.
4) Fix infinite loop in NSH protocol code. To Eric Dumazet we are most
grateful for this fix.
5) Line up /proc/net/netlink headers properly. This fix from YU Bo, we
do appreciate.
6) Use after free in TLS code. Once again we are blessed by the
honorable Eric Dumazet with this fix.
7) Fix regression in TLS code causing stalls on partial TLS records.
This fix is bestowed upon us by Andrew Tomt.
8) Deal with too small MTUs properly in LLC code, another great gift
from Eric Dumazet.
9) Handle cached route flushing properly wrt. MTU locking in ipv4, to
Hangbin Liu we give thanks for this.
10) Fix regression in SO_BINDTODEVIC handling wrt. UDP socket demux.
Paolo Abeni, he gave us this.
11) Range check coalescing parameters in mlx4 driver, thank you Moshe
Shemesh.
12) Some ipv6 ICMP error handling fixes in rxrpc, from our good brother
David Howells.
13) Fix kexec on mlx5 by freeing IRQs in shutdown path. Daniel Juergens,
you're the best!
14) Don't send bonding RLB updates to invalid MAC addresses. Debabrata
Benerjee saved us!
15) Uh oh, we were leaking in udp_sendmsg and ping_v4_sendmsg. The ship
is now water tight, thanks to Andrey Ignatov.
16) IPSEC memory leak in ixgbe from Colin Ian King, man we've got holes
everywhere!
17) Fix error path in tcf_proto_create, Jiri Pirko what would we do
without you!
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (92 commits)
net sched actions: fix refcnt leak in skbmod
net: sched: fix error path in tcf_proto_create() when modules are not configured
net sched actions: fix invalid pointer dereferencing if skbedit flags missing
ixgbe: fix memory leak on ipsec allocation
ixgbevf: fix ixgbevf_xmit_frame()'s return type
ixgbe: return error on unsupported SFP module when resetting
ice: Set rq_last_status when cleaning rq
ipv4: fix memory leaks in udp_sendmsg, ping_v4_sendmsg
mlxsw: core: Fix an error handling path in 'mlxsw_core_bus_device_register()'
bonding: send learning packets for vlans on slave
bonding: do not allow rlb updates to invalid mac
net/mlx5e: Err if asked to offload TC match on frag being first
net/mlx5: E-Switch, Include VF RDMA stats in vport statistics
net/mlx5: Free IRQs in shutdown path
rxrpc: Trace UDP transmission failure
rxrpc: Add a tracepoint to log ICMP/ICMP6 and error messages
rxrpc: Fix the min security level for kernel calls
rxrpc: Fix error reception on AF_INET6 sockets
rxrpc: Fix missing start of call timeout
qed: fix spelling mistake: "taskelt" -> "tasklet"
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJa9eaYAAoJELcQ+SIFb8HamGAH/imdKSv7GJe7OOZyJ1AjHPIc
BLklw6RA6ejGpKdFNVzK77g+5DmpH+00sInWJNDginQCop+quH5aqTvLdVaoFh/w
/XU+aZClSNvqI3wesPPKTKuWj/bofDks7pGfiXsBXrSIT6SGQVMC7U/iL9YhkyN2
93hgdYs1qEoy2d3MgFfyRmAJKhPVnr4jbzN5RLXpbZox9NMYBa4qPL2GXvhx0A7T
3DtXykSmfBAHBHbAnLCIvwgJ9aQg9TUVMKJo3jXynTvnD+5qgLYiE8y3j+Yg6qRm
WG1G5f5sU9fEPgkYApg30wdrO3HbSi9POsLDW0Inzlt1/WUMrC3iJ8Gza7K58kM=
=VZ/j
-----END PGP SIGNATURE-----
Merge tag 'sh-for-4.17-fixes' of git://git.libc.org/linux-sh
Pull arch/sh fixes from Rich Felker:
"Fixes for critical regressions and a build failure.
The regressions were introduced in 4.15 and 4.17-rc1 and prevented
booting on affected systems"
* tag 'sh-for-4.17-fixes' of git://git.libc.org/linux-sh:
sh: switch to NO_BOOTMEM
sh: mm: Fix unprotected access to struct device
sh: fix build failure for J2 cpu with SMP disabled
- Mitigate Spectre-v2 for NVIDIA Denver CPUs
- Free memblocks corresponding to freed initrd area
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABCgAGBQJa9bgsAAoJELescNyEwWM0yYwIAKvMuUU8d6fy/5EdjTm2uG9p
DoSw+ezHeiUrphQwNvOc/fj0vGutM+sftcmghRV1KmP7lvAqk/zvK57PAZjwQ5ua
i1X2AJemKr7Gs77FV5Y6Jgkkd2kaIh3n86d9/hM7n9TfAt31vPAYCapb8h3LbRBJ
bjZXoTHeujZAIMLGyxzLGVlk9MdW2UjQ3LvWGby/mFEPuktJKkApxBSNQOJOuRKw
Ny/eCwFhbyLzDA4zXw7hASld/J+WWBhk0m8ks2qy7BD/F2auZX/p5flU/NoE1VXi
JevclGif18iQtZQRV/hJ1woLROfbp6cRKWaVB4cEFKSnB2mG6FLSfrYyvbCj6LE=
=lZDP
-----END PGP SIGNATURE-----
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"There's a small memblock accounting problem when freeing the initrd
and a Spectre-v2 mitigation for NVIDIA Denver CPUs which just requires
a match on the CPU ID register.
Summary:
- Mitigate Spectre-v2 for NVIDIA Denver CPUs
- Free memblocks corresponding to freed initrd area"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: capabilities: Add NVIDIA Denver CPU to bp_harden list
arm64: Add MIDR encoding for NVIDIA CPUs
arm64: To remove initrd reserved area entry from memblock
One fix for an actual regression, the change to the SYSCALL_DEFINE wrapper broke
FTRACE_SYSCALLS for us due to a name mismatch. There's also another commit to
the same code to make sure we match all our syscalls with various prefixes.
And then just one minor build fix, and the removal of an unused variable that
was removed and then snuck back in due to some rebasing.
Thanks to:
Naveen N. Rao.
-----BEGIN PGP SIGNATURE-----
iQIwBAABCAAaBQJa9Zw4ExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYAr
cg/9Go3swOrZhh0Nc8UUwxXJlNN20fgOJVmqVuTh6cE8I10pw2Uy5z/1mD642X0n
Y3qLRIJJPYGoAXyZglxPpD1rUWQJ7C7WCvFIgXGj9itDZmySD4lgxwDKFGBUphRG
1PejPtY/IHpcSyCECwD6tPj5ZSSPemlkXqLZAXS8ECSTQwi605tcV3WlZtWtPk1w
7PZraHSxc9gGtfxXR78aLd6C7ZAzodm7zXyJeZq3g0QuXUbEgA3DDvD7O9vRDQau
f2gj4b+19gHCpq9uPoTnnZH7V1eyZ8XpLUoxBKkdD1rwq8oW8pSZnOg40j9qwkGc
z0gnys0c34jFGc0GeQUgfA4SJwclO6KkEcdQiy3uU0+QIUA0dW4QvKHvdxULps1y
h80d2PSi1hQVBGpeV+agrc8zSLzSp7M7mo2CCkU+S2OBBnxjh+6dhSqAT6Ms7cZx
W9Ly6NbMSLQCsC1K4EP8d56fug99z4O+44Wb3anTiGg0LhVELm9+Mlo5Ku8s/klB
qSa7rFJ6mx5donBTf70pMRhSJY24bie5E8j5+o78gEnC66j+p7sBEhBOe44TPNKi
y3Zv76PT7/7hvo1Eyus849dW0YKiJUiZKEInVgErqkSI75z0mO0WqAJ88UNQLaXB
TCRIBf5aX8mpsWcVcJb8OF2+JNGDjX9w2snLEta/TfUso14=
=7GGr
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"One fix for an actual regression, the change to the SYSCALL_DEFINE
wrapper broke FTRACE_SYSCALLS for us due to a name mismatch. There's
also another commit to the same code to make sure we match all our
syscalls with various prefixes.
And then just one minor build fix, and the removal of an unused
variable that was removed and then snuck back in due to some rebasing.
Thanks to: Naveen N. Rao"
* tag 'powerpc-4.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/pseries: Fix CONFIG_NUMA=n build
powerpc/trace/syscalls: Update syscall name matching logic to account for ppc_ prefix
powerpc/trace/syscalls: Update syscall name matching logic
powerpc/64: Remove unused paca->soft_enabled
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCWvV2WQAKCRCAXGG7T9hj
vvV1AQD/mqwRavel82e8JiMosoqrpZWwZ4uK2m7DhhIGhdyuegEAjmqzkjYSInrA
0A7FeFH2Wl1nYiKBl8ppvAd2GOkbbws=
=kcKL
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fix from Juergen Gross:
"One fix for the kernel running as a fully virtualized guest using PV
drivers on old Xen hypervisor versions"
* tag 'for-linus-4.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: Reset VCPU0 info pointer after shared_info remap
Commit 0fa1c57934 ("of/fdt: use memblock_virt_alloc for early alloc")
inadvertently switched the DT unflattening allocations from memblock to
bootmem which doesn't work because the unflattening happens before
bootmem is initialized. Swapping the order of bootmem init and
unflattening could also fix this, but removing bootmem is desired. So
enable NO_BOOTMEM on SH like other architectures have done.
Fixes: 0fa1c57934 ("of/fdt: use memblock_virt_alloc for early alloc")
Reported-by: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Rich Felker <dalias@libc.org>
Update SECONDARY_EXEC_DESC for UMIP emulation if and only UMIP
is actually being emulated. Skipping the VMCS update eliminates
unnecessary VMREAD/VMWRITE when UMIP is supported in hardware,
and on platforms that don't have SECONDARY_VM_EXEC_CONTROL. The
latter case resolves a bug where KVM would fill the kernel log
with warnings due to failed VMWRITEs on older platforms.
Fixes: 0367f205a3 ("KVM: vmx: add support for emulating UMIP")
Cc: stable@vger.kernel.org #4.16
Reported-by: Paolo Zeppegno <pzeppegno@gmail.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Radim KrÄmář <rkrcmar@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>