Commit Graph

4437 Commits

Author SHA1 Message Date
Mike Galbraith 53d7783841 x86: UV: raw_spinlock conversion
Shrug.  Lots of hobbyists have a beast in their basement, right?

Cc: stable-rt@vger.kernel.org
Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2020-10-14 00:59:24 +03:00
Thomas Gleixner 5ccf423693 x86-preempt-lazy.patch
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:22 +03:00
Thomas Gleixner b33545443e sched: Add support for lazy preemption
It has become an obsession to mitigate the determinism vs. throughput
loss of RT. Looking at the mainline semantics of preemption points
gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER
tasks. One major issue is the wakeup of tasks which are right away
preempting the waking task while the waking task holds a lock on which
the woken task will block right after having preempted the wakee. In
mainline this is prevented due to the implicit preemption disable of
spin/rw_lock held regions. On RT this is not possible due to the fully
preemptible nature of sleeping spinlocks.

Though for a SCHED_OTHER task preempting another SCHED_OTHER task this
is really not a correctness issue. RT folks are concerned about
SCHED_FIFO/RR tasks preemption and not about the purely fairness
driven SCHED_OTHER preemption latencies.

So I introduced a lazy preemption mechanism which only applies to
SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the
existing preempt_count each tasks sports now a preempt_lazy_count
which is manipulated on lock acquiry and release. This is slightly
incorrect as for lazyness reasons I coupled this on
migrate_disable/enable so some other mechanisms get the same treatment
(e.g. get_cpu_light).

Now on the scheduler side instead of setting NEED_RESCHED this sets
NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and
therefor allows to exit the waking task the lock held region before
the woken task preempts. That also works better for cross CPU wakeups
as the other side can stay in the adaptive spinning loop.

For RT class preemption there is no change. This simply sets
NEED_RESCHED and forgoes the lazy preemption counter.

 Initial test do not expose any observable latency increasement, but
history shows that I've been proven wrong before :)

The lazy preemption mode is per default on, but with
CONFIG_SCHED_DEBUG enabled it can be disabled via:

 # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features

and reenabled via

 # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features

The test results so far are very machine and workload dependent, but
there is a clear trend that it enhances the non RT workload
performance.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:22 +03:00
Thomas Gleixner 633db31439 x86: stackprotector: Avoid random pool on rt
CPU bringup calls into the random pool to initialize the stack
canary. During boot that works nicely even on RT as the might sleep
checks are disabled. During CPU hotplug the might sleep checks
trigger. Making the locks in random raw is a major PITA, so avoid the
call on RT is the only sensible solution. This is basically the same
randomness which we get during boot where the random pool has no
entropy and we rely on the TSC randomnness.

Reported-by: Carsten Emde <carsten.emde@osadl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:18 +03:00
Oleg Nesterov 5627fb89dd signal/x86: Delay calling signals in atomic
On x86_64 we must disable preemption before we enable interrupts
for stack faults, int3 and debugging, because the current task is using
a per CPU debug stack defined by the IST. If we schedule out, another task
can come in and use the same stack and cause the stack to be corrupted
and crash the kernel on return.

When CONFIG_PREEMPT_RT_FULL is enabled, spin_locks become mutexes, and
one of these is the spin lock used in signal handling.

Some of the debug code (int3) causes do_trap() to send a signal.
This function calls a spin lock that has been converted to a mutex
and has the possibility to sleep. If this happens, the above issues with
the corrupted stack is possible.

Instead of calling the signal right away, for PREEMPT_RT and x86_64,
the signal information is stored on the stacks task_struct and
TIF_NOTIFY_RESUME is set. Then on exit of the trap, the signal resume
code will send the signal when preemption is enabled.

[ rostedt: Switched from #ifdef CONFIG_PREEMPT_RT_FULL to
  ARCH_RT_DELAYS_SIGNAL_SEND and added comments to the code. ]

Cc: stable-rt@vger.kernel.org
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:08 +03:00
Andy Lutomirski 510458cb7d x86/asm/irq: Stop relying on magic JMP behavior for early_idt_handlers
commit 425be5679f upstream.

The early_idt_handlers asm code generates an array of entry
points spaced nine bytes apart.  It's not really clear from that
code or from the places that reference it what's going on, and
the code only works in the first place because GAS never
generates two-byte JMP instructions when jumping to global
labels.

Clean up the code to generate the correct array stride (member size)
explicitly. This should be considerably more robust against
screw-ups, as GAS will warn if a .fill directive has a negative
count.  Using '. =' to advance would have been even more robust
(it would generate an actual error if it tried to move
backwards), but it would pad with nulls, confusing anyone who
tries to disassemble the code.  The new scheme should be much
clearer to future readers.

While we're at it, improve the comments and rename the array and
common code.

Binutils may start relaxing jumps to non-weak labels.  If so,
this change will fix our build, and we may need to backport this
change.

Before, on x86_64:

  0000000000000000 <early_idt_handlers>:
     0:   6a 00                   pushq  $0x0
     2:   6a 00                   pushq  $0x0
     4:   e9 00 00 00 00          jmpq   9 <early_idt_handlers+0x9>
                          5: R_X86_64_PC32        early_idt_handler-0x4
  ...
    48:   66 90                   xchg   %ax,%ax
    4a:   6a 08                   pushq  $0x8
    4c:   e9 00 00 00 00          jmpq   51 <early_idt_handlers+0x51>
                          4d: R_X86_64_PC32       early_idt_handler-0x4
  ...
   117:   6a 00                   pushq  $0x0
   119:   6a 1f                   pushq  $0x1f
   11b:   e9 00 00 00 00          jmpq   120 <early_idt_handler>
                          11c: R_X86_64_PC32      early_idt_handler-0x4

After:

  0000000000000000 <early_idt_handler_array>:
     0:   6a 00                   pushq  $0x0
     2:   6a 00                   pushq  $0x0
     4:   e9 14 01 00 00          jmpq   11d <early_idt_handler_common>
  ...
    48:   6a 08                   pushq  $0x8
    4a:   e9 d1 00 00 00          jmpq   120 <early_idt_handler_common>
    4f:   cc                      int3
    50:   cc                      int3
  ...
   117:   6a 00                   pushq  $0x0
   119:   6a 1f                   pushq  $0x1f
   11b:   eb 03                   jmp    120 <early_idt_handler_common>
   11d:   cc                      int3
   11e:   cc                      int3
   11f:   cc                      int3

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Binutils <binutils@sourceware.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/ac027962af343b0c599cbfcf50b945ad2ef3d7a8.1432336324.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-22 17:01:21 -07:00
Len Brown 9308cced73 sched/idle/x86: Restore mwait_idle() to fix boot hangs, to improve power savings and to improve performance
commit b253149b84 upstream.

In Linux-3.9 we removed the mwait_idle() loop:

  69fb3676df ("x86 idle: remove mwait_idle() and "idle=mwait" cmdline param")

The reasoning was that modern machines should be sufficiently
happy during the boot process using the default_idle() HALT
loop, until cpuidle loads and either acpi_idle or intel_idle
invoke the newer MWAIT-with-hints idle loop.

But two machines reported problems:

 1. Certain Core2-era machines support MWAIT-C1 and HALT only.
    MWAIT-C1 is preferred for optimal power and performance.
    But if they support just C1, cpuidle never loads and
    so they use the boot-time default idle loop forever.

 2. Some laptops will boot-hang if HALT is used,
    but will boot successfully if MWAIT is used.
    This appears to be a hidden assumption in BIOS SMI,
    that is presumably valid on the proprietary OS
    where the BIOS was validated.

       https://bugzilla.kernel.org/show_bug.cgi?id=60770

So here we effectively revert the patch above, restoring
the mwait_idle() loop.  However, we don't bother restoring
the idle=mwait cmdline parameter, since it appears to add
no value.

Maintainer notes:

  For 3.9, simply revert 69fb3676df
  for 3.10, patch -F3 applies, fuzz needed due to __cpuinit use in
  context For 3.11, 3.12, 3.13, this patch applies cleanly

Tested-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Acked-by: Mike Galbraith <bitbucket@online.de>
Cc: <stable@vger.kernel.org> # 3.9+
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ian Malone <ibmalone@gmail.com>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/345254a551eb5a6a866e048d7ab570fd2193aca4.1389763084.git.len.brown@intel.com
[ Ported to recent kernels. ]
[ Mike: 3.10 backport ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06 21:59:08 +02:00
Oleg Nesterov c16957b5ab x86/fpu: Drop_fpu() should not assume that tsk equals current
commit f4c3686386 upstream.

drop_fpu() does clear_used_math() and usually this is correct
because tsk == current.

However switch_fpu_finish()->restore_fpu_checking() is called before
__switch_to() updates the "current_task" variable. If it fails,
we will wrongly clear the PF_USED_MATH flag of the previous task.

So use clear_stopped_child_used_math() instead.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pekka Riikonen <priikone@iki.fi>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150309171041.GB11388@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-26 15:06:58 +01:00
Andy Lutomirski 83df3d59f5 x86, tls: Interpret an all-zero struct user_desc as "no segment"
commit 3669ef9fa7 upstream.

The Witcher 2 did something like this to allocate a TLS segment index:

        struct user_desc u_info;
        bzero(&u_info, sizeof(u_info));
        u_info.entry_number = (uint32_t)-1;

        syscall(SYS_set_thread_area, &u_info);

Strictly speaking, this code was never correct.  It should have set
read_exec_only and seg_not_present to 1 to indicate that it wanted
to find a free slot without putting anything there, or it should
have put something sensible in the TLS slot if it wanted to allocate
a TLS entry for real.  The actual effect of this code was to
allocate a bogus segment that could be used to exploit espfix.

The set_thread_area hardening patches changed the behavior, causing
set_thread_area to return -EINVAL and crashing the game.

This changes set_thread_area to interpret this as a request to find
a free slot and to leave it empty, which isn't *quite* what the game
expects but should be close enough to keep it working.  In
particular, using the code above to allocate two segments will
allocate the same segment both times.

According to FrostbittenKing on Github, this fixes The Witcher 2.

If this somehow still causes problems, we could instead allocate
a limit==0 32-bit data segment, but that seems rather ugly to me.

Fixes: 41bdc78544 x86/tls: Validate TLS entries to protect espfix
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/0cb251abe1ff0958b8e468a9a9a905b80ae3a746.1421954363.git.luto@amacapital.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:49 -08:00
Andy Lutomirski 1fe84f910a x86, tls, ldt: Stop checking lm in LDT_empty
commit e30ab185c4 upstream.

32-bit programs don't have an lm bit in their ABI, so they can't
reliably cause LDT_empty to return true without resorting to memset.
They shouldn't need to do this.

This should fix a longstanding, if minor, issue in all 64-bit kernels
as well as a potential regression in the TLS hardening code.

Fixes: 41bdc78544 x86/tls: Validate TLS entries to protect espfix
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/72a059de55e86ad5e2935c80aa91880ddf19d07c.1421954363.git.luto@amacapital.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:49 -08:00
Andy Lutomirski 99c8619590 x86, vdso: Use asm volatile in __getcpu
commit 1ddf0b1b11 upstream.

In Linux 3.18 and below, GCC hoists the lsl instructions in the
pvclock code all the way to the beginning of __vdso_clock_gettime,
slowing the non-paravirt case significantly.  For unknown reasons,
presumably related to the removal of a branch, the performance issue
is gone as of

e76b027e64 x86,vdso: Use LSL unconditionally for vgetcpu

but I don't trust GCC enough to expect the problem to stay fixed.

There should be no correctness issue, because the __getcpu calls in
__vdso_vlock_gettime were never necessary in the first place.

Note to stable maintainers: In 3.18 and below, depending on
configuration, gcc 4.9.2 generates code like this:

     9c3:       44 0f 03 e8             lsl    %ax,%r13d
     9c7:       45 89 eb                mov    %r13d,%r11d
     9ca:       0f 03 d8                lsl    %ax,%ebx

This patch won't apply as is to any released kernel, but I'll send a
trivial backported version if needed.

[
 Backported by Andy Lutomirski.  Should apply to all affected
 versions.  This fixes a functionality bug as well as a performance
 bug: buggy kernels can infinite loop in __vdso_clock_gettime on
 affected compilers.  See, for exammple:

 https://bugzilla.redhat.com/show_bug.cgi?id=1178975
]

Fixes: 51c19b4f59 x86: vdso: pvclock gettime support
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-16 06:59:31 -08:00
Andy Lutomirski e990e54996 x86/tls: Don't validate lm in set_thread_area() after all
commit 3fb2f4237b upstream.

It turns out that there's a lurking ABI issue.  GCC, when
compiling this in a 32-bit program:

struct user_desc desc = {
	.entry_number    = idx,
	.base_addr       = base,
	.limit           = 0xfffff,
	.seg_32bit       = 1,
	.contents        = 0, /* Data, grow-up */
	.read_exec_only  = 0,
	.limit_in_pages  = 1,
	.seg_not_present = 0,
	.useable         = 0,
};

will leave .lm uninitialized.  This means that anything in the
kernel that reads user_desc.lm for 32-bit tasks is unreliable.

Revert the .lm check in set_thread_area().  The value never did
anything in the first place.

Fixes: 0e58af4e1d ("x86/tls: Disallow unusual TLS segments")
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/d7875b60e28c512f6a6fc0baf5714d58e7eaadbb.1418856405.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-08 10:00:49 -08:00
Paolo Bonzini 28b0107f54 x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-only
commit c1118b3602 upstream.

On x86_64, kernel text mappings are mapped read-only with CONFIG_DEBUG_RODATA.
In that case, KVM will fail to patch VMCALL instructions to VMMCALL
as required on AMD processors.

The failure mode is currently a divide-by-zero exception, which obviously
is a KVM bug that has to be fixed.  However, picking the right instruction
between VMCALL and VMMCALL will be faster and will help if you cannot upgrade
the hypervisor.

Reported-by: Chris Webb <chris@arachsys.com>
Tested-by: Chris Webb <chris@arachsys.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Chris J Arges <chris.j.arges@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06 15:55:40 -08:00
Andy Lutomirski 52c01ecc65 uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME
commit 82975bc6a6 upstream.

x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set but
not on non-paranoid returns.  I suspect that this is a mistake and that
the code only works because int3 is paranoid.

Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a workaround
for the x86 bug.  With that bug fixed, we can remove _TIF_NOTIFY_RESUME
from the uprobes code.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06 15:55:33 -08:00
Andy Lutomirski c6328855c4 x86_64, traps: Stop using IST for #SS
commit 6f442be2fb upstream.

On a 32-bit kernel, this has no effect, since there are no IST stacks.

On a 64-bit kernel, #SS can only happen in user code, on a failed iret
to user space, a canonical violation on access via RSP or RBP, or a
genuine stack segment violation in 32-bit kernel code.  The first two
cases don't need IST, and the latter two cases are unlikely fatal bugs,
and promoting them to double faults would be fine.

This fixes a bug in which the espfix64 code mishandles a stack segment
violation.

This saves 4k of memory per CPU and a tiny bit of code.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06 15:55:33 -08:00
Ong Boon Leong e31f17130f x86, iosf: Added Quark MBI identifiers
commit 7ef1def800 upstream.

Added all the MBI units below and their associated read/write
opcodes:
 - Host Bridge Arbiter
 - Host Bridge
 - Remote Management Unit
 - Memory Manager & eSRAM
 - SoC Unit

Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Link: http://lkml.kernel.org/r/1399668248-24199-3-git-send-email-david.e.box@linux.intel.com
Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Chang Rebecca Swee Fun <rebecca.swee.fun.chang@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14 09:00:00 -08:00
David E. Box 2e6400671d x86, iosf: Make IOSF driver modular and usable by more drivers
commit 6b8f0c8780 upstream.

Currently drivers that run on non-IOSF systems (Core/Xeon) can't use the IOSF
driver on SOC's without selecting it which forces an unnecessary and limiting
dependency. Provides dummy functions to allow these modules to conditionally
use the driver on IOSF equipped platforms without impacting their ability to
compile and load on non-IOSF platforms. Build default m to ensure availability
on x86 SOC's.

Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Link: http://lkml.kernel.org/r/1399668248-24199-2-git-send-email-david.e.box@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Chang Rebecca Swee Fun <rebecca.swee.fun.chang@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14 08:59:59 -08:00
Petr Matousek 8b74c6f87d kvm: vmx: handle invvpid vm exit gracefully
commit a642fc3050 upstream.

On systems with invvpid instruction support (corresponding bit in
IA32_VMX_EPT_VPID_CAP MSR is set) guest invocation of invvpid
causes vm exit, which is currently not handled and results in
propagation of unknown exit to userspace.

Fix this by installing an invvpid vm exit handler.

This is CVE-2014-3646.

Signed-off-by: Petr Matousek <pmatouse@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14 08:59:55 -08:00
Nadav Amit 44d1efb927 KVM: x86: Check non-canonical addresses upon WRMSR
commit 854e8bb1aa upstream.

Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is
written to certain MSRs. The behavior is "almost" identical for AMD and Intel
(ignoring MSRs that are not implemented in either architecture since they would
anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if
non-canonical address is written on Intel but not on AMD (which ignores the top
32-bits).

Accordingly, this patch injects a #GP on the MSRs which behave identically on
Intel and AMD.  To eliminate the differences between the architecutres, the
value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to
canonical value before writing instead of injecting a #GP.

Some references from Intel and AMD manuals:

According to Intel SDM description of WRMSR instruction #GP is expected on
WRMSR "If the source register contains a non-canonical address and ECX
specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP."

According to AMD manual instruction manual:
LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the
LSTAR and CSTAR registers.  If an RIP written by WRMSR is not in canonical
form, a general-protection exception (#GP) occurs."
IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the
base field must be in canonical form or a #GP fault will occur."
IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must
be in canonical form."

This patch fixes CVE-2014-3610.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14 08:59:55 -08:00
Andy Honig fb07a1411e KVM: x86: Prevent host from panicking on shared MSR writes.
commit 8b3c3104c3 upstream.

The previous patch blocked invalid writes directly when the MSR
is written.  As a precaution, prevent future similar mistakes by
gracefulling handle GPs caused by writes to shared MSRs.

Signed-off-by: Andrew Honig <ahonig@google.com>
[Remove parts obsoleted by Nadav's patch. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14 08:59:54 -08:00
Ben Hutchings c8d171f626 x86: Reject x32 executables if x32 ABI not supported
commit 0e6d3112a4 upstream.

It is currently possible to execve() an x32 executable on an x86_64
kernel that has only ia32 compat enabled.  However all its syscalls
will fail, even _exit().  This usually causes it to segfault.

Change the ELF compat architecture check so that x32 executables are
rejected if we don't support the x32 ABI.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Link: http://lkml.kernel.org/r/1410120305.6822.9.camel@decadent.org.uk
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14 08:59:47 -08:00
David Matlack a523af29af kvm: x86: fix stale mmio cache bug
commit 56f17dd3fb upstream.

The following events can lead to an incorrect KVM_EXIT_MMIO bubbling
up to userspace:

(1) Guest accesses gpa X without a memory slot. The gfn is cached in
struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets
the SPTE write-execute-noread so that future accesses cause
EPT_MISCONFIGs.

(2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION
covering the page just accessed.

(3) Guest attempts to read or write to gpa X again. On Intel, this
generates an EPT_MISCONFIG. The memory slot generation number that
was incremented in (2) would normally take care of this but we fast
path mmio faults through quickly_check_mmio_pf(), which only checks
the per-vcpu mmio cache. Since we hit the cache, KVM passes a
KVM_EXIT_MMIO up to userspace.

This patch fixes the issue by using the memslot generation number
to validate the mmio cache.

Signed-off-by: David Matlack <dmatlack@google.com>
[xiaoguangrong: adjust the code to make it simpler for stable-tree fix.]
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Tested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30 09:38:19 -07:00
Dave Young c2e334db0b x86 early_ioremap: Increase FIX_BTMAPS_SLOTS to 8
commit 3eddc69ffe upstream.

3.16 kernel boot fail with earlyprintk=efi, it keeps scrolling at the
bottom line of screen.

Bisected, the first bad commit is below:
commit 86dfc6f339
Author: Lv Zheng <lv.zheng@intel.com>
Date:   Fri Apr 4 12:38:57 2014 +0800

    ACPICA: Tables: Fix table checksums verification before installation.

I did some debugging by enabling both serial and efi earlyprintk, below is
some debug dmesg, seems early_ioremap fails in scroll up function due to
no free slot, see below dmesg output:

  WARNING: CPU: 0 PID: 0 at mm/early_ioremap.c:116 __early_ioremap+0x90/0x1c4()
  __early_ioremap(ed00c800, 00000c80) not found slot
  Modules linked in:
  CPU: 0 PID: 0 Comm: swapper Not tainted 3.17.0-rc1+ #204
  Hardware name: Hewlett-Packard HP Z420 Workstation/1589, BIOS J61 v03.15 05/09/2013
  Call Trace:
    dump_stack+0x4e/0x7a
    warn_slowpath_common+0x75/0x8e
    ? __early_ioremap+0x90/0x1c4
    warn_slowpath_fmt+0x47/0x49
    __early_ioremap+0x90/0x1c4
    ? sprintf+0x46/0x48
    early_ioremap+0x13/0x15
    early_efi_map+0x24/0x26
    early_efi_scroll_up+0x6d/0xc0
    early_efi_write+0x1b0/0x214
    call_console_drivers.constprop.21+0x73/0x7e
    console_unlock+0x151/0x3b2
    ? vprintk_emit+0x49f/0x532
    vprintk_emit+0x521/0x532
    ? console_unlock+0x383/0x3b2
    printk+0x4f/0x51
    acpi_os_vprintf+0x2b/0x2d
    acpi_os_printf+0x43/0x45
    acpi_info+0x5c/0x63
    ? __acpi_map_table+0x13/0x18
    ? acpi_os_map_iomem+0x21/0x147
    acpi_tb_print_table_header+0x177/0x186
    acpi_tb_install_table_with_override+0x4b/0x62
    acpi_tb_install_standard_table+0xd9/0x215
    ? early_ioremap+0x13/0x15
    ? __acpi_map_table+0x13/0x18
    acpi_tb_parse_root_table+0x16e/0x1b4
    acpi_initialize_tables+0x57/0x59
    acpi_table_init+0x50/0xce
    acpi_boot_table_init+0x1e/0x85
    setup_arch+0x9b7/0xcc4
    start_kernel+0x94/0x42d
    ? early_idt_handlers+0x120/0x120
    x86_64_start_reservations+0x2a/0x2c
    x86_64_start_kernel+0xf3/0x100

Quote reply from Lv.zheng about the early ioremap slot usage in this case:

"""
In early_efi_scroll_up(), 2 mapping entries will be used for the src/dst screen buffer.
In drivers/acpi/acpica/tbutils.c, we've improved the early table loading code in acpi_tb_parse_root_table().
We now need 2 mapping entries:
1. One mapping entry is used for RSDT table mapping. Each RSDT entry contains an address for another ACPI table.
2. For each entry in RSDP, we need another mapping entry to map the table to perform necessary check/override before installing it.

When acpi_tb_parse_root_table() prints something through EFI earlyprintk console, we'll have 4 mapping entries used.
The current 4 slots setting of early_ioremap() seems to be too small for such a use case.
"""

Thus increase the slot to 8 in this patch to fix this issue.
boot-time mappings become 512 page with this patch.

Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05 14:52:17 -07:00
Stefan Bader 7fad1f4c34 x86/xen: don't copy bogus duplicate entries into kernel page tables
commit 0b5a50635f upstream.

When RANDOMIZE_BASE (KASLR) is enabled; or the sum of all loaded
modules exceeds 512 MiB, then loading modules fails with a warning
(and hence a vmalloc allocation failure) because the PTEs for the
newly-allocated vmalloc address space are not zero.

  WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128
           vmap_page_range_noflush+0x2a1/0x360()

This is caused by xen_setup_kernel_pagetables() copying
level2_kernel_pgt into level2_fixmap_pgt, overwriting many non-present
entries.

Without KASLR, the normal kernel image size only covers the first half
of level2_kernel_pgt and module space starts after that.

L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[  0..255]->kernel
                                                  [256..511]->module
                          [511]->level2_fixmap_pgt[  0..505]->module

This allows 512 MiB of of module vmalloc space to be used before
having to use the corrupted level2_fixmap_pgt entries.

With KASLR enabled, the kernel image uses the full PUD range of 1G and
module space starts in the level2_fixmap_pgt. So basically:

L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..511]->kernel
                          [511]->level2_fixmap_pgt[0..505]->module

And now no module vmalloc space can be used without using the corrupt
level2_fixmap_pgt entries.

Fix this by properly converting the level2_fixmap_pgt entries to MFNs,
and setting level1_fixmap_pgt as read-only.

A number of comments were also using the the wrong L3 offset for
level2_kernel_pgt.  These have been corrected.

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05 14:52:17 -07:00
Paolo Bonzini 98a634d07e Revert "KVM: x86: Increase the number of fixed MTRR regs to 10"
commit 0d234daf7e upstream.

This reverts commit 682367c494,
which causes 32-bit SMP Windows 7 guests to panic.

SeaBIOS has a limit on the number of MTRRs that it can handle,
and this patch exceeded the limit.  Better revert it.
Thanks to Nadav Amit for debugging the cause.

Reported-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05 16:34:15 -07:00
Andy Lutomirski 74020e9499 x86_64/entry/xen: Do not invoke espfix64 on Xen
commit 7209a75d20 upstream.

This moves the espfix64 logic into native_iret.  To make this work,
it gets rid of the native patch for INTERRUPT_RETURN:
INTERRUPT_RETURN on native kernels is now 'jmp native_iret'.

This changes the 16-bit SS behavior on Xen from OOPSing to leaking
some bits of the Xen hypervisor's RSP (I think).

[ hpa: this is a nonzero cost on native, but probably not enough to
  measure. Xen needs to fix this in their own code, probably doing
  something equivalent to espfix64. ]

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/7b8f1d8ef6597cb16ae004a43c56980a7de3cf94.1406129132.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-07 14:52:38 -07:00
H. Peter Anvin 1ce1f063cc x86, espfix: Fix broken header guard
commit 20b68535cd upstream.

Header guard is #ifndef, not #ifdef...

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-07 14:52:38 -07:00
H. Peter Anvin 5e03b3d8ef x86, espfix: Move espfix definitions into a separate header file
commit e1fe9ed8d2 upstream.

Sparse warns that the percpu variables aren't declared before they are
defined.  Rather than hacking around it, move espfix definitions into
a proper header file.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-07 14:52:38 -07:00
H. Peter Anvin 7ecbe6e0a0 x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stack
commit 3891a04aaf upstream.

The IRET instruction, when returning to a 16-bit segment, only
restores the bottom 16 bits of the user space stack pointer.  This
causes some 16-bit software to break, but it also leaks kernel state
to user space.  We have a software workaround for that ("espfix") for
the 32-bit kernel, but it relies on a nonzero stack segment base which
is not available in 64-bit mode.

In checkin:

    b3b42ac2cb x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels

we "solved" this by forbidding 16-bit segments on 64-bit kernels, with
the logic that 16-bit support is crippled on 64-bit kernels anyway (no
V86 support), but it turns out that people are doing stuff like
running old Win16 binaries under Wine and expect it to work.

This works around this by creating percpu "ministacks", each of which
is mapped 2^16 times 64K apart.  When we detect that the return SS is
on the LDT, we copy the IRET frame to the ministack and use the
relevant alias to return to userspace.  The ministacks are mapped
readonly, so if IRET faults we promote #GP to #DF which is an IST
vector and thus has its own stack; we then do the fixup in the #DF
handler.

(Making #GP an IST exception would make the msr_safe functions unsafe
in NMI/MC context, and quite possibly have other effects.)

Special thanks to:

- Andy Lutomirski, for the suggestion of using very small stack slots
  and copy (as opposed to map) the IRET frame there, and for the
  suggestion to mark them readonly and let the fault promote to #DF.
- Konrad Wilk for paravirt fixup and testing.
- Borislav Petkov for testing help and useful comments.

Reported-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrew Lutomriski <amluto@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dirk Hohndel <dirk@hohndel.org>
Cc: Arjan van de Ven <arjan.van.de.ven@intel.com>
Cc: comex <comexk@gmail.com>
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: <stable@vger.kernel.org> # consider after upstream merge
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-07 14:52:38 -07:00
Paolo Bonzini a46438472e KVM: x86: preserve the high 32-bits of the PAT register
commit 7cb060a91c upstream.

KVM does not really do much with the PAT, so this went unnoticed for a
long time.  It is exposed however if you try to do rdmsr on the PAT
register.

Reported-by: Valentine Sinitsyn <valentine.sinitsyn@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09 11:18:27 -07:00
Nadav Amit 16f225f355 KVM: x86: Increase the number of fixed MTRR regs to 10
commit 682367c494 upstream.

Recent Intel CPUs have 10 variable range MTRRs. Since operating systems
sometime make assumptions on CPUs while they ignore capability MSRs, it is
better for KVM to be consistent with recent CPUs. Reporting more MTRRs than
actually supported has no functional implications.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09 11:18:27 -07:00
Tejun Heo b897eba830 ptrace,x86: force IRET path after a ptrace_stop()
commit b9cd18de4d upstream.

The 'sysret' fastpath does not correctly restore even all regular
registers, much less any segment registers or reflags values.  That is
very much part of why it's faster than 'iret'.

Normally that isn't a problem, because the normal ptrace() interface
catches the process using the signal handler infrastructure, which
always returns with an iret.

However, some paths can get caught using ptrace_event() instead of the
signal path, and for those we need to make sure that we aren't going to
return to user space using 'sysret'.  Otherwise the modifications that
may have been done to the register set by the tracer wouldn't
necessarily take effect.

Fix it by forcing IRET path by setting TIF_NOTIFY_RESUME from
arch_ptrace_stop_needed() which is invoked from ptrace_stop().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06 18:57:29 -07:00
Anthony Iliopoulos d0cdfb253e x86, mm, hugetlb: Add missing TLB page invalidation for hugetlb_cow()
commit 9844f54623 upstream.

The invalidation is required in order to maintain proper semantics
under CoW conditions. In scenarios where a process clones several
threads, a thread operating on a core whose DTLB entry for a
particular hugepage has not been invalidated, will be reading from
the hugepage that belongs to the forked child process, even after
hugetlb_cow().

The thread will not see the updated page as long as the stale DTLB
entry remains cached, the thread attempts to write into the page,
the child process exits, or the thread gets migrated to a different
processor.

Signed-off-by: Anthony Iliopoulos <anthony.iliopoulos@huawei.com>
Link: http://lkml.kernel.org/r/20140514092948.GA17391@server-36.huawei.corp
Suggested-by: Shay Goikhman <shay.goikhman@huawei.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07 10:28:10 -07:00
Peter Zijlstra 4c03d46991 x86,preempt: Fix preemption for i386
Many people reported preemption/reschedule problems with i386 kernels
for .13 and .14. After Michele bisected this to a combination of

  3e8e42c69bb ("sched: Revert need_resched() to look at TIF_NEED_RESCHED")
  ded7975475 ("irq: Force hardirq exit's softirq processing on its own stack")

it finally dawned on me that i386's current_thread_info() was to
blame.

When we are on interrupt/exception stacks, we fail to observe the
right TIF_NEED_RESCHED bit and therefore the PREEMPT_NEED_RESCHED
folding malfunctions.

Current upstream fixes this by making i386 behave the same as x86_64
already did:

  2432e1364b ("x86: Nuke the supervisor_stack field in i386 thread_info")
  b807902a88 ("x86: Nuke GET_THREAD_INFO_WITH_ESP() macro for i386")
  0788aa6a23 ("x86: Prepare removal of previous_esp from i386 thread_info structure")
  198d208df4 ("x86: Keep thread_info on thread stack in x86_32")

However, that is far too much to stuff into -stable. Therefore I
propose we merge the below patch which uses task_thread_info(current)
for tif_need_resched() instead of the ESP based current_thread_info().

This makes sure we always observe the one true TIF_NEED_RESCHED bit
and things will work as expected again.

Cc: bp@alien8.de
Cc: fweisbec@gmail.com
Cc: david.a.cohen@linux.intel.com
Cc: mingo@kernel.org
Cc: fweisbec@gmail.com
Cc: greg@kroah.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: gregkh@linuxfoundation.org
Cc: pbonzini@redhat.com
Cc: rostedt@goodmis.org
Cc: stefan.bader@canonical.com
Cc: mingo@kernel.org
Cc: toralf.foerster@gmx.de
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: torvalds@linux-foundation.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: <stable@vger.kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: <stable-commits@vger.kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: peterz@infradead.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: barra_cuda@katamail.com
Tested-by: Stefan Bader <stefan.bader@canonical.com>
Tested-by: Toralf F¿rster <toralf.foerster@gmx.de>
Tested-by: Michele Ballabio <barra_cuda@katamail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140409142447.GD13658@twins.programming.kicks-ass.net
2014-05-31 13:20:28 -07:00
Fenghua Yu 5bc3ab30db x86, AVX-512: Enable AVX-512 States Context Switch
commit c2bc11f10a upstream.

This patch enables Opmask, ZMM_Hi256, and Hi16_ZMM AVX-512 states for
xstate context switch.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1392931491-33237-2-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:26 -07:00
Fenghua Yu 05f4be8204 x86, AVX-512: AVX-512 Feature Detection
commit 8e5780fdee upstream.

AVX-512 is an extention of AVX2. Its spec can be found at:
http://download-software.intel.com/sites/default/files/managed/71/2e/319433-017.pdf

This patch detects AVX-512 features by CPUID.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1392931491-33237-1-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06 07:59:26 -07:00
Borislav Petkov 579d8f085b x86/efi: Make efi virtual runtime map passing more robust
commit b7b898ae0c upstream.

Currently, running SetVirtualAddressMap() and passing the physical
address of the virtual map array was working only by a lucky coincidence
because the memory was present in the EFI page table too. Until Toshi
went and booted this on a big HP box - the krealloc() manner of resizing
the memmap we're doing did allocate from such physical addresses which
were not mapped anymore and boom:

http://lkml.kernel.org/r/1386806463.1791.295.camel@misato.fc.hp.com

One way to take care of that issue is to reimplement the krealloc thing
but with pages. We start with contiguous pages of order 1, i.e. 2 pages,
and when we deplete that memory (shouldn't happen all that often but you
know firmware) we realloc the next power-of-two pages.

Having the pages, it is much more handy and easy to map them into the
EFI page table with the already existing mapping code which we're using
for building the virtual mappings.

Thanks to Toshi Kani and Matt for the great debugging help.

Reported-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-14 06:50:02 -07:00
Borislav Petkov 6f48949f01 x86, pageattr: Export page unmapping interface
commit 42a5477251 upstream.

We will use it in efi so expose it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-14 06:50:02 -07:00
Artem Fetishev 825600c0f2 x86: fix boot on uniprocessor systems
On x86 uniprocessor systems topology_physical_package_id() returns -1
which causes rapl_cpu_prepare() to leave rapl_pmu variable uninitialized
which leads to GPF in rapl_pmu_init().

See arch/x86/kernel/cpu/perf_event_intel_rapl.c.

It turns out that physical_package_id and core_id can actually be
retreived for uniprocessor systems too.  Enabling them also fixes
rapl_pmu code.

Signed-off-by: Artem Fetishev <artem_fetishev@epam.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-28 13:56:58 -07:00
David Vrabel 5926f87fda Revert "xen: properly account for _PAGE_NUMA during xen pte translations"
This reverts commit a9c8e4beee.

PTEs in Xen PV guests must contain machine addresses if _PAGE_PRESENT
is set and pseudo-physical addresses is _PAGE_PRESENT is clear.

This is because during a domain save/restore (migration) the page
table entries are "canonicalised" and uncanonicalised". i.e., MFNs are
converted to PFNs during domain save so that on a restore the page
table entries may be rewritten with the new MFNs on the destination.
This canonicalisation is only done for PTEs that are present.

This change resulted in writing PTEs with MFNs if _PAGE_PROTNONE (or
_PAGE_NUMA) was set but _PAGE_PRESENT was clear.  These PTEs would be
migrated as-is which would result in unexpected behaviour in the
destination domain.  Either a) the MFN would be translated to the
wrong PFN/page; b) setting the _PAGE_PRESENT bit would clear the PTE
because the MFN is no longer owned by the domain; or c) the present
bit would not get set.

Symptoms include "Bad page" reports when munmapping after migrating a
domain.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <stable@vger.kernel.org>        [3.12+]
2014-03-25 11:11:42 +00:00
Dave Jones 09df7c4c80 x86: Remove CONFIG_X86_OOSTORE
This was an optimization that made memcpy type benchmarks a little
faster on ancient (Circa 1998) IDT Winchip CPUs.  In real-life
workloads, it wasn't even noticable, and I doubt anyone is running
benchmarks on 16 year old silicon any more.

Given this code has likely seen very little use over the last decade,
let's just remove it.

Signed-off-by: Dave Jones <davej@fedoraproject.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-11 10:16:18 -07:00
H. Peter Anvin 3c0b566334 * Disable the new EFI 1:1 virtual mapping for SGI UV because using it
causes a crash during boot - Borislav Petkov
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTFmV4AAoJEC84WcCNIz1VeSgP/1gykrBiH3Vr4H4c32la/arZ
 ktXRAT+RHdiebEXopt6+A1Pv6iyRYz3fBB1cKKb7q8fhDmVVmefGVkyO4qg4NbgL
 Ic1dP6uPgiFidcYML9/c6+UIovbwDD7f5wwJEjXxoZKg7b7P0TIykd8z8YPQ/6A6
 fIY5Z2L9A8eVt4k1m6Dg2PUJ2B+XKOLa5BbL7gXB6u9avgAVAoLM9oQStss+V9Pv
 JQu+BZxeEwuxgi0LOxk1sFWXaAoRtgVDNH6nPK93CRvO2H83voWFf+OcW9hQWsF5
 tLam6aMkvjuM5Dv1IA69PBAWrE/jxcMvjEdJyrGMWRKWtH/iTN4ez4EoFiO38o4R
 IgzXh9L5xbP3o+g3rntIu4h3/5yde9TRM18mER0lLTdNFZ8QzmLt4L0TptntRoBv
 bTXffILACq0uQU6T10P+EwseT472HphMeswaWVDkRxkN2hTCipFqQX0ekb0qbw3O
 yQeRyz1/t7DeA77iAGG96SfeziMdr44d6u6zDQPrTAJV+H1ZZ3XNIlJDi01CAg3b
 PDT6nHb/9V0tPZQebntZnczVRP+5EBdn5RbJaAMldEwVgjdjlFNY5slsF01KeCSF
 6Cx6UyAI9XKuZMoayC3QDHKKWi+BTOwbKcVAdtZ1c9AMLt9pyxsDfdoOAV4zBiQa
 PsVs9t/l7TZoL1MQHlEJ
 =YIab
 -----END PGP SIGNATURE-----

Merge tag 'efi-urgent' into x86/urgent

 * Disable the new EFI 1:1 virtual mapping for SGI UV because using it
   causes a crash during boot - Borislav Petkov

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-03-04 15:50:06 -08:00
Borislav Petkov a5d90c923b x86/efi: Quirk out SGI UV
Alex reported hitting the following BUG after the EFI 1:1 virtual
mapping work was merged,

 kernel BUG at arch/x86/mm/init_64.c:351!
 invalid opcode: 0000 [#1] SMP
 Call Trace:
  [<ffffffff818aa71d>] init_extra_mapping_uc+0x13/0x15
  [<ffffffff818a5e20>] uv_system_init+0x22b/0x124b
  [<ffffffff8108b886>] ? clockevents_register_device+0x138/0x13d
  [<ffffffff81028dbb>] ? setup_APIC_timer+0xc5/0xc7
  [<ffffffff8108b620>] ? clockevent_delta2ns+0xb/0xd
  [<ffffffff818a3a92>] ? setup_boot_APIC_clock+0x4a8/0x4b7
  [<ffffffff8153d955>] ? printk+0x72/0x74
  [<ffffffff818a1757>] native_smp_prepare_cpus+0x389/0x3d6
  [<ffffffff818957bc>] kernel_init_freeable+0xb7/0x1fb
  [<ffffffff81535530>] ? rest_init+0x74/0x74
  [<ffffffff81535539>] kernel_init+0x9/0xff
  [<ffffffff81541dfc>] ret_from_fork+0x7c/0xb0
  [<ffffffff81535530>] ? rest_init+0x74/0x74

Getting this thing to work with the new mapping scheme would need more
work, so automatically switch to the old memmap layout for SGI UV.

Acked-by: Russ Anderson <rja@sgi.com>
Cc: Alex Thorlton <athorlton@sgi.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-03-04 23:43:33 +00:00
Thomas Gleixner 5f0e030930 x86, tsc: Fallback to normal calibration if fast MSR calibration fails
If we cannot calibrate TSC via MSR based calibration
try_msr_calibrate_tsc() stores zero to fast_calibrate and returns that
to the caller. This value gets then propagated further to clockevents
code resulting division by zero oops like the one below:

 divide error: 0000 [#1] PREEMPT SMP
 Modules linked in:
 CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W    3.13.0+ #47
 task: ffff880075508000 ti: ffff880075506000 task.ti: ffff880075506000
 RIP: 0010:[<ffffffff810aec14>]  [<ffffffff810aec14>] clockevents_config.part.3+0x24/0xa0
 RSP: 0000:ffff880075507e58  EFLAGS: 00010246
 RAX: ffffffffffffffff RBX: ffff880079c0cd80 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffffffffff
 RBP: ffff880075507e70 R08: 0000000000000001 R09: 00000000000000be
 R10: 00000000000000bd R11: 0000000000000003 R12: 000000000000b008
 R13: 0000000000000008 R14: 000000000000b010 R15: 0000000000000000
 FS:  0000000000000000(0000) GS:ffff880079c00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: ffff880079fff000 CR3: 0000000001c0b000 CR4: 00000000001006f0
 Stack:
  ffff880079c0cd80 000000000000b008 0000000000000008 ffff880075507e88
  ffffffff810aecb0 ffff880079c0cd80 ffff880075507e98 ffffffff81030168
  ffff880075507ed8 ffffffff81d1104f 00000000000000c3 0000000000000000
 Call Trace:
  [<ffffffff810aecb0>] clockevents_config_and_register+0x20/0x30
  [<ffffffff81030168>] setup_APIC_timer+0xc8/0xd0
  [<ffffffff81d1104f>] setup_boot_APIC_clock+0x4cc/0x4d8
  [<ffffffff81d0f5de>] native_smp_prepare_cpus+0x3dd/0x3f0
  [<ffffffff81d02ee9>] kernel_init_freeable+0xc3/0x205
  [<ffffffff8177c910>] ? rest_init+0x90/0x90
  [<ffffffff8177c91e>] kernel_init+0xe/0x120
  [<ffffffff8178deec>] ret_from_fork+0x7c/0xb0
  [<ffffffff8177c910>] ? rest_init+0x90/0x90

Prevent this from happening by:
 1) Modifying try_msr_calibrate_tsc() to return calibration value or zero
    if it fails.
 2) Check this return value in native_calibrate_tsc() and in case of zero
    fallback to use normal non-MSR based calibration.

[mw: Added subject and changelog]

Reported-and-tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Bin Gao <bin.gao@linux.intel.com>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1392810750-18660-1-git-send-email-mika.westerberg@linux.intel.com
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-19 17:12:24 +01:00
Linus Torvalds 7fc9280462 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 EFI fixes from Peter Anvin:
 "A few more EFI-related fixes"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/efi: Check status field to validate BGRT header
  x86/efi: Fix 32-bit fallout
2014-02-15 15:02:28 -08:00
Borislav Petkov c55d016f7a x86/efi: Fix 32-bit fallout
We do not enable the new efi memmap on 32-bit and thus we need to run
runtime_code_page_mkexec() unconditionally there. Fix that.

Reported-and-tested-by: Lejun Zhu <lejun.zhu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-02-14 09:30:19 +00:00
Mel Gorman a9c8e4beee xen: properly account for _PAGE_NUMA during xen pte translations
Steven Noonan forwarded a users report where they had a problem starting
vsftpd on a Xen paravirtualized guest, with this in dmesg:

  BUG: Bad page map in process vsftpd  pte:8000000493b88165 pmd:e9cc01067
  page:ffffea00124ee200 count:0 mapcount:-1 mapping:     (null) index:0x0
  page flags: 0x2ffc0000000014(referenced|dirty)
  addr:00007f97eea74000 vm_flags:00100071 anon_vma:ffff880e98f80380 mapping:          (null) index:7f97eea74
  CPU: 4 PID: 587 Comm: vsftpd Not tainted 3.12.7-1-ec2 #1
  Call Trace:
    dump_stack+0x45/0x56
    print_bad_pte+0x22e/0x250
    unmap_single_vma+0x583/0x890
    unmap_vmas+0x65/0x90
    exit_mmap+0xc5/0x170
    mmput+0x65/0x100
    do_exit+0x393/0x9e0
    do_group_exit+0xcc/0x140
    SyS_exit_group+0x14/0x20
    system_call_fastpath+0x1a/0x1f
  Disabling lock debugging due to kernel taint
  BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:0 val:-1
  BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:1 val:1

The issue could not be reproduced under an HVM instance with the same
kernel, so it appears to be exclusive to paravirtual Xen guests.  He
bisected the problem to commit 1667918b64 ("mm: numa: clear numa
hinting information on mprotect") that was also included in 3.12-stable.

The problem was related to how xen translates ptes because it was not
accounting for the _PAGE_NUMA bit.  This patch splits pte_present to add
a pteval_present helper for use by xen so both bare metal and xen use
the same code when checking if a PTE is present.

[mgorman@suse.de: wrote changelog, proposed minor modifications]
[akpm@linux-foundation.org: fix typo in comment]
Reported-by: Steven Noonan <steven@uplinklabs.net>
Tested-by: Steven Noonan <steven@uplinklabs.net>
Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <stable@vger.kernel.org>	[3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-10 16:01:41 -08:00
Linus Torvalds c1ff84317f Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Peter Anvin:
 "Quite a varied little collection of fixes.  Most of them are
  relatively small or isolated; the biggest one is Mel Gorman's fixes
  for TLB range flushing.

  A couple of AMD-related fixes (including not crashing when given an
  invalid microcode image) and fix a crash when compiled with gcov"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, microcode, AMD: Unify valid container checks
  x86, hweight: Fix BUG when booting with CONFIG_GCOV_PROFILE_ALL=y
  x86/efi: Allow mapping BGRT on x86-32
  x86: Fix the initialization of physnode_map
  x86, cpu hotplug: Fix stack frame warning in check_irq_vectors_for_cpu_disable()
  x86/intel/mid: Fix X86_INTEL_MID dependencies
  arch/x86/mm/srat: Skip NUMA_NO_NODE while parsing SLIT
  mm, x86: Revisit tlb_flushall_shift tuning for page flushes except on IvyBridge
  x86: mm: change tlb_flushall_shift for IvyBridge
  x86/mm: Eliminate redundant page table walk during TLB range flushing
  x86/mm: Clean up inconsistencies when flushing TLB ranges
  mm, x86: Account for TLB flushes only when debugging
  x86/AMD/NB: Fix amd_set_subcaches() parameter type
  x86/quirks: Add workaround for AMD F16h Erratum792
  x86, doc, kconfig: Fix dud URL for Microcode data
2014-02-08 11:54:43 -08:00
H. Peter Anvin a3b072cd18 * Avoid WARN_ON() when mapping BGRT on Baytrail (EFI 32-bit).
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJS9P2pAAoJEC84WcCNIz1VABsP/j0fwiWdaoEn/9p9EUQcBcMn
 CILuiKI8GoBR+0nH7vm/jSgUKfUxzM6T0+qjoZPVkO/u/qup+JCT5JxoywUyEbfb
 +wPNIhBgKSwJdWXPd4ZvObA9jknrP6r5sJTsixpESVpVWmcC8egPgkyBFILjokKS
 BCJqqruHZcXfWaDQTocYErl3T217J7RfnCG1o7yP96g4jteX8EdvIkPjEf2mM83I
 GXacHLy8IhGhdyPKSXcRqZ0ivhZ2WX1iFQYIY19FhmzBs364WulyXve+oV+l/4h4
 Kx7ks4Ob5AlUIizchGNwVz7058F9o/v7m0CezTgi4Q/RrZi34samxbjk/95/Xc60
 JSvWkekm3/jOODubad5zj+ZnJmG83/ZlCUuTaqsE47ftbaedSUHBN9QSpm8iHEex
 n3d4J3AdP/3amcP33kZ5MRALDYIFKb4ZxtDkADqDcXhS56COivGAdZe5hnyCpb/9
 RPUDXTOlxfJQK/y2Atcdb400JjJ/Yr9Kew81LRIt0UMZU3dKSh05UZ+a7Ym0yCkt
 3k0NNkgsFCZbYTO/Z3aPDcprwU5Lq9UrwjB17U2ev/qK+qRYDzCzSR0XGPrLMRv7
 C5Bnov6uCn/0ZG/NlAx8UXK9wdWDsLhp1QkBz+daX3sGwRAS+OiKBv6+l8dqsdOc
 1L8PMkTX2rgtELiv4PJ/
 =hLCC
 -----END PGP SIGNATURE-----

Merge tag 'efi-urgent' into x86/urgent

 * Avoid WARN_ON() when mapping BGRT on Baytrail (EFI 32-bit).

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-07 11:27:30 -08:00
Linus Torvalds 1cd731df09 Bug-fixes:
- Revert "xen/grant-table: Avoid m2p_override during mapping" as it broke Xen ARM build.
  - Fix CR4 not being set on AP processors in Xen PVH mode.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJS8AyQAAoJEFjIrFwIi8fJbD4IAJssMuaLI5CRsSWBgDFHHDFt
 srVJpDOYQiDr/TxkwFCVcL4sFy9Htb3KMArU4eIBl6uMqQbGa+3rHyXcHYI219YY
 XH3D8RG+9JChwsxtaeUEzwx1C8ehcygD34vtdcoQXa7eBuEi4TL3HeLifR+HrXKO
 UdFrTA34FmvpVFbSuRXkZh5sd6ca9et9xHuQHM8SIY6pVokY6xaEYOp17tfPZpwM
 7A6LFjUjXeugHC2L3+/H8UOHA9nSZQvnMiZOWq2Cusc2Dt2V7emzgk2wcc2CHttf
 EA6GbtiJzHqMPmt5EjubI9hHdSMB31HpY4hnQE38+ucl+BwiSdRE9z2Rm4TYClg=
 =IX4M
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull Xen fixes from Konrad Rzeszutek Wilk:
 "Bug-fixes:
   - Revert "xen/grant-table: Avoid m2p_override during mapping" as it
     broke Xen ARM build.
   - Fix CR4 not being set on AP processors in Xen PVH mode"

* tag 'stable/for-linus-3.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/pvh: set CR4 flags for APs
  Revert "xen/grant-table: Avoid m2p_override during mapping"
2014-02-05 16:01:11 -08:00