Commit Graph

217 Commits

Author SHA1 Message Date
Linus Torvalds 4cb38750d4 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/mm changes from Peter Anvin:
 "The big change here is the patchset by Alex Shi to use INVLPG to flush
  only the affected pages when we only need to flush a small page range.

  It also removes the special INVALIDATE_TLB_VECTOR interrupts (32
  vectors!) and replace it with an ordinary IPI function call."

Fix up trivial conflicts in arch/x86/include/asm/apic.h (added code next
to changed line)

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/tlb: Fix build warning and crash when building for !SMP
  x86/tlb: do flush_tlb_kernel_range by 'invlpg'
  x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR
  x86/tlb: enable tlb flush range support for x86
  mm/mmu_gather: enable tlb flush range in generic mmu_gather
  x86/tlb: add tlb_flushall_shift knob into debugfs
  x86/tlb: add tlb_flushall_shift for specific CPU
  x86/tlb: fall back to flush all when meet a THP large page
  x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range
  x86/tlb_info: get last level TLB entry number of CPU
  x86: Add read_mostly declaration/definition to variables from smp.h
  x86: Define early read-mostly per-cpu macros
2012-07-26 13:17:17 -07:00
David Vrabel 66a27dde9a xen/mm: zero PTEs for non-present MFNs in the initial page table
When constructing the initial page tables, if the MFN for a usable PFN
is missing in the p2m then that frame is initially ballooned out.  In
this case, zero the PTE (as in decrease_reservation() in
drivers/xen/balloon.c).

This is obviously safe instead of having an valid PTE with an MFN of
INVALID_P2M_ENTRY (~0).

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-07-19 15:51:44 -04:00
David Vrabel d095d43e78 xen/mm: do direct hypercall in xen_set_pte() if batching is unavailable
In xen_set_pte() if batching is unavailable (because the caller is in
an interrupt context such as handling a page fault) it would fall back
to using native_set_pte() and trapping and emulating the PTE write.

On 32-bit guests this requires two traps for each PTE write (one for
each dword of the PTE).  Instead, do one mmu_update hypercall
directly.

During construction of the initial page tables, continue to use
native_set_pte() because most of the PTEs being set are in writable
and unpinned pages (see phys_pmd_init() in arch/x86/mm/init_64.c) and
using a hypercall for this is very expensive.

This significantly improves page fault performance in 32-bit PV
guests.

lmbench3 test  Before    After     Improvement
----------------------------------------------
lat_pagefault  3.18 us   2.32 us   27%
lat_proc fork  356 us    313.3 us  11%

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-07-19 15:51:43 -04:00
Alex Shi e7b52ffd45 x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range
x86 has no flush_tlb_range support in instruction level. Currently the
flush_tlb_range just implemented by flushing all page table. That is not
the best solution for all scenarios. In fact, if we just use 'invlpg' to
flush few lines from TLB, we can get the performance gain from later
remain TLB lines accessing.

But the 'invlpg' instruction costs much of time. Its execution time can
compete with cr3 rewriting, and even a bit more on SNB CPU.

So, on a 512 4KB TLB entries CPU, the balance points is at:
	(512 - X) * 100ns(assumed TLB refill cost) =
		X(TLB flush entries) * 100ns(assumed invlpg cost)

Here, X is 256, that is 1/2 of 512 entries.

But with the mysterious CPU pre-fetcher and page miss handler Unit, the
assumed TLB refill cost is far lower then 100ns in sequential access. And
2 HT siblings in one core makes the memory access more faster if they are
accessing the same memory. So, in the patch, I just do the change when
the target entries is less than 1/16 of whole active tlb entries.
Actually, I have no data support for the percentage '1/16', so any
suggestions are welcomed.

As to hugetlb, guess due to smaller page table, and smaller active TLB
entries, I didn't see benefit via my benchmark, so no optimizing now.

My micro benchmark show in ideal scenarios, the performance improves 70
percent in reading. And in worst scenario, the reading/writing
performance is similar with unpatched 3.4-rc4 kernel.

Here is the reading data on my 2P * 4cores *HT NHM EP machine, with THP
'always':

multi thread testing, '-t' paramter is thread number:
	       	        with patch   unpatched 3.4-rc4
./mprotect -t 1           14ns		24ns
./mprotect -t 2           13ns		22ns
./mprotect -t 4           12ns		19ns
./mprotect -t 8           14ns		16ns
./mprotect -t 16          28ns		26ns
./mprotect -t 32          54ns		51ns
./mprotect -t 128         200ns		199ns

Single process with sequencial flushing and memory accessing:

		       	with patch   unpatched 3.4-rc4
./mprotect		    7ns			11ns
./mprotect -p 4096  -l 8 -n 10240
			    21ns		21ns

[ hpa: http://lkml.kernel.org/r/1B4B44D9196EFF41AE41FDA404FC0A100BFF94@SHSMSX101.ccr.corp.intel.com
  has additional performance numbers. ]

Signed-off-by: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/1340845344-27557-3-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-06-27 19:29:07 -07:00
Linus Torvalds b5f4035adf Features:
* Extend the APIC ops implementation and add IRQ_WORKER vector support so that 'perf' can work properly.
  * Fix self-ballooning code, and balloon logic when booting as initial domain.
  * Move array printing code to generic debugfs
  * Support XenBus domains.
  * Lazily free grants when a domain is dead/non-existent.
  * In M2P code use batching calls
 Bug-fixes:
  * Fix NULL dereference in allocation failure path (hvc_xen)
  * Fix unbinding of IRQ_WORKER vector during vCPU hot-unplug
  * Fix HVM guest resume - we would leak an PIRQ value instead of reusing the existing one.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQEcBAABAgAGBQJPu9MpAAoJEFjIrFwIi8fJaNQH/RylThiO+O+LBpPrO8VRUw+2
 /Io98T7ZK2ggoUeaJx0C8irM0JMFAkxGMcfX3w9fwNt/BTec4s++4JhbN1jYN0da
 6a0PqINo+M8y73So6CBfuJDCunaRLGKVG/ibIO3Y3WAff51/H+DMvO7uYYDAE0aA
 mikyOxnaty0DiG5i4JGDHGmCzDASfK/jgGccZ03m6522mDx5ZIbTzZWONLfz8dqT
 rbxnn9vrNLgEYWuzyLMwW0GymToUtt01xBQvwJLAbhn8lr1WBRBLpxXA+5iYNQrn
 Ri25G7keYJhG4uwZfaHnR+4HTrmhlGzK1Z96dkqpGUaeIcdyWmPMp22VtBBiwG8=
 =uyRr
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.5-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull Xen updates from Konrad Rzeszutek Wilk:
 "Features:
   * Extend the APIC ops implementation and add IRQ_WORKER vector
     support so that 'perf' can work properly.
   * Fix self-ballooning code, and balloon logic when booting as initial
     domain.
   * Move array printing code to generic debugfs
   * Support XenBus domains.
   * Lazily free grants when a domain is dead/non-existent.
   * In M2P code use batching calls
  Bug-fixes:
   * Fix NULL dereference in allocation failure path (hvc_xen)
   * Fix unbinding of IRQ_WORKER vector during vCPU hot-unplug
   * Fix HVM guest resume - we would leak an PIRQ value instead of
     reusing the existing one."

Fix up add-add onflicts in arch/x86/xen/enlighten.c due to addition of
apic ipi interface next to the new apic_id functions.

* tag 'stable/for-linus-3.5-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen: do not map the same GSI twice in PVHVM guests.
  hvc_xen: NULL dereference on allocation failure
  xen: Add selfballoning memory reservation tunable.
  xenbus: Add support for xenbus backend in stub domain
  xen/smp: unbind irqworkX when unplugging vCPUs.
  xen: enter/exit lazy_mmu_mode around m2p_override calls
  xen/acpi/sleep: Enable ACPI sleep via the __acpi_os_prepare_sleep
  xen: implement IRQ_WORK_VECTOR handler
  xen: implement apic ipi interface
  xen/setup: update VA mapping when releasing memory during setup
  xen/setup: Combine the two hypercall functions - since they are quite similar.
  xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
  xen/setup: Only print "Freeing XXX-YYY pfn range: Z pages freed" if Z > 0
  xen/gnttab: add deferred freeing logic
  debugfs: Add support to print u32 array in debugfs
  xen/p2m: An early bootup variant of set_phys_to_machine
  xen/p2m: Collapse early_alloc_p2m_middle redundant checks.
  xen/p2m: Allow alloc_p2m_middle to call reserve_brk depending on argument
  xen/p2m: Move code around to allow for better re-usage.
2012-05-24 16:02:08 -07:00
Linus Torvalds f08b9c2f8a Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/apic changes from Ingo Molnar:
 "Most of the changes are about helping virtualized guest kernels
  achieve better performance."

Fix up trivial conflicts with the iommu updates to arch/x86/kernel/apic/io_apic.c

* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/apic: Implement EIO micro-optimization
  x86/apic: Add apic->eoi_write() callback
  x86/apic: Use symbolic APIC_EOI_ACK
  x86/apic: Fix typo EIO_ACK -> EOI_ACK and document it
  x86/xen/apic: Add missing #include <xen/xen.h>
  x86/apic: Only compile local function if used with !CONFIG_GENERIC_PENDING_IRQ
  x86/apic: Fix UP boot crash
  x86: Conditionally update time when ack-ing pending irqs
  xen/apic: implement io apic read with hypercall
  Revert "xen/x86: Workaround 'x86/ioapic: Add register level checks to detect bogus io-apic entries'"
  xen/x86: Implement x86_apic_ops
  x86/apic: Replace io_apic_ops with x86_io_apic_ops.
2012-05-22 18:38:11 -07:00
Konrad Rzeszutek Wilk 4b3451ad13 Merge branch 'stable/autoballoon.v5.2' into stable/for-linus-3.5
* stable/autoballoon.v5.2:
  xen/setup: update VA mapping when releasing memory during setup
  xen/setup: Combine the two hypercall functions - since they are quite similar.
  xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
  xen/setup: Only print "Freeing XXX-YYY pfn range: Z pages freed" if Z > 0
  xen/p2m: An early bootup variant of set_phys_to_machine
  xen/p2m: Collapse early_alloc_p2m_middle redundant checks.
  xen/p2m: Allow alloc_p2m_middle to call reserve_brk depending on argument
  xen/p2m: Move code around to allow for better re-usage.
2012-05-07 15:33:27 -04:00
David Vrabel 83d51ab473 xen/setup: update VA mapping when releasing memory during setup
In xen_memory_setup(), if a page that is being released has a VA
mapping this must also be updated.  Otherwise, the page will be not
released completely -- it will still be referenced in Xen and won't be
freed util the mapping is removed and this prevents it from being
reallocated at a different PFN.

This was already being done for the ISA memory region in
xen_ident_map_ISA() but on many systems this was omitting a few pages
as many systems marked a few pages below the ISA memory region as
reserved in the e820 map.

This fixes errors such as:

(XEN) page_alloc.c:1148:d0 Over-allocation for domain 0: 2097153 > 2097152
(XEN) memory.c:133:d0 Could not allocate order=0 extent: id=0 memflags=0 (0 of 17)

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-05-07 15:32:24 -04:00
Konrad Rzeszutek Wilk b7e5ffe5d8 xen/pte: Fix crashes when trying to see non-existent PGD/PMD/PUD/PTEs
If I try to do "cat /sys/kernel/debug/kernel_page_tables"
I end up with:

BUG: unable to handle kernel paging request at ffffc7fffffff000
IP: [<ffffffff8106aa51>] ptdump_show+0x221/0x480
PGD 0
Oops: 0000 [#1] SMP
CPU 0
.. snip..
RAX: 0000000000000000 RBX: ffffc00000000fff RCX: 0000000000000000
RDX: 0000800000000000 RSI: 0000000000000000 RDI: ffffc7fffffff000

which is due to the fact we are trying to access a PFN that is not
accessible to us. The reason (at least in this case) was that
PGD[256] is set to __HYPERVISOR_VIRT_START which was setup (by the
hypervisor) to point to a read-only linear map of the MFN->PFN array.
During our parsing we would get the MFN (a valid one), try to look
it up in the MFN->PFN tree and find it invalid and return ~0 as PFN.
Then pte_mfn_to_pfn would happilly feed that in, attach the flags
and return it back to the caller. 'ptdump_show' bitshifts it and
gets and invalid value that it tries to dereference.

Instead of doing all of that, we detect the ~0 case and just
return !_PAGE_PRESENT.

This bug has been in existence .. at least until 2.6.37 (yikes!)

CC: stable@kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-05-07 10:21:13 -04:00
Konrad Rzeszutek Wilk 27abd14bd9 Revert "xen/x86: Workaround 'x86/ioapic: Add register level checks to detect bogus io-apic entries'"
This reverts commit 2531d64b6f.

The two patches:
      x86/apic: Replace io_apic_ops with x86_io_apic_ops.
      xen/x86: Implement x86_apic_ops

take care of fixing it properly.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-05-01 14:50:55 -04:00
Linus Torvalds 9479f0f801 Two fixes for regressions:
* one is a workaround that will be removed in v3.5 with proper fix in the tip/x86 tree,
  * the other is to fix drivers to load on PV (a previous patch made them only
    load in PVonHVM mode).
 
 The rest are just minor fixes in the various drivers and some cleanup in the
 core code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQEcBAABAgAGBQJPfyVUAAoJEFjIrFwIi8fJUjUH/jbY5JavRqSlNELZW2A4Ta76
 8p00LqLHw/C56iHZcWKke8mqtWNb+ZfcQt7ZYcxDIYa4QWBL28x0OLAO2tOBIt37
 ZjYESWSdFJaJvmpADluWtFyGyZ9TYJllDTBm/jWj1ZtKSZvR1YkhuMXCS0f4AmGQ
 xFzSWJZUDdiOAqpN+VQD8wP00gfR8knQLg16XE2fvFdQo4XwpCtqLfHV/5pMMGdy
 Cs/ep6rq/7cdv/nshKOcBnw7RW8l3Xoi/28ht8k3DvAQ2VtFq1Tugv2G9pcCHwQG
 DIBkB3SOU6/v6P5at5+egKS5xR1fJetCWlkMd8kkbcdz2NPI4UDMkvOW6Q8yQls=
 =6Ve+
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull xen fixes from Konrad Rzeszutek Wilk:
 "Two fixes for regressions:
   * one is a workaround that will be removed in v3.5 with proper fix in
     the tip/x86 tree,
   * the other is to fix drivers to load on PV (a previous patch made
     them only load in PVonHVM mode).

  The rest are just minor fixes in the various drivers and some cleanup
  in the core code."

* tag 'stable/for-linus-3.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/pcifront: avoid pci_frontend_enable_msix() falsely returning success
  xen/pciback: fix XEN_PCI_OP_enable_msix result
  xen/smp: Remove unnecessary call to smp_processor_id()
  xen/x86: Workaround 'x86/ioapic: Add register level checks to detect bogus io-apic entries'
  xen: only check xen_platform_pci_unplug if hvm
2012-04-06 17:54:53 -07:00
Konrad Rzeszutek Wilk 2531d64b6f xen/x86: Workaround 'x86/ioapic: Add register level checks to detect bogus io-apic entries'
The above mentioned patch checks the IOAPIC and if it contains
-1, then it unmaps said IOAPIC. But under Xen we get this:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
IP: [<ffffffff8134e51f>] xen_irq_init+0x1f/0xb0
PGD 0
Oops: 0002 [#1] SMP
CPU 0
Modules linked in:

Pid: 1, comm: swapper/0 Not tainted 3.2.10-3.fc16.x86_64 #1 Dell Inc. Inspiron
1525                  /0U990C
RIP: e030:[<ffffffff8134e51f>]  [<ffffffff8134e51f>] xen_irq_init+0x1f/0xb0
RSP: e02b: ffff8800d42cbb70  EFLAGS: 00010202
RAX: 0000000000000000 RBX: 00000000ffffffef RCX: 0000000000000001
RDX: 0000000000000040 RSI: 00000000ffffffef RDI: 0000000000000001
RBP: ffff8800d42cbb80 R08: ffff8800d6400000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffef
R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000010
FS:  0000000000000000(0000) GS:ffff8800df5fe000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0:000000008005003b
CR2: 0000000000000040 CR3: 0000000001a05000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper/0 (pid: 1, threadinfo ffff8800d42ca000, task ffff8800d42d0000)
Stack:
 00000000ffffffef 0000000000000010 ffff8800d42cbbe0 ffffffff8134f157
 ffffffff8100a9b2 ffffffff8182ffd1 00000000000000a0 00000000829e7384
 0000000000000002 0000000000000010 00000000ffffffff 0000000000000000
Call Trace:
 [<ffffffff8134f157>] xen_bind_pirq_gsi_to_irq+0x87/0x230
 [<ffffffff8100a9b2>] ? check_events+0x12+0x20
 [<ffffffff814bab42>] xen_register_pirq+0x82/0xe0
 [<ffffffff814bac1a>] xen_register_gsi.part.2+0x4a/0xd0
 [<ffffffff814bacc0>] acpi_register_gsi_xen+0x20/0x30
 [<ffffffff8103036f>] acpi_register_gsi+0xf/0x20
 [<ffffffff8131abdb>] acpi_pci_irq_enable+0x12e/0x202
 [<ffffffff814bc849>] pcibios_enable_device+0x39/0x40
 [<ffffffff812dc7ab>] do_pci_enable_device+0x4b/0x70
 [<ffffffff812dc878>] __pci_enable_device_flags+0xa8/0xf0
 [<ffffffff812dc8d3>] pci_enable_device+0x13/0x20

The reason we are dying is b/c the call acpi_get_override_irq() is used,
which returns the polarity and trigger for the IRQs. That function calls
mp_find_ioapics to get the 'struct ioapic' structure - which along with the
mp_irq[x] is used to figure out the default values and the polarity/trigger
overrides. Since the mp_find_ioapics now returns -1 [b/c the IOAPIC is filled
with 0xffffffff], the acpi_get_override_irq() stops trying to lookup in the
mp_irq[x] the proper INT_SRV_OVR and we can't install the SCI interrupt.

The proper fix for this is going in v3.5 and adds an x86_io_apic_ops
struct so that platforms can override it. But for v3.4 lets carry this
work-around. This patch does that by providing a slightly different variant
of the fake IOAPIC entries.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-04-06 12:13:06 -04:00
Linus Torvalds d4c6fa73fe Features:
- PV multiconsole support, so that there can be hvc1, hvc2, etc;
  - P-state and C-state power management driver that uploads said
    power management data to the hypervisor. It also inhibits cpufreq
    scaling drivers to load so that only the hypervisor can make power
    management decisions - fixing a weird perf bug.
  - Function Level Reset (FLR) support in the Xen PCI backend.
 Fixes:
  - Kconfig dependencies for Xen PV keyboard and video
  - Compile warnings and constify fixes
  - Change over to use percpu_xxx instead of this_cpu_xxx
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQEcBAABAgAGBQJPZ0qkAAoJEFjIrFwIi8fJjCgH/jeJ39E8ML8DP9tCS2HQnMqM
 uTEjLcqvoJ7sEhHvtBLPeG2p0jyBvOWjLbSc7P8nESBAMPvSYol8L6WqfWrdSU4r
 lHrma2sg9UYzRog5NyxAgkp7bBsBBFOnhVL3Cxb5Ig78cPWzeSWGpqGZ8M/d51Wf
 1iE0tHuU4DpN+fg1SZqPqEm8ecEJ/eSrVTnyTx/Qo2Ak+Zw98SqzX7SV5lo8mudd
 WFL1F2K9FyTNk79ndGhqFt36x6nEbFgMLbmCDWumLuWN6bMd1Uq0wNkCqW4F1h28
 3yqnY+rfQh4y3eXK1B9nttCUTs+/66U5ZWrT6B1IJumGTAIqcWfgeUX/Vn/HVC4=
 =tfMc
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull xen updates from Konrad Rzeszutek Wilk:
 "which has three neat features:

   - PV multiconsole support, so that there can be hvc1, hvc2, etc; This
     can be used in HVM and in PV mode.

   - P-state and C-state power management driver that uploads said power
     management data to the hypervisor.  It also inhibits cpufreq
     scaling drivers to load so that only the hypervisor can make power
     management decisions - fixing a weird perf bug.

     There is one thing in the Kconfig that you won't like: "default y
     if (X86_ACPI_CPUFREQ = y || X86_POWERNOW_K8 = y)" (note, that it
     all depends on CONFIG_XEN which depends on CONFIG_PARAVIRT which by
     default is off).  I've a fix to convert that boolean expression
     into "default m" which I am going to post after the cpufreq git
     pull - as the two patches to make this work depend on a fix in Dave
     Jones's tree.

   - Function Level Reset (FLR) support in the Xen PCI backend.

  Fixes:

   - Kconfig dependencies for Xen PV keyboard and video
   - Compile warnings and constify fixes
   - Change over to use percpu_xxx instead of this_cpu_xxx"

Fix up trivial conflicts in drivers/tty/hvc/hvc_xen.c due to changes to
a removed commit.

* tag 'stable/for-linus-3.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen kconfig: relax INPUT_XEN_KBDDEV_FRONTEND deps
  xen/acpi-processor: C and P-state driver that uploads said data to hypervisor.
  xen: constify all instances of "struct attribute_group"
  xen/xenbus: ignore console/0
  hvc_xen: introduce HVC_XEN_FRONTEND
  hvc_xen: implement multiconsole support
  hvc_xen: support PV on HVM consoles
  xenbus: don't free other end details too early
  xen/enlighten: Expose MWAIT and MWAIT_LEAF if hypervisor OKs it.
  xen/setup/pm/acpi: Remove the call to boot_option_idle_override.
  xenbus: address compiler warnings
  xen: use this_cpu_xxx replace percpu_xxx funcs
  xen/pciback: Support pci_reset_function, aka FLR or D3 support.
  pci: Introduce __pci_reset_function_locked to be used when holding device_lock.
  xen: Utilize the restore_msi_irqs hook.
2012-03-22 20:16:14 -07:00
Konrad Rzeszutek Wilk 8eaffa67b4 xen/pat: Disable PAT support for now.
[Pls also look at https://lkml.org/lkml/2012/2/10/228]

Using of PAT to change pages from WB to WC works quite nicely.
Changing it back to WB - not so much. The crux of the matter is
that the code that does this (__page_change_att_set_clr) has only
limited information so when it tries to the change it gets
the "raw" unfiltered information instead of the properly filtered one -
and the "raw" one tell it that PSE bit is on (while infact it
is not).  As a result when the PTE is set to be WB from WC, we get
tons of:

:WARNING: at arch/x86/xen/mmu.c:475 xen_make_pte+0x67/0xa0()
:Hardware name: HP xw4400 Workstation
.. snip..
:Pid: 27, comm: kswapd0 Tainted: G        W    3.2.2-1.fc16.x86_64 #1
:Call Trace:
: [<ffffffff8106dd1f>] warn_slowpath_common+0x7f/0xc0
: [<ffffffff8106dd7a>] warn_slowpath_null+0x1a/0x20
: [<ffffffff81005a17>] xen_make_pte+0x67/0xa0
: [<ffffffff810051bd>] __raw_callee_save_xen_make_pte+0x11/0x1e
: [<ffffffff81040e15>] ? __change_page_attr_set_clr+0x9d5/0xc00
: [<ffffffff8114c2e8>] ? __purge_vmap_area_lazy+0x158/0x1d0
: [<ffffffff8114cca5>] ? vm_unmap_aliases+0x175/0x190
: [<ffffffff81041168>] change_page_attr_set_clr+0x128/0x4c0
: [<ffffffff81041542>] set_pages_array_wb+0x42/0xa0
: [<ffffffff8100a9b2>] ? check_events+0x12/0x20
: [<ffffffffa0074d4c>] ttm_pages_put+0x1c/0x70 [ttm]
: [<ffffffffa0074e98>] ttm_page_pool_free+0xf8/0x180 [ttm]
: [<ffffffffa0074f78>] ttm_pool_mm_shrink+0x58/0x90 [ttm]
: [<ffffffff8112ba04>] shrink_slab+0x154/0x310
: [<ffffffff8112f17a>] balance_pgdat+0x4fa/0x6c0
: [<ffffffff8112f4b8>] kswapd+0x178/0x3d0
: [<ffffffff815df134>] ? __schedule+0x3d4/0x8c0
: [<ffffffff81090410>] ? remove_wait_queue+0x50/0x50
: [<ffffffff8112f340>] ? balance_pgdat+0x6c0/0x6c0
: [<ffffffff8108fb6c>] kthread+0x8c/0xa0

for every page. The proper fix for this is has been posted
and is https://lkml.org/lkml/2012/2/10/228
"x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations."
along with a detailed description of the problem and solution.

But since that posting has gone nowhere I am proposing
this band-aid solution so that at least users don't get
the page corruption (the pages that are WC don't get changed to WB
and end up being recycled for filesystem or other things causing
mysterious crashes).

The negative impact of this patch is that users of WC flag
(which are InfiniBand, radeon, nouveau drivers) won't be able
to set that flag - so they are going to see performance degradation.
But stability is more important here.

Fixes RH BZ# 742032, 787403, and 745574
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-02-20 10:41:35 -05:00
Alex Shi 2113f46916 xen: use this_cpu_xxx replace percpu_xxx funcs
percpu_xxx funcs are duplicated with this_cpu_xxx funcs, so replace them
for further code clean up.

I don't know much of xen code. But, since the code is in x86 architecture,
the percpu_xxx is exactly same as this_cpu_xxx serials functions. So, the
change is safe.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-01-24 12:20:24 -05:00
Konrad Rzeszutek Wilk dc6821e0cf xen/mmu: Fix compile errors introduced by x86/memblock mismerge.
The git commit d4bbf7e775
"Merge branch 'master' into x86/memblock" mismerged the 32-bit
section causing:

arch/x86/xen/mmu.c: In function ‘xen_setup_kernel_pagetable’:
arch/x86/xen/mmu.c:1855: error: expected ‘;’ before ‘)’ token
arch/x86/xen/mmu.c:1855: error: expected statement before ‘)’ token

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-01-09 12:05:05 -05:00
Tejun Heo d4bbf7e775 Merge branch 'master' into x86/memblock
Conflicts & resolutions:

* arch/x86/xen/setup.c

	dc91c728fd "xen: allow extra memory to be in multiple regions"
	24aa07882b "memblock, x86: Replace memblock_x86_reserve/free..."

	conflicted on xen_add_extra_mem() updates.  The resolution is
	trivial as the latter just want to replace
	memblock_x86_reserve_range() with memblock_reserve().

* drivers/pci/intel-iommu.c

	166e9278a3 "x86/ia64: intel-iommu: move to drivers/iommu/"
	5dfe8660a3 "bootmem: Replace work_with_active_regions() with..."

	conflicted as the former moved the file under drivers/iommu/.
	Resolved by applying the chnages from the latter on the moved
	file.

* mm/Kconfig

	6661672053 "memblock: add NO_BOOTMEM config symbol"
	c378ddd53f "memblock, x86: Make ARCH_DISCARD_MEMBLOCK a config option"

	conflicted trivially.  Both added config options.  Just
	letting both add their own options resolves the conflict.

* mm/memblock.c

	d1f0ece6cd "mm/memblock.c: small function definition fixes"
	ed7b56a799 "memblock: Remove memblock_memory_can_coalesce()"

	confliected.  The former updates function removed by the
	latter.  Resolution is trivial.

Signed-off-by: Tejun Heo <tj@kernel.org>
2011-11-28 09:46:22 -08:00
Linus Torvalds 31018acd4c Merge branches 'stable/bug.fixes-3.2' and 'stable/mmu.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
* 'stable/bug.fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/p2m/debugfs: Make type_name more obvious.
  xen/p2m/debugfs: Fix potential pointer exception.
  xen/enlighten: Fix compile warnings and set cx to known value.
  xen/xenbus: Remove the unnecessary check.
  xen/irq: If we fail during msi_capability_init return proper error code.
  xen/events: Don't check the info for NULL as it is already done.
  xen/events: BUG() when we can't allocate our event->irq array.

* 'stable/mmu.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen: Fix selfballooning and ensure it doesn't go too far
  xen/gntdev: Fix sleep-inside-spinlock
  xen: modify kernel mappings corresponding to granted pages
  xen: add an "highmem" parameter to alloc_xenballooned_pages
  xen/p2m: Use SetPagePrivate and its friends for M2P overrides.
  xen/p2m: Make debug/xen/mmu/p2m visible again.
  Revert "xen/debug: WARN_ON when identity PFN has no _PAGE_IOMAP flag set."
2011-10-25 09:17:47 +02:00
Konrad Rzeszutek Wilk a867db10e8 xen/p2m: Make debug/xen/mmu/p2m visible again.
We dropped a lot of the MMU debugfs in favour of using
tracing API - but there is one which just provides
mostly static information that was made invisible by this change.

Bring it back.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-09-23 22:22:32 -04:00
Linus Torvalds abbe0d3c26 Merge branch 'stable/bug.fixes' of git://oss.oracle.com/git/kwilk/xen
* 'stable/bug.fixes' of git://oss.oracle.com/git/kwilk/xen:
  xen/i386: follow-up to "replace order-based range checking of M2P table by linear one"
  xen/irq: Alter the locking to use a mutex instead of a spinlock.
  xen/e820: if there is no dom0_mem=, don't tweak extra_pages.
  xen: disable PV spinlocks on HVM
2011-09-16 11:28:11 -07:00
Jan Beulich 61cca2fab7 xen/i386: follow-up to "replace order-based range checking of M2P table by linear one"
The numbers obtained from the hypervisor really can't ever lead to an
overflow here, only the original calculation going through the order
of the range could have. This avoids the (as Jeremy points outs)
somewhat ugly NULL-based calculation here.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-09-15 04:39:46 -04:00
Linus Torvalds 4762e252f4 Merge branch 'stable/bug.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
* 'stable/bug.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/tracing: Fix tracing config option properly
  xen: Do not enable PV IPIs when vector callback not present
  xen/x86: replace order-based range checking of M2P table by linear one
  xen: xen-selfballoon.c needs more header files
2011-08-22 11:25:44 -07:00
Jan Beulich ccbcdf7cf1 xen/x86: replace order-based range checking of M2P table by linear one
The order-based approach is not only less efficient (requiring a shift
and a compare, typical generated code looking like this

	mov	eax, [machine_to_phys_order]
	mov	ecx, eax
	shr	ebx, cl
	test	ebx, ebx
	jnz	...

whereas a direct check requires just a compare, like in

	cmp	ebx, [machine_to_phys_nr]
	jae	...

), but also slightly dangerous in the 32-on-64 case - the element
address calculation can wrap if the next power of two boundary is
sufficiently far away from the actual upper limit of the table, and
hence can result in user space addresses being accessed (with it being
unknown what may actually be mapped there).

Additionally, the elimination of the mistaken use of fls() here (should
have been __fls()) fixes a latent issue on x86-64 that would trigger
if the code was run on a system with memory extending beyond the 44-bit
boundary.

CC: stable@kernel.org
Signed-off-by: Jan Beulich <jbeulich@novell.com>
[v1: Based on Jeremy's feedback]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-08-17 10:26:48 -04:00
Linus Torvalds 06e727d2a5 Merge branch 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-tip
* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-tip:
  x86-64: Rework vsyscall emulation and add vsyscall= parameter
  x86-64: Wire up getcpu syscall
  x86: Remove unnecessary compile flag tweaks for vsyscall code
  x86-64: Add vsyscall:emulate_vsyscall trace event
  x86-64: Add user_64bit_mode paravirt op
  x86-64, xen: Enable the vvar mapping
  x86-64: Work around gold bug 13023
  x86-64: Move the "user" vsyscall segment out of the data segment.
  x86-64: Pad vDSO to a page boundary
2011-08-12 20:46:24 -07:00
Konrad Rzeszutek Wilk 10fe570fc1 Revert "xen/debug: WARN_ON when identity PFN has no _PAGE_IOMAP flag set."
We don' use it anymore and there are more false positives.

This reverts commit fc25151d9a.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-08-09 13:04:08 -04:00
Andy Lutomirski 5d5791af4c x86-64, xen: Enable the vvar mapping
Xen needs to handle VVAR_PAGE, introduced in git commit:
9fd67b4ed0
x86-64: Give vvars their own page

Otherwise we die during bootup with a message like:

(XEN) mm.c:940:d10 Error getting mfn 1888 (pfn 1e3e48) from L1 entry
      8000000001888465 for l1e_owner=10, pg_owner=10
(XEN) mm.c:5049:d10 ptwr_emulate: could not get_page_from_l1e()
[    0.000000] BUG: unable to handle kernel NULL pointer dereference at (null)
[    0.000000] IP: [<ffffffff8103a930>] xen_set_pte+0x20/0xe0

Signed-off-by: Andy Lutomirski <luto@mit.edu>
Link: http://lkml.kernel.org/r/4659478ed2f3480938f96491c2ecbe2b2e113a23.1312378163.git.luto@mit.edu
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-08-04 16:13:47 -07:00
Jeremy Fitzhardinge bc7fe1d977 xen/mmu: tune pgtable alloc/release
Make sure the fastpath code is inlined.  Batch the page permission change
and the pin/unpin, and make sure that it can be batched with any
adjacent set_pte/pmd/etc operations.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
2011-07-18 15:43:28 -07:00
Jeremy Fitzhardinge dcf7435cfe xen/mmu: use extend_args for more mmuext updates
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
2011-07-18 15:43:27 -07:00
Jeremy Fitzhardinge c8eed1719a xen/trace: add tlb flush tracepoints
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
2011-07-18 15:43:27 -07:00
Jeremy Fitzhardinge 5f94fb5b8e xen/trace: add xen_pgd_(un)pin tracepoints
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
2011-07-18 15:43:27 -07:00
Jeremy Fitzhardinge c2ba050d2e xen/trace: add ptpage alloc/release tracepoints
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
2011-07-18 15:43:27 -07:00
Jeremy Fitzhardinge 8470880791 xen/trace: add mmu tracepoints
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
2011-07-18 15:43:27 -07:00
Tejun Heo 24aa07882b memblock, x86: Replace memblock_x86_reserve/free_range() with generic ones
Other than sanity check and debug message, the x86 specific version of
memblock reserve/free functions are simple wrappers around the generic
versions - memblock_reserve/free().

This patch adds debug messages with caller identification to the
generic versions and replaces x86 specific ones and kills them.
arch/x86/include/asm/memblock.h and arch/x86/mm/memblock.c are empty
after this change and removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1310462166-31469-14-git-send-email-tj@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-07-14 11:47:53 -07:00
Konrad Rzeszutek Wilk 32dd11942a xen/mmu: Fix for linker errors when CONFIG_SMP is not defined.
Simple enough - we use an extern defined symbol which is not
defined when CONFIG_SMP is not defined. This fixes the linker
dying.

CC: stable@kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-06-30 09:21:10 -04:00
Andrew Jones 900cba8881 xen: support CONFIG_MAXSMP
The MAXSMP config option requires CPUMASK_OFFSTACK, which in turn
requires we init the memory for the maps while we bring up the cpus.
MAXSMP also increases NR_CPUS to 4096. This increase in size exposed an
issue in the argument construction for multicalls from
xen_flush_tlb_others. The args should only need space for the actual
number of cpus.

Also in 2.6.39 it exposes a bootup problem.

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff8157a1d3>] set_cpu_sibling_map+0x123/0x30d
...
Call Trace:
[<ffffffff81039a3f>] ? xen_restore_fl_direct_reloc+0x4/0x4
[<ffffffff819dc4db>] xen_smp_prepare_cpus+0x36/0x135
..

CC: stable@kernel.org
Signed-off-by: Andrew Jones <drjones@redhat.com>
[v2: Updated to compile on 3.0]
[v3: Updated to compile when CONFIG_SMP is not defined]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-06-15 14:18:49 -04:00
Stefano Stabellini a91d92875e xen: partially revert "xen: set max_pfn_mapped to the last pfn mapped"
We only need to set max_pfn_mapped to the last pfn mapped on x86_64 to
make sure that cleanup_highmap doesn't remove important mappings at
_end.

We don't need to do this on x86_32 because cleanup_highmap is not called
on x86_32. Besides lowering max_pfn_mapped on x86_32 has the unwanted
side effect of limiting the amount of memory available for the 1:1
kernel pagetable allocation.

This patch reverts the x86_32 part of the original patch.

CC: stable@kernel.org
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-06-09 09:08:53 -04:00
Linus Torvalds dc7acbb251 Merge branch 'upstream/tidy-xen-mmu-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen
* 'upstream/tidy-xen-mmu-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen:
  xen: fix compile without CONFIG_XEN_DEBUG_FS
  Use arbitrary_virt_to_machine() to deal with ioremapped pud updates.
  Use arbitrary_virt_to_machine() to deal with ioremapped pmd updates.
  xen/mmu: remove all ad-hoc stats stuff
  xen: use normal virt_to_machine for ptes
  xen: make a pile of mmu pvop functions static
  vmalloc: remove vmalloc_sync_all() from alloc_vm_area()
  xen: condense everything onto xen_set_pte
  xen: use mmu_update for xen_set_pte_at()
  xen: drop all the special iomap pte paths.
2011-05-26 19:01:15 -07:00
Jeremy Fitzhardinge 4bf0ff24e3 xen: fix compile without CONFIG_XEN_DEBUG_FS
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
2011-05-20 16:34:44 -07:00
Jeremy Fitzhardinge 2a001f6482 Use arbitrary_virt_to_machine() to deal with ioremapped pud updates.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
2011-05-20 14:26:40 -07:00
Jeremy Fitzhardinge f05608d278 Use arbitrary_virt_to_machine() to deal with ioremapped pmd updates.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
2011-05-20 14:26:39 -07:00
Jeremy Fitzhardinge c86d8077b3 xen/mmu: remove all ad-hoc stats stuff
To make way for tracing.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
2011-05-20 14:26:39 -07:00
Jeremy Fitzhardinge d5108316b8 xen: use normal virt_to_machine for ptes
We no longer support HIGHPTE allocations, so ptes should always be
within the kernel's direct map, and don't need pagetable walks
to convert to machine addresses.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
2011-05-20 14:25:24 -07:00
Jeremy Fitzhardinge 4c13629f81 xen: make a pile of mmu pvop functions static
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
2011-05-20 14:25:24 -07:00
Jeremy Fitzhardinge 4a35c13cb8 xen: condense everything onto xen_set_pte
xen_set_pte_at and xen_clear_pte are essentially identical to
xen_set_pte, so just make them all common.

When batched set_pte and pte_clear are the same, but the unbatch operation
must be different: they need to update the two halves of the pte in
different order.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
2011-05-20 14:14:32 -07:00
Jeremy Fitzhardinge a99ac5e861 xen: use mmu_update for xen_set_pte_at()
In principle update_va_mapping is a good match for set_pte_at, since
it gets the address being mapped, which allows Xen to use its linear
pagetable mapping.

However that assumes that the pmd for the address is attached to the
current pagetable, which may not be true for a given user address space
because the kernel pmd is not shared (at least on 32-bit guests).
Normally the kernel will automatically sync a missing part of the
pagetable with the init_mm pagetable transparently via faults, but that
fails when a missing address is passed to Xen.

And while the linear pagetable mapping is very useful for 32-bit Xen
(as it avoids an explicit domain mapping), 32-bit Xen is deprecated.
64-bit Xen has all memory mapped all the time, so it makes no real
difference.

The upshot is that we should use mmu_update, since it can operate on
non-current pagetables or detached pagetables.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
2011-05-20 14:14:31 -07:00
Jeremy Fitzhardinge 331468b11b xen: drop all the special iomap pte paths.
Xen can work out when we're doing IO mappings for itself, so we don't
need to do anything special, and the extra tests just clog things up.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
2011-05-20 14:14:31 -07:00
Linus Torvalds e33ab8f275 Merge branches 'stable/irq', 'stable/p2m.bugfixes', 'stable/e820.bugfixes' and 'stable/mmu.bugfixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
* 'stable/irq' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen: do not clear and mask evtchns in __xen_evtchn_do_upcall

* 'stable/p2m.bugfixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/p2m: Create entries in the P2M_MFN trees's to track 1-1 mappings

* 'stable/e820.bugfixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/setup: Fix for incorrect xen_extra_mem_start initialization under 32-bit
  xen/setup: Ignore E820_UNUSABLE when setting 1-1 mappings.

* 'stable/mmu.bugfixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen mmu: fix a race window causing leave_mm BUG()
2011-05-19 16:14:58 -07:00
Daniel Kiper 3f508953dd arch/x86/xen/mmu: Cleanup code/data sections definitions
Cleanup code/data sections definitions
accordingly to include/linux/init.h.

Signed-off-by: Daniel Kiper <dkiper@net-space.pl>
[v1: Rebased on top of latest linus's to include fixes in mmu.c]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-05-19 11:30:29 -04:00
Tian, Kevin 7899891c7d xen mmu: fix a race window causing leave_mm BUG()
There's a race window in xen_drop_mm_ref, where remote cpu may exit
dirty bitmap between the check on this cpu and the point where remote
cpu handles drop request. So in drop_other_mm_ref we need check
whether TLB state is still lazy before calling into leave_mm. This
bug is rarely observed in earlier kernel, but exaggerated by the
commit 831d52bc15
("x86, mm: avoid possible bogus tlb entries by clearing prev mm_cpumask after switching mm")
which clears bitmap after changing the TLB state. the call trace is as below:

---------------------------------
kernel BUG at arch/x86/mm/tlb.c:61!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/system/xen_memory/xen_memory0/info/current_kb
CPU 1
Modules linked in: 8021q garp xen_netback xen_blkback blktap blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device serio_raw bnx2 snd_pcm_oss snd_mixer_oss snd_pcm snd_timer iTCO_wdt snd soundcore snd_page_alloc i2c_i801 iTCO_vendor_support i2c_core pcs pkr pata_acpi ata_generic ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table]
Pid: 25581, comm: khelper Not tainted 2.6.32.36fixxen #1 Tecal RH2285
RIP: e030:[<ffffffff8103a3cb>]  [<ffffffff8103a3cb>] leave_mm+0x15/0x46
RSP: e02b:ffff88002805be48  EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88015f8e2da0
RDX: ffff88002805be78 RSI: 0000000000000000 RDI: 0000000000000001
RBP: ffff88002805be48 R08: ffff88009d662000 R09: dead000000200200
R10: dead000000100100 R11: ffffffff814472b2 R12: ffff88009bfc1880
R13: ffff880028063020 R14: 00000000000004f6 R15: 0000000000000000
FS:  00007f62362d66e0(0000) GS:ffff880028058000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003aabc11909 CR3: 000000009b8ca000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 00000000000000 00
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process khelper (pid: 25581, threadinfo ffff88007691e000, task ffff88009b92db40)
Stack:
 ffff88002805be68 ffffffff8100e4ae 0000000000000001 ffff88009d733b88
<0> ffff88002805be98 ffffffff81087224 ffff88002805be78 ffff88002805be78
<0> ffff88015f808360 00000000000004f6 ffff88002805bea8 ffffffff81010108
Call Trace:
 <IRQ>
 [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53
 [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
 [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28
 [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
 [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
 [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d
 [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46
 [<ffffffff81013efe>] xen_do_hyper visor_callback+0x1e/0x30
 <EOI>
 [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
 [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef
 [<ffffffff81042fcf>] ? need_resched+0x23/0x2d
 [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7
 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
 [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255
 [<ffffffff81114362>] ? do_execve+0x1c3/0x29e
 [<ffffffff8101155d>] ? sys_execve+0x43/0x5d
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
 [<ffffffff 8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
 [<ffffffff81013daa>] ? child_rip+0xa/0x20
 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
 [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
 [<ffffffff81013da0>] ? child_rip+0x0/0x20
Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8
RIP  [<ffffffff8103a3cb>] leave_mm+0x15/0x46
 RSP <ffff88002805be48>
---[ end trace ce9cee6832a9c503 ]---

Tested-by: Maoxiaoyun<tinnycloud@hotmail.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
[v1: Fleshed out the git description a bit]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-05-12 14:27:43 -04:00
Stefano Stabellini 279b706bf8 x86,xen: introduce x86_init.mapping.pagetable_reserve
Introduce a new x86_init hook called pagetable_reserve that at the end
of init_memory_mapping is used to reserve a range of memory addresses for
the kernel pagetable pages we used and free the other ones.

On native it just calls memblock_x86_reserve_range while on xen it also
takes care of setting the spare memory previously allocated
for kernel pagetable pages from RO to RW, so that it can be used for
other purposes.

A detailed explanation of the reason why this hook is needed follows.

As a consequence of the commit:

commit 4b239f458c
Author: Yinghai Lu <yinghai@kernel.org>
Date:   Fri Dec 17 16:58:28 2010 -0800

    x86-64, mm: Put early page table high

at some point init_memory_mapping is going to reach the pagetable pages
area and map those pages too (mapping them as normal memory that falls
in the range of addresses passed to init_memory_mapping as argument).
Some of those pages are already pagetable pages (they are in the range
pgt_buf_start-pgt_buf_end) therefore they are going to be mapped RO and
everything is fine.
Some of these pages are not pagetable pages yet (they fall in the range
pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
are going to be mapped RW.  When these pages become pagetable pages and
are hooked into the pagetable, xen will find that the guest has already
a RW mapping of them somewhere and fail the operation.
The reason Xen requires pagetables to be RO is that the hypervisor needs
to verify that the pagetables are valid before using them. The validation
operations are called "pinning" (more details in arch/x86/xen/mmu.c).

In order to fix the issue we mark all the pages in the entire range
pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
is completed only the range pgt_buf_start-pgt_buf_end is reserved by
init_memory_mapping. Hence the kernel is going to crash as soon as one
of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
ranges are RO).

For this reason we need a hook to reserve the kernel pagetable pages we
used and free the other ones so that they can be reused for other
purposes.
On native it just means calling memblock_x86_reserve_range, on Xen it
also means marking RW the pagetable pages that we allocated before but
that haven't been used before.

Another way to fix this is without using the hook is by adding a 'if
(xen_pv_domain)' in the 'init_memory_mapping' code and calling the Xen
counterpart, but that is just nasty.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-05-12 13:05:04 -04:00