- An implementation for the newly added hv_ops->flush() for the OPAL hvc
console driver backends, I forgot to apply this after merging the hvc driver
changes before the merge window.
- Enable all PCI bridges at boot on powernv, to avoid races when multiple
children of a bridge try to enable it simultaneously. This is a workaround
until the PCI core can be enhanced to fix the races.
- A fix to query PowerVM for the correct system topology at boot before
initialising sched domains, seen in some configurations to cause broken
scheduling etc.
- A fix for pte_access_permitted() on "nohash" platforms.
- Two commits to fix SIGBUS when using remap_pfn_range() seen on Power9 due to
a workaround when using the nest MMU (GPUs, accelerators).
- Another fix to the VFIO code used by KVM, the previous fix had some bugs
which caused guests to not start in some configurations.
- A handful of other minor fixes.
Thanks to:
Aneesh Kumar K.V, Benjamin Herrenschmidt, Christophe Leroy, Hari Bathini, Luke
Dashjr, Mahesh Salgaonkar, Nicholas Piggin, Paul Mackerras, Srikar Dronamraju.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAlt//10THG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgLUGD/9y/MPrs+V2lNFhxP+l1jO4Ro0v8DPI
vARjqq06WfppgQgSS1dvWfzLaMFbGe8wPRfL0T1xMOXCZg8Ts/HrgxVBVFYcv6/Q
xBaU5bKztg6HKQbwwO+8B/gdTA3hO7yFVux6JGwsGO5Ebl8Q3UDOdbvYX6XTj3H8
rdiicle9LaI7qodC8bxlBo1Be0YKEW0O/ag179sxXzozzvoPIyFpeX6FL1sAft++
XlQS1MHu4hErSM2rbmyoFCm+SmyRt3CD0NTVmNd2cgw5XexPOBFlnsdgpaK1jJFc
CYu1chP83E91ol1/8NAPkcmPWvP6MGyoqOl75RghooY2D1IJ2GtLKoz2Dvc53ay2
ZlIpMyc2CYIa55Mj18tOV/NbGbh0Lf0Ta++BxqxbcCDt5fq0VxHkoDPqBgWh/tdp
Po7oQc7U2VTKKC3wiLr//nSHpgtSTAWtucDt7oT7GdP8+EMxUZ8teBFIfTkwfuD2
plroEmcYuRD3beI4FAG/iCp5POOCsnHLkKVDl7tyQPl3Yvu8hvyLY9gBS9RN4Unt
/z94YFJtz7UD+VP7jDaPiQS26y0WEJfW8ml0tNxMZVBdksZLPIPN0/EU/04V0jWf
GxynzKMhElxC67tK/Eb43EGiLEZAlbEnJxoOtiCnL1MK+OIJPjg6e7o6qlsmaa+Y
zkXVpxLtkq0OoQ==
=1AUa
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- An implementation for the newly added hv_ops->flush() for the OPAL
hvc console driver backends, I forgot to apply this after merging the
hvc driver changes before the merge window.
- Enable all PCI bridges at boot on powernv, to avoid races when
multiple children of a bridge try to enable it simultaneously. This
is a workaround until the PCI core can be enhanced to fix the races.
- A fix to query PowerVM for the correct system topology at boot before
initialising sched domains, seen in some configurations to cause
broken scheduling etc.
- A fix for pte_access_permitted() on "nohash" platforms.
- Two commits to fix SIGBUS when using remap_pfn_range() seen on Power9
due to a workaround when using the nest MMU (GPUs, accelerators).
- Another fix to the VFIO code used by KVM, the previous fix had some
bugs which caused guests to not start in some configurations.
- A handful of other minor fixes.
Thanks to: Aneesh Kumar K.V, Benjamin Herrenschmidt, Christophe Leroy,
Hari Bathini, Luke Dashjr, Mahesh Salgaonkar, Nicholas Piggin, Paul
Mackerras, Srikar Dronamraju.
* tag 'powerpc-4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mce: Fix SLB rebolting during MCE recovery path.
KVM: PPC: Book3S: Fix guest DMA when guest partially backed by THP pages
powerpc/mm/radix: Only need the Nest MMU workaround for R -> RW transition
powerpc/mm/books3s: Add new pte bit to mark pte temporarily invalid.
powerpc/nohash: fix pte_access_permitted()
powerpc/topology: Get topology for shared processors at boot
powerpc64/ftrace: Include ftrace.h needed for enable/disable calls
powerpc/powernv/pci: Work around races in PCI bridge enabling
powerpc/fadump: cleanup crash memory ranges support
powerpc/powernv: provide a console flush operation for opal hvc driver
powerpc/traps: Avoid rate limit messages from show unhandled signals
powerpc/64s: Fix PACA_IRQ_HARD_DIS accounting in idle_power4()
When splitting a huge pmd pte, we need to mark the pmd entry invalid. We
can do that by clearing _PAGE_PRESENT bit. But then that will be taken as a
swap pte. In order to differentiate between the two use a software pte bit
when invalidating.
For regular pte, due to bd5050e38a ("powerpc/mm/radix: Change pte relax
sequence to handle nest MMU hang") we need to mark the pte entry invalid when
relaxing access permission. Instead of marking pte_none which can result in
different page table walk routines possibly skipping this pte entry, invalidate
it but still keep it marked present.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 5769beaf18 ("powerpc/mm: Add proper pte access check helper
for other platforms") replaced generic pte_access_permitted() by an
arch specific one.
The generic one is defined as
(pte_present(pte) && (!(write) || pte_write(pte)))
The arch specific one is open coded checking that _PAGE_USER and
_PAGE_WRITE (_PAGE_RW) flags are set, but lacking to check that
_PAGE_RO and _PAGE_PRIVILEGED are unset, leading to a useless test
on targets like the 8xx which defines _PAGE_RW and _PAGE_USER as 0.
Commit 5fa5b16be5 ("powerpc/mm/hugetlb: Use pte_access_permitted
for hugetlb access check") replaced some tests performed with
pte helpers by a call to pte_access_permitted(), leading to the same
issue.
This patch rewrites powerpc/nohash pte_access_permitted()
using pte helpers.
Fixes: 5769beaf18 ("powerpc/mm: Add proper pte access check helper for other platforms")
Fixes: 5fa5b16be5 ("powerpc/mm/hugetlb: Use pte_access_permitted for hugetlb access check")
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On a shared LPAR, Phyp will not update the CPU associativity at boot
time. Just after the boot system does recognize itself as a shared
LPAR and trigger a request for correct CPU associativity. But by then
the scheduler would have already created/destroyed its sched domains.
This causes
- Broken load balance across Nodes causing islands of cores.
- Performance degradation esp if the system is lightly loaded
- dmesg to wrongly report all CPUs to be in Node 0.
- Messages in dmesg saying borken topology.
- With commit 051f3ca02e ("sched/topology: Introduce NUMA identity
node sched domain"), can cause rcu stalls at boot up.
The sched_domains_numa_masks table which is used to generate cpumasks
is only created at boot time just before creating sched domains and
never updated. Hence, its better to get the topology correct before
the sched domains are created.
For example on 64 core Power 8 shared LPAR, dmesg reports
Brought up 512 CPUs
Node 0 CPUs: 0-511
Node 1 CPUs:
Node 2 CPUs:
Node 3 CPUs:
Node 4 CPUs:
Node 5 CPUs:
Node 6 CPUs:
Node 7 CPUs:
Node 8 CPUs:
Node 9 CPUs:
Node 10 CPUs:
Node 11 CPUs:
...
BUG: arch topology borken
the DIE domain not a subset of the NUMA domain
BUG: arch topology borken
the DIE domain not a subset of the NUMA domain
numactl/lscpu output will still be correct with cores spreading across
all nodes:
Socket(s): 64
NUMA node(s): 12
Model: 2.0 (pvr 004d 0200)
Model name: POWER8 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 64K
L1i cache: 32K
NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
NUMA node4 CPU(s): 208-215,304-311,400-407,496-503
NUMA node5 CPU(s): 168-175,264-271,360-367,456-463
NUMA node6 CPU(s): 128-135,224-231,320-327,416-423
NUMA node7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s): 152-159,248-255,344-351,440-447
NUMA node11 CPU(s): 160-167,256-263,352-359,448-455
Currently on this LPAR, the scheduler detects 2 levels of Numa and
created numa sched domains for all CPUs, but it finds a single DIE
domain consisting of all CPUs. Hence it deletes all numa sched
domains.
To address this, detect the shared processor and update topology soon
after CPUs are setup so that correct topology is updated just before
scheduler creates sched domain.
With the fix, dmesg reports:
numa: Node 0 CPUs: 0-7 32-39 64-71 96-103 176-183 272-279 368-375 464-471
numa: Node 1 CPUs: 8-15 40-47 72-79 104-111 184-191 280-287 376-383 472-479
numa: Node 2 CPUs: 16-23 48-55 80-87 112-119 192-199 288-295 384-391 480-487
numa: Node 3 CPUs: 24-31 56-63 88-95 120-127 200-207 296-303 392-399 488-495
numa: Node 4 CPUs: 208-215 304-311 400-407 496-503
numa: Node 5 CPUs: 168-175 264-271 360-367 456-463
numa: Node 6 CPUs: 128-135 224-231 320-327 416-423
numa: Node 7 CPUs: 136-143 232-239 328-335 424-431
numa: Node 8 CPUs: 216-223 312-319 408-415 504-511
numa: Node 9 CPUs: 144-151 240-247 336-343 432-439
numa: Node 10 CPUs: 152-159 248-255 344-351 440-447
numa: Node 11 CPUs: 160-167 256-263 352-359 448-455
and lscpu also reports:
Socket(s): 64
NUMA node(s): 12
Model: 2.0 (pvr 004d 0200)
Model name: POWER8 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 64K
L1i cache: 32K
NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
NUMA node4 CPU(s): 208-215,304-311,400-407,496-503
NUMA node5 CPU(s): 168-175,264-271,360-367,456-463
NUMA node6 CPU(s): 128-135,224-231,320-327,416-423
NUMA node7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s): 152-159,248-255,344-351,440-447
NUMA node11 CPU(s): 160-167,256-263,352-359,448-455
Reported-by: Manjunatha H R <manjuhr1@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[mpe: Trim / format change log]
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Provide the flush hv_op for the opal hvc driver. This will flush the
firmware console buffers without spinning with interrupts disabled.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For x86 this brings in PCID emulation and CR3 caching for shadow page
tables, nested VMX live migration, nested VMCS shadowing, an optimized
IPI hypercall, and some optimizations.
ARM will come next week.
There is a semantic conflict because tip also added an .init_platform
callback to kvm.c. Please keep the initializer from this branch,
and add a call to kvmclock_init (added by tip) inside kvm_init_platform
(added here).
Also, there is a backmerge from 4.18-rc6. This is because of a
refactoring that conflicted with a relatively late bugfix and
resulted in a particularly hellish conflict. Because the conflict
was only due to unfortunate timing of the bugfix, I backmerged and
rebased the refactoring rather than force the resolution on you.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJbdwNFAAoJEL/70l94x66DiPEH/1cAGZWGd85Y3yRu1dmTmqiz
kZy0V+WTQ5kyJF4ZsZKKOp+xK7Qxh5e9kLdTo70uPZCHwLu9IaGKN9+dL9Jar3DR
yLPX5bMsL8UUed9g9mlhdaNOquWi7d7BseCOnIyRTolb+cqnM5h3sle0gqXloVrS
UQb4QogDz8+86czqR8tNfazjQRKW/D2HEGD5NDNVY1qtpY+leCDAn9/u6hUT5c6z
EtufgyDh35UN+UQH0e2605gt3nN3nw3FiQJFwFF1bKeQ7k5ByWkuGQI68XtFVhs+
2WfqL3ftERkKzUOy/WoSJX/C9owvhMcpAuHDGOIlFwguNGroZivOMVnACG1AI3I=
=9Mgw
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull first set of KVM updates from Paolo Bonzini:
"PPC:
- minor code cleanups
x86:
- PCID emulation and CR3 caching for shadow page tables
- nested VMX live migration
- nested VMCS shadowing
- optimized IPI hypercall
- some optimizations
ARM will come next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (85 commits)
kvm: x86: Set highest physical address bits in non-present/reserved SPTEs
KVM/x86: Use CC_SET()/CC_OUT in arch/x86/kvm/vmx.c
KVM: X86: Implement PV IPIs in linux guest
KVM: X86: Add kvm hypervisor init time platform setup callback
KVM: X86: Implement "send IPI" hypercall
KVM/x86: Move X86_CR4_OSXSAVE check into kvm_valid_sregs()
KVM: x86: Skip pae_root shadow allocation if tdp enabled
KVM/MMU: Combine flushing remote tlb in mmu_set_spte()
KVM: vmx: skip VMWRITE of HOST_{FS,GS}_BASE when possible
KVM: vmx: skip VMWRITE of HOST_{FS,GS}_SEL when possible
KVM: vmx: always initialize HOST_{FS,GS}_BASE to zero during setup
KVM: vmx: move struct host_state usage to struct loaded_vmcs
KVM: vmx: compute need to reload FS/GS/LDT on demand
KVM: nVMX: remove a misleading comment regarding vmcs02 fields
KVM: vmx: rename __vmx_load_host_state() and vmx_save_host_state()
KVM: vmx: add dedicated utility to access guest's kernel_gs_base
KVM: vmx: track host_state.loaded using a loaded_vmcs pointer
KVM: vmx: refactor segmentation code in vmx_save_host_state()
kvm: nVMX: Fix fault priority for VMX operations
kvm: nVMX: Fix fault vector for VMX operation at CPL > 0
...
Merge updates from Andrew Morton:
- a few misc things
- a few Y2038 fixes
- ntfs fixes
- arch/sh tweaks
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits)
mm/hmm.c: remove unused variables align_start and align_end
fs/userfaultfd.c: remove redundant pointer uwq
mm, vmacache: hash addresses based on pmd
mm/list_lru: introduce list_lru_shrink_walk_irq()
mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()
mm/list_lru.c: move locking from __list_lru_walk_one() to its caller
mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()
mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP
mm/sparse: delete old sparse_init and enable new one
mm/sparse: add new sparse_init_nid() and sparse_init()
mm/sparse: move buffer init/fini to the common place
mm/sparse: use the new sparse buffer functions in non-vmemmap
mm/sparse: abstract sparse buffer allocations
mm/hugetlb.c: don't zero 1GiB bootmem pages
mm, page_alloc: double zone's batchsize
mm/oom_kill.c: document oom_lock
mm/hugetlb: remove gigantic page support for HIGHMEM
mm, oom: remove sleep from under oom_lock
kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous()
mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
...
Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.
Ref-> commit 1c8f422059 ("mm: change return type to vm_fault_t")
In this patch all the caller of handle_mm_fault() are changed to return
vm_fault_t type.
Link: http://lkml.kernel.org/r/20180617084810.GA6730@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: James Hogan <jhogan@kernel.org>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: David S. Miller <davem@davemloft.net>
Cc: Richard Weinberger <richard@nod.at>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Notable changes:
- A fix for a bug in our page table fragment allocator, where a page table page
could be freed and reallocated for something else while still in use, leading
to memory corruption etc. The fix reuses pt_mm in struct page (x86 only) for
a powerpc only refcount.
- Fixes to our pkey support. Several are user-visible changes, but bring us in
to line with x86 behaviour and/or fix outright bugs. Thanks to Florian Weimer
for reporting many of these.
- A series to improve the hvc driver & related OPAL console code, which have
been seen to cause hardlockups at times. The hvc driver changes in particular
have been in linux-next for ~month.
- Increase our MAX_PHYSMEM_BITS to 128TB when SPARSEMEM_VMEMMAP=y.
- Remove Power8 DD1 and Power9 DD1 support, neither chip should be in use
anywhere other than as a paper weight.
- An optimised memcmp implementation using Power7-or-later VMX instructions
- Support for barrier_nospec on some NXP CPUs.
- Support for flushing the count cache on context switch on some IBM CPUs
(controlled by firmware), as a Spectre v2 mitigation.
- A series to enhance the information we print on unhandled signals to bring it
into line with other arches, including showing the offending VMA and dumping
the instructions around the fault.
Thanks to:
Aaro Koskinen, Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Alexey
Spirkov, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar,
Arnd Bergmann, Bartosz Golaszewski, Benjamin Herrenschmidt, Bharat Bhushan,
Bjoern Noetel, Boqun Feng, Breno Leitao, Bryant G. Ly, Camelia Groza,
Christophe Leroy, Christoph Hellwig, Cyril Bur, Dan Carpenter, Daniel Klamt,
Darren Stevens, Dave Young, David Gibson, Diana Craciun, Finn Thain, Florian
Weimer, Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven, Geoff Levand,
Guenter Roeck, Gustavo Romero, Haren Myneni, Hari Bathini, Joel Stanley,
Jonathan Neuschäfer, Kees Cook, Madhavan Srinivasan, Mahesh Salgaonkar, Markus
Elfring, Mathieu Malaterre, Mauro S. M. Rodrigues, Michael Hanselmann, Michael
Neuling, Michael Schmitz, Mukesh Ojha, Murilo Opsfelder Araujo, Nicholas
Piggin, Parth Y Shah, Paul Mackerras, Paul Menzel, Ram Pai, Randy Dunlap,
Rashmica Gupta, Reza Arbab, Rodrigo R. Galvao, Russell Currey, Sam Bobroff,
Scott Wood, Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stan Johnson,
Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain, Vasant Hegde, Venkat Rao
B, zhong jiang.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAlt2O6cTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgC7hD/4+cj796Df7GsVsIMxzQm7SS9dklIdO
JuKj2Nr5HRzTH59jWlXukLG9mfTNCFgFJB4gEpK1ArDOTcHTCI9RRsLZTZ/kum66
7Pd+7T40dLYXB5uecuUs0vMXa2fI3syKh1VLzACSXv3Dh9BBIKQBwW/aD2eww4YI
1fS5LnXZ2PSxfr6KNAC6ogZnuaiD0sHXOYrtGHq+S/TFC7+Z6ySa6+AnPS+hPVoo
/rHDE1Khr66aj7uk+PP2IgUrCFj6Sbj6hTVlS/iAuwbMjUl9ty6712PmvX9x6wMZ
13hJQI+g6Ci+lqLKqmqVUpXGSr6y4NJGPS/Hko4IivBTJApI+qV/tF2H9nxU+6X0
0RqzsMHPHy13n2torA1gC7ttzOuXPI4hTvm6JWMSsfmfjTxLANJng3Dq3ejh6Bqw
76EMowpDLexwpy7/glPpqNdsP4ySf2Qm8yq3mR7qpL4m3zJVRGs11x+s5DW8NKBL
Fl5SqZvd01abH+sHwv6NLaLkEtayUyohxvyqu2RU3zu5M5vi7DhqstybTPjKPGu0
icSPh7b2y10WpOUpC6lxpdi8Me8qH47mVc/trZ+SpgBrsuEmtJhGKszEnzRCOqos
o2IhYHQv3lQv86kpaAFQlg/RO+Lv+Lo5qbJ209V+hfU5nYzXpEulZs4dx1fbA+ze
fK8GEh+u0L4uJg==
=PzRz
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Notable changes:
- A fix for a bug in our page table fragment allocator, where a page
table page could be freed and reallocated for something else while
still in use, leading to memory corruption etc. The fix reuses
pt_mm in struct page (x86 only) for a powerpc only refcount.
- Fixes to our pkey support. Several are user-visible changes, but
bring us in to line with x86 behaviour and/or fix outright bugs.
Thanks to Florian Weimer for reporting many of these.
- A series to improve the hvc driver & related OPAL console code,
which have been seen to cause hardlockups at times. The hvc driver
changes in particular have been in linux-next for ~month.
- Increase our MAX_PHYSMEM_BITS to 128TB when SPARSEMEM_VMEMMAP=y.
- Remove Power8 DD1 and Power9 DD1 support, neither chip should be in
use anywhere other than as a paper weight.
- An optimised memcmp implementation using Power7-or-later VMX
instructions
- Support for barrier_nospec on some NXP CPUs.
- Support for flushing the count cache on context switch on some IBM
CPUs (controlled by firmware), as a Spectre v2 mitigation.
- A series to enhance the information we print on unhandled signals
to bring it into line with other arches, including showing the
offending VMA and dumping the instructions around the fault.
Thanks to: Aaro Koskinen, Akshay Adiga, Alastair D'Silva, Alexey
Kardashevskiy, Alexey Spirkov, Alistair Popple, Andrew Donnellan,
Aneesh Kumar K.V, Anju T Sudhakar, Arnd Bergmann, Bartosz Golaszewski,
Benjamin Herrenschmidt, Bharat Bhushan, Bjoern Noetel, Boqun Feng,
Breno Leitao, Bryant G. Ly, Camelia Groza, Christophe Leroy, Christoph
Hellwig, Cyril Bur, Dan Carpenter, Daniel Klamt, Darren Stevens, Dave
Young, David Gibson, Diana Craciun, Finn Thain, Florian Weimer,
Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven, Geoff Levand,
Guenter Roeck, Gustavo Romero, Haren Myneni, Hari Bathini, Joel
Stanley, Jonathan Neuschäfer, Kees Cook, Madhavan Srinivasan, Mahesh
Salgaonkar, Markus Elfring, Mathieu Malaterre, Mauro S. M. Rodrigues,
Michael Hanselmann, Michael Neuling, Michael Schmitz, Mukesh Ojha,
Murilo Opsfelder Araujo, Nicholas Piggin, Parth Y Shah, Paul
Mackerras, Paul Menzel, Ram Pai, Randy Dunlap, Rashmica Gupta, Reza
Arbab, Rodrigo R. Galvao, Russell Currey, Sam Bobroff, Scott Wood,
Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stan Johnson, Thiago
Jung Bauermann, Tyrel Datwyler, Vaibhav Jain, Vasant Hegde, Venkat
Rao, zhong jiang"
* tag 'powerpc-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (234 commits)
powerpc/mm/book3s/radix: Add mapping statistics
powerpc/uaccess: Enable get_user(u64, *p) on 32-bit
powerpc/mm/hash: Remove unnecessary do { } while(0) loop
powerpc/64s: move machine check SLB flushing to mm/slb.c
powerpc/powernv/idle: Fix build error
powerpc/mm/tlbflush: update the mmu_gather page size while iterating address range
powerpc/mm: remove warning about ‘type’ being set
powerpc/32: Include setup.h header file to fix warnings
powerpc: Move `path` variable inside DEBUG_PROM
powerpc/powermac: Make some functions static
powerpc/powermac: Remove variable x that's never read
cxl: remove a dead branch
powerpc/powermac: Add missing include of header pmac.h
powerpc/kexec: Use common error handling code in setup_new_fdt()
powerpc/xmon: Add address lookup for percpu symbols
powerpc/mm: remove huge_pte_offset_and_shift() prototype
powerpc/lib: Use patch_site to patch copy_32 functions once cache is enabled
powerpc/pseries: Fix endianness while restoring of r3 in MCE handler.
powerpc/fadump: merge adjacent memory ranges to reduce PT_LOAD segements
powerpc/fadump: handle crash memory ranges array index overflow
...
Pull perf update from Thomas Gleixner:
"The perf crowd presents:
Kernel updates:
- Removal of jprobes
- Cleanup and consolidatation the handling of kprobes
- Cleanup and consolidation of hardware breakpoints
- The usual pile of fixes and updates to PMUs and event descriptors
Tooling updates:
- Updates and improvements all over the place. Nothing outstanding,
just the (good) boring incremental grump work"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
perf trace: Do not require --no-syscalls to suppress strace like output
perf bpf: Include uapi/linux/bpf.h from the 'perf trace' script's bpf.h
perf tools: Allow overriding MAX_NR_CPUS at compile time
perf bpf: Show better message when failing to load an object
perf list: Unify metric group description format with PMU event description
perf vendor events arm64: Update ThunderX2 implementation defined pmu core events
perf cs-etm: Generate branch sample for CS_ETM_TRACE_ON packet
perf cs-etm: Generate branch sample when receiving a CS_ETM_TRACE_ON packet
perf cs-etm: Support dummy address value for CS_ETM_TRACE_ON packet
perf cs-etm: Fix start tracing packet handling
perf build: Fix installation directory for eBPF
perf c2c report: Fix crash for empty browser
perf tests: Fix indexing when invoking subtests
perf trace: Beautify the AF_INET & AF_INET6 'socket' syscall 'protocol' args
perf trace beauty: Add beautifiers for 'socket''s 'protocol' arg
perf trace beauty: Do not print NULL strarray entries
perf beauty: Add a generator for IPPROTO_ socket's protocol constants
tools include uapi: Grab a copy of linux/in.h
perf tests: Fix complex event name parsing
perf evlist: Fix error out while applying initial delay and LBR
...
Pull locking/atomics update from Thomas Gleixner:
"The locking, atomics and memory model brains delivered:
- A larger update to the atomics code which reworks the ordering
barriers, consolidates the atomic primitives, provides the new
atomic64_fetch_add_unless() primitive and cleans up the include
hell.
- Simplify cmpxchg() instrumentation and add instrumentation for
xchg() and cmpxchg_double().
- Updates to the memory model and documentation"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
locking/atomics: Rework ordering barriers
locking/atomics: Instrument cmpxchg_double*()
locking/atomics: Instrument xchg()
locking/atomics: Simplify cmpxchg() instrumentation
locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation
tools/memory-model: Rename litmus tests to comply to norm7
tools/memory-model/Documentation: Fix typo, smb->smp
sched/Documentation: Update wake_up() & co. memory-barrier guarantees
locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock()
sched/core: Use smp_mb() in wake_woken_function()
tools/memory-model: Add informal LKMM documentation to MAINTAINERS
locking/atomics/Documentation: Describe atomic_set() as a write operation
tools/memory-model: Make scripts executable
tools/memory-model: Remove ACCESS_ONCE() from model
tools/memory-model: Remove ACCESS_ONCE() from recipes
locking/memory-barriers.txt/kokr: Update Korean translation to fix broken DMA vs. MMIO ordering example
MAINTAINERS: Add Daniel Lustig as an LKMM reviewer
tools/memory-model: Fix ISA2+pooncelock+pooncelock+pombonce name
tools/memory-model: Add litmus test for full multicopy atomicity
locking/refcount: Always allow checked forms
...
Add statistics that show how memory is mapped within the kernel linear mapping.
This is similar to commit 37cd944c8d ("s390/pgtable: add mapping statistics")
We don't do this with Hash translation mode. Hash uses one size (mmu_linear_psize)
to map the kernel linear mapping and we print the linear psize during boot as
below.
"Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24"
A sample output looks like:
DirectMap4k: 0 kB
DirectMap64k: 18432 kB
DirectMap2M: 1030144 kB
DirectMap1G: 11534336 kB
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently if you build a 32-bit powerpc kernel and use get_user() to
load a u64 value it will fail to build with eg:
kernel/rseq.o: In function `rseq_get_rseq_cs':
kernel/rseq.c:123: undefined reference to `__get_user_bad'
This is hitting the check in __get_user_size() that makes sure the
size we're copying doesn't exceed the size of the destination:
#define __get_user_size(x, ptr, size, retval)
do {
retval = 0;
__chk_user_ptr(ptr);
if (size > sizeof(x))
(x) = __get_user_bad();
Which doesn't immediately make sense because the size of the
destination is u64, but it's not really, because __get_user_check()
etc. internally create an unsigned long and copy into that:
#define __get_user_check(x, ptr, size)
({
long __gu_err = -EFAULT;
unsigned long __gu_val = 0;
The problem being that on 32-bit unsigned long is not big enough to
hold a u64. We can fix this with a trick from hpa in the x86 code, we
statically check the type of x and set the type of __gu_val to either
unsigned long or unsigned long long.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The machine check code that flushes and restores bolted segments in
real mode belongs in mm/slb.c. This will also be used by pseries
machine check and idle code in future changes.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch makes sure we update the mmu_gather page size even if we are
requesting for a fullmm flush. This avoids triggering VM_WARN_ON in code
paths like __tlb_remove_page_size that explicitly check for removing range page
size to be same as mmu gather page size.
Fixes: 5a6099346c ("powerpc/64s/radix: tlb do not flush on page size when fullmm")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
huge_pte_offset_and_shift() has never existed
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The symbol memcpy_nocache_branch defined in order to allow patching
of memset function once cache is enabled leads to confusing reports
by perf tool.
Using the new patch_site functionality solves this issue.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Crash memory ranges is an array of memory ranges of the crashing kernel
to be exported as a dump via /proc/vmcore file. The size of the array
is set based on INIT_MEMBLOCK_REGIONS, which works alright in most cases
where memblock memory regions count is less than INIT_MEMBLOCK_REGIONS
value. But this count can grow beyond INIT_MEMBLOCK_REGIONS value since
commit 142b45a72e ("memblock: Add array resizing support").
On large memory systems with a few DLPAR operations, the memblock memory
regions count could be larger than INIT_MEMBLOCK_REGIONS value. On such
systems, registering fadump results in crash or other system failures
like below:
task: c00007f39a290010 ti: c00000000b738000 task.ti: c00000000b738000
NIP: c000000000047df4 LR: c0000000000f9e58 CTR: c00000000010f180
REGS: c00000000b73b570 TRAP: 0300 Tainted: G L X (4.4.140+)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22004484 XER: 20000000
CFAR: c000000000008500 DAR: 000007a450000000 DSISR: 40000000 SOFTE: 0
...
NIP [c000000000047df4] smp_send_reschedule+0x24/0x80
LR [c0000000000f9e58] resched_curr+0x138/0x160
Call Trace:
resched_curr+0x138/0x160 (unreliable)
check_preempt_curr+0xc8/0xf0
ttwu_do_wakeup+0x38/0x150
try_to_wake_up+0x224/0x4d0
__wake_up_common+0x94/0x100
ep_poll_callback+0xac/0x1c0
__wake_up_common+0x94/0x100
__wake_up_sync_key+0x70/0xa0
sock_def_readable+0x58/0xa0
unix_stream_sendmsg+0x2dc/0x4c0
sock_sendmsg+0x68/0xa0
___sys_sendmsg+0x2cc/0x2e0
__sys_sendmsg+0x5c/0xc0
SyS_socketcall+0x36c/0x3f0
system_call+0x3c/0x100
as array index overflow is not checked for while setting up crash memory
ranges causing memory corruption. To resolve this issue, dynamically
allocate memory for crash memory ranges and resize it incrementally,
in units of pagesize, on hitting array size limit.
Fixes: 2df173d9e8 ("fadump: Initialize elfcore header and add PT_LOAD program headers.")
Cc: stable@vger.kernel.org # v3.4+
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
[mpe: Just use PAGE_SIZE directly, fixup variable placement]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
commit e8cb7a55eb ("powerpc: remove superflous inclusions of
asm/fixmap.h") removed inclusion of asm/fixmap.h from files not
including objects from that file.
However, asm/mmu-8xx.h includes call to __fix_to_virt(). The proper
way would be to include asm/fixmap.h in asm/mmu-8xx.h but it creates
an inclusion loop.
So we have to leave asm/fixmap.h in sysdep/cpm_common.c for
CONFIG_PPC_EARLY_DEBUG_CPM
CC arch/powerpc/sysdev/cpm_common.o
In file included from ./arch/powerpc/include/asm/mmu.h:340:0,
from ./arch/powerpc/include/asm/reg_8xx.h:8,
from ./arch/powerpc/include/asm/reg.h:29,
from ./arch/powerpc/include/asm/processor.h:13,
from ./arch/powerpc/include/asm/thread_info.h:28,
from ./include/linux/thread_info.h:38,
from ./arch/powerpc/include/asm/ptrace.h:159,
from ./arch/powerpc/include/asm/hw_irq.h:12,
from ./arch/powerpc/include/asm/irqflags.h:12,
from ./include/linux/irqflags.h:16,
from ./include/asm-generic/cmpxchg-local.h:6,
from ./arch/powerpc/include/asm/cmpxchg.h:537,
from ./arch/powerpc/include/asm/atomic.h:11,
from ./include/linux/atomic.h:5,
from ./include/linux/mutex.h:18,
from ./include/linux/kernfs.h:13,
from ./include/linux/sysfs.h:16,
from ./include/linux/kobject.h:20,
from ./include/linux/device.h:16,
from ./include/linux/node.h:18,
from ./include/linux/cpu.h:17,
from ./include/linux/of_device.h:5,
from arch/powerpc/sysdev/cpm_common.c:21:
arch/powerpc/sysdev/cpm_common.c: In function ‘udbg_init_cpm’:
./arch/powerpc/include/asm/mmu-8xx.h:218:25: error: implicit declaration of function ‘__fix_to_virt’ [-Werror=implicit-function-declaration]
#define VIRT_IMMR_BASE (__fix_to_virt(FIX_IMMR_BASE))
^
arch/powerpc/sysdev/cpm_common.c:75:7: note: in expansion of macro ‘VIRT_IMMR_BASE’
VIRT_IMMR_BASE);
^
./arch/powerpc/include/asm/mmu-8xx.h:218:39: error: ‘FIX_IMMR_BASE’ undeclared (first use in this function)
#define VIRT_IMMR_BASE (__fix_to_virt(FIX_IMMR_BASE))
^
arch/powerpc/sysdev/cpm_common.c:75:7: note: in expansion of macro ‘VIRT_IMMR_BASE’
VIRT_IMMR_BASE);
^
./arch/powerpc/include/asm/mmu-8xx.h:218:39: note: each undeclared identifier is reported only once for each function it appears in
#define VIRT_IMMR_BASE (__fix_to_virt(FIX_IMMR_BASE))
^
arch/powerpc/sysdev/cpm_common.c:75:7: note: in expansion of macro ‘VIRT_IMMR_BASE’
VIRT_IMMR_BASE);
^
cc1: all warnings being treated as errors
make[1]: *** [arch/powerpc/sysdev/cpm_common.o] Error 1
Fixes: e8cb7a55eb ("powerpc: remove superflous inclusions of asm/fixmap.h")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
NX increments readOffset by FIFO size in receive FIFO control register
when CRB is read. But the index in RxFIFO has to match with the
corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
may be processing incorrect CRBs and can cause CRB timeout.
VAS FIFO offset is 0 when the receive window is opened during
initialization. When the module is reloaded or in kexec boot, readOffset
in FIFO control register may not match with VAS entry. This patch adds
nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
control register for both high and normal FIFOs.
Signed-off-by: Haren Myneni <haren@us.ibm.com>
[mpe: Fixup uninitialized variable warning]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show_user_instructions() is a slightly modified version of
show_instructions() that allows userspace instruction dump.
This will be useful within show_signal_msg() to dump userspace
instructions of the faulty location.
Here is a sample of what show_user_instructions() outputs:
pandafault[10850]: code: 4bfffeec 4bfffee8 3c401002 38427f00 fbe1fff8 f821ffc1 7c3f0b78 3d22fffe
pandafault[10850]: code: 392988d0 f93f0020 e93f0020 39400048 <99490000> 39200000 7d234b78 383f0040
The current->comm and current->pid printed can serve as a glue that
links the instructions dump to its originator, allowing messages to be
interleaved in the logs.
Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use the existing hypercall to determine the appropriate settings for
the count cache flush, and then call the generic powerpc code to set
it up based on the security feature flags.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Some CPU revisions support a mode where the count cache needs to be
flushed by software on context switch. Additionally some revisions may
have a hardware accelerated flush, in which case the software flush
sequence can be shortened.
If we detect the appropriate flag from firmware we patch a branch
into _switch() which takes us to a count cache flush sequence.
That sequence in turn may be patched to return early if we detect that
the CPU supports accelerating the flush sequence in hardware.
Add debugfs support for reporting the state of the flush, as well as
runtime disabling it.
And modify the spectre_v2 sysfs file to report the state of the
software flush.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add security feature flags to indicate the need for software to flush
the count cache on context switch, and for the presence of a hardware
assisted count cache flush.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a macro and some helper C functions for patching single asm
instructions.
The gas macro means we can do something like:
1: nop
patch_site 1b, patch__foo
Which is less visually distracting than defining a GLOBAL symbol at 1,
and also doesn't pollute the symbol table which can confuse eg. perf.
These are obviously similar to our existing feature sections, but are
not automatically patched based on CPU/MMU features, rather they are
designed to be manually patched by C code at some arbitrary point.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Implement the barrier_nospec as a isync;sync instruction sequence.
The implementation uses the infrastructure built for BOOK3S 64.
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
[mpe: Split out of larger patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently we require platform code to call setup_barrier_nospec(). But
if we add an empty definition for the !CONFIG_PPC_BARRIER_NOSPEC case
then we can call it in setup_arch().
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a config symbol to encode which platforms support the
barrier_nospec speculation barrier. Currently this is just Book3S 64
but we will add Book3E in a future patch.
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We pass the "loc" (location) parameter to MASKABLE_EXCEPTION and
friends, but it's not used, so drop it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
_MASKABLE_RELON_EXCEPTION_PSERIES() does nothing useful, update all
callers to use __MASKABLE_RELON_EXCEPTION_PSERIES() directly.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
_MASKABLE_EXCEPTION_PSERIES() does nothing useful, update all callers
to use __MASKABLE_EXCEPTION_PSERIES() directly.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The EXCEPTION_RELON_PROLOG_PSERIES_1() macro does the same job as
EXCEPTION_PROLOG_2 (which we just recently created), except for
"RELON" (relocation on) exceptions.
So rename it as such.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As with the other patches in this series, we are removing the
"PSERIES" from the name as it's no longer meaningful.
In this case it's not simply a case of removing the "PSERIES" as that
would result in a clash with the existing EXCEPTION_PROLOG_1.
Instead we name this one EXCEPTION_PROLOG_2, as it's usually used in
sequence after 0 and 1.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The "PSERIES" in STD_EXCEPTION_PSERIES is to differentiate the macros
from the legacy iSeries versions, which are called
STD_EXCEPTION_ISERIES. It is not anything to do with pseries vs
powernv or powermac etc.
We removed the legacy iSeries code in 2012, in commit 8ee3e0d69623x
("powerpc: Remove the main legacy iSerie platform code").
So remove "PSERIES" from the macros.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
EXCEPTION_RELON_PROLOG_PSERIES() only has two users,
STD_RELON_EXCEPTION_PSERIES() and STD_RELON_EXCEPTION_HV() both of
which "call" SET_SCRATCH0(), so just move SET_SCRATCH0() into
EXCEPTION_RELON_PROLOG_PSERIES().
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
EXCEPTION_PROLOG_PSERIES() only has two users, STD_EXCEPTION_PSERIES()
and STD_EXCEPTION_HV() both of which "call" SET_SCRATCH0(), so just
move SET_SCRATCH0() into EXCEPTION_PROLOG_PSERIES().
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The generic implementation of strlen() reads strings byte per byte.
This patch implements strlen() in assembly based on a read of entire
words, in the same spirit as what some other arches and glibc do.
On a 8xx the time spent in strlen is reduced by 3/4 for long strings.
strlen() selftest on an 8xx provides the following values:
Before the patch (ie with the generic strlen() in lib/string.c):
len 256 : time = 1.195055
len 016 : time = 0.083745
len 008 : time = 0.046828
len 004 : time = 0.028390
After the patch:
len 256 : time = 0.272185 ==> 78% improvment
len 016 : time = 0.040632 ==> 51% improvment
len 008 : time = 0.033060 ==> 29% improvment
len 004 : time = 0.029149 ==> 2% degradation
On a 832x:
Before the patch:
len 256 : time = 0.236125
len 016 : time = 0.018136
len 008 : time = 0.011000
len 004 : time = 0.007229
After the patch:
len 256 : time = 0.094950 ==> 60% improvment
len 016 : time = 0.013357 ==> 26% improvment
len 008 : time = 0.010586 ==> 4% improvment
len 004 : time = 0.008784
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
rtas_log_buf is a buffer to hold RTAS event data that are communicated
to kernel by hypervisor. This buffer is then used to pass RTAS event
data to user through proc fs. This buffer is allocated from
vmalloc (non-linear mapping) area.
On Machine check interrupt, register r3 points to RTAS extended event
log passed by hypervisor that contains the MCE event. The pseries
machine check handler then logs this error into rtas_log_buf. The
rtas_log_buf is a vmalloc-ed (non-linear) buffer we end up taking up a
page fault (vector 0x300) while accessing it. Since machine check
interrupt handler runs in NMI context we can not afford to take any
page fault. Page faults are not honored in NMI context and causes
kernel panic. Apart from that, as Nick pointed out,
pSeries_log_error() also takes a spin_lock while logging error which
is not safe in NMI context. It may endup in deadlock if we get another
MCE before releasing the lock. Fix this by deferring the logging of
rtas error to irq work queue.
Current implementation uses two different buffers to hold rtas error
log depending on whether extended log is provided or not. This makes
bit difficult to identify which buffer has valid data that needs to
logged later in irq work. Simplify this using single buffer, one per
paca, and copy rtas log to it irrespective of whether extended log is
provided or not. Allocate this buffer below RMA region so that it can
be accessed in real mode mce handler.
Fixes: b96672dd84 ("powerpc: Machine check interrupt is a non-maskable interrupt")
Cc: stable@vger.kernel.org # v4.14+
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
It's identical to xive_teardown_cpu() so just use the latter
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When the mm is being torn down there will be a full PID flush so
there is no need to flush the TLB on page size changes.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltU8z0eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG5X8H/2fJr7m3k242+t76
sitwvx1eoPqTgryW59dRKm9IuXAGA+AjauvHzaz1QxomeQa50JghGWefD0eiJfkA
1AphQ/24EOiAbbVk084dAI/C2p122dE4D5Fy7CrfLnuouyrbFaZI5STbnrRct7sR
9deeYW0GDHO1Uenp4WDCj0baaqJqaevZ+7GG09DnWpya2nQtSkGBjqn6GpYmrfOU
mqFuxAX8mEOW6cwK16y/vYtnVjuuMAiZ63/OJ8AQ6d6ArGLwAsdn7f8Fn4I4tEr2
L0d3CRLUyegms4++Dmlu05k64buQu46WlPhjCZc5/Ts4kjrNxBuHejj2/jeSnUSt
vJJlibI=
=42a5
-----END PGP SIGNATURE-----
Merge tag 'v4.18-rc6' into HEAD
Pull bug fixes into the KVM development tree to avoid nasty conflicts.