2005-04-17 00:20:36 +02:00
|
|
|
/*
|
|
|
|
* mm/mprotect.c
|
|
|
|
*
|
|
|
|
* (C) Copyright 1994 Linus Torvalds
|
|
|
|
* (C) Copyright 2002 Christoph Hellwig
|
|
|
|
*
|
2009-01-05 15:06:29 +01:00
|
|
|
* Address space accounting code <alan@lxorguk.ukuu.org.uk>
|
2005-04-17 00:20:36 +02:00
|
|
|
* (C) Copyright 2002 Red Hat Inc, All Rights Reserved
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/hugetlb.h>
|
|
|
|
#include <linux/shm.h>
|
|
|
|
#include <linux/mman.h>
|
|
|
|
#include <linux/fs.h>
|
|
|
|
#include <linux/highmem.h>
|
|
|
|
#include <linux/security.h>
|
|
|
|
#include <linux/mempolicy.h>
|
|
|
|
#include <linux/personality.h>
|
|
|
|
#include <linux/syscalls.h>
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 11:03:35 +02:00
|
|
|
#include <linux/swap.h>
|
|
|
|
#include <linux/swapops.h>
|
mmu-notifiers: core
With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
There are secondary MMUs (with secondary sptes and secondary tlbs) too.
sptes in the kvm case are shadow pagetables, but when I say spte in
mmu-notifier context, I mean "secondary pte". In GRU case there's no
actual secondary pte and there's only a secondary tlb because the GRU
secondary MMU has no knowledge about sptes and every secondary tlb miss
event in the MMU always generates a page fault that has to be resolved by
the CPU (this is not the case of KVM where the a secondary tlb miss will
walk sptes in hardware and it will refill the secondary tlb transparently
to software if the corresponding spte is present). The same way
zap_page_range has to invalidate the pte before freeing the page, the spte
(and secondary tlb) must also be invalidated before any page is freed and
reused.
Currently we take a page_count pin on every page mapped by sptes, but that
means the pages can't be swapped whenever they're mapped by any spte
because they're part of the guest working set. Furthermore a spte unmap
event can immediately lead to a page to be freed when the pin is released
(so requiring the same complex and relatively slow tlb_gather smp safe
logic we have in zap_page_range and that can be avoided completely if the
spte unmap event doesn't require an unpin of the page previously mapped in
the secondary MMU).
The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
when the VM is swapping or freeing or doing anything on the primary MMU so
that the secondary MMU code can drop sptes before the pages are freed,
avoiding all page pinning and allowing 100% reliable swapping of guest
physical address space. Furthermore it avoids the code that teardown the
mappings of the secondary MMU, to implement a logic like tlb_gather in
zap_page_range that would require many IPI to flush other cpu tlbs, for
each fixed number of spte unmapped.
To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings will be
invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an updated
spte or secondary-tlb-mapping on the copied page. Or it will setup a
readonly spte or readonly tlb mapping if it's a guest-read, if it calls
get_user_pages with write=0. This is just an example.
This allows to map any page pointed by any pte (and in turn visible in the
primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
with kvm), or a remote DMA in software like XPMEM (hence needing of
schedule in XPMEM code to send the invalidate to the remote node, while no
need to schedule in kvm/gru as it's an immediate event like invalidating
primary-mmu pte).
At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.
Dependencies:
1) mm_take_all_locks() to register the mmu notifier when the whole VM
isn't doing anything with "mm". This allows mmu notifier users to keep
track if the VM is in the middle of the invalidate_range_begin/end
critical section with an atomic counter incraese in range_begin and
decreased in range_end. No secondary MMU page fault is allowed to map
any spte or secondary tlb reference, while the VM is in the middle of
range_begin/end as any page returned by get_user_pages in that critical
section could later immediately be freed without any further
->invalidate_page notification (invalidate_range_begin/end works on
ranges and ->invalidate_page isn't called immediately before freeing
the page). To stop all page freeing and pagetable overwrites the
mmap_sem must be taken in write mode and all other anon_vma/i_mmap
locks must be taken too.
2) It'd be a waste to add branches in the VM if nobody could possibly
run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
mmu notifiers, but this already allows to compile a KVM external module
against a kernel with mmu notifiers enabled and from the next pull from
kvm.git we'll start using them. And GRU/XPMEM will also be able to
continue the development by enabling KVM=m in their config, until they
submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
are all =n.
The mmu_notifier_register call can fail because mm_take_all_locks may be
interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
is used when a driver startup, a failure can be gracefully handled. Here
an example of the change applied to kvm to register the mmu notifiers.
Usually when a driver startups other allocations are required anyway and
-ENOMEM failure paths exists already.
struct kvm *kvm_arch_create_vm(void)
{
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ int err;
if (!kvm)
return ERR_PTR(-ENOMEM);
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+ kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+ err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+ if (err) {
+ kfree(kvm);
+ return ERR_PTR(err);
+ }
+
return kvm;
}
mmu_notifier_unregister returns void and it's reliable.
The patch also adds a few needed but missing includes that would prevent
kernel to compile after these changes on non-x86 archs (x86 didn't need
them by luck).
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix mm/filemap_xip.c build]
[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Chris Wright <chrisw@redhat.com>
Cc: Marcelo Tosatti <marcelo@kvack.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Izik Eidus <izike@qumranet.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-29 00:46:29 +02:00
|
|
|
#include <linux/mmu_notifier.h>
|
2009-01-06 23:39:16 +01:00
|
|
|
#include <linux/migrate.h>
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 12:02:48 +02:00
|
|
|
#include <linux/perf_event.h>
|
x86/pkeys: Allocation/free syscalls
This patch adds two new system calls:
int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
int pkey_free(int pkey);
These implement an "allocator" for the protection keys
themselves, which can be thought of as analogous to the allocator
that the kernel has for file descriptors. The kernel tracks
which numbers are in use, and only allows operations on keys that
are valid. A key which was not obtained by pkey_alloc() may not,
for instance, be passed to pkey_mprotect().
These system calls are also very important given the kernel's use
of pkeys to implement execute-only support. These help ensure
that userspace can never assume that it has control of a key
unless it first asks the kernel. The kernel does not promise to
preserve PKRU (right register) contents except for allocated
pkeys.
The 'init_access_rights' argument to pkey_alloc() specifies the
rights that will be established for the returned pkey. For
instance:
pkey = pkey_alloc(flags, PKEY_DENY_WRITE);
will allocate 'pkey', but also sets the bits in PKRU[1] such that
writing to 'pkey' is already denied.
The kernel does not prevent pkey_free() from successfully freeing
in-use pkeys (those still assigned to a memory range by
pkey_mprotect()). It would be expensive to implement the checks
for this, so we instead say, "Just don't do it" since sane
software will never do it anyway.
Any piece of userspace calling pkey_alloc() needs to be prepared
for it to fail. Why? pkey_alloc() returns the same error code
(ENOSPC) when there are no pkeys and when pkeys are unsupported.
They can be unsupported for a whole host of reasons, so apps must
be prepared for this. Also, libraries or LD_PRELOADs might steal
keys before an application gets access to them.
This allocation mechanism could be implemented in userspace.
Even if we did it in userspace, we would still need additional
user/kernel interfaces to tell userspace which keys are being
used by the kernel internally (such as for execute-only
mappings). Having the kernel provide this facility completely
removes the need for these additional interfaces, or having an
implementation of this in userspace at all.
Note that we have to make changes to all of the architectures
that do not use mman-common.h because we use the new
PKEY_DENY_ACCESS/WRITE macros in arch-independent code.
1. PKRU is the Protection Key Rights User register. It is a
usermode-accessible register that controls whether writes
and/or access to each individual pkey is allowed or denied.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-29 18:30:15 +02:00
|
|
|
#include <linux/pkeys.h>
|
2014-01-22 00:51:02 +01:00
|
|
|
#include <linux/ksm.h>
|
2016-12-24 20:46:01 +01:00
|
|
|
#include <linux/uaccess.h>
|
2005-04-17 00:20:36 +02:00
|
|
|
#include <asm/pgtable.h>
|
|
|
|
#include <asm/cacheflush.h>
|
x86/pkeys: Allocation/free syscalls
This patch adds two new system calls:
int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
int pkey_free(int pkey);
These implement an "allocator" for the protection keys
themselves, which can be thought of as analogous to the allocator
that the kernel has for file descriptors. The kernel tracks
which numbers are in use, and only allows operations on keys that
are valid. A key which was not obtained by pkey_alloc() may not,
for instance, be passed to pkey_mprotect().
These system calls are also very important given the kernel's use
of pkeys to implement execute-only support. These help ensure
that userspace can never assume that it has control of a key
unless it first asks the kernel. The kernel does not promise to
preserve PKRU (right register) contents except for allocated
pkeys.
The 'init_access_rights' argument to pkey_alloc() specifies the
rights that will be established for the returned pkey. For
instance:
pkey = pkey_alloc(flags, PKEY_DENY_WRITE);
will allocate 'pkey', but also sets the bits in PKRU[1] such that
writing to 'pkey' is already denied.
The kernel does not prevent pkey_free() from successfully freeing
in-use pkeys (those still assigned to a memory range by
pkey_mprotect()). It would be expensive to implement the checks
for this, so we instead say, "Just don't do it" since sane
software will never do it anyway.
Any piece of userspace calling pkey_alloc() needs to be prepared
for it to fail. Why? pkey_alloc() returns the same error code
(ENOSPC) when there are no pkeys and when pkeys are unsupported.
They can be unsupported for a whole host of reasons, so apps must
be prepared for this. Also, libraries or LD_PRELOADs might steal
keys before an application gets access to them.
This allocation mechanism could be implemented in userspace.
Even if we did it in userspace, we would still need additional
user/kernel interfaces to tell userspace which keys are being
used by the kernel internally (such as for execute-only
mappings). Having the kernel provide this facility completely
removes the need for these additional interfaces, or having an
implementation of this in userspace at all.
Note that we have to make changes to all of the architectures
that do not use mman-common.h because we use the new
PKEY_DENY_ACCESS/WRITE macros in arch-independent code.
1. PKRU is the Protection Key Rights User register. It is a
usermode-accessible register that controls whether writes
and/or access to each individual pkey is allowed or denied.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-29 18:30:15 +02:00
|
|
|
#include <asm/mmu_context.h>
|
2005-04-17 00:20:36 +02:00
|
|
|
#include <asm/tlbflush.h>
|
|
|
|
|
mm: fix mprotect() behaviour on VM_LOCKED VMAs
On mlock(2) we trigger COW on private writable VMA to avoid faults in
future.
mm/gup.c:
840 long populate_vma_page_range(struct vm_area_struct *vma,
841 unsigned long start, unsigned long end, int *nonblocking)
842 {
...
855 * We want to touch writable mappings with a write fault in order
856 * to break COW, except for shared mappings because these don't COW
857 * and we would not want to dirty them for nothing.
858 */
859 if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
860 gup_flags |= FOLL_WRITE;
But we miss this case when we make VM_LOCKED VMA writeable via
mprotect(2). The test case:
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/resource.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/types.h>
#define PAGE_SIZE 4096
int main(int argc, char **argv)
{
struct rusage usage;
long before;
char *p;
int fd;
/* Create a file and populate first page of page cache */
fd = open("/tmp", O_TMPFILE | O_RDWR, S_IRUSR | S_IWUSR);
write(fd, "1", 1);
/* Create a *read-only* *private* mapping of the file */
p = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
/*
* Since the mapping is read-only, mlock() will populate the mapping
* with PTEs pointing to page cache without triggering COW.
*/
mlock(p, PAGE_SIZE);
/*
* Mapping became read-write, but it's still populated with PTEs
* pointing to page cache.
*/
mprotect(p, PAGE_SIZE, PROT_READ | PROT_WRITE);
getrusage(RUSAGE_SELF, &usage);
before = usage.ru_minflt;
/* Trigger COW: fault in mlock()ed VMA. */
*p = 1;
getrusage(RUSAGE_SELF, &usage);
printf("faults: %ld\n", usage.ru_minflt - before);
return 0;
}
$ ./test
faults: 1
Let's fix it by triggering populating of VMA in mprotect_fixup() on this
condition. We don't care about population error as we don't in other
similar cases i.e. mremap.
[akpm@linux-foundation.org: tweak comment text]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25 01:56:10 +02:00
|
|
|
#include "internal.h"
|
|
|
|
|
2012-10-25 14:16:32 +02:00
|
|
|
static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
|
2006-09-26 08:30:59 +02:00
|
|
|
unsigned long addr, unsigned long end, pgprot_t newprot,
|
2013-10-07 12:29:25 +02:00
|
|
|
int dirty_accountable, int prot_numa)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
2012-10-25 14:16:32 +02:00
|
|
|
struct mm_struct *mm = vma->vm_mm;
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 11:03:35 +02:00
|
|
|
pte_t *pte, oldpte;
|
2005-10-30 02:16:27 +01:00
|
|
|
spinlock_t *ptl;
|
2012-11-19 03:14:23 +01:00
|
|
|
unsigned long pages = 0;
|
2016-12-13 01:41:47 +01:00
|
|
|
int target_node = NUMA_NO_NODE;
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2017-02-23 00:44:12 +01:00
|
|
|
/*
|
|
|
|
* Can be called with only the mmap_sem for reading by
|
|
|
|
* prot_numa so we must check the pmd isn't constantly
|
|
|
|
* changing from under us from pmd_none to pmd_trans_huge
|
|
|
|
* and/or the other way around.
|
|
|
|
*/
|
|
|
|
if (pmd_trans_unstable(pmd))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The pmd points to a regular pte so the pmd can't change
|
|
|
|
* from under us even if the mmap_sem is only hold for
|
|
|
|
* reading.
|
|
|
|
*/
|
|
|
|
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
|
2014-04-08 00:36:56 +02:00
|
|
|
if (!pte)
|
|
|
|
return 0;
|
|
|
|
|
2016-12-13 01:41:47 +01:00
|
|
|
/* Get target node for single threaded private VMAs */
|
|
|
|
if (prot_numa && !(vma->vm_flags & VM_SHARED) &&
|
|
|
|
atomic_read(&vma->vm_mm->mm_users) == 1)
|
|
|
|
target_node = numa_node_id();
|
|
|
|
|
2006-10-01 08:29:33 +02:00
|
|
|
arch_enter_lazy_mmu_mode();
|
2005-04-17 00:20:36 +02:00
|
|
|
do {
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 11:03:35 +02:00
|
|
|
oldpte = *pte;
|
|
|
|
if (pte_present(oldpte)) {
|
2005-04-17 00:20:36 +02:00
|
|
|
pte_t ptent;
|
2015-03-25 23:55:40 +01:00
|
|
|
bool preserve_write = prot_numa && pte_write(oldpte);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2015-02-12 23:58:35 +01:00
|
|
|
/*
|
|
|
|
* Avoid trapping faults against the zero or KSM
|
|
|
|
* pages. See similar comment in change_huge_pmd.
|
|
|
|
*/
|
|
|
|
if (prot_numa) {
|
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
page = vm_normal_page(vma, addr, oldpte);
|
|
|
|
if (!page || PageKsm(page))
|
|
|
|
continue;
|
2015-02-12 23:58:44 +01:00
|
|
|
|
|
|
|
/* Avoid TLB flush if possible */
|
|
|
|
if (pte_protnone(oldpte))
|
|
|
|
continue;
|
2016-12-13 01:41:47 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Don't mess with PTEs if page is already on the node
|
|
|
|
* a single-threaded process is running on.
|
|
|
|
*/
|
|
|
|
if (target_node == page_to_nid(page))
|
|
|
|
continue;
|
2015-02-12 23:58:35 +01:00
|
|
|
}
|
|
|
|
|
2015-02-12 23:58:22 +01:00
|
|
|
ptent = ptep_modify_prot_start(mm, addr, pte);
|
|
|
|
ptent = pte_modify(ptent, newprot);
|
2015-03-25 23:55:40 +01:00
|
|
|
if (preserve_write)
|
2017-02-24 23:59:16 +01:00
|
|
|
ptent = pte_mk_savedwrite(ptent);
|
2012-10-25 14:16:32 +02:00
|
|
|
|
2015-02-12 23:58:22 +01:00
|
|
|
/* Avoid taking write faults for known dirty pages */
|
|
|
|
if (dirty_accountable && pte_dirty(ptent) &&
|
|
|
|
(pte_soft_dirty(ptent) ||
|
|
|
|
!(vma->vm_flags & VM_SOFTDIRTY))) {
|
|
|
|
ptent = pte_mkwrite(ptent);
|
2012-10-25 14:16:32 +02:00
|
|
|
}
|
2015-02-12 23:58:22 +01:00
|
|
|
ptep_modify_prot_commit(mm, addr, pte, ptent);
|
|
|
|
pages++;
|
2015-02-10 23:10:04 +01:00
|
|
|
} else if (IS_ENABLED(CONFIG_MIGRATION)) {
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 11:03:35 +02:00
|
|
|
swp_entry_t entry = pte_to_swp_entry(oldpte);
|
|
|
|
|
|
|
|
if (is_write_migration_entry(entry)) {
|
2013-10-16 22:46:51 +02:00
|
|
|
pte_t newpte;
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 11:03:35 +02:00
|
|
|
/*
|
|
|
|
* A protection check is difficult so
|
|
|
|
* just be safe and disable write
|
|
|
|
*/
|
|
|
|
make_migration_entry_read(&entry);
|
2013-10-16 22:46:51 +02:00
|
|
|
newpte = swp_entry_to_pte(entry);
|
|
|
|
if (pte_swp_soft_dirty(oldpte))
|
|
|
|
newpte = pte_swp_mksoft_dirty(newpte);
|
|
|
|
set_pte_at(mm, addr, pte, newpte);
|
2013-10-07 12:28:48 +02:00
|
|
|
|
|
|
|
pages++;
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 11:03:35 +02:00
|
|
|
}
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
} while (pte++, addr += PAGE_SIZE, addr != end);
|
2006-10-01 08:29:33 +02:00
|
|
|
arch_leave_lazy_mmu_mode();
|
2005-10-30 02:16:27 +01:00
|
|
|
pte_unmap_unlock(pte - 1, ptl);
|
2012-11-19 03:14:23 +01:00
|
|
|
|
|
|
|
return pages;
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
|
2012-12-18 23:23:17 +01:00
|
|
|
static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
|
|
|
|
pud_t *pud, unsigned long addr, unsigned long end,
|
|
|
|
pgprot_t newprot, int dirty_accountable, int prot_numa)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
|
|
|
pmd_t *pmd;
|
2014-04-08 00:36:57 +02:00
|
|
|
struct mm_struct *mm = vma->vm_mm;
|
2005-04-17 00:20:36 +02:00
|
|
|
unsigned long next;
|
2012-11-19 03:14:23 +01:00
|
|
|
unsigned long pages = 0;
|
2013-11-13 00:08:32 +01:00
|
|
|
unsigned long nr_huge_updates = 0;
|
2014-04-08 00:36:57 +02:00
|
|
|
unsigned long mni_start = 0;
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
pmd = pmd_offset(pud, addr);
|
|
|
|
do {
|
2013-10-07 12:29:14 +02:00
|
|
|
unsigned long this_pages;
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
next = pmd_addr_end(addr, end);
|
2016-01-16 01:56:52 +01:00
|
|
|
if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
|
|
|
|
&& pmd_none_or_clear_bad(pmd))
|
2014-04-08 00:36:55 +02:00
|
|
|
continue;
|
2014-04-08 00:36:57 +02:00
|
|
|
|
|
|
|
/* invoke the mmu notifier if the pmd is populated */
|
|
|
|
if (!mni_start) {
|
|
|
|
mni_start = addr;
|
|
|
|
mmu_notifier_invalidate_range_start(mm, mni_start, end);
|
|
|
|
}
|
|
|
|
|
2016-01-16 01:56:52 +01:00
|
|
|
if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
|
2016-02-12 01:13:03 +01:00
|
|
|
if (next - addr != HPAGE_PMD_SIZE) {
|
2016-12-13 01:42:20 +01:00
|
|
|
__split_huge_pmd(vma, pmd, addr, false, NULL);
|
2016-02-12 01:13:03 +01:00
|
|
|
} else {
|
2013-10-07 12:28:49 +02:00
|
|
|
int nr_ptes = change_huge_pmd(vma, pmd, addr,
|
2015-02-12 23:58:35 +01:00
|
|
|
newprot, prot_numa);
|
2013-10-07 12:28:49 +02:00
|
|
|
|
|
|
|
if (nr_ptes) {
|
2013-11-13 00:08:32 +01:00
|
|
|
if (nr_ptes == HPAGE_PMD_NR) {
|
|
|
|
pages += HPAGE_PMD_NR;
|
|
|
|
nr_huge_updates++;
|
|
|
|
}
|
2014-04-08 00:36:56 +02:00
|
|
|
|
|
|
|
/* huge pmd was handled */
|
2013-10-07 12:28:49 +02:00
|
|
|
continue;
|
|
|
|
}
|
2012-11-19 03:14:23 +01:00
|
|
|
}
|
2014-04-08 00:36:55 +02:00
|
|
|
/* fall through, the trans huge pmd just split */
|
2011-01-14 00:47:04 +01:00
|
|
|
}
|
2013-10-07 12:29:14 +02:00
|
|
|
this_pages = change_pte_range(vma, pmd, addr, next, newprot,
|
2013-10-07 12:29:25 +02:00
|
|
|
dirty_accountable, prot_numa);
|
2013-10-07 12:29:14 +02:00
|
|
|
pages += this_pages;
|
2005-04-17 00:20:36 +02:00
|
|
|
} while (pmd++, addr = next, addr != end);
|
2012-11-19 03:14:23 +01:00
|
|
|
|
2014-04-08 00:36:57 +02:00
|
|
|
if (mni_start)
|
|
|
|
mmu_notifier_invalidate_range_end(mm, mni_start, end);
|
|
|
|
|
2013-11-13 00:08:32 +01:00
|
|
|
if (nr_huge_updates)
|
|
|
|
count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
|
2012-11-19 03:14:23 +01:00
|
|
|
return pages;
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
|
2012-12-18 23:23:17 +01:00
|
|
|
static inline unsigned long change_pud_range(struct vm_area_struct *vma,
|
|
|
|
pgd_t *pgd, unsigned long addr, unsigned long end,
|
|
|
|
pgprot_t newprot, int dirty_accountable, int prot_numa)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
|
|
|
pud_t *pud;
|
|
|
|
unsigned long next;
|
2012-11-19 03:14:23 +01:00
|
|
|
unsigned long pages = 0;
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
pud = pud_offset(pgd, addr);
|
|
|
|
do {
|
|
|
|
next = pud_addr_end(addr, end);
|
|
|
|
if (pud_none_or_clear_bad(pud))
|
|
|
|
continue;
|
2012-11-19 03:14:23 +01:00
|
|
|
pages += change_pmd_range(vma, pud, addr, next, newprot,
|
2012-10-25 14:16:32 +02:00
|
|
|
dirty_accountable, prot_numa);
|
2005-04-17 00:20:36 +02:00
|
|
|
} while (pud++, addr = next, addr != end);
|
2012-11-19 03:14:23 +01:00
|
|
|
|
|
|
|
return pages;
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
|
2012-11-19 03:14:23 +01:00
|
|
|
static unsigned long change_protection_range(struct vm_area_struct *vma,
|
2006-09-26 08:30:59 +02:00
|
|
|
unsigned long addr, unsigned long end, pgprot_t newprot,
|
2012-10-25 14:16:32 +02:00
|
|
|
int dirty_accountable, int prot_numa)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
|
|
|
struct mm_struct *mm = vma->vm_mm;
|
|
|
|
pgd_t *pgd;
|
|
|
|
unsigned long next;
|
|
|
|
unsigned long start = addr;
|
2012-11-19 03:14:23 +01:00
|
|
|
unsigned long pages = 0;
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
BUG_ON(addr >= end);
|
|
|
|
pgd = pgd_offset(mm, addr);
|
|
|
|
flush_cache_range(vma, addr, end);
|
mm: fix TLB flush race between migration, and change_protection_range
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.
The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.
During that time, a CPU may continue writing to the page.
This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.
When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.
This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).
The basic race looks like this:
CPU A CPU B CPU C
load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data
[*] the old page may belong to a new user at this point!
The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.
This should fix both NUMA migration and compaction.
[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-19 02:08:44 +01:00
|
|
|
set_tlb_flush_pending(mm);
|
2005-04-17 00:20:36 +02:00
|
|
|
do {
|
|
|
|
next = pgd_addr_end(addr, end);
|
|
|
|
if (pgd_none_or_clear_bad(pgd))
|
|
|
|
continue;
|
2012-11-19 03:14:23 +01:00
|
|
|
pages += change_pud_range(vma, pgd, addr, next, newprot,
|
2012-10-25 14:16:32 +02:00
|
|
|
dirty_accountable, prot_numa);
|
2005-04-17 00:20:36 +02:00
|
|
|
} while (pgd++, addr = next, addr != end);
|
2012-11-19 03:14:23 +01:00
|
|
|
|
2012-11-19 03:14:24 +01:00
|
|
|
/* Only flush the TLB if we actually modified any entries: */
|
|
|
|
if (pages)
|
|
|
|
flush_tlb_range(vma, start, end);
|
mm: fix TLB flush race between migration, and change_protection_range
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.
The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.
During that time, a CPU may continue writing to the page.
This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.
When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.
This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).
The basic race looks like this:
CPU A CPU B CPU C
load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data
[*] the old page may belong to a new user at this point!
The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.
This should fix both NUMA migration and compaction.
[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-19 02:08:44 +01:00
|
|
|
clear_tlb_flush_pending(mm);
|
2012-11-19 03:14:23 +01:00
|
|
|
|
|
|
|
return pages;
|
|
|
|
}
|
|
|
|
|
|
|
|
unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
|
|
|
|
unsigned long end, pgprot_t newprot,
|
2012-10-25 14:16:32 +02:00
|
|
|
int dirty_accountable, int prot_numa)
|
2012-11-19 03:14:23 +01:00
|
|
|
{
|
|
|
|
unsigned long pages;
|
|
|
|
|
|
|
|
if (is_vm_hugetlb_page(vma))
|
|
|
|
pages = hugetlb_change_protection(vma, start, end, newprot);
|
|
|
|
else
|
2012-10-25 14:16:32 +02:00
|
|
|
pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa);
|
2012-11-19 03:14:23 +01:00
|
|
|
|
|
|
|
return pages;
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
|
2007-07-19 10:48:16 +02:00
|
|
|
int
|
2005-04-17 00:20:36 +02:00
|
|
|
mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
|
|
|
|
unsigned long start, unsigned long end, unsigned long newflags)
|
|
|
|
{
|
|
|
|
struct mm_struct *mm = vma->vm_mm;
|
|
|
|
unsigned long oldflags = vma->vm_flags;
|
|
|
|
long nrpages = (end - start) >> PAGE_SHIFT;
|
|
|
|
unsigned long charged = 0;
|
|
|
|
pgoff_t pgoff;
|
|
|
|
int error;
|
2006-09-26 08:30:59 +02:00
|
|
|
int dirty_accountable = 0;
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
if (newflags == oldflags) {
|
|
|
|
*pprev = vma;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we make a private mapping writable we increase our commit;
|
|
|
|
* but (without finer accounting) cannot reduce our commit if we
|
2009-02-10 15:02:27 +01:00
|
|
|
* make it unwritable again. hugetlb mapping were accounted for
|
|
|
|
* even if read-only so there is no need to account for them here
|
2005-04-17 00:20:36 +02:00
|
|
|
*/
|
|
|
|
if (newflags & VM_WRITE) {
|
2016-01-15 00:22:07 +01:00
|
|
|
/* Check space limits when area turns into data. */
|
|
|
|
if (!may_expand_vm(mm, newflags, nrpages) &&
|
|
|
|
may_expand_vm(mm, oldflags, nrpages))
|
|
|
|
return -ENOMEM;
|
2009-02-10 15:02:27 +01:00
|
|
|
if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB|
|
2008-07-24 06:27:28 +02:00
|
|
|
VM_SHARED|VM_NORESERVE))) {
|
2005-04-17 00:20:36 +02:00
|
|
|
charged = nrpages;
|
2012-02-13 04:58:52 +01:00
|
|
|
if (security_vm_enough_memory_mm(mm, charged))
|
2005-04-17 00:20:36 +02:00
|
|
|
return -ENOMEM;
|
|
|
|
newflags |= VM_ACCOUNT;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* First try to merge with previous and/or next vma.
|
|
|
|
*/
|
|
|
|
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
|
|
|
|
*pprev = vma_merge(mm, *pprev, start, end, newflags,
|
2015-09-05 00:46:24 +02:00
|
|
|
vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
|
|
|
|
vma->vm_userfaultfd_ctx);
|
2005-04-17 00:20:36 +02:00
|
|
|
if (*pprev) {
|
|
|
|
vma = *pprev;
|
mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk
The rmap_walk can access vm_page_prot (and potentially vm_flags in the
pte/pmd manipulations). So it's not safe to wait the caller to update
the vm_page_prot/vm_flags after vma_merge returned potentially removing
the "next" vma and extending the "current" vma over the
next->vm_start,vm_end range, but still with the "current" vma
vm_page_prot, after releasing the rmap locks.
The vm_page_prot/vm_flags must be transferred from the "next" vma to the
current vma while vma_merge still holds the rmap locks.
The side effect of this race condition is pte corruption during migrate
as remove_migration_ptes when run on a address of the "next" vma that
got removed, used the vm_page_prot of the current vma.
migrate mprotect
------------ -------------
migrating in "next" vma
vma_merge() # removes "next" vma and
# extends "current" vma
# current vma is not with
# vm_page_prot updated
remove_migration_ptes
read vm_page_prot of current "vma"
establish pte with wrong permissions
vm_set_page_prot(vma) # too late!
change_protection in the old vma range
only, next range is not updated
This caused segmentation faults and potentially memory corruption in
heavy mprotect loads with some light page migration caused by compaction
in the background.
Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge
which confirms the case 8 is only buggy one where the race can trigger,
in all other vma_merge cases the above cannot happen.
This fix removes the oddness factor from case 8 and it converts it from:
AAAA
PPPPNNNNXXXX -> PPPPNNNNNNNN
to:
AAAA
PPPPNNNNXXXX -> PPPPXXXXXXXX
XXXX has the right vma properties for the whole merged vma returned by
vma_adjust, so it solves the problem fully. It has the added benefits
that the callers could stop updating vma properties when vma_merge
succeeds however the callers are not updated by this patch (there are
bits like VM_SOFTDIRTY that still need special care for the whole range,
as the vma merging ignores them, but as long as they're not processed by
rmap walks and instead they're accessed with the mmap_sem at least for
reading, they are fine not to be updated within vma_adjust before
releasing the rmap_locks).
Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Aditya Mandaleeka <adityam@microsoft.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Vorlicek <janvorli@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-08 02:01:28 +02:00
|
|
|
VM_WARN_ON((vma->vm_flags ^ newflags) & ~VM_SOFTDIRTY);
|
2005-04-17 00:20:36 +02:00
|
|
|
goto success;
|
|
|
|
}
|
|
|
|
|
|
|
|
*pprev = vma;
|
|
|
|
|
|
|
|
if (start != vma->vm_start) {
|
|
|
|
error = split_vma(mm, vma, start, 1);
|
|
|
|
if (error)
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (end != vma->vm_end) {
|
|
|
|
error = split_vma(mm, vma, end, 0);
|
|
|
|
if (error)
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
|
|
|
|
success:
|
|
|
|
/*
|
|
|
|
* vm_flags and vm_page_prot are protected by the mmap_sem
|
|
|
|
* held in write mode.
|
|
|
|
*/
|
|
|
|
vma->vm_flags = newflags;
|
2016-10-08 02:01:22 +02:00
|
|
|
dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
|
mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
For VMAs that don't want write notifications, PTEs created for read faults
have their write bit set. If the read fault happens after VM_SOFTDIRTY is
cleared, then the PTE's softdirty bit will remain clear after subsequent
writes.
Here's a simple code snippet to demonstrate the bug:
char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
assert(*m == '\0'); /* new PTE allows write access */
assert(!soft_dirty(x));
*m = 'x'; /* should dirty the page */
assert(soft_dirty(x)); /* fails */
With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write notifications
are disabled when VM_SOFTDIRTY is set.
As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set by
drivers were zapped on mprotect. An analogous bug was fixed in mmap by
commit c9d0bf241451 ("mm: uncached vma support with writenotify").
Signed-off-by: Peter Feiner <pfeiner@google.com>
Reported-by: Peter Feiner <pfeiner@google.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 00:55:46 +02:00
|
|
|
vma_set_page_prot(vma);
|
2006-09-26 08:30:57 +02:00
|
|
|
|
2012-12-18 23:23:17 +01:00
|
|
|
change_protection(vma, start, end, vma->vm_page_prot,
|
|
|
|
dirty_accountable, 0);
|
2012-11-19 03:14:23 +01:00
|
|
|
|
mm: fix mprotect() behaviour on VM_LOCKED VMAs
On mlock(2) we trigger COW on private writable VMA to avoid faults in
future.
mm/gup.c:
840 long populate_vma_page_range(struct vm_area_struct *vma,
841 unsigned long start, unsigned long end, int *nonblocking)
842 {
...
855 * We want to touch writable mappings with a write fault in order
856 * to break COW, except for shared mappings because these don't COW
857 * and we would not want to dirty them for nothing.
858 */
859 if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
860 gup_flags |= FOLL_WRITE;
But we miss this case when we make VM_LOCKED VMA writeable via
mprotect(2). The test case:
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/resource.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/types.h>
#define PAGE_SIZE 4096
int main(int argc, char **argv)
{
struct rusage usage;
long before;
char *p;
int fd;
/* Create a file and populate first page of page cache */
fd = open("/tmp", O_TMPFILE | O_RDWR, S_IRUSR | S_IWUSR);
write(fd, "1", 1);
/* Create a *read-only* *private* mapping of the file */
p = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
/*
* Since the mapping is read-only, mlock() will populate the mapping
* with PTEs pointing to page cache without triggering COW.
*/
mlock(p, PAGE_SIZE);
/*
* Mapping became read-write, but it's still populated with PTEs
* pointing to page cache.
*/
mprotect(p, PAGE_SIZE, PROT_READ | PROT_WRITE);
getrusage(RUSAGE_SELF, &usage);
before = usage.ru_minflt;
/* Trigger COW: fault in mlock()ed VMA. */
*p = 1;
getrusage(RUSAGE_SELF, &usage);
printf("faults: %ld\n", usage.ru_minflt - before);
return 0;
}
$ ./test
faults: 1
Let's fix it by triggering populating of VMA in mprotect_fixup() on this
condition. We don't care about population error as we don't in other
similar cases i.e. mremap.
[akpm@linux-foundation.org: tweak comment text]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25 01:56:10 +02:00
|
|
|
/*
|
|
|
|
* Private VM_LOCKED VMA becoming writable: trigger COW to avoid major
|
|
|
|
* fault on access.
|
|
|
|
*/
|
|
|
|
if ((oldflags & (VM_WRITE | VM_SHARED | VM_LOCKED)) == VM_LOCKED &&
|
|
|
|
(newflags & VM_WRITE)) {
|
|
|
|
populate_vma_page_range(vma, start, end, NULL);
|
|
|
|
}
|
|
|
|
|
2016-01-15 00:22:07 +01:00
|
|
|
vm_stat_account(mm, oldflags, -nrpages);
|
|
|
|
vm_stat_account(mm, newflags, nrpages);
|
2010-11-08 20:29:07 +01:00
|
|
|
perf_event_mmap(vma);
|
2005-04-17 00:20:36 +02:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
fail:
|
|
|
|
vm_unacct_memory(charged);
|
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
mm: Implement new pkey_mprotect() system call
pkey_mprotect() is just like mprotect, except it also takes a
protection key as an argument. On systems that do not support
protection keys, it still works, but requires that key=0.
Otherwise it does exactly what mprotect does.
I expect it to get used like this, if you want to guarantee that
any mapping you create can *never* be accessed without the right
protection keys set up.
int real_prot = PROT_READ|PROT_WRITE;
pkey = pkey_alloc(0, PKEY_DENY_ACCESS);
ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
This way, there is *no* window where the mapping is accessible
since it was always either PROT_NONE or had a protection key set
that denied all access.
We settled on 'unsigned long' for the type of the key here. We
only need 4 bits on x86 today, but I figured that other
architectures might need some more space.
Semantically, we have a bit of a problem if we combine this
syscall with our previously-introduced execute-only support:
What do we do when we mix execute-only pkey use with
pkey_mprotect() use? For instance:
pkey_mprotect(ptr, PAGE_SIZE, PROT_WRITE, 6); // set pkey=6
mprotect(ptr, PAGE_SIZE, PROT_EXEC); // set pkey=X_ONLY_PKEY?
mprotect(ptr, PAGE_SIZE, PROT_WRITE); // is pkey=6 again?
To solve that, we make the plain-mprotect()-initiated execute-only
support only apply to VMAs that have the default protection key (0)
set on them.
Proposed semantics:
1. protection key 0 is special and represents the default,
"unassigned" protection key. It is always allocated.
2. mprotect() never affects a mapping's pkey_mprotect()-assigned
protection key. A protection key of 0 (even if set explicitly)
represents an unassigned protection key.
2a. mprotect(PROT_EXEC) on a mapping with an assigned protection
key may or may not result in a mapping with execute-only
properties. pkey_mprotect() plus pkey_set() on all threads
should be used to _guarantee_ execute-only semantics if this
is not a strong enough semantic.
3. mprotect(PROT_EXEC) may result in an "execute-only" mapping. The
kernel will internally attempt to allocate and dedicate a
protection key for the purpose of execute-only mappings. This
may not be possible in cases where there are no free protection
keys available. It can also happen, of course, in situations
where there is no hardware support for protection keys.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163012.3DDD36C4@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-29 18:30:12 +02:00
|
|
|
/*
|
|
|
|
* pkey==-1 when doing a legacy mprotect()
|
|
|
|
*/
|
|
|
|
static int do_mprotect_pkey(unsigned long start, size_t len,
|
|
|
|
unsigned long prot, int pkey)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
mm/core, x86/mm/pkeys: Add execute-only protection keys support
Protection keys provide new page-based protection in hardware.
But, they have an interesting attribute: they only affect data
accesses and never affect instruction fetches. That means that
if we set up some memory which is set as "access-disabled" via
protection keys, we can still execute from it.
This patch uses protection keys to set up mappings to do just that.
If a user calls:
mmap(..., PROT_EXEC);
or
mprotect(ptr, sz, PROT_EXEC);
(note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
notice this, and set a special protection key on the memory. It
also sets the appropriate bits in the Protection Keys User Rights
(PKRU) register so that the memory becomes unreadable and
unwritable.
I haven't found any userspace that does this today. With this
facility in place, we expect userspace to move to use it
eventually. Userspace _could_ start doing this today. Any
PROT_EXEC calls get converted to PROT_READ inside the kernel, and
would transparently be upgraded to "true" PROT_EXEC with this
code. IOW, userspace never has to do any PROT_EXEC runtime
detection.
This feature provides enhanced protection against leaking
executable memory contents. This helps thwart attacks which are
attempting to find ROP gadgets on the fly.
But, the security provided by this approach is not comprehensive.
The PKRU register which controls access permissions is a normal
user register writable from unprivileged userspace. An attacker
who can execute the 'wrpkru' instruction can easily disable the
protection provided by this feature.
The protection key that is used for execute-only support is
permanently dedicated at compile time. This is fine for now
because there is currently no API to set a protection key other
than this one.
Despite there being a constant PKRU value across the entire
system, we do not set it unless this feature is in use in a
process. That is to preserve the PKRU XSAVE 'init state',
which can lead to faster context switches.
PKRU *is* a user register and the kernel is modifying it. That
means that code doing:
pkru = rdpkru()
pkru |= 0x100;
mmap(..., PROT_EXEC);
wrpkru(pkru);
could lose the bits in PKRU that enforce execute-only
permissions. To avoid this, we suggest avoiding ever calling
mmap() or mprotect() when the PKRU value is expected to be
unstable.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Piotr Kwapulinski <kwapulinski.piotr@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Vladimir Murzin <vladimir.murzin@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: keescook@google.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210240.CB4BB5CA@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-12 22:02:40 +01:00
|
|
|
unsigned long nstart, end, tmp, reqprot;
|
2005-04-17 00:20:36 +02:00
|
|
|
struct vm_area_struct *vma, *prev;
|
|
|
|
int error = -EINVAL;
|
|
|
|
const int grows = prot & (PROT_GROWSDOWN|PROT_GROWSUP);
|
2016-03-22 22:27:51 +01:00
|
|
|
const bool rier = (current->personality & READ_IMPLIES_EXEC) &&
|
|
|
|
(prot & PROT_READ);
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
prot &= ~(PROT_GROWSDOWN|PROT_GROWSUP);
|
|
|
|
if (grows == (PROT_GROWSDOWN|PROT_GROWSUP)) /* can't be both */
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (start & ~PAGE_MASK)
|
|
|
|
return -EINVAL;
|
|
|
|
if (!len)
|
|
|
|
return 0;
|
|
|
|
len = PAGE_ALIGN(len);
|
|
|
|
end = start + len;
|
|
|
|
if (end <= start)
|
|
|
|
return -ENOMEM;
|
2008-07-07 16:28:51 +02:00
|
|
|
if (!arch_validate_prot(prot))
|
2005-04-17 00:20:36 +02:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
reqprot = prot;
|
|
|
|
|
2016-05-24 01:25:27 +02:00
|
|
|
if (down_write_killable(¤t->mm->mmap_sem))
|
|
|
|
return -EINTR;
|
2005-04-17 00:20:36 +02:00
|
|
|
|
x86/pkeys: Allocation/free syscalls
This patch adds two new system calls:
int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
int pkey_free(int pkey);
These implement an "allocator" for the protection keys
themselves, which can be thought of as analogous to the allocator
that the kernel has for file descriptors. The kernel tracks
which numbers are in use, and only allows operations on keys that
are valid. A key which was not obtained by pkey_alloc() may not,
for instance, be passed to pkey_mprotect().
These system calls are also very important given the kernel's use
of pkeys to implement execute-only support. These help ensure
that userspace can never assume that it has control of a key
unless it first asks the kernel. The kernel does not promise to
preserve PKRU (right register) contents except for allocated
pkeys.
The 'init_access_rights' argument to pkey_alloc() specifies the
rights that will be established for the returned pkey. For
instance:
pkey = pkey_alloc(flags, PKEY_DENY_WRITE);
will allocate 'pkey', but also sets the bits in PKRU[1] such that
writing to 'pkey' is already denied.
The kernel does not prevent pkey_free() from successfully freeing
in-use pkeys (those still assigned to a memory range by
pkey_mprotect()). It would be expensive to implement the checks
for this, so we instead say, "Just don't do it" since sane
software will never do it anyway.
Any piece of userspace calling pkey_alloc() needs to be prepared
for it to fail. Why? pkey_alloc() returns the same error code
(ENOSPC) when there are no pkeys and when pkeys are unsupported.
They can be unsupported for a whole host of reasons, so apps must
be prepared for this. Also, libraries or LD_PRELOADs might steal
keys before an application gets access to them.
This allocation mechanism could be implemented in userspace.
Even if we did it in userspace, we would still need additional
user/kernel interfaces to tell userspace which keys are being
used by the kernel internally (such as for execute-only
mappings). Having the kernel provide this facility completely
removes the need for these additional interfaces, or having an
implementation of this in userspace at all.
Note that we have to make changes to all of the architectures
that do not use mman-common.h because we use the new
PKEY_DENY_ACCESS/WRITE macros in arch-independent code.
1. PKRU is the Protection Key Rights User register. It is a
usermode-accessible register that controls whether writes
and/or access to each individual pkey is allowed or denied.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-29 18:30:15 +02:00
|
|
|
/*
|
|
|
|
* If userspace did not allocate the pkey, do not let
|
|
|
|
* them use it here.
|
|
|
|
*/
|
|
|
|
error = -EINVAL;
|
|
|
|
if ((pkey != -1) && !mm_pkey_is_allocated(current->mm, pkey))
|
|
|
|
goto out;
|
|
|
|
|
2012-03-07 03:23:36 +01:00
|
|
|
vma = find_vma(current->mm, start);
|
2005-04-17 00:20:36 +02:00
|
|
|
error = -ENOMEM;
|
|
|
|
if (!vma)
|
|
|
|
goto out;
|
2012-03-07 03:23:36 +01:00
|
|
|
prev = vma->vm_prev;
|
2005-04-17 00:20:36 +02:00
|
|
|
if (unlikely(grows & PROT_GROWSDOWN)) {
|
|
|
|
if (vma->vm_start >= end)
|
|
|
|
goto out;
|
|
|
|
start = vma->vm_start;
|
|
|
|
error = -EINVAL;
|
|
|
|
if (!(vma->vm_flags & VM_GROWSDOWN))
|
|
|
|
goto out;
|
2012-12-18 23:23:17 +01:00
|
|
|
} else {
|
2005-04-17 00:20:36 +02:00
|
|
|
if (vma->vm_start > start)
|
|
|
|
goto out;
|
|
|
|
if (unlikely(grows & PROT_GROWSUP)) {
|
|
|
|
end = vma->vm_end;
|
|
|
|
error = -EINVAL;
|
|
|
|
if (!(vma->vm_flags & VM_GROWSUP))
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (start > vma->vm_start)
|
|
|
|
prev = vma;
|
|
|
|
|
|
|
|
for (nstart = start ; ; ) {
|
2016-07-29 18:30:13 +02:00
|
|
|
unsigned long mask_off_old_flags;
|
2005-04-17 00:20:36 +02:00
|
|
|
unsigned long newflags;
|
mm: Implement new pkey_mprotect() system call
pkey_mprotect() is just like mprotect, except it also takes a
protection key as an argument. On systems that do not support
protection keys, it still works, but requires that key=0.
Otherwise it does exactly what mprotect does.
I expect it to get used like this, if you want to guarantee that
any mapping you create can *never* be accessed without the right
protection keys set up.
int real_prot = PROT_READ|PROT_WRITE;
pkey = pkey_alloc(0, PKEY_DENY_ACCESS);
ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
This way, there is *no* window where the mapping is accessible
since it was always either PROT_NONE or had a protection key set
that denied all access.
We settled on 'unsigned long' for the type of the key here. We
only need 4 bits on x86 today, but I figured that other
architectures might need some more space.
Semantically, we have a bit of a problem if we combine this
syscall with our previously-introduced execute-only support:
What do we do when we mix execute-only pkey use with
pkey_mprotect() use? For instance:
pkey_mprotect(ptr, PAGE_SIZE, PROT_WRITE, 6); // set pkey=6
mprotect(ptr, PAGE_SIZE, PROT_EXEC); // set pkey=X_ONLY_PKEY?
mprotect(ptr, PAGE_SIZE, PROT_WRITE); // is pkey=6 again?
To solve that, we make the plain-mprotect()-initiated execute-only
support only apply to VMAs that have the default protection key (0)
set on them.
Proposed semantics:
1. protection key 0 is special and represents the default,
"unassigned" protection key. It is always allocated.
2. mprotect() never affects a mapping's pkey_mprotect()-assigned
protection key. A protection key of 0 (even if set explicitly)
represents an unassigned protection key.
2a. mprotect(PROT_EXEC) on a mapping with an assigned protection
key may or may not result in a mapping with execute-only
properties. pkey_mprotect() plus pkey_set() on all threads
should be used to _guarantee_ execute-only semantics if this
is not a strong enough semantic.
3. mprotect(PROT_EXEC) may result in an "execute-only" mapping. The
kernel will internally attempt to allocate and dedicate a
protection key for the purpose of execute-only mappings. This
may not be possible in cases where there are no free protection
keys available. It can also happen, of course, in situations
where there is no hardware support for protection keys.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163012.3DDD36C4@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-29 18:30:12 +02:00
|
|
|
int new_vma_pkey;
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2012-12-18 23:23:17 +01:00
|
|
|
/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2016-03-22 22:27:51 +01:00
|
|
|
/* Does the application expect PROT_READ to imply PROT_EXEC */
|
|
|
|
if (rier && (vma->vm_flags & VM_MAYEXEC))
|
|
|
|
prot |= PROT_EXEC;
|
|
|
|
|
2016-07-29 18:30:13 +02:00
|
|
|
/*
|
|
|
|
* Each mprotect() call explicitly passes r/w/x permissions.
|
|
|
|
* If a permission is not passed to mprotect(), it must be
|
|
|
|
* cleared from the VMA.
|
|
|
|
*/
|
|
|
|
mask_off_old_flags = VM_READ | VM_WRITE | VM_EXEC |
|
|
|
|
ARCH_VM_PKEY_FLAGS;
|
|
|
|
|
mm: Implement new pkey_mprotect() system call
pkey_mprotect() is just like mprotect, except it also takes a
protection key as an argument. On systems that do not support
protection keys, it still works, but requires that key=0.
Otherwise it does exactly what mprotect does.
I expect it to get used like this, if you want to guarantee that
any mapping you create can *never* be accessed without the right
protection keys set up.
int real_prot = PROT_READ|PROT_WRITE;
pkey = pkey_alloc(0, PKEY_DENY_ACCESS);
ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
This way, there is *no* window where the mapping is accessible
since it was always either PROT_NONE or had a protection key set
that denied all access.
We settled on 'unsigned long' for the type of the key here. We
only need 4 bits on x86 today, but I figured that other
architectures might need some more space.
Semantically, we have a bit of a problem if we combine this
syscall with our previously-introduced execute-only support:
What do we do when we mix execute-only pkey use with
pkey_mprotect() use? For instance:
pkey_mprotect(ptr, PAGE_SIZE, PROT_WRITE, 6); // set pkey=6
mprotect(ptr, PAGE_SIZE, PROT_EXEC); // set pkey=X_ONLY_PKEY?
mprotect(ptr, PAGE_SIZE, PROT_WRITE); // is pkey=6 again?
To solve that, we make the plain-mprotect()-initiated execute-only
support only apply to VMAs that have the default protection key (0)
set on them.
Proposed semantics:
1. protection key 0 is special and represents the default,
"unassigned" protection key. It is always allocated.
2. mprotect() never affects a mapping's pkey_mprotect()-assigned
protection key. A protection key of 0 (even if set explicitly)
represents an unassigned protection key.
2a. mprotect(PROT_EXEC) on a mapping with an assigned protection
key may or may not result in a mapping with execute-only
properties. pkey_mprotect() plus pkey_set() on all threads
should be used to _guarantee_ execute-only semantics if this
is not a strong enough semantic.
3. mprotect(PROT_EXEC) may result in an "execute-only" mapping. The
kernel will internally attempt to allocate and dedicate a
protection key for the purpose of execute-only mappings. This
may not be possible in cases where there are no free protection
keys available. It can also happen, of course, in situations
where there is no hardware support for protection keys.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163012.3DDD36C4@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-29 18:30:12 +02:00
|
|
|
new_vma_pkey = arch_override_mprotect_pkey(vma, prot, pkey);
|
|
|
|
newflags = calc_vm_prot_bits(prot, new_vma_pkey);
|
2016-07-29 18:30:13 +02:00
|
|
|
newflags |= (vma->vm_flags & ~mask_off_old_flags);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2005-09-21 18:55:39 +02:00
|
|
|
/* newflags >> 4 shift VM_MAY% in place of VM_% */
|
|
|
|
if ((newflags & ~(newflags >> 4)) & (VM_READ | VM_WRITE | VM_EXEC)) {
|
2005-04-17 00:20:36 +02:00
|
|
|
error = -EACCES;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
error = security_file_mprotect(vma, reqprot, prot);
|
|
|
|
if (error)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
tmp = vma->vm_end;
|
|
|
|
if (tmp > end)
|
|
|
|
tmp = end;
|
|
|
|
error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
|
|
|
|
if (error)
|
|
|
|
goto out;
|
|
|
|
nstart = tmp;
|
|
|
|
|
|
|
|
if (nstart < prev->vm_end)
|
|
|
|
nstart = prev->vm_end;
|
|
|
|
if (nstart >= end)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
vma = prev->vm_next;
|
|
|
|
if (!vma || vma->vm_start != nstart) {
|
|
|
|
error = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
2016-03-22 22:27:51 +01:00
|
|
|
prot = reqprot;
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
out:
|
|
|
|
up_write(¤t->mm->mmap_sem);
|
|
|
|
return error;
|
|
|
|
}
|
mm: Implement new pkey_mprotect() system call
pkey_mprotect() is just like mprotect, except it also takes a
protection key as an argument. On systems that do not support
protection keys, it still works, but requires that key=0.
Otherwise it does exactly what mprotect does.
I expect it to get used like this, if you want to guarantee that
any mapping you create can *never* be accessed without the right
protection keys set up.
int real_prot = PROT_READ|PROT_WRITE;
pkey = pkey_alloc(0, PKEY_DENY_ACCESS);
ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
This way, there is *no* window where the mapping is accessible
since it was always either PROT_NONE or had a protection key set
that denied all access.
We settled on 'unsigned long' for the type of the key here. We
only need 4 bits on x86 today, but I figured that other
architectures might need some more space.
Semantically, we have a bit of a problem if we combine this
syscall with our previously-introduced execute-only support:
What do we do when we mix execute-only pkey use with
pkey_mprotect() use? For instance:
pkey_mprotect(ptr, PAGE_SIZE, PROT_WRITE, 6); // set pkey=6
mprotect(ptr, PAGE_SIZE, PROT_EXEC); // set pkey=X_ONLY_PKEY?
mprotect(ptr, PAGE_SIZE, PROT_WRITE); // is pkey=6 again?
To solve that, we make the plain-mprotect()-initiated execute-only
support only apply to VMAs that have the default protection key (0)
set on them.
Proposed semantics:
1. protection key 0 is special and represents the default,
"unassigned" protection key. It is always allocated.
2. mprotect() never affects a mapping's pkey_mprotect()-assigned
protection key. A protection key of 0 (even if set explicitly)
represents an unassigned protection key.
2a. mprotect(PROT_EXEC) on a mapping with an assigned protection
key may or may not result in a mapping with execute-only
properties. pkey_mprotect() plus pkey_set() on all threads
should be used to _guarantee_ execute-only semantics if this
is not a strong enough semantic.
3. mprotect(PROT_EXEC) may result in an "execute-only" mapping. The
kernel will internally attempt to allocate and dedicate a
protection key for the purpose of execute-only mappings. This
may not be possible in cases where there are no free protection
keys available. It can also happen, of course, in situations
where there is no hardware support for protection keys.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163012.3DDD36C4@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-29 18:30:12 +02:00
|
|
|
|
|
|
|
SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
|
|
|
|
unsigned long, prot)
|
|
|
|
{
|
|
|
|
return do_mprotect_pkey(start, len, prot, -1);
|
|
|
|
}
|
|
|
|
|
2016-12-13 01:43:09 +01:00
|
|
|
#ifdef CONFIG_ARCH_HAS_PKEYS
|
|
|
|
|
mm: Implement new pkey_mprotect() system call
pkey_mprotect() is just like mprotect, except it also takes a
protection key as an argument. On systems that do not support
protection keys, it still works, but requires that key=0.
Otherwise it does exactly what mprotect does.
I expect it to get used like this, if you want to guarantee that
any mapping you create can *never* be accessed without the right
protection keys set up.
int real_prot = PROT_READ|PROT_WRITE;
pkey = pkey_alloc(0, PKEY_DENY_ACCESS);
ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
This way, there is *no* window where the mapping is accessible
since it was always either PROT_NONE or had a protection key set
that denied all access.
We settled on 'unsigned long' for the type of the key here. We
only need 4 bits on x86 today, but I figured that other
architectures might need some more space.
Semantically, we have a bit of a problem if we combine this
syscall with our previously-introduced execute-only support:
What do we do when we mix execute-only pkey use with
pkey_mprotect() use? For instance:
pkey_mprotect(ptr, PAGE_SIZE, PROT_WRITE, 6); // set pkey=6
mprotect(ptr, PAGE_SIZE, PROT_EXEC); // set pkey=X_ONLY_PKEY?
mprotect(ptr, PAGE_SIZE, PROT_WRITE); // is pkey=6 again?
To solve that, we make the plain-mprotect()-initiated execute-only
support only apply to VMAs that have the default protection key (0)
set on them.
Proposed semantics:
1. protection key 0 is special and represents the default,
"unassigned" protection key. It is always allocated.
2. mprotect() never affects a mapping's pkey_mprotect()-assigned
protection key. A protection key of 0 (even if set explicitly)
represents an unassigned protection key.
2a. mprotect(PROT_EXEC) on a mapping with an assigned protection
key may or may not result in a mapping with execute-only
properties. pkey_mprotect() plus pkey_set() on all threads
should be used to _guarantee_ execute-only semantics if this
is not a strong enough semantic.
3. mprotect(PROT_EXEC) may result in an "execute-only" mapping. The
kernel will internally attempt to allocate and dedicate a
protection key for the purpose of execute-only mappings. This
may not be possible in cases where there are no free protection
keys available. It can also happen, of course, in situations
where there is no hardware support for protection keys.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163012.3DDD36C4@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-29 18:30:12 +02:00
|
|
|
SYSCALL_DEFINE4(pkey_mprotect, unsigned long, start, size_t, len,
|
|
|
|
unsigned long, prot, int, pkey)
|
|
|
|
{
|
|
|
|
return do_mprotect_pkey(start, len, prot, pkey);
|
|
|
|
}
|
x86/pkeys: Allocation/free syscalls
This patch adds two new system calls:
int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
int pkey_free(int pkey);
These implement an "allocator" for the protection keys
themselves, which can be thought of as analogous to the allocator
that the kernel has for file descriptors. The kernel tracks
which numbers are in use, and only allows operations on keys that
are valid. A key which was not obtained by pkey_alloc() may not,
for instance, be passed to pkey_mprotect().
These system calls are also very important given the kernel's use
of pkeys to implement execute-only support. These help ensure
that userspace can never assume that it has control of a key
unless it first asks the kernel. The kernel does not promise to
preserve PKRU (right register) contents except for allocated
pkeys.
The 'init_access_rights' argument to pkey_alloc() specifies the
rights that will be established for the returned pkey. For
instance:
pkey = pkey_alloc(flags, PKEY_DENY_WRITE);
will allocate 'pkey', but also sets the bits in PKRU[1] such that
writing to 'pkey' is already denied.
The kernel does not prevent pkey_free() from successfully freeing
in-use pkeys (those still assigned to a memory range by
pkey_mprotect()). It would be expensive to implement the checks
for this, so we instead say, "Just don't do it" since sane
software will never do it anyway.
Any piece of userspace calling pkey_alloc() needs to be prepared
for it to fail. Why? pkey_alloc() returns the same error code
(ENOSPC) when there are no pkeys and when pkeys are unsupported.
They can be unsupported for a whole host of reasons, so apps must
be prepared for this. Also, libraries or LD_PRELOADs might steal
keys before an application gets access to them.
This allocation mechanism could be implemented in userspace.
Even if we did it in userspace, we would still need additional
user/kernel interfaces to tell userspace which keys are being
used by the kernel internally (such as for execute-only
mappings). Having the kernel provide this facility completely
removes the need for these additional interfaces, or having an
implementation of this in userspace at all.
Note that we have to make changes to all of the architectures
that do not use mman-common.h because we use the new
PKEY_DENY_ACCESS/WRITE macros in arch-independent code.
1. PKRU is the Protection Key Rights User register. It is a
usermode-accessible register that controls whether writes
and/or access to each individual pkey is allowed or denied.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-29 18:30:15 +02:00
|
|
|
|
|
|
|
SYSCALL_DEFINE2(pkey_alloc, unsigned long, flags, unsigned long, init_val)
|
|
|
|
{
|
|
|
|
int pkey;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
/* No flags supported yet. */
|
|
|
|
if (flags)
|
|
|
|
return -EINVAL;
|
|
|
|
/* check for unsupported init values */
|
|
|
|
if (init_val & ~PKEY_ACCESS_MASK)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
down_write(¤t->mm->mmap_sem);
|
|
|
|
pkey = mm_pkey_alloc(current->mm);
|
|
|
|
|
|
|
|
ret = -ENOSPC;
|
|
|
|
if (pkey == -1)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = arch_set_user_pkey_access(current, pkey, init_val);
|
|
|
|
if (ret) {
|
|
|
|
mm_pkey_free(current->mm, pkey);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
ret = pkey;
|
|
|
|
out:
|
|
|
|
up_write(¤t->mm->mmap_sem);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
SYSCALL_DEFINE1(pkey_free, int, pkey)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
down_write(¤t->mm->mmap_sem);
|
|
|
|
ret = mm_pkey_free(current->mm, pkey);
|
|
|
|
up_write(¤t->mm->mmap_sem);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We could provie warnings or errors if any VMA still
|
|
|
|
* has the pkey set here.
|
|
|
|
*/
|
|
|
|
return ret;
|
|
|
|
}
|
2016-12-13 01:43:09 +01:00
|
|
|
|
|
|
|
#endif /* CONFIG_ARCH_HAS_PKEYS */
|