After m68k's task_thread_info() doesn't refer to current,
it's possible to remove sched.h from interrupt.h and not break m68k!
Many thanks to Heiko Carstens for allowing this.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
* 'sparc-perf-events-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
mm, perf_event: Make vmalloc_user() align base kernel virtual address to SHMLBA
perf_event: Provide vmalloc() based mmap() backing
When a vmalloc'd area is mmap'd into userspace, some kind of
co-ordination is necessary for this to work on platforms with cpu
D-caches which can have aliases.
Otherwise kernel side writes won't be seen properly in userspace
and vice versa.
If the kernel side mapping and the user side one have the same
alignment, modulo SHMLBA, this can work as long as VM_SHARED is
shared of VMA and for all current users this is true. VM_SHARED
will force SHMLBA alignment of the user side mmap on platforms with
D-cache aliasing matters.
The bulk of this patch is just making it so that a specific
alignment can be passed down into __get_vm_area_node(). All
existing callers pass in '1' which preserves existing behavior.
vmalloc_user() gives SHMLBA for the alignment.
As a side effect this should get the video media drivers and other
vmalloc_user() users into more working shape on such systems.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <200909211922.n8LJMYjw029425@imap1.linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix the following 'make includecheck' warning:
mm/vmalloc.c: linux/highmem.h is included more than once.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some archs define MODULED_VADDR/MODULES_END which is not in VMALLOC area.
This is handled only in x86-64. This patch make it more generic. And we
can use vread/vwrite to access the area. Fix it.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sizing of memory allocations shouldn't depend on the number of physical
pages found in a system, as that generally includes (perhaps a huge amount
of) non-RAM pages. The amount of what actually is usable as storage
should instead be used as a basis here.
Some of the calculations (i.e. those not intending to use high memory)
should likely even use (totalram_pages - totalhigh_pages).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Dave Airlie <airlied@linux.ie>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vread/vwrite access vmalloc area without checking there is a page or not.
In most case, this works well.
In old ages, the caller of get_vm_ara() is only IOREMAP and there is no
memory hole within vm_struct's [addr...addr + size - PAGE_SIZE] (
-PAGE_SIZE is for a guard page.)
After per-cpu-alloc patch, it uses get_vm_area() for reserve continuous
virtual address but remap _later_. There tend to be a hole in valid
vmalloc area in vm_struct lists. Then, skip the hole (not mapped page) is
necessary. This patch updates vread/vwrite() for avoiding memory hole.
Routines which access vmalloc area without knowing for which addr is used
are
- /proc/kcore
- /dev/kmem
kcore checks IOREMAP, /dev/kmem doesn't. After this patch, IOREMAP is
checked and /dev/kmem will avoid to read/write it. Fixes to /proc/kcore
will be in the next patch in series.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Mike Smith <scgtrp@gmail.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vmap area should be purged after vm_struct is removed from the list
because vread/vwrite etc...believes the range is valid while it's on
vm_struct list.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Mike Smith <scgtrp@gmail.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no need for double error checking.
Signed-off-by: Figo.zhang <figo1802@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To directly use spread NUMA memories for percpu units, percpu
allocator will be updated to allow sparsely mapping units in a chunk.
As the distances between units can be very large, this makes
allocating single vmap area for each chunk undesirable. This patch
implements pcpu_get_vm_areas() and pcpu_free_vm_areas() which
allocates and frees sparse congruent vmap areas.
pcpu_get_vm_areas() take @offsets and @sizes array which define
distances and sizes of vmap areas. It scans down from the top of
vmalloc area looking for the top-most address which can accomodate all
the areas. The top-down scan is to avoid interacting with regular
vmallocs which can push up these congruent areas up little by little
ending up wasting address space and page table.
To speed up top-down scan, the highest possible address hint is
maintained. Although the scan is linear from the hint, given the
usual large holes between memory addresses between NUMA nodes, the
scanning is highly likely to finish after finding the first hole for
the last unit which is scanned first.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Nick Piggin <npiggin@suse.de>
Separate out insert_vmalloc_vm() from __get_vm_area_node().
insert_vmalloc_vm() initializes vm_struct from vmap_area and inserts
it into vmlist. insert_vmalloc_vm() only initializes fields which can
be determined from @vm, @flags and @caller The rest should be
initialized by the caller. For __get_vm_area_node(), all other fields
just need to be cleared and this is done by using kzalloc instead of
kmalloc.
This will be used to implement pcpu_get_vm_areas().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Nick Piggin <npiggin@suse.de>
* 'for-linus' of git://linux-arm.org/linux-2.6:
kmemleak: Add the corresponding MAINTAINERS entry
kmemleak: Simple testing module for kmemleak
kmemleak: Enable the building of the memory leak detector
kmemleak: Remove some of the kmemleak false positives
kmemleak: Add modules support
kmemleak: Add kmemleak_alloc callback from alloc_large_system_hash
kmemleak: Add the vmalloc memory allocation/freeing hooks
kmemleak: Add the slub memory allocation/freeing hooks
kmemleak: Add the slob memory allocation/freeing hooks
kmemleak: Add the slab memory allocation/freeing hooks
kmemleak: Add documentation on the memory leak detector
kmemleak: Add the base support
Manual conflict resolution (with the slab/earlyboot changes) in:
drivers/char/vt.c
init/main.c
mm/slab.c
We can call vmalloc_init() after kmem_cache_init() and use kzalloc() instead of
the bootmem allocator when initializing vmalloc data structures.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Nick Piggin <npiggin@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
If alloc_vmap_area() fails the allocated struct vmap_area has to be freed.
Signed-off-by: Ralph Wuerthner <ralphw@linux.vnet.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vmap's dirty_list is unused. It's for optimizing flushing. but Nick
didn't write the code yet. so, we don't need it until time as it is
needed.
This patch removes vmap_block's dirty_list and codes related to it.
Signed-off-by: MinChan Kim <minchan.kim@gmail.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I just got this new warning from kmemcheck:
WARNING: kmemcheck: Caught 32-bit read from freed memory (c7806a60)
a06a80c7ecde70c1a04080c700000000a06709c1000000000000000000000000
f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f
^
Pid: 0, comm: swapper Not tainted (2.6.29-rc4 #230)
EIP: 0060:[<c1096df7>] EFLAGS: 00000286 CPU: 0
EIP is at __purge_vmap_area_lazy+0x117/0x140
EAX: 00070f43 EBX: c7806a40 ECX: c1677080 EDX: 00027b66
ESI: 00002001 EDI: c170df0c EBP: c170df00 ESP: c178830c
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 80050033 CR2: c7806b14 CR3: 01775000 CR4: 00000690
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: 00004000 DR7: 00000000
[<c1096f3e>] free_unmap_vmap_area_noflush+0x6e/0x70
[<c1096f6a>] remove_vm_area+0x2a/0x70
[<c1097025>] __vunmap+0x45/0xe0
[<c10970de>] vunmap+0x1e/0x30
[<c1008ba5>] text_poke+0x95/0x150
[<c1008ca9>] alternatives_smp_unlock+0x49/0x60
[<c171ef47>] alternative_instructions+0x11b/0x124
[<c171f991>] check_bugs+0xbd/0xdc
[<c17148c5>] start_kernel+0x2ed/0x360
[<c171409e>] __init_begin+0x9e/0xa9
[<ffffffff>] 0xffffffff
It happened here:
$ addr2line -e vmlinux -i c1096df7
mm/vmalloc.c:540
Code:
list_for_each_entry(va, &valist, purge_list)
__free_vmap_area(va);
It's this instruction:
mov 0x20(%ebx),%edx
Which corresponds to a dereference of va->purge_list.next:
(gdb) p ((struct vmap_area *) 0)->purge_list.next
Cannot access memory at address 0x20
It seems that we should use "safe" list traversal here, as the element
is freed inside the loop. Please verify that this is the right fix.
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The new vmap allocator can wrap the address and get confused in the case
of large allocations or VMALLOC_END near the end of address space.
Problem reported by Christoph Hellwig on a 32-bit XFS workload.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reported-by: Christoph Hellwig <hch@lst.de>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: allow larger alignment for early vmalloc area allocation
Some early vmalloc users might want larger alignment, for example, for
custom large page mapping. Add @align to vm_area_register_early().
While at it, drop docbook comment on non-existent @size.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Impact: proper vcache flush on unmap_kernel_range()
flush_cache_vunmap() should be called before pages are unmapped. Add
a call to it in unmap_kernel_range().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: two more public map/unmap functions
Implement map_kernel_range_noflush() and unmap_kernel_range_noflush().
These functions respectively map and unmap address range in kernel VM
area but doesn't do any vcache or tlb flushing. These will be used by
new percpu allocator.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Impact: allow multiple early vm areas
There are places where kernel VM area needs to be allocated before
vmalloc is initialized. This is done by allocating static vm_struct,
initializing several fields and linking it to vmlist and later vmalloc
initialization picking up these from vmlist. This is currently done
manually and if there's more than one such areas, there's no defined
way to arbitrate who gets which address.
This patch implements vm_area_register_early(), which takes vm_area
struct with flags and size initialized, assigns address to it and puts
it on the vmlist. This way, multiple early vm areas can determine
which addresses they should use. The only current user - alpha mm
init - is converted to use it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: proper vcache flush on unmap_kernel_range()
flush_cache_vunmap() should be called before pages are unmapped. Add
a call to it in unmap_kernel_range().
Signed-off-by: Tejun Heo <tj@kernel.org>
We have get_vm_area_caller() and __get_vm_area() but not
__get_vm_area_caller()
On powerpc, I use __get_vm_area() to separate the ranges of addresses
given to vmalloc vs. ioremap (various good reasons for that) so in order
to be able to implement the new caller tracking in /proc/vmallocinfo, I
need a "_caller" variant of it.
(akpm: needed for ongoing powerpc development, so merge it early)
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On alpha, we have to map some stuff in the VMALLOC space very early in the
boot process (to make SRM console callbacks work and so on, see
arch/alpha/mm/init.c). For old VM allocator, we just manually placed a
vm_struct onto the global vmlist and this worked for ages.
Unfortunately, the new allocator isn't aware of this, so it constantly
tries to allocate the VM space which is already in use, making vmalloc on
alpha defunct.
This patch forces KVA to import vmlist entries on init.
[akpm@linux-foundation.org: remove unneeded check (per Johannes)]
Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lazy unmapping in the vmalloc code has now opened the possibility for use
after free bugs to go undetected. We can catch those by forcing an unmap
and flush (which is going to be slow, but that's what happens).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vmalloc purge lock can be a mutex so we can sleep while a purge is
going on (purge involves a global kernel TLB invalidate, so it can take
quite a while).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we do that, output of files like /proc/vmallocinfo will show things
like "vmalloc_32", "vmalloc_user", or whomever the caller was as the
caller. This info is not as useful as the real caller of the allocation.
So, proposal is to call __vmalloc_node node directly, with matching
parameters to save the caller information
Signed-off-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we can't service a vmalloc allocation, show size of the allocation that
actually failed. Useful for debugging.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The flush_cache_vmap in vmap_page_range() is called with the end of the
range twice. The following patch fixes this for me.
Signed-off-by: Adam Lackorzynski <adam@os.inf.tu-dresden.de>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miles Lane tailing /sys files hit a BUG which Pekka Enberg has tracked
to my 966c8c12dc sprint_symbol(): use
less stack exposing a bug in slub's list_locations() -
kallsyms_lookup() writes a 0 to namebuf[KSYM_NAME_LEN-1], but that was
beyond the end of page provided.
The 100 slop which list_locations() allows at end of page looks roughly
enough for all the other stuff it might print after the symbol before
it checks again: break out KSYM_SYMBOL_LEN earlier than before.
Latencytop and ftrace and are using KSYM_NAME_LEN buffers where they
need KSYM_SYMBOL_LEN buffers, and vmallocinfo a 2*KSYM_NAME_LEN buffer
where it wants a KSYM_SYMBOL_LEN buffer: fix those before anyone copies
them.
[akpm@linux-foundation.org: ftrace.h needs module.h]
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc Miles Lane <miles.lane@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jim Radford has reported that the vmap subsystem rewrite was sometimes
causing his VIVT ARM system to behave strangely (seemed like going into
infinite loops trying to fault in pages to userspace).
We determined that the problem was most likely due to a cache aliasing
issue. flush_cache_vunmap was only being called at the moment the page
tables were to be taken down, however with lazy unmapping, this can happen
after the page has subsequently been freed and allocated for something
else. The dangling alias may still have dirty data attached to it.
The fix for this problem is to do the cache flushing when the caller has
called vunmap -- it would be a bug for them to write anything else to the
mapping at that point.
That appeared to solve Jim's problems.
Reported-by: Jim Radford <radford@blackbean.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current vmalloc restart search for a free area in case we can't find one.
The reason is there are areas which are lazily freed, and could be
possibly freed now. However, current implementation start searching the
tree from the last failing address, which is pretty much by definition at
the end of address space. So, we fail.
The proposal of this patch is to restart the search from the beginning of
the requested vstart address. This fixes the regression in running KVM
virtual machines for me, described in http://lkml.org/lkml/2008/10/28/349,
caused by commit db64fe0225.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An initial vmalloc failure should start off a synchronous flush of lazy
areas, in case someone is in progress flushing them already, which could
cause us to return an allocation failure even if there is plenty of KVA
free.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix off by one bug in the KVA allocator that can leave gaps in the address
space.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xen can end up calling vm_unmap_aliases() before vmalloc_init() has
been called. In this case its safe to make it a simple no-op.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Linux Memory Management List <linux-mm@kvack.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As of 73bdf0a60e, the kernel needs
to know where modules are located in the virtual address space.
On ARM, we located this region between MODULE_START and MODULE_END.
Unfortunately, everyone else calls it MODULES_VADDR and MODULES_END.
Update ARM to use the same naming, so is_vmalloc_or_module_addr()
can work properly. Also update the comment on mm/vmalloc.c to
reflect that ARM also places modules in a separate region from the
vmalloc space.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Delete excess kernel-doc notation in mm/ subdirectory.
Actually this is a kernel-doc notation fix.
Warning(/var/linsrc/linux-2.6.27-git10//mm/vmalloc.c:902): Excess function parameter or struct member 'returns' description in 'vm_map_ram'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86 ACPI: fix breakage of resume on 64-bit UP systems with SMP kernel
Introduce is_vmalloc_or_module_addr() and use with DEBUG_VIRTUAL
Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
provide a fast, scalable percpu frontend for small vmaps (requires a
slightly different API, though).
The biggest problem with vmap is actually vunmap. Presently this requires
a global kernel TLB flush, which on most architectures is a broadcast IPI
to all CPUs to flush the cache. This is all done under a global lock. As
the number of CPUs increases, so will the number of vunmaps a scaled
workload will want to perform, and so will the cost of a global TLB flush.
This gives terrible quadratic scalability characteristics.
Another problem is that the entire vmap subsystem works under a single
lock. It is a rwlock, but it is actually taken for write in all the fast
paths, and the read locking would likely never be run concurrently anyway,
so it's just pointless.
This is a rewrite of vmap subsystem to solve those problems. The existing
vmalloc API is implemented on top of the rewritten subsystem.
The TLB flushing problem is solved by using lazy TLB unmapping. vmap
addresses do not have to be flushed immediately when they are vunmapped,
because the kernel will not reuse them again (would be a use-after-free)
until they are reallocated. So the addresses aren't allocated again until
a subsequent TLB flush. A single TLB flush then can flush multiple
vunmaps from each CPU.
XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often.
The virtual memory extent information is stored in an rbtree rather than a
linked list to improve the algorithmic scalability.
There is a per-CPU allocator for small vmaps, which amortizes or avoids
global locking.
To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
must be used in place of vmap and vunmap. Vmalloc does not use these
interfaces at the moment, so it will not be quite so scalable (although it
will use lazy TLB flushing).
As a quick test of performance, I ran a test that loops in the kernel,
linearly mapping then touching then unmapping 4 pages. Different numbers
of tests were run in parallel on an 4 core, 2 socket opteron. Results are
in nanoseconds per map+touch+unmap.
threads vanilla vmap rewrite
1 14700 2900
2 33600 3000
4 49500 2800
8 70631 2900
So with a 8 cores, the rewritten version is already 25x faster.
In a slightly more realistic test (although with an older and less
scalable version of the patch), I ripped the not-very-good vunmap batching
code out of XFS, and implemented the large buffer mapping with vm_map_ram
and vm_unmap_ram... along with a couple of other tricks, I was able to
speed up a large directory workload by 20x on a 64 CPU system. I believe
vmap/vunmap is actually sped up a lot more than 20x on such a system, but
I'm running into other locks now. vmap is pretty well blown off the
profiles.
Before:
1352059 total 0.1401
798784 _write_lock 8320.6667 <- vmlist_lock
529313 default_idle 1181.5022
15242 smp_call_function 15.8771 <- vmap tlb flushing
2472 __get_vm_area_node 1.9312 <- vmap
1762 remove_vm_area 4.5885 <- vunmap
316 map_vm_area 0.2297 <- vmap
312 kfree 0.1950
300 _spin_lock 3.1250
252 sn_send_IPI_phys 0.4375 <- tlb flushing
238 vmap 0.8264 <- vmap
216 find_lock_page 0.5192
196 find_next_bit 0.3603
136 sn2_send_IPI 0.2024
130 pio_phys_write_mmr 2.0312
118 unmap_kernel_range 0.1229
After:
78406 total 0.0081
40053 default_idle 89.4040
33576 ia64_spinlock_contention 349.7500
1650 _spin_lock 17.1875
319 __reg_op 0.5538
281 _atomic_dec_and_lock 1.0977
153 mutex_unlock 1.5938
123 iget_locked 0.1671
117 xfs_dir_lookup 0.1662
117 dput 0.1406
114 xfs_iget_core 0.0268
92 xfs_da_hashname 0.1917
75 d_alloc 0.0670
68 vmap_page_range 0.0462 <- vmap
58 kmem_cache_alloc 0.0604
57 memset 0.0540
52 rb_next 0.1625
50 __copy_user 0.0208
49 bitmap_find_free_region 0.2188 <- vmap
46 ia64_sn_udelay 0.1106
45 find_inode_fast 0.1406
42 memcmp 0.2188
42 finish_task_switch 0.1094
42 __d_lookup 0.0410
40 radix_tree_lookup_slot 0.1250
37 _spin_unlock_irqrestore 0.3854
36 xfs_bmapi 0.0050
36 kmem_cache_free 0.0256
35 xfs_vn_getattr 0.0322
34 radix_tree_lookup 0.1062
33 __link_path_walk 0.0035
31 xfs_da_do_buf 0.0091
30 _xfs_buf_find 0.0204
28 find_get_page 0.0875
27 xfs_iread 0.0241
27 __strncpy_from_user 0.2812
26 _xfs_buf_initialize 0.0406
24 _xfs_buf_lookup_pages 0.0179
24 vunmap_page_range 0.0250 <- vunmap
23 find_lock_page 0.0799
22 vm_map_ram 0.0087 <- vmap
20 kfree 0.0125
19 put_page 0.0330
18 __kmalloc 0.0176
17 xfs_da_node_lookup_int 0.0086
17 _read_lock 0.0885
17 page_waitqueue 0.0664
vmap has gone from being the top 5 on the profiles and flushing the crap
out of all TLBs, to using less than 1% of kernel time.
[akpm@linux-foundation.org: cleanups, section fix]
[akpm@linux-foundation.org: fix build on alpha]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: crash on module insertion with CONFIG_DEBUG_VIRTUAL
We would incorrectly BUG due to:
VIRTUAL_BUG_ON(!is_vmalloc_addr(vmalloc_addr) &&
!is_module_address(addr));
... because, at least on x86-64, is_module_address() doesn't do what
it should. This patch introduces is_vmalloc_or_module_addr(), which
is what we really want anyway, and uses it instead.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>