Currently we hand off PAGE_SIZEd kmallocs to the page allocator in the
mistaken belief that the page allocator can handle these allocations
effectively. However, measurements indicate a minimum slowdown by the
factor of 8 (and that is only SMP, NUMA is much worse) vs the slub fastpath
which causes regressions in tbench.
Increase the number of kmalloc caches by one so that we again handle 4k
kmallocs directly from slub. 4k page buffering for the page allocator
will be performed by slub like done by slab.
At some point the page allocator fastpath should be fixed. A lot of the kernel
would benefit from a faster ability to allocate a single page. If that is
done then the 4k allocs may again be forwarded to the page allocator and this
patch could be reverted.
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Slub already has two ways of allocating an object. One is via its own
logic and the other is via the call to kmalloc_large to hand off object
allocation to the page allocator. kmalloc_large is typically used
for objects >= PAGE_SIZE.
We can use that handoff to avoid failing if a higher order kmalloc slab
allocation cannot be satisfied by the page allocator. If we reach the
out of memory path then simply try a kmalloc_large(). kfree() can
already handle the case of an object that was allocated via the page
allocator and so this will work just fine (apart from object
accounting...).
For any kmalloc slab that already requires higher order allocs (which
makes it impossible to use the page allocator fastpath!)
we just use PAGE_ALLOC_COSTLY_ORDER to get the largest number of
objects in one go from the page allocator slowpath.
On a 4k platform this patch will lead to the following use of higher
order pages for the following kmalloc slabs:
8 ... 1024 order 0
2048 .. 4096 order 3 (4k slab only after the next patch)
We may waste some space if fallback occurs on a 2k slab but we
are always able to fallback to an order 0 alloc.
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Currently we determine the gfp flags to pass to the page allocator
each time a slab is being allocated.
Determine the bits to be set at the time the slab is created. Store
in a new allocflags field and add the flags in allocate_slab().
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
This adds a proper function for kmalloc page allocator pass-through. While it
simplifies any code that does slab tracing code a lot, I think it's a
worthwhile cleanup in itself.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
fix checkpatch --file mm/slub.c errors and warnings.
$ q-code-quality-compare
errors lines of code errors/KLOC
mm/slub.c [before] 22 4204 5.2
mm/slub.c [after] 0 4210 0
no code changed:
text data bss dec hex filename
22195 8634 136 30965 78f5 slub.o.before
22195 8634 136 30965 78f5 slub.o.after
md5:
93cdfbec2d6450622163c590e1064358 slub.o.before.asm
93cdfbec2d6450622163c590e1064358 slub.o.after.asm
[clameter: rediffed against Pekka's cleanup patch, omitted
moves of the name of a function to the start of line]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Slub can use the non-atomic version to unlock because other flags will not
get modified with the lock held.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The statistics provided here allow the monitoring of allocator behavior but
at the cost of some (minimal) loss of performance. Counters are placed in
SLUB's per cpu data structure. The per cpu structure may be extended by the
statistics to grow larger than one cacheline which will increase the cache
footprint of SLUB.
There is a compile option to enable/disable the inclusion of the runtime
statistics and its off by default.
The slabinfo tool is enhanced to support these statistics via two options:
-D Switches the line of information displayed for a slab from size
mode to activity mode.
-A Sorts the slabs displayed by activity. This allows the display of
the slabs most important to the performance of a certain load.
-r Report option will report detailed statistics on
Example (tbench load):
slabinfo -AD ->Shows the most active slabs
Name Objects Alloc Free %Fast
skbuff_fclone_cache 33 111953835 111953835 99 99
:0000192 2666 5283688 5281047 99 99
:0001024 849 5247230 5246389 83 83
vm_area_struct 1349 119642 118355 91 22
:0004096 15 66753 66751 98 98
:0000064 2067 25297 23383 98 78
dentry 10259 28635 18464 91 45
:0000080 11004 18950 8089 98 98
:0000096 1703 12358 10784 99 98
:0000128 762 10582 9875 94 18
:0000512 184 9807 9647 95 81
:0002048 479 9669 9195 83 65
anon_vma 777 9461 9002 99 71
kmalloc-8 6492 9981 5624 99 97
:0000768 258 7174 6931 58 15
So the skbuff_fclone_cache is of highest importance for the tbench load.
Pretty high load on the 192 sized slab. Look for the aliases
slabinfo -a | grep 000192
:0000192 <- xfs_btree_cur filp kmalloc-192 uid_cache tw_sock_TCP
request_sock_TCPv6 tw_sock_TCPv6 skbuff_head_cache xfs_ili
Likely skbuff_head_cache.
Looking into the statistics of the skbuff_fclone_cache is possible through
slabinfo skbuff_fclone_cache ->-r option implied if cache name is mentioned
.... Usual output ...
Slab Perf Counter Alloc Free %Al %Fr
--------------------------------------------------
Fastpath 111953360 111946981 99 99
Slowpath 1044 7423 0 0
Page Alloc 272 264 0 0
Add partial 25 325 0 0
Remove partial 86 264 0 0
RemoteObj/SlabFrozen 350 4832 0 0
Total 111954404 111954404
Flushes 49 Refill 0
Deactivate Full=325(92%) Empty=0(0%) ToHead=24(6%) ToTail=1(0%)
Looks good because the fastpath is overwhelmingly taken.
skbuff_head_cache:
Slab Perf Counter Alloc Free %Al %Fr
--------------------------------------------------
Fastpath 5297262 5259882 99 99
Slowpath 4477 39586 0 0
Page Alloc 937 824 0 0
Add partial 0 2515 0 0
Remove partial 1691 824 0 0
RemoteObj/SlabFrozen 2621 9684 0 0
Total 5301739 5299468
Deactivate Full=2620(100%) Empty=0(0%) ToHead=0(0%) ToTail=0(0%)
Descriptions of the output:
Total: The total number of allocation and frees that occurred for a
slab
Fastpath: The number of allocations/frees that used the fastpath.
Slowpath: Other allocations
Page Alloc: Number of calls to the page allocator as a result of slowpath
processing
Add Partial: Number of slabs added to the partial list through free or
alloc (occurs during cpuslab flushes)
Remove Partial: Number of slabs removed from the partial list as a result of
allocations retrieving a partial slab or by a free freeing
the last object of a slab.
RemoteObj/Froz: How many times were remotely freed object encountered when a
slab was about to be deactivated. Frozen: How many times was
free able to skip list processing because the slab was in use
as the cpuslab of another processor.
Flushes: Number of times the cpuslab was flushed on request
(kmem_cache_shrink, may result from races in __slab_alloc)
Refill: Number of times we were able to refill the cpuslab from
remotely freed objects for the same slab.
Deactivate: Statistics how slabs were deactivated. Shows how they were
put onto the partial list.
In general fastpath is very good. Slowpath without partial list processing is
also desirable. Any touching of partial list uses node specific locks which
may potentially cause list lock contention.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Provide an alternate implementation of the SLUB fast paths for alloc
and free using cmpxchg_local. The cmpxchg_local fast path is selected
for arches that have CONFIG_FAST_CMPXCHG_LOCAL set. An arch should only
set CONFIG_FAST_CMPXCHG_LOCAL if the cmpxchg_local is faster than an
interrupt enable/disable sequence. This is known to be true for both
x86 platforms so set FAST_CMPXCHG_LOCAL for both arches.
Currently another requirement for the fastpath is that the kernel is
compiled without preemption. The restriction will go away with the
introduction of a new per cpu allocator and new per cpu operations.
The advantages of a cmpxchg_local based fast path are:
1. Potentially lower cycle count (30%-60% faster)
2. There is no need to disable and enable interrupts on the fast path.
Currently interrupts have to be disabled and enabled on every
slab operation. This is likely avoiding a significant percentage
of interrupt off / on sequences in the kernel.
3. The disposal of freed slabs can occur with interrupts enabled.
The alternate path is realized using #ifdef's. Several attempts to do the
same with macros and inline functions resulted in a mess (in particular due
to the strange way that local_interrupt_save() handles its argument and due
to the need to define macros/functions that sometimes disable interrupts
and sometimes do something else).
[clameter: Stripped preempt bits and disabled fastpath if preempt is enabled]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We use a NULL pointer on freelists to signal that there are no more objects.
However the NULL pointers of all slabs match in contrast to the pointers to
the real objects which are in different ranges for different slab pages.
Change the end pointer to be a pointer to the first object and set bit 0.
Every slab will then have a different end pointer. This is necessary to ensure
that end markers can be matched to the source slab during cmpxchg_local.
Bring back the use of the mapping field by SLUB since we would otherwise have
to call a relatively expensive function page_address() in __slab_alloc(). Use
of the mapping field allows avoiding a call to page_address() in various other
functions as well.
There is no need to change the page_mapping() function since bit 0 is set on
the mapping as also for anonymous pages. page_mapping(slab_page) will
therefore still return NULL although the mapping field is overloaded.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
gcc 4.2 spits out an annoying warning if one casts a const void *
pointer to a void * pointer. No warning is generated if the
conversion is done through an assignment.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
This fixes most of the obvious coding style violations in mm/slub.c as
reported by checkpatch.
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Add a parameter to add_partial instead of having separate functions. The
parameter allows a more detailed control of where the slab pages is placed in
the partial queues.
If we put slabs back to the front then they are likely immediately used for
allocations. If they are put at the end then we can maximize the time that
the partial slabs spent without being subject to allocations.
When deactivating slab we can put the slabs that had remote objects freed (we
can see that because objects were put on the freelist that requires locks) to
them at the end of the list so that the cachelines of remote processors can
cool down. Slabs that had objects from the local cpu freed to them (objects
exist in the lockless freelist) are put in the front of the list to be reused
ASAP in order to exploit the cache hot state of the local cpu.
Patch seems to slightly improve tbench speed (1-2%).
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The NUMA defrag works by allocating objects from partial slabs on remote
nodes. Rename it to
remote_node_defrag_ratio
to be clear about this.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Move the counting function for objects in partial slabs so that it is placed
before kmem_cache_shrink.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If CONFIG_SYSFS is set then free the kmem_cache structure when
sysfs tells us its okay.
Otherwise there is the danger (as pointed out by
Al Viro) that sysfs thinks the kobject still exists after
kmem_cache_destroy() removed it.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Reviewed-by: Pekka J Enberg <penberg@cs.helsinki.fi>
Introduce 'len' at outer level:
mm/slub.c:3406:26: warning: symbol 'n' shadows an earlier one
mm/slub.c:3393:6: originally declared here
No need to declare new node:
mm/slub.c:3501:7: warning: symbol 'node' shadows an earlier one
mm/slub.c:3491:6: originally declared here
No need to declare new x:
mm/slub.c:3513:9: warning: symbol 'x' shadows an earlier one
mm/slub.c:3492:6: originally declared here
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
This converts the code to use the new kobject functions, cleaning up the
logic in doing so.
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
kernel_kset does not need to be a kset, but a much simpler kobject now
that we have kobj_attributes.
We also rename kernel_kset to kernel_kobj to catch all users of this
symbol with a build error instead of an easy-to-ignore build warning.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
/sys/kernel is where these things should go.
Also updated the documentation and tool that used this directory.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Dynamically create the kset instead of declaring it statically.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We don't need a "default" ktype for a kset. We should set this
explicitly every time for each kset. This change is needed so that we
can make ksets dynamic, and cleans up one of the odd, undocumented
assumption that the kset/kobject/ktype model has.
This patch is based on a lot of help from Kay Sievers.
Nasty bug in the block code was found by Dave Young
<hidave.darkstar@gmail.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Both SLUB and SLAB really did almost exactly the same thing for
/proc/slabinfo setup, using duplicate code and per-allocator #ifdef's.
This just creates a common CONFIG_SLABINFO that is enabled by both SLUB
and SLAB, and shares all the setup code. Maybe SLOB will want this some
day too.
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Increase the mininum number of partial slabs to keep around and put
partial slabs to the end of the partial queue so that they can add
more objects.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove a recently added useless masking of GFP_ZERO. GFP_ZERO is already
masked out in new_slab() (See how it calls allocate_slab). No need to do
it twice.
This reverts the SLUB parts of 7fd272550b.
Cc: Matt Mackall <mpm@selenic.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both slob and slub react to __GFP_ZERO by clearing the allocation, which
means that passing the GFP_ZERO bit down to the page allocator is just
wasteful and pointless.
Acked-by: Matt Mackall <mpm@selenic.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I can't pass memory allocated by kmalloc() to ksize() if it is allocated by
SLUB allocator and size is larger than (I guess) PAGE_SIZE / 2.
The error of ksize() seems to be that it does not check if the allocation
was made by SLUB or the page allocator.
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Christoph Lameter <clameter@sgi.com>, Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the memory leak that may occur when we attempt to reuse a cpu_slab
that was allocated while we reenabled interrupts in order to be able to
grow a slab cache.
The per cpu freelist may contain objects and in that situation we may
overwrite the per cpu freelist pointer loosing objects. This only
occurs if we find that the concurrently allocated slab fits our
allocation needs.
If we simply always deactivate the slab then the freelist will be
properly reintegrated and the memory leak will go away.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a panic due to access NULL pointer of kmem_cache_node at discard_slab()
after memory online.
When memory online is called, kmem_cache_nodes are created for all SLUBs
for new node whose memory are available.
slab_mem_going_online_callback() is called to make kmem_cache_node() in
callback of memory online event. If it (or other callbacks) fails, then
slab_mem_offline_callback() is called for rollback.
In memory offline, slab_mem_going_offline_callback() is called to shrink
all slub cache, then slab_mem_offline_callback() is called later.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: locking fix]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Slab constructors currently have a flags parameter that is never used. And
the order of the arguments is opposite to other slab functions. The object
pointer is placed before the kmem_cache pointer.
Convert
ctor(void *object, struct kmem_cache *s, unsigned long flags)
to
ctor(struct kmem_cache *s, void *object)
throughout the kernel
[akpm@linux-foundation.org: coupla fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move irq handling out of new slab into __slab_alloc. That is useful for
Mathieu's cmpxchg_local patchset and also allows us to remove the crude
local_irq_off in early_kmem_cache_alloc().
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We touch a cacheline in the kmem_cache structure for zeroing to get the
size. However, the hot paths in slab_alloc and slab_free do not reference
any other fields in kmem_cache, so we may have to just bring in the
cacheline for this one access.
Add a new field to kmem_cache_cpu that contains the object size. That
cacheline must already be used in the hotpaths. So we save one cacheline
on every slab_alloc if we zero.
We need to update the kmem_cache_cpu object size if an aliasing operation
changes the objsize of an non debug slab.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kmem_cache_cpu structures introduced are currently an array placed in the
kmem_cache struct. Meaning the kmem_cache_cpu structures are overwhelmingly
on the wrong node for systems with a higher amount of nodes. These are
performance critical structures since the per node information has
to be touched for every alloc and free in a slab.
In order to place the kmem_cache_cpu structure optimally we put an array
of pointers to kmem_cache_cpu structs in kmem_cache (similar to SLAB).
However, the kmem_cache_cpu structures can now be allocated in a more
intelligent way.
We would like to put per cpu structures for the same cpu but different
slab caches in cachelines together to save space and decrease the cache
footprint. However, the slab allocators itself control only allocations
per node. We set up a simple per cpu array for every processor with
100 per cpu structures which is usually enough to get them all set up right.
If we run out then we fall back to kmalloc_node. This also solves the
bootstrap problem since we do not have to use slab allocator functions
early in boot to get memory for the small per cpu structures.
Pro:
- NUMA aware placement improves memory performance
- All global structures in struct kmem_cache become readonly
- Dense packing of per cpu structures reduces cacheline
footprint in SMP and NUMA.
- Potential avoidance of exclusive cacheline fetches
on the free and alloc hotpath since multiple kmem_cache_cpu
structures are in one cacheline. This is particularly important
for the kmalloc array.
Cons:
- Additional reference to one read only cacheline (per cpu
array of pointers to kmem_cache_cpu) in both slab_alloc()
and slab_free().
[akinobu.mita@gmail.com: fix cpu hotplug offline/online path]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: "Pekka Enberg" <penberg@cs.helsinki.fi>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Set c->node to -1 if we allocate from a debug slab instead for SlabDebug
which requires access the page struct cacheline.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need the offset from the page struct during slab_alloc and slab_free. In
both cases we also reference the cacheline of the kmem_cache_cpu structure.
We can therefore move the offset field into the kmem_cache_cpu structure
freeing up 16 bits in the page struct.
Moving the offset allows an allocation from slab_alloc() without touching the
page struct in the hot path.
The only thing left in slab_free() that touches the page struct cacheline for
per cpu freeing is the checking of SlabDebug(page). The next patch deals with
that.
Use the available 16 bits to broaden page->inuse. More than 64k objects per
slab become possible and we can get rid of the checks for that limitation.
No need anymore to shrink the order of slabs if we boot with 2M sized slabs
(slub_min_order=9).
No need anymore to switch off the offset calculation for very large slabs
since the field in the kmem_cache_cpu structure is 32 bits and so the offset
field can now handle slab sizes of up to 8GB.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After moving the lockless_freelist to kmem_cache_cpu we no longer need
page->lockless_freelist. Restructure the use of the struct page fields in
such a way that we never touch the mapping field.
This is turn allows us to remove the special casing of SLUB when determining
the mapping of a page (needed for corner cases of virtual caches machines that
need to flush caches of processors mapping a page).
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A remote free may access the same page struct that also contains the lockless
freelist for the cpu slab. If objects have a short lifetime and are freed by
a different processor then remote frees back to the slab from which we are
currently allocating are frequent. The cacheline with the page struct needs
to be repeately acquired in exclusive mode by both the allocating thread and
the freeing thread. If this is frequent enough then performance will suffer
because of cacheline bouncing.
This patchset puts the lockless_freelist pointer in its own cacheline. In
order to make that happen we introduce a per cpu structure called
kmem_cache_cpu.
Instead of keeping an array of pointers to page structs we now keep an array
to a per cpu structure that--among other things--contains the pointer to the
lockless freelist. The freeing thread can then keep possession of exclusive
access to the page struct cacheline while the allocating thread keeps its
exclusive access to the cacheline containing the per cpu structure.
This works as long as the allocating cpu is able to service its request
from the lockless freelist. If the lockless freelist runs empty then the
allocating thread needs to acquire exclusive access to the cacheline with
the page struct lock the slab.
The allocating thread will then check if new objects were freed to the per
cpu slab. If so it will keep the slab as the cpu slab and continue with the
recently remote freed objects. So the allocating thread can take a series
of just freed remote pages and dish them out again. Ideally allocations
could be just recycling objects in the same slab this way which will lead
to an ideal allocation / remote free pattern.
The number of objects that can be handled in this way is limited by the
capacity of one slab. Increasing slab size via slub_min_objects/
slub_max_order may increase the number of objects and therefore performance.
If the allocating thread runs out of objects and finds that no objects were
put back by the remote processor then it will retrieve a new slab (from the
partial lists or from the page allocator) and start with a whole
new set of objects while the remote thread may still be freeing objects to
the old cpu slab. This may then repeat until the new slab is also exhausted.
If remote freeing has freed objects in the earlier slab then that earlier
slab will now be on the partial freelist and the allocating thread will
pick that slab next for allocation. So the loop is extended. However,
both threads need to take the list_lock to make the swizzling via
the partial list happen.
It is likely that this kind of scheme will keep the objects being passed
around to a small set that can be kept in the cpu caches leading to increased
performance.
More code cleanups become possible:
- Instead of passing a cpu we can now pass a kmem_cache_cpu structure around.
Allows reducing the number of parameters to various functions.
- Can define a new node_match() function for NUMA to encapsulate locality
checks.
Effect on allocations:
Cachelines touched before this patch:
Write: page cache struct and first cacheline of object
Cachelines touched after this patch:
Write: kmem_cache_cpu cacheline and first cacheline of object
Read: page cache struct (but see later patch that avoids touching
that cacheline)
The handling when the lockless alloc list runs empty gets to be a bit more
complicated since another cacheline has now to be written to. But that is
halfway out of the hot path.
Effect on freeing:
Cachelines touched before this patch:
Write: page_struct and first cacheline of object
Cachelines touched after this patch depending on how we free:
Write(to cpu_slab): kmem_cache_cpu struct and first cacheline of object
Write(to other): page struct and first cacheline of object
Read(to cpu_slab): page struct to id slab etc. (but see later patch that
avoids touching the page struct on free)
Read(to other): cpu local kmem_cache_cpu struct to verify its not
the cpu slab.
Summary:
Pro:
- Distinct cachelines so that concurrent remote frees and local
allocs on a cpuslab can occur without cacheline bouncing.
- Avoids potential bouncing cachelines because of neighboring
per cpu pointer updates in kmem_cache's cpu_slab structure since
it now grows to a cacheline (Therefore remove the comment
that talks about that concern).
Cons:
- Freeing objects now requires the reading of one additional
cacheline. That can be mitigated for some cases by the following
patches but its not possible to completely eliminate these
references.
- Memory usage grows slightly.
The size of each per cpu object is blown up from one word
(pointing to the page_struct) to one cacheline with various data.
So this is NR_CPUS*NR_SLABS*L1_BYTES more memory use. Lets say
NR_SLABS is 100 and a cache line size of 128 then we have just
increased SLAB metadata requirements by 12.8k per cpu.
(Another later patch reduces these requirements)
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch marks a number of allocations that are either short-lived such as
network buffers or are reclaimable such as inode allocations. When something
like updatedb is called, long-lived and unmovable kernel allocations tend to
be spread throughout the address space which increases fragmentation.
This patch groups these allocations together as much as possible by adding a
new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be
reclaimed on demand, but not moved. i.e. they can be migrated by deleting
them and re-reading the information from elsewhere.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The function of GFP_LEVEL_MASK seems to be unclear. In order to clear up
the mystery we get rid of it and replace GFP_LEVEL_MASK with 3 sets of GFP
flags:
GFP_RECLAIM_MASK Flags used to control page allocator reclaim behavior.
GFP_CONSTRAINT_MASK Flags used to limit where allocations can occur.
GFP_SLAB_BUG_MASK Flags that the slab allocator BUG()s on.
These replace the uses of GFP_LEVEL mask in the slab allocators and in
vmalloc.c.
The use of the flags not included in these sets may occur as a result of a
slab allocation standing in for a page allocation when constructing scatter
gather lists. Extraneous flags are cleared and not passed through to the
page allocator. __GFP_MOVABLE/RECLAIMABLE, __GFP_COLD and __GFP_COMP will
now be ignored if passed to a slab allocator.
Change the allocation of allocator meta data in SLAB and vmalloc to not
pass through flags listed in GFP_CONSTRAINT_MASK. SLAB already removes the
__GFP_THISNODE flag for such allocations. Generalize that to also cover
vmalloc. The use of GFP_CONSTRAINT_MASK also includes __GFP_HARDWALL.
The impact of allocator metadata placement on access latency to the
cachelines of the object itself is minimal since metadata is only
referenced on alloc and free. The attempt is still made to place the meta
data optimally but we consistently allow fallback both in SLAB and vmalloc
(SLUB does not need to allocate metadata like that).
Allocator metadata may serve multiple in kernel users and thus should not
be subject to the limitations arising from a single allocation context.
[akpm@linux-foundation.org: fix fallback_alloc()]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Simply switch all for_each_online_node to for_each_node_state(NORMAL_MEMORY).
That way SLUB only operates on nodes with regular memory. Any allocation
attempt on a memoryless node or a node with just highmem will fall whereupon
SLUB will fetch memory from a nearby node (depending on how memory policies
and cpuset describe fallback).
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A NULL pointer means that the object was not allocated. One cannot
determine the size of an object that has not been allocated. Currently we
return 0 but we really should BUG() on attempts to determine the size of
something nonexistent.
krealloc() interprets NULL to mean a zero sized object. Handle that
separately in krealloc().
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Considering kfree(NULL) would normally occur only in error paths and
kfree(ZERO_SIZE_PTR) is uncommon as well, so let's use unlikely() for the
condition check in SLUB's and SLOB's kfree() to optimize for the common
case. SLAB has this already.
Signed-off-by: Satyam Sharma <satyam@infradead.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This gets rid of all kmalloc caches larger than page size. A kmalloc
request larger than PAGE_SIZE > 2 is going to be passed through to the page
allocator. This works both inline where we will call __get_free_pages
instead of kmem_cache_alloc and in __kmalloc.
kfree is modified to check if the object is in a slab page. If not then
the page is freed via the page allocator instead. Roughly similar to what
SLOB does.
Advantages:
- Reduces memory overhead for kmalloc array
- Large kmalloc operations are faster since they do not
need to pass through the slab allocator to get to the
page allocator.
- Performance increase of 10%-20% on alloc and 50% on free for
PAGE_SIZEd allocations.
SLUB must call page allocator for each alloc anyways since
the higher order pages which that allowed avoiding the page alloc calls
are not available in a reliable way anymore. So we are basically removing
useless slab allocator overhead.
- Large kmallocs yields page aligned object which is what
SLAB did. Bad things like using page sized kmalloc allocations to
stand in for page allocate allocs can be transparently handled and are not
distinguishable from page allocator uses.
- Checking for too large objects can be removed since
it is done by the page allocator.
Drawbacks:
- No accounting for large kmalloc slab allocations anymore
- No debugging of large kmalloc slab allocations.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This was posted on Aug 28 and fixes an issue that could cause troubles
when slab caches >=128k are created.
http://marc.info/?l=linux-mm&m=118798149918424&w=2
Currently we simply add the debug flags unconditional when checking for a
matching slab. This creates issues for sysfs processing when slabs exist
that are exempt from debugging due to their huge size or because only a
subset of slabs was selected for debugging.
We need to only add the flags if kmem_cache_open() would also add them.
Create a function to calculate the flags that would be set
if the cache would be opened and use that function to determine
the flags before looking for a compatible slab.
[akpm@linux-foundation.org: fixlets]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do not BUG() if we cannot register a slab with sysfs. Just print an error.
The only consequence of not registering is that the slab cache is not
visible via /sys/slab. A BUG() may not be visible that early during boot
and we have had multiple issues here already.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Print a big fat warning and do what is necessary to continue if a node is
marked as up (meaning either node is online (upstream) or node has memory
(Andrew's tree)) but allocations from the node do not succeed.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SLUB is using atomic_read() for variables declared atomic_long_t.
Switch to atomic_long_read().
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The dynamic dma kmalloc creation can run into trouble if a
GFP_ATOMIC allocation is the first one performed for a certain size
of dma kmalloc slab.
- Move the adding of the slab to sysfs into a workqueue
(sysfs does GFP_KERNEL allocations)
- Do not call kmem_cache_destroy() (uses slub_lock)
- Only acquire the slub_lock once and--if we cannot wait--do a trylock.
This introduces a slight risk of the first kmalloc(x, GFP_DMA|GFP_ATOMIC)
for a range of sizes failing due to another process holding the slub_lock.
However, we only need to acquire the spinlock once in order to establish
each power of two DMA kmalloc cache. The possible conflict is with the
slub_lock taken during slab management actions (create / remove slab cache).
It is rather typical that a driver will first fill its buffers using
GFP_KERNEL allocations which will wait until the slub_lock can be acquired.
Drivers will also create its slab caches first outside of an atomic
context before starting to use atomic kmalloc from an interrupt context.
If there are any failures then they will occur early after boot or when
loading of multiple drivers concurrently. Drivers can already accomodate
failures of GFP_ATOMIC for other reasons. Retries will then create the slab.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
The MAX_PARTIAL checks were supposed to be an optimization. However, slab
shrinking is a manually triggered process either through running slabinfo
or by the kernel calling kmem_cache_shrink.
If one really wants to shrink a slab then all operations should be done
regardless of the size of the partial list. This also fixes an issue that
could surface if the number of partial slabs was initially above MAX_PARTIAL
in kmem_cache_shrink and later drops below MAX_PARTIAL through the
elimination of empty slabs on the partial list (rare). In that case a few
slabs may be left off the partial list (and only be put back when they
are empty).
Signed-off-by: Christoph Lameter <clameter@sgi.com>
We ClearSlabDebug() before the last SlabDebug() check. Clear it later.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Ingo noticed that the SLUB code does include the lock debugging free
check.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Slab destructors were no longer supported after Christoph's
c59def9f22 change. They've been
BUGs for both slab and slub, and slob never supported them
either.
This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
The slab and slob allocators already did this right, but slub would call
"get_object_page()" on the magic ZERO_SIZE_PTR, with all kinds of nasty
end results.
Noted by Ingo Molnar.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We currently cannot disable CONFIG_SLUB_DEBUG for CONFIG_NUMA. Now that
embedded systems start to use NUMA we may need this.
Put an #ifdef around places where NUMA only code uses fields only valid
for CONFIG_SLUB_DEBUG.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sysfs can do a gazillion things when called. Make sure that we do not call
any sysfs functions while holding the slub_lock.
Just protect the essentials:
1. The list of all slab caches
2. The kmalloc_dma array
3. The ref counters of the slabs.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The objects per slab increase with the current patches in mm since we allow up
to order 3 allocs by default. More patches in mm actually allow to use 2M or
higher sized slabs. For slab validation we need per object bitmaps in order
to check a slab. We end up with up to 64k objects per slab resulting in a
potential requirement of 8K stack space. That does not look good.
Allocate the bit arrays via kmalloc.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It becomes now easy to support the zeroing allocs with generic inline
functions in slab.h. Provide inline definitions to allow the continued use of
kzalloc, kmem_cache_zalloc etc but remove other definitions of zeroing
functions from the slab allocators and util.c.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can get to the length of the object through the kmem_cache_structure. The
additional parameter does no good and causes the compiler to generate bad
code.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do proper spacing and we only need to do this in steps of 8.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no need to caculate the dma slab size ourselves. We can simply
lookup the size of the corresponding non dma slab.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmalloc_index is a long series of comparisons. The attempt to replace
kmalloc_index with something more efficient like ilog2 failed due to compiler
issues with constant folding on gcc 3.3 / powerpc.
kmalloc_index()'es long list of comparisons works fine for constant folding
since all the comparisons are optimized away. However, SLUB also uses
kmalloc_index to determine the slab to use for the __kmalloc_xxx functions.
This leads to a large set of comparisons in get_slab().
The patch here allows to get rid of that list of comparisons in get_slab():
1. If the requested size is larger than 192 then we can simply use
fls to determine the slab index since all larger slabs are
of the power of two type.
2. If the requested size is smaller then we cannot use fls since there
are non power of two caches to be considered. However, the sizes are
in a managable range. So we divide the size by 8. Then we have only
24 possibilities left and then we simply look up the kmalloc index
in a table.
Code size of slub.o decreases by more than 200 bytes through this patch.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We modify the kmalloc_cache_dma[] array without proper locking. Do the proper
locking and undo the dma cache creation if another processor has already
created it.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The rarely used dma functionality in get_slab() makes the function too
complex. The compiler begins to spill variables from the working set onto the
stack. The created function is only used in extremely rare cases so make sure
that the compiler does not decide on its own to merge it back into get_slab().
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add #ifdefs around data structures only needed if debugging is compiled into
SLUB.
Add inlines to small functions to reduce code size.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A kernel convention for many allocators is that if __GFP_ZERO is passed to an
allocator then the allocated memory should be zeroed.
This is currently not supported by the slab allocators. The inconsistency
makes it difficult to implement in derived allocators such as in the uncached
allocator and the pool allocators.
In addition the support zeroed allocations in the slab allocators does not
have a consistent API. There are no zeroing allocator functions for NUMA node
placement (kmalloc_node, kmem_cache_alloc_node). The zeroing allocations are
only provided for default allocs (kzalloc, kmem_cache_zalloc_node).
__GFP_ZERO will make zeroing universally available and does not require any
addititional functions.
So add the necessary logic to all slab allocators to support __GFP_ZERO.
The code is added to the hot path. The gfp flags are on the stack and so the
cacheline is readily available for checking if we want a zeroed object.
Zeroing while allocating is now a frequent operation and we seem to be
gradually approaching a 1-1 parity between zeroing and not zeroing allocs.
The current tree has 3476 uses of kmalloc vs 2731 uses of kzalloc.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Define ZERO_OR_NULL_PTR macro to be able to remove the checks from the
allocators. Move ZERO_SIZE_PTR related stuff into slab.h.
Make ZERO_SIZE_PTR work for all slab allocators and get rid of the
WARN_ON_ONCE(size == 0) that is still remaining in SLAB.
Make slub return NULL like the other allocators if a too large memory segment
is requested via __kmalloc.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The size of a kmalloc object is readily available via ksize(). ksize is
provided by all allocators and thus we can implement krealloc in a generic
way.
Implement krealloc in mm/util.c and drop slab specific implementations of
krealloc.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The function we are calling to initialize object debug state during early NUMA
bootstrap sets up an inactive object giving it the wrong redzone signature.
The bootstrap nodes are active objects and should have active redzone
signatures.
Currently slab validation complains and reverts the object to active state.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently SLUB has no provision to deal with too high page orders that may
be specified on the kernel boot line. If an order higher than 6 (on a 4k
platform) is generated then we will BUG() because slabs get more than 65535
objects.
Add some logic that decreases order for slabs that have too many objects.
This allow booting with slab sizes up to MAX_ORDER.
For example
slub_min_order=10
will boot with a default slab size of 4M and reduce slab sizes for small
object sizes to lower orders if the number of objects becomes too big.
Large slab sizes like that allow a concentration of objects of the same
slab cache under as few as possible TLB entries and thus potentially
reduces TLB pressure.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We currently have to do an GFP_ATOMIC allocation because the list_lock is
already taken when we first allocate memory for tracking allocation
information. It would be better if we could avoid atomic allocations.
Allocate a size of the tracking table that is usually sufficient (one page)
before we take the list lock. We will then only do the atomic allocation
if we need to resize the table to become larger than a page (mostly only
needed under large NUMA because of the tracking of cpus and nodes otherwise
the table stays small).
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use list_for_each_entry() instead of list_for_each().
Get rid of for_all_slabs(). It had only one user. So fold it into the
callback. This also gets rid of cpu_slab_flush.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Changes the error reporting format to loosely follow lockdep.
If data corruption is detected then we generate the following lines:
============================================
BUG <slab-cache>: <problem>
--------------------------------------------
INFO: <more information> [possibly multiple times]
<object dump>
FIX <slab-cache>: <remedial action>
This also adds some more intelligence to the data corruption detection. Its
now capable of figuring out the start and end.
Add a comment on how to configure SLUB so that a production system may
continue to operate even though occasional slab corruption occur through
a misbehaving kernel component. See "Emergency operations" in
Documentation/vm/slub.txt.
[akpm@linux-foundation.org: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new configuration variable
CONFIG_SLUB_DEBUG_ON
If set then the kernel will be booted by default with slab debugging
switched on. Similar to CONFIG_SLAB_DEBUG. By default slab debugging
is available but must be enabled by specifying "slub_debug" as a
kernel parameter.
Also add support to switch off slab debugging for a kernel that was
built with CONFIG_SLUB_DEBUG_ON. This works by specifying
slub_debug=-
as a kernel parameter.
Dave Jones wanted this feature.
http://marc.info/?l=linux-kernel&m=118072189913045&w=2
[akpm@linux-foundation.org: clean up switch statement]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmem_cache_open is static. EXPORT_SYMBOL was leftover from some earlier
time period where kmem_cache_open was usable outside of slub.
(Fixes powerpc build error)
Signed-off-by: Chrsitoph Lameter <clameter@sgi.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we move the local_irq_enable() to the end of the function then
add_partial() in early_kmem_cache_node_alloc() will be called
with interrupts disabled like during regular operations.
This makes lockdep happy.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by: Andre Noll <maan@systemlinux.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If slabs are allocated or freed from a large set of call sites (typical for
the kmalloc area) then we may create more output than fits into a single
PAGE and sysfs only gives us one page. The output should be truncated.
This patch fixes the checks to do the truncation properly.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If ARCH_KMALLOC_MINALIGN is set to a value greater than 8 (SLUBs smallest
kmalloc cache) then SLUB may generate duplicate slabs in sysfs (yes again)
because the object size is padded to reach ARCH_KMALLOC_MINALIGN. Thus the
size of the small slabs is all the same.
No arch sets ARCH_KMALLOC_MINALIGN larger than 8 though except mips which
for some reason wants a 128 byte alignment.
This patch increases the size of the smallest cache if
ARCH_KMALLOC_MINALIGN is greater than 8. In that case more and more of the
smallest caches are disabled.
If we do that then the count of the active general caches that is displayed
on boot is not correct anymore since we may skip elements of the kmalloc
array. So count them separately.
This approach was tested by Havard yesterday.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The data structure to manage the information gathered about functions
allocating and freeing objects is allocated when the list_lock has already
been taken. We need to allocate with GFP_ATOMIC instead of GFP_KERNEL.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of returning the smallest available object return ZERO_SIZE_PTR.
A ZERO_SIZE_PTR can be legitimately used as an object pointer as long as it
is not deferenced. The dereference of ZERO_SIZE_PTR causes a distinctive
fault. kfree can handle a ZERO_SIZE_PTR in the same way as NULL.
This enables functions to use zero sized object. e.g. n = number of objects.
objects = kmalloc(n * sizeof(object));
for (i = 0; i < n; i++)
objects[i].x = y;
kfree(objects);
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hotplug callbacks are performed with interrupts enabled. Slub requires
interrupts to be disabled for flushing caches.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need this patch in ASAP. Patch fixes the mysterious hang that remained
on some particular configurations with lockdep on after the first fix that
moved the #idef CONFIG_SLUB_DEBUG to the right location. See
http://marc.info/?t=117963072300001&r=1&w=2
The kmem_cache_node cache is very special because it is needed for NUMA
bootstrap. Under certain conditions (like for example if lockdep is
enabled and significantly increases the size of spinlock_t) the structure
may become exactly the size as one of the larger caches in the kmalloc
array.
That early during bootstrap we cannot perform merging properly. The unique
id for the kmem_cache_node cache will match one of the kmalloc array.
Sysfs will complain about a duplicate directory entry. All of this occurs
while the console is not yet fully operational. Thus boot may appear to be
silently failing.
The kmem_cache_node cache is very special. During early boostrap the main
allocation function is not operational yet and so we have to run our own
small special alloc function during early boot. It is also special in that
it is never freed.
We really do not want any merging on that cache. Set the refcount -1 and
forbid merging of slabs that have a negative refcount.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The check for super sized slabs where we can no longer move the free
pointer behind the object for debugging purposes etc is accessing a
field that is not setup yet. We must use objsize here since the size of
the slab has not been determined yet.
The effect of this is that a global slab shrink via "slabinfo -s" will
show errors about offsets being wrong if booted with slub_debug.
Potentially there are other troubles with huge slabs under slub_debug
because the calculated free pointer offset is truncated.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The object size calculation is wrong if !CONFIG_SLUB_DEBUG because the
#ifdef CONFIG_SLUB_DEBUG is now switching off the size adjustments for
DESTROY_BY_RCU and ctor.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Consolidate functionality into the #ifdef section.
Extract tracing into one subroutine.
Move object debug processing into the #ifdef section so that the
code in __slab_alloc and __slab_free becomes minimal.
Reduce number of functions we need to provide stubs for in the !SLUB_DEBUG case.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@ucw.cz>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The atomicity when handling flags in SLUB is not necessary since both flags
used by SLUB are not updated in a racy way. Flag updates are either done
during slab creation or destruction or under slab_lock. Some of these flags
do not have the non atomic variants that we need. So define our own.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use inline functions to access the per cpu bit. Intoduce the notion of
"freezing" a slab to make things more understandable.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no user of destructors left. There is no reason why we should keep
checking for destructors calls in the slab allocators.
The RFC for this patch was discussed at
http://marc.info/?l=linux-kernel&m=117882364330705&w=2
Destructors were mainly used for list management which required them to take a
spinlock. Taking a spinlock in a destructor is a bit risky since the slab
allocators may run the destructors anytime they decide a slab is no longer
needed.
Patch drops destructor support. Any attempt to use a destructor will BUG().
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmem_cache_create() was swapping ctor and dtor in calling find_mergeable():
though it caused no bug, and probably never would, even if destructors are
retained; but fix it so as not to generate anxiety ;)
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This was in SLUB in order to head off trouble while the nr_cpu_ids
functionality was not merged. Its merged now so no need to still have this.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Avoid atomic overhead in slab_alloc and slab_free
SLUB needs to use the slab_lock for the per cpu slabs to synchronize with
potential kfree operations. This patch avoids that need by moving all free
objects onto a lockless_freelist. The regular freelist continues to exist
and will be used to free objects. So while we consume the
lockless_freelist the regular freelist may build up objects.
If we are out of objects on the lockless_freelist then we may check the
regular freelist. If it has objects then we move those over to the
lockless_freelist and do this again. There is a significant savings in
terms of atomic operations that have to be performed.
We can even free directly to the lockless_freelist if we know that we are
running on the same processor. So this speeds up short lived objects.
They may be allocated and freed without taking the slab_lock. This is
particular good for netperf.
In order to maximize the effect of the new faster hotpath we extract the
hottest performance pieces into inlined functions. These are then inlined
into kmem_cache_alloc and kmem_cache_free. So hotpath allocation and
freeing no longer requires a subroutine call within SLUB.
[I am not sure that it is worth doing this because it changes the easy to
read structure of slub just to reduce atomic ops. However, there is
someone out there with a benchmark on 4 way and 8 way processor systems
that seems to show a 5% regression vs. Slab. Seems that the regression is
due to increased atomic operations use vs. SLAB in SLUB). I wonder if
this is applicable or discernable at all in a real workload?]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the slab allocators contain callbacks into the page allocator to
perform the draining of pagesets on remote nodes. This requires SLUB to have
a whole subsystem in order to be compatible with SLAB. Moving node draining
out of the slab allocators avoids a section of code in SLUB.
Move the node draining so that is is done when the vm statistics are updated.
At that point we are already touching all the cachelines with the pagesets of
a processor.
Add a expire counter there. If we have to update per zone or global vm
statistics then assume that the pageset will require subsequent draining.
The expire counter will be decremented on each vm stats update pass until it
reaches zero. Then we will drain one batch from the pageset. The draining
will cause vm counter updates which will then cause another expiration until
the pcp is empty. So we will drain a batch every 3 seconds.
Note that remote node draining is a somewhat esoteric feature that is required
on large NUMA systems because otherwise significant portions of system memory
can become trapped in pcp queues. The number of pcp is determined by the
number of processors and nodes in a system. A system with 4 processors and 2
nodes has 8 pcps which is okay. But a system with 1024 processors and 512
nodes has 512k pcps with a high potential for large amount of memory being
caught in them.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vmstat is currently using the cache reaper to periodically bring the
statistics up to date. The cache reaper does only exists in SLUB as a way to
provide compatibility with SLAB. This patch removes the vmstat calls from the
slab allocators and provides its own handling.
The advantage is also that we can use a different frequency for the updates.
Refreshing vm stats is a pretty fast job so we can run this every second and
stagger this by only one tick. This will lead to some overlap in large
systems. F.e a system running at 250 HZ with 1024 processors will have 4 vm
updates occurring at once.
However, the vm stats update only accesses per node information. It is only
necessary to stagger the vm statistics updates per processor in each node. Vm
counter updates occurring on distant nodes will not cause cacheline
contention.
We could implement an alternate approach that runs the first processor on each
node at the second and then each of the other processor on a node on a
subsequent tick. That may be useful to keep a large amount of the second free
of timer activity. Maybe the timer folks will have some feedback on this one?
[jirislaby@gmail.com: add missing break]
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>