linux/mm
Christoph Lameter 81819f0fc8 SLUB core
This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.

A. Management of object queues

   A particular concern was the complex management of the numerous object
   queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
   each allocating CPU and use objects from a slab directly instead of
   queueing them up.

B. Storage overhead of object queues

   SLAB Object queues exist per node, per CPU. The alien cache queue even
   has a queue array that contain a queue for each processor on each
   node. For very large systems the number of queues and the number of
   objects that may be caught in those queues grows exponentially. On our
   systems with 1k nodes / processors we have several gigabytes just tied up
   for storing references to objects for those queues  This does not include
   the objects that could be on those queues. One fears that the whole
   memory of the machine could one day be consumed by those queues.

C. SLAB meta data overhead

   SLAB has overhead at the beginning of each slab. This means that data
   cannot be naturally aligned at the beginning of a slab block. SLUB keeps
   all meta data in the corresponding page_struct. Objects can be naturally
   aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
   boundaries and can fit tightly into a 4k page with no bytes left over.
   SLAB cannot do this.

D. SLAB has a complex cache reaper

   SLUB does not need a cache reaper for UP systems. On SMP systems
   the per CPU slab may be pushed back into partial list but that
   operation is simple and does not require an iteration over a list
   of objects. SLAB expires per CPU, shared and alien object queues
   during cache reaping which may cause strange hold offs.

E. SLAB has complex NUMA policy layer support

   SLUB pushes NUMA policy handling into the page allocator. This means that
   allocation is coarser (SLUB does interleave on a page level) but that
   situation was also present before 2.6.13. SLABs application of
   policies to individual slab objects allocated in SLAB is
   certainly a performance concern due to the frequent references to
   memory policies which may lead a sequence of objects to come from
   one node after another. SLUB will get a slab full of objects
   from one node and then will switch to the next.

F. Reduction of the size of partial slab lists

   SLAB has per node partial lists. This means that over time a large
   number of partial slabs may accumulate on those lists. These can
   only be reused if allocator occur on specific nodes. SLUB has a global
   pool of partial slabs and will consume slabs from that pool to
   decrease fragmentation.

G. Tunables

   SLAB has sophisticated tuning abilities for each slab cache. One can
   manipulate the queue sizes in detail. However, filling the queues still
   requires the uses of the spin lock to check out slabs. SLUB has a global
   parameter (min_slab_order) for tuning. Increasing the minimum slab
   order can decrease the locking overhead. The bigger the slab order the
   less motions of pages between per CPU and partial lists occur and the
   better SLUB will be scaling.

G. Slab merging

   We often have slab caches with similar parameters. SLUB detects those
   on boot up and merges them into the corresponding general caches. This
   leads to more effective memory use. About 50% of all caches can
   be eliminated through slab merging. This will also decrease
   slab fragmentation because partial allocated slabs can be filled
   up again. Slab merging can be switched off by specifying
   slub_nomerge on boot up.

   Note that merging can expose heretofore unknown bugs in the kernel
   because corrupted objects may now be placed differently and corrupt
   differing neighboring objects. Enable sanity checks to find those.

H. Diagnostics

   The current slab diagnostics are difficult to use and require a
   recompilation of the kernel. SLUB contains debugging code that
   is always available (but is kept out of the hot code paths).
   SLUB diagnostics can be enabled via the "slab_debug" option.
   Parameters can be specified to select a single or a group of
   slab caches for diagnostics. This means that the system is running
   with the usual performance and it is much more likely that
   race conditions can be reproduced.

I. Resiliency

   If basic sanity checks are on then SLUB is capable of detecting
   common error conditions and recover as best as possible to allow the
   system to continue.

J. Tracing

   Tracing can be enabled via the slab_debug=T,<slabcache> option
   during boot. SLUB will then protocol all actions on that slabcache
   and dump the object contents on free.

K. On demand DMA cache creation.

   Generally DMA caches are not needed. If a kmalloc is used with
   __GFP_DMA then just create this single slabcache that is needed.
   For systems that have no ZONE_DMA requirement the support is
   completely eliminated.

L. Performance increase

   Some benchmarks have shown speed improvements on kernbench in the
   range of 5-10%. The locking overhead of slub is based on the
   underlying base allocation size. If we can reliably allocate
   larger order pages then it is possible to increase slub
   performance much further. The anti-fragmentation patches may
   enable further performance increases.

Tested on:
i386 UP + SMP, x86_64 UP + SMP + NUMA emulation, IA64 NUMA + Simulator

SLUB Boot options

slub_nomerge		Disable merging of slabs
slub_min_order=x	Require a minimum order for slab caches. This
			increases the managed chunk size and therefore
			reduces meta data and locking overhead.
slub_min_objects=x	Mininum objects per slab. Default is 8.
slub_max_order=x	Avoid generating slabs larger than order specified.
slub_debug		Enable all diagnostics for all caches
slub_debug=<options>	Enable selective options for all caches
slub_debug=<o>,<cache>	Enable selective options for a certain set of
			caches

Available Debug options
F		Double Free checking, sanity and resiliency
R		Red zoning
P		Object / padding poisoning
U		Track last free / alloc
T		Trace all allocs / frees (only use for individual slabs).

To use SLUB: Apply this patch and then select SLUB as the default slab
allocator.

[hugh@veritas.com: fix an oops-causing locking error]
[akpm@linux-foundation.org: various stupid cleanups and small fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
..
Kconfig [PATCH] Set CONFIG_ZONE_DMA for arches with GENERIC_ISA_DMA 2007-02-11 10:51:19 -08:00
Makefile SLUB core 2007-05-07 12:12:53 -07:00
allocpercpu.c [PATCH] Allow NULL pointers in percpu_free 2006-12-07 08:39:22 -08:00
backing-dev.c [PATCH] nfs: fix congestion control 2007-03-16 19:25:05 -07:00
bootmem.c [PATCH] remove EXPORT_UNUSED_SYMBOL'ed symbols 2006-12-07 08:39:44 -08:00
bounce.c block: blk_max_pfn is somtimes wrong 2007-03-27 08:52:47 +02:00
fadvise.c [PATCH] mm: change uses of f_{dentry,vfsmnt} to use f_path 2006-12-08 08:28:43 -08:00
filemap.c readahead: code cleanup 2007-05-07 12:12:52 -07:00
filemap.h Remove all inclusions of <linux/config.h> 2006-10-04 03:38:54 -04:00
filemap_xip.c [PATCH] mm: fix xip issue with /dev/zero 2007-03-29 08:22:26 -07:00
fremap.c [PATCH] mm: more rmap debugging 2006-12-22 08:55:49 -08:00
highmem.c [PATCH] i386: PARAVIRT: add kmap_atomic_pte for mapping highpte pages 2007-05-02 19:27:15 +02:00
hugetlb.c [PATCH] hugetlb: preserve hugetlb pte dirty state 2007-02-09 09:25:46 -08:00
internal.h [PATCH] mm: VM_BUG_ON 2006-09-26 08:48:44 -07:00
madvise.c [PATCH] holepunch: fix mmap_sem i_mutex deadlock 2007-03-29 08:22:26 -07:00
memory.c Add unitialized_var() macro for suppressing gcc warnings 2007-05-07 12:12:52 -07:00
memory_hotplug.c [PATCH] Fix sparsemem on Cell 2007-01-11 18:18:20 -08:00
mempolicy.c [PATCH] Page migration: Fix vma flag checking 2007-03-05 07:57:51 -08:00
mempool.c [PATCH] Numerous fixes to kernel-doc info in source files. 2007-02-11 10:51:32 -08:00
migrate.c page migration: fix NR_FILE_PAGES accounting 2007-04-24 08:23:08 -07:00
mincore.c [PATCH] mincore: vma crossing fix 2007-02-15 09:57:03 -08:00
mlock.c [PATCH] mlock cleanup 2006-12-07 08:39:22 -08:00
mmap.c [PATCH] x86: PARAVIRT: add hooks to intercept mm creation and destruction 2007-05-02 19:27:14 +02:00
mmzone.c [PATCH] remove EXPORT_UNUSED_SYMBOL'ed symbols 2006-12-07 08:39:44 -08:00
mprotect.c [PATCH] paravirt: lazy mmu mode hooks.patch 2006-10-01 00:39:33 -07:00
mremap.c [PATCH] mm: mremap correct rmap accounting 2007-01-30 08:33:32 -08:00
msync.c [PATCH] mm: msync() cleanup 2006-09-26 08:48:45 -07:00
nommu.c [PATCH] nommu: fix bug ip_conntrack does not work on nommu 2007-04-12 15:31:42 -07:00
oom_kill.c allow oom_adj of saintly processes 2007-05-07 12:12:51 -07:00
page-writeback.c Use ZVC counters to establish exact size of dirtyable pages 2007-05-07 12:12:51 -07:00
page_alloc.c Do not disable interrupts when reading min_free_kbytes 2007-05-07 12:12:53 -07:00
page_io.c [PATCH] swsusp: use block device offsets to identify swap locations 2006-12-07 08:39:27 -08:00
pdflush.c [PATCH] Add include/linux/freezer.h and move definitions from sched.h 2006-12-07 08:39:27 -08:00
prio_tree.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
readahead.c readahead: code cleanup 2007-05-07 12:12:52 -07:00
rmap.c [S390] split page_test_and_clear_dirty. 2007-04-27 16:01:46 +02:00
shmem.c [PATCH] holepunch: fix disconnected pages after second truncate 2007-03-29 08:22:25 -07:00
shmem_acl.c [PATCH] Fix typos in mm/shmem_acl.c 2006-10-11 11:14:23 -07:00
slab.c slab: mark set_up_list3s() __init 2007-05-07 12:12:53 -07:00
slob.c slab: introduce krealloc 2007-05-07 12:12:50 -07:00
slub.c SLUB core 2007-05-07 12:12:53 -07:00
sparse.c [PATCH] numa node ids are int, page_to_nid and zone_to_nid should return int 2006-12-07 08:39:23 -08:00
swap.c [PATCH] hotplug CPU: clean up hotcpu_notifier() use 2006-12-07 08:39:39 -08:00
swap_state.c [PATCH] lockdep: locking init debugging improvement 2006-07-03 15:27:02 -07:00
swapfile.c mm: make read_cache_page synchronous 2007-05-07 12:12:51 -07:00
thrash.c [PATCH] make mm/thrash.c:global_faults static 2006-12-07 08:39:22 -08:00
tiny-shmem.c [PATCH] mm/{,tiny-}shmem.c cleanups 2007-03-01 14:53:35 -08:00
truncate.c [PATCH] VM: invalidate_inode_pages2_range() should not exit early 2007-03-01 14:53:39 -08:00
util.c [PATCH] slab: clean up leak tracking ifdefs a little bit 2006-10-04 07:55:13 -07:00
vmalloc.c [PATCH] x86-64: Fix vmalloc_32 to really allocate <4GB on 64bit platforms 2007-05-02 19:27:12 +02:00
vmscan.c [PATCH] throttle_vm_writeout(): don't loop on GFP_NOFS and GFP_NOIO allocations 2007-03-01 14:53:38 -08:00
vmstat.c [PATCH] optional ZONE_DMA: optional ZONE_DMA in the VM 2007-02-11 10:51:18 -08:00