Commit Graph

259 Commits

Author SHA1 Message Date
Christoph Lameter 17022220dd SLAB: remove WARN_ON_ONCE for zero sized objects for 2.6.22 release
We agreed to remove the WARN_ON_ONCE before 2.6.22 is released.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-01 12:29:43 -07:00
Christoph Lameter 3cdc0ed0ce slab: fix alien cache handling
cache_free_alien must be called regardless if we use alien caches or not.
cache_free_alien() will do the right thing if there are no alien caches
available.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Acked-by: Pekka J Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-08 17:23:32 -07:00
Sam Ravnborg 38bdc32af4 mm/slab: fix section mismatch warning
Use the new __init_refok marker to avoid the
section mismatch warning from slab.c

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2007-05-19 09:11:58 +02:00
Christoph Lameter 0aa817f078 Slab allocators: define common size limitations
Currently we have a maze of configuration variables that determine the
maximum slab size.  Worst of all it seems to vary between SLAB and SLUB.

So define a common maximum size for kmalloc.  For conveniences sake we use
the maximum size ever supported which is 32 MB.  We limit the maximum size
to a lower limit if MAX_ORDER does not allow such large allocations.

For many architectures this patch will have the effect of adding large
kmalloc sizes.  x86_64 adds 5 new kmalloc sizes.  So a small amount of
memory will be needed for these caches (contemporary SLAB has dynamically
sizeable node and cpu structure so the waste is less than in the past)

Most architectures will then be able to allocate object with sizes up to
MAX_ORDER.  We have had repeated breakage (in fact whenever we doubled the
number of supported processors) on IA64 because one or the other struct
grew beyond what the slab allocators supported.  This will avoid future
issues and f.e.  avoid fixes for 2k and 4k cpu support.

CONFIG_LARGE_ALLOCS is no longer necessary so drop it.

It fixes sparc64 with SLAB.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:04 -07:00
Christoph Lameter a35afb830f Remove SLAB_CTOR_CONSTRUCTOR
SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@ucw.cz>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:04 -07:00
Christoph Lameter 0b44f7a5b5 slab: warn on zero-length allocations
slub warns on this, and we're working on making kmalloc(0) return NULL.
Let's make slab warn as well so our testers detect such callers more
rapidly.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:03 -07:00
Christoph Lameter c59def9f22 Slab allocators: Drop support for destructors
There is no user of destructors left.  There is no reason why we should keep
checking for destructors calls in the slab allocators.

The RFC for this patch was discussed at
http://marc.info/?l=linux-kernel&m=117882364330705&w=2

Destructors were mainly used for list management which required them to take a
spinlock.  Taking a spinlock in a destructor is a bit risky since the slab
allocators may run the destructors anytime they decide a slab is no longer
needed.

Patch drops destructor support.  Any attempt to use a destructor will BUG().

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:03 -07:00
Christoph Lameter 4037d45220 Move remote node draining out of slab allocators
Currently the slab allocators contain callbacks into the page allocator to
perform the draining of pagesets on remote nodes.  This requires SLUB to have
a whole subsystem in order to be compatible with SLAB.  Moving node draining
out of the slab allocators avoids a section of code in SLUB.

Move the node draining so that is is done when the vm statistics are updated.
At that point we are already touching all the cachelines with the pagesets of
a processor.

Add a expire counter there.  If we have to update per zone or global vm
statistics then assume that the pageset will require subsequent draining.

The expire counter will be decremented on each vm stats update pass until it
reaches zero.  Then we will drain one batch from the pageset.  The draining
will cause vm counter updates which will then cause another expiration until
the pcp is empty.  So we will drain a batch every 3 seconds.

Note that remote node draining is a somewhat esoteric feature that is required
on large NUMA systems because otherwise significant portions of system memory
can become trapped in pcp queues.  The number of pcp is determined by the
number of processors and nodes in a system.  A system with 4 processors and 2
nodes has 8 pcps which is okay.  But a system with 1024 processors and 512
nodes has 512k pcps with a high potential for large amount of memory being
caught in them.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Christoph Lameter d1187ed210 vmstat: use our own timer events
vmstat is currently using the cache reaper to periodically bring the
statistics up to date.  The cache reaper does only exists in SLUB as a way to
provide compatibility with SLAB.  This patch removes the vmstat calls from the
slab allocators and provides its own handling.

The advantage is also that we can use a different frequency for the updates.
Refreshing vm stats is a pretty fast job so we can run this every second and
stagger this by only one tick.  This will lead to some overlap in large
systems.  F.e a system running at 250 HZ with 1024 processors will have 4 vm
updates occurring at once.

However, the vm stats update only accesses per node information.  It is only
necessary to stagger the vm statistics updates per processor in each node.  Vm
counter updates occurring on distant nodes will not cause cacheline
contention.

We could implement an alternate approach that runs the first processor on each
node at the second and then each of the other processor on a node on a
subsequent tick.  That may be useful to keep a large amount of the second free
of timer activity.  Maybe the timer folks will have some feedback on this one?

[jirislaby@gmail.com: add missing break]
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Rafael J. Wysocki 8bb7844286 Add suspend-related notifications for CPU hotplug
Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress.  This
patch introduces such notifications and causes them to be used during
suspend and resume transitions.  It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).

[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Christoph Lameter 5830c59021 slab: shut down cache_reaper when cpu goes down
Shutdown the cache_reaper if the cpu is brought down and set the
cache_reap.func to NULL.  Otherwise hotplug shuts down the reaper for good.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Heiko Carstens 38c3bd96a0 slab: use CPU_LOCK_[ACQUIRE|RELEASE]
Looks like this was forgotten when CPU_LOCK_[ACQUIRE|RELEASE] was
introduced.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:51 -07:00
Pekka J Enberg 7ae439ce0c krealloc: fix kerneldoc comments
No "blank" (or "*") line is allowed between the function name and lines for
it parameter(s).

Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:46 -07:00
Alexey Dobriyan a5c43dae7a Fix race between cat /proc/slab_allocators and rmmod
Same story as with cat /proc/*/wchan race vs rmmod race, only
/proc/slab_allocators want more info than just symbol name.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:08 -07:00
David Woodhouse b46b8f19c9 Increase slab redzone to 64bits
There are two problems with the existing redzone implementation.

Firstly, it's causing misalignment of structures which contain a 64-bit
integer, such as netfilter's 'struct ipt_entry' -- causing netfilter
modules to fail to load because of the misalignment.  (In particular, the
first check in
net/ipv4/netfilter/ip_tables.c::check_entry_size_and_hooks())

On ppc32 and sparc32, amongst others, __alignof__(uint64_t) == 8.

With slab debugging, we use 32-bit redzones. And allocated slab objects
aren't sufficiently aligned to hold a structure containing a uint64_t.

By _just_ setting ARCH_KMALLOC_MINALIGN to __alignof__(u64) we'd disable
redzone checks on those architectures.  By using 64-bit redzones we avoid that
loss of debugging, and also fix the other problem while we're at it.

When investigating this, I noticed that on 64-bit platforms we're using a
32-bit value of RED_ACTIVE/RED_INACTIVE in the 64-bit memory location set
aside for the redzone.  Which means that the four bytes immediately before
or after the allocated object at 0x00,0x00,0x00,0x00 for LE and BE
machines, respectively.  Which is probably not the most useful choice of
poison value.

One way to fix both of those at once is just to switch to 64-bit
redzones in all cases.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:14:57 -07:00
Christoph Lameter cfce66047f Slab allocators: remove useless __GFP_NO_GROW flag
There is no user remaining and I have never seen any use of that flag.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:57 -07:00
Christoph Lameter 4f10493459 slab allocators: Remove SLAB_CTOR_ATOMIC
SLAB_CTOR atomic is never used which is no surprise since I cannot imagine
that one would want to do something serious in a constructor or destructor.
 In particular given that the slab allocators run with interrupts disabled.
 Actions in constructors and destructors are by their nature very limited
and usually do not go beyond initializing variables and list operations.

(The i386 pgd ctor and dtors do take a spinlock in constructor and
destructor.....  I think that is the furthest we go at this point.)

There is no flag passed to the destructor so removing SLAB_CTOR_ATOMIC also
establishes a certain symmetry.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:57 -07:00
Christoph Lameter 50953fe9e0 slab allocators: Remove SLAB_DEBUG_INITIAL flag
I have never seen a use of SLAB_DEBUG_INITIAL.  It is only supported by
SLAB.

I think its purpose was to have a callback after an object has been freed
to verify that the state is the constructor state again?  The callback is
performed before each freeing of an object.

I would think that it is much easier to check the object state manually
before the free.  That also places the check near the code object
manipulation of the object.

Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
compiled with SLAB debugging on.  If there would be code in a constructor
handling SLAB_DEBUG_INITIAL then it would have to be conditional on
SLAB_DEBUG otherwise it would just be dead code.  But there is no such code
in the kernel.  I think SLUB_DEBUG_INITIAL is too problematic to make real
use of, difficult to understand and there are easier ways to accomplish the
same effect (i.e.  add debug code before kfree).

There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
clear in fs inode caches.  Remove the pointless checks (they would even be
pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

This is the last slab flag that SLUB did not support.  Remove the check for
unimplemented flags from SLUB.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:57 -07:00
Akinobu Mita 824ebef122 fault injection: fix failslab with CONFIG_NUMA
Currently failslab injects failures into ____cache_alloc().  But with enabling
CONFIG_NUMA it's not enough to let actual slab allocator functions (kmalloc,
kmem_cache_alloc, ...) return NULL.

This patch moves fault injection hook inside of __cache_alloc() and
__cache_alloc_node().  These are lower call path than ____cache_alloc() and
enable to inject faulures to slab allocators with CONFIG_NUMA.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:55 -07:00
Christoph Lameter 5af6083990 slab allocators: Remove obsolete SLAB_MUST_HWCACHE_ALIGN
This patch was recently posted to lkml and acked by Pekka.

The flag SLAB_MUST_HWCACHE_ALIGN is

1. Never checked by SLAB at all.

2. A duplicate of SLAB_HWCACHE_ALIGN for SLUB

3. Fulfills the role of SLAB_HWCACHE_ALIGN for SLOB.

The only remaining use is in sparc64 and ppc64 and their use there
reflects some earlier role that the slab flag once may have had. If
its specified then SLAB_HWCACHE_ALIGN is also specified.

The flag is confusing, inconsistent and has no purpose.

Remove it.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:55 -07:00
matze b4169525bc include KERN_* constant in printk() calls in mm/slab.c
Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Christoph Lameter b49af68ff9 Add virt_to_head_page and consolidate code in slab and slub
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Christoph Lameter d85f33855c Make page->private usable in compound pages
If we add a new flag so that we can distinguish between the first page and the
tail pages then we can avoid to use page->private in the first page.
page->private == page for the first page, so there is no real information in
there.

Freeing up page->private makes the use of compound pages more transparent.
They become more usable like real pages.  Right now we have to be careful f.e.
 if we are going beyond PAGE_SIZE allocations in the slab on i386 because we
can then no longer use the private field.  This is one of the issues that
cause us not to support debugging for page size slabs in SLAB.

Having page->private available for SLUB would allow more meta information in
the page struct.  I can probably avoid the 16 bit ints that I have in there
right now.

Also if page->private is available then a compound page may be equipped with
buffer heads.  This may free up the way for filesystems to support larger
blocks than page size.

We add PageTail as an alias of PageReclaim.  Compound pages cannot currently
be reclaimed.  Because of the alias one needs to check PageCompound first.

The RFC for the this approach was discussed at
http://marc.info/?t=117574302800001&r=1&w=2

[nacc@us.ibm.com: fix hugetlbfs]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
Andrew Morton a3a02be791 slab: mark set_up_list3s() __init
It is only ever used prior to free_initmem().

(It will cause a warning when we run the section checking, but that's a
false-positive and it simply changes the source of an existing warning, which
is also a false-positive)

Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
Eric Dumazet 8da3430d8a slab: NUMA kmem_cache diet
Some NUMA machines have a big MAX_NUMNODES (possibly 1024), but fewer
possible nodes.  This patch dynamically sizes the 'struct kmem_cache' to
allocate only needed space.

I moved nodelists[] field at the end of struct kmem_cache, and use the
following computation in kmem_cache_init()

cache_cache.buffer_size = offsetof(struct kmem_cache, nodelists) +
                                 nr_node_ids * sizeof(struct kmem_list3 *);

On my two nodes x86_64 machine, kmem_cache.obj_size is now 192 instead of 704
(This is because on x86_64, MAX_NUMNODES is 64)

On bigger NUMA setups, this might reduce the gfporder of "cache_cache"

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
Eric Dumazet 6310984694 SLAB: don't allocate empty shared caches
We can avoid allocating empty shared caches and avoid unecessary check of
cache->limit.  We save some memory.  We avoid bringing into CPU cache
unecessary cache lines.

All accesses to l3->shared are already checking NULL pointers so this patch is
safe.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
Eric Dumazet 364fbb29a0 SLAB: use num_possible_cpus() in enable_cpucache()
The existing comment in mm/slab.c is *perfect*, so I reproduce it :

         /*
          * CPU bound tasks (e.g. network routing) can exhibit cpu bound
          * allocation behaviour: Most allocs on one cpu, most free operations
          * on another cpu. For these cases, an efficient object passing between
          * cpus is necessary. This is provided by a shared array. The array
          * replaces Bonwick's magazine layer.
          * On uniprocessor, it's functionally equivalent (but less efficient)
          * to a larger limit. Thus disabled by default.
          */

As most shiped linux kernels are now compiled with CONFIG_SMP, there is no way
a preprocessor #if can detect if the machine is UP or SMP. Better to use
num_possible_cpus().

This means on UP we allocate a 'size=0 shared array', to be more efficient.

Another patch can later avoid the allocations of 'empty shared arrays', to
save some memory.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:52 -07:00
Pekka Enberg 714b8171af slab: ensure cache_alloc_refill terminates
If slab->inuse is corrupted, cache_alloc_refill can enter an infinite
loop as detailed by Michael Richardson in the following post:
<http://lkml.org/lkml/2007/2/16/292>. This adds a BUG_ON to catch
those cases.

Cc: Michael Richardson <mcr@sandelman.ca>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:51 -07:00
Pekka Enberg fd76bab2fa slab: introduce krealloc
This introduce krealloc() that reallocates memory while keeping the contents
unchanged.  The allocator avoids reallocation if the new size fits the
currently used cache.  I also added a simple non-optimized version for
mm/slob.c for compatibility.

[akpm@linux-foundation.org: fix warnings]
Acked-by: Josef Sipek <jsipek@fsl.cs.sunysb.edu>
Acked-by: Matt Mackall <mpm@selenic.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:50 -07:00
Siddha, Suresh B 62918a0361 [PATCH] x86-64: skip cache_free_alien() on non NUMA
Set use_alien_caches to 0 on non NUMA platforms.  And avoid calling the
cache_free_alien() when use_alien_caches is not set.  This will avoid the
cache miss that happens while dereferencing slabp to get nodeid.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:18 +02:00
David Howells e94a40c508 [PATCH] SLAB: Mention slab name when listing corrupt objects
Mention the slab name when listing corrupt objects.  Although the function
that released the memory is mentioned, that is frequently ambiguous as such
functions often release several pieces of memory.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-04 08:51:52 -07:00
Randy Dunlap 05fb6bf0b2 [PATCH] kernel-doc fixes for 2.6.20-git15 (non-drivers)
Fix kernel-doc warnings in 2.6.20-git15 (lib/, mm/, kernel/, include/).

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:37 -08:00
Christoph Lameter 8ef8286689 [PATCH] slab: reduce size of alien cache to cover only possible nodes
The alien cache is a per cpu per node array allocated for every slab on the
system.  Currently we size this array for all nodes that the kernel does
support.  For IA64 this is 1024 nodes.  So we allocate an array with 1024
objects even if we only boot a system with 4 nodes.

This patch uses "nr_node_ids" to determine the number of possible nodes
supported by a hardware configuration and only allocates an alien cache
sized for possible nodes.

The initialization of nr_node_ids occurred too late relative to the bootstrap
of the slab allocator and so I moved the setup_nr_node_ids() into
free_area_init_nodes().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20 17:10:13 -08:00
Robert P. J. Day 72fd4a35a8 [PATCH] Numerous fixes to kernel-doc info in source files.
A variety of (mostly) innocuous fixes to the embedded kernel-doc content in
source files, including:

  * make multi-line initial descriptions single line
  * denote some function names, constants and structs as such
  * change erroneous opening '/*' to '/**' in a few places
  * reword some text for clarity

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:32 -08:00
Ingo Molnar 898552c9d8 [PATCH] lockdep: also check for freed locks in kmem_cache_free()
kmem_cache_free() was missing the check for freeing held locks.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:26 -08:00
Christoph Lameter 4b51d66989 [PATCH] optional ZONE_DMA: optional ZONE_DMA in the VM
Make ZONE_DMA optional in core code.

- ifdef all code for ZONE_DMA and related definitions following the example
  for ZONE_DMA32 and ZONE_HIGHMEM.

- Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of
  0.

- Modify the VM statistics to work correctly without a DMA zone.

- Modify slab to not create DMA slabs if there is no ZONE_DMA.

[akpm@osdl.org: cleanup]
[jdike@addtoit.com: build fix]
[apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter 7c5cae368a [PATCH] slab: use parameter passed to cache_reap to determine pointer to work structure
Use the pointer passed to cache_reap to determine the work pointer and
consolidate exit paths.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:17 -08:00
Pekka Enberg 8c8cc2c10c [PATCH] slab: cache alloc cleanups
Clean up __cache_alloc and __cache_alloc_node functions a bit.  We no
longer need to do NUMA_BUILD tricks and the UMA allocation path is much
simpler.  No functional changes in this patch.

Note: saves few kernel text bytes on x86 NUMA build due to using gotos in
__cache_alloc_node() and moving __GFP_THISNODE check in to
fallback_alloc().

Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Christoph Lameter <christoph@lameter.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:16 -08:00
Pekka Enberg 6e40e73097 [PATCH] slab: remove broken PageSlab check from kfree_debugcheck
The PageSlab debug check in kfree_debugcheck() is broken for compound
pages.  It is also redundant as we already do BUG_ON for non-slab pages in
page_get_cache() and page_get_slab() which are always called before we free
any actual objects.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:16 -08:00
Hugh Dickins b6a6045181 [PATCH] fix BUG_ON(!PageSlab) from fallback_alloc
pdflush hit the BUG_ON(!PageSlab(page)) in kmem_freepages called from
fallback_alloc: cache_grow already freed those pages when alloc_slabmgmt
failed.  But it wouldn't have freed them if __GFP_NO_GROW, so make sure
fallback_alloc doesn't waste its time on that case.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka J Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:23 -08:00
Randy Dunlap af9997e426 [PATCH] fix kernel-doc warnings in 2.6.20-rc1
Fix kernel-doc warnings in 2.6.20-rc1.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:47 -08:00
Christoph Lameter b7f869a284 [PATCH] slab: fix kmem_ptr_validate definition
The declaration of kmem_ptr_validate in slab.h does not match the
one in slab.c. Remove the fastcall attribute (this is the only use in
slab.c).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:47 -08:00
Eric Dumazet 6a2d7a955d [PATCH] SLAB: use a multiply instead of a divide in obj_to_index()
When some objects are allocated by one CPU but freed by another CPU we can
consume lot of cycles doing divides in obj_to_index().

(Typical load on a dual processor machine where network interrupts are
handled by one particular CPU (allocating skbufs), and the other CPU is
running the application (consuming and freeing skbufs))

Here on one production server (dual-core AMD Opteron 285), I noticed this
divide took 1.20 % of CPU_CLK_UNHALTED events in kernel.  But Opteron are
quite modern cpus and the divide is much more expensive on oldest
architectures :

On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1
cycle for a multiply.

Doing some math, we can use a reciprocal multiplication instead of a divide.

If we want to compute V = (A / B)  (A and B being u32 quantities)
we can instead use :

V = ((u64)A * RECIPROCAL(B)) >> 32 ;

where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B

Note :

I wrote pure C code for clarity. gcc output for i386 is not optimal but
acceptable :

mull   0x14(%ebx)
mov    %edx,%eax // part of the >> 32
xor     %edx,%edx // useless
mov    %eax,(%esp) // could be avoided
mov    %edx,0x4(%esp) // useless
mov    (%esp),%ebx

[akpm@osdl.org: small cleanups]
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:49 -08:00
Paul Jackson 02a0e53d82 [PATCH] cpuset: rework cpuset_zone_allowed api
Elaborate the API for calling cpuset_zone_allowed(), so that users have to
explicitly choose between the two variants:

  cpuset_zone_allowed_hardwall()
  cpuset_zone_allowed_softwall()

Until now, whether or not you got the hardwall flavor depended solely on
whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
argument.

If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
version.

Unfortunately, this meant that users would end up with the softwall version
without thinking about it.  Since only the softwall version might sleep,
this led to bugs with possible sleeping in interrupt context on more than
one occassion.

The hardwall version requires that the current tasks mems_allowed allows
the node of the specified zone (or that you're in interrupt or that
__GFP_THISNODE is set or that you're on a one cpuset system.)

The softwall version, depending on the gfp_mask, might allow a node if it
was allowed in the nearest enclusing cpuset marked mem_exclusive (which
requires taking the cpuset lock 'callback_mutex' to evaluate.)

This patch removes the cpuset_zone_allowed() call, and forces the caller to
explicitly choose between the hardwall and the softwall case.

If the caller wants the gfp_mask to determine this choice, they should (1)
be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
cpuset_zone_allowed_softwall() routine.

This adds another 100 or 200 bytes to the kernel text space, due to the few
lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
routines.  It should save a few instructions executed for the calls that
turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
set (before the call) then check (within the call) the __GFP_HARDWALL flag.

For the most critical call, from get_page_from_freelist(), the same
instructions are executed as before -- the old cpuset_zone_allowed()
routine it used to call is the same code as the
cpuset_zone_allowed_softwall() routine that it calls now.

Not a perfect win, but seems worth it, to reduce this chance of hitting a
sleeping with irq off complaint again.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:49 -08:00
Christoph Lameter 55935a34a4 [PATCH] More slab.h cleanups
More cleanups for slab.h

1. Remove tabs from weird locations as suggested by Pekka

2. Drop the check for NUMA and SLAB_DEBUG from the fallback section
   as suggested by Pekka.

3. Uses static inline for the fallback defs as also suggested by Pekka.

4. Make kmem_ptr_valid take a const * argument.

5. Separate the NUMA fallback definitions from the kmalloc_track fallback
   definitions.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:49 -08:00
Christoph Lameter dd47ea7556 [PATCH] slab: fix sleeping in atomic bug
Fallback_alloc() does not do the check for GFP_WAIT as done in
cache_grow().  Thus interrupts are disabled when we call kmem_getpages()
which results in the failure.

Duplicate the handling of GFP_WAIT in cache_grow().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Jay Cliburn <jacliburn@bellsouth.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:48 -08:00
Arjan van de Ven 2b2842146c [PATCH] user of the jiffies rounding patch: Slab
This patch introduces users of the round_jiffies() function in the slab code.

The slab code has a few "run every second" timers for background work; these
are obviously not timing critical as long as they happen roughly at the right
frequency.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00
Don Mullis 6b1b60f41e [PATCH] fault-injection: defaults likely to please a new user
Assign defaults most likely to please a new user:
 1) generate some logging output
    (verbose=2)
 2) avoid injecting failures likely to lock up UI
    (ignore_gfp_wait=1, ignore_gfp_highmem=1)

Signed-off-by: Don Mullis <dwm@meer.net>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:03 -08:00
Akinobu Mita 8a8b6502fb [PATCH] fault-injection capability for kmalloc
This patch provides fault-injection capability for kmalloc.

Boot option:

failslab=<interval>,<probability>,<space>,<times>

	<interval> -- specifies the interval of failures.

	<probability> -- specifies how often it should fail in percent.

	<space> -- specifies the size of free space where memory can be
		   allocated safely in bytes.

	<times> -- specifies how many times failures may happen at most.

Debugfs:

/debug/failslab/interval
/debug/failslab/probability
/debug/failslab/specifies
/debug/failslab/times
/debug/failslab/ignore-gfp-highmem
/debug/failslab/ignore-gfp-wait

Example:

	failslab=10,100,0,-1

slab allocation (kmalloc(), kmem_cache_alloc(),..) fails once per 10 times.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:02 -08:00
Paul Jackson b8b50b6519 [PATCH] mm: fallback_alloc cpuset_zone_allowed irq fix
fallback_alloc() could end up calling cpuset_zone_allowed() with interrupts
disabled (by code in kmem_cache_alloc_node()), but without __GFP_HARDWALL
set, leading to a possible call of a sleeping function with interrupts
disabled.

This results in the BUG report:

  BUG: sleeping function called from invalid context at kernel/cpuset.c:1520
in_atomic():0, irqs_disabled():1

Thanks to Paul Menage for catching this one.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:37 -08:00
Helge Deller 15ad7cdcfd [PATCH] struct seq_operations and struct file_operations constification
- move some file_operations structs into the .rodata section

 - move static strings from policy_types[] array into the .rodata section

 - fix generic seq_operations usages, so that those structs may be defined
   as "const" as well

[akpm@osdl.org: couple of fixes]
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:46 -08:00
Andrew Morton 138ae6631a [PATCH] slab: use probe_kernel_address()
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:34 -08:00
Christoph Lameter 3c517a6132 [PATCH] slab: better fallback allocation behavior
Currently we simply attempt to allocate from all allowed nodes using
GFP_THISNODE.  However, GFP_THISNODE does not do reclaim (it wont do any at
all if the recent GFP_THISNODE patch is accepted).  If we truly run out of
memory in the whole system then fallback_alloc may return NULL although
memory may still be available if we would perform more thorough reclaim.

This patch changes fallback_alloc() so that we first only inspect all the
per node queues for available slabs.  If we find any then we allocate from
those.  This avoids slab fragmentation by first getting rid of all partial
allocated slabs on every node before allocating new memory.

If we cannot satisfy the allocation from any per node queue then we extend
a slab.  We now call into the page allocator without specifying
GFP_THISNODE.  The page allocator will then implement its own fallback (in
the given cpuset context), perform necessary reclaim (again considering not
a single node but the whole set of allowed nodes) and then return pages for
a new slab.

We identify from which node the pages were allocated and then insert the
pages into the corresponding per node structure.  In order to do so we need
to modify cache_grow() to take a parameter that specifies the new slab.
kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be
able to use kmem_getpage to allocate from an arbitrary node.  GFP_THISNODE
needs to be specified when calling cache_grow().

One key advantage is that the decision from which node to allocate new
memory is removed from slab fallback processing.  The patch allows to go
back to use of the page allocators fallback/reclaim logic.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:25 -08:00
Christoph Lameter 5bcd234d88 [PATCH] slab: fix two issues in kmalloc_node / __cache_alloc_node
This addresses two issues:

1. Kmalloc_node() may intermittently return NULL if we are allocating
   from the current node and are unable to obtain memory for the current
   node from the page allocator.  This is because we call ___cache_alloc()
   if nodeid == numa_node_id() and ____cache_alloc is not able to fallback
   to other nodes.

   This was introduced in the 2.6.19 development cycle.  <= 2.6.18 in
   that case does not do a restricted allocation and blindly trusts the
   page allocator to have given us memory from the indicated node.  It
   inserts the page regardless of the node it came from into the queues for
   the current node.

2. If kmalloc_node() is used on a node that has not been bootstrapped
   yet then we may try to pass an invalid node number to
   ____cache_alloc_node() triggering a BUG().

   Change the function to call fallback_alloc() instead.  Only call
   fallback_alloc() if we are allowed to fallback at all.  The need to
   handle a node not bootstrapped yet also first surfaced in the 2.6.19
   cycle.

Update the comments since they were still describing the old kmalloc_node
from 2.6.12.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:25 -08:00
Christoph Lameter 441e143e95 [PATCH] slab: remove SLAB_DMA
SLAB_DMA is an alias of GFP_DMA. This is the last one so we
remove the leftover comment too.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:24 -08:00
Christoph Lameter e94b176609 [PATCH] slab: remove SLAB_KERNEL
SLAB_KERNEL is an alias of GFP_KERNEL.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:24 -08:00
Christoph Lameter a06d72c1dc [PATCH] slab: remove SLAB_LEVEL_MASK
SLAB_LEVEL_MASK is only used internally to the slab and is
and alias of GFP_LEVEL_MASK.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:23 -08:00
Christoph Lameter 6e0eaa4b05 [PATCH] slab: remove SLAB_NO_GROW
It is only used internally in the slab.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:23 -08:00
Christoph Hellwig 8b98c1699e [PATCH] leak tracking for kmalloc_node
We have variants of kmalloc and kmem_cache_alloc that leave leak tracking to
the caller.  This is used for subsystem-specific allocators like skb_alloc.

To make skb_alloc node-aware we need similar routines for the node-aware slab
allocator, which this patch adds.

Note that the code is rather ugly, but it mirrors the non-node-aware code 1:1:

[akpm@osdl.org: add module export]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:22 -08:00
Paul Menage 3395ee0588 [PATCH] mm: add noaliencache boot option to disable numa alien caches
When using numa=fake on non-NUMA hardware there is no benefit to having the
alien caches, and they consume much memory.

Add a kernel boot option to disable them.

Christoph sayeth "This is good to have even on large NUMA.  The problem is
that the alien caches grow by the square of the size of the system in terms of
nodes."

Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:21 -08:00
Ravikiran G Thirumalai 8f5be20bf8 [PATCH] mm: slab: eliminate lock_cpu_hotplug from slab
Here's an attempt towards doing away with lock_cpu_hotplug in the slab
subsystem.  This approach also fixes a bug which shows up when cpus are
being offlined/onlined and slab caches are being tuned simultaneously.

http://marc.theaimsgroup.com/?l=linux-kernel&m=116098888100481&w=2

The patch has been stress tested overnight on a 2 socket 4 core AMD box with
repeated cpu online and offline, while dbench and kernbench process are
running, and slab caches being tuned at the same time.
There were no lockdep warnings either.  (This test on 2,6.18 as 2.6.19-rc
crashes at __drain_pages
http://marc.theaimsgroup.com/?l=linux-kernel&m=116172164217678&w=2 )

The approach here is to hold cache_chain_mutex from CPU_UP_PREPARE until
CPU_ONLINE (similar in approach as worqueue_mutex) .  Slab code sensitive
to cpu_online_map (kmem_cache_create, kmem_cache_destroy, slabinfo_write,
__cache_shrink) is already serialized with cache_chain_mutex.  (This patch
lengthens cache_chain_mutex hold time at kmem_cache_destroy to cover this).
 This patch also takes the cache_chain_sem at kmem_cache_shrink to protect
sanity of cpu_online_map at __cache_shrink, as viewed by slab.
(kmem_cache_shrink->__cache_shrink->drain_cpu_caches).  But, really,
kmem_cache_shrink is used at just one place in the acpi subsystem!  Do we
really need to keep kmem_cache_shrink at all?

Another note.  Looks like a cpu hotplug event can send  CPU_UP_CANCELED to
a registered subsystem even if the subsystem did not receive CPU_UP_PREPARE.
This could be due to a subsystem registered for notification earlier than
the current subsystem crapping out with NOTIFY_BAD. Badness can occur with
in the CPU_UP_CANCELED code path at slab if this happens (The same would
apply for workqueue.c as well).  To overcome this, we might have to use either
a) a per subsystem flag and avoid handling of CPU_UP_CANCELED, or
b) Use a special notifier events like LOCK_ACQUIRE/RELEASE as Gautham was
   using in his experiments, or
c) Do not send CPU_UP_CANCELED to a subsystem which did not receive
   CPU_UP_PREPARE.

I would prefer c).

Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:21 -08:00
Kevin Hilman a44b56d354 [PATCH] slab debug and ARCH_SLAB_MINALIGN don't get along
When CONFIG_SLAB_DEBUG is used in combination with ARCH_SLAB_MINALIGN, some
debug flags should be disabled which depend on BYTES_PER_WORD alignment.

The disabling of these debug flags is not properly handled when
BYTES_PER_WORD < ARCH_SLAB_MEMALIGN < cache_line_size()

This patch fixes that and also adds an alignment check to
cache_alloc_debugcheck_after() when ARCH_SLAB_MINALIGN is used.

Signed-off-by: Kevin Hilman <khilman@mvista.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:21 -08:00
David Howells 65f27f3844 WorkStruct: Pass the work_struct pointer instead of context data
Pass the work_struct pointer to the work function rather than context data.
The work function can use container_of() to work out the data.

For the cases where the container of the work_struct may go away the moment the
pending bit is cleared, it is made possible to defer the release of the
structure by deferring the clearing of the pending bit.

To make this work, an extra flag is introduced into the management side of the
work_struct.  This governs auto-release of the structure upon execution.

Ordinarily, the work queue executor would release the work_struct for further
scheduling or deallocation by clearing the pending bit prior to jumping to the
work function.  This means that, unless the driver makes some guarantee itself
that the work_struct won't go away, the work function may not access anything
else in the work_struct or its container lest they be deallocated..  This is a
problem if the auxiliary data is taken away (as done by the last patch).

However, if the pending bit is *not* cleared before jumping to the work
function, then the work function *may* access the work_struct and its container
with no problems.  But then the work function must itself release the
work_struct by calling work_release().

In most cases, automatic release is fine, so this is the default.  Special
initiators exist for the non-auto-release case (ending in _NAR).


Signed-Off-By: David Howells <dhowells@redhat.com>
2006-11-22 14:55:48 +00:00
David Howells 52bad64d95 WorkStruct: Separate delayable and non-delayable events.
Separate delayable work items from non-delayable work items be splitting them
into a separate structure (delayed_work), which incorporates a work_struct and
the timer_list removed from work_struct.

The work_struct struct is huge, and this limits it's usefulness.  On a 64-bit
architecture it's nearly 100 bytes in size.  This reduces that by half for the
non-delayable type of event.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-11-22 14:54:01 +00:00
Daniel Yeisley 7f6b8876c7 [PATCH] init_reap_node() initialization fix
It looks like there is a bug in init_reap_node() in slab.c that can cause
multiple oops's on certain ES7000 configurations.  The variable reap_node
is defined per cpu, but only initialized on a single CPU.  This causes an
oops in next_reap_node() when __get_cpu_var(reap_node) returns the wrong
value.  Fix is below.

Signed-off-by: Dan Yeisley <dan.yeisley@unisys.com>
Cc: Andi Kleen <ak@suse.de>
Acked-by: Christoph Lameter <clameter@engr.sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-03 12:27:58 -08:00
Christoph Lameter aedb0eb107 [PATCH] Slab: Do not fallback to nodes that have not been bootstrapped yet
The zonelist may contain zones of nodes that have not been bootstrapped and
we will oops if we try to allocate from those zones.  So check if the node
information for the slab and the node have been setup before attempting an
allocation.  If it has not been setup then skip that zone.

Usually we will not encounter this situation since the slab bootstrap code
avoids falling back before we have setup the respective nodes but we seem
to have a special needs for pppc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Kravetz <kravetz@us.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-21 13:35:06 -07:00
Christoph Lameter dcbd4ec4c2 [PATCH] slab: remove wrongly placed BUG_ON
Init list is called with a list parameter that is not equal to the
cachep->nodelists entry under NUMA if more than one node exists.  This is
fully legitimatei.  One may want to populate the list fields before
switching nodelist pointers.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-07 10:51:14 -07:00
Pekka Enberg 1ca4cb2418 [PATCH] slab: reduce numa text size
Reduce the NUMA text size of mm/slab.o a little on x86 by using a local
variable to store the result of numa_node_id().

    text    data     bss     dec     hex filename
   16858    2584      16   19458    4c02 mm/slab.o (before)
   16804    2584      16   19404    4bcc mm/slab.o (after)

[akpm@osdl.org: use better names]
[pbadari@us.ibm.com: fix that]
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-06 08:53:40 -07:00
Linus Torvalds fefd26b3b8 Merge master.kernel.org:/pub/scm/linux/kernel/git/davej/configh
* master.kernel.org:/pub/scm/linux/kernel/git/davej/configh:
  Remove all inclusions of <linux/config.h>

Manually resolved trivial path conflicts due to removed files in
the sound/oss/ subdirectory.
2006-10-04 09:59:57 -07:00
Christoph Hellwig 1d2c8eea69 [PATCH] slab: clean up leak tracking ifdefs a little bit
- rename ____kmalloc to kmalloc_track_caller so that people have a chance
  to guess what it does just from it's name.  Add a comment describing it
  for those who don't.  Also move it after kmalloc in slab.h so people get
  less confused when they are just looking for kmalloc - move things around
  in slab.c a little to reduce the ifdef mess.

[penberg@cs.helsinki.fi: Fix up reversed #ifdef]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:13 -07:00
Dave Jones 038b0a6d8d Remove all inclusions of <linux/config.h>
kbuild explicitly includes this at build time.

Signed-off-by: Dave Jones <davej@redhat.com>
2006-10-04 03:38:54 -04:00
Dave Jones aa83aa40ed [PATCH] single bit flip detector
In cases where we detect a single bit has been flipped, we spew the usual
slab corruption message, which users instantly think is a kernel bug.  In a
lot of cases, single bit errors are down to bad memory, or other hardware
failure.

This patch adds an extra line to the slab debug messages in those cases, in
the hope that users will try memtest before they report a bug.

000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Single bit error detected. Possibly bad RAM. Run memtest86.

[akpm@osdl.org: cleanups]
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:10 -07:00
Christoph Lameter 765c4507af [PATCH] GFP_THISNODE for the slab allocator
This patch insures that the slab node lists in the NUMA case only contain
slabs that belong to that specific node.  All slab allocations use
GFP_THISNODE when calling into the page allocator.  If an allocation fails
then we fall back in the slab allocator according to the zonelists appropriate
for a certain context.

This allows a replication of the behavior of alloc_pages and alloc_pages node
in the slab layer.

Currently allocations requested from the page allocator may be redirected via
cpusets to other nodes.  This results in remote pages on nodelists and that in
turn results in interrupt latency issues during cache draining.  Plus the slab
is handing out memory as local when it is really remote.

Fallback for slab memory allocations will occur within the slab allocator and
not in the page allocator.  This is necessary in order to be able to use the
existing pools of objects on the nodes that we fall back to before adding more
pages to a slab.

The fallback function insures that the nodes we fall back to obey cpuset
restrictions of the current context.  We do not allocate objects from outside
of the current cpuset context like before.

Note that the implementation of locality constraints within the slab allocator
requires importing logic from the page allocator.  This is a mischmash that is
not that great.  Other allocators (uncached allocator, vmalloc, huge pages)
face similar problems and have similar minimal reimplementations of the basic
fallback logic of the page allocator.  There is another way of implementing a
slab by avoiding per node lists (see modular slab) but this wont work within
the existing slab.

V1->V2:
- Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA
- Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another
  #ifdef

[akpm@osdl.org: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:12 -07:00
Christoph Lameter de3083ec3e [PATCH] slab: fix kmalloc_node applying memory policies if nodeid == numa_node_id()
kmalloc_node() falls back to ___cache_alloc() under certain conditions and
at that point memory policies may be applied redirecting the allocation
away from the current node.  Therefore kmalloc_node(...,numa_node_id()) or
kmalloc_node(...,-1) may not return memory from the local node.

Fix this by doing the policy check in __cache_alloc() instead of
____cache_alloc().

This version here is a cleanup of Kiran's patch.

- Tested on ia64.
- Extra material removed.
- Consolidate the exit path if alternate_node_alloc() returned an object.

[akpm@osdl.org: warning fix]
Signed-off-by: Alok N Kataria <alok.kataria@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:12 -07:00
Alexey Dobriyan 133d205a18 [PATCH] Make kmem_cache_destroy() return void
un-, de-, -free, -destroy, -exit, etc functions should in general return
void.  Also,

There is very little, say, filesystem driver code can do upon failed
kmem_cache_destroy().  If it will be decided to BUG in this case, BUG
should be put in generic code, instead.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:11 -07:00
Christoph Lameter 972d1a7b14 [PATCH] ZVC: Support NR_SLAB_RECLAIMABLE / NR_SLAB_UNRECLAIMABLE
Remove the atomic counter for slab_reclaim_pages and replace the counter
and NR_SLAB with two ZVC counter that account for unreclaimable and
reclaimable slab pages: NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE.

Change the check in vmscan.c to refer to to NR_SLAB_RECLAIMABLE.  The
intend seems to be to check for slab pages that could be freed.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:51 -07:00
Christoph Lameter d00bcc98d7 [PATCH] Extract the allocpercpu functions from the slab allocator
The allocpercpu functions __alloc_percpu and __free_percpu() are heavily
using the slab allocator.  However, they are conceptually slab.  This also
simplifies SLOB (at this point slob may be broken in mm.  This should fix
it).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:51 -07:00
Siddha, Suresh B d2e7b7d0aa [PATCH] fix potential stack overflow in mm/slab.c
On High end systems (1024 or so cpus) this can potentially cause stack
overflow. Fix the stack usage.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:50 -07:00
Ravikiran G Thirumalai 056c62418c [PATCH] slab: fix lockdep warnings
Place the alien array cache locks of on slab malloc slab caches on a
seperate lockdep class.  This avoids false positives from lockdep

[akpm@osdl.org: build fix]
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:50 -07:00
Christoph Lameter 2ed3a4ef95 [PATCH] slab: do not panic when alloc_kmemlist fails and slab is up
It is fairly easy to get a system to oops by simply sizing a cache via
/proc in such a way that one of the chaches (shared is easiest) becomes
bigger than the maximum allowed slab allocation size.  This occurs because
enable_cpucache() fails if it cannot reallocate some caches.

However, enable_cpucache() is used for multiple purposes: resizing caches,
cache creation and bootstrap.

If the slab is already up then we already have working caches.  The resize
can fail without a problem.  We just need to return the proper error code.
F.e.  after this patch:

# echo "size-64 10000 50 1000" >/proc/slabinfo
-bash: echo: write error: Cannot allocate memory

notice no OOPS.

If we are doing a kmem_cache_create() then we also should not panic but
return -ENOMEM.

If on the other hand we do not have a fully bootstrapped slab allocator yet
then we should indeed panic since we are unable to bring up the slab to its
full functionality.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:50 -07:00
Christoph Lameter 117f6eb1d8 [PATCH] slab: extract __kmem_cache_destroy from kmem_cache_destroy
The ability to free memory allocated to a slab cache is also useful if an
error occurs during setup of a slab.  So extract the function.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:50 -07:00
Christoph Hellwig dbe5e69d2d [PATCH] slab: optimize kmalloc_node the same way as kmalloc
[akpm@osdl.org: export fix]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:49 -07:00
Ravikiran G Thirumalai e5ac9c5aec [PATCH] Add some comments to slab.c
Also, checks if we get a valid slabp_cache for off slab slab-descriptors.
We should always get this.  If we don't, then in that case we, will have to
disable off-slab descriptors for this cache and do the calculations again.
This is a rare case, so add a BUG_ON, for now, just in case.

Signed-off-by: Alok N Kataria <alok.kataria@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:49 -07:00
Pekka Enberg ca5f9703df [PATCH] slab: respect architecture and caller mandated alignment
As explained by Heiko, on s390 (32-bit) ARCH_KMALLOC_MINALIGN is set to
eight because their common I/O layer allocates data structures that need to
have an eight byte alignment.  This does not work when CONFIG_SLAB_DEBUG is
enabled because kmem_cache_create will override alignment to BYTES_PER_WORD
which is four.

So change kmem_cache_create to ensure cache alignment is always at minimum
what the architecture or caller mandates even if slab debugging is enabled.

Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:48 -07:00
Martin Peschke 7ff6f08295 [PATCH] CPU hotplug compatible alloc_percpu()
This patch splits alloc_percpu() up into two phases.  Likewise for
free_percpu().  This allows clients to limit initial allocations to online
cpu's, and to populate or depopulate per-cpu data at run time as needed:

  struct my_struct *obj;

  /* initial allocation for online cpu's */
  obj = percpu_alloc(sizeof(struct my_struct), GFP_KERNEL);

  ...

  /* populate per-cpu data for cpu coming online */
  ptr = percpu_populate(obj, sizeof(struct my_struct), GFP_KERNEL, cpu);

  ...

  /* access per-cpu object */
  ptr = percpu_ptr(obj, smp_processor_id());

  ...

  /* depopulate per-cpu data for cpu going offline */
  percpu_depopulate(obj, cpu);

  ...

  /* final removal */
  percpu_free(obj);

Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:47 -07:00
Adrian Bunk b221385bc4 [PATCH] mm/: make functions static
This patch makes the following needlessly global functions static:
 - slab.c: kmem_find_general_cachep()
 - swap.c: __page_cache_release()
 - vmalloc.c: __vmalloc_node()

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:45 -07:00
Rolf Eike Beer b8008b2bc2 [PATCH] Fix kmem_cache_alloc() been documented twice
kmem_cache_alloc() was documented twice, but kmem_cache_zalloc() never.
Fix this obvious typo to get things right.

Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:43 -07:00
Chandra Seetharaman 8c78f3075d [PATCH] cpu hotplug: replace __devinit* with __cpuinit* for cpu notifications
Few of the callback functions and notifier blocks that are associated with cpu
notifications incorrectly have __devinit and __devinitdata.  They should be
__cpuinit and __cpuinitdata instead.

It makes no functional difference but wastes text area when CONFIG_HOTPLUG is
enabled and CONFIG_HOTPLUG_CPU is not.

This patch fixes all those instances.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:39 -07:00
Ingo Molnar fc818301a8 [PATCH] revert slab.c locking change
Chandra Seetharaman reported SLAB crashes caused by the slab.c lock
annotation patch.  There is only one chunk of that patch that has a
material effect on the slab logic - this patch undoes that chunk.

This was confirmed to fix the slab problem by Chandra.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-13 15:38:43 -07:00
Arjan van de Ven f1aaee53f2 [PATCH] lockdep: annotate mm/slab.c
mm/slab.c uses nested locking when dealing with 'off-slab'
caches, in that case it allocates the slab header from the
(on-slab) kmalloc caches. Teach the lock validator about
this by putting all on-slab caches into a separate class.

this patch has no effect on non-lockdep kernels.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-13 12:02:44 -07:00
Ingo Molnar 873623dfab [PATCH] lockdep: undo mm/slab.c annotation
undo existing mm/slab.c lock-validator annotations, in preparation
of a new, less intrusive annotation patch.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-13 12:02:44 -07:00
Ingo Molnar 2b2d5493e1 [PATCH] lockdep: annotate SLAB code
Teach special (recursive) locking code to the lock validator.  Has no effect
on non-lockdep kernels.

Fix initialize-locks-via-memcpy assumptions.

Effects on non-lockdep kernels: the subclass nesting parameter is passed into
cache_free_alien() and __cache_free(), and turns one internal
kmem_cache_free() call into an open-coded __cache_free() call.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:09 -07:00
Christoph Lameter ed11d9eb22 [PATCH] slab: consolidate code to free slabs from freelist
Post and discussion:
http://marc.theaimsgroup.com/?t=115074342800003&r=1&w=2

Code in __shrink_node() duplicates code in cache_reap()

Add a new function drain_freelist that removes slabs with objects that are
already free and use that in various places.

This eliminates the __node_shrink() function and provides the interrupt
holdoff reduction from slab_free to code that used to call __node_shrink.

[akpm@osdl.org: build fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:36 -07:00
Christoph Lameter 9a865ffa34 [PATCH] zoned vm counters: conversion of nr_slab to per zone counter
- Allows reclaim to access counter without looping over processor counts.

- Allows accurate statistics on how many pages are used in a zone by
  the slab. This may become useful to balance slab allocations over
  various zones.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:35 -07:00
Christoph Lameter 2244b95a7b [PATCH] zoned vm counters: basic ZVC (zoned vm counter) implementation
Per zone counter infrastructure

The counters that we currently have for the VM are split per processor.  The
processor however has not much to do with the zone these pages belong to.  We
cannot tell f.e.  how many ZONE_DMA pages are dirty.

So we are blind to potentially inbalances in the usage of memory in various
zones.  F.e.  in a NUMA system we cannot tell how many pages are dirty on a
particular node.  If we knew then we could put measures into the VM to balance
the use of memory between different zones and different nodes in a NUMA
system.  For example it would be possible to limit the dirty pages per node so
that fast local memory is kept available even if a process is dirtying huge
amounts of pages.

Another example is zone reclaim.  We do not know how many unmapped pages exist
per zone.  So we just have to try to reclaim.  If it is not working then we
pause and try again later.  It would be better if we knew when it makes sense
to reclaim unmapped pages from a zone.  This patchset allows the determination
of the number of unmapped pages per zone.  We can remove the zone reclaim
interval with the counters introduced here.

Futhermore the ability to have various usage statistics available will allow
the development of new NUMA balancing algorithms that may be able to improve
the decision making in the scheduler of when to move a process to another node
and hopefully will also enable automatic page migration through a user space
program that can analyse the memory load distribution and then rebalance
memory use in order to increase performance.

The counter framework here implements differential counters for each processor
in struct zone.  The differential counters are consolidated when a threshold
is exceeded (like done in the current implementation for nr_pageache), when
slab reaping occurs or when a consolidation function is called.

Consolidation uses atomic operations and accumulates counters per zone in the
zone structure and also globally in the vm_stat array.  VM functions can
access the counts by simply indexing a global or zone specific array.

The arrangement of counters in an array also simplifies processing when output
has to be generated for /proc/*.

Counters can be updated by calling inc/dec_zone_page_state or
_inc/dec_zone_page_state analogous to *_page_state.  The second group of
functions can be called if it is known that interrupts are disabled.

Special optimized increment and decrement functions are provided.  These can
avoid certain checks and use increment or decrement instructions that an
architecture may provide.

We also add a new CONFIG_DMA_IS_NORMAL that signifies that an architecture can
do DMA to all memory and therefore ZONE_NORMAL will not be populated.  This is
only currently set for IA64 SGI SN2 and currently only affects
node_page_state().  In the best case node_page_state can be reduced to
retrieving a single counter for the one zone on the node.

[akpm@osdl.org: cleanups]
[akpm@osdl.org: export vm_stat[] for filesystems]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:34 -07:00
Ingo Molnar e7eebaf6a8 [PATCH] pi-futex: rt mutex debug
Runtime debugging functionality for rt-mutexes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar f9b8404cf8 [PATCH] pi-futex: introduce debug_check_no_locks_freed()
Add debug_check_no_locks_freed(), as a central inline to add
bad-lock-free-debugging functionality to.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:46 -07:00
Chandra Seetharaman 74b85f3790 [PATCH] cpu hotplug: make cpu_notifier related notifier blocks __cpuinit only
Make notifier_blocks associated with cpu_notifier as __cpuinitdata.

__cpuinitdata makes sure that the data is init time only unless
CONFIG_HOTPLUG_CPU is defined.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:41 -07:00
Chandra Seetharaman 9c7b216d23 [PATCH] cpu hotplug: revert init patch submitted for 2.6.17
In 2.6.17, there was a problem with cpu_notifiers and XFS.  I provided a
band-aid solution to solve that problem.  In the process, i undid all the
changes you both were making to ensure that these notifiers were available
only at init time (unless CONFIG_HOTPLUG_CPU is defined).

We deferred the real fix to 2.6.18.  Here is a set of patches that fixes the
XFS problem cleanly and makes the cpu notifiers available only at init time
(unless CONFIG_HOTPLUG_CPU is defined).

If CONFIG_HOTPLUG_CPU is defined then cpu notifiers are available at run
time.

This patch reverts the notifier_call changes made in 2.6.17

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:40 -07:00
Randy Dunlap c9cf55285e [PATCH] add poison.h and patch primary users
Localize poison values into one header file for better documentation and
easier/quicker debugging and so that the same values won't be used for
multiple purposes.

Use these constants in core arch., mm, driver, and fs code.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Matt Mackall <mpm@selenic.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:38 -07:00