Commit Graph

1788 Commits

Author SHA1 Message Date
Peter Zijlstra 23c1fb5296 mm: fixup /proc/vmstat output
Line up the vmstat_text with zone_stat_item

enum zone_stat_item {
	/* First 128 byte cacheline (assuming 64 bit words) */
	NR_FREE_PAGES,
	NR_INACTIVE,
	NR_ACTIVE,

We current have nr_active and nr_inactive reversed.

[ "OK with patch, though using initializers canbe handy to prevent such
   things in future:

	static const char * const vmstat_text[] = {
		[NR_FREE_PAGES] = "nr_free_pages",
		..."
							 - Alexey ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-06 10:26:50 -07:00
David Woodhouse 87a927c715 Fix slab redzone alignment
Commit b46b8f19c9 fixed a couple of bugs
by switching the redzone to 64 bits. Unfortunately, it neglected to
ensure that the _second_ redzone, after the slab object, is aligned
correctly. This caused illegal instruction faults on sparc32, which for
some reason not entirely clear to me are not trapped and fixed up.

Two things need to be done to fix this:
  - increase the object size, rounding up to alignof(long long) so
    that the second redzone can be aligned correctly.
  - If SLAB_STORE_USER is set but alignof(long long)==8, allow a
    full 64 bits of space for the user word at the end of the buffer,
    even though we may not _use_ the whole 64 bits.

This patch should be a no-op on any 64-bit architecture or any 32-bit
architecture where alignof(long long) == 4. Of the others, it's tested
on ppc32 by myself and a very similar patch was tested on sparc32 by
Mark Fortescue, who reported the new problem.

Also, fix the conditions for FORCED_DEBUG, which hadn't been adjusted to
the new sizes. Again noticed by Mark.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-05 15:54:13 -07:00
Christoph Lameter dbc55faa64 SLUB: Make lockdep happy by not calling add_partial with interrupts enabled during bootstrap
If we move the local_irq_enable() to the end of the function then
add_partial() in early_kmem_cache_node_alloc() will be called
with interrupts disabled like during regular operations.

This makes lockdep happy.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by: Andre Noll <maan@systemlinux.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-03 13:56:13 -07:00
Christoph Lameter 17022220dd SLAB: remove WARN_ON_ONCE for zero sized objects for 2.6.22 release
We agreed to remove the WARN_ON_ONCE before 2.6.22 is released.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-01 12:29:43 -07:00
Hugh Dickins 30acbabae3 mm: kill validate_anon_vma to avoid mapcount BUG
validate_anon_vma gave a useful check on the integrity of the anon_vma list
when Andrea was developing obj rmap; but it was not enabled in SLES9
itself, nor in mainline, until Nick changed commented-out RMAP_DEBUG to
configurable CONFIG_DEBUG_VM in 2.6.17.  Now Petr Vandrovec reports that
its BUG_ON(mapcount > 100000) can easily crash a CONFIG_DEBUG_VM=y system.

That limit was just an arbitrary number to protect against an infinite
loop.  We could raise it to something enormous (depending on sizeof struct
vma and size of memory?); but I rather think validate_anon_vma has outlived
its usefulness, and is better just removed - which gives a magnificent
performance boost to anything like Petr's test program ;)

Of course, a very long anon_vma list is bad news for preemption latency,
and I believe there has been one recent report of such: let's not forget
that, but validate_anon_vma only makes it worse not better.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Petr Vandrovec <petr@vmware.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Cc: Andrea Arcangeli <andrea@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-28 11:34:53 -07:00
Christoph Lameter 8496634302 SLUB: fix behavior if the text output of list_locations overflows PAGE_SIZE
If slabs are allocated or freed from a large set of call sites (typical for
the kmalloc area) then we may create more output than fits into a single
PAGE and sysfs only gives us one page.  The output should be truncated.
This patch fixes the checks to do the truncation properly.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-24 08:59:11 -07:00
Helge Deller 06b32f3ab6 [PARISC] Handle wrapping in expand_upwards()
Function expand_upwards() did not guarded against wrapping
around to address 0. This fixes the adjtimex02 testcase from
the Linux Test Project on a 32bit PARISC kernel.

[expand_upwards is only used on parisc and ia64; it looks like it does
 the right thing on both. --kyle]

Signed-off-by: Helge Deller <deller@gmx.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kyle McMartin <kyle@parisc-linux.org>
2007-06-21 17:46:20 -04:00
Christoph Lameter 4b356be019 SLUB: minimum alignment fixes
If ARCH_KMALLOC_MINALIGN is set to a value greater than 8 (SLUBs smallest
kmalloc cache) then SLUB may generate duplicate slabs in sysfs (yes again)
because the object size is padded to reach ARCH_KMALLOC_MINALIGN.  Thus the
size of the small slabs is all the same.

No arch sets ARCH_KMALLOC_MINALIGN larger than 8 though except mips which
for some reason wants a 128 byte alignment.

This patch increases the size of the smallest cache if
ARCH_KMALLOC_MINALIGN is greater than 8.  In that case more and more of the
smallest caches are disabled.

If we do that then the count of the active general caches that is displayed
on boot is not correct anymore since we may skip elements of the kmalloc
array.  So count them separately.

This approach was tested by Havard yesterday.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-16 13:16:16 -07:00
Benjamin Herrenschmidt 8dab5241d0 Rework ptep_set_access_flags and fix sun4c
Some changes done a while ago to avoid pounding on ptep_set_access_flags and
update_mmu_cache in some race situations break sun4c which requires
update_mmu_cache() to always be called on minor faults.

This patch reworks ptep_set_access_flags() semantics, implementations and
callers so that it's now responsible for returning whether an update is
necessary or not (basically whether the PTE actually changed).  This allow
fixing the sparc implementation to always return 1 on sun4c.

[akpm@linux-foundation.org: fixes, cleanups]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: David Miller <davem@davemloft.net>
Cc: Mark Fortescue <mark@mtfhpc.demon.co.uk>
Acked-by: William Lee Irwin III <wli@holomorphy.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-16 13:16:16 -07:00
Christoph Lameter dd08c40e3e SLUB slab validation: Alloc while interrupts are disabled must use GFP_ATOMIC
The data structure to manage the information gathered about functions
allocating and freeing objects is allocated when the list_lock has already
been taken.  We need to allocate with GFP_ATOMIC instead of GFP_KERNEL.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-16 13:16:15 -07:00
Paul Mundt d09c6b8094 mm: Fix memory/cpu hotplug section mismatch and oops.
When building with memory hotplug enabled and cpu hotplug disabled, we
end up with the following section mismatch:

WARNING: mm/built-in.o(.text+0x4e58): Section mismatch: reference to
.init.text: (between 'free_area_init_node' and '__build_all_zonelists')

This happens as a result of:

        -> free_area_init_node()
          -> free_area_init_core()
            -> zone_pcp_init() <-- all __meminit up to this point
              -> zone_batchsize() <-- marked as __cpuinit                     fo

This happens because CONFIG_HOTPLUG_CPU=n sets __cpuinit to __init, but
CONFIG_MEMORY_HOTPLUG=y unsets __meminit.

Changing zone_batchsize() to __devinit fixes this.

__devinit is the only thing that is common between CONFIG_HOTPLUG_CPU=y and
CONFIG_MEMORY_HOTPLUG=y. In the long run, perhaps this should be moved to
another section identifier completely. Without this, memory hot-add
of offline nodes (via hotadd_new_pgdat()) will oops if CPU hotplug is
not also enabled.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

--

 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
2007-06-15 16:18:08 -07:00
Benjamin Herrenschmidt c19c03fc74 [POWERPC] unmap_vm_area becomes unmap_kernel_range for the public
This makes unmap_vm_area static and a wrapper around a new
exported unmap_kernel_range that takes an explicit range instead
of a vm_area struct.

This makes it more versatile for code that wants to play with kernel
page tables outside of the standard vmalloc area.

(One example is some rework of the PowerPC PCI IO space mapping
code that depends on that patch and removes some code duplication
and horrible abuse of forged struct vm_struct).

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-06-14 22:29:56 +10:00
Stephen Rothwell 193faea928 Move three functions that are only needed for CONFIG_MEMORY_HOTPLUG
into the appropriate #ifdef.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-08 17:23:33 -07:00
Christoph Lameter 272c1d21d6 SLUB: return ZERO_SIZE_PTR for kmalloc(0)
Instead of returning the smallest available object return ZERO_SIZE_PTR.

A ZERO_SIZE_PTR can be legitimately used as an object pointer as long as it
is not deferenced.  The dereference of ZERO_SIZE_PTR causes a distinctive
fault.  kfree can handle a ZERO_SIZE_PTR in the same way as NULL.

This enables functions to use zero sized object. e.g. n = number of objects.

	objects = kmalloc(n * sizeof(object));

	for (i = 0; i < n; i++)
		objects[i].x = y;

	kfree(objects);

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-08 17:23:33 -07:00
Christoph Lameter 3cdc0ed0ce slab: fix alien cache handling
cache_free_alien must be called regardless if we use alien caches or not.
cache_free_alien() will do the right thing if there are no alien caches
available.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Acked-by: Pekka J Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-08 17:23:32 -07:00
Hugh Dickins a210906c1b mount -t tmpfs -o mpol=: check nodes online
Randy Dunlap reports that a tmpfs, mounted with NUMA mpol= specifying an
offline node, crashes as soon as data is allocated upon it.  Now restrict it
to online nodes, where before it restricted to MAX_NUMNODES.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Tested-and-acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-08 17:23:32 -07:00
Paul Mundt 33d63bd83b sh: memory hot-add for sparsemem users support.
This enables simple hotplug support for sparsemem users. Presently
this only permits memory being added in to node 0 on ZONE_NORMAL.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-06-08 02:43:51 +00:00
Christoph Lameter 27390bc335 SLUB: fix locking for hotplug callbacks
Hotplug callbacks are performed with interrupts enabled.  Slub requires
interrupts to be disabled for flushing caches.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-01 08:18:30 -07:00
Yasunori Goto 13466c8419 memory hotplug: fix unnecessary calling of init_currenty_empty_zone()
zone->present_pages is updated in online_pages().  But, __add_zone() can be
called twice or more before calling online_pages().  So,
init_currenty_empty_zone() can be called unnecessary times.  It is cause of
memory leak of zone's wait_table.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-01 08:18:29 -07:00
Zou Nan hai 2e1c49db4c x86_64: allocate sparsemem memmap above 4G
On systems with huge amount of physical memory, VFS cache and memory memmap
may eat all available system memory under 4G, then the system may fail to
allocate swiotlb bounce buffer.

There was a fix for this issue in arch/x86_64/mm/numa.c, but that fix dose
not cover sparsemem model.

This patch add fix to sparsemem model by first try to allocate memmap above
4G.

Signed-off-by: Zou Nan hai <nanhai.zou@intel.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Andi Kleen <ak@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-01 08:18:27 -07:00
Roman Zippel 12d810c1b8 m68k: discontinuous memory support
Fix support for discontinuous memory

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-31 07:58:14 -07:00
Christoph Lameter 8ffa68755a SLUB: Fix NUMA / SYSFS bootstrap issue
We need this patch in ASAP.  Patch fixes the mysterious hang that remained
on some particular configurations with lockdep on after the first fix that
moved the #idef CONFIG_SLUB_DEBUG to the right location.  See
http://marc.info/?t=117963072300001&r=1&w=2

The kmem_cache_node cache is very special because it is needed for NUMA
bootstrap.  Under certain conditions (like for example if lockdep is
enabled and significantly increases the size of spinlock_t) the structure
may become exactly the size as one of the larger caches in the kmalloc
array.

That early during bootstrap we cannot perform merging properly.  The unique
id for the kmem_cache_node cache will match one of the kmalloc array.
Sysfs will complain about a duplicate directory entry.  All of this occurs
while the console is not yet fully operational.  Thus boot may appear to be
silently failing.

The kmem_cache_node cache is very special.  During early boostrap the main
allocation function is not operational yet and so we have to run our own
small special alloc function during early boot.  It is also special in that
it is never freed.

We really do not want any merging on that cache.  Set the refcount -1 and
forbid merging of slabs that have a negative refcount.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-31 07:58:14 -07:00
Christoph Lameter 33e9e24101 SLUB Debug: fix check for super sized slabs (>512k 64bit, >256k 32bit)
The check for super sized slabs where we can no longer move the free
pointer behind the object for debugging purposes etc is accessing a
field that is not setup yet.  We must use objsize here since the size of
the slab has not been determined yet.

The effect of this is that a global slab shrink via "slabinfo -s" will
show errors about offsets being wrong if booted with slub_debug.
Potentially there are other troubles with huge slabs under slub_debug
because the calculated free pointer offset is truncated.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:13 -07:00
Miklos Szeredi 418508c132 fix unused setup_nr_node_ids
mm/page_alloc.c:931: warning: 'setup_nr_node_ids' defined but not used

This is now the only (!) compiler warning I get in my UML build :)

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:13 -07:00
Christoph Lameter c12b3c6251 SLUB Debug: Fix object size calculation
The object size calculation is wrong if !CONFIG_SLUB_DEBUG because the
#ifdef CONFIG_SLUB_DEBUG is now switching off the size adjustments for
DESTROY_BY_RCU and ctor.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Linus Torvalds 080e89270a Merge git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild-fix
* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild-fix:
  mm/slab: fix section mismatch warning
  mm: fix section mismatch warnings
  init/main: use __init_refok to fix section mismatch
  kbuild: introduce __init_refok/__initdata_refok to supress section mismatch warnings
  all-archs: consolidate .data section definition in asm-generic
  all-archs: consolidate .text section definition in asm-generic
  kbuild: add "Section mismatch" warning whitelist for powerpc
  kbuild: make better section mismatch reports on i386, arm and mips
  kbuild: make modpost section warnings clearer
  kconfig: search harder for curses library in check-lxdialog.sh
  kbuild: include limits.h in sumversion.c for PATH_MAX
  powerpc: Fix the MODALIAS generation in modpost for of devices
2007-05-21 12:03:04 -07:00
Alexey Dobriyan e8edc6e03a Detach sched.h from mm.h
First thing mm.h does is including sched.h solely for can_do_mlock() inline
function which has "current" dereference inside. By dealing with can_do_mlock()
mm.h can be detached from sched.h which is good. See below, why.

This patch
a) removes unconditional inclusion of sched.h from mm.h
b) makes can_do_mlock() normal function in mm/mlock.c
c) exports can_do_mlock() to not break compilation
d) adds sched.h inclusions back to files that were getting it indirectly.
e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
   getting them indirectly

Net result is:
a) mm.h users would get less code to open, read, preprocess, parse, ... if
   they don't need sched.h
b) sched.h stops being dependency for significant number of files:
   on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
   after patch it's only 3744 (-8.3%).

Cross-compile tested on

	all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
	alpha alpha-up
	arm
	i386 i386-up i386-defconfig i386-allnoconfig
	ia64 ia64-up
	m68k
	mips
	parisc parisc-up
	powerpc powerpc-up
	s390 s390-up
	sparc sparc-up
	sparc64 sparc64-up
	um-x86_64
	x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig

as well as my two usual configs.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-21 09:18:19 -07:00
Sam Ravnborg 38bdc32af4 mm/slab: fix section mismatch warning
Use the new __init_refok marker to avoid the
section mismatch warning from slab.c

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2007-05-19 09:11:58 +02:00
Sam Ravnborg 577a32f620 mm: fix section mismatch warnings
modpost had two cases hardcoded for mm/
Shift over to __init_refok and kill the
hardcoded function names in modpost.

This has the drawback that the functions
will always be kept no matter configuration.
With previous code the function were placed in
init section if configuration allowed it.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2007-05-19 09:11:58 +02:00
Nick Piggin c97a9e10ea mm: more rmap checking
Re-introduce rmap verification patches that Hugh removed when he removed
PG_map_lock. PG_map_lock actually isn't needed to synchronise access to
anonymous pages, because PG_locked and PTL together already do.

These checks were important in discovering and fixing a rare rmap corruption
in SLES9.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:06 -07:00
Benjamin Herrenschmidt d55e2ca873 Make __vunmap static
__vunmap doesn't seem to be used outside of mm/vmalloc.c, and has
no prototype in any header so let's make it static

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:04 -07:00
Christoph Lameter 0aa817f078 Slab allocators: define common size limitations
Currently we have a maze of configuration variables that determine the
maximum slab size.  Worst of all it seems to vary between SLAB and SLUB.

So define a common maximum size for kmalloc.  For conveniences sake we use
the maximum size ever supported which is 32 MB.  We limit the maximum size
to a lower limit if MAX_ORDER does not allow such large allocations.

For many architectures this patch will have the effect of adding large
kmalloc sizes.  x86_64 adds 5 new kmalloc sizes.  So a small amount of
memory will be needed for these caches (contemporary SLAB has dynamically
sizeable node and cpu structure so the waste is less than in the past)

Most architectures will then be able to allocate object with sizes up to
MAX_ORDER.  We have had repeated breakage (in fact whenever we doubled the
number of supported processors) on IA64 because one or the other struct
grew beyond what the slab allocators supported.  This will avoid future
issues and f.e.  avoid fixes for 2k and 4k cpu support.

CONFIG_LARGE_ALLOCS is no longer necessary so drop it.

It fixes sparc64 with SLAB.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:04 -07:00
Christoph Lameter 3ec0974210 SLUB: Simplify debug code
Consolidate functionality into the #ifdef section.

Extract tracing into one subroutine.

Move object debug processing into the #ifdef section so that the
code in __slab_alloc and __slab_free becomes minimal.

Reduce number of functions we need to provide stubs for in the !SLUB_DEBUG case.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:04 -07:00
Christoph Lameter a35afb830f Remove SLAB_CTOR_CONSTRUCTOR
SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@ucw.cz>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:04 -07:00
Christoph Lameter 5577bd8a85 SLUB: Do our own flags based on PG_active and PG_error
The atomicity when handling flags in SLUB is not necessary since both flags
used by SLUB are not updated in a racy way.  Flag updates are either done
during slab creation or destruction or under slab_lock.  Some of these flags
do not have the non atomic variants that we need.  So define our own.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:03 -07:00
Christoph Lameter 0b44f7a5b5 slab: warn on zero-length allocations
slub warns on this, and we're working on making kmalloc(0) return NULL.
Let's make slab warn as well so our testers detect such callers more
rapidly.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:03 -07:00
Christoph Lameter 4b6f075045 SLUB: Define functions for cpu slab handling instead of using PageActive
Use inline functions to access the per cpu bit.  Intoduce the notion of
"freezing" a slab to make things more understandable.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:03 -07:00
Christoph Lameter c59def9f22 Slab allocators: Drop support for destructors
There is no user of destructors left.  There is no reason why we should keep
checking for destructors calls in the slab allocators.

The RFC for this patch was discussed at
http://marc.info/?l=linux-kernel&m=117882364330705&w=2

Destructors were mainly used for list management which required them to take a
spinlock.  Taking a spinlock in a destructor is a bit risky since the slab
allocators may run the destructors anytime they decide a slab is no longer
needed.

Patch drops destructor support.  Any attempt to use a destructor will BUG().

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:03 -07:00
Nick Piggin afc0cedbe9 slob: implement RCU freeing
The SLOB allocator should implement SLAB_DESTROY_BY_RCU correctly, because
even on UP, RCU freeing semantics are not equivalent to simply freeing
immediately.  This also allows SLOB to be used on SMP.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:02 -07:00
Christoph Lameter 43c0f3d25c Fix: find_or_create_page skips cpuset memory spreading.
We call alloc_page where we should be calling __page_cache_alloc.

__page_cache_alloc performs cpuset memory spreading.  alloc_page does not.
There is no reason that pages allocated via find_or_create should be
exempt.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-16 21:19:15 -07:00
Hugh Dickins 1800782016 slub: don't confuse ctor and dtor
kmem_cache_create() was swapping ctor and dtor in calling find_mergeable():
though it caused no bug, and probably never would, even if destructors are
retained; but fix it so as not to generate anxiety ;)

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-16 21:19:15 -07:00
Paul Mundt 6c645ac725 sh64: generic quicklist support.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-05-14 09:55:35 +09:00
Miklos Szeredi 0ea9718016 consolidate generic_writepages and mpage_writepages
Clean up massive code duplication between mpage_writepages() and
generic_writepages().

The new generic function, write_cache_pages() takes a function pointer
argument, which will be called for each page to be written.

Maybe cifs_writepages() too can use this infrastructure, but I'm not
touching that with a ten-foot pole.

The upcoming page writeback support in fuse will also want this.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:35 -07:00
Christoph Lameter 39bf6270f5 VM statistics: Make timer deferrable
VM statistics updates do not matter if the kernel is in idle powersaving
mode.  So allow the timer to be deferred.

It would be better though if we could switch the timer between deferrable
and nondeferrable based on differentials present.  The timer would start
out nondeferrable and if we find that there were no updates in the last
statistics interval then we would switch the timer to deferrable.  If the
timer later finds again that there are differentials then go to
nondeferrable again.

And yet another way would be to run the timer shortly before going to idle?

The solution here means that the VM counters may be slightly off during
idle since differentials may be still pending while the timer is deferred.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:32 -07:00
Mika Kukkonen 7faaa5f0bf Bug in mm/thrash.c function grab_swap_token()
Following bug was uncovered by compiling with '-W' flag:

  CC      mm/thrash.o
mm/thrash.c: In function ‘grab_swap_token’:
mm/thrash.c:52: warning: comparison of unsigned expression < 0 is always false

Variable token_priority is unsigned, so decrementing first and then
checking the result does not work; fixed by reversing the test, patch
attached (compile tested only).

I am not sure if likely() makes much sense in this new situation, but
I'll let somebody else to make a decision on that.

Signed-off-by: Mika Kukkonen <mikukkon@iki.fi>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:32 -07:00
Christoph Lameter bcf889f965 SLUB: remove nr_cpu_ids hack
This was in SLUB in order to head off trouble while the nr_cpu_ids
functionality was not merged.  Its merged now so no need to still have this.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-10 09:26:53 -07:00
Stephen Rothwell 6f076f5dd9 early_pfn_to_nid needs to be __meminit
Since it is referenced by memmap_init_zone (which is __meminit) via the
early_pfn_in_nid macro when CONFIG_NODES_SPAN_OTHER_NODES is set (which
basically means PowerPC 64).

This removes a section mismatch warning in those circumstances.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-10 09:26:52 -07:00
Christoph Lameter 894b8788d7 slub: support concurrent local and remote frees and allocs on a slab
Avoid atomic overhead in slab_alloc and slab_free

SLUB needs to use the slab_lock for the per cpu slabs to synchronize with
potential kfree operations.  This patch avoids that need by moving all free
objects onto a lockless_freelist.  The regular freelist continues to exist
and will be used to free objects.  So while we consume the
lockless_freelist the regular freelist may build up objects.

If we are out of objects on the lockless_freelist then we may check the
regular freelist.  If it has objects then we move those over to the
lockless_freelist and do this again.  There is a significant savings in
terms of atomic operations that have to be performed.

We can even free directly to the lockless_freelist if we know that we are
running on the same processor.  So this speeds up short lived objects.
They may be allocated and freed without taking the slab_lock.  This is
particular good for netperf.

In order to maximize the effect of the new faster hotpath we extract the
hottest performance pieces into inlined functions.  These are then inlined
into kmem_cache_alloc and kmem_cache_free.  So hotpath allocation and
freeing no longer requires a subroutine call within SLUB.

[I am not sure that it is worth doing this because it changes the easy to
read structure of slub just to reduce atomic ops.  However, there is
someone out there with a benchmark on 4 way and 8 way processor systems
that seems to show a 5% regression vs.  Slab.  Seems that the regression is
due to increased atomic operations use vs.  SLAB in SLUB).  I wonder if
this is applicable or discernable at all in a real workload?]

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-10 09:26:52 -07:00
Linus Torvalds d84c4124c4 Merge master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6:
  sh: Fix stacktrace simplification fallout.
  sh: SH7760 DMABRG support.
  sh: clockevent/clocksource/hrtimers/nohz TMU support.
  sh: Truncate MAX_ACTIVE_REGIONS for the common case.
  rtc: rtc-sh: Fix rtc_dev pointer for rtc_update_irq().
  sh: Convert to common die chain.
  sh: Wire up utimensat syscall.
  sh: landisk mv_nr_irqs definition.
  sh: Fixup ndelay() xloops calculation for alternate HZ.
  sh: Add 32-bit opcode feature CPU flag.
  sh: Fix PC adjustments for varying opcode length.
  sh: Support for SH-2A 32-bit opcodes.
  sh: Kill off redundant __div64_32 symbol export.
  sh: Share exception vector table for SH-3/4.
  sh: Always define TRAPA_BUG_OPCODE.
  sh: __GFP_REPEAT for pte allocations, too.
  rtc: rtc-sh: Fix up dev_dbg() warnings.
  sh: generic quicklist support.
2007-05-09 13:08:20 -07:00
David Howells c855ff3718 Fix a bad error case handling in read_cache_page_async()
Commit 6fe6900e1e introduced a nasty bug
in read_cache_page_async().

It added a "mark_page_accessed(page)" at the final return path in
read_cache_page_async().  But in error cases, 'page' holds the error
code, and you can't mark it accessed.

[ and Glauber de Oliveira Costa points out that we can use a return
  instead of adding more goto's ]

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 13:04:03 -07:00
Linus Torvalds 9a9136e270 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
  sound: convert "sound" subdirectory to UTF-8
  MAINTAINERS: Add cxacru website/mailing list
  include files: convert "include" subdirectory to UTF-8
  general: convert "kernel" subdirectory to UTF-8
  documentation: convert the Documentation directory to UTF-8
  Convert the toplevel files CREDITS and MAINTAINERS to UTF-8.
  remove broken URLs from net drivers' output
  Magic number prefix consistency change to Documentation/magic-number.txt
  trivial: s/i_sem /i_mutex/
  fix file specification in comments
  drivers/base/platform.c: fix small typo in doc
  misc doc and kconfig typos
  Remove obsolete fat_cvf help text
  Fix occurrences of "the the "
  Fix minor typoes in kernel/module.c
  Kconfig: Remove reference to external mqueue library
  Kconfig: A couple of grammatical fixes in arch/i386/Kconfig
  Correct comments in genrtc.c to refer to correct /proc file.
  Fix more "deprecated" spellos.
  Fix "deprecated" typoes.
  ...

Fix trivial comment conflict in kernel/relay.c.
2007-05-09 12:54:17 -07:00
Christoph Lameter 4037d45220 Move remote node draining out of slab allocators
Currently the slab allocators contain callbacks into the page allocator to
perform the draining of pagesets on remote nodes.  This requires SLUB to have
a whole subsystem in order to be compatible with SLAB.  Moving node draining
out of the slab allocators avoids a section of code in SLUB.

Move the node draining so that is is done when the vm statistics are updated.
At that point we are already touching all the cachelines with the pagesets of
a processor.

Add a expire counter there.  If we have to update per zone or global vm
statistics then assume that the pageset will require subsequent draining.

The expire counter will be decremented on each vm stats update pass until it
reaches zero.  Then we will drain one batch from the pageset.  The draining
will cause vm counter updates which will then cause another expiration until
the pcp is empty.  So we will drain a batch every 3 seconds.

Note that remote node draining is a somewhat esoteric feature that is required
on large NUMA systems because otherwise significant portions of system memory
can become trapped in pcp queues.  The number of pcp is determined by the
number of processors and nodes in a system.  A system with 4 processors and 2
nodes has 8 pcps which is okay.  But a system with 1024 processors and 512
nodes has 512k pcps with a high potential for large amount of memory being
caught in them.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Christoph Lameter 77461ab332 Make vm statistics update interval configurable
Make it configurable.  Code in mm makes the vm statistics intervals
independent from the cache reaper use that opportunity to make it
configurable.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Christoph Lameter d1187ed210 vmstat: use our own timer events
vmstat is currently using the cache reaper to periodically bring the
statistics up to date.  The cache reaper does only exists in SLUB as a way to
provide compatibility with SLAB.  This patch removes the vmstat calls from the
slab allocators and provides its own handling.

The advantage is also that we can use a different frequency for the updates.
Refreshing vm stats is a pretty fast job so we can run this every second and
stagger this by only one tick.  This will lead to some overlap in large
systems.  F.e a system running at 250 HZ with 1024 processors will have 4 vm
updates occurring at once.

However, the vm stats update only accesses per node information.  It is only
necessary to stagger the vm statistics updates per processor in each node.  Vm
counter updates occurring on distant nodes will not cause cacheline
contention.

We could implement an alternate approach that runs the first processor on each
node at the second and then each of the other processor on a node on a
subsequent tick.  That may be useful to keep a large amount of the second free
of timer activity.  Maybe the timer folks will have some feedback on this one?

[jirislaby@gmail.com: add missing break]
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Rafael J. Wysocki 8bb7844286 Add suspend-related notifications for CPU hotplug
Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress.  This
patch introduces such notifications and causes them to be used during
suspend and resume transitions.  It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).

[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Nate Diller 01f2705daf fs: convert core functions to zero_user_page
It's very common for file systems to need to zero part or all of a page,
the simplist way is just to use kmap_atomic() and memset().  There's
actually a library function in include/linux/highmem.h that does exactly
that, but it's confusingly named memclear_highpage_flush(), which is
descriptive of *how* it does the work rather than what the *purpose* is.
So this patchset renames the function to zero_user_page(), and calls it
from the various places that currently open code it.

This first patch introduces the new function call, and converts all the
core kernel callsites, both the open-coded ones and the old
memclear_highpage_flush() ones.  Following this patch is a series of
conversions for each file system individually, per AKPM, and finally a
patch deprecating the old call.  The diffstat below shows the entire
patchset.

[akpm@linux-foundation.org: fix a few things]
Signed-off-by: Nate Diller <nate.diller@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:55 -07:00
Christoph Lameter 5830c59021 slab: shut down cache_reaper when cpu goes down
Shutdown the cache_reaper if the cpu is brought down and set the
cache_reap.func to NULL.  Otherwise hotplug shuts down the reaper for good.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Heiko Carstens 38c3bd96a0 slab: use CPU_LOCK_[ACQUIRE|RELEASE]
Looks like this was forgotten when CPU_LOCK_[ACQUIRE|RELEASE] was
introduced.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:51 -07:00
David Howells ef71c15c46 AFS: export a couple of core functions for AFS write support
Export a couple of core functions for AFS write support to use:

	find_get_pages_contig()
	find_get_pages_tag()

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:50 -07:00
Ken Chen 8a63011275 pretend cpuset has some form of hugetlb page reservation
When cpuset is configured, it breaks the strict hugetlb page reservation as
the accounting is done on a global variable.  Such reservation is
completely rubbish in the presence of cpuset because the reservation is not
checked against page availability for the current cpuset.  Application can
still potentially OOM'ed by kernel with lack of free htlb page in cpuset
that the task is in.  Attempt to enforce strict accounting with cpuset is
almost impossible (or too ugly) because cpuset is too fluid that task or
memory node can be dynamically moved between cpusets.

The change of semantics for shared hugetlb mapping with cpuset is
undesirable.  However, in order to preserve some of the semantics, we fall
back to check against current free page availability as a best attempt and
hopefully to minimize the impact of changing semantics that cpuset has on
hugetlb.

Signed-off-by: Ken Chen <kenchen@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:49 -07:00
Ken Chen ace4bd29c2 fix leaky resv_huge_pages when cpuset is in use
The internal hugetlb resv_huge_pages variable can permanently leak nonzero
value in the error path of hugetlb page fault handler when hugetlb page is
used in combination of cpuset.  The leaked count can permanently trap N
number of hugetlb pages in unusable "reserved" state.

Steps to reproduce the bug:

  (1) create two cpuset, user1 and user2
  (2) reserve 50 htlb pages in cpuset user1
  (3) attempt to shmget/shmat 50 htlb page inside cpuset user2
  (4) kernel oom the user process in step 3
  (5) ipcrm the shm segment

At this point resv_huge_pages will have a count of 49, even though
there are no active hugetlbfs file nor hugetlb shared memory segment
in the system.  The leak is permanent and there is no recovery method
other than system reboot. The leaked count will hold up all future use
of that many htlb pages in all cpusets.

The culprit is that the error path of alloc_huge_page() did not
properly undo the change it made to resv_huge_page, causing
inconsistent state.

Signed-off-by: Ken Chen <kenchen@google.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Martin Bligh <mbligh@google.com>
Acked-by: David Gibson <dwg@au1.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:48 -07:00
Pekka J Enberg 7ae439ce0c krealloc: fix kerneldoc comments
No "blank" (or "*") line is allowed between the function name and lines for
it parameter(s).

Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:46 -07:00
Christoph Lameter 5e6d444ea1 SLUB: rework slab order determination
In some cases SLUB is creating uselessly slabs that are larger than
slub_max_order. Also the layout of some of the slabs was not satisfactory.

Go to an iterarive approach.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:46 -07:00
Christoph Lameter 45edfa580b SLUB: include lifetime stats and sets of cpus / nodes in tracking output
We have information about how long an object existed and about the nodes and
cpus where the allocations and frees took place.  Add that information to the
tracking output in /sys/slab/xx/alloc_calls and /sys/slab/free_calls

This will then enable slabinfo to output nice reports like this:

  christoph@qirst:~/slub$ ./slabinfo kmalloc-128

  Slabcache: kmalloc-128           Aliases:  0 Order :  0

  Sizes (bytes)     Slabs              Debug                Memory
  ------------------------------------------------------------------------
  Object :     128  Total  :      12   Sanity Checks : On   Total:   49152
  SlabObj:     200  Full   :       7   Redzoning     : On   Used :   24832
  SlabSiz:    4096  Partial:       4   Poisoning     : On   Loss :   24320
  Loss   :      72  CpuSlab:       1   Tracking      : On   Lalig:   13968
  Align  :       8  Objects:      20   Tracing       : Off  Lpadd:    1152

  kmalloc-128 has no kmem_cache operations

  kmalloc-128: Kernel object allocation
  -----------------------------------------------------------------------
        6 param_sysfs_setup+0x71/0x130 age=284512/284512/284512 pid=1 nodes=0-1,3
       11 percpu_populate+0x39/0x80 age=283914/284428/284512 pid=1 nodes=0
       21 __register_chrdev_region+0x31/0x170 age=282896/284347/284473 pid=1-1705 nodes=0-2
        1 sys_inotify_init+0x76/0x1c0 age=283423 pid=1004 nodes=0
       19 as_get_io_context+0x32/0xd0 age=6/247567/283988 pid=1-11782 nodes=0,2
       10 ida_pre_get+0x4a/0x80 age=277666/283773/284526 pid=0-2177 nodes=0,2
       24 kobject_kset_add_dir+0x37/0xb0 age=282727/283860/284472 pid=1-1723 nodes=0-2
        1 acpi_ds_build_internal_buffer_obj+0xd3/0x11d age=284508 pid=1 nodes=0
       24 con_insert_unipair+0xd7/0x110 age=284438/284438/284438 pid=1 nodes=0,2
        1 uart_open+0x2d2/0x4b0 age=283896 pid=1 nodes=0
       26 dma_pool_create+0x73/0x1a0 age=282762/282833/282916 pid=1705-1723 nodes=0
        1 neigh_table_init_no_netlink+0xd2/0x210 age=284461 pid=1 nodes=0
        2 neigh_parms_alloc+0x2b/0xe0 age=284410/284411/284412 pid=1 nodes=2
        2 neigh_resolve_output+0x1e1/0x280 age=276289/276291/276293 pid=0-2443 nodes=0
        1 netlink_kernel_create+0x90/0x170 age=284472 pid=1 nodes=0
        4 xt_alloc_table_info+0x39/0xf0 age=283958/283958/283959 pid=1 nodes=1
        3 fn_hash_insert+0x473/0x720 age=277653/277661/277666 pid=2177-2185 nodes=0
        1 get_mtrr_state+0x285/0x2a0 age=284526 pid=0 nodes=0
        1 cacheinfo_cpu_callback+0x26d/0x3e0 age=284458 pid=1 nodes=0
       29 kernel_param_sysfs_setup+0x25/0x90 age=284511/284511/284512 pid=1 nodes=0-1,3
        5 process_zones+0x5e/0x170 age=284546/284546/284546 pid=0 nodes=0
        1 drm_core_init+0x48/0x160 age=284421 pid=1 nodes=2

  kmalloc-128: Kernel object freeing
  ------------------------------------------------------------------------
      163 <not-available> age=4295176847 pid=0 nodes=0-3
        1 __vunmap+0x6e/0xf0 age=282907 pid=1723 nodes=0
       28 free_as_io_context+0x12/0x90 age=9243/262197/283474 pid=42-11754 nodes=0
        1 acpi_get_object_info+0x1b7/0x1d4 age=284475 pid=1 nodes=0
        1 do_acpi_find_child+0x45/0x4e age=284475 pid=1 nodes=0

  NUMA nodes           :    0    1    2    3
  ------------------------------------------
  All slabs                 7    2    2    1
  Partial slabs             2    2    0    0

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:46 -07:00
Christoph Lameter 41ecc55b8a SLUB: add CONFIG_SLUB_DEBUG
CONFIG_SLUB_DEBUG can be used to switch off the debugging and sysfs components
of SLUB.  Thus SLUB will be able to replace SLOB.  SLUB can arrange objects in
a denser way than SLOB and the code size should be minimal without debugging
and sysfs support.

Note that CONFIG_SLUB_DEBUG is materially different from CONFIG_SLAB_DEBUG.
CONFIG_SLAB_DEBUG is used to enable slab debugging in SLAB.  SLUB enables
debugging via a boot parameter.  SLUB debug code should always be present.

CONFIG_SLUB_DEBUG can be modified in the embedded config section.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:45 -07:00
Christoph Lameter 02cbc87446 SLUB: move tracking definitions and check_valid_pointer() away from debug code
Move the tracking definitions and the check_valid_pointer() function away from
the debugging related functions.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:45 -07:00
Christoph Lameter 636f0d7de8 SLUB: consolidate trace code
Trace in both slab_alloc and slab_free has a lot of common code.  Use a single
function for both.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:45 -07:00
Christoph Lameter 35e5d7ee27 SLUB: introduce DebugSlab(page)
This replaces the PageError() checking.  DebugSlab is clearer and allows for
future changes to the page bit used.  We also need it to support
CONFIG_SLUB_DEBUG.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:45 -07:00
Christoph Lameter b345970905 SLUB: move resiliency check into SYSFS section
Move the resiliency check into the SYSFS section after validate_slab that is
used by the resiliency check.  This will avoid a forward declaration.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:45 -07:00
Christoph Lameter 7656c72b5a SLUB: add macros for scanning objects in a slab
Scanning of objects happens in a number of functions.  Consolidate that code.
DECLARE_BITMAP instead of coding the declaration for bitmaps.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:45 -07:00
Christoph Lameter 672bba3a4b SLUB: update comments
Update comments throughout SLUB to reflect the new developments.  Fix up
various awkward sentences.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:45 -07:00
Christoph Lameter 26a7bd0302 SLUB: get rid of finish_bootstrap
Its only purpose was to bring some sort of symmetry to sysfs usage when
dealing with bootstrapping per cpu flushing.  Since we do not time out slabs
anymore we have no need to run finish_bootstrap even without sysfs.  Fold it
back into slab_sysfs_init and drop the initcall for the !SYFS case.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:45 -07:00
Christoph Lameter 1f99a283dc SLUB: clean up krealloc
We really do not need all this gaga there.

ksize gives us all the information we need to figure out if the object can
cope with the new size.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:45 -07:00
Christoph Lameter abcd08a6f5 SLUB: use check_valid_pointer in kmem_ptr_validate
We needlessly duplicate code. Also make check_valid_pointer inline.

Signed-off-by: Christoph LAemter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:44 -07:00
Christoph Lameter be7b3fbcef SLUB: after object padding only needed for Redzoning
If no redzoning is selected then we do not need padding before the next
object.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:44 -07:00
Christoph Lameter 65c02d4cfb SLUB: add support for dynamic cacheline size determination
SLUB currently assumes that the cacheline size is static.  However, i386 f.e.
supports dynamic cache line size determination.

Use cache_line_size() instead of L1_CACHE_BYTES in the allocator.

That also explains the purpose of SLAB_HWCACHE_ALIGN.  So we will need to keep
that one around to allow dynamic aligning of objects depending on boot
determination of the cache line size.

[akpm@linux-foundation.org: need to define it before we use it]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:44 -07:00
Michael Opdenacker 59c51591a0 Fix occurrences of "the the "
Signed-off-by: Michael Opdenacker <michael@free-electrons.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2007-05-09 08:57:56 +02:00
Paul Mundt 5f8c9908f2 sh: generic quicklist support.
This moves SH over to the generic quicklists. As per x86_64,
we have special mappings for the PGDs, so these go on their
own list..

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-05-09 01:35:00 +00:00
Roland McGrath 74add80cbd Remove unused variable in get_unmapped_area
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:35:28 -07:00
Jaya Kumar 60b59beafb fbdev: mm: Deferred IO support
This implements deferred IO support in fbdev.  Deferred IO is a way to delay
and repurpose IO.  This implementation is done using mm's page_mkwrite and
page_mkclean hooks in order to detect, delay and then rewrite IO.  This
functionality is used by hecubafb.

[adaplas]
This is useful for graphics hardware with no directly addressable/mappable
framebuffer. Implementing this will allow the "framebuffer" to be accesible
from user space via mmap().

Signed-off-by: Jaya Kumar <jayakumar.lkml@gmail.com>
Signed-off-by: Antonino Daplas <adaplas@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:26 -07:00
Alexey Dobriyan a5c43dae7a Fix race between cat /proc/slab_allocators and rmmod
Same story as with cat /proc/*/wchan race vs rmmod race, only
/proc/slab_allocators want more info than just symbol name.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:08 -07:00
Mark Fasheh ef51c97623 Remove do_sync_file_range()
Remove do_sync_file_range() and convert callers to just use
do_sync_mapping_range().

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:04 -07:00
Christoph Hellwig 1eeb66a1bb move die notifier handling to common code
This patch moves the die notifier handling to common code.  Previous
various architectures had exactly the same code for it.  Note that the new
code is compiled unconditionally, this should be understood as an appel to
the other architecture maintainer to implement support for it aswell (aka
sprinkling a notify_die or two in the proper place)

arm had a notifiy_die that did something totally different, I renamed it to
arm_notify_die as part of the patch and made it static to the file it's
declared and used at.  avr32 used to pass slightly less information through
this interface and I brought it into line with the other architectures.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix vmalloc_sync_all bustage]
[bryan.wu@analog.com: fix vmalloc_sync_all in nommu]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: <linux-arch@vger.kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Bryan Wu <bryan.wu@analog.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:04 -07:00
Guillaume Chazarain 3e9f45bd18 Factor outstanding I/O error handling
Cleanup: setting an outstanding error on a mapping was open coded too many
times.  Factor it out in mapping_set_error().

Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:14:57 -07:00
Yasunori Goto 72280ede31 Add white list into modpost.c for memory hotplug code and ia64's machvec section
This patch is add white list into modpost.c for some functions and
ia64's section to fix section mismatchs.

  sparse_index_alloc() and zone_wait_table_init() calls bootmem allocator
  at boot time, and kmalloc/vmalloc at hotplug time. If config
  memory hotplug is on, there are references of bootmem allocater(init text)
  from them (normal text). This is cause of section mismatch.

  Bootmem is called by many functions and it must be
  used only at boot time. I think __init of them should keep for
  section mismatch check. So, I would like to register sparse_index_alloc()
  and zone_wait_table_init() into white list.

  In addition, ia64's .machvec section is function table of some platform
  dependent code. It is mixture of .init.text and normal text. These
  reference of __init functions are valid too.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:14:57 -07:00
Yasunori Goto a3142c8e1d Fix section mismatch of memory hotplug related code.
This is to fix many section mismatches of code related to memory hotplug.
I checked compile with memory hotplug on/off on ia64 and x86-64 box.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:14:57 -07:00
Dmitriy Monakhov 0ceb331433 mm: move common segment checks to separate helper function
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Monakhov Dmitriy <dmonakhov@openvz.org>
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Anton Altaparmakov <aia21@cam.ac.uk>
Acked-by: David Chinner <dgc@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:14:57 -07:00
David Woodhouse b46b8f19c9 Increase slab redzone to 64bits
There are two problems with the existing redzone implementation.

Firstly, it's causing misalignment of structures which contain a 64-bit
integer, such as netfilter's 'struct ipt_entry' -- causing netfilter
modules to fail to load because of the misalignment.  (In particular, the
first check in
net/ipv4/netfilter/ip_tables.c::check_entry_size_and_hooks())

On ppc32 and sparc32, amongst others, __alignof__(uint64_t) == 8.

With slab debugging, we use 32-bit redzones. And allocated slab objects
aren't sufficiently aligned to hold a structure containing a uint64_t.

By _just_ setting ARCH_KMALLOC_MINALIGN to __alignof__(u64) we'd disable
redzone checks on those architectures.  By using 64-bit redzones we avoid that
loss of debugging, and also fix the other problem while we're at it.

When investigating this, I noticed that on 64-bit platforms we're using a
32-bit value of RED_ACTIVE/RED_INACTIVE in the 64-bit memory location set
aside for the redzone.  Which means that the four bytes immediately before
or after the allocated object at 0x00,0x00,0x00,0x00 for LE and BE
machines, respectively.  Which is probably not the most useful choice of
poison value.

One way to fix both of those at once is just to switch to 64-bit
redzones in all cases.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:14:57 -07:00
Linus Torvalds 0f9008ef38 Fix up SLUB compile
The newly merged SLUB allocator patches had been generated before the
removal of "struct subsystem", and ended up applying fine, but wouldn't
build based on the current tree as a result.

Fix up that merge error - not that SLUB is likely really ready for
showtime yet, but at least I can fix the trivial stuff.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:31:58 -07:00
Linus Torvalds ef93127e4c Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6:
  [SERIAL] sunsu: Fix section mismatch warnings.
  [SPARC64]: pgtable_cache_init() should be __init.
  [SPARC64]: Fix section mismatch warnings in arch/sparc64/kernel/prom.c
  [SPARC64]: Fix section mismatch warnings in arch/sparc64/kernel/pci.c
  [SPARC64]: Fix section mismatch warnings in arch/sparc64/kernel/console.c
  [MM]: sparse_init() should be __init.
  [SPARC64]: Update defconfig.
  [VIDEO]: Add Sun XVR-2500 framebuffer driver.
  [VIDEO]: Add Sun XVR-500 framebuffer driver.
  [SPARC64]: SUN4U PCI-E controller support.
  [SPARC]: Fix comment typo in smp4m_blackbox_current().
  [SCSI] SUNESP: sun_esp.c needs linux/delay.h

Fix up conflict in arch/sparc64/mm/init.c manually due to removal of
pgtable_cache_init() through the -mm patches (even though that patch was
also by David ;)

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:22:48 -07:00
Rafael J. Wysocki b1296cc48b freezer: fix racy usage of try_to_freeze in kswapd
Currently we can miss freeze_process()->signal_wake_up() in kswapd() if it
happens between try_to_freeze() and prepare_to_wait().  To prevent this
from happening we should check freezing(current) before calling schedule().

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:59 -07:00
Rafael J. Wysocki 7be9823491 swsusp: use inline functions for changing page flags
Replace direct invocations of SetPageNosave(), SetPageNosaveFree() etc.  with
calls to inline functions that can be changed in subsequent patches without
modifying the code calling them.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:58 -07:00
Akinobu Mita 4ab688c512 slob: fix page order calculation on not 4KB page
SLOB doesn't calculate correct page order when page size is not 4KB.  This
patch fixes it with using get_order() instead of find_order() which is SLOB
version of get_order().

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:57 -07:00
Christoph Lameter cfce66047f Slab allocators: remove useless __GFP_NO_GROW flag
There is no user remaining and I have never seen any use of that flag.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:57 -07:00
Christoph Lameter 4f10493459 slab allocators: Remove SLAB_CTOR_ATOMIC
SLAB_CTOR atomic is never used which is no surprise since I cannot imagine
that one would want to do something serious in a constructor or destructor.
 In particular given that the slab allocators run with interrupts disabled.
 Actions in constructors and destructors are by their nature very limited
and usually do not go beyond initializing variables and list operations.

(The i386 pgd ctor and dtors do take a spinlock in constructor and
destructor.....  I think that is the furthest we go at this point.)

There is no flag passed to the destructor so removing SLAB_CTOR_ATOMIC also
establishes a certain symmetry.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:57 -07:00
Christoph Lameter 50953fe9e0 slab allocators: Remove SLAB_DEBUG_INITIAL flag
I have never seen a use of SLAB_DEBUG_INITIAL.  It is only supported by
SLAB.

I think its purpose was to have a callback after an object has been freed
to verify that the state is the constructor state again?  The callback is
performed before each freeing of an object.

I would think that it is much easier to check the object state manually
before the free.  That also places the check near the code object
manipulation of the object.

Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
compiled with SLAB debugging on.  If there would be code in a constructor
handling SLAB_DEBUG_INITIAL then it would have to be conditional on
SLAB_DEBUG otherwise it would just be dead code.  But there is no such code
in the kernel.  I think SLUB_DEBUG_INITIAL is too problematic to make real
use of, difficult to understand and there are easier ways to accomplish the
same effect (i.e.  add debug code before kfree).

There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
clear in fs inode caches.  Remove the pointless checks (they would even be
pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

This is the last slab flag that SLUB did not support.  Remove the check for
unimplemented flags from SLUB.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:57 -07:00
Benjamin Herrenschmidt 4b1d89290b get_unmapped_area doesn't need hugetlbfs hacks anymore
Remove the hugetlbfs specific hacks in toplevel get_unmapped_area() now that
all archs and hugetlbfs itself do the right thing for both cases.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: William Irwin <bill.irwin@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Matthew Wilcox <willy@debian.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:57 -07:00
Benjamin Herrenschmidt 06abdfb47e get_unmapped_area handles MAP_FIXED in generic code
generic arch_get_unmapped_area() now handles MAP_FIXED.  Now that all
implementations have been fixed, change the toplevel get_unmapped_area() to
call into arch or drivers for the MAP_FIXED case.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Matthew Wilcox <willy@debian.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: William Irwin <bill.irwin@oracle.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:57 -07:00
David Rientjes 2b45ab3398 oom: fix constraint deadlock
Fixes a deadlock in the OOM killer for allocations that are not
__GFP_HARDWALL.

Before the OOM killer checks for the allocation constraint, it takes
callback_mutex.

constrained_alloc() iterates through each zone in the allocation zonelist
and calls cpuset_zone_allowed_softwall() to determine whether an allocation
for gfp_mask is possible.  If a zone's node is not in the OOM-triggering
task's mems_allowed, it is not exiting, and we did not fail on a
__GFP_HARDWALL allocation, cpuset_zone_allowed_softwall() attempts to take
callback_mutex to check the nearest exclusive ancestor of current's cpuset.
 This results in deadlock.

We now take callback_mutex after iterating through the zonelist since we
don't need it yet.

Cc: Andi Kleen <ak@suse.de>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Martin J. Bligh <mbligh@mbligh.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:55 -07:00
Yasunori Goto 2b744c01a5 mm: fix handling of panic_on_oom when cpusets are in use
The current panic_on_oom may not work if there is a process using
cpusets/mempolicy, because other nodes' memory may remain.  But some people
want failover by panic ASAP even if they are used.  This patch makes new
setting for its request.

This is tested on my ia64 box which has 3 nodes.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Ethan Solomita <solo@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:55 -07:00
Akinobu Mita 824ebef122 fault injection: fix failslab with CONFIG_NUMA
Currently failslab injects failures into ____cache_alloc().  But with enabling
CONFIG_NUMA it's not enough to let actual slab allocator functions (kmalloc,
kmem_cache_alloc, ...) return NULL.

This patch moves fault injection hook inside of __cache_alloc() and
__cache_alloc_node().  These are lower call path than ____cache_alloc() and
enable to inject faulures to slab allocators with CONFIG_NUMA.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:55 -07:00
Christoph Lameter 5af6083990 slab allocators: Remove obsolete SLAB_MUST_HWCACHE_ALIGN
This patch was recently posted to lkml and acked by Pekka.

The flag SLAB_MUST_HWCACHE_ALIGN is

1. Never checked by SLAB at all.

2. A duplicate of SLAB_HWCACHE_ALIGN for SLUB

3. Fulfills the role of SLAB_HWCACHE_ALIGN for SLOB.

The only remaining use is in sparc64 and ppc64 and their use there
reflects some earlier role that the slab flag once may have had. If
its specified then SLAB_HWCACHE_ALIGN is also specified.

The flag is confusing, inconsistent and has no purpose.

Remove it.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:55 -07:00
Nick Piggin 0a27a14a62 mm: madvise avoid exclusive mmap_sem
Avoid down_write of the mmap_sem in madvise when we can help it.

Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
matze b4169525bc include KERN_* constant in printk() calls in mm/slab.c
Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Akinobu Mita bc0055aee4 slob: handle SLAB_PANIC flag
kmem_cache_create() for slob doesn't handle SLAB_PANIC.

Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Christoph Lameter 6225e93735 Quicklists for page table pages
On x86_64 this cuts allocation overhead for page table pages down to a
fraction (kernel compile / editing load.  TSC based measurement of times spend
in each function):

no quicklist

pte_alloc               1569048 4.3s(401ns/2.7us/179.7us)
pmd_alloc                780988 2.1s(337ns/2.7us/86.1us)
pud_alloc                780072 2.2s(424ns/2.8us/300.6us)
pgd_alloc                260022 1s(920ns/4us/263.1us)

quicklist:

pte_alloc                452436 573.4ms(8ns/1.3us/121.1us)
pmd_alloc                196204 174.5ms(7ns/889ns/46.1us)
pud_alloc                195688 172.4ms(7ns/881ns/151.3us)
pgd_alloc                 65228 9.8ms(8ns/150ns/6.1us)

pgd allocations are the most complex and there we see the most dramatic
improvement (may be we can cut down the amount of pgds cached somewhat?).  But
even the pte allocations still see a doubling of performance.

1. Proven code from the IA64 arch.

	The method used here has been fine tuned for years and
	is NUMA aware. It is based on the knowledge that accesses
	to page table pages are sparse in nature. Taking a page
	off the freelists instead of allocating a zeroed pages
	allows a reduction of number of cachelines touched
	in addition to getting rid of the slab overhead. So
	performance improves. This is particularly useful if pgds
	contain standard mappings. We can save on the teardown
	and setup of such a page if we have some on the quicklists.
	This includes avoiding lists operations that are otherwise
	necessary on alloc and free to track pgds.

2. Light weight alternative to use slab to manage page size pages

	Slab overhead is significant and even page allocator use
	is pretty heavy weight. The use of a per cpu quicklist
	means that we touch only two cachelines for an allocation.
	There is no need to access the page_struct (unless arch code
	needs to fiddle around with it). So the fast past just
	means bringing in one cacheline at the beginning of the
	page. That same cacheline may then be used to store the
	page table entry. Or a second cacheline may be used
	if the page table entry is not in the first cacheline of
	the page. The current code will zero the page which means
	touching 32 cachelines (assuming 128 byte). We get down
	from 32 to 2 cachelines in the fast path.

3. x86_64 gets lightweight page table page management.

	This will allow x86_64 arch code to faster repopulate pgds
	and other page table entries. The list operations for pgds
	are reduced in the same way as for i386 to the point where
	a pgd is allocated from the page allocator and when it is
	freed back to the page allocator. A pgd can pass through
	the quicklists without having to be reinitialized.

64 Consolidation of code from multiple arches

	So far arches have their own implementation of quicklist
	management. This patch moves that feature into the core allowing
	an easier maintenance and consistent management of quicklists.

Page table pages have the characteristics that they are typically zero or in a
known state when they are freed.  This is usually the exactly same state as
needed after allocation.  So it makes sense to build a list of freed page
table pages and then consume the pages already in use first.  Those pages have
already been initialized correctly (thus no need to zero them) and are likely
already cached in such a way that the MMU can use them most effectively.  Page
table pages are used in a sparse way so zeroing them on allocation is not too
useful.

Such an implementation already exits for ia64.  Howver, that implementation
did not support constructors and destructors as needed by i386 / x86_64.  It
also only supported a single quicklist.  The implementation here has
constructor and destructor support as well as the ability for an arch to
specify how many quicklists are needed.

Quicklists are defined by an arch defining CONFIG_QUICKLIST.  If more than one
quicklist is necessary then we can define NR_QUICK for additional lists.  F.e.
 i386 needs two and thus has

config NR_QUICK
	int
	default 2

If an arch has requested quicklist support then pages can be allocated
from the quicklist (or from the page allocator if the quicklist is
empty) via:

quicklist_alloc(<quicklist-nr>, <gfpflags>, <constructor>)

Page table pages can be freed using:

quicklist_free(<quicklist-nr>, <destructor>, <page>)

Pages must have a definite state after allocation and before
they are freed. If no constructor is specified then pages
will be zeroed on allocation and must be zeroed before they are
freed.

If a constructor is used then the constructor will establish
a definite page state. F.e. the i386 and x86_64 pgd constructors
establish certain mappings.

Constructors and destructors can also be used to track the pages.
i386 and x86_64 use a list of pgds in order to be able to dynamically
update standard mappings.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Christoph Lameter 70d71228af slub: remove object activities out of checking functions
Make sure that the check function really only check things and do not perform
activities.  Extract the tracing and object seeding out of the two check
functions and place them into slab_alloc and slab_free

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Christoph Lameter 2086d26a05 SLUB: Free slabs and sort partial slab lists in kmem_cache_shrink
At kmem_cache_shrink check if we have any empty slabs on the partial
if so then remove them.

Also--as an anti-fragmentation measure--sort the partial slabs so that
the most fully allocated ones come first and the least allocated last.

The next allocations may fill up the nearly full slabs. Having the
least allocated slabs last gives them the maximum chance that their
remaining objects may be freed. Thus we can hopefully minimize the
partial slabs.

I think this is the best one can do in terms antifragmentation
measures. Real defragmentation (meaning moving objects out of slabs with
the least free objects to those that are almost full) can be implemted
by reverse scanning through the list produced here but that would mean
that we need to provide a callback at slab cache creation that allows
the deletion or moving of an object. This will involve slab API
changes, so defer for now.

Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Christoph Lameter 88a420e4e2 slub: add ability to list alloc / free callers per slab
This patch enables listing the callers who allocated or freed objects in a
cache.

For example to list the allocators for kmalloc-128 do

cat /sys/slab/kmalloc-128/alloc_calls
      7 sn_io_slot_fixup+0x40/0x700
      7 sn_io_slot_fixup+0x80/0x700
      9 sn_bus_fixup+0xe0/0x380
      6 param_sysfs_setup+0xf0/0x280
    276 percpu_populate+0xf0/0x1a0
     19 __register_chrdev_region+0x30/0x360
      8 expand_files+0x2e0/0x6e0
      1 sys_epoll_create+0x60/0x200
      1 __mounts_open+0x140/0x2c0
     65 kmem_alloc+0x110/0x280
      3 alloc_disk_node+0xe0/0x200
     33 as_get_io_context+0x90/0x280
     74 kobject_kset_add_dir+0x40/0x140
     12 pci_create_bus+0x2a0/0x5c0
      1 acpi_ev_create_gpe_block+0x120/0x9e0
     41 con_insert_unipair+0x100/0x1c0
      1 uart_open+0x1c0/0xba0
      1 dma_pool_create+0xe0/0x340
      2 neigh_table_init_no_netlink+0x260/0x4c0
      6 neigh_parms_alloc+0x30/0x200
      1 netlink_kernel_create+0x130/0x320
      5 fz_hash_alloc+0x50/0xe0
      2 sn_common_hubdev_init+0xd0/0x6e0
     28 kernel_param_sysfs_setup+0x30/0x180
     72 process_zones+0x70/0x2e0

cat /sys/slab/kmalloc-128/free_calls
    558 <not-available>
      3 sn_io_slot_fixup+0x600/0x700
     84 free_fdtable_rcu+0x120/0x260
      2 seq_release+0x40/0x60
      6 kmem_free+0x70/0xc0
     24 free_as_io_context+0x20/0x200
      1 acpi_get_object_info+0x3a0/0x3e0
      1 acpi_add_single_object+0xcf0/0x1e40
      2 con_release_unimap+0x80/0x140
      1 free+0x20/0x40

SLAB_STORE_USER must be enabled for a slab cache by either booting with
"slab_debug" or enabling user tracking specifically for the slab of interest.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Christoph Lameter e95eed571e SLUB: Add MIN_PARTIAL
We leave a mininum of partial slabs on nodes when we search for
partial slabs on other node. Define a constant for that value.

Then modify slub to keep MIN_PARTIAL slabs around.

This avoids bad situations where a function frees the last object
in a slab (which results in the page being returned to the page
allocator) only to then allocate one again (which requires getting
a page back from the page allocator if the partial list was empty).
Keeping a couple of slabs on the partial list reduces overhead.

Empty slabs are added to the end of the partial list to insure that
partially allocated slabs are consumed first (defragmentation).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Christoph Lameter 53e15af03b slub: validation of slabs (metadata and guard zones)
This enables validation of slab.  Validation means that all objects are
checked to see if there are redzone violations, if padding has been
overwritten or any pointers have been corrupted.  Also checks the consistency
of slab counters.

Validation enables the detection of metadata corruption without the kernel
having to execute code that actually uses (allocs/frees) and object.  It
allows one to make sure that the slab metainformation and the guard values
around an object have not been compromised.

A single slabcache can be checked by writing a 1 to the "validate" file.

i.e.

echo 1 >/sys/slab/kmalloc-128/validate

or use the slabinfo tool to check all slabs

slabinfo -v

Error messages will show up in the syslog.

Note that validation can only reach slabs that are on a list.  This means that
we are usually restricted to partial slabs and active slabs unless
SLAB_STORE_USER is active which will build a full slab list and allows
validation of slabs that are fully in use.  Booting with "slub_debug" set will
enable SLAB_STORE_USER and then full diagnostic are available.

Note that we attempt to push cpu slabs back to the lists when we start the
check.  If the cpu slab is reactivated before we get to it (another processor
grabs it before we get to it) then it cannot be checked.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Christoph Lameter 643b113849 slub: enable tracking of full slabs
If slab tracking is on then build a list of full slabs so that we can verify
the integrity of all slabs and are also able to built list of alloc/free
callers.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Christoph Lameter 77c5e2d01a slub: fix object tracking
Object tracking did not work the right way for several call chains. Fix this up
by adding a new parameter to slub_alloc and slub_free that specifies the
caller address explicitly.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Christoph Lameter b49af68ff9 Add virt_to_head_page and consolidate code in slab and slub
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Christoph Lameter 6d7779538f mm: optimize compound_head() by avoiding a shared page flag
The patch adds PageTail(page) and PageHead(page) to check if a page is the
head or the tail of a compound page.  This is done by masking the two bits
describing the state of a compound page and then comparing them.  So one
comparision and a branch instead of two bit checks and two branches.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
Christoph Lameter d85f33855c Make page->private usable in compound pages
If we add a new flag so that we can distinguish between the first page and the
tail pages then we can avoid to use page->private in the first page.
page->private == page for the first page, so there is no real information in
there.

Freeing up page->private makes the use of compound pages more transparent.
They become more usable like real pages.  Right now we have to be careful f.e.
 if we are going beyond PAGE_SIZE allocations in the slab on i386 because we
can then no longer use the private field.  This is one of the issues that
cause us not to support debugging for page size slabs in SLAB.

Having page->private available for SLUB would allow more meta information in
the page struct.  I can probably avoid the 16 bit ints that I have in there
right now.

Also if page->private is available then a compound page may be equipped with
buffer heads.  This may free up the way for filesystems to support larger
blocks than page size.

We add PageTail as an alias of PageReclaim.  Compound pages cannot currently
be reclaimed.  Because of the alias one needs to check PageCompound first.

The RFC for the this approach was discussed at
http://marc.info/?t=117574302800001&r=1&w=2

[nacc@us.ibm.com: fix hugetlbfs]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
Christoph Lameter 614410d589 SLUB: allocate smallest object size if the user asks for 0 bytes
Makes SLUB behave like SLAB in this area to avoid issues....

Throw a stack dump to alert people.

At some point the behavior should be switched back.  NULL is no memory as
far as I can tell and if the use asked for 0 bytes then he need to get no
memory.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
Christoph Lameter 47bfdc0d5a SLUB: change default alignments
Structures may contain u64 items on 32 bit platforms that are only able to
address 64 bit items on 64 bit boundaries.  Change the mininum alignment of
slabs to conform to those expectations.

ARCH_KMALLOC_MINALIGN must be changed for good since a variety of structure
are mixed in the general slabs.

ARCH_SLAB_MINALIGN is changed because currently there is no consistent
specification of object alignment.  We may have that in the future when the
KMEM_CACHE and related macros are used to generate slabs.  These pass the
alignment of the structure generated by the compiler to the slab.

With KMEM_CACHE etc we could align structures that do not contain 64
bit values to 32 bit boundaries potentially saving some memory.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
Christoph Lameter 81819f0fc8 SLUB core
This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.

A. Management of object queues

   A particular concern was the complex management of the numerous object
   queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
   each allocating CPU and use objects from a slab directly instead of
   queueing them up.

B. Storage overhead of object queues

   SLAB Object queues exist per node, per CPU. The alien cache queue even
   has a queue array that contain a queue for each processor on each
   node. For very large systems the number of queues and the number of
   objects that may be caught in those queues grows exponentially. On our
   systems with 1k nodes / processors we have several gigabytes just tied up
   for storing references to objects for those queues  This does not include
   the objects that could be on those queues. One fears that the whole
   memory of the machine could one day be consumed by those queues.

C. SLAB meta data overhead

   SLAB has overhead at the beginning of each slab. This means that data
   cannot be naturally aligned at the beginning of a slab block. SLUB keeps
   all meta data in the corresponding page_struct. Objects can be naturally
   aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
   boundaries and can fit tightly into a 4k page with no bytes left over.
   SLAB cannot do this.

D. SLAB has a complex cache reaper

   SLUB does not need a cache reaper for UP systems. On SMP systems
   the per CPU slab may be pushed back into partial list but that
   operation is simple and does not require an iteration over a list
   of objects. SLAB expires per CPU, shared and alien object queues
   during cache reaping which may cause strange hold offs.

E. SLAB has complex NUMA policy layer support

   SLUB pushes NUMA policy handling into the page allocator. This means that
   allocation is coarser (SLUB does interleave on a page level) but that
   situation was also present before 2.6.13. SLABs application of
   policies to individual slab objects allocated in SLAB is
   certainly a performance concern due to the frequent references to
   memory policies which may lead a sequence of objects to come from
   one node after another. SLUB will get a slab full of objects
   from one node and then will switch to the next.

F. Reduction of the size of partial slab lists

   SLAB has per node partial lists. This means that over time a large
   number of partial slabs may accumulate on those lists. These can
   only be reused if allocator occur on specific nodes. SLUB has a global
   pool of partial slabs and will consume slabs from that pool to
   decrease fragmentation.

G. Tunables

   SLAB has sophisticated tuning abilities for each slab cache. One can
   manipulate the queue sizes in detail. However, filling the queues still
   requires the uses of the spin lock to check out slabs. SLUB has a global
   parameter (min_slab_order) for tuning. Increasing the minimum slab
   order can decrease the locking overhead. The bigger the slab order the
   less motions of pages between per CPU and partial lists occur and the
   better SLUB will be scaling.

G. Slab merging

   We often have slab caches with similar parameters. SLUB detects those
   on boot up and merges them into the corresponding general caches. This
   leads to more effective memory use. About 50% of all caches can
   be eliminated through slab merging. This will also decrease
   slab fragmentation because partial allocated slabs can be filled
   up again. Slab merging can be switched off by specifying
   slub_nomerge on boot up.

   Note that merging can expose heretofore unknown bugs in the kernel
   because corrupted objects may now be placed differently and corrupt
   differing neighboring objects. Enable sanity checks to find those.

H. Diagnostics

   The current slab diagnostics are difficult to use and require a
   recompilation of the kernel. SLUB contains debugging code that
   is always available (but is kept out of the hot code paths).
   SLUB diagnostics can be enabled via the "slab_debug" option.
   Parameters can be specified to select a single or a group of
   slab caches for diagnostics. This means that the system is running
   with the usual performance and it is much more likely that
   race conditions can be reproduced.

I. Resiliency

   If basic sanity checks are on then SLUB is capable of detecting
   common error conditions and recover as best as possible to allow the
   system to continue.

J. Tracing

   Tracing can be enabled via the slab_debug=T,<slabcache> option
   during boot. SLUB will then protocol all actions on that slabcache
   and dump the object contents on free.

K. On demand DMA cache creation.

   Generally DMA caches are not needed. If a kmalloc is used with
   __GFP_DMA then just create this single slabcache that is needed.
   For systems that have no ZONE_DMA requirement the support is
   completely eliminated.

L. Performance increase

   Some benchmarks have shown speed improvements on kernbench in the
   range of 5-10%. The locking overhead of slub is based on the
   underlying base allocation size. If we can reliably allocate
   larger order pages then it is possible to increase slub
   performance much further. The anti-fragmentation patches may
   enable further performance increases.

Tested on:
i386 UP + SMP, x86_64 UP + SMP + NUMA emulation, IA64 NUMA + Simulator

SLUB Boot options

slub_nomerge		Disable merging of slabs
slub_min_order=x	Require a minimum order for slab caches. This
			increases the managed chunk size and therefore
			reduces meta data and locking overhead.
slub_min_objects=x	Mininum objects per slab. Default is 8.
slub_max_order=x	Avoid generating slabs larger than order specified.
slub_debug		Enable all diagnostics for all caches
slub_debug=<options>	Enable selective options for all caches
slub_debug=<o>,<cache>	Enable selective options for a certain set of
			caches

Available Debug options
F		Double Free checking, sanity and resiliency
R		Red zoning
P		Object / padding poisoning
U		Track last free / alloc
T		Trace all allocs / frees (only use for individual slabs).

To use SLUB: Apply this patch and then select SLUB as the default slab
allocator.

[hugh@veritas.com: fix an oops-causing locking error]
[akpm@linux-foundation.org: various stupid cleanups and small fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
Andrew Morton a3a02be791 slab: mark set_up_list3s() __init
It is only ever used prior to free_initmem().

(It will cause a warning when we run the section checking, but that's a
false-positive and it simply changes the source of an existing warning, which
is also a false-positive)

Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
Mel Gorman 3b1d92c565 Do not disable interrupts when reading min_free_kbytes
The sysctl handler for min_free_kbytes calls setup_per_zone_pages_min() on
read or write.  This function iterates through every zone and calls
spin_lock_irqsave() on the zone LRU lock.  When reading min_free_kbytes,
this is a total waste of time that disables interrupts on the local
processor.  It might even be noticable machines with large numbers of zones
if a process started constantly reading min_free_kbytes.

This patch only calls setup_per_zone_pages_min() only on write. Tested on
an x86 laptop and it did the right thing.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
Eric Dumazet 8da3430d8a slab: NUMA kmem_cache diet
Some NUMA machines have a big MAX_NUMNODES (possibly 1024), but fewer
possible nodes.  This patch dynamically sizes the 'struct kmem_cache' to
allocate only needed space.

I moved nodelists[] field at the end of struct kmem_cache, and use the
following computation in kmem_cache_init()

cache_cache.buffer_size = offsetof(struct kmem_cache, nodelists) +
                                 nr_node_ids * sizeof(struct kmem_list3 *);

On my two nodes x86_64 machine, kmem_cache.obj_size is now 192 instead of 704
(This is because on x86_64, MAX_NUMNODES is 64)

On bigger NUMA setups, this might reduce the gfporder of "cache_cache"

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
Eric Dumazet 6310984694 SLAB: don't allocate empty shared caches
We can avoid allocating empty shared caches and avoid unecessary check of
cache->limit.  We save some memory.  We avoid bringing into CPU cache
unecessary cache lines.

All accesses to l3->shared are already checking NULL pointers so this patch is
safe.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
Eric Dumazet 364fbb29a0 SLAB: use num_possible_cpus() in enable_cpucache()
The existing comment in mm/slab.c is *perfect*, so I reproduce it :

         /*
          * CPU bound tasks (e.g. network routing) can exhibit cpu bound
          * allocation behaviour: Most allocs on one cpu, most free operations
          * on another cpu. For these cases, an efficient object passing between
          * cpus is necessary. This is provided by a shared array. The array
          * replaces Bonwick's magazine layer.
          * On uniprocessor, it's functionally equivalent (but less efficient)
          * to a larger limit. Thus disabled by default.
          */

As most shiped linux kernels are now compiled with CONFIG_SMP, there is no way
a preprocessor #if can detect if the machine is UP or SMP. Better to use
num_possible_cpus().

This means on UP we allocate a 'size=0 shared array', to be more efficient.

Another patch can later avoid the allocations of 'empty shared arrays', to
save some memory.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:52 -07:00
Jan Kara 6ce745ed39 readahead: code cleanup
Rename file_ra_state.prev_page to prev_index and file_ra_state.offset to
prev_offset.  Also update of prev_index in do_generic_mapping_read() is now
moved close to the update of prev_offset.

[wfg@mail.ustc.edu.cn: fix it]
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:52 -07:00
Jan Kara ec0f163722 readahead: improve heuristic detecting sequential reads
Introduce ra.offset and store in it an offset where the previous read
ended.  This way we can detect whether reads are really sequential (and
thus we should not mark the page as accessed repeatedly) or whether they
are random and just happen to be in the same page (and the page should
really be marked accessed again).

Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:52 -07:00
Borislav Petkov 9490991482 Add unitialized_var() macro for suppressing gcc warnings
Introduce a macro for suppressing gcc from generating a warning about a
probable uninitialized state of a variable.

Example:

-	spinlock_t *ptl;
+	spinlock_t *uninitialized_var(ptl);

Not a happy solution, but those warnings are obnoxious.

- Using the usual pointlessly-set-it-to-zero approach wastes several
  bytes of text.

- Using a macro means we can (hopefully) do something else if gcc changes
  cause the `x = x' hack to stop working

- Using a macro means that people who are worried about hiding true bugs
  can easily turn it off.

Signed-off-by: Borislav Petkov <bbpetkov@yahoo.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:52 -07:00
Nick Piggin a8127717cb mm: simplify filemap_nopage
Identical block is duplicated twice: contrary to the comment, we have been
re-reading the page *twice* in filemap_nopage rather than once.

If any retry logic or anything is needed, it belongs in lower levels anyway.
Only retry once.  Linus agrees.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:52 -07:00
Andy Whitcroft 14e0729841 add pfn_valid_within helper for sub-MAX_ORDER hole detection
Generally we work under the assumption that memory the mem_map array is
contigious and valid out to MAX_ORDER_NR_PAGES block of pages, ie.  that if we
have validated any page within this MAX_ORDER_NR_PAGES block we need not check
any other.  This is not true when CONFIG_HOLES_IN_ZONE is set and we must
check each and every reference we make from a pfn.

Add a pfn_valid_within() helper which should be used when scanning pages
within a MAX_ORDER_NR_PAGES block when we have already checked the validility
of the block normally with pfn_valid().  This can then be optimised away when
we do not have holes within a MAX_ORDER_NR_PAGES block of pages.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:52 -07:00
Joshua N Pritikin 9a82782f8f allow oom_adj of saintly processes
If the badness of a process is zero then oom_adj>0 has no effect.  This
patch makes sure that the oom_adj shift actually increases badness points
appropriately.

Signed-off-by: Joshua N. Pritikin <jpritikin@pobox.com>
Cc: Andrea Arcangeli <andrea@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:51 -07:00
Nick Piggin 6fe6900e1e mm: make read_cache_page synchronous
Ensure pages are uptodate after returning from read_cache_page, which allows
us to cut out most of the filesystem-internal PageUptodate calls.

I didn't have a great look down the call chains, but this appears to fixes 7
possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in
ecryptfs, 1 in jffs2, and a possible cleared data overwritten with readpage in
block2mtd.  All depending on whether the filler is async and/or can return
with a !uptodate page.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:51 -07:00
Pekka Enberg 714b8171af slab: ensure cache_alloc_refill terminates
If slab->inuse is corrupted, cache_alloc_refill can enter an infinite
loop as detailed by Michael Richardson in the following post:
<http://lkml.org/lkml/2007/2/16/292>. This adds a BUG_ON to catch
those cases.

Cc: Michael Richardson <mcr@sandelman.ca>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:51 -07:00
Nick Piggin 5f22df00a0 mm: remove gcc workaround
Minimum gcc version is 3.2 now.  However, with likely profiling, even
modern gcc versions cannot always eliminate the call.

Replace the placeholder functions with the more conventional empty static
inlines, which should be optimal for everyone.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:51 -07:00
Christoph Lameter 1b4244647c Use ZVC counters to establish exact size of dirtyable pages
We can use the global ZVC counters to establish the exact size of the LRU
and the free pages.  This allows a more accurate determination of the dirty
ratio.

This patch will fix the broken ratio calculations if large amounts of
memory are allocated to huge pags or other consumers that do not put the
pages on to the LRU.

Notes:
- I did not add NR_SLAB_RECLAIMABLE to the calculation of the
  dirtyable pages. Those may be reclaimable but they are at this
  point not dirtyable. If NR_SLAB_RECLAIMABLE would be considered
  then a huge number of reclaimable pages would stop writeback
  from occurring.

- This patch used to be in mm as the last one in a series of patches.
  It was removed when Linus updated the treatment of highmem because
  there was a conflict. I updated the patch to follow Linus' approach.
  This patch is neede to fulfill the claims made in the beginning of the
  patchset that is now in Linus' tree.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:51 -07:00
Christoph Lameter 476f35348e Safer nr_node_ids and nr_node_ids determination and initial values
The nr_cpu_ids value is currently only calculated in smp_init.  However, it
may be needed before (SLUB needs it on kmem_cache_init!) and other kernel
components may also want to allocate dynamically sized per cpu array before
smp_init.  So move the determination of possible cpus into sched_init()
where we already loop over all possible cpus early in boot.

Also initialize both nr_node_ids and nr_cpu_ids with the highest value they
could take.  If we have accidental users before these values are determined
then the current valud of 0 may cause too small per cpu and per node arrays
to be allocated.  If it is set to the maximum possible then we only waste
some memory for early boot users.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:51 -07:00
Jeremy Fitzhardinge aee16b3cee Add apply_to_page_range() which applies a function to a pte range
Add a new mm function apply_to_page_range() which applies a given function to
every pte in a given virtual address range in a given mm structure.  This is a
generic alternative to cut-and-pasting the Linux idiomatic pagetable walking
code in every place that a sequence of PTEs must be accessed.

Although this interface is intended to be useful in a wide range of
situations, it is currently used specifically by several Xen subsystems, for
example: to ensure that pagetables have been allocated for a virtual address
range, and to construct batched special pagetable update requests to map I/O
memory (in ioremap()).

[akpm@linux-foundation.org: fix warning, unpleasantly]
Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Matt Mackall <mpm@waste.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:51 -07:00
Pekka Enberg fd76bab2fa slab: introduce krealloc
This introduce krealloc() that reallocates memory while keeping the contents
unchanged.  The allocator avoids reallocation if the new size fits the
currently used cache.  I also added a simple non-optimized version for
mm/slob.c for compatibility.

[akpm@linux-foundation.org: fix warnings]
Acked-by: Josef Sipek <jsipek@fsl.cs.sunysb.edu>
Acked-by: Matt Mackall <mpm@selenic.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:50 -07:00
David S. Miller 6a5b518f22 [MM]: sparse_init() should be __init.
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-06 23:54:25 -07:00
Siddha, Suresh B 62918a0361 [PATCH] x86-64: skip cache_free_alien() on non NUMA
Set use_alien_caches to 0 on non NUMA platforms.  And avoid calling the
cache_free_alien() when use_alien_caches is not set.  This will avoid the
cache miss that happens while dereferencing slabp to get nodeid.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:18 +02:00
Jeremy Fitzhardinge ce6234b529 [PATCH] i386: PARAVIRT: add kmap_atomic_pte for mapping highpte pages
Xen and VMI both have special requirements when mapping a highmem pte
page into the kernel address space.  These can be dealt with by adding
a new kmap_atomic_pte() function for mapping highptes, and hooking it
into the paravirt_ops infrastructure.

Xen specifically wants to map the pte page RO, so this patch exposes a
helper function, kmap_atomic_prot, which maps the page with the
specified page protections.

This also adds a kmap_flush_unused() function to clear out the cached
kmap mappings.  Xen needs this to clear out any potential stray RW
mappings of pages which will become part of a pagetable.

[ Zach - vmi.c will need some attention after this patch.  It wasn't
  immediately obvious to me what needs to be done. ]

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Zachary Amsden <zach@vmware.com>
2007-05-02 19:27:15 +02:00
Jeremy Fitzhardinge d6dd61c831 [PATCH] x86: PARAVIRT: add hooks to intercept mm creation and destruction
Add hooks to allow a paravirt implementation to track the lifetime of
an mm.  Paravirtualization requires three hooks, but only two are
needed in common code.  They are:

arch_dup_mmap, which is called when a new mmap is created at fork

arch_exit_mmap, which is called when the last process reference to an
  mm is dropped, which typically happens on exit and exec.

The third hook is activate_mm, which is called from the arch-specific
activate_mm() macro/function, and so doesn't need stub versions for
other architectures.  It's called when an mm is first used.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: linux-arch@vger.kernel.org
Cc: James Bottomley <James.Bottomley@SteelEye.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
2007-05-02 19:27:14 +02:00
Andi Kleen 0d08e0d3a9 [PATCH] x86-64: Fix vmalloc_32 to really allocate <4GB on 64bit platforms
Ugly ifdef, but should handle all 64bit platforms that have suitable
zones. On some like Altix it's probably impossible without IOMMU
use to get memory <4GB this way,  but they have to live with that.
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:12 +02:00
Linus Torvalds da8ac5e0fa Merge branch 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6
* 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6: (38 commits)
  [S390] SPIN_LOCK_UNLOCKED cleanup in drivers/s390
  [S390] Clean up smp code in preparation for some larger changes.
  [S390] Remove debugging junk.
  [S390] Switch etr from tasklet to workqueue.
  [S390] split page_test_and_clear_dirty.
  [S390] Processor degradation notification.
  [S390] vtime: cleanup per_cpu usage.
  [S390] crypto: cleanup.
  [S390] sclp: fix coding style.
  [S390] vmlogrdr: stop IUCV connection in vmlogrdr_release.
  [S390] sclp: initialize early.
  [S390] ctc: kmalloc->kzalloc/casting cleanups.
  [S390] zfcpdump support.
  [S390] dasd: Add ipldev parameter.
  [S390] dasd: Add sysfs attribute status and generate uevents.
  [S390] Improved kernel stack overflow checking.
  [S390] Get rid of console setup functions.
  [S390] No execute support cleanup.
  [S390] Minor fault path optimization.
  [S390] Use generic bug.
  ...
2007-04-27 09:15:31 -07:00
Linus Torvalds 07db59bd6b Change default dirty-writeback limits
Do this really early in the 2.6.22-rc series, so that we'll get
feedback.  And don't change by half measures.  Just cut the default
dirty limit to a quarter of what it was, and see if anybody even
notices.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-27 09:10:47 -07:00
Martin Schwidefsky 6c210482ae [S390] split page_test_and_clear_dirty.
The page_test_and_clear_dirty primitive really consists of two
operations, page_test_dirty and the page_clear_dirty. The combination
of the two is not an atomic operation, so it makes more sense to have
two separate operations instead of one.
In addition to the improved readability of the s390 version of
SetPageUptodate, it now avoids the page_test_dirty operation which is
an insert-storage-key-extended (iske) instruction which is an expensive
operation.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2007-04-27 16:01:46 +02:00
Christoph Lameter 0e8c7d0fd5 page migration: fix NR_FILE_PAGES accounting
NR_FILE_PAGES must be accounted for depending on the zone that the page
belongs to.  If we replace the page in the radix tree then we may have to
shift the count to another zone.

Suggested-by: Ethan Solomita <solo@google.com>
Eventually-typed-in-by: Christoph Lameter <clameter@sgi.com>
Cc: Martin Bligh <mbligh@mbligh.org>
Cc: <stable@kernel.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-24 08:23:08 -07:00
Hugh Dickins 3d124cbba3 fix OOM killing processes wrongly thought MPOL_BIND
I only have CONFIG_NUMA=y for build testing: surprised when trying a memhog
to see lots of other processes killed with "No available memory
(MPOL_BIND)".  memhog is killed correctly once we initialize nodemask in
constrained_alloc().

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: William Irwin <bill.irwin@oracle.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-24 08:23:07 -07:00
David Rientjes 650a7c974f oom: kill all threads that share mm with killed task
oom_kill_task() calls __oom_kill_task() to OOM kill a selected task.
When finding other threads that share an mm with that task, we need to
kill those individual threads and not the same one.

(Bug introduced by f2a2a7108a)

Acked-by: William Irwin <bill.irwin@oracle.com>
Acked-by: Christoph Lameter <clameter@engr.sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-24 08:11:49 -07:00
Wu, Bryan 6a04de6dbe [PATCH] nommu: fix bug ip_conntrack does not work on nommu
num_physpages is not exported out in mm/nommu.c, so the ip_conntrack module
link will fail.

Signed-off-by: Bryan Wu <bryan.wu@analog.com>
Acked-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-12 15:31:42 -07:00
Linus Torvalds 8d00647f2c Merge branch 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6
* 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6:
  [S390] cio: Fix handling of interrupt for csch().
  [S390] page_mkclean data corruption.
2007-04-04 10:11:16 -07:00
David Howells e94a40c508 [PATCH] SLAB: Mention slab name when listing corrupt objects
Mention the slab name when listing corrupt objects.  Although the function
that released the memory is mentioned, that is frequently ambiguous as such
functions often release several pieces of memory.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-04 08:51:52 -07:00
Martin Schwidefsky 6e1beb3c22 [S390] page_mkclean data corruption.
The git commit c2fda5fed8 which
added the page_test_and_clear_dirty call to page_mkclean and the
git commit 7658cc2892 which fixes
the "nasty and subtle race in shared mmap'ed page writeback"
problem in clear_page_dirty_for_io cause data corruption on s390.

The effect of the two changes is that for every call to
clear_page_dirty_for_io a page_test_and_clear_dirty is done. If
the per page dirty bit is set set_page_dirty is called. Strangly
clear_page_dirty_for_io is called for not-uptodate pages, e.g.
over this call-chain:

 [<000000000007c0f2>] clear_page_dirty_for_io+0x12a/0x130
 [<000000000007c494>] generic_writepages+0x258/0x3e0
 [<000000000007c692>] do_writepages+0x76/0x7c
 [<00000000000c7a26>] __writeback_single_inode+0xba/0x3e4
 [<00000000000c831a>] sync_sb_inodes+0x23e/0x398
 [<00000000000c8802>] writeback_inodes+0x12e/0x140
 [<000000000007b9ee>] wb_kupdate+0xd2/0x178
 [<000000000007cca2>] pdflush+0x162/0x23c

The bad news now is that page_test_and_clear_dirty might claim
that a not-uptodate page is dirty since SetPageUptodate which
resets the per page dirty bit has not yet been called. The page
writeback that follows clobbers the data on disk.

The simplest solution to this problem is to move the call to
page_test_and_clear_dirty under the "if (page_mapped(page))".
If a file backed page is mapped it is uptodate.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2007-04-04 14:37:39 +02:00
Carsten Otte a76c0b9763 [PATCH] mm: fix xip issue with /dev/zero
Fix the bug, that reading into xip mapping from /dev/zero fills the user
page table with ZERO_PAGE() entries.  Later on, xip cannot tell which pages
have been ZERO_PAGE() filled by access to a sparse mapping, and which ones
origin from /dev/zero.  It will unmap ZERO_PAGE from all mappings when
filling the sparse hole with data.  xip does now use its own zeroed page
for its sparse mappings.  Please apply.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-29 08:22:26 -07:00
Hugh Dickins 90ed52ebe4 [PATCH] holepunch: fix mmap_sem i_mutex deadlock
sys_madvise has down_write of mmap_sem, then madvise_remove calls
vmtruncate_range which takes i_mutex and i_alloc_sem: no, we can easily devise
deadlocks from that ordering.

madvise_remove drop mmap_sem while calling vmtruncate_range: luckily, since
madvise_remove doesn't split or merge vmas, it's easy to handle this case with
a NULL prev, without restructuring sys_madvise.  (Though sad to retake
mmap_sem when it's unlikely to be needed, and certainly down_read is
sufficient for MADV_REMOVE, unlike the other madvices.)

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-29 08:22:26 -07:00
Hugh Dickins 16a100190d [PATCH] holepunch: fix disconnected pages after second truncate
shmem_truncate_range has its own truncate_inode_pages_range, to free any pages
racily instantiated while it was in progress: a SHMEM_PAGEIN flag is set when
this might have happened.  But holepunching gets no chance to clear that flag
at the start of vmtruncate_range, so it's always set (unless a truncate came
just before), so holepunch almost always does this second
truncate_inode_pages_range.

shmem holepunch has unlikely swap<->file races hereabouts whatever we do
(without a fuller rework than is fit for this release): I was going to skip
the second truncate in the punch_hole case, but Miklos points out that would
make holepunch correctness more vulnerable to swapoff.  So keep the second
truncate, but follow it by an unmap_mapping_range to eliminate the
disconnected pages (freed from pagecache while still mapped in userspace) that
it might have left behind.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-29 08:22:25 -07:00
Hugh Dickins 1ae7000630 [PATCH] holepunch: fix shmem_truncate_range punch locking
Miklos Szeredi observes that during truncation of shmem page directories,
info->lock is released to improve latency (after lowering i_size and
next_index to exclude races); but this is quite wrong for holepunching, which
receives no such protection from i_size or next_index, and is left vulnerable
to races with shmem_unuse, shmem_getpage and shmem_writepage.

Hold info->lock throughout when holepunching?  No, any user could prevent
rescheduling for far too long.  Instead take info->lock just when needed: in
shmem_free_swp when removing the swap entries, and whenever removing a
directory page from the level above.  But so long as we remove before
scanning, we can safely skip taking the lock at the lower levels, except at
misaligned start and end of the hole.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-29 08:22:25 -07:00
Hugh Dickins a2646d1e6c [PATCH] holepunch: fix shmem_truncate_range punching too far
Miklos Szeredi observes BUG_ON(!entry) in shmem_writepage() triggered in rare
circumstances, because shmem_truncate_range() erroneously removes partially
truncated directory pages at the end of the range: later reclaim on pages
pointing to these removed directories triggers the BUG.  Indeed, and it can
also cause data loss beyond the hole.

Fix this as in the patch proposed by Miklos, but distinguish between "limit"
(how far we need to search: ignore truncation's next_index optimization in the
holepunch case - if there are races it's more consistent to act on the whole
range specified) and "upper_limit" (how far we can free directory pages:
generally we must be careful to keep partially punched pages, but can relax at
end of file - i_size being held stable by i_mutex).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Miklos Szeredi <mszeredi@suse.cs>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-29 08:22:25 -07:00
Vasily Tarasov f772b3d9ca block: blk_max_pfn is somtimes wrong
There is a small problem in handling page bounce.

At the moment blk_max_pfn equals max_pfn, which is in fact not maximum
possible _number_ of a page frame, but the _amount_ of page frames.  For
example for the 32bit x86 node with 4Gb RAM, max_pfn = 0x100000, but not
0xFFFF.

request_queue structure has a member q->bounce_pfn and queue needs bounce
pages for the pages _above_ this limit.  This routine is handled by
blk_queue_bounce(), where the following check is produced:

	if (q->bounce_pfn >= blk_max_pfn)
		return;

Assume, that a driver has set q->bounce_pfn to 0xFFFF, but blk_max_pfn
equals 0x10000.  In such situation the check above fails and for each bio
we always fall down for iterating over pages tied to the bio.

I want to notice, that for quite a big range of device drivers (ide, md,
...) such problem doesn't happen because they use BLK_BOUNCE_ANY for
bounce_pfn.  BLK_BOUNCE_ANY is defined as blk_max_pfn << PAGE_SHIFT, and
then the check above doesn't fail.  But for other drivers, which obtain
reuired value from drivers, it fails.  For example sata_nv uses
ATA_DMA_MASK or dev->dma_mask.

I propose to use (max_pfn - 1) for blk_max_pfn.  And the same for
blk_max_low_pfn.  The patch also cleanses some checks related with
bounce_pfn.

Signed-off-by: Vasily Tarasov <vtaras@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-03-27 08:52:47 +02:00
David Howells 165b239270 [PATCH] NOMMU: make SYSV SHM nattch work correctly
Make the SYSV SHM nattch counter work correctly by forcing multiple VMAs to
be produced to represent MAP_SHARED segments, even if they overlap exactly.

Using this test program:

	http://people.redhat.com/~dhowells/doshm.c

Run as:

	doshm sysv

I can see nattch going from one before the patch:

	# /doshm sysv
	Command: sysv
	shmid: 65536
	memory: 0xc3700000
	c0b00000-c0b04000 rw-p 00000000 00:00 0
	c0bb0000-c0bba788 r-xs 00000000 00:0b 14582157  /lib/ld-uClibc-0.9.28.so
	c3180000-c31dede4 r-xs 00000000 00:0b 14582179  /lib/libuClibc-0.9.28.so
	c3520000-c352278c rw-p 00000000 00:0b 13763417  /doshm
	c3584000-c35865e8 r-xs 00000000 00:0b 13763417  /doshm
	c3588000-c358aa00 rw-p 00008000 00:0b 14582157  /lib/ld-uClibc-0.9.28.so
	c3590000-c359b6c0 rw-p 00000000 00:00 0
	c3620000-c3640000 rwxp 00000000 00:00 0
	c3700000-c37fa000 rw-S 00000000 00:06 1411      /SYSV00000000 (deleted)
	c3700000-c37fa000 rw-S 00000000 00:06 1411      /SYSV00000000 (deleted)
	nattch 1

To two after the patch:

	# /doshm sysv
	Command: sysv
	shmid: 0
	memory: 0xc3700000
	c0bb0000-c0bba788 r-xs 00000000 00:0b 14582157  /lib/ld-uClibc-0.9.28.so
	c3180000-c31dede4 r-xs 00000000 00:0b 14582179  /lib/libuClibc-0.9.28.so
	c3320000-c3340000 rwxp 00000000 00:00 0
	c3530000-c35325e8 r-xs 00000000 00:0b 13763417  /doshm
	c3534000-c353678c rw-p 00000000 00:0b 13763417  /doshm
	c3538000-c353aa00 rw-p 00008000 00:0b 14582157  /lib/ld-uClibc-0.9.28.so
	c3590000-c359b6c0 rw-p 00000000 00:00 0
	c35a4000-c35a8000 rw-p 00000000 00:00 0
	c3700000-c37fa000 rw-S 00000000 00:06 1369      /SYSV00000000 (deleted)
	c3700000-c37fa000 rw-S 00000000 00:06 1369      /SYSV00000000 (deleted)
	nattch 2

That's +1 to nattch for each shmat() made.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-22 19:39:06 -07:00
David Howells d56e03cd27 [PATCH] NOMMU: supply get_unmapped_area() to fix NOMMU SYSV SHM
Supply a get_unmapped_area() to fix NOMMU SYSV SHM support.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-22 19:39:05 -07:00
Ankita Garg 35ae834fa0 [PATCH] oom fix: prevent oom from killing a process with children/sibling unkillable
Looking at oom_kill.c, found that the intention to not kill the selected
process if any of its children/siblings has OOM_DISABLE set, is not being
met.

Signed-off-by: Ankita Garg <ankita@in.ibm.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: William Irwin <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16 19:25:06 -07:00
Peter Zijlstra 89a09141df [PATCH] nfs: fix congestion control
The current NFS client congestion logic is severly broken, it marks the
backing device congested during each nfs_writepages() call but doesn't
mirror this in nfs_writepage() which makes for deadlocks.  Also it
implements its own waitqueue.

Replace this by a more regular congestion implementation that puts a cap on
the number of active writeback pages and uses the bdi congestion waitqueue.

Also always use an interruptible wait since it makes sense to be able to
SIGKILL the process even for mounts without 'intr'.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16 19:25:05 -07:00
Zach Brown 65b8291c40 [PATCH] dio: invalidate clean pages before dio write
This patch fixes a user-triggerable oops that was reported by Leonid
Ananiev as archived at http://lkml.org/lkml/2007/2/8/337.

dio writes invalidate clean pages that intersect the written region so that
subsequent buffered reads go to disk to read the new data.  If this fails
the interface tries to tell the caller that the cache is inconsistent by
returning EIO.

Before this patch we had the problem where this invalidation failure would
clobber -EIOCBQUEUED as it made its way from fs/direct-io.c to fs/aio.c.
Both fs/aio.c and bio completion call aio_complete() and we reference freed
memory, usually oopsing.

This patch addresses this problem by invalidating before the write so that
we can cleanly return -EIO before ->direct_IO() has had a chance to return
-EIOCBQUEUED.

There is a compromise here.  During the dio write we can fault in mmap()ed
pages which intersect the written range with get_user_pages() if the user
provided them for the source buffer.  This is a crazy thing to do, but we
can make it mostly work in most cases by trying the invalidation again.
The compromise is that we won't return an error if this second invalidation
fails if it's an AIO write and we have -EIOCBQUEUED.

This was tested by having two processes race performing large O_DIRECT and
buffered ordered writes.  Within minutes ext3 would see a race between
ext3_releasepage() and jbd holding a reference on ordered data buffers and
would cause invalidation to fail, panicing the box.  The test can be found
in the 'aio_dio_bugs' test group in test.kernel.org/autotest.  After this
patch the test passes.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Leonid Ananiev <leonid.i.ananiev@linux.intel.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16 19:25:04 -07:00
Nick Piggin 00e9fa2d64 [PATCH] mm: fix madvise infinine loop
madvise(MADV_REMOVE) can go into an infinite loop or cause an oops if the
call covers a region from the start of a vma, and extending past that vma.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16 19:25:04 -07:00
Christoph Lameter 0dc952dc3e [PATCH] Page migration: Fix vma flag checking
Currently we do not check for vma flags if sys_move_pages is called to move
individual pages.  If sys_migrate_pages is called to move pages then we
check for vm_flags that indicate a non migratable vma but that still
includes VM_LOCKED and we can migrate mlocked pages.

Extract the vma_migratable check from mm/mempolicy.c, fix it and put it
into migrate.h so that is can be used from both locations.

Problem was spotted by Lee Schermerhorn

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-05 07:57:51 -08:00
Hugh Dickins 759b9775c2 [PATCH] shmem and simple const super_operations
shmem's super_operations were missed from the recent const-ification;
and simple_fill_super()'s, which can share with get_sb_pseudo()'s.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-05 07:57:51 -08:00
Trond Myklebust 7b965e0884 [PATCH] VM: invalidate_inode_pages2_range() should not exit early
Fix invalidate_inode_pages2_range() so that it does not immediately exit
just because a single page in the specified range could not be removed.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:39 -08:00
Oleg Nesterov 34bbd70405 [PATCH] adapt page_lock_anon_vma() to PREEMPT_RCU
page_lock_anon_vma() uses spin_lock() to block RCU.  This doesn't work with
PREEMPT_RCU, we have to do rcu_read_lock() explicitely.  Otherwise, it is
theoretically possible that slab returns anon_vma's memory to the system
before we do spin_unlock(&anon_vma->lock).

[ Hugh points out that this only matters for PREEMPT_RCU, which isn't merged
  yet, and may never be.  Regardless, this patch is conceptually the
  right thing to do, even if it doesn't matter at this point.  - Linus ]

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:39 -08:00
Andrew Morton 232ea4d69d [PATCH] throttle_vm_writeout(): don't loop on GFP_NOFS and GFP_NOIO allocations
throttle_vm_writeout() is designed to wait for the dirty levels to subside.
But if the caller holds IO or FS locks, we might be holding up that writeout.

So change it to take a single nap to give other devices a chance to clean some
memory, then return.

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:38 -08:00
David Miller d1af65d13f [PATCH] Bug in MM_RB debugging
The code is seemingly trying to make sure that rb_next() brings us to
successive increasing vma entries.

But the two variables, prev and pend, used to perform these checks, are
never advanced.

Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Andrea Arcangeli <andrea@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:38 -08:00
Nick Piggin 5409bae07a [PATCH] Rename PG_checked to PG_owner_priv_1
Rename PG_checked to PG_owner_priv_1 to reflect its availablilty as a
private flag for use by the owner/allocator of the page.  In the case of
pagecache pages (which might be considered to be owned by the mm),
filesystems may use the flag.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:37 -08:00
Randy Dunlap 05fb6bf0b2 [PATCH] kernel-doc fixes for 2.6.20-git15 (non-drivers)
Fix kernel-doc warnings in 2.6.20-git15 (lib/, mm/, kernel/, include/).

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:37 -08:00
Adrian Bunk 9b83a6a852 [PATCH] mm/{,tiny-}shmem.c cleanups
shmem_{nopage,mmap} are no longer used in ipc/shm.c

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:35 -08:00
Christoph Lameter 8ef8286689 [PATCH] slab: reduce size of alien cache to cover only possible nodes
The alien cache is a per cpu per node array allocated for every slab on the
system.  Currently we size this array for all nodes that the kernel does
support.  For IA64 this is 1024 nodes.  So we allocate an array with 1024
objects even if we only boot a system with 4 nodes.

This patch uses "nr_node_ids" to determine the number of possible nodes
supported by a hardware configuration and only allocates an alien cache
sized for possible nodes.

The initialization of nr_node_ids occurred too late relative to the bootstrap
of the slab allocator and so I moved the setup_nr_node_ids() into
free_area_init_nodes().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20 17:10:13 -08:00
Christoph Lameter 74c7aa8b85 [PATCH] Replace highest_possible_node_id() with nr_node_ids
highest_possible_node_id() is currently used to calculate the last possible
node idso that the network subsystem can figure out how to size per node
arrays.

I think having the ability to determine the maximum amount of nodes in a
system at runtime is useful but then we should name this entry
correspondingly, it should return the number of node_ids, and the the value
needs to be setup only once on bootup.  The node_possible_map does not
change after bootup.

This patch introduces nr_node_ids and replaces the use of
highest_possible_node_id().  nr_node_ids is calculated on bootup when the
page allocators pagesets are initialized.

[deweerdt@free.fr: fix oops]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20 17:10:13 -08:00
KAMEZAWA Hiroyuki 8af5e2eb3c [PATCH] fix mempolicy's check on a system with memory-less-node
bind_zonelist() can create zero-length zonelist if there is a
memory-less-node.  This patch checks the length of zonelist.  If length is
0, returns -EINVAL.

tested on ia64/NUMA with memory-less-node.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20 17:10:13 -08:00
NeilBrown 29dbb3fc80 [PATCH] knfsd: stop NFSD writes from being broken into lots of little writes to filesystem
When NFSD receives a write request, the data is typically in a number of
1448 byte segments and writev is used to collect them together.

Unfortunately, generic_file_buffered_write passes these to the filesystem
one at a time, so an e.g.  32K over-write becomes a series of partial-page
writes to each page, causing the filesystem to have to pre-read those pages
- wasted effort.

generic_file_buffered_write handles one segment of the vector at a time as
it has to pre-fault in each segment to avoid deadlocks.  When writing from
kernel-space (and nfsd does) this is not an issue, so
generic_file_buffered_write does not need to break and iovec from nfsd into
little pieces.

This patch avoids the splitting when  get_fs is KERNEL_DS as it is
from NFSd.

This issue was introduced by commit 6527c2bdf1

Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Norman Weathers <norman.r.weathers@conocophillips.com>
Cc: Vladimir V. Saveliev <vs@namesys.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:14:01 -08:00
Nick Piggin e0a04cffa4 [PATCH] mincore: vma crossing fix
My mincore also forgot about crossing vmas.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-15 09:57:03 -08:00
Nick Piggin 4a76ef036a [PATCH] mincore: fill in results properly
Paper bag time. Thanks to Randy for noticing that I didn't actually assign
'present' to anything.

Unfortunately my original patch passed the few simple test cases I gave it,
purely by coincidence.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-15 09:57:03 -08:00
Nick Piggin 30fcffed81 [PATCH] mincore: CONFIG_SWAP=n fix
Fix mincore-anon patch to compile with CONFIG_SWAP=n

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-15 09:57:03 -08:00
Arjan van de Ven 92e1d5be91 [PATCH] mark struct inode_operations const 2
Many struct inode_operations in the kernel can be "const".  Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data.  In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:46 -08:00
Nick Piggin 42da9cbd3e [PATCH] mm: mincore anon
Make mincore work for anon mappings, nonlinear, and migration entries.
Based on patch from Linus Torvalds <torvalds@linux-foundation.org>.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:27 -08:00
Benjamin Herrenschmidt 22cd25ed31 [PATCH] Add NOPFN_REFAULT result from vm_ops->nopfn()
Add a NOPFN_REFAULT return code for vm_ops->nopfn() equivalent to
NOPAGE_REFAULT for vmops->nopage() indicating that the handler requests a
re-execution of the faulting instruction

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Arnd Bergmann <arnd.bergmann@de.ibm.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:27 -08:00
Nick Piggin e0dc0d8f4a [PATCH] add vm_insert_pfn()
Add a vm_insert_pfn helper, so that ->fault handlers can have nopfn
functionality by installing their own pte and returning NULL.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Arnd Bergmann <arnd.bergmann@de.ibm.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:27 -08:00
Paul E. McKenney aa0f030374 [PATCH] Change constant zero to NOTIFY_DONE in ratelimit_handler()
Change a hard-coded constant 0 to the symbolic equivalent NOTIFY_DONE in
the ratelimit_handler() CPU notifier handler function.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 11:18:07 -08:00
Robert P. J. Day 72fd4a35a8 [PATCH] Numerous fixes to kernel-doc info in source files.
A variety of (mostly) innocuous fixes to the embedded kernel-doc content in
source files, including:

  * make multi-line initial descriptions single line
  * denote some function names, constants and structs as such
  * change erroneous opening '/*' to '/**' in a few places
  * reword some text for clarity

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:32 -08:00
Andrew Morton fc0ecff698 [PATCH] remove invalidate_inode_pages()
Convert all calls to invalidate_inode_pages() into open-coded calls to
invalidate_mapping_pages().

Leave the invalidate_inode_pages() wrapper in place for now, marked as
deprecated.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:31 -08:00
Anton Altaparmakov 54bc485522 [PATCH] Export invalidate_mapping_pages() to modules
It makes no sense to me to export invalidate_inode_pages() and not
invalidate_mapping_pages() and I actually need invalidate_mapping_pages()
because of its range specification ability...

akpm: also remove the export of invalidate_inode_pages() by making it an
inlined wrapper.

Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:30 -08:00
Ingo Molnar 898552c9d8 [PATCH] lockdep: also check for freed locks in kmem_cache_free()
kmem_cache_free() was missing the check for freeing held locks.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:26 -08:00
Ken Chen daa88c8d21 [PATCH] do not disturb page referenced state when unmapping memory range
When kernel unmaps an address range, it needs to transfer PTE state into
page struct.  Currently, kernel transfer access bit via
mark_page_accessed().  The call to mark_page_accessed in the unmap path
doesn't look logically correct.

At unmap time, calling mark_page_accessed will causes page LRU state to be
bumped up one step closer to more recently used state.  It is causing quite
a bit headache in a scenario when a process creates a shmem segment, touch
a whole bunch of pages, then unmaps it.  The unmapping takes a long time
because mark_page_accessed() will start moving pages from inactive to
active list.

I'm not too much concerned with moving the page from one list to another in
LRU.  Sooner or later it might be moved because of multiple mappings from
various processes.  But it just doesn't look logical that when user asks a
range to be unmapped, it's his intention that the process is no longer
interested in these pages.  Moving those pages to active list (or bumping
up a state towards more active) seems to be an over reaction.  It also
prolongs unmapping latency which is the core issue I'm trying to solve.

As suggested by Peter, we should still preserve the info on pte young
pages, but not more.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ken Chen <kenchen@google.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:19 -08:00
Ken Chen 767193253b [PATCH] simplify shmem_aops.set_page_dirty() method
shmem backed file does not have page writeback, nor it participates in
backing device's dirty or writeback accounting.  So using generic
__set_page_dirty_nobuffers() for its .set_page_dirty aops method is a bit
overkill.  It unnecessarily prolongs shm unmap latency.

For example, on a densely populated large shm segment (sevearl GBs), the
unmapping operation becomes painfully long.  Because at unmap, kernel
transfers dirty bit in PTE into page struct and to the radix tree tag.  The
operation of tagging the radix tree is particularly expensive because it
has to traverse the tree from the root to the leaf node on every dirty
page.  What's bothering is that radix tree tag is used for page write back.
 However, shmem is memory backed and there is no page write back for such
file system.  And in the end, we spend all that time tagging radix tree and
none of that fancy tagging will be used.  So let's simplify it by introduce
a new aops __set_page_dirty_no_writeback and this will speed up shm unmap.

Signed-off-by: Ken Chen <kenchen@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:19 -08:00
Christoph Lameter 5ac6da669e [PATCH] Set CONFIG_ZONE_DMA for arches with GENERIC_ISA_DMA
As Andi pointed out: CONFIG_GENERIC_ISA_DMA only disables the ISA DMA
channel management.  Other functionality may still expect GFP_DMA to
provide memory below 16M.  So we need to make sure that CONFIG_ZONE_DMA is
set independent of CONFIG_GENERIC_ISA_DMA.  Undo the modifications to
mm/Kconfig where we made ZONE_DMA dependent on GENERIC_ISA_DMA and set
theses explicitly in each arches Kconfig.

Reviews must occur for each arch in order to determine if ZONE_DMA can be
switched off.  It can only be switched off if we know that all devices
supported by a platform are capable of performing DMA transfers to all of
memory (Some arches already support this: uml, avr32, sh sh64, parisc and
IA64/Altix).

In order to switch ZONE_DMA off conditionally, one would have to establish
a scheme by which one can assure that no drivers are enabled that are only
capable of doing I/O to a part of memory, or one needs to provide an
alternate means of performing an allocation from a specific range of memory
(like provided by alloc_pages_range()) and insure that all drivers use that
call.  In that case the arches alloc_dma_coherent() may need to be modified
to call alloc_pages_range() instead of relying on GFP_DMA.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:19 -08:00
Christoph Lameter 4b51d66989 [PATCH] optional ZONE_DMA: optional ZONE_DMA in the VM
Make ZONE_DMA optional in core code.

- ifdef all code for ZONE_DMA and related definitions following the example
  for ZONE_DMA32 and ZONE_HIGHMEM.

- Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of
  0.

- Modify the VM statistics to work correctly without a DMA zone.

- Modify slab to not create DMA slabs if there is no ZONE_DMA.

[akpm@osdl.org: cleanup]
[jdike@addtoit.com: build fix]
[apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter 66701b1499 [PATCH] optional ZONE_DMA: introduce CONFIG_ZONE_DMA
This patch simply defines CONFIG_ZONE_DMA for all arches.  We later do special
things with CONFIG_ZONE_DMA after the VM and an arch are prepared to work
without ZONE_DMA.

CONFIG_ZONE_DMA can be defined in two ways depending on how an architecture
handles ISA DMA.

First if CONFIG_GENERIC_ISA_DMA is set by the arch then we know that the arch
needs ZONE_DMA because ISA DMA devices are supported.  We can catch this in
mm/Kconfig and do not need to modify arch code.

Second, arches may use ZONE_DMA in an unknown way.  We set CONFIG_ZONE_DMA for
all arches that do not set CONFIG_GENERIC_ISA_DMA in order to insure backwards
compatibility.  The arches may later undefine ZONE_DMA if their arch code has
been verified to not depend on ZONE_DMA.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter 6267276f3f [PATCH] optional ZONE_DMA: deal with cases of ZONE_DMA meaning the first zone
This patchset follows up on the earlier work in Andrew's tree to reduce the
number of zones.  The patches allow to go to a minimum of 2 zones.  This one
allows also to make ZONE_DMA optional and therefore the number of zones can be
reduced to one.

ZONE_DMA is usually used for ISA DMA devices.  There are a number of reasons
why we would not want to have ZONE_DMA

1. Some arches do not need ZONE_DMA at all.

2. With the advent of IOMMUs DMA zones are no longer needed.
   The necessity of DMA zones may drastically be reduced
   in the future. This patchset allows a compilation of
   a kernel without that overhead.

3. Devices that require ISA DMA get rare these days. All
   my systems do not have any need for ISA DMA.

4. The presence of an additional zone unecessarily complicates
   VM operations because it must be scanned and balancing
   logic must operate on its.

5. With only ZONE_NORMAL one can reach the situation where
   we have only one zone. This will allow the unrolling of many
   loops in the VM and allows the optimization of varous
   code paths in the VM.

6. Having only a single zone in a NUMA system results in a
   1-1 correspondence between nodes and zones. Various additional
   optimizations to critical VM paths become possible.

Many systems today can operate just fine with a single zone.  If you look at
what is in ZONE_DMA then one usually sees that nothing uses it.  The DMA slabs
are empty (Some arches use ZONE_DMA instead of ZONE_NORMAL, then ZONE_NORMAL
will be empty instead).

On all of my systems (i386, x86_64, ia64) ZONE_DMA is completely empty.  Why
constantly look at an empty zone in /proc/zoneinfo and empty slab in
/proc/slabinfo?  Non i386 also frequently have no need for ZONE_DMA and zones
stay empty.

The patchset was tested on i386 (UP / SMP), x86_64 (UP, NUMA) and ia64 (NUMA).

The RFC posted earlier (see
http://marc.theaimsgroup.com/?l=linux-kernel&m=115231723513008&w=2) had lots
of #ifdefs in them.  An effort has been made to minize the number of #ifdefs
and make this as compact as possible.  The job was made much easier by the
ongoing efforts of others to extract common arch specific functionality.

I have been running this for awhile now on my desktop and finally Linux is
using all my available RAM instead of leaving the 16MB in ZONE_DMA untouched:

christoph@pentium940:~$ cat /proc/zoneinfo
Node 0, zone   Normal
  pages free     4435
        min      1448
        low      1810
        high     2172
        active   241786
        inactive 210170
        scanned  0 (a: 0 i: 0)
        spanned  524224
        present  524224
    nr_anon_pages 61680
    nr_mapped    14271
    nr_file_pages 390264
    nr_slab_reclaimable 27564
    nr_slab_unreclaimable 1793
    nr_page_table_pages 449
    nr_dirty     39
    nr_writeback 0
    nr_unstable  0
    nr_bounce    0
    cpu: 0 pcp: 0
              count: 156
              high:  186
              batch: 31
    cpu: 0 pcp: 1
              count: 9
              high:  62
              batch: 15
  vm stats threshold: 20
    cpu: 1 pcp: 0
              count: 177
              high:  186
              batch: 31
    cpu: 1 pcp: 1
              count: 12
              high:  62
              batch: 15
  vm stats threshold: 20
  all_unreclaimable: 0
  prev_priority:     12
  temp_priority:     12
  start_pfn:         0

This patch:

In two places in the VM we use ZONE_DMA to refer to the first zone.  If
ZONE_DMA is optional then other zones may be first.  So simply replace
ZONE_DMA with zone 0.

This also fixes ZONETABLE_PGSHIFT.  If we have only a single zone then
ZONES_PGSHIFT may become 0 because there is no need anymore to encode the zone
number related to a pgdat.  However, we still need a zonetable to index all
the zones for each node if this is a NUMA system.  Therefore define
ZONETABLE_SHIFT unconditionally as the offset of the ZONE field in page flags.

[apw@shadowen.org: fix mismerge]
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter 65e458d43d [PATCH] Drop get_zone_counts()
Values are available via ZVC sums.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter 05a0416be2 [PATCH] Drop __get_zone_counts()
Values are readily available via ZVC per node and global sums.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter 9195481d2f [PATCH] Drop nr_free_pages_pgdat()
Function is unnecessary now.  We can use the summing features of the ZVCs to
get the values we need.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter 9617729941 [PATCH] Drop free_pages()
nr_free_pages is now a simple access to a global variable.  Make it a macro
instead of a function.

The nr_free_pages now requires vmstat.h to be included.  There is one
occurrence in power management where we need to add the include.  Directly
refrer to global_page_state() there to clarify why the #include was added.

[akpm@osdl.org: arm build fix]
[akpm@osdl.org: sparc64 build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter 51ed449127 [PATCH] Reorder ZVCs according to cacheline
The global and per zone counter sums are in arrays of longs.  Reorder the ZVCs
so that the most frequently used ZVCs are put into the same cacheline.  That
way calculations of the global, node and per zone vm state touches only a
single cacheline.  This is mostly important for 64 bit systems were one 128
byte cacheline takes only 8 longs.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:17 -08:00
Christoph Lameter d23ad42324 [PATCH] Use ZVC for free_pages
This is again simplifies some of the VM counter calculations through the use
of the ZVC consolidated counters.

[michal.k.k.piotrowski@gmail.com: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:17 -08:00
Christoph Lameter c878538598 [PATCH] Use ZVC for inactive and active counts
The determination of the dirty ratio to determine writeback behavior is
currently based on the number of total pages on the system.

However, not all pages in the system may be dirtied.  Thus the ratio is always
too low and can never reach 100%.  The ratio may be particularly skewed if
large hugepage allocations, slab allocations or device driver buffers make
large sections of memory not available anymore.  In that case we may get into
a situation in which f.e.  the background writeback ratio of 40% cannot be
reached anymore which leads to undesired writeback behavior.

This patchset fixes that issue by determining the ratio based on the actual
pages that may potentially be dirty.  These are the pages on the active and
the inactive list plus free pages.

The problem with those counts has so far been that it is expensive to
calculate these because counts from multiple nodes and multiple zones will
have to be summed up.  This patchset makes these counters ZVC counters.  This
means that a current sum per zone, per node and for the whole system is always
available via global variables and not expensive anymore to calculate.

The patchset results in some other good side effects:

- Removal of the various functions that sum up free, active and inactive
  page counts

- Cleanup of the functions that display information via the proc filesystem.

This patch:

The use of a ZVC for nr_inactive and nr_active allows a simplification of some
counter operations.  More ZVC functionality is used for sums etc in the
following patches.

[akpm@osdl.org: UP build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:17 -08:00
Hugh Dickins c3704ceb4a [PATCH] page_mkwrite caller race fix
After do_wp_page has tested page_mkwrite, it must release old_page after
acquiring page table lock, not before: at some stage that ordering got
reversed, leaving a (very unlikely) window in which old_page might be
truncated, freed, and reused in the same position.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:17 -08:00
Andrew Morton 5a88a13d06 [PATCH] /proc/zoneinfo: fix vm stats display
This early break prevents us from displaying info for the vm stats thresholds
if the zone doesn't have any pages in its per-cpu pagesets.

So my 800MB i386 box says:

Node 0, zone      DMA
  pages free     2365
        min      16
        low      20
        high     24
        active   0
        inactive 0
        scanned  0 (a: 0 i: 0)
        spanned  4096
        present  4044
    nr_anon_pages 0
    nr_mapped    1
    nr_file_pages 0
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 0
    nr_page_table_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
        protection: (0, 868, 868)
  pagesets
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         0
Node 0, zone   Normal
  pages free     199713
        min      934
        low      1167
        high     1401
        active   10215
        inactive 4507
        scanned  0 (a: 0 i: 0)
        spanned  225280
        present  222420
    nr_anon_pages 2685
    nr_mapped    1110
    nr_file_pages 12055
    nr_slab_reclaimable 2216
    nr_slab_unreclaimable 1527
    nr_page_table_pages 213
    nr_dirty     0
    nr_writeback 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
        protection: (0, 0, 0)
  pagesets
    cpu: 0 pcp: 0
              count: 152
              high:  186
              batch: 31
    cpu: 0 pcp: 1
              count: 13
              high:  62
              batch: 15
  vm stats threshold: 16
    cpu: 1 pcp: 0
              count: 34
              high:  186
              batch: 31
    cpu: 1 pcp: 1
              count: 10
              high:  62
              batch: 15
  vm stats threshold: 16
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         4096

Just nuke all that search-for-the-first-non-empty-pageset code.  Dunno why it
was there in the first place..

Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:17 -08:00
Mel Gorman a6af2bc3d5 [PATCH] Avoid excessive sorting of early_node_map[]
find_min_pfn_for_node() and find_min_pfn_with_active_regions() sort
early_node_map[] on every call.  This is an excessive amount of sorting and
that can be avoided.  This patch always searches the whole early_node_map[]
in find_min_pfn_for_node() instead of returning the first value found.  The
map is then only sorted once when required.  Successfully boot tested on a
number of machines.

[akpm@osdl.org: cleanup]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:17 -08:00
Christoph Lameter 7c5cae368a [PATCH] slab: use parameter passed to cache_reap to determine pointer to work structure
Use the pointer passed to cache_reap to determine the work pointer and
consolidate exit paths.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:17 -08:00
Pekka Enberg 8c8cc2c10c [PATCH] slab: cache alloc cleanups
Clean up __cache_alloc and __cache_alloc_node functions a bit.  We no
longer need to do NUMA_BUILD tricks and the UMA allocation path is much
simpler.  No functional changes in this patch.

Note: saves few kernel text bytes on x86 NUMA build due to using gotos in
__cache_alloc_node() and moving __GFP_THISNODE check in to
fallback_alloc().

Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Christoph Lameter <christoph@lameter.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:16 -08:00
Pekka Enberg 6e40e73097 [PATCH] slab: remove broken PageSlab check from kfree_debugcheck
The PageSlab debug check in kfree_debugcheck() is broken for compound
pages.  It is also redundant as we already do BUG_ON for non-slab pages in
page_get_cache() and page_get_slab() which are always called before we free
any actual objects.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:16 -08:00
Roland McGrath fa5dc22f85 [PATCH] Add install_special_mapping
This patch adds a utility function install_special_mapping, for creating a
special vma using a fixed set of preallocated pages as backing, such as for a
vDSO.  This consolidates some nearly identical code used for vDSO mapping
reimplemented for different architectures.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-09 09:25:47 -08:00
Andrew Morton a25700a53f [PATCH] mm: show bounce pages in oom killer output
Also split that long line up - people like to send us wordwrapped oom-kill
traces.

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-09 09:25:47 -08:00
Ken Chen 6649a38632 [PATCH] hugetlb: preserve hugetlb pte dirty state
__unmap_hugepage_range() is buggy that it does not preserve dirty state of
huge_pte when unmapping hugepage range.  It causes data corruption in the
event of dop_caches being used by sys admin.  For example, an application
creates a hugetlb file, modify pages, then unmap it.  While leaving the
hugetlb file alive, comes along sys admin doing a "echo 3 >
/proc/sys/vm/drop_caches".

drop_pagecache_sb() will happily free all pages that aren't marked dirty if
there are no active mapping.  Later when application remaps the hugetlb
file back and all data are gone, triggering catastrophic flip over on
application.

Not only that, the internal resv_huge_pages count will also get all messed
up.  Fix it up by marking page dirty appropriately.

Signed-off-by: Ken Chen <kenchen@google.com>
Cc: "Nish Aravamudan" <nish.aravamudan@gmail.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: <stable@kernel.org>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-09 09:25:46 -08:00
Nick Piggin 62045305c2 [PATCH] mm: remove find_trylock_page
Remove find_trylock_page as per the removal schedule.

Signed-off-by: Nick Piggin <npiggin@suse.de>
[ Let's see if anybody screams ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-09 08:06:14 -08:00
Linus Torvalds 6fd6b17c6d Revert "[PATCH] mm: micro optimise zone_watermark_ok"
This reverts commit e80ee884ae.

Pawel Sikora had a boot-time oops due to it - because the sign change
invalidates the following comparisons, since 'free_pages' can be
negative.

The micro-optimization just isn't worth it.

Bisected-by: Pawel Sikora <pluto@agmk.net>
Acked-by: Andrew Morton <akpm@osdl.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-31 16:46:40 -08:00
Adam Litke 0d59a01bc4 [PATCH] Don't allow the stack to grow into hugetlb reserved regions
When expanding the stack, we don't currently check if the VMA will cross
into an area of the address space that is reserved for hugetlb pages.
Subsequent faults on the expanded portion of such a VMA will confuse the
low-level MMU code, resulting in an OOPS.  Check for this.

Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-30 16:01:35 -08:00
Hugh Dickins 701dfbc1cb [PATCH] mm: mremap correct rmap accounting
Nick Piggin points out that page accounting on MIPS multiple ZERO_PAGEs
is not maintained by its move_pte, and could lead to freeing a ZERO_PAGE.

Instead of complicating that move_pte, just forget the minor optimization
when mremapping, and change the one thing which needed it for correctness
- filemap_xip use ZERO_PAGE(0) throughout instead of according to address.

[ "There is no block device driver one could use for XIP on mips
   platforms" - Carsten Otte ]

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-30 08:33:32 -08:00
Linus Torvalds dc6e29da91 Fix balance_dirty_page() calculations with CONFIG_HIGHMEM
This makes balance_dirty_page() always base its calculations on the
amount of non-highmem memory in the machine, rather than try to base it
on total memory and then falling back on non-highmem memory if the
mapping it was writing wasn't highmem capable.

This not only fixes a situation where two different writers can have
wildly different notions about what is a "balanced" dirty state, but it
also means that people with highmem machines don't run into an OOM
situation when regular memory fills up with dirty pages.

We used to try to handle the latter case by scaling down the dirty_ratio
if the machine had a lot of highmem pages in page_writeback_init(), but
it wasn't aggressive enough for some situations, and since basing the
dirty ratio on highmem memory was broken in the first place, let's just
stop doing so.

(A variation of this theme fixed Justin Piszcz's OOM problem when
copying an 18GB file on a RAID setup).

Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Justin Piszcz <jpiszcz@lucidpixels.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-29 16:37:38 -08:00
Trond Myklebust 569d3287c1 [PATCH] MM: Remove [PATCH] invalidate_inode_pages2_range() debug
NFS can handle the case where invalidate_inode_pages2_range() fails, so the
premise behind commit 8258d4a574 is now gone.

Remove the WARN_ON_ONCE() which is causing users grief as we can see from
http://bugzilla.kernel.org/show_bug.cgi?id=7826

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26 13:51:00 -08:00
Roland McGrath f47aef55d9 [PATCH] i386 vDSO: use VM_ALWAYSDUMP
This patch fixes core dumps to include the vDSO vma, which is left out now.
It removes the special-case core writing macros, which were not doing the
right thing for the vDSO vma anyway.  Instead, it uses VM_ALWAYSDUMP in the
vma; there is no need for the fixmap page to be installed.  It handles the
CONFIG_COMPAT_VDSO case by making elf_core_dump use the fake vma from
get_gate_vma after real vmas in the same way the /proc/PID/maps code does.

This changes core dumps so they no longer include the non-PT_LOAD phdrs from
the vDSO.  I made the change to add them in the first place, but in turned out
that nothing ever wanted them there since the advent of NT_AUXV.  It's cleaner
to leave them out, and just let the phdrs inside the vDSO image speak for
themselves.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26 13:50:58 -08:00
Roland McGrath b6558c4a23 [PATCH] Fix gate_vma.vm_flags
This patch fixes the initialization of gate_vma.vm_flags and
gate_vma.vm_page_prot to reflect reality.  This makes the "[vdso]" line in
/proc/PID/maps correctly show r-xp instead of ---p, when gate_vma is used
(CONFIG_COMPAT_VDSO on i386).

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26 13:50:58 -08:00
Linus Torvalds ecdfc9787f Resurrect 'try_to_free_buffers()' VM hackery
It's not pretty, but it appears that ext3 with data=journal will clean
pages without ever actually telling the VM that they are clean.  This,
in turn, will result in the VM (and balance_dirty_pages() in particular)
to never realize that the pages got cleaned, and wait forever for an
event that already happened.

Technically, this seems to be a problem with ext3 itself, but it used to
be hidden by 'try_to_free_buffers()' noticing this situation on its own,
and just working around the filesystem problem.

This commit re-instates that hack, in order to avoid a regression for
the 2.6.20 release. This fixes bugzilla 7844:

	http://bugzilla.kernel.org/show_bug.cgi?id=7844

Peter Zijlstra points out that we should probably retain the debugging
code that this removes from cancel_dirty_page(), and I agree, but for
the imminent release we might as well just silence the warning too
(since it's not a new bug: anything that triggers that warning has been
around forever).

Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26 12:47:06 -08:00
Christoph Lameter 30150f8d7b [PATCH] mbind: restrict nodes to the currently allowed cpuset
Currently one can specify an arbitrary node mask to mbind that includes
nodes not allowed.  If that is done with an interleave policy then we will
go around all the nodes.  Those outside of the currently allowed cpuset
will be redirected to the border nodes.  Interleave will then create
imbalances at the borders of the cpuset.

This patch restricts the nodes to the currently allowed cpuset.

The RFC for this patch was discussed at
http://marc.theaimsgroup.com/?t=116793842100004&r=1&w=2

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-23 07:52:06 -08:00
Jens Axboe c43a5082a6 [PATCH] blktrace: only add a bounce trace when we really bounce
Currently we issue a bounce trace when __blk_queue_bounce() is called,
but that merely means that the device has a lower dma mask than the
higher pages in the system. The bio itself may still be lower pages. So
move the bounce trace into __blk_queue_bounce(), when we know there will
actually be page bouncing.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-12 10:46:49 -08:00
Trond Myklebust e3db7691e9 [PATCH] NFS: Fix race in nfs_release_page()
NFS: Fix race in nfs_release_page()

    invalidate_inode_pages2() may find the dirty bit has been set on a page
    owing to the fact that the page may still be mapped after it was locked.
    Only after the call to unmap_mapping_range() are we sure that the page
    can no longer be dirtied.
    In order to fix this, NFS has hooked the releasepage() method and tries
    to write the page out between the call to unmap_mapping_range() and the
    call to remove_mapping(). This, however leads to deadlocks in the page
    reclaim code, where the page may be locked without holding a reference
    to the inode or dentry.

    Fix is to add a new address_space_operation, launder_page(), which will
    attempt to write out a dirty page without releasing the page lock.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

    Also, the bare SetPageDirty() can skew all sort of accounting leading to
    other nasties.

[akpm@osdl.org: cleanup]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-11 18:18:21 -08:00
Dave Hansen a2f3aa0257 [PATCH] Fix sparsemem on Cell
Fix an oops experienced on the Cell architecture when init-time functions,
early_*(), are called at runtime.  It alters the call paths to make sure
that the callers explicitly say whether the call is being made on behalf of
a hotplug even, or happening at boot-time.

It has been compile tested on ppc64, ia64, s390, i386 and x86_64.

Acked-by: Arnd Bergmann <arndb@de.ibm.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-11 18:18:20 -08:00
Linus Torvalds 74bda9310f Merge master.kernel.org:/home/rmk/linux-2.6-arm
* master.kernel.org:/home/rmk/linux-2.6-arm:
  [ARM] Provide basic printk_clock() implementation
  [ARM] Resolve fuse and direct-IO failures due to missing cache flushes
  [ARM] pass vma for flush_anon_page()
  [ARM] Fix potential MMCI bug
  [ARM] Fix kernel-mode undefined instruction aborts
  [ARM] 4082/1: iop3xx: fix iop33x gpio register offset
  [ARM] 4070/1: arch/arm/kernel: fix warnings from missing includes
  [ARM] 4079/1: iop: Update MAINTAINERS
2007-01-08 15:06:39 -08:00
Russell King a6f36be326 [ARM] pass vma for flush_anon_page()
Since get_user_pages() may be used with processes other than the
current process and calls flush_anon_page(), flush_anon_page() has to
cope in some way with non-current processes.

It may not be appropriate, or even desirable to flush a region of
virtual memory cache in the current process when that is different to
the process that we want the flush to occur for.

Therefore, pass the vma into flush_anon_page() so that the architecture
can work out whether the 'vmaddr' is for the current process or not.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-01-08 19:49:54 +00:00
Andrew Morton 76395d3761 [PATCH] shrink_all_memory(): fix lru_pages handling
At the end of shrink_all_memory() we forget to recalculate lru_pages: it can
be zero.

Fix that up, and add a helper function for this operation too.

Also, recalculate lru_pages each time around the inner loop to get the
balancing correct.

Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:29 -08:00
Hugh Dickins 7ba3485947 [PATCH] fix OOM killing of swapoff
These days, if you swapoff when there isn't enough memory, OOM killer gives
"BUG: scheduling while atomic" and the machine hangs: badness() needs to do
its PF_SWAPOFF return after the task_unlock (tasklist_lock is also held
here, so p isn't going to be freed: PF_SWAPOFF might get turned off at any
moment, but that doesn't really matter).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:29 -08:00
Christoph Lameter f2e12bb272 [PATCH] Check for populated zone in __drain_pages
Both process_zones() and drain_node_pages() check for populated zones
before touching pagesets.  However, __drain_pages does not do so,

This may result in a NULL pointer dereference for pagesets in unpopulated
zones if a NUMA setup is combined with cpu hotplug.

Initially the unpopulated zone has the pcp pointers pointing to the boot
pagesets.  Since the zone is not populated the boot pageset pointers will
not be changed during page allocator and slab bootstrap.

If a cpu is later brought down (first call to __drain_pages()) then the pcp
pointers for cpus in unpopulated zones are set to NULL since __drain_pages
does not first check for an unpopulated zone.

If the cpu is then brought up again then we call process_zones() which will
ignore the unpopulated zone.  So the pageset pointers will still be NULL.

If the cpu is then again brought down then __drain_pages will attempt to
drain pages by following the NULL pageset pointer for unpopulated zones.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:29 -08:00
Hugh Dickins b6a6045181 [PATCH] fix BUG_ON(!PageSlab) from fallback_alloc
pdflush hit the BUG_ON(!PageSlab(page)) in kmem_freepages called from
fallback_alloc: cache_grow already freed those pages when alloc_slabmgmt
failed.  But it wouldn't have freed them if __GFP_NO_GROW, so make sure
fallback_alloc doesn't waste its time on that case.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka J Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:23 -08:00
Paul Mundt 9ab37b8f21 [PATCH] Sanely size hash tables when using large base pages
At the moment the inode/dentry cache hash tables (common by way of
alloc_large_system_hash()) are incorrectly sized by their respective
detection logic when we attempt to use large base pages on systems with
little memory.

This results in odd behaviour when using a 64kB PAGE_SIZE, such as:

Dentry cache hash table entries: 8192 (order: -1, 32768 bytes)
Inode-cache hash table entries: 4096 (order: -2, 16384 bytes)

The mount cache hash table is seemingly the only one that gets this right
by directly taking PAGE_SIZE in to account.

The following patch attempts to catch the bogus values and round it up to
at least 0-order.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:23 -08:00
Rafael J. Wysocki 7bf2368742 [PATCH] swsusp: Do not fail if resume device is not set
In the kernels later than 2.6.19 there is a regression that makes swsusp
fail if the resume device is not explicitly specified.

It can be fixed by adding an additional parameter to
mm/swapfile.c:swap_type_of() allowing us to pass the (struct block_device
*) corresponding to the first available swap back to the caller.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:22 -08:00
Shantanu Goel 918d3f90e8 [PATCH] Buglet in vmscan.c
Fix a rather obvious buglet.  Noticed while instrumenting the VM using
/proc/vmstat.

Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30 10:56:43 -08:00
Al Viro d6e88e671a [PATCH] page_mkclean_one(): fix call to set_pte_at()
(akpm: macros are wonderful)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30 10:56:42 -08:00
Dimitri Gorokhovik bcb4ddb46a [PATCH] MM: SLOB is broken by recent cleanup of slab.h
Recent cleanup of slab.h broke SLOB allocator: the routine kmem_cache_init
has now the __init attribute for both slab.c and slob.c.  This routine
cannot be removed after init in the case of slob.c -- it serves as a timer
callback.

Provide a separate timer callback routine, call it once from kmem_cache_init,
keep the __init attribute on the latter.

Signed-off-by: Dimitri Gorokhovik <dimitri.gorokhovik@free.fr>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30 10:56:42 -08:00
KAMEZAWA Hiroyuki 96ac5913f4 [PATCH] fix oom killer kills current every time if there is memory-less-node take2
constrained_alloc(), which is called to detect where oom is from, checks
passed zone_list().  If zone_list doesn't include all nodes, it thinks oom
is from mempolicy.

But there is memory-less-node.  memory-less-node's zones are never included
in zonelist[].

contstrained_alloc() should get memory_less_node into count.  Otherwise, it
always thinks 'oom is from mempolicy'.  This means that current process
dies at any time.  This patch fix it.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30 10:55:55 -08:00
Linus Torvalds 7658cc2892 VM: Fix nasty and subtle race in shared mmap'ed page writeback
The VM layer (on the face of it, fairly reasonably) expected that when
it does a ->writepage() call to the filesystem, it would write out the
full page at that point in time.  Especially since it had earlier marked
the whole page dirty with "set_page_dirty()".

But that isn't actually the case: ->writepage() does not actually write
a page, it writes the parts of the page that have been explicitly marked
dirty before, *and* that had not got written out for other reasons since
the last time we told it they were dirty.

That last caveat is the important one.

Which _most_ of the time ends up being the whole page (since we had
called "set_page_dirty()" on the page earlier), but if the filesystem
had done any dirty flushing of its own (for example, to honor some
internal write ordering guarantees), it might end up doing only a
partial page IO (or none at all) when ->writepage() is actually called.

That is the correct thing in general (since we actually often _want_
only the known-dirty parts of the page to be written out), but the
shared dirty page handling had implicitly forgotten about these details,
and had a number of cases where it was doing just the "->writepage()"
part, without telling the low-level filesystem that the whole page might
have been re-dirtied as part of being mapped writably into user space.

Since most of the time the FS did actually write out the full page, we
didn't notice this for a loong time, and this needed some really odd
patterns to trigger.  But it caused occasional corruption with rtorrent
and with the Debian "apt" database, because both use shared mmaps to
update the end result.

This fixes it. Finally. After way too much hair-pulling.

Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Martin J. Bligh <mbligh@google.com>
Acked-by: Martin Michlmayr <tbm@cyrius.com>
Acked-by: Martin Johansson <martin@fatbob.nu>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Andrei Popa <andrei.popa@i-neo.ro>
Cc: High Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@osdl.org>,
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Segher Boessenkool <segher@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Gordon Farquharson <gordonfarquharson@gmail.com>
Cc: Guillaume Chazarain <guichaz@yahoo.fr>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Kenneth Cheng <kenneth.w.chen@intel.com>
Cc: Tobias Diedrich <ranma@tdiedrich.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-29 10:00:58 -08:00
Linus Torvalds 8368e328df Clean up and export cancel_dirty_page() to modules
Make cancel_dirty_page() act more like all the other dirty and writeback
accounting functions: test for "mapping" being NULL, and do the
NR_FILE_DIRY accounting purely based on mapping_cap_account_dirty()).

Also, add it to the exports, so that modular filesystems can use it.

Acked-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-23 09:25:04 -08:00
Peter Zijlstra c2fda5fed8 [PATCH] Fix up page_mkclean_one(): virtual caches, s390
- add flush_cache_page() for all those virtual indexed cache
   architectures.

 - handle s390.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 10:39:35 -08:00
Nick Piggin 7de6b80579 [PATCH] mm: more rmap debugging
Add more debugging in the rmap code in an attempt to locate to source of
the occasional "mapcount went negative" assertions.

Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:49 -08:00
Nigel Cunningham e07aa05b60 [PATCH] Fix swapped parameters in mm/vmscan.c
The version of mm/vmscan.c in Linus' current tree has swapped parameters in
the shrink_all_zones declaration and call, used by the various
suspend-to-disk implementations.  This doesn't seem to have any great
adverse effect, but it's clearly wrong.

Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:47 -08:00
Randy Dunlap af9997e426 [PATCH] fix kernel-doc warnings in 2.6.20-rc1
Fix kernel-doc warnings in 2.6.20-rc1.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:47 -08:00
Christoph Lameter b7f869a284 [PATCH] slab: fix kmem_ptr_validate definition
The declaration of kmem_ptr_validate in slab.h does not match the
one in slab.c. Remove the fastcall attribute (this is the only use in
slab.c).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:47 -08:00
Badari Pulavarty 92a3d03aab [PATCH] Fix for shmem_truncate_range() BUG_ON()
Ran into BUG() while doing madvise(REMOVE) testing.  If we are punching a
hole into shared memory segment using madvise(REMOVE) and the entire hole
is below the indirect blocks, we hit following assert.

	        BUG_ON(limit <= SHMEM_NR_DIRECT);

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:47 -08:00
Andrew Morton 5f2a105d5e [PATCH] truncate: dirty memory accounting fix
Only (un)account for IO and page-dirtying for devices which have real backing
store (ie: not tmpfs or ramdisks).

Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:45 -08:00
Andrew Morton 3e67c0987d [PATCH] truncate: clear page dirtiness before running try_to_free_buffers()
truncate presently invalidates the dirty page's buffer_heads then shoots down
the page.  But try_to_free_buffers() will now bale out because the page is
dirty.

Net effect: the LRU gets filled with dirty pages which have invalidated
buffer_heads attached.  They have no ->mapping and hence cannot be cleaned.
The machine leaks memory at an enormous rate.

Fix this by cleaning the page before running try_to_free_buffers(), so
try_to_free_buffers() can do its work.

Also, remember to do dirty-page-acoounting in cancel_dirty_page() so the
machine won't wedge up trying to write non-existent dirty pages.

Probably still wrong, but now less so.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-21 11:17:26 -08:00
Linus Torvalds fba2591bf4 VM: Remove "clear_page_dirty()" and "test_clear_page_dirty()" functions
They were horribly easy to mis-use because of their tempting naming, and
they also did way more than any users of them generally wanted them to
do.

A dirty page can become clean under two circumstances:

 (a) when we write it out.  We have "clear_page_dirty_for_io()" for
     this, and that function remains unchanged.

     In the "for IO" case it is not sufficient to just clear the dirty
     bit, you also have to mark the page as being under writeback etc.

 (b) when we actually remove a page due to it becoming inaccessible to
     users, notably because it was truncate()'d away or the file (or
     metadata) no longer exists, and we thus want to cancel any
     outstanding dirty state.

For the (b) case, we now introduce "cancel_dirty_page()", which only
touches the page state itself, and verifies that the page is not mapped
(since cancelling writes on a mapped page would be actively wrong as it
is still accessible to users).

Some filesystems need to be fixed up for this: CIFS, FUSE, JFS,
ReiserFS, XFS all use the old confusing functions, and will be fixed
separately in subsequent commits (with some of them just removing the
offending logic, and others using clear_page_dirty_for_io()).

This was confirmed by Martin Michlmayr to fix the apt database
corruption on ARM.

Cc: Martin Michlmayr <tbm@cyrius.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Andrei Popa <andrei.popa@i-neo.ro>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Gordon Farquharson <gordonfarquharson@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-21 09:19:57 -08:00
Oleg Nesterov 825020c386 [PATCH] sys_mincore: s/max/min/
fix a typo, sys_mincore() needs min().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus "I'm a moron" Torvalds <torvalds@osdl.org>
2006-12-17 10:21:53 -08:00
Linus Torvalds 4fb23e439c Fix up mm/mincore.c error value cases
Hugh Dickins correctly points out that mincore() is actually _supposed_
to fail on an unmapped hole in the user address space, rather than
return valid ("empty") information about the hole.  This just simplifies
the problem further (I had been misled by our previous confusing and
complicated way of doing mincore()).

Also, in the unlikely situation that we can't allocate a temporary
kernel buffer, we should actually return EAGAIN, not ENOMEM, to keep the
"unmapped hole" and "allocation failure" error cases separate.

Finally, add a comment about our stupid historical lack of support for
anonymous mappings.  I'll fix that if somebody reminds me after 2.6.20
is out.

Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-16 16:01:50 -08:00
Linus Torvalds 2f77d10705 Fix incorrect user space access locking in mincore()
Doug Chapman noticed that mincore() will doa "copy_to_user()" of the
result while holding the mmap semaphore for reading, which is a big
no-no.  While a recursive read-lock on a semaphore in the case of a page
fault happens to work, we don't actually allow them due to deadlock
schenarios with writers due to fairness issues.

Doug and Marcel sent in a patch to fix it, but I decided to just rewrite
the mess instead - not just fixing the locking problem, but making the
code smaller and (imho) much easier to understand.

Cc: Doug Chapman <dchapman@redhat.com>
Cc: Marcel Holtmann <holtmann@redhat.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-16 09:44:32 -08:00
Atsushi Nemoto 9de455b207 [PATCH] Pass vma argument to copy_user_highpage().
To allow a more effective copy_user_highpage() on certain architectures,
a vma argument is added to the function and cow_user_page() allowing
the implementation of these functions to check for the VM_EXEC bit.

The main part of this patch was originally written by Ralf Baechle;
Atushi Nemoto did the the debugging.

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:27:08 -08:00
Eric Dumazet 6a2d7a955d [PATCH] SLAB: use a multiply instead of a divide in obj_to_index()
When some objects are allocated by one CPU but freed by another CPU we can
consume lot of cycles doing divides in obj_to_index().

(Typical load on a dual processor machine where network interrupts are
handled by one particular CPU (allocating skbufs), and the other CPU is
running the application (consuming and freeing skbufs))

Here on one production server (dual-core AMD Opteron 285), I noticed this
divide took 1.20 % of CPU_CLK_UNHALTED events in kernel.  But Opteron are
quite modern cpus and the divide is much more expensive on oldest
architectures :

On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1
cycle for a multiply.

Doing some math, we can use a reciprocal multiplication instead of a divide.

If we want to compute V = (A / B)  (A and B being u32 quantities)
we can instead use :

V = ((u64)A * RECIPROCAL(B)) >> 32 ;

where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B

Note :

I wrote pure C code for clarity. gcc output for i386 is not optimal but
acceptable :

mull   0x14(%ebx)
mov    %edx,%eax // part of the >> 32
xor     %edx,%edx // useless
mov    %eax,(%esp) // could be avoided
mov    %edx,0x4(%esp) // useless
mov    (%esp),%ebx

[akpm@osdl.org: small cleanups]
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:49 -08:00
Paul Jackson 02a0e53d82 [PATCH] cpuset: rework cpuset_zone_allowed api
Elaborate the API for calling cpuset_zone_allowed(), so that users have to
explicitly choose between the two variants:

  cpuset_zone_allowed_hardwall()
  cpuset_zone_allowed_softwall()

Until now, whether or not you got the hardwall flavor depended solely on
whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
argument.

If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
version.

Unfortunately, this meant that users would end up with the softwall version
without thinking about it.  Since only the softwall version might sleep,
this led to bugs with possible sleeping in interrupt context on more than
one occassion.

The hardwall version requires that the current tasks mems_allowed allows
the node of the specified zone (or that you're in interrupt or that
__GFP_THISNODE is set or that you're on a one cpuset system.)

The softwall version, depending on the gfp_mask, might allow a node if it
was allowed in the nearest enclusing cpuset marked mem_exclusive (which
requires taking the cpuset lock 'callback_mutex' to evaluate.)

This patch removes the cpuset_zone_allowed() call, and forces the caller to
explicitly choose between the hardwall and the softwall case.

If the caller wants the gfp_mask to determine this choice, they should (1)
be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
cpuset_zone_allowed_softwall() routine.

This adds another 100 or 200 bytes to the kernel text space, due to the few
lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
routines.  It should save a few instructions executed for the calls that
turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
set (before the call) then check (within the call) the __GFP_HARDWALL flag.

For the most critical call, from get_page_from_freelist(), the same
instructions are executed as before -- the old cpuset_zone_allowed()
routine it used to call is the same code as the
cpuset_zone_allowed_softwall() routine that it calls now.

Not a perfect win, but seems worth it, to reduce this chance of hitting a
sleeping with irq off complaint again.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:49 -08:00
Christoph Lameter 55935a34a4 [PATCH] More slab.h cleanups
More cleanups for slab.h

1. Remove tabs from weird locations as suggested by Pekka

2. Drop the check for NUMA and SLAB_DEBUG from the fallback section
   as suggested by Pekka.

3. Uses static inline for the fallback defs as also suggested by Pekka.

4. Make kmem_ptr_valid take a const * argument.

5. Separate the NUMA fallback definitions from the kmalloc_track fallback
   definitions.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:49 -08:00
Christoph Lameter 2e892f43cc [PATCH] Cleanup slab headers / API to allow easy addition of new slab allocators
This is a response to an earlier discussion on linux-mm about splitting
slab.h components per allocator.  Patch is against 2.6.19-git11.  See
http://marc.theaimsgroup.com/?l=linux-mm&m=116469577431008&w=2

This patch cleans up the slab header definitions.  We define the common
functions of slob and slab in slab.h and put the extra definitions needed
for slab's kmalloc implementations in <linux/slab_def.h>.  In order to get
a greater set of common functions we add several empty functions to slob.c
and also rename slob's kmalloc to __kmalloc.

Slob does not need any special definitions since we introduce a fallback
case.  If there is no need for a slab implementation to provide its own
kmalloc mess^H^H^Hacros then we simply fall back to __kmalloc functions.
That is sufficient for SLOB.

Sort the function in slab.h according to their functionality.  First the
functions operating on struct kmem_cache * then the kmalloc related
functions followed by special debug and fallback definitions.

Also redo a lot of comments.

Signed-off-by: Christoph Lameter <clameter@sgi.com>?
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:49 -08:00
Christoph Lameter dd47ea7556 [PATCH] slab: fix sleeping in atomic bug
Fallback_alloc() does not do the check for GFP_WAIT as done in
cache_grow().  Thus interrupts are disabled when we call kmem_getpages()
which results in the failure.

Duplicate the handling of GFP_WAIT in cache_grow().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Jay Cliburn <jacliburn@bellsouth.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:48 -08:00
Arjan van de Ven 2b2842146c [PATCH] user of the jiffies rounding patch: Slab
This patch introduces users of the round_jiffies() function in the slab code.

The slab code has a few "run every second" timers for background work; these
are obviously not timing critical as long as they happen roughly at the right
frequency.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00
Zach Brown 8459d86aff [PATCH] dio: only call aio_complete() after returning -EIOCBQUEUED
The only time it is safe to call aio_complete() is when the ->ki_retry
function returns -EIOCBQUEUED to the AIO core.  direct_io_worker() has
historically done this by relying on its caller to translate positive return
codes into -EIOCBQUEUED for the aio case.  It did this by trying to keep
conditionals in sync.  direct_io_worker() knew when finished_one_bio() was
going to call aio_complete().  It would reverse the test and wait and free the
dio in the cases it thought that finished_one_bio() wasn't going to.

Not surprisingly, it ended up getting it wrong.  'ret' could be a negative
errno from the submission path but it failed to communicate this to
finished_one_bio().  direct_io_worker() would return < 0, it's callers
wouldn't raise -EIOCBQUEUED, and aio_complete() would be called.  In the
future finished_one_bio()'s tests wouldn't reflect this and aio_complete()
would be called for a second time which can manifest as an oops.

The previous cleanups have whittled the sync and async completion paths down
to the point where we can collapse them and clearly reassert the invariant
that we must only call aio_complete() after returning -EIOCBQUEUED.
direct_io_worker() will only return -EIOCBQUEUED when it is not the last to
drop the dio refcount and the aio bio completion path will only call
aio_complete() when it is the last to drop the dio refcount.
direct_io_worker() can ensure that it is the last to drop the reference count
by waiting for bios to drain.  It does this for sync ops, of course, and for
partial dio writes that must fall back to buffered and for aio ops that saw
errors during submission.

This means that operations that end up waiting, even if they were issued as
aio ops, will not call aio_complete() from dio.  Instead we return the return
code of the operation and let the aio core call aio_complete().  This is
purposely done to fix a bug where AIO DIO file extensions would call
aio_complete() before their callers have a chance to update i_size.

Now that direct_io_worker() is explicitly returning -EIOCBQUEUED its callers
no longer have to translate for it.  XFS needs to be careful not to free
resources that will be used during AIO completion if -EIOCBQUEUED is returned.
 We maintain the previous behaviour of trying to write fs metadata for O_SYNC
aio+dio writes.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: <xfs-masters@oss.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:21 -08:00
Andrew Morton 8bde37f08f [PATCH] io-accounting-read-accounting nfs fix
nfs's ->readpages uses read_cache_pages().  Wire it up there.

[wfg@mail.ustc.edu.cn: account only successful nfs/fuse reads]
Cc: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Cc: David Wright <daw@sgi.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:41 -08:00
Andrew Morton e08748ce01 [PATCH] io-accounting: write-cancel accounting
Account for the number of byte writes which this process caused to not happen
after all.

Cc: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Cc: David Wright <daw@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:41 -08:00
Andrew Morton 55e829af06 [PATCH] io-accounting: write accounting
Accounting writes is fairly simple: whenever a process flips a page from clean
to dirty, we accuse it of having caused a write to underlying storage of
PAGE_CACHE_SIZE bytes.

This may overestimate the amount of writing: the page-dirtying may cause only
one buffer_head's worth of writeout.  Fixing that is possible, but probably a
bit messy and isn't obviously important.

Cc: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Cc: David Wright <daw@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:41 -08:00
Andrew Morton 8c08540f87 [PATCH] clean up __set_page_dirty_nobuffers()
Save a tabstop in __set_page_dirty_nobuffers() and __set_page_dirty_buffers()
and a few other places.  No functional changes.

Cc: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Cc: David Wright <daw@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:41 -08:00
Hugh Dickins 5fcf7bb73f [PATCH] read_zero_pagealigned() locking fix
Ramiro Voicu hits the BUG_ON(!pte_none(*pte)) in zeromap_pte_range: kernel
bugzilla 7645.  Right: read_zero_pagealigned uses down_read of mmap_sem,
but another thread's racing read of /dev/zero, or a normal fault, can
easily set that pte again, in between zap_page_range and zeromap_page_range
getting there.  It's been wrong ever since 2.4.3.

The simple fix is to use down_write instead, but that would serialize reads
of /dev/zero more than at present: perhaps some app would be badly
affected.  So instead let zeromap_page_range return the error instead of
BUG_ON, and read_zero_pagealigned break to the slower clear_user loop in
that case - there's no need to optimize for it.

Use -EEXIST for when a pte is found: BUG_ON in mmap_zero (the other user of
zeromap_page_range), though it really isn't interesting there.  And since
mmap_zero wants -EAGAIN for out-of-memory, the zeromaps better return that
than -ENOMEM.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Ramiro Voicu: <Ramiro.Voicu@cern.ch>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:39 -08:00
Don Mullis 6b1b60f41e [PATCH] fault-injection: defaults likely to please a new user
Assign defaults most likely to please a new user:
 1) generate some logging output
    (verbose=2)
 2) avoid injecting failures likely to lock up UI
    (ignore_gfp_wait=1, ignore_gfp_highmem=1)

Signed-off-by: Don Mullis <dwm@meer.net>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:03 -08:00
Akinobu Mita 933e312e73 [PATCH] fault-injection capability for alloc_pages()
This patch provides fault-injection capability for alloc_pages()

Boot option:

fail_page_alloc=<interval>,<probability>,<space>,<times>

	<interval> -- specifies the interval of failures.

	<probability> -- specifies how often it should fail in percent.

	<space> -- specifies the size of free space where memory can be
		   allocated safely in pages.

	<times> -- specifies how many times failures may happen at most.

Debugfs:

/debug/fail_page_alloc/interval
/debug/fail_page_alloc/probability
/debug/fail_page_alloc/specifies
/debug/fail_page_alloc/times
/debug/fail_page_alloc/ignore-gfp-highmem
/debug/fail_page_alloc/ignore-gfp-wait

Example:

	fail_page_alloc=10,100,0,-1

The page allocation (alloc_pages(), ...) fails once per 10 times.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:02 -08:00
Akinobu Mita 8a8b6502fb [PATCH] fault-injection capability for kmalloc
This patch provides fault-injection capability for kmalloc.

Boot option:

failslab=<interval>,<probability>,<space>,<times>

	<interval> -- specifies the interval of failures.

	<probability> -- specifies how often it should fail in percent.

	<space> -- specifies the size of free space where memory can be
		   allocated safely in bytes.

	<times> -- specifies how many times failures may happen at most.

Debugfs:

/debug/failslab/interval
/debug/failslab/probability
/debug/failslab/specifies
/debug/failslab/times
/debug/failslab/ignore-gfp-highmem
/debug/failslab/ignore-gfp-wait

Example:

	failslab=10,100,0,-1

slab allocation (kmalloc(), kmem_cache_alloc(),..) fails once per 10 times.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:02 -08:00
David Howells f0d1b0b30d [PATCH] LOG2: Implement a general integer log2 facility in the kernel
This facility provides three entry points:

	ilog2()		Log base 2 of unsigned long
	ilog2_u32()	Log base 2 of u32
	ilog2_u64()	Log base 2 of u64

These facilities can either be used inside functions on dynamic data:

	int do_something(long q)
	{
		...;
		y = ilog2(x)
		...;
	}

Or can be used to statically initialise global variables with constant values:

	unsigned n = ilog2(27);

When performing static initialisation, the compiler will report "error:
initializer element is not constant" if asked to take a log of zero or of
something not reducible to a constant.  They treat negative numbers as
unsigned.

When not dealing with a constant, they fall back to using fls() which permits
them to use arch-specific log calculation instructions - such as BSR on
x86/x86_64 or SCAN on FRV - if available.

[akpm@osdl.org: MMC fix]
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David Howells <dhowells@redhat.com>
Cc: Wojtek Kaniewski <wojtekka@toxygen.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:51 -08:00
Josef Sipek e9536ae720 [PATCH] struct path: convert mm
Signed-off-by: Josef Sipek <jsipek@fsl.cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:47 -08:00
Josef "Jeff" Sipek d3ac7f892b [PATCH] mm: change uses of f_{dentry,vfsmnt} to use f_path
Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in linux/mm/.

Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:43 -08:00
Paul Jackson b8b50b6519 [PATCH] mm: fallback_alloc cpuset_zone_allowed irq fix
fallback_alloc() could end up calling cpuset_zone_allowed() with interrupts
disabled (by code in kmem_cache_alloc_node()), but without __GFP_HARDWALL
set, leading to a possible call of a sleeping function with interrupts
disabled.

This results in the BUG report:

  BUG: sleeping function called from invalid context at kernel/cpuset.c:1520
in_atomic():0, irqs_disabled():1

Thanks to Paul Menage for catching this one.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:37 -08:00
Helge Deller 15ad7cdcfd [PATCH] struct seq_operations and struct file_operations constification
- move some file_operations structs into the .rodata section

 - move static strings from policy_types[] array into the .rodata section

 - fix generic seq_operations usages, so that those structs may be defined
   as "const" as well

[akpm@osdl.org: couple of fixes]
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:46 -08:00
Adrian Bunk 045f147f32 [PATCH] remove EXPORT_UNUSED_SYMBOL'ed symbols
In time for 2.6.20, we can get rid of this junk.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:44 -08:00
Burman Yan 4668edc334 [PATCH] kernel core: replace kmalloc+memset with kzalloc
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:41 -08:00
Adrian Bunk 1f370a23f2 [PATCH] make mm/shmem.c:shmem_xattr_security_handler static
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:39 -08:00
Ingo Molnar 0231606785 [PATCH] hotplug CPU: clean up hotcpu_notifier() use
There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
generating compiler warnings of unused symbols, hence forcing people to add
#ifdefs.

the compiler can skip truly unused functions just fine:

    text    data     bss     dec     hex filename
 1624412  728710 3674856 6027978  5bfaca vmlinux.before
 1624412  728710 3674856 6027978  5bfaca vmlinux.after

[akpm@osdl.org: topology.c fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:39 -08:00
Andrew Morton 0490366432 [PATCH] remove HASH_HIGHMEM
It has no users and it's doubtful that we'll need it again.

Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:37 -08:00
OGAWA Hirofumi 38da288b8b [PATCH] read_cache_pages() cleanup
Use put_pages_list() instead of opencoding it.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:34 -08:00
Andrew Morton 138ae6631a [PATCH] slab: use probe_kernel_address()
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:34 -08:00
Nigel Cunningham 7dfb71030f [PATCH] Add include/linux/freezer.h and move definitions from sched.h
Move process freezing functions from include/linux/sched.h to freezer.h, so
that modifications to the freezer or the kernel configuration don't require
recompiling just about everything.

[akpm@osdl.org: fix ueagle driver]
Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Rafael J. Wysocki 8357376d3d [PATCH] swsusp: Improve handling of highmem
Currently swsusp saves the contents of highmem pages by copying them to the
normal zone which is quite inefficient (eg.  it requires two normal pages
to be used for saving one highmem page).  This may be improved by using
highmem for saving the contents of saveable highmem pages.

Namely, during the suspend phase of the suspend-resume cycle we try to
allocate as many free highmem pages as there are saveable highmem pages.
If there are not enough highmem image pages to store the contents of all of
the saveable highmem pages, some of them will be stored in the "normal"
memory.  Next, we allocate as many free "normal" pages as needed to store
the (remaining) image data.  We use a memory bitmap to mark the allocated
free pages (ie.  highmem as well as "normal" image pages).

Now, we use another memory bitmap to mark all of the saveable pages
(highmem as well as "normal") and the contents of the saveable pages are
copied into the image pages.  Then, the second bitmap is used to save the
pfns corresponding to the saveable pages and the first one is used to save
their data.

During the resume phase the pfns of the pages that were saveable during the
suspend are loaded from the image and used to mark the "unsafe" page
frames.  Next, we try to allocate as many free highmem page frames as to
load all of the image data that had been in the highmem before the suspend
and we allocate so many free "normal" page frames that the total number of
allocated free pages (highmem and "normal") is equal to the size of the
image.  While doing this we have to make sure that there will be some extra
free "normal" and "safe" page frames for two lists of PBEs constructed
later.

Now, the image data are loaded, if possible, into their "original" page
frames.  The image data that cannot be written into their "original" page
frames are loaded into "safe" page frames and their "original" kernel
virtual addresses, as well as the addresses of the "safe" pages containing
their copies, are stored in one of two lists of PBEs.

One list of PBEs is for the copies of "normal" suspend pages (ie.  "normal"
pages that were saveable during the suspend) and it is used in the same way
as previously (ie.  by the architecture-dependent parts of swsusp).  The
other list of PBEs is for the copies of highmem suspend pages.  The pages
in this list are restored (in a reversible way) right before the
arch-dependent code is called.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Rafael J. Wysocki 3aef83e0ef [PATCH] swsusp: use block device offsets to identify swap locations
Make swsusp use block device offsets instead of swap offsets to identify swap
locations and make it use the same code paths for writing as well as for
reading data.

This allows us to use the same code for handling swap files and swap
partitions and to simplify the code, eg.  by dropping rw_swap_page_sync().

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Rafael J. Wysocki 915bae9ebe [PATCH] swsusp: use partition device and offset to identify swap areas
The Linux kernel handles swap files almost in the same way as it handles swap
partitions and there are only two differences between these two types of swap
areas:

(1) swap files need not be contiguous,

(2) the header of a swap file is not in the first block of the partition
    that holds it.  From the swsusp's point of view (1) is not a problem,
    because it is already taken care of by the swap-handling code, but (2) has
    to be taken into consideration.

In principle the location of a swap file's header may be determined with the
help of appropriate filesystem driver.  Unfortunately, however, it requires
the filesystem holding the swap file to be mounted, and if this filesystem is
journaled, it cannot be mounted during a resume from disk.  For this reason we
need some other means by which swap areas can be identified.

For example, to identify a swap area we can use the partition that holds the
area and the offset from the beginning of this partition at which the swap
header is located.

The following patch allows swsusp to identify swap areas this way.  It changes
swap_type_of() so that it takes an additional argument representing an offset
of the swap header within the partition represented by its first argument.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Nick Piggin 7cf9c2c76c [PATCH] radix-tree: RCU lockless readside
Make radix tree lookups safe to be performed without locks.  Readers are
protected against nodes being deleted by using RCU based freeing.  Readers
are protected against new node insertion by using memory barriers to ensure
the node itself will be properly written before it is visible in the radix
tree.

Each radix tree node keeps a record of their height (above leaf nodes).
This height does not change after insertion -- when the radix tree is
extended, higher nodes are only inserted in the top.  So a lookup can take
the pointer to what is *now* the root node, and traverse down it even if
the tree is concurrently extended and this node becomes a subtree of a new
root.

"Direct" pointers (tree height of 0, where root->rnode points directly to
the data item) are handled by using the low bit of the pointer to signal
whether rnode is a direct pointer or a pointer to a radix tree node.

When a reader wants to traverse the next branch, they will take a copy of
the pointer.  This pointer will be either NULL (and the branch is empty) or
non-NULL (and will point to a valid node).

[akpm@osdl.org: cleanups]
[Lee.Schermerhorn@hp.com: bugfixes, comments, simplifications]
[clameter@sgi.com: build fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:25 -08:00
Andy Whitcroft 33f2ef89f8 [PATCH] mm: make compound page destructor handling explicit
Currently we we use the lru head link of the second page of a compound page
to hold its destructor.  This was ok when it was purely an internal
implmentation detail.  However, hugetlbfs overrides this destructor
violating the layering.  Abstract this out as explicit calls, also
introduce a type for the callback function allowing them to be type
checked.  For each callback we pre-declare the function, causing a type
error on definition rather than on use elsewhere.

[akpm@osdl.org: cleanups]
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:25 -08:00
Christoph Lameter 3c517a6132 [PATCH] slab: better fallback allocation behavior
Currently we simply attempt to allocate from all allowed nodes using
GFP_THISNODE.  However, GFP_THISNODE does not do reclaim (it wont do any at
all if the recent GFP_THISNODE patch is accepted).  If we truly run out of
memory in the whole system then fallback_alloc may return NULL although
memory may still be available if we would perform more thorough reclaim.

This patch changes fallback_alloc() so that we first only inspect all the
per node queues for available slabs.  If we find any then we allocate from
those.  This avoids slab fragmentation by first getting rid of all partial
allocated slabs on every node before allocating new memory.

If we cannot satisfy the allocation from any per node queue then we extend
a slab.  We now call into the page allocator without specifying
GFP_THISNODE.  The page allocator will then implement its own fallback (in
the given cpuset context), perform necessary reclaim (again considering not
a single node but the whole set of allowed nodes) and then return pages for
a new slab.

We identify from which node the pages were allocated and then insert the
pages into the corresponding per node structure.  In order to do so we need
to modify cache_grow() to take a parameter that specifies the new slab.
kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be
able to use kmem_getpage to allocate from an arbitrary node.  GFP_THISNODE
needs to be specified when calling cache_grow().

One key advantage is that the decision from which node to allocate new
memory is removed from slab fallback processing.  The patch allows to go
back to use of the page allocators fallback/reclaim logic.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:25 -08:00
Christoph Lameter 952f3b51be [PATCH] GFP_THISNODE must not trigger global reclaim
The intent of GFP_THISNODE is to make sure that an allocation occurs on a
particular node.  If this is not possible then NULL needs to be returned so
that the caller can choose what to do next on its own (the slab allocator
depends on that).

However, GFP_THISNODE currently triggers reclaim before returning a failure
(GFP_THISNODE means GFP_NORETRY is set).  If we have over allocated a node
then we will currently do some reclaim before returning NULL.  The caller
may want memory from other nodes before reclaim should be triggered.  (If
the caller wants reclaim then he can directly use __GFP_THISNODE instead).

There is no flag to avoid reclaim in the page allocator and adding yet
another GFP_xx flag would be difficult given that we are out of available
flags.

So just compare and see if all bits for GFP_THISNODE (__GFP_THISNODE,
__GFP_NORETRY and __GFP_NOWARN) are set.  If so then we return NULL before
waking up kswapd.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:25 -08:00
Christoph Lameter 5bcd234d88 [PATCH] slab: fix two issues in kmalloc_node / __cache_alloc_node
This addresses two issues:

1. Kmalloc_node() may intermittently return NULL if we are allocating
   from the current node and are unable to obtain memory for the current
   node from the page allocator.  This is because we call ___cache_alloc()
   if nodeid == numa_node_id() and ____cache_alloc is not able to fallback
   to other nodes.

   This was introduced in the 2.6.19 development cycle.  <= 2.6.18 in
   that case does not do a restricted allocation and blindly trusts the
   page allocator to have given us memory from the indicated node.  It
   inserts the page regardless of the node it came from into the queues for
   the current node.

2. If kmalloc_node() is used on a node that has not been bootstrapped
   yet then we may try to pass an invalid node number to
   ____cache_alloc_node() triggering a BUG().

   Change the function to call fallback_alloc() instead.  Only call
   fallback_alloc() if we are allowed to fallback at all.  The need to
   handle a node not bootstrapped yet also first surfaced in the 2.6.19
   cycle.

Update the comments since they were still describing the old kmalloc_node
from 2.6.12.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:25 -08:00
Christoph Lameter 441e143e95 [PATCH] slab: remove SLAB_DMA
SLAB_DMA is an alias of GFP_DMA. This is the last one so we
remove the leftover comment too.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:24 -08:00
Christoph Lameter e94b176609 [PATCH] slab: remove SLAB_KERNEL
SLAB_KERNEL is an alias of GFP_KERNEL.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:24 -08:00
Christoph Lameter a06d72c1dc [PATCH] slab: remove SLAB_LEVEL_MASK
SLAB_LEVEL_MASK is only used internally to the slab and is
and alias of GFP_LEVEL_MASK.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:23 -08:00
Christoph Lameter 6e0eaa4b05 [PATCH] slab: remove SLAB_NO_GROW
It is only used internally in the slab.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:23 -08:00
Hugh Dickins 2d4d862f72 [PATCH] kill install_file_pte's pte_val
David Binderman and his Intel C compiler rightly observe that
install_file_pte no longer has any use for its pte_val.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: d binderman <dcb314@hotmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:23 -08:00
Andy Whitcroft ce421c799b [PATCH] mm: cleanup indentation on switch for CPU operations
These patches introduced new switch statements which are indented contrary
to the concensus in mm/*.c.  Fix them up to match that concensus.

    [PATCH] node local per-cpu-pages
    [PATCH] ZVC: Scale thresholds depending on the size of the system
    commit e7c8d5c995
    commit df9ecaba3f

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:23 -08:00
Eric Sandeen 5d1854e15e [PATCH] reject corrupt swapfiles earlier
The fsfuzzer found this; with a corrupt small swapfile that claims to have
many pages:

  [root]# file swap.741.img
  swap.741.img: Linux/i386 swap file (new style) 1 (4K pages) size 1040191487 pages
  [root]# ls -l swap.741.img
  -rw-r--r-- 1 root root 16777216 Nov 22 05:18 swap.741.img

sys_swapon() will try to vmalloc all those pages, and -then- check to see if
the file is actually that large:

                if (!(p->swap_map = vmalloc(maxpages * sizeof(short)))) {
  <snip>
        if (swapfilesize && maxpages > swapfilesize) {
                printk(KERN_WARNING
                       "Swap area shorter than signature indicates\n");

It seems to me that it would make more sense to move this test up before
the vmalloc, with the other checks, to avoid the OOM-killer in this
situation...

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:23 -08:00
Andy Whitcroft 25ba77c141 [PATCH] numa node ids are int, page_to_nid and zone_to_nid should return int
NUMA node ids are passed as either int or unsigned int almost exclusivly
page_to_nid and zone_to_nid both return unsigned long.  This is a throw
back to when page_to_nid was a #define and was thus exposing the real type
of the page flags field.

In addition to fixing up the definitions of page_to_nid and zone_to_nid I
audited the users of these functions identifying the following incorrect
uses:

1) mm/page_alloc.c show_node() -- printk dumping the node id,
2) include/asm-ia64/pgalloc.h pgtable_quicklist_free() -- comparison
   against numa_node_id() which returns an int from cpu_to_node(), and
3) mm/mpolicy.c check_pte_range -- used as an index in node_isset which
   uses bit_set which in generic code takes an int.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:23 -08:00
Christoph Lameter bc4ba393c0 [PATCH] drain_node_page(): Drain pages in batch units
drain_node_pages() currently drains the complete pageset of all pages.  If
there are a large number of pages in the queues then we may hold off
interrupts for too long.

Duplicate the method used in free_hot_cold_page.  Only drain pcp->batch
pages at one time.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:23 -08:00
Adrian Bunk e30500557e [PATCH] make mm/thrash.c:global_faults static
This patch makes the needlessly global "global_faults" static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:22 -08:00
Christian Krafft 7c309a64d6 [PATCH] enable booting a NUMA system where some nodes have no memory
When booting a NUMA system with nodes that have no memory (eg by limiting
memory), bootmem_alloc_core tried to find pages in an uninitialized
bootmem_map.  This caused a null pointer access.  This fix adds a check, so
that NULL is returned.  That will enable the caller (bootmem_alloc_nopanic)
to alloc memory on other without a panic.

Signed-off-by: Christian Krafft <krafft@de.ibm.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Martin Bligh <mbligh@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:22 -08:00
Alan Stern a120586873 [PATCH] Allow NULL pointers in percpu_free
The patch (as824b) makes percpu_free() ignore NULL arguments, as one would
expect for a deallocation routine.  (Note that free_percpu is #defined as
percpu_free in include/linux/percpu.h.) A few callers are updated to remove
now-unneeded tests for NULL.  A few other callers already seem to assume
that passing a NULL pointer to percpu_free() is okay!

The patch also removes an unnecessary NULL check in percpu_depopulate().

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:22 -08:00
Christoph Hellwig 8b98c1699e [PATCH] leak tracking for kmalloc_node
We have variants of kmalloc and kmem_cache_alloc that leave leak tracking to
the caller.  This is used for subsystem-specific allocators like skb_alloc.

To make skb_alloc node-aware we need similar routines for the node-aware slab
allocator, which this patch adds.

Note that the code is rather ugly, but it mirrors the non-node-aware code 1:1:

[akpm@osdl.org: add module export]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:22 -08:00
Suleiman Souhlal 881e4aabe4 [PATCH] Always print out the header line in /proc/swaps
It would be possible for /proc/swaps to not always print out the header:

swapon /dev/hdc2
swapon /dev/hde2
swapoff /dev/hdc2

At this point /proc/swaps would not have a header.

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:22 -08:00
Kirill Korotaev b43a57bb4d [PATCH] OOM can panic due to processes stuck in __alloc_pages()
OOM can panic due to the processes stuck in __alloc_pages() doing infinite
rebalance loop while no memory can be reclaimed.  OOM killer tries to kill
some processes, but unfortunetaly, rebalance label was moved by someone
below the TIF_MEMDIE check, so buddy allocator doesn't see that process is
OOM-killed and it can simply fail the allocation :/

Observed in reality on RHEL4(2.6.9)+OpenVZ kernel when a user doing some
memory allocation tricks triggered OOM panic.

Signed-off-by: Denis Lunev <den@sw.ru>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:22 -08:00
Rik Bobbaers a3eea484f7 [PATCH] mlock cleanup
mm is defined as vma->vm_mm, so use that.

Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:22 -08:00
Paul Menage 3395ee0588 [PATCH] mm: add noaliencache boot option to disable numa alien caches
When using numa=fake on non-NUMA hardware there is no benefit to having the
alien caches, and they consume much memory.

Add a kernel boot option to disable them.

Christoph sayeth "This is good to have even on large NUMA.  The problem is
that the alien caches grow by the square of the size of the system in terms of
nodes."

Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:21 -08:00
Ravikiran G Thirumalai 8f5be20bf8 [PATCH] mm: slab: eliminate lock_cpu_hotplug from slab
Here's an attempt towards doing away with lock_cpu_hotplug in the slab
subsystem.  This approach also fixes a bug which shows up when cpus are
being offlined/onlined and slab caches are being tuned simultaneously.

http://marc.theaimsgroup.com/?l=linux-kernel&m=116098888100481&w=2

The patch has been stress tested overnight on a 2 socket 4 core AMD box with
repeated cpu online and offline, while dbench and kernbench process are
running, and slab caches being tuned at the same time.
There were no lockdep warnings either.  (This test on 2,6.18 as 2.6.19-rc
crashes at __drain_pages
http://marc.theaimsgroup.com/?l=linux-kernel&m=116172164217678&w=2 )

The approach here is to hold cache_chain_mutex from CPU_UP_PREPARE until
CPU_ONLINE (similar in approach as worqueue_mutex) .  Slab code sensitive
to cpu_online_map (kmem_cache_create, kmem_cache_destroy, slabinfo_write,
__cache_shrink) is already serialized with cache_chain_mutex.  (This patch
lengthens cache_chain_mutex hold time at kmem_cache_destroy to cover this).
 This patch also takes the cache_chain_sem at kmem_cache_shrink to protect
sanity of cpu_online_map at __cache_shrink, as viewed by slab.
(kmem_cache_shrink->__cache_shrink->drain_cpu_caches).  But, really,
kmem_cache_shrink is used at just one place in the acpi subsystem!  Do we
really need to keep kmem_cache_shrink at all?

Another note.  Looks like a cpu hotplug event can send  CPU_UP_CANCELED to
a registered subsystem even if the subsystem did not receive CPU_UP_PREPARE.
This could be due to a subsystem registered for notification earlier than
the current subsystem crapping out with NOTIFY_BAD. Badness can occur with
in the CPU_UP_CANCELED code path at slab if this happens (The same would
apply for workqueue.c as well).  To overcome this, we might have to use either
a) a per subsystem flag and avoid handling of CPU_UP_CANCELED, or
b) Use a special notifier events like LOCK_ACQUIRE/RELEASE as Gautham was
   using in his experiments, or
c) Do not send CPU_UP_CANCELED to a subsystem which did not receive
   CPU_UP_PREPARE.

I would prefer c).

Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:21 -08:00
Kevin Hilman a44b56d354 [PATCH] slab debug and ARCH_SLAB_MINALIGN don't get along
When CONFIG_SLAB_DEBUG is used in combination with ARCH_SLAB_MINALIGN, some
debug flags should be disabled which depend on BYTES_PER_WORD alignment.

The disabling of these debug flags is not properly handled when
BYTES_PER_WORD < ARCH_SLAB_MEMALIGN < cache_line_size()

This patch fixes that and also adds an alignment check to
cache_alloc_debugcheck_after() when ARCH_SLAB_MINALIGN is used.

Signed-off-by: Kevin Hilman <khilman@mvista.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:21 -08:00
Chen, Kenneth W cace673d37 [PATCH] htlb forget rss with pt sharing
Imprecise RSS accounting is an irritating ill effect with pt sharing.  After
consulted with several VM experts, I have tried various methods to solve that
problem: (1) iterate through all mm_structs that share the PT and increment
count; (2) keep RSS count in page table structure and then sum them up at
reporting time.  None of the above methods yield any satisfactory
implementation.

Since process RSS accounting is pure information only, I propose we don't
count them at all for hugetlb page.  rlimit has such field, though there is
absolutely no enforcement on limiting that resource.  One other method is to
account all RSS at hugetlb mmap time regardless they are faulted or not.  I
opt for the simplicity of no accounting at all.

Hugetlb page are special, they are reserved up front in global reservation
pool and is not reclaimable.  From physical memory resource point of view, it
is already consumed regardless whether there are users using them.

If the concern is that RSS can be used to control resource allocation, we
already can specify hugetlb fs size limit and sysadmin can enforce that at
mount time.  Combined with the two points mentioned above, I fail to see if
there is anything got affected because of this patch.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: Dave McCracken <dmccr@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:21 -08:00
Chen, Kenneth W 39dde65c99 [PATCH] shared page table for hugetlb page
Following up with the work on shared page table done by Dave McCracken.  This
set of patch target shared page table for hugetlb memory only.

The shared page table is particular useful in the situation of large number of
independent processes sharing large shared memory segments.  In the normal
page case, the amount of memory saved from process' page table is quite
significant.  For hugetlb, the saving on page table memory is not the primary
objective (as hugetlb itself already cuts down page table overhead
significantly), instead, the purpose of using shared page table on hugetlb is
to allow faster TLB refill and smaller cache pollution upon TLB miss.

With PT sharing, pte entries are shared among hundreds of processes, the cache
consumption used by all the page table is smaller and in return, application
gets much higher cache hit ratio.  One other effect is that cache hit ratio
with hardware page walker hitting on pte in cache will be higher and this
helps to reduce tlb miss latency.  These two effects contribute to higher
application performance.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: Dave McCracken <dmccr@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:21 -08:00
Andrew Morton e1dbeda60a [PATCH] balance_pdgat() cleanup
Despaghettify balance_pdgat() a bit.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:21 -08:00
Nick Piggin cc10250907 [PATCH] mm: add arch_alloc_page
Add an arch_alloc_page to match arch_free_page.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:21 -08:00
Ashwin Chaugule 7602bdf2fd [PATCH] new scheme to preempt swap token
The new swap token patches replace the current token traversal algo.  The old
algo had a crude timeout parameter that was used to handover the token from
one task to another.  This algo, transfers the token to the tasks that are in
need of the token.  The urgency for the token is based on the number of times
a task is required to swap-in pages.  Accordingly, the priority of a task is
incremented if it has been badly affected due to swap-outs.  To ensure that
the token doesnt bounce around rapidly, the token holders are given a priority
boost.  The priority of tasks is also decremented, if their rate of swap-in's
keeps reducing.  This way, the condition to check whether to pre-empt the swap
token, is a matter of comparing two task's priority fields.

[akpm@osdl.org: cleanups]
Signed-off-by: Ashwin Chaugule <ashwin.chaugule@celunite.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:21 -08:00
Ashwin Chaugule 098fe651f7 [PATCH] grab swap token reordered
Make sure the contention for the token happens _before_ any read-in and
kicks the swap-token algo only when the VM is under pressure.

Signed-off-by: Ashwin Chaugule <ashwin.chaugule@celunite.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:21 -08:00
Nick Piggin f2a2a7108a [PATCH] oom: less memdie
Don't cause all threads in all other thread groups to gain TIF_MEMDIE
otherwise we'll get a thundering herd eating our memory reserve.  This may not
be the optimal scheme, but it fits our policy of allowing just one TIF_MEMDIE
in the system at once.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Nick Piggin f3af38d30c [PATCH] oom: cleanup messages
Clean up the OOM killer messages to be more consistent.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Nick Piggin c33e0fca35 [PATCH] oom: don't kill unkillable children or siblings
Abort the kill if any of our threads have OOM_DISABLE set.  Having this
test here also prevents any OOM_DISABLE child of the "selected" process
from being killed.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Paul Jackson 9276b1bc96 [PATCH] memory page_alloc zonelist caching speedup
Optimize the critical zonelist scanning for free pages in the kernel memory
allocator by caching the zones that were found to be full recently, and
skipping them.

Remembers the zones in a zonelist that were short of free memory in the
last second.  And it stashes a zone-to-node table in the zonelist struct,
to optimize that conversion (minimize its cache footprint.)

Recent changes:

    This differs in a significant way from a similar patch that I
    posted a week ago.  Now, instead of having a nodemask_t of
    recently full nodes, I have a bitmask of recently full zones.
    This solves a problem that last weeks patch had, which on
    systems with multiple zones per node (such as DMA zone) would
    take seeing any of these zones full as meaning that all zones
    on that node were full.

    Also I changed names - from "zonelist faster" to "zonelist cache",
    as that seemed to better convey what we're doing here - caching
    some of the key zonelist state (for faster access.)

    See below for some performance benchmark results.  After all that
    discussion with David on why I didn't need them, I went and got
    some ;).  I wanted to verify that I had not hurt the normal case
    of memory allocation noticeably.  At least for my one little
    microbenchmark, I found (1) the normal case wasn't affected, and
    (2) workloads that forced scanning across multiple nodes for
    memory improved up to 10% fewer System CPU cycles and lower
    elapsed clock time ('sys' and 'real').  Good.  See details, below.

    I didn't have the logic in get_page_from_freelist() for various
    full nodes and zone reclaim failures correct.  That should be
    fixed up now - notice the new goto labels zonelist_scan,
    this_zone_full, and try_next_zone, in get_page_from_freelist().

There are two reasons I persued this alternative, over some earlier
proposals that would have focused on optimizing the fake numa
emulation case by caching the last useful zone:

 1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
    have seen real customer loads where the cost to scan the zonelist
    was a problem, due to many nodes being full of memory before
    we got to a node we could use.  Or at least, I think we have.
    This was related to me by another engineer, based on experiences
    from some time past.  So this is not guaranteed.  Most likely, though.

    The following approach should help such real numa systems just as
    much as it helps fake numa systems, or any combination thereof.

 2) The effort to distinguish fake from real numa, using node_distance,
    so that we could cache a fake numa node and optimize choosing
    it over equivalent distance fake nodes, while continuing to
    properly scan all real nodes in distance order, was going to
    require a nasty blob of zonelist and node distance munging.

    The following approach has no new dependency on node distances or
    zone sorting.

See comment in the patch below for a description of what it actually does.

Technical details of note (or controversy):

 - See the use of "zlc_active" and "did_zlc_setup" below, to delay
   adding any work for this new mechanism until we've looked at the
   first zone in zonelist.  I figured the odds of the first zone
   having the memory we needed were high enough that we should just
   look there, first, then get fancy only if we need to keep looking.

 - Some odd hackery was needed to add items to struct zonelist, while
   not tripping up the custom zonelists built by the mm/mempolicy.c
   code for MPOL_BIND.  My usual wordy comments below explain this.
   Search for "MPOL_BIND".

 - Some per-node data in the struct zonelist is now modified frequently,
   with no locking.  Multiple CPU cores on a node could hit and mangle
   this data.  The theory is that this is just performance hint data,
   and the memory allocator will work just fine despite any such mangling.
   The fields at risk are the struct 'zonelist_cache' fields 'fullzones'
   (a bitmask) and 'last_full_zap' (unsigned long jiffies).  It should
   all be self correcting after at most a one second delay.

 - This still does a linear scan of the same lengths as before.  All
   I've optimized is making the scan faster, not algorithmically
   shorter.  It is now able to scan a compact array of 'unsigned
   short' in the case of many full nodes, so one cache line should
   cover quite a few nodes, rather than each node hitting another
   one or two new and distinct cache lines.

 - If both Andi and Nick don't find this too complicated, I will be
   (pleasantly) flabbergasted.

 - I removed the comment claiming we only use one cachline's worth of
   zonelist.  We seem, at least in the fake numa case, to have put the
   lie to that claim.

 - I pay no attention to the various watermarks and such in this performance
   hint.  A node could be marked full for one watermark, and then skipped
   over when searching for a page using a different watermark.  I think
   that's actually quite ok, as it will tend to slightly increase the
   spreading of memory over other nodes, away from a memory stressed node.

===============

Performance - some benchmark results and analysis:

This benchmark runs a memory hog program that uses multiple
threads to touch alot of memory as quickly as it can.

Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of
the total 96 GBytes on the system, and using 1, 19, 37, or 55
threads (on a 56 CPU system.)  System, user and real (elapsed)
timings were recorded for each run, shown in units of seconds,
in the table below.

Two kernels were tested - 2.6.18-mm3 and the same kernel with
this zonelist caching patch added.  The table also shows the
percentage improvement the zonelist caching sys time is over
(lower than) the stock *-mm kernel.

      number     2.6.18-mm3	   zonelist-cache    delta (< 0 good)	percent
 GBs    N  	------------	   --------------    ----------------	systime
 mem threads   sys user  real	  sys  user  real     sys  user  real	 better
  12	 1     153   24   177	  151	 24   176      -2     0    -1	   1%
  12	19	99   22     8	   99	 22	8	0     0     0	   0%
  12	37     111   25     6	  112	 25	6	1     0     0	  -0%
  12	55     115   25     5	  110	 23	5      -5    -2     0	   4%
  38	 1     502   74   576	  497	 73   570      -5    -1    -6	   0%
  38	19     426   78    48	  373	 76    39     -53    -2    -9	  12%
  38	37     544   83    36	  547	 82    36	3    -1     0	  -0%
  38	55     501   77    23	  511	 80    24      10     3     1	  -1%
  64	 1     917  125  1042	  890	124  1014     -27    -1   -28	   2%
  64	19    1118  138   119	  965	141   103    -153     3   -16	  13%
  64	37    1202  151    94	 1136	150    81     -66    -1   -13	   5%
  64	55    1118  141    61	 1072	140    58     -46    -1    -3	   4%
  90	 1    1342  177  1519	 1275	174  1450     -67    -3   -69	   4%
  90	19    2392  199   192	 2116	189   176    -276   -10   -16	  11%
  90	37    3313  238   175	 2972	225   145    -341   -13   -30	  10%
  90	55    1948  210   104	 1843	213   100    -105     3    -4	   5%

Notes:
 1) This test ran a memory hog program that started a specified number N of
    threads, and had each thread allocate and touch 1/N'th of
    the total memory to be used in the test run in a single loop,
    writing a constant word to memory, one store every 4096 bytes.
    Watching this test during some earlier trial runs, I would see
    each of these threads sit down on one CPU and stay there, for
    the remainder of the pass, a different CPU for each thread.

 2) The 'real' column is not comparable to the 'sys' or 'user' columns.
    The 'real' column is seconds wall clock time elapsed, from beginning
    to end of that test pass.  The 'sys' and 'user' columns are total
    CPU seconds spent on that test pass.  For a 19 thread test run,
    for example, the sum of 'sys' and 'user' could be up to 19 times the
    number of 'real' elapsed wall clock seconds.

 3) Tests were run on a fresh, single-user boot, to minimize the amount
    of memory already in use at the start of the test, and to minimize
    the amount of background activity that might interfere.

 4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM.

 5) Notice that the 'real' time gets large for the single thread runs, even
    though the measured 'sys' and 'user' times are modest.  I'm not sure what
    that means - probably something to do with it being slow for one thread to
    be accessing memory along ways away.  Perhaps the fake numa system, running
    ostensibly the same workload, would not show this substantial degradation
    of 'real' time for one thread on many nodes -- lets hope not.

 6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs)
    ran quite efficiently, as one might expect.  Each pair of threads needed
    to allocate and touch the memory on the node the two threads shared, a
    pleasantly parallizable workload.

 7) The intermediate thread count passes, when asking for alot of memory forcing
    them to go to a few neighboring nodes, improved the most with this zonelist
    caching patch.

Conclusions:
 * This zonelist cache patch probably makes little difference one way or the
   other for most workloads on real numa hardware, if those workloads avoid
   heavy off node allocations.
 * For memory intensive workloads requiring substantial off-node allocations
   on real numa hardware, this patch improves both kernel and elapsed timings
   up to ten per-cent.
 * For fake numa systems, I'm optimistic, but will have to leave that up to
   Rohit Seth to actually test (once I get him a 2.6.18 backport.)

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Rohit Seth <rohitseth@google.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: David Rientjes <rientjes@cs.washington.edu>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Christoph Lameter 89689ae7f9 [PATCH] Get rid of zone_table[]
The zone table is mostly not needed.  If we have a node in the page flags
then we can get to the zone via NODE_DATA() which is much more likely to be
already in the cpu cache.

In case of SMP and UP NODE_DATA() is a constant pointer which allows us to
access an exact replica of zonetable in the node_zones field.  In all of
the above cases there will be no need at all for the zone table.

The only remaining case is if in a NUMA system the node numbers do not fit
into the page flags.  In that case we make sparse generate a table that
maps sections to nodes and use that table to to figure out the node number.
 This table is sized to fit in a single cache line for the known 32 bit
NUMA platform which makes it very likely that the information can be
obtained without a cache miss.

For sparsemem the zone table seems to be have been fairly large based on
the maximum possible number of sections and the number of zones per node.
There is some memory saving by removing zone_table.  The main benefit is to
reduce the cache foootprint of the VM from the frequent lookups of zones.
Plus it simplifies the page allocator.

[akpm@osdl.org: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Chen, Kenneth W c0a499c2c4 [PATCH] __unmap_hugepage_range(): add comment
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Paul Jackson 0798e5193c [PATCH] memory page alloc minor cleanups
- s/freeliest/freelist/ spelling fix

- Check for NULL *z zone seems useless - even if it could happen, so
  what?  Perhaps we should have a check later on if we are faced with an
  allocation request that is not allowed to fail - shouldn't that be a
  serious kernel error, passing an empty zonelist with a mandate to not
  fail?

- Initializing 'z' to zonelist->zones can wait until after the first
  get_page_from_freelist() fails; we only use 'z' in the wakeup_kswapd()
  loop, so let's initialize 'z' there, in a 'for' loop.  Seems clearer.

- Remove superfluous braces around a break

- Fix a couple errant spaces

- Adjust indentation on the cpuset_zone_allowed() check, to match the
  lines just before it -- seems easier to read in this case.

- Add another set of braces to the zone_watermark_ok logic

From: Paul Jackson <pj@sgi.com>

  Backout one item from a previous "memory page_alloc minor cleanups" patch.
   Until and unless we are certain that no one can ever pass an empty zonelist
  to __alloc_pages(), this check for an empty zonelist (or some BUG
  equivalent) is essential.  The code in get_page_from_freelist() blow ups if
  passed an empty zonelist.

Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Linus Torvalds dd8856bda5 Merge git://git.infradead.org/users/dhowells/workq-2.6
* git://git.infradead.org/users/dhowells/workq-2.6:
  Actually update the fixed up compile failures.
  WorkQueue: Fix up arch-specific work items where possible
  WorkStruct: make allyesconfig
  WorkStruct: Pass the work_struct pointer instead of context data
  WorkStruct: Merge the pending bit into the wq_data pointer
  WorkStruct: Typedef the work function prototype
  WorkStruct: Separate delayable and non-delayable events.
2006-12-06 08:01:37 -08:00
Mike Frysinger f81cff0d40 [PATCH] uclinux: fix mmap() of directory for nommu case
I was playing with blackfin when i hit a neat bug ... doing an open() on a
directory and then passing that fd to mmap() would cause the kernel to hang

after poking into the code a bit more, i found that
mm/nommu.c:validate_mmap_request() checks the length and if it is 0, just
returns the address ... this is in stark contrast to mmu's
mm/mmap.c:do_mmap_pgoff() where it returns -EINVAL for 0 length requests ...
i then noticed that some other parts of the logic is out of date between the
two funcs, so perhaps that's the easy fix ?

Signed-off-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-06 07:41:26 -08:00
David Howells 9db7372445 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:

	drivers/ata/libata-scsi.c
	include/linux/libata.h

Futher merge of Linus's head and compilation fixups.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-12-05 17:01:28 +00:00
David Howells 4c1ac1b491 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:

	drivers/infiniband/core/iwcm.c
	drivers/net/chelsio/cxgb2.c
	drivers/net/wireless/bcm43xx/bcm43xx_main.c
	drivers/net/wireless/prism54/islpci_eth.c
	drivers/usb/core/hub.h
	drivers/usb/input/hid-core.c
	net/core/netpoll.c

Fix up merge failures with Linus's head and fix new compilation failures.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-12-05 14:37:56 +00:00
Mark Fasheh d23a147bb6 [PATCH] Export should_remove_suid()
This helps us avoid replicating the same logic within file system drivers.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
2006-12-01 18:28:38 -08:00
Mel Gorman 1abbfb412b [PATCH] x86_64: fix bad page state in process 'swapper'
find_min_pfn_for_node() and find_min_pfn_with_active_regions() both
depend on a sorted early_node_map[].  However, sort_node_map() is being
called after fin_min_pfn_with_active_regions() in
free_area_init_nodes().

In most cases, this is ok, but on at least one x86_64, the SRAT table
caused the E820 ranges to be registered out of order.  This gave the
wrong values for the min PFN range resulting in some pages not being
initialised.

This patch sorts the early_node_map in find_min_pfn_for_node().  It has
been boot tested on x86, x86_64, ppc64 and ia64.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-23 09:30:38 -08:00
David Howells c4028958b6 WorkStruct: make allyesconfig
Fix up for make allyesconfig.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-11-22 14:57:56 +00:00
David Howells 65f27f3844 WorkStruct: Pass the work_struct pointer instead of context data
Pass the work_struct pointer to the work function rather than context data.
The work function can use container_of() to work out the data.

For the cases where the container of the work_struct may go away the moment the
pending bit is cleared, it is made possible to defer the release of the
structure by deferring the clearing of the pending bit.

To make this work, an extra flag is introduced into the management side of the
work_struct.  This governs auto-release of the structure upon execution.

Ordinarily, the work queue executor would release the work_struct for further
scheduling or deallocation by clearing the pending bit prior to jumping to the
work function.  This means that, unless the driver makes some guarantee itself
that the work_struct won't go away, the work function may not access anything
else in the work_struct or its container lest they be deallocated..  This is a
problem if the auxiliary data is taken away (as done by the last patch).

However, if the pending bit is *not* cleared before jumping to the work
function, then the work function *may* access the work_struct and its container
with no problems.  But then the work function must itself release the
work_struct by calling work_release().

In most cases, automatic release is fine, so this is the default.  Special
initiators exist for the non-auto-release case (ending in _NAR).


Signed-Off-By: David Howells <dhowells@redhat.com>
2006-11-22 14:55:48 +00:00
David Howells 52bad64d95 WorkStruct: Separate delayable and non-delayable events.
Separate delayable work items from non-delayable work items be splitting them
into a separate structure (delayed_work), which incorporates a work_struct and
the timer_list removed from work_struct.

The work_struct struct is huge, and this limits it's usefulness.  On a 64-bit
architecture it's nearly 100 bytes in size.  This reduces that by half for the
non-delayable type of event.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-11-22 14:54:01 +00:00
OGAWA Hirofumi 31be830953 [PATCH] Fix strange size check in __get_vm_area_node()
Recently, __get_vm_area_node() was changed like following

 	if (unlikely(!area))
 		return NULL;

-	if (unlikely(!size)) {
-		kfree (area);
+	if (unlikely(!size))
 		return NULL;
-	}

It is leaking `area', also original code seems strange already.
Probably, we wanted to do this patch.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-16 11:43:38 -08:00
Hugh Dickins cd2579d7aa [PATCH] hugetlb: fix error return for brk() entering a hugepage region
Commit cb07c9a186 causes the wrong return
value.  is_hugepage_only_range() is a boolean, so we should return
-EINVAL rather than 1.

Also - we can use "mm" instead of looking up "current->mm" again.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-14 15:15:01 -08:00
David Gibson cb07c9a186 [PATCH] hugetlb: check for brk() entering a hugepage region
Unlike mmap(), the codepath for brk() creates a vma without first checking
that it doesn't touch a region exclusively reserved for hugepages.  On
powerpc, this can allow it to create a normal page vma in a hugepage
region, causing oopses and other badness.

Add a test to prevent this.  With this patch, brk() will simply fail if it
attempts to move the break into a hugepage reserved region.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-14 09:09:27 -08:00
Hugh Dickins 68589bc353 [PATCH] hugetlb: prepare_hugepage_range check offset too
(David:)

If hugetlbfs_file_mmap() returns a failure to do_mmap_pgoff() - for example,
because the given file offset is not hugepage aligned - then do_mmap_pgoff
will go to the unmap_and_free_vma backout path.

But at this stage the vma hasn't been marked as hugepage, and the backout path
will call unmap_region() on it.  That will eventually call down to the
non-hugepage version of unmap_page_range().  On ppc64, at least, that will
cause serious problems if there are any existing hugepage pagetable entries in
the vicinity - for example if there are any other hugepage mappings under the
same PUD.  unmap_page_range() will trigger a bad_pud() on the hugepage pud
entries.  I suspect this will also cause bad problems on ia64, though I don't
have a machine to test it on.

(Hugh:)

prepare_hugepage_range() should check file offset alignment when it checks
virtual address and length, to stop MAP_FIXED with a bad huge offset from
unmapping before it fails further down.  PowerPC should apply the same
prepare_hugepage_range alignment checks as ia64 and all the others do.

Then none of the alignment checks in hugetlbfs_file_mmap are required (nor
is the check for too small a mapping); but even so, move up setting of
VM_HUGETLB and add a comment to warn of what David Gibson discovered - if
hugetlbfs_file_mmap fails before setting it, do_mmap_pgoff's unmap_region
when unwinding from error will go the non-huge way, which may cause bad
behaviour on architectures (powerpc and ia64) which segregate their huge
mappings into a separate region of the address space.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-14 09:09:27 -08:00
Eric Dumazet 2b4ac44e7c [PATCH] vmalloc: optimization, cleanup, bugfixes
- reorder 'struct vm_struct' to speedup lookups on CPUS with small cache
  lines.  The fields 'next,addr,size' should be now in the same cache line,
  to speedup lookups.

- One minor cleanup in __get_vm_area_node()

- Bugfixes in vmalloc_user() and vmalloc_32_user() NULL returns from
  __vmalloc() and __find_vm_area() were not tested.

[akpm@osdl.org: remove redundant BUG_ONs]
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-13 07:40:42 -08:00
Stephen Rothwell 8ce08464d2 [PATCH] Fix sys_move_pages when a NULL node list is passed
sys_move_pages() uses vmalloc() to allocate an array of structures that is
fills with information passed from user mode and then passes to
do_stat_pages() (in the case the node list is NULL).  do_stat_pages()
depends on a marker in the node field of the structure to decide how large
the array is and this marker is correctly inserted into the last element of
the array.  However, vmalloc() doesn't zero the memory it allocates and if
the user passes NULL for the node list, then the node fields are not filled
in (except for the end marker).  If the memory the vmalloc() returned
happend to have a word with the marker value in it in just the right place,
do_pages_stat will fail to fill the status field of part of the array and
we will return (random) kernel data to user mode.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-03 12:27:59 -08:00
Daniel Yeisley 7f6b8876c7 [PATCH] init_reap_node() initialization fix
It looks like there is a bug in init_reap_node() in slab.c that can cause
multiple oops's on certain ES7000 configurations.  The variable reap_node
is defined per cpu, but only initialized on a single CPU.  This causes an
oops in next_reap_node() when __get_cpu_var(reap_node) returns the wrong
value.  Fix is below.

Signed-off-by: Dan Yeisley <dan.yeisley@unisys.com>
Cc: Andi Kleen <ak@suse.de>
Acked-by: Christoph Lameter <clameter@engr.sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-03 12:27:58 -08:00
OGAWA Hirofumi 029e332ea7 [PATCH] Cleanup read_pages()
Current read_pages() assume ->readpages() frees the passed pages.

This patch free the pages in ->read_pages(), if those were remaining in the
pages_list.  So, readpages() just can ignore the remaining pages in
pages_list.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-03 12:27:56 -08:00
nkalmala 941c7105dc [PATCH] mm: un-needed add-store operation wastes a few bytes
Un-needed add-store operation wastes a few bytes.
8 bytes wasted with -O2, on a ppc.

Signed-off-by: nkalmala <nkalmala@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-03 12:27:56 -08:00
Giridhar Pemmasani 5211e6e6c6 [PATCH] Fix GFP_HIGHMEM slab panic
As reported by Martin J. Bligh <mbligh@google.com>, we let through some
non-slab bits to slab allocation through __get_vm_area_node when doing a
vmalloc.

I haven't been able to reproduce this, although I understand why it
happens: vmalloc allocates memory with

GFP_KERNEL | __GFP_HIGHMEM

and commit 52fd24ca1d resulted in the same
flags are passed down to cache_alloc_refill, causing the BUG.  The
following patch fixes it.

Note that when calling kmalloc_node, I am masking off __GFP_HIGHMEM with
GFP_LEVEL_MASK, whereas __vmalloc_area_node does the same with

~(__GFP_HIGHMEM | __GFP_ZERO).

IMHO, using GFP_LEVEL_MASK is preferable, but either should fix this
problem.

Signed-off-by: Giridhar Pemmasani (pgiri@yahoo.com)
Cc: Martin J. Bligh <mbligh@google.com>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-29 08:01:58 -08:00
Mel Gorman 0c6cb97463 [PATCH] Calculation fix for memory holes beyong the end of physical memory
absent_pages_in_range() made the assumption that users of the
arch-independent zone-sizing API would not care about holes beyound the end
of physical memory.  This was not the case and was "fixed" in a patch
called "Account for holes that are outside the range of physical memory".
However, when given a range that started before a hole in "real" memory and
ended beyond the end of memory, it would get the result wrong.  The bug is
in mainline but a patch is below.

It has been tested successfully on a number of machines and architectures.
Additional credit to Keith Mannthey for discovering the problem, helping
identify the correct fix and confirming it Worked For Him.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: keith mannthey <kmannth@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:55 -07:00
Hugh Dickins ebed4bfc8d [PATCH] hugetlb: fix absurd HugePages_Rsvd
If you truncated an mmap'ed hugetlbfs file, then faulted on the truncated
area, /proc/meminfo's HugePages_Rsvd wrapped hugely "negative".  Reinstate my
preliminary i_size check before attempting to allocate the page (though this
only fixes the most obvious case: more work will be needed here).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:53 -07:00
Giridhar Pemmasani 52fd24ca1d [PATCH] __vmalloc with GFP_ATOMIC causes 'sleeping from invalid context'
If __vmalloc is called to allocate memory with GFP_ATOMIC in atomic
context, the chain of calls results in __get_vm_area_node allocating memory
for vm_struct with GFP_KERNEL, causing the 'sleeping from invalid context'
warning.  This patch fixes it by passing the gfp flags along so
__get_vm_area_node allocates memory for vm_struct with the same flags.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:52 -07:00
Yasunori Goto f2d0aa5bf8 [PATCH] memory hotplug: __GFP_NOWARN is better for __kmalloc_section_memmap()
Add __GFP_NOWARN flag to calling of __alloc_pages() in
__kmalloc_section_memmap().  It can reduce noisy failure message.

In ia64, section size is 1 GB, this means that order 8 pages are necessary
for each section's memmap.  It is often very hard requirement under heavy
memory pressure as you know.  So, __alloc_pages() gives up allocation and
shows many noisy stack traces which means no page for each sections.
(Current my environment shows 32 times of stack trace....)

But, __kmalloc_section_memmap() calls vmalloc() after failure of it, and it
can succeed allocation of memmap.  So, its stack trace warning becomes just
noisy.  I suppose it shouldn't be shown.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:52 -07:00
Martin Bligh bbdb396a60 [PATCH] Use min of two prio settings in calculating distress for reclaim
If try_to_free_pages / balance_pgdat are called with a gfp_mask specifying
GFP_IO and/or GFP_FS, they will reclaim the requisite number of pages, and the
reset prev_priority to DEF_PRIORITY (or to some other high (ie: unurgent)
value).

However, another reclaimer without those gfp_mask flags set (say, GFP_NOIO)
may still be struggling to reclaim pages.  The concurrent overwrite of
zone->prev_priority will cause this GFP_NOIO thread to unexpectedly cease
deactivating mapped pages, thus causing reclaim difficulties.

Fix this is to key the distress calculation not off zone->prev_priority, but
also take into account the local caller's priority by using
min(zone->prev_priority, sc->priority)

Signed-off-by: Martin J. Bligh <mbligh@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:51 -07:00
Martin Bligh 3bb1a852ab [PATCH] vmscan: Fix temp_priority race
The temp_priority field in zone is racy, as we can walk through a reclaim
path, and just before we copy it into prev_priority, it can be overwritten
(say with DEF_PRIORITY) by another reclaimer.

The same bug is contained in both try_to_free_pages and balance_pgdat, but
it is fixed slightly differently.  In balance_pgdat, we keep a separate
priority record per zone in a local array.  In try_to_free_pages there is
no need to do this, as the priority level is the same for all zones that we
reclaim from.

Impact of this bug is that temp_priority is copied into prev_priority, and
setting this artificially high causes reclaimers to set distress
artificially low.  They then fail to reclaim mapped pages, when they are,
in fact, under severe memory pressure (their priority may be as low as 0).
This causes the OOM killer to fire incorrectly.

From: Andrew Morton <akpm@osdl.org>

__zone_reclaim() isn't modifying zone->prev_priority.  But zone->prev_priority
is used in the decision whether or not to bring mapped pages onto the inactive
list.  Hence there's a risk here that __zone_reclaim() will fail because
zone->prev_priority ir large (ie: low urgency) and lots of mapped pages end up
stuck on the active list.

Fix that up by decreasing (ie making more urgent) zone->prev_priority as
__zone_reclaim() scans the zone's pages.

This bug perhaps explains why ZONE_RECLAIM_PRIORITY was created.  It should be
possible to remove that now, and to just start out at DEF_PRIORITY?

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:50 -07:00
Nick Piggin 2ae88149a2 [PATCH] mm: clean up pagecache allocation
- Consolidate page_cache_alloc

- Fix splice: only the pagecache pages and filesystem data need to use
  mapping_gfp_mask.

- Fix grab_cache_page_nowait: same as splice, also honour NUMA placement.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:50 -07:00
Christoph Lameter aedb0eb107 [PATCH] Slab: Do not fallback to nodes that have not been bootstrapped yet
The zonelist may contain zones of nodes that have not been bootstrapped and
we will oops if we try to allocate from those zones.  So check if the node
information for the slab and the node have been setup before attempting an
allocation.  If it has not been setup then skip that zone.

Usually we will not encounter this situation since the slab bootstrap code
avoids falling back before we have setup the respective nodes but we seem
to have a special needs for pppc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Kravetz <kravetz@us.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-21 13:35:06 -07:00
Andy Whitcroft 7516795739 [PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc
Reintroduce NODES_SPAN_OTHER_NODES for powerpc

Revert "[PATCH] Remove SPAN_OTHER_NODES config definition"
    This reverts commit f62859bb68.
Revert "[PATCH] mm: remove arch independent NODES_SPAN_OTHER_NODES"
    This reverts commit a94b3ab7ea.

Also update the comments to indicate that this is still required
and where its used.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Kravetz <kravetz@us.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-21 13:35:06 -07:00
Linus Torvalds 7b7fc708b5 Merge branch 'splice' of git://brick.kernel.dk/data/git/linux-2.6-block
* 'splice' of git://brick.kernel.dk/data/git/linux-2.6-block:
  [PATCH] Remove SUID when splicing into an inode
  [PATCH] Add lockless helpers for remove_suid()
  [PATCH] Introduce generic_file_splice_write_nolock()
  [PATCH] Take i_mutex in splice_from_pipe()
2006-10-21 10:01:52 -07:00
Nick Piggin 82591e6ea2 [PATCH] mm: more commenting on lock ordering
Clarify lockorder comments now that sys_msync dropps mmap_sem before
calling do_fsync.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20 10:26:44 -07:00
Dmitriy Monakhov c4ec7b0de4 [PATCH] mm: D-cache aliasing issue in cow_user_page
--=-=-=

 from mm/memory.c:
  1434  static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va)
  1435  {
  1436          /*
  1437           * If the source page was a PFN mapping, we don't have
  1438           * a "struct page" for it. We do a best-effort copy by
  1439           * just copying from the original user address. If that
  1440           * fails, we just zero-fill it. Live with it.
  1441           */
  1442          if (unlikely(!src)) {
  1443                  void *kaddr = kmap_atomic(dst, KM_USER0);
  1444                  void __user *uaddr = (void __user *)(va & PAGE_MASK);
  1445
  1446                  /*
  1447                   * This really shouldn't fail, because the page is there
  1448                   * in the page tables. But it might just be unreadable,
  1449                   * in which case we just give up and fill the result with
  1450                   * zeroes.
  1451                   */
  1452                  if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
  1453                          memset(kaddr, 0, PAGE_SIZE);
  1454                  kunmap_atomic(kaddr, KM_USER0);
  #### D-cache have to be flushed here.
  #### It seems it is just forgotten.

  1455                  return;
  1456
  1457          }
  1458          copy_user_highpage(dst, src, va);
  #### Ok here. flush_dcache_page() called from this func if arch need it
  1459  }

Following is the patch  fix this issue:

Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20 10:26:43 -07:00
Andrew Morton 6220ec7844 [PATCH] highest_possible_node_id() linkage fix
Qooting Adrian:

- net/sunrpc/svc.c uses highest_possible_node_id()

- include/linux/nodemask.h says highest_possible_node_id() is
  out-of-line #if MAX_NUMNODES > 1

- the out-of-line highest_possible_node_id() is in lib/cpumask.c

- lib/Makefile: lib-$(CONFIG_SMP) += cpumask.o
  CONFIG_ARCH_DISCONTIGMEM_ENABLE=y, CONFIG_SMP=n, CONFIG_SUNRPC=y

-> highest_possible_node_id() is used in net/sunrpc/svc.c
   CONFIG_NODES_SHIFT defined and > 0

-> include/linux/numa.h: MAX_NUMNODES > 1

-> compile error

The bug is not present on architectures where ARCH_DISCONTIGMEM_ENABLE
depends on NUMA (but m32r isn't the only affected architecture).

So move the function into page_alloc.c

Cc: Adrian Bunk <bunk@stusta.de>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20 10:26:43 -07:00
Alexey Dobriyan 8ac773b4f7 [PATCH] OOM killer meets userspace headers
Despite mm.h is not being exported header, it does contain one thing
which is part of userspace ABI -- value disabling OOM killer for given
process. So,
a) create and export include/linux/oom.h
b) move OOM_DISABLE define there.
c) turn bounding values of /proc/$PID/oom_adj into defines and export
   them too.

Note: mass __KERNEL__ removal will be done later.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20 10:26:38 -07:00
Andrew Morton 3fcfab16c5 [PATCH] separate bdi congestion functions from queue congestion functions
Separate out the concept of "queue congestion" from "backing-dev congestion".
Congestion is a backing-dev concept, not a queue concept.

The blk_* congestion functions are retained, as wrappers around the core
backing-dev congestion functions.

This proper layering is needed so that NFS can cleanly use the congestion
functions, and so that CONFIG_BLOCK=n actually links.

Cc: "Thomas Maier" <balagi@justmail.de>
Cc: "Jens Axboe" <jens.axboe@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: David Howells <dhowells@redhat.com>
Cc: Peter Osterlund <petero2@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20 10:26:35 -07:00
Jeff Moyer fb5527e68d [PATCH] direct-io: sync and invalidate file region when falling back to buffered write
When direct-io falls back to buffered write, it will just leave the dirty data
floating about in pagecache, pending regular writeback.

But normal direct-io semantics are that IO is synchronous, and that it leaves
no pagecache behind.

So change the fallback-to-buffered-write code to sync the file region and to
then strip away the pagecache, just as a regular direct-io write would do.

Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20 10:26:35 -07:00
Jens Axboe 01de85e057 [PATCH] Add lockless helpers for remove_suid()
Right now users have to grab i_mutex before calling remove_suid(), in the
unlikely event that a call to ->setattr() may be needed. Split up the
function in two parts:

- One to check if we need to remove suid
- One to actually remove it

The first we can call lockless.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-10-19 20:53:08 +02:00
Andrew Morton 286e1ea3ac [PATCH] vmalloc(): don't pass __GFP_ZERO to slab
A recent change to the vmalloc() code accidentally resulted in us passing
__GFP_ZERO into the slab allocator.  But we only wanted __GFP_ZERO for the
actual pages whcih are being vmalloc()ed, and passing __GFP_ZERO into slab is
not a rational thing to ask for.

Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17 08:18:44 -07:00
David M. Grimes 91828a405a [PATCH] knfsd: add nfs-export support to tmpfs
We need to encode a decode the 'file' part of a handle.  We simply use the
inode number and generation number to construct the filehandle.

The generation number is the time when the file was created.  As inode numbers
cycle through the full 32 bits before being reused, there is no real chance of
the same inum being allocated to different files in the same second so this is
suitably unique.  Using time-of-day rather than e.g.  jiffies makes it less
likely that the same filehandle can be created after a reboot.

In order to be able to decode a filehandle we need to be able to lookup by
inum, which means that the inode needs to be added to the inode hash table
(tmpfs doesn't currently hash inodes as there is never a need to lookup by
inum).  To avoid overhead when not exporting, we only hash an inode when it is
first exported.  This requires a lock to ensure it isn't hashed twice.

This code is separate from the patch posted in June06 from Atal Shargorodsky
which provided the same functionality, but does borrow slightly from it.

Locking comment: Most filesystems that hash their inodes do so at the point
where the 'struct inode' is initialised, and that has suitable locking
(I_NEW).  Here in shmem, we are hashing the inode later, the first time we
need an NFS file handle for it.  We no longer have I_NEW to ensure only one
thread tries to add it to the hash table.

Cc: Atal Shargorodsky <atal@codefidence.com>
Cc: Gilad Ben-Yossef <gilad@codefidence.com>
Signed-off-by: David M. Grimes <dgrimes@navisite.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17 08:18:43 -07:00
Andrew Morton a649fd9271 [PATCH] invalidate: remove_mapping() fix
If remove_mapping() failed to remove the page from its mapping, don't go and
mark it not uptodate!  Makes kernel go dead.

(Actually, I don't think the ClearPageUptodate is needed there at all).

Says Nick Piggin:

   "Right, it isn't needed because at this point the page is guaranteed
    by remove_mapping to have no references (except us) and cannot pick
    up any new ones because it is removed from pagecache.

    We can delete it."

Signed-off-by: Andrew Morton <akpm@osdl.org>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17 08:18:43 -07:00
Linus Torvalds 80c5606c3b Fix VM_MAYEXEC calculation
.. and clean up the file mapping code while at it.  No point in having a
"if (file)" repeated twice, and generally doing similar checks in two
different sections of the same code

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-15 14:09:55 -07:00
Aneesh Kumar 53bc5b2db1 [PATCH] Fix typos in mm/shmem_acl.c
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:23 -07:00
Trond Myklebust 887ed2f3ae [PATCH] VM: Fix the gfp_mask in invalidate_complete_page2
If try_to_release_page() is called with a zero gfp mask, then the
filesystem is effectively denied the possibility of sleeping while
attempting to release the page.  There doesn't appear to be any valid
reason why this should be banned, given that we're not calling this from a
memory allocation context.

For this reason, change the gfp_mask argument of the call to GFP_KERNEL.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Steve Dickson <SteveD@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:22 -07:00
Andrew Morton 8258d4a574 [PATCH] invalidate_inode_pages2_range() debug
A failure in invalidate_inode_pages2_range() can result in unpleasant things
happening in NFS (at least).  Stick a WARN_ON_ONCE() in there so we can find
out if it happens, and maybe why.

(akpm: might be a -mm-only patch, we'll see..)

Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Steve Dickson <SteveD@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:22 -07:00
Nick Piggin 9858db504c [PATCH] mm: locks_freed fix
Move the lock debug checks below the page reserved checks.  Also, having
debug_check_no_locks_freed in kernel_map_pages is wrong.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:19 -07:00
Nick Piggin dafb13673c [PATCH] mm: arch_free_page fix
After the PG_reserved check was added, arch_free_page was being called in the
wrong place (it could be called for a page we don't actually want to free).
Fix that.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:19 -07:00
Keith Owens 6993974997 [PATCH] Fix do_mbind warning with CONFIG_MIGRATION=n
With CONFIG_MIGRATION=n

mm/mempolicy.c: In function 'do_mbind':
mm/mempolicy.c:796: warning: passing argument 2 of 'migrate_pages' from incompatible pointer type

Signed-off-by: Keith Owens <kaos@ocs.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:19 -07:00
Dave Jones b16bc64d1a [PATCH] move rmap BUG_ON outside DEBUG_VM
We have a persistent dribble of reports of this BUG triggering.  Its extended
diagnostics were recently made conditional on CONFIG_DEBUG_VM, which was a bad
idea - we want to know about it.

Signed-off-by: Dave Jones <davej@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:19 -07:00
Chen, Kenneth W 502717f4e1 [PATCH] hugetlb: fix linked list corruption in unmap_hugepage_range()
commit fe1668ae5b causes kernel to oops with
libhugetlbfs test suite.  The problem is that hugetlb pages can be shared
by multiple mappings.  Multiple threads can fight over page->lru in the
unmap path and bad things happen.  We now serialize __unmap_hugepage_range
to void concurrent linked list manipulation.  Such serialization is also
needed for shared page table page on hugetlb area.  This patch will fixed
the bug and also serve as a prepatch for shared page table.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:15 -07:00
Mel Gorman b888132b0f [PATCH] mm: remove memmap_zone_idx()
memmap_zone_idx() is not used anymore.  It was required by an earlier
version of
account-for-memmap-and-optionally-the-kernel-image-as-holes.patch but not
any more.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:14 -07:00
Christoph Lameter dcbd4ec4c2 [PATCH] slab: remove wrongly placed BUG_ON
Init list is called with a list parameter that is not equal to the
cachep->nodelists entry under NUMA if more than one node exists.  This is
fully legitimatei.  One may want to populate the list fields before
switching nodelist pointers.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-07 10:51:14 -07:00
Benjamin Herrenschmidt 7f7bbbe50b [PATCH] page fault retry with NOPAGE_REFAULT
Add a way for a no_page() handler to request a retry of the faulting
instruction.  It goes back to userland on page faults and just tries again
in get_user_pages().  I added a cond_resched() in the loop in that later
case.

The problem I have with signal and spufs is an actual bug affecting apps and I
don't see other ways of fixing it.

In addition, we are having issues with infiniband and 64k pages (related to
the way the hypervisor deals with some HV cards) that will require us to muck
around with the MMU from within the IB driver's no_page() (it's a pSeries
specific driver) and return to the caller the same way using NOPAGE_REFAULT.

And to add to this, the graphics folks have been following a new approach of
memory management that involves transparently swapping objects between video
ram and main meory.  To do that, they need installing PTEs from a no_page()
handler as well and that also requires returning with NOPAGE_REFAULT.

(For the later, they are currently using io_remap_pfn_range to install one PTE
from no_page() which is a bit racy, we need to add a check for the PTE having
already been installed afer taking the lock, but that's ok, they are only at
the proof-of-concept stage.  I'll send a patch adding a "clean" function to do
that, we can use that from spufs too and get rid of the sparsemem hacks we do
to create struct page for SPEs.  Basically, that provides a generic solution
for being able to have no_page() map hardware devices, which is something that
I think sound driver folks have been asking for some time too).

All of these things depend on having the NOPAGE_REFAULT exit path from
no_page() handlers.

Signed-off-by: Benjamin Herrenchmidt <benh@kernel.crashing.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-06 08:53:40 -07:00
Pekka Enberg 1ca4cb2418 [PATCH] slab: reduce numa text size
Reduce the NUMA text size of mm/slab.o a little on x86 by using a local
variable to store the result of numa_node_id().

    text    data     bss     dec     hex filename
   16858    2584      16   19458    4c02 mm/slab.o (before)
   16804    2584      16   19404    4bcc mm/slab.o (after)

[akpm@osdl.org: use better names]
[pbadari@us.ibm.com: fix that]
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-06 08:53:40 -07:00
Linus Torvalds fefd26b3b8 Merge master.kernel.org:/pub/scm/linux/kernel/git/davej/configh
* master.kernel.org:/pub/scm/linux/kernel/git/davej/configh:
  Remove all inclusions of <linux/config.h>

Manually resolved trivial path conflicts due to removed files in
the sound/oss/ subdirectory.
2006-10-04 09:59:57 -07:00
Linus Torvalds 4a61f17378 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6: (292 commits)
  [GFS2] Fix endian bug for de_type
  [GFS2] Initialize SELinux extended attributes at inode creation time.
  [GFS2] Move logging code into log.c (mostly)
  [GFS2] Mark nlink cleared so VFS sees it happen
  [GFS2] Two redundant casts removed
  [GFS2] Remove uneeded endian conversion
  [GFS2] Remove duplicate sb reading code
  [GFS2] Mark metadata reads for blktrace
  [GFS2] Remove iflags.h, use FS_
  [GFS2] Fix code style/indent in ops_file.c
  [GFS2] streamline-generic_file_-interfaces-and-filemap gfs fix
  [GFS2] Remove readv/writev methods and use aio_read/aio_write instead (gfs bits)
  [GFS2] inode-diet: Eliminate i_blksize from the inode structure
  [GFS2] inode_diet: Replace inode.u.generic_ip with inode.i_private (gfs)
  [GFS2] Fix typo in last patch
  [GFS2] Fix direct i/o logic in filemap.c
  [GFS2] Fix bug in Makefiles for lock modules
  [GFS2] Remove (extra) fs_subsys declaration
  [GFS2/DLM] Fix trailing whitespace
  [GFS2] Tidy up meta_io code
  ...
2006-10-04 09:06:16 -07:00
Christoph Hellwig 1d2c8eea69 [PATCH] slab: clean up leak tracking ifdefs a little bit
- rename ____kmalloc to kmalloc_track_caller so that people have a chance
  to guess what it does just from it's name.  Add a comment describing it
  for those who don't.  Also move it after kmalloc in slab.h so people get
  less confused when they are just looking for kmalloc - move things around
  in slab.c a little to reduce the ifdef mess.

[penberg@cs.helsinki.fi: Fix up reversed #ifdef]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:13 -07:00
Randy Dunlap 88ca3b94e8 [PATCH] page_alloc: fix kernel-doc and func. declaration
Fix kernel-doc and function declaration (missing "void") in
mm/page_alloc.c.

Add mm/page_alloc.c to kernel-api.tmpl in DocBook.

mm/page_alloc.c:2589:38: warning: non-ANSI function declaration of function 'remove_all_active_ranges'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:12 -07:00
Chen, Kenneth W fe1668ae5b [PATCH] enforce proper tlb flush in unmap_hugepage_range
Spotted by Hugh that hugetlb page is free'ed back to global pool before
performing any TLB flush in unmap_hugepage_range().  This potentially allow
threads to abuse free-alloc race condition.

The generic tlb gather code is unsuitable to use by hugetlb, I just open
coded a page gathering list and delayed put_page until tlb flush is
performed.

Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: William Irwin <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:12 -07:00
Nick Piggin e80ee884ae [PATCH] mm: micro optimise zone_watermark_ok
Having min be a signed quantity means gcc can't turn high latency divides
into shifts.  There happen to be two such divides for GFP_ATOMIC (ie.
networking, ie.  important) allocations, one of which depends on the other.
 Fixing this makes code smaller as a bonus.

Shame on somebody (probably me).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:12 -07:00
Henrik Kretzschmar b2abacf3a2 [PATCH] mm: fix in kerneldoc
Fixes an kerneldoc error.

Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:12 -07:00
Dave Jones 038b0a6d8d Remove all inclusions of <linux/config.h>
kbuild explicitly includes this at build time.

Signed-off-by: Dave Jones <davej@redhat.com>
2006-10-04 03:38:54 -04:00
Michael Opdenacker c1c8897f83 Spelling fix: "control" instead of "cotrol"
This patch against fixes a spelling mistake ("control" instead of "cotrol").

Signed-off-by: Michael Opdenacker <michael@free-electrons.com>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-10-03 23:21:02 +02:00
Uwe Zeisberger f30c226954 fix file specification in comments
Many files include the filename at the beginning, serveral used a wrong one.

Signed-off-by: Uwe Zeisberger <Uwe_Zeisberger@digi.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-10-03 23:01:26 +02:00
Matt LaPlante 84eb8d0608 Fix "can not" in Documentation and Kconfig
Randy brought it to my attention that in proper english "can not" should always
be written "cannot". I donot see any reason to argue, even if I mightnot
understand why this rule exists.  This patch fixes "can not" in several
Documentation files as well as three Kconfigs.

Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-10-03 22:53:09 +02:00
Matt LaPlante 44c09201a4 more misc typo fixes
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-10-03 22:34:14 +02:00
Steven Whitehouse 59458f40e2 Merge branch 'master' into gfs2 2006-10-02 08:45:08 -04:00
Zachary Amsden 6606c3e0da [PATCH] paravirt: lazy mmu mode hooks.patch
Implement lazy MMU update hooks which are SMP safe for both direct and shadow
page tables.  The idea is that PTE updates and page invalidations while in
lazy mode can be batched into a single hypercall.  We use this in VMI for
shadow page table synchronization, and it is a win.  It also can be used by
PPC and for direct page tables on Xen.

For SMP, the enter / leave must happen under protection of the page table
locks for page tables which are being modified.  This is because otherwise,
you end up with stale state in the batched hypercall, which other CPUs can
race ahead of.  Doing this under the protection of the locks guarantees the
synchronization is correct, and also means that spurious faults which are
generated during this window by remote CPUs are properly handled, as the page
fault handler must re-check the PTE under protection of the same lock.

Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:33 -07:00
Zachary Amsden 9888a1cae3 [PATCH] paravirt: pte clear not present
Change pte_clear_full to a more appropriately named pte_clear_not_present,
allowing optimizations when not-present mapping changes need not be reflected
in the hardware TLB for protected page table modes.  There is also another
case that can use it in the fremap code.

Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:33 -07:00
Zachary Amsden 3dc9079514 [PATCH] paravirt: remove read hazard from cow
We don't want to read PTEs directly like this after they have been modified,
as a lazy MMU implementation of direct page tables may not have written the
updated PTE back to memory yet.

Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:33 -07:00
Andrew Morton bd4c8ce41a [PATCH] invalidate_inode_pages2(): ignore page refcounts
The recent fix to invalidate_inode_pages() (git commit 016eb4a) managed to
unfix invalidate_inode_pages2().

The problem is that various bits of code in the kernel can take transient refs
on pages: the page scanner will do this when inspecting a batch of pages, and
the lru_cache_add() batching pagevecs also hold a ref.

Net result is transient failures in invalidate_inode_pages2().  This affects
NFS directory invalidation (observed) and presumably also block-backed
direct-io (not yet reported).

Fix it by reverting invalidate_inode_pages2() back to the old version which
ignores the page refcounts.

We may come up with something more clever later, but for now we need a 2.6.18
fix for NFS.

Cc: Chuck Lever <cel@citi.umich.edu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:33 -07:00
Dave Hansen d8c76e6f45 [PATCH] r/o bind mount prepwork: inc_nlink() helper
This is mostly included for parity with dec_nlink(), where we will have some
more hooks.  This one should stay pretty darn straightforward for now.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:30 -07:00
Dave Hansen 9a53c3a783 [PATCH] r/o bind mounts: unlink: monitor i_nlink
When a filesystem decrements i_nlink to zero, it means that a write must be
performed in order to drop the inode from the filesystem.

We're shortly going to have keep filesystems from being remounted r/o between
the time that this i_nlink decrement and that write occurs.

So, add a little helper function to do the decrements.  We'll tie into it in a
bit to note when i_nlink hits zero.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:30 -07:00
Badari Pulavarty 543ade1fc9 [PATCH] Streamline generic_file_* interfaces and filemap cleanups
This patch cleans up generic_file_*_read/write() interfaces.  Christoph
Hellwig gave me the idea for this clean ups.

In a nutshell, all filesystems should set .aio_read/.aio_write methods and use
do_sync_read/ do_sync_write() as their .read/.write methods.  This allows us
to cleanup all variants of generic_file_* routines.

Final available interfaces:

generic_file_aio_read() - read handler
generic_file_aio_write() - write handler
generic_file_aio_write_nolock() - no lock write handler

__generic_file_aio_write_nolock() - internal worker routine

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:28 -07:00
Badari Pulavarty ee0b3e671b [PATCH] Remove readv/writev methods and use aio_read/aio_write instead
This patch removes readv() and writev() methods and replaces them with
aio_read()/aio_write() methods.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:28 -07:00
Badari Pulavarty 027445c372 [PATCH] Vectorize aio_read/aio_write fileop methods
This patch vectorizes aio_read() and aio_write() methods to prepare for
collapsing all aio & vectored operations into one interface - which is
aio_read()/aio_write().

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Michael Holzheu <HOLZHEU@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:28 -07:00
Alexey Dobriyan 52978be636 [PATCH] kmemdup: some users
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:19 -07:00
Alexey Dobriyan 1a2f67b459 [PATCH] kmemdup: introduce
One of idiomatic ways to duplicate a region of memory is

	dst = kmalloc(len, GFP_KERNEL);
	if (!dst)
		return -ENOMEM;
	memcpy(dst, src, len);

which is neat code except a programmer needs to write size twice.  Which
sometimes leads to mistakes.  If len passed to kmalloc is smaller that len
passed to memcpy, it's straight overwrite-beyond-end.  If len passed to
memcpy is smaller than len passed to kmalloc, it's either a) legit
behaviour ;-), or b) cloned buffer will contain garbage in second half.

Slight trolling of commit lists shows several duplications bugs
done exactly because of diverged lenghts:

	Linux:
		[CRYPTO]: Fix memcpy/memset args.
		[PATCH] memcpy/memset fixes
	OpenBSD:
		kerberosV/src/lib/asn1: der_copy.c:1.4

If programmer is given only one place to play with lengths, I believe, such
mistakes could be avoided.

With kmemdup, the snippet above will be rewritten as:

	dst = kmemdup(src, len, GFP_KERNEL);
	if (!dst)
		return -ENOMEM;

This also leads to smaller code (kzalloc effect). Quick grep shows
200+ places where kmemdup() can be used.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:19 -07:00
Keith Mannthey 45e0b78b05 [PATCH] hot-add-mem x86_64: use CONFIG_MEMORY_HOTPLUG_RESERVE
The api for hot-add memory already has a construct for finding nodes based on
an address, memory_add_physaddr_to_nid.  This patch allows the fucntion to do
something besides return 0.  It uses the nodes_add infomation to lookup to
node info for a hot add event.

Signed-off-by: Keith Mannthey <kmannth@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:18 -07:00
Keith Mannthey 53947027ad [PATCH] hot-add-mem x86_64: use CONFIG_MEMORY_HOTPLUG_SPARSE
Migate CONFIG_MEMORY_HOTPLUG to CONFIG_MEMORY_HOTPLUG_SPARSE where needed.

Signed-off-by: Keith Mannthey <kmannth@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:18 -07:00
Keith Mannthey ec69acbb11 [PATCH] hot-add-mem x86_64: Kconfig changes
Create Kconfig namespace for MEMORY_HOTPLUG_RESERVE and MEMORY_HOTPLUG_SPARSE.
 This is needed to create a disticiton between the 2 paths.  Selecting the
high level opiton of MEMORY_HOTPLUG will get you MEMORY_HOTPLUG_SPARSE if you
have sparsemem enabled or MEMORY_HOTPLUG_RESERVE if you are x86_64 with
discontig and ACPI numa support.

Signed-off-by: Keith Mannthey <kmannth@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:18 -07:00
Keith Mannthey f28c5edc06 [PATCH] hot-add-mem x86_64: fixup externs
Fix up externs in memory_hotplug.c.  Cleanup.

Signed-off-by: Keith Mannthey <kmannth@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:18 -07:00
Gavin Lambert 3fcd03e070 [PATCH] NOMMU: don't try and give NULL to fput()
Don't try and give NULL to fput() in the error handling in do_mmap_pgoff()
as it'll cause an oops.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:17 -07:00
David Howells 9361401eb7 [PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer.  Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.

This patch does the following:

 (*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
     support.

 (*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
     an item that uses the block layer.  This includes:

     (*) Block I/O tracing.

     (*) Disk partition code.

     (*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.

     (*) The SCSI layer.  As far as I can tell, even SCSI chardevs use the
     	 block layer to do scheduling.  Some drivers that use SCSI facilities -
     	 such as USB storage - end up disabled indirectly from this.

     (*) Various block-based device drivers, such as IDE and the old CDROM
     	 drivers.

     (*) MTD blockdev handling and FTL.

     (*) JFFS - which uses set_bdev_super(), something it could avoid doing by
     	 taking a leaf out of JFFS2's book.

 (*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
     linux/elevator.h contingent on CONFIG_BLOCK being set.  sector_div() is,
     however, still used in places, and so is still available.

 (*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
     parts of linux/fs.h.

 (*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.

 (*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.

 (*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
     is not enabled.

 (*) fs/no-block.c is created to hold out-of-line stubs and things that are
     required when CONFIG_BLOCK is not set:

     (*) Default blockdev file operations (to give error ENODEV on opening).

 (*) Makes some /proc changes:

     (*) /proc/devices does not list any blockdevs.

     (*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.

 (*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.

 (*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
     given command other than Q_SYNC or if a special device is specified.

 (*) In init/do_mounts.c, no reference is made to the blockdev routines if
     CONFIG_BLOCK is not defined.  This does not prohibit NFS roots or JFFS2.

 (*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
     error ENOSYS by way of cond_syscall if so).

 (*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
     CONFIG_BLOCK is not set, since they can't then happen.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:52:31 +02:00
David Howells 811d736f9e [PATCH] BLOCK: Dissociate generic_writepages() from mpage stuff [try #6]
Dissociate the generic_writepages() function from the mpage stuff, moving its
declaration to linux/mm.h and actually emitting a full implementation into
mm/page-writeback.c.

The implementation is a partial duplicate of mpage_writepages() with all BIO
references removed.

It is used by NFS to do writeback.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:52:26 +02:00
David Howells 831058dec3 [PATCH] BLOCK: Separate the bounce buffering code from the highmem code [try #6]
Move the bounce buffer code from mm/highmem.c to mm/bounce.c so that it can be
more easily disabled when the block layer is disabled.

!!!NOTE!!! There may be a bug in this code: Should init_emergency_pool() be
	   contingent on CONFIG_HIGHMEM?

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:32:11 +02:00
David Howells b398f6bff9 [PATCH] BLOCK: Stop fallback_migrate_page() from using page_has_buffers() [try #6]
Stop fallback_migrate_page() from using page_has_buffers() since that might not
be available.  Use PagePrivate() instead since that's more general.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:31:20 +02:00
David Howells cf9a2ae8d4 [PATCH] BLOCK: Move functions out of buffer code [try #6]
Move some functions out of the buffering code that aren't strictly buffering
specific.  This is a precursor to being able to disable the block layer.

 (*) Moved some stuff out of fs/buffer.c:

     (*) The file sync and general sync stuff moved to fs/sync.c.

     (*) The superblock sync stuff moved to fs/super.c.

     (*) do_invalidatepage() moved to mm/truncate.c.

     (*) try_to_release_page() moved to mm/filemap.c.

 (*) Moved some related declarations between header files:

     (*) declarations for do_invalidatepage() and try_to_release_page() moved
     	 to linux/mm.h.

     (*) __set_page_dirty_buffers() moved to linux/buffer_head.h.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:31:19 +02:00
Andreas Gruenbacher 39f0247d38 [PATCH] Access Control Lists for tmpfs
Add access control lists for tmpfs.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:24 -07:00
Hugh Dickins 3f9e7949f8 [PATCH] valid_swaphandles() fix
akpm draws my attention to the fact that sysctl(VM_PAGE_CLUSTER) might
conceivably change page_cluster to 0 while valid_swaphandles() is in the
middle of using it, leading to an embarrassingly long loop: take a local
snapshot of page_cluster and work with that.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:23 -07:00
Chandra Seetharaman 2d1d43f6a4 [PATCH] call mm/page-writeback.c:set_ratelimit() when new pages are hot-added
ratelimit_pages in page-writeback.c is recalculated (in set_ratelimit())
every time a CPU is hot-added/removed.  But this value is not recalculated
when new pages are hot-added.

This patch fixes that problem by calling set_ratelimit() when new pages
are hot-added.

[akpm@osdl.org: cleanups]
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:22 -07:00
Chandra Seetharaman 40c99aae23 [PATCH] remove static variable mm/page-writeback.c:total_pages
page-writeback.c has a static local variable "total_pages", which is the
total number of pages in the system.

There is a global variable "vm_total_pages", which is the total number of
pages the VM controls.

Both are assigned from the return value of nr_free_pagecache_pages().

This patch removes the local variable and uses the global variable in that
place.

One more issue with the local static variable "total_pages" is that it is
not updated when new pages are hot-added.  Since vm_total_pages is updated
when new pages are hot-added, this patch fixes that problem too.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:22 -07:00
Paul Jackson 38837fc75a [PATCH] cpuset: top_cpuset tracks hotplug changes to node_online_map
Change the list of memory nodes allowed to tasks in the top (root) nodeset
to dynamically track what cpus are online, using a call to a cpuset hook
from the memory hotplug code.  Make this top cpus file read-only.

On systems that have cpusets configured in their kernel, but that aren't
actively using cpusets (for some distros, this covers the majority of
systems) all tasks end up in the top cpuset.

If that system does support memory hotplug, then these tasks cannot make
use of memory nodes that are added after system boot, because the memory
nodes are not allowed in the top cpuset.  This is a surprising regression
over earlier kernels that didn't have cpusets enabled.

One key motivation for this change is to remain consistent with the
behaviour for the top_cpuset's 'cpus', which is also read-only, and which
automatically tracks the cpu_online_map.

This change also has the minor benefit that it fixes a long standing,
little noticed, minor bug in cpusets.  The cpuset performance tweak to
short circuit the cpuset_zone_allowed() check on systems with just a single
cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply
changing the 'mems' of the top_cpuset had no affect, even though the change
(the write system call) appeared to succeed.  With the following change,
that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly
refuses to be changed via user space writes.  Thus no one should be mislead
into thinking they've changed the top_cpusets's 'mems' when in affect they
haven't.

In order to keep the behaviour of cpusets consistent between systems
actively making use of them and systems not using them, this patch changes
the behaviour of the 'mems' file in the top (root) cpuset, making it read
only, and making it automatically track the value of node_online_map.  Thus
tasks in the top cpuset will have automatic use of hot plugged memory nodes
allowed by their cpuset.

[akpm@osdl.org: build fix]
[bunk@stusta.de: build fix]
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:21 -07:00
Nick Piggin b78483a4ba [PATCH] oom: don't kill current when another OOM in progress
A previous patch to allow an exiting task to OOM kill itself (and thereby
avoid a little deadlock) introduced a problem.  We don't want the
PF_EXITING task, even if it is 'current', to access mem reserves if there
is already a TIF_MEMDIE process in the system sucking up reserves.

Also make the commenting a little bit clearer, and note that our current
scheme of effectively single threading the OOM killer is not itself
perfect.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:21 -07:00
Oleg Nesterov 01017a2270 [PATCH] oom_kill_task(): cleanup ->mm checks
- It is not possible to have task->mm == &init_mm.

- task_lock() buys nothing for 'if (!p->mm)' check.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:21 -07:00
Oleg Nesterov 972c4ea59c [PATCH] select_bad_process(): cleanup 'releasing' check
No logic changes, but imho easier to read.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:21 -07:00
Oleg Nesterov 28324d1df6 [PATCH] select_bad_process(): kill a bogus PF_DEAD/TASK_DEAD check
The only one usage of TASK_DEAD outside of last schedule path,

select_bad_process:

	for_each_task(p) {

		if (!p->mm)
			continue;
		...
			if (p->state == TASK_DEAD)
				continue;
		...

TASK_DEAD state is set at the end of do_exit(), this means that p->mm
was already set == NULL by exit_mm(), so this task was already rejected
by 'if (!p->mm)' above.

Note also that the caller holds tasklist_lock, this means that p can't
pass exit_notify() and then set TASK_DEAD when p->mm != NULL.

Also, remove open-coded is_init().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:21 -07:00
Oleg Nesterov c394cc9fbb [PATCH] introduce TASK_DEAD state
I am not sure about this patch, I am asking Ingo to take a decision.

task_struct->state == EXIT_DEAD is a very special case, to avoid a confusion
it makes sense to introduce a new state, TASK_DEAD, while EXIT_DEAD should
live only in ->exit_state as documented in sched.h.

Note that this state is not visible to user-space, get_task_state() masks off
unsuitable states.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:21 -07:00
Oleg Nesterov 55a101f8f7 [PATCH] kill PF_DEAD flag
After the previous change (->flags & PF_DEAD) <=> (->state == EXIT_DEAD), we
don't need PF_DEAD any longer.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:20 -07:00
Sukadev Bhattiprolu f400e198b2 [PATCH] pidspace: is_init()
This is an updated version of Eric Biederman's is_init() patch.
(http://lkml.org/lkml/2006/2/6/280).  It applies cleanly to 2.6.18-rc3 and
replaces a few more instances of ->pid == 1 with is_init().

Further, is_init() checks pid and thus removes dependency on Eric's other
patches for now.

Eric's original description:

	There are a lot of places in the kernel where we test for init
	because we give it special properties.  Most  significantly init
	must not die.  This results in code all over the kernel test
	->pid == 1.

	Introduce is_init to capture this case.

	With multiple pid spaces for all of the cases affected we are
	looking for only the first process on the system, not some other
	process that has pid == 1.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: <lxc-devel@lists.sourceforge.net>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:12 -07:00
Dave Jones aa83aa40ed [PATCH] single bit flip detector
In cases where we detect a single bit has been flipped, we spew the usual
slab corruption message, which users instantly think is a kernel bug.  In a
lot of cases, single bit errors are down to bad memory, or other hardware
failure.

This patch adds an extra line to the slab debug messages in those cases, in
the hope that users will try memtest before they report a bug.

000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Single bit error detected. Possibly bad RAM. Run memtest86.

[akpm@osdl.org: cleanups]
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:10 -07:00
Adam Litke 79f5acf5d7 [PATCH] mm: make filemap_nopage use NOPAGE_SIGBUS
Don't open-code NOPAGE_SIGBUS.

Signed-off-by: Adam Litke <agl@us.ibm.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:03 -07:00
Siddha, Suresh B 4ce072f1fa [PATCH] mm: fix a race condition under SMC + COW
Failing context is a multi threaded process context and the failing
sequence is as follows.

One thread T0 doing self modifying code on page X on processor P0 and
another thread T1 doing COW (breaking the COW setup as part of just
happened fork() in another thread T2) on the same page X on processor P1.
T0 doing SMC can endup modifying the new page Y (allocated by the T1 doing
COW on P1) but because of different I/D TLB's, P0 ITLB will not see the new
mapping till the flush TLB IPI from P1 is received.  During this interval,
if T0 executes the code created by SMC it can result in an app error (as
ITLB still points to old page X and endup executing the content in page X
rather than using the content in page Y).

Fix this issue by first clearing the PTE and flushing it, before updating
it with new entry.

Hugh sayeth:

  I was a bit sceptical, in the habit of thinking that Self Modifying Code
  must look such issues itself: but I guess there's nothing it can do to avoid
  this one.

  Fair enough, what you're changing it to is pretty much what powerpc and
  s390 were already doing, and is a more robust way of proceeding, consistent
  with how ptes are set everywhere else.

  The ptep_clear_flush is a bit heavy-handed (it's anxious to return the pte
  that was atomically cleared), but we'd have to wander through lots of arches
  to get the right minimal behaviour.  It'd also be nice to eliminate
  ptep_establish completely, now only used to define other macros/inlines: it
  always seemed obfuscation to me, what you've got there now is clearer.
  Let's put those cleanups on a TODO list.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:03 -07:00
Steven Whitehouse 185a257f2f Merge branch 'master' into gfs2 2006-09-28 08:29:59 -04:00
Steven Whitehouse 3f1a9aaeff [GFS2] Fix typo in last patch
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2006-09-27 14:52:48 -04:00
Steven Whitehouse 0e0bcae3bf [GFS2] Fix direct i/o logic in filemap.c
We shouldn't mark the file accessed in the case that it
wasn't accessed.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2006-09-27 14:45:07 -04:00
Theodore Ts'o ba52de123d [PATCH] inode-diet: Eliminate i_blksize from the inode structure
This eliminates the i_blksize field from struct inode.  Filesystems that want
to provide a per-inode st_blksize can do so by providing their own getattr
routine instead of using the generic_fillattr() function.

Note that some filesystems were providing pretty much random (and incorrect)
values for i_blksize.

[bunk@stusta.de: cleanup]
[akpm@osdl.org: generic_fillattr() fix]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:18 -07:00
David Howells 930e652a21 [PATCH] NOMMU: Make futexes work under NOMMU conditions
Make futexes work under NOMMU conditions.

This can be tested by running this in one shell:

	#define SYSERROR(X, Y) \
		do { if ((long)(X) == -1L) { perror(Y); exit(1); }} while(0)

	int main()
	{
		int shmid, tmp, *f, n;

		shmid = shmget(23, 4, IPC_CREAT|0666);
		SYSERROR(shmid, "shmget");

		f = shmat(shmid, NULL, 0);
		SYSERROR(f, "shmat");

		n = *f;
		printf("WAIT: %p{%x}\n", f, n);
		tmp = futex(f, FUTEX_WAIT, n, NULL, NULL, 0);
		SYSERROR(tmp, "futex");
		printf("WAITED: %d\n", tmp);

		tmp = shmdt(f);
		SYSERROR(tmp, "shmdt");

		exit(0);
	}

And then this in the other shell:

	#define SYSERROR(X, Y) \
		do { if ((long)(X) == -1L) { perror(Y); exit(1); }} while(0)

	int main()
	{
		int shmid, tmp, *f;

		shmid = shmget(23, 4, IPC_CREAT|0666);
		SYSERROR(shmid, "shmget");

		f = shmat(shmid, NULL, 0);
		SYSERROR(f, "shmat");

		(*f)++;
		printf("WAKE: %p{%x}\n", f, *f);
		tmp = futex(f, FUTEX_WAKE, 1, NULL, NULL, 0);
		SYSERROR(tmp, "futex");
		printf("WOKE: %d\n", tmp);

		tmp = shmdt(f);
		SYSERROR(tmp, "shmdt");

		exit(0);
	}

The first program will set up a SYSV IPC SHM segment and wait on a futex in it
for the number at the start to change.  The program will increment that number
and wake the first program up.  This leads to output of the form:

	SHELL 1			SHELL 2
	=======================	=======================
	# /dowait
	WAIT: 0xc32ac000{0}
				# /dowake
				WAKE: 0xc32ac000{1}
	WAITED: 0		WOKE: 1

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:15 -07:00
David Howells 6fa5f80bc3 [PATCH] NOMMU: Make mremap() partially work for NOMMU kernels
Make mremap() partially work for NOMMU kernels.  It may resize a VMA provided
that it doesn't exceed the size of the slab object in which the storage is
allocated that the VMA refers to.  Shareable VMAs may not be resized.

Moving VMAs (as permitted by MREMAP_MAYMOVE) is not currently supported.

This patch also makes use of the fact that the VMA list is now ordered to cut
it short when possible.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:14 -07:00
David Howells 3034097a50 [PATCH] NOMMU: Order the per-mm_struct VMA list
Order the per-mm_struct VMA list by address so that searching it can be cut
short when the appropriate address has been exceeded.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:14 -07:00
David Howells d00c7b9937 [PATCH] NOMMU: Permit ptrace to ignore non-PROT_WRITE VMAs in NOMMU mode
Permit ptrace to modify a section that's non-shared but is marked
unwritable, such as is obtained by mapping the text segment of an ELF-FDPIC
executable binary with into a binary that's being ptraced[*].

[*] Under NOMMU conditions ptrace causes read-only MAP_PRIVATE mmaps to become
    totally private copies because if a private mapping was actually shared
    then the debugging setting breakpoints in it would potentially crash
    other processes.

This is done by using the VM_MAYWRITE flag rather than the VM_WRITE flag
when deciding whether to permit a write.

Without this patch a debugger can't set breakpoints in the mapped text
sections of executables that are mapped read-only private, even if the
mmap() syscall has taken a private copy because PT_PTRACED is set.

In addition, VM_MAYREAD is used instead of VM_READ for similar reasons.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:14 -07:00
David Howells 7b4d5b8b39 [PATCH] NOMMU: Check VMA protections
Check the VMA protections in get_user_pages() against what's being asked.

This checks to see that we don't accidentally write on a non-writable VMA or
permit an I/O mapping VMA to be accessed (which may lack page structs).

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:14 -07:00
Sonic Zhang 910e46da4b [PATCH] Check if start address is in vma region in NOMMU function get_user_pages()
In NOMMU arch, if run "cat /proc/self/mem", data from physical address 0
are read.  This behavior is different from MMU arch.  In IA32, message
"cat: /proc/self/mem: Input/output error" is reported.

This issue is rootcaused by not validate the start address in NOMMU
function get_user_pages().  Following patch solves this issue.

Signed-off-by: Sonic Zhang <sonic.adi@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:14 -07:00
David Howells 0159b141d8 [PATCH] NOMMU: Use find_vma() rather than reimplementing a VMA search
Use find_vma() in the NOMMU version of access_process_vm() rather than
reimplementing it.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:14 -07:00
David Howells 0ec76a110f [PATCH] NOMMU: Check that access_process_vm() has a valid target
Check that access_process_vm() is accessing a valid mapping in the target
process.

This limits ptrace() accesses and accesses through /proc/<pid>/maps to only
those regions actually mapped by a program.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:14 -07:00
Rolf Eike Beer d24afc57d5 [PATCH] Mark __remove_vm_area() static
The function is exported but not used from anywhere else.  It's also marked as
"not for driver use" so noone out there should really care.

Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:13 -07:00
Rolf Eike Beer ead04089b1 [PATCH] Fix kerneldoc comments in mm/vmalloc.c
The empty line between the short description and the first argument
description causes a section to appear twice in the generated manpage.
Also the short description should really be short: the script can't handle
multiple lines.

Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:13 -07:00
Randy Dunlap 423b41d773 [PATCH] mm/page_alloc: use NULL instead of 0 for ptr
Use NULL instead of 0 for pointer value, eliminate sparse warnings.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:13 -07:00
Jes Sorensen f4b81804a2 [PATCH] do_no_pfn()
Implement do_no_pfn() for handling mapping of memory without a struct page
backing it.  This avoids creating fake page table entries for regions which
are not backed by real memory.

This feature is used by the MSPEC driver and other users, where it is
highly undesirable to have a struct page sitting behind the page (for
instance if the page is accessed in cached mode via the struct page in
parallel to the the driver accessing it uncached, which can result in data
corruption on some architectures, such as ia64).

This version uses specific NOPFN_{SIGBUS,OOM} return values, rather than
expect all negative pfn values would be an error.  It also bugs on cow
mappings as this would not work with the VM.

[akpm@osdl.org: micro-optimise]
Signed-off-by: Jes Sorensen <jes@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:13 -07:00
Christoph Lameter 5d29234362 [PATCH] zone_statistics: Use hot node instead of cold zone_pgdat
Now that we have the node in the hot zone of struct zone we can avoid
accessing zone_pgdat in zone_statistics.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:13 -07:00
Christoph Lameter 66a550308b [PATCH] Do not allocate pagesets for unpopulated zones.
We do not need to allocate pagesets for unpopulated zones.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:13 -07:00
Christoph Lameter d5f541ed6e [PATCH] Add node to zone for the NUMA case
Add the node in order to optimize zone_to_nid.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:13 -07:00
Christoph Lameter 765c4507af [PATCH] GFP_THISNODE for the slab allocator
This patch insures that the slab node lists in the NUMA case only contain
slabs that belong to that specific node.  All slab allocations use
GFP_THISNODE when calling into the page allocator.  If an allocation fails
then we fall back in the slab allocator according to the zonelists appropriate
for a certain context.

This allows a replication of the behavior of alloc_pages and alloc_pages node
in the slab layer.

Currently allocations requested from the page allocator may be redirected via
cpusets to other nodes.  This results in remote pages on nodelists and that in
turn results in interrupt latency issues during cache draining.  Plus the slab
is handing out memory as local when it is really remote.

Fallback for slab memory allocations will occur within the slab allocator and
not in the page allocator.  This is necessary in order to be able to use the
existing pools of objects on the nodes that we fall back to before adding more
pages to a slab.

The fallback function insures that the nodes we fall back to obey cpuset
restrictions of the current context.  We do not allocate objects from outside
of the current cpuset context like before.

Note that the implementation of locality constraints within the slab allocator
requires importing logic from the page allocator.  This is a mischmash that is
not that great.  Other allocators (uncached allocator, vmalloc, huge pages)
face similar problems and have similar minimal reimplementations of the basic
fallback logic of the page allocator.  There is another way of implementing a
slab by avoiding per node lists (see modular slab) but this wont work within
the existing slab.

V1->V2:
- Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA
- Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another
  #ifdef

[akpm@osdl.org: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:12 -07:00
Christoph Lameter 08e0f6a970 [PATCH] Add NUMA_BUILD definition in kernel.h to avoid #ifdef CONFIG_NUMA
The NUMA_BUILD constant is always available and will be set to 1 on
NUMA_BUILDs.  That way checks valid only under CONFIG_NUMA can easily be done
without #ifdef CONFIG_NUMA

F.e.

if (NUMA_BUILD && <numa_condition>) {
...
}

[akpm: not a thing we'd normally do, but CONFIG_NUMA is special: it is
 causing ifdef explosion in core kernel, so let's see if this is a comfortable
 way in whcih to control that]

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:12 -07:00
Jes Sorensen c72419138f [PATCH] Condense output of show_free_areas()
On larger systems, the amount of output dumped on the console when you do
SysRq-M is beyond insane.  This patch is trying to reduce it somewhat as
even with the smaller NUMA systems that have hit the desktop this seems to
be a fair thing to do.

The philosophy I have taken is as follows:
 1) If a zone is empty, don't tell, we don't need yet another line
    telling us so. The information is available since one can look up
    the fact how many zones were initialized in the first place.
 2) Put as much information on a line is possible, if it can be done
    in one line, rahter than two, then do it in one. I tried to format
    the temperature stuff for easy reading.

Change show_free_areas() to not print lines for empty zones.  If no zone
output is printed, the zone is empty.  This reduces the number of lines
dumped to the console in sysrq on a large system by several thousand lines.

Change the zone temperature printouts to use one line per CPU instead of
two lines (one hot, one cold).  On a 1024 CPU, 1024 node system, this
reduces the console output by over a million lines of output.

While this is a bigger problem on large NUMA systems, it is also applicable
to smaller desktop sized and mid range NUMA systems.

Old format:

Mem-info:
Node 0 DMA per-cpu:
cpu 0 hot: high 42, batch 7 used:24
cpu 0 cold: high 14, batch 3 used:1
cpu 1 hot: high 42, batch 7 used:34
cpu 1 cold: high 14, batch 3 used:0
cpu 2 hot: high 42, batch 7 used:0
cpu 2 cold: high 14, batch 3 used:0
cpu 3 hot: high 42, batch 7 used:0
cpu 3 cold: high 14, batch 3 used:0
cpu 4 hot: high 42, batch 7 used:0
cpu 4 cold: high 14, batch 3 used:0
cpu 5 hot: high 42, batch 7 used:0
cpu 5 cold: high 14, batch 3 used:0
cpu 6 hot: high 42, batch 7 used:0
cpu 6 cold: high 14, batch 3 used:0
cpu 7 hot: high 42, batch 7 used:0
cpu 7 cold: high 14, batch 3 used:0
Node 0 DMA32 per-cpu: empty
Node 0 Normal per-cpu: empty
Node 0 HighMem per-cpu: empty
Node 1 DMA per-cpu:
[snip]
Free pages:     5410688kB (0kB HighMem)
Active:9536 inactive:4261 dirty:6 writeback:0 unstable:0 free:338168 slab:1931 mapped:1900 pagetables:208
Node 0 DMA free:1676304kB min:3264kB low:4080kB high:4896kB active:128048kB inactive:61568kB present:1970880kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 HighMem free:0kB min:512kB low:512kB high:512kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 1 DMA free:1951728kB min:3280kB low:4096kB high:4912kB active:5632kB inactive:1504kB present:1982464kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
....

New format:

Mem-info:
Node 0 DMA per-cpu:
CPU    0: Hot: hi:   42, btch:   7 usd:  41   Cold: hi:   14, btch:   3 usd:   2
CPU    1: Hot: hi:   42, btch:   7 usd:  40   Cold: hi:   14, btch:   3 usd:   1
CPU    2: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
CPU    3: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
CPU    4: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
CPU    5: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
CPU    6: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
CPU    7: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
Node 1 DMA per-cpu:
[snip]
Free pages:     5411088kB (0kB HighMem)
Active:9558 inactive:4233 dirty:6 writeback:0 unstable:0 free:338193 slab:1942 mapped:1918 pagetables:208
Node 0 DMA free:1677648kB min:3264kB low:4080kB high:4896kB active:129296kB inactive:58864kB present:1970880kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 1 DMA free:1948448kB min:3280kB low:4096kB high:4912kB active:6864kB inactive:3536kB present:1982464kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0

Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:12 -07:00
Christoph Lameter de3083ec3e [PATCH] slab: fix kmalloc_node applying memory policies if nodeid == numa_node_id()
kmalloc_node() falls back to ___cache_alloc() under certain conditions and
at that point memory policies may be applied redirecting the allocation
away from the current node.  Therefore kmalloc_node(...,numa_node_id()) or
kmalloc_node(...,-1) may not return memory from the local node.

Fix this by doing the policy check in __cache_alloc() instead of
____cache_alloc().

This version here is a cleanup of Kiran's patch.

- Tested on ia64.
- Extra material removed.
- Consolidate the exit path if alternate_node_alloc() returned an object.

[akpm@osdl.org: warning fix]
Signed-off-by: Alok N Kataria <alok.kataria@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:12 -07:00
Nick Piggin 0fd0e6b05a [PATCH] page invalidation cleanup
Clean up the invalidate code, and use a common function to safely remove
the page from pagecache.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:12 -07:00
Andrew Morton e129b5c23c [PATCH] vm: add per-zone writeout counter
The VM is supposed to minimise the number of pages which get written off the
LRU (for IO scheduling efficiency, and for high reclaim-success rates).  But
we don't actually have a clear way of showing how true this is.

So add `nr_vmscan_write' to /proc/vmstat and /proc/zoneinfo - the number of
pages which have been written by the vm scanner in this zone and globally.

Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:12 -07:00
Mel Gorman fb01439c5b [PATCH] Allow an arch to expand node boundaries
Arch-independent zone-sizing determines the size of a node
(pgdat->node_spanned_pages) based on the physical memory that was
registered by the architecture.  However, when
CONFIG_MEMORY_HOTPLUG_RESERVE is set, the architecture expects that the
spanned_pages will be much larger and that mem_map will be allocated that
is used lated on memory hot-add.

This patch allows an architecture that sets CONFIG_MEMORY_HOTPLUG_RESERVE
to call push_node_boundaries() which will set the node beginning and end to
at *least* the requested boundary.

Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:12 -07:00
Mel Gorman 9c7cd6877c [PATCH] Account for holes that are outside the range of physical memory
absent_pages_in_range() made the assumption that users of the API would not
care about holes beyound the end of physical memory.  This was not the
case.  This patch will account for ranges outside of physical memory as
holes correctly.

Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:11 -07:00
Mel Gorman 0e0b864e06 [PATCH] Account for memmap and optionally the kernel image as holes
The x86_64 code accounted for memmap and some portions of the the DMA zone as
holes.  This was because those areas would never be reclaimed and accounting
for them as memory affects min watermarks.  This patch will account for the
memmap as a memory hole.  Architectures may optionally use set_dma_reserve()
if they wish to account for a portion of memory in ZONE_DMA as a hole.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:11 -07:00
Mel Gorman c713216dee [PATCH] Introduce mechanism for registering active regions of memory
At a basic level, architectures define structures to record where active
ranges of page frames are located.  Once located, the code to calculate zone
sizes and holes in each architecture is very similar.  Some of this zone and
hole sizing code is difficult to read for no good reason.  This set of patches
eliminates the similar-looking architecture-specific code.

The patches introduce a mechanism where architectures register where the
active ranges of page frames are with add_active_range().  When all areas have
been discovered, free_area_init_nodes() is called to initialise the pgdat and
zones.  The zone sizes and holes are then calculated in an architecture
independent manner.

Patch 1 introduces the mechanism for registering and initialising PFN ranges
Patch 2 changes ppc to use the mechanism - 139 arch-specific LOC removed
Patch 3 changes x86 to use the mechanism - 136 arch-specific LOC removed
Patch 4 changes x86_64 to use the mechanism - 74 arch-specific LOC removed
Patch 5 changes ia64 to use the mechanism - 52 arch-specific LOC removed
Patch 6 accounts for mem_map as a memory hole as the pages are not reclaimable.
	It adjusts the watermarks slightly

Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
gensparse_defconfig and defconfig.  Bob Picco has also tested and debugged on
IA64.  Jack Steiner successfully boot tested on a mammoth SGI IA64-based
machine.  These were on patches against 2.6.17-rc1 and release 3 of these
patches but there have been no ia64-changes since release 3.

There are differences in the zone sizes for x86_64 as the arch-specific code
for x86_64 accounts the kernel image and the starting mem_maps as memory holes
but the architecture-independent code accounts the memory as present.

The big benefit of this set of patches is a sizable reduction of
architecture-specific code, some of which is very hairy.  There should be a
greater reduction when other architectures use the same mechanisms for zone
and hole sizing but I lack the hardware to test on.

Additional credit;
	Dave Hansen for the initial suggestion and comments on early patches
	Andy Whitcroft for reviewing early versions and catching numerous
		errors
	Tony Luck for testing and debugging on IA64
	Bob Picco for fixing bugs related to pfn registration, reviewing a
		number of patch revisions, providing a number of suggestions
		on future direction and testing heavily
	Jack Steiner and Robin Holt for testing on IA64 and clarifying
		issues related to memory holes
	Yasunori for testing on IA64
	Andi Kleen for reviewing and feeding back about x86_64
	Christian Kujau for providing valuable information related to ACPI
		problems on x86_64 and testing potential fixes

This patch:

Define the structure to represent an active range of page frames within a node
in an architecture independent manner.  Architectures are expected to register
active ranges of PFNs using add_active_range(nid, start_pfn, end_pfn) and call
free_area_init_nodes() passing the PFNs of the end of each zone.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:11 -07:00
Alexey Dobriyan 133d205a18 [PATCH] Make kmem_cache_destroy() return void
un-, de-, -free, -destroy, -exit, etc functions should in general return
void.  Also,

There is very little, say, filesystem driver code can do upon failed
kmem_cache_destroy().  If it will be decided to BUG in this case, BUG
should be put in generic code, instead.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:11 -07:00