Commit Graph

1983 Commits

Author SHA1 Message Date
Hugh Dickins 6d48ff8bcf memcg: css_put after remove_list
mem_cgroup_uncharge_page does css_put on the mem_cgroup before uncharging from
it, and before removing page_cgroup from one of its lru lists: isn't there a
danger that struct mem_cgroup memory could be freed and reused before
completing that, so corrupting something?  Never seen it, and for all I know
there may be other constraints which make it impossible; but let's be
defensive and reverse the ordering there.

mem_cgroup_force_empty_list is safe because there's an extra css_get around
all its works; but even so, change its ordering the same way round, to help
get in the habit of doing it like this.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:15 -08:00
Hugh Dickins b9c565d5a2 memcg: remove clear_page_cgroup and atomics
Remove clear_page_cgroup: it's an unhelpful helper, see for example how
mem_cgroup_uncharge_page had to unlock_page_cgroup just in order to call it
(serious races from that?  I'm not sure).

Once that's gone, you can see it's pointless for page_cgroup's ref_cnt to be
atomic: it's always manipulated under lock_page_cgroup, except where
force_empty unilaterally reset it to 0 (and how does uncharge's
atomic_dec_and_test protect against that?).

Simplify this page_cgroup locking: if you've got the lock and the pc is
attached, then the ref_cnt must be positive: VM_BUG_ONs to check that, and to
check that pc->page matches page (we're on the way to finding why sometimes it
doesn't, but this patch doesn't fix that).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:15 -08:00
Hugh Dickins d5b69e38f8 memcg: memcontrol uninlined and static
More cleanup to memcontrol.c, this time changing some of the code generated.
Let the compiler decide what to inline (except for page_cgroup_locked which is
only used when CONFIG_DEBUG_VM): the __always_inline on lock_page_cgroup etc.
was quite a waste since bit_spin_lock etc.  are inlines in a header file; made
mem_cgroup_force_empty and mem_cgroup_write_strategy static.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:15 -08:00
Hugh Dickins 8869b8f6e0 memcg: memcontrol whitespace cleanups
Sorry, before getting down to more important changes, I'd like to do some
cleanup in memcontrol.c.  This patch doesn't change the code generated, but
cleans up whitespace, moves up a double declaration, removes an unused enum,
removes void returns, removes misleading comments, that kind of thing.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:15 -08:00
Hugh Dickins 8289546e57 memcg: remove mem_cgroup_uncharge
Nothing uses mem_cgroup_uncharge apart from mem_cgroup_uncharge_page, (a
trivial wrapper around it) and mem_cgroup_end_migration (which does the same
as mem_cgroup_uncharge_page).  And it often ends up having to lock just to let
its caller unlock.  Remove it (but leave the silly locking until a later
patch).

Moved mem_cgroup_cache_charge next to mem_cgroup_charge in memcontrol.h.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:15 -08:00
Hugh Dickins 7e924aafa4 memcg: mem_cgroup_charge never NULL
My memcgroup patch to fix hang with shmem/tmpfs added NULL page handling to
mem_cgroup_charge_common.  It seemed convenient at the time, but hard to
justify now: there's a perfectly appropriate swappage to charge and uncharge
instead, this is not on any hot path through shmem_getpage, and no performance
hit was observed from the slight extra overhead.

So revert that NULL page handling from mem_cgroup_charge_common; and make it
clearer by bringing page_cgroup_assign_new_page_cgroup into its body - that
was a helper I found more of a hindrance to understanding.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:15 -08:00
Hugh Dickins 9442ec9df4 memcg: bad page if page_cgroup when free
Replace free_hot_cold_page's VM_BUG_ON(page_get_page_cgroup(page)) by a "Bad
page state" and clear: most users don't have CONFIG_DEBUG_VM on, and if it
were set here, it'd likely cause corruption when the page is reused.

Don't use page_assign_page_cgroup to clear it: that should be private to
memcontrol.c, and always called with the lock taken; and memmap_init_zone
doesn't need it either - like page->mapping and other pointers throughout the
kernel, Linux assumes pointers in zeroed structures are NULL pointers.

Instead use page_reset_bad_cgroup, added to memcontrol.h for this only.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:15 -08:00
Hugh Dickins 98837c7f82 memcg: fix VM_BUG_ON from page migration
Page migration gave me free_hot_cold_page's VM_BUG_ON page->page_cgroup.
remove_migration_pte was calling mem_cgroup_charge on the new page whenever it
found a swap pte, before it had determined it to be a migration entry.  That
left a surplus reference count on the page_cgroup, so it was still attached
when the page was later freed.

Move that mem_cgroup_charge down to where we're sure it's a migration entry.
We were already under i_mmap_lock or anon_vma->lock, so its GFP_KERNEL was
already inappropriate: change that to GFP_ATOMIC.

It's essential that remove_migration_pte removes all the migration entries,
other crashes follow if not.  So proceed even when the charge fails: normally
it cannot, but after a mem_cgroup_force_empty it might - comment in the code.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:14 -08:00
Hugh Dickins 61469f1d51 memcg: when do_swap's do_wp_page fails
Don't uncharge when do_swap_page's call to do_wp_page fails: the page which
was charged for is there in the pagetable, and will be correctly uncharged
when that area is unmapped - it was only its COWing which failed.

And while we're here, remove earlier XXX comment: yes, OR in do_wp_page's
return value (maybe VM_FAULT_WRITE) with do_swap_page's there; but if it
fails, mask out success bits, which might confuse some arches e.g.  sparc.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:14 -08:00
Hugh Dickins 6dbf6d3bb9 memcg: page_cache_release not __free_page
There's nothing wrong with mem_cgroup_charge failure in do_wp_page and
do_anonymous page using __free_page, but it does look odd when nearby code
uses page_cache_release: use that instead (while turning a blind eye to
ancient inconsistencies of page_cache_release versus put_page).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:14 -08:00
Hugh Dickins 427d5416f3 memcg: move_lists on page not page_cgroup
Each caller of mem_cgroup_move_lists is having to use page_get_page_cgroup:
it's more convenient if it acts upon the page itself not the page_cgroup; and
in a later patch this becomes important to handle within memcontrol.c.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:14 -08:00
Hugh Dickins bd845e38c7 memcg: mm_match_cgroup not vm_match_cgroup
vm_match_cgroup is a perverse name for a macro to match mm with cgroup: rename
it mm_match_cgroup, matching mm_init_cgroup and mm_free_cgroup.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:14 -08:00
Balbir Singh 00f0b8259e Memory controller: rename to Memory Resource Controller
Rename Memory Controller to Memory Resource Controller.  Reflect the same
changes in the CONFIG definition for the Memory Resource Controller.  Group
together the config options for Resource Counters and Memory Resource
Controller.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:12 -08:00
Eric Dumazet be852795e1 alloc_percpu() fails to allocate percpu data
Some oprofile results obtained while using tbench on a 2x2 cpu machine were
very surprising.

For example, loopback_xmit() function was using high number of cpu cycles
to perform the statistic updates, supposed to be real cheap since they use
percpu data

        pcpu_lstats = netdev_priv(dev);
        lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
        lb_stats->packets++;  /* HERE : serious contention */
        lb_stats->bytes += skb->len;

struct pcpu_lstats is a small structure containing two longs.  It appears
that on my 32bits platform, alloc_percpu(8) allocates a single cache line,
instead of giving to each cpu a separate cache line.

Using the following patch gave me impressive boost in various benchmarks
( 6 % in tbench)
(all percpu_counters hit this bug too)

Long term fix (ie >= 2.6.26) would be to let each CPU allocate their own
block of memory, so that we dont need to roudup sizes to L1_CACHE_BYTES, or
merging the SGI stuff of course...

Note : SLUB vs SLAB is important here to *show* the improvement, since they
dont have the same minimum allocation sizes (8 bytes vs 32 bytes).  This
could very well explain regressions some guys reported when they switched
to SLUB.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:11 -08:00
KOSAKI Motohiro 10ed273f50 zlc_setup(): handle jiffies wraparound
jiffies subtraction may cause an overflow problem.  It should be using
time_after().

[akpm@linux-foundation.org: include jiffies.h]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:10 -08:00
Cyrill Gorcunov 62e5c4b4d6 slub: fix possible NULL pointer dereference
This patch fix possible NULL pointer dereference if kzalloc
failed. To be able to return proper error code the function
return type is changed to ssize_t (according to callees and
sysfs definitions).

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:32 -08:00
Christoph Lameter f619cfe1bd slub: Add kmalloc_large_node() to support kmalloc_node fallback
Slub is missing some NUMA support for large kmallocs. Provide that.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:32 -08:00
Pekka J Enberg 7693143481 slub: look up object from the freelist once
We only need to look up object from c->page->freelist once in
__slab_alloc().

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:32 -08:00
Christoph Lameter 6446faa2ff slub: Fix up comments
Provide comments and fix up various spelling / style issues.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:32 -08:00
Christoph Lameter d8b42bf54b slub: Rearrange #ifdef CONFIG_SLUB_DEBUG in calculate_sizes()
Group SLUB_DEBUG code together to reduce the number of #ifdefs. Move some
debug checks under the #ifdef.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:31 -08:00
Christoph Lameter ae20bfda68 slub: Remove BUG_ON() from ksize and omit checks for !SLUB_DEBUG
The BUG_ONs are useless since the pointer derefs will lead to
NULL deref errors anyways. Some of the checks are not necessary
if no debugging is possible.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:31 -08:00
Christoph Lameter 27d9e4e948 slub: Use the objsize from the kmem_cache_cpu structure
No need to access the kmem_cache structure. We have the same value
in kmem_cache_cpu.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:31 -08:00
Christoph Lameter d692ef6dcd slub: Remove useless checks in alloc_debug_processing
Alloc debug processing is never called with a NULL object pointer.
No reason to check for NULL.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:31 -08:00
Christoph Lameter e153362a50 slub: Remove objsize check in kmem_cache_flags()
There is no page->offset anymore and also no associated limit on the number
of objects. The page->offset field was removed for 2.6.24. So the check
in kmem_cache_flags() is now also obsolete (should have been dropped
earlier, somehow a hunk vanished).

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:30 -08:00
Christoph Lameter d9acf4b7b6 slub: rename slab_objects to show_slab_objects
The sysfs callback is better named show_slab_objects since it is always
called from the xxx_show callbacks. We need the name for other purposes
later.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:30 -08:00
Christoph Lameter a973e9dd1e Revert "unique end pointer" patch
This only made sense for the alternate fastpath which was reverted last week.

Mathieu is working on a new version that addresses the fastpath issues but that
new code first needs to go through mm and it is not clear if we need the
unique end pointers with his new scheme.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:30 -08:00
Randy Dunlap 0643245f59 docbook: fix kernel-api source files
Fix docbook problems in kernel-api.tmpl.
These cause the generated docbook to be incorrect.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-03 10:47:14 -08:00
Li Zefan 2dda81ca31 memcgroup: return negative error code in mem_cgroup_create()
Cgroup requires the subsystem to return negative error code on error in the
create method.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23 17:13:25 -08:00
Li Zefan 7fde4c3eb7 memcgroup: remove a useless VM_BUG_ON()
Remove this VM_BUG_ON(), as Balbir stated:

We used to have a for loop with !list_empty() as a termination condition
and VM_BUG_ON(!pc) is a spill over.  With the new loop, VM_BUG_ON(!pc) does
not make sense.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Balbir Singh <balbir@in.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23 17:13:25 -08:00
Alexander van Heukelum b5a0e01132 Solve section mismatch for free_area_init_core.
WARNING: vmlinux.o(.meminit.text+0x649):
Section mismatch in reference from the
function free_area_init_core() to the function .init.text:setup_usemap()
The function __meminit free_area_init_core() references
a function __init setup_usemap().
If free_area_init_core is only used by setup_usemap then
annotate free_area_init_core with a matching annotation.

The warning is covers this stack of functions in mm/page_alloc.c:

alloc_bootmem_node must be marked __init.
alloc_bootmem_node is used by setup_usemap, if !SPARSEMEM.
(usemap_size is only used by setup_usemap, if !SPARSEMEM.)
setup_usemap is only used by free_area_init_core.
free_area_init_core is only used by free_area_init_node.

free_area_init_node is used by:
arch/alpha/mm/numa.c: __init paging_init()
arch/arm/mm/init.c: __init bootmem_init_node()
arch/avr32/mm/init.c: __init paging_init()
arch/cris/arch-v10/mm/init.c: __init paging_init()
arch/cris/arch-v32/mm/init.c: __init paging_init()
arch/m32r/mm/discontig.c: __init zone_sizes_init()
arch/m32r/mm/init.c: __init zone_sizes_init()
arch/m68k/mm/motorola.c: __init paging_init()
arch/m68k/mm/sun3mmu.c: __init paging_init()
arch/mips/sgi-ip27/ip27-memory.c: __init paging_init()
arch/parisc/mm/init.c: __init paging_init()
arch/sparc/mm/srmmu.c: __init srmmu_paging_init()
arch/sparc/mm/sun4c.c: __init sun4c_paging_init()
arch/sparc64/mm/init.c: __init paging_init()
mm/page_alloc.c: __init free_area_init_nodes()
mm/page_alloc.c: __init free_area_init()
and
mm/memory_hotplug.c: hotadd_new_pgdat()

hotadd_new_pgdat can not be an __init function, but:

It is compiled for MEMORY_HOTPLUG configurations only
MEMORY_HOTPLUG depends on SPARSEMEM || X86_64_ACPI_NUMA
X86_64_ACPI_NUMA depends on X86_64
ARCH_FLATMEM_ENABLE depends on X86_32
ARCH_DISCONTIGMEM_ENABLE depends on X86_32
So X86_64_ACPI_NUMA implies SPARSEMEM, right?

So we can mark the stack of functions __init for !SPARSEMEM, but we must mark
them __meminit for SPARSEMEM configurations.  This is ok, because then the
calls to alloc_bootmem_node are also avoided.

Compile-tested on:
silly minimal config
defconfig x86_32
defconfig x86_64
defconfig x86_64 -HIBERNATION +MEMORY_HOTPLUG

Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23 17:13:24 -08:00
Andy Whitcroft e5df70ab19 hugetlb: ensure we do not reference a surplus page after handing it to buddy
When we free a page via free_huge_page and we detect that we are in surplus
the page will be returned to the buddy.  After this we no longer own the page.

However at the end free_huge_page we clear out our mapping pointer from
page private.  Even where the page is not a surplus we free the page to
the hugepage pool, drop the pool locks and then clear page private.  In
either case the page may have been reallocated.  BAD.

Make sure we clear out page private before we free the page.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23 17:12:13 -08:00
Linus Torvalds 00e962c540 Revert "SLUB: Alternate fast paths using cmpxchg_local"
This reverts commit 1f84260c8c, which is
suspected to be the reason for some very occasional and hard-to-trigger
crashes that usually look related to memory allocation (mostly reported
in networking, but since that's generally the most common source of
shortlived allocations - and allocations in interrupt contexts - that in
itself is not a big clue).

See for example
	http://bugzilla.kernel.org/show_bug.cgi?id=9973
	http://lkml.org/lkml/2008/2/19/278
etc.

One promising suspicion for what the root cause of bug is (which also
explains why it's so hard to trigger in practice) came from Eric
Dumazet:

   "I wonder how SLUB_FASTPATH is supposed to work, since it is affected
    by a classical ABA problem of lockless algo.

    cmpxchg_local(&c->freelist, object, object[c->offset]) can succeed,
    while an interrupt came (on this cpu), and several allocations were
    done, and one free was performed at the end of this interruption, so
    'object' was recycled.

    c->freelist can then contain the previous value (object), but
    object[c->offset] was changed by IRQ.

    We then put back in freelist an already allocated object."

but another reason for the revert is simply that everybody agrees that
this code was the main suspect just by virtue of the pattern of oopses.

Cc: Torsten Kaiser <just.for.lkml@googlemail.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-19 09:08:49 -08:00
Linus Torvalds f527cf4050 Merge branch 'slab-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm
* 'slab-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm:
  slub: Support 4k kmallocs again to compensate for page allocator slowness
  slub: Fallback to kmalloc_large for failing higher order allocs
  slub: Determine gfpflags once and not every time a slab is allocated
  make slub.c:slab_address() static
  slub: kmalloc page allocator pass-through cleanup
  slab: avoid double initialization & do initialization in 1 place
2008-02-14 21:24:02 -08:00
Linus Torvalds 664a1566df Merge git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86
* git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86:
  x86: cpa, fix out of date comment
  KVM is not seen under X86 config with latest git (32 bit compile)
  x86: cpa: ensure page alignment
  x86: include proper prototypes for rodata_test
  x86: fix gart_iommu_init()
  x86: EFI set_memory_x()/set_memory_uc() fixes
  x86: make dump_pagetable() static
  x86: fix "BUG: sleeping function called from invalid context" in print_vma_addr()
2008-02-14 21:23:19 -08:00
Jan Blunck cf28b4863f d_path: Make d_path() use a struct path
d_path() is used on a <dentry,vfsmount> pair.  Lets use a struct path to
reflect this.

[akpm@linux-foundation.org: fix build in mm/memory.c]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Bryan Wu <bryan.wu@analog.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14 21:17:09 -08:00
Jan Blunck c32c2f63a9 d_path: Make seq_path() use a struct path argument
seq_path() is always called with a dentry and a vfsmount from a struct path.
Make seq_path() take it directly as an argument.

Signed-off-by: Jan Blunck <jblunck@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14 21:17:08 -08:00
Christoph Lameter 331dc558fa slub: Support 4k kmallocs again to compensate for page allocator slowness
Currently we hand off PAGE_SIZEd kmallocs to the page allocator in the
mistaken belief that the page allocator can handle these allocations
effectively. However, measurements indicate a minimum slowdown by the
factor of 8 (and that is only SMP, NUMA is much worse) vs the slub fastpath
which causes regressions in tbench.

Increase the number of kmalloc caches by one so that we again handle 4k
kmallocs directly from slub. 4k page buffering for the page allocator
will be performed by slub like done by slab.

At some point the page allocator fastpath should be fixed. A lot of the kernel
would benefit from a faster ability to allocate a single page. If that is
done then the 4k allocs may again be forwarded to the page allocator and this
patch could be reverted.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-14 15:30:02 -08:00
Christoph Lameter 71c7a06ff0 slub: Fallback to kmalloc_large for failing higher order allocs
Slub already has two ways of allocating an object. One is via its own
logic and the other is via the call to kmalloc_large to hand off object
allocation to the page allocator. kmalloc_large is typically used
for objects >= PAGE_SIZE.

We can use that handoff to avoid failing if a higher order kmalloc slab
allocation cannot be satisfied by the page allocator. If we reach the
out of memory path then simply try a kmalloc_large(). kfree() can
already handle the case of an object that was allocated via the page
allocator and so this will work just fine (apart from object
accounting...).

For any kmalloc slab that already requires higher order allocs (which
makes it impossible to use the page allocator fastpath!)
we just use PAGE_ALLOC_COSTLY_ORDER to get the largest number of
objects in one go from the page allocator slowpath.

On a 4k platform this patch will lead to the following use of higher
order pages for the following kmalloc slabs:

8 ... 1024	order 0
2048 .. 4096	order 3 (4k slab only after the next patch)

We may waste some space if fallback occurs on a 2k slab but we
are always able to fallback to an order 0 alloc.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-14 15:30:01 -08:00
Christoph Lameter b7a49f0d4c slub: Determine gfpflags once and not every time a slab is allocated
Currently we determine the gfp flags to pass to the page allocator
each time a slab is being allocated.

Determine the bits to be set at the time the slab is created. Store
in a new allocflags field and add the flags in allocate_slab().

Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-14 15:30:01 -08:00
Adrian Bunk dada123d99 make slub.c:slab_address() static
slab_address() can become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-14 15:30:01 -08:00
Pekka Enberg eada35efcb slub: kmalloc page allocator pass-through cleanup
This adds a proper function for kmalloc page allocator pass-through. While it
simplifies any code that does slab tracing code a lot, I think it's a
worthwhile cleanup in itself.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-14 15:30:01 -08:00
Marcin Slusarz e51bfd0ad1 slab: avoid double initialization & do initialization in 1 place
- alloc_slabmgmt: initialize all slab fields in 1 place
- slab->nodeid was initialized twice: in alloc_slabmgmt
  and immediately after it in cache_grow

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
CC: Christoph Lameter <clameter@sgi.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-14 15:30:01 -08:00
Ingo Molnar e8bff74afb x86: fix "BUG: sleeping function called from invalid context" in print_vma_addr()
Jiri Kosina reported the following deadlock scenario with
show_unhandled_signals enabled:

 [   68.379022] gnome-settings-[2941] trap int3 ip:3d2c840f34
 sp:7fff36f5d100 error:0<3>BUG: sleeping function called from invalid
 context at kernel/rwsem.c:21
 [   68.379039] in_atomic():1, irqs_disabled():0
 [   68.379044] no locks held by gnome-settings-/2941.
 [   68.379050] Pid: 2941, comm: gnome-settings- Not tainted 2.6.25-rc1 #30
 [   68.379054]
 [   68.379056] Call Trace:
 [   68.379061]  <#DB>  [<ffffffff81064883>] ? __debug_show_held_locks+0x13/0x30
 [   68.379109]  [<ffffffff81036765>] __might_sleep+0xe5/0x110
 [   68.379123]  [<ffffffff812f2240>] down_read+0x20/0x70
 [   68.379137]  [<ffffffff8109cdca>] print_vma_addr+0x3a/0x110
 [   68.379152]  [<ffffffff8100f435>] do_trap+0xf5/0x170
 [   68.379168]  [<ffffffff8100f52b>] do_int3+0x7b/0xe0
 [   68.379180]  [<ffffffff812f4a6f>] int3+0x9f/0xd0
 [   68.379203]  <<EOE>>
 [   68.379229]  in libglib-2.0.so.0.1505.0[3d2c800000+dc000]

and tracked it down to:

  commit 03252919b7
  Author: Andi Kleen <ak@suse.de>
  Date:   Wed Jan 30 13:33:18 2008 +0100

      x86: print which shared library/executable faulted in segfault etc. messages

the problem is that we call down_read() from an atomic context.

Solve this by returning from print_vma_addr() if the preempt count is
elevated. Update preempt_conditional_sti / preempt_conditional_cli to
unconditionally lift the preempt count even on !CONFIG_PREEMPT.

Reported-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-14 23:30:19 +01:00
Nishanth Aravamudan 064d9efe94 hugetlb: fix overcommit locking
proc_doulongvec_minmax() calls copy_to_user()/copy_from_user(), so we can't
hold hugetlb_lock over the call.  Use a dummy variable to store the sysctl
result, like in hugetlb_sysctl_handler(), then grab the lock to update
nr_overcommit_huge_pages.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Reported-by: Miles Lane <miles.lane@gmail.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-13 16:21:18 -08:00
Harvey Harrison b5606c2d44 remove final fastcall users
fastcall always expands to empty, remove it.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-13 16:21:18 -08:00
KOSAKI Motohiro 31f1de46b9 mempolicy: silently restrict nodemask to allowed nodes
Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
presence of memoryless nodes.  This patch attempts to fix that problem.

Some background:

numactl --interleave=all calls set_mempolicy(2) with a fully populated
[out to MAXNUMNODES] nodemask.  set_mempolicy() [in do_set_mempolicy()]
calls contextualize_policy() which requires that the nodemask be a
subset of the current task's mems_allowed; else EINVAL will be returned.

A task's mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]
i.e., nodes with memory.  So, a fully populated nodemask will be
declared invalid if it includes memoryless nodes.

  NOTE:  the same thing will occur when running in a cpuset
         with restricted mem_allowed--for the same reason:
         node mask contains dis-allowed nodes.

mbind(2), on the other hand, just masks off any nodes in the nodemask
that are not included in the caller's mems_allowed.

In each case [mbind() and set_mempolicy()], mpol_check_policy() will
complain [again, resulting in EINVAL] if the nodemask contains any
memoryless nodes.  This is somewhat redundant as mpol_new() will remove
memoryless nodes for interleave policy, as will bind_zonelist()--called
by mpol_new() for BIND policy.

Proposed fix:

1) modify contextualize_policy logic to:
   a) remember whether the incoming node mask is empty.
   b) if not, restrict the nodemask to allowed nodes, as is
      currently done in-line for mbind().  This guarantees
      that the resulting mask includes only nodes with memory.

      NOTE:  this is a [benign, IMO] change in behavior for
             set_mempolicy().  Dis-allowed nodes will be
             silently ignored, rather than returning an error.

   c) fold this code into mpol_check_policy(), replace 2 calls to
      contextualize_policy() to call mpol_check_policy() directly
      and remove contextualize_policy().

2) In existing mpol_check_policy() logic, after "contextualization":
   a) MPOL_DEFAULT:  require that in coming mask "was_empty"
   b) MPOL_{BIND|INTERLEAVE}:  require that contextualized nodemask
      contains at least one node.
   c) add a case for MPOL_PREFERRED:  if in coming was not empty
      and resulting mask IS empty, user specified invalid nodes.
      Return EINVAL.
   c) remove the now redundant check for memoryless nodes

3) remove the now redundant masking of policy nodes for interleave
   policy from mpol_new().

4) Now that mpol_check_policy() contextualizes the nodemask, remove
   the in-line nodes_and() from sys_mbind().  I believe that this
   restores mbind() to the behavior before the memoryless-nodes
   patch series.  E.g., we'll no longer treat an invalid nodemask
   with MPOL_PREFERRED as local allocation.

[ Patch history:

  v1 -> v2:
   - Communicate whether or not incoming node mask was empty to
     mpol_check_policy() for better error checking.
   - As suggested by David Rientjes, remove the now unused
     cpuset_nodes_subset_current_mems_allowed() from cpuset.h

  v2 -> v3:
   - As suggested by Kosaki Motohito, fold the "contextualization"
     of policy nodemask into mpol_check_policy().  Looks a little
     cleaner. ]

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by:      KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by:       David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-11 20:48:29 -08:00
Jonathan Corbet 900cf086fd Be more robust about bad arguments in get_user_pages()
So I spent a while pounding my head against my monitor trying to figure
out the vmsplice() vulnerability - how could a failure to check for
*read* access turn into a root exploit? It turns out that it's a buffer
overflow problem which is made easy by the way get_user_pages() is
coded.

In particular, "len" is a signed int, and it is only checked at the
*end* of a do {} while() loop.  So, if it is passed in as zero, the loop
will execute once and decrement len to -1.  At that point, the loop will
proceed until the next invalid address is found; in the process, it will
likely overflow the pages array passed in to get_user_pages().

I think that, if get_user_pages() has been asked to grab zero pages,
that's what it should do.  Thus this patch; it is, among other things,
enough to block the (already fixed) root exploit and any others which
might be lurking in similar code.  I also think that the number of pages
should be unsigned, but changing the prototype of this function probably
requires some more careful review.

Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-11 20:44:44 -08:00
David Rientjes 60c12b1202 memcontrol: add vm_match_cgroup()
mm_cgroup() is exclusively used to test whether an mm's mem_cgroup pointer
is pointing to a specific cgroup.  Instead of returning the pointer, we can
just do the test itself in a new macro:

	vm_match_cgroup(mm, cgroup)

returns non-zero if the mm's mem_cgroup points to cgroup.  Otherwise it
returns zero.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-09 11:08:33 -08:00
Nick Piggin b1d0e4f535 mm: special mapping nopage
Convert special mapping install from nopage to fault.

Because the "vm_file" is NULL for the special mapping, the generic VM
code has messed up "vm_pgoff" thinking that it's an anonymous mapping
and the offset does't matter.  For that reason, we need to undo the
vm_pgoff offset that got added into vmf->pgoff.

[ We _really_ should clean that up - either by making this whole special
  mapping code just use a real file entry rather than that ugly array of
  "struct page" pointers, or by just making the VM code realize that
  even if vm_file is NULL it may not be a regular anonymous mmap.
							 - Linus ]

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 18:57:39 -08:00
Martin Schwidefsky 2f569afd9c CONFIG_HIGHPTE vs. sub-page page tables.
Background: I've implemented 1K/2K page tables for s390.  These sub-page
page tables are required to properly support the s390 virtualization
instruction with KVM.  The SIE instruction requires that the page tables
have 256 page table entries (pte) followed by 256 page status table entries
(pgste).  The pgstes are only required if the process is using the SIE
instruction.  The pgstes are updated by the hardware and by the hypervisor
for a number of reasons, one of them is dirty and reference bit tracking.
To avoid wasting memory the standard pte table allocation should return
1K/2K (31/64 bit) and 2K/4K if the process is using SIE.

Problem: Page size on s390 is 4K, page table size is 1K or 2K.  That means
the s390 version for pte_alloc_one cannot return a pointer to a struct
page.  Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one
cannot return a pointer to a pte either, since that would require more than
32 bit for the return value of pte_alloc_one (and the pte * would not be
accessible since its not kmapped).

Solution: The only solution I found to this dilemma is a new typedef: a
pgtable_t.  For s390 pgtable_t will be a (pte *) - to be introduced with a
later patch.  For everybody else it will be a (struct page *).  The
additional problem with the initialization of the ptl lock and the
NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and
a destructor pgtable_page_dtor.  The page table allocation and free
functions need to call these two whenever a page table page is allocated or
freed.  pmd_populate will get a pgtable_t instead of a struct page pointer.
 To get the pgtable_t back from a pmd entry that has been installed with
pmd_populate a new function pmd_pgtable is added.  It replaces the pmd_page
call in free_pte_range and apply_to_pte_range.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:42 -08:00