Commit Graph

105 Commits

Author SHA1 Message Date
Lee Schermerhorn 3ad33b2436 Migration: find correct vma in new_vma_page()
We hit the BUG_ON() in mm/rmap.c:vma_address() when trying to migrate via
mbind(MPOL_MF_MOVE) a non-anon region that spans multiple vmas.  For
anon-regions, we just fail to migrate any pages beyond the 1st vma in the
range.

This occurs because do_mbind() collects a list of pages to migrate by
calling check_range().  check_range() walks the task's mm, spanning vmas as
necessary, to collect the migratable pages into a list.  Then, do_mbind()
calls migrate_pages() passing the list of pages, a function to allocate new
pages based on vma policy [new_vma_page()], and a pointer to the first vma
of the range.

For each page in the list, new_vma_page() calls page_address_in_vma()
passing the page and the vma [first in range] to obtain the address to get
for alloc_page_vma().  The page address is needed to get interleaving
policy correct.  If the pages in the list come from multiple vmas,
eventually, new_page_address() will pass that page to page_address_in_vma()
with the incorrect vma.  For !PageAnon pages, this will result in a bug
check in rmap.c:vma_address().  For anon pages, vma_address() will just
return EFAULT and fail the migration.

This patch modifies new_vma_page() to check the return value from
page_address_in_vma().  If the return value is EFAULT, new_vma_page()
searchs forward via vm_next for the vma that maps the page--i.e., that does
not return EFAULT.  This assumes that the pages in the list handed to
migrate_pages() is in address order.  This is currently case.  The patch
documents this assumption in a new comment block for new_vma_page().

If new_vma_page() cannot locate the vma mapping the page in a forward
search in the mm, it will pass a NULL vma to alloc_page_vma().  This will
result in the allocation using the task policy, if any, else system default
policy.  This situation is unlikely, but the patch documents this behavior
with a comment.

Note, this patch results in restarting from the first vma in a multi-vma
range each time new_vma_page() is called.  If this is not acceptable, we
can make the vma argument a pointer, both in new_vma_page() and it's caller
unmap_and_move() so that the value held by the loop in migrate_pages()
always passes down the last vma in which a page was found.  This will
require changes to all new_page_t functions passed to migrate_pages().  Is
this necessary?

For this patch to work, we can't bug check in vma_address() for pages
outside the argument vma.  This patch removes the BUG_ON().  All other
callers [besides new_vma_page()] already check the return status.

Tested on x86_64, 4 node NUMA platform.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-14 18:45:38 -08:00
Pavel Emelyanov 228ebcbe63 Uninline find_task_by_xxx set of functions
The find_task_by_something is a set of macros are used to find task by pid
depending on what kind of pid is proposed - global or virtual one.  All of
them are wrappers above the most generic one - find_task_by_pid_type_ns() -
and just substitute some args for it.

It turned out, that dereferencing the current->nsproxy->pid_ns construction
and pushing one more argument on the stack inline cause kernel text size to
grow.

This patch moves all this stuff out-of-line into kernel/pid.c.  Together
with the next patch it saves a bit less than 400 bytes from the .text
section.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:40 -07:00
Pavel Emelyanov b488893a39 pid namespaces: changes to show virtual ids to user
This is the largest patch in the set. Make all (I hope) the places where
the pid is shown to or get from user operate on the virtual pids.

The idea is:
 - all in-kernel data structures must store either struct pid itself
   or the pid's global nr, obtained with pid_nr() call;
 - when seeking the task from kernel code with the stored id one
   should use find_task_by_pid() call that works with global pids;
 - when showing pid's numerical value to the user the virtual one
   should be used, but however when one shows task's pid outside this
   task's namespace the global one is to be used;
 - when getting the pid from userspace one need to consider this as
   the virtual one and use appropriate task/pid-searching functions.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: nuther build fix]
[akpm@linux-foundation.org: yet nuther build fix]
[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:40 -07:00
Paul Menage 8793d854ed Task Control Groups: make cpusets a client of cgroups
Remove the filesystem support logic from the cpusets system and makes cpusets
a cgroup subsystem

The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
passed through to the cgroup filesystem with the appropriate options to
emulate the old cpuset filesystem behaviour.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:36 -07:00
Adrian Bunk dbcb0f19c8 mm/mempolicy.c: cleanups
This patch contains the following cleanups:
- every file should include the headers containing the prototypes for
  its global functions
- make the follosing needlessly global functions static:
  - migrate_to_node()
  - do_mbind()
  - sp_alloc()
  - mpol_rebind_policy()

[akpm@linux-foundation.org: fix uninitialised var warning]
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:03 -07:00
Lee Schermerhorn 37b07e4163 memoryless nodes: fixup uses of node_online_map in generic code
Here's a cut at fixing up uses of the online node map in generic code.

mm/shmem.c:shmem_parse_mpol()

	Ensure nodelist is subset of nodes with memory.
	Use node_states[N_HIGH_MEMORY] as default for missing
	nodelist for interleave policy.

mm/shmem.c:shmem_fill_super()

	initialize policy_nodes to node_states[N_HIGH_MEMORY]

mm/page-writeback.c:highmem_dirtyable_memory()

	sum over nodes with memory

mm/page_alloc.c:zlc_setup()

	allowednodes - use nodes with memory.

mm/page_alloc.c:default_zonelist_order()

	average over nodes with memory.

mm/page_alloc.c:find_next_best_node()

	skip nodes w/o memory.
	N_HIGH_MEMORY state mask may not be initialized at this time,
	unless we want to depend on early_calculate_totalpages() [see
	below].  Will ZONE_MOVABLE ever be configurable?

mm/page_alloc.c:find_zone_movable_pfns_for_nodes()

	spread kernelcore over nodes with memory.

	This required calling early_calculate_totalpages()
	unconditionally, and populating N_HIGH_MEMORY node
	state therein from nodes in the early_node_map[].
	If we can depend on this, we can eliminate the
	population of N_HIGH_MEMORY mask from __build_all_zonelists()
	and use the N_HIGH_MEMORY mask in find_next_best_node().

mm/mempolicy.c:mpol_check_policy()

	Ensure nodes specified for policy are subset of
	nodes with memory.

[akpm@linux-foundation.org: fix warnings]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Christoph Lameter 56bbd65df0 Memoryless nodes: Update memory policy and page migration
Online nodes now may have no memory.  The checks and initialization must
therefore be changed to no longer use the online functions.

This will correctly initialize the interleave on bootup to only target nodes
with memory and will make sys_move_pages return an error when a page is to be
moved to a memoryless node.  Similarly we will get an error if MPOL_BIND and
MPOL_INTERLEAVE is used on a memoryless node.

These are somewhat new semantics.  So far one could specify memoryless nodes
and we would maybe do the right thing and just ignore the node (or we'd do
something strange like with MPOL_INTERLEAVE).  If we want to allow the
specification of memoryless nodes via memory policies then we need to keep
checking for online nodes.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:58 -07:00
Christoph Lameter 6eaf806a22 Memoryless nodes: Fix interleave behavior for memoryless nodes
MPOL_INTERLEAVE currently simply loops over all nodes.  Allocations on
memoryless nodes will be redirected to nodes with memory.  This results in an
imbalance because the neighboring nodes to memoryless nodes will get
significantly more interleave hits that the rest of the nodes on the system.

We can avoid this imbalance by clearing the nodes in the interleave node set
that have no memory.  If we use the node map of the memory nodes instead of
the online nodes then we have only the nodes we want.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:58 -07:00
Lee Schermerhorn 754af6f5a8 Mem Policy: add MPOL_F_MEMS_ALLOWED get_mempolicy() flag
Allow an application to query the memories allowed by its context.

Updated numa_memory_policy.txt to mention that applications can use this to
obtain allowed memories for constructing valid policies.

TODO:  update out-of-tree libnuma wrapper[s], or maybe add a new
wrapper--e.g.,  numa_get_mems_allowed() ?

Also, update numa syscall man pages.

Tested with memtoy V>=0.13.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:54 -07:00
Jesper Juhl 43fac94dd6 Clean up duplicate includes in mm/
This patch cleans up duplicate includes in
	mm/

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:52 -07:00
Lee Schermerhorn 480eccf9ae Fix NUMA Memory Policy Reference Counting
This patch proposes fixes to the reference counting of memory policy in the
page allocation paths and in show_numa_map().  Extracted from my "Memory
Policy Cleanups and Enhancements" series as stand-alone.

Shared policy lookup [shmem] has always added a reference to the policy,
but this was never unrefed after page allocation or after formatting the
numa map data.

Default system policy should not require additional ref counting, nor
should the current task's task policy.  However, show_numa_map() calls
get_vma_policy() to examine what may be [likely is] another task's policy.
The latter case needs protection against freeing of the policy.

This patch adds a reference count to a mempolicy returned by
get_vma_policy() when the policy is a vma policy or another task's
mempolicy.  Again, shared policy is already reference counted on lookup.  A
matching "unref" [__mpol_free()] is performed in alloc_page_vma() for
shared and vma policies, and in show_numa_map() for shared and another
task's mempolicy.  We can call __mpol_free() directly, saving an admittedly
inexpensive inline NULL test, because we know we have a non-NULL policy.

Handling policy ref counts for hugepages is a bit trickier.
huge_zonelist() returns a zone list that might come from a shared or vma
'BIND policy.  In this case, we should hold the reference until after the
huge page allocation in dequeue_hugepage().  The patch modifies
huge_zonelist() to return a pointer to the mempolicy if it needs to be
unref'd after allocation.

Kernel Build [16cpu, 32GB, ia64] - average of 10 runs:

		w/o patch	w/ refcount patch
	    Avg	  Std Devn	   Avg	  Std Devn
Real:	 100.59	    0.38	 100.63	    0.43
User:	1209.60	    0.37	1209.91	    0.31
System:   81.52	    0.42	  81.64	    0.34

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-19 11:24:18 -07:00
Christoph Lameter 3b42d28b2a Page migration: Do not accept invalid nodes in the target nodeset
Page migration currently does not check if the target of the move contains
nodes that that are invalid (if root attempts to migrate pages)
and may try to allocate from invalid nodes if these are specified
leading to oopses.

Return -EINVAL if an offline node is specified.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-31 01:42:23 -07:00
Mel Gorman b377fd3982 Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE
The NUMA layer only supports NUMA policies for the highest zone.  When
ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes
ZONE_MOVABLE.  The result is that policies are only applied to allocations
like anonymous pages and page cache allocated from ZONE_MOVABLE when the
zone is used.

This patch applies policies to the two highest zones when the highest zone
is ZONE_MOVABLE.  As ZONE_MOVABLE consists of pages from the highest "real"
zone, it's always functionally equivalent.

The patch has been tested on a variety of machines both NUMA and non-NUMA
covering x86, x86_64 and ppc64.  No abnormal results were seen in
kernbench, tbench, dbench or hackbench.  It passes regression tests from
the numactl package with and without kernelcore= once numactl tests are
patched to wait for vmstat counters to update.

akpm: this is the nasty hack to fix NUMA mempolicies in the presence of
ZONE_MOVABLE and kernelcore= in 2.6.23.  Christoph says "For .24 either merge
the mobility or get the other solution that Mel is working on.  That solution
would only use a single zonelist per node and filter on the fly.  That may
help performance and also help to make memory policies work better."

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-22 19:52:47 -07:00
Paul Mundt 20c2df83d2 mm: Remove slab destructors from kmem_cache_create().
Slab destructors were no longer supported after Christoph's
c59def9f22 change. They've been
BUGs for both slab and slub, and slob never supported them
either.

This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-07-20 10:11:58 +09:00
Mel Gorman 396faf0303 Allow huge page allocations to use GFP_HIGH_MOVABLE
Huge pages are not movable so are not allocated from ZONE_MOVABLE.  However,
as ZONE_MOVABLE will always have pages that can be migrated or reclaimed, it
can be used to satisfy hugepage allocations even when the system has been
running a long time.  This allows an administrator to resize the hugepage pool
at runtime depending on the size of ZONE_MOVABLE.

This patch adds a new sysctl called hugepages_treat_as_movable.  When a
non-zero value is written to it, future allocations for the huge page pool
will use ZONE_MOVABLE.  Despite huge pages being non-movable, we do not
introduce additional external fragmentation of note as huge pages are always
the largest contiguous block we care about.

[akpm@linux-foundation.org: various fixes]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:22:59 -07:00
Mel Gorman 769848c038 Add __GFP_MOVABLE for callers to flag allocations from high memory that may be migrated
It is often known at allocation time whether a page may be migrated or not.
This patch adds a flag called __GFP_MOVABLE and a new mask called
GFP_HIGH_MOVABLE.  Allocations using the __GFP_MOVABLE can be either migrated
using the page migration mechanism or reclaimed by syncing with backing
storage and discarding.

An API function very similar to alloc_zeroed_user_highpage() is added for
__GFP_MOVABLE allocations called alloc_zeroed_user_highpage_movable().  The
flags used by alloc_zeroed_user_highpage() are not changed because it would
change the semantics of an existing API.  After this patch is applied there
are no in-kernel users of alloc_zeroed_user_highpage() so it probably should
be marked deprecated if this patch is merged.

Note that this patch includes a minor cleanup to the use of __GFP_ZERO in
shmem.c to keep all flag modifications to inode->mapping in the
shmem_dir_alloc() helper function.  This clean-up suggestion is courtesy of
Hugh Dickens.

Additional credit goes to Christoph Lameter and Linus Torvalds for shaping the
concept.  Credit to Hugh Dickens for catching issues with shmem swap vector
and ramfs allocations.

[akpm@linux-foundation.org: build fix]
[hugh@veritas.com: __GFP_ZERO cleanup]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:22:59 -07:00
Paul Mundt 140d5a4904 numa: mempolicy: trivial debug fixes.
Enabling debugging fails to build due to the nodemask variable in
do_mbind() having changed names, and then oopses on boot due to the
assumption that the nodemask can be dereferenced -- which doesn't work out
so well when the policy is changed to MPOL_DEFAULT with a NULL nodemask by
numa_default_policy().

This fixes it up, and switches from PDprintk() to pr_debug() while
we're at it.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:36 -07:00
Paul Mundt b71636e298 numa: mempolicy: dynamic interleave map for system init
This converts the default system init memory policy to use a dynamically
created node map instead of defaulting to all online nodes.  Nodes of a
certain size (>= 16MB) are judged to be suitable for interleave, and are added
to the map.  If all nodes are smaller in size, the largest one is
automatically selected.

Without this, tiny nodes find themselves out of memory before we even make it
to userspace.  Systems with large nodes will notice no change.

Only the system init policy is effected by this change, the regular
MPOL_DEFAULT policy is still switched to later on in the boot process as
normal.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:36 -07:00
Christoph Lameter 0dc952dc3e [PATCH] Page migration: Fix vma flag checking
Currently we do not check for vma flags if sys_move_pages is called to move
individual pages.  If sys_migrate_pages is called to move pages then we
check for vm_flags that indicate a non migratable vma but that still
includes VM_LOCKED and we can migrate mlocked pages.

Extract the vma_migratable check from mm/mempolicy.c, fix it and put it
into migrate.h so that is can be used from both locations.

Problem was spotted by Lee Schermerhorn

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-05 07:57:51 -08:00
KAMEZAWA Hiroyuki 8af5e2eb3c [PATCH] fix mempolicy's check on a system with memory-less-node
bind_zonelist() can create zero-length zonelist if there is a
memory-less-node.  This patch checks the length of zonelist.  If length is
0, returns -EINVAL.

tested on ia64/NUMA with memory-less-node.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20 17:10:13 -08:00
Christoph Lameter 6267276f3f [PATCH] optional ZONE_DMA: deal with cases of ZONE_DMA meaning the first zone
This patchset follows up on the earlier work in Andrew's tree to reduce the
number of zones.  The patches allow to go to a minimum of 2 zones.  This one
allows also to make ZONE_DMA optional and therefore the number of zones can be
reduced to one.

ZONE_DMA is usually used for ISA DMA devices.  There are a number of reasons
why we would not want to have ZONE_DMA

1. Some arches do not need ZONE_DMA at all.

2. With the advent of IOMMUs DMA zones are no longer needed.
   The necessity of DMA zones may drastically be reduced
   in the future. This patchset allows a compilation of
   a kernel without that overhead.

3. Devices that require ISA DMA get rare these days. All
   my systems do not have any need for ISA DMA.

4. The presence of an additional zone unecessarily complicates
   VM operations because it must be scanned and balancing
   logic must operate on its.

5. With only ZONE_NORMAL one can reach the situation where
   we have only one zone. This will allow the unrolling of many
   loops in the VM and allows the optimization of varous
   code paths in the VM.

6. Having only a single zone in a NUMA system results in a
   1-1 correspondence between nodes and zones. Various additional
   optimizations to critical VM paths become possible.

Many systems today can operate just fine with a single zone.  If you look at
what is in ZONE_DMA then one usually sees that nothing uses it.  The DMA slabs
are empty (Some arches use ZONE_DMA instead of ZONE_NORMAL, then ZONE_NORMAL
will be empty instead).

On all of my systems (i386, x86_64, ia64) ZONE_DMA is completely empty.  Why
constantly look at an empty zone in /proc/zoneinfo and empty slab in
/proc/slabinfo?  Non i386 also frequently have no need for ZONE_DMA and zones
stay empty.

The patchset was tested on i386 (UP / SMP), x86_64 (UP, NUMA) and ia64 (NUMA).

The RFC posted earlier (see
http://marc.theaimsgroup.com/?l=linux-kernel&m=115231723513008&w=2) had lots
of #ifdefs in them.  An effort has been made to minize the number of #ifdefs
and make this as compact as possible.  The job was made much easier by the
ongoing efforts of others to extract common arch specific functionality.

I have been running this for awhile now on my desktop and finally Linux is
using all my available RAM instead of leaving the 16MB in ZONE_DMA untouched:

christoph@pentium940:~$ cat /proc/zoneinfo
Node 0, zone   Normal
  pages free     4435
        min      1448
        low      1810
        high     2172
        active   241786
        inactive 210170
        scanned  0 (a: 0 i: 0)
        spanned  524224
        present  524224
    nr_anon_pages 61680
    nr_mapped    14271
    nr_file_pages 390264
    nr_slab_reclaimable 27564
    nr_slab_unreclaimable 1793
    nr_page_table_pages 449
    nr_dirty     39
    nr_writeback 0
    nr_unstable  0
    nr_bounce    0
    cpu: 0 pcp: 0
              count: 156
              high:  186
              batch: 31
    cpu: 0 pcp: 1
              count: 9
              high:  62
              batch: 15
  vm stats threshold: 20
    cpu: 1 pcp: 0
              count: 177
              high:  186
              batch: 31
    cpu: 1 pcp: 1
              count: 12
              high:  62
              batch: 15
  vm stats threshold: 20
  all_unreclaimable: 0
  prev_priority:     12
  temp_priority:     12
  start_pfn:         0

This patch:

In two places in the VM we use ZONE_DMA to refer to the first zone.  If
ZONE_DMA is optional then other zones may be first.  So simply replace
ZONE_DMA with zone 0.

This also fixes ZONETABLE_PGSHIFT.  If we have only a single zone then
ZONES_PGSHIFT may become 0 because there is no need anymore to encode the zone
number related to a pgdat.  However, we still need a zonetable to index all
the zones for each node if this is a NUMA system.  Therefore define
ZONETABLE_SHIFT unconditionally as the offset of the ZONE field in page flags.

[apw@shadowen.org: fix mismerge]
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter 30150f8d7b [PATCH] mbind: restrict nodes to the currently allowed cpuset
Currently one can specify an arbitrary node mask to mbind that includes
nodes not allowed.  If that is done with an interleave policy then we will
go around all the nodes.  Those outside of the currently allowed cpuset
will be redirected to the border nodes.  Interleave will then create
imbalances at the borders of the cpuset.

This patch restricts the nodes to the currently allowed cpuset.

The RFC for this patch was discussed at
http://marc.theaimsgroup.com/?t=116793842100004&r=1&w=2

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-23 07:52:06 -08:00
Josef Sipek e9536ae720 [PATCH] struct path: convert mm
Signed-off-by: Josef Sipek <jsipek@fsl.cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:47 -08:00
Helge Deller 15ad7cdcfd [PATCH] struct seq_operations and struct file_operations constification
- move some file_operations structs into the .rodata section

 - move static strings from policy_types[] array into the .rodata section

 - fix generic seq_operations usages, so that those structs may be defined
   as "const" as well

[akpm@osdl.org: couple of fixes]
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:46 -08:00
Christoph Lameter e94b176609 [PATCH] slab: remove SLAB_KERNEL
SLAB_KERNEL is an alias of GFP_KERNEL.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:24 -08:00
Andy Whitcroft 25ba77c141 [PATCH] numa node ids are int, page_to_nid and zone_to_nid should return int
NUMA node ids are passed as either int or unsigned int almost exclusivly
page_to_nid and zone_to_nid both return unsigned long.  This is a throw
back to when page_to_nid was a #define and was thus exposing the real type
of the page flags field.

In addition to fixing up the definitions of page_to_nid and zone_to_nid I
audited the users of these functions identifying the following incorrect
uses:

1) mm/page_alloc.c show_node() -- printk dumping the node id,
2) include/asm-ia64/pgalloc.h pgtable_quicklist_free() -- comparison
   against numa_node_id() which returns an int from cpu_to_node(), and
3) mm/mpolicy.c check_pte_range -- used as an index in node_isset which
   uses bit_set which in generic code takes an int.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:23 -08:00
Paul Jackson 9276b1bc96 [PATCH] memory page_alloc zonelist caching speedup
Optimize the critical zonelist scanning for free pages in the kernel memory
allocator by caching the zones that were found to be full recently, and
skipping them.

Remembers the zones in a zonelist that were short of free memory in the
last second.  And it stashes a zone-to-node table in the zonelist struct,
to optimize that conversion (minimize its cache footprint.)

Recent changes:

    This differs in a significant way from a similar patch that I
    posted a week ago.  Now, instead of having a nodemask_t of
    recently full nodes, I have a bitmask of recently full zones.
    This solves a problem that last weeks patch had, which on
    systems with multiple zones per node (such as DMA zone) would
    take seeing any of these zones full as meaning that all zones
    on that node were full.

    Also I changed names - from "zonelist faster" to "zonelist cache",
    as that seemed to better convey what we're doing here - caching
    some of the key zonelist state (for faster access.)

    See below for some performance benchmark results.  After all that
    discussion with David on why I didn't need them, I went and got
    some ;).  I wanted to verify that I had not hurt the normal case
    of memory allocation noticeably.  At least for my one little
    microbenchmark, I found (1) the normal case wasn't affected, and
    (2) workloads that forced scanning across multiple nodes for
    memory improved up to 10% fewer System CPU cycles and lower
    elapsed clock time ('sys' and 'real').  Good.  See details, below.

    I didn't have the logic in get_page_from_freelist() for various
    full nodes and zone reclaim failures correct.  That should be
    fixed up now - notice the new goto labels zonelist_scan,
    this_zone_full, and try_next_zone, in get_page_from_freelist().

There are two reasons I persued this alternative, over some earlier
proposals that would have focused on optimizing the fake numa
emulation case by caching the last useful zone:

 1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
    have seen real customer loads where the cost to scan the zonelist
    was a problem, due to many nodes being full of memory before
    we got to a node we could use.  Or at least, I think we have.
    This was related to me by another engineer, based on experiences
    from some time past.  So this is not guaranteed.  Most likely, though.

    The following approach should help such real numa systems just as
    much as it helps fake numa systems, or any combination thereof.

 2) The effort to distinguish fake from real numa, using node_distance,
    so that we could cache a fake numa node and optimize choosing
    it over equivalent distance fake nodes, while continuing to
    properly scan all real nodes in distance order, was going to
    require a nasty blob of zonelist and node distance munging.

    The following approach has no new dependency on node distances or
    zone sorting.

See comment in the patch below for a description of what it actually does.

Technical details of note (or controversy):

 - See the use of "zlc_active" and "did_zlc_setup" below, to delay
   adding any work for this new mechanism until we've looked at the
   first zone in zonelist.  I figured the odds of the first zone
   having the memory we needed were high enough that we should just
   look there, first, then get fancy only if we need to keep looking.

 - Some odd hackery was needed to add items to struct zonelist, while
   not tripping up the custom zonelists built by the mm/mempolicy.c
   code for MPOL_BIND.  My usual wordy comments below explain this.
   Search for "MPOL_BIND".

 - Some per-node data in the struct zonelist is now modified frequently,
   with no locking.  Multiple CPU cores on a node could hit and mangle
   this data.  The theory is that this is just performance hint data,
   and the memory allocator will work just fine despite any such mangling.
   The fields at risk are the struct 'zonelist_cache' fields 'fullzones'
   (a bitmask) and 'last_full_zap' (unsigned long jiffies).  It should
   all be self correcting after at most a one second delay.

 - This still does a linear scan of the same lengths as before.  All
   I've optimized is making the scan faster, not algorithmically
   shorter.  It is now able to scan a compact array of 'unsigned
   short' in the case of many full nodes, so one cache line should
   cover quite a few nodes, rather than each node hitting another
   one or two new and distinct cache lines.

 - If both Andi and Nick don't find this too complicated, I will be
   (pleasantly) flabbergasted.

 - I removed the comment claiming we only use one cachline's worth of
   zonelist.  We seem, at least in the fake numa case, to have put the
   lie to that claim.

 - I pay no attention to the various watermarks and such in this performance
   hint.  A node could be marked full for one watermark, and then skipped
   over when searching for a page using a different watermark.  I think
   that's actually quite ok, as it will tend to slightly increase the
   spreading of memory over other nodes, away from a memory stressed node.

===============

Performance - some benchmark results and analysis:

This benchmark runs a memory hog program that uses multiple
threads to touch alot of memory as quickly as it can.

Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of
the total 96 GBytes on the system, and using 1, 19, 37, or 55
threads (on a 56 CPU system.)  System, user and real (elapsed)
timings were recorded for each run, shown in units of seconds,
in the table below.

Two kernels were tested - 2.6.18-mm3 and the same kernel with
this zonelist caching patch added.  The table also shows the
percentage improvement the zonelist caching sys time is over
(lower than) the stock *-mm kernel.

      number     2.6.18-mm3	   zonelist-cache    delta (< 0 good)	percent
 GBs    N  	------------	   --------------    ----------------	systime
 mem threads   sys user  real	  sys  user  real     sys  user  real	 better
  12	 1     153   24   177	  151	 24   176      -2     0    -1	   1%
  12	19	99   22     8	   99	 22	8	0     0     0	   0%
  12	37     111   25     6	  112	 25	6	1     0     0	  -0%
  12	55     115   25     5	  110	 23	5      -5    -2     0	   4%
  38	 1     502   74   576	  497	 73   570      -5    -1    -6	   0%
  38	19     426   78    48	  373	 76    39     -53    -2    -9	  12%
  38	37     544   83    36	  547	 82    36	3    -1     0	  -0%
  38	55     501   77    23	  511	 80    24      10     3     1	  -1%
  64	 1     917  125  1042	  890	124  1014     -27    -1   -28	   2%
  64	19    1118  138   119	  965	141   103    -153     3   -16	  13%
  64	37    1202  151    94	 1136	150    81     -66    -1   -13	   5%
  64	55    1118  141    61	 1072	140    58     -46    -1    -3	   4%
  90	 1    1342  177  1519	 1275	174  1450     -67    -3   -69	   4%
  90	19    2392  199   192	 2116	189   176    -276   -10   -16	  11%
  90	37    3313  238   175	 2972	225   145    -341   -13   -30	  10%
  90	55    1948  210   104	 1843	213   100    -105     3    -4	   5%

Notes:
 1) This test ran a memory hog program that started a specified number N of
    threads, and had each thread allocate and touch 1/N'th of
    the total memory to be used in the test run in a single loop,
    writing a constant word to memory, one store every 4096 bytes.
    Watching this test during some earlier trial runs, I would see
    each of these threads sit down on one CPU and stay there, for
    the remainder of the pass, a different CPU for each thread.

 2) The 'real' column is not comparable to the 'sys' or 'user' columns.
    The 'real' column is seconds wall clock time elapsed, from beginning
    to end of that test pass.  The 'sys' and 'user' columns are total
    CPU seconds spent on that test pass.  For a 19 thread test run,
    for example, the sum of 'sys' and 'user' could be up to 19 times the
    number of 'real' elapsed wall clock seconds.

 3) Tests were run on a fresh, single-user boot, to minimize the amount
    of memory already in use at the start of the test, and to minimize
    the amount of background activity that might interfere.

 4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM.

 5) Notice that the 'real' time gets large for the single thread runs, even
    though the measured 'sys' and 'user' times are modest.  I'm not sure what
    that means - probably something to do with it being slow for one thread to
    be accessing memory along ways away.  Perhaps the fake numa system, running
    ostensibly the same workload, would not show this substantial degradation
    of 'real' time for one thread on many nodes -- lets hope not.

 6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs)
    ran quite efficiently, as one might expect.  Each pair of threads needed
    to allocate and touch the memory on the node the two threads shared, a
    pleasantly parallizable workload.

 7) The intermediate thread count passes, when asking for alot of memory forcing
    them to go to a few neighboring nodes, improved the most with this zonelist
    caching patch.

Conclusions:
 * This zonelist cache patch probably makes little difference one way or the
   other for most workloads on real numa hardware, if those workloads avoid
   heavy off node allocations.
 * For memory intensive workloads requiring substantial off-node allocations
   on real numa hardware, this patch improves both kernel and elapsed timings
   up to ten per-cent.
 * For fake numa systems, I'm optimistic, but will have to leave that up to
   Rohit Seth to actually test (once I get him a 2.6.18 backport.)

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Rohit Seth <rohitseth@google.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: David Rientjes <rientjes@cs.washington.edu>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Keith Owens 6993974997 [PATCH] Fix do_mbind warning with CONFIG_MIGRATION=n
With CONFIG_MIGRATION=n

mm/mempolicy.c: In function 'do_mbind':
mm/mempolicy.c:796: warning: passing argument 2 of 'migrate_pages' from incompatible pointer type

Signed-off-by: Keith Owens <kaos@ocs.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:19 -07:00
Alexey Dobriyan 52978be636 [PATCH] kmemdup: some users
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:19 -07:00
Christoph Lameter 765c4507af [PATCH] GFP_THISNODE for the slab allocator
This patch insures that the slab node lists in the NUMA case only contain
slabs that belong to that specific node.  All slab allocations use
GFP_THISNODE when calling into the page allocator.  If an allocation fails
then we fall back in the slab allocator according to the zonelists appropriate
for a certain context.

This allows a replication of the behavior of alloc_pages and alloc_pages node
in the slab layer.

Currently allocations requested from the page allocator may be redirected via
cpusets to other nodes.  This results in remote pages on nodelists and that in
turn results in interrupt latency issues during cache draining.  Plus the slab
is handing out memory as local when it is really remote.

Fallback for slab memory allocations will occur within the slab allocator and
not in the page allocator.  This is necessary in order to be able to use the
existing pools of objects on the nodes that we fall back to before adding more
pages to a slab.

The fallback function insures that the nodes we fall back to obey cpuset
restrictions of the current context.  We do not allocate objects from outside
of the current cpuset context like before.

Note that the implementation of locality constraints within the slab allocator
requires importing logic from the page allocator.  This is a mischmash that is
not that great.  Other allocators (uncached allocator, vmalloc, huge pages)
face similar problems and have similar minimal reimplementations of the basic
fallback logic of the page allocator.  There is another way of implementing a
slab by avoiding per node lists (see modular slab) but this wont work within
the existing slab.

V1->V2:
- Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA
- Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another
  #ifdef

[akpm@osdl.org: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:12 -07:00
Christoph Lameter 89fa30242f [PATCH] NUMA: Add zone_to_nid function
There are many places where we need to determine the node of a zone.
Currently we use a difficult to read sequence of pointer dereferencing.
Put that into an inline function and use throughout VM.  Maybe we can find
a way to optimize the lookup in the future.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:52 -07:00
Christoph Lameter 9b819d204c [PATCH] Add __GFP_THISNODE to avoid fallback to other nodes and ignore cpuset/memory policy restrictions
Add a new gfp flag __GFP_THISNODE to avoid fallback to other nodes.  This
flag is essential if a kernel component requires memory to be located on a
certain node.  It will be needed for alloc_pages_node() to force allocation
on the indicated node and for alloc_pages() to force allocation on the
current node.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:50 -07:00
Christoph Lameter 19655d3487 [PATCH] linearly index zone->node_zonelists[]
I wonder why we need this bitmask indexing into zone->node_zonelists[]?

We always start with the highest zone and then include all lower zones
if we build zonelists.

Are there really cases where we need allocation from ZONE_DMA or
ZONE_HIGHMEM but not ZONE_NORMAL? It seems that the current implementation
of highest_zone() makes that already impossible.

If we go linear on the index then gfp_zone() == highest_zone() and a lot
of definitions fall by the wayside.

We can now revert back to the use of gfp_zone() in mempolicy.c ;-)

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:47 -07:00
Christoph Lameter 2f6726e54a [PATCH] Apply type enum zone_type
After we have done this we can now do some typing cleanup.

The memory policy layer keeps a policy_zone that specifies
the zone that gets memory policies applied. This variable
can now be of type enum zone_type.

The check_highest_zone function and the build_zonelists funnctionm must
then also take a enum zone_type parameter.

Plus there are a number of loops over zones that also should use
zone_type.

We run into some troubles at some points with functions that need a
zone_type variable to become -1. Fix that up.

[pj@sgi.com: fix set_mempolicy() crash]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:47 -07:00
Christoph Lameter 4e4785bcf0 [PATCH] mempolicies: fix policy_zone check
There is a check in zonelist_policy that compares pieces of the bitmap
obtained from a gfp mask via GFP_ZONETYPES with a zone number in function
zonelist_policy().

The bitmap is an ORed mask of __GFP_DMA, __GFP_DMA32 and __GFP_HIGHMEM.
The policy_zone is a zone number with the possible values of ZONE_DMA,
ZONE_DMA32, ZONE_HIGHMEM and ZONE_NORMAL. These are two different domains
of values.

For some reason seemed to work before the zone reduction patchset (It
definitely works on SGI boxes since we just have one zone and the check
cannot fail).

With the zone reduction patchset this check definitely fails on systems
with two zones if the system actually has memory in both zones.

This is because ZONE_NORMAL is selected using no __GFP flag at
all and thus gfp_zone(gfpmask) == 0. ZONE_DMA is selected when __GFP_DMA
is set. __GFP_DMA is 0x01.  So gfp_zone(gfpmask) == 1.

policy_zone is set to ZONE_NORMAL (==1) if ZONE_NORMAL and ZONE_DMA are
populated.

For ZONE_NORMAL gfp_zone(<no _GFP_DMA>) yields 0 which is <
policy_zone(ZONE_NORMAL) and so policy is not applied to regular memory
allocations!

Instead gfp_zone(__GFP_DMA) == 1 which results in policy being applied
to DMA allocations!

What we realy want in that place is to establish the highest allowable
zone for a given gfp_mask. If the highest zone is higher or equal to the
policy_zone then memory policies need to be applied. We have such
a highest_zone() function in page_alloc.c.

So move the highest_zone() function from mm/page_alloc.c into
include/linux/gfp.h.  On the way we simplify the function and use the new
zone_type that was also introduced with the zone reduction patchset plus we
also specify the right type for the gfp flags parameter.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:47 -07:00
Nishanth Aravamudan 3b98b087fc [PATCH] fix NUMA interleaving for huge pages
Since vma->vm_pgoff is in units of smallpages, VMAs for huge pages have the
lower HPAGE_SHIFT - PAGE_SHIFT bits always cleared, which results in badd
offsets to the interleave functions.  Take this difference from small pages
into account when calculating the offset.  This does add a 0-bit shift into
the small-page path (via alloc_page_vma()), but I think that is negligible.
 Also add a BUG_ON to prevent the offset from growing due to a negative
right-shift, which probably shouldn't be allowed anyways.

Tested on an 8-memory node ppc64 NUMA box and got the interleaving I
expected.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Acked-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-01 11:39:10 -07:00
Christoph Lameter ca889e6c45 [PATCH] Use Zoned VM Counters for NUMA statistics
The numa statistics are really event counters.  But they are per node and
so we have had special treatment for these counters through additional
fields on the pcp structure.  We can now use the per zone nature of the
zoned VM counters to realize these.

This will shrink the size of the pcp structure on NUMA systems.  We will
have some room to add additional per zone counters that will all still fit
in the same cacheline.

 Bits	Prior pcp size	  	Size after patch	We can add
 ------------------------------------------------------------------
 64	128 bytes (16 words)	80 bytes (10 words)	48
 32	 76 bytes (19 words)	56 bytes (14 words)	8 (64 byte cacheline)
							72 (128 byte)

Remove the special statistics for numa and replace them with zoned vm
counters.  This has the side effect that global sums of these events now
show up in /proc/vmstat.

Also take the opportunity to move the zone_statistics() function from
page_alloc.c into vmstat.c.

Discussions:
V2 http://marc.theaimsgroup.com/?t=115048227000002&r=1&w=2

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:36 -07:00
Eric W. Biederman 99f8955183 [PATCH] proc: don't lock task_structs indefinitely
Every inode in /proc holds a reference to a struct task_struct.  If a
directory or file is opened and remains open after the the task exits this
pinning continues.  With 8K stacks on a 32bit machine the amount pinned per
file descriptor is about 10K.

Normally I would figure a reasonable per user process limit is about 100
processes.  With 80 processes, with a 1000 file descriptors each I can trigger
the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless
data.

This patch replaces the struct task_struct pointer with a pointer to a struct
task_ref which has a struct task_struct pointer.  The so the pinning of dead
tasks does not happen.

The code now has to contend with the fact that the task may now exit at any
time.  Which is a little but not muh more complicated.

With this change it takes about 1000 processes each opening up 1000 file
descriptors before I can trigger the OOM killer.  Much better.

[mlp@google.com: task_mmu small fixes]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Paul Jackson <pj@sgi.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Albert Cahalan <acahalan@gmail.com>
Signed-off-by: Prasanna Meda <mlp@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:25 -07:00
Christoph Lameter 7b2259b3e5 [PATCH] page migration: Support a vma migration function
Hooks for calling vma specific migration functions

With this patch a vma may define a vma->vm_ops->migrate function.  That
function may perform page migration on its own (some vmas may not contain page
structs and therefore cannot be handled by regular page migration.  Pages in a
vma may require special preparatory treatment before migration is possible
etc) .  Only mmap_sem is held when the migration function is called.  The
migrate() function gets passed two sets of nodemasks describing the source and
the target of the migration.  The flags parameter either contains

MPOL_MF_MOVE	which means that only pages used exclusively by
		the specified mm should be moved

or

MPOL_MF_MOVE_ALL which means that pages shared with other processes
		should also be moved.

The migration function returns 0 on success or an error condition.  An error
condition will prevent regular page migration from occurring.

On its own this patch cannot be included since there are no users for this
functionality.  But it seems that the uncached allocator will need this
functionality at some point.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:00:55 -07:00
David Quigley 86c3a7645c [PATCH] SELinux: add security_task_movememory calls to mm code
This patch inserts security_task_movememory hook calls into memory management
code to enable security modules to mediate this operation between tasks.

Since the last posting, the hook has been renamed following feedback from
Christoph Lameter.

Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov>
Acked-by:  Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Cc: Andi Kleen <ak@muc.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:54 -07:00
Christoph Lameter 742755a1d8 [PATCH] page migration: sys_move_pages(): support moving of individual pages
move_pages() is used to move individual pages of a process. The function can
be used to determine the location of pages and to move them onto the desired
node. move_pages() returns status information for each page.

long move_pages(pid, number_of_pages_to_move,
		addresses_of_pages[],
		nodes[] or NULL,
		status[],
		flags);

The addresses of pages is an array of void * pointing to the
pages to be moved.

The nodes array contains the node numbers that the pages should be moved
to. If a NULL is passed instead of an array then no pages are moved but
the status array is updated. The status request may be used to determine
the page state before issuing another move_pages() to move pages.

The status array will contain the state of all individual page migration
attempts when the function terminates. The status array is only valid if
move_pages() completed successfullly.

Possible page states in status[]:

0..MAX_NUMNODES	The page is now on the indicated node.

-ENOENT		Page is not present

-EACCES		Page is mapped by multiple processes and can only
		be moved if MPOL_MF_MOVE_ALL is specified.

-EPERM		The page has been mlocked by a process/driver and
		cannot be moved.

-EBUSY		Page is busy and cannot be moved. Try again later.

-EFAULT		Invalid address (no VMA or zero page).

-ENOMEM		Unable to allocate memory on target node.

-EIO		Unable to write back page. The page must be written
		back in order to move it since the page is dirty and the
		filesystem does not provide a migration function that
		would allow the moving of dirty pages.

-EINVAL		A dirty page cannot be moved. The filesystem does not provide
		a migration function and has no ability to write back pages.

The flags parameter indicates what types of pages to move:

MPOL_MF_MOVE	Move pages that are only mapped by the process.

MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
		Requires sufficient capabilities.

Possible return codes from move_pages()

-ENOENT		No pages found that would require moving. All pages
		are either already on the target node, not present, had an
		invalid address or could not be moved because they were
		mapped by multiple processes.

-EINVAL		Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
		to migrate pages in a kernel thread.

-EPERM		MPOL_MF_MOVE_ALL specified without sufficient priviledges.
		or an attempt to move a process belonging to another user.

-EACCES		One of the target nodes is not allowed by the current cpuset.

-ENODEV		One of the target nodes is not online.

-ESRCH		Process does not exist.

-E2BIG		Too many pages to move.

-ENOMEM		Not enough memory to allocate control array.

-EFAULT		Parameters could not be accessed.

A test program for move_pages() may be found with the patches
on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3

From: Christoph Lameter <clameter@sgi.com>

  Detailed results for sys_move_pages()

  Pass a pointer to an integer to get_new_page() that may be used to
  indicate where the completion status of a migration operation should be
  placed.  This allows sys_move_pags() to report back exactly what happened to
  each page.

  Wish there would be a better way to do this. Looks a bit hacky.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:53 -07:00
Christoph Lameter 95a402c384 [PATCH] page migration: use allocator function for migrate_pages()
Instead of passing a list of new pages, pass a function to allocate a new
page.  This allows the correct placement of MPOL_INTERLEAVE pages during page
migration.  It also further simplifies the callers of migrate pages.
migrate_pages() becomes similar to migrate_pages_to() so drop
migrate_pages_to().  The batching of new page allocations becomes unnecessary.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:53 -07:00
Christoph Lameter aaa994b300 [PATCH] page migration: handle freeing of pages in migrate_pages()
Do not leave pages on the lists passed to migrate_pages().  Seems that we will
not need any postprocessing of pages.  This will simplify the handling of
pages by the callers of migrate_pages().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:53 -07:00
Christoph Lameter 6d472be378 [PATCH] Remove cond_resched in gather_stats()
gather_stats() is called with a spinlock held from check_pte_range.  We
cannot reschedule with a lock held.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-20 07:54:03 -07:00
Alexey Dobriyan 7f927fcc2f [PATCH] Typo fixes
Fix a lot of typos.  Eyeballed by jmc@ in OpenBSD.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 09:16:08 -08:00
Paul Jackson c61afb181c [PATCH] cpuset memory spread slab cache optimizations
The hooks in the slab cache allocator code path for support of NUMA
mempolicies and cpuset memory spreading are in an important code path.  Many
systems will use neither feature.

This patch optimizes those hooks down to a single check of some bits in the
current tasks task_struct flags.  For non NUMA systems, this hook and related
code is already ifdef'd out.

The optimization is done by using another task flag, set if the task is using
a non-default NUMA mempolicy.  Taking this flag bit along with the
PF_SPREAD_PAGE and PF_SPREAD_SLAB flag bits added earlier in this 'cpuset
memory spreading' patch set, one can check for the combination of any of these
special case memory placement mechanisms with a single test of the current
tasks task_struct flags.

This patch also tightens up the code, to save a few bytes of kernel text
space, and moves some of it out of line.  Due to the nested inlines called
from multiple places, we were ending up with three copies of this code, which
once we get off the main code path (for local node allocation) seems a bit
wasteful of instruction memory.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:23 -08:00
Christoph Lameter b20a35035f [PATCH] page migration reorg
Centralize the page migration functions in anticipation of additional
tinkering.  Creates a new file mm/migrate.c

1. Extract buffer_migrate_page() from fs/buffer.c

2. Extract central migration code from vmscan.c

3. Extract some components from mempolicy.c

4. Export pageout() and remove_from_swap() from vmscan.c

5. Make it possible to configure NUMA systems without page migration
   and non-NUMA systems with page migration.

I had to so some #ifdeffing in mempolicy.c that may need a cleanup.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22 07:54:06 -08:00
Pekka Enberg fcc234f888 [PATCH] mm: kill kmem_cache_t usage
We have struct kmem_cache now so use it instead of the old typedef.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22 07:53:58 -08:00
Christoph Lameter 90036ee593 [PATCH] page migration: Fail with error if swap not setup
Currently the migration of anonymous pages will silently fail if no swap is
setup.  This patch makes page migration functions check for available swap
and fail with -ENODEV if no swap space is available.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17 07:51:25 -08:00
Christoph Lameter 74c0024105 [PATCH] Consistent capabilites associated with MPOL_MOVE_ALL
It seems that setting scheduling policy and priorities is also the kind of
thing that might be performed in apps that also use the NUMA API, so it
would seem consistent to use CAP_SYS_NICE for NUMA also.

So use CAP_SYS_NICE for controlling migration permissions.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-14 21:43:02 -08:00