linux/mm
KAMEZAWA Hiroyuki 4bfc44958e mm: make set_mempolicy(MPOL_INTERLEAV) N_HIGH_MEMORY aware
At first, init_task's mems_allowed is initialized as this.
 init_task->mems_allowed == node_state[N_POSSIBLE]

And cpuset's top_cpuset mask is initialized as this
 top_cpuset->mems_allowed = node_state[N_HIGH_MEMORY]

Before 2.6.29:
policy's mems_allowed is initialized as this.

  1. update tasks->mems_allowed by its cpuset->mems_allowed.
  2. policy->mems_allowed = nodes_and(tasks->mems_allowed, user's mask)

Updating task's mems_allowed in reference to top_cpuset's one.
cpuset's mems_allowed is aware of N_HIGH_MEMORY, always.

In 2.6.30: After commit 58568d2a82
("cpuset,mm: update tasks' mems_allowed in time"), policy's mems_allowed
is initialized as this.

  1. policy->mems_allowd = nodes_and(task->mems_allowed, user's mask)

Here, if task is in top_cpuset, task->mems_allowed is not updated from
init's one.  Assume user excutes command as #numactrl --interleave=all
,....

  policy->mems_allowd = nodes_and(N_POSSIBLE, ALL_SET_MASK)

Then, policy's mems_allowd can includes a possible node, which has no pgdat.

MPOL's INTERLEAVE just scans nodemask of task->mems_allowd and access this
directly.

  NODE_DATA(nid)->zonelist even if NODE_DATA(nid)==NULL

Then, what's we need is making policy->mems_allowed be aware of
N_HIGH_MEMORY.  This patch does that.  But to do so, extra nodemask will
be on statck.  Because I know cpumask has a new interface of
CPUMASK_ALLOC(), I added it to node.

This patch stands on old behavior.  But I feel this fix itself is just a
Band-Aid.  But to do fundametal fix, we have to take care of memory
hotplug and it takes time.  (task->mems_allowd should be N_HIGH_MEMORY, I
think.)

mpol_set_nodemask() should be aware of N_HIGH_MEMORY and policy's nodemask
should be includes only online nodes.

In old behavior, this is guaranteed by frequent reference to cpuset's
code.  Now, most of them are removed and mempolicy has to check it by
itself.

To do check, a few nodemask_t will be used for calculating nodemask.  But,
size of nodemask_t can be big and it's not good to allocate them on stack.

Now, cpumask_t has CPUMASK_ALLOC/FREE an easy code for get scratch area.
NODEMASK_ALLOC/FREE shoudl be there.

[akpm@linux-foundation.org: cleanups & tweaks]
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-07 10:39:55 -07:00
..
allocpercpu.c
backing-dev.c Fix congestion_wait() sync/async vs read/write confusion 2009-07-10 20:31:53 +02:00
bootmem.c kmemleak: Add callbacks to the bootmem allocator 2009-07-08 14:25:14 +01:00
bounce.c
debug-pagealloc.c
dmapool.c dmapools: protect page_list walk in show_pools() 2009-06-30 18:56:00 -07:00
fadvise.c
failslab.c
filemap_xip.c
filemap.c mm: mark page accessed before we write_end() 2009-07-06 13:57:03 -07:00
fremap.c
highmem.c
hugetlb.c hugetlbfs: fix i_blocks accounting 2009-07-29 19:10:35 -07:00
init-mm.c
internal.h vmscan: do not unconditionally treat zones that fail zone_reclaim() as full 2009-06-16 19:47:45 -07:00
Kconfig Merge branch 'akpm' 2009-06-16 19:50:13 -07:00
Kconfig.debug
kmemcheck.c
kmemleak-test.c
kmemleak.c kmemleak: Protect the seq start/next/stop sequence by rcu_read_lock() 2009-07-29 12:34:58 -07:00
maccess.c
madvise.c
Makefile Merge branch 'akpm' 2009-06-16 19:50:13 -07:00
memcontrol.c cgroup avoid permanent sleep at rmdir 2009-07-29 19:10:35 -07:00
memory_hotplug.c page-allocator: reset wmark_min and inactive ratio of zone when hotplug happens 2009-06-16 19:47:42 -07:00
memory.c mm: Pass virtual address to [__]p{te,ud,md}_free_tlb() 2009-07-27 12:10:38 -07:00
mempolicy.c mm: make set_mempolicy(MPOL_INTERLEAV) N_HIGH_MEMORY aware 2009-08-07 10:39:55 -07:00
mempool.c
migrate.c
mincore.c
mlock.c mm: remove CONFIG_UNEVICTABLE_LRU config option 2009-06-16 19:47:42 -07:00
mm_init.c
mmap.c
mmu_notifier.c
mmzone.c
mprotect.c
mremap.c
msync.c
nommu.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6 2009-07-01 11:46:30 -07:00
oom_kill.c oom: only oom kill exiting tasks with attached memory 2009-06-16 19:47:45 -07:00
page_alloc.c page-allocator: allow too high-order warning messages to be suppressed with __GFP_NOWARN 2009-07-29 19:10:35 -07:00
page_cgroup.c memcg: remove some redundant checks 2009-06-18 13:03:47 -07:00
page_io.c mm: remove file argument from swap_readpage() 2009-06-16 19:47:44 -07:00
page_isolation.c
page-writeback.c Fix congestion_wait() sync/async vs read/write confusion 2009-07-10 20:31:53 +02:00
pagewalk.c
pdflush.c
percpu.c x86: implement percpu_alloc kernel parameter 2009-06-22 11:56:24 +09:00
prio_tree.c
quicklist.c
readahead.c
rmap.c memcg: add file-based RSS accounting 2009-06-18 13:03:47 -07:00
shmem_acl.c switch shmem to inode->i_acl 2009-06-24 08:17:06 -04:00
shmem.c Get "no acls for this inode" right, fix shmem breakage 2009-06-24 16:58:48 -04:00
slab.c SLAB: Fix lockdep annotations 2009-06-29 09:57:10 +03:00
slob.c fix RCU-callback-after-kmem_cache_destroy problem in sl[aou]b 2009-06-26 12:10:47 +03:00
slub.c kmemleak: Trace the kmalloc_large* functions in slub 2009-07-08 14:25:14 +01:00
sparse-vmemmap.c
sparse.c
swap_state.c mm: remove file argument from swap_readpage() 2009-06-16 19:47:44 -07:00
swap.c
swapfile.c PM / Hibernate: Replace bdget call with simple atomic_inc of i_count 2009-07-29 21:07:55 +02:00
thrash.c mm: pass mm to grab_swap_token 2009-06-23 12:50:05 -07:00
truncate.c mm: remove __invalidate_mapping_pages variant 2009-06-16 19:47:43 -07:00
util.c Merge branches 'slab/documentation', 'slab/fixes', 'slob/cleanups' and 'slub/fixes' into for-linus 2009-06-17 08:30:15 +03:00
vmalloc.c
vmscan.c Fix congestion_wait() sync/async vs read/write confusion 2009-07-10 20:31:53 +02:00
vmstat.c vmscan: count the number of times zone_reclaim() scans and fails 2009-06-16 19:47:46 -07:00