linux/kernel
KAMEZAWA Hiroyuki f0c0b2b808 change zonelist order: zonelist order selection logic
Make zonelist creation policy selectable from sysctl/boot option v6.

This patch makes NUMA's zonelist (of pgdat) order selectable.
Available order are Default(automatic)/ Node-based / Zone-based.

[Default Order]
The kernel selects Node-based or Zone-based order automatically.

[Node-based Order]
This policy treats the locality of memory as the most important parameter.
Zonelist order is created by each zone's locality. This means lower zones
(ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
IOW. ZONE_DMA will be in the middle of zonelist.
current 2.6.21 kernel uses this.

Pros.
 * A user can expect local memory as much as possible.
Cons.
 * lower zone will be exhansted before higher zone. This may cause OOM_KILL.

Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
because of ZONE_DMA exhaution and you need the best locality.

(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

*node(0)'s memory allocation order:

 node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.

*node(1)'s memory allocation order:

 node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

[Zone-based order]
This policy treats the zone type as the most important parameter.
Zonelist order is created by zone-type order. This means lower zone
never be used bofere higher zone exhaustion.
IOW. ZONE_DMA will be always at the tail of zonelist.

Pros.
 * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
Cons.
 * memory locality may not be best.

(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

*node(0)'s memory allocation order:

 node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.

*node(1)'s memory allocation order:

 node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.

command:
%echo N > /proc/sys/vm/numa_zonelist_order

Will rebuild zonelist in Node-based order.

command:
%echo Z > /proc/sys/vm/numa_zonelist_order

Will rebuild zonelist in Zone-based order.

Thanks to Lee Schermerhorn, he gives me much help and codes.

[Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order]
[akpm@linux-foundation.org: build fix]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "jesse.barnes@intel.com" <jesse.barnes@intel.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:35 -07:00
..
irq
power PM: introduce set_target method in pm_ops 2007-07-01 12:29:44 -07:00
time NTP: remove clock_was_set() call to prevent deadlock 2007-07-03 13:54:27 -07:00
.gitignore
acct.c
audit.c
audit.h
auditfilter.c audit: fix oops removing watch if audit disabled 2007-06-24 08:59:12 -07:00
auditsc.c
capability.c
compat.c
configs.c
cpu.c
cpuset.c
delayacct.c sched: update delay-accounting to use CFS's precise stats 2007-07-09 18:52:00 +02:00
die_notifier.c
dma.c
exec_domain.c
exit.c sched: update delay-accounting to use CFS's precise stats 2007-07-09 18:52:00 +02:00
extable.c
fork.c sched: update delay-accounting to use CFS's precise stats 2007-07-09 18:52:00 +02:00
futex_compat.c
futex.c FUTEX: Restore the dropped ERSCH fix 2007-06-24 12:08:53 -07:00
hrtimer.c
itimer.c
kallsyms.c
Kconfig.hz
Kconfig.preempt
kexec.c
kfifo.c
kmod.c
kprobes.c
ksysfs.c
kthread.c
latency.c
lockdep_internals.h
lockdep_proc.c
lockdep.c
Makefile
module.c sysfs: kill unnecessary attribute->owner 2007-07-11 16:09:06 -07:00
mutex-debug.c
mutex-debug.h
mutex.c
mutex.h
nsproxy.c fix refcounting of nsproxy object when unshared 2007-06-24 08:59:10 -07:00
panic.c
params.c sysfs: kill unnecessary attribute->owner 2007-07-11 16:09:06 -07:00
pid.c
posix-cpu-timers.c sched: make posix-cpu-timers use CFS's accounting information 2007-07-09 18:51:58 +02:00
posix-timers.c posix-timers: Prevent softirq starvation by small intervals and SIG_IGN 2007-06-21 15:57:04 -07:00
printk.c serial: convert early_uart to earlycon for 8250 2007-07-16 09:05:35 -07:00
profile.c
ptrace.c
rcupdate.c
rcutorture.c
relay.c relay: fixup kerneldoc comment 2007-07-13 14:14:28 +02:00
resource.c
rtmutex_common.h
rtmutex-debug.c
rtmutex-debug.h
rtmutex-tester.c
rtmutex.c
rtmutex.h
rwsem.c
sched_debug.c [PATCH] sched: remove stale version info from kernel/sched_debug.c 2007-07-13 10:10:41 -07:00
sched_fair.c sched: cfs core, kernel/sched_fair.c 2007-07-09 18:51:58 +02:00
sched_idletask.c sched: cfs core, kernel/sched_idletask.c 2007-07-09 18:51:58 +02:00
sched_rt.c sched: cfs core, kernel/sched_rt.c 2007-07-09 18:51:58 +02:00
sched_stats.h sched: update delay-accounting to use CFS's precise stats 2007-07-09 18:52:00 +02:00
sched.c CFS: Fix missing digit off in wmult table 2007-07-13 16:45:43 -07:00
seccomp.c
signal.c
softirq.c sched: do not set softirqs to nice +19 2007-07-09 18:52:00 +02:00
softlockup.c
spinlock.c
srcu.c
stacktrace.c
stop_machine.c
sys_ni.c
sys.c
sysctl.c change zonelist order: zonelist order selection logic 2007-07-16 09:05:35 -07:00
taskstats.c
time.c
timer.c
tsacct.c
uid16.c
user.c
utsname_sysctl.c
utsname.c
wait.c
workqueue.c