linux/kernel
Paul Jackson 68860ec10b [PATCH] cpusets: automatic numa mempolicy rebinding
This patch automatically updates a tasks NUMA mempolicy when its cpuset
memory placement changes.  It does so within the context of the task,
without any need to support low level external mempolicy manipulation.

If a system is not using cpusets, or if running on a system with just the
root (all-encompassing) cpuset, then this remap is a no-op.  Only when a
task is moved between cpusets, or a cpusets memory placement is changed
does the following apply.  Otherwise, the main routine below,
rebind_policy() is not even called.

When mixing cpusets, scheduler affinity, and NUMA mempolicies, the
essential role of cpusets is to place jobs (several related tasks) on a set
of CPUs and Memory Nodes, the essential role of sched_setaffinity is to
manage a jobs processor placement within its allowed cpuset, and the
essential role of NUMA mempolicy (mbind, set_mempolicy) is to manage a jobs
memory placement within its allowed cpuset.

However, CPU affinity and NUMA memory placement are managed within the
kernel using absolute system wide numbering, not cpuset relative numbering.

This is ok until a job is migrated to a different cpuset, or what's the
same, a jobs cpuset is moved to different CPUs and Memory Nodes.

Then the CPU affinity and NUMA memory placement of the tasks in the job
need to be updated, to preserve their cpuset-relative position.  This can
be done for CPU affinity using sched_setaffinity() from user code, as one
task can modify anothers CPU affinity.  This cannot be done from an
external task for NUMA memory placement, as that can only be modified in
the context of the task using it.

However, it easy enough to remap a tasks NUMA mempolicy automatically when
a task is migrated, using the existing cpuset mechanism to trigger a
refresh of a tasks memory placement after its cpuset has changed.  All that
is needed is the old and new nodemask, and notice to the task that it needs
to rebind its mempolicy.  The tasks mems_allowed has the old mask, the
tasks cpuset has the new mask, and the existing
cpuset_update_current_mems_allowed() mechanism provides the notice.  The
bitmap/cpumask/nodemask remap operators provide the cpuset relative
calculations.

This patch leaves open a couple of issues:

 1) Updating vma and shmfs/tmpfs/hugetlbfs memory policies:

    These mempolicies may reference nodes outside of those allowed to
    the current task by its cpuset.  Tasks are migrated as part of jobs,
    which reside on what might be several cpusets in a subtree.  When such
    a job is migrated, all NUMA memory policy references to nodes within
    that cpuset subtree should be translated, and references to any nodes
    outside that subtree should be left untouched.  A future patch will
    provide the cpuset mechanism needed to mark such subtrees.  With that
    patch, we will be able to correctly migrate these other memory policies
    across a job migration.

 2) Updating cpuset, affinity and memory policies in user space:

    This is harder.  Any placement state stored in user space using
    system-wide numbering will be invalidated across a migration.  More
    work will be required to provide user code with a migration-safe means
    to manage its cpuset relative placement, while preserving the current
    API's that pass system wide numbers, not cpuset relative numbers across
    the kernel-user boundary.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:22 -08:00
..
irq [PATCH] CHECK_IRQ_PER_CPU() to avoid dead code in __do_IRQ() 2005-09-07 16:57:29 -07:00
power [PATCH] introduce .valid callback for pm_ops 2005-10-30 17:37:15 -08:00
Kconfig.hz [PATCH] i386: Selectable Frequency of the Timer Interrupt 2005-06-23 09:45:10 -07:00
Kconfig.preempt [PATCH] sched: voluntary kernel preemption 2005-06-25 16:24:45 -07:00
Makefile [PATCH] Remove redundant configs.o 2005-10-30 17:37:10 -08:00
acct.c [PATCH] mm: rss = file_rss + anon_rss 2005-10-29 21:40:38 -07:00
audit.c [PATCH] gfp_t: kernel/* 2005-10-28 08:16:49 -07:00
auditsc.c [PATCH] gfp_t: kernel/* 2005-10-28 08:16:49 -07:00
capability.c [PATCH] kernel/capability.c: add kerneldoc 2005-07-27 16:26:06 -07:00
compat.c [PATCH] kernel: fix-up schedule_timeout() usage 2005-09-10 10:06:37 -07:00
configs.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
cpu.c [PATCH] create and destroy cpufreq sysfs entries based on cpu notifiers 2005-10-30 17:37:14 -08:00
cpuset.c [PATCH] cpusets: automatic numa mempolicy rebinding 2005-10-30 17:37:22 -08:00
crash_dump.c [PATCH] kernel/crash_dump.c: add kerneldoc 2005-07-27 16:26:06 -07:00
dma.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
exec_domain.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
exit.c [PATCH] mm: update_hiwaters just in time 2005-10-29 21:40:39 -07:00
extable.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
fork.c [PATCH] mm: ptd_alloc take ptlock 2005-10-29 21:40:40 -07:00
futex.c [PATCH] mm: follow_page with inner ptlock 2005-10-29 21:40:41 -07:00
intermodule.c [PATCH] introduce and use kzalloc 2005-09-07 16:57:45 -07:00
itimer.c [PATCH] itimer fixes 2005-07-27 16:25:51 -07:00
kallsyms.c [PATCH] ppc32: platform-specific functions missing from kallsyms. 2005-05-05 16:36:31 -07:00
kexec.c [PATCH] mm: split page table lock 2005-10-29 21:40:42 -07:00
kfifo.c [PATCH] gfp flags annotations - part 1 2005-10-08 15:00:57 -07:00
kmod.c [PATCH] Keys: Pass session keyring to call_usermodehelper() 2005-06-24 00:05:18 -07:00
kprobes.c [PATCH] kprobes: fix bug when probed on task and isr functions 2005-09-07 16:58:01 -07:00
ksysfs.c [PATCH] Kdump: Export crash notes section address through sysfs 2005-06-25 16:24:51 -07:00
kthread.c [PATCH] Add kthread_stop_sem() 2005-10-30 17:37:17 -08:00
module.c [PATCH] use add_taint() for setting tainted bit flags 2005-09-13 08:22:29 -07:00
panic.c [PATCH] Call emergency_reboot from panic 2005-07-26 14:35:43 -07:00
params.c [PATCH] Ignore trailing whitespace on kernel parameters correctly 2005-09-28 07:46:41 -07:00
pid.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
posix-cpu-timers.c [PATCH] Yet more posix-cpu-timer fixes 2005-10-27 09:08:43 -07:00
posix-timers.c [PATCH] posix-timers: use schedule_timeout() in common_nsleep() 2005-10-30 17:37:20 -08:00
printk.c [PATCH] Add printk_clock() 2005-09-21 10:11:54 -07:00
profile.c [PATCH] mostly_read data section 2005-07-07 18:23:46 -07:00
ptrace.c [PATCH] remove duplicated code from proc and ptrace 2005-09-07 16:57:43 -07:00
rcupdate.c [PATCH] rcu: keep rcu callback event counter 2005-10-17 15:27:58 -07:00
resource.c [PATCH] introduce and use kzalloc 2005-09-07 16:57:45 -07:00
sched.c [PATCH] mm: update_hiwaters just in time 2005-10-29 21:40:39 -07:00
seccomp.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
signal.c [PATCH] Unify sys_tkill() and sys_tgkill() 2005-10-30 17:37:19 -08:00
softirq.c [PATCH] x86-64: Some cleanup and optimization to the processor data area. 2005-09-12 10:49:58 -07:00
softlockup.c [PATCH] detect soft lockups 2005-09-07 16:57:17 -07:00
spinlock.c [PATCH] spinlock consolidation 2005-09-10 10:06:21 -07:00
stop_machine.c [PATCH] smp_processor_id() cleanup 2005-06-21 18:46:13 -07:00
sys.c [PATCH] reboot: comment and factor the main reboot functions 2005-09-22 22:17:33 -07:00
sys_ni.c [PATCH] remove sys_set_zone_reclaim() 2005-08-01 10:03:56 -07:00
sysctl.c [NET]: Fix sparse warnings 2005-08-29 16:01:32 -07:00
time.c [PATCH] NTP shift_right cleanup 2005-10-30 17:37:18 -08:00
timer.c [PATCH] remove timer debug field 2005-10-30 17:37:18 -08:00
uid16.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
user.c [PATCH] inotify 2005-07-12 20:38:38 -07:00
wait.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
workqueue.c [PATCH] Use alloc_percpu to allocate workqueues locally 2005-10-30 17:37:18 -08:00