linux/kernel
Paul Jackson 4247bdc600 [PATCH] cpuset semaphore depth check deadlock fix
The cpusets-formalize-intermediate-gfp_kernel-containment patch
has a deadlock problem.

This patch was part of a set of four patches to make more
extensive use of the cpuset 'mem_exclusive' attribute to
manage kernel GFP_KERNEL memory allocations and to constrain
the out-of-memory (oom) killer.

A task that is changing cpusets in particular ways on a system
when it is very short of free memory could double trip over
the global cpuset_sem semaphore (get the lock and then deadlock
trying to get it again).

The second attempt to get cpuset_sem would be in the routine
cpuset_zone_allowed().  This was discovered by code inspection.
I can not reproduce the problem except with an artifically
hacked kernel and a specialized stress test.

In real life you cannot hit this unless you are manipulating
cpusets, and are very unlikely to hit it unless you are rapidly
modifying cpusets on a memory tight system.  Even then it would
be a rare occurence.

If you did hit it, the task double tripping over cpuset_sem
would deadlock in the kernel, and any other task also trying
to manipulate cpusets would deadlock there too, on cpuset_sem.
Your batch manager would be wedged solid (if it was cpuset
savvy), but classic Unix shells and utilities would work well
enough to reboot the system.

The unusual condition that led to this bug is that unlike most
semaphores, cpuset_sem _can_ be acquired while in the page
allocation code, when __alloc_pages() calls cpuset_zone_allowed.
So it easy to mistakenly perform the following sequence:
  1) task makes system call to alter a cpuset
  2) take cpuset_sem
  3) try to allocate memory
  4) memory allocator, via cpuset_zone_allowed, trys to take cpuset_sem
  5) deadlock

The reason that this is not a serious bug for most users
is that almost all calls to allocate memory don't require
taking cpuset_sem.  Only some code paths off the beaten
track require taking cpuset_sem -- which is good.  Taking
a global semaphore on the main code path for allocating
memory would not scale well.

This patch fixes this deadlock by wrapping the up() and down()
calls on cpuset_sem in kernel/cpuset.c with code that tracks
the nesting depth of the current task on that semaphore, and
only does the real down() if the task doesn't hold the lock
already, and only does the real up() if the nesting depth
(number of unmatched downs) is exactly one.

The previous required use of refresh_mems(), anytime that
the cpuset_sem semaphore was acquired and the code executed
while holding that semaphore might try to allocate memory, is
no longer required.  Two refresh_mems() calls were removed
thanks to this.  This is a good change, as failing to get
all the necessary refresh_mems() calls placed was a primary
source of bugs in this cpuset code.  The only remaining call
to refresh_mems() is made while doing a memory allocation,
if certain task memory placement data needs to be updated
from its cpuset, due to the cpuset having been changed behind
the tasks back.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:21 -07:00
..
irq [PATCH] CHECK_IRQ_PER_CPU() to avoid dead code in __do_IRQ() 2005-09-07 16:57:29 -07:00
power Merge linux-2.6 with linux-acpi-2.6 2005-09-08 01:45:47 -04:00
Kconfig.hz [PATCH] i386: Selectable Frequency of the Timer Interrupt 2005-06-23 09:45:10 -07:00
Kconfig.preempt [PATCH] sched: voluntary kernel preemption 2005-06-25 16:24:45 -07:00
Makefile [PATCH] spinlock consolidation 2005-09-10 10:06:21 -07:00
acct.c [PATCH] largefile support for accounting 2005-09-07 16:57:31 -07:00
audit.c [NETLINK]: Add "groups" argument to netlink_kernel_create 2005-08-29 16:01:11 -07:00
auditsc.c AUDIT: Record working directory when syscall arguments are pathnames 2005-05-27 12:17:28 +01:00
capability.c [PATCH] kernel/capability.c: add kerneldoc 2005-07-27 16:26:06 -07:00
compat.c [PATCH] Fix get_compat_sigevent() 2005-04-16 15:24:01 -07:00
configs.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
cpu.c [PATCH] i386 CPU hotplug 2005-06-25 16:24:29 -07:00
cpuset.c [PATCH] cpuset semaphore depth check deadlock fix 2005-09-10 10:06:21 -07:00
crash_dump.c [PATCH] kernel/crash_dump.c: add kerneldoc 2005-07-27 16:26:06 -07:00
dma.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
exec_domain.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
exit.c [PATCH] files: files struct with RCU 2005-09-09 13:57:55 -07:00
extable.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
fork.c [PATCH] files: files struct with RCU 2005-09-09 13:57:55 -07:00
futex.c [PATCH] futex: remove duplicate code 2005-09-07 16:57:33 -07:00
intermodule.c [PATCH] introduce and use kzalloc 2005-09-07 16:57:45 -07:00
itimer.c [PATCH] itimer fixes 2005-07-27 16:25:51 -07:00
kallsyms.c [PATCH] ppc32: platform-specific functions missing from kallsyms. 2005-05-05 16:36:31 -07:00
kexec.c [PATCH] kexec: fix sparse warnings 2005-06-28 14:53:40 -07:00
kfifo.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
kmod.c [PATCH] Keys: Pass session keyring to call_usermodehelper() 2005-06-24 00:05:18 -07:00
kprobes.c [PATCH] kprobes: fix bug when probed on task and isr functions 2005-09-07 16:58:01 -07:00
ksysfs.c [PATCH] Kdump: Export crash notes section address through sysfs 2005-06-25 16:24:51 -07:00
kthread.c [PATCH] use smp_mb/wmb/rmb where possible 2005-05-01 08:58:47 -07:00
module.c [PATCH] flush icache early when loading module 2005-09-07 16:57:26 -07:00
panic.c [PATCH] Call emergency_reboot from panic 2005-07-26 14:35:43 -07:00
params.c [PATCH] introduce and use kzalloc 2005-09-07 16:57:45 -07:00
pid.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
posix-cpu-timers.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
posix-timers.c [PATCH] fix send_sigqueue() vs thread exit race 2005-09-07 16:57:33 -07:00
printk.c [PATCH] Provide better printk() support for SMP machines 2005-09-07 16:57:18 -07:00
profile.c [PATCH] mostly_read data section 2005-07-07 18:23:46 -07:00
ptrace.c [PATCH] remove duplicated code from proc and ptrace 2005-09-07 16:57:43 -07:00
rcupdate.c [PATCH] files: rcuref APIs 2005-09-09 13:57:54 -07:00
resource.c [PATCH] introduce and use kzalloc 2005-09-07 16:57:45 -07:00
sched.c [PATCH] spinlock consolidation 2005-09-10 10:06:21 -07:00
seccomp.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
signal.c [PATCH] fix send_sigqueue() vs thread exit race 2005-09-07 16:57:33 -07:00
softirq.c [PATCH] revert bogus softirq changes 2005-07-30 10:49:59 -07:00
softlockup.c [PATCH] detect soft lockups 2005-09-07 16:57:17 -07:00
spinlock.c [PATCH] spinlock consolidation 2005-09-10 10:06:21 -07:00
stop_machine.c [PATCH] smp_processor_id() cleanup 2005-06-21 18:46:13 -07:00
sys.c [PATCH] remove a redundant variable in sys_prctl() 2005-09-07 16:57:32 -07:00
sys_ni.c [PATCH] remove sys_set_zone_reclaim() 2005-08-01 10:03:56 -07:00
sysctl.c [NET]: Fix sparse warnings 2005-08-29 16:01:32 -07:00
time.c [PATCH] clean up inline static vs static inline 2005-07-27 16:26:20 -07:00
timer.c [PATCH] optimize writer path in time_interpolator_get_counter() 2005-09-07 16:57:24 -07:00
uid16.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
user.c [PATCH] inotify 2005-07-12 20:38:38 -07:00
wait.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
workqueue.c [PATCH] introduce and use kzalloc 2005-09-07 16:57:45 -07:00