linux/block
Jeff Moyer c10b61f091 cfq: Don't allow queue merges for queues that have no process references
Hi,

A user reported a kernel bug when running a particular program that did
the following:

created 32 threads
- each thread took a mutex, grabbed a global offset, added a buffer size
  to that offset, released the lock
- read from the given offset in the file
- created a new thread to do the same
- exited

The result is that cfq's close cooperator logic would trigger, as the
threads were issuing I/O within the mean seek distance of one another.
This workload managed to routinely trigger a use after free bug when
walking the list of merge candidates for a particular cfqq
(cfqq->new_cfqq).  The logic used for merging queues looks like this:

static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
{
	int process_refs, new_process_refs;
	struct cfq_queue *__cfqq;

	/* Avoid a circular list and skip interim queue merges */
	while ((__cfqq = new_cfqq->new_cfqq)) {
		if (__cfqq == cfqq)
			return;
		new_cfqq = __cfqq;
	}

	process_refs = cfqq_process_refs(cfqq);
	/*
	 * If the process for the cfqq has gone away, there is no
	 * sense in merging the queues.
	 */
	if (process_refs == 0)
		return;

	/*
	 * Merge in the direction of the lesser amount of work.
	 */
	new_process_refs = cfqq_process_refs(new_cfqq);
	if (new_process_refs >= process_refs) {
		cfqq->new_cfqq = new_cfqq;
		atomic_add(process_refs, &new_cfqq->ref);
	} else {
		new_cfqq->new_cfqq = cfqq;
		atomic_add(new_process_refs, &cfqq->ref);
	}
}

When a merge candidate is found, we add the process references for the
queue with less references to the queue with more.  The actual merging
of queues happens when a new request is issued for a given cfqq.  In the
case of the test program, it only does a single pread call to read in
1MB, so the actual merge never happens.

Normally, this is fine, as when the queue exits, we simply drop the
references we took on the other cfqqs in the merge chain:

	/*
	 * If this queue was scheduled to merge with another queue, be
	 * sure to drop the reference taken on that queue (and others in
	 * the merge chain).  See cfq_setup_merge and cfq_merge_cfqqs.
	 */
	__cfqq = cfqq->new_cfqq;
	while (__cfqq) {
		if (__cfqq == cfqq) {
			WARN(1, "cfqq->new_cfqq loop detected\n");
			break;
		}
		next = __cfqq->new_cfqq;
		cfq_put_queue(__cfqq);
		__cfqq = next;
	}

However, there is a hole in this logic.  Consider the following (and
keep in mind that each I/O keeps a reference to the cfqq):

q1->new_cfqq = q2   // q2 now has 2 process references
q3->new_cfqq = q2   // q2 now has 3 process references

// the process associated with q2 exits
// q2 now has 2 process references

// queue 1 exits, drops its reference on q2
// q2 now has 1 process reference

// q3 exits, so has 0 process references, and hence drops its references
// to q2, which leaves q2 also with 0 process references

q4 comes along and wants to merge with q3

q3->new_cfqq still points at q2!  We follow that link and end up at an
already freed cfqq.

So, the fix is to not follow a merge chain if the top-most queue does
not have a process reference, otherwise any queue in the chain could be
already freed.  I also changed the logic to disallow merging with a
queue that does not have any process references.  Previously, we did
this check for one of the merge candidates, but not the other.  That
doesn't really make sense.

Without the attached patch, my system would BUG within a couple of
seconds of running the reproducer program.  With the patch applied, my
system ran the program for over an hour without issues.

This addresses the following bugzilla:
    https://bugzilla.kernel.org/show_bug.cgi?id=16217

Thanks a ton to Phil Carns for providing the bug report and an excellent
reproducer.

[ Note for stable: this applies to 2.6.32/33/34 ].

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Reported-by: Phil Carns <carns@mcs.anl.gov>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-06-17 20:17:35 +02:00
..
blk-barrier.c blkdev: move blkdev_issue helper functions to separate file 2010-04-28 19:47:36 +02:00
blk-cgroup.c Merge branch 'master' into for-2.6.35 2010-05-21 21:27:26 +02:00
blk-cgroup.h blk-cgroup: config options re-arrangement 2010-04-26 19:27:56 +02:00
blk-core.c block: fix DISCARD_BARRIER requests 2010-06-17 10:10:53 +02:00
blk-exec.c block: don't set REQ_NOMERGE unnecessarily 2009-04-28 07:37:33 +02:00
blk-integrity.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
blk-ioc.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
blk-iopoll.c tree-wide: fix assorted typos all over the place 2009-12-04 15:39:55 +01:00
blk-lib.c block: fix bad use of min() on different types 2010-04-29 09:28:21 +02:00
blk-map.c block: Use accessor functions for queue limits 2009-05-22 23:22:54 +02:00
blk-merge.c block: Consolidate phys_segment and hw_segment limits 2010-02-26 13:58:08 +01:00
blk-settings.c Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block 2010-04-09 11:50:29 -07:00
blk-softirq.c
blk-sysfs.c Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block 2010-04-09 11:50:29 -07:00
blk-tag.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
blk-timeout.c block: ensure jiffies wrap is handled correctly in blk_rq_timed_out_timer 2010-04-21 17:42:08 +02:00
blk.h block: implement mixed merge of different failfast requests 2009-09-11 14:33:30 +02:00
bsg.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
cfq-iosched.c cfq: Don't allow queue merges for queues that have no process references 2010-06-17 20:17:35 +02:00
compat_ioctl.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
deadline-iosched.c block: convert to pos and nr_sectors accessors 2009-05-11 09:50:54 +02:00
elevator.c block: make blk_init_free_list and elevator_init idempotent 2010-06-04 13:47:06 +02:00
genhd.c block: remove all rcu head initializations 2010-05-21 20:01:02 +02:00
ioctl.c blkdev: generalize flags for blkdev_issue_fn functions 2010-04-28 19:47:36 +02:00
Kconfig blk-cgroup: config options re-arrangement 2010-04-26 19:27:56 +02:00
Kconfig.iosched blk-cgroup: config options re-arrangement 2010-04-26 19:27:56 +02:00
Makefile blkdev: move blkdev_issue helper functions to separate file 2010-04-28 19:47:36 +02:00
noop-iosched.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
scsi_ioctl.c block/scsi_ioctl.c: quiet sparse noise 2009-11-04 09:10:33 +01:00