linux/block
Thadeu Lima de Souza Cascardo 032651863c blk-throttle: check stats_cpu before reading it from sysfs
commit 045c47ca30 upstream.

When reading blkio.throttle.io_serviced in a recently created blkio
cgroup, it's possible to race against the creation of a throttle policy,
which delays the allocation of stats_cpu.

Like other functions in the throttle code, just checking for a NULL
stats_cpu prevents the following oops caused by that race.

[ 1117.285199] Unable to handle kernel paging request for data at address 0x7fb4d0020
[ 1117.285252] Faulting instruction address: 0xc0000000003efa2c
[ 1137.733921] Oops: Kernel access of bad area, sig: 11 [#1]
[ 1137.733945] SMP NR_CPUS=2048 NUMA PowerNV
[ 1137.734025] Modules linked in: bridge stp llc kvm_hv kvm binfmt_misc autofs4
[ 1137.734102] CPU: 3 PID: 5302 Comm: blkcgroup Not tainted 3.19.0 #5
[ 1137.734132] task: c000000f1d188b00 ti: c000000f1d210000 task.ti: c000000f1d210000
[ 1137.734167] NIP: c0000000003efa2c LR: c0000000003ef9f0 CTR: c0000000003ef980
[ 1137.734202] REGS: c000000f1d213500 TRAP: 0300   Not tainted  (3.19.0)
[ 1137.734230] MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI>  CR: 42008884  XER: 20000000
[ 1137.734325] CFAR: 0000000000008458 DAR: 00000007fb4d0020 DSISR: 40000000 SOFTE: 0
GPR00: c0000000003ed3a0 c000000f1d213780 c000000000c59538 0000000000000000
GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000
GPR08: ffffffffffffffff 00000007fb4d0020 00000007fb4d0000 c000000000780808
GPR12: 0000000022000888 c00000000fdc0d80 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 000001003e120200 c000000f1d5b0cc0 0000000000000200 0000000000000000
GPR24: 0000000000000001 c000000000c269e0 0000000000000020 c000000f1d5b0c80
GPR28: c000000000ca3a08 c000000000ca3dec c000000f1c667e00 c000000f1d213850
[ 1137.734886] NIP [c0000000003efa2c] .tg_prfill_cpu_rwstat+0xac/0x180
[ 1137.734915] LR [c0000000003ef9f0] .tg_prfill_cpu_rwstat+0x70/0x180
[ 1137.734943] Call Trace:
[ 1137.734952] [c000000f1d213780] [d000000005560520] 0xd000000005560520 (unreliable)
[ 1137.734996] [c000000f1d2138a0] [c0000000003ed3a0] .blkcg_print_blkgs+0xe0/0x1a0
[ 1137.735039] [c000000f1d213960] [c0000000003efb50] .tg_print_cpu_rwstat+0x50/0x70
[ 1137.735082] [c000000f1d2139e0] [c000000000104b48] .cgroup_seqfile_show+0x58/0x150
[ 1137.735125] [c000000f1d213a70] [c0000000002749dc] .kernfs_seq_show+0x3c/0x50
[ 1137.735161] [c000000f1d213ae0] [c000000000218630] .seq_read+0xe0/0x510
[ 1137.735197] [c000000f1d213bd0] [c000000000275b04] .kernfs_fop_read+0x164/0x200
[ 1137.735240] [c000000f1d213c80] [c0000000001eb8e0] .__vfs_read+0x30/0x80
[ 1137.735276] [c000000f1d213cf0] [c0000000001eb9c4] .vfs_read+0x94/0x1b0
[ 1137.735312] [c000000f1d213d90] [c0000000001ebb38] .SyS_read+0x58/0x100
[ 1137.735349] [c000000f1d213e30] [c000000000009218] syscall_exit+0x0/0x98
[ 1137.735383] Instruction dump:
[ 1137.735405] 7c6307b4 7f891800 409d00b8 60000000 60420000 3d420004 392a63b0 786a1f24
[ 1137.735471] 7d49502a e93e01c8 7d495214 7d2ad214 <7cead02a> e9090008 e9490010 e9290018

And here is one code that allows to easily reproduce this, although this
has first been found by running docker.

void run(pid_t pid)
{
	int n;
	int status;
	int fd;
	char *buffer;
	buffer = memalign(BUFFER_ALIGN, BUFFER_SIZE);
	n = snprintf(buffer, BUFFER_SIZE, "%d\n", pid);
	fd = open(CGPATH "/test/tasks", O_WRONLY);
	write(fd, buffer, n);
	close(fd);
	if (fork() > 0) {
		fd = open("/dev/sda", O_RDONLY | O_DIRECT);
		read(fd, buffer, 512);
		close(fd);
		wait(&status);
	} else {
		fd = open(CGPATH "/test/blkio.throttle.io_serviced", O_RDONLY);
		n = read(fd, buffer, BUFFER_SIZE);
		close(fd);
	}
	free(buffer);
	exit(0);
}

void test(void)
{
	int status;
	mkdir(CGPATH "/test", 0666);
	if (fork() > 0)
		wait(&status);
	else
		run(getpid());
	rmdir(CGPATH "/test");
}

int main(int argc, char **argv)
{
	int i;
	for (i = 0; i < NR_TESTS; i++)
		test();
	return 0;
}

Reported-by: Ricardo Marin Matinata <rmm@br.ibm.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06 14:43:32 -08:00
..
partitions partitions: aix.c: off by one bug 2014-10-05 14:52:24 -07:00
Kconfig block: change config option name for cmdline partition parsing 2013-09-30 14:31:02 -07:00
Kconfig.iosched blkcg: make CONFIG_BLK_CGROUP bool 2012-03-06 21:27:21 +01:00
Makefile blk-mq: new multi-queue block IO queueing mechanism 2013-10-25 11:56:00 +01:00
blk-cgroup.c blkcg: don't call into policy draining if root_blkg is already gone 2014-07-31 12:52:55 -07:00
blk-cgroup.h blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t 2014-07-09 11:18:27 -07:00
blk-core.c blktrace: fix accounting of partially completed requests 2014-05-31 13:20:28 -07:00
blk-exec.c blk-mq: merge blk_mq_insert_request and blk_mq_run_request 2014-02-21 08:58:48 -08:00
blk-flush.c block: change flush sequence list addition back to front add 2014-03-08 20:31:31 -07:00
blk-integrity.c bio-integrity: Convert to bvec_iter 2013-11-23 22:33:50 -08:00
blk-ioc.c block: cleanup removing dependency on bootmem headers 2013-11-08 19:43:48 -07:00
blk-iopoll.c block: Replace __get_cpu_var uses 2013-11-08 08:59:58 -07:00
blk-lib.c block: add cond_resched() to potentially long running ioctl discard loop 2014-02-12 09:36:37 -07:00
blk-map.c block: Abstract out bvec iterator 2013-11-23 22:33:47 -08:00
blk-merge.c block: Explicitly handle discard/write same segments 2014-02-07 13:54:08 -07:00
blk-mq-cpu.c rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock 2014-03-03 09:34:10 -07:00
blk-mq-cpumap.c blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map 2015-01-16 06:59:31 -08:00
blk-mq-sysfs.c block: fix memory leaks on unplugging block device 2013-12-06 09:18:02 -07:00
blk-mq-tag.c Merge branch 'for-linus' of git://git.kernel.dk/linux-block 2014-02-14 10:45:18 -08:00
blk-mq-tag.h blk-mq: new multi-queue block IO queueing mechanism 2013-10-25 11:56:00 +01:00
blk-mq.c blk-mq: add REQ_SYNC early 2014-03-07 08:15:28 -07:00
blk-mq.h blk-mq: merge blk_mq_insert_request and blk_mq_run_request 2014-02-21 08:58:48 -08:00
blk-settings.c block: fix alignment_offset math that assumes io_min is a power-of-2 2014-11-14 08:59:51 -08:00
blk-softirq.c kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS 2013-11-15 09:32:22 +09:00
blk-sysfs.c blk-mq: rework flush sequencing logic 2014-02-10 09:29:00 -07:00
blk-tag.c block: don't assume last put of shared tags is for the host 2014-07-31 12:52:54 -07:00
blk-throttle.c blk-throttle: check stats_cpu before reading it from sysfs 2015-03-06 14:43:32 -08:00
blk-timeout.c blk-mq: rework I/O completions 2014-02-10 09:27:31 -07:00
blk.h block: __elv_next_request() shouldn't call into the elevator if bypassing 2014-01-30 12:57:25 -07:00
bsg-lib.c bsg: Remove unused function bsg_goose_queue() 2012-12-06 14:33:02 +01:00
bsg.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
cfq-iosched.c cfq-iosched: fix incorrect filing of rt async cfqq 2015-03-06 14:43:29 -08:00
cmdline-parser.c block: remove unrelated header files and export symbol 2014-01-21 20:18:26 -08:00
compat_ioctl.c block: provide compat ioctl for BLKZEROOUT 2014-07-31 12:52:54 -07:00
deadline-iosched.c block: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...) 2013-09-11 13:22:03 -06:00
elevator.c block: Abstract out bvec iterator 2013-11-23 22:33:47 -08:00
genhd.c genhd: check for int overflow in disk_expand_part_tbl() 2015-01-16 06:59:32 -08:00
ioctl.c block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO 2013-11-08 09:05:31 -07:00
noop-iosched.c elevator: Fix a race in elevator switching 2013-07-03 13:25:24 +02:00
partition-generic.c block: Fix dev_t minor allocation lifetime 2014-10-05 14:52:19 -07:00
scsi_ioctl.c scsi: Fix error handling in SCSI_IOCTL_SEND_COMMAND 2014-11-14 09:00:08 -08:00