Commit Graph

2087 Commits

Author SHA1 Message Date
Thomas Gleixner 6e3d617ff1 block: Shorten interrupt disabled regions
Moving the blk_sched_flush_plug() call out of the interrupt/preempt
disabled region in the scheduler allows us to replace
local_irq_save/restore(flags) by local_irq_disable/enable() in
blk_flush_plug().

Now instead of doing this we disable interrupts explicitely when we
lock the request_queue and reenable them when we drop the lock. That
allows interrupts to be handled when the plug list contains requests
for more than one queue.

Aside of that this change makes the scope of the irq disabled region
more obvious. The current code confused the hell out of me when
looking at:

 local_irq_save(flags);
   spin_lock(q->queue_lock);
   ...
   queue_unplugged(q...);
     scsi_request_fn();
       spin_unlock(q->queue_lock);
       spin_lock(shost->host_lock);
       spin_unlock_irq(shost->host_lock);

-------------------^^^ ????

       spin_lock_irq(q->queue_lock);
       spin_unlock(q->lock);
 local_irq_restore(flags);

Also add a comment to __blk_run_queue() documenting that
q->request_fn() can drop q->queue_lock and reenable interrupts, but
must return with q->queue_lock held and interrupts disabled.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110622174919.025446432@linutronix.de
2020-10-14 00:59:08 +03:00
Dan Williams 42928ef4b6 block: fix ext_dev_lock lockdep report
commit 4d66e5e9b6 upstream.

 =================================
 [ INFO: inconsistent lock state ]
 4.1.0-rc7+ #217 Tainted: G           O
 ---------------------------------
 inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
 swapper/6/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
  (ext_devt_lock){+.?...}, at: [<ffffffff8143a60c>] blk_free_devt+0x3c/0x70
 {SOFTIRQ-ON-W} state was registered at:
   [<ffffffff810bf6b1>] __lock_acquire+0x461/0x1e70
   [<ffffffff810c1947>] lock_acquire+0xb7/0x290
   [<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50
   [<ffffffff8143a07d>] blk_alloc_devt+0x6d/0xd0  <-- take the lock in process context
[..]
  [<ffffffff810bf64e>] __lock_acquire+0x3fe/0x1e70
  [<ffffffff810c00ad>] ? __lock_acquire+0xe5d/0x1e70
  [<ffffffff810c1947>] lock_acquire+0xb7/0x290
  [<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70
  [<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50
  [<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70
  [<ffffffff8143a60c>] blk_free_devt+0x3c/0x70    <-- take the lock in softirq
  [<ffffffff8143bfec>] part_release+0x1c/0x50
  [<ffffffff8158edf6>] device_release+0x36/0xb0
  [<ffffffff8145ac2b>] kobject_cleanup+0x7b/0x1a0
  [<ffffffff8145aad0>] kobject_put+0x30/0x70
  [<ffffffff8158f147>] put_device+0x17/0x20
  [<ffffffff8143c29c>] delete_partition_rcu_cb+0x16c/0x180
  [<ffffffff8143c130>] ? read_dev_sector+0xa0/0xa0
  [<ffffffff810e0e0f>] rcu_process_callbacks+0x2ff/0xa90
  [<ffffffff810e0dcf>] ? rcu_process_callbacks+0x2bf/0xa90
  [<ffffffff81067e2e>] __do_softirq+0xde/0x600

Neil sees this in his tests and it also triggers on pmem driver unbind
for the libnvdimm tests.  This fix is on top of an initial fix by Keith
for incorrect usage of mutex_lock() in this path: 2da78092dd "block:
Fix dev_t minor allocation lifetime".  Both this and 2da78092dd are
candidates for -stable.

Fixes: 2da78092dd ("block: Fix dev_t minor allocation lifetime")
Cc: Keith Busch <keith.busch@intel.com>
Reported-by: NeilBrown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-22 17:01:20 -07:00
Thadeu Lima de Souza Cascardo 032651863c blk-throttle: check stats_cpu before reading it from sysfs
commit 045c47ca30 upstream.

When reading blkio.throttle.io_serviced in a recently created blkio
cgroup, it's possible to race against the creation of a throttle policy,
which delays the allocation of stats_cpu.

Like other functions in the throttle code, just checking for a NULL
stats_cpu prevents the following oops caused by that race.

[ 1117.285199] Unable to handle kernel paging request for data at address 0x7fb4d0020
[ 1117.285252] Faulting instruction address: 0xc0000000003efa2c
[ 1137.733921] Oops: Kernel access of bad area, sig: 11 [#1]
[ 1137.733945] SMP NR_CPUS=2048 NUMA PowerNV
[ 1137.734025] Modules linked in: bridge stp llc kvm_hv kvm binfmt_misc autofs4
[ 1137.734102] CPU: 3 PID: 5302 Comm: blkcgroup Not tainted 3.19.0 #5
[ 1137.734132] task: c000000f1d188b00 ti: c000000f1d210000 task.ti: c000000f1d210000
[ 1137.734167] NIP: c0000000003efa2c LR: c0000000003ef9f0 CTR: c0000000003ef980
[ 1137.734202] REGS: c000000f1d213500 TRAP: 0300   Not tainted  (3.19.0)
[ 1137.734230] MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI>  CR: 42008884  XER: 20000000
[ 1137.734325] CFAR: 0000000000008458 DAR: 00000007fb4d0020 DSISR: 40000000 SOFTE: 0
GPR00: c0000000003ed3a0 c000000f1d213780 c000000000c59538 0000000000000000
GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000
GPR08: ffffffffffffffff 00000007fb4d0020 00000007fb4d0000 c000000000780808
GPR12: 0000000022000888 c00000000fdc0d80 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 000001003e120200 c000000f1d5b0cc0 0000000000000200 0000000000000000
GPR24: 0000000000000001 c000000000c269e0 0000000000000020 c000000f1d5b0c80
GPR28: c000000000ca3a08 c000000000ca3dec c000000f1c667e00 c000000f1d213850
[ 1137.734886] NIP [c0000000003efa2c] .tg_prfill_cpu_rwstat+0xac/0x180
[ 1137.734915] LR [c0000000003ef9f0] .tg_prfill_cpu_rwstat+0x70/0x180
[ 1137.734943] Call Trace:
[ 1137.734952] [c000000f1d213780] [d000000005560520] 0xd000000005560520 (unreliable)
[ 1137.734996] [c000000f1d2138a0] [c0000000003ed3a0] .blkcg_print_blkgs+0xe0/0x1a0
[ 1137.735039] [c000000f1d213960] [c0000000003efb50] .tg_print_cpu_rwstat+0x50/0x70
[ 1137.735082] [c000000f1d2139e0] [c000000000104b48] .cgroup_seqfile_show+0x58/0x150
[ 1137.735125] [c000000f1d213a70] [c0000000002749dc] .kernfs_seq_show+0x3c/0x50
[ 1137.735161] [c000000f1d213ae0] [c000000000218630] .seq_read+0xe0/0x510
[ 1137.735197] [c000000f1d213bd0] [c000000000275b04] .kernfs_fop_read+0x164/0x200
[ 1137.735240] [c000000f1d213c80] [c0000000001eb8e0] .__vfs_read+0x30/0x80
[ 1137.735276] [c000000f1d213cf0] [c0000000001eb9c4] .vfs_read+0x94/0x1b0
[ 1137.735312] [c000000f1d213d90] [c0000000001ebb38] .SyS_read+0x58/0x100
[ 1137.735349] [c000000f1d213e30] [c000000000009218] syscall_exit+0x0/0x98
[ 1137.735383] Instruction dump:
[ 1137.735405] 7c6307b4 7f891800 409d00b8 60000000 60420000 3d420004 392a63b0 786a1f24
[ 1137.735471] 7d49502a e93e01c8 7d495214 7d2ad214 <7cead02a> e9090008 e9490010 e9290018

And here is one code that allows to easily reproduce this, although this
has first been found by running docker.

void run(pid_t pid)
{
	int n;
	int status;
	int fd;
	char *buffer;
	buffer = memalign(BUFFER_ALIGN, BUFFER_SIZE);
	n = snprintf(buffer, BUFFER_SIZE, "%d\n", pid);
	fd = open(CGPATH "/test/tasks", O_WRONLY);
	write(fd, buffer, n);
	close(fd);
	if (fork() > 0) {
		fd = open("/dev/sda", O_RDONLY | O_DIRECT);
		read(fd, buffer, 512);
		close(fd);
		wait(&status);
	} else {
		fd = open(CGPATH "/test/blkio.throttle.io_serviced", O_RDONLY);
		n = read(fd, buffer, BUFFER_SIZE);
		close(fd);
	}
	free(buffer);
	exit(0);
}

void test(void)
{
	int status;
	mkdir(CGPATH "/test", 0666);
	if (fork() > 0)
		wait(&status);
	else
		run(getpid());
	rmdir(CGPATH "/test");
}

int main(int argc, char **argv)
{
	int i;
	for (i = 0; i < NR_TESTS; i++)
		test();
	return 0;
}

Reported-by: Ricardo Marin Matinata <rmm@br.ibm.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06 14:43:32 -08:00
Jeff Moyer 68ba0558dc cfq-iosched: fix incorrect filing of rt async cfqq
commit c6ce194325 upstream.

Hi,

If you can manage to submit an async write as the first async I/O from
the context of a process with realtime scheduling priority, then a
cfq_queue is allocated, but filed into the wrong async_cfqq bucket.  It
ends up in the best effort array, but actually has realtime I/O
scheduling priority set in cfqq->ioprio.

The reason is that cfq_get_queue assumes the default scheduling class and
priority when there is no information present (i.e. when the async cfqq
is created):

static struct cfq_queue *
cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic,
	      struct bio *bio, gfp_t gfp_mask)
{
	const int ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio);
	const int ioprio = IOPRIO_PRIO_DATA(cic->ioprio);

cic->ioprio starts out as 0, which is "invalid".  So, class of 0
(IOPRIO_CLASS_NONE) is passed to cfq_async_queue_prio like so:

		async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);

static struct cfq_queue **
cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
{
        switch (ioprio_class) {
        case IOPRIO_CLASS_RT:
                return &cfqd->async_cfqq[0][ioprio];
        case IOPRIO_CLASS_NONE:
                ioprio = IOPRIO_NORM;
                /* fall through */
        case IOPRIO_CLASS_BE:
                return &cfqd->async_cfqq[1][ioprio];
        case IOPRIO_CLASS_IDLE:
                return &cfqd->async_idle_cfqq;
        default:
                BUG();
        }
}

Here, instead of returning a class mapped from the process' scheduling
priority, we get back the bucket associated with IOPRIO_CLASS_BE.

Now, there is no queue allocated there yet, so we create it:

		cfqq = cfq_find_alloc_queue(cfqd, is_sync, cic, bio, gfp_mask);

That function ends up doing this:

			cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync);
			cfq_init_prio_data(cfqq, cic);

cfq_init_cfqq marks the priority as having changed.  Then, cfq_init_prio
data does this:

	ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio);
	switch (ioprio_class) {
	default:
		printk(KERN_ERR "cfq: bad prio %x\n", ioprio_class);
	case IOPRIO_CLASS_NONE:
		/*
		 * no prio set, inherit CPU scheduling settings
		 */
		cfqq->ioprio = task_nice_ioprio(tsk);
		cfqq->ioprio_class = task_nice_ioclass(tsk);
		break;

So we basically have two code paths that treat IOPRIO_CLASS_NONE
differently, which results in an RT async cfqq filed into a best effort
bucket.

Attached is a patch which fixes the problem.  I'm not sure how to make
it cleaner.  Suggestions would be welcome.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06 14:43:29 -08:00
Konstantin Khlebnikov 16f3a871c4 cfq-iosched: handle failure of cfq group allocation
commit 69abaffec7 upstream.

Cfq_lookup_create_cfqg() allocates struct blkcg_gq using GFP_ATOMIC.
In cfq_find_alloc_queue() possible allocation failure is not handled.
As a result kernel oopses on NULL pointer dereference when
cfq_link_cfqq_cfqg() calls cfqg_get() for NULL pointer.

Bug was introduced in v3.5 in commit cd1604fab4 ("blkcg: factor
out blkio_group creation"). Prior to that commit cfq group lookup
had returned pointer to root group as fallback.

This patch handles this error using existing fallback oom_cfqq.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Fixes: cd1604fab4 ("blkcg: factor out blkio_group creation")
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06 14:43:29 -08:00
Jens Axboe 0fa799921d genhd: check for int overflow in disk_expand_part_tbl()
commit 5fabcb4c33 upstream.

We can get here from blkdev_ioctl() -> blkpg_ioctl() -> add_partition()
with a user passed in partno value. If we pass in 0x7fffffff, the
new target in disk_expand_part_tbl() overflows the 'int' and we
access beyond the end of ptbl->part[] and even write to it when we
do the rcu_assign_pointer() to assign the new partition.

Reported-by: David Ramos <daramos@stanford.edu>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-16 06:59:32 -08:00
Jens Axboe bdc69c2acc blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map
commit a33c1ba291 upstream.

We currently use num_possible_cpus(), but that breaks on sparc64 where
the CPU ID space is discontig. Use nr_cpu_ids as the highest CPU ID
instead, so we don't end up reading from invalid memory.

Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-16 06:59:31 -08:00
Jan Kara 42c141b6ee scsi: Fix error handling in SCSI_IOCTL_SEND_COMMAND
commit 84ce0f0e94 upstream.

When sg_scsi_ioctl() fails to prepare request to submit in
blk_rq_map_kern() we jump to a label where we just end up copying
(luckily zeroed-out) kernel buffer to userspace instead of reporting
error. Fix the problem by jumping to the right label.

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-scsi@vger.kernel.org
Coverity-id: 1226871
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Fixed up the, now unused, out label.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-14 09:00:08 -08:00
Mike Snitzer e851024dbf block: fix alignment_offset math that assumes io_min is a power-of-2
commit b8839b8c55 upstream.

The math in both blk_stack_limits() and queue_limit_alignment_offset()
assume that a block device's io_min (aka minimum_io_size) is always a
power-of-2.  Fix the math such that it works for non-power-of-2 io_min.

This issue (of alignment_offset != 0) became apparent when testing
dm-thinp with a thinp blocksize that matches a RAID6 stripesize of
1280K.  Commit fdfb4c8c1 ("dm thin: set minimum_io_size to pool's data
block size") unlocked the potential for alignment_offset != 0 due to
the dm-thin-pool's io_min possibly being a non-power-of-2.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14 08:59:51 -08:00
Dan Carpenter 763e95c4ac partitions: aix.c: off by one bug
commit d97a86c170 upstream.

The lvip[] array has "state->limit" elements so the condition here
should be >= instead of >.

Fixes: 6ceea22bbb ('partitions: add aix lvm partition support files')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Philippe De Muyter <phdm@macqel.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05 14:52:24 -07:00
Jens Axboe 9783e9905e genhd: fix leftover might_sleep() in blk_free_devt()
commit 46f341ffcf upstream.

Commit 2da78092 changed the locking from a mutex to a spinlock,
so we now longer sleep in this context. But there was a leftover
might_sleep() in there, which now triggers since we do the final
free from an RCU callback. Get rid of it.

Reported-by: Pontus Fuchs <pontus.fuchs@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05 14:52:20 -07:00
Keith Busch 5f9b9210b5 block: Fix dev_t minor allocation lifetime
commit 2da78092dd upstream.

Releases the dev_t minor when all references are closed to prevent
another device from acquiring the same major/minor.

Since the partition's release may be invoked from call_rcu's soft-irq
context, the ext_dev_idr's mutex had to be replaced with a spinlock so
as not so sleep.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05 14:52:19 -07:00
Toshiaki Makita 63ea55d4d1 cfq-iosched: Fix wrong children_weight calculation
commit e15693ef18 upstream.

cfq_group_service_tree_add() is applying new_weight at the beginning of
the function via cfq_update_group_weight().
This actually allows weight to change between adding it to and subtracting
it from children_weight, and triggers WARN_ON_ONCE() in
cfq_group_service_tree_del(), or even causes oops by divide error during
vfr calculation in cfq_group_service_tree_add().

The detailed scenario is as follows:
1. Create blkio cgroups X and Y as a child of X.
   Set X's weight to 500 and perform some I/O to apply new_weight.
   This X's I/O completes before starting Y's I/O.
2. Y starts I/O and cfq_group_service_tree_add() is called with Y.
3. cfq_group_service_tree_add() walks up the tree during children_weight
   calculation and adds parent X's weight (500) to children_weight of root.
   children_weight becomes 500.
4. Set X's weight to 1000.
5. X starts I/O and cfq_group_service_tree_add() is called with X.
6. cfq_group_service_tree_add() applies its new_weight (1000).
7. I/O of Y completes and cfq_group_service_tree_del() is called with Y.
8. I/O of X completes and cfq_group_service_tree_del() is called with X.
9. cfq_group_service_tree_del() subtracts X's weight (1000) from
   children_weight of root. children_weight becomes -500.
   This triggers WARN_ON_ONCE().
10. Set X's weight to 500.
11. X starts I/O and cfq_group_service_tree_add() is called with X.
12. cfq_group_service_tree_add() applies its new_weight (500) and adds it
    to children_weight of root. children_weight becomes 0. Calcularion of
    vfr triggers oops by divide error.

weight should be updated right before adding it to children_weight.

Reported-by: Ruki Sekiya <sekiya.ruki@lab.ntt.co.jp>
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05 14:52:11 -07:00
Tejun Heo d98c246a64 blkcg: don't call into policy draining if root_blkg is already gone
commit 0b462c89e3 upstream.

While a queue is being destroyed, all the blkgs are destroyed and its
->root_blkg pointer is set to NULL.  If someone else starts to drain
while the queue is in this state, the following oops happens.

  NULL pointer dereference at 0000000000000028
  IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
  PGD e4a1067 PUD b773067 PMD 0
  Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
  CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
  RIP: 0010:[<ffffffff8144e944>]  [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
  RSP: 0018:ffff88000efd7bf0  EFLAGS: 00010046
  RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
  RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
  R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
  R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
  FS:  00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
  Stack:
   ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
   ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
   ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
  Call Trace:
   [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60
   [<ffffffff81427641>] __blk_drain_queue+0x71/0x180
   [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0
   [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120
   [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50
   [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40
   [<ffffffff8142d476>] blk_release_queue+0x26/0xd0
   [<ffffffff81454968>] kobject_cleanup+0x38/0x70
   [<ffffffff81454848>] kobject_put+0x28/0x60
   [<ffffffff81427505>] blk_put_queue+0x15/0x20
   [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0
   [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0
   [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20
   [<ffffffff817930e2>] device_release+0x32/0xa0
   [<ffffffff81454968>] kobject_cleanup+0x38/0x70
   [<ffffffff81454848>] kobject_put+0x28/0x60
   [<ffffffff817934d7>] put_device+0x17/0x20
   [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0
   [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40
   [<ffffffff817d1257>] sdev_store_delete+0x27/0x30
   [<ffffffff81792ca8>] dev_attr_store+0x18/0x30
   [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50
   [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170
   [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0
   [<ffffffff811f69bd>] SyS_write+0x4d/0xc0
   [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b

776687bce4 ("block, blk-mq: draining can't be skipped even if
bypass_depth was non-zero") made it easier to trigger this bug by
making blk_queue_bypass_start() drain even when it loses the first
bypass test to blk_cleanup_queue(); however, the bug has always been
there even before the commit as blk_queue_bypass_start() could race
against queue destruction, win the initial bypass test but perform the
actual draining after blk_cleanup_queue() already destroyed all blkgs.

Fix it by skippping calling into policy draining if all the blkgs are
already gone.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Shirish Pargaonkar <spargaonkar@suse.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Reported-by: Jet Chen <jet.chen@intel.com>
Tested-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-31 12:52:55 -07:00
Christoph Hellwig 0b9f20dad0 block: don't assume last put of shared tags is for the host
commit d45b3279a5 upstream.

There is no inherent reason why the last put of a tag structure must be
the one for the Scsi_Host, as device model objects can be held for
arbitrary periods.  Merge blk_free_tags and __blk_free_tags into a single
funtion that just release a references and get rid of the BUG() when the
host reference wasn't the last.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-31 12:52:54 -07:00
Mikulas Patocka 3b968e0cc9 block: provide compat ioctl for BLKZEROOUT
commit 3b3a1814d1 upstream.

This patch provides the compat BLKZEROOUT ioctl. The argument is a pointer
to two uint64_t values, so there is no need to translate it.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-31 12:52:54 -07:00
Tejun Heo baa9dbe85a blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t
commit a5049a8ae3 upstream.

Hello,

So, this patch should do.  Joe, Vivek, can one of you guys please
verify that the oops goes away with this patch?

Jens, the original thread can be read at

  http://thread.gmane.org/gmane.linux.kernel/1720729

The fix converts blkg->refcnt from int to atomic_t.  It does some
overhead but it should be minute compared to everything else which is
going on and the involved cacheline bouncing, so I think it's highly
unlikely to cause any noticeable difference.  Also, the refcnt in
question should be converted to a perpcu_ref for blk-mq anyway, so the
atomic_t is likely to go away pretty soon anyway.

Thanks.

------- 8< -------
__blkg_release_rcu() may be invoked after the associated request_queue
is released with a RCU grace period inbetween.  As such, the function
and callbacks invoked from it must not dereference the associated
request_queue.  This is clearly indicated in the comment above the
function.

Unfortunately, while trying to fix a different issue, 2a4fd070ee
("blkcg: move bulk of blkcg_gq release operations to the RCU
callback") ignored this and added [un]locking of @blkg->q->queue_lock
to __blkg_release_rcu().  This of course can cause oops as the
request_queue may be long gone by the time this code gets executed.

  general protection fault: 0000 [#1] SMP
  CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1
  Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
  task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000
  RIP: 0010:[<ffffffff8162e9e5>]  [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
  RSP: 0018:ffff88085403fdf0  EFLAGS: 00010086
  RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000
  RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b
  RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39
  R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130
  R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000
  FS:  0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0
  Stack:
   ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000
   ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0
   ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30
  Call Trace:
   [<ffffffff812cbfc2>] __blkg_release_rcu+0x72/0x150
   [<ffffffff810d1d28>] rcu_nocb_kthread+0x1e8/0x300
   [<ffffffff81091d81>] kthread+0xe1/0x100
   [<ffffffff8163813c>] ret_from_fork+0x7c/0xb0
  Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5
  +fa 66 66 90 66 66 90 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f
  +b7
  RIP  [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
   RSP <ffff88085403fdf0>

The request_queue locking was added because blkcg_gq->refcnt is an int
protected with the queue lock and __blkg_release_rcu() needs to put
the parent.  Let's fix it by making blkcg_gq->refcnt an atomic_t and
dropping queue locking in the function.

Given the general heavy weight of the current request_queue and blkcg
operations, this is unlikely to cause any noticeable overhead.
Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in
the near future, so whatever (most likely negligible) overhead it may
add is temporary.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09 11:18:27 -07:00
Roman Pen 0a8eda9c00 blktrace: fix accounting of partially completed requests
commit af5040da01 upstream.

trace_block_rq_complete does not take into account that request can
be partially completed, so we can get the following incorrect output
of blkparser:

  C   R 232 + 240 [0]
  C   R 240 + 232 [0]
  C   R 248 + 224 [0]
  C   R 256 + 216 [0]

but should be:

  C   R 232 + 8 [0]
  C   R 240 + 8 [0]
  C   R 248 + 8 [0]
  C   R 256 + 8 [0]

Also, the whole output summary statistics of completed requests and
final throughput will be incorrect.

This patch takes into account real completion size of the request and
fixes wrong completion accounting.

Signed-off-by: Roman Pen <r.peniaev@gmail.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: linux-kernel@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31 13:20:28 -07:00
Dave Jones 708f04d2ab block: free q->flush_rq in blk_init_allocated_queue error paths
Commit 7982e90c3a ("block: fix q->flush_rq NULL pointer crash on
dm-mpath flush") moved an allocation to blk_init_allocated_queue(), but
neglected to free that allocation on the error paths that follow.

Signed-off-by: Dave Jones <davej@fedoraproject.org>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-20 22:32:06 -07:00
Mike Snitzer 10beafc190 block: change flush sequence list addition back to front add
Commit 18741986 inadvertently changed the rq flush insertion
from a head to a tail insertion. Fix that back up.

Signed-off-by: Mike Snitzer <msnitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-03-08 20:31:31 -07:00
Mike Snitzer 7982e90c3a block: fix q->flush_rq NULL pointer crash on dm-mpath flush
Commit 1874198 ("blk-mq: rework flush sequencing logic") switched
->flush_rq from being an embedded member of the request_queue structure
to being dynamically allocated in blk_init_queue_node().

Request-based DM multipath doesn't use blk_init_queue_node(), instead it
uses blk_alloc_queue_node() + blk_init_allocated_queue().  Because
commit 1874198 placed the dynamic allocation of ->flush_rq in
blk_init_queue_node() any flush issued to a dm-mpath device would crash
with a NULL pointer, e.g.:

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff8125037e>] blk_rq_init+0x1e/0xb0
PGD bb3c7067 PUD bb01d067 PMD 0
Oops: 0002 [#1] SMP
...
CPU: 5 PID: 5028 Comm: dt Tainted: G        W  O 3.14.0-rc3.snitm+ #10
...
task: ffff88032fb270e0 ti: ffff880079564000 task.ti: ffff880079564000
RIP: 0010:[<ffffffff8125037e>]  [<ffffffff8125037e>] blk_rq_init+0x1e/0xb0
RSP: 0018:ffff880079565c98  EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000030
RDX: ffff880260c74048 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff880079565ca8 R08: ffff880260aa1e98 R09: 0000000000000001
R10: ffff88032fa78500 R11: 0000000000000246 R12: 0000000000000000
R13: ffff880260aa1de8 R14: 0000000000000650 R15: 0000000000000000
FS:  00007f8d36a2a700(0000) GS:ffff88033fca0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000079b36000 CR4: 00000000000007e0
Stack:
 0000000000000000 ffff880260c74048 ffff880079565cd8 ffffffff81257a47
 ffff880260aa1de8 ffff880260c74048 0000000000000001 0000000000000000
 ffff880079565d08 ffffffff81257c2d 0000000000000000 ffff880260aa1de8
Call Trace:
 [<ffffffff81257a47>] blk_flush_complete_seq+0x2d7/0x2e0
 [<ffffffff81257c2d>] blk_insert_flush+0x1dd/0x210
 [<ffffffff8124ec59>] __elv_add_request+0x1f9/0x320
 [<ffffffff81250681>] ? blk_account_io_start+0x111/0x190
 [<ffffffff81253a4b>] blk_queue_bio+0x25b/0x330
 [<ffffffffa0020bf5>] dm_request+0x35/0x40 [dm_mod]
 [<ffffffff812530c0>] generic_make_request+0xc0/0x100
 [<ffffffff81253173>] submit_bio+0x73/0x140
 [<ffffffff811becdd>] submit_bio_wait+0x5d/0x80
 [<ffffffff81257528>] blkdev_issue_flush+0x78/0xa0
 [<ffffffff811c1f6f>] blkdev_fsync+0x3f/0x60
 [<ffffffff811b7fde>] vfs_fsync_range+0x1e/0x20
 [<ffffffff811b7ffc>] vfs_fsync+0x1c/0x20
 [<ffffffff811b81f1>] do_fsync+0x41/0x80
 [<ffffffff8118874e>] ? SyS_lseek+0x7e/0x80
 [<ffffffff811b8260>] SyS_fsync+0x10/0x20
 [<ffffffff8154c2d2>] system_call_fastpath+0x16/0x1b

Fix this by moving the ->flush_rq allocation from blk_init_queue_node()
to blk_init_allocated_queue().  blk_init_queue_node() also calls
blk_init_allocated_queue() so this change is functionality equivalent
for all blk_init_queue_node() callers.

Reported-by: Hannes Reinecke <hare@suse.de>
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-03-08 17:20:01 -07:00
Shaohua Li 739c3eea71 blk-mq: add REQ_SYNC early
Add REQ_SYNC early, so rq_dispatched[] in blk_mq_rq_ctx_init
is set correctly.

Signed-off-by: Shaohua Li<shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-03-07 08:15:28 -07:00
Mike Galbraith 2a26ebef84 rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock
[  365.164040] BUG: sleeping function called from invalid context at kernel/rtmutex.c:674
[  365.164041] in_atomic(): 1, irqs_disabled(): 1, pid: 26, name: migration/1
[  365.164043] no locks held by migration/1/26.
[  365.164044] irq event stamp: 6648
[  365.164056] hardirqs last  enabled at (6647): [<ffffffff8153d377>] restore_args+0x0/0x30
[  365.164062] hardirqs last disabled at (6648): [<ffffffff810ed98d>] multi_cpu_stop+0x9d/0x120
[  365.164070] softirqs last  enabled at (0): [<ffffffff810543bc>] copy_process.part.28+0x6fc/0x1920
[  365.164072] softirqs last disabled at (0): [<          (null)>]           (null)
[  365.164076] CPU: 1 PID: 26 Comm: migration/1 Tainted: GF           N  3.12.12-rt19-0.gcb6c4a2-rt #3
[  365.164078] Hardware name: QCI QSSC-S4R/QSSC-S4R, BIOS QSSC-S4R.QCI.01.00.S013.032920111005 03/29/2011
[  365.164091]  0000000000000001 ffff880a42ea7c30 ffffffff815367e6 ffffffff81a086c0
[  365.164099]  ffff880a42ea7c40 ffffffff8108919c ffff880a42ea7c60 ffffffff8153c24f
[  365.164107]  ffff880a42ea91f0 00000000ffffffe1 ffff880a42ea7c88 ffffffff81297ec0
[  365.164108] Call Trace:
[  365.164119]  [<ffffffff810060b1>] try_stack_unwind+0x191/0x1a0
[  365.164127]  [<ffffffff81004872>] dump_trace+0x92/0x360
[  365.164133]  [<ffffffff81006108>] show_trace_log_lvl+0x48/0x60
[  365.164138]  [<ffffffff81004c18>] show_stack_log_lvl+0xd8/0x1d0
[  365.164143]  [<ffffffff81006160>] show_stack+0x20/0x50
[  365.164153]  [<ffffffff815367e6>] dump_stack+0x54/0x9a
[  365.164163]  [<ffffffff8108919c>] __might_sleep+0xfc/0x140
[  365.164173]  [<ffffffff8153c24f>] rt_spin_lock+0x1f/0x70
[  365.164182]  [<ffffffff81297ec0>] blk_mq_main_cpu_notify+0x20/0x70
[  365.164191]  [<ffffffff81540a1c>] notifier_call_chain+0x4c/0x70
[  365.164201]  [<ffffffff81083499>] __raw_notifier_call_chain+0x9/0x10
[  365.164207]  [<ffffffff810567be>] cpu_notify+0x1e/0x40
[  365.164217]  [<ffffffff81525da2>] take_cpu_down+0x22/0x40
[  365.164223]  [<ffffffff810ed9c6>] multi_cpu_stop+0xd6/0x120
[  365.164229]  [<ffffffff810edd97>] cpu_stopper_thread+0xd7/0x1e0
[  365.164235]  [<ffffffff810863a3>] smpboot_thread_fn+0x203/0x380
[  365.164241]  [<ffffffff8107cbf8>] kthread+0xc8/0xd0
[  365.164250]  [<ffffffff8154440c>] ret_from_fork+0x7c/0xb0
[  365.164429] smpboot: CPU 1 is now offline

Signed-off-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-03-03 09:34:10 -07:00
Christoph Hellwig d6a25b3131 blk-mq: support partial I/O completions
Add a new blk_mq_end_io_partial function to partially complete requests
as needed by the SCSI layer.  We do this by reusing blk_update_request
to advance the bio instead of having a simplified version of it in
the blk-mq code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-21 08:58:49 -08:00
Christoph Hellwig feb71dae1f blk-mq: merge blk_mq_insert_request and blk_mq_run_request
It's almost identical to blk_mq_insert_request, so fold the two into one
slightly more generic function by making the flush special case a bit
smarted.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-21 08:58:48 -08:00
Christoph Hellwig fd694131bb blk-mq: remove blk_mq_alloc_rq
There's only one caller, which is a straight wrapper and fits the naming
scheme of the related functions a lot better.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-21 08:58:47 -08:00
Linus Torvalds 5e57dc8110 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block IO fixes from Jens Axboe:
 "Second round of updates and fixes for 3.14-rc2.  Most of this stuff
  has been queued up for a while.  The notable exception is the blk-mq
  changes, which are naturally a bit more in flux still.

  The pull request contains:

   - Two bug fixes for the new immutable vecs, causing crashes with raid
     or swap.  From Kent.

   - Various blk-mq tweaks and fixes from Christoph.  A fix for
     integrity bio's from Nic.

   - A few bcache fixes from Kent and Darrick Wong.

   - xen-blk{front,back} fixes from David Vrabel, Matt Rushton, Nicolas
     Swenson, and Roger Pau Monne.

   - Fix for a vec miscount with integrity vectors from Martin.

   - Minor annotations or fixes from Masanari Iida and Rashika Kheria.

   - Tweak to null_blk to do more normal FIFO processing of requests
     from Shlomo Pongratz.

   - Elevator switching bypass fix from Tejun.

   - Softlockup in blkdev_issue_discard() fix when !CONFIG_PREEMPT from
     me"

* 'for-linus' of git://git.kernel.dk/linux-block: (31 commits)
  block: add cond_resched() to potentially long running ioctl discard loop
  xen-blkback: init persistent_purge_work work_struct
  blk-mq: pair blk_mq_start_request / blk_mq_requeue_request
  blk-mq: dont assume rq->errors is set when returning an error from ->queue_rq
  block: Fix cloning of discard/write same bios
  block: Fix type mismatch in ssize_t_blk_mq_tag_sysfs_show
  blk-mq: rework flush sequencing logic
  null_blk: use blk_complete_request and blk_mq_complete_request
  virtio_blk: use blk_mq_complete_request
  blk-mq: rework I/O completions
  fs: Add prototype declaration to appropriate header file include/linux/bio.h
  fs: Mark function as static in fs/bio-integrity.c
  block/null_blk: Fix completion processing from LIFO to FIFO
  block: Explicitly handle discard/write same segments
  block: Fix nr_vecs for inline integrity vectors
  blk-mq: Add bio_integrity setup to blk_mq_make_request
  blk-mq: initialize sg_reserved_size
  blk-mq: handle dma_drain_size
  blk-mq: divert __blk_put_request for MQ ops
  blk-mq: support at_head inserations for blk_execute_rq
  ...
2014-02-14 10:45:18 -08:00
Jens Axboe c8123f8c9c block: add cond_resched() to potentially long running ioctl discard loop
When mkfs issues a full device discard and the device only
supports discards of a smallish size, we can loop in
blkdev_issue_discard() for a long time. If preempt isn't enabled,
this can turn into a softlock situation and the kernel will
start complaining.

Add an explicit cond_resched() at the end of the loop to avoid
that.

Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-12 09:36:37 -07:00
Christoph Hellwig 49f5baa510 blk-mq: pair blk_mq_start_request / blk_mq_requeue_request
Make sure we have a proper pairing between starting and requeueing
requests.  Move the dma drain and REQ_END setup into blk_mq_start_request,
and make sure blk_mq_requeue_request properly undoes them, giving us
a pair of function to prepare and unprepare a request without leaving
side effects.

Together this ensures we always clean up properly after
BLK_MQ_RQ_QUEUE_BUSY returns from ->queue_rq.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-11 09:34:08 -07:00
Christoph Hellwig 1e93b8c274 blk-mq: dont assume rq->errors is set when returning an error from ->queue_rq
rq->errors never has been part of the communication protocol between drivers
and the block stack and most drivers will not have initialized it.

Return -EIO to upper layers when the driver returns BLK_MQ_RQ_QUEUE_ERROR
unconditionally.  If a driver want to return a different error it can easily
do so by returning success after calling blk_mq_end_io itself.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-11 09:34:07 -07:00
Masanari Iida 11c9444407 block: Fix type mismatch in ssize_t_blk_mq_tag_sysfs_show
cppcheck detected following format string mismatch.
[blk-mq-tag.c:201]: (warning) %u in format string (no. 1) requires
'unsigned int' but the argument type is 'int'.

Change "cpu" from int to unsigned int, because the cpu
never become minus value.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-10 10:39:53 -07:00
Christoph Hellwig 18741986a4 blk-mq: rework flush sequencing logic
Witch to using a preallocated flush_rq for blk-mq similar to what's done
with the old request path.  This allows us to set up the request properly
with a tag from the actually allowed range and ->rq_disk as needed by
some drivers.  To make life easier we also switch to dynamic allocation
of ->flush_rq for the old path.

This effectively reverts most of

    "blk-mq: fix for flush deadlock"

and

    "blk-mq: Don't reserve a tag for flush request"

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-10 09:29:00 -07:00
Christoph Hellwig 30a91cb4ef blk-mq: rework I/O completions
Rework I/O completions to work more like the old code path.  blk_mq_end_io
now stays out of the business of deferring completions to others CPUs
and calling blk_mark_rq_complete.  The latter is very important to allow
completing requests that have timed out and thus are already marked completed,
the former allows using the IPI callout even for driver specific completions
instead of having to reimplement them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-10 09:27:31 -07:00
Kent Overstreet 5cb8850c9c block: Explicitly handle discard/write same segments
Immutable biovecs changed the way biovecs are interpreted - drivers no
longer use bi_vcnt, they have to go by bi_iter.bi_size (to allow for
using part of an existing segment without modifying it).

This breaks with discards and write_same bios, since for those bi_size
has nothing to do with segments in the biovec. So for now, we need a
fairly gross hack - we fortunately know that there will never be more
than one segment for the entire request, so we can special case
discard/write_same.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-07 13:54:08 -07:00
Nicholas Bellinger 14ec77f352 blk-mq: Add bio_integrity setup to blk_mq_make_request
This patch adds the missing bio_integrity_enabled() +
bio_integrity_prep() setup into blk_mq_make_request()
in order to use DIF protection with scsi-mq.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-07 13:45:39 -07:00
Christoph Hellwig 1be036e946 blk-mq: initialize sg_reserved_size
To behave the same way as the old request path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-07 11:58:54 -07:00
Christoph Hellwig 4f7f418c48 blk-mq: handle dma_drain_size
Make blk-mq handle the dma_drain_size field the same way as the old request
path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-07 11:58:54 -07:00
Christoph Hellwig 6f5ba581c0 blk-mq: divert __blk_put_request for MQ ops
__blk_put_request needs to call into the blk-mq code just like
blk_put_request.  As we don't have the queue lock in this case both
end up calling the same function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-07 11:58:54 -07:00
Christoph Hellwig 72a0a36e28 blk-mq: support at_head inserations for blk_execute_rq
This is neede for proper SG_IO operation as well as various uses of
blk_execute_rq from the SCSI midlayer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-07 11:58:54 -07:00
Linus Torvalds 4e13c5d021 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending
Pull SCSI target updates from Nicholas Bellinger:
 "The highlights this round include:

  - add support for SCSI Referrals (Hannes)
  - add support for T10 DIF into target core (nab + mkp)
  - add support for T10 DIF emulation in FILEIO + RAMDISK backends (Sagi + nab)
  - add support for T10 DIF -> bio_integrity passthrough in IBLOCK backend (nab)
  - prep changes to iser-target for >= v3.15 T10 DIF support (Sagi)
  - add support for qla2xxx N_Port ID Virtualization - NPIV (Saurav + Quinn)
  - allow percpu_ida_alloc() to receive task state bitmask (Kent)
  - fix >= v3.12 iscsi-target session reset hung task regression (nab)
  - fix >= v3.13 percpu_ref se_lun->lun_ref_active race (nab)
  - fix a long-standing network portal creation race (Andy)"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (51 commits)
  target: Fix percpu_ref_put race in transport_lun_remove_cmd
  target/iscsi: Fix network portal creation race
  target: Report bad sector in sense data for DIF errors
  iscsi-target: Convert gfp_t parameter to task state bitmask
  iscsi-target: Fix connection reset hang with percpu_ida_alloc
  percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  iscsi-target: Pre-allocate more tags to avoid ack starvation
  qla2xxx: Configure NPIV fc_vport via tcm_qla2xxx_npiv_make_lport
  qla2xxx: Enhancements to enable NPIV support for QLOGIC ISPs with TCM/LIO.
  qla2xxx: Fix scsi_host leak on qlt_lport_register callback failure
  IB/isert: pass scatterlist instead of cmd to fast_reg_mr routine
  IB/isert: Move fastreg descriptor creation to a function
  IB/isert: Avoid frwr notation, user fastreg
  IB/isert: seperate connection protection domains and dma MRs
  tcm_loop: Enable DIF/DIX modes in SCSI host LLD
  target/rd: Add DIF protection into rd_execute_rw
  target/rd: Add support for protection SGL setup + release
  target/rd: Refactor rd_build_device_space + rd_release_device_space
  target/file: Add DIF protection support to fd_execute_rw
  target/file: Add DIF protection init/format support
  ...
2014-01-31 15:31:23 -08:00
Tejun Heo 556ee818c0 block: __elv_next_request() shouldn't call into the elevator if bypassing
request_queue bypassing is used to suppress higher-level function of a
request_queue so that they can be switched, reconfigured and shut
down.  A request_queue does the followings while bypassing.

* bypasses elevator and io_cq association and queues requests directly
  to the FIFO dispatch queue.

* bypasses block cgroup request_list lookup and always uses the root
  request_list.

Once confirmed to be bypassing, specific elevator and block cgroup
policy implementations can assume that nothing is in flight for them
and perform various operations which would be dangerous otherwise.

Such confirmation is acheived by short-circuiting all new requests
directly to the dispatch queue and waiting for all the requests which
were issued before to finish.  Unfortunately, while the request
allocating and draining sides were properly handled, we forgot to
actually plug the request dispatch path.  Even after bypassing mode is
confirmed, if the attached driver tries to fetch a request and the
dispatch queue is empty, __elv_next_request() would invoke the current
elevator's elevator_dispatch_fn() callback.  As all in-flight requests
were drained, the elevator wouldn't contain any request but once
bypass is confirmed we don't even know whether the elevator is even
there.  It might be in the process of being switched and half torn
down.

Frank Mayhar reports that this actually happened while switching
elevators, leading to an oops.

Let's fix it by making __elv_next_request() avoid invoking the
elevator_dispatch_fn() callback if the queue is bypassing.  It already
avoids invoking the callback if the queue is dying.  As a dying queue
is guaranteed to be bypassing, we can simply replace blk_queue_dying()
check with blk_queue_bypass().

Reported-by: Frank Mayhar <fmayhar@google.com>
References: http://lkml.kernel.org/g/1390319905.20232.38.camel@bobble.lax.corp.google.com
Cc: stable@vger.kernel.org
Tested-by: Frank Mayhar <fmayhar@google.com>

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2014-01-30 12:57:25 -07:00
Shaohua Li f0276924fa blk-mq: Don't reserve a tag for flush request
Reserving a tag (request) for flush to avoid dead lock is a overkill. A
tag is valuable resource. We can track the number of flush requests and
disallow having too many pending flush requests allocated. With this
patch, blk_mq_alloc_request_pinned() could do a busy nop (but not a dead
loop) if too many pending requests are allocated and new flush request
is allocated. But this should not be a problem, too many pending flush
requests are very rare case.

I verified this can fix the deadlock caused by too many pending flush
requests.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2014-01-30 12:57:25 -07:00
Linus Torvalds 53d8ab29f8 Merge branch 'for-3.14/drivers' of git://git.kernel.dk/linux-block
Pull block IO driver changes from Jens Axboe:

 - bcache update from Kent Overstreet.

 - two bcache fixes from Nicholas Swenson.

 - cciss pci init error fix from Andrew.

 - underflow fix in the parallel IDE pg_write code from Dan Carpenter.
   I'm sure the 1 (or 0) users of that are now happy.

 - two PCI related fixes for sx8 from Jingoo Han.

 - floppy init fix for first block read from Jiri Kosina.

 - pktcdvd error return miss fix from Julia Lawall.

 - removal of IRQF_SHARED from the SEGA Dreamcast CD-ROM code from
   Michael Opdenacker.

 - comment typo fix for the loop driver from Olaf Hering.

 - potential oops fix for null_blk from Raghavendra K T.

 - two fixes from Sam Bradshaw (Micron) for the mtip32xx driver, fixing
   an OOM problem and a problem with handling security locked conditions

* 'for-3.14/drivers' of git://git.kernel.dk/linux-block: (47 commits)
  mg_disk: Spelling s/finised/finished/
  null_blk: Null pointer deference problem in alloc_page_buffers
  mtip32xx: Correctly handle security locked condition
  mtip32xx: Make SGL container per-command to eliminate high order dma allocation
  drivers/block/loop.c: fix comment typo in loop_config_discard
  drivers/block/cciss.c:cciss_init_one(): use proper errnos
  drivers/block/paride/pg.c: underflow bug in pg_write()
  drivers/block/sx8.c: remove unnecessary pci_set_drvdata()
  drivers/block/sx8.c: use module_pci_driver()
  floppy: bail out in open() if drive is not responding to block0 read
  bcache: Fix auxiliary search trees for key size > cacheline size
  bcache: Don't return -EINTR when insert finished
  bcache: Improve bucket_prio() calculation
  bcache: Add bch_bkey_equal_header()
  bcache: update bch_bkey_try_merge
  bcache: Move insert_fixup() to btree_keys_ops
  bcache: Convert sorting to btree_keys
  bcache: Convert debug code to btree_keys
  bcache: Convert btree_iter to struct btree_keys
  bcache: Refactor bset_tree sysfs stats
  ...
2014-01-30 11:40:10 -08:00
Linus Torvalds f568849eda Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block
Pull core block IO changes from Jens Axboe:
 "The major piece in here is the immutable bio_ve series from Kent, the
  rest is fairly minor.  It was supposed to go in last round, but
  various issues pushed it to this release instead.  The pull request
  contains:

   - Various smaller blk-mq fixes from different folks.  Nothing major
     here, just minor fixes and cleanups.

   - Fix for a memory leak in the error path in the block ioctl code
     from Christian Engelmayer.

   - Header export fix from CaiZhiyong.

   - Finally the immutable biovec changes from Kent Overstreet.  This
     enables some nice future work on making arbitrarily sized bios
     possible, and splitting more efficient.  Related fixes to immutable
     bio_vecs:

        - dm-cache immutable fixup from Mike Snitzer.
        - btrfs immutable fixup from Muthu Kumar.

  - bio-integrity fix from Nic Bellinger, which is also going to stable"

* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
  xtensa: fixup simdisk driver to work with immutable bio_vecs
  block/blk-mq-cpu.c: use hotcpu_notifier()
  blk-mq: for_each_* macro correctness
  block: Fix memory leak in rw_copy_check_uvector() handling
  bio-integrity: Fix bio_integrity_verify segment start bug
  block: remove unrelated header files and export symbol
  blk-mq: uses page->list incorrectly
  blk-mq: use __smp_call_function_single directly
  btrfs: fix missing increment of bi_remaining
  Revert "block: Warn and free bio if bi_end_io is not set"
  block: Warn and free bio if bi_end_io is not set
  blk-mq: fix initializing request's start time
  block: blk-mq: don't export blk_mq_free_queue()
  block: blk-mq: make blk_sync_queue support mq
  block: blk-mq: support draining mq queue
  dm cache: increment bi_remaining when bi_end_io is restored
  block: fixup for generic bio chaining
  block: Really silence spurious compiler warnings
  block: Silence spurious compiler warnings
  block: Kill bio_pair_split()
  ...
2014-01-30 11:19:05 -08:00
Andrew Morton 381d3ee33b block/blk-mq-cpu.c: use hotcpu_notifier()
Cleaner, reduces text size when cpu hotplug is disabled.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2014-01-28 09:52:01 -07:00
Kent Overstreet 6f6b5d1ec5 percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
This patch changes percpu_ida_alloc() + callers to accept task state
bitmask for prepare_to_wait() for code like target/iscsi that needs
it for interruptible sleep, that is provided in a subsequent patch.

It now expects TASK_UNINTERRUPTIBLE when the caller is able to sleep
waiting for a new tag, or TASK_RUNNING when the caller cannot sleep,
and is forced to return a negative value when no tags are available.

v2 changes:
  - Include blk-mq + tcm_fc + vhost/scsi + target/iscsi changes
  - Drop signal_pending_state() call
v3 changes:
  - Only call prepare_to_wait() + finish_wait() when != TASK_RUNNING
    (PeterZ)

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: <stable@vger.kernel.org> #3.12+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2014-01-23 20:17:18 +00:00
Christian Engelmayer 17a05cca99 block: Fix memory leak in rw_copy_check_uvector() handling
Fix a memory leak in the error handling path of function sg_io()
that is used during the processing of scsi ioctl. Memory already
allocated by rw_copy_check_uvector() needs to be freed correctly.
Detected by Coverity: CID 1128953.

Signed-off-by: Christian Engelmayer <cengelma@gmx.at>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2014-01-21 20:36:17 -08:00
CaiZhiyong bca266b388 block: remove unrelated header files and export symbol
Fix up the following items:

 - remove unrelated header files.
 - export interface function.
 - modify function cmdline_parts_parse return value, this will make
   it more friendly for the caller.

Signed-off-by: CaiZhiyong <caizhiyong@huawei.com>
Cc: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
CC: Brian Norris <computersforpeace@gmail.com>
Cc: "Wanglin (Albert)" <albert.wanglin@hisilicon.com>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Karel Zak <kzak@redhat.com>
Cc: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2014-01-21 20:18:26 -08:00
Linus Torvalds f075e0f699 Merge branch 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "The bulk of changes are cleanups and preparations for the upcoming
  kernfs conversion.

   - cgroup_event mechanism which is and will be used only by memcg is
     moved to memcg.

   - pidlist handling is updated so that it can be served by seq_file.

     Also, the list is not sorted if sane_behavior.  cgroup
     documentation explicitly states that the file is not sorted but it
     has been for quite some time.

   - All cgroup file handling now happens on top of seq_file.  This is
     to prepare for kernfs conversion.  In addition, all operations are
     restructured so that they map 1-1 to kernfs operations.

   - Other cleanups and low-pri fixes"

* 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (40 commits)
  cgroup: trivial style updates
  cgroup: remove stray references to css_id
  doc: cgroups: Fix typo in doc/cgroups
  cgroup: fix fail path in cgroup_load_subsys()
  cgroup: fix missing unlock on error in cgroup_load_subsys()
  cgroup: remove for_each_root_subsys()
  cgroup: implement for_each_css()
  cgroup: factor out cgroup_subsys_state creation into create_css()
  cgroup: combine css handling loops in cgroup_create()
  cgroup: reorder operations in cgroup_create()
  cgroup: make for_each_subsys() useable under cgroup_root_mutex
  cgroup: css iterations and css_from_dir() are safe under cgroup_mutex
  cgroup: unify pidlist and other file handling
  cgroup: replace cftype->read_seq_string() with cftype->seq_show()
  cgroup: attach cgroup_open_file to all cgroup files
  cgroup: generalize cgroup_pidlist_open_file
  cgroup: unify read path so that seq_file is always used
  cgroup: unify cgroup_write_X64() and cgroup_write_string()
  cgroup: remove cftype->read(), ->read_map() and ->write()
  hugetlb_cgroup: convert away from cftype->read()
  ...
2014-01-21 17:51:34 -08:00
Dave Hansen 6753471c0c blk-mq: uses page->list incorrectly
'struct page' has two list_head fields: 'lru' and 'list'.  Conveniently,
they are unioned together.  This means that code can use them
interchangably, which gets horribly confusing.

The blk-mq made the logical decision to try to use page->list.  But, that
field was actually introduced just for the slub code.  ->lru is the right
field to use outside of slab/slub.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2014-01-08 20:17:46 -07:00