Commit Graph

3853 Commits

Author SHA1 Message Date
Filippo Muzzini ac857e0d54 block, bfq: remove the removal of 'next' rq in bfq_requests_merged
Since bfq_finish_request() is always called on the request 'next',
after bfq_requests_merged() is finished, and bfq_finish_request()
removes 'next' from its bfq_queue if needed, it isn't necessary to do
such a removal in advance in bfq_merged_requests().

This commit removes such a useless 'next' removal.

Signed-off-by: Filippo Muzzini <filippo.muzzini@outlook.it>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-31 08:48:32 -06:00
Paolo Valente 8abfa4d6fd block, bfq: remove wrong check in bfq_requests_merged
The request rq passed to the function bfq_requests_merged is always in
a bfq_queue, so the check !RB_EMPTY_NODE(&rq->rb_node) at the
beginning of bfq_requests_merged always succeeds, and the control
flow systematically skips to the end of the function.  This implies
that the body of the function is never executed, i.e., the
repositioning of rq is never performed.

On the opposite end, a control is missing in the body of the function:
'next' must be removed only if it is inside a bfq_queue.

This commit removes the wrong check on rq, and adds the missing check
on 'next'. In addition, this commit adds comments on
bfq_requests_merged.

Signed-off-by: Filippo Muzzini <filippo.muzzini@outlook.it>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-31 08:48:05 -06:00
Filippo Muzzini a12bffebc0 block, bfq: remove wrong lock in bfq_requests_merged
In bfq_requests_merged(), there is a deadlock because the lock on
bfqq->bfqd->lock is held by the calling function, but the code of
this function tries to grab the lock again.

This deadlock is currently hidden by another bug (fixed by next commit
for this source file), which causes the body of bfq_requests_merged()
to be never executed.

This commit removes the deadlock by removing the lock/unlock pair.

Signed-off-by: Filippo Muzzini <filippo.muzzini@outlook.it>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-31 08:42:27 -06:00
Jens Axboe 04c4950d5b block: fixup bioset_integrity_create() call
Missed converting the bioset_integrity_create() bounce bio set
call.

Fixes: 338aa96d56 ("block: convert bounce, q->bio_split to bioset_init()/mempool_init()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 18:51:21 -06:00
Kent Overstreet dad0852752 block: Drop bioset_create()
All users have been converted to bioset_init(), kill off the
old API.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Kent Overstreet 338aa96d56 block: convert bounce, q->bio_split to bioset_init()/mempool_init()
Convert the core block functionality to embedded bio sets.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Chengguang Xu 0b6bad7d66 blk-throttle: return proper bool type to caller instead of 0/1
Change to return true/false only for bool type return code.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 12:48:22 -06:00
Christoph Hellwig d250bf4e77 blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter
We already check for started commands in all callbacks, but we should
also protect against already completed commands.  Do this by taking
the checks to common code.

Acked-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 11:31:34 -06:00
Liu Bo 2ab74cd296 blk-throttle: fix potential NULL pointer dereference in throtl_select_dispatch
tg in throtl_select_dispatch is used first and then do check. Since tg
may be NULL, it has potential NULL pointer dereference risk. So fix
it.

Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 10:54:33 -06:00
Jianchao Wang a6088845c2 block: kyber: make kyber more friendly with merging
Currently, kyber is very unfriendly with merging. kyber depends
on ctx rq_list to do merging, however, most of time, it will not
leave any requests in ctx rq_list. This is because even if tokens
of one domain is used up, kyber will try to dispatch requests
from other domain and flush the rq_list there.

To improve this, we setup kyber_ctx_queue (kcq) which is similar
with ctx, but it has rq_lists for different domain and build same
mapping between kcq and khd as the ctx & hctx. Then we could merge,
insert and dispatch for different domains separately. At the same
time, only flush the rq_list of kcq when get domain token successfully.
Then if one domain token is used up, the requests could be left in
the rq_list of that domain and maybe merged with following io.

Following is my test result on machine with 8 cores and NVMe card
INTEL SSDPEKKR128G7

fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
seq/random
+------+---------------------------------------------------------------+
|patch?| bw(MB/s) |   iops    | slat(usec) |    clat(usec)   |  merge  |
+----------------------------------------------------------------------+
| w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
+----------------------------------------------------------------------+
| w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
+----------------------------------------------------------------------+
When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
on my platform.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 10:47:40 -06:00
Jens Axboe 9c55873464 blk-mq: abstract out blk-mq-sched rq list iteration bio merge helper
No functional changes in this patch, just a prep patch for utilizing
this in an IO scheduler.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2018-05-30 10:43:58 -06:00
Christoph Hellwig 5de815a7ee block: remove parent device reference from struct bsg_class_device
Bsg holding a reference to the parent device may result in a crash if a
bsg file handle is closed after the parent device driver has unloaded.

Holding a reference is not really needed: the parent device must exist
between bsg_register_queue and bsg_unregister_queue.  Before the device
goes away the caller does blk_cleanup_queue so that all in-flight
requests to the device are gone and all new requests cannot pass beyond
the queue.  The queue itself is a refcounted object and it will stay
alive with a bsg file.

Based on analysis, previous patch and changelog from Anatoliy Glagolev.

Reported-by: Anatoliy Glagolev <glagolig@gmail.com>
Reviewed-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 13:00:25 -06:00
Christoph Hellwig 5afb78356c block: don't print a message when the device went away
The information about a size change in this case just creates confusion.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:59:21 -06:00
Christoph Hellwig d1210d5afb blk-mq: simplify blk_mq_rq_timed_out
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:59:21 -06:00
Christoph Hellwig f6e7d48a78 block: remove BLK_EH_HANDLED
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:59:21 -06:00
Christoph Hellwig 6600593cbd block: rename BLK_EH_NOT_HANDLED to BLK_EH_DONE
The BLK_EH_NOT_HANDLED implies nothing happen, but very often that
is not what is happening - instead the driver already completed the
command.  Fix the symbolic name to reflect that a little better.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:59:21 -06:00
Keith Busch 12f5b93145 blk-mq: Remove generation seqeunce
This patch simplifies the timeout handling by relying on the request
reference counting to ensure the iterator is operating on an inflight
and truly timed out request. Since the reference counting prevents the
tag from being reallocated, the block layer no longer needs to prevent
drivers from completing their requests while the timeout handler is
operating on it: a driver completing a request is allowed to proceed to
the next state without additional syncronization with the block layer.

This also removes any need for generation sequence numbers since the
request lifetime is prevented from being reallocated as a new sequence
while timeout handling is operating on it.

To enables this a refcount is added to struct request so that request
users can be sure they're operating on the same request without it
changing while they're processing it.  The request's tag won't be
released for reuse until both the timeout handler and the completion
are done with it.

Signed-off-by: Keith Busch <keith.busch@intel.com>
[hch: slight cleanups, added back submission side hctx lock, use cmpxchg
 for completions]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:59:21 -06:00
Keith Busch ad103e7983 blk-mq: Fix timeout and state order
The block layer had been setting the state to in-flight prior to updating
the timer. This is the wrong order since the timeout handler could observe
the in-flight state with the older timeout, believing the request had
expired when in fact it is just getting started.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:47:40 -06:00
Joe Perches 5657a819a8 block drivers/block: Use octal not symbolic permissions
Convert the S_<FOO> symbolic permissions to their octal equivalents as
using octal and not symbolic permissions is preferred by many as more
readable.

see: https://lkml.org/lkml/2016/8/2/1945

Done with automated conversion via:
$ ./scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace <files...>

Miscellanea:

o Wrapped modified multi-line calls to a single line where appropriate
o Realign modified multi-line calls to open parenthesis

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-24 13:38:59 -06:00
Ming Lei e6fc464987 blk-mq: avoid starving tag allocation after allocating process migrates
When the allocation process is scheduled back and the mapped hw queue is
changed, fake one extra wake up on previous queue for compensating wake
up miss, so other allocations on the previous queue won't be starved.

This patch fixes one request allocation hang issue, which can be
triggered easily in case of very low nr_request.

The race is as follows:

1) 2 hw queues, nr_requests are 2, and wake_batch is one

2) there are 3 waiters on hw queue 0

3) two in-flight requests in hw queue 0 are completed, and only two
   waiters of 3 are waken up because of wake_batch, but both the two
   waiters can be scheduled to another CPU and cause to switch to hw
   queue 1

4) then the 3rd waiter will wait for ever, since no in-flight request
   is in hw queue 0 any more.

5) this patch fixes it by the fake wakeup when waiter is scheduled to
   another hw queue

Cc: <stable@vger.kernel.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>

Modified commit message to make it clearer, and make it apply on
top of the 4.18 branch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-24 11:00:39 -06:00
Bart Van Assche 327ea4adcf blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers
Avoid that complaints similar to the following appear in the kernel log
if the number of zones is sufficiently large:

  fio: page allocation failure: order:9, mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
  Call Trace:
  dump_stack+0x63/0x88
  warn_alloc+0xf5/0x190
  __alloc_pages_slowpath+0x8f0/0xb0d
  __alloc_pages_nodemask+0x242/0x260
  alloc_pages_current+0x6a/0xb0
  kmalloc_order+0x18/0x50
  kmalloc_order_trace+0x26/0xb0
  __kmalloc+0x20e/0x220
  blkdev_report_zones_ioctl+0xa5/0x1a0
  blkdev_ioctl+0x1ba/0x930
  block_ioctl+0x41/0x50
  do_vfs_ioctl+0xaa/0x610
  SyS_ioctl+0x79/0x90
  do_syscall_64+0x79/0x1b0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: 3ed05a987e ("blk-zoned: implement ioctls")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Shaun Tancheff <shaun.tancheff@seagate.com>
Cc: Damien Le Moal <damien.lemoal@hgst.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-22 11:58:07 -06:00
huhai b4f6f38d9f blk-mq: remove wrong 'unlikely' check
When dispatch_rq_from_ctx is called, in the vast majority of cases
the ctx->rq_list is not empty.

Signed-off-by: huhai <huhai@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-22 08:38:04 -06:00
huhai d416c92c5d blk-mq: clear hctx->dispatch_from when mappings change
When the number of hardware queues is changed, the drivers will call
blk_mq_update_nr_hw_queues() to remap hardware queues. This changes
the ctx mappings, but the current code doesn't clear the
->dispatch_from hint. This can result in dispatch_from pointing to
a ctx that isn't mapped to the hctx anymore.

Fixes: b347689ffb ("blk-mq-sched: improve dispatching from sw queue")
Signed-off-by: huhai <huhai@kylinos.cn>
Reviewed-by: Ming Lei <ming.lei@redhat.com>

Moved the placement of the clearing to where we clear other items
pertaining to the existing mapping, added Fixes line, and reworded
the commit message.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-18 08:35:38 -06:00
huhai 8fa9f55645 blk-mq: remove redundant insert case in blk_mq_make_request()
We can use blk_mq_sched_insert_request() even if we don't have
an IO scheduler attached, since that case will end up being
exactly the same as what blk_mq_queue_io() was doing now.

Signed-off-by: huhai <huhai@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-16 08:22:47 -06:00
Kent Overstreet 6fcefbe578 block: Add sysfs entry for fua support
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 13:16:17 -06:00
Kent Overstreet 1900fcc461 block: Export bio check/set pages_dirty
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 13:16:15 -06:00
Kent Overstreet 0ba99ca483 block: Add warning for bi_next not NULL in bio_endio()
Recently found a bug where a driver left bi_next not NULL and then
called bio_endio(), and then the submitter of the bio used
bio_copy_data() which was treating src and dst as lists of bios.

Fixed that bug by splitting out bio_list_copy_data(), but in case other
things are depending on bi_next in weird ways, add a warning to help
avoid more bugs like that in the future.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 13:16:13 -06:00
Kent Overstreet 6e6e811d74 block: Add missing flush_dcache_page() call
Since a bio can point to userspace pages (e.g. direct IO), this is
generally necessary.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 13:16:11 -06:00
Kent Overstreet 45db54d58d block: Split out bio_list_copy_data()
Found a bug (with ASAN) where we were passing a bio to bio_copy_data()
with bi_next not NULL, when it should have been - a driver had left
bi_next set to something after calling bio_endio().

Since the normal case is only copying single bios, split out
bio_list_copy_data() to avoid more bugs like this in the future.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 13:16:10 -06:00
Kent Overstreet 38a72dac48 block: Add bio_copy_data_iter(), zero_fill_bio_iter()
Add versions that take bvec_iter args instead of using bio->bi_iter - to
be used by bcachefs.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 13:16:08 -06:00
Kent Overstreet f4f8154a08 block: Use bioset_init() for fs_bio_set
Minor optimization - remove a pointer indirection when using fs_bio_set.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 13:16:06 -06:00
Kent Overstreet 917a38c71a block: Add bioset_init()/bioset_exit()
Similarly to mempool_init()/mempool_exit(), take a pointer indirection
out of allocation/freeing by allowing biosets to be embedded in other
structs.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 13:16:04 -06:00
Kent Overstreet 8aa6ba2f6e block: Convert bio_set to mempool_init()
Minor performance improvement by getting rid of pointer indirections
from allocation/freeing fastpaths.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 13:16:03 -06:00
Christoph Hellwig 0eb0b63c1d block: consistently use GFP_NOIO instead of __GFP_NORECLAIM
Same numerical value (for now at least), but a much better documentation
of intent.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 08:55:18 -06:00
Christoph Hellwig c3036021c7 block: use GFP_NOIO instead of __GFP_DIRECT_RECLAIM
We just can't do I/O when doing block layer requests allocations,
so use GFP_NOIO instead of the even more limited __GFP_DIRECT_RECLAIM.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 08:55:16 -06:00
Christoph Hellwig 4accf5fc79 block: pass an explicit gfp_t to get_request
blk_old_get_request already has it at hand, and in blk_queue_bio, which
is the fast path, it is constant.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 08:55:14 -06:00
Christoph Hellwig ff005a0662 block: sanitize blk_get_request calling conventions
Switch everyone to blk_get_request_flags, and then rename
blk_get_request_flags to blk_get_request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 08:55:12 -06:00
Christoph Hellwig a9a14d3671 block: fix __get_request documentation
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 08:55:11 -06:00
Jens Axboe 2882064076 kyber-iosched: update shallow depth when setting up hardware queue
We don't expect the async depth to be smaller than the wake batch
count for sbitmap, but just in case, inform sbitmap of what shallow
depth kyber may use.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-10 11:27:46 -06:00
Jens Axboe 483b7bf2e4 bfq-iosched: update shallow depth to smallest one used
If our shallow depth is smaller than the wake batching of sbitmap,
we can introduce hangs. Ensure that sbitmap knows how low we'll go.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-10 11:27:41 -06:00
Jens Axboe bd7d4ef6a4 bfq-iosched: remove unused variable
bfqd->sb_shift was attempted used as a cache for the sbitmap queue
shift, but we don't need it, as it never changes. Kill it with fire.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-10 11:27:31 -06:00
Jens Axboe f0635b8a41 bfq: calculate shallow depths at init time
It doesn't change, so don't put it in the per-IO hot path.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-10 11:27:29 -06:00
Jens Axboe 55141366de bfq-iosched: don't worry about reserved tags in limit_depth
Reserved tags are used for error handling, we don't need to
care about them for regular IO. The core won't call us for these
anyway.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-10 11:27:17 -06:00
Jens Axboe 17a5119932 blk-mq: don't call into depth limiting for reserved tags
It's not useful, they are internal and/or error handling recovery
commands.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-10 11:27:12 -06:00
Paolo Valente 18e5a57d79 block, bfq: postpone rq preparation to insert or merge
When invoked for an I/O request rq, the prepare_request hook of bfq
increments reference counters in the destination bfq_queue for rq. In
this respect, after this hook has been invoked, rq may still be
transformed into a request with no icq attached, i.e., for bfq, a
request not associated with any bfq_queue. No further hook is invoked
to signal this tranformation to bfq (in general, to the destination
elevator for rq). This leads bfq into an inconsistent state, because
bfq has no chance to correctly lower these counters back. This
inconsistency may in its turn cause incorrect scheduling and hangs. It
certainly causes memory leaks, by making it impossible for bfq to free
the involved bfq_queue.

On the bright side, no transformation can still happen for rq after rq
has been inserted into bfq, or merged with another, already inserted,
request. Exploiting this fact, this commit addresses the above issue
by delaying the preparation of an I/O request to when the request is
inserted or merged.

This change also gives a performance bonus: a lock-contention point
gets removed. To prepare a request, bfq needs to hold its scheduler
lock. After postponing request preparation to insertion or merging, no
lock needs to be grabbed any longer in the prepare_request hook, while
the lock already taken to perform insertion or merging is used to
preparare the request as well.

Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-10 10:16:29 -06:00
Omar Sandoval 522a777566 block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:

- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
  used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq

These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 08:33:09 -06:00
Omar Sandoval 4bc6339a58 block: move blk_stat_add() to __blk_mq_end_request()
We want this next to blk_account_io_done() for the next change so that
we can call ktime_get() only once for both.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 08:33:07 -06:00
Omar Sandoval 84c7afcebe block: use ktime_get_ns() instead of sched_clock() for cfq and bfq
cfq and bfq have some internal fields that use sched_clock() which can
trivially use ktime_get_ns() instead. Their timestamp fields in struct
request can also use ktime_get_ns(), which resolves the 8 year old
comment added by commit 28f4197e5d ("block: disable preemption before
using sched_clock()").

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 08:33:06 -06:00
Omar Sandoval 544ccc8dc9 block: get rid of struct blk_issue_stat
struct blk_issue_stat squashes three things into one u64:

- The time the driver started working on a request
- The original size of the request (for the io.low controller)
- Flags for writeback throttling

It turns out that on x86_64, we have a 4 byte hole in struct request
which we can fill with the non-timestamp fields from blk_issue_stat,
simplifying things quite a bit.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 08:33:05 -06:00
Omar Sandoval 5238dcf413 block: replace bio->bi_issue_stat with bio-specific type
struct blk_issue_stat is going away, and bio->bi_issue_stat doesn't even
use the blk-stats interface, so we can provide a separate implementation
specific for bios. The helpers work the same way as the blk-stats
helpers.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 08:33:03 -06:00