o Dominik Klein reported a system hang issue while doing some blkio
throttling testing.
https://lkml.org/lkml/2011/2/24/173
o Some tracing revealed that CFQ was not dispatching any more jobs as
queue unplug was not happening. And queue unplug was not happening
because unplug work was not being called as there was one throttling
work on same cpu which as not finished yet. And throttling work had not
finished as it was tyring to dispatch a bio to CFQ but all the request
descriptors were consume to it was put to sleep.
o So basically it is a cyclic dependecny between CFQ unplug work and
throtl dispatch work. Tejun suggested that use separate workqueue for
such cases.
o This patch uses a separate workqueue for throttle related work and
does not rely on kblockd workqueue anymore.
Cc: stable@kernel.org
Reported-by: Dominik Klein <dk@in-telegence.net>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Jeff Moyer was doing some testing on a RAM backed disk and
blkiocg_lookup_group() showed up high overhead after memcpy(). Similarly
somebody else reported that blkiocg_lookup_group() is eating 6% extra
cpu. Though looking at the code I can't think why the overhead of
this function is so high. One thing is that it is called with very high
frequency (once for every IO).
o For lot of folks blkio controller will be compiled in but they might
not have actually created cgroups. Hence optimize the case of root
cgroup where we can avoid calling blkiocg_lookup_group() if IO is happening
in root group (common case).
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o I was discussing what are the variable being updated without spin lock and
why do we need barriers and Oleg pointed out that location of smp_rmb()
should be between read of td->limits_changed and tg->limits_changed. This
patch fixes it.
o Following is one possible sequence of events. Say cpu0 is executing
throtl_update_blkio_group_read_bps() and cpu1 is executing
throtl_process_limit_change().
cpu0 cpu1
tg->limits_changed = true;
smp_mb__before_atomic_inc();
atomic_inc(&td->limits_changed);
if (!atomic_read(&td->limits_changed))
return;
if (tg->limits_changed)
do_something;
If cpu0 has updated tg->limits_changed and td->limits_changed, we want to
make sure that if update to td->limits_changed is visible on cpu1, then
update to tg->limits_changed should also be visible.
Oleg pointed out to ensure that we need to insert an smp_rmb() between
td->limits_changed read and tg->limits_changed read.
o I had erroneously put smp_rmb() before atomic_read(&td->limits_changed).
This patch fixes it.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o During some testing I did following and noticed throttling stops working.
- Put a very low limit on a cgroup, say 1 byte per second.
- Start some reads, this will set slice_end to a very high value.
- Change the limit to higher value say 1MB/s
- Now IO unthrottles and finishes as expected.
- Try to do the read again but IO is not limited to 1MB/s as expected.
o What is happening.
- Initially low value of limit sets slice_end to a very high value.
- During updation of limit, slice_end is not being truncated.
- Very high value of slice_end leads to keeping the existing slice
valid for a very long time and new slice does not start.
- tg_may_dispatch() is called in blk_throtle_bio(), and trim_slice()
is not called in this path. So slice_start is some old value and
practically we are able to do huge amount of IO.
o There are many ways it can be fixed. I have fixed it by trying to
adjust/cleanup slice_end in trim_slice(). Generally we extend slices if bio
is big and can't be dispatched in one slice. After dispatch of bio, readjust
the slice_end to make sure we don't end up with huge values.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Currently we try to dispatch more READS and less WRITES (75%, 25%) in one
dispatch round. ummy pointed out that there is a bug in max_nr_writes
calculation. This patch fixes it.
Reported-by: ummy y <yummylln@yahoo.com.cn>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o User can specify max iops value of 32bit (UINT_MAX), through cgroup
interface. If a user has specified say 4294967294 (UNIT_MAX - 2), then
on 32bit platform, following multiplication can overflow.
io_allowed = (tg->iops[rw] * jiffy_elapsed_rnd)
o Explicitly cast the multiplication to 64bit and then perform division and
then check whether result is still great then UNINT_MAX.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Do not convert jiffies to mili seconds as it is not required. Just work
with jiffies and HZ.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Randy Dunlap reported following linux-next failure. This patch fixes it.
on i386:
blk-throttle.c:(.text+0x1abb8): undefined reference to `__udivdi3'
blk-throttle.c:(.text+0x1b1dc): undefined reference to `__udivdi3'
o bytes_per_second interface is 64bit and I was continuing to do 64 bit
division even on 32bit platform without help of special macros/functions
hence the failure.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Currently any cgroup throttle limit changes are processed asynchronousy and
the change does not take affect till a new bio is dispatched from same group.
o It might happen that a user sets a redicuously low limit on throttling.
Say 1 bytes per second on reads. In such cases simple operations like mount
a disk can wait for a very long time.
o Once bio is throttled, there is no easy way to come out of that wait even if
user increases the read limit later.
o This patch fixes it. Now if a user changes the cgroup limits, we recalculate
the bio dispatch time according to new limits.
o Can't take queueu lock under blkcg_lock, hence after the change I wake
up the dispatch thread again which recalculates the time. So there are some
variables being synchronized across two threads without lock and I had to
make use of barriers. Hoping I have used barriers correctly. Any review of
memory barrier code especially will help.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Currently all the dynamically allocated groups, except root grp is added
to td->tg_list. This was not a problem so far but in next patch I will
travel through td->tg_list to process any updates of limits on the group.
If root group is not in tg_list, then root group's updates are not
processed.
o It is better to root group also to tg_list instead of doing special
processing for it during limit updates.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Actual implementation of throttling policy in block layer. Currently it
implements READ and WRITE bytes per second throttling logic. IOPS throttling
comes in later patches.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>