547 lines
23 KiB
Plaintext
547 lines
23 KiB
Plaintext
BFQ (Budget Fair Queueing)
|
|
==========================
|
|
|
|
BFQ is a proportional-share I/O scheduler, with some extra
|
|
low-latency capabilities. In addition to cgroups support (blkio or io
|
|
controllers), BFQ's main features are:
|
|
- BFQ guarantees a high system and application responsiveness, and a
|
|
low latency for time-sensitive applications, such as audio or video
|
|
players;
|
|
- BFQ distributes bandwidth, and not just time, among processes or
|
|
groups (switching back to time distribution when needed to keep
|
|
throughput high).
|
|
|
|
In its default configuration, BFQ privileges latency over
|
|
throughput. So, when needed for achieving a lower latency, BFQ builds
|
|
schedules that may lead to a lower throughput. If your main or only
|
|
goal, for a given device, is to achieve the maximum-possible
|
|
throughput at all times, then do switch off all low-latency heuristics
|
|
for that device, by setting low_latency to 0. Full details in Section 3.
|
|
|
|
On average CPUs, the current version of BFQ can handle devices
|
|
performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a
|
|
reference, 30-50 KIOPS correspond to very high bandwidths with
|
|
sequential I/O (e.g., 8-12 GB/s if I/O requests are 256 KB large), and
|
|
to 120-200 MB/s with 4KB random I/O. BFQ has not yet been tested on
|
|
multi-queue devices.
|
|
|
|
The table of contents follow. Impatients can just jump to Section 3.
|
|
|
|
CONTENTS
|
|
|
|
1. When may BFQ be useful?
|
|
1-1 Personal systems
|
|
1-2 Server systems
|
|
2. How does BFQ work?
|
|
3. What are BFQ's tunable?
|
|
4. BFQ group scheduling
|
|
4-1 Service guarantees provided
|
|
4-2 Interface
|
|
|
|
1. When may BFQ be useful?
|
|
==========================
|
|
|
|
BFQ provides the following benefits on personal and server systems.
|
|
|
|
1-1 Personal systems
|
|
--------------------
|
|
|
|
Low latency for interactive applications
|
|
|
|
Regardless of the actual background workload, BFQ guarantees that, for
|
|
interactive tasks, the storage device is virtually as responsive as if
|
|
it was idle. For example, even if one or more of the following
|
|
background workloads are being executed:
|
|
- one or more large files are being read, written or copied,
|
|
- a tree of source files is being compiled,
|
|
- one or more virtual machines are performing I/O,
|
|
- a software update is in progress,
|
|
- indexing daemons are scanning filesystems and updating their
|
|
databases,
|
|
starting an application or loading a file from within an application
|
|
takes about the same time as if the storage device was idle. As a
|
|
comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
|
|
applications experience high latencies, or even become unresponsive
|
|
until the background workload terminates (also on SSDs).
|
|
|
|
Low latency for soft real-time applications
|
|
|
|
Also soft real-time applications, such as audio and video
|
|
players/streamers, enjoy a low latency and a low drop rate, regardless
|
|
of the background I/O workload. As a consequence, these applications
|
|
do not suffer from almost any glitch due to the background workload.
|
|
|
|
Higher speed for code-development tasks
|
|
|
|
If some additional workload happens to be executed in parallel, then
|
|
BFQ executes the I/O-related components of typical code-development
|
|
tasks (compilation, checkout, merge, ...) much more quickly than CFQ,
|
|
NOOP or DEADLINE.
|
|
|
|
High throughput
|
|
|
|
On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
|
|
up to 150% higher throughput than DEADLINE and NOOP, with all the
|
|
sequential workloads considered in our tests. With random workloads,
|
|
and with all the workloads on flash-based devices, BFQ achieves,
|
|
instead, about the same throughput as the other schedulers.
|
|
|
|
Strong fairness, bandwidth and delay guarantees
|
|
|
|
BFQ distributes the device throughput, and not just the device time,
|
|
among I/O-bound applications in proportion their weights, with any
|
|
workload and regardless of the device parameters. From these bandwidth
|
|
guarantees, it is possible to compute tight per-I/O-request delay
|
|
guarantees by a simple formula. If not configured for strict service
|
|
guarantees, BFQ switches to time-based resource sharing (only) for
|
|
applications that would otherwise cause a throughput loss.
|
|
|
|
1-2 Server systems
|
|
------------------
|
|
|
|
Most benefits for server systems follow from the same service
|
|
properties as above. In particular, regardless of whether additional,
|
|
possibly heavy workloads are being served, BFQ guarantees:
|
|
|
|
. audio and video-streaming with zero or very low jitter and drop
|
|
rate;
|
|
|
|
. fast retrieval of WEB pages and embedded objects;
|
|
|
|
. real-time recording of data in live-dumping applications (e.g.,
|
|
packet logging);
|
|
|
|
. responsiveness in local and remote access to a server.
|
|
|
|
|
|
2. How does BFQ work?
|
|
=====================
|
|
|
|
BFQ is a proportional-share I/O scheduler, whose general structure,
|
|
plus a lot of code, are borrowed from CFQ.
|
|
|
|
- Each process doing I/O on a device is associated with a weight and a
|
|
(bfq_)queue.
|
|
|
|
- BFQ grants exclusive access to the device, for a while, to one queue
|
|
(process) at a time, and implements this service model by
|
|
associating every queue with a budget, measured in number of
|
|
sectors.
|
|
|
|
- After a queue is granted access to the device, the budget of the
|
|
queue is decremented, on each request dispatch, by the size of the
|
|
request.
|
|
|
|
- The in-service queue is expired, i.e., its service is suspended,
|
|
only if one of the following events occurs: 1) the queue finishes
|
|
its budget, 2) the queue empties, 3) a "budget timeout" fires.
|
|
|
|
- The budget timeout prevents processes doing random I/O from
|
|
holding the device for too long and dramatically reducing
|
|
throughput.
|
|
|
|
- Actually, as in CFQ, a queue associated with a process issuing
|
|
sync requests may not be expired immediately when it empties. In
|
|
contrast, BFQ may idle the device for a short time interval,
|
|
giving the process the chance to go on being served if it issues
|
|
a new request in time. Device idling typically boosts the
|
|
throughput on rotational devices, if processes do synchronous
|
|
and sequential I/O. In addition, under BFQ, device idling is
|
|
also instrumental in guaranteeing the desired throughput
|
|
fraction to processes issuing sync requests (see the description
|
|
of the slice_idle tunable in this document, or [1, 2], for more
|
|
details).
|
|
|
|
- With respect to idling for service guarantees, if several
|
|
processes are competing for the device at the same time, but
|
|
all processes (and groups, after the following commit) have
|
|
the same weight, then BFQ guarantees the expected throughput
|
|
distribution without ever idling the device. Throughput is
|
|
thus as high as possible in this common scenario.
|
|
|
|
- If low-latency mode is enabled (default configuration), BFQ
|
|
executes some special heuristics to detect interactive and soft
|
|
real-time applications (e.g., video or audio players/streamers),
|
|
and to reduce their latency. The most important action taken to
|
|
achieve this goal is to give to the queues associated with these
|
|
applications more than their fair share of the device
|
|
throughput. For brevity, we call just "weight-raising" the whole
|
|
sets of actions taken by BFQ to privilege these queues. In
|
|
particular, BFQ provides a milder form of weight-raising for
|
|
interactive applications, and a stronger form for soft real-time
|
|
applications.
|
|
|
|
- BFQ automatically deactivates idling for queues born in a burst of
|
|
queue creations. In fact, these queues are usually associated with
|
|
the processes of applications and services that benefit mostly
|
|
from a high throughput. Examples are systemd during boot, or git
|
|
grep.
|
|
|
|
- As CFQ, BFQ merges queues performing interleaved I/O, i.e.,
|
|
performing random I/O that becomes mostly sequential if
|
|
merged. Differently from CFQ, BFQ achieves this goal with a more
|
|
reactive mechanism, called Early Queue Merge (EQM). EQM is so
|
|
responsive in detecting interleaved I/O (cooperating processes),
|
|
that it enables BFQ to achieve a high throughput, by queue
|
|
merging, even for queues for which CFQ needs a different
|
|
mechanism, preemption, to get a high throughput. As such EQM is a
|
|
unified mechanism to achieve a high throughput with interleaved
|
|
I/O.
|
|
|
|
- Queues are scheduled according to a variant of WF2Q+, named
|
|
B-WF2Q+, and implemented using an augmented rb-tree to preserve an
|
|
O(log N) overall complexity. See [2] for more details. B-WF2Q+ is
|
|
also ready for hierarchical scheduling. However, for a cleaner
|
|
logical breakdown, the code that enables and completes
|
|
hierarchical support is provided in the next commit, which focuses
|
|
exactly on this feature.
|
|
|
|
- B-WF2Q+ guarantees a tight deviation with respect to an ideal,
|
|
perfectly fair, and smooth service. In particular, B-WF2Q+
|
|
guarantees that each queue receives a fraction of the device
|
|
throughput proportional to its weight, even if the throughput
|
|
fluctuates, and regardless of: the device parameters, the current
|
|
workload and the budgets assigned to the queue.
|
|
|
|
- The last, budget-independence, property (although probably
|
|
counterintuitive in the first place) is definitely beneficial, for
|
|
the following reasons:
|
|
|
|
- First, with any proportional-share scheduler, the maximum
|
|
deviation with respect to an ideal service is proportional to
|
|
the maximum budget (slice) assigned to queues. As a consequence,
|
|
BFQ can keep this deviation tight not only because of the
|
|
accurate service of B-WF2Q+, but also because BFQ *does not*
|
|
need to assign a larger budget to a queue to let the queue
|
|
receive a higher fraction of the device throughput.
|
|
|
|
- Second, BFQ is free to choose, for every process (queue), the
|
|
budget that best fits the needs of the process, or best
|
|
leverages the I/O pattern of the process. In particular, BFQ
|
|
updates queue budgets with a simple feedback-loop algorithm that
|
|
allows a high throughput to be achieved, while still providing
|
|
tight latency guarantees to time-sensitive applications. When
|
|
the in-service queue expires, this algorithm computes the next
|
|
budget of the queue so as to:
|
|
|
|
- Let large budgets be eventually assigned to the queues
|
|
associated with I/O-bound applications performing sequential
|
|
I/O: in fact, the longer these applications are served once
|
|
got access to the device, the higher the throughput is.
|
|
|
|
- Let small budgets be eventually assigned to the queues
|
|
associated with time-sensitive applications (which typically
|
|
perform sporadic and short I/O), because, the smaller the
|
|
budget assigned to a queue waiting for service is, the sooner
|
|
B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
|
|
|
|
- If several processes are competing for the device at the same time,
|
|
but all processes and groups have the same weight, then BFQ
|
|
guarantees the expected throughput distribution without ever idling
|
|
the device. It uses preemption instead. Throughput is then much
|
|
higher in this common scenario.
|
|
|
|
- ioprio classes are served in strict priority order, i.e.,
|
|
lower-priority queues are not served as long as there are
|
|
higher-priority queues. Among queues in the same class, the
|
|
bandwidth is distributed in proportion to the weight of each
|
|
queue. A very thin extra bandwidth is however guaranteed to
|
|
the Idle class, to prevent it from starving.
|
|
|
|
|
|
3. What are BFQ's tunable?
|
|
==========================
|
|
|
|
The tunables back_seek-max, back_seek_penalty, fifo_expire_async and
|
|
fifo_expire_sync below are the same as in CFQ. Their description is
|
|
just copied from that for CFQ. Some considerations in the description
|
|
of slice_idle are copied from CFQ too.
|
|
|
|
per-process ioprio and weight
|
|
-----------------------------
|
|
|
|
Unless the cgroups interface is used (see "4. BFQ group scheduling"),
|
|
weights can be assigned to processes only indirectly, through I/O
|
|
priorities, and according to the relation:
|
|
weight = (IOPRIO_BE_NR - ioprio) * 10.
|
|
|
|
Beware that, if low-latency is set, then BFQ automatically raises the
|
|
weight of the queues associated with interactive and soft real-time
|
|
applications. Unset this tunable if you need/want to control weights.
|
|
|
|
slice_idle
|
|
----------
|
|
|
|
This parameter specifies how long BFQ should idle for next I/O
|
|
request, when certain sync BFQ queues become empty. By default
|
|
slice_idle is a non-zero value. Idling has a double purpose: boosting
|
|
throughput and making sure that the desired throughput distribution is
|
|
respected (see the description of how BFQ works, and, if needed, the
|
|
papers referred there).
|
|
|
|
As for throughput, idling can be very helpful on highly seeky media
|
|
like single spindle SATA/SAS disks where we can cut down on overall
|
|
number of seeks and see improved throughput.
|
|
|
|
Setting slice_idle to 0 will remove all the idling on queues and one
|
|
should see an overall improved throughput on faster storage devices
|
|
like multiple SATA/SAS disks in hardware RAID configuration.
|
|
|
|
So depending on storage and workload, it might be useful to set
|
|
slice_idle=0. In general for SATA/SAS disks and software RAID of
|
|
SATA/SAS disks keeping slice_idle enabled should be useful. For any
|
|
configurations where there are multiple spindles behind single LUN
|
|
(Host based hardware RAID controller or for storage arrays), setting
|
|
slice_idle=0 might end up in better throughput and acceptable
|
|
latencies.
|
|
|
|
Idling is however necessary to have service guarantees enforced in
|
|
case of differentiated weights or differentiated I/O-request lengths.
|
|
To see why, suppose that a given BFQ queue A must get several I/O
|
|
requests served for each request served for another queue B. Idling
|
|
ensures that, if A makes a new I/O request slightly after becoming
|
|
empty, then no request of B is dispatched in the middle, and thus A
|
|
does not lose the possibility to get more than one request dispatched
|
|
before the next request of B is dispatched. Note that idling
|
|
guarantees the desired differentiated treatment of queues only in
|
|
terms of I/O-request dispatches. To guarantee that the actual service
|
|
order then corresponds to the dispatch order, the strict_guarantees
|
|
tunable must be set too.
|
|
|
|
There is an important flipside for idling: apart from the above cases
|
|
where it is beneficial also for throughput, idling can severely impact
|
|
throughput. One important case is random workload. Because of this
|
|
issue, BFQ tends to avoid idling as much as possible, when it is not
|
|
beneficial also for throughput. As a consequence of this behavior, and
|
|
of further issues described for the strict_guarantees tunable,
|
|
short-term service guarantees may be occasionally violated. And, in
|
|
some cases, these guarantees may be more important than guaranteeing
|
|
maximum throughput. For example, in video playing/streaming, a very
|
|
low drop rate may be more important than maximum throughput. In these
|
|
cases, consider setting the strict_guarantees parameter.
|
|
|
|
strict_guarantees
|
|
-----------------
|
|
|
|
If this parameter is set (default: unset), then BFQ
|
|
|
|
- always performs idling when the in-service queue becomes empty;
|
|
|
|
- forces the device to serve one I/O request at a time, by dispatching a
|
|
new request only if there is no outstanding request.
|
|
|
|
In the presence of differentiated weights or I/O-request sizes, both
|
|
the above conditions are needed to guarantee that every BFQ queue
|
|
receives its allotted share of the bandwidth. The first condition is
|
|
needed for the reasons explained in the description of the slice_idle
|
|
tunable. The second condition is needed because all modern storage
|
|
devices reorder internally-queued requests, which may trivially break
|
|
the service guarantees enforced by the I/O scheduler.
|
|
|
|
Setting strict_guarantees may evidently affect throughput.
|
|
|
|
back_seek_max
|
|
-------------
|
|
|
|
This specifies, given in Kbytes, the maximum "distance" for backward seeking.
|
|
The distance is the amount of space from the current head location to the
|
|
sectors that are backward in terms of distance.
|
|
|
|
This parameter allows the scheduler to anticipate requests in the "backward"
|
|
direction and consider them as being the "next" if they are within this
|
|
distance from the current head location.
|
|
|
|
back_seek_penalty
|
|
-----------------
|
|
|
|
This parameter is used to compute the cost of backward seeking. If the
|
|
backward distance of request is just 1/back_seek_penalty from a "front"
|
|
request, then the seeking cost of two requests is considered equivalent.
|
|
|
|
So scheduler will not bias toward one or the other request (otherwise scheduler
|
|
will bias toward front request). Default value of back_seek_penalty is 2.
|
|
|
|
fifo_expire_async
|
|
-----------------
|
|
|
|
This parameter is used to set the timeout of asynchronous requests. Default
|
|
value of this is 248ms.
|
|
|
|
fifo_expire_sync
|
|
----------------
|
|
|
|
This parameter is used to set the timeout of synchronous requests. Default
|
|
value of this is 124ms. In case to favor synchronous requests over asynchronous
|
|
one, this value should be decreased relative to fifo_expire_async.
|
|
|
|
low_latency
|
|
-----------
|
|
|
|
This parameter is used to enable/disable BFQ's low latency mode. By
|
|
default, low latency mode is enabled. If enabled, interactive and soft
|
|
real-time applications are privileged and experience a lower latency,
|
|
as explained in more detail in the description of how BFQ works.
|
|
|
|
DISABLE this mode if you need full control on bandwidth
|
|
distribution. In fact, if it is enabled, then BFQ automatically
|
|
increases the bandwidth share of privileged applications, as the main
|
|
means to guarantee a lower latency to them.
|
|
|
|
In addition, as already highlighted at the beginning of this document,
|
|
DISABLE this mode if your only goal is to achieve a high throughput.
|
|
In fact, privileging the I/O of some application over the rest may
|
|
entail a lower throughput. To achieve the highest-possible throughput
|
|
on a non-rotational device, setting slice_idle to 0 may be needed too
|
|
(at the cost of giving up any strong guarantee on fairness and low
|
|
latency).
|
|
|
|
timeout_sync
|
|
------------
|
|
|
|
Maximum amount of device time that can be given to a task (queue) once
|
|
it has been selected for service. On devices with costly seeks,
|
|
increasing this time usually increases maximum throughput. On the
|
|
opposite end, increasing this time coarsens the granularity of the
|
|
short-term bandwidth and latency guarantees, especially if the
|
|
following parameter is set to zero.
|
|
|
|
max_budget
|
|
----------
|
|
|
|
Maximum amount of service, measured in sectors, that can be provided
|
|
to a BFQ queue once it is set in service (of course within the limits
|
|
of the above timeout). According to what said in the description of
|
|
the algorithm, larger values increase the throughput in proportion to
|
|
the percentage of sequential I/O requests issued. The price of larger
|
|
values is that they coarsen the granularity of short-term bandwidth
|
|
and latency guarantees.
|
|
|
|
The default value is 0, which enables auto-tuning: BFQ sets max_budget
|
|
to the maximum number of sectors that can be served during
|
|
timeout_sync, according to the estimated peak rate.
|
|
|
|
weights
|
|
-------
|
|
|
|
Read-only parameter, used to show the weights of the currently active
|
|
BFQ queues.
|
|
|
|
|
|
wr_ tunables
|
|
------------
|
|
|
|
BFQ exports a few parameters to control/tune the behavior of
|
|
low-latency heuristics.
|
|
|
|
wr_coeff
|
|
|
|
Factor by which the weight of a weight-raised queue is multiplied. If
|
|
the queue is deemed soft real-time, then the weight is further
|
|
multiplied by an additional, constant factor.
|
|
|
|
wr_max_time
|
|
|
|
Maximum duration of a weight-raising period for an interactive task
|
|
(ms). If set to zero (default value), then this value is computed
|
|
automatically, as a function of the peak rate of the device. In any
|
|
case, when the value of this parameter is read, it always reports the
|
|
current duration, regardless of whether it has been set manually or
|
|
computed automatically.
|
|
|
|
wr_max_softrt_rate
|
|
|
|
Maximum service rate below which a queue is deemed to be associated
|
|
with a soft real-time application, and is then weight-raised
|
|
accordingly (sectors/sec).
|
|
|
|
wr_min_idle_time
|
|
|
|
Minimum idle period after which interactive weight-raising may be
|
|
reactivated for a queue (in ms).
|
|
|
|
wr_rt_max_time
|
|
|
|
Maximum weight-raising duration for soft real-time queues (in ms). The
|
|
start time from which this duration is considered is automatically
|
|
moved forward if the queue is detected to be still soft real-time
|
|
before the current soft real-time weight-raising period finishes.
|
|
|
|
wr_min_inter_arr_async
|
|
|
|
Minimum period between I/O request arrivals after which weight-raising
|
|
may be reactivated for an already busy async queue (in ms).
|
|
|
|
|
|
4. Group scheduling with BFQ
|
|
============================
|
|
|
|
BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely
|
|
blkio and io. In particular, BFQ supports weight-based proportional
|
|
share. To activate cgroups support, set BFQ_GROUP_IOSCHED.
|
|
|
|
4-1 Service guarantees provided
|
|
-------------------------------
|
|
|
|
With BFQ, proportional share means true proportional share of the
|
|
device bandwidth, according to group weights. For example, a group
|
|
with weight 200 gets twice the bandwidth, and not just twice the time,
|
|
of a group with weight 100.
|
|
|
|
BFQ supports hierarchies (group trees) of any depth. Bandwidth is
|
|
distributed among groups and processes in the expected way: for each
|
|
group, the children of the group share the whole bandwidth of the
|
|
group in proportion to their weights. In particular, this implies
|
|
that, for each leaf group, every process of the group receives the
|
|
same share of the whole group bandwidth, unless the ioprio of the
|
|
process is modified.
|
|
|
|
The resource-sharing guarantee for a group may partially or totally
|
|
switch from bandwidth to time, if providing bandwidth guarantees to
|
|
the group lowers the throughput too much. This switch occurs on a
|
|
per-process basis: if a process of a leaf group causes throughput loss
|
|
if served in such a way to receive its share of the bandwidth, then
|
|
BFQ switches back to just time-based proportional share for that
|
|
process.
|
|
|
|
4-2 Interface
|
|
-------------
|
|
|
|
To get proportional sharing of bandwidth with BFQ for a given device,
|
|
BFQ must of course be the active scheduler for that device.
|
|
|
|
Within each group directory, the names of the files associated with
|
|
BFQ-specific cgroup parameters and stats begin with the "bfq."
|
|
prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for
|
|
BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group
|
|
parameter to set the weight of a group with BFQ is blkio.bfq.weight
|
|
or io.bfq.weight.
|
|
|
|
Parameters to set
|
|
-----------------
|
|
|
|
For each group, there is only the following parameter to set.
|
|
|
|
weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the
|
|
group inside its parent. Available values: 1..10000 (default 100). The
|
|
linear mapping between ioprio and weights, described at the beginning
|
|
of the tunable section, is still valid, but all weights higher than
|
|
IOPRIO_BE_NR*10 are mapped to ioprio 0.
|
|
|
|
Recall that, if low-latency is set, then BFQ automatically raises the
|
|
weight of the queues associated with interactive and soft real-time
|
|
applications. Unset this tunable if you need/want to control weights.
|
|
|
|
|
|
[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
|
|
Scheduler", Proceedings of the First Workshop on Mobile System
|
|
Technologies (MST-2015), May 2015.
|
|
http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
|
|
|
|
[2] P. Valente and M. Andreolini, "Improving Application
|
|
Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
|
|
the 5th Annual International Systems and Storage Conference
|
|
(SYSTOR '12), June 2012.
|
|
Slightly extended version:
|
|
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
|
|
results.pdf
|