docs/multiple-iothreads.txt: add documentation on IOThread programming
This document explains how IOThreads and the main loop are related, especially how to write code that can run in an IOThread. Currently only virtio-blk-data-plane uses these techniques. The next obvious target is virtio-scsi; there has also been work on virtio-net. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
This commit is contained in:
parent
8cced12143
commit
ef558696b5
|
@ -0,0 +1,134 @@
|
|||
Copyright (c) 2014 Red Hat Inc.
|
||||
|
||||
This work is licensed under the terms of the GNU GPL, version 2 or later. See
|
||||
the COPYING file in the top-level directory.
|
||||
|
||||
|
||||
This document explains the IOThread feature and how to write code that runs
|
||||
outside the QEMU global mutex.
|
||||
|
||||
The main loop and IOThreads
|
||||
---------------------------
|
||||
QEMU is an event-driven program that can do several things at once using an
|
||||
event loop. The VNC server and the QMP monitor are both processed from the
|
||||
same event loop, which monitors their file descriptors until they become
|
||||
readable and then invokes a callback.
|
||||
|
||||
The default event loop is called the main loop (see main-loop.c). It is
|
||||
possible to create additional event loop threads using -object
|
||||
iothread,id=my-iothread.
|
||||
|
||||
Side note: The main loop and IOThread are both event loops but their code is
|
||||
not shared completely. Sometimes it is useful to remember that although they
|
||||
are conceptually similar they are currently not interchangeable.
|
||||
|
||||
Why IOThreads are useful
|
||||
------------------------
|
||||
IOThreads allow the user to control the placement of work. The main loop is a
|
||||
scalability bottleneck on hosts with many CPUs. Work can be spread across
|
||||
several IOThreads instead of just one main loop. When set up correctly this
|
||||
can improve I/O latency and reduce jitter seen by the guest.
|
||||
|
||||
The main loop is also deeply associated with the QEMU global mutex, which is a
|
||||
scalability bottleneck in itself. vCPU threads and the main loop use the QEMU
|
||||
global mutex to serialize execution of QEMU code. This mutex is necessary
|
||||
because a lot of QEMU's code historically was not thread-safe.
|
||||
|
||||
The fact that all I/O processing is done in a single main loop and that the
|
||||
QEMU global mutex is contended by all vCPU threads and the main loop explain
|
||||
why it is desirable to place work into IOThreads.
|
||||
|
||||
The experimental virtio-blk data-plane implementation has been benchmarked and
|
||||
shows these effects:
|
||||
ftp://public.dhe.ibm.com/linux/pdfs/KVM_Virtualized_IO_Performance_Paper.pdf
|
||||
|
||||
How to program for IOThreads
|
||||
----------------------------
|
||||
The main difference between legacy code and new code that can run in an
|
||||
IOThread is dealing explicitly with the event loop object, AioContext
|
||||
(see include/block/aio.h). Code that only works in the main loop
|
||||
implicitly uses the main loop's AioContext. Code that supports running
|
||||
in IOThreads must be aware of its AioContext.
|
||||
|
||||
AioContext supports the following services:
|
||||
* File descriptor monitoring (read/write/error on POSIX hosts)
|
||||
* Event notifiers (inter-thread signalling)
|
||||
* Timers
|
||||
* Bottom Halves (BH) deferred callbacks
|
||||
|
||||
There are several old APIs that use the main loop AioContext:
|
||||
* LEGACY qemu_aio_set_fd_handler() - monitor a file descriptor
|
||||
* LEGACY qemu_aio_set_event_notifier() - monitor an event notifier
|
||||
* LEGACY timer_new_ms() - create a timer
|
||||
* LEGACY qemu_bh_new() - create a BH
|
||||
* LEGACY qemu_aio_wait() - run an event loop iteration
|
||||
|
||||
Since they implicitly work on the main loop they cannot be used in code that
|
||||
runs in an IOThread. They might cause a crash or deadlock if called from an
|
||||
IOThread since the QEMU global mutex is not held.
|
||||
|
||||
Instead, use the AioContext functions directly (see include/block/aio.h):
|
||||
* aio_set_fd_handler() - monitor a file descriptor
|
||||
* aio_set_event_notifier() - monitor an event notifier
|
||||
* aio_timer_new() - create a timer
|
||||
* aio_bh_new() - create a BH
|
||||
* aio_poll() - run an event loop iteration
|
||||
|
||||
The AioContext can be obtained from the IOThread using
|
||||
iothread_get_aio_context() or for the main loop using qemu_get_aio_context().
|
||||
Code that takes an AioContext argument works both in IOThreads or the main
|
||||
loop, depending on which AioContext instance the caller passes in.
|
||||
|
||||
How to synchronize with an IOThread
|
||||
-----------------------------------
|
||||
AioContext is not thread-safe so some rules must be followed when using file
|
||||
descriptors, event notifiers, timers, or BHs across threads:
|
||||
|
||||
1. AioContext functions can be called safely from file descriptor, event
|
||||
notifier, timer, or BH callbacks invoked by the AioContext. No locking is
|
||||
necessary.
|
||||
|
||||
2. Other threads wishing to access the AioContext must use
|
||||
aio_context_acquire()/aio_context_release() for mutual exclusion. Once the
|
||||
context is acquired no other thread can access it or run event loop iterations
|
||||
in this AioContext.
|
||||
|
||||
aio_context_acquire()/aio_context_release() calls may be nested. This
|
||||
means you can call them if you're not sure whether #1 applies.
|
||||
|
||||
There is currently no lock ordering rule if a thread needs to acquire multiple
|
||||
AioContexts simultaneously. Therefore, it is only safe for code holding the
|
||||
QEMU global mutex to acquire other AioContexts.
|
||||
|
||||
Side note: the best way to schedule a function call across threads is to create
|
||||
a BH in the target AioContext beforehand and then call qemu_bh_schedule(). No
|
||||
acquire/release or locking is needed for the qemu_bh_schedule() call. But be
|
||||
sure to acquire the AioContext for aio_bh_new() if necessary.
|
||||
|
||||
The relationship between AioContext and the block layer
|
||||
-------------------------------------------------------
|
||||
The AioContext originates from the QEMU block layer because it provides a
|
||||
scoped way of running event loop iterations until all work is done. This
|
||||
feature is used to complete all in-flight block I/O requests (see
|
||||
bdrv_drain_all()). Nowadays AioContext is a generic event loop that can be
|
||||
used by any QEMU subsystem.
|
||||
|
||||
The block layer has support for AioContext integrated. Each BlockDriverState
|
||||
is associated with an AioContext using bdrv_set_aio_context() and
|
||||
bdrv_get_aio_context(). This allows block layer code to process I/O inside the
|
||||
right AioContext. Other subsystems may wish to follow a similar approach.
|
||||
|
||||
Block layer code must therefore expect to run in an IOThread and avoid using
|
||||
old APIs that implicitly use the main loop. See the "How to program for
|
||||
IOThreads" above for information on how to do that.
|
||||
|
||||
If main loop code such as a QMP function wishes to access a BlockDriverState it
|
||||
must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure the
|
||||
IOThread does not run in parallel.
|
||||
|
||||
Long-running jobs (usually in the form of coroutines) are best scheduled in the
|
||||
BlockDriverState's AioContext to avoid the need to acquire/release around each
|
||||
bdrv_*() call. Be aware that there is currently no mechanism to get notified
|
||||
when bdrv_set_aio_context() moves this BlockDriverState to a different
|
||||
AioContext (see bdrv_detach_aio_context()/bdrv_attach_aio_context()), so you
|
||||
may need to add this if you want to support long-running jobs.
|
Loading…
Reference in New Issue