eabc977973
This patch rewrites the ctx->dispatching optimization, which was the cause
of some mysterious hangs that could be reproduced on aarch64 KVM only.
The hangs were indirectly caused by aio_poll() and in particular by
flash memory updates's call to blk_write(), which invokes aio_poll().
Fun stuff: they had an extremely short race window, so much that
adding all kind of tracing to either the kernel or QEMU made it
go away (a single printf made it half as reproducible).
On the plus side, the failure mode (a hang until the next keypress)
made it very easy to examine the state of the process with a debugger.
And there was a very nice reproducer from Laszlo, which failed pretty
often (more than half of the time) on any version of QEMU with a non-debug
kernel; it also failed fast, while still in the firmware. So, it could
have been worse.
For some unknown reason they happened only with virtio-scsi, but
that's not important. It's more interesting that they disappeared with
io=native, making thread-pool.c a likely suspect for where the bug arose.
thread-pool.c is also one of the few places which use bottom halves
across threads, by the way.
I hope that no other similar bugs exist, but just in case :) I am
going to describe how the successful debugging went... Since the
likely culprit was the ctx->dispatching optimization, which mostly
affects bottom halves, the first observation was that there are two
qemu_bh_schedule() invocations in the thread pool: the one in the aio
worker and the one in thread_pool_completion_bh. The latter always
causes the optimization to trigger, the former may or may not. In
order to restrict the possibilities, I introduced new functions
qemu_bh_schedule_slow() and qemu_bh_schedule_fast():
/* qemu_bh_schedule_slow: */
ctx = bh->ctx;
bh->idle = 0;
if (atomic_xchg(&bh->scheduled, 1) == 0) {
event_notifier_set(&ctx->notifier);
}
/* qemu_bh_schedule_fast: */
ctx = bh->ctx;
bh->idle = 0;
assert(ctx->dispatching);
atomic_xchg(&bh->scheduled, 1);
Notice how the atomic_xchg is still in qemu_bh_schedule_slow(). This
was already debated a few months ago, so I assumed it to be correct.
In retrospect this was a very good idea, as you'll see later.
Changing thread_pool_completion_bh() to qemu_bh_schedule_fast() didn't
trigger the assertion (as expected). Changing the worker's invocation
to qemu_bh_schedule_slow() didn't hide the bug (another assumption
which luckily held). This already limited heavily the amount of
interaction between the threads, hinting that the problematic events
must have triggered around thread_pool_completion_bh().
As mentioned early, invoking a debugger to examine the state of a
hung process was pretty easy; the iothread was always waiting on a
poll(..., -1) system call. Infinite timeouts are much rarer on x86,
and this could be the reason why the bug was never observed there.
With the buggy sequence more or less resolved to an interaction between
thread_pool_completion_bh() and poll(..., -1), my "tracing" strategy was
to just add a few qemu_clock_get_ns(QEMU_CLOCK_REALTIME) calls, hoping
that the ordering of aio_ctx_prepare(), aio_ctx_dispatch, poll() and
qemu_bh_schedule_fast() would provide some hint. The output was:
(gdb) p last_prepare
$3 = 103885451
(gdb) p last_dispatch
$4 = 103876492
(gdb) p last_poll
$5 = 115909333
(gdb) p last_schedule
$6 = 115925212
Notice how the last call to qemu_poll_ns() came after aio_ctx_dispatch().
This makes little sense unless there is an aio_poll() call involved,
and indeed with a slightly different instrumentation you can see that
there is one:
(gdb) p last_prepare
$3 = 107569679
(gdb) p last_dispatch
$4 = 107561600
(gdb) p last_aio_poll
$5 = 110671400
(gdb) p last_schedule
$6 = 110698917
So the scenario becomes clearer:
iothread VCPU thread
--------------------------------------------------------------------------
aio_ctx_prepare
aio_ctx_check
qemu_poll_ns(timeout=-1)
aio_poll
aio_dispatch
thread_pool_completion_bh
qemu_bh_schedule()
At this point bh->scheduled = 1 and the iothread has not been woken up.
The solution must be close, but this alone should not be a problem,
because the bottom half is only rescheduled to account for rare situations
(see commit 3c80ca1
, thread-pool: avoid deadlock in nested aio_poll()
calls, 2014-07-15).
Introducing a third thread---a thread pool worker thread, which
also does qemu_bh_schedule()---does bring out the problematic case.
The third thread must be awakened *after* the callback is complete and
thread_pool_completion_bh has redone the whole loop, explaining the
short race window. And then this is what happens:
thread pool worker
--------------------------------------------------------------------------
<I/O completes>
qemu_bh_schedule()
Tada, bh->scheduled is already 1, so qemu_bh_schedule() does nothing
and the iothread is never woken up. This is where the bh->scheduled
optimization comes into play---it is correct, but removing it would
have masked the bug.
So, what is the bug?
Well, the question asked by the ctx->dispatching optimization ("is any
active aio_poll dispatching?") was wrong. The right question to ask
instead is "is any active aio_poll *not* dispatching", i.e. in the prepare
or poll phases? In that case, the aio_poll is sleeping or might go to
sleep anytime soon, and the EventNotifier must be invoked to wake
it up.
In any other case (including if there is *no* active aio_poll at all!)
we can just wait for the next prepare phase to pick up the event (e.g. a
bottom half); the prepare phase will avoid the blocking and service the
bottom half.
Expressing the invariant with a logic formula, the broken one looked like:
!(exists(thread): in_dispatching(thread)) => !optimize
or equivalently:
!(exists(thread):
in_aio_poll(thread) && in_dispatching(thread)) => !optimize
In the correct one, the negation is in a slightly different place:
(exists(thread):
in_aio_poll(thread) && !in_dispatching(thread)) => !optimize
or equivalently:
(exists(thread): in_prepare_or_poll(thread)) => !optimize
Even if the difference boils down to moving an exclamation mark :)
the implementation is quite different. However, I think the new
one is simpler to understand.
In the old implementation, the "exists" was implemented with a boolean
value. This didn't really support well the case of multiple concurrent
event loops, but I thought that this was okay: aio_poll holds the
AioContext lock so there cannot be concurrent aio_poll invocations, and
I was just considering nested event loops. However, aio_poll _could_
indeed be concurrent with the GSource. This is why I came up with the
wrong invariant.
In the new implementation, "exists" is computed simply by counting how many
threads are in the prepare or poll phases. There are some interesting
points to consider, but the gist of the idea remains:
1) AioContext can be used through GSource as well; as mentioned in the
patch, bit 0 of the counter is reserved for the GSource.
2) the counter need not be updated for a non-blocking aio_poll, because
it won't sleep forever anyway. This is just a matter of checking
the "blocking" variable. This requires some changes to the win32
implementation, but is otherwise not too complicated.
3) as mentioned above, the new implementation will not call aio_notify
when there is *no* active aio_poll at all. The tests have to be
adjusted for this change. The calls to aio_notify in async.c are fine;
they only want to kick aio_poll out of a blocking wait, but need not
do anything if aio_poll is not running.
4) nested aio_poll: these just work with the new implementation; when
a nested event loop is invoked, the outer event loop is never in the
prepare or poll phases. The outer event loop thus has already decremented
the counter.
Reported-by: Richard W. M. Jones <rjones@redhat.com>
Reported-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Message-id: 1437487673-23740-5-git-send-email-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
344 lines
11 KiB
C
344 lines
11 KiB
C
/*
|
|
* QEMU aio implementation
|
|
*
|
|
* Copyright IBM, Corp. 2008
|
|
*
|
|
* Authors:
|
|
* Anthony Liguori <aliguori@us.ibm.com>
|
|
*
|
|
* This work is licensed under the terms of the GNU GPL, version 2. See
|
|
* the COPYING file in the top-level directory.
|
|
*
|
|
*/
|
|
|
|
#ifndef QEMU_AIO_H
|
|
#define QEMU_AIO_H
|
|
|
|
#include "qemu/typedefs.h"
|
|
#include "qemu-common.h"
|
|
#include "qemu/queue.h"
|
|
#include "qemu/event_notifier.h"
|
|
#include "qemu/thread.h"
|
|
#include "qemu/rfifolock.h"
|
|
#include "qemu/timer.h"
|
|
|
|
typedef struct BlockAIOCB BlockAIOCB;
|
|
typedef void BlockCompletionFunc(void *opaque, int ret);
|
|
|
|
typedef struct AIOCBInfo {
|
|
void (*cancel_async)(BlockAIOCB *acb);
|
|
AioContext *(*get_aio_context)(BlockAIOCB *acb);
|
|
size_t aiocb_size;
|
|
} AIOCBInfo;
|
|
|
|
struct BlockAIOCB {
|
|
const AIOCBInfo *aiocb_info;
|
|
BlockDriverState *bs;
|
|
BlockCompletionFunc *cb;
|
|
void *opaque;
|
|
int refcnt;
|
|
};
|
|
|
|
void *qemu_aio_get(const AIOCBInfo *aiocb_info, BlockDriverState *bs,
|
|
BlockCompletionFunc *cb, void *opaque);
|
|
void qemu_aio_unref(void *p);
|
|
void qemu_aio_ref(void *p);
|
|
|
|
typedef struct AioHandler AioHandler;
|
|
typedef void QEMUBHFunc(void *opaque);
|
|
typedef void IOHandler(void *opaque);
|
|
|
|
struct AioContext {
|
|
GSource source;
|
|
|
|
/* Protects all fields from multi-threaded access */
|
|
RFifoLock lock;
|
|
|
|
/* The list of registered AIO handlers */
|
|
QLIST_HEAD(, AioHandler) aio_handlers;
|
|
|
|
/* This is a simple lock used to protect the aio_handlers list.
|
|
* Specifically, it's used to ensure that no callbacks are removed while
|
|
* we're walking and dispatching callbacks.
|
|
*/
|
|
int walking_handlers;
|
|
|
|
/* Used to avoid unnecessary event_notifier_set calls in aio_notify;
|
|
* accessed with atomic primitives. If this field is 0, everything
|
|
* (file descriptors, bottom halves, timers) will be re-evaluated
|
|
* before the next blocking poll(), thus the event_notifier_set call
|
|
* can be skipped. If it is non-zero, you may need to wake up a
|
|
* concurrent aio_poll or the glib main event loop, making
|
|
* event_notifier_set necessary.
|
|
*
|
|
* Bit 0 is reserved for GSource usage of the AioContext, and is 1
|
|
* between a call to aio_ctx_check and the next call to aio_ctx_dispatch.
|
|
* Bits 1-31 simply count the number of active calls to aio_poll
|
|
* that are in the prepare or poll phase.
|
|
*
|
|
* The GSource and aio_poll must use a different mechanism because
|
|
* there is no certainty that a call to GSource's prepare callback
|
|
* (via g_main_context_prepare) is indeed followed by check and
|
|
* dispatch. It's not clear whether this would be a bug, but let's
|
|
* play safe and allow it---it will just cause extra calls to
|
|
* event_notifier_set until the next call to dispatch.
|
|
*
|
|
* Instead, the aio_poll calls include both the prepare and the
|
|
* dispatch phase, hence a simple counter is enough for them.
|
|
*/
|
|
uint32_t notify_me;
|
|
|
|
/* lock to protect between bh's adders and deleter */
|
|
QemuMutex bh_lock;
|
|
|
|
/* Anchor of the list of Bottom Halves belonging to the context */
|
|
struct QEMUBH *first_bh;
|
|
|
|
/* A simple lock used to protect the first_bh list, and ensure that
|
|
* no callbacks are removed while we're walking and dispatching callbacks.
|
|
*/
|
|
int walking_bh;
|
|
|
|
/* Used for aio_notify. */
|
|
EventNotifier notifier;
|
|
|
|
/* Thread pool for performing work and receiving completion callbacks */
|
|
struct ThreadPool *thread_pool;
|
|
|
|
/* TimerLists for calling timers - one per clock type */
|
|
QEMUTimerListGroup tlg;
|
|
};
|
|
|
|
/**
|
|
* aio_context_new: Allocate a new AioContext.
|
|
*
|
|
* AioContext provide a mini event-loop that can be waited on synchronously.
|
|
* They also provide bottom halves, a service to execute a piece of code
|
|
* as soon as possible.
|
|
*/
|
|
AioContext *aio_context_new(Error **errp);
|
|
|
|
/**
|
|
* aio_context_ref:
|
|
* @ctx: The AioContext to operate on.
|
|
*
|
|
* Add a reference to an AioContext.
|
|
*/
|
|
void aio_context_ref(AioContext *ctx);
|
|
|
|
/**
|
|
* aio_context_unref:
|
|
* @ctx: The AioContext to operate on.
|
|
*
|
|
* Drop a reference to an AioContext.
|
|
*/
|
|
void aio_context_unref(AioContext *ctx);
|
|
|
|
/* Take ownership of the AioContext. If the AioContext will be shared between
|
|
* threads, and a thread does not want to be interrupted, it will have to
|
|
* take ownership around calls to aio_poll(). Otherwise, aio_poll()
|
|
* automatically takes care of calling aio_context_acquire and
|
|
* aio_context_release.
|
|
*
|
|
* Access to timers and BHs from a thread that has not acquired AioContext
|
|
* is possible. Access to callbacks for now must be done while the AioContext
|
|
* is owned by the thread (FIXME).
|
|
*/
|
|
void aio_context_acquire(AioContext *ctx);
|
|
|
|
/* Relinquish ownership of the AioContext. */
|
|
void aio_context_release(AioContext *ctx);
|
|
|
|
/**
|
|
* aio_bh_new: Allocate a new bottom half structure.
|
|
*
|
|
* Bottom halves are lightweight callbacks whose invocation is guaranteed
|
|
* to be wait-free, thread-safe and signal-safe. The #QEMUBH structure
|
|
* is opaque and must be allocated prior to its use.
|
|
*/
|
|
QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque);
|
|
|
|
/**
|
|
* aio_notify: Force processing of pending events.
|
|
*
|
|
* Similar to signaling a condition variable, aio_notify forces
|
|
* aio_wait to exit, so that the next call will re-examine pending events.
|
|
* The caller of aio_notify will usually call aio_wait again very soon,
|
|
* or go through another iteration of the GLib main loop. Hence, aio_notify
|
|
* also has the side effect of recalculating the sets of file descriptors
|
|
* that the main loop waits for.
|
|
*
|
|
* Calling aio_notify is rarely necessary, because for example scheduling
|
|
* a bottom half calls it already.
|
|
*/
|
|
void aio_notify(AioContext *ctx);
|
|
|
|
/**
|
|
* aio_bh_poll: Poll bottom halves for an AioContext.
|
|
*
|
|
* These are internal functions used by the QEMU main loop.
|
|
* And notice that multiple occurrences of aio_bh_poll cannot
|
|
* be called concurrently
|
|
*/
|
|
int aio_bh_poll(AioContext *ctx);
|
|
|
|
/**
|
|
* qemu_bh_schedule: Schedule a bottom half.
|
|
*
|
|
* Scheduling a bottom half interrupts the main loop and causes the
|
|
* execution of the callback that was passed to qemu_bh_new.
|
|
*
|
|
* Bottom halves that are scheduled from a bottom half handler are instantly
|
|
* invoked. This can create an infinite loop if a bottom half handler
|
|
* schedules itself.
|
|
*
|
|
* @bh: The bottom half to be scheduled.
|
|
*/
|
|
void qemu_bh_schedule(QEMUBH *bh);
|
|
|
|
/**
|
|
* qemu_bh_cancel: Cancel execution of a bottom half.
|
|
*
|
|
* Canceling execution of a bottom half undoes the effect of calls to
|
|
* qemu_bh_schedule without freeing its resources yet. While cancellation
|
|
* itself is also wait-free and thread-safe, it can of course race with the
|
|
* loop that executes bottom halves unless you are holding the iothread
|
|
* mutex. This makes it mostly useless if you are not holding the mutex.
|
|
*
|
|
* @bh: The bottom half to be canceled.
|
|
*/
|
|
void qemu_bh_cancel(QEMUBH *bh);
|
|
|
|
/**
|
|
*qemu_bh_delete: Cancel execution of a bottom half and free its resources.
|
|
*
|
|
* Deleting a bottom half frees the memory that was allocated for it by
|
|
* qemu_bh_new. It also implies canceling the bottom half if it was
|
|
* scheduled.
|
|
* This func is async. The bottom half will do the delete action at the finial
|
|
* end.
|
|
*
|
|
* @bh: The bottom half to be deleted.
|
|
*/
|
|
void qemu_bh_delete(QEMUBH *bh);
|
|
|
|
/* Return whether there are any pending callbacks from the GSource
|
|
* attached to the AioContext, before g_poll is invoked.
|
|
*
|
|
* This is used internally in the implementation of the GSource.
|
|
*/
|
|
bool aio_prepare(AioContext *ctx);
|
|
|
|
/* Return whether there are any pending callbacks from the GSource
|
|
* attached to the AioContext, after g_poll is invoked.
|
|
*
|
|
* This is used internally in the implementation of the GSource.
|
|
*/
|
|
bool aio_pending(AioContext *ctx);
|
|
|
|
/* Dispatch any pending callbacks from the GSource attached to the AioContext.
|
|
*
|
|
* This is used internally in the implementation of the GSource.
|
|
*/
|
|
bool aio_dispatch(AioContext *ctx);
|
|
|
|
/* Progress in completing AIO work to occur. This can issue new pending
|
|
* aio as a result of executing I/O completion or bh callbacks.
|
|
*
|
|
* Return whether any progress was made by executing AIO or bottom half
|
|
* handlers. If @blocking == true, this should always be true except
|
|
* if someone called aio_notify.
|
|
*
|
|
* If there are no pending bottom halves, but there are pending AIO
|
|
* operations, it may not be possible to make any progress without
|
|
* blocking. If @blocking is true, this function will wait until one
|
|
* or more AIO events have completed, to ensure something has moved
|
|
* before returning.
|
|
*/
|
|
bool aio_poll(AioContext *ctx, bool blocking);
|
|
|
|
/* Register a file descriptor and associated callbacks. Behaves very similarly
|
|
* to qemu_set_fd_handler. Unlike qemu_set_fd_handler, these callbacks will
|
|
* be invoked when using aio_poll().
|
|
*
|
|
* Code that invokes AIO completion functions should rely on this function
|
|
* instead of qemu_set_fd_handler[2].
|
|
*/
|
|
void aio_set_fd_handler(AioContext *ctx,
|
|
int fd,
|
|
IOHandler *io_read,
|
|
IOHandler *io_write,
|
|
void *opaque);
|
|
|
|
/* Register an event notifier and associated callbacks. Behaves very similarly
|
|
* to event_notifier_set_handler. Unlike event_notifier_set_handler, these callbacks
|
|
* will be invoked when using aio_poll().
|
|
*
|
|
* Code that invokes AIO completion functions should rely on this function
|
|
* instead of event_notifier_set_handler.
|
|
*/
|
|
void aio_set_event_notifier(AioContext *ctx,
|
|
EventNotifier *notifier,
|
|
EventNotifierHandler *io_read);
|
|
|
|
/* Return a GSource that lets the main loop poll the file descriptors attached
|
|
* to this AioContext.
|
|
*/
|
|
GSource *aio_get_g_source(AioContext *ctx);
|
|
|
|
/* Return the ThreadPool bound to this AioContext */
|
|
struct ThreadPool *aio_get_thread_pool(AioContext *ctx);
|
|
|
|
/**
|
|
* aio_timer_new:
|
|
* @ctx: the aio context
|
|
* @type: the clock type
|
|
* @scale: the scale
|
|
* @cb: the callback to call on timer expiry
|
|
* @opaque: the opaque pointer to pass to the callback
|
|
*
|
|
* Allocate a new timer attached to the context @ctx.
|
|
* The function is responsible for memory allocation.
|
|
*
|
|
* The preferred interface is aio_timer_init. Use that
|
|
* unless you really need dynamic memory allocation.
|
|
*
|
|
* Returns: a pointer to the new timer
|
|
*/
|
|
static inline QEMUTimer *aio_timer_new(AioContext *ctx, QEMUClockType type,
|
|
int scale,
|
|
QEMUTimerCB *cb, void *opaque)
|
|
{
|
|
return timer_new_tl(ctx->tlg.tl[type], scale, cb, opaque);
|
|
}
|
|
|
|
/**
|
|
* aio_timer_init:
|
|
* @ctx: the aio context
|
|
* @ts: the timer
|
|
* @type: the clock type
|
|
* @scale: the scale
|
|
* @cb: the callback to call on timer expiry
|
|
* @opaque: the opaque pointer to pass to the callback
|
|
*
|
|
* Initialise a new timer attached to the context @ctx.
|
|
* The caller is responsible for memory allocation.
|
|
*/
|
|
static inline void aio_timer_init(AioContext *ctx,
|
|
QEMUTimer *ts, QEMUClockType type,
|
|
int scale,
|
|
QEMUTimerCB *cb, void *opaque)
|
|
{
|
|
timer_init_tl(ts, ctx->tlg.tl[type], scale, cb, opaque);
|
|
}
|
|
|
|
/**
|
|
* aio_compute_timeout:
|
|
* @ctx: the aio context
|
|
*
|
|
* Compute the timeout that a blocking aio_poll should use.
|
|
*/
|
|
int64_t aio_compute_timeout(AioContext *ctx);
|
|
|
|
#endif
|