de99dab06f
Some of the constraints on operand sizes have been relaxed, so adjust the documentation. Deprecate atomic_mb_read and atomic_mb_set; it is not really possible to use them correctly because they do not interoperate with sequentially-consistent RMW operations. Finally, extend the memory barrier pairing section to cover acquire and release semantics in general, roughly based on the KVM Forum 2016 talk, "<atomic.h> weapons". Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
508 lines
25 KiB
ReStructuredText
508 lines
25 KiB
ReStructuredText
=========================
|
|
Atomic operations in QEMU
|
|
=========================
|
|
|
|
CPUs perform independent memory operations effectively in random order.
|
|
but this can be a problem for CPU-CPU interaction (including interactions
|
|
between QEMU and the guest). Multi-threaded programs use various tools
|
|
to instruct the compiler and the CPU to restrict the order to something
|
|
that is consistent with the expectations of the programmer.
|
|
|
|
The most basic tool is locking. Mutexes, condition variables and
|
|
semaphores are used in QEMU, and should be the default approach to
|
|
synchronization. Anything else is considerably harder, but it's
|
|
also justified more often than one would like;
|
|
the most performance-critical parts of QEMU in particular require
|
|
a very low level approach to concurrency, involving memory barriers
|
|
and atomic operations. The semantics of concurrent memory accesses are governed
|
|
by the C11 memory model.
|
|
|
|
QEMU provides a header, ``qemu/atomic.h``, which wraps C11 atomics to
|
|
provide better portability and a less verbose syntax. ``qemu/atomic.h``
|
|
provides macros that fall in three camps:
|
|
|
|
- compiler barriers: ``barrier()``;
|
|
|
|
- weak atomic access and manual memory barriers: ``atomic_read()``,
|
|
``atomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, ``smp_mb_acquire()``,
|
|
``smp_mb_release()``, ``smp_read_barrier_depends()``;
|
|
|
|
- sequentially consistent atomic access: everything else.
|
|
|
|
In general, use of ``qemu/atomic.h`` should be wrapped with more easily
|
|
used data structures (e.g. the lock-free singly-linked list operations
|
|
``QSLIST_INSERT_HEAD_ATOMIC`` and ``QSLIST_MOVE_ATOMIC``) or synchronization
|
|
primitives (such as RCU, ``QemuEvent`` or ``QemuLockCnt``). Bare use of
|
|
atomic operations and memory barriers should be limited to inter-thread
|
|
checking of flags and documented thoroughly.
|
|
|
|
|
|
|
|
Compiler memory barrier
|
|
=======================
|
|
|
|
``barrier()`` prevents the compiler from moving the memory accesses on
|
|
either side of it to the other side. The compiler barrier has no direct
|
|
effect on the CPU, which may then reorder things however it wishes.
|
|
|
|
``barrier()`` is mostly used within ``qemu/atomic.h`` itself. On some
|
|
architectures, CPU guarantees are strong enough that blocking compiler
|
|
optimizations already ensures the correct order of execution. In this
|
|
case, ``qemu/atomic.h`` will reduce stronger memory barriers to simple
|
|
compiler barriers.
|
|
|
|
Still, ``barrier()`` can be useful when writing code that can be interrupted
|
|
by signal handlers.
|
|
|
|
|
|
Sequentially consistent atomic access
|
|
=====================================
|
|
|
|
Most of the operations in the ``qemu/atomic.h`` header ensure *sequential
|
|
consistency*, where "the result of any execution is the same as if the
|
|
operations of all the processors were executed in some sequential order,
|
|
and the operations of each individual processor appear in this sequence
|
|
in the order specified by its program".
|
|
|
|
``qemu/atomic.h`` provides the following set of atomic read-modify-write
|
|
operations::
|
|
|
|
void atomic_inc(ptr)
|
|
void atomic_dec(ptr)
|
|
void atomic_add(ptr, val)
|
|
void atomic_sub(ptr, val)
|
|
void atomic_and(ptr, val)
|
|
void atomic_or(ptr, val)
|
|
|
|
typeof(*ptr) atomic_fetch_inc(ptr)
|
|
typeof(*ptr) atomic_fetch_dec(ptr)
|
|
typeof(*ptr) atomic_fetch_add(ptr, val)
|
|
typeof(*ptr) atomic_fetch_sub(ptr, val)
|
|
typeof(*ptr) atomic_fetch_and(ptr, val)
|
|
typeof(*ptr) atomic_fetch_or(ptr, val)
|
|
typeof(*ptr) atomic_fetch_xor(ptr, val)
|
|
typeof(*ptr) atomic_fetch_inc_nonzero(ptr)
|
|
typeof(*ptr) atomic_xchg(ptr, val)
|
|
typeof(*ptr) atomic_cmpxchg(ptr, old, new)
|
|
|
|
all of which return the old value of ``*ptr``. These operations are
|
|
polymorphic; they operate on any type that is as wide as a pointer or
|
|
smaller.
|
|
|
|
Similar operations return the new value of ``*ptr``::
|
|
|
|
typeof(*ptr) atomic_inc_fetch(ptr)
|
|
typeof(*ptr) atomic_dec_fetch(ptr)
|
|
typeof(*ptr) atomic_add_fetch(ptr, val)
|
|
typeof(*ptr) atomic_sub_fetch(ptr, val)
|
|
typeof(*ptr) atomic_and_fetch(ptr, val)
|
|
typeof(*ptr) atomic_or_fetch(ptr, val)
|
|
typeof(*ptr) atomic_xor_fetch(ptr, val)
|
|
|
|
``qemu/atomic.h`` also provides loads and stores that cannot be reordered
|
|
with each other::
|
|
|
|
typeof(*ptr) atomic_mb_read(ptr)
|
|
void atomic_mb_set(ptr, val)
|
|
|
|
However these do not provide sequential consistency and, in particular,
|
|
they do not participate in the total ordering enforced by
|
|
sequentially-consistent operations. For this reason they are deprecated.
|
|
They should instead be replaced with any of the following (ordered from
|
|
easiest to hardest):
|
|
|
|
- accesses inside a mutex or spinlock
|
|
|
|
- lightweight synchronization primitives such as ``QemuEvent``
|
|
|
|
- RCU operations (``atomic_rcu_read``, ``atomic_rcu_set``) when publishing
|
|
or accessing a new version of a data structure
|
|
|
|
- other atomic accesses: ``atomic_read`` and ``atomic_load_acquire`` for
|
|
loads, ``atomic_set`` and ``atomic_store_release`` for stores, ``smp_mb``
|
|
to forbid reordering subsequent loads before a store.
|
|
|
|
|
|
Weak atomic access and manual memory barriers
|
|
=============================================
|
|
|
|
Compared to sequentially consistent atomic access, programming with
|
|
weaker consistency models can be considerably more complicated.
|
|
The only guarantees that you can rely upon in this case are:
|
|
|
|
- atomic accesses will not cause data races (and hence undefined behavior);
|
|
ordinary accesses instead cause data races if they are concurrent with
|
|
other accesses of which at least one is a write. In order to ensure this,
|
|
the compiler will not optimize accesses out of existence, create unsolicited
|
|
accesses, or perform other similar optimzations.
|
|
|
|
- acquire operations will appear to happen, with respect to the other
|
|
components of the system, before all the LOAD or STORE operations
|
|
specified afterwards.
|
|
|
|
- release operations will appear to happen, with respect to the other
|
|
components of the system, after all the LOAD or STORE operations
|
|
specified before.
|
|
|
|
- release operations will *synchronize with* acquire operations;
|
|
see :ref:`acqrel` for a detailed explanation.
|
|
|
|
When using this model, variables are accessed with:
|
|
|
|
- ``atomic_read()`` and ``atomic_set()``; these prevent the compiler from
|
|
optimizing accesses out of existence and creating unsolicited
|
|
accesses, but do not otherwise impose any ordering on loads and
|
|
stores: both the compiler and the processor are free to reorder
|
|
them.
|
|
|
|
- ``atomic_load_acquire()``, which guarantees the LOAD to appear to
|
|
happen, with respect to the other components of the system,
|
|
before all the LOAD or STORE operations specified afterwards.
|
|
Operations coming before ``atomic_load_acquire()`` can still be
|
|
reordered after it.
|
|
|
|
- ``atomic_store_release()``, which guarantees the STORE to appear to
|
|
happen, with respect to the other components of the system,
|
|
after all the LOAD or STORE operations specified before.
|
|
Operations coming after ``atomic_store_release()`` can still be
|
|
reordered before it.
|
|
|
|
Restrictions to the ordering of accesses can also be specified
|
|
using the memory barrier macros: ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``,
|
|
``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``.
|
|
|
|
Memory barriers control the order of references to shared memory.
|
|
They come in six kinds:
|
|
|
|
- ``smp_rmb()`` guarantees that all the LOAD operations specified before
|
|
the barrier will appear to happen before all the LOAD operations
|
|
specified after the barrier with respect to the other components of
|
|
the system.
|
|
|
|
In other words, ``smp_rmb()`` puts a partial ordering on loads, but is not
|
|
required to have any effect on stores.
|
|
|
|
- ``smp_wmb()`` guarantees that all the STORE operations specified before
|
|
the barrier will appear to happen before all the STORE operations
|
|
specified after the barrier with respect to the other components of
|
|
the system.
|
|
|
|
In other words, ``smp_wmb()`` puts a partial ordering on stores, but is not
|
|
required to have any effect on loads.
|
|
|
|
- ``smp_mb_acquire()`` guarantees that all the LOAD operations specified before
|
|
the barrier will appear to happen before all the LOAD or STORE operations
|
|
specified after the barrier with respect to the other components of
|
|
the system.
|
|
|
|
- ``smp_mb_release()`` guarantees that all the STORE operations specified *after*
|
|
the barrier will appear to happen after all the LOAD or STORE operations
|
|
specified *before* the barrier with respect to the other components of
|
|
the system.
|
|
|
|
- ``smp_mb()`` guarantees that all the LOAD and STORE operations specified
|
|
before the barrier will appear to happen before all the LOAD and
|
|
STORE operations specified after the barrier with respect to the other
|
|
components of the system.
|
|
|
|
``smp_mb()`` puts a partial ordering on both loads and stores. It is
|
|
stronger than both a read and a write memory barrier; it implies both
|
|
``smp_mb_acquire()`` and ``smp_mb_release()``, but it also prevents STOREs
|
|
coming before the barrier from overtaking LOADs coming after the
|
|
barrier and vice versa.
|
|
|
|
- ``smp_read_barrier_depends()`` is a weaker kind of read barrier. On
|
|
most processors, whenever two loads are performed such that the
|
|
second depends on the result of the first (e.g., the first load
|
|
retrieves the address to which the second load will be directed),
|
|
the processor will guarantee that the first LOAD will appear to happen
|
|
before the second with respect to the other components of the system.
|
|
However, this is not always true---for example, it was not true on
|
|
Alpha processors. Whenever this kind of access happens to shared
|
|
memory (that is not protected by a lock), a read barrier is needed,
|
|
and ``smp_read_barrier_depends()`` can be used instead of ``smp_rmb()``.
|
|
|
|
Note that the first load really has to have a _data_ dependency and not
|
|
a control dependency. If the address for the second load is dependent
|
|
on the first load, but the dependency is through a conditional rather
|
|
than actually loading the address itself, then it's a _control_
|
|
dependency and a full read barrier or better is required.
|
|
|
|
|
|
Memory barriers and ``atomic_load_acquire``/``atomic_store_release`` are
|
|
mostly used when a data structure has one thread that is always a writer
|
|
and one thread that is always a reader:
|
|
|
|
+----------------------------------+----------------------------------+
|
|
| thread 1 | thread 2 |
|
|
+==================================+==================================+
|
|
| :: | :: |
|
|
| | |
|
|
| atomic_store_release(&a, x); | y = atomic_load_acquire(&b); |
|
|
| atomic_store_release(&b, y); | x = atomic_load_acquire(&a); |
|
|
+----------------------------------+----------------------------------+
|
|
|
|
In this case, correctness is easy to check for using the "pairing"
|
|
trick that is explained below.
|
|
|
|
Sometimes, a thread is accessing many variables that are otherwise
|
|
unrelated to each other (for example because, apart from the current
|
|
thread, exactly one other thread will read or write each of these
|
|
variables). In this case, it is possible to "hoist" the barriers
|
|
outside a loop. For example:
|
|
|
|
+------------------------------------------+----------------------------------+
|
|
| before | after |
|
|
+==========================================+==================================+
|
|
| :: | :: |
|
|
| | |
|
|
| n = 0; | n = 0; |
|
|
| for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) |
|
|
| n += atomic_load_acquire(&a[i]); | n += atomic_read(&a[i]); |
|
|
| | smp_mb_acquire(); |
|
|
+------------------------------------------+----------------------------------+
|
|
| :: | :: |
|
|
| | |
|
|
| | smp_mb_release(); |
|
|
| for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) |
|
|
| atomic_store_release(&a[i], false); | atomic_set(&a[i], false); |
|
|
+------------------------------------------+----------------------------------+
|
|
|
|
Splitting a loop can also be useful to reduce the number of barriers:
|
|
|
|
+------------------------------------------+----------------------------------+
|
|
| before | after |
|
|
+==========================================+==================================+
|
|
| :: | :: |
|
|
| | |
|
|
| n = 0; | smp_mb_release(); |
|
|
| for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) |
|
|
| atomic_store_release(&a[i], false); | atomic_set(&a[i], false); |
|
|
| smp_mb(); | smb_mb(); |
|
|
| n += atomic_read(&b[i]); | n = 0; |
|
|
| } | for (i = 0; i < 10; i++) |
|
|
| | n += atomic_read(&b[i]); |
|
|
+------------------------------------------+----------------------------------+
|
|
|
|
In this case, a ``smp_mb_release()`` is also replaced with a (possibly cheaper, and clearer
|
|
as well) ``smp_wmb()``:
|
|
|
|
+------------------------------------------+----------------------------------+
|
|
| before | after |
|
|
+==========================================+==================================+
|
|
| :: | :: |
|
|
| | |
|
|
| | smp_mb_release(); |
|
|
| for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) |
|
|
| atomic_store_release(&a[i], false); | atomic_set(&a[i], false); |
|
|
| atomic_store_release(&b[i], false); | smb_wmb(); |
|
|
| } | for (i = 0; i < 10; i++) |
|
|
| | atomic_set(&b[i], false); |
|
|
+------------------------------------------+----------------------------------+
|
|
|
|
|
|
.. _acqrel:
|
|
|
|
Acquire/release pairing and the *synchronizes-with* relation
|
|
------------------------------------------------------------
|
|
|
|
Atomic operations other than ``atomic_set()`` and ``atomic_read()`` have
|
|
either *acquire* or *release* semantics [#rmw]_. This has two effects:
|
|
|
|
.. [#rmw] Read-modify-write operations can have both---acquire applies to the
|
|
read part, and release to the write.
|
|
|
|
- within a thread, they are ordered either before subsequent operations
|
|
(for acquire) or after previous operations (for release).
|
|
|
|
- if a release operation in one thread *synchronizes with* an acquire operation
|
|
in another thread, the ordering constraints propagates from the first to the
|
|
second thread. That is, everything before the release operation in the
|
|
first thread is guaranteed to *happen before* everything after the
|
|
acquire operation in the second thread.
|
|
|
|
The concept of acquire and release semantics is not exclusive to atomic
|
|
operations; almost all higher-level synchronization primitives also have
|
|
acquire or release semantics. For example:
|
|
|
|
- ``pthread_mutex_lock`` has acquire semantics, ``pthread_mutex_unlock`` has
|
|
release semantics and synchronizes with a ``pthread_mutex_lock`` for the
|
|
same mutex.
|
|
|
|
- ``pthread_cond_signal`` and ``pthread_cond_broadcast`` have release semantics;
|
|
``pthread_cond_wait`` has both release semantics (synchronizing with
|
|
``pthread_mutex_lock``) and acquire semantics (synchronizing with
|
|
``pthread_mutex_unlock`` and signaling of the condition variable).
|
|
|
|
- ``pthread_create`` has release semantics and synchronizes with the start
|
|
of the new thread; ``pthread_join`` has acquire semantics and synchronizes
|
|
with the exiting of the thread.
|
|
|
|
- ``qemu_event_set`` has release semantics, ``qemu_event_wait`` has
|
|
acquire semantics.
|
|
|
|
For example, in the following example there are no atomic accesses, but still
|
|
thread 2 is relying on the *synchronizes-with* relation between ``pthread_exit``
|
|
(release) and ``pthread_join`` (acquire):
|
|
|
|
+----------------------+-------------------------------+
|
|
| thread 1 | thread 2 |
|
|
+======================+===============================+
|
|
| :: | :: |
|
|
| | |
|
|
| *a = 1; | |
|
|
| pthread_exit(a); | pthread_join(thread1, &a); |
|
|
| | x = *a; |
|
|
+----------------------+-------------------------------+
|
|
|
|
Synchronization between threads basically descends from this pairing of
|
|
a release operation and an acquire operation. Therefore, atomic operations
|
|
other than ``atomic_set()`` and ``atomic_read()`` will almost always be
|
|
paired with another operation of the opposite kind: an acquire operation
|
|
will pair with a release operation and vice versa. This rule of thumb is
|
|
extremely useful; in the case of QEMU, however, note that the other
|
|
operation may actually be in a driver that runs in the guest!
|
|
|
|
``smp_read_barrier_depends()``, ``smp_rmb()``, ``smp_mb_acquire()``,
|
|
``atomic_load_acquire()`` and ``atomic_rcu_read()`` all count
|
|
as acquire operations. ``smp_wmb()``, ``smp_mb_release()``,
|
|
``atomic_store_release()`` and ``atomic_rcu_set()`` all count as release
|
|
operations. ``smp_mb()`` counts as both acquire and release, therefore
|
|
it can pair with any other atomic operation. Here is an example:
|
|
|
|
+----------------------+------------------------------+
|
|
| thread 1 | thread 2 |
|
|
+======================+==============================+
|
|
| :: | :: |
|
|
| | |
|
|
| atomic_set(&a, 1); | |
|
|
| smp_wmb(); | |
|
|
| atomic_set(&b, 2); | x = atomic_read(&b); |
|
|
| | smp_rmb(); |
|
|
| | y = atomic_read(&a); |
|
|
+----------------------+------------------------------+
|
|
|
|
Note that a load-store pair only counts if the two operations access the
|
|
same variable: that is, a store-release on a variable ``x`` *synchronizes
|
|
with* a load-acquire on a variable ``x``, while a release barrier
|
|
synchronizes with any acquire operation. The following example shows
|
|
correct synchronization:
|
|
|
|
+--------------------------------+--------------------------------+
|
|
| thread 1 | thread 2 |
|
|
+================================+================================+
|
|
| :: | :: |
|
|
| | |
|
|
| atomic_set(&a, 1); | |
|
|
| atomic_store_release(&b, 2); | x = atomic_load_acquire(&b); |
|
|
| | y = atomic_read(&a); |
|
|
+--------------------------------+--------------------------------+
|
|
|
|
Acquire and release semantics of higher-level primitives can also be
|
|
relied upon for the purpose of establishing the *synchronizes with*
|
|
relation.
|
|
|
|
Note that the "writing" thread is accessing the variables in the
|
|
opposite order as the "reading" thread. This is expected: stores
|
|
before a release operation will normally match the loads after
|
|
the acquire operation, and vice versa. In fact, this happened already
|
|
in the ``pthread_exit``/``pthread_join`` example above.
|
|
|
|
Finally, this more complex example has more than two accesses and data
|
|
dependency barriers. It also does not use atomic accesses whenever there
|
|
cannot be a data race:
|
|
|
|
+----------------------+------------------------------+
|
|
| thread 1 | thread 2 |
|
|
+======================+==============================+
|
|
| :: | :: |
|
|
| | |
|
|
| b[2] = 1; | |
|
|
| smp_wmb(); | |
|
|
| x->i = 2; | |
|
|
| smp_wmb(); | |
|
|
| atomic_set(&a, x); | x = atomic_read(&a); |
|
|
| | smp_read_barrier_depends(); |
|
|
| | y = x->i; |
|
|
| | smp_read_barrier_depends(); |
|
|
| | z = b[y]; |
|
|
+----------------------+------------------------------+
|
|
|
|
Comparison with Linux kernel primitives
|
|
=======================================
|
|
|
|
Here is a list of differences between Linux kernel atomic operations
|
|
and memory barriers, and the equivalents in QEMU:
|
|
|
|
- atomic operations in Linux are always on a 32-bit int type and
|
|
use a boxed ``atomic_t`` type; atomic operations in QEMU are polymorphic
|
|
and use normal C types.
|
|
|
|
- Originally, ``atomic_read`` and ``atomic_set`` in Linux gave no guarantee
|
|
at all. Linux 4.1 updated them to implement volatile
|
|
semantics via ``ACCESS_ONCE`` (or the more recent ``READ``/``WRITE_ONCE``).
|
|
|
|
QEMU's ``atomic_read`` and ``atomic_set`` implement C11 atomic relaxed
|
|
semantics if the compiler supports it, and volatile semantics otherwise.
|
|
Both semantics prevent the compiler from doing certain transformations;
|
|
the difference is that atomic accesses are guaranteed to be atomic,
|
|
while volatile accesses aren't. Thus, in the volatile case we just cross
|
|
our fingers hoping that the compiler will generate atomic accesses,
|
|
since we assume the variables passed are machine-word sized and
|
|
properly aligned.
|
|
|
|
No barriers are implied by ``atomic_read`` and ``atomic_set`` in either Linux
|
|
or QEMU.
|
|
|
|
- atomic read-modify-write operations in Linux are of three kinds:
|
|
|
|
===================== =========================================
|
|
``atomic_OP`` returns void
|
|
``atomic_OP_return`` returns new value of the variable
|
|
``atomic_fetch_OP`` returns the old value of the variable
|
|
``atomic_cmpxchg`` returns the old value of the variable
|
|
===================== =========================================
|
|
|
|
In QEMU, the second kind is named ``atomic_OP_fetch``.
|
|
|
|
- different atomic read-modify-write operations in Linux imply
|
|
a different set of memory barriers; in QEMU, all of them enforce
|
|
sequential consistency.
|
|
|
|
- in QEMU, ``atomic_read()`` and ``atomic_set()`` do not participate in
|
|
the total ordering enforced by sequentially-consistent operations.
|
|
This is because QEMU uses the C11 memory model. The following example
|
|
is correct in Linux but not in QEMU:
|
|
|
|
+----------------------------------+--------------------------------+
|
|
| Linux (correct) | QEMU (incorrect) |
|
|
+==================================+================================+
|
|
| :: | :: |
|
|
| | |
|
|
| a = atomic_fetch_add(&x, 2); | a = atomic_fetch_add(&x, 2); |
|
|
| b = READ_ONCE(&y); | b = atomic_read(&y); |
|
|
+----------------------------------+--------------------------------+
|
|
|
|
because the read of ``y`` can be moved (by either the processor or the
|
|
compiler) before the write of ``x``.
|
|
|
|
Fixing this requires an ``smp_mb()`` memory barrier between the write
|
|
of ``x`` and the read of ``y``. In the common case where only one thread
|
|
writes ``x``, it is also possible to write it like this:
|
|
|
|
+--------------------------------+
|
|
| QEMU (correct) |
|
|
+================================+
|
|
| :: |
|
|
| |
|
|
| a = atomic_read(&x); |
|
|
| atomic_set(&x, a + 2); |
|
|
| smp_mb(); |
|
|
| b = atomic_read(&y); |
|
|
+--------------------------------+
|
|
|
|
Sources
|
|
=======
|
|
|
|
- ``Documentation/memory-barriers.txt`` from the Linux kernel
|