Commit b1f416aa8d870fab71030abc9401cfc77b948e8e breaks vhost_net
because it always registers the virtio_pci_host_notifier_read() handler
function on the ioeventfd, even when vhost_net.ko is using the ioeventfd.
The result is both QEMU and vhost_net.ko polling on the same eventfd
and the virtio_net.ko guest driver seeing inconsistent results:
# ifconfig eth0 192.168.0.1 netmask 255.255.255.0
virtio_net virtio0: output:id 0 is not a head!
To fix this, proceed the same as we do for irqfd: add a parameter to
virtio_queue_set_host_notifier_fd_handler and in that case only set
the notifier, not the handler.
Cc: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Tested-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
All transports can use the same event handler for the irqfd, though the
exact mechanics of the assignment will be specific. Note that there
are three states: handled by the kernel, handled in userspace, disabled.
This also lets virtio use event_notifier_set_handler.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
All transports can use the same event handler for the ioeventfd, though
the exact setup (address/memory region) will be specific.
This lets virtio use event_notifier_set_handler.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
virtio has the equivalent of:
if (vq->last_avail_index != vring_avail_idx(vq)) {
read descriptor head at vq->last_avail_index;
}
In theory, processor can reorder descriptor head
read to happen speculatively before the index read.
this would trigger the following race:
host descriptor head read <- reads invalid head from ring
guest writes valid descriptor head
guest writes avail index
host avail index read <- observes valid index
as a result host will use an invalid head value.
This was not observed in the field by me but after
the experience with the previous two races
I think it is prudent to address this theoretical race condition.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This fixes an issue dual to the one fixed by
patch 'virtio: add missing mb() on notification'
and applies on top.
In this case, to enable vq kick to exit to host,
qemu writes out used flag then reads the
avail index. if these are reordered we get a race:
host avail index read: ring is empty
guest avail index write
guest flag read: exit disabled
host used flag write: enable exit
which results in a lost exit: host will never be notified about the
avail index update. Again, happens in the field but only seems to
trigger on some specific hardware.
Insert an smp_mb barrier operation to ensure the correct ordering.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
During normal operation, virtio first writes a used index
and then checks whether it should interrupt the guest
by reading guest avail index/flag values.
Guest does the reverse: writes the index/flag,
then checks the used ring.
The ordering is important: if host avail flag read bypasses the used
index write, we could in effect get this timing:
host avail flag read
guest enable interrupts: avail flag write
guest check used ring: ring is empty
host used index write
which results in a lost interrupt: guest will never be notified
about the used ring update.
This actually can happen when using kvm with an io thread,
such that the guest vcpu and qemu run on different host cpus,
and this has actually been observed in the field
(but only seems to trigger on very specific processor types)
with userspace virtio: vhost has the necessary smp_mb()
in place to prevent the regordering, so the same workload stalls
forever waiting for an interrupt with vhost=off but works
fine with vhost=on.
Insert an smp_mb barrier operation in userspace virtio to
ensure the correct ordering.
Applying this patch fixed the race condition we have observed.
Tested on x86_64. I checked the code generated by the new macro
for i386 and ppc but didn't run virtio.
Note: mb could in theory be implemented by __sync_synchronize, but this
would make us hit old GCC bugs. Besides old GCC
not implementing __sync_synchronize at all, there were bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36793
in this functionality as recently as in 4.3.
As we need asm for rmb,wmb anyway, it's just as well to
use it for mb.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Serializing virtio-scsi requests needs a simple way to get from a
VirtQueue to the number of the queue. The virtio_queue_get_id
provides this.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When accessing the device specific virtio config space, we memcpy
the data into a variable in QEMU. At that point we're basically
pulling host endianness into the game which is a really bad idea.
So instead, let's use the target specific load/store helpers for
memory pointers which fetch things in target endianness. The whole
array is already populated in target endianness anyways
(see virtio-blk).
Signed-off-by: Alexander Graf <agraf@suse.de>
Reviewed-by: Anthony Liguori <aliguori@us.ibm.com>
vdev->guest_features is not masking features that are not supported by
the guest. Fix this by introducing a common wrapper to be used by all
virtio bus implementations.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The virtio code uses wmb() macros in several places, as required by the
SMP-aware virtio protocol. However the wmb() macro is locally defined
to be a compiler barrier only. This is probably sufficient on x86
due to its strong storage ordering model, but it certainly isn't on other
platforms, such as ppc.
In any case, qemu already has some globally defined memory barrier macros
in qemu-barrier.h. This patch, therefore converts virtio.c to use those
barrier macros. The macros in qemu-barrier.h are also wrong (or at least,
safe for x86 only) but this way at least there's only one place to fix
them.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The virtio device lifecycle can be observed by looking at the sequence
of set status operations. This is especially important for catching the
reset operation (status value 0), which resets the device and all
virtqueues.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Today, when notifying a VM state change with vm_state_notify(),
we pass a VMSTOP macro as the 'reason' argument. This is not ideal
because the VMSTOP macros tell why qemu stopped and not exactly
what the current VM state is.
One example to demonstrate this problem is that vm_start() calls
vm_state_notify() with reason=0, which turns out to be VMSTOP_USER.
This commit fixes that by replacing the VMSTOP macros with a proper
state type called RunState.
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
virtio_common_init() allocates RAM for the vdev struct (and any
additional memory, depending on the size passed to the function). This
memory wasn't being freed until now.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We were previously allowing arbitrarily-long indirect descriptors, which
could lead to a buffer overflow in qemu-kvm process.
CVE-2011-2212
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
error_report() prepends location, and appends a newline. The message
constructed from the arguments should not contain a newline. Fix the
obvious offenders.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Add support for event_idx feature, and utilize it to
reduce the number of interrupts and exits for the guest.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Current vm_running was not explicitly initialized and its value was changed by
vm state notifier, this may confuse the virtio device being hotplugged such as
virtio-net with vhost backend as it may think the vm was not running. Solve this
by initialize this value explicitly in virtio_common_init().
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The virtio_queue_notify() function checks that the virtqueue number is
less than the maximum number of virtqueues. A signed comparison is used
but the virtqueue number could be negative if a buggy or malicious guest
is run. This results in memory accesses outside of the virtqueue array.
It is risky doing input validation in common code instead of at the
guest<->host boundary. Note that virtio_queue_set_addr(),
virtio_queue_get_addr(), virtio_queue_get_num(), and many other virtio
functions do *not* validate the virtqueue number argument.
Instead of fixing the comparison in virtio_queue_notify(), move the
comparison to the virtio bindings (just like VIRTIO_PCI_QUEUE_SEL) where
we have a uint32_t value and can avoid ever calling into common virtio
code if the virtqueue number is invalid.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Virtqueue notify is currently handled synchronously in userspace virtio. This
prevents the vcpu from executing guest code while hardware emulation code
handles the notify.
On systems that support KVM, the ioeventfd mechanism can be used to make
virtqueue notify a lightweight exit by deferring hardware emulation to the
iothread and allowing the VM to continue execution. This model is similar to
how vhost receives virtqueue notifies.
The result of this change is improved performance for userspace virtio devices.
Virtio-blk throughput increases especially for multithreaded scenarios and
virtio-net transmit throughput increases substantially.
Some virtio devices are known to have guest drivers which expect a notify to be
processed synchronously and spin waiting for completion.
For virtio-net, this also seems to interact with the guest stack in strange
ways so that TCP throughput for small message sizes (~200bytes)
is harmed. Only enable ioeventfd for virtio-blk for now.
Care must be taken not to interfere with vhost-net, which uses host
notifiers. If the set_host_notifier() API is used by a device
virtio-pci will disable virtio-ioeventfd and let the device deal with
host notifiers as it wishes.
Finally, there used to be a limit of 6 KVM io bus devices inside the
kernel. On such a kernel, don't use ioeventfd for virtqueue host
notification since the limit is reached too easily. This ensures that
existing vhost-net setups (which always use ioeventfd) have ioeventfds
available so they can continue to work.
After migration and on VM change state (running/paused) virtio-ioeventfd
will enable/disable itself.
* VIRTIO_CONFIG_S_DRIVER_OK -> enable virtio-ioeventfd
* !VIRTIO_CONFIG_S_DRIVER_OK -> disable virtio-ioeventfd
* virtio_pci_set_host_notifier() -> disable virtio-ioeventfd
* vm_change_state(running=0) -> disable virtio-ioeventfd
* vm_change_state(running=1) -> enable virtio-ioeventfd
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Move tracking vmstate change from virtio-net to virtio.c
as it is going to be used by virito-blk and virtio-pci
for the ioeventfd support.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When migration triggers before a VQ is initialized,
base pa is 0 and last_used_index must be 0 too:
we don't have a ring to compare to.
Reported-by: Juan Quintela <quintela@redhat.com>
Tested-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
(cherry picked from commit cd92f4cc22fbe12a7bf60c9430731f768dc1537c)
Checking available index upon load instead of
only when vm is running makes is easier to
debug failures.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
As status is set to 0 on reset, invoke the relevant callback. This makes
for a cleaner code in devices as they don't need to duplicate the code
in their reset routine, as well as excercises this path a little more.
In particular this makes it possible to unify
vhost-net handling code with the following patch.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This patch adds trace events for virtqueue operations including
adding/removing buffers, notifying the guest, and receiving a notify
from the guest.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Separate the mapping of requests to host memory from the descriptor iteration.
The next patch will make use of it in a different context.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
After migration, vhost was not getting features
acked because set_features callback was never invoked.
The fix is just to invoke that callback.
Reported-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: David L Stevens <dlstevens@us.ibm.com>
u_int64_t raises compiler error messages:
CC libhw32/virtio.o
/qemu/ar7/hw/virtio.c: In function ‘virtio_queue_get_avail_size’:
/qemu/ar7/hw/virtio.c:776: error: ‘u_int64_t’ undeclared (first use in this function)
/qemu/ar7/hw/virtio.c:776: error: (Each undeclared identifier is reported only once
/qemu/ar7/hw/virtio.c:776: error: for each function it appears in.)
Replacing u_int64_t by uint64_t helps.
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
vhost needs physical addresses for ring and other queue fields,
so add APIs for these. In particular, add binding API to set
host/guest notifiers. Will be used by vhost.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
VIRTIO_PCI_QUEUE_MAX is redefined in hw/virtio.c. Let's just keep it in
hw/virtio.h.
Also, bump up the value of the maximum allowed virtqueues to 64. This is
in preparation to allow multiple ports per virtio-console device.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Add feature bits as properties to virtio. This makes it possible to e.g. define
machine without indirect buffer support, which is required for 0.10
compatibility, or without hardware checksum support, which is required for 0.11
compatibility. Since default values for optional features are now set by qdev,
get_features callback has been modified: it sets non-optional bits, and clears
bits not supported by host.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Rename features->guest_features. This is
what they are, avoid confusion with
host features which we also need to keep around.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
migrating between hosts which have different features
might break silently, if the migration destination
does not support some features supported by source.
Prevent this from happening by comparing acked feature
bits with the mask supported by the device.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
wmb must be at least a compiler barrier, even without SMP.
Further, we likely need some rmb()/mb() as well:
I have not audited the code but lguest has mb(),
add a comment for now.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
In the very least, a change like this requires discussion on the list.
The naming convention is goofy and it causes a massive merge problem. Something
like this _must_ be presented on the list first so people can provide input
and cope with it.
This reverts commit 99a0949b720a0936da2052cb9a46db04ffc6db29.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
initialize vectors for all vqs to VIRTIO_NO_VECTOR rather than 0 which
is a valid vector. This fixes migration which happened before driver
was loaded.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reported-by: Amit Shah <amit.shah@redhat.com>
Tested-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
control vector is saved/restored by virtio-pci,
it does not belong in virtio.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Implement bindings for virtio save/load. Use them in virtio pci.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Extend virtio to support many interrupt vectors, and rearrange code in
preparation for multi-vector support (mostly move reset out to bindings,
because we will have to reset the vectors in transport-specific code).
Actual bindings in pci, and use in net, to follow.
Load and save are not connected to bindings yet, so they are left
stubbed out for now.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Support a new feature flag for indirect ring entries. These are ring
entries which point to a table of buffer descriptors.
The idea here is to increase the ring capacity by allowing a larger
effective ring size whereby the ring size dictates the number of
requests that may be outstanding, rather than the size of those
requests.
This should be most effective in the case of block I/O where we can
potentially benefit by concurrently dispatching a large number of
large requests. Even in the simple case of single segment block
requests, this results in a threefold increase in ring capacity.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Change the vring descriptor helpers to take the physical
address of the descriptor table rather than a virtqueue.
This is needed in order to allow these helpers to be used
with an indirect descriptor table.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Add the parameter 'order' to qemu_register_reset and sort callbacks on
registration. On system reset, callbacks with lower order will be
invoked before those with higher order. Update all existing users to the
standard order 0.
Note: At least for x86, the existing users seem to assume that handlers
are called in their registration order. Therefore, the patch preserves
this property. If someone feels bored, (s)he could try to identify this
dependency and express it properly on callback registration.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>