If a user chooses to turn on the auto-converge migration capability
these changes detect the lack of convergence and throttle down the
guest. i.e. force the VCPUs out of the guest for some duration
and let the migration thread catchup and help converge.
Verified the convergence using the following :
- Java Warehouse workload running on a 20VCPU/256G guest(~80% busy)
- OLTP like workload running on a 80VCPU/512G guest (~80% busy)
Sample results with Java warehouse workload : (migrate speed set to 20Gb and
migrate downtime set to 4seconds).
(qemu) info migrate
capabilities: xbzrle: off auto-converge: off <----
Migration status: active
total time: 1487503 milliseconds
expected downtime: 519 milliseconds
transferred ram: 383749347 kbytes
remaining ram: 2753372 kbytes
total ram: 268444224 kbytes
duplicate: 65461532 pages
skipped: 64901568 pages
normal: 95750218 pages
normal bytes: 383000872 kbytes
dirty pages rate: 67551 pages
---
(qemu) info migrate
capabilities: xbzrle: off auto-converge: on <----
Migration status: completed
total time: 241161 milliseconds
downtime: 6373 milliseconds
transferred ram: 28235307 kbytes
remaining ram: 0 kbytes
total ram: 268444224 kbytes
duplicate: 64946416 pages
skipped: 64903523 pages
normal: 7044971 pages
normal bytes: 28179884 kbytes
Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
backup_start() creates a block job that copies a point-in-time snapshot
of a block device to a target block device.
We call backup_do_cow() for each write during backup. That function
reads the original data from the block device before it gets
overwritten. The data is then written to the target device.
Currently backup cluster size is hardcoded to 65536 bytes.
[I made a number of changes to Dietmar's original patch and folded them
in to make code review easy. Here is the full list:
* Drop BackupDumpFunc interface in favor of a target block device
* Detect zero clusters with buffer_is_zero() and use bdrv_co_write_zeroes()
* Use 0 delay instead of 1us, like other block jobs
* Unify creation/start functions into backup_start()
* Simplify cleanup, free bitmap in backup_run() instead of cb
* function
* Use HBitmap to avoid duplicating bitmap code
* Use bdrv_getlength() instead of accessing ->total_sectors
* directly
* Delete the backup.h header file, it is no longer necessary
* Move ./backup.c to block/backup.c
* Remove #ifdefed out code
* Coding style and whitespace cleanups
* Use bdrv_add_before_write_notifier() instead of blockjob-specific hooks
* Keep our own in-flight CowRequest list instead of using block.c
tracked requests. This means a little code duplication but is much
simpler than trying to share the tracked requests list and use the
backup block size.
* Add on_source_error and on_target_error error handling.
* Use trace events instead of DPRINTF()
-- stefanha]
Signed-off-by: Dietmar Maurer <dietmar@proxmox.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
# By Paolo Bonzini (11) and others
# Via Paolo Bonzini
* bonzini/iommu-for-anthony:
memory: clean up phys_page_find
memory: populate FlatView for new address spaces
memory: limit sections in the radix tree to the actual address space size
s390x: reduce TARGET_PHYS_ADDR_SPACE_BITS to 62
memory: fix address space initialization/destruction
memory: make memory_global_sync_dirty_bitmap take an AddressSpace
memory: do not duplicate memory_region_destructor_none
memory: Rename readable flag to romd_mode
memory: Replace open-coded memory_region_is_romd
memory: allow memory_region_find() to run on non-root memory regions
memory: assert that PhysPageEntry's ptr does not overflow
exec: eliminate stq_phys_notdirty
exec: make qemu_get_ram_ptr private
exec: eliminate qemu_put_ram_ptr
exec: remove obsolete comment
Message-id: 1369414987-8839-1-git-send-email-pbonzini@redhat.com
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
qemu_co_queue_next(&queue) arranges that the next queued coroutine is
run at a later point in time. This deferred restart is useful because
the caller may not want to transfer control yet.
This behavior was implemented using QEMUBH in the past, which meant that
CoQueue (and hence CoMutex and CoRwlock) had a dependency on the
AioContext event loop. This hidden dependency causes trouble when we
move to a world with multiple event loops - now qemu_co_queue_next()
needs to know which event loop to schedule the QEMUBH in.
After pondering how to stash AioContext I realized the best solution is
to not use AioContext at all. This patch implements the deferred
restart behavior purely in terms of coroutines and no longer uses
QEMUBH.
Here is how it works:
Each Coroutine has a wakeup queue that starts out empty. When
qemu_co_queue_next() is called, the next coroutine is added to our
wakeup queue. The wakeup queue is processed when we yield or terminate.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Commit e7a09b92b7 added a trace at each
memory freeing, but unfortunately inverted size and pointer when printing
them. Fix trace.
This also led to a compilation error on 32 bit hosts:
In file included from include/trace.h:4:0,
from trace/generated-events.c:3:
./trace/generated-tracers.h: In function ‘trace_qemu_anon_ram_free’:
./trace/generated-tracers.h:64:9: error: format ‘%zu’ expects argument of type
‘size_t’, but argument 3 has type ‘void *’ [-Werror=format]
./trace/generated-tracers.h:64:9: error: format ‘%p’ expects argument of type
‘void *’, but argument 4 has type ‘size_t’ [-Werror=format]
Signed-off-by: Hervé Poussineau <hpoussin@reactos.org>
Signed-off-by: Hervé Poussineau <hpoussin@reactos.org>
Message-id: 1369045989-14016-1-git-send-email-hpoussin@reactos.org
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
We switched from qemu_memalign to mmap() but then we don't modify
qemu_vfree() to do a munmap() over free(). Which we cannot do
because qemu_vfree() frees memory allocated by qemu_{mem,block}align.
Introduce a new function that does the munmap(), luckily the size is
available in the RAMBlock.
Reported-by: Amos Kong <akong@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Amos Kong <akong@redhat.com>
Message-id: 1368454796-14989-3-git-send-email-pbonzini@redhat.com
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This is preparatory to the introduction of a separate freeing API.
Reported-by: Amos Kong <akong@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Amos Kong <akong@redhat.com>
Message-id: 1368454796-14989-2-git-send-email-pbonzini@redhat.com
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This provides a way to detect the cast that leads to a (reproducible)
crash even when QOM cast debugging is disabled.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1368188203-3407-6-git-send-email-pbonzini@redhat.com
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This patch enable us to know exit reason of KVM_RUN. It will help us
know where the trouble is caused.
Signed-off-by: Kazuya Saito <saito.kazuya@jp.fujitsu.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This patch adds tracepoints at ioctl to kvm. Tracing these ioctl is
useful for clarification whether the cause of troubles is qemu or kvm.
Signed-off-by: Kazuya Saito <saito.kazuya@jp.fujitsu.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This fixes the following error:
In file included from qemu/include/trace.h:4:0,
from trace/generated-events.c:3:
./trace/generated-tracers.h: In function ‘trace_pvscsi_get_sg_list’:
./trace/generated-tracers.h:4271:9: error: format ‘%lu’ expects argument of
type ‘long unsigned int’, but argument 4 has type ‘size_t’ [-Werror=format]
Signed-off-by: Hervé Poussineau <hpoussin@reactos.org>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Report the supported speeds for device and port in the error message.
Also add the speeds to the tracepoint. And while being at it drop
the redundant error message in usb_desc_attach, usb_device_attach will
report the error anyway.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Yan Vugenfirer <yan@daynix.com>
[ Rename files to vmw_pvscsi, fix setting of hostStatus in
pvscsi_request_cancelled - Paolo ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reimplement usb-host on top of libusb.
Reasons to do this:
(1) Largely rewritten from scratch, nice opportunity to kill historical
cruft.
(2) Offload usbfs handling to libusb.
(3) Have a single portable code base instead of bsd + linux variants.
(4) Bring usb-host support to any platform supported by libusbx.
For now this goes side-by-side to the existing code. That is only to
simplify regression testing though, at the end of the day I want remove
the old code and support libusb exclusively. Merge early in 1.5 cycle,
remove the old code after 1.5 release or something like this.
Thanks to qdev the old and new code can coexist nicely on linux. Just
use "-device usb-host-linux" to use the old linux driver instead of the
libusb one (which takes over the "usb-host" name).
The bsd driver isn't qdev'ified so it isn't that easy for bsd.
I didn't bother making it runtime switchable, so you have to rebuild
qemu with --disable-libusb to get back the old code.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Check for port reset first and skip everything else then.
Add sanity checks for PLS updates.
Add PLC notification when entering PLS_U0 state.
This gets host-initiated port resume going on win8.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Make gui update rate adaption code in gui_update() actually work.
Sprinkle in a tracepoint so you can see the code at work. Remove
the update rate adaption code in vnc and make vnc simply use the
generic bits instead.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Hardcode depth to 32 bpp. It effectively was that way before because
that is the default surface depth, this just makes it explicit in the
code.
Rename depth to new_depth to make it consistent with the new_width +
new_height names. In theory we can make new_depth changeable (i.e.
allow the guest to fill in -- say -- 16 there). In practice the guests
don't try, the X-Server refuses to start if you ask it to use 16bpp
depth (via DefaultDepth in the Screen section).
Always return the correct rmask+gmask+bmask values for the given
new_depth.
Fix mode setting to also verify at new_depth to make sure we have a
correct DisplaySurface, even if the current video mode happes to be
16bpp (set by vgabios via bochs vbe interface). While being at it
switch over to use qemu_create_displaysurface_from, so the surface is
backed by guest-visible video memory and we save a memcpy.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
# By Kevin Wolf (22) and Peter Lieven (1)
# Via Stefan Hajnoczi
* stefanha/block: (23 commits)
block: Fix direct use of protocols as driver for bdrv_open()
qcow2: Gather clusters in a looping loop
qcow2: Move cluster gathering to a non-looping loop
qcow2: Allow requests with multiple l2metas
qcow2: Use byte granularity in qcow2_alloc_cluster_offset()
qcow2: Prepare handle_alloc/copied() for byte granularity
qcow2: handle_copied(): Implement non-zero host_offset
qcow2: handle_copied(): Get rid of keep_clusters parameter
qcow2: handle_copied(): Get rid of nb_clusters parameter
qcow2: Factor out handle_copied()
qcow2: Clean up handle_alloc()
qcow2: Finalise interface of handle_alloc()
qcow2: handle_alloc(): Get rid of keep_clusters parameter
qcow2: handle_alloc(): Get rid of nb_clusters parameter
qcow2: Factor out handle_alloc()
qcow2: Decouple cluster allocation from cluster reuse code
qcow2: Change handle_dependency to byte granularity
qcow2: Improve check for overlapping allocations
qcow2: Handle dependencies earlier
qcow2: Remove bogus unlock of s->lock
...
This patch enables us to know RunState transition. It will be userful
for investigation when the trouble occured in special event such like
live migration, shutdown, suspend, and so on.
Signed-off-by: Kazuya Saito <saito.kazuya@jp.fujitsu.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Decouple DisplaySurface allocation & deallocation from DisplayState.
Replace dpy_gfx_resize + dpy_gfx_setdata with a dpy_gfx_replace_surface
function.
This handles the graphic hardware emulation.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Split callbacks into separate Ops struct. Pass DisplayChangeListener
pointer as first argument to all callbacks. Uninline a bunch of
display functions and move them from console.h to console.c
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Move global variables into a struct so multiple thread pools can be
supported in the future.
This patch does not change thread-pool.h interfaces. There is still a
global thread pool and it is not yet possible to create/destroy
individual thread pools. Moving the variables into a struct first makes
later patches easier to review.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
This patch allows to specify multiple directories where qemu should look
for data files. To implement that the behavior of the -L switch is
slightly different now: Instead of replacing the data directory the
path specified will be appended to the data directory list. So when
specifiying -L multiple times all directories specified will be checked,
in the order they are specified on the command line, instead of just the
last one.
Additionally the default paths are always appended to the directory
data list. This allows to specify a incomplete directory (such as the
seabios out/ directory) via -L. Anything not found there will be loaded
from the default paths, so you don't have to create a symlink farm for
all the rom blobs.
For trouble-shooting a tracepoint has been added, logging which blob
has been loaded from which location.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Message-id: 1362739344-8068-1-git-send-email-kraxel@redhat.com
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Add streams support to the xhci emulation. No secondary streams yet,
only linear stream arays are supported for now.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Add a new virtio transport that uses channel commands to perform
virtio operations.
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Provide a mechanism for qemu to provide fully virtual subchannels to
the guest.
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Provide handlers for (most) channel I/O instructions.
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Yet another optimization is to extend the mirroring iteration to include more
adjacent dirty blocks. This limits the number of I/O operations and makes
mirroring efficient even with a small granularity. Most of the infrastructure
is already in place; we only need to put a loop around the computation of
the origin and sector count of the iteration.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
With AIO support in place, we can start copying more than one chunk
in parallel. This patch introduces the required infrastructure for
this: the buffer is split into multiple granularity-sized chunks,
and there is a free list to access them.
Because of copy-on-write, a single operation may already require
multiple chunks to be available on the free list.
In addition, two different iterations on the HBitmap may want to
copy the same cluster. We avoid this by keeping a bitmap of in-flight
I/O operations, and blocking until the previous iteration completes.
This should be a pretty rare occurrence, though; as long as there is
no overlap the next iteration can start before the previous one finishes.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There is really no change in the behavior of the job here, since
there is still a maximum of one in-flight I/O operation between
the source and the target. However, this patch already introduces
the AIO callbacks (which are unmodified in the next patch)
and some of the logic to count in-flight operations and only
complete the job when there is none.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When mirroring runs, the backing files for the target may not yet be
ready. However, this means that a copy-on-write operation on the target
would fill the missing sectors with zeros. Copy-on-write only happens
if the granularity of the dirty bitmap is smaller than the cluster size
(and only for clusters that are allocated in the source after the job
has started copying). So far, the granularity was fixed to 1MB; to avoid
the problem we detected the situation and required the backing files to
be available in that case only.
However, we want to lower the granularity for efficiency, so we need
a better solution. The solution is to always copy a whole cluster the
first time it is touched. The code keeps a bitmap of clusters that
have already been allocated by the mirroring job, and only does "manual"
copy-on-write if the chunk being copied is zero in the bitmap.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This actually uses the dirty bitmap in the block layer, and converts
mirroring to use an HBitmapIter.
Reviewed-by: Laszlo Ersek <lersek@redhat.com> (except block/mirror.c parts)
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
HBitmaps provides an array of bits. The bits are stored as usual in an
array of unsigned longs, but HBitmap is also optimized to provide fast
iteration over set bits; going from one bit to the next is O(logB n)
worst case, with B = sizeof(long) * CHAR_BIT: the result is low enough
that the number of levels is in fact fixed.
In order to do this, it stacks multiple bitmaps with progressively coarser
granularity; in all levels except the last, bit N is set iff the N-th
unsigned long is nonzero in the immediately next level. When iteration
completes on the last level it can examine the 2nd-last level to quickly
skip entire words, and even do so recursively to skip blocks of 64 words or
powers thereof (32 on 32-bit machines).
Given an index in the bitmap, it can be split in group of bits like
this (for the 64-bit case):
bits 0-57 => word in the last bitmap | bits 58-63 => bit in the word
bits 0-51 => word in the 2nd-last bitmap | bits 52-57 => bit in the word
bits 0-45 => word in the 3rd-last bitmap | bits 46-51 => bit in the word
So it is easy to move up simply by shifting the index right by
log2(BITS_PER_LONG) bits. To move down, you shift the index left
similarly, and add the word index within the group. Iteration uses
ffs (find first set bit) to find the next word to examine; this
operation can be done in constant time in most current architectures.
Setting or clearing a range of m bits on all levels, the work to perform
is O(m + m/W + m/W^2 + ...), which is O(m) like on a regular bitmap.
When iterating on a bitmap, each bit (on any level) is only visited
once. Hence, The total cost of visiting a bitmap with m bits in it is
the number of bits that are set in all bitmaps. Unless the bitmap is
extremely sparse, this is also O(m + m/W + m/W^2 + ...), so the amortized
cost of advancing from one bit to the next is usually constant.
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Many callers pass size_t, which gets silently truncated to uint32_t.
Harmless, because all practical sizes are well below 4GiB. Clean it
up anyway. Size overflow now fails assertions.
Bonus: saves a whole bunch of silly casts.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
virtio-blk-data-plane is a subset implementation of virtio-blk. It only
handles read, write, and flush requests. It does this using a dedicated
thread that executes an epoll(2)-based event loop and processes I/O
using Linux AIO.
This approach performs very well but can be used for raw image files
only. The number of IOPS achieved has been reported to be several times
higher than the existing virtio-blk implementation.
Eventually it should be possible to unify virtio-blk-data-plane with the
main body of QEMU code once the block layer and hardware emulation is
able to run outside the global mutex.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The virtio-blk-data-plane cannot access memory using the usual QEMU
functions since it executes outside the global mutex and the memory APIs
are this time are not thread-safe.
This patch introduces a virtqueue module based on the kernel's vhost
vring code. The trick is that we map guest memory ahead of time and
access it cheaply outside the global mutex.
Once the hardware emulation code can execute outside the global mutex it
will be possible to drop this code.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Add a new spice chardev to allow arbitrary communication between the
host and the Spice client via the spice server.
Examples:
This allows the Spice client to have a special port for the qemu
monitor:
... -chardev spiceport,name=org.qemu.monitor,id=monitorport
-mon chardev=monitorport
v2:
- remove support for chardev to chardev linking
- conditionnaly compile with SPICE_SERVER_VERSION
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>