The old code detected an overlapping allocation even when the
allocations didn't actually overlap, but were only adjacent.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Handling overlapping allocations isn't just a detail of cluster
allocation. It is rather one of three ways to get the host cluster
offset for a write request:
1. If a request overlaps an in-flight allocations, the cluster offset
can be taken from there (this is what handle_dependencies will evolve
into) or the request must just wait until the allocation has
completed. Accessing the L2 is not valid in this case, it has
outdated information.
2. Outside overlapping areas, check the clusters that can be written to
as they are, with no COW involved.
3. If a COW is required, allocate new clusters
Changing the code to reflect this doesn't change the behaviour because
overlaps cannot exist for clusters that are kept in step 2. It does
however make it easier for later patches to work on clusters that belong
to an allocation that is still in flight.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The unlock wakes up the next coroutine, but the currently running
coroutine will lock it again before it yields, so this doesn't make a
lot of sense.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This should be based on the virtual disk size, not on the size of the
image.
Interesting observation: With some VM state stored in the image file,
percentages higher than 100% are possible, even though snapshots
themselves are ignored. This is a qcow2 bug to be fixed another day: The
VM state should be discarded in the active L2 tables after completing
the snapshot creation.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The new parameter is unused yet.
This part was missing in commit 787e4a8500.
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Commit 787e4a85 [block: Add options QDict to bdrv_file_open() prototypes] didn't
update rbd.c accordingly.
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Reviewed-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
A file name may only specified if no host or socket path is specified.
The latter two may not appear at the same time either.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
The URL method already takes care to apply the default port when none is
specfied. Directly specifying driver-specific options required the port
number until now. Allow leaving it out and apply the default.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
In order to achieve this, the .bdrv_probe callbacks of all drivers must
cope with this. The DMG driver is the only one that bases its decision
on the filename and it needs to be changed.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
If a driver needs structured data and not just a string, it can provide
a .bdrv_parse_filename callback now that parses the command line string
into separate options. Keeping this separate from .bdrv_open_filename
ensures that the preferred way of directly specifying the options always
works as well if parsing the string works.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
The existing parsers for the file name now parse everything into the
bdrv_open() options QDict. Instead of using these parsers, you can now
directly specify the options on the command line, like this:
qemu-system-x86_64 -drive file=nbd:,file.port=1234,file.host=::1
Clearly the file=... part could use further improvement, but it's a
start.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
The NBD block supports an URL syntax, for which a URL parser returns
separate hostname and port fields. It also supports the traditional qemu
syntax encoded in a filename. Until now, after parsing the URL to get
each piece of information, a new string is built to be fed to socket
functions.
Instead of building a string in the URL case that is immediately parsed
again, parse the string in both cases and use the QemuOpts interface to
qemu-sockets.c.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Need to pass an options QDict to qcow2_open() now. This fixes a segfault
on the migration target with qcow2.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Sheepdog (neither quorum nor unsafe mode) will refuse to serve IO requests when
number of alive nodes is less than that of copies specified by users. This will
return 0x19 to QEMU client which currently doesn't recognize it.
This patch adds an error description when QEMU client receives it, other than
plainly printing 'Invalid error code'
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now that each AioContext has a ThreadPool and the main loop AioContext
can be fetched with bdrv_get_aio_context(), we can eliminate the concept
of a global thread pool from thread-pool.c.
The submit functions must take a ThreadPool* argument.
block/raw-posix.c and block/raw-win32.c use
aio_get_thread_pool(bdrv_get_aio_context(bs)) to fetch the main loop's
ThreadPool.
tests/test-thread-pool.c must be updated to reflect the new
thread_pool_submit() function prototypes.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
If an io_flush handler is not set, qemu_aio_wait doesn't invoke
callbacks.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Using a blocking socket in the coroutine context reduces the chance of
switching to other work. This patch makes the sheepdog driver use a
non-blocking fd always.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Otherwise, live migration of the top layer will miss zero clusters and
let the backing file show through. This also matches what is done in qed.
QCOW2_CLUSTER_ZERO clusters are invalid in v2 image files. Check this
directly in qcow2_get_cluster_offset instead of replicating the test
everywhere.
Cc: qemu-stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
We already flush when the function completes. There is no need to flush
after every compressed cluster.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The update_cluster_refcount() function increments/decrements a cluster's
refcount and then returns the new refcount value.
There is no need to flush since both update_cluster_refcount() callers
already take care of this:
1. qcow2_alloc_bytes() calls update_cluster_refcount() when compressed
sectors will be appended to an existing cluster with enough free
space. qcow2_alloc_bytes() already flushes so there is no need to do
so in update_cluster_refcount().
2. qcow2_update_snapshot_refcount() sets a cache dependency on refcounts
if it needs to update L2 entries. It also flushes before completing.
Removing this flush significantly speeds up qcow2 snapshot creation:
$ qemu-img create -f qcow2 test.qcow2 -o size=50G,preallocation=metadata
$ time qemu-img snapshot -c new test.qcow2
Time drops from more than 3 minutes to under 1 second.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Users of qcow2_update_snapshot_refcount() do not flush consistently.
qcow2_snapshot_create() flushes but qcow2_snapshot_goto() and
qcow2_snapshot_delete() do not.
Solve this by moving the bdrv_flush() into
qcow2_update_snapshot_refcount().
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Compressed writes use qcow2_alloc_bytes() to allocate space with byte
granularity. The affected clusters' refcounts will be incremented but
we do not need to flush yet.
Set a L2 cache dependency on the refcount block cache, so that the
refcounts get written out before the L2 updates.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since qcow2 metadata is cached we need to flush the caches, not just the
underlying file. Use bdrv_flush(bs) instead of bdrv_flush(bs->file).
Also add the error return path when bdrv_flush() fails and move the
flush after checking for qcow2_alloc_clusters() failure so that the
qcow2_alloc_clusters() error return value takes precedence.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
update_refcount() affects the refcount cache, it does not write to disk.
Therefore bdrv_flush(bs->file) does nothing. We need to flush the
refcount cache in order to write out the refcount updates!
While we're here also add error returns when qcow2_cache_flush() fails.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2 images now accept a boolean lazy_refcounts options. Use it like
this:
-drive file=test.qcow2,lazy_refcounts=on
If the option is specified on the command line, it overrides the default
specified by the qcow2 header flags that were set when creating the
image.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
It doesn't do anything yet except storing the options QDict in the
BlockDriverState.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
this patch adds iscsi_truncate which effectively allows for
online resizing of iscsi volumes. for this to work you have
to resize the volume on your storage and then call
block_resize command in qemu which will issue a
readcapacity16 to update the capacity.
v4:
- factor out complete readcapacity logic into a separate function
- handle capacity change check condition in readcapacity function
(this happens if the block_resize cmd is the first iscsi task
executed after a resize on the storage)
v3:
- remove switch statement in iscsi_open
- create separate patch for brdv_drain_all() in bdrv_truncate()
v2:
- add a general bdrv_drain_all() before bdrv_truncate() to avoid
in-flight AIOs while the device is truncated
- since no AIOs are in flight we can use a sync libiscsi call
to re-read the capacity
- factor out the readcapacity16 logic as it is redundant
to iscsi_open() and iscsi_truncate().
Signed-off-by: Peter Lieven <pl@kamp.de>
[allow any type of unit attention check condition in iscsi_readcapacity_sync(),
as in Message-ID: <51263A2A.6070304@dlhnet.de> - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
the storage might return a check condition status for various reasons.
(e.g. bus reset, capacity change, thin-provisioning info etc.)
currently all these informative status responses lead to an I/O error
which is populated to the guest. this patch introduces a retry mechanism
to avoid this.
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds support for a unix domain socket for a connection
between qemu and local sheepdog server. You can use the unix domain
socket with the following syntax:
$ qemu sheepdog+unix:///<vdiname>?socket=<socket path>[#snapid]
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This uses the form "<host>:<port>" for the representation of the
sheepdog server to use inet_connect.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The URI syntax is consistent with the NBD and Gluster syntax. The
syntax is
sheepdog[+tcp]://[host:port]/vdiname[#snapid|#tag]
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The qemu-img check command can display fragmentation statistics:
* Total number of clusters in virtual disk
* Number of allocated clusters
* Number of fragmented clusters
This patch adds fragmentation statistics support to qcow2.
Compressed and normal clusters count as allocated. Zero clusters are
not counted as allocated unless their L2 entry has a non-zero offset
(e.g. preallocation).
Only the current L1 table counts towards the statistics - snapshots are
ignored.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The check_refcounts_l1/l2() functions have a check_copied argument to
check that the QCOW_O_COPIED flag is consistent with refcount == 1.
This should be a bool, not an int.
However, the next patch introduces qcow2 fragmentation statistics and
also needs to pass an option to check_refcounts_l1/l2(). This is a good
opportunity to use an int flags field.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This patch adds the support for reporting the image end offset (in
bytes). This is particularly useful after a conversion (or a rebase)
where the destination is a block device in order to find the first
unused byte at the end of the image.
Signed-off-by: Federico Simoncelli <fsimonce@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The curl_easy_setopt(state->curl, CURLOPT_PROTOCOLS, ...) interface was
introduced in libcurl 7.19.4. Therefore we cannot protect against
CVE-2013-0249 when linking against an older libcurl.
This fixes the build failure introduced by
fb6d1bbd24.
Reported-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Andreas Färber <andreas.faeber@web.de>
Message-id: 1360743934-8337-1-git-send-email-stefanha@redhat.com
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This reverts commit f880defbb0.
Jeff Cody's testing revealed that the interpretation of size differs
even between VirtualPC and HyperV. Revert this so there is time to
consider the impact of any backwards incompatible behavior this change
creates.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Linux block devices can be set read-only with "blockdev --setro
<device>". The same thing can be done for LVM volumes using "lvchange
--permission r <volume>". This read-only setting is independent of
device node permissions. Therefore the device can still be opened
O_RDWR but actual writes will fail.
This results in odd behavior for QEMU. bdrv_open() is supposed to fail
if a read-only image is being opened with BDRV_O_RDWR. By not failing
for Linux block devices, the guest boots up but every write produces an
I/O error.
This patch checks whether the block device is read-only so that Linux
block devices behave like regular files.
Reported-by: Sibiao Luo <sluo@redhat.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The size calculated from the CHS values is not the real image (disk) size,
but usually a smaller value. This is caused by rounding effects.
Only older operating systems use CHS. Such guests won't be able to use
the whole disk. All modern operating systems use the real size.
This patch fixes https://bugs.launchpad.net/qemu/+bug/1105670/.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Message-id: 1360265212-22037-1-git-send-email-sw@weilnetz.de
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
There is a buffer overflow in libcurl POP3/SMTP/IMAP. The workaround is
simple: disable extra protocols so that they cannot be exploited. Full
details here:
http://curl.haxx.se/docs/adv_20130206.html
QEMU only cares about HTTP, HTTPS, FTP, FTPS, and TFTP. I have tested
that this fix prevents the exploit on my host with
libcurl-7.27.0-5.fc18.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Commit eeb6b45d48 (block: raw-posix image
file reopen) broke the build on OpenIndiana.
illumos has no O_ASYNC. Exclude it from flags to be compared
and instead assert that it is not set where defined.
Cf. e61ab1da7e for qemu-ga.
Cc: qemu-stable@nongnu.org (1.3.x)
Cc: Jeff Cody <jcody@redhat.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andreas Färber <andreas.faerber@web.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The previous scanf() format string stopped parsing the file name on the
first white white space, which seems to be allowed at least by VMware
Workstation.
Change the format string to collect everything between the first and
second quote as the file name, disallowing line breaks.
Signed-off-by: Philipp Hahn <hahn@univention.de>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Return -errno instead of -1 on errors. Hey, no memory leak to fix here
while we're touching it!
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The buffers are allocated with g_(re)alloc, so use g_free to free them.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Return -errno instead of -1 on errors and add error checks in some
places that didn't have one. Passing things by reference requires more
correct typing, replaced a few off_ts therefore - with a 32-bit off_t
this is even a fix for truncation bugs.
While touching the code, fix even some more memory leaks than in the
other drivers...
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Return -errno instead of -1 on errors. While touching the
code, fix a memory leak.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Return -errno instead of -1 on errors. While touching the
code, fix a memory leak.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Return -errno instead of -1 on errors. While touching the
code, fix a memory leak.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Sheep daemon needs vdi_id to identify which vdi is closed to release resources
such as object cache.
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Reviewed-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Introduce a new option "adapter_type" when converting to vmdk images.
It can be one of the following: ide (default), buslogic, lsilogic
or legacyESX (according to the vmdk spec from vmware).
In case of a non-ide adapter, heads is set to 255 instead of the 16.
The latter is used for "ide".
Also see LP#545089
Signed-off-by: Othmar Pasteka <pasteka@kabsi.at>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Once upon a time, it was decided that qemu_malloc(0) should abort.
Switching to glib retired that bright idea. Some code that was added
to cope with it (e.g. in commits 702ef63, b76b6e9) is still around.
Bury it.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
On a zero-sized disk we need to break out of the job successfully
before bdrv_dirty_iter_init is called, otherwise you will get an
assertion failure with the next patch.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
vdi_open did not check for a bad signature.
This check was only in vdi_probe.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
vdi_open returned -1 in case of any error, but it should return an
error code (negative value of errno or -EMEDIUMTYPE).
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The signature is a 32 bit value and needs up to 8 hex digits for printing.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This improves error reports for bochs, cow, qcow, qcow2, qed and vmdk
when a file with the wrong format is selected.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Yet another optimization is to extend the mirroring iteration to include more
adjacent dirty blocks. This limits the number of I/O operations and makes
mirroring efficient even with a small granularity. Most of the infrastructure
is already in place; we only need to put a loop around the computation of
the origin and sector count of the iteration.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
With AIO support in place, we can start copying more than one chunk
in parallel. This patch introduces the required infrastructure for
this: the buffer is split into multiple granularity-sized chunks,
and there is a free list to access them.
Because of copy-on-write, a single operation may already require
multiple chunks to be available on the free list.
In addition, two different iterations on the HBitmap may want to
copy the same cluster. We avoid this by keeping a bitmap of in-flight
I/O operations, and blocking until the previous iteration completes.
This should be a pretty rare occurrence, though; as long as there is
no overlap the next iteration can start before the previous one finishes.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This makes sense when the next commit starts using the extra buffer space
to perform many I/O operations asynchronously.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There is really no change in the behavior of the job here, since
there is still a maximum of one in-flight I/O operation between
the source and the target. However, this patch already introduces
the AIO callbacks (which are unmodified in the next patch)
and some of the logic to count in-flight operations and only
complete the job when there is none.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The desired granularity may be very different depending on the kind of
operation (e.g. continuous replication vs. collapse-to-raw) and whether
the VM is expected to perform lots of I/O while mirroring is in progress.
Allow the user to customize it, while providing a sane default so that
in general there will be no extra allocated space in the target compared
to the source.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When mirroring runs, the backing files for the target may not yet be
ready. However, this means that a copy-on-write operation on the target
would fill the missing sectors with zeros. Copy-on-write only happens
if the granularity of the dirty bitmap is smaller than the cluster size
(and only for clusters that are allocated in the source after the job
has started copying). So far, the granularity was fixed to 1MB; to avoid
the problem we detected the situation and required the backing files to
be available in that case only.
However, we want to lower the granularity for efficiency, so we need
a better solution. The solution is to always copy a whole cluster the
first time it is touched. The code keeps a bitmap of clusters that
have already been allocated by the mirroring job, and only does "manual"
copy-on-write if the chunk being copied is zero in the bitmap.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This actually uses the dirty bitmap in the block layer, and converts
mirroring to use an HBitmapIter.
Reviewed-by: Laszlo Ersek <lersek@redhat.com> (except block/mirror.c parts)
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds support for directly passing the iovec
array from QEMUIOVector if libiscsi supports it (1.8.0
or newer).
Signed-off-by: Peter Lieven <pl@kamp.de>
[Preserve the improvements from commit 4cc841b, iscsi: partly
avoid iovec linearization in iscsi_aio_writev, 2012-11-19 - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
acb->buf is freed in the WRITE(16) callback, but this may not
get called at all when commands are aborted. Add another
free in the ABORT TASK callback, which requires setting acb->buf
to NULL everywhere.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
# By Peter Lieven (3) and others
# Via Paolo Bonzini
* bonzini/scsi-next:
scsi: Drop useless null test in scsi_unit_attention()
lsi: use qbus_reset_all to reset SCSI bus
scsi: fix segfault with 0-byte disk
iscsi: add support for iSCSI NOPs [v2]
iscsi: partly avoid iovec linearization in iscsi_aio_writev
iscsi: add iscsi_create support
This patch will send NOP-Out PDUs every 5 seconds to the iSCSI target.
If a consecutive number of NOP-In replies fail a reconnect is initiated.
iSCSI NOPs help to ensure that the connection to the target is still operational.
This should not, but in reality may be the case even if the TCP connection is still
alive if there are bugs in either the target or the initiator implementation.
v2:
- track the NOPs inside libiscsi so libiscsi can reset the counter
in case it initiates a reconnect.
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
libiscsi expects all write16 data in a linear buffer. If the
iovec only contains one buffer we can skip the linearization
step as well as the additional malloc/free and pass the
buffer directly.
Reported-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds support for bdrv_create. This allows e.g.
to use qemu-img to convert from any supported device to
an iscsi backed storage as destination.
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
# By Kevin Wolf (4) and others
# Via Stefan Hajnoczi
* stefanha/block:
dataplane: support viostor virtio-pci status bit setting
dataplane: avoid reentrancy during virtio_blk_data_plane_stop()
win32-aio: use iov utility functions instead of open-coding them
win32-aio: Fix memory leak
win32-aio: Fix vectored reads
aio: Fix return value of aio_poll()
ide: Remove wrong assertion
block: fix null-pointer bug on error case in block commit
Fixes the build on OpenBSD among others.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Andreas Färber <andreas.faerber@web.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
We have iov_from_buf() and iov_to_buf(), use them instead of
open-coding these in block/win32-aio.c
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The buffer is allocated for both reads and writes, and obviously it
should be freed even if an error occurs.
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Copying data in the right direction really helps a lot!
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is a bug that was caught by a coverity run by Markus. In
the error case when we errored out to exit_restore_open early in the
function, 'overlay_bs' was still NULL at that point, although it is
used to look up flags and perform a bdrv_reopen().
Move the overlay_bs lookup to where it is needed, and check for NULL
before restoring the flags. Also get rid of the unneeded parameter
initialization.
Reported-By: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
It allocates with qemu_blockalign(), therefore it must free with
qemu_vfree(), not g_free().
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
win32_aio_submit() allocates it with qemu_blockalign(), therefore it
must be freed with qemu_vfree(), not g_free().
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The last two parameters of sd_aio_setup() are never used, so remove them.
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This will reduce sockfds connected to the sheep server to one, which simply the
future hacks.
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is easy with the thread pool, because we can use s->is_xfs and
s->has_discard from the worker function.
QEMU has a widespread assumption that each I/O operation writes less
than 2^32 bytes. This patch doesn't fix it throughout of course,
but it starts correcting struct RawPosixAIOData so that there is
no regression with respect to the synchronous discard implementation.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Block devices use a ioctl instead of fallocate, so add a separate
implementation.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Avoid sending system calls repeatedly if they shall fail. This
does not apply to XFS: if the filesystem-specific ioctl fails,
something weird is happening.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Linux 2.6.38 introduced the filesystem independent interface to
deallocate part of a file. As of Linux 3.7, btrfs, ext4, ocfs2,
tmpfs and xfs support it.
Even though the system calls here are in practice issued on Linux,
the code is structured to allow plugging in alternatives for other Unix
variants. EOPNOTSUPP is used unconditionally in this patch, but it is
supported in both OpenBSD and Mac OS X since forever (see for example
http://lists.debian.org/debian-glibc/2006/02/msg00337.html).
Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
One of the recent refactoring patches (commit f50f88b9) didn't take care
to initialise l2meta properly, so with zero-length writes, which don't
even enter the write loop, qemu just segfaulted.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The qiov_is_aligned() function checks whether a QEMUIOVector meets a
BlockDriverState's alignment requirements. This is needed by
virtio-blk-data-plane so:
1. Move the function from block/raw-posix.c to block/block.c.
2. Make it public in block/block.h.
3. Rename to bdrv_qiov_is_aligned().
4. Change return type from int to bool.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When the raw-posix aio=thread code was moved from posix-aio-compat.c
to block/raw-posix.c, there was an unintended change to the ioctl code.
The code used to return the ioctl command, which posix_aio_read()
would later morph into a zero. This hack is not necessary anymore,
and in fact breaks scsi-generic (which expects a zero return code).
Remove it.
Cc: qemu-stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Sheepdog supports both writeback/writethrough write but has not yet supported
DIRECTIO semantics which bypass the cache completely even if Sheepdog daemon is
set up with cache enabled.
Suppose cache is enabled on Sheepdog daemon size, the new cache control is
cache=writeback # enable the writeback semantics for write
cache=writethrough # enable the emulated writethrough semantics for write
cache=directsync # disable cache competely
Guest WCE toggling on the run time to toggle writeback/writethrough is also
supported.
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This allows removing of MinGW specific code and improves
reentrancy for POSIX hosts.
[Removed unused ret variable in qemu_get_timedate() to fix warning:
vl.c: In function ‘qemu_get_timedate’:
vl.c:451:16: error: variable ‘ret’ set but not used [-Werror=unused-but-set-variable]
-- Stefan Hajnoczi]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Reviewed-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
For the error case such as SD_RES_NO_SPACE, we shouldn't update the inode bitmap
to avoid the scenario that the object is allocated but wasn't created at the
server side. This will result in VM's IO error on the failed object.
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Reviewed-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Commit fbcad04d6b added fprintf statements
with wrong format specifiers.
GetLastError() returns a DWORD which is unsigned long, so %lu must be used.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The raw_get_aio_fd() function allows virtio-blk-data-plane to get the
file descriptor of a raw image file with Linux AIO enabled. This
interface is really a layering violation that can be resolved once the
block layer is able to run outside the global mutex - at that point
virtio-blk-data-plane will switch from custom Linux AIO code to using
the block layer.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Touching char/char.h basically causes the whole of QEMU to
be rebuilt. Avoid this, it is usually unnecessary.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Various header files rely on qemu-char.h including qemu-config.h or
main-loop.h, but they really do not need qemu-char.h at all (particularly
interesting is the case of the block layer!). Clean this up, and also
add missing inclusions of qemu-char.h itself.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There's no reason for run_dependent_requests() to hold s->lock, and a
later patch will require that in fact the lock is not held.
Also, before this patch, run_dependent_requests() not only does what its
name suggests, but also removes the l2meta from the list of in-flight
requests. When changing this, it becomes an one-liner, so just inline it
completely.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is closer to where the dirty flag is really needed, and it avoids
having checks for special cases related to cluster allocation directly
in the writev loop.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Even for writes to already allocated clusters, an l2meta is allocated,
though it stays effectively unused. After this patch, only allocating
requests still have one. Each l2meta now describes an in-flight request
that writes to clusters that are not yet hooked up in the L2 table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There's no real reason to have an l2meta for normal requests that don't
allocate anything. Before we can get rid of it, we must return the host
cluster offset in a different way.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
As soon as delayed COW is introduced, the l2meta struct is needed even
after completion of the request, so it can't live on the stack.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This makes it easier to address the areas for which a COW must be
performed. As a nice side effect, the COW code in
qcow2_alloc_cluster_link_l2 becomes really trivial.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The offset within the cluster is already present as n_start and this is
what the code uses. QCowL2Meta.offset is only needed at a cluster
granularity.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We want to use these events to suspend requests for testing concurrent
AIO requests. Suspending requests while they are holding the CoMutex is
rather boring for this purpose.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This allows more systematic AIO testing. The patch adds three new
operations to blkdebug:
* Setting a "breakpoint" on a blkdebug event. The next request that
triggers this breakpoint is suspended and is tagged with a name.
The breakpoint is removed after a request has triggered it.
* A suspended request (identified by it's tag) can be resumed
* It's possible to check whether a suspended request with a given
tag exists. This can be used for waiting for an event.
Ideally, we would instead tag requests right when they are created and
set breakpoints for individual requests. However, at this point the
block layer doesn't allow this easily, and breakpoints that trigger for
any request already allow a lot of useful testing.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The cleanup work to remove a rule depends on the type of the rule. It's
easy for the existing rules as there is no data that must be cleaned up
and is specific to a type yet, but the next patch will change this.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
As soon as new rules can be set during runtime, as introduced by the
next patch, blkdebug makes sense even without a config file.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
An error has occurred if the return value is invalid_set_file_pointer
and getlasterror doesn't return no_error.
Signed-off-by: Fabien Chouteau <chouteau@adacore.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
This one fixes a race which qemu had also in iscsi block driver
between cancellation and io completition.
qemu_rbd_aio_cancel was not synchronously waiting for the end of
the command.
To archieve this it introduces a new status flag which uses
-EINPROGRESS.
Signed-off-by: Stefan Priebe <s.priebe@profihost.ag>
Reviewed-by: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
clang now warns about an unused function:
CC block/raw-posix.o
block/raw-posix.c:707:26: warning: unused function paio_ioctl
[-Wunused-function]
static BlockDriverAIOCB *paio_ioctl(BlockDriverState *bs, int fd,
^
1 warning generated.
because the only use of paio_ioctl() is inside a #if defined(__linux__)
guard and it is static now.
Reported-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The VHD specification allows for up to a 2 TB disk size. The current
implementation in qemu emulates EIDE and ATA-2 hardware which only allows
for up to 127 GB. This disk size limitation can be overridden by allowing
up to 255 heads instead of the normal 4 bit limitation of 16. Doing so
allows disk images to be created of up to nearly 2 TB. This change does
not violate the VHD format specification nor does it change how smaller
disks (ie, <=127GB) are defined.
[Charles Arnold also writes: "In analyzing a 160 GB VHD fixed disk image
created on Windows 2008 R2, it appears that MS is also ignoring the CHS
values in the footer geometry field in whatever driver they use for
accessing the image. The CHS values are set at 65535,16,255 which
obviously doesn't represent an image size of 160 GB." -- Stefan]
Signed-off-by: Charles Arnold <carnold@suse.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Initialize the uuid field in the footer with a generated uuid.
Signed-off-by: Charles Arnold <carnold@suse.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Without any complex checks we can't assume that an
iscsi target is initialized to zero.
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the connection is interrupted before the first login is successfully
completed qemu-kvm is waiting forever in qemu_aio_wait().
This is fixed by performing an sync login to the target. If the
connection breaks after the first successful login errors are
handled internally by libiscsi.
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If an invalid URL is specified iscsi_get_error(iscsi) is called
with iscsi == NULL.
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
rbd / rados tends to return pretty often length of writes
or discarded blocks. These values might be bigger than int.
The steps to reproduce are:
mkfs.xfs -f a whole device bigger than int in bytes. mkfs.xfs sends
a discard. Important is that you use scsi-hd and set
discard_granularity=512. Otherwise rbd disabled discard support.
Signed-off-by: Stefan Priebe <s.priebe@profihost.ag>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
It's poor symbol hygiene to provide a global symbols that collide with a
common library like libuuid. If QEMU links against a shared library
that depends on uuid_generate() it can end up calling our stub version
of the function.
This exact scenario happened with GlusterFS libgfapi.so, which depends
on libglusterfs.so's uuid_generate().
Scope the uuid stubs for vdi.c only and avoid affecting other shared
objects.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
For hdev, floppy, and cdrom, the reopen() handlers are the same as
for the file reopen handler. For floppy and cdrom types, however,
we keep O_NONBLOCK, as in the _open function.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Fixed a MAJOR BUG in VMDK files on file boundaries on reads
and ALSO ON WRITES WHICH MIGHT CORRUPT THE IMAGE AND DATA!!!!!!
Triggered for example with the following VMDK file (partly listed):
RW 4193792 FLAT "XP-W1-f001.vmdk" 0
RW 2097664 FLAT "XP-W1-f002.vmdk" 0
RW 4193792 FLAT "XP-W1-f003.vmdk" 0
RW 512 FLAT "XP-W1-f004.vmdk" 0
RW 4193792 FLAT "XP-W1-f005.vmdk" 0
RW 2097664 FLAT "XP-W1-f006.vmdk" 0
RW 4193792 FLAT "XP-W1-f007.vmdk" 0
RW 512 FLAT "XP-W1-f008.vmdk" 0
Patch includes:
1.) Patch fixes wrong calculation on extent boundaries. Especially it
fixes the relativeness of the sector number to the current extent.
Verfied correctness with:
1.) Converted either with Virtualbox to VDI and then with qemu-img and
then with qemu-img only:
VBoxManage clonehd --format vdi /VM/XP-W/new/XP-W1.vmdk ~/.VirtualBox/Harddisks/XP-W1-new-test.vdi
./qemu-img convert -O raw ~/.VirtualBox/Harddisks/XP-W1-new-test.vdi /root/QEMU/VM-XP-W1/XP-W1-via-VBOX.img
md5sum /root/QEMU/VM-XP-W/XP-W1-direct.img
md5sum /root/QEMU/VM-XP-W/XP-W1-via-VBOX.img
=> same MD5 hash
2.) Verified debug log files
3.) Run Windows XP successfully
4.) chkdsk run successfully without any errors
Signed-off-by: Gerhard Wiesinger <lists@wiesinger.com>
Acked-by: Fam Zheng <famcool@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now that AIOPool no longer keeps a freelist, it isn't really a "pool"
anymore. Rename it to AIOCBInfo and make it const since it no longer
needs to be modified.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Versions before gcc-4.6 don't support unnamed fields in initializers
(see http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676).
Offset and OffsetHigh belong to an unnamed struct which is part of an
unnamed union. Therefore the original code does not work with older
versions of gcc.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
A missing factor for the refcount table entry size in the calculation
could mean that too little memory was allocated for the in-memory
representation of the table, resulting in a buffer overflow.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Michael Tokarev <mjt@tls.msk.ru>
Tested-by: Michael Tokarev <mjt@tls.msk.ru>
The URI syntax is consistent with the Gluster syntax. Export names
are specified in the path, preceded by one or more (otherwise unused)
slashes.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
With the new support for EventNotifiers in the AIO event loop, we
can hook a completion port to every opened file and use asynchronous
I/O on them.
Wine's support is extremely inefficient, also because it really does
the I/O synchronously on regular files. (!) But it works, and it is
good to keep the Win32 and POSIX ports as similar as possible.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Making the qemu_paiocb specific to raw devices will let us access members
of the BDRVRawState arbitrarily.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The Win32 implementation will only accept EventNotifiers, thus a few
drivers are disabled under Windows. EventNotifiers are a good match
for the GSource implementation, too, because the Win32 port of glib
allows to place their HANDLEs in a GPollFD.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Error management is important for mirroring; otherwise, an error on the
target (even something as "innocent" as ENOSPC) requires to start again
with a full copy. Similar to on_read_error/on_write_error, two separate
knobs are provided for on_source_error (reads) and on_target_error (writes).
The default is 'report' for both.
The 'ignore' policy will leave the sector dirty, so that it will be
retried later. Thus, it will not cause corruption.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Switching to the target of the migration is done mostly asynchronously,
and reported to management via the BLOCK_JOB_COMPLETED event; the only
synchronous phase is opening the backing files. bdrv_open_backing_file
can always be done, even for migration of the full image (aka sync:
'full'). In this case, qmp_drive_mirror will create the target disk
with no backing file at all, and bdrv_open_backing_file will be a no-op.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds the implementation of a new job that mirrors a disk to
a new image while letting the guest continue using the old image.
The target is treated as a "black box" and data is copied from the
source to the target in the background. This can be used for several
purposes, including storage migration, continuous replication, and
observation of the guest I/O in an external program. It is also a
first step in replacing the inefficient block migration code that is
part of QEMU.
The job is possibly never-ending, but it is logically structured into
two phases: 1) copy all data as fast as possible until the target
first gets in sync with the source; 2) keep target in sync and
ensure that reopening to the target gets a correct (full) copy
of the source data.
The second phase is indicated by the progress in "info block-jobs"
reporting the current offset to be equal to the length of the file.
When the job is cancelled in the second phase, QEMU will run the
job until the source is clean and quiescent, then it will report
successful completion of the job.
In other words, the BLOCK_JOB_CANCELLED event means that the target
may _not_ be consistent with a past state of the source; the
BLOCK_JOB_COMPLETED event means that the target is consistent with
a past state of the source. (Note that it could already happen
that management lost the race against QEMU and got a completion
event instead of cancellation).
It is not yet possible to complete the job and switch over to the target
disk. The next patches will fix this and add many refinements to the
basic idea introduced here. These include improved error management,
some tunable knobs and performance optimizations.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>