-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJVevYFAAoJEH8JsnLIjy/W3jEP/0hiQ3rCRZ/he8s5maTdT+TR
YSeHkB5rKpz0Uopn1DMn1QrIbUVzX7dyb+uf9zQ0/xRQIzf6k8uxqU/NWrdoF3NK
qx91dGWedwnG+TEBIMbcR7nMrw4dP6kH7uPz/VWMXDHVLz0HIcD95qhKgs0mSY6J
dWqex6ACjXM68zJU5IioagU9evV80WZE1S8z7zfixxtTBx5hCaTVbwalkaCxcrXw
PbZle55rjI8B10+OzgBw0fq10nias+NTndU9CwNBboxmEtAjq8/mQ663vcWlmiFo
9a/hkda27Z5ut/0Tqk1v4uLHauylp++rrAabPBAuCFMKes6cdkddP15Q/r52aJ29
5meodQtbet1rGrM+Aq4vuSuWId71PGypEI/3URDdNfYFNISoeLLsk4lcQUu7VrDD
sRX3Jt8SI3nkIgOnhPyi7NDPmafxFt8yRt5vM8MyR5ynF8NS/2hiAc3wqnbXGjUj
a5GqDCefb1yM0R5HvksuFFt3OnXlKJQ3J+ksXNUJf9DSAZPauqWD696pcTeg8wyy
3PIGkczgUuKTVfFWd3THZxJLAo7ZuqvXBHHV8o1SeMBDxwh4FhTd8Kjvm3rUNFfl
VDox4qwZ1AcxLrxgqazKU7sD9iWBDHURRcpOoUsBys7oxQZnhcmMp1fRlEkTOyrD
HNiSNByqBrtkfeVzHlSe
=QgYk
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
Block layer core and image format patches
# gpg: Signature made Fri Jun 12 16:08:53 2015 BST using RSA key ID C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>"
* remotes/kevin/tags/for-upstream: (25 commits)
block: Fix reopen flag inheritance
block: Add BlockDriverState.inherits_from
block: Add list of children to BlockDriverState
queue.h: Add QLIST_FIX_HEAD_PTR()
block: Drain requests before swapping nodes in bdrv_swap()
block: Move flag inheritance to bdrv_open_inherit()
block: Use QemuOpts in bdrv_open_common()
block: Use macro for cache option names
vmdk: Use bdrv_open_image()
quorum: Use bdrv_open_image()
check-qdict: Test cases for new functions
qdict: Add qdict_{set,copy}_default()
qdict: Add qdict_array_entries()
iotests: Add tests for overriding BDRV_O_PROTOCOL
block: driver should override flags in bdrv_open()
block: Change bitmap truncate conditional to assertion
block: record new size in bdrv_dirty_bitmap_truncate
raw-posix: Fix .bdrv_co_get_block_status() for unaligned image size
vmdk: Use vmdk_find_index_in_cluster everywhere
vmdk: Fix index_in_cluster calculation in vmdk_co_get_block_status
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
When reopening an image, the block layer already takes care to reopen
bs->file as well with recalculated inherited flags. The same must happen
for any other child (most notably missing before this patch: backing
files).
If bs->file (or any other child) didn't originally inherit from bs, e.g.
because it was created separately and then only referenced, it must not
inherit flags on reopen either, so check the inherited_from field before
propagation the reopen down.
VMDK already reopened its extents manually; this code can now be
dropped.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Currently, the block layer assumes that any block node can have only one
parent, and if it has a parent, that it inherits some options/flags from
this parent.
This is not true any more: With references used in block device
creation, a single node can be used by multiple parents, or it can be
created separately and not inherit flags from any parent.
To handle reopens correctly, a node must know from which parent it
inherited options. This patch adds the information to BlockDriverState.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
This allows iterating over all children of a given BDS, not only
including bs->file and bs->backing_hd, but also driver-specific
ones like VMDK extents or Quorum children.
For bdrv_swap(), the list of children of the swapped BDS stays at that
BDS (because that's where the pointers stay as well). The list head
moves and pointers to it must be fixed up therefore.
The list of children in the parent of the swapped BDS is not affected by
the swap. The contents of the BDS objects is swapped, so the existing
pointer in the parent automatically points to the newly swapped in BDS.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
bdrv_swap() requires that there are no requests in flight on either of
the two devices. The request coroutine would work on the wrong
BlockDriverState object (with bs->opaque even being interpreted as a
different type potentially) and all sorts of bad things would result
from this.
The currently existing callers mostly ensure that there is no I/O
pending on nodes that are swapped. In detail, this is:
1. Live snapshots. This goes through qmp_transaction(), which calls
bdrv_drain_all() before doing anything. The command is executed
synchronously, so no new I/O can be issued concurrently.
2. snapshot=on in bdrv_open(). We're in the middle of opening the image
(both the original image and its temporary overlay), so there can't
be any I/O in flight yet.
3. Mirroring. bdrv_drain() is already used on the source device so that
the mirror doesn't miss anything. However, the main loop runs between
that and the bdrv_swap() (which is actually a bug, being addressed in
another series), so there is a small window in which new I/O might be
issued that would be in flight during bdrv_swap().
It is safer to just drain the request queue of both devices in
bdrv_swap() instead of relying on callers to do the right thing.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Instead of letting every caller of bdrv_open() determine the right flags
for its child node manually and pass them to the function, pass the
parent node and the role of the newly opened child (like backing file,
protocol layer, etc.).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Instead of manually parsing options and then deleting them from the
options QDict, just use QemuOpts like most other places that deal with
block device options.
More options will be added there and then QemuOpts is a lot more
manageable than open-coding everything.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
The BDRV_O_PROTOCOL flag should have an impact only if no driver is
specified explicitly. Therefore, if bdrv_open() is called with an
explicit block driver argument (either through the options QDict or
through the drv parameter) and that block driver is a protocol block
driver, BDRV_O_PROTOCOL should be set; if it is a format block driver,
BDRV_O_PROTOCOL should be unset.
While there was code to unset the flag in case a format block driver
has been selected, it only followed the bdrv_fill_options() function
call whereas the flag in fact needs to be adjusted before it is used
there.
With that change, BDRV_O_PROTOCOL will always be set if the BDS should
be a protocol driver; if the driver has been specified explicitly, the
new code will set it; and bdrv_fill_options() will only "probe" a
protocol driver if BDRV_O_PROTOCOL is set. The probing after
bdrv_fill_options() cannot select a protocol driver.
Thus, bdrv_open_image() to open BDS.file is never called if a protocol
BDS is about to be created. With that change in turn it is impossible to
call bdrv_open_common() with a protocol drv and file != NULL, which
allows us to remove the bdrv_swap() call.
This change breaks a test case in qemu-iotest 051:
"-drive file=t.qcow2,file.driver=qcow2" now works because the explicitly
specified "qcow2" overrides the BDRV_O_PROTOCOL which is automatically
set for the "file" BDS (and the filename is just passed down).
Therefore, this patch removes that test case.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is an artifact of an older version that had both all-bitmap and
single-bitmap truncate functions, and some info got lost in the shuffle.
Bitmaps can only be frozen during a backup operation, and a backup
operation should prevent a resize operation, so just assert that this
cannot happen.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
ce1ffea8 neglected to update the BdrvDirtyBitmap structure
itself for internal consistency. It's currently not an issue,
but for migration and persistence series this will cause headaches.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_swap() touches the fields of a BlockDriverState that are
protected by the ThrottleGroup lock. Although those fields end up in
their original place, they are temporarily swapped in the process,
so there's a chance that an operation on a member of the same group
happening on a different thread can try to use them.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: d92dc40d7c4f1fc5cda5cbbf4ffb7a4670b79d17.1433779731.git.berto@igalia.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The throttle group support use a cooperative round robin scheduling
algorithm.
The principles of the algorithm are simple:
- Each BDS of the group is used as a token in a circular way.
- The active BDS computes if a wait must be done and arms the right
timer.
- If a wait must be done the token timer will be armed so the token
will become the next active BDS.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: f0082a86f3ac01c46170f7eafe2101a92e8fde39.1433779731.git.berto@igalia.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Group throttling will share ThrottleState between multiple bs.
As a consequence the ThrottleState will be accessed by multiple aio
context.
Timers are tied to their aio context so they must go out of the
ThrottleState structure.
This commit paves the way for each bs of a common ThrottleState to
have its own timer.
Signed-off-by: Benoit Canet <benoit.canet@nodalink.com>
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 6cf9ea96d8b32ae2f8769cead38f68a6a0c8c909.1433779731.git.berto@igalia.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Bitmaps can be in a handful of different states with potentially
more to come as we tool around with migration and persistence patches.
Management applications may need to know why certain bitmaps are
unavailable for various commands, e.g. busy in another operation,
busy being migrated, etc.
Right now, all we offer is BlockDirtyInfo's boolean member 'frozen'.
Instead of adding more booleans, replace it by an enumeration member
'status' with values 'active' and 'frozen'. Then add new value
'disabled'.
Incompatible change. Fine because the changed part hasn't been
released so far.
Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[Commit message tweaked]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Bogus image may have a large total_sectors that will overflow the
multiplication. For cleanness, fix the return code so the error message
will be meaningful.
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The following sequence
int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
for (i = 0; i < 100000; i++)
write(fd, buf, 4096);
performs 5% better if buf is aligned to 4096 bytes.
The difference is quite reliable.
On the other hand we do not want at the moment to enforce bounce
buffering if guest request is aligned to 512 bytes.
The patch changes default bounce buffer optimal alignment to
MAX(page size, 4k). 4k is chosen as maximal known sector size on real
HDD.
The justification of the performance improve is quite interesting.
From the kernel point of view each request to the disk was split
by two. This could be seen by blktrace like this:
9,0 11 1 0.000000000 11151 Q WS 312737792 + 1023 [qemu-img]
9,0 11 2 0.000007938 11151 Q WS 312738815 + 8 [qemu-img]
9,0 11 3 0.000030735 11151 Q WS 312738823 + 1016 [qemu-img]
9,0 11 4 0.000032482 11151 Q WS 312739839 + 8 [qemu-img]
9,0 11 5 0.000041379 11151 Q WS 312739847 + 1016 [qemu-img]
9,0 11 6 0.000042818 11151 Q WS 312740863 + 8 [qemu-img]
9,0 11 7 0.000051236 11151 Q WS 312740871 + 1017 [qemu-img]
9,0 5 1 0.169071519 11151 Q WS 312741888 + 1023 [qemu-img]
After the patch the pattern becomes normal:
9,0 6 1 0.000000000 12422 Q WS 314834944 + 1024 [qemu-img]
9,0 6 2 0.000038527 12422 Q WS 314835968 + 1024 [qemu-img]
9,0 6 3 0.000072849 12422 Q WS 314836992 + 1024 [qemu-img]
9,0 6 4 0.000106276 12422 Q WS 314838016 + 1024 [qemu-img]
and the amount of requests sent to disk (could be calculated counting
number of lines in the output of blktrace) is reduced about 2 times.
Both qemu-img and qemu-io are affected while qemu-kvm is not. The guest
does his job well and real requests comes properly aligned (to page).
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1431441056-26198-3-git-send-email-den@openvz.org
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The patch introduces new concept: minimal memory alignment for bounce
buffers. Original so called "optimal" value is actually minimal required
value for aligment. It should be used for validation that the IOVec
is properly aligned and bounce buffer is not required.
Though, from the performance point of view, it would be better if
bounce buffer or IOVec allocated by QEMU will be aligned stricter.
The patch does not change any alignment value yet.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1431441056-26198-2-git-send-email-den@openvz.org
CC: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The block.c file has grown to over 6000 lines. It is time to split this
file so there are fewer conflicts and the code is easier to maintain.
Extract I/O request processing code:
* Read
* Write
* Zero writes and making the image empty
* Flush
* Discard
* ioctl
* Tracked requests and queuing
* Throttling and copy-on-read
* Block status and allocated functions
* Refreshing block limits
* Reading/writing vmstate
* qemu_blockalign() and friends
The patch simply moves code from block.c into block/io.c.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Move the code to install coroutine and aio emulation function pointers
in a BlockDriver to its own function.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The dirty bitmap functions are called from the block I/O processing
code. Make them visible to block_int.h users so they can be used
outside block.c.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The bdrv_states list is a static variable in block.c.
bdrv_drain_all() and bdrv_flush_all() use this variable to iterate over
all drives.
The next patch will move bdrv_drain_all() and bdrv_flush_all() out of
block.c so it's necessary to switch to the public bdrv_next() interface.
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1429314609-29776-16-git-send-email-jsnow@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We often don't need the BlockDriverState for functions
that operate on bitmaps. Remove it.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1429314609-29776-15-git-send-email-jsnow@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1429314609-29776-14-git-send-email-jsnow@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add the "frozen" status booleans, to inform clients
when a bitmap is occupied doing a task.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1429314609-29776-13-git-send-email-jsnow@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add bdrv_clear_dirty_bitmap and a matching QMP command,
qmp_block_dirty_bitmap_clear that enables a user to reset
the bitmap attached to a drive.
This allows us to reset a bitmap in the event of a full
drive backup.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1429314609-29776-12-git-send-email-jsnow@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For "dirty-bitmap" sync mode, the block job will iterate through the
given dirty bitmap to decide if a sector needs backup (backup all the
dirty clusters and skip clean ones), just as allocation conditions of
"top" sync mode.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1429314609-29776-11-git-send-email-jsnow@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
A bitmap successor is an anonymous BdrvDirtyBitmap that is intended to
be created just prior to a sensitive operation (e.g. Incremental Backup)
that can either succeed or fail, but during the course of which we still
want a bitmap tracking writes.
On creating a successor, we "freeze" the parent bitmap which prevents
its deletion, enabling, anonymization, or creating a bitmap with the
same name.
On success, the parent bitmap can "abdicate" responsibility to the
successor, which will inherit its name. The successor will have been
tracking writes during the course of the backup operation. The parent
will be safely deleted.
On failure, we can "reclaim" the successor from the parent, unifying
them such that the resulting bitmap describes all writes occurring since
the last successful backup, for instance. Reclamation will thaw the
parent, but not explicitly re-enable it.
BdrvDirtyBitmap operations that target a single bitmap are protected
by assertions that the bitmap is not frozen and/or disabled.
BdrvDirtyBitmap operations that target a group of bitmaps, such as
bdrv_{set,reset}_dirty will ignore frozen/disabled drives with a
conditional instead.
Internal functions that enable/disable dirty bitmaps have assertions
added to them to prevent modifying frozen bitmaps.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1429314609-29776-10-git-send-email-jsnow@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add a status indicating the enabled/disabled state of the bitmap.
A bitmap is by default enabled, but you can lock the bitmap into
a read-only state by setting disabled = true.
A previous version of this patch added a QMP interface for changing
the state of the bitmap, but it has since been removed for now until
a use case emerges where this state must be revealed to the user.
The disabled state WILL be used internally for bitmap migration and
bitmap persistence.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1429314609-29776-9-git-send-email-jsnow@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This returns the granularity (in bytes) of dirty bitmap,
which matches the QMP interface and the existing query
interface.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1429314609-29776-6-git-send-email-jsnow@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The new command pair is added to manage a user created dirty bitmap. The
dirty bitmap's name is mandatory and must be unique for the same device,
but different devices can have bitmaps with the same names.
The granularity is an optional field. If it is not specified, we will
choose a default granularity based on the cluster size if available,
clamped to between 4K and 64K to mirror how the 'mirror' code was
already choosing granularity. If we do not have cluster size info
available, we choose 64K. This code has been factored out into a helper
shared with block/mirror.
This patch also introduces the 'block_dirty_bitmap_lookup' helper,
which takes a device name and a dirty bitmap name and validates the
lookup, returning NULL and setting errp if there is a problem with
either field. This helper will be re-used in future patches in this
series.
The types added to block-core.json will be re-used in future patches
in this series, see:
'qapi: Add transaction support to block-dirty-bitmap-{add, enable, disable}'
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1429314609-29776-5-git-send-email-jsnow@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We treat this field with a variety of different types everywhere
in the code. Now it's just uint32_t.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1429314609-29776-4-git-send-email-jsnow@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This field will be set for user created dirty bitmap. Also pass in an
error pointer to bdrv_create_dirty_bitmap, so when a name is already
taken on this BDS, it can report an error message. This is not global
check, two BDSes can have dirty bitmap with a common name.
Implemented bdrv_find_dirty_bitmap to find a dirty bitmap by name, will
be used later when other QMP commands want to reference dirty bitmap by
name.
Add bdrv_dirty_bitmap_make_anon. This unsets the name of dirty bitmap.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1429314609-29776-3-git-send-email-jsnow@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The image field in BlockDeviceInfo is supposed to contain an ImageInfo
object. However that is being filled in by bdrv_query_info(), not by
bdrv_block_device_info(), which is where BlockDeviceInfo is actually
created.
Anyone calling bdrv_block_device_info() directly will get a null image
field. As a consequence of this, the HMP command 'info block -n -v'
crashes QEMU.
This patch moves the code that fills in that field from
bdrv_query_info() to bdrv_block_device_info().
Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-id: 1429271563-3765-1-git-send-email-berto@igalia.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There are several error messages that identify a BlockDriverState by
its device name. However those errors can be produced in nodes that
don't have a device name associated.
In those cases we should use bdrv_get_device_or_node_name() to fall
back to the node name and produce a more meaningful message. The
messages are also updated to use the more generic term 'node' instead
of 'device'.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 9823a1f0514fdb0692e92868661c38a9e00a12d6.1428485266.git.berto@igalia.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This function gets the device name associated with a BlockDriverState,
or its node name if the device name is empty.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 4fa30aa8d61d9052ce266fd5429a59a14e941255.1428485266.git.berto@igalia.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_aio_* APIs can use coroutines to achieve asynchronicity. However,
the coroutine may terminate without having yielded back to the caller
(for example because of something that invokes a nested event loop,
or because the coroutine is doing nothing at all). In this case,
the bdrv_aio_* API must delay the completion to the next iteration
of the main loop, because bdrv_aio_* will never invoke the callback
before returning.
This can be done with a bottom half, and indeed bdrv_aio_* is always
using one for simplicity. It is possible to gain some performance
(~3%) by avoiding this in the common case. A new field in the
BlockAIOCBCoroutine struct is set to true until the first time the
corotine has yielded to its creator, and completion goes through a
new function bdrv_co_complete. If the flag is false, bdrv_co_complete
invokes the callback immediately. If it is true, the caller will
notice that the coroutine has completed and schedule the bottom
half itself.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1427524638-28157-1-git-send-email-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is necessary to suppress more IO requests from being generated from
block job coroutines.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 1428069921-2957-3-git-send-email-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Currently, throttle timers won't make any progress when VCPU is not
running, which would stall the request queue in utils, qtest, vm
suspending, and live migration, without special handling.
Block jobs are confusingly inconsistent between with and without
throttling: if user sets a bps limit, stops the vm, then start a block
job, the block job will not make any progress; in contrary, if user
unsets the bps limit, or if it's not set, the block job will run
normally.
After this patch, with the host clock, even if the VCPUs are stopped,
the throttle queues will be processed.
This patch also enables potential to add throttle to bdrv_drain_all.
Currently all requests are drained immediately. In other words whenever
it is called, IO throttling goes ineffective (examples: system reset,
migration and many block job operations.). This is a loophole that guest
could exploit. If we use the host clock, we can later just trust the
nested poll. This could be done on top.
Note that for qemu-iotests case 093, which uses qtest, we still keep vm
clock so the script can control the clock stepping in order to be
deterministic.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 1427268446-6426-1-git-send-email-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This commit was generated mechanically by coccinelle from the following
semantic patch:
@@
expression val;
@@
- (ffs(val) - 1)
+ ctz32(val)
The call sites have been audited to ensure the ffs(0) - 1 == -1 case
never occurs (due to input validation, asserts, etc). Therefore we
don't need to worry about the fact that ctz32(0) == 32.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1427124571-28598-5-git-send-email-stefanha@redhat.com
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the zero write is not aligned, bdrv_co_do_pwritev will segfault
because of accessing to the NULL qiov passed in by bdrv_co_write_zeroes.
Fix this by allocating a local qiov in bdrv_co_do_pwritev if the request
is not aligned. (In this case the padding iovs are necessary anyway, so
it doesn't hurt.)
Also add a check at the end of bdrv_co_do_pwritev to clear the zero flag
if padding is involved.
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 1427160230-4489-2-git-send-email-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
All callers are converted, so drop it.
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 1425296209-1476-5-git-send-email-famz@redhat.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
We've steered users away from QCOW/QCOW2 encryption for a while,
because it's a flawed design (commit 136cd19 Describe flaws in
qcow/qcow2 encryption in the docs).
In addition to flawed crypto, we have comically bad usability, and
plain old bugs. Let me show you.
= Example images =
I'm going to use a raw image as backing file, and two QCOW2 images,
one encrypted, and one not:
$ qemu-img create -f raw backing.img 4m
Formatting 'backing.img', fmt=raw size=4194304
$ qemu-img create -f qcow2 -o encryption,backing_file=backing.img,backing_fmt=raw geheim.qcow2 4m
Formatting 'geheim.qcow2', fmt=qcow2 size=4194304 backing_file='backing.img' backing_fmt='raw' encryption=on cluster_size=65536 lazy_refcounts=off
$ qemu-img create -f qcow2 -o backing_file=backing.img,backing_fmt=raw normal.qcow2 4m
Formatting 'normal.qcow2', fmt=qcow2 size=4194304 backing_file='backing.img' backing_fmt='raw' encryption=off cluster_size=65536 lazy_refcounts=off
= Usability issues =
== Confusing startup ==
When no image is encrypted, and you don't give -S, QEMU starts the
guest immediately:
$ qemu-system-x86_64 -nodefaults -display none -monitor stdio normal.qcow2
QEMU 2.2.50 monitor - type 'help' for more information
(qemu) info status
VM status: running
But as soon as there's an encrypted image in play, the guest is *not*
started, with no notification whatsoever:
$ qemu-system-x86_64 -nodefaults -display none -monitor stdio geheim.qcow2
QEMU 2.2.50 monitor - type 'help' for more information
(qemu) info status
VM status: paused (prelaunch)
If the user figured out that he needs to type "cont" to enter his
keys, the confusion enters the next level: "cont" asks for at most
*one* key. If more are needed, it then silently does nothing. The
user has to type "cont" once per encrypted image:
$ qemu-system-x86_64 -nodefaults -display none -monitor stdio -drive if=none,file=geheim.qcow2 -drive if=none,file=geheim.qcow2
QEMU 2.2.50 monitor - type 'help' for more information
(qemu) info status
VM status: paused (prelaunch)
(qemu) c
none0 (geheim.qcow2) is encrypted.
Password: ******
(qemu) info status
VM status: paused (prelaunch)
(qemu) c
none1 (geheim.qcow2) is encrypted.
Password: ******
(qemu) info status
VM status: running
== Incorrect passwords not caught ==
All existing encryption schemes give you the GIGO treatment: garbage
password in, garbage data out. Guests usually refuse to mount
garbage, but other usage is prone to data loss.
== Need to stop the guest to add an encrypted image ==
$ qemu-system-x86_64 -nodefaults -display none -monitor stdio
QEMU 2.2.50 monitor - type 'help' for more information
(qemu) info status
VM status: running
(qemu) drive_add "" if=none,file=geheim.qcow2
Guest must be stopped for opening of encrypted image
(qemu) stop
(qemu) drive_add "" if=none,file=geheim.qcow2
OK
Commit c3adb58 added this restriction. Before, we could expose images
lacking an encryption key to guests, with potentially catastrophic
results. See also "Use without key is not always caught".
= Bugs =
== Use without key is not always caught ==
Encrypted images can be in an intermediate state "opened, but no key".
The weird startup behavior and the need to stop the guest are there to
ensure the guest isn't exposed to that state. But other things still
are!
* drive_backup
$ qemu-system-x86_64 -nodefaults -display none -monitor stdio geheim.qcow2
QEMU 2.2.50 monitor - type 'help' for more information
(qemu) drive_backup -f ide0-hd0 out.img raw
Formatting 'out.img', fmt=raw size=4194304
I guess this writes encrypted data to raw image out.img. Good luck
with figuring out how to decrypt that again.
* commit
$ qemu-system-x86_64 -nodefaults -display none -monitor stdio geheim.qcow2
QEMU 2.2.50 monitor - type 'help' for more information
(qemu) commit ide0-hd0
I guess this writes encrypted data into the unencrypted raw backing
image, effectively destroying it.
== QMP device_add of usb-storage fails when it shouldn't ==
When the image is encrypted, device_add creates the device, defers
actually attaching it to when the key becomes available, then fails.
This is wrong. device_add must either create the device and succeed,
or do nothing and fail.
$ qemu-system-x86_64 -nodefaults -display none -usb -qmp stdio -drive if=none,id=foo,file=geheim.qcow2
{"QMP": {"version": {"qemu": {"micro": 50, "minor": 2, "major": 2}, "package": ""}, "capabilities": []}}
{ "execute": "qmp_capabilities" }
{"return": {}}
{ "execute": "device_add", "arguments": { "driver": "usb-storage", "id": "bar", "drive": "foo" } }
{"error": {"class": "DeviceEncrypted", "desc": "'foo' (geheim.qcow2) is encrypted"}}
{"execute":"device_del","arguments": { "id": "bar" } }
{"timestamp": {"seconds": 1426003440, "microseconds": 237181}, "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/bar/bar.0/legacy[0]"}}
{"timestamp": {"seconds": 1426003440, "microseconds": 238231}, "event": "DEVICE_DELETED", "data": {"device": "bar", "path": "/machine/peripheral/bar"}}
{"return": {}}
This stuff is worse than useless, it's a trap for users.
If people become sufficiently interested in encrypted images to
contribute a cryptographically sane implementation for QCOW2 (or
whatever other format), then rewriting the necessary support around it
from scratch will likely be easier and yield better results than
fixing up the existing mess.
Let's deprecate the mess now, drop it after a grace period, and move
on.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add driver functions for geometry and blocksize detection
Signed-off-by: Ekaterina Tumanova <tumanova@linux.vnet.ibm.com>
Reviewed-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1424087278-49393-2-git-send-email-tumanova@linux.vnet.ibm.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
img_convert() and img_amend() use qemu_opts_do_parse(), which reports
errors with qerror_report_err(). Its error messages aren't helpful
here, the caller reports one that actually makes sense. Reproducer:
$ qemu-img convert -o backing_format=raw in.img out.img
qemu-img: Invalid parameter 'backing_format'
qemu-img: Invalid options for file format 'raw'
To fix, propagate errors through qemu_opts_do_parse(). This lifts the
error reporting into callers. Drop it from img_convert() and
img_amend(), keep it in qemu_chr_parse_compat(), bdrv_img_create().
Since I'm touching qemu_opts_do_parse() anyway, write a function
comment for it.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
qemu_opt_set() is a wrapper around qemu_opt_set() that reports the
error with qerror_report_err().
Most of its users assume the function can't fail. Make them use
qemu_opt_set_err() with &error_abort, so that should the assumption
ever break, it'll break noisily.
Just two users remain, in util/qemu-config.c. Switch them to
qemu_opt_set_err() as well, then rename qemu_opt_set_err() to
qemu_opt_set().
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
bdrv_img_create() uses qemu_opt_set(), which reports errors with
qerror_report_err(). Its error messages aren't helpful here, the
caller reports one that actually makes sense. I don't know how to
trigger the error conditions, though.
Switch to qemu_opt_set_err() to get rid of the unwanted messages.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Return the Error object instead of reporting it with
qerror_report_err().
Change callers that assume the function can't fail to pass
&error_abort, so that should the assumption ever break, it'll break
noisily.
Turns out all callers outside its unit test assume that. We could
drop the Error ** argument, but that would make the interface less
regular, so don't.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Do not throw away the value returned by bdrv_check_request() and
bdrv_check_byte_request().
Fix up some coding style issues in the proximity of the affected hunks.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1423162705-32065-17-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Now that request clamping is done in the BlockBackend, the "growable"
field can be removed from the BlockDriverState. All BDSs are now treated
as being "growable" (that is, they are allowed to grow; they are not
necessarily actually able to).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1423162705-32065-16-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The argument given to bdrv_find_protocol() is just a file name, which
makes it difficult for the caller to reconstruct what protocol
bdrv_find_protocol() was hoping to find. This patch adds an Error
parameter to that function to solve this issue.
Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1423162705-32065-4-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The QERR_ macros are leftovers from the days of "rich" error objects.
They're used with error_set() and qerror_report(), and expand into the
first *two* arguments. This trickiness has become pointless. Clean
up QERR_DEVICE_ENCRYPTED and QERR_DEVICE_NOT_ENCRYPTED.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1422524221-8566-5-git-send-email-armbru@redhat.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1422524221-8566-4-git-send-email-armbru@redhat.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
we check and adjust request sizes at several places with
sometimes inconsistent checks or default values:
INT_MAX
INT_MAX >> BDRV_SECTOR_BITS
UINT_MAX >> BDRV_SECTOR_BITS
SIZE_MAX >> BDRV_SECTOR_BITS
This patches introdocues a macro for the maximal allowed sectors
per request and uses it at several places.
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
do not trim requests if the driver does not supply a limit
through BlockLimits. For write zeroes we still keep a limit
for the unsupported path to avoid allocating a big bounce buffer.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Suggested-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
As Stefan pointed out, the variable 'filename' in bdrv_commit is unused,
despite being maintained in previous patches.
With this patch, get rid of the variable for good.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Like BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET,
block-commit involves two asymmetric devices.
This change is not user-visible (yet), because commit only works with
device names.
But once we enable backing reference in blockdev-add, or specifying
node-name in block-commit command, we don't want the user to start two
commit jobs on the same backing chain, which will corrupt things because
of the final bdrv_swap.
Before we have per category blockers, splitting this type is still
better.
[Resolved virtio-blk dataplane conflict by replacing
BLOCK_OP_TYPE_COMMIT with both BLOCK_OP_TYPE_COMMIT_{SOURCE, TARGET}.
They are safe since the block job runs in the same AioContext as the
dataplane IOThread.
--Stefan]
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If bs->bl.max_write_zeroes is large and we end up in the unsupported
path we might allocate a lot of memory for the iovector and/or even
generate an oversized requests.
Fix this by limiting the request by the minimum of the reported
maximum transfer size or 16MB (32768 sectors).
Reported-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Message-id: 1420457389-16332-1-git-send-email-pl@kamp.de
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Mirror and migration use dirty bitmaps for their purposes, and since
commit [block: per caller dirty bitmap] they use their own bitmaps, not
the global one. But they use old functions bdrv_set_dirty and
bdrv_reset_dirty, which change all dirty bitmaps.
Named dirty bitmaps series by Fam and Snow are affected: mirroring and
migration will spoil all (not related to this mirroring or migration)
named dirty bitmaps.
This patch fixes this by adding bdrv_set_dirty_bitmap and
bdrv_reset_dirty_bitmap, which change concrete bitmap. Also, to prevent
such mistakes in future, old functions bdrv_(set,reset)_dirty are made
static, for internal block usage.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@parallels.com>
CC: John Snow <jsnow@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: Denis V. Lunev <den@openvz.org>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1417081246-3593-1-git-send-email-vsementsov@parallels.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Relative backing filenames are always relative to the backed image's
directory; the same applies to image creation. Therefore, if the backing
file has to be opened for determining its size (in case the size has not
been explicitly specified) its filename should be interpreted relative
to the new image's base directory and not relative to qemu's working
directory.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When using a relative backing file name, qemu needs to know the
directory of the top image file. For JSON filenames, such a directory
cannot be easily determined (e.g. how do you determine the directory of
a qcow2 BDS directly on top of a quorum BDS?). Therefore, do not allow
relative filenames for the backing file of BDSs only having a JSON
filename.
Furthermore, BDS::exact_filename should be used whenever possible. If
BDS::filename is not equal to BDS::exact_filename, the former will
always be a JSON object.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Introduce bdrv_get_full_backing_filename_from_filename(), a function
which takes the name of the backed file and a potentially relative
backing filename to produce the full (absolute) backing filename.
Use this function from bdrv_get_full_backing_filename().
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Most reads do not go past the end of the file, and they can use the
input QEMUIOVector instead of creating one. This removes the
qemu_iovec_* functions from the profile.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Change the message printing code to output a separator for each option
string before it instead of after, then we don't one more extra ' ' in
the end.
To update qemu-iotests output files, most of the times one would just
copy the *.out.bad to *.out. With this change we will not have the
space disliked by checkpatch.pl.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Message-id: 1418110684-19528-3-git-send-email-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If vmdk blindly tries to use path_combine() using bs->file->filename as
the base file name, this will result in a bad error message for JSON
file names when calling bdrv_open(). It is better to only try
bs->file->exact_filename; if that is empty, bs->file->filename will be
useless for path_combine() and an error should be emitted (containing
bs->file->filename because desc_file_path (which is
bs->file->exact_filename) is empty).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1417615043-26174-2-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If a driver supports image creation, it needs to set the .create_opts
field. We can use that to make sure .create_opts for both drivers
involved is not NULL in bdrv_img_create(), which is important so that
the create_opts pointer in that function is not NULL after the
qemu_opts_append() calls and when going into qemu_opts_create().
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We can always assume raw, file and qcow2 being available; so do not use
bdrv_find_format() to locate their BlockDriver objects but statically
reference the respective objects.
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If a qcow2 image specifies a backing file format that doesn't correspond
to any format driver that qemu knows, we shouldn't fall back to probing,
but simply error out.
Not looking up the backing file driver in bdrv_open_backing_file(), but
just filling in the "driver" option if it isn't there moves us closer to
the goal of having everything in QDict options and gets us the error
handling of bdrv_open(), which correctly refuses unknown drivers.
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 1416935562-7760-4-git-send-email-kwolf@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the user neglects to specify the image format, QEMU probes the
image to guess it automatically, for convenience.
Relying on format probing is insecure for raw images (CVE-2008-2004).
If the guest writes a suitable header to the device, the next probe
will recognize a format chosen by the guest. A malicious guest can
abuse this to gain access to host files, e.g. by crafting a QCOW2
header with backing file /etc/shadow.
Commit 1e72d3b (April 2008) provided -drive parameter format to let
users disable probing. Commit f965509 (March 2009) extended QCOW2 to
optionally store the backing file format, to let users disable backing
file probing. QED has had a flag to suppress probing since the
beginning (2010), set whenever a raw backing file is assigned.
All of these additions that allow to avoid format probing have to be
specified explicitly. The default still allows the attack.
In order to fix this, commit 79368c8 (July 2010) put probed raw images
in a restricted mode, in which they wouldn't be able to overwrite the
first few bytes of the image so that they would identify as a different
image. If a write to the first sector would write one of the signatures
of another driver, qemu would instead zero out the first four bytes.
This patch was later reverted in commit 8b33d9e (September 2010) because
it didn't get the handling of unaligned qiov members right.
Today's block layer that is based on coroutines and has qiov utility
functions makes it much easier to get this functionality right, so this
patch implements it.
The other differences of this patch to the old one are that it doesn't
silently write something different than the guest requested by zeroing
out some bytes (it fails the request instead) and that it doesn't
maintain a list of signatures in the raw driver (it calls the usual
probe function instead).
Note that this change doesn't introduce new breakage for false positive
cases where the guest legitimately writes data into the first sector
that matches the signatures of an image format (e.g. for nested virt):
These cases were broken before, only the failure mode changes from
corruption after the next restart (when the wrong format is probed) to
failing the problematic write request.
Also note that like in the original patch, the restrictions only apply
if the image format has been guessed by probing. Explicitly specifying a
format allows guests to write anything they like.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1416497234-29880-8-git-send-email-kwolf@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The only image format driver that even potentially accesses anything
after 512 bytes in its bdrv_probe() implementation is VMDK, which reads
a plain-text descriptor file. In practice, the field it's looking for
seems to come first and will be well within the first 512 bytes, too.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 1416497234-29880-7-git-send-email-kwolf@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 1416497234-29880-6-git-send-email-kwolf@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This returns the node name of a BDS. Remove the TODO comment and expect
the callers to be explicit.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Similar to bdrv_next, this traverses through graph_bdrv_states. Will be
useful to enumerate all the named nodes.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In bdrv_rw_co we report -EINVAL for nb_sectors > INT_MAX /
BDRV_SECTOR_SIZE, so a caller shouldn't exceed it.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Message-id: 1415603264-21497-1-git-send-email-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJUYh9vAAoJEL7lnXSkw9fbgPQH/065L5+SpaJR1Nte9Lz3N2s1
a6tGSI22yu85tKvYCdYjeoVHSkSTyR57FdTfUd2xc2QPj+J4sWXpA81KILBGTJUp
NMpmLpWg4LOh8Ek4ViRgmFFdryzIFa4dT4gc1AcSAIAQ6jsgK1dM7m5kfncC3TN0
TUs248vJ2i/DaE0k8TOeJqxJTqInoFttlJEqG7RD+V5JznokE4zpFNXHDGx9BptE
W2J38GJ/TKRPe9UrHMKZI1r6+ZBdXyE/CaqsNNKLJdqrHgSQuAyK/PS6dQbM4BLg
M1qdP7Tp0wOlvv9qoEZMOEiUsi54XPqLgaLMbW74Yp5X459fqmLW2imy49pHXt8=
=klsW
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/mjt/tags/pull-trivial-patches-2014-11-11' into staging
trivial patches for 2014-11-11
# gpg: Signature made Tue 11 Nov 2014 14:38:39 GMT using RSA key ID A4C3D7DB
# gpg: Good signature from "Michael Tokarev <mjt@tls.msk.ru>"
# gpg: aka "Michael Tokarev <mjt@corpit.ru>"
# gpg: aka "Michael Tokarev <mjt@debian.org>"
* remotes/mjt/tags/pull-trivial-patches-2014-11-11:
block: Fix comment for bdrv_co_get_block_status
sysbus: Correct SYSTEM_BUS(obj) defines
target-i386: cpu: keeping function parameters alignment on new line
xen-hvm: Remove redundant variable 'xstate'
coroutine-sigaltstack: Change jmp_buf to sigjmp_buf
pc-bios: petalogix-s3adsp1800.dtb: Use 'xlnx, xps-ethernetlite-2.00.a' instead of 'xlnx, xps-ethernetlite-2.00.b'
gdbstub: Add a missing case of signal number translation in gdbstub
numa: make 'info numa' take into account hotplugged memory
slirp/smbd: modify/set several parameters in generated smbd.conf
qemu-doc.texi: fix typos in x509 examples
icc_bus: fix typo ICC_BRIGDE -> ICC_BRIDGE
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
It returns more information than binary, fix the comment.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
If the specified backing file could not be opened, do not generate a new
error message which contains the message which has been generated by
bdrv_open(), but just propagate the latter.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The mirror block job must run in the BlockDriverState AioContext so that
it works with dataplane.
Acquire the AioContext in blockdev.c so starting the block job is safe.
Note that to_replace is treated separately from other BlockDriverStates
in that it does not need to be in the same AioContext. Explicitly
acquire/release to_replace's AioContext when accessing it.
The completion code in block/mirror.c must perform BDS graph
manipulation and bdrv_reopen() from the main loop. Use
block_job_defer_to_main_loop() to achieve that.
The bdrv_drain_all() call is not allowed outside the main loop since it
could lead to lock ordering problems. Use bdrv_drain(bs) instead
because we have acquired the AioContext so nothing else can sneak in
I/O.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 1413889440-32577-10-git-send-email-stefanha@redhat.com
Now that op blockers are in use, we can ensure that no other sources are
generating I/O on a BlockDriverState. Therefore it is possible to drain
requests for a single BDS.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 1413889440-32577-7-git-send-email-stefanha@redhat.com
Depending on the changed options and the image format,
bdrv_amend_options() may take a significant amount of time. In these
cases, a way to be informed about the operation's status is desirable.
Since the operation is rather complex and may fundamentally change the
image, implementing it as AIO or a coroutine does not seem feasible. On
the other hand, implementing it as a block job would be significantly
more difficult than a simple callback and would not add benefits other
than progress report to the amending operation, because it should not
actually be run as a block job at all.
A callback may not be very pretty, but it's very easy to implement and
perfectly fits its purpose here.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1414404776-4919-2-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The local variable 'ac' in send_qmp_error_event() is declared with the
wrong type, which causes clang to complain when it is initialized
and again when it is used:
block.c:3655:20: warning: implicit conversion from enumeration type 'enum IoOperationType' to different enumeration type 'BlockErrorAction' (aka 'enum BlockErrorAction') [-Wenum-conversion]
ac = is_read ? IO_OPERATION_TYPE_READ : IO_OPERATION_TYPE_WRITE;
~ ^~~~~~~~~~~~~~~~~~~~~~
block.c:3655:45: warning: implicit conversion from enumeration type 'enum IoOperationType' to different enumeration type 'BlockErrorAction' (aka 'enum BlockErrorAction') [-Wenum-conversion]
ac = is_read ? IO_OPERATION_TYPE_READ : IO_OPERATION_TYPE_WRITE;
~ ^~~~~~~~~~~~~~~~~~~~~~~
block.c:3656:62: warning: implicit conversion from enumeration type 'BlockErrorAction' (aka 'enum BlockErrorAction') to different enumeration type 'IoOperationType' (aka 'enum IoOperationType') [-Wenum-conversion]
qapi_event_send_block_io_error(bdrv_get_device_name(bs), ac, action,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^~
Correct the type to IoOperationType, and rename the variable
to 'optype' to match its correct type.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com>
Message-id: 1412969583-21045-1-git-send-email-peter.maydell@linaro.org
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
When falling through to the underlying file in
bdrv_co_get_block_status(), if it returns that the query offset is
beyond the file end (by setting *pnum to 0), return the range to be
zero and do not let the number of sectors for which information could be
obtained be overwritten.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
These functions call their non-0-counterparts and then fill the
allocated buffer with 0 (if the allocation has been successful).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Move device model attachment / detachment and the BlockDevOps device
model callbacks and their wrappers from BlockDriverState to
BlockBackend.
Wrapper calls in block.c change from
bdrv_dev_FOO_cb(bs, ...)
to
if (bs->blk) {
bdrv_dev_FOO_cb(bs->blk, ...);
}
No change, because both bdrv_dev_change_media_cb() and
bdrv_dev_resize_cb() do nothing when no device model is attached, and
a device model can be attached only when bs->blk.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
I'll use it with block backends shortly, and the name is going to fit
badly there. It's a block layer thing anyway, not just a block driver
thing.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
I'll use BlockDriverAIOCB with block backends shortly, and the name is
going to fit badly there. It's a block layer thing anyway, not just a
block driver thing.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
BlockBackend's name space is separate only to keep the initial patches
simple. Time to merge the two.
Retain bdrv_find() and bdrv_get_device_name() for now, to keep this
series manageable.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
device_name[] can become non-empty only in bdrv_new_root() and
bdrv_move_feature_fields(). The latter is used only to undo damage
done by bdrv_swap(). The former is called only by blk_new_with_bs().
Therefore, when a BlockDriverState's device_name[] is non-empty, then
it's been created with a BlockBackend, and vice versa. Furthermore,
blk_new_with_bs() keeps the two names equal.
Therefore, device_name[] is redundant. Eliminate it.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Make the BlockBackend own the DriveInfo. Change blockdev_init() to
return the BlockBackend instead of the DriveInfo.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Convenience function blk_new_with_bs() creates a BlockBackend with its
BlockDriverState. Callers have to unref both. The commit after next
will relieve them of the need to unref the BlockDriverState.
Complication: due to the silly way drive_del works, we need a way to
hide a BlockBackend, just like bdrv_make_anon(). To emphasize its
"special" status, give the function a suitably off-putting name:
blk_hide_on_behalf_of_do_drive_del(). Unfortunately, hiding turns the
BlockBackend's name into the empty string. Can't avoid that without
breaking the blk->bs->device_name equals blk->name invariant.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Creating an anonymous BDS can't fail. Make that obvious.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When migrated using libvirt with "--copy-storage-all", at the end of
migration there is race between NBD mirroring task trying to do flush
and migration completion, both end up invalidating cache. Since qcow2
driver does not handle this situation very well, random crashes happen.
This disables the BDRV_O_INCOMING flag for the block device being migrated
once the cache has been invalidated.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
--
fixed parens by hand
Signed-off-by: Juan Quintela <quintela@redhat.com>
IDs have long spread beyond QemuOpts: not everything with an ID
necessarily goes through QemuOpts. Commit 9aebf3b is about such a
case: block layer names are meant to be well-formed IDs, but some of
them don't go through QemuOpts, and thus weren't checked. The commit
fixed that the straightforward way: rename the internal QemuOpts
helper id_wellformed() to qemu_opts_id_wellformed() and give it
external linkage.
Instead of using it directly in block.c, the commit adds wrapper
bdrv_is_valid_name(), probably to hide the connection to QemuOpts.
Go one logical step further: emancipate IDs from QemuOpts. Rename the
function back to id_wellformed(), and put it in another file. While
there, clean up its value to bool. Peel off the bdrv_is_valid_name()
wrapper.
[Replaced stray return 0 with return false to match bool returns used
elsewhere in id_wellformed().
--Stefan]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The device_name of a BlockDriverState is currently checked because it is
always used as a QemuOpts ID and qemu_opts_create() checks whether such
IDs are wellformed.
node-name is supposed to share the same namespace, but it isn't checked
currently. This patch adds explicit checks both for device_name and
node-name so that the same rules will still apply even if QemuOpts won't
be used any more at some point.
qemu-img used to use names with spaces in them, which isn't allowed any
more. Replace them with underscores.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Suggested-by: Benoit Canet <benoit.canet@nodalink.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the BDS's refcnt > 0, drive_del() destroys the DriveInfo, but not
the BDS. This can happen in three places:
* Device model destruction during unplug: blockdev_auto_del()
* Xen IDE unplug: pci_piix3_xen_ide_unplug()
* drive_del command when no device model is attached: do_drive_del()
The other callers of drive_del are on error paths where refcnt == 1.
If the user somehow manages to plug in a device model using a BDS that
has gone through drive_del(), the legacy configuration passed in
DriveInfo doesn't reach the device model, and automatic deletion on
unplug doesn't work. Worse, some device models such as scsi-disk
crash when DriveInfo doesn't exist.
This is theoretical; I didn't research an actual reproducer. The problem
was introduced when we replaced DriveInfo reference counting by BDS
reference counting in commit a94a3fa..fa510eb.
Fix by keeping DriveInfo alive until its BDS dies.
This affects qemu_drive_opts: now you can't reuse the same ID for new
drive options until the BDS dies. Before, you could, but since the
code always attempts to create a BDS with the same ID next, the
enclosing operation "create a new drive" failed anyway. Different
error path, same result.
Unfortunately, the fix involves use of blockdev.c stuff from block.c,
which is a layering violation. Fortunately, my forthcoming
BlockBackend work will get rid of it again.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now that all the implementations are converted to asynchronous version
and we can emulate synchronous cancellation with it. Let's drop the
unused member.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is the async version of bdrv_aio_cancel, which doesn't block the
caller. It guarantees that the cb is called either before returning or
some time later.
bdrv_aio_cancel can base on bdrv_aio_cancel_async, later we can convert
all .io_cancel implementations to .io_cancel_async, and the aio_poll is
the common logic. In the end, .io_cancel can be dropped.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This will be useful in synchronous cancel emulation with
bdrv_aio_cancel_async.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
BLOCK_IO_ERROR events are logged by libvirt, which helps with
post mortem analysis of guests. However, one information that
we miss today is a human readable string describing the cause
of the I/O error.
This commit adds that string it to BLOCK_IO_ERROR. Note that
this string is a debugging aid for humans, meaning that it
should not parsed by applications.
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is the next step for decoupling block accounting functions from
BlockDriverState.
In a future commit the BlockAcctStats structure will be moved from
BlockDriverState to the device models structures.
Note that bdrv_get_stats was introduced so device models can retrieve the
BlockAcctStats structure of a BlockDriverState without being aware of it's
layout.
This function should go away when BlockAcctStats will be embedded in the device
models structures.
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Keith Busch <keith.busch@intel.com>
CC: Anthony Liguori <aliguori@amazon.com>
CC: "Michael S. Tsirkin" <mst@redhat.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Eric Blake <eblake@redhat.com>
CC: Peter Maydell <peter.maydell@linaro.org>
CC: Michael Tokarev <mjt@tls.msk.ru>
CC: John Snow <jsnow@redhat.com>
CC: Markus Armbruster <armbru@redhat.com>
CC: Alexander Graf <agraf@suse.de>
CC: Max Reitz <mreitz@redhat.com>
Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The plan is to add new accounting metrics (latency, invalid requests, failed
requests, queue depth) and block.c is overpopulated so it will be better to work
in a separate module.
Moreover the long term plan is to have statistics in each of the BDS of the graph
for metrology purpose; this means that the device model statistics must move from
the topmost BDS to the device model.
So we need to decouple the statistic code from BlockDriverState.
This is another argument for the extraction of the code in a separate module.
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Eric Blake <eblake@redhat.com>
CC: Benoit Canet <benoit@irqsave.net>
CC: Fam Zheng <famz@redhat.com>
CC: Peter Crosthwaite <peter.crosthwaite@xilinx.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Extract the block accounting statistics into a structure so the block device
models can hold them in the future.
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Eric Blake <eblake@redhat.com>
Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Management software, such as RHEV's vdsm, want to be able to allocate
disk space on demand. The basic use case is to start a VM with a small
disk and then the disk is enlarged when QEMU hits a ENOSPC condition.
To this end, the management software has to be notified when QEMU
encounters ENOSPC. The solution implemented by this commit is simple:
it extends the BLOCK_IO_ERROR with a 'nospace' key, which is true
when QEMU is stopped due to ENOSPC.
Note that support for querying this event is already present in
query-block by means of the 'io-status' key. Also, the new 'nospace'
BLOCK_IO_ERROR field shares the same semantics with 'io-status',
which basically means that werror= has to be set to either
'stop' or 'enospc' to enable 'nospace'.
Finally, this commit also updates the 'io-status' key doc in the
schema with a list of supported device models.
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
When request A is a strict superset of request B:
AAAAAAAA
BBBB
multiwrite_merge() merges them as follows:
AABBBB
The tail of request A should have been included:
AABBBBAA
This patch fixes data loss but this code path is probably rare. Since
guests cannot assume ordering between in-flight requests, few
applications submit overlapping write requests.
Reported-by: Slava Pestov <sviatoslav.pestov@gmail.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
If a long-running operation on a BDS wants to always remain in the same
AIO context, it somehow needs to keep track of the BDS changing its
context. This adds a function for registering callbacks on a BDS which
are called whenever the BDS is attached or detached from an AIO context.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Format names are best consumed in alphabetical order. This makes
human-readable output easy to produce.
bdrv_iterate_format() already has an array of format strings. Sort them
before invoking the iteration callback.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Some block devices may not have a filename in their BDS; and for some,
there may not even be a normal filename at all. To work around this, add
a function which tries to construct a valid filename for the
BDS.filename field.
If a filename exists or a block driver is able to reconstruct a valid
filename (which is placed in BDS.exact_filename), this can directly be
used.
If no filename can be constructed, we can still construct an options
QDict which is then converted to a JSON object and prefixed with the
"json:" pseudo protocol prefix. The QDict is placed in
BDS.full_open_options.
For most block drivers, this process can be done automatically; those
that need special handling may define a .bdrv_refresh_filename() method
to fill BDS.exact_filename and BDS.full_open_options themselves.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
g_new(T, n) is neater than g_malloc(sizeof(T) * n). It's also safer,
for two reasons. One, it catches multiplication overflowing size_t.
Two, it returns T * rather than void *, which lets the compiler catch
more type errors.
Patch created with Coccinelle, with two manual changes on top:
* Add const to bdrv_iterate_format() to keep the types straight
* Convert the allocation in bdrv_drop_intermediate(), which Coccinelle
inexplicably misses
Coccinelle semantic patch:
@@
type T;
@@
-g_malloc(sizeof(T))
+g_new(T, 1)
@@
type T;
@@
-g_try_malloc(sizeof(T))
+g_try_new(T, 1)
@@
type T;
@@
-g_malloc0(sizeof(T))
+g_new0(T, 1)
@@
type T;
@@
-g_try_malloc0(sizeof(T))
+g_try_new0(T, 1)
@@
type T;
expression n;
@@
-g_malloc(sizeof(T) * (n))
+g_new(T, n)
@@
type T;
expression n;
@@
-g_try_malloc(sizeof(T) * (n))
+g_try_new(T, n)
@@
type T;
expression n;
@@
-g_malloc0(sizeof(T) * (n))
+g_new0(T, n)
@@
type T;
expression n;
@@
-g_try_malloc0(sizeof(T) * (n))
+g_try_new0(T, n)
@@
type T;
expression p, n;
@@
-g_realloc(p, sizeof(T) * (n))
+g_renew(T, p, n)
@@
type T;
expression p, n;
@@
-g_try_realloc(p, sizeof(T) * (n))
+g_try_renew(T, p, n)
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qemu-img check calls bdrv_check() twice if the first run repaired some
inconsistencies. If the first run however again triggered corruption
prevention (on qcow2) due to very bad inconsistencies, bs->drv may be
NULL afterwards. Thus, bdrv_check() should check whether bs->drv is set.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Some code in the block layer makes potentially huge allocations. Failure
is not completely unexpected there, so avoid aborting qemu and handle
out-of-memory situations gracefully.
This patch addresses bounce buffer allocations in block.c. While at it,
convert bdrv_commit() from plain g_malloc() to qemu_try_blockalign().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
This function returns NULL instead of aborting when an allocation fails.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
If bdrv_unref() is passed a NULL BDS pointer, it is safe to
exit with no operation. This will allow cleanup code to blindly
call bdrv_unref() on a BDS that has been initialized to NULL.
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When a BlockDriverState is associated with a storage controller
DeviceState we expect guest I/O. Use this opportunity to bump the
coroutine pool size by 64.
This patch ensures that the coroutine pool size scales with the number
of drives attached to the guest. It should increase coroutine pool
usage (which makes qemu_coroutine_create() fast) without hogging too
much memory when fewer drives are attached.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
bdrv_get_geometry() hides errors. Use bdrv_nb_sectors() or
bdrv_getlength() instead where that's obviously inappropriate.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
It returns a multiple of the sector size.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Instead of bdrv_getlength().
Aside: a few of these callers don't handle errors. I didn't
investigate whether they should.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Instead of bdrv_getlength().
Replace variables length, length2 by total_sectors, nb_sectors2.
Bonus: use total_sectors instead of the slightly unclean
bs->total_sectors.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Instead of bdrv_getlength().
Variable target_size is initially in bytes, then changes meaning to
sectors. Ugh. Replace by target_sectors.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
A call to retrieve the image size converts between bytes and sectors
several times:
* BlockDriver method bdrv_getlength() returns bytes.
* refresh_total_sectors() converts to sectors, rounding up, and stores
in total_sectors.
* bdrv_getlength() converts total_sectors back to bytes (now rounded
up to a multiple of the sector size).
* Callers wanting sectors rather bytes convert it right back.
Example: bdrv_get_geometry().
bdrv_nb_sectors() provides a way to omit the last two conversions.
It's exactly bdrv_getlength() with the conversion to bytes omitted.
It's functionally like bdrv_get_geometry() without its odd error
handling.
Reimplement bdrv_getlength() and bdrv_get_geometry() on top of
bdrv_nb_sectors().
The next patches will convert some users of bdrv_getlength() to
bdrv_nb_sectors().
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
At least raw-posix relies on this because it can allocate bounce buffers
based on the request length, but access it using all of the qiov entries
later.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
If a read request goes across EOF, the block driver sees a shortened
request that stops at EOF (the rest is memsetted in block.c), however
the original qiov was used for this request.
This patch makes the qiov size match the request size, avoiding a
potential buffer overflow in raw-posix.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
bdrv_is_allocated() should return either 0 or 1 in successful cases.
We're lucky that currently, the callers that rely on this (e.g. because
they check for ret == 1) don't seem to break badly. They just might skip
some optimisation or in the case of qemu-io 'map' print separate lines
where a single line would suffice. In theory, a wrong allocation status
could lead to image corruption with certain operations, so let's fix
this quickly.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This patch introduces three APIs so that following
patches can support queuing I/O requests and submitting them
as a batch for improving I/O performance.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
On some image chains, QEMU may not always be able to resolve the
filenames properly, when updating the backing file of an image
after a block commit.
For instance, certain relative pathnames may fail, or drives may
have been specified originally by file descriptor (e.g. /dev/fd/???),
or a relative protocol pathname may have been used.
In these instances, QEMU may lack the information to be able to make
the correct choice, but the user or management layer most likely does
have that knowledge.
With this extension to the block-commit api, the user is able to change
the backing file of the overlay image as part of the block-commit
operation.
This allows the change to be 'safe', in the sense that if the attempt
to write the overlay image metadata fails, then the block-commit
operation returns failure, without disrupting the guest.
If the commit top is the active layer, then specifying the backing
file string will be treated as an error (there is no overlay image
to modify in that case).
If a backing file string is not specified in the command, the backing
file string to use is determined in the same manner as it was
previously.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is a small helper function, to determine if 'base' is in the
chain of BlockDriverState 'top'. It returns true if it is in the chain,
and false otherwise.
If either argument is NULL, it will also return false.
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This simplifies the function bdrv_find_overlay(). With this change,
bdrv_find_base() is just a subset of usage of bdrv_find_overlay(),
so this also takes advantage of that.
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
When failure occurs, 'ret' need be set, or may return 0 to indicate
success. Previously, an error was set in errp, but 0 was returned
anyway. So let bdrv_append_temp_snapshot() return an error code and
use that for the bdrv_open() return value.
Also, error_propagate() need be called only one time within a function.
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
drive-mirror will bdrv_swap the new BDS named node-name with the one
pointed by replaces when the mirroring is finished.
Signed-off-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If we check for the RESIZE blocker in bdrv_truncate(), that means a
commit will fail if the overlay layer is larger than the base, due to
the backing blocker.
This is a regression in behavior from 2.0; currently, commit will try to
grow the size of the base image to match the overlay size, if the
overlay size is larger.
By moving this into the QMP command qmp_block_resize(), it allows
usage of bdrv_truncate() within block jobs.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The only semantic change is that bs->open_flags gets BDRV_O_PROTOCOL set
now. This isn't useful, but it doesn't hurt either. The code that was
previously skipped by 'goto done' is automatically disabled because
protocol drivers don't support backing files (and if they did, this
would probably be a fix) and can't have snapshot_flags set.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Since we parse backing.* options to add a backing file from the command
line when the driver didn't assign one, it has been possible to have a
backing file for e.g. raw images (it just was never accessed).
This is obvious nonsense and should be rejected.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This recursion was introduced in commit 505d7583 in order to allow
nesting image formats. It only ever takes effect when the user
explicitly specifies a driver name and that driver isn't suitable for
the protocol level.
We can check this earlier in bdrv_open() and if the explicitly
requested driver is a format driver, clear BDRV_O_PROTOCOL so that
another bs->file layer is opened.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
It doesn't do much any more, we can move the code to bdrv_open() now.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Reviewed-by: Eric Blake <eblake@redhat.com>
This moves the bdrv_open_file() call a bit down so that it can use the
bdrv_open() code that selects the right block driver.
The code between the old and the new call site is either common code
(the error message for an unknown driver has been unified now) or
doesn't run with cleared BDRV_O_PROTOCOL (added an if block in one
place, whereas the right path was already asserted in another place)
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
The "driver" entry in the options QDict is now only missing if we're
opening an image with format probing.
We also catch cases now where both the drv argument and a "driver"
option is specified, e.g. by specifying -drive format=qcow2,driver=raw
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
The idea of bdrv_fill_options() is to convert every parameter for
opening images, in particular the filename and flags, to entries in the
options QDict.
This patch starts with moving the filename parsing and driver probing
part from bdrv_file_open() to the new function.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
'buf' is not used actually, so remove it and related snprintf() statement.
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
In order to let event defines use existing types later, instead of
redefine new ones, some old type defines for spice and vnc are changed,
and BlockErrorAction is moved from block.h to qapi schema. Note that
BlockErrorAction is not merged with BlockdevOnError.
At this point, VncInfo is not made a child of VncBasicInfo, because
VncBasicInfo has mandatory fields where VncInfo makes them optional.
Signed-off-by: Wenchao Xia <wenchaoqemu@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
With virtio-blk dataplane, I/O errors might occur while QEMU is
not in the main I/O thread. However, it's invalid to call vm_stop
when we're neither in a VCPU thread nor in the main I/O thread,
even if we were to take the iothread mutex around it.
To avoid this problem, we can raise a request to the main I/O thread,
similar to what QEMU does when vm_stop is called from a CPU thread.
We know that bdrv_error_action is called from an AIO callback, and
the moment at which the callback will fire is not well-defined; it
depends on the moment at which the disk or OS finishes the operation,
which can happen at any time. Note that QEMU is certainly not in a CPU
thread and we do not need to call cpu_stop_current() like vm_stop() does.
However, we need to ensure that any action taken by management will
result in correct detection of the error _and_ a running VM. In particular:
- the event must be raised after the iostatus has been set, so that
"info block" will return an iostatus that matches the event.
- the VM must be stopped after the iostatus has been set, so that
"info block" will return an iostatus that matches the runstate.
The ordering between the STOP and BLOCK_IO_ERROR events is preserved;
BLOCK_IO_ERROR is documented to come first.
This makes bdrv_error_action() thread safe (assuming QMP events are,
which is attacked by a separate series).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Now that all backend drivers are using QemuOpts, remove all
QEMUOptionParameter related codes.
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Chunyan Liu <cyliu@suse.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Change block layer to support both QemuOpts and QEMUOptionParameter.
After this patch, it will change backend drivers one by one. At the end,
QEMUOptionParameter will be removed and only QemuOpts is kept.
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Chunyan Liu <cyliu@suse.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Block I/O throttling uses timers and currently always adds them to the
main loop. Throttling will break if bdrv_set_aio_context() is used to
move a BlockDriverState to a different AioContext.
This patch adds throttle_detach/attach_aio_context() interfaces so the
throttling timers and uses them to move timers to the new AioContext.
Note that bdrv_set_aio_context() already drains all requests so we're
sure no throttled requests are pending.
The test cases need to be updated since the throttle_init() interface
has changed.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Up until now all BlockDriverState instances have used the QEMU main loop
for fd handlers, timers, and BHs. This is not scalable on SMP guests
and hosts so we need to move to a model with multiple event loops on
different host CPUs.
bdrv_set_aio_context() assigns the AioContext event loop to use for a
particular BlockDriverState. It first detaches the entire
BlockDriverState graph from the current AioContext and then attaches to
the new AioContext.
This function will be used by virtio-blk data-plane to assign a
BlockDriverState to its IOThread AioContext. Make
bdrv_aio_set_context() public since data-plane should not include
block_int.h.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Modify bdrv_drain_all() to take into account that BlockDriverState
instances may be running in different AioContexts.
This patch changes the implementation of bdrv_drain_all() while
preserving the semantics. Previously kicking throttled requests and
checking for pending requests were done across all BlockDriverState
instances in sequence. Now we process each BlockDriverState in turn,
making sure to acquire and release its AioContext.
This prevents race conditions between the thread executing
bdrv_drain_all() and the thread running the AioContext.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
bdrv_close_all(), bdrv_commit_all(), bdrv_flush_all(),
bdrv_invalidate_cache_all(), and bdrv_clear_incoming_migration_all() are
called by main loop code and touch all BlockDriverState instances.
Some BlockDriverState instances may be running in another AioContext.
Make sure to acquire the AioContext before closing the BlockDriverState.
This will protect against race conditions once virtio-blk data-plane is
using the BlockDriverState from another AioContext event loop.
Note that this patch does not convert bdrv_drain_all() yet since that
conversion is non-trivial.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Drop the assumption that we're using the main AioContext. Convert
qemu_aio_wait() to aio_poll() and qemu_bh_new() to aio_bh_new() so the
BlockDriverState AioContext is used.
Note there is still one qemu_aio_wait() left in bdrv_create() but we do
not have a BlockDriverState there and only main loop code invokes this
function.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Introduced in commit da557a. Spotted by Coverity.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The above bdrv_set_backing_hd already does this.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This makes use of op_blocker and blocks all the operations except for
commit target, on each BlockDriverState->backing_hd.
The asserts for op_blocker in bdrv_swap are removed because with this
change, the target of block commit has at least the backing blocker of
its child, so the assertion is not true. Callers should do their check.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
We need to handle the coming backing_blocker properly, so don't open
code the assignment, instead, call bdrv_set_backing_hd to change
backing_hd.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is the common but non-trivial steps to assign or change the
backing_hd of BDS.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This drops BlockDriverState.in_use with op_blockers:
- Call bdrv_op_block_all in place of bdrv_set_in_use(bs, 1).
- Call bdrv_op_unblock_all in place of bdrv_set_in_use(bs, 0).
- Check bdrv_op_is_blocked() in place of bdrv_in_use(bs).
The specific types are used, e.g. in place of starting block backup,
bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_BACKUP, ...).
There is one exception in block_job_create, where
bdrv_op_blocker_is_empty() is used, because we don't know the operation
type here. This doesn't matter because in a few commits away we will drop
the check and move it to callers that _do_ know the type.
- Check bdrv_op_blocker_is_empty() in place of assert(!bs->in_use).
Note: there is only bdrv_op_block_all and bdrv_op_unblock_all callers at
this moment. So although the checks are specific to op types, this
changes can still be seen as identical logic with previously with
in_use. The difference is error message are improved because of blocker
error info.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
BlockDriverState.op_blockers is an array of lists with BLOCK_OP_TYPE_MAX
elements. Each list is a list of blockers of an operation type
(BlockOpType), that marks this BDS as currently blocked for a certain
type of operation with reason errors stored in the list. The rule of
usage is:
* BDS user who wants to take an operation should check if there's any
blocker of the type with bdrv_op_is_blocked().
* BDS user who wants to block certain types of operation, should call
bdrv_op_block (or bdrv_op_block_all to block all types of operations,
which is similar to the existing bdrv_set_in_use()).
* A blocker is only referenced by op_blockers, so the lifecycle is
managed by caller, and shouldn't be lost until unblock, so typically
a caller does these:
- Allocate a blocker with error_setg or similar, call bdrv_op_block()
to block some operations.
- Hold the blocker, do his job.
- Unblock operations that it blocked, with the same reason pointer
passed to bdrv_op_unblock().
- Release the blocker with error_free().
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
this patch tries to optimize zero write requests
by automatically using bdrv_write_zeroes if it is
supported by the format.
This significantly speeds up file system initialization and
should speed zero write test used to test backend storage
performance.
I ran the following 2 tests on my internal SSD with a
50G QCOW2 container and on an attached iSCSI storage.
a) mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/vdX
QCOW2 [off] [on] [unmap]
-----
runtime: 14secs 1.1secs 1.1secs
filesize: 937M 18M 18M
iSCSI [off] [on] [unmap]
----
runtime: 9.3s 0.9s 0.9s
b) dd if=/dev/zero of=/dev/vdX bs=1M oflag=direct
QCOW2 [off] [on] [unmap]
-----
runtime: 246secs 18secs 18secs
filesize: 51G 192K 192K
throughput: 203M/s 2.3G/s 2.3G/s
iSCSI* [off] [on] [unmap]
----
runtime: 8mins 45secs 33secs
throughput: 106M/s 1.2G/s 1.6G/s
allocated: 100% 100% 0%
* The storage was connected via an 1Gbit interface.
It seems to internally handle writing zeroes
via WRITESAME16 very fast.
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the filename given to bdrv_open() is prefixed with "json:", parse the
rest as a JSON object and merge the result into the options QDict. If
there are conflicts, the options QDict takes precedence.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_is_allocated() shouldn't return true for sectors that are
unallocated, but after the end of a short backing file, even though
such sectors are (correctly) marked as containing zeros.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
The immediately visible effect of this patch is that it fixes committing
a temporary snapshot to its backing file. Previously, it would fail with
a "permission denied" error because bdrv_inherited_flags() forced the
backing file to be read-only, ignoring the r/w reopen of bdrv_commit().
The bigger problem this revealed is that the original open flags must
actually only be applied to the temporary snapshot, and the original
image file must be treated as a backing file of the temporary snapshot
and get the right flags for that.
Reported-by: Jan Kiszka <jan.kiszka@web.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Use the same function as bdrv_open() for determining what the right
flags for bs->file are. Without doing this, a reopen means that
bs->file loses BDRV_O_CACHE_WB or BDRV_O_UNMAP if bs doesn't have it as
well.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
This reverts commit 3a389e7926. The commit
was wrong and what it tried to fix just works today without any change.
What the commit tried to fix:
When creating live snapshots, the new image file is opened with
BDRV_O_NO_BACKING because the whole backing chain is already opened.
It is then appended to the chain using bdrv_append(). The result of
this was that the image had a backing file, but BDRV_O_NO_BACKING
was still set. This is obviously inconsistent.
There used to be some places in qemu that closed and image and then
opened it again, with its old flags (a bdrv_open()/close() sequence
involves reopening the whole backing file chain, too). In this case
the BDRV_O_NO_BACKING flag meant that the backing chain wasn't
reopened and only the top layer was left.
(Most, but not all of these places are replaced by bdrv_reopen()
today, which doesn't touch the backing files at all.)
Other places that looked at bs->open_flags weren't interested in
BDRV_O_NO_BACKING, so no breakage there.
What it actually did:
The commit moved the BDRV_O_NO_BACKING away to the backing file.
Because the bdrv_open()/close() sequences only looked at the flags
of the top level BlockDriverState and used it for the whole chain,
the flag didn't hurt there any more. Obviously, it is still
inconsistent because the backing file may have another backing file,
but without practical impact.
At the same time, it swapped all other flags. This is practically
irrelevant as long as live snapshots only allow opening the new
layer with the same flags as the old top layer. It still doesn't
make any sense, and it is a time bomb that explodes as soon as the
flags can differ.
bdrv_append_temp_snapshot() is such a case: It adds the new flag
BDRV_O_TEMPORARY for the temporary snapshot. The swapping of commit
3a389e79 results in the following nonsensical configuration:
bs->open_flags: BDRV_O_TEMPORARY cleared
bs->file->open_flags: BDRV_O_TEMPORARY set
bs->backing_hd->open_flags: BDRV_O_TEMPORARY set
bs->backing_hd->file->open_flags: BDRV_O_TEMPORARY cleared
We're still lucky because the format layer ignores the flag and the
protocol layer happens to get the right value, but sooner or later
this is bound to go wrong...
What the right fix would have been:
Simply clear the BDRV_O_NO_BACKING flag when the BlockDriverState is
appended to an existing backing file chain, because now it does have
a backing file.
Commit 4ddc07ca already implemented this silently in bdrv_append(),
so we don't have to come up with a new fix.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Instead of having unlink() calls in the generic block layer, where we
aren't even guarateed to have a file name, move them to those block
drivers that are actually used and that always have a filename. Gets us
rid of some #ifdefs as well.
The patch also converts bs->is_temporary to a new BDRV_O_TEMPORARY open
flag so that it is inherited in the protocol layer and the raw-posix and
raw-win32 drivers can unlink the file.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Copy on Read makes sense on the format level where backing files are
implemented, but it's not required on the protocol level. While it
shouldn't actively break anything to have COR enabled on both layers,
needless serialisation and allocation checks may impact performance.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Instead of manipulation flags inline, move the derivation of the flags
of a backing file into a new function next to the existing functions
that derive flags for bs->file and for the block driver open function.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Instead of having bdrv_open_flags() as a function that creates flags for
several unrelated places and then adding open-coded flags on top, create
a new function that derives the flags for bs->file from the flags for bs.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Some block drivers have multiple BlockDriver instances with identical
format_name fields (e.g. gluster, nbd).
Both qemu-img and qemu will use bdrv_iterate_format() to list the
supported formats when a help option is invoked. As protocols and
formats may register multiple drivers, redundant listings of formats
occur (e.g., "Supported formats: ... gluster gluster gluster gluster ...
").
Since the list of driver formats will be small, this performs a simple
linear search on format_name, and ignores any duplicates.
The end result change is that the iterator will no longer receive
duplicate string names, nor will it receive NULL pointers.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
error_is_set(&var) is the same as var != NULL, but it takes
whole-program analysis to figure that out. Unnecessarily hard for
optimizers, static checkers, and human readers. Commit 84d18f0 dumbed
it down to obvious, but a few more have crept in since, and
documentation was overlooked. Dumb these down, too.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
In 1.7.1 qcow2_create2 reopen the file for flushing without the BDRV_O_NO_BACKING
flags.
As a consequence the code would recursively open the whole backing chain.
These three stack arrays would pile up through the recursion and lead to a coroutine
stack overflow.
Convert these array to malloced buffers in order to streamline the coroutine
footprint.
Symptoms where freezes or segfaults on production machines while taking QMP externals
snapshots. The overflow disturbed coroutine switching.
[Resolved conflicts on qemu.git/master since the patch was against v1.7.1
--Stefan]
Signed-off-by: Benoit Canet <benoit.canet@gmail.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Since commit f298d071, block devices added with blockdev-add don't have
a QemuOpts around in dinfo->opts. Consequently, we can't rely any more
on QemuOpts catching duplicate IDs for block devices.
This patch adds a new check for duplicate IDs to bdrv_new(), and moves
the existing check that the ID isn't already taken for a node-name there
as well.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This patch adds an errp parameter to bdrv_new() and updates all its
callers. The next patches will make use of this in order to check for
duplicate IDs. Most of the callers know that their ID is fine, so they
can simply assert that there is no error.
Behaviour doesn't change with this patch yet as bdrv_new() doesn't
actually assign errors to errp.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
bdrv_getlength could fail, check the return value before using it.
Return NULL and set errno if it fails. Callers are updated to handle
the error case.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Insanely large requests could cause an integer overflow in
bdrv_rw_co() while converting sectors to bytes. This patch catches the
problem and returns an error (if we hadn't overflown the integer here,
bdrv_check_byte_request() would have rejected the request, so we're not
breaking anything that was supposed to work before).
We actually do have a test case that triggers behaviour where we
accidentally let such a request pass, so that it would return success,
but read 0 bytes instead of the requested 4 GB. It fails now like it
should.
If the vdi block driver wants to be able to deal with huge images, it
can't read the whole block bitmap at once into memory like it does
today, but needs to use a metadata cache like qcow2 does.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 8f4754ed intended to protect against integer overflow bugs in
block drivers by making sure that a single request that is passed to
drivers is no longer than INT_MAX bytes.
However, meanwhile there are some callers that don't use that code path
any more but call bdrv_check_byte_request() directy, so let's add a
check there as well.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Since commit 9fd3171a, BDRV_O_SNAPSHOT uses an option QDict to specify
the originally requested image as the backing file of the newly created
temporary snapshot. This means that the filename is stored in
"file.filename", which is an option that is not parsed for protocol
names. Therefore things like -drive file=nbd:localhost:10809 were
broken because it looked for a local file with the literal name
'nbd:localhost:10809'.
This patch changes the way BDRV_O_SNAPSHOT works once again. We now open
the originally requested image as normal, and then do a similar
operation as for live snapshots to put the temporary snapshot on top.
This way, both driver specific options and parsed filenames work.
As a nice side effect, this results in code movement to factor
bdrv_append_temp_snapshot() out. This is a good preparation for moving
its call to drive_init() and friends eventually.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
When using the QDict option 'filename', it is supposed to be interpreted
literally. The code did correctly avoid guessing the protocol from any
string before the first colon, but it still called bdrv_parse_filename()
which would, for example, incorrectly remove a 'file:' prefix in the
raw-posix driver.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Limiting the size of a single request to INT_MAX not only fixes a
direct integer overflow in bdrv_check_request() (which would only
trigger bad behaviour with ridiculously huge images, as in close to
2^64 bytes), but can also prevent overflows in all block drivers.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If it returns an error, the migrated VM will not be started, but qemu
exits with an error message.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Opening an encrypted image takes an additional step: setting the key.
Between open and the key set, the image must not be used.
We have some protection against accidental use in place: you can't
unpause a guest while we're missing keys. You can, however, hot-plug
block devices lacking keys into a running guest just fine, or insert
media lacking keys. In the latter case, notifying the guest of the
insert is delayed until the key is set, which may suffice to protect
at least some guests in common usage.
This patch makes the protection apply in more cases, in a rather
heavy-handed way: it doesn't let you open encrypted images unless
we're in a paused state.
It doesn't extend the protection to users other than the guest (block
jobs?). Use of runstate_check() from block.c is disgusting. Best I
can do right now.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If the image file cannot be opened and was created as a temporary file,
it should be deleted; thus, in this case, we should jump to the
"unlink_and_fail" label and not just to "fail".
Reported-by: Benoît Canet <benoit@irqsave.net>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This patch keep the recursive way of doing things but simplify it by giving
two responsabilities to all block filters implementors.
They will need to do two things:
-Set the is_filter field of their block driver to true.
-Implement the bdrv_recurse_is_first_non_filter method of their block driver like
it is done on the Quorum block driver. (block/quorum.c)
[Paolo Bonzini <pbonzini@redhat.com> pointed out that this patch changes
the semantics of blkverify, which now recurses down both bs->file and
s->test_file.
-- Stefan]
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Currently, bdrv_debug_resume() requires every bs->drv in the BDS stack
to be NULL until a bs->drv with an implementation of bdrv_debug_resume()
is found. For a normal function, this would be fine, but this is a
function for debugging purposes and should therefore allow intermediate
BDS not to have a driver (i.e., be "ejected"). Otherwise, it is hard to
debug such situations.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
After migration has completed, we call bdrv_invalidate_cache() so that
drivers which cache some data drop their stale copy of the data and
reread it from the image file to get a new version of data that the
source modified while the migration was running.
Reloading metadata from the image file is useless, though, if the size
of the image file stays stale (this is a value that is cached for all
image formats in block.c). Reads from (meta)data after the old EOF
return only zeroes, causing image corruption.
We need to update bs->total_sectors in all layers that could potentially
have changed their size (i.e. backing files are not a concern - if they
are changed, we're in bigger trouble)
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>