Commit Graph

3996 Commits

Author SHA1 Message Date
Alberto Garcia 25b8e4db7f throttle-groups: Don't allow timers without throttled requests
Commit 6fccbb475b fixed a bug caused by
QEMU attempting to remove a throttle group member with no pending
requests but an active timer set. This was the result of a previous
bdrv_drained_begin() call processing the throttled requests but
leaving the timer untouched.

Although the commit does solve the problem, the situation shouldn't
happen in the first place. If we try to drain a throttle group member
which has a timer set, we should cancel the timer instead of ignoring
it.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-08-15 12:50:39 +02:00
Alberto Garcia 5d8e4ca035 throttle-groups: Skip the round-robin if a member is being drained
In the throttling code after an I/O request has been completed the
next one is selected from a different member using a round-robin
algorithm. This ensures that all members get a chance to finish their
pending I/O requests.

However, if a group member has its I/O limits disabled (because it's
being drained) then we should always give it priority in order to have
all its pending requests finished as soon as possible.

If we don't do this we could have a member in the process of being
drained waiting for the throttled requests of other members, for which
the I/O limits still apply.

This can have additional consequences: if we're running in qtest mode
(with QEMU_CLOCK_VIRTUAL) then timers can only fire if we advance the
clock manually, so attempting to drain a block device can hang QEMU in
the BDRV_POLL_WHILE() loop at the end of bdrv_do_drained_begin().

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-08-15 12:50:39 +02:00
Kevin Wolf f62492bb8d block/qapi: Fix memory leak in qmp_query_blockstats()
For BlockBackends that are skipped in query-blockstats, we would leak
info since commit 567dcb31. Allocate info only later to avoid the memory
leak.

Fixes: CID 1394727
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
2018-08-15 12:50:39 +02:00
Paolo Bonzini 2f0d8947a6 nvme: simplify plug/unplug
bdrv_io_plug/bdrv_io_unplug take care of keeping a nesting count,
so change s->plugged to just a bool.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20180813144320.12382-2-pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
2018-08-15 10:12:35 +08:00
Fam Zheng 9582f357bb nvme: Fix nvme_init error handling
It is wrong to leave this field as 1, as nvme_close() called in the
error handling code in nvme_file_open() will use it and try to free
s->queues again.

Another problem is the cleaning ups are duplicated between the fail*
labels of nvme_init() and nvme_file_open(), which calls nvme_close().

A third problem is nvme_close() misses g_free() and
event_notifier_cleanup().

Fix all of them.

Cc: qemu-stable@nongnu.org
Signed-off-by: Fam Zheng <famz@redhat.com>

Message-Id: <20180712025420.4932-1-famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
2018-08-15 10:12:35 +08:00
Kevin Wolf 567dcb31f2 block/qapi: Include anonymous BBs in query-blockstats
Consistent with query-block, query-blockstats should not only include
named BlockBackends, but also those that are anonymous, but belong to a
device model.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2018-07-30 15:35:37 +02:00
Kevin Wolf 5a9cb5a97f block/qapi: Add 'qdev' field to query-blockstats result
Like for query-block, the client needs to identify which BlockBackend
the returned data is for. Anonymous BlockBackends are identified by the
device model they are attached to. Add a 'qdev' field that contains the
qdev ID or QOM path of the attached device model.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2018-07-30 15:35:37 +02:00
Kevin Wolf 34fa110e42 file-posix: Fix write_zeroes with unmap on block devices
The BLKDISCARD ioctl doesn't guarantee that the discarded blocks read as
all-zero afterwards, so don't try to abuse it for zero writing. We try
to only use this if BLKDISCARDZEROES tells us that it is safe, but this
is unreliable on older kernels and a constant 0 in newer kernels. In
other words, this code path is never actually used with newer kernels,
so we don't even try to unmap while writing zeros.

This patch removes the abuse of discard for writing zeroes from
file-posix and instead adds a new function that uses interfaces that are
actually meant to deallocate and zero out at the same time. Only if
those fail, it falls back to zeroing out without unmap. We never fall
back to a discard operation any more that may or may not result in
zeros.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-30 15:35:37 +02:00
Fam Zheng a1c81f4f16 file-posix: Handle EINTR in preallocation=full write
Cc: qemu-stable@nongnu.org
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-30 15:35:37 +02:00
Leonid Bloch 308999e9d4 qcow2: A grammar fix in conflicting cache sizing error message
Signed-off-by: Leonid Bloch <lbloch@janustech.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-30 15:35:37 +02:00
KONRAD Frederic 41b6513436 qcow: fix a reference leak
Since 42a3e1ab36 qemu asserts when using the
vvfat driver:

git clone git://qemu.org/qemu.git
cd qemu
./configure --target-list=ppc-softmmu --enable-debug
make -j8
mkdir foo
touch foo/hello
./ppc-softmmu/qemu-system-ppc -M prep --nographic --monitor null             \
                              -hda fat:rw:./foo

"Ctrl-C"

qemu-system-ppc: block.c:3368: bdrv_close_all: Assertion                     \
   `((&all_bdrv_states)->tqh_first == ((void *)0))' failed.

This is because we reference bs twice in qcow_co_create(..) one time in
bdrv_open_blockdev_ref(..) and in blk_insert_bs(..) but we unref it only once
in blk_unref which leads to the reference leak.

Note that I didn't tested much QCOW after this change as I don't use it much.

Signed-off-by: KONRAD Frederic <frederic.konrad@adacore.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-30 15:35:37 +02:00
Markus Armbruster ba891d68b4 qstring: Move qstring_from_substr()'s @end one to the right
qstring_from_substr() takes the index of the substring's first and
last character.  qstring_from_substr(s, 0, SIZE_MAX) denotes an empty
substring.  Awkward.

Shift the end index one to the right.  This simplifies both
qstring_from_substr() and its callers.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180727062204.10401-3-armbru@redhat.com>
2018-07-28 09:09:58 +02:00
Nishanth Aravamudan 042b757cc7 block/file-posix: add bdrv_attach_aio_context callback for host dev and cdrom
In ed6e2161 ("linux-aio: properly bubble up errors from initialzation"),
I only added a bdrv_attach_aio_context callback for the bdrv_file
driver. There are several other drivers that use the shared
aio_plug callback, though, and they will trip the assertion added to
aio_get_linux_aio because they did not call aio_setup_linux_aio first.
Add the appropriate callback definition to the affected driver
definitions.

Fixes: ed6e2161 ("linux-aio: properly bubble up errors from initialization")
Reported-by: Farhan Ali <alifm@linux.ibm.com>
Signed-off-by: Nishanth Aravamudan <naravamudan@digitalocean.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20180718211256.29774-1-naravamudan@digitalocean.com
Cc: Eric Blake <eblake@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: John Snow <jsnow@redhat.com>
Cc: Max Reitz <mreitz@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Fam Zheng <famz@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: qemu-block@nongnu.org
Cc: qemu-devel@nongnu.org
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-07-24 14:27:41 +01:00
Thomas Huth 3e31b4e170 block/vvfat: Disable debug message by default
It's annoying to see this debug message every time you use vvfat.
Disable it with the DLOG() macro by default, as it is done with the
other debug messages in this file.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-23 16:50:43 +02:00
Stefan Weil 50d6a8a352 block: Fix typos in comments (found by codespell)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-23 16:50:43 +02:00
Stefan Hajnoczi 6fccbb475b throttle-groups: fix hang when group member leaves
Throttle groups consist of members sharing one throttling state
(including bps/iops limits).  Round-robin scheduling is used to ensure
fairness.  If a group member already has a timer pending then other
groups members do not schedule their own timers.  The next group member
will have its turn when the existing timer expires.

A hang may occur when a group member leaves while it had a timer
scheduled.  Although the code carefully removes the group member from
the round-robin list, it does not schedule the next member.  Therefore
remaining members continue to wait for the removed member's timer to
expire.

This patch schedules the next request if a timer is pending.
Unfortunately the actual bug is a race condition that I've been unable
to capture in a test case.

Sometimes drive2 hangs when drive1 is removed from the throttling group:

  $ qemu ... -drive if=none,id=drive1,cache=none,format=qcow2,file=data1.qcow2,iops=100,group=foo \
             -device virtio-blk-pci,id=virtio-blk-pci0,drive=drive1 \
             -drive if=none,id=drive2,cache=none,format=qcow2,file=data2.qcow2,iops=10,group=foo \
             -device virtio-blk-pci,id=virtio-blk-pci1,drive=drive2
  (guest-console1)# fio -filename /dev/vda 4k-seq-read.job
  (guest-console2)# fio -filename /dev/vdb 4k-seq-read.job
  (qmp) {"execute": "block_set_io_throttle", "arguments": {"device": "drive1","bps": 0,"bps_rd": 0,"bps_wr": 0,"iops": 0,"iops_rd": 0,"iops_wr": 0}}

Reported-by: Nini Gu <ngu@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20180704145410.794-1-stefanha@redhat.com
RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1535914
Cc: Alberto Garcia <berto@igalia.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-07-19 13:08:26 +01:00
John Snow 230ff73904 file-posix: specify expected filetypes
Adjust each caller of raw_open_common to specify if they are expecting
host and character devices or not. Tighten expectations of file types upon
open in the common code and refuse types that are not expected.

This has two effects:

(1) Character and block devices are now considered deprecated for the
    'file' driver, which expects only S_IFREG, and
(2) no file-posix driver (file, host_cdrom, or host_device) can open
    directories now.

I don't think there's a legitimate reason to open directories as if
they were files. This prevents QEMU from opening and attempting to probe
a directory inode, which can break in exciting ways. One of those ways
is lseek on ext4/xfs, which will return 0x7fffffffffffffff as the file
size instead of EISDIR. This can coax QEMU into responding with a
confusing "file too big" instead of "Hey, that's not a file".

See: https://bugs.launchpad.net/qemu/+bug/1739304/
Signed-off-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-12 18:24:08 +02:00
Peter Maydell 7851f1a706 Block layer patches:
- Copy offloading fixes for when the copy increases the image size
 - Temporary revert of the removal of deprecated -drive options
 - Fix request serialisation in the image fleecing scenario
 - Fix copy-on-read crash with unaligned image size
 - Fix another drain crash
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbRNLQAAoJEH8JsnLIjy/WOaQQALlZk01JohETuwGG6HGl0LdI
 jEEm+N0J+BlGOVjoGU67OKGidUCl5WvBsQTlyYkmlaToGuk/njWxCa/GA6+iNRnt
 MDq7Ovr8uZI3D+0Fuc6xg/6NBiLkukgh0Q9gMWkzn3jaNWzO2WcTr8WXwepvP6sj
 YtPhEQOXTT3sXf/MFY8ig7qRrZ6f7LFOoKu7LMnrD+QWDo8TY3QLZaxP9OUFHH7S
 A6J0LIfuRZlq79a7SgrRkCR2ddtgYyBQ+zD7PD5kf1vLW4+dOhDOutQEsZCMCPgR
 ft99kNhrZcJGN6n2r8/oVcvRkw5c4I1JPgakm/GoW/NllfPMebuPospKaS4wiJnB
 zI4YOtmco4Mfxkw/wK+Ep/bPCpxEF43uDcpPiEjsNADrdLq0eKnPn5ctwSyWlGvn
 ayQWxDoKoYckn/ccjtLxJ2xPws8433cTXrBdIKnJadWxi3iRNzlIKHRuEfXf9zQt
 G+Nq7ruysT9TPf9ifuCHcZnTsi3SLYLsjCj7pAgBkazBYE2cCI3eKN8kxsDJi7qv
 cWzFCpwE28pHRJ6FwtdzBVkNcfTlC/XopR1M66OzYZlLqR/4hbNhyHL3hBV+yfrM
 fC7mPi81ttI6e+JAgC6K8t3Ey242MjSzUYa7pJUNws7RpqUhfhr6EXXbBceJKsVW
 F8qKZoiIEK7wDacUiEiE
 =FXOo
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging

Block layer patches:

- Copy offloading fixes for when the copy increases the image size
- Temporary revert of the removal of deprecated -drive options
- Fix request serialisation in the image fleecing scenario
- Fix copy-on-read crash with unaligned image size
- Fix another drain crash

# gpg: Signature made Tue 10 Jul 2018 16:37:52 BST
# gpg:                using RSA key 7F09B272C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>"
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74  56FE 7F09 B272 C88F 2FD6

* remotes/kevin/tags/for-upstream: (24 commits)
  block: Use common write req handling in truncate
  block: Fix bdrv_co_truncate overlap check
  block: Use common req handling in copy offloading
  block: Use common req handling for discard
  block: Fix handling of image enlarging write
  block: Extract common write req handling
  block: Use uint64_t for BdrvTrackedRequest byte fields
  block: Use BdrvChild to discard
  block: Add copy offloading trace points
  block: Prefix file driver trace points with "file_"
  Revert "block: Remove deprecated -drive geometry options"
  Revert "block: Remove deprecated -drive option addr"
  Revert "block: Remove deprecated -drive option serial"
  Revert "block: Remove dead deprecation warning code"
  block/blklogwrites: Make sure the log sector size is not too small
  qapi/block-core.json: Add missing documentation for blklogwrites log-append option
  block/backup: fix fleecing scheme: use serialized writes
  block: add BDRV_REQ_SERIALISING flag
  block: split flags in copy_range
  block/io: fix copy_range
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2018-07-10 17:28:29 +01:00
Fam Zheng cd47d792d7 block: Use common write req handling in truncate
Truncation is the last to convert from open coded req handling to
reusing helpers. This time the permission check in prepare has to adapt
to the new caller: it checks a different permission bit, and doesn't
trigger the before write notifier.

Also, truncation should always trigger a bs->total_sectors update and in
turn call parent resize_cb. Update the condition in finish helper, too.

It's intended to do a duplicated bs->read_only check before calling
bdrv_co_write_req_prepare() so that we can be more informative with the
error message, as bdrv_co_write_req_prepare() doesn't have Error
parameter.

Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 16:46:22 +02:00
Fam Zheng 5416a11eb5 block: Fix bdrv_co_truncate overlap check
If we are growing the image and potentially using preallocation for the
new area, we need to make sure that no write requests are made to the
"preallocated" area which is [@old_size, @offset), not
[@offset, offset * 2 - @old_size).

Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 16:46:22 +02:00
Fam Zheng 0eb1e89112 block: Use common req handling in copy offloading
This brings the request handling logic inline with write and discard,
fixing write_gen, resize_cb, dirty bitmaps and image size refreshing.
The last of these issues broke iotest case 222, which is now fixed.

Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 16:46:22 +02:00
Fam Zheng 00695c27a0 block: Use common req handling for discard
Reuse the new bdrv_co_write_req_prepare/finish helpers. The variation
here is that discard requests don't affect bs->wr_highest_offset, and it
cannot extend the image.

Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 16:46:16 +02:00
Fam Zheng 7f8f03ef6d block: Fix handling of image enlarging write
Two problems exist when a write request that enlarges the image (i.e.
write beyond EOF) finishes:

1) parent is not notified about size change;
2) dirty bitmap is not resized although we try to set the dirty bits;

Fix them just like how bdrv_co_truncate works.

Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 16:01:52 +02:00
Fam Zheng 85fe24796d block: Extract common write req handling
As a mechanical refactoring patch, this is the first step towards
unified and more correct write code paths. This is helpful because
multiple BlockDriverState fields need to be updated after modifying
image data, and it's hard to maintain in multiple places such as copy
offload, discard and truncate.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 16:01:52 +02:00
Fam Zheng 22931a1533 block: Use uint64_t for BdrvTrackedRequest byte fields
This matches the types used for bytes in the rest parts of block layer.
In the case of bdrv_co_truncate, new_bytes can be the image size which
probably doesn't fit in a 32 bit int.

Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 16:01:52 +02:00
Fam Zheng 0b9fd3f467 block: Use BdrvChild to discard
Other I/O functions are already using a BdrvChild pointer in the API, so
make discard do the same. It makes it possible to initiate the same
permission checks before doing I/O, and much easier to share the
helper functions for this, which will be added and used by write,
truncate and copy range paths.

Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 16:01:52 +02:00
Fam Zheng ecc983a507 block: Add copy offloading trace points
A few trace points that can help reveal what is happening in a copy
offloading I/O path.

Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 16:01:52 +02:00
Fam Zheng f8a30874ca block: Prefix file driver trace points with "file_"
With in one module, trace points usually have a common prefix named
after the module name. paio_submit and paio_submit_co are the only two
trace points so far in the two file protocol drivers. As we are adding
more, having a common prefix here is better so that trace points can be
enabled with a glob. Rename them.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 16:01:51 +02:00
Cornelia Huck 44e8b4689c Revert "block: Remove deprecated -drive option serial"
This reverts commit b008326744.

Hold off removing this for one more QEMU release (current libvirt
release still uses it.)

Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 14:36:11 +02:00
Ari Sundholm ba814c82bb block/blklogwrites: Make sure the log sector size is not too small
The sector size needs to be large enough to accommodate the data
structures for the log super block and log write entries. This was
previously not properly checked, which made it possible to cause
QEMU to badly misbehave.

Signed-off-by: Ari Sundholm <ari@tuxera.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 13:17:48 +02:00
Vladimir Sementsov-Ogievskiy f8d59dfb40 block/backup: fix fleecing scheme: use serialized writes
Fleecing scheme works as follows: we want a kind of temporary snapshot
of active drive A. We create temporary image B, with B->backing = A.
Then we start backup(sync=none) from A to B. From this point, B reads
as point-in-time snapshot of A (A continues to be active drive,
accepting guest IO).

This scheme needs some additional synchronization between reads from B
and backup COW operations, otherwise, the following situation is
theoretically possible:

(assume B is qcow2, client is NBD client, reading from B)

1. client starts reading and take qcow2 mutex in qcow2_co_preadv, and
   goes up to l2 table loading (assume cache miss)

2) guest write => backup COW => qcow2 write =>
   try to take qcow2 mutex => waiting

3. l2 table loaded, we see that cluster is UNALLOCATED, go to
   "case QCOW2_CLUSTER_UNALLOCATED" and unlock mutex before
   bdrv_co_preadv(bs->backing, ...)

4) aha, mutex unlocked, backup COW continues, and we finally finish
   guest write and change cluster in our active disk A

5. actually, do bdrv_co_preadv(bs->backing, ...) and read
   _new updated_ data.

To avoid this, let's make backup writes serializing, to not intersect
with reads from B.

Note: we expand range of handled cases from (sync=none and
B->backing = A) to just (A in backing chain of B), to finally allow
safe reading from B during backup for all cases when A in backing chain
of B, i.e. B formally looks like point-in-time snapshot of A.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 13:10:29 +02:00
Vladimir Sementsov-Ogievskiy 09d2f94846 block: add BDRV_REQ_SERIALISING flag
Serialized writes should be used in copy-on-write of backup(sync=none)
for image fleecing scheme.

We need to change an assert in bdrv_aligned_pwritev, added in
28de2dcd88. The assert may fail now, because call to
wait_serialising_requests here may become first call to it for this
request with serializing flag set. It occurs if the request is aligned
(otherwise, we should already set serializing flag before calling
bdrv_aligned_pwritev and correspondingly waited for all intersecting
requests). However, for aligned requests, we should not care about
outdating of previously read data, as there no such data. Therefore,
let's just update an assert to not care about aligned requests.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 13:10:25 +02:00
Vladimir Sementsov-Ogievskiy 67b51fb998 block: split flags in copy_range
Pass read flags and write flags separately. This is needed to handle
coming BDRV_REQ_NO_SERIALISING clearly in following patches.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 13:04:25 +02:00
Vladimir Sementsov-Ogievskiy 999658a05e block/io: fix copy_range
Here two things are fixed:

1. Architecture

On each recursion step, we go to the child of src or dst, only for one
of them. So, it's wrong to create tracked requests for both on each
step. It leads to tracked requests duplication.

2. Wait for serializing requests on write path independently of
   BDRV_REQ_NO_SERIALISING

Before commit 9ded4a0114 "backup: Use copy offloading",
BDRV_REQ_NO_SERIALISING was used for only one case: read in
copy-on-write operation during backup. Also, the flag was handled only
on read path (in bdrv_co_preadv and bdrv_aligned_preadv).

After 9ded4a0114, flag is used for not waiting serializing operations
on backup target (in same case of copy-on-write operation). This
behavior change is unsubstantiated and potentially dangerous, let's
drop it and add additional asserts and documentation.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 13:04:22 +02:00
Kevin Wolf b0ddcbbb36 block: Fix copy-on-read crash with partial final cluster
If the virtual disk size isn't aligned to full clusters,
bdrv_co_do_copy_on_readv() may get pnum == 0 before having the full
cluster completed, which will let it run into an assertion failure:

qemu-io: block/io.c:1203: bdrv_co_do_copy_on_readv: Assertion `skip_bytes < pnum' failed.

Check for EOF, assert that we read at least as much as the read request
originally wanted to have (which is true at EOF because otherwise
bdrv_check_byte_request() would already have returned an error) and
return success early even though we couldn't copy the full cluster.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 10:36:15 +02:00
Kevin Wolf 4be6a6d118 block: Poll after drain on attaching a node
Commit dcf94a23b1 ('block: Don't poll in parent drain callbacks')
removed polling in bdrv_child_cb_drained_begin() on the grounds that the
original bdrv_drain() already will poll and BdrvChildRole.drained_begin
calls must not cause graph changes (and therefore must not call
aio_poll() or the recursion through the graph will break.

This reasoning is correct for calls through bdrv_do_drained_begin().
However, BdrvChildRole.drained_begin is also called when a node that is
already in a drained section (i.e. bdrv_do_drained_begin() has already
returned and therefore can't poll any more) is attached to a new parent.
In this case, we must explicitly poll to have all requests completed
before the drained new child can be attached to the parent.

In bdrv_replace_child_noperm(), we know that we're not inside the
recursion of bdrv_do_drained_begin() because graph changes are not
allowed there, and bdrv_replace_child_noperm() is a graph change. The
call of BdrvChildRole.drained_begin() must therefore be followed by a
BDRV_POLL_WHILE() that waits for the completion of requests.

Reported-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-10 10:36:15 +02:00
Vladimir Sementsov-Ogievskiy 0e4e4318ea qcow2: add overlap check for bitmap directory
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20180705151515.779173-1-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-07-09 19:43:24 +02:00
Max Reitz 439e89fc09 vmdk: Fix possible segfault with non-VMDK backing
VMDK performs a probing check in vmdk_co_create_opts() to prevent the
user from assigning non-VMDK files as a backing file, because it only
supports VMDK backing files.  However, with the @backing runtime option,
it is possible to assign arbitrary nodes as backing nodes, regardless of
what the image header says.  Therefore, VMDK may not just access backing
nodes assuming they are VMDK nodes -- which it does, because it needs to
compare the backing file's CID with the overlay's parentCID value, and
naturally the backing file only has a CID when it's a VMDK file.
Instead, it should report the CID of non-VMDK backing files not to match
the overlay because clearly a non-present CID does not match.

Without this change, vmdk_read_cid() reads from the backing file's
bs->file, which may be NULL (in which case we get a segfault).  Also, it
interprets bs->opaque as a BDRVVmdkState and then reads from the
.desc_offset field, which usually will just return some arbitrary value
which then results in either garbage to be read, or bdrv_pread() to
return an error, both of which result in a non-matching CID to be
reported.

(In a very unlikely case, we could read something that looks like a
VMDK descriptor, and then get a CID which might actually match.  But
that is highly unlikely, and the only result would be that VMDK accepts
the backing file which is not too bad (albeit unintentional).)

((And in theory, the seek to .desc_offset might leak data from another
block driver's opaque object.  But then again, the user should realize
very quickly that a non-VMDK backing file does not work (because the
read will very likely fail, due to the reasons given above), so this
should not be exploitable.))

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180702210721.4847-2-mreitz@redhat.com
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-07-09 19:43:24 +02:00
Fam Zheng 6d6bcc46b5 raw: Drop superfluous semicolon
Reported-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180702025836.20957-5-famz@redhat.com
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-07-09 19:43:24 +02:00
Fam Zheng 65f1899e9b qcow2: Drop unreachable break
Reported-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180702025836.20957-4-famz@redhat.com
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-07-09 19:43:24 +02:00
Fam Zheng 9f850f67ad file-posix: Fix fd_open check in raw_co_copy_range_to
One of them is a typo. But update both to be more readable.

Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180702025836.20957-3-famz@redhat.com
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-07-09 19:43:24 +02:00
Fam Zheng 148546c822 qcow2: Drop unused cluster_data
Reported-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180702025836.20957-2-famz@redhat.com
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-07-09 19:43:24 +02:00
Peter Maydell 1daf14ec9e Block layer patches:
- qcow2: Use worker threads for compression to improve performance of
   'qemu-img convert -W' and compressed backup jobs
 - blklogwrites: New filter driver to log write requests to an image in
   the dm-log-writes format
 - file-posix: Fix image locking during image creation
 - crypto: Fix memory leak in error path
 - Error out instead of silently truncating node names
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbPfHhAAoJEH8JsnLIjy/Wxp8P/jeEJ/jJ0N2doix5KJSvtkdP
 dYx4cJxHg9tMSnmBfA49qc1ilgbzZqjbF/wH3Ko3zPsmFP0xsxx+9zE4c3iL7HA6
 6E4/r3szznO8DIxdgzAIeHjZc7YPpsY3alrthT55eR/FnDyc8zofi/iUMgvTKPYA
 8TnWqgi2fdaVwNy+lIRJFdhzzQNtPxOEO+0DFHaOZvQ9vlc5DGPC+Hj3Qc11GK8M
 u7orkeYqQPknxy0hJKPxtWHvzCQUJMcSSs6PbuslnzOerYiQLmx+RVIwMhXfVZjV
 lXe2SppAszBujtcIENhZlj1cECs7MrWTmFDcWvBA+Mh/JhFEWwykmlQHYrjCqdvw
 QyWhVLgP/6jQDaBls6LADSZDxX7i0sV27DzVGN4gYUX/KcyVPEDROm90MyE8nxN0
 Y5hhujTkzu94Zvd/OlKZRs4tPxmbRrd2SWy0id8Kj5/gO/+UWECOZCrPrlB5uzec
 bsxHoeXLVe2/56JkCIiOw3hLILMY6gPLJgeaQjz6hu1oZJqYuoof8grVFjUCSH0C
 BBlz8O4upHdPIKNRl4rmdgwasg5YIiJ7SFl4h7KMCD0elkKmo3SyTKosagvqrpVN
 JvW1YuosCedcfYIKNPnZdEGWDKuTaZtcnZnM9IRY6hr8ySMw1LloFsgXsJ3tqTwe
 wb1Wjm4Qlz5BaoB7o/Os
 =z5AC
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging

Block layer patches:

- qcow2: Use worker threads for compression to improve performance of
  'qemu-img convert -W' and compressed backup jobs
- blklogwrites: New filter driver to log write requests to an image in
  the dm-log-writes format
- file-posix: Fix image locking during image creation
- crypto: Fix memory leak in error path
- Error out instead of silently truncating node names

# gpg: Signature made Thu 05 Jul 2018 11:24:33 BST
# gpg:                using RSA key 7F09B272C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>"
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74  56FE 7F09 B272 C88F 2FD6

* remotes/kevin/tags/for-upstream:
  file-posix: Unlock FD after creation
  file-posix: Fix creation locking
  block/blklogwrites: Add an option for the update interval of the log superblock
  block/blklogwrites: Add an option for appending to an old log
  block/blklogwrites: Change log_sector_size from int64_t to uint64_t
  block/crypto: Fix memory leak in create error path
  block: Don't silently truncate node names
  block: Add blklogwrites
  block: Move two block permission constants to the relevant enum
  qcow2: add compress threads
  qcow2: refactor data compression
  qemu-img: allow compressed not-in-order writes

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2018-07-05 15:53:04 +01:00
Max Reitz 7c20c808a5 file-posix: Unlock FD after creation
Closing the FD does not necessarily mean that it is unlocked.  Fix this
by relinquishing all permission locks before qemu_close().

Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-05 11:07:58 +02:00
Max Reitz d815efcaf0 file-posix: Fix creation locking
raw_apply_lock_bytes() takes a bit mask of "permissions that are NOT
shared".

Also, make the "perm" and "shared" variables uint64_t, because I do not
particularly like using ~ on signed integers (and other permission masks
are usually uint64_t, too).

Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-05 11:07:58 +02:00
Ari Sundholm 1dce698ea8 block/blklogwrites: Add an option for the update interval of the log superblock
This is a way to ensure that the log superblock is periodically
updated. Before, this was only done on flush requests, which may
not be enough if the VM exits abnormally, omitting the final flush.

The default interval is 4096 write requests.

Signed-off-by: Ari Sundholm <ari@tuxera.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-05 10:50:21 +02:00
Ari Sundholm 0878b3c113 block/blklogwrites: Add an option for appending to an old log
Suggested by Kevin Wolf. May be useful when testing multiple batches
of writes or doing long-term testing involving restarts of the VM.

Signed-off-by: Ari Sundholm <ari@tuxera.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-05 10:50:20 +02:00
Ari Sundholm 2dacaf7c82 block/blklogwrites: Change log_sector_size from int64_t to uint64_t
This was a simple oversight when working on intermediate versions
of the original patch which introduced blklogwrites.

Signed-off-by: Ari Sundholm <ari@tuxera.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-05 10:50:20 +02:00
Kevin Wolf 0b68589d17 block/crypto: Fix memory leak in create error path
Fixes: Coverity CID 1393782
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
2018-07-05 10:29:19 +02:00
Aapo Vienamo bfcc224e3c block: Add blklogwrites
Implements a block device write logging system, similar to Linux kernel
device mapper dm-log-writes. The write operations that are performed
on a block device are logged to a file or another block device. The
write log format is identical to the dm-log-writes format. Currently,
log markers are not supported.

This functionality can be used for crash consistency and fs consistency
testing. By implementing it in qemu, tests utilizing write logs can be
be used to test non-Linux drivers and older kernels.

The driver accepts an optional parameter to set the sector size used
for logging. This makes the driver require all requests to be aligned
to this sector size and also makes offsets and sizes of writes in the
log metadata to be expressed in terms of this value (the log format has
a granularity of one sector for offsets and sizes). This allows
accurate logging of writes to guest block devices that have unusual
sector sizes.

The implementation is based on the blkverify and blkdebug block
drivers.

Signed-off-by: Aapo Vienamo <aapo@tuxera.com>
Signed-off-by: Ari Sundholm <ari@tuxera.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-05 10:29:19 +02:00
Vladimir Sementsov-Ogievskiy ceb029cd6f qcow2: add compress threads
Do data compression in separate threads. This significantly improve
performance for qemu-img convert with -W (allow async writes) and -c
(compressed) options.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-05 10:29:19 +02:00
Vladimir Sementsov-Ogievskiy 2714f13d69 qcow2: refactor data compression
Make a separate function for compression to be parallelized later.
 - use .avail_out field instead of .next_out to calculate size of
   compressed data. It looks more natural and it allows to keep dest to
   be void pointer
 - set avail_out to be at least one byte less than input, to be sure
   avoid inefficient compression earlier

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-05 10:29:00 +02:00
Vladimir Sementsov-Ogievskiy 58f72b965e dirty-bitmap: fix double lock on bitmap enabling
Bitmap lock/unlock were added to bdrv_enable_dirty_bitmap in
8b1402ce80, but some places were not updated correspondingly, which
leads to trying to take this lock twice, which is dead-lock. Fix this.

Actually, iotest 199 (about dirty bitmap postcopy migration) is broken
now, and this fixes it.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20180625165745.25259-3-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
2018-07-04 02:12:49 -04:00
Vladimir Sementsov-Ogievskiy 92bcea40d3 block/dirty-bitmap: add bdrv_enable_dirty_bitmap_locked
Add _locked version of bdrv_enable_dirty_bitmap, to fix dirty bitmap
migration in the following patch.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20180625165745.25259-2-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
2018-07-04 02:12:49 -04:00
Peter Maydell a395717cbd -----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbOvCTAAoJEL2+eyfA3jBX354P/RFVLsozhcb3DeFj5Ocq2kfS
 sRt/82Ke/f/w/8lNd4wNbOsdG/eg2M4RmLWF4ONWWoeO7Z0KIatUOTtw5HxjBxBX
 XfhQy7RZ65luuCHnLDU6c4IdnDvXBVG/kErydhDZjEyeY8qlxrurBB8331cTRFwu
 hisweIwogPOFDA+/Bty0W0EyVQWFAobL3ExYFlOYFuHwsqJfMPQbytw2zDzC4kjn
 8Ecppyt7rfLsEcyf/4OAoHfbbYOiQl7PkXE7/uXDFyL8zPdRpIlDFSZtmy1Zb213
 mcYhPmehUkFHV/BDF/LdnzjlraK8oMaNu0IDld5cX/1xUU4VtbW2YjAt6OdCn7Ll
 7YbNNKYU/mM1QUPshX4qJkbUaCu7JoTDKPiBbJei/MV7zMJBLpNVG/AuJE2gbweI
 2levV76QzS2+fQVKv/9LUliqOYEp5T0/aybb+35Vzhf5WNpSO7s1oaCDAvSgUhS+
 qU1MIAROQQPCmdM8PwqzG9b2TGp/tYcWOju5bqt488Twmo0BbTGjYCFl6StJHibC
 mN5fASP5nQiz1fc3FrBp0h/PCQlGtd2ZgeyeC+lkPVcFovclA2vo/ib1k2LK/0nU
 TzkKIFRJZH58yYjppuBrB6c/aFfVkutE4Hz25i+3nZ91ZEyQKbv1mDxCyRNXNWjt
 Gteul6gUo/AjzMOWFFvH
 =AKR7
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/cody/tags/block-pull-request' into staging

# gpg: Signature made Tue 03 Jul 2018 04:42:11 BST
# gpg:                using RSA key BDBE7B27C0DE3057
# gpg: Good signature from "Jeffrey Cody <jcody@redhat.com>"
# gpg:                 aka "Jeffrey Cody <jeff@codyprime.org>"
# gpg:                 aka "Jeffrey Cody <codyprime@gmail.com>"
# Primary key fingerprint: 9957 4B4D 3474 90E7 9D98  D624 BDBE 7B27 C0DE 3057

* remotes/cody/tags/block-pull-request:
  backup: Use copy offloading
  block: Honour BDRV_REQ_NO_SERIALISING in copy range
  block: Fix parameter checking in bdrv_co_copy_range_internal

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2018-07-03 11:49:51 +01:00
Peter Maydell 9b75dcb15f nbd patches for 2018-07-02
Bug fixes and iotest exposure of fleecing via NBD (serving a
 read-only point-in-time view via blockdev-backup sync:none,
 as well as serving dirty bitmaps over NBD), including a new
 x-dirty-bitmap parameter when opening NBD clients as the
 counterpart to x-nbd-server-add-bitmap. Also a random fix
 for iscsi block_status spotted by Coverity that missed other
 miscellaneous trees.
 
 - Eric Blake: nbd/server: Fix dirty bitmap logic regression
 - Eric Blake: iscsi: Avoid potential for get_status overflow
 - John Snow/Vladimir Sementsov-Ogievskiy: 0/2 block: formalize and test fleecing
 - Eric Blake: 0/2 test NBD bitmap export
 -----BEGIN PGP SIGNATURE-----
 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg
 
 iQEcBAABCAAGBQJbOtJPAAoJEKeha0olJ0NqEvwH/3FwWnlBdBvdYGgPjzGE1Atm
 ofCKcyxE/2VJtxeWlZQHzs1VqSq81s7am5SdzOrIWnQekvHFcLu6/71RABiauzMd
 neCvVOrXOVdktj1i1Z2Gg4BgjDmqbTDlo5ssVh/oXP0Zebi6OZFfQrB7y3cGBvui
 4XI7lW9qJxt6F1FlKloXnofWRDENyo5vgdz6QjQXfauthw2T5045RIPTfiz03FCp
 fbs+6K0+bKxfPdNLrqxxOZo/loYnEXbDYv6VBAIWBqztXVnMHxCqnh0YN05jwsfF
 TRW0/YT8lpWarOZ1soIC6a/OGXQZbxgRhZ+Zr+Wa2jw0YNHJanU9isxi37aUqQo=
 =Xivx
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/ericb/tags/pull-nbd-2018-07-02' into staging

nbd patches for 2018-07-02

Bug fixes and iotest exposure of fleecing via NBD (serving a
read-only point-in-time view via blockdev-backup sync:none,
as well as serving dirty bitmaps over NBD), including a new
x-dirty-bitmap parameter when opening NBD clients as the
counterpart to x-nbd-server-add-bitmap. Also a random fix
for iscsi block_status spotted by Coverity that missed other
miscellaneous trees.

- Eric Blake: nbd/server: Fix dirty bitmap logic regression
- Eric Blake: iscsi: Avoid potential for get_status overflow
- John Snow/Vladimir Sementsov-Ogievskiy: 0/2 block: formalize and test fleecing
- Eric Blake: 0/2 test NBD bitmap export

# gpg: Signature made Tue 03 Jul 2018 02:33:03 BST
# gpg:                using RSA key A7A16B4A2527436A
# gpg: Good signature from "Eric Blake <eblake@redhat.com>"
# gpg:                 aka "Eric Blake (Free Software Programmer) <ebb9@byu.net>"
# gpg:                 aka "[jpeg image of size 6874]"
# Primary key fingerprint: 71C2 CC22 B1C4 6029 27D2  F3AA A7A1 6B4A 2527 436A

* remotes/ericb/tags/pull-nbd-2018-07-02:
  iotests: New test 223 for exporting dirty bitmap over NBD
  nbd/client: Add x-dirty-bitmap to query bitmap from server
  iotests: add 222 to test basic fleecing
  blockdev: enable non-root nodes for backup source
  iscsi: Avoid potential for get_status overflow
  nbd/server: Fix dirty bitmap logic regression

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2018-07-03 10:47:02 +01:00
Fam Zheng 9ded4a0114 backup: Use copy offloading
The implementation is similar to the 'qemu-img convert'. In the
beginning of the job, offloaded copy is attempted. If it fails, further
I/O will go through the existing bounce buffer code path.

Then, as Kevin pointed out, both this and qemu-img convert can benefit
from a local check if one request fails because of, for example, the
offset is beyond EOF, but another may well be accepted by the protocol
layer. This will be implemented separately.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180703023758.14422-4-famz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
2018-07-02 23:23:45 -04:00
Fam Zheng dee12de893 block: Honour BDRV_REQ_NO_SERIALISING in copy range
This semantics is needed by drive-backup so implement it before using
this API there.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180703023758.14422-3-famz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
2018-07-02 23:23:45 -04:00
Fam Zheng d4d3e5a0d5 block: Fix parameter checking in bdrv_co_copy_range_internal
src may be NULL if BDRV_REQ_ZERO_WRITE flag is set, in this case only
check dst and dst->bs. This bug was introduced when moving in the
request tracking code from bdrv_co_copy_range, in 37aec7d75e.

This especially fixes the possible segfault when initializing src_bs
with a NULL src.

Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180703023758.14422-2-famz@redhat.com
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
2018-07-02 23:23:45 -04:00
Eric Blake 216ee3657e nbd/client: Add x-dirty-bitmap to query bitmap from server
In order to test that the NBD server is properly advertising
dirty bitmaps, we need a bare minimum client that can request
and read the context.  Since feature freeze for 3.0 is imminent,
this is the smallest workable patch, which replaces the qemu
block status report with the results of the NBD server's dirty
bitmap (making it very easy to use 'qemu-img map --output=json'
to learn where the dirty portions are).  Note that the NBD
protocol defines a dirty section with the same bit but opposite
sense that normal "base:allocation" uses to report an allocated
section; so in qemu-img map output, "data":true corresponds to
clean, "data":false corresponds to dirty.

A more complete solution that allows dirty bitmaps to be queried
at the same time as normal block status will be required before
this addition can lose the x- prefix.  Until then, the fact that
this replaces normal status with dirty status means actions
like 'qemu-img convert' will likely misbehave due to treating
dirty regions of the file as if they are unallocated.

The next patch adds an iotest to exercise this new code.

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180702191458.28741-2-eblake@redhat.com>
2018-07-02 15:27:38 -05:00
Eric Blake 8ee1cef459 iscsi: Avoid potential for get_status overflow
Detected by Coverity: Multiplying two 32-bit int and assigning
the result to a 64-bit number is a risk of overflow.  Prior to
the conversion to byte-based interfaces, the block layer took
care of ensuring that a status request never exceeded 2G in
the driver; but after that conversion, the block layer expects
drivers to deal with any size request (the driver can always
truncate the request size back down, as long as it makes
progress).  So, in the off-chance that someone makes a large
request, we are at the mercy of whether iscsi_get_lba_status_task()
will cap things to at most INT_MAX / iscsilun->block_size when
it populates lbasd->num_blocks; since I could not easily audit
that, it's better to be safe than sorry by just forcing a 64-bit
multiply.

Fixes: 92809c36
CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180508212718.1482663-1-eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
2018-07-02 14:28:26 -05:00
Philippe Mathieu-Daudé f043568f54 vdi: Use definitions from "qemu/units.h"
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Stefan Weil <sw@weilnetz.de>
Message-Id: <20180625124238.25339-3-f4bug@amsat.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-07-02 14:45:23 +02:00
Eric Blake 583c99d393 block: Remove unused sector-based vectored I/O
We are gradually moving away from sector-based interfaces, towards
byte-based.  Now that all callers of vectored I/O have been converted
to use our preferred byte-based bdrv_co_p{read,write}v(), we can
delete the unused bdrv_co_{read,write}v().

Furthermore, this gets rid of the signature difference between the
public bdrv_co_writev() and the callback .bdrv_co_writev (the
latter still exists, because some drivers still need more work
before they are fully byte-based).

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29 14:20:56 +02:00
Eric Blake 3a7404b31e vhdx: Switch to byte-based calls
We are gradually moving away from sector-based interfaces, towards
byte-based.  Make the change for the last few sector-based calls
into the block layer from the vhdx driver.

Ideally, the vhdx driver should switch to doing everything
byte-based, but that's a more invasive change that requires a
bit more auditing.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29 14:20:56 +02:00
Eric Blake 04a11d87d1 replication: Switch to byte-based calls
We are gradually moving away from sector-based interfaces, towards
byte-based.  Make the change for the last few sector-based calls
into the block layer from the replication driver.

Ideally, the replication driver should switch to doing everything
byte-based, but that's a more invasive change that requires a
bit more auditing.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29 14:20:56 +02:00
Eric Blake 609841a3f8 qcow: Switch to a byte-based driver
We are gradually moving away from sector-based interfaces, towards
byte-based.  The qcow driver is now ready to fully utilize the
byte-based callback interface, as long as we override the default
alignment to still be 512 (needed at least for asserts present
because of encryption, but easier to do everywhere than to audit
which sub-sector requests are handled correctly, especially since
we no longer recommend qcow for new disk images).

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29 14:20:56 +02:00
Eric Blake d1326a786d qcow: Switch qcow_co_writev to byte-based calls
We are gradually moving away from sector-based interfaces, towards
byte-based.  Make the change for the internals of the qcow
driver write function, by iterating over offset/bytes instead of
sector_num/nb_sectors, and with a rename of index_in_cluster and
repurposing of n to track bytes instead of sectors.

A later patch will then switch the qcow driver as a whole over
to byte-based operation.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29 14:20:56 +02:00
Eric Blake a15312b017 qcow: Switch qcow_co_readv to byte-based calls
We are gradually moving away from sector-based interfaces, towards
byte-based.  Make the change for the internals of the qcow
driver read function, by iterating over offset/bytes instead of
sector_num/nb_sectors, and with a rename of index_in_cluster and
repurposing of n to track bytes instead of sectors.

A later patch will then switch the qcow driver as a whole over
to byte-based operation.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29 14:20:56 +02:00
Eric Blake 787993a543 qcow: Switch get_cluster_offset to be byte-based
We are gradually moving away from sector-based interfaces, towards
byte-based.  Make the change for the internal helper function
get_cluster_offset(), by changing n_start and n_end to be byte
offsets rather than sector indices within the cluster being
allocated.  However, assert that these values are still
sector-aligned (at least qcrypto_block_encrypt() still wants that).
For now we get that alignment for free because we still use
sector-based driver callbacks.

A later patch will then switch the qcow driver as a whole over
to byte-based operation; but will still leave things at sector
alignments as it is not worth auditing the qcow image format
to worry about sub-sector requests.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29 14:20:56 +02:00
Eric Blake d08c2a245f parallels: Switch to byte-based calls
We are gradually moving away from sector-based interfaces, towards
byte-based.  Make the change for the last few sector-based calls
into the block layer from the parallels driver.

Ideally, the parallels driver should switch to doing everything
byte-based, but that's a more invasive change that requires a
bit more auditing.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29 14:20:56 +02:00
Fam Zheng c436e3d014 file-posix: Fix EINTR handling
EINTR should be checked against errno, not ret. While fixing the bug,
collect the branches with a switch block.

Also, change the return value from -ENOSTUP to -ENOSPC when the actual
issue is request range passes EOF, which should be distinguishable from
the case of error == ENOSYS by the caller, so that it could still retry
with other byte ranges, whereas it shouldn't retry anymore upon ENOSYS.

Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29 14:20:56 +02:00
Fam Zheng 1439b9c110 iscsi: Don't blindly use designator length in response for memcpy
Per SCSI definition the designator_length we receive from INQUIRY is 8,
12 or at most 16, but we should be careful because the remote iscsi
target may misbehave, otherwise we could have a buffer overflow.

Reported-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29 14:20:56 +02:00
Fam Zheng e06f4639d8 qcow2: Fix src_offset in copy offloading
Not updating src_offset will result in wrong data being written to dst
image.

Reported-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29 14:20:56 +02:00
Kevin Wolf 33d70fb6fa file-posix: Implement co versions of discard/flush
This simplifies file-posix by implementing the coroutine variants of
the discard and flush BlockDriver callbacks. These were the last
remaining users of paio_submit(), which can be removed now.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2018-06-29 14:20:56 +02:00
Kevin Wolf 8b24cd1415 qcow2: Free allocated clusters on write error
If we managed to allocate the clusters, but then failed to write the
data, there's a good chance that we'll still be able to free the
clusters again in order to avoid cluster leaks (the refcounts are
cached, so even if we can't write them out right now, we may be able to
do so when the VM is resumed after a werror=stop/enospc pause).

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Tested-by: Eric Blake <eblake@redhat.com>
2018-06-29 14:20:56 +02:00
Markus Armbruster 796d323945 block/crypto: Simplify block_crypto_{open,create}_opts_init()
block_crypto_open_opts_init() and block_crypto_create_opts_init()
contain a virtual visit of QCryptoBlockOptions and
QCryptoBlockCreateOptions less member "format", respectively.

Change their callers to put member "format" in the QDict, so they can
use the generated visitors for these types instead.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29 14:20:56 +02:00
Fam Zheng 37aec7d75e block: Move request tracking to children in copy offloading
in_flight and tracked requests need to be tracked in every layer during
recursion. For now the only user is qemu-img convert where overlapping
requests and IOThreads don't exist, therefore this change doesn't make
much difference form user point of view, but it is incorrect as part of
the API. Fix it.

Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29 14:20:56 +02:00
Fam Zheng 354d930dc6 qcow2: Remove dead check on !ret
In the beginning of the function, we initialize the local variable to 0,
and in the body of the function, we check the assigned values and exit
the loop immediately. So here it can never be non-zero.

Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29 14:20:56 +02:00
Kevin Wolf 93f4e2ff4b file-posix: Make .bdrv_co_truncate asynchronous
This moves the code to resize an image file to the thread pool to avoid
blocking.

Creating large images with preallocation with blockdev-create is now
actually a background job instead of blocking the monitor (and most
other things) until the preallocation has completed.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-29 14:20:56 +02:00
Kevin Wolf 1bc5f09f2e block: Use tracked request for truncate
When growing an image, block drivers (especially protocol drivers) may
initialise the newly added area. I/O requests to the same area need to
wait for this initialisation to be completed so that data writes don't
get overwritten and reads don't read uninitialised data.

To avoid overhead in the fast I/O path by adding new locking in the
protocol drivers and to restrict the impact to requests that actually
touch the new area, reuse the existing tracked request infrastructure in
block/io.c and mark all discard requests as serialising.

With this change, it is safe for protocol drivers to make
.bdrv_co_truncate actually asynchronous.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-29 14:20:56 +02:00
Kevin Wolf 3d9f2d2af6 block: Move bdrv_truncate() implementation to io.c
This moves the bdrv_truncate() implementation from block.c to block/io.c
so it can have access to the tracked requests infrastructure.

This involves making refresh_total_sectors() public (in block_int.h).

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-29 14:20:56 +02:00
Kevin Wolf 47e86b868d qcow2: Remove coroutine trampoline for preallocate_co()
All callers are coroutine_fns now, so we can just directly call
preallocate_co().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-29 14:20:56 +02:00
Kevin Wolf 061ca8a368 block: Convert .bdrv_truncate callback to coroutine_fn
bdrv_truncate() is an operation that can block (even for a quite long
time, depending on the PreallocMode) in I/O paths that shouldn't block.
Convert it to a coroutine_fn so that we have the infrastructure for
drivers to make their .bdrv_co_truncate implementation asynchronous.

This change could potentially introduce new race conditions because
bdrv_truncate() isn't necessarily executed atomically any more. Whether
this is a problem needs to be evaluated for each block driver that
supports truncate:

* file-posix/win32, gluster, iscsi, nfs, rbd, ssh, sheepdog: The
  protocol drivers are trivially safe because they don't actually yield
  yet, so there is no change in behaviour.

* copy-on-read, crypto, raw-format: Essentially just filter drivers that
  pass the request to a child node, no problem.

* qcow2: The implementation modifies metadata, so it needs to hold
  s->lock to be safe with concurrent I/O requests. In order to avoid
  double locking, this requires pulling the locking out into
  preallocate_co() and using qcow2_write_caches() instead of
  bdrv_flush().

* qed: Does a single header update, this is fine without locking.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-29 14:20:56 +02:00
Kevin Wolf ae5475e82f qcow2: Fix qcow2_truncate() error return value
If qcow2_alloc_clusters_at() returns an error, we do need to negate it
to get back the positive errno code for error_setg_errno(), but we still
need to return the negative error code.

Fixes: 772d1f973f
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-29 14:20:56 +02:00
Markus Armbruster e6af90f3fb block/crypto: Pacify Coverity after commit f853465aac
Coverity can't see that qobject_input_visitor_new_flat_confused()
returns non-null when it doesn't set @local_err.  Check the return
value instead, like all the other callers do.

Fixes: CID 1393615
Fixes: CID 1393616
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29 14:20:56 +02:00
Nishanth Aravamudan ed6e216171 linux-aio: properly bubble up errors from initialization
laio_init() can fail for a couple of reasons, which will lead to a NULL
pointer dereference in laio_attach_aio_context().

To solve this, add a aio_setup_linux_aio() function which is called
early in raw_open_common. If this fails, propagate the error up. The
signature of aio_get_linux_aio() was not modified, because it seems
preferable to return the actual errno from the possible failing
initialization calls.

Additionally, when the AioContext changes, we need to associate a
LinuxAioState with the new AioContext. Use the bdrv_attach_aio_context
callback and call the new aio_setup_linux_aio(), which will allocate a
new AioContext if needed, and return errors on failures. If it fails for
any reason, fallback to threaded AIO with an error message, as the
device is already in-use by the guest.

Add an assert that aio_get_linux_aio() cannot return NULL.

Signed-off-by: Nishanth Aravamudan <naravamudan@digitalocean.com>
Message-id: 20180622193700.6523-1-naravamudan@digitalocean.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-27 13:06:34 +01:00
Anton Nefedov 29cd0403f1 qapi: remove empty flat union branches and types
Flat unions may now have uncovered branches, so it is possible to get rid
of empty types defined for that purpose only.

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <1529311206-76847-3-git-send-email-anton.nefedov@virtuozzo.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
2018-06-22 16:33:46 +02:00
Max Reitz 481debaa32 block/mirror: Add copy mode QAPI interface
This patch allows the user to specify whether to use active or only
background mode for mirror block jobs.  Currently, this setting will
remain constant for the duration of the entire block job.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20180613181823.13618-14-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-18 17:05:16 +02:00
Max Reitz d06107ade0 block/mirror: Add active mirroring
This patch implements active synchronous mirroring.  In active mode, the
passive mechanism will still be in place and is used to copy all
initially dirty clusters off the source disk; but every write request
will write data both to the source and the target disk, so the source
cannot be dirtied faster than data is mirrored to the target.  Also,
once the block job has converged (BLOCK_JOB_READY sent), source and
target are guaranteed to stay in sync (unless an error occurs).

Active mode is completely optional and currently disabled at runtime.  A
later patch will add a way for users to enable it.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180613181823.13618-13-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-18 17:05:15 +02:00
Max Reitz 429076e88d block/mirror: Add MirrorBDSOpaque
This will allow us to access the block job data when the mirror block
driver becomes more complex.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180613181823.13618-11-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-18 17:04:59 +02:00
Max Reitz 72d10a9421 block/dirty-bitmap: Add bdrv_dirty_iter_next_area
This new function allows to look for a consecutively dirty area in a
dirty bitmap.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20180613181823.13618-10-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-18 17:04:57 +02:00
Max Reitz a33fbb4f8b hbitmap: Add @advance param to hbitmap_iter_next()
This new parameter allows the caller to just query the next dirty
position without moving the iterator.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20180613181823.13618-8-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-18 17:04:55 +02:00
Max Reitz 138f9fffb8 block/mirror: Use source as a BdrvChild
With this, the mirror_top_bs is no longer just a technically required
node in the BDS graph but actually represents the block job operation.

Also, drop MirrorBlockJob.source, as we can reach it through
mirror_top_bs->backing.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20180613181823.13618-6-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-18 17:04:54 +02:00
Max Reitz 1181e19a6d block/mirror: Wait for in-flight op conflicts
This patch makes the mirror code differentiate between simply waiting
for any operation to complete (mirror_wait_for_free_in_flight_slot())
and specifically waiting for all operations touching a certain range of
the virtual disk to complete (mirror_wait_on_conflicts()).

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180613181823.13618-5-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-18 17:04:53 +02:00
Max Reitz 12aa40822d block/mirror: Use CoQueue to wait on in-flight ops
Attach a CoQueue to each in-flight operation so if we need to wait for
any we can use it to wait instead of just blindly yielding and hoping
for some operation to wake us.

A later patch will use this infrastructure to allow requests accessing
the same area of the virtual disk to specifically wait for each other.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180613181823.13618-4-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-18 17:04:52 +02:00
Max Reitz 2e1990b26e block/mirror: Convert to coroutines
In order to talk to the source BDS (and maybe in the future to the
target BDS as well) directly, we need to convert our existing AIO
requests into coroutine I/O requests.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180613181823.13618-3-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-18 17:04:48 +02:00
Max Reitz 4295c5fc61 block/mirror: Pull out mirror_perform()
When converting mirror's I/O to coroutines, we are going to need a point
where these coroutines are created.  mirror_perform() is going to be
that point.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20180613181823.13618-2-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-18 17:04:43 +02:00
Greg Kurz f45280cbf6 block: fix QEMU crash with scsi-hd and drive_del
Removing a drive with drive_del while it is being used to run an I/O
intensive workload can cause QEMU to crash.

An AIO flush can yield at some point:

blk_aio_flush_entry()
 blk_co_flush(blk)
  bdrv_co_flush(blk->root->bs)
   ...
    qemu_coroutine_yield()

and let the HMP command to run, free blk->root and give control
back to the AIO flush:

    hmp_drive_del()
     blk_remove_bs()
      bdrv_root_unref_child(blk->root)
       child_bs = blk->root->bs
       bdrv_detach_child(blk->root)
        bdrv_replace_child(blk->root, NULL)
         blk->root->bs = NULL
        g_free(blk->root) <============== blk->root becomes stale
       bdrv_unref(child_bs)
        bdrv_delete(child_bs)
         bdrv_close()
          bdrv_drained_begin()
           bdrv_do_drained_begin()
            bdrv_drain_recurse()
             aio_poll()
              ...
              qemu_coroutine_switch()

and the AIO flush completion ends up dereferencing blk->root:

  blk_aio_complete()
   scsi_aio_complete()
    blk_get_aio_context(blk)
     bs = blk_bs(blk)
 ie, bs = blk->root ? blk->root->bs : NULL
            ^^^^^
            stale

The problem is that we should avoid making block driver graph
changes while we have in-flight requests. Let's drain all I/O
for this BB before calling bdrv_root_unref_child().

Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-18 15:03:25 +02:00
Kevin Wolf 0f12264e7a block: Allow graph changes in bdrv_drain_all_begin/end sections
bdrv_drain_all_*() used bdrv_next() to iterate over all root nodes and
did a subtree drain for each of them. This works fine as long as the
graph is static, but sadly, reality looks different.

If the graph changes so that root nodes are added or removed, we would
have to compensate for this. bdrv_next() returns each root node only
once even if it's the root node for multiple BlockBackends or for a
monitor-owned block driver tree, which would only complicate things.

The much easier and more obviously correct way is to fundamentally
change the way the functions work: Iterate over all BlockDriverStates,
no matter who owns them, and drain them individually. Compensation is
only necessary when a new BDS is created inside a drain_all section.
Removal of a BDS doesn't require any action because it's gone afterwards
anyway.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-18 15:03:25 +02:00
Kevin Wolf 6cd5c9d7b2 block: ignore_bds_parents parameter for drain functions
In the future, bdrv_drained_all_begin/end() will drain all invidiual
nodes separately rather than whole subtrees. This means that we don't
want to propagate the drain to all parents any more: If the parent is a
BDS, it will already be drained separately. Recursing to all parents is
unnecessary work and would make it an O(n²) operation.

Prepare the drain function for the changed drain_all by adding an
ignore_bds_parents parameter to the internal implementation that
prevents the propagation of the drain to BDS parents. We still (have to)
propagate it to non-BDS parents like BlockBackends or Jobs because those
are not drained separately.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-18 15:03:25 +02:00