Commit Graph

4532 Commits

Author SHA1 Message Date
Sam Eiderman 98eb9733f4 vmdk: Add read-only support for seSparse snapshots
Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in
QEMU).

This format was lacking in the following:

    * Grain directory (L1) and grain table (L2) entries were 32-bit,
      allowing access to only 2TB (slightly less) of data.
    * The grain size (default) was 512 bytes - leading to data
      fragmentation and many grain tables.
    * For space reclamation purposes, it was necessary to find all the
      grains which are not pointed to by any grain table - so a reverse
      mapping of "offset of grain in vmdk" to "grain table" must be
      constructed - which takes large amounts of CPU/RAM.

The format specification can be found in VMware's documentation:
https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf

In ESXi 6.5, to support snapshot files larger than 2TB, a new format was
introduced: SESparse (Space Efficient).

This format fixes the above issues:

    * All entries are now 64-bit.
    * The grain size (default) is 4KB.
    * Grain directory and grain tables are now located at the beginning
      of the file.
      + seSparse format reserves space for all grain tables.
      + Grain tables can be addressed using an index.
      + Grains are located in the end of the file and can also be
        addressed with an index.
      - seSparse vmdks of large disks (64TB) have huge preallocated
        headers - mainly due to L2 tables, even for empty snapshots.
    * The header contains a reverse mapping ("backmap") of "offset of
      grain in vmdk" to "grain table" and a bitmap ("free bitmap") which
      specifies for each grain - whether it is allocated or not.
      Using these data structures we can implement space reclamation
      efficiently.
    * Due to the fact that the header now maintains two mappings:
        * The regular one (grain directory & grain tables)
        * A reverse one (backmap and free bitmap)
      These data structures can lose consistency upon crash and result
      in a corrupted VMDK.
      Therefore, a journal is also added to the VMDK and is replayed
      when the VMware reopens the file after a crash.

Since ESXi 6.7 - SESparse is the only snapshot format available.

Unfortunately, VMware does not provide documentation regarding the new
seSparse format.

This commit is based on black-box research of the seSparse format.
Various in-guest block operations and their effect on the snapshot file
were tested.

The only VMware provided source of information (regarding the underlying
implementation) was a log file on the ESXi:

    /var/log/hostd.log

Whenever an seSparse snapshot is created - the log is being populated
with seSparse records.

Relevant log records are of the form:

[...] Const Header:
[...]  constMagic     = 0xcafebabe
[...]  version        = 2.1
[...]  capacity       = 204800
[...]  grainSize      = 8
[...]  grainTableSize = 64
[...]  flags          = 0
[...] Extents:
[...]  Header         : <1 : 1>
[...]  JournalHdr     : <2 : 2>
[...]  Journal        : <2048 : 2048>
[...]  GrainDirectory : <4096 : 2048>
[...]  GrainTables    : <6144 : 2048>
[...]  FreeBitmap     : <8192 : 2048>
[...]  BackMap        : <10240 : 2048>
[...]  Grain          : <12288 : 204800>
[...] Volatile Header:
[...] volatileMagic     = 0xcafecafe
[...] FreeGTNumber      = 0
[...] nextTxnSeqNumber  = 0
[...] replayJournal     = 0

The sizes that are seen in the log file are in sectors.
Extents are of the following format: <offset : size>

This commit is a strict implementation which enforces:
    * magics
    * version number 2.1
    * grain size of 8 sectors  (4KB)
    * grain table size of 64 sectors
    * zero flags
    * extent locations

Additionally, this commit proivdes only a subset of the functionality
offered by seSparse's format:
    * Read-only
    * No journal replay
    * No space reclamation
    * No unmap support

Hence, journal header, journal, free bitmap and backmap extents are
unused, only the "classic" (L1 -> L2 -> data) grain access is
implemented.

However there are several differences in the grain access itself.
Grain directory (L1):
    * Grain directory entries are indexes (not offsets) to grain
      tables.
    * Valid grain directory entries have their highest nibble set to
      0x1.
    * Since grain tables are always located in the beginning of the
      file - the index can fit into 32 bits - so we can use its low
      part if it's valid.
Grain table (L2):
    * Grain table entries are indexes (not offsets) to grains.
    * If the highest nibble of the entry is:
        0x0:
            The grain in not allocated.
            The rest of the bytes are 0.
        0x1:
            The grain is unmapped - guest sees a zero grain.
            The rest of the bits point to the previously mapped grain,
            see 0x3 case.
        0x2:
            The grain is zero.
        0x3:
            The grain is allocated - to get the index calculate:
            ((entry & 0x0fff000000000000) >> 48) |
            ((entry & 0x0000ffffffffffff) << 12)
    * The difference between 0x1 and 0x2 is that 0x1 is an unallocated
      grain which results from the guest using sg_unmap to unmap the
      grain - but the grain itself still exists in the grain extent - a
      space reclamation procedure should delete it.
      Unmapping a zero grain has no effect (0x2 will not change to 0x1)
      but unmapping an unallocated grain will (0x0 to 0x1) - naturally.

In order to implement seSparse some fields had to be changed to support
both 32-bit and 64-bit entry sizes.

Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com>
Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com>
Message-id: 20190620091057.47441-4-shmuel.eiderman@oracle.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-06-24 15:53:02 +02:00
Sam Eiderman 59d6ee4850 vmdk: Reduce the max bound for L1 table size
512M of L1 entries is a very loose bound, only 32M are required to store
the maximal supported VMDK file size of 2TB.

Fixed qemu-iotest 59# - now failure occures before on impossible L1
table size.

Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com>
Message-id: 20190620091057.47441-3-shmuel.eiderman@oracle.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-06-24 15:53:02 +02:00
Sam Eiderman 940a2cd5d2 vmdk: Fix comment regarding max l1_size coverage
Commit b0651b8c24 ("vmdk: Move l1_size check into vmdk_add_extent")
extended the l1_size check from VMDK4 to VMDK3 but did not update the
default coverage in the moved comment.

The previous vmdk4 calculation:

    (512 * 1024 * 1024) * 512(l2 entries) * 65536(grain) = 16PB

The added vmdk3 calculation:

    (512 * 1024 * 1024) * 4096(l2 entries) * 512(grain) = 1PB

Adding the calculation of vmdk3 to the comment.

In any case, VMware does not offer virtual disks more than 2TB for
vmdk4/vmdk3 or 64TB for the new undocumented seSparse format which is
not implemented yet in qemu.

Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com>
Message-id: 20190620091057.47441-2-shmuel.eiderman@oracle.com
Reviewed-by: yuchenlin <yuchenlin@synology.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-06-24 15:53:02 +02:00
Max Reitz a193ad3b3b block/commit: Drop bdrv_child_try_set_perm()
commit_top_bs never requests or unshares any permissions.  There is no
reason to make this so explicit here.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-18 16:41:10 +02:00
Max Reitz f94dc3b414 block/mirror: Fix child permissions
We cannot use bdrv_child_try_set_perm() to give up all restrictions on
the child edge, and still have bdrv_mirror_top_child_perm() request
BLK_PERM_WRITE.  Fix this by making bdrv_mirror_top_child_perm() return
0/BLK_PERM_ALL when we want to give up all permissions, and replacing
bdrv_child_try_set_perm() by bdrv_child_refresh_perms().

The bdrv_child_try_set_perm() before removing the node with
bdrv_replace_node() is then unnecessary.  No permissions have changed
since the previous invocation of bdrv_child_try_set_perm().

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-18 16:41:10 +02:00
Max Reitz 094e363944 file-posix: Update open_flags in raw_set_perm()
raw_check_perm() + raw_set_perm() can change the flags associated with
the current FD.  If so, we have to update BDRVRawState.open_flags
accordingly.  Otherwise, we may keep reopening the FD even though the
current one already has the correct flags.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-18 16:41:10 +02:00
Vladimir Sementsov-Ogievskiy b23c580c94 block: drop bs->job
Drop remaining users of bs->job:
1. assertions actually duplicated by assert(!bs->refcnt)
2. trace-point seems not enough reason to change stream_start to return
   BlockJob pointer
3. Restricting creation of two jobs based on same bs is bad idea, as
   3.1 Some jobs creates filters to be their main node, so, this check
   don't actually prevent creating second job on same real node (which
   will create another filter node) (but I hope it is restricted by
   other mechanisms)
   3.2 Even without bs->job we have two systems of permissions:
   op-blockers and BLK_PERM
   3.3 We may want to run several jobs on one node one day

And finally, drop bs->job pointer itself. Hurrah!

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-18 16:41:10 +02:00
Vladimir Sementsov-Ogievskiy 68d00e4293 block/block-backend: blk_iostatus_reset: drop usage of bs->job
We are going to remove bs->job pointer. Drop it's usage in
blk_iostatus_reset.

blk_iostatus_reset() has only two callers:

1. blk_attach_dev(). This doesn't have anything to do with jobs and
    attaching a new guest device won't solve any problem the job
    encountered, so no reason to reset the iostatus for the job.

2. qmp_cont(). This resets the iostatus for everything. We can just
    call block_job_iostatus_reset() for all block jobs instead of going
    through BlockBackend.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-18 16:41:10 +02:00
Vladimir Sementsov-Ogievskiy cc19f1773d block/replication: drop usage of bs->job
We are going to remove bs->job pointer. Drop it's usage in replication
code. Additionally we have to return job pointer from some mirror APIs.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-18 16:41:09 +02:00
Max Reitz 1adb0b5e0f blkdebug: Inject errors on .bdrv_co_block_status()
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190507203508.18026-6-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-06-14 14:16:57 +02:00
Max Reitz f8cec157cb blkdebug: Add "none" event
Together with @iotypes and @sector, this can be used to trap e.g. the
first read or write access to a certain sector without having to know
what happens internally in the block layer, i.e. which "real" events
happen right before such an access.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190507203508.18026-5-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-06-14 14:16:57 +02:00
Max Reitz 16789db3de blkdebug: Add @iotype error option
This new error option allows users of blkdebug to inject errors only on
certain kinds of I/O operations.  Users usually want to make a very
specific operation fail, not just any; but right now they simply hope
that the event that triggers the error injection is followed up with
that very operation.  That may not be true, however, because the block
layer is changing (including blkdebug, which may increase the number of
types of I/O operations on which to inject errors).

The new option's default has been chosen to keep backwards
compatibility.

Note that similar to the internal representation, we could choose to
expose this option as a list of I/O types.  But there is no practical
use for this, because as described above, users usually know exactly
which kind of operation they want to make fail, so there is no need to
specify multiple I/O types at once.  In addition, exposing this option
as a list would require non-trivial changes to qemu_opts_absorb_qdict().

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190507203508.18026-4-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-06-14 14:16:57 +02:00
Vladimir Sementsov-Ogievskiy 611ae1d716 block/nbd: merge NBDClientSession struct back to BDRVNBDState
No reason to keep it separate, it differs from others block driver
behavior and therefore confuses. Instead of generic
  'state = (State*)bs->opaque' we have to use special helper.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190611102720.86114-4-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-06-13 10:00:42 -05:00
Vladimir Sementsov-Ogievskiy 86f8cdf3db block/nbd: merge nbd-client.* to nbd.c
No reason for keeping driver handlers realization separate from driver
structure. We can get rid of extra header file.

While being here, fix comments style, restore forgotten comments for
NBD_FOREACH_REPLY_CHUNK and nbd_reply_chunk_iter_receive, remove extra
includes.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190611102720.86114-3-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-06-13 09:55:09 -05:00
Vladimir Sementsov-Ogievskiy 0a93b359db block/nbd-client: drop stale logout
Drop one on failure path (we have errp) and turn two others into trace
points.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190611102720.86114-2-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-06-13 09:35:53 -05:00
Stefano Garzarella 2ea8e96da2 block/gluster: update .help of BLOCK_OPT_PREALLOC option
Add missing 'falloc' among the allowed values of 'preallocation'
option; show it and 'full' only when they are supported.
('falloc' is supported if defined CONFIG_GLUSTERFS_FALLOCATE,
'full' is supported if defined CONFIG_GLUSTERFS_ZEROFILL)

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190524075848.23781-4-sgarzare@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
2019-06-12 18:32:32 +02:00
Stefano Garzarella abea00533f block/file-posix: update .help of BLOCK_OPT_PREALLOC option
Show 'falloc' among the allowed values of 'preallocation'
option, only when it is supported (if defined CONFIG_POSIX_FALLOCATE)

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190524075848.23781-3-sgarzare@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
2019-06-12 18:31:46 +02:00
Markus Armbruster a8d2532645 Include qemu-common.h exactly where needed
No header includes qemu-common.h after this commit, as prescribed by
qemu-common.h's file comment.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190523143508.25387-5-armbru@redhat.com>
[Rebased with conflicts resolved automatically, except for
include/hw/arm/xlnx-zynqmp.h hw/arm/nrf51_soc.c hw/arm/msf2-soc.c
block/qcow2-refcount.c block/qcow2-cluster.c block/qcow2-cache.c
target/arm/cpu.h target/lm32/cpu.h target/m68k/cpu.h target/mips/cpu.h
target/moxie/cpu.h target/nios2/cpu.h target/openrisc/cpu.h
target/riscv/cpu.h target/tilegx/cpu.h target/tricore/cpu.h
target/unicore32/cpu.h target/xtensa/cpu.h; bsd-user/main.c and
net/tap-bsd.c fixed up]
2019-06-12 13:20:20 +02:00
Markus Armbruster 0b8fa32f55 Include qemu/module.h where needed, drop it from qemu-common.h
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190523143508.25387-4-armbru@redhat.com>
[Rebased with conflicts resolved automatically, except for
hw/usb/dev-hub.c hw/misc/exynos4210_rng.c hw/misc/bcm2835_rng.c
hw/misc/aspeed_scu.c hw/display/virtio-vga.c hw/arm/stm32f205_soc.c;
ui/cocoa.m fixed up]
2019-06-12 13:18:33 +02:00
Markus Armbruster 856dfd8a03 qemu-common: Move qemu_isalnum() etc. to qemu/ctype.h
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190523143508.25387-3-armbru@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
2019-06-11 20:22:09 +02:00
Vladimir Sementsov-Ogievskiy d93e572688 block/io: bdrv_pdiscard: support int64_t bytes parameter
This fixes at least one overflow in qcow2_process_discards, which
passes 64bit region length to bdrv_pdiscard where bytes (or sectors in
the past) parameter is int since its introduction in 0b919fae.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-04 16:55:58 +02:00
Vladimir Sementsov-Ogievskiy 1477b6c803 block/qcow2-refcount: add trace-point to qcow2_process_discards
Let's at least trace ignored failure.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-04 16:55:58 +02:00
Kevin Wolf d0ee0204f4 block: Remove wrong bdrv_set_aio_context() calls
The mirror and commit block jobs use bdrv_set_aio_context() to move
their filter node into the right AioContext before hooking it up in the
graph. Similarly, bdrv_open_backing_file() explicitly moves the backing
file node into the right AioContext first.

This isn't necessary any more, they get automatically moved into the
right context now when attaching them.

However, in the case of bdrv_open_backing_file() with a node reference,
it's actually not only unnecessary, but even wrong: The unchecked
bdrv_set_aio_context() changes the AioContext of the child node even if
other parents require it to retain the old context. So this is not only
a simplification, but a bug fix, too.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1684342
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-04 15:22:22 +02:00
Kevin Wolf 132ada80c4 block: Adjust AioContexts when attaching nodes
So far, we only made sure that updating the AioContext of a node
affected the whole subtree. However, if a node is newly attached to a
new parent, we also need to make sure that both the subtree of the node
and the parent are in the same AioContext. This tries to move the new
child node to the parent AioContext and returns an error if this isn't
possible.

BlockBackends now actually apply their AioContext to their root node.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-04 15:22:22 +02:00
Kevin Wolf d861ab3acf block: Add BlockBackend.ctx
This adds a new parameter to blk_new() which requires its callers to
declare from which AioContext this BlockBackend is going to be used (or
the locks of which AioContext need to be taken anyway).

The given context is only stored and kept up to date when changing
AioContexts. Actually applying the stored AioContext to the root node
is saved for another commit.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-04 15:22:22 +02:00
Kevin Wolf 97896a4887 block: Add Error to blk_set_aio_context()
Add an Error parameter to blk_set_aio_context() and use
bdrv_child_try_set_aio_context() internally to check whether all
involved nodes can actually support the AioContext switch.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-04 15:22:22 +02:00
Julia Suvorova 2b02fd81de block/linux-aio: Drop unused BlockAIOCB submission method
Callback-based laio_submit() and laio_cancel() were left after
rewriting Linux AIO backend to coroutines in hope that they would be
used in other code that could bypass coroutines. They can be safely
removed because they have not been used since that time.

Signed-off-by: Julia Suvorova <jusual@mail.ru>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-04 15:20:41 +02:00
Max Reitz 5cb2737e92 block/io: Delay decrementing the quiesce_counter
When ending a drained section, bdrv_do_drained_end() currently first
decrements the quiesce_counter, and only then actually ends the drain.

The bdrv_drain_invoke(bs, false) call may cause graph changes.  Say the
graph change involves replacing an existing BB's ("blk") BDS
(blk_bs(blk)) by @bs.  Let us introducing the following values:
- bs_oqc = old_quiesce_counter
  (so bs->quiesce_counter == bs_oqc - 1)
- obs_qc = blk_bs(blk)->quiesce_counter (before bdrv_drain_invoke())

Let us assume there is no blk_pread_unthrottled() involved, so
blk->quiesce_counter == obs_qc (before bdrv_drain_invoke()).

Now replacing blk_bs(blk) by @bs will reduce blk->quiesce_counter by
obs_qc (making it 0) and increase it by bs_oqc-1 (making it bs_oqc-1).

bdrv_drain_invoke() returns and we invoke bdrv_parent_drained_end().
This will decrement blk->quiesce_counter by one, so it would be -1 --
were there not an assertion against that in blk_root_drained_end().

We therefore have to keep the quiesce_counter up at least until
bdrv_drain_invoke() returns, so that bdrv_parent_drained_end() does the
right thing for the parents @bs got during bdrv_drain_invoke().

But let us delay it even further, namely until bdrv_parent_drained_end()
returns, because then it mirrors bdrv_do_drained_begin(): There, we
first increment the quiesce_counter, then begin draining the parents,
and then call bdrv_drain_invoke().  It makes sense to let
bdrv_do_drained_end() unravel this exactly in reverse.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-04 15:20:41 +02:00
Vladimir Sementsov-Ogievskiy 69f47505ee block: avoid recursive block_status call if possible
drv_co_block_status digs bs->file for additional, more accurate search
for hole inside region, reported as DATA by bs since 5daa74a6eb.

This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
knows, where are holes and where is data. But every block_status
request calls lseek additionally. Assume a big disk, full of
data, in any iterative copying block job (or img convert) we'll call
lseek(HOLE) on every iteration, and each of these lseeks will have to
iterate through all metadata up to the end of file. It's obviously
ineffective behavior. And for many scenarios we don't need this lseek
at all.

However, lseek is needed when we have metadata-preallocated image.

So, let's detect metadata-preallocation case and don't dig qcow2's
protocol file in other cases.

The idea is to compare allocation size in POV of filesystem with
allocations size in POV of Qcow2 (by refcounts). If allocation in fs is
significantly lower, consider it as metadata-preallocation case.

102 iotest changed, as our detector can't detect shrinked file as
metadata-preallocation, which don't seem to be wrong, as with metadata
preallocation we always have valid file length.

Two other iotests have a slight change in their QMP output sequence:
Active 'block-commit' returns earlier because the job coroutine yields
earlier on a blocking operation. This operation is loading the refcount
blocks in qcow2_detect_metadata_preallocation().

Suggested-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-04 15:20:41 +02:00
Peter Maydell 62f6849e7a Pull request
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+ber27ys35W+dsvQfe+BBqr8OQ4FAlztyykACgkQfe+BBqr8
 OQ4EaA//QQVDpIBRcMN+LeKWeEs8VSLziPUrZuFvuhMEEnjnaU6gbKq8G8xbFQ62
 JIHg0DBGhTt8ymE9Ay6O/cooR8F0z+XyfDr7UlpI7JL/Uwl7JguGKQrWUYBRMqCv
 Q2cLaWStLkfdkuW7Y3WRc16VEnIlizDxjRzfjE2ESYpuzD2fFsBY3KZbgbJwYwZw
 SujWUQ3MdsNdw5kDmerlrDUy7r/eyl2cLXyIt6ClHNoqq392oGMoUn4XbsaLnCWE
 H5s46qm33eXtvBHqxVGoOMAli5FwCnhwF+H3xg93jIG6vC/RXQYCIhlEmEwKyrU2
 g2DWWe/8+9b0iX+zTIcAPTcn1pmjVivGRorOurP0AtMtjV/8PvV+hAQQeSg2ARB3
 rLpXaEphD4WTwu7mYlZ5kX0qvX2SftaMU08k1IgR3mfo8Z3X9znVoFIv8HLlHuy+
 OhCmwT5OWYw4mNABTXeBMH/Dcs9EcU4+T/KhAGLReHo18CSyjeT2xsT+XCsETagF
 KlAP88dP0EdJ9Oiccyb8as22u7ygKWIiDYPplBdb4SkKg/koQnYGDjeDAzB2vXS3
 cGVhGJD2DBbcePA8iaCfWzsSCDOTBFQLa45uhPD3DnkAJylhecSsiDQP+IrLslK3
 h/8v9e8MAlHMgrueSnS7foMDI9rdrTNsChuNCJWOOaUI/ZWnXFg=
 =kCrN
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/jnsnow/tags/bitmaps-pull-request' into staging

Pull request

# gpg: Signature made Wed 29 May 2019 00:58:33 BST
# gpg:                using RSA key F9B7ABDBBCACDF95BE76CBD07DEF8106AAFC390E
# gpg: Good signature from "John Snow (John Huston) <jsnow@redhat.com>" [full]
# Primary key fingerprint: FAEB 9711 A12C F475 812F  18F2 88A9 064D 1835 61EB
#      Subkey fingerprint: F9B7 ABDB BCAC DF95 BE76  CBD0 7DEF 8106 AAFC 390E

* remotes/jnsnow/tags/bitmaps-pull-request:
  iotests: test external snapshot with bitmap copying
  qapi: support external bitmaps in block-dirty-bitmap-merge
  migration/dirty-bitmaps: change bitmap enumeration method

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2019-05-30 12:10:27 +01:00
Vladimir Sementsov-Ogievskiy eff0829b07 qapi: support external bitmaps in block-dirty-bitmap-merge
Add new optional parameter making possible to merge bitmaps from
different nodes. It is needed to maintain external snapshots during
incremental backup chain history.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20190517152111.206494-2-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
2019-05-28 19:33:31 -04:00
Andrey Shinkevich 6388903e7c qcow2-bitmap: initialize bitmap directory alignment
Valgrind detects multiple issues in QEMU iotests when the memory is
used without being initialized. Valgrind may dump lots of unnecessary
reports what makes the memory issue analysis harder. Particularly,
that is true for the aligned bitmap directory and can be seen while
running the iotest #169. Padding the aligned space with zeros eases
the pain.

Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Message-id: 1558961521-131620-1-git-send-email-andrey.shinkevich@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28 20:30:55 +02:00
Anton Nefedov c8bb23cbdb qcow2: skip writing zero buffers to empty COW areas
If COW areas of the newly allocated clusters are zeroes on the backing
image, efficient bdrv_write_zeroes(flags=BDRV_REQ_NO_FALLBACK) can be
used on the whole cluster instead of writing explicit zero buffers later
in perform_cow().

iotest 060:
write to the discarded cluster does not trigger COW anymore.
Use a backing image instead.

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Message-id: 20190516142749.81019-2-anton.nefedov@virtuozzo.com
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28 20:30:55 +02:00
Alberto Garcia b441dc71c0 block: Make bdrv_root_attach_child() unref child_bs on failure
A consequence of the previous patch is that bdrv_attach_child()
transfers the reference to child_bs from the caller to parent_bs,
which will drop it on bdrv_close() or when someone calls
bdrv_unref_child().

But this only happens when bdrv_attach_child() succeeds. If it fails
then the caller is responsible for dropping the reference to child_bs.

This patch makes bdrv_attach_child() take the reference also when
there is an error, freeing the caller for having to do it.

A similar situation happens with bdrv_root_attach_child(), so the
changes on this patch affect both functions.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-id: 20dfb3d9ccec559cdd1a9690146abad5d204a186.1557754872.git.berto@igalia.com
[mreitz: Removed now superfluous BdrvChild * variable in
         bdrv_open_child()]
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy ae6b12fa4c block/backup: refactor: split out backup_calculate_cluster_size
Split out cluster_size calculation. Move copy-bitmap creation above
block-job creation, as we are going to share it with upcoming
backup-top filter, which also should be created before actual block job
creation.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190429090842.57910-6-vsementsov@virtuozzo.com
[mreitz: Dropped a paragraph from the commit message that was left over
         from a previous version]
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy c334e897d0 block/backup: unify different modes code path
Do full, top and incremental mode copying all in one place. This
unifies the code path and helps further improvements.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190429090842.57910-5-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy 9eb5a248f3 block/backup: refactor and tolerate unallocated cluster skipping
Split allocation checking to separate function and reduce nesting.
Consider bdrv_is_allocated() fail as allocated area, as copying more
than needed is not wrong (and we do it anyway) and seems better than
fail the whole job. And, most probably we will fail on the next read,
if there are real problem with source.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190429090842.57910-4-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy a8389e315e block/backup: move to copy_bitmap with granularity
We are going to share this bitmap between backup and backup-top filter
driver, so let's share something more meaningful. It also simplifies
some calculations.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190429090842.57910-3-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy c2da3413c0 block/backup: simplify backup_incremental_init_copy_bitmap
Simplify backup_incremental_init_copy_bitmap using the function
bdrv_dirty_bitmap_next_dirty_area.

Note: move to job->len instead of bitmap size: it should not matter but
less code.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190429090842.57910-2-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy 8ac0f15f33 qcow2: do encryption in threads
Do encryption/decryption in threads, like it is already done for
compression. This improves asynchronous encrypted io.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190506142741.41731-9-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy 5447c3a03f qcow2: bdrv_co_pwritev: move encryption code out of the lock
Encryption will be done in threads, to take benefit of it, we should
move it out of the lock first.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190506142741.41731-8-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy f24196d388 qcow2: qcow2_co_preadv: improve locking
Background: decryption will be done in threads, to take benefit of it,
we should move it out of the lock first.

But let's go further: it turns out, that only
qcow2_get_cluster_offset() needs locking, so reduce locking to it.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190506142741.41731-7-vsementsov@virtuozzo.com
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy 6f13a316dd qcow2-threads: split out generic path
Move generic part out of qcow2_co_do_compress, to reuse it for
encryption and rename things that would be shared with encryption path.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190506142741.41731-6-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy 0f5636c51c qcow2-threads: qcow2_co_do_compress: protect queuing by mutex
Drop dependence on AioContext lock.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190506142741.41731-5-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy 269062efc8 qcow2-threads: use thread_pool_submit_co
Use thread_pool_submit_co, instead of reinventing it here. Note, that
thread_pool_submit_aio() never returns NULL, so checking it was an
extra thing.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190506142741.41731-4-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy 56e2f1d898 qcow2: add separate file for threaded data processing functions
Move compression-on-threads to separate file. Encryption will be in it
too.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190506142741.41731-3-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy 9353db47c5 qcow2.h: add missing include
qcow2.h depends on block_int.h. Compilation isn't broken currently only
due to block_int.h always included before qcow2.h. Though, it seems
better to directly include block_int.h in qcow2.h.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190506142741.41731-2-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28 20:30:55 +02:00
Max Reitz 9c3db310ff block/file-posix: Unaligned O_DIRECT block-status
Currently, qemu crashes whenever someone queries the block status of an
unaligned image tail of an O_DIRECT image:
$ echo > foo
$ qemu-img map --image-opts driver=file,filename=foo,cache.direct=on
Offset          Length          Mapped to       File
qemu-img: block/io.c:2093: bdrv_co_block_status: Assertion `*pnum &&
QEMU_IS_ALIGNED(*pnum, align) && align > offset - aligned_offset'
failed.

This is because bdrv_co_block_status() checks that the result returned
by the driver's implementation is aligned to the request_alignment, but
file-posix can fail to do so, which is actually mentioned in a comment
there: "[...] possibly including a partial sector at EOF".

Fix this by rounding up those partial sectors.

There are two possible alternative fixes:
(1) We could refuse to open unaligned image files with O_DIRECT
    altogether.  That sounds reasonable until you realize that qcow2
    does necessarily not fill up its metadata clusters, and that nobody
    runs qemu-img create with O_DIRECT.  Therefore, unpreallocated qcow2
    files usually have an unaligned image tail.

(2) bdrv_co_block_status() could ignore unaligned tails.  It actually
    throws away everything past the EOF already, so that sounds
    reasonable.
    Unfortunately, the block layer knows file lengths only with a
    granularity of BDRV_SECTOR_SIZE, so bdrv_co_block_status() usually
    would have to guess whether its file length information is inexact
    or whether the driver is broken.

Fixing what raw_co_block_status() returns is the safest thing to do.

There seems to be no other block driver that sets request_alignment and
does not make sure that it always returns aligned values.

Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-05-20 17:08:57 +02:00
Kevin Wolf 9ff7f0df87 blockjob: Propagate AioContext change to all job nodes
Block jobs require that all of the nodes the job is using are in the
same AioContext. Therefore all BdrvChild objects of the job propagate
.(can_)set_aio_context to all other job nodes, so that the switch is
checked and performed consistently even if both nodes are in different
subtrees.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-05-20 17:08:56 +02:00
Kevin Wolf 980b0f943a block: Add blk_set_allow_aio_context_change()
Some users (like block jobs) can tolerate an AioContext change for their
BlockBackend. Add a function that tells the BlockBackend that it can
allow changes.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-05-20 17:08:56 +02:00
Kevin Wolf 38475269d4 block: Implement .(can_)set_aio_ctx for BlockBackend
bdrv_try_set_aio_context() currently fails if a BlockBackend is attached
to a node because it doesn't implement the BdrvChildRole callbacks for
AioContext management.

We can allow changing the AioContext of monitor-owned BlockBackends as
long as no device is attached to them.

When setting the AioContext of the root node of a BlockBackend, we now
need to pass blk->root as an ignored child because we don't want the
root node to recursively call back into BlockBackend and execute
blk_do_set_aio_context() a second time.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-05-20 17:08:56 +02:00
Alberto Garcia 41ae31e3d7 block: Use BDRV_REQUEST_MAX_BYTES instead of BDRV_REQUEST_MAX_SECTORS
There are a few places in which we turn a number of bytes into sectors
in order to compare the result against BDRV_REQUEST_MAX_SECTORS
instead of using BDRV_REQUEST_MAX_BYTES directly.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-05-20 17:08:56 +02:00
Alberto Garcia b6c246942b qcow2: Define and use QCOW2_COMPRESSED_SECTOR_SIZE
When an L2 table entry points to a compressed cluster the space used
by the data is specified in 512-byte sectors. This size is independent
from BDRV_SECTOR_SIZE and is specific to the qcow2 file format.

The QCOW2_COMPRESSED_SECTOR_SIZE constant defined in this patch makes
this explicit.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-05-20 17:08:56 +02:00
Max Reitz 50ba5b2d99 block/file-posix: Truncate in xfs_write_zeroes()
XFS_IOC_ZERO_RANGE does not increase the file length:
$ touch foo
$ xfs_io -c 'zero 0 65536' foo
$ stat -c "size=%s, blocks=%b" foo
size=0, blocks=128

We do want writes beyond the EOF to automatically increase the file
length, however.  This is evidenced by the fact that iotest 061 is
broken on XFS since qcow2's check implementation checks for blocks
beyond the EOF.

Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-05-20 17:08:56 +02:00
Peter Maydell 01807c8b0e Miscellaneous patches for 2019-05-13
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJc2RbiAAoJEDhwtADrkYZTfpAP/itXg+X9wRfMeerni6SAkgtz
 knaLtJNC7YvwH7W6TIkSf2QgMrg/nYzIXxxj6V+Y3Vdn4CD93C7ldItWkm33amlA
 SEhREHpAn4F3wg/LsOGYYgpuqF/wrUcZsmzudnslfMd3mM6Q9Q6J3q6mu8n1oRcl
 RRKlk++ElqyRdvmxFhddhPxk797Vuunh76vd3ARUFmzKs2n7CGkeBu+qbk41VqI9
 YtjmWHO6BDY5b01PvjuWPir6n1yJsYcpfo3ZElZvPf5jQHMmO6fGB3SZc/PIWegq
 gAVeoXtwhNm+nywMpIv1wHQMkvRDZW0wrurIQBc4VGpH1Pa90dR9FNVZ8r0OZqPB
 aErPCdC7ED73uzJwzXKTnLxY0XDgdhsAsW7lFggANs6YyewZNcbDaVhZWsopTTK/
 3jBbddIw2RsfHNQgXlFVVzjZJGHBNHxFjAFASCKcapUWQwDKU42kQrS1GqxG56NI
 Lgi8Ce+Q0GsVF4wme96Oa/8EMRfmNvsHMfWQvmqGqA1OACSOf2PSGCeD618A5gq6
 kV6wF4v5HdGFkc0x9Vr5ur7kv3eOhpzFzBM6XJXe3CyqnYrkNuBldkyGZBbrNY7G
 aW5sR26Is4m9i+7159cNB5LmnfQqtsscibkSC0UQiXcuWgevd6cdiF+0r1YuNp7C
 Faa2yPOHs4mHCjUwade9
 =yKfZ
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/armbru/tags/pull-misc-2019-05-13' into staging

Miscellaneous patches for 2019-05-13

# gpg: Signature made Mon 13 May 2019 08:04:02 BST
# gpg:                using RSA key 3870B400EB918653
# gpg: Good signature from "Markus Armbruster <armbru@redhat.com>" [full]
# gpg:                 aka "Markus Armbruster <armbru@pond.sub.org>" [full]
# Primary key fingerprint: 354B C8B3 D7EB 2A6B 6867  4E5F 3870 B400 EB91 8653

* remotes/armbru/tags/pull-misc-2019-05-13:
  Clean up decorations and whitespace around header guards
  Normalize header guard symbol definition.
  Clean up ill-advised or unusual header guards
  Clean up header guards that don't match their file name
  target/xtensa: Clean up core-isa.h header guards
  linux-user/nios2 linux-user/riscv: Clean up header guards
  authz: Normalize #include "authz/trace.h" to "trace.h"
  Use #include "..." for our own headers, <...> for others
  Clean up includes

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2019-05-13 13:55:13 +01:00
Markus Armbruster a8b991b52d Clean up ill-advised or unusual header guards
Leading underscores are ill-advised because such identifiers are
reserved.  Trailing underscores are merely ugly.  Strip both.

Our header guards commonly end in _H.  Normalize the exceptions.

Done with scripts/clean-header-guards.pl.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190315145123.28030-7-armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
[Changes to slirp/ dropped, as we're about to spin it off]
2019-05-13 08:58:55 +02:00
Alberto Garcia 433e8e3b22 qcow2: Remove BDRVQcow2State.cluster_sectors
The last user of this field disappeared when we replace the
sector-based bdrv_write() with the byte-based bdrv_pwrite().

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-05-10 16:45:40 +02:00
Alberto Garcia 2e11d7562a block: Remove bdrv_read() and bdrv_write()
No one is using these functions anymore, all callers have switched to
the byte-based bdrv_pread() and bdrv_pwrite()

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-05-10 16:45:40 +02:00
Alberto Garcia e5a0a6784a vvfat: Replace bdrv_{read,write}() with bdrv_{pread,pwrite}()
There's only a couple of bdrv_read() and bdrv_write() calls left in
the vvfat code, and they can be trivially replaced with the byte-based
bdrv_pread() and bdrv_pwrite().

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-05-10 16:45:40 +02:00
Alberto Garcia d4f189713f vdi: Replace bdrv_{read,write}() with bdrv_{pread,pwrite}()
There's only a couple of bdrv_read() and bdrv_write() calls left in
the vdi code, and they can be trivially replaced with the byte-based
bdrv_pread() and bdrv_pwrite().

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-05-10 16:45:40 +02:00
Alberto Garcia e3b4257d03 qcow2: Replace bdrv_write() with bdrv_pwrite()
There's only one bdrv_write() call left in the qcow2 code, and it can
be trivially replaced with the byte-based bdrv_pwrite().

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-05-10 16:45:40 +02:00
Andrey Shinkevich 118f99442d block/io.c: fix for the allocation failure
On a file system used by the customer, fallocate() returns an error
if the block is not properly aligned. So, bdrv_co_pwrite_zeroes()
fails. We can handle that case the same way as it is done for the
unsupported cases, namely, call to bdrv_driver_pwritev() that writes
zeroes to an image for the unaligned chunk of the block.

Suggested-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 1554474244-553661-1-git-send-email-andrey.shinkevich@virtuozzo.com
Message-Id: <1554474244-553661-1-git-send-email-andrey.shinkevich@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-05-10 10:53:21 +01:00
Alberto Garcia 6a63419980 commit: Use bdrv_append() in commit_start()
This function combines bdrv_set_backing_hd() and bdrv_replace_node()
so we can use it to simplify the code a bit in commit_start().

Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190403143748.9790-1-berto@igalia.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-07 17:14:21 +02:00
Max Reitz 21205c7c3b block/ssh: Implement .bdrv_dirname()
ssh_bdrv_dirname() is basically the generic bdrv_dirname(), except it
takes care not to silently chop off any query string (i.e.,
host_key_check).

Signed-off-by: Max Reitz <mreitz@redhat.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Message-id: 20190225190828.17726-3-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-07 17:14:21 +02:00
Max Reitz b8c1f90118 block/ssh: Implement .bdrv_refresh_filename()
This requires some changes to keep iotests 104 and 207 working.

qemu-img info in 104 will now return a filename including the user name
and the port, which need to be filtered by adjusting REMOTE_TEST_DIR in
common.rc.  This additional information has to be marked optional,
however (which is simple as REMOTE_TEST_DIR is a regex), because
otherwise 197 and 215 would fail: They use it (indirectly) to filter
qemu-img create output which contains a backing filename they have
passed to it -- which probably does not contain a user name or port
number.

The problem in 207 is a nice one to have: qemu-img info used to return
json:{} filenames, but with this patch it returns nice plain ones.  We
now need to adjust the filtering to hide the user name (and port number
while we are at it).  The simplest way to do this is to include both in
iotests.remote_filename() so that bdrv_refresh_filename() will not
change it, and then iotests.img_info_log() will filter it correctly
automatically.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Message-id: 20190225190828.17726-2-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-07 17:14:21 +02:00
Andrey Shinkevich 444b82369b qcow2: discard bitmap when removed
Bitmap data may take a lot of disk space, so it's better to discard it
always.

Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Message-id: 1551346019-293202-1-git-send-email-andrey.shinkevich@virtuozzo.com
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
[mreitz: Use the commit message proposed by Vladimir]
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-07 17:14:21 +02:00
Vladimir Sementsov-Ogievskiy 54b10010eb qcow2-refcount: don't mask corruptions under internal errors
No reasons for not reporting found corruptions as corruptions in case
of some internal errors, especially in case of just failed to fix l2
entry (and in this case, missed corruptions may influence comparing
logic, when we calculate difference between corruptions fields of two
results)

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190227131433.197063-6-vsementsov@virtuozzo.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-07 17:14:21 +02:00
Vladimir Sementsov-Ogievskiy cbb51e9f93 qcow2-refcount: check_refcounts_l2: don't count fixed cluster as allocated
Do not count a cluster which is fixed to be ZERO as allocated.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190227131433.197063-5-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-07 17:14:21 +02:00
Vladimir Sementsov-Ogievskiy 1ef337b7a0 qcow2-refcount: check_refcounts_l2: reduce ignored overlaps
Reduce number of structures ignored in overlap check: when checking
active table ignore active tables, when checking inactive table ignore
inactive ones.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190227131433.197063-4-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-07 17:14:21 +02:00
Vladimir Sementsov-Ogievskiy a5fff8d4b4 qcow2-refcount: avoid eating RAM
qcow2_inc_refcounts_imrt() (through realloc_refcount_array()) can eat
an unpredictable amount of memory on corrupted table entries, which are
referencing regions far beyond the end of file.

Prevent this, by skipping such regions from further processing.

Interesting that iotest 138 checks exactly the behavior which we fix
here. So, change the test appropriately.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190227131433.197063-3-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-07 17:14:21 +02:00
Vladimir Sementsov-Ogievskiy 7e3e736cbd qcow2-refcount: fix check_oflag_copied
Increase corruptions_fixed only after successful fix.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190227131433.197063-2-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-07 17:14:21 +02:00
Vladimir Sementsov-Ogievskiy 54277a2aab block/qed: add missed coroutine_fn markers
qed_read_table and qed_write_table use coroutine-only interfaces but
are not marked coroutine_fn. Happily, they are called only from
coroutine context, so we only need to add missed markers.

Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Alberto Garcia 065abf9f2b commit: Make base read-only if there is an early failure
You can reproduce this by passing an invalid filter-node-name (like
"1234") to block-commit. In this case the base image is put in
read-write mode but is never reset back to read-only.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy f4326aefcf block/stream: use buffer-based io
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy 08b6261f34 block/commit: use buffer-based io
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy 607dbdc4e0 block/backup: use buffer-based io
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy a4072543cc block/parallels: use buffer-based io
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy 696e8cb292 block/qed: use buffer-based io
Move to _co_ versions of io functions qed_read_table() and
qed_write_table(), as we use qemu_co_mutex_unlock()
anyway.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy 4ed3e0c486 block/qcow: use buffer-based io
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy b00cb15bda block/qcow2: use buffer-based io
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Alberto Garcia e1f4a37a49 qcow2: Fix error handling in the compression code
This patch fixes a few things in the way error codes are handled in
the qcow2 compression code:

a) qcow2_co_pwritev_compressed() expects qcow2_co_compress() to only
   return -1 or -2 on failure, but this is not correct. Since the
   change from qcow2_compress() to qcow2_co_compress() in commit
   ceb029cd6f the new code can also return -EINVAL (although
   there does not seem to exist any code path that would cause that
   error in the current implementation).

b) -1 and -2 are ad-hoc error codes defined in qcow2_compress().
   This patch replaces them with standard constants from errno.h.

c) Both qcow2_compress() and qcow2_co_do_compress() return a negative
   value on failure, but qcow2_co_pwritev_compressed() stores the
   value in an unsigned data type.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Kevin Wolf db04524f82 qcow2: Fix qcow2_make_empty() with external data file
make_completely_empty() is an optimisated path for bdrv_make_empty()
where completely new metadata is created inside the image file instead
of going through all clusters and discarding them. For an external data
file, however, we actually need to do discard operations on the data
file; just overwriting the qcow2 file doesn't get rid of the data.

The necessary slow path with an explicit discard operation already
exists for other cases. Use it for external data files, too.

Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2019-04-30 15:29:00 +02:00
Kevin Wolf 718c0fce2f qcow2: Fix full preallocation with external data file
preallocate_co() already gave the data file the full size without
forwarding the requested preallocation mode to the protocol. When
bdrv_co_truncate() was called later with the preallocation mode, the
file didn't actually grow any more, so the data file stayed unallocated
even if full preallocation was requested.

Pass the right preallocation mode to preallocate_co() and remove the
second bdrv_co_truncate() to fix this. As a side effect, the ugly
one-byte write in preallocate_co() is replaced with a truncate call,
now leaving the last block unallocated on the protocol level as it
should be.

Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2019-04-30 15:29:00 +02:00
Kevin Wolf 360bd07471 qcow2: Add errp to preallocate_co()
We'll add a bdrv_co_truncate() call in the next patch which can return
an Error that we don't want to discard. So add an errp parameter to
preallocate_co().

Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2019-04-30 15:29:00 +02:00
Kevin Wolf f29fbf7c6b qcow2: Avoid COW during metadata preallocation
Limiting the allocation to INT_MAX bytes isn't particularly clever
because it means that the final cluster will be a partial cluster which
will be completed through a COW operation. This results in unnecessary
data read and write requests which lead to an unwanted non-sparse
filesystem block for metadata preallocation.

Align the maximum allocation size down to the cluster size to avoid this
situation.

Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2019-04-30 15:29:00 +02:00
Eric Blake de38b5005e qemu-img: Saner printing of large file sizes
Disk sizes close to INT64_MAX cause overflow, for some pretty
ridiculous output:

  $ ./nbdkit -U - memory size=$((2**63 - 512)) --run 'qemu-img info $nbd'
  image: nbd+unix://?socket=/tmp/nbdkitHSAzNz/socket
  file format: raw
  virtual size: -8388607T (9223372036854775296 bytes)
  disk size: unavailable

But there's no reason to have two separate implementations of integer
to human-readable abbreviation, where one has overflow and stops at
'T', while the other avoids overflow and goes all the way to 'E'. With
this patch, the output now claims 8EiB instead of -8388607T, which
really is the correct rounding of largest file size supported by qemu
(we could go 511 bytes larger if we used byte-accurate sizing instead
of rounding up to the next sector boundary, but that wouldn't change
the human-readable result).

Quite a few iotests need updates to expected output to match.

Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Tested-by: Max Reitz <mreitz@redhat.com>
2019-04-30 15:29:00 +02:00
Stefano Garzarella 0cb98af218 block/vhdx: Use IEC binary prefixes for size constants
Using IEC binary prefixes in order to make the code more readable,
with the exception of DEFAULT_LOG_SIZE because it's passed to
stringify().

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Stefano Garzarella e9991e29ea block/vhdx: Remove redundant IEC binary prefixes definition
IEC binary prefixes are already defined in "qemu/units.h",
so we can remove redundant definitions in "block/vhdx.h".

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Sam Eiderman 7502be838e vmdk: Set vmdk parent backing_format to vmdk
Commit b69864e5a ("vmdk: Support version=3 in VMDK descriptor files")
fixed the probe function to correctly guess vmdk descriptors with
version=3.

This solves the issue where vmdk snapshot with parent vmdk descriptor
containing "version=3" would be treated as raw instead vmdk.

In the future case where a new vmdk version is introduced, we will again
experience this issue, even if the user will provide "-f vmdk" it will
only apply to the tip image and not to the underlying "misprobed" parent
image.

The code in vmdk.c already assumes that the backing file of vmdk must be
vmdk (see vmdk_is_cid_valid which returns 0 if backing file is not
vmdk).

So let's make it official by supplying the backing_format as vmdk.

Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Reviewed-By: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
Signed-off-by: Shmuel Eiderman <shmuel.eiderman@oracle.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <fam@euphon.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Zhengui li 126734c4f7 vpc: unlock Coroutine lock to make IO submit Concurrently
Concurrent IO becomes serial IO because of the qemu Coroutine lock,
which reduce IO performance severely.

So unlock Coroutine lock before bdrv_co_pwritev and
bdrv_co_preadv to fix it.

Signed-off-by: Zhengui li <lizhengui@huawei.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-30 15:29:00 +02:00
Markus Armbruster e1ce7d747b block/qapi: Clean up how we print to monitor or stdout
bdrv_snapshot_dump(), bdrv_image_info_specific_dump(),
bdrv_image_info_dump() and their helpers take an fprintf()-like
callback and a FILE * to pass to it.

hmp.c passes monitor_printf() cast to fprintf_function and the current
monitor cast to FILE *.

qemu-img.c and qemu-io-cmds.c pass fprintf and stdout.

The type-punning is technically undefined behaviour, but works in
practice.  Clean up: drop the callback, and call qemu_printf()
instead.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20190417191805.28198-8-armbru@redhat.com>
2019-04-18 22:18:59 +02:00
Markus Armbruster 6b3048cee0 block/ssh: Do not report read/write/flush errors to the user
Callbacks ssh_co_readv(), ssh_co_writev(), ssh_co_flush() report
errors to the user with error_printf().  They shouldn't, it's their
caller's job.  Replace by a suitable trace point.  While there, drop
the unreachable !s->sftp case.

Perhaps we should convert this part of the block driver interface to
Error, so block drivers can pass more detail to their callers.  Not
today.

Cc: "Richard W.M. Jones" <rjones@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Max Reitz <mreitz@redhat.com>
Cc: qemu-block@nongnu.org
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190417190641.26814-3-armbru@redhat.com>
2019-04-17 21:21:49 +02:00
Kevin Wolf 93e32b3e20 qcow2: Fix preallocation bdrv_pwrite to wrong file
With an external data file, preallocate_co() must write the final byte
to the external data file, not to the qcow2 image file.

This is harmless for preallocation of newly created images (only the
qcow2 file size is increased to the virtual disk size while it should be
much smaller), but with preallocated resize, it could in theory cause
visible corruption if the metadata of the image is larger than the data
(e.g. lots of bitmaps).

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2019-04-16 16:23:24 +02:00
Alberto Garcia 20509c4b8b block: freeze the backing chain earlier in stream_start()
Commit 6585493369 added code to freeze
the backing chain from 'top' to 'base' for the duration of the
block-stream job.

The problem is that the freezing happens too late in stream_start():
during the bdrv_reopen_set_read_only() call earlier in that function
another job can jump in and remove the base image. If that happens we
have an invalid chain and QEMU crashes.

This patch puts the bdrv_freeze_backing_chain() call at the beginning
of the function.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-02 12:04:44 +02:00
Vladimir Sementsov-Ogievskiy 696aaaed57 block/file-posix: do not fail on unlock bytes
bdrv_replace_child() calls bdrv_check_perm() with error_abort on
loosening permissions. However file-locking operations may fail even
in this case, for example on NFS. And this leads to Qemu crash.

Let's avoid such errors. Note, that we ignore such things anyway on
permission update commit and abort.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-02 12:04:44 +02:00
Stefano Garzarella de23e72bb7 block/gluster: limit the transfer size to 512 MiB
Several versions of GlusterFS (3.12? -> 6.0.1) fail when the
transfer size is greater or equal to 1024 MiB, so we are
limiting the transfer size to 512 MiB to avoid this rare issue.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1691320
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Niels de Vos <ndevos@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-02 12:04:44 +02:00
Eric Blake 75d34eb98c nbd/client: Trace server noncompliance on structured reads
Just as we recently added a trace for a server sending block status
that doesn't match the server's advertised minimum block alignment,
let's do the same for read chunks.  But since qemu 3.1 is such a
server (because it advertised 512-byte alignment, but when serving a
file that ends in data but is not sector-aligned, NBD_CMD_READ would
detect a mid-sector change between data and hole at EOF and the
resulting read chunks are unaligned), we don't want to change our
behavior of otherwise tolerating unaligned reads.

Note that even though we fixed the server for 4.0 to advertise an
actual block alignment (which gets rid of the unaligned reads at EOF
for posix files), we can still trigger it via other means:

$ qemu-nbd --image-opts driver=blkdebug,align=512,image.driver=file,image.filename=/path/to/non-aligned-file

Arguably, that is a bug in the blkdebug block status function, for
leaking a block status that is not aligned. It may also be possible to
observe issues with a backing layer with smaller alignment than the
active layer, although so far I have been unable to write a reliable
iotest for that scenario.

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190330165349.32256-1-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2019-04-01 08:58:04 -05:00
Eric Blake 4841211e0d block: Add bdrv_get_request_alignment()
The next patch needs access to a device's minimum permitted
alignment, since NBD wants to advertise this to clients. Add
an accessor function, borrowing from blk_get_max_transfer()
for accessing a backend's block limits.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190329042750.14704-6-eblake@redhat.com>
2019-04-01 08:46:52 -05:00
Eric Blake 9cf638508c nbd/client: Support qemu-img convert from unaligned size
If an NBD server advertises a size that is not a multiple of a sector,
the block layer rounds up that size, even though we set info.size to
the exact byte value sent by the server. The block layer then proceeds
to let us read or query block status on the hole that it added past
EOF, which the NBD server is unlikely to be happy with. Fortunately,
qemu as a server never advertizes an unaligned size, so we generally
don't run into this problem; but the nbdkit server makes it easy to
test:

$ printf %1000d 1 > f1
$ ~/nbdkit/nbdkit -fv file f1 & pid=$!
$ qemu-img convert -f raw nbd://localhost:10809 f2
$ kill $pid
$ qemu-img compare f1 f2

Pre-patch, the server attempts a 1024-byte read, which nbdkit
rightfully rejects as going beyond its advertised 1000 byte size; the
conversion fails and the output files differ (not even the first
sector is copied, because qemu-img does not follow ddrescue's habit of
trying smaller reads to get as much information as possible in spite
of errors). Post-patch, the client's attempts to read (and query block
status, for new enough nbdkit) are properly truncated to the server's
length, with sane handling of the hole the block layer forced on
us. Although f2 ends up as a larger file (1024 bytes instead of 1000),
qemu-img compare shows the two images to have identical contents for
display to the guest.

I didn't add iotests coverage since I didn't want to add a dependency
on nbdkit in iotests. I also did NOT patch write, trim, or write
zeroes - these commands continue to fail (usually with ENOSPC, but
whatever the server chose), because we really can't write to the end
of the file, and because 'qemu-img convert' is the most common case
where we care about being tolerant (which is read-only). Perhaps we
could truncate the request if the client is writing zeros to the tail,
but that seems like more work, especially if the block layer is fixed
in 4.1 to track byte-accurate sizing (in which case this patch would
be reverted as unnecessary).

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190329042750.14704-5-eblake@redhat.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
2019-04-01 08:32:44 -05:00
Eric Blake a62a85ef5c nbd/client: Report offsets in bdrv_block_status
It is desirable for 'qemu-img map' to have the same output for a file
whether it is served over file or nbd protocols. However, ever since
we implemented block status for NBD (2.12), the NBD protocol forgot to
inform the block layer that as the final layer in the chain, the
offset is valid; without an offset, the human-readable form of
qemu-img map gives up with the unhelpful:

$ nbdkit -U - data data="1" size=512 --run 'qemu-img map $nbd'
Offset          Length          Mapped to       File
qemu-img: File contains external, encrypted or compressed clusters.

The --output=json form always works, because it is reporting the
lower-level bdrv_block_status results directly rather than trying to
filter out sparse ranges for human consumption - but now it also
shows the offset member.

With this patch, the human output changes to:

Offset          Length          Mapped to       File
0               0x200           0               nbd+unix://?socket=/tmp/nbdkitOxeoLa/socket

This change is observable to several iotests.

Fixes: 78a33ab5
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190329042750.14704-4-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2019-03-30 20:52:29 -05:00
Eric Blake 7da537f70d nbd/client: Lower min_block for block-status, unaligned size
We have a latent bug in our NBD client code, tickled by the brand new
nbdkit 1.11.10 block status support:

$ nbdkit --filter=log --filter=truncate -U - \
           data data="1" size=511 truncate=64K logfile=/dev/stdout \
           --run 'qemu-img convert $nbd /var/tmp/out'
...
qemu-img: block/io.c:2122: bdrv_co_block_status: Assertion `*pnum && QEMU_IS_ALIGNED(*pnum, align) && align > offset - aligned_offset' failed.

The culprit? Our implementation of .bdrv_co_block_status can return
unaligned block status for any server that operates with a lower
actual alignment than what we tell the block layer in
request_alignment, in violation of the block layer's constraints. To
date, we've been unable to trip the bug, because qemu as NBD server
always advertises block sizing (at which point it is a server bug if
the server sends unaligned status - although qemu 3.1 is such a server
and I've sent separate patches for 4.0 both to get the server to obey
the spec, and to let the client to tolerate server oddities at EOF).

But nbdkit does not (yet) advertise block sizing, and therefore is not
in violation of the spec for returning block status at whatever
boundaries it wants, and those unaligned results can occur anywhere
rather than just at EOF. While we are still wise to avoid sending
sub-sector read/write requests to a server of unknown origin, we MUST
consider that a server telling us block status without an advertised
block size is correct.  So, we either have to munge unaligned answers
from the server into aligned ones that we hand back to the block
layer, or we have to tell the block layer about a smaller alignment.

Similarly, if the server advertises an image size that is not
sector-aligned, we might as well assume that the server intends to let
us access those tail bytes, and therefore supports a minimum block
size of 1, regardless of whether the server supports block status
(although we still need more patches to fix the problem that with an
unaligned image, we can send read or block status requests that exceed
EOF to the server). Again, qemu as server cannot trip this problem
(because it rounds images to sector alignment), but nbdkit advertised
unaligned size even before it gained block status support.

Solve both alignment problems at once by using better heuristics on
what alignment to report to the block layer when the server did not
give us something to work with. Note that very few NBD servers
implement block status (to date, only qemu and nbdkit are known to do
so); and as the NBD spec mentioned block sizing constraints prior to
documenting block status, it can be assumed that any future
implementations of block status are aware that they must advertise
block size if they want a minimum size other than 1.

We've had a long history of struggles with picking the right alignment
to use in the block layer, as evidenced by the commit message of
fd8d372d (v2.12) that introduced the current choice of forced 512-byte
alignment.

There is no iotest coverage for this fix, because qemu can't provoke
it, and I didn't want to make test 241 dependent on nbdkit.

Fixes: fd8d372d
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190329042750.14704-3-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
2019-03-30 20:52:19 -05:00
Eric Blake 737d3f5244 nbd-client: Work around server BLOCK_STATUS misalignment at EOF
The NBD spec is clear that a server that advertises a minimum block
size should reply to NBD_CMD_BLOCK_STATUS with extents aligned
accordingly. However, we know that the qemu NBD server implementation
has had a corner-case bug where it is not compliant with the spec,
present since the introduction of NBD_CMD_BLOCK_STATUS in qemu 2.12
(and unlikely to be patched in time for 4.0). Namely, when qemu is
serving a file that is not a multiple of 512 bytes, it rounds the size
advertised over NBD up to the next sector boundary (someday, I'd like
to fix that to be byte-accurate, but it's a much bigger audit not
appropriate for this release); yet if the final sector contains data
prior to EOF, lseek(SEEK_HOLE) will point to the implicit hole
mid-sector which qemu then reported over NBD.

We are well within our rights to hang up on a server that can't follow
the spec, but it is more useful to try and keep the connection alive
in spite of the problem. Do so by tracing a message about the problem,
and then either truncating the request back to an aligned boundary (if
it covered more than the final sector) or widening it out to the full
boundary with a forced status of data (since truncating would result
in 0 bytes, but we have to make progress, and valid since data is a
default-safe answer). And in practice, since the problem only happens
on a sector that starts with data and ends with a hole, we are going
to want to read that full sector anyway (where qemu as the server
fills in the tail beyond EOF with appropriate NUL bytes).

Easy reproduction:
$ printf %1000d 1 > file
$ qemu-nbd -f raw -t file & pid=$!
$ qemu-img map --output=json -f raw nbd://localhost:10809
qemu-img: Could not read file metadata: Invalid argument
$ kill $pid

where the patched version instead succeeds with:
[{ "start": 0, "length": 1024, "depth": 0, "zero": false, "data": true}]

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190326171317.4036-1-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2019-03-30 10:06:08 -05:00
Eric Blake ebd82cd872 nbd: Permit simple error to NBD_CMD_BLOCK_STATUS
The NBD spec is clear that when structured replies are active, a
simple error reply is acceptable to any command except for
NBD_CMD_READ.  However, we were mistakenly requiring structured errors
for NBD_CMD_BLOCK_STATUS, and hanging up on a server that gave a
simple error (since qemu does not behave as such a server, we didn't
notice the problem until now).  Broken since its introduction in
commit 78a33ab5 (v2.12).

Noticed while debugging a separate failure reported by nbdkit while
working out its initial implementation of BLOCK_STATUS, although it
turns out that nbdkit also chose to send structured error replies for
BLOCK_STATUS, so I had to manually provoke the situation by hacking
qemu's server to send a simple error reply:

| diff --git i/nbd/server.c w/nbd/server.c
| index fd013a2817a..833288d7c45 100644
| 00--- i/nbd/server.c
| +++ w/nbd/server.c
| @@ -2269,6 +2269,8 @@ static coroutine_fn int nbd_handle_request(NBDClient *client,
|                                        "discard failed", errp);
|
|      case NBD_CMD_BLOCK_STATUS:
| +        return nbd_co_send_simple_reply(client, request->handle, ENOMEM,
| +                                        NULL, 0, errp);
|          if (!request->len) {
|              return nbd_send_generic_reply(client, request->handle, -EINVAL,
|                                            "need non-zero length", errp);
|

Signed-off-by: Eric Blake <eblake@redhat.com>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
Message-Id: <20190325190104.30213-3-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2019-03-30 10:06:08 -05:00
Eric Blake b29f3a3d2a nbd: Don't lose server's error to NBD_CMD_BLOCK_STATUS
When the server replies with a (structured [*]) error to
NBD_CMD_BLOCK_STATUS, without any extent information sent first, the
client code was blindly throwing away the server's error code and
instead telling the caller that EIO occurred.  This has been broken
since its introduction in 78a33ab5 (v2.12, where we should have called:
   error_setg(&local_err, "Server did not reply with any status extents");
   nbd_iter_error(&iter, false, -EIO, &local_err);
to declare the situation as a non-fatal error if no earlier error had
already been flagged, rather than just blindly slamming iter.err and
iter.ret), although it is more noticeable since commit 7f86068d, which
actually tries hard to preserve the server's code thanks to a separate
iter.request_ret.

[*] The spec is clear that the server is also permitted to reply with
a simple error, but that's a separate fix.

I was able to provoke this scenario with a hack to the server, then
seeing whether ENOMEM makes it back to the caller:

| diff --git a/nbd/server.c b/nbd/server.c
| index fd013a2817a..29c7995de02 100644
| --- a/nbd/server.c
| +++ b/nbd/server.c
| @@ -2269,6 +2269,8 @@ static coroutine_fn int nbd_handle_request(NBDClient *client,
|                                        "discard failed", errp);
|
|      case NBD_CMD_BLOCK_STATUS:
| +        return nbd_send_generic_reply(client, request->handle, -ENOMEM,
| +                                      "no status for you today", errp);
|          if (!request->len) {
|              return nbd_send_generic_reply(client, request->handle, -EINVAL,
|                                            "need non-zero length", errp);
| --

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190325190104.30213-2-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2019-03-30 10:06:08 -05:00
Eric Blake a39286dd61 nbd: Tolerate some server non-compliance in NBD_CMD_BLOCK_STATUS
The NBD spec states that NBD_CMD_FLAG_REQ_ONE (which we currently
always use) should not reply with an extent larger than our request,
and that the server's response should be exactly one extent. Right
now, that means that if a server sends more than one extent, we treat
the server as broken, fail the block status request, and disconnect,
which prevents all further use of the block device. But while good
software should be strict in what it sends, it should be tolerant in
what it receives.

While trying to implement NBD_CMD_BLOCK_STATUS in nbdkit, we
temporarily had a non-compliant server sending too many extents in
spite of REQ_ONE. Oddly enough, 'qemu-img convert' with qemu 3.1
failed with a somewhat useful message:
  qemu-img: Protocol error: invalid payload for NBD_REPLY_TYPE_BLOCK_STATUS

which then disappeared with commit d8b4bad8, on the grounds that an
error message flagged only at the time of coroutine teardown is
pointless, and instead we should rely on the actual failed API to
report an error - in other words, the 3.1 behavior was masking the
fact that qemu-img was not reporting an error. That has since been
fixed in the previous patch, where qemu-img convert now fails with:
  qemu-img: error while reading block status of sector 0: Invalid argument

But even that is harsh.  Since we already partially relaxed things in
commit acfd8f7a to tolerate a server that exceeds the cap (although
that change was made prior to the NBD spec actually putting a cap on
the extent length during REQ_ONE - in fact, the NBD spec change was
BECAUSE of the qemu behavior prior to that commit), it's not that much
harder to argue that we should also tolerate a server that sends too
many extents.  But at the same time, it's nice to trace when we are
being tolerant of server non-compliance, in order to help server
writers fix their implementations to be more portable (if they refer
to our traces, rather than just stderr).

Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190323212639.579-3-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2019-03-30 10:06:08 -05:00
Kevin Wolf 738301e117 file-posix: Support BDRV_REQ_NO_FALLBACK for zero writes
We know that the kernel implements a slow fallback code path for
BLKZEROOUT, so if BDRV_REQ_NO_FALLBACK is given, we shouldn't call it.
The other operations we call in the context of .bdrv_co_pwrite_zeroes
should usually be quick, so no modification should be needed for them.
If we ever notice that there are additional problematic cases, we can
still make these conditional as well.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Eric Blake <eblake@redhat.com>
2019-03-26 11:37:51 +01:00
Kevin Wolf 80f5c33ff3 block: Advertise BDRV_REQ_NO_FALLBACK in filter drivers
Filter drivers that support .bdrv_co_pwrite_zeroes can safely advertise
BDRV_REQ_NO_FALLBACK because they just forward the request flags to
their child node.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Eric Blake <eblake@redhat.com>
2019-03-26 11:37:51 +01:00
Kevin Wolf fe0480d629 block: Add BDRV_REQ_NO_FALLBACK
For qemu-img convert, we want an operation that zeroes out the whole
image if this can be done efficiently, but that returns an error
otherwise so we don't write explicit zeroes and immediately overwrite
them with the real data, potentially doubling the amount of data to be
written.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Eric Blake <eblake@redhat.com>
2019-03-26 11:37:51 +01:00
Kevin Wolf 48ce986096 block: Remove error messages in bdrv_make_zero()
There is only a single caller of bdrv_make_zero(), which is qemu-img
convert. If the function fails, we just fall back to a different method
of zeroing out blocks on the target image. There is no good reason to
print error messages on stderr when the higher level operation will
actually succeed.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Eric Blake <eblake@redhat.com>
2019-03-26 11:37:51 +01:00
Markus Armbruster a9779a3ab0 trace-events: Delete unused trace points
Tracked down with cleanup-trace-events.pl.  Funnies requiring manual
post-processing:

* block.c and blockdev.c trace points are in block/trace-events.

* hw/block/nvme.c uses the preprocessor to hide its trace point use
  from cleanup-trace-events.pl.

* include/hw/xen/xen_common.h trace points are in hw/xen/trace-events.

* net/colo-compare and net/filter-rewriter.c use pseudo trace points
  colo_compare_udp_miscompare and colo_filter_rewriter_debug to guard
  debug code.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-id: 20190314180929.27722-5-armbru@redhat.com
Message-Id: <20190314180929.27722-5-armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-03-22 16:18:07 +00:00
Markus Armbruster 500016e5db trace-events: Shorten file names in comments
We spell out sub/dir/ in sub/dir/trace-events' comments pointing to
source files.  That's because when trace-events got split up, the
comments were moved verbatim.

Delete the sub/dir/ part from these comments.  Gets rid of several
misspellings.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20190314180929.27722-3-armbru@redhat.com
Message-Id: <20190314180929.27722-3-armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-03-22 16:18:07 +00:00
Alberto Garcia 782b9d06bf block: Make bdrv_{copy_on_read,crypto_luks,replication} static
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-19 15:49:29 +01:00
Sam Eiderman b69864e5a8 vmdk: Support version=3 in VMDK descriptor files
Commit 509d39aa22 added support for read
only VMDKs of version 3.

This commit fixes the probe function to correctly handle descriptors of
version 3.

This commit has two effects:
    1. We no longer need to supply '-f vmdk' when pointing to descriptor
       files of version 3 in qemu/qemu-img command line arguments.
    2. This fixes the scenario where a VMDK points to a parent version 3
       descriptor file which is being probed as "raw" instead of "vmdk".

Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Shmuel Eiderman <shmuel.eiderman@oracle.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-19 15:49:29 +01:00
Kevin Wolf a0cf83639c qcow2: Fix data file error condition in qcow2_co_create()
We were trying to check whether bdrv_open_blockdev_ref() returned
success, but accidentally checked the wrong variable. Spotted by
Coverity (CID 1399703).

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
2019-03-19 15:49:29 +01:00
Sergio Lopez 5e771752a1 mirror: Confirm we're quiesced only if the job is paused or cancelled
While child_job_drained_begin() calls to job_pause(), the job doesn't
actually transition between states until it runs again and reaches a
pause point. This means bdrv_drained_begin() may return with some jobs
using the node still having 'busy == true'.

As a consequence, block_job_detach_aio_context() may get into a
deadlock, waiting for the job to be actually paused, while the coroutine
servicing the job is yielding and doesn't get the opportunity to get
scheduled again. This situation can be reproduced by issuing a
'block-commit' immediately followed by a 'device_del'.

To ensure bdrv_drained_begin() only returns when the jobs have been
paused, we change mirror_drained_poll() to only confirm it's quiesced
when job->paused == true and there aren't any in-flight requests, except
if we reached that point by a drained section initiated by the
mirror/commit job itself.

The other block jobs shouldn't need any changes, as the default
drained_poll() behavior is to only confirm it's quiesced if the job is
not busy or completed.

Signed-off-by: Sergio Lopez <slp@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-19 15:49:29 +01:00
Peter Maydell dbbc277510 Pull request
* Add 'drop-cache=on|off' option to file-posix.c.  The default is on.
    Disabling the option fixes a QEMU 3.0.0 performance regression when live
    migrating on the same host with cache.direct=off.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJciOSEAAoJEJykq7OBq3PIVSUIAI6r2Mgoi+no4nle8Jf2nZ+W
 EnQXnNEFyJA0lKRtqQ2UILD9udVdKd/L1PZu5k/Il/Ralto9Yf3+62brekI7rsss
 c3Qusu4LUK6jom2RslRjRIaJ9GilQi/jWezKV/O0VlcsMVemgVHX008EIR+ea1U4
 H0/u2kfu04PciKQ5MR2+6aacu9bfmyH1yM2no+aMN5dDu/38PV6JEsf0Zl2agowg
 opGepJ7YiDQsxH9IBXrbfm38mBrrY0K2vFzAb9BzTHfBPotGMNIZNJNM2FChRfoM
 sTjOIpZz3NDwPEUPQPZxp+7YKRFFYfse1oHtpyh4n1rMQksB019SCGlP9TBhrF0=
 =CH5G
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/stefanha/tags/block-pull-request' into staging

Pull request

 * Add 'drop-cache=on|off' option to file-posix.c.  The default is on.
   Disabling the option fixes a QEMU 3.0.0 performance regression when live
   migrating on the same host with cache.direct=off.

# gpg: Signature made Wed 13 Mar 2019 11:07:48 GMT
# gpg:                using RSA key 9CA4ABB381AB73C8
# gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>" [full]
# gpg:                 aka "Stefan Hajnoczi <stefanha@gmail.com>" [full]
# Primary key fingerprint: 8695 A8BF D3F9 7CDA AC35  775A 9CA4 ABB3 81AB 73C8

* remotes/stefanha/tags/block-pull-request:
  file-posix: add drop-cache=on|off option

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2019-03-14 09:34:51 +00:00
Peter Maydell 523a2a42c3 Pull request
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+ber27ys35W+dsvQfe+BBqr8OQ4FAlyIFSwACgkQfe+BBqr8
 OQ7wghAAm16eCEr57oTO7QXR3y8uVFsKqXBn9cNH6nbrFp2PUQSglwMDKBls1Z5m
 olF23X/JaqSlSmkL9BBuzDZ6Up+kkHKuxPq4/5RKXfiDI0pr3R0eqts0COAlaN9q
 Bew3ipj99m8gzMi2093AW4+Ob0N3658fuDTGLe1M1Uoy7CEg1QJ7rVOBBEui7vIl
 RbZ8l/Zmb4ldNpB3lnE4Nh9ue8fy0RAj3Nai161nCnNeXNF/VzD3Ye8bojSBbnux
 PIMX6/RWmykX4feIf9QP8apDpxX4HkyuPq5EdwT9PD8PwdyXPAXZtsYUNCuNtQuk
 n5VKFVgFYgqUclBeVHmrMYPU4K4iCFQp4/Fua7wzPEC0iG05NiiDv91oVkEJCp3L
 ManHeuGfNLCcXaIntKZhuJl1cK8yMM3yDww6/pPTehrPjcyvKa0NOqhQBExektcD
 R6q7maJRzFaxSxdcs+Zzuog9zESvH1mlJxQCKzeYhAP0kkxInyTELE/Vbx37xuqR
 RFfZYyVQ6x87Q/sxHx4EMiV97WUM8elZOQdSEC/okt5WUUNpgIu0WF9nSQ1VKZ8C
 CZmv5xh9ogfwvB/kOm6IVwNkLvVagJQcLwddORI5LLXLbSIUcuwVSuyMp/7iDtQ/
 hnHkGs2mIJ2JUYbSSNsSJNs6oTurn8eTFCeGoYKJgd9l4QxaThU=
 =ekU+
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/jnsnow/tags/bitmaps-pull-request' into staging

Pull request

# gpg: Signature made Tue 12 Mar 2019 20:23:08 GMT
# gpg:                using RSA key F9B7ABDBBCACDF95BE76CBD07DEF8106AAFC390E
# gpg: Good signature from "John Snow (John Huston) <jsnow@redhat.com>" [full]
# Primary key fingerprint: FAEB 9711 A12C F475 812F  18F2 88A9 064D 1835 61EB
#      Subkey fingerprint: F9B7 ABDB BCAC DF95 BE76  CBD0 7DEF 8106 AAFC 390E

* remotes/jnsnow/tags/bitmaps-pull-request: (22 commits)
  tests/qemu-iotests: add bitmap resize test 246
  block/qcow2-bitmap: Allow resizes with persistent bitmaps
  block/qcow2-bitmap: Don't check size for IN_USE bitmap
  docs/interop/qcow2: Improve bitmap flag in_use specification
  bitmaps: Fix typo in function name
  block/dirty-bitmaps: implement inconsistent bit
  block/dirty-bitmaps: disallow busy bitmaps as merge source
  block/dirty-bitmaps: prohibit removing readonly bitmaps
  block/dirty-bitmaps: prohibit readonly bitmaps for backups
  block/dirty-bitmaps: add block_dirty_bitmap_check function
  block/dirty-bitmap: add inconsistent status
  block/dirty-bitmaps: add inconsistent bit
  iotests: add busy/recording bit test to 124
  blockdev: remove unused paio parameter documentation
  block/dirty-bitmaps: move comment block
  block/dirty-bitmaps: unify qmp_locked and user_locked calls
  block/dirty-bitmap: explicitly lock bitmaps with successors
  nbd: change error checking order for bitmaps
  block/dirty-bitmap: change semantics of enabled predicate
  block/dirty-bitmap: remove set/reset assertions against enabled bit
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

# Conflicts:
#	tests/qemu-iotests/group
2019-03-13 17:30:34 +00:00
Stefan Hajnoczi f357fcd890 file-posix: add drop-cache=on|off option
Commit dd577a26ff ("block/file-posix:
implement bdrv_co_invalidate_cache() on Linux") introduced page cache
invalidation so that cache.direct=off live migration is safe on Linux.

The invalidation takes a significant amount of time when the file is
large and present in the page cache.  Normally this is not the case for
cross-host live migration but it can happen when migrating between QEMU
processes on the same host.

On same-host migration we don't need to invalidate pages for correctness
anyway, so an option to skip page cache invalidation is useful.  I
investigated optimizing invalidation and detecting same-host migration,
but both are hard to achieve so a user-visible option will suffice.

As a bonus this option means that the cache invalidation feature will
now be detectable by libvirt via QMP schema introspection.

Suggested-by: Neil Skrypuch <neil@tembosocial.com>
Tested-by: Neil Skrypuch <neil@tembosocial.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20190307164941.3322-1-stefanha@redhat.com
Message-Id: <20190307164941.3322-1-stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-03-13 10:54:55 +00:00
Alberto Garcia 5019aece2a block: Remove the AioContext parameter from bdrv_reopen_multiple()
This parameter has been unused since 1a63a90750

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:14 +01:00
Alberto Garcia 8a2ce0bc1e block: Add a 'mutable_opts' field to BlockDriver
If we reopen a BlockDriverState and there is an option that is present
in bs->options but missing from the new set of options then we have to
return an error unless the driver is able to reset it to its default
value.

This patch adds a new 'mutable_opts' field to BlockDriver. This is
a list of runtime options that can be modified during reopen. If an
option in this list is unspecified on reopen then it must be reset (or
return an error).

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:14 +01:00
Alberto Garcia 077e8e2018 block: Add 'keep_old_opts' parameter to bdrv_reopen_queue()
The bdrv_reopen_queue() function is used to create a queue with
the BDSs that are going to be reopened and their new options. Once
the queue is ready bdrv_reopen_multiple() is called to perform the
operation.

The original options from each one of the BDSs are kept, with the new
options passed to bdrv_reopen_queue() applied on top of them.

For "x-blockdev-reopen" we want a function that behaves much like
"blockdev-add". We want to ignore the previous set of options so that
only the ones actually specified by the user are applied, with the
rest having their default values.

One of the things that we need is a way to tell bdrv_reopen_queue()
whether we want to keep the old set of options or not, and that's what
this patch does. All current callers are setting this new parameter to
true and x-blockdev-reopen will set it to false.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:14 +01:00
Alberto Garcia 6585493369 block: Freeze the backing chain for the duration of the stream job
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:14 +01:00
Alberto Garcia ef53dc09ed block: Freeze the backing chain for the duration of the mirror job
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:14 +01:00
Alberto Garcia df827336ab block: Freeze the backing chain for the duration of the commit job
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:14 +01:00
Kevin Wolf 23dece19da file-posix: Make auto-read-only dynamic
Until now, with auto-read-only=on we tried to open the file read-write
first and if that failed, read-only was tried. This is actually not good
enough for libvirt, which gives QEMU SELinux permissions for read-write
only as soon as it actually intends to write to the image. So we need to
be able to switch between read-only and read-write at runtime.

This patch makes auto-read-only dynamic, i.e. the file is opened
read-only as long as no user of the node has requested write
permissions, but it is automatically reopened read-write as soon as the
first writer is attached. Conversely, if the last writer goes away, the
file is reopened read-only again.

bs->read_only is no longer set for auto-read-only=on files even if the
file descriptor is opened read-only because it will be transparently
upgraded as soon as a writer is attached. This changes the output of
qemu-iotests 232.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:14 +01:00
Kevin Wolf 6ceabe6f77 file-posix: Prepare permission code for fd switching
In order to be able to dynamically reopen the file read-only or
read-write, depending on the users that are attached, we need to be able
to switch to a different file descriptor during the permission change.

This interacts with reopen, which also creates a new file descriptor and
performs permission changes internally. In this case, the permission
change code must reuse the reopen file descriptor instead of creating a
third one.

In turn, reopen can drop its code to copy file locks to the new file
descriptor because that is now done when applying the new permissions.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:14 +01:00
Kevin Wolf a6aeca0ca5 file-posix: Lock new fd in raw_reopen_prepare()
There is no reason why we can take locks on the new file descriptor only
in raw_reopen_commit() where error handling isn't possible any more.
Instead, we can already do this in raw_reopen_prepare().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:14 +01:00
Kevin Wolf e0c9cf3a48 file-posix: Store BDRVRawState.reopen_state during reopen
We'll want to access the file descriptor in the reopen_state while
processing permission changes in the context of the repoen.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:14 +01:00
Kevin Wolf 5cec287025 file-posix: Factor out raw_reconfigure_getfd()
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:14 +01:00
Vladimir Sementsov-Ogievskiy cb8aac3783 qapi: drop x- from x-block-latency-histogram-set
Drop x- and x_ prefixes for latency histograms and update version to
4.0

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:08 +01:00
John Snow d19c6b36ff block/qcow2-bitmap: Allow resizes with persistent bitmaps
Since we now load all bitmaps into memory anyway, we can just truncate
them in-memory and then flush them back to disk. Just in case, we will
still check and enforce that this shortcut is valid -- i.e. that any
bitmap described on-disk is indeed in-memory and can be modified.

If there are any inconsistent bitmaps, we refuse to allow the truncate
as we do not actually load these bitmaps into memory, and it isn't safe
or reasonable to attempt to truncate corrupted data.

Signed-off-by: John Snow <jsnow@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190311185147.52309-4-vsementsov@virtuozzo.com
  [vsementsov: drop bitmap flushing, fix block comments style]
Signed-off-by: John Snow <jsnow@redhat.com>
2019-03-12 14:57:38 -04:00
Vladimir Sementsov-Ogievskiy bf5f0cf5d8 block/qcow2-bitmap: Don't check size for IN_USE bitmap
We are going to allow image resize when there are persistent bitmaps.
It may lead to appearing of inconsistent bitmaps (IN_USE=1) with
inconsistent size. But we still want to load them as inconsistent.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190311185147.52309-3-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
2019-03-12 14:50:28 -04:00
Eric Blake 796a3798ab bitmaps: Fix typo in function name
Commit a88b179f introduced the ability to set and query bitmap
persistence, but with an atypical spelling.

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-id: 20190308205845.25734-1-eblake@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
2019-03-12 12:05:49 -04:00
John Snow 74da6b9435 block/dirty-bitmaps: implement inconsistent bit
Set the inconsistent bit on load instead of rejecting such bitmaps.
There is no way to un-set it; the only option is to delete the bitmap.

Obvervations:
- bitmap loading does not need to update the header for in_use bitmaps.
- inconsistent bitmaps don't need to have their data loaded; they're
  glorified corruption sentinels.
- bitmap saving does not need to save inconsistent bitmaps back to disk.
- bitmap reopening DOES need to drop the readonly flag from inconsistent
  bitmaps to allow reopening of qcow2 files with non-qemu-owned bitmaps
  being eventually flushed back to disk.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20190301191545.8728-8-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
2019-03-12 12:05:49 -04:00
John Snow cb8e58e3de block/dirty-bitmaps: disallow busy bitmaps as merge source
We didn't do any state checking on source bitmaps at all,
so this adds inconsistent and busy checks. readonly is
allowed, so you can still copy a readonly bitmap to a new
destination to use it for operations like drive-backup.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190301191545.8728-7-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
2019-03-12 12:05:49 -04:00
John Snow 3ae96d6684 block/dirty-bitmaps: add block_dirty_bitmap_check function
Instead of checking against busy, inconsistent, or read only directly,
use a check function with permissions bits that let us streamline the
checks without reproducing them in many places.

Included in this patch are permissions changes that simply add the
inconsistent check to existing permissions call spots, without
addressing existing bugs.

In general, this means that busy+readonly checks become BDRV_BITMAP_DEFAULT,
which checks against all three conditions. busy-only checks become
BDRV_BITMAP_ALLOW_RO.

Notably, remove allows inconsistent bitmaps, so it doesn't follow the pattern.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190301191545.8728-4-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
2019-03-12 12:05:49 -04:00
John Snow 0064cfefa4 block/dirty-bitmap: add inconsistent status
Even though the status field is deprecated, we still have to support
it for a few more releases. Since this is a very new kind of bitmap
state, it makes sense for it to have its own status field.

Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190301191545.8728-3-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
2019-03-12 12:05:49 -04:00
John Snow b0f455599d block/dirty-bitmaps: add inconsistent bit
Add an inconsistent bit to dirty-bitmaps that allows us to report a bitmap as
persistent but potentially inconsistent, i.e. if we find bitmaps on a qcow2
that have been marked as "in use".

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190301191545.8728-2-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
2019-03-12 12:05:49 -04:00
John Snow 1e6fddcd6f block/dirty-bitmaps: move comment block
Simply move the big status enum comment block to above the status
function, and document it as being deprecated. The whole confusing
block can get deleted in three releases time.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190223000614.13894-9-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
2019-03-12 12:05:48 -04:00
John Snow 27a1b301a4 block/dirty-bitmaps: unify qmp_locked and user_locked calls
These mean the same thing now. Unify them and rename the merged call
bdrv_dirty_bitmap_busy to indicate semantically what we are describing,
as well as help disambiguate from the various _locked and _unlocked
versions of bitmap helpers that refer to mutex locks.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190223000614.13894-8-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
2019-03-12 12:05:48 -04:00
John Snow 21d2376f26 block/dirty-bitmap: explicitly lock bitmaps with successors
Instead of implying a user_locked/busy status, make it explicit.
Now, bitmaps in use by migration, NBD or backup operations
are all treated the same way with the same code paths.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190223000614.13894-7-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
2019-03-12 12:05:48 -04:00
John Snow 8b2e20f64f block/dirty-bitmap: change semantics of enabled predicate
Currently, the enabled predicate means something like:
"the QAPI status of the bitmap is ACTIVE."
After this patch, it should mean exclusively:
"This bitmap is recording guest writes, and is allowed to do so."

In many places, this is how this predicate was already used.
Internal usages of the bitmap QPI can call user_locked to find out if
the bitmap is in use by an operation.

To accommodate this, modify the create_successor routine to now
explicitly disable the parent bitmap at creation time.

Justifications:

1. bdrv_dirty_bitmap_status suffers no change from the lack of
   1:1 parity with the new predicates because of the order in which
   the predicates are checked. This is now only for compatibility.

2. bdrv_set_dirty() is unchanged: pre-patch, it was skipping bitmaps that were
   disabled or had a successor, while post-patch it is only skipping bitmaps
   that are disabled. To accommodate this, create_successor now ensures that
   any bitmap with a successor is explicitly disabled.

3. qcow2_store_persistent_dirty_bitmaps: No functional change. This function
   cares only about the literal enabled bit, and makes no effort to check if
   the bitmap is in-use or not. After this patch there are still no ways to
   produce an enabled bitmap with a successor.

4. block_dirty_bitmap_enable_prepare
   block_dirty_bitmap_disable_prepare
   init_dirty_bitmap_migration
   nbd_export_new

   These functions care about the literal enabled bit,
   and already check user_locked separately.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190223000614.13894-5-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
2019-03-12 12:05:48 -04:00
John Snow c28ddbb07e block/dirty-bitmap: remove set/reset assertions against enabled bit
bdrv_set_dirty_bitmap and bdrv_reset_dirty_bitmap are only used as an
internal API by the mirror and migration areas of our code. These
calls modify the bitmap, but do so at the behest of QEMU and not the
guest.

Presently, these bitmaps are always "enabled" anyway, but there's no
reason they have to be.

Modify these internal APIs to drop this assertion.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190223000614.13894-4-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
2019-03-12 12:05:48 -04:00
John Snow 50a47257f8 block/dirty-bitmaps: rename frozen predicate helper
"Frozen" was a good description a long time ago, but it isn't adequate now.
Rename the frozen predicate to has_successor to make the semantics of the
predicate more clear to outside callers.

In the process, remove some calls to frozen() that no longer semantically
make sense. For bdrv_enable_dirty_bitmap_locked and
bdrv_disable_dirty_bitmap_locked, it doesn't make sense to prohibit QEMU
internals from performing this action when we only wished to prohibit QMP
users from issuing these commands. All of the QMP API commands for bitmap
manipulation already check against user_locked() to prohibit these actions.

Several other assertions really want to check that the bitmap isn't in-use
by another operation -- use the bitmap_user_locked function for this instead,
which presently also checks for has_successor. This leaves some redundant
checks of has_successor through different helpers that are addressed in
forthcoming patches.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190223000614.13894-3-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
2019-03-12 12:05:48 -04:00
John Snow 4db6ceb0b5 block/dirty-bitmap: add recording and busy properties
The current API allows us to report a single status, which we've defined as:

Frozen: has a successor, treated as qmp_locked, may or may not be enabled.
Locked: no successor, qmp_locked. may or may not be enabled.
Disabled: Not frozen or locked, disabled.
Active: Not frozen, locked, or disabled.

The problem is that both "Frozen" and "Locked" mean nearly the same thing,
and that both of them do not intuit whether they are recording guest writes
or not.

This patch deprecates that status field and introduces two orthogonal
properties instead to replace it.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190223000614.13894-2-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
2019-03-12 12:05:48 -04:00
Niels de Vos 0e3b891fef gluster: the glfs_io_cbk callback function pointer adds pre/post stat args
The glfs_*_async() functions do a callback once finished. This callback
has changed its arguments, pre- and post-stat structures have been
added. This makes it possible to improve caching, which is useful for
Samba and NFS-Ganesha, but not so much for QEMU. Gluster 6 is the first
release that includes these new arguments.

With an additional detection in ./configure, the new arguments can
conditionally get included in the glfs_io_cbk handler.

Signed-off-by: Niels de Vos <ndevos@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 14:26:49 +01:00
Prasanna Kumar Kalever e014dbe74e gluster: Handle changed glfs_ftruncate signature
New versions of Glusters libgfapi.so have an updated glfs_ftruncate()
function that returns additional 'struct stat' structures to enable
advanced caching of attributes. This is useful for file servers, not so
much for QEMU. Nevertheless, the API has changed and needs to be
adopted.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 14:26:49 +01:00
Philippe Mathieu-Daudé d4cef0c67c block/iscsi: Restrict Linux-specific code
Some Linux specific code is missing guards, leading to
build failure on OSX:

  $ sudo brew install libiscsi
  $ ./configure && make
  [...]
    CC      block/iscsi.o
  qemu/block/iscsi.c:338:24: error: 'iscsi_aiocb_info' defined but not used [-Werror=unused-const-variable=]
   static const AIOCBInfo iscsi_aiocb_info = {
                          ^~~~~~~~~~~~~~~~
  qemu/block/iscsi.c:168:1: error: 'iscsi_schedule_bh' defined but not used [-Werror=unused-function]
   iscsi_schedule_bh(IscsiAIOCB *acb)
   ^~~~~~~~~~~~~~~~~
  cc1: all warnings being treated as errors

Add guards to restrict this code for Linux.

Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20190220000553.28438-1-philmd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-11 16:33:49 +01:00
Kevin Wolf 6c3944dc62 qcow2: Implement data-file-raw create option
Provide an option to force QEMU to always keep the external data file
consistent as a standalone read-only raw image.

At the moment, this means making sure that write_zeroes requests are
forwarded to the data file instead of just updating the metadata, and
checking that no backing file is used.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-08 12:26:46 +01:00
Kevin Wolf 9b890bdcb6 qcow2: Store data file name in the image
Rather than requiring that the external data file node is passed
explicitly when creating the qcow2 node, store the filename in the
designated header extension during .bdrv_create and read it from there
as a default during .bdrv_open.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-08 12:26:46 +01:00
Kevin Wolf dcc98687f8 qcow2: Creating images with external data file
This adds a .bdrv_create option to use an external data file.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-08 12:26:46 +01:00
Kevin Wolf 0e8c08be27 qcow2: Add basic data-file infrastructure
This adds a .bdrv_open option to specify the external data file node.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-08 12:26:46 +01:00
Kevin Wolf e9f5b6deaa qcow2: Support external data file in qemu-img check
For external data files, data clusters must be excluded from the
refcount calculations. Instead, an implicit refcount of 1 is assumed for
the COPIED flag.

Compressed clusters and internal snapshots are incompatible with
external data files, so print an error if they are in use for images
with an external data file.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-08 12:26:46 +01:00
Kevin Wolf aa8b34c1b2 qcow2: Return error for snapshot operation with data file
Internal snapshots and an external data file are incompatible because
snapshots require refcounting and non-linear mapping. Return an error
for all of the snapshot operations if an external data file is in use.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-08 12:26:46 +01:00
Kevin Wolf 966b000f49 qcow2: External file I/O
This changes the qcow2 implementation to direct all guest data I/O to
s->data_file rather than bs->file, while metadata I/O still uses
bs->file. At the moment, this is still always the same, but soon we'll
add options to set s->data_file to an external data file.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-08 12:26:46 +01:00
Kevin Wolf 37be14036b qcow2: Prepare qcow2_co_block_status() for data file
Offset 0 cannot be assumed to mean an unallocated cluster any more.
Instead, the cluster type needs to be checked.

*file must refer to the data file instead of the image file if a valid
offset is returned from qcow2_co_block_status().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-08 12:26:45 +01:00
Kevin Wolf 77e023ff79 qcow2: Return 0/-errno in qcow2_alloc_compressed_cluster_offset()
qcow2_alloc_compressed_cluster_offset() used to return the cluster
offset for success and 0 for error. This doesn't only conflict with 0 as
a valid host offset, but also loses the error code.

Similar to the change made to qcow2_alloc_cluster_offset() for
uncompressed clusters in commit 148da7ea9d, make the function return
0/-errno and return the allocated cluster offset in a by-reference
parameter.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-08 12:26:45 +01:00
Kevin Wolf c6d619cc12 qcow2: Don't assume 0 is an invalid cluster offset
The cluster allocation code uses 0 as an invalid offset that is used in
case of errors or as "offset not yet determined". With external data
files, a host cluster offset of 0 becomes valid, though.

Define a constant INV_OFFSET (which is not cluster aligned and will
therefore never be a valid offset) that can be used for such purposes.

This removes the additional host_offset == 0 check that commit
ff52aab2df introduced; the confusion between an invalid offset and
(erroneous) allocation at offset 0 is removed with this change.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-08 12:26:45 +01:00
Kevin Wolf b8c8353a38 qcow2: Prepare count_contiguous_clusters() for external data file
Offset 0 can be valid for normal (allocated) clusters now, so use
qcow2_get_cluster_type() instead.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-08 12:26:45 +01:00
Kevin Wolf a4ea184d8a qcow2: Prepare qcow2_get_cluster_type() for external data file
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-08 12:26:45 +01:00
Kevin Wolf 808c2bb4c4 qcow2: Pass bs to qcow2_get_cluster_type()
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-08 12:26:45 +01:00
Kevin Wolf 93c2493646 qcow2: Basic definitions for external data files
This adds basic constants, struct fields and helper function for
external data file support to the implementation.

QCOW2_INCOMPAT_MASK and QCOW2_AUTOCLEAR_MASK are not updated yet so that
opening images with an external data file still fails (we don't handle
them correctly yet).

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-08 12:26:45 +01:00
Kevin Wolf c5e86ebc11 qcow2: Simplify preallocation code
Image creation already involves a bdrv_co_truncate() call, which allows
to specify a preallocation mode. Just pass the right mode there and
remove the code that is made redundant by this.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-08 12:26:45 +01:00
Alberto Garcia af39bd0d9a qcow2: Default to 4KB for the qcow2 cache entry size
QEMU 2.12 (commit 1221fe6f63) introduced
a new setting called l2-cache-entry-size that allows making entries on
the qcow2 L2 cache smaller than the cluster size.

I have been performing several tests with different cluster and entry
sizes and all of them show that reducing the entry size (aka L2 slice)
consistently improves I/O performance, notably during random I/O (all
tests done with sequential I/O show similar results). This is to be
expected because loading and evicting an L2 slice is more expensive
the larger the slice is.

Here are some numbers on fully populated 40GB qcow2 images. The
rightmost column represents the maximum L2 cache size in both cases.

   Cluster size = 64 KB
   |-------------+--------------+--------------+--------------|
   |             | 1MB L2 cache | 3MB L2 cache | 5MB L2 cache |
   |-------------+--------------+--------------+--------------|
   |  4KB slices |    6545 IOPS |   12045 IOPS |   55680 IOPS |
   | 16KB slices |    5177 IOPS |    9798 IOPS |   56278 IOPS |
   | 64KB slices |    2718 IOPS |    5326 IOPS |   57355 IOPS |
   |-------------+--------------+--------------+--------------|

   Cluster size = 256 KB
   |--------------+----------------+--------------+-----------------|
   |              | 512KB L2 cache | 1MB L2 cache | 1280KB L2 cache |
   |--------------+----------------+--------------+-----------------|
   |   4KB slices |      8539 IOPS |   21071 IOPS |      55417 IOPS |
   |  64KB slices |      3598 IOPS |    9772 IOPS |      57687 IOPS |
   | 256KB slices |      1415 IOPS |    4120 IOPS |      58001 IOPS |
   |--------------+----------------+--------------+-----------------|

As can be seen in the numbers, the only exception to the rule is when
the cache is large enough to hold all L2 tables. This is also to be
expected because in this case no cache entry is ever evicted so
reducing its size doesn't bring any benefit.

This patch sets the default L2 cache entry size to 4KB except when the
cache is large enough for the whole disk.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-08 12:26:45 +01:00
Peter Maydell adf2e451f3 Block layer patches:
- Block graph change fixes (avoid loops, cope with non-tree graphs)
 - bdrv_set_aio_context() related fixes
 - HMP snapshot commands: Use only tag, not the ID to identify snapshots
 - qmeu-img, commit: Error path fixes
 - block/nvme: Build fix for gcc 9
 - MAINTAINERS updates
 - Fix various issues with bdrv_refresh_filename()
 - Fix various iotests
 - Include LUKS overhead in qemu-img measure for qcow2
 - A fix for vmdk's image creation interface
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJcc/knAAoJEH8JsnLIjy/WptQP/3F8Lh52H4egXaP7NUUuDjQM
 AhqhuDAp/EZBS+xim9kLTogNJADe/rMWdSX/YB5aLpSPYbjasC66NgaLhd6QewgQ
 VIcsLUdlYAyZ5ZjJytimfMTLwm1X02RmVIe55y52DTY8LlfViZzOlf3qwqPm00ao
 EJB2cl8UJLM+PVEu59cCw3R0/06LY+WIJRB32d3tnCBRTkaJwfR9h4lrp/juVcFZ
 U+2eWU68KMbUHSYiWANowN+KRV3uPY4HVA98v3F0vDmcBxlVHOeBg6S+PcT7tK8p
 huzCMwcdwUyPMJgVs/+WBtUnbG0jN6SHUYmFLz859UMVgBnCw5tzBMf8qw1wOA4A
 Iw+zor27Pxj4IlxcLPp5f97YZ8k9acdMR2VKPH6xLJZ1JF+sKa54RfzESd5EJeIj
 Mfcp773H0lIaWcFJ6RY1F0L1E1ta7QigwNBiWMdYfh0a0EWHnDvGyYeaSPYEQ+rl
 e8bZOcfrYwVI7DTDiZOIkGA9D8DXEPDNp+sl6s1DxeY69D0NNaXTtCPqFNNAbFbd
 20uD7yDRZlWq32cQB/K9D5cSkZRSOzdUpLfLU3nQU2+dz11x6OpM6m7DVboSrztD
 1HtPPDzDEvH5dOP7ibd60s+ntjkSiNfNkUgnuVrBE/d/PocC1eHHpZt5V7f43Ofb
 RxVwH5+smzQ9nsNBfQR0
 =gaah
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging

Block layer patches:

- Block graph change fixes (avoid loops, cope with non-tree graphs)
- bdrv_set_aio_context() related fixes
- HMP snapshot commands: Use only tag, not the ID to identify snapshots
- qmeu-img, commit: Error path fixes
- block/nvme: Build fix for gcc 9
- MAINTAINERS updates
- Fix various issues with bdrv_refresh_filename()
- Fix various iotests
- Include LUKS overhead in qemu-img measure for qcow2
- A fix for vmdk's image creation interface

# gpg: Signature made Mon 25 Feb 2019 14:18:15 GMT
# gpg:                using RSA key 7F09B272C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74  56FE 7F09 B272 C88F 2FD6

* remotes/kevin/tags/for-upstream: (71 commits)
  iotests: Skip 211 on insufficient memory
  vmdk: false positive of compat6 with hwversion not set
  iotests: add LUKS payload overhead to 178 qemu-img measure test
  qcow2: include LUKS payload overhead in qemu-img measure
  iotests.py: s/_/-/g on keys in qmp_log()
  iotests: Let 045 be run concurrently
  iotests: Filter SSH paths
  iotests.py: Filter filename in any string value
  iotests.py: Add is_str()
  iotests: Fix 207 to use QMP filters for qmp_log
  iotests: Fix 232 for LUKS
  iotests: Remove superfluous rm from 232
  iotests: Fix 237 for Python 2.x
  iotests: Re-add filename filters
  iotests: Test json:{} filenames of internal BDSs
  block: BDS options may lack the "driver" option
  block/null: Generate filename even with latency-ns
  block/curl: Implement bdrv_refresh_filename()
  block/curl: Harmonize option defaults
  block/nvme: Fix bdrv_refresh_filename()
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2019-02-26 19:04:47 +00:00
yuchenlin 26c9296c31 vmdk: false positive of compat6 with hwversion not set
In vmdk_co_create_opts, when it finds hw_version is undefined, it will
set it to 4, which misleading the compat6 and hwversion in
vmdk_co_do_create. Simply set hw_version to NULL after free, let
the logic in vmdk_co_do_create to decide the value of hw_version.

This bug can be reproduced by:

$ qemu-img convert -O vmdk -o subformat=streamOptimized,compat6
/home/yuchenlin/syno.qcow2 /home/yuchenlin/syno.vmdk

qemu-img: /home/yuchenlin/syno.vmdk: error while converting vmdk:
compat6 cannot be enabled with hwversion set

Signed-off-by: yuchenlin <yuchenlin@synology.com>
Message-id: 20190221110805.28239-1-yuchenlin@synology.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-02-25 15:11:28 +01:00
Stefan Hajnoczi 61914f8906 qcow2: include LUKS payload overhead in qemu-img measure
LUKS encryption reserves clusters for its own payload data.  The size of
this area must be included in the qemu-img measure calculation so that
we arrive at the correct minimum required image size.

(Ab)use the qcrypto_block_create() API to determine the payload
overhead.  We discard the payload data that qcrypto thinks will be
written to the image.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190218104525.23674-2-stefanha@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-02-25 15:11:28 +01:00
Max Reitz 1e47cb7f52 block/null: Generate filename even with latency-ns
While we cannot represent the latency-ns option in a filename, it is not
a strong option so not being able to should not stop us from generating
a filename nonetheless.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190201192935.18394-30-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-02-25 15:11:27 +01:00
Max Reitz 937c007b6e block/curl: Implement bdrv_refresh_filename()
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190201192935.18394-29-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-02-25 15:11:27 +01:00
Max Reitz 712b64e8f3 block/curl: Harmonize option defaults
Both of the defaults we currently have in the curl driver are named
based on a slightly different schema, let's unify that and call both
CURL_BLOCK_OPT_${NAME}_DEFAULT.

While at it, we can add a macro for the third option for which a default
exists, namely "sslverify".

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190201192935.18394-28-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-02-25 15:11:27 +01:00
Max Reitz cc61b0740f block/nvme: Fix bdrv_refresh_filename()
Currently, nvme's bdrv_refresh_filename() is an exact copy of null's
implementation.  However, for null, "null-co://" and "null-aio://" are
indeed valid filenames -- for nvme, they are not, as a device address is
still required.

The correct implementation should generate a filename of the form
"nvme://[PCI address]/[namespace]" (as the comment above
nvme_parse_filename() describes).

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190201192935.18394-27-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-02-25 15:11:27 +01:00
Max Reitz 998b3a1e5a block: Purify .bdrv_refresh_filename()
Currently, BlockDriver.bdrv_refresh_filename() is supposed to both
refresh the filename (BDS.exact_filename) and set BDS.full_open_options.
Now that we have generic code in the central bdrv_refresh_filename() for
creating BDS.full_open_options, we can drop the latter part from all
BlockDriver.bdrv_refresh_filename() implementations.

This also means that we can drop all of the existing default code for
this from the global bdrv_refresh_filename() itself.

Furthermore, we now have to call BlockDriver.bdrv_refresh_filename()
after having set BDS.full_open_options, because the block driver's
implementation should now be allowed to depend on BDS.full_open_options
being set correctly.

Finally, with this patch we can drop the @options parameter from
BlockDriver.bdrv_refresh_filename(); also, add a comment on this
function's purpose in block/block_int.h while touching its interface.

This completely obsoletes blklogwrite's implementation of
.bdrv_refresh_filename().

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190201192935.18394-25-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-02-25 15:11:27 +01:00
Max Reitz abc521a9aa block: Add BlockDriver.bdrv_gather_child_options
Some follow-up patches will rework the way bs->full_open_options is
refreshed in bdrv_refresh_filename(). The new implementation will remove
the need for the block drivers' bdrv_refresh_filename() implementations
to set bs->full_open_options; instead, it will be generic and use static
information from each block driver.

However, by implementing bdrv_gather_child_options(), block drivers will
still be able to override the way the full_open_options of their
children are incorporated into their own.

We need to implement this function for VMDK because we have to prevent
the generic implementation from gathering the options of all children:
It is not possible to specify options for the extents through the
runtime options.

For quorum, the child names that would be used by the generic
implementation and the ones that we actually (currently) want to use
differ. See quorum_gather_child_options() for more information.

Note that both of these are cases which are not ideal: In case of VMDK
it would probably be nice to be able to specify options for all extents.
In case of quorum, the current runtime option structure is simply broken
and needs to be fixed (but that is left for another patch).

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190201192935.18394-23-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-02-25 15:11:27 +01:00
Max Reitz 2654267cc1 block: Add strong_runtime_opts to BlockDriver
This new field can be set by block drivers to list the runtime options
they accept that may influence the contents of the respective BDS. As of
a follow-up patch, this list will be used by the common
bdrv_refresh_filename() implementation to decide which options to put
into BDS.full_open_options (and consequently whether a JSON filename has
to be created), thus freeing the drivers of having to implement that
logic themselves.

Additionally, this patch adds the field to all of the block drivers that
need it and sets it accordingly.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190201192935.18394-22-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-02-25 15:11:27 +01:00
Max Reitz 0dcbc54a95 block/nfs: Implement bdrv_dirname()
While the basic idea is obvious and could be handled by the default
bdrv_dirname() implementation, we cannot generate a directory name if
the gid or uid are set, so we have to explicitly return NULL in those
cases.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190201192935.18394-19-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-02-25 15:11:26 +01:00
Max Reitz 8a6239c071 block/nbd: Make bdrv_dirname() return NULL
The generic bdrv_dirname() implementation would be able to generate some
form of directory name for many NBD nodes, but it would be always wrong.
Therefore, we have to explicitly make it an error (until NBD has some
form of specification for export paths, if it ever will).

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20190201192935.18394-18-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-02-25 15:11:26 +01:00
Max Reitz f3037bd254 quorum: Make bdrv_dirname() return NULL
While the common implementation for bdrv_dirname() should return NULL
for quorum BDSs already (because they do not have a file node and their
exact_filename field should be empty), there is no reason not to make
that explicit.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190201192935.18394-17-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-02-25 15:11:26 +01:00
Max Reitz 27953572a5 blkverify: Make bdrv_dirname() return NULL
blkverify's BDSs have a file BDS, but we do not want this to be
preferred over the raw node. There is no way to decide between the two
(and not really a reason to, either), so just return NULL in blkverify's
implementation of bdrv_dirname().

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190201192935.18394-16-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-02-25 15:11:26 +01:00
Max Reitz 6b6833c1b4 block: bdrv_get_full_backing_filename's ret. val.
Make bdrv_get_full_backing_filename() return an allocated string instead
of placing the result in a caller-provided buffer.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190201192935.18394-12-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-02-25 15:11:26 +01:00
Max Reitz 645ae7d88e block: bdrv_get_full_backing_filename_from_...'s ret. val.
Make bdrv_get_full_backing_filename_from_filename() return an allocated
string instead of placing the result in a caller-provided buffer.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190201192935.18394-11-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-02-25 15:11:26 +01:00
Max Reitz 009b03aaa2 block: Make path_combine() return the path
Besides being safe for arbitrary path lengths, after some follow-up
patches all callers will want a freshly allocated buffer anyway.

In the meantime, path_combine_deprecated() is added which has the same
interface as path_combine() had before this patch. All callers to that
function will be converted in follow-up patches.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 20190201192935.18394-10-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-02-25 15:11:26 +01:00
Max Reitz 998c201923 block: Add BDS.auto_backing_file
If the backing file is overridden, this most probably does change the
guest-visible data of a BDS.  Therefore, we will need to consider this
in bdrv_refresh_filename().

To see whether it has been overridden, we might want to compare
bs->backing_file and bs->backing->bs->filename.  However,
bs->backing_file is changed by bdrv_set_backing_hd() (which is just used
to change the backing child at runtime, without modifying the image
header), so bs->backing_file most of the time simply contains a copy of
bs->backing->bs->filename anyway, so it is useless for such a
comparison.

This patch adds an auto_backing_file BDS field which contains the
backing file path as indicated by the image header, which is not changed
by bdrv_set_backing_hd().

Because of bdrv_refresh_filename() magic, however, a BDS's filename may
differ from what has been specified during bdrv_open().  Then, the
comparison between bs->auto_backing_file and bs->backing->bs->filename
may fail even though bs->backing was opened from bs->auto_backing_file.
To mitigate this, we can copy the real BDS's filename (after the whole
bdrv_open() and bdrv_refresh_filename() process) into
bs->auto_backing_file, if we know the former has been opened based on
the latter.  This is only possible if no options modifying the backing
file's behavior have been specified, though.  To simplify things, this
patch only copies the filename from the backing file if no options have
been specified for it at all.

Furthermore, there are cases where an overlay is created by qemu which
already contains a BDS's filename (e.g. in blockdev-snapshot-sync).  We
do not need to worry about updating the overlay's bs->auto_backing_file
there, because we actually wrote a post-bdrv_refresh_filename() filename
into the image header.

So all in all, there will be false negatives where (as of a future
patch) bdrv_refresh_filename() will assume that the backing file differs
from what was specified in the image header, even though it really does
not.  However, these cases should be limited to where (1) the user
actually did override something in the backing chain (e.g. by specifying
options for the backing file), or (2) the user executed a QMP command to
change some node's backing file (e.g. change-backing-file or
block-commit with @backing-file given) where the given filename does not
happen to coincide with qemu's idea of the backing BDS's filename.

Then again, (1) really is limited to -drive.  With -blockdev or
blockdev-add, you have to adhere to the schema, so a user cannot give
partial "unimportant" options (e.g. by just setting backing.node-name
and leaving the rest to the image header).  Therefore, trying to fix
this would mean trying to fix something for -drive only.

To improve on (2), we would need a full infrastructure to "canonicalize"
an arbitrary filename (+ options), so it can be compared against
another.  That seems a bit over the top, considering that filenames
nowadays are there mostly for the user's entertainment.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190201192935.18394-5-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-02-25 15:11:25 +01:00
Max Reitz e24518e303 block: Use children list in bdrv_refresh_filename
bdrv_refresh_filename() should invoke itself recursively on all
children, not just on file.

With that change, we can remove the manual invocations in blkverify,
quorum, commit, mirror, and blklogwrites.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190201192935.18394-3-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-02-25 15:11:25 +01:00
Max Reitz f30c66ba6e block: Use bdrv_refresh_filename() to pull
Before this patch, bdrv_refresh_filename() is used in a pushing manner:
Whenever the BDS graph is modified, the parents of the modified edges
are supposed to be updated (recursively upwards).  However, that is
nonviable, considering that we want child changes not to concern
parents.

Also, in the long run we want a pull model anyway: Here, we would have a
bdrv_filename() function which returns a BDS's filename, freshly
constructed.

This patch is an intermediate step.  It adds bdrv_refresh_filename()
calls before every place a BDS.filename value is used.  The only
exceptions are protocol drivers that use their own filename, which
clearly would not profit from refreshing that filename before.

Also, bdrv_get_encrypted_filename() is removed along the way (as a user
of BDS.filename), since it is completely unused.

In turn, all of the calls to bdrv_refresh_filename() before this patch
are removed, because we no longer have to call this function on graph
changes.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190201192935.18394-2-mreitz@redhat.com
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-02-25 15:11:25 +01:00
Thomas Huth 83c68e149a block/nvme: Remove QEMU_PACKED from naturally aligned NVMeRegs struct
The QEMU_PACKED is causing a compiler warning/error with GCC 9:

  CC      block/nvme.o
block/nvme.c: In function ‘nvme_create_queue_pair’:
block/nvme.c:209:22: error: taking address of packed member of
 ‘struct <anonymous>’ may result in an unaligned pointer value
 [-Werror=address-of-packed-member]
  209 |     q->sq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale];

All members of the struct are naturally aligned, so there should
not be the need for QEMU_PACKED here, and the following QEMU_BUILD_BUG_ON
also ensures that there is no padding. Thus simply remove the QEMU_PACKED
here.

Buglink: https://bugs.launchpad.net/qemu/+bug/1817525
Reported-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-02-25 15:09:48 +01:00
Alberto Garcia c1c4399084 qcow2: Assert that L2 table offsets fit in the L1 table
L1 table entries have a field to store the offset of an L2 table.
The rest of the bits of the entry are currently reserved except from
bit 63, which stores the COPIED flag.

The offset is always taken from the entry using L1E_OFFSET_MASK to
ensure that we only use the bits that belong to that field.

While that mask is used every time we read from the L1 table, it is
never used when we write to it. Due to the limits set elsewhere in the
code QEMU can never produce L2 table offsets that don't fit in that
field so any such offset when allocating an L2 table would indicate a
bug in QEMU.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-02-25 15:05:23 +01:00
Kevin Wolf 28e0b2d2e1 nbd: Increase bs->in_flight during AioContext switch
bdrv_drain() must not leave connection_co scheduled, so bs->in_flight
needs to be increased while the coroutine is waiting to be scheduled
in the new AioContext after nbd_client_attach_aio_context().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-02-25 15:03:19 +01:00
Kevin Wolf d3bd5b9089 nbd: Use low-level QIOChannel API in nbd_read_eof()
Instead of using the convenience wrapper qio_channel_read_all_eof(), use
the lower level QIOChannel API. This means duplicating some code, but
we'll need this because this coroutine yield is special: We want it to
be interruptible so that nbd_client_attach_aio_context() can correctly
reenter the coroutine.

This moves the bdrv_dec/inc_in_flight() pair into nbd_read_eof(), so
that connection_co will always sit in this exact qio_channel_yield()
call when bdrv_drain() returns.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2019-02-25 15:03:19 +01:00
Kevin Wolf 5ad81b4946 nbd: Restrict connection_co reentrance
nbd_client_attach_aio_context() schedules connection_co in the new
AioContext and this way reenters it in any arbitrary place that has
yielded. We can restrict this a bit to the function call where the
coroutine actually sits waiting when it's idle.

This doesn't solve any bug yet, but it shows where in the code we need
to support this random reentrance and where we don't have to care.

Add FIXME comments for the existing bugs that the rest of this series
will fix.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2019-02-25 15:03:19 +01:00
Kevin Wolf c90e2a9cfd block-backend: Make blk_inc/dec_in_flight public
For some users of BlockBackends, just increasing the in_flight counter
is easier than implementing separate handlers in BlockDevOps. Make the
helper functions for this public.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-02-25 15:03:19 +01:00
Alberto Garcia 2468eed3be commit: Replace commit_top_bs on failure after deleting the block job
If there's an error in commit_start() then the block job must be
deleted before replacing commit_top_bs, otherwise it will fail because
of lack of permissions. This happens since the permission system was
introduced in 8dfba27977.

Fortunately this bug doesn't seem to be possible to reproduce at the
moment without changing the code.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-02-25 15:03:19 +01:00
Daniel Henrique Barboza 161e612d20 qcow2-snapshot: remove redundant find_snapshot_by_id_and_name call
In qcow2_snapshot_create there is the following code block:

    /* Generate an ID */
    find_new_snapshot_id(bs, sn_info->id_str, sizeof(sn_info->id_str));

    /* Check that the ID is unique */
    if (find_snapshot_by_id_and_name(bs, sn_info->id_str, NULL) >= 0) {
        return -EEXIST;
    }

find_new_snapshot_id cycles through all snapshots, getting the id_str
as an unsigned long int, calculating the max id_max value of all the
existing id_strs and writing in the id_str pointer id_max + 1:

    for(i = 0; i < s->nb_snapshots; i++) {
        sn = s->snapshots + i;
        id = strtoul(sn->id_str, NULL, 10);
        if (id > id_max)
            id_max = id;
    }
    snprintf(id_str, id_str_size, "%lu", id_max + 1);

Here, sn_info->id_str will have the unique value id_max + 1. Right
after that, find_snapshot_by_id_and_name is called with
id = sn_info->id_str and name = NULL. This will cause the function
to execute the following:

    } else if (id) {
        for (i = 0; i < s->nb_snapshots; i++) {
            if (!strcmp(s->snapshots[i].id_str, id)) {
                return i;
            }
        }
    }

In short, we're searching the existing snapshots to see if sn_info->id_str
matches any existing id, right after we set in the previous line a
sn_info->id_str value that is already unique.

The first code block goes way back to commit 585f8587ad, a 2006 commit from
Fabrice Bellard that simply says "new qcow2 disk image format". No more
info is provided about this logic in any subsequent commits that moved
this code block around.

I can't say about the original design, but the current logic is redundant.
bdrv_snapshot_create is called in aio_context lock, forbidding any
concurrent call to accidentally create a new snapshot between
the find_new_snapshot_id and find_snapshot_by_id_and_name calls. What
we're ending up doing is to cycle through the snapshots two times
for no viable reason.

This patch eliminates the redundancy by removing the 'id is unique'
check that calls find_snapshot_by_id_and_name.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-02-25 15:03:19 +01:00
Daniel Henrique Barboza 8c04093c8c block/snapshot: remove bdrv_snapshot_delete_by_id_or_name
After the previous patch, the only instance of this function left
is inside qemu-img.c.

qemu-img is using it inside the 'img_snapshot' function to delete
snapshots in the SNAPSHOT_DELETE case, based on a "snapshot_name"
string that refers to the tag, not ID, of the QEMUSnapshotInfo struct.
This can be verified by checking the SNAPSHOT_CREATE case that
comes shortly before SNAPSHOT_DELETE. In that case, the same
"snapshot_name" variable is being strcpy to the 'name' field
of the QEMUSnapshotInfo struct sn:

pstrcpy(sn.name, sizeof(sn.name), snapshot_name);

Based on that, it is unlikely that "snapshot_name" might contain
an "id" in SNAPSHOT_DELETE.

This patch changes SNAPSHOT_DELETE to use snapshot_find() and
snapshot_delete() instead of bdrv_snapshot_delete_by_id_or_name.
After that, there is no instances left of bdrv_snapshot_delete_by_id_or_name
in the code, so it is safe to remove it entirely.

Suggested-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-02-25 15:03:18 +01:00
Daniel Henrique Barboza 6ca080453e block/snapshot.c: eliminate use of ID input in snapshot operations
At this moment, QEMU attempts to create/load/delete snapshots
by using either an ID (id_str) or a name. The problem is that the code
isn't consistent of whether the entered argument is an ID or a name,
causing unexpected behaviors.

For example, when creating snapshots via savevm <arg>, what happens is that
"arg" is treated as both name and id_str. In a guest without snapshots, create
a single snapshot via savevm:

(qemu) savevm 0
(qemu) info snapshots
List of snapshots present on all disks:
ID        TAG                 VM SIZE                DATE       VM CLOCK
--        0                      741M 2018-07-31 13:39:56   00:41:25.313

A snapshot with name "0" is created. ID is hidden from the user, but the
ID is a non-zero integer that starts at "1". Thus, this snapshot has
id_str=1, TAG="0". Creating a second snapshot with arg = 1, the first one
is deleted:

(qemu) savevm 1
(qemu) info snapshots
List of snapshots present on all disks:
ID        TAG                 VM SIZE                DATE       VM CLOCK
--        1                      741M 2018-07-31 13:42:14   00:41:55.252

What happened?

- when creating the second snapshot, a verification is done inside
bdrv_all_delete_snapshot to delete any existing snapshots that matches an
string argument. Here, the code calls bdrv_all_delete_snapshot("1", ...);

- bdrv_all_delete_snapshot calls bdrv_snapshot_find(..., "1") for each
BlockDriverState of the guest. And this is where things goes tilting:
bdrv_snapshot_find does a search by both id_str and name. It finds
out that there is a snapshot that has id_str = 1, stores a reference
to the snapshot in the sn_info pointer and then returns match found;

- since a match was found, a call to bdrv_snapshot_delete_by_id_or_name() is
made. This function ignores the pointer written by bdrv_snapshot_find. Instead,
it deletes the snapshot using bdrv_snapshot_delete() calling it first with
id_str = 1. If it fails to delete, then it calls it again with name = 1.

- after all that, QEMU creates the new snapshot, that has id_str = 1 and
name = 1. The user is left wondering that happened with the first snapshot
created. Similar bugs can be triggered when using loadvm and delvm.

Before contemplating discarding the use of ID input in these operations,
I've searched the code of what would be the implications. My findings
are:

- the RBD and Sheepdog drivers don't care. Both uses the 'name' field as
key in their logic, making id_str = name when appropriate.
replay-snapshot.c does not make any special use of id_str;

- qcow2 uses id_str as an unique identifier but it is automatically
calculated, not being influenced by user input. Other than that, there are
no distinguish operations made only with id_str;

- in blockdev.c, the delete operation uses a match of both id_str AND
name. Given that id_str is either a copy of 'name' or auto-generated,
we're fine here.

This gives motivation to not consider ID as a valid user input in HMP
commands - sticking with 'name' input only is more consistent. To
accomplish that, the following changes were made in this patch:

- bdrv_snapshot_find() does not match for id_str anymore, only 'name'. The
function is called in save_snapshot(), load_snapshot(), bdrv_all_delete_snapshot()
and bdrv_all_find_snapshot(). This change makes the search function more
predictable and does not change the behavior of any underlying code that uses
these affected functions, which are related to HMP (which is fine) and the
main loop inside vl.c (which doesn't care about it anyways);

- bdrv_all_delete_snapshot() does not call bdrv_snapshot_delete_by_id_or_name
anymore. Instead, it uses the pointer returned by bdrv_snapshot_find to
erase the snapshot with the exact match of id_str an name. This function
is called in save_snapshot and hmp_delvm, thus this change  produces the
intended effect;

- documentation changes to reflect the new behavior. I consider this to
be an API fix instead of an API change - the user was already creating
snapshots using 'name', but now he/she will also enjoy a consistent
behavior.

Ideally we would get rid of the id_str field entirely, but this would have
repercussions on existing snapshots. Another day perhaps.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-02-25 15:03:18 +01:00
Vladimir Sementsov-Ogievskiy 199d95b043 block/vmdk: use qemu_iovec_init_buf
Use new qemu_iovec_init_buf() instead of
qemu_iovec_init_external( ... , 1), which simplifies the code.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20190218140926.333779-12-vsementsov@virtuozzo.com
Message-Id: <20190218140926.333779-12-vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-02-22 09:42:13 +00:00
Vladimir Sementsov-Ogievskiy 342544f98b block/qed: use qemu_iovec_init_buf
Use new qemu_iovec_init_buf() instead of
qemu_iovec_init_external( ... , 1), which simplifies the code.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20190218140926.333779-11-vsementsov@virtuozzo.com
Message-Id: <20190218140926.333779-11-vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-02-22 09:42:13 +00:00
Vladimir Sementsov-Ogievskiy c793d4ff20 block/qcow2: use qemu_iovec_init_buf
Use new qemu_iovec_init_buf() instead of
qemu_iovec_init_external( ... , 1), which simplifies the code.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20190218140926.333779-10-vsementsov@virtuozzo.com
Message-Id: <20190218140926.333779-10-vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-02-22 09:42:13 +00:00
Vladimir Sementsov-Ogievskiy 30d780f8fe block/qcow: use qemu_iovec_init_buf
Use new qemu_iovec_init_buf() instead of
qemu_iovec_init_external( ... , 1), which simplifies the code.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20190218140926.333779-9-vsementsov@virtuozzo.com
Message-Id: <20190218140926.333779-9-vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-02-22 09:42:13 +00:00
Vladimir Sementsov-Ogievskiy b2589d66b4 block/parallels: use QEMU_IOVEC_INIT_BUF
Use new QEMU_IOVEC_INIT_BUF() instead of
qemu_iovec_init_external( ... , 1), which simplifies the code.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20190218140926.333779-8-vsementsov@virtuozzo.com
Message-Id: <20190218140926.333779-8-vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-02-22 09:42:13 +00:00
Vladimir Sementsov-Ogievskiy 8040446d30 block/stream: use QEMU_IOVEC_INIT_BUF
Use new QEMU_IOVEC_INIT_BUF() instead of
qemu_iovec_init_external( ... , 1), which simplifies the code.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20190218140926.333779-7-vsementsov@virtuozzo.com
Message-Id: <20190218140926.333779-7-vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-02-22 09:42:13 +00:00
Vladimir Sementsov-Ogievskiy ee7a883ace block/commit: use QEMU_IOVEC_INIT_BUF
Use new QEMU_IOVEC_INIT_BUF() instead of
qemu_iovec_init_external( ... , 1), which simplifies the code.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20190218140926.333779-6-vsementsov@virtuozzo.com
Message-Id: <20190218140926.333779-6-vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-02-22 09:42:13 +00:00
Vladimir Sementsov-Ogievskiy 31e0a6e1af block/backup: use qemu_iovec_init_buf
Use new qemu_iovec_init_buf() instead of
qemu_iovec_init_external( ... , 1), which simplifies the code.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20190218140926.333779-5-vsementsov@virtuozzo.com
Message-Id: <20190218140926.333779-5-vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-02-22 09:42:13 +00:00
Vladimir Sementsov-Ogievskiy ae5a9592ce block/block-backend: use QEMU_IOVEC_INIT_BUF
Use new QEMU_IOVEC_INIT_BUF() instead of
qemu_iovec_init_external( ... , 1), which simplifies the code.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20190218140926.333779-4-vsementsov@virtuozzo.com
Message-Id: <20190218140926.333779-4-vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-02-22 09:42:13 +00:00
Vladimir Sementsov-Ogievskiy 0d93ed0845 block/io: use qemu_iovec_init_buf
Use new qemu_iovec_init_buf() instead of
qemu_iovec_init_external( ... , 1), which simplifies the code.

While being here, use qemu_try_blockalign0 as well.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20190218140926.333779-3-vsementsov@virtuozzo.com
Message-Id: <20190218140926.333779-3-vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-02-22 09:42:13 +00:00
John Snow 73ab5d601c block/dirty-bitmap: Documentation and Comment fixups
The meaning of the states has changed subtly over time,
this should bring the understanding more in-line with the
current, actual usages.

Reported-by: Eric Blake <eblake@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190202011048.12343-1-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
2019-02-19 17:49:43 -05:00
Eric Blake f67cf661f8 dirty-bitmap: Expose persistent flag to 'query-block'
Since qemu currently doesn't flush persistent bitmaps to disk until
shutdown (which might be MUCH later), it's useful if 'query-block'
at least shows WHICH bitmaps will (eventually) make it to persistent
storage.  Update affected iotests.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20190204210512.27458-1-eblake@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
2019-02-19 17:49:43 -05:00
Andrey Shinkevich b8968c875f qcow2: Add list of bitmaps to ImageInfoSpecificQCow2
In the 'Format specific information' section of the 'qemu-img info'
command output, the supplemental information about existing QCOW2
bitmaps will be shown, such as a bitmap name, flags and granularity:

image: /vz/vmprivate/VM1/harddisk.hdd
file format: qcow2
virtual size: 64G (68719476736 bytes)
disk size: 3.0M
cluster_size: 1048576
Format specific information:
    compat: 1.1
    lazy refcounts: true
    bitmaps:
        [0]:
            flags:
                [0]: in-use
                [1]: auto
            name: back-up1
            granularity: 65536
        [1]:
            flags:
                [0]: in-use
                [1]: auto
            name: back-up2
            granularity: 65536
    refcount bits: 16
    corrupt: false

Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <1549638368-530182-3-git-send-email-andrey.shinkevich@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-02-11 14:35:43 -06:00
Andrey Shinkevich 1bf6e9ca92 bdrv_query_image_info Error parameter added
Inform a user in case qcow2_get_specific_info fails to obtain
QCOW2 image specific information. This patch is preliminary to
the one "qcow2: Add list of bitmaps to ImageInfoSpecificQCow2".

Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <1549638368-530182-2-git-send-email-andrey.shinkevich@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-02-11 14:35:43 -06:00
Vladimir Sementsov-Ogievskiy bc5a03350c block/nbd-client: rename read_reply_co to connection_co
This coroutine will serve nbd reconnects, so, rename it to be something
more generic.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190201130138.94525-7-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-02-04 15:11:27 -06:00
Vladimir Sementsov-Ogievskiy 88ed4e1bf0 block/nbd-client: don't check ioc
We have several paranoid checks for ioc != NULL. But ioc may become
NULL only on close, which should not happen during requests handling.
Also, we check ioc only sometimes, not after each yield, which is
inconsistent. Let's drop these checks. However, for safety, let's leave
asserts instead.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190201130138.94525-6-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-02-04 15:11:27 -06:00
Vladimir Sementsov-Ogievskiy 65e01d4765 block/nbd-client: fix nbd_reply_chunk_iter_receive
Use exported report, not the variable to be reused (should not really
matter).

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190201130138.94525-5-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-02-04 15:11:27 -06:00
Vladimir Sementsov-Ogievskiy b0e4b5a58f block/nbd-client: split connection from initialization
Split connection code to reuse it for reconnect.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190201130138.94525-4-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-02-04 15:11:27 -06:00
Vladimir Sementsov-Ogievskiy d42f78e940 block/nbd: move connection code from block/nbd to block/nbd-client
Keep all connection code in one file, to be able to implement reconnect
in further patches.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190201130138.94525-3-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: format tweak]
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-02-04 15:11:27 -06:00
Vladimir Sementsov-Ogievskiy 7f86068dc1 block/nbd-client: split channel errors from export errors
To implement nbd reconnect in further patches, we need to distinguish
error codes, returned by nbd server, from channel errors, to reconnect
only in the latter case.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190201130138.94525-2-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-02-04 15:11:27 -06:00
Vladimir Sementsov-Ogievskiy e6798f06a6 nbd: generalize usage of nbd_read
We generally do very similar things around nbd_read: error_prepend
specifying what we have tried to read, and be_to_cpu conversion of
integers.

So, it seems reasonable to move common things to helper functions,
which:
1. simplify code a bit
2. generalize nbd_read error descriptions, all starting with
   "Failed to read"
3. make it more difficult to forget to convert things from BE

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190128165830.165170-1-vsementsov@virtuozzo.com>
[eblake: rename macro to DEF_NBD_READ_N and formatting tweaks;
checkpatch has false positive complaint]
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-02-04 15:11:27 -06:00
Markus Armbruster 14632122b8 block: Eliminate the S_1KiB, S_2KiB, ... macros
We define 54 macros for the powers of two >= 1024.  We use six, in six
macro definitions.  Four of them could just as well use the common MiB
macro, so do that.  The remaining two can't, because they get passed
to stringify.  Replace the macro by the literal number there.
Slightly harder to read in one instance (1048576 vs. S_1MiB), so add a
comment there.  The other instance is a wash: 65536 vs S_64KiB.  65536
has been good enough for more than seven years there.

This effectively reverts commit 540b849261 and 1240ac558d.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-02-01 13:46:45 +01:00
Thomas Huth d09ea2d227 block: Remove blk_attach_dev_legacy() / legacy_dev code
The last user of blk_attach_dev_legacy() was the code in xen_disk which
has recently been reworked. Now there is no user for this legacy function
anymore. Thus we can finally remove all code related to the "legacy_dev"
flag, too, and turn the related "void *" in block-backend.c into proper
"DeviceState *" to fix some of the remaining TODOs there.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-02-01 13:46:45 +01:00
Peter Maydell 1324f06384 uuid: Make qemu_uuid_bswap() take and return a QemuUUID
Currently qemu_uuid_bswap() takes a pointer to the QemuUUID to
be byte-swapped. This means it can't be used when the UUID
to be swapped is in a packed member of a struct. It's also
out of line with the general bswap*() functions we provide
in bswap.h, which take the value to be swapped and return it.

Make qemu_uuid_bswap() take a QemuUUID and return the swapped version.

This fixes some clang warnings about taking the address of
a packed struct member in block/vdi.c.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-02-01 13:46:45 +01:00
Peter Maydell ac928b8ee8 block/vdi: Don't take address of fields in packed structs
Taking the address of a field in a packed struct is a bad idea, because
it might not be actually aligned enough for that pointer type (and
thus cause a crash on dereference on some host architectures). Newer
versions of clang warn about this.

Instead of passing UUID related functions the address of a possibly
unaligned QemuUUID struct, use local variables and then copy to/from
the struct field as appropriate.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-02-01 13:46:44 +01:00
Peter Maydell 0dbaaa7981 block/vpc: Don't take address of fields in packed structs
Taking the address of a field in a packed struct is a bad idea, because
it might not be actually aligned enough for that pointer type (and
thus cause a crash on dereference on some host architectures). Newer
versions of clang warn about this. Avoid the bug by generating the
UUID into a local variable which is definitely safely aligned and
then copying it into place.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-02-01 13:46:44 +01:00
Kevin Wolf 4a960ece17 vmdk: Reject excess extents in blockdev-create
Clarify that the number of extents provided in BlockdevCreateOptionsVmdk
must match the number of extents that will actually be used. Providing
more extents will result in an error now.

This requires adapting the test case to provide the right number of
extents.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
2019-02-01 13:46:44 +01:00
Fam Zheng 3015372dd0 vmdk: Implement .bdrv_co_create callback
This makes VMDK support blockdev-create. The implementation reuses the
image creation code in vmdk_co_create_opts which now acceptes a callback
pointer to "retrieve" BlockBackend pointers from the caller. This way we
separate the logic between file/extent acquisition and initialization.

The QAPI command parameters are mostly the same as the old create_opts
except the dropped legacy @compat6 switch, which is redundant with
@hwversion.

Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-02-01 13:46:44 +01:00
Fam Zheng 5be28490ca vmdk: Refactor vmdk_create_extent
The extracted vmdk_init_extent takes a BlockBackend object and
initializes the format metadata. It is the common part between "qemu-img
create" and "blockdev-create".

Add a "BlockBackend *pbb" parameter to vmdk_create_extent, to return the
opened BB to the caller in the next patch.

Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-02-01 13:46:44 +01:00
Kevin Wolf 4720cbeea1 block: Fix hangs in synchronous APIs with iothreads
In the block layer, synchronous APIs are often implemented by creating a
coroutine that calls the asynchronous coroutine-based implementation and
then waiting for completion with BDRV_POLL_WHILE().

For this to work with iothreads (more specifically, when the synchronous
API is called in a thread that is not the home thread of the block
device, so that the coroutine will run in a different thread), we must
make sure to call aio_wait_kick() at the end of the operation. Many
places are missing this, so that BDRV_POLL_WHILE() keeps hanging even if
the condition has long become false.

Note that bdrv_dec_in_flight() involves an aio_wait_kick() call. This
corresponds to the BDRV_POLL_WHILE() in the drain functions, but it is
generally not enough for most other operations because they haven't set
the return value in the coroutine entry stub yet. To avoid race
conditions there, we need to kick after setting the return value.

The race window is small enough that the problem doesn't usually surface
in the common path. However, it does surface and causes easily
reproducible hangs if the operation can return early before even calling
bdrv_inc/dec_in_flight, which many of them do (trivial error or no-op
success paths).

The bug in bdrv_truncate(), bdrv_check() and bdrv_invalidate_cache() is
slightly different: These functions even neglected to schedule the
coroutine in the home thread of the node. This avoids the hang, but is
obviously wrong, too. Fix those to schedule the coroutine in the right
AioContext in addition to adding aio_wait_kick() calls.

Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-02-01 13:46:44 +01:00
Markus Armbruster 4e20c1becb block: Replace qdict_put() by qdict_put_obj() where appropriate
Patch created mechanically by rerunning:

  $  spatch --sp-file scripts/coccinelle/qobject.cocci \
	    --macro-file scripts/cocci-macro-file.h \
	    --dir block --in-place

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-02-01 13:46:44 +01:00
Alberto Garcia cdc674c736 qcow2: Assert that refcount block offsets fit in the refcount table
Refcount table entries have a field to store the offset of the
refcount block. The rest of the bits of the entry are currently
reserved.

The offset is always taken from the entry using REFT_OFFSET_MASK to
ensure that we only use the bits that belong to that field.

While that mask is used every time we read from the refcount table, it
is never used when we write to it. Due to the other constraints of the
qcow2 format QEMU can never produce refcount block offsets that don't
fit in that field so any such offset when allocating a refcount block
would indicate a bug in QEMU.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-02-01 13:46:44 +01:00
Alberto Garcia 67b24427fe mirror: Block the source BlockDriverState in mirror_start_job()
The mirror_start_job() function used for the commit-active job blocks
the source, target and all intermediate nodes for the duration of the
job.

   target <- intermediate <- source

Since 4ef85a9c23 this function creates a dummy mirror_top_bs that
goes on top of the source node, and it is this dummy node that gets
blocked instead. The source node is never blocked or added to the
job's list of nodes.

   target <- intermediate <- source <- mirror_top

At the moment I don't think it is possible to exploit this problem
because any additional job on 'source' would either be forbidden for
other reasons or it would need to involve an additional node that is
blocked, causing an error.

This can be seen in the error messages, however, because they never
refer to the source node being blocked:

  $ qemu-img create -f qcow2 hd0.qcow2 1M
  $ qemu-img create -f qcow2 -b hd0.qcow2 hd1.qcow2
  $ qemu-io -c 'write 0 1M' hd0.qcow2
  $ $QEMU -drive if=none,file=hd1.qcow2,node-name=hd1
  { "execute": "qmp_capabilities" }
  { "execute": "block-commit", "arguments": {"device": "hd1", "speed": 256}}
  { "execute": "block-stream", "arguments": {"device": "hd1"}}
  { "error": {"class": "GenericError",
    "desc": "Node 'hd0' is busy: block device is in use by block job: commit"}}

After this patch the error message refers to 'hd1', as it should.

The expected output of iotest 141 also needs to be updated for the
same reason.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-02-01 13:46:44 +01:00
Alberto Garcia e917e2cb2a mirror: Release the dirty bitmap if mirror_start_job() fails
At the moment I don't see how to make this function fail after the
dirty bitmap has been created, but if that was possible then we would
hit the assert(QLIST_EMPTY(&bs->dirty_bitmaps)) in bdrv_close().

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-02-01 13:46:44 +01:00
Laurent Vivier 70018a149c block/sheepdog: Convert from DPRINTF() macro to trace events
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20181213162727.17438-5-lvivier@redhat.com
[mreitz: Fixed sheepdog_snapshot_create_inode's format string to use
         PRIx32 for uint32_ts]
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-01-31 00:38:19 +01:00
Laurent Vivier 4f7d28d738 block/file-posix: Convert from DPRINTF() macro to trace events
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20181213162727.17438-4-lvivier@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-01-31 00:38:19 +01:00
Laurent Vivier ed2a66deb9 block/curl: Convert from DPRINTF() macro to trace events
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20181213162727.17438-3-lvivier@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-01-31 00:38:19 +01:00
Laurent Vivier 023908a24d block/ssh: Convert from DPRINTF() macro to trace events
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20181213162727.17438-2-lvivier@redhat.com
[mreitz: Fixed type of ssh_{read,write}_return's parameter to be ssize_t
         instead of size_t]
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-01-31 00:38:19 +01:00
Vladimir Sementsov-Ogievskiy 5d3b4e9946 qapi: add x-debug-query-block-graph
Add a new command, returning block nodes (and their users) graph.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20181221170909.25584-2-vsementsov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-01-31 00:38:19 +01:00
Stefan Hajnoczi bc19a0a6e4 throttle-groups: fix restart coroutine iothread race
The following QMP command leads to a crash when iothreads are used:

  { 'execute': 'device_del', 'arguments': {'id': 'data'} }

The backtrace involves the queue restart coroutine where
tgm->throttle_state is a NULL pointer because
throttle_group_unregister_tgm() has already been called:

  (gdb) bt full
  #0  0x00005585a7a3b378 in qemu_mutex_lock_impl (mutex=0xffffffffffffffd0, file=0x5585a7bb3d54 "block/throttle-groups.c", line=412) at util/qemu-thread-posix.c:64
        err = <optimized out>
        __PRETTY_FUNCTION__ = "qemu_mutex_lock_impl"
        __func__ = "qemu_mutex_lock_impl"
  #1  0x00005585a79be074 in throttle_group_restart_queue_entry (opaque=0x5585a9de4eb0) at block/throttle-groups.c:412
        _f = <optimized out>
        data = 0x5585a9de4eb0
        tgm = 0x5585a9079440
        ts = 0x0
        tg = 0xffffffffffffff98
        is_write = false
        empty_queue = 255

This coroutine should not execute in the iothread after the throttle
group member has been unregistered!

The root cause is that the device_del code path schedules the restart
coroutine in the iothread while holding the AioContext lock.  Therefore
the iothread cannot execute the coroutine until after device_del
releases the lock - by this time it's too late.

This patch adds a reference count to ThrottleGroupMember so we can
synchronously wait for restart coroutines to complete.  Once they are
done it is safe to unregister the ThrottleGroupMember.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190114133257.30299-2-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-01-24 10:02:28 +00:00
Peter Maydell 952bc8b3c2 nbd patches for 2019-01-21
Add 'qemu-nbd --list' for probing a remote NBD server's advertisements.
 
 - Eric Blake: 0/21 nbd: add qemu-nbd --list
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJcRktLAAoJEKeha0olJ0NqY8cIAJqcHDAuk1GwcDE/NIAASY9E
 2PsaEL2pM65bEWUwsnVkFyRb4y4S4QX9VtAR/c9G6jR0YHuDvCCz7VGjiHh3eGso
 9AWKzuh5C4mayAN5dTOlbjryOx7hlcz8wDDjP5OQNqIlBvmWVAnLeh5Kkqlc7KDk
 f1SH8kJfqegDAhKDeDhi/HGlL1UVzMr5A2sR6I1wbb5IbBru5JtAKzgqAmdBH893
 lWZ9EDXqajwPCo8ASwZNyawmtmjx+VBeFjO2juA3qTWvf262roiB3XT5W/3n/CCb
 Er5CBwQQvSFBUfF9bRYDziyjwQwTXoIQJ6nuJLzPGx2cUpQbJ/svphoRQ6rBOjY=
 =gywW
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/ericb/tags/pull-nbd-2019-01-21' into staging

nbd patches for 2019-01-21

Add 'qemu-nbd --list' for probing a remote NBD server's advertisements.

- Eric Blake: 0/21 nbd: add qemu-nbd --list

# gpg: Signature made Mon 21 Jan 2019 22:44:27 GMT
# gpg:                using RSA key A7A16B4A2527436A
# gpg: Good signature from "Eric Blake <eblake@redhat.com>"
# gpg:                 aka "Eric Blake (Free Software Programmer) <ebb9@byu.net>"
# gpg:                 aka "[jpeg image of size 6874]"
# Primary key fingerprint: 71C2 CC22 B1C4 6029 27D2  F3AA A7A1 6B4A 2527 436A

* remotes/ericb/tags/pull-nbd-2019-01-21: (21 commits)
  iotests: Enhance 223, 233 to cover 'qemu-nbd --list'
  nbd/client: Work around 3.0 bug for listing meta contexts
  qemu-nbd: Add --list option
  nbd/client: Add meta contexts to nbd_receive_export_list()
  nbd/client: Add nbd_receive_export_list()
  nbd/client: Refactor nbd_opt_go() to support NBD_OPT_INFO
  nbd/client: Pull out oldstyle size determination
  nbd/client: Split handshake into two functions
  nbd/client: Refactor return of nbd_receive_negotiate()
  nbd/client: Split out nbd_receive_one_meta_context()
  nbd/client: Split out nbd_send_meta_query()
  nbd/client: Change signature of nbd_negotiate_simple_meta_context()
  nbd/client: Move export name into NBDExportInfo
  nbd/client: Refactor nbd_receive_list()
  qemu-nbd: Avoid strtol open-coding
  nbd/server: Favor [u]int64_t over off_t
  nbd/server: Hoist length check to qmp_nbd_server_add
  qemu-nbd: Sanity check partition bounds
  qemu-nbd: Enhance man page
  maint: Allow for EXAMPLES in texi2pod
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2019-01-22 17:56:21 +00:00
Thomas Huth df71ca84e4 block: Work-around a bug in libiscsi 1.9.0 when used in gnu99 mode
The header "scsi-lowlevel.h" of libiscsi 1.9.0 contains some bad
"inline" prototype definitions which GCC refuses to compile in its
gnu99 mode:

In file included from block/iscsi.c:52:0:
/usr/include/iscsi/scsi-lowlevel.h:810:13: error: inline function
‘scsi_set_uint16’ declared but never defined [-Werror]
 inline void scsi_set_uint16(unsigned char *c, uint16_t val);
             ^
/usr/include/iscsi/scsi-lowlevel.h:809:13: error: inline function
‘scsi_set_uint32’ declared but never defined [-Werror]
 inline void scsi_set_uint32(unsigned char *c, uint32_t val);
             ^
[...]

This has been fixed by upstream libiscsi in version 1.10.0 (see
https://github.com/sahlberg/libiscsi/commit/7692027d6c11 ), but
since we still want to support 1.9.0 for CentOS 7 / RHEL7, we
have to work-around the issue by redefining the "inline" keyword
to use the old "gnu89" mode behavior via "gnu_inline" instead.

Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
2019-01-22 06:26:32 +01:00
Eric Blake 2df94eb52b nbd/client: Change signature of nbd_negotiate_simple_meta_context()
Pass 'info' instead of three separate parameters related to info,
when requesting the server to set the meta context.  Update the
NBDExportInfo struct to rename the received id field to match the
fact that we are currently overloading the field to match whatever
context the user supplied through the x-dirty-bitmap hack, as well
as adding a TODO comment to remind future patches about a desire
to request two contexts at once.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190117193658.16413-11-eblake@redhat.com>
2019-01-21 15:49:52 -06:00
Eric Blake 6dc1667d68 nbd/client: Move export name into NBDExportInfo
Refactor the 'name' parameter of nbd_receive_negotiate() from
being a separate parameter into being part of the in-out 'info'.
This also spills over to a simplification of nbd_opt_go().

The main driver for this refactoring is that an upcoming patch
would like to add support to qemu-nbd to list information about
all exports available on a server, where the name(s) will be
provided by the server instead of the client.  But another benefit
is that we can now allow the client to explicitly specify the
empty export name "" even when connecting to an oldstyle server
(even if qemu is no longer such a server after commit 7f7dfe2a).

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190117193658.16413-10-eblake@redhat.com>
2019-01-21 15:49:52 -06:00
Peter Maydell 51c1c13560 pci, pc, virtio: fixes, features
tpm physical presence interface
 rsc support in virtio net
 ivshmem is removed
 misc cleanups and fixes all over the place
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJcQTW/AAoJECgfDbjSjVRpY34H/0wniUqht3CVVihAfC9qnL7N
 jgAJ41OGgrrCmXLrwa5P5iWxIKRjp2Odo7DQ2dl+9sqYshe6WoBBXmqEa8bDeldE
 69EJ3xSnZjkpScqXKoSaDO3dtibUGpdvlTKpjuh7q2FFBOvNvf3hQubhyt/vINsO
 qoDMqycjXJkHSaZLfdNb3P/LAkAuHI4veNk8KEbog3XTijSUbZlb2GYc8vT/agka
 0SldiEVi9CTIh7co0L2BkcLVeTMVrF3OZuWDek3tVHXPxPaNbQM3GMvHZaxqZusU
 SZinnNhcnI++kgLZVHuurxo7e1S9d8Oy0QuHGX8CZ1sfeqy3vzxer0Hbb7BdLIY=
 =GhIM
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging

pci, pc, virtio: fixes, features

tpm physical presence interface
rsc support in virtio net
ivshmem is removed
misc cleanups and fixes all over the place

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

# gpg: Signature made Fri 18 Jan 2019 02:11:11 GMT
# gpg:                using RSA key 281F0DB8D28D5469
# gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>"
# gpg:                 aka "Michael S. Tsirkin <mst@redhat.com>"
# Primary key fingerprint: 0270 606B 6F3C DF3D 0B17  0970 C350 3912 AFBE 8E67
#      Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA  8A0D 281F 0DB8 D28D 5469

* remotes/mst/tags/for_upstream: (49 commits)
  migration: Use strnlen() for fixed-size string
  migration: Fix stringop-truncation warning
  hw/acpi: Use QEMU_NONSTRING for non NUL-terminated arrays
  block/sheepdog: Use QEMU_NONSTRING for non NUL-terminated arrays
  qemu/compiler: Define QEMU_NONSTRING
  acpi: update expected files
  hw: acpi: Fix memory hotplug AML generation error
  tpm: clear RAM when "memory overwrite" requested
  acpi: add ACPI memory clear interface
  acpi: build TPM Physical Presence interface
  acpi: expose TPM/PPI configuration parameters to firmware via fw_cfg
  tpm: allocate/map buffer for TPM Physical Presence interface
  tpm: add a "ppi" boolean property
  hw/misc/edu: add msi_uninit() for pci_edu_uninit()
  virtio: Make disable-legacy/disable-modern compat properties optional
  globals: Allow global properties to be optional
  virtio: virtio 9p really requires CONFIG_VIRTFS to work
  virtio: split virtio crypto bits from virtio-pci.h
  virtio: split virtio gpu bits from virtio-pci.h
  virtio: split virtio serial bits from virtio-pci
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2019-01-18 14:58:58 +00:00
Philippe Mathieu-Daudé 97b583f46c block/sheepdog: Use QEMU_NONSTRING for non NUL-terminated arrays
GCC 8 added a -Wstringop-truncation warning:

  The -Wstringop-truncation warning added in GCC 8.0 via r254630 for
  bug 81117 is specifically intended to highlight likely unintended
  uses of the strncpy function that truncate the terminating NUL
  character from the source string.

This new warning leads to compilation failures:

    CC      block/sheepdog.o
  qemu/block/sheepdog.c: In function 'find_vdi_name':
  qemu/block/sheepdog.c:1239:5: error: 'strncpy' specified bound 256 equals destination size [-Werror=stringop-truncation]
       strncpy(buf + SD_MAX_VDI_LEN, tag, SD_MAX_VDI_TAG_LEN);
       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  make: *** [qemu/rules.mak:69: block/sheepdog.o] Error 1

As described previous to the strncpy() calls, the use of strncpy() is
correct here:

    /* This pair of strncpy calls ensures that the buffer is zero-filled,
     * which is desirable since we'll soon be sending those bytes, and
     * don't want the send_req to read uninitialized data.
     */
    strncpy(buf, filename, SD_MAX_VDI_LEN);
    strncpy(buf + SD_MAX_VDI_LEN, tag, SD_MAX_VDI_TAG_LEN);

Use the QEMU_NONSTRING attribute, since this array is intended to store
character arrays that do not necessarily contain a terminating NUL.

Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
2019-01-17 21:10:57 -05:00
Vladimir Sementsov-Ogievskiy 19c021e194 Revert "hbitmap: Add @advance param to hbitmap_iter_next()"
This reverts commit a33fbb4f8b.

The functionality is unused.

Note: in addition to automatic revert, drop second parameter in
hbitmap_iter_next() call from hbitmap_next_dirty_area() too.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
2019-01-15 18:26:50 -05:00
Vladimir Sementsov-Ogievskiy 166cd55125 Revert "block/dirty-bitmap: Add bdrv_dirty_iter_next_area"
This reverts commit 72d10a9421.

The function is unused now.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: John Snow <jsnow@redhat.com>
2019-01-15 18:26:50 -05:00
Vladimir Sementsov-Ogievskiy 1eaf1b0fdf block/mirror: fix and improve do_sync_target_write
Use bdrv_dirty_bitmap_next_dirty_area() instead of
bdrv_dirty_iter_next_area(), because of the following problems of
bdrv_dirty_iter_next_area():

1. Using HBitmap iterators we should carefully handle unaligned offset,
as first call to hbitmap_iter_next() may return a value less than
original offset (actually, it will be original offset rounded down to
bitmap granularity). This handling is not done in
do_sync_target_write().

2. bdrv_dirty_iter_next_area() handles unaligned max_offset
incorrectly:

look at the code:
    if (max_offset == iter->bitmap->size) {
        /* If max_offset points to the image end, round it up by the
         * bitmap granularity */
        gran_max_offset = ROUND_UP(max_offset, granularity);
    } else {
        gran_max_offset = max_offset;
    }

    ret = hbitmap_iter_next(&iter->hbi, false);
    if (ret < 0 || ret + granularity > gran_max_offset) {
        return false;
    }

and assume that max_offset != iter->bitmap->size but still unaligned.
if 0 < ret < max_offset we found dirty area, but the function can
return false in this case (if ret + granularity > max_offset).

3. bdrv_dirty_iter_next_area() uses inefficient loop to find the end of
the dirty area. Let's use more efficient hbitmap_next_zero instead
(bdrv_dirty_bitmap_next_dirty_area() do so)

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2019-01-15 18:26:50 -05:00
Vladimir Sementsov-Ogievskiy a78a1a48cd dirty-bitmap: add bdrv_dirty_bitmap_next_dirty_area
The function alters bdrv_dirty_iter_next_area(), which is wrong and
less efficient (see further commit
"block/mirror: fix and improve do_sync_target_write" for description).

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2019-01-15 18:26:50 -05:00
Vladimir Sementsov-Ogievskiy 76d570dc49 dirty-bitmap: improve bdrv_dirty_bitmap_next_zero
Add bytes parameter to the function, to limit searched range.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2019-01-15 18:26:49 -05:00
John Snow 07d5a8df6a block/dirty-bitmap: remove assertion from restore
When making a backup of a dirty bitmap (for transactions), we want to
restore that backup whether or not the bitmap is enabled.

It is perfectly valid to write into bitmaps that are disabled. It is
only illegitimate for the guest to have done so.

Remove this assertion.

Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Message-Id: <20181221093529.23855-3-jsnow@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-01-14 10:09:46 -06:00
Paolo Bonzini 7d37435bd5 avoid TABs in files that only contain a few
Most files that have TABs only contain a handful of them.  Change
them to spaces so that we don't confuse people.

disas, standard-headers, linux-headers and libdecnumber are imported
from other projects and probably should be exempted from the check.
Outside those, after this patch the following files still contain both
8-space and TAB sequences at the beginning of the line.  Many of them
have a majority of TABs, or were initially committed with all tabs.

    bsd-user/i386/target_syscall.h
    bsd-user/x86_64/target_syscall.h
    crypto/aes.c
    hw/audio/fmopl.c
    hw/audio/fmopl.h
    hw/block/tc58128.c
    hw/display/cirrus_vga.c
    hw/display/xenfb.c
    hw/dma/etraxfs_dma.c
    hw/intc/sh_intc.c
    hw/misc/mst_fpga.c
    hw/net/pcnet.c
    hw/sh4/sh7750.c
    hw/timer/m48t59.c
    hw/timer/sh_timer.c
    include/crypto/aes.h
    include/disas/bfd.h
    include/hw/sh4/sh.h
    libdecnumber/decNumber.c
    linux-headers/asm-generic/unistd.h
    linux-headers/linux/kvm.h
    linux-user/alpha/target_syscall.h
    linux-user/arm/nwfpe/double_cpdo.c
    linux-user/arm/nwfpe/fpa11_cpdt.c
    linux-user/arm/nwfpe/fpa11_cprt.c
    linux-user/arm/nwfpe/fpa11.h
    linux-user/flat.h
    linux-user/flatload.c
    linux-user/i386/target_syscall.h
    linux-user/ppc/target_syscall.h
    linux-user/sparc/target_syscall.h
    linux-user/syscall.c
    linux-user/syscall_defs.h
    linux-user/x86_64/target_syscall.h
    slirp/cksum.c
    slirp/if.c
    slirp/ip.h
    slirp/ip_icmp.c
    slirp/ip_icmp.h
    slirp/ip_input.c
    slirp/ip_output.c
    slirp/mbuf.c
    slirp/misc.c
    slirp/sbuf.c
    slirp/socket.c
    slirp/socket.h
    slirp/tcp_input.c
    slirp/tcpip.h
    slirp/tcp_output.c
    slirp/tcp_subr.c
    slirp/tcp_timer.c
    slirp/tftp.c
    slirp/udp.c
    slirp/udp.h
    target/cris/cpu.h
    target/cris/mmu.c
    target/cris/op_helper.c
    target/sh4/helper.c
    target/sh4/op_helper.c
    target/sh4/translate.c
    tcg/sparc/tcg-target.inc.c
    tests/tcg/cris/check_addo.c
    tests/tcg/cris/check_moveq.c
    tests/tcg/cris/check_swap.c
    tests/tcg/multiarch/test-mmap.c
    ui/vnc-enc-hextile-template.h
    ui/vnc-enc-zywrle.h
    util/envlist.c
    util/readline.c

The following have only TABs:

    bsd-user/i386/target_signal.h
    bsd-user/sparc64/target_signal.h
    bsd-user/sparc64/target_syscall.h
    bsd-user/sparc/target_signal.h
    bsd-user/sparc/target_syscall.h
    bsd-user/x86_64/target_signal.h
    crypto/desrfb.c
    hw/audio/intel-hda-defs.h
    hw/core/uboot_image.h
    hw/sh4/sh7750_regnames.c
    hw/sh4/sh7750_regs.h
    include/hw/cris/etraxfs_dma.h
    linux-user/alpha/termbits.h
    linux-user/arm/nwfpe/fpopcode.h
    linux-user/arm/nwfpe/fpsr.h
    linux-user/arm/syscall_nr.h
    linux-user/arm/target_signal.h
    linux-user/cris/target_signal.h
    linux-user/i386/target_signal.h
    linux-user/linux_loop.h
    linux-user/m68k/target_signal.h
    linux-user/microblaze/target_signal.h
    linux-user/mips64/target_signal.h
    linux-user/mips/target_signal.h
    linux-user/mips/target_syscall.h
    linux-user/mips/termbits.h
    linux-user/ppc/target_signal.h
    linux-user/sh4/target_signal.h
    linux-user/sh4/termbits.h
    linux-user/sparc64/target_syscall.h
    linux-user/sparc/target_signal.h
    linux-user/x86_64/target_signal.h
    linux-user/x86_64/termbits.h
    pc-bios/optionrom/optionrom.h
    slirp/mbuf.h
    slirp/misc.h
    slirp/sbuf.h
    slirp/tcp.h
    slirp/tcp_timer.h
    slirp/tcp_var.h
    target/i386/svm.h
    target/sparc/asi.h
    target/xtensa/core-dc232b/xtensa-modules.inc.c
    target/xtensa/core-dc233c/xtensa-modules.inc.c
    target/xtensa/core-de212/core-isa.h
    target/xtensa/core-de212/xtensa-modules.inc.c
    target/xtensa/core-fsf/xtensa-modules.inc.c
    target/xtensa/core-sample_controller/core-isa.h
    target/xtensa/core-sample_controller/xtensa-modules.inc.c
    target/xtensa/core-test_kc705_be/core-isa.h
    target/xtensa/core-test_kc705_be/xtensa-modules.inc.c
    tests/tcg/cris/check_abs.c
    tests/tcg/cris/check_addc.c
    tests/tcg/cris/check_addcm.c
    tests/tcg/cris/check_addoq.c
    tests/tcg/cris/check_bound.c
    tests/tcg/cris/check_ftag.c
    tests/tcg/cris/check_int64.c
    tests/tcg/cris/check_lz.c
    tests/tcg/cris/check_openpf5.c
    tests/tcg/cris/check_sigalrm.c
    tests/tcg/cris/crisutils.h
    tests/tcg/cris/sys.c
    tests/tcg/i386/test-i386-ssse3.c
    ui/vgafont.h

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20181213223737.11793-3-pbonzini@redhat.com>
Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
Acked-by: Richard Henderson <richard.henderson@linaro.org>
Acked-by: Eric Blake <eblake@redhat.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Stefan Markovic <smarkovic@wavecomp.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-01-11 15:46:56 +01:00
Paolo Bonzini b58deb344d qemu/queue.h: leave head structs anonymous unless necessary
Most list head structs need not be given a name.  In most cases the
name is given just in case one is going to use QTAILQ_LAST, QTAILQ_PREV
or reverse iteration, but this does not apply to lists of other kinds,
and even for QTAILQ in practice this is only rarely needed.  In addition,
we will soon reimplement those macros completely so that they do not
need a name for the head struct.  So clean up everything, not giving a
name except in the rare case where it is necessary.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-01-11 15:46:55 +01:00
Stefan Hajnoczi 88e94fd238 block/iscsi: cancel libiscsi task when ABORT TASK TMF completes
The libiscsi iscsi_task_mgmt_async() API documentation says:

  abort_task will also cancel the scsi task. The callback for the scsi
  task will be invoked with SCSI_STATUS_CANCELLED

The libiscsi implementation does not fulfil this promise.  The task's
callback is not invoked and its struct iscsi_pdu remains in the internal
list (effectively leaked).

This patch invokes the libiscsi iscsi_scsi_cancel_task() API to force
the task's callback to be invoked with SCSI_STATUS_CANCELLED when the
ABORT TASK TMF completes and the task's callback hasn't been invoked
yet.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20180215111526.2464-1-stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-01-11 13:57:24 +01:00
Stefan Hajnoczi c100448790 block/iscsi: fix ioctl cancel use-after-free
iscsi_aio_cancel() does not increment the request's reference count,
causing a use-after-free when ABORT TASK finishes after the request has
already completed.

There are some additional issues with iscsi_aio_cancel():
1. Several ABORT TASKs may be sent for the same task if
   iscsi_aio_cancel() is invoked multiple times.  It's better to avoid
   this just in case the command identifier is reused.
2. The iscsilun->mutex protection is missing in iscsi_aio_cancel().

Reported-by: Felipe Franciosi <felipe@nutanix.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20180203061621.7033-4-stefanha@redhat.com>
Reviewed-by: Felipe Franciosi <felipe@nutanix.com>
Tested-by: Sreejith Mohanan <sreejit.mohanan@nutanix.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-01-11 13:57:24 +01:00