qemu-e2k

Author	SHA1	Message	Date
Andrey Shinkevich	96a07d5bf4	block/stream: refactor stream_run: drop goto The goto is unnecessary in the stream_run() since the common exit code was removed in the commit `eb23654dbe`: "jobs: utilize job_exit shim". Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 1559152576-281803-3-git-send-email-andrey.shinkevich@virtuozzo.com Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-07-02 03:53:04 +02:00
Andrey Shinkevich	170d3bd341	block: include base when checking image chain for block allocation This patch is used in the 'block/stream: introduce a bottom node' that is following. Instead of the base node, the caller may pass the node that has the base as its backing image to the function bdrv_is_allocated_above() with a new parameter include_base = true and get rid of the dependency on the base that may change during commit/stream parallel jobs. Now, if the specified base is not found in the backing image chain, the QEMU will abort. Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Message-id: 1559152576-281803-2-git-send-email-andrey.shinkevich@virtuozzo.com [mreitz: Squashed in the following as a rebase on conflicting patches:] Message-id: e3cf99ae-62e9-8b6e-5a06-d3c8b9363b85@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-07-02 03:53:04 +02:00
Stefano Garzarella	d24f80234b	block/rbd: increase dynamically the image size RBD APIs don't allow us to write more than the size set with rbd_create() or rbd_resize(). In order to support growing images (eg. qcow2), we resize the image before write operations that exceed the current size. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Message-id: 20190509145927.293369-1-sgarzare@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-07-02 03:53:04 +02:00
Pino Toscano	b10d49d761	ssh: switch from libssh2 to libssh Rewrite the implementation of the ssh block driver to use libssh instead of libssh2. The libssh library has various advantages over libssh2: - easier API for authentication (for example for using ssh-agent) - easier API for known_hosts handling - supports newer types of keys in known_hosts Use APIs/features available in libssh 0.8 conditionally, to support older versions (which are not recommended though). Adjust the iotest 207 according to the different error message, and to find the default key type for localhost (to properly compare the fingerprint with). Contributed-by: Max Reitz <mreitz@redhat.com> Adjust the various Docker/Travis scripts to use libssh when available instead of libssh2. The mingw/mxe testing is dropped for now, as there are no packages for it. Signed-off-by: Pino Toscano <ptoscano@redhat.com> Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com> Acked-by: Alex Bennée <alex.bennee@linaro.org> Message-id: 20190620200840.17655-1-ptoscano@redhat.com Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 5873173.t2JhDm7DL7@lindworm.usersys.redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-06-24 16:01:04 +02:00
Sam Eiderman	98eb9733f4	vmdk: Add read-only support for seSparse snapshots Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in QEMU). This format was lacking in the following: * Grain directory (L1) and grain table (L2) entries were 32-bit, allowing access to only 2TB (slightly less) of data. * The grain size (default) was 512 bytes - leading to data fragmentation and many grain tables. * For space reclamation purposes, it was necessary to find all the grains which are not pointed to by any grain table - so a reverse mapping of "offset of grain in vmdk" to "grain table" must be constructed - which takes large amounts of CPU/RAM. The format specification can be found in VMware's documentation: https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf In ESXi 6.5, to support snapshot files larger than 2TB, a new format was introduced: SESparse (Space Efficient). This format fixes the above issues: * All entries are now 64-bit. * The grain size (default) is 4KB. * Grain directory and grain tables are now located at the beginning of the file. + seSparse format reserves space for all grain tables. + Grain tables can be addressed using an index. + Grains are located in the end of the file and can also be addressed with an index. - seSparse vmdks of large disks (64TB) have huge preallocated headers - mainly due to L2 tables, even for empty snapshots. * The header contains a reverse mapping ("backmap") of "offset of grain in vmdk" to "grain table" and a bitmap ("free bitmap") which specifies for each grain - whether it is allocated or not. Using these data structures we can implement space reclamation efficiently. * Due to the fact that the header now maintains two mappings: * The regular one (grain directory & grain tables) * A reverse one (backmap and free bitmap) These data structures can lose consistency upon crash and result in a corrupted VMDK. Therefore, a journal is also added to the VMDK and is replayed when the VMware reopens the file after a crash. Since ESXi 6.7 - SESparse is the only snapshot format available. Unfortunately, VMware does not provide documentation regarding the new seSparse format. This commit is based on black-box research of the seSparse format. Various in-guest block operations and their effect on the snapshot file were tested. The only VMware provided source of information (regarding the underlying implementation) was a log file on the ESXi: /var/log/hostd.log Whenever an seSparse snapshot is created - the log is being populated with seSparse records. Relevant log records are of the form: [...] Const Header: [...] constMagic = 0xcafebabe [...] version = 2.1 [...] capacity = 204800 [...] grainSize = 8 [...] grainTableSize = 64 [...] flags = 0 [...] Extents: [...] Header : <1 : 1> [...] JournalHdr : <2 : 2> [...] Journal : <2048 : 2048> [...] GrainDirectory : <4096 : 2048> [...] GrainTables : <6144 : 2048> [...] FreeBitmap : <8192 : 2048> [...] BackMap : <10240 : 2048> [...] Grain : <12288 : 204800> [...] Volatile Header: [...] volatileMagic = 0xcafecafe [...] FreeGTNumber = 0 [...] nextTxnSeqNumber = 0 [...] replayJournal = 0 The sizes that are seen in the log file are in sectors. Extents are of the following format: <offset : size> This commit is a strict implementation which enforces: * magics * version number 2.1 * grain size of 8 sectors (4KB) * grain table size of 64 sectors * zero flags * extent locations Additionally, this commit proivdes only a subset of the functionality offered by seSparse's format: * Read-only * No journal replay * No space reclamation * No unmap support Hence, journal header, journal, free bitmap and backmap extents are unused, only the "classic" (L1 -> L2 -> data) grain access is implemented. However there are several differences in the grain access itself. Grain directory (L1): * Grain directory entries are indexes (not offsets) to grain tables. * Valid grain directory entries have their highest nibble set to 0x1. * Since grain tables are always located in the beginning of the file - the index can fit into 32 bits - so we can use its low part if it's valid. Grain table (L2): * Grain table entries are indexes (not offsets) to grains. * If the highest nibble of the entry is: 0x0: The grain in not allocated. The rest of the bytes are 0. 0x1: The grain is unmapped - guest sees a zero grain. The rest of the bits point to the previously mapped grain, see 0x3 case. 0x2: The grain is zero. 0x3: The grain is allocated - to get the index calculate: ((entry & 0x0fff000000000000) >> 48) \| ((entry & 0x0000ffffffffffff) << 12) * The difference between 0x1 and 0x2 is that 0x1 is an unallocated grain which results from the guest using sg_unmap to unmap the grain - but the grain itself still exists in the grain extent - a space reclamation procedure should delete it. Unmapping a zero grain has no effect (0x2 will not change to 0x1) but unmapping an unallocated grain will (0x0 to 0x1) - naturally. In order to implement seSparse some fields had to be changed to support both 32-bit and 64-bit entry sizes. Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com> Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com> Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com> Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com> Message-id: 20190620091057.47441-4-shmuel.eiderman@oracle.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-06-24 15:53:02 +02:00
Sam Eiderman	59d6ee4850	vmdk: Reduce the max bound for L1 table size 512M of L1 entries is a very loose bound, only 32M are required to store the maximal supported VMDK file size of 2TB. Fixed qemu-iotest 59# - now failure occures before on impossible L1 table size. Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com> Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com> Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com> Message-id: 20190620091057.47441-3-shmuel.eiderman@oracle.com Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-06-24 15:53:02 +02:00
Sam Eiderman	940a2cd5d2	vmdk: Fix comment regarding max l1_size coverage Commit `b0651b8c24` ("vmdk: Move l1_size check into vmdk_add_extent") extended the l1_size check from VMDK4 to VMDK3 but did not update the default coverage in the moved comment. The previous vmdk4 calculation: (512 * 1024 * 1024) * 512(l2 entries) * 65536(grain) = 16PB The added vmdk3 calculation: (512 * 1024 * 1024) * 4096(l2 entries) * 512(grain) = 1PB Adding the calculation of vmdk3 to the comment. In any case, VMware does not offer virtual disks more than 2TB for vmdk4/vmdk3 or 64TB for the new undocumented seSparse format which is not implemented yet in qemu. Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com> Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com> Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com> Message-id: 20190620091057.47441-2-shmuel.eiderman@oracle.com Reviewed-by: yuchenlin <yuchenlin@synology.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-06-24 15:53:02 +02:00
Max Reitz	a193ad3b3b	block/commit: Drop bdrv_child_try_set_perm() commit_top_bs never requests or unshares any permissions. There is no reason to make this so explicit here. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-06-18 16:41:10 +02:00
Max Reitz	f94dc3b414	block/mirror: Fix child permissions We cannot use bdrv_child_try_set_perm() to give up all restrictions on the child edge, and still have bdrv_mirror_top_child_perm() request BLK_PERM_WRITE. Fix this by making bdrv_mirror_top_child_perm() return 0/BLK_PERM_ALL when we want to give up all permissions, and replacing bdrv_child_try_set_perm() by bdrv_child_refresh_perms(). The bdrv_child_try_set_perm() before removing the node with bdrv_replace_node() is then unnecessary. No permissions have changed since the previous invocation of bdrv_child_try_set_perm(). Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-06-18 16:41:10 +02:00
Max Reitz	094e363944	file-posix: Update open_flags in raw_set_perm() raw_check_perm() + raw_set_perm() can change the flags associated with the current FD. If so, we have to update BDRVRawState.open_flags accordingly. Otherwise, we may keep reopening the FD even though the current one already has the correct flags. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-06-18 16:41:10 +02:00
Vladimir Sementsov-Ogievskiy	b23c580c94	block: drop bs->job Drop remaining users of bs->job: 1. assertions actually duplicated by assert(!bs->refcnt) 2. trace-point seems not enough reason to change stream_start to return BlockJob pointer 3. Restricting creation of two jobs based on same bs is bad idea, as 3.1 Some jobs creates filters to be their main node, so, this check don't actually prevent creating second job on same real node (which will create another filter node) (but I hope it is restricted by other mechanisms) 3.2 Even without bs->job we have two systems of permissions: op-blockers and BLK_PERM 3.3 We may want to run several jobs on one node one day And finally, drop bs->job pointer itself. Hurrah! Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-06-18 16:41:10 +02:00
Vladimir Sementsov-Ogievskiy	68d00e4293	block/block-backend: blk_iostatus_reset: drop usage of bs->job We are going to remove bs->job pointer. Drop it's usage in blk_iostatus_reset. blk_iostatus_reset() has only two callers: 1. blk_attach_dev(). This doesn't have anything to do with jobs and attaching a new guest device won't solve any problem the job encountered, so no reason to reset the iostatus for the job. 2. qmp_cont(). This resets the iostatus for everything. We can just call block_job_iostatus_reset() for all block jobs instead of going through BlockBackend. Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-06-18 16:41:10 +02:00
Vladimir Sementsov-Ogievskiy	cc19f1773d	block/replication: drop usage of bs->job We are going to remove bs->job pointer. Drop it's usage in replication code. Additionally we have to return job pointer from some mirror APIs. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-06-18 16:41:09 +02:00
Max Reitz	1adb0b5e0f	blkdebug: Inject errors on .bdrv_co_block_status() Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190507203508.18026-6-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-06-14 14:16:57 +02:00
Max Reitz	f8cec157cb	blkdebug: Add "none" event Together with @iotypes and @sector, this can be used to trap e.g. the first read or write access to a certain sector without having to know what happens internally in the block layer, i.e. which "real" events happen right before such an access. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190507203508.18026-5-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-06-14 14:16:57 +02:00
Max Reitz	16789db3de	blkdebug: Add @iotype error option This new error option allows users of blkdebug to inject errors only on certain kinds of I/O operations. Users usually want to make a very specific operation fail, not just any; but right now they simply hope that the event that triggers the error injection is followed up with that very operation. That may not be true, however, because the block layer is changing (including blkdebug, which may increase the number of types of I/O operations on which to inject errors). The new option's default has been chosen to keep backwards compatibility. Note that similar to the internal representation, we could choose to expose this option as a list of I/O types. But there is no practical use for this, because as described above, users usually know exactly which kind of operation they want to make fail, so there is no need to specify multiple I/O types at once. In addition, exposing this option as a list would require non-trivial changes to qemu_opts_absorb_qdict(). Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190507203508.18026-4-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-06-14 14:16:57 +02:00
Vladimir Sementsov-Ogievskiy	611ae1d716	block/nbd: merge NBDClientSession struct back to BDRVNBDState No reason to keep it separate, it differs from others block driver behavior and therefore confuses. Instead of generic 'state = (State*)bs->opaque' we have to use special helper. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20190611102720.86114-4-vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com>	2019-06-13 10:00:42 -05:00
Vladimir Sementsov-Ogievskiy	86f8cdf3db	block/nbd: merge nbd-client.* to nbd.c No reason for keeping driver handlers realization separate from driver structure. We can get rid of extra header file. While being here, fix comments style, restore forgotten comments for NBD_FOREACH_REPLY_CHUNK and nbd_reply_chunk_iter_receive, remove extra includes. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20190611102720.86114-3-vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com>	2019-06-13 09:55:09 -05:00
Vladimir Sementsov-Ogievskiy	0a93b359db	block/nbd-client: drop stale logout Drop one on failure path (we have errp) and turn two others into trace points. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20190611102720.86114-2-vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com>	2019-06-13 09:35:53 -05:00
Stefano Garzarella	2ea8e96da2	block/gluster: update .help of BLOCK_OPT_PREALLOC option Add missing 'falloc' among the allowed values of 'preallocation' option; show it and 'full' only when they are supported. ('falloc' is supported if defined CONFIG_GLUSTERFS_FALLOCATE, 'full' is supported if defined CONFIG_GLUSTERFS_ZEROFILL) Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20190524075848.23781-4-sgarzare@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com>	2019-06-12 18:32:32 +02:00
Stefano Garzarella	abea00533f	block/file-posix: update .help of BLOCK_OPT_PREALLOC option Show 'falloc' among the allowed values of 'preallocation' option, only when it is supported (if defined CONFIG_POSIX_FALLOCATE) Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20190524075848.23781-3-sgarzare@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com>	2019-06-12 18:31:46 +02:00
Markus Armbruster	a8d2532645	Include qemu-common.h exactly where needed No header includes qemu-common.h after this commit, as prescribed by qemu-common.h's file comment. Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20190523143508.25387-5-armbru@redhat.com> [Rebased with conflicts resolved automatically, except for include/hw/arm/xlnx-zynqmp.h hw/arm/nrf51_soc.c hw/arm/msf2-soc.c block/qcow2-refcount.c block/qcow2-cluster.c block/qcow2-cache.c target/arm/cpu.h target/lm32/cpu.h target/m68k/cpu.h target/mips/cpu.h target/moxie/cpu.h target/nios2/cpu.h target/openrisc/cpu.h target/riscv/cpu.h target/tilegx/cpu.h target/tricore/cpu.h target/unicore32/cpu.h target/xtensa/cpu.h; bsd-user/main.c and net/tap-bsd.c fixed up]	2019-06-12 13:20:20 +02:00
Markus Armbruster	0b8fa32f55	Include qemu/module.h where needed, drop it from qemu-common.h Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20190523143508.25387-4-armbru@redhat.com> [Rebased with conflicts resolved automatically, except for hw/usb/dev-hub.c hw/misc/exynos4210_rng.c hw/misc/bcm2835_rng.c hw/misc/aspeed_scu.c hw/display/virtio-vga.c hw/arm/stm32f205_soc.c; ui/cocoa.m fixed up]	2019-06-12 13:18:33 +02:00
Markus Armbruster	856dfd8a03	qemu-common: Move qemu_isalnum() etc. to qemu/ctype.h Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20190523143508.25387-3-armbru@redhat.com> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>	2019-06-11 20:22:09 +02:00
Vladimir Sementsov-Ogievskiy	d93e572688	block/io: bdrv_pdiscard: support int64_t bytes parameter This fixes at least one overflow in qcow2_process_discards, which passes 64bit region length to bdrv_pdiscard where bytes (or sectors in the past) parameter is int since its introduction in `0b919fae`. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-06-04 16:55:58 +02:00
Vladimir Sementsov-Ogievskiy	1477b6c803	block/qcow2-refcount: add trace-point to qcow2_process_discards Let's at least trace ignored failure. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-06-04 16:55:58 +02:00
Kevin Wolf	d0ee0204f4	block: Remove wrong bdrv_set_aio_context() calls The mirror and commit block jobs use bdrv_set_aio_context() to move their filter node into the right AioContext before hooking it up in the graph. Similarly, bdrv_open_backing_file() explicitly moves the backing file node into the right AioContext first. This isn't necessary any more, they get automatically moved into the right context now when attaching them. However, in the case of bdrv_open_backing_file() with a node reference, it's actually not only unnecessary, but even wrong: The unchecked bdrv_set_aio_context() changes the AioContext of the child node even if other parents require it to retain the old context. So this is not only a simplification, but a bug fix, too. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1684342 Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-06-04 15:22:22 +02:00
Kevin Wolf	132ada80c4	block: Adjust AioContexts when attaching nodes So far, we only made sure that updating the AioContext of a node affected the whole subtree. However, if a node is newly attached to a new parent, we also need to make sure that both the subtree of the node and the parent are in the same AioContext. This tries to move the new child node to the parent AioContext and returns an error if this isn't possible. BlockBackends now actually apply their AioContext to their root node. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-06-04 15:22:22 +02:00
Kevin Wolf	d861ab3acf	block: Add BlockBackend.ctx This adds a new parameter to blk_new() which requires its callers to declare from which AioContext this BlockBackend is going to be used (or the locks of which AioContext need to be taken anyway). The given context is only stored and kept up to date when changing AioContexts. Actually applying the stored AioContext to the root node is saved for another commit. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-06-04 15:22:22 +02:00
Kevin Wolf	97896a4887	block: Add Error to blk_set_aio_context() Add an Error parameter to blk_set_aio_context() and use bdrv_child_try_set_aio_context() internally to check whether all involved nodes can actually support the AioContext switch. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-06-04 15:22:22 +02:00
Julia Suvorova	2b02fd81de	block/linux-aio: Drop unused BlockAIOCB submission method Callback-based laio_submit() and laio_cancel() were left after rewriting Linux AIO backend to coroutines in hope that they would be used in other code that could bypass coroutines. They can be safely removed because they have not been used since that time. Signed-off-by: Julia Suvorova <jusual@mail.ru> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-06-04 15:20:41 +02:00
Max Reitz	5cb2737e92	block/io: Delay decrementing the quiesce_counter When ending a drained section, bdrv_do_drained_end() currently first decrements the quiesce_counter, and only then actually ends the drain. The bdrv_drain_invoke(bs, false) call may cause graph changes. Say the graph change involves replacing an existing BB's ("blk") BDS (blk_bs(blk)) by @bs. Let us introducing the following values: - bs_oqc = old_quiesce_counter (so bs->quiesce_counter == bs_oqc - 1) - obs_qc = blk_bs(blk)->quiesce_counter (before bdrv_drain_invoke()) Let us assume there is no blk_pread_unthrottled() involved, so blk->quiesce_counter == obs_qc (before bdrv_drain_invoke()). Now replacing blk_bs(blk) by @bs will reduce blk->quiesce_counter by obs_qc (making it 0) and increase it by bs_oqc-1 (making it bs_oqc-1). bdrv_drain_invoke() returns and we invoke bdrv_parent_drained_end(). This will decrement blk->quiesce_counter by one, so it would be -1 -- were there not an assertion against that in blk_root_drained_end(). We therefore have to keep the quiesce_counter up at least until bdrv_drain_invoke() returns, so that bdrv_parent_drained_end() does the right thing for the parents @bs got during bdrv_drain_invoke(). But let us delay it even further, namely until bdrv_parent_drained_end() returns, because then it mirrors bdrv_do_drained_begin(): There, we first increment the quiesce_counter, then begin draining the parents, and then call bdrv_drain_invoke(). It makes sense to let bdrv_do_drained_end() unravel this exactly in reverse. Signed-off-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-06-04 15:20:41 +02:00
Vladimir Sementsov-Ogievskiy	69f47505ee	block: avoid recursive block_status call if possible drv_co_block_status digs bs->file for additional, more accurate search for hole inside region, reported as DATA by bs since `5daa74a6eb`. This accuracy is not free: assume we have qcow2 disk. Actually, qcow2 knows, where are holes and where is data. But every block_status request calls lseek additionally. Assume a big disk, full of data, in any iterative copying block job (or img convert) we'll call lseek(HOLE) on every iteration, and each of these lseeks will have to iterate through all metadata up to the end of file. It's obviously ineffective behavior. And for many scenarios we don't need this lseek at all. However, lseek is needed when we have metadata-preallocated image. So, let's detect metadata-preallocation case and don't dig qcow2's protocol file in other cases. The idea is to compare allocation size in POV of filesystem with allocations size in POV of Qcow2 (by refcounts). If allocation in fs is significantly lower, consider it as metadata-preallocation case. 102 iotest changed, as our detector can't detect shrinked file as metadata-preallocation, which don't seem to be wrong, as with metadata preallocation we always have valid file length. Two other iotests have a slight change in their QMP output sequence: Active 'block-commit' returns earlier because the job coroutine yields earlier on a blocking operation. This operation is loading the refcount blocks in qcow2_detect_metadata_preallocation(). Suggested-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-06-04 15:20:41 +02:00
Peter Maydell	62f6849e7a	Pull request -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+ber27ys35W+dsvQfe+BBqr8OQ4FAlztyykACgkQfe+BBqr8 OQ4EaA//QQVDpIBRcMN+LeKWeEs8VSLziPUrZuFvuhMEEnjnaU6gbKq8G8xbFQ62 JIHg0DBGhTt8ymE9Ay6O/cooR8F0z+XyfDr7UlpI7JL/Uwl7JguGKQrWUYBRMqCv Q2cLaWStLkfdkuW7Y3WRc16VEnIlizDxjRzfjE2ESYpuzD2fFsBY3KZbgbJwYwZw SujWUQ3MdsNdw5kDmerlrDUy7r/eyl2cLXyIt6ClHNoqq392oGMoUn4XbsaLnCWE H5s46qm33eXtvBHqxVGoOMAli5FwCnhwF+H3xg93jIG6vC/RXQYCIhlEmEwKyrU2 g2DWWe/8+9b0iX+zTIcAPTcn1pmjVivGRorOurP0AtMtjV/8PvV+hAQQeSg2ARB3 rLpXaEphD4WTwu7mYlZ5kX0qvX2SftaMU08k1IgR3mfo8Z3X9znVoFIv8HLlHuy+ OhCmwT5OWYw4mNABTXeBMH/Dcs9EcU4+T/KhAGLReHo18CSyjeT2xsT+XCsETagF KlAP88dP0EdJ9Oiccyb8as22u7ygKWIiDYPplBdb4SkKg/koQnYGDjeDAzB2vXS3 cGVhGJD2DBbcePA8iaCfWzsSCDOTBFQLa45uhPD3DnkAJylhecSsiDQP+IrLslK3 h/8v9e8MAlHMgrueSnS7foMDI9rdrTNsChuNCJWOOaUI/ZWnXFg= =kCrN -----END PGP SIGNATURE----- Merge remote-tracking branch 'remotes/jnsnow/tags/bitmaps-pull-request' into staging Pull request # gpg: Signature made Wed 29 May 2019 00:58:33 BST # gpg: using RSA key F9B7ABDBBCACDF95BE76CBD07DEF8106AAFC390E # gpg: Good signature from "John Snow (John Huston) <jsnow@redhat.com>" [full] # Primary key fingerprint: FAEB 9711 A12C F475 812F 18F2 88A9 064D 1835 61EB # Subkey fingerprint: F9B7 ABDB BCAC DF95 BE76 CBD0 7DEF 8106 AAFC 390E * remotes/jnsnow/tags/bitmaps-pull-request: iotests: test external snapshot with bitmap copying qapi: support external bitmaps in block-dirty-bitmap-merge migration/dirty-bitmaps: change bitmap enumeration method Signed-off-by: Peter Maydell <peter.maydell@linaro.org>	2019-05-30 12:10:27 +01:00
Vladimir Sementsov-Ogievskiy	eff0829b07	qapi: support external bitmaps in block-dirty-bitmap-merge Add new optional parameter making possible to merge bitmaps from different nodes. It is needed to maintain external snapshots during incremental backup chain history. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: John Snow <jsnow@redhat.com> Message-id: 20190517152111.206494-2-vsementsov@virtuozzo.com Signed-off-by: John Snow <jsnow@redhat.com>	2019-05-28 19:33:31 -04:00
Andrey Shinkevich	6388903e7c	qcow2-bitmap: initialize bitmap directory alignment Valgrind detects multiple issues in QEMU iotests when the memory is used without being initialized. Valgrind may dump lots of unnecessary reports what makes the memory issue analysis harder. Particularly, that is true for the aligned bitmap directory and can be seen while running the iotest #169. Padding the aligned space with zeros eases the pain. Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com> Message-id: 1558961521-131620-1-git-send-email-andrey.shinkevich@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-28 20:30:55 +02:00
Anton Nefedov	c8bb23cbdb	qcow2: skip writing zero buffers to empty COW areas If COW areas of the newly allocated clusters are zeroes on the backing image, efficient bdrv_write_zeroes(flags=BDRV_REQ_NO_FALLBACK) can be used on the whole cluster instead of writing explicit zero buffers later in perform_cow(). iotest 060: write to the discarded cluster does not trigger COW anymore. Use a backing image instead. Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com> Message-id: 20190516142749.81019-2-anton.nefedov@virtuozzo.com Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-28 20:30:55 +02:00
Alberto Garcia	b441dc71c0	block: Make bdrv_root_attach_child() unref child_bs on failure A consequence of the previous patch is that bdrv_attach_child() transfers the reference to child_bs from the caller to parent_bs, which will drop it on bdrv_close() or when someone calls bdrv_unref_child(). But this only happens when bdrv_attach_child() succeeds. If it fails then the caller is responsible for dropping the reference to child_bs. This patch makes bdrv_attach_child() take the reference also when there is an error, freeing the caller for having to do it. A similar situation happens with bdrv_root_attach_child(), so the changes on this patch affect both functions. Signed-off-by: Alberto Garcia <berto@igalia.com> Message-id: 20dfb3d9ccec559cdd1a9690146abad5d204a186.1557754872.git.berto@igalia.com [mreitz: Removed now superfluous BdrvChild * variable in bdrv_open_child()] Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy	ae6b12fa4c	block/backup: refactor: split out backup_calculate_cluster_size Split out cluster_size calculation. Move copy-bitmap creation above block-job creation, as we are going to share it with upcoming backup-top filter, which also should be created before actual block job creation. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190429090842.57910-6-vsementsov@virtuozzo.com [mreitz: Dropped a paragraph from the commit message that was left over from a previous version] Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy	c334e897d0	block/backup: unify different modes code path Do full, top and incremental mode copying all in one place. This unifies the code path and helps further improvements. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190429090842.57910-5-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy	9eb5a248f3	block/backup: refactor and tolerate unallocated cluster skipping Split allocation checking to separate function and reduce nesting. Consider bdrv_is_allocated() fail as allocated area, as copying more than needed is not wrong (and we do it anyway) and seems better than fail the whole job. And, most probably we will fail on the next read, if there are real problem with source. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190429090842.57910-4-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy	a8389e315e	block/backup: move to copy_bitmap with granularity We are going to share this bitmap between backup and backup-top filter driver, so let's share something more meaningful. It also simplifies some calculations. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190429090842.57910-3-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy	c2da3413c0	block/backup: simplify backup_incremental_init_copy_bitmap Simplify backup_incremental_init_copy_bitmap using the function bdrv_dirty_bitmap_next_dirty_area. Note: move to job->len instead of bitmap size: it should not matter but less code. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190429090842.57910-2-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy	8ac0f15f33	qcow2: do encryption in threads Do encryption/decryption in threads, like it is already done for compression. This improves asynchronous encrypted io. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190506142741.41731-9-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy	5447c3a03f	qcow2: bdrv_co_pwritev: move encryption code out of the lock Encryption will be done in threads, to take benefit of it, we should move it out of the lock first. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190506142741.41731-8-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy	f24196d388	qcow2: qcow2_co_preadv: improve locking Background: decryption will be done in threads, to take benefit of it, we should move it out of the lock first. But let's go further: it turns out, that only qcow2_get_cluster_offset() needs locking, so reduce locking to it. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190506142741.41731-7-vsementsov@virtuozzo.com Reviewed-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy	6f13a316dd	qcow2-threads: split out generic path Move generic part out of qcow2_co_do_compress, to reuse it for encryption and rename things that would be shared with encryption path. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190506142741.41731-6-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy	0f5636c51c	qcow2-threads: qcow2_co_do_compress: protect queuing by mutex Drop dependence on AioContext lock. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190506142741.41731-5-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy	269062efc8	qcow2-threads: use thread_pool_submit_co Use thread_pool_submit_co, instead of reinventing it here. Note, that thread_pool_submit_aio() never returns NULL, so checking it was an extra thing. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190506142741.41731-4-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy	56e2f1d898	qcow2: add separate file for threaded data processing functions Move compression-on-threads to separate file. Encryption will be in it too. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190506142741.41731-3-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-28 20:30:55 +02:00
Vladimir Sementsov-Ogievskiy	9353db47c5	qcow2.h: add missing include qcow2.h depends on block_int.h. Compilation isn't broken currently only due to block_int.h always included before qcow2.h. Though, it seems better to directly include block_int.h in qcow2.h. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190506142741.41731-2-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-28 20:30:55 +02:00
Max Reitz	9c3db310ff	block/file-posix: Unaligned O_DIRECT block-status Currently, qemu crashes whenever someone queries the block status of an unaligned image tail of an O_DIRECT image: $ echo > foo $ qemu-img map --image-opts driver=file,filename=foo,cache.direct=on Offset Length Mapped to File qemu-img: block/io.c:2093: bdrv_co_block_status: Assertion `pnum && QEMU_IS_ALIGNED(pnum, align) && align > offset - aligned_offset' failed. This is because bdrv_co_block_status() checks that the result returned by the driver's implementation is aligned to the request_alignment, but file-posix can fail to do so, which is actually mentioned in a comment there: "[...] possibly including a partial sector at EOF". Fix this by rounding up those partial sectors. There are two possible alternative fixes: (1) We could refuse to open unaligned image files with O_DIRECT altogether. That sounds reasonable until you realize that qcow2 does necessarily not fill up its metadata clusters, and that nobody runs qemu-img create with O_DIRECT. Therefore, unpreallocated qcow2 files usually have an unaligned image tail. (2) bdrv_co_block_status() could ignore unaligned tails. It actually throws away everything past the EOF already, so that sounds reasonable. Unfortunately, the block layer knows file lengths only with a granularity of BDRV_SECTOR_SIZE, so bdrv_co_block_status() usually would have to guess whether its file length information is inexact or whether the driver is broken. Fixing what raw_co_block_status() returns is the safest thing to do. There seems to be no other block driver that sets request_alignment and does not make sure that it always returns aligned values. Cc: qemu-stable@nongnu.org Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-05-20 17:08:57 +02:00
Kevin Wolf	9ff7f0df87	blockjob: Propagate AioContext change to all job nodes Block jobs require that all of the nodes the job is using are in the same AioContext. Therefore all BdrvChild objects of the job propagate .(can_)set_aio_context to all other job nodes, so that the switch is checked and performed consistently even if both nodes are in different subtrees. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-05-20 17:08:56 +02:00
Kevin Wolf	980b0f943a	block: Add blk_set_allow_aio_context_change() Some users (like block jobs) can tolerate an AioContext change for their BlockBackend. Add a function that tells the BlockBackend that it can allow changes. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-05-20 17:08:56 +02:00
Kevin Wolf	38475269d4	block: Implement .(can_)set_aio_ctx for BlockBackend bdrv_try_set_aio_context() currently fails if a BlockBackend is attached to a node because it doesn't implement the BdrvChildRole callbacks for AioContext management. We can allow changing the AioContext of monitor-owned BlockBackends as long as no device is attached to them. When setting the AioContext of the root node of a BlockBackend, we now need to pass blk->root as an ignored child because we don't want the root node to recursively call back into BlockBackend and execute blk_do_set_aio_context() a second time. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-05-20 17:08:56 +02:00
Alberto Garcia	41ae31e3d7	block: Use BDRV_REQUEST_MAX_BYTES instead of BDRV_REQUEST_MAX_SECTORS There are a few places in which we turn a number of bytes into sectors in order to compare the result against BDRV_REQUEST_MAX_SECTORS instead of using BDRV_REQUEST_MAX_BYTES directly. Signed-off-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-05-20 17:08:56 +02:00
Alberto Garcia	b6c246942b	qcow2: Define and use QCOW2_COMPRESSED_SECTOR_SIZE When an L2 table entry points to a compressed cluster the space used by the data is specified in 512-byte sectors. This size is independent from BDRV_SECTOR_SIZE and is specific to the qcow2 file format. The QCOW2_COMPRESSED_SECTOR_SIZE constant defined in this patch makes this explicit. Signed-off-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-05-20 17:08:56 +02:00
Max Reitz	50ba5b2d99	block/file-posix: Truncate in xfs_write_zeroes() XFS_IOC_ZERO_RANGE does not increase the file length: $ touch foo $ xfs_io -c 'zero 0 65536' foo $ stat -c "size=%s, blocks=%b" foo size=0, blocks=128 We do want writes beyond the EOF to automatically increase the file length, however. This is evidenced by the fact that iotest 061 is broken on XFS since qcow2's check implementation checks for blocks beyond the EOF. Reported-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-05-20 17:08:56 +02:00
Peter Maydell	01807c8b0e	Miscellaneous patches for 2019-05-13 -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJc2RbiAAoJEDhwtADrkYZTfpAP/itXg+X9wRfMeerni6SAkgtz knaLtJNC7YvwH7W6TIkSf2QgMrg/nYzIXxxj6V+Y3Vdn4CD93C7ldItWkm33amlA SEhREHpAn4F3wg/LsOGYYgpuqF/wrUcZsmzudnslfMd3mM6Q9Q6J3q6mu8n1oRcl RRKlk++ElqyRdvmxFhddhPxk797Vuunh76vd3ARUFmzKs2n7CGkeBu+qbk41VqI9 YtjmWHO6BDY5b01PvjuWPir6n1yJsYcpfo3ZElZvPf5jQHMmO6fGB3SZc/PIWegq gAVeoXtwhNm+nywMpIv1wHQMkvRDZW0wrurIQBc4VGpH1Pa90dR9FNVZ8r0OZqPB aErPCdC7ED73uzJwzXKTnLxY0XDgdhsAsW7lFggANs6YyewZNcbDaVhZWsopTTK/ 3jBbddIw2RsfHNQgXlFVVzjZJGHBNHxFjAFASCKcapUWQwDKU42kQrS1GqxG56NI Lgi8Ce+Q0GsVF4wme96Oa/8EMRfmNvsHMfWQvmqGqA1OACSOf2PSGCeD618A5gq6 kV6wF4v5HdGFkc0x9Vr5ur7kv3eOhpzFzBM6XJXe3CyqnYrkNuBldkyGZBbrNY7G aW5sR26Is4m9i+7159cNB5LmnfQqtsscibkSC0UQiXcuWgevd6cdiF+0r1YuNp7C Faa2yPOHs4mHCjUwade9 =yKfZ -----END PGP SIGNATURE----- Merge remote-tracking branch 'remotes/armbru/tags/pull-misc-2019-05-13' into staging Miscellaneous patches for 2019-05-13 # gpg: Signature made Mon 13 May 2019 08:04:02 BST # gpg: using RSA key 3870B400EB918653 # gpg: Good signature from "Markus Armbruster <armbru@redhat.com>" [full] # gpg: aka "Markus Armbruster <armbru@pond.sub.org>" [full] # Primary key fingerprint: 354B C8B3 D7EB 2A6B 6867 4E5F 3870 B400 EB91 8653 * remotes/armbru/tags/pull-misc-2019-05-13: Clean up decorations and whitespace around header guards Normalize header guard symbol definition. Clean up ill-advised or unusual header guards Clean up header guards that don't match their file name target/xtensa: Clean up core-isa.h header guards linux-user/nios2 linux-user/riscv: Clean up header guards authz: Normalize #include "authz/trace.h" to "trace.h" Use #include "..." for our own headers, <...> for others Clean up includes Signed-off-by: Peter Maydell <peter.maydell@linaro.org>	2019-05-13 13:55:13 +01:00
Markus Armbruster	a8b991b52d	Clean up ill-advised or unusual header guards Leading underscores are ill-advised because such identifiers are reserved. Trailing underscores are merely ugly. Strip both. Our header guards commonly end in _H. Normalize the exceptions. Done with scripts/clean-header-guards.pl. Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20190315145123.28030-7-armbru@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> [Changes to slirp/ dropped, as we're about to spin it off]	2019-05-13 08:58:55 +02:00
Alberto Garcia	433e8e3b22	qcow2: Remove BDRVQcow2State.cluster_sectors The last user of this field disappeared when we replace the sector-based bdrv_write() with the byte-based bdrv_pwrite(). Signed-off-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-05-10 16:45:40 +02:00
Alberto Garcia	2e11d7562a	block: Remove bdrv_read() and bdrv_write() No one is using these functions anymore, all callers have switched to the byte-based bdrv_pread() and bdrv_pwrite() Signed-off-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-05-10 16:45:40 +02:00
Alberto Garcia	e5a0a6784a	vvfat: Replace bdrv_{read,write}() with bdrv_{pread,pwrite}() There's only a couple of bdrv_read() and bdrv_write() calls left in the vvfat code, and they can be trivially replaced with the byte-based bdrv_pread() and bdrv_pwrite(). Signed-off-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-05-10 16:45:40 +02:00
Alberto Garcia	d4f189713f	vdi: Replace bdrv_{read,write}() with bdrv_{pread,pwrite}() There's only a couple of bdrv_read() and bdrv_write() calls left in the vdi code, and they can be trivially replaced with the byte-based bdrv_pread() and bdrv_pwrite(). Signed-off-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-05-10 16:45:40 +02:00
Alberto Garcia	e3b4257d03	qcow2: Replace bdrv_write() with bdrv_pwrite() There's only one bdrv_write() call left in the qcow2 code, and it can be trivially replaced with the byte-based bdrv_pwrite(). Signed-off-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-05-10 16:45:40 +02:00
Andrey Shinkevich	118f99442d	block/io.c: fix for the allocation failure On a file system used by the customer, fallocate() returns an error if the block is not properly aligned. So, bdrv_co_pwrite_zeroes() fails. We can handle that case the same way as it is done for the unsupported cases, namely, call to bdrv_driver_pwritev() that writes zeroes to an image for the unaligned chunk of the block. Suggested-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com> Reviewed-by: John Snow <jsnow@redhat.com> Message-id: 1554474244-553661-1-git-send-email-andrey.shinkevich@virtuozzo.com Message-Id: <1554474244-553661-1-git-send-email-andrey.shinkevich@virtuozzo.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2019-05-10 10:53:21 +01:00
Alberto Garcia	6a63419980	commit: Use bdrv_append() in commit_start() This function combines bdrv_set_backing_hd() and bdrv_replace_node() so we can use it to simplify the code a bit in commit_start(). Signed-off-by: Alberto Garcia <berto@igalia.com> Message-id: 20190403143748.9790-1-berto@igalia.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-07 17:14:21 +02:00
Max Reitz	21205c7c3b	block/ssh: Implement .bdrv_dirname() ssh_bdrv_dirname() is basically the generic bdrv_dirname(), except it takes care not to silently chop off any query string (i.e., host_key_check). Signed-off-by: Max Reitz <mreitz@redhat.com> Tested-by: Richard W.M. Jones <rjones@redhat.com> Message-id: 20190225190828.17726-3-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-07 17:14:21 +02:00
Max Reitz	b8c1f90118	block/ssh: Implement .bdrv_refresh_filename() This requires some changes to keep iotests 104 and 207 working. qemu-img info in 104 will now return a filename including the user name and the port, which need to be filtered by adjusting REMOTE_TEST_DIR in common.rc. This additional information has to be marked optional, however (which is simple as REMOTE_TEST_DIR is a regex), because otherwise 197 and 215 would fail: They use it (indirectly) to filter qemu-img create output which contains a backing filename they have passed to it -- which probably does not contain a user name or port number. The problem in 207 is a nice one to have: qemu-img info used to return json:{} filenames, but with this patch it returns nice plain ones. We now need to adjust the filtering to hide the user name (and port number while we are at it). The simplest way to do this is to include both in iotests.remote_filename() so that bdrv_refresh_filename() will not change it, and then iotests.img_info_log() will filter it correctly automatically. Signed-off-by: Max Reitz <mreitz@redhat.com> Tested-by: Richard W.M. Jones <rjones@redhat.com> Message-id: 20190225190828.17726-2-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-07 17:14:21 +02:00
Andrey Shinkevich	444b82369b	qcow2: discard bitmap when removed Bitmap data may take a lot of disk space, so it's better to discard it always. Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com> Message-id: 1551346019-293202-1-git-send-email-andrey.shinkevich@virtuozzo.com Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> [mreitz: Use the commit message proposed by Vladimir] Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-07 17:14:21 +02:00
Vladimir Sementsov-Ogievskiy	54b10010eb	qcow2-refcount: don't mask corruptions under internal errors No reasons for not reporting found corruptions as corruptions in case of some internal errors, especially in case of just failed to fix l2 entry (and in this case, missed corruptions may influence comparing logic, when we calculate difference between corruptions fields of two results) Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190227131433.197063-6-vsementsov@virtuozzo.com Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-07 17:14:21 +02:00
Vladimir Sementsov-Ogievskiy	cbb51e9f93	qcow2-refcount: check_refcounts_l2: don't count fixed cluster as allocated Do not count a cluster which is fixed to be ZERO as allocated. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190227131433.197063-5-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-07 17:14:21 +02:00
Vladimir Sementsov-Ogievskiy	1ef337b7a0	qcow2-refcount: check_refcounts_l2: reduce ignored overlaps Reduce number of structures ignored in overlap check: when checking active table ignore active tables, when checking inactive table ignore inactive ones. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190227131433.197063-4-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-07 17:14:21 +02:00
Vladimir Sementsov-Ogievskiy	a5fff8d4b4	qcow2-refcount: avoid eating RAM qcow2_inc_refcounts_imrt() (through realloc_refcount_array()) can eat an unpredictable amount of memory on corrupted table entries, which are referencing regions far beyond the end of file. Prevent this, by skipping such regions from further processing. Interesting that iotest 138 checks exactly the behavior which we fix here. So, change the test appropriately. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190227131433.197063-3-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-07 17:14:21 +02:00
Vladimir Sementsov-Ogievskiy	7e3e736cbd	qcow2-refcount: fix check_oflag_copied Increase corruptions_fixed only after successful fix. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190227131433.197063-2-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-05-07 17:14:21 +02:00
Vladimir Sementsov-Ogievskiy	54277a2aab	block/qed: add missed coroutine_fn markers qed_read_table and qed_write_table use coroutine-only interfaces but are not marked coroutine_fn. Happily, they are called only from coroutine context, so we only need to add missed markers. Reported-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-04-30 15:29:00 +02:00
Alberto Garcia	065abf9f2b	commit: Make base read-only if there is an early failure You can reproduce this by passing an invalid filter-node-name (like "1234") to block-commit. In this case the base image is put in read-write mode but is never reset back to read-only. Signed-off-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy	f4326aefcf	block/stream: use buffer-based io Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy	08b6261f34	block/commit: use buffer-based io Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy	607dbdc4e0	block/backup: use buffer-based io Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy	a4072543cc	block/parallels: use buffer-based io Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy	696e8cb292	block/qed: use buffer-based io Move to _co_ versions of io functions qed_read_table() and qed_write_table(), as we use qemu_co_mutex_unlock() anyway. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy	4ed3e0c486	block/qcow: use buffer-based io Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-04-30 15:29:00 +02:00
Vladimir Sementsov-Ogievskiy	b00cb15bda	block/qcow2: use buffer-based io Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-04-30 15:29:00 +02:00
Alberto Garcia	e1f4a37a49	qcow2: Fix error handling in the compression code This patch fixes a few things in the way error codes are handled in the qcow2 compression code: a) qcow2_co_pwritev_compressed() expects qcow2_co_compress() to only return -1 or -2 on failure, but this is not correct. Since the change from qcow2_compress() to qcow2_co_compress() in commit `ceb029cd6f` the new code can also return -EINVAL (although there does not seem to exist any code path that would cause that error in the current implementation). b) -1 and -2 are ad-hoc error codes defined in qcow2_compress(). This patch replaces them with standard constants from errno.h. c) Both qcow2_compress() and qcow2_co_do_compress() return a negative value on failure, but qcow2_co_pwritev_compressed() stores the value in an unsigned data type. Signed-off-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-04-30 15:29:00 +02:00
Kevin Wolf	db04524f82	qcow2: Fix qcow2_make_empty() with external data file make_completely_empty() is an optimisated path for bdrv_make_empty() where completely new metadata is created inside the image file instead of going through all clusters and discarding them. For an external data file, however, we actually need to do discard operations on the data file; just overwriting the qcow2 file doesn't get rid of the data. The necessary slow path with an explicit discard operation already exists for other cases. Use it for external data files, too. Cc: qemu-stable@nongnu.org Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>	2019-04-30 15:29:00 +02:00
Kevin Wolf	718c0fce2f	qcow2: Fix full preallocation with external data file preallocate_co() already gave the data file the full size without forwarding the requested preallocation mode to the protocol. When bdrv_co_truncate() was called later with the preallocation mode, the file didn't actually grow any more, so the data file stayed unallocated even if full preallocation was requested. Pass the right preallocation mode to preallocate_co() and remove the second bdrv_co_truncate() to fix this. As a side effect, the ugly one-byte write in preallocate_co() is replaced with a truncate call, now leaving the last block unallocated on the protocol level as it should be. Cc: qemu-stable@nongnu.org Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>	2019-04-30 15:29:00 +02:00
Kevin Wolf	360bd07471	qcow2: Add errp to preallocate_co() We'll add a bdrv_co_truncate() call in the next patch which can return an Error that we don't want to discard. So add an errp parameter to preallocate_co(). Cc: qemu-stable@nongnu.org Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>	2019-04-30 15:29:00 +02:00
Kevin Wolf	f29fbf7c6b	qcow2: Avoid COW during metadata preallocation Limiting the allocation to INT_MAX bytes isn't particularly clever because it means that the final cluster will be a partial cluster which will be completed through a COW operation. This results in unnecessary data read and write requests which lead to an unwanted non-sparse filesystem block for metadata preallocation. Align the maximum allocation size down to the cluster size to avoid this situation. Cc: qemu-stable@nongnu.org Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>	2019-04-30 15:29:00 +02:00
Eric Blake	de38b5005e	qemu-img: Saner printing of large file sizes Disk sizes close to INT64_MAX cause overflow, for some pretty ridiculous output: $ ./nbdkit -U - memory size=$((2**63 - 512)) --run 'qemu-img info $nbd' image: nbd+unix://?socket=/tmp/nbdkitHSAzNz/socket file format: raw virtual size: -8388607T (9223372036854775296 bytes) disk size: unavailable But there's no reason to have two separate implementations of integer to human-readable abbreviation, where one has overflow and stops at 'T', while the other avoids overflow and goes all the way to 'E'. With this patch, the output now claims 8EiB instead of -8388607T, which really is the correct rounding of largest file size supported by qemu (we could go 511 bytes larger if we used byte-accurate sizing instead of rounding up to the next sector boundary, but that wouldn't change the human-readable result). Quite a few iotests need updates to expected output to match. Reported-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com> Tested-by: Richard W.M. Jones <rjones@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com> Tested-by: Max Reitz <mreitz@redhat.com>	2019-04-30 15:29:00 +02:00
Stefano Garzarella	0cb98af218	block/vhdx: Use IEC binary prefixes for size constants Using IEC binary prefixes in order to make the code more readable, with the exception of DEFAULT_LOG_SIZE because it's passed to stringify(). Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Reviewed-by: John Snow <jsnow@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-04-30 15:29:00 +02:00
Stefano Garzarella	e9991e29ea	block/vhdx: Remove redundant IEC binary prefixes definition IEC binary prefixes are already defined in "qemu/units.h", so we can remove redundant definitions in "block/vhdx.h". Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Reviewed-by: John Snow <jsnow@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-04-30 15:29:00 +02:00
Sam Eiderman	7502be838e	vmdk: Set vmdk parent backing_format to vmdk Commit `b69864e5a` ("vmdk: Support version=3 in VMDK descriptor files") fixed the probe function to correctly guess vmdk descriptors with version=3. This solves the issue where vmdk snapshot with parent vmdk descriptor containing "version=3" would be treated as raw instead vmdk. In the future case where a new vmdk version is introduced, we will again experience this issue, even if the user will provide "-f vmdk" it will only apply to the tip image and not to the underlying "misprobed" parent image. The code in vmdk.c already assumes that the backing file of vmdk must be vmdk (see vmdk_is_cid_valid which returns 0 if backing file is not vmdk). So let's make it official by supplying the backing_format as vmdk. Reviewed-by: Mark Kanda <mark.kanda@oracle.com> Reviewed-By: Liran Alon <liran.alon@oracle.com> Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com> Signed-off-by: Shmuel Eiderman <shmuel.eiderman@oracle.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Fam Zheng <fam@euphon.net> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-04-30 15:29:00 +02:00
Zhengui li	126734c4f7	vpc: unlock Coroutine lock to make IO submit Concurrently Concurrent IO becomes serial IO because of the qemu Coroutine lock, which reduce IO performance severely. So unlock Coroutine lock before bdrv_co_pwritev and bdrv_co_preadv to fix it. Signed-off-by: Zhengui li <lizhengui@huawei.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-04-30 15:29:00 +02:00
Markus Armbruster	e1ce7d747b	block/qapi: Clean up how we print to monitor or stdout bdrv_snapshot_dump(), bdrv_image_info_specific_dump(), bdrv_image_info_dump() and their helpers take an fprintf()-like callback and a FILE * to pass to it. hmp.c passes monitor_printf() cast to fprintf_function and the current monitor cast to FILE *. qemu-img.c and qemu-io-cmds.c pass fprintf and stdout. The type-punning is technically undefined behaviour, but works in practice. Clean up: drop the callback, and call qemu_printf() instead. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Message-Id: <20190417191805.28198-8-armbru@redhat.com>	2019-04-18 22:18:59 +02:00
Markus Armbruster	6b3048cee0	block/ssh: Do not report read/write/flush errors to the user Callbacks ssh_co_readv(), ssh_co_writev(), ssh_co_flush() report errors to the user with error_printf(). They shouldn't, it's their caller's job. Replace by a suitable trace point. While there, drop the unreachable !s->sftp case. Perhaps we should convert this part of the block driver interface to Error, so block drivers can pass more detail to their callers. Not today. Cc: "Richard W.M. Jones" <rjones@redhat.com> Cc: Kevin Wolf <kwolf@redhat.com> Cc: Max Reitz <mreitz@redhat.com> Cc: qemu-block@nongnu.org Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20190417190641.26814-3-armbru@redhat.com>	2019-04-17 21:21:49 +02:00
Kevin Wolf	93e32b3e20	qcow2: Fix preallocation bdrv_pwrite to wrong file With an external data file, preallocate_co() must write the final byte to the external data file, not to the qcow2 image file. This is harmless for preallocation of newly created images (only the qcow2 file size is increased to the virtual disk size while it should be much smaller), but with preallocated resize, it could in theory cause visible corruption if the metadata of the image is larger than the data (e.g. lots of bitmaps). Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>	2019-04-16 16:23:24 +02:00
Alberto Garcia	20509c4b8b	block: freeze the backing chain earlier in stream_start() Commit `6585493369` added code to freeze the backing chain from 'top' to 'base' for the duration of the block-stream job. The problem is that the freezing happens too late in stream_start(): during the bdrv_reopen_set_read_only() call earlier in that function another job can jump in and remove the base image. If that happens we have an invalid chain and QEMU crashes. This patch puts the bdrv_freeze_backing_chain() call at the beginning of the function. Signed-off-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-04-02 12:04:44 +02:00
Vladimir Sementsov-Ogievskiy	696aaaed57	block/file-posix: do not fail on unlock bytes bdrv_replace_child() calls bdrv_check_perm() with error_abort on loosening permissions. However file-locking operations may fail even in this case, for example on NFS. And this leads to Qemu crash. Let's avoid such errors. Note, that we ignore such things anyway on permission update commit and abort. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-04-02 12:04:44 +02:00
Stefano Garzarella	de23e72bb7	block/gluster: limit the transfer size to 512 MiB Several versions of GlusterFS (3.12? -> 6.0.1) fail when the transfer size is greater or equal to 1024 MiB, so we are limiting the transfer size to 512 MiB to avoid this rare issue. Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1691320 Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Niels de Vos <ndevos@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-04-02 12:04:44 +02:00
Eric Blake	75d34eb98c	nbd/client: Trace server noncompliance on structured reads Just as we recently added a trace for a server sending block status that doesn't match the server's advertised minimum block alignment, let's do the same for read chunks. But since qemu 3.1 is such a server (because it advertised 512-byte alignment, but when serving a file that ends in data but is not sector-aligned, NBD_CMD_READ would detect a mid-sector change between data and hole at EOF and the resulting read chunks are unaligned), we don't want to change our behavior of otherwise tolerating unaligned reads. Note that even though we fixed the server for 4.0 to advertise an actual block alignment (which gets rid of the unaligned reads at EOF for posix files), we can still trigger it via other means: $ qemu-nbd --image-opts driver=blkdebug,align=512,image.driver=file,image.filename=/path/to/non-aligned-file Arguably, that is a bug in the blkdebug block status function, for leaking a block status that is not aligned. It may also be possible to observe issues with a backing layer with smaller alignment than the active layer, although so far I have been unable to write a reliable iotest for that scenario. Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <20190330165349.32256-1-eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>	2019-04-01 08:58:04 -05:00
Eric Blake	4841211e0d	block: Add bdrv_get_request_alignment() The next patch needs access to a device's minimum permitted alignment, since NBD wants to advertise this to clients. Add an accessor function, borrowing from blk_get_max_transfer() for accessing a backend's block limits. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20190329042750.14704-6-eblake@redhat.com>	2019-04-01 08:46:52 -05:00
Eric Blake	9cf638508c	nbd/client: Support qemu-img convert from unaligned size If an NBD server advertises a size that is not a multiple of a sector, the block layer rounds up that size, even though we set info.size to the exact byte value sent by the server. The block layer then proceeds to let us read or query block status on the hole that it added past EOF, which the NBD server is unlikely to be happy with. Fortunately, qemu as a server never advertizes an unaligned size, so we generally don't run into this problem; but the nbdkit server makes it easy to test: $ printf %1000d 1 > f1 $ ~/nbdkit/nbdkit -fv file f1 & pid=$! $ qemu-img convert -f raw nbd://localhost:10809 f2 $ kill $pid $ qemu-img compare f1 f2 Pre-patch, the server attempts a 1024-byte read, which nbdkit rightfully rejects as going beyond its advertised 1000 byte size; the conversion fails and the output files differ (not even the first sector is copied, because qemu-img does not follow ddrescue's habit of trying smaller reads to get as much information as possible in spite of errors). Post-patch, the client's attempts to read (and query block status, for new enough nbdkit) are properly truncated to the server's length, with sane handling of the hole the block layer forced on us. Although f2 ends up as a larger file (1024 bytes instead of 1000), qemu-img compare shows the two images to have identical contents for display to the guest. I didn't add iotests coverage since I didn't want to add a dependency on nbdkit in iotests. I also did NOT patch write, trim, or write zeroes - these commands continue to fail (usually with ENOSPC, but whatever the server chose), because we really can't write to the end of the file, and because 'qemu-img convert' is the most common case where we care about being tolerant (which is read-only). Perhaps we could truncate the request if the client is writing zeros to the tail, but that seems like more work, especially if the block layer is fixed in 4.1 to track byte-accurate sizing (in which case this patch would be reverted as unnecessary). Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <20190329042750.14704-5-eblake@redhat.com> Tested-by: Richard W.M. Jones <rjones@redhat.com>	2019-04-01 08:32:44 -05:00
Eric Blake	a62a85ef5c	nbd/client: Report offsets in bdrv_block_status It is desirable for 'qemu-img map' to have the same output for a file whether it is served over file or nbd protocols. However, ever since we implemented block status for NBD (2.12), the NBD protocol forgot to inform the block layer that as the final layer in the chain, the offset is valid; without an offset, the human-readable form of qemu-img map gives up with the unhelpful: $ nbdkit -U - data data="1" size=512 --run 'qemu-img map $nbd' Offset Length Mapped to File qemu-img: File contains external, encrypted or compressed clusters. The --output=json form always works, because it is reporting the lower-level bdrv_block_status results directly rather than trying to filter out sparse ranges for human consumption - but now it also shows the offset member. With this patch, the human output changes to: Offset Length Mapped to File 0 0x200 0 nbd+unix://?socket=/tmp/nbdkitOxeoLa/socket This change is observable to several iotests. Fixes: `78a33ab5` Reported-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <20190329042750.14704-4-eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>	2019-03-30 20:52:29 -05:00
Eric Blake	7da537f70d	nbd/client: Lower min_block for block-status, unaligned size We have a latent bug in our NBD client code, tickled by the brand new nbdkit 1.11.10 block status support: $ nbdkit --filter=log --filter=truncate -U - \ data data="1" size=511 truncate=64K logfile=/dev/stdout \ --run 'qemu-img convert $nbd /var/tmp/out' ... qemu-img: block/io.c:2122: bdrv_co_block_status: Assertion `pnum && QEMU_IS_ALIGNED(pnum, align) && align > offset - aligned_offset' failed. The culprit? Our implementation of .bdrv_co_block_status can return unaligned block status for any server that operates with a lower actual alignment than what we tell the block layer in request_alignment, in violation of the block layer's constraints. To date, we've been unable to trip the bug, because qemu as NBD server always advertises block sizing (at which point it is a server bug if the server sends unaligned status - although qemu 3.1 is such a server and I've sent separate patches for 4.0 both to get the server to obey the spec, and to let the client to tolerate server oddities at EOF). But nbdkit does not (yet) advertise block sizing, and therefore is not in violation of the spec for returning block status at whatever boundaries it wants, and those unaligned results can occur anywhere rather than just at EOF. While we are still wise to avoid sending sub-sector read/write requests to a server of unknown origin, we MUST consider that a server telling us block status without an advertised block size is correct. So, we either have to munge unaligned answers from the server into aligned ones that we hand back to the block layer, or we have to tell the block layer about a smaller alignment. Similarly, if the server advertises an image size that is not sector-aligned, we might as well assume that the server intends to let us access those tail bytes, and therefore supports a minimum block size of 1, regardless of whether the server supports block status (although we still need more patches to fix the problem that with an unaligned image, we can send read or block status requests that exceed EOF to the server). Again, qemu as server cannot trip this problem (because it rounds images to sector alignment), but nbdkit advertised unaligned size even before it gained block status support. Solve both alignment problems at once by using better heuristics on what alignment to report to the block layer when the server did not give us something to work with. Note that very few NBD servers implement block status (to date, only qemu and nbdkit are known to do so); and as the NBD spec mentioned block sizing constraints prior to documenting block status, it can be assumed that any future implementations of block status are aware that they must advertise block size if they want a minimum size other than 1. We've had a long history of struggles with picking the right alignment to use in the block layer, as evidenced by the commit message of `fd8d372d` (v2.12) that introduced the current choice of forced 512-byte alignment. There is no iotest coverage for this fix, because qemu can't provoke it, and I didn't want to make test 241 dependent on nbdkit. Fixes: `fd8d372d` Reported-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <20190329042750.14704-3-eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Tested-by: Richard W.M. Jones <rjones@redhat.com>	2019-03-30 20:52:19 -05:00
Eric Blake	737d3f5244	nbd-client: Work around server BLOCK_STATUS misalignment at EOF The NBD spec is clear that a server that advertises a minimum block size should reply to NBD_CMD_BLOCK_STATUS with extents aligned accordingly. However, we know that the qemu NBD server implementation has had a corner-case bug where it is not compliant with the spec, present since the introduction of NBD_CMD_BLOCK_STATUS in qemu 2.12 (and unlikely to be patched in time for 4.0). Namely, when qemu is serving a file that is not a multiple of 512 bytes, it rounds the size advertised over NBD up to the next sector boundary (someday, I'd like to fix that to be byte-accurate, but it's a much bigger audit not appropriate for this release); yet if the final sector contains data prior to EOF, lseek(SEEK_HOLE) will point to the implicit hole mid-sector which qemu then reported over NBD. We are well within our rights to hang up on a server that can't follow the spec, but it is more useful to try and keep the connection alive in spite of the problem. Do so by tracing a message about the problem, and then either truncating the request back to an aligned boundary (if it covered more than the final sector) or widening it out to the full boundary with a forced status of data (since truncating would result in 0 bytes, but we have to make progress, and valid since data is a default-safe answer). And in practice, since the problem only happens on a sector that starts with data and ends with a hole, we are going to want to read that full sector anyway (where qemu as the server fills in the tail beyond EOF with appropriate NUL bytes). Easy reproduction: $ printf %1000d 1 > file $ qemu-nbd -f raw -t file & pid=$! $ qemu-img map --output=json -f raw nbd://localhost:10809 qemu-img: Could not read file metadata: Invalid argument $ kill $pid where the patched version instead succeeds with: [{ "start": 0, "length": 1024, "depth": 0, "zero": false, "data": true}] Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <20190326171317.4036-1-eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>	2019-03-30 10:06:08 -05:00
Eric Blake	ebd82cd872	nbd: Permit simple error to NBD_CMD_BLOCK_STATUS The NBD spec is clear that when structured replies are active, a simple error reply is acceptable to any command except for NBD_CMD_READ. However, we were mistakenly requiring structured errors for NBD_CMD_BLOCK_STATUS, and hanging up on a server that gave a simple error (since qemu does not behave as such a server, we didn't notice the problem until now). Broken since its introduction in commit `78a33ab5` (v2.12). Noticed while debugging a separate failure reported by nbdkit while working out its initial implementation of BLOCK_STATUS, although it turns out that nbdkit also chose to send structured error replies for BLOCK_STATUS, so I had to manually provoke the situation by hacking qemu's server to send a simple error reply: \| diff --git i/nbd/server.c w/nbd/server.c \| index fd013a2817a..833288d7c45 100644 \| 00--- i/nbd/server.c \| +++ w/nbd/server.c \| @@ -2269,6 +2269,8 @@ static coroutine_fn int nbd_handle_request(NBDClient *client, \| "discard failed", errp); \| \| case NBD_CMD_BLOCK_STATUS: \| + return nbd_co_send_simple_reply(client, request->handle, ENOMEM, \| + NULL, 0, errp); \| if (!request->len) { \| return nbd_send_generic_reply(client, request->handle, -EINVAL, \| "need non-zero length", errp); \| Signed-off-by: Eric Blake <eblake@redhat.com> Acked-by: Richard W.M. Jones <rjones@redhat.com> Message-Id: <20190325190104.30213-3-eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>	2019-03-30 10:06:08 -05:00
Eric Blake	b29f3a3d2a	nbd: Don't lose server's error to NBD_CMD_BLOCK_STATUS When the server replies with a (structured []) error to NBD_CMD_BLOCK_STATUS, without any extent information sent first, the client code was blindly throwing away the server's error code and instead telling the caller that EIO occurred. This has been broken since its introduction in `78a33ab5` (v2.12, where we should have called: error_setg(&local_err, "Server did not reply with any status extents"); nbd_iter_error(&iter, false, -EIO, &local_err); to declare the situation as a non-fatal error if no earlier error had already been flagged, rather than just blindly slamming iter.err and iter.ret), although it is more noticeable since commit `7f86068d`, which actually tries hard to preserve the server's code thanks to a separate iter.request_ret. [] The spec is clear that the server is also permitted to reply with a simple error, but that's a separate fix. I was able to provoke this scenario with a hack to the server, then seeing whether ENOMEM makes it back to the caller: \| diff --git a/nbd/server.c b/nbd/server.c \| index fd013a2817a..29c7995de02 100644 \| --- a/nbd/server.c \| +++ b/nbd/server.c \| @@ -2269,6 +2269,8 @@ static coroutine_fn int nbd_handle_request(NBDClient *client, \| "discard failed", errp); \| \| case NBD_CMD_BLOCK_STATUS: \| + return nbd_send_generic_reply(client, request->handle, -ENOMEM, \| + "no status for you today", errp); \| if (!request->len) { \| return nbd_send_generic_reply(client, request->handle, -EINVAL, \| "need non-zero length", errp); \| -- Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <20190325190104.30213-2-eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>	2019-03-30 10:06:08 -05:00
Eric Blake	a39286dd61	nbd: Tolerate some server non-compliance in NBD_CMD_BLOCK_STATUS The NBD spec states that NBD_CMD_FLAG_REQ_ONE (which we currently always use) should not reply with an extent larger than our request, and that the server's response should be exactly one extent. Right now, that means that if a server sends more than one extent, we treat the server as broken, fail the block status request, and disconnect, which prevents all further use of the block device. But while good software should be strict in what it sends, it should be tolerant in what it receives. While trying to implement NBD_CMD_BLOCK_STATUS in nbdkit, we temporarily had a non-compliant server sending too many extents in spite of REQ_ONE. Oddly enough, 'qemu-img convert' with qemu 3.1 failed with a somewhat useful message: qemu-img: Protocol error: invalid payload for NBD_REPLY_TYPE_BLOCK_STATUS which then disappeared with commit `d8b4bad8`, on the grounds that an error message flagged only at the time of coroutine teardown is pointless, and instead we should rely on the actual failed API to report an error - in other words, the 3.1 behavior was masking the fact that qemu-img was not reporting an error. That has since been fixed in the previous patch, where qemu-img convert now fails with: qemu-img: error while reading block status of sector 0: Invalid argument But even that is harsh. Since we already partially relaxed things in commit `acfd8f7a` to tolerate a server that exceeds the cap (although that change was made prior to the NBD spec actually putting a cap on the extent length during REQ_ONE - in fact, the NBD spec change was BECAUSE of the qemu behavior prior to that commit), it's not that much harder to argue that we should also tolerate a server that sends too many extents. But at the same time, it's nice to trace when we are being tolerant of server non-compliance, in order to help server writers fix their implementations to be more portable (if they refer to our traces, rather than just stderr). Reported-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <20190323212639.579-3-eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>	2019-03-30 10:06:08 -05:00
Kevin Wolf	738301e117	file-posix: Support BDRV_REQ_NO_FALLBACK for zero writes We know that the kernel implements a slow fallback code path for BLKZEROOUT, so if BDRV_REQ_NO_FALLBACK is given, we shouldn't call it. The other operations we call in the context of .bdrv_co_pwrite_zeroes should usually be quick, so no modification should be needed for them. If we ever notice that there are additional problematic cases, we can still make these conditional as well. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Eric Blake <eblake@redhat.com>	2019-03-26 11:37:51 +01:00
Kevin Wolf	80f5c33ff3	block: Advertise BDRV_REQ_NO_FALLBACK in filter drivers Filter drivers that support .bdrv_co_pwrite_zeroes can safely advertise BDRV_REQ_NO_FALLBACK because they just forward the request flags to their child node. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Eric Blake <eblake@redhat.com>	2019-03-26 11:37:51 +01:00
Kevin Wolf	fe0480d629	block: Add BDRV_REQ_NO_FALLBACK For qemu-img convert, we want an operation that zeroes out the whole image if this can be done efficiently, but that returns an error otherwise so we don't write explicit zeroes and immediately overwrite them with the real data, potentially doubling the amount of data to be written. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Eric Blake <eblake@redhat.com>	2019-03-26 11:37:51 +01:00
Kevin Wolf	48ce986096	block: Remove error messages in bdrv_make_zero() There is only a single caller of bdrv_make_zero(), which is qemu-img convert. If the function fails, we just fall back to a different method of zeroing out blocks on the target image. There is no good reason to print error messages on stderr when the higher level operation will actually succeed. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Eric Blake <eblake@redhat.com>	2019-03-26 11:37:51 +01:00
Markus Armbruster	a9779a3ab0	trace-events: Delete unused trace points Tracked down with cleanup-trace-events.pl. Funnies requiring manual post-processing: * block.c and blockdev.c trace points are in block/trace-events. * hw/block/nvme.c uses the preprocessor to hide its trace point use from cleanup-trace-events.pl. * include/hw/xen/xen_common.h trace points are in hw/xen/trace-events. * net/colo-compare and net/filter-rewriter.c use pseudo trace points colo_compare_udp_miscompare and colo_filter_rewriter_debug to guard debug code. Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-id: 20190314180929.27722-5-armbru@redhat.com Message-Id: <20190314180929.27722-5-armbru@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2019-03-22 16:18:07 +00:00
Markus Armbruster	500016e5db	trace-events: Shorten file names in comments We spell out sub/dir/ in sub/dir/trace-events' comments pointing to source files. That's because when trace-events got split up, the comments were moved verbatim. Delete the sub/dir/ part from these comments. Gets rid of several misspellings. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20190314180929.27722-3-armbru@redhat.com Message-Id: <20190314180929.27722-3-armbru@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2019-03-22 16:18:07 +00:00
Alberto Garcia	782b9d06bf	block: Make bdrv_{copy_on_read,crypto_luks,replication} static Signed-off-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-19 15:49:29 +01:00
Sam Eiderman	b69864e5a8	vmdk: Support version=3 in VMDK descriptor files Commit `509d39aa22` added support for read only VMDKs of version 3. This commit fixes the probe function to correctly handle descriptors of version 3. This commit has two effects: 1. We no longer need to supply '-f vmdk' when pointing to descriptor files of version 3 in qemu/qemu-img command line arguments. 2. This fixes the scenario where a VMDK points to a parent version 3 descriptor file which is being probed as "raw" instead of "vmdk". Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com> Reviewed-by: Mark Kanda <mark.kanda@oracle.com> Signed-off-by: Shmuel Eiderman <shmuel.eiderman@oracle.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-19 15:49:29 +01:00
Kevin Wolf	a0cf83639c	qcow2: Fix data file error condition in qcow2_co_create() We were trying to check whether bdrv_open_blockdev_ref() returned success, but accidentally checked the wrong variable. Spotted by Coverity (CID 1399703). Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>	2019-03-19 15:49:29 +01:00
Sergio Lopez	5e771752a1	mirror: Confirm we're quiesced only if the job is paused or cancelled While child_job_drained_begin() calls to job_pause(), the job doesn't actually transition between states until it runs again and reaches a pause point. This means bdrv_drained_begin() may return with some jobs using the node still having 'busy == true'. As a consequence, block_job_detach_aio_context() may get into a deadlock, waiting for the job to be actually paused, while the coroutine servicing the job is yielding and doesn't get the opportunity to get scheduled again. This situation can be reproduced by issuing a 'block-commit' immediately followed by a 'device_del'. To ensure bdrv_drained_begin() only returns when the jobs have been paused, we change mirror_drained_poll() to only confirm it's quiesced when job->paused == true and there aren't any in-flight requests, except if we reached that point by a drained section initiated by the mirror/commit job itself. The other block jobs shouldn't need any changes, as the default drained_poll() behavior is to only confirm it's quiesced if the job is not busy or completed. Signed-off-by: Sergio Lopez <slp@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-19 15:49:29 +01:00
Peter Maydell	dbbc277510	Pull request * Add 'drop-cache=on\|off' option to file-posix.c. The default is on. Disabling the option fixes a QEMU 3.0.0 performance regression when live migrating on the same host with cache.direct=off. -----BEGIN PGP SIGNATURE----- iQEcBAABAgAGBQJciOSEAAoJEJykq7OBq3PIVSUIAI6r2Mgoi+no4nle8Jf2nZ+W EnQXnNEFyJA0lKRtqQ2UILD9udVdKd/L1PZu5k/Il/Ralto9Yf3+62brekI7rsss c3Qusu4LUK6jom2RslRjRIaJ9GilQi/jWezKV/O0VlcsMVemgVHX008EIR+ea1U4 H0/u2kfu04PciKQ5MR2+6aacu9bfmyH1yM2no+aMN5dDu/38PV6JEsf0Zl2agowg opGepJ7YiDQsxH9IBXrbfm38mBrrY0K2vFzAb9BzTHfBPotGMNIZNJNM2FChRfoM sTjOIpZz3NDwPEUPQPZxp+7YKRFFYfse1oHtpyh4n1rMQksB019SCGlP9TBhrF0= =CH5G -----END PGP SIGNATURE----- Merge remote-tracking branch 'remotes/stefanha/tags/block-pull-request' into staging Pull request * Add 'drop-cache=on\|off' option to file-posix.c. The default is on. Disabling the option fixes a QEMU 3.0.0 performance regression when live migrating on the same host with cache.direct=off. # gpg: Signature made Wed 13 Mar 2019 11:07:48 GMT # gpg: using RSA key 9CA4ABB381AB73C8 # gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>" [full] # gpg: aka "Stefan Hajnoczi <stefanha@gmail.com>" [full] # Primary key fingerprint: 8695 A8BF D3F9 7CDA AC35 775A 9CA4 ABB3 81AB 73C8 * remotes/stefanha/tags/block-pull-request: file-posix: add drop-cache=on\|off option Signed-off-by: Peter Maydell <peter.maydell@linaro.org>	2019-03-14 09:34:51 +00:00
Peter Maydell	523a2a42c3	Pull request -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+ber27ys35W+dsvQfe+BBqr8OQ4FAlyIFSwACgkQfe+BBqr8 OQ7wghAAm16eCEr57oTO7QXR3y8uVFsKqXBn9cNH6nbrFp2PUQSglwMDKBls1Z5m olF23X/JaqSlSmkL9BBuzDZ6Up+kkHKuxPq4/5RKXfiDI0pr3R0eqts0COAlaN9q Bew3ipj99m8gzMi2093AW4+Ob0N3658fuDTGLe1M1Uoy7CEg1QJ7rVOBBEui7vIl RbZ8l/Zmb4ldNpB3lnE4Nh9ue8fy0RAj3Nai161nCnNeXNF/VzD3Ye8bojSBbnux PIMX6/RWmykX4feIf9QP8apDpxX4HkyuPq5EdwT9PD8PwdyXPAXZtsYUNCuNtQuk n5VKFVgFYgqUclBeVHmrMYPU4K4iCFQp4/Fua7wzPEC0iG05NiiDv91oVkEJCp3L ManHeuGfNLCcXaIntKZhuJl1cK8yMM3yDww6/pPTehrPjcyvKa0NOqhQBExektcD R6q7maJRzFaxSxdcs+Zzuog9zESvH1mlJxQCKzeYhAP0kkxInyTELE/Vbx37xuqR RFfZYyVQ6x87Q/sxHx4EMiV97WUM8elZOQdSEC/okt5WUUNpgIu0WF9nSQ1VKZ8C CZmv5xh9ogfwvB/kOm6IVwNkLvVagJQcLwddORI5LLXLbSIUcuwVSuyMp/7iDtQ/ hnHkGs2mIJ2JUYbSSNsSJNs6oTurn8eTFCeGoYKJgd9l4QxaThU= =ekU+ -----END PGP SIGNATURE----- Merge remote-tracking branch 'remotes/jnsnow/tags/bitmaps-pull-request' into staging Pull request # gpg: Signature made Tue 12 Mar 2019 20:23:08 GMT # gpg: using RSA key F9B7ABDBBCACDF95BE76CBD07DEF8106AAFC390E # gpg: Good signature from "John Snow (John Huston) <jsnow@redhat.com>" [full] # Primary key fingerprint: FAEB 9711 A12C F475 812F 18F2 88A9 064D 1835 61EB # Subkey fingerprint: F9B7 ABDB BCAC DF95 BE76 CBD0 7DEF 8106 AAFC 390E * remotes/jnsnow/tags/bitmaps-pull-request: (22 commits) tests/qemu-iotests: add bitmap resize test 246 block/qcow2-bitmap: Allow resizes with persistent bitmaps block/qcow2-bitmap: Don't check size for IN_USE bitmap docs/interop/qcow2: Improve bitmap flag in_use specification bitmaps: Fix typo in function name block/dirty-bitmaps: implement inconsistent bit block/dirty-bitmaps: disallow busy bitmaps as merge source block/dirty-bitmaps: prohibit removing readonly bitmaps block/dirty-bitmaps: prohibit readonly bitmaps for backups block/dirty-bitmaps: add block_dirty_bitmap_check function block/dirty-bitmap: add inconsistent status block/dirty-bitmaps: add inconsistent bit iotests: add busy/recording bit test to 124 blockdev: remove unused paio parameter documentation block/dirty-bitmaps: move comment block block/dirty-bitmaps: unify qmp_locked and user_locked calls block/dirty-bitmap: explicitly lock bitmaps with successors nbd: change error checking order for bitmaps block/dirty-bitmap: change semantics of enabled predicate block/dirty-bitmap: remove set/reset assertions against enabled bit ... Signed-off-by: Peter Maydell <peter.maydell@linaro.org> # Conflicts: # tests/qemu-iotests/group	2019-03-13 17:30:34 +00:00
Stefan Hajnoczi	f357fcd890	file-posix: add drop-cache=on\|off option Commit `dd577a26ff` ("block/file-posix: implement bdrv_co_invalidate_cache() on Linux") introduced page cache invalidation so that cache.direct=off live migration is safe on Linux. The invalidation takes a significant amount of time when the file is large and present in the page cache. Normally this is not the case for cross-host live migration but it can happen when migrating between QEMU processes on the same host. On same-host migration we don't need to invalidate pages for correctness anyway, so an option to skip page cache invalidation is useful. I investigated optimizing invalidation and detecting same-host migration, but both are hard to achieve so a user-visible option will suffice. As a bonus this option means that the cache invalidation feature will now be detectable by libvirt via QMP schema introspection. Suggested-by: Neil Skrypuch <neil@tembosocial.com> Tested-by: Neil Skrypuch <neil@tembosocial.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 20190307164941.3322-1-stefanha@redhat.com Message-Id: <20190307164941.3322-1-stefanha@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2019-03-13 10:54:55 +00:00
Alberto Garcia	5019aece2a	block: Remove the AioContext parameter from bdrv_reopen_multiple() This parameter has been unused since `1a63a90750` Signed-off-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-12 20:30:14 +01:00
Alberto Garcia	8a2ce0bc1e	block: Add a 'mutable_opts' field to BlockDriver If we reopen a BlockDriverState and there is an option that is present in bs->options but missing from the new set of options then we have to return an error unless the driver is able to reset it to its default value. This patch adds a new 'mutable_opts' field to BlockDriver. This is a list of runtime options that can be modified during reopen. If an option in this list is unspecified on reopen then it must be reset (or return an error). Signed-off-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-12 20:30:14 +01:00
Alberto Garcia	077e8e2018	block: Add 'keep_old_opts' parameter to bdrv_reopen_queue() The bdrv_reopen_queue() function is used to create a queue with the BDSs that are going to be reopened and their new options. Once the queue is ready bdrv_reopen_multiple() is called to perform the operation. The original options from each one of the BDSs are kept, with the new options passed to bdrv_reopen_queue() applied on top of them. For "x-blockdev-reopen" we want a function that behaves much like "blockdev-add". We want to ignore the previous set of options so that only the ones actually specified by the user are applied, with the rest having their default values. One of the things that we need is a way to tell bdrv_reopen_queue() whether we want to keep the old set of options or not, and that's what this patch does. All current callers are setting this new parameter to true and x-blockdev-reopen will set it to false. Signed-off-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-12 20:30:14 +01:00
Alberto Garcia	6585493369	block: Freeze the backing chain for the duration of the stream job Signed-off-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-12 20:30:14 +01:00
Alberto Garcia	ef53dc09ed	block: Freeze the backing chain for the duration of the mirror job Signed-off-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-12 20:30:14 +01:00
Alberto Garcia	df827336ab	block: Freeze the backing chain for the duration of the commit job Signed-off-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-12 20:30:14 +01:00
Kevin Wolf	23dece19da	file-posix: Make auto-read-only dynamic Until now, with auto-read-only=on we tried to open the file read-write first and if that failed, read-only was tried. This is actually not good enough for libvirt, which gives QEMU SELinux permissions for read-write only as soon as it actually intends to write to the image. So we need to be able to switch between read-only and read-write at runtime. This patch makes auto-read-only dynamic, i.e. the file is opened read-only as long as no user of the node has requested write permissions, but it is automatically reopened read-write as soon as the first writer is attached. Conversely, if the last writer goes away, the file is reopened read-only again. bs->read_only is no longer set for auto-read-only=on files even if the file descriptor is opened read-only because it will be transparently upgraded as soon as a writer is attached. This changes the output of qemu-iotests 232. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-12 20:30:14 +01:00
Kevin Wolf	6ceabe6f77	file-posix: Prepare permission code for fd switching In order to be able to dynamically reopen the file read-only or read-write, depending on the users that are attached, we need to be able to switch to a different file descriptor during the permission change. This interacts with reopen, which also creates a new file descriptor and performs permission changes internally. In this case, the permission change code must reuse the reopen file descriptor instead of creating a third one. In turn, reopen can drop its code to copy file locks to the new file descriptor because that is now done when applying the new permissions. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-12 20:30:14 +01:00
Kevin Wolf	a6aeca0ca5	file-posix: Lock new fd in raw_reopen_prepare() There is no reason why we can take locks on the new file descriptor only in raw_reopen_commit() where error handling isn't possible any more. Instead, we can already do this in raw_reopen_prepare(). Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-12 20:30:14 +01:00
Kevin Wolf	e0c9cf3a48	file-posix: Store BDRVRawState.reopen_state during reopen We'll want to access the file descriptor in the reopen_state while processing permission changes in the context of the repoen. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-12 20:30:14 +01:00
Kevin Wolf	5cec287025	file-posix: Factor out raw_reconfigure_getfd() Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-12 20:30:14 +01:00
Vladimir Sementsov-Ogievskiy	cb8aac3783	qapi: drop x- from x-block-latency-histogram-set Drop x- and x_ prefixes for latency histograms and update version to 4.0 Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-12 20:30:08 +01:00
John Snow	d19c6b36ff	block/qcow2-bitmap: Allow resizes with persistent bitmaps Since we now load all bitmaps into memory anyway, we can just truncate them in-memory and then flush them back to disk. Just in case, we will still check and enforce that this shortcut is valid -- i.e. that any bitmap described on-disk is indeed in-memory and can be modified. If there are any inconsistent bitmaps, we refuse to allow the truncate as we do not actually load these bitmaps into memory, and it isn't safe or reasonable to attempt to truncate corrupted data. Signed-off-by: John Snow <jsnow@redhat.com> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190311185147.52309-4-vsementsov@virtuozzo.com [vsementsov: drop bitmap flushing, fix block comments style] Signed-off-by: John Snow <jsnow@redhat.com>	2019-03-12 14:57:38 -04:00
Vladimir Sementsov-Ogievskiy	bf5f0cf5d8	block/qcow2-bitmap: Don't check size for IN_USE bitmap We are going to allow image resize when there are persistent bitmaps. It may lead to appearing of inconsistent bitmaps (IN_USE=1) with inconsistent size. But we still want to load them as inconsistent. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190311185147.52309-3-vsementsov@virtuozzo.com Signed-off-by: John Snow <jsnow@redhat.com>	2019-03-12 14:50:28 -04:00
Eric Blake	796a3798ab	bitmaps: Fix typo in function name Commit `a88b179f` introduced the ability to set and query bitmap persistence, but with an atypical spelling. Signed-off-by: Eric Blake <eblake@redhat.com> Message-id: 20190308205845.25734-1-eblake@redhat.com Signed-off-by: John Snow <jsnow@redhat.com>	2019-03-12 12:05:49 -04:00
John Snow	74da6b9435	block/dirty-bitmaps: implement inconsistent bit Set the inconsistent bit on load instead of rejecting such bitmaps. There is no way to un-set it; the only option is to delete the bitmap. Obvervations: - bitmap loading does not need to update the header for in_use bitmaps. - inconsistent bitmaps don't need to have their data loaded; they're glorified corruption sentinels. - bitmap saving does not need to save inconsistent bitmaps back to disk. - bitmap reopening DOES need to drop the readonly flag from inconsistent bitmaps to allow reopening of qcow2 files with non-qemu-owned bitmaps being eventually flushed back to disk. Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-id: 20190301191545.8728-8-jsnow@redhat.com Signed-off-by: John Snow <jsnow@redhat.com>	2019-03-12 12:05:49 -04:00
John Snow	cb8e58e3de	block/dirty-bitmaps: disallow busy bitmaps as merge source We didn't do any state checking on source bitmaps at all, so this adds inconsistent and busy checks. readonly is allowed, so you can still copy a readonly bitmap to a new destination to use it for operations like drive-backup. Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190301191545.8728-7-jsnow@redhat.com Signed-off-by: John Snow <jsnow@redhat.com>	2019-03-12 12:05:49 -04:00
John Snow	3ae96d6684	block/dirty-bitmaps: add block_dirty_bitmap_check function Instead of checking against busy, inconsistent, or read only directly, use a check function with permissions bits that let us streamline the checks without reproducing them in many places. Included in this patch are permissions changes that simply add the inconsistent check to existing permissions call spots, without addressing existing bugs. In general, this means that busy+readonly checks become BDRV_BITMAP_DEFAULT, which checks against all three conditions. busy-only checks become BDRV_BITMAP_ALLOW_RO. Notably, remove allows inconsistent bitmaps, so it doesn't follow the pattern. Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190301191545.8728-4-jsnow@redhat.com Signed-off-by: John Snow <jsnow@redhat.com>	2019-03-12 12:05:49 -04:00
John Snow	0064cfefa4	block/dirty-bitmap: add inconsistent status Even though the status field is deprecated, we still have to support it for a few more releases. Since this is a very new kind of bitmap state, it makes sense for it to have its own status field. Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190301191545.8728-3-jsnow@redhat.com Signed-off-by: John Snow <jsnow@redhat.com>	2019-03-12 12:05:49 -04:00
John Snow	b0f455599d	block/dirty-bitmaps: add inconsistent bit Add an inconsistent bit to dirty-bitmaps that allows us to report a bitmap as persistent but potentially inconsistent, i.e. if we find bitmaps on a qcow2 that have been marked as "in use". Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190301191545.8728-2-jsnow@redhat.com Signed-off-by: John Snow <jsnow@redhat.com>	2019-03-12 12:05:49 -04:00
John Snow	1e6fddcd6f	block/dirty-bitmaps: move comment block Simply move the big status enum comment block to above the status function, and document it as being deprecated. The whole confusing block can get deleted in three releases time. Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190223000614.13894-9-jsnow@redhat.com Signed-off-by: John Snow <jsnow@redhat.com>	2019-03-12 12:05:48 -04:00
John Snow	27a1b301a4	block/dirty-bitmaps: unify qmp_locked and user_locked calls These mean the same thing now. Unify them and rename the merged call bdrv_dirty_bitmap_busy to indicate semantically what we are describing, as well as help disambiguate from the various _locked and _unlocked versions of bitmap helpers that refer to mutex locks. Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190223000614.13894-8-jsnow@redhat.com Signed-off-by: John Snow <jsnow@redhat.com>	2019-03-12 12:05:48 -04:00
John Snow	21d2376f26	block/dirty-bitmap: explicitly lock bitmaps with successors Instead of implying a user_locked/busy status, make it explicit. Now, bitmaps in use by migration, NBD or backup operations are all treated the same way with the same code paths. Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190223000614.13894-7-jsnow@redhat.com Signed-off-by: John Snow <jsnow@redhat.com>	2019-03-12 12:05:48 -04:00
John Snow	8b2e20f64f	block/dirty-bitmap: change semantics of enabled predicate Currently, the enabled predicate means something like: "the QAPI status of the bitmap is ACTIVE." After this patch, it should mean exclusively: "This bitmap is recording guest writes, and is allowed to do so." In many places, this is how this predicate was already used. Internal usages of the bitmap QPI can call user_locked to find out if the bitmap is in use by an operation. To accommodate this, modify the create_successor routine to now explicitly disable the parent bitmap at creation time. Justifications: 1. bdrv_dirty_bitmap_status suffers no change from the lack of 1:1 parity with the new predicates because of the order in which the predicates are checked. This is now only for compatibility. 2. bdrv_set_dirty() is unchanged: pre-patch, it was skipping bitmaps that were disabled or had a successor, while post-patch it is only skipping bitmaps that are disabled. To accommodate this, create_successor now ensures that any bitmap with a successor is explicitly disabled. 3. qcow2_store_persistent_dirty_bitmaps: No functional change. This function cares only about the literal enabled bit, and makes no effort to check if the bitmap is in-use or not. After this patch there are still no ways to produce an enabled bitmap with a successor. 4. block_dirty_bitmap_enable_prepare block_dirty_bitmap_disable_prepare init_dirty_bitmap_migration nbd_export_new These functions care about the literal enabled bit, and already check user_locked separately. Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190223000614.13894-5-jsnow@redhat.com Signed-off-by: John Snow <jsnow@redhat.com>	2019-03-12 12:05:48 -04:00
John Snow	c28ddbb07e	block/dirty-bitmap: remove set/reset assertions against enabled bit bdrv_set_dirty_bitmap and bdrv_reset_dirty_bitmap are only used as an internal API by the mirror and migration areas of our code. These calls modify the bitmap, but do so at the behest of QEMU and not the guest. Presently, these bitmaps are always "enabled" anyway, but there's no reason they have to be. Modify these internal APIs to drop this assertion. Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190223000614.13894-4-jsnow@redhat.com Signed-off-by: John Snow <jsnow@redhat.com>	2019-03-12 12:05:48 -04:00
John Snow	50a47257f8	block/dirty-bitmaps: rename frozen predicate helper "Frozen" was a good description a long time ago, but it isn't adequate now. Rename the frozen predicate to has_successor to make the semantics of the predicate more clear to outside callers. In the process, remove some calls to frozen() that no longer semantically make sense. For bdrv_enable_dirty_bitmap_locked and bdrv_disable_dirty_bitmap_locked, it doesn't make sense to prohibit QEMU internals from performing this action when we only wished to prohibit QMP users from issuing these commands. All of the QMP API commands for bitmap manipulation already check against user_locked() to prohibit these actions. Several other assertions really want to check that the bitmap isn't in-use by another operation -- use the bitmap_user_locked function for this instead, which presently also checks for has_successor. This leaves some redundant checks of has_successor through different helpers that are addressed in forthcoming patches. Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190223000614.13894-3-jsnow@redhat.com Signed-off-by: John Snow <jsnow@redhat.com>	2019-03-12 12:05:48 -04:00
John Snow	4db6ceb0b5	block/dirty-bitmap: add recording and busy properties The current API allows us to report a single status, which we've defined as: Frozen: has a successor, treated as qmp_locked, may or may not be enabled. Locked: no successor, qmp_locked. may or may not be enabled. Disabled: Not frozen or locked, disabled. Active: Not frozen, locked, or disabled. The problem is that both "Frozen" and "Locked" mean nearly the same thing, and that both of them do not intuit whether they are recording guest writes or not. This patch deprecates that status field and introduces two orthogonal properties instead to replace it. Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190223000614.13894-2-jsnow@redhat.com Signed-off-by: John Snow <jsnow@redhat.com>	2019-03-12 12:05:48 -04:00
Niels de Vos	0e3b891fef	gluster: the glfs_io_cbk callback function pointer adds pre/post stat args The glfs_*_async() functions do a callback once finished. This callback has changed its arguments, pre- and post-stat structures have been added. This makes it possible to improve caching, which is useful for Samba and NFS-Ganesha, but not so much for QEMU. Gluster 6 is the first release that includes these new arguments. With an additional detection in ./configure, the new arguments can conditionally get included in the glfs_io_cbk handler. Signed-off-by: Niels de Vos <ndevos@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-12 14:26:49 +01:00
Prasanna Kumar Kalever	e014dbe74e	gluster: Handle changed glfs_ftruncate signature New versions of Glusters libgfapi.so have an updated glfs_ftruncate() function that returns additional 'struct stat' structures to enable advanced caching of attributes. This is useful for file servers, not so much for QEMU. Nevertheless, the API has changed and needs to be adopted. Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com> Signed-off-by: Niels de Vos <ndevos@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-12 14:26:49 +01:00
Philippe Mathieu-Daudé	d4cef0c67c	block/iscsi: Restrict Linux-specific code Some Linux specific code is missing guards, leading to build failure on OSX: $ sudo brew install libiscsi $ ./configure && make [...] CC block/iscsi.o qemu/block/iscsi.c:338:24: error: 'iscsi_aiocb_info' defined but not used [-Werror=unused-const-variable=] static const AIOCBInfo iscsi_aiocb_info = { ^~~~~~~~~~~~~~~~ qemu/block/iscsi.c:168:1: error: 'iscsi_schedule_bh' defined but not used [-Werror=unused-function] iscsi_schedule_bh(IscsiAIOCB *acb) ^~~~~~~~~~~~~~~~~ cc1: all warnings being treated as errors Add guards to restrict this code for Linux. Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20190220000553.28438-1-philmd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-03-11 16:33:49 +01:00
Kevin Wolf	6c3944dc62	qcow2: Implement data-file-raw create option Provide an option to force QEMU to always keep the external data file consistent as a standalone read-only raw image. At the moment, this means making sure that write_zeroes requests are forwarded to the data file instead of just updating the metadata, and checking that no backing file is used. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-08 12:26:46 +01:00
Kevin Wolf	9b890bdcb6	qcow2: Store data file name in the image Rather than requiring that the external data file node is passed explicitly when creating the qcow2 node, store the filename in the designated header extension during .bdrv_create and read it from there as a default during .bdrv_open. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-08 12:26:46 +01:00
Kevin Wolf	dcc98687f8	qcow2: Creating images with external data file This adds a .bdrv_create option to use an external data file. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-08 12:26:46 +01:00
Kevin Wolf	0e8c08be27	qcow2: Add basic data-file infrastructure This adds a .bdrv_open option to specify the external data file node. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-08 12:26:46 +01:00
Kevin Wolf	e9f5b6deaa	qcow2: Support external data file in qemu-img check For external data files, data clusters must be excluded from the refcount calculations. Instead, an implicit refcount of 1 is assumed for the COPIED flag. Compressed clusters and internal snapshots are incompatible with external data files, so print an error if they are in use for images with an external data file. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-08 12:26:46 +01:00
Kevin Wolf	aa8b34c1b2	qcow2: Return error for snapshot operation with data file Internal snapshots and an external data file are incompatible because snapshots require refcounting and non-linear mapping. Return an error for all of the snapshot operations if an external data file is in use. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-08 12:26:46 +01:00
Kevin Wolf	966b000f49	qcow2: External file I/O This changes the qcow2 implementation to direct all guest data I/O to s->data_file rather than bs->file, while metadata I/O still uses bs->file. At the moment, this is still always the same, but soon we'll add options to set s->data_file to an external data file. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-08 12:26:46 +01:00
Kevin Wolf	37be14036b	qcow2: Prepare qcow2_co_block_status() for data file Offset 0 cannot be assumed to mean an unallocated cluster any more. Instead, the cluster type needs to be checked. *file must refer to the data file instead of the image file if a valid offset is returned from qcow2_co_block_status(). Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-08 12:26:45 +01:00
Kevin Wolf	77e023ff79	qcow2: Return 0/-errno in qcow2_alloc_compressed_cluster_offset() qcow2_alloc_compressed_cluster_offset() used to return the cluster offset for success and 0 for error. This doesn't only conflict with 0 as a valid host offset, but also loses the error code. Similar to the change made to qcow2_alloc_cluster_offset() for uncompressed clusters in commit `148da7ea9d`, make the function return 0/-errno and return the allocated cluster offset in a by-reference parameter. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-08 12:26:45 +01:00
Kevin Wolf	c6d619cc12	qcow2: Don't assume 0 is an invalid cluster offset The cluster allocation code uses 0 as an invalid offset that is used in case of errors or as "offset not yet determined". With external data files, a host cluster offset of 0 becomes valid, though. Define a constant INV_OFFSET (which is not cluster aligned and will therefore never be a valid offset) that can be used for such purposes. This removes the additional host_offset == 0 check that commit `ff52aab2df` introduced; the confusion between an invalid offset and (erroneous) allocation at offset 0 is removed with this change. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-08 12:26:45 +01:00
Kevin Wolf	b8c8353a38	qcow2: Prepare count_contiguous_clusters() for external data file Offset 0 can be valid for normal (allocated) clusters now, so use qcow2_get_cluster_type() instead. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-08 12:26:45 +01:00
Kevin Wolf	a4ea184d8a	qcow2: Prepare qcow2_get_cluster_type() for external data file Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-08 12:26:45 +01:00
Kevin Wolf	808c2bb4c4	qcow2: Pass bs to qcow2_get_cluster_type() Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-08 12:26:45 +01:00
Kevin Wolf	93c2493646	qcow2: Basic definitions for external data files This adds basic constants, struct fields and helper function for external data file support to the implementation. QCOW2_INCOMPAT_MASK and QCOW2_AUTOCLEAR_MASK are not updated yet so that opening images with an external data file still fails (we don't handle them correctly yet). Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-08 12:26:45 +01:00
Kevin Wolf	c5e86ebc11	qcow2: Simplify preallocation code Image creation already involves a bdrv_co_truncate() call, which allows to specify a preallocation mode. Just pass the right mode there and remove the code that is made redundant by this. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-08 12:26:45 +01:00
Alberto Garcia	af39bd0d9a	qcow2: Default to 4KB for the qcow2 cache entry size QEMU 2.12 (commit `1221fe6f63`) introduced a new setting called l2-cache-entry-size that allows making entries on the qcow2 L2 cache smaller than the cluster size. I have been performing several tests with different cluster and entry sizes and all of them show that reducing the entry size (aka L2 slice) consistently improves I/O performance, notably during random I/O (all tests done with sequential I/O show similar results). This is to be expected because loading and evicting an L2 slice is more expensive the larger the slice is. Here are some numbers on fully populated 40GB qcow2 images. The rightmost column represents the maximum L2 cache size in both cases. Cluster size = 64 KB \|-------------+--------------+--------------+--------------\| \| \| 1MB L2 cache \| 3MB L2 cache \| 5MB L2 cache \| \|-------------+--------------+--------------+--------------\| \| 4KB slices \| 6545 IOPS \| 12045 IOPS \| 55680 IOPS \| \| 16KB slices \| 5177 IOPS \| 9798 IOPS \| 56278 IOPS \| \| 64KB slices \| 2718 IOPS \| 5326 IOPS \| 57355 IOPS \| \|-------------+--------------+--------------+--------------\| Cluster size = 256 KB \|--------------+----------------+--------------+-----------------\| \| \| 512KB L2 cache \| 1MB L2 cache \| 1280KB L2 cache \| \|--------------+----------------+--------------+-----------------\| \| 4KB slices \| 8539 IOPS \| 21071 IOPS \| 55417 IOPS \| \| 64KB slices \| 3598 IOPS \| 9772 IOPS \| 57687 IOPS \| \| 256KB slices \| 1415 IOPS \| 4120 IOPS \| 58001 IOPS \| \|--------------+----------------+--------------+-----------------\| As can be seen in the numbers, the only exception to the rule is when the cache is large enough to hold all L2 tables. This is also to be expected because in this case no cache entry is ever evicted so reducing its size doesn't bring any benefit. This patch sets the default L2 cache entry size to 4KB except when the cache is large enough for the whole disk. Signed-off-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-03-08 12:26:45 +01:00
Peter Maydell	adf2e451f3	Block layer patches: - Block graph change fixes (avoid loops, cope with non-tree graphs) - bdrv_set_aio_context() related fixes - HMP snapshot commands: Use only tag, not the ID to identify snapshots - qmeu-img, commit: Error path fixes - block/nvme: Build fix for gcc 9 - MAINTAINERS updates - Fix various issues with bdrv_refresh_filename() - Fix various iotests - Include LUKS overhead in qemu-img measure for qcow2 - A fix for vmdk's image creation interface -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJcc/knAAoJEH8JsnLIjy/WptQP/3F8Lh52H4egXaP7NUUuDjQM AhqhuDAp/EZBS+xim9kLTogNJADe/rMWdSX/YB5aLpSPYbjasC66NgaLhd6QewgQ VIcsLUdlYAyZ5ZjJytimfMTLwm1X02RmVIe55y52DTY8LlfViZzOlf3qwqPm00ao EJB2cl8UJLM+PVEu59cCw3R0/06LY+WIJRB32d3tnCBRTkaJwfR9h4lrp/juVcFZ U+2eWU68KMbUHSYiWANowN+KRV3uPY4HVA98v3F0vDmcBxlVHOeBg6S+PcT7tK8p huzCMwcdwUyPMJgVs/+WBtUnbG0jN6SHUYmFLz859UMVgBnCw5tzBMf8qw1wOA4A Iw+zor27Pxj4IlxcLPp5f97YZ8k9acdMR2VKPH6xLJZ1JF+sKa54RfzESd5EJeIj Mfcp773H0lIaWcFJ6RY1F0L1E1ta7QigwNBiWMdYfh0a0EWHnDvGyYeaSPYEQ+rl e8bZOcfrYwVI7DTDiZOIkGA9D8DXEPDNp+sl6s1DxeY69D0NNaXTtCPqFNNAbFbd 20uD7yDRZlWq32cQB/K9D5cSkZRSOzdUpLfLU3nQU2+dz11x6OpM6m7DVboSrztD 1HtPPDzDEvH5dOP7ibd60s+ntjkSiNfNkUgnuVrBE/d/PocC1eHHpZt5V7f43Ofb RxVwH5+smzQ9nsNBfQR0 =gaah -----END PGP SIGNATURE----- Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging Block layer patches: - Block graph change fixes (avoid loops, cope with non-tree graphs) - bdrv_set_aio_context() related fixes - HMP snapshot commands: Use only tag, not the ID to identify snapshots - qmeu-img, commit: Error path fixes - block/nvme: Build fix for gcc 9 - MAINTAINERS updates - Fix various issues with bdrv_refresh_filename() - Fix various iotests - Include LUKS overhead in qemu-img measure for qcow2 - A fix for vmdk's image creation interface # gpg: Signature made Mon 25 Feb 2019 14:18:15 GMT # gpg: using RSA key 7F09B272C88F2FD6 # gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full] # Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6 * remotes/kevin/tags/for-upstream: (71 commits) iotests: Skip 211 on insufficient memory vmdk: false positive of compat6 with hwversion not set iotests: add LUKS payload overhead to 178 qemu-img measure test qcow2: include LUKS payload overhead in qemu-img measure iotests.py: s/_/-/g on keys in qmp_log() iotests: Let 045 be run concurrently iotests: Filter SSH paths iotests.py: Filter filename in any string value iotests.py: Add is_str() iotests: Fix 207 to use QMP filters for qmp_log iotests: Fix 232 for LUKS iotests: Remove superfluous rm from 232 iotests: Fix 237 for Python 2.x iotests: Re-add filename filters iotests: Test json:{} filenames of internal BDSs block: BDS options may lack the "driver" option block/null: Generate filename even with latency-ns block/curl: Implement bdrv_refresh_filename() block/curl: Harmonize option defaults block/nvme: Fix bdrv_refresh_filename() ... Signed-off-by: Peter Maydell <peter.maydell@linaro.org>	2019-02-26 19:04:47 +00:00
yuchenlin	26c9296c31	vmdk: false positive of compat6 with hwversion not set In vmdk_co_create_opts, when it finds hw_version is undefined, it will set it to 4, which misleading the compat6 and hwversion in vmdk_co_do_create. Simply set hw_version to NULL after free, let the logic in vmdk_co_do_create to decide the value of hw_version. This bug can be reproduced by: $ qemu-img convert -O vmdk -o subformat=streamOptimized,compat6 /home/yuchenlin/syno.qcow2 /home/yuchenlin/syno.vmdk qemu-img: /home/yuchenlin/syno.vmdk: error while converting vmdk: compat6 cannot be enabled with hwversion set Signed-off-by: yuchenlin <yuchenlin@synology.com> Message-id: 20190221110805.28239-1-yuchenlin@synology.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-02-25 15:11:28 +01:00
Stefan Hajnoczi	61914f8906	qcow2: include LUKS payload overhead in qemu-img measure LUKS encryption reserves clusters for its own payload data. The size of this area must be included in the qemu-img measure calculation so that we arrive at the correct minimum required image size. (Ab)use the qcrypto_block_create() API to determine the payload overhead. We discard the payload data that qcrypto thinks will be written to the image. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190218104525.23674-2-stefanha@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-02-25 15:11:28 +01:00
Max Reitz	1e47cb7f52	block/null: Generate filename even with latency-ns While we cannot represent the latency-ns option in a filename, it is not a strong option so not being able to should not stop us from generating a filename nonetheless. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Message-id: 20190201192935.18394-30-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-02-25 15:11:27 +01:00
Max Reitz	937c007b6e	block/curl: Implement bdrv_refresh_filename() Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Message-id: 20190201192935.18394-29-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-02-25 15:11:27 +01:00
Max Reitz	712b64e8f3	block/curl: Harmonize option defaults Both of the defaults we currently have in the curl driver are named based on a slightly different schema, let's unify that and call both CURL_BLOCK_OPT_${NAME}_DEFAULT. While at it, we can add a macro for the third option for which a default exists, namely "sslverify". Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Message-id: 20190201192935.18394-28-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-02-25 15:11:27 +01:00
Max Reitz	cc61b0740f	block/nvme: Fix bdrv_refresh_filename() Currently, nvme's bdrv_refresh_filename() is an exact copy of null's implementation. However, for null, "null-co://" and "null-aio://" are indeed valid filenames -- for nvme, they are not, as a device address is still required. The correct implementation should generate a filename of the form "nvme://[PCI address]/[namespace]" (as the comment above nvme_parse_filename() describes). Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Message-id: 20190201192935.18394-27-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-02-25 15:11:27 +01:00
Max Reitz	998b3a1e5a	block: Purify .bdrv_refresh_filename() Currently, BlockDriver.bdrv_refresh_filename() is supposed to both refresh the filename (BDS.exact_filename) and set BDS.full_open_options. Now that we have generic code in the central bdrv_refresh_filename() for creating BDS.full_open_options, we can drop the latter part from all BlockDriver.bdrv_refresh_filename() implementations. This also means that we can drop all of the existing default code for this from the global bdrv_refresh_filename() itself. Furthermore, we now have to call BlockDriver.bdrv_refresh_filename() after having set BDS.full_open_options, because the block driver's implementation should now be allowed to depend on BDS.full_open_options being set correctly. Finally, with this patch we can drop the @options parameter from BlockDriver.bdrv_refresh_filename(); also, add a comment on this function's purpose in block/block_int.h while touching its interface. This completely obsoletes blklogwrite's implementation of .bdrv_refresh_filename(). Signed-off-by: Max Reitz <mreitz@redhat.com> Message-id: 20190201192935.18394-25-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-02-25 15:11:27 +01:00
Max Reitz	abc521a9aa	block: Add BlockDriver.bdrv_gather_child_options Some follow-up patches will rework the way bs->full_open_options is refreshed in bdrv_refresh_filename(). The new implementation will remove the need for the block drivers' bdrv_refresh_filename() implementations to set bs->full_open_options; instead, it will be generic and use static information from each block driver. However, by implementing bdrv_gather_child_options(), block drivers will still be able to override the way the full_open_options of their children are incorporated into their own. We need to implement this function for VMDK because we have to prevent the generic implementation from gathering the options of all children: It is not possible to specify options for the extents through the runtime options. For quorum, the child names that would be used by the generic implementation and the ones that we actually (currently) want to use differ. See quorum_gather_child_options() for more information. Note that both of these are cases which are not ideal: In case of VMDK it would probably be nice to be able to specify options for all extents. In case of quorum, the current runtime option structure is simply broken and needs to be fixed (but that is left for another patch). Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Message-id: 20190201192935.18394-23-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-02-25 15:11:27 +01:00
Max Reitz	2654267cc1	block: Add strong_runtime_opts to BlockDriver This new field can be set by block drivers to list the runtime options they accept that may influence the contents of the respective BDS. As of a follow-up patch, this list will be used by the common bdrv_refresh_filename() implementation to decide which options to put into BDS.full_open_options (and consequently whether a JSON filename has to be created), thus freeing the drivers of having to implement that logic themselves. Additionally, this patch adds the field to all of the block drivers that need it and sets it accordingly. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Message-id: 20190201192935.18394-22-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-02-25 15:11:27 +01:00
Max Reitz	0dcbc54a95	block/nfs: Implement bdrv_dirname() While the basic idea is obvious and could be handled by the default bdrv_dirname() implementation, we cannot generate a directory name if the gid or uid are set, so we have to explicitly return NULL in those cases. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Message-id: 20190201192935.18394-19-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-02-25 15:11:26 +01:00
Max Reitz	8a6239c071	block/nbd: Make bdrv_dirname() return NULL The generic bdrv_dirname() implementation would be able to generate some form of directory name for many NBD nodes, but it would be always wrong. Therefore, we have to explicitly make it an error (until NBD has some form of specification for export paths, if it ever will). Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-id: 20190201192935.18394-18-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-02-25 15:11:26 +01:00
Max Reitz	f3037bd254	quorum: Make bdrv_dirname() return NULL While the common implementation for bdrv_dirname() should return NULL for quorum BDSs already (because they do not have a file node and their exact_filename field should be empty), there is no reason not to make that explicit. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Message-id: 20190201192935.18394-17-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-02-25 15:11:26 +01:00
Max Reitz	27953572a5	blkverify: Make bdrv_dirname() return NULL blkverify's BDSs have a file BDS, but we do not want this to be preferred over the raw node. There is no way to decide between the two (and not really a reason to, either), so just return NULL in blkverify's implementation of bdrv_dirname(). Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Message-id: 20190201192935.18394-16-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-02-25 15:11:26 +01:00
Max Reitz	6b6833c1b4	block: bdrv_get_full_backing_filename's ret. val. Make bdrv_get_full_backing_filename() return an allocated string instead of placing the result in a caller-provided buffer. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Message-id: 20190201192935.18394-12-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-02-25 15:11:26 +01:00
Max Reitz	645ae7d88e	block: bdrv_get_full_backing_filename_from_...'s ret. val. Make bdrv_get_full_backing_filename_from_filename() return an allocated string instead of placing the result in a caller-provided buffer. Signed-off-by: Max Reitz <mreitz@redhat.com> Message-id: 20190201192935.18394-11-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-02-25 15:11:26 +01:00
Max Reitz	009b03aaa2	block: Make path_combine() return the path Besides being safe for arbitrary path lengths, after some follow-up patches all callers will want a freshly allocated buffer anyway. In the meantime, path_combine_deprecated() is added which has the same interface as path_combine() had before this patch. All callers to that function will be converted in follow-up patches. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Message-id: 20190201192935.18394-10-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-02-25 15:11:26 +01:00
Max Reitz	998c201923	block: Add BDS.auto_backing_file If the backing file is overridden, this most probably does change the guest-visible data of a BDS. Therefore, we will need to consider this in bdrv_refresh_filename(). To see whether it has been overridden, we might want to compare bs->backing_file and bs->backing->bs->filename. However, bs->backing_file is changed by bdrv_set_backing_hd() (which is just used to change the backing child at runtime, without modifying the image header), so bs->backing_file most of the time simply contains a copy of bs->backing->bs->filename anyway, so it is useless for such a comparison. This patch adds an auto_backing_file BDS field which contains the backing file path as indicated by the image header, which is not changed by bdrv_set_backing_hd(). Because of bdrv_refresh_filename() magic, however, a BDS's filename may differ from what has been specified during bdrv_open(). Then, the comparison between bs->auto_backing_file and bs->backing->bs->filename may fail even though bs->backing was opened from bs->auto_backing_file. To mitigate this, we can copy the real BDS's filename (after the whole bdrv_open() and bdrv_refresh_filename() process) into bs->auto_backing_file, if we know the former has been opened based on the latter. This is only possible if no options modifying the backing file's behavior have been specified, though. To simplify things, this patch only copies the filename from the backing file if no options have been specified for it at all. Furthermore, there are cases where an overlay is created by qemu which already contains a BDS's filename (e.g. in blockdev-snapshot-sync). We do not need to worry about updating the overlay's bs->auto_backing_file there, because we actually wrote a post-bdrv_refresh_filename() filename into the image header. So all in all, there will be false negatives where (as of a future patch) bdrv_refresh_filename() will assume that the backing file differs from what was specified in the image header, even though it really does not. However, these cases should be limited to where (1) the user actually did override something in the backing chain (e.g. by specifying options for the backing file), or (2) the user executed a QMP command to change some node's backing file (e.g. change-backing-file or block-commit with @backing-file given) where the given filename does not happen to coincide with qemu's idea of the backing BDS's filename. Then again, (1) really is limited to -drive. With -blockdev or blockdev-add, you have to adhere to the schema, so a user cannot give partial "unimportant" options (e.g. by just setting backing.node-name and leaving the rest to the image header). Therefore, trying to fix this would mean trying to fix something for -drive only. To improve on (2), we would need a full infrastructure to "canonicalize" an arbitrary filename (+ options), so it can be compared against another. That seems a bit over the top, considering that filenames nowadays are there mostly for the user's entertainment. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Message-id: 20190201192935.18394-5-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-02-25 15:11:25 +01:00
Max Reitz	e24518e303	block: Use children list in bdrv_refresh_filename bdrv_refresh_filename() should invoke itself recursively on all children, not just on file. With that change, we can remove the manual invocations in blkverify, quorum, commit, mirror, and blklogwrites. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Message-id: 20190201192935.18394-3-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-02-25 15:11:25 +01:00
Max Reitz	f30c66ba6e	block: Use bdrv_refresh_filename() to pull Before this patch, bdrv_refresh_filename() is used in a pushing manner: Whenever the BDS graph is modified, the parents of the modified edges are supposed to be updated (recursively upwards). However, that is nonviable, considering that we want child changes not to concern parents. Also, in the long run we want a pull model anyway: Here, we would have a bdrv_filename() function which returns a BDS's filename, freshly constructed. This patch is an intermediate step. It adds bdrv_refresh_filename() calls before every place a BDS.filename value is used. The only exceptions are protocol drivers that use their own filename, which clearly would not profit from refreshing that filename before. Also, bdrv_get_encrypted_filename() is removed along the way (as a user of BDS.filename), since it is completely unused. In turn, all of the calls to bdrv_refresh_filename() before this patch are removed, because we no longer have to call this function on graph changes. Signed-off-by: Max Reitz <mreitz@redhat.com> Message-id: 20190201192935.18394-2-mreitz@redhat.com Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>	2019-02-25 15:11:25 +01:00
Thomas Huth	83c68e149a	block/nvme: Remove QEMU_PACKED from naturally aligned NVMeRegs struct The QEMU_PACKED is causing a compiler warning/error with GCC 9: CC block/nvme.o block/nvme.c: In function ‘nvme_create_queue_pair’: block/nvme.c:209:22: error: taking address of packed member of ‘struct <anonymous>’ may result in an unaligned pointer value [-Werror=address-of-packed-member] 209 \| q->sq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale]; All members of the struct are naturally aligned, so there should not be the need for QEMU_PACKED here, and the following QEMU_BUILD_BUG_ON also ensures that there is no padding. Thus simply remove the QEMU_PACKED here. Buglink: https://bugs.launchpad.net/qemu/+bug/1817525 Reported-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Signed-off-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Peter Maydell <peter.maydell@linaro.org> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-02-25 15:09:48 +01:00
Alberto Garcia	c1c4399084	qcow2: Assert that L2 table offsets fit in the L1 table L1 table entries have a field to store the offset of an L2 table. The rest of the bits of the entry are currently reserved except from bit 63, which stores the COPIED flag. The offset is always taken from the entry using L1E_OFFSET_MASK to ensure that we only use the bits that belong to that field. While that mask is used every time we read from the L1 table, it is never used when we write to it. Due to the limits set elsewhere in the code QEMU can never produce L2 table offsets that don't fit in that field so any such offset when allocating an L2 table would indicate a bug in QEMU. Signed-off-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-02-25 15:05:23 +01:00
Kevin Wolf	28e0b2d2e1	nbd: Increase bs->in_flight during AioContext switch bdrv_drain() must not leave connection_co scheduled, so bs->in_flight needs to be increased while the coroutine is waiting to be scheduled in the new AioContext after nbd_client_attach_aio_context(). Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-02-25 15:03:19 +01:00
Kevin Wolf	d3bd5b9089	nbd: Use low-level QIOChannel API in nbd_read_eof() Instead of using the convenience wrapper qio_channel_read_all_eof(), use the lower level QIOChannel API. This means duplicating some code, but we'll need this because this coroutine yield is special: We want it to be interruptible so that nbd_client_attach_aio_context() can correctly reenter the coroutine. This moves the bdrv_dec/inc_in_flight() pair into nbd_read_eof(), so that connection_co will always sit in this exact qio_channel_yield() call when bdrv_drain() returns. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>	2019-02-25 15:03:19 +01:00
Kevin Wolf	5ad81b4946	nbd: Restrict connection_co reentrance nbd_client_attach_aio_context() schedules connection_co in the new AioContext and this way reenters it in any arbitrary place that has yielded. We can restrict this a bit to the function call where the coroutine actually sits waiting when it's idle. This doesn't solve any bug yet, but it shows where in the code we need to support this random reentrance and where we don't have to care. Add FIXME comments for the existing bugs that the rest of this series will fix. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>	2019-02-25 15:03:19 +01:00
Kevin Wolf	c90e2a9cfd	block-backend: Make blk_inc/dec_in_flight public For some users of BlockBackends, just increasing the in_flight counter is easier than implementing separate handlers in BlockDevOps. Make the helper functions for this public. Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-02-25 15:03:19 +01:00
Alberto Garcia	2468eed3be	commit: Replace commit_top_bs on failure after deleting the block job If there's an error in commit_start() then the block job must be deleted before replacing commit_top_bs, otherwise it will fail because of lack of permissions. This happens since the permission system was introduced in `8dfba27977`. Fortunately this bug doesn't seem to be possible to reproduce at the moment without changing the code. Signed-off-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-02-25 15:03:19 +01:00
Daniel Henrique Barboza	161e612d20	qcow2-snapshot: remove redundant find_snapshot_by_id_and_name call In qcow2_snapshot_create there is the following code block: /* Generate an ID / find_new_snapshot_id(bs, sn_info->id_str, sizeof(sn_info->id_str)); / Check that the ID is unique */ if (find_snapshot_by_id_and_name(bs, sn_info->id_str, NULL) >= 0) { return -EEXIST; } find_new_snapshot_id cycles through all snapshots, getting the id_str as an unsigned long int, calculating the max id_max value of all the existing id_strs and writing in the id_str pointer id_max + 1: for(i = 0; i < s->nb_snapshots; i++) { sn = s->snapshots + i; id = strtoul(sn->id_str, NULL, 10); if (id > id_max) id_max = id; } snprintf(id_str, id_str_size, "%lu", id_max + 1); Here, sn_info->id_str will have the unique value id_max + 1. Right after that, find_snapshot_by_id_and_name is called with id = sn_info->id_str and name = NULL. This will cause the function to execute the following: } else if (id) { for (i = 0; i < s->nb_snapshots; i++) { if (!strcmp(s->snapshots[i].id_str, id)) { return i; } } } In short, we're searching the existing snapshots to see if sn_info->id_str matches any existing id, right after we set in the previous line a sn_info->id_str value that is already unique. The first code block goes way back to commit `585f8587ad`, a 2006 commit from Fabrice Bellard that simply says "new qcow2 disk image format". No more info is provided about this logic in any subsequent commits that moved this code block around. I can't say about the original design, but the current logic is redundant. bdrv_snapshot_create is called in aio_context lock, forbidding any concurrent call to accidentally create a new snapshot between the find_new_snapshot_id and find_snapshot_by_id_and_name calls. What we're ending up doing is to cycle through the snapshots two times for no viable reason. This patch eliminates the redundancy by removing the 'id is unique' check that calls find_snapshot_by_id_and_name. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-02-25 15:03:19 +01:00
Daniel Henrique Barboza	8c04093c8c	block/snapshot: remove bdrv_snapshot_delete_by_id_or_name After the previous patch, the only instance of this function left is inside qemu-img.c. qemu-img is using it inside the 'img_snapshot' function to delete snapshots in the SNAPSHOT_DELETE case, based on a "snapshot_name" string that refers to the tag, not ID, of the QEMUSnapshotInfo struct. This can be verified by checking the SNAPSHOT_CREATE case that comes shortly before SNAPSHOT_DELETE. In that case, the same "snapshot_name" variable is being strcpy to the 'name' field of the QEMUSnapshotInfo struct sn: pstrcpy(sn.name, sizeof(sn.name), snapshot_name); Based on that, it is unlikely that "snapshot_name" might contain an "id" in SNAPSHOT_DELETE. This patch changes SNAPSHOT_DELETE to use snapshot_find() and snapshot_delete() instead of bdrv_snapshot_delete_by_id_or_name. After that, there is no instances left of bdrv_snapshot_delete_by_id_or_name in the code, so it is safe to remove it entirely. Suggested-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-02-25 15:03:18 +01:00
Daniel Henrique Barboza	6ca080453e	block/snapshot.c: eliminate use of ID input in snapshot operations At this moment, QEMU attempts to create/load/delete snapshots by using either an ID (id_str) or a name. The problem is that the code isn't consistent of whether the entered argument is an ID or a name, causing unexpected behaviors. For example, when creating snapshots via savevm <arg>, what happens is that "arg" is treated as both name and id_str. In a guest without snapshots, create a single snapshot via savevm: (qemu) savevm 0 (qemu) info snapshots List of snapshots present on all disks: ID TAG VM SIZE DATE VM CLOCK -- 0 741M 2018-07-31 13:39:56 00:41:25.313 A snapshot with name "0" is created. ID is hidden from the user, but the ID is a non-zero integer that starts at "1". Thus, this snapshot has id_str=1, TAG="0". Creating a second snapshot with arg = 1, the first one is deleted: (qemu) savevm 1 (qemu) info snapshots List of snapshots present on all disks: ID TAG VM SIZE DATE VM CLOCK -- 1 741M 2018-07-31 13:42:14 00:41:55.252 What happened? - when creating the second snapshot, a verification is done inside bdrv_all_delete_snapshot to delete any existing snapshots that matches an string argument. Here, the code calls bdrv_all_delete_snapshot("1", ...); - bdrv_all_delete_snapshot calls bdrv_snapshot_find(..., "1") for each BlockDriverState of the guest. And this is where things goes tilting: bdrv_snapshot_find does a search by both id_str and name. It finds out that there is a snapshot that has id_str = 1, stores a reference to the snapshot in the sn_info pointer and then returns match found; - since a match was found, a call to bdrv_snapshot_delete_by_id_or_name() is made. This function ignores the pointer written by bdrv_snapshot_find. Instead, it deletes the snapshot using bdrv_snapshot_delete() calling it first with id_str = 1. If it fails to delete, then it calls it again with name = 1. - after all that, QEMU creates the new snapshot, that has id_str = 1 and name = 1. The user is left wondering that happened with the first snapshot created. Similar bugs can be triggered when using loadvm and delvm. Before contemplating discarding the use of ID input in these operations, I've searched the code of what would be the implications. My findings are: - the RBD and Sheepdog drivers don't care. Both uses the 'name' field as key in their logic, making id_str = name when appropriate. replay-snapshot.c does not make any special use of id_str; - qcow2 uses id_str as an unique identifier but it is automatically calculated, not being influenced by user input. Other than that, there are no distinguish operations made only with id_str; - in blockdev.c, the delete operation uses a match of both id_str AND name. Given that id_str is either a copy of 'name' or auto-generated, we're fine here. This gives motivation to not consider ID as a valid user input in HMP commands - sticking with 'name' input only is more consistent. To accomplish that, the following changes were made in this patch: - bdrv_snapshot_find() does not match for id_str anymore, only 'name'. The function is called in save_snapshot(), load_snapshot(), bdrv_all_delete_snapshot() and bdrv_all_find_snapshot(). This change makes the search function more predictable and does not change the behavior of any underlying code that uses these affected functions, which are related to HMP (which is fine) and the main loop inside vl.c (which doesn't care about it anyways); - bdrv_all_delete_snapshot() does not call bdrv_snapshot_delete_by_id_or_name anymore. Instead, it uses the pointer returned by bdrv_snapshot_find to erase the snapshot with the exact match of id_str an name. This function is called in save_snapshot and hmp_delvm, thus this change produces the intended effect; - documentation changes to reflect the new behavior. I consider this to be an API fix instead of an API change - the user was already creating snapshots using 'name', but now he/she will also enjoy a consistent behavior. Ideally we would get rid of the id_str field entirely, but this would have repercussions on existing snapshots. Another day perhaps. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2019-02-25 15:03:18 +01:00
Vladimir Sementsov-Ogievskiy	199d95b043	block/vmdk: use qemu_iovec_init_buf Use new qemu_iovec_init_buf() instead of qemu_iovec_init_external( ... , 1), which simplifies the code. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 20190218140926.333779-12-vsementsov@virtuozzo.com Message-Id: <20190218140926.333779-12-vsementsov@virtuozzo.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2019-02-22 09:42:13 +00:00
Vladimir Sementsov-Ogievskiy	342544f98b	block/qed: use qemu_iovec_init_buf Use new qemu_iovec_init_buf() instead of qemu_iovec_init_external( ... , 1), which simplifies the code. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 20190218140926.333779-11-vsementsov@virtuozzo.com Message-Id: <20190218140926.333779-11-vsementsov@virtuozzo.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2019-02-22 09:42:13 +00:00

... 2 3 4 5 6 ...

4486 Commits