.bdrv_amend_clean() says block drivers can use it to clean up what was
done in .bdrv_amend_pre_run(). Therefore, it should always be called
after .bdrv_amend_pre_run(), which means we need it to call it in the
JobDriver.free() callback, not in JobDriver.clean().
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220304153729.711387-3-hreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_refresh_limits() recurses down to the node's children. That does
not seem necessary: When we refresh limits on some node, and then
recurse down and were to change one of its children's BlockLimits, then
that would mean we noticed the changed limits by pure chance. The fact
that we refresh the parent's limits has nothing to do with it, so the
reason for the change probably happened before this point in time, and
we should have refreshed the limits then.
Consequently, we should actually propagate block limits changes upwards,
not downwards. That is a separate and pre-existing issue, though, and
so will not be addressed in this patch.
The problem with recursing is that bdrv_refresh_limits() is not atomic.
It begins with zeroing BDS.bl, and only then sets proper, valid limits.
If we do not drain all nodes whose limits are refreshed, then concurrent
I/O requests can encounter invalid request_alignment values and crash
qemu. Therefore, a recursing bdrv_refresh_limits() requires the whole
subtree to be drained, which is currently not ensured by most callers.
A non-recursive bdrv_refresh_limits() only requires the node in question
to not receive I/O requests, and this is done by most callers in some
way or another:
- bdrv_open_driver() deals with a new node with no parents yet
- bdrv_set_file_or_backing_noperm() acts on a drained node
- bdrv_reopen_commit() acts only on drained nodes
- bdrv_append() should in theory require the node to be drained; in
practice most callers just lock the AioContext, which should at least
be enough to prevent concurrent I/O requests from accessing invalid
limits
So we can resolve the bug by making bdrv_refresh_limits() non-recursive.
Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1879437
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20220216105355.30729-2-hreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-27-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
block coroutines functions run in different aiocontext, and are
not protected by the BQL. Therefore are I/O.
On the other side, generated_co_wrapper functions use BDRV_POLL_WHILE,
meaning the caller can either be the main loop or a specific iothread.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-25-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
copy-before-write functions always run under BQL.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-24-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Snapshots run also under the BQL, so they all are
in the global state API. The aiocontext lock that they hold
is currently an overkill and in future could be removed.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-23-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-22-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Following the assertion derived from the API split,
propagate the assertion also in the static functions.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-18-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Mark all I/O functions with IO_CODE, and all "I/O OR GS" with
IO_OR_GS_CODE.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-14-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-13-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Mark all I/O functions with IO_CODE, and all "I/O OR GS" with
IO_OR_GS_CODE.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-10-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
All the global state (GS) API functions will check that
qemu_in_main_thread() returns true. If not, it means
that the safety of BQL cannot be guaranteed, and
they need to be moved to I/O.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-9-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Similarly to the previous patches, split block-backend.h
in block-backend-io.h and block-backend-global-state.h
In addition, remove "block/block.h" include as it seems
it is not necessary anymore, together with "qemu/iov.h"
block-backend-common.h contains the structures shared between
the two headers, and the functions that can't be categorized as
I/O or global state.
Assertions are added in the next patch.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-8-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Allow writable exports to get BLK_PERM_RESIZE permission
from creation, in fuse_export_create().
In this way, there is no need to give the permission in
fuse_do_truncate(), which might be run in an iothread.
Permissions should be set only in the main thread, so
in any case if an iothread tries to set RESIZE, it will
be blocked.
Also assert in fuse_do_truncate that if we give the
RESIZE permission we can then restore the original ones.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220303151616.325444-7-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Mark all I/O functions with IO_CODE, and all "I/O OR GS" with
IO_OR_GS_CODE.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-6-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
All the global state (GS) API functions will check that
qemu_in_main_thread() returns true. If not, it means
that the safety of BQL cannot be guaranteed, and
they need to be moved to I/O.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-5-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
block.h currently contains a mix of functions:
some of them run under the BQL and modify the block layer graph,
others are instead thread-safe and perform I/O in iothreads.
Some others can only be called by either the main loop or the
iothread running the AioContext (and not other iothreads),
and using them in another thread would cause deadlocks, and therefore
it is not ideal to define them as I/O.
It is not easy to understand which function is part of which
group (I/O vs GS vs "I/O or GS"), and this patch aims to clarify it.
The "GS" functions need the BQL, and often use
aio_context_acquire/release and/or drain to be sure they
can modify the graph safely.
The I/O function are instead thread safe, and can run in
any AioContext.
"I/O or GS" functions run instead in the main loop or in
a single iothread, and use BDRV_POLL_WHILE().
By splitting the header in two files, block-io.h
and block-global-state.h we have a clearer view on what
needs what kind of protection. block-common.h
contains common structures shared by both headers.
block.h is left there for legacy and to avoid changing
all includes in all c files that use the block APIs.
Assertions are added in the next patch.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-4-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Following the bdrv_activate renaming, change also the name
of the respective callers.
bdrv_invalidate_cache_all -> bdrv_activate_all
blk_invalidate_cache -> blk_activate
test_sync_op_invalidate_cache -> test_sync_op_activate
No functional change intended.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220209105452.1694545-5-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This function is currently just a wrapper for bdrv_invalidate_cache(),
but in future will contain the code of bdrv_co_invalidate_cache() that
has to always be protected by BQL, and leave the rest in the I/O
coroutine.
Replace all bdrv_invalidate_cache() invokations with bdrv_activate().
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220209105452.1694545-4-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
block_crypto_amend_options_generic_luks uses the block layer
permission API, therefore it should be called with the BQL held.
However, the same function is being called by two BlockDriver
callbacks: bdrv_amend_options (under BQL) and bdrv_co_amend (I/O).
The latter is I/O because it is invoked by block/amend.c's
blockdev_amend_run(), a .run callback of the amend JobDriver.
Therefore we want to change this function to still perform
the permission check, but making sure it is done under BQL regardless
of the caller context.
Remove the permission check in block_crypto_amend_options_generic_luks()
and:
- in block_crypto_amend_options_luks() (BQL case, called by
.bdrv_amend_options()), reuse helper functions
block_crypto_amend_{prepare/cleanup} that take care of checking
permissions.
- for block_crypto_co_amend_luks() (I/O case, called by
.bdrv_co_amend()), don't check for permissions but delegate
.bdrv_amend_pre_run() and .bdrv_amend_clean() to do it,
performing these checks before and after the job runs in its aiocontext.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220209105452.1694545-3-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Move the permission API calls into driver-specific callbacks
that always run under BQL. In this case, bdrv_crypto_luks
needs to perform permission checks before and after
qcrypto_block_amend_options(). The problem is that the caller,
block_crypto_amend_options_generic_luks(), can also run in I/O
from .bdrv_co_amend(). This does not comply with Global State-I/O API split,
as permissions API must always run under BQL.
Firstly, introduce .bdrv_amend_pre_run() and .bdrv_amend_clean()
callbacks. These two callbacks are guaranteed to be invoked under
BQL, respectively before and after .bdrv_co_amend().
They take care of performing the permission checks
in the same way as they are currently done before and after
qcrypto_block_amend_options().
These callbacks are in preparation for next patch, where we
delete the original permission check. Right now they just add redundant
control.
Then, call .bdrv_amend_pre_run() before job_start in
qmp_x_blockdev_amend(), so that it will be run before the job coroutine
is created and stay in the main loop.
As a cleanup, use JobDriver's .clean() callback to call
.bdrv_amend_clean(), and run amend-specific cleanup callbacks under BQL.
After this patch, permission failures occur early in the blockdev-amend
job to update a LUKS volume's keys. iotest 296 must now expect them in
x-blockdev-amend's QMP reply instead of waiting for the actual job to
fail later.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220209105452.1694545-2-eesposit@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220304153729.711387-6-hreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
- Improves documentation of SSH fingerprint checking
- Fixes SHA256 fingerprints with non-blockdev usage
- Blocks the clone3, setns, unshare & execveat syscalls
with seccomp
- Blocks process spawning via clone syscall, but allows
threads, with seccomp
- Takes over seccomp maintainer role
- Expands firmware descriptor spec to allow flash
without NVRAM
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE2vOm/bJrYpEtDo4/vobrtBUQT98FAmIOOBkACgkQvobrtBUQ
T9/ruhAAr8jkAH8FN5ftx2/L7q8SHpjPupue1CJ0Nl/ykmYhTGc+SqC3R2nZWOk2
Ws8hHVcDVT1lhrGxPtU7o+JPC1TebJTsloimJoKQY3qfdvZadJeR/4KsOUzi2ruu
VZ6HiYvZc1c9T+NPf3QRhBo7yyascKWKWHDseUNIt/2DiefCox4QFUDDMG86HiQF
KK30xWTvwJdcPxRlbfZbWRoqA0v4OoSDK6Ftp94FQSNBkExO85kstDq3xVaApf8H
DE1QD7gf+dvz11wVuFhrf4d1EH032nU0p0kMxhABc4/kZXo5iWXohhzML3/MUEVT
pe5/9pzUdWpfXQd/2r7x2PyPgySAG7lGbkgltowY52qnRPaNw9ukwkFCFAj8wiD8
FT2ghvkYD3zLfnZ3nuuzJVjf3pXgCc5VcfXaoffT72a7gpI1LTuEqPFwo04imV4l
21fYFx26mYTGCLH1CwVw8MQ2z/dg6uorT/NHdmRA/KrYJ1Elay2K7DV3Z5jOM5MI
0Ll5HkfsUut+1rioUjNgmlQ+96k/G0P0hVUoTUIcgl3U/GDx2+ypcrNTfmEcaCLV
bOhsjtrcg/KAXsCSbvnfDe3bWf0txnscyqoilEzDahLvciWG3d6qlhczLy29LGb4
/w7iqnUcSygXc+a9/ckVo1h5fo0i9qb3W8Pw9klapvz6SGJ83g4=
=PeCY
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/berrange-gitlab/tags/misc-next-pull-request' into staging
This misc series of changes:
- Improves documentation of SSH fingerprint checking
- Fixes SHA256 fingerprints with non-blockdev usage
- Blocks the clone3, setns, unshare & execveat syscalls
with seccomp
- Blocks process spawning via clone syscall, but allows
threads, with seccomp
- Takes over seccomp maintainer role
- Expands firmware descriptor spec to allow flash
without NVRAM
# gpg: Signature made Thu 17 Feb 2022 11:57:13 GMT
# gpg: using RSA key DAF3A6FDB26B62912D0E8E3FBE86EBB415104FDF
# gpg: Good signature from "Daniel P. Berrange <dan@berrange.com>" [full]
# gpg: aka "Daniel P. Berrange <berrange@redhat.com>" [full]
# Primary key fingerprint: DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF
* remotes/berrange-gitlab/tags/misc-next-pull-request:
docs: expand firmware descriptor to allow flash without NVRAM
MAINTAINERS: take over seccomp from Eduardo Otubo
seccomp: block setns, unshare and execveat syscalls
seccomp: block use of clone3 syscall
seccomp: fix blocking of process spawning
seccomp: add unit test for seccomp filtering
seccomp: allow action to be customized per syscall
block: print the server key type and fingerprint on failure
block: support sha256 fingerprint with pre-blockdev options
block: better document SSH host key fingerprint checking
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
When validating the server key fingerprint fails, it is difficult for
the user to know what they got wrong. The fingerprint accepted by QEMU
is received in a different format than OpenSSH displays. There can also
be keys for multiple different ciphers in known_hosts. It may not be
obvious which cipher QEMU will use and whether it will be the same
as OpenSSH. Address this by printing the server key type and its
corresponding fingerprint in the format QEMU accepts.
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
When support for sha256 fingerprint checking was aded in
commit bf783261f0
Author: Daniel P. Berrangé <berrange@redhat.com>
Date: Tue Jun 22 12:51:56 2021 +0100
block/ssh: add support for sha256 host key fingerprints
it was only made to work with -blockdev. Getting it working with
-drive requires some extra custom parsing.
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
v2: add my s-o-b marks to each commit
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEi5wmzbL9FHyIDoahVh8kwfGfefsFAmIGYU8ACgkQVh8kwfGf
efslpQ/+OAEgA8z2frIOzoa5PKmyWDaE4xOAsr595EzhSnDeLi1QVIqSSXFap/m6
1rzqglD+5udM9LT9FZBHeMDbWr1UqvODZTAah6/T8E6vVzXm9Jn+CivEj4kT4Dd2
2Mrg3ydjkZsVivs1el4eG7QZ88TvSZkoNAIlT9lVYuT/TaRBKohrms1B4h25wqn9
u78tnbdAdEGtlHUFy9qyOXPt0eGRt6a1CTo+u0wwJLxKjLpq8cht0DPA4/CQhZay
Jm7cudJAJyTFxT7EUxpfROj4KuSMsUDmF/pNE8qwQuSnZC9G9u67ImhWpMFmwEZl
yCAJbOTudWYefgZ3+BtKMH5SBVLA6AsQ60luWkXkqK1dUB4cWxt5lG/UAm1OMJsn
FL6w3WJFkRTIC1OtHqqJnBtDFHWhTxNX9jjcAdb6lpjz4/wKywUVs+l5aSbxswog
DMrE1XT+JIab1IqAnSGw0QTY0H9COiQyOlsJ8EjtFFsqwLACSX4qBowesGfoA7IF
N0ZorAs6QuYYzEaSbRG1vB1lAusSCF5s/UnTg1mDLFWSqJOsqZN7jxXyQbbiFhsN
f+OTlyIwv4t32eGLRrKHF0vaWUoKQ3skB8m9UFMdQbGBO/LakZL9L42ITxarzUMI
CVeGLAzs3KvbZUlUyJWl97F2KkdrornmM/9CB57sjw13K94oNYo=
=+lgn
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/vsementsov/tags/pull-nbd-2022-02-09-v2' into staging
nbd: handle AioContext change correctly
v2: add my s-o-b marks to each commit
# gpg: Signature made Fri 11 Feb 2022 13:14:55 GMT
# gpg: using RSA key 8B9C26CDB2FD147C880E86A1561F24C1F19F79FB
# gpg: Good signature from "Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 8B9C 26CD B2FD 147C 880E 86A1 561F 24C1 F19F 79FB
* remotes/vsementsov/tags/pull-nbd-2022-02-09-v2:
iotests/281: Let NBD connection yield in iothread
block/nbd: Move s->ioc on AioContext change
iotests/281: Test lingering timers
iotests.py: Add QemuStorageDaemon class
block/nbd: Assert there are no timers when closed
block/nbd: Delete open timer when done
block/nbd: Delete reconnect delay timer when done
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
s->ioc must always be attached to the NBD node's AioContext. If that
context changes, s->ioc must be attached to the new context.
Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2033626
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Our two timers must not remain armed beyond nbd_clear_bdrvstate(), or
they will access freed data when they fire.
This patch is separate from the patches that actually fix the issue
(HEAD^^ and HEAD^) so that you can run the associated regression iotest
(281) on a configuration that reproducibly exposes the bug.
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
We start the open timer to cancel the connection attempt after a while.
Once nbd_do_establish_connection() has returned, the attempt is over,
and we no longer need the timer.
Delete it before returning from nbd_open(), so that it does not persist
for longer. It has no use after nbd_open(), and just like the reconnect
delay timer, it might well be dangerous if it were to fire afterwards.
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
We start the reconnect delay timer to cancel the reconnection attempt
after a while. Once nbd_co_do_establish_connection() has returned, this
attempt is over, and we no longer need the timer.
Delete it before returning from nbd_reconnect_attempt(), so that it does
not persist beyond the I/O request that was paused for reconnecting; we
do not want it to fire in a drained section, because all sort of things
can happen in such a section (e.g. the AioContext might be changed, and
we do not want the timer to fire in the wrong context; or the BDS might
even be deleted, and so the timer CB would access already-freed data).
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
For a long time, we assumed that libxml2 is necessary for parallels
block format support (block/parallels*). However, this format actually
does not use libxml [*]. Since this is the only user of libxml2 in
whole QEMU tree, we can drop all libxml2 checks and dependencies too.
It is even more: --enable-parallels configure option was the only
option which was silently ignored when it's (fake) dependency
(libxml2) isn't installed.
Drop all mentions of libxml2.
[*] Actually the basis for libxml use were introduced in commit
ed279a06c5 ("configure: add dependency") but the implementation
was never merged:
https://lore.kernel.org/qemu-devel/70227bbd-a517-70e9-714f-e6e0ec431be9@openvz.org/
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20220119090423.149315-1-mjt@msgid.tls.msk.ru>
Tested-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
[PMD: Updated description and adapted to use lcitool]
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20220121154134.315047-5-f4bug@amsat.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20220204204335.1689602-9-alex.bennee@linaro.org>
librbd had a bug until early 2022 that affected all versions of ceph that
supported fast-diff. This bug results in reporting of incorrect offsets
if the offset parameter to rbd_diff_iterate2 is not object aligned.
This patch works around this bug for pre Quincy versions of librbd.
Fixes: 0347a8fd4c
Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Lieven <pl@kamp.de>
Message-Id: <20220113144426.4036493-3-pl@kamp.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Tested-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
the assumption that we can't hit a hole if we do not diff against a snapshot was wrong.
We can see a hole in an image if we diff against base if there exists an older snapshot
of the image and we have discarded blocks in the image where the snapshot has data.
Fix this by simply handling a hole like an unallocated area. There are no callbacks
for unallocated areas so just bail out if we hit a hole.
Fixes: 0347a8fd4c
Suggested-by: Ilya Dryomov <idryomov@gmail.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Lieven <pl@kamp.de>
Message-Id: <20220113144426.4036493-2-pl@kamp.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When building on FreeBSD we get:
[816/6851] Compiling C object libblockdev.fa.p/block_export_fuse.c.o
../block/export/fuse.c:628:16: error: use of undeclared identifier 'FALLOC_FL_KEEP_SIZE'
if (mode & FALLOC_FL_KEEP_SIZE) {
^
../block/export/fuse.c:651:16: error: use of undeclared identifier 'FALLOC_FL_PUNCH_HOLE'
if (mode & FALLOC_FL_PUNCH_HOLE) {
^
../block/export/fuse.c:652:22: error: use of undeclared identifier 'FALLOC_FL_KEEP_SIZE'
if (!(mode & FALLOC_FL_KEEP_SIZE)) {
^
3 errors generated.
FAILED: libblockdev.fa.p/block_export_fuse.c.o
Meson indeed reported FALLOC_FL_PUNCH_HOLE is not available:
C compiler for the host machine: cc (clang 10.0.1 "FreeBSD clang version 10.0.1")
Checking for function "fallocate" : NO
Checking for function "posix_fallocate" : YES
Header <linux/falloc.h> has symbol "FALLOC_FL_PUNCH_HOLE" : NO
Header <linux/falloc.h> has symbol "FALLOC_FL_ZERO_RANGE" : NO
...
Similarly to commit 304332039 ("block/export/fuse.c: fix musl build"),
guard the code requiring FALLOC_FL_KEEP_SIZE / FALLOC_FL_PUNCH_HOLE
definitions under CONFIG_FALLOCATE_PUNCH_HOLE #ifdef'ry.
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20220201112655.344373-3-f4bug@amsat.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In order to safely maintain a mixture of #ifdef'ry with if-else-if
ladder, rearrange the last statement (!mode) first. Since it is
mutually exclusive with the other conditions, checking it first
doesn't make any logical difference, but allows to add #ifdef'ry
around in a more cleanly way.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20220201112655.344373-2-f4bug@amsat.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The vhost-user-blk export runs requests asynchronously in their own
coroutine. When the vhost connection goes away and we want to stop the
vhost-user server, we need to wait for these coroutines to stop before
we can unmap the shared memory. Otherwise, they would still access the
unmapped memory and crash.
This introduces a refcount to VuServer which is increased when spawning
a new request coroutine and decreased before the coroutine exits. The
memory is only unmapped when the refcount reaches zero.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20220125151435.48792-1-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
After migration, the permissions the guest device wants to impose on its
BlockBackend are stored in blk->perm and blk->shared_perm. In
blk_root_activate(), we take our permissions, but keep all shared
permissions open by calling `blk_set_perm(blk->perm, BLK_PERM_ALL)`.
Only afterwards (immediately or later, depending on the runstate) do we
restrict the shared permissions by calling
`blk_set_perm(blk->perm, blk->shared_perm)`. Unfortunately, our first
call with shared_perm=BLK_PERM_ALL has overwritten blk->shared_perm to
be BLK_PERM_ALL, so this is a no-op and the set of shared permissions is
not restricted.
Fix this bug by saving the set of shared permissions before invoking
blk_set_perm() with BLK_PERM_ALL and restoring it afterwards.
Fixes: 5f7772c4d0
("block-backend: Defer shared_perm tightening migration
completion")
Reported-by: Peng Liang <liangpeng10@huawei.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20211125135317.186576-2-hreitz@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Tested-by: Peng Liang <liangpeng10@huawei.com>
If image doesn't have any compressed cluster we can easily switch to
zlib compression, which may allow to downgrade the image.
That's mostly needed to support IMGOPTS='compression_type=zstd' in some
iotests which do qcow2 downgrade.
While being here also fix checkpatch complain against '#' in printf
formatting.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20211223160144.1097696-13-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
We update the block-status cache whenever we get new information from a
bdrv_co_block_status() call to the block driver. However, if we have
passed want_zero=false to that call, it may flag areas containing zeroes
as data, and so we would update the block-status cache with wrong
information.
Therefore, we should not update the cache with want_zero=false.
Reported-by: Nir Soffer <nsoffer@redhat.com>
Fixes: 0bc329fbb0 ("block: block-status cache for data regions")
Reviewed-by: Nir Soffer <nsoffer@redhat.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220118170000.49423-2-hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
First, this permission never protected a node from being changed, as
generic child-replacing functions don't check it.
Second, it's a strange thing: it presents a permission of parent node
to change its child. But generally, children are replaced by different
mechanisms, like jobs or qmp commands, not by nodes.
Graph-mod permission is hard to understand. All other permissions
describe operations which done by parent node on its child: read,
write, resize. Graph modification operations are something completely
different.
The only place where BLK_PERM_GRAPH_MOD is used as "perm" (not shared
perm) is mirror_start_job, for s->target. Still modern code should use
bdrv_freeze_backing_chain() to protect from graph modification, if we
don't do it somewhere it may be considered as a bug. So, it's a bit
risky to drop GRAPH_MOD, and analyzing of possible loss of protection
is hard. But one day we should do it, let's do it now.
One more bit of information is that locking the corresponding byte in
file-posix doesn't make sense at all.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210902093754.2352-1-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The calculation in sector2cluster() is done relative to the offset of
the root directory. Any writes to blocks before the start of the root
directory (in particular, writes to the FAT) result in negative values,
which are not handled correctly in vvfat_write().
This changes sector2cluster() to return a signed value, and makes sure
that vvfat_write() doesn't try to find mappings for negative cluster
number. It clarifies the code in vvfat_write() to make it more obvious
that the cluster numbers can be negative.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20211209152231.23756-1-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The size of the qcow size was calculated so that only the FAT partition
would fit on it, but not the whole disk. However, offsets relative to
the whole disk are used to access it, so increase its size to be large
enough for that.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20211209151815.23495-1-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The BlockBackend root child can change when aio_poll() is invoked. This
happens when a temporary filter node is removed upon blockjob
completion, for example.
Functions in block/block-backend.c must be aware of this when using a
blk_bs() pointer across aio_poll() because the BlockDriverState refcnt
may reach 0, resulting in a stale pointer.
One example is scsi_device_purge_requests(), which calls blk_drain() to
wait for in-flight requests to cancel. If the backup blockjob is active,
then the BlockBackend root child is a temporary filter BDS owned by the
blockjob. The blockjob can complete during bdrv_drained_begin() and the
last reference to the BDS is released when the temporary filter node is
removed. This results in a use-after-free when blk_drain() calls
bdrv_drained_end(bs) on the dangling pointer.
Explicitly hold a reference to bs across block APIs that invoke
aio_poll().
Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2021778
Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2036178
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20220111153613.25453-2-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
drive_def is only a particular use case of
qemu_opts_parse_noisily, so it can be inlined.
Also remove drive_mark_claimed_by_board, as it is only defined
but not implemented (nor used) anywhere.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20211215121140.456939-3-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Adaptive polling measures the execution time of the polling check plus
handlers called when a polled event becomes ready. Handlers can take a
significant amount of time, making it look like polling was running for
a long time when in fact the event handler was running for a long time.
For example, on Linux the io_submit(2) syscall invoked when a virtio-blk
device's virtqueue becomes ready can take 10s of microseconds. This
can exceed the default polling interval (32 microseconds) and cause
adaptive polling to stop polling.
By excluding the handler's execution time from the polling check we make
the adaptive polling calculation more accurate. As a result, the event
loop now stays in polling mode where previously it would have fallen
back to file descriptor monitoring.
The following data was collected with virtio-blk num-queues=2
event_idx=off using an IOThread. Before:
168k IOPS, IOThread syscalls:
9837.115 ( 0.020 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, nr: 16, iocbpp: 0x7fcb9f937db0) = 16
9837.158 ( 0.002 ms): IO iothread1/620155 write(fd: 103, buf: 0x556a2ef71b88, count: 8) = 8
9837.161 ( 0.001 ms): IO iothread1/620155 write(fd: 104, buf: 0x556a2ef71b88, count: 8) = 8
9837.163 ( 0.001 ms): IO iothread1/620155 ppoll(ufds: 0x7fcb90002800, nfds: 4, tsp: 0x7fcb9f1342d0, sigsetsize: 8) = 3
9837.164 ( 0.001 ms): IO iothread1/620155 read(fd: 107, buf: 0x7fcb9f939cc0, count: 512) = 8
9837.174 ( 0.001 ms): IO iothread1/620155 read(fd: 105, buf: 0x7fcb9f939cc0, count: 512) = 8
9837.176 ( 0.001 ms): IO iothread1/620155 read(fd: 106, buf: 0x7fcb9f939cc0, count: 512) = 8
9837.209 ( 0.035 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, nr: 32, iocbpp: 0x7fca7d0cebe0) = 32
174k IOPS (+3.6%), IOThread syscalls:
9809.566 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, nr: 32, iocbpp: 0x7fd0cdd62be0) = 32
9809.625 ( 0.001 ms): IO iothread1/623061 write(fd: 103, buf: 0x5647cfba5f58, count: 8) = 8
9809.627 ( 0.002 ms): IO iothread1/623061 write(fd: 104, buf: 0x5647cfba5f58, count: 8) = 8
9809.663 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, nr: 32, iocbpp: 0x7fd0d0388b50) = 32
Notice that ppoll(2) and eventfd read(2) syscalls are eliminated because
the IOThread stays in polling mode instead of falling back to file
descriptor monitoring.
As usual, polling is not implemented on Windows so this patch ignores
the new io_poll_read() callback in aio-win32.c.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20211207132336.36627-2-stefanha@redhat.com
[Fixed up aio_set_event_notifier() calls in
tests/unit/test-fdmon-epoll.c added after this series was queued.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The handling for the XFS_IOC_DIOINFO ioctl is currently quite excessive:
This is not a "real" feature like the other features that we provide with
the "--enable-xxx" and "--disable-xxx" switches for the configure script,
since this does not influence lots of code (it's only about one call to
xfsctl() in file-posix.c), so people don't gain much with the ability to
disable this with "--disable-xfsctl".
It's also unfortunate that the ioctl will be disabled on Linux in case
the user did not install the right xfsprogs-devel package before running
configure. Thus let's simplify this by providing the ioctl definition
on our own, so we can completely get rid of the header dependency and
thus the related code in the configure script.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20211215125824.250091-1-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It's unused now (except for permission handling)[*]. The only reasonable
user of it was block-stream job, recently updated to use own blk. And
other block jobs prefer to use own source node related objects.
So, the arguments of dropping the field are:
- block jobs prefer not to use it
- block jobs usually has more then one node to operate on, and better
to operate symmetrically (for example has both source and target
blk's in specific block-job state structure)
*: BlockJob.blk is used to keep some permissions. We simply move
permissions to block-job child created in block_job_create() together
with blk.
In mirror, we just should not care anymore about restoring state of
blk. Most probably this code could be dropped long ago, after dropping
bs->job pointer. Now it finally goes away together with BlockJob.blk
itself.
iotest 141 output is updated, as "bdrv_has_blk(bs)" check in
qmp_blockdev_del() doesn't fail (we don't have blk now). Still, new
error message looks even better.
In iotest 283 we need to add a job id, otherwise "Invalid job ID"
happens now earlier than permission check (as permissions moved from
blk to block-job node).
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Nikita Lapshin <nikita.lapshin@virtuozzo.com>
block-stream is the only block-job, that reasonably use BlockJob.blk.
We are going to drop BlockJob.blk soon. So, let block-stream have own
blk.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Nikita Lapshin <nikita.lapshin@virtuozzo.com>
It is useful when start of vm and start of nbd server are not
simple to sync.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
When the request free list is exhausted the coroutine waits on
q->free_req_queue for the next free request. Whenever a request is
completed a BH is scheduled to invoke nvme_free_req_queue_cb() and wake
up waiting coroutines.
1. nvme_get_free_req() waits for a free request:
while (q->free_req_head == -1) {
...
trace_nvme_free_req_queue_wait(q->s, q->index);
qemu_co_queue_wait(&q->free_req_queue, &q->lock);
...
}
2. nvme_free_req_queue_cb() wakes up the coroutine:
while (qemu_co_enter_next(&q->free_req_queue, &q->lock)) {
^--- infinite loop when free_req_head == -1
}
nvme_free_req_queue_cb() and the coroutine form an infinite loop when
q->free_req_head == -1. Fix this by checking q->free_req_head in
nvme_free_req_queue_cb(). If the free request list is exhausted, don't
wake waiting coroutines. Eventually an in-flight request will complete
and the BH will be scheduled again, guaranteeing forward progress.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20211208152246.244585-1-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Function vvfat_open called function enable_write_target and init_directories,
and these functions malloc new memory for BDRVVVFATState::qcow_filename,
BDRVVVFATState::used_clusters, and BDRVVVFATState::cluster_buff.
When the specified folder does not exist ,it may contains memory leak.
After init_directories function is executed, the vvfat_open return -EIO,
and bdrv_open_driver goto label open_failed,
the program use g_free(bs->opaque) to release BDRVVVFATState struct
without members mentioned.
command line:
qemu-system-x86_64 -hdb <vdisk qcow file> -usb -device usb-storage,drive=fat16
-drive file=fat:rw:fat-type=16:"<path of a host folder does not exist>",
id=fat16,format=raw,if=none
enable_write_target called:
(gdb) bt
at ../block/vvfat.c:3114
flags=155650, errp=0x7fffffffd780) at ../block/vvfat.c:1236
node_name=0x0, options=0x555556fa45d0, open_flags=155650,
errp=0x7fffffffd890) at ../block.c:1558
errp=0x7fffffffd890) at ../block.c:1852
reference=0x0, options=0x555556fa45d0, flags=40962, parent=0x555556f98cd0,
child_class=0x555556b1d6a0 <child_of_bds>, child_role=19,
errp=0x7fffffffda90) at ../block.c:3779
options=0x555556f9cfc0, bdref_key=0x555556239bb8 "file",
parent=0x555556f98cd0, child_class=0x555556b1d6a0 <child_of_bds>,
child_role=19, allow_none=true, errp=0x7fffffffda90) at ../block.c:3419
reference=0x0, options=0x555556f9cfc0, flags=8194, parent=0x0,
child_class=0x0, child_role=0, errp=0x555556c98c40 <error_fatal>)
at ../block.c:3726
options=0x555556f757b0, flags=0, errp=0x555556c98c40 <error_fatal>)
at ../block.c:3872
options=0x555556f757b0, flags=0, errp=0x555556c98c40 <error_fatal>)
at ../block/block-backend.c:436
bs_opts=0x555556f757b0, errp=0x555556c98c40 <error_fatal>)
at ../blockdev.c:608
errp=0x555556c98c40 <error_fatal>) at ../blockdev.c:992
......
Signed-off-by: Daniella Lee <daniellalee111@gmail.com>
Message-Id: <20211119112553.352222-1-daniellalee111@gmail.com>
[hreitz: Took commit message from v1]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
At the end of a reopen, we already call bdrv_refresh_limits(), which
should update bs->request_alignment according to the new file
descriptor. However, raw_probe_alignment() relies on s->needs_alignment
and just uses 1 if it isn't set. We neglected to update this field, so
starting with cache=writeback and then reopening with cache=none means
that we get an incorrect bs->request_alignment == 1 and unaligned
requests fail instead of being automatically aligned.
Fix this by recalculating s->needs_alignment in raw_refresh_limits()
before calling raw_probe_alignment().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20211104113109.56336-1-kwolf@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20211115145409.176785-13-kwolf@redhat.com>
[hreitz: Fix iotest 142 for block sizes greater than 512 by operating on
a file with a size of 1 MB]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20211116101431.105252-1-hreitz@redhat.com>
bdrv_cor_filter_drop() modifies the block graph. That means that other
parties can also modify the block graph before it returns. Therefore,
we cannot assume that the result of a graph traversal we did before
remains valid afterwards.
We should thus fetch `base` and `unfiltered_base` afterwards instead of
before.
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211111120829.81329-2-hreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20211115145409.176785-2-kwolf@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Instead of duplicating code, extract the common helper to free
a single queue.
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20211006164931.172349-4-philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For debugging purpose it is helpful to know the CQ/SQ pointers.
We already have a trace event in nvme_free_queue_pair(), extend
it to report these pointer addresses.
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20211006164931.172349-3-philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since commit 4d324c0bf6 ("introduce QEMU_AUTO_VFREE") buffers
allocated by qemu_memalign() can automatically freed when using
the QEMU_AUTO_VFREE macro. Use it to simplify a bit.
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20211006164931.172349-2-philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Similarly to e7e588d432, there is a
warning in block/block-backend.c that qiov->size <= INT64_MAX is always
true on machines where size_t is narrower than a uint64_t. In said
commit, we silenced this warning by casting to uint64_t.
The commit introducing this warning here
(a93d81c84a) anticipated it and so tried
to address it the same way. However, it only did so in one of two
places where this comparison occurs, and so we still need to fix up the
other one.
Fixes: a93d81c84a
("block-backend: convert blk_aio_ functions to int64_t bytes
paramter")
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20211026090745.30800-1-hreitz@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Between the submission of a request and the unplug, other devices
with larger limits may have been queued new requests without flushing
the batch.
Using the new `dev_max_batch` parameter, laio_io_unplug() can check
if the batch exceeds the device limit to flush the current batch.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20211026162346.253081-4-sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This new parameter can be used by block devices to limit the
Linux AIO batch size more than the limit set by the AIO context.
file-posix backend supports this, passing its `aio-max-batch` option
previously added.
Add an helper function to calculate the maximum batch size.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20211026162346.253081-3-sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit d7ddd0a161 ("linux-aio: limit the batch size using
`aio-max-batch` parameter") added a way to limit the batch size
of Linux AIO backend for the entire AIO context.
The same AIO context can be shared by multiple devices, so
latency-sensitive devices may want to limit the batch size even
more to avoid increasing latency.
For this reason we add the `aio-max-batch` option to the file
backend, which will be used by the next commits to limit the size of
batches including requests generated by this device.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20211026162346.253081-2-sgarzare@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Include linux/falloc.h if CONFIG_FALLOCATE_ZERO_RANGE is defined to fix
50482fda98
and avoid the following build failure on musl:
../block/export/fuse.c: In function 'fuse_fallocate':
../block/export/fuse.c:643:21: error: 'FALLOC_FL_ZERO_RANGE' undeclared (first use in this function)
643 | else if (mode & FALLOC_FL_ZERO_RANGE) {
| ^~~~~~~~~~~~~~~~~~~~
Fixes:
- http://autobuild.buildroot.org/results/be24433a429fda681fb66698160132c1c99bc53b
Fixes: 50482fda98 ("block/export/fuse.c: fix musl build")
Signed-off-by: Fabrice Fontaine <fontaine.fabrice@gmail.com>
Message-Id: <20211022095209.1319671-1-fontaine.fabrice@gmail.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).
This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).
To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff feature which depends on the object-map and
exclusive-lock features. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.
If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.
Signed-off-by: Peter Lieven <pl@kamp.de>
Message-Id: <20211012152231.24868-1-pl@kamp.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
AIO discards regressed as a result of the following commit:
0dfc7af2 block/file-posix: Optimize for macOS
When trying to run blkdiscard within a Linux guest, the request would
fail, with some errors in dmesg:
---- [ snip ] ----
[ 4.010070] sd 2:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[ 4.011061] sd 2:0:0:0: [sda] tag#0 Sense Key : Aborted Command
[current]
[ 4.011061] sd 2:0:0:0: [sda] tag#0 Add. Sense: I/O process
terminated
[ 4.011061] sd 2:0:0:0: [sda] tag#0 CDB: Unmap/Read sub-channel 42
00 00 00 00 00 00 00 18 00
[ 4.011061] blk_update_request: I/O error, dev sda, sector 0
---- [ snip ] ----
This turns out to be a result of a flaw in changes to the error value
translation logic in handle_aiocb_discard(). The default return value
may be left untranslated in some configurations, and the wrong variable
is used in one translation.
Fix both issues.
Fixes: 0dfc7af2b2 ("block/file-posix: Optimize for macOS")
Cc: qemu-stable@nongnu.org
Signed-off-by: Ari Sundholm <ari@tuxera.com>
Signed-off-by: Emil Karlson <jkarlson@tuxera.com>
Reviewed-by: Akihiko Odaki <akihiko.odaki@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20211019110954.4170931-1-ari@tuxera.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The code in vpc.c uses BDRVVPCState->footer.type in various places
to decide whether the image is a fixed-size (VHD_FIXED) or a dynamic
(VHD_DYNAMIC) image. However, we never check that this field really
contains VHD_FIXED if we detected a fixed size image in vpc_open(),
so a wrong value here could cause quite some trouble during runtime.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20211012082702.792259-1-thuth@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
VMDK files support an attribute that represents the version of the guest
tools that are installed on the disk.
This attribute is used by vSphere before a machine has been started to
determine if the VM has the guest tools installed.
This is important when configuring "Operating system customizations" in
vSphere, as it checks for the presence of the guest tools before
allowing those customizations.
Thus when the VM has not yet booted normally it would be impossible to
customize it, therefore preventing a customized first-boot.
The attribute should not hurt on disks that do not have the guest tools
installed and indeed the VMware tools also unconditionally add this
attribute.
(Defaulting to the value "2147483647", as is done in this patch)
Signed-off-by: Thomas Weißschuh <thomas.weissschuh.ext@zeiss.com>
Message-Id: <20210913130419.13241-1-thomas.weissschuh.ext@zeiss.com>
[hreitz: Added missing '#' in block-core.json]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
blk_check_bytes_request is called from blk_co_do_preadv,
blk_co_do_pwritev_part, blk_co_do_pdiscard and blk_co_copy_range
before (maybe) calling throttle_group_co_io_limits_intercept() (which
has int64_t argument) and then calling corresponding bdrv_co_ function.
bdrv_co_ functions are OK with int64_t bytes as well.
So dropping the check for INT_MAX we just get same restrictions as in
bdrv_ layer: discard and write-zeroes goes through
bdrv_check_qiov_request() and are allowed to be 64bit. Other requests
go through bdrv_check_request32() and still restricted by INT_MAX
boundary.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-13-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
To be consistent with declarations in include/sysemu/block-backend.h.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-12-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
1. Convert bytes in BlkAioEmAIOCB:
aio->bytes is only passed to already int64_t interfaces, and set in
blk_aio_prwv, which is updated here.
2. For all updated functions the parameter type becomes wider so callers
are safe.
3. In blk_aio_prwv we only store bytes to BlkAioEmAIOCB, which is
updated here.
4. Other updated functions are wrappers on blk_aio_prwv.
Note that blk_aio_preadv and blk_aio_pwritev become safer: before this
commit, it's theoretically possible to pass qiov with size exceeding
INT_MAX, which than converted to int argument of blk_aio_prwv. Now it's
converted to int64_t which is a lot better. Still add assertions.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-11-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: tweak assertion and grammar]
Signed-off-by: Eric Blake <eblake@redhat.com>
Function is updated so that parameter type becomes wider, so all
callers should be OK with it.
Look at blk_co_copy_range() itself: bytes is passed only to
blk_check_byte_request() and bdrv_co_copy_range(), which already have
int64_t bytes parameter, so we are OK.
Note that requests exceeding INT_MAX are still restricted by
blk_check_byte_request().
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-10-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
Convert blk_pdiscard, blk_pwrite_compressed, blk_pwrite_zeroes.
These are just wrappers for functions with int64_t argument, so allow
passing int64_t as well. Parameter type becomes wider so all callers
should be OK with it.
Note that requests exceeding INT_MAX are still restricted by
blk_check_byte_request().
Note also that we don't (and are not going to) convert blk_pwrite and
blk_pread: these functions return number of bytes on success, so to
update them, we should change return type to int64_t as well, which
will lead to investigating and updating all callers which is too much.
So, blk_pread and blk_pwrite remain unchanged.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-9-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
Let's drop hand-made coroutine wrappers and use coroutine wrapper
generation like in block/io.c.
Now, blk_foo() functions are written in same way as blk_co_foo() ones,
but wrap blk_do_foo() instead of blk_co_do_foo().
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-8-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: spelling fix]
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-7-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
This is a preparation to the following commit, to use automatic
coroutine wrapper generation.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-6-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
We updated blk_do_pdiscard() and its wrapper blk_co_pdiscard(). Both
functions are updated so that the parameter type becomes wider, so all
callers should be OK with it.
Look at blk_do_pdiscard(): bytes is passed only to
blk_check_byte_request() and bdrv_co_pdiscard(), which already have
int64_t bytes parameter, so we are OK.
Note that requests exceeding INT_MAX are still restricted by
blk_check_byte_request().
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-5-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
We convert blk_do_pwritev_part() and some wrappers:
blk_co_pwritev_part(), blk_co_pwritev(), blk_co_pwrite_zeroes().
All functions are converted so that the parameter type becomes wider, so
all callers should be OK with it.
Look at blk_do_pwritev_part() body:
bytes is passed to:
- trace_blk_co_pwritev (we update it here)
- blk_check_byte_request, throttle_group_co_io_limits_intercept,
bdrv_co_pwritev_part - all already have int64_t argument.
Note that requests exceeding INT_MAX are still restricted by
blk_check_byte_request().
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-4-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
For both updated functions, the type of bytes becomes wider, so all callers
should be OK with it.
blk_co_preadv() only passes its arguments to blk_do_preadv().
blk_do_preadv() passes bytes to:
- trace_blk_co_preadv, which is updated too
- blk_check_byte_request, throttle_group_co_io_limits_intercept,
bdrv_co_preadv, which are already int64_t.
Note that requests exceeding INT_MAX are still restricted by
blk_check_byte_request().
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-3-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
Rename size and make it int64_t to correspond to modern block layer,
which always uses int64_t for offset and bytes (not in blk layer yet,
which is a task for following commits).
All callers pass int or unsigned int.
So, for bytes in [0, INT_MAX] nothing is changed, for negative bytes we
now fail on "bytes < 0" check instead of "bytes > INT_MAX" check.
Note, that blk_check_byte_request() still doesn't allow requests
exceeding INT_MAX.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-2-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
With -m32, size_t is generally only a uint32_t. That makes clang
complain that in the assertion
assert(qiov->size <= INT64_MAX);
the range of the type of qiov->size (size_t) is too small for any of its
values to ever exceed INT64_MAX.
Cast qiov->size to uint64_t to silence clang.
Fixes: f7ef38dd13
("block: use int64_t instead of uint64_t in driver read
handlers")
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20211011155031.149158-1-hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
v2: add small fix by Stefano, Hanna's series fixed
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEi5wmzbL9FHyIDoahVh8kwfGfefsFAmFfEVMACgkQVh8kwfGf
eftAVA//WLtOaiVYPSjEl5EK80kry39VknZkQyeYUzyV7JNr/FRMlbJaIF2sOjH5
KPRpfwBiuijOc8R0s34HY0BpyweRd1rbypHblZkO7EO4XwHx1FLF5kNHF6yV7wPL
c9W564sZpc6Z96wSMgC4Is9QHJ6JbO4TJNJsG8v/PHEqGQV/yCYkgBox4loJckww
uSAZ7l63IWA8uPSq/rOu34bREKN9s0kHkvFq0JNWk2HtOBLDiRDUYmbSfdjfT4jz
np7ojKiffcAJED9JA28Zo2Y+MSId+FyoO4lbt+deMNzIHboy2oVlHouoHHprr61x
dIO7Qt1IoMk5IBIfkPRYkReMwxxSVKuIJcWm8Qqtkcg2X0g5ayNUmHwpBMd50h2z
XPjrr0YdOixhxMHoBnqlkPlWU0Y/B+YJIQ+mjqp+vRNkk94NoXhsXnCod1ajkgWO
zjc/dztew7HvNStJaMM0rnEjanLhzFZKtlMO4WwZHQp2yZG2AINkPStswo2f3AmL
FI+2By/UhFKm3BEemf0wYWDPWrPHU+BOiu16KjSKeS0GA9t7GXBUDRxNYPhUheXJ
eJKIpNsGbseNxKrAbLyRhAB75Fa/ReZqqybmEcLyal/ball3R/cNF3gaMHeX0o1n
HTGIAF5JOAXNGApS5TilkXPZ7jHFOVPh/Fi6/16/08tcgxjVfro=
=TVTu
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/vsementsov/tags/pull-jobs-2021-10-07-v2' into staging
mirror: Handle errors after READY cancel
v2: add small fix by Stefano, Hanna's series fixed
# gpg: Signature made Thu 07 Oct 2021 08:25:07 AM PDT
# gpg: using RSA key 8B9C26CDB2FD147C880E86A1561F24C1F19F79FB
# gpg: Good signature from "Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 8B9C 26CD B2FD 147C 880E 86A1 561F 24C1 F19F 79FB
* remotes/vsementsov/tags/pull-jobs-2021-10-07-v2:
iotests: Add mirror-ready-cancel-error test
mirror: Do not clear .cancelled
mirror: Stop active mirroring after force-cancel
mirror: Check job_is_cancelled() earlier
mirror: Use job_is_cancelled()
job: Add job_cancel_requested()
job: Do not soft-cancel after a job is done
jobs: Give Job.force_cancel more meaning
job: @force parameter for job_cancel_sync()
job: Force-cancel jobs in a failed transaction
mirror: Drop s->synced
mirror: Keep s->synced on error
job: Context changes in job_completed_txn_abort()
block/aio_task: assert `max_busy_tasks` is greater than 0
block/backup: avoid integer overflow of `max-workers`
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Clearing .cancelled before leaving the main loop when the job has been
soft-cancelled is no longer necessary since job_is_cancelled() only
returns true for jobs that have been force-cancelled.
Therefore, this only makes a differences in places that call
job_cancel_requested(). In block/mirror.c, this is done only before
.cancelled was cleared.
In job.c, there are two callers:
- job_completed_txn_abort() asserts that .cancelled is true, so keeping
it true will not affect this place.
- job_complete() refuses to let a job complete that has .cancelled set.
It is correct to refuse to let the user invoke job-complete on mirror
jobs that have already been soft-cancelled.
With this change, there are no places that reset .cancelled to false and
so we can be sure that .force_cancel can only be true if .cancelled is
true as well. Assert this in job_is_cancelled().
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006151940.214590-13-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Once the mirror job is force-cancelled (job_is_cancelled() is true), we
should not generate new I/O requests. This applies to active mirroring,
too, so stop it once the job is cancelled.
(We must still forward all I/O requests to the source, though, of
course, but those are not really I/O requests generated by the job, so
this is fine.)
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006151940.214590-12-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
We must check whether the job is force-cancelled early in our main loop,
most importantly before any `continue` statement. For example, we used
to have `continue`s before our current checking location that are
triggered by `mirror_flush()` failing. So, if `mirror_flush()` kept
failing, force-cancelling the job would not terminate it.
Jobs can be cancelled while they yield, and once they are
(force-cancelled), they should not generate new I/O requests.
Therefore, we should put the check after the last yield before
mirror_iteration() is invoked.
Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006151940.214590-11-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
mirror_drained_poll() returns true whenever the job is cancelled,
because "we [can] be sure that it won't issue more requests". However,
this is only true for force-cancelled jobs, so use job_is_cancelled().
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006151940.214590-10-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Most callers of job_is_cancelled() actually want to know whether the job
is on its way to immediate termination. For example, we refuse to pause
jobs that are cancelled; but this only makes sense for jobs that are
really actually cancelled.
A mirror job that is cancelled during READY with force=false should
absolutely be allowed to pause. This "cancellation" (which is actually
a kind of completion) may take an indefinite amount of time, and so
should behave like any job during normal operation. For example, with
on-target-error=stop, the job should stop on write errors. (In
contrast, force-cancelled jobs should not get write errors, as they
should just terminate and not do further I/O.)
Therefore, redefine job_is_cancelled() to only return true for jobs that
are force-cancelled (which as of HEAD^ means any job that interprets the
cancellation request as a request for immediate termination), and add
job_cancel_requested() as the general variant, which returns true for
any jobs which have been requested to be cancelled, whether it be
immediately or after an arbitrarily long completion phase.
Finally, here is a justification for how different job_is_cancelled()
invocations are treated by this patch:
- block/mirror.c (mirror_run()):
- The first invocation is a while loop that should loop until the job
has been cancelled or scheduled for completion. What kind of cancel
does not matter, only the fact that the job is supposed to end.
- The second invocation wants to know whether the job has been
soft-cancelled. Calling job_cancel_requested() is a bit too broad,
but if the job were force-cancelled, we should leave the main loop
as soon as possible anyway, so this should not matter here.
- The last two invocations already check force_cancel, so they should
continue to use job_is_cancelled().
- block/backup.c, block/commit.c, block/stream.c, anything in tests/:
These jobs know only force-cancel, so there is no difference between
job_is_cancelled() and job_cancel_requested(). We can continue using
job_is_cancelled().
- job.c:
- job_pause_point(), job_yield(), job_sleep_ns(): Only force-cancelled
jobs should be prevented from being paused. Continue using job_is_cancelled().
- job_update_rc(), job_finalize_single(), job_finish_sync(): These
functions are all called after the job has left its main loop. The
mirror job (the only job that can be soft-cancelled) will clear
.cancelled before leaving the main loop if it has been
soft-cancelled. Therefore, these functions will observe .cancelled
to be true only if the job has been force-cancelled. We can
continue to use job_is_cancelled().
(Furthermore, conceptually, a soft-cancelled mirror job should not
report to have been cancelled. It should report completion (see
also the block-job-cancel QAPI documentation). Therefore, it makes
sense for these functions not to distinguish between a
soft-cancelled mirror job and a job that has completed as normal.)
- job_completed_txn_abort(): All jobs other than @job have been
force-cancelled. job_is_cancelled() must be true for them.
Regarding @job itself: job_completed_txn_abort() is mostly called
when the job's return value is not 0. A soft-cancelled mirror has a
return value of 0, and so will not end up here then.
However, job_cancel() invokes job_completed_txn_abort() if the job
has been deferred to the main loop, which is mostly the case for
completed jobs (which skip the assertion), but not for sure.
To be safe, use job_cancel_requested() in this assertion.
- job_complete(): This is function eventually invoked by the user
(through qmp_block_job_complete() or qmp_job_complete(), or
job_complete_sync(), which comes from qemu-img). The intention here
is to prevent a user from invoking job-complete after the job has
been cancelled. This should also apply to soft cancelling: After a
mirror job has been soft-cancelled, the user should not be able to
decide otherwise and have it complete as normal (i.e. pivoting to
the target).
- job_cancel(): Both functions are equivalent (see comment there), but
we want to use job_is_cancelled(), because this shows that we call
job_completed_txn_abort() only for force-cancelled jobs. (As
explained for job_update_rc(), soft-cancelled jobs should be treated
as if they have completed as normal.)
Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006151940.214590-9-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
We largely have two cancel modes for jobs:
First, there is actual cancelling. The job is terminated as soon as
possible, without trying to reach a consistent result.
Second, we have mirror in the READY state. Technically, the job is not
really cancelled, but it just is a different completion mode. The job
can still run for an indefinite amount of time while it tries to reach a
consistent result.
We want to be able to clearly distinguish which cancel mode a job is in
(when it has been cancelled). We can use Job.force_cancel for this, but
right now it only reflects cancel requests from the user with
force=true, but clearly, jobs that do not even distinguish between
force=false and force=true are effectively always force-cancelled.
So this patch has Job.force_cancel signify whether the job will
terminate as soon as possible (force_cancel=true) or whether it will
effectively remain running despite being "cancelled"
(force_cancel=false).
To this end, we let jobs that provide JobDriver.cancel() tell the
generic job code whether they will terminate as soon as possible or not,
and for jobs that do not provide that method we assume they will.
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20211006151940.214590-7-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Callers should be able to specify whether they want job_cancel_sync() to
force-cancel the job or not.
In fact, almost all invocations do not care about consistency of the
result and just want the job to terminate as soon as possible, so they
should pass force=true. The replication block driver is the exception,
specifically the active commit job it runs.
As for job_cancel_sync_all(), all callers want it to force-cancel all
jobs, because that is the point of it: To cancel all remaining jobs as
quickly as possible (generally on process termination). So make it
invoke job_cancel_sync() with force=true.
This changes some iotest outputs, because quitting qemu while a mirror
job is active will now lead to it being cancelled instead of completed,
which is what we want. (Cancelling a READY mirror job with force=false
may take an indefinite amount of time, which we do not want when
quitting. If users want consistent results, they must have all jobs be
done before they quit qemu.)
Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006151940.214590-6-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
As of HEAD^, there is no meaning to s->synced other than whether the job
is READY or not. job_is_ready() gives us that information, too.
Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20211006151940.214590-4-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
An error does not take us out of the READY phase, which is what
s->synced signifies. It does of course mean that source and target are
no longer in sync, but that is what s->actively_sync is for -- s->synced
never meant that source and target are in sync, only that they were at
some point (and at that point we transitioned into the READY phase).
The tangible problem is that we transition to READY once we are in sync
and s->synced is false. By resetting s->synced here, we will transition
from READY to READY once the error is resolved (if the job keeps
running), and that transition is not allowed.
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20211006151940.214590-3-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Linux limits the size of iovecs to 1024 (UIO_MAXIOV in the kernel
sources, IOV_MAX in POSIX). Because of this, on some host adapters
requests with many iovecs are rejected with -EINVAL by the
io_submit() or readv()/writev() system calls.
In fact, the same limit applies to SG_IO as well. To fix both the
EINVAL and the possible performance issues from using fewer iovecs
than allowed by Linux (some HBAs have max_segments as low as 128),
introduce a separate entry in BlockLimits to hold the max_segments
value from sysfs. This new limit is used only for SG_IO and clamped
to bs->bl.max_iov anyway, just like max_hw_transfer is clamped to
bs->bl.max_transfer.
Reported-by: Halil Pasic <pasic@linux.ibm.com>
Cc: Hanna Reitz <hreitz@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-block@nongnu.org
Cc: qemu-stable@nongnu.org
Fixes: 18473467d5 ("file-posix: try BLKSECTGET on block devices too, do not round to power of 2", 2021-06-25)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210923130436.1187591-1-pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
All code in block/aio_task.c expects `max_busy_tasks` to always
be greater than 0.
Assert this condition during the AioTaskPool creation where
`max_busy_tasks` is set.
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211005161157.282396-3-sgarzare@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
QAPI generates `struct BackupPerf` where `max-workers` value is stored
in an `int64_t` variable.
But block_copy_async(), and the underlying code, uses an `int` parameter.
At the end that variable is used to initialize `max_busy_tasks` in
block/aio_task.c causing the following assertion failure if a value
greater than INT_MAX(2147483647) is used:
../block/aio_task.c:63: aio_task_pool_wait_one: Assertion `pool->busy_tasks > 0' failed.
Let's check that `max-workers` doesn't exceed INT_MAX and print an
error in that case.
Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2009310
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211005161157.282396-2-sgarzare@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
If we don't have active request, that waiting for this handle to be
received, we should report an error.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210902103805.25686-6-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
OK, that's a big rewrite of the logic.
Pre-patch we have an always running coroutine - connection_co. It does
reply receiving and reconnecting. And it leads to a lot of difficult
and unobvious code around drained sections and context switch. We also
abuse bs->in_flight counter which is increased for connection_co and
temporary decreased in points where we want to allow drained section to
begin. One of these place is in another file: in nbd_read_eof() in
nbd/client.c.
We also cancel reconnect and requests waiting for reconnect on drained
begin which is not correct. And this patch fixes that.
Let's finally drop this always running coroutine and go another way:
do both reconnect and receiving in request coroutines.
The detailed list of changes below (in the sequence of diff hunks).
1. receiving coroutines are woken directly from nbd_channel_error, when
we change s->state
2. nbd_co_establish_connection_cancel(): we don't have drain_begin now,
and in nbd_teardown_connection() all requests should already be
finished (and reconnect is done from request). So
nbd_co_establish_connection_cancel() is called from
nbd_cancel_in_flight() (to cancel the request that is doing
nbd_co_establish_connection()) and from reconnect_delay_timer_cb()
(previously we didn't need it, as reconnect delay only should cancel
active requests not the reconnection itself). But now reconnection
itself is done in the separate thread (we now call
nbd_client_connection_enable_retry() in nbd_open()), and we need to
cancel the requests that wait in nbd_co_establish_connection()
now).
2A. We do receive headers in request coroutine. But we also should
dispatch replies for other pending requests. So,
nbd_connection_entry() is turned into nbd_receive_replies(), which
does reply dispatching while it receives other request headers, and
returns when it receives the requested header.
3. All old staff around drained sections and context switch is dropped.
In details:
- we don't need to move connection_co to new aio context, as we
don't have connection_co anymore
- we don't have a fake "request" of connection_co (extra increasing
in_flight), so don't care with it in drain_begin/end
- we don't stop reconnection during drained section anymore. This
means that drain_begin may wait for a long time (up to
reconnect_delay). But that's an improvement and more correct
behavior see below[*]
4. In nbd_teardown_connection() we don't have to wait for
connection_co, as it is dropped. And cleanup for s->ioc and nbd_yank
is moved here from removed connection_co.
5. In nbd_co_do_establish_connection() we now should handle
NBD_CLIENT_CONNECTING_NOWAIT: if new request comes when we are in
NBD_CLIENT_CONNECTING_NOWAIT, it still should call
nbd_co_establish_connection() (who knows, maybe the connection was
already established by another thread in the background). But we
shouldn't wait: if nbd_co_establish_connection() can't return new
channel immediately the request should fail (we are in
NBD_CLIENT_CONNECTING_NOWAIT state).
6. nbd_reconnect_attempt() is simplified: it's now easier to wait for
other requests in the caller, so here we just assert that fact.
Also delay time is now initialized here: we can easily detect first
attempt and start a timer.
7. nbd_co_reconnect_loop() is dropped, we don't need it. Reconnect
retries are fully handle by thread (nbd/client-connection.c), delay
timer we initialize in nbd_reconnect_attempt(), we don't have to
bother with s->drained and friends. nbd_reconnect_attempt() now
called from nbd_co_send_request().
8. nbd_connection_entry is dropped: reconnect is now handled by
nbd_co_send_request(), receiving reply is now handled by
nbd_receive_replies(): all handled from request coroutines.
9. So, welcome new nbd_receive_replies() called from request coroutine,
that receives reply header instead of nbd_connection_entry().
Like with sending requests, only one coroutine may receive in a
moment. So we introduce receive_mutex, which is locked around
nbd_receive_reply(). It also protects some related fields. Still,
full audit of thread-safety in nbd driver is a separate task.
New function waits for a reply with specified handle being received
and works rather simple:
Under mutex:
- if current handle is 0, do receive by hand. If another handle
received - switch to other request coroutine, release mutex and
yield. Otherwise return success
- if current handle == requested handle, we are done
- otherwise, release mutex and yield
10: in nbd_co_send_request() we now do nbd_reconnect_attempt() if
needed. Also waiting in free_sema queue we now wait for one of two
conditions:
- connectED, in_flight < MAX_NBD_REQUESTS (so we can start new one)
- connectING, in_flight == 0, so we can call
nbd_reconnect_attempt()
And this logic is protected by s->send_mutex
Also, on failure we don't have to care of removed s->connection_co
11. nbd_co_do_receive_one_chunk(): now instead of yield() and wait for
s->connection_co we just call new nbd_receive_replies().
12. nbd_co_receive_one_chunk(): place where s->reply.handle becomes 0,
which means that handling of the whole reply is finished. Here we
need to wake one of coroutines sleeping in nbd_receive_replies().
If none are sleeping - do nothing. That's another behavior change: we
don't have endless recv() in the idle time. It may be considered as
a drawback. If so, it may be fixed later.
13. nbd_reply_chunk_iter_receive(): don't care about removed
connection_co, just ping in_flight waiters.
14. Don't create connection_co, enable retry in the connection thread
(we don't have own reconnect loop anymore)
15. We now need to add a nbd_co_establish_connection_cancel() call in
nbd_cancel_in_flight(), to cancel the request that is doing a
connection attempt.
[*], ok, now we don't cancel reconnect on drain begin. That's correct:
reconnect feature leads to possibility of long-running requests (up
to reconnect delay). Still, drain begin is not a reason to kill
long requests. We should wait for them.
This also means, that we can again reproduce a dead-lock, described
in 8c517de24a.
Why we are OK with it:
1. Now this is not absolutely-dead dead-lock: the vm is unfrozen
after reconnect delay. Actually 8c517de24a fixed a bug in
NBD logic, that was not described in 8c517de24a and led to
forever dead-lock. The problem was that nobody woke the free_sema
queue, but drain_begin can't finish until there is a request in
free_sema queue. Now we have a reconnect delay timer that works
well.
2. It's not a problem of the NBD driver, but of the ide code,
because it does drain_begin under the global mutex; the problem
doesn't reproduce when using scsi instead of ide.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210902103805.25686-5-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar and comment tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
Split out nbd_recv_coroutine_wake_one(), as it will be used
separately.
Rename the function and add a possibility to wake only first found
sleeping coroutine.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210902103805.25686-4-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweak]
Signed-off-by: Eric Blake <eblake@redhat.com>