Commit Graph

5437 Commits

Author SHA1 Message Date
Richard Henderson 741bdeb1d5 Block layer patches
- Fail gracefully when blockdev-snapshot creates loops
 - ide: Fix IDENTIFY DEVICE for disks > 128 GiB
 - file-posix: Fix return value translation for AIO discards
 - file-posix: add 'aio-max-batch' option
 - rbd: implement bdrv_co_block_status
 - Code cleanups and build fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmGBYXIRHGt3b2xmQHJl
 ZGhhdC5jb20ACgkQfwmycsiPL9Y3Qw//S9iBIU0cemm5mX/LlDzkrb0VhB5zCQ/C
 8ReZh2R0j3kJl2n+ViRr3qRzvjKyjYuKKJ+bb8FNl/DC88evmiowuXj+csVnC6PM
 I6a6vfKPdioIk36MwACWw75N208eVRNHOPvWH/s3sjar1hMlDftdN1lDZA9sDLLx
 NGH9VRG8gSoKox9+Wg4x584XGI3YU/BqGdcM+W9E0T4OLT9X0RUmXbzVzspYFlPh
 O/r/gTZ0Ip5UXoPiHgyGhCalNep6Uc9A8n1sTaHIjaVcfSE86AuJgc5RcM0HFASn
 1fTNmxbX2taEeynISJAn2x77LASCvLSUErBmVQjHZOAkbNgJD4/R9MUlndxr+MAR
 tXJD+Pkah+M2LwUWdUIfm8UiBBX3LC4Qd/5byc85qr+s24q+EPvOmyqThLT2RNjq
 NLGO9WmRnf7sNvFjvmMXz1JAp1pV4P82j2P/4oy3ZZSW8AVaWr6xxS5hgePpqD4x
 yFP91lnce//UKVfU60VafPh6wUv1Isv0aN1JaW0Vz9wo0LF/i/6zbFO+RVS2PbMR
 VpGmGee7rbgPyWxPrjunyXv+FqciRmbd+IEAg2MlKacHk1F2pQiDLnJUPIGf+9tz
 nv+cq92FNd7hpx4qsuBhe0Ys89QJ2hueTk4RrQF/38tEoRzZ6IwppLlp7kRK39C2
 8TPSkiyqlIQ=
 =Jtih
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/kwolf/tags/for-upstream' into staging

Block layer patches

- Fail gracefully when blockdev-snapshot creates loops
- ide: Fix IDENTIFY DEVICE for disks > 128 GiB
- file-posix: Fix return value translation for AIO discards
- file-posix: add 'aio-max-batch' option
- rbd: implement bdrv_co_block_status
- Code cleanups and build fixes

# gpg: Signature made Tue 02 Nov 2021 12:04:02 PM EDT
# gpg:                using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6
# gpg:                issuer "kwolf@redhat.com"
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]

* remotes/kwolf/tags/for-upstream:
  block/nvme: Extract nvme_free_queue() from nvme_free_queue_pair()
  block/nvme: Display CQ/SQ pointer in nvme_free_queue_pair()
  block/nvme: Automatically free qemu_memalign() with QEMU_AUTO_VFREE
  block-backend: Silence clang -m32 compiler warning
  linux-aio: add `dev_max_batch` parameter to laio_io_unplug()
  linux-aio: add `dev_max_batch` parameter to laio_co_submit()
  file-posix: add `aio-max-batch` option
  block/export/fuse.c: fix musl build
  ide: Cap LBA28 capacity announcement to 2^28-1
  block/rbd: implement bdrv_co_block_status
  block: Fail gracefully when blockdev-snapshot creates loops
  block/file-posix: Fix return value translation for AIO discards

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2021-11-03 00:32:56 -04:00
Philippe Mathieu-Daudé a895143894 block/nvme: Extract nvme_free_queue() from nvme_free_queue_pair()
Instead of duplicating code, extract the common helper to free
a single queue.

Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20211006164931.172349-4-philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 15:49:13 +01:00
Philippe Mathieu-Daudé 53cedeaaee block/nvme: Display CQ/SQ pointer in nvme_free_queue_pair()
For debugging purpose it is helpful to know the CQ/SQ pointers.
We already have a trace event in nvme_free_queue_pair(), extend
it to report these pointer addresses.

Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20211006164931.172349-3-philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 15:49:12 +01:00
Philippe Mathieu-Daudé 4a613bd862 block/nvme: Automatically free qemu_memalign() with QEMU_AUTO_VFREE
Since commit 4d324c0bf6 ("introduce QEMU_AUTO_VFREE") buffers
allocated by qemu_memalign() can automatically freed when using
the QEMU_AUTO_VFREE macro. Use it to simplify a bit.

Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20211006164931.172349-2-philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 15:48:55 +01:00
Hanna Reitz 73d4a11300 block-backend: Silence clang -m32 compiler warning
Similarly to e7e588d432, there is a
warning in block/block-backend.c that qiov->size <= INT64_MAX is always
true on machines where size_t is narrower than a uint64_t.  In said
commit, we silenced this warning by casting to uint64_t.

The commit introducing this warning here
(a93d81c84a) anticipated it and so tried
to address it the same way.  However, it only did so in one of two
places where this comparison occurs, and so we still need to fix up the
other one.

Fixes: a93d81c84a
       ("block-backend: convert blk_aio_ functions to int64_t bytes
       paramter")
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20211026090745.30800-1-hreitz@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 13:22:09 +01:00
Stefano Garzarella 68d7946648 linux-aio: add `dev_max_batch` parameter to laio_io_unplug()
Between the submission of a request and the unplug, other devices
with larger limits may have been queued new requests without flushing
the batch.

Using the new `dev_max_batch` parameter, laio_io_unplug() can check
if the batch exceeds the device limit to flush the current batch.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20211026162346.253081-4-sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 13:03:35 +01:00
Stefano Garzarella 512da21101 linux-aio: add `dev_max_batch` parameter to laio_co_submit()
This new parameter can be used by block devices to limit the
Linux AIO batch size more than the limit set by the AIO context.

file-posix backend supports this, passing its `aio-max-batch` option
previously added.

Add an helper function to calculate the maximum batch size.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20211026162346.253081-3-sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 13:03:35 +01:00
Stefano Garzarella 684960d462 file-posix: add `aio-max-batch` option
Commit d7ddd0a161 ("linux-aio: limit the batch size using
`aio-max-batch` parameter") added a way to limit the batch size
of Linux AIO backend for the entire AIO context.

The same AIO context can be shared by multiple devices, so
latency-sensitive devices may want to limit the batch size even
more to avoid increasing latency.

For this reason we add the `aio-max-batch` option to the file
backend, which will be used by the next commits to limit the size of
batches including requests generated by this device.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20211026162346.253081-2-sgarzare@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 13:03:30 +01:00
Fabrice Fontaine 3043320390 block/export/fuse.c: fix musl build
Include linux/falloc.h if CONFIG_FALLOCATE_ZERO_RANGE is defined to fix
50482fda98
and avoid the following build failure on musl:

../block/export/fuse.c: In function 'fuse_fallocate':
../block/export/fuse.c:643:21: error: 'FALLOC_FL_ZERO_RANGE' undeclared (first use in this function)
  643 |     else if (mode & FALLOC_FL_ZERO_RANGE) {
      |                     ^~~~~~~~~~~~~~~~~~~~

Fixes:
 - http://autobuild.buildroot.org/results/be24433a429fda681fb66698160132c1c99bc53b

Fixes: 50482fda98 ("block/export/fuse.c: fix musl build")
Signed-off-by: Fabrice Fontaine <fontaine.fabrice@gmail.com>
Message-Id: <20211022095209.1319671-1-fontaine.fabrice@gmail.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 13:02:46 +01:00
Peter Lieven 0347a8fd4c block/rbd: implement bdrv_co_block_status
the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).

This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).

To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff feature which depends on the object-map and
exclusive-lock features. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.

If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.

Signed-off-by: Peter Lieven <pl@kamp.de>
Message-Id: <20211012152231.24868-1-pl@kamp.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 13:02:46 +01:00
Ari Sundholm 13a028336f block/file-posix: Fix return value translation for AIO discards
AIO discards regressed as a result of the following commit:
	0dfc7af2 block/file-posix: Optimize for macOS

When trying to run blkdiscard within a Linux guest, the request would
fail, with some errors in dmesg:

---- [ snip ] ----
[    4.010070] sd 2:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[    4.011061] sd 2:0:0:0: [sda] tag#0 Sense Key : Aborted Command
[current]
[    4.011061] sd 2:0:0:0: [sda] tag#0 Add. Sense: I/O process
terminated
[    4.011061] sd 2:0:0:0: [sda] tag#0 CDB: Unmap/Read sub-channel 42
00 00 00 00 00 00 00 18 00
[    4.011061] blk_update_request: I/O error, dev sda, sector 0
---- [ snip ] ----

This turns out to be a result of a flaw in changes to the error value
translation logic in handle_aiocb_discard(). The default return value
may be left untranslated in some configurations, and the wrong variable
is used in one translation.

Fix both issues.

Fixes: 0dfc7af2b2 ("block/file-posix: Optimize for macOS")
Cc: qemu-stable@nongnu.org
Signed-off-by: Ari Sundholm <ari@tuxera.com>
Signed-off-by: Emil Karlson <jkarlson@tuxera.com>
Reviewed-by: Akihiko Odaki <akihiko.odaki@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20211019110954.4170931-1-ari@tuxera.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 13:02:46 +01:00
Thomas Huth 7da9623cc0 block/vpc: Add a sanity check that fixed-size images have the right type
The code in vpc.c uses BDRVVPCState->footer.type in various places
to decide whether the image is a fixed-size (VHD_FIXED) or a dynamic
(VHD_DYNAMIC) image. However, we never check that this field really
contains VHD_FIXED if we detected a fixed size image in vpc_open(),
so a wrong value here could cause quite some trouble during runtime.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20211012082702.792259-1-thuth@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-11-02 12:47:51 +01:00
Thomas Weißschuh f3d43dfd9a vmdk: allow specification of tools version
VMDK files support an attribute that represents the version of the guest
tools that are installed on the disk.
This attribute is used by vSphere before a machine has been started to
determine if the VM has the guest tools installed.
This is important when configuring "Operating system customizations" in
vSphere, as it checks for the presence of the guest tools before
allowing those customizations.
Thus when the VM has not yet booted normally it would be impossible to
customize it, therefore preventing a customized first-boot.

The attribute should not hurt on disks that do not have the guest tools
installed and indeed the VMware tools also unconditionally add this
attribute.
(Defaulting to the value "2147483647", as is done in this patch)

Signed-off-by: Thomas Weißschuh <thomas.weissschuh.ext@zeiss.com>
Message-Id: <20210913130419.13241-1-thomas.weissschuh.ext@zeiss.com>
[hreitz: Added missing '#' in block-core.json]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-11-02 12:47:51 +01:00
Vladimir Sementsov-Ogievskiy aa78b82516 block-backend: drop INT_MAX restriction from blk_check_byte_request()
blk_check_bytes_request is called from blk_co_do_preadv,
blk_co_do_pwritev_part, blk_co_do_pdiscard and blk_co_copy_range
before (maybe) calling throttle_group_co_io_limits_intercept() (which
has int64_t argument) and then calling corresponding bdrv_co_ function.
bdrv_co_ functions are OK with int64_t bytes as well.

So dropping the check for INT_MAX we just get same restrictions as in
bdrv_ layer: discard and write-zeroes goes through
bdrv_check_qiov_request() and are allowed to be 64bit. Other requests
go through bdrv_check_request32() and still restricted by INT_MAX
boundary.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-13-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 16:00:07 -05:00
Vladimir Sementsov-Ogievskiy 14149710f9 block-backend: blk_pread, blk_pwrite: rename count parameter to bytes
To be consistent with declarations in include/sysemu/block-backend.h.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-12-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:59:26 -05:00
Vladimir Sementsov-Ogievskiy a93d81c84a block-backend: convert blk_aio_ functions to int64_t bytes paramter
1. Convert bytes in BlkAioEmAIOCB:
  aio->bytes is only passed to already int64_t interfaces, and set in
  blk_aio_prwv, which is updated here.

2. For all updated functions the parameter type becomes wider so callers
   are safe.

3. In blk_aio_prwv we only store bytes to BlkAioEmAIOCB, which is
   updated here.

4. Other updated functions are wrappers on blk_aio_prwv.

Note that blk_aio_preadv and blk_aio_pwritev become safer: before this
commit, it's theoretically possible to pass qiov with size exceeding
INT_MAX, which than converted to int argument of blk_aio_prwv. Now it's
converted to int64_t which is a lot better. Still add assertions.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-11-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: tweak assertion and grammar]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:57:29 -05:00
Vladimir Sementsov-Ogievskiy e192179bb2 block-backend: convert blk_co_copy_range to int64_t bytes
Function is updated so that parameter type becomes wider, so all
callers should be OK with it.

Look at blk_co_copy_range() itself: bytes is passed only to
blk_check_byte_request() and bdrv_co_copy_range(), which already have
int64_t bytes parameter, so we are OK.

Note that requests exceeding INT_MAX are still restricted by
blk_check_byte_request().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-10-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:55:28 -05:00
Vladimir Sementsov-Ogievskiy 06f0325c5b block-backend: convert blk_foo wrappers to use int64_t bytes parameter
Convert blk_pdiscard, blk_pwrite_compressed, blk_pwrite_zeroes.
These are just wrappers for functions with int64_t argument, so allow
passing int64_t as well. Parameter type becomes wider so all callers
should be OK with it.

Note that requests exceeding INT_MAX are still restricted by
blk_check_byte_request().

Note also that we don't (and are not going to) convert blk_pwrite and
blk_pread: these functions return number of bytes on success, so to
update them, we should change return type to int64_t as well, which
will lead to investigating and updating all callers which is too much.

So, blk_pread and blk_pwrite remain unchanged.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-9-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:53:48 -05:00
Vladimir Sementsov-Ogievskiy 16d36e2996 block-backend: drop blk_prw, use block-coroutine-wrapper
Let's drop hand-made coroutine wrappers and use coroutine wrapper
generation like in block/io.c.

Now, blk_foo() functions are written in same way as blk_co_foo() ones,
but wrap blk_do_foo() instead of blk_co_do_foo().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-8-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: spelling fix]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:53:24 -05:00
Vladimir Sementsov-Ogievskiy 7d55a3bbad block-coroutine-wrapper.py: support BlockBackend first argument
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-7-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:51:33 -05:00
Vladimir Sementsov-Ogievskiy 70e8775ed9 block-backend: rename _do_ helper functions to _co_do_
This is a preparation to the following commit, to use automatic
coroutine wrapper generation.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-6-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:50:40 -05:00
Vladimir Sementsov-Ogievskiy 2800637a33 block-backend: convert blk_co_pdiscard to int64_t bytes
We updated blk_do_pdiscard() and its wrapper blk_co_pdiscard(). Both
functions are updated so that the parameter type becomes wider, so all
callers should be OK with it.

Look at blk_do_pdiscard(): bytes is passed only to
blk_check_byte_request() and bdrv_co_pdiscard(), which already have
int64_t bytes parameter, so we are OK.

Note that requests exceeding INT_MAX are still restricted by
blk_check_byte_request().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-5-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:48:56 -05:00
Vladimir Sementsov-Ogievskiy 34460feb63 block-backend: convert blk_co_pwritev_part to int64_t bytes
We convert blk_do_pwritev_part() and some wrappers:
blk_co_pwritev_part(), blk_co_pwritev(), blk_co_pwrite_zeroes().

All functions are converted so that the parameter type becomes wider, so
all callers should be OK with it.

Look at blk_do_pwritev_part() body:
bytes is passed to:

 - trace_blk_co_pwritev (we update it here)
 - blk_check_byte_request, throttle_group_co_io_limits_intercept,
   bdrv_co_pwritev_part - all already have int64_t argument.

Note that requests exceeding INT_MAX are still restricted by
blk_check_byte_request().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-4-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:47:18 -05:00
Vladimir Sementsov-Ogievskiy 9547907705 block-backend: make blk_co_preadv() 64bit
For both updated functions, the type of bytes becomes wider, so all callers
should be OK with it.

blk_co_preadv() only passes its arguments to blk_do_preadv().

blk_do_preadv() passes bytes to:

 - trace_blk_co_preadv, which is updated too
 - blk_check_byte_request, throttle_group_co_io_limits_intercept,
   bdrv_co_preadv, which are already int64_t.

Note that requests exceeding INT_MAX are still restricted by
blk_check_byte_request().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-3-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:46:44 -05:00
Vladimir Sementsov-Ogievskiy 7242db6389 block-backend: blk_check_byte_request(): int64_t bytes
Rename size and make it int64_t to correspond to modern block layer,
which always uses int64_t for offset and bytes (not in blk layer yet,
which is a task for following commits).

All callers pass int or unsigned int.

So, for bytes in [0, INT_MAX] nothing is changed, for negative bytes we
now fail on "bytes < 0" check instead of "bytes > INT_MAX" check.

Note, that blk_check_byte_request() still doesn't allow requests
exceeding INT_MAX.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006131718.214235-2-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:39:40 -05:00
Hanna Reitz e7e588d432 qcow2: Silence clang -m32 compiler warning
With -m32, size_t is generally only a uint32_t.  That makes clang
complain that in the assertion

  assert(qiov->size <= INT64_MAX);

the range of the type of qiov->size (size_t) is too small for any of its
values to ever exceed INT64_MAX.

Cast qiov->size to uint64_t to silence clang.

Fixes: f7ef38dd13
       ("block: use int64_t instead of uint64_t in driver read
       handlers")
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20211011155031.149158-1-hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-10-15 15:39:38 -05:00
Paolo Bonzini ff66f3e55b configure, meson: move libaio check to meson.build
Message-Id: <20211007130829.632254-10-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-14 09:50:57 +02:00
Richard Henderson 14f12119aa mirror: Handle errors after READY cancel
v2: add small fix by Stefano, Hanna's series fixed
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEi5wmzbL9FHyIDoahVh8kwfGfefsFAmFfEVMACgkQVh8kwfGf
 eftAVA//WLtOaiVYPSjEl5EK80kry39VknZkQyeYUzyV7JNr/FRMlbJaIF2sOjH5
 KPRpfwBiuijOc8R0s34HY0BpyweRd1rbypHblZkO7EO4XwHx1FLF5kNHF6yV7wPL
 c9W564sZpc6Z96wSMgC4Is9QHJ6JbO4TJNJsG8v/PHEqGQV/yCYkgBox4loJckww
 uSAZ7l63IWA8uPSq/rOu34bREKN9s0kHkvFq0JNWk2HtOBLDiRDUYmbSfdjfT4jz
 np7ojKiffcAJED9JA28Zo2Y+MSId+FyoO4lbt+deMNzIHboy2oVlHouoHHprr61x
 dIO7Qt1IoMk5IBIfkPRYkReMwxxSVKuIJcWm8Qqtkcg2X0g5ayNUmHwpBMd50h2z
 XPjrr0YdOixhxMHoBnqlkPlWU0Y/B+YJIQ+mjqp+vRNkk94NoXhsXnCod1ajkgWO
 zjc/dztew7HvNStJaMM0rnEjanLhzFZKtlMO4WwZHQp2yZG2AINkPStswo2f3AmL
 FI+2By/UhFKm3BEemf0wYWDPWrPHU+BOiu16KjSKeS0GA9t7GXBUDRxNYPhUheXJ
 eJKIpNsGbseNxKrAbLyRhAB75Fa/ReZqqybmEcLyal/ball3R/cNF3gaMHeX0o1n
 HTGIAF5JOAXNGApS5TilkXPZ7jHFOVPh/Fi6/16/08tcgxjVfro=
 =TVTu
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/vsementsov/tags/pull-jobs-2021-10-07-v2' into staging

mirror: Handle errors after READY cancel
v2: add small fix by Stefano, Hanna's series fixed

# gpg: Signature made Thu 07 Oct 2021 08:25:07 AM PDT
# gpg:                using RSA key 8B9C26CDB2FD147C880E86A1561F24C1F19F79FB
# gpg: Good signature from "Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 8B9C 26CD B2FD 147C 880E  86A1 561F 24C1 F19F 79FB

* remotes/vsementsov/tags/pull-jobs-2021-10-07-v2:
  iotests: Add mirror-ready-cancel-error test
  mirror: Do not clear .cancelled
  mirror: Stop active mirroring after force-cancel
  mirror: Check job_is_cancelled() earlier
  mirror: Use job_is_cancelled()
  job: Add job_cancel_requested()
  job: Do not soft-cancel after a job is done
  jobs: Give Job.force_cancel more meaning
  job: @force parameter for job_cancel_sync()
  job: Force-cancel jobs in a failed transaction
  mirror: Drop s->synced
  mirror: Keep s->synced on error
  job: Context changes in job_completed_txn_abort()
  block/aio_task: assert `max_busy_tasks` is greater than 0
  block/backup: avoid integer overflow of `max-workers`

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2021-10-07 10:26:35 -07:00
Hanna Reitz a640fa0e38 mirror: Do not clear .cancelled
Clearing .cancelled before leaving the main loop when the job has been
soft-cancelled is no longer necessary since job_is_cancelled() only
returns true for jobs that have been force-cancelled.

Therefore, this only makes a differences in places that call
job_cancel_requested().  In block/mirror.c, this is done only before
.cancelled was cleared.

In job.c, there are two callers:
- job_completed_txn_abort() asserts that .cancelled is true, so keeping
  it true will not affect this place.

- job_complete() refuses to let a job complete that has .cancelled set.
  It is correct to refuse to let the user invoke job-complete on mirror
  jobs that have already been soft-cancelled.

With this change, there are no places that reset .cancelled to false and
so we can be sure that .force_cancel can only be true if .cancelled is
true as well.  Assert this in job_is_cancelled().

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006151940.214590-13-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-07 10:42:50 +02:00
Hanna Reitz 9b230ef93e mirror: Stop active mirroring after force-cancel
Once the mirror job is force-cancelled (job_is_cancelled() is true), we
should not generate new I/O requests.  This applies to active mirroring,
too, so stop it once the job is cancelled.

(We must still forward all I/O requests to the source, though, of
course, but those are not really I/O requests generated by the job, so
this is fine.)

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006151940.214590-12-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-07 10:42:50 +02:00
Hanna Reitz 4feeec7e23 mirror: Check job_is_cancelled() earlier
We must check whether the job is force-cancelled early in our main loop,
most importantly before any `continue` statement.  For example, we used
to have `continue`s before our current checking location that are
triggered by `mirror_flush()` failing.  So, if `mirror_flush()` kept
failing, force-cancelling the job would not terminate it.

Jobs can be cancelled while they yield, and once they are
(force-cancelled), they should not generate new I/O requests.
Therefore, we should put the check after the last yield before
mirror_iteration() is invoked.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006151940.214590-11-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-07 10:42:50 +02:00
Hanna Reitz 20ad4d204a mirror: Use job_is_cancelled()
mirror_drained_poll() returns true whenever the job is cancelled,
because "we [can] be sure that it won't issue more requests".  However,
this is only true for force-cancelled jobs, so use job_is_cancelled().

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006151940.214590-10-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-07 10:42:50 +02:00
Hanna Reitz 08b83bff2a job: Add job_cancel_requested()
Most callers of job_is_cancelled() actually want to know whether the job
is on its way to immediate termination.  For example, we refuse to pause
jobs that are cancelled; but this only makes sense for jobs that are
really actually cancelled.

A mirror job that is cancelled during READY with force=false should
absolutely be allowed to pause.  This "cancellation" (which is actually
a kind of completion) may take an indefinite amount of time, and so
should behave like any job during normal operation.  For example, with
on-target-error=stop, the job should stop on write errors.  (In
contrast, force-cancelled jobs should not get write errors, as they
should just terminate and not do further I/O.)

Therefore, redefine job_is_cancelled() to only return true for jobs that
are force-cancelled (which as of HEAD^ means any job that interprets the
cancellation request as a request for immediate termination), and add
job_cancel_requested() as the general variant, which returns true for
any jobs which have been requested to be cancelled, whether it be
immediately or after an arbitrarily long completion phase.

Finally, here is a justification for how different job_is_cancelled()
invocations are treated by this patch:

- block/mirror.c (mirror_run()):
  - The first invocation is a while loop that should loop until the job
    has been cancelled or scheduled for completion.  What kind of cancel
    does not matter, only the fact that the job is supposed to end.

  - The second invocation wants to know whether the job has been
    soft-cancelled.  Calling job_cancel_requested() is a bit too broad,
    but if the job were force-cancelled, we should leave the main loop
    as soon as possible anyway, so this should not matter here.

  - The last two invocations already check force_cancel, so they should
    continue to use job_is_cancelled().

- block/backup.c, block/commit.c, block/stream.c, anything in tests/:
  These jobs know only force-cancel, so there is no difference between
  job_is_cancelled() and job_cancel_requested().  We can continue using
  job_is_cancelled().

- job.c:
  - job_pause_point(), job_yield(), job_sleep_ns(): Only force-cancelled
    jobs should be prevented from being paused.  Continue using job_is_cancelled().

  - job_update_rc(), job_finalize_single(), job_finish_sync(): These
    functions are all called after the job has left its main loop.  The
    mirror job (the only job that can be soft-cancelled) will clear
    .cancelled before leaving the main loop if it has been
    soft-cancelled.  Therefore, these functions will observe .cancelled
    to be true only if the job has been force-cancelled.  We can
    continue to use job_is_cancelled().
    (Furthermore, conceptually, a soft-cancelled mirror job should not
    report to have been cancelled.  It should report completion (see
    also the block-job-cancel QAPI documentation).  Therefore, it makes
    sense for these functions not to distinguish between a
    soft-cancelled mirror job and a job that has completed as normal.)

  - job_completed_txn_abort(): All jobs other than @job have been
    force-cancelled.  job_is_cancelled() must be true for them.
    Regarding @job itself: job_completed_txn_abort() is mostly called
    when the job's return value is not 0.  A soft-cancelled mirror has a
    return value of 0, and so will not end up here then.
    However, job_cancel() invokes job_completed_txn_abort() if the job
    has been deferred to the main loop, which is mostly the case for
    completed jobs (which skip the assertion), but not for sure.
    To be safe, use job_cancel_requested() in this assertion.

  - job_complete(): This is function eventually invoked by the user
    (through qmp_block_job_complete() or qmp_job_complete(), or
    job_complete_sync(), which comes from qemu-img).  The intention here
    is to prevent a user from invoking job-complete after the job has
    been cancelled.  This should also apply to soft cancelling: After a
    mirror job has been soft-cancelled, the user should not be able to
    decide otherwise and have it complete as normal (i.e. pivoting to
    the target).

  - job_cancel(): Both functions are equivalent (see comment there), but
    we want to use job_is_cancelled(), because this shows that we call
    job_completed_txn_abort() only for force-cancelled jobs.  (As
    explained for job_update_rc(), soft-cancelled jobs should be treated
    as if they have completed as normal.)

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006151940.214590-9-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-07 10:42:40 +02:00
Hanna Reitz 73895f3838 jobs: Give Job.force_cancel more meaning
We largely have two cancel modes for jobs:

First, there is actual cancelling.  The job is terminated as soon as
possible, without trying to reach a consistent result.

Second, we have mirror in the READY state.  Technically, the job is not
really cancelled, but it just is a different completion mode.  The job
can still run for an indefinite amount of time while it tries to reach a
consistent result.

We want to be able to clearly distinguish which cancel mode a job is in
(when it has been cancelled).  We can use Job.force_cancel for this, but
right now it only reflects cancel requests from the user with
force=true, but clearly, jobs that do not even distinguish between
force=false and force=true are effectively always force-cancelled.

So this patch has Job.force_cancel signify whether the job will
terminate as soon as possible (force_cancel=true) or whether it will
effectively remain running despite being "cancelled"
(force_cancel=false).

To this end, we let jobs that provide JobDriver.cancel() tell the
generic job code whether they will terminate as soon as possible or not,
and for jobs that do not provide that method we assume they will.

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20211006151940.214590-7-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-07 10:42:34 +02:00
Hanna Reitz 4cfb3f0562 job: @force parameter for job_cancel_sync()
Callers should be able to specify whether they want job_cancel_sync() to
force-cancel the job or not.

In fact, almost all invocations do not care about consistency of the
result and just want the job to terminate as soon as possible, so they
should pass force=true.  The replication block driver is the exception,
specifically the active commit job it runs.

As for job_cancel_sync_all(), all callers want it to force-cancel all
jobs, because that is the point of it: To cancel all remaining jobs as
quickly as possible (generally on process termination).  So make it
invoke job_cancel_sync() with force=true.

This changes some iotest outputs, because quitting qemu while a mirror
job is active will now lead to it being cancelled instead of completed,
which is what we want.  (Cancelling a READY mirror job with force=false
may take an indefinite amount of time, which we do not want when
quitting.  If users want consistent results, they must have all jobs be
done before they quit qemu.)

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211006151940.214590-6-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-07 10:42:09 +02:00
Hanna Reitz 4471622428 mirror: Drop s->synced
As of HEAD^, there is no meaning to s->synced other than whether the job
is READY or not.  job_is_ready() gives us that information, too.

Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20211006151940.214590-4-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-07 10:40:48 +02:00
Hanna Reitz a3810da5cf mirror: Keep s->synced on error
An error does not take us out of the READY phase, which is what
s->synced signifies.  It does of course mean that source and target are
no longer in sync, but that is what s->actively_sync is for -- s->synced
never meant that source and target are in sync, only that they were at
some point (and at that point we transitioned into the READY phase).

The tangible problem is that we transition to READY once we are in sync
and s->synced is false.  By resetting s->synced here, we will transition
from READY to READY once the error is resolved (if the job keeps
running), and that transition is not allowed.

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20211006151940.214590-3-hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-07 10:40:48 +02:00
Paolo Bonzini cc07162953 block: introduce max_hw_iov for use in scsi-generic
Linux limits the size of iovecs to 1024 (UIO_MAXIOV in the kernel
sources, IOV_MAX in POSIX).  Because of this, on some host adapters
requests with many iovecs are rejected with -EINVAL by the
io_submit() or readv()/writev() system calls.

In fact, the same limit applies to SG_IO as well.  To fix both the
EINVAL and the possible performance issues from using fewer iovecs
than allowed by Linux (some HBAs have max_segments as low as 128),
introduce a separate entry in BlockLimits to hold the max_segments
value from sysfs.  This new limit is used only for SG_IO and clamped
to bs->bl.max_iov anyway, just like max_hw_transfer is clamped to
bs->bl.max_transfer.

Reported-by: Halil Pasic <pasic@linux.ibm.com>
Cc: Hanna Reitz <hreitz@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-block@nongnu.org
Cc: qemu-stable@nongnu.org
Fixes: 18473467d5 ("file-posix: try BLKSECTGET on block devices too, do not round to power of 2", 2021-06-25)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210923130436.1187591-1-pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-10-06 10:25:55 +02:00
Stefano Garzarella a9515df4d6 block/aio_task: assert `max_busy_tasks` is greater than 0
All code in block/aio_task.c expects `max_busy_tasks` to always
be greater than 0.

Assert this condition during the AioTaskPool creation where
`max_busy_tasks` is set.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211005161157.282396-3-sgarzare@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-05 18:56:41 +02:00
Stefano Garzarella 8fc898ce0b block/backup: avoid integer overflow of `max-workers`
QAPI generates `struct BackupPerf` where `max-workers` value is stored
in an `int64_t` variable.
But block_copy_async(), and the underlying code, uses an `int` parameter.

At the end that variable is used to initialize `max_busy_tasks` in
block/aio_task.c causing the following assertion failure if a value
greater than INT_MAX(2147483647) is used:

  ../block/aio_task.c:63: aio_task_pool_wait_one: Assertion `pool->busy_tasks > 0' failed.

Let's check that `max-workers` doesn't exceed INT_MAX and print an
error in that case.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2009310
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20211005161157.282396-2-sgarzare@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-10-05 18:56:20 +02:00
Vladimir Sementsov-Ogievskiy 1af7737871 block/nbd: check that received handle is valid
If we don't have active request, that waiting for this handle to be
received, we should report an error.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210902103805.25686-6-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:33 -05:00
Vladimir Sementsov-Ogievskiy 4ddb5d2fde block/nbd: drop connection_co
OK, that's a big rewrite of the logic.

Pre-patch we have an always running coroutine - connection_co. It does
reply receiving and reconnecting. And it leads to a lot of difficult
and unobvious code around drained sections and context switch. We also
abuse bs->in_flight counter which is increased for connection_co and
temporary decreased in points where we want to allow drained section to
begin. One of these place is in another file: in nbd_read_eof() in
nbd/client.c.

We also cancel reconnect and requests waiting for reconnect on drained
begin which is not correct. And this patch fixes that.

Let's finally drop this always running coroutine and go another way:
do both reconnect and receiving in request coroutines.

The detailed list of changes below (in the sequence of diff hunks).

1. receiving coroutines are woken directly from nbd_channel_error, when
   we change s->state

2. nbd_co_establish_connection_cancel(): we don't have drain_begin now,
   and in nbd_teardown_connection() all requests should already be
   finished (and reconnect is done from request). So
   nbd_co_establish_connection_cancel() is called from
   nbd_cancel_in_flight() (to cancel the request that is doing
   nbd_co_establish_connection()) and from reconnect_delay_timer_cb()
   (previously we didn't need it, as reconnect delay only should cancel
   active requests not the reconnection itself). But now reconnection
   itself is done in the separate thread (we now call
   nbd_client_connection_enable_retry() in nbd_open()), and we need to
   cancel the requests that wait in nbd_co_establish_connection()
   now).

2A. We do receive headers in request coroutine. But we also should
   dispatch replies for other pending requests. So,
   nbd_connection_entry() is turned into nbd_receive_replies(), which
   does reply dispatching while it receives other request headers, and
   returns when it receives the requested header.

3. All old staff around drained sections and context switch is dropped.
   In details:
   - we don't need to move connection_co to new aio context, as we
     don't have connection_co anymore
   - we don't have a fake "request" of connection_co (extra increasing
     in_flight), so don't care with it in drain_begin/end
   - we don't stop reconnection during drained section anymore. This
     means that drain_begin may wait for a long time (up to
     reconnect_delay). But that's an improvement and more correct
     behavior see below[*]

4. In nbd_teardown_connection() we don't have to wait for
   connection_co, as it is dropped. And cleanup for s->ioc and nbd_yank
   is moved here from removed connection_co.

5. In nbd_co_do_establish_connection() we now should handle
   NBD_CLIENT_CONNECTING_NOWAIT: if new request comes when we are in
   NBD_CLIENT_CONNECTING_NOWAIT, it still should call
   nbd_co_establish_connection() (who knows, maybe the connection was
   already established by another thread in the background). But we
   shouldn't wait: if nbd_co_establish_connection() can't return new
   channel immediately the request should fail (we are in
   NBD_CLIENT_CONNECTING_NOWAIT state).

6. nbd_reconnect_attempt() is simplified: it's now easier to wait for
   other requests in the caller, so here we just assert that fact.
   Also delay time is now initialized here: we can easily detect first
   attempt and start a timer.

7. nbd_co_reconnect_loop() is dropped, we don't need it. Reconnect
   retries are fully handle by thread (nbd/client-connection.c), delay
   timer we initialize in nbd_reconnect_attempt(), we don't have to
   bother with s->drained and friends. nbd_reconnect_attempt() now
   called from nbd_co_send_request().

8. nbd_connection_entry is dropped: reconnect is now handled by
   nbd_co_send_request(), receiving reply is now handled by
   nbd_receive_replies(): all handled from request coroutines.

9. So, welcome new nbd_receive_replies() called from request coroutine,
   that receives reply header instead of nbd_connection_entry().
   Like with sending requests, only one coroutine may receive in a
   moment. So we introduce receive_mutex, which is locked around
   nbd_receive_reply(). It also protects some related fields. Still,
   full audit of thread-safety in nbd driver is a separate task.
   New function waits for a reply with specified handle being received
   and works rather simple:

   Under mutex:
     - if current handle is 0, do receive by hand. If another handle
       received - switch to other request coroutine, release mutex and
       yield. Otherwise return success
     - if current handle == requested handle, we are done
     - otherwise, release mutex and yield

10: in nbd_co_send_request() we now do nbd_reconnect_attempt() if
    needed. Also waiting in free_sema queue we now wait for one of two
    conditions:
    - connectED, in_flight < MAX_NBD_REQUESTS (so we can start new one)
    - connectING, in_flight == 0, so we can call
      nbd_reconnect_attempt()
    And this logic is protected by s->send_mutex

    Also, on failure we don't have to care of removed s->connection_co

11. nbd_co_do_receive_one_chunk(): now instead of yield() and wait for
    s->connection_co we just call new nbd_receive_replies().

12. nbd_co_receive_one_chunk(): place where s->reply.handle becomes 0,
    which means that handling of the whole reply is finished. Here we
    need to wake one of coroutines sleeping in nbd_receive_replies().
    If none are sleeping - do nothing. That's another behavior change: we
    don't have endless recv() in the idle time. It may be considered as
    a drawback. If so, it may be fixed later.

13. nbd_reply_chunk_iter_receive(): don't care about removed
    connection_co, just ping in_flight waiters.

14. Don't create connection_co, enable retry in the connection thread
    (we don't have own reconnect loop anymore)

15. We now need to add a nbd_co_establish_connection_cancel() call in
    nbd_cancel_in_flight(), to cancel the request that is doing a
    connection attempt.

[*], ok, now we don't cancel reconnect on drain begin. That's correct:
    reconnect feature leads to possibility of long-running requests (up
    to reconnect delay). Still, drain begin is not a reason to kill
    long requests. We should wait for them.

    This also means, that we can again reproduce a dead-lock, described
    in 8c517de24a.
    Why we are OK with it:
    1. Now this is not absolutely-dead dead-lock: the vm is unfrozen
       after reconnect delay. Actually 8c517de24a fixed a bug in
       NBD logic, that was not described in 8c517de24a and led to
       forever dead-lock. The problem was that nobody woke the free_sema
       queue, but drain_begin can't finish until there is a request in
       free_sema queue. Now we have a reconnect delay timer that works
       well.
    2. It's not a problem of the NBD driver, but of the ide code,
       because it does drain_begin under the global mutex; the problem
       doesn't reproduce when using scsi instead of ide.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210902103805.25686-5-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar and comment tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:33 -05:00
Vladimir Sementsov-Ogievskiy 04a953b232 block/nbd: refactor nbd_recv_coroutines_wake_all()
Split out nbd_recv_coroutine_wake_one(), as it will be used
separately.
Rename the function and add a possibility to wake only first found
sleeping coroutine.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210902103805.25686-4-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweak]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:33 -05:00
Vladimir Sementsov-Ogievskiy 3bc0bd1f42 block/nbd: move nbd_recv_coroutines_wake_all() up
We are going to use it in nbd_channel_error(), so move it up. Note,
that we are going also refactor and rename
nbd_recv_coroutines_wake_all() in future anyway, so keeping it where it
is and making forward declaration doesn't make real sense.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210902103805.25686-3-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:33 -05:00
Vladimir Sementsov-Ogievskiy cb116da7d7 block/nbd: nbd_channel_error() shutdown channel unconditionally
Don't rely on connection being totally broken in case of -EIO. Safer
and more correct is to just shut down the channel anyway, since we
change the state and plan on reconnecting.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210902103805.25686-2-vsementsov@virtuozzo.com>
[eblake: grammar tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:33 -05:00
Vladimir Sementsov-Ogievskiy 6a8f3dbb19 block/io: allow 64bit discard requests
Now that all drivers are updated by the previous commit, we can drop
the last limiter on pdiscard path: INT_MAX in bdrv_co_pdiscard().

Now everything is prepared for implementing incredibly cool and fast
big-discard requests in NBD and qcow2. And any other driver which wants
it of course.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210903102807.27127-12-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:32 -05:00
Vladimir Sementsov-Ogievskiy 0c8022876f block: use int64_t instead of int in driver discard handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

So, convert driver discard handlers bytes parameter to int64_t.

The only caller of all updated function is bdrv_co_pdiscard in
block/io.c. It is already prepared to work with 64bit requests, but
pass at most max(bs->bl.max_pdiscard, INT_MAX) to the driver.

Let's look at all updated functions:

blkdebug: all calculations are still OK, thanks to
  bdrv_check_qiov_request().
  both rule_check and bdrv_co_pdiscard are 64bit

blklogwrites: pass to blk_loc_writes_co_log which is 64bit

blkreplay, copy-on-read, filter-compress: pass to bdrv_co_pdiscard, OK

copy-before-write: pass to bdrv_co_pdiscard which is 64bit and to
  cbw_do_copy_before_write which is 64bit

file-posix: one handler calls raw_account_discard() is 64bit and both
  handlers calls raw_do_pdiscard(). Update raw_do_pdiscard, which pass
  to RawPosixAIOData::aio_nbytes, which is 64bit (and calls
  raw_account_discard())

gluster: somehow, third argument of glfs_discard_async is size_t.
  Let's set max_pdiscard accordingly.

iscsi: iscsi_allocmap_set_invalid is 64bit,
  !is_byte_request_lun_aligned is 64bit.
  list.num is uint32_t. Let's clarify max_pdiscard and
  pdiscard_alignment.

mirror_top: pass to bdrv_mirror_top_do_write() which is
  64bit

nbd: protocol limitation. max_pdiscard is alredy set strict enough,
  keep it as is for now.

nvme: buf.nlb is uint32_t and we do shift. So, add corresponding limits
  to nvme_refresh_limits().

preallocate: pass to bdrv_co_pdiscard() which is 64bit.

rbd: pass to qemu_rbd_start_co() which is 64bit.

qcow2: calculations are still OK, thanks to bdrv_check_qiov_request(),
  qcow2_cluster_discard() is 64bit.

raw-format: raw_adjust_offset() is 64bit, bdrv_co_pdiscard too.

throttle: pass to bdrv_co_pdiscard() which is 64bit and to
  throttle_group_co_io_limits_intercept() which is 64bit as well.

test-block-iothread: bytes argument is unused

Great! Now all drivers are prepared to handle 64bit discard requests,
or else have explicit max_pdiscard limits.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-11-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:32 -05:00
Vladimir Sementsov-Ogievskiy 39af49c0d7 block: make BlockLimits::max_pdiscard 64bit
We are going to support 64 bit discard requests. Now update the
limit variable. It's absolutely safe. The variable is set in some
drivers, and used in bdrv_co_pdiscard().

Update also max_pdiscard variable in bdrv_co_pdiscard(), so that
bdrv_co_pdiscard() is now prepared for 64bit requests. The remaining
logic including num, offset and bytes variables is already
supporting 64bit requests.

So the only thing that prevents 64 bit requests is limiting
max_pdiscard variable to INT_MAX in bdrv_co_pdiscard().
We'll drop this limitation after updating all block drivers.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210903102807.27127-10-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:32 -05:00
Vladimir Sementsov-Ogievskiy 2aaa3f9b33 block/io: allow 64bit write-zeroes requests
Now that all drivers are updated by previous commit, we can drop two
last limiters on write-zeroes path: INT_MAX in
bdrv_co_do_pwrite_zeroes() and bdrv_check_request32() in
bdrv_co_pwritev_part().

Now everything is prepared for implementing incredibly cool and fast
big-write-zeroes in NBD and qcow2. And any other driver which wants it
of course.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210903102807.27127-9-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:32 -05:00
Vladimir Sementsov-Ogievskiy f34b2bcf8c block: use int64_t instead of int in driver write_zeroes handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

So, convert driver write_zeroes handlers bytes parameter to int64_t.

The only caller of all updated function is bdrv_co_do_pwrite_zeroes().

bdrv_co_do_pwrite_zeroes() itself is of course OK with widening of
callee parameter type. Also, bdrv_co_do_pwrite_zeroes()'s
max_write_zeroes is limited to INT_MAX. So, updated functions all are
safe, they will not get "bytes" larger than before.

Still, let's look through all updated functions, and add assertions to
the ones which are actually unprepared to values larger than INT_MAX.
For these drivers also set explicit max_pwrite_zeroes limit.

Let's go:

blkdebug: calculations can't overflow, thanks to
  bdrv_check_qiov_request() in generic layer. rule_check() and
  bdrv_co_pwrite_zeroes() both have 64bit argument.

blklogwrites: pass to blk_log_writes_co_log() with 64bit argument.

blkreplay, copy-on-read, filter-compress: pass to
  bdrv_co_pwrite_zeroes() which is OK

copy-before-write: Calls cbw_do_copy_before_write() and
  bdrv_co_pwrite_zeroes, both have 64bit argument.

file-posix: both handler calls raw_do_pwrite_zeroes, which is updated.
  In raw_do_pwrite_zeroes() calculations are OK due to
  bdrv_check_qiov_request(), bytes go to RawPosixAIOData::aio_nbytes
  which is uint64_t.
  Check also where that uint64_t gets handed:
  handle_aiocb_write_zeroes_block() passes a uint64_t[2] to
  ioctl(BLKZEROOUT), handle_aiocb_write_zeroes() calls do_fallocate()
  which takes off_t (and we compile to always have 64-bit off_t), as
  does handle_aiocb_write_zeroes_unmap. All look safe.

gluster: bytes go to GlusterAIOCB::size which is int64_t and to
  glfs_zerofill_async works with off_t.

iscsi: Aha, here we deal with iscsi_writesame16_task() that has
  uint32_t num_blocks argument and iscsi_writesame16_task() has
  uint16_t argument. Make comments, add assertions and clarify
  max_pwrite_zeroes calculation.
  iscsi_allocmap_() functions already has int64_t argument
  is_byte_request_lun_aligned is simple to update, do it.

mirror_top: pass to bdrv_mirror_top_do_write which has uint64_t
  argument

nbd: Aha, here we have protocol limitation, and NBDRequest::len is
  uint32_t. max_pwrite_zeroes is cleanly set to 32bit value, so we are
  OK for now.

nvme: Again, protocol limitation. And no inherent limit for
  write-zeroes at all. But from code that calculates cdw12 it's obvious
  that we do have limit and alignment. Let's clarify it. Also,
  obviously the code is not prepared to handle bytes=0. Let's handle
  this case too.
  trace events already 64bit

preallocate: pass to handle_write() and bdrv_co_pwrite_zeroes(), both
  64bit.

rbd: pass to qemu_rbd_start_co() which is 64bit.

qcow2: offset + bytes and alignment still works good (thanks to
  bdrv_check_qiov_request()), so tail calculation is OK
  qcow2_subcluster_zeroize() has 64bit argument, should be OK
  trace events updated

qed: qed_co_request wants int nb_sectors. Also in code we have size_t
  used for request length which may be 32bit. So, let's just keep
  INT_MAX as a limit (aligning it down to pwrite_zeroes_alignment) and
  don't care.

raw-format: Is OK. raw_adjust_offset and bdrv_co_pwrite_zeroes are both
  64bit.

throttle: Both throttle_group_co_io_limits_intercept() and
  bdrv_co_pwrite_zeroes() are 64bit.

vmdk: pass to vmdk_pwritev which is 64bit

quorum: pass to quorum_co_pwritev() which is 64bit

Hooray!

At this point all block drivers are prepared to support 64bit
write-zero requests, or have explicitly set max_pwrite_zeroes.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-8-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: use <= rather than < in assertions relying on max_pwrite_zeroes]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:32 -05:00