2010-12-06 20:53:01 +01:00
|
|
|
/*
|
|
|
|
* QEMU Block driver for RADOS (Ceph)
|
|
|
|
*
|
2011-05-27 01:07:31 +02:00
|
|
|
* Copyright (C) 2010-2011 Christian Brunner <chb@muc.de>,
|
|
|
|
* Josh Durgin <josh.durgin@dreamhost.com>
|
2010-12-06 20:53:01 +01:00
|
|
|
*
|
|
|
|
* This work is licensed under the terms of the GNU GPL, version 2. See
|
|
|
|
* the COPYING file in the top-level directory.
|
|
|
|
*
|
2012-01-13 17:44:23 +01:00
|
|
|
* Contributions after 2012-01-13 are licensed under the terms of the
|
|
|
|
* GNU GPL, version 2 or (at your option) any later version.
|
2010-12-06 20:53:01 +01:00
|
|
|
*/
|
|
|
|
|
2016-01-18 19:01:42 +01:00
|
|
|
#include "qemu/osdep.h"
|
2011-05-27 01:07:31 +02:00
|
|
|
|
rbd: Fix bugs around -drive parameter "server"
qemu_rbd_open() takes option parameters as a flattened QDict, with
keys of the form server.%d.host, server.%d.port, where %d counts up
from zero.
qemu_rbd_array_opts() extracts these values as follows. First, it
calls qdict_array_entries() to find the list's length. For each list
element, it formats the list's key prefix (e.g. "server.0."), then
creates a new QDict holding the options with that key prefix, then
converts that to a QemuOpts, so it can finally get the member values
from there.
If there's one surefire way to make code using QDict more awkward,
it's creating more of them and mixing in QemuOpts for good measure.
The extraction of keys starting with server.%d into another QDict
makes us ignore parameters like server.0.neither-host-nor-port
silently.
The conversion to QemuOpts abuses runtime_opts, as described a few
commits ago.
Rewrite to simply get the values straight from the options QDict.
Fixes -drive not to crash when server.*.* are present, but
server.*.host is absent.
Fixes -drive to reject invalid server.*.*.
Permits cleaning up runtime_opts. Do that, and fix -drive to reject
bogus parameters host and port instead of silently ignoring them.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1490691368-32099-11-git-send-email-armbru@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
2017-03-28 10:56:08 +02:00
|
|
|
#include <rbd/librbd.h>
|
include/qemu/osdep.h: Don't include qapi/error.h
Commit 57cb38b included qapi/error.h into qemu/osdep.h to get the
Error typedef. Since then, we've moved to include qemu/osdep.h
everywhere. Its file comment explains: "To avoid getting into
possible circular include dependencies, this file should not include
any other QEMU headers, with the exceptions of config-host.h,
compiler.h, os-posix.h and os-win32.h, all of which are doing a
similar job to this file and are under similar constraints."
qapi/error.h doesn't do a similar job, and it doesn't adhere to
similar constraints: it includes qapi-types.h. That's in excess of
100KiB of crap most .c files don't actually need.
Add the typedef to qemu/typedefs.h, and include that instead of
qapi/error.h. Include qapi/error.h in .c files that need it and don't
get it now. Include qapi-types.h in qom/object.h for uint16List.
Update scripts/clean-includes accordingly. Update it further to match
reality: replace config.h by config-target.h, add sysemu/os-posix.h,
sysemu/os-win32.h. Update the list of includes in the qemu/osdep.h
comment quoted above similarly.
This reduces the number of objects depending on qapi/error.h from "all
of them" to less than a third. Unfortunately, the number depending on
qapi-types.h shrinks only a little. More work is needed for that one.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
[Fix compilation without the spice devel packages. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-14 09:01:28 +01:00
|
|
|
#include "qapi/error.h"
|
2012-12-17 18:20:00 +01:00
|
|
|
#include "qemu/error-report.h"
|
2019-05-23 16:35:07 +02:00
|
|
|
#include "qemu/module.h"
|
2018-02-01 12:18:46 +01:00
|
|
|
#include "qemu/option.h"
|
2012-12-17 18:19:44 +01:00
|
|
|
#include "block/block_int.h"
|
2018-06-14 21:14:28 +02:00
|
|
|
#include "block/qdict.h"
|
2016-01-21 15:19:19 +01:00
|
|
|
#include "crypto/secret.h"
|
2016-03-20 18:16:19 +01:00
|
|
|
#include "qemu/cutils.h"
|
2019-09-17 13:58:19 +02:00
|
|
|
#include "sysemu/replay.h"
|
2017-02-26 23:50:42 +01:00
|
|
|
#include "qapi/qmp/qstring.h"
|
2018-02-01 12:18:39 +01:00
|
|
|
#include "qapi/qmp/qdict.h"
|
rbd: Fix regression in legacy key/values containing escaped :
Commit c7cacb3 accidentally broke legacy key-value parsing through
pseudo-filename parsing of -drive file=rbd://..., for any key that
contains an escaped ':'. Such a key is surprisingly common, thanks
to mon_host specifying a 'host:port' string. The break happens
because passing things from QDict through QemuOpts back to another
QDict requires that we pack our parsed key/value pairs into a string,
and then reparse that string, but the intermediate string that we
created ("key1=value1:key2=value2") lost the \: escaping that was
present in the original, so that we could no longer see which : were
used as separators vs. those used as part of the original input.
Fix it by collecting the key/value pairs through a QList, and
sending that list on a round trip through a JSON QString (as in
'["key1","value1","key2","value2"]') on its way through QemuOpts,
rather than hand-rolling our own string. Since the string is only
handled internally, this was faster than creating a full-blown
struct of '[{"key1":"value1"},{"key2":"value2"}]', and safer at
guaranteeing order compared to '{"key1":"value1","key2":"value2"}'.
It would be nicer if we didn't have to round-trip through QemuOpts
in the first place, but that's a much bigger task for later.
Reproducer:
./x86_64-softmmu/qemu-system-x86_64 -nodefaults -nographic -qmp stdio \
-drive 'file=rbd:volumes/volume-ea141b5c-cdb3-4765-910d-e7008b209a70'\
':id=compute:key=AQAVkvxXAAAAABAA9ZxWFYdRmV+DSwKr7BKKXg=='\
':auth_supported=cephx\;none:mon_host=192.168.1.2\:6789'\
',format=raw,if=none,id=drive-virtio-disk0,'\
'serial=ea141b5c-cdb3-4765-910d-e7008b209a70,cache=writeback'
Even without an RBD setup, this serves a test of whether we get
the incorrect parser error of:
qemu-system-x86_64: -drive file=rbd:...cache=writeback: conf option 6789 has no value
or the correct behavior of hanging while trying to connect to
the requested mon_host of 192.168.1.2:6789.
Reported-by: Alexandru Avadanii <Alexandru.Avadanii@enea.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170331152730.12514-1-eblake@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
2017-03-31 17:27:30 +02:00
|
|
|
#include "qapi/qmp/qjson.h"
|
2018-02-01 12:18:38 +01:00
|
|
|
#include "qapi/qmp/qlist.h"
|
2018-02-15 20:58:24 +01:00
|
|
|
#include "qapi/qobject-input-visitor.h"
|
|
|
|
#include "qapi/qapi-visit-block-core.h"
|
2010-12-06 20:53:01 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* When specifying the image filename use:
|
|
|
|
*
|
2011-05-27 01:07:32 +02:00
|
|
|
* rbd:poolname/devicename[@snapshotname][:option1=value1[:option2=value2...]]
|
2010-12-06 20:53:01 +01:00
|
|
|
*
|
2011-09-15 23:11:10 +02:00
|
|
|
* poolname must be the name of an existing rados pool.
|
2010-12-06 20:53:01 +01:00
|
|
|
*
|
2011-09-15 23:11:10 +02:00
|
|
|
* devicename is the name of the rbd image.
|
2010-12-06 20:53:01 +01:00
|
|
|
*
|
2011-09-15 23:11:10 +02:00
|
|
|
* Each option given is used to configure rados, and may be any valid
|
|
|
|
* Ceph option, "id", or "conf".
|
2011-05-27 01:07:32 +02:00
|
|
|
*
|
2011-09-15 23:11:10 +02:00
|
|
|
* The "id" option indicates what user we should authenticate as to
|
|
|
|
* the Ceph cluster. If it is excluded we will use the Ceph default
|
|
|
|
* (normally 'admin').
|
2010-12-06 20:53:01 +01:00
|
|
|
*
|
2011-09-15 23:11:10 +02:00
|
|
|
* The "conf" option specifies a Ceph configuration file to read. If
|
|
|
|
* it is not specified, we will read from the default Ceph locations
|
|
|
|
* (e.g., /etc/ceph/ceph.conf). To avoid reading _any_ configuration
|
|
|
|
* file, specify conf=/dev/null.
|
2010-12-06 20:53:01 +01:00
|
|
|
*
|
2011-09-15 23:11:10 +02:00
|
|
|
* Configuration values containing :, @, or = can be escaped with a
|
|
|
|
* leading "\".
|
2010-12-06 20:53:01 +01:00
|
|
|
*/
|
|
|
|
|
|
|
|
#define OBJ_MAX_SIZE (1UL << OBJ_DEFAULT_OBJ_ORDER)
|
|
|
|
|
2011-05-27 01:07:31 +02:00
|
|
|
#define RBD_MAX_SNAPS 100
|
|
|
|
|
2021-06-27 13:46:35 +02:00
|
|
|
#define RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN 8
|
|
|
|
|
|
|
|
static const char rbd_luks_header_verification[
|
|
|
|
RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN] = {
|
|
|
|
'L', 'U', 'K', 'S', 0xBA, 0xBE, 0, 1
|
|
|
|
};
|
|
|
|
|
|
|
|
static const char rbd_luks2_header_verification[
|
|
|
|
RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN] = {
|
|
|
|
'L', 'U', 'K', 'S', 0xBA, 0xBE, 0, 2
|
|
|
|
};
|
|
|
|
|
2012-05-01 08:16:45 +02:00
|
|
|
typedef enum {
|
|
|
|
RBD_AIO_READ,
|
|
|
|
RBD_AIO_WRITE,
|
2013-03-29 21:03:23 +01:00
|
|
|
RBD_AIO_DISCARD,
|
block/rbd: add write zeroes support
This patch wittingly sets BDRV_REQ_NO_FALLBACK and silently ignores
BDRV_REQ_MAY_UNMAP for older librbd versions.
The rationale for this is as follows (citing Ilya Dryomov current RBD
maintainer):
---8<---
a) remove the BDRV_REQ_MAY_UNMAP check in qemu_rbd_co_pwrite_zeroes()
and as a consequence always unmap if librbd is too old
It's not clear what qemu's expectation is but in general Write
Zeroes is allowed to unmap. The only guarantee is that subsequent
reads return zeroes, everything else is a hint. This is how it is
specified in the kernel and in the NVMe spec.
In particular, block/nvme.c implements it as follows:
if (flags & BDRV_REQ_MAY_UNMAP) {
cdw12 |= (1 << 25);
}
This sets the Deallocate bit. But if it's not set, the device may
still deallocate:
"""
If the Deallocate bit (CDW12.DEAC) is set to '1' in a Write Zeroes
command, and the namespace supports clearing all bytes to 0h in the
values read (e.g., bits 2:0 in the DLFEAT field are set to 001b)
from a deallocated logical block and its metadata (excluding
protection information), then for each specified logical block, the
controller:
- should deallocate that logical block;
...
If the Deallocate bit is cleared to '0' in a Write Zeroes command,
and the namespace supports clearing all bytes to 0h in the values
read (e.g., bits 2:0 in the DLFEAT field are set to 001b) from
a deallocated logical block and its metadata (excluding protection
information), then, for each specified logical block, the
controller:
- may deallocate that logical block;
"""
https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-2021.06.02-Ratified-1.pdf
b) set BDRV_REQ_NO_FALLBACK in supported_zero_flags
Again, it's not clear what qemu expects here, but without it we end
up in a ridiculous situation where specifying the "don't allow slow
fallback" switch immediately fails all efficient zeroing requests on
a device where Write Zeroes is always efficient:
$ qemu-io -c 'help write' | grep -- '-[zun]'
-n, -- with -z, don't allow slow fallback
-u, -- with -z, allow unmapping
-z, -- write zeroes using blk_co_pwrite_zeroes
$ qemu-io -f rbd -c 'write -z -u -n 0 1M' rbd:foo/bar
write failed: Operation not supported
--->8---
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Message-Id: <20210702172356.11574-6-idryomov@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-07-02 19:23:55 +02:00
|
|
|
RBD_AIO_FLUSH,
|
|
|
|
RBD_AIO_WRITE_ZEROES
|
2012-05-01 08:16:45 +02:00
|
|
|
} RBDAIOCmd;
|
|
|
|
|
2010-12-06 20:53:01 +01:00
|
|
|
typedef struct BDRVRBDState {
|
2011-05-27 01:07:31 +02:00
|
|
|
rados_t cluster;
|
|
|
|
rados_ioctx_t io_ctx;
|
|
|
|
rbd_image_t image;
|
2017-04-07 22:55:31 +02:00
|
|
|
char *image_name;
|
2011-05-27 01:07:31 +02:00
|
|
|
char *snap;
|
block/rbd: Add support for ceph namespaces
Starting from ceph Nautilus, RBD has support for namespaces, allowing
for finer grain ACLs on images inside a pool, and tenant isolation.
In the rbd cli tool documentation, the new image-spec and snap-spec are :
- [pool-name/[namespace-name/]]image-name
- [pool-name/[namespace-name/]]image-name@snap-name
When using an non namespace's enabled qemu, it complains about not
finding the image called namespace-name/image-name, thus we only need to
parse the image once again to find if there is a '/' in its name, and if
there is, use what is before it as the name of the namespace to later
pass it to rados_ioctx_set_namespace.
rados_ioctx_set_namespace if called with en empty string or a null
pointer as the namespace parameters pretty much does nothing, as it then
defaults to the default namespace.
The namespace is extracted inside qemu_rbd_parse_filename, stored in the
qdict, and used in qemu_rbd_connect to make it work with both qemu-img,
and qemu itself.
Signed-off-by: Florian Florensa <fflorensa@online.net>
Message-Id: <20200110111513.321728-2-fflorensa@online.net>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-01-10 12:15:13 +01:00
|
|
|
char *namespace;
|
2019-05-09 16:59:27 +02:00
|
|
|
uint64_t image_size;
|
2021-07-02 19:23:52 +02:00
|
|
|
uint64_t object_size;
|
2010-12-06 20:53:01 +01:00
|
|
|
} BDRVRBDState;
|
|
|
|
|
2021-07-02 19:23:54 +02:00
|
|
|
typedef struct RBDTask {
|
|
|
|
BlockDriverState *bs;
|
|
|
|
Coroutine *co;
|
|
|
|
bool complete;
|
|
|
|
int64_t ret;
|
|
|
|
} RBDTask;
|
|
|
|
|
2021-10-12 17:22:31 +02:00
|
|
|
typedef struct RBDDiffIterateReq {
|
|
|
|
uint64_t offs;
|
|
|
|
uint64_t bytes;
|
|
|
|
bool exists;
|
|
|
|
} RBDDiffIterateReq;
|
|
|
|
|
2018-02-16 18:48:25 +01:00
|
|
|
static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
|
|
|
|
BlockdevOptionsRbd *opts, bool cache,
|
|
|
|
const char *keypairs, const char *secretid,
|
|
|
|
Error **errp);
|
|
|
|
|
2021-04-21 23:23:43 +02:00
|
|
|
static char *qemu_rbd_strchr(char *src, char delim)
|
|
|
|
{
|
|
|
|
char *p;
|
|
|
|
|
|
|
|
for (p = src; *p; ++p) {
|
|
|
|
if (*p == delim) {
|
|
|
|
return p;
|
|
|
|
}
|
|
|
|
if (*p == '\\' && p[1] != '\0') {
|
|
|
|
++p;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2017-03-28 10:56:01 +02:00
|
|
|
static char *qemu_rbd_next_tok(char *src, char delim, char **p)
|
2010-12-06 20:53:01 +01:00
|
|
|
{
|
|
|
|
char *end;
|
|
|
|
|
|
|
|
*p = NULL;
|
|
|
|
|
2021-04-21 23:23:43 +02:00
|
|
|
end = qemu_rbd_strchr(src, delim);
|
|
|
|
if (end) {
|
2017-03-28 10:56:02 +02:00
|
|
|
*p = end + 1;
|
|
|
|
*end = '\0';
|
|
|
|
}
|
2017-02-24 16:30:33 +01:00
|
|
|
return src;
|
2010-12-06 20:53:01 +01:00
|
|
|
}
|
|
|
|
|
2011-09-19 22:35:26 +02:00
|
|
|
static void qemu_rbd_unescape(char *src)
|
|
|
|
{
|
|
|
|
char *p;
|
|
|
|
|
|
|
|
for (p = src; *src; ++src, ++p) {
|
|
|
|
if (*src == '\\' && src[1] != '\0') {
|
|
|
|
src++;
|
|
|
|
}
|
|
|
|
*p = *src;
|
|
|
|
}
|
|
|
|
*p = '\0';
|
|
|
|
}
|
|
|
|
|
2017-02-26 23:50:42 +01:00
|
|
|
static void qemu_rbd_parse_filename(const char *filename, QDict *options,
|
|
|
|
Error **errp)
|
2010-12-06 20:53:01 +01:00
|
|
|
{
|
|
|
|
const char *start;
|
rbd: Fix regression in legacy key/values containing escaped :
Commit c7cacb3 accidentally broke legacy key-value parsing through
pseudo-filename parsing of -drive file=rbd://..., for any key that
contains an escaped ':'. Such a key is surprisingly common, thanks
to mon_host specifying a 'host:port' string. The break happens
because passing things from QDict through QemuOpts back to another
QDict requires that we pack our parsed key/value pairs into a string,
and then reparse that string, but the intermediate string that we
created ("key1=value1:key2=value2") lost the \: escaping that was
present in the original, so that we could no longer see which : were
used as separators vs. those used as part of the original input.
Fix it by collecting the key/value pairs through a QList, and
sending that list on a round trip through a JSON QString (as in
'["key1","value1","key2","value2"]') on its way through QemuOpts,
rather than hand-rolling our own string. Since the string is only
handled internally, this was faster than creating a full-blown
struct of '[{"key1":"value1"},{"key2":"value2"}]', and safer at
guaranteeing order compared to '{"key1":"value1","key2":"value2"}'.
It would be nicer if we didn't have to round-trip through QemuOpts
in the first place, but that's a much bigger task for later.
Reproducer:
./x86_64-softmmu/qemu-system-x86_64 -nodefaults -nographic -qmp stdio \
-drive 'file=rbd:volumes/volume-ea141b5c-cdb3-4765-910d-e7008b209a70'\
':id=compute:key=AQAVkvxXAAAAABAA9ZxWFYdRmV+DSwKr7BKKXg=='\
':auth_supported=cephx\;none:mon_host=192.168.1.2\:6789'\
',format=raw,if=none,id=drive-virtio-disk0,'\
'serial=ea141b5c-cdb3-4765-910d-e7008b209a70,cache=writeback'
Even without an RBD setup, this serves a test of whether we get
the incorrect parser error of:
qemu-system-x86_64: -drive file=rbd:...cache=writeback: conf option 6789 has no value
or the correct behavior of hanging while trying to connect to
the requested mon_host of 192.168.1.2:6789.
Reported-by: Alexandru Avadanii <Alexandru.Avadanii@enea.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170331152730.12514-1-eblake@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
2017-03-31 17:27:30 +02:00
|
|
|
char *p, *buf;
|
|
|
|
QList *keypairs = NULL;
|
block/rbd: Add support for ceph namespaces
Starting from ceph Nautilus, RBD has support for namespaces, allowing
for finer grain ACLs on images inside a pool, and tenant isolation.
In the rbd cli tool documentation, the new image-spec and snap-spec are :
- [pool-name/[namespace-name/]]image-name
- [pool-name/[namespace-name/]]image-name@snap-name
When using an non namespace's enabled qemu, it complains about not
finding the image called namespace-name/image-name, thus we only need to
parse the image once again to find if there is a '/' in its name, and if
there is, use what is before it as the name of the namespace to later
pass it to rados_ioctx_set_namespace.
rados_ioctx_set_namespace if called with en empty string or a null
pointer as the namespace parameters pretty much does nothing, as it then
defaults to the default namespace.
The namespace is extracted inside qemu_rbd_parse_filename, stored in the
qdict, and used in qemu_rbd_connect to make it work with both qemu-img,
and qemu itself.
Signed-off-by: Florian Florensa <fflorensa@online.net>
Message-Id: <20200110111513.321728-2-fflorensa@online.net>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-01-10 12:15:13 +01:00
|
|
|
char *found_str, *image_name;
|
2010-12-06 20:53:01 +01:00
|
|
|
|
|
|
|
if (!strstart(filename, "rbd:", &start)) {
|
2014-05-16 11:00:11 +02:00
|
|
|
error_setg(errp, "File name must start with 'rbd:'");
|
2017-02-26 23:50:42 +01:00
|
|
|
return;
|
2010-12-06 20:53:01 +01:00
|
|
|
}
|
|
|
|
|
2011-08-21 05:09:37 +02:00
|
|
|
buf = g_strdup(start);
|
2010-12-06 20:53:01 +01:00
|
|
|
p = buf;
|
|
|
|
|
2017-03-28 10:56:01 +02:00
|
|
|
found_str = qemu_rbd_next_tok(p, '/', &p);
|
2017-02-24 16:30:33 +01:00
|
|
|
if (!p) {
|
|
|
|
error_setg(errp, "Pool name is required");
|
2010-12-06 20:53:01 +01:00
|
|
|
goto done;
|
|
|
|
}
|
2017-02-24 16:30:33 +01:00
|
|
|
qemu_rbd_unescape(found_str);
|
2017-04-27 23:58:17 +02:00
|
|
|
qdict_put_str(options, "pool", found_str);
|
2011-05-27 01:07:32 +02:00
|
|
|
|
2021-04-21 23:23:43 +02:00
|
|
|
if (qemu_rbd_strchr(p, '@')) {
|
block/rbd: Add support for ceph namespaces
Starting from ceph Nautilus, RBD has support for namespaces, allowing
for finer grain ACLs on images inside a pool, and tenant isolation.
In the rbd cli tool documentation, the new image-spec and snap-spec are :
- [pool-name/[namespace-name/]]image-name
- [pool-name/[namespace-name/]]image-name@snap-name
When using an non namespace's enabled qemu, it complains about not
finding the image called namespace-name/image-name, thus we only need to
parse the image once again to find if there is a '/' in its name, and if
there is, use what is before it as the name of the namespace to later
pass it to rados_ioctx_set_namespace.
rados_ioctx_set_namespace if called with en empty string or a null
pointer as the namespace parameters pretty much does nothing, as it then
defaults to the default namespace.
The namespace is extracted inside qemu_rbd_parse_filename, stored in the
qdict, and used in qemu_rbd_connect to make it work with both qemu-img,
and qemu itself.
Signed-off-by: Florian Florensa <fflorensa@online.net>
Message-Id: <20200110111513.321728-2-fflorensa@online.net>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-01-10 12:15:13 +01:00
|
|
|
image_name = qemu_rbd_next_tok(p, '@', &p);
|
2017-02-24 16:30:33 +01:00
|
|
|
|
2017-03-28 10:56:01 +02:00
|
|
|
found_str = qemu_rbd_next_tok(p, ':', &p);
|
2017-02-24 16:30:33 +01:00
|
|
|
qemu_rbd_unescape(found_str);
|
2017-04-27 23:58:17 +02:00
|
|
|
qdict_put_str(options, "snapshot", found_str);
|
2011-05-27 01:07:32 +02:00
|
|
|
} else {
|
block/rbd: Add support for ceph namespaces
Starting from ceph Nautilus, RBD has support for namespaces, allowing
for finer grain ACLs on images inside a pool, and tenant isolation.
In the rbd cli tool documentation, the new image-spec and snap-spec are :
- [pool-name/[namespace-name/]]image-name
- [pool-name/[namespace-name/]]image-name@snap-name
When using an non namespace's enabled qemu, it complains about not
finding the image called namespace-name/image-name, thus we only need to
parse the image once again to find if there is a '/' in its name, and if
there is, use what is before it as the name of the namespace to later
pass it to rados_ioctx_set_namespace.
rados_ioctx_set_namespace if called with en empty string or a null
pointer as the namespace parameters pretty much does nothing, as it then
defaults to the default namespace.
The namespace is extracted inside qemu_rbd_parse_filename, stored in the
qdict, and used in qemu_rbd_connect to make it work with both qemu-img,
and qemu itself.
Signed-off-by: Florian Florensa <fflorensa@online.net>
Message-Id: <20200110111513.321728-2-fflorensa@online.net>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-01-10 12:15:13 +01:00
|
|
|
image_name = qemu_rbd_next_tok(p, ':', &p);
|
|
|
|
}
|
|
|
|
/* Check for namespace in the image_name */
|
2021-04-21 23:23:43 +02:00
|
|
|
if (qemu_rbd_strchr(image_name, '/')) {
|
block/rbd: Add support for ceph namespaces
Starting from ceph Nautilus, RBD has support for namespaces, allowing
for finer grain ACLs on images inside a pool, and tenant isolation.
In the rbd cli tool documentation, the new image-spec and snap-spec are :
- [pool-name/[namespace-name/]]image-name
- [pool-name/[namespace-name/]]image-name@snap-name
When using an non namespace's enabled qemu, it complains about not
finding the image called namespace-name/image-name, thus we only need to
parse the image once again to find if there is a '/' in its name, and if
there is, use what is before it as the name of the namespace to later
pass it to rados_ioctx_set_namespace.
rados_ioctx_set_namespace if called with en empty string or a null
pointer as the namespace parameters pretty much does nothing, as it then
defaults to the default namespace.
The namespace is extracted inside qemu_rbd_parse_filename, stored in the
qdict, and used in qemu_rbd_connect to make it work with both qemu-img,
and qemu itself.
Signed-off-by: Florian Florensa <fflorensa@online.net>
Message-Id: <20200110111513.321728-2-fflorensa@online.net>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-01-10 12:15:13 +01:00
|
|
|
found_str = qemu_rbd_next_tok(image_name, '/', &image_name);
|
2017-02-24 16:30:33 +01:00
|
|
|
qemu_rbd_unescape(found_str);
|
block/rbd: Add support for ceph namespaces
Starting from ceph Nautilus, RBD has support for namespaces, allowing
for finer grain ACLs on images inside a pool, and tenant isolation.
In the rbd cli tool documentation, the new image-spec and snap-spec are :
- [pool-name/[namespace-name/]]image-name
- [pool-name/[namespace-name/]]image-name@snap-name
When using an non namespace's enabled qemu, it complains about not
finding the image called namespace-name/image-name, thus we only need to
parse the image once again to find if there is a '/' in its name, and if
there is, use what is before it as the name of the namespace to later
pass it to rados_ioctx_set_namespace.
rados_ioctx_set_namespace if called with en empty string or a null
pointer as the namespace parameters pretty much does nothing, as it then
defaults to the default namespace.
The namespace is extracted inside qemu_rbd_parse_filename, stored in the
qdict, and used in qemu_rbd_connect to make it work with both qemu-img,
and qemu itself.
Signed-off-by: Florian Florensa <fflorensa@online.net>
Message-Id: <20200110111513.321728-2-fflorensa@online.net>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-01-10 12:15:13 +01:00
|
|
|
qdict_put_str(options, "namespace", found_str);
|
|
|
|
} else {
|
|
|
|
qdict_put_str(options, "namespace", "");
|
2010-12-06 20:53:01 +01:00
|
|
|
}
|
block/rbd: Add support for ceph namespaces
Starting from ceph Nautilus, RBD has support for namespaces, allowing
for finer grain ACLs on images inside a pool, and tenant isolation.
In the rbd cli tool documentation, the new image-spec and snap-spec are :
- [pool-name/[namespace-name/]]image-name
- [pool-name/[namespace-name/]]image-name@snap-name
When using an non namespace's enabled qemu, it complains about not
finding the image called namespace-name/image-name, thus we only need to
parse the image once again to find if there is a '/' in its name, and if
there is, use what is before it as the name of the namespace to later
pass it to rados_ioctx_set_namespace.
rados_ioctx_set_namespace if called with en empty string or a null
pointer as the namespace parameters pretty much does nothing, as it then
defaults to the default namespace.
The namespace is extracted inside qemu_rbd_parse_filename, stored in the
qdict, and used in qemu_rbd_connect to make it work with both qemu-img,
and qemu itself.
Signed-off-by: Florian Florensa <fflorensa@online.net>
Message-Id: <20200110111513.321728-2-fflorensa@online.net>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-01-10 12:15:13 +01:00
|
|
|
qemu_rbd_unescape(image_name);
|
|
|
|
qdict_put_str(options, "image", image_name);
|
2017-02-24 16:30:33 +01:00
|
|
|
if (!p) {
|
2010-12-06 20:53:01 +01:00
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
2017-02-26 23:50:42 +01:00
|
|
|
/* The following are essentially all key/value pairs, and we treat
|
|
|
|
* 'id' and 'conf' a bit special. Key/value pairs may be in any order. */
|
|
|
|
while (p) {
|
|
|
|
char *name, *value;
|
2017-03-28 10:56:01 +02:00
|
|
|
name = qemu_rbd_next_tok(p, '=', &p);
|
2017-02-26 23:50:42 +01:00
|
|
|
if (!p) {
|
|
|
|
error_setg(errp, "conf option %s has no value", name);
|
|
|
|
break;
|
2011-09-07 18:28:04 +02:00
|
|
|
}
|
2017-02-26 23:50:42 +01:00
|
|
|
|
|
|
|
qemu_rbd_unescape(name);
|
|
|
|
|
2017-03-28 10:56:01 +02:00
|
|
|
value = qemu_rbd_next_tok(p, ':', &p);
|
2017-02-26 23:50:42 +01:00
|
|
|
qemu_rbd_unescape(value);
|
|
|
|
|
|
|
|
if (!strcmp(name, "conf")) {
|
2017-04-27 23:58:17 +02:00
|
|
|
qdict_put_str(options, "conf", value);
|
2017-02-26 23:50:42 +01:00
|
|
|
} else if (!strcmp(name, "id")) {
|
2017-04-27 23:58:17 +02:00
|
|
|
qdict_put_str(options, "user", value);
|
2017-02-26 23:50:42 +01:00
|
|
|
} else {
|
rbd: Fix regression in legacy key/values containing escaped :
Commit c7cacb3 accidentally broke legacy key-value parsing through
pseudo-filename parsing of -drive file=rbd://..., for any key that
contains an escaped ':'. Such a key is surprisingly common, thanks
to mon_host specifying a 'host:port' string. The break happens
because passing things from QDict through QemuOpts back to another
QDict requires that we pack our parsed key/value pairs into a string,
and then reparse that string, but the intermediate string that we
created ("key1=value1:key2=value2") lost the \: escaping that was
present in the original, so that we could no longer see which : were
used as separators vs. those used as part of the original input.
Fix it by collecting the key/value pairs through a QList, and
sending that list on a round trip through a JSON QString (as in
'["key1","value1","key2","value2"]') on its way through QemuOpts,
rather than hand-rolling our own string. Since the string is only
handled internally, this was faster than creating a full-blown
struct of '[{"key1":"value1"},{"key2":"value2"}]', and safer at
guaranteeing order compared to '{"key1":"value1","key2":"value2"}'.
It would be nicer if we didn't have to round-trip through QemuOpts
in the first place, but that's a much bigger task for later.
Reproducer:
./x86_64-softmmu/qemu-system-x86_64 -nodefaults -nographic -qmp stdio \
-drive 'file=rbd:volumes/volume-ea141b5c-cdb3-4765-910d-e7008b209a70'\
':id=compute:key=AQAVkvxXAAAAABAA9ZxWFYdRmV+DSwKr7BKKXg=='\
':auth_supported=cephx\;none:mon_host=192.168.1.2\:6789'\
',format=raw,if=none,id=drive-virtio-disk0,'\
'serial=ea141b5c-cdb3-4765-910d-e7008b209a70,cache=writeback'
Even without an RBD setup, this serves a test of whether we get
the incorrect parser error of:
qemu-system-x86_64: -drive file=rbd:...cache=writeback: conf option 6789 has no value
or the correct behavior of hanging while trying to connect to
the requested mon_host of 192.168.1.2:6789.
Reported-by: Alexandru Avadanii <Alexandru.Avadanii@enea.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170331152730.12514-1-eblake@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
2017-03-31 17:27:30 +02:00
|
|
|
/*
|
|
|
|
* We pass these internally to qemu_rbd_set_keypairs(), so
|
|
|
|
* we can get away with the simpler list of [ "key1",
|
|
|
|
* "value1", "key2", "value2" ] rather than a raw dict
|
|
|
|
* { "key1": "value1", "key2": "value2" } where we can't
|
|
|
|
* guarantee order, or even a more correct but complex
|
|
|
|
* [ { "key1": "value1" }, { "key2": "value2" } ]
|
|
|
|
*/
|
|
|
|
if (!keypairs) {
|
|
|
|
keypairs = qlist_new();
|
2017-02-26 23:50:42 +01:00
|
|
|
}
|
2017-04-27 23:58:17 +02:00
|
|
|
qlist_append_str(keypairs, name);
|
|
|
|
qlist_append_str(keypairs, value);
|
2017-02-26 23:50:42 +01:00
|
|
|
}
|
2011-09-07 18:28:04 +02:00
|
|
|
}
|
2017-02-26 23:50:42 +01:00
|
|
|
|
rbd: Fix regression in legacy key/values containing escaped :
Commit c7cacb3 accidentally broke legacy key-value parsing through
pseudo-filename parsing of -drive file=rbd://..., for any key that
contains an escaped ':'. Such a key is surprisingly common, thanks
to mon_host specifying a 'host:port' string. The break happens
because passing things from QDict through QemuOpts back to another
QDict requires that we pack our parsed key/value pairs into a string,
and then reparse that string, but the intermediate string that we
created ("key1=value1:key2=value2") lost the \: escaping that was
present in the original, so that we could no longer see which : were
used as separators vs. those used as part of the original input.
Fix it by collecting the key/value pairs through a QList, and
sending that list on a round trip through a JSON QString (as in
'["key1","value1","key2","value2"]') on its way through QemuOpts,
rather than hand-rolling our own string. Since the string is only
handled internally, this was faster than creating a full-blown
struct of '[{"key1":"value1"},{"key2":"value2"}]', and safer at
guaranteeing order compared to '{"key1":"value1","key2":"value2"}'.
It would be nicer if we didn't have to round-trip through QemuOpts
in the first place, but that's a much bigger task for later.
Reproducer:
./x86_64-softmmu/qemu-system-x86_64 -nodefaults -nographic -qmp stdio \
-drive 'file=rbd:volumes/volume-ea141b5c-cdb3-4765-910d-e7008b209a70'\
':id=compute:key=AQAVkvxXAAAAABAA9ZxWFYdRmV+DSwKr7BKKXg=='\
':auth_supported=cephx\;none:mon_host=192.168.1.2\:6789'\
',format=raw,if=none,id=drive-virtio-disk0,'\
'serial=ea141b5c-cdb3-4765-910d-e7008b209a70,cache=writeback'
Even without an RBD setup, this serves a test of whether we get
the incorrect parser error of:
qemu-system-x86_64: -drive file=rbd:...cache=writeback: conf option 6789 has no value
or the correct behavior of hanging while trying to connect to
the requested mon_host of 192.168.1.2:6789.
Reported-by: Alexandru Avadanii <Alexandru.Avadanii@enea.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170331152730.12514-1-eblake@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
2017-03-31 17:27:30 +02:00
|
|
|
if (keypairs) {
|
|
|
|
qdict_put(options, "=keyvalue-pairs",
|
2020-12-11 18:11:37 +01:00
|
|
|
qstring_from_gstring(qobject_to_json(QOBJECT(keypairs))));
|
2017-02-26 23:50:42 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
done:
|
|
|
|
g_free(buf);
|
2018-04-19 17:01:43 +02:00
|
|
|
qobject_unref(keypairs);
|
2017-02-26 23:50:42 +01:00
|
|
|
return;
|
2011-09-07 18:28:04 +02:00
|
|
|
}
|
|
|
|
|
2018-06-14 21:14:43 +02:00
|
|
|
static int qemu_rbd_set_auth(rados_t cluster, BlockdevOptionsRbd *opts,
|
2016-01-21 15:19:19 +01:00
|
|
|
Error **errp)
|
|
|
|
{
|
2018-06-14 21:14:43 +02:00
|
|
|
char *key, *acr;
|
rbd: New parameter auth-client-required
Parameter auth-client-required lets you configure authentication
methods. We tried to provide that in v2.9.0, but backed out due to
interface design doubts (commit 464444fcc16).
This commit is similar to what we backed out, but simpler: we use a
list of enumeration values instead of a list of objects with a member
of enumeration type.
Let's review our reasons for backing out the first try, as stated in
the commit message:
* The implementation uses deprecated rados_conf_set() key
"auth_supported". No biggie.
Fixed: we use "auth-client-required".
* The implementation makes -drive silently ignore invalid parameters
"auth" and "auth-supported.*.X" where X isn't "auth". Fixable (in
fact I'm going to fix similar bugs around parameter server), so
again no biggie.
That fix is commit 2836284db60. This commit doesn't bring the bugs
back.
* BlockdevOptionsRbd member @password-secret applies only to
authentication method cephx. Should it be a variant member of
RbdAuthMethod?
We've had time to ponder, and we decided to stick to the way Ceph
configuration works: the key configured separately, and silently
ignored if the authentication method doesn't use it.
* BlockdevOptionsRbd member @user could apply to both methods cephx
and none, but I'm not sure it's actually used with none. If it
isn't, should it be a variant member of RbdAuthMethod?
Likewise.
* The client offers a *set* of authentication methods, not a list.
Should the methods be optional members of BlockdevOptionsRbd instead
of members of list @auth-supported? The latter begs the question
what multiple entries for the same method mean. Trivial question
now that RbdAuthMethod contains nothing but @type, but less so when
RbdAuthMethod acquires other members, such the ones discussed above.
Again, we decided to stick to the way Ceph configuration works, except
we make auth-client-required a list of enumeration values instead of a
string containing keywords separated by delimiters.
* How BlockdevOptionsRbd member @auth-supported interacts with
settings from a configuration file specified with @conf is
undocumented. I suspect it's untested, too.
Not actually true, the documentation for @conf says "Values in the
configuration file will be overridden by options specified via QAPI",
and we've tested this.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-14 21:14:42 +02:00
|
|
|
int r;
|
|
|
|
GString *accu;
|
|
|
|
RbdAuthModeList *auth;
|
|
|
|
|
2018-06-14 21:14:43 +02:00
|
|
|
if (opts->key_secret) {
|
|
|
|
key = qcrypto_secret_lookup_as_base64(opts->key_secret, errp);
|
|
|
|
if (!key) {
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
r = rados_conf_set(cluster, "key", key);
|
|
|
|
g_free(key);
|
|
|
|
if (r < 0) {
|
|
|
|
error_setg_errno(errp, -r, "Could not set 'key'");
|
|
|
|
return r;
|
rbd: New parameter auth-client-required
Parameter auth-client-required lets you configure authentication
methods. We tried to provide that in v2.9.0, but backed out due to
interface design doubts (commit 464444fcc16).
This commit is similar to what we backed out, but simpler: we use a
list of enumeration values instead of a list of objects with a member
of enumeration type.
Let's review our reasons for backing out the first try, as stated in
the commit message:
* The implementation uses deprecated rados_conf_set() key
"auth_supported". No biggie.
Fixed: we use "auth-client-required".
* The implementation makes -drive silently ignore invalid parameters
"auth" and "auth-supported.*.X" where X isn't "auth". Fixable (in
fact I'm going to fix similar bugs around parameter server), so
again no biggie.
That fix is commit 2836284db60. This commit doesn't bring the bugs
back.
* BlockdevOptionsRbd member @password-secret applies only to
authentication method cephx. Should it be a variant member of
RbdAuthMethod?
We've had time to ponder, and we decided to stick to the way Ceph
configuration works: the key configured separately, and silently
ignored if the authentication method doesn't use it.
* BlockdevOptionsRbd member @user could apply to both methods cephx
and none, but I'm not sure it's actually used with none. If it
isn't, should it be a variant member of RbdAuthMethod?
Likewise.
* The client offers a *set* of authentication methods, not a list.
Should the methods be optional members of BlockdevOptionsRbd instead
of members of list @auth-supported? The latter begs the question
what multiple entries for the same method mean. Trivial question
now that RbdAuthMethod contains nothing but @type, but less so when
RbdAuthMethod acquires other members, such the ones discussed above.
Again, we decided to stick to the way Ceph configuration works, except
we make auth-client-required a list of enumeration values instead of a
string containing keywords separated by delimiters.
* How BlockdevOptionsRbd member @auth-supported interacts with
settings from a configuration file specified with @conf is
undocumented. I suspect it's untested, too.
Not actually true, the documentation for @conf says "Values in the
configuration file will be overridden by options specified via QAPI",
and we've tested this.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-14 21:14:42 +02:00
|
|
|
}
|
2016-01-21 15:19:19 +01:00
|
|
|
}
|
|
|
|
|
rbd: New parameter auth-client-required
Parameter auth-client-required lets you configure authentication
methods. We tried to provide that in v2.9.0, but backed out due to
interface design doubts (commit 464444fcc16).
This commit is similar to what we backed out, but simpler: we use a
list of enumeration values instead of a list of objects with a member
of enumeration type.
Let's review our reasons for backing out the first try, as stated in
the commit message:
* The implementation uses deprecated rados_conf_set() key
"auth_supported". No biggie.
Fixed: we use "auth-client-required".
* The implementation makes -drive silently ignore invalid parameters
"auth" and "auth-supported.*.X" where X isn't "auth". Fixable (in
fact I'm going to fix similar bugs around parameter server), so
again no biggie.
That fix is commit 2836284db60. This commit doesn't bring the bugs
back.
* BlockdevOptionsRbd member @password-secret applies only to
authentication method cephx. Should it be a variant member of
RbdAuthMethod?
We've had time to ponder, and we decided to stick to the way Ceph
configuration works: the key configured separately, and silently
ignored if the authentication method doesn't use it.
* BlockdevOptionsRbd member @user could apply to both methods cephx
and none, but I'm not sure it's actually used with none. If it
isn't, should it be a variant member of RbdAuthMethod?
Likewise.
* The client offers a *set* of authentication methods, not a list.
Should the methods be optional members of BlockdevOptionsRbd instead
of members of list @auth-supported? The latter begs the question
what multiple entries for the same method mean. Trivial question
now that RbdAuthMethod contains nothing but @type, but less so when
RbdAuthMethod acquires other members, such the ones discussed above.
Again, we decided to stick to the way Ceph configuration works, except
we make auth-client-required a list of enumeration values instead of a
string containing keywords separated by delimiters.
* How BlockdevOptionsRbd member @auth-supported interacts with
settings from a configuration file specified with @conf is
undocumented. I suspect it's untested, too.
Not actually true, the documentation for @conf says "Values in the
configuration file will be overridden by options specified via QAPI",
and we've tested this.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-14 21:14:42 +02:00
|
|
|
if (opts->has_auth_client_required) {
|
|
|
|
accu = g_string_new("");
|
|
|
|
for (auth = opts->auth_client_required; auth; auth = auth->next) {
|
|
|
|
if (accu->str[0]) {
|
|
|
|
g_string_append_c(accu, ';');
|
|
|
|
}
|
|
|
|
g_string_append(accu, RbdAuthMode_str(auth->value));
|
|
|
|
}
|
|
|
|
acr = g_string_free(accu, FALSE);
|
|
|
|
r = rados_conf_set(cluster, "auth_client_required", acr);
|
|
|
|
g_free(acr);
|
|
|
|
if (r < 0) {
|
|
|
|
error_setg_errno(errp, -r,
|
|
|
|
"Could not set 'auth_client_required'");
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
}
|
2016-01-21 15:19:19 +01:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
rbd: Fix regression in legacy key/values containing escaped :
Commit c7cacb3 accidentally broke legacy key-value parsing through
pseudo-filename parsing of -drive file=rbd://..., for any key that
contains an escaped ':'. Such a key is surprisingly common, thanks
to mon_host specifying a 'host:port' string. The break happens
because passing things from QDict through QemuOpts back to another
QDict requires that we pack our parsed key/value pairs into a string,
and then reparse that string, but the intermediate string that we
created ("key1=value1:key2=value2") lost the \: escaping that was
present in the original, so that we could no longer see which : were
used as separators vs. those used as part of the original input.
Fix it by collecting the key/value pairs through a QList, and
sending that list on a round trip through a JSON QString (as in
'["key1","value1","key2","value2"]') on its way through QemuOpts,
rather than hand-rolling our own string. Since the string is only
handled internally, this was faster than creating a full-blown
struct of '[{"key1":"value1"},{"key2":"value2"}]', and safer at
guaranteeing order compared to '{"key1":"value1","key2":"value2"}'.
It would be nicer if we didn't have to round-trip through QemuOpts
in the first place, but that's a much bigger task for later.
Reproducer:
./x86_64-softmmu/qemu-system-x86_64 -nodefaults -nographic -qmp stdio \
-drive 'file=rbd:volumes/volume-ea141b5c-cdb3-4765-910d-e7008b209a70'\
':id=compute:key=AQAVkvxXAAAAABAA9ZxWFYdRmV+DSwKr7BKKXg=='\
':auth_supported=cephx\;none:mon_host=192.168.1.2\:6789'\
',format=raw,if=none,id=drive-virtio-disk0,'\
'serial=ea141b5c-cdb3-4765-910d-e7008b209a70,cache=writeback'
Even without an RBD setup, this serves a test of whether we get
the incorrect parser error of:
qemu-system-x86_64: -drive file=rbd:...cache=writeback: conf option 6789 has no value
or the correct behavior of hanging while trying to connect to
the requested mon_host of 192.168.1.2:6789.
Reported-by: Alexandru Avadanii <Alexandru.Avadanii@enea.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170331152730.12514-1-eblake@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
2017-03-31 17:27:30 +02:00
|
|
|
static int qemu_rbd_set_keypairs(rados_t cluster, const char *keypairs_json,
|
2017-02-26 23:50:42 +01:00
|
|
|
Error **errp)
|
2011-05-27 01:07:32 +02:00
|
|
|
{
|
rbd: Fix regression in legacy key/values containing escaped :
Commit c7cacb3 accidentally broke legacy key-value parsing through
pseudo-filename parsing of -drive file=rbd://..., for any key that
contains an escaped ':'. Such a key is surprisingly common, thanks
to mon_host specifying a 'host:port' string. The break happens
because passing things from QDict through QemuOpts back to another
QDict requires that we pack our parsed key/value pairs into a string,
and then reparse that string, but the intermediate string that we
created ("key1=value1:key2=value2") lost the \: escaping that was
present in the original, so that we could no longer see which : were
used as separators vs. those used as part of the original input.
Fix it by collecting the key/value pairs through a QList, and
sending that list on a round trip through a JSON QString (as in
'["key1","value1","key2","value2"]') on its way through QemuOpts,
rather than hand-rolling our own string. Since the string is only
handled internally, this was faster than creating a full-blown
struct of '[{"key1":"value1"},{"key2":"value2"}]', and safer at
guaranteeing order compared to '{"key1":"value1","key2":"value2"}'.
It would be nicer if we didn't have to round-trip through QemuOpts
in the first place, but that's a much bigger task for later.
Reproducer:
./x86_64-softmmu/qemu-system-x86_64 -nodefaults -nographic -qmp stdio \
-drive 'file=rbd:volumes/volume-ea141b5c-cdb3-4765-910d-e7008b209a70'\
':id=compute:key=AQAVkvxXAAAAABAA9ZxWFYdRmV+DSwKr7BKKXg=='\
':auth_supported=cephx\;none:mon_host=192.168.1.2\:6789'\
',format=raw,if=none,id=drive-virtio-disk0,'\
'serial=ea141b5c-cdb3-4765-910d-e7008b209a70,cache=writeback'
Even without an RBD setup, this serves a test of whether we get
the incorrect parser error of:
qemu-system-x86_64: -drive file=rbd:...cache=writeback: conf option 6789 has no value
or the correct behavior of hanging while trying to connect to
the requested mon_host of 192.168.1.2:6789.
Reported-by: Alexandru Avadanii <Alexandru.Avadanii@enea.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170331152730.12514-1-eblake@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
2017-03-31 17:27:30 +02:00
|
|
|
QList *keypairs;
|
|
|
|
QString *name;
|
|
|
|
QString *value;
|
|
|
|
const char *key;
|
|
|
|
size_t remaining;
|
2011-05-27 01:07:32 +02:00
|
|
|
int ret = 0;
|
|
|
|
|
rbd: Fix regression in legacy key/values containing escaped :
Commit c7cacb3 accidentally broke legacy key-value parsing through
pseudo-filename parsing of -drive file=rbd://..., for any key that
contains an escaped ':'. Such a key is surprisingly common, thanks
to mon_host specifying a 'host:port' string. The break happens
because passing things from QDict through QemuOpts back to another
QDict requires that we pack our parsed key/value pairs into a string,
and then reparse that string, but the intermediate string that we
created ("key1=value1:key2=value2") lost the \: escaping that was
present in the original, so that we could no longer see which : were
used as separators vs. those used as part of the original input.
Fix it by collecting the key/value pairs through a QList, and
sending that list on a round trip through a JSON QString (as in
'["key1","value1","key2","value2"]') on its way through QemuOpts,
rather than hand-rolling our own string. Since the string is only
handled internally, this was faster than creating a full-blown
struct of '[{"key1":"value1"},{"key2":"value2"}]', and safer at
guaranteeing order compared to '{"key1":"value1","key2":"value2"}'.
It would be nicer if we didn't have to round-trip through QemuOpts
in the first place, but that's a much bigger task for later.
Reproducer:
./x86_64-softmmu/qemu-system-x86_64 -nodefaults -nographic -qmp stdio \
-drive 'file=rbd:volumes/volume-ea141b5c-cdb3-4765-910d-e7008b209a70'\
':id=compute:key=AQAVkvxXAAAAABAA9ZxWFYdRmV+DSwKr7BKKXg=='\
':auth_supported=cephx\;none:mon_host=192.168.1.2\:6789'\
',format=raw,if=none,id=drive-virtio-disk0,'\
'serial=ea141b5c-cdb3-4765-910d-e7008b209a70,cache=writeback'
Even without an RBD setup, this serves a test of whether we get
the incorrect parser error of:
qemu-system-x86_64: -drive file=rbd:...cache=writeback: conf option 6789 has no value
or the correct behavior of hanging while trying to connect to
the requested mon_host of 192.168.1.2:6789.
Reported-by: Alexandru Avadanii <Alexandru.Avadanii@enea.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170331152730.12514-1-eblake@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
2017-03-31 17:27:30 +02:00
|
|
|
if (!keypairs_json) {
|
|
|
|
return ret;
|
|
|
|
}
|
2018-02-24 16:40:29 +01:00
|
|
|
keypairs = qobject_to(QList,
|
|
|
|
qobject_from_json(keypairs_json, &error_abort));
|
rbd: Fix regression in legacy key/values containing escaped :
Commit c7cacb3 accidentally broke legacy key-value parsing through
pseudo-filename parsing of -drive file=rbd://..., for any key that
contains an escaped ':'. Such a key is surprisingly common, thanks
to mon_host specifying a 'host:port' string. The break happens
because passing things from QDict through QemuOpts back to another
QDict requires that we pack our parsed key/value pairs into a string,
and then reparse that string, but the intermediate string that we
created ("key1=value1:key2=value2") lost the \: escaping that was
present in the original, so that we could no longer see which : were
used as separators vs. those used as part of the original input.
Fix it by collecting the key/value pairs through a QList, and
sending that list on a round trip through a JSON QString (as in
'["key1","value1","key2","value2"]') on its way through QemuOpts,
rather than hand-rolling our own string. Since the string is only
handled internally, this was faster than creating a full-blown
struct of '[{"key1":"value1"},{"key2":"value2"}]', and safer at
guaranteeing order compared to '{"key1":"value1","key2":"value2"}'.
It would be nicer if we didn't have to round-trip through QemuOpts
in the first place, but that's a much bigger task for later.
Reproducer:
./x86_64-softmmu/qemu-system-x86_64 -nodefaults -nographic -qmp stdio \
-drive 'file=rbd:volumes/volume-ea141b5c-cdb3-4765-910d-e7008b209a70'\
':id=compute:key=AQAVkvxXAAAAABAA9ZxWFYdRmV+DSwKr7BKKXg=='\
':auth_supported=cephx\;none:mon_host=192.168.1.2\:6789'\
',format=raw,if=none,id=drive-virtio-disk0,'\
'serial=ea141b5c-cdb3-4765-910d-e7008b209a70,cache=writeback'
Even without an RBD setup, this serves a test of whether we get
the incorrect parser error of:
qemu-system-x86_64: -drive file=rbd:...cache=writeback: conf option 6789 has no value
or the correct behavior of hanging while trying to connect to
the requested mon_host of 192.168.1.2:6789.
Reported-by: Alexandru Avadanii <Alexandru.Avadanii@enea.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170331152730.12514-1-eblake@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
2017-03-31 17:27:30 +02:00
|
|
|
remaining = qlist_size(keypairs) / 2;
|
|
|
|
assert(remaining);
|
|
|
|
|
|
|
|
while (remaining--) {
|
2018-02-24 16:40:29 +01:00
|
|
|
name = qobject_to(QString, qlist_pop(keypairs));
|
|
|
|
value = qobject_to(QString, qlist_pop(keypairs));
|
rbd: Fix regression in legacy key/values containing escaped :
Commit c7cacb3 accidentally broke legacy key-value parsing through
pseudo-filename parsing of -drive file=rbd://..., for any key that
contains an escaped ':'. Such a key is surprisingly common, thanks
to mon_host specifying a 'host:port' string. The break happens
because passing things from QDict through QemuOpts back to another
QDict requires that we pack our parsed key/value pairs into a string,
and then reparse that string, but the intermediate string that we
created ("key1=value1:key2=value2") lost the \: escaping that was
present in the original, so that we could no longer see which : were
used as separators vs. those used as part of the original input.
Fix it by collecting the key/value pairs through a QList, and
sending that list on a round trip through a JSON QString (as in
'["key1","value1","key2","value2"]') on its way through QemuOpts,
rather than hand-rolling our own string. Since the string is only
handled internally, this was faster than creating a full-blown
struct of '[{"key1":"value1"},{"key2":"value2"}]', and safer at
guaranteeing order compared to '{"key1":"value1","key2":"value2"}'.
It would be nicer if we didn't have to round-trip through QemuOpts
in the first place, but that's a much bigger task for later.
Reproducer:
./x86_64-softmmu/qemu-system-x86_64 -nodefaults -nographic -qmp stdio \
-drive 'file=rbd:volumes/volume-ea141b5c-cdb3-4765-910d-e7008b209a70'\
':id=compute:key=AQAVkvxXAAAAABAA9ZxWFYdRmV+DSwKr7BKKXg=='\
':auth_supported=cephx\;none:mon_host=192.168.1.2\:6789'\
',format=raw,if=none,id=drive-virtio-disk0,'\
'serial=ea141b5c-cdb3-4765-910d-e7008b209a70,cache=writeback'
Even without an RBD setup, this serves a test of whether we get
the incorrect parser error of:
qemu-system-x86_64: -drive file=rbd:...cache=writeback: conf option 6789 has no value
or the correct behavior of hanging while trying to connect to
the requested mon_host of 192.168.1.2:6789.
Reported-by: Alexandru Avadanii <Alexandru.Avadanii@enea.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170331152730.12514-1-eblake@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
2017-03-31 17:27:30 +02:00
|
|
|
assert(name && value);
|
|
|
|
key = qstring_get_str(name);
|
|
|
|
|
|
|
|
ret = rados_conf_set(cluster, key, qstring_get_str(value));
|
2018-04-19 17:01:43 +02:00
|
|
|
qobject_unref(value);
|
2017-02-26 23:50:42 +01:00
|
|
|
if (ret < 0) {
|
rbd: Fix regression in legacy key/values containing escaped :
Commit c7cacb3 accidentally broke legacy key-value parsing through
pseudo-filename parsing of -drive file=rbd://..., for any key that
contains an escaped ':'. Such a key is surprisingly common, thanks
to mon_host specifying a 'host:port' string. The break happens
because passing things from QDict through QemuOpts back to another
QDict requires that we pack our parsed key/value pairs into a string,
and then reparse that string, but the intermediate string that we
created ("key1=value1:key2=value2") lost the \: escaping that was
present in the original, so that we could no longer see which : were
used as separators vs. those used as part of the original input.
Fix it by collecting the key/value pairs through a QList, and
sending that list on a round trip through a JSON QString (as in
'["key1","value1","key2","value2"]') on its way through QemuOpts,
rather than hand-rolling our own string. Since the string is only
handled internally, this was faster than creating a full-blown
struct of '[{"key1":"value1"},{"key2":"value2"}]', and safer at
guaranteeing order compared to '{"key1":"value1","key2":"value2"}'.
It would be nicer if we didn't have to round-trip through QemuOpts
in the first place, but that's a much bigger task for later.
Reproducer:
./x86_64-softmmu/qemu-system-x86_64 -nodefaults -nographic -qmp stdio \
-drive 'file=rbd:volumes/volume-ea141b5c-cdb3-4765-910d-e7008b209a70'\
':id=compute:key=AQAVkvxXAAAAABAA9ZxWFYdRmV+DSwKr7BKKXg=='\
':auth_supported=cephx\;none:mon_host=192.168.1.2\:6789'\
',format=raw,if=none,id=drive-virtio-disk0,'\
'serial=ea141b5c-cdb3-4765-910d-e7008b209a70,cache=writeback'
Even without an RBD setup, this serves a test of whether we get
the incorrect parser error of:
qemu-system-x86_64: -drive file=rbd:...cache=writeback: conf option 6789 has no value
or the correct behavior of hanging while trying to connect to
the requested mon_host of 192.168.1.2:6789.
Reported-by: Alexandru Avadanii <Alexandru.Avadanii@enea.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20170331152730.12514-1-eblake@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
2017-03-31 17:27:30 +02:00
|
|
|
error_setg_errno(errp, -ret, "invalid conf option %s", key);
|
2018-04-19 17:01:43 +02:00
|
|
|
qobject_unref(name);
|
2017-02-26 23:50:42 +01:00
|
|
|
ret = -EINVAL;
|
|
|
|
break;
|
2011-05-27 01:07:32 +02:00
|
|
|
}
|
2018-04-19 17:01:43 +02:00
|
|
|
qobject_unref(name);
|
2011-05-27 01:07:32 +02:00
|
|
|
}
|
|
|
|
|
2018-04-19 17:01:43 +02:00
|
|
|
qobject_unref(keypairs);
|
2011-05-27 01:07:32 +02:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2021-06-27 13:46:35 +02:00
|
|
|
#ifdef LIBRBD_SUPPORTS_ENCRYPTION
|
|
|
|
static int qemu_rbd_convert_luks_options(
|
|
|
|
RbdEncryptionOptionsLUKSBase *luks_opts,
|
|
|
|
char **passphrase,
|
|
|
|
size_t *passphrase_len,
|
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
return qcrypto_secret_lookup(luks_opts->key_secret, (uint8_t **)passphrase,
|
|
|
|
passphrase_len, errp);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int qemu_rbd_convert_luks_create_options(
|
|
|
|
RbdEncryptionCreateOptionsLUKSBase *luks_opts,
|
|
|
|
rbd_encryption_algorithm_t *alg,
|
|
|
|
char **passphrase,
|
|
|
|
size_t *passphrase_len,
|
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
int r = 0;
|
|
|
|
|
|
|
|
r = qemu_rbd_convert_luks_options(
|
|
|
|
qapi_RbdEncryptionCreateOptionsLUKSBase_base(luks_opts),
|
|
|
|
passphrase, passphrase_len, errp);
|
|
|
|
if (r < 0) {
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (luks_opts->has_cipher_alg) {
|
|
|
|
switch (luks_opts->cipher_alg) {
|
|
|
|
case QCRYPTO_CIPHER_ALG_AES_128: {
|
|
|
|
*alg = RBD_ENCRYPTION_ALGORITHM_AES128;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case QCRYPTO_CIPHER_ALG_AES_256: {
|
|
|
|
*alg = RBD_ENCRYPTION_ALGORITHM_AES256;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
default: {
|
|
|
|
r = -ENOTSUP;
|
|
|
|
error_setg_errno(errp, -r, "unknown encryption algorithm: %u",
|
|
|
|
luks_opts->cipher_alg);
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
/* default alg */
|
|
|
|
*alg = RBD_ENCRYPTION_ALGORITHM_AES256;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int qemu_rbd_encryption_format(rbd_image_t image,
|
|
|
|
RbdEncryptionCreateOptions *encrypt,
|
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
int r = 0;
|
|
|
|
g_autofree char *passphrase = NULL;
|
|
|
|
size_t passphrase_len;
|
|
|
|
rbd_encryption_format_t format;
|
|
|
|
rbd_encryption_options_t opts;
|
|
|
|
rbd_encryption_luks1_format_options_t luks_opts;
|
|
|
|
rbd_encryption_luks2_format_options_t luks2_opts;
|
|
|
|
size_t opts_size;
|
|
|
|
uint64_t raw_size, effective_size;
|
|
|
|
|
|
|
|
r = rbd_get_size(image, &raw_size);
|
|
|
|
if (r < 0) {
|
|
|
|
error_setg_errno(errp, -r, "cannot get raw image size");
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
switch (encrypt->format) {
|
|
|
|
case RBD_IMAGE_ENCRYPTION_FORMAT_LUKS: {
|
|
|
|
memset(&luks_opts, 0, sizeof(luks_opts));
|
|
|
|
format = RBD_ENCRYPTION_FORMAT_LUKS1;
|
|
|
|
opts = &luks_opts;
|
|
|
|
opts_size = sizeof(luks_opts);
|
|
|
|
r = qemu_rbd_convert_luks_create_options(
|
|
|
|
qapi_RbdEncryptionCreateOptionsLUKS_base(&encrypt->u.luks),
|
|
|
|
&luks_opts.alg, &passphrase, &passphrase_len, errp);
|
|
|
|
if (r < 0) {
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
luks_opts.passphrase = passphrase;
|
|
|
|
luks_opts.passphrase_size = passphrase_len;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case RBD_IMAGE_ENCRYPTION_FORMAT_LUKS2: {
|
|
|
|
memset(&luks2_opts, 0, sizeof(luks2_opts));
|
|
|
|
format = RBD_ENCRYPTION_FORMAT_LUKS2;
|
|
|
|
opts = &luks2_opts;
|
|
|
|
opts_size = sizeof(luks2_opts);
|
|
|
|
r = qemu_rbd_convert_luks_create_options(
|
|
|
|
qapi_RbdEncryptionCreateOptionsLUKS2_base(
|
|
|
|
&encrypt->u.luks2),
|
|
|
|
&luks2_opts.alg, &passphrase, &passphrase_len, errp);
|
|
|
|
if (r < 0) {
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
luks2_opts.passphrase = passphrase;
|
|
|
|
luks2_opts.passphrase_size = passphrase_len;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
default: {
|
|
|
|
r = -ENOTSUP;
|
|
|
|
error_setg_errno(
|
|
|
|
errp, -r, "unknown image encryption format: %u",
|
|
|
|
encrypt->format);
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
r = rbd_encryption_format(image, format, opts, opts_size);
|
|
|
|
if (r < 0) {
|
|
|
|
error_setg_errno(errp, -r, "encryption format fail");
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
r = rbd_get_size(image, &effective_size);
|
|
|
|
if (r < 0) {
|
|
|
|
error_setg_errno(errp, -r, "cannot get effective image size");
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
r = rbd_resize(image, raw_size + (raw_size - effective_size));
|
|
|
|
if (r < 0) {
|
|
|
|
error_setg_errno(errp, -r, "cannot resize image after format");
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int qemu_rbd_encryption_load(rbd_image_t image,
|
|
|
|
RbdEncryptionOptions *encrypt,
|
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
int r = 0;
|
|
|
|
g_autofree char *passphrase = NULL;
|
|
|
|
size_t passphrase_len;
|
|
|
|
rbd_encryption_luks1_format_options_t luks_opts;
|
|
|
|
rbd_encryption_luks2_format_options_t luks2_opts;
|
|
|
|
rbd_encryption_format_t format;
|
|
|
|
rbd_encryption_options_t opts;
|
|
|
|
size_t opts_size;
|
|
|
|
|
|
|
|
switch (encrypt->format) {
|
|
|
|
case RBD_IMAGE_ENCRYPTION_FORMAT_LUKS: {
|
|
|
|
memset(&luks_opts, 0, sizeof(luks_opts));
|
|
|
|
format = RBD_ENCRYPTION_FORMAT_LUKS1;
|
|
|
|
opts = &luks_opts;
|
|
|
|
opts_size = sizeof(luks_opts);
|
|
|
|
r = qemu_rbd_convert_luks_options(
|
|
|
|
qapi_RbdEncryptionOptionsLUKS_base(&encrypt->u.luks),
|
|
|
|
&passphrase, &passphrase_len, errp);
|
|
|
|
if (r < 0) {
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
luks_opts.passphrase = passphrase;
|
|
|
|
luks_opts.passphrase_size = passphrase_len;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case RBD_IMAGE_ENCRYPTION_FORMAT_LUKS2: {
|
|
|
|
memset(&luks2_opts, 0, sizeof(luks2_opts));
|
|
|
|
format = RBD_ENCRYPTION_FORMAT_LUKS2;
|
|
|
|
opts = &luks2_opts;
|
|
|
|
opts_size = sizeof(luks2_opts);
|
|
|
|
r = qemu_rbd_convert_luks_options(
|
|
|
|
qapi_RbdEncryptionOptionsLUKS2_base(&encrypt->u.luks2),
|
|
|
|
&passphrase, &passphrase_len, errp);
|
|
|
|
if (r < 0) {
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
luks2_opts.passphrase = passphrase;
|
|
|
|
luks2_opts.passphrase_size = passphrase_len;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
default: {
|
|
|
|
r = -ENOTSUP;
|
|
|
|
error_setg_errno(
|
|
|
|
errp, -r, "unknown image encryption format: %u",
|
|
|
|
encrypt->format);
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
r = rbd_encryption_load(image, format, opts, opts_size);
|
|
|
|
if (r < 0) {
|
|
|
|
error_setg_errno(errp, -r, "encryption load fail");
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2018-06-14 21:14:43 +02:00
|
|
|
/* FIXME Deprecate and remove keypairs or make it available in QMP. */
|
2018-01-31 16:27:38 +01:00
|
|
|
static int qemu_rbd_do_create(BlockdevCreateOptions *options,
|
|
|
|
const char *keypairs, const char *password_secret,
|
|
|
|
Error **errp)
|
2010-12-06 20:53:01 +01:00
|
|
|
{
|
2018-01-31 16:27:38 +01:00
|
|
|
BlockdevCreateOptionsRbd *opts = &options->u.rbd;
|
2011-05-27 01:07:31 +02:00
|
|
|
rados_t cluster;
|
|
|
|
rados_ioctx_t io_ctx;
|
2018-01-31 16:27:38 +01:00
|
|
|
int obj_order = 0;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
assert(options->driver == BLOCKDEV_DRIVER_RBD);
|
|
|
|
if (opts->location->has_snapshot) {
|
|
|
|
error_setg(errp, "Can't use snapshot name for image creation");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
2010-12-06 20:53:01 +01:00
|
|
|
|
2021-06-27 13:46:35 +02:00
|
|
|
#ifndef LIBRBD_SUPPORTS_ENCRYPTION
|
|
|
|
if (opts->has_encrypt) {
|
|
|
|
error_setg(errp, "RBD library does not support image encryption");
|
|
|
|
return -ENOTSUP;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2018-01-31 16:27:38 +01:00
|
|
|
if (opts->has_cluster_size) {
|
|
|
|
int64_t objsize = opts->cluster_size;
|
2014-06-05 11:21:04 +02:00
|
|
|
if ((objsize - 1) & objsize) { /* not a power of 2? */
|
|
|
|
error_setg(errp, "obj size needs to be power of 2");
|
2018-01-31 16:27:38 +01:00
|
|
|
return -EINVAL;
|
2014-06-05 11:21:04 +02:00
|
|
|
}
|
|
|
|
if (objsize < 4096) {
|
|
|
|
error_setg(errp, "obj size too small");
|
2018-01-31 16:27:38 +01:00
|
|
|
return -EINVAL;
|
2010-12-06 20:53:01 +01:00
|
|
|
}
|
2015-03-23 16:29:26 +01:00
|
|
|
obj_order = ctz32(objsize);
|
2010-12-06 20:53:01 +01:00
|
|
|
}
|
|
|
|
|
2018-02-16 18:48:25 +01:00
|
|
|
ret = qemu_rbd_connect(&cluster, &io_ctx, opts->location, false, keypairs,
|
|
|
|
password_secret, errp);
|
2016-05-09 09:51:59 +02:00
|
|
|
if (ret < 0) {
|
2018-01-31 16:27:38 +01:00
|
|
|
return ret;
|
2010-12-06 20:53:01 +01:00
|
|
|
}
|
|
|
|
|
2018-01-31 16:27:38 +01:00
|
|
|
ret = rbd_create(io_ctx, opts->location->image, opts->size, &obj_order);
|
2016-05-09 09:51:59 +02:00
|
|
|
if (ret < 0) {
|
|
|
|
error_setg_errno(errp, -ret, "error rbd create");
|
2018-02-16 18:48:25 +01:00
|
|
|
goto out;
|
2016-05-09 09:51:59 +02:00
|
|
|
}
|
2010-12-06 20:53:01 +01:00
|
|
|
|
2021-06-27 13:46:35 +02:00
|
|
|
#ifdef LIBRBD_SUPPORTS_ENCRYPTION
|
|
|
|
if (opts->has_encrypt) {
|
|
|
|
rbd_image_t image;
|
|
|
|
|
|
|
|
ret = rbd_open(io_ctx, opts->location->image, &image, NULL);
|
|
|
|
if (ret < 0) {
|
|
|
|
error_setg_errno(errp, -ret,
|
|
|
|
"error opening image '%s' for encryption format",
|
|
|
|
opts->location->image);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = qemu_rbd_encryption_format(image, opts->encrypt, errp);
|
|
|
|
rbd_close(image);
|
|
|
|
if (ret < 0) {
|
|
|
|
/* encryption format fail, try removing the image */
|
|
|
|
rbd_remove(io_ctx, opts->location->image);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2018-01-31 16:27:38 +01:00
|
|
|
ret = 0;
|
2018-02-16 18:48:25 +01:00
|
|
|
out:
|
|
|
|
rados_ioctx_destroy(io_ctx);
|
2016-10-15 10:26:13 +02:00
|
|
|
rados_shutdown(cluster);
|
2018-01-31 16:27:38 +01:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int qemu_rbd_co_create(BlockdevCreateOptions *options, Error **errp)
|
|
|
|
{
|
|
|
|
return qemu_rbd_do_create(options, NULL, NULL, errp);
|
|
|
|
}
|
|
|
|
|
2021-06-27 13:46:35 +02:00
|
|
|
static int qemu_rbd_extract_encryption_create_options(
|
|
|
|
QemuOpts *opts,
|
|
|
|
RbdEncryptionCreateOptions **spec,
|
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
QDict *opts_qdict;
|
|
|
|
QDict *encrypt_qdict;
|
|
|
|
Visitor *v;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
opts_qdict = qemu_opts_to_qdict(opts, NULL);
|
|
|
|
qdict_extract_subqdict(opts_qdict, &encrypt_qdict, "encrypt.");
|
|
|
|
qobject_unref(opts_qdict);
|
|
|
|
if (!qdict_size(encrypt_qdict)) {
|
|
|
|
*spec = NULL;
|
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Convert options into a QAPI object */
|
|
|
|
v = qobject_input_visitor_new_flat_confused(encrypt_qdict, errp);
|
|
|
|
if (!v) {
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
visit_type_RbdEncryptionCreateOptions(v, NULL, spec, errp);
|
|
|
|
visit_free(v);
|
|
|
|
if (!*spec) {
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
exit:
|
|
|
|
qobject_unref(encrypt_qdict);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-03-26 02:12:17 +01:00
|
|
|
static int coroutine_fn qemu_rbd_co_create_opts(BlockDriver *drv,
|
|
|
|
const char *filename,
|
2018-01-31 16:27:38 +01:00
|
|
|
QemuOpts *opts,
|
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
BlockdevCreateOptions *create_options;
|
|
|
|
BlockdevCreateOptionsRbd *rbd_opts;
|
|
|
|
BlockdevOptionsRbd *loc;
|
2021-06-27 13:46:35 +02:00
|
|
|
RbdEncryptionCreateOptions *encrypt = NULL;
|
2018-01-31 16:27:38 +01:00
|
|
|
Error *local_err = NULL;
|
|
|
|
const char *keypairs, *password_secret;
|
|
|
|
QDict *options = NULL;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
create_options = g_new0(BlockdevCreateOptions, 1);
|
|
|
|
create_options->driver = BLOCKDEV_DRIVER_RBD;
|
|
|
|
rbd_opts = &create_options->u.rbd;
|
|
|
|
|
|
|
|
rbd_opts->location = g_new0(BlockdevOptionsRbd, 1);
|
|
|
|
|
|
|
|
password_secret = qemu_opt_get(opts, "password-secret");
|
|
|
|
|
|
|
|
/* Read out options */
|
|
|
|
rbd_opts->size = ROUND_UP(qemu_opt_get_size_del(opts, BLOCK_OPT_SIZE, 0),
|
|
|
|
BDRV_SECTOR_SIZE);
|
|
|
|
rbd_opts->cluster_size = qemu_opt_get_size_del(opts,
|
|
|
|
BLOCK_OPT_CLUSTER_SIZE, 0);
|
|
|
|
rbd_opts->has_cluster_size = (rbd_opts->cluster_size != 0);
|
|
|
|
|
|
|
|
options = qdict_new();
|
|
|
|
qemu_rbd_parse_filename(filename, options, &local_err);
|
|
|
|
if (local_err) {
|
|
|
|
ret = -EINVAL;
|
|
|
|
error_propagate(errp, local_err);
|
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
|
2021-06-27 13:46:35 +02:00
|
|
|
ret = qemu_rbd_extract_encryption_create_options(opts, &encrypt, errp);
|
|
|
|
if (ret < 0) {
|
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
rbd_opts->encrypt = encrypt;
|
|
|
|
rbd_opts->has_encrypt = !!encrypt;
|
|
|
|
|
2018-01-31 16:27:38 +01:00
|
|
|
/*
|
|
|
|
* Caution: while qdict_get_try_str() is fine, getting non-string
|
|
|
|
* types would require more care. When @options come from -blockdev
|
|
|
|
* or blockdev_add, its members are typed according to the QAPI
|
|
|
|
* schema, but when they come from -drive, they're all QString.
|
|
|
|
*/
|
|
|
|
loc = rbd_opts->location;
|
block/rbd: Add support for ceph namespaces
Starting from ceph Nautilus, RBD has support for namespaces, allowing
for finer grain ACLs on images inside a pool, and tenant isolation.
In the rbd cli tool documentation, the new image-spec and snap-spec are :
- [pool-name/[namespace-name/]]image-name
- [pool-name/[namespace-name/]]image-name@snap-name
When using an non namespace's enabled qemu, it complains about not
finding the image called namespace-name/image-name, thus we only need to
parse the image once again to find if there is a '/' in its name, and if
there is, use what is before it as the name of the namespace to later
pass it to rados_ioctx_set_namespace.
rados_ioctx_set_namespace if called with en empty string or a null
pointer as the namespace parameters pretty much does nothing, as it then
defaults to the default namespace.
The namespace is extracted inside qemu_rbd_parse_filename, stored in the
qdict, and used in qemu_rbd_connect to make it work with both qemu-img,
and qemu itself.
Signed-off-by: Florian Florensa <fflorensa@online.net>
Message-Id: <20200110111513.321728-2-fflorensa@online.net>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-01-10 12:15:13 +01:00
|
|
|
loc->pool = g_strdup(qdict_get_try_str(options, "pool"));
|
|
|
|
loc->conf = g_strdup(qdict_get_try_str(options, "conf"));
|
|
|
|
loc->has_conf = !!loc->conf;
|
|
|
|
loc->user = g_strdup(qdict_get_try_str(options, "user"));
|
|
|
|
loc->has_user = !!loc->user;
|
|
|
|
loc->q_namespace = g_strdup(qdict_get_try_str(options, "namespace"));
|
2021-03-29 17:01:29 +02:00
|
|
|
loc->has_q_namespace = !!loc->q_namespace;
|
block/rbd: Add support for ceph namespaces
Starting from ceph Nautilus, RBD has support for namespaces, allowing
for finer grain ACLs on images inside a pool, and tenant isolation.
In the rbd cli tool documentation, the new image-spec and snap-spec are :
- [pool-name/[namespace-name/]]image-name
- [pool-name/[namespace-name/]]image-name@snap-name
When using an non namespace's enabled qemu, it complains about not
finding the image called namespace-name/image-name, thus we only need to
parse the image once again to find if there is a '/' in its name, and if
there is, use what is before it as the name of the namespace to later
pass it to rados_ioctx_set_namespace.
rados_ioctx_set_namespace if called with en empty string or a null
pointer as the namespace parameters pretty much does nothing, as it then
defaults to the default namespace.
The namespace is extracted inside qemu_rbd_parse_filename, stored in the
qdict, and used in qemu_rbd_connect to make it work with both qemu-img,
and qemu itself.
Signed-off-by: Florian Florensa <fflorensa@online.net>
Message-Id: <20200110111513.321728-2-fflorensa@online.net>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-01-10 12:15:13 +01:00
|
|
|
loc->image = g_strdup(qdict_get_try_str(options, "image"));
|
|
|
|
keypairs = qdict_get_try_str(options, "=keyvalue-pairs");
|
2018-01-31 16:27:38 +01:00
|
|
|
|
|
|
|
ret = qemu_rbd_do_create(create_options, keypairs, password_secret, errp);
|
|
|
|
if (ret < 0) {
|
|
|
|
goto exit;
|
|
|
|
}
|
2017-02-26 23:50:42 +01:00
|
|
|
|
|
|
|
exit:
|
2018-04-19 17:01:43 +02:00
|
|
|
qobject_unref(options);
|
2018-01-31 16:27:38 +01:00
|
|
|
qapi_free_BlockdevCreateOptions(create_options);
|
2010-12-06 20:53:01 +01:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-02-15 20:58:24 +01:00
|
|
|
static char *qemu_rbd_mon_host(BlockdevOptionsRbd *opts, Error **errp)
|
2017-02-27 18:36:46 +01:00
|
|
|
{
|
2018-02-15 20:58:24 +01:00
|
|
|
const char **vals;
|
rbd: Fix bugs around -drive parameter "server"
qemu_rbd_open() takes option parameters as a flattened QDict, with
keys of the form server.%d.host, server.%d.port, where %d counts up
from zero.
qemu_rbd_array_opts() extracts these values as follows. First, it
calls qdict_array_entries() to find the list's length. For each list
element, it formats the list's key prefix (e.g. "server.0."), then
creates a new QDict holding the options with that key prefix, then
converts that to a QemuOpts, so it can finally get the member values
from there.
If there's one surefire way to make code using QDict more awkward,
it's creating more of them and mixing in QemuOpts for good measure.
The extraction of keys starting with server.%d into another QDict
makes us ignore parameters like server.0.neither-host-nor-port
silently.
The conversion to QemuOpts abuses runtime_opts, as described a few
commits ago.
Rewrite to simply get the values straight from the options QDict.
Fixes -drive not to crash when server.*.* are present, but
server.*.host is absent.
Fixes -drive to reject invalid server.*.*.
Permits cleaning up runtime_opts. Do that, and fix -drive to reject
bogus parameters host and port instead of silently ignoring them.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1490691368-32099-11-git-send-email-armbru@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
2017-03-28 10:56:08 +02:00
|
|
|
const char *host, *port;
|
|
|
|
char *rados_str;
|
2018-02-15 20:58:24 +01:00
|
|
|
InetSocketAddressBaseList *p;
|
|
|
|
int i, cnt;
|
|
|
|
|
|
|
|
if (!opts->has_server) {
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (cnt = 0, p = opts->server; p; p = p->next) {
|
|
|
|
cnt++;
|
|
|
|
}
|
|
|
|
|
|
|
|
vals = g_new(const char *, cnt + 1);
|
|
|
|
|
|
|
|
for (i = 0, p = opts->server; p; p = p->next, i++) {
|
|
|
|
host = p->value->host;
|
|
|
|
port = p->value->port;
|
2017-02-27 18:36:46 +01:00
|
|
|
|
rbd: Fix bugs around -drive parameter "server"
qemu_rbd_open() takes option parameters as a flattened QDict, with
keys of the form server.%d.host, server.%d.port, where %d counts up
from zero.
qemu_rbd_array_opts() extracts these values as follows. First, it
calls qdict_array_entries() to find the list's length. For each list
element, it formats the list's key prefix (e.g. "server.0."), then
creates a new QDict holding the options with that key prefix, then
converts that to a QemuOpts, so it can finally get the member values
from there.
If there's one surefire way to make code using QDict more awkward,
it's creating more of them and mixing in QemuOpts for good measure.
The extraction of keys starting with server.%d into another QDict
makes us ignore parameters like server.0.neither-host-nor-port
silently.
The conversion to QemuOpts abuses runtime_opts, as described a few
commits ago.
Rewrite to simply get the values straight from the options QDict.
Fixes -drive not to crash when server.*.* are present, but
server.*.host is absent.
Fixes -drive to reject invalid server.*.*.
Permits cleaning up runtime_opts. Do that, and fix -drive to reject
bogus parameters host and port instead of silently ignoring them.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1490691368-32099-11-git-send-email-armbru@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
2017-03-28 10:56:08 +02:00
|
|
|
if (strchr(host, ':')) {
|
2018-02-15 20:58:24 +01:00
|
|
|
vals[i] = g_strdup_printf("[%s]:%s", host, port);
|
2017-02-27 18:36:46 +01:00
|
|
|
} else {
|
2018-02-15 20:58:24 +01:00
|
|
|
vals[i] = g_strdup_printf("%s:%s", host, port);
|
2017-02-27 18:36:46 +01:00
|
|
|
}
|
|
|
|
}
|
rbd: Fix bugs around -drive parameter "server"
qemu_rbd_open() takes option parameters as a flattened QDict, with
keys of the form server.%d.host, server.%d.port, where %d counts up
from zero.
qemu_rbd_array_opts() extracts these values as follows. First, it
calls qdict_array_entries() to find the list's length. For each list
element, it formats the list's key prefix (e.g. "server.0."), then
creates a new QDict holding the options with that key prefix, then
converts that to a QemuOpts, so it can finally get the member values
from there.
If there's one surefire way to make code using QDict more awkward,
it's creating more of them and mixing in QemuOpts for good measure.
The extraction of keys starting with server.%d into another QDict
makes us ignore parameters like server.0.neither-host-nor-port
silently.
The conversion to QemuOpts abuses runtime_opts, as described a few
commits ago.
Rewrite to simply get the values straight from the options QDict.
Fixes -drive not to crash when server.*.* are present, but
server.*.host is absent.
Fixes -drive to reject invalid server.*.*.
Permits cleaning up runtime_opts. Do that, and fix -drive to reject
bogus parameters host and port instead of silently ignoring them.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1490691368-32099-11-git-send-email-armbru@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
2017-03-28 10:56:08 +02:00
|
|
|
vals[i] = NULL;
|
2017-02-27 18:36:46 +01:00
|
|
|
|
rbd: Fix bugs around -drive parameter "server"
qemu_rbd_open() takes option parameters as a flattened QDict, with
keys of the form server.%d.host, server.%d.port, where %d counts up
from zero.
qemu_rbd_array_opts() extracts these values as follows. First, it
calls qdict_array_entries() to find the list's length. For each list
element, it formats the list's key prefix (e.g. "server.0."), then
creates a new QDict holding the options with that key prefix, then
converts that to a QemuOpts, so it can finally get the member values
from there.
If there's one surefire way to make code using QDict more awkward,
it's creating more of them and mixing in QemuOpts for good measure.
The extraction of keys starting with server.%d into another QDict
makes us ignore parameters like server.0.neither-host-nor-port
silently.
The conversion to QemuOpts abuses runtime_opts, as described a few
commits ago.
Rewrite to simply get the values straight from the options QDict.
Fixes -drive not to crash when server.*.* are present, but
server.*.host is absent.
Fixes -drive to reject invalid server.*.*.
Permits cleaning up runtime_opts. Do that, and fix -drive to reject
bogus parameters host and port instead of silently ignoring them.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1490691368-32099-11-git-send-email-armbru@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
2017-03-28 10:56:08 +02:00
|
|
|
rados_str = i ? g_strjoinv(";", (char **)vals) : NULL;
|
|
|
|
g_strfreev((char **)vals);
|
2017-02-27 18:36:46 +01:00
|
|
|
return rados_str;
|
|
|
|
}
|
|
|
|
|
2018-02-15 19:13:47 +01:00
|
|
|
static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
|
2018-02-15 20:58:24 +01:00
|
|
|
BlockdevOptionsRbd *opts, bool cache,
|
2018-02-15 20:31:04 +01:00
|
|
|
const char *keypairs, const char *secretid,
|
|
|
|
Error **errp)
|
2010-12-06 20:53:01 +01:00
|
|
|
{
|
2017-02-27 18:36:46 +01:00
|
|
|
char *mon_host = NULL;
|
2018-02-15 19:13:47 +01:00
|
|
|
Error *local_err = NULL;
|
2010-12-06 20:53:01 +01:00
|
|
|
int r;
|
|
|
|
|
2018-06-14 21:14:43 +02:00
|
|
|
if (secretid) {
|
|
|
|
if (opts->key_secret) {
|
|
|
|
error_setg(errp,
|
|
|
|
"Legacy 'password-secret' clashes with 'key-secret'");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
opts->key_secret = g_strdup(secretid);
|
|
|
|
opts->has_key_secret = true;
|
|
|
|
}
|
|
|
|
|
2018-02-15 20:58:24 +01:00
|
|
|
mon_host = qemu_rbd_mon_host(opts, &local_err);
|
2017-02-27 18:36:46 +01:00
|
|
|
if (local_err) {
|
|
|
|
error_propagate(errp, local_err);
|
|
|
|
r = -EINVAL;
|
2021-03-29 17:01:28 +02:00
|
|
|
goto out;
|
2017-02-27 18:36:46 +01:00
|
|
|
}
|
|
|
|
|
2018-02-15 20:58:24 +01:00
|
|
|
r = rados_create(cluster, opts->user);
|
2011-05-27 01:07:31 +02:00
|
|
|
if (r < 0) {
|
2016-05-09 09:51:59 +02:00
|
|
|
error_setg_errno(errp, -r, "error initializing");
|
2021-03-29 17:01:28 +02:00
|
|
|
goto out;
|
2010-12-06 20:53:01 +01:00
|
|
|
}
|
|
|
|
|
2017-02-26 23:50:42 +01:00
|
|
|
/* try default location when conf=NULL, but ignore failure */
|
2018-02-15 20:58:24 +01:00
|
|
|
r = rados_conf_read_file(*cluster, opts->conf);
|
|
|
|
if (opts->has_conf && r < 0) {
|
|
|
|
error_setg_errno(errp, -r, "error reading conf file %s", opts->conf);
|
2017-02-26 23:50:42 +01:00
|
|
|
goto failed_shutdown;
|
2015-06-11 05:28:45 +02:00
|
|
|
}
|
|
|
|
|
2018-02-15 19:13:47 +01:00
|
|
|
r = qemu_rbd_set_keypairs(*cluster, keypairs, errp);
|
2017-02-26 23:50:42 +01:00
|
|
|
if (r < 0) {
|
|
|
|
goto failed_shutdown;
|
2015-06-11 05:28:45 +02:00
|
|
|
}
|
|
|
|
|
2017-02-27 18:36:46 +01:00
|
|
|
if (mon_host) {
|
2018-02-15 19:13:47 +01:00
|
|
|
r = rados_conf_set(*cluster, "mon_host", mon_host);
|
2017-02-27 18:36:46 +01:00
|
|
|
if (r < 0) {
|
|
|
|
goto failed_shutdown;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-06-14 21:14:43 +02:00
|
|
|
r = qemu_rbd_set_auth(*cluster, opts, errp);
|
|
|
|
if (r < 0) {
|
2016-01-21 15:19:19 +01:00
|
|
|
goto failed_shutdown;
|
|
|
|
}
|
|
|
|
|
2012-05-17 22:42:29 +02:00
|
|
|
/*
|
|
|
|
* Fallback to more conservative semantics if setting cache
|
|
|
|
* options fails. Ignore errors from setting rbd_cache because the
|
|
|
|
* only possible error is that the option does not exist, and
|
|
|
|
* librbd defaults to no caching. If write through caching cannot
|
|
|
|
* be set up, fall back to no caching.
|
|
|
|
*/
|
2018-02-15 19:13:47 +01:00
|
|
|
if (cache) {
|
|
|
|
rados_conf_set(*cluster, "rbd_cache", "true");
|
2012-05-17 22:42:29 +02:00
|
|
|
} else {
|
2018-02-15 19:13:47 +01:00
|
|
|
rados_conf_set(*cluster, "rbd_cache", "false");
|
2012-05-17 22:42:29 +02:00
|
|
|
}
|
|
|
|
|
2018-02-15 19:13:47 +01:00
|
|
|
r = rados_connect(*cluster);
|
2011-05-27 01:07:31 +02:00
|
|
|
if (r < 0) {
|
2016-05-09 09:51:59 +02:00
|
|
|
error_setg_errno(errp, -r, "error connecting");
|
2011-09-07 18:28:06 +02:00
|
|
|
goto failed_shutdown;
|
2010-12-06 20:53:01 +01:00
|
|
|
}
|
|
|
|
|
2018-02-15 20:58:24 +01:00
|
|
|
r = rados_ioctx_create(*cluster, opts->pool, io_ctx);
|
2011-05-27 01:07:31 +02:00
|
|
|
if (r < 0) {
|
2018-02-15 20:58:24 +01:00
|
|
|
error_setg_errno(errp, -r, "error opening pool %s", opts->pool);
|
2011-09-07 18:28:06 +02:00
|
|
|
goto failed_shutdown;
|
2010-12-06 20:53:01 +01:00
|
|
|
}
|
block/rbd: Add support for ceph namespaces
Starting from ceph Nautilus, RBD has support for namespaces, allowing
for finer grain ACLs on images inside a pool, and tenant isolation.
In the rbd cli tool documentation, the new image-spec and snap-spec are :
- [pool-name/[namespace-name/]]image-name
- [pool-name/[namespace-name/]]image-name@snap-name
When using an non namespace's enabled qemu, it complains about not
finding the image called namespace-name/image-name, thus we only need to
parse the image once again to find if there is a '/' in its name, and if
there is, use what is before it as the name of the namespace to later
pass it to rados_ioctx_set_namespace.
rados_ioctx_set_namespace if called with en empty string or a null
pointer as the namespace parameters pretty much does nothing, as it then
defaults to the default namespace.
The namespace is extracted inside qemu_rbd_parse_filename, stored in the
qdict, and used in qemu_rbd_connect to make it work with both qemu-img,
and qemu itself.
Signed-off-by: Florian Florensa <fflorensa@online.net>
Message-Id: <20200110111513.321728-2-fflorensa@online.net>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-01-10 12:15:13 +01:00
|
|
|
/*
|
|
|
|
* Set the namespace after opening the io context on the pool,
|
|
|
|
* if nspace == NULL or if nspace == "", it is just as we did nothing
|
|
|
|
*/
|
|
|
|
rados_ioctx_set_namespace(*io_ctx, opts->q_namespace);
|
2010-12-06 20:53:01 +01:00
|
|
|
|
2021-03-29 17:01:28 +02:00
|
|
|
r = 0;
|
|
|
|
goto out;
|
2018-02-15 19:13:47 +01:00
|
|
|
|
|
|
|
failed_shutdown:
|
|
|
|
rados_shutdown(*cluster);
|
2021-03-29 17:01:28 +02:00
|
|
|
out:
|
2018-02-15 19:13:47 +01:00
|
|
|
g_free(mon_host);
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2018-09-12 00:32:30 +02:00
|
|
|
static int qemu_rbd_convert_options(QDict *options, BlockdevOptionsRbd **opts,
|
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
Visitor *v;
|
|
|
|
|
|
|
|
/* Convert the remaining options into a QAPI object */
|
|
|
|
v = qobject_input_visitor_new_flat_confused(options, errp);
|
|
|
|
if (!v) {
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2020-07-07 18:06:07 +02:00
|
|
|
visit_type_BlockdevOptionsRbd(v, NULL, opts, errp);
|
2018-09-12 00:32:30 +02:00
|
|
|
visit_free(v);
|
2020-07-07 18:06:07 +02:00
|
|
|
if (!opts) {
|
2018-09-12 00:32:30 +02:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-09-12 00:32:31 +02:00
|
|
|
static int qemu_rbd_attempt_legacy_options(QDict *options,
|
|
|
|
BlockdevOptionsRbd **opts,
|
|
|
|
char **keypairs)
|
|
|
|
{
|
|
|
|
char *filename;
|
|
|
|
int r;
|
|
|
|
|
|
|
|
filename = g_strdup(qdict_get_try_str(options, "filename"));
|
|
|
|
if (!filename) {
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
qdict_del(options, "filename");
|
|
|
|
|
|
|
|
qemu_rbd_parse_filename(filename, options, NULL);
|
|
|
|
|
|
|
|
/* keypairs freed by caller */
|
|
|
|
*keypairs = g_strdup(qdict_get_try_str(options, "=keyvalue-pairs"));
|
|
|
|
if (*keypairs) {
|
|
|
|
qdict_del(options, "=keyvalue-pairs");
|
|
|
|
}
|
|
|
|
|
|
|
|
r = qemu_rbd_convert_options(options, opts, NULL);
|
|
|
|
|
|
|
|
g_free(filename);
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2018-02-15 19:13:47 +01:00
|
|
|
static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
|
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
BDRVRBDState *s = bs->opaque;
|
2018-02-15 20:58:24 +01:00
|
|
|
BlockdevOptionsRbd *opts = NULL;
|
2018-04-04 17:40:45 +02:00
|
|
|
const QDictEntry *e;
|
2018-02-15 19:13:47 +01:00
|
|
|
Error *local_err = NULL;
|
2018-02-15 20:31:04 +01:00
|
|
|
char *keypairs, *secretid;
|
2021-07-02 19:23:52 +02:00
|
|
|
rbd_image_info_t info;
|
2018-02-15 19:13:47 +01:00
|
|
|
int r;
|
|
|
|
|
2018-02-15 20:31:04 +01:00
|
|
|
keypairs = g_strdup(qdict_get_try_str(options, "=keyvalue-pairs"));
|
|
|
|
if (keypairs) {
|
|
|
|
qdict_del(options, "=keyvalue-pairs");
|
|
|
|
}
|
|
|
|
|
|
|
|
secretid = g_strdup(qdict_get_try_str(options, "password-secret"));
|
|
|
|
if (secretid) {
|
|
|
|
qdict_del(options, "password-secret");
|
|
|
|
}
|
|
|
|
|
2018-09-12 00:32:30 +02:00
|
|
|
r = qemu_rbd_convert_options(options, &opts, &local_err);
|
2018-02-15 20:58:24 +01:00
|
|
|
if (local_err) {
|
2018-09-12 00:32:31 +02:00
|
|
|
/* If keypairs are present, that means some options are present in
|
|
|
|
* the modern option format. Don't attempt to parse legacy option
|
|
|
|
* formats, as we won't support mixed usage. */
|
|
|
|
if (keypairs) {
|
|
|
|
error_propagate(errp, local_err);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* If the initial attempt to convert and process the options failed,
|
|
|
|
* we may be attempting to open an image file that has the rbd options
|
|
|
|
* specified in the older format consisting of all key/value pairs
|
|
|
|
* encoded in the filename. Go ahead and attempt to parse the
|
|
|
|
* filename, and see if we can pull out the required options. */
|
|
|
|
r = qemu_rbd_attempt_legacy_options(options, &opts, &keypairs);
|
|
|
|
if (r < 0) {
|
|
|
|
/* Propagate the original error, not the legacy parsing fallback
|
|
|
|
* error, as the latter was just a best-effort attempt. */
|
|
|
|
error_propagate(errp, local_err);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
/* Take care whenever deciding to actually deprecate; once this ability
|
|
|
|
* is removed, we will not be able to open any images with legacy-styled
|
|
|
|
* backing image strings. */
|
2018-10-17 10:26:27 +02:00
|
|
|
warn_report("RBD options encoded in the filename as keyvalue pairs "
|
|
|
|
"is deprecated");
|
2018-02-15 20:58:24 +01:00
|
|
|
}
|
|
|
|
|
2018-04-04 17:40:45 +02:00
|
|
|
/* Remove the processed options from the QDict (the visitor processes
|
|
|
|
* _all_ options in the QDict) */
|
|
|
|
while ((e = qdict_first(options))) {
|
|
|
|
qdict_del(options, e->key);
|
|
|
|
}
|
|
|
|
|
2018-02-16 18:54:52 +01:00
|
|
|
r = qemu_rbd_connect(&s->cluster, &s->io_ctx, opts,
|
|
|
|
!(flags & BDRV_O_NOCACHE), keypairs, secretid, errp);
|
2018-02-15 19:13:47 +01:00
|
|
|
if (r < 0) {
|
2018-02-15 20:31:04 +01:00
|
|
|
goto out;
|
2018-02-15 19:13:47 +01:00
|
|
|
}
|
|
|
|
|
2018-02-16 18:54:52 +01:00
|
|
|
s->snap = g_strdup(opts->snapshot);
|
|
|
|
s->image_name = g_strdup(opts->image);
|
|
|
|
|
block: do not set BDS read_only if copy_on_read enabled
A few block drivers will set the BDS read_only flag from their
.bdrv_open() function. This means the bs->read_only flag could
be set after we enable copy_on_read, as the BDRV_O_COPY_ON_READ
flag check occurs prior to the call to bdrv->bdrv_open().
This adds an error return to bdrv_set_read_only(), and an error will be
return if we try to set the BDS to read_only while copy_on_read is
enabled.
This patch also changes the behavior of vvfat. Before, vvfat could
override the drive 'readonly' flag with its own, internal 'rw' flag.
For instance, this -drive parameter would result in a writable image:
"-drive format=vvfat,dir=/tmp/vvfat,rw,if=virtio,readonly=on"
This is not correct. Now, attempting to use the above -drive parameter
will result in an error (i.e., 'rw' is incompatible with 'readonly=on').
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 0c5b4c1cc2c651471b131f21376dfd5ea24d2196.1491597120.git.jcody@redhat.com
2017-04-07 22:55:26 +02:00
|
|
|
/* rbd_open is always r/w */
|
2017-04-07 22:55:31 +02:00
|
|
|
r = rbd_open(s->io_ctx, s->image_name, &s->image, s->snap);
|
2010-12-06 20:53:01 +01:00
|
|
|
if (r < 0) {
|
2017-04-07 22:55:31 +02:00
|
|
|
error_setg_errno(errp, -r, "error reading header from %s",
|
|
|
|
s->image_name);
|
2011-09-07 18:28:06 +02:00
|
|
|
goto failed_open;
|
2010-12-06 20:53:01 +01:00
|
|
|
}
|
|
|
|
|
2021-06-27 13:46:35 +02:00
|
|
|
if (opts->has_encrypt) {
|
|
|
|
#ifdef LIBRBD_SUPPORTS_ENCRYPTION
|
|
|
|
r = qemu_rbd_encryption_load(s->image, opts->encrypt, errp);
|
|
|
|
if (r < 0) {
|
|
|
|
goto failed_post_open;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
r = -ENOTSUP;
|
|
|
|
error_setg(errp, "RBD library does not support image encryption");
|
|
|
|
goto failed_post_open;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2021-07-02 19:23:52 +02:00
|
|
|
r = rbd_stat(s->image, &info, sizeof(info));
|
2019-05-09 16:59:27 +02:00
|
|
|
if (r < 0) {
|
2021-07-02 19:23:52 +02:00
|
|
|
error_setg_errno(errp, -r, "error getting image info from %s",
|
2019-05-09 16:59:27 +02:00
|
|
|
s->image_name);
|
2021-06-27 13:46:35 +02:00
|
|
|
goto failed_post_open;
|
2019-05-09 16:59:27 +02:00
|
|
|
}
|
2021-07-02 19:23:52 +02:00
|
|
|
s->image_size = info.size;
|
|
|
|
s->object_size = info.obj_size;
|
2019-05-09 16:59:27 +02:00
|
|
|
|
block: do not set BDS read_only if copy_on_read enabled
A few block drivers will set the BDS read_only flag from their
.bdrv_open() function. This means the bs->read_only flag could
be set after we enable copy_on_read, as the BDRV_O_COPY_ON_READ
flag check occurs prior to the call to bdrv->bdrv_open().
This adds an error return to bdrv_set_read_only(), and an error will be
return if we try to set the BDS to read_only while copy_on_read is
enabled.
This patch also changes the behavior of vvfat. Before, vvfat could
override the drive 'readonly' flag with its own, internal 'rw' flag.
For instance, this -drive parameter would result in a writable image:
"-drive format=vvfat,dir=/tmp/vvfat,rw,if=virtio,readonly=on"
This is not correct. Now, attempting to use the above -drive parameter
will result in an error (i.e., 'rw' is incompatible with 'readonly=on').
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 0c5b4c1cc2c651471b131f21376dfd5ea24d2196.1491597120.git.jcody@redhat.com
2017-04-07 22:55:26 +02:00
|
|
|
/* If we are using an rbd snapshot, we must be r/o, otherwise
|
|
|
|
* leave as-is */
|
|
|
|
if (s->snap != NULL) {
|
2018-10-12 11:27:41 +02:00
|
|
|
r = bdrv_apply_auto_read_only(bs, "rbd snapshots are read-only", errp);
|
|
|
|
if (r < 0) {
|
2021-06-27 13:46:35 +02:00
|
|
|
goto failed_post_open;
|
block: do not set BDS read_only if copy_on_read enabled
A few block drivers will set the BDS read_only flag from their
.bdrv_open() function. This means the bs->read_only flag could
be set after we enable copy_on_read, as the BDRV_O_COPY_ON_READ
flag check occurs prior to the call to bdrv->bdrv_open().
This adds an error return to bdrv_set_read_only(), and an error will be
return if we try to set the BDS to read_only while copy_on_read is
enabled.
This patch also changes the behavior of vvfat. Before, vvfat could
override the drive 'readonly' flag with its own, internal 'rw' flag.
For instance, this -drive parameter would result in a writable image:
"-drive format=vvfat,dir=/tmp/vvfat,rw,if=virtio,readonly=on"
This is not correct. Now, attempting to use the above -drive parameter
will result in an error (i.e., 'rw' is incompatible with 'readonly=on').
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 0c5b4c1cc2c651471b131f21376dfd5ea24d2196.1491597120.git.jcody@redhat.com
2017-04-07 22:55:26 +02:00
|
|
|
}
|
|
|
|
}
|
2010-12-06 20:53:01 +01:00
|
|
|
|
block/rbd: add write zeroes support
This patch wittingly sets BDRV_REQ_NO_FALLBACK and silently ignores
BDRV_REQ_MAY_UNMAP for older librbd versions.
The rationale for this is as follows (citing Ilya Dryomov current RBD
maintainer):
---8<---
a) remove the BDRV_REQ_MAY_UNMAP check in qemu_rbd_co_pwrite_zeroes()
and as a consequence always unmap if librbd is too old
It's not clear what qemu's expectation is but in general Write
Zeroes is allowed to unmap. The only guarantee is that subsequent
reads return zeroes, everything else is a hint. This is how it is
specified in the kernel and in the NVMe spec.
In particular, block/nvme.c implements it as follows:
if (flags & BDRV_REQ_MAY_UNMAP) {
cdw12 |= (1 << 25);
}
This sets the Deallocate bit. But if it's not set, the device may
still deallocate:
"""
If the Deallocate bit (CDW12.DEAC) is set to '1' in a Write Zeroes
command, and the namespace supports clearing all bytes to 0h in the
values read (e.g., bits 2:0 in the DLFEAT field are set to 001b)
from a deallocated logical block and its metadata (excluding
protection information), then for each specified logical block, the
controller:
- should deallocate that logical block;
...
If the Deallocate bit is cleared to '0' in a Write Zeroes command,
and the namespace supports clearing all bytes to 0h in the values
read (e.g., bits 2:0 in the DLFEAT field are set to 001b) from
a deallocated logical block and its metadata (excluding protection
information), then, for each specified logical block, the
controller:
- may deallocate that logical block;
"""
https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-2021.06.02-Ratified-1.pdf
b) set BDRV_REQ_NO_FALLBACK in supported_zero_flags
Again, it's not clear what qemu expects here, but without it we end
up in a ridiculous situation where specifying the "don't allow slow
fallback" switch immediately fails all efficient zeroing requests on
a device where Write Zeroes is always efficient:
$ qemu-io -c 'help write' | grep -- '-[zun]'
-n, -- with -z, don't allow slow fallback
-u, -- with -z, allow unmapping
-z, -- write zeroes using blk_co_pwrite_zeroes
$ qemu-io -f rbd -c 'write -z -u -n 0 1M' rbd:foo/bar
write failed: Operation not supported
--->8---
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Message-Id: <20210702172356.11574-6-idryomov@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-07-02 19:23:55 +02:00
|
|
|
#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
|
|
|
|
bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK;
|
|
|
|
#endif
|
|
|
|
|
2020-04-28 22:29:00 +02:00
|
|
|
/* When extending regular files, we get zeros from the OS */
|
|
|
|
bs->supported_truncate_flags = BDRV_REQ_ZERO_WRITE;
|
|
|
|
|
2018-02-15 20:31:04 +01:00
|
|
|
r = 0;
|
|
|
|
goto out;
|
2010-12-06 20:53:01 +01:00
|
|
|
|
2021-06-27 13:46:35 +02:00
|
|
|
failed_post_open:
|
|
|
|
rbd_close(s->image);
|
2011-09-07 18:28:06 +02:00
|
|
|
failed_open:
|
2011-05-27 01:07:31 +02:00
|
|
|
rados_ioctx_destroy(s->io_ctx);
|
2011-09-07 18:28:06 +02:00
|
|
|
g_free(s->snap);
|
2017-04-07 22:55:31 +02:00
|
|
|
g_free(s->image_name);
|
2018-02-15 19:13:47 +01:00
|
|
|
rados_shutdown(s->cluster);
|
2018-02-15 20:31:04 +01:00
|
|
|
out:
|
2018-02-15 20:58:24 +01:00
|
|
|
qapi_free_BlockdevOptionsRbd(opts);
|
2018-02-15 20:31:04 +01:00
|
|
|
g_free(keypairs);
|
|
|
|
g_free(secretid);
|
2010-12-06 20:53:01 +01:00
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2017-04-07 22:55:32 +02:00
|
|
|
|
|
|
|
/* Since RBD is currently always opened R/W via the API,
|
|
|
|
* we just need to check if we are using a snapshot or not, in
|
|
|
|
* order to determine if we will allow it to be R/W */
|
|
|
|
static int qemu_rbd_reopen_prepare(BDRVReopenState *state,
|
|
|
|
BlockReopenQueue *queue, Error **errp)
|
|
|
|
{
|
|
|
|
BDRVRBDState *s = state->bs->opaque;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
if (s->snap && state->flags & BDRV_O_RDWR) {
|
|
|
|
error_setg(errp,
|
|
|
|
"Cannot change node '%s' to r/w when using RBD snapshot",
|
|
|
|
bdrv_get_device_or_node_name(state->bs));
|
|
|
|
ret = -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2011-05-27 01:07:31 +02:00
|
|
|
static void qemu_rbd_close(BlockDriverState *bs)
|
2010-12-06 20:53:01 +01:00
|
|
|
{
|
|
|
|
BDRVRBDState *s = bs->opaque;
|
|
|
|
|
2011-05-27 01:07:31 +02:00
|
|
|
rbd_close(s->image);
|
|
|
|
rados_ioctx_destroy(s->io_ctx);
|
2011-08-21 05:09:37 +02:00
|
|
|
g_free(s->snap);
|
2017-04-07 22:55:31 +02:00
|
|
|
g_free(s->image_name);
|
2011-05-27 01:07:31 +02:00
|
|
|
rados_shutdown(s->cluster);
|
2010-12-06 20:53:01 +01:00
|
|
|
}
|
|
|
|
|
2019-05-09 16:59:27 +02:00
|
|
|
/* Resize the RBD image and update the 'image_size' with the current size */
|
|
|
|
static int qemu_rbd_resize(BlockDriverState *bs, uint64_t size)
|
|
|
|
{
|
|
|
|
BDRVRBDState *s = bs->opaque;
|
|
|
|
int r;
|
|
|
|
|
|
|
|
r = rbd_resize(s->image, size);
|
|
|
|
if (r < 0) {
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
s->image_size = size;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2021-07-02 19:23:54 +02:00
|
|
|
static void qemu_rbd_finish_bh(void *opaque)
|
2010-12-06 20:53:01 +01:00
|
|
|
{
|
2021-07-02 19:23:54 +02:00
|
|
|
RBDTask *task = opaque;
|
2021-07-07 20:04:48 +02:00
|
|
|
task->complete = true;
|
2021-07-02 19:23:54 +02:00
|
|
|
aio_co_wake(task->co);
|
2011-05-27 01:07:31 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2021-07-02 19:23:54 +02:00
|
|
|
* This is the completion callback function for all rbd aio calls
|
|
|
|
* started from qemu_rbd_start_co().
|
2011-05-27 01:07:31 +02:00
|
|
|
*
|
|
|
|
* Note: this function is being called from a non qemu thread so
|
|
|
|
* we need to be careful about what we do here. Generally we only
|
2013-12-05 16:38:33 +01:00
|
|
|
* schedule a BH, and do the rest of the io completion handling
|
2021-07-02 19:23:54 +02:00
|
|
|
* from qemu_rbd_finish_bh() which runs in a qemu context.
|
2011-05-27 01:07:31 +02:00
|
|
|
*/
|
2021-07-02 19:23:54 +02:00
|
|
|
static void qemu_rbd_completion_cb(rbd_completion_t c, RBDTask *task)
|
2011-05-27 01:07:31 +02:00
|
|
|
{
|
2021-07-02 19:23:54 +02:00
|
|
|
task->ret = rbd_aio_get_return_value(c);
|
2011-05-27 01:07:31 +02:00
|
|
|
rbd_aio_release(c);
|
2021-07-02 19:23:54 +02:00
|
|
|
aio_bh_schedule_oneshot(bdrv_get_aio_context(task->bs),
|
|
|
|
qemu_rbd_finish_bh, task);
|
2010-12-06 20:53:01 +01:00
|
|
|
}
|
|
|
|
|
2021-07-02 19:23:54 +02:00
|
|
|
static int coroutine_fn qemu_rbd_start_co(BlockDriverState *bs,
|
|
|
|
uint64_t offset,
|
|
|
|
uint64_t bytes,
|
|
|
|
QEMUIOVector *qiov,
|
|
|
|
int flags,
|
|
|
|
RBDAIOCmd cmd)
|
2010-12-06 20:53:01 +01:00
|
|
|
{
|
2021-07-02 19:23:54 +02:00
|
|
|
BDRVRBDState *s = bs->opaque;
|
|
|
|
RBDTask task = { .bs = bs, .co = qemu_coroutine_self() };
|
2011-05-27 01:07:31 +02:00
|
|
|
rbd_completion_t c;
|
2011-05-27 01:07:33 +02:00
|
|
|
int r;
|
2010-12-06 20:53:01 +01:00
|
|
|
|
2021-07-02 19:23:54 +02:00
|
|
|
assert(!qiov || qiov->size == bytes);
|
2017-02-21 07:50:03 +01:00
|
|
|
|
2022-03-17 17:26:38 +01:00
|
|
|
if (cmd == RBD_AIO_WRITE || cmd == RBD_AIO_WRITE_ZEROES) {
|
|
|
|
/*
|
|
|
|
* RBD APIs don't allow us to write more than actual size, so in order
|
|
|
|
* to support growing images, we resize the image before write
|
|
|
|
* operations that exceed the current size.
|
|
|
|
*/
|
|
|
|
if (offset + bytes > s->image_size) {
|
|
|
|
int r = qemu_rbd_resize(bs, offset + bytes);
|
|
|
|
if (r < 0) {
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-07-02 19:23:54 +02:00
|
|
|
r = rbd_aio_create_completion(&task,
|
|
|
|
(rbd_callback_t) qemu_rbd_completion_cb, &c);
|
2011-05-27 01:07:33 +02:00
|
|
|
if (r < 0) {
|
2021-07-02 19:23:54 +02:00
|
|
|
return r;
|
2011-05-27 01:07:33 +02:00
|
|
|
}
|
2010-12-06 20:53:01 +01:00
|
|
|
|
2012-05-01 08:16:45 +02:00
|
|
|
switch (cmd) {
|
|
|
|
case RBD_AIO_READ:
|
2021-07-02 19:23:54 +02:00
|
|
|
r = rbd_aio_readv(s->image, qiov->iov, qiov->niov, offset, c);
|
|
|
|
break;
|
|
|
|
case RBD_AIO_WRITE:
|
|
|
|
r = rbd_aio_writev(s->image, qiov->iov, qiov->niov, offset, c);
|
2012-05-01 08:16:45 +02:00
|
|
|
break;
|
|
|
|
case RBD_AIO_DISCARD:
|
2021-07-02 19:23:54 +02:00
|
|
|
r = rbd_aio_discard(s->image, offset, bytes, c);
|
2012-05-01 08:16:45 +02:00
|
|
|
break;
|
2013-03-29 21:03:23 +01:00
|
|
|
case RBD_AIO_FLUSH:
|
2021-07-02 19:23:51 +02:00
|
|
|
r = rbd_aio_flush(s->image, c);
|
2013-03-29 21:03:23 +01:00
|
|
|
break;
|
block/rbd: add write zeroes support
This patch wittingly sets BDRV_REQ_NO_FALLBACK and silently ignores
BDRV_REQ_MAY_UNMAP for older librbd versions.
The rationale for this is as follows (citing Ilya Dryomov current RBD
maintainer):
---8<---
a) remove the BDRV_REQ_MAY_UNMAP check in qemu_rbd_co_pwrite_zeroes()
and as a consequence always unmap if librbd is too old
It's not clear what qemu's expectation is but in general Write
Zeroes is allowed to unmap. The only guarantee is that subsequent
reads return zeroes, everything else is a hint. This is how it is
specified in the kernel and in the NVMe spec.
In particular, block/nvme.c implements it as follows:
if (flags & BDRV_REQ_MAY_UNMAP) {
cdw12 |= (1 << 25);
}
This sets the Deallocate bit. But if it's not set, the device may
still deallocate:
"""
If the Deallocate bit (CDW12.DEAC) is set to '1' in a Write Zeroes
command, and the namespace supports clearing all bytes to 0h in the
values read (e.g., bits 2:0 in the DLFEAT field are set to 001b)
from a deallocated logical block and its metadata (excluding
protection information), then for each specified logical block, the
controller:
- should deallocate that logical block;
...
If the Deallocate bit is cleared to '0' in a Write Zeroes command,
and the namespace supports clearing all bytes to 0h in the values
read (e.g., bits 2:0 in the DLFEAT field are set to 001b) from
a deallocated logical block and its metadata (excluding protection
information), then, for each specified logical block, the
controller:
- may deallocate that logical block;
"""
https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-2021.06.02-Ratified-1.pdf
b) set BDRV_REQ_NO_FALLBACK in supported_zero_flags
Again, it's not clear what qemu expects here, but without it we end
up in a ridiculous situation where specifying the "don't allow slow
fallback" switch immediately fails all efficient zeroing requests on
a device where Write Zeroes is always efficient:
$ qemu-io -c 'help write' | grep -- '-[zun]'
-n, -- with -z, don't allow slow fallback
-u, -- with -z, allow unmapping
-z, -- write zeroes using blk_co_pwrite_zeroes
$ qemu-io -f rbd -c 'write -z -u -n 0 1M' rbd:foo/bar
write failed: Operation not supported
--->8---
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Message-Id: <20210702172356.11574-6-idryomov@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-07-02 19:23:55 +02:00
|
|
|
#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
|
|
|
|
case RBD_AIO_WRITE_ZEROES: {
|
|
|
|
int zero_flags = 0;
|
|
|
|
#ifdef RBD_WRITE_ZEROES_FLAG_THICK_PROVISION
|
|
|
|
if (!(flags & BDRV_REQ_MAY_UNMAP)) {
|
|
|
|
zero_flags = RBD_WRITE_ZEROES_FLAG_THICK_PROVISION;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
r = rbd_aio_write_zeroes(s->image, offset, bytes, c, zero_flags, 0);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
#endif
|
2012-05-01 08:16:45 +02:00
|
|
|
default:
|
|
|
|
r = -EINVAL;
|
2011-05-27 01:07:33 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
if (r < 0) {
|
2021-07-02 19:23:54 +02:00
|
|
|
error_report("rbd request failed early: cmd %d offset %" PRIu64
|
|
|
|
" bytes %" PRIu64 " flags %d r %d (%s)", cmd, offset,
|
|
|
|
bytes, flags, r, strerror(-r));
|
|
|
|
rbd_aio_release(c);
|
|
|
|
return r;
|
2010-12-06 20:53:01 +01:00
|
|
|
}
|
2011-05-27 01:07:33 +02:00
|
|
|
|
2021-07-02 19:23:54 +02:00
|
|
|
while (!task.complete) {
|
|
|
|
qemu_coroutine_yield();
|
|
|
|
}
|
2017-02-21 07:50:03 +01:00
|
|
|
|
2021-07-02 19:23:54 +02:00
|
|
|
if (task.ret < 0) {
|
|
|
|
error_report("rbd request failed: cmd %d offset %" PRIu64 " bytes %"
|
|
|
|
PRIu64 " flags %d task.ret %" PRIi64 " (%s)", cmd, offset,
|
|
|
|
bytes, flags, task.ret, strerror(-task.ret));
|
|
|
|
return task.ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* zero pad short reads */
|
|
|
|
if (cmd == RBD_AIO_READ && task.ret < qiov->size) {
|
|
|
|
qemu_iovec_memset(qiov, task.ret, 0, qiov->size - task.ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
block: use int64_t instead of uint64_t in driver read handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
So, convert driver read handlers parameters which are already 64bit to
signed type.
While being here, convert also flags parameter to be BdrvRequestFlags.
Now let's consider all callers. Simple
git grep '\->bdrv_\(aio\|co\)_preadv\(_part\)\?'
shows that's there three callers of driver function:
bdrv_driver_preadv() in block/io.c, passes int64_t, checked by
bdrv_check_qiov_request() to be non-negative.
qcow2_load_vmstate() does bdrv_check_qiov_request().
do_perform_cow_read() has uint64_t argument. And a lot of things in
qcow2 driver are uint64_t, so converting it is big job. But we must
not work with requests that don't satisfy bdrv_check_qiov_request(),
so let's just assert it here.
Still, the functions may be called directly, not only by drv->...
Let's check:
git grep '\.bdrv_\(aio\|co\)_preadv\(_part\)\?\s*=' | \
awk '{print $4}' | sed 's/,//' | sed 's/&//' | sort | uniq | \
while read func; do git grep "$func(" | \
grep -v "$func(BlockDriverState"; done
The only one such caller:
QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, &data, 1);
...
ret = bdrv_replace_test_co_preadv(bs, 0, 1, &qiov, 0);
in tests/unit/test-bdrv-drain.c, and it's OK obviously.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-4-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: fix typos]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-03 12:27:59 +02:00
|
|
|
coroutine_fn qemu_rbd_co_preadv(BlockDriverState *bs, int64_t offset,
|
|
|
|
int64_t bytes, QEMUIOVector *qiov,
|
|
|
|
BdrvRequestFlags flags)
|
2021-07-02 19:23:54 +02:00
|
|
|
{
|
|
|
|
return qemu_rbd_start_co(bs, offset, bytes, qiov, flags, RBD_AIO_READ);
|
2010-12-06 20:53:01 +01:00
|
|
|
}
|
|
|
|
|
2021-07-02 19:23:54 +02:00
|
|
|
static int
|
block: use int64_t instead of uint64_t in driver write handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
So, convert driver write handlers parameters which are already 64bit to
signed type.
While being here, convert also flags parameter to be BdrvRequestFlags.
Now let's consider all callers. Simple
git grep '\->bdrv_\(aio\|co\)_pwritev\(_part\)\?'
shows that's there three callers of driver function:
bdrv_driver_pwritev() and bdrv_driver_pwritev_compressed() in
block/io.c, both pass int64_t, checked by bdrv_check_qiov_request() to
be non-negative.
qcow2_save_vmstate() does bdrv_check_qiov_request().
Still, the functions may be called directly, not only by drv->...
Let's check:
git grep '\.bdrv_\(aio\|co\)_pwritev\(_part\)\?\s*=' | \
awk '{print $4}' | sed 's/,//' | sed 's/&//' | sort | uniq | \
while read func; do git grep "$func(" | \
grep -v "$func(BlockDriverState"; done
shows several callers:
qcow2:
qcow2_co_truncate() write at most up to @offset, which is checked in
generic qcow2_co_truncate() by bdrv_check_request().
qcow2_co_pwritev_compressed_task() pass the request (or part of the
request) that already went through normal write path, so it should
be OK
qcow:
qcow_co_pwritev_compressed() pass int64_t, it's updated by this patch
quorum:
quorum_co_pwrite_zeroes() pass int64_t and int - OK
throttle:
throttle_co_pwritev_compressed() pass int64_t, it's updated by this
patch
vmdk:
vmdk_co_pwritev_compressed() pass int64_t, it's updated by this
patch
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-5-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-03 12:28:00 +02:00
|
|
|
coroutine_fn qemu_rbd_co_pwritev(BlockDriverState *bs, int64_t offset,
|
|
|
|
int64_t bytes, QEMUIOVector *qiov,
|
|
|
|
BdrvRequestFlags flags)
|
2010-12-06 20:53:01 +01:00
|
|
|
{
|
2021-07-02 19:23:54 +02:00
|
|
|
return qemu_rbd_start_co(bs, offset, bytes, qiov, flags, RBD_AIO_WRITE);
|
2010-12-06 20:53:01 +01:00
|
|
|
}
|
|
|
|
|
2021-07-02 19:23:54 +02:00
|
|
|
static int coroutine_fn qemu_rbd_co_flush(BlockDriverState *bs)
|
2010-12-06 20:53:01 +01:00
|
|
|
{
|
2021-07-02 19:23:54 +02:00
|
|
|
return qemu_rbd_start_co(bs, 0, 0, NULL, 0, RBD_AIO_FLUSH);
|
2010-12-06 20:53:01 +01:00
|
|
|
}
|
|
|
|
|
2021-07-02 19:23:54 +02:00
|
|
|
static int coroutine_fn qemu_rbd_co_pdiscard(BlockDriverState *bs,
|
block: use int64_t instead of int in driver discard handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
So, convert driver discard handlers bytes parameter to int64_t.
The only caller of all updated function is bdrv_co_pdiscard in
block/io.c. It is already prepared to work with 64bit requests, but
pass at most max(bs->bl.max_pdiscard, INT_MAX) to the driver.
Let's look at all updated functions:
blkdebug: all calculations are still OK, thanks to
bdrv_check_qiov_request().
both rule_check and bdrv_co_pdiscard are 64bit
blklogwrites: pass to blk_loc_writes_co_log which is 64bit
blkreplay, copy-on-read, filter-compress: pass to bdrv_co_pdiscard, OK
copy-before-write: pass to bdrv_co_pdiscard which is 64bit and to
cbw_do_copy_before_write which is 64bit
file-posix: one handler calls raw_account_discard() is 64bit and both
handlers calls raw_do_pdiscard(). Update raw_do_pdiscard, which pass
to RawPosixAIOData::aio_nbytes, which is 64bit (and calls
raw_account_discard())
gluster: somehow, third argument of glfs_discard_async is size_t.
Let's set max_pdiscard accordingly.
iscsi: iscsi_allocmap_set_invalid is 64bit,
!is_byte_request_lun_aligned is 64bit.
list.num is uint32_t. Let's clarify max_pdiscard and
pdiscard_alignment.
mirror_top: pass to bdrv_mirror_top_do_write() which is
64bit
nbd: protocol limitation. max_pdiscard is alredy set strict enough,
keep it as is for now.
nvme: buf.nlb is uint32_t and we do shift. So, add corresponding limits
to nvme_refresh_limits().
preallocate: pass to bdrv_co_pdiscard() which is 64bit.
rbd: pass to qemu_rbd_start_co() which is 64bit.
qcow2: calculations are still OK, thanks to bdrv_check_qiov_request(),
qcow2_cluster_discard() is 64bit.
raw-format: raw_adjust_offset() is 64bit, bdrv_co_pdiscard too.
throttle: pass to bdrv_co_pdiscard() which is 64bit and to
throttle_group_co_io_limits_intercept() which is 64bit as well.
test-block-iothread: bytes argument is unused
Great! Now all drivers are prepared to handle 64bit discard requests,
or else have explicit max_pdiscard limits.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-11-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-03 12:28:06 +02:00
|
|
|
int64_t offset, int64_t bytes)
|
2013-03-29 21:03:23 +01:00
|
|
|
{
|
block: use int64_t instead of int in driver discard handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
So, convert driver discard handlers bytes parameter to int64_t.
The only caller of all updated function is bdrv_co_pdiscard in
block/io.c. It is already prepared to work with 64bit requests, but
pass at most max(bs->bl.max_pdiscard, INT_MAX) to the driver.
Let's look at all updated functions:
blkdebug: all calculations are still OK, thanks to
bdrv_check_qiov_request().
both rule_check and bdrv_co_pdiscard are 64bit
blklogwrites: pass to blk_loc_writes_co_log which is 64bit
blkreplay, copy-on-read, filter-compress: pass to bdrv_co_pdiscard, OK
copy-before-write: pass to bdrv_co_pdiscard which is 64bit and to
cbw_do_copy_before_write which is 64bit
file-posix: one handler calls raw_account_discard() is 64bit and both
handlers calls raw_do_pdiscard(). Update raw_do_pdiscard, which pass
to RawPosixAIOData::aio_nbytes, which is 64bit (and calls
raw_account_discard())
gluster: somehow, third argument of glfs_discard_async is size_t.
Let's set max_pdiscard accordingly.
iscsi: iscsi_allocmap_set_invalid is 64bit,
!is_byte_request_lun_aligned is 64bit.
list.num is uint32_t. Let's clarify max_pdiscard and
pdiscard_alignment.
mirror_top: pass to bdrv_mirror_top_do_write() which is
64bit
nbd: protocol limitation. max_pdiscard is alredy set strict enough,
keep it as is for now.
nvme: buf.nlb is uint32_t and we do shift. So, add corresponding limits
to nvme_refresh_limits().
preallocate: pass to bdrv_co_pdiscard() which is 64bit.
rbd: pass to qemu_rbd_start_co() which is 64bit.
qcow2: calculations are still OK, thanks to bdrv_check_qiov_request(),
qcow2_cluster_discard() is 64bit.
raw-format: raw_adjust_offset() is 64bit, bdrv_co_pdiscard too.
throttle: pass to bdrv_co_pdiscard() which is 64bit and to
throttle_group_co_io_limits_intercept() which is 64bit as well.
test-block-iothread: bytes argument is unused
Great! Now all drivers are prepared to handle 64bit discard requests,
or else have explicit max_pdiscard limits.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-11-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-03 12:28:06 +02:00
|
|
|
return qemu_rbd_start_co(bs, offset, bytes, NULL, 0, RBD_AIO_DISCARD);
|
2013-03-29 21:03:23 +01:00
|
|
|
}
|
|
|
|
|
block/rbd: add write zeroes support
This patch wittingly sets BDRV_REQ_NO_FALLBACK and silently ignores
BDRV_REQ_MAY_UNMAP for older librbd versions.
The rationale for this is as follows (citing Ilya Dryomov current RBD
maintainer):
---8<---
a) remove the BDRV_REQ_MAY_UNMAP check in qemu_rbd_co_pwrite_zeroes()
and as a consequence always unmap if librbd is too old
It's not clear what qemu's expectation is but in general Write
Zeroes is allowed to unmap. The only guarantee is that subsequent
reads return zeroes, everything else is a hint. This is how it is
specified in the kernel and in the NVMe spec.
In particular, block/nvme.c implements it as follows:
if (flags & BDRV_REQ_MAY_UNMAP) {
cdw12 |= (1 << 25);
}
This sets the Deallocate bit. But if it's not set, the device may
still deallocate:
"""
If the Deallocate bit (CDW12.DEAC) is set to '1' in a Write Zeroes
command, and the namespace supports clearing all bytes to 0h in the
values read (e.g., bits 2:0 in the DLFEAT field are set to 001b)
from a deallocated logical block and its metadata (excluding
protection information), then for each specified logical block, the
controller:
- should deallocate that logical block;
...
If the Deallocate bit is cleared to '0' in a Write Zeroes command,
and the namespace supports clearing all bytes to 0h in the values
read (e.g., bits 2:0 in the DLFEAT field are set to 001b) from
a deallocated logical block and its metadata (excluding protection
information), then, for each specified logical block, the
controller:
- may deallocate that logical block;
"""
https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-2021.06.02-Ratified-1.pdf
b) set BDRV_REQ_NO_FALLBACK in supported_zero_flags
Again, it's not clear what qemu expects here, but without it we end
up in a ridiculous situation where specifying the "don't allow slow
fallback" switch immediately fails all efficient zeroing requests on
a device where Write Zeroes is always efficient:
$ qemu-io -c 'help write' | grep -- '-[zun]'
-n, -- with -z, don't allow slow fallback
-u, -- with -z, allow unmapping
-z, -- write zeroes using blk_co_pwrite_zeroes
$ qemu-io -f rbd -c 'write -z -u -n 0 1M' rbd:foo/bar
write failed: Operation not supported
--->8---
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Message-Id: <20210702172356.11574-6-idryomov@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-07-02 19:23:55 +02:00
|
|
|
#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
|
|
|
|
static int
|
|
|
|
coroutine_fn qemu_rbd_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
|
block: use int64_t instead of int in driver write_zeroes handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
So, convert driver write_zeroes handlers bytes parameter to int64_t.
The only caller of all updated function is bdrv_co_do_pwrite_zeroes().
bdrv_co_do_pwrite_zeroes() itself is of course OK with widening of
callee parameter type. Also, bdrv_co_do_pwrite_zeroes()'s
max_write_zeroes is limited to INT_MAX. So, updated functions all are
safe, they will not get "bytes" larger than before.
Still, let's look through all updated functions, and add assertions to
the ones which are actually unprepared to values larger than INT_MAX.
For these drivers also set explicit max_pwrite_zeroes limit.
Let's go:
blkdebug: calculations can't overflow, thanks to
bdrv_check_qiov_request() in generic layer. rule_check() and
bdrv_co_pwrite_zeroes() both have 64bit argument.
blklogwrites: pass to blk_log_writes_co_log() with 64bit argument.
blkreplay, copy-on-read, filter-compress: pass to
bdrv_co_pwrite_zeroes() which is OK
copy-before-write: Calls cbw_do_copy_before_write() and
bdrv_co_pwrite_zeroes, both have 64bit argument.
file-posix: both handler calls raw_do_pwrite_zeroes, which is updated.
In raw_do_pwrite_zeroes() calculations are OK due to
bdrv_check_qiov_request(), bytes go to RawPosixAIOData::aio_nbytes
which is uint64_t.
Check also where that uint64_t gets handed:
handle_aiocb_write_zeroes_block() passes a uint64_t[2] to
ioctl(BLKZEROOUT), handle_aiocb_write_zeroes() calls do_fallocate()
which takes off_t (and we compile to always have 64-bit off_t), as
does handle_aiocb_write_zeroes_unmap. All look safe.
gluster: bytes go to GlusterAIOCB::size which is int64_t and to
glfs_zerofill_async works with off_t.
iscsi: Aha, here we deal with iscsi_writesame16_task() that has
uint32_t num_blocks argument and iscsi_writesame16_task() has
uint16_t argument. Make comments, add assertions and clarify
max_pwrite_zeroes calculation.
iscsi_allocmap_() functions already has int64_t argument
is_byte_request_lun_aligned is simple to update, do it.
mirror_top: pass to bdrv_mirror_top_do_write which has uint64_t
argument
nbd: Aha, here we have protocol limitation, and NBDRequest::len is
uint32_t. max_pwrite_zeroes is cleanly set to 32bit value, so we are
OK for now.
nvme: Again, protocol limitation. And no inherent limit for
write-zeroes at all. But from code that calculates cdw12 it's obvious
that we do have limit and alignment. Let's clarify it. Also,
obviously the code is not prepared to handle bytes=0. Let's handle
this case too.
trace events already 64bit
preallocate: pass to handle_write() and bdrv_co_pwrite_zeroes(), both
64bit.
rbd: pass to qemu_rbd_start_co() which is 64bit.
qcow2: offset + bytes and alignment still works good (thanks to
bdrv_check_qiov_request()), so tail calculation is OK
qcow2_subcluster_zeroize() has 64bit argument, should be OK
trace events updated
qed: qed_co_request wants int nb_sectors. Also in code we have size_t
used for request length which may be 32bit. So, let's just keep
INT_MAX as a limit (aligning it down to pwrite_zeroes_alignment) and
don't care.
raw-format: Is OK. raw_adjust_offset and bdrv_co_pwrite_zeroes are both
64bit.
throttle: Both throttle_group_co_io_limits_intercept() and
bdrv_co_pwrite_zeroes() are 64bit.
vmdk: pass to vmdk_pwritev which is 64bit
quorum: pass to quorum_co_pwritev() which is 64bit
Hooray!
At this point all block drivers are prepared to support 64bit
write-zero requests, or have explicitly set max_pwrite_zeroes.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-8-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: use <= rather than < in assertions relying on max_pwrite_zeroes]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-03 12:28:03 +02:00
|
|
|
int64_t bytes, BdrvRequestFlags flags)
|
block/rbd: add write zeroes support
This patch wittingly sets BDRV_REQ_NO_FALLBACK and silently ignores
BDRV_REQ_MAY_UNMAP for older librbd versions.
The rationale for this is as follows (citing Ilya Dryomov current RBD
maintainer):
---8<---
a) remove the BDRV_REQ_MAY_UNMAP check in qemu_rbd_co_pwrite_zeroes()
and as a consequence always unmap if librbd is too old
It's not clear what qemu's expectation is but in general Write
Zeroes is allowed to unmap. The only guarantee is that subsequent
reads return zeroes, everything else is a hint. This is how it is
specified in the kernel and in the NVMe spec.
In particular, block/nvme.c implements it as follows:
if (flags & BDRV_REQ_MAY_UNMAP) {
cdw12 |= (1 << 25);
}
This sets the Deallocate bit. But if it's not set, the device may
still deallocate:
"""
If the Deallocate bit (CDW12.DEAC) is set to '1' in a Write Zeroes
command, and the namespace supports clearing all bytes to 0h in the
values read (e.g., bits 2:0 in the DLFEAT field are set to 001b)
from a deallocated logical block and its metadata (excluding
protection information), then for each specified logical block, the
controller:
- should deallocate that logical block;
...
If the Deallocate bit is cleared to '0' in a Write Zeroes command,
and the namespace supports clearing all bytes to 0h in the values
read (e.g., bits 2:0 in the DLFEAT field are set to 001b) from
a deallocated logical block and its metadata (excluding protection
information), then, for each specified logical block, the
controller:
- may deallocate that logical block;
"""
https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-2021.06.02-Ratified-1.pdf
b) set BDRV_REQ_NO_FALLBACK in supported_zero_flags
Again, it's not clear what qemu expects here, but without it we end
up in a ridiculous situation where specifying the "don't allow slow
fallback" switch immediately fails all efficient zeroing requests on
a device where Write Zeroes is always efficient:
$ qemu-io -c 'help write' | grep -- '-[zun]'
-n, -- with -z, don't allow slow fallback
-u, -- with -z, allow unmapping
-z, -- write zeroes using blk_co_pwrite_zeroes
$ qemu-io -f rbd -c 'write -z -u -n 0 1M' rbd:foo/bar
write failed: Operation not supported
--->8---
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Message-Id: <20210702172356.11574-6-idryomov@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-07-02 19:23:55 +02:00
|
|
|
{
|
block: use int64_t instead of int in driver write_zeroes handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
So, convert driver write_zeroes handlers bytes parameter to int64_t.
The only caller of all updated function is bdrv_co_do_pwrite_zeroes().
bdrv_co_do_pwrite_zeroes() itself is of course OK with widening of
callee parameter type. Also, bdrv_co_do_pwrite_zeroes()'s
max_write_zeroes is limited to INT_MAX. So, updated functions all are
safe, they will not get "bytes" larger than before.
Still, let's look through all updated functions, and add assertions to
the ones which are actually unprepared to values larger than INT_MAX.
For these drivers also set explicit max_pwrite_zeroes limit.
Let's go:
blkdebug: calculations can't overflow, thanks to
bdrv_check_qiov_request() in generic layer. rule_check() and
bdrv_co_pwrite_zeroes() both have 64bit argument.
blklogwrites: pass to blk_log_writes_co_log() with 64bit argument.
blkreplay, copy-on-read, filter-compress: pass to
bdrv_co_pwrite_zeroes() which is OK
copy-before-write: Calls cbw_do_copy_before_write() and
bdrv_co_pwrite_zeroes, both have 64bit argument.
file-posix: both handler calls raw_do_pwrite_zeroes, which is updated.
In raw_do_pwrite_zeroes() calculations are OK due to
bdrv_check_qiov_request(), bytes go to RawPosixAIOData::aio_nbytes
which is uint64_t.
Check also where that uint64_t gets handed:
handle_aiocb_write_zeroes_block() passes a uint64_t[2] to
ioctl(BLKZEROOUT), handle_aiocb_write_zeroes() calls do_fallocate()
which takes off_t (and we compile to always have 64-bit off_t), as
does handle_aiocb_write_zeroes_unmap. All look safe.
gluster: bytes go to GlusterAIOCB::size which is int64_t and to
glfs_zerofill_async works with off_t.
iscsi: Aha, here we deal with iscsi_writesame16_task() that has
uint32_t num_blocks argument and iscsi_writesame16_task() has
uint16_t argument. Make comments, add assertions and clarify
max_pwrite_zeroes calculation.
iscsi_allocmap_() functions already has int64_t argument
is_byte_request_lun_aligned is simple to update, do it.
mirror_top: pass to bdrv_mirror_top_do_write which has uint64_t
argument
nbd: Aha, here we have protocol limitation, and NBDRequest::len is
uint32_t. max_pwrite_zeroes is cleanly set to 32bit value, so we are
OK for now.
nvme: Again, protocol limitation. And no inherent limit for
write-zeroes at all. But from code that calculates cdw12 it's obvious
that we do have limit and alignment. Let's clarify it. Also,
obviously the code is not prepared to handle bytes=0. Let's handle
this case too.
trace events already 64bit
preallocate: pass to handle_write() and bdrv_co_pwrite_zeroes(), both
64bit.
rbd: pass to qemu_rbd_start_co() which is 64bit.
qcow2: offset + bytes and alignment still works good (thanks to
bdrv_check_qiov_request()), so tail calculation is OK
qcow2_subcluster_zeroize() has 64bit argument, should be OK
trace events updated
qed: qed_co_request wants int nb_sectors. Also in code we have size_t
used for request length which may be 32bit. So, let's just keep
INT_MAX as a limit (aligning it down to pwrite_zeroes_alignment) and
don't care.
raw-format: Is OK. raw_adjust_offset and bdrv_co_pwrite_zeroes are both
64bit.
throttle: Both throttle_group_co_io_limits_intercept() and
bdrv_co_pwrite_zeroes() are 64bit.
vmdk: pass to vmdk_pwritev which is 64bit
quorum: pass to quorum_co_pwritev() which is 64bit
Hooray!
At this point all block drivers are prepared to support 64bit
write-zero requests, or have explicitly set max_pwrite_zeroes.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-8-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: use <= rather than < in assertions relying on max_pwrite_zeroes]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-03 12:28:03 +02:00
|
|
|
return qemu_rbd_start_co(bs, offset, bytes, NULL, flags,
|
block/rbd: add write zeroes support
This patch wittingly sets BDRV_REQ_NO_FALLBACK and silently ignores
BDRV_REQ_MAY_UNMAP for older librbd versions.
The rationale for this is as follows (citing Ilya Dryomov current RBD
maintainer):
---8<---
a) remove the BDRV_REQ_MAY_UNMAP check in qemu_rbd_co_pwrite_zeroes()
and as a consequence always unmap if librbd is too old
It's not clear what qemu's expectation is but in general Write
Zeroes is allowed to unmap. The only guarantee is that subsequent
reads return zeroes, everything else is a hint. This is how it is
specified in the kernel and in the NVMe spec.
In particular, block/nvme.c implements it as follows:
if (flags & BDRV_REQ_MAY_UNMAP) {
cdw12 |= (1 << 25);
}
This sets the Deallocate bit. But if it's not set, the device may
still deallocate:
"""
If the Deallocate bit (CDW12.DEAC) is set to '1' in a Write Zeroes
command, and the namespace supports clearing all bytes to 0h in the
values read (e.g., bits 2:0 in the DLFEAT field are set to 001b)
from a deallocated logical block and its metadata (excluding
protection information), then for each specified logical block, the
controller:
- should deallocate that logical block;
...
If the Deallocate bit is cleared to '0' in a Write Zeroes command,
and the namespace supports clearing all bytes to 0h in the values
read (e.g., bits 2:0 in the DLFEAT field are set to 001b) from
a deallocated logical block and its metadata (excluding protection
information), then, for each specified logical block, the
controller:
- may deallocate that logical block;
"""
https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-2021.06.02-Ratified-1.pdf
b) set BDRV_REQ_NO_FALLBACK in supported_zero_flags
Again, it's not clear what qemu expects here, but without it we end
up in a ridiculous situation where specifying the "don't allow slow
fallback" switch immediately fails all efficient zeroing requests on
a device where Write Zeroes is always efficient:
$ qemu-io -c 'help write' | grep -- '-[zun]'
-n, -- with -z, don't allow slow fallback
-u, -- with -z, allow unmapping
-z, -- write zeroes using blk_co_pwrite_zeroes
$ qemu-io -f rbd -c 'write -z -u -n 0 1M' rbd:foo/bar
write failed: Operation not supported
--->8---
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Message-Id: <20210702172356.11574-6-idryomov@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-07-02 19:23:55 +02:00
|
|
|
RBD_AIO_WRITE_ZEROES);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2011-05-27 01:07:31 +02:00
|
|
|
static int qemu_rbd_getinfo(BlockDriverState *bs, BlockDriverInfo *bdi)
|
2010-12-06 20:53:01 +01:00
|
|
|
{
|
|
|
|
BDRVRBDState *s = bs->opaque;
|
2021-07-02 19:23:52 +02:00
|
|
|
bdi->cluster_size = s->object_size;
|
2010-12-06 20:53:01 +01:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2021-06-27 13:46:35 +02:00
|
|
|
static ImageInfoSpecific *qemu_rbd_get_specific_info(BlockDriverState *bs,
|
|
|
|
Error **errp)
|
|
|
|
{
|
|
|
|
BDRVRBDState *s = bs->opaque;
|
|
|
|
ImageInfoSpecific *spec_info;
|
|
|
|
char buf[RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN] = {0};
|
|
|
|
int r;
|
|
|
|
|
|
|
|
if (s->image_size >= RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN) {
|
|
|
|
r = rbd_read(s->image, 0,
|
|
|
|
RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN, buf);
|
|
|
|
if (r < 0) {
|
|
|
|
error_setg_errno(errp, -r, "cannot read image start for probe");
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
spec_info = g_new(ImageInfoSpecific, 1);
|
|
|
|
*spec_info = (ImageInfoSpecific){
|
|
|
|
.type = IMAGE_INFO_SPECIFIC_KIND_RBD,
|
|
|
|
.u.rbd.data = g_new0(ImageInfoSpecificRbd, 1),
|
|
|
|
};
|
|
|
|
|
|
|
|
if (memcmp(buf, rbd_luks_header_verification,
|
|
|
|
RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN) == 0) {
|
|
|
|
spec_info->u.rbd.data->encryption_format =
|
|
|
|
RBD_IMAGE_ENCRYPTION_FORMAT_LUKS;
|
|
|
|
spec_info->u.rbd.data->has_encryption_format = true;
|
|
|
|
} else if (memcmp(buf, rbd_luks2_header_verification,
|
|
|
|
RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN) == 0) {
|
|
|
|
spec_info->u.rbd.data->encryption_format =
|
|
|
|
RBD_IMAGE_ENCRYPTION_FORMAT_LUKS2;
|
|
|
|
spec_info->u.rbd.data->has_encryption_format = true;
|
|
|
|
} else {
|
|
|
|
spec_info->u.rbd.data->has_encryption_format = false;
|
|
|
|
}
|
|
|
|
|
|
|
|
return spec_info;
|
|
|
|
}
|
|
|
|
|
2021-10-12 17:22:31 +02:00
|
|
|
/*
|
|
|
|
* rbd_diff_iterate2 allows to interrupt the exection by returning a negative
|
|
|
|
* value in the callback routine. Choose a value that does not conflict with
|
|
|
|
* an existing exitcode and return it if we want to prematurely stop the
|
|
|
|
* execution because we detected a change in the allocation status.
|
|
|
|
*/
|
|
|
|
#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
|
|
|
|
|
|
|
|
static int qemu_rbd_diff_iterate_cb(uint64_t offs, size_t len,
|
|
|
|
int exists, void *opaque)
|
|
|
|
{
|
|
|
|
RBDDiffIterateReq *req = opaque;
|
|
|
|
|
|
|
|
assert(req->offs + req->bytes <= offs);
|
2022-01-13 15:44:25 +01:00
|
|
|
|
|
|
|
/* treat a hole like an unallocated area and bail out */
|
|
|
|
if (!exists) {
|
|
|
|
return 0;
|
|
|
|
}
|
2021-10-12 17:22:31 +02:00
|
|
|
|
|
|
|
if (!req->exists && offs > req->offs) {
|
|
|
|
/*
|
|
|
|
* we started in an unallocated area and hit the first allocated
|
|
|
|
* block. req->bytes must be set to the length of the unallocated area
|
|
|
|
* before the allocated area. stop further processing.
|
|
|
|
*/
|
|
|
|
req->bytes = offs - req->offs;
|
|
|
|
return QEMU_RBD_EXIT_DIFF_ITERATE2;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (req->exists && offs > req->offs + req->bytes) {
|
|
|
|
/*
|
|
|
|
* we started in an allocated area and jumped over an unallocated area,
|
|
|
|
* req->bytes contains the length of the allocated area before the
|
|
|
|
* unallocated area. stop further processing.
|
|
|
|
*/
|
|
|
|
return QEMU_RBD_EXIT_DIFF_ITERATE2;
|
|
|
|
}
|
|
|
|
|
|
|
|
req->bytes += len;
|
|
|
|
req->exists = true;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
|
|
|
|
bool want_zero, int64_t offset,
|
|
|
|
int64_t bytes, int64_t *pnum,
|
|
|
|
int64_t *map,
|
|
|
|
BlockDriverState **file)
|
|
|
|
{
|
|
|
|
BDRVRBDState *s = bs->opaque;
|
|
|
|
int status, r;
|
|
|
|
RBDDiffIterateReq req = { .offs = offset };
|
|
|
|
uint64_t features, flags;
|
2022-01-13 15:44:26 +01:00
|
|
|
uint64_t head = 0;
|
2021-10-12 17:22:31 +02:00
|
|
|
|
|
|
|
assert(offset + bytes <= s->image_size);
|
|
|
|
|
|
|
|
/* default to all sectors allocated */
|
|
|
|
status = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
|
|
|
|
*map = offset;
|
|
|
|
*file = bs;
|
|
|
|
*pnum = bytes;
|
|
|
|
|
|
|
|
/* check if RBD image supports fast-diff */
|
|
|
|
r = rbd_get_features(s->image, &features);
|
|
|
|
if (r < 0) {
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
if (!(features & RBD_FEATURE_FAST_DIFF)) {
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* check if RBD fast-diff result is valid */
|
|
|
|
r = rbd_get_flags(s->image, &flags);
|
|
|
|
if (r < 0) {
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
if (flags & RBD_FLAG_FAST_DIFF_INVALID) {
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
|
2022-01-13 15:44:26 +01:00
|
|
|
#if LIBRBD_VERSION_CODE < LIBRBD_VERSION(1, 17, 0)
|
|
|
|
/*
|
|
|
|
* librbd had a bug until early 2022 that affected all versions of ceph that
|
|
|
|
* supported fast-diff. This bug results in reporting of incorrect offsets
|
|
|
|
* if the offset parameter to rbd_diff_iterate2 is not object aligned.
|
|
|
|
* Work around this bug by rounding down the offset to object boundaries.
|
|
|
|
* This is OK because we call rbd_diff_iterate2 with whole_object = true.
|
|
|
|
* However, this workaround only works for non cloned images with default
|
|
|
|
* striping.
|
|
|
|
*
|
|
|
|
* See: https://tracker.ceph.com/issues/53784
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* check if RBD image has non-default striping enabled */
|
|
|
|
if (features & RBD_FEATURE_STRIPINGV2) {
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
|
|
|
|
#pragma GCC diagnostic push
|
|
|
|
#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
|
|
|
|
/*
|
|
|
|
* check if RBD image is a clone (= has a parent).
|
|
|
|
*
|
|
|
|
* rbd_get_parent_info is deprecated from Nautilus onwards, but the
|
|
|
|
* replacement rbd_get_parent is not present in Luminous and Mimic.
|
|
|
|
*/
|
|
|
|
if (rbd_get_parent_info(s->image, NULL, 0, NULL, 0, NULL, 0) != -ENOENT) {
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
#pragma GCC diagnostic pop
|
|
|
|
|
|
|
|
head = req.offs & (s->object_size - 1);
|
|
|
|
req.offs -= head;
|
|
|
|
bytes += head;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
r = rbd_diff_iterate2(s->image, NULL, req.offs, bytes, true, true,
|
2021-10-12 17:22:31 +02:00
|
|
|
qemu_rbd_diff_iterate_cb, &req);
|
|
|
|
if (r < 0 && r != QEMU_RBD_EXIT_DIFF_ITERATE2) {
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
assert(req.bytes <= bytes);
|
|
|
|
if (!req.exists) {
|
|
|
|
if (r == 0) {
|
|
|
|
/*
|
|
|
|
* rbd_diff_iterate2 does not invoke callbacks for unallocated
|
|
|
|
* areas. This here catches the case where no callback was
|
|
|
|
* invoked at all (req.bytes == 0).
|
|
|
|
*/
|
|
|
|
assert(req.bytes == 0);
|
|
|
|
req.bytes = bytes;
|
|
|
|
}
|
|
|
|
status = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID;
|
|
|
|
}
|
|
|
|
|
2022-01-13 15:44:26 +01:00
|
|
|
assert(req.bytes > head);
|
|
|
|
*pnum = req.bytes - head;
|
2021-10-12 17:22:31 +02:00
|
|
|
return status;
|
|
|
|
}
|
|
|
|
|
2011-05-27 01:07:31 +02:00
|
|
|
static int64_t qemu_rbd_getlength(BlockDriverState *bs)
|
2010-12-06 20:53:01 +01:00
|
|
|
{
|
|
|
|
BDRVRBDState *s = bs->opaque;
|
2011-05-27 01:07:31 +02:00
|
|
|
int r;
|
2010-12-06 20:53:01 +01:00
|
|
|
|
2021-07-02 19:23:53 +02:00
|
|
|
r = rbd_get_size(s->image, &s->image_size);
|
2011-05-27 01:07:31 +02:00
|
|
|
if (r < 0) {
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2021-07-02 19:23:53 +02:00
|
|
|
return s->image_size;
|
2010-12-06 20:53:01 +01:00
|
|
|
}
|
|
|
|
|
block: Convert .bdrv_truncate callback to coroutine_fn
bdrv_truncate() is an operation that can block (even for a quite long
time, depending on the PreallocMode) in I/O paths that shouldn't block.
Convert it to a coroutine_fn so that we have the infrastructure for
drivers to make their .bdrv_co_truncate implementation asynchronous.
This change could potentially introduce new race conditions because
bdrv_truncate() isn't necessarily executed atomically any more. Whether
this is a problem needs to be evaluated for each block driver that
supports truncate:
* file-posix/win32, gluster, iscsi, nfs, rbd, ssh, sheepdog: The
protocol drivers are trivially safe because they don't actually yield
yet, so there is no change in behaviour.
* copy-on-read, crypto, raw-format: Essentially just filter drivers that
pass the request to a child node, no problem.
* qcow2: The implementation modifies metadata, so it needs to hold
s->lock to be safe with concurrent I/O requests. In order to avoid
double locking, this requires pulling the locking out into
preallocate_co() and using qcow2_write_caches() instead of
bdrv_flush().
* qed: Does a single header update, this is fine without locking.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-21 17:54:35 +02:00
|
|
|
static int coroutine_fn qemu_rbd_co_truncate(BlockDriverState *bs,
|
|
|
|
int64_t offset,
|
2019-09-18 11:51:40 +02:00
|
|
|
bool exact,
|
block: Convert .bdrv_truncate callback to coroutine_fn
bdrv_truncate() is an operation that can block (even for a quite long
time, depending on the PreallocMode) in I/O paths that shouldn't block.
Convert it to a coroutine_fn so that we have the infrastructure for
drivers to make their .bdrv_co_truncate implementation asynchronous.
This change could potentially introduce new race conditions because
bdrv_truncate() isn't necessarily executed atomically any more. Whether
this is a problem needs to be evaluated for each block driver that
supports truncate:
* file-posix/win32, gluster, iscsi, nfs, rbd, ssh, sheepdog: The
protocol drivers are trivially safe because they don't actually yield
yet, so there is no change in behaviour.
* copy-on-read, crypto, raw-format: Essentially just filter drivers that
pass the request to a child node, no problem.
* qcow2: The implementation modifies metadata, so it needs to hold
s->lock to be safe with concurrent I/O requests. In order to avoid
double locking, this requires pulling the locking out into
preallocate_co() and using qcow2_write_caches() instead of
bdrv_flush().
* qed: Does a single header update, this is fine without locking.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-21 17:54:35 +02:00
|
|
|
PreallocMode prealloc,
|
2020-04-24 14:54:39 +02:00
|
|
|
BdrvRequestFlags flags,
|
block: Convert .bdrv_truncate callback to coroutine_fn
bdrv_truncate() is an operation that can block (even for a quite long
time, depending on the PreallocMode) in I/O paths that shouldn't block.
Convert it to a coroutine_fn so that we have the infrastructure for
drivers to make their .bdrv_co_truncate implementation asynchronous.
This change could potentially introduce new race conditions because
bdrv_truncate() isn't necessarily executed atomically any more. Whether
this is a problem needs to be evaluated for each block driver that
supports truncate:
* file-posix/win32, gluster, iscsi, nfs, rbd, ssh, sheepdog: The
protocol drivers are trivially safe because they don't actually yield
yet, so there is no change in behaviour.
* copy-on-read, crypto, raw-format: Essentially just filter drivers that
pass the request to a child node, no problem.
* qcow2: The implementation modifies metadata, so it needs to hold
s->lock to be safe with concurrent I/O requests. In order to avoid
double locking, this requires pulling the locking out into
preallocate_co() and using qcow2_write_caches() instead of
bdrv_flush().
* qed: Does a single header update, this is fine without locking.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-21 17:54:35 +02:00
|
|
|
Error **errp)
|
2011-05-27 01:07:34 +02:00
|
|
|
{
|
|
|
|
int r;
|
|
|
|
|
2017-06-13 22:20:52 +02:00
|
|
|
if (prealloc != PREALLOC_MODE_OFF) {
|
|
|
|
error_setg(errp, "Unsupported preallocation mode '%s'",
|
2017-08-24 10:46:08 +02:00
|
|
|
PreallocMode_str(prealloc));
|
2017-06-13 22:20:52 +02:00
|
|
|
return -ENOTSUP;
|
|
|
|
}
|
|
|
|
|
2019-05-09 16:59:27 +02:00
|
|
|
r = qemu_rbd_resize(bs, offset);
|
2011-05-27 01:07:34 +02:00
|
|
|
if (r < 0) {
|
2017-03-28 22:51:29 +02:00
|
|
|
error_setg_errno(errp, -r, "Failed to resize file");
|
2011-05-27 01:07:34 +02:00
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-05-27 01:07:31 +02:00
|
|
|
static int qemu_rbd_snap_create(BlockDriverState *bs,
|
|
|
|
QEMUSnapshotInfo *sn_info)
|
2010-12-06 20:53:01 +01:00
|
|
|
{
|
|
|
|
BDRVRBDState *s = bs->opaque;
|
|
|
|
int r;
|
|
|
|
|
|
|
|
if (sn_info->name[0] == '\0') {
|
|
|
|
return -EINVAL; /* we need a name for rbd snapshots */
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* rbd snapshots are using the name as the user controlled unique identifier
|
|
|
|
* we can't use the rbd snapid for that purpose, as it can't be set
|
|
|
|
*/
|
|
|
|
if (sn_info->id_str[0] != '\0' &&
|
|
|
|
strcmp(sn_info->id_str, sn_info->name) != 0) {
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (strlen(sn_info->name) >= sizeof(sn_info->id_str)) {
|
|
|
|
return -ERANGE;
|
|
|
|
}
|
|
|
|
|
2011-05-27 01:07:31 +02:00
|
|
|
r = rbd_snap_create(s->image, sn_info->name);
|
2010-12-06 20:53:01 +01:00
|
|
|
if (r < 0) {
|
2011-05-27 01:07:31 +02:00
|
|
|
error_report("failed to create snap: %s", strerror(-r));
|
2010-12-06 20:53:01 +01:00
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-01-11 20:53:52 +01:00
|
|
|
static int qemu_rbd_snap_remove(BlockDriverState *bs,
|
snapshot: distinguish id and name in snapshot delete
Snapshot creation actually already distinguish id and name since it take
a structured parameter *sn, but delete can't. Later an accurate delete
is needed in qmp_transaction abort and blockdev-snapshot-delete-sync,
so change its prototype. Also *errp is added to tip error, but return
value is kepted to let caller check what kind of error happens. Existing
caller for it are savevm, delvm and qemu-img, they are not impacted by
introducing a new function bdrv_snapshot_delete_by_id_or_name(), which
check the return value and do the operation again.
Before this patch:
For qcow2, it search id first then name to find the one to delete.
For rbd, it search name.
For sheepdog, it does nothing.
After this patch:
For qcow2, logic is the same by call it twice in caller.
For rbd, it always fails in delete with id, but still search for name
in second try, no change to user.
Some code for *errp is based on Pavel's patch.
Signed-off-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2013-09-11 08:04:33 +02:00
|
|
|
const char *snapshot_id,
|
|
|
|
const char *snapshot_name,
|
|
|
|
Error **errp)
|
2012-01-11 20:53:52 +01:00
|
|
|
{
|
|
|
|
BDRVRBDState *s = bs->opaque;
|
|
|
|
int r;
|
|
|
|
|
snapshot: distinguish id and name in snapshot delete
Snapshot creation actually already distinguish id and name since it take
a structured parameter *sn, but delete can't. Later an accurate delete
is needed in qmp_transaction abort and blockdev-snapshot-delete-sync,
so change its prototype. Also *errp is added to tip error, but return
value is kepted to let caller check what kind of error happens. Existing
caller for it are savevm, delvm and qemu-img, they are not impacted by
introducing a new function bdrv_snapshot_delete_by_id_or_name(), which
check the return value and do the operation again.
Before this patch:
For qcow2, it search id first then name to find the one to delete.
For rbd, it search name.
For sheepdog, it does nothing.
After this patch:
For qcow2, logic is the same by call it twice in caller.
For rbd, it always fails in delete with id, but still search for name
in second try, no change to user.
Some code for *errp is based on Pavel's patch.
Signed-off-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2013-09-11 08:04:33 +02:00
|
|
|
if (!snapshot_name) {
|
|
|
|
error_setg(errp, "rbd need a valid snapshot name");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* If snapshot_id is specified, it must be equal to name, see
|
|
|
|
qemu_rbd_snap_list() */
|
|
|
|
if (snapshot_id && strcmp(snapshot_id, snapshot_name)) {
|
|
|
|
error_setg(errp,
|
|
|
|
"rbd do not support snapshot id, it should be NULL or "
|
|
|
|
"equal to snapshot name");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2012-01-11 20:53:52 +01:00
|
|
|
r = rbd_snap_remove(s->image, snapshot_name);
|
snapshot: distinguish id and name in snapshot delete
Snapshot creation actually already distinguish id and name since it take
a structured parameter *sn, but delete can't. Later an accurate delete
is needed in qmp_transaction abort and blockdev-snapshot-delete-sync,
so change its prototype. Also *errp is added to tip error, but return
value is kepted to let caller check what kind of error happens. Existing
caller for it are savevm, delvm and qemu-img, they are not impacted by
introducing a new function bdrv_snapshot_delete_by_id_or_name(), which
check the return value and do the operation again.
Before this patch:
For qcow2, it search id first then name to find the one to delete.
For rbd, it search name.
For sheepdog, it does nothing.
After this patch:
For qcow2, logic is the same by call it twice in caller.
For rbd, it always fails in delete with id, but still search for name
in second try, no change to user.
Some code for *errp is based on Pavel's patch.
Signed-off-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2013-09-11 08:04:33 +02:00
|
|
|
if (r < 0) {
|
|
|
|
error_setg_errno(errp, -r, "Failed to remove the snapshot");
|
|
|
|
}
|
2012-01-11 20:53:52 +01:00
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int qemu_rbd_snap_rollback(BlockDriverState *bs,
|
|
|
|
const char *snapshot_name)
|
|
|
|
{
|
|
|
|
BDRVRBDState *s = bs->opaque;
|
|
|
|
|
2016-06-13 23:57:58 +02:00
|
|
|
return rbd_snap_rollback(s->image, snapshot_name);
|
2012-01-11 20:53:52 +01:00
|
|
|
}
|
|
|
|
|
2011-05-27 01:07:31 +02:00
|
|
|
static int qemu_rbd_snap_list(BlockDriverState *bs,
|
|
|
|
QEMUSnapshotInfo **psn_tab)
|
2010-12-06 20:53:01 +01:00
|
|
|
{
|
|
|
|
BDRVRBDState *s = bs->opaque;
|
|
|
|
QEMUSnapshotInfo *sn_info, *sn_tab = NULL;
|
2011-05-27 01:07:31 +02:00
|
|
|
int i, snap_count;
|
|
|
|
rbd_snap_info_t *snaps;
|
|
|
|
int max_snaps = RBD_MAX_SNAPS;
|
2010-12-06 20:53:01 +01:00
|
|
|
|
2011-05-27 01:07:31 +02:00
|
|
|
do {
|
2014-08-19 10:31:09 +02:00
|
|
|
snaps = g_new(rbd_snap_info_t, max_snaps);
|
2011-05-27 01:07:31 +02:00
|
|
|
snap_count = rbd_snap_list(s->image, snaps, &max_snaps);
|
2013-09-25 16:00:48 +02:00
|
|
|
if (snap_count <= 0) {
|
2011-08-21 05:09:37 +02:00
|
|
|
g_free(snaps);
|
2010-12-06 20:53:01 +01:00
|
|
|
}
|
2011-05-27 01:07:31 +02:00
|
|
|
} while (snap_count == -ERANGE);
|
2010-12-06 20:53:01 +01:00
|
|
|
|
2011-05-27 01:07:31 +02:00
|
|
|
if (snap_count <= 0) {
|
2011-12-07 02:05:10 +01:00
|
|
|
goto done;
|
2010-12-06 20:53:01 +01:00
|
|
|
}
|
|
|
|
|
block: Use g_new() & friends where that makes obvious sense
g_new(T, n) is neater than g_malloc(sizeof(T) * n). It's also safer,
for two reasons. One, it catches multiplication overflowing size_t.
Two, it returns T * rather than void *, which lets the compiler catch
more type errors.
Patch created with Coccinelle, with two manual changes on top:
* Add const to bdrv_iterate_format() to keep the types straight
* Convert the allocation in bdrv_drop_intermediate(), which Coccinelle
inexplicably misses
Coccinelle semantic patch:
@@
type T;
@@
-g_malloc(sizeof(T))
+g_new(T, 1)
@@
type T;
@@
-g_try_malloc(sizeof(T))
+g_try_new(T, 1)
@@
type T;
@@
-g_malloc0(sizeof(T))
+g_new0(T, 1)
@@
type T;
@@
-g_try_malloc0(sizeof(T))
+g_try_new0(T, 1)
@@
type T;
expression n;
@@
-g_malloc(sizeof(T) * (n))
+g_new(T, n)
@@
type T;
expression n;
@@
-g_try_malloc(sizeof(T) * (n))
+g_try_new(T, n)
@@
type T;
expression n;
@@
-g_malloc0(sizeof(T) * (n))
+g_new0(T, n)
@@
type T;
expression n;
@@
-g_try_malloc0(sizeof(T) * (n))
+g_try_new0(T, n)
@@
type T;
expression p, n;
@@
-g_realloc(p, sizeof(T) * (n))
+g_renew(T, p, n)
@@
type T;
expression p, n;
@@
-g_try_realloc(p, sizeof(T) * (n))
+g_try_renew(T, p, n)
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2014-08-19 10:31:08 +02:00
|
|
|
sn_tab = g_new0(QEMUSnapshotInfo, snap_count);
|
2010-12-06 20:53:01 +01:00
|
|
|
|
2011-05-27 01:07:31 +02:00
|
|
|
for (i = 0; i < snap_count; i++) {
|
|
|
|
const char *snap_name = snaps[i].name;
|
2010-12-06 20:53:01 +01:00
|
|
|
|
|
|
|
sn_info = sn_tab + i;
|
|
|
|
pstrcpy(sn_info->id_str, sizeof(sn_info->id_str), snap_name);
|
|
|
|
pstrcpy(sn_info->name, sizeof(sn_info->name), snap_name);
|
|
|
|
|
2011-05-27 01:07:31 +02:00
|
|
|
sn_info->vm_state_size = snaps[i].size;
|
2010-12-06 20:53:01 +01:00
|
|
|
sn_info->date_sec = 0;
|
|
|
|
sn_info->date_nsec = 0;
|
|
|
|
sn_info->vm_clock_nsec = 0;
|
|
|
|
}
|
2011-05-27 01:07:31 +02:00
|
|
|
rbd_snap_list_end(snaps);
|
2013-09-25 16:00:48 +02:00
|
|
|
g_free(snaps);
|
2011-05-27 01:07:31 +02:00
|
|
|
|
2011-12-07 02:05:10 +01:00
|
|
|
done:
|
2010-12-06 20:53:01 +01:00
|
|
|
*psn_tab = sn_tab;
|
|
|
|
return snap_count;
|
|
|
|
}
|
|
|
|
|
2018-03-01 17:36:18 +01:00
|
|
|
static void coroutine_fn qemu_rbd_co_invalidate_cache(BlockDriverState *bs,
|
|
|
|
Error **errp)
|
2014-10-09 20:44:32 +02:00
|
|
|
{
|
|
|
|
BDRVRBDState *s = bs->opaque;
|
|
|
|
int r = rbd_invalidate_cache(s->image);
|
|
|
|
if (r < 0) {
|
|
|
|
error_setg_errno(errp, -r, "Failed to invalidate the cache");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-06-05 11:21:04 +02:00
|
|
|
static QemuOptsList qemu_rbd_create_opts = {
|
|
|
|
.name = "rbd-create-opts",
|
|
|
|
.head = QTAILQ_HEAD_INITIALIZER(qemu_rbd_create_opts.head),
|
|
|
|
.desc = {
|
|
|
|
{
|
|
|
|
.name = BLOCK_OPT_SIZE,
|
|
|
|
.type = QEMU_OPT_SIZE,
|
|
|
|
.help = "Virtual disk size"
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.name = BLOCK_OPT_CLUSTER_SIZE,
|
|
|
|
.type = QEMU_OPT_SIZE,
|
|
|
|
.help = "RBD object size"
|
|
|
|
},
|
2016-01-21 15:19:19 +01:00
|
|
|
{
|
|
|
|
.name = "password-secret",
|
|
|
|
.type = QEMU_OPT_STRING,
|
|
|
|
.help = "ID of secret providing the password",
|
|
|
|
},
|
2021-06-27 13:46:35 +02:00
|
|
|
{
|
|
|
|
.name = "encrypt.format",
|
|
|
|
.type = QEMU_OPT_STRING,
|
|
|
|
.help = "Encrypt the image, format choices: 'luks', 'luks2'",
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.name = "encrypt.cipher-alg",
|
|
|
|
.type = QEMU_OPT_STRING,
|
|
|
|
.help = "Name of encryption cipher algorithm"
|
|
|
|
" (allowed values: aes-128, aes-256)",
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.name = "encrypt.key-secret",
|
|
|
|
.type = QEMU_OPT_STRING,
|
|
|
|
.help = "ID of secret providing LUKS passphrase",
|
|
|
|
},
|
2014-06-05 11:21:04 +02:00
|
|
|
{ /* end of list */ }
|
|
|
|
}
|
2010-12-06 20:53:01 +01:00
|
|
|
};
|
|
|
|
|
2019-02-01 20:29:25 +01:00
|
|
|
static const char *const qemu_rbd_strong_runtime_opts[] = {
|
|
|
|
"pool",
|
2020-09-14 21:05:53 +02:00
|
|
|
"namespace",
|
2019-02-01 20:29:25 +01:00
|
|
|
"image",
|
|
|
|
"conf",
|
|
|
|
"snapshot",
|
|
|
|
"user",
|
|
|
|
"server.",
|
|
|
|
"password-secret",
|
|
|
|
|
|
|
|
NULL
|
|
|
|
};
|
|
|
|
|
2010-12-06 20:53:01 +01:00
|
|
|
static BlockDriver bdrv_rbd = {
|
2017-02-26 23:50:42 +01:00
|
|
|
.format_name = "rbd",
|
|
|
|
.instance_size = sizeof(BDRVRBDState),
|
|
|
|
.bdrv_parse_filename = qemu_rbd_parse_filename,
|
|
|
|
.bdrv_file_open = qemu_rbd_open,
|
|
|
|
.bdrv_close = qemu_rbd_close,
|
2017-04-07 22:55:32 +02:00
|
|
|
.bdrv_reopen_prepare = qemu_rbd_reopen_prepare,
|
2018-01-31 16:27:38 +01:00
|
|
|
.bdrv_co_create = qemu_rbd_co_create,
|
2018-01-18 13:43:45 +01:00
|
|
|
.bdrv_co_create_opts = qemu_rbd_co_create_opts,
|
2017-02-26 23:50:42 +01:00
|
|
|
.bdrv_has_zero_init = bdrv_has_zero_init_1,
|
|
|
|
.bdrv_get_info = qemu_rbd_getinfo,
|
2021-06-27 13:46:35 +02:00
|
|
|
.bdrv_get_specific_info = qemu_rbd_get_specific_info,
|
2017-02-26 23:50:42 +01:00
|
|
|
.create_opts = &qemu_rbd_create_opts,
|
|
|
|
.bdrv_getlength = qemu_rbd_getlength,
|
block: Convert .bdrv_truncate callback to coroutine_fn
bdrv_truncate() is an operation that can block (even for a quite long
time, depending on the PreallocMode) in I/O paths that shouldn't block.
Convert it to a coroutine_fn so that we have the infrastructure for
drivers to make their .bdrv_co_truncate implementation asynchronous.
This change could potentially introduce new race conditions because
bdrv_truncate() isn't necessarily executed atomically any more. Whether
this is a problem needs to be evaluated for each block driver that
supports truncate:
* file-posix/win32, gluster, iscsi, nfs, rbd, ssh, sheepdog: The
protocol drivers are trivially safe because they don't actually yield
yet, so there is no change in behaviour.
* copy-on-read, crypto, raw-format: Essentially just filter drivers that
pass the request to a child node, no problem.
* qcow2: The implementation modifies metadata, so it needs to hold
s->lock to be safe with concurrent I/O requests. In order to avoid
double locking, this requires pulling the locking out into
preallocate_co() and using qcow2_write_caches() instead of
bdrv_flush().
* qed: Does a single header update, this is fine without locking.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-21 17:54:35 +02:00
|
|
|
.bdrv_co_truncate = qemu_rbd_co_truncate,
|
2017-02-26 23:50:42 +01:00
|
|
|
.protocol_name = "rbd",
|
2010-12-06 20:53:01 +01:00
|
|
|
|
2021-07-02 19:23:54 +02:00
|
|
|
.bdrv_co_preadv = qemu_rbd_co_preadv,
|
|
|
|
.bdrv_co_pwritev = qemu_rbd_co_pwritev,
|
|
|
|
.bdrv_co_flush_to_disk = qemu_rbd_co_flush,
|
|
|
|
.bdrv_co_pdiscard = qemu_rbd_co_pdiscard,
|
block/rbd: add write zeroes support
This patch wittingly sets BDRV_REQ_NO_FALLBACK and silently ignores
BDRV_REQ_MAY_UNMAP for older librbd versions.
The rationale for this is as follows (citing Ilya Dryomov current RBD
maintainer):
---8<---
a) remove the BDRV_REQ_MAY_UNMAP check in qemu_rbd_co_pwrite_zeroes()
and as a consequence always unmap if librbd is too old
It's not clear what qemu's expectation is but in general Write
Zeroes is allowed to unmap. The only guarantee is that subsequent
reads return zeroes, everything else is a hint. This is how it is
specified in the kernel and in the NVMe spec.
In particular, block/nvme.c implements it as follows:
if (flags & BDRV_REQ_MAY_UNMAP) {
cdw12 |= (1 << 25);
}
This sets the Deallocate bit. But if it's not set, the device may
still deallocate:
"""
If the Deallocate bit (CDW12.DEAC) is set to '1' in a Write Zeroes
command, and the namespace supports clearing all bytes to 0h in the
values read (e.g., bits 2:0 in the DLFEAT field are set to 001b)
from a deallocated logical block and its metadata (excluding
protection information), then for each specified logical block, the
controller:
- should deallocate that logical block;
...
If the Deallocate bit is cleared to '0' in a Write Zeroes command,
and the namespace supports clearing all bytes to 0h in the values
read (e.g., bits 2:0 in the DLFEAT field are set to 001b) from
a deallocated logical block and its metadata (excluding protection
information), then, for each specified logical block, the
controller:
- may deallocate that logical block;
"""
https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-2021.06.02-Ratified-1.pdf
b) set BDRV_REQ_NO_FALLBACK in supported_zero_flags
Again, it's not clear what qemu expects here, but without it we end
up in a ridiculous situation where specifying the "don't allow slow
fallback" switch immediately fails all efficient zeroing requests on
a device where Write Zeroes is always efficient:
$ qemu-io -c 'help write' | grep -- '-[zun]'
-n, -- with -z, don't allow slow fallback
-u, -- with -z, allow unmapping
-z, -- write zeroes using blk_co_pwrite_zeroes
$ qemu-io -f rbd -c 'write -z -u -n 0 1M' rbd:foo/bar
write failed: Operation not supported
--->8---
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Message-Id: <20210702172356.11574-6-idryomov@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-07-02 19:23:55 +02:00
|
|
|
#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
|
|
|
|
.bdrv_co_pwrite_zeroes = qemu_rbd_co_pwrite_zeroes,
|
|
|
|
#endif
|
2021-10-12 17:22:31 +02:00
|
|
|
.bdrv_co_block_status = qemu_rbd_co_block_status,
|
2012-05-01 08:16:45 +02:00
|
|
|
|
2011-11-10 17:25:44 +01:00
|
|
|
.bdrv_snapshot_create = qemu_rbd_snap_create,
|
2012-01-11 20:53:52 +01:00
|
|
|
.bdrv_snapshot_delete = qemu_rbd_snap_remove,
|
2011-11-10 17:25:44 +01:00
|
|
|
.bdrv_snapshot_list = qemu_rbd_snap_list,
|
2012-01-11 20:53:52 +01:00
|
|
|
.bdrv_snapshot_goto = qemu_rbd_snap_rollback,
|
2018-03-01 17:36:18 +01:00
|
|
|
.bdrv_co_invalidate_cache = qemu_rbd_co_invalidate_cache,
|
2019-02-01 20:29:25 +01:00
|
|
|
|
|
|
|
.strong_runtime_opts = qemu_rbd_strong_runtime_opts,
|
2010-12-06 20:53:01 +01:00
|
|
|
};
|
|
|
|
|
|
|
|
static void bdrv_rbd_init(void)
|
|
|
|
{
|
|
|
|
bdrv_register(&bdrv_rbd);
|
|
|
|
}
|
|
|
|
|
|
|
|
block_init(bdrv_rbd_init);
|