Use special offset to write zeroes efficiently, when zeroed-grain GTE is
available. If zero-write an allocated cluster, cluster is leaked because
its offset pointer is overwritten by "0x1".
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Previously VmdkMetaData.offset is stored little endian while other
fields are cpu endian. This changes offset to cpu endian and convert
before writing to image.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Add image create option "zeroed-grain" to enable zeroed-grain GTE
feature of vmdk sparse extents. When this option is on, header version
of newly created extent will be 2 and VMDK4_FLAG_ZERO_GRAIN flag bit
will be set.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Introduced support for zeroed-grain GTE, as specified in Virtual Disk
Format 5.0[1].
Recent VMware hosted platform products support a new “zeroed‐grain”
grain table entry (GTE). The zeroed‐grain GTE returns all zeros on
read. In other words, the zeroed‐grain GTE indicates that a grain
in the child disk is zero‐filled but does not actually occupy space
in storage. A sparse extent with zeroed‐grain GTE has the following
in its header:
* SparseExtentHeader.version = 2
* SparseExtentHeader.flags has bit 2 set
Other than the new flag and the possibly zeroed‐grain GTE, version 2
sparse extents are identical to version 1. Also, a zeroed‐grain GTE
has value 0x1 in the GT table.
[1] Virtual Disk Format 5.0, http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?src=vmdk
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Internal routines in vmdk.c previously return -1 on error and 0 on
success. More return values are useful for future changes such as
zeroed-grain GTE. Change all the magic `return 0` and `return -1` to
macro names:
* VMDK_OK 0
* VMDK_ERROR (-1)
* VMDK_UNALLOC (-2)
* VMDK_ZEROED (-3)
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This adds in read-only support to the VHDX image format. This supports
reads for fixed-size, and dynamic sized VHDX images.
Differencing files are still unsupported.
The image must be opened without BDRV_O_RDWR set, because we do not
yet update the headers. I.e., pass 'readonly=on' in the drive image
options from the QEMU commandline.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is the initial block driver framework for VHDX image support
(i.e. Hyper-V image file formats), that supports opening VHDX files, and
parsing the headers.
This commit does not yet enable:
- reading
- writing
- updating the header
- differencing files (images with parents)
- log replay / dirty logs (only clean images)
This is based on Microsoft's VHDX specification:
"VHDX Format Specification v0.95", published 4/12/2012
https://www.microsoft.com/en-us/download/details.aspx?id=29681
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is based on Microsoft's VHDX specification:
"VHDX Format Specification v0.95", published 4/12/2012
https://www.microsoft.com/en-us/download/details.aspx?id=29681
These structures define the various header, metadata, and other
block structures defined in the VHDX specification.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Currently the 'loadvm' opertaion works as following:
1. switch to the snapshot
2. mark current working VDI as a snapshot
3. rely on sd_create_branch to create a new working VDI based on the snapshot
This works not the same as other format as QCOW2. For e.g,
qemu > savevm # get a live snapshot snap1
qemu > savevm # snap2
qemu > loadvm 1 # This will steally create snap3 of the working VDI
Which will result in following snapshot chain:
base <-- snap1 <-- snap2 <-- snap3
^
|
working VDI
snap3 was unnecessarily created and might be annoying users.
This patch discard the unnecessary 'snap3' creation. and implement
rollback(loadvm) operation to the specified snapshot by
1. switch to the snapshot
2. delete working VDI
3. rely on sd_create_branch to create a new working VDI based on the snapshot
The snapshot chain for above example will be:
base <-- snap1 <-- snap2
^
|
working VDI
Cc: qemu-devel@nongnu.org
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Reviewed-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
When a snapshot is taken from out side of qemu (e.g. qemu-img
snapshot), write requests to the current vdi return SD_RES_READONLY.
In this case, the sheepdog block driver needs to update the current
inode to the latest one and resend the write requests.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This adds a helper function to update the current inode state with the
specified vdi object.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Sheepdog returns SD_RES_READONLY when qemu sends write requests to the
snapshot vdi. This adds the result code and makes sd_strerror() print
its error reason.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This makes 'filename' and 'tag' constant variables, and renames
'for_snapshot' to 'lock' to clear how it works.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Commit a9ccedc3 frees the QemuOpts for the driver-specific options
immediately, even though it still needs the filename string that is
contained there. This doesn't work. Move the deletion of the QemuOpts to
the end of the function where its content isn't needed any more.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The 'TRIM' command from VM that is to release underlying data storage for
better thin-provision is already supported by the Sheepdog.
This patch adds the TRIM support at QEMU part.
For older Sheepdog that doesn't support it, we return 0(success) to upper layer.
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
# By Kevin Wolf (16) and Stefan Hajnoczi (4)
# Via Kevin Wolf
* kwolf/for-anthony:
qemu-iotests: add 053 unaligned compressed image size test
block: Allow overriding backing.file.filename
block: Remove filename parameter from .bdrv_file_open()
vvfat: Use bdrv_open options instead of filename
sheepdog: Use bdrv_open options instead of filename
rbd: Use bdrv_open options instead of filename
iscsi: Use bdrv_open options instead of filename
gluster: Use bdrv_open options instead of filename
curl: Use bdrv_open options instead of filename
blkverify: Use bdrv_open options instead of filename
blkdebug: Use bdrv_open options instead of filename
raw-win32: Use bdrv_open options instead of filename
raw-posix: Use bdrv_open options instead of filename
block: Enable filename option
block: Add driver-specific options for backing files
block: Fail gracefully when using a format driver on protocol level
qemu-iotests: Fix _filter_qemu
qemu-img: do not zero-pad the compressed write buffer
qcow: allow sub-cluster compressed write to last cluster
qcow2: allow sub-cluster compressed write to last cluster
Message-id: 1366630294-18984-1-git-send-email-kwolf@redhat.com
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
# By Stefan Hajnoczi
# Via Paolo Bonzini
* bonzini/nbd-next:
nbd: set TCP_NODELAY
nbd: use TCP_CORK in nbd_co_send_request()
nbd: unlock mutex in nbd_co_send_request() error path
Message-id: 1366381830-11267-1-git-send-email-pbonzini@redhat.com
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This is only to convert the internal interface that is used for passing
the "filename" to be parsed, but converting to actual fine grained
options is left for another day, as it doesn't look trivial.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This is only to convert the internal interface that is used for passing
the "filename" to be parsed, but converting to actual fine grained
options is left for another day, as it doesn't look trivial.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This is only to convert the internal interface that is used for passing
the "filename" to be parsed, but converting to actual fine grained
options is left for another day, as it doesn't look trivial.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This is only to convert the internal interface that is used for passing
the "filename" to be parsed, but converting to actual fine grained
options is left for another day, as it doesn't look trivial.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
As a bonus, going through the QemuOpts QEMU_OPT_SIZE parser for the
readahead option gives us proper error reporting that the previous use
of atoi() lacked.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Options starting in "backing." are passed to the backing file now. If
you don't need to specify the filename for the backing file, you can add
it on the command line instead of in the image file:
$ qemu-nbd -t /tmp/test.img
$ qemu-img create -f qcow2 empty.qcow2 1G
$ qemu-system-x86_64 -drive file=empty.qcow2,backing.file.driver=nbd,\
backing.file.host=localhost
Note that this doesn't override the backing filename from the image. If
the image has one, this will fail because NBD doesn't want the options
and a filename at the same time.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Compression in qcow requires image length to be a multiple of the
cluster size. Lift this requirement by zero-padding the final cluster
when necessary. The virtual disk size is still not cluster-aligned, so
the guest cannot access the zero sectors.
Note that this is almost identical to the qcow2 version of this code.
qcow2's compression code is drawn from qcow.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Compression in qcow2 requires image length to be a multiple of the
cluster size. Lift this requirement by zero-padding the final cluster
when necessary. The virtual disk size is still not cluster-aligned, so
the guest cannot access the zero sectors.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now gcc will check whether format string and variable arguments match.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Disable the Nagle algorithm to reduce latency. Note this means we must
also use TCP_CORK when sending header followed by payload to avoid
fragmenting lots of little packets. The previous patch took care of
that.
Suggested-by: Nick Thomas <nick@bytemark.co.uk>
Tested-by: Nick Thomas <nick@bytemark.co.uk>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use TCP_CORK to defer packet transmission until both the header and the
payload have been written.
Suggested-by: Nick Thomas <nick@bytemark.co.uk>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The existing bdrv_co_flush_to_disk implementation uses rbd_flush(),
which is sychronous and causes the main qemu thread to block until it
is complete. This results in unresponsiveness and extra latency for
the guest.
Fix this by using an asynchronous version of flush. This was added to
librbd with a special #define to indicate its presence, since it will
be backported to stable versions. Thus, there is no need to check the
version of librbd.
Implement this as bdrv_aio_flush, since it matches other aio functions
in the rbd block driver, and leave out bdrv_co_flush_to_disk when the
asynchronous version is available.
Reported-by: Oliver Francke <oliver@filoo.de>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
libssh2_sftp_fsync is an extension to libssh2 to support fsync(2) over
sftp, which is itself an extension of OpenSSH.
If both libssh2 and the ssh daemon support it, this will allow
bdrv_flush_to_disk to commit changes through to disk on the remote
server.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
qemu-system-x86_64 -drive file=ssh://hostname/some/image
QEMU will ssh into 'hostname' and open '/some/image' which is made
available as a standard block device.
You can specify a username (ssh://user@host/...) and/or a port number
(ssh://host:port/...). You can also use an alternate syntax using
properties (file.user, file.host, file.port, file.path).
Current limitations:
- Authentication must be done without passwords or passphrases, using
ssh-agent. Other authentication methods are not supported.
- Uses a single connection, instead of concurrent AIO with multiple
SSH connections.
This is implemented using libssh2 on the client side. The server just
requires a regular ssh daemon with sftp-server support. Most ssh
daemons on Unix/Linux systems will work out of the box.
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Cc: Stefan Hajnoczi <stefanha@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Directly pass the QEMUIOVector on instead of linearising it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Move aes.h from include/block to include/qemu to show it can be reused
by other subsystems.
Cc: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Edgar E. Iglesias <edgar.iglesias@gmail.com>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Many of these should be cleaned up with proper qdev-/QOM-ification.
Right now there are many catch-all headers in include/hw/ARCH depending
on cpu.h, and this makes it necessary to compile these files per-target.
However, fixing this does not belong in these patches.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It ignored the error code, and at least the 'goto fail' is obvious
nonsense as it creates an endless loop (if the next attempt doesn't
magically succeed) and leaves the in-memory L1 table in big-endian
instead of converting it back.
In error cases, there's no point in writing an updated L1 table, so
skip this part for them.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The fcntl(fd, F_SETFL, O_NONBLOCK) flag is not specific to sockets.
Rename to qemu_set_nonblock() just like qemu_set_cloexec().
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Instead of just checking once in exactly this order if there are
dependendies, non-COW clusters and new allocation, this starts looping
around these. This way we can, for example, gather non-COW clusters after
new allocations as long as the host cluster offsets stay contiguous.
Once handle_dependencies() is extended so that COW areas of in-flight
allocations can be overwritten, this allows to continue with gathering
other clusters (we wouldn't be able to do that without this change
because we would have missed a possible second dependency in one of the
next clusters).
This means that in the typical sequential write case, we can combine the
COW overwrite of one cluster with the allocation of the next cluster as
soon as something like Delayed COW gets actually implemented. It is only
by avoiding splitting requests this way that Delayed COW actually starts
improving performance noticably.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>