Commit Graph

114 Commits

Author SHA1 Message Date
Wei Yang 038adc2f58 core: replace getpagesize() with qemu_real_host_page_size
There are three page size in qemu:

  real host page size
  host page size
  target page size

All of them have dedicate variable to represent. For the last two, we
use the same form in the whole qemu project, while for the first one we
use two forms: qemu_real_host_page_size and getpagesize().

qemu_real_host_page_size is defined to be a replacement of
getpagesize(), so let it serve the role.

[Note] Not fully tested for some arch or device.

Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Message-Id: <20191013021145.16011-3-richardw.yang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-26 15:38:06 +02:00
Anton Nefedov d924559953 qapi: query-blockstat: add driver specific file-posix stats
A block driver can provide a callback to report driver-specific
statistics.

file-posix driver now reports discard statistics

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Message-id: 20190923121737.83281-10-anton.nefedov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10 10:56:18 +02:00
Anton Nefedov 1c45036636 file-posix: account discard operations
This will help to identify how many of the user-issued discard operations
(accounted on a device level) have actually suceeded down on the host file
(even though the numbers will not be exactly the same if non-raw format
driver is used (e.g. qcow2 sending metadata discards)).

Note that these numbers will not include discards triggered by
write-zeroes + MAY_UNMAP calls.

Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190923121737.83281-9-anton.nefedov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10 10:56:18 +02:00
Kevin Wolf effecce6bc file-posix: Fix has_write_zeroes after NO_FALLBACK
If QEMU_AIO_NO_FALLBACK is given, we always return failure and don't
even try to use the BLKZEROOUT ioctl. In this failure case, we shouldn't
disable has_write_zeroes because we didn't learn anything about the
ioctl. The next request might not set QEMU_AIO_NO_FALLBACK and we can
still use the ioctl then.

Fixes: 738301e117
Reported-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2019-09-10 08:58:43 +02:00
Max Reitz b2c6f23f4a block/file-posix: Reduce xfsctl() use
This patch removes xfs_write_zeroes() and xfs_discard().  Both functions
have been added just before the same feature was present through
fallocate():

- fallocate() has supported PUNCH_HOLE for XFS since Linux 2.6.38 (March
  2011); xfs_discard() was added in December 2010.

- fallocate() has supported ZERO_RANGE for XFS since Linux 3.15 (June
  2014); xfs_write_zeroes() was added in November 2013.

Nowadays, all systems that qemu runs on should support both fallocate()
features (RHEL 7's kernel does).

xfsctl() is still useful for getting the request alignment for O_DIRECT,
so this patch does not remove our dependency on it completely.

Note that xfs_write_zeroes() had a bug: It calls ftruncate() when the
file is shorter than the specified range (because ZERO_RANGE does not
increase the file length).  ftruncate() may yield and then discard data
that parallel write requests have written past the EOF in the meantime.
Dropping the function altogether fixes the bug.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes: 50ba5b2d99
Reported-by: Lukáš Doktor <ldoktor@redhat.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Tested-by: Stefano Garzarella <sgarzare@redhat.com>
Tested-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-09-10 08:58:43 +02:00
Andrey Shinkevich 294682cc3a block: workaround for unaligned byte range in fallocate()
Revert the commit 118f99442d 'block/io.c: fix for the allocation failure'
and use better error handling for file systems that do not support
fallocate() for an unaligned byte range. Allow falling back to pwrite
in case fallocate() returns EINVAL.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Message-Id: <1566913973-15490-1-git-send-email-andrey.shinkevich@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-09-05 16:01:31 -05:00
Stefan Hajnoczi 236094c738 file-posix: fix request_alignment typo
Fixes: a6b257a08e
       ("file-posix: Handle undetectable alignment")
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20190827101328.4062-1-stefanha@redhat.com
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-09-03 14:55:35 +02:00
Nir Soffer 3a20013fbb block: posix: Always allocate the first block
When creating an image with preallocation "off" or "falloc", the first
block of the image is typically not allocated. When using Gluster
storage backed by XFS filesystem, reading this block using direct I/O
succeeds regardless of request length, fooling alignment detection.

In this case we fallback to a safe value (4096) instead of the optimal
value (512), which may lead to unneeded data copying when aligning
requests.  Allocating the first block avoids the fallback.

Since we allocate the first block even with preallocation=off, we no
longer create images with zero disk size:

    $ ./qemu-img create -f raw test.raw 1g
    Formatting 'test.raw', fmt=raw size=1073741824

    $ ls -lhs test.raw
    4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw

And converting the image requires additional cluster:

    $ ./qemu-img measure -f raw -O qcow2 test.raw
    required size: 458752
    fully allocated size: 1074135040

When using format like vmdk with multiple files per image, we allocate
one block per file:

    $ ./qemu-img create -f vmdk -o subformat=twoGbMaxExtentFlat test.vmdk 4g
    Formatting 'test.vmdk', fmt=vmdk size=4294967296 compat6=off hwversion=undefined subformat=twoGbMaxExtentFlat

    $ ls -lhs test*.vmdk
    4.0K -rw-r--r--. 1 nsoffer nsoffer 2.0G Aug 27 03:23 test-f001.vmdk
    4.0K -rw-r--r--. 1 nsoffer nsoffer 2.0G Aug 27 03:23 test-f002.vmdk
    4.0K -rw-r--r--. 1 nsoffer nsoffer  353 Aug 27 03:23 test.vmdk

I did quick performance test for copying disks with qemu-img convert to
new raw target image to Gluster storage with sector size of 512 bytes:

    for i in $(seq 10); do
        rm -f dst.raw
        sleep 10
        time ./qemu-img convert -f raw -O raw -t none -T none src.raw dst.raw
    done

Here is a table comparing the total time spent:

Type    Before(s)   After(s)    Diff(%)
---------------------------------------
real      530.028    469.123      -11.4
user       17.204     10.768      -37.4
sys        17.881      7.011      -60.7

We can see very clear improvement in CPU usage.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Message-id: 20190827010528.8818-2-nsoffer@redhat.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-09-03 14:55:35 +02:00
Max Reitz 1dcaf52760 block: Implement .bdrv_has_zero_init_truncate()
We need to implement .bdrv_has_zero_init_truncate() for every block
driver that supports truncation and has a .bdrv_has_zero_init()
implementation.

Implement it the same way each driver implements .bdrv_has_zero_init().
This is at least not any more unsafe than what we had before.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190724171239.8764-5-mreitz@redhat.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-08-19 17:13:26 +02:00
Paolo Bonzini f6fc1e30cf block: fix NetBSD qemu-iotests failure
Opening a block device on NetBSD has an additional step compared to other OSes,
corresponding to raw_normalize_devicepath.  The error message in that function
is slightly different from that in raw_open_common and this was causing spurious
failures in qemu-iotests.  However, in general it is not important to know what
exact step was failing, for example in the qemu-iotests case the error message
contains the fairly unequivocal "No such file or directory" text from strerror.
We can thus fix the failures by standardizing on a single error message for
both raw_open_common and raw_normalize_devicepath; in fact, we can even
use error_setg_file_open to make sure the error message is the same as in
the rest of QEMU.

Message-Id: <20190725095920.28419-1-pbonzini@redhat.com>
Tested-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
2019-08-17 09:02:59 +02:00
Nir Soffer a6b257a08e file-posix: Handle undetectable alignment
In some cases buf_align or request_alignment cannot be detected:

1. With Gluster, buf_align cannot be detected since the actual I/O is
   done on Gluster server, and qemu buffer alignment does not matter.
   Since we don't have alignment requirement, buf_align=1 is the best
   value.

2. With local XFS filesystem, buf_align cannot be detected if reading
   from unallocated area. In this we must align the buffer, but we don't
   know what is the correct size. Using the wrong alignment results in
   I/O error.

3. With Gluster backed by XFS, request_alignment cannot be detected if
   reading from unallocated area. In this case we need to use the
   correct alignment, and failing to do so results in I/O errors.

4. With NFS, the server does not use direct I/O, so both buf_align cannot
   be detected. In this case we don't need any alignment so we can use
   buf_align=1 and request_alignment=1.

These cases seems to work when storage sector size is 512 bytes, because
the current code starts checking align=512. If the check succeeds
because alignment cannot be detected we use 512. But this does not work
for storage with 4k sector size.

To determine if we can detect the alignment, we probe first with
align=1. If probing succeeds, maybe there are no alignment requirement
(cases 1, 4) or we are probing unallocated area (cases 2, 3). Since we
don't have any way to tell, we treat this as undetectable alignment. If
probing with align=1 fails with EINVAL, but probing with one of the
expected alignments succeeds, we know that we found a working alignment.

Practically the alignment requirements are the same for buffer
alignment, buffer length, and offset in file. So in case we cannot
detect buf_align, we can use request alignment. If we cannot detect
request alignment, we can fallback to a safe value. To use this logic,
we probe first request alignment instead of buf_align.

Here is a table showing the behaviour with current code (the value in
parenthesis is the optimal value).

Case    Sector    buf_align (opt)   request_alignment (opt)     result
======================================================================
1       512       512   (1)          512   (512)                 OK
1       4096      512   (1)          4096  (4096)                FAIL
----------------------------------------------------------------------
2       512       512   (512)        512   (512)                 OK
2       4096      512   (4096)       4096  (4096)                FAIL
----------------------------------------------------------------------
3       512       512   (1)          512   (512)                 OK
3       4096      512   (1)          512   (4096)                FAIL
----------------------------------------------------------------------
4       512       512   (1)          512   (1)                   OK
4       4096      512   (1)          512   (1)                   OK

Same cases with this change:

Case    Sector    buf_align (opt)   request_alignment (opt)     result
======================================================================
1       512       512   (1)          512   (512)                 OK
1       4096      4096  (1)          4096  (4096)                OK
----------------------------------------------------------------------
2       512       512   (512)        512   (512)                 OK
2       4096      4096  (4096)       4096  (4096)                OK
----------------------------------------------------------------------
3       512       4096  (1)          4096  (512)                 OK
3       4096      4096  (1)          4096  (4096)                OK
----------------------------------------------------------------------
4       512       4096  (1)          4096  (1)                   OK
4       4096      4096  (1)          4096  (1)                   OK

I tested that provisioning VMs and copying disks on local XFS and
Gluster with 4k bytes sector size work now, resolving bugs [1],[2].
I tested also on XFS, NFS, Gluster with 512 bytes sector size.

[1] https://bugzilla.redhat.com/1737256
[2] https://bugzilla.redhat.com/1738657

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-08-16 11:29:11 +02:00
Maxim Levitsky 867eccfed8 file-posix: Use max transfer length/segment count only for SCSI passthrough
Regular kernel block devices (/dev/sda*, /dev/nvme*, etc) don't have
max segment size/max segment count hardware requirements exposed
to the userspace, but rather the kernel block layer
takes care to split the incoming requests that
violate these requirements.

Allowing the kernel to do the splitting allows qemu to avoid
various overheads that arise otherwise from this.

This is especially visible in nbd server,
exposing as a raw file, a mostly empty qcow2 image over the net.
In this case most of the reads by the remote user
won't even hit the underlying kernel block device,
and therefore most of the  overhead will be in the
nbd traffic which increases significantly with lower max transfer size.

In addition to that even for local block device
access the peformance improves a bit due to less
traffic between qemu and the kernel when large
transfer sizes are used (e.g for image conversion)

More info can be found at:
https://bugzilla.redhat.com/show_bug.cgi?id=1647104

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-07-12 15:42:23 +02:00
Max Reitz 094e363944 file-posix: Update open_flags in raw_set_perm()
raw_check_perm() + raw_set_perm() can change the flags associated with
the current FD.  If so, we have to update BDRVRawState.open_flags
accordingly.  Otherwise, we may keep reopening the FD even though the
current one already has the correct flags.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-18 16:41:10 +02:00
Stefano Garzarella abea00533f block/file-posix: update .help of BLOCK_OPT_PREALLOC option
Show 'falloc' among the allowed values of 'preallocation'
option, only when it is supported (if defined CONFIG_POSIX_FALLOCATE)

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190524075848.23781-3-sgarzare@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
2019-06-12 18:31:46 +02:00
Markus Armbruster a8d2532645 Include qemu-common.h exactly where needed
No header includes qemu-common.h after this commit, as prescribed by
qemu-common.h's file comment.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190523143508.25387-5-armbru@redhat.com>
[Rebased with conflicts resolved automatically, except for
include/hw/arm/xlnx-zynqmp.h hw/arm/nrf51_soc.c hw/arm/msf2-soc.c
block/qcow2-refcount.c block/qcow2-cluster.c block/qcow2-cache.c
target/arm/cpu.h target/lm32/cpu.h target/m68k/cpu.h target/mips/cpu.h
target/moxie/cpu.h target/nios2/cpu.h target/openrisc/cpu.h
target/riscv/cpu.h target/tilegx/cpu.h target/tricore/cpu.h
target/unicore32/cpu.h target/xtensa/cpu.h; bsd-user/main.c and
net/tap-bsd.c fixed up]
2019-06-12 13:20:20 +02:00
Max Reitz 9c3db310ff block/file-posix: Unaligned O_DIRECT block-status
Currently, qemu crashes whenever someone queries the block status of an
unaligned image tail of an O_DIRECT image:
$ echo > foo
$ qemu-img map --image-opts driver=file,filename=foo,cache.direct=on
Offset          Length          Mapped to       File
qemu-img: block/io.c:2093: bdrv_co_block_status: Assertion `*pnum &&
QEMU_IS_ALIGNED(*pnum, align) && align > offset - aligned_offset'
failed.

This is because bdrv_co_block_status() checks that the result returned
by the driver's implementation is aligned to the request_alignment, but
file-posix can fail to do so, which is actually mentioned in a comment
there: "[...] possibly including a partial sector at EOF".

Fix this by rounding up those partial sectors.

There are two possible alternative fixes:
(1) We could refuse to open unaligned image files with O_DIRECT
    altogether.  That sounds reasonable until you realize that qcow2
    does necessarily not fill up its metadata clusters, and that nobody
    runs qemu-img create with O_DIRECT.  Therefore, unpreallocated qcow2
    files usually have an unaligned image tail.

(2) bdrv_co_block_status() could ignore unaligned tails.  It actually
    throws away everything past the EOF already, so that sounds
    reasonable.
    Unfortunately, the block layer knows file lengths only with a
    granularity of BDRV_SECTOR_SIZE, so bdrv_co_block_status() usually
    would have to guess whether its file length information is inexact
    or whether the driver is broken.

Fixing what raw_co_block_status() returns is the safest thing to do.

There seems to be no other block driver that sets request_alignment and
does not make sure that it always returns aligned values.

Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-05-20 17:08:57 +02:00
Max Reitz 50ba5b2d99 block/file-posix: Truncate in xfs_write_zeroes()
XFS_IOC_ZERO_RANGE does not increase the file length:
$ touch foo
$ xfs_io -c 'zero 0 65536' foo
$ stat -c "size=%s, blocks=%b" foo
size=0, blocks=128

We do want writes beyond the EOF to automatically increase the file
length, however.  This is evidenced by the fact that iotest 061 is
broken on XFS since qcow2's check implementation checks for blocks
beyond the EOF.

Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-05-20 17:08:56 +02:00
Vladimir Sementsov-Ogievskiy 696aaaed57 block/file-posix: do not fail on unlock bytes
bdrv_replace_child() calls bdrv_check_perm() with error_abort on
loosening permissions. However file-locking operations may fail even
in this case, for example on NFS. And this leads to Qemu crash.

Let's avoid such errors. Note, that we ignore such things anyway on
permission update commit and abort.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-02 12:04:44 +02:00
Kevin Wolf 738301e117 file-posix: Support BDRV_REQ_NO_FALLBACK for zero writes
We know that the kernel implements a slow fallback code path for
BLKZEROOUT, so if BDRV_REQ_NO_FALLBACK is given, we shouldn't call it.
The other operations we call in the context of .bdrv_co_pwrite_zeroes
should usually be quick, so no modification should be needed for them.
If we ever notice that there are additional problematic cases, we can
still make these conditional as well.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Eric Blake <eblake@redhat.com>
2019-03-26 11:37:51 +01:00
Peter Maydell dbbc277510 Pull request
* Add 'drop-cache=on|off' option to file-posix.c.  The default is on.
    Disabling the option fixes a QEMU 3.0.0 performance regression when live
    migrating on the same host with cache.direct=off.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJciOSEAAoJEJykq7OBq3PIVSUIAI6r2Mgoi+no4nle8Jf2nZ+W
 EnQXnNEFyJA0lKRtqQ2UILD9udVdKd/L1PZu5k/Il/Ralto9Yf3+62brekI7rsss
 c3Qusu4LUK6jom2RslRjRIaJ9GilQi/jWezKV/O0VlcsMVemgVHX008EIR+ea1U4
 H0/u2kfu04PciKQ5MR2+6aacu9bfmyH1yM2no+aMN5dDu/38PV6JEsf0Zl2agowg
 opGepJ7YiDQsxH9IBXrbfm38mBrrY0K2vFzAb9BzTHfBPotGMNIZNJNM2FChRfoM
 sTjOIpZz3NDwPEUPQPZxp+7YKRFFYfse1oHtpyh4n1rMQksB019SCGlP9TBhrF0=
 =CH5G
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/stefanha/tags/block-pull-request' into staging

Pull request

 * Add 'drop-cache=on|off' option to file-posix.c.  The default is on.
   Disabling the option fixes a QEMU 3.0.0 performance regression when live
   migrating on the same host with cache.direct=off.

# gpg: Signature made Wed 13 Mar 2019 11:07:48 GMT
# gpg:                using RSA key 9CA4ABB381AB73C8
# gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>" [full]
# gpg:                 aka "Stefan Hajnoczi <stefanha@gmail.com>" [full]
# Primary key fingerprint: 8695 A8BF D3F9 7CDA AC35  775A 9CA4 ABB3 81AB 73C8

* remotes/stefanha/tags/block-pull-request:
  file-posix: add drop-cache=on|off option

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2019-03-14 09:34:51 +00:00
Stefan Hajnoczi f357fcd890 file-posix: add drop-cache=on|off option
Commit dd577a26ff ("block/file-posix:
implement bdrv_co_invalidate_cache() on Linux") introduced page cache
invalidation so that cache.direct=off live migration is safe on Linux.

The invalidation takes a significant amount of time when the file is
large and present in the page cache.  Normally this is not the case for
cross-host live migration but it can happen when migrating between QEMU
processes on the same host.

On same-host migration we don't need to invalidate pages for correctness
anyway, so an option to skip page cache invalidation is useful.  I
investigated optimizing invalidation and detecting same-host migration,
but both are hard to achieve so a user-visible option will suffice.

As a bonus this option means that the cache invalidation feature will
now be detectable by libvirt via QMP schema introspection.

Suggested-by: Neil Skrypuch <neil@tembosocial.com>
Tested-by: Neil Skrypuch <neil@tembosocial.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20190307164941.3322-1-stefanha@redhat.com
Message-Id: <20190307164941.3322-1-stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-03-13 10:54:55 +00:00
Alberto Garcia 8a2ce0bc1e block: Add a 'mutable_opts' field to BlockDriver
If we reopen a BlockDriverState and there is an option that is present
in bs->options but missing from the new set of options then we have to
return an error unless the driver is able to reset it to its default
value.

This patch adds a new 'mutable_opts' field to BlockDriver. This is
a list of runtime options that can be modified during reopen. If an
option in this list is unspecified on reopen then it must be reset (or
return an error).

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:14 +01:00
Kevin Wolf 23dece19da file-posix: Make auto-read-only dynamic
Until now, with auto-read-only=on we tried to open the file read-write
first and if that failed, read-only was tried. This is actually not good
enough for libvirt, which gives QEMU SELinux permissions for read-write
only as soon as it actually intends to write to the image. So we need to
be able to switch between read-only and read-write at runtime.

This patch makes auto-read-only dynamic, i.e. the file is opened
read-only as long as no user of the node has requested write
permissions, but it is automatically reopened read-write as soon as the
first writer is attached. Conversely, if the last writer goes away, the
file is reopened read-only again.

bs->read_only is no longer set for auto-read-only=on files even if the
file descriptor is opened read-only because it will be transparently
upgraded as soon as a writer is attached. This changes the output of
qemu-iotests 232.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:14 +01:00
Kevin Wolf 6ceabe6f77 file-posix: Prepare permission code for fd switching
In order to be able to dynamically reopen the file read-only or
read-write, depending on the users that are attached, we need to be able
to switch to a different file descriptor during the permission change.

This interacts with reopen, which also creates a new file descriptor and
performs permission changes internally. In this case, the permission
change code must reuse the reopen file descriptor instead of creating a
third one.

In turn, reopen can drop its code to copy file locks to the new file
descriptor because that is now done when applying the new permissions.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:14 +01:00
Kevin Wolf a6aeca0ca5 file-posix: Lock new fd in raw_reopen_prepare()
There is no reason why we can take locks on the new file descriptor only
in raw_reopen_commit() where error handling isn't possible any more.
Instead, we can already do this in raw_reopen_prepare().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:14 +01:00
Kevin Wolf e0c9cf3a48 file-posix: Store BDRVRawState.reopen_state during reopen
We'll want to access the file descriptor in the reopen_state while
processing permission changes in the context of the repoen.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:14 +01:00
Kevin Wolf 5cec287025 file-posix: Factor out raw_reconfigure_getfd()
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12 20:30:14 +01:00
Laurent Vivier 4f7d28d738 block/file-posix: Convert from DPRINTF() macro to trace events
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20181213162727.17438-4-lvivier@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-01-31 00:38:19 +01:00
Paolo Bonzini 7d37435bd5 avoid TABs in files that only contain a few
Most files that have TABs only contain a handful of them.  Change
them to spaces so that we don't confuse people.

disas, standard-headers, linux-headers and libdecnumber are imported
from other projects and probably should be exempted from the check.
Outside those, after this patch the following files still contain both
8-space and TAB sequences at the beginning of the line.  Many of them
have a majority of TABs, or were initially committed with all tabs.

    bsd-user/i386/target_syscall.h
    bsd-user/x86_64/target_syscall.h
    crypto/aes.c
    hw/audio/fmopl.c
    hw/audio/fmopl.h
    hw/block/tc58128.c
    hw/display/cirrus_vga.c
    hw/display/xenfb.c
    hw/dma/etraxfs_dma.c
    hw/intc/sh_intc.c
    hw/misc/mst_fpga.c
    hw/net/pcnet.c
    hw/sh4/sh7750.c
    hw/timer/m48t59.c
    hw/timer/sh_timer.c
    include/crypto/aes.h
    include/disas/bfd.h
    include/hw/sh4/sh.h
    libdecnumber/decNumber.c
    linux-headers/asm-generic/unistd.h
    linux-headers/linux/kvm.h
    linux-user/alpha/target_syscall.h
    linux-user/arm/nwfpe/double_cpdo.c
    linux-user/arm/nwfpe/fpa11_cpdt.c
    linux-user/arm/nwfpe/fpa11_cprt.c
    linux-user/arm/nwfpe/fpa11.h
    linux-user/flat.h
    linux-user/flatload.c
    linux-user/i386/target_syscall.h
    linux-user/ppc/target_syscall.h
    linux-user/sparc/target_syscall.h
    linux-user/syscall.c
    linux-user/syscall_defs.h
    linux-user/x86_64/target_syscall.h
    slirp/cksum.c
    slirp/if.c
    slirp/ip.h
    slirp/ip_icmp.c
    slirp/ip_icmp.h
    slirp/ip_input.c
    slirp/ip_output.c
    slirp/mbuf.c
    slirp/misc.c
    slirp/sbuf.c
    slirp/socket.c
    slirp/socket.h
    slirp/tcp_input.c
    slirp/tcpip.h
    slirp/tcp_output.c
    slirp/tcp_subr.c
    slirp/tcp_timer.c
    slirp/tftp.c
    slirp/udp.c
    slirp/udp.h
    target/cris/cpu.h
    target/cris/mmu.c
    target/cris/op_helper.c
    target/sh4/helper.c
    target/sh4/op_helper.c
    target/sh4/translate.c
    tcg/sparc/tcg-target.inc.c
    tests/tcg/cris/check_addo.c
    tests/tcg/cris/check_moveq.c
    tests/tcg/cris/check_swap.c
    tests/tcg/multiarch/test-mmap.c
    ui/vnc-enc-hextile-template.h
    ui/vnc-enc-zywrle.h
    util/envlist.c
    util/readline.c

The following have only TABs:

    bsd-user/i386/target_signal.h
    bsd-user/sparc64/target_signal.h
    bsd-user/sparc64/target_syscall.h
    bsd-user/sparc/target_signal.h
    bsd-user/sparc/target_syscall.h
    bsd-user/x86_64/target_signal.h
    crypto/desrfb.c
    hw/audio/intel-hda-defs.h
    hw/core/uboot_image.h
    hw/sh4/sh7750_regnames.c
    hw/sh4/sh7750_regs.h
    include/hw/cris/etraxfs_dma.h
    linux-user/alpha/termbits.h
    linux-user/arm/nwfpe/fpopcode.h
    linux-user/arm/nwfpe/fpsr.h
    linux-user/arm/syscall_nr.h
    linux-user/arm/target_signal.h
    linux-user/cris/target_signal.h
    linux-user/i386/target_signal.h
    linux-user/linux_loop.h
    linux-user/m68k/target_signal.h
    linux-user/microblaze/target_signal.h
    linux-user/mips64/target_signal.h
    linux-user/mips/target_signal.h
    linux-user/mips/target_syscall.h
    linux-user/mips/termbits.h
    linux-user/ppc/target_signal.h
    linux-user/sh4/target_signal.h
    linux-user/sh4/termbits.h
    linux-user/sparc64/target_syscall.h
    linux-user/sparc/target_signal.h
    linux-user/x86_64/target_signal.h
    linux-user/x86_64/termbits.h
    pc-bios/optionrom/optionrom.h
    slirp/mbuf.h
    slirp/misc.h
    slirp/sbuf.h
    slirp/tcp.h
    slirp/tcp_timer.h
    slirp/tcp_var.h
    target/i386/svm.h
    target/sparc/asi.h
    target/xtensa/core-dc232b/xtensa-modules.inc.c
    target/xtensa/core-dc233c/xtensa-modules.inc.c
    target/xtensa/core-de212/core-isa.h
    target/xtensa/core-de212/xtensa-modules.inc.c
    target/xtensa/core-fsf/xtensa-modules.inc.c
    target/xtensa/core-sample_controller/core-isa.h
    target/xtensa/core-sample_controller/xtensa-modules.inc.c
    target/xtensa/core-test_kc705_be/core-isa.h
    target/xtensa/core-test_kc705_be/xtensa-modules.inc.c
    tests/tcg/cris/check_abs.c
    tests/tcg/cris/check_addc.c
    tests/tcg/cris/check_addcm.c
    tests/tcg/cris/check_addoq.c
    tests/tcg/cris/check_bound.c
    tests/tcg/cris/check_ftag.c
    tests/tcg/cris/check_int64.c
    tests/tcg/cris/check_lz.c
    tests/tcg/cris/check_openpf5.c
    tests/tcg/cris/check_sigalrm.c
    tests/tcg/cris/crisutils.h
    tests/tcg/cris/sys.c
    tests/tcg/i386/test-i386-ssse3.c
    ui/vgafont.h

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20181213223737.11793-3-pbonzini@redhat.com>
Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
Acked-by: Richard Henderson <richard.henderson@linaro.org>
Acked-by: Eric Blake <eblake@redhat.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Stefan Markovic <smarkovic@wavecomp.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-01-11 15:46:56 +01:00
Kevin Wolf 0342567115 file-posix: Avoid aio_worker() for QEMU_AIO_IOCTL
aio_worker() doesn't add anything interesting, it's only a useless
indirection. Call the handler function directly instead.

As we know that this handler function is only called from coroutine
context and the coroutine stays around until the worker thread finishes,
we can keep RawPosixAIOData on the stack.

This was the last user of aio_worker(), so the function goes away now.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14 11:54:49 +01:00
Kevin Wolf 2f3a7ab39b file-posix: Switch to .bdrv_co_ioctl
No real reason to keep using the callback based mechanism here when the
rest of the file-posix driver is coroutine based. Changing it brings
ioctls more in line with how other request types work.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14 11:52:41 +01:00
Kevin Wolf c9db2b6489 file-posix: Remove paio_submit_co()
The function is not used any more, remove it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14 11:52:41 +01:00
Kevin Wolf 999e6b69ce file-posix: Avoid aio_worker() for QEMU_AIO_READ/WRITE
aio_worker() doesn't add anything interesting, it's only a useless
indirection. Call the handler function directly instead.

As we know that this handler function is only called from coroutine
context and the coroutine stays around until the worker thread finishes,
we can keep RawPosixAIOData on the stack.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14 11:52:41 +01:00
Kevin Wolf 54c7ca1b81 file-posix: Move read/write operation logic out of aio_worker()
aio_worker() for reads and writes isn't boring enough yet. It still does
some postprocessing for handling short reads and turning the result into
the right return value.

However, there is no reason why handle_aiocb_rw() couldn't do the same,
and even without duplicating code between the read and write path. So
move the code there.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14 11:52:41 +01:00
Kevin Wolf 06dc9bd571 file-posix: Avoid aio_worker() for QEMU_AIO_FLUSH
aio_worker() doesn't add anything interesting, it's only a useless
indirection. Call the handler function directly instead.

As we know that this handler function is only called from coroutine
context and the coroutine stays around until the worker thread finishes,
we can keep RawPosixAIOData on the stack.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14 11:52:41 +01:00
Kevin Wolf 46ee0f462b file-posix: Avoid aio_worker() for QEMU_AIO_DISCARD
aio_worker() doesn't add anything interesting, it's only a useless
indirection. Call the handler function directly instead.

As we know that this handler function is only called from coroutine
context and the coroutine stays around until the worker thread finishes,
we can keep RawPosixAIOData on the stack.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14 11:52:41 +01:00
Kevin Wolf 7154d8ae66 file-posix: Avoid aio_worker() for QEMU_AIO_WRITE_ZEROES
aio_worker() doesn't add anything interesting, it's only a useless
indirection. Call the handler function directly instead.

As we know that this handler function is only called from coroutine
context and the coroutine stays around until the worker thread finishes,
we can keep RawPosixAIOData on the stack.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14 11:52:41 +01:00
Kevin Wolf 58a209c437 file-posix: Avoid aio_worker() for QEMU_AIO_COPY_RANGE
aio_worker() doesn't add anything interesting, it's only a useless
indirection. Call the handler function directly instead.

As we know that this handler function is only called from coroutine
context and the coroutine stays around until the worker thread finishes,
we can keep RawPosixAIOData on the stack.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14 11:52:41 +01:00
Kevin Wolf 29cb4c01e7 file-posix: Avoid aio_worker() for QEMU_AIO_TRUNCATE
aio_worker() doesn't add anything interesting, it's only a useless
indirection. Call the handler function directly instead.

As we know that this handler function is only called from coroutine
context and the coroutine stays around until the worker thread finishes,
we can keep RawPosixAIOData on the stack.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14 11:52:41 +01:00
Kevin Wolf 5d5de25005 file-posix: Factor out raw_thread_pool_submit()
Getting the thread pool of the AioContext of a block node and scheduling
some work in it is an operation that is already done twice, and we'll
get more instances. Factor it out into a separate function.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14 11:52:41 +01:00
Kevin Wolf d57c44d00f file-posix: Reorganise RawPosixAIOData
RawPosixAIOData contains a lot of fields for several separate operations
that are to be processed in a worker thread and that need different
parameters. The struct is currently rather unorganised, with unions that
cover some, but not all operations, and even one #define for field names
instead of a union.

Clean this up to have some common fields and a single union. As a side
effect, on x86_64 the struct shrinks from 72 to 48 bytes.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14 11:52:41 +01:00
Max Reitz 577a133988 file-posix: Fix shared locks on reopen commit
s->locked_shared_perm is the set of bits locked in the file, which is
the inverse of the permissions actually shared.  So we need to pass them
as they are to raw_apply_lock_bytes() instead of inverting them again.

Reported-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-11-19 14:32:01 +01:00
Fam Zheng f2e3af29b7 file-posix: Drop s->lock_fd
The lock_fd field is not strictly necessary because transferring locked
bytes from old fd to the new one shouldn't fail anyway. This spares the
user one fd per image.

Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-11-12 17:46:57 +01:00
Fam Zheng 2996ffad3a file-posix: Skip effectiveless OFD lock operations
If we know we've already locked the bytes, don't do it again; similarly
don't unlock a byte if we haven't locked it. This doesn't change the
behavior, but fixes a corner case explained below.

Libvirt had an error handling bug that an image can get its (ownership,
file mode, SELinux) permissions changed (RHBZ 1584982) by mistake behind
QEMU. Specifically, an image in use by Libvirt VM has:

    $ ls -lhZ b.img
    -rw-r--r--. qemu qemu system_u:object_r:svirt_image_t:s0:c600,c690 b.img

Trying to attach it a second time won't work because of image locking.
And after the error, it becomes:

    $ ls -lhZ b.img
    -rw-r--r--. root root system_u:object_r:virt_image_t:s0 b.img

Then, we won't be able to do OFD lock operations with the existing fd.
In other words, the code such as in blk_detach_dev:

    blk_set_perm(blk, 0, BLK_PERM_ALL, &error_abort);

can abort() QEMU, out of environmental changes.

This patch is an easy fix to this and the change is regardlessly
reasonable, so do it.

Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-11-12 17:46:57 +01:00
Fam Zheng db0754df88 file-posix: Use error API properly
Use error_report for situations that affect user operation (i.e.  we're
actually returning error), and warn_report/warn_report_err when some
less critical error happened but the user operation can still carry on.

For raw_normalize_devicepath, add Error parameter to propagate to
its callers.

Suggested-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-11-12 17:46:57 +01:00
Kevin Wolf 64107dc044 file-posix: Support auto-read-only option
If read-only=off, but auto-read-only=on is given, open the file
read-write if we have the permissions, but instead of erroring out for
read-only files, just degrade to read-only.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2018-11-05 15:09:55 +01:00
Alberto Garcia 8d3245750b file-posix: Forbid trying to change unsupported options during reopen
The file-posix code is used for the "file", "host_device" and
"host_cdrom" drivers, and it allows reopening images. However the only
option that is actually processed is "x-check-cache-dropped", and
changes in all other options (e.g. "filename") are silently ignored:

   (qemu) qemu-io virtio0 "reopen -o file.filename=no-such-file"

While we could allow changing some of the other options, let's keep
things as they are for now but return an error if the user tries to
change any of them.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-10-01 12:51:12 +02:00
Alberto Garcia 589f20dccd file-posix: x-check-cache-dropped should default to false on reopen
The default value of x-check-cache-dropped is false. There's no reason
to use the previous value as a default in raw_reopen_prepare() because
bdrv_reopen_queue_child() already takes care of putting the old
options in the BDRVReopenState.options QDict.

If x-check-cache-dropped was previously set but is now missing from
the reopen QDict then it should be reset to false.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-10-01 12:51:11 +02:00
Fam Zheng b857431d2a file-posix: Include filename in locking error message
Image locking errors happening at device initialization time doesn't say
which file cannot be locked, for instance,

    -device scsi-disk,drive=drive-1: Failed to get shared "write" lock
    Is another process using the image?

could refer to either the overlay image or its backing image.

Hoist the error_append_hint to the caller of raw_check_lock_bytes where
file name is known, and include it in the error hint.

Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-10-01 12:51:11 +02:00
Kevin Wolf 34fa110e42 file-posix: Fix write_zeroes with unmap on block devices
The BLKDISCARD ioctl doesn't guarantee that the discarded blocks read as
all-zero afterwards, so don't try to abuse it for zero writing. We try
to only use this if BLKDISCARDZEROES tells us that it is safe, but this
is unreliable on older kernels and a constant 0 in newer kernels. In
other words, this code path is never actually used with newer kernels,
so we don't even try to unmap while writing zeros.

This patch removes the abuse of discard for writing zeroes from
file-posix and instead adds a new function that uses interfaces that are
actually meant to deallocate and zero out at the same time. Only if
those fail, it falls back to zeroing out without unmap. We never fall
back to a discard operation any more that may or may not result in
zeros.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-07-30 15:35:37 +02:00