Commit Graph

4381 Commits

Author SHA1 Message Date
Peter Maydell e018ccb3fb Block layer patches:
- file-posix: Fix O_DIRECT alignment detection
 - Fixes for concurrent block jobs
 - block-backend: Queue requests while drained (fix IDE vs. job crashes)
 - qemu-img convert: Deprecate using -n and -o together
 - iotests: Migration tests with filter nodes
 - iotests: More media change tests
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJdVnduAAoJEH8JsnLIjy/W0IgQAKft/M3aDgt0sbTzQh8vdy6A
 yAfTnnSL4Z56+8qAsqhEnplC3rZxvTkg9AGOoNYHOZKl3FgRH9r8g9/Enemh4fWu
 MH52hiRf2ytlFVurIQal3aj9O+i0YTnzuvYbysvkH4ID5zbv2QnwdagtEcBxbbYL
 NZTMZBynDzp4rKIZ7p6T/kkaklLHh4vZrjW+Mzm3LQx9JJr8TwVNqqetSfc4VKIJ
 ByaNbbihDUVjQyIaJ24DXXJdzonGrrtSbSZycturc5FzXymzSRgrXZCeSKCs8X+i
 fjwMXH5v4/UfK511ILsXiumeuxBfD2Ck4sAblFxVo06oMPRNmsAKdRLeDByE7IC1
 lWep/pB3y/au9CW2/pkWJOiaz5s5iuv2fFYidKUJ0KQ1dD7G8M9rzkQlV3FUmTZO
 jBKSxHEffXsYl0ojn0vGmZEd7FAPi3fsZibGGws1dVgxlWI93aUJsjCq0E+lHIRD
 hEmQcjqZZa4taKpj0Y3Me05GkL7tH6RYA153jDNb8rPdzriGRCLZSObEISrOJf8H
 Mh0gTLi8KJNh6bULd12Ake1tKn7ZeTXpHH+gadz9OU7eIModh1qYTSHPlhy5oAv0
 Hm9BikNlS1Hzw+a+EbLcOW7TrsteNeGr7r8T6QKPMq1sfsYcp3svbC2c+zVlQ6Ll
 mLoTssksXOkgBevVqSiS
 =T7L5
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging

Block layer patches:

- file-posix: Fix O_DIRECT alignment detection
- Fixes for concurrent block jobs
- block-backend: Queue requests while drained (fix IDE vs. job crashes)
- qemu-img convert: Deprecate using -n and -o together
- iotests: Migration tests with filter nodes
- iotests: More media change tests

# gpg: Signature made Fri 16 Aug 2019 10:29:18 BST
# gpg:                using RSA key 7F09B272C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74  56FE 7F09 B272 C88F 2FD6

* remotes/kevin/tags/for-upstream:
  file-posix: Handle undetectable alignment
  qemu-img convert: Deprecate using -n and -o together
  block-backend: Queue requests while drained
  mirror: Keep mirror_top_bs drained after dropping permissions
  block: Remove blk_pread_unthrottled()
  iotests: Add test for concurrent stream/commit
  tests: Test mid-drain bdrv_replace_child_noperm()
  tests: Test polling in bdrv_drop_intermediate()
  block: Reduce (un)drains when replacing a child
  block: Keep subtree drained in drop_intermediate
  block: Simplify bdrv_filter_default_perms()
  iotests: Test migration with all kinds of filter nodes
  iotests: Move migration helpers to iotests.py
  iotests/118: Add -blockdev based tests
  iotests/118: Create test classes dynamically
  iotests/118: Test media change for scsi-cd

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2019-08-16 16:43:46 +01:00
Peter Maydell c6a2225a5a nbd patches for 2019-08-15
- Addition of InetSocketAddress keep-alive
 - Addition of BDRV_REQ_PREFETCH for more efficient copy-on-read
 - Initial refactoring in preparation of NBD reconnect
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJdVaRZAAoJEKeha0olJ0NqrGoIAJSvVLMDeWZIkHr3CQ5AbMHy
 6IHUntBwv4PEHw0FyyDU7lLgEWubTwe/7RfvyJ69kQYSJLjvHa3KEic0aa7SOETK
 hGUlSoIFHEugi+XDcYyy9EG+ItUR7jnunkwomxvFRm4XzjEHFO9ck8fOS+uq/23e
 LGDHwdoZI6vawUPftbBuRAlB3egCEcBtTWXYMk8lm3MXHOHL7O18DRkfWvwcHfl6
 mNIKgTVMtl1gYoJznCUmC5VLHL4jQy+kSNXnyHBQOEEvTcORu0EztJS81H+BODni
 sxa9seem7JL9NLUTmkJsbGfSM6RKdfypX34oik9yakqUnXRrlxkxI+IX26XfdQ4=
 =2MAO
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/ericb/tags/pull-nbd-2019-08-15' into staging

nbd patches for 2019-08-15

- Addition of InetSocketAddress keep-alive
- Addition of BDRV_REQ_PREFETCH for more efficient copy-on-read
- Initial refactoring in preparation of NBD reconnect

# gpg: Signature made Thu 15 Aug 2019 19:28:41 BST
# gpg:                using RSA key A7A16B4A2527436A
# gpg: Good signature from "Eric Blake <eblake@redhat.com>" [full]
# gpg:                 aka "Eric Blake (Free Software Programmer) <ebb9@byu.net>" [full]
# gpg:                 aka "[jpeg image of size 6874]" [full]
# Primary key fingerprint: 71C2 CC22 B1C4 6029 27D2  F3AA A7A1 6B4A 2527 436A

* remotes/ericb/tags/pull-nbd-2019-08-15:
  block/nbd: refactor nbd connection parameters
  block/nbd: add cmdline and qapi parameter reconnect-delay
  block/nbd: move from quit to state
  block/nbd: use non-blocking io channel for nbd negotiation
  block/nbd: split connection_co start out of nbd_client_connect
  nbd: improve CMD_CACHE: use BDRV_REQ_PREFETCH
  block/stream: use BDRV_REQ_PREFETCH
  block: implement BDRV_REQ_PREFETCH
  qapi: Add InetSocketAddress member keep-alive

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2019-08-16 15:53:37 +01:00
Markus Armbruster 54d31236b9 sysemu: Split sysemu/runstate.h off sysemu/sysemu.h
sysemu/sysemu.h is a rather unfocused dumping ground for stuff related
to the system-emulator.  Evidence:

* It's included widely: in my "build everything" tree, changing
  sysemu/sysemu.h still triggers a recompile of some 1100 out of 6600
  objects (not counting tests and objects that don't depend on
  qemu/osdep.h, down from 5400 due to the previous two commits).

* It pulls in more than a dozen additional headers.

Split stuff related to run state management into its own header
sysemu/runstate.h.

Touching sysemu/sysemu.h now recompiles some 850 objects.  qemu/uuid.h
also drops from 1100 to 850, and qapi/qapi-types-run-state.h from 4400
to 4200.  Touching new sysemu/runstate.h recompiles some 500 objects.

Since I'm touching MAINTAINERS to add sysemu/runstate.h anyway, also
add qemu/main-loop.h.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190812052359.30071-30-armbru@redhat.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
[Unbreak OS-X build]
2019-08-16 13:37:36 +02:00
Markus Armbruster d5938f29fe Clean up inclusion of sysemu/sysemu.h
In my "build everything" tree, changing sysemu/sysemu.h triggers a
recompile of some 5400 out of 6600 objects (not counting tests and
objects that don't depend on qemu/osdep.h).

Almost a third of its inclusions are actually superfluous.  Delete
them.  Downgrade two more to qapi/qapi-types-run-state.h, and move one
from char/serial.h to char/serial.c.

hw/semihosting/config.c, monitor/monitor.c, qdev-monitor.c, and
stubs/semihost.c define variables declared in sysemu/sysemu.h without
including it.  The compiler is cool with that, but include it anyway.

This doesn't reduce actual use much, as it's still included into
widely included headers.  The next commit will tackle that.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-Id: <20190812052359.30071-27-armbru@redhat.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
2019-08-16 13:31:53 +02:00
Markus Armbruster db72581598 Include qemu/main-loop.h less
In my "build everything" tree, changing qemu/main-loop.h triggers a
recompile of some 5600 out of 6600 objects (not counting tests and
objects that don't depend on qemu/osdep.h).  It includes block/aio.h,
which in turn includes qemu/event_notifier.h, qemu/notify.h,
qemu/processor.h, qemu/qsp.h, qemu/queue.h, qemu/thread-posix.h,
qemu/thread.h, qemu/timer.h, and a few more.

Include qemu/main-loop.h only where it's needed.  Touching it now
recompiles only some 1700 objects.  For block/aio.h and
qemu/event_notifier.h, these numbers drop from 5600 to 2800.  For the
others, they shrink only slightly.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190812052359.30071-21-armbru@redhat.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
2019-08-16 13:31:52 +02:00
Markus Armbruster 13d4ff07e8 trace: Do not include qom/cpu.h into generated trace.h
docs/devel/tracing.txt explains "since many source files include
trace.h, [the generated trace.h use] a minimum of types and other
header files included to keep the namespace clean and compile times
and dependencies down."

Commit 4815185902 "trace: Add per-vCPU tracing states for events with
the 'vcpu' property" made them all include qom/cpu.h via
control-internal.h.  qom/cpu.h in turn includes about thirty headers.
Ouch.

Per-vCPU tracing is currently not supported in sub-directories'
trace-events.  In other words, qom/cpu.h can only be used in
trace-root.h, not in any trace.h.

Split trace/control-vcpu.h off trace/control.h and
trace/control-internal.h.  Have the generated trace.h include
trace/control.h (which no longer includes qom/cpu.h), and trace-root.h
include trace/control-vcpu.h (which includes it).

The resulting improvement is a bit disappointing: in my "build
everything" tree, some 1100 out of 6600 objects (not counting tests
and objects that don't depend on qemu/osdep.h) depend on a trace.h,
and about 600 of them no longer depend on qom/cpu.h.  But more than
1300 others depend on trace-root.h.  More work is clearly needed.
Left for another day.

Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20190812052359.30071-8-armbru@redhat.com>
2019-08-16 13:31:52 +02:00
Nir Soffer a6b257a08e file-posix: Handle undetectable alignment
In some cases buf_align or request_alignment cannot be detected:

1. With Gluster, buf_align cannot be detected since the actual I/O is
   done on Gluster server, and qemu buffer alignment does not matter.
   Since we don't have alignment requirement, buf_align=1 is the best
   value.

2. With local XFS filesystem, buf_align cannot be detected if reading
   from unallocated area. In this we must align the buffer, but we don't
   know what is the correct size. Using the wrong alignment results in
   I/O error.

3. With Gluster backed by XFS, request_alignment cannot be detected if
   reading from unallocated area. In this case we need to use the
   correct alignment, and failing to do so results in I/O errors.

4. With NFS, the server does not use direct I/O, so both buf_align cannot
   be detected. In this case we don't need any alignment so we can use
   buf_align=1 and request_alignment=1.

These cases seems to work when storage sector size is 512 bytes, because
the current code starts checking align=512. If the check succeeds
because alignment cannot be detected we use 512. But this does not work
for storage with 4k sector size.

To determine if we can detect the alignment, we probe first with
align=1. If probing succeeds, maybe there are no alignment requirement
(cases 1, 4) or we are probing unallocated area (cases 2, 3). Since we
don't have any way to tell, we treat this as undetectable alignment. If
probing with align=1 fails with EINVAL, but probing with one of the
expected alignments succeeds, we know that we found a working alignment.

Practically the alignment requirements are the same for buffer
alignment, buffer length, and offset in file. So in case we cannot
detect buf_align, we can use request alignment. If we cannot detect
request alignment, we can fallback to a safe value. To use this logic,
we probe first request alignment instead of buf_align.

Here is a table showing the behaviour with current code (the value in
parenthesis is the optimal value).

Case    Sector    buf_align (opt)   request_alignment (opt)     result
======================================================================
1       512       512   (1)          512   (512)                 OK
1       4096      512   (1)          4096  (4096)                FAIL
----------------------------------------------------------------------
2       512       512   (512)        512   (512)                 OK
2       4096      512   (4096)       4096  (4096)                FAIL
----------------------------------------------------------------------
3       512       512   (1)          512   (512)                 OK
3       4096      512   (1)          512   (4096)                FAIL
----------------------------------------------------------------------
4       512       512   (1)          512   (1)                   OK
4       4096      512   (1)          512   (1)                   OK

Same cases with this change:

Case    Sector    buf_align (opt)   request_alignment (opt)     result
======================================================================
1       512       512   (1)          512   (512)                 OK
1       4096      4096  (1)          4096  (4096)                OK
----------------------------------------------------------------------
2       512       512   (512)        512   (512)                 OK
2       4096      4096  (4096)       4096  (4096)                OK
----------------------------------------------------------------------
3       512       4096  (1)          4096  (512)                 OK
3       4096      4096  (1)          4096  (4096)                OK
----------------------------------------------------------------------
4       512       4096  (1)          4096  (1)                   OK
4       4096      4096  (1)          4096  (1)                   OK

I tested that provisioning VMs and copying disks on local XFS and
Gluster with 4k bytes sector size work now, resolving bugs [1],[2].
I tested also on XFS, NFS, Gluster with 512 bytes sector size.

[1] https://bugzilla.redhat.com/1737256
[2] https://bugzilla.redhat.com/1738657

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-08-16 11:29:11 +02:00
Kevin Wolf cf3129323f block-backend: Queue requests while drained
This fixes devices like IDE that can still start new requests from I/O
handlers in the CPU thread while the block backend is drained.

The basic assumption is that in a drain section, no new requests should
be allowed through a BlockBackend (blk_drained_begin/end don't exist,
we get drain sections only on the node level). However, there are two
special cases where requests should not be queued:

1. Block jobs: We already make sure that block jobs are paused in a
   drain section, so they won't start new requests. However, if the
   drain_begin is called on the job's BlockBackend first, it can happen
   that we deadlock because the job stays busy until it reaches a pause
   point - which it can't if its requests aren't processed any more.

   The proper solution here would be to make all requests through the
   job's filter node instead of using a BlockBackend. For now, just
   disabling request queuing on the job BlockBackend is simpler.

2. In test cases where making requests through bdrv_* would be
   cumbersome because we'd need a BdrvChild. As we already got the
   functionality to disable request queuing from 1., use it in tests,
   too, for convenience.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2019-08-16 10:25:16 +02:00
Kevin Wolf d2da5e288a mirror: Keep mirror_top_bs drained after dropping permissions
mirror_top_bs is currently implicitly drained through its connection to
the source or the target node. However, the drain section for target_bs
ends early after moving mirror_top_bs from src to target_bs, so that
requests can already be restarted while mirror_top_bs is still present
in the chain, but has dropped all permissions and therefore runs into an
assertion failure like this:

    qemu-system-x86_64: block/io.c:1634: bdrv_co_write_req_prepare:
    Assertion `child->perm & BLK_PERM_WRITE' failed.

Keep mirror_top_bs drained until all graph changes have completed.

Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2019-08-16 10:25:16 +02:00
Kevin Wolf 421919d76b block: Remove blk_pread_unthrottled()
The functionality offered by blk_pread_unthrottled() goes back to commit
498e386c58. Then, we couldn't perform I/O throttling with synchronous
requests because timers wouldn't be executed in polling loops. So the
commit automatically disabled I/O throttling as soon as a synchronous
request was issued.

However, for geometry detection during disk initialisation, we always
used (and still use) synchronous requests even if guest requests use AIO
later. Geometry detection was not wanted to disable I/O throttling, so
bdrv_pread_unthrottled() was introduced which disabled throttling only
temporarily.

All of this isn't necessary any more because we do run timers in polling
loop and even synchronous requests are now using coroutine
infrastructure internally. For this reason, commit 90c78624f already
removed the automatic disabling of I/O throttling.

It's time to get rid of the workaround for the removed code, and its
abuse of blk_root_drained_begin()/end(), as well.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2019-08-16 10:25:16 +02:00
Vladimir Sementsov-Ogievskiy 8f071c9db5 block/nbd: refactor nbd connection parameters
We'll need some connection parameters to be available all the time to
implement nbd reconnect. So, let's refactor them: define additional
parameters in BDRVNBDState, drop them from function parameters, drop
nbd_client_init and separate options parsing instead from nbd_open.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190618114328.55249-6-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: Drop useless 'if' before object_unref]
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-08-15 13:22:14 -05:00
Vladimir Sementsov-Ogievskiy b172ae2e0e block/nbd: add cmdline and qapi parameter reconnect-delay
Reconnect will be implemented in the following commit, so for now,
in semantics below, disconnect itself is a "serious error".

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190618114328.55249-5-vsementsov@virtuozzo.com>
[eblake: slipped from 4.1 to 4.2]
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-08-15 13:22:14 -05:00
Vladimir Sementsov-Ogievskiy a34b1e5e06 block/nbd: move from quit to state
To implement reconnect we need several states for the client:
CONNECTED, QUIT and two different CONNECTING states. CONNECTING states
will be added in the following patches. This patch implements CONNECTED
and QUIT.

QUIT means, that we should close the connection and fail all current
and further requests (like old quit = true).

CONNECTED means that connection is ok, we can send requests (like old
quit = false).

For receiving loop we use a comparison of the current state with QUIT,
because reconnect will be in the same loop, so it should be looping
until the end.

Opposite, for requests we use a comparison of the current state with
CONNECTED, as we don't want to send requests in future CONNECTING
states.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190618114328.55249-4-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-08-15 13:22:14 -05:00
Vladimir Sementsov-Ogievskiy a8e2bb6a76 block/nbd: use non-blocking io channel for nbd negotiation
No reason to use blocking channel for negotiation and we'll benefit in
further reconnect feature, as qio_channel reads and writes will do
qemu_coroutine_yield while waiting for io completion.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190618114328.55249-3-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-08-15 13:22:14 -05:00
Vladimir Sementsov-Ogievskiy 962b7b3d4c block/nbd: split connection_co start out of nbd_client_connect
nbd_client_connect is going to be used from connection_co, so, let's
refactor nbd_client_connect in advance, leaving io channel
configuration all in nbd_client_connect.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190618114328.55249-2-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-08-15 13:22:13 -05:00
Vladimir Sementsov-Ogievskiy 99136607b1 block/stream: use BDRV_REQ_PREFETCH
This helps to avoid extra io, allocations and memory copying.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190725100550.33801-3-vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[eblake: fix comment grammar]
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-08-15 13:22:13 -05:00
Vladimir Sementsov-Ogievskiy 3299e5ecf7 block: implement BDRV_REQ_PREFETCH
Do effective copy-on-read request when we don't need data actually. It
will be used for block-stream and NBD_CMD_CACHE.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20190725100550.33801-2-vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[eblake: comment grammar fix]
Signed-off-by: Eric Blake <eblake@redhat.com>
2019-08-15 13:22:13 -05:00
Vladimir Sementsov-Ogievskiy 110571be4e block/backup: disable copy_range for compressed backup
Enabled by default copy_range ignores compress option. It's definitely
unexpected for user.

It's broken since introduction of copy_range usage in backup in
9ded4a0114.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190730163251.755248-3-vsementsov@virtuozzo.com
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-08-06 13:17:27 +02:00
Max Reitz 9adc1cb49a mirror: Only mirror granularity-aligned chunks
In write-blocking mode, all writes to the top node directly go to the
target.  We must only mirror chunks of data that are aligned to the
job's granularity, because that is how the dirty bitmap works.
Therefore, the request alignment for writes must be the job's
granularity (in write-blocking mode).

Unfortunately, this forces all reads and writes to have the same
granularity (we only need this alignment for writes to the target, not
the source), but that is something to be fixed another time.

Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190805153308.2657-1-mreitz@redhat.com
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Fixes: d06107ade0
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-08-06 13:17:25 +02:00
Max Reitz 4a5b91ca02 backup: Copy only dirty areas
The backup job must only copy areas that the copy_bitmap reports as
dirty.  This is always the case when using traditional non-offloading
backup, because it copies each cluster separately.  When offloading the
copy operation, we sometimes copy more than one cluster at a time, but
we only check whether the first one is dirty.

Therefore, whenever copy offloading is possible, the backup job
currently produces wrong output when the guest writes to an area of
which an inner part has already been backed up, because that inner part
will be re-copied.

Fixes: 9ded4a0114
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190801173900.23851-2-mreitz@redhat.com
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-08-06 13:17:01 +02:00
Max Reitz 1120407bdf nvme: Limit blkshift to 12 (for 4 kB blocks)
Linux does not support blocks greater than 4 kB anyway, so we might as
well limit blkshift to 12 and thus save us from some potential trouble.

Reported-by: Peter Maydell <peter.maydell@linaro.org>
Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190730114812.10493-1-mreitz@redhat.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Coverity: CID 1403771
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-07-30 14:49:24 +02:00
Kevin Wolf 2b23f28639 block/copy-on-read: Fix permissions for inactive node
The copy-on-read drive must not request the WRITE_UNCHANGED permission
for its child if the node is inactive, otherwise starting a migration
destination with -incoming will fail because the child cannot provide
write access yet:

  qemu-system-x86_64: -blockdev copy-on-read,file=img,node-name=cor: Block node is read-only

Earlier QEMU versions additionally ran into an abort() on the migration
source side: bdrv_inactivate_recurse() failed to update permissions.
This is silently ignored today because it was only supposed to loosen
restrictions. This is the symptom that was originally reported here:

  https://bugzilla.redhat.com/show_bug.cgi?id=1733022

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2019-07-30 12:25:43 +02:00
Max Reitz 65181d6381 block: Dec. drained_end_counter before bdrv_wakeup
Decrementing drained_end_counter after bdrv_dec_in_flight() (which in
turn invokes bdrv_wakeup() and thus aio_wait_kick()) is not very clever.
We should decrement it beforehand, so that any waiting aio_poll() that
is woken by bdrv_dec_in_flight() sees the decremented
drained_end_counter.

Because the time window between decrementing drained_end_counter and
aio_wait_kick() is very small, I cannot supply a reliable regression
test.  However, running e.g. the /bdrv-drain/blockjob/iothread/drain_all
test in test-bdrv-drain has a small chance of hanging without this
patch (about 1/200 or so; it gets to nearly 100 % if you add e.g. an
fputc(' ', stderr); after the bdrv_dec_in_flight()).

Fixes: e037c09c78
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190722133054.21781-2-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-07-22 18:41:35 +02:00
Maxim Levitsky 258867d1dc block/nvme: don't touch the completion entries
Completion entries are meant to be only read by the host and written by the device.
The driver is supposed to scan the completions from the last point where it left,
and until it sees a completion with non flipped phase bit.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190716163020.13383-4-mlevitsk@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-07-22 18:40:32 +02:00
Maxim Levitsky 118d1b6a81 block/nvme: support larger that 512 bytes sector devices
Currently the driver hardcodes the sector size to 512,
and doesn't check the underlying device. Fix that.

Also fail if underlying nvme device is formatted with metadata
as this needs special support.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-id: 20190716163020.13383-3-mlevitsk@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-07-22 18:40:32 +02:00
Maxim Levitsky 461bba04bf block/nvme: fix doorbell stride
Fix the math involving non standard doorbell stride

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190716163020.13383-2-mlevitsk@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-07-22 18:40:32 +02:00
Peter Maydell b9e02bb3f9 nbd patches for 2019-07-19
- silence harmless compiler/valgrind warning
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJdMiVIAAoJEKeha0olJ0NqhSkH/RF4EViyGi/BW9rHHmZIKQjL
 h71g+Y5l0QvTHW2qkU9UCjYC7lrhTnD8r06v2qQZCk37Rb0z/y8BGpcZQVE92nPt
 GW5VbqWh4IdNCMUNXlfFo/U2t+hJL+BAAd8OkolKcDU4FUuN9QGkvQztBLb6Edzk
 oz4/NbnFsd87TRI61EjOEspTpTO6cukLRmE0HIsL8KOaYo3E7QhgxRvR45Y58sbN
 TvUTpI0teHzo4MiXD/yRH8oDz7zKttFwAj3E0oU9IcVTyBJXpE8lNA4J2a65KdbH
 S+43tlNDJA1a5+OqyfzFHcgz6nLCcntU4+5LH7YfTy753EurDVqsi7vTo4lJqHM=
 =uPfp
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/ericb/tags/pull-nbd-2019-07-19' into staging

nbd patches for 2019-07-19

- silence harmless compiler/valgrind warning

# gpg: Signature made Fri 19 Jul 2019 21:17:12 BST
# gpg:                using RSA key A7A16B4A2527436A
# gpg: Good signature from "Eric Blake <eblake@redhat.com>" [full]
# gpg:                 aka "Eric Blake (Free Software Programmer) <ebb9@byu.net>" [full]
# gpg:                 aka "[jpeg image of size 6874]" [full]
# Primary key fingerprint: 71C2 CC22 B1C4 6029 27D2  F3AA A7A1 6B4A 2527 436A

* remotes/ericb/tags/pull-nbd-2019-07-19:
  nbd: Initialize reply on failure

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2019-07-22 10:11:28 +01:00
Eric Blake 5cf42b1c1f nbd: Initialize reply on failure
We've had two separate reports of different callers running into use
of uninitialized data if s->quit is set (one detected by gcc -O3,
another by valgrind), due to checking 'nbd_reply_is_simple(reply) ||
s->quit' in the wrong order. Rather than chasing down which callers
need to pre-initialize reply, and whether there are any other
uninitialized uses, it's easier to guarantee that reply will always be
set by nbd_co_receive_one_chunk() even on failure.

The uninitialized use happens to be harmless (the only time the
variable is uninitialized is if s->quit is set, so the conditional
results in the same action regardless of what was read from reply),
and was introduced in commit 65e01d47.

In fixing the problem, it can also be seen that all (one) callers pass
in a non-NULL reply, so there is a dead conditional to also be cleaned
up.

Reported-by: Thomas Huth <thuth@redhat.com>
Reported-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20190719172001.19770-1-eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
2019-07-19 13:19:18 -05:00
Max Reitz 61ad631cee block: Loop unsafely in bdrv*drained_end()
The graph must not change in these loops (or a QLIST_FOREACH_SAFE would
not even be enough).  We now ensure this by only polling once in the
root bdrv_drained_end() call, so we can drop the _SAFE suffix.  Doing so
makes it clear that the graph must not change.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-07-19 13:19:17 +02:00
Max Reitz e037c09c78 block: Do not poll in bdrv_do_drained_end()
We should never poll anywhere in bdrv_do_drained_end() (including its
recursive callees like bdrv_drain_invoke()), because it does not cope
well with graph changes.  In fact, it has been written based on the
postulation that no graph changes will happen in it.

Instead, the callers that want to poll must poll, i.e. all currently
globally available wrappers: bdrv_drained_end(),
bdrv_subtree_drained_end(), bdrv_unapply_subtree_drain(), and
bdrv_drain_all_end().  Graph changes there do not matter.

They can poll simply by passing a pointer to a drained_end_counter and
wait until it reaches 0.

This patch also adds a non-polling global wrapper for
bdrv_do_drained_end() that takes a drained_end_counter pointer.  We need
such a variant because now no function called anywhere from
bdrv_do_drained_end() must poll.  This includes
BdrvChildRole.drained_end(), which already must not poll according to
its interface documentation, but bdrv_child_cb_drained_end() just
violates that by invoking bdrv_drained_end() (which does poll).
Therefore, BdrvChildRole.drained_end() must take a *drained_end_counter
parameter, which bdrv_child_cb_drained_end() can pass on to the new
bdrv_drained_end_no_poll() function.

Note that we now have a pattern of all drained_end-related functions
either polling or receiving a *drained_end_counter to let the caller
poll based on that.

A problem with a single poll loop is that when the drained section in
bdrv_set_aio_context_ignore() ends, some nodes in the subgraph may be in
the old contexts, while others are in the new context already.  To let
the collective poll in bdrv_drained_end() work correctly, we must not
hold a lock to the old context, so that the old context can make
progress in case it is different from the current context.

(In the process, remove the comment saying that the current context is
always the old context, because it is wrong.)

In all other places, all nodes in a subtree must be in the same context,
so we can just poll that.  The exception of course is
bdrv_drain_all_end(), but that always runs in the main context, so we
can just poll NULL (like bdrv_drain_all_begin() does).

Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-07-19 13:19:16 +02:00
Max Reitz f4c8a43be0 block: Make bdrv_parent_drained_[^_]*() static
These functions are not used outside of block/io.c, there is no reason
why they should be globally available.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-07-19 13:19:16 +02:00
Max Reitz 8e1da77e6e block: Add @drained_end_counter
Callers can now pass a pointer to an integer that bdrv_drain_invoke()
(and its recursive callees) will increment for every
bdrv_drain_invoke_entry() operation they schedule.
bdrv_drain_invoke_entry() in turn will decrement it once it has invoked
BlockDriver.bdrv_co_drain_end().

We use atomic operations to access the pointee, because the
bdrv_do_drained_end() caller may wish to end drained sections for
multiple nodes in different AioContexts (bdrv_drain_all_end() does, for
example).

This is the first step to moving the polling for BdrvCoDrainData.done to
become true out of bdrv_drain_invoke() and into the root drained_end
function.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-07-19 13:19:16 +02:00
Max Reitz 804db8ea00 block: Introduce BdrvChild.parent_quiesce_counter
Commit 5cb2737e92 laid out why
bdrv_do_drained_end() must decrement the quiesce_counter after
bdrv_drain_invoke().  It did not give a very good reason why it has to
happen after bdrv_parent_drained_end(), instead only claiming symmetry
to bdrv_do_drained_begin().

It turns out that delaying it for so long is wrong.

Situation: We have an active commit job (i.e. a mirror job) from top to
base for the following graph:

                  filter
                    |
                  [file]
                    |
                    v
top --[backing]--> base

Now the VM is closed, which results in the job being cancelled and a
bdrv_drain_all() happening pretty much simultaneously.

Beginning the drain means the job is paused once whenever one of its
nodes is quiesced.  This is reversed when the drain ends.

With how the code currently is, after base's drain ends (which means
that it will have unpaused the job once), its quiesce_counter remains at
1 while it goes to undrain its parents (bdrv_parent_drained_end()).  For
some reason or another, undraining filter causes the job to be kicked
and enter mirror_exit_common(), where it proceeds to invoke
block_job_remove_all_bdrv().

Now base will be detached from the job.  Because its quiesce_counter is
still 1, it will unpause the job once more.  So in total, undraining
base will unpause the job twice.  Eventually, this will lead to the
job's pause_count going negative -- well, it would, were there not an
assertion against this, which crashes qemu.

The general problem is that if in bdrv_parent_drained_end() we undrain
parent A, and then undrain parent B, which then leads to A detaching the
child, bdrv_replace_child_noperm() will undrain A as if we had not done
so yet; that is, one time too many.

It follows that we cannot decrement the quiesce_counter after invoking
bdrv_parent_drained_end().

Unfortunately, decrementing it before bdrv_parent_drained_end() would be
wrong, too.  Imagine the above situation in reverse: Undraining A leads
to B detaching the child.  If we had already decremented the
quiesce_counter by that point, bdrv_replace_child_noperm() would undrain
B one time too little; because it expects bdrv_parent_drained_end() to
issue this undrain.  But bdrv_parent_drained_end() won't do that,
because B is no longer a parent.

Therefore, we have to do something else.  This patch opts for
introducing a second quiesce_counter that counts how many times a
child's parent has been quiesced (though c->role->drained_*).  With
that, bdrv_replace_child_noperm() just has to undrain the parent exactly
that many times when removing a child, and it will always be right.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-07-19 13:19:16 +02:00
Peter Maydell 697f59243f * VFIO bugfix for AMD SEV (Alex)
* Kconfig improvements (Julio, Philippe)
 * MemoryRegion reference counting bugfix (King Wang)
 * Build system cleanups (Marc-André, myself)
 * rdmacm-mux off-by-one (Marc-André)
 * ZBC passthrough fixes (Shinichiro, myself)
 * WHPX build fix (Stefan)
 * char-pty fix (Wei Yang)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJdLX1PAAoJEL/70l94x66DQ1YH/im8BbRRRPsm3Qg55fTolcWN
 0+dm/Vfv2P7nfxDMsZ4S+jrvCaCWOZb6ua75TdB74VIXpJTGPU7a3JxyTzRueP+2
 c4WH3owT8x9e4iyLNGZoIDAKtJXLSX6FInjHKTkupLVbs2UpAh0Mipq4zIoIambl
 wf83jFmJ6KCemayE9gfw8Z45YTJcLceIOLaEyXgqrPoHXTmerEj5ZMMIqEMag3W/
 dKszhVjRb6En5Ldn0jEqeC5fU10tKIs+y7VNwdJ8CZw41daBDiXDVmXemJyTF/Xn
 SYJCwrJUSdVU42AE2xXCpBfANCh7eGyg4loCitLv8Z393tN7bRufULsnM/rEreI=
 =tEVO
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into staging

* VFIO bugfix for AMD SEV (Alex)
* Kconfig improvements (Julio, Philippe)
* MemoryRegion reference counting bugfix (King Wang)
* Build system cleanups (Marc-André, myself)
* rdmacm-mux off-by-one (Marc-André)
* ZBC passthrough fixes (Shinichiro, myself)
* WHPX build fix (Stefan)
* char-pty fix (Wei Yang)

# gpg: Signature made Tue 16 Jul 2019 08:31:27 BST
# gpg:                using RSA key BFFBD25F78C7AE83
# gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>" [full]
# gpg:                 aka "Paolo Bonzini <pbonzini@redhat.com>" [full]
# Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4  E2F7 7E15 100C CD36 69B1
#      Subkey fingerprint: F133 3857 4B66 2389 866C  7682 BFFB D25F 78C7 AE83

* remotes/bonzini/tags/for-upstream:
  vl: make sure char-pty message displayed by moving setbuf to the beginning
  create_config: remove $(CONFIG_SOFTMMU) hack
  Makefile: do not repeat $(CONFIG_SOFTMMU) in hw/Makefile.objs
  hw/usb/Kconfig: USB_XHCI_NEC requires USB_XHCI
  hw/usb/Kconfig: Add CONFIG_USB_EHCI_PCI
  target/i386: sev: Do not unpin ram device memory region
  checkpatch: detect doubly-encoded UTF-8
  hw/lm32/Kconfig: Milkymist One provides a USB 1.1 Controller
  util: merge main-loop.c and iohandler.c
  Fix broken build with WHPX enabled
  memory: unref the memory region in simplify flatview
  hw/i386: turn off vmport if CONFIG_VMPORT is disabled
  rdmacm-mux: fix strcpy string warning
  build-sys: remove slirp cflags from main-loop.o
  iscsi: base all handling of check condition on scsi_sense_to_errno
  iscsi: fix busy/timeout/task set full
  scsi: add guest-recoverable ZBC errors
  scsi: explicitly list guest-recoverable sense codes
  scsi-disk: pass sense correctly for guest-recoverable errors

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2019-07-16 15:08:29 +01:00
Stefano Garzarella 0b1847bbc2 gluster: fix .bdrv_reopen_prepare when backing file is a JSON object
When the backing_file is specified as a JSON object, the
qemu_gluster_reopen_prepare() fails with this message:
    invalid URI json:{"server.0.host": ...}

In this case, we should call qemu_gluster_init() using the QDict
'state->options' that contains the JSON parameters already parsed.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1542445
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20190715132844.506584-1-sgarzare@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-07-15 15:48:41 +02:00
Max Reitz 8441d82d51 block/stream: Swap backing file change order
bdrv_change_backing_file() can result in yields.  Therefore, @base may
no longer be the the backing_bs() of s->bottom afterwards.

Just swap the order of the two calls to fix this.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190703172813.6868-4-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-07-15 15:48:40 +02:00
Max Reitz 17a7c39248 block/stream: Fix error path
As of commit c624b015bf, the stream job
only freezes the chain until the overlay of the base node.  The error
path must consider this.

Fixes: c624b015bf
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190703172813.6868-3-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-07-15 15:48:40 +02:00
Max Reitz e5182c1c57 block: Add BDS.never_freeze
The commit and the mirror block job must be able to drop their filter
node at any point.  However, this will not be possible if any of the
BdrvChild links to them is frozen.  Therefore, we need to prevent them
from ever becoming frozen.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190703172813.6868-2-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-07-15 15:48:40 +02:00
Michal Privoznik 95667c3be0 nvme: Set number of queues later in nvme_init()
When creating the admin queue in nvme_init() the variable that
holds the number of queues created is modified before actual
queue creation. This is a problem because if creating the queue
fails then the variable is left in inconsistent state. This was
actually observed when I tried to hotplug a nvme disk. The
control got to nvme_file_open() which called nvme_init() which
failed and thus nvme_close() was called which in turn called
nvme_free_queue_pair() with queue being NULL. This lead to an
instant crash:

  #0  0x000055d9507ec211 in nvme_free_queue_pair (bs=0x55d952ddb880, q=0x0) at block/nvme.c:164
  #1  0x000055d9507ee180 in nvme_close (bs=0x55d952ddb880) at block/nvme.c:729
  #2  0x000055d9507ee3d5 in nvme_file_open (bs=0x55d952ddb880, options=0x55d952bb1410, flags=147456, errp=0x7ffd8e19e200) at block/nvme.c:781
  #3  0x000055d9507629f3 in bdrv_open_driver (bs=0x55d952ddb880, drv=0x55d95109c1e0 <bdrv_nvme>, node_name=0x0, options=0x55d952bb1410, open_flags=147456, errp=0x7ffd8e19e310) at block.c:1291
  #4  0x000055d9507633d6 in bdrv_open_common (bs=0x55d952ddb880, file=0x0, options=0x55d952bb1410, errp=0x7ffd8e19e310) at block.c:1551
  #5  0x000055d950766881 in bdrv_open_inherit (filename=0x0, reference=0x0, options=0x55d952bb1410, flags=32768, parent=0x55d9538ce420, child_role=0x55d950eaade0 <child_file>, errp=0x7ffd8e19e510) at block.c:3063
  #6  0x000055d950765ae4 in bdrv_open_child_bs (filename=0x0, options=0x55d9541cdff0, bdref_key=0x55d950af33aa "file", parent=0x55d9538ce420, child_role=0x55d950eaade0 <child_file>, allow_none=true, errp=0x7ffd8e19e510) at block.c:2712
  #7  0x000055d950766633 in bdrv_open_inherit (filename=0x0, reference=0x0, options=0x55d9541cdff0, flags=0, parent=0x0, child_role=0x0, errp=0x7ffd8e19e908) at block.c:3011
  #8  0x000055d950766dba in bdrv_open (filename=0x0, reference=0x0, options=0x55d953d00390, flags=0, errp=0x7ffd8e19e908) at block.c:3156
  #9  0x000055d9507cb635 in blk_new_open (filename=0x0, reference=0x0, options=0x55d953d00390, flags=0, errp=0x7ffd8e19e908) at block/block-backend.c:389
  #10 0x000055d950465ec5 in blockdev_init (file=0x0, bs_opts=0x55d953d00390, errp=0x7ffd8e19e908) at blockdev.c:602

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Message-id: 927aae40b617ba7d4b6c7ffe74e6d7a2595f8e86.1562770546.git.mprivozn@redhat.com
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-07-15 15:48:40 +02:00
Paolo Bonzini 8c460269aa iscsi: base all handling of check condition on scsi_sense_to_errno
Now that scsi-disk is not using scsi_sense_to_errno to separate guest-recoverable
sense codes, we can modify it to simplify iscsi's own sense handling.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-15 11:20:42 +02:00
Paolo Bonzini 00e3cccdf4 iscsi: fix busy/timeout/task set full
In this case, do_retry was set without calling aio_co_wake, thus never
waking up the coroutine.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-15 11:20:42 +02:00
Maxim Levitsky 867eccfed8 file-posix: Use max transfer length/segment count only for SCSI passthrough
Regular kernel block devices (/dev/sda*, /dev/nvme*, etc) don't have
max segment size/max segment count hardware requirements exposed
to the userspace, but rather the kernel block layer
takes care to split the incoming requests that
violate these requirements.

Allowing the kernel to do the splitting allows qemu to avoid
various overheads that arise otherwise from this.

This is especially visible in nbd server,
exposing as a raw file, a mostly empty qcow2 image over the net.
In this case most of the reads by the remote user
won't even hit the underlying kernel block device,
and therefore most of the  overhead will be in the
nbd traffic which increases significantly with lower max transfer size.

In addition to that even for local block device
access the peformance improves a bit due to less
traffic between qemu and the kernel when large
transfer sizes are used (e.g for image conversion)

More info can be found at:
https://bugzilla.redhat.com/show_bug.cgi?id=1647104

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-07-12 15:42:23 +02:00
Eric Blake f7077c9860 qcow2: Allow -o compat=v3 during qemu-img amend
Commit b76b4f60 allowed '-o compat=v3' as an alias for the
less-appealing '-o compat=1.1' for 'qemu-img create' since we want to
use the QMP form as much as possible, but forgot to do likewise for
qemu-img amend.  Also, it doesn't help that '-o help' doesn't list our
new preferred spellings.

Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-07-08 16:00:31 +02:00
John Snow 197bfa7da7 block/qcow: Improve error when opening qcow2 files as qcow
Reported-by: radmehrsaeed7@gmail.com
Fixes: https://bugs.launchpad.net/bugs/1832914
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-07-08 16:00:26 +02:00
Andrey Shinkevich c624b015bf block/stream: introduce a bottom node
The bottom node is the intermediate block device that has the base as its
backing image. It is used instead of the base node while a block stream
job is running to avoid dependency on the base that may change due to the
parallel jobs. The change may take place due to a filter node as well that
is inserted between the base and the intermediate bottom node. It occurs
when the base node is the top one for another commit or stream job.
After the introduction of the bottom node, don't freeze its backing child,
that's the base, anymore.

Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 1559152576-281803-4-git-send-email-andrey.shinkevich@virtuozzo.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-07-02 03:53:05 +02:00
Andrey Shinkevich 96a07d5bf4 block/stream: refactor stream_run: drop goto
The goto is unnecessary in the stream_run() since the common exit
code was removed in the commit eb23654dbe43b549ea2a9ebff9d8e:
"jobs: utilize job_exit shim".

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 1559152576-281803-3-git-send-email-andrey.shinkevich@virtuozzo.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-07-02 03:53:04 +02:00
Andrey Shinkevich 170d3bd341 block: include base when checking image chain for block allocation
This patch is used in the 'block/stream: introduce a bottom node'
that is following. Instead of the base node, the caller may pass
the node that has the base as its backing image to the function
bdrv_is_allocated_above() with a new parameter include_base = true
and get rid of the dependency on the base that may change during
commit/stream parallel jobs. Now, if the specified base is not
found in the backing image chain, the QEMU will abort.

Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 1559152576-281803-2-git-send-email-andrey.shinkevich@virtuozzo.com
[mreitz: Squashed in the following as a rebase on conflicting patches:]
Message-id: e3cf99ae-62e9-8b6e-5a06-d3c8b9363b85@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-07-02 03:53:04 +02:00
Stefano Garzarella d24f80234b block/rbd: increase dynamically the image size
RBD APIs don't allow us to write more than the size set with
rbd_create() or rbd_resize().
In order to support growing images (eg. qcow2), we resize the
image before write operations that exceed the current size.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20190509145927.293369-1-sgarzare@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-07-02 03:53:04 +02:00
Pino Toscano b10d49d761 ssh: switch from libssh2 to libssh
Rewrite the implementation of the ssh block driver to use libssh instead
of libssh2.  The libssh library has various advantages over libssh2:
- easier API for authentication (for example for using ssh-agent)
- easier API for known_hosts handling
- supports newer types of keys in known_hosts

Use APIs/features available in libssh 0.8 conditionally, to support
older versions (which are not recommended though).

Adjust the iotest 207 according to the different error message, and to
find the default key type for localhost (to properly compare the
fingerprint with).
Contributed-by: Max Reitz <mreitz@redhat.com>

Adjust the various Docker/Travis scripts to use libssh when available
instead of libssh2. The mingw/mxe testing is dropped for now, as there
are no packages for it.

Signed-off-by: Pino Toscano <ptoscano@redhat.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Acked-by: Alex Bennée <alex.bennee@linaro.org>
Message-id: 20190620200840.17655-1-ptoscano@redhat.com
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 5873173.t2JhDm7DL7@lindworm.usersys.redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-06-24 16:01:04 +02:00
Sam Eiderman 98eb9733f4 vmdk: Add read-only support for seSparse snapshots
Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in
QEMU).

This format was lacking in the following:

    * Grain directory (L1) and grain table (L2) entries were 32-bit,
      allowing access to only 2TB (slightly less) of data.
    * The grain size (default) was 512 bytes - leading to data
      fragmentation and many grain tables.
    * For space reclamation purposes, it was necessary to find all the
      grains which are not pointed to by any grain table - so a reverse
      mapping of "offset of grain in vmdk" to "grain table" must be
      constructed - which takes large amounts of CPU/RAM.

The format specification can be found in VMware's documentation:
https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf

In ESXi 6.5, to support snapshot files larger than 2TB, a new format was
introduced: SESparse (Space Efficient).

This format fixes the above issues:

    * All entries are now 64-bit.
    * The grain size (default) is 4KB.
    * Grain directory and grain tables are now located at the beginning
      of the file.
      + seSparse format reserves space for all grain tables.
      + Grain tables can be addressed using an index.
      + Grains are located in the end of the file and can also be
        addressed with an index.
      - seSparse vmdks of large disks (64TB) have huge preallocated
        headers - mainly due to L2 tables, even for empty snapshots.
    * The header contains a reverse mapping ("backmap") of "offset of
      grain in vmdk" to "grain table" and a bitmap ("free bitmap") which
      specifies for each grain - whether it is allocated or not.
      Using these data structures we can implement space reclamation
      efficiently.
    * Due to the fact that the header now maintains two mappings:
        * The regular one (grain directory & grain tables)
        * A reverse one (backmap and free bitmap)
      These data structures can lose consistency upon crash and result
      in a corrupted VMDK.
      Therefore, a journal is also added to the VMDK and is replayed
      when the VMware reopens the file after a crash.

Since ESXi 6.7 - SESparse is the only snapshot format available.

Unfortunately, VMware does not provide documentation regarding the new
seSparse format.

This commit is based on black-box research of the seSparse format.
Various in-guest block operations and their effect on the snapshot file
were tested.

The only VMware provided source of information (regarding the underlying
implementation) was a log file on the ESXi:

    /var/log/hostd.log

Whenever an seSparse snapshot is created - the log is being populated
with seSparse records.

Relevant log records are of the form:

[...] Const Header:
[...]  constMagic     = 0xcafebabe
[...]  version        = 2.1
[...]  capacity       = 204800
[...]  grainSize      = 8
[...]  grainTableSize = 64
[...]  flags          = 0
[...] Extents:
[...]  Header         : <1 : 1>
[...]  JournalHdr     : <2 : 2>
[...]  Journal        : <2048 : 2048>
[...]  GrainDirectory : <4096 : 2048>
[...]  GrainTables    : <6144 : 2048>
[...]  FreeBitmap     : <8192 : 2048>
[...]  BackMap        : <10240 : 2048>
[...]  Grain          : <12288 : 204800>
[...] Volatile Header:
[...] volatileMagic     = 0xcafecafe
[...] FreeGTNumber      = 0
[...] nextTxnSeqNumber  = 0
[...] replayJournal     = 0

The sizes that are seen in the log file are in sectors.
Extents are of the following format: <offset : size>

This commit is a strict implementation which enforces:
    * magics
    * version number 2.1
    * grain size of 8 sectors  (4KB)
    * grain table size of 64 sectors
    * zero flags
    * extent locations

Additionally, this commit proivdes only a subset of the functionality
offered by seSparse's format:
    * Read-only
    * No journal replay
    * No space reclamation
    * No unmap support

Hence, journal header, journal, free bitmap and backmap extents are
unused, only the "classic" (L1 -> L2 -> data) grain access is
implemented.

However there are several differences in the grain access itself.
Grain directory (L1):
    * Grain directory entries are indexes (not offsets) to grain
      tables.
    * Valid grain directory entries have their highest nibble set to
      0x1.
    * Since grain tables are always located in the beginning of the
      file - the index can fit into 32 bits - so we can use its low
      part if it's valid.
Grain table (L2):
    * Grain table entries are indexes (not offsets) to grains.
    * If the highest nibble of the entry is:
        0x0:
            The grain in not allocated.
            The rest of the bytes are 0.
        0x1:
            The grain is unmapped - guest sees a zero grain.
            The rest of the bits point to the previously mapped grain,
            see 0x3 case.
        0x2:
            The grain is zero.
        0x3:
            The grain is allocated - to get the index calculate:
            ((entry & 0x0fff000000000000) >> 48) |
            ((entry & 0x0000ffffffffffff) << 12)
    * The difference between 0x1 and 0x2 is that 0x1 is an unallocated
      grain which results from the guest using sg_unmap to unmap the
      grain - but the grain itself still exists in the grain extent - a
      space reclamation procedure should delete it.
      Unmapping a zero grain has no effect (0x2 will not change to 0x1)
      but unmapping an unallocated grain will (0x0 to 0x1) - naturally.

In order to implement seSparse some fields had to be changed to support
both 32-bit and 64-bit entry sizes.

Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com>
Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com>
Message-id: 20190620091057.47441-4-shmuel.eiderman@oracle.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-06-24 15:53:02 +02:00