21b17cd011
Reorganize the page, moving things around, and add a few headlines ("Postcopy internals", "Postcopy features") to cover sub-areas. Reviewed-by: Cédric Le Goater <clg@redhat.com> Link: https://lore.kernel.org/r/20240109064628.595453-9-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com>
314 lines
13 KiB
ReStructuredText
314 lines
13 KiB
ReStructuredText
========
|
|
Postcopy
|
|
========
|
|
|
|
.. contents::
|
|
|
|
'Postcopy' migration is a way to deal with migrations that refuse to converge
|
|
(or take too long to converge) its plus side is that there is an upper bound on
|
|
the amount of migration traffic and time it takes, the down side is that during
|
|
the postcopy phase, a failure of *either* side causes the guest to be lost.
|
|
|
|
In postcopy the destination CPUs are started before all the memory has been
|
|
transferred, and accesses to pages that are yet to be transferred cause
|
|
a fault that's translated by QEMU into a request to the source QEMU.
|
|
|
|
Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
|
|
doesn't finish in a given time the switch is made to postcopy.
|
|
|
|
Enabling postcopy
|
|
=================
|
|
|
|
To enable postcopy, issue this command on the monitor (both source and
|
|
destination) prior to the start of migration:
|
|
|
|
``migrate_set_capability postcopy-ram on``
|
|
|
|
The normal commands are then used to start a migration, which is still
|
|
started in precopy mode. Issuing:
|
|
|
|
``migrate_start_postcopy``
|
|
|
|
will now cause the transition from precopy to postcopy.
|
|
It can be issued immediately after migration is started or any
|
|
time later on. Issuing it after the end of a migration is harmless.
|
|
|
|
Blocktime is a postcopy live migration metric, intended to show how
|
|
long the vCPU was in state of interruptible sleep due to pagefault.
|
|
That metric is calculated both for all vCPUs as overlapped value, and
|
|
separately for each vCPU. These values are calculated on destination
|
|
side. To enable postcopy blocktime calculation, enter following
|
|
command on destination monitor:
|
|
|
|
``migrate_set_capability postcopy-blocktime on``
|
|
|
|
Postcopy blocktime can be retrieved by query-migrate qmp command.
|
|
postcopy-blocktime value of qmp command will show overlapped blocking
|
|
time for all vCPU, postcopy-vcpu-blocktime will show list of blocking
|
|
time per vCPU.
|
|
|
|
.. note::
|
|
During the postcopy phase, the bandwidth limits set using
|
|
``migrate_set_parameter`` is ignored (to avoid delaying requested pages that
|
|
the destination is waiting for).
|
|
|
|
Postcopy internals
|
|
==================
|
|
|
|
State machine
|
|
-------------
|
|
|
|
Postcopy moves through a series of states (see postcopy_state) from
|
|
ADVISE->DISCARD->LISTEN->RUNNING->END
|
|
|
|
- Advise
|
|
|
|
Set at the start of migration if postcopy is enabled, even
|
|
if it hasn't had the start command; here the destination
|
|
checks that its OS has the support needed for postcopy, and performs
|
|
setup to ensure the RAM mappings are suitable for later postcopy.
|
|
The destination will fail early in migration at this point if the
|
|
required OS support is not present.
|
|
(Triggered by reception of POSTCOPY_ADVISE command)
|
|
|
|
- Discard
|
|
|
|
Entered on receipt of the first 'discard' command; prior to
|
|
the first Discard being performed, hugepages are switched off
|
|
(using madvise) to ensure that no new huge pages are created
|
|
during the postcopy phase, and to cause any huge pages that
|
|
have discards on them to be broken.
|
|
|
|
- Listen
|
|
|
|
The first command in the package, POSTCOPY_LISTEN, switches
|
|
the destination state to Listen, and starts a new thread
|
|
(the 'listen thread') which takes over the job of receiving
|
|
pages off the migration stream, while the main thread carries
|
|
on processing the blob. With this thread able to process page
|
|
reception, the destination now 'sensitises' the RAM to detect
|
|
any access to missing pages (on Linux using the 'userfault'
|
|
system).
|
|
|
|
- Running
|
|
|
|
POSTCOPY_RUN causes the destination to synchronise all
|
|
state and start the CPUs and IO devices running. The main
|
|
thread now finishes processing the migration package and
|
|
now carries on as it would for normal precopy migration
|
|
(although it can't do the cleanup it would do as it
|
|
finishes a normal migration).
|
|
|
|
- Paused
|
|
|
|
Postcopy can run into a paused state (normally on both sides when
|
|
happens), where all threads will be temporarily halted mostly due to
|
|
network errors. When reaching paused state, migration will make sure
|
|
the qemu binary on both sides maintain the data without corrupting
|
|
the VM. To continue the migration, the admin needs to fix the
|
|
migration channel using the QMP command 'migrate-recover' on the
|
|
destination node, then resume the migration using QMP command 'migrate'
|
|
again on source node, with resume=true flag set.
|
|
|
|
- End
|
|
|
|
The listen thread can now quit, and perform the cleanup of migration
|
|
state, the migration is now complete.
|
|
|
|
Device transfer
|
|
---------------
|
|
|
|
Loading of device data may cause the device emulation to access guest RAM
|
|
that may trigger faults that have to be resolved by the source, as such
|
|
the migration stream has to be able to respond with page data *during* the
|
|
device load, and hence the device data has to be read from the stream completely
|
|
before the device load begins to free the stream up. This is achieved by
|
|
'packaging' the device data into a blob that's read in one go.
|
|
|
|
Source behaviour
|
|
----------------
|
|
|
|
Until postcopy is entered the migration stream is identical to normal
|
|
precopy, except for the addition of a 'postcopy advise' command at
|
|
the beginning, to tell the destination that postcopy might happen.
|
|
When postcopy starts the source sends the page discard data and then
|
|
forms the 'package' containing:
|
|
|
|
- Command: 'postcopy listen'
|
|
- The device state
|
|
|
|
A series of sections, identical to the precopy streams device state stream
|
|
containing everything except postcopiable devices (i.e. RAM)
|
|
- Command: 'postcopy run'
|
|
|
|
The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
|
|
contents are formatted in the same way as the main migration stream.
|
|
|
|
During postcopy the source scans the list of dirty pages and sends them
|
|
to the destination without being requested (in much the same way as precopy),
|
|
however when a page request is received from the destination, the dirty page
|
|
scanning restarts from the requested location. This causes requested pages
|
|
to be sent quickly, and also causes pages directly after the requested page
|
|
to be sent quickly in the hope that those pages are likely to be used
|
|
by the destination soon.
|
|
|
|
Destination behaviour
|
|
---------------------
|
|
|
|
Initially the destination looks the same as precopy, with a single thread
|
|
reading the migration stream; the 'postcopy advise' and 'discard' commands
|
|
are processed to change the way RAM is managed, but don't affect the stream
|
|
processing.
|
|
|
|
::
|
|
|
|
------------------------------------------------------------------------------
|
|
1 2 3 4 5 6 7
|
|
main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN )
|
|
thread | |
|
|
| (page request)
|
|
| \___
|
|
v \
|
|
listen thread: --- page -- page -- page -- page -- page --
|
|
|
|
a b c
|
|
------------------------------------------------------------------------------
|
|
|
|
- On receipt of ``CMD_PACKAGED`` (1)
|
|
|
|
All the data associated with the package - the ( ... ) section in the diagram -
|
|
is read into memory, and the main thread recurses into qemu_loadvm_state_main
|
|
to process the contents of the package (2) which contains commands (3,6) and
|
|
devices (4...)
|
|
|
|
- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
|
|
|
|
a new thread (a) is started that takes over servicing the migration stream,
|
|
while the main thread carries on loading the package. It loads normal
|
|
background page data (b) but if during a device load a fault happens (5)
|
|
the returned page (c) is loaded by the listen thread allowing the main
|
|
threads device load to carry on.
|
|
|
|
- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
|
|
|
|
letting the destination CPUs start running. At the end of the
|
|
``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
|
|
is no longer used by migration, while the listen thread carries on servicing
|
|
page data until the end of migration.
|
|
|
|
Source side page bitmap
|
|
-----------------------
|
|
|
|
The 'migration bitmap' in postcopy is basically the same as in the precopy,
|
|
where each of the bit to indicate that page is 'dirty' - i.e. needs
|
|
sending. During the precopy phase this is updated as the CPU dirties
|
|
pages, however during postcopy the CPUs are stopped and nothing should
|
|
dirty anything any more. Instead, dirty bits are cleared when the relevant
|
|
pages are sent during postcopy.
|
|
|
|
Postcopy features
|
|
=================
|
|
|
|
Postcopy recovery
|
|
-----------------
|
|
|
|
Comparing to precopy, postcopy is special on error handlings. When any
|
|
error happens (in this case, mostly network errors), QEMU cannot easily
|
|
fail a migration because VM data resides in both source and destination
|
|
QEMU instances. On the other hand, when issue happens QEMU on both sides
|
|
will go into a paused state. It'll need a recovery phase to continue a
|
|
paused postcopy migration.
|
|
|
|
The recovery phase normally contains a few steps:
|
|
|
|
- When network issue occurs, both QEMU will go into PAUSED state
|
|
|
|
- When the network is recovered (or a new network is provided), the admin
|
|
can setup the new channel for migration using QMP command
|
|
'migrate-recover' on destination node, preparing for a resume.
|
|
|
|
- On source host, the admin can continue the interrupted postcopy
|
|
migration using QMP command 'migrate' with resume=true flag set.
|
|
|
|
- After the connection is re-established, QEMU will continue the postcopy
|
|
migration on both sides.
|
|
|
|
During a paused postcopy migration, the VM can logically still continue
|
|
running, and it will not be impacted from any page access to pages that
|
|
were already migrated to destination VM before the interruption happens.
|
|
However, if any of the missing pages got accessed on destination VM, the VM
|
|
thread will be halted waiting for the page to be migrated, it means it can
|
|
be halted until the recovery is complete.
|
|
|
|
The impact of accessing missing pages can be relevant to different
|
|
configurations of the guest. For example, when with async page fault
|
|
enabled, logically the guest can proactively schedule out the threads
|
|
accessing missing pages.
|
|
|
|
Postcopy with hugepages
|
|
-----------------------
|
|
|
|
Postcopy now works with hugetlbfs backed memory:
|
|
|
|
a) The linux kernel on the destination must support userfault on hugepages.
|
|
b) The huge-page configuration on the source and destination VMs must be
|
|
identical; i.e. RAMBlocks on both sides must use the same page size.
|
|
c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal
|
|
RAM if it doesn't have enough hugepages, triggering (b) to fail.
|
|
Using ``-mem-prealloc`` enforces the allocation using hugepages.
|
|
d) Care should be taken with the size of hugepage used; postcopy with 2MB
|
|
hugepages works well, however 1GB hugepages are likely to be problematic
|
|
since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
|
|
and until the full page is transferred the destination thread is blocked.
|
|
|
|
Postcopy with shared memory
|
|
---------------------------
|
|
|
|
Postcopy migration with shared memory needs explicit support from the other
|
|
processes that share memory and from QEMU. There are restrictions on the type of
|
|
memory that userfault can support shared.
|
|
|
|
The Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs``
|
|
(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)``
|
|
for hugetlbfs which may be a problem in some configurations).
|
|
|
|
The vhost-user code in QEMU supports clients that have Postcopy support,
|
|
and the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes
|
|
to support postcopy.
|
|
|
|
The client needs to open a userfaultfd and register the areas
|
|
of memory that it maps with userfault. The client must then pass the
|
|
userfaultfd back to QEMU together with a mapping table that allows
|
|
fault addresses in the clients address space to be converted back to
|
|
RAMBlock/offsets. The client's userfaultfd is added to the postcopy
|
|
fault-thread and page requests are made on behalf of the client by QEMU.
|
|
QEMU performs 'wake' operations on the client's userfaultfd to allow it
|
|
to continue after a page has arrived.
|
|
|
|
.. note::
|
|
There are two future improvements that would be nice:
|
|
a) Some way to make QEMU ignorant of the addresses in the clients
|
|
address space
|
|
b) Avoiding the need for QEMU to perform ufd-wake calls after the
|
|
pages have arrived
|
|
|
|
Retro-fitting postcopy to existing clients is possible:
|
|
a) A mechanism is needed for the registration with userfault as above,
|
|
and the registration needs to be coordinated with the phases of
|
|
postcopy. In vhost-user extra messages are added to the existing
|
|
control channel.
|
|
b) Any thread that can block due to guest memory accesses must be
|
|
identified and the implication understood; for example if the
|
|
guest memory access is made while holding a lock then all other
|
|
threads waiting for that lock will also be blocked.
|
|
|
|
Postcopy preemption mode
|
|
------------------------
|
|
|
|
Postcopy preempt is a new capability introduced in 8.0 QEMU release, it
|
|
allows urgent pages (those got page fault requested from destination QEMU
|
|
explicitly) to be sent in a separate preempt channel, rather than queued in
|
|
the background migration channel. Anyone who cares about latencies of page
|
|
faults during a postcopy migration should enable this feature. By default,
|
|
it's not enabled.
|