qemu-e2k

Author	SHA1	Message	Date
Stefan Hajnoczi	826cc32423	aio-posix: split poll check from ready handler Adaptive polling measures the execution time of the polling check plus handlers called when a polled event becomes ready. Handlers can take a significant amount of time, making it look like polling was running for a long time when in fact the event handler was running for a long time. For example, on Linux the io_submit(2) syscall invoked when a virtio-blk device's virtqueue becomes ready can take 10s of microseconds. This can exceed the default polling interval (32 microseconds) and cause adaptive polling to stop polling. By excluding the handler's execution time from the polling check we make the adaptive polling calculation more accurate. As a result, the event loop now stays in polling mode where previously it would have fallen back to file descriptor monitoring. The following data was collected with virtio-blk num-queues=2 event_idx=off using an IOThread. Before: 168k IOPS, IOThread syscalls: 9837.115 ( 0.020 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, nr: 16, iocbpp: 0x7fcb9f937db0) = 16 9837.158 ( 0.002 ms): IO iothread1/620155 write(fd: 103, buf: 0x556a2ef71b88, count: 8) = 8 9837.161 ( 0.001 ms): IO iothread1/620155 write(fd: 104, buf: 0x556a2ef71b88, count: 8) = 8 9837.163 ( 0.001 ms): IO iothread1/620155 ppoll(ufds: 0x7fcb90002800, nfds: 4, tsp: 0x7fcb9f1342d0, sigsetsize: 8) = 3 9837.164 ( 0.001 ms): IO iothread1/620155 read(fd: 107, buf: 0x7fcb9f939cc0, count: 512) = 8 9837.174 ( 0.001 ms): IO iothread1/620155 read(fd: 105, buf: 0x7fcb9f939cc0, count: 512) = 8 9837.176 ( 0.001 ms): IO iothread1/620155 read(fd: 106, buf: 0x7fcb9f939cc0, count: 512) = 8 9837.209 ( 0.035 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, nr: 32, iocbpp: 0x7fca7d0cebe0) = 32 174k IOPS (+3.6%), IOThread syscalls: 9809.566 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, nr: 32, iocbpp: 0x7fd0cdd62be0) = 32 9809.625 ( 0.001 ms): IO iothread1/623061 write(fd: 103, buf: 0x5647cfba5f58, count: 8) = 8 9809.627 ( 0.002 ms): IO iothread1/623061 write(fd: 104, buf: 0x5647cfba5f58, count: 8) = 8 9809.663 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, nr: 32, iocbpp: 0x7fd0d0388b50) = 32 Notice that ppoll(2) and eventfd read(2) syscalls are eliminated because the IOThread stays in polling mode instead of falling back to file descriptor monitoring. As usual, polling is not implemented on Windows so this patch ignores the new io_poll_read() callback in aio-win32.c. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Message-id: 20211207132336.36627-2-stefanha@redhat.com [Fixed up aio_set_event_notifier() calls in tests/unit/test-fdmon-epoll.c added after this series was queued. --Stefan] Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2022-01-12 17:09:39 +00:00
Laurent Vivier	1b529d908d	failover: Silence warning messages during qtest virtio-net-failover test tries several device combinations that produces some expected warnings. These warning can be confusing, so we disable them during the qtest sequence. Reported-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Laurent Vivier <lvivier@redhat.com> Message-Id: <20211220145314.390697-1-lvivier@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> [thuth: Fix memory leak by using error_free()] Signed-off-by: Thomas Huth <thuth@redhat.com>	2021-12-22 08:12:45 +01:00
Juan Quintela	a5ed229488	multifd: Make zlib compression method not use iovs Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>	2021-12-15 10:38:34 +01:00
Juan Quintela	f5ff548774	multifd: Make zstd compression method not use iovs Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>	2021-12-15 10:38:17 +01:00
Rao, Lei	9c5c8ff24e	COLO: Move some trace code behind qemu_mutex_unlock_iothread() There is no need to put some trace code in the critical section. So, moving it behind qemu_mutex_unlock_iothread() can reduce the lock time. Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-12-15 10:31:42 +01:00
Li Zhang	077fbb5942	multifd: Shut down the QIO channels to avoid blocking the send threads when they are terminated. When doing live migration with multifd channels 8, 16 or larger number, the guest hangs in the presence of the network errors such as missing TCP ACKs. At sender's side: The main thread is blocked on qemu_thread_join, migration_fd_cleanup is called because one thread fails on qio_channel_write_all when the network problem happens and other send threads are blocked on sendmsg. They could not be terminated. So the main thread is blocked on qemu_thread_join to wait for the threads terminated. (gdb) bt 0 0x00007f30c8dcffc0 in __pthread_clockjoin_ex () at /lib64/libpthread.so.0 1 0x000055cbb716084b in qemu_thread_join (thread=0x55cbb881f418) at ../util/qemu-thread-posix.c:627 2 0x000055cbb6b54e40 in multifd_save_cleanup () at ../migration/multifd.c:542 3 0x000055cbb6b4de06 in migrate_fd_cleanup (s=0x55cbb8024000) at ../migration/migration.c:1808 4 0x000055cbb6b4dfb4 in migrate_fd_cleanup_bh (opaque=0x55cbb8024000) at ../migration/migration.c:1850 5 0x000055cbb7173ac1 in aio_bh_call (bh=0x55cbb7eb98e0) at ../util/async.c:141 6 0x000055cbb7173bcb in aio_bh_poll (ctx=0x55cbb7ebba80) at ../util/async.c:169 7 0x000055cbb715ba4b in aio_dispatch (ctx=0x55cbb7ebba80) at ../util/aio-posix.c:381 8 0x000055cbb7173ffe in aio_ctx_dispatch (source=0x55cbb7ebba80, callback=0x0, user_data=0x0) at ../util/async.c:311 9 0x00007f30c9c8cdf4 in g_main_context_dispatch () at /usr/lib64/libglib-2.0.so.0 10 0x000055cbb71851a2 in glib_pollfds_poll () at ../util/main-loop.c:232 11 0x000055cbb718521c in os_host_main_loop_wait (timeout=42251070366) at ../util/main-loop.c:255 12 0x000055cbb7185321 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:531 13 0x000055cbb6e6ba27 in qemu_main_loop () at ../softmmu/runstate.c:726 14 0x000055cbb6ad6fd7 in main (argc=68, argv=0x7ffc0c578888, envp=0x7ffc0c578ab0) at ../softmmu/main.c:50 To make sure that the send threads could be terminated, IO channels should be shut down to avoid waiting IO. Signed-off-by: Li Zhang <lizhang@suse.de> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-12-15 10:31:42 +01:00
Juan Quintela	01102a2ef6	multifd: Fill offset and block for reception We were using the iov directly, but we will need this info on the following patch. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>	2021-12-15 10:31:42 +01:00
Juan Quintela	40a4bfe9d3	multifd: remove used parameter from send_recv_pages() method It is already there as p->pages->num. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>	2021-12-15 10:31:42 +01:00
Juan Quintela	02fb81043e	multifd: remove used parameter from send_prepare() method It is already there as p->pages->num. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>	2021-12-15 10:31:42 +01:00
Juan Quintela	1943c11a62	multifd: The variable is only used inside the loop Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>	2021-12-15 10:31:42 +01:00
Juan Quintela	18ede636bc	multifd: Add missing documention Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>	2021-12-15 10:31:42 +01:00
Juan Quintela	90a3d2f9d5	multifd: Rename used field to num We will need to split it later in zero_num (number of zero pages) and normal_num (number of normal pages). This name is better. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>	2021-12-15 10:31:42 +01:00
Juan Quintela	144fa06b34	migration: Never call twice qemu_target_page_size() Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>	2021-12-15 10:31:42 +01:00
Juan Quintela	47a1782461	multifd: Delete useless operation We are dividing by page_size to multiply again in the only use. Once there, improve the comments. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>	2021-12-15 10:31:42 +01:00
Juan Quintela	bad452a77e	migration: Remove is_zero_range() It just calls buffer_is_zero(). Just change the callers. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>	2021-12-15 10:31:42 +01:00
Zhang Chen	751fe4c608	migration/colo: Optimize COLO primary node start code path Optimize COLO primary start path from: MIGRATION_STATUS_XXX --> MIGRATION_STATUS_ACTIVE --> MIGRATION_STATUS_COLO --> MIGRATION_STATUS_COMPLETED To: MIGRATION_STATUS_XXX --> MIGRATION_STATUS_COLO --> MIGRATION_STATUS_COMPLETED No need to start primary COLO through "MIGRATION_STATUS_ACTIVE". Signed-off-by: Zhang Chen <chen.zhang@intel.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-12-15 10:31:42 +01:00
Rao, Lei	795969ab1f	Fixed a QEMU hang when guest poweroff in COLO mode When the PVM guest poweroff, the COLO thread may wait a semaphore in colo_process_checkpoint().So, we should wake up the COLO thread before migration shutdown. Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Zhang Chen <chen.zhang@intel.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-12-15 10:31:42 +01:00
Zhang Chen	0e0f0479e2	migration/colo: More accurate update checkpoint time Previous operation(like vm_start and replication_start_all) will consume extra time before update the timer, so reduce time in this patch. Signed-off-by: Zhang Chen <chen.zhang@intel.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-12-15 10:31:42 +01:00
Rao, Lei	672159a97c	migration/ram.c: Remove the qemu_mutex_lock in colo_flush_ram_cache. The code to acquire bitmap_mutex is added in the commit of "63268c4970a5f126cc9af75f3ccb8057abef5ec0". There is no need to acquire bitmap_mutex in colo_flush_ram_cache(). This is because the colo_flush_ram_cache only be called on the COLO secondary VM, which is the destination side. On the COLO secondary VM, only the COLO thread will touch the bitmap of ram cache. Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Zhang Chen <chen.zhang@intel.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-12-15 10:31:42 +01:00
Rao, Lei	91fe9a8dbd	Reset the auto-converge counter at every checkpoint. if we don't reset the auto-converge counter, it will continue to run with COLO running, and eventually the system will hang due to the CPU throttle reaching DEFAULT_MIGRATE_MAX_CPU_THROTTLE. Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Reviewed-by: Lukas Straub <lukasstraub2@web.de> Tested-by: Lukas Straub <lukasstraub2@web.de> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-09 08:48:36 +01:00
Rao, Lei	a6a83cef9c	Reduce the PVM stop time during Checkpoint When flushing memory from ram cache to ram during every checkpoint on secondary VM, we can copy continuous chunks of memory instead of 4096 bytes per time to reduce the time of VM stop during checkpoint. Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Reviewed-by: Lukas Straub <lukasstraub2@web.de> Reviewed-by: Juan Quintela <quintela@redhat.com> Tested-by: Lukas Straub <lukasstraub2@web.de> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-09 08:46:30 +01:00
Juan Quintela	565599807f	migration: Check that postcopy fd's are not NULL If postcopy has finished, it frees the array. But vhost-user unregister it at cleanup time. fixes: `c4f7538` Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>	2021-11-06 12:35:29 +01:00
Lukas Straub	e5fdf92096	colo: Don't dump colo cache if dump-guest-core=off One might set dump-guest-core=off to make coredumps smaller and still allow to debug many qemu bugs. Extend this option to the colo cache. Signed-off-by: Lukas Straub <lukasstraub2@web.de> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-03 09:39:31 +01:00
Rao, Lei	2b9f6bf36c	Changed the last-mode to none of first start COLO When we first stated the COLO, the last-mode is as follows: { "execute": "query-colo-status" } {"return": {"last-mode": "primary", "mode": "primary", "reason": "none"}} The last-mode is unreasonable. After the patch, will be changed to the following: { "execute": "query-colo-status" } {"return": {"last-mode": "none", "mode": "primary", "reason": "none"}} Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-03 09:38:53 +01:00
Rao, Lei	04dd89169b	Removed the qemu_fclose() in colo_process_incoming_thread After the live migration, the related fd will be cleanup in migration_incoming_state_destroy(). So, the qemu_close() in colo_process_incoming_thread is not necessary. Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-03 09:38:53 +01:00
Rao, Lei	ac183dac96	colo: fixed 'Segmentation fault' when the simplex mode PVM poweroff The GDB statck is as follows: Program terminated with signal SIGSEGV, Segmentation fault. 0 object_class_dynamic_cast (class=0x55c8f5d2bf50, typename=0x55c8f2f7379e "qio-channel") at qom/object.c:832 if (type->class->interfaces && [Current thread is 1 (Thread 0x7f756e97eb00 (LWP 1811577))] (gdb) bt 0 object_class_dynamic_cast (class=0x55c8f5d2bf50, typename=0x55c8f2f7379e "qio-channel") at qom/object.c:832 1 0x000055c8f2c3dd14 in object_dynamic_cast (obj=0x55c8f543ac00, typename=0x55c8f2f7379e "qio-channel") at qom/object.c:763 2 0x000055c8f2c3ddce in object_dynamic_cast_assert (obj=0x55c8f543ac00, typename=0x55c8f2f7379e "qio-channel", file=0x55c8f2f73780 "migration/qemu-file-channel.c", line=117, func=0x55c8f2f73800 <__func__.18724> "channel_shutdown") at qom/object.c:786 3 0x000055c8f2bbc6ac in channel_shutdown (opaque=0x55c8f543ac00, rd=true, wr=true, errp=0x0) at migration/qemu-file-channel.c:117 4 0x000055c8f2bba56e in qemu_file_shutdown (f=0x7f7558070f50) at migration/qemu-file.c:67 5 0x000055c8f2ba5373 in migrate_fd_cancel (s=0x55c8f4ccf3f0) at migration/migration.c:1699 6 0x000055c8f2ba1992 in migration_shutdown () at migration/migration.c:187 7 0x000055c8f29a5b77 in main (argc=69, argv=0x7fff3e9e8c08, envp=0x7fff3e9e8e38) at vl.c:4512 The root cause is that we still want to shutdown the from_dst_file in migrate_fd_cancel() after qemu_close in colo_process_checkpoint(). So, we should set the s->rp_state.from_dst_file = NULL after qemu_close(). Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-03 09:38:53 +01:00
Rao, Lei	684bfd1820	Fixed SVM hang when do failover before PVM crash This patch fixed as follows: Thread 1 (Thread 0x7f34ee738d80 (LWP 11212)): #0 __pthread_clockjoin_ex (threadid=139847152957184, thread_return=0x7f30b1febf30, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145 #1 0x0000563401998e36 in qemu_thread_join (thread=0x563402d66610) at util/qemu-thread-posix.c:587 #2 0x00005634017a79fa in process_incoming_migration_co (opaque=0x0) at migration/migration.c:502 #3 0x00005634019b59c9 in coroutine_trampoline (i0=63395504, i1=22068) at util/coroutine-ucontext.c:115 #4 0x00007f34ef860660 in ?? () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 from /lib/x86_64-linux-gnu/libc.so.6 #5 0x00007f30b21ee730 in ?? () #6 0x0000000000000000 in ?? () Thread 13 (Thread 0x7f30b3dff700 (LWP 11747)): #0 __lll_lock_wait (futex=futex@entry=0x56340218ffa0 <qemu_global_mutex>, private=0) at lowlevellock.c:52 #1 0x00007f34efa000a3 in _GI__pthread_mutex_lock (mutex=0x56340218ffa0 <qemu_global_mutex>) at ../nptl/pthread_mutex_lock.c:80 #2 0x0000563401997f99 in qemu_mutex_lock_impl (mutex=0x56340218ffa0 <qemu_global_mutex>, file=0x563401b7a80e "migration/colo.c", line=806) at util/qemu-thread-posix.c:78 #3 0x0000563401407144 in qemu_mutex_lock_iothread_impl (file=0x563401b7a80e "migration/colo.c", line=806) at /home/workspace/colo-qemu/cpus.c:1899 #4 0x00005634017ba8e8 in colo_process_incoming_thread (opaque=0x563402d664c0) at migration/colo.c:806 #5 0x0000563401998b72 in qemu_thread_start (args=0x5634039f8370) at util/qemu-thread-posix.c:519 #6 0x00007f34ef9fd609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #7 0x00007f34ef924293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 The QEMU main thread is holding the lock: (gdb) p qemu_global_mutex $1 = {lock = {_data = {lock = 2, __count = 0, __owner = 11212, __nusers = 9, __kind = 0, __spins = 0, __elision = 0, __list = {_prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000\314+\000\000\t", '\000' <repeats 26 times>, __align = 2}, file = 0x563401c07e4b "util/main-loop.c", line = 240, initialized = true} >From the call trace, we can see it is a deadlock bug. and the QEMU main thread holds the global mutex to wait until the COLO thread ends. and the colo thread wants to acquire the global mutex, which will cause a deadlock. So, we should release the qemu_global_mutex before waiting colo thread ends. Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Li Zhijian <lizhijian@cn.fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-03 09:38:53 +01:00
Rao, Lei	aa505f8e0e	Fixed qemu crash when guest power off in COLO mode This patch fixes the following: qemu-system-x86_64: invalid runstate transition: 'shutdown' -> 'running' Aborted (core dumped) The gdb bt as following: 0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 1 0x00007faa3d613859 in __GI_abort () at abort.c:79 2 0x000055c5a21268fd in runstate_set (new_state=RUN_STATE_RUNNING) at vl.c:723 3 0x000055c5a1f8cae4 in vm_prepare_start () at /home/workspace/colo-qemu/cpus.c:2206 4 0x000055c5a1f8cb1b in vm_start () at /home/workspace/colo-qemu/cpus.c:2213 5 0x000055c5a2332bba in migration_iteration_finish (s=0x55c5a4658810) at migration/migration.c:3376 6 0x000055c5a2332f3b in migration_thread (opaque=0x55c5a4658810) at migration/migration.c:3527 7 0x000055c5a251d68a in qemu_thread_start (args=0x55c5a5491a70) at util/qemu-thread-posix.c:519 8 0x00007faa3d7e9609 in start_thread (arg=<optimized out>) at pthread_create.c:477 9 0x00007faa3d710293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-03 09:38:53 +01:00
Rao, Lei	ae4c209935	Some minor optimizations for COLO Signed-off-by: Lei Rao <lei.rao@intel.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-03 09:38:53 +01:00
Juan Quintela	02abee3d51	migration: Zero migration compression counters Based on previous patch from yuxiating <yuxiating@huawei.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-03 09:38:53 +01:00
yuxiating	fa0b31d585	migration: initialise compression_counters for a new migration If the compression migration fails or is canceled, the query for the value of compression_counters during the next compression migration is wrong. Signed-off-by: yuxiating <yuxiating@huawei.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-03 09:38:53 +01:00
Laurent Vivier	458fecca80	migration: provide an error message to migration_cancel() This avoids to call migrate_get_current() in the caller function whereas migration_cancel() already needs the pointer to the current migration state. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-03 09:38:53 +01:00
Hyman Huang(黄勇)	826b8bc80c	migration/dirtyrate: implement dirty-bitmap dirtyrate calculation introduce dirty-bitmap mode as the third method of calc-dirty-rate. implement dirty-bitmap dirtyrate calculation, which can be used to measuring dirtyrate in the absence of dirty-ring. introduce "dirty_bitmap:-b" option in hmp calc_dirty_rate to indicate dirty bitmap method should be used for calculation. Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-01 22:56:44 +01:00
Hyman Huang(黄勇)	4998a37e4b	memory: introduce total_dirty_pages to stat dirty pages introduce global var total_dirty_pages to stat dirty pages along with memory_global_dirty_log_sync. Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-01 22:56:44 +01:00
David Hildenbrand	6fee3a1fd9	migration/ram: Handle RAMBlocks with a RamDiscardManager on background snapshots We already don't ever migrate memory that corresponds to discarded ranges as managed by a RamDiscardManager responsible for the mapped memory region of the RAMBlock. virtio-mem uses this mechanism to logically unplug parts of a RAMBlock. Right now, we still populate zeropages for the whole usable part of the RAMBlock, which is undesired because: 1. Even populating the shared zeropage will result in memory getting consumed for page tables. 2. Memory backends without a shared zeropage (like hugetlbfs and shmem) will populate an actual, fresh page, resulting in an unintended memory consumption. Discarded ("logically unplugged") parts have to remain discarded. As these pages are never part of the migration stream, there is no need to track modifications via userfaultfd WP reliably for these parts. Further, any writes to these ranges by the VM are invalid and the behavior is undefined. Note that Linux only supports userfaultfd WP on private anonymous memory for now, which usually results in the shared zeropage getting populated. The issue will become more relevant once userfaultfd WP supports shmem and hugetlb. Acked-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-01 22:56:44 +01:00
David Hildenbrand	f7b9dcfbcf	migration/ram: Factor out populating pages readable in ram_block_populate_pages() Let's factor out prefaulting/populating to make further changes easier to review and add a comment what we are actually expecting to happen. While at it, use the actual page size of the ramblock, which defaults to qemu_real_host_page_size for anonymous memory. Further, rename ram_block_populate_pages() to ram_block_populate_read() as well, to make it clearer what we are doing. In the future, we might want to use MADV_POPULATE_READ to speed up population. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-01 22:56:44 +01:00
David Hildenbrand	7648297d40	migration: Simplify alignment and alignment checks Let's use QEMU_ALIGN_DOWN() and friends to make the code a bit easier to read. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-01 22:56:44 +01:00
David Hildenbrand	9470c5e082	migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the destination Currently, when someone (i.e., the VM) accesses discarded parts inside a RAMBlock with a RamDiscardManager managing the corresponding mapped memory region, postcopy will request migration of the corresponding page from the source. The source, however, will never answer, because it refuses to migrate such pages with undefined content ("logically unplugged"): the pages are never dirty, and get_queued_page() will consequently skip processing these postcopy requests. Especially reading discarded ("logically unplugged") ranges is supposed to work in some setups (for example with current virtio-mem), although it barely ever happens: still, not placing a page would currently stall the VM, as it cannot make forward progress. Let's check the state via the RamDiscardManager (the state e.g., of virtio-mem is migrated during precopy) and avoid sending a request that will never get answered. Place a fresh zero page instead to keep the VM working. This is the same behavior that would happen automatically without userfaultfd being active, when accessing virtual memory regions without populated pages -- "populate on demand". For now, there are valid cases (as documented in the virtio-mem spec) where a VM might read discarded memory; in the future, we will disallow that. Then, we might want to handle that case differently, e.g., warning the user that the VM seems to be mis-behaving. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-01 22:56:44 +01:00
David Hildenbrand	be39b4cd20	migration/ram: Handle RAMBlocks with a RamDiscardManager on the migration source We don't want to migrate memory that corresponds to discarded ranges as managed by a RamDiscardManager responsible for the mapped memory region of the RAMBlock. The content of these pages is essentially stale and without any guarantees for the VM ("logically unplugged"). Depending on the underlying memory type, even reading memory might populate memory on the source, resulting in an undesired memory consumption. Of course, on the destination, even writing a zeropage consumes memory, which we also want to avoid (similar to free page hinting). Currently, virtio-mem tries achieving that goal (not migrating "unplugged" memory that was discarded) by going via qemu_guest_free_page_hint() - but it's hackish and incomplete. For example, background snapshots still end up reading all memory, as they don't do bitmap syncs. Postcopy recovery code will re-add previously cleared bits to the dirty bitmap and migrate them. Let's consult the RamDiscardManager after setting up our dirty bitmap initially and when postcopy recovery code reinitializes it: clear corresponding bits in the dirty bitmaps (e.g., of the RAMBlock and inside KVM). It's important to fixup the dirty bitmap after our initial bitmap sync, such that the corresponding dirty bits in KVM are actually cleared. As colo is incompatible with discarding of RAM and inhibits it, we don't have to bother. Note: if a misbehaving guest would use discarded ranges after migration started we would still migrate that memory: however, then we already populated that memory on the migration source. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-01 22:56:44 +01:00
Peter Xu	60fd680193	migration: Add migrate_add_blocker_internal() An internal version that removes -only-migratable implications. It can be used for temporary migration blockers like dump-guest-memory. Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-01 22:56:44 +01:00
Peter Xu	4c170330aa	migration: Make migration blocker work for snapshots too save_snapshot() checks migration blocker, which looks sane. At the meantime we should also teach the blocker add helper to fail if during a snapshot, just like for migrations. Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-01 22:56:43 +01:00
Hyman Huang(é»„å‹‡)	0e21bf2460	migration/dirtyrate: implement dirty-ring dirtyrate calculation use dirty ring feature to implement dirtyrate calculation. introduce mode option in qmp calc_dirty_rate to specify what method should be used when calculating dirtyrate, either page-sampling or dirty-ring should be passed. introduce "dirty_ring:-r" option in hmp calc_dirty_rate to indicate dirty ring method should be used for calculation. Signed-off-by: Hyman Huang(é»„å‹‡) <huangy81@chinatelecom.cn> Message-Id: <7db445109bd18125ce8ec86816d14f6ab5de6a7d.1624040308.git.huangy81@chinatelecom.cn> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-01 22:56:43 +01:00
Hyman Huang(é»„å‹‡)	9865d0f68f	migration/dirtyrate: move init step of calculation to main thread since main thread may "query dirty rate" at any time, it's better to move init step into main thead so that synchronization overhead between "main" and "get_dirtyrate" can be reduced. Signed-off-by: Hyman Huang(é»„å‹‡) <huangy81@chinatelecom.cn> Message-Id: <109f8077518ed2f13068e3bfb10e625e964780f1.1624040308.git.huangy81@chinatelecom.cn> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-01 22:56:43 +01:00
Hyman Huang(é»„å‹‡)	15eb2d644c	migration/dirtyrate: adjust order of registering thread registering get_dirtyrate thread in advance so that both page-sampling and dirty-ring mode can be covered. Signed-off-by: Hyman Huang(é»„å‹‡) <huangy81@chinatelecom.cn> Message-Id: <d7727581a8e86d4a42fc3eacf7f310419b9ebf7e.1624040308.git.huangy81@chinatelecom.cn> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-01 22:56:43 +01:00
Hyman Huang(é»„å‹‡)	71864eadd9	migration/dirtyrate: introduce struct and adjust DirtyRateStat introduce "DirtyRateMeasureMode" to specify what method should be used to calculate dirty rate, introduce "DirtyRateVcpu" to store dirty rate for each vcpu. use union to store stat data of specific mode Signed-off-by: Hyman Huang(é»„å‹‡) <huangy81@chinatelecom.cn> Message-Id: <661c98c40f40e163aa58334337af8f3ddf41316a.1624040308.git.huangy81@chinatelecom.cn> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-01 22:56:43 +01:00
Hyman Huang(é»„å‹‡)	63b41db4bc	memory: make global_dirty_tracking a bitmask since dirty ring has been introduced, there are two methods to track dirty pages of vm. it seems that "logging" has a hint on the method, so rename the global_dirty_log to global_dirty_tracking would make description more accurate. dirty rate measurement may start or stop dirty tracking during calculation. this conflict with migration because stop dirty tracking make migration leave dirty pages out then that'll be a problem. make global_dirty_tracking a bitmask can let both migration and dirty rate measurement work fine. introduce GLOBAL_DIRTY_MIGRATION and GLOBAL_DIRTY_DIRTY_RATE to distinguish what current dirty tracking aims for, migration or dirty rate. Signed-off-by: Hyman Huang(é»„å‹‡) <huangy81@chinatelecom.cn> Message-Id: <9c9388657cfa0301bd2c1cfa36e7cf6da4aeca19.1624040308.git.huangy81@chinatelecom.cn> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-01 22:56:43 +01:00
Li Zhijian	b390afd8c5	migration/rdma: Fix out of order wrid destination: ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive if=none,file=./Fedora-rdma-server-migration.qcow2,id=drive-virtio-disk0 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl -spice streaming-video=filter,port=5902,disable-ticketing -incoming rdma:192.168.22.23:8888 qemu-system-x86_64: -spice streaming-video=filter,port=5902,disable-ticketing: warning: short-form boolean option 'disable-ticketing' deprecated Please use disable-ticketing=on instead QEMU 6.0.50 monitor - type 'help' for more information (qemu) trace-event qemu_rdma_block_for_wrid_miss on (qemu) dest_init RDMA Device opened: kernel name rxe_eth0 uverbs device name uverbs2, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs2, infiniband class device path /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet qemu_rdma_block_for_wrid_miss A Wanted wrid CONTROL SEND (2000) but got CONTROL RECV (4000) source: ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive if=none,file=./Fedora-rdma-server.qcow2,id=drive-virtio-disk0 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl -spice streaming-video=filter,port=5901,disable-ticketing -S qemu-system-x86_64: -spice streaming-video=filter,port=5901,disable-ticketing: warning: short-form boolean option 'disable-ticketing' deprecated Please use disable-ticketing=on instead QEMU 6.0.50 monitor - type 'help' for more information (qemu) (qemu) trace-event qemu_rdma_block_for_wrid_miss on (qemu) migrate -d rdma:192.168.22.23:8888 source_resolve_host RDMA Device opened: kernel name rxe_eth0 uverbs device name uverbs2, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs2, infiniband class device path /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet (qemu) qemu_rdma_block_for_wrid_miss A Wanted wrid WRITE RDMA (1) but got CONTROL RECV (4000) NOTE: we use soft RoCE as the rdma device. [root@iaas-rpma images]# rdma link show rxe_eth0/1 link rxe_eth0/1 state ACTIVE physical_state LINK_UP netdev eth0 This migration could not be completed when out of order(OOO) CQ event occurs. The send queue and receive queue shared a same completion queue, and qemu_rdma_block_for_wrid() will drop the CQs it's not interested in. But the dropped CQs by qemu_rdma_block_for_wrid() could be later CQs it wants. So in this case, qemu_rdma_block_for_wrid() will block forever. OOO cases will occur in both source side and destination side. And a forever blocking happens on only SEND and RECV are out of order. OOO between 'WRITE RDMA' and 'RECV' doesn't matter. below the OOO sequence: source destination rdma_write_one() qemu_rdma_registration_handle() 1. S1: post_recv X D1: post_recv Y 2. wait for recv CQ event X 3. D2: post_send X ---------------+ 4. wait for send CQ send event X (D2) \| 5. recv CQ event X reaches (D2) \| 6. +-S2: post_send Y \| 7. \| wait for send CQ event Y \| 8. \| recv CQ event Y (S2) (drop it) \| 9. +-send CQ event Y reaches (S2) \| 10. send CQ event X reaches (D2) -----+ 11. wait recv CQ event Y (dropped by (8)) Although a hardware IB works fine in my a hundred of runs, the IB specification doesn't guaratee the CQ order in such case. Here we introduce a independent send completion queue to distinguish ibv_post_send completion queue from the original mixed completion queue. It helps us to poll the specific CQE we are really interested in. Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-11-01 12:49:29 +01:00
Li Zhijian	911965ace9	migration/rdma: advise prefetch write for ODP region The responder mr registering with ODP will sent RNR NAK back to the requester in the face of the page fault. --------- ibv_poll_cq wc.status=13 RNR retry counter exceeded! ibv_poll_cq wrid=WRITE RDMA! --------- ibv_advise_mr(3) helps to make pages present before the actual IO is conducted so that the responder does page fault as little as possible. Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-10-19 08:39:04 +02:00
Li Zhijian	e2daccb0d0	migration/rdma: Try to register On-Demand Paging memory region Previously, for the fsdax mem-backend-file, it will register failed with Operation not supported. In this case, we can try to register it with On-Demand Paging[1] like what rpma_mr_reg() does on rpma[2]. [1]: https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x [2]: http://pmem.io/rpma/manpages/v0.9.0/rpma_mr_reg.3 CC: Marcel Apfelbaum <marcel.apfelbaum@gmail.com> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Reviewed-by: Marcel Apfelbaum <marcel.apfelbaum@gmail.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-10-19 08:39:04 +02:00
Li Zhijian	5ad15e8614	migration: allow enabling mutilfd for specific protocol only To: <quintela@redhat.com>, <dgilbert@redhat.com>, <qemu-devel@nongnu.org> CC: Li Zhijian <lizhijian@cn.fujitsu.com> Date: Sat, 31 Jul 2021 22:05:52 +0800 (5 weeks, 4 days, 17 hours ago) And change the default to true so that in '-incoming defer' case, user is able to change multifd capability. Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>	2021-10-19 08:39:04 +02:00

1 2 3 4 5 ...

1493 Commits