qemu-e2k/migration
Fabiano Rosas ef796ee93b migration: Replace the return path retry logic
Replace the return path retry logic with finishing and restarting the
thread. This fixes a race when resuming the migration that leads to a
segfault.

Currently when doing postcopy we consider that an IO error on the
return path file could be due to a network intermittency. We then keep
the thread alive but have it do cleanup of the 'from_dst_file' and
wait on the 'postcopy_pause_rp' semaphore. When the user issues a
migrate resume, a new return path is opened and the thread is allowed
to continue.

There's a race condition in the above mechanism. It is possible for
the new return path file to be setup *before* the cleanup code in the
return path thread has had a chance to run, leading to the *new* file
being closed and the pointer set to NULL. When the thread is released
after the resume, it tries to dereference 'from_dst_file' and crashes:

Thread 7 "return path" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd1dbf700 (LWP 9611)]
0x00005555560e4893 in qemu_file_get_error_obj (f=0x0, errp=0x0) at ../migration/qemu-file.c:154
154         return f->last_error;

(gdb) bt
 #0  0x00005555560e4893 in qemu_file_get_error_obj (f=0x0, errp=0x0) at ../migration/qemu-file.c:154
 #1  0x00005555560e4983 in qemu_file_get_error (f=0x0) at ../migration/qemu-file.c:206
 #2  0x0000555555b9a1df in source_return_path_thread (opaque=0x555556e06000) at ../migration/migration.c:1876
 #3  0x000055555602e14f in qemu_thread_start (args=0x55555782e780) at ../util/qemu-thread-posix.c:541
 #4  0x00007ffff38d76ea in start_thread (arg=0x7fffd1dbf700) at pthread_create.c:477
 #5  0x00007ffff35efa6f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Here's the race (important bit is open_return_path happening before
migration_release_dst_files):

migration                 | qmp                         | return path
--------------------------+-----------------------------+---------------------------------
			    qmp_migrate_pause()
			     shutdown(ms->to_dst_file)
			      f->last_error = -EIO
migrate_detect_error()
 postcopy_pause()
  set_state(PAUSED)
  wait(postcopy_pause_sem)
			    qmp_migrate(resume)
			    migrate_fd_connect()
			     resume = state == PAUSED
			     open_return_path <-- TOO SOON!
			     set_state(RECOVER)
			     post(postcopy_pause_sem)
							(incoming closes to_src_file)
							res = qemu_file_get_error(rp)
							migration_release_dst_files()
							ms->rp_state.from_dst_file = NULL
  post(postcopy_pause_rp_sem)
							postcopy_pause_return_path_thread()
							  wait(postcopy_pause_rp_sem)
							rp = ms->rp_state.from_dst_file
							goto retry
							qemu_file_get_error(rp)
							SIGSEGV
-------------------------------------------------------------------------------------------

We can keep the retry logic without having the thread alive and
waiting. The only piece of data used by it is the 'from_dst_file' and
it is only allowed to proceed after a migrate resume is issued and the
semaphore released at migrate_fd_connect().

Move the retry logic to outside the thread by waiting for the thread
to finish before pausing the migration.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20230918172822.19052-8-farosas@suse.de>
2023-09-27 13:58:02 -04:00
..
block-dirty-bitmap.c migration: Move rate_limit_max and rate_limit_used to migration_stats 2023-05-18 18:40:51 +02:00
block.c block-migration: Ensure we don't crash during migration cleanup 2023-08-30 07:39:10 -04:00
block.h
channel-block.c io: follow coroutine AioContext in qio_channel_yield() 2023-09-07 20:32:11 -05:00
channel-block.h migration: introduce a QIOChannel impl for BlockDriverState VMState 2022-06-22 19:33:43 +01:00
channel.c migration: check magic value for deciding the mapping of channels 2023-02-06 19:22:57 +01:00
channel.h migration: check magic value for deciding the mapping of channels 2023-02-06 19:22:57 +01:00
colo-failover.c migration/colo: Improve an x-colo-lost-heartbeat error message 2023-02-23 14:10:17 +01:00
colo.c migration: process_incoming_migration_co(): move colo part to colo 2023-05-18 18:40:51 +02:00
dirtyrate.c migration/dirtyrate: Fix precision losses and g_usleep overshoot 2023-08-29 10:19:03 +08:00
dirtyrate.h migration/dirtyrate: Refactor dirty page rate calculation 2022-07-20 12:15:08 +01:00
exec.c *: Add missing includes of qemu/error-report.h 2023-03-22 15:06:57 +00:00
exec.h
fd.c bulk: Remove pointless QOM casts 2023-06-05 20:48:34 +02:00
fd.h
global_state.c migration: never fail in global_state_store() 2023-06-02 01:03:19 +02:00
meson.build meson: Replace softmmu_ss -> system_ss 2023-06-20 10:01:30 +02:00
migration-hmp-cmds.c migration: Extend query-migrate to provide dirty page limit info 2023-07-26 10:55:56 +02:00
migration-stats.c migration: spelling fixes 2023-07-25 17:13:20 +03:00
migration-stats.h migration: We don't need the field rate_limit_used anymore 2023-05-18 18:40:51 +02:00
migration.c migration: Replace the return path retry logic 2023-09-27 13:58:02 -04:00
migration.h migration: Replace the return path retry logic 2023-09-27 13:58:02 -04:00
multifd-zlib.c migration: spelling fixes 2023-07-25 17:13:20 +03:00
multifd-zstd.c migration: spelling fixes 2023-07-25 17:13:20 +03:00
multifd.c migration/multifd: Rename threadinfo.c functions 2023-07-26 10:55:56 +02:00
multifd.h multifd: Add the ramblock to MultiFDRecvParams 2023-05-10 18:48:11 +02:00
options.c migration: enforce multifd and postcopy preempt to be set before incoming 2023-07-26 10:55:56 +02:00
options.h migration: Introduce dirty-limit capability 2023-07-26 10:55:56 +02:00
page_cache.c migration: Fix cache_init()'s "Failed to allocate" error messages 2021-02-08 11:19:51 +00:00
page_cache.h migration: Clean up signed vs. unsigned XBZRLE cache-size 2021-02-08 11:19:51 +00:00
postcopy-ram.c migration: Fix race that dest preempt thread close too early 2023-09-27 13:58:02 -04:00
postcopy-ram.h migration: Allow postcopy_ram_supported_by_host() to report err 2023-04-27 10:18:25 +02:00
qemu-file.c migration/rdma: Split qemu_fopen_rdma() into input/output functions 2023-07-26 10:55:56 +02:00
qemu-file.h migration/rdma: Split qemu_fopen_rdma() into input/output functions 2023-07-26 10:55:56 +02:00
ram-compress.c ram-compress.c: Make target independent 2023-05-08 15:25:26 +02:00
ram-compress.h ram.c: Move core decompression code into its own file 2023-05-08 15:25:26 +02:00
ram.c migration: Implement dirty-limit convergence algo 2023-07-26 10:55:56 +02:00
ram.h migration/ram: Expose ramblock_is_ignored() as migrate_ram_is_ignored() 2023-07-12 09:25:37 +02:00
rdma.c io: follow coroutine AioContext in qio_channel_yield() 2023-09-07 20:32:11 -05:00
rdma.h
savevm.c migration: Add .save_prepare() handler to struct SaveVMHandlers 2023-09-11 08:34:06 +02:00
savevm.h migration: Add .save_prepare() handler to struct SaveVMHandlers 2023-09-11 08:34:06 +02:00
socket.c migration: Move migrate_use_zero_copy_send() to options.c 2023-04-24 15:01:46 +02:00
socket.h migration: Postcopy preemption preparation on channel creation 2022-07-20 12:15:08 +01:00
target.c migration: Add migration prefix to functions in target.c 2023-09-11 08:34:06 +02:00
threadinfo.c migration/multifd: Protect accesses to migration_threads 2023-07-26 10:55:56 +02:00
threadinfo.h migration/multifd: Protect accesses to migration_threads 2023-07-26 10:55:56 +02:00
tls.c migration: Drop unused parameter for migration_tls_client_create() 2023-05-03 11:24:20 +02:00
tls.h migration: Drop unused parameter for migration_tls_client_create() 2023-05-03 11:24:20 +02:00
trace-events migration: Implement dirty-limit convergence algo 2023-07-26 10:55:56 +02:00
trace.h
vmstate-types.c Move CPU softfloat unions to cpu-float.h 2022-04-06 14:31:43 +02:00
vmstate.c qemu-file: Rename qemu_file_transferred_ fast -> noflush 2023-07-26 10:55:56 +02:00
xbzrle.c migration/xbzrle: Use i386 host/cpuinfo.h 2023-05-23 16:51:18 -07:00
xbzrle.h migration/xbzrle: Use i386 host/cpuinfo.h 2023-05-23 16:51:18 -07:00
yank_functions.c bulk: Remove pointless QOM casts 2023-06-05 20:48:34 +02:00
yank_functions.h migration: Move the yank unregister of channel_close out 2021-07-26 12:45:03 +01:00