ef796ee93b
Replace the return path retry logic with finishing and restarting the thread. This fixes a race when resuming the migration that leads to a segfault. Currently when doing postcopy we consider that an IO error on the return path file could be due to a network intermittency. We then keep the thread alive but have it do cleanup of the 'from_dst_file' and wait on the 'postcopy_pause_rp' semaphore. When the user issues a migrate resume, a new return path is opened and the thread is allowed to continue. There's a race condition in the above mechanism. It is possible for the new return path file to be setup *before* the cleanup code in the return path thread has had a chance to run, leading to the *new* file being closed and the pointer set to NULL. When the thread is released after the resume, it tries to dereference 'from_dst_file' and crashes: Thread 7 "return path" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffd1dbf700 (LWP 9611)] 0x00005555560e4893 in qemu_file_get_error_obj (f=0x0, errp=0x0) at ../migration/qemu-file.c:154 154 return f->last_error; (gdb) bt #0 0x00005555560e4893 in qemu_file_get_error_obj (f=0x0, errp=0x0) at ../migration/qemu-file.c:154 #1 0x00005555560e4983 in qemu_file_get_error (f=0x0) at ../migration/qemu-file.c:206 #2 0x0000555555b9a1df in source_return_path_thread (opaque=0x555556e06000) at ../migration/migration.c:1876 #3 0x000055555602e14f in qemu_thread_start (args=0x55555782e780) at ../util/qemu-thread-posix.c:541 #4 0x00007ffff38d76ea in start_thread (arg=0x7fffd1dbf700) at pthread_create.c:477 #5 0x00007ffff35efa6f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Here's the race (important bit is open_return_path happening before migration_release_dst_files): migration | qmp | return path --------------------------+-----------------------------+--------------------------------- qmp_migrate_pause() shutdown(ms->to_dst_file) f->last_error = -EIO migrate_detect_error() postcopy_pause() set_state(PAUSED) wait(postcopy_pause_sem) qmp_migrate(resume) migrate_fd_connect() resume = state == PAUSED open_return_path <-- TOO SOON! set_state(RECOVER) post(postcopy_pause_sem) (incoming closes to_src_file) res = qemu_file_get_error(rp) migration_release_dst_files() ms->rp_state.from_dst_file = NULL post(postcopy_pause_rp_sem) postcopy_pause_return_path_thread() wait(postcopy_pause_rp_sem) rp = ms->rp_state.from_dst_file goto retry qemu_file_get_error(rp) SIGSEGV ------------------------------------------------------------------------------------------- We can keep the retry logic without having the thread alive and waiting. The only piece of data used by it is the 'from_dst_file' and it is only allowed to proceed after a migrate resume is issued and the semaphore released at migrate_fd_connect(). Move the retry logic to outside the thread by waiting for the thread to finish before pausing the migration. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230918172822.19052-8-farosas@suse.de> |
||
---|---|---|
.. | ||
block-dirty-bitmap.c | ||
block.c | ||
block.h | ||
channel-block.c | ||
channel-block.h | ||
channel.c | ||
channel.h | ||
colo-failover.c | ||
colo.c | ||
dirtyrate.c | ||
dirtyrate.h | ||
exec.c | ||
exec.h | ||
fd.c | ||
fd.h | ||
global_state.c | ||
meson.build | ||
migration-hmp-cmds.c | ||
migration-stats.c | ||
migration-stats.h | ||
migration.c | ||
migration.h | ||
multifd-zlib.c | ||
multifd-zstd.c | ||
multifd.c | ||
multifd.h | ||
options.c | ||
options.h | ||
page_cache.c | ||
page_cache.h | ||
postcopy-ram.c | ||
postcopy-ram.h | ||
qemu-file.c | ||
qemu-file.h | ||
ram-compress.c | ||
ram-compress.h | ||
ram.c | ||
ram.h | ||
rdma.c | ||
rdma.h | ||
savevm.c | ||
savevm.h | ||
socket.c | ||
socket.h | ||
target.c | ||
threadinfo.c | ||
threadinfo.h | ||
tls.c | ||
tls.h | ||
trace-events | ||
trace.h | ||
vmstate-types.c | ||
vmstate.c | ||
xbzrle.c | ||
xbzrle.h | ||
yank_functions.c | ||
yank_functions.h |