iothread: fix iothread hang when stop too soon

Lukas reported an hard to reproduce QMP iothread hang on s390 that
QEMU might hang at pthread_join() of the QMP monitor iothread before
quitting:

  Thread 1
  #0  0x000003ffad10932c in pthread_join
  #1  0x0000000109e95750 in qemu_thread_join
      at /home/thuth/devel/qemu/util/qemu-thread-posix.c:570
  #2  0x0000000109c95a1c in iothread_stop
  #3  0x0000000109bb0874 in monitor_cleanup
  #4  0x0000000109b55042 in main

While the iothread is still in the main loop:

  Thread 4
  #0  0x000003ffad0010e4 in ??
  #1  0x000003ffad553958 in g_main_context_iterate.isra.19
  #2  0x000003ffad553d90 in g_main_loop_run
  #3  0x0000000109c9585a in iothread_run
      at /home/thuth/devel/qemu/iothread.c:74
  #4  0x0000000109e94752 in qemu_thread_start
      at /home/thuth/devel/qemu/util/qemu-thread-posix.c:502
  #5  0x000003ffad10825a in start_thread
  #6  0x000003ffad00dcf2 in thread_start

IMHO it's because there's a race between the main thread and iothread
when stopping the thread in following sequence:

    main thread                       iothread
    ===========                       ==============
                                      aio_poll()
    iothread_get_g_main_context
      set iothread->worker_context
    iothread_stop
      schedule iothread_stop_bh
                                        execute iothread_stop_bh [1]
                                          set iothread->running=false
                                          (since main_loop==NULL so
                                           skip to quit main loop.
                                           Note: although main_loop is
                                           NULL but worker_context is
                                           not!)
                                      atomic_read(&iothread->worker_context) [2]
                                        create main_loop object
                                        g_main_loop_run() [3]
    pthread_join() [4]

We can see that when execute iothread_stop_bh() at [1] it's possible
that main_loop is still NULL because it's only created until the first
check of the worker_context later at [2].  Then the iothread will hang
in the main loop [3] and it'll starve the main thread too [4].

Here the simple solution should be that we check again the "running"
variable before check against worker_context.

CC: Thomas Huth <thuth@redhat.com>
CC: Dr. David Alan Gilbert <dgilbert@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Lukáš Doktor <ldoktor@redhat.com>
CC: Markus Armbruster <armbru@redhat.com>
CC: Eric Blake <eblake@redhat.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: Lukáš Doktor <ldoktor@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Thomas Huth <thuth@redhat.com>
Message-id: 20190129051432.22023-1-peterx@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This commit is contained in:
Peter Xu 2019-01-29 13:14:32 +08:00 committed by Stefan Hajnoczi
parent 22c5f44651
commit 6c95363d97

View File

@ -63,7 +63,11 @@ static void *iothread_run(void *opaque)
while (iothread->running) { while (iothread->running) {
aio_poll(iothread->ctx, true); aio_poll(iothread->ctx, true);
if (atomic_read(&iothread->worker_context)) { /*
* We must check the running state again in case it was
* changed in previous aio_poll()
*/
if (iothread->running && atomic_read(&iothread->worker_context)) {
GMainLoop *loop; GMainLoop *loop;
g_main_context_push_thread_default(iothread->worker_context); g_main_context_push_thread_default(iothread->worker_context);