qemu-e2k/block
Bin Wu 141cabe6f1 nbd: fix the co_queue multi-adding bug
When we tested the VM migartion between different hosts with NBD
devices, we found if we sent a cancel command after the drive_mirror
was just started, a coroutine re-enter error would occur. The stack
was as follow:

(gdb) bt
00)  0x00007fdfc744d885 in raise () from /lib64/libc.so.6
01)  0x00007fdfc744ee61 in abort () from /lib64/libc.so.6
02)  0x00007fdfca467cc5 in qemu_coroutine_enter (co=0x7fdfcaedb400, opaque=0x0)
at qemu-coroutine.c:118
03)  0x00007fdfca467f6c in qemu_co_queue_run_restart (co=0x7fdfcaedb400) at
qemu-coroutine-lock.c:59
04)  0x00007fdfca467be5 in coroutine_swap (from=0x7fdfcaf3c4e8,
to=0x7fdfcaedb400) at qemu-coroutine.c:96
05)  0x00007fdfca467cea in qemu_coroutine_enter (co=0x7fdfcaedb400, opaque=0x0)
at qemu-coroutine.c:123
06)  0x00007fdfca467f6c in qemu_co_queue_run_restart (co=0x7fdfcaedbdc0) at
qemu-coroutine-lock.c:59
07)  0x00007fdfca467be5 in coroutine_swap (from=0x7fdfcaf3c4e8,
to=0x7fdfcaedbdc0) at qemu-coroutine.c:96
08)  0x00007fdfca467cea in qemu_coroutine_enter (co=0x7fdfcaedbdc0, opaque=0x0)
at qemu-coroutine.c:123
09)  0x00007fdfca4a1fa4 in nbd_recv_coroutines_enter_all (s=0x7fdfcaef7dd0) at
block/nbd-client.c:41
10) 0x00007fdfca4a1ff9 in nbd_teardown_connection (client=0x7fdfcaef7dd0) at
block/nbd-client.c:50
11) 0x00007fdfca4a20f0 in nbd_reply_ready (opaque=0x7fdfcaef7dd0) at
block/nbd-client.c:92
12) 0x00007fdfca45ed80 in aio_dispatch (ctx=0x7fdfcae15e90) at aio-posix.c:144
13) 0x00007fdfca45ef1b in aio_poll (ctx=0x7fdfcae15e90, blocking=false) at
aio-posix.c:222
14) 0x00007fdfca448c34 in aio_ctx_dispatch (source=0x7fdfcae15e90, callback=0x0,
user_data=0x0) at async.c:212
15) 0x00007fdfc8f2f69a in g_main_context_dispatch () from
/usr/lib64/libglib-2.0.so.0
16) 0x00007fdfca45c391 in glib_pollfds_poll () at main-loop.c:190
17) 0x00007fdfca45c489 in os_host_main_loop_wait (timeout=1483677098) at
main-loop.c:235
18) 0x00007fdfca45c57b in main_loop_wait (nonblocking=0) at main-loop.c:484
19) 0x00007fdfca25f403 in main_loop () at vl.c:2249
20) 0x00007fdfca266fc2 in main (argc=42, argv=0x7ffff517d638,
envp=0x7ffff517d790) at vl.c:4814

We find the nbd_recv_coroutines_enter_all function (triggered by a cancel
command or a network connection breaking down) will enter a coroutine which
is waiting for the sending lock. If the lock is still held by another coroutine,
the entering coroutine will be added into the co_queue again. Latter, when the
lock is released, a coroutine re-enter error will occur.

This bug can be fixed simply by delaying the setting of recv_coroutine as
suggested by paolo. After applying this patch, we have tested the cancel
operation in mirror phase looply for more than 5 hous and everything is fine.
Without this patch, a coroutine re-enter error will occur in 5 minutes.

Signed-off-by: Bn Wu <wu.wubin@huawei.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1423552846-3896-1-git-send-email-wu.wubin@huawei.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-02-16 15:07:17 +00:00
..
accounting.c block: add accounting for merged requests 2015-02-06 17:24:21 +01:00
archipelago.c block: Rename BlockDriverCompletionFunc to BlockCompletionFunc 2014-10-20 13:41:27 +02:00
backup.c qmp: Add command 'blockdev-backup' 2015-01-13 11:47:56 +00:00
blkdebug.c blkdebug: Simplify and improve filename generation 2014-12-10 10:31:11 +01:00
blkverify.c block: Rename BlockDriverCompletionFunc to BlockCompletionFunc 2014-10-20 13:41:27 +02:00
block-backend.c block-backend: expose bs->bl.max_transfer_length 2015-02-06 17:24:21 +01:00
bochs.c block: Use g_new() & friends to avoid multiplying sizes 2014-08-20 11:51:28 +02:00
cloop.c cloop: Handle failure for potentially large allocations 2014-08-15 15:07:15 +02:00
commit.c block: let commit blockjob run in BDS AioContext 2014-11-03 11:41:49 +00:00
curl.c block/curl: Improve type safety of s->timeout. 2014-11-03 11:41:47 +00:00
dmg.c block/dmg: improve zeroes handling 2015-02-06 17:24:21 +01:00
gluster.c block: don't convert file size to sector size 2014-09-12 15:43:06 +02:00
iscsi.c block/iscsi: fix uninitialized variable 2015-01-03 09:22:13 +01:00
linux-aio.c linux-aio: simplify removal of completed iocbs from the list 2014-12-12 16:57:55 +00:00
Makefile.objs block/dmg: support bzip2 block entry types 2015-02-06 17:24:21 +01:00
mirror.c block: mirror - change string allocation to 2-bytes 2015-01-23 18:17:06 +01:00
nbd-client.c nbd: fix the co_queue multi-adding bug 2015-02-16 15:07:17 +00:00
nbd-client.h nbd: Drop BDS backpointer 2015-02-16 14:36:03 +00:00
nbd.c nbd: Drop BDS backpointer 2015-02-16 14:36:03 +00:00
nfs.c block/nfs: Add create_opts 2014-12-10 10:31:19 +01:00
null.c block: Rename BlockDriverCompletionFunc to BlockCompletionFunc 2014-10-20 13:41:27 +02:00
parallels.c block/parallels: fix access to not initialized memory in catalog_bitmap 2014-11-03 09:48:41 +00:00
qapi.c block: add event when disk usage exceeds threshold 2015-02-06 17:24:21 +01:00
qcow2-cache.c block: Give always priority to unused entries in the qcow2 L2 cache 2015-02-06 17:24:22 +01:00
qcow2-cluster.c qcow2: Add two more unalignment checks 2015-01-23 18:17:05 +01:00
qcow2-refcount.c qcow2: Rewrite qcow2_alloc_bytes() 2015-02-06 17:24:22 +01:00
qcow2-snapshot.c qcow2: Allow "full" discard 2014-11-03 11:41:47 +00:00
qcow2.c block: fix off-by-one error in qcow and qcow2 2015-02-06 17:24:21 +01:00
qcow2.h block/qcow2: Make get_refcount() global 2014-11-03 11:41:49 +00:00
qcow.c block: fix off-by-one error in qcow and qcow2 2015-02-06 17:24:21 +01:00
qed-check.c block: Use g_new() & friends to avoid multiplying sizes 2014-08-20 11:51:28 +02:00
qed-cluster.c
qed-gencb.c block: Rename BlockDriverCompletionFunc to BlockCompletionFunc 2014-10-20 13:41:27 +02:00
qed-l2-cache.c qed: do not evict in-use L2 table cache entries 2012-03-12 15:14:06 +01:00
qed-table.c block: Rename BlockDriverCompletionFunc to BlockCompletionFunc 2014-10-20 13:41:27 +02:00
qed.c qed: check for header size overflow 2015-02-06 17:24:21 +01:00
qed.h qed: Really remove unused field QEDAIOCB.finished 2015-02-06 17:24:21 +01:00
quorum.c block: Rename BlockDriverCompletionFunc to BlockCompletionFunc 2014-10-20 13:41:27 +02:00
raw_bsd.c block: Make essential BlockDriver objects public 2014-12-10 10:31:19 +01:00
raw-aio.h linux-aio: drop return code from laio_io_unplug and ioq_submit 2014-12-12 16:57:55 +00:00
raw-posix.c block/raw-posix.c: Fix raw_getlength() on Mac OS X block devices 2015-02-06 18:00:53 +01:00
raw-win32.c block: Make essential BlockDriver objects public 2014-12-10 10:31:19 +01:00
rbd.c block/rbd: fix memory leak 2014-12-12 13:16:56 +00:00
sheepdog.c block: Rename BlockDriverAIOCB* to BlockAIOCB* 2014-10-20 13:41:27 +02:00
snapshot.c snapshot: add bdrv_drain_all() to bdrv_snapshot_delete() to avoid concurrency problem 2014-11-03 09:48:42 +00:00
ssh.c ssh: Don't crash if either host or path is not specified. 2014-10-03 10:30:33 +01:00
stream.c block: let stream blockjob run in BDS AioContext 2014-11-03 11:41:49 +00:00
vdi.c block: remove BLOCK_OPT_NOCOW from vdi_create_opts 2014-12-10 10:31:20 +01:00
vhdx-endian.c block: VHDX endian fixes 2014-08-15 15:07:14 +02:00
vhdx-log.c block: Drop some superfluous casts from void * 2014-08-20 11:51:28 +02:00
vhdx.c block: vhdx - force FileOffsetMB field to '0' for certain block states 2015-01-23 12:41:32 -05:00
vhdx.h block: vhdx - update PAYLOAD_BLOCK_UNMAPPED value to match 1.00 spec 2014-12-12 15:42:22 +00:00
vmdk.c block: vmdk - move string allocations from stack to the heap 2015-01-23 18:17:05 +01:00
vpc.c block: remove BLOCK_OPT_NOCOW from vpc_create_opts 2014-12-10 10:31:21 +01:00
vvfat.c block: update string sizes for filename,backing_file,exact_filename 2015-01-23 18:17:06 +01:00
win32-aio.c block: Rename BlockDriverCompletionFunc to BlockCompletionFunc 2014-10-20 13:41:27 +02:00
write-threshold.c block: add event when disk usage exceeds threshold 2015-02-06 17:24:21 +01:00