2016-10-27 14:43:07 +08:00
|
|
|
|
COarse-grained LOck-stepping Virtual Machines for Non-stop Service
|
|
|
|
|
----------------------------------------
|
|
|
|
|
Copyright (c) 2016 Intel Corporation
|
|
|
|
|
Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
|
|
|
|
|
Copyright (c) 2016 Fujitsu, Corp.
|
|
|
|
|
|
|
|
|
|
This work is licensed under the terms of the GNU GPL, version 2 or later.
|
|
|
|
|
See the COPYING file in the top-level directory.
|
|
|
|
|
|
|
|
|
|
This document gives an overview of COLO's design and how to use it.
|
|
|
|
|
|
|
|
|
|
== Background ==
|
|
|
|
|
Virtual machine (VM) replication is a well known technique for providing
|
|
|
|
|
application-agnostic software-implemented hardware fault tolerance,
|
|
|
|
|
also known as "non-stop service".
|
|
|
|
|
|
|
|
|
|
COLO (COarse-grained LOck-stepping) is a high availability solution.
|
|
|
|
|
Both primary VM (PVM) and secondary VM (SVM) run in parallel. They receive the
|
|
|
|
|
same request from client, and generate response in parallel too.
|
|
|
|
|
If the response packets from PVM and SVM are identical, they are released
|
|
|
|
|
immediately. Otherwise, a VM checkpoint (on demand) is conducted.
|
|
|
|
|
|
|
|
|
|
== Architecture ==
|
|
|
|
|
|
|
|
|
|
The architecture of COLO is shown in the diagram below.
|
|
|
|
|
It consists of a pair of networked physical nodes:
|
|
|
|
|
The primary node running the PVM, and the secondary node running the SVM
|
|
|
|
|
to maintain a valid replica of the PVM.
|
|
|
|
|
PVM and SVM execute in parallel and generate output of response packets for
|
|
|
|
|
client requests according to the application semantics.
|
|
|
|
|
|
|
|
|
|
The incoming packets from the client or external network are received by the
|
|
|
|
|
primary node, and then forwarded to the secondary node, so that both the PVM
|
|
|
|
|
and the SVM are stimulated with the same requests.
|
|
|
|
|
|
|
|
|
|
COLO receives the outbound packets from both the PVM and SVM and compares them
|
|
|
|
|
before allowing the output to be sent to clients.
|
|
|
|
|
|
|
|
|
|
The SVM is qualified as a valid replica of the PVM, as long as it generates
|
|
|
|
|
identical responses to all client requests. Once the differences in the outputs
|
|
|
|
|
are detected between the PVM and SVM, COLO withholds transmission of the
|
|
|
|
|
outbound packets until it has successfully synchronized the PVM state to the SVM.
|
|
|
|
|
|
2016-11-01 11:38:12 +08:00
|
|
|
|
Primary Node Secondary Node
|
|
|
|
|
+------------+ +-----------------------+ +------------------------+ +------------+
|
|
|
|
|
| | | HeartBeat +<----->+ HeartBeat | | |
|
|
|
|
|
| Primary VM | +-----------+-----------+ +-----------+------------+ |Secondary VM|
|
|
|
|
|
| | | | | |
|
|
|
|
|
| | +-----------|-----------+ +-----------|------------+ | |
|
|
|
|
|
| | |QEMU +---v----+ | |QEMU +----v---+ | | |
|
|
|
|
|
| | | |Failover| | | |Failover| | | |
|
|
|
|
|
| | | +--------+ | | +--------+ | | |
|
|
|
|
|
| | | +---------------+ | | +---------------+ | | |
|
|
|
|
|
| | | | VM Checkpoint +-------------->+ VM Checkpoint | | | |
|
|
|
|
|
| | | +---------------+ | | +---------------+ | | |
|
|
|
|
|
|Requests<--------------------------\ /-----------------\ /--------------------->Requests|
|
|
|
|
|
| | | ^ ^ | | | | | | |
|
|
|
|
|
|Responses+---------------------\ /-|-|------------\ /-------------------------+Responses|
|
|
|
|
|
| | | | | | | | | | | | | | | |
|
|
|
|
|
| | | +-----------+ | | | | | | | | | | +----------+ | | |
|
|
|
|
|
| | | | COLO disk | | | | | | | | | | | | COLO disk| | | |
|
|
|
|
|
| | | | Manager +---------------------------->| Manager | | | |
|
|
|
|
|
| | | ++----------+ v v | | | | | v v | +---------++ | | |
|
|
|
|
|
| | | |+-----------+-+-+-++| | ++-+--+-+---------+ | | | |
|
|
|
|
|
| | | || COLO Proxy || | | COLO Proxy | | | | |
|
|
|
|
|
| | | || (compare packet || | |(adjust sequence | | | | |
|
|
|
|
|
| | | ||and mirror packet)|| | | and ACK) | | | | |
|
|
|
|
|
| | | |+------------+---+-+| | +-----------------+ | | | |
|
|
|
|
|
+------------+ +-----------------------+ +------------------------+ +------------+
|
|
|
|
|
+------------+ | | | | +------------+
|
|
|
|
|
| VM Monitor | | | | | | VM Monitor |
|
|
|
|
|
+------------+ | | | | +------------+
|
|
|
|
|
+---------------------------------------+ +----------------------------------------+
|
|
|
|
|
| Kernel | | | | | Kernel | |
|
|
|
|
|
+---------------------------------------+ +----------------------------------------+
|
|
|
|
|
| | | |
|
|
|
|
|
+--------------v+ +---------v---+--+ +------------------+ +v-------------+
|
|
|
|
|
| Storage | |External Network| | External Network | | Storage |
|
|
|
|
|
+---------------+ +----------------+ +------------------+ +--------------+
|
|
|
|
|
|
2016-10-27 14:43:07 +08:00
|
|
|
|
|
|
|
|
|
== Components introduction ==
|
|
|
|
|
|
|
|
|
|
You can see there are several components in COLO's diagram of architecture.
|
|
|
|
|
Their functions are described below.
|
|
|
|
|
|
|
|
|
|
HeartBeat:
|
|
|
|
|
Runs on both the primary and secondary nodes, to periodically check platform
|
|
|
|
|
availability. When the primary node suffers a hardware fail-stop failure,
|
|
|
|
|
the heartbeat stops responding, the secondary node will trigger a failover
|
|
|
|
|
as soon as it determines the absence.
|
|
|
|
|
|
|
|
|
|
COLO disk Manager:
|
|
|
|
|
When primary VM writes data into image, the colo disk manger captures this data
|
|
|
|
|
and sends it to secondary VM's which makes sure the context of secondary VM's
|
|
|
|
|
image is consistent with the context of primary VM 's image.
|
|
|
|
|
For more details, please refer to docs/block-replication.txt.
|
|
|
|
|
|
|
|
|
|
Checkpoint/Failover Controller:
|
|
|
|
|
Modifications of save/restore flow to realize continuous migration,
|
|
|
|
|
to make sure the state of VM in Secondary side is always consistent with VM in
|
|
|
|
|
Primary side.
|
|
|
|
|
|
|
|
|
|
COLO Proxy:
|
2019-02-20 13:27:26 +08:00
|
|
|
|
Delivers packets to Primary and Secondary, and then compare the responses from
|
2016-10-27 14:43:07 +08:00
|
|
|
|
both side. Then decide whether to start a checkpoint according to some rules.
|
2018-07-13 14:17:27 +02:00
|
|
|
|
Please refer to docs/colo-proxy.txt for more information.
|
2016-10-27 14:43:07 +08:00
|
|
|
|
|
|
|
|
|
Note:
|
|
|
|
|
HeartBeat has not been implemented yet, so you need to trigger failover process
|
|
|
|
|
by using 'x-colo-lost-heartbeat' command.
|
|
|
|
|
|
2018-09-03 12:39:00 +08:00
|
|
|
|
== COLO operation status ==
|
|
|
|
|
|
|
|
|
|
+-----------------+
|
|
|
|
|
| |
|
|
|
|
|
| Start COLO |
|
|
|
|
|
| |
|
|
|
|
|
+--------+--------+
|
|
|
|
|
|
|
|
|
|
|
| Main qmp command:
|
|
|
|
|
| migrate-set-capabilities with x-colo
|
|
|
|
|
| migrate
|
|
|
|
|
|
|
|
|
|
|
v
|
|
|
|
|
+--------+--------+
|
|
|
|
|
| |
|
|
|
|
|
| COLO running |
|
|
|
|
|
| |
|
|
|
|
|
+--------+--------+
|
|
|
|
|
|
|
|
|
|
|
| Main qmp command:
|
|
|
|
|
| x-colo-lost-heartbeat
|
|
|
|
|
| or
|
|
|
|
|
| some error happened
|
|
|
|
|
v
|
|
|
|
|
+--------+--------+
|
|
|
|
|
| | send qmp event:
|
|
|
|
|
| COLO failover | COLO_EXIT
|
|
|
|
|
| |
|
|
|
|
|
+-----------------+
|
|
|
|
|
|
|
|
|
|
COLO use the qmp command to switch and report operation status.
|
|
|
|
|
The diagram just shows the main qmp command, you can get the detail
|
|
|
|
|
in test procedure.
|
|
|
|
|
|
2016-10-27 14:43:07 +08:00
|
|
|
|
== Test procedure ==
|
|
|
|
|
1. Startup qemu
|
|
|
|
|
Primary:
|
2018-06-13 07:05:19 +02:00
|
|
|
|
# qemu-system-x86_64 -accel kvm -m 2048 -smp 2 -qmp stdio -name primary \
|
|
|
|
|
-device piix3-usb-uhci -vnc :7 \
|
2016-10-27 14:43:07 +08:00
|
|
|
|
-device usb-tablet -netdev tap,id=hn0,vhost=off \
|
|
|
|
|
-device virtio-net-pci,id=net-pci0,netdev=hn0 \
|
|
|
|
|
-drive if=virtio,id=primary-disk0,driver=quorum,read-pattern=fifo,vote-threshold=1,\
|
|
|
|
|
children.0.file.filename=1.raw,\
|
|
|
|
|
children.0.driver=raw -S
|
|
|
|
|
Secondary:
|
2018-06-13 07:05:19 +02:00
|
|
|
|
# qemu-system-x86_64 -accel kvm -m 2048 -smp 2 -qmp stdio -name secondary \
|
|
|
|
|
-device piix3-usb-uhci -vnc :7 \
|
2016-10-27 14:43:07 +08:00
|
|
|
|
-device usb-tablet -netdev tap,id=hn0,vhost=off \
|
|
|
|
|
-device virtio-net-pci,id=net-pci0,netdev=hn0 \
|
|
|
|
|
-drive if=none,id=secondary-disk0,file.filename=1.raw,driver=raw,node-name=node0 \
|
|
|
|
|
-drive if=virtio,id=active-disk0,driver=replication,mode=secondary,\
|
|
|
|
|
file.driver=qcow2,top-id=active-disk0,\
|
|
|
|
|
file.file.filename=/mnt/ramfs/active_disk.img,\
|
|
|
|
|
file.backing.driver=qcow2,\
|
|
|
|
|
file.backing.file.filename=/mnt/ramfs/hidden_disk.img,\
|
|
|
|
|
file.backing.backing=secondary-disk0 \
|
|
|
|
|
-incoming tcp:0:8888
|
|
|
|
|
|
|
|
|
|
2. On Secondary VM's QEMU monitor, issue command
|
|
|
|
|
{'execute':'qmp_capabilities'}
|
|
|
|
|
{ 'execute': 'nbd-server-start',
|
|
|
|
|
'arguments': {'addr': {'type': 'inet', 'data': {'host': 'xx.xx.xx.xx', 'port': '8889'} } }
|
|
|
|
|
}
|
2017-03-20 17:27:55 -07:00
|
|
|
|
{'execute': 'nbd-server-add', 'arguments': {'device': 'secondary-disk0', 'writable': true } }
|
2016-10-27 14:43:07 +08:00
|
|
|
|
|
|
|
|
|
Note:
|
|
|
|
|
a. The qmp command nbd-server-start and nbd-server-add must be run
|
|
|
|
|
before running the qmp command migrate on primary QEMU
|
|
|
|
|
b. Active disk, hidden disk and nbd target's length should be the
|
|
|
|
|
same.
|
|
|
|
|
c. It is better to put active disk and hidden disk in ramdisk.
|
|
|
|
|
|
|
|
|
|
3. On Primary VM's QEMU monitor, issue command:
|
|
|
|
|
{'execute':'qmp_capabilities'}
|
|
|
|
|
{ 'execute': 'human-monitor-command',
|
|
|
|
|
'arguments': {'command-line': 'drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=xx.xx.xx.xx,file.port=8889,file.export=secondary-disk0,node-name=nbd_client0'}}
|
|
|
|
|
{ 'execute':'x-blockdev-change', 'arguments':{'parent': 'primary-disk0', 'node': 'nbd_client0' } }
|
|
|
|
|
{ 'execute': 'migrate-set-capabilities',
|
|
|
|
|
'arguments': {'capabilities': [ {'capability': 'x-colo', 'state': true } ] } }
|
|
|
|
|
{ 'execute': 'migrate', 'arguments': {'uri': 'tcp:xx.xx.xx.xx:8888' } }
|
|
|
|
|
|
|
|
|
|
Note:
|
|
|
|
|
a. There should be only one NBD Client for each primary disk.
|
|
|
|
|
b. xx.xx.xx.xx is the secondary physical machine's hostname or IP
|
|
|
|
|
c. The qmp command line must be run after running qmp command line in
|
|
|
|
|
secondary qemu.
|
|
|
|
|
|
|
|
|
|
4. After the above steps, you will see, whenever you make changes to PVM, SVM will be synced.
|
|
|
|
|
You can issue command '{ "execute": "migrate-set-parameters" , "arguments":{ "x-checkpoint-delay": 2000 } }'
|
|
|
|
|
to change the checkpoint period time
|
|
|
|
|
|
|
|
|
|
5. Failover test
|
|
|
|
|
You can kill Primary VM and run 'x_colo_lost_heartbeat' in Secondary VM's
|
|
|
|
|
monitor at the same time, then SVM will failover and client will not detect this
|
|
|
|
|
change.
|
|
|
|
|
|
|
|
|
|
Before issuing '{ "execute": "x-colo-lost-heartbeat" }' command, we have to
|
|
|
|
|
issue block related command to stop block replication.
|
|
|
|
|
Primary:
|
|
|
|
|
Remove the nbd child from the quorum:
|
|
|
|
|
{ 'execute': 'x-blockdev-change', 'arguments': {'parent': 'colo-disk0', 'child': 'children.1'}}
|
|
|
|
|
{ 'execute': 'human-monitor-command','arguments': {'command-line': 'drive_del blk-buddy0'}}
|
|
|
|
|
Note: there is no qmp command to remove the blockdev now
|
|
|
|
|
|
|
|
|
|
Secondary:
|
|
|
|
|
The primary host is down, so we should do the following thing:
|
|
|
|
|
{ 'execute': 'nbd-server-stop' }
|
|
|
|
|
|
|
|
|
|
== TODO ==
|
|
|
|
|
1. Support continuous VM replication.
|
|
|
|
|
2. Support shared storage.
|
|
|
|
|
3. Develop the heartbeat part.
|
|
|
|
|
4. Reduce checkpoint VM’s downtime while doing checkpoint.
|