2017-02-03 08:32:12 +08:00
|
|
|
QEMU Virtual NVDIMM
|
|
|
|
===================
|
|
|
|
|
|
|
|
This document explains the usage of virtual NVDIMM (vNVDIMM) feature
|
|
|
|
which is available since QEMU v2.6.0.
|
|
|
|
|
|
|
|
The current QEMU only implements the persistent memory mode of vNVDIMM
|
|
|
|
device and not the block window mode.
|
|
|
|
|
|
|
|
Basic Usage
|
|
|
|
-----------
|
|
|
|
|
|
|
|
The storage of a vNVDIMM device in QEMU is provided by the memory
|
|
|
|
backend (i.e. memory-backend-file and memory-backend-ram). A simple
|
|
|
|
way to create a vNVDIMM device at startup time is done via the
|
|
|
|
following command line options:
|
|
|
|
|
2021-09-23 12:30:15 +02:00
|
|
|
-machine pc,nvdimm=on
|
2017-02-03 08:32:12 +08:00
|
|
|
-m $RAM_SIZE,slots=$N,maxmem=$MAX_SIZE
|
2021-01-04 17:13:20 +00:00
|
|
|
-object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE,readonly=off
|
|
|
|
-device nvdimm,id=nvdimm1,memdev=mem1,unarmed=off
|
2017-02-03 08:32:12 +08:00
|
|
|
|
|
|
|
Where,
|
|
|
|
|
|
|
|
- the "nvdimm" machine option enables vNVDIMM feature.
|
|
|
|
|
|
|
|
- "slots=$N" should be equal to or larger than the total amount of
|
|
|
|
normal RAM devices and vNVDIMM devices, e.g. $N should be >= 2 here.
|
|
|
|
|
|
|
|
- "maxmem=$MAX_SIZE" should be equal to or larger than the total size
|
|
|
|
of normal RAM devices and vNVDIMM devices, e.g. $MAX_SIZE should be
|
|
|
|
>= $RAM_SIZE + $NVDIMM_SIZE here.
|
|
|
|
|
2021-01-04 17:13:20 +00:00
|
|
|
- "object memory-backend-file,id=mem1,share=on,mem-path=$PATH,
|
|
|
|
size=$NVDIMM_SIZE,readonly=off" creates a backend storage of size
|
|
|
|
$NVDIMM_SIZE on a file $PATH. All accesses to the virtual NVDIMM device go
|
|
|
|
to the file $PATH.
|
2017-02-03 08:32:12 +08:00
|
|
|
|
|
|
|
"share=on/off" controls the visibility of guest writes. If
|
|
|
|
"share=on", then guest writes will be applied to the backend
|
|
|
|
file. If another guest uses the same backend file with option
|
|
|
|
"share=on", then above writes will be visible to it as well. If
|
|
|
|
"share=off", then guest writes won't be applied to the backend
|
|
|
|
file and thus will be invisible to other guests.
|
|
|
|
|
2021-01-04 17:13:20 +00:00
|
|
|
"readonly=on/off" controls whether the file $PATH is opened read-only or
|
|
|
|
read/write (default).
|
|
|
|
|
|
|
|
- "device nvdimm,id=nvdimm1,memdev=mem1,unarmed=off" creates a read/write
|
|
|
|
virtual NVDIMM device whose storage is provided by above memory backend
|
|
|
|
device.
|
|
|
|
|
|
|
|
"unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM
|
|
|
|
State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept
|
|
|
|
persistent writes. Linux guest drivers set the device to read-only when this
|
|
|
|
bit is present. Set unarmed to on when the memdev has readonly=on.
|
2017-02-03 08:32:12 +08:00
|
|
|
|
|
|
|
Multiple vNVDIMM devices can be created if multiple pairs of "-object"
|
|
|
|
and "-device" are provided.
|
|
|
|
|
|
|
|
For above command line options, if the guest OS has the proper NVDIMM
|
2018-10-18 13:13:51 -07:00
|
|
|
driver (e.g. "CONFIG_ACPI_NFIT=y" under Linux), it should be able to
|
|
|
|
detect a NVDIMM device which is in the persistent memory mode and whose
|
|
|
|
size is $NVDIMM_SIZE.
|
2017-02-03 08:32:12 +08:00
|
|
|
|
|
|
|
Note:
|
|
|
|
|
|
|
|
1. Prior to QEMU v2.8.0, if memory-backend-file is used and the actual
|
|
|
|
backend file size is not equal to the size given by "size" option,
|
|
|
|
QEMU will truncate the backend file by ftruncate(2), which will
|
|
|
|
corrupt the existing data in the backend file, especially for the
|
|
|
|
shrink case.
|
|
|
|
|
|
|
|
QEMU v2.8.0 and later check the backend file size and the "size"
|
|
|
|
option. If they do not match, QEMU will report errors and abort in
|
|
|
|
order to avoid the data corruption.
|
|
|
|
|
|
|
|
2. QEMU v2.6.0 only puts a basic alignment requirement on the "size"
|
|
|
|
option of memory-backend-file, e.g. 4KB alignment on x86. However,
|
|
|
|
QEMU v.2.7.0 puts an additional alignment requirement, which may
|
|
|
|
require a larger value than the basic one, e.g. 2MB on x86. This
|
|
|
|
change breaks the usage of memory-backend-file that only satisfies
|
|
|
|
the basic alignment.
|
|
|
|
|
|
|
|
QEMU v2.8.0 and later remove the additional alignment on non-s390x
|
|
|
|
architectures, so the broken memory-backend-file can work again.
|
|
|
|
|
|
|
|
Label
|
|
|
|
-----
|
|
|
|
|
|
|
|
QEMU v2.7.0 and later implement the label support for vNVDIMM devices.
|
|
|
|
To enable label on vNVDIMM devices, users can simply add
|
|
|
|
"label-size=$SZ" option to "-device nvdimm", e.g.
|
|
|
|
|
|
|
|
-device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K
|
|
|
|
|
|
|
|
Note:
|
|
|
|
|
|
|
|
1. The minimal label size is 128KB.
|
|
|
|
|
|
|
|
2. QEMU v2.7.0 and later store labels at the end of backend storage.
|
|
|
|
If a memory backend file, which was previously used as the backend
|
|
|
|
of a vNVDIMM device without labels, is now used for a vNVDIMM
|
|
|
|
device with label, the data in the label area at the end of file
|
|
|
|
will be inaccessible to the guest. If any useful data (e.g. the
|
|
|
|
meta-data of the file system) was stored there, the latter usage
|
|
|
|
may result guest data corruption (e.g. breakage of guest file
|
|
|
|
system).
|
|
|
|
|
|
|
|
Hotplug
|
|
|
|
-------
|
|
|
|
|
|
|
|
QEMU v2.8.0 and later implement the hotplug support for vNVDIMM
|
|
|
|
devices. Similarly to the RAM hotplug, the vNVDIMM hotplug is
|
|
|
|
accomplished by two monitor commands "object_add" and "device_add".
|
|
|
|
|
|
|
|
For example, the following commands add another 4GB vNVDIMM device to
|
|
|
|
the guest:
|
|
|
|
|
|
|
|
(qemu) object_add memory-backend-file,id=mem2,share=on,mem-path=new_nvdimm.img,size=4G
|
|
|
|
(qemu) device_add nvdimm,id=nvdimm2,memdev=mem2
|
|
|
|
|
|
|
|
Note:
|
|
|
|
|
|
|
|
1. Each hotplugged vNVDIMM device consumes one memory slot. Users
|
|
|
|
should always ensure the memory option "-m ...,slots=N" specifies
|
|
|
|
enough number of slots, i.e.
|
|
|
|
N >= number of RAM devices +
|
|
|
|
number of statically plugged vNVDIMM devices +
|
|
|
|
number of hotplugged vNVDIMM devices
|
|
|
|
|
|
|
|
2. The similar is required for the memory option "-m ...,maxmem=M", i.e.
|
|
|
|
M >= size of RAM devices +
|
|
|
|
size of statically plugged vNVDIMM devices +
|
|
|
|
size of hotplugged vNVDIMM devices
|
hostmem-file: add "align" option
When mmap(2) the backend files, QEMU uses the host page size
(getpagesize(2)) by default as the alignment of mapping address.
However, some backends may require alignments different than the page
size. For example, mmap a device DAX (e.g., /dev/dax0.0) on Linux
kernel 4.13 to an address, which is 4K-aligned but not 2M-aligned,
fails with a kernel message like
[617494.969768] dax dax0.0: qemu-system-x86: dax_mmap: fail, unaligned vma (0x7fa37c579000 - 0x7fa43c579000, 0x1fffff)
Because there is no common approach to get such alignment requirement,
we add the 'align' option to 'memory-backend-file', so that users or
management utils, which have enough knowledge about the backend, can
specify a proper alignment via this option.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Message-Id: <20171211072806.2812-2-haozhong.zhang@intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[ehabkost: fixed typo, fixed error_setg() format string]
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2017-12-11 15:28:04 +08:00
|
|
|
|
|
|
|
Alignment
|
|
|
|
---------
|
|
|
|
|
|
|
|
QEMU uses mmap(2) to maps vNVDIMM backends and aligns the mapping
|
|
|
|
address to the page size (getpagesize(2)) by default. However, some
|
|
|
|
types of backends may require an alignment different than the page
|
|
|
|
size. In that case, QEMU v2.12.0 and later provide 'align' option to
|
|
|
|
memory-backend-file to allow users to specify the proper alignment.
|
2020-04-29 16:50:10 +08:00
|
|
|
For device dax (e.g., /dev/dax0.0), this alignment needs to match the
|
|
|
|
alignment requirement of the device dax. The NUM of 'align=NUM' option
|
|
|
|
must be larger than or equal to the 'align' of device dax.
|
|
|
|
We can use one of the following commands to show the 'align' of device dax.
|
|
|
|
|
|
|
|
ndctl list -X
|
|
|
|
daxctl list -R
|
|
|
|
|
|
|
|
In order to get the proper 'align' of device dax, you need to install
|
|
|
|
the library 'libdaxctl'.
|
hostmem-file: add "align" option
When mmap(2) the backend files, QEMU uses the host page size
(getpagesize(2)) by default as the alignment of mapping address.
However, some backends may require alignments different than the page
size. For example, mmap a device DAX (e.g., /dev/dax0.0) on Linux
kernel 4.13 to an address, which is 4K-aligned but not 2M-aligned,
fails with a kernel message like
[617494.969768] dax dax0.0: qemu-system-x86: dax_mmap: fail, unaligned vma (0x7fa37c579000 - 0x7fa43c579000, 0x1fffff)
Because there is no common approach to get such alignment requirement,
we add the 'align' option to 'memory-backend-file', so that users or
management utils, which have enough knowledge about the backend, can
specify a proper alignment via this option.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Message-Id: <20171211072806.2812-2-haozhong.zhang@intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[ehabkost: fixed typo, fixed error_setg() format string]
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2017-12-11 15:28:04 +08:00
|
|
|
|
|
|
|
For example, device dax require the 2 MB alignment, so we can use
|
|
|
|
following QEMU command line options to use it (/dev/dax0.0) as the
|
|
|
|
backend of vNVDIMM:
|
|
|
|
|
|
|
|
-object memory-backend-file,id=mem1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M
|
|
|
|
-device nvdimm,id=nvdimm1,memdev=mem1
|
2017-12-11 15:28:06 +08:00
|
|
|
|
|
|
|
Guest Data Persistence
|
|
|
|
----------------------
|
|
|
|
|
|
|
|
Though QEMU supports multiple types of vNVDIMM backends on Linux,
|
2019-04-22 08:48:48 +08:00
|
|
|
the only backend that can guarantee the guest write persistence is:
|
|
|
|
|
|
|
|
A. DAX device (e.g., /dev/dax0.0, ) or
|
|
|
|
B. DAX file(mounted with dax option)
|
|
|
|
|
|
|
|
When using B (A file supporting direct mapping of persistent memory)
|
|
|
|
as a backend, write persistence is guaranteed if the host kernel has
|
|
|
|
support for the MAP_SYNC flag in the mmap system call (available
|
|
|
|
since Linux 4.15 and on certain distro kernels) and additionally
|
|
|
|
both 'pmem' and 'share' flags are set to 'on' on the backend.
|
|
|
|
|
|
|
|
If these conditions are not satisfied i.e. if either 'pmem' or 'share'
|
|
|
|
are not set, if the backend file does not support DAX or if MAP_SYNC
|
|
|
|
is not supported by the host kernel, write persistence is not
|
|
|
|
guaranteed after a system crash. For compatibility reasons, these
|
|
|
|
conditions are ignored if not satisfied. Currently, no way is
|
|
|
|
provided to test for them.
|
|
|
|
For more details, please reference mmap(2) man page:
|
|
|
|
http://man7.org/linux/man-pages/man2/mmap.2.html.
|
2017-12-11 15:28:06 +08:00
|
|
|
|
|
|
|
When using other types of backends, it's suggested to set 'unarmed'
|
|
|
|
option of '-device nvdimm' to 'on', which sets the unarmed flag of the
|
|
|
|
guest NVDIMM region mapping structure. This unarmed flag indicates
|
|
|
|
guest software that this vNVDIMM device contains a region that cannot
|
|
|
|
accept persistent writes. In result, for example, the guest Linux
|
|
|
|
NVDIMM driver, marks such vNVDIMM device as read-only.
|
2018-05-21 10:32:02 -06:00
|
|
|
|
2019-08-01 08:40:53 +08:00
|
|
|
Backend File Setup Example
|
|
|
|
--------------------------
|
|
|
|
|
|
|
|
Here are two examples showing how to setup these persistent backends on
|
|
|
|
linux using the tool ndctl [3].
|
|
|
|
|
|
|
|
A. DAX device
|
|
|
|
|
|
|
|
Use the following command to set up /dev/dax0.0 so that the entirety of
|
|
|
|
namespace0.0 can be exposed as an emulated NVDIMM to the guest:
|
|
|
|
|
|
|
|
ndctl create-namespace -f -e namespace0.0 -m devdax
|
|
|
|
|
|
|
|
The /dev/dax0.0 could be used directly in "mem-path" option.
|
|
|
|
|
|
|
|
B. DAX file
|
|
|
|
|
|
|
|
Individual files on a DAX host file system can be exposed as emulated
|
|
|
|
NVDIMMS. First an fsdax block device is created, partitioned, and then
|
|
|
|
mounted with the "dax" mount option:
|
|
|
|
|
|
|
|
ndctl create-namespace -f -e namespace0.0 -m fsdax
|
|
|
|
(partition /dev/pmem0 with name pmem0p1)
|
|
|
|
mount -o dax /dev/pmem0p1 /mnt
|
|
|
|
(create or copy a disk image file with qemu-img(1), cp(1), or dd(1)
|
|
|
|
in /mnt)
|
|
|
|
|
|
|
|
Then the new file in /mnt could be used in "mem-path" option.
|
|
|
|
|
2018-06-07 16:31:11 -06:00
|
|
|
NVDIMM Persistence
|
|
|
|
------------------
|
2018-05-21 10:32:02 -06:00
|
|
|
|
|
|
|
ACPI 6.2 Errata A added support for a new Platform Capabilities Structure
|
|
|
|
which allows the platform to communicate what features it supports related to
|
2018-06-07 16:31:11 -06:00
|
|
|
NVDIMM data persistence. Users can provide a persistence value to a guest via
|
|
|
|
the optional "nvdimm-persistence" machine command line option:
|
2018-05-21 10:32:02 -06:00
|
|
|
|
2018-06-07 16:31:11 -06:00
|
|
|
-machine pc,accel=kvm,nvdimm,nvdimm-persistence=cpu
|
2018-05-21 10:32:02 -06:00
|
|
|
|
2018-06-07 16:31:11 -06:00
|
|
|
There are currently two valid values for this option:
|
2018-05-21 10:32:02 -06:00
|
|
|
|
2018-06-07 16:31:11 -06:00
|
|
|
"mem-ctrl" - The platform supports flushing dirty data from the memory
|
|
|
|
controller to the NVDIMMs in the event of power loss.
|
2018-05-21 10:32:02 -06:00
|
|
|
|
2018-06-07 16:31:11 -06:00
|
|
|
"cpu" - The platform supports flushing dirty data from the CPU cache to
|
|
|
|
the NVDIMMs in the event of power loss. This implies that the
|
|
|
|
platform also supports flushing dirty data through the memory
|
|
|
|
controller on power loss.
|
2018-07-18 15:48:00 +08:00
|
|
|
|
|
|
|
If the vNVDIMM backend is in host persistent memory that can be accessed in
|
|
|
|
SNIA NVM Programming Model [1] (e.g., Intel NVDIMM), it's suggested to set
|
|
|
|
the 'pmem' option of memory-backend-file to 'on'. When 'pmem' is 'on' and QEMU
|
|
|
|
is built with libpmem [2] support (configured with --enable-libpmem), QEMU
|
|
|
|
will take necessary operations to guarantee the persistence of its own writes
|
|
|
|
to the vNVDIMM backend(e.g., in vNVDIMM label emulation and live migration).
|
|
|
|
If 'pmem' is 'on' while there is no libpmem support, qemu will exit and report
|
|
|
|
a "lack of libpmem support" message to ensure the persistence is available.
|
|
|
|
For example, if we want to ensure the persistence for some backend file,
|
|
|
|
use the QEMU command line:
|
|
|
|
|
|
|
|
-object memory-backend-file,id=nv_mem,mem-path=/XXX/yyy,size=4G,pmem=on
|
|
|
|
|
|
|
|
References
|
|
|
|
----------
|
|
|
|
|
|
|
|
[1] NVM Programming Model (NPM)
|
|
|
|
Version 1.2
|
|
|
|
https://www.snia.org/sites/default/files/technical_work/final/NVMProgrammingModel_v1.2.pdf
|
|
|
|
[2] Persistent Memory Development Kit (PMDK), formerly known as NVML project, home page:
|
|
|
|
http://pmem.io/pmdk/
|
2019-08-01 08:40:53 +08:00
|
|
|
[3] ndctl-create-namespace - provision or reconfigure a namespace
|
|
|
|
http://pmem.io/ndctl/ndctl-create-namespace.html
|