9ab3aad281
Add a machine command line option to allow the user to control the Platform Capabilities Structure in the virtualized NFIT. This Platform Capabilities Structure was added in ACPI 6.2 Errata A. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
183 lines
7.2 KiB
Plaintext
183 lines
7.2 KiB
Plaintext
QEMU Virtual NVDIMM
|
|
===================
|
|
|
|
This document explains the usage of virtual NVDIMM (vNVDIMM) feature
|
|
which is available since QEMU v2.6.0.
|
|
|
|
The current QEMU only implements the persistent memory mode of vNVDIMM
|
|
device and not the block window mode.
|
|
|
|
Basic Usage
|
|
-----------
|
|
|
|
The storage of a vNVDIMM device in QEMU is provided by the memory
|
|
backend (i.e. memory-backend-file and memory-backend-ram). A simple
|
|
way to create a vNVDIMM device at startup time is done via the
|
|
following command line options:
|
|
|
|
-machine pc,nvdimm
|
|
-m $RAM_SIZE,slots=$N,maxmem=$MAX_SIZE
|
|
-object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE
|
|
-device nvdimm,id=nvdimm1,memdev=mem1
|
|
|
|
Where,
|
|
|
|
- the "nvdimm" machine option enables vNVDIMM feature.
|
|
|
|
- "slots=$N" should be equal to or larger than the total amount of
|
|
normal RAM devices and vNVDIMM devices, e.g. $N should be >= 2 here.
|
|
|
|
- "maxmem=$MAX_SIZE" should be equal to or larger than the total size
|
|
of normal RAM devices and vNVDIMM devices, e.g. $MAX_SIZE should be
|
|
>= $RAM_SIZE + $NVDIMM_SIZE here.
|
|
|
|
- "object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE"
|
|
creates a backend storage of size $NVDIMM_SIZE on a file $PATH. All
|
|
accesses to the virtual NVDIMM device go to the file $PATH.
|
|
|
|
"share=on/off" controls the visibility of guest writes. If
|
|
"share=on", then guest writes will be applied to the backend
|
|
file. If another guest uses the same backend file with option
|
|
"share=on", then above writes will be visible to it as well. If
|
|
"share=off", then guest writes won't be applied to the backend
|
|
file and thus will be invisible to other guests.
|
|
|
|
- "device nvdimm,id=nvdimm1,memdev=mem1" creates a virtual NVDIMM
|
|
device whose storage is provided by above memory backend device.
|
|
|
|
Multiple vNVDIMM devices can be created if multiple pairs of "-object"
|
|
and "-device" are provided.
|
|
|
|
For above command line options, if the guest OS has the proper NVDIMM
|
|
driver, it should be able to detect a NVDIMM device which is in the
|
|
persistent memory mode and whose size is $NVDIMM_SIZE.
|
|
|
|
Note:
|
|
|
|
1. Prior to QEMU v2.8.0, if memory-backend-file is used and the actual
|
|
backend file size is not equal to the size given by "size" option,
|
|
QEMU will truncate the backend file by ftruncate(2), which will
|
|
corrupt the existing data in the backend file, especially for the
|
|
shrink case.
|
|
|
|
QEMU v2.8.0 and later check the backend file size and the "size"
|
|
option. If they do not match, QEMU will report errors and abort in
|
|
order to avoid the data corruption.
|
|
|
|
2. QEMU v2.6.0 only puts a basic alignment requirement on the "size"
|
|
option of memory-backend-file, e.g. 4KB alignment on x86. However,
|
|
QEMU v.2.7.0 puts an additional alignment requirement, which may
|
|
require a larger value than the basic one, e.g. 2MB on x86. This
|
|
change breaks the usage of memory-backend-file that only satisfies
|
|
the basic alignment.
|
|
|
|
QEMU v2.8.0 and later remove the additional alignment on non-s390x
|
|
architectures, so the broken memory-backend-file can work again.
|
|
|
|
Label
|
|
-----
|
|
|
|
QEMU v2.7.0 and later implement the label support for vNVDIMM devices.
|
|
To enable label on vNVDIMM devices, users can simply add
|
|
"label-size=$SZ" option to "-device nvdimm", e.g.
|
|
|
|
-device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K
|
|
|
|
Note:
|
|
|
|
1. The minimal label size is 128KB.
|
|
|
|
2. QEMU v2.7.0 and later store labels at the end of backend storage.
|
|
If a memory backend file, which was previously used as the backend
|
|
of a vNVDIMM device without labels, is now used for a vNVDIMM
|
|
device with label, the data in the label area at the end of file
|
|
will be inaccessible to the guest. If any useful data (e.g. the
|
|
meta-data of the file system) was stored there, the latter usage
|
|
may result guest data corruption (e.g. breakage of guest file
|
|
system).
|
|
|
|
Hotplug
|
|
-------
|
|
|
|
QEMU v2.8.0 and later implement the hotplug support for vNVDIMM
|
|
devices. Similarly to the RAM hotplug, the vNVDIMM hotplug is
|
|
accomplished by two monitor commands "object_add" and "device_add".
|
|
|
|
For example, the following commands add another 4GB vNVDIMM device to
|
|
the guest:
|
|
|
|
(qemu) object_add memory-backend-file,id=mem2,share=on,mem-path=new_nvdimm.img,size=4G
|
|
(qemu) device_add nvdimm,id=nvdimm2,memdev=mem2
|
|
|
|
Note:
|
|
|
|
1. Each hotplugged vNVDIMM device consumes one memory slot. Users
|
|
should always ensure the memory option "-m ...,slots=N" specifies
|
|
enough number of slots, i.e.
|
|
N >= number of RAM devices +
|
|
number of statically plugged vNVDIMM devices +
|
|
number of hotplugged vNVDIMM devices
|
|
|
|
2. The similar is required for the memory option "-m ...,maxmem=M", i.e.
|
|
M >= size of RAM devices +
|
|
size of statically plugged vNVDIMM devices +
|
|
size of hotplugged vNVDIMM devices
|
|
|
|
Alignment
|
|
---------
|
|
|
|
QEMU uses mmap(2) to maps vNVDIMM backends and aligns the mapping
|
|
address to the page size (getpagesize(2)) by default. However, some
|
|
types of backends may require an alignment different than the page
|
|
size. In that case, QEMU v2.12.0 and later provide 'align' option to
|
|
memory-backend-file to allow users to specify the proper alignment.
|
|
|
|
For example, device dax require the 2 MB alignment, so we can use
|
|
following QEMU command line options to use it (/dev/dax0.0) as the
|
|
backend of vNVDIMM:
|
|
|
|
-object memory-backend-file,id=mem1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M
|
|
-device nvdimm,id=nvdimm1,memdev=mem1
|
|
|
|
Guest Data Persistence
|
|
----------------------
|
|
|
|
Though QEMU supports multiple types of vNVDIMM backends on Linux,
|
|
currently the only one that can guarantee the guest write persistence
|
|
is the device DAX on the real NVDIMM device (e.g., /dev/dax0.0), to
|
|
which all guest access do not involve any host-side kernel cache.
|
|
|
|
When using other types of backends, it's suggested to set 'unarmed'
|
|
option of '-device nvdimm' to 'on', which sets the unarmed flag of the
|
|
guest NVDIMM region mapping structure. This unarmed flag indicates
|
|
guest software that this vNVDIMM device contains a region that cannot
|
|
accept persistent writes. In result, for example, the guest Linux
|
|
NVDIMM driver, marks such vNVDIMM device as read-only.
|
|
|
|
Platform Capabilities
|
|
---------------------
|
|
|
|
ACPI 6.2 Errata A added support for a new Platform Capabilities Structure
|
|
which allows the platform to communicate what features it supports related to
|
|
NVDIMM data durability. Users can provide a capabilities value to a guest via
|
|
the optional "nvdimm-cap" machine command line option:
|
|
|
|
-machine pc,accel=kvm,nvdimm,nvdimm-cap=2
|
|
|
|
This "nvdimm-cap" field is an integer, and is the combined value of the
|
|
various capability bits defined in table 5-137 of the ACPI 6.2 Errata A spec.
|
|
|
|
Here is a quick summary of the three bits that are defined as of that spec:
|
|
|
|
Bit[0] - CPU Cache Flush to NVDIMM Durability on Power Loss Capable.
|
|
Bit[1] - Memory Controller Flush to NVDIMM Durability on Power Loss Capable.
|
|
Note: If bit 0 is set to 1 then this bit shall be set to 1 as well.
|
|
Bit[2] - Byte Addressable Persistent Memory Hardware Mirroring Capable.
|
|
|
|
So, a "nvdimm-cap" value of 2 would mean that the platform supports Memory
|
|
Controller Flush on Power Loss, a value of 3 would mean that the platform
|
|
supports CPU Cache Flush and Memory Controller Flush on Power Loss, etc.
|
|
|
|
For a complete list of the flags available and for more detailed descriptions,
|
|
please consult the ACPI spec.
|