Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware
Interface Table), is the first known instance of a memory range
described by a unique "target" proximity domain. Where "initiator" and
"target" proximity domains is an approach that the ACPI HMAT
(Heterogeneous Memory Attributes Table) uses to described the unique
performance properties of a memory range relative to a given initiator
(e.g. CPU or DMA device).
Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y
char-device follows the traditional notion of 'numa-node' where the
attribute conveys the closest online numa-node. That numa-node attribute
is useful for cpu-binding and memory-binding processes *near* the
device. However, when the memory range backing a 'pmem', or 'dax' device
is onlined (memory hot-add) the memory-only-numa-node representing that
address needs to be differentiated from the set of online nodes. In
other words, the numa-node association of the device depends on whether
you can bind processes *near* the cpu-numa-node in the offline
device-case, or bind process *on* the memory-range directly after the
backing address range is onlined.
Allow for the case that platform firmware describes persistent memory
with a unique proximity domain, i.e. when it is distinct from the
proximity of DRAM and CPUs that are on the same socket. Plumb the Linux
numa-node translation of that proximity through the libnvdimm region
device to namespaces that are in device-dax mode. With this in place the
proposed kmem driver [1] can optionally discover a unique numa-node
number for the address range as it transitions the memory from an
offline state managed by a device-driver to an online memory range
managed by the core-mm.
[1]: https://lore.kernel.org/lkml/20181022201317.8558C1D8@viggo.jf.intel.com
Reported-by: Fan Du <fan.du@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Move the responsibility of calling devm_request_resource() and
devm_memremap_pages() into the common device-dax driver. This is another
preparatory step to allowing an alternate personality driver for a
device-dax range.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In support of multiple device-dax instances per device-dax-region and
allowing the 'kmem' driver to attach to dax-instances instead of the
current device-node access, convert the dax sub-system from a class to a
bus. Recall that the kmem driver takes reserved / special purpose
memories and assigns them to be managed by the core-mm.
Aside from the fact the device-dax instances are registered and probed
on a bus, two other lifetime-management changes are made:
1/ Delay attaching a cdev until driver probe time
2/ A new run_dax() helper is introduced to allow restoring dax-operation
after a kill_dax() event. So, at driver ->probe() time we run_dax()
and at ->remove() time we kill_dax() and invalidate all mappings.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Towards eliminating the dax_class, move the dax-device-attribute
enabling to a new bus.c file in the core. The amount of code
thrash of sub-sequent patches is reduced as no logic changes are made,
just pure code movement.
A temporary export of unregister_dex_dax() and dax_attribute_groups is
needed to preserve compilation, but those symbols become static again in
a follow-on patch.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The multi-resource implementation anticipated discontiguous sub-division
support. That has not yet materialized, delete the infrastructure and
related code.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Nothing consumes this attribute of a region and devres otherwise
remembers the value for de-allocation purposes.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Commit bbb3be170a "device-dax: fix sysfs duplicate warnings" arranged
for passing a dax instance-id to devm_create_dax_dev(), rather than
generating one internally. Remove the dax_region ida and related code.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Provide a replacement pgoff_to_phys() that translates an nfit_test
resource (allocated by vmalloc()) to a pfn.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>