numa: update Documentation/vm/numa, add memoryless node info
Kamezawa Hiroyuki requested documentation for the numa_mem_id() and slab related changes. He suggested Documentation/vm/numa for this documentation. Looking at this file, it seems to me to be hopelessly out of date relative to current Linux NUMA support. At the risk of going down a rathole, I have made an attempt to rewrite the doc at a slightly higher level [I think] and provide pointers to other in-tree documents and out-of-tree man pages that cover the details. Let the games begin. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commit is contained in:
parent
3dd6b5fb43
commit
b9498bfe86
@ -1,41 +1,149 @@
|
||||
Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
|
||||
|
||||
The intent of this file is to have an uptodate, running commentary
|
||||
from different people about NUMA specific code in the Linux vm.
|
||||
What is NUMA?
|
||||
|
||||
What is NUMA? It is an architecture where the memory access times
|
||||
for different regions of memory from a given processor varies
|
||||
according to the "distance" of the memory region from the processor.
|
||||
Each region of memory to which access times are the same from any
|
||||
cpu, is called a node. On such architectures, it is beneficial if
|
||||
the kernel tries to minimize inter node communications. Schemes
|
||||
for this range from kernel text and read-only data replication
|
||||
across nodes, and trying to house all the data structures that
|
||||
key components of the kernel need on memory on that node.
|
||||
This question can be answered from a couple of perspectives: the
|
||||
hardware view and the Linux software view.
|
||||
|
||||
Currently, all the numa support is to provide efficient handling
|
||||
of widely discontiguous physical memory, so architectures which
|
||||
are not NUMA but can have huge holes in the physical address space
|
||||
can use the same code. All this code is bracketed by CONFIG_DISCONTIGMEM.
|
||||
From the hardware perspective, a NUMA system is a computer platform that
|
||||
comprises multiple components or assemblies each of which may contain 0
|
||||
or more CPUs, local memory, and/or IO buses. For brevity and to
|
||||
disambiguate the hardware view of these physical components/assemblies
|
||||
from the software abstraction thereof, we'll call the components/assemblies
|
||||
'cells' in this document.
|
||||
|
||||
The initial port includes NUMAizing the bootmem allocator code by
|
||||
encapsulating all the pieces of information into a bootmem_data_t
|
||||
structure. Node specific calls have been added to the allocator.
|
||||
In theory, any platform which uses the bootmem allocator should
|
||||
be able to put the bootmem and mem_map data structures anywhere
|
||||
it deems best.
|
||||
Each of the 'cells' may be viewed as an SMP [symmetric multi-processor] subset
|
||||
of the system--although some components necessary for a stand-alone SMP system
|
||||
may not be populated on any given cell. The cells of the NUMA system are
|
||||
connected together with some sort of system interconnect--e.g., a crossbar or
|
||||
point-to-point link are common types of NUMA system interconnects. Both of
|
||||
these types of interconnects can be aggregated to create NUMA platforms with
|
||||
cells at multiple distances from other cells.
|
||||
|
||||
Each node's page allocation data structures have also been encapsulated
|
||||
into a pg_data_t. The bootmem_data_t is just one part of this. To
|
||||
make the code look uniform between NUMA and regular UMA platforms,
|
||||
UMA platforms have a statically allocated pg_data_t too (contig_page_data).
|
||||
For the sake of uniformity, the function num_online_nodes() is also defined
|
||||
for all platforms. As we run benchmarks, we might decide to NUMAize
|
||||
more variables like low_on_memory, nr_free_pages etc into the pg_data_t.
|
||||
For Linux, the NUMA platforms of interest are primarily what is known as Cache
|
||||
Coherent NUMA or ccNUMA systems. With ccNUMA systems, all memory is visible
|
||||
to and accessible from any CPU attached to any cell and cache coherency
|
||||
is handled in hardware by the processor caches and/or the system interconnect.
|
||||
|
||||
The NUMA aware page allocation code currently tries to allocate pages
|
||||
from different nodes in a round robin manner. This will be changed to
|
||||
do concentratic circle search, starting from current node, once the
|
||||
NUMA port achieves more maturity. The call alloc_pages_node has been
|
||||
added, so that drivers can make the call and not worry about whether
|
||||
it is running on a NUMA or UMA platform.
|
||||
Memory access time and effective memory bandwidth varies depending on how far
|
||||
away the cell containing the CPU or IO bus making the memory access is from the
|
||||
cell containing the target memory. For example, access to memory by CPUs
|
||||
attached to the same cell will experience faster access times and higher
|
||||
bandwidths than accesses to memory on other, remote cells. NUMA platforms
|
||||
can have cells at multiple remote distances from any given cell.
|
||||
|
||||
Platform vendors don't build NUMA systems just to make software developers'
|
||||
lives interesting. Rather, this architecture is a means to provide scalable
|
||||
memory bandwidth. However, to achieve scalable memory bandwidth, system and
|
||||
application software must arrange for a large majority of the memory references
|
||||
[cache misses] to be to "local" memory--memory on the same cell, if any--or
|
||||
to the closest cell with memory.
|
||||
|
||||
This leads to the Linux software view of a NUMA system:
|
||||
|
||||
Linux divides the system's hardware resources into multiple software
|
||||
abstractions called "nodes". Linux maps the nodes onto the physical cells
|
||||
of the hardware platform, abstracting away some of the details for some
|
||||
architectures. As with physical cells, software nodes may contain 0 or more
|
||||
CPUs, memory and/or IO buses. And, again, memory accesses to memory on
|
||||
"closer" nodes--nodes that map to closer cells--will generally experience
|
||||
faster access times and higher effective bandwidth than accesses to more
|
||||
remote cells.
|
||||
|
||||
For some architectures, such as x86, Linux will "hide" any node representing a
|
||||
physical cell that has no memory attached, and reassign any CPUs attached to
|
||||
that cell to a node representing a cell that does have memory. Thus, on
|
||||
these architectures, one cannot assume that all CPUs that Linux associates with
|
||||
a given node will see the same local memory access times and bandwidth.
|
||||
|
||||
In addition, for some architectures, again x86 is an example, Linux supports
|
||||
the emulation of additional nodes. For NUMA emulation, linux will carve up
|
||||
the existing nodes--or the system memory for non-NUMA platforms--into multiple
|
||||
nodes. Each emulated node will manage a fraction of the underlying cells'
|
||||
physical memory. NUMA emluation is useful for testing NUMA kernel and
|
||||
application features on non-NUMA platforms, and as a sort of memory resource
|
||||
management mechanism when used together with cpusets.
|
||||
[see Documentation/cgroups/cpusets.txt]
|
||||
|
||||
For each node with memory, Linux constructs an independent memory management
|
||||
subsystem, complete with its own free page lists, in-use page lists, usage
|
||||
statistics and locks to mediate access. In addition, Linux constructs for
|
||||
each memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE],
|
||||
an ordered "zonelist". A zonelist specifies the zones/nodes to visit when a
|
||||
selected zone/node cannot satisfy the allocation request. This situation,
|
||||
when a zone has no available memory to satisfy a request, is called
|
||||
"overflow" or "fallback".
|
||||
|
||||
Because some nodes contain multiple zones containing different types of
|
||||
memory, Linux must decide whether to order the zonelists such that allocations
|
||||
fall back to the same zone type on a different node, or to a different zone
|
||||
type on the same node. This is an important consideration because some zones,
|
||||
such as DMA or DMA32, represent relatively scarce resources. Linux chooses
|
||||
a default zonelist order based on the sizes of the various zone types relative
|
||||
to the total memory of the node and the total memory of the system. The
|
||||
default zonelist order may be overridden using the numa_zonelist_order kernel
|
||||
boot parameter or sysctl. [see Documentation/kernel-parameters.txt and
|
||||
Documentation/sysctl/vm.txt]
|
||||
|
||||
By default, Linux will attempt to satisfy memory allocation requests from the
|
||||
node to which the CPU that executes the request is assigned. Specifically,
|
||||
Linux will attempt to allocate from the first node in the appropriate zonelist
|
||||
for the node where the request originates. This is called "local allocation."
|
||||
If the "local" node cannot satisfy the request, the kernel will examine other
|
||||
nodes' zones in the selected zonelist looking for the first zone in the list
|
||||
that can satisfy the request.
|
||||
|
||||
Local allocation will tend to keep subsequent access to the allocated memory
|
||||
"local" to the underlying physical resources and off the system interconnect--
|
||||
as long as the task on whose behalf the kernel allocated some memory does not
|
||||
later migrate away from that memory. The Linux scheduler is aware of the
|
||||
NUMA topology of the platform--embodied in the "scheduling domains" data
|
||||
structures [see Documentation/scheduler/sched-domains.txt]--and the scheduler
|
||||
attempts to minimize task migration to distant scheduling domains. However,
|
||||
the scheduler does not take a task's NUMA footprint into account directly.
|
||||
Thus, under sufficient imbalance, tasks can migrate between nodes, remote
|
||||
from their initial node and kernel data structures.
|
||||
|
||||
System administrators and application designers can restrict a task's migration
|
||||
to improve NUMA locality using various CPU affinity command line interfaces,
|
||||
such as taskset(1) and numactl(1), and program interfaces such as
|
||||
sched_setaffinity(2). Further, one can modify the kernel's default local
|
||||
allocation behavior using Linux NUMA memory policy.
|
||||
[see Documentation/vm/numa_memory_policy.]
|
||||
|
||||
System administrators can restrict the CPUs and nodes' memories that a non-
|
||||
privileged user can specify in the scheduling or NUMA commands and functions
|
||||
using control groups and CPUsets. [see Documentation/cgroups/CPUsets.txt]
|
||||
|
||||
On architectures that do not hide memoryless nodes, Linux will include only
|
||||
zones [nodes] with memory in the zonelists. This means that for a memoryless
|
||||
node the "local memory node"--the node of the first zone in CPU's node's
|
||||
zonelist--will not be the node itself. Rather, it will be the node that the
|
||||
kernel selected as the nearest node with memory when it built the zonelists.
|
||||
So, default, local allocations will succeed with the kernel supplying the
|
||||
closest available memory. This is a consequence of the same mechanism that
|
||||
allows such allocations to fallback to other nearby nodes when a node that
|
||||
does contain memory overflows.
|
||||
|
||||
Some kernel allocations do not want or cannot tolerate this allocation fallback
|
||||
behavior. Rather they want to be sure they get memory from the specified node
|
||||
or get notified that the node has no free memory. This is usually the case when
|
||||
a subsystem allocates per CPU memory resources, for example.
|
||||
|
||||
A typical model for making such an allocation is to obtain the node id of the
|
||||
node to which the "current CPU" is attached using one of the kernel's
|
||||
numa_node_id() or CPU_to_node() functions and then request memory from only
|
||||
the node id returned. When such an allocation fails, the requesting subsystem
|
||||
may revert to its own fallback path. The slab kernel memory allocator is an
|
||||
example of this. Or, the subsystem may choose to disable or not to enable
|
||||
itself on allocation failure. The kernel profiling subsystem is an example of
|
||||
this.
|
||||
|
||||
If the architecture supports--does not hide--memoryless nodes, then CPUs
|
||||
attached to memoryless nodes would always incur the fallback path overhead
|
||||
or some subsystems would fail to initialize if they attempted to allocated
|
||||
memory exclusively from a node without memory. To support such
|
||||
architectures transparently, kernel subsystems can use the numa_mem_id()
|
||||
or cpu_to_mem() function to locate the "local memory node" for the calling or
|
||||
specified CPU. Again, this is the same node from which default, local page
|
||||
allocations will be attempted.
|
||||
|
Loading…
Reference in New Issue
Block a user