cpuset: various documentation fixes and updates
I noticed the old commit 8f5aa26c75
("cpusets: update_cpumask documentation fix") is not a complete fix,
resulting in inconsistent paragraphs. This patch fixes it and does other
fixes and updates:
- s/migrate_all_tasks()/migrate_live_tasks()/
- describe more cpuset control files
- s/cpumask_t/struct cpumask/
- document cpu hotplug and change of 'sched_relax_domain_level' may cause
domain rebuild
- document various ways to query and modify cpusets
- the equivalent of "mount -t cpuset" is "mount -t cgroup -o cpuset,noprefix"
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commit is contained in:
parent
152de30bce
commit
3fd076dd95
|
@ -142,7 +142,7 @@ into the rest of the kernel, none in performance critical paths:
|
||||||
- in fork and exit, to attach and detach a task from its cpuset.
|
- in fork and exit, to attach and detach a task from its cpuset.
|
||||||
- in sched_setaffinity, to mask the requested CPUs by what's
|
- in sched_setaffinity, to mask the requested CPUs by what's
|
||||||
allowed in that tasks cpuset.
|
allowed in that tasks cpuset.
|
||||||
- in sched.c migrate_all_tasks(), to keep migrating tasks within
|
- in sched.c migrate_live_tasks(), to keep migrating tasks within
|
||||||
the CPUs allowed by their cpuset, if possible.
|
the CPUs allowed by their cpuset, if possible.
|
||||||
- in the mbind and set_mempolicy system calls, to mask the requested
|
- in the mbind and set_mempolicy system calls, to mask the requested
|
||||||
Memory Nodes by what's allowed in that tasks cpuset.
|
Memory Nodes by what's allowed in that tasks cpuset.
|
||||||
|
@ -175,6 +175,10 @@ files describing that cpuset:
|
||||||
- mem_exclusive flag: is memory placement exclusive?
|
- mem_exclusive flag: is memory placement exclusive?
|
||||||
- mem_hardwall flag: is memory allocation hardwalled
|
- mem_hardwall flag: is memory allocation hardwalled
|
||||||
- memory_pressure: measure of how much paging pressure in cpuset
|
- memory_pressure: measure of how much paging pressure in cpuset
|
||||||
|
- memory_spread_page flag: if set, spread page cache evenly on allowed nodes
|
||||||
|
- memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
|
||||||
|
- sched_load_balance flag: if set, load balance within CPUs on that cpuset
|
||||||
|
- sched_relax_domain_level: the searching range when migrating tasks
|
||||||
|
|
||||||
In addition, the root cpuset only has the following file:
|
In addition, the root cpuset only has the following file:
|
||||||
- memory_pressure_enabled flag: compute memory_pressure?
|
- memory_pressure_enabled flag: compute memory_pressure?
|
||||||
|
@ -252,7 +256,7 @@ is causing.
|
||||||
|
|
||||||
This is useful both on tightly managed systems running a wide mix of
|
This is useful both on tightly managed systems running a wide mix of
|
||||||
submitted jobs, which may choose to terminate or re-prioritize jobs that
|
submitted jobs, which may choose to terminate or re-prioritize jobs that
|
||||||
are trying to use more memory than allowed on the nodes assigned them,
|
are trying to use more memory than allowed on the nodes assigned to them,
|
||||||
and with tightly coupled, long running, massively parallel scientific
|
and with tightly coupled, long running, massively parallel scientific
|
||||||
computing jobs that will dramatically fail to meet required performance
|
computing jobs that will dramatically fail to meet required performance
|
||||||
goals if they start to use more memory than allowed to them.
|
goals if they start to use more memory than allowed to them.
|
||||||
|
@ -378,7 +382,7 @@ as cpusets and sched_setaffinity.
|
||||||
The algorithmic cost of load balancing and its impact on key shared
|
The algorithmic cost of load balancing and its impact on key shared
|
||||||
kernel data structures such as the task list increases more than
|
kernel data structures such as the task list increases more than
|
||||||
linearly with the number of CPUs being balanced. So the scheduler
|
linearly with the number of CPUs being balanced. So the scheduler
|
||||||
has support to partition the systems CPUs into a number of sched
|
has support to partition the systems CPUs into a number of sched
|
||||||
domains such that it only load balances within each sched domain.
|
domains such that it only load balances within each sched domain.
|
||||||
Each sched domain covers some subset of the CPUs in the system;
|
Each sched domain covers some subset of the CPUs in the system;
|
||||||
no two sched domains overlap; some CPUs might not be in any sched
|
no two sched domains overlap; some CPUs might not be in any sched
|
||||||
|
@ -485,17 +489,22 @@ of CPUs allowed to a cpuset having 'sched_load_balance' enabled.
|
||||||
The internal kernel cpuset to scheduler interface passes from the
|
The internal kernel cpuset to scheduler interface passes from the
|
||||||
cpuset code to the scheduler code a partition of the load balanced
|
cpuset code to the scheduler code a partition of the load balanced
|
||||||
CPUs in the system. This partition is a set of subsets (represented
|
CPUs in the system. This partition is a set of subsets (represented
|
||||||
as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all
|
as an array of struct cpumask) of CPUs, pairwise disjoint, that cover
|
||||||
the CPUs that must be load balanced.
|
all the CPUs that must be load balanced.
|
||||||
|
|
||||||
Whenever the 'sched_load_balance' flag changes, or CPUs come or go
|
The cpuset code builds a new such partition and passes it to the
|
||||||
from a cpuset with this flag enabled, or a cpuset with this flag
|
scheduler sched domain setup code, to have the sched domains rebuilt
|
||||||
enabled is removed, the cpuset code builds a new such partition and
|
as necessary, whenever:
|
||||||
passes it to the scheduler sched domain setup code, to have the sched
|
- the 'sched_load_balance' flag of a cpuset with non-empty CPUs changes,
|
||||||
domains rebuilt as necessary.
|
- or CPUs come or go from a cpuset with this flag enabled,
|
||||||
|
- or 'sched_relax_domain_level' value of a cpuset with non-empty CPUs
|
||||||
|
and with this flag enabled changes,
|
||||||
|
- or a cpuset with non-empty CPUs and with this flag enabled is removed,
|
||||||
|
- or a cpu is offlined/onlined.
|
||||||
|
|
||||||
This partition exactly defines what sched domains the scheduler should
|
This partition exactly defines what sched domains the scheduler should
|
||||||
setup - one sched domain for each element (cpumask_t) in the partition.
|
setup - one sched domain for each element (struct cpumask) in the
|
||||||
|
partition.
|
||||||
|
|
||||||
The scheduler remembers the currently active sched domain partitions.
|
The scheduler remembers the currently active sched domain partitions.
|
||||||
When the scheduler routine partition_sched_domains() is invoked from
|
When the scheduler routine partition_sched_domains() is invoked from
|
||||||
|
@ -559,7 +568,7 @@ domain, the largest value among those is used. Be careful, if one
|
||||||
requests 0 and others are -1 then 0 is used.
|
requests 0 and others are -1 then 0 is used.
|
||||||
|
|
||||||
Note that modifying this file will have both good and bad effects,
|
Note that modifying this file will have both good and bad effects,
|
||||||
and whether it is acceptable or not will be depend on your situation.
|
and whether it is acceptable or not depends on your situation.
|
||||||
Don't modify this file if you are not sure.
|
Don't modify this file if you are not sure.
|
||||||
|
|
||||||
If your situation is:
|
If your situation is:
|
||||||
|
@ -600,19 +609,15 @@ to allocate a page of memory for that task.
|
||||||
|
|
||||||
If a cpuset has its 'cpus' modified, then each task in that cpuset
|
If a cpuset has its 'cpus' modified, then each task in that cpuset
|
||||||
will have its allowed CPU placement changed immediately. Similarly,
|
will have its allowed CPU placement changed immediately. Similarly,
|
||||||
if a tasks pid is written to a cpusets 'tasks' file, in either its
|
if a tasks pid is written to another cpusets 'tasks' file, then its
|
||||||
current cpuset or another cpuset, then its allowed CPU placement is
|
allowed CPU placement is changed immediately. If such a task had been
|
||||||
changed immediately. If such a task had been bound to some subset
|
bound to some subset of its cpuset using the sched_setaffinity() call,
|
||||||
of its cpuset using the sched_setaffinity() call, the task will be
|
the task will be allowed to run on any CPU allowed in its new cpuset,
|
||||||
allowed to run on any CPU allowed in its new cpuset, negating the
|
negating the effect of the prior sched_setaffinity() call.
|
||||||
affect of the prior sched_setaffinity() call.
|
|
||||||
|
|
||||||
In summary, the memory placement of a task whose cpuset is changed is
|
In summary, the memory placement of a task whose cpuset is changed is
|
||||||
updated by the kernel, on the next allocation of a page for that task,
|
updated by the kernel, on the next allocation of a page for that task,
|
||||||
but the processor placement is not updated, until that tasks pid is
|
and the processor placement is updated immediately.
|
||||||
rewritten to the 'tasks' file of its cpuset. This is done to avoid
|
|
||||||
impacting the scheduler code in the kernel with a check for changes
|
|
||||||
in a tasks processor placement.
|
|
||||||
|
|
||||||
Normally, once a page is allocated (given a physical page
|
Normally, once a page is allocated (given a physical page
|
||||||
of main memory) then that page stays on whatever node it
|
of main memory) then that page stays on whatever node it
|
||||||
|
@ -681,10 +686,14 @@ and then start a subshell 'sh' in that cpuset:
|
||||||
# The next line should display '/Charlie'
|
# The next line should display '/Charlie'
|
||||||
cat /proc/self/cpuset
|
cat /proc/self/cpuset
|
||||||
|
|
||||||
In the future, a C library interface to cpusets will likely be
|
There are ways to query or modify cpusets:
|
||||||
available. For now, the only way to query or modify cpusets is
|
- via the cpuset file system directly, using the various cd, mkdir, echo,
|
||||||
via the cpuset file system, using the various cd, mkdir, echo, cat,
|
cat, rmdir commands from the shell, or their equivalent from C.
|
||||||
rmdir commands from the shell, or their equivalent from C.
|
- via the C library libcpuset.
|
||||||
|
- via the C library libcgroup.
|
||||||
|
(http://sourceforge.net/proects/libcg/)
|
||||||
|
- via the python application cset.
|
||||||
|
(http://developer.novell.com/wiki/index.php/Cpuset)
|
||||||
|
|
||||||
The sched_setaffinity calls can also be done at the shell prompt using
|
The sched_setaffinity calls can also be done at the shell prompt using
|
||||||
SGI's runon or Robert Love's taskset. The mbind and set_mempolicy
|
SGI's runon or Robert Love's taskset. The mbind and set_mempolicy
|
||||||
|
@ -756,7 +765,7 @@ mount -t cpuset X /dev/cpuset
|
||||||
|
|
||||||
is equivalent to
|
is equivalent to
|
||||||
|
|
||||||
mount -t cgroup -ocpuset X /dev/cpuset
|
mount -t cgroup -ocpuset,noprefix X /dev/cpuset
|
||||||
echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent
|
echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent
|
||||||
|
|
||||||
2.2 Adding/removing cpus
|
2.2 Adding/removing cpus
|
||||||
|
|
Loading…
Reference in New Issue