2005-04-17 00:20:36 +02:00
|
|
|
Documentation for /proc/sys/kernel/* kernel version 2.2.10
|
|
|
|
(c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
|
2009-04-03 01:57:20 +02:00
|
|
|
(c) 2009, Shen Feng<shen@cn.fujitsu.com>
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
For general info and legal blurb, please look in README.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
|
|
|
This file contains documentation for the sysctl files in
|
|
|
|
/proc/sys/kernel/ and is valid for Linux kernel version 2.2.
|
|
|
|
|
|
|
|
The files in this directory can be used to tune and monitor
|
|
|
|
miscellaneous and general things in the operation of the Linux
|
|
|
|
kernel. Since some of the files _can_ be used to screw up your
|
|
|
|
system, it is advisable to read both documentation and source
|
|
|
|
before actually making adjustments.
|
|
|
|
|
|
|
|
Currently, these files might (depending on your configuration)
|
|
|
|
show up in /proc/sys/kernel:
|
2011-07-23 19:39:29 +02:00
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
- acct
|
2011-07-23 19:39:29 +02:00
|
|
|
- acpi_video_flags
|
|
|
|
- auto_msgmni
|
2009-12-11 23:23:44 +01:00
|
|
|
- bootloader_type [ X86 only ]
|
|
|
|
- bootloader_version [ X86 only ]
|
2009-09-11 10:28:47 +02:00
|
|
|
- callhome [ S390 only ]
|
2011-11-01 01:11:20 +01:00
|
|
|
- cap_last_cap
|
2005-04-17 00:20:36 +02:00
|
|
|
- core_pattern
|
2009-09-24 00:56:56 +02:00
|
|
|
- core_pipe_limit
|
2005-04-17 00:20:36 +02:00
|
|
|
- core_uses_pid
|
|
|
|
- ctrl-alt-del
|
2010-11-11 23:05:18 +01:00
|
|
|
- dmesg_restrict
|
2005-04-17 00:20:36 +02:00
|
|
|
- domainname
|
|
|
|
- hostname
|
|
|
|
- hotplug
|
kptr_restrict for hiding kernel pointers from unprivileged users
Add the %pK printk format specifier and the /proc/sys/kernel/kptr_restrict
sysctl.
The %pK format specifier is designed to hide exposed kernel pointers,
specifically via /proc interfaces. Exposing these pointers provides an
easy target for kernel write vulnerabilities, since they reveal the
locations of writable structures containing easily triggerable function
pointers. The behavior of %pK depends on the kptr_restrict sysctl.
If kptr_restrict is set to 0, no deviation from the standard %p behavior
occurs. If kptr_restrict is set to 1, the default, if the current user
(intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
(currently in the LSM tree), kernel pointers using %pK are printed as 0's.
If kptr_restrict is set to 2, kernel pointers using %pK are printed as
0's regardless of privileges. Replacing with 0's was chosen over the
default "(null)", which cannot be parsed by userland %p, which expects
"(nil)".
[akpm@linux-foundation.org: check for IRQ context when !kptr_restrict, save an indent level, s/WARN/WARN_ONCE/]
[akpm@linux-foundation.org: coding-style fixup]
[randy.dunlap@oracle.com: fix kernel/sysctl.c warning]
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: James Morris <jmorris@namei.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Thomas Graf <tgraf@infradead.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 01:59:41 +01:00
|
|
|
- kptr_restrict
|
2006-12-07 02:14:11 +01:00
|
|
|
- kstack_depth_to_print [ X86 only ]
|
2005-04-17 00:20:36 +02:00
|
|
|
- l2cr [ PPC only ]
|
2008-02-14 00:03:32 +01:00
|
|
|
- modprobe ==> Documentation/debugging-modules.txt
|
2009-04-03 00:49:29 +02:00
|
|
|
- modules_disabled
|
2013-01-05 00:34:50 +01:00
|
|
|
- msg_next_id [ sysv ipc ]
|
2005-04-17 00:20:36 +02:00
|
|
|
- msgmax
|
|
|
|
- msgmnb
|
|
|
|
- msgmni
|
2009-04-03 01:57:20 +02:00
|
|
|
- nmi_watchdog
|
2005-04-17 00:20:36 +02:00
|
|
|
- osrelease
|
|
|
|
- ostype
|
|
|
|
- overflowgid
|
|
|
|
- overflowuid
|
|
|
|
- panic
|
2011-07-23 19:39:29 +02:00
|
|
|
- panic_on_oops
|
|
|
|
- panic_on_unrecovered_nmi
|
2011-11-29 07:08:36 +01:00
|
|
|
- panic_on_stackoverflow
|
2005-04-17 00:20:36 +02:00
|
|
|
- pid_max
|
|
|
|
- powersave-nap [ PPC only ]
|
|
|
|
- printk
|
2011-07-23 19:39:29 +02:00
|
|
|
- printk_delay
|
|
|
|
- printk_ratelimit
|
|
|
|
- printk_ratelimit_burst
|
2008-02-09 23:24:08 +01:00
|
|
|
- randomize_va_space
|
2005-04-17 00:20:36 +02:00
|
|
|
- real-root-dev ==> Documentation/initrd.txt
|
|
|
|
- reboot-cmd [ SPARC only ]
|
|
|
|
- rtsig-max
|
|
|
|
- rtsig-nr
|
|
|
|
- sem
|
2013-01-05 00:34:50 +01:00
|
|
|
- sem_next_id [ sysv ipc ]
|
2005-04-17 00:20:36 +02:00
|
|
|
- sg-big-buff [ generic SCSI device (sg) ]
|
2013-01-05 00:34:50 +01:00
|
|
|
- shm_next_id [ sysv ipc ]
|
2011-07-27 01:08:48 +02:00
|
|
|
- shm_rmid_forced
|
2005-04-17 00:20:36 +02:00
|
|
|
- shmall
|
|
|
|
- shmmax [ sysv ipc ]
|
|
|
|
- shmmni
|
|
|
|
- stop-a [ SPARC only ]
|
|
|
|
- sysrq ==> Documentation/sysrq.txt
|
|
|
|
- tainted
|
|
|
|
- threads-max
|
2009-04-03 01:57:20 +02:00
|
|
|
- unknown_nmi_panic
|
2013-05-17 04:31:20 +02:00
|
|
|
- watchdog_thresh
|
2005-04-17 00:20:36 +02:00
|
|
|
- version
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
|
|
|
acct:
|
|
|
|
|
|
|
|
highwater lowwater frequency
|
|
|
|
|
|
|
|
If BSD-style process accounting is enabled these values control
|
|
|
|
its behaviour. If free space on filesystem where the log lives
|
|
|
|
goes below <lowwater>% accounting suspends. If free space gets
|
|
|
|
above <highwater>% accounting resumes. <Frequency> determines
|
|
|
|
how often do we check the amount of free space (value is in
|
|
|
|
seconds). Default:
|
|
|
|
4 2 30
|
|
|
|
That is, suspend accounting if there left <= 2% free; resume it
|
|
|
|
if we got >=4%; consider information about amount of free space
|
|
|
|
valid for 30 seconds.
|
|
|
|
|
2011-07-23 19:39:29 +02:00
|
|
|
==============================================================
|
|
|
|
|
|
|
|
acpi_video_flags:
|
|
|
|
|
|
|
|
flags
|
|
|
|
|
|
|
|
See Doc*/kernel/power/video.txt, it allows mode of video boot to be
|
|
|
|
set during run time.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
|
|
|
auto_msgmni:
|
|
|
|
|
|
|
|
Enables/Disables automatic recomputing of msgmni upon memory add/remove
|
|
|
|
or upon ipc namespace creation/removal (see the msgmni description
|
|
|
|
above). Echoing "1" into this file enables msgmni automatic recomputing.
|
|
|
|
Echoing "0" turns it off. auto_msgmni default value is 1.
|
|
|
|
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
==============================================================
|
|
|
|
|
2009-12-11 23:23:44 +01:00
|
|
|
bootloader_type:
|
|
|
|
|
|
|
|
x86 bootloader identification
|
|
|
|
|
|
|
|
This gives the bootloader type number as indicated by the bootloader,
|
|
|
|
shifted left by 4, and OR'd with the low four bits of the bootloader
|
|
|
|
version. The reason for this encoding is that this used to match the
|
|
|
|
type_of_loader field in the kernel header; the encoding is kept for
|
|
|
|
backwards compatibility. That is, if the full bootloader type number
|
|
|
|
is 0x15 and the full version number is 0x234, this file will contain
|
|
|
|
the value 340 = 0x154.
|
|
|
|
|
|
|
|
See the type_of_loader and ext_loader_type fields in
|
|
|
|
Documentation/x86/boot.txt for additional information.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
|
|
|
bootloader_version:
|
|
|
|
|
|
|
|
x86 bootloader version
|
|
|
|
|
|
|
|
The complete bootloader version number. In the example above, this
|
|
|
|
file will contain the value 564 = 0x234.
|
|
|
|
|
|
|
|
See the type_of_loader and ext_loader_ver fields in
|
|
|
|
Documentation/x86/boot.txt for additional information.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2009-09-11 10:28:47 +02:00
|
|
|
callhome:
|
|
|
|
|
|
|
|
Controls the kernel's callhome behavior in case of a kernel panic.
|
|
|
|
|
|
|
|
The s390 hardware allows an operating system to send a notification
|
|
|
|
to a service organization (callhome) in case of an operating system panic.
|
|
|
|
|
|
|
|
When the value in this file is 0 (which is the default behavior)
|
|
|
|
nothing happens in case of a kernel panic. If this value is set to "1"
|
|
|
|
the complete kernel oops message is send to the IBM customer service
|
|
|
|
organization in case the mainframe the Linux operating system is running
|
|
|
|
on has a service contract with IBM.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2011-11-01 01:11:20 +01:00
|
|
|
cap_last_cap
|
|
|
|
|
|
|
|
Highest valid capability of the running kernel. Exports
|
|
|
|
CAP_LAST_CAP from the kernel.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
core_pattern:
|
|
|
|
|
|
|
|
core_pattern is used to specify a core dumpfile pattern name.
|
2006-10-11 10:21:57 +02:00
|
|
|
. max length 128 characters; default value is "core"
|
2005-04-17 00:20:36 +02:00
|
|
|
. core_pattern is used as a pattern template for the output filename;
|
|
|
|
certain string patterns (beginning with '%') are substituted with
|
|
|
|
their actual values.
|
|
|
|
. backward compatibility with core_uses_pid:
|
|
|
|
If core_pattern does not include "%p" (default does not)
|
|
|
|
and core_uses_pid is set, then .PID will be appended to
|
|
|
|
the filename.
|
|
|
|
. corename format specifiers:
|
|
|
|
%<NUL> '%' is dropped
|
|
|
|
%% output one '%'
|
|
|
|
%p pid
|
2013-09-11 23:24:32 +02:00
|
|
|
%P global pid (init PID namespace)
|
2005-04-17 00:20:36 +02:00
|
|
|
%u uid
|
|
|
|
%g gid
|
2012-10-05 02:15:25 +02:00
|
|
|
%d dump mode, matches PR_SET_DUMPABLE and
|
|
|
|
/proc/sys/fs/suid_dumpable
|
2005-04-17 00:20:36 +02:00
|
|
|
%s signal number
|
|
|
|
%t UNIX time of dump
|
|
|
|
%h hostname
|
2011-05-27 01:25:46 +02:00
|
|
|
%e executable filename (may be shortened)
|
|
|
|
%E executable path
|
2005-04-17 00:20:36 +02:00
|
|
|
%<OTHER> both are dropped
|
2006-10-11 10:21:57 +02:00
|
|
|
. If the first character of the pattern is a '|', the kernel will treat
|
|
|
|
the rest of the pattern as a command to run. The core dump will be
|
|
|
|
written to the standard input of that program instead of to a file.
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2009-09-24 00:56:56 +02:00
|
|
|
core_pipe_limit:
|
|
|
|
|
2011-07-23 19:39:29 +02:00
|
|
|
This sysctl is only applicable when core_pattern is configured to pipe
|
|
|
|
core files to a user space helper (when the first character of
|
|
|
|
core_pattern is a '|', see above). When collecting cores via a pipe
|
|
|
|
to an application, it is occasionally useful for the collecting
|
|
|
|
application to gather data about the crashing process from its
|
|
|
|
/proc/pid directory. In order to do this safely, the kernel must wait
|
|
|
|
for the collecting process to exit, so as not to remove the crashing
|
|
|
|
processes proc files prematurely. This in turn creates the
|
|
|
|
possibility that a misbehaving userspace collecting process can block
|
|
|
|
the reaping of a crashed process simply by never exiting. This sysctl
|
|
|
|
defends against that. It defines how many concurrent crashing
|
|
|
|
processes may be piped to user space applications in parallel. If
|
|
|
|
this value is exceeded, then those crashing processes above that value
|
|
|
|
are noted via the kernel log and their cores are skipped. 0 is a
|
|
|
|
special value, indicating that unlimited processes may be captured in
|
|
|
|
parallel, but that no waiting will take place (i.e. the collecting
|
|
|
|
process is not guaranteed access to /proc/<crashing pid>/). This
|
|
|
|
value defaults to 0.
|
2009-09-24 00:56:56 +02:00
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
core_uses_pid:
|
|
|
|
|
|
|
|
The default coredump filename is "core". By setting
|
|
|
|
core_uses_pid to 1, the coredump filename becomes core.PID.
|
|
|
|
If core_pattern does not include "%p" (default does not)
|
|
|
|
and core_uses_pid is set, then .PID will be appended to
|
|
|
|
the filename.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
|
|
|
ctrl-alt-del:
|
|
|
|
|
|
|
|
When the value in this file is 0, ctrl-alt-del is trapped and
|
|
|
|
sent to the init(1) program to handle a graceful restart.
|
|
|
|
When, however, the value is > 0, Linux's reaction to a Vulcan
|
|
|
|
Nerve Pinch (tm) will be an immediate reboot, without even
|
|
|
|
syncing its dirty buffers.
|
|
|
|
|
|
|
|
Note: when a program (like dosemu) has the keyboard in 'raw'
|
|
|
|
mode, the ctrl-alt-del is intercepted by the program before it
|
|
|
|
ever reaches the kernel tty layer, and it's up to the program
|
|
|
|
to decide what to do with it.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2010-11-11 23:05:18 +01:00
|
|
|
dmesg_restrict:
|
|
|
|
|
2011-07-23 19:39:29 +02:00
|
|
|
This toggle indicates whether unprivileged users are prevented
|
|
|
|
from using dmesg(8) to view messages from the kernel's log buffer.
|
|
|
|
When dmesg_restrict is set to (0) there are no restrictions. When
|
2010-12-08 16:19:01 +01:00
|
|
|
dmesg_restrict is set set to (1), users must have CAP_SYSLOG to use
|
2010-11-11 23:05:18 +01:00
|
|
|
dmesg(8).
|
|
|
|
|
2011-07-23 19:39:29 +02:00
|
|
|
The kernel config option CONFIG_SECURITY_DMESG_RESTRICT sets the
|
|
|
|
default value of dmesg_restrict.
|
2010-11-11 23:05:18 +01:00
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
domainname & hostname:
|
|
|
|
|
|
|
|
These files can be used to set the NIS/YP domainname and the
|
|
|
|
hostname of your box in exactly the same way as the commands
|
|
|
|
domainname and hostname, i.e.:
|
|
|
|
# echo "darkstar" > /proc/sys/kernel/hostname
|
|
|
|
# echo "mydomain" > /proc/sys/kernel/domainname
|
|
|
|
has the same effect as
|
|
|
|
# hostname "darkstar"
|
|
|
|
# domainname "mydomain"
|
|
|
|
|
|
|
|
Note, however, that the classic darkstar.frop.org has the
|
|
|
|
hostname "darkstar" and DNS (Internet Domain Name Server)
|
|
|
|
domainname "frop.org", not to be confused with the NIS (Network
|
|
|
|
Information Service) or YP (Yellow Pages) domainname. These two
|
|
|
|
domain names are in general different. For a detailed discussion
|
|
|
|
see the hostname(1) man page.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
|
|
|
hotplug:
|
|
|
|
|
|
|
|
Path for the hotplug policy agent.
|
|
|
|
Default value is "/sbin/hotplug".
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
kptr_restrict for hiding kernel pointers from unprivileged users
Add the %pK printk format specifier and the /proc/sys/kernel/kptr_restrict
sysctl.
The %pK format specifier is designed to hide exposed kernel pointers,
specifically via /proc interfaces. Exposing these pointers provides an
easy target for kernel write vulnerabilities, since they reveal the
locations of writable structures containing easily triggerable function
pointers. The behavior of %pK depends on the kptr_restrict sysctl.
If kptr_restrict is set to 0, no deviation from the standard %p behavior
occurs. If kptr_restrict is set to 1, the default, if the current user
(intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
(currently in the LSM tree), kernel pointers using %pK are printed as 0's.
If kptr_restrict is set to 2, kernel pointers using %pK are printed as
0's regardless of privileges. Replacing with 0's was chosen over the
default "(null)", which cannot be parsed by userland %p, which expects
"(nil)".
[akpm@linux-foundation.org: check for IRQ context when !kptr_restrict, save an indent level, s/WARN/WARN_ONCE/]
[akpm@linux-foundation.org: coding-style fixup]
[randy.dunlap@oracle.com: fix kernel/sysctl.c warning]
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: James Morris <jmorris@namei.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Thomas Graf <tgraf@infradead.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 01:59:41 +01:00
|
|
|
kptr_restrict:
|
|
|
|
|
|
|
|
This toggle indicates whether restrictions are placed on
|
|
|
|
exposing kernel addresses via /proc and other interfaces. When
|
|
|
|
kptr_restrict is set to (0), there are no restrictions. When
|
|
|
|
kptr_restrict is set to (1), the default, kernel pointers
|
|
|
|
printed using the %pK format specifier will be replaced with 0's
|
|
|
|
unless the user has CAP_SYSLOG. When kptr_restrict is set to
|
|
|
|
(2), kernel pointers printed using %pK will be replaced with 0's
|
|
|
|
regardless of privileges.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2006-12-07 02:14:11 +01:00
|
|
|
kstack_depth_to_print: (X86 only)
|
|
|
|
|
|
|
|
Controls the number of words to print when dumping the raw
|
|
|
|
kernel stack.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2011-07-23 19:39:29 +02:00
|
|
|
l2cr: (PPC only)
|
|
|
|
|
|
|
|
This flag controls the L2 cache of G3 processor boards. If
|
|
|
|
0, the cache is disabled. Enabled if nonzero.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2009-04-03 00:49:29 +02:00
|
|
|
modules_disabled:
|
|
|
|
|
|
|
|
A toggle value indicating if modules are allowed to be loaded
|
|
|
|
in an otherwise modular kernel. This toggle defaults to off
|
|
|
|
(0), but can be set true (1). Once true, modules can be
|
|
|
|
neither loaded nor unloaded, and the toggle cannot be set back
|
|
|
|
to false.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2013-01-05 00:34:50 +01:00
|
|
|
msg_next_id, sem_next_id, and shm_next_id:
|
|
|
|
|
|
|
|
These three toggles allows to specify desired id for next allocated IPC
|
|
|
|
object: message, semaphore or shared memory respectively.
|
|
|
|
|
|
|
|
By default they are equal to -1, which means generic allocation logic.
|
|
|
|
Possible values to set are in range {0..INT_MAX}.
|
|
|
|
|
|
|
|
Notes:
|
|
|
|
1) kernel doesn't guarantee, that new object will have desired id. So,
|
|
|
|
it's up to userspace, how to handle an object with "wrong" id.
|
|
|
|
2) Toggle with non-default value will be set back to -1 by kernel after
|
|
|
|
successful IPC object allocation.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2011-07-23 19:39:29 +02:00
|
|
|
nmi_watchdog:
|
|
|
|
|
|
|
|
Enables/Disables the NMI watchdog on x86 systems. When the value is
|
|
|
|
non-zero the NMI watchdog is enabled and will continuously test all
|
|
|
|
online cpus to determine whether or not they are still functioning
|
|
|
|
properly. Currently, passing "nmi_watchdog=" parameter at boot time is
|
|
|
|
required for this function to work.
|
|
|
|
|
|
|
|
If LAPIC NMI watchdog method is in use (nmi_watchdog=2 kernel
|
|
|
|
parameter), the NMI watchdog shares registers with oprofile. By
|
|
|
|
disabling the NMI watchdog, oprofile may have more registers to
|
|
|
|
utilize.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2013-10-07 12:28:40 +02:00
|
|
|
numa_balancing
|
|
|
|
|
|
|
|
Enables/disables automatic page fault based NUMA memory
|
|
|
|
balancing. Memory is moved automatically to nodes
|
|
|
|
that access it often.
|
|
|
|
|
|
|
|
Enables/disables automatic NUMA memory balancing. On NUMA machines, there
|
|
|
|
is a performance penalty if remote memory is accessed by a CPU. When this
|
|
|
|
feature is enabled the kernel samples what task thread is accessing memory
|
|
|
|
by periodically unmapping pages and later trapping a page fault. At the
|
|
|
|
time of the page fault, it is determined if the data being accessed should
|
|
|
|
be migrated to a local memory node.
|
|
|
|
|
|
|
|
The unmapping of pages and trapping faults incur additional overhead that
|
|
|
|
ideally is offset by improved memory locality but there is no universal
|
|
|
|
guarantee. If the target workload is already bound to NUMA nodes then this
|
|
|
|
feature should be disabled. Otherwise, if the system overhead from the
|
|
|
|
feature is too high then the rate the kernel samples for NUMA hinting
|
|
|
|
faults may be controlled by the numa_balancing_scan_period_min_ms,
|
2013-10-07 12:29:37 +02:00
|
|
|
numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
|
2013-10-07 12:29:39 +02:00
|
|
|
numa_balancing_scan_size_mb, numa_balancing_settle_count sysctls and
|
|
|
|
numa_balancing_migrate_deferred.
|
2013-10-07 12:28:40 +02:00
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
|
|
|
numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
|
2013-10-07 12:29:37 +02:00
|
|
|
numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
|
2013-10-07 12:28:40 +02:00
|
|
|
|
|
|
|
Automatic NUMA balancing scans tasks address space and unmaps pages to
|
|
|
|
detect if pages are properly placed or if the data should be migrated to a
|
|
|
|
memory node local to where the task is running. Every "scan delay" the task
|
|
|
|
scans the next "scan size" number of pages in its address space. When the
|
|
|
|
end of the address space is reached the scanner restarts from the beginning.
|
|
|
|
|
|
|
|
In combination, the "scan delay" and "scan size" determine the scan rate.
|
|
|
|
When "scan delay" decreases, the scan rate increases. The scan delay and
|
|
|
|
hence the scan rate of every task is adaptive and depends on historical
|
|
|
|
behaviour. If pages are properly placed then the scan delay increases,
|
|
|
|
otherwise the scan delay decreases. The "scan size" is not adaptive but
|
|
|
|
the higher the "scan size", the higher the scan rate.
|
|
|
|
|
|
|
|
Higher scan rates incur higher system overhead as page faults must be
|
|
|
|
trapped and potentially data must be migrated. However, the higher the scan
|
|
|
|
rate, the more quickly a tasks memory is migrated to a local node if the
|
|
|
|
workload pattern changes and minimises performance impact due to remote
|
|
|
|
memory accesses. These sysctls control the thresholds for scan delays and
|
|
|
|
the number of pages scanned.
|
|
|
|
|
2013-10-07 12:28:55 +02:00
|
|
|
numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
|
|
|
|
scan a tasks virtual memory. It effectively controls the maximum scanning
|
|
|
|
rate for each task.
|
2013-10-07 12:28:40 +02:00
|
|
|
|
|
|
|
numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
|
|
|
|
when it initially forks.
|
|
|
|
|
2013-10-07 12:28:55 +02:00
|
|
|
numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
|
|
|
|
scan a tasks virtual memory. It effectively controls the minimum scanning
|
|
|
|
rate for each task.
|
2013-10-07 12:28:40 +02:00
|
|
|
|
|
|
|
numa_balancing_scan_size_mb is how many megabytes worth of pages are
|
|
|
|
scanned for a given scan.
|
|
|
|
|
2013-10-07 12:29:00 +02:00
|
|
|
numa_balancing_settle_count is how many scan periods must complete before
|
|
|
|
the schedule balancer stops pushing the task towards a preferred node. This
|
|
|
|
gives the scheduler a chance to place the task on an alternative node if the
|
|
|
|
preferred node is overloaded.
|
|
|
|
|
2013-10-07 12:29:39 +02:00
|
|
|
numa_balancing_migrate_deferred is how many page migrations get skipped
|
|
|
|
unconditionally, after a page migration is skipped because a page is shared
|
|
|
|
with other tasks. This reduces page migration overhead, and determines
|
|
|
|
how much stronger the "move task near its memory" policy scheduler becomes,
|
|
|
|
versus the "move memory near its task" memory management policy, for workloads
|
|
|
|
with shared memory.
|
|
|
|
|
2013-10-07 12:28:40 +02:00
|
|
|
==============================================================
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
osrelease, ostype & version:
|
|
|
|
|
|
|
|
# cat osrelease
|
|
|
|
2.1.88
|
|
|
|
# cat ostype
|
|
|
|
Linux
|
|
|
|
# cat version
|
|
|
|
#5 Wed Feb 25 21:49:24 MET 1998
|
|
|
|
|
|
|
|
The files osrelease and ostype should be clear enough. Version
|
|
|
|
needs a little more clarification however. The '#5' means that
|
|
|
|
this is the fifth kernel built from this source base and the
|
|
|
|
date behind it indicates the time the kernel was built.
|
|
|
|
The only way to tune these values is to rebuild the kernel :-)
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
|
|
|
overflowgid & overflowuid:
|
|
|
|
|
2011-07-23 19:39:29 +02:00
|
|
|
if your architecture did not always support 32-bit UIDs (i.e. arm,
|
|
|
|
i386, m68k, sh, and sparc32), a fixed UID and GID will be returned to
|
|
|
|
applications that use the old 16-bit UID/GID system calls, if the
|
|
|
|
actual UID or GID would exceed 65535.
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
These sysctls allow you to change the value of the fixed UID and GID.
|
|
|
|
The default is 65534.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
|
|
|
panic:
|
|
|
|
|
2011-07-23 19:39:29 +02:00
|
|
|
The value in this file represents the number of seconds the kernel
|
|
|
|
waits before rebooting on a panic. When you use the software watchdog,
|
|
|
|
the recommended setting is 60.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
|
|
|
panic_on_unrecovered_nmi:
|
|
|
|
|
|
|
|
The default Linux behaviour on an NMI of either memory or unknown is
|
|
|
|
to continue operation. For many environments such as scientific
|
|
|
|
computing it is preferable that the box is taken out and the error
|
|
|
|
dealt with than an uncorrected parity/ECC error get propagated.
|
|
|
|
|
|
|
|
A small number of systems do generate NMI's for bizarre random reasons
|
|
|
|
such as power management so the default is off. That sysctl works like
|
|
|
|
the existing panic controls already in that directory.
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
|
|
|
panic_on_oops:
|
|
|
|
|
|
|
|
Controls the kernel's behaviour when an oops or BUG is encountered.
|
|
|
|
|
|
|
|
0: try to continue operation
|
|
|
|
|
2007-05-09 07:35:06 +02:00
|
|
|
1: panic immediately. If the `panic' sysctl is also non-zero then the
|
2006-08-05 21:14:32 +02:00
|
|
|
machine will be rebooted.
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2011-11-29 07:08:36 +01:00
|
|
|
panic_on_stackoverflow:
|
|
|
|
|
|
|
|
Controls the kernel's behavior when detecting the overflows of
|
|
|
|
kernel, IRQ and exception stacks except a user stack.
|
|
|
|
This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled.
|
|
|
|
|
|
|
|
0: try to continue operation.
|
|
|
|
|
|
|
|
1: panic immediately.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2013-06-21 17:51:36 +02:00
|
|
|
perf_cpu_time_max_percent:
|
|
|
|
|
|
|
|
Hints to the kernel how much CPU time it should be allowed to
|
|
|
|
use to handle perf sampling events. If the perf subsystem
|
|
|
|
is informed that its samples are exceeding this limit, it
|
|
|
|
will drop its sampling frequency to attempt to reduce its CPU
|
|
|
|
usage.
|
|
|
|
|
|
|
|
Some perf sampling happens in NMIs. If these samples
|
|
|
|
unexpectedly take too long to execute, the NMIs can become
|
|
|
|
stacked up next to each other so much that nothing else is
|
|
|
|
allowed to execute.
|
|
|
|
|
|
|
|
0: disable the mechanism. Do not monitor or correct perf's
|
|
|
|
sampling rate no matter how CPU time it takes.
|
|
|
|
|
|
|
|
1-100: attempt to throttle perf's sample rate to this
|
|
|
|
percentage of CPU. Note: the kernel calculates an
|
|
|
|
"expected" length of each sample event. 100 here means
|
|
|
|
100% of that expected length. Even if this is set to
|
|
|
|
100, you may still see sample throttling if this
|
|
|
|
length is exceeded. Set to 0 if you truly do not care
|
|
|
|
how much CPU is consumed.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2011-11-29 07:08:36 +01:00
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
pid_max:
|
|
|
|
|
2007-05-09 07:14:03 +02:00
|
|
|
PID allocation wrap value. When the kernel's next PID value
|
2005-04-17 00:20:36 +02:00
|
|
|
reaches this value, it wraps back to a minimum PID value.
|
|
|
|
PIDs of value pid_max or larger are not allocated.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2012-01-13 02:20:27 +01:00
|
|
|
ns_last_pid:
|
|
|
|
|
|
|
|
The last pid allocated in the current (the one task using this sysctl
|
|
|
|
lives in) pid namespace. When selecting a pid for a next task on fork
|
|
|
|
kernel tries to allocate a number starting from this one.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
powersave-nap: (PPC only)
|
|
|
|
|
|
|
|
If set, Linux-PPC will use the 'nap' mode of powersaving,
|
|
|
|
otherwise the 'doze' mode will be used.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
|
|
|
printk:
|
|
|
|
|
|
|
|
The four values in printk denote: console_loglevel,
|
|
|
|
default_message_loglevel, minimum_console_loglevel and
|
|
|
|
default_console_loglevel respectively.
|
|
|
|
|
|
|
|
These values influence printk() behavior when printing or
|
|
|
|
logging error messages. See 'man 2 syslog' for more info on
|
|
|
|
the different loglevels.
|
|
|
|
|
|
|
|
- console_loglevel: messages with a higher priority than
|
|
|
|
this will be printed to the console
|
2011-02-06 21:00:41 +01:00
|
|
|
- default_message_loglevel: messages without an explicit priority
|
2005-04-17 00:20:36 +02:00
|
|
|
will be printed with this priority
|
|
|
|
- minimum_console_loglevel: minimum (highest) value to which
|
|
|
|
console_loglevel can be set
|
|
|
|
- default_console_loglevel: default value for console_loglevel
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2011-07-23 19:39:29 +02:00
|
|
|
printk_delay:
|
|
|
|
|
|
|
|
Delay each printk message in printk_delay milliseconds
|
|
|
|
|
|
|
|
Value from 0 - 10000 is allowed.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
printk_ratelimit:
|
|
|
|
|
|
|
|
Some warning messages are rate limited. printk_ratelimit specifies
|
|
|
|
the minimum length of time between these messages (in jiffies), by
|
|
|
|
default we allow one every 5 seconds.
|
|
|
|
|
|
|
|
A value of 0 will disable rate limiting.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
|
|
|
printk_ratelimit_burst:
|
|
|
|
|
|
|
|
While long term we enforce one message per printk_ratelimit
|
|
|
|
seconds, we do allow a burst of messages to pass through.
|
|
|
|
printk_ratelimit_burst specifies the number of messages we can
|
|
|
|
send before ratelimiting kicks in.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2011-07-23 19:39:29 +02:00
|
|
|
randomize_va_space:
|
2008-02-09 23:24:08 +01:00
|
|
|
|
|
|
|
This option can be used to select the type of process address
|
|
|
|
space randomization that is used in the system, for architectures
|
|
|
|
that support this feature.
|
|
|
|
|
2009-07-03 14:20:17 +02:00
|
|
|
0 - Turn the process address space randomization off. This is the
|
|
|
|
default for architectures that do not support this feature anyways,
|
|
|
|
and kernels that are booted with the "norandmaps" parameter.
|
2008-02-09 23:24:08 +01:00
|
|
|
|
|
|
|
1 - Make the addresses of mmap base, stack and VDSO page randomized.
|
|
|
|
This, among other things, implies that shared libraries will be
|
2009-07-03 14:20:17 +02:00
|
|
|
loaded to random addresses. Also for PIE-linked binaries, the
|
|
|
|
location of code start is randomized. This is the default if the
|
|
|
|
CONFIG_COMPAT_BRK option is enabled.
|
2008-02-09 23:24:08 +01:00
|
|
|
|
2009-07-03 14:20:17 +02:00
|
|
|
2 - Additionally enable heap randomization. This is the default if
|
|
|
|
CONFIG_COMPAT_BRK is disabled.
|
|
|
|
|
|
|
|
There are a few legacy applications out there (such as some ancient
|
2008-02-09 23:24:08 +01:00
|
|
|
versions of libc.so.5 from 1996) that assume that brk area starts
|
2009-07-03 14:20:17 +02:00
|
|
|
just after the end of the code+bss. These applications break when
|
|
|
|
start of the brk area is randomized. There are however no known
|
2008-02-09 23:24:08 +01:00
|
|
|
non-legacy applications that would be broken this way, so for most
|
2009-07-03 14:20:17 +02:00
|
|
|
systems it is safe to choose full randomization.
|
|
|
|
|
|
|
|
Systems with ancient and/or broken binaries should be configured
|
|
|
|
with CONFIG_COMPAT_BRK enabled, which excludes the heap from process
|
|
|
|
address space randomization.
|
2008-02-09 23:24:08 +01:00
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
reboot-cmd: (Sparc only)
|
|
|
|
|
|
|
|
??? This seems to be a way to give an argument to the Sparc
|
|
|
|
ROM/Flash boot loader. Maybe to tell it what to do after
|
|
|
|
rebooting. ???
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
|
|
|
rtsig-max & rtsig-nr:
|
|
|
|
|
|
|
|
The file rtsig-max can be used to tune the maximum number
|
|
|
|
of POSIX realtime (queued) signals that can be outstanding
|
|
|
|
in the system.
|
|
|
|
|
|
|
|
rtsig-nr shows the number of RT signals currently queued.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
|
|
|
sg-big-buff:
|
|
|
|
|
|
|
|
This file shows the size of the generic SCSI (sg) buffer.
|
|
|
|
You can't tune it just yet, but you could change it on
|
|
|
|
compile time by editing include/scsi/sg.h and changing
|
|
|
|
the value of SG_BIG_BUFF.
|
|
|
|
|
|
|
|
There shouldn't be any reason to change this value. If
|
|
|
|
you can come up with one, you probably know what you
|
|
|
|
are doing anyway :)
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2013-01-05 00:35:05 +01:00
|
|
|
shmall:
|
|
|
|
|
|
|
|
This parameter sets the total amount of shared memory pages that
|
|
|
|
can be used system wide. Hence, SHMALL should always be at least
|
|
|
|
ceil(shmmax/PAGE_SIZE).
|
|
|
|
|
|
|
|
If you are not sure what the default PAGE_SIZE is on your Linux
|
|
|
|
system, you can run the following command:
|
|
|
|
|
|
|
|
# getconf PAGE_SIZE
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2011-07-23 19:39:29 +02:00
|
|
|
shmmax:
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
This value can be used to query and set the run time limit
|
|
|
|
on the maximum shared memory segment size that can be created.
|
2011-07-23 19:39:29 +02:00
|
|
|
Shared memory segments up to 1Gb are now supported in the
|
2005-04-17 00:20:36 +02:00
|
|
|
kernel. This value defaults to SHMMAX.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2011-07-27 01:08:48 +02:00
|
|
|
shm_rmid_forced:
|
|
|
|
|
|
|
|
Linux lets you set resource limits, including how much memory one
|
|
|
|
process can consume, via setrlimit(2). Unfortunately, shared memory
|
|
|
|
segments are allowed to exist without association with any process, and
|
|
|
|
thus might not be counted against any resource limits. If enabled,
|
|
|
|
shared memory segments are automatically destroyed when their attach
|
|
|
|
count becomes zero after a detach or a process termination. It will
|
|
|
|
also destroy segments that were created, but never attached to, on exit
|
|
|
|
from the process. The only use left for IPC_RMID is to immediately
|
|
|
|
destroy an unattached segment. Of course, this breaks the way things are
|
|
|
|
defined, so some applications might stop working. Note that this
|
|
|
|
feature will do you no good unless you also configure your resource
|
|
|
|
limits (in particular, RLIMIT_AS and RLIMIT_NPROC). Most systems don't
|
|
|
|
need this.
|
|
|
|
|
|
|
|
Note that if you change this from 0 to 1, already created segments
|
|
|
|
without users and with a dead originative process will be destroyed.
|
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
2011-07-23 19:39:29 +02:00
|
|
|
tainted:
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
Non-zero if the kernel has been tainted. Numeric values, which
|
|
|
|
can be ORed together:
|
|
|
|
|
2008-10-18 00:01:07 +02:00
|
|
|
1 - A module with a non-GPL license has been loaded, this
|
|
|
|
includes modules with no license.
|
|
|
|
Set by modutils >= 2.4.9 and module-init-tools.
|
|
|
|
2 - A module was force loaded by insmod -f.
|
|
|
|
Set by modutils >= 2.4.9 and module-init-tools.
|
|
|
|
4 - Unsafe SMP processors: SMP with CPUs not designed for SMP.
|
|
|
|
8 - A module was forcibly unloaded from the system by rmmod -f.
|
|
|
|
16 - A hardware machine check error occurred on the system.
|
|
|
|
32 - A bad page was discovered on the system.
|
|
|
|
64 - The user has asked that the system be marked "tainted". This
|
|
|
|
could be because they are running software that directly modifies
|
|
|
|
the hardware, or for other reasons.
|
|
|
|
128 - The system has died.
|
|
|
|
256 - The ACPI DSDT has been overridden with one supplied by the user
|
|
|
|
instead of using the one provided by the hardware.
|
|
|
|
512 - A kernel warning has occurred.
|
|
|
|
1024 - A module from drivers/staging was loaded.
|
2012-02-06 18:49:50 +01:00
|
|
|
2048 - The system is working around a severe firmware bug.
|
|
|
|
4096 - An out-of-tree module has been loaded.
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2009-04-03 01:57:20 +02:00
|
|
|
==============================================================
|
|
|
|
|
|
|
|
unknown_nmi_panic:
|
|
|
|
|
2011-07-23 19:39:29 +02:00
|
|
|
The value in this file affects behavior of handling NMI. When the
|
|
|
|
value is non-zero, unknown NMI is trapped and then panic occurs. At
|
|
|
|
that time, kernel debugging information is displayed on console.
|
2009-04-03 01:57:20 +02:00
|
|
|
|
2011-07-23 19:39:29 +02:00
|
|
|
NMI switch that most IA32 servers have fires unknown NMI up, for
|
|
|
|
example. If a system hangs up, try pressing the NMI switch.
|
2013-05-17 04:31:20 +02:00
|
|
|
|
|
|
|
==============================================================
|
|
|
|
|
|
|
|
watchdog_thresh:
|
|
|
|
|
|
|
|
This value can be used to control the frequency of hrtimer and NMI
|
|
|
|
events and the soft and hard lockup thresholds. The default threshold
|
|
|
|
is 10 seconds.
|
|
|
|
|
|
|
|
The softlockup threshold is (2 * watchdog_thresh). Setting this
|
|
|
|
tunable to zero will disable lockup detection altogether.
|
|
|
|
|
|
|
|
==============================================================
|