linux/Documentation
Petr Holasek 90bd6fd31c ksm: allow trees per NUMA node
Here's a KSM series, based on mmotm 2013-01-23-17-04: starting with
Petr's v7 "KSM: numa awareness sysfs knob"; then fixing the two issues
we had with that, fully enabling KSM page migration on the way.

(A different kind of KSM/NUMA issue which I've certainly not begun to
address here: when KSM pages are unmerged, there's usually no sense in
preferring to allocate the new pages local to the caller's node.)

This patch:

Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
which control merging pages across different numa nodes.  When it is set
to zero only pages from the same node are merged, otherwise pages from
all nodes can be merged together (default behavior).

Typical use-case could be a lot of KVM guests on NUMA machine and cpus
from more distant nodes would have significant increase of access
latency to the merged ksm page.  Sysfs knob was choosen for higher
variability when some users still prefers higher amount of saved
physical memory regardless of access latency.

Every numa node has its own stable & unstable trees because of faster
searching and inserting.  Changing of merge_across_nodes value is
possible only when there are not any ksm shared pages in system.

I've tested this patch on numa machines with 2, 4 and 8 nodes and
measured speed of memory access inside of KVM guests with memory pinned
to one of nodes with this benchmark:

  http://pholasek.fedorapeople.org/alloc_pg.c

Population standard deviations of access times in percentage of average
were following:

merge_across_nodes=1
2 nodes 1.4%
4 nodes 1.6%
8 nodes	1.7%

merge_across_nodes=0
2 nodes	1%
4 nodes	0.32%
8 nodes	0.018%

RFC: https://lkml.org/lkml/2011/11/30/91
v1: https://lkml.org/lkml/2012/1/23/46
v2: https://lkml.org/lkml/2012/6/29/105
v3: https://lkml.org/lkml/2012/9/14/550
v4: https://lkml.org/lkml/2012/9/23/137
v5: https://lkml.org/lkml/2012/12/10/540
v6: https://lkml.org/lkml/2012/12/23/154
v7: https://lkml.org/lkml/2012/12/27/225

Hugh notes that this patch brings two problems, whose solution needs
further support in mm/ksm.c, which follows in subsequent patches:

1) switching merge_across_nodes after running KSM is liable to oops
   on stale nodes still left over from the previous stable tree;

2) memory hotremove may migrate KSM pages, but there is no provision
   here for !merge_across_nodes to migrate nodes to the proper tree.

Signed-off-by: Petr Holasek <pholasek@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:19 -08:00
..
ABI Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid 2013-02-21 17:41:38 -08:00
DocBook Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-02-21 17:40:58 -08:00
EDID
PCI PCI/MSI: Enable multiple MSIs with pci_enable_msi_block_auto() 2013-01-24 17:25:13 +01:00
RCU Merge branches 'urgent.2012.10.27a', 'doc.2012.11.16a', 'fixes.2012.11.13a', 'srcu.2012.10.27a', 'stall.2012.11.13a', 'tracing.2012.11.08a' and 'idle.2012.10.24a' into HEAD 2012-11-16 09:59:58 -08:00
accounting doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c 2012-11-26 14:22:21 +01:00
acpi ACPI / Documentation: refer to correct file for acpi_platform_device_ids[] table 2013-02-13 13:43:02 +01:00
aoe aoe: allow user to disable target failure timeout 2012-12-17 17:15:25 -08:00
arm fbdev changes for 3.8: 2012-12-15 13:03:48 -08:00
arm64 arm64: Add simple earlyprintk support 2013-01-22 17:51:01 +00:00
auxdisplay
backlight backlight: lp855x_bl: support new LP8557 device 2013-02-21 17:22:25 -08:00
blackfin
block block: Kill bi_destructor 2012-09-09 10:35:39 +02:00
blockdev
bus-devices ARM: OMAP2+: gpmc: generic timing calculation 2012-11-09 18:07:11 +05:30
cdrom
cgroups cgroups: move cgroup_event_listener.c to tools/cgroup 2013-01-07 09:41:28 -08:00
connector
console
cpu-freq cpufreq: Update Documentation for cpus and related_cpus 2013-02-02 00:01:16 +01:00
cpuidle Honor state disabling in the cpuidle ladder governor 2012-09-04 01:35:44 +02:00
cris
crypto KEYS: Document asymmetric key type 2012-10-08 13:50:12 +10:30
development-process
device-mapper DM-RAID: Fix RAID10's check for sufficient redundancy 2013-01-24 12:02:36 +11:00
devicetree Merge branch 'akpm' (incoming from Andrew) 2013-02-21 17:38:49 -08:00
driver-model lib: devres: Introduce devm_ioremap_resource() 2013-01-22 09:41:43 -08:00
dvb get_dvb_firmware: fix download site for tda10046 firmware 2012-09-28 16:16:00 -03:00
early-userspace
extcon
fault-injection doc: fix quite a few typos within Documentation 2012-11-19 14:28:24 +01:00
fb
filesystems f2fs: update f2fs document to reflect SIT/NAT layout correctly 2013-01-04 09:42:59 +09:00
firmware_class firmware loader: document firmware cache mechanism 2012-11-14 15:07:18 -08:00
frv
hid HID: remove x bit from sensor doc 2012-12-14 08:48:59 +01:00
hwmon hwmon: (jc42) Add support for MCP98244 2013-02-06 09:58:07 -08:00
i2c Documentation: remove __dev* attributes. 2013-01-03 15:57:16 -08:00
i2o
ia64 doc: aliasing-test: close fd on write error 2012-09-01 09:57:10 -07:00
ide
infiniband IB/ipoib: Add rtnl_link_ops support 2012-09-20 16:49:17 -04:00
input Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid 2012-12-13 12:00:48 -08:00
ioctl wanrouter: completely decouple obsolete code from kernel. 2013-01-31 19:20:33 -05:00
isdn
ja_JP
kbuild kbuild: create a rule to run the pre-processor on *.dts files 2013-02-08 14:38:07 +00:00
kdump
ko_KR
laptops Documentation/laptops: remove depends on CONFIG_EXPERIMENTAL 2013-01-21 14:52:42 -08:00
leds leds-lp5523: add channel name in the platform data 2012-09-11 18:32:41 +08:00
m68k
make
memory-devices
mips
misc-devices doc: fix quite a few typos within Documentation 2012-11-19 14:28:24 +01:00
mmc mmc: core: Extend sysfs to ext_csd parameters for RPMB support 2012-12-06 13:54:48 -05:00
mn10300
mtd
namespaces
netlabel
networking tcp: remove Appropriate Byte Count support 2013-02-05 14:51:16 -05:00
nfc NFC: Update pn544 documentation 2013-01-10 01:27:46 +01:00
parisc
pcmcia
power suspend: enable freeze timeout configuration through sys 2013-02-09 22:32:48 +01:00
powerpc powerpc/hw-breakpoint: Use generic hw-breakpoint interfaces for new PPC ptrace flags 2012-11-15 13:00:23 +11:00
pps
prctl seccomp: Make syscall skipping and nr changes more consistent 2012-10-02 21:14:29 +10:00
pti
ptp
rapidio
s390
scheduler sched: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSW 2012-09-13 16:52:04 +02:00
scsi [SCSI] hptiop: Support HighPoint RR4520/RR4522 HBA 2012-11-27 08:59:43 +04:00
security Documentation: fix Documentation/security/00-INDEX 2012-12-17 17:15:22 -08:00
serial tty: Update serial core API documentation 2013-01-15 21:57:44 -08:00
sh
sound ALSA: compress: add support for gapless playback 2013-02-14 12:30:22 +01:00
spi Documentation: remove __dev* attributes. 2013-01-03 15:57:16 -08:00
sysctl Documentation/sysctl/kernel.txt: document /proc/sys/shmall 2013-01-04 16:11:46 -08:00
target target: Simplify fabric sense data length handling 2012-09-17 17:12:58 -07:00
thermal Thermal: Add documentation for platform layer data 2012-11-05 14:00:09 +08:00
timers
trace ACPI and power management updates for 3.9-rc1 2013-02-20 11:26:56 -08:00
usb USB: report submission of active URBs 2012-11-11 18:10:46 -08:00
vDSO
video4linux Documentation: remove __dev* attributes. 2013-01-03 15:57:16 -08:00
virtual ARM: KVM: VGIC accept vcpu and dist base addresses from user space 2013-02-11 18:59:01 +00:00
vm ksm: allow trees per NUMA node 2013-02-23 17:50:19 -08:00
w1 w1: w1_therm: Add force-pullup option for "broken" sensors 2013-02-18 13:55:24 -08:00
watchdog watchdog: fix watchdog-test.c build warning 2012-08-29 17:12:58 +02:00
wimax
x86 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-02-21 18:06:55 -08:00
xtensa xtensa: initialize atomctl SR 2012-12-18 21:10:22 -08:00
zh_CN Driver core patches for 3.9-rc1 2013-02-21 12:05:51 -08:00
.gitignore
00-INDEX Documentation: update top level 00-INDEX file with new additions 2013-02-18 10:38:54 +01:00
BUG-HUNTING
Changes
CodingStyle Documentation: remove depends on CONFIG_EXPERIMENTAL 2013-01-11 11:38:03 -08:00
DMA-API-HOWTO.txt Documentation DMA-API-HOWTO.txt Add dma mapping error check usage examples 2012-10-24 17:07:43 +02:00
DMA-API.txt dma-debug: New interfaces to debug dma mapping errors 2012-10-24 17:06:43 +02:00
DMA-ISA-LPC.txt
DMA-attributes.txt common: DMA-mapping: add DMA_ATTR_FORCE_CONTIGUOUS attribute 2012-11-29 03:30:34 -08:00
HOWTO HOWTO: fix double words typo 2012-12-10 15:54:27 +01:00
IPMI.txt IPMI: Remove SMBus driver info from the docs 2012-10-16 18:07:12 -07:00
IRQ-affinity.txt
IRQ-domain.txt irqdomain: update documentation 2012-12-05 23:52:10 +00:00
IRQ.txt
Intel-IOMMU.txt
Makefile
ManagementStyle
SAK.txt
SM501.txt
SecurityBugs
SubmitChecklist
SubmittingDrivers
SubmittingPatches
VGA-softcursor.txt
applying-patches.txt
atomic_ops.txt Documentation: Memory barrier semantics of atomic_xchg() 2013-01-08 14:14:55 -08:00
bad_memory.txt
basic_profiling.txt
binfmt_misc.txt
braille-console.txt
bt8xxgpio.txt
btmrvl.txt
bus-virt-phys-mapping.txt
cachetlb.txt
circular-buffers.txt
clk.txt
coccinelle.txt
cpu-hotplug.txt doc: Add x86 CPU0 online/offline feature 2012-11-14 09:39:44 -08:00
cpu-load.txt
cputopology.txt
crc32.txt
dcdbas.txt
debugging-modules.txt
debugging-via-ohci1394.txt
dell_rbu.txt
devices.txt firmware: remove last vestiges of dabusb 2012-11-21 13:03:01 -08:00
digsig.txt
dma-buf-sharing.txt doc: fix quite a few typos within Documentation 2012-11-19 14:28:24 +01:00
dmaengine.txt
dontdiff x86: remove offsets.h from .gitignore and dontdiff 2012-11-19 14:10:53 +01:00
dynamic-debug-howto.txt dynamic_debug: dynamic hex dump 2013-01-17 12:19:09 -08:00
edac.txt
eisa.txt
email-clients.txt
flexible-arrays.txt
futex-requeue-pi.txt
gcov.txt
gpio.txt gpiolib: provide provision to register pin ranges 2012-11-11 19:06:00 +01:00
highuid.txt
hw_random.txt
hwspinlock.txt
init.txt
initrd.txt
intel_txt.txt Documentation: remove depends on CONFIG_EXPERIMENTAL 2013-01-11 11:38:03 -08:00
io-mapping.txt
io_ordering.txt
iostats.txt
irqflags-tracing.txt
isapnp.txt
java.txt
kernel-doc-nano-HOWTO.txt Kernel-doc: Convention: Use a "Return" section to describe return values 2012-11-27 21:08:57 +01:00
kernel-docs.txt
kernel-parameters.txt acpi, memory-hotplug: support getting hotplug info from SRAT 2013-02-23 17:50:14 -08:00
kmemcheck.txt
kmemleak.txt
kobject.txt Documentation: Fix "struct kobj_type" to include newer members. 2012-09-04 16:06:34 -07:00
kprobes.txt
kref.txt kref: Add kref_get_unless_zero documentation 2012-11-28 18:36:06 +10:00
ldm.txt
local_ops.txt
lockdep-design.txt
lockstat.txt
lockup-watchdogs.txt
logo.gif
logo.txt
magic-number.txt wanrouter: completely decouple obsolete code from kernel. 2013-01-31 19:20:33 -05:00
md.txt
media-framework.txt
memory-barriers.txt Documentation: Memory barrier semantics of atomic_xchg() 2013-01-08 14:14:55 -08:00
memory-hotplug.txt hotplug: update nodemasks management 2012-12-12 17:38:33 -08:00
mono.txt
mutex-design.txt
nommu-mmap.txt
numastat.txt
oops-tracing.txt
padata.txt
parport-lowlevel.txt
parport.txt
percpu-rw-semaphore.txt percpu-rw-semaphore: fix documentation typos 2012-09-26 19:56:15 +02:00
pi-futex.txt
pinctrl.txt drivers/pinctrl: grab default handles from device core 2013-01-23 16:39:51 +01:00
pnp.txt
preempt-locking.txt
printk-formats.txt lib/vsprintf.c: add %pa format specifier for phys_addr_t types 2013-02-21 17:22:20 -08:00
pwm.txt pwm: add devm_pwm_get() and devm_pwm_put() 2012-09-10 17:05:45 +02:00
ramoops.txt pstore/ftrace: Convert to its own enable/disable debugfs knob 2012-09-06 22:16:58 -07:00
rbtree.txt rbtree: move augmented rbtree functionality to rbtree_augmented.h 2012-10-09 16:22:40 +09:00
remoteproc.txt remoteproc: add rproc_report_crash function to notify rproc crashes 2012-09-18 12:53:22 +03:00
rfkill.txt
robust-futex-ABI.txt
robust-futexes.txt
rpmsg.txt Documentation: remove __dev* attributes. 2013-01-03 15:57:16 -08:00
rt-mutex-design.txt
rt-mutex.txt
rtc.txt rtc-proc: permit the /proc/driver/rtc device to use other devices 2012-10-06 03:05:01 +09:00
serial-console.txt
sgi-ioc4.txt
sgi-visws.txt
smsc_ece1099.txt mfd: smsc: Add support for smsc gpio io/keypad driver 2012-10-01 15:27:48 +02:00
sparse.txt Documentation/sparse.txt: document context annotations for lock checking 2012-12-17 17:15:23 -08:00
spinlocks.txt
stable_api_nonsense.txt
stable_kernel_rules.txt
static-keys.txt
svga.txt
sysfs-rules.txt
sysrq.txt sparc64: Add global PMU register dumping via sysrq. 2012-10-16 09:34:01 -07:00
unaligned-memory-access.txt
unicode.txt
unshare.txt
vfio.txt vfio: Trivial Documentation correction 2012-09-21 10:48:03 -06:00
vgaarbiter.txt
video-output.txt
vme_api.txt
volatile-considered-harmful.txt
workqueue.txt
xz.txt
zorro.txt