linux/Documentation
Andrea Arcangeli 2c653d0ee2 ksm: introduce ksm_max_page_sharing per page deduplication limit
Without a max deduplication limit for each KSM page, the list of the
rmap_items associated to each stable_node can grow infinitely large.

During the rmap walk each entry can take up to ~10usec to process
because of IPIs for the TLB flushing (both for the primary MMU and the
secondary MMUs with the MMU notifier).  With only 16GB of address space
shared in the same KSM page, that would amount to dozens of seconds of
kernel runtime.

A ~256 max deduplication factor will reduce the latencies of the rmap
walks on KSM pages to order of a few msec.  Just doing the
cond_resched() during the rmap walks is not enough, the list size must
have a limit too, otherwise the caller could get blocked in (schedule
friendly) kernel computations for seconds, unexpectedly.

There's room for optimization to significantly reduce the IPI delivery
cost during the page_referenced(), but at least for page_migration in
the KSM case (used by hard NUMA bindings, compaction and NUMA balancing)
it may be inevitable to send lots of IPIs if each rmap_item->mm is
active on a different CPU and there are lots of CPUs.  Even if we ignore
the IPI delivery cost, we've still to walk the whole KSM rmap list, so
we can't allow millions or billions (ulimited) number of entries in the
KSM stable_node rmap_item lists.

The limit is enforced efficiently by adding a second dimension to the
stable rbtree.  So there are three types of stable_nodes: the regular
ones (identical as before, living in the first flat dimension of the
stable rbtree), the "chains" and the "dups".

Every "chain" and all "dups" linked into a "chain" enforce the invariant
that they represent the same write protected memory content, even if
each "dup" will be pointed by a different KSM page copy of that content.
This way the stable rbtree lookup computational complexity is unaffected
if compared to an unlimited max_sharing_limit.  It is still enforced
that there cannot be KSM page content duplicates in the stable rbtree
itself.

Adding the second dimension to the stable rbtree only after the
max_page_sharing limit hits, provides for a zero memory footprint
increase on 64bit archs.  The memory overhead of the per-KSM page
stable_tree and per virtual mapping rmap_item is unchanged.  Only after
the max_page_sharing limit hits, we need to allocate a stable_tree
"chain" and rb_replace() the "regular" stable_node with the newly
allocated stable_node "chain".  After that we simply add the "regular"
stable_node to the chain as a stable_node "dup" by linking hlist_dup in
the stable_node_chain->hlist.  This way the "regular" (flat) stable_node
is converted to a stable_node "dup" living in the second dimension of
the stable rbtree.

During stable rbtree lookups the stable_node "chain" is identified as
stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka
is_stable_node_chain()).

When dropping stable_nodes, the stable_node "dup" is identified as
stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()).

The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used
elsewhere in any stable_node->head/node to avoid a clashes with the
stable_node->node.rb_parent_color pointer, and different from
&migrate_nodes.  So the second field of &migrate_nodes is picked and
verified as always safe with a BUILD_BUG_ON in case the list_head
implementation changes in the future.

The STABLE_NODE_DUP is picked as a random negative value in
stable_node->rmap_hlist_len.  rmap_hlist_len cannot become negative when
it's a "regular" stable_node or a stable_node "dup".

The stable_node_chain->nid is irrelevant.  The stable_node_chain->kpfn
is aliased in a union with a time field used to rate limit the
stable_node_chain->hlist prunes.

The garbage collection of the stable_node_chain happens lazily during
stable rbtree lookups (as for all other kind of stable_nodes), or while
disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while collecting the
entire stable rbtree.

While the "regular" stable_nodes and the stable_node "dups" must wait
for their underlying tree_page to be freed before they can be freed
themselves, the stable_node "chains" can be freed immediately if the
stable_node->hlist turns empty.  This is because the "chains" are never
pointed by any page->mapping and they're effectively stable rbtree KSM
self contained metadata.

[akpm@linux-foundation.org: fix non-NUMA build]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Petr Holasek <pholasek@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Evgheni Dereveanchin <ederevea@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Gavin Guo <gavin.guo@canonical.com>
Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:31 -07:00
..
ABI Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-07-05 12:31:59 -07:00
EDID drm: use .hword to represent 16-bit numbers 2017-03-30 10:15:19 +02:00
PCI docs: update old references for DocBook from the documentation 2017-05-16 08:44:19 -03:00
RCU rcu: Remove debugfs tracing 2017-06-08 18:52:43 -07:00
accounting
acpi Merge branches 'acpi-button' and 'acpi-tools' 2017-05-22 20:29:06 +02:00
admin-guide mm: allow slab_nomerge to be set at build time 2017-07-06 16:24:31 -07:00
aoe
arm ARM: at91: Documentation: add armv7m families 2017-06-02 10:11:09 +02:00
arm64 arm64: documentation: document tagged pointer stack constraints 2017-05-09 17:43:18 +01:00
auxdisplay
backlight
blackfin
block block: remove bio_clone() and all references. 2017-06-18 12:40:59 -06:00
blockdev remove the mg_disk driver 2017-04-14 14:00:49 -06:00
bus-devices
cdrom
cgroup-v1
cma
connector
console
core-api There has been a fair amount of activity in the docs tree this time 2017-07-03 21:13:25 -07:00
cpu-freq cpufreq: intel_pstate: Document the current behavior and user interface 2017-05-14 02:06:03 +02:00
cpuidle
cris
crypto Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2017-07-05 12:22:23 -07:00
dev-tools There has been a fair amount of activity in the docs tree this time 2017-07-03 21:13:25 -07:00
device-mapper - A major update for DM cache that reduces the latency for deciding 2017-05-03 10:31:20 -07:00
devicetree Merge branch 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata 2017-07-06 09:41:58 -07:00
dmaengine
doc-guide kernel-doc: describe the ``literal`` syntax 2017-05-16 08:44:24 -03:00
driver-api There has been a fair amount of activity in the docs tree this time 2017-07-03 21:13:25 -07:00
driver-model Char/Misc patches for 4.13-rc1 2017-07-03 20:55:59 -07:00
early-userspace Documentation: Fix dead URLs to ftp.kernel.org 2017-03-29 15:46:06 -06:00
extcon extcon: Remove porting compatibility of swich class 2017-04-06 10:55:24 +09:00
fault-injection
fb docs: update old references for DocBook from the documentation 2017-05-16 08:44:19 -03:00
features powerpc updates for 4.12 part 1. 2017-05-05 11:36:44 -07:00
filesystems There has been a fair amount of activity in the docs tree this time 2017-07-03 21:13:25 -07:00
firmware_class
fmc
fpga
frv
gpio
gpu docs: update old references for DocBook from the documentation 2017-05-16 08:44:19 -03:00
hid
hwmon hwmon: (pmbus) move header file out of I2C realm 2017-06-11 17:08:19 -07:00
i2c
ia64
ide
iio
infiniband IB/opa-vnic: Virtual Network Interface Controller (VNIC) documentation 2017-04-20 12:01:06 -04:00
input Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2017-05-26 16:45:13 -07:00
ioctl TEE driver infrastructure and OP-TEE drivers 2017-05-10 11:20:09 -07:00
isdn
kbuild Documentation, kbuild: fix typo "minimun" -> "minimum" 2017-05-18 10:49:44 -06:00
kdump Documentation: kdump: describe arm64 port 2017-04-05 18:32:32 +01:00
kernel-hacking There has been a fair amount of activity in the docs tree this time 2017-07-03 21:13:25 -07:00
laptops
leds
lightnvm lightnvm: physical block device (pblk) target 2017-04-16 10:06:33 -06:00
livepatch
locking
m68k
md Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md 2017-05-03 10:05:38 -07:00
media Docs: Use kernel-figure in vidioc-g-selection.rst 2017-06-23 13:45:56 -06:00
memory-devices
metag
mic
mips
misc-devices Documentation: misc-devices: Add Documentation for pci-endpoint-test driver 2017-04-28 10:23:19 -05:00
mmc MMC core: 2017-05-02 17:34:32 -07:00
mn10300
mtd
namespaces
netlabel
networking Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-07-05 12:31:59 -07:00
nfc
nios2
nvdimm
nvmem
parisc
pcmcia
perf perf: qcom: Add L3 cache PMU driver 2017-04-03 18:53:50 +01:00
phy
platform
power power supply and reset changes for the v4.13 series 2017-07-04 14:25:14 -07:00
powerpc powerpc/fadump: update documentation about crashkernel parameter reuse 2017-05-08 17:15:11 -07:00
pps
process doc: Document suitability of IBM Verse for kernel development 2017-06-22 10:22:41 -06:00
pti
ptp
rapidio
s390 docs: add documentation for vfio-ccw 2017-03-31 12:55:11 +02:00
scheduler sched/deadline: Add documentation about GRUB reclaiming 2017-06-08 10:31:56 +02:00
scsi scsi: make asynchronous aborts mandatory 2017-04-06 13:07:33 -04:00
security docs: Fix some formatting issues in request-key.rst 2017-05-18 10:46:25 -06:00
serial tty: n_gsm: do not send/receive in ldisc close path 2017-06-03 18:48:52 +09:00
sh docs-rst: convert sh book to ReST 2017-05-16 08:44:18 -03:00
sound There has been a fair amount of activity in the docs tree this time 2017-07-03 21:13:25 -07:00
sparc
sphinx Docs: clean up some DocBook loose ends 2017-06-23 14:17:38 -06:00
sphinx-static
spi spi: Document SPI slave controller support 2017-05-26 13:11:00 +01:00
sysctl Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning 2017-04-21 13:22:34 -04:00
target Documentation/target: add an example script to configure an iSCSI target 2017-05-01 22:21:35 -07:00
thermal Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux 2017-05-12 11:58:45 -07:00
timers rcu: Eliminate NOCBs CPU-state Kconfig options 2017-06-08 18:52:43 -07:00
trace Char/Misc patches for 4.13-rc1 2017-07-03 20:55:59 -07:00
translations doc/kokr/howto: Only send regression fixes after -rc1 2017-06-22 10:25:22 -06:00
usb usb: gadget: add f_uac1 variant based on a new u_audio api 2017-06-19 09:22:47 +03:00
userspace-api doc: ReSTify no_new_privs.txt 2017-05-18 10:30:09 -06:00
virtual Second round of KVM/ARM Changes for v4.12. 2017-05-09 12:51:49 +02:00
vm ksm: introduce ksm_max_page_sharing per page deduplication limit 2017-07-06 16:24:31 -07:00
w1
watchdog iTCO_wdt: all versions count down twice 2017-05-19 10:42:11 +02:00
wimax
x86 x86/mce: Update bootlog description to reflect behavior on AMD 2017-06-14 07:32:10 +02:00
xtensa
.gitignore
00-INDEX Merge remote-tracking branch 'mauro-exp/docbook3' into death-to-docbook 2017-05-18 11:03:08 -06:00
Changes
CodingStyle
DMA-API-HOWTO.txt
DMA-API.txt Documentation: DMA API: fix a typo in a function name 2017-06-05 15:57:02 -06:00
DMA-ISA-LPC.txt
DMA-attributes.txt
IPMI.txt
IRQ-affinity.txt
IRQ-domain.txt Documentation: Update IRQ-domain.txt to document irq_domain_mapping 2017-05-22 22:29:45 +02:00
IRQ.txt
Intel-IOMMU.txt
Makefile docs: remove DocBook from the building system 2017-05-16 08:44:19 -03:00
SAK.txt
SM501.txt
SubmittingPatches
bcache.txt
bt8xxgpio.txt
btmrvl.txt
bus-virt-phys-mapping.txt
cachetlb.txt
cgroup-v2.txt cgroup: implement "nsdelegate" mount option 2017-06-28 14:45:21 -04:00
circular-buffers.txt
clk.txt
conf.py Docs: Fix breakage with Sphinx 1.5 and upper 2017-06-23 13:45:37 -06:00
cpu-load.txt
cputopology.txt docs: Fix a couple typos 2017-04-27 15:54:39 -06:00
crc32.txt
dcdbas.txt
debugging-modules.txt
debugging-via-ohci1394.txt docs: Fix a couple typos 2017-04-27 15:54:39 -06:00
dell_rbu.txt
digsig.txt
docutils.conf
dontdiff GCC plugin updates: 2017-07-05 11:46:59 -07:00
efi-stub.txt
eisa.txt
flexible-arrays.txt
futex-requeue-pi.txt
gcc-plugins.txt
highuid.txt
hw_random.txt
hwspinlock.txt
index.rst Make the main documentation title less Geocities 2017-06-23 14:02:27 -06:00
intel_txt.txt
io-mapping.txt
io_ordering.txt
iostats.txt
irqflags-tracing.txt
isa.txt
isapnp.txt
kernel-doc-nano-HOWTO.txt docs: update old references for DocBook from the documentation 2017-05-16 08:44:19 -03:00
kernel-per-CPU-kthreads.txt rcu: Eliminate NOCBs CPU-state Kconfig options 2017-06-08 18:52:43 -07:00
kobject.txt
kprobes.txt
kref.txt Revert "kref: double kref_put() in my_data_handler()" 2017-04-08 18:38:10 +02:00
kselftest.txt
ldm.txt
lockup-watchdogs.txt
logo.gif
logo.txt
lsm.txt docs-rst: convert lsm from DocBook to ReST 2017-05-16 08:44:19 -03:00
lzo.txt
mailbox.txt
memory-barriers.txt There has been a fair amount of activity in the docs tree this time 2017-07-03 21:13:25 -07:00
memory-hotplug.txt
men-chameleon-bus.txt
nommu-mmap.txt
ntb.txt
numastat.txt
padata.txt
parport-lowlevel.txt
percpu-rw-semaphore.txt
phy.txt
pi-futex.txt
pinctrl.txt pinctrl: core: Fix pinctrl_register_and_init() with pinctrl_enable() 2017-04-07 01:08:08 +02:00
pnp.txt
preempt-locking.txt
printk-formats.txt
pwm.txt
rbtree.txt
remoteproc.txt
rfkill.txt
robust-futex-ABI.txt
robust-futexes.txt
rpmsg.txt
rtc.txt
sgi-ioc4.txt
siphash.txt
smsc_ece1099.txt
static-keys.txt docs: Fix a couple typos 2017-04-27 15:54:39 -06:00
svga.txt
switchtec.txt switchtec: Add IOCTLs to the Switchtec driver 2017-04-12 12:23:37 -05:00
sync_file.txt
tee.txt
this_cpu_ops.txt
unaligned-memory-access.txt
vfio-mediated-device.txt docs: Fix a spelling error in vfio-mediated-device.txt 2017-04-27 15:54:39 -06:00
vfio.txt
video-output.txt
xillybus.txt
xz.txt
zorro.txt docs: Fix a couple typos 2017-04-27 15:54:39 -06:00