linux/Documentation
Kirill A. Shutemov dc6c9a35b6 mm: account pmd page tables to the process
Dave noticed that unprivileged process can allocate significant amount of
memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
kernel doesn't account PMD tables to the process, only PTE.

The use-cases below use few tricks to allocate a lot of PMD page tables
while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.

	#include <errno.h>
	#include <stdio.h>
	#include <stdlib.h>
	#include <unistd.h>
	#include <sys/mman.h>
	#include <sys/prctl.h>

	#define PUD_SIZE (1UL << 30)
	#define PMD_SIZE (1UL << 21)

	#define NR_PUD 130000

	int main(void)
	{
		char *addr = NULL;
		unsigned long i;

		prctl(PR_SET_THP_DISABLE);
		for (i = 0; i < NR_PUD ; i++) {
			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
			if (addr == MAP_FAILED) {
				perror("mmap");
				break;
			}
			*addr = 'x';
			munmap(addr, PMD_SIZE);
			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
			if (addr == MAP_FAILED)
				perror("re-mmap"), exit(1);
		}
		printf("PID %d consumed %lu KiB in PMD page tables\n",
				getpid(), i * 4096 >> 10);
		return pause();
	}

The patch addresses the issue by account PMD tables to the process the
same way we account PTE.

The main place where PMD tables is accounted is __pmd_alloc() and
free_pmd_range(). But there're few corner cases:

 - HugeTLB can share PMD page tables. The patch handles by accounting
   the table to all processes who share it.

 - x86 PAE pre-allocates few PMD tables on fork.

 - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
   check on exit(2).

Accounting only happens on configuration where PMD page table's level is
present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
counter value is used to calculate baseline for badness score by
oom-killer.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: David Rientjes <rientjes@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 17:06:04 -08:00
..
ABI Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2015-02-11 09:32:08 -08:00
DocBook media updates for v3.20-rc1 2015-02-11 08:45:40 -08:00
EDID
PCI
RCU Merge branches 'doc.2015.01.07a', 'fixes.2015.01.15a', 'preempt.2015.01.06a', 'srcu.2015.01.06a', 'stall.2015.01.16a' and 'torture.2015.01.11a' into HEAD 2015-01-15 23:34:34 -08:00
accounting
acpi ACPI / Documentation: add a missing '=' 2015-01-22 01:21:45 +01:00
aoe
arm Merge branch 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm 2014-12-12 15:26:48 -08:00
arm64 arm64: Emulate CP15 Barrier instructions 2014-11-20 16:34:48 +00:00
auxdisplay
backlight
blackfin
block Merge branch 'for-3.19/core' of git://git.kernel.dk/linux-block 2014-12-13 14:14:23 -08:00
blockdev zram: report maximum used memory 2014-10-09 22:26:02 -04:00
bus-devices
cdrom
cgroups mm: memcontrol: default hierarchy interface for memory 2015-02-11 17:06:02 -08:00
connector
console
cpu-freq intel_pstate: Add num_pstates to sysfs 2015-01-30 01:52:17 +01:00
cpuidle
cris
crypto crypto: doc - userspace interface spec 2014-11-13 22:31:38 +08:00
development-process Documentation: remove outdated references to the linux-next wiki 2014-10-28 09:06:11 -04:00
device-mapper dm cache policy mq: simplify ability to promote sequential IO to the cache 2014-11-10 15:25:30 -05:00
devicetree MMC core: 2015-02-11 10:56:48 -08:00
dmaengine Documentation: dmanegine: move dmatest.txt to dmaengine folder 2014-11-06 11:17:37 +05:30
driver-model PCI changes for the v3.18 merge window: 2014-10-09 15:03:49 -04:00
dvb
early-userspace
extcon
fault-injection
fb
filesystems Merge branch 'akpm' (patches from Andrew) 2015-02-10 16:45:56 -08:00
firmware_class
fmc
frv
gpio This is the bulk of GPIO changes for the v3.19 series: 2014-12-14 14:05:05 -08:00
hid
hwmon hwmon: (ina2xx) Add ina231 compatible string 2015-01-25 21:23:59 -08:00
i2c Documentation: i2c: Use PM ops instead of legacy suspend/resume 2014-12-04 19:09:03 +01:00
i2o
ia64 kvm: Documentation: remove ia64 2014-11-20 11:08:55 +01:00
ide
infiniband
input Docs changes for the 3.19 merge window 2014-12-12 14:42:48 -08:00
ioctl cxl: Add documentation for userspace APIs 2014-10-08 20:16:19 +11:00
isdn
ja_JP
kbuild
kdump kernel: add panic_on_warn 2014-12-10 17:41:10 -08:00
ko_KR
laptops
leds
locking locking/Documentation: Update code path 2015-01-16 09:09:21 +01:00
m68k
memory-devices
metag
mic Documentation: Build mic/mpssd only for x86_64 2014-12-05 11:18:36 -05:00
mips
misc-devices
mmc
mn10300
mtd
namespaces
netlabel
networking tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks 2015-02-08 01:03:12 -08:00
nfc
nios2 Documentation: Add documentation for Nios2 architecture 2014-12-08 12:56:06 +08:00
parisc
pcmcia
phy
platform
power PM / sleep: Mention async suspend in PM_TRACE documentation 2015-01-30 01:29:46 +01:00
powerpc cxl: Add documentation for userspace APIs 2014-10-08 20:16:19 +11:00
pps
prctl Documentation: Restrict TSC test code to x86 2014-10-28 08:46:27 -04:00
pti
ptp ptp: restore the makefile for building the test program. 2014-10-24 16:07:10 -04:00
rapidio
s390 s390/docs: Remove sections that are not related to s390 2014-11-18 18:22:59 +01:00
scheduler
scsi Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2014-12-12 10:08:06 -08:00
security Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity into next 2014-11-19 21:36:07 +11:00
serial serial: Fix locking for uart driver set_termios() method 2014-11-05 18:53:54 -08:00
sh
sound ALSA: hda - Add "eapd" model string for AD1986A codec 2014-12-10 14:00:13 +01:00
spi Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/doc 2014-10-07 21:14:57 -04:00
sysctl mm: account pmd page tables to the process 2015-02-11 17:06:04 -08:00
target Documentation/target: Update fabric_ops to latest code 2015-01-06 13:46:49 -08:00
thermal Documentation: thermal: document of_cpufreq_cooling_register() 2015-01-06 14:39:17 -04:00
timers
tpm
trace Char/Misc driver patches for 3.19-rc1 2014-12-14 16:43:47 -08:00
usb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2015-02-10 18:57:15 -08:00
vDSO vdso: don't require 64-bit math in standalone test 2014-10-25 10:53:44 -04:00
video4linux [media] Documentation/video4linux: remove obsolete text files 2015-01-29 19:16:30 -02:00
virtual Second round of changes for KVM for arm/arm64 for v3.19; fixes reboot 2014-12-15 13:06:40 +01:00
vm mm:add KPF_ZERO_PAGE flag for /proc/kpageflags 2015-02-11 17:06:00 -08:00
w1
watchdog
wimax
x86 x86, entry: Switch stacks on a paranoid entry from userspace 2015-01-02 10:22:45 -08:00
xtensa
zh_CN
00-INDEX
BUG-HUNTING
Changes Update old iproute2 and Xen Remus links 2014-12-09 13:38:13 -05:00
CodingStyle CodingStyle: add some more error handling guidelines 2014-12-02 08:55:32 -05:00
DMA-API-HOWTO.txt
DMA-API.txt
DMA-ISA-LPC.txt
DMA-attributes.txt
HOWTO Documentation: remove outdated references to the linux-next wiki 2014-10-28 09:06:11 -04:00
IPMI.txt ipmi: Add SMBus interface driver (SSIF) 2014-12-11 15:04:11 -06:00
IRQ-affinity.txt
IRQ-domain.txt irqdomain: Introduce new interfaces to support hierarchy irqdomains 2014-11-23 13:01:45 +01:00
IRQ.txt
Intel-IOMMU.txt
Makefile
ManagementStyle
SAK.txt
SM501.txt
SecurityBugs
SubmitChecklist
SubmittingDrivers
SubmittingPatches Documentation/SubmittingPatches: Reported-by tags and permission 2014-10-29 08:56:46 -04:00
VGA-softcursor.txt
applying-patches.txt
assoc_array.txt
atomic_ops.txt documentation: Add atomic_long_t to atomic_ops.txt 2014-11-13 10:34:54 -08:00
bad_memory.txt
basic_profiling.txt
bcache.txt
binfmt_misc.txt binfmt_misc: touch up documentation a bit 2014-10-14 02:18:16 +02:00
braille-console.txt
bt8xxgpio.txt
btmrvl.txt
bus-virt-phys-mapping.txt
cachetlb.txt rmap: drop support of non-linear mappings 2015-02-10 14:30:31 -08:00
circular-buffers.txt
clk.txt clk: Change clk_ops->determine_rate to return a clk_hw as the best parent 2014-12-03 16:21:37 -08:00
coccinelle.txt
cpu-hotplug.txt
cpu-load.txt
cputopology.txt
crc32.txt
dcdbas.txt
debugging-modules.txt
debugging-via-ohci1394.txt
dell_rbu.txt
devices.txt
digsig.txt
dma-buf-sharing.txt
dontdiff
dynamic-debug-howto.txt
edac.txt
efi-stub.txt
eisa.txt
email-clients.txt Documentation/email-clients.txt: add info about Claws Mail 2014-12-02 11:55:29 -05:00
flexible-arrays.txt
futex-requeue-pi.txt doc: Fix misnamed FUTEX_CMP_REQUEUE_PI op constants 2015-01-19 12:05:32 +01:00
gcov.txt
highuid.txt
hsi.txt
hw_random.txt
hwspinlock.txt
init.txt
initrd.txt
intel_txt.txt
io-mapping.txt
io_ordering.txt
iostats.txt
irqflags-tracing.txt
isapnp.txt
java.txt
kernel-doc-nano-HOWTO.txt
kernel-docs.txt
kernel-parameters.txt Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2015-02-10 20:01:30 -08:00
kernel-per-CPU-kthreads.txt
kmemcheck.txt
kmemleak.txt Documentation: Add CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF case 2014-10-24 13:59:03 -04:00
kobject.txt kobject: grammar fix 2014-12-08 09:07:11 -05:00
kprobes.txt
kref.txt
kselftest.txt kselftest: Move the docs to the Documentation dir 2014-11-24 10:49:54 -07:00
ldm.txt
local_ops.txt percpu: update local_ops.txt to reflect this_cpu operations 2014-12-13 12:42:53 -08:00
lockup-watchdogs.txt
logo.gif
logo.txt
lzo.txt
magic-number.txt
mailbox.txt Documentation: Fix a typo in mailbox.txt 2014-11-03 11:54:50 -05:00
md.txt
media-framework.txt
memory-barriers.txt documentation: Fix smp typo in memory-barriers.txt 2015-01-07 08:59:32 -08:00
memory-hotplug.txt memory-hotplug: add sysfs valid_zones attribute 2014-10-09 22:25:52 -04:00
module-signing.txt
mono.txt
nommu-mmap.txt
numastat.txt
oops-tracing.txt livepatch: kernel: add TAINT_LIVEPATCH 2014-12-22 15:40:48 +01:00
padata.txt
parport-lowlevel.txt
parport.txt
percpu-rw-semaphore.txt
phy.txt phy: improved lookup method 2014-11-21 19:48:50 +05:30
pi-futex.txt
pinctrl.txt
pnp.txt
preempt-locking.txt
printk-formats.txt lib/vsprintf: add %*pE[achnops] format specifier 2014-10-14 02:18:26 +02:00
pwm.txt
ramoops.txt pstore-ram: Allow optional mapping with pgprot_noncached 2014-12-11 13:38:31 -08:00
rbtree.txt
remoteproc.txt
rfkill.txt rfkill: document rfkill module parameters 2015-01-09 23:22:12 +01:00
robust-futex-ABI.txt
robust-futexes.txt
rpmsg.txt
rtc.txt
serial-console.txt
sgi-ioc4.txt
smsc_ece1099.txt
sparse.txt
stable_api_nonsense.txt
stable_kernel_rules.txt
static-keys.txt
svga.txt
sysfs-rules.txt
sysrq.txt
this_cpu_ops.txt
unaligned-memory-access.txt
unicode.txt
unshare.txt
vfio.txt
vgaarbiter.txt
video-output.txt
vme_api.txt
volatile-considered-harmful.txt
workqueue.txt
xillybus.txt
xz.txt
zorro.txt