Go to file
Roman Gushchin 090892ba70 mm: memcg/slab: fix percpu slab vmstats flushing
commit 4a87e2a25d upstream.

Currently slab percpu vmstats are flushed twice: during the memcg
offlining and just before freeing the memcg structure.  Each time percpu
counters are summed, added to the atomic counterparts and propagated up
by the cgroup tree.

The second flushing is required due to how recursive vmstats are
implemented: counters are batched in percpu variables on a local level,
and once a percpu value is crossing some predefined threshold, it spills
over to atomic values on the local and each ascendant levels.  It means
that without flushing some numbers cached in percpu variables will be
dropped on floor each time a cgroup is destroyed.  And with uptime the
error on upper levels might become noticeable.

The first flushing aims to make counters on ancestor levels more
precise.  Dying cgroups may resume in the dying state for a long time.
After kmem_cache reparenting which is performed during the offlining
slab counters of the dying cgroup don't have any chances to be updated,
because any slab operations will be performed on the parent level.  It
means that the inaccuracy caused by percpu batching will not decrease up
to the final destruction of the cgroup.  By the original idea flushing
slab counters during the offlining should minimize the visible
inaccuracy of slab counters on the parent level.

The problem is that percpu counters are not zeroed after the first
flushing.  So every cached percpu value is summed twice.  It creates a
small error (up to 32 pages per cpu, but usually less) which accumulates
on parent cgroup level.  After creating and destroying of thousands of
child cgroups, slab counter on parent level can be way off the real
value.

For now, let's just stop flushing slab counters on memcg offlining.  It
can't be done correctly without scheduling a work on each cpu: reading
and zeroing it during css offlining can race with an asynchronous
update, which doesn't expect values to be changed underneath.

With this change, slab counters on parent level will become eventually
consistent.  Once all dying children are gone, values are correct.  And
if not, the error is capped by 32 * NR_CPUS pages per dying cgroup.

It's not perfect, as slab are reparented, so any updates after the
reparenting will happen on the parent level.  It means that if a slab
page was allocated, a counter on child level was bumped, then the page
was reparented and freed, the annihilation of positive and negative
counter values will not happen until the child cgroup is released.  It
makes slab counters different from others, and it might want us to
implement flushing in a correct form again.  But it's also a question of
performance: scheduling a work on each cpu isn't free, and it's an open
question if the benefit of having more accurate counters is worth it.

We might also consider flushing all counters on offlining, not only slab
counters.

So let's fix the main problem now: make the slab counters eventually
consistent, so at least the error won't grow with uptime (or more
precisely the number of created and destroyed cgroups).  And think about
the accuracy of counters separately.

Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com
Fixes: bee07b33db ("mm: memcontrol: flush percpu slab vmstats on kmem offlining")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-23 08:22:39 +01:00
Documentation mei: fix modalias documentation 2020-01-17 19:48:48 +01:00
LICENSES LICENSES: Rename other to deprecated 2019-05-03 06:34:32 -06:00
arch s390/setup: Fix secure ipl message 2020-01-23 08:22:38 +01:00
block block: fix an integer overflow in logical block size 2020-01-23 08:22:32 +01:00
certs PKCS#7: Refactor verify_pkcs7_signature() 2019-08-05 18:40:18 -04:00
crypto crypto: algif_skcipher - Use chunksize instead of blocksize 2020-01-17 19:48:46 +01:00
drivers clk: samsung: exynos5420: Keep top G3D clocks enabled 2020-01-23 08:22:39 +01:00
fs io_uring: only allow submit from owning task 2020-01-23 08:22:32 +01:00
include mm: memcg/slab: fix percpu slab vmstats flushing 2020-01-23 08:22:39 +01:00
init Merge branch 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2019-09-28 08:14:15 -07:00
ipc ipc/sem.c: convert to use built-in RCU list checking 2019-09-25 17:51:41 -07:00
kernel locking/lockdep: Fix buffer overrun problem in stack_trace[] 2020-01-23 08:22:39 +01:00
lib sbitmap: only queue kyber's wait callback if not already active 2020-01-12 12:21:44 +01:00
mm mm: memcg/slab: fix percpu slab vmstats flushing 2020-01-23 08:22:39 +01:00
net rxrpc: Fix missing security check on incoming calls 2020-01-17 19:49:05 +01:00
samples samples: bpf: fix syscall_tp due to unused syscall 2020-01-12 12:21:26 +01:00
scripts kbuild/deb-pkg: annotate libelf-dev dependency as :native 2020-01-17 19:49:07 +01:00
security tomoyo: Suppress RCU warning at list_for_each_entry_rcu(). 2020-01-17 19:49:05 +01:00
sound ALSA: usb-audio: fix sync-ep altsetting sanity check 2020-01-23 08:22:31 +01:00
tools perf report: Fix incorrectly added dimensions as switch perf data file 2020-01-23 08:22:39 +01:00
usr gen_initramfs_list.sh: fix 'bad variable name' error 2020-01-09 10:20:00 +01:00
virt KVM: arm/arm64: Properly handle faulting of device mappings 2019-12-31 16:46:24 +01:00
.clang-format clang-format: Update with the latest for_each macro list 2019-08-31 10:00:51 +02:00
.cocciconfig
.get_maintainer.ignore Opt out of scripts/get_maintainer.pl 2019-05-16 10:53:40 -07:00
.gitattributes
.gitignore Modules updates for v5.4 2019-09-22 10:34:46 -07:00
.mailmap ARM: SoC fixes 2019-11-10 13:41:59 -08:00
COPYING COPYING: use the new text with points to the license files 2018-03-23 12:41:45 -06:00
CREDITS MAINTAINERS: Remove Simon as Renesas SoC Co-Maintainer 2019-10-10 08:12:51 -07:00
Kbuild kbuild: do not descend to ./Kbuild when cleaning 2019-08-21 21:03:58 +09:00
Kconfig docs: kbuild: convert docs to ReST and rename to *.rst 2019-06-14 14:21:21 -06:00
MAINTAINERS MAINTAINERS: Append missed file to the database 2020-01-17 19:48:28 +01:00
Makefile Linux 5.4.13 2020-01-17 19:49:08 +01:00
README Drop all 00-INDEX files from Documentation/ 2018-09-09 15:08:58 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.