Go to file
Dave Chiluk de53fd7aed sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to cpu-local run queue being allocated per cpu
bandwidth slices, and then not fully using that slice within the period.
At which point the slice and quota expires. This expiration of unused
slice results in applications not being able to utilize the quota for
which they are allocated.

The non-expiration of per-cpu slices was recently fixed by
'commit 512ac999d2 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this had been broken since
at least 'commit 51f2176d74 ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevent the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d2 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: John Hammond <jhammond@indeed.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kyle Anderson <kwa@yelp.com>
Cc: Gabriel Munos <gmunoz@netflix.com>
Cc: Peter Oskolkov <posk@posk.io>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Brendan Gregg <bgregg@netflix.com>
Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com
2019-08-08 09:09:30 +02:00
arch hexagon: switch to generic version of pte allocation 2019-07-21 09:53:00 -07:00
block docs conversion for v5.3-rc1 2019-07-16 12:21:41 -07:00
certs Revert "Merge tag 'keys-acl-20190703' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs" 2019-07-10 18:43:43 -07:00
crypto USB / PHY patches for 5.3-rc1 2019-07-11 15:40:06 -07:00
Documentation sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-08-08 09:09:30 +02:00
drivers media updates for v5.3-rc1 2019-07-22 09:01:47 -07:00
fs sched/fair: Don't free p->numa_faults with concurrent readers 2019-07-25 15:37:04 +02:00
include sched/core: Prevent race condition between cpuset and __sched_setscheduler() 2019-07-25 15:55:04 +02:00
init Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-07-19 10:42:02 -07:00
ipc Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-07-19 10:42:02 -07:00
kernel sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-08-08 09:09:30 +02:00
lib Kbuild updates for v5.3 (2nd) 2019-07-20 09:34:55 -07:00
LICENSES
mm Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-07-19 10:42:02 -07:00
net tcp: be more careful in tcp_fragment() 2019-07-21 20:41:24 -07:00
samples Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-07-19 10:06:06 -07:00
scripts Kbuild updates for v5.3 (2nd) 2019-07-20 09:34:55 -07:00
security Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-07-19 10:42:02 -07:00
sound sound fixes for 5.3-rc1 2019-07-18 09:36:51 -07:00
tools New feature to add support for NTB virtual MSI interrupts, the ability 2019-07-21 09:46:59 -07:00
usr Kbuild updates for v5.3 (2nd) 2019-07-20 09:34:55 -07:00
virt KVM: Boost vCPUs that are delivering interrupts 2019-07-20 09:00:45 +02:00
.clang-format
.cocciconfig
.get_maintainer.ignore
.gitattributes
.gitignore kbuild: create *.mod with full directory path and remove MODVERDIR 2019-07-18 02:19:31 +09:00
.mailmap
COPYING
CREDITS Remove references to dead website. 2019-07-19 12:22:04 -07:00
Kbuild
Kconfig
MAINTAINERS Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-07-20 11:24:49 -07:00
Makefile Linus 5.3-rc1 2019-07-21 14:05:38 -07:00
README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.