linux/kernel/sched
Wanpeng Li 2a42eb9594 sched/cputime: Accumulate vtime on top of nsec clocksource
Currently the cputime source used by vtime is jiffies. When we cross
a context boundary and jiffies have changed since the last snapshot, the
pending cputime is accounted to the switching out context.

This system works ok if the ticks are not aligned across CPUs. If they
instead are aligned (ie: all fire at the same time) and the CPUs run in
userspace, the jiffies change is only observed on tick exit and therefore
the user cputime is accounted as system cputime. This is because the
CPU that maintains timekeeping fires its tick at the same time as the
others. It updates jiffies in the middle of the tick and the other CPUs
see that update on IRQ exit:

    CPU 0 (timekeeper)                  CPU 1
    -------------------              -------------
                      jiffies = N
    ...                              run in userspace for a jiffy
    tick entry                       tick entry (sees jiffies = N)
    set jiffies = N + 1
    tick exit                        tick exit (sees jiffies = N + 1)
                                                account 1 jiffy as stime

Fix this with using a nanosec clock source instead of jiffies. The
cputime is then accumulated and flushed everytime the pending delta
reaches a jiffy in order to mitigate the accounting overhead.

[ fweisbec: changelog, rebase on struct vtime, field renames, add delta
  on cputime readers, keep idle vtime as-is (low overhead accounting),
  harmonize clock sources. ]

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-05 09:54:15 +02:00
..
Makefile Merge branch 'WIP.sched/core' into sched/core 2017-06-20 12:28:21 +02:00
autogroup.c sched/autogroup: Rename auto_group.[ch] to autogroup.[ch] 2017-02-08 09:01:11 +01:00
autogroup.h sched/headers: Prepare for new header dependencies before moving code to <linux/sched/autogroup.h> 2017-03-02 08:42:28 +01:00
clock.c sched/clock: Fix early boot preempt assumption in __set_sched_clock_stable() 2017-05-24 09:10:00 +02:00
completion.c sched/wait: Rename wait_queue_t => wait_queue_entry_t 2017-06-20 12:18:27 +02:00
core.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-07-03 13:08:04 -07:00
cpuacct.c sched/cputime: Convert kcpustat to nsecs 2017-02-01 09:13:47 +01:00
cpuacct.h sched/cpuacct: Simplify the cpuacct code 2016-03-21 11:00:28 +01:00
cpudeadline.c sched/core: Remove the tsk_cpus_allowed() wrapper 2017-03-02 08:42:24 +01:00
cpudeadline.h sched/deadline: Split cpudl_set() into cpudl_set() and cpudl_clear() 2016-09-05 13:29:43 +02:00
cpufreq.c cpufreq / sched: Pass flags to cpufreq_update_util() 2016-08-16 22:14:55 +02:00
cpufreq_schedutil.c Merge branches 'pm-cpufreq', 'pm-cpuidle' and 'pm-devfreq' 2017-06-15 01:51:33 +02:00
cpupri.c sched/core: Remove the tsk_cpus_allowed() wrapper 2017-03-02 08:42:24 +01:00
cpupri.h sched/cpupri: Remove unnecessary definitions in cpupri.h 2014-11-16 10:58:59 +01:00
cputime.c sched/cputime: Accumulate vtime on top of nsec clocksource 2017-07-05 09:54:15 +02:00
deadline.c sched/deadline: Move DL related code from sched/core.c to sched/deadline.c 2017-06-23 10:46:45 +02:00
debug.c sched/debug: Expose the number of RT/DL tasks that can migrate 2017-06-30 09:32:07 +02:00
fair.c sched/numa: Hide numa_wake_affine() from UP build 2017-06-29 08:25:52 +02:00
features.h sched/core: Implement new approach to scale select_idle_cpu() 2017-06-08 10:25:17 +02:00
idle.c sched/idle: Add deferrable vmstat_updater back 2017-06-08 10:32:09 +02:00
idle_task.c sched/core: Add wrappers for lockdep_(un)pin_lock() 2017-01-14 11:29:30 +01:00
loadavg.c sched/loadavg: Generalize "_idle" naming to "_nohz" 2017-06-22 11:30:01 +02:00
rt.c sched/rt: Move RT related code from sched/core.c to sched/rt.c 2017-06-23 10:46:45 +02:00
sched-pelt.h sched/fair: Move the PELT constants into a generated header 2017-04-14 10:26:37 +02:00
sched.h sched/rt: Move RT related code from sched/core.c to sched/rt.c 2017-06-23 10:46:45 +02:00
stats.c sched: use %*pb[l] to print bitmaps including cpumasks and nodemasks 2015-02-13 21:21:37 -08:00
stats.h sched/headers: Move cputime functionality from <linux/sched.h> and <linux/cputime.h> into <linux/sched/cputime.h> 2017-03-03 01:45:22 +01:00
stop_task.c sched/core: Add wrappers for lockdep_(un)pin_lock() 2017-01-14 11:29:30 +01:00
swait.c sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> 2017-03-02 08:42:32 +01:00
topology.c sched/topology: Rename sched_group_cpus() 2017-05-15 10:15:34 +02:00
wait.c sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming 2017-06-20 12:19:14 +02:00
wait_bit.c sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming 2017-06-20 12:19:14 +02:00