Commit Graph

740 Commits

Author SHA1 Message Date
Mel Gorman 1378447598 sched/numa: Stagger NUMA balancing scan periods for new threads
Threads share an address space and each can change the protections of the
same address space to trap NUMA faults. This is redundant and potentially
counter-productive as any thread doing the update will suffice. Potentially
only one thread is required but that thread may be idle or it may not have
any locality concerns and pick an unsuitable scan rate.

This patch uses independent scan period but they are staggered based on
the number of address space users when the thread is created.  The intent
is that threads will avoid scanning at the same time and have a chance
to adapt their scan rate later if necessary. This reduces the total scan
activity early in the lifetime of the threads.

The different in headline performance across a range of machines and
workloads is marginal but the system CPU usage is reduced as well as overall
scan activity.  The following is the time reported by NAS Parallel Benchmark
using unbound openmp threads and a D size class:

			      4.17.0-rc1             4.17.0-rc1
				 vanilla           stagger-v1r1
	Time bt.D      442.77 (   0.00%)      419.70 (   5.21%)
	Time cg.D      171.90 (   0.00%)      180.85 (  -5.21%)
	Time ep.D       33.10 (   0.00%)       32.90 (   0.60%)
	Time is.D        9.59 (   0.00%)        9.42 (   1.77%)
	Time lu.D      306.75 (   0.00%)      304.65 (   0.68%)
	Time mg.D       54.56 (   0.00%)       52.38 (   4.00%)
	Time sp.D     1020.03 (   0.00%)      903.77 (  11.40%)
	Time ua.D      400.58 (   0.00%)      386.49 (   3.52%)

Note it's not a universal win but we have no prior knowledge of which
thread matters but the number of threads created often exceeds the size
of the node when the threads are not bound. However, there is a reducation
of overall system CPU usage:

				    4.17.0-rc1             4.17.0-rc1
				       vanilla           stagger-v1r1
	sys-time-bt.D         48.78 (   0.00%)       48.22 (   1.15%)
	sys-time-cg.D         25.31 (   0.00%)       26.63 (  -5.22%)
	sys-time-ep.D          1.65 (   0.00%)        0.62 (  62.42%)
	sys-time-is.D         40.05 (   0.00%)       24.45 (  38.95%)
	sys-time-lu.D         37.55 (   0.00%)       29.02 (  22.72%)
	sys-time-mg.D         47.52 (   0.00%)       34.92 (  26.52%)
	sys-time-sp.D        119.01 (   0.00%)      109.05 (   8.37%)
	sys-time-ua.D         51.52 (   0.00%)       45.13 (  12.40%)

NUMA scan activity is also reduced:

	NUMA alloc local               1042828     1342670
	NUMA base PTE updates        140481138    93577468
	NUMA huge PMD updates           272171      180766
	NUMA page range updates      279832690   186129660
	NUMA hint faults               1395972     1193897
	NUMA hint local faults          877925      855053
	NUMA hint local percent             62          71
	NUMA pages migrated           12057909     9158023

Similar observations are made for other thread-intensive workloads. System
CPU usage is lower even though the headline gains in performance tend to be
small. For example, specjbb 2005 shows almost no difference in performance
but scan activity is reduced by a third on a 4-socket box. I didn't find
a workload (thread intensive or otherwise) that suffered badly.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20180504154109.mvrha2qo5wdl65vr@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 09:12:24 +02:00
Ingo Molnar dfd5c3ea64 Linux 4.17-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlr4xw8eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGNYoH/1d5zyMpVJVUKZ0K
 LuEctCGby1PjSvSOhmMuxFVagFAqfBJXmwWTeohLfLG48r/Yk0AsZQ5HH13/8baj
 k/T8UgUvKZKustndCRp+joQ3Pa1ZpcIFaWRvB8pKFCefJ/F/Lj4B4X1HYI7vLq0K
 /ZBXUdy3ry0lcVuypnaARYAb2O7l/nyZIjZ3FhiuyymWe7Jpo+G7VK922LOMSX/y
 VYFZCWa8nxN+yFhO0ao9X5k7ggIiUrEBtbfNrk19VtAn0hx+OYKW2KfJK/eHNey/
 CKrOT+KAxU8VU29AEIbYzlL3yrQmULcEoIDiqJ/6m5m6JwsEbP6EqQHs0TiuQFpq
 A0MO9rw=
 =yjUP
 -----END PGP SIGNATURE-----

Merge tag 'v4.17-rc5' into sched/core, to pick up fixes and dependencies

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 09:02:14 +02:00
Linus Torvalds 66e1c94db3 Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/pti updates from Thomas Gleixner:
 "A mixed bag of fixes and updates for the ghosts which are hunting us.

  The scheduler fixes have been pulled into that branch to avoid
  conflicts.

   - A set of fixes to address a khread_parkme() race which caused lost
     wakeups and loss of state.

   - A deadlock fix for stop_machine() solved by moving the wakeups
     outside of the stopper_lock held region.

   - A set of Spectre V1 array access restrictions. The possible
     problematic spots were discuvered by Dan Carpenters new checks in
     smatch.

   - Removal of an unused file which was forgotten when the rest of that
     functionality was removed"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/vdso: Remove unused file
  perf/x86/cstate: Fix possible Spectre-v1 indexing for pkg_msr
  perf/x86/msr: Fix possible Spectre-v1 indexing in the MSR driver
  perf/x86: Fix possible Spectre-v1 indexing for x86_pmu::event_map()
  perf/x86: Fix possible Spectre-v1 indexing for hw_perf_event cache_*
  perf/core: Fix possible Spectre-v1 indexing for ->aux_pages[]
  sched/autogroup: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]
  sched/core: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]
  sched/core: Introduce set_special_state()
  kthread, sched/wait: Fix kthread_parkme() completion issue
  kthread, sched/wait: Fix kthread_parkme() wait-loop
  sched/fair: Fix the update of blocked load when newly idle
  stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock
2018-05-13 10:53:08 -07:00
Mel Gorman 789ba28013 Revert "sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()"
This reverts commit 7347fc87df.

Srikar Dronamra pointed out that while the commit in question did show
a performance improvement on ppc64, it did so at the cost of disabling
active CPU migration by automatic NUMA balancing which was not the intent.
The issue was that a serious flaw in the logic failed to ever active balance
if SD_WAKE_AFFINE was disabled on scheduler domains. Even when it's enabled,
the logic is still bizarre and against the original intent.

Investigation showed that fixing the patch in either the way he suggested,
using the correct comparison for jiffies values or introducing a new
numa_migrate_deferred variable in task_struct all perform similarly to a
revert with a mix of gains and losses depending on the workload, machine
and socket count.

The original intent of the commit was to handle a problem whereby
wake_affine, idle balancing and automatic NUMA balancing disagree on the
appropriate placement for a task. This was particularly true for cases where
a single task was a massive waker of tasks but where wake_wide logic did
not apply.  This was particularly noticeable when a futex (a barrier) woke
all worker threads and tried pulling the wakees to the waker nodes. In that
specific case, it could be handled by tuning MPI or openMP appropriately,
but the behavior is not illogical and was worth attempting to fix. However,
the approach was wrong. Given that we're at rc4 and a fix is not obvious,
it's better to play safe, revert this commit and retry later.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: ggherdovich@suse.cz
Cc: hpa@zytor.com
Cc: matt@codeblueprint.co.uk
Cc: mpe@ellerman.id.au
Link: http://lkml.kernel.org/r/20180509163115.6fnnyeg4vdm2ct4v@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-12 08:37:56 +02:00
Viresh Kumar c976a862ba sched/fair: Avoid calling sync_entity_load_avg() unnecessarily
Call sync_entity_load_avg() directly from find_idlest_cpu() instead of
select_task_rq_fair(), as that's where we need to use task's utilization
value. And call sync_entity_load_avg() only after making sure sched
domain spans over one of the allowed CPUs for the task.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/cd019d1753824c81130eae7b43e2bbcec47cc1ad.1524738578.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-04 10:00:08 +02:00
Viresh Kumar f1d88b4468 sched/fair: Rearrange select_task_rq_fair() to optimize it
Rearrange select_task_rq_fair() a bit to avoid executing some
conditional statements in few specific code-paths. That gets rid of the
goto as well.

This shouldn't result in any functional changes.

Tested-by: Rohit Jain <rohit.k.jain@oracle.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20831b8d237bf3a20e4e328286f678b425ff04c9.1524738578.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-04 10:00:07 +02:00
Vincent Guittot 457be908c8 sched/fair: Fix the update of blocked load when newly idle
With commit:

  31e77c93e4 ("sched/fair: Update blocked load when newly idle")

... we release the rq->lock when updating blocked load of idle CPUs.

This opens a time window during which another CPU can add a task to this
CPU's cfs_rq.

The check for newly added task of idle_balance() is not in the common path.
Move the out label to include this check.

Reported-by: Heiner Kallweit <hkallweit1@gmail.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 31e77c93e4 ("sched/fair: Update blocked load when newly idle")
Link: http://lkml.kernel.org/r/20180426103133.GA6953@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-03 07:38:03 +02:00
Davidlohr Bueso adcc8da885 sched/core: Simplify helpers for rq clock update skip requests
By renaming the functions we can get rid of the skip parameter
and have better code redability. It makes zero sense to have
things such as:

  rq_clock_skip_update(rq, false)

When the skip request is in fact not going to happen. Ever. Rename
things such that we end up with:

  rq_clock_skip_update(rq)
  rq_clock_cancel_skipupdate(rq)

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: matt@codeblueprint.co.uk
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20180404161539.nhadkff2aats74jh@linux-n805
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-05 09:20:46 +02:00
Patrick Bellasi d519329f72 sched/fair: Update util_est only on util_avg updates
The estimated utilization of a task is currently updated every time the
task is dequeued. However, to keep overheads under control, PELT signals
are effectively updated at maximum once every 1ms.

Thus, for really short running tasks, it can happen that their util_avg
value has not been updates since their last enqueue.  If such tasks are
also frequently running tasks (e.g. the kind of workload generated by
hackbench) it can also happen that their util_avg is updated only every
few activations.

This means that updating util_est at every dequeue potentially introduces
not necessary overheads and it's also conceptually wrong if the util_avg
signal has never been updated during a task activation.

Let's introduce a throttling mechanism on task's util_est updates
to sync them with util_avg updates. To make the solution memory
efficient, both in terms of space and load/store operations, we encode a
synchronization flag into the LSB of util_est.enqueued.
This makes util_est an even values only metric, which is still
considered good enough for its purpose.
The synchronization bit is (re)set by __update_load_avg_se() once the
PELT signal of a task has been updated during its last activation.

Such a throttling mechanism allows to keep under control util_est
overheads in the wakeup hot path, thus making it a suitable mechanism
which can be enabled also on high-intensity workload systems.
Thus, this now switches on by default the estimation utilization
scheduler feature.

Suggested-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@android.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20180309095245.11071-5-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 08:11:09 +01:00
Patrick Bellasi f9be3e5961 sched/fair: Use util_est in LB and WU paths
When the scheduler looks at the CPU utilization, the current PELT value
for a CPU is returned straight away. In certain scenarios this can have
undesired side effects on task placement.

For example, since the task utilization is decayed at wakeup time, when
a long sleeping big task is enqueued it does not add immediately a
significant contribution to the target CPU.
As a result we generate a race condition where other tasks can be placed
on the same CPU while it is still considered relatively empty.

In order to reduce this kind of race conditions, this patch introduces the
required support to integrate the usage of the CPU's estimated utilization
in the wakeup path, via cpu_util_wake(), as well as in the load-balance
path, via cpu_util() which is used by update_sg_lb_stats().

The estimated utilization of a CPU is defined to be the maximum between
its PELT's utilization and the sum of the estimated utilization (at
previous dequeue time) of all the tasks currently RUNNABLE on that CPU.
This allows to properly represent the spare capacity of a CPU which, for
example, has just got a big task running since a long sleep period.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@android.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20180309095245.11071-3-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 08:11:07 +01:00
Patrick Bellasi 7f65ea42eb sched/fair: Add util_est on top of PELT
The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.

The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can result in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.

Moreover, the PELT utilization of a task can be updated every [ms], thus
making it a continuously changing value for certain longer running
tasks. This means that the instantaneous PELT utilization of a RUNNING
task is not really meaningful to properly support scheduler decisions.

For all these reasons, a more stable signal can do a better job of
representing the expected/estimated utilization of a task/cfs_rq.
Such a signal can be easily created on top of PELT by still using it as
an estimator which produces values to be aggregated on meaningful
events.

This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:

    util_est(task) = max(task::util_avg, f(task::util_avg@dequeue))

This allows to remember how big a task has been reported by PELT in its
previous activations via f(task::util_avg@dequeue), which is the new
_task_util_est(struct task_struct*) function added by this patch.

If a task should change its behavior and it runs longer in a new
activation, after a certain time its util_est will just track the
original PELT signal (i.e. task::util_avg).

The estimated utilization of cfs_rq is defined only for root ones.
That's because the only sensible consumer of this signal are the
scheduler and schedutil when looking for the overall CPU utilization
due to FAIR tasks.

For this reason, the estimated utilization of a root cfs_rq is simply
defined as:

    util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est::enqueued)

where:

    cfs_rq::util_est::enqueued = sum(_task_util_est(task))
                                 for each RUNNABLE task on that root cfs_rq

It's worth noting that the estimated utilization is tracked only for
objects of interests, specifically:

 - Tasks: to better support tasks placement decisions
 - root cfs_rqs: to better support both tasks placement decisions as
                 well as frequencies selection

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@android.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20180309095245.11071-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 08:11:06 +01:00
Vincent Guittot 31e77c93e4 sched/fair: Update blocked load when newly idle
When NEWLY_IDLE load balance is not triggered, we might need to update the
blocked load anyway. We can kick an ilb so an idle CPU will take care of
updating blocked load or we can try to update them locally before entering
idle. In the latter case, we reuse part of the nohz_idle_balance.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: brendan.jackman@arm.com
Cc: dietmar.eggemann@arm.com
Cc: morten.rasmussen@foss.arm.com
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1518622006-16089-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:28 +01:00
Peter Zijlstra 47ea54121e sched/fair: Move idle_balance()
We're going to want to call nohz_idle_balance() or parts thereof from
idle_balance(). Since we already have a forward declaration of
idle_balance() move it down such that it's below nohz_idle_balance()
avoiding the need for a forward declaration for that.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:25 +01:00
Peter Zijlstra dd707247ab sched/nohz: Merge CONFIG_NO_HZ_COMMON blocks
Now that we have two back-to-back NO_HZ_COMMON blocks, merge them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:24 +01:00
Peter Zijlstra af3fe03c56 sched/fair: Move rebalance_domains()
This pure code movement results in two #ifdef CONFIG_NO_HZ_COMMON
sections landing next to each other.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:23 +01:00
Peter Zijlstra 63928384fa sched/nohz: Optimize nohz_idle_balance()
Avoid calling update_blocked_averages() when it does not in fact have
any by re-using/extending update_nohz_stats().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:22 +01:00
Vincent Guittot 1936c53ce8 sched/fair: Reduce the periodic update duration
Instead of using the cfs_rq_is_decayed() which monitors all *_avg
and *_sum, we create a cfs_rq_has_blocked() which only takes care of
util_avg and load_avg. We are only interested by these 2 values which are
decaying faster than the *_sum so we can stop the periodic update earlier.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: brendan.jackman@arm.com
Cc: dietmar.eggemann@arm.com
Cc: morten.rasmussen@foss.arm.com
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1518517879-2280-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:22 +01:00
Vincent Guittot f643ea2207 sched/nohz: Stop NOHZ stats when decayed
Stopped the periodic update of blocked load when all idle CPUs have fully
decayed. We introduce a new nohz.has_blocked that reflect if some idle
CPUs has blocked load that have to be periodiccally updated. nohz.has_blocked
is set everytime that a Idle CPU can have blocked load and it is then clear
when no more blocked load has been detected during an update. We don't need
atomic operation but only to make cure of the right ordering when updating
nohz.idle_cpus_mask and nohz.has_blocked.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: brendan.jackman@arm.com
Cc: dietmar.eggemann@arm.com
Cc: morten.rasmussen@foss.arm.com
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1518517879-2280-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:21 +01:00
Peter Zijlstra ea14b57e8a sched/cpufreq: Provide migration hint
It was suggested that a migration hint might be usefull for the
CPU-freq governors.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:20 +01:00
Peter Zijlstra 00357f5ec5 sched/nohz: Clean up nohz enter/exit
The primary observation is that nohz enter/exit is always from the
current CPU, therefore NOHZ_TICK_STOPPED does not in fact need to be
an atomic.

Secondary is that we appear to have 2 nearly identical hooks in the
nohz enter code, set_cpu_sd_state_idle() and
nohz_balance_enter_idle(). Fold the whole set_cpu_sd_state thing into
nohz_balance_{enter,exit}_idle.

Removes an atomic op from both enter and exit paths.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:19 +01:00
Peter Zijlstra e022e0d38a sched/fair: Update blocked load from NEWIDLE
Since we already iterate CPUs looking for work on NEWIDLE, use this
iteration to age the blocked load. If the domain for which this is
done completely spand the idle set, we can push the ILB based aging
forward.

Suggested-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:19 +01:00
Peter Zijlstra a4064fb614 sched/fair: Add NOHZ stats balancing
Teach the idle balancer about the need to update statistics which have
a different periodicity from regular balancing.

Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:18 +01:00
Peter Zijlstra 4550487a99 sched/fair: Restructure nohz_balance_kick()
The current:

	if (nohz_kick_needed())
		nohz_balancer_kick()

is pointless complexity, fold them into a single call and avoid the
various conditions at the call site.

When we introduce multiple different needs to kick the ilb, the above
construct also becomes a problem.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:17 +01:00
Peter Zijlstra b7031a02ec sched/fair: Add NOHZ_STATS_KICK
Split the NOHZ idle balancer into doing two separate actions:

 - update blocked load statistic

 - actually load-balance

Since the latter requires the former, ensure this happens. For now
always tag both bits at the same time.

Prepares for a future where we can toggle only the STATS bit.

Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:16 +01:00
Peter Zijlstra a22e47a4e3 sched/core: Convert nohz_flags to atomic_t
Using atomic_t allows us to use the more flexible bitops provided
there. Also its smaller.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:16 +01:00
Norbert Manthey 13a453c241 sched/fair: Add ';' after label attributes
Due to using GCC defines for configuration, some labels might be unused in
certain configurations. While adding a __maybe_unused to the label is
fine in general, the line has to be terminated with ';'. This is also
reflected in the GCC documentation, but GCC parsed the previous variant
without an error message.

This has been spotted while compiling with goto-cc, the compiler for the
CPROVER tool suite.

Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
Signed-off-by: Michael Tautschnig <tautschn@amazon.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1519717660-16157-1-git-send-email-nmanthey@amazon.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:13 +01:00
Ingo Molnar 325ea10c08 sched/headers: Simplify and clean up header usage in the scheduler
Do the following cleanups and simplifications:

 - sched/sched.h already includes <asm/paravirt.h>, so no need to
   include it in sched/core.c again.

 - order the <linux/sched/*.h> headers alphabetically

 - add all <linux/sched/*.h> headers to kernel/sched/sched.h

 - remove all unnecessary includes from the .c files that
   are already included in kernel/sched/sched.h.

Finally, make all scheduler .c files use a single common header:

  #include "sched.h"

... which now contains a union of the relied upon headers.

This makes the various .c files easier to read and easier to handle.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-04 12:39:29 +01:00
Ingo Molnar 97fb7a0a89 sched: Clean up and harmonize the coding style of the scheduler code base
A good number of small style inconsistencies have accumulated
in the scheduler core, so do a pass over them to harmonize
all these details:

 - fix speling in comments,

 - use curly braces for multi-line statements,

 - remove unnecessary parentheses from integer literals,

 - capitalize consistently,

 - remove stray newlines,

 - add comments where necessary,

 - remove invalid/unnecessary comments,

 - align structure definitions and other data types vertically,

 - add missing newlines for increased readability,

 - fix vertical tabulation where it's misaligned,

 - harmonize preprocessor conditional block labeling
   and vertical alignment,

 - remove line-breaks where they uglify the code,

 - add newline after local variable definitions,

No change in functionality:

  md5:
     1191fa0a890cfa8132156d2959d7e9e2  built-in.o.before.asm
     1191fa0a890cfa8132156d2959d7e9e2  built-in.o.after.asm

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-03 15:50:21 +01:00
Frederic Weisbecker d84b31313e sched/isolation: Offload residual 1Hz scheduler tick
When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
keep the scheduler stats alive. However this residual tick is a burden
for bare metal tasks that can't stand any interruption at all, or want
to minimize them.

The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
outsource these scheduler ticks to the global workqueue so that a
housekeeping CPU handles those remotely. The sched_class::task_tick()
implementations have been audited and look safe to be called remotely
as the target runqueue and its current task are passed in parameter
and don't seem to be accessed locally.

Note that in the case of using isolcpus, it's still up to the user to
affine the global workqueues to the housekeeping CPUs through
/sys/devices/virtual/workqueue/cpumask or domains isolation
"isolcpus=nohz,domain".

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1519186649-3242-6-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 09:49:09 +01:00
Mel Gorman 7347fc87df sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()
If wake_affine() pulls a task to another node for any reason and the node is
no longer preferred then temporarily stop automatic NUMA balancing pulling
the task back. Otherwise, tasks with a strong waker/wakee relationship
may constantly fight automatic NUMA balancing over where a task should
be placed.

Once again netperf is interesting here. The performance barely changes
but automatic NUMA balancing is interesting:

 Hmean     send-64         354.67 (   0.00%)      352.15 (  -0.71%)
 Hmean     send-128        702.91 (   0.00%)      693.84 (  -1.29%)
 Hmean     send-256       1350.07 (   0.00%)     1344.19 (  -0.44%)
 Hmean     send-1024      5124.38 (   0.00%)     4941.24 (  -3.57%)
 Hmean     send-2048      9687.44 (   0.00%)     9624.45 (  -0.65%)
 Hmean     send-3312     14577.64 (   0.00%)    14514.35 (  -0.43%)
 Hmean     send-4096     16393.62 (   0.00%)    16488.30 (   0.58%)
 Hmean     send-8192     26877.26 (   0.00%)    26431.63 (  -1.66%)
 Hmean     send-16384    38683.43 (   0.00%)    38264.91 (  -1.08%)
 Hmean     recv-64         354.67 (   0.00%)      352.15 (  -0.71%)
 Hmean     recv-128        702.91 (   0.00%)      693.84 (  -1.29%)
 Hmean     recv-256       1350.07 (   0.00%)     1344.19 (  -0.44%)
 Hmean     recv-1024      5124.38 (   0.00%)     4941.24 (  -3.57%)
 Hmean     recv-2048      9687.43 (   0.00%)     9624.45 (  -0.65%)
 Hmean     recv-3312     14577.59 (   0.00%)    14514.35 (  -0.43%)
 Hmean     recv-4096     16393.55 (   0.00%)    16488.20 (   0.58%)
 Hmean     recv-8192     26876.96 (   0.00%)    26431.29 (  -1.66%)
 Hmean     recv-16384    38682.41 (   0.00%)    38263.94 (  -1.08%)

 NUMA alloc hit                 1465986     1423090
 NUMA alloc miss                      0           0
 NUMA interleave hit                  0           0
 NUMA alloc local               1465897     1423003
 NUMA base PTE updates             1473        1420
 NUMA huge PMD updates                0           0
 NUMA page range updates           1473        1420
 NUMA hint faults                  1383        1312
 NUMA hint local faults             451         124
 NUMA hint local percent             32           9

There is a slight degrading in performance but there are slightly fewer
NUMA faults. There is a large drop in the percentage of local faults but
the bulk of migrations for netperf are in small shared libraries so it's
reflecting the fact that automatic NUMA balancing has backed off. This is
a case where despite wake_affine() and automatic NUMA balancing fighting
for placement that there is a marginal benefit to rescheduling to local
data quickly. However, it should be noted that wake_affine() and automatic
NUMA balancing fighting each other constantly is undesirable.

However, the benefit in other cases is large. This is the result for NAS
with the D class sizing on a 4-socket machine:

 nas-mpi
                           4.15.0                 4.15.0
                     sdnuma-v1r23       delayretry-v1r23
 Time cg.D      557.00 (   0.00%)      431.82 (  22.47%)
 Time ep.D       77.83 (   0.00%)       79.01 (  -1.52%)
 Time is.D       26.46 (   0.00%)       26.64 (  -0.68%)
 Time lu.D      727.14 (   0.00%)      597.94 (  17.77%)
 Time mg.D      191.35 (   0.00%)      146.85 (  23.26%)

               4.15.0      4.15.0
         sdnuma-v1r23delayretry-v1r23
 User        75665.20    70413.30
 System      20321.59     8861.67
 Elapsed       766.13      634.92

 Minor Faults                  16528502     7127941
 Major Faults                      4553        5068
 NUMA alloc local               6963197     6749135
 NUMA base PTE updates        366409093   107491434
 NUMA huge PMD updates           687556      198880
 NUMA page range updates      718437765   209317994
 NUMA hint faults              13643410     4601187
 NUMA hint local faults         9212593     3063996
 NUMA hint local percent             67          66

Note the massive reduction in system CPU usage even though the percentage
of local faults is barely affected. There is a massive reduction in the
number of PTE updates showing that automatic NUMA balancing has backed off.
A critical observation is also that there is a massive reduction in minor
faults which is due to far fewer NUMA hinting faults being trapped.

There were questions on NAS OMP and how it behaved related to threads
being bound to CPUs. First, there are more gains than losses with this
patch applied and a reduction in system CPU usage:

nas-omp
                      4.16.0-rc1             4.16.0-rc1
                     sdnuma-v2r1        delayretry-v2r1
Time bt.D      436.71 (   0.00%)      430.05 (   1.53%)
Time cg.D      201.02 (   0.00%)      180.87 (  10.02%)
Time ep.D       32.84 (   0.00%)       32.68 (   0.49%)
Time is.D        9.63 (   0.00%)        9.64 (  -0.10%)
Time lu.D      331.20 (   0.00%)      304.80 (   7.97%)
Time mg.D       54.87 (   0.00%)       52.72 (   3.92%)
Time sp.D     1108.78 (   0.00%)      917.10 (  17.29%)
Time ua.D      378.81 (   0.00%)      398.83 (  -5.28%)

          4.16.0-rc1  4.16.0-rc1
         sdnuma-v2r1delayretry-v2r1
User       305633.08   296751.91
System        451.75      357.80
Elapsed      2595.73     2368.13

However, it does not close the gap between binding and being unbound. There
is negligible difference between the performance of the baseline and a
patched kernel when threads are bound so it is not presented here:

                      4.16.0-rc1             4.16.0-rc1
                 delayretry-bind     delayretry-unbound
Time bt.D      385.02 (   0.00%)      430.05 ( -11.70%)
Time cg.D      144.02 (   0.00%)      180.87 ( -25.59%)
Time ep.D       32.85 (   0.00%)       32.68 (   0.52%)
Time is.D       10.52 (   0.00%)        9.64 (   8.37%)
Time lu.D      285.31 (   0.00%)      304.80 (  -6.83%)
Time mg.D       43.21 (   0.00%)       52.72 ( -22.01%)
Time sp.D      820.24 (   0.00%)      917.10 ( -11.81%)
Time ua.D      337.09 (   0.00%)      398.83 ( -18.32%)

          4.16.0-rc1  4.16.0-rc1
        delayretry-binddelayretry-unbound
User       277731.25   296751.91
System        261.29      357.80
Elapsed      2100.55     2368.13

Unfortunately, while performance is improved by the patch, there is still
quite a long way to go before it's equivalent to hard binding.

Other workloads like hackbench, tbench, dbench and schbench are barely
affected. dbench shows a mix of gains and losses depending on the machine
although in general, the results are more stable.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-7-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 08:49:45 +01:00
Mel Gorman 2c83362734 sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on
find_idlest_group() compares a local group with each other group to select
the one that is most idle. When comparing groups in different NUMA domains,
a very slight imbalance is enough to select a remote NUMA node even if the
runnable load on both groups is 0 or close to 0. This ignores the cost of
remote accesses entirely and is a problem when selecting the CPU for a
newly forked task to run on.  This is problematic when a forking server
is almost guaranteed to run on a remote node incurring numerous remote
accesses and potentially causing automatic NUMA balancing to try migrate
the task back or migrate the data to another node. Similar weirdness is
observed if a basic shell command pipes output to another as each process
in the pipeline is likely to start on different nodes and then get adjusted
later by wake_affine().

This patch adds imbalance to remote domains when considering whether to
select CPUs from remote domains. If the local domain is selected, imbalance
will still be used to try select a CPU from a lower scheduler domain's group
instead of stacking tasks on the same CPU.

A variety of workloads and machines were tested and as expected, there is no
difference on UMA. The difference on NUMA can be dramatic. This is a comparison
of elapsed times running the git regression test suite. It's fork-intensive with
short-lived processes:

                                  4.15.0                 4.15.0
                            noexit-v1r23           sdnuma-v1r23
 Elapsed min          1706.06 (   0.00%)     1435.94 (  15.83%)
 Elapsed mean         1709.53 (   0.00%)     1436.98 (  15.94%)
 Elapsed stddev          2.16 (   0.00%)        1.01 (  53.38%)
 Elapsed coeffvar        0.13 (   0.00%)        0.07 (  44.54%)
 Elapsed max          1711.59 (   0.00%)     1438.01 (  15.98%)

               4.15.0      4.15.0
         noexit-v1r23 sdnuma-v1r23
 User         5434.12     5188.41
 System       4878.77     3467.09
 Elapsed     10259.06     8624.21

That shows a considerable reduction in elapsed times. It's important to
note that automatic NUMA balancing does not affect this load as processes
are too short-lived.

There is also a noticable impact on hackbench such as this example using
processes and pipes:

 hackbench-process-pipes
                               4.15.0                 4.15.0
                         noexit-v1r23           sdnuma-v1r23
 Amean     1        1.0973 (   0.00%)      0.9393 (  14.40%)
 Amean     4        1.3427 (   0.00%)      1.3730 (  -2.26%)
 Amean     7        1.4233 (   0.00%)      1.6670 ( -17.12%)
 Amean     12       3.0250 (   0.00%)      3.3013 (  -9.13%)
 Amean     21       9.0860 (   0.00%)      9.5343 (  -4.93%)
 Amean     30      14.6547 (   0.00%)     13.2433 (   9.63%)
 Amean     48      22.5447 (   0.00%)     20.4303 (   9.38%)
 Amean     79      29.2010 (   0.00%)     26.7853 (   8.27%)
 Amean     110     36.7443 (   0.00%)     35.8453 (   2.45%)
 Amean     141     45.8533 (   0.00%)     42.6223 (   7.05%)
 Amean     172     55.1317 (   0.00%)     50.6473 (   8.13%)
 Amean     203     64.4420 (   0.00%)     58.3957 (   9.38%)
 Amean     234     73.2293 (   0.00%)     67.1047 (   8.36%)
 Amean     265     80.5220 (   0.00%)     75.7330 (   5.95%)
 Amean     296     88.7567 (   0.00%)     82.1533 (   7.44%)

It's not a universal win as there are occasions when spreading wide and
quickly is a benefit but it's more of a win than it is a loss. For other
workloads, there is little difference but netperf is interesting. Without
the patch, the server and client starts on different nodes but quickly get
migrated due to wake_affine. Hence, the difference is overall performance
is marginal but detectable:

                                      4.15.0                 4.15.0
                                noexit-v1r23           sdnuma-v1r23
 Hmean     send-64         349.09 (   0.00%)      354.67 (   1.60%)
 Hmean     send-128        699.16 (   0.00%)      702.91 (   0.54%)
 Hmean     send-256       1316.34 (   0.00%)     1350.07 (   2.56%)
 Hmean     send-1024      5063.99 (   0.00%)     5124.38 (   1.19%)
 Hmean     send-2048      9705.19 (   0.00%)     9687.44 (  -0.18%)
 Hmean     send-3312     14359.48 (   0.00%)    14577.64 (   1.52%)
 Hmean     send-4096     16324.20 (   0.00%)    16393.62 (   0.43%)
 Hmean     send-8192     26112.61 (   0.00%)    26877.26 (   2.93%)
 Hmean     send-16384    37208.44 (   0.00%)    38683.43 (   3.96%)
 Hmean     recv-64         349.09 (   0.00%)      354.67 (   1.60%)
 Hmean     recv-128        699.16 (   0.00%)      702.91 (   0.54%)
 Hmean     recv-256       1316.34 (   0.00%)     1350.07 (   2.56%)
 Hmean     recv-1024      5063.99 (   0.00%)     5124.38 (   1.19%)
 Hmean     recv-2048      9705.16 (   0.00%)     9687.43 (  -0.18%)
 Hmean     recv-3312     14359.42 (   0.00%)    14577.59 (   1.52%)
 Hmean     recv-4096     16323.98 (   0.00%)    16393.55 (   0.43%)
 Hmean     recv-8192     26111.85 (   0.00%)    26876.96 (   2.93%)
 Hmean     recv-16384    37206.99 (   0.00%)    38682.41 (   3.97%)

However, what is very interesting is how automatic NUMA balancing behaves.
Each netperf instance runs long enough for balancing to activate:

 NUMA base PTE updates             4620        1473
 NUMA huge PMD updates                0           0
 NUMA page range updates           4620        1473
 NUMA hint faults                  4301        1383
 NUMA hint local faults            1309         451
 NUMA hint local percent             30          32
 NUMA pages migrated               1335         491
 AutoNUMA cost                      21%          6%

There is an unfortunate number of remote faults although tracing indicated
that the vast majority are in shared libraries. However, the tendency to
start tasks on the same node if there is capacity means that there were
far fewer PTE updates and faults incurred overall.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-6-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 08:49:43 +01:00
Peter Zijlstra 24d0c1d6e6 sched/fair: Do not migrate due to a sync wakeup on exit
When a task exits, it notifies the parent that it has exited. This is a
sync wakeup and the exiting task may pull the parent towards the wakers
CPU. For simple workloads like using a shell, it was observed that the
shell is pulled across nodes by exiting processes. This is daft as the
parent may be long-lived and properly placed. This patch special cases a
sync wakeup on exit to avoid pulling tasks across nodes. Testing on a range
of workloads and machines showed very little differences in performance
although there was a small 3% boost on some machines running a shellscript
intensive workload (git regression test suite).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-5-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 08:49:42 +01:00
Mel Gorman 082f764a2f sched/fair: Do not migrate on wake_affine_weight() if weights are equal
wake_affine_weight() will consider migrating a task to, or near, the current
CPU if there is a load imbalance. If the CPUs share LLC then either CPU
is valid as a search-for-idle-sibling target and equally appropriate for
stacking two tasks on one CPU if an idle sibling is unavailable. If they do
not share cache then a cross-node migration potentially impacts locality
so while they are equal from a CPU capacity point of view, they are not
equal in terms of memory locality. In either case, it's more appropriate
to migrate only if there is a difference in their effective load.

This patch modifies wake_affine_weight() to only consider migrating a task
if there is a load imbalance for normal wakeups but will allow potential
stacking if the loads are equal and it's a sync wakeup.

For the most part, the different in performance is marginal. For example,
on a 4-socket server running netperf UDP_STREAM on localhost the differences
are as follows:

                                      4.15.0                 4.15.0
                                       16rc0          noequal-v1r23
 Hmean     send-64         355.47 (   0.00%)      349.50 (  -1.68%)
 Hmean     send-128        697.98 (   0.00%)      693.35 (  -0.66%)
 Hmean     send-256       1328.02 (   0.00%)     1318.77 (  -0.70%)
 Hmean     send-1024      5051.83 (   0.00%)     5051.11 (  -0.01%)
 Hmean     send-2048      9637.02 (   0.00%)     9601.34 (  -0.37%)
 Hmean     send-3312     14355.37 (   0.00%)    14414.51 (   0.41%)
 Hmean     send-4096     16464.97 (   0.00%)    16301.37 (  -0.99%)
 Hmean     send-8192     26722.42 (   0.00%)    26428.95 (  -1.10%)
 Hmean     send-16384    38137.81 (   0.00%)    38046.11 (  -0.24%)
 Hmean     recv-64         355.47 (   0.00%)      349.50 (  -1.68%)
 Hmean     recv-128        697.98 (   0.00%)      693.35 (  -0.66%)
 Hmean     recv-256       1328.02 (   0.00%)     1318.77 (  -0.70%)
 Hmean     recv-1024      5051.83 (   0.00%)     5051.11 (  -0.01%)
 Hmean     recv-2048      9636.95 (   0.00%)     9601.30 (  -0.37%)
 Hmean     recv-3312     14355.32 (   0.00%)    14414.48 (   0.41%)
 Hmean     recv-4096     16464.74 (   0.00%)    16301.16 (  -0.99%)
 Hmean     recv-8192     26721.63 (   0.00%)    26428.17 (  -1.10%)
 Hmean     recv-16384    38136.00 (   0.00%)    38044.88 (  -0.24%)
 Stddev    send-64           7.30 (   0.00%)        4.75 (  34.96%)
 Stddev    send-128         15.15 (   0.00%)       22.38 ( -47.66%)
 Stddev    send-256         13.99 (   0.00%)       19.14 ( -36.81%)
 Stddev    send-1024       105.73 (   0.00%)       67.38 (  36.27%)
 Stddev    send-2048       294.57 (   0.00%)      223.88 (  24.00%)
 Stddev    send-3312       302.28 (   0.00%)      271.74 (  10.10%)
 Stddev    send-4096       195.92 (   0.00%)      121.10 (  38.19%)
 Stddev    send-8192       399.71 (   0.00%)      563.77 ( -41.04%)
 Stddev    send-16384     1163.47 (   0.00%)     1103.68 (   5.14%)
 Stddev    recv-64           7.30 (   0.00%)        4.75 (  34.96%)
 Stddev    recv-128         15.15 (   0.00%)       22.38 ( -47.66%)
 Stddev    recv-256         13.99 (   0.00%)       19.14 ( -36.81%)
 Stddev    recv-1024       105.73 (   0.00%)       67.38 (  36.27%)
 Stddev    recv-2048       294.59 (   0.00%)      223.89 (  24.00%)
 Stddev    recv-3312       302.24 (   0.00%)      271.75 (  10.09%)
 Stddev    recv-4096       196.03 (   0.00%)      121.14 (  38.20%)
 Stddev    recv-8192       399.86 (   0.00%)      563.65 ( -40.96%)
 Stddev    recv-16384     1163.79 (   0.00%)     1103.86 (   5.15%)

The difference in overall performance is marginal but note that most
measurements are less variable. There were similar observations for other
netperf comparisons. hackbench with sockets or threads with processes or
threads showed minor difference with some reduction of migration. tbench
showed only marginal differences that were within the noise. dbench,
regardless of filesystem, showed minor differences all of which are
within noise. Multiple machines, both UMA and NUMA were tested without
any regressions showing up.

The biggest risk with a patch like this is affecting wakeup latencies.
However, the schbench load from Facebook which is very sensitive to wakeup
latency showed a mixed result with mostly improvements in wakeup latency:

                                      4.15.0                 4.15.0
                                       16rc0          noequal-v1r23
 Lat 50.00th-qrtle-1        38.00 (   0.00%)       38.00 (   0.00%)
 Lat 75.00th-qrtle-1        49.00 (   0.00%)       41.00 (  16.33%)
 Lat 90.00th-qrtle-1        52.00 (   0.00%)       50.00 (   3.85%)
 Lat 95.00th-qrtle-1        54.00 (   0.00%)       51.00 (   5.56%)
 Lat 99.00th-qrtle-1        63.00 (   0.00%)       60.00 (   4.76%)
 Lat 99.50th-qrtle-1        66.00 (   0.00%)       61.00 (   7.58%)
 Lat 99.90th-qrtle-1        78.00 (   0.00%)       65.00 (  16.67%)
 Lat 50.00th-qrtle-2        38.00 (   0.00%)       38.00 (   0.00%)
 Lat 75.00th-qrtle-2        42.00 (   0.00%)       43.00 (  -2.38%)
 Lat 90.00th-qrtle-2        46.00 (   0.00%)       48.00 (  -4.35%)
 Lat 95.00th-qrtle-2        49.00 (   0.00%)       50.00 (  -2.04%)
 Lat 99.00th-qrtle-2        55.00 (   0.00%)       57.00 (  -3.64%)
 Lat 99.50th-qrtle-2        58.00 (   0.00%)       60.00 (  -3.45%)
 Lat 99.90th-qrtle-2        65.00 (   0.00%)       68.00 (  -4.62%)
 Lat 50.00th-qrtle-4        41.00 (   0.00%)       41.00 (   0.00%)
 Lat 75.00th-qrtle-4        45.00 (   0.00%)       46.00 (  -2.22%)
 Lat 90.00th-qrtle-4        50.00 (   0.00%)       50.00 (   0.00%)
 Lat 95.00th-qrtle-4        54.00 (   0.00%)       53.00 (   1.85%)
 Lat 99.00th-qrtle-4        61.00 (   0.00%)       61.00 (   0.00%)
 Lat 99.50th-qrtle-4        65.00 (   0.00%)       64.00 (   1.54%)
 Lat 99.90th-qrtle-4        76.00 (   0.00%)       82.00 (  -7.89%)
 Lat 50.00th-qrtle-8        48.00 (   0.00%)       46.00 (   4.17%)
 Lat 75.00th-qrtle-8        55.00 (   0.00%)       54.00 (   1.82%)
 Lat 90.00th-qrtle-8        60.00 (   0.00%)       59.00 (   1.67%)
 Lat 95.00th-qrtle-8        63.00 (   0.00%)       63.00 (   0.00%)
 Lat 99.00th-qrtle-8        71.00 (   0.00%)       69.00 (   2.82%)
 Lat 99.50th-qrtle-8        74.00 (   0.00%)       73.00 (   1.35%)
 Lat 99.90th-qrtle-8        98.00 (   0.00%)       90.00 (   8.16%)
 Lat 50.00th-qrtle-16       56.00 (   0.00%)       55.00 (   1.79%)
 Lat 75.00th-qrtle-16       68.00 (   0.00%)       67.00 (   1.47%)
 Lat 90.00th-qrtle-16       77.00 (   0.00%)       78.00 (  -1.30%)
 Lat 95.00th-qrtle-16       82.00 (   0.00%)       84.00 (  -2.44%)
 Lat 99.00th-qrtle-16       90.00 (   0.00%)       93.00 (  -3.33%)
 Lat 99.50th-qrtle-16       93.00 (   0.00%)       97.00 (  -4.30%)
 Lat 99.90th-qrtle-16      110.00 (   0.00%)      110.00 (   0.00%)
 Lat 50.00th-qrtle-32       68.00 (   0.00%)       62.00 (   8.82%)
 Lat 75.00th-qrtle-32       90.00 (   0.00%)       83.00 (   7.78%)
 Lat 90.00th-qrtle-32      110.00 (   0.00%)      100.00 (   9.09%)
 Lat 95.00th-qrtle-32      122.00 (   0.00%)      111.00 (   9.02%)
 Lat 99.00th-qrtle-32      145.00 (   0.00%)      133.00 (   8.28%)
 Lat 99.50th-qrtle-32      154.00 (   0.00%)      143.00 (   7.14%)
 Lat 99.90th-qrtle-32     2316.00 (   0.00%)      515.00 (  77.76%)
 Lat 50.00th-qrtle-35       69.00 (   0.00%)       72.00 (  -4.35%)
 Lat 75.00th-qrtle-35       92.00 (   0.00%)       95.00 (  -3.26%)
 Lat 90.00th-qrtle-35      111.00 (   0.00%)      114.00 (  -2.70%)
 Lat 95.00th-qrtle-35      122.00 (   0.00%)      124.00 (  -1.64%)
 Lat 99.00th-qrtle-35      142.00 (   0.00%)      144.00 (  -1.41%)
 Lat 99.50th-qrtle-35      150.00 (   0.00%)      154.00 (  -2.67%)
 Lat 99.90th-qrtle-35     6104.00 (   0.00%)     5640.00 (   7.60%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-4-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 08:49:08 +01:00
Mel Gorman eeb6039863 sched/fair: Defer calculation of 'prev_eff_load' in wake_affine_weight() until needed
On sync wakeups, the previous CPU effective load may not be used so delay
the calculation until it's needed.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-3-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 08:49:07 +01:00
Mel Gorman 7ebb66a12f sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine
The only caller of wake_affine() knows the CPU ID. Pass it in instead of
rechecking it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 08:49:07 +01:00
Vincent Guittot 387f77cc82 sched/fair: Remove stray space in #ifdef
Remove a useless space in # ifdef and align it with others.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1518512382-29426-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-13 10:32:36 +01:00
Mel Gorman 32e839dda3 sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
The select_idle_sibling() (SIS) rewrite in commit:

  10e2f1acd0 ("sched/core: Rewrite and improve select_idle_siblings()")

... replaced a domain iteration with a search that broadly speaking
does a wrapped walk of the scheduler domain sharing a last-level-cache.

While this had a number of improvements, one consequence is that two tasks
that share a waker/wakee relationship push each other around a socket. Even
though two tasks may be active, all cores are evenly used. This is great from
a search perspective and spreads a load across individual cores, but it has
adverse consequences for cpufreq. As each CPU has relatively low utilisation,
cpufreq may decide the utilisation is too low to used a higher P-state and
overall computation throughput suffers.

While individual cpufreq and cpuidle drivers may compensate by artifically
boosting P-state (at c0) or avoiding lower C-states (during idle), it does
not help if hardware-based cpufreq (e.g. HWP) is used.

This patch tracks a recently used CPU based on what CPU a task was running
on when it last was a waker a CPU it was recently using when a task is a
wakee. During SIS, the recently used CPU is used as a target if it's still
allowed by the task and is idle.

The benefit may be non-obvious so consider an example of two tasks
communicating back and forth. Task A may be an application doing IO where
task B is a kworker or kthread like journald. Task A may issue IO, wake
B and B wakes up A on completion.  With the existing scheme this may look
like the following (potentially different IDs if SMT is in use but similar
principal applies).

 A (cpu 0)	wake	B (wakes on cpu 1)
 B (cpu 1)	wake	A (wakes on cpu 2)
 A (cpu 2)	wake	B (wakes on cpu 3)
 etc.

A careful reader may wonder why CPU 0 was not idle when B wakes A the
first time and it's simply due to the fact that A can be rescheduled to
another CPU and the pattern is that prev == target when B tries to wakeup A
and the information about CPU 0 has been lost.

With this patch, the pattern is more likely to be:

 A (cpu 0)	wake	B (wakes on cpu 1)
 B (cpu 1)	wake	A (wakes on cpu 0)
 A (cpu 0)	wake	B (wakes on cpu 1)
 etc

i.e. two communicating casts are more likely to use just two cores instead
of all available cores sharing a LLC.

The most dramatic speedup was noticed on dbench using the XFS filesystem on
UMA as clients interact heavily with workqueues in that configuration. Note
that a similar speedup is not observed on ext4 as the wakeup pattern
is different:

                          4.15.0-rc9             4.15.0-rc9
                           waprev-v1        biasancestor-v1
 Hmean      1      287.54 (   0.00%)      817.01 ( 184.14%)
 Hmean      2     1268.12 (   0.00%)     1781.24 (  40.46%)
 Hmean      4     1739.68 (   0.00%)     1594.47 (  -8.35%)
 Hmean      8     2464.12 (   0.00%)     2479.56 (   0.63%)
 Hmean     64     1455.57 (   0.00%)     1434.68 (  -1.44%)

The results can be less dramatic on NUMA where automatic balancing interferes
with the test. It's also known that network benchmarks running on localhost
also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
and TCP depending on the machine). Hackbench also seens small improvements
(6-11% depending on machine and thread count). The facebook schbench was also
tested but in most cases showed little or no different to wakeup latencies.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-5-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-06 10:20:37 +01:00
Mel Gorman 806486c377 sched/fair: Do not migrate if the prev_cpu is idle
wake_affine_idle() prefers to move a task to the current CPU if the
wakeup is due to an interrupt. The expectation is that the interrupt
data is cache hot and relevant to the waking task as well as avoiding
a search. However, there is no way to determine if there was cache hot
data on the previous CPU that may exceed the interrupt data. Furthermore,
round-robin delivery of interrupts can migrate tasks around a socket where
each CPU is under-utilised.  This can interact badly with cpufreq which
makes decisions based on per-cpu data. It has been observed on machines
with HWP that p-states are not boosted to their maximum levels even though
the workload is latency and throughput sensitive.

This patch uses the previous CPU for the task if it's idle and cache-affine
with the current CPU even if the current CPU is idle due to the wakup
being related to the interrupt. This reduces migrations at the cost of
the interrupt data not being cache hot when the task wakes.

A variety of workloads were tested on various machines and no adverse
impact was noticed that was outside noise. dbench on ext4 on UMA showed
roughly 10% reduction in the number of CPU migrations and it is a case
where interrupts are frequent for IO competions. In most cases, the
difference in performance is quite small but variability is often
reduced. For example, this is the result for pgbench running on a UMA
machine with different numbers of clients.

                          4.15.0-rc9             4.15.0-rc9
                            baseline              waprev-v1
 Hmean     1     22096.28 (   0.00%)    22734.86 (   2.89%)
 Hmean     4     74633.42 (   0.00%)    75496.77 (   1.16%)
 Hmean     7    115017.50 (   0.00%)   113030.81 (  -1.73%)
 Hmean     12   126209.63 (   0.00%)   126613.40 (   0.32%)
 Hmean     16   131886.91 (   0.00%)   130844.35 (  -0.79%)
 Stddev    1       636.38 (   0.00%)      417.11 (  34.46%)
 Stddev    4       614.64 (   0.00%)      583.24 (   5.11%)
 Stddev    7       542.46 (   0.00%)      435.45 (  19.73%)
 Stddev    12      173.93 (   0.00%)      171.50 (   1.40%)
 Stddev    16      671.42 (   0.00%)      680.30 (  -1.32%)
 CoeffVar  1         2.88 (   0.00%)        1.83 (  36.26%)

Note that the different in performance is marginal but for low utilisation,
there is less variability.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-4-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-06 10:20:36 +01:00
Mel Gorman 3b76c4a339 sched/fair: Restructure wake_affine*() to return a CPU id
This is a preparation patch that has wake_affine*() return a CPU ID instead of
a boolean. The intent is to allow the wake_affine() helpers to be avoided
if a decision is already made. This patch has no functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-3-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-06 10:20:35 +01:00
Mel Gorman 89a55f56fd sched/fair: Remove unnecessary parameters from wake_affine_idle()
wake_affine_idle() takes parameters it never uses so clean it up.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-06 10:20:35 +01:00
Peter Zijlstra 2ed41a5502 sched/core: Optimize update_stats_*()
These functions are already gated by schedstats_enabled(), there is no
point in then issuing another static_branch for every individual
update in them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-06 10:20:32 +01:00
Linus Torvalds af8c5e2d60 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Implement frequency/CPU invariance and OPP selection for
     SCHED_DEADLINE (Juri Lelli)

   - Tweak the task migration logic for better multi-tasking
     workload scalability (Mel Gorman)

   - Misc cleanups, fixes and improvements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Make bandwidth enforcement scale-invariant
  sched/cpufreq: Move arch_scale_{freq,cpu}_capacity() outside of #ifdef CONFIG_SMP
  sched/cpufreq: Remove arch_scale_freq_capacity()'s 'sd' parameter
  sched/cpufreq: Always consider all CPUs when deciding next freq
  sched/cpufreq: Split utilization signals
  sched/cpufreq: Change the worker kthread to SCHED_DEADLINE
  sched/deadline: Move CPU frequency selection triggering points
  sched/cpufreq: Use the DEADLINE utilization signal
  sched/deadline: Implement "runtime overrun signal" support
  sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache
  sched/fair: Correct obsolete comment about cpufreq_update_util()
  sched/fair: Remove impossible condition from find_idlest_group_cpu()
  sched/cpufreq: Don't pass flags to sugov_set_iowait_boost()
  sched/cpufreq: Initialize sg_cpu->flags to 0
  sched/fair: Consider RT/IRQ pressure in capacity_spare_wake()
  sched/fair: Use 'unsigned long' for utilization, consistently
  sched/core: Rework and clarify prepare_lock_switch()
  sched/fair: Remove unused 'curr' parameter from wakeup_gran
  sched/headers: Constify object_is_on_stack()
2018-01-30 11:55:56 -08:00
Peter Zijlstra ce48c14649 sched/core: Fix cpu.max vs. cpuhotplug deadlock
Tejun reported the following cpu-hotplug lock (percpu-rwsem) read recursion:

  tg_set_cfs_bandwidth()
    get_online_cpus()
      cpus_read_lock()

    cfs_bandwidth_usage_inc()
      static_key_slow_inc()
        cpus_read_lock()

Reported-by: Tejun Heo <tj@kernel.org>
Tested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180122215328.GP3397@worktop
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-24 10:03:44 +01:00
Juri Lelli 07881166a8 sched/deadline: Make bandwidth enforcement scale-invariant
Apply frequency and CPU scale-invariance correction factor to bandwidth
enforcement (similar to what we already do to fair utilization tracking).

Each delta_exec gets scaled considering current frequency and maximum
CPU capacity; which means that the reservation runtime parameter (that
need to be specified profiling the task execution at max frequency on
biggest capacity core) gets thus scaled accordingly.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: alessio.balsini@arm.com
Cc: bristot@redhat.com
Cc: dietmar.eggemann@arm.com
Cc: joelaf@google.com
Cc: juri.lelli@redhat.com
Cc: mathieu.poirier@linaro.org
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: rjw@rjwysocki.net
Cc: rostedt@goodmis.org
Cc: tkjos@android.com
Cc: tommaso.cucinotta@santannapisa.it
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/20171204102325.5110-9-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 12:53:35 +01:00
Juri Lelli 7673c8a4c7 sched/cpufreq: Remove arch_scale_freq_capacity()'s 'sd' parameter
The 'sd' parameter is never used in arch_scale_freq_capacity() (and it's hard to
see where information coming from scheduling domains might help doing
frequency invariance scaling).

Remove it; also in anticipation of moving arch_scale_freq_capacity()
outside CONFIG_SMP.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: alessio.balsini@arm.com
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: dietmar.eggemann@arm.com
Cc: joelaf@google.com
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: rjw@rjwysocki.net
Cc: rostedt@goodmis.org
Cc: tkjos@android.com
Cc: tommaso.cucinotta@santannapisa.it
Cc: vincent.guittot@linaro.org
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/20171204102325.5110-7-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 12:53:34 +01:00
Mel Gorman 7332dec055 sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache
If waking from an idle CPU due to an interrupt then it's possible that
the waker task will be pulled to wake on the current CPU. Unfortunately,
depending on the type of interrupt and IRQ configuration, there may not
be a strong relationship between the CPU an interrupt was delivered on
and the CPU a task was running on. For example, the interrupts could all
be delivered to CPUs on one particular node due to the machine topology
or IRQ affinity configuration. Another example is an interrupt for an IO
completion which can be delivered to any CPU where there is no guarantee
the data is either cache hot or even local.

This patch was motivated by the observation that an IO workload was
being pulled cross-node on a frequent basis when IO completed.  From a
wakeup latency perspective, it's still useful to know that an idle CPU is
immediately available for use but lets only consider an automatic migration
if the CPUs share cache to limit damage due to NUMA migrations. Migrations
may still occur if wake_affine_weight determines it's appropriate.

These are the throughput results for dbench running on ext4 comparing
4.15-rc3 and this patch on a 2-socket machine where interrupts due to IO
completions can happen on any CPU.

                          4.15.0-rc3             4.15.0-rc3
                             vanilla            lessmigrate
Hmean     1        854.64 (   0.00%)      865.01 (   1.21%)
Hmean     2       1229.60 (   0.00%)     1274.44 (   3.65%)
Hmean     4       1591.81 (   0.00%)     1628.08 (   2.28%)
Hmean     8       1845.04 (   0.00%)     1831.80 (  -0.72%)
Hmean     16      2038.61 (   0.00%)     2091.44 (   2.59%)
Hmean     32      2327.19 (   0.00%)     2430.29 (   4.43%)
Hmean     64      2570.61 (   0.00%)     2568.54 (  -0.08%)
Hmean     128     2481.89 (   0.00%)     2499.28 (   0.70%)
Stddev    1         14.31 (   0.00%)        5.35 (  62.65%)
Stddev    2         21.29 (   0.00%)       11.09 (  47.92%)
Stddev    4          7.22 (   0.00%)        6.80 (   5.92%)
Stddev    8         26.70 (   0.00%)        9.41 (  64.76%)
Stddev    16        22.40 (   0.00%)       20.01 (  10.70%)
Stddev    32        45.13 (   0.00%)       44.74 (   0.85%)
Stddev    64        93.10 (   0.00%)       93.18 (  -0.09%)
Stddev    128      184.28 (   0.00%)      177.85 (   3.49%)

Note the small increase in throughput for low thread counts but also
note that the standard deviation for each sample during the test run is
lower. The throughput figures for dbench can be misleading so the benchmark
is actually modified to time the latency of the processing of one load
file with many samples taken. The difference in latency is

                           4.15.0-rc3             4.15.0-rc3
                              vanilla            lessmigrate
Amean      1         21.71 (   0.00%)       21.47 (   1.08%)
Amean      2         30.89 (   0.00%)       29.58 (   4.26%)
Amean      4         47.54 (   0.00%)       46.61 (   1.97%)
Amean      8         82.71 (   0.00%)       82.81 (  -0.12%)
Amean      16       149.45 (   0.00%)      145.01 (   2.97%)
Amean      32       265.49 (   0.00%)      248.43 (   6.42%)
Amean      64       463.23 (   0.00%)      463.55 (  -0.07%)
Amean      128      933.97 (   0.00%)      935.50 (  -0.16%)
Stddev     1          1.58 (   0.00%)        1.54 (   2.26%)
Stddev     2          2.84 (   0.00%)        2.95 (  -4.15%)
Stddev     4          6.78 (   0.00%)        6.85 (  -0.99%)
Stddev     8         16.85 (   0.00%)       16.37 (   2.85%)
Stddev     16        41.59 (   0.00%)       41.04 (   1.32%)
Stddev     32       111.05 (   0.00%)      105.11 (   5.35%)
Stddev     64       285.94 (   0.00%)      288.01 (  -0.72%)
Stddev     128      803.39 (   0.00%)      809.73 (  -0.79%)

It's a small improvement which is not surprising given that migrations that
migrate to a different node as not that common. However, it is noticeable
in the CPU migration statistics which are reduced by 24%.

There was a query for v1 of this patch about NAS so here are the results
for C-class using MPI for parallelisation on the same machine

nas-mpi
                      4.15.0-rc3             4.15.0-rc3
                         vanilla                  noirq
Time cg.C       24.25 (   0.00%)       23.17 (   4.45%)
Time ep.C        8.22 (   0.00%)        8.29 (  -0.85%)
Time ft.C       22.67 (   0.00%)       20.34 (  10.28%)
Time is.C        1.42 (   0.00%)        1.47 (  -3.52%)
Time lu.C       55.62 (   0.00%)       54.81 (   1.46%)
Time mg.C        7.93 (   0.00%)        7.91 (   0.25%)

          4.15.0-rc3  4.15.0-rc3
             vanilla  noirq-v1r1
User         3799.96     3748.34
System        672.10      626.15
Elapsed        91.91       79.49

lu.C sees a small gain, ft.C a large gain and ep.C and is.C see small
regressions but in terms of absolute time, the difference is small and
likely within run-to-run variance. System CPU usage is slightly reduced.

schbench from Facebook was also requested. This is a bit of a mixed bag but
it's important to note that this workload should not be heavily impacted
by wakeups from interrupt context.

                                 4.15.0-rc3             4.15.0-rc3
                                    vanilla             noirq-v1r1
Lat 50.00th-qrtle-1        41.00 (   0.00%)       41.00 (   0.00%)
Lat 75.00th-qrtle-1        42.00 (   0.00%)       42.00 (   0.00%)
Lat 90.00th-qrtle-1        43.00 (   0.00%)       44.00 (  -2.33%)
Lat 95.00th-qrtle-1        44.00 (   0.00%)       46.00 (  -4.55%)
Lat 99.00th-qrtle-1        57.00 (   0.00%)       58.00 (  -1.75%)
Lat 99.50th-qrtle-1        59.00 (   0.00%)       59.00 (   0.00%)
Lat 99.90th-qrtle-1        67.00 (   0.00%)       78.00 ( -16.42%)
Lat 50.00th-qrtle-2        40.00 (   0.00%)       51.00 ( -27.50%)
Lat 75.00th-qrtle-2        45.00 (   0.00%)       56.00 ( -24.44%)
Lat 90.00th-qrtle-2        53.00 (   0.00%)       59.00 ( -11.32%)
Lat 95.00th-qrtle-2        57.00 (   0.00%)       61.00 (  -7.02%)
Lat 99.00th-qrtle-2        67.00 (   0.00%)       71.00 (  -5.97%)
Lat 99.50th-qrtle-2        69.00 (   0.00%)       74.00 (  -7.25%)
Lat 99.90th-qrtle-2        83.00 (   0.00%)       77.00 (   7.23%)
Lat 50.00th-qrtle-4        51.00 (   0.00%)       51.00 (   0.00%)
Lat 75.00th-qrtle-4        57.00 (   0.00%)       56.00 (   1.75%)
Lat 90.00th-qrtle-4        60.00 (   0.00%)       59.00 (   1.67%)
Lat 95.00th-qrtle-4        62.00 (   0.00%)       62.00 (   0.00%)
Lat 99.00th-qrtle-4        73.00 (   0.00%)       72.00 (   1.37%)
Lat 99.50th-qrtle-4        76.00 (   0.00%)       74.00 (   2.63%)
Lat 99.90th-qrtle-4        85.00 (   0.00%)       78.00 (   8.24%)
Lat 50.00th-qrtle-8        54.00 (   0.00%)       58.00 (  -7.41%)
Lat 75.00th-qrtle-8        59.00 (   0.00%)       62.00 (  -5.08%)
Lat 90.00th-qrtle-8        65.00 (   0.00%)       66.00 (  -1.54%)
Lat 95.00th-qrtle-8        67.00 (   0.00%)       70.00 (  -4.48%)
Lat 99.00th-qrtle-8        78.00 (   0.00%)       79.00 (  -1.28%)
Lat 99.50th-qrtle-8        81.00 (   0.00%)       80.00 (   1.23%)
Lat 99.90th-qrtle-8       116.00 (   0.00%)       83.00 (  28.45%)
Lat 50.00th-qrtle-16       65.00 (   0.00%)       64.00 (   1.54%)
Lat 75.00th-qrtle-16       77.00 (   0.00%)       71.00 (   7.79%)
Lat 90.00th-qrtle-16       83.00 (   0.00%)       82.00 (   1.20%)
Lat 95.00th-qrtle-16       87.00 (   0.00%)       87.00 (   0.00%)
Lat 99.00th-qrtle-16       95.00 (   0.00%)       96.00 (  -1.05%)
Lat 99.50th-qrtle-16       99.00 (   0.00%)      103.00 (  -4.04%)
Lat 99.90th-qrtle-16      104.00 (   0.00%)      122.00 ( -17.31%)
Lat 50.00th-qrtle-32       71.00 (   0.00%)       73.00 (  -2.82%)
Lat 75.00th-qrtle-32       91.00 (   0.00%)       92.00 (  -1.10%)
Lat 90.00th-qrtle-32      108.00 (   0.00%)      107.00 (   0.93%)
Lat 95.00th-qrtle-32      118.00 (   0.00%)      115.00 (   2.54%)
Lat 99.00th-qrtle-32      134.00 (   0.00%)      129.00 (   3.73%)
Lat 99.50th-qrtle-32      138.00 (   0.00%)      133.00 (   3.62%)
Lat 99.90th-qrtle-32      149.00 (   0.00%)      146.00 (   2.01%)
Lat 50.00th-qrtle-39       83.00 (   0.00%)       81.00 (   2.41%)
Lat 75.00th-qrtle-39      105.00 (   0.00%)      102.00 (   2.86%)
Lat 90.00th-qrtle-39      120.00 (   0.00%)      119.00 (   0.83%)
Lat 95.00th-qrtle-39      129.00 (   0.00%)      128.00 (   0.78%)
Lat 99.00th-qrtle-39      153.00 (   0.00%)      149.00 (   2.61%)
Lat 99.50th-qrtle-39      166.00 (   0.00%)      156.00 (   6.02%)
Lat 99.90th-qrtle-39    12304.00 (   0.00%)    12848.00 (  -4.42%)

When heavily loaded (e.g. 99.50th-qrtle-39 indicates 39 threads), there
are small gains in many cases. Otherwise it depends on the quartile used
where it can be bad -- e.g. 75.00th-qrtle-2. However, even these results
are probably a co-incidence. For this workload, much depends on what node
the threads get placed on and their relative locality and not wakeups from
interrupt context. A larger component on how it behaves would be automatic
NUMA balancing where a fault incurred to measure locality would be a much
larger contributer to latency than the wakeup path.

This is the results from an almost identical machine that happened to run
the same test.  They only differ in terms of storage which is irrelevant
for this test.

                                 4.15.0-rc3             4.15.0-rc3
                                    vanilla             noirq-v1r1
Lat 50.00th-qrtle-1        41.00 (   0.00%)       41.00 (   0.00%)
Lat 75.00th-qrtle-1        42.00 (   0.00%)       42.00 (   0.00%)
Lat 90.00th-qrtle-1        44.00 (   0.00%)       43.00 (   2.27%)
Lat 95.00th-qrtle-1        53.00 (   0.00%)       45.00 (  15.09%)
Lat 99.00th-qrtle-1        59.00 (   0.00%)       58.00 (   1.69%)
Lat 99.50th-qrtle-1        60.00 (   0.00%)       59.00 (   1.67%)
Lat 99.90th-qrtle-1        86.00 (   0.00%)       61.00 (  29.07%)
Lat 50.00th-qrtle-2        52.00 (   0.00%)       41.00 (  21.15%)
Lat 75.00th-qrtle-2        57.00 (   0.00%)       46.00 (  19.30%)
Lat 90.00th-qrtle-2        60.00 (   0.00%)       53.00 (  11.67%)
Lat 95.00th-qrtle-2        62.00 (   0.00%)       57.00 (   8.06%)
Lat 99.00th-qrtle-2        73.00 (   0.00%)       68.00 (   6.85%)
Lat 99.50th-qrtle-2        74.00 (   0.00%)       71.00 (   4.05%)
Lat 99.90th-qrtle-2        90.00 (   0.00%)       75.00 (  16.67%)
Lat 50.00th-qrtle-4        57.00 (   0.00%)       52.00 (   8.77%)
Lat 75.00th-qrtle-4        60.00 (   0.00%)       58.00 (   3.33%)
Lat 90.00th-qrtle-4        62.00 (   0.00%)       62.00 (   0.00%)
Lat 95.00th-qrtle-4        65.00 (   0.00%)       65.00 (   0.00%)
Lat 99.00th-qrtle-4        76.00 (   0.00%)       75.00 (   1.32%)
Lat 99.50th-qrtle-4        77.00 (   0.00%)       77.00 (   0.00%)
Lat 99.90th-qrtle-4        87.00 (   0.00%)       81.00 (   6.90%)
Lat 50.00th-qrtle-8        59.00 (   0.00%)       57.00 (   3.39%)
Lat 75.00th-qrtle-8        63.00 (   0.00%)       62.00 (   1.59%)
Lat 90.00th-qrtle-8        66.00 (   0.00%)       67.00 (  -1.52%)
Lat 95.00th-qrtle-8        68.00 (   0.00%)       70.00 (  -2.94%)
Lat 99.00th-qrtle-8        79.00 (   0.00%)       80.00 (  -1.27%)
Lat 99.50th-qrtle-8        80.00 (   0.00%)       84.00 (  -5.00%)
Lat 99.90th-qrtle-8        84.00 (   0.00%)       90.00 (  -7.14%)
Lat 50.00th-qrtle-16       65.00 (   0.00%)       65.00 (   0.00%)
Lat 75.00th-qrtle-16       77.00 (   0.00%)       75.00 (   2.60%)
Lat 90.00th-qrtle-16       84.00 (   0.00%)       83.00 (   1.19%)
Lat 95.00th-qrtle-16       88.00 (   0.00%)       87.00 (   1.14%)
Lat 99.00th-qrtle-16       97.00 (   0.00%)       96.00 (   1.03%)
Lat 99.50th-qrtle-16      100.00 (   0.00%)      104.00 (  -4.00%)
Lat 99.90th-qrtle-16      110.00 (   0.00%)      126.00 ( -14.55%)
Lat 50.00th-qrtle-32       70.00 (   0.00%)       71.00 (  -1.43%)
Lat 75.00th-qrtle-32       92.00 (   0.00%)       94.00 (  -2.17%)
Lat 90.00th-qrtle-32      110.00 (   0.00%)      110.00 (   0.00%)
Lat 95.00th-qrtle-32      121.00 (   0.00%)      118.00 (   2.48%)
Lat 99.00th-qrtle-32      135.00 (   0.00%)      137.00 (  -1.48%)
Lat 99.50th-qrtle-32      140.00 (   0.00%)      146.00 (  -4.29%)
Lat 99.90th-qrtle-32      150.00 (   0.00%)      160.00 (  -6.67%)
Lat 50.00th-qrtle-39       80.00 (   0.00%)       71.00 (  11.25%)
Lat 75.00th-qrtle-39      102.00 (   0.00%)       91.00 (  10.78%)
Lat 90.00th-qrtle-39      118.00 (   0.00%)      108.00 (   8.47%)
Lat 95.00th-qrtle-39      128.00 (   0.00%)      117.00 (   8.59%)
Lat 99.00th-qrtle-39      149.00 (   0.00%)      133.00 (  10.74%)
Lat 99.50th-qrtle-39      160.00 (   0.00%)      139.00 (  13.12%)
Lat 99.90th-qrtle-39    13808.00 (   0.00%)     4920.00 (  64.37%)

Despite being nearly identical, it showed a variety of major gains so
I'm not convinced that heavy emphasis should be placed on this particular
workload in terms of evaluating this particular patch. Further evidence of
this is the fact that testing on a UMA machine showed small gains/losses
even though the patch should be a no-op on UMA.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171219085947.13136-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:31 +01:00
Joel Fernandes 9783be2c0e sched/fair: Correct obsolete comment about cpufreq_update_util()
Since the remote cpufreq callback work, the cpufreq_update_util() call can happen
from remote CPUs. The comment about local CPUs is thus obsolete. Update it
accordingly.

Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Android Kernel <kernel-team@android.com>
Cc: Atish Patra <atish.patra@oracle.com>
Cc: Chris Redpath <Chris.Redpath@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: EAS Dev <eas-dev@lists.linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Ramussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Rohit Jain <rohit.k.jain@oracle.com>
Cc: Saravana Kannan <skannan@quicinc.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20171215153944.220146-2-joelaf@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:30 +01:00
Joel Fernandes 18cec7e0dd sched/fair: Remove impossible condition from find_idlest_group_cpu()
find_idlest_group_cpu() goes through CPUs of a group previous selected by
find_idlest_group(). find_idlest_group() returns NULL if the local group is the
selected one and doesn't execute find_idlest_group_cpu if the group to which
'cpu' belongs to is chosen. So we're always guaranteed to call
find_idlest_group_cpu() with a group to which 'cpu' is non-local.

This makes one of the conditions in find_idlest_group_cpu() an impossible one,
which we can get rid off.

Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Android Kernel <kernel-team@android.com>
Cc: Atish Patra <atish.patra@oracle.com>
Cc: Chris Redpath <Chris.Redpath@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: EAS Dev <eas-dev@lists.linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Ramussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Rohit Jain <rohit.k.jain@oracle.com>
Cc: Saravana Kannan <skannan@quicinc.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20171215153944.220146-3-joelaf@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:30 +01:00
Joel Fernandes f453ae2200 sched/fair: Consider RT/IRQ pressure in capacity_spare_wake()
capacity_spare_wake() in the slow path influences choice of idlest groups,
as we search for groups with maximum spare capacity. In scenarios where
RT pressure is high, a sub optimal group can be chosen and hurt
performance of the task being woken up.

Fix this by using capacity_of() instead of capacity_orig_of() in capacity_spare_wake().

Tests results from improvements with this change are below. More tests
were also done by myself and Matt Fleming to ensure no degradation in
different benchmarks.

1) Rohit ran barrier.c test (details below) with following improvements:
------------------------------------------------------------------------
This was Rohit's original use case for a patch he posted at [1] however
from his recent tests he showed my patch can replace his slow path
changes [1] and there's no need to selectively scan/skip CPUs in
find_idlest_group_cpu in the slow path to get the improvement he sees.

barrier.c (open_mp code) as a micro-benchmark. It does a number of
iterations and barrier sync at the end of each for loop.

Here barrier,c is running in along with ping on CPU 0 and 1 as:
'ping -l 10000 -q -s 10 -f hostX'

barrier.c can be found at:
http://www.spinics.net/lists/kernel/msg2506955.html

Following are the results for the iterations per second with this
micro-benchmark (higher is better), on a 44 core, 2 socket 88 Threads
Intel x86 machine:
+--------+------------------+---------------------------+
|Threads | Without patch    | With patch                |
|        |                  |                           |
+--------+--------+---------+-----------------+---------+
|        | Mean   | Std Dev | Mean            | Std Dev |
+--------+--------+---------+-----------------+---------+
|1       | 539.36 | 60.16   | 572.54 (+6.15%) | 40.95   |
|2       | 481.01 | 19.32   | 530.64 (+10.32%)| 56.16   |
|4       | 474.78 | 22.28   | 479.46 (+0.99%) | 18.89   |
|8       | 450.06 | 24.91   | 447.82 (-0.50%) | 12.36   |
|16      | 436.99 | 22.57   | 441.88 (+1.12%) | 7.39    |
|32      | 388.28 | 55.59   | 429.4  (+10.59%)| 31.14   |
|64      | 314.62 | 6.33    | 311.81 (-0.89%) | 11.99   |
+--------+--------+---------+-----------------+---------+

2) ping+hackbench test on bare-metal sever (by Rohit)
-----------------------------------------------------
Here hackbench is running in threaded mode along
with, running ping on CPU 0 and 1 as:
'ping -l 10000 -q -s 10 -f hostX'

This test is running on 2 socket, 20 core and 40 threads Intel x86
machine:
Number of loops is 10000 and runtime is in seconds (Lower is better).

+--------------+-----------------+--------------------------+
|Task Groups   | Without patch   |  With patch              |
|              +-------+---------+----------------+---------+
|(Groups of 40)| Mean  | Std Dev |  Mean          | Std Dev |
+--------------+-------+---------+----------------+---------+
|1             | 0.851 | 0.007   |  0.828 (+2.77%)| 0.032   |
|2             | 1.083 | 0.203   |  1.087 (-0.37%)| 0.246   |
|4             | 1.601 | 0.051   |  1.611 (-0.62%)| 0.055   |
|8             | 2.837 | 0.060   |  2.827 (+0.35%)| 0.031   |
|16            | 5.139 | 0.133   |  5.107 (+0.63%)| 0.085   |
|25            | 7.569 | 0.142   |  7.503 (+0.88%)| 0.143   |
+--------------+-------+---------+----------------+---------+

[1] https://patchwork.kernel.org/patch/9991635/

Matt Fleming also ran several different hackbench tests and cyclic test
to santiy-check that the patch doesn't harm other usecases.

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Tested-by: Rohit Jain <rohit.k.jain@oracle.com>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Atish Patra <atish.patra@oracle.com>
Cc: Brendan Jackman <brendan.jackman@arm.com>
Cc: Chris Redpath <Chris.Redpath@arm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Ramussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Saravana Kannan <skannan@quicinc.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20171214212158.188190-1-joelaf@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:28 +01:00
Patrick Bellasi f01415fdbf sched/fair: Use 'unsigned long' for utilization, consistently
Utilization and capacity are tracked as 'unsigned long', however some
functions using them return an 'int' which is ultimately assigned back to
'unsigned long' variables.

Since there is not scope on using a different and signed type,
consolidate the signature of functions returning utilization to always
use the native type.

This change improves code consistency, and it also benefits
code paths where utilizations should be clamped by avoiding
further type conversions or ugly type casts.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chris Redpath <chris.redpath@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@android.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20171205171018.9203-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:28 +01:00