Commit Graph

839 Commits

Author SHA1 Message Date
Rik van Riel 6dc1a672ab sched/numa: Use effective_load() to balance NUMA loads
When CONFIG_FAIR_GROUP_SCHED is enabled, the load that a task places
on a CPU is determined by the group the task is in. The active groups
on the source and destination CPU can be different, resulting in a
different load contribution by the same task at its source and at its
destination. As a result, the load needs to be calculated separately
for each CPU, instead of estimated once with task_h_load().

Getting this calculation right allows some workloads to converge,
where previously the last thread could get stuck on another node,
without being able to migrate to its final destination.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538378-31571-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:35 +02:00
Rik van Riel 28a2174519 sched/numa: Move power adjustment into load_too_imbalanced()
Currently the NUMA code scales the load on each node with the
amount of CPU power available on that node, but it does not
apply any adjustment to the load of the task that is being
moved over.

On systems with SMT/HT, this results in a task being weighed
much more heavily than a CPU core, and a task move that would
even out the load between nodes being disallowed.

The correct thing is to apply the power correction to the
numbers after we have first applied the move of the tasks'
loads to them.

This also allows us to do the power correction with a multiplication,
rather than a division.

Also drop two function arguments for load_too_unbalanced, since it
takes various factors from env already.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: chegu_vinod@hp.com
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538378-31571-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:34 +02:00
Rik van Riel f0b8a4afd6 sched/numa: Use group's max nid as task's preferred nid
From task_numa_placement, always try to consolidate the tasks
in a group on the group's top nid.

In case this task is part of a group that is interleaved over
multiple nodes, task_numa_migrate will set the task's preferred
nid to the best node it could find for the task, so this patch
will cause at most one run through task_numa_migrate.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538095-31256-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:33 +02:00
Tim Chen 4486edd12b sched/fair: Implement fast idling of CPUs when the system is partially loaded
When a system is lightly loaded (i.e. no more than 1 job per cpu),
attempt to pull job to a cpu before putting it to idle is unnecessary and
can be skipped.  This patch adds an indicator so the scheduler can know
when there's no more than 1 active job is on any CPU in the system to
skip needless job pulls.

On a 4 socket machine with a request/response kind of workload from
clients, we saw about 0.13 msec delay when we go through a full load
balance to try pull job from all the other cpus.  While 0.1 msec was
spent on processing the request and generating a response, the 0.13 msec
load balance overhead was actually more than the actual work being done.
This overhead can be skipped much of the time for lightly loaded systems.

With this patch, we tested with a netperf request/response workload that
has the server busy with half the cpus in a 4 socket system.  We found
the patch eliminated 75% of the load balance attempts before idling a cpu.

The overhead of setting/clearing the indicator is low as we already gather
the necessary info while we call add_nr_running() and update_sd_lb_stats.()
We switch to full load balance load immediately if any cpu got more than
one job on its run queue in add_nr_running.  We'll clear the indicator
to avoid load balance when we detect no cpu's have more than one job
when we scan the work queues in update_sg_lb_stats().  We are aggressive
in turning on the load balance and opportunistic in skipping the load
balance.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Jason Low <jason.low2@hp.com>
Cc: "Paul E.McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403551009.2970.613.camel@schen9-DESK
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:32 +02:00
Viresh Kumar 89abb5ad10 sched/idle: Drop !! while calculating 'broadcast'
We don't need 'broadcast' to be set to 'zero or one', but to 'zero or non-zero'
and so the extra operation to convert it to 'zero or one' can be skipped.

Also change type of 'broadcast' to unsigned int, i.e. type of
drv->states[*].flags.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/0dfbe2976aa108c53e08d3477ea90f6360c1f54c.1403584026.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:31 +02:00
Mike Galbraith 4036ac1567 sched: Fix clock_gettime(CLOCK_[PROCESS/THREAD]_CPUTIME_ID) monotonicity
If a task has been dequeued, it has been accounted.  Do not project
cycles that may or may not ever be accounted to a dequeued task, as
that may make clock_gettime() both inaccurate and non-monotonic.

Protect update_rq_clock() from slight TSC skew while at it.

Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: kosaki.motohiro@jp.fujitsu.com
Cc: pjt@google.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403588980.29711.11.camel@marge.simpson.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:30 +02:00
Ben Segall c06f04c704 sched: Fix potential near-infinite distribute_cfs_runtime() loop
distribute_cfs_runtime() intentionally only hands out enough runtime to
bring each cfs_rq to 1 ns of runtime, expecting the cfs_rqs to then take
the runtime they need only once they actually get to run. However, if
they get to run sufficiently quickly, the period timer is still in
distribute_cfs_runtime() and no runtime is available, causing them to
throttle. Then distribute has to handle them again, and this can go on
until distribute has handed out all of the runtime 1ns at a time, which
takes far too long.

Instead allow access to the same runtime that distribute is handing out,
accepting that corner cases with very low quota may be able to spend the
entire cfs_b->runtime during distribute_cfs_runtime, meaning that the
runtime directly handed out by distribute_cfs_runtime was over quota. In
addition, if a cfs_rq does manage to throttle like this, make sure the
existing distribute_cfs_runtime no longer loops over it again.

Signed-off-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140620222120.13814.21652.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:29 +02:00
Viresh Kumar 541b82644d sched/core: Fix formatting issues in sched_can_stop_tick()
sched_can_stop_tick() is using 7 spaces instead of 8 spaces or a 'tab' at the
beginning of few lines. Which doesn't align well with the Coding Guidelines.

Also remove local variable 'rq' as it is used at only one place and we can
directly use this_rq() instead.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: fweisbec@gmail.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/afb781733e4a9ffbced5eb9fd25cc0aa5c6ffd7a.1403596966.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:28 +02:00
Ingo Molnar 51da9830d7 Merge branch 'timers/nohz' into sched/core
Merge these two, because upcoming patches will touch both areas.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:06:10 +02:00
Hillf Danton 5d5e2b1bcb sched: Fix CACHE_HOT_BUDY condition
When computing cache hot, we should check if the migration dst cpu is idle,
instead of the current cpu. Though they are same in normal balancing, that
is false nowadays in nohz idle balancing at least.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mike Galbraith <mgalbraith@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140607090452.4696E301D2@webmail.sinamail.sina.com.cn
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-18 18:29:59 +02:00
Rik van Riel bb97fc3164 sched/numa: Always try to migrate to preferred node at task_numa_placement() time
It is possible that at task_numa_placement() time, the task's
numa_preferred_nid does not change, but the task is not
actually running on the preferred node at the time.

In that case, we still want to attempt migration to the
preferred node.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140604163315.1dbc7b56@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-18 18:29:58 +02:00
Rik van Riel a43455a1d5 sched/numa: Ensure task_numa_migrate() checks the preferred node
The first thing task_numa_migrate() does is check to see if there is
CPU capacity available on the preferred node, in order to move the
task there.

However, if the preferred node is all busy, we would skip considering
that node for tasks swaps in the subsequent loop. This prevents NUMA
convergence of tasks on busy systems.

However, swapping locations with a task on our preferred nid, when
the preferred nid is busy, is perfectly fine.

The fix is to also look for a CPU on our preferred nid when it is
totally busy.

This changes "perf bench numa mem -p 4 -t 20 -m -0 -P 1000" from
not converging in 15 minutes on my 4 node system, to converging in
10-20 seconds.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140604160942.6969b101@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-18 18:29:57 +02:00
Frederic Weisbecker 3882ec6439 nohz: Use IPI implicit full barrier against rq->nr_running r/w
A full dynticks CPU is allowed to stop its tick when a single task runs.
Meanwhile when a new task gets enqueued, the CPU must be notified so that
it can restart its tick to maintain local fairness and other accounting
details.

This notification is performed by way of an IPI. Then when the target
receives the IPI, we expect it to see the new value of rq->nr_running.

Hence the following ordering scenario:

   CPU 0                   CPU 1

   write rq->running       get IPI
   smp_wmb()               smp_rmb()
   send IPI                read rq->nr_running

But Paul Mckenney says that nowadays IPIs imply a full barrier on
all architectures. So we can safely remove this pair and rely on the
implicit barriers that come along IPI send/receive. Lets
just comment on this new assumption.

Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-06-16 16:27:24 +02:00
Frederic Weisbecker fd2ac4f4a6 nohz: Use nohz own full kick on 2nd task enqueue
Now that we have a nohz full remote kick based on irq work, lets use
it to notify a CPU that it's exiting single task mode.

This unbloats a bit the scheduler IPI that the nohz code was abusing
for its cool "callable anywhere/anytime" properties.

Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-06-16 16:26:55 +02:00
Frederic Weisbecker 53c5fa16b4 nohz: Switch to nohz full remote kick on timer enqueue
When a new timer is enqueued on a full dynticks target, that CPU must
re-evaluate the next tick to handle the timer correctly.

This is currently performed through the scheduler IPI. Meanwhile this
happens at the cost of off-topic workarounds in that fast path to make
it call irq_exit().

As we plan to remove this hack off the scheduler IPI, lets use
the nohz full kick instead. Pretty much any IPI fits for that job
as long at it calls irq_exit(). The nohz full kick just happens to be
handy and readily available here.

If it happens to be too much an overkill in the future, we can still
turn that timer kick into an empty IPI.

Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-06-16 16:26:55 +02:00
Linus Torvalds b2e09f633a Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more scheduler updates from Ingo Molnar:
 "Second round of scheduler changes:
   - try-to-wakeup and IPI reduction speedups, from Andy Lutomirski
   - continued power scheduling cleanups and refactorings, from Nicolas
     Pitre
   - misc fixes and enhancements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Delete extraneous extern for to_ratio()
  sched/idle: Optimize try-to-wake-up IPI
  sched/idle: Simplify wake_up_idle_cpu()
  sched/idle: Clear polling before descheduling the idle thread
  sched, trace: Add a tracepoint for IPI-less remote wakeups
  cpuidle: Set polling in poll_idle
  sched: Remove redundant assignment to "rt_rq" in update_curr_rt(...)
  sched: Rename capacity related flags
  sched: Final power vs. capacity cleanups
  sched: Remove remaining dubious usage of "power"
  sched: Let 'struct sched_group_power' care about CPU capacity
  sched/fair: Disambiguate existing/remaining "capacity" usage
  sched/fair: Change "has_capacity" to "has_free_capacity"
  sched/fair: Remove "power" from 'struct numa_stats'
  sched: Fix signedness bug in yield_to()
  sched/fair: Use time_after() in record_wakee()
  sched/balancing: Reduce the rate of needless idle load balancing
  sched/fair: Fix unlocked reads of some cfs_b->quota/period
2014-06-12 19:42:15 -07:00
Linus Torvalds 3737a12761 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more perf updates from Ingo Molnar:
 "A second round of perf updates:

   - wide reaching kprobes sanitization and robustization, with the hope
     of fixing all 'probe this function crashes the kernel' bugs, by
     Masami Hiramatsu.

   - uprobes updates from Oleg Nesterov: tmpfs support, corner case
     fixes and robustization work.

   - perf tooling updates and fixes from Jiri Olsa, Namhyung Ki, Arnaldo
     et al:
        * Add support to accumulate hist periods (Namhyung Kim)
        * various fixes, refactorings and enhancements"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
  perf: Differentiate exec() and non-exec() comm events
  perf: Fix perf_event_comm() vs. exec() assumption
  uprobes/x86: Rename arch_uprobe->def to ->defparam, minor comment updates
  perf/documentation: Add description for conditional branch filter
  perf/x86: Add conditional branch filtering support
  perf/tool: Add conditional branch filter 'cond' to perf record
  perf: Add new conditional branch filter 'PERF_SAMPLE_BRANCH_COND'
  uprobes: Teach copy_insn() to support tmpfs
  uprobes: Shift ->readpage check from __copy_insn() to uprobe_register()
  perf/x86: Use common PMU interrupt disabled code
  perf/ARM: Use common PMU interrupt disabled code
  perf: Disable sampled events if no PMU interrupt
  perf: Fix use after free in perf_remove_from_context()
  perf tools: Fix 'make help' message error
  perf record: Fix poll return value propagation
  perf tools: Move elide bool into perf_hpp_fmt struct
  perf tools: Remove elide setup for SORT_MODE__MEMORY mode
  perf tools: Fix "==" into "=" in ui_browser__warning assignment
  perf tools: Allow overriding sysfs and proc finding with env var
  perf tools: Consider header files outside perf directory in tags target
  ...
2014-06-12 19:18:49 -07:00
Ingo Molnar 535560d841 Merge commit '3cf2f34' into sched/core, to fix build error
Fix this dependency on the locking tree's smp_mb*() API changes:

  kernel/sched/idle.c:247:3: error: implicit declaration of function ‘smp_mb__after_atomic’ [-Werror=implicit-function-declaration]

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-12 13:46:37 +02:00
Linus Torvalds 14208b0ec5 Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "A lot of activities on cgroup side.  Heavy restructuring including
  locking simplification took place to improve the code base and enable
  implementation of the unified hierarchy, which currently exists behind
  a __DEVEL__ mount option.  The core support is mostly complete but
  individual controllers need further work.  To explain the design and
  rationales of the the unified hierarchy

        Documentation/cgroups/unified-hierarchy.txt

  is added.

  Another notable change is css (cgroup_subsys_state - what each
  controller uses to identify and interact with a cgroup) iteration
  update.  This is part of continuing updates on css object lifetime and
  visibility.  cgroup started with reference count draining on removal
  way back and is now reaching a point where csses behave and are
  iterated like normal refcnted objects albeit with some complexities to
  allow distinguishing the state where they're being deleted.  The css
  iteration update isn't taken advantage of yet but is planned to be
  used to simplify memcg significantly"

* 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
  cgroup: disallow disabled controllers on the default hierarchy
  cgroup: don't destroy the default root
  cgroup: disallow debug controller on the default hierarchy
  cgroup: clean up MAINTAINERS entries
  cgroup: implement css_tryget()
  device_cgroup: use css_has_online_children() instead of has_children()
  cgroup: convert cgroup_has_live_children() into css_has_online_children()
  cgroup: use CSS_ONLINE instead of CGRP_DEAD
  cgroup: iterate cgroup_subsys_states directly
  cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
  cgroup: move cgroup->serial_nr into cgroup_subsys_state
  cgroup: link all cgroup_subsys_states in their sibling lists
  cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
  cgroup: remove cgroup->parent
  device_cgroup: remove direct access to cgroup->children
  memcg: update memcg_has_children() to use css_next_child()
  memcg: remove tasks/children test from mem_cgroup_force_empty()
  cgroup: remove css_parent()
  cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
  cgroup: use cgroup->self.refcnt for cgroup refcnting
  ...
2014-06-09 15:03:33 -07:00
Rik van Riel 1662867a9b numa,sched: fix load_to_imbalanced logic inversion
This function is supposed to return true if the new load imbalance is
worse than the old one.  It didn't.  I can only hope brown paper bags
are in style.

Now things converge much better on both the 4 node and 8 node systems.

I am not sure why this did not seem to impact specjbb performance on the
4 node system, which is the system I have full-time access to.

This bug was introduced recently, with commit e63da03639 ("sched/numa:
Allow task switch if load imbalance improves")

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-08 14:35:05 -07:00
Linus Torvalds 3f17ea6dea Merge branch 'next' (accumulated 3.16 merge window patches) into master
Now that 3.15 is released, this merges the 'next' branch into 'master',
bringing us to the normal situation where my 'master' branch is the
merge window.

* accumulated work in next: (6809 commits)
  ufs: sb mutex merge + mutex_destroy
  powerpc: update comments for generic idle conversion
  cris: update comments for generic idle conversion
  idle: remove cpu_idle() forward declarations
  nbd: zero from and len fields in NBD_CMD_DISCONNECT.
  mm: convert some level-less printks to pr_*
  MAINTAINERS: adi-buildroot-devel is moderated
  MAINTAINERS: add linux-api for review of API/ABI changes
  mm/kmemleak-test.c: use pr_fmt for logging
  fs/dlm/debug_fs.c: replace seq_printf by seq_puts
  fs/dlm/lockspace.c: convert simple_str to kstr
  fs/dlm/config.c: convert simple_str to kstr
  mm: mark remap_file_pages() syscall as deprecated
  mm: memcontrol: remove unnecessary memcg argument from soft limit functions
  mm: memcontrol: clean up memcg zoneinfo lookup
  mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
  mm/mempool.c: update the kmemleak stack trace for mempool allocations
  lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
  mm: introduce kmemleak_update_trace()
  mm/kmemleak.c: use %u to print ->checksum
  ...
2014-06-08 11:31:16 -07:00
Linus Torvalds d54d14bfb4 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Four misc fixes: each was deemed serious enough to warrant v3.15
  inclusion"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix tg_set_cfs_bandwidth() deadlock on rq->lock
  sched/dl: Fix race in dl_task_timer()
  sched: Fix sched_policy < 0 comparison
  sched/numa: Fix use of spin_{un}lock_irq() when interrupts are disabled
2014-06-06 09:53:32 -07:00
Ingo Molnar ec00010972 Merge branch 'perf/urgent' into perf/core, to resolve conflict and to prepare for new patches
Conflicts:
	arch/x86/kernel/traps.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-06 07:55:06 +02:00
Paul Gortmaker f602d06327 sched/deadline: Delete extraneous extern for to_ratio()
There was a prototype for it added to kernel/sched/sched.h
at the same time the extern was added, so the extern in
the C file was never really ever needed.

See commit 332ac17ef5
("sched/deadline: Add bandwidth management for SCHED_DEADLINE
tasks") for details.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dario Faggioli <raistlin@linux.it>
Link: http://lkml.kernel.org/r/1400013605-18717-1-git-send-email-paul.gortmaker@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 14:22:24 +02:00
Peter Zijlstra e3baac47f0 sched/idle: Optimize try-to-wake-up IPI
[ This series reduces the number of IPIs on Andy's workload by something like
  99%. It's down from many hundreds per second to very few.

  The basic idea behind this series is to make TIF_POLLING_NRFLAG be a
  reliable indication that the idle task is polling.  Once that's done,
  the rest is reasonably straightforward. ]

When enqueueing tasks on remote LLC domains, we send an IPI to do the
work 'locally' and avoid bouncing all the cachelines over.

However, when the remote CPU is idle (and polling, say x86 mwait), we
don't need to send an IPI, we can simply kick the TIF word to wake it
up and have the 'idle' loop do the work.

So when _TIF_POLLING_NRFLAG is set, but _TIF_NEED_RESCHED is not (yet)
set, set _TIF_NEED_RESCHED and avoid sending the IPI.

Much-requested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
[Edited by Andy Lutomirski, but this is mostly Peter Zijlstra's code.]
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: umgwanakikbuti@gmail.com
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/ce06f8b02e7e337be63e97597fc4b248d3aa6f9b.1401902905.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 12:09:53 +02:00
Andy Lutomirski 67b9ca70c3 sched/idle: Simplify wake_up_idle_cpu()
Now that rq->idle's polling bit is a reliable indication that the cpu is
polling, use it.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: umgwanakikbuti@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/922f00761445a830ebb23d058e2ae53956ce2d73.1401902905.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 12:09:52 +02:00
Andy Lutomirski 82c65d60d6 sched/idle: Clear polling before descheduling the idle thread
Currently, the only real guarantee provided by the polling bit is
that, if you hold rq->lock and the polling bit is set, then you can
set need_resched to force a reschedule.

The only reason the lock is needed is that the idle thread might not
be running at all when setting its need_resched bit, and rq->lock
keeps it pinned.

This is easy to fix: just clear the polling bit before scheduling.
Now the idle thread's polling bit is only ever set when
rq->curr == rq->idle.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: umgwanakikbuti@gmail.com
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/b2059fcb4c613d520cb503b6fad6e47033c7c203.1401902905.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 12:09:51 +02:00
Andy Lutomirski dfc68f29ae sched, trace: Add a tracepoint for IPI-less remote wakeups
Remote wakeups of polling CPUs are a valuable performance
improvement; add a tracepoint to make it much easier to verify that
they're working.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: umgwanakikbuti@gmail.com
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/16205aee116772aa686814f9b13bccb562108047.1401902905.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 12:09:50 +02:00
Giedrius Rekasius 0b07939cbf sched: Remove redundant assignment to "rt_rq" in update_curr_rt(...)
Variable "rt_rq" is used only in block "for_each_sched_rt_entity" so the
value assigned to it at the beginning of the update_curr_rt(...) gets
overwritten without ever being read. Remove redundant assignment and
move variable declaration to the block in which it is being used.

Signed-off-by: Giedrius Rekasius <giedrius.rekasius@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: kernel-janitors@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1401027811-30066-1-git-send-email-giedrius.rekasius@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 11:52:34 +02:00
Nicolas Pitre 5d4dfddd4f sched: Rename capacity related flags
It is better not to think about compute capacity as being equivalent
to "CPU power".  The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.

Let's rename the following feature flags since they do relate to capacity:

	SD_SHARE_CPUPOWER  -> SD_SHARE_CPUCAPACITY
	ARCH_POWER         -> ARCH_CAPACITY
	NONTASK_POWER      -> NONTASK_CAPACITY

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linaro-kernel@lists.linaro.org
Cc: Andy Fleming <afleming@freescale.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: devicetree@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/n/tip-e93lpnxb87owfievqatey6b5@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 11:52:32 +02:00
Nicolas Pitre ca8ce3d0b1 sched: Final power vs. capacity cleanups
It is better not to think about compute capacity as being equivalent
to "CPU power".  The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.

This contains the architecture visible changes.  Incidentally, only ARM
takes advantage of the available pow^H^H^Hcapacity scaling hooks and
therefore those changes outside kernel/sched/ are confined to one ARM
specific file.  The default arch_scale_smt_power() hook is not overridden
by anyone.

Replacements are as follows:

	arch_scale_freq_power  --> arch_scale_freq_capacity
	arch_scale_smt_power   --> arch_scale_smt_capacity
	SCHED_POWER_SCALE      --> SCHED_CAPACITY_SCALE
	SCHED_POWER_SHIFT      --> SCHED_CAPACITY_SHIFT

The local usage of "power" in arch/arm/kernel/topology.c is also changed
to "capacity" as appropriate.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linaro-kernel@lists.linaro.org
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Brown <broonie@linaro.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Sudeep KarkadaNagesha <sudeep.karkadanagesha@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: devicetree@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-48zba9qbznvglwelgq2cfygh@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 11:52:30 +02:00
Nicolas Pitre ced549fa5f sched: Remove remaining dubious usage of "power"
It is better not to think about compute capacity as being equivalent
to "CPU power".  The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.

This is the remaining "power" -> "capacity" rename for local symbols.
Those symbols visible to the rest of the kernel are not included yet.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linaro-kernel@lists.linaro.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-yyyhohzhkwnaotr3lx8zd5aa@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 11:52:29 +02:00
Nicolas Pitre 63b2ca30bd sched: Let 'struct sched_group_power' care about CPU capacity
It is better not to think about compute capacity as being equivalent
to "CPU power".  The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.

Since struct sched_group_power is really about compute capacity of sched
groups, let's rename it to struct sched_group_capacity. Similarly sgp
becomes sgc. Related variables and functions dealing with groups are also
adjusted accordingly.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linaro-kernel@lists.linaro.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-5yeix833vvgf2uyj5o36hpu9@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 11:52:26 +02:00
Nicolas Pitre 0fedc6c8e3 sched/fair: Disambiguate existing/remaining "capacity" usage
We have "power" (which should actually become "capacity") and "capacity"
which is a scaled down "capacity factor" in terms of unitary tasks.
Let's use "capacity_factor" to make room for proper usage of "capacity"
later.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linaro-kernel@lists.linaro.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-gk1co8sqdev3763opqm6ovml@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 11:52:25 +02:00
Nicolas Pitre 1b6a7495d3 sched/fair: Change "has_capacity" to "has_free_capacity"
The capacity of a CPU/group should be some intrinsic value that doesn't
change with task placement.  It is like a container which capacity is
stable regardless of the amount of liquid in it (its "utilization")...
unless the container itself is crushed that is, but that's another story.

Therefore let's rename "has_capacity" to "has_free_capacity" in order to
better convey the wanted meaning.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linaro-kernel@lists.linaro.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-djzkk027jm0e8x8jxy70opzh@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 11:52:22 +02:00
Nicolas Pitre 5ef20ca181 sched/fair: Remove "power" from 'struct numa_stats'
It is better not to think about compute capacity as being equivalent
to "CPU power".  The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.

To make things explicit and not create more confusion with the existing
"capacity" member, let's rename things as follows:

	power    -> compute_capacity
	capacity -> task_capacity

Note: none of those fields are actually used outside update_numa_stats().

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linaro-kernel@lists.linaro.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-2e2ndymj5gyshyjq8am79f20@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 11:52:14 +02:00
Dan Carpenter fa93384f40 sched: Fix signedness bug in yield_to()
yield_to() is supposed to return -ESRCH if there is no task to
yield to, but because the type is bool that is the same as returning
true.

The only place I see which cares is kvm_vcpu_on_spin().

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Raghavendra <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Link: http://lkml.kernel.org/r/20140523102042.GA7267@mwanda
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 11:52:13 +02:00
Manuel Schölling 2538d960d0 sched/fair: Use time_after() in record_wakee()
To be future-proof and for better readability the time comparisons are modified
to use time_after() instead of plain, error-prone math.

Signed-off-by: Manuel Schölling <manuel.schoelling@gmx.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1400780723-24626-1-git-send-email-manuel.schoelling@gmx.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 11:52:02 +02:00
Tim Chen ed61bbc69c sched/balancing: Reduce the rate of needless idle load balancing
The current no_hz idle load balancer do load balancing for *all* idle cpus,
even though the time due to load balance for a particular
idle cpu could be still a while in the future.  This introduces a much
higher load balancing rate than what is necessary.  The patch
changes the behavior by only doing idle load balancing on
behalf of an idle cpu only when it is due for load balancing.

On SGI's systems with over 3000 cores, the cpu responsible for idle balancing
got overwhelmed with idle balancing, and introduces a lot of OS noise
to workloads.  This patch fixes the issue.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Russ Anderson <rja@sgi.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Hedi Berriche <hedi@sgi.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: MichelLespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1400621967.2970.280.camel@schen9-DESK
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 11:52:01 +02:00
Ben Segall 51f2176d74 sched/fair: Fix unlocked reads of some cfs_b->quota/period
sched_cfs_period_timer() reads cfs_b->period without locks before calling
do_sched_cfs_period_timer(), and similarly unthrottle_offline_cfs_rqs()
would read cfs_b->period without the right lock. Thus a simultaneous
change of bandwidth could cause corruption on any platform where ktime_t
or u64 writes/reads are not atomic.

Extend cfs_b->lock from do_sched_cfs_period_timer() to include the read of
cfs_b->period to solve that issue; unthrottle_offline_cfs_rqs() can just
use 1 rather than the exact quota, much like distribute_cfs_runtime()
does.

There is also an unlocked read of cfs_b->runtime_expires, but a race
there would only delay runtime expiry by a tick. Still, the comparison
should just be != anyway, which clarifies even that problem.

Signed-off-by: Ben Segall <bsegall@google.com>
Tested-by: Roman Gushchin <klamm@yandex-team.ru>
[peterz: Fix compile warn]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140519224945.20303.93530.stgit@sword-of-the-dawn.mtv.corp.google.com
Cc: pjt@google.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 11:52:00 +02:00
Roman Gushchin 09dc4ab039 sched/fair: Fix tg_set_cfs_bandwidth() deadlock on rq->lock
tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to
force the period timer restart. It's not safe, because
can lead to deadlock, described in commit 927b54fccbf0:
"__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
waiting for the hrtimer to finish. However, if sched_cfs_period_timer
runs for another loop iteration, the hrtimer can attempt to take
rq->lock, resulting in deadlock."

Three CPUs must be involved:

  CPU0               CPU1                         CPU2
  take rq->lock      period timer fired
  ...                take cfs_b lock
  ...                ...                          tg_set_cfs_bandwidth()
  throttle_cfs_rq()  release cfs_b lock           take cfs_b lock
  ...                distribute_cfs_runtime()     timer_active = 0
  take cfs_b->lock   wait for rq->lock            ...
  __start_cfs_bandwidth()
  {wait for timer callback
   break if timer_active == 1}

So, CPU0 and CPU1 are deadlocked.

Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can
wait for period timer callbacks (ignoring cfs_b->timer_active) and
restart the timer explicitly.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/87wqdi9g8e.wl\%klamm@yandex-team.ru
Cc: pjt@google.com
Cc: chris.j.arges@canonical.com
Cc: gregkh@linuxfoundation.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 11:51:34 +02:00
Kirill Tkhai 0f397f2c90 sched/dl: Fix race in dl_task_timer()
Throttled task is still on rq, and it may be moved to other cpu
if user is playing with sched_setaffinity(). Therefore, unlocked
task_rq() access makes the race.

Juri Lelli reports he got this race when dl_bandwidth_enabled()
was not set.

Other thing, pointed by Peter Zijlstra:

   "Now I suppose the problem can still actually happen when
    you change the root domain and trigger a effective affinity
    change that way".

To fix that we do the same as made in __task_rq_lock(). We do not
use __task_rq_lock() itself, because it has a useful lockdep check,
which is not correct in case of dl_task_timer(). We do not need
pi_lock locked here. This case is an exception (PeterZ):

   "The only reason we don't strictly need ->pi_lock now is because
    we're guaranteed to have p->state == TASK_RUNNING here and are
    thus free of ttwu races".

Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # v3.14+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/3056991400578422@web14g.yandex.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 11:51:12 +02:00
Richard Weinberger b14ed2c273 sched: Fix sched_policy < 0 comparison
attr.sched_policy is u32, therefore a comparison against < 0 is never true.
Fix this by casting sched_policy to int.

This issue was reported by coverity CID 1219934.

Fixes: dbdb22754f ("sched: Disallow sched_attr::sched_policy < 0")
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1401741514-7045-1-git-send-email-richard@nod.at
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 11:07:43 +02:00
Steven Rostedt e9dd685ce8 sched/numa: Fix use of spin_{un}lock_irq() when interrupts are disabled
As Peter Zijlstra told me, we have the following path:

do_exit()
  exit_itimers()
    itimer_delete()
      spin_lock_irqsave(&timer->it_lock, &flags);
      timer_delete_hook(timer);
        kc->timer_del(timer) := posix_cpu_timer_del()
          put_task_struct()
            __put_task_struct()
              task_numa_free()
                spin_lock(&grp->lock);

Which means that task_numa_free() can be called with interrupts
disabled, which means that we should not be using spin_lock_irq() but
spin_lock_irqsave() instead. Otherwise we are enabling interrupts while
holding an interrupt unsafe lock!

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner<tglx@linutronix.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140527182541.GH11096@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 11:07:41 +02:00
John Stultz c224815dac printk: Add printk_deferred_once
Two of the three prink_deferred uses are really printk_once style
uses, so add a printk_deferred_once macro to simplify those call
sites.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:17 -07:00
John Stultz aac74dc495 printk: rename printk_sched to printk_deferred
After learning we'll need some sort of deferred printk functionality in
the timekeeping core, Peter suggested we rename the printk_sched function
so it can be reused by needed subsystems.

This only changes the function name. No logic changes.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:17 -07:00
Linus Torvalds c84a1e32ee Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
Pull scheduler updates from Ingo Molnar:
 "The main scheduling related changes in this cycle were:

   - various sched/numa updates, for better performance

   - tree wide cleanup of open coded nice levels

   - nohz fix related to rq->nr_running use

   - cpuidle changes and continued consolidation to improve the
     kernel/sched/idle.c high level idle scheduling logic.  As part of
     this effort I pulled cpuidle driver changes from Rafael as well.

   - standardized idle polling amongst architectures

   - continued work on preparing better power/energy aware scheduling

   - sched/rt updates

   - misc fixlets and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits)
  sched/numa: Decay ->wakee_flips instead of zeroing
  sched/numa: Update migrate_improves/degrades_locality()
  sched/numa: Allow task switch if load imbalance improves
  sched/rt: Fix 'struct sched_dl_entity' and dl_task_time() comments, to match the current upstream code
  sched: Consolidate open coded implementations of nice level frobbing into nice_to_rlimit() and rlimit_to_nice()
  sched: Initialize rq->age_stamp on processor start
  sched, nohz: Change rq->nr_running to always use wrappers
  sched: Fix the rq->next_balance logic in rebalance_domains() and idle_balance()
  sched: Use clamp() and clamp_val() to make sys_nice() more readable
  sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups()
  sched/numa: Fix initialization of sched_domain_topology for NUMA
  sched: Call select_idle_sibling() when not affine_sd
  sched: Simplify return logic in sched_read_attr()
  sched: Simplify return logic in sched_copy_attr()
  sched: Fix exec_start/task_hot on migrated tasks
  arm64: Remove TIF_POLLING_NRFLAG
  metag: Remove TIF_POLLING_NRFLAG
  sched/idle: Make cpuidle_idle_call() void
  sched/idle: Reflow cpuidle_idle_call()
  sched/idle: Delay clearing the polling bit
  ...
2014-06-03 14:00:15 -07:00
Linus Torvalds 776edb5931 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
Pull core locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - reduced/streamlined smp_mb__*() interface that allows more usecases
     and makes the existing ones less buggy, especially in rarer
     architectures

   - add rwsem implementation comments

   - bump up lockdep limits"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  rwsem: Add comments to explain the meaning of the rwsem's count field
  lockdep: Increase static allocations
  arch: Mass conversion of smp_mb__*()
  arch,doc: Convert smp_mb__*()
  arch,xtensa: Convert smp_mb__*()
  arch,x86: Convert smp_mb__*()
  arch,tile: Convert smp_mb__*()
  arch,sparc: Convert smp_mb__*()
  arch,sh: Convert smp_mb__*()
  arch,score: Convert smp_mb__*()
  arch,s390: Convert smp_mb__*()
  arch,powerpc: Convert smp_mb__*()
  arch,parisc: Convert smp_mb__*()
  arch,openrisc: Convert smp_mb__*()
  arch,mn10300: Convert smp_mb__*()
  arch,mips: Convert smp_mb__*()
  arch,metag: Convert smp_mb__*()
  arch,m68k: Convert smp_mb__*()
  arch,m32r: Convert smp_mb__*()
  arch,ia64: Convert smp_mb__*()
  ...
2014-06-03 12:57:53 -07:00
Linus Torvalds 59a3d4c363 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
Pull RCU changes from Ingo Molnar:
 "The main RCU changes in this cycle were:

   - RCU torture-test changes.

   - variable-name renaming cleanup.

   - update RCU documentation.

   - miscellaneous fixes.

   - patch to suppress RCU stall warnings while sysrq requests are being
     processed"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (68 commits)
  rcu: Provide API to suppress stall warnings while sysrc runs
  rcu: Variable name changed in tree_plugin.h and used in tree.c
  torture: Remove unused definition
  torture: Remove __init from torture_init_begin/end
  torture: Check for multiple concurrent torture tests
  locktorture: Remove reference to nonexistent Kconfig parameter
  rcutorture: Run rcu_torture_writer at normal priority
  rcutorture: Note diffs from git commits
  rcutorture: Add missing destroy_timer_on_stack()
  rcutorture: Explicitly test synchronous grace-period primitives
  rcutorture:  Add tests for get_state_synchronize_rcu()
  rcutorture: Test RCU-sched primitives in TREE_PREEMPT_RCU kernels
  torture: Use elapsed time to detect hangs
  rcutorture: Check for rcu_torture_fqs creation errors
  torture: Better summary diagnostics for build failures
  torture: Notice if an all-zero cpumask is passed inside a critical section
  rcutorture: Make rcu_torture_reader() use cond_resched()
  sched,rcu: Make cond_resched() report RCU quiescent states
  percpu: Fix raw_cpu_inc_return()
  rcutorture: Export RCU grace-period kthread wait state to rcutorture
  ...
2014-06-03 12:35:05 -07:00
Linus Torvalds 32439700fe Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Various fixlets, mostly related to the (root-only) SCHED_DEADLINE
  policy, but also a hotplug bug fix and a fix for a NR_CPUS related
  overallocation bug causing a suspend/resume regression"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix hotplug vs. set_cpus_allowed_ptr()
  sched/cpupri: Replace NR_CPUS arrays
  sched/deadline: Replace NR_CPUS arrays
  sched/deadline: Restrict user params max value to 2^63 ns
  sched/deadline: Change sched_getparam() behaviour vs SCHED_DEADLINE
  sched: Disallow sched_attr::sched_policy < 0
  sched: Make sched_setattr() correctly return -EFBIG
2014-06-01 18:26:59 -07:00