Commit Graph

1293 Commits

Author SHA1 Message Date
Linus Torvalds 051b29f279 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
 "A single fix to prevent math underflow in the numa balancing code"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/numa: Fix math underflow in task_tick_numa()
2015-11-15 09:35:33 -08:00
Rik van Riel 25b3e5a334 sched/numa: Fix math underflow in task_tick_numa()
The NUMA balancing code implements delays in scanning by
advancing curr->node_stamp beyond curr->se.sum_exec_runtime.

With unsigned math, that creates an underflow, which results
in task_numa_work being queued all the time, even when we
don't want to.

Avoiding the math underflow makes it possible to reduce CPU
overhead in the NUMA balancing code.

Reported-and-tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mgorman@suse.de
Link: http://lkml.kernel.org/r/1446756983-28173-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-09 16:13:27 +01:00
Linus Torvalds 933425fb00 s390: A bunch of fixes and optimizations for interrupt and time
handling.
 
 PPC: Mostly bug fixes.
 
 ARM: No big features, but many small fixes and prerequisites including:
 - a number of fixes for the arch-timer
 - introducing proper level-triggered semantics for the arch-timers
 - a series of patches to synchronously halt a guest (prerequisite for
   IRQ forwarding)
 - some tracepoint improvements
 - a tweak for the EL2 panic handlers
 - some more VGIC cleanups getting rid of redundant state
 
 x86: quite a few changes:
 
 - support for VT-d posted interrupts (i.e. PCI devices can inject
 interrupts directly into vCPUs).  This introduces a new component (in
 virt/lib/) that connects VFIO and KVM together.  The same infrastructure
 will be used for ARM interrupt forwarding as well.
 
 - more Hyper-V features, though the main one Hyper-V synthetic interrupt
 controller will have to wait for 4.5.  These will let KVM expose Hyper-V
 devices.
 
 - nested virtualization now supports VPID (same as PCID but for vCPUs)
 which makes it quite a bit faster
 
 - for future hardware that supports NVDIMM, there is support for clflushopt,
 clwb, pcommit
 
 - support for "split irqchip", i.e. LAPIC in kernel + IOAPIC/PIC/PIT in
 userspace, which reduces the attack surface of the hypervisor
 
 - obligatory smattering of SMM fixes
 
 - on the guest side, stable scheduler clock support was rewritten to not
 require help from the hypervisor.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJWO2IQAAoJEL/70l94x66D/K0H/3AovAgYmJQToZlimsktMk6a
 f2xhdIqfU5lIQQh5uNBCfL3o9o8H9Py1ym7aEw3fmztPHHJYc91oTatt2UEKhmEw
 VtZHp/dFHt3hwaIdXmjRPEXiYctraKCyrhaUYdWmUYkoKi7lW5OL5h+S7frG2U6u
 p/hFKnHRZfXHr6NSgIqvYkKqtnc+C0FWY696IZMzgCksOO8jB1xrxoSN3tANW3oJ
 PDV+4og0fN/Fr1capJUFEc/fejREHneANvlKrLaa8ht0qJQutoczNADUiSFLcMPG
 iHljXeDsv5eyjMtUuIL8+MPzcrIt/y4rY41ZPiKggxULrXc6H+JJL/e/zThZpXc=
 =iv2z
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "First batch of KVM changes for 4.4.

  s390:
     A bunch of fixes and optimizations for interrupt and time handling.

  PPC:
     Mostly bug fixes.

  ARM:
     No big features, but many small fixes and prerequisites including:

      - a number of fixes for the arch-timer

      - introducing proper level-triggered semantics for the arch-timers

      - a series of patches to synchronously halt a guest (prerequisite
        for IRQ forwarding)

      - some tracepoint improvements

      - a tweak for the EL2 panic handlers

      - some more VGIC cleanups getting rid of redundant state

  x86:
     Quite a few changes:

      - support for VT-d posted interrupts (i.e. PCI devices can inject
        interrupts directly into vCPUs).  This introduces a new
        component (in virt/lib/) that connects VFIO and KVM together.
        The same infrastructure will be used for ARM interrupt
        forwarding as well.

      - more Hyper-V features, though the main one Hyper-V synthetic
        interrupt controller will have to wait for 4.5.  These will let
        KVM expose Hyper-V devices.

      - nested virtualization now supports VPID (same as PCID but for
        vCPUs) which makes it quite a bit faster

      - for future hardware that supports NVDIMM, there is support for
        clflushopt, clwb, pcommit

      - support for "split irqchip", i.e.  LAPIC in kernel +
        IOAPIC/PIC/PIT in userspace, which reduces the attack surface of
        the hypervisor

      - obligatory smattering of SMM fixes

      - on the guest side, stable scheduler clock support was rewritten
        to not require help from the hypervisor"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (123 commits)
  KVM: VMX: Fix commit which broke PML
  KVM: x86: obey KVM_X86_QUIRK_CD_NW_CLEARED in kvm_set_cr0()
  KVM: x86: allow RSM from 64-bit mode
  KVM: VMX: fix SMEP and SMAP without EPT
  KVM: x86: move kvm_set_irq_inatomic to legacy device assignment
  KVM: device assignment: remove pointless #ifdefs
  KVM: x86: merge kvm_arch_set_irq with kvm_set_msi_inatomic
  KVM: x86: zero apic_arb_prio on reset
  drivers/hv: share Hyper-V SynIC constants with userspace
  KVM: x86: handle SMBASE as physical address in RSM
  KVM: x86: add read_phys to x86_emulate_ops
  KVM: x86: removing unused variable
  KVM: don't pointlessly leave KVM_COMPAT=y in non-KVM configs
  KVM: arm/arm64: Merge vgic_set_lr() and vgic_sync_lr_elrsr()
  KVM: arm/arm64: Clean up vgic_retire_lr() and surroundings
  KVM: arm/arm64: Optimize away redundant LR tracking
  KVM: s390: use simple switch statement as multiplexer
  KVM: s390: drop useless newline in debugging data
  KVM: s390: SCA must not cross page boundaries
  KVM: arm: Do not indent the arguments of DECLARE_BITMAP
  ...
2015-11-05 16:26:26 -08:00
Linus Torvalds 69234acee5 Merge branch 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "The cgroup core saw several significant updates this cycle:

   - percpu_rwsem for threadgroup locking is reinstated.  This was
     temporarily dropped due to down_write latency issues.  Oleg's
     rework of percpu_rwsem which is scheduled to be merged in this
     merge window resolves the issue.

   - On the v2 hierarchy, when controllers are enabled and disabled, all
     operations are atomic and can fail and revert cleanly.  This allows
     ->can_attach() failure which is necessary for cpu RT slices.

   - Tasks now stay associated with the original cgroups after exit
     until released.  This allows tracking resources held by zombies
     (e.g.  pids) and makes it easy to find out where zombies came from
     on the v2 hierarchy.  The pids controller was broken before these
     changes as zombies escaped the limits; unfortunately, updating this
     behavior required too many invasive changes and I don't think it's
     a good idea to backport them, so the pids controller on 4.3, the
     first version which included the pids controller, will stay broken
     at least until I'm sure about the cgroup core changes.

   - Optimization of a couple common tests using static_key"

* 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits)
  cgroup: fix race condition around termination check in css_task_iter_next()
  blkcg: don't create "io.stat" on the root cgroup
  cgroup: drop cgroup__DEVEL__legacy_files_on_dfl
  cgroup: replace error handling in cgroup_init() with WARN_ON()s
  cgroup: add cgroup_subsys->free() method and use it to fix pids controller
  cgroup: keep zombies associated with their original cgroups
  cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock
  cgroup: don't hold css_set_rwsem across css task iteration
  cgroup: reorganize css_task_iter functions
  cgroup: factor out css_set_move_task()
  cgroup: keep css_set and task lists in chronological order
  cgroup: make cgroup_destroy_locked() test cgroup_is_populated()
  cgroup: make css_sets pin the associated cgroups
  cgroup: relocate cgroup_[try]get/put()
  cgroup: move check_for_release() invocation
  cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
  cgroup: make cgroup->nr_populated count the number of populated css_sets
  cgroup: remove an unused parameter from cgroup_task_migrate()
  cgroup: fix too early usage of static_branch_disable()
  cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
  ...
2015-11-05 14:51:32 -08:00
Linus Torvalds 53528695ff Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "The main changes in this cycle were:

   - sched/fair load tracking fixes and cleanups (Byungchul Park)

   - Make load tracking frequency scale invariant (Dietmar Eggemann)

   - sched/deadline updates (Juri Lelli)

   - stop machine fixes, cleanups and enhancements for bugs triggered by
     CPU hotplug stress testing (Oleg Nesterov)

   - scheduler preemption code rework: remove PREEMPT_ACTIVE and related
     cleanups (Peter Zijlstra)

   - Rework the sched_info::run_delay code to fix races (Peter Zijlstra)

   - Optimize per entity utilization tracking (Peter Zijlstra)

   - ... misc other fixes, cleanups and smaller updates"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
  sched: Don't scan all-offline ->cpus_allowed twice if !CONFIG_CPUSETS
  sched: Move cpu_active() tests from stop_two_cpus() into migrate_swap_stop()
  sched: Start stopper early
  stop_machine: Kill cpu_stop_threads->setup() and cpu_stop_unpark()
  stop_machine: Kill smp_hotplug_thread->pre_unpark, introduce stop_machine_unpark()
  stop_machine: Change cpu_stop_queue_two_works() to rely on stopper->enabled
  stop_machine: Introduce __cpu_stop_queue_work() and cpu_stop_queue_two_works()
  stop_machine: Ensure that a queued callback will be called before cpu_stop_park()
  sched/x86: Fix typo in __switch_to() comments
  sched/core: Remove a parameter in the migrate_task_rq() function
  sched/core: Drop unlikely behind BUG_ON()
  sched/core: Fix task and run queue sched_info::run_delay inconsistencies
  sched/numa: Fix task_tick_fair() from disabling numa_balancing
  sched/core: Add preempt_count invariant check
  sched/core: More notrace annotations
  sched/core: Kill PREEMPT_ACTIVE
  sched/core, sched/x86: Kill thread_info::saved_preempt_count
  sched/core: Simplify preempt_count tests
  sched/core: Robustify preemption leak checks
  sched/core: Stop setting PREEMPT_ACTIVE
  ...
2015-11-03 18:03:50 -08:00
Ingo Molnar e4340bbb07 Merge branch 'linus' into core/rcu, to fix up a semantic conflict
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-28 13:17:20 +01:00
Peter Zijlstra 0aaafaabfc sched/core: Add missing lockdep_unpin() annotations
Luca and Wanpeng reported two missing annotations that led to
false lockdep complaints. Add the missing annotations.

Reported-by: Luca Abeni <luca.abeni@unitn.it>
Reported-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: cbce1a6867 ("sched,lockdep: Employ lock pinning")
Link: http://lkml.kernel.org/r/20151023095008.GY17308@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-23 12:02:10 +02:00
Oleg Nesterov e73e85f059 sched: Don't scan all-offline ->cpus_allowed twice if !CONFIG_CPUSETS
If CONFIG_CPUSETS=n then "case cpuset" changes the state and runs
the already failed for_each_cpu() loop again for no reason.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: heiko.carstens@de.ibm.com
Link: http://lkml.kernel.org/r/20151010185315.GA24100@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20 10:25:57 +02:00
Peter Zijlstra 62694cd513 sched: Move cpu_active() tests from stop_two_cpus() into migrate_swap_stop()
The cpu_active() tests are not fundamentally part of stop_two_cpus(),
move then into the scheduler where they belong.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20 10:25:56 +02:00
Peter Zijlstra 07f06cb3b5 sched: Start stopper early
Ensure the stopper thread is active 'early', because the load balancer
pretty much assumes that its available. And when 'online && active' the
load-balancer is fully available.

Not only the numa balancing stop_two_cpus() caller relies on it, but
also the self migration stuff does, and at CPU_ONLINE time the cpu
really is 'free' to run anything.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: heiko.carstens@de.ibm.com
Link: http://lkml.kernel.org/r/20151009160054.GA10176@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20 10:25:55 +02:00
Ingo Molnar 6af597de62 Merge branch 'sched/urgent' into sched/core, to pick up fixes and resolve conflicts
Conflicts:
	kernel/sched/fair.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20 10:18:16 +02:00
Luca Abeni 5aa5050787 sched/deadline: Fix migration of SCHED_DEADLINE tasks
Commit:

  9d51426242 ("sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target")

broke select_task_rq_dl() and find_lock_later_rq(), because it introduced
a comparison between the local task's deadline and dl.earliest_dl.curr of
the remote queue.

However, if the remote runqueue does not contain any SCHED_DEADLINE
task its earliest_dl.curr is 0 (always smaller than the deadline of
the local task) and the remote runqueue is not selected for pushing.

As a result, if an application creates multiple SCHED_DEADLINE
threads, they will never be pushed to runqueues that do not already
contain SCHED_DEADLINE tasks.

This patch fixes the issue by checking if dl.dl_nr_running == 0.

Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@linux.intel.com>
Fixes: 9d51426242 ("sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target")
Link: http://lkml.kernel.org/r/1444982781-15608-1-git-send-email-luca.abeni@unitn.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20 10:13:36 +02:00
Frederic Weisbecker 0baabb385e nohz: Revert "nohz: Set isolcpus when nohz_full is set"
This reverts:

  8cb9764fc8 ("nohz: Set isolcpus when nohz_full is set")

We assumed that full-nohz users always want scheduler isolation on full
dynticks CPUs, therefore we included full-nohz CPUs on cpu_isolated_map.

This means that tasks run by default on CPUs outside the nohz_full range
unless their affinity is explicity overwritten.

This suits pure isolation workloads but when the machine is needed to
run common workloads, the available sets of CPUs to run common tasks
becomes reduced.

We reach an extreme case when CONFIG_NO_HZ_FULL_ALL is enabled as it
leaves only CPU 0 for non-isolation tasks, which makes people think that
their supercomputer regressed to 90's UP - which is true in a sense.

Some full-nohz users appear to be interested in running normal workloads
either before or after an isolation workload. Full-nohz isn't optimized
toward normal workloads but it's still better than UP performance.

We are reaching a limitation in kernel presets here. Lets revert this
cpu_isolated_map inclusion and let userspace do its own scheduler
isolation using cpusets or explicit affinity settings.

Reported-by: Ingo Molnar <mingo@kernel.org>
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Link: http://lkml.kernel.org/r/1444663283-30068-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20 10:13:36 +02:00
Yuyang Du 3e386d56ba sched/fair: Update task group's load_avg after task migration
When cfs_rq has cfs_rq->removed_load_avg set (when a task migrates from
this cfs_rq), we need to update its contribution to the group's load_avg.

This should not increase tg's update too much, because in most cases, the
cfs_rq has already decayed its load_avg.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1444699103-20272-2-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20 10:13:35 +02:00
Yuyang Du fde7d22e01 sched/fair: Fix overly small weight for interactive group entities
Commit:

  9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")

led to an overly small weight for interactive group entities. The bad case
can be easily reproduced when a number of CPU hogs compete for the CPUs
at the same time (thanks to Mike). This is largly because the task group's
load average tracking cross CPUs lags behind the real changes.

To fix this we accelerate the group share distribution process by using
the load.weight of the cfs_rq. This may increase the entire group's
share, but we have to do so to protect the (fragile) interactive
tasks, especially from CPU hogs.

Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1444699103-20272-1-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20 10:13:34 +02:00
Ingo Molnar c13dc31adb Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

  - Miscellaneous fixes. (Paul E. McKenney, Boqun Feng, Oleg Nesterov, Patrick Marlier)

  - Improvements to expedited grace periods. (Paul E. McKenney)

  - Performance improvements to and locktorture tests for percpu-rwsem.
    (Oleg Nesterov, Paul E. McKenney)

  - Torture-test changes. (Paul E. McKenney, Davidlohr Bueso)

  - Documentation updates. (Paul E. McKenney)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-19 10:09:54 +02:00
Tejun Heo 2e91fa7f6d cgroup: keep zombies associated with their original cgroups
cgroup_exit() is called when a task exits and disassociates the
exiting task from its cgroups and half-attach it to the root cgroup.
This is unnecessary and undesirable.

No controller actually needs an exiting task to be disassociated with
non-root cgroups.  Both cpu and perf_event controllers update the
association to the root cgroup from their exit callbacks just to keep
consistent with the cgroup core behavior.

Also, this disassociation makes it difficult to track resources held
by zombies or determine where the zombies came from.  Currently, pids
controller is completely broken as it uncharges on exit and zombies
always escape the resource restriction.  With cgroup association being
reset on exit, fixing it is pretty painful.

There's no reason to reset cgroup membership on exit.  The zombie can
be removed from its css_set so that it doesn't show up on
"cgroup.procs" and thus can't be migrated or interfere with cgroup
removal.  It can still pin and point to the css_set so that its cgroup
membership is maintained.  This patch makes cgroup core keep zombies
associated with their cgroups at the time of exit.

* Previous patches decoupled populated_cnt tracking from css_set
  lifetime, so a dying task can be simply unlinked from its css_set
  while pinning and pointing to the css_set.  This keeps css_set
  association from task side alive while hiding it from "cgroup.procs"
  and populated_cnt tracking.  The css_set reference is dropped when
  the task_struct is freed.

* ->exit() callback no longer needs the css arguments as the
  associated css never changes once PF_EXITING is set.  Removed.

* cpu and perf_events controllers no longer need ->exit() callbacks.
  There's no reason to explicitly switch away on exit.  The final
  schedule out is enough.  The callbacks are removed.

* On traditional hierarchies, nothing changes.  "/proc/PID/cgroup"
  still reports "/" for all zombies.  On the default hierarchy,
  "/proc/PID/cgroup" keeps reporting the cgroup that the task belonged
  to at the time of exit.  If the cgroup gets removed before the task
  is reaped, " (deleted)" is appended.

v2: Build brekage due to missing dummy cgroup_free() when
    !CONFIG_CGROUP fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
2015-10-15 16:41:53 -04:00
Daniel Bristot de Oliveira 9babcd7929 sched, tracing: Stop/start critical timings around the idle=poll idle loop
When using idle=poll, the preemptoff tracer is always showing
the idle task as the culprit for long latencies. That happens
because critical timings are not stopped before idle loop. This
patch stops critical timings before entering the idle loop,
starting it again after the idle loop.

This problem does not affect the irqsoff tracer because
interruptions are enabled before entering the idle loop.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Reviewed-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/10fc3705874aef11dbe152a068b591a7be1899b4.1444314899.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-12 09:45:25 +02:00
Davidlohr Bueso 84778472e1 sched: Export sched_setscheduler_nocheck
The new locktorture rtmutex_lock tests exercise priority boosting, which
means that they need to set some tasks to real-time priority.  To do this,
they use sched_setscheduler_nocheck().  However, this is not exported to
modules, which results in the following error when building locktorture
as a module:

ERROR: "sched_setscheduler_nocheck" [kernel/locking/locktorture.ko] undefined!

This commit therefore adds an EXPORT_SYMBOL_GPL() to allow this function
to be invoked from locktorture when built as a module.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-10-06 11:07:54 -07:00
xiaofeng.yan 5a4fd03685 sched/core: Remove a parameter in the migrate_task_rq() function
The parameter "int next_cpu" in the following function is unused:

  migrate_task_rq(struct task_struct *p, int next_cpu)

Remove it.

Signed-off-by: xiaofeng.yan <yanxiaofeng@inspur.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1442991360-31945-1-git-send-email-yanxiaofeng@inspur.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:08:23 +02:00
Geliang Tang ce03e4137b sched/core: Drop unlikely behind BUG_ON()
(1) For !CONFIG_BUG cases, the bug call is a no-op, so we couldn't care
    less and the change is ok.

(2) PPC and MIPS, which HAVE_ARCH_BUG_ON, do not rely on branch predictions
    as it seems to be pointless [1] and thus callers should not be trying to
    push an optimization in the first place.

(3) For CONFIG_BUG and !HAVE_ARCH_BUG_ON cases, BUG_ON() contains an
    unlikely compiler flag already.

Hence, we can drop unlikely behind BUG_ON().

  [1] http://lkml.iu.edu/hypermail/linux/kernel/1101.3/02289.html

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/6fa7125979f98bbeac26e268271769b6ca935c8d.1444051018.git.geliangtang@163.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:08:22 +02:00
Peter Zijlstra 1de64443d7 sched/core: Fix task and run queue sched_info::run_delay inconsistencies
Mike Meyer reported the following bug:

> During evaluation of some performance data, it was discovered thread
> and run queue run_delay accounting data was inconsistent with the other
> accounting data that was collected.  Further investigation found under
> certain circumstances execution time was leaking into the task and
> run queue accounting of run_delay.
>
> Consider the following sequence:
>
>     a. thread is running.
>     b. thread moves beween cgroups, changes scheduling class or priority.
>     c. thread sleeps OR
>     d. thread involuntarily gives up cpu.
>
> a. implies:
>
>     thread->sched_info.last_queued = 0
>
> a. and b. results in the following:
>
>     1. dequeue_task(rq, thread)
>
>            sched_info_dequeued(rq, thread)
>                delta = 0
>
>                sched_info_reset_dequeued(thread)
>                    thread->sched_info.last_queued = 0
>
>                thread->sched_info.run_delay += delta
>
>     2. enqueue_task(rq, thread)
>
>            sched_info_queued(rq, thread)
>
>                /* thread is still on cpu at this point. */
>                thread->sched_info.last_queued = task_rq(thread)->clock;
>
> c. results in:
>
>     dequeue_task(rq, thread)
>
>         sched_info_dequeued(rq, thread)
>
>             /* delta is execution time not run_delay. */
>             delta = task_rq(thread)->clock - thread->sched_info.last_queued
>
>         sched_info_reset_dequeued(thread)
>             thread->sched_info.last_queued = 0
>
>         thread->sched_info.run_delay += delta
>
>     Since thread was running between enqueue_task(rq, thread) and
>     dequeue_task(rq, thread), the delta above is really execution
>     time and not run_delay.
>
> d. results in:
>
>     __sched_info_switch(thread, next_thread)
>
>         sched_info_depart(rq, thread)
>
>             sched_info_queued(rq, thread)
>
>                 /* last_queued not updated due to being non-zero */
>                 return
>
>     Since thread was running between enqueue_task(rq, thread) and
>     __sched_info_switch(thread, next_thread), the execution time
>     between enqueue_task(rq, thread) and
>     __sched_info_switch(thread, next_thread) now will become
>     associated with run_delay due to when last_queued was last updated.
>

This alternative patch solves the problem by not calling
sched_info_{de,}queued() in {de,en}queue_task(). Therefore the
sched_info state is preserved and things work as expected.

By inlining the {de,en}queue_task() functions the new condition
becomes (mostly) a compile-time constant and we'll not emit any new
branch instructions.

It even shrinks the code (due to inlining {en,de}queue_task()):

$ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig
   text    data     bss     dec     hex filename
  64019   23378    2344   89741   15e8d defconfig-build/kernel/sched/core.o
  64149   23378    2344   89871   15f0f defconfig-build/kernel/sched/core.o.orig

Reported-by: Mike Meyer <Mike.Meyer@Teradata.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:08:22 +02:00
Srikar Dronamraju b52da86e0a sched/numa: Fix task_tick_fair() from disabling numa_balancing
If static branch 'sched_numa_balancing' is enabled, it should kickstart
NUMA balancing through task_tick_numa(). However the following commit:

  2a595721a1 ("sched/numa: Convert sched_numa_balancing to a static_branch")

erroneously disables this.

Fix this anomaly by enabling task_tick_numa() when the static branch
'sched_numa_balancing' is enabled.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1443752305-27413-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:08:21 +02:00
Peter Zijlstra e2bf1c4b17 sched/core: Add preempt_count invariant check
Ingo requested I keep my debug check for the preempt_count invariant.

Requested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:08:20 +02:00
Peter Zijlstra 499d79559f sched/core: More notrace annotations
preempt_schedule_common() is marked notrace, but it does not use
_notrace() preempt_count functions and __schedule() is also not marked
notrace, which means that its perfectly possible to end up in the
tracer from preempt_schedule_common().

Steve says:

  | Yep, there's some history to this. This was originally the issue that
  | caused function tracing to go into infinite recursion. But now we have
  | preempt_schedule_notrace(), which is used by the function tracer, and
  | that function must not be traced till preemption is disabled.
  |
  | Now if function tracing is running and we take an interrupt when
  | NEED_RESCHED is set, it calls
  |
  |   preempt_schedule_common() (not traced)
  |
  | But then that calls preempt_disable() (traced)
  |
  | function tracer calls preempt_disable_notrace() followed by
  | preempt_enable_notrace() which will see NEED_RESCHED set, and it will
  | call preempt_schedule_notrace(), which stops the recursion, but
  | still calls __schedule() here, and that means when we return, we call
  | the __schedule() from preempt_schedule_common().
  |
  | That said, I prefer this patch. Preemption is disabled before calling
  | __schedule(), and we get rid of a one round recursion with the
  | scheduler.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:08:20 +02:00
Peter Zijlstra da7142e2ed sched/core: Simplify preempt_count tests
Since we stopped setting PREEMPT_ACTIVE, there is no need to mask it
out of preempt_count() tests.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:08:17 +02:00
Peter Zijlstra 1dc0fffc48 sched/core: Robustify preemption leak checks
When we warn about a preempt_count leak; reset the preempt_count to
the known good value such that the problem does not ripple forward.

This is most important on x86 which has a per cpu preempt_count that is
not saved/restored (after this series). So if you schedule with an
invalid (!2*PREEMPT_DISABLE_OFFSET) preempt_count the next task is
messed up too.

Enforcing this invariant limits the borkage to just the one task.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:08:17 +02:00
Peter Zijlstra 3d8f74dd4c sched/core: Stop setting PREEMPT_ACTIVE
Now that nothing tests for PREEMPT_ACTIVE anymore, stop setting it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:08:16 +02:00
Peter Zijlstra c73464b1c8 sched/core: Fix trace_sched_switch()
__trace_sched_switch_state() is the last remaining PREEMPT_ACTIVE
user, move trace_sched_switch() from prepare_task_switch() to
__schedule() and propagate the @preempt argument.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:08:15 +02:00
Peter Zijlstra fc13aebab7 sched/core: Add preempt argument to __schedule()
There is only a single PREEMPT_ACTIVE use in the regular __schedule()
path and that is to circumvent the task->state check. Since the code
setting PREEMPT_ACTIVE is the immediate caller of __schedule() we can
replace this with a function argument.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:08:15 +02:00
Peter Zijlstra 609ca06638 sched/core: Create preempt_count invariant
Assuming units of PREEMPT_DISABLE_OFFSET for preempt_count() numbers.

Now that TASK_DEAD no longer results in preempt_count() == 3 during
scheduling, we will always call context_switch() with preempt_count()
== 2.

However, we don't always end up with preempt_count() == 2 in
finish_task_switch() because new tasks get created with
preempt_count() == 1.

Create FORK_PREEMPT_COUNT and set it to 2 and use that in the right
places. Note that we cannot use INIT_PREEMPT_COUNT as that serves
another purpose (boot).

After this, preempt_count() is invariant across the context switch,
with exception of PREEMPT_ACTIVE.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:08:14 +02:00
Peter Zijlstra b99def8b96 sched/core: Rework TASK_DEAD preemption exception
TASK_DEAD is special in that the final schedule call from do_exit()
must be done with preemption disabled.

This means we end up scheduling with a preempt_count() higher than
usual (3 instead of the 'expected' 2).

Since future patches will want to rely on an invariant
preempt_count() value during schedule, fix this up.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:08:13 +02:00
Ingo Molnar fe19159225 Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:05:36 +02:00
Peter Zijlstra 95913d9791 sched/core: Fix TASK_DEAD race in finish_task_switch()
So the problem this patch is trying to address is as follows:

        CPU0                            CPU1

        context_switch(A, B)
                                        ttwu(A)
                                          LOCK A->pi_lock
                                          A->on_cpu == 0
        finish_task_switch(A)
          prev_state = A->state  <-.
          WMB                      |
          A->on_cpu = 0;           |
          UNLOCK rq0->lock         |
                                   |    context_switch(C, A)
                                   `--  A->state = TASK_DEAD
          prev_state == TASK_DEAD
            put_task_struct(A)
                                        context_switch(A, C)
                                        finish_task_switch(A)
                                          A->state == TASK_DEAD
                                            put_task_struct(A)

The argument being that the WMB will allow the load of A->state on CPU0
to cross over and observe CPU1's store of A->state, which will then
result in a double-drop and use-after-free.

Now the comment states (and this was true once upon a long time ago)
that we need to observe A->state while holding rq->lock because that
will order us against the wakeup; however the wakeup will not in fact
acquire (that) rq->lock; it takes A->pi_lock these days.

We can obviously fix this by upgrading the WMB to an MB, but that is
expensive, so we'd rather avoid that.

The alternative this patch takes is: smp_store_release(&A->on_cpu, 0),
which avoids the MB on some archs, but not important ones like ARM.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@vger.kernel.org> # v3.1+
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: manfred@colorfullife.com
Cc: will.deacon@arm.com
Fixes: e4a52bcb9a ("sched: Remove rq->lock from the first half of ttwu()")
Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:05:17 +02:00
Andrey Smetanin 9eec50b8bb kvm/x86: Hyper-V HV_X64_MSR_VP_RUNTIME support
HV_X64_MSR_VP_RUNTIME msr used by guest to get
"the time the virtual processor consumes running guest code,
and the time the associated logical processor spends running
hypervisor code on behalf of that guest."

Calculation of this time is performed by task_cputime_adjusted()
for vcpu task.

Necessary to support loading of winhv.sys in guest, which in turn is
required to support Windows VMBus.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01 15:06:33 +02:00
Linus Torvalds 73f479b243 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
 "A single bug fix for the scheduler to prevent dequeueing of the idle
  task when setting the cpus allowed mask"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix crash trying to dequeue/enqueue the idle thread
2015-09-27 12:50:27 -04:00
Juergen Gross c6e1e7b5b7 sched/core: Make 'sched_domain_topology' declaration static
The 'sched_domain_topology' variable is only used within kernel/sched/core.c.
Make it static.

Signed-off-by: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1442918939-9907-1-git-send-email-jgross@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-23 10:19:12 +02:00
Juri Lelli 269b26a5ef sched/rt: Make (do_)balance_runtime() return void
The return value of (do_)balance_runtime() is not consumed by anybody.
Make them return void.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1441188096-23021-5-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-23 09:51:26 +02:00
Juri Lelli 2726d6ce38 sched/deadline: Unify dl_time_before() usage
Move dl_time_before() static definition in include/linux/sched/deadline.h
so that it can be used by different parties without being re-defined.

Reported-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1441188096-23021-3-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-23 09:51:25 +02:00
Andrea Arcangeli ac5be6b47e userfaultfd: revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key"
This reverts commit 51360155ec and adapts
fs/userfaultfd.c to use the old version of that function.

It didn't look robust to call __wake_up_common with "nr == 1" when we
absolutely require wakeall semantics, but we've full control of what we
insert in the two waitqueue heads of the blocked userfaults.  No
exclusive waitqueue risks to be inserted into those two waitqueue heads
so we can as well stick to "nr == 1" of the old code and we can rely
purely on the fact no waitqueue inserted in one of the two waitqueue
heads we must enforce as wakeall, has wait->flags WQ_FLAG_EXCLUSIVE set.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Thierry Reding <treding@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-22 15:09:53 -07:00
Linus Torvalds 3ae839454e Mostly stable material, a lot of ARM fixes.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJV+/ucAAoJEL/70l94x66DV8YH/1KDym/1GJ+/Br/YkHZnM53l
 3Q0PwSLu9cNcIL9lUuDLwGTaVj+y8ud1Hjr/uzvKwivktmUYVZhkdtnZmnanvGOM
 qKB9K3nFXCPx8uqy8Dn7fOwEKcg9FmDOTTkWy13HDnXO+V4crSVVt+rPw+6FUMld
 NV5tYdw9Lu7y3XrveDebPWaPtyDL7OAagzmeK47eMffxG7X9Hf1H2aT7HueRi7x/
 SkLIe3gmiOWmHVJDPE9TOmFYIj19gywDFysKes1gdVJLVUIXiELMT7SrvAYnToVB
 zISIEj7Zx4SINPxpf2dUn8REm7NsmJY+PffLIl/Nv+ozGggFQGFH0SMZ08p0bxw=
 =tfmn
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "Mostly stable material, a lot of ARM fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits)
  sched: access local runqueue directly in single_task_running
  arm/arm64: KVM: Remove 'config KVM_ARM_MAX_VCPUS'
  arm64: KVM: Remove all traces of the ThumbEE registers
  arm: KVM: Disable virtual timer even if the guest is not using it
  arm64: KVM: Disable virtual timer even if the guest is not using it
  arm/arm64: KVM: vgic: Check for !irqchip_in_kernel() when mapping resources
  KVM: s390: Replace incorrect atomic_or with atomic_andnot
  arm: KVM: Fix incorrect device to IPA mapping
  arm64: KVM: Fix user access for debug registers
  KVM: vmx: fix VPID is 0000H in non-root operation
  KVM: add halt_attempted_poll to VCPU stats
  kvm: fix zero length mmio searching
  kvm: fix double free for fast mmio eventfd
  kvm: factor out core eventfd assign/deassign logic
  kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfd
  KVM: make the declaration of functions within 80 characters
  KVM: arm64: add workaround for Cortex-A57 erratum #852523
  KVM: fix polling for guest halt continued even if disable it
  arm/arm64: KVM: Fix PSCI affinity info return value for non valid cores
  arm64: KVM: set {v,}TCR_EL2 RES1 bits
  ...
2015-09-18 09:23:08 -07:00
Dominik Dingel 00cc163381 sched: access local runqueue directly in single_task_running
Commit 2ee507c472 ("sched: Add function single_task_running to let a task
check if it is the only task running on a cpu") referenced the current
runqueue with the smp_processor_id.  When CONFIG_DEBUG_PREEMPT is enabled,
that is only allowed if preemption is disabled or the currrent task is
bound to the local cpu (e.g. kernel worker).

With commit f781951299 ("kvm: add halt_poll_ns module parameter") KVM
calls single_task_running. If CONFIG_DEBUG_PREEMPT is enabled that
generates a lot of kernel messages.

To avoid adding preemption in that cases, as it would limit the usefulness,
we change single_task_running to access directly the cpu local runqueue.

Cc: Tim Chen <tim.c.chen@linux.intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Fixes: 2ee507c472
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-18 13:47:59 +02:00
Leo Yan 79a89f92cb sched/fair: Remove unnecessary parameter for group_classify()
The group_classify() function does not use the "env" parameter, so remove it.
Also unify code to always use group_classify() to calculate group's
load type.

Signed-off-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1442314605-14838-1-git-send-email-leo.yan@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-18 09:23:16 +02:00
Leo Yan 84fb5a182d sched/fair: Polish comments for LOAD_AVG_MAX
Macro LOAD_AVG_MAX is defined far away from the precompuated tables
for decay calculation in code; So explicitly comments for this.

Also fix one typo: s/LOAD_MAX_AVG/LOAD_AVG_MAX.

Signed-off-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1442314657-14949-1-git-send-email-leo.yan@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-18 09:23:15 +02:00
Rik van Riel 4620f8c1fd sched/numa: Limit the amount of virtual memory scanned in task_numa_work()
Currently task_numa_work() scans up to numa_balancing_scan_size_mb worth
of memory per invocation, but only counts memory areas that have at
least one PTE that is still present and not marked for numa hint faulting.

It will skip over arbitarily large amounts of memory that are either
unused, full of swap ptes, or full of PTEs that were already marked
for NUMA hint faults but have not been faulted on yet.

This can cause excessive amounts of CPU use, due to there being
essentially no upper limit on the scan rate of very large processes
that are not yet in a phase where they are actively accessing old
memory pages (eg. they are still initializing their data).

Avoid that problem by placing an upper limit on the amount of virtual
memory that task_numa_work() scans in each invocation. This can be a
higher limit than "pages", to ensure the task still skips over unused
areas fairly quickly.

While we are here, also fix the "nr_pte_updates" logic, so it only
counts page ranges with ptes in them.

Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150911090027.4a7987bd@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-18 09:23:14 +02:00
Henrik Austad 20f9cd2acb sched/core: Make policy-testing consistent
Most of the policy-tests are done via the <class>_policy() helpers with
the notable exception of idle. A new wrapper for valid_policy() has also
been added to improve readability  in set_load_weight().

This commit does not change the logical behavior of the scheduler core.

Signed-off-by: Henrik Austad <henrik@austad.us>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1441810841-4756-1-git-send-email-henrik@austad.us
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-18 09:23:13 +02:00
Peter Zijlstra de9b8f5dcb sched: Fix crash trying to dequeue/enqueue the idle thread
Sasha reports that his virtual machine tries to schedule the idle
thread since commit 6c37067e27 ("sched: Change the
sched_class::set_cpus_allowed() calling context").

Hit trace shows this happening from idle_thread_get()->init_idle(),
which is the _second_ init_idle() invocation on that task_struct, the
first being done through idle_init()->fork_idle(). (this code is
insane...)

Because we call init_idle() twice in a row, its ->sched_class ==
&idle_sched_class and ->on_rq = TASK_ON_RQ_QUEUED. This means
do_set_cpus_allowed() think we're queued and will call dequeue_task(),
which is implemented with BUG() for the idle class, seeing how
dequeueing the idle task is a daft thing.

Aside of the whole insanity of calling init_idle() _twice_, change the
code to call set_cpus_allowed_common() instead as this is 'obviously'
before the idle task gets ran etc..

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 6c37067e27 ("sched: Change the sched_class::set_cpus_allowed() calling context")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-18 09:17:50 +02:00
Linus Torvalds c2ea72fd86 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "A migrate_tasks() locking fix, and a late-coming nohz change plus a
  nohz debug check"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: 'Annotate' migrate_tasks()
  nohz: Assert existing housekeepers when nohz full enabled
  nohz: Affine unpinned timers to housekeepers
2015-09-17 10:49:42 -07:00
Peter Zijlstra 006cdf025a sched/fair: Optimize per entity utilization tracking
Currently the load_{sum,avg} and util_{sum,avg} tracking is asymmetric
in that load tracking gets a 2^10 unit from the weight, but util gets
no such factor.

This results in more lost bits for util scaling and asymmetric scaling
rules.

Fix this by removing shifts, such that we gain the 2^10 factor from
scaling. There is no risk of overflowing the u32 as the max value is
now LOAD_AVG_MAX << 10, which is still well below UINT_MAX.

This further entangles the assumption that both LOAD and CAPACITY
shifts are the same (and 10) so put in an assertion for that.

This fixes the math for the LOAD_RESOLUTION != 0 case.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:53:02 +02:00
Dietmar Eggemann 6f2b04524f sched/fair: Defer calling scaling functions
Do not call the scaling functions in case time goes backwards or the
last update of the sched_avg structure has happened less than 1024ns
ago.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org <daniel.lezcano@linaro.org>
Cc: mturquette@baylibre.com <mturquette@baylibre.com>
Cc: pang.xunlei@zte.com.cn <pang.xunlei@zte.com.cn>
Cc: rjw@rjwysocki.net <rjw@rjwysocki.net>
Cc: sgurrappadi@nvidia.com <sgurrappadi@nvidia.com>
Cc: vincent.guittot@linaro.org <vincent.guittot@linaro.org>
Cc: yuyang.du@intel.com <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/55EDA2E9.8040900@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:53:01 +02:00