Commit Graph

18310 Commits

Author SHA1 Message Date
Rik van Riel a22b4b0123 sched/numa: Change scan period code to match intent
Reading through the scan period code and comment, it appears the
intent was to slow down NUMA scanning when a majority of accesses
are on the local node, specifically a local:remote ratio of 3:1.

However, the code actually tests local / (local + remote), and
the actual cut-off point was around 30% local accesses, well before
a task has actually converged on a node.

Changing the threshold to 7 means scanning slows down when a task
has around 70% of its accesses local, which appears to match the
intent of the code more closely.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538095-31256-8-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:40 +02:00
Rik van Riel db015daedb sched/numa: Rework best node setting in task_numa_migrate()
Fix up the best node setting in task_numa_migrate() to deal with a task
in a pseudo-interleaved NUMA group, which is already running in the
best location.

Set the task's preferred nid to the current nid, so task migration is
not retried at a high rate.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538095-31256-7-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:39 +02:00
Rik van Riel 0132c3e177 sched/numa: Examine a task move when examining a task swap
Running "perf bench numa mem -0 -m -P 1000 -p 8 -t 20" on a 4
node system results in 160 runnable threads on a system with 80
CPU threads.

Once a process has nearly converged, with 39 threads on one node
and 1 thread on another node, the remaining thread will be unable
to migrate to its preferred node through a task swap.

However, a simple task move would make the workload converge,
witout causing an imbalance.

Test for this unlikely occurrence, and attempt a task move to
the preferred nid when it happens.

 # Running main, "perf bench numa mem -p 8 -t 20 -0 -m -P 1000"

 ###
 # 160 tasks will execute (on 4 nodes, 80 CPUs):
 #         -1x     0MB global  shared mem operations
 #         -1x  1000MB process shared mem operations
 #         -1x     0MB thread  local  mem operations
 ###

 ###
 #
 #    0.0%  [0.2 mins]  0/0   1/1  36/2   0/0  [36/3 ] l:  0-0   (  0) {0-2}
 #    0.0%  [0.3 mins] 43/3  37/2  39/2  41/3  [ 6/10] l:  0-1   (  1) {1-2}
 #    0.0%  [0.4 mins] 42/3  38/2  40/2  40/2  [ 4/9 ] l:  1-2   (  1) [50.0%] {1-2}
 #    0.0%  [0.6 mins] 41/3  39/2  40/2  40/2  [ 2/9 ] l:  2-4   (  2) [50.0%] {1-2}
 #    0.0%  [0.7 mins] 40/2  40/2  40/2  40/2  [ 0/8 ] l:  3-5   (  2) [40.0%] (  41.8s converged)

Without this patch, this same perf bench numa mem run had to
rely on the scheduler load balancer to first balance out the
load (moving a random task), before a task swap could complete
the NUMA convergence.

The load balancer does not normally take action unless the load

difference exceeds 25%. Convergence times of over half an hour
have been observed without this patch.

With this patch, the NUMA balancing code will simply migrate the
task, if that does not cause an imbalance.

Also skip examining a CPU in detail if the improvement on that CPU
is no more than the best we already have.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: chegu_vinod@hp.com
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-ggthh0rnh0yua6o5o3p6cr1o@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:38 +02:00
Rik van Riel 1c5d3eb375 sched/numa: Simplify task_numa_compare()
When a task is part of a numa_group, the comparison should always use
the group weight, in order to make workloads converge.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: chegu_vinod@hp.com
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538378-31571-4-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:37 +02:00
Rik van Riel 6dc1a672ab sched/numa: Use effective_load() to balance NUMA loads
When CONFIG_FAIR_GROUP_SCHED is enabled, the load that a task places
on a CPU is determined by the group the task is in. The active groups
on the source and destination CPU can be different, resulting in a
different load contribution by the same task at its source and at its
destination. As a result, the load needs to be calculated separately
for each CPU, instead of estimated once with task_h_load().

Getting this calculation right allows some workloads to converge,
where previously the last thread could get stuck on another node,
without being able to migrate to its final destination.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538378-31571-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:35 +02:00
Rik van Riel 28a2174519 sched/numa: Move power adjustment into load_too_imbalanced()
Currently the NUMA code scales the load on each node with the
amount of CPU power available on that node, but it does not
apply any adjustment to the load of the task that is being
moved over.

On systems with SMT/HT, this results in a task being weighed
much more heavily than a CPU core, and a task move that would
even out the load between nodes being disallowed.

The correct thing is to apply the power correction to the
numbers after we have first applied the move of the tasks'
loads to them.

This also allows us to do the power correction with a multiplication,
rather than a division.

Also drop two function arguments for load_too_unbalanced, since it
takes various factors from env already.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: chegu_vinod@hp.com
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538378-31571-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:34 +02:00
Rik van Riel f0b8a4afd6 sched/numa: Use group's max nid as task's preferred nid
From task_numa_placement, always try to consolidate the tasks
in a group on the group's top nid.

In case this task is part of a group that is interleaved over
multiple nodes, task_numa_migrate will set the task's preferred
nid to the best node it could find for the task, so this patch
will cause at most one run through task_numa_migrate.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538095-31256-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:33 +02:00
Tim Chen 4486edd12b sched/fair: Implement fast idling of CPUs when the system is partially loaded
When a system is lightly loaded (i.e. no more than 1 job per cpu),
attempt to pull job to a cpu before putting it to idle is unnecessary and
can be skipped.  This patch adds an indicator so the scheduler can know
when there's no more than 1 active job is on any CPU in the system to
skip needless job pulls.

On a 4 socket machine with a request/response kind of workload from
clients, we saw about 0.13 msec delay when we go through a full load
balance to try pull job from all the other cpus.  While 0.1 msec was
spent on processing the request and generating a response, the 0.13 msec
load balance overhead was actually more than the actual work being done.
This overhead can be skipped much of the time for lightly loaded systems.

With this patch, we tested with a netperf request/response workload that
has the server busy with half the cpus in a 4 socket system.  We found
the patch eliminated 75% of the load balance attempts before idling a cpu.

The overhead of setting/clearing the indicator is low as we already gather
the necessary info while we call add_nr_running() and update_sd_lb_stats.()
We switch to full load balance load immediately if any cpu got more than
one job on its run queue in add_nr_running.  We'll clear the indicator
to avoid load balance when we detect no cpu's have more than one job
when we scan the work queues in update_sg_lb_stats().  We are aggressive
in turning on the load balance and opportunistic in skipping the load
balance.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Jason Low <jason.low2@hp.com>
Cc: "Paul E.McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403551009.2970.613.camel@schen9-DESK
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:32 +02:00
Viresh Kumar 89abb5ad10 sched/idle: Drop !! while calculating 'broadcast'
We don't need 'broadcast' to be set to 'zero or one', but to 'zero or non-zero'
and so the extra operation to convert it to 'zero or one' can be skipped.

Also change type of 'broadcast' to unsigned int, i.e. type of
drv->states[*].flags.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/0dfbe2976aa108c53e08d3477ea90f6360c1f54c.1403584026.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:31 +02:00
Mike Galbraith 4036ac1567 sched: Fix clock_gettime(CLOCK_[PROCESS/THREAD]_CPUTIME_ID) monotonicity
If a task has been dequeued, it has been accounted.  Do not project
cycles that may or may not ever be accounted to a dequeued task, as
that may make clock_gettime() both inaccurate and non-monotonic.

Protect update_rq_clock() from slight TSC skew while at it.

Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: kosaki.motohiro@jp.fujitsu.com
Cc: pjt@google.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403588980.29711.11.camel@marge.simpson.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:30 +02:00
Ben Segall c06f04c704 sched: Fix potential near-infinite distribute_cfs_runtime() loop
distribute_cfs_runtime() intentionally only hands out enough runtime to
bring each cfs_rq to 1 ns of runtime, expecting the cfs_rqs to then take
the runtime they need only once they actually get to run. However, if
they get to run sufficiently quickly, the period timer is still in
distribute_cfs_runtime() and no runtime is available, causing them to
throttle. Then distribute has to handle them again, and this can go on
until distribute has handed out all of the runtime 1ns at a time, which
takes far too long.

Instead allow access to the same runtime that distribute is handing out,
accepting that corner cases with very low quota may be able to spend the
entire cfs_b->runtime during distribute_cfs_runtime, meaning that the
runtime directly handed out by distribute_cfs_runtime was over quota. In
addition, if a cfs_rq does manage to throttle like this, make sure the
existing distribute_cfs_runtime no longer loops over it again.

Signed-off-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140620222120.13814.21652.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:29 +02:00
Viresh Kumar 541b82644d sched/core: Fix formatting issues in sched_can_stop_tick()
sched_can_stop_tick() is using 7 spaces instead of 8 spaces or a 'tab' at the
beginning of few lines. Which doesn't align well with the Coding Guidelines.

Also remove local variable 'rq' as it is used at only one place and we can
directly use this_rq() instead.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: fweisbec@gmail.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/afb781733e4a9ffbced5eb9fd25cc0aa5c6ffd7a.1403596966.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:28 +02:00
Peter Zijlstra a77353e5eb irq_work: Remove BUG_ON in irq_work_run()
Because of a collision with 8d056c48e4 ("CPU hotplug, smp: flush any
pending IPI callbacks before CPU offline"), which ends up calling
hotplug_cfd()->flush_smp_call_function_queue()->irq_work_run(), which
is not from IRQ context.

And since that already calls irq_work_run() from the hotplug path,
remove our entire hotplug handling.

Reported-by: Stephen Warren <swarren@wwwdotorg.org>
Tested-by: Stephen Warren <swarren@wwwdotorg.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-busatzs2gvz4v62258agipuf@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:17:26 +02:00
Ingo Molnar 51da9830d7 Merge branch 'timers/nohz' into sched/core
Merge these two, because upcoming patches will touch both areas.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05 11:06:10 +02:00
Hillf Danton 5d5e2b1bcb sched: Fix CACHE_HOT_BUDY condition
When computing cache hot, we should check if the migration dst cpu is idle,
instead of the current cpu. Though they are same in normal balancing, that
is false nowadays in nohz idle balancing at least.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mike Galbraith <mgalbraith@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140607090452.4696E301D2@webmail.sinamail.sina.com.cn
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-18 18:29:59 +02:00
Rik van Riel bb97fc3164 sched/numa: Always try to migrate to preferred node at task_numa_placement() time
It is possible that at task_numa_placement() time, the task's
numa_preferred_nid does not change, but the task is not
actually running on the preferred node at the time.

In that case, we still want to attempt migration to the
preferred node.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140604163315.1dbc7b56@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-18 18:29:58 +02:00
Rik van Riel a43455a1d5 sched/numa: Ensure task_numa_migrate() checks the preferred node
The first thing task_numa_migrate() does is check to see if there is
CPU capacity available on the preferred node, in order to move the
task there.

However, if the preferred node is all busy, we would skip considering
that node for tasks swaps in the subsequent loop. This prevents NUMA
convergence of tasks on busy systems.

However, swapping locations with a task on our preferred nid, when
the preferred nid is busy, is perfectly fine.

The fix is to also look for a CPU on our preferred nid when it is
totally busy.

This changes "perf bench numa mem -p 4 -t 20 -m -0 -P 1000" from
not converging in 15 minutes on my 4 node system, to converging in
10-20 seconds.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140604160942.6969b101@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-18 18:29:57 +02:00
Frederic Weisbecker 3882ec6439 nohz: Use IPI implicit full barrier against rq->nr_running r/w
A full dynticks CPU is allowed to stop its tick when a single task runs.
Meanwhile when a new task gets enqueued, the CPU must be notified so that
it can restart its tick to maintain local fairness and other accounting
details.

This notification is performed by way of an IPI. Then when the target
receives the IPI, we expect it to see the new value of rq->nr_running.

Hence the following ordering scenario:

   CPU 0                   CPU 1

   write rq->running       get IPI
   smp_wmb()               smp_rmb()
   send IPI                read rq->nr_running

But Paul Mckenney says that nowadays IPIs imply a full barrier on
all architectures. So we can safely remove this pair and rely on the
implicit barriers that come along IPI send/receive. Lets
just comment on this new assumption.

Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-06-16 16:27:24 +02:00
Frederic Weisbecker fd2ac4f4a6 nohz: Use nohz own full kick on 2nd task enqueue
Now that we have a nohz full remote kick based on irq work, lets use
it to notify a CPU that it's exiting single task mode.

This unbloats a bit the scheduler IPI that the nohz code was abusing
for its cool "callable anywhere/anytime" properties.

Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-06-16 16:26:55 +02:00
Frederic Weisbecker 53c5fa16b4 nohz: Switch to nohz full remote kick on timer enqueue
When a new timer is enqueued on a full dynticks target, that CPU must
re-evaluate the next tick to handle the timer correctly.

This is currently performed through the scheduler IPI. Meanwhile this
happens at the cost of off-topic workarounds in that fast path to make
it call irq_exit().

As we plan to remove this hack off the scheduler IPI, lets use
the nohz full kick instead. Pretty much any IPI fits for that job
as long at it calls irq_exit(). The nohz full kick just happens to be
handy and readily available here.

If it happens to be too much an overkill in the future, we can still
turn that timer kick into an empty IPI.

Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-06-16 16:26:55 +02:00
Frederic Weisbecker 3d36aebc2e nohz: Support nohz full remote kick
Remotely kicking a full nohz CPU in order to make it re-evaluate its
next tick is currently implemented using the scheduler IPI.

However this bloats a scheduler fast path with an off-topic feature.
The scheduler tick was abused here for its cool "callable
anywhere/anytime" properties.

But now that the irq work subsystem can queue remote callbacks, it's
a perfect fit to safely queue IPIs when interrupts are disabled
without worrying about concurrent callers.

So lets implement remote kick on top of irq work. This is going to
be used when a new event requires the next tick to be recalculated:
more than 1 task competing on the CPU, timer armed, ...

Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-06-16 16:26:54 +02:00
Frederic Weisbecker 4788501606 irq_work: Implement remote queueing
irq work currently only supports local callbacks. However its code
is mostly ready to run remote callbacks and we have some potential user.

The full nohz subsystem currently open codes its own remote irq work
on top of the scheduler ipi when it wants a CPU to reevaluate its next
tick. However this ad hoc solution bloats the scheduler IPI.

Lets just extend the irq work subsystem to support remote queuing on top
of the generic SMP IPI to handle this kind of user. This shouldn't add
noticeable overhead.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-06-16 16:26:54 +02:00
Frederic Weisbecker b93e0b8fa8 irq_work: Split raised and lazy lists
An irq work can be handled from two places: from the tick if the work
carries the "lazy" flag and the tick is periodic, or from a self IPI.

We merge all these works in a single list and we use some per cpu latch
to avoid raising a self-IPI when one is already pending.

Now we could do away with this ugly latch if only the list was only made of
non-lazy works. Just enqueueing a work on the empty list would be enough
to know if we need to raise an IPI or not.

Also we are going to implement remote irq work queuing. Then the per CPU
latch will need to become atomic in the global scope. That's too bad
because, here as well, just enqueueing a work on an empty list of
non-lazy works would be enough to know if we need to raise an IPI or not.

So lets take a way out of this: split the works in two distinct lists,
one for the works that can be handled by the next tick and another
one for those handled by the IPI. Just checking if the latter is empty
when we queue a new work is enough to know if we need to raise an IPI.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-06-16 16:26:53 +02:00
Linus Torvalds 8841c8b3c4 One bug fix that goes back to 3.10. Accessing a non existent buffer
if "possible cpus" is greater than actual CPUs (including offline CPUs).
 
 Namhyung Kim did some reviews of the patches I sent this merge window and
 found a memory leak and had a few clean ups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJTmcYfAAoJEKQekfcNnQGusVIH/2wvfh1D4Mu+qCUsG7HqMLWM
 wHhTHcJsiE5rBpfcvc+XoLLBMGn9IKCeClGG59KYJGzznbJHHLmk1dy4qdSqNAel
 POTEtGh+AUX0wFBZtVLl2AesFZRsbtaxoNqD0ZBLhkHCV9FBEKbsm8uDCtJIf8wR
 vDDz27nPXkiFWX+wM9Z+tFSL7rohxvwREKlQjfk5Z9plyUcURE8GDbrWl870ZSJX
 uYzSYzioCyYdy/cpasDuTX22/I+nNUxA4dWTeyciigYC+6IqLJRIwqEXJ4RjYmfm
 v9+hf6KkqF8CY6mDjObLkKmy0gubPBQ8Op1G/3ChluUrjkwdffgb6oiLb352yHY=
 =XD1A
 -----END PGP SIGNATURE-----

Merge tag 'trace-3.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing cleanups and bugfixes from Steven Rostedt:
 "One bug fix that goes back to 3.10.  Accessing a non existent buffer
  if "possible cpus" is greater than actual CPUs (including offline
  CPUs).

  Namhyung Kim did some reviews of the patches I sent this merge window
  and found a memory leak and had a few clean ups"

* tag 'trace-3.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix check of ftrace_trace_arrays list_empty() check
  tracing: Fix leak of per cpu max data in instances
  tracing: Cleanup saved_cmdlines_size changes
  ring-buffer: Check if buffer exists before polling
2014-06-12 21:07:25 -07:00
Linus Torvalds b2e09f633a Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more scheduler updates from Ingo Molnar:
 "Second round of scheduler changes:
   - try-to-wakeup and IPI reduction speedups, from Andy Lutomirski
   - continued power scheduling cleanups and refactorings, from Nicolas
     Pitre
   - misc fixes and enhancements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Delete extraneous extern for to_ratio()
  sched/idle: Optimize try-to-wake-up IPI
  sched/idle: Simplify wake_up_idle_cpu()
  sched/idle: Clear polling before descheduling the idle thread
  sched, trace: Add a tracepoint for IPI-less remote wakeups
  cpuidle: Set polling in poll_idle
  sched: Remove redundant assignment to "rt_rq" in update_curr_rt(...)
  sched: Rename capacity related flags
  sched: Final power vs. capacity cleanups
  sched: Remove remaining dubious usage of "power"
  sched: Let 'struct sched_group_power' care about CPU capacity
  sched/fair: Disambiguate existing/remaining "capacity" usage
  sched/fair: Change "has_capacity" to "has_free_capacity"
  sched/fair: Remove "power" from 'struct numa_stats'
  sched: Fix signedness bug in yield_to()
  sched/fair: Use time_after() in record_wakee()
  sched/balancing: Reduce the rate of needless idle load balancing
  sched/fair: Fix unlocked reads of some cfs_b->quota/period
2014-06-12 19:42:15 -07:00
Linus Torvalds 3737a12761 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more perf updates from Ingo Molnar:
 "A second round of perf updates:

   - wide reaching kprobes sanitization and robustization, with the hope
     of fixing all 'probe this function crashes the kernel' bugs, by
     Masami Hiramatsu.

   - uprobes updates from Oleg Nesterov: tmpfs support, corner case
     fixes and robustization work.

   - perf tooling updates and fixes from Jiri Olsa, Namhyung Ki, Arnaldo
     et al:
        * Add support to accumulate hist periods (Namhyung Kim)
        * various fixes, refactorings and enhancements"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
  perf: Differentiate exec() and non-exec() comm events
  perf: Fix perf_event_comm() vs. exec() assumption
  uprobes/x86: Rename arch_uprobe->def to ->defparam, minor comment updates
  perf/documentation: Add description for conditional branch filter
  perf/x86: Add conditional branch filtering support
  perf/tool: Add conditional branch filter 'cond' to perf record
  perf: Add new conditional branch filter 'PERF_SAMPLE_BRANCH_COND'
  uprobes: Teach copy_insn() to support tmpfs
  uprobes: Shift ->readpage check from __copy_insn() to uprobe_register()
  perf/x86: Use common PMU interrupt disabled code
  perf/ARM: Use common PMU interrupt disabled code
  perf: Disable sampled events if no PMU interrupt
  perf: Fix use after free in perf_remove_from_context()
  perf tools: Fix 'make help' message error
  perf record: Fix poll return value propagation
  perf tools: Move elide bool into perf_hpp_fmt struct
  perf tools: Remove elide setup for SORT_MODE__MEMORY mode
  perf tools: Fix "==" into "=" in ui_browser__warning assignment
  perf tools: Allow overriding sysfs and proc finding with env var
  perf tools: Consider header files outside perf directory in tags target
  ...
2014-06-12 19:18:49 -07:00
Linus Torvalds c29deef32e Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more locking changes from Ingo Molnar:
 "This is the second round of locking tree updates for v3.16, offering
  large system scalability improvements:

 - optimistic spinning for rwsems, from Davidlohr Bueso.

 - 'qrwlocks' core code and x86 enablement, from Waiman Long and PeterZ"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, locking/rwlocks: Enable qrwlocks on x86
  locking/rwlocks: Introduce 'qrwlocks' - fair, queued rwlocks
  locking/mutexes: Documentation update/rewrite
  locking/rwsem: Fix checkpatch.pl warnings
  locking/rwsem: Fix warnings for CONFIG_RWSEM_GENERIC_SPINLOCK
  locking/rwsem: Support optimistic spinning
2014-06-12 18:48:15 -07:00
Linus Torvalds f9da455b93 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.

 2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
    Benniston.

 3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
    Mork.

 4) BPF now has a "random" opcode, from Chema Gonzalez.

 5) Add more BPF documentation and improve test framework, from Daniel
    Borkmann.

 6) Support TCP fastopen over ipv6, from Daniel Lee.

 7) Add software TSO helper functions and use them to support software
    TSO in mvneta and mv643xx_eth drivers.  From Ezequiel Garcia.

 8) Support software TSO in fec driver too, from Nimrod Andy.

 9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.

10) Handle broadcasts more gracefully over macvlan when there are large
    numbers of interfaces configured, from Herbert Xu.

11) Allow more control over fwmark used for non-socket based responses,
    from Lorenzo Colitti.

12) Do TCP congestion window limiting based upon measurements, from Neal
    Cardwell.

13) Support busy polling in SCTP, from Neal Horman.

14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.

15) Bridge promisc mode handling improvements from Vlad Yasevich.

16) Don't use inetpeer entries to implement ID generation any more, it
    performs poorly, from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
  rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
  tcp: fixing TLP's FIN recovery
  net: fec: Add software TSO support
  net: fec: Add Scatter/gather support
  net: fec: Increase buffer descriptor entry number
  net: fec: Factorize feature setting
  net: fec: Enable IP header hardware checksum
  net: fec: Factorize the .xmit transmit function
  bridge: fix compile error when compiling without IPv6 support
  bridge: fix smatch warning / potential null pointer dereference
  via-rhine: fix full-duplex with autoneg disable
  bnx2x: Enlarge the dorq threshold for VFs
  bnx2x: Check for UNDI in uncommon branch
  bnx2x: Fix 1G-baseT link
  bnx2x: Fix link for KR with swapped polarity lane
  sctp: Fix sk_ack_backlog wrap-around problem
  net/core: Add VF link state control policy
  net/fsl: xgmac_mdio is dependent on OF_MDIO
  net/fsl: Make xgmac_mdio read error message useful
  net_sched: drr: warn when qdisc is not work conserving
  ...
2014-06-12 14:27:40 -07:00
Linus Torvalds 19c1940fea More ACPI and power management updates for 3.16-rc1
- I didn't remember correctly that the Hans de Goede's ACPI video
    patches actually didn't flip the video.use_native_backlight
    default, although we had discussed that and decided to do that.
    Since I said we would do that in the previous PM+ACPI pull
    request, make that change for real now.
 
  - ACPI bus check notifications for PCI host bridges don't cause
    the bus below the host bridge to be checked for changes as they
    should because of a mistake in the ACPI-based PCI hotplug (ACPIPHP)
    subsystem that forgets to add hotplug contexts to PCI host bridge
    ACPI device objects.  Create hotplug contexts for PCI host bridges
    too as appropriate.
 
  - Revert recent cpufreq commit related to the big.LITTLE cpufreq
    driver that breaks arm64 builds.
 
  - Fix for a regression in the ppc-corenet cpufreq driver introduced
    during the 3.15 cycle and causing the driver to use the remainder
    from do_div instead of the quotient.  From Ed Swarthout.
 
  - Resets triggered by panic activate a BUG_ON() in vmalloc.c on
    systems where the ACPI reset register is located in memory address
    space.  Fix from Randy Wright.
 
  - Fix for a problem with cpufreq governors that decisions made by
    them may be suboptimal due to the fact that deferrable timers are
    used by them for CPU load sampling.  From Srivatsa S Bhat.
 
  - Fix for a problem with the Tegra cpufreq driver where the CPU
    frequency is temporarily switched to a "stable" level that
    is different from both the initial and target frequencies
    during transitions which causes udelay() to expire earlier than
    it should sometimes.  From Viresh Kumar.
 
  - New trace points and rework of some existing trace points for
    system suspend/resume profiling from Todd Brandt.
 
  - Assorted cpufreq fixes and cleanups from Stratos Karafotis and
    Viresh Kumar.
 
  - Copyright notice update for suspend-and-cpuhotplug.txt from
    Srivatsa S Bhat.
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJTmeBNAAoJEILEb/54YlRxFo0QAIfp74wZO9ZPcrR+6IO1AEUb
 1qcVJYMFWvisG2JO9b7DUtxwgWHk8/NMgKv+bYxUAEni95mY7PqDTdJ+Qjk7DinJ
 jVo+mzooaQg+KYGQ503YOtqsGhNFM3lE6Jw01wbLytTCetkNCkTgr//7btBbyRKn
 13Ut3o2vH9n5EMoe1jql96onJH6AfBDEn7jc5Sk4rGL7MtKAMsWNTNSGVyLFA98l
 sghO8ZR0AqnBzoedr1eBxzo6ujUqjfYlIcxowZycpJJVX02eN+KGUbOJao2+6RB+
 J6wu/FoPv2VtJkNwSB8IMgZfqceecSIXeWBG5xC22cYbSQ/IDW2k72V+kLHUqd36
 LhlYLIsIxJQovqOgPdKeP5o6OVFd4EheWBiCfNBrmYU+x2av6I6ZjTscz3Robaxh
 AVG6yU8XR2GOpoVGW/+L7R2jZ1Qse1Io0r93hXvCsSXgMkq9HbueX3mZR605msfe
 liDk+fym357cKQUreSH1XF0Q79C1wpEJ6rTz0Qi6ZxkKB+dAYE3oPA+V0+cWSxbK
 WqaFjQwPtvrrduvLj5Z+qF/zRu4LXdTxiY59utBek/RoN6zUsMMpwsRCCdBfub2O
 alBOHUPRaiUywkQtqu7yP9j7iciNxEn1/tXo97b/1qC3RrOwLWOgd8dhpWe0i0Gp
 EmQkie8qCHXw5vCpaeUK
 =0lht
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-3.16-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more ACPI and power management updates from Rafael Wysocki:
 "These are fixups on top of the previous PM+ACPI pull request,
  regression fixes (ACPI hotplug, cpufreq ppc-corenet), other bug fixes
  (ACPI reset, cpufreq), new PM trace points for system suspend
  profiling and a copyright notice update.

  Specifics:

   - I didn't remember correctly that the Hans de Goede's ACPI video
     patches actually didn't flip the video.use_native_backlight
     default, although we had discussed that and decided to do that.
     Since I said we would do that in the previous PM+ACPI pull request,
     make that change for real now.

   - ACPI bus check notifications for PCI host bridges don't cause the
     bus below the host bridge to be checked for changes as they should
     because of a mistake in the ACPI-based PCI hotplug (ACPIPHP)
     subsystem that forgets to add hotplug contexts to PCI host bridge
     ACPI device objects.  Create hotplug contexts for PCI host bridges
     too as appropriate.

   - Revert recent cpufreq commit related to the big.LITTLE cpufreq
     driver that breaks arm64 builds.

   - Fix for a regression in the ppc-corenet cpufreq driver introduced
     during the 3.15 cycle and causing the driver to use the remainder
     from do_div instead of the quotient.  From Ed Swarthout.

   - Resets triggered by panic activate a BUG_ON() in vmalloc.c on
     systems where the ACPI reset register is located in memory address
     space.  Fix from Randy Wright.

   - Fix for a problem with cpufreq governors that decisions made by
     them may be suboptimal due to the fact that deferrable timers are
     used by them for CPU load sampling.  From Srivatsa S Bhat.

   - Fix for a problem with the Tegra cpufreq driver where the CPU
     frequency is temporarily switched to a "stable" level that is
     different from both the initial and target frequencies during
     transitions which causes udelay() to expire earlier than it should
     sometimes.  From Viresh Kumar.

   - New trace points and rework of some existing trace points for
     system suspend/resume profiling from Todd Brandt.

   - Assorted cpufreq fixes and cleanups from Stratos Karafotis and
     Viresh Kumar.

   - Copyright notice update for suspend-and-cpuhotplug.txt from
     Srivatsa S Bhat"

* tag 'pm+acpi-3.16-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI / hotplug / PCI: Add hotplug contexts to PCI host bridges
  PM / sleep: trace events for device PM callbacks
  cpufreq: cpufreq-cpu0: remove dependency on THERMAL and REGULATOR
  cpufreq: tegra: update comment for clarity
  cpufreq: intel_pstate: Remove duplicate CPU ID check
  cpufreq: Mark CPU0 driver with CPUFREQ_NEED_INITIAL_FREQ_CHECK flag
  PM / Documentation: Update copyright in suspend-and-cpuhotplug.txt
  cpufreq: governor: remove copy_prev_load from 'struct cpu_dbs_common_info'
  cpufreq: governor: Be friendly towards latency-sensitive bursty workloads
  PM / sleep: trace events for suspend/resume
  cpufreq: ppc-corenet-cpu-freq: do_div use quotient
  Revert "cpufreq: Enable big.LITTLE cpufreq driver on arm64"
  cpufreq: Tegra: implement intermediate frequency callbacks
  cpufreq: add support for intermediate (stable) frequencies
  ACPI / video: Change the default for video.use_native_backlight to 1
  ACPI: Fix bug when ACPI reset register is implemented in system memory
2014-06-12 13:14:19 -07:00
Ingo Molnar 535560d841 Merge commit '3cf2f34' into sched/core, to fix build error
Fix this dependency on the locking tree's smp_mb*() API changes:

  kernel/sched/idle.c:247:3: error: implicit declaration of function ‘smp_mb__after_atomic’ [-Werror=implicit-function-declaration]

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-12 13:46:37 +02:00
Rafael J. Wysocki d715a226b0 Merge branch 'pm-sleep'
* pm-sleep:
  PM / sleep: trace events for device PM callbacks
  PM / sleep: trace events for suspend/resume
2014-06-12 13:43:08 +02:00
Linus Torvalds 4251c2a670 Most of this is cleaning up various driver sysfs permissions so we can
re-add the perm check (we unified the module param and sysfs checks, but
 the module ones were stronger so we weakened them temporarily).
 
 Param parsing gets documented, and also "--" now forces args to be
 handed to init (and ignored by the kernel).
 
 Module NX/RO protections get tightened: we now set them before calling
 parse_args().
 
 Cheers,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTl+oJAAoJENkgDmzRrbjxtUEP/jIXml01jE2HquOJ/DfrCJOt
 ry5L5Iy8wVBRotTszrXqlD6+W8fLYsEdhM65Wof1H7X1qjaulqYZmrL7bQn4rIGN
 YPUmO5rOzECeAPNW5+e2JLnR4bmS99gVcWzJFCHUBd7Z8ceKaoIk7/XvUg6Mdjg7
 v0kJ5X+U9da2sVYYcZ71euth4ADLFDRNRexA1mPI6mKzJLOBgfvCBWZnkFVdBcjd
 VmL6ceFo/yP9Ed4pgG/4uXq1dZ4ZttpjPusDmNcjq+snOzsQb4tW+KB2Pr6iTwQy
 TDt7lQm5+xfUXgUG/S5L6PYn10P44Voo7AEJa+QK5YPSOY/eRVA0h4/ayP0vqDaJ
 LpZjqXbW77G4yOgEV9KRFLLXiFXykTh2TyCPYL5G2XVXQp1OmViu2f21JWJLFLgL
 mqOXYWdowOGVOOoTgwxIdxczCFCATJUaU5Ig6ay8C02E2mCwIV+IaGSdpsCiyjz/
 dNNumMxWg0NMo/c0YG4K3Ake6ZaGrwbnuJYijaEj6mgpifhh7k4yhFciXGLpkLnS
 Yuo4ORO0GX34z1+bX0iwrgMGPdy7+BnbXsDdWJsbsnwnKKes/Sp44fNl4lPwdM3n
 siaPsxmfAtl9EGqbkU1Fk+x5+X/Lv2I/7/nX5n53520RLkJJpbeMDfHUqpbrqeUN
 JNUTOZ9o72EqDVKnn175
 =IxSN
 -----END PGP SIGNATURE-----

Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull module updates from Rusty Russell:
 "Most of this is cleaning up various driver sysfs permissions so we can
  re-add the perm check (we unified the module param and sysfs checks,
  but the module ones were stronger so we weakened them temporarily).

  Param parsing gets documented, and also "--" now forces args to be
  handed to init (and ignored by the kernel).

  Module NX/RO protections get tightened: we now set them before calling
  parse_args()"

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  module: set nx before marking module MODULE_STATE_COMING.
  samples/kobject/: avoid world-writable sysfs files.
  drivers/hid/hid-picolcd_fb: avoid world-writable sysfs files.
  drivers/staging/speakup/: avoid world-writable sysfs files.
  drivers/regulator/virtual: avoid world-writable sysfs files.
  drivers/scsi/pm8001/pm8001_ctl.c: avoid world-writable sysfs files.
  drivers/hid/hid-lg4ff.c: avoid world-writable sysfs files.
  drivers/video/fbdev/sm501fb.c: avoid world-writable sysfs files.
  drivers/mtd/devices/docg3.c: avoid world-writable sysfs files.
  speakup: fix incorrect perms on speakup_acntsa.c
  cpumask.h: silence warning with -Wsign-compare
  Documentation: Update kernel-parameters.tx
  param: hand arguments after -- straight to init
  modpost: Fix resource leak in read_dump()
2014-06-11 16:09:14 -07:00
Linus Torvalds 9ee4d7a653 Merge branch 'akpm' (patches from Andrew Morton)
Merge leftovers from Andrew Morton:
 "A few leftovers: ocfs2, gcov, RTC"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  rtc: s5m: consolidate two device type switch statements
  rtc: s5m: add support for S2MPS14 RTC
  rtc: s5m: support different register layout
  rtc: s5m: use shorter time of register update
  rtc: s5m: remove undocumented time init on first boot
  mfd/rtc: sec/s5m: rename SEC* symbols to S5M
  gcov: add support for GCC 4.9
  ocfs2/o2net: incorrect to terminate accepting connections loop upon rejecting an invalid one
2014-06-10 15:34:55 -07:00
Yuan Pengfei a992bf836f gcov: add support for GCC 4.9
This patch handles the gcov-related changes in GCC 4.9:

  A new counter (time profile) is added. The total number is 9 now.

  A new profile merge function __gcov_merge_time_profile is added.

See gcc/gcov-io.h and libgcc/libgcov-merge.c

For the first change, the layout of struct gcov_info is affected.

For the second one, a dummy function is added to kernel/gcov/base.c
similarly.

Signed-off-by: Yuan Pengfei <coolypf@qq.com>
Acked-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-10 15:34:46 -07:00
Andy Lutomirski 23adbe12ef fs,userns: Change inode_capable to capable_wrt_inode_uidgid
The kernel has no concept of capabilities with respect to inodes; inodes
exist independently of namespaces.  For example, inode_capable(inode,
CAP_LINUX_IMMUTABLE) would be nonsense.

This patch changes inode_capable to check for uid and gid mappings and
renames it to capable_wrt_inode_uidgid, which should make it more
obvious what it does.

Fixes CVE-2014-4014.

Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-10 13:57:22 -07:00
Steven Rostedt (Red Hat) da9c3413a2 tracing: Fix check of ftrace_trace_arrays list_empty() check
The check that tests if ftrace_trace_arrays is empty in
top_trace_array(), uses the .prev pointer:

  if (list_empty(ftrace_trace_arrays.prev))

instead of testing the variable itself:

  if (list_empty(&ftrace_trace_arrays))

Although it is technically correct, it is awkward and confusing.
Use the proper method.

Link: http://lkml.kernel.org/r/87oay1bas8.fsf@sejong.aot.lge.com

Reported-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-10 13:53:50 -04:00
Steven Rostedt (Red Hat) f0b70cc48c tracing: Fix leak of per cpu max data in instances
The freeing of an instance, if max data is configured, there will be
per cpu data structures created. But these are not freed when the instance
is deleted, which causes a memory leak.

A new helper function is added that frees the individual buffers within a
trace array, instead of duplicating the code. This way changes made for one
are applied to the other (normal buffer vs max buffer).

Link: http://lkml.kernel.org/r/87k38pbake.fsf@sejong.aot.lge.com

Reported-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-10 12:06:30 -04:00
Andy Lutomirski a3c5493119 auditsc: audit_krule mask accesses need bounds checking
Fixes an easy DoS and possible information disclosure.

This does nothing about the broken state of x32 auditing.

eparis: If the admin has enabled auditd and has specifically loaded
audit rules.  This bug has been around since before git.  Wow...

Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-10 08:44:40 -07:00
Namhyung Kim a6af8fbf17 tracing: Cleanup saved_cmdlines_size changes
The recent addition of saved_cmdlines_size file had some remaining
(minor - mostly coding style) issues.  Fix them by passing pointer
name to sizeof() and using scnprintf().

Link: http://lkml.kernel.org/p/1402384295-23680-1-git-send-email-namhyung@kernel.org

Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-10 09:51:10 -04:00
Steven Rostedt (Red Hat) 8b8b36834d ring-buffer: Check if buffer exists before polling
The per_cpu buffers are created one per possible CPU. But these do
not mean that those CPUs are online, nor do they even exist.

With the addition of the ring buffer polling, it assumes that the
caller polls on an existing buffer. But this is not the case if
the user reads trace_pipe from a CPU that does not exist, and this
causes the kernel to crash.

Simple fix is to check the cpu against buffer bitmask against to see
if the buffer was allocated or not and return -ENODEV if it is
not.

More updates were done to pass the -ENODEV back up to userspace.

Link: http://lkml.kernel.org/r/5393DB61.6060707@oracle.com

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-10 09:46:00 -04:00
Linus Torvalds 214b931320 Lots of tweaks, small fixes, optimizations, and some helper functions
to help out the rest of the kernel to ease their use of trace events.
 
 The big change for this release is the allowing of other tracers,
 such as the latency tracers, to be used in the trace instances and allow
 for function or function graph tracing to be in the top level
 simultaneously.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJTlbUqAAoJEKQekfcNnQGuP+8H+wTBG06beHsqe6XcaeXcKNkt
 Mimm0O04oQdw89CBWeJvXyOwRTtiN4M/4hxHXBTDtChxM9oUyWw1o0IpSMMuQ16O
 w9r3DfC8e1air+ufEuYWM0QNtyzHi8EfDSNia55ON5jvtkCZTXOEKZD+n8M9w28p
 I7PVgr0PDztsCpethCpg0M8beK9zuQPWMzsHAQCsKI06Xl5z33kPIJR15Exh+Kr1
 uVVTZW7JFVAPuSnteLSIx9pN6OjsVGzOZCljg+O+9/v/02u5nkMiS2nURxae86kg
 RTSiRYT6Hvl/MCBhdss/w5kgSk6BYiZ0hXbLtwetvre+vQrOR5CnDw2DxZ7e+gU=
 =oudH
 -----END PGP SIGNATURE-----

Merge tag 'trace-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "Lots of tweaks, small fixes, optimizations, and some helper functions
  to help out the rest of the kernel to ease their use of trace events.

  The big change for this release is the allowing of other tracers, such
  as the latency tracers, to be used in the trace instances and allow
  for function or function graph tracing to be in the top level
  simultaneously"

* tag 'trace-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits)
  tracing: Fix memory leak on instance deletion
  tracing: Fix leak of ring buffer data when new instances creation fails
  tracing/kprobes: Avoid self tests if tracing is disabled on boot up
  tracing: Return error if ftrace_trace_arrays list is empty
  tracing: Only calculate stats of tracepoint benchmarks for 2^32 times
  tracing: Convert stddev into u64 in tracepoint benchmark
  tracing: Introduce saved_cmdlines_size file
  tracing: Add __get_dynamic_array_len() macro for trace events
  tracing: Remove unused variable in trace_benchmark
  tracing: Eliminate double free on failure of allocation on boot up
  ftrace/x86: Call text_ip_addr() instead of the duplicated code
  tracing: Print max callstack on stacktrace bug
  tracing: Move locking of trace_cmdline_lock into start/stop seq calls
  tracing: Try again for saved cmdline if failed due to locking
  tracing: Have saved_cmdlines use the seq_read infrastructure
  tracing: Add tracepoint benchmark tracepoint
  tracing: Print nasty banner when trace_printk() is in use
  tracing: Add funcgraph_tail option to print function name after closing braces
  tracing: Eliminate duplicate TRACE_GRAPH_PRINT_xx defines
  tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
  ...
2014-06-09 16:39:15 -07:00
Linus Torvalds 14208b0ec5 Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "A lot of activities on cgroup side.  Heavy restructuring including
  locking simplification took place to improve the code base and enable
  implementation of the unified hierarchy, which currently exists behind
  a __DEVEL__ mount option.  The core support is mostly complete but
  individual controllers need further work.  To explain the design and
  rationales of the the unified hierarchy

        Documentation/cgroups/unified-hierarchy.txt

  is added.

  Another notable change is css (cgroup_subsys_state - what each
  controller uses to identify and interact with a cgroup) iteration
  update.  This is part of continuing updates on css object lifetime and
  visibility.  cgroup started with reference count draining on removal
  way back and is now reaching a point where csses behave and are
  iterated like normal refcnted objects albeit with some complexities to
  allow distinguishing the state where they're being deleted.  The css
  iteration update isn't taken advantage of yet but is planned to be
  used to simplify memcg significantly"

* 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
  cgroup: disallow disabled controllers on the default hierarchy
  cgroup: don't destroy the default root
  cgroup: disallow debug controller on the default hierarchy
  cgroup: clean up MAINTAINERS entries
  cgroup: implement css_tryget()
  device_cgroup: use css_has_online_children() instead of has_children()
  cgroup: convert cgroup_has_live_children() into css_has_online_children()
  cgroup: use CSS_ONLINE instead of CGRP_DEAD
  cgroup: iterate cgroup_subsys_states directly
  cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
  cgroup: move cgroup->serial_nr into cgroup_subsys_state
  cgroup: link all cgroup_subsys_states in their sibling lists
  cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
  cgroup: remove cgroup->parent
  device_cgroup: remove direct access to cgroup->children
  memcg: update memcg_has_children() to use css_next_child()
  memcg: remove tasks/children test from mem_cgroup_force_empty()
  cgroup: remove css_parent()
  cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
  cgroup: use cgroup->self.refcnt for cgroup refcnting
  ...
2014-06-09 15:03:33 -07:00
Linus Torvalds da85d191f5 Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "Lai simplified worker destruction path and internal workqueue locking
  and there are some other minor changes.

  Except for the removal of some long-deprecated interfaces which
  haven't had any in-kernel user for quite a while, there shouldn't be
  any difference to workqueue users"

* 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info
  workqueue: remove the confusing POOL_FREEZING
  workqueue: rename first_worker() to first_idle_worker()
  workqueue: remove unused work_clear_pending()
  workqueue: remove unused WORK_CPU_END
  workqueue: declare system_highpri_wq
  workqueue: use generic attach/detach routine for rescuers
  workqueue: separate pool-attaching code out from create_worker()
  workqueue: rename manager_mutex to attach_mutex
  workqueue: narrow the protection range of manager_mutex
  workqueue: convert worker_idr to worker_ida
  workqueue: separate iteration role from worker_idr
  workqueue: destroy worker directly in the idle timeout handler
  workqueue: async worker destruction
  workqueue: destroy_worker() should destroy idle workers only
  workqueue: use manager lock only to protect worker_idr
  workqueue: Remove deprecated system_nrt[_freezable]_wq
  workqueue: Remove deprecated flush[_delayed]_work_sync()
  kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info
  workqueue: simplify wq_update_unbound_numa() by jumping to use_dfl_pwq if the target cpumask equals wq's
2014-06-09 14:56:49 -07:00
Rik van Riel 1662867a9b numa,sched: fix load_to_imbalanced logic inversion
This function is supposed to return true if the new load imbalance is
worse than the old one.  It didn't.  I can only hope brown paper bags
are in style.

Now things converge much better on both the 4 node and 8 node systems.

I am not sure why this did not seem to impact specjbb performance on the
4 node system, which is the system I have full-time access to.

This bug was introduced recently, with commit e63da03639 ("sched/numa:
Allow task switch if load imbalance improves")

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-08 14:35:05 -07:00
Linus Torvalds 3f17ea6dea Merge branch 'next' (accumulated 3.16 merge window patches) into master
Now that 3.15 is released, this merges the 'next' branch into 'master',
bringing us to the normal situation where my 'master' branch is the
merge window.

* accumulated work in next: (6809 commits)
  ufs: sb mutex merge + mutex_destroy
  powerpc: update comments for generic idle conversion
  cris: update comments for generic idle conversion
  idle: remove cpu_idle() forward declarations
  nbd: zero from and len fields in NBD_CMD_DISCONNECT.
  mm: convert some level-less printks to pr_*
  MAINTAINERS: adi-buildroot-devel is moderated
  MAINTAINERS: add linux-api for review of API/ABI changes
  mm/kmemleak-test.c: use pr_fmt for logging
  fs/dlm/debug_fs.c: replace seq_printf by seq_puts
  fs/dlm/lockspace.c: convert simple_str to kstr
  fs/dlm/config.c: convert simple_str to kstr
  mm: mark remap_file_pages() syscall as deprecated
  mm: memcontrol: remove unnecessary memcg argument from soft limit functions
  mm: memcontrol: clean up memcg zoneinfo lookup
  mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
  mm/mempool.c: update the kmemleak stack trace for mempool allocations
  lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
  mm: introduce kmemleak_update_trace()
  mm/kmemleak.c: use %u to print ->checksum
  ...
2014-06-08 11:31:16 -07:00
Steven Rostedt (Red Hat) a9fcaaac37 tracing: Fix memory leak on instance deletion
When an instance is created, it also gets a snapshot ring buffer
allocated (with minimum of pages). But when it is deleted the snapshot
buffer is not. There was a helper function added to match the allocation
of these ring buffers to a way to free them, but it wasn't used by
the deletion of an instance. Using that helper function solves this
memory leak.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-06 23:17:28 -04:00
Joe Perches 6f8fd1d77e sysctl: convert use of typedef ctl_table to struct ctl_table
This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:16 -07:00
Fabian Frederick 119ce5c8b9 kernel/seccomp.c: kernel-doc warning fix
+ fix small typo

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:15 -07:00
Paul McQuade 46c0a8ca3e ipc, kernel: clear whitespace
trailing whitespace

Signed-off-by: Paul McQuade <paulmcquad@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:14 -07:00
Paul McQuade 7153e40273 ipc, kernel: use Linux headers
Use #include <linux/uaccess.h> instead of <asm/uaccess.h>
Use #include <linux/types.h> instead of <asm/types.h>

Signed-off-by: Paul McQuade <paulmcquad@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:14 -07:00