Commit Graph

14647 Commits

Author SHA1 Message Date
Paul E. McKenney 7b2e6011f1 rcu: Rename ->onofflock to ->orphan_lock
The ->onofflock field in the rcu_state structure at one time synchronized
CPU-hotplug operations for RCU.  However, its scope has decreased over time
so that it now only protects the lists of orphaned RCU callbacks.  This
commit therefore renames it to ->orphan_lock to reflect its current use.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-11-08 11:50:11 -08:00
Tao Ma 316eb661f1 cgroup: set 'start' with the right value in cgroup_path.
'start' is set to buf + buflen and do the '--' immediately.
Just set it to 'buf + buflen - 1' directly.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
2012-11-08 06:23:02 -08:00
Tejun Heo 5b805f2a76 Merge branch 'cgroup/for-3.7-fixes' into cgroup/for-3.8
This is to receive device_cgroup fixes so that further device_cgroup
changes can be made in cgroup/for-3.8.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-11-06 12:26:23 -08:00
Tejun Heo 1db1e31b1e Merge branch 'cgroup-rmdir-updates' into cgroup/for-3.8
Pull rmdir updates into for-3.8 so that further callback updates can
be put on top.  This pull created a trivial conflict between the
following two commits.

  8c7f6edbda ("cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them")
  ed95779340 ("cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs")

The former added a field to cgroup_subsys and the latter removed one
from it.  They happen to be colocated causing the conflict.  Keeping
what's added and removing what's removed resolves the conflict.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-11-05 09:21:51 -08:00
Tejun Heo bcf6de1b91 cgroup: make ->pre_destroy() return void
All ->pre_destory() implementations return 0 now, which is the only
allowed return value.  Make it return void.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
2012-11-05 09:16:59 -08:00
Tejun Heo b25ed609d0 cgroup: remove CGRP_WAIT_ON_RMDIR, cgroup_exclude_rmdir() and cgroup_release_and_wakeup_rmdir()
CGRP_WAIT_ON_RMDIR is another kludge which was added to make cgroup
destruction rollback somewhat working.  cgroup_rmdir() used to drain
CSS references and CGRP_WAIT_ON_RMDIR and the associated waitqueue and
helpers were used to allow the task performing rmdir to wait for the
next relevant event.

Unfortunately, the wait is visible to controllers too and the
mechanism got exposed to memcg by 887032670d ("cgroup avoid permanent
sleep at rmdir").

Now that the draining and retries are gone, CGRP_WAIT_ON_RMDIR is
unnecessary.  Remove it and all the mechanisms supporting it.  Note
that memcontrol.c changes are essentially revert of 887032670d
("cgroup avoid permanent sleep at rmdir").

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Balbir Singh <bsingharora@gmail.com>
2012-11-05 09:16:59 -08:00
Tejun Heo 1a90dd508b cgroup: deactivate CSS's and mark cgroup dead before invoking ->pre_destroy()
Because ->pre_destroy() could fail and can't be called under
cgroup_mutex, cgroup destruction did something very ugly.

  1. Grab cgroup_mutex and verify it can be destroyed; fail otherwise.

  2. Release cgroup_mutex and call ->pre_destroy().

  3. Re-grab cgroup_mutex and verify it can still be destroyed; fail
     otherwise.

  4. Continue destroying.

In addition to being ugly, it has been always broken in various ways.
For example, memcg ->pre_destroy() expects the cgroup to be inactive
after it's done but tasks can be attached and detached between #2 and
#3 and the conditions that memcg verified in ->pre_destroy() might no
longer hold by the time control reaches #3.

Now that ->pre_destroy() is no longer allowed to fail.  We can switch
to the following.

  1. Grab cgroup_mutex and verify it can be destroyed; fail otherwise.

  2. Deactivate CSS's and mark the cgroup removed thus preventing any
     further operations which can invalidate the verification from #1.

  3. Release cgroup_mutex and call ->pre_destroy().

  4. Re-grab cgroup_mutex and continue destroying.

After this change, controllers can safely assume that ->pre_destroy()
will only be called only once for a given cgroup and, once
->pre_destroy() is called, the cgroup will stay dormant till it's
destroyed.

This removes the only reason ->pre_destroy() can fail - new task being
attached or child cgroup being created inbetween.  Error out path is
removed and ->pre_destroy() invocation is open coded in
cgroup_rmdir().

v2: cgroup_call_pre_destroy() removal moved to this patch per Michal.
    Commit message updated per Glauber.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
2012-11-05 09:16:59 -08:00
Tejun Heo 976c06bccc cgroup: use cgroup_lock_live_group(parent) in cgroup_create()
This patch makes cgroup_create() fail if @parent is marked removed.
This is to prepare for further updates to cgroup_rmdir() path.

Note that this change isn't strictly necessary.  cgroup can only be
created via mkdir and the removed marking and dentry removal happen
without releasing cgroup_mutex, so cgroup_create() can never race with
cgroup_rmdir().  Even after the scheduled updates to cgroup_rmdir(),
cgroup_mkdir() and cgroup_rmdir() are synchronized by i_mutex
rendering the added liveliness check unnecessary.

Do it anyway such that locking is contained inside cgroup proper and
we don't get nasty surprises if we ever grow another caller of
cgroup_create().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-05 09:16:59 -08:00
Tejun Heo e93160803f cgroup: kill CSS_REMOVED
CSS_REMOVED is one of the several contortions which were necessary to
support css reference draining on cgroup removal.  All css->refcnts
which need draining should be deactivated and verified to equal zero
atomically w.r.t. css_tryget().  If any one isn't zero, all refcnts
needed to be re-activated and css_tryget() shouldn't fail in the
process.

This was achieved by letting css_tryget() busy-loop until either the
refcnt is reactivated (failed removal attempt) or CSS_REMOVED is set
(committing to removal).

Now that css refcnt draining is no longer used, there's no need for
atomic rollback mechanism.  css_tryget() simply can look at the
reference count and fail if it's deactivated - it's never getting
re-activated.

This patch removes CSS_REMOVED and updates __css_tryget() to fail if
the refcnt is deactivated.  As deactivation and removal are a single
step now, they no longer need to be protected against css_tryget()
happening from irq context.  Remove local_irq_disable/enable() from
cgroup_rmdir().

Note that this removes css_is_removed() whose only user is VM_BUG_ON()
in memcontrol.c.  We can replace it with a check on the refcnt but
given that the only use case is a debug assert, I think it's better to
simply unexport it.

v2: Comment updated and explanation on local_irq_disable/enable()
    added per Michal Hocko.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
2012-11-05 09:16:58 -08:00
Tejun Heo ed95779340 cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs
2ef37d3fe4 ("memcg: Simplify mem_cgroup_force_empty_list error
handling") removed the last user of __DEPRECATED_clear_css_refs.  This
patch removes __DEPRECATED_clear_css_refs and mechanisms to support
it.

* Conditionals dependent on __DEPRECATED_clear_css_refs removed.

* cgroup_clear_css_refs() can no longer fail.  All that needs to be
  done are deactivating refcnts, setting CSS_REMOVED and putting the
  base reference on each css.  Remove cgroup_clear_css_refs() and the
  failure path, and open-code the loops into cgroup_rmdir().

This patch keeps the two for_each_subsys() loops separate while open
coding them.  They can be merged now but there are scheduled changes
which need them to be separate, so keep them separate to reduce the
amount of churn.

local_irq_save/restore() from cgroup_clear_css_refs() are replaced
with local_irq_disable/enable() for simplicity.  This is safe as
cgroup_rmdir() is always called with IRQ enabled.  Note that this IRQ
switching is necessary to ensure that css_tryget() isn't called from
IRQ context on the same CPU while lower context is between CSS
deactivation and setting CSS_REMOVED as css_tryget() would hang
forever in such cases waiting for CSS to be re-activated or
CSS_REMOVED set.  This will go away soon.

v2: cgroup_call_pre_destroy() removal dropped per Michal.  Commit
    message updated to explain local_irq_disable/enable() conversion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-05 09:16:58 -08:00
Oleg Nesterov 19f5ee2716 uprobes: Kill arch_uprobe_enable/disable_step() hooks
Kill arch_uprobe_enable/disable_step() hooks, they do nothing and
nobody needs them.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-11-03 17:15:13 +01:00
Oleg Nesterov 65b2c8f0e5 uprobes/powerpc: Do not use arch_uprobe_*_step() helpers
No functional changes.

powerpc is the only user of arch_uprobe_enable/disable_step() helpers,
but they should die. They can not be used correctly, every arch needs
its own implementation (like x86 does). And they do not really help
even as initial-and-almost-working code, arch_uprobe_*_xol() hooks can
easily use user_enable/disable_single_step() directly.

Change arch_uprobe_*_step() to do nothing, and convert powerpc to use
ptrace helpers. This is equally wrong, powerpc needs the arch-specific
fixes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-11-03 17:15:12 +01:00
Steven Rostedt 7bcfaf54f5 tracing: Add trace_options kernel command line parameter
Add trace_options to the kernel command line parameter to be able to
set options at early boot. For example, to enable stack dumps of
events, add the following:

  trace_options=stacktrace

This along with the trace_event option, you can get not only
traces of the events but also the stack dumps with them.

Requested-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-11-02 10:21:53 -04:00
Steven Rostedt 0d5c6e1c19 tracing: Use irq_work for wake ups and remove *_nowake_*() functions
Have the ring buffer commit function use the irq_work infrastructure to
wake up any waiters waiting on the ring buffer for new data. The irq_work
was created for such a purpose, where doing the actual wake up at the
time of adding data is too dangerous, as an event or function trace may
be in the midst of the work queue locks and cause deadlocks. The irq_work
will either delay the action to the next timer interrupt, or trigger an IPI
to itself forcing an interrupt to do the work (in a safe location).

With irq_work, all ring buffer commits can safely do wakeups, removing
the need for the ring buffer commit "nowake" variants, which were used
by events and function tracing. All commits can now safely use the
normal commit, and the "nowake" variants can be removed.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-11-02 10:21:52 -04:00
Steven Rostedt 02404baf1b tracing: Remove deprecated tracing_enabled file
The tracing_enabled file was used as a quick way to stop
tracers, and try to bring down overhead for things like
the latency tracers (irqsoff, wakeup, etc). But it didn't
work that well.

The tracing_on file was created as a really fast way to
stop recording into the ftrace ring buffer and can interact
with the kernel. That is a tracing_off() call in the kernel
can disable recording of events, and then from userspace one
could echo 1 into the tracing_on file to continue it. The
tracing_enabled function did too much to allow for this.

The tracing_on has taken over as a way to start and stop tracing
and the tracing_enabled file should not be used. But because of
its existance, it still confuses people. Over a year ago the
following commit was added:

 commit 6752ab4a9c
 Author: Steven Rostedt <srostedt@redhat.com>
 Date:   Tue Feb 8 13:54:06 2011 -0500

    tracing: Deprecate tracing_enabled for tracing_on

This commit added a WARN_ON() if the tracing_enabled file's variable
was changed. After this was added, only LatencyTop complained, and
they soon fixed their tool as there was no reason that LatencyTop
should touch this file as it was using the perf ring buffers which
this file does not interact with. But since that time no one else
has complained about this WARN_ON(). Thus it is safe to assume that
this file is no longer needed. Time to get rid of it.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-11-02 10:21:51 -04:00
Steven Rostedt 0fb9656d95 tracing: Make tracing_enabled be equal to tracing_on
The tracing_enabled file has been deprecated as it never was able
to serve its purpose well. The tracing_on file has taken over.
Instead of having code to keep tracing_enabled, have the tracing_enabled
file just set tracing_on, and remove the tracing_enabled variable.

This allows us to remove the tracing_enabled file. The reason that
the remove is in a different change set and not removed here is
in case we find some lonely userspace tool that requires the file
to exist. Then the removal patch will get reverted, but this one
will not.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-11-02 10:21:50 -04:00
Steven Rostedt c7b84ecada tracing: Remove unused function unregister_tracer()
The function register_tracer() is only used by kernel core code,
that never needs to remove the tracer. As trace_events have become
the main way to add new tracing to the kernel, the need to
unregister a tracer has diminished. Remove the unused function
unregister_tracer(). If a need arises where we need it, then we
can always add it back.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-11-02 10:21:50 -04:00
Steven Rostedt 15075cac42 tracing: Separate open function from set_event and available_events
The open function used by available_events is the same as set_event even
though it uses different seq functions. This causes a side effect of
writing into available_events clearing all events, even though
available_events is suppose to be read only.

There's no reason to keep a single function for just the open and have
both use different functions for everything else. It is a little
confusing and causes strange behavior. Just have each have their own
function.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-11-02 10:21:49 -04:00
Yoshihiro YUNOMAE 50ecf2c3af ring-buffer: Change unsigned long type of ring_buffer_oldest_event_ts() to u64
ring_buffer_oldest_event_ts() should return a value of u64 type, because
ring_buffer_per_cpu->buffer_page->buffer_data_page->time_stamp is u64 type.

Link: http://lkml.kernel.org/r/1349998076-15495-5-git-send-email-dhsharp@google.com

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: David Sharp <dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-11-02 10:21:48 -04:00
David Sharp 60303ed3f4 tracing: Reset ring buffer when changing trace_clocks
Because the "tsc" clock isn't in nanoseconds, the ring buffer must be
reset when changing clocks so that incomparable timestamps don't end up
in the same trace.

Tested: Confirmed switching clocks resets the trace buffer.

Google-Bug-Id: 6980623
Link: http://lkml.kernel.org/r/1349998076-15495-3-git-send-email-dhsharp@google.com

Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: David Sharp <dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-11-02 10:21:47 -04:00
Richard Cochran 65f8f9a1c1 time: remove the timecompare code.
This patch removes the timecompare code from the kernel. The top five
reasons to do this are:

1. There are no more users of this code.
2. The original idea was a bit weak.
3. The original author has disappeared.
4. The code was not general purpose but tuned to a particular hardware,
5. There are better ways to accomplish clock synchronization.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Tested-by: Bob Liu <lliubbo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-01 11:41:35 -04:00
Chuansheng Liu b8f61116c1 tick: Correct the comments for tick_sched_timer()
In the comments of function tick_sched_timer(), the sentence
"timer->base->cpu_base->lock held" is not right.

In function __run_hrtimer(), before call timer->function(),
the cpu_base->lock has been unlocked.

Signed-off-by: liu chuansheng <chuansheng.liu@intel.com>
Cc: fei.li@intel.com
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1351098455.15558.1421.camel@cliu38-desktop-build
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-11-01 12:13:59 +01:00
Sankara Muthukrishnan f3de44edf3 irq: Set CPU affinity right on thread creation
As irq_thread_check_affinity is called ONLY inside the while loop in
the irq thread, the core affinity is set only when an interrupt
occurs. This patch sets the core affinity right after the irq thread
is created and before it waits for interrupts. In real-tiime targets
that do not typically change the core affinity of irqs during
run-time, this patch will save additional latency of an irq thread in
setting the core affinity during the first interrupt occurrence for
that irq.

Signed-off-by: Sankara S Muthukrishnan <sankara.m@ni.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/CAFQPvXeVZ858WFYimEU5uvLNxLDd6bJMmqWihFmbCf3ntokz0A@mail.gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-11-01 12:11:31 +01:00
Thomas Gleixner 293a7a0a16 genirq: Provide means to retrigger parent
Attempts to retrigger nested threaded IRQs currently fail because they
have no primary handler. In order to support retrigger of nested
IRQs, the parent IRQ needs to be retriggered.

To fix, when an IRQ needs to be resent, if the interrupt has a parent
IRQ and runs in the context of the parent IRQ, then resend the parent.

Also, handle_nested_irq() needs to clear the replay flag like the
other handlers, otherwise check_irq_resend() will set it and it will
never be cleared.  Without clearing, it results in the first resend
working fine, but check_irq_resend() returning early on subsequent
resends because the replay flag is still set.

Problem discovered on ARM/OMAP platforms where a nested IRQ that's
also a wakeup IRQ happens late in suspend and needed to be retriggered
during the resume process.

[khilman@ti.com: changelog edits, clear IRQS_REPLAY in handle_nested_irq()]

Reported-by: Kevin Hilman <khilman@ti.com>
Tested-by: Kevin Hilman <khilman@ti.com>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/1350425269-11489-1-git-send-email-khilman@deeprootsystems.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-11-01 12:11:31 +01:00
Thomas Gleixner 59fa624519 futex: Handle futex_pi OWNER_DIED take over correctly
Siddhesh analyzed a failure in the take over of pi futexes in case the
owner died and provided a workaround.
See: http://sourceware.org/bugzilla/show_bug.cgi?id=14076

The detailed problem analysis shows:

Futex F is initialized with PTHREAD_PRIO_INHERIT and
PTHREAD_MUTEX_ROBUST_NP attributes.

T1 lock_futex_pi(F);

T2 lock_futex_pi(F);
   --> T2 blocks on the futex and creates pi_state which is associated
       to T1.

T1 exits
   --> exit_robust_list() runs
       --> Futex F userspace value TID field is set to 0 and
           FUTEX_OWNER_DIED bit is set.

T3 lock_futex_pi(F);
   --> Succeeds due to the check for F's userspace TID field == 0
   --> Claims ownership of the futex and sets its own TID into the
       userspace TID field of futex F
   --> returns to user space

T1 --> exit_pi_state_list()
       --> Transfers pi_state to waiter T2 and wakes T2 via
       	   rt_mutex_unlock(&pi_state->mutex)

T2 --> acquires pi_state->mutex and gains real ownership of the
       pi_state
   --> Claims ownership of the futex and sets its own TID into the
       userspace TID field of futex F
   --> returns to user space

T3 --> observes inconsistent state

This problem is independent of UP/SMP, preemptible/non preemptible
kernels, or process shared vs. private. The only difference is that
certain configurations are more likely to expose it.

So as Siddhesh correctly analyzed the following check in
futex_lock_pi_atomic() is the culprit:

	if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {

We check the userspace value for a TID value of 0 and take over the
futex unconditionally if that's true.

AFAICT this check is there as it is correct for a different corner
case of futexes: the WAITERS bit became stale.

Now the proposed change

-	if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
+       if (unlikely(ownerdied ||
+                       !(curval & (FUTEX_TID_MASK | FUTEX_WAITERS)))) {

solves the problem, but it's not obvious why and it wreckages the
"stale WAITERS bit" case.

What happens is, that due to the WAITERS bit being set (T2 is blocked
on that futex) it enforces T3 to go through lookup_pi_state(), which
in the above case returns an existing pi_state and therefor forces T3
to legitimately fight with T2 over the ownership of the pi_state (via
pi_state->mutex). Probelm solved!

Though that does not work for the "WAITERS bit is stale" problem
because if lookup_pi_state() does not find existing pi_state it
returns -ERSCH (due to TID == 0) which causes futex_lock_pi() to
return -ESRCH to user space because the OWNER_DIED bit is not set.

Now there is a different solution to that problem. Do not look at the
user space value at all and enforce a lookup of possibly available
pi_state. If pi_state can be found, then the new incoming locker T3
blocks on that pi_state and legitimately races with T2 to acquire the
rt_mutex and the pi_state and therefor the proper ownership of the
user space futex.

lookup_pi_state() has the correct order of checks. It first tries to
find a pi_state associated with the user space futex and only if that
fails it checks for futex TID value = 0. If no pi_state is available
nothing can create new state at that point because this happens with
the hash bucket lock held.

So the above scenario changes to:

T1 lock_futex_pi(F);

T2 lock_futex_pi(F);
   --> T2 blocks on the futex and creates pi_state which is associated
       to T1.

T1 exits
   --> exit_robust_list() runs
       --> Futex F userspace value TID field is set to 0 and
           FUTEX_OWNER_DIED bit is set.

T3 lock_futex_pi(F);
   --> Finds pi_state and blocks on pi_state->rt_mutex

T1 --> exit_pi_state_list()
       --> Transfers pi_state to waiter T2 and wakes it via
       	   rt_mutex_unlock(&pi_state->mutex)

T2 --> acquires pi_state->mutex and gains ownership of the pi_state
   --> Claims ownership of the futex and sets its own TID into the
       userspace TID field of futex F
   --> returns to user space

This covers all gazillion points on which T3 might come in between
T1's exit_robust_list() clearing the TID field and T2 fixing it up. It
also solves the "WAITERS bit stale" problem by forcing the take over.

Another benefit of changing the code this way is that it makes it less
dependent on untrusted user space values and therefor minimizes the
possible wreckage which might be inflicted.

As usual after staring for too long at the futex code my brain hurts
so much that I really want to ditch that whole optimization of
avoiding the syscall for the non contended case for PI futexes and rip
out the maze of corner case handling code. Unfortunately we can't as
user space relies on that existing behaviour, but at least thinking
about it helps me to preserve my mental sanity. Maybe we should
nevertheless :)

Reported-and-tested-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1210232138540.2756@ionos
Acked-by: Darren Hart <dvhart@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-11-01 12:06:54 +01:00
Vaibhav Nagarnaik 6f86ab9fca tracing: Cleanup unnecessary function declarations
The functions defined in include/trace/syscalls.h are not used directly
since struct ftrace_event_class was introduced. Remove them from the
header file and rearrange the ftrace_event_class declarations in
trace_syscalls.c.

Link: http://lkml.kernel.org/r/1339112785-21806-2-git-send-email-vnagarnaik@google.com

Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-10-31 16:45:34 -04:00
David Sharp 01e3e710a9 tracing: Trivial cleanup
Remove ftrace_format_syscall() declaration; it is neither defined nor
used. Also update a comment and formatting.

Link: http://lkml.kernel.org/r/1339112785-21806-1-git-send-email-vnagarnaik@google.com

Signed-off-by: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-10-31 16:45:33 -04:00
Steven Rostedt 7ffbd48d5c tracing: Cache comms only after an event occurred
Whenever an event is registered, the comm of tasks are saved at
every task switch instead of saving them at every event. But if
an event isn't executed much, the comm cache will be filled up
by tasks that did not record the event and you lose out on the comms
that did.

Here's an example, if you enable the following events:

echo 1 > /debug/tracing/events/kvm/kvm_cr/enable
echo 1 > /debug/tracing/events/net/net_dev_xmit/enable

Note, there's no kvm running on this machine so the first event will
never be triggered, but because it is enabled, the storing of comms
will continue. If we now disable the network event:

echo 0 > /debug/tracing/events/net/net_dev_xmit/enable

and look at the trace:

cat /debug/tracing/trace
            sshd-2672  [001] ..s2   375.731616: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
            sshd-2672  [001] ..s1   375.731617: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
            sshd-2672  [001] ..s2   375.859356: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
            sshd-2672  [001] ..s1   375.859357: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
            sshd-2672  [001] ..s2   375.947351: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
            sshd-2672  [001] ..s1   375.947352: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
            sshd-2672  [001] ..s2   376.035383: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
            sshd-2672  [001] ..s1   376.035383: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
            sshd-2672  [001] ..s2   377.563806: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=226 rc=0
            sshd-2672  [001] ..s1   377.563807: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=226 rc=0
            sshd-2672  [001] ..s2   377.563834: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6be0 len=114 rc=0
            sshd-2672  [001] ..s1   377.563842: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6be0 len=114 rc=0

We see that process 2672 which triggered the events has the comm "sshd".
But if we run hackbench for a bit and look again:

cat /debug/tracing/trace
           <...>-2672  [001] ..s2   375.731616: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
           <...>-2672  [001] ..s1   375.731617: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
           <...>-2672  [001] ..s2   375.859356: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
           <...>-2672  [001] ..s1   375.859357: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
           <...>-2672  [001] ..s2   375.947351: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
           <...>-2672  [001] ..s1   375.947352: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
           <...>-2672  [001] ..s2   376.035383: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
           <...>-2672  [001] ..s1   376.035383: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
           <...>-2672  [001] ..s2   377.563806: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=226 rc=0
           <...>-2672  [001] ..s1   377.563807: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=226 rc=0
           <...>-2672  [001] ..s2   377.563834: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6be0 len=114 rc=0
           <...>-2672  [001] ..s1   377.563842: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6be0 len=114 rc=0

The stored "sshd" comm has been flushed out and we get a useless "<...>".

But by only storing comms after a trace event occurred, we can run
hackbench all day and still get the same output.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-10-31 16:45:31 -04:00
Steven Rostedt 2b70e59043 tracing: Have tracing_sched_wakeup_trace() use standard unlock_commit
The functon tracing_sched_wakeup_trace() does an open coded unlock
commit and save stack. This is what the trace_nowake_buffer_unlock_commit()
is for.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-10-31 16:45:30 -04:00
Steven Rostedt 81698831bc tracing: Enable comm recording if trace_printk() is used
If comm recording is not enabled when trace_printk() is used then
you just get this type of output:

[ adding trace_printk("hello! %d", irq); in do_IRQ ]

           <...>-2843  [001] d.h.    80.812300: do_IRQ: hello! 14
           <...>-2734  [002] d.h2    80.824664: do_IRQ: hello! 14
           <...>-2713  [003] d.h.    80.829971: do_IRQ: hello! 14
           <...>-2814  [000] d.h.    80.833026: do_IRQ: hello! 14

By enabling the comm recorder when trace_printk is enabled:

       hackbench-6715  [001] d.h.   193.233776: do_IRQ: hello! 21
            sshd-2659  [001] d.h.   193.665862: do_IRQ: hello! 21
          <idle>-0     [001] d.h1   193.665996: do_IRQ: hello! 21

Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-10-31 16:45:29 -04:00
Steven Rostedt b382ede6b5 tracing: Expand ring buffer when trace_printk() is used
Since tracing is not used by 99% of Linux users, even though tracing
may be configured in, it does not make sense to allocate 1.4 Megs
per CPU for the ring buffers if they are not used. Thus, on boot up
the ring buffers are set to a minimal size until something needs the
and they are expanded.

This works well for events and tracers (function, etc), but for the
asynchronous use of trace_printk() which can write to the ring buffer
at any time, does not expand the buffers.

On boot up a check is made to see if any trace_printk() is used to
see if the trace_printk() temp buffer pages should be allocated. This
same code can be used to expand the buffers as well.

Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-10-31 16:45:28 -04:00
Slava Pestov 884bfe89a4 ring-buffer: Add a 'dropped events' counter
The existing 'overrun' counter is incremented when the ring
buffer wraps around, with overflow on (the default). We wanted
a way to count requests lost from the buffer filling up with
overflow off, too. I decided to add a new counter instead
of retro-fitting the existing one because it seems like a
different statistic to count conceptually, and also because
of how the code was structured.

Link: http://lkml.kernel.org/r/1310765038-26399-1-git-send-email-slavapestov@google.com

Signed-off-by: Slava Pestov <slavapestov@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-10-31 16:45:27 -04:00
Hiraku Toyooka f43c738bfa tracing: Change tracer's integer flags to bool
print_max and use_max_tr in struct tracer are "int" variables and
used like flags. This is wasteful, so change the type to "bool".

Link: http://lkml.kernel.org/r/20121002082710.9807.86393.stgit@falsita

Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-10-31 16:45:25 -04:00
Steven Rostedt 6f4156723c tracing: Allow tracers to start at core initcall
There's times during debugging that it is helpful to see traces of early
boot functions. But the tracers are initialized at device_initcall()
which is quite late during the boot process. Setting the kernel command
line parameter ftrace=function will not show anything until the function
tracer is initialized. This prevents being able to trace functions before
device_initcall().

There's no reason that the tracers need to be initialized so late in the
boot process. Move them up to core_initcall() as they still need to come
after early_initcall() which initializes the tracing buffers.

Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-10-31 16:45:24 -04:00
Daniel Walter bcd83ea6cb tracing: Replace strict_strto* with kstrto*
* remove old string conversions with kstrto*

Link: http://lkml.kernel.org/r/20120926200838.GC1244@0x90.at

Signed-off-by: Daniel Walter <sahne@0x90.at>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-10-31 16:45:23 -04:00
Rusty Russell 59ef28b1f1 module: fix out-by-one error in kallsyms
Masaki found and patched a kallsyms issue: the last symbol in a
module's symtab wasn't transferred.  This is because we manually copy
the zero'th entry (which is always empty) then copy the rest in a loop
starting at 1, though from src[0].  His fix was minimal, I prefer to
rewrite the loops in more standard form.

There are two loops: one to get the size, and one to copy.  Make these
identical: always count entry 0 and any defined symbol in an allocated
non-init section.

This bug exists since the following commit was introduced.
   module: reduce symbol table for loaded modules (v2)
   commit: 4a4962263f

LKML: http://lkml.org/lkml/2012/10/24/27
Reported-by: Masaki Kimura <masaki.kimura.kz@hitachi.com>
Cc: stable@kernel.org
2012-10-31 13:56:37 +10:30
Mike Galbraith 5258f386ea sched/autogroup: Fix crash on reboot when autogroup is disabled
Due to these two commits:

  8323f26ce3 sched: Fix race in task_group()
  800d4d30c8 sched, autogroup: Stop going ahead if autogroup is disabled

... autogroup scheduling's dynamic knobs are wrecked.

With both patches applied, all you have to do to crash a box is
disable autogroup during boot up, then reboot.. boom, NULL pointer
dereference due to 800d4d30 not allowing autogroup to move things,
and 8323f26ce making that the only way to switch runqueues.

Remove most of the (dysfunctional) knobs and turn the remaining
sched_autogroup_enabled knob readonly.

If the user fiddles with cgroups hereafter, once tasks
are moved, autogroup won't mess with them again unless
they call setsid().

No knobs, no glitz, nada, just a cute little thing folks can
turn on if they don't want to muck about with cgroups and/or
systemd.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Xiaotian Feng <xtfeng@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Xiaotian Feng <dannyfeng@tencent.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org> # v3.6
Link: http://lkml.kernel.org/r/1351451963.4999.8.camel@maggy.simpson.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-30 10:26:04 +01:00
Michael Neuling 0d855354ea perf, powerpc: Fix hw breakpoints returning -ENOSPC
I've been trying to get hardware breakpoints with perf to work
on POWER7 but I'm getting the following:

  % perf record -e mem:0x10000000 true

    Error: sys_perf_event_open() syscall returned with 28 (No space left on device).  /bin/dmesg may provide additional information.

    Fatal: No CONFIG_PERF_EVENTS=y kernel support configured?

  true: Terminated

(FWIW adding -a and it works fine)

Debugging it seems that __reserve_bp_slot() is returning ENOSPC
because it thinks there are no free breakpoint slots on this
CPU.

I have a 2 CPUs, so perf userspace is doing two perf_event_open
syscalls to add a counter to each CPU [1].  The first syscall
succeeds but the second is failing.

On this second syscall, fetch_bp_busy_slots() sets slots.pinned
to be 1, despite there being no breakpoint on this CPU.  This is
because the call the task_bp_pinned, checks all CPUs, rather
than just the current CPU. POWER7 only has one hardware
breakpoint per CPU (ie. HBP_NUM=1), so we return ENOSPC.

The following patch fixes this by checking the associated CPU
for each breakpoint in task_bp_pinned.  I'm not familiar with
this code, so it's provided as a reference to the above issue.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Jovi Zhang <bookjovi@gmail.com>
Cc: K Prasad <prasad@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1351268936-2956-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-30 10:07:58 +01:00
Frederic Weisbecker 3e1df4f506 cputime: Separate irqtime accounting from generic vtime
vtime_account() doesn't have the same role in
CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_IRQ_TIME_ACCOUNTING.

In the first case it handles time accounting in any context. In
the second case it only handles irq time accounting.

So when vtime_account() is called from outside vtime_account_irq_*()
this call is pointless to CONFIG_IRQ_TIME_ACCOUNTING.

To fix the confusion, change vtime_account() to irqtime_account_irq()
in CONFIG_IRQ_TIME_ACCOUNTING. This way we ensure future account_vtime()
calls won't waste useless cycles in the irqtime APIs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-10-29 21:31:32 +01:00
Frederic Weisbecker fa5058f3b6 cputime: Specialize irq vtime hooks
With CONFIG_VIRT_CPU_ACCOUNTING, when vtime_account()
is called in irq entry/exit, we perform a check on the
context: if we are interrupting the idle task we
account the pending cputime to idle, otherwise account
to system time or its sub-areas: tsk->stime, hardirq time,
softirq time, ...

However this check for idle only concerns the hardirq entry
and softirq entry:

* Hardirq may directly interrupt the idle task, in which case
we need to flush the pending CPU time to idle.

* The idle task may be directly interrupted by a softirq if
it calls local_bh_enable(). There is probably no such call
in any idle task but we need to cover every case. Ksoftirqd
is not concerned because the idle time is flushed on context
switch and softirq in the end of hardirq have the idle time
already flushed from the hardirq entry.

In the other cases we always account to system/irq time:

* On hardirq exit we account the time to hardirq time.
* On softirq exit we account the time to softirq time.

To optimize this and avoid the indirect call to vtime_account()
and the checks it performs, specialize the vtime irq APIs and
only perform the check on irq entry. Irq exit can directly call
vtime_account_system().

CONFIG_IRQ_TIME_ACCOUNTING behaviour doesn't change and directly
maps to its own vtime_account() implementation. One may want
to take benefits from the new APIs to optimize irq time accounting
as well in the future.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-10-29 21:31:32 +01:00
Frederic Weisbecker 11113334d1 vtime: Make vtime_account_system() irqsafe
vtime_account_system() currently has only one caller with
vtime_account() which is irq safe.

Now we are going to call it from other places like kvm where
irqs are not always disabled by the time we account the cputime.

So let's make it irqsafe. The arch implementation part is now
prefixed with "__".

vtime_account_idle() arch implementation is prefixed accordingly
to stay consistent.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
2012-10-29 21:31:31 +01:00
Greg Kroah-Hartman ca364d8388 Merge 3.7-rc3 into tty-next
This merges the tty changes in 3.7-rc3 into tty-next

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-29 09:00:57 -07:00
Lai Jiangshan cda4dc8130 rcutorture: Use DEFINE_STATIC_SRCU()
Use DEFINE_STATIC_SRCU() to simplify the rcutorture.c SRCU test code.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-27 15:39:20 -07:00
Oleg Nesterov 5d8f72b55c freezer: change ptrace_stop/do_signal_stop to use freezable_schedule()
try_to_freeze_tasks() and cgroup_freezer rely on scheduler locks
to ensure that a task doing STOPPED/TRACED -> RUNNING transition
can't escape freezing. This mostly works, but ptrace_stop() does
not necessarily call schedule(), it can change task->state back to
RUNNING and check freezing() without any lock/barrier in between.

We could add the necessary barrier, but this patch changes
ptrace_stop() and do_signal_stop() to use freezable_schedule().
This fixes the race, freezer_count() and freezer_should_skip()
carefully avoid the race.

And this simplifies the code, try_to_freeze_tasks/update_if_frozen
no longer need to use task_is_stopped_or_traced() checks with the
non trivial assumptions. We can rely on the mechanism which was
specially designed to mark the sleeping task as "frozen enough".

v2: As Tejun pointed out, we can also change get_signal_to_deliver()
and move try_to_freeze() up before 'relock' label.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-10-26 14:27:49 -07:00
Linus Torvalds 2ab3f29ddd Merge branch 'akpm' (Andrew's fixes)
Merge misc fixes from Andrew Morton:
 "18 total.  15 fixes and some updates to a device_cgroup patchset which
  bring it up to date with the version which I should have merged in the
  first place."

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (18 patches)
  fs/compat_ioctl.c: VIDEO_SET_SPU_PALETTE missing error check
  gen_init_cpio: avoid stack overflow when expanding
  drivers/rtc/rtc-imxdi.c: add missing spin lock initialization
  mm, numa: avoid setting zone_reclaim_mode unless a node is sufficiently distant
  pidns: limit the nesting depth of pid namespaces
  drivers/dma/dw_dmac: make driver's endianness configurable
  mm/mmu_notifier: allocate mmu_notifier in advance
  tools/testing/selftests/epoll/test_epoll.c: fix build
  UAPI: fix tools/vm/page-types.c
  mm/page_alloc.c:alloc_contig_range(): return early for err path
  rbtree: include linux/compiler.h for definition of __always_inline
  genalloc: stop crashing the system when destroying a pool
  backlight: ili9320: add missing SPI dependency
  device_cgroup: add proper checking when changing default behavior
  device_cgroup: stop using simple_strtoul()
  device_cgroup: rename deny_all to behavior
  cgroup: fix invalid rcu dereference
  mm: fix XFS oops due to dirty pages without buffers on s390
2012-10-25 16:05:57 -07:00
H. Peter Anvin 2008713c71 Makefile: Documentation for external tool should be correct
If one includes documentation for an external tool, it should be
correct.  This is not:

1. Overriding the input to rngd should typically be neither
   necessary nor desired.  This is especially so since newer
   versions of rngd support a number of different *types* of sources.
2. The default kernel-exported device is called /dev/hwrng not
   /dev/hwrandom nor /dev/hw_random (both of which were used in the
   past; however, kernel and udev seem to have converged on
   /dev/hwrng.)

Overall it is better if the documentation for rngd is kept with rngd
rather than in a kernel Makefile.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-25 16:00:53 -07:00
Andrew Vagin f230250577 pidns: limit the nesting depth of pid namespaces
'struct pid' is a "variable sized struct" - a header with an array of
upids at the end.

The size of the array depends on a level (depth) of pid namespaces.  Now a
level of pidns is not limited, so 'struct pid' can be more than one page.

Looks reasonable, that it should be less than a page.  MAX_PIS_NS_LEVEL is
not calculated from PAGE_SIZE, because in this case it depends on
architectures, config options and it will be reduced, if someone adds a
new fields in struct pid or struct upid.

I suggest to set MAX_PIS_NS_LEVEL = 32, because it saves ability to expand
"struct pid" and it's more than enough for all known for me use-cases.
When someone finds a reasonable use case, we can add a config option or a
sysctl parameter.

In addition it will reduce the effect of another problem, when we have
many nested namespaces and the oldest one starts dying.
zap_pid_ns_processe will be called for each namespace and find_vpid will
be called for each process in a namespace.  find_vpid will be called
minimum max_level^2 / 2 times.  The reason of that is that when we found a
bit in pidmap, we can't determine this pidns is top for this process or it
isn't.

vpid is a heavy operation, so a fork bomb, which create many nested
namespace, can make a system inaccessible for a long time.  For example my
system becomes inaccessible for a few minutes with 4000 processes.

[akpm@linux-foundation.org: return -EINVAL in response to excessive nesting, not -ENOMEM]
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-25 14:37:53 -07:00
Jovi Zhang 0d13ac96b9 uprobes: Fix misleading log entry
There don't have any 'r' prefix in uprobe event naming, remove it.

Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2012-10-25 16:02:51 +02:00
Linus Torvalds cbb525b447 Merge branch 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "This pull request contains three fixes.

  Two are reverts of task_lock() removal in cgroup fork path.  The
  optimizations incorrectly assumed that threadgroup_lock can protect
  process forks (as opposed to thread creations) too.  Further cleanup
  of cgroup fork path is scheduled.

  The third fixes cgroup emptiness notification loss."

* 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  Revert "cgroup: Remove task_lock() from cgroup_post_fork()"
  Revert "cgroup: Drop task_lock(parent) on cgroup_fork()"
  cgroup: notify_on_release may not be triggered in some cases
2012-10-24 16:35:13 -07:00
Linus Torvalds d579a35d0e Merge branch 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
 "This pull request contains one patch from Dan Magenheimer to fix
  cancel_delayed_work() regression introduced by its reimplementation
  using try_to_grab_pending().  The reimplementation made it incorrectly
  return %true when the work item is idle.

  There aren't too many consumers of the return value but it broke at
  least ramster."

* 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: cancel_delayed_work() should return %false if work item is idle
2012-10-24 16:33:22 -07:00
Dan Magenheimer c0158ca64d workqueue: cancel_delayed_work() should return %false if work item is idle
57b30ae77b ("workqueue: reimplement cancel_delayed_work() using
try_to_grab_pending()") made cancel_delayed_work() always return %true
unless someone else is also trying to cancel the work item, which is
broken - if the target work item is idle, the return value should be
%false.

try_to_grab_pending() indicates that the target work item was idle by
zero return value.  Use it for return.  Note that this brings
cancel_delayed_work() in line with __cancel_work_timer() in return
value handling.

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <444a6439-b1a4-4740-9e7e-bc37267cfe73@default>
2012-10-24 12:38:16 -07:00
Alan Cox 8ae763cd7e audit: remove bogus tty name check
tty name is an array not a pointer

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-24 11:34:51 -07:00
Cyrill Gorcunov 99fb4a122e lockdep: Use KSYM_NAME_LEN'ed buffer for __get_key_name()
Not a big deal, but since other __get_key_name() callers
use it lets be consistent.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20121020190519.GH25467@moon
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-24 12:39:09 +02:00
Peter Zijlstra e9c84cb8d5 sched: Describe CFS load-balancer
Add some scribbles on how and why the load-balancer works..

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1341316406.23484.64.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-24 10:27:33 +02:00
Paul Turner f4e26b120b sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking
While per-entity load-tracking is generally useful, beyond computing shares
distribution, e.g. runnable based load-balance (in progress), governors,
power-management, etc.

These facilities are not yet consumers of this data.  This may be trivially
reverted when the information is required; but avoid paying the overhead for
calculations we will not use until then.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.422162369@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-24 10:27:31 +02:00
Paul Turner 5b51f2f80b sched: Make __update_entity_runnable_avg() fast
__update_entity_runnable_avg forms the core of maintaining an entity's runnable
load average.  In this function we charge the accumulated run-time since last
update and handle appropriate decay.  In some cases, e.g. a waking task, this
time interval may be much larger than our period unit.

Fortunately we can exploit some properties of our series to perform decay for a
blocked update in constant time and account the contribution for a running
update in essentially-constant* time.

[*]: For any running entity they should be performing updates at the tick which
gives us a soft limit of 1 jiffy between updates, and we can compute up to a
32 jiffy update in a single pass.

C program to generate the magic constants in the arrays:

  #include <math.h>
  #include <stdio.h>

  #define N 32
  #define WMULT_SHIFT 32

  const long WMULT_CONST = ((1UL << N) - 1);
  double y;

  long runnable_avg_yN_inv[N];
  void calc_mult_inv() {
  	int i;
  	double yn = 0;

  	printf("inverses\n");
  	for (i = 0; i < N; i++) {
  		yn = (double)WMULT_CONST * pow(y, i);
  		runnable_avg_yN_inv[i] = yn;
  		printf("%2d: 0x%8lx\n", i, runnable_avg_yN_inv[i]);
  	}
  	printf("\n");
  }

  long mult_inv(long c, int n) {
  	return (c * runnable_avg_yN_inv[n]) >>  WMULT_SHIFT;
  }

  void calc_yn_sum(int n)
  {
  	int i;
  	double sum = 0, sum_fl = 0, diff = 0;

  	/*
  	 * We take the floored sum to ensure the sum of partial sums is never
  	 * larger than the actual sum.
  	 */
  	printf("sum y^n\n");
  	printf("   %8s  %8s %8s\n", "exact", "floor", "error");
  	for (i = 1; i <= n; i++) {
  		sum = (y * sum + y * 1024);
  		sum_fl = floor(y * sum_fl+ y * 1024);
  		printf("%2d: %8.0f  %8.0f %8.0f\n", i, sum, sum_fl,
  			sum_fl - sum);
  	}
  	printf("\n");
  }

  void calc_conv(long n) {
  	long old_n;
  	int i = -1;

  	printf("convergence (LOAD_AVG_MAX, LOAD_AVG_MAX_N)\n");
  	do {
  		old_n = n;
  		n = mult_inv(n, 1) + 1024;
  		i++;
  	} while (n != old_n);
  	printf("%d> %ld\n", i - 1, n);
  	printf("\n");
  }

  void main() {
  	y = pow(0.5, 1/(double)N);
  	calc_mult_inv();
  	calc_conv(1024);
  	calc_yn_sum(N);
  }

[ Compile with -lm ]
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.277808946@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-24 10:27:30 +02:00
Paul Turner f269ae0469 sched: Update_cfs_shares at period edge
Now that our measurement intervals are small (~1ms) we can amortize the posting
of update_shares() to be about each period overflow.  This is a large cost
saving for frequently switching tasks.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.200772172@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-24 10:27:29 +02:00
Paul Turner 48a1675323 sched: Refactor update_shares_cpu() -> update_blocked_avgs()
Now that running entities maintain their own load-averages the work we must do
in update_shares() is largely restricted to the periodic decay of blocked
entities.  This allows us to be a little less pessimistic regarding our
occupancy on rq->lock and the associated rq->clock updates required.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.133999170@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-24 10:27:28 +02:00
Paul Turner 82958366cf sched: Replace update_shares weight distribution with per-entity computation
Now that the machinery in place is in place to compute contributed load in a
bottom up fashion; replace the shares distribution code within update_shares()
accordingly.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.061208672@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-24 10:27:28 +02:00
Paul Turner f1b17280ef sched: Maintain runnable averages across throttled periods
With bandwidth control tracked entities may cease execution according to user
specified bandwidth limits.  Charging this time as either throttled or blocked
however, is incorrect and would falsely skew in either direction.

What we actually want is for any throttled periods to be "invisible" to
load-tracking as they are removed from the system for that interval and
contribute normally otherwise.

Do this by moderating the progression of time to omit any periods in which the
entity belonged to a throttled hierarchy.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.998912151@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-24 10:27:27 +02:00
Paul Turner bb17f65571 sched: Normalize tg load contributions against runnable time
Entities of equal weight should receive equitable distribution of cpu time.
This is challenging in the case of a task_group's shares as execution may be
occurring on multiple cpus simultaneously.

To handle this we divide up the shares into weights proportionate with the load
on each cfs_rq.  This does not however, account for the fact that the sum of
the parts may be less than one cpu and so we need to normalize:
  load(tg) = min(runnable_avg(tg), 1) * tg->shares
Where runnable_avg is the aggregate time in which the task_group had runnable
children.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.930124292@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-24 10:27:26 +02:00
Paul Turner 8165e145ce sched: Compute load contribution by a group entity
Unlike task entities who have a fixed weight, group entities instead own a
fraction of their parenting task_group's shares as their contributed weight.

Compute this fraction so that we can correctly account hierarchies and shared
entity nodes.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.855074415@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-24 10:27:25 +02:00
Paul Turner c566e8e9e4 sched: Aggregate total task_group load
Maintain a global running sum of the average load seen on each cfs_rq belonging
to each task group so that it may be used in calculating an appropriate
shares:weight distribution.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.792901086@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-24 10:27:24 +02:00
Paul Turner aff3e49884 sched: Account for blocked load waking back up
When a running entity blocks we migrate its tracked load to
cfs_rq->blocked_runnable_avg.  In the sleep case this occurs while holding
rq->lock and so is a natural transition.  Wake-ups however, are potentially
asynchronous in the presence of migration and so special care must be taken.

We use an atomic counter to track such migrated load, taking care to match this
with the previously introduced decay counters so that we don't migrate too much
load.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.726077467@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-24 10:27:23 +02:00
Paul Turner 0a74bef8be sched: Add an rq migration call-back to sched_class
Since we are now doing bottom up load accumulation we need explicit
notification when a task has been re-parented so that the old hierarchy can be
updated.

Adds: migrate_task_rq(struct task_struct *p, int next_cpu)

(The alternative is to do this out of __set_task_cpu, but it was suggested that
this would be a cleaner encapsulation.)

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.660023400@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-24 10:27:23 +02:00
Paul Turner 9ee474f556 sched: Maintain the load contribution of blocked entities
We are currently maintaining:

  runnable_load(cfs_rq) = \Sum task_load(t)

For all running children t of cfs_rq.  While this can be naturally updated for
tasks in a runnable state (as they are scheduled); this does not account for
the load contributed by blocked task entities.

This can be solved by introducing a separate accounting for blocked load:

  blocked_load(cfs_rq) = \Sum runnable(b) * weight(b)

Obviously we do not want to iterate over all blocked entities to account for
their decay, we instead observe that:

  runnable_load(t) = \Sum p_i*y^i

and that to account for an additional idle period we only need to compute:

  y*runnable_load(t).

This means that we can compute all blocked entities at once by evaluating:

  blocked_load(cfs_rq)` = y * blocked_load(cfs_rq)

Finally we maintain a decay counter so that when a sleeping entity re-awakens
we can determine how much of its load should be removed from the blocked sum.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.585389902@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-24 10:27:22 +02:00
Paul Turner 2dac754e10 sched: Aggregate load contributed by task entities on parenting cfs_rq
For a given task t, we can compute its contribution to load as:

  task_load(t) = runnable_avg(t) * weight(t)

On a parenting cfs_rq we can then aggregate:

  runnable_load(cfs_rq) = \Sum task_load(t), for all runnable children t

Maintain this bottom up, with task entities adding their contributed load to
the parenting cfs_rq sum.  When a task entity's load changes we add the same
delta to the maintained sum.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.514678907@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-24 10:27:21 +02:00
Ben Segall 18bf2805d9 sched: Maintain per-rq runnable averages
Since runqueues do not have a corresponding sched_entity we instead embed a
sched_avg structure directly.

Signed-off-by: Ben Segall <bsegall@google.com>
Reviewed-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.442637130@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-24 10:27:20 +02:00
Paul Turner 9d85f21c94 sched: Track the runnable average on a per-task entity basis
Instead of tracking averaging the load parented by a cfs_rq, we can track
entity load directly. With the load for a given cfs_rq then being the sum
of its children.

To do this we represent the historical contribution to runnable average
within each trailing 1024us of execution as the coefficients of a
geometric series.

We can express this for a given task t as:

  runnable_sum(t) = \Sum u_i * y^i, runnable_avg_period(t) = \Sum 1024 * y^i
  load(t) = weight_t * runnable_sum(t) / runnable_avg_period(t)

Where: u_i is the usage in the last i`th 1024us period (approximately 1ms)
~ms and y is chosen such that y^k = 1/2.  We currently choose k to be 32 which
roughly translates to about a sched period.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.372695337@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-24 10:27:18 +02:00
Chuansheng Liu 351f181f91 timers, sched: Correct the comments for tick_sched_timer()
In the comments of function tick_sched_timer(), the sentence
"timer->base->cpu_base->lock held" is not right.

In function __run_hrtimer(), before call timer->function(),
the cpu_base->lock has been unlocked.

Signed-off-by: liu chuansheng <chuansheng.liu@intel.com>
Cc: fei.li@intel.com
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1351098455.15558.1421.camel@cliu38-desktop-build
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-24 10:16:51 +02:00
Daniel Vetter 6b898c07cb console: use might_sleep in console_lock
Instead of BUG_ON(in_interrupt()), since that doesn't check for all
the newfangled stuff like preempt.

Note that this is valid since the console_sem is essentially used like
a real mutex with only two twists:
- we allow trylock from hardirq context
- across suspend/resume we lock the logical console_lock, but drop the
  semaphore protecting the locking state.

Now that doesn't guarantee that no one is playing tricks in
single-thread atomic contexts at suspend/resume/boot time, but
- I couldn't find anything suspicious with some grepping,
- might_sleep shouldn't die,
- and I think the upside of catching more potential issues is worth
  the risk of getting a might_sleep backtrace that would have been
  save (and then dealing with that fallout).

Cc: Dave Airlie <airlied@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-23 20:14:55 -07:00
Linus Torvalds e17b131583 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Most of these are uprobes race fixes from Oleg, and their preparatory
  cleanups.  (It's larger than what I'd normally send for an -rc kernel,
  but they looked significant enough to not delay them.)

  There's also an oprofile fix and an uncore PMU fix."

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
  perf/x86: Disable uncore on virtualized CPUs
  oprofile, x86: Fix wrapping bug in op_x86_get_ctrl()
  ring-buffer: Check for uninitialized cpu buffer before resizing
  uprobes: Fix the racy uprobe->flags manipulation
  uprobes: Fix prepare_uprobe() race with itself
  uprobes: Introduce prepare_uprobe()
  uprobes: Fix handle_swbp() vs unregister() + register() race
  uprobes: Do not delete uprobe if uprobe_unregister() fails
  uprobes: Don't return success if alloc_uprobe() fails
  uprobes/x86: Only rep+nop can be emulated correctly
  uprobes: Simplify is_swbp_at_addr(), remove stale comments
  uprobes: Kill set_orig_insn()->is_swbp_at_addr()
  uprobes: Introduce copy_opcode(), kill read_opcode()
  uprobes: Kill set_swbp()->is_swbp_at_addr()
  uprobes: Restrict valid_vma(false) to skip VM_SHARED vmas
  uprobes: Change valid_vma() to demand VM_MAYEXEC rather than VM_EXEC
  uprobes: Change write_opcode() to use FOLL_FORCE
  uprobes: Move clear_thread_flag(TIF_UPROBE) to uprobe_notify_resume()
  uprobes: Kill UTASK_BP_HIT state
  uprobes: Fix UPROBE_SKIP_SSTEP checks in handle_swbp()
  ...
2012-10-24 04:07:51 +03:00
Paul E. McKenney 53bb857c37 rcu: Dump number of callbacks in stall warning messages
In theory, if a grace period manages to get started despite there being
no callbacks on any of the CPUs, all CPUs could go into dyntick-idle
mode, so that the grace period would never end.  This commit updates
the RCU CPU stall warning messages to detect this condition by summing
up the number of callbacks on all CPUs.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-23 14:55:27 -07:00
Paul E. McKenney eee0588261 rcu: Add grace-period information to RCU CPU stall warnings
This commit causes the last grace period started and completed to be
printed on RCU CPU stall warning messages in order to aid diagnosis.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-23 14:55:26 -07:00
Paul E. McKenney b637a328bd rcu: Print remote CPU's stacks in stall warnings
The RCU CPU stall warnings rely on trigger_all_cpu_backtrace() to
do NMI-based dump of the stack traces of all CPUs.  Unfortunately, a
number of architectures do not implement trigger_all_cpu_backtrace(), in
which case RCU falls back to just dumping the stack of the running CPU.
This is unhelpful in the case where the running CPU has detected that
some other CPU has stalled.

This commit therefore makes the running CPU dump the stacks of the
tasks running on the stalled CPUs.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-23 14:55:25 -07:00
Lai Jiangshan f2ebfbc991 srcu: Export process_srcu()
Because process_srcu() will be used in DEFINE_SRCU(), which is a macro
that could be expanded pretty much anywhere, it can no longer be static.
Note that process_srcu() is still internal to srcu.h.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-23 14:54:42 -07:00
Lai Jiangshan 4e87b2d7e8 srcu: Credit Lai Jiangshan with SRCU rewrite
Lai Jiangshan rewrote SRCU, so this commit ensures that he gets his
proper share of blame^Wcredit.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-23 14:54:41 -07:00
Paul E. McKenney 340f588bba rcu: Fix precedence error in cpu_needs_another_gp()
The fix introduced by a10d206e (rcu: Fix day-one dyntick-idle
stall-warning bug) has a C-language precedence error.  It turns out
that this error is harmless in that the same result is computed for all
inputs, but the code is nevertheless a potential source of confusion.
This commit therefore introduces parentheses in order to force the
execution of the code to reflect the intent.

Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-23 14:54:09 -07:00
Antti P Miettinen 3705b88db0 rcu: Add a module parameter to force use of expedited RCU primitives
There have been some embedded applications that would benefit from
use of expedited grace-period primitives.  In some ways, this is
similar to synchronize_net() doing either a normal or an expedited
grace period depending on lock state, but with control outside of
the kernel.

This commit therefore adds rcu_expedited boot and sysfs parameters
that cause the kernel to substitute expedited primitives for the
normal grace-period primitives.

[ paulmck: Add trace/event/rcu.h to kernel/srcu.c to avoid build error.
	   Get rid of infinite loop through contention path.]

Signed-off-by: Antti P Miettinen <amiettinen@nvidia.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-23 14:54:08 -07:00
Frederic Weisbecker 4d9a5d4319 rcu: Remove rcu_switch()
It's only there to call rcu_user_hooks_switch(). Let's
just call rcu_user_hooks_switch() directly, we don't need this
function in the middle.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-23 14:54:06 -07:00
Paul E. McKenney 489832609a rcu: Make rcutorture give diagnostics if CPU offline fails
This commit causes rcutorture to print the errno if cpu_down() fails
when the rcutorture "verbose" module parameter is specified.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-23 14:46:47 -07:00
Paul E. McKenney abfd6e58ae rcu: Fix comment about _rcu_barrier()/orphanage exclusion
In the old days, _rcu_barrier() acquired ->onofflock to exclude
rcu_send_cbs_to_orphanage(), which allowed the latter to avoid memory
barriers in callback handling.  However, _rcu_barrier() recently started
doing get_online_cpus() to lock out CPU-hotplug operations entirely, which
means that the comment in rcu_send_cbs_to_orphanage() that talks about
->onofflock is now obsolete.  This commit therefore fixes the comment.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-23 14:46:47 -07:00
Daniel Vetter daee779718 console: implement lockdep support for console_lock
Dave Airlie recently discovered a locking bug in the fbcon layer,
where a timer_del_sync (for the blinking cursor) deadlocks with the
timer itself, since both (want to) hold the console_lock:

https://lkml.org/lkml/2012/8/21/36

Unfortunately the console_lock isn't a plain mutex and hence has no
lockdep support. Which resulted in a few days wasted of tracking down
this bug (complicated by the fact that printk doesn't show anything
when the console is locked) instead of noticing the bug much earlier
with the lockdep splat.

Hence I've figured I need to fix that for the next deadlock involving
console_lock - and with kms/drm growing ever more complex locking
that'll eventually happen.

Now the console_lock has rather funky semantics, so after a quick irc
discussion with Thomas Gleixner and Dave Airlie I've quickly ditched
the original idead of switching to a real mutex (since it won't work)
and instead opted to annotate the console_lock with lockdep
information manually.

There are a few special cases:
- The console_lock state is protected by the console_sem, and usually
  grabbed/dropped at _lock/_unlock time. But the suspend/resume code
  drops the semaphore without dropping the console_lock (see
  suspend_console/resume_console). But since the same thread that did
  the suspend will do the resume, we don't need to fix up anything.

- In the printk code there's a special trylock, only used to kick off
  the logbuffer printk'ing in console_unlock. But all that happens
  while lockdep is disable (since printk does a few other evil
  tricks). So no issue there, either.

- The console_lock can also be acquired form irq context (but only
  with a trylock). lockdep already handles that.

This all leaves us with annotating the normal console_lock, _unlock
and _trylock functions.

And yes, it works - simply unloading a drm kms driver resulted in
lockdep complaining about the deadlock in fbcon_deinit:

======================================================
[ INFO: possible circular locking dependency detected ]
3.6.0-rc2+ #552 Not tainted
-------------------------------------------------------
kms-reload/3577 is trying to acquire lock:
 ((&info->queue)){+.+...}, at: [<ffffffff81058c70>] wait_on_work+0x0/0xa7

but task is already holding lock:
 (console_lock){+.+.+.}, at: [<ffffffff81264686>] bind_con_driver+0x38/0x263

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (console_lock){+.+.+.}:
       [<ffffffff81087440>] lock_acquire+0x95/0x105
       [<ffffffff81040190>] console_lock+0x59/0x5b
       [<ffffffff81209cb6>] fb_flashcursor+0x2e/0x12c
       [<ffffffff81057c3e>] process_one_work+0x1d9/0x3b4
       [<ffffffff810584a2>] worker_thread+0x1a7/0x24b
       [<ffffffff8105ca29>] kthread+0x7f/0x87
       [<ffffffff813b1204>] kernel_thread_helper+0x4/0x10

-> #0 ((&info->queue)){+.+...}:
       [<ffffffff81086cb3>] __lock_acquire+0x999/0xcf6
       [<ffffffff81087440>] lock_acquire+0x95/0x105
       [<ffffffff81058cab>] wait_on_work+0x3b/0xa7
       [<ffffffff81058dd6>] __cancel_work_timer+0xbf/0x102
       [<ffffffff81058e33>] cancel_work_sync+0xb/0xd
       [<ffffffff8120a3b3>] fbcon_deinit+0x11c/0x1dc
       [<ffffffff81264793>] bind_con_driver+0x145/0x263
       [<ffffffff81264a45>] unbind_con_driver+0x14f/0x195
       [<ffffffff8126540c>] store_bind+0x1ad/0x1c1
       [<ffffffff8127cbb7>] dev_attr_store+0x13/0x1f
       [<ffffffff8116d884>] sysfs_write_file+0xe9/0x121
       [<ffffffff811145b2>] vfs_write+0x9b/0xfd
       [<ffffffff811147b7>] sys_write+0x3e/0x6b
       [<ffffffff813b0039>] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(console_lock);
                               lock((&info->queue));
                               lock(console_lock);
  lock((&info->queue));

 *** DEADLOCK ***

v2: Mark the lockdep_map static, noticed by Jani Nikula.

Cc: Dave Airlie <airlied@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-22 16:12:20 -07:00
Rafael J. Wysocki 5efbe4279f PM / QoS: Introduce request and constraint data types for PM QoS flags
Introduce struct pm_qos_flags_request and struct pm_qos_flags
representing PM QoS flags request type and PM QoS flags constraint
type, respectively.  With these definitions the data structures
will be arranged so that the list member of a struct pm_qos_flags
object will contain the head of a list of struct pm_qos_flags_request
objects representing all of the "flags" requests present for the
given device.  Then, the effective_flags member of a struct
pm_qos_flags object will contain the bitwise OR of the flags members
of all the struct pm_qos_flags_request objects in the list.

Additionally, introduce helper function pm_qos_update_flags()
allowing the caller to manage the list of struct pm_qos_flags_request
pointed to by the list member of struct pm_qos_flags.

The flags are of type s32 so that the request's "value" field
is always of the same type regardless of what kind of request it
is (latency requests already have value fields of type s32).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Jean Pihet <j-pihet@ti.com>
Acked-by: mark gross <markgross@thegnar.org>
2012-10-23 01:07:46 +02:00
Randy Dunlap 0390c88356 module_signing: fix printk format warning
Fix the warning:

  kernel/module_signing.c:195:2: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'size_t'

by using the proper 'z' modifier for printing a size_t.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-22 08:56:34 +03:00
Ingo Molnar ef8ff74ed8 Merge branch 'tip/perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgent
Pull ftrace ring-buffer resizing fix from Steve Rostedt.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-21 19:53:34 +02:00
Ingo Molnar f38787f4f9 Merge branch 'uprobes/core' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc into perf/urgent
Pull various uprobes bugfixes from Oleg Nesterov - mostly race and
failure path fixes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-21 18:18:17 +02:00
Ingo Molnar 0acfd009be Merge branch 'nohz/core' of git://github.com/fweisbec/linux-dynticks into timers/core
Pull uncontroversial cleanup/refactoring nohz patches from Frederic Weisbecker.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-21 18:14:02 +02:00
Tejun Heo ead5c47371 cgroup_freezer: don't use cgroup_lock_live_group()
freezer_read/write() used cgroup_lock_live_group() to synchronize
against task migration into and out of the target cgroup.
cgroup_lock_live_group() grabs the internal cgroup lock and using it
from outside cgroup core leads to complex and fragile locking
dependency issues which are difficult to resolve.

Now that freezer_can_attach() is replaced with freezer_attach() and
update_if_frozen() updated, nothing requires excluding migration
against freezer state reads and changes.

This patch removes cgroup_lock_live_group() and the matching
cgroup_unlock() usages.  The prone-to-bitrot, already outdated and
unnecessary global lock hierarchy documentation is replaced with
documentation in local scope.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Li Zefan <lizefan@huawei.com>
2012-10-20 16:33:12 -07:00
Tejun Heo b4d18311d3 cgroup_freezer: prepare update_if_frozen() for locking change
Locking will change such that migration can happen while
freezer_read/write() is in progress.  This means that
update_if_frozen() can no longer assume that all tasks in the cgroup
coform to the current freezer state - newly migrated tasks which
haven't finished freezer_attach() yet might be in any state.

This patch updates update_if_frozen() such that it no longer verifies
task states against freezer state.  It now simply decides whether
FREEZING stage is complete.

This removal of verification makes it meaningless to call from
freezer_change_state().  Drop it and move the fast exit test from
freezer_read() - the only left caller - to update_if_frozen().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Li Zefan <lizefan@huawei.com>
2012-10-20 16:33:08 -07:00
Tejun Heo 8755ade683 cgroup_freezer: allow moving tasks in and out of a frozen cgroup
cgroup_freezer is one of the few users of cgroup_subsys->can_attach()
and uses it to prevent tasks from being migrated into or out of a
frozen cgroup.  This makes cgroup_freezer cumbersome to use especially
when co-mounted with other controllers.

->can_attach() is problematic in general as it can make co-mounting
multiple cgroups difficult - migrating tasks may fail for reasons
completely irrelevant for other controllers.  freezer_can_attach() in
particular is more problematic because it messes with cgroup internal
locking to ensure that the state verification performed at
freezer_can_attach() stays valid until migration is complete.

This patch replaces freezer_can_attach() with freezer_attach() so that
tasks are always allowed to migrate - they are nudged into the
conforming state from freezer_attach().  This means that there can be
tasks which are being migrated which don't conform to the current
cgroup_freezer state until freezer_attach() is complete.  Under the
current locking scheme, the only such place is freezer_fork() which is
updated to handle such window.

While this patch doesn't remove the use of internal cgroup locking
from freezer_read/write() paths, it removes the requirement to keep
the freezer state constant while migrating and enables such change.

Note that this creates a userland visible behavior change - FROZEN
cgroup can no longer be used to lock migrations in and out of the
cgroup.  This behavior change is intended.  I don't think the feature
is necessary - userland should coordinate accesses to cgroup fs anyway
- and even if the feature is needed cgroup_freezer is the completely
wrong place to implement it.

Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <1350426526-14254-1-git-send-email-tj@kernel.org>
Cc: Matt Helsley <matthltc@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Li Zefan <lizefan@huawei.com>
2012-10-20 16:28:56 -07:00
Paul E. McKenney 62da192129 rcu: Accelerate callbacks for CPU initiating a grace period
Because grace-period initialization is carried out by a separate
kthread, it might happen on a different CPU than the one that
had the callback needing a grace period -- which is where the
callback acceleration needs to happen.

Fortunately, rcu_start_gp() holds the root rcu_node structure's
->lock, which prevents a new grace period from starting.  This
allows this function to safely determine that a grace period has
not yet started, which in turn allows it to fully accelerate any
callbacks that it has pending.  This commit adds this acceleration.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-20 13:47:10 -07:00
Kees Cook 31fd84b95e use clamp_t in UNAME26 fix
The min/max call needed to have explicit types on some architectures
(e.g. mn10300). Use clamp_t instead to avoid the warning:

  kernel/sys.c: In function 'override_release':
  kernel/sys.c:1287:10: warning: comparison of distinct pointer types lacks a cast [enabled by default]

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-19 18:51:17 -07:00
David Howells caabe24057 MODSIGN: Move the magic string to the end of a module and eliminate the search
Emit the magic string that indicates a module has a signature after the
signature data instead of before it.  This allows module_sig_check() to
be made simpler and faster by the elimination of the search for the
magic string.  Instead we just need to do a single memcmp().

This works because at the end of the signature data there is the
fixed-length signature information block.  This block then falls
immediately prior to the magic number.

From the contents of the information block, it is trivial to calculate
the size of the signature data and thus the size of the actual module
data.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-19 17:30:40 -07:00
Tejun Heo d878383211 Revert "cgroup: Remove task_lock() from cgroup_post_fork()"
This reverts commit 7e3aa30ac8.

The commit incorrectly assumed that fork path always performed
threadgroup_change_begin/end() and depended on that for
synchronization against task exit and cgroup migration paths instead
of explicitly grabbing task_lock().

threadgroup_change is not locked when forking a new process (as
opposed to a new thread in the same process) and even if it were it
wouldn't be effective as different processes use different threadgroup
locks.

Revert the incorrect optimization.

Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <20121008020000.GB2575@localhost>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: stable@vger.kernel.org
2012-10-19 14:09:35 -07:00
Tejun Heo 9bb71308b8 Revert "cgroup: Drop task_lock(parent) on cgroup_fork()"
This reverts commit 7e381b0eb1.

The commit incorrectly assumed that fork path always performed
threadgroup_change_begin/end() and depended on that for
synchronization against task exit and cgroup migration paths instead
of explicitly grabbing task_lock().

threadgroup_change is not locked when forking a new process (as
opposed to a new thread in the same process) and even if it were it
wouldn't be effective as different processes use different threadgroup
locks.

Revert the incorrect optimization.

Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <20121008020000.GB2575@localhost>
Acked-by: Li Zefan <lizefan@huawei.com>
Bitterly-Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: stable@vger.kernel.org
2012-10-19 14:08:49 -07:00
Cyrill Gorcunov bbc2e3ef87 pidns: remove recursion from free_pid_ns()
free_pid_ns() operates in a recursive fashion:

free_pid_ns(parent)
  put_pid_ns(parent)
    kref_put(&ns->kref, free_pid_ns);
      free_pid_ns

thus if there was a huge nesting of namespaces the userspace may trigger
avalanche calling of free_pid_ns leading to kernel stack exhausting and a
panic eventually.

This patch turns the recursion into an iterative loop.

Based on a patch by Andrew Vagin.

[akpm@linux-foundation.org: export put_pid_ns() to modules]
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-19 14:07:47 -07:00
Kees Cook 2702b1526c kernel/sys.c: fix stack memory content leak via UNAME26
Calling uname() with the UNAME26 personality set allows a leak of kernel
stack contents.  This fixes it by defensively calculating the length of
copy_to_user() call, making the len argument unsigned, and initializing
the stack buffer to zero (now technically unneeded, but hey, overkill).

CVE-2012-0957

Reported-by: PaX Team <pageexec@freemail.hu>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-19 14:07:47 -07:00
Paul E. McKenney 85eae82a08 printk: Fix scheduling-while-atomic problem in console_cpu_notify()
The console_cpu_notify() function runs with interrupts disabled in the
CPU_DYING case.  It therefore cannot block, for example, as will happen
when it calls console_lock().  Therefore, remove the CPU_DYING leg of
the switch statement to avoid this problem.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-16 18:17:44 -07:00
Daisuke Nishimura 1f5320d597 cgroup: notify_on_release may not be triggered in some cases
notify_on_release must be triggered when the last process in a cgroup is
move to another. But if the first(and only) process in a cgroup is moved to
another, notify_on_release is not triggered.

	# mkdir /cgroup/cpu/SRC
	# mkdir /cgroup/cpu/DST
	#
	# echo 1 >/cgroup/cpu/SRC/notify_on_release
	# echo 1 >/cgroup/cpu/DST/notify_on_release
	#
	# sleep 300 &
	[1] 8629
	#
	# echo 8629 >/cgroup/cpu/SRC/tasks
	# echo 8629 >/cgroup/cpu/DST/tasks
	-> notify_on_release for /SRC must be triggered at this point,
	   but it isn't.

This is because put_css_set() is called before setting CGRP_RELEASABLE
in cgroup_task_migrate(), and is a regression introduce by the
commit:74a1166d(cgroups: make procs file writable), which was merged
into v3.0.

Cc: Ben Blum <bblum@andrew.cmu.edu>
Cc: <stable@vger.kernel.org> # v3.0.x and later
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-10-16 17:09:36 -07:00
Tejun Heo 3c426d5e11 cgroup_freezer: don't stall transition to FROZEN for PF_NOFREEZE or PF_FREEZER_SKIP tasks
cgroup_freezer doesn't transition from FREEZING to FROZEN if the
cgroup contains PF_NOFREEZE tasks or tasks sleeping with
PF_FREEZER_SKIP set.

Only kernel tasks can be non-freezable (PF_NOFREEZE) and there's
nothing cgroup_freezer or userland can do about or to it.  It's
pointless to stall the transition for PF_NOFREEZE tasks.

PF_FREEZER_SKIP indicates that the task can be skipped when
determining whether frozen state is reached.  A task with
PF_FREEZER_SKIP is guaranteed to perform try_to_freeze() after it
wakes up and can be considered frozen much like stopped or traced
tasks.  Note that a vfork parent uses PF_FREEZER_SKIP while waiting
for the child.

This updates update_if_frozen() such that it only considers freezable
tasks and treats %true freezer_should_skip() tasks as frozen.

This allows cgroups w/ kthreads and vfork parents successfully reach
FROZEN state.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
2012-10-16 15:03:14 -07:00
Tejun Heo 51f246ed95 cgroup_freezer: make it official that writes to freezer.state don't fail
try_to_freeze_cgroup() has condition checks which are intended to fail
the write operation to freezer.state if there are tasks which can't be
frozen.  The condition checks have been broken for quite some time
now.  freeze_task() returns %false if the target task can't be frozen,
so num_cant_freeze_now is never incremented.

In addition, strangely, cgroup freezing proceeds even after the write
is failed, which is rather broken.

This patch rips out the non-working code intended to fail the write to
freezer.state when the cgroup contains non-freezable tasks and makes
it official that writes to freezer.state succeed whether there are
non-freezable tasks in the cgroup or not.

This leaves is_task_frozen_enough() with only one user -
upste_if_frozen().  Collapse it into the caller.  Note that this
removes an extra call to freezing().

This doesn't cause any userland behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
2012-10-16 15:03:14 -07:00
Tejun Heo 5edee61ede cgroup: cgroup_subsys->fork() should be called after the task is added to css_set
cgroup core has a bug which violates a basic rule about event
notifications - when a new entity needs to be added, you add that to
the notification list first and then make the new entity conform to
the current state.  If done in the reverse order, an event happening
inbetween will be lost.

cgroup_subsys->fork() is invoked way before the new task is added to
the css_set.  Currently, cgroup_freezer is the only user of ->fork()
and uses it to make new tasks conform to the current state of the
freezer.  If FROZEN state is requested while fork is in progress
between cgroup_fork_callbacks() and cgroup_post_fork(), the child
could escape freezing - the cgroup isn't frozen when ->fork() is
called and the freezer couldn't see the new task on the css_set.

This patch moves cgroup_subsys->fork() invocation to
cgroup_post_fork() after the new task is added to the css_set.
cgroup_fork_callbacks() is removed.

Because now a task may be migrated during cgroup_subsys->fork(),
freezer_fork() is updated so that it adheres to the usual RCU locking
and the rather pointless comment on why locking can be different there
is removed (if it doesn't make anything simpler, why even bother?).

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: stable@vger.kernel.org
2012-10-16 15:03:14 -07:00
Ingo Molnar 8ed92e51f9 sched: Add WAKEUP_PREEMPTION feature flag, on by default
As per the recent discussion with Mike and Linus, make it easier to
test with/without this feature. No change in default behavior.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-izoxq4haeg4mTognnDbwcevt@git.kernel.org
2012-10-16 10:05:27 +02:00
Frederic Weisbecker 94a5714020 tick: Conditionally build nohz specific code in tick handler
This optimize a bit the high res tick sched handler.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
2012-10-15 18:51:08 +02:00
Frederic Weisbecker 9e8f559b08 tick: Consolidate tick handling for high and low res handlers
Besides unifying code, this also adds the idle check before
processing idle accounting specifics on the low res handler.
This way we also generalize this part of the nohz code for
!CONFIG_HIGH_RES_TIMERS to prepare for the adaptive tickless
features.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
2012-10-15 18:42:25 +02:00
Frederic Weisbecker 5bb962269c tick: Consolidate timekeeping handling code
Unify the duplicated timekeeping handling code of low and high res tick
sched handlers.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
2012-10-15 18:35:11 +02:00
Linus Torvalds d25282d1c9 Merge branch 'modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module signing support from Rusty Russell:
 "module signing is the highlight, but it's an all-over David Howells frenzy..."

Hmm "Magrathea: Glacier signing key". Somebody has been reading too much HHGTTG.

* 'modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (37 commits)
  X.509: Fix indefinite length element skip error handling
  X.509: Convert some printk calls to pr_devel
  asymmetric keys: fix printk format warning
  MODSIGN: Fix 32-bit overflow in X.509 certificate validity date checking
  MODSIGN: Make mrproper should remove generated files.
  MODSIGN: Use utf8 strings in signer's name in autogenerated X.509 certs
  MODSIGN: Use the same digest for the autogen key sig as for the module sig
  MODSIGN: Sign modules during the build process
  MODSIGN: Provide a script for generating a key ID from an X.509 cert
  MODSIGN: Implement module signature checking
  MODSIGN: Provide module signing public keys to the kernel
  MODSIGN: Automatically generate module signing keys if missing
  MODSIGN: Provide Kconfig options
  MODSIGN: Provide gitignore and make clean rules for extra files
  MODSIGN: Add FIPS policy
  module: signature checking hook
  X.509: Add a crypto key parser for binary (DER) X.509 certificates
  MPILIB: Provide a function to read raw data into an MPI
  X.509: Add an ASN.1 decoder
  X.509: Add simple ASN.1 grammar compiler
  ...
2012-10-14 13:39:34 -07:00
Linus Torvalds 6c536a17fa KGDB/KDB fixes and cleanups
Cleanups
    Clean up compile warnings in kgdboc.c and x86/kernel/kgdb.c
    Add module event hooks for simplified debugging with gdb
  Fixes
    Fix kdb to stop paging with 'q' on bta and dmesg
    Fix for data that scrolls off the vga console due to line wrapping
      when using the kdb pager
  New
    The debug core registers for kernel module events which allows a
      kernel aware gdb to automatically load symbols and break on entry
      to a kernel module
    Allow kgdboc=kdb to setup kdb on the vga console
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJQeB8KAAoJEIciOldedpOjpbIP/j+LXEkzXKKfi/3m79VQ87DB
 5iUmTS84t84pomHamXX175AC0gA/2mC0FbbcHpqjlhxF4awXcviCNIiTdtSOTbbu
 G102naLHY8i77X+XbHuN2utJeaRLw8rsfMMZGmjJnjfpc4LtsaH0YTkUzbt3qvba
 N6/QvknadzIrmoCJvHipdOdsSmL0YmTS22+koG4es9B5jvOqVH/W7jZs1qRlVw96
 VxG5Psx4LPB+RI+ZwF1WwbGxbtqKGwkVvkcGG1XIW7FQojHmjw+vUERQCjoFueJ5
 NkKfus98j85/+MvSTkWx3L1K46MHMCFbtJs9RWftJ8GtoNNnm7GDxasoIG2bJKyG
 HFD3IGPuKAokE/equF3eGTRHeEM0IUGwT3EnBqdKd73zud27WsHaSqC/1CPR+74v
 ojLQ2ft1QF+pEkGrhRTdQpLyVnvEmxu8q+j9z9n/HlGEVv8kZ6LGxDPjWB+um/Yi
 Cs0XAryYrL5gE5O+Vwna61luughtIYJwR7+DeVxnQYJ43x/0MtN/SoURnwvrCTEo
 9FeoMgZm1nLh6EW29ahIT/hMu4f0sM91Kiwrmc/zEWZgoB++wo1n470qQmUUrOx4
 CPD7zdmDrf6YxDG2QTHjCtVErO4aJ5zN4Dq0+YyodV545SZVn3t4qBDTVvKhq4Y6
 NIhZAxrv5RKABwtLcP9E
 =uf0L
 -----END PGP SIGNATURE-----

Merge tag 'for_linus-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb

Pull KGDB/KDB fixes and cleanups from Jason Wessel:
 "Cleanups
   - Clean up compile warnings in kgdboc.c and x86/kernel/kgdb.c
   - Add module event hooks for simplified debugging with gdb
 Fixes
   - Fix kdb to stop paging with 'q' on bta and dmesg
   - Fix for data that scrolls off the vga console due to line wrapping
     when using the kdb pager
 New
   - The debug core registers for kernel module events which allows a
     kernel aware gdb to automatically load symbols and break on entry
     to a kernel module
   - Allow kgdboc=kdb to setup kdb on the vga console"

* tag 'for_linus-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
  tty/console: fix warnings in drivers/tty/serial/kgdboc.c
  kdb,vt_console: Fix missed data due to pager overruns
  kdb: Fix dmesg/bta scroll to quit with 'q'
  kgdboc: Accept either kbd or kdb to activate the vga + keyboard kdb shell
  kgdb,x86: fix warning about unused variable
  mips,kgdb: fix recursive page fault with CONFIG_KPROBES
  kgdb: Add module event hooks
2012-10-13 11:16:58 +09:00
Linus Torvalds ade0899b29 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "This tree includes some late late perf items that missed the first
  round:

  tools:

   - Bash auto completion improvements, now we can auto complete the
     tools long options, tracepoint event names, etc, from Namhyung Kim.

   - Look up thread using tid instead of pid in 'perf sched'.

   - Move global variables into a perf_kvm struct, from David Ahern.

   - Hists refactorings, preparatory for improved 'diff' command, from
     Jiri Olsa.

   - Hists refactorings, preparatory for event group viewieng work, from
     Namhyung Kim.

   - Remove double negation on optional feature macro definitions, from
     Namhyung Kim.

   - Remove several cases of needless global variables, on most
     builtins.

   - misc fixes

  kernel:

   - sysfs support for IBS on AMD CPUs, from Robert Richter.

   - Support for an upcoming Intel CPU, the Xeon-Phi / Knights Corner
     HPC blade PMU, from Vince Weaver.

   - misc fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
  perf: Fix perf_cgroup_switch for sw-events
  perf: Clarify perf_cpu_context::active_pmu usage by renaming it to ::unique_pmu
  perf/AMD/IBS: Add sysfs support
  perf hists: Add more helpers for hist entry stat
  perf hists: Move he->stat.nr_events initialization to a template
  perf hists: Introduce struct he_stat
  perf diff: Removing the total_period argument from output code
  perf tool: Add hpp interface to enable/disable hpp column
  perf tools: Removing hists pair argument from output path
  perf hists: Separate overhead and baseline columns
  perf diff: Refactor diff displacement possition info
  perf hists: Add struct hists pointer to struct hist_entry
  perf tools: Complete tracepoint event names
  perf/x86: Add support for Intel Xeon-Phi Knights Corner PMU
  perf evlist: Remove some unused methods
  perf evlist: Introduce add_newtp method
  perf kvm: Move global variables into a perf_kvm struct
  perf tools: Convert to BACKTRACE_SUPPORT
  perf tools: Long option completion support for each subcommands
  perf tools: Complete long option names of perf command
  ...
2012-10-13 10:20:11 +09:00
Linus Torvalds 4e21fc138b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull third pile of kernel_execve() patches from Al Viro:
 "The last bits of infrastructure for kernel_thread() et.al., with
  alpha/arm/x86 use of those.  Plus sanitizing the asm glue and
  do_notify_resume() on alpha, fixing the "disabled irq while running
  task_work stuff" breakage there.

  At that point the rest of kernel_thread/kernel_execve/sys_execve work
  can be done independently for different architectures.  The only
  pending bits that do depend on having all architectures converted are
  restrictred to fs/* and kernel/* - that'll obviously have to wait for
  the next cycle.

  I thought we'd have to wait for all of them done before we start
  eliminating the longjump-style insanity in kernel_execve(), but it
  turned out there's a very simple way to do that without flagday-style
  changes."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
  alpha: switch to saner kernel_execve() semantics
  arm: switch to saner kernel_execve() semantics
  x86, um: convert to saner kernel_execve() semantics
  infrastructure for saner ret_from_kernel_thread semantics
  make sure that kernel_thread() callbacks call do_exit() themselves
  make sure that we always have a return path from kernel_execve()
  ppc: eeh_event should just use kthread_run()
  don't bother with kernel_thread/kernel_execve for launching linuxrc
  alpha: get rid of switch_stack argument of do_work_pending()
  alpha: don't bother passing switch_stack separately from regs
  alpha: take SIGPENDING/NOTIFY_RESUME loop into signal.c
  alpha: simplify TIF_NEED_RESCHED handling
2012-10-13 10:05:52 +09:00
Linus Torvalds 8418263e35 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull third pile of VFS updates from Al Viro:
 "Stuff from Jeff Layton, mostly.  Sanitizing interplay between audit
  and namei, removing a lot of insanity from audit_inode() mess and
  getting things ready for his ESTALE patchset."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  procfs: don't need a PATH_MAX allocation to hold a string representation of an int
  vfs: embed struct filename inside of names_cache allocation if possible
  audit: make audit_inode take struct filename
  vfs: make path_openat take a struct filename pointer
  vfs: turn do_path_lookup into wrapper around struct filename variant
  audit: allow audit code to satisfy getname requests from its names_list
  vfs: define struct filename and have getname() return it
  vfs: unexport getname and putname symbols
  acct: constify the name arg to acct_on
  vfs: allocate page instead of names_cache buffer in mount_block_root
  audit: overhaul __audit_inode_child to accomodate retrying
  audit: optimize audit_compare_dname_path
  audit: make audit_compare_dname_path use parent_len helper
  audit: remove dirlen argument to audit_compare_dname_path
  audit: set the name_len in audit_inode for parent lookups
  audit: add a new "type" field to audit_names struct
  audit: reverse arguments to audit_inode_child
  audit: no need to walk list in audit_inode if name is NULL
  audit: pass in dentry to audit_copy_inode wherever possible
  audit: remove unnecessary NULL ptr checks from do_path_lookup
2012-10-13 10:04:42 +09:00
Jeff Layton adb5c2473d audit: make audit_inode take struct filename
Keep a pointer to the audit_names "slot" in struct filename.

Have all of the audit_inode callers pass a struct filename ponter to
audit_inode instead of a string pointer. If the aname field is already
populated, then we can skip walking the list altogether and just use it
directly.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 20:15:09 -04:00
Jeff Layton 669abf4e55 vfs: make path_openat take a struct filename pointer
...and fix up the callers. For do_file_open_root, just declare a
struct filename on the stack and fill out the .name field. For
do_filp_open, make it also take a struct filename pointer, and fix up its
callers to call it appropriately.

For filp_open, add a variant that takes a struct filename pointer and turn
filp_open into a wrapper around it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 20:15:09 -04:00
Jeff Layton 7ac86265dc audit: allow audit code to satisfy getname requests from its names_list
Currently, if we call getname() on a userland string more than once,
we'll get multiple copies of the string and multiple audit_names
records.

Add a function that will allow the audit_names code to satisfy getname
requests using info from the audit_names list, avoiding a new allocation
and audit_names records.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 20:15:08 -04:00
Jeff Layton 91a27b2a75 vfs: define struct filename and have getname() return it
getname() is intended to copy pathname strings from userspace into a
kernel buffer. The result is just a string in kernel space. It would
however be quite helpful to be able to attach some ancillary info to
the string.

For instance, we could attach some audit-related info to reduce the
amount of audit-related processing needed. When auditing is enabled,
we could also call getname() on the string more than once and not
need to recopy it from userspace.

This patchset converts the getname()/putname() interfaces to return
a struct instead of a string. For now, the struct just tracks the
string in kernel space and the original userland pointer for it.

Later, we'll add other information to the struct as it becomes
convenient.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 20:14:55 -04:00
Al Viro a74fb73c12 infrastructure for saner ret_from_kernel_thread semantics
* allow kernel_execve() leave the actual return to userland to
caller (selected by CONFIG_GENERIC_KERNEL_EXECVE).  Callers
updated accordingly.
* architecture that does select GENERIC_KERNEL_EXECVE in its
Kconfig should have its ret_from_kernel_thread() do this:
	call schedule_tail
	call the callback left for it by copy_thread(); if it ever
returns, that's because it has just done successful kernel_execve()
	jump to return from syscall
IOW, its only difference from ret_from_fork() is that it does call the
callback.
* such an architecture should also get rid of ret_from_kernel_execve()
and __ARCH_WANT_KERNEL_EXECVE

This is the last part of infrastructure patches in that area - from
that point on work on different architectures can live independently.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 13:35:07 -04:00
Linus Torvalds 03d3602a83 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer core update from Thomas Gleixner:
 - Bug fixes (one for a longstanding dead loop issue)
 - Rework of time related vsyscalls
 - Alarm timer updates
 - Jiffies updates to remove compile time dependencies

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Cast raw_interval to u64 to avoid shift overflow
  timers: Fix endless looping between cascade() and internal_add_timer()
  time/jiffies: bring back unconditional LATCH definition
  time: Convert x86_64 to using new update_vsyscall
  time: Only do nanosecond rounding on GENERIC_TIME_VSYSCALL_OLD systems
  time: Introduce new GENERIC_TIME_VSYSCALL
  time: Convert CONFIG_GENERIC_TIME_VSYSCALL to CONFIG_GENERIC_TIME_VSYSCALL_OLD
  time: Move update_vsyscall definitions to timekeeper_internal.h
  time: Move timekeeper structure to timekeeper_internal.h for vsyscall changes
  jiffies: Remove compile time assumptions about CLOCK_TICK_RATE
  jiffies: Kill unused TICK_USEC_TO_NSEC
  alarmtimer: Rename alarmtimer_remove to alarmtimer_dequeue
  alarmtimer: Remove unused helpers & defines
  alarmtimer: Use hrtimer per-alarm instead of per-base
  alarmtimer: Implement minimum alarm interval for allowing suspend
2012-10-12 22:17:48 +09:00
Linus Torvalds 0588f1f934 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "A CPU hotplug related crash fix and a nohz accounting fixlet."

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Update sched_domains_numa_masks[][] when new cpus are onlined
  sched: Ensure 'sched_domains_numa_levels' is safe to use in other functions
  nohz: Fix one jiffy count too far in idle cputime
2012-10-12 22:13:05 +09:00
Linus Torvalds 9d55ab71b7 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU fixes from Ingo Molnar:
 "This tree includes a shutdown/cpu-hotplug deadlock fix and a
  documentation fix."

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rcu: Advise most users not to enable RCU user mode
  rcu: Grace-period initialization excludes only RCU notifier
2012-10-12 22:12:07 +09:00
Jason Wessel 17b572e820 kdb,vt_console: Fix missed data due to pager overruns
It is possible to miss data when using the kdb pager.  The kdb pager
does not pay attention to the maximum column constraint of the screen
or serial terminal.  This result is not incrementing the shown lines
correctly and the pager will print more lines that fit on the screen.
Obviously that is less than useful when using a VGA console where you
cannot scroll back.

The pager will now look at the kdb_buffer string to see how many
characters are printed.  It might not be perfect considering you can
output ASCII that might move the cursor position, but it is a
substantially better approximation for viewing dmesg and trace logs.

This also means that the vt screen needs to set the kdb COLUMNS
variable.

Cc: <stable@vger.kernel.org>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2012-10-12 06:37:35 -05:00
Jason Wessel d1871b38fc kdb: Fix dmesg/bta scroll to quit with 'q'
If you press 'q' the pager should exit instead of printing everything
from dmesg which can really bog down a 9600 baud serial link.

The same is true for the bta command.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2012-10-12 06:37:35 -05:00
Jason Wessel f30fed10c4 kgdb: Add module event hooks
Allow gdb to auto load kernel modules when it is attached,
which makes it trivially easy to debug module init functions
or pre-set breakpoints in a kernel module that has not loaded yet.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2012-10-12 06:37:33 -05:00
Jeff Layton cfd4da1755 acct: constify the name arg to acct_on
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 00:32:03 -04:00
Jeff Layton 4fa6b5ecbf audit: overhaul __audit_inode_child to accomodate retrying
In order to accomodate retrying path-based syscalls, we need to add a
new "type" argument to audit_inode_child. This will tell us whether
we're looking for a child entry that represents a create or a delete.

If we find a parent, don't automatically assume that we need to create a
new entry. Instead, use the information we have to try to find an
existing entry first. Update it if one is found and create a new one if
not.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 00:32:03 -04:00
Jeff Layton e3d6b07b8b audit: optimize audit_compare_dname_path
In the cases where we already know the length of the parent, pass it as
a parm so we don't need to recompute it. In the cases where we don't
know the length, pass in AUDIT_NAME_FULL (-1) to indicate that it should
be determined.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 00:32:02 -04:00
Eric Paris 29e9a3467c audit: make audit_compare_dname_path use parent_len helper
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 00:32:02 -04:00
Jeff Layton 563a0d1236 audit: remove dirlen argument to audit_compare_dname_path
All the callers set this to NULL now.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 00:32:01 -04:00
Jeff Layton bfcec70874 audit: set the name_len in audit_inode for parent lookups
Currently, this gets set mostly by happenstance when we call into
audit_inode_child. While that might be a little more efficient, it seems
wrong. If the syscall ends up failing before audit_inode_child ever gets
called, then you'll have an audit_names record that shows the full path
but has the parent inode info attached.

Fix this by passing in a parent flag when we call audit_inode that gets
set to the value of LOOKUP_PARENT. We can then fix up the pathname for
the audit entry correctly from the get-go.

While we're at it, clean up the no-op macro for audit_inode in the
!CONFIG_AUDITSYSCALL case.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 00:32:01 -04:00
Jeff Layton 78e2e802a8 audit: add a new "type" field to audit_names struct
For now, we just have two possibilities:

UNKNOWN: for a new audit_names record that we don't know anything about yet
NORMAL: for everything else

In later patches, we'll add other types so we can distinguish and update
records created under different circumstances.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 00:32:00 -04:00
Jeff Layton c43a25abba audit: reverse arguments to audit_inode_child
Most of the callers get called with an inode and dentry in the reverse
order. The compiler then has to reshuffle the arg registers and/or
stack in order to pass them on to audit_inode_child.

Reverse those arguments for a micro-optimization.

Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 00:32:00 -04:00
Jeff Layton 9cec9d68ae audit: no need to walk list in audit_inode if name is NULL
If name is NULL then the condition in the loop will never be true. Also,
with this change, we can eliminate the check for n->name == NULL since
the equivalence check will never be true if it is.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 00:31:59 -04:00
Jeff Layton 1c2e51e8c1 audit: pass in dentry to audit_copy_inode wherever possible
In some cases, we were passing in NULL even when we have a dentry.

Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 00:31:59 -04:00
Linus Torvalds 759e00b8a8 A second round of pinctrl patches for v3.7:
- Complement the Nomadik pinctrl driver with alternate Cx functions
   so it handles all oddities.
 - A patch to the IRQdomain to reform the simple irqdomain to handle
   IRQ descriptor allocation dynamically.
 - Use the above feature in the Nomadik pin controller.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJQdonzAAoJEEEQszewGV1zpAoP/RfvLFoZe5q6FXFCUG+CbXmg
 PKSe58YR3iLkCPDgv0t/zpddmKkulg92LMvrJK1Rv5tuWODia9fQTRbqXGWehoPi
 0jnAIvjuBkDDYuHD+mr9vd+WO8Ts6pKasFwNLLZSMmu5vuV3rQvkPyMkC47amB8j
 ncMl16M5efxxfgEJo49TkaKCCJOp3aNRQdZlY9aCqDzGqGmLizOJituN5FAfzT60
 0IZpUC3tZwn4eMlMZy3C0WkNDpiUy8U10vXafHVapQ/y2t1lgRnMyncbioH/cOIQ
 jXbbHI9mKOoXf4sXWEzikEreB+WAnPVcfiLNzdHzv3SoW6UrJjY0FumGJ85MItIg
 HKwtcF2HHuJ1MaQI+DkLlhyWszXXjKP/zfRioBf0SkMZOtbvDA5aMmrSza6nqIF1
 zCHu33ywc8AJbEBgHfVYZlAfvqkMNnI+oerrAdodtbYY0+8hey8EKeHkTJH3grk4
 mCtVFtFGhbyNmoqM2YKgLqS8TqxDMfYhj1e3GX0kCgqbQEWbX6gCyqXOeDMl+gst
 9kHPfHhaqKvBShWspU0yOU88M72KWlLt+CwiB1WA1eAW/lBwFiWl21PUe6RKAjpt
 E0hX77+UdNm5Af9yVETC/K5q77lQnkjBdCDXbioRcCh2ifKFjyCtMQiW5FIw3Qc3
 7UGdkdWTf7vhtPqmIxgF
 =UKY/
 -----END PGP SIGNATURE-----

Merge tag 'pinctrl-for-3.7-late' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

Pull second set of pinctrl patches from Linus Walleij:
 "Here is a late pinctrl pull request with stuff that wasn't quite
  tested at the first pull request.

  The main reason to not hold off is that the modifications to
  irq_domain_add_simple() as reviewed by Rob Herring introduce new
  infrastructure for irqdomains that will be useful for the next cycle:
  instead of sprinkling irq descriptor allocation all over the kernel
  wherever a "legacy" domain is registered, which is necessary for any
  platform using sparse IRQs, and many irq chips are say GPIO
  controllers which may be used with several systems, some with sparse
  IRQs some not, we push this into the irq_domain_add_simple() so we can
  atleast do mistakes in one place.

  The irq_domain_add_simple() is currently unused in the kernel, so I
  need to provide a user.  The Nomadik stuff that goes with are changes
  to the driver I use day-to-day to make use of this facility (and a
  dependency), so see it as a way to eat my own dogfood: if this blows
  up the egg hits my face.

  A second round of pinctrl patches for v3.7:
   - Complement the Nomadik pinctrl driver with alternate Cx functions
     so it handles all oddities.
   - A patch to the IRQdomain to reform the simple irqdomain to handle
     IRQ descriptor allocation dynamically.
   - Use the above feature in the Nomadik pin controller."

* tag 'pinctrl-for-3.7-late' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl/nomadik: use simple or linear IRQ domain
  irqdomain: augment add_simple() to allocate descs
  pinctrl/nomadik: support other alternate-C functions
2012-10-12 12:35:05 +09:00
Linus Torvalds 79360ddd73 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull pile 2 of vfs updates from Al Viro:
 "Stuff in this one - assorted fixes, lglock tidy-up, death to
  lock_super().

  There'll be a VFS pile tomorrow (with patches from Jeff Layton,
  sanitizing getname() and related parts of audit and preparing for
  ESTALE fixes), but I'd rather push the stuff in this one ASAP - some
  of the bugs closed here are quite unpleasant."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: bogus warnings in fs/namei.c
  consitify do_mount() arguments
  lglock: add DEFINE_STATIC_LGLOCK()
  lglock: make the per_cpu locks static
  lglock: remove unused DEFINE_LGLOCK_LOCKDEP()
  MAX_LFS_FILESIZE definition for 64bit needs LL...
  tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking
  vfs: drop lock/unlock super
  ufs: drop lock/unlock super
  sysv: drop lock/unlock super
  hpfs: drop lock/unlock super
  fat: drop lock/unlock super
  ext3: drop lock/unlock super
  exofs: drop lock/unlock super
  dup3: Return an error when oldfd == newfd.
  fs: handle failed audit_log_start properly
  fs: prevent use after free in auditing when symlink following was denied
2012-10-12 10:52:03 +09:00
Linus Torvalds 8213a2f3ee Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull pile 2 of execve and kernel_thread unification work from Al Viro:
 "Stuff in there: kernel_thread/kernel_execve/sys_execve conversions for
  several more architectures plus assorted signal fixes and cleanups.

  There'll be more (in particular, real fixes for the alpha
  do_notify_resume() irq mess)..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (43 commits)
  alpha: don't open-code trace_report_syscall_{enter,exit}
  Uninclude linux/freezer.h
  m32r: trim masks
  avr32: trim masks
  tile: don't bother with SIGTRAP in setup_frame
  microblaze: don't bother with SIGTRAP in setup_rt_frame()
  mn10300: don't bother with SIGTRAP in setup_frame()
  frv: no need to raise SIGTRAP in setup_frame()
  x86: get rid of duplicate code in case of CONFIG_VM86
  unicore32: remove pointless test
  h8300: trim _TIF_WORK_MASK
  parisc: decide whether to go to slow path (tracesys) based on thread flags
  parisc: don't bother looping in do_signal()
  parisc: fix double restarts
  bury the rest of TIF_IRET
  sanitize tsk_is_polling()
  bury _TIF_RESTORE_SIGMASK
  unicore32: unobfuscate _TIF_WORK_MASK
  mips: NOTIFY_RESUME is not needed in TIF masks
  mips: merge the identical "return from syscall" per-ABI code
  ...

Conflicts:
	arch/arm/include/asm/thread_info.h
2012-10-12 10:49:08 +09:00
Al Viro fb45550d76 make sure that kernel_thread() callbacks call do_exit() themselves
Most of them never returned anyway - only two functions had to be
changed.  That allows to simplify their callers a whole lot.

Note that this does *not* apply to kthread_run() callbacks - all of
those had been called from the same kernel_thread() callback, which
did do_exit() already.  This is strictly about very few low-level
kernel_thread() callbacks (there are only 6 of those, mostly as part
of kthread.h and kmod.h exported mechanisms, plus kernel_init()
itself).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-11 21:42:36 -04:00
Vaibhav Nagarnaik 8e49f418c9 ring-buffer: Check for uninitialized cpu buffer before resizing
With a system where, num_present_cpus < num_possible_cpus, even if all
CPUs are online, non-present CPUs don't have per_cpu buffers allocated.
If per_cpu/<cpu>/buffer_size_kb is modified for such a CPU, it can cause
a panic due to NULL dereference in ring_buffer_resize().

To fix this, resize operation is allowed only if the per-cpu buffer has
been initialized.

Link: http://lkml.kernel.org/r/1349912427-6486-1-git-send-email-vnagarnaik@google.com

Cc: stable@vger.kernel.org # 3.5+
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-10-11 12:21:48 -04:00
Rusty Russell d5b719365e MODSIGN: Make mrproper should remove generated files.
It doesn't, because the clean targets don't include kernel/Makefile, and
because two files were missing from the list.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-10 20:06:36 +10:30
David Howells e7d113bcf2 MODSIGN: Use utf8 strings in signer's name in autogenerated X.509 certs
Place an indication that the certificate should use utf8 strings into the
x509.genkey template generated by kernel/Makefile.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-10 20:06:35 +10:30
David Howells 5e8cb1e441 MODSIGN: Use the same digest for the autogen key sig as for the module sig
Use the same digest type for the autogenerated key signature as for the module
signature so that the hash algorithm is guaranteed to be present in the kernel.

Without this, the X.509 certificate loader may reject the X.509 certificate so
generated because it was self-signed and the signature will be checked against
itself - but this won't work if the digest algorithm must be loaded as a
module.

The symptom is that the key fails to load with the following message emitted
into the kernel log:

	MODSIGN: Problem loading in-kernel X.509 certificate (-65)

the error in brackets being -ENOPKG.  What you should see is something like:

	MODSIGN: Loaded cert 'Magarathea: Glacier signing key: 9588321144239a119d3406d4c4cf1fbae1836fa0'

Note that this doesn't apply to certificates that are not self-signed as we
don't check those currently as they require the parent CA certificate to be
available.

Reported-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-10 20:06:34 +10:30
David Howells 48ba2462ac MODSIGN: Implement module signature checking
Check the signature on the module against the keys compiled into the kernel or
available in a hardware key store.

Currently, only RSA keys are supported - though that's easy enough to change,
and the signature is expected to contain raw components (so not a PGP or
PKCS#7 formatted blob).

The signature blob is expected to consist of the following pieces in order:

 (1) The binary identifier for the key.  This is expected to match the
     SubjectKeyIdentifier from an X.509 certificate.  Only X.509 type
     identifiers are currently supported.

 (2) The signature data, consisting of a series of MPIs in which each is in
     the format of a 2-byte BE word sizes followed by the content data.

 (3) A 12 byte information block of the form:

	struct module_signature {
		enum pkey_algo		algo : 8;
		enum pkey_hash_algo	hash : 8;
		enum pkey_id_type	id_type : 8;
		u8			__pad;
		__be32			id_length;
		__be32			sig_length;
	};

     The three enums are defined in crypto/public_key.h.

     'algo' contains the public-key algorithm identifier (0->DSA, 1->RSA).

     'hash' contains the digest algorithm identifier (0->MD4, 1->MD5, 2->SHA1,
      etc.).

     'id_type' contains the public-key identifier type (0->PGP, 1->X.509).

     '__pad' should be 0.

     'id_length' should contain in the binary identifier length in BE form.

     'sig_length' should contain in the signature data length in BE form.

     The lengths are in BE order rather than CPU order to make dealing with
     cross-compilation easier.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (minor Kconfig fix)
2012-10-10 20:06:10 +10:30
David Howells 631cc66eb9 MODSIGN: Provide module signing public keys to the kernel
Include a PGP keyring containing the public keys required to perform module
verification in the kernel image during build and create a special keyring
during boot which is then populated with keys of crypto type holding the public
keys found in the PGP keyring.

These can be seen by root:

[root@andromeda ~]# cat /proc/keys
07ad4ee0 I-----     1 perm 3f010000     0     0 crypto    modsign.0: RSA 87b9b3bd []
15c7f8c3 I-----     1 perm 1f030000     0     0 keyring   .module_sign: 1/4
...

It is probably worth permitting root to invalidate these keys, resulting in
their removal and preventing further modules from being loaded with that key.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-10 20:01:22 +10:30
David Howells d441108c6f MODSIGN: Automatically generate module signing keys if missing
Automatically generate keys for module signing if they're absent so that
allyesconfig doesn't break.  The builder should consider generating their own
key and certificate, however, so that the keys are appropriately named.

The private key for the module signer should be placed in signing_key.priv
(unencrypted!) and the public key in an X.509 certificate as signing_key.x509.

If a transient key is desired for signing the modules, a config file for
'openssl req' can be placed in x509.genkey, looking something like the
following:

	[ req ]
	default_bits = 4096
	distinguished_name = req_distinguished_name
	prompt = no
	x509_extensions = myexts

	[ req_distinguished_name ]
	O = Magarathea
	CN = Glacier signing key
	emailAddress = slartibartfast@magrathea.h2g2

	[ myexts ]
	basicConstraints=critical,CA:FALSE
	keyUsage=digitalSignature
	subjectKeyIdentifier=hash
	authorityKeyIdentifier=hash

The build process will use this to configure:

	openssl req -new -nodes -utf8 -sha1 -days 36500 -batch \
		-x509 -config x509.genkey \
		-outform DER -out signing_key.x509 \
		-keyout signing_key.priv

to generate the key.

Note that it is required that the X.509 certificate have a subjectKeyIdentifier
and an authorityKeyIdentifier.  Without those, the certificate will be
rejected.  These can be used to check the validity of a certificate.

Note that 'make distclean' will remove signing_key.{priv,x509} and x509.genkey,
whether or not they were generated automatically.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-10 20:01:21 +10:30
David Howells 1d0059f3a4 MODSIGN: Add FIPS policy
If we're in FIPS mode, we should panic if we fail to verify the signature on a
module or we're asked to load an unsigned module in signature enforcing mode.
Possibly FIPS mode should automatically enable enforcing mode.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-10 20:01:19 +10:30
Rusty Russell 106a4ee258 module: signature checking hook
We do a very simple search for a particular string appended to the module
(which is cache-hot and about to be SHA'd anyway).  There's both a config
option and a boot parameter which control whether we accept or fail with
unsigned modules and modules that are signed with an unknown key.

If module signing is enabled, the kernel will be tainted if a module is
loaded that is unsigned or has a signature for which we don't have the
key.

(Useful feedback and tweaks by David Howells <dhowells@redhat.com>)

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-10-10 20:00:55 +10:30
Linus Walleij 2854d167cc irqdomain: augment add_simple() to allocate descs
Currently we rely on all IRQ chip instances to dynamically
allocate their IRQ descriptors unless they use the linear
IRQ domain. So for irqdomain_add_legacy() and
irqdomain_add_simple() the caller need to make sure that
descriptors are allocated.

Let's slightly augment the yet unused irqdomain_add_simple()
to also allocate descriptors as a means to simplify usage
and avoid code duplication throughout the kernel.

We warn if descriptors cannot be allocated, e.g. if a
platform has the bad habit of hogging descriptors at boot
time.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Lee Jones <lee.jones@linaro.org>
Reviewed-by: Rob Herring <rob.herring@calxeda.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2012-10-10 08:57:26 +02:00
Sasha Levin d1c7d97ad5 fs: handle failed audit_log_start properly
audit_log_start() may return NULL, this is unchecked by the caller in
audit_log_link_denied() and could cause a NULL ptr deref.

Introduced by commit a51d9eaa ("fs: add link restriction audit reporting").

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-09 23:33:37 -04:00
Linus Torvalds 42859eea96 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull generic execve() changes from Al Viro:
 "This introduces the generic kernel_thread() and kernel_execve()
  functions, and switches x86, arm, alpha, um and s390 over to them."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (26 commits)
  s390: convert to generic kernel_execve()
  s390: switch to generic kernel_thread()
  s390: fold kernel_thread_helper() into ret_from_fork()
  s390: fold execve_tail() into start_thread(), convert to generic sys_execve()
  um: switch to generic kernel_thread()
  x86, um/x86: switch to generic sys_execve and kernel_execve
  x86: split ret_from_fork
  alpha: introduce ret_from_kernel_execve(), switch to generic kernel_execve()
  alpha: switch to generic kernel_thread()
  alpha: switch to generic sys_execve()
  arm: get rid of execve wrapper, switch to generic execve() implementation
  arm: optimized current_pt_regs()
  arm: introduce ret_from_kernel_execve(), switch to generic kernel_execve()
  arm: split ret_from_fork, simplify kernel_thread() [based on patch by rmk]
  generic sys_execve()
  generic kernel_execve()
  new helper: current_pt_regs()
  preparation for generic kernel_thread()
  um: kill thread->forking
  um: let signal_delivered() do SIGTRAP on singlestepping into handler
  ...
2012-10-10 12:02:25 +09:00
Dan Carpenter 5b3900cd40 timekeeping: Cast raw_interval to u64 to avoid shift overflow
We fixed a bunch of integer overflows in timekeeping code during the 3.6
cycle.  I did an audit based on that and found this potential overflow.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/20121009071823.GA19159@elgon.mountain
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
2012-10-09 21:27:14 +02:00
Hildner, Christian 26cff4e2aa timers: Fix endless looping between cascade() and internal_add_timer()
Adding two (or more) timers with large values for "expires" (they have
to reside within tv5 in the same list) leads to endless looping
between cascade() and internal_add_timer() in case CONFIG_BASE_SMALL
is one and jiffies are crossing the value 1 << 18. The bug was
introduced between 2.6.11 and 2.6.12 (and survived for quite some
time).

This patch ensures that when cascade() is called timers within tv5 are
not added endlessly to their own list again, instead they are added to
the next lower tv level tv4 (as expected).

Signed-off-by: Christian Hildner <christian.hildner@siemens.com>
Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Link: http://lkml.kernel.org/r/98673C87CB31274881CFFE0B65ECC87B0F5FC1963E@DEFTHW99EA4MSX.ww902.siemens.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
2012-10-09 21:27:14 +02:00
Thomas Gleixner db8c246937 Merge branch 'fortglx/3.7/time' of git://git.linaro.org/people/jstultz/linux into timers/core 2012-10-09 21:20:05 +02:00
Haggai Eran 6bdb913f0a mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end
In order to allow sleeping during invalidate_page mmu notifier calls, we
need to avoid calling when holding the PT lock.  In addition to its direct
calls, invalidate_page can also be called as a substitute for a change_pte
call, in case the notifier client hasn't implemented change_pte.

This patch drops the invalidate_page call from change_pte, and instead
wraps all calls to change_pte with invalidate_range_start and
invalidate_range_end calls.

Note that change_pte still cannot sleep after this patch, and that clients
implementing change_pte should not take action on it in case the number of
outstanding invalidate_range_start calls is larger than one, otherwise
they might miss a later invalidation.

Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Cc: Andrea Arcangeli <andrea@qumranet.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Shachar Raindel <raindel@mellanox.com>
Cc: Liran Liss <liranl@mellanox.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:58 +09:00
Michel Lespinasse 9826a516ff mm: interval tree updates
Update the generic interval tree code that was introduced in "mm: replace
vma prio_tree with an interval tree".

Changes:

- fixed 'endpoing' typo noticed by Andrew Morton

- replaced include/linux/interval_tree_tmpl.h, which was used as a
  template (including it automatically defined the interval tree
  functions) with include/linux/interval_tree_generic.h, which only
  defines a preprocessor macro INTERVAL_TREE_DEFINE(), which itself
  defines the interval tree functions when invoked. Now that is a very
  long macro which is unfortunate, but it does make the usage sites
  (lib/interval_tree.c and mm/interval_tree.c) a bit nicer than previously.

- make use of RB_DECLARE_CALLBACKS() in the INTERVAL_TREE_DEFINE() macro,
  instead of duplicating that code in the interval tree template.

- replaced vma_interval_tree_add(), which was actually handling the
  nonlinear and interval tree cases, with vma_interval_tree_insert_after()
  which handles only the interval tree case and has an API that is more
  consistent with the other interval tree handling functions.
  The nonlinear case is now handled explicitly in kernel/fork.c dup_mmap().

Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:40 +09:00
Michel Lespinasse 6b2dbba8b6 mm: replace vma prio_tree with an interval tree
Implement an interval tree as a replacement for the VMA prio_tree.  The
algorithms are similar to lib/interval_tree.c; however that code can't be
directly reused as the interval endpoints are not explicitly stored in the
VMA.  So instead, the common algorithm is moved into a template and the
details (node type, how to get interval endpoints from the node, etc) are
filled in using the C preprocessor.

Once the interval tree functions are available, using them as a
replacement to the VMA prio tree is a relatively simple, mechanical job.

Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:39 +09:00
Davidlohr Bueso 01dc52ebdf oom: remove deprecated oom_adj
The deprecated /proc/<pid>/oom_adj is scheduled for removal this month.

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:24 +09:00
Konstantin Khlebnikov 314e51b985 mm: kill vma flag VM_RESERVED and mm->reserved_vm counter
A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
currently it lost original meaning but still has some effects:

 | effect                 | alternative flags
-+------------------------+---------------------------------------------
1| account as reserved_vm | VM_IO
2| skip in core dump      | VM_IO, VM_DONTDUMP
3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

This patch removes reserved_vm counter from mm_struct.  Seems like nobody
cares about it, it does not exported into userspace directly, it only
reduces total_vm showed in proc.

Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

[akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:19 +09:00
Konstantin Khlebnikov e9714acf8c mm: kill vma flag VM_EXECUTABLE and mm->num_exe_file_vmas
Currently the kernel sets mm->exe_file during sys_execve() and then tracks
number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon
as this counter drops to zero kernel resets mm->exe_file to NULL.  Plus it
resets mm->exe_file at last mmput() when mm->mm_users drops to zero.

VMA with VM_EXECUTABLE flag appears after mapping file with flag
MAP_EXECUTABLE, such vmas can appears only at sys_execve() or after vma
splitting, because sys_mmap ignores this flag.  Usually binfmt module sets
mm->exe_file and mmaps executable vmas with this file, they hold
mm->exe_file while task is running.

comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"),
where all this stuff was introduced:

> The kernel implements readlink of /proc/pid/exe by getting the file from
> the first executable VMA.  Then the path to the file is reconstructed and
> reported as the result.
>
> Because of the VMA walk the code is slightly different on nommu systems.
> This patch avoids separate /proc/pid/exe code on nommu systems.  Instead of
> walking the VMAs to find the first executable file-backed VMA we store a
> reference to the exec'd file in the mm_struct.
>
> That reference would prevent the filesystem holding the executable file
> from being unmounted even after unmapping the VMAs.  So we track the number
> of VM_EXECUTABLE VMAs and drop the new reference when the last one is
> unmapped.  This avoids pinning the mounted filesystem.

exe_file's vma accounting is hooked into every file mmap/unmmap and vma
split/merge just to fix some hypothetical pinning fs from umounting by mm,
which already unmapped all its executable files, but still alive.

Seems like currently nobody depends on this behaviour.  We can try to
remove this logic and keep mm->exe_file until final mmput().

mm->exe_file is still protected with mm->mmap_sem, because we want to
change it via new sys_prctl(PR_SET_MM_EXE_FILE).  Also via this syscall
task can change its mm->exe_file and unpin mountpoint explicitly.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:18 +09:00
Konstantin Khlebnikov 2dd8ad81e3 mm: use mm->exe_file instead of first VM_EXECUTABLE vma->vm_file
Some security modules and oprofile still uses VM_EXECUTABLE for retrieving
a task's executable file.  After this patch they will use mm->exe_file
directly.  mm->exe_file is protected with mm->mmap_sem, so locking stays
the same.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>			[arch/tile]
Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>	[tomoyo]
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:18 +09:00
Srivatsa S. Bhat 075663d198 CPU hotplug, debug: detect imbalance between get_online_cpus() and put_online_cpus()
The synchronization between CPU hotplug readers and writers is achieved
by means of refcounting, safeguarded by the cpu_hotplug.lock.

get_online_cpus() increments the refcount, whereas put_online_cpus()
decrements it.  If we ever hit an imbalance between the two, we end up
compromising the guarantees of the hotplug synchronization i.e, for
example, an extra call to put_online_cpus() can end up allowing a
hotplug reader to execute concurrently with a hotplug writer.

So, add a WARN_ON() in put_online_cpus() to detect such cases where the
refcount can go negative, and also attempt to fix it up, so that we can
continue to run.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:15 +09:00
Catalin Marinas 7ac57a89de Kconfig: clean up the "#if defined(arch)" list for exception-trace sysctl entry
Introduce SYSCTL_EXCEPTION_TRACE config option and selec it in the
architectures requiring support for the "exception-trace" debug_table
entry in kernel/sysctl.c.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:14 +09:00
Paul E. McKenney a4fbe35a12 rcu: Grace-period initialization excludes only RCU notifier
Kirill noted the following deadlock cycle on shutdown involving padata:

> With commit 755609a908 I've got deadlock on
> poweroff.
>
> It guess it happens because of race for cpu_hotplug.lock:
>
>       CPU A                                   CPU B
> disable_nonboot_cpus()
> _cpu_down()
> cpu_hotplug_begin()
>  mutex_lock(&cpu_hotplug.lock);
> __cpu_notify()
> padata_cpu_callback()
> __padata_remove_cpu()
> padata_replace()
> synchronize_rcu()
>                                       rcu_gp_kthread()
>                                       get_online_cpus();
>                                       mutex_lock(&cpu_hotplug.lock);

It would of course be good to eliminate grace-period delays from
CPU-hotplug notifiers, but that is a separate issue.  Deadlock is
not an appropriate diagnostic for excessive CPU-hotplug latency.

Fortunately, grace-period initialization does not actually need to
exclude all of the CPU-hotplug operation, but rather only RCU's own
CPU_UP_PREPARE and CPU_DEAD CPU-hotplug notifiers.  This commit therefore
introduces a new per-rcu_state onoff_mutex that provides the required
concurrency control in place of the get_online_cpus() that was previously
in rcu_gp_init().

Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Kirill A. Shutemov <kirill@shutemov.name>
2012-10-08 09:06:38 -07:00
Oleg Nesterov 71434f2fcb uprobes: Fix the racy uprobe->flags manipulation
Multiple threads can manipulate uprobe->flags, this is obviously
unsafe. For example mmap can set UPROBE_COPY_INSN while register
tries to set UPROBE_RUN_HANDLER, the latter can also race with
can_skip_sstep() which clears UPROBE_SKIP_SSTEP.

Change this code to use bitops.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-10-07 21:19:43 +02:00
Oleg Nesterov 4710f05fd1 uprobes: Fix prepare_uprobe() race with itself
install_breakpoint() is called under mm->mmap_sem, this protects
set_swbp() but not prepare_uprobe(). Two or more different tasks
can call install_breakpoint()->prepare_uprobe() at the same time,
this leads to numerous problems if UPROBE_COPY_INSN is not set.

Just for example, the second copy_insn() can corrupt the already
analyzed/fixuped uprobe->arch.insn and race with handle_swbp().

This patch simply adds uprobe->copy_mutex to serialize this code.
We could probably reuse ->consumer_rwsem, but this would mean that
consumer->handler() can not use mm->mmap_sem, not good.

Note: this is another temporary ugly hack until we move this logic
into uprobe_register().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-10-07 21:19:43 +02:00
Oleg Nesterov cb9a19fe4a uprobes: Introduce prepare_uprobe()
Preparation. Extract the copy_insn/arch_uprobe_analyze_insn code
from install_breakpoint() into the new helper, prepare_uprobe().

And move uprobe->flags defines from uprobes.h to uprobes.c, nobody
else can use them anyway.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-10-07 21:19:42 +02:00
Oleg Nesterov 142b18ddc8 uprobes: Fix handle_swbp() vs unregister() + register() race
Strictly speaking this race was added by me in 56bb4cf6. However
I think that this bug is just another indication that we should
move copy_insn/uprobe_analyze_insn code from install_breakpoint()
to uprobe_register(), there are a lot of other reasons for that.
Until then, add a hack to close the race.

A task can hit uprobe U1, but before it calls find_uprobe() this
uprobe can be unregistered *AND* another uprobe U2 can be added to
uprobes_tree at the same inode/offset. In this case handle_swbp()
will use the not-fully-initialized U2, in particular its arch.insn
for xol.

Add the additional !UPROBE_COPY_INSN check into handle_swbp(),
if this flag is not set we simply restart as if the new uprobe was
not inserted yet. This is not very nice, we need barriers, but we
will remove this hack when we change uprobe_register().

Note: with or without this patch install_breakpoint() can race with
itself, yet another reson to kill UPROBE_COPY_INSN altogether. And
even the usage of uprobe->flags is not safe. See the next patches.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-10-07 21:19:41 +02:00
Oleg Nesterov 076a365b3d uprobes: Do not delete uprobe if uprobe_unregister() fails
delete_uprobe() must not be called if register_for_each_vma(false)
fails to remove all breakpoints, __uprobe_unregister() is correct.
The problem is that register_for_each_vma(false) always returns 0
and thus this logic does not work.

1. Change verify_opcode() to return 0 rather than -EINVAL when
   unregister detects the !is_swbp insn, we can treat this case
   as success and currently unregister paths ignore the error
   code anyway.

2. Change remove_breakpoint() to propagate the error code from
   write_opcode().

3. Change register_for_each_vma(is_register => false) to remove
   as much breakpoints as possible but return non-zero if
   remove_breakpoint() fails at least once.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-10-07 21:19:41 +02:00
Oleg Nesterov a5f658b71b uprobes: Don't return success if alloc_uprobe() fails
If alloc_uprobe() fails uprobe_register() should return ENOMEM, not 0.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-10-07 21:19:41 +02:00
Linus Torvalds dc92b1f9ab Merge branch 'virtio-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull virtio changes from Rusty Russell:
 "New workflow: same git trees pulled by linux-next get sent straight to
  Linus.  Git is awkward at shuffling patches compared with quilt or mq,
  but that doesn't happen often once things get into my -next branch."

* 'virtio-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (24 commits)
  lguest: fix occasional crash in example launcher.
  virtio-blk: Disable callback in virtblk_done()
  virtio_mmio: Don't attempt to create empty virtqueues
  virtio_mmio: fix off by one error allocating queue
  drivers/virtio/virtio_pci.c: fix error return code
  virtio: don't crash when device is buggy
  virtio: remove CONFIG_VIRTIO_RING
  virtio: add help to CONFIG_VIRTIO option.
  virtio: support reserved vqs
  virtio: introduce an API to set affinity for a virtqueue
  virtio-ring: move queue_index to vring_virtqueue
  virtio_balloon: not EXPERIMENTAL any more.
  virtio-balloon: dependency fix
  virtio-blk: fix NULL checking in virtblk_alloc_req()
  virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path
  virtio-blk: Add bio-based IO path for virtio-blk
  virtio: console: fix error handling in init() function
  tools: Fix pthread flag for Makefile of trace-agent used by virtio-trace
  tools: Add guest trace agent as a user tool
  virtio/console: Allocate scatterlist according to the current pipe size
  ...
2012-10-07 21:04:56 +09:00
Linus Torvalds 7f60ba388f 1. We no longer ad-hoc to the function tracer "high level" infrastructure
and no longer use its debugfs knobs. The change slightly touches
    kernel/trace directory, but it got the needed ack from Steven Rostedt:
    http://lkml.org/lkml/2012/8/21/688
 2. Added maintainers entry;
 3. A bunch of fixes, nothing special.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABAgAGBQJQbjoFAAoJEGgI9fZJve1bZgUP/A/ZwFGfdnochDgRhK5p7ljY
 baRZpgSh2B+BxIDTEPLfVh6HbOivmYJ8WF0unD9kKTzCCS71ZUMiLB25G/bV4lnZ
 fawAOhGfOLG3rXmldxf6nJllHr9JpoSmVEHypvjFNcbjYZ04zhe7jM+YsaWmBw68
 eHXQkOSdfPPpKXZ2B0Eef/EoGWhORW0kTD7xFlorsxYAkksSheY0PC0nYgCFhvCZ
 168y9pi4T4lucr4s44x8AJ/r/5BQ1jEQAY/A2qUE/iBfRFP4XyE1Oao4OHtVDYdU
 KjVPA1VYmwkKSfnkiVFrpb/94IyrKslblgR8nX0kK3L/ccFYjQix4nd9jR+n857s
 xfAuj9nfhUO6fI5qoaVSOBufxKyPp1S7X8INEAJ7WQ0c9VoMv00biK9M77ifDGZg
 ll/Ecq1CADtcbOnQXf6qwGwRKmpR+qgPkIzpNXcuGMuM4AEPwtckOhCyXFr37Txk
 6ZoGM8IIaBJ0yXxHkfpUA7l9ZF0gXR+qHMQCwpUS8tIMx35On+IbybEaKbniKEi1
 AURgQ7ZimVYAHPi0Y0L00+EKI3IPVQJvCFH7SG+wUfLWcbEtNbTv3MAer5o3DANJ
 GMnWBwNw9ClTydWKI0GMNmnWpFukWhd4OXleyl2+q4qRJi3HhNacrok3s/2r+CnT
 QRg8i/0SDvxGuXazrTZT
 =1HAE
 -----END PGP SIGNATURE-----

Merge tag 'for-v3.7' of git://git.infradead.org/users/cbou/linux-pstore

Pull pstore changes from Anton Vorontsov:

 1) We no longer ad-hoc to the function tracer "high level"
    infrastructure and no longer use its debugfs knobs.  The change
    slightly touches kernel/trace directory, but it got the needed ack
    from Steven Rostedt:

      http://lkml.org/lkml/2012/8/21/688

 2) Added maintainers entry;

 3) A bunch of fixes, nothing special.

* tag 'for-v3.7' of git://git.infradead.org/users/cbou/linux-pstore:
  pstore: Avoid recursive spinlocks in the oops_in_progress case
  pstore/ftrace: Convert to its own enable/disable debugfs knob
  pstore/ram: Add missing platform_device_unregister
  MAINTAINERS: Add pstore maintainers
  pstore/ram: Mark ramoops_pstore_write_buf() as notrace
  pstore/ram: Fix printk format warning
  pstore/ram: Fix possible NULL dereference
2012-10-07 17:30:50 +09:00
T Makphaibulchoke 4965f5667f kernel/resource.c: fix stack overflow in __reserve_region_with_split()
Using a recursive call add a non-conflicting region in
__reserve_region_with_split() could result in a stack overflow in the case
that the recursive calls are too deep.  Convert the recursive calls to an
iterative loop to avoid the problem.

Tested on a machine containing 135 regions.  The kernel no longer panicked
with stack overflow.

Also tested with code arbitrarily adding regions with no conflict,
embedding two consecutive conflicts and embedding two non-consecutive
conflicts.

Signed-off-by: T Makphaibulchoke <tmac@hp.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@gmail.com>
Cc: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:31 +09:00
Jesper Juhl 0324b5a450 taskstats: cgroupstats_user_cmd() may leak on error
If prepare_reply() succeeds we have allocated memory for 'rep_skb'.  If
nla_reserve() then subsequently fails and returns NULL we fail to release
the memory we allocated, thus causing a leak.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:31 +09:00
Wei Yongjun de4ec99c32 kdump: remove unneeded include
The inclusion of <generated/utsrelease.h> is unnecessary.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:19 +09:00
Denys Vlasenko 5ab1c309b3 coredump: pass siginfo_t* to do_coredump() and below, not merely signr
This is a preparatory patch for the introduction of NT_SIGINFO elf note.

With this patch we pass "siginfo_t *siginfo" instead of "int signr" to
do_coredump() and put it into coredump_params.  It will be used by the
next patch.  Most changes are simple s/signr/siginfo->si_signo/.

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: "Jonathan M. Foote" <jmfoote@cert.org>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:16 +09:00
Alex Kelly 179899fd5d coredump: update coredump-related headers
Create a new header file, fs/coredump.h, which contains functions only
used by the new coredump.c.  It also moves do_coredump to the
include/linux/coredump.h header file, for consistency.

Signed-off-by: Alex Kelly <alex.page.kelly@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:15 +09:00
Alex Kelly 046d662f48 coredump: make core dump functionality optional
Adds an expert Kconfig option, CONFIG_COREDUMP, which allows disabling of
core dump.  This saves approximately 2.6k in the compiled kernel, and
complements CONFIG_ELF_CORE, which now depends on it.

CONFIG_COREDUMP also disables coredump-related sysctls, except for
suid_dumpable and related functions, which are necessary for ptrace.

[akpm@linux-foundation.org: fix binfmt_aout.c build]
Signed-off-by: Alex Kelly <alex.page.kelly@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:15 +09:00
hongfeng 6c0c0d4d10 poweroff: fix bug in orderly_poweroff()
orderly_poweroff is trying to poweroff platform in two steps:

step 1: Call user space application to poweroff
step 2: If user space poweroff fail, then do a force power off if force param
        is set.

The bug here is, step 1 is always successful with param UMH_NO_WAIT, which obey
the design goal of orderly_poweroff.

We have two choices here:
UMH_WAIT_EXEC which means wait for the exec, but not the process;
UMH_WAIT_PROC which means wait for the process to complete.
we need to trade off the two choices:

If using UMH_WAIT_EXEC, there is potential issue comments by Serge E.
Hallyn: The exec will have started, but may for whatever (very unlikely)
reason fail.

If using UMH_WAIT_PROC, there is potential issue comments by Eric W.
Biederman: If the caller is not running in a kernel thread then we can
easily get into a case where the user space caller will block waiting for
us when we are waiting for the user space caller.

Thanks for their excellent ideas, based on the above discussion, we
finally choose UMH_WAIT_EXEC, which is much more safe, if the user
application really fails, we just complain the application itself, it
seems a better choice here.

Signed-off-by: Feng Hong <hongfeng@marvell.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:04:48 +09:00
Shawn Guo f96972f2dc kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()
As kernel_power_off() calls disable_nonboot_cpus(), we may also want to
have kernel_restart() call disable_nonboot_cpus().  Doing so can help
machines that require boot cpu be the last alive cpu during reboot to
survive with kernel restart.

This fixes one reboot issue seen on imx6q (Cortex-A9 Quad).  The machine
requires that the restart routine be run on the primary cpu rather than
secondary ones.  Otherwise, the secondary core running the restart
routine will fail to come to online after reboot.

Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:04:47 +09:00
Peter Zijlstra 95cf59ea72 perf: Fix perf_cgroup_switch for sw-events
Jiri reported that he could trigger the WARN_ON_ONCE() in
perf_cgroup_switch() using sw-events. This is because sw-events share
a cpuctx with multiple PMUs.

Use the ->unique_pmu pointer to limit the pmu iteration to unique
cpuctx instances.

Reported-and-Tested-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-so7wi2zf3jjzrwcutm2mkz0j@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-05 13:59:07 +02:00
Peter Zijlstra 3f1f33206c perf: Clarify perf_cpu_context::active_pmu usage by renaming it to ::unique_pmu
Stephane thought the perf_cpu_context::active_pmu name confusing and
suggested using 'unique_pmu' instead.

This pointer is a pointer to a 'random' pmu sharing the cpuctx
instance, therefore limiting a for_each_pmu loop to those where
cpuctx->unique_pmu matches the pmu we get a loop over unique cpuctx
instances.

Suggested-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-kxyjqpfj2fn9gt7kwu5ag9ks@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-05 13:59:06 +02:00
Tang Chen 301a5cba28 sched: Update sched_domains_numa_masks[][] when new cpus are onlined
Once array sched_domains_numa_masks[] []is defined, it is never updated.

When a new cpu on a new node is onlined, the coincident member in
sched_domains_numa_masks[][] is not initialized, and all the masks are 0.
As a result, the build_overlap_sched_groups() will initialize a NULL
sched_group for the new cpu on the new node, which will lead to kernel panic:

[ 3189.403280] Call Trace:
[ 3189.403286]  [<ffffffff8106c36f>] warn_slowpath_common+0x7f/0xc0
[ 3189.403289]  [<ffffffff8106c3ca>] warn_slowpath_null+0x1a/0x20
[ 3189.403292]  [<ffffffff810b1d57>] build_sched_domains+0x467/0x470
[ 3189.403296]  [<ffffffff810b2067>] partition_sched_domains+0x307/0x510
[ 3189.403299]  [<ffffffff810b1ea2>] ? partition_sched_domains+0x142/0x510
[ 3189.403305]  [<ffffffff810fcc93>] cpuset_update_active_cpus+0x83/0x90
[ 3189.403308]  [<ffffffff810b22a8>] cpuset_cpu_active+0x38/0x70
[ 3189.403316]  [<ffffffff81674b87>] notifier_call_chain+0x67/0x150
[ 3189.403320]  [<ffffffff81664647>] ? native_cpu_up+0x18a/0x1b5
[ 3189.403328]  [<ffffffff810a044e>] __raw_notifier_call_chain+0xe/0x10
[ 3189.403333]  [<ffffffff81070470>] __cpu_notify+0x20/0x40
[ 3189.403337]  [<ffffffff8166663e>] _cpu_up+0xe9/0x131
[ 3189.403340]  [<ffffffff81666761>] cpu_up+0xdb/0xee
[ 3189.403348]  [<ffffffff8165667c>] store_online+0x9c/0xd0
[ 3189.403355]  [<ffffffff81437640>] dev_attr_store+0x20/0x30
[ 3189.403361]  [<ffffffff8124aa63>] sysfs_write_file+0xa3/0x100
[ 3189.403368]  [<ffffffff811ccbe0>] vfs_write+0xd0/0x1a0
[ 3189.403371]  [<ffffffff811ccdb4>] sys_write+0x54/0xa0
[ 3189.403375]  [<ffffffff81679c69>] system_call_fastpath+0x16/0x1b
[ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]---
[ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018

This patch registers a new notifier for cpu hotplug notify chain, and
updates sched_domains_numa_masks every time a new cpu is onlined or offlined.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
[ fixed compile warning ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1348578751-16904-3-git-send-email-tangchen@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-05 13:54:48 +02:00
Tang Chen 5f7865f3e4 sched: Ensure 'sched_domains_numa_levels' is safe to use in other functions
We should temporarily reset 'sched_domains_numa_levels' to 0 after
it is reset to 'level' in sched_init_numa(). If it fails to allocate
memory for array sched_domains_numa_masks[][], the array will contain
less then 'level' members. This could be dangerous when we use it to
iterate array sched_domains_numa_masks[][] in other functions.

This patch set sched_domains_numa_levels to 0 before initializing
array sched_domains_numa_masks[][], and reset it to 'level' when
sched_domains_numa_masks[][] is fully initialized.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1348578751-16904-2-git-send-email-tangchen@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-05 13:54:46 +02:00
Frederic Weisbecker 2b17c545a4 nohz: Fix one jiffy count too far in idle cputime
When we stop the tick in idle, we save the current jiffies value
in ts->idle_jiffies. This snapshot is substracted from the later
value of jiffies when the tick is restarted and the resulting
delta is accounted as idle cputime. This is how we handle the
idle cputime accounting without the tick.

But sometimes we need to schedule the next tick to some time in
the future instead of completely stopping it. In this case, a
tick may happen before we restart the periodic behaviour and
from that tick we account one jiffy to idle cputime as usual but
we also increment the ts->idle_jiffies snapshot by one so that
when we compute the delta to account, we substract the one jiffy
we just accounted.

To prepare for stopping the tick outside idle, we introduced a
check that prevents from fixing up that ts->idle_jiffies if we
are not running the idle task. But we use idle_cpu() for that
and this is a problem if we run the tick while another CPU
remotely enqueues a ttwu to our runqueue:

CPU 0:                            CPU 1:

tick_sched_timer() {              ttwu_queue_remote()
       if (idle_cpu(CPU 0))
           ts->idle_jiffies++;
}

Here, idle_cpu() notes that &rq->wake_list is not empty and
hence won't consider the CPU as idle. As a result,
ts->idle_jiffies won't be incremented. But this is wrong because
we actually account the current jiffy to idle cputime. And that
jiffy won't get substracted from the nohz time delta. So in the
end, this jiffy is accounted twice.

Fix this by changing idle_cpu(smp_processor_id()) with
is_idle_task(current). This way the jiffy is substracted
correctly even if a ttwu operation is enqueued on the CPU.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # 3.5+
Link: http://lkml.kernel.org/r/1349308004-3482-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-05 10:22:20 +02:00
Linus Torvalds ecefbd94b8 KVM updates for the 3.7 merge window
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQbY/2AAoJEI7yEDeUysxlymQQAIv5svpAI/FUe3FhvBi3IW2h
 WWMIpbdhHyocaINT18qNp8prO0iwoaBfgsnU8zuB34MrbdUgiwSHgM6T4Ff4NGa+
 R4u+gpyKYwxNQYKeJyj04luXra/krxwHL1u9OwN7o44JuQXAmzrw2tZ9ad1ArvL3
 eoZ6kGsPcdHPZMZWw2jN5xzBsRtqybm0GPPQh1qPXdn8UlPPd1X7owvbaud2y4+e
 StVIpGY6wrsO36f7UcA4Gm1EP/1E6Lm5KMXJyHgM9WBRkEfp92jTY5+XKv91vK8Z
 VKUd58QMdZE5NCNBkAR9U5N9aH0oSXnFU/g8hgiwGvrhS3IsSkKUePE6sVyMVTIO
 VptKRYe0AdmD/g25p6ApJsguV7ITlgoCPaE4rMmRcW9/bw8+iY098r7tO7w11H8M
 TyFOXihc3B+rlH8WdzOblwxHMC4yRuiPIktaA3WwbX7eA7Xv/ZRtdidifXKtgsVE
 rtubVqwGyYcHoX1Y+JiByIW1NN0pYncJhPEdc8KbRe2wKs3amA9rio1mUpBYYBPO
 B0ygcITftyXbhcTtssgcwBDGXB0AAGqI7wqdtJhFeIrKwHXD7fNeAGRwO8oKxmlj
 0aPwo9fDtpI+e6BFTohEgjZBocRvXXNWLnDSFB0E7xDR31bACck2FG5FAp1DxdS7
 lb/nbAsXf9UJLgGir4I1
 =kN6V
 -----END PGP SIGNATURE-----

Merge tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Avi Kivity:
 "Highlights of the changes for this release include support for vfio
  level triggered interrupts, improved big real mode support on older
  Intels, a streamlines guest page table walker, guest APIC speedups,
  PIO optimizations, better overcommit handling, and read-only memory."

* tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (138 commits)
  KVM: s390: Fix vcpu_load handling in interrupt code
  KVM: x86: Fix guest debug across vcpu INIT reset
  KVM: Add resampling irqfds for level triggered interrupts
  KVM: optimize apic interrupt delivery
  KVM: MMU: Eliminate pointless temporary 'ac'
  KVM: MMU: Avoid access/dirty update loop if all is well
  KVM: MMU: Eliminate eperm temporary
  KVM: MMU: Optimize is_last_gpte()
  KVM: MMU: Simplify walk_addr_generic() loop
  KVM: MMU: Optimize pte permission checks
  KVM: MMU: Update accessed and dirty bits after guest pagetable walk
  KVM: MMU: Move gpte_access() out of paging_tmpl.h
  KVM: MMU: Optimize gpte_access() slightly
  KVM: MMU: Push clean gpte write protection out of gpte_access()
  KVM: clarify kvmclock documentation
  KVM: make processes waiting on vcpu mutex killable
  KVM: SVM: Make use of asm.h
  KVM: VMX: Make use of asm.h
  KVM: VMX: Make lto-friendly
  KVM: x86: lapic: Clean up find_highest_vector() and count_vectors()
  ...

Conflicts:
	arch/s390/include/asm/processor.h
	arch/x86/kvm/i8259.c
2012-10-04 09:30:33 -07:00
Linus Torvalds 88265322c1 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "Highlights:

   - Integrity: add local fs integrity verification to detect offline
     attacks
   - Integrity: add digital signature verification
   - Simple stacking of Yama with other LSMs (per LSS discussions)
   - IBM vTPM support on ppc64
   - Add new driver for Infineon I2C TIS TPM
   - Smack: add rule revocation for subject labels"

Fixed conflicts with the user namespace support in kernel/auditsc.c and
security/integrity/ima/ima_policy.c.

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (39 commits)
  Documentation: Update git repository URL for Smack userland tools
  ima: change flags container data type
  Smack: setprocattr memory leak fix
  Smack: implement revoking all rules for a subject label
  Smack: remove task_wait() hook.
  ima: audit log hashes
  ima: generic IMA action flag handling
  ima: rename ima_must_appraise_or_measure
  audit: export audit_log_task_info
  tpm: fix tpm_acpi sparse warning on different address spaces
  samples/seccomp: fix 31 bit build on s390
  ima: digital signature verification support
  ima: add support for different security.ima data types
  ima: add ima_inode_setxattr/removexattr function and calls
  ima: add inode_post_setattr call
  ima: replace iint spinblock with rwlock/read_lock
  ima: allocating iint improvements
  ima: add appraise action keywords and default rules
  ima: integrity appraisal extension
  vfs: move ima_file_free before releasing the file
  ...
2012-10-02 21:38:48 -07:00
Linus Torvalds aab174f0df Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs update from Al Viro:

 - big one - consolidation of descriptor-related logics; almost all of
   that is moved to fs/file.c

   (BTW, I'm seriously tempted to rename the result to fd.c.  As it is,
   we have a situation when file_table.c is about handling of struct
   file and file.c is about handling of descriptor tables; the reasons
   are historical - file_table.c used to be about a static array of
   struct file we used to have way back).

   A lot of stray ends got cleaned up and converted to saner primitives,
   disgusting mess in android/binder.c is still disgusting, but at least
   doesn't poke so much in descriptor table guts anymore.  A bunch of
   relatively minor races got fixed in process, plus an ext4 struct file
   leak.

 - related thing - fget_light() partially unuglified; see fdget() in
   there (and yes, it generates the code as good as we used to have).

 - also related - bits of Cyrill's procfs stuff that got entangled into
   that work; _not_ all of it, just the initial move to fs/proc/fd.c and
   switch of fdinfo to seq_file.

 - Alex's fs/coredump.c spiltoff - the same story, had been easier to
   take that commit than mess with conflicts.  The rest is a separate
   pile, this was just a mechanical code movement.

 - a few misc patches all over the place.  Not all for this cycle,
   there'll be more (and quite a few currently sit in akpm's tree)."

Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
  MAX_LFS_FILESIZE should be a loff_t
  compat: fs: Generic compat_sys_sendfile implementation
  fs: push rcu_barrier() from deactivate_locked_super() to filesystems
  btrfs: reada_extent doesn't need kref for refcount
  coredump: move core dump functionality into its own file
  coredump: prevent double-free on an error path in core dumper
  usb/gadget: fix misannotations
  fcntl: fix misannotations
  ceph: don't abuse d_delete() on failure exits
  hypfs: ->d_parent is never NULL or negative
  vfs: delete surplus inode NULL check
  switch simple cases of fget_light to fdget
  new helpers: fdget()/fdput()
  switch o2hb_region_dev_write() to fget_light()
  proc_map_files_readdir(): don't bother with grabbing files
  make get_file() return its argument
  vhost_set_vring(): turn pollstart/pollstop into bool
  switch prctl_set_mm_exe_file() to fget_light()
  switch xfs_find_handle() to fget_light()
  switch xfs_swapext() to fget_light()
  ...
2012-10-02 20:25:04 -07:00
James Morris 61d335dd27 Merge branch 'security-next-keys' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/security-keys into next-queue
As requested by David.
2012-10-03 13:00:17 +10:00
Linus Torvalds 16642a2e7b Power management updates for 3.7-rc1
* Improved system suspend/resume and runtime PM handling for the SH TMU, CMT
   and MTU2 clock event devices (also used by ARM/shmobile).
 
 * Generic PM domains framework extensions related to cpuidle support and
   domain objects lookup using names.
 
 * ARM/shmobile power management updates including improved support for the
   SH7372's A4S power domain containing the CPU core.
 
 * cpufreq changes related to AMD CPUs support from Matthew Garrett, Andre
   Przywara and Borislav Petkov.
 
 * cpu0 cpufreq driver from Shawn Guo.
 
 * cpufreq governor fixes related to the relaxing of limit from Michal Pecio.
 
 * OMAP cpufreq updates from Axel Lin and Richard Zhao.
 
 * cpuidle ladder governor fixes related to the disabling of states from
   Carsten Emde and me.
 
 * Runtime PM core updates related to the interactions with the system suspend
   core from Alan Stern and Kevin Hilman.
 
 * Wakeup sources modification allowing more helper functions to be called from
   interrupt context from John Stultz and additional diagnostic code from Todd
   Poynor.
 
 * System suspend error code path fix from Feng Hong.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIcBAABAgAGBQJQa1rRAAoJEKhOf7ml8uNsYZ0P/2RZ71sgLWcUCfr0yHaiZeOd
 2GxEYSZ+9BZJHADgoAK/bHRTv8crm40Y2RkbaWbxPDRNuE4SutbvNTGTlJSAguSD
 yHkU/6AFC7u8Jwq+afsWIdGX7eHd78zPpj6EVtVtjHM903WDwbMU2vUz7tQ+fFa+
 ZZ7eydq9j0ec0OoH3UeNhet7JSOpT5BSLgjmIkHMBgIvTxNVDbkB31QUxnUxocxn
 k6S2wQaUSJJWGMLksRRNrhwLq+cGYwTsaOtG/KzRLH1raUyn33B5pcZr0aqhOkjg
 ClaCks3V8o3vRghSwOPB5aVXzjBKvM3UnSyJNIl+FeCeyWuwSNbkEFdA/e7oPuxG
 UsW6dcHiuVo6Ir4+zhd9+lN+/AcPTChO5b7lbU8qRF4ce04czWlUY/KzJjaM+YOE
 CKGq6eX9AHwFjE+h4+VcCXgmzcioiS8Y/CPz13u8N1y0zzwW+ftjb12K+7lVBEG1
 fhrePKHgLw3kJ9LqGpR+4vVur7C+rCf6WwCReTY2vXXVYJ+SuKWTRI4zAjTPXtHa
 i9dpMRASpF+ScRYBcgwIpv789WuHATFKqdBSinZUKBaxQZ5flJ2qIrfqN5VeAejh
 oQs/zZCdIuAtFKqVycQ0L42YxFNKgPFKQErUCSu3M5OuZLlLVLu7yQvIo2Xmo9qf
 Hcrpvo5K+w29YkiwGP9e
 =rbCk
 -----END PGP SIGNATURE-----

Merge tag 'pm-for-3.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael J Wysocki:

 - Improved system suspend/resume and runtime PM handling for the SH
   TMU, CMT and MTU2 clock event devices (also used by ARM/shmobile).

 - Generic PM domains framework extensions related to cpuidle support
   and domain objects lookup using names.

 - ARM/shmobile power management updates including improved support for
   the SH7372's A4S power domain containing the CPU core.

 - cpufreq changes related to AMD CPUs support from Matthew Garrett,
   Andre Przywara and Borislav Petkov.

 - cpu0 cpufreq driver from Shawn Guo.

 - cpufreq governor fixes related to the relaxing of limit from Michal
   Pecio.

 - OMAP cpufreq updates from Axel Lin and Richard Zhao.

 - cpuidle ladder governor fixes related to the disabling of states from
   Carsten Emde and me.

 - Runtime PM core updates related to the interactions with the system
   suspend core from Alan Stern and Kevin Hilman.

 - Wakeup sources modification allowing more helper functions to be
   called from interrupt context from John Stultz and additional
   diagnostic code from Todd Poynor.

 - System suspend error code path fix from Feng Hong.

Fixed up conflicts in cpufreq/powernow-k8 that stemmed from the
workqueue fixes conflicting fairly badly with the removal of support for
hardware P-state chips.  The changes were independent but somewhat
intertwined.

* tag 'pm-for-3.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (76 commits)
  Revert "PM QoS: Use spinlock in the per-device PM QoS constraints code"
  PM / Runtime: let rpm_resume() succeed if RPM_ACTIVE, even when disabled, v2
  cpuidle: rename function name "__cpuidle_register_driver", v2
  cpufreq: OMAP: Check IS_ERR() instead of NULL for omap_device_get_by_hwmod_name
  cpuidle: remove some empty lines
  PM: Prevent runtime suspend during system resume
  PM QoS: Use spinlock in the per-device PM QoS constraints code
  PM / Sleep: use resume event when call dpm_resume_early
  cpuidle / ACPI : move cpuidle_device field out of the acpi_processor_power structure
  ACPI / processor: remove pointless variable initialization
  ACPI / processor: remove unused function parameter
  cpufreq: OMAP: remove loops_per_jiffy recalculate for smp
  sections: fix section conflicts in drivers/cpufreq
  cpufreq: conservative: update frequency when limits are relaxed
  cpufreq / ondemand: update frequency when limits are relaxed
  properly __init-annotate pm_sysrq_init()
  cpufreq: Add a generic cpufreq-cpu0 driver
  PM / OPP: Initialize OPP table from device tree
  ARM: add cpufreq transiton notifier to adjust loops_per_jiffy for smp
  cpufreq: Remove support for hardware P-state chips from powernow-k8
  ...
2012-10-02 18:32:35 -07:00
Linus Torvalds aecdc33e11 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking changes from David Miller:

 1) GRE now works over ipv6, from Dmitry Kozlov.

 2) Make SCTP more network namespace aware, from Eric Biederman.

 3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.

 4) Make openvswitch network namespace aware, from Pravin B Shelar.

 5) IPV6 NAT implementation, from Patrick McHardy.

 6) Server side support for TCP Fast Open, from Jerry Chu and others.

 7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
    Borkmann.

 8) Increate the loopback default MTU to 64K, from Eric Dumazet.

 9) Use a per-task rather than per-socket page fragment allocator for
    outgoing networking traffic.  This benefits processes that have very
    many mostly idle sockets, which is quite common.

    From Eric Dumazet.

10) Use up to 32K for page fragment allocations, with fallbacks to
    smaller sizes when higher order page allocations fail.  Benefits are
    a) less segments for driver to process b) less calls to page
    allocator c) less waste of space.

    From Eric Dumazet.

11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.

12) VXLAN device driver, one way to handle VLAN issues such as the
    limitation of 4096 VLAN IDs yet still have some level of isolation.
    From Stephen Hemminger.

13) As usual there is a large boatload of driver changes, with the scale
    perhaps tilted towards the wireless side this time around.

Fix up various fairly trivial conflicts, mostly caused by the user
namespace changes.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
  hyperv: Add buffer for extended info after the RNDIS response message.
  hyperv: Report actual status in receive completion packet
  hyperv: Remove extra allocated space for recv_pkt_list elements
  hyperv: Fix page buffer handling in rndis_filter_send_request()
  hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
  hyperv: Fix the max_xfer_size in RNDIS initialization
  vxlan: put UDP socket in correct namespace
  vxlan: Depend on CONFIG_INET
  sfc: Fix the reported priorities of different filter types
  sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
  sfc: Fix loopback self-test with separate_tx_channels=1
  sfc: Fix MCDI structure field lookup
  sfc: Add parentheses around use of bitfield macro arguments
  sfc: Fix null function pointer in efx_sriov_channel_type
  vxlan: virtual extensible lan
  igmp: export symbol ip_mc_leave_group
  netlink: add attributes to fdb interface
  tg3: unconditionally select HWMON support when tg3 is enabled.
  Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
  gre: fix sparse warning
  ...
2012-10-02 13:38:27 -07:00
David Howells 3a50597de8 KEYS: Make the session and process keyrings per-thread
Make the session keyring per-thread rather than per-process, but still
inherited from the parent thread to solve a problem with PAM and gdm.

The problem is that join_session_keyring() will reject attempts to change the
session keyring of a multithreaded program but gdm is now multithreaded before
it gets to the point of starting PAM and running pam_keyinit to create the
session keyring.  See:

	https://bugs.freedesktop.org/show_bug.cgi?id=49211

The reason that join_session_keyring() will only change the session keyring
under a single-threaded environment is that it's hard to alter the other
thread's credentials to effect the change in a multi-threaded program.  The
problems are such as:

 (1) How to prevent two threads both running join_session_keyring() from
     racing.

 (2) Another thread's credentials may not be modified directly by this process.

 (3) The number of threads is uncertain whilst we're not holding the
     appropriate spinlock, making preallocation slightly tricky.

 (4) We could use TIF_NOTIFY_RESUME and key_replace_session_keyring() to get
     another thread to replace its keyring, but that means preallocating for
     each thread.

A reasonable way around this is to make the session keyring per-thread rather
than per-process and just document that if you want a common session keyring,
you must get it before you spawn any threads - which is the current situation
anyway.

Whilst we're at it, we can the process keyring behave in the same way.  This
means we can clean up some of the ickyness in the creds code.

Basically, after this patch, the session, process and thread keyrings are about
inheritance rules only and not about sharing changes of keyring.

Reported-by: Mantas M. <grawity@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Ray Strode <rstrode@redhat.com>
2012-10-02 19:24:29 +01:00
Linus Torvalds 437589a74b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace changes from Eric Biederman:
 "This is a mostly modest set of changes to enable basic user namespace
  support.  This allows the code to code to compile with user namespaces
  enabled and removes the assumption there is only the initial user
  namespace.  Everything is converted except for the most complex of the
  filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
  nfs, ocfs2 and xfs as those patches need a bit more review.

  The strategy is to push kuid_t and kgid_t values are far down into
  subsystems and filesystems as reasonable.  Leaving the make_kuid and
  from_kuid operations to happen at the edge of userspace, as the values
  come off the disk, and as the values come in from the network.
  Letting compile type incompatible compile errors (present when user
  namespaces are enabled) guide me to find the issues.

  The most tricky areas have been the places where we had an implicit
  union of uid and gid values and were storing them in an unsigned int.
  Those places were converted into explicit unions.  I made certain to
  handle those places with simple trivial patches.

  Out of that work I discovered we have generic interfaces for storing
  quota by projid.  I had never heard of the project identifiers before.
  Adding full user namespace support for project identifiers accounts
  for most of the code size growth in my git tree.

  Ultimately there will be work to relax privlige checks from
  "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
  root in a user names to do those things that today we only forbid to
  non-root users because it will confuse suid root applications.

  While I was pushing kuid_t and kgid_t changes deep into the audit code
  I made a few other cleanups.  I capitalized on the fact we process
  netlink messages in the context of the message sender.  I removed
  usage of NETLINK_CRED, and started directly using current->tty.

  Some of these patches have also made it into maintainer trees, with no
  problems from identical code from different trees showing up in
  linux-next.

  After reading through all of this code I feel like I might be able to
  win a game of kernel trivial pursuit."

Fix up some fairly trivial conflicts in netfilter uid/git logging code.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
  userns: Convert the ufs filesystem to use kuid/kgid where appropriate
  userns: Convert the udf filesystem to use kuid/kgid where appropriate
  userns: Convert ubifs to use kuid/kgid
  userns: Convert squashfs to use kuid/kgid where appropriate
  userns: Convert reiserfs to use kuid and kgid where appropriate
  userns: Convert jfs to use kuid/kgid where appropriate
  userns: Convert jffs2 to use kuid and kgid where appropriate
  userns: Convert hpfs to use kuid and kgid where appropriate
  userns: Convert btrfs to use kuid/kgid where appropriate
  userns: Convert bfs to use kuid/kgid where appropriate
  userns: Convert affs to use kuid/kgid wherwe appropriate
  userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
  userns: On ia64 deal with current_uid and current_gid being kuid and kgid
  userns: On ppc convert current_uid from a kuid before printing.
  userns: Convert s390 getting uid and gid system calls to use kuid and kgid
  userns: Convert s390 hypfs to use kuid and kgid where appropriate
  userns: Convert binder ipc to use kuids
  userns: Teach security_path_chown to take kuids and kgids
  userns: Add user namespace support to IMA
  userns: Convert EVM to deal with kuids and kgids in it's hmac computation
  ...
2012-10-02 11:11:09 -07:00
Linus Torvalds 68d47a137c Merge branch 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup hierarchy update from Tejun Heo:
 "Currently, different cgroup subsystems handle nested cgroups
  completely differently.  There's no consistency among subsystems and
  the behaviors often are outright broken.

  People at least seem to agree that the broken hierarhcy behaviors need
  to be weeded out if any progress is gonna be made on this front and
  that the fallouts from deprecating the broken behaviors should be
  acceptable especially given that the current behaviors don't make much
  sense when nested.

  This patch makes cgroup emit warning messages if cgroups for
  subsystems with broken hierarchy behavior are nested to prepare for
  fixing them in the future.  This was put in a separate branch because
  more related changes were expected (didn't make it this round) and the
  memory cgroup wanted to pull in this and make changes on top."

* 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
2012-10-02 10:52:28 -07:00
Linus Torvalds c0e8a139a5 Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - xattr support added.  The implementation is shared with tmpfs.  The
   usage is restricted and intended to be used to manage per-cgroup
   metadata by system software.  tmpfs changes are routed through this
   branch with Hugh's permission.

 - cgroup subsystem ID handling simplified.

* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Define CGROUP_SUBSYS_COUNT according the configuration
  cgroup: Assign subsystem IDs during compile time
  cgroup: Do not depend on a given order when populating the subsys array
  cgroup: Wrap subsystem selection macro
  cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT
  cgroup: net_prio: Do not define task_netpioidx() when not selected
  cgroup: net_cls: Do not define task_cls_classid() when not selected
  cgroup: net_cls: Move sock_update_classid() declaration to cls_cgroup.h
  cgroup: trivial fixes for Documentation/cgroups/cgroups.txt
  xattr: mark variable as uninitialized to make both gcc and smatch happy
  fs: add missing documentation to simple_xattr functions
  cgroup: add documentation on extended attributes usage
  cgroup: rename subsys_bits to subsys_mask
  cgroup: add xattr support
  cgroup: revise how we re-populate root directory
  xattr: extract simple_xattr code from tmpfs
2012-10-02 10:50:47 -07:00
Linus Torvalds 033d9959ed Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue changes from Tejun Heo:
 "This is workqueue updates for v3.7-rc1.  A lot of activities this
  round including considerable API and behavior cleanups.

   * delayed_work combines a timer and a work item.  The handling of the
     timer part has always been a bit clunky leading to confusing
     cancelation API with weird corner-case behaviors.  delayed_work is
     updated to use new IRQ safe timer and cancelation now works as
     expected.

   * Another deficiency of delayed_work was lack of the counterpart of
     mod_timer() which led to cancel+queue combinations or open-coded
     timer+work usages.  mod_delayed_work[_on]() are added.

     These two delayed_work changes make delayed_work provide interface
     and behave like timer which is executed with process context.

   * A work item could be executed concurrently on multiple CPUs, which
     is rather unintuitive and made flush_work() behavior confusing and
     half-broken under certain circumstances.  This problem doesn't
     exist for non-reentrant workqueues.  While non-reentrancy check
     isn't free, the overhead is incurred only when a work item bounces
     across different CPUs and even in simulated pathological scenario
     the overhead isn't too high.

     All workqueues are made non-reentrant.  This removes the
     distinction between flush_[delayed_]work() and
     flush_[delayed_]_work_sync().  The former is now as strong as the
     latter and the specified work item is guaranteed to have finished
     execution of any previous queueing on return.

   * In addition to the various bug fixes, Lai redid and simplified CPU
     hotplug handling significantly.

   * Joonsoo introduced system_highpri_wq and used it during CPU
     hotplug.

  There are two merge commits - one to pull in IRQ safe timer from
  tip/timers/core and the other to pull in CPU hotplug fixes from
  wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.

Tejun pointed out a few of them, I fixed a couple more.

* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
  workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
  workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
  workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
  workqueue: remove @delayed from cwq_dec_nr_in_flight()
  workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
  workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
  workqueue: use __cpuinit instead of __devinit for cpu callbacks
  workqueue: rename manager_mutex to assoc_mutex
  workqueue: WORKER_REBIND is no longer necessary for idle rebinding
  workqueue: WORKER_REBIND is no longer necessary for busy rebinding
  workqueue: reimplement idle worker rebinding
  workqueue: deprecate __cancel_delayed_work()
  workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
  workqueue: use mod_delayed_work() instead of __cancel + queue
  workqueue: use irqsafe timer for delayed_work
  workqueue: clean up delayed_work initializers and add missing one
  workqueue: make deferrable delayed_work initializer names consistent
  workqueue: cosmetic whitespace updates for macro definitions
  workqueue: deprecate system_nrt[_freezable]_wq
  workqueue: deprecate flush[_delayed]_work_sync()
  ...
2012-10-02 09:54:49 -07:00
Andy Lutomirski 87b526d349 seccomp: Make syscall skipping and nr changes more consistent
This fixes two issues that could cause incompatibility between
kernel versions:

 - If a tracer uses SECCOMP_RET_TRACE to select a syscall number
   higher than the largest known syscall, emulate the unknown
   vsyscall by returning -ENOSYS.  (This is unlikely to make a
   noticeable difference on x86-64 due to the way the system call
   entry works.)

 - On x86-64 with vsyscall=emulate, skipped vsyscalls were buggy.

This updates the documentation accordingly.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Will Drewry <wad@chromium.org>
Signed-off-by: James Morris <james.l.morris@oracle.com>
2012-10-02 21:14:29 +10:00
Linus Torvalds 3498d13b80 TTY merge for 3.7-rc1
As we skipped the merge window for 3.6-rc1 for the tty tree, everything
 is now settled down and working properly, so we are ready for 3.7-rc1.
 Here's the patchset, it's big, but the large changes are removing a
 firmware file and adding a staging tty driver (it depended on the tty
 core changes, so it's going through this tree instead of the staging
 tree.)
 
 All of these patches have been in the linux-next tree for a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iEYEABECAAYFAlBp36oACgkQMUfUDdst+yk4WgCdEy13hot8fI2Lqnc7W0LKu7GX
 4p8AoLTjzrXhLosxdijskDQ9X1OtjrxU
 =S5Ng
 -----END PGP SIGNATURE-----

Merge tag 'tty-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull TTY changes from Greg Kroah-Hartman:
 "As we skipped the merge window for 3.6-rc1 for the tty tree,
  everything is now settled down and working properly, so we are ready
  for 3.7-rc1.  Here's the patchset, it's big, but the large changes are
  removing a firmware file and adding a staging tty driver (it depended
  on the tty core changes, so it's going through this tree instead of
  the staging tree.)

  All of these patches have been in the linux-next tree for a while.

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

Fix up more-or-less trivial conflicts in
 - drivers/char/pcmcia/synclink_cs.c:
    tty NULL dereference fix vs tty_port_cts_enabled() helper function
 - drivers/staging/{Kconfig,Makefile}:
    add-add conflict (dgrp driver added close to other staging drivers)
 - drivers/staging/ipack/devices/ipoctal.c:
    "split ipoctal_channel from iopctal" vs "TTY: use tty_port_register_device"

* tag 'tty-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (235 commits)
  tty/serial: Add kgdb_nmi driver
  tty/serial/amba-pl011: Quiesce interrupts in poll_get_char
  tty/serial/amba-pl011: Implement poll_init callback
  tty/serial/core: Introduce poll_init callback
  kdb: Turn KGDB_KDB=n stubs into static inlines
  kdb: Implement disable_nmi command
  kernel/debug: Mask KGDB NMI upon entry
  serial: pl011: handle corruption at high clock speeds
  serial: sccnxp: Make 'default' choice in switch last
  serial: sccnxp: Remove mask termios caps for SW flow control
  serial: sccnxp: Report actual baudrate back to core
  serial: samsung: Add poll_get_char & poll_put_char
  Powerpc 8xx CPM_UART setting MAXIDL register proportionaly to baud rate
  Powerpc 8xx CPM_UART maxidl should not depend on fifo size
  Powerpc 8xx CPM_UART too many interrupts
  Powerpc 8xx CPM_UART desynchronisation
  serial: set correct baud_base for EXSYS EX-41092 Dual 16950
  serial: omap: fix the reciever line error case
  8250: blacklist Winbond CIR port
  8250_pnp: do pnp probe before legacy probe
  ...
2012-10-01 12:26:52 -07:00
Linus Torvalds 81f56e5375 Linux support for the 64-bit ARM architecture (AArch64)
Features currently supported:
 - 39-bit address space for user and kernel (each)
 - 4KB and 64KB page configurations
 - Compat (32-bit) user applications (ARMv7, EABI only)
 - Flattened Device Tree (mandated for all AArch64 platforms)
 - ARM generic timers
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.9 (GNU/Linux)
 
 iQIcBAABAgAGBQJQabRiAAoJEGvWsS0AyF7xXgcQAK+FTXt0ikdQYMkV5AIZXb9i
 xHRhuiZWx2vKyk0mCqpyGLY58GSmSb6uTBg/2P2Ej7vXdH/RB2goPzjlspfjkDL4
 o8RJp7eQ07Uz3KRDYEJgMP8xKZid6KFG93RJ6TjjpKZLuDBdwiG1GP1vb0jVcWfo
 ttZrj/aI8lMcqrh3Vq5qefP7GWP1OVATqeaGTiT7oo38pXwF3t237xfBr2iDGFBp
 ZgIRddrxpa7JYUesfJDDDdGHvLq7Vh2jJV+io9qasBZDrtppGJIhZ0vUni2DgIi7
 r4i1LcynDN4JaG0maZ4U/YQm74TCD4BqxV8GJ7zwLPTWeN+of+skjhPSLOkA+0fp
 I+sWjXlv200gDfJZ9qnUld2kFpoDfJi2b7fNDouSDd2OhmVOVWG3jnVP4Z7meVSb
 O8BYzWDdsAiabuwciUY3OsmW6424lT93b2v86Vncs4unKMvEjOPxYZbUxhqX8f2j
 gsmWwwD/yS4THx2B6OyW9VT3I5J6miqs2Glt/GG6vPWT5AKQJn9jCxKaBGhPMPIs
 xe5/GycBYjdk/Y8qRjegxFbEqzQuiRzmkeFn5jwjmBLqpGNbZDpvMaL6adhAKM5/
 v6UIKa91ra4fC9N0h6G61pOc9N9DbT8wPbCbdYY0RMTMRuLDZDgAM3Bvz0r2APdD
 96leNy6vx684hbkCSLJs
 =buJB
 -----END PGP SIGNATURE-----

Merge tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64

Pull arm64 support from Catalin Marinas:
 "Linux support for the 64-bit ARM architecture (AArch64)

  Features currently supported:
   - 39-bit address space for user and kernel (each)
   - 4KB and 64KB page configurations
   - Compat (32-bit) user applications (ARMv7, EABI only)
   - Flattened Device Tree (mandated for all AArch64 platforms)
   - ARM generic timers"

* tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: (35 commits)
  arm64: ptrace: remove obsolete ptrace request numbers from user headers
  arm64: Do not set the SMP/nAMP processor bit
  arm64: MAINTAINERS update
  arm64: Build infrastructure
  arm64: Miscellaneous header files
  arm64: Generic timers support
  arm64: Loadable modules
  arm64: Miscellaneous library functions
  arm64: Performance counters support
  arm64: Add support for /proc/sys/debug/exception-trace
  arm64: Debugging support
  arm64: Floating point and SIMD
  arm64: 32-bit (compat) applications support
  arm64: User access library functions
  arm64: Signal handling support
  arm64: VDSO support
  arm64: System calls handling
  arm64: ELF definitions
  arm64: SMP support
  arm64: DMA mapping API
  ...
2012-10-01 11:51:57 -07:00
Linus Torvalds da8347969f Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/asm changes from Ingo Molnar:
 "The one change that stands out is the alternatives patching change
  that prevents us from ever patching back instructions from SMP to UP:
  this simplifies things and speeds up CPU hotplug.

  Other than that it's smaller fixes, cleanups and improvements."

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Unspaghettize do_trap()
  x86_64: Work around old GAS bug
  x86: Use REP BSF unconditionally
  x86: Prefer TZCNT over BFS
  x86/64: Adjust types of temporaries used by ffs()/fls()/fls64()
  x86: Drop unnecessary kernel_eflags variable on 64-bit
  x86/smp: Don't ever patch back to UP if we unplug cpus
2012-10-01 10:46:27 -07:00
Linus Torvalds 2fff56641b Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer changes from Ingo Molnar:
 "Timer enhancements, generalizations and cleanups from Tejun Heo, in
  preparation for workqueue facility enhancements."

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timer: Implement TIMER_IRQSAFE
  timer: Clean up timer initializers
  timer: Relocate declarations of init_timer_on_stack_key()
  timer: Generalize timer->base flags handling
2012-10-01 10:45:16 -07:00
Linus Torvalds 0b981cb94b Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "Continued quest to clean up and enhance the cputime code by Frederic
  Weisbecker, in preparation for future tickless kernel features.

  Other than that, smallish changes."

Fix up trivial conflicts due to additions next to each other in arch/{x86/}Kconfig

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  cputime: Make finegrained irqtime accounting generally available
  cputime: Gather time/stats accounting config options into a single menu
  ia64: Reuse system and user vtime accounting functions on task switch
  ia64: Consolidate user vtime accounting
  vtime: Consolidate system/idle context detection
  cputime: Use a proper subsystem naming for vtime related APIs
  sched: cpu_power: enable ARCH_POWER
  sched/nohz: Clean up select_nohz_load_balancer()
  sched: Fix load avg vs. cpu-hotplug
  sched: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSW
  sched: Fix nohz_idle_balance()
  sched: Remove useless code in yield_to()
  sched: Add time unit suffix to sched sysctl knobs
  sched/debug: Limit sd->*_idx range on sysctl
  sched: Remove AFFINE_WAKEUPS feature flag
  s390: Remove leftover account_tick_vtime() header
  cputime: Consolidate vtime handling on context switch
  sched: Move cputime code to its own file
  cputime: Generalize CONFIG_VIRT_CPU_ACCOUNTING
  tile: Remove SD_PREFER_LOCAL leftover
  ...
2012-10-01 10:43:39 -07:00
Linus Torvalds 7e92daaefa Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf update from Ingo Molnar:
 "Lots of changes in this cycle as well, with hundreds of commits from
  over 30 contributors.  Most of the activity was on the tooling side.

  Higher level changes:

   - New 'perf kvm' analysis tool, from Xiao Guangrong.

   - New 'perf trace' system-wide tracing tool

   - uprobes fixes + cleanups from Oleg Nesterov.

   - Lots of patches to make perf build on Android out of box, from
     Irina Tirdea

   - Extend ftrace function tracing utility to be more dynamic for its
     users.  It allows for data passing to the callback functions, as
     well as reading regs as if a breakpoint were to trigger at function
     entry.

     The main goal of this patch series was to allow kprobes to use
     ftrace as an optimized probe point when a probe is placed on an
     ftrace nop.  With lots of help from Masami Hiramatsu, and going
     through lots of iterations, we finally came up with a good
     solution.

   - Add cpumask for uncore pmu, use it in 'stat', from Yan, Zheng.

   - Various tracing updates from Steve Rostedt

   - Clean up and improve 'perf sched' performance by elliminating lots
     of needless calls to libtraceevent.

   - Event group parsing support, from Jiri Olsa

   - UI/gtk refactorings and improvements from Namhyung Kim

   - Add support for non-tracepoint events in perf script python, from
     Feng Tang

   - Add --symbols to 'script', similar to the one in 'report', from
     Feng Tang.

  Infrastructure enhancements and fixes:

   - Convert the trace builtins to use the growing evsel/evlist
     tracepoint infrastructure, removing several open coded constructs
     like switch like series of strcmp to dispatch events, etc.
     Basically what had already been showcased in 'perf sched'.

   - Add evsel constructor for tracepoints, that uses libtraceevent just
     to parse the /format events file, use it in a new 'perf test' to
     make sure the libtraceevent format parsing regressions can be more
     readily caught.

   - Some strange errors were happening in some builds, but not on the
     next, reported by several people, problem was some parser related
     files, generated during the build, didn't had proper make deps, fix
     from Eric Sandeen.

   - Introduce struct and cache information about the environment where
     a perf.data file was captured, from Namhyung Kim.

   - Fix handling of unresolved samples when --symbols is used in
     'report', from Feng Tang.

   - Add union member access support to 'probe', from Hyeoncheol Lee.

   - Fixups to die() removal, from Namhyung Kim.

   - Render fixes for the TUI, from Namhyung Kim.

   - Don't enable annotation in non symbolic view, from Namhyung Kim.

   - Fix pipe mode in 'report', from Namhyung Kim.

   - Move related stats code from stat to util/, will be used by the
     'stat' kvm tool, from Xiao Guangrong.

   - Remove die()/exit() calls from several tools.

   - Resolve vdso callchains, from Jiri Olsa

   - Don't pass const char pointers to basename, so that we can
     unconditionally use libgen.h and thus avoid ifdef BIONIC lines,
     from David Ahern

   - Refactor hist formatting so that it can be reused with the GTK
     browser, From Namhyung Kim

   - Fix build for another rbtree.c change, from Adrian Hunter.

   - Make 'perf diff' command work with evsel hists, from Jiri Olsa.

   - Use the only field_sep var that is set up: symbol_conf.field_sep,
     fix from Jiri Olsa.

   - .gitignore compiled python binaries, from Namhyung Kim.

   - Get rid of die() in more libtraceevent places, from Namhyung Kim.

   - Rename libtraceevent 'private' struct member to 'priv' so that it
     works in C++, from Steven Rostedt

   - Remove lots of exit()/die() calls from tools so that the main perf
     exit routine can take place, from David Ahern

   - Fix x86 build on x86-64, from David Ahern.

   - {int,str,rb}list fixes from Suzuki K Poulose

   - perf.data header fixes from Namhyung Kim

   - Allow user to indicate objdump path, needed in cross environments,
     from Maciek Borzecki

   - Fix hardware cache event name generation, fix from Jiri Olsa

   - Add round trip test for sw, hw and cache event names, catching the
     problem Jiri fixed, after Jiri's patch, the test passes
     successfully.

   - Clean target should do clean for lib/traceevent too, fix from David
     Ahern

   - Check the right variable for allocation failure, fix from Namhyung
     Kim

   - Set up evsel->tp_format regardless of evsel->name being set
     already, fix from Namhyung Kim

   - Oprofile fixes from Robert Richter.

   - Remove perf_event_attr needless version inflation, from Jiri Olsa

   - Introduce libtraceevent strerror like error reporting facility,
     from Namhyung Kim

   - Add pmu mappings to perf.data header and use event names from cmd
     line, from Robert Richter

   - Fix include order for bison/flex-generated C files, from Ben
     Hutchings

   - Build fixes and documentation corrections from David Ahern

   - Assorted cleanups from Robert Richter

   - Let O= makes handle relative paths, from Steven Rostedt

   - perf script python fixes, from Feng Tang.

   - Initial bash completion support, from Frederic Weisbecker

   - Allow building without libelf, from Namhyung Kim.

   - Support DWARF CFI based unwind to have callchains when %bp based
     unwinding is not possible, from Jiri Olsa.

   - Symbol resolution fixes, while fixing support PPC64 files with an
     .opt ELF section was the end goal, several fixes for code that
     handles all architectures and cleanups are included, from Cody
     Schafer.

   - Assorted fixes for Documentation and build in 32 bit, from Robert
     Richter

   - Cache the libtraceevent event_format associated to each evsel
     early, so that we avoid relookups, i.e.  calling pevent_find_event
     repeatedly when processing tracepoint events.

     [ This is to reduce the surface contact with libtraceevents and
        make clear what is that the perf tools needs from that lib: so
        far parsing the common and per event fields.  ]

   - Don't stop the build if the audit libraries are not installed, fix
     from Namhyung Kim.

   - Fix bfd.h/libbfd detection with recent binutils, from Markus
     Trippelsdorf.

   - Improve warning message when libunwind devel packages not present,
     from Jiri Olsa"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (282 commits)
  perf trace: Add aliases for some syscalls
  perf probe: Print an enum type variable in "enum variable-name" format when showing accessible variables
  perf tools: Check libaudit availability for perf-trace builtin
  perf hists: Add missing period_* fields when collapsing a hist entry
  perf trace: New tool
  perf evsel: Export the event_format constructor
  perf evsel: Introduce rawptr() method
  perf tools: Use perf_evsel__newtp in the event parser
  perf evsel: The tracepoint constructor should store sys:name
  perf evlist: Introduce set_filter() method
  perf evlist: Renane set_filters method to apply_filters
  perf test: Add test to check we correctly parse and match syscall open parms
  perf evsel: Handle endianity in intval method
  perf evsel: Know if byte swap is needed
  perf tools: Allow handling a NULL cpu_map as meaning "all cpus"
  perf evsel: Improve tracepoint constructor setup
  tools lib traceevent: Fix error path on pevent_parse_event
  perf test: Fix build failure
  trace: Move trace event enable from fs_initcall to core_initcall
  tracing: Add an option for disabling markers
  ...
2012-10-01 10:28:49 -07:00
Linus Torvalds 7a68294278 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull trivial irq core update from Ingo Molnar:
 "Two symbol exports for modular irq-chip drivers"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Export dummy_irq_chip
  genirq: Export irq_set_chip_and_handler_name()
2012-10-01 10:28:09 -07:00
Linus Torvalds 627312b9a8 Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core locking changes from Ingo Molnar:
 "It includes a lockdep improvement plus a spinlock inlining Kconfig
  cleanup."

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking: Adjust spin lock inlining Kconfig options
  lockdep: Check if nested lock is actually held
2012-10-01 10:27:18 -07:00
Linus Torvalds 94095a1fff Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core kernel fixes from Ingo Molnar:
 "This is a complex task_work series from Oleg that fixes the bug that
  this VFS commit tried to fix:

    d35abdb288 hold task_lock around checks in keyctl

  but solves the problem without the lockup regression that d35abdb288
  introduced in v3.6.

  This series came late in v3.6 and I did not feel confident about it so
  late in the cycle.  Might be worth backporting to -stable if it proves
  itself upstream."

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  task_work: Simplify the usage in ptrace_notify() and get_signal_to_deliver()
  task_work: Revert "hold task_lock around checks in keyctl"
  task_work: task_work_add() should not succeed after exit_task_work()
  task_work: Make task_work_add() lockless
2012-10-01 10:25:54 -07:00
Al Viro 16a8016372 sanitize tsk_is_polling()
Make default just return 0.  The current default (checking
TIF_POLLING_NRFLAG) is taken to architectures that need it;
ones that don't do polling in their idle threads don't need
to defined TIF_POLLING_NRFLAG at all.

ia64 defined both TS_POLLING (used by its tsk_is_polling())
and TIF_POLLING_NRFLAG (not used at all).  Killed the latter...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-01 09:58:13 -04:00
Al Viro 2aa3a7f866 preparation for generic kernel_thread()
Let architectures select GENERIC_KERNEL_THREAD and have their copy_thread()
treat NULL regs as "it came from kernel_thread(), sp argument contains
the function new thread will be calling and stack_size - the argument for
that function".  Switching the architectures begins shortly...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-30 13:35:55 -04:00
Oleg Nesterov ec75fba93e uprobes: Simplify is_swbp_at_addr(), remove stale comments
After the previous change is_swbp_at_addr() is always called with
current->mm. Remove this check and move it close to its single caller.

Also, remove the obsolete comment about is_swbp_at_addr() and
uprobe_state.count.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-09-29 21:21:54 +02:00
Oleg Nesterov ed6f6a50dc uprobes: Kill set_orig_insn()->is_swbp_at_addr()
Unlike set_swbp(), set_orig_insn()->is_swbp_at_addr() makes sense,
although it can't prevent all confusions.

But the usage of is_swbp_at_addr() is equally confusing, and it adds
the extra get_user_pages() we can avoid.

This patch removes set_orig_insn()->is_swbp_at_addr() but changes
write_opcode() to do the necessary checks before replace_page().

Perhaps it also makes sense to ensure PAGE_MAPPING_ANON in unregister
case.

find_active_uprobe() becomes the only user of is_swbp_at_addr(),
we can change its semantics.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-09-29 21:21:54 +02:00
Oleg Nesterov cceb55aab7 uprobes: Introduce copy_opcode(), kill read_opcode()
No functional changes, preparations.

1. Extract the kmap-and-memcpy code from read_opcode() into the
   new trivial helper, copy_opcode(). The next patch will add
   another user.

2. read_opcode() becomes really trivial, fold it into its single
   caller, is_swbp_at_addr().

3. Remove "auprobe" argument from write_opcode(), it is not used
   since f403072c6.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-09-29 21:21:54 +02:00
Oleg Nesterov e97f65a17d uprobes: Kill set_swbp()->is_swbp_at_addr()
A separate patch for better documentation.

set_swbp()->is_swbp_at_addr() is not needed for correctness, it is
harmless to do the unnecessary __replace_page(old_page, new_page)
when these 2 pages are identical.

And it can not be counted as optimization. mmap/register races are
very unlikely, while in the likely case is_swbp_at_addr() adds the
extra get_user_pages() even if the caller is uprobe_mmap(current->mm)
and returns false.

Note also that the semantics/usage of is_swbp_at_addr() in uprobe.c
is confusing. set_swbp() uses it to detect the case when this insn
was already modified by uprobes, that is why it should always compare
the opcode with UPROBE_SWBP_INSN even if the hardware (like powerpc)
has other trap insns. It doesn't matter if this breakpoint was in fact
installed by gdb or application itself, we are going to "steal" this
breakpoint anyway and execute the original insn from vm_file even if
it no longer matches the memory.

OTOH, handle_swbp()->find_active_uprobe() uses is_swbp_at_addr() to
figure out whether we need to send SIGTRAP or not if we can not find
uprobe, so in this case it should return true for all trap variants,
not only for UPROBE_SWBP_INSN.

This patch removes set_swbp()->is_swbp_at_addr(), the next patches
will remove it from set_orig_insn() which is similar to set_swbp()
in this respect. So the only caller will be handle_swbp() and we
can make its semantics clear.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-09-29 21:21:54 +02:00
Oleg Nesterov e40cfce626 uprobes: Restrict valid_vma(false) to skip VM_SHARED vmas
valid_vma(false) ignores ->vm_flags, this is not actually right.
We should never try to write into MAP_SHARED mapping, this can
confuse an apllication which actually writes to ->vm_file.

With this patch valid_vma(false) ignores VM_WRITE only but checks
other (immutable) bits checked by valid_vma(true). This can also
speedup uprobe_munmap() and uprobe_unregister().

Note: even after this patch _unregister can confuse the probed
application if it does mprotect(PROT_WRITE) after _register and
installs "int3", but this is hardly possible to avoid and this
doesn't differ from gdb case.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-09-29 21:21:54 +02:00
Oleg Nesterov 78a320542e uprobes: Change valid_vma() to demand VM_MAYEXEC rather than VM_EXEC
uprobe_register() or uprobe_mmap() requires VM_READ | VM_EXEC, this
is not right. An apllication can do mprotect(PROT_EXEC) later and
execute this code.

Change valid_vma(is_register => true) to check VM_MAYEXEC instead.
No need to check VM_MAYREAD, it is always set.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-09-29 21:21:53 +02:00
Oleg Nesterov 75ed82ea53 uprobes: Change write_opcode() to use FOLL_FORCE
write_opcode()->get_user_pages() needs FOLL_FORCE to ensure we can
read the page even if the probed task did mprotect(PROT_NONE) after
uprobe_register(). Without FOLL_WRITE, FOLL_FORCE doesn't have any
side effect but allows to read the !VM_READ memory.

Otherwiese the subsequent uprobe_unregister()->set_orig_insn() fails
and we leak "int3". If that task does mprotect(PROT_READ | EXEC) and
execute the probed insn later it will be killed.

Note: in fact this is also needed for _register, see the next patch.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-09-29 21:21:53 +02:00
Oleg Nesterov db023ea595 uprobes: Move clear_thread_flag(TIF_UPROBE) to uprobe_notify_resume()
Move clear_thread_flag(TIF_UPROBE) from do_notify_resume() to
uprobe_notify_resume() for !CONFIG_UPROBES case.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-09-29 21:21:53 +02:00
Oleg Nesterov 1b08e90721 uprobes: Kill UTASK_BP_HIT state
Kill UTASK_BP_HIT state, it buys nothing but complicates the code.
It is only used in uprobe_notify_resume() to decide who should be
called, we can check utask->active_uprobe != NULL instead. And this
allows us to simplify handle_swbp(), no need to clear utask->state.

Likewise we could kill UTASK_SSTEP, but UTASK_BP_HIT is worse and
imho should die. The problem is, it creates the special case when
task->utask is NULL, we can't distinguish RUNNING and BP_HIT. With
this patch utask == NULL always means RUNNING.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-09-29 21:21:53 +02:00
Oleg Nesterov 0578a97098 uprobes: Fix UPROBE_SKIP_SSTEP checks in handle_swbp()
If handle_swbp()->add_utask() fails but UPROBE_SKIP_SSTEP is set,
cleanup_ret: path do not restart the insn, this is wrong. Remove
this check and add the additional label for can_skip_sstep() = T
case.

Note also that UPROBE_SKIP_SSTEP can be false positive, we simply
can not trust it unless arch_uprobe_skip_sstep() was already called.

Also, move another UPROBE_SKIP_SSTEP check before can_skip_sstep()
into this helper, this looks more clean and understandable.

Note: probably we should rename "skip" to "emulate" and I think
that "clear UPROBE_SKIP_SSTEP" should be moved to arch_can_skip.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-09-29 21:21:52 +02:00
Oleg Nesterov 746a9e6ba2 uprobes: Do not setup ->active_uprobe/state prematurely
handle_swbp() sets utask->active_uprobe before handler_chain(),
and UTASK_SSTEP before pre_ssout(). This complicates the code
for no reason,  arch_ hooks or consumer->handler() should not
(and can't) use this info.

Change handle_swbp() to initialize them after pre_ssout(), and
remove the no longer needed cleanup-utask code.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
cked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-09-29 21:21:52 +02:00
Oleg Nesterov 79d54b249c uprobes: Do not leak UTASK_BP_HIT if find_active_uprobe() fails
If handle_swbp()->find_active_uprobe() fails we return with
utask->state = UTASK_BP_HIT.

Change handle_swbp() to reset utask->state at the start. Note
that we do this unconditionally, see the next patch(es).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-09-29 21:21:52 +02:00
David S. Miller 6a06e5e1bb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/team/team.c
	drivers/net/usb/qmi_wwan.c
	net/batman-adv/bat_iv_ogm.c
	net/ipv4/fib_frontend.c
	net/ipv4/route.c
	net/l2tp/l2tp_netlink.c

The team, fib_frontend, route, and l2tp_netlink conflicts were simply
overlapping changes.

qmi_wwan and bat_iv_ogm were of the "use HEAD" variety.

With help from Antonio Quartulli.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-28 14:40:49 -04:00
Masami Hiramatsu d55cb6cf14 ftrace: Allow stealing pages from pipe buffer
Use generic steal operation on pipe buffer to allow stealing
ring buffer's read page from pipe buffer.

Note that this could reduce the performance of splice on the
splice_write side operation without affinity setting.
Since the ring buffer's read pages are allocated on the
tracing-node, but the splice user does not always execute
splice write side operation on the same node. In this case,
the page will be accessed from the another node.
Thus, it is strongly recommended to assign the splicing
thread to corresponding node.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 15:05:12 +09:30
Rusty Russell 9bb9c3be56 module: wait when loading a module which is currently initializing.
The original module-init-tools module loader used a fnctl lock on the
.ko file to avoid attempts to simultaneously load a module.
Unfortunately, you can't get an exclusive fcntl lock on a read-only
fd, making this not work for read-only mounted filesystems.
module-init-tools has a hacky sleep-and-loop for this now.

It's not that hard to wait in the kernel, and only return -EEXIST once
the first module has finished loading (or continue loading the module
if the first one failed to initialize for some reason).  It's also
consistent with what we do for dependent modules which are still loading.

Suggested-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 14:31:03 +09:30
Rusty Russell 6f13909f4f module: fix symbol waiting when module fails before init
We use resolve_symbol_wait(), which blocks if the module containing
the symbol is still loading.  However:

1) The module_wq we use is only woken after calling the modules' init
   function, but there are other failure paths after the module is
   placed in the linked list where we need to do the same thing.

2) wake_up() only wakes one waiter, and our waitqueue is shared by all
   modules, so we need to wake them all.

3) wake_up_all() doesn't imply a memory barrier: I feel happier calling
   it after we've grabbed and dropped the module_mutex, not just after
   the state assignment.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 14:31:03 +09:30
David Howells 786d35d45c Make most arch asm/module.h files use asm-generic/module.h
Use the mapping of Elf_[SPE]hdr, Elf_Addr, Elf_Sym, Elf_Dyn, Elf_Rel/Rela,
ELF_R_TYPE() and ELF_R_SYM() to either the 32-bit version or the 64-bit version
into asm-generic/module.h for all arches bar MIPS.

Also, use the generic definition mod_arch_specific where possible.

To this end, I've defined three new config bools:

 (*) HAVE_MOD_ARCH_SPECIFIC

     Arches define this if they don't want to use the empty generic
     mod_arch_specific struct.

 (*) MODULES_USE_ELF_RELA

     Arches define this if their modules can contain RELA records.  This causes
     the Elf_Rela mapping to be emitted and allows apply_relocate_add() to be
     defined by the arch rather than have the core emit an error message.

 (*) MODULES_USE_ELF_REL

     Arches define this if their modules can contain REL records.  This causes
     the Elf_Rel mapping to be emitted and allows apply_relocate() to be
     defined by the arch rather than have the core emit an error message.

Note that it is possible to allow both REL and RELA records: m68k and mips are
two arches that do this.

With this, some arch asm/module.h files can be deleted entirely and replaced
with a generic-y marker in the arch Kbuild file.

Additionally, I have removed the bits from m32r and score that handle the
unsupported type of relocation record as that's now handled centrally.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 14:31:03 +09:30
Matthew Garrett c99af3752b module: taint kernel when lve module is loaded
Cloudlinux have a product called lve that includes a kernel module. This
was previously GPLed but is now under a proprietary license, but the
module continues to declare MODULE_LICENSE("GPL") and makes use of some
EXPORT_SYMBOL_GPL symbols. Forcibly taint it in order to avoid this.

Signed-off-by: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Alex Lyashkov <umka@cloudlinux.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org
2012-09-28 14:31:02 +09:30
James Morris bf53083445 Linux 3.6-rc7
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQEcBAABAgAGBQJQX7MuAAoJEHm+PkMAQRiG0h0IAJURkrMCAQUxA+Ik66ReH89s
 LQcVd0U9uL4UUOi7f5WR64Vf9Cfu6VVGX9ZKSvjpNskvlQaUQPMIt4pMe6g4X4dI
 u0bApEy4XZz3nGabUAghIU8jJ8cDmhCG6kPpSiS7pi7KHc0yIa4WFtJRrIpGaIWT
 xuK38YOiOHcSDRlLyWZzainMncQp/ixJdxnqVMTonkVLk0q0b84XzOr4/qlLE5lU
 i+TsK3PRKdQXgvZ4CebL+srPBwWX1dmgP3VkeBloQbSSenSeELICbFWavn2ml+sF
 GXi4dO93oNquL/Oy5SwI666T4uNcrRPaS+5X+xSZgBW/y2aQVJVJuNZg6ZP/uWk=
 =0v2l
 -----END PGP SIGNATURE-----

Merge tag 'v3.6-rc7' into next

Linux 3.6-rc7

Requested by David Howells so he can merge his key susbsystem work into
my tree with requisite -linus changesets.
2012-09-28 13:37:32 +10:00
Al Viro 2903ff019b switch simple cases of fget_light to fdget
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 22:20:08 -04:00
Al Viro e10ce27f0d switch prctl_set_mm_exe_file() to fget_light()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:10:12 -04:00
Al Viro 864bdb3b6c new helper: daemonize_descriptors()
descriptor-related parts of daemonize, done right.  As the
result we simplify the locking rules for ->files - we
hold task_lock in *all* cases when we modify ->files.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:10:00 -04:00
Al Viro 7cf4dc3c8d move files_struct-related bits from kernel/exit.c to fs/file.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:08:54 -04:00
Al Viro ab72a7028c events: don't use get_unused_fd_flags() when get_unused_fd() will do
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:08:52 -04:00
Anton Vorontsov ad394f66fa kdb: Implement disable_nmi command
This command disables NMI-entry. If NMI source has been previously shared
with a serial console ("debug port"), this effectively releases the port
from KDB exclusive use, and makes the console available for normal use.

Of course, NMI can be reenabled, enable_nmi modparam is used for that:

	echo 1 > /sys/module/kdb/parameters/enable_nmi

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-09-26 13:42:25 -07:00
Anton Vorontsov 5a14fead07 kernel/debug: Mask KGDB NMI upon entry
The new arch callback should manage NMIs that usually cause KGDB to
enter. That is, not all NMIs should be enabled/disabled, but only
those that issue kgdb_handle_exception().

We must mask it as serial-line interrupt can be used as an NMI, so
if the original KGDB-entry cause was say a breakpoint, then every
input to KDB console will cause KGDB to reenter, which we don't want.

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-09-26 13:42:25 -07:00
Paul E. McKenney cb349ca954 rcu: Apply micro-optimization and int/bool fixes to RCU's idle handling
Checking "user" before "is_idle_task()" allows better optimizations
in cases where inlining is possible.  Also, "bool" should be passed
"true" or "false" rather than "1" or "0".  This commit therefore makes
these changes, as noted in Josh's review.

Reported-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-26 15:47:18 +02:00
Frederic Weisbecker 1fd2b4425a rcu: Userspace RCU extended QS selftest
Provide a config option that enables the userspace
RCU extended quiescent state on every CPUs by default.

This is for testing purpose.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-26 15:47:16 +02:00
Frederic Weisbecker 20ab65e33f rcu: Exit RCU extended QS on user preemption
When exceptions or irq are about to resume userspace, if
the task needs to be rescheduled, the arch low level code
calls schedule() directly.

If we call it, it is because we have the TIF_RESCHED flag:

- It can be set after random local calls to set_need_resched()
(RCU, drm, ...)

- A wake up happened and the CPU needs preemption. This can
  happen in several ways:

    * Remotely: the remote waking CPU has set TIF_RESCHED and send the
      wakee an IPI to schedule the new task.
    * Remotely enqueued: the remote waking CPU sends an IPI to the target
      and the wake up is made by the target.
    * Locally: waking CPU == wakee CPU and the wakeup is done locally.
      set_need_resched() is called without IPI.

In the case of local and remotely enqueued wake ups, the tick can
be restarted when we enqueue the new task and RCU can exit the
extended quiescent state at the same time. Then by the time we reach
irq exit path and we call schedule, we are not in RCU user mode.

But if we call schedule() only because something called set_need_resched(),
RCU may still be in user mode when we reach schedule.

Also if a wake up is done remotely, the CPU might see the TIF_RESCHED
flag and call schedule while the IPI has not yet happen to restart the
tick and exit RCU user mode.

We need to manually protect against these corner cases.

Create a new API schedule_user() that calls schedule() inside
rcu_user_exit()-rcu_user_enter() in order to protect it. Archs
will need to rely on it now to implement user preemption safely.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-26 15:47:11 +02:00
Frederic Weisbecker 90a340ed53 rcu: Exit RCU extended QS on kernel preemption after irq/exception
When an exception or an irq exits, and we are going to resume into
interrupted kernel code, the low level architecture code calls
preempt_schedule_irq() if there is a need to reschedule.

If the interrupt/exception occured between a call to rcu_user_enter()
(from syscall exit, exception exit, do_notify_resume exit, ...) and
a real resume to userspace (iret,...), preempt_schedule_irq() can be
called whereas RCU thinks we are in userspace. But preempt_schedule_irq()
is going to run kernel code and may be some RCU read side critical
section. We must exit the userspace extended quiescent state before
we call it.

To solve this, just call rcu_user_exit() in the beginning of
preempt_schedule_irq().

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-26 15:47:09 +02:00
Frederic Weisbecker 04e7e95153 rcu: Switch task's syscall hooks on context switch
Clear the syscalls hook of a task when it's scheduled out so that if
the task migrates, it doesn't run the syscall slow path on a CPU
that might not need it.

Also set the syscalls hook on the next task if needed.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-26 15:47:02 +02:00
Frederic Weisbecker 1e1a689f10 rcu: Ignore userspace extended quiescent state by default
By default we don't want to enter into RCU extended quiescent
state while in userspace because doing this produces some overhead
(eg: use of syscall slowpath). Set it off by default and ready to
run when some feature like adaptive tickless need it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-26 15:47:01 +02:00
Frederic Weisbecker c5d900bf67 rcu: Allow rcu_user_enter()/exit() to nest
Allow calls to rcu_user_enter() even if we are already
in userspace (as seen by RCU) and allow calls to rcu_user_exit()
even if we are already in the kernel.

This makes the APIs more flexible to be called from architectures.
Exception entries for example won't need to know if they come from
userspace before calling rcu_user_exit().

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-26 15:46:55 +02:00
Frederic Weisbecker 2b1d5024e1 rcu: Settle config for userspace extended quiescent state
Create a new config option under the RCU menu that put
CPUs under RCU extended quiescent state (as in dynticks
idle mode) when they run in userspace. This require
some contribution from architectures to hook into kernel
and userspace boundaries.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-26 15:44:04 +02:00
Paul E. McKenney 9a0c6fef42 rcu: Make RCU_FAST_NO_HZ handle adaptive ticks
The current implementation of RCU_FAST_NO_HZ tries reasonably hard to rid
the current CPU of RCU callbacks.  This is appropriate when the CPU is
entering idle, where it doesn't have much useful to do anyway, but is most
definitely not what you want when transitioning to user-mode execution.
This commit therefore detects the adaptive-tick case, and refrains from
burning CPU time getting rid of RCU callbacks in that case.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-26 15:44:02 +02:00
Frederic Weisbecker 19dd1591fc rcu: New rcu_user_enter_after_irq() and rcu_user_exit_after_irq() APIs
In some cases, it is necessary to enter or exit userspace-RCU-idle mode
from an interrupt handler, for example, if some other CPU sends this
CPU a resched IPI.  In this case, the current CPU would enter the IPI
handler in userspace-RCU-idle mode, but would need to exit the IPI handler
after having exited that mode.

To allow this to work, this commit adds two new APIs to TREE_RCU:

- rcu_user_enter_after_irq(). This must be called from an interrupt between
rcu_irq_enter() and rcu_irq_exit().  After the irq calls rcu_irq_exit(),
the irq handler will return into an RCU extended quiescent state.
In theory, this interrupt is never a nested interrupt, but in practice
it might interrupt softirq, which looks to RCU like a nested interrupt.

- rcu_user_exit_after_irq(). This must be called from a non-nesting
interrupt, interrupting an RCU extended quiescent state, also
between rcu_irq_enter() and rcu_irq_exit(). After the irq calls
rcu_irq_exit(), the irq handler will return in an RCU non-quiescent
state.

[ Combined with "Allow calls to rcu_exit_user_irq from nesting irqs." ]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-26 15:44:01 +02:00
Frederic Weisbecker adf5091e6c rcu: New rcu_user_enter() and rcu_user_exit() APIs
RCU currently insists that only idle tasks can enter RCU idle mode, which
prohibits an adaptive tickless kernel (AKA nohz cpusets), which in turn
would mean that usermode execution would always take scheduling-clock
interrupts, even when there is only one task runnable on the CPU in
question.

This commit therefore adds rcu_user_enter() and rcu_user_exit(), which
allow non-idle tasks to enter RCU idle mode.  These are quite similar
to rcu_idle_enter() and rcu_idle_exit(), respectively, except that they
omit the idle-task checks.

[ Updated to use "user" flag rather than separate check functions. ]

[ paulmck: Updated to drop exports of new functions based on Josh's patch
  getting rid of the need for them. ]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-26 15:43:50 +02:00
Paul E. McKenney 593d1006cd Merge remote-tracking branch 'tip/core/rcu' into next.2012.09.25b
Resolved conflict in kernel/sched/core.c using Peter Zijlstra's
approach from https://lkml.org/lkml/2012/9/5/585.
2012-09-25 10:03:56 -07:00
Paul E. McKenney 5217192b85 Merge remote-tracking branch 'tip/smp/hotplug' into next.2012.09.25b
The conflicts between kernel/rcutree.h and kernel/rcutree_plugin.h
were due to adjacent insertions and deletions, which were resolved
by simply accepting the changes on both branches.
2012-09-25 10:01:45 -07:00
Frederic Weisbecker a7e1a9e3af vtime: Consolidate system/idle context detection
Move the code that finds out to which context we account the
cputime into generic layer.

Archs that consider the whole time spent in the idle task as idle
time (ia64, powerpc) can rely on the generic vtime_account()
and implement vtime_account_system() and vtime_account_idle(),
letting the generic code to decide when to call which API.

Archs that have their own meaning of idle time, such as s390
that only considers the time spent in CPU low power mode as idle
time, can just override vtime_account().

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
2012-09-25 15:42:37 +02:00
Frederic Weisbecker bf9fae9f5e cputime: Use a proper subsystem naming for vtime related APIs
Use a naming based on vtime as a prefix for virtual based
cputime accounting APIs:

- account_system_vtime() -> vtime_account()
- account_switch_vtime() -> vtime_task_switch()

It makes it easier to allow for further declension such
as vtime_account_system(), vtime_account_idle(), ... if we
want to find out the context we account to from generic code.

This also make it better to know on which subsystem these APIs
refer to.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
2012-09-25 15:31:31 +02:00
Paul E. McKenney bda4ec9f6a Merge branches 'bigrt.2012.09.23a', 'doctorture.2012.09.23a', 'fixes.2012.09.23a', 'hotplug.2012.09.23a' and 'idlechop.2012.09.23a' into HEAD
bigrt.2012.09.23a contains additional commits to reduce scheduling latency
	from RCU on huge systems (many hundrends or thousands of CPUs).

doctorture.2012.09.23a contains documentation changes and rcutorture fixes.

fixes.2012.09.23a contains miscellaneous fixes.

hotplug.2012.09.23a contains CPU-hotplug-related changes.

idle.2012.09.23a fixes architectures for which RCU no longer considered
	the idle loop to be a quiescent state due to earlier
	adaptive-dynticks changes.  Affected architectures are alpha,
	cris, frv, h8300, m32r, m68k, mn10300, parisc, score, xtensa,
	and ia64.
2012-09-24 20:02:22 -07:00
Eric Dumazet 5640f76858 net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.

This page is used to build fragments for skbs.

Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)

But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page

Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.

This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.

(up to 32768 bytes per frag, thats order-3 pages on x86)

This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.

Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536

Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 16:31:37 -04:00
Ezequiel Garcia 8781915ad2 trace: Move trace event enable from fs_initcall to core_initcall
This patch splits trace event initialization in two stages:
 * ftrace enable
 * sysfs event entry creation

This allows to capture trace events from an earlier point
by using 'trace_event' kernel parameter and is important
to trace boot-up allocations.

Note that, in order to enable events at core_initcall,
it's necessary to move init_ftrace_syscalls() from
core_initcall to early_initcall.

Link: http://lkml.kernel.org/r/1347461277-25302-1-git-send-email-elezegarcia@gmail.com

Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-09-24 14:13:02 -04:00
Mandeep Singh Baines 5224c3a315 tracing: Add an option for disabling markers
In our application, we have trace markers spread through user-space.
We have markers in GL, X, etc. These are super handy for Chrome's
about:tracing feature (Chrome + system + kernel trace view), but
can be very distracting when you're trying to debug a kernel issue.

I normally, use "grep -v tracing_mark_write" but it would be nice
if I could just temporarily disable markers all together.

Link: http://lkml.kernel.org/r/1347066739-26285-1-git-send-email-msb@chromium.org

CC: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-09-24 14:10:44 -04:00
John Stultz 92bb1fcf57 time: Only do nanosecond rounding on GENERIC_TIME_VSYSCALL_OLD systems
We only do rounding to the next nanosecond so we don't see minor
1ns inconsistencies in the vsyscall implementations. Since we're
changing the vsyscall implementations to avoid this, conditionalize
the rounding only to the GENERIC_TIME_VSYSCALL_OLD architectures.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-09-24 12:38:08 -04:00
John Stultz 576094b7f0 time: Introduce new GENERIC_TIME_VSYSCALL
Now that we moved everyone over to GENERIC_TIME_VSYSCALL_OLD,
introduce the new declaration and config option for the new
update_vsyscall method.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-09-24 12:38:08 -04:00
John Stultz 7063942116 time: Convert CONFIG_GENERIC_TIME_VSYSCALL to CONFIG_GENERIC_TIME_VSYSCALL_OLD
To help migrate archtectures over to the new update_vsyscall method,
redfine CONFIG_GENERIC_TIME_VSYSCALL as CONFIG_GENERIC_TIME_VSYSCALL_OLD

Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-09-24 12:38:07 -04:00
John Stultz 189374aed6 time: Move update_vsyscall definitions to timekeeper_internal.h
Since users will need to include timekeeper_internal.h, move
update_vsyscall definitions to timekeeper_internal.h.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-09-24 12:38:06 -04:00
John Stultz d7b4202e05 time: Move timekeeper structure to timekeeper_internal.h for vsyscall changes
We're going to need to access the timekeeper in update_vsyscall,
so make the structure available for those who need it.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-09-24 12:38:05 -04:00
John Stultz b3c869d35b jiffies: Remove compile time assumptions about CLOCK_TICK_RATE
CLOCK_TICK_RATE is used to accurately caclulate exactly how
a tick will be at a given HZ.

This is useful, because while we'd expect NSEC_PER_SEC/HZ,
the underlying hardware will have some granularity limit,
so we won't be able to have exactly HZ ticks per second.

This slight error can cause timekeeping quality problems
when using the jiffies or other jiffies driven clocksources.
Thus we currently use compile time CLOCK_TICK_RATE value to
generate SHIFTED_HZ and NSEC_PER_JIFFIES, which we then use
to adjust the jiffies clocksource to correct this error.

Unfortunately though, since CLOCK_TICK_RATE is a compile
time value, and the jiffies clocksource is registered very
early during boot, there are a number of cases where there
are different possible hardware timers that have different
tick rates. This causes problems in cases like ARM where
there are numerous different types of hardware, each having
their own compile-time CLOCK_TICK_RATE, making it hard to
accurately support different hardware with a single kernel.

For the most part, this doesn't matter all that much, as not
too many systems actually utilize the jiffies or jiffies driven
clocksource. Usually there are other highres clocksources
who's granularity error is negligable.

Even so, we have some complicated calcualtions that we do
everywhere to handle these edge cases.

This patch removes the compile time SHIFTED_HZ value, and
introduces a register_refined_jiffies() function. This results
in the default jiffies clock as being assumed a perfect HZ
freq, and allows archtectures that care about jiffies accuracy
to call register_refined_jiffies() with the tick rate, specified
dynamically at boot.

This allows us, where necessary, to not have a compile time
CLOCK_TICK_RATE constant, simplifies the jiffies code, and
still provides a way to have an accurate jiffies clock.

NOTE: Since this patch does not add register_refinied_jiffies()
calls for every arch, it may cause time quality regressions
in some cases. Its likely these will not be noticable, but
if they are an issue, adding the following to the end of
setup_arch() should resolve the regression:
	register_refinied_jiffies(CLOCK_TICK_RATE)

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-09-24 12:38:05 -04:00
John Stultz a65bcc12ad alarmtimer: Rename alarmtimer_remove to alarmtimer_dequeue
Now that alarmtimer_remove has been simplified, change
its name to _dequeue to better match its paired _enqueue
function.

Cc: Arve Hjønnevåg <arve@android.com>
Cc: Colin Cross <ccross@android.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-09-24 12:38:03 -04:00
John Stultz dae373be9f alarmtimer: Use hrtimer per-alarm instead of per-base
Arve Hjønnevåg reported numerous crashes from the
"BUG_ON(timer->state != HRTIMER_STATE_CALLBACK)" check
in __run_hrtimer after it called alarmtimer_fired.

It ends up the alarmtimer code was not properly handling
possible failures of hrtimer_try_to_cancel, and because
these faulres occur when the underlying base hrtimer is
being run, this limits the ability to properly handle
modifications to any alarmtimers on that base.

Because much of the logic duplicates the hrtimer logic,
it seems that we might as well have a per-alarmtimer
hrtimer, and avoid the extra complextity of trying to
multiplex many alarmtimers off of one hrtimer.

Thus this patch moves the hrtimer to the alarm structure
and simplifies the management logic.

Changelog:
v2:
* Includes a fix for double alarm_start calls found by
  Arve

Cc: Arve Hjønnevåg <arve@android.com>
Cc: Colin Cross <ccross@android.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Arve Hjønnevåg <arve@android.com>
Tested-by: Arve Hjønnevåg <arve@android.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-09-24 12:38:02 -04:00
Todd Poynor 59a93c27c4 alarmtimer: Implement minimum alarm interval for allowing suspend
alarmtimer suspend return -EBUSY if the next alarm will fire in less
than 2 seconds.  This allows one RTC seconds tick to occur subsequent
to this check before the alarm wakeup time is set, ensuring the wakeup
time is still in the future (assuming the RTC does not tick one more
second prior to setting the alarm).

If suspend is rejected due to an imminent alarm, hold a wakeup source
for 2 seconds to process the alarm prior to reattempting suspend.

If setting the alarm incurs an -ETIME for an alarm set in the past,
or any other problem setting the alarm, abort suspend and hold a
wakelock for 1 second while the alarm is allowed to be serviced or
other hopefully transient conditions preventing the alarm clear up.

Signed-off-by: Todd Poynor <toddpoynor@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-09-24 12:38:01 -04:00
Peter Zijlstra 5d18023294 sched: Fix load avg vs cpu-hotplug
Rabik and Paul reported two different issues related to the same few
lines of code.

Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).

Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.

The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.

The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.

So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.

[ paulmck: Move call to calc_load_migration to CPU_DEAD to avoid
miscounting noted by Rakib. ]

Reported-by: Rakib Mullick <rakib.mullick@gmail.com>
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
2012-09-23 07:43:56 -07:00
Paul E. McKenney 0d8ee37e2f rcu: Disallow callback registry on offline CPUs
Posting a callback after the CPU_DEAD notifier effectively leaks
that callback unless/until that CPU comes back online.  Silence is
unhelpful when attempting to track down such leaks, so this commit emits
a WARN_ON_ONCE() and unconditionally leaks the callback when an offline
CPU attempts to register a callback.  The rdp->nxttail[RCU_NEXT_TAIL] is
set to NULL in the CPU_DEAD notifier and restored in the CPU_UP_PREPARE
notifier, allowing _call_rcu() to determine exactly when posting callbacks
is illegal.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:43:55 -07:00
Paul E. McKenney 1331e7a1bb rcu: Remove _rcu_barrier() dependency on __stop_machine()
Currently, _rcu_barrier() relies on preempt_disable() to prevent
any CPU from going offline, which in turn depends on CPU hotplug's
use of __stop_machine().

This patch therefore makes _rcu_barrier() use get_online_cpus() to
block CPU-hotplug operations.  This has the added benefit of removing
the need for _rcu_barrier() to adopt callbacks:  Because CPU-hotplug
operations are excluded, there can be no callbacks to adopt.  This
commit simplifies the code accordingly.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:43:55 -07:00
Paul E. McKenney 86f343b50b rcu: Fix CONFIG_RCU_FAST_NO_HZ stall warning message
The print_cpu_stall_fast_no_hz() function attempts to print -1 when
the ->idle_gp_timer is not pending, but unsigned arithmetic causes it
to instead print ULONG_MAX, which is 4294967295 on 32-bit systems and
18446744073709551615 on 64-bit systems.  Neither of these are the most
reader-friendly values, so this commit instead causes "timer not pending"
to be printed when ->idle_gp_timer is not pending.

Reported-by: Paul Walmsley <paul@pwsan.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23 07:42:52 -07:00
Li Zhong 22a767269a rcu: Move TINY_RCU quiescent state out of extended quiescent state
TINY_RCU's rcu_idle_enter_common() invokes rcu_sched_qs() in order
to inform the RCU core of the quiescent state implied by idle entry.
Of course, idle is also an extended quiescent state, so that the call
to rcu_sched_qs() speeds up RCU's invoking of any callbacks that might
be queued.  This speed-up is important when entering into dyntick-idle
mode -- if there are no further scheduling-clock interrupts, the callbacks
might never be invoked, which could result in a system hang.

However, processing callbacks does event tracing, which in turn
implies RCU read-side critical sections, which are illegal in extended
quiescent states.  This patch therefore moves the call to rcu_sched_qs()
so that it precedes the point at which we inform lockdep that RCU has
entered an extended quiescent state.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23 07:42:52 -07:00
Paul E. McKenney 803b0ebae9 time: RCU permitted to stop idle entry via softirq
The can_stop_idle_tick() function complains if a softirq vector is
raised too late in the idle-entry process, presumably in order to
prevent dangling softirq invocations from being delayed across the
full idle period, which might be indefinitely long -- and if softirq
was asserted any later than the call to this function, such a delay
might well happen.

However, RCU needs to be able to use softirq to stop idle entry in
order to be able to drain RCU callbacks from the current CPU, which in
turn enables faster entry into dyntick-idle mode, which in turn reduces
power consumption.  Because RCU takes this action at a well-defined
point in the idle-entry path, it is safe for RCU to take this approach.

This commit therefore silences the error message that is sometimes
produced when the going-idle CPU suddenly finds that it has an RCU_SOFTIRQ
to process.  The error message will continue to be issued for other
softirq vectors.

Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:42:52 -07:00
Paul E. McKenney 7a11e2058f rcu: Move TINY_PREEMPT_RCU away from raw_local_irq_save()
The use of raw_local_irq_save() is unnecessary, given that local_irq_save()
really does disable interrupts.  Also, it appears to interfere with lockdep.
Therefore, this commit moves to local_irq_save().

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
2012-09-23 07:42:51 -07:00
Paul E. McKenney fdab649b1a rcu: Remove redundant memory barrier from __call_rcu()
The first memory barrier in __call_rcu() is supposed to order any
updates done beforehand by the caller against the actual queuing
of the callback.  However, the second memory barrier (which is intended
to order incrementing the queue lengths before queuing the callback)
is also between the caller's updates and the queuing of the callback.
The second memory barrier can therefore serve both purposes.

This commit therefore removes the first memory barrier.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:42:51 -07:00
Paul E. McKenney c96ea7cfdd rcu: Avoid spurious RCU CPU stall warnings
If a given CPU avoids the idle loop but also avoids starting a new
RCU grace period for a full minute, RCU can issue spurious RCU CPU
stall warnings.  This commit fixes this issue by adding a check for
ongoing grace period to avoid these spurious stall warnings.

Reported-by: Becky Bruce <bgillbruce@gmail.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:42:51 -07:00
Paul E. McKenney c8020a67e6 rcu: Protect rcu_node accesses during CPU stall warnings
The print_other_cpu_stall() function accesses a number of rcu_node
fields without protection from the ->lock.  In theory, this is not
a problem because the fields accessed are all integers, but in
practice the compiler can get nasty.  Therefore, the commit extends
the existing critical section to cover the entire loop body.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23 07:42:51 -07:00
Paul E. McKenney 5fd4dc068c rcu: Avoid rcu_print_detail_task_stall_rnp() segfault
The rcu_print_detail_task_stall_rnp() function invokes
rcu_preempt_blocked_readers_cgp() to verify that there are some preempted
RCU readers blocking the current grace period outside of the protection
of the rcu_node structure's ->lock.  This means that the last blocked
reader might exit its RCU read-side critical section and remove itself
from the ->blkd_tasks list before the ->lock is acquired, resulting in
a segmentation fault when the subsequent code attempts to dereference
the now-NULL gp_tasks pointer.

This commit therefore moves the test under the lock.  This will not
have measurable effect on lock contention because this code is invoked
only when printing RCU CPU stall warnings, in other words, in the common
case, never.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23 07:42:50 -07:00
Paul E. McKenney 115f7a7ca0 rcu: Apply for_each_rcu_flavor() to increment_cpu_stall_ticks()
The increment_cpu_stall_ticks() function listed each RCU flavor
explicitly, with an ifdef to handle preemptible RCU.  This commit
therefore applies for_each_rcu_flavor() to save a line of code.

Because this commit switches from a code-based enumeration of the
flavors of RCU to an rcu_state-list-based enumeration, it is no longer
possible to apply __get_cpu_var() to the per-CPU rcu_data structures.
We instead use __this_cpu_var() on the rcu_state structure's ->rda field
that references the corresponding rcu_data structures.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23 07:42:50 -07:00
Paul E. McKenney b065a85354 rcu: Fix obsolete rcu_initiate_boost() header comment
Commit 1217ed1b (rcu: permit rcu_read_unlock() to be called while holding
runqueue locks) made rcu_initiate_boost() restore irq state when releasing
the rcu_node structure's ->lock, but failed to update the header comment
accordingly.  This commit therefore brings the header comment up to date.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:42:50 -07:00
Paul E. McKenney a82dcc7602 rcu: Make offline-CPU checking allow for indefinite delays
The rcu_implicit_offline_qs() function implicitly assumed that execution
would progress predictably when interrupts are disabled, which is of course
not guaranteed when running on a hypervisor.  Furthermore, this function
is short, and is called from one place only in a short function.

This commit therefore ensures that the timing is checked before
checking the condition, which guarantees correct behavior even given
indefinite delays.  It also inlines rcu_implicit_offline_qs() into
rcu_implicit_dynticks_qs().

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:42:50 -07:00
Paul E. McKenney 5cc900cf55 rcu: Improve boost selection when moving tasks to root rcu_node
The rcu_preempt_offline_tasks() moves all tasks queued on a given leaf
rcu_node structure to the root rcu_node, which is done when the last CPU
corresponding the the leaf rcu_node structure goes offline.  Now that
RCU-preempt's synchronize_rcu_expedited() implementation blocks CPU-hotplug
operations during the initialization of each rcu_node structure's
->boost_tasks pointer, rcu_preempt_offline_tasks() can do a better job
of setting the root rcu_node's ->boost_tasks pointer.

The key point is that rcu_preempt_offline_tasks() runs as part of the
CPU-hotplug process, so that a concurrent synchronize_rcu_expedited()
is guaranteed to either have not started on the one hand (in which case
there is no boosting on behalf of the expedited grace period) or to be
completely initialized on the other (in which case, in the absence of
other priority boosting, all ->boost_tasks pointers will be initialized).
Therefore, if rcu_preempt_offline_tasks() finds that the ->boost_tasks
pointer is equal to the ->exp_tasks pointer, it can be sure that it is
correctly placed.

In the case where there was boosting ongoing at the time that the
synchronize_rcu_expedited() function started, different nodes might start
boosting the tasks blocking the expedited grace period at different times.
In this mixed case, the root node will either be boosting tasks for
the expedited grace period already, or it will start as soon as it gets
done boosting for the normal grace period -- but in this latter case,
the root node's tasks needed to be boosted in any case.

This commit therefore adds a check of the ->boost_tasks pointer against
the ->exp_tasks pointer to the list that prevents updating ->boost_tasks.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:42:50 -07:00
Paul E. McKenney b4270ee356 rcu: Permit RCU_NONIDLE() to be used from interrupt context
There is a need to use RCU from interrupt context, but either before
rcu_irq_enter() is called or after rcu_irq_exit() is called.  If the
interrupt occurs from idle, then lockdep-RCU will complain about such
uses, as they appear to be illegal uses of RCU from the idle loop.
In other environments, RCU_NONIDLE() could be used to properly protect
the use of RCU, but RCU_NONIDLE() currently cannot be invoked except
from process context.

This commit therefore modifies RCU_NONIDLE() to permit its use more
globally.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23 07:42:49 -07:00
Paul E. McKenney 1e3fd2b38c rcu: Properly initialize ->boost_tasks on CPU offline
When rcu_preempt_offline_tasks() clears tasks from a leaf rcu_node
structure, it does not NULL out the structure's ->boost_tasks field.
This commit therefore fixes this issue.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:42:49 -07:00
Paul E. McKenney 818615c4cd rcu: Pull TINY_RCU dyntick-idle tracing into non-idle region
Because TINY_RCU's idle detection keys directly off of the nesting
level, rather than from a separate variable as in TREE_RCU, the
TINY_RCU dyntick-idle tracing on transition to idle must happen
before the change to the nesting level.  This commit therefore makes
this change by passing the desired new value (rather than the old value)
of the nesting level in to rcu_idle_enter_common().

[ paulmck: Add fix for wrong-variable bug spotted by
  Michael Wang <wangyun@linux.vnet.ibm.com>. ]

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:42:49 -07:00
Paul E. McKenney e3ebfb96f3 rcu: Add PROVE_RCU_DELAY to provoke difficult races
There have been some recent bugs that were triggered only when
preemptible RCU's __rcu_read_unlock() was preempted just after setting
->rcu_read_lock_nesting to INT_MIN, which is a low-probability event.
Therefore, reproducing those bugs (to say nothing of gaining confidence
in alleged fixes) was quite difficult.  This commit therefore creates
a new debug-only RCU kernel config option that forces a short delay
in __rcu_read_unlock() to increase the probability of those sorts of
bugs occurring.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:42:49 -07:00
Paul E. McKenney 60f53782c5 rcu: Prevent initialization race in rcutorture kthreads
When you do something like "t = kthread_run(...)", it is possible that
the kthread will start running before the assignment to "t" happens.
If the child kthread expects to find a pointer to its task_struct in "t",
it will then be fatally disappointed.  This commit therefore switches
such cases to kthread_create() followed by wake_up_process(), guaranteeing
that the assignment happens before the child kthread starts running.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23 07:42:23 -07:00
Paul E. McKenney 2caa1e4432 rcu: Switch rcutorture to pr_alert() and friends
Drop a few characters by switching kernel/rcutorture.c from
"printk(KERN_ALERT" to "pr_alert(".

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23 07:42:23 -07:00
Paul E. McKenney 13dbf9140c rcu: Track CPU-hotplug duration statistics
Many rcutorture runs include CPU-hotplug operations in their stress
testing.  This commit accumulates statistics on the durations of these
operations in deference to the recent concern about the overhead and
latency of these operations.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:42:22 -07:00
Paul E. McKenney ab840f7a06 rcu: Update rcutorture defaults
A number of new features have been added to rcutorture over the years, but
the defaults have not been updated to include them.  This commit therefore
turns on a couple of them that have proven helpful and trustworthy, namely
periodic progress reports and testing of NO_HZ.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:42:22 -07:00
Paul E. McKenney b17c7035f3 rcu: Shrink RCU based on number of CPUs
Currently, rcu_init_geometry() only reshapes RCU's combining trees
if the leaf fanout is changed at boot time.  This means that by
default, kernels compiled with (say) NR_CPUS=4096 will keep oversized
data structures, even when running on systems with (say) four CPUs.

This commit therefore checks to see if the maximum number of CPUs on
the actual running system (nr_cpu_ids) differs from NR_CPUS, and if so
reshapes the combining trees accordingly.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23 07:41:56 -07:00
Paul E. McKenney 4dbd6bb38d rcu: Handle unbalanced rcu_node configurations with few CPUs
If CONFIG_RCU_FANOUT_EXACT=y, if there are not enough CPUs (according
to nr_cpu_ids) to require more than a single rcu_node structure, but if
NR_CPUS is larger than would fit into a single rcu_node structure, then
the current rcu_init_levelspread() code is subject to integer overflow
in the eight-bit ->levelspread[] array in the rcu_state structure.

In this case, the solution is -not- to increase the size of the
elements in this array because the values in that array should be
constrained to the number of bits in an unsigned long.  Instead, this
commit replaces NR_CPUS with nr_cpu_ids in the rcu_init_levelspread()
function's initialization of the cprv local variable.  This results in
all of the arithmetic being consistently based off of the nr_cpu_ids
value, thus avoiding the overflow, which was caused by the mixing of
nr_cpu_ids and NR_CPUS.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23 07:41:56 -07:00
Paul E. McKenney d7d6a11e86 rcu: Simplify quiescent-state detection
The current quiescent-state detection algorithm is needlessly
complex.  It records the grace-period number corresponding to
the quiescent state at the time of the quiescent state, which
works, but it seems better to simply erase any record of previous
quiescent states at the time that the CPU notices the new grace
period.  This has the further advantage of removing another piece
of RCU for which lockless reasoning is required.

Therefore, this commit makes this change.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:56 -07:00
Paul E. McKenney 1943c89de7 rcu: Reduce synchronize_rcu_expedited() latency
The synchronize_rcu_expedited() function disables interrupts across a
scan of all leaf rcu_node structures, which is not good for real-time
scheduling latency on large systems (hundreds or especially thousands
of CPUs).  This commit therefore holds off CPU-hotplug operations using
get_online_cpus(), and removes the prior acquisiion of the ->onofflock
(which required disabling interrupts).

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:56 -07:00
Paul E. McKenney bcfa57ce10 rcu: Eliminate signed overflow in synchronize_rcu_expedited()
In the C language, signed overflow is undefined.  It is true that
twos-complement arithmetic normally comes to the rescue, but if the
compiler can subvert this any time it has any information about the values
being compared.  For example, given "if (a - b > 0)", if the compiler
has enough information to realize that (for example) the value of "a"
is positive and that of "b" is negative, the compiler is within its
rights to optimize to a simple "if (1)", which might not be what you want.

This commit therefore converts synchronize_rcu_expedited()'s work-done
detection counter from signed to unsigned.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:56 -07:00
Paul E. McKenney 25d30cf425 rcu: Adjust for unconditional ->completed assignment
Now that the rcu_node structures' ->completed fields are unconditionally
assigned at grace-period cleanup time, they should already have the
correct value for the new grace period at grace-period initialization
time.  This commit therefore inserts a WARN_ON_ONCE() to verify this
invariant.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:55 -07:00
Paul E. McKenney 661a85dc0d rcu: Add random PROVE_RCU_DELAY to grace-period initialization
Preemption greatly raised the probability of certain types of race
conditions, so this commit adds an anti-heisenbug to greatly increase
the collision cross section, also known as the probability of occurrence.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:55 -07:00
Paul E. McKenney 5d4b865949 rcu: Fix day-zero grace-period initialization/cleanup race
The current approach to grace-period initialization is vulnerable to
extremely low-probability races.  These races stem from the fact that
the old grace period is marked completed on the same traversal through
the rcu_node structure that is marking the start of the new grace period.
This means that some rcu_node structures will believe that the old grace
period is still in effect at the same time that other rcu_node structures
believe that the new grace period has already started.

These sorts of disagreements can result in too-short grace periods,
as shown in the following scenario:

1.	CPU 0 completes a grace period, but needs an additional
	grace period, so starts initializing one, initializing all
	the non-leaf rcu_node structures and the first leaf rcu_node
	structure.  Because CPU 0 is both completing the old grace
	period and starting a new one, it marks the completion of
	the old grace period and the start of the new grace period
	in a single traversal of the rcu_node structures.

	Therefore, CPUs corresponding to the first rcu_node structure
	can become aware that the prior grace period has completed, but
	CPUs corresponding to the other rcu_node structures will see
	this same prior grace period as still being in progress.

2.	CPU 1 passes through a quiescent state, and therefore informs
	the RCU core.  Because its leaf rcu_node structure has already
	been initialized, this CPU's quiescent state is applied to the
	new (and only partially initialized) grace period.

3.	CPU 1 enters an RCU read-side critical section and acquires
	a reference to data item A.  Note that this CPU believes that
	its critical section started after the beginning of the new
	grace period, and therefore will not block this new grace period.

4.	CPU 16 exits dyntick-idle mode.  Because it was in dyntick-idle
	mode, other CPUs informed the RCU core of its extended quiescent
	state for the past several grace periods.  This means that CPU 16
	is not yet aware that these past grace periods have ended.  Assume
	that CPU 16 corresponds to the second leaf rcu_node structure --
	which has not yet been made aware of the new grace period.

5.	CPU 16 removes data item A from its enclosing data structure
	and passes it to call_rcu(), which queues a callback in the
	RCU_NEXT_TAIL segment of the callback queue.

6.	CPU 16 enters the RCU core, possibly because it has taken a
	scheduling-clock interrupt, or alternatively because it has
	more than 10,000 callbacks queued.  It notes that the second
	most recent grace period has completed (recall that because it
	corresponds to the second as-yet-uninitialized rcu_node structure,
	it cannot yet become aware that the most recent grace period has
	completed), and therefore advances its callbacks.  The callback
	for data item A is therefore in the RCU_NEXT_READY_TAIL segment
	of the callback queue.

7.	CPU 0 completes initialization of the remaining leaf rcu_node
	structures for the new grace period, including the structure
	corresponding to CPU 16.

8.	CPU 16 again enters the RCU core, again, possibly because it has
	taken a scheduling-clock interrupt, or alternatively because
	it now has more than 10,000 callbacks queued.	It notes that
	the most recent grace period has ended, and therefore advances
	its callbacks.	The callback for data item A is therefore in
	the RCU_DONE_TAIL segment of the callback queue.

9.	All CPUs other than CPU 1 pass through quiescent states.  Because
	CPU 1 already passed through its quiescent state, the new grace
	period completes.  Note that CPU 1 is still in its RCU read-side
	critical section, still referencing data item A.

10.	Suppose that CPU 2 wais the last CPU to pass through a quiescent
	state for the new grace period, and suppose further that CPU 2
	did not have any callbacks queued, therefore not needing an
	additional grace period.  CPU 2 therefore traverses all of the
	rcu_node structures, marking the new grace period as completed,
	but does not initialize a new grace period.

11.	CPU 16 yet again enters the RCU core, yet again possibly because
	it has taken a scheduling-clock interrupt, or alternatively
	because it now has more than 10,000 callbacks queued.	It notes
	that the new grace period has ended, and therefore advances
	its callbacks.	The callback for data item A is therefore in
	the RCU_DONE_TAIL segment of the callback queue.  This means
	that this callback is now considered ready to be invoked.

12.	CPU 16 invokes the callback, freeing data item A while CPU 1
	is still referencing it.

This scenario represents a day-zero bug for TREE_RCU.  This commit
therefore ensures that the old grace period is marked completed in
all leaf rcu_node structures before a new grace period is marked
started in any of them.

That said, it would have been insanely difficult to force this race to
happen before the grace-period initialization process was preemptible.
Therefore, this commit is not a candidate for -stable.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>

Conflicts:

	kernel/rcutree.c
2012-09-23 07:41:55 -07:00
Paul E. McKenney 7e5c2dfb4d rcu: Make rcutree module parameters visible in sysfs
The module parameters blimit, qhimark, and qlomark (and more
recently, rcu_fanout_leaf) have permission masks of zero, so
that their values are not visible from sysfs.  This is unnecessary
and inconvenient to administrators who might like an easy way to
see what these values are on a running system.  This commit therefore
sets their permission masks to 0444, allowing them to be read but
not written.

Reported-by: Rusty Russell <rusty@ozlabs.org>
Reported-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:55 -07:00
Paul E. McKenney d40011f601 rcu: Control grace-period duration from sysfs
Although almost everyone is well-served by the defaults, some uses of RCU
benefit from shorter grace periods, while others benefit more from the
greater efficiency provided by longer grace periods.  Situations requiring
a large number of grace periods to elapse (and wireshark startup has
been called out as an example of this) are helped by lower-latency
grace periods.  Furthermore, in some embedded applications, people are
willing to accept a small degradation in update efficiency (due to there
being more of the shorter grace-period operations) in order to gain the
lower latency.

In contrast, those few systems with thousands of CPUs need longer grace
periods because the CPU overhead of a grace period rises roughly
linearly with the number of CPUs.  Such systems normally do not make
much use of facilities that require large numbers of grace periods to
elapse, so this is a good tradeoff.

Therefore, this commit allows the durations to be controlled from sysfs.
There are two sysfs parameters, one named "jiffies_till_first_fqs" that
specifies the delay in jiffies from the end of grace-period initialization
until the first attempt to force quiescent states, and the other named
"jiffies_till_next_fqs" that specifies the delay (again in jiffies)
between subsequent attempts to force quiescent states.  They both default
to three jiffies, which is compatible with the old hard-coded behavior.

At some future time, it may be possible to automatically increase the
grace-period length with the number of CPUs, but we do not yet have
sufficient data to do a good job.  Preliminary data indicates that we
should add an addiitonal jiffy to each of the delays for every 200 CPUs
in the system, but more experimentation is needed.  For now, the number
of systems with more than 1,000 CPUs is small enough that this can be
relegated to boot-time hand tuning.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:54 -07:00
Paul E. McKenney 394f2769aa rcu: Prevent force_quiescent_state() memory contention
Large systems running RCU_FAST_NO_HZ kernels see extreme memory
contention on the rcu_state structure's ->fqslock field.  This
can be avoided by disabling RCU_FAST_NO_HZ, either at compile time
or at boot time (via the nohz kernel boot parameter), but large
systems will no doubt become sensitive to energy consumption.
This commit therefore uses a combining-tree approach to spread the
memory contention across new cache lines in the leaf rcu_node structures.
This can be thought of as a tournament lock that has only a try-lock
acquisition primitive.

The effect on small systems is minimal, because such systems have
an rcu_node "tree" consisting of a single node.  In addition, this
functionality is not used on fastpaths.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:54 -07:00
Paul E. McKenney 4605c0143c rcu: Adjust debugfs tracing for kthread-based quiescent-state forcing
Moving quiescent-state forcing into a kthread dispenses with the need
for the ->n_rp_need_fqs field, so this commit removes it.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:54 -07:00
Paul E. McKenney b4be093fee rcu: Allow RCU quiescent-state forcing to be preempted
RCU quiescent-state forcing is currently carried out without preemption
points, which can result in excessive latency spikes on large systems
(many hundreds or thousands of CPUs).  This patch therefore inserts
a voluntary preemption point into force_qs_rnp(), which should greatly
reduce the magnitude of these spikes.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:54 -07:00
Paul E. McKenney 4cdfc175c2 rcu: Move quiescent-state forcing into kthread
As the first step towards allowing quiescent-state forcing to be
preemptible, this commit moves RCU quiescent-state forcing into the
same kthread that is now used to initialize and clean up after grace
periods.  This is yet another step towards keeping scheduling
latency down to a dull roar.

Updated to change from raw_spin_lock_irqsave() to raw_spin_lock_irq()
and to remove the now-unused rcu_state structure fields as suggested by
Peter Zijlstra.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23 07:41:54 -07:00
Dimitri Sivanich b402b73b3a rcu: Segregate rcu_state fields to improve cache locality
The fields in the rcu_state structure that are protected by the
root rcu_node structure's ->lock can share a cache line with the
fields protected by ->onofflock.  This can result in excessive
memory contention on large systems, so this commit applies
____cacheline_internodealigned_in_smp to the ->onofflock field in
order to segregate them.

Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Dimitri Sivanich <sivanich@sgi.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:53 -07:00
Paul E. McKenney b626c1b689 rcu: Provide OOM handler to motivate lazy RCU callbacks
In kernels built with CONFIG_RCU_FAST_NO_HZ=y, CPUs can accumulate a
large number of lazy callbacks, which as the name implies will be slow
to be invoked.  This can be a problem on small-memory systems, where the
default 6-second sleep for CPUs having only lazy RCU callbacks could well
be fatal.  This commit therefore installs an OOM hander that ensures that
every CPU with lazy callbacks has at least one non-lazy callback, in turn
ensuring timely advancement for these callbacks.

Updated to fix bug that disabled OOM killing, noted by Lai Jiangshan.

Updated to push the for_each_rcu_flavor() loop into rcu_oom_notify_cpu(),
thus reducing the number of IPIs, as suggested by Steven Rostedt.  Also
to make the for_each_online_cpu() loop be preemptible.  (Later, it might
be good to use smp_call_function(), as suggested by Peter Zijlstra.)

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:53 -07:00
Paul E. McKenney bfa00b4c40 rcu: Prevent offline CPUs from executing RCU core code
Earlier versions of RCU invoked the RCU core from the CPU_DYING notifier
in order to note a quiescent state for the outgoing CPU.  Because the
CPU is marked "offline" during the execution of the CPU_DYING notifiers,
the RCU core had to tolerate being invoked from an offline CPU.  However,
commit b1420f1c (Make rcu_barrier() less disruptive) left only tracing
code in the CPU_DYING notifier, so the RCU core need no longer execute
on offline CPUs.  This commit therefore enforces this restriction.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:53 -07:00