Commit Graph

24233 Commits

Author SHA1 Message Date
Peter Zijlstra bebe5b5143 futex: Futex_unlock_pi() determinism
The problem with returning -EAGAIN when the waiter state mismatches is that
it becomes very hard to proof a bounded execution time on the
operation. And seeing that this is a RT operation, this is somewhat
important.

While in practise; given the previous patch; it will be very unlikely to
ever really take more than one or two rounds, proving so becomes rather
hard.

However, now that modifying wait_list is done while holding both hb->lock
and wait_lock, the scenario can be avoided entirely by acquiring wait_lock
while still holding hb-lock. Doing a hand-over, without leaving a hole.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.112378812@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:10 +01:00
Peter Zijlstra cfafcd117d futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() all wait_list
modifications are done under both hb->lock and wait_lock.

This closes the obvious interleave pattern between futex_lock_pi() and
futex_unlock_pi(), but not entirely so. See below:

Before:

futex_lock_pi()			futex_unlock_pi()
  unlock hb->lock

				  lock hb->lock
				  unlock hb->lock

				  lock rt_mutex->wait_lock
				  unlock rt_mutex_wait_lock
				    -EAGAIN

  lock rt_mutex->wait_lock
  list_add
  unlock rt_mutex->wait_lock

  schedule()

  lock rt_mutex->wait_lock
  list_del
  unlock rt_mutex->wait_lock

				  <idem>
				    -EAGAIN

  lock hb->lock


After:

futex_lock_pi()			futex_unlock_pi()

  lock hb->lock
  lock rt_mutex->wait_lock
  list_add
  unlock rt_mutex->wait_lock
  unlock hb->lock

  schedule()
				  lock hb->lock
				  unlock hb->lock
  lock hb->lock
  lock rt_mutex->wait_lock
  list_del
  unlock rt_mutex->wait_lock

				  lock rt_mutex->wait_lock
				  unlock rt_mutex_wait_lock
				    -EAGAIN

  unlock hb->lock


It does however solve the earlier starvation/live-lock scenario which got
introduced with the -EAGAIN since unlike the before scenario; where the
-EAGAIN happens while futex_unlock_pi() doesn't hold any locks; in the
after scenario it happens while futex_unlock_pi() actually holds a lock,
and then it is serialized on that lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.062785528@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:09 +01:00
Peter Zijlstra 38d589f2fd futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
With the ultimate goal of keeping rt_mutex wait_list and futex_q waiters
consistent it's necessary to split 'rt_mutex_futex_lock()' into finer
parts, such that only the actual blocking can be done without hb->lock
held.

Split split_mutex_finish_proxy_lock() into two parts, one that does the
blocking and one that does remove_waiter() when the lock acquire failed.

When the rtmutex was acquired successfully the waiter can be removed in the
acquisiton path safely, since there is no concurrency on the lock owner.

This means that, except for futex_lock_pi(), all wait_list modifications
are done with both hb->lock and wait_lock held.

[bigeasy@linutronix.de: fix for futex_requeue_pi_signal_restart]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.001659630@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:09 +01:00
Peter Zijlstra 50809358dd futex,rt_mutex: Introduce rt_mutex_init_waiter()
Since there's already two copies of this code, introduce a helper now
before adding a third one.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.950039479@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:09 +01:00
Peter Zijlstra 16ffa12d74 futex: Pull rt_mutex_futex_unlock() out from under hb->lock
There's a number of 'interesting' problems, all caused by holding
hb->lock while doing the rt_mutex_unlock() equivalient.

Notably:

 - a PI inversion on hb->lock; and,

 - a SCHED_DEADLINE crash because of pointer instability.

The previous changes:

 - changed the locking rules to cover {uval,pi_state} with wait_lock.

 - allow to do rt_mutex_futex_unlock() without dropping wait_lock; which in
   turn allows to rely on wait_lock atomicity completely.

 - simplified the waiter conundrum.

It's now sufficient to hold rtmutex::wait_lock and a reference on the
pi_state to protect the state consistency, so hb->lock can be dropped
before calling rt_mutex_futex_unlock().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.900002056@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:08 +01:00
Peter Zijlstra 73d786bd04 futex: Rework inconsistent rt_mutex/futex_q state
There is a weird state in the futex_unlock_pi() path when it interleaves
with a concurrent futex_lock_pi() at the point where it drops hb->lock.

In this case, it can happen that the rt_mutex wait_list and the futex_q
disagree on pending waiters, in particular rt_mutex will find no pending
waiters where futex_q thinks there are. In this case the rt_mutex unlock
code cannot assign an owner.

The futex side fixup code has to cleanup the inconsistencies with quite a
bunch of interesting corner cases.

Simplify all this by changing wake_futex_pi() to return -EAGAIN when this
situation occurs. This then gives the futex_lock_pi() code the opportunity
to continue and the retried futex_unlock_pi() will now observe a coherent
state.

The only problem is that this breaks RT timeliness guarantees. That
is, consider the following scenario:

  T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1)

    CPU0

    T1
      lock_pi()
      queue_me()  <- Waiter is visible

    preemption

    T2
      unlock_pi()
	loops with -EAGAIN forever

Which is undesirable for PI primitives. Future patches will rectify
this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.850383690@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:08 +01:00
Peter Zijlstra bf92cf3a51 futex: Cleanup refcounting
Add a put_pit_state() as counterpart for get_pi_state() so the refcounting
becomes consistent.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.801778516@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:08 +01:00
Peter Zijlstra 734009e96d futex: Change locking rules
Currently futex-pi relies on hb->lock to serialize everything. But hb->lock
creates another set of problems, especially priority inversions on RT where
hb->lock becomes a rt_mutex itself.

The rt_mutex::wait_lock is the most obvious protection for keeping the
futex user space value and the kernel internal pi_state in sync.

Rework and document the locking so rt_mutex::wait_lock is held accross all
operations which modify the user space value and the pi state.

This allows to invoke rt_mutex_unlock() (including deboost) without holding
hb->lock as a next step.

Nothing yet relies on the new locking rules.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.751993333@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:07 +01:00
Peter Zijlstra 5293c2efda futex,rt_mutex: Provide futex specific rt_mutex API
Part of what makes futex_unlock_pi() intricate is that
rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop
rt_mutex::wait_lock.

This means it cannot rely on the atomicy of wait_lock, which would be
preferred in order to not rely on hb->lock so much.

The reason rt_mutex_slowunlock() needs to drop wait_lock is because it can
race with the rt_mutex fastpath, however futexes have their own fast path.

Since futexes already have a bunch of separate rt_mutex accessors, complete
that set and implement a rt_mutex variant without fastpath for them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.702962446@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:07 +01:00
Peter Zijlstra fffa954fb5 futex: Remove rt_mutex_deadlock_account_*()
These are unused and clutter up the code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.652692478@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:07 +01:00
Peter Zijlstra 1b367ece0d futex: Use smp_store_release() in mark_wake_futex()
Since the futex_q can dissapear the instruction after assigning NULL,
this really should be a RELEASE barrier. That stops loads from hitting
dead memory too.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.604296452@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:06 +01:00
Peter Zijlstra 499f5aca2c futex: Cleanup variable names for futex_top_waiter()
futex_top_waiter() returns the top-waiter on the pi_mutex. Assinging
this to a variable 'match' totally obscures the code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.554710645@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:06 +01:00
Peter Zijlstra bf7b3ac2e3 locking/ww_mutex: Improve test to cover acquire context changes
Currently each thread starts an acquire context only once, and
performs all its loop iterations under it.

This means that the Wound/Wait relations between threads are fixed.

To make things a little more realistic and cover more of the
functionality with the test, open a new acquire context for each loop.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:57:09 +01:00
Thomas Gleixner 383776fa75 locking/lockdep: Handle statically initialized PER_CPU locks properly
If a PER_CPU struct which contains a spin_lock is statically initialized
via:

DEFINE_PER_CPU(struct foo, bla) = {
	.lock = __SPIN_LOCK_UNLOCKED(bla.lock)
};

then lockdep assigns a seperate key to each lock because the logic for
assigning a key to statically initialized locks is to use the address as
the key. With per CPU locks the address is obvioulsy different on each CPU.

That's wrong, because all locks should have the same key.

To solve this the following modifications are required:

 1) Extend the is_kernel/module_percpu_addr() functions to hand back the
    canonical address of the per CPU address, i.e. the per CPU address
    minus the per CPU offset.

 2) Check the lock address with these functions and if the per CPU check
    matches use the returned canonical address as the lock key, so all per
    CPU locks have the same key.

 3) Move the static_obj(key) check into look_up_lock_class() so this check
    can be avoided for statically initialized per CPU locks.  That's
    required because the canonical address fails the static_obj(key) check
    for obvious reasons.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ Merged Dan's fixups for !MODULES and !SMP into this patch. ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Murphy <dmurphy@ti.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170227143736.pectaimkjkan5kow@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:57:08 +01:00
J. R. Okajima 6419c4af77 locking/lockdep: Add new check to lock_downgrade()
Commit:

  f8319483f5 ("locking/lockdep: Provide a type check for lock_is_held")

didn't fully cover rwsems as downgrade_write() was left out.

Introduce lock_downgrade() and use it to add new checks.

See-also: http://marc.info/?l=linux-kernel&m=148581164003149&w=2
Originally-written-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1486053497-9948-3-git-send-email-hooanon05g@gmail.com
[ Rewrote the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:57:07 +01:00
J. R. Okajima e969970be0 locking/lockdep: Factor out the validate_held_lock() helper function
Behaviour should not change.

Signed-off-by: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1486053497-9948-2-git-send-email-hooanon05g@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:57:07 +01:00
J. R. Okajima 41c2c5b86a locking/lockdep: Factor out the find_held_lock() helper function
A simple consolidataion to factor out repeated patterns.

The behaviour should not change.

Signed-off-by: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1486053497-9948-1-git-send-email-hooanon05g@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:57:06 +01:00
Niklas Cassel 17fcbd590d locking/rwsem: Fix down_write_killable() for CONFIG_RWSEM_GENERIC_SPINLOCK=y
We hang if SIGKILL has been sent, but the task is stuck in down_read()
(after do_exit()), even though no task is doing down_write() on the
rwsem in question:

  INFO: task libupnp:21868 blocked for more than 120 seconds.
  libupnp         D    0 21868      1 0x08100008
  ...
  Call Trace:
  __schedule()
  schedule()
  __down_read()
  do_exit()
  do_group_exit()
  __wake_up_parent()

This bug has already been fixed for CONFIG_RWSEM_XCHGADD_ALGORITHM=y in
the following commit:

 04cafed7fc ("locking/rwsem: Fix down_write_killable()")

... however, this bug also exists for CONFIG_RWSEM_GENERIC_SPINLOCK=y.

Signed-off-by: Niklas Cassel <niklas.cassel@axis.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Niklas Cassel <niklass@axis.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: d47996082f ("locking/rwsem: Introduce basis for down_write_killable()")
Link: http://lkml.kernel.org/r/1487981873-12649-1-git-send-email-niklass@axis.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:28:30 +01:00
Peter Zijlstra 9bbb25afeb futex: Add missing error handling to FUTEX_REQUEUE_PI
Thomas spotted that fixup_pi_state_owner() can return errors and we
fail to unlock the rt_mutex in that case.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170304093558.867401760@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-14 21:45:36 +01:00
Peter Zijlstra c236c8e95a futex: Fix potential use-after-free in FUTEX_REQUEUE_PI
While working on the futex code, I stumbled over this potential
use-after-free scenario. Dmitry triggered it later with syzkaller.

pi_mutex is a pointer into pi_state, which we drop the reference on in
unqueue_me_pi(). So any access to that pointer after that is bad.

Since other sites already do rt_mutex_unlock() with hb->lock held, see
for example futex_lock_pi(), simply move the unlock before
unqueue_me_pi().

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170304093558.801744246@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-14 21:45:36 +01:00
Linus Torvalds 5a45a5a881 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:

 - a fix for the kexec/purgatory regression which was introduced in the
   merge window via an innocent sparse fix. We could have reverted that
   commit, but on deeper inspection it turned out that the whole
   machinery is neither documented nor robust. So a proper cleanup was
   done instead

 - the fix for the TLB flush issue which was discovered recently

 - a simple typo fix for a reboot quirk

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/tlb: Fix tlb flushing when lguest clears PGE
  kexec, x86/purgatory: Unbreak it and clean it up
  x86/reboot/quirks: Fix typo in ASUS EeeBook X205TA reboot quirk
2017-03-12 14:18:49 -07:00
Thomas Gleixner 40c50c1fec kexec, x86/purgatory: Unbreak it and clean it up
The purgatory code defines global variables which are referenced via a
symbol lookup in the kexec code (core and arch).

A recent commit addressing sparse warnings made these static and thereby
broke kexec_file.

Why did this happen? Simply because the whole machinery is undocumented and
lacks any form of forward declarations. The variable names are unspecific
and lack a prefix, so adding forward declarations creates shadow variables
in the core code. Aside of that the code relies on magic constants and
duplicate struct definitions with no way to ensure that these things stay
in sync. The section placement of the purgatory variables happened by
chance and not by design.

Unbreak kexec and cleanup the mess:

 - Add proper forward declarations and document the usage
 - Use common struct definition
 - Use the proper common defines instead of magic constants
 - Add a purgatory_ prefix to have a proper name space
 - Use ARRAY_SIZE() instead of a homebrewn reimplementation
 - Add proper sections to the purgatory variables [ From Mike ]

Fixes: 72042a8c7b ("x86/purgatory: Make functions and variables static")
Reported-by: Mike Galbraith <<efault@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Nicholas Mc Guire <der.herr@hofr.at>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Tobin C. Harding" <me@tobin.cc>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1703101315140.3681@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-10 20:55:09 +01:00
Linus Torvalds 8fe3ccaed0 Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "26 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (26 commits)
  userfaultfd: remove wrong comment from userfaultfd_ctx_get()
  fat: fix using uninitialized fields of fat_inode/fsinfo_inode
  sh: cayman: IDE support fix
  kasan: fix races in quarantine_remove_cache()
  kasan: resched in quarantine_remove_cache()
  mm: do not call mem_cgroup_free() from within mem_cgroup_alloc()
  thp: fix another corner case of munlock() vs. THPs
  rmap: fix NULL-pointer dereference on THP munlocking
  mm/memblock.c: fix memblock_next_valid_pfn()
  userfaultfd: selftest: vm: allow to build in vm/ directory
  userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED
  userfaultfd: non-cooperative: fix fork fctx->new memleak
  mm/cgroup: avoid panic when init with low memory
  drivers/md/bcache/util.h: remove duplicate inclusion of blkdev.h
  mm/vmstats: add thp_split_pud event for clarity
  include/linux/fs.h: fix unsigned enum warning with gcc-4.2
  userfaultfd: non-cooperative: release all ctx in dup_userfaultfd_complete
  userfaultfd: non-cooperative: robustness check
  userfaultfd: non-cooperative: rollback userfaultfd_exit
  x86, mm: unify exit paths in gup_pte_range()
  ...
2017-03-10 08:34:42 -08:00
Andrea Arcangeli dd0db88d80 userfaultfd: non-cooperative: rollback userfaultfd_exit
Patch series "userfaultfd non-cooperative further update for 4.11 merge
window".

Unfortunately I noticed one relevant bug in userfaultfd_exit while doing
more testing.  I've been doing testing before and this was also tested
by kbuild bot and exercised by the selftest, but this bug never
reproduced before.

I dropped userfaultfd_exit as result.  I dropped it because of
implementation difficulty in receiving signals in __mmput and because I
think -ENOSPC as result from the background UFFDIO_COPY should be enough
already.

Before I decided to remove userfaultfd_exit, I noticed userfaultfd_exit
wasn't exercised by the selftest and when I tried to exercise it, after
moving it to a more correct place in __mmput where it would make more
sense and where the vma list is stable, it resulted in the
event_wait_completion in D state.  So then I added the second patch to
be sure even if we call userfaultfd_event_wait_completion too late
during task exit(), we won't risk to generate tasks in D state.  The
same check exists in handle_userfault() for the same reason, except it
makes a difference there, while here is just a robustness check and it's
run under WARN_ON_ONCE.

While looking at the userfaultfd_event_wait_completion() function I
looked back at its callers too while at it and I think it's not ok to
stop executing dup_fctx on the fcs list because we relay on
userfaultfd_event_wait_completion to execute
userfaultfd_ctx_put(fctx->orig) which is paired against
userfaultfd_ctx_get(fctx->orig) in dup_userfault just before
list_add(fcs).  This change only takes care of fctx->orig but this area
also needs further review looking for similar problems in fctx->new.

The only patch that is urgent is the first because it's an use after
free during a SMP race condition that affects all processes if
CONFIG_USERFAULTFD=y.  Very hard to reproduce though and probably
impossible without SLUB poisoning enabled.

This patch (of 3):

I once reproduced this oops with the userfaultfd selftest, it's not
easily reproducible and it requires SLUB poisoning to reproduce.

    general protection fault: 0000 [#1] SMP
    Modules linked in:
    CPU: 2 PID: 18421 Comm: userfaultfd Tainted: G               ------------ T 3.10.0+ #15
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
    task: ffff8801f83b9440 ti: ffff8801f833c000 task.ti: ffff8801f833c000
    RIP: 0010:[<ffffffff81451299>]  [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0
    RSP: 0018:ffff8801f833fe80  EFLAGS: 00010202
    RAX: ffff8801f833ffd8 RBX: 6b6b6b6b6b6b6b6b RCX: ffff8801f83b9440
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800baf18600
    RBP: ffff8801f833fee8 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000000 R11: ffffffff8127ceb3 R12: 0000000000000000
    R13: ffff8800baf186b0 R14: ffff8801f83b99f8 R15: 00007faed746c700
    FS:  0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007faf0966f028 CR3: 0000000001bc6000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Call Trace:
      do_exit+0x297/0xd10
      SyS_exit+0x17/0x20
      tracesys+0xdd/0xe2
    Code: 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 48 83 ec 58 48 8b 1f 48 85 db 75 11 eb 73 66 0f 1f 44 00 00 48 8b 5b 10 48 85 db 74 64 <4c> 8b a3 b8 00 00 00 4d 85 e4 74 eb 41 f6 84 24 2c 01 00 00 80
    RIP  [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0
     RSP <ffff8801f833fe80>
    ---[ end trace 9fecd6dcb442846a ]---

In the debugger I located the "mm" pointer in the stack and walking
mm->mmap->vm_next through the end shows the vma->vm_next list is fully
consistent and it is null terminated list as expected.  So this has to
be an SMP race condition where userfaultfd_exit was running while the
vma list was being modified by another CPU.

When userfaultfd_exit() run one of the ->vm_next pointers pointed to
SLAB_POISON (RBX is the vma pointer and is 0x6b6b..).

The reason is that it's not running in __mmput but while there are still
other threads running and it's not holding the mmap_sem (it can't as it
has to wait the even to be received by the manager).  So this is an use
after free that was happening for all processes.

One more implementation problem aside from the race condition:
userfaultfd_exit has really to check a flag in mm->flags before walking
the vma or it's going to slowdown the exit() path for regular tasks.

One more implementation problem: at that point signals can't be
delivered so it would also create a task in D state if the manager
doesn't read the event.

The major design issue: it overall looks superfluous as the manager can
check for -ENOSPC in the background transfer:

	if (mmget_not_zero(ctx->mm)) {
[..]
	} else {
		return -ENOSPC;
	}

It's safer to roll it back and re-introduce it later if at all.

[rppt@linux.vnet.ibm.com: documentation fixup after removal of UFFD_EVENT_EXIT]
  Link: http://lkml.kernel.org/r/1488345437-4364-1-git-send-email-rppt@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/20170224181957.19736-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 17:01:09 -08:00
Masahiro Yamada 505d3085d7 scripts/spelling.txt: add "overide" pattern and fix typo instances
Fix typos and add the following to the scripts/spelling.txt:

  overide||override

While we are here, fix the doubled "address" in the touched line
Documentation/devicetree/bindings/regulator/ti-abb-regulator.txt.

Also, fix the comment block style in the touched hunks in
drivers/media/dvb-frontends/drx39xyj/drx_driver.h.

Link: http://lkml.kernel.org/r/1481573103-11329-21-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 17:01:09 -08:00
Masahiro Yamada 8a1115ff6b scripts/spelling.txt: add "disble(d)" pattern and fix typo instances
Fix typos and add the following to the scripts/spelling.txt:

  disble||disable
  disbled||disabled

I kept the TSL2563_INT_DISBLED in /drivers/iio/light/tsl2563.c
untouched.  The macro is not referenced at all, but this commit is
touching only comment blocks just in case.

Link: http://lkml.kernel.org/r/1481573103-11329-20-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 17:01:09 -08:00
Linus Torvalds c1aa905a30 Power management updates for v4.11-rc2
- Three fixes for intel_pstate problems related to the passive
    mode (in which it acts as a regular cpufreq scaling driver), two
    for the handling of global P-state limits and one for the handling
    of the cpu_frequency tracepoint in that mode (Rafael Wysocki).
 
  - Three fixes for the handling of P-state limits in intel_pstate in
    the active mode (Rafael Wysocki).
 
  - Introduction of a new cpufreq.off=1 kernel command line argument
    that will disable cpufreq entirely if passed to the kernel and
    is simply hooked up to the existing code used by Xen (Len Brown).
 
  - Fix for the schedutil cpufreq governor to prevent it from using
    stale raw frequency values in configurations with mutiple CPUs
    sharing one policy object and a cleanup for it reducing its
    overhead slightly (Viresh Kumar).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJYwc9bAAoJEILEb/54YlRxU24P/2RPFpRYNtGNfbnvmyNjcA5H
 MLQ/i0HEqhBVsfsj2Kmy6sRkY/X/kqnOiD2PzGcaOMTtMx2FcLNHzNqH0P0I7A4W
 9wzUcq0SC5FzTJcLryN4gtJa3tfBDHjr6peAcZyji5t3DbGa+mygVwH8IGyr4feo
 u7PE/6eXLfkRIySwG/kCI/YVE8tWuhjDHbjhR1pjgyrMjhDbPP480Mgsi4eDRRTO
 BwGVyQJXoMo9e7/vZ8Y8Fkt7PMxYeyeE1yAGHkJzjkFfdrkZrn3IUPfgVgxy8IPV
 N3w1BB+H84duEswPZH2patdIKQxXG3r0eaGHTXZIwy+5sHyc3mUMJ1FeHR67gasd
 Z4p4hylTP+dGZGDo4G7PbVX4V6zEP+LSxgOQYjbo4n6k3nJJOrIF4wt+l5+tQz5m
 Y+V6XVHgZm1LyFgjj06jMeVSmk6SS7X0qBHJ4WcfGzV/J+vStXmIvAmljs+cd/u2
 R9z1W/rxelZFKT+4lRCuztpBYfJvlE5nXyM1XwC6Rz6QjFax0pJAHUPFznWyq24C
 GlMAvQWPapDEoOvDrS/QKeqX1ROOYf2FwHs+uvPPCvOxMjL9NCQsQ34tqCM3MiL/
 ebk3uZ0xC1e0LaOB87xr7tfsag12MyZzSCwX0D8nBLBLvntpa/xQwE5+4vu9ZguD
 xlarnIsEh6bTwTVH6DJP
 =Gwp7
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix several issues in the intel_pstate driver and one issue in
  the schedutil cpufreq governor, clean up that governor a bit and hook
  up existing code for disabling cpufreq to a new kernel command line
  option.

  Specifics:

   - Three fixes for intel_pstate problems related to the passive mode
     (in which it acts as a regular cpufreq scaling driver), two for the
     handling of global P-state limits and one for the handling of the
     cpu_frequency tracepoint in that mode (Rafael Wysocki).

   - Three fixes for the handling of P-state limits in intel_pstate in
     the active mode (Rafael Wysocki).

   - Introduction of a new cpufreq.off=1 kernel command line argument
     that will disable cpufreq entirely if passed to the kernel and is
     simply hooked up to the existing code used by Xen (Len Brown).

   - Fix for the schedutil cpufreq governor to prevent it from using
     stale raw frequency values in configurations with mutiple CPUs
     sharing one policy object and a cleanup for it reducing its
     overhead slightly (Viresh Kumar)"

* tag 'pm-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy
  cpufreq: intel_pstate: Fix intel_pstate_verify_policy()
  cpufreq: intel_pstate: Fix global settings in active mode
  cpufreq: Add the "cpufreq.off=1" cmdline option
  cpufreq: schedutil: Pass sg_policy to get_next_freq()
  cpufreq: schedutil: move cached_raw_freq to struct sugov_policy
  cpufreq: intel_pstate: Avoid triggering cpu_frequency tracepoint unnecessarily
  cpufreq: intel_pstate: Fix intel_cpufreq_verify_policy()
  cpufreq: intel_pstate: Do not use performance_limits in passive mode
2017-03-09 16:30:37 -08:00
Linus Torvalds bd0f9b356d sched/headers: fix up header file dependency on <linux/sched/signal.h>
The scheduler header file split and cleanups ended up exposing a few
nasty header file dependencies, and in particular it showed how we in
<linux/wait.h> ended up depending on "signal_pending()", which now comes
from <linux/sched/signal.h>.

That's a very subtle and annoying dependency, which already caused a
semantic merge conflict (see commit e58bc92783 "Pull overlayfs updates
from Miklos Szeredi", which added that fixup in the merge commit).

It turns out that we can avoid this dependency _and_ improve code
generation by moving the guts of the fairly nasty helper #define
__wait_event_interruptible_locked() to out-of-line code.  The code that
includes the signal_pending() check is all in the slow-path where we
actually go to sleep waiting for the event anyway, so using a helper
function is the right thing to do.

Using a helper function is also what we already did for the non-locked
versions, see the "__wait_event*()" macros and the "prepare_to_wait*()"
set of helper functions.

We might want to try to unify all these macro games, we have a _lot_ of
subtly different wait-event loops.  But this is the minimal patch to fix
the annoying header dependency.

Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-08 10:36:03 -08:00
Linus Torvalds 8a9172356f Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
 "This includes a fix for lockups caused by incorrect nsecs related
  cleanup, and a capabilities check fix for timerfd"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  jiffies: Revert bogus conversion of NSEC_PER_SEC to TICK_NSEC
  timerfd: Only check CAP_WAKE_ALARM when it is needed
2017-03-07 14:45:22 -08:00
Linus Torvalds 609b07b72d Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "A fix for KVM's scheduler clock which (erroneously) was always marked
  unstable, a fix for RT/DL load balancing, plus latency fixes"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/clock, x86/tsc: Rework the x86 'unstable' sched_clock() interface
  sched/core: Fix pick_next_task() for RT,DL
  sched/fair: Make select_idle_cpu() more aggressive
2017-03-07 14:42:34 -08:00
Linus Torvalds c3abcabe81 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "This includes a fix for a crash if certain special addresses are
  kprobed, plus does a rename of two Kconfig variables that were a minor
  misnomer"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Rename CONFIG_[UK]PROBE_EVENT to CONFIG_[UK]PROBE_EVENTS
  kprobes/x86: Fix kernel panic when certain exception-handling addresses are probed
2017-03-07 14:38:16 -08:00
Linus Torvalds 500e1af252 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:

 - Change the new refcount_t warnings from WARN() to WARN_ONCE()

 - two ww_mutex fixes

 - plus a new lockdep self-consistency check for a bug that triggered in
   practice

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/ww_mutex: Adjust the lock number for stress test
  locking/lockdep: Add nest_lock integrity test
  locking/ww_mutex: Replace cpu_relax() with cond_resched() for tests
  locking/refcounts: Change WARN() to WARN_ONCE()
2017-03-07 14:33:11 -08:00
Linus Torvalds 304362a8bc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace fix from Eric Biederman:
 "This fixes a race between put_ucounts and get_ucounts that can cause a
  use after free. The fix works by simplifying the code and so there is
  not even a temptation to be clever and play spinlock vs atomic
  reference games"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  ucount: Remove the atomicity from ucount->count
2017-03-07 10:06:25 -08:00
Linus Torvalds f26db9649a There was some breakage with the changes for jump labels in the 4.11 merge
window. Namely powerpc broke as jump labels uses the two LSB bits as flags
 in initialization. A check was added to make sure that all jump label
 entries were 4 bytes aligned, but powerpc didn't work that way for modules.
 Adding an alignment in the module linker script appeared to be the best
 solution.
 
 Jump labels also added an anonymous union to access those LSB bits as a
 normal long. But because this structure had static initialization, it broke
 older compilers that could not statically initialize anonymous unions
 without brackets.
 
 The command line parameter for setting function graph filter broke the
 "EMPTY_HASH" descriptor by modifying it instead of creating a new hash to
 hold the entries.
 
 The command line parameter ftrace_graph_max_depth was added to allow its
 setting at boot time. It uses existing code and only the command line hook
 was added. This is not really a fix, but as it uses existing code without
 affecting anything else, I added it to this release. It was ready before the
 merge window closed, but I wanted to let it sit in linux-next for a couple
 of days first.
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJYvNrAFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 JGQIAMkayeZ0OCyYHRPR4EcCrdE3fATmt1huJWHrMPnT4/fLabL8XQqrOpnOBMq1
 GFZb1SMkBmvGtAHF4GbvCxnIUfDQko6BTQAd8EMea1WM8+Kb66/BLgJawjWIU9I0
 dNYre9ONgR2NOzkz6nfKRXnmy0lRcOweBb09YYGSzY11Md7d8T3T4TUrPNZdYrO9
 8ZMbF4qRd9KLMRHcsWqvhWhBISxWnmtUSlthfweukKgDMy8OKpb7pR0ckjtYwsWX
 RF41jqLqzSUqtd/nE2Sj/aT8XOP4pfrKEUuNM4SBj8q5jmNcZuqi8Q9wItu3LWR2
 jqM/9UKTzaCr9cchwuvUC0i+jWc=
 =kDql
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "There was some breakage with the changes for jump labels in the 4.11
  merge window:

   - powerpc broke as jump labels uses the two LSB bits as flags in
     initialization.

     A check was added to make sure that all jump label entries were 4
     bytes aligned, but powerpc didn't work that way for modules. Adding
     an alignment in the module linker script appeared to be the best
     solution.

   - Jump labels also added an anonymous union to access those LSB bits
     as a normal long. But because this structure had static
     initialization, it broke older compilers that could not statically
     initialize anonymous unions without brackets.

   - The command line parameter for setting function graph filter broke
     the "EMPTY_HASH" descriptor by modifying it instead of creating a
     new hash to hold the entries.

   - The command line parameter ftrace_graph_max_depth was added to
     allow its setting at boot time. It uses existing code and only the
     command line hook was added.

     This is not really a fix, but as it uses existing code without
     affecting anything else, I added it to this release. It was ready
     before the merge window closed, but I wanted to let it sit in
     linux-next for a couple of days first"

* tag 'trace-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace/graph: Add ftrace_graph_max_depth kernel parameter
  tracing: Add #undef to fix compile error
  jump_label: Add comment about initialization order for anonymous unions
  jump_label: Fix anonymous union initialization
  module: set __jump_table alignment to 8
  ftrace/graph: Do not modify the EMPTY_HASH for the function_graph filter
  tracing: Fix code comment for ftrace_ops_get_func()
2017-03-07 09:37:28 -08:00
Frederic Weisbecker fa3aa7a54f jiffies: Revert bogus conversion of NSEC_PER_SEC to TICK_NSEC
commit 93825f2ec7 converted NSEC_PER_SEC to TICK_NSEC because the author
confused NSEC_PER_JIFFY with NSEC_PER_SEC.

As a result, the calculation of refined jiffies got broken, triggering
lockups.

Fixes: 93825f2ec7 ("jiffies: Reuse TICK_NSEC instead of NSEC_PER_JIFFY")
Reported-and-tested-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1488880534-3777-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-07 11:03:28 +01:00
Eric W. Biederman 040757f738 ucount: Remove the atomicity from ucount->count
Always increment/decrement ucount->count under the ucounts_lock.  The
increments are there already and moving the decrements there means the
locking logic of the code is simpler.  This simplification in the
locking logic fixes a race between put_ucounts and get_ucounts that
could result in a use-after-free because the count could go zero then
be found by get_ucounts and then be freed by put_ucounts.

A bug presumably this one was found by a combination of syzkaller and
KASAN.  JongWhan Kim reported the syzkaller failure and Dmitry Vyukov
spotted the race in the code.

Cc: stable@vger.kernel.org
Fixes: f6b2db1a3e ("userns: Make the count of user namespaces per user")
Reported-by: JongHwan Kim <zzoru007@gmail.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-03-06 15:26:37 -06:00
Viresh Kumar 655cb1ebff cpufreq: schedutil: Pass sg_policy to get_next_freq()
get_next_freq() uses sg_cpu only to get sg_policy, which the callers of
get_next_freq() already have. Pass sg_policy instead of sg_cpu to
get_next_freq(), to make it more efficient.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-05 23:58:48 +01:00
Viresh Kumar 6c4f0fa643 cpufreq: schedutil: move cached_raw_freq to struct sugov_policy
cached_raw_freq applies to the entire cpufreq policy and not individual
CPUs. Apart from wasting per-cpu memory, it is actually wrong to keep it
in struct sugov_cpu as we may end up comparing next_freq with a stale
cached_raw_freq of a random CPU.

Move cached_raw_freq to struct sugov_policy.

Fixes: 5cbea46984 (cpufreq: schedutil: map raw required frequency to driver frequency)
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-05 23:58:48 +01:00
Linus Torvalds 8d70eeb84a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix double-free in batman-adv, from Sven Eckelmann.

 2) Fix packet stats for fast-RX path, from Joannes Berg.

 3) Netfilter's ip_route_me_harder() doesn't handle request sockets
    properly, fix from Florian Westphal.

 4) Fix sendmsg deadlock in rxrpc, from David Howells.

 5) Add missing RCU locking to transport hashtable scan, from Xin Long.

 6) Fix potential packet loss in mlxsw driver, from Ido Schimmel.

 7) Fix race in NAPI handling between poll handlers and busy polling,
    from Eric Dumazet.

 8) TX path in vxlan and geneve need proper RCU locking, from Jakub
    Kicinski.

 9) SYN processing in DCCP and TCP need to disable BH, from Eric
    Dumazet.

10) Properly handle net_enable_timestamp() being invoked from IRQ
    context, also from Eric Dumazet.

11) Fix crash on device-tree systems in xgene driver, from Alban Bedel.

12) Do not call sk_free() on a locked socket, from Arnaldo Carvalho de
    Melo.

13) Fix use-after-free in netvsc driver, from Dexuan Cui.

14) Fix max MTU setting in bonding driver, from WANG Cong.

15) xen-netback hash table can be allocated from softirq context, so use
    GFP_ATOMIC. From Anoob Soman.

16) Fix MAC address change bug in bgmac driver, from Hari Vyas.

17) strparser needs to destroy strp_wq on module exit, from WANG Cong.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits)
  strparser: destroy workqueue on module exit
  sfc: fix IPID endianness in TSOv2
  sfc: avoid max() in array size
  rds: remove unnecessary returned value check
  rxrpc: Fix potential NULL-pointer exception
  nfp: correct DMA direction in XDP DMA sync
  nfp: don't tell FW about the reserved buffer space
  net: ethernet: bgmac: mac address change bug
  net: ethernet: bgmac: init sequence bug
  xen-netback: don't vfree() queues under spinlock
  xen-netback: keep a local pointer for vif in backend_disconnect()
  netfilter: nf_tables: don't call nfnetlink_set_err() if nfnetlink_send() fails
  netfilter: nft_set_rbtree: incorrect assumption on lower interval lookups
  netfilter: nf_conntrack_sip: fix wrong memory initialisation
  can: flexcan: fix typo in comment
  can: usb_8dev: Fix memory leak of priv->cmd_msg_buffer
  can: gs_usb: fix coding style
  can: gs_usb: Don't use stack memory for USB transfers
  ixgbe: Limit use of 2K buffers on architectures with 256B or larger cache lines
  ixgbe: update the rss key on h/w, when ethtool ask for it
  ...
2017-03-04 17:31:39 -08:00
Linus Torvalds 1827adb11a Merge branch 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull sched.h split-up from Ingo Molnar:
 "The point of these changes is to significantly reduce the
  <linux/sched.h> header footprint, to speed up the kernel build and to
  have a cleaner header structure.

  After these changes the new <linux/sched.h>'s typical preprocessed
  size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K
  lines), which is around 40% faster to build on typical configs.

  Not much changed from the last version (-v2) posted three weeks ago: I
  eliminated quirks, backmerged fixes plus I rebased it to an upstream
  SHA1 from yesterday that includes most changes queued up in -next plus
  all sched.h changes that were pending from Andrew.

  I've re-tested the series both on x86 and on cross-arch defconfigs,
  and did a bisectability test at a number of random points.

  I tried to test as many build configurations as possible, but some
  build breakage is probably still left - but it should be mostly
  limited to architectures that have no cross-compiler binaries
  available on kernel.org, and non-default configurations"

* 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits)
  sched/headers: Clean up <linux/sched.h>
  sched/headers: Remove #ifdefs from <linux/sched.h>
  sched/headers: Remove the <linux/topology.h> include from <linux/sched.h>
  sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h>
  sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h>
  sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h>
  sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/init.h>
  sched/core: Remove unused prefetch_stack()
  sched/headers: Remove <linux/rculist.h> from <linux/sched.h>
  sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h>
  sched/headers: Remove <linux/signal.h> from <linux/sched.h>
  sched/headers: Remove <linux/rwsem.h> from <linux/sched.h>
  sched/headers: Remove the runqueue_is_locked() prototype
  sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h>
  sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h>
  sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h>
  ...
2017-03-03 10:16:38 -08:00
Todd Brandt 65a50c6562 ftrace/graph: Add ftrace_graph_max_depth kernel parameter
Early trace callgraphs can be extremely large on systems with
several seconds of boot time. The max_depth parameter limits how
deep the graph trace goes and reduces the output size. This
parameter is the same as the max_graph_depth file in tracefs.

Link: http://lkml.kernel.org/r/1488499935-23216-1-git-send-email-todd.e.brandt@linux.intel.com

Signed-off-by: Todd Brandt <todd.e.brandt@linux.intel.com>
[ changed comments about debugfs to tracefs ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-03 09:45:01 -05:00
Steven Rostedt (VMware) 92ad18ec26 ftrace/graph: Do not modify the EMPTY_HASH for the function_graph filter
On boot up, if the kernel command line sets a graph funtion with the kernel
command line options "ftrace_graph_filter" or "ftrace_graph_notrace" then it
updates the corresponding function graph hash, ftrace_graph_hash or
ftrace_graph_notrace_hash respectively. Unfortunately, at boot up, these
variables are pointers to the "EMPTY_HASH" which is a constant used as a
placeholder when a hash has no entities. The problem was that the comand
line version to set the hashes updated the actual EMPTY_HASH instead of
creating a new hash for the function graph. This broke the EMPTY_HASH
because not only did it modify a constant (not sure how that was allowed to
happen, except maybe because it was done at early boot, const variables were
still mutable), but it made the filters have functions listed in them when
they were actually empty.

The kernel command line function needs to allocate a new hash for the
function graph filters and assign the necessary variables to that new hash
instead.

Link: http://lkml.kernel.org/r/1488420091.7212.17.camel@linux.intel.com

Cc: Namhyung Kim <namhyung@kernel.org>
Fixes: b9b0c831be ("ftrace: Convert graph filter to use hash tables")
Reported-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-03 09:44:17 -05:00
Linus Torvalds 080e4168c0 More power management updates for v4.11-rc1
- Fix for a cpuidle menu governor problem that started to take an
    unnecessary spinlock after one of the recent updates and that
    did not play well with the RT patch (Rafael Wysocki).
 
  - Fix for the new intel_pstate operation mode switching feature
    added recently that did not reinitialize P-state limits properly
    when switching operation modes (Rafael Wysocki).
 
  - Removal of unused global notifiers from the PM QoS framework
    (Viresh Kumar).
 
  - Generic power domains framework update to make it handle
    asynchronous invocations of PM callbacks in the "noirq" phases
    of system suspend/hibernation correctly (Ulf Hansson).
 
  - Two hibernation core cleanups (Rafael Wysocki).
 
  - intel_idle cleanup related to the sysfs interface (Len Brown).
 
  - Off-by-one bug fix in the OPP (Operating Performance Points)
    framework (Andrzej Hajda).
 
  - OPP framework's documentation fix (Viresh Kumar).
 
  - cpufreq qoriq driver cleanup (Tang Yuantian).
 
  - Fixes for typos in comments in the device runtime PM framework
    (Christophe Jaillet).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJYuLeGAAoJEILEb/54YlRxJvcP/0BRmh8Hn4Itx/NIWNg71X6j
 U+v8Pn8T3MP33gCcleLYlgre2JUIAUDmhdK99+UOx+/abhjMhQSaF3HhTOwYaPtQ
 6njoHVS0NnfqUf+x5kp+EpRxBVNucYVbdRTVd1DsHIeLLz/96DFOzb/R7tko/pKx
 pFMWvNdotHLLgXOG1UvdRimwTDlFMffxFzD8Se53LPjRXS0S73A5VWfqZOye44Re
 j3W1AJ0Idgq5uduA6J8x1MWbaxDq1h+j6CSUm05yvqrINzxXwXt0Hv6stCQTo+Gb
 YMdiBd8MujNyAgcchw3jiDQ8Vp+zmfLPcHrfPe//SSefj26eB8LyVNSYelvbUdOz
 cNjvyErva37MmaegCL9QC7WbLM+A7VE6bU6YzDCi/rR8jYMJ51Fb9jGiYb/oimry
 OLlblEekikUsskWv4hGV1JVt5VhmUMlagWtexxn+lMszATcZro0tfXu/vgQWksYs
 noUnwuWJWxvj2aNMsvbzW3HLlTGSmYl2UxJ7IymQQaTDblwF9Kg61rm3+5coUctd
 ifceynDVp9Gju25faYgZ+Dq9+o8ktlOGOHRRPdLIRNJ/T+4tUDnlGkdbPb+Tfn03
 XUIzYCu74U8/oW8gOk6t0WpmWzvxEXNgdirdEIR6y3loYIC0Jr3v4gyD975Eug74
 Hzfrdg7ignAmWV+nf6UY
 =SeUF
 -----END PGP SIGNATURE-----

Merge tag 'pm-extra-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management updates deom Rafael Wysocki:
 "These fix two bugs introduced by recent power management updates (in
  the cpuidle menu governor and intel_pstate) and a few other issues,
  clean up things and remove unused code.

  Specifics:

   - Fix for a cpuidle menu governor problem that started to take an
     unnecessary spinlock after one of the recent updates and that did
     not play well with the RT patch (Rafael Wysocki).

   - Fix for the new intel_pstate operation mode switching feature added
     recently that did not reinitialize P-state limits properly when
     switching operation modes (Rafael Wysocki).

   - Removal of unused global notifiers from the PM QoS framework
     (Viresh Kumar).

   - Generic power domains framework update to make it handle
     asynchronous invocations of PM callbacks in the "noirq" phases of
     system suspend/hibernation correctly (Ulf Hansson).

   - Two hibernation core cleanups (Rafael Wysocki).

   - intel_idle cleanup related to the sysfs interface (Len Brown).

   - Off-by-one bug fix in the OPP (Operating Performance Points)
     framework (Andrzej Hajda).

   - OPP framework's documentation fix (Viresh Kumar).

   - cpufreq qoriq driver cleanup (Tang Yuantian).

   - Fixes for typos in comments in the device runtime PM framework
     (Christophe Jaillet)"

* tag 'pm-extra-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM / OPP: Documentation: Fix opp-microvolt in examples
  intel_idle: stop exposing platform acronyms in sysfs
  cpufreq: intel_pstate: Fix limits issue with operation mode switching
  PM / hibernate: Define pr_fmt() and use pr_*() instead of printk()
  PM / hibernate: Untangle power_down()
  cpuidle: menu: Avoid taking spinlock for accessing QoS values
  PM / QoS: Remove global notifiers
  PM / runtime: Fix some typos
  cpufreq: qoriq: clean up unused code
  PM / OPP: fix off-by-one bug in dev_pm_opp_get_max_volt_latency loop
  PM / Domains: Power off masters immediately in the power off sequence
  PM / Domains: Rename is_async to one_dev_on for genpd_power_off()
  PM / Domains: Move genpd_power_off() above genpd_power_on()
2017-03-02 17:33:52 -08:00
Ingo Molnar 50ff9d1300 sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h>
It's not used by any of the scheduler methods, but <linux/sched/task_stack.h>
needs it to pick up STACK_END_MAGIC.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03 01:45:39 +01:00
Ingo Molnar cd9c513be3 sched/headers: Remove <linux/rwsem.h> from <linux/sched.h>
This is a stray header that is not needed by anything in sched.h,
so remove it.

Update files that relied on the stray inclusion.

This reduces the size of the header dependency graph.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03 01:45:36 +01:00
Ingo Molnar 1050b27c52 sched/headers: Move cputime functionality from <linux/sched.h> and <linux/cputime.h> into <linux/sched/cputime.h>
Move cputime related functionality out of <linux/sched.h>, as most code
that includes <linux/sched.h> does not use that functionality.

Move data types that are not included in task_struct directly to
the signal definitions, into <linux/sched/signal.h>.

Also merge the (small) existing <linux/cputime.h> header into <linux/sched/cputime.h>.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03 01:45:22 +01:00
Ingo Molnar 56cd697366 sched/headers: Move the task_lock()/unlock() APIs to <linux/sched/task.h>
The task_lock()/task_unlock() APIs are not realated to core scheduling,
they are task lifetime APIs, i.e. they belong into <linux/sched/task.h>.

Move them.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03 01:45:22 +01:00
Ingo Molnar 6bfbaa51ed sched/headers, RCU: Move rcu_copy_process() from <linux/sched/task.h> to kernel/fork.c
Move rcu_copy_process() into kernel/fork.c, which is the only
user of this inline function.

This simplifies <linux/sched/task.h> to the level that <linux/sched.h>
does not have to be included in it anymore - which change is done
in a subsequent patch.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03 01:43:46 +01:00
Ingo Molnar c3edc4010e sched/headers: Move task_struct::signal and task_struct::sighand types and accessors into <linux/sched/signal.h>
task_struct::signal and task_struct::sighand are pointers, which would normally make it
straightforward to not define those types in sched.h.

That is not so, because the types are accompanied by a myriad of APIs (macros and inline
functions) that dereference them.

Split the types and the APIs out of sched.h and move them into a new header, <linux/sched/signal.h>.

With this change sched.h does not know about 'struct signal' and 'struct sighand' anymore,
trying to put accessors into sched.h as a test fails the following way:

  ./include/linux/sched.h: In function ‘test_signal_types’:
  ./include/linux/sched.h:2461:18: error: dereferencing pointer to incomplete type ‘struct signal_struct’
                    ^

This reduces the size and complexity of sched.h significantly.

Update all headers and .c code that relied on getting the signal handling
functionality from <linux/sched.h> to include <linux/sched/signal.h>.

The list of affected files in the preparatory patch was partly generated by
grepping for the APIs, and partly by doing coverage build testing, both
all[yes|mod|def|no]config builds on 64-bit and 32-bit x86, and an array of
cross-architecture builds.

Nevertheless some (trivial) build breakage is still expected related to rare
Kconfig combinations and in-flight patches to various kernel code, but most
of it should be handled by this patch.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03 01:43:37 +01:00
Rafael J. Wysocki 9b5e9cb164 Merge branches 'pm-cpuidle', 'pm-cpufreq' and 'pm-sleep'
* pm-cpuidle:
  intel_idle: stop exposing platform acronyms in sysfs
  cpuidle: menu: Avoid taking spinlock for accessing QoS values

* pm-cpufreq:
  cpufreq: intel_pstate: Fix limits issue with operation mode switching
  cpufreq: qoriq: clean up unused code

* pm-sleep:
  PM / hibernate: Define pr_fmt() and use pr_*() instead of printk()
  PM / hibernate: Untangle power_down()
2017-03-03 00:43:11 +01:00