Commit Graph

431786 Commits

Author SHA1 Message Date
Sebastian Andrzej Siewior 88f8541ddd gpio: omap: use raw locks for locking
This patch converts gpio_bank.lock from a spin_lock into a
raw_spin_lock. The call path is to access this lock is always under a
raw_spin_lock, for instance
- __setup_irq() holds &desc->lock with irq off
  + __irq_set_trigger()
   + omap_gpio_irq_type()

- handle_level_irq() (runs with irqs off therefore raw locks)
  + mask_ack_irq()
   + omap_gpio_mask_irq()

This fixes the obvious backtrace on -RT. However the locking vs context
is not and this is not limited to -RT:
- omap_gpio_irq_type() is called with IRQ off and has an conditional
  call to pm_runtime_get_sync() which may sleep. Either it may happen or
  it may not happen but pm_runtime_get_sync() should not be called with
  irqs off.

- omap_gpio_debounce() is holding the lock with IRQs off.
  + omap2_set_gpio_debounce()
   + clk_prepare_enable()
    + clk_prepare() this one might sleep.
  The number of users of gpiod_set_debounce() / gpio_set_debounce()
  looks low but still this is not good.

Cc: stable-rt@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2020-10-14 00:59:23 +03:00
Thomas Gleixner 9673232a79 workqueue: Prevent deadlock/stall on RT
Austin reported a XFS deadlock/stall on RT where scheduled work gets
never exececuted and tasks are waiting for each other for ever.

The underlying problem is the modification of the RT code to the
handling of workers which are about to go to sleep. In mainline a
worker thread which goes to sleep wakes an idle worker if there is
more work to do. This happens from the guts of the schedule()
function. On RT this must be outside and the accessed data structures
are not protected against scheduling due to the spinlock to rtmutex
conversion. So the naive solution to this was to move the code outside
of the scheduler and protect the data structures by the pool
lock. That approach turned out to be a little naive as we cannot call
into that code when the thread blocks on a lock, as it is not allowed
to block on two locks in parallel. So we dont call into the worker
wakeup magic when the worker is blocked on a lock, which causes the
deadlock/stall observed by Austin and Mike.

Looking deeper into that worker code it turns out that the only
relevant data structure which needs to be protected is the list of
idle workers which can be woken up.

So the solution is to protect the list manipulation operations with
preempt_enable/disable pairs on RT and call unconditionally into the
worker code even when the worker is blocked on a lock. The preemption
protection is safe as there is nothing which can fiddle with the list
outside of thread context.

Reported-and_tested-by: Austin Schuh <austin@peloton-tech.com>
Reported-and_tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://vger.kernel.org/r/alpine.DEB.2.10.1406271249510.5170@nanos
Cc: Richard Weinberger <richard.weinberger@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: stable-rt@vger.kernel.org
2020-10-14 00:59:23 +03:00
Steven Rostedt 504c1e6c4a sched: Do not clear PF_NO_SETAFFINITY flag in select_fallback_rq()
I talked with Peter Zijlstra about this, and he told me that the clearing
of the PF_NO_SETAFFINITY flag was to deal with the optimization of
migrate_disable/enable() that ignores tasks that have that flag set. But
that optimization was removed when I did a rework of the cpu hotplug code.

I found that ignoring tasks that had that flag set would cause those tasks
to not sync with the hotplug code and cause the kernel to crash. Thus it
needed to not treat them special and those tasks had to go though the same
work as tasks without that flag set.

Now that those tasks are not treated special, there's no reason to clear the
flag.

May still need to be tested as the migrate_me() code does not ignore those
flags.

Cc: stable-rt@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140701111444.0cfebaa1@gandalf.local.home
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:23 +03:00
Sebastian Andrzej Siewior cd93a88a67 disable preempt lazy on x86-64
it still explodes

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2020-10-14 00:59:23 +03:00
Sebastian Andrzej Siewior 28abbe8efe md: disable bcache
It uses anon semaphores
|drivers/md/bcache/request.c: In function ‘cached_dev_write_complete’:
|drivers/md/bcache/request.c:1007:2: error: implicit declaration of function ‘up_read_non_owner’ [-Werror=implicit-function-declaration]
|  up_read_non_owner(&dc->writeback_lock);
|  ^
|drivers/md/bcache/request.c: In function ‘request_write’:
|drivers/md/bcache/request.c:1033:2: error: implicit declaration of function ‘down_read_non_owner’ [-Werror=implicit-function-declaration]
|  down_read_non_owner(&dc->writeback_lock);
|  ^

either we get rid of those or we have to introduce them…

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2020-10-14 00:59:23 +03:00
Steven Rostedt 646a8ab0b1 rt,ntp: Move call to schedule_delayed_work() to helper thread
The ntp code for notify_cmos_timer() is called from a hard interrupt
context. schedule_delayed_work() under PREEMPT_RT_FULL calls spinlocks
that have been converted to mutexes, thus calling schedule_delayed_work()
from interrupt is not safe.

Add a helper thread that does the call to schedule_delayed_work and wake
up that thread instead of calling schedule_delayed_work() directly.
This is only for CONFIG_PREEMPT_RT_FULL, otherwise the code still calls
schedule_delayed_work() directly in irq context.

Note: There's a few places in the kernel that do this. Perhaps the RT
code should have a dedicated thread that does the checks. Just register
a notifier on boot up for your check and wake up the thread when
needed. This will be a todo.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2020-10-14 00:59:23 +03:00
Sebastian Andrzej Siewior b1fcb3c08e a few open coded completions
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2020-10-14 00:59:23 +03:00
Thomas Gleixner 0d3b12ccc6 completion: Use simple wait queues
Completions have no long lasting callbacks and therefor do not need
the complex waitqueue variant. Use simple waitqueues which reduces the
contention on the waitqueue lock.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:23 +03:00
Thomas Gleixner a4203240fa rcu-more-swait-conversions.patch
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Merged Steven's

 static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp) {
-       swait_wake(&rnp->nocb_gp_wq[rnp->completed & 0x1]);
+       wake_up_all(&rnp->nocb_gp_wq[rnp->completed & 0x1]);
 }

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2020-10-14 00:59:23 +03:00
Sebastian Andrzej Siewior 4437a7dee2 kernel/treercu: use a simple waitqueue
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2020-10-14 00:59:23 +03:00
Paul Gortmaker ca0179a36a simple-wait: rename and export the equivalent of waitqueue_active()
The function "swait_head_has_waiters()" was internalized into
wait-simple.c but it parallels the waitqueue_active of normal
waitqueue support. Given that there are over 150 waitqueue_active
users in drivers/ fs/ kernel/ and the like, lets make it globally
visible, and rename it to parallel the waitqueue_active accordingly.
We'll need to do this if we expect to expand its usage beyond RT.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2020-10-14 00:59:23 +03:00
Thomas Gleixner 65870d64d3 wait-simple: Rework for use with completions
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:23 +03:00
Thomas Gleixner 99076b3731 wait-simple: Simple waitqueue implementation
wait_queue is a swiss army knife and in most of the cases the
complexity is not needed. For RT waitqueues are a constant source of
trouble as we can't convert the head lock to a raw spinlock due to
fancy and long lasting callbacks.

Provide a slim version, which allows RT to replace wait queues. This
should go mainline as well, as it lowers memory consumption and
runtime overhead.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

smp_mb() added by Steven Rostedt to fix a race condition with swait
wakeups vs adding items to the list.
2020-10-14 00:59:22 +03:00
Sebastian Andrzej Siewior e49d664aa7 wait.h: include atomic.h
|  CC      init/main.o
|In file included from include/linux/mmzone.h:9:0,
|                 from include/linux/gfp.h:4,
|                 from include/linux/kmod.h:22,
|                 from include/linux/module.h:13,
|                 from init/main.c:15:
|include/linux/wait.h: In function ‘wait_on_atomic_t’:
|include/linux/wait.h:982:2: error: implicit declaration of function ‘atomic_read’ [-Werror=implicit-function-declaration]
|  if (atomic_read(val) == 0)
|  ^

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2020-10-14 00:59:22 +03:00
Sebastian Andrzej Siewior 0e144af33c drm/i915: drop trace_i915_gem_ring_dispatch on rt
This tracepoint is responsible for:

|[<814cc358>] __schedule_bug+0x4d/0x59
|[<814d24cc>] __schedule+0x88c/0x930
|[<814d3b90>] ? _raw_spin_unlock_irqrestore+0x40/0x50
|[<814d3b95>] ? _raw_spin_unlock_irqrestore+0x45/0x50
|[<810b57b5>] ? task_blocks_on_rt_mutex+0x1f5/0x250
|[<814d27d9>] schedule+0x29/0x70
|[<814d3423>] rt_spin_lock_slowlock+0x15b/0x278
|[<814d3786>] rt_spin_lock+0x26/0x30
|[<a00dced9>] gen6_gt_force_wake_get+0x29/0x60 [i915]
|[<a00e183f>] gen6_ring_get_irq+0x5f/0x100 [i915]
|[<a00b2a33>] ftrace_raw_event_i915_gem_ring_dispatch+0xe3/0x100 [i915]
|[<a00ac1b3>] i915_gem_do_execbuffer.isra.13+0xbd3/0x1430 [i915]
|[<810f8943>] ? trace_buffer_unlock_commit+0x43/0x60
|[<8113e8d2>] ? ftrace_raw_event_kmem_alloc+0xd2/0x180
|[<8101d063>] ? native_sched_clock+0x13/0x80
|[<a00acf29>] i915_gem_execbuffer2+0x99/0x280 [i915]
|[<a00114a3>] drm_ioctl+0x4c3/0x570 [drm]
|[<8101d0d9>] ? sched_clock+0x9/0x10
|[<a00ace90>] ? i915_gem_execbuffer+0x480/0x480 [i915]
|[<810f1c18>] ? rb_commit+0x68/0xa0
|[<810f1c6c>] ? ring_buffer_unlock_commit+0x1c/0xa0
|[<81197467>] do_vfs_ioctl+0x97/0x540
|[<81021318>] ? ftrace_raw_event_sys_enter+0xd8/0x130
|[<811979a1>] sys_ioctl+0x91/0xb0
|[<814db931>] tracesys+0xe1/0xe6

Chris Wilson does not like to move i915_trace_irq_get() out of the macro

|No. This enables the IRQ, as well as making a number of
|very expensively serialised read, unconditionally.

so it is gone now on RT.

Cc: stable-rt@vger.kernel.org
Reported-by: Joakim Hernberg <jbh@alchemy.lu>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2020-10-14 00:59:22 +03:00
Sebastian Andrzej Siewior 5550291744 gpu/i915: don't open code these things
The opencode part is gone in 1f83fee0 ("drm/i915: clear up wedged transitions")
the owner check is still there.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2020-10-14 00:59:22 +03:00
Thomas Gleixner 11fd979912 mmci: Remove bogus local_irq_save()
On !RT interrupt runs with interrupts disabled. On RT it's in a
thread, so no need to disable interrupts at all.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:22 +03:00
Sebastian Andrzej Siewior b83d9c8fd4 i2c/omap: drop the lock hard irq context
The lock is taken while reading two registers. On RT the first lock is
taken in hard irq where it might sleep and in the threaded irq.
The threaded irq runs in oneshot mode so the hard irq does not run until
the thread the completes so there is no reason to grab the lock.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2020-10-14 00:59:22 +03:00
Sebastian Andrzej Siewior 16a70d5345 leds: trigger: disable CPU trigger on -RT
as it triggers:
|CPU: 0 PID: 0 Comm: swapper Not tainted 3.12.8-rt10 #141
|[<c0014aa4>] (unwind_backtrace+0x0/0xf8) from [<c0012788>] (show_stack+0x1c/0x20)
|[<c0012788>] (show_stack+0x1c/0x20) from [<c043c8dc>] (dump_stack+0x20/0x2c)
|[<c043c8dc>] (dump_stack+0x20/0x2c) from [<c004c5e8>] (__might_sleep+0x13c/0x170)
|[<c004c5e8>] (__might_sleep+0x13c/0x170) from [<c043f270>] (__rt_spin_lock+0x28/0x38)
|[<c043f270>] (__rt_spin_lock+0x28/0x38) from [<c043fa00>] (rt_read_lock+0x68/0x7c)
|[<c043fa00>] (rt_read_lock+0x68/0x7c) from [<c036cf74>] (led_trigger_event+0x2c/0x5c)
|[<c036cf74>] (led_trigger_event+0x2c/0x5c) from [<c036e0bc>] (ledtrig_cpu+0x54/0x5c)
|[<c036e0bc>] (ledtrig_cpu+0x54/0x5c) from [<c000ffd8>] (arch_cpu_idle_exit+0x18/0x1c)
|[<c000ffd8>] (arch_cpu_idle_exit+0x18/0x1c) from [<c00590b8>] (cpu_startup_entry+0xa8/0x234)
|[<c00590b8>] (cpu_startup_entry+0xa8/0x234) from [<c043b2cc>] (rest_init+0xb8/0xe0)
|[<c043b2cc>] (rest_init+0xb8/0xe0) from [<c061ebe0>] (start_kernel+0x2c4/0x380)

Cc: stable-rt@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2020-10-14 00:59:22 +03:00
Thomas Gleixner 322d3dd7f7 powerpc-preempt-lazy-support.patch
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:22 +03:00
Thomas Gleixner de01f48d58 arm-preempt-lazy-support.patch
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:22 +03:00
Thomas Gleixner 5ccf423693 x86-preempt-lazy.patch
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:22 +03:00
Thomas Gleixner b33545443e sched: Add support for lazy preemption
It has become an obsession to mitigate the determinism vs. throughput
loss of RT. Looking at the mainline semantics of preemption points
gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER
tasks. One major issue is the wakeup of tasks which are right away
preempting the waking task while the waking task holds a lock on which
the woken task will block right after having preempted the wakee. In
mainline this is prevented due to the implicit preemption disable of
spin/rw_lock held regions. On RT this is not possible due to the fully
preemptible nature of sleeping spinlocks.

Though for a SCHED_OTHER task preempting another SCHED_OTHER task this
is really not a correctness issue. RT folks are concerned about
SCHED_FIFO/RR tasks preemption and not about the purely fairness
driven SCHED_OTHER preemption latencies.

So I introduced a lazy preemption mechanism which only applies to
SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the
existing preempt_count each tasks sports now a preempt_lazy_count
which is manipulated on lock acquiry and release. This is slightly
incorrect as for lazyness reasons I coupled this on
migrate_disable/enable so some other mechanisms get the same treatment
(e.g. get_cpu_light).

Now on the scheduler side instead of setting NEED_RESCHED this sets
NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and
therefor allows to exit the waking task the lock held region before
the woken task preempts. That also works better for cross CPU wakeups
as the other side can stay in the adaptive spinning loop.

For RT class preemption there is no change. This simply sets
NEED_RESCHED and forgoes the lazy preemption counter.

 Initial test do not expose any observable latency increasement, but
history shows that I've been proven wrong before :)

The lazy preemption mode is per default on, but with
CONFIG_SCHED_DEBUG enabled it can be disabled via:

 # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features

and reenabled via

 # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features

The test results so far are very machine and workload dependent, but
there is a clear trend that it enhances the non RT workload
performance.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:22 +03:00
Sebastian Andrzej Siewior 1b1950518f rcu: make RCU_BOOST default on RT
Since it is no longer invoked from the softirq people run into OOM more
often if the priority of the RCU thread is too low. Making boosting
default on RT should help in those case and it can be switched off if
someone knows better.

Cc: stable-rt@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2020-10-14 00:59:22 +03:00
Paul E. McKenney 21f9c8f24c rcu: Eliminate softirq processing from rcutree
Running RCU out of softirq is a problem for some workloads that would
like to manage RCU core processing independently of other softirq work,
for example, setting kthread priority.  This commit therefore moves the
RCU core work from softirq to a per-CPU/per-flavor SCHED_OTHER kthread
named rcuc.  The SCHED_OTHER approach avoids the scalability problems
that appeared with the earlier attempt to move RCU core processing to
from softirq to kthreads.  That said, kernels built with RCU_BOOST=y
will run the rcuc kthreads at the RCU-boosting priority.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2020-10-14 00:59:22 +03:00
Thomas Gleixner e69d00b006 rcu: Disable RCU_FAST_NO_HZ on RT
This uses a timer_list timer from the irq disabled guts of the idle
code. Disable it for now to prevent wreckage.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable-rt@vger.kernel.org
2020-10-14 00:59:22 +03:00
Nicholas Mc Guire f335b3cd6c softirq: make migrate disable/enable conditioned on softirq_nestcnt transition
This patch removes the recursive calls to migrate_disable/enable in
local_bh_disable/enable

the softirq-local-lock.patch introduces local_bh_disable/enable wich
decrements/increments the current->softirq_nestcnt and disable/enables
migration as well. as softirq_nestcnt (include/linux/sched.h conditioned
on CONFIG_PREEMPT_RT_BASE) already is tracking the nesting level of the
recursive calls to local_bh_disable/enable (all in kernel/softirq.c) - no
need to do it twice.

migrate_disable/enable thus can be conditionsed on softirq_nestcnt making
a transition from 0-1 to disable migration and 1-0 to re-enable it.

No change of functional behavior, this does noticably reduce the observed
nesting level of migrate_disable/enable

Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2020-10-14 00:59:22 +03:00
Thomas Gleixner e1bb862d7e softirq: Adapt NOHZ softirq pending check to new RT scheme
We can't rely on ksoftirqd anymore and we need to check the tasks
which run a particular softirq and if such a task is pi blocked ignore
the other pending bits of that task as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:22 +03:00
Nicholas Mc Guire d2f09d0ed6 API cleanup - use local_lock not __local_lock for soft
trivial API cleanup - kernel/softirq.c was mimiking local_lock.

No change of functional behavior

Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2020-10-14 00:59:22 +03:00
Thomas Gleixner f4d0ede105 softirq: Split softirq locks
The 3.x RT series removed the split softirq implementation in favour
of pushing softirq processing into the context of the thread which
raised it. Though this prevents us from handling the various softirqs
at different priorities. Now instead of reintroducing the split
softirq threads we split the locks which serialize the softirq
processing.

If a softirq is raised in context of a thread, then the softirq is
noted on a per thread field, if the thread is in a bh disabled
region. If the softirq is raised from hard interrupt context, then the
bit is set in the flag field of ksoftirqd and ksoftirqd is invoked.
When a thread leaves a bh disabled region, then it tries to execute
the softirqs which have been raised in its own context. It acquires
the per softirq / per cpu lock for the softirq and then checks,
whether the softirq is still pending in the per cpu
local_softirq_pending() field. If yes, it runs the softirq. If no,
then some other task executed it already. This allows for zero config
softirq elevation in the context of user space tasks or interrupt
threads.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:22 +03:00
Thomas Gleixner c8bec60fb0 softirq: Split handling function
Split out the inner handling function, so RT can reuse it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:22 +03:00
Thomas Gleixner 2285aafeb8 softirq: Make serving softirqs a task flag
Avoid the percpu softirq_runner pointer magic by using a task flag.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:22 +03:00
Steven Rostedt 3466500c8c softirq: Init softirq local lock after per cpu section is set up
I discovered this bug when booting 3.4-rt on my powerpc box. It crashed
with the following report:

------------[ cut here ]------------
kernel BUG at /work/rt/stable-rt.git/kernel/rtmutex_common.h:75!
Oops: Exception in kernel mode, sig: 5 [#1]
PREEMPT SMP NR_CPUS=64 NUMA PA Semi PWRficient
Modules linked in:
NIP: c0000000004aa03c LR: c0000000004aa01c CTR: c00000000009b2ac
REGS: c00000003e8d7950 TRAP: 0700   Not tainted  (3.4.11-test-rt19)
MSR: 9000000000029032 <SF,HV,EE,ME,IR,DR,RI>  CR: 24000082  XER: 20000000
SOFTE: 0
TASK = c00000003e8fdcd0[11] 'ksoftirqd/1' THREAD: c00000003e8d4000 CPU: 1
GPR00: 0000000000000001 c00000003e8d7bd0 c000000000d6cbb0 0000000000000000
GPR04: c00000003e8fdcd0 0000000000000000 0000000024004082 c000000000011454
GPR08: 0000000000000000 0000000080000001 c00000003e8fdcd1 0000000000000000
GPR12: 0000000024000084 c00000000fff0280 ffffffffffffffff 000000003ffffad8
GPR16: ffffffffffffffff 000000000072c798 0000000000000060 0000000000000000
GPR20: 0000000000642741 000000000072c858 000000003ffffaf0 0000000000000417
GPR24: 000000000072dcd0 c00000003e7ff990 0000000000000000 0000000000000001
GPR28: 0000000000000000 c000000000792340 c000000000ccec78 c000000001182338
NIP [c0000000004aa03c] .wakeup_next_waiter+0x44/0xb8
LR [c0000000004aa01c] .wakeup_next_waiter+0x24/0xb8
Call Trace:
[c00000003e8d7bd0] [c0000000004aa01c] .wakeup_next_waiter+0x24/0xb8 (unreliable)
[c00000003e8d7c60] [c0000000004a0320] .rt_spin_lock_slowunlock+0x8c/0xe4
[c00000003e8d7ce0] [c0000000004a07cc] .rt_spin_unlock+0x54/0x64
[c00000003e8d7d60] [c0000000000636bc] .__thread_do_softirq+0x130/0x174
[c00000003e8d7df0] [c00000000006379c] .run_ksoftirqd+0x9c/0x1a4
[c00000003e8d7ea0] [c000000000080b68] .kthread+0xa8/0xb4
[c00000003e8d7f90] [c00000000001c2f8] .kernel_thread+0x54/0x70
Instruction dump:
60000000 e86d01c8 38630730 4bff7061 60000000 ebbf0008 7c7c1b78 e81d0040
7fe00278 7c000074 7800d182 68000001 <0b000000> e88d01c8 387d0010 38840738

The rtmutex_common.h:75 is:

rt_mutex_top_waiter(struct rt_mutex *lock)
{
	struct rt_mutex_waiter *w;

	w = plist_first_entry(&lock->wait_list, struct rt_mutex_waiter,
			       list_entry);
	BUG_ON(w->lock != lock);

	return w;
}

Where the waiter->lock is corrupted. I saw various other random bugs
that all had to with the softirq lock and plist. As plist needs to be
initialized before it is used I investigated how this lock is
initialized. It's initialized with:

void __init softirq_early_init(void)
{
	local_irq_lock_init(local_softirq_lock);
}

Where:

#define local_irq_lock_init(lvar)					\
	do {								\
		int __cpu;						\
		for_each_possible_cpu(__cpu)				\
			spin_lock_init(&per_cpu(lvar, __cpu).lock);	\
	} while (0)

As the softirq lock is a local_irq_lock, which is a per_cpu lock, the
initialization is done to all per_cpu versions of the lock. But lets
look at where the softirq_early_init() is called from.

In init/main.c: start_kernel()

/*
 * Interrupts are still disabled. Do necessary setups, then
 * enable them
 */
	softirq_early_init();
	tick_init();
	boot_cpu_init();
	page_address_init();
	printk(KERN_NOTICE "%s", linux_banner);
	setup_arch(&command_line);
	mm_init_owner(&init_mm, &init_task);
	mm_init_cpumask(&init_mm);
	setup_command_line(command_line);
	setup_nr_cpu_ids();
	setup_per_cpu_areas();
	smp_prepare_boot_cpu();	/* arch-specific boot-cpu hooks */

One of the first things that is called is the initialization of the
softirq lock. But if you look further down, we see the per_cpu areas
have not been set up yet. Thus initializing a local_irq_lock() before
the per_cpu section is set up, may not work as it is initializing the
per cpu locks before the per cpu exists.

By moving the softirq_early_init() right after setup_per_cpu_areas(),
the kernel boots fine.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Clark Williams <clark@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Carsten Emde <cbe@osadl.org>
Cc: vomlehn@texas.net
Link: http://lkml.kernel.org/r/1349362924.6755.18.camel@gandalf.local.home
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:22 +03:00
Thomas Gleixner 44eb2c6fd2 softirq: Check preemption after reenabling interrupts
raise_softirq_irqoff() disables interrupts and wakes the softirq
daemon, but after reenabling interrupts there is no preemption check,
so the execution of the softirq thread might be delayed arbitrarily.

In principle we could add that check to local_irq_enable/restore, but
that's overkill as the rasie_softirq_irqoff() sections are the only
ones which show this behaviour.

Reported-by: Carsten Emde <cbe@osadl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable-rt@vger.kernel.org
2020-10-14 00:59:21 +03:00
Yong Zhang fa863e9259 perf: Make swevent hrtimer run in irq instead of softirq
Otherwise we get a deadlock like below:

[ 1044.042749] BUG: scheduling while atomic: ksoftirqd/21/141/0x00010003
[ 1044.042752] INFO: lockdep is turned off.
[ 1044.042754] Modules linked in:
[ 1044.042757] Pid: 141, comm: ksoftirqd/21 Tainted: G        W    3.4.0-rc2-rt3-23676-ga723175-dirty #29
[ 1044.042759] Call Trace:
[ 1044.042761]  <IRQ>  [<ffffffff8107d8e5>] __schedule_bug+0x65/0x80
[ 1044.042770]  [<ffffffff8168978c>] __schedule+0x83c/0xa70
[ 1044.042775]  [<ffffffff8106bdd2>] ? prepare_to_wait+0x32/0xb0
[ 1044.042779]  [<ffffffff81689a5e>] schedule+0x2e/0xa0
[ 1044.042782]  [<ffffffff81071ebd>] hrtimer_wait_for_timer+0x6d/0xb0
[ 1044.042786]  [<ffffffff8106bb30>] ? wake_up_bit+0x40/0x40
[ 1044.042790]  [<ffffffff81071f20>] hrtimer_cancel+0x20/0x40
[ 1044.042794]  [<ffffffff8111da0c>] perf_swevent_cancel_hrtimer+0x3c/0x50
[ 1044.042798]  [<ffffffff8111da31>] task_clock_event_stop+0x11/0x40
[ 1044.042802]  [<ffffffff8111da6e>] task_clock_event_del+0xe/0x10
[ 1044.042805]  [<ffffffff8111c568>] event_sched_out+0x118/0x1d0
[ 1044.042809]  [<ffffffff8111c649>] group_sched_out+0x29/0x90
[ 1044.042813]  [<ffffffff8111ed7e>] __perf_event_disable+0x18e/0x200
[ 1044.042817]  [<ffffffff8111c343>] remote_function+0x63/0x70
[ 1044.042821]  [<ffffffff810b0aae>] generic_smp_call_function_single_interrupt+0xce/0x120
[ 1044.042826]  [<ffffffff81022bc7>] smp_call_function_single_interrupt+0x27/0x40
[ 1044.042831]  [<ffffffff8168d50c>] call_function_single_interrupt+0x6c/0x80
[ 1044.042833]  <EOI>  [<ffffffff811275b0>] ? perf_event_overflow+0x20/0x20
[ 1044.042840]  [<ffffffff8168b970>] ? _raw_spin_unlock_irq+0x30/0x70
[ 1044.042844]  [<ffffffff8168b976>] ? _raw_spin_unlock_irq+0x36/0x70
[ 1044.042848]  [<ffffffff810702e2>] run_hrtimer_softirq+0xc2/0x200
[ 1044.042853]  [<ffffffff811275b0>] ? perf_event_overflow+0x20/0x20
[ 1044.042857]  [<ffffffff81045265>] __do_softirq_common+0xf5/0x3a0
[ 1044.042862]  [<ffffffff81045c3d>] __thread_do_softirq+0x15d/0x200
[ 1044.042865]  [<ffffffff81045dda>] run_ksoftirqd+0xfa/0x210
[ 1044.042869]  [<ffffffff81045ce0>] ? __thread_do_softirq+0x200/0x200
[ 1044.042873]  [<ffffffff81045ce0>] ? __thread_do_softirq+0x200/0x200
[ 1044.042877]  [<ffffffff8106b596>] kthread+0xb6/0xc0
[ 1044.042881]  [<ffffffff8168b97b>] ? _raw_spin_unlock_irq+0x3b/0x70
[ 1044.042886]  [<ffffffff8168d994>] kernel_thread_helper+0x4/0x10
[ 1044.042889]  [<ffffffff8107d98c>] ? finish_task_switch+0x8c/0x110
[ 1044.042894]  [<ffffffff8168b97b>] ? _raw_spin_unlock_irq+0x3b/0x70
[ 1044.042897]  [<ffffffff8168bd5d>] ? retint_restore_args+0xe/0xe
[ 1044.042900]  [<ffffffff8106b4e0>] ? kthreadd+0x1e0/0x1e0
[ 1044.042902]  [<ffffffff8168d990>] ? gs_change+0xb/0xb

Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1341476476-5666-1-git-send-email-yong.zhang0@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2020-10-14 00:59:21 +03:00
Thomas Gleixner c3fd676d4a rt: rwsem/rwlock: lockdep annotations
rwlocks and rwsems on RT do not allow multiple readers. Annotate the
lockdep acquire functions accordingly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable-rt@vger.kernel.org
2020-10-14 00:59:21 +03:00
Yong Zhang 98e8f4ca44 lockdep: Selftest: Only do hardirq context test for raw spinlock
On -rt there is no softirq context any more and rwlock is sleepable,
disable softirq context test and rwlock+irq test.

Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: Yong Zhang <yong.zhang@windriver.com>
Link: http://lkml.kernel.org/r/1334559716-18447-3-git-send-email-yong.zhang0@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:21 +03:00
Peter Zijlstra 08295f4108 crypto: Convert crypto notifier chain to SRCU
The crypto notifier deadlocks on RT. Though this can be a real deadlock
on mainline as well due to fifo fair rwsems.

The involved parties here are:

[   82.172678] swapper/0       S 0000000000000001     0     1      0 0x00000000
[   82.172682]  ffff88042f18fcf0 0000000000000046 ffff88042f18fc80 ffffffff81491238
[   82.172685]  0000000000011cc0 0000000000011cc0 ffff88042f18c040 ffff88042f18ffd8
[   82.172688]  0000000000011cc0 0000000000011cc0 ffff88042f18ffd8 0000000000011cc0
[   82.172689] Call Trace:
[   82.172697]  [<ffffffff81491238>] ? _raw_spin_unlock_irqrestore+0x6c/0x7a
[   82.172701]  [<ffffffff8148fd3f>] schedule+0x64/0x66
[   82.172704]  [<ffffffff8148ec6b>] schedule_timeout+0x27/0xd0
[   82.172708]  [<ffffffff81043c0c>] ? unpin_current_cpu+0x1a/0x6c
[   82.172713]  [<ffffffff8106e491>] ? migrate_enable+0x12f/0x141
[   82.172716]  [<ffffffff8148fbbd>] wait_for_common+0xbb/0x11f
[   82.172719]  [<ffffffff810709f2>] ? try_to_wake_up+0x182/0x182
[   82.172722]  [<ffffffff8148fc96>] wait_for_completion_interruptible+0x1d/0x2e
[   82.172726]  [<ffffffff811debfd>] crypto_wait_for_test+0x49/0x6b
[   82.172728]  [<ffffffff811ded32>] crypto_register_alg+0x53/0x5a
[   82.172730]  [<ffffffff811ded6c>] crypto_register_algs+0x33/0x72
[   82.172734]  [<ffffffff81ad7686>] ? aes_init+0x12/0x12
[   82.172737]  [<ffffffff81ad76ea>] aesni_init+0x64/0x66
[   82.172741]  [<ffffffff81000318>] do_one_initcall+0x7f/0x13b
[   82.172744]  [<ffffffff81ac4d34>] kernel_init+0x199/0x22c
[   82.172747]  [<ffffffff81ac44ef>] ? loglevel+0x31/0x31
[   82.172752]  [<ffffffff814987c4>] kernel_thread_helper+0x4/0x10
[   82.172755]  [<ffffffff81491574>] ? retint_restore_args+0x13/0x13
[   82.172759]  [<ffffffff81ac4b9b>] ? start_kernel+0x3ca/0x3ca
[   82.172761]  [<ffffffff814987c0>] ? gs_change+0x13/0x13

[   82.174186] cryptomgr_test  S 0000000000000001     0    41      2 0x00000000
[   82.174189]  ffff88042c971980 0000000000000046 ffffffff81d74830 0000000000000292
[   82.174192]  0000000000011cc0 0000000000011cc0 ffff88042c96eb80 ffff88042c971fd8
[   82.174195]  0000000000011cc0 0000000000011cc0 ffff88042c971fd8 0000000000011cc0
[   82.174195] Call Trace:
[   82.174198]  [<ffffffff8148fd3f>] schedule+0x64/0x66
[   82.174201]  [<ffffffff8148ec6b>] schedule_timeout+0x27/0xd0
[   82.174204]  [<ffffffff81043c0c>] ? unpin_current_cpu+0x1a/0x6c
[   82.174206]  [<ffffffff8106e491>] ? migrate_enable+0x12f/0x141
[   82.174209]  [<ffffffff8148fbbd>] wait_for_common+0xbb/0x11f
[   82.174212]  [<ffffffff810709f2>] ? try_to_wake_up+0x182/0x182
[   82.174215]  [<ffffffff8148fc96>] wait_for_completion_interruptible+0x1d/0x2e
[   82.174218]  [<ffffffff811e4883>] cryptomgr_notify+0x280/0x385
[   82.174221]  [<ffffffff814943de>] notifier_call_chain+0x6b/0x98
[   82.174224]  [<ffffffff8108a11c>] ? rt_down_read+0x10/0x12
[   82.174227]  [<ffffffff810677cd>] __blocking_notifier_call_chain+0x70/0x8d
[   82.174230]  [<ffffffff810677fe>] blocking_notifier_call_chain+0x14/0x16
[   82.174234]  [<ffffffff811dd272>] crypto_probing_notify+0x24/0x50
[   82.174236]  [<ffffffff811dd7a1>] crypto_alg_mod_lookup+0x3e/0x74
[   82.174238]  [<ffffffff811dd949>] crypto_alloc_base+0x36/0x8f
[   82.174241]  [<ffffffff811e9408>] cryptd_alloc_ablkcipher+0x6e/0xb5
[   82.174243]  [<ffffffff811dd591>] ? kzalloc.clone.5+0xe/0x10
[   82.174246]  [<ffffffff8103085d>] ablk_init_common+0x1d/0x38
[   82.174249]  [<ffffffff8103852a>] ablk_ecb_init+0x15/0x17
[   82.174251]  [<ffffffff811dd8c6>] __crypto_alloc_tfm+0xc7/0x114
[   82.174254]  [<ffffffff811e0caa>] ? crypto_lookup_skcipher+0x1f/0xe4
[   82.174256]  [<ffffffff811e0dcf>] crypto_alloc_ablkcipher+0x60/0xa5
[   82.174258]  [<ffffffff811e5bde>] alg_test_skcipher+0x24/0x9b
[   82.174261]  [<ffffffff8106d96d>] ? finish_task_switch+0x3f/0xfa
[   82.174263]  [<ffffffff811e6b8e>] alg_test+0x16f/0x1d7
[   82.174267]  [<ffffffff811e45ac>] ? cryptomgr_probe+0xac/0xac
[   82.174269]  [<ffffffff811e45d8>] cryptomgr_test+0x2c/0x47
[   82.174272]  [<ffffffff81061161>] kthread+0x7e/0x86
[   82.174275]  [<ffffffff8106d9dd>] ? finish_task_switch+0xaf/0xfa
[   82.174278]  [<ffffffff814987c4>] kernel_thread_helper+0x4/0x10
[   82.174281]  [<ffffffff81491574>] ? retint_restore_args+0x13/0x13
[   82.174284]  [<ffffffff810610e3>] ? __init_kthread_worker+0x8c/0x8c
[   82.174287]  [<ffffffff814987c0>] ? gs_change+0x13/0x13

[   82.174329] cryptomgr_probe D 0000000000000002     0    47      2 0x00000000
[   82.174332]  ffff88042c991b70 0000000000000046 ffff88042c991bb0 0000000000000006
[   82.174335]  0000000000011cc0 0000000000011cc0 ffff88042c98ed00 ffff88042c991fd8
[   82.174338]  0000000000011cc0 0000000000011cc0 ffff88042c991fd8 0000000000011cc0
[   82.174338] Call Trace:
[   82.174342]  [<ffffffff8148fd3f>] schedule+0x64/0x66
[   82.174344]  [<ffffffff814901ad>] __rt_mutex_slowlock+0x85/0xbe
[   82.174347]  [<ffffffff814902d2>] rt_mutex_slowlock+0xec/0x159
[   82.174351]  [<ffffffff81089c4d>] rt_mutex_fastlock.clone.8+0x29/0x2f
[   82.174353]  [<ffffffff81490372>] rt_mutex_lock+0x33/0x37
[   82.174356]  [<ffffffff8108a0f2>] __rt_down_read+0x50/0x5a
[   82.174358]  [<ffffffff8108a11c>] ? rt_down_read+0x10/0x12
[   82.174360]  [<ffffffff8108a11c>] rt_down_read+0x10/0x12
[   82.174363]  [<ffffffff810677b5>] __blocking_notifier_call_chain+0x58/0x8d
[   82.174366]  [<ffffffff810677fe>] blocking_notifier_call_chain+0x14/0x16
[   82.174369]  [<ffffffff811dd272>] crypto_probing_notify+0x24/0x50
[   82.174372]  [<ffffffff811debd6>] crypto_wait_for_test+0x22/0x6b
[   82.174374]  [<ffffffff811decd3>] crypto_register_instance+0xb4/0xc0
[   82.174377]  [<ffffffff811e9b76>] cryptd_create+0x378/0x3b6
[   82.174379]  [<ffffffff811de512>] ? __crypto_lookup_template+0x5b/0x63
[   82.174382]  [<ffffffff811e4545>] cryptomgr_probe+0x45/0xac
[   82.174385]  [<ffffffff811e4500>] ? crypto_alloc_pcomp+0x1b/0x1b
[   82.174388]  [<ffffffff81061161>] kthread+0x7e/0x86
[   82.174391]  [<ffffffff8106d9dd>] ? finish_task_switch+0xaf/0xfa
[   82.174394]  [<ffffffff814987c4>] kernel_thread_helper+0x4/0x10
[   82.174398]  [<ffffffff81491574>] ? retint_restore_args+0x13/0x13
[   82.174401]  [<ffffffff810610e3>] ? __init_kthread_worker+0x8c/0x8c
[   82.174403]  [<ffffffff814987c0>] ? gs_change+0x13/0x13

cryptomgr_test spawns the cryptomgr_probe thread from the notifier
call. The probe thread fires the same notifier as the test thread and
deadlocks on the rwsem on RT.

Now this is a potential deadlock in mainline as well, because we have
fifo fair rwsems. If another thread blocks with a down_write() on the
notifier chain before the probe thread issues the down_read() it will
block the probe thread and the whole party is dead locked.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:21 +03:00
Sebastian Andrzej Siewior 739cec90a6 net: Add a mutex around devnet_rename_seq
On RT write_seqcount_begin() disables preemption and device_rename()
allocates memory with GFP_KERNEL and grabs later the sysfs_mutex
mutex. Serialize with a mutex and add use the non preemption disabling
__write_seqcount_begin().

To avoid writer starvation, let the reader grab the mutex and release
it when it detects a writer in progress. This keeps the normal case
(no reader on the fly) fast.

[ tglx: Instead of replacing the seqcount by a mutex, add the mutex ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:21 +03:00
Thomas Gleixner dd624a5df8 net: Use local_bh_disable in netif_rx_ni()
This code triggers the new WARN in __raise_softirq_irqsoff() though it
actually looks at the softirq pending bit and calls into the softirq
code, but that fits not well with the context related softirq model of
RT. It's correct on mainline though, but going through
local_bh_disable/enable here is not going to hurt badly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:21 +03:00
Thomas Gleixner f46faf6ae0 net: netfilter: Serialize xt_write_recseq sections on RT
The netfilter code relies only on the implicit semantics of
local_bh_disable() for serializing wt_write_recseq sections. RT breaks
that and needs explicit serialization here.

Reported-by: Peter LaDow <petela@gocougs.wsu.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable-rt@vger.kernel.org
2020-10-14 00:59:21 +03:00
Thomas Gleixner 4d98d1f9c3 net: Another local_irq_disable/kmalloc headache
Replace it by a local lock. Though that's pretty inefficient :(

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:21 +03:00
Priyanka Jain 82509650d3 net,RT:REmove preemption disabling in netif_rx()
1)enqueue_to_backlog() (called from netif_rx) should be
  bind to a particluar CPU. This can be achieved by
  disabling migration. No need to disable preemption

2)Fixes crash "BUG: scheduling while atomic: ksoftirqd"
  in case of RT.
  If preemption is disabled, enqueue_to_backog() is called
  in atomic context. And if backlog exceeds its count,
  kfree_skb() is called. But in RT, kfree_skb() might
  gets scheduled out, so it expects non atomic context.

3)When CONFIG_PREEMPT_RT_FULL is not defined,
 migrate_enable(), migrate_disable() maps to
 preempt_enable() and preempt_disable(), so no
 change in functionality in case of non-RT.

-Replace preempt_enable(), preempt_disable() with
 migrate_enable(), migrate_disable() respectively
-Replace get_cpu(), put_cpu() with get_cpu_light(),
 put_cpu_light() respectively

Signed-off-by: Priyanka Jain <Priyanka.Jain@freescale.com>
Acked-by: Rajan Srivastava <Rajan.Srivastava@freescale.com>
Cc: <rostedt@goodmis.orgn>
Link: http://lkml.kernel.org/r/1337227511-2271-1-git-send-email-Priyanka.Jain@freescale.com
Cc: stable-rt@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:21 +03:00
John Kacur d61a0d53b5 scsi: qla2xxx: Use local_irq_save_nort() in qla2x00_poll
RT triggers the following:

[   11.307652]  [<ffffffff81077b27>] __might_sleep+0xe7/0x110
[   11.307663]  [<ffffffff8150e524>] rt_spin_lock+0x24/0x60
[   11.307670]  [<ffffffff8150da78>] ? rt_spin_lock_slowunlock+0x78/0x90
[   11.307703]  [<ffffffffa0272d83>] qla24xx_intr_handler+0x63/0x2d0 [qla2xxx]
[   11.307736]  [<ffffffffa0262307>] qla2x00_poll+0x67/0x90 [qla2xxx]

Function qla2x00_poll does local_irq_save() before calling qla24xx_intr_handler
which has a spinlock. Since spinlocks are sleepable on rt, it is not allowed
to call them with interrupts disabled. Therefore we use local_irq_save_nort()
instead which saves flags without disabling interrupts.

This fix needs to be applied to v3.0-rt, v3.2-rt and v3.4-rt

Suggested-by: Thomas Gleixner
Signed-off-by: John Kacur <jkacur@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: David Sommerseth <davids@redhat.com>
Link: http://lkml.kernel.org/r/1335523726-10024-1-git-send-email-jkacur@redhat.com
Cc: stable-rt@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:21 +03:00
Tiejun Chen 3ac75ddc51 cpu_down: move migrate_enable() back
Commit 08c1ab68, "hotplug-use-migrate-disable.patch", intends to
use migrate_enable()/migrate_disable() to replace that combination
of preempt_enable() and preempt_disable(), but actually in
!CONFIG_PREEMPT_RT_FULL case, migrate_enable()/migrate_disable()
are still equal to preempt_enable()/preempt_disable(). So that
followed cpu_hotplug_begin()/cpu_unplug_begin(cpu) would go schedule()
to trigger schedule_debug() like this:

_cpu_down()
	|
	+ migrate_disable() = preempt_disable()
	|
	+ cpu_hotplug_begin() or cpu_unplug_begin()
		|
		+ schedule()
			|
			+ __schedule()
				|
				+ preempt_disable();
				|
				+ __schedule_bug() is true!

So we should move migrate_enable() as the original scheme.

Cc: stable-rt@vger.kernel.org
Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
2020-10-14 00:59:21 +03:00
Sebastian Andrzej Siewior 925c8d77e6 kernel/hotplug: restore original cpu mask oncpu/down
If a task which is allowed to run only on CPU X puts CPU Y down then it
will be allowed on all CPUs but the on CPU Y after it comes back from
kernel. This patch ensures that we don't lose the initial setting unless
the CPU the task is running is going down.

Cc: stable-rt@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2020-10-14 00:59:21 +03:00
Sebastian Andrzej Siewior 869f5656ff kernel/cpu: fix cpu down problem if kthread's cpu is going down
If kthread is pinned to CPUx and CPUx is going down then we get into
trouble:
- first the unplug thread is created
- it will set itself to hp->unplug. As a result, every task that is
  going to take a lock, has to leave the CPU.
- the CPU_DOWN_PREPARE notifier are started. The worker thread will
  start a new process for the "high priority worker".
  Now kthread would like to take a lock but since it can't leave the CPU
  it will never complete its task.

We could fire the unplug thread after the notifier but then the cpu is
no longer marked "online" and the unplug thread will run on CPU0 which
was fixed before :)

So instead the unplug thread is started and kept waiting until the
notfier complete their work.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2020-10-14 00:59:21 +03:00
Steven Rostedt 42244f4dc2 cpu hotplug: Document why PREEMPT_RT uses a spinlock
The patch:

    cpu: Make hotplug.lock a "sleeping" spinlock on RT

    Tasks can block on hotplug.lock in pin_current_cpu(), but their
    state might be != RUNNING. So the mutex wakeup will set the state
    unconditionally to RUNNING. That might cause spurious unexpected
    wakeups. We could provide a state preserving mutex_lock() function,
    but this is semantically backwards. So instead we convert the
    hotplug.lock() to a spinlock for RT, which has the state preserving
    semantics already.

Fixed a bug where the hotplug lock on PREEMPT_RT can be called after a
task set its state to TASK_UNINTERRUPTIBLE and before it called
schedule. If the hotplug_lock used a mutex, and there was contention,
the current task's state would be turned to TASK_RUNNABLE and the
schedule call will not sleep. This caused unexpected results.

Although the patch had a description of the change, the code had no
comments about it. This causes confusion to those that review the code,
and as PREEMPT_RT is held in a quilt queue and not git, it's not as easy
to see why a change was made. Even if it was in git, the code should
still have a comment for something as subtle as this.

Document the rational for using a spinlock on PREEMPT_RT in the hotplug
lock code.

Reported-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2020-10-14 00:59:21 +03:00
Steven Rostedt 17618e0e1f cpu/rt: Rework cpu down for PREEMPT_RT
Bringing a CPU down is a pain with the PREEMPT_RT kernel because
tasks can be preempted in many more places than in non-RT. In
order to handle per_cpu variables, tasks may be pinned to a CPU
for a while, and even sleep. But these tasks need to be off the CPU
if that CPU is going down.

Several synchronization methods have been tried, but when stressed
they failed. This is a new approach.

A sync_tsk thread is still created and tasks may still block on a
lock when the CPU is going down, but how that works is a bit different.
When cpu_down() starts, it will create the sync_tsk and wait on it
to inform that current tasks that are pinned on the CPU are no longer
pinned. But new tasks that are about to be pinned will still be allowed
to do so at this time.

Then the notifiers are called. Several notifiers will bring down tasks
that will enter these locations. Some of these tasks will take locks
of other tasks that are on the CPU. If we don't let those other tasks
continue, but make them block until CPU down is done, the tasks that
the notifiers are waiting on will never complete as they are waiting
for the locks held by the tasks that are blocked.

Thus we still let the task pin the CPU until the notifiers are done.
After the notifiers run, we then make new tasks entering the pinned
CPU sections grab a mutex and wait. This mutex is now a per CPU mutex
in the hotplug_pcp descriptor.

To help things along, a new function in the scheduler code is created
called migrate_me(). This function will try to migrate the current task
off the CPU this is going down if possible. When the sync_tsk is created,
all tasks will then try to migrate off the CPU going down. There are
several cases that this wont work, but it helps in most cases.

After the notifiers are called and if a task can't migrate off but enters
the pin CPU sections, it will be forced to wait on the hotplug_pcp mutex
until the CPU down is complete. Then the scheduler will force the migration
anyway.

Also, I found that THREAD_BOUND need to also be accounted for in the
pinned CPU, and the migrate_disable no longer treats them special.
This helps fix issues with ksoftirqd and workqueue that unbind on CPU down.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:21 +03:00
Steven Rostedt 109bb45d4d cpu: Make hotplug.lock a "sleeping" spinlock on RT
Tasks can block on hotplug.lock in pin_current_cpu(), but their state
might be != RUNNING. So the mutex wakeup will set the state
unconditionally to RUNNING. That might cause spurious unexpected
wakeups. We could provide a state preserving mutex_lock() function,
but this is semantically backwards. So instead we convert the
hotplug.lock() to a spinlock for RT, which has the state preserving
semantics already.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Carsten Emde <C.Emde@osadl.org>
Cc: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clark Williams <clark.williams@gmail.com>
Cc: stable-rt@vger.kernel.org
Link: http://lkml.kernel.org/r/1330702617.25686.265.camel@gandalf.stny.rr.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-10-14 00:59:21 +03:00