Commit Graph

892082 Commits

Author SHA1 Message Date
Thomas Gleixner 2cac4318fb net: Use cpu_chill() instead of cpu_relax()
Retry loops on RT might loop forever when the modifying side was
preempted. Use cpu_chill() instead of cpu_relax() to let the system
make progress.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:28 +03:00
Thomas Gleixner 15387f991a fs: namespace: Use cpu_chill() in trylock loops
Retry loops on RT might loop forever when the modifying side was
preempted. Use cpu_chill() instead of cpu_relax() to let the system
make progress.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:28 +03:00
Thomas Gleixner 125c578f0b block: Use cpu_chill() for retry loops
Retry loops on RT might loop forever when the modifying side was
preempted. Steven also observed a live lock when there was a
concurrent priority boosting going on.

Use cpu_chill() instead of cpu_relax() to let the system
make progress.

[bigeasy: After all those changes that occured over the years, this one hunk is
left and should not cause any starvation on -RT anymore]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:28 +03:00
Thomas Gleixner 9b9cfa4b04 rt: Introduce cpu_chill()
Retry loops on RT might loop forever when the modifying side was
preempted. Add cpu_chill() to replace cpu_relax(). cpu_chill()
defaults to cpu_relax() for non RT. On RT it puts the looping task to
sleep for a tick so the preempted task can make progress.

Steven Rostedt changed it to use a hrtimer instead of msleep():
|
|Ulrich Obergfell pointed out that cpu_chill() calls msleep() which is woken
|up by the ksoftirqd running the TIMER softirq. But as the cpu_chill() is
|called from softirq context, it may block the ksoftirqd() from running, in
|which case, it may never wake up the msleep() causing the deadlock.

+ bigeasy later changed to schedule_hrtimeout()
|If a task calls cpu_chill() and gets woken up by a regular or spurious
|wakeup and has a signal pending, then it exits the sleep loop in
|do_nanosleep() and sets up the restart block. If restart->nanosleep.type is
|not TI_NONE then this results in accessing a stale user pointer from a
|previously interrupted syscall and a copy to user based on the stale
|pointer or a BUG() when 'type' is not supported in nanosleep_copyout().

+ bigeasy: add PF_NOFREEZE:
| [....] Waiting for /dev to be fully populated...
| =====================================
| [ BUG: udevd/229 still has locks held! ]
| 3.12.11-rt17 #23 Not tainted
| -------------------------------------
| 1 lock held by udevd/229:
|  #0:  (&type->i_mutex_dir_key#2){+.+.+.}, at: lookup_slow+0x28/0x98
|
| stack backtrace:
| CPU: 0 PID: 229 Comm: udevd Not tainted 3.12.11-rt17 #23
| (unwind_backtrace+0x0/0xf8) from (show_stack+0x10/0x14)
| (show_stack+0x10/0x14) from (dump_stack+0x74/0xbc)
| (dump_stack+0x74/0xbc) from (do_nanosleep+0x120/0x160)
| (do_nanosleep+0x120/0x160) from (hrtimer_nanosleep+0x90/0x110)
| (hrtimer_nanosleep+0x90/0x110) from (cpu_chill+0x30/0x38)
| (cpu_chill+0x30/0x38) from (dentry_kill+0x158/0x1ec)
| (dentry_kill+0x158/0x1ec) from (dput+0x74/0x15c)
| (dput+0x74/0x15c) from (lookup_real+0x4c/0x50)
| (lookup_real+0x4c/0x50) from (__lookup_hash+0x34/0x44)
| (__lookup_hash+0x34/0x44) from (lookup_slow+0x38/0x98)
| (lookup_slow+0x38/0x98) from (path_lookupat+0x208/0x7fc)
| (path_lookupat+0x208/0x7fc) from (filename_lookup+0x20/0x60)
| (filename_lookup+0x20/0x60) from (user_path_at_empty+0x50/0x7c)
| (user_path_at_empty+0x50/0x7c) from (user_path_at+0x14/0x1c)
| (user_path_at+0x14/0x1c) from (vfs_fstatat+0x48/0x94)
| (vfs_fstatat+0x48/0x94) from (SyS_stat64+0x14/0x30)
| (SyS_stat64+0x14/0x30) from (ret_fast_syscall+0x0/0x48)

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:28 +03:00
Mike Galbraith 3ebd0b93c2 sunrpc: Make svc_xprt_do_enqueue() use get_cpu_light()
|BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:915
|in_atomic(): 1, irqs_disabled(): 0, pid: 3194, name: rpc.nfsd
|Preemption disabled at:[<ffffffffa06bf0bb>] svc_xprt_received+0x4b/0xc0 [sunrpc]
|CPU: 6 PID: 3194 Comm: rpc.nfsd Not tainted 3.18.7-rt1 #9
|Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.404 11/06/2014
| ffff880409630000 ffff8800d9a33c78 ffffffff815bdeb5 0000000000000002
| 0000000000000000 ffff8800d9a33c98 ffffffff81073c86 ffff880408dd6008
| ffff880408dd6000 ffff8800d9a33cb8 ffffffff815c3d84 ffff88040b3ac000
|Call Trace:
| [<ffffffff815bdeb5>] dump_stack+0x4f/0x9e
| [<ffffffff81073c86>] __might_sleep+0xe6/0x150
| [<ffffffff815c3d84>] rt_spin_lock+0x24/0x50
| [<ffffffffa06beec0>] svc_xprt_do_enqueue+0x80/0x230 [sunrpc]
| [<ffffffffa06bf0bb>] svc_xprt_received+0x4b/0xc0 [sunrpc]
| [<ffffffffa06c03ed>] svc_add_new_perm_xprt+0x6d/0x80 [sunrpc]
| [<ffffffffa06b2693>] svc_addsock+0x143/0x200 [sunrpc]
| [<ffffffffa072e69c>] write_ports+0x28c/0x340 [nfsd]
| [<ffffffffa072d2ac>] nfsctl_transaction_write+0x4c/0x80 [nfsd]
| [<ffffffff8117ee83>] vfs_write+0xb3/0x1d0
| [<ffffffff8117f889>] SyS_write+0x49/0xb0
| [<ffffffff815c4556>] system_call_fastpath+0x16/0x1b

Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:28 +03:00
Thomas Gleixner 2af2485dd8 scsi/fcoe: Make RT aware.
Do not disable preemption while taking sleeping locks. All user look safe
for migrate_diable() only.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:28 +03:00
Thomas Gleixner fde5499a8f md: raid5: Make raid5_percpu handling RT aware
__raid_run_ops() disables preemption with get_cpu() around the access
to the raid5_percpu variables. That causes scheduling while atomic
spews on RT.

Serialize the access to the percpu data with a lock and keep the code
preemptible.

Reported-by: Udo van den Heuvel <udovdh@xs4all.nl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Udo van den Heuvel <udovdh@xs4all.nl>
2023-03-25 04:21:28 +03:00
Sebastian Andrzej Siewior bad3589dc7 block/mq: don't complete requests via IPI
The IPI runs in hardirq context and there are sleeping locks. Assume caches are
shared and complete them on the local CPU.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:28 +03:00
Sebastian Andrzej Siewior 4abe6b99f0 block/mq: do not invoke preempt_disable()
preempt_disable() and get_cpu() don't play well together with the sleeping
locks it tries to allocate later.
It seems to be enough to replace it with get_cpu_light() and migrate_disable().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:28 +03:00
Thomas Gleixner 457d5cccfb mm/vmalloc: Another preempt disable region which sucks
Avoid the preempt disable version of get_cpu_var(). The inner-lock should
provide enough serialisation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:28 +03:00
Thomas Gleixner dfd47e14f8 fs/epoll: Do not disable preemption on RT
ep_call_nested() takes a sleeping lock so we can't disable preemption.
The light version is enough since ep_call_nested() doesn't mind beeing
invoked twice on the same CPU.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:27 +03:00
Scott Wood a5850930e9 rcutorture: Avoid problematic critical section nesting on RT
rcutorture was generating some nesting scenarios that are not
reasonable.  Constrain the state selection to avoid them.

Example #1:

1. preempt_disable()
2. local_bh_disable()
3. preempt_enable()
4. local_bh_enable()

On PREEMPT_RT, BH disabling takes a local lock only when called in
non-atomic context.  Thus, atomic context must be retained until after BH
is re-enabled.  Likewise, if BH is initially disabled in non-atomic
context, it cannot be re-enabled in atomic context.

Example #2:

1. rcu_read_lock()
2. local_irq_disable()
3. rcu_read_unlock()
4. local_irq_enable()

If the thread is preempted between steps 1 and 2,
rcu_read_unlock_special.b.blocked will be set, but it won't be
acted on in step 3 because IRQs are disabled.  Thus, reporting of the
quiescent state will be delayed beyond the local_irq_enable().

For now, these scenarios will continue to be tested on non-PREEMPT_RT
kernels, until debug checks are added to ensure that they are not
happening elsewhere.

Signed-off-by: Scott Wood <swood@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:27 +03:00
Julia Cartwright 56ca6ad996 rcu: enable rcu_normal_after_boot by default for RT
The forcing of an expedited grace period is an expensive and very
RT-application unfriendly operation, as it forcibly preempts all running
tasks on CPUs which are preventing the gp from expiring.

By default, as a policy decision, disable the expediting of grace
periods (after boot) on configurations which enable PREEMPT_RT.

Suggested-by: Luiz Capitulino <lcapitulino@redhat.com>
Acked-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Julia Cartwright <julia@ni.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:27 +03:00
Sebastian Andrzej Siewior 277efec1fa srcu: replace local_irqsave() with a locallock
There are two instances which disable interrupts in order to become a
stable this_cpu_ptr() pointer. The restore part is coupled with
spin_unlock_irqrestore() which does not work on RT.
Replace the local_irq_save() call with the appropriate local_lock()
version of it.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:27 +03:00
Scott Wood fc049aaf7b rcu: Use rcuc threads on PREEMPT_RT as we did
While switching to the reworked RCU-thread code, it has been forgotten
to enable the thread processing on -RT.
Besides restoring behavior that used to be default on RT, this avoids
a deadlock on scheduler locks.

Signed-off-by: Scott Wood <swood@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:27 +03:00
Sebastian Andrzej Siewior 27de62e778 locking: Make spinlock_t and rwlock_t a RCU section on RT
On !RT a locked spinlock_t and rwlock_t disables preemption which
implies a RCU read section. There is code that relies on that behaviour.

Add an explicit RCU read section on RT while a sleeping lock (a lock
which would disables preemption on !RT) acquired.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:27 +03:00
Sebastian Andrzej Siewior f8d8e5803c locking: don't check for __LINUX_SPINLOCK_TYPES_H on -RT archs
Upstream uses arch_spinlock_t within spinlock_t and requests that
spinlock_types.h header file is included first.
On -RT we have the rt_mutex with its raw_lock wait_lock which needs
architectures' spinlock_types.h header file for its definition. However
we need rt_mutex first because it is used to build the spinlock_t so
that check does not work for us.
Therefore I am dropping that check.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:27 +03:00
Thomas Gleixner c8ff2f06be futex: workaround migrate_disable/enable in different context
migrate_enable() invokes __schedule() and it expects a preempt count of one.
Holding a raw_spinlock_t with disabled interrupts should not allow scheduling.

These little hacks ensure that we don't schedule while we lock the hb lockwith
interrupts enabled and unlock it with interrupts disabled.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[XXX: As per PeterZ suggesstion
	set_thread_flag(TIF_NEED_RESCHED); preempt_fold_need_resched()
 would trigger a scheduler invocation on the last preempt_enable() which in
 turn would allow to drop this.
]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:27 +03:00
Thomas Gleixner 6ff7d19cf2 trace: Add migrate-disabled counter to tracing output
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:27 +03:00
Scott Wood a4aa572645 sched: migrate_enable: Remove __schedule() call
We can rely on preempt_enable() to schedule.  Besides simplifying the
code, this potentially allows sequences such as the following to be
permitted:

migrate_disable();
preempt_disable();
migrate_enable();
preempt_enable();

Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Scott Wood <swood@redhat.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:27 +03:00
Scott Wood e2e316ff2b sched: migrate_enable: Use per-cpu cpu_stop_work
Commit e6c287b1512d ("sched: migrate_enable: Use stop_one_cpu_nowait()")
adds a busy wait to deal with an edge case where the migrated thread
can resume running on another CPU before the stopper has consumed
cpu_stop_work.  However, this is done with preemption disabled and can
potentially lead to deadlock.

While it is not guaranteed that the cpu_stop_work will be consumed before
the migrating thread resumes and exits the stack frame, it is guaranteed
that nothing other than the stopper can run on the old cpu between the
migrating thread scheduling out and the cpu_stop_work being consumed.
Thus, we can store cpu_stop_work in per-cpu data without it being
reused too early.

Fixes: e6c287b1512d ("sched: migrate_enable: Use stop_one_cpu_nowait()")
Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Scott Wood <swood@redhat.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:27 +03:00
Scott Wood 5be92193b8 sched: migrate_enable: Use stop_one_cpu_nowait()
migrate_enable() can be called with current->state != TASK_RUNNING.
Avoid clobbering the existing state by using stop_one_cpu_nowait().
Since we're stopping the current cpu, we know that we won't get
past __schedule() until migration_cpu_stop() has run (at least up to
the point of migrating us to another cpu).

Signed-off-by: Scott Wood <swood@redhat.com>
[bigeasy: spin until the request has been processed]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:27 +03:00
Sebastian Andrzej Siewior fe6fe368b0 sched/core: migrate_enable() must access takedown_cpu_task on !HOTPLUG_CPU
The variable takedown_cpu_task is never declared/used on !HOTPLUG_CPU
except for migrate_enable(). This leads to a link error.

Don't use takedown_cpu_task in !HOTPLUG_CPU.

Reported-by: Dick Hollenbeck <dick@softplc.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:26 +03:00
Sebastian Andrzej Siewior 1158230ec6 kernel/sched/core: add migrate_disable()
[bristot@redhat.com: rt: Increase/decrease the nr of migratory tasks when enabling/disabling migration
 Link: https://lkml.kernel.org/r/e981d271cbeca975bca710e2fbcc6078c09741b0.1498482127.git.bristot@redhat.com
]
[swood@redhat.com: fixups and optimisations
 Link:https://lkml.kernel.org/r/20190727055638.20443-1-swood@redhat.com
 Link:https://lkml.kernel.org/r/20191012065214.28109-1-swood@redhat.com
]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:26 +03:00
Sebastian Andrzej Siewior 85b10d0ec5 ptrace: fix ptrace vs tasklist_lock race
As explained by Alexander Fyodorov <halcy@yandex.ru>:

|read_lock(&tasklist_lock) in ptrace_stop() is converted to mutex on RT kernel,
|and it can remove __TASK_TRACED from task->state (by moving  it to
|task->saved_state). If parent does wait() on child followed by a sys_ptrace
|call, the following race can happen:
|
|- child sets __TASK_TRACED in ptrace_stop()
|- parent does wait() which eventually calls wait_task_stopped() and returns
|  child's pid
|- child blocks on read_lock(&tasklist_lock) in ptrace_stop() and moves
|  __TASK_TRACED flag to saved_state
|- parent calls sys_ptrace, which calls ptrace_check_attach() and wait_task_inactive()

The patch is based on his initial patch where an additional check is
added in case the __TASK_TRACED moved to ->saved_state. The pi_lock is
taken in case the caller is interrupted between looking into ->state and
->saved_state.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:26 +03:00
Sebastian Andrzej Siewior 70c7cce455 locking/rtmutex: re-init the wait_lock in rt_mutex_init_proxy_locked()
We could provide a key-class for the lockdep (and fixup all callers) or
move the init to all callers (like it was) in order to avoid lockdep
seeing a double-lock of the wait_lock.

Reported-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:26 +03:00
Scott Wood 714e8359dd locking/rt-mutex: Flush block plug on __down_read()
__down_read() bypasses the rtmutex frontend to call
rt_mutex_slowlock_locked() directly, and thus it needs to call
blk_schedule_flush_flug() itself.

Cc: stable-rt@vger.kernel.org
Signed-off-by: Scott Wood <swood@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:26 +03:00
Mikulas Patocka 867ae8be55 locking/rt-mutex: fix deadlock in device mapper / block-IO
When some block device driver creates a bio and submits it to another
block device driver, the bio is added to current->bio_list (in order to
avoid unbounded recursion).

However, this queuing of bios can cause deadlocks, in order to avoid them,
device mapper registers a function flush_current_bio_list. This function
is called when device mapper driver blocks. It redirects bios queued on
current->bio_list to helper workqueues, so that these bios can proceed
even if the driver is blocked.

The problem with CONFIG_PREEMPT_RT is that when the device mapper
driver blocks, it won't call flush_current_bio_list (because
tsk_is_pi_blocked returns true in sched_submit_work), so deadlocks in
block device stack can happen.

Note that we can't call blk_schedule_flush_plug if tsk_is_pi_blocked
returns true - that would cause
BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on)) in
task_blocks_on_rt_mutex when flush_current_bio_list attempts to take a
spinlock.

So the proper fix is to call blk_schedule_flush_plug in rt_mutex_fastlock,
when fast acquire failed and when the task is about to block.

CC: stable-rt@vger.kernel.org
[bigeasy: The deadlock is not device-mapper specific, it can also occur
          in plain EXT4]
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:26 +03:00
Sebastian Andrzej Siewior 154f5666ec rtmutex: add ww_mutex addon for mutex-rt
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:26 +03:00
Thomas Gleixner 4b3a456867 rtmutex: wire up RT's locking
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:26 +03:00
Thomas Gleixner 993b8b95ab rtmutex: add rwlock implementation based on rtmutex
The implementation is bias-based, similar to the rwsem implementation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:26 +03:00
Thomas Gleixner 8a55693ca5 rtmutex: add rwsem implementation based on rtmutex
The RT specific R/W semaphore implementation restricts the number of readers
to one because a writer cannot block on multiple readers and inherit its
priority or budget.

The single reader restricting is painful in various ways:

 - Performance bottleneck for multi-threaded applications in the page fault
   path (mmap sem)

 - Progress blocker for drivers which are carefully crafted to avoid the
   potential reader/writer deadlock in mainline.

The analysis of the writer code pathes shows, that properly written RT tasks
should not take them. Syscalls like mmap(), file access which take mmap sem
write locked have unbound latencies which are completely unrelated to mmap
sem. Other R/W sem users like graphics drivers are not suitable for RT tasks
either.

So there is little risk to hurt RT tasks when the RT rwsem implementation is
changed in the following way:

 - Allow concurrent readers

 - Make writers block until the last reader left the critical section. This
   blocking is not subject to priority/budget inheritance.

 - Readers blocked on a writer inherit their priority/budget in the normal
   way.

There is a drawback with this scheme. R/W semaphores become writer unfair
though the applications which have triggered writer starvation (mostly on
mmap_sem) in the past are not really the typical workloads running on a RT
system. So while it's unlikely to hit writer starvation, it's possible. If
there are unexpected workloads on RT systems triggering it, we need to rethink
the approach.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:26 +03:00
Thomas Gleixner c4457270a5 rtmutex: add mutex implementation based on rtmutex
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:26 +03:00
Sebastian Andrzej Siewior dfa76b1102 rtmutex: trylock is okay on -RT
non-RT kernel could deadlock on rt_mutex_trylock() in softirq context. On
-RT we don't run softirqs in IRQ context but in thread context so it is
not a issue here.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:26 +03:00
Peter Zijlstra 749c611945 locking/rtmutex: Clean ->pi_blocked_on in the error case
The function rt_mutex_wait_proxy_lock() cleans ->pi_blocked_on in case
of failure (timeout, signal). The same cleanup is required in
__rt_mutex_start_proxy_lock().
In both the cases the tasks was interrupted by a signal or timeout while
acquiring the lock and after the interruption it longer blocks on the
lock.

Fixes: 1a1fb985f2 ("futex: Handle early deadlock return correctly")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:25 +03:00
Thomas Gleixner 44f08144d7 sched: Use the proper LOCK_OFFSET for cond_resched()
RT does not increment preempt count when a 'sleeping' spinlock is
locked. Update PREEMPT_LOCK_OFFSET for that case.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:25 +03:00
Thomas Gleixner d3e4782a4e rtmutex: add sleeping lock implementation
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:25 +03:00
Thomas Gleixner 63a7b1d0ca rtmutex: export lockdep-less version of rt_mutex's lock, trylock and unlock
Required for lock implementation ontop of rtmutex.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:25 +03:00
Thomas Gleixner 2b74b5dd97 rtmutex: Provide rt_mutex_slowlock_locked()
This is the inner-part of rt_mutex_slowlock(), required for rwsem-rt.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:25 +03:00
Sebastian Andrzej Siewior b50141e955 rbtree: don't include the rcu header
The RCU header pulls in spinlock.h and fails due not yet defined types:

|In file included from include/linux/spinlock.h:275:0,
|                 from include/linux/rcupdate.h:38,
|                 from include/linux/rbtree.h:34,
|                 from include/linux/rtmutex.h:17,
|                 from include/linux/spinlock_types.h:18,
|                 from kernel/bounds.c:13:
|include/linux/rwlock_rt.h:16:38: error: unknown type name ‘rwlock_t’
| extern void __lockfunc rt_write_lock(rwlock_t *rwlock);
|                                      ^

This patch moves the required RCU function from the rcupdate.h header file into
a new header file which can be included by both users.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:25 +03:00
Thomas Gleixner bbfb297717 rtmutex: Avoid include hell
Include only the required raw types. This avoids pulling in the
complete spinlock header which in turn requires rtmutex.h at some point.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:25 +03:00
Thomas Gleixner 8d03371f8a spinlock: Split the lock types header
Split raw_spinlock into its own file and the remaining spinlock_t into
its own non-RT header. The non-RT header will be replaced later by sleeping
spinlocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:25 +03:00
Thomas Gleixner 9b58817f04 rtmutex: Make lock_killable work
Locking an rt mutex killable does not work because signal handling is
restricted to TASK_INTERRUPTIBLE.

Use signal_pending_state() unconditionaly.

Cc: stable-rt@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:25 +03:00
Thomas Gleixner 64474cb7c1 rtmutex: Add rtmutex_lock_killable()
Add "killable" type to rtmutex. We need this since rtmutex are used as
"normal" mutexes which do use this type.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:25 +03:00
Wolfgang M. Reimer f4d2cdae88 locking: locktorture: Do NOT include rwlock.h directly
Including rwlock.h directly will cause kernel builds to fail
if CONFIG_PREEMPT_RT is defined. The correct header file
(rwlock_rt.h OR rwlock.h) will be included by spinlock.h which
is included by locktorture.c anyway.

Cc: stable-rt@vger.kernel.org
Signed-off-by: Wolfgang M. Reimer <linuxball@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:25 +03:00
Grygorii Strashko bf31352d9f pid.h: include atomic.h
This patch fixes build error:
  CC      kernel/pid_namespace.o
In file included from kernel/pid_namespace.c:11:0:
include/linux/pid.h: In function 'get_pid':
include/linux/pid.h:78:3: error: implicit declaration of function 'atomic_inc' [-Werror=implicit-function-declaration]
   atomic_inc(&pid->count);
   ^
which happens when
 CONFIG_PROVE_LOCKING=n
 CONFIG_DEBUG_SPINLOCK=n
 CONFIG_DEBUG_MUTEXES=n
 CONFIG_DEBUG_LOCK_ALLOC=n
 CONFIG_PID_NS=y

Vanilla gets this via spinlock.h.

Signed-off-by: Grygorii Strashko <Grygorii.Strashko@linaro.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:25 +03:00
Thomas Gleixner cc08805ba8 futex: Ensure lock/unlock symetry versus pi_lock and hash bucket lock
In exit_pi_state_list() we have the following locking construct:

   spin_lock(&hb->lock);
   raw_spin_lock_irq(&curr->pi_lock);

   ...
   spin_unlock(&hb->lock);

In !RT this works, but on RT the migrate_enable() function which is
called from spin_unlock() sees atomic context due to the held pi_lock
and just decrements the migrate_disable_atomic counter of the
task. Now the next call to migrate_disable() sees the counter being
negative and issues a warning. That check should be in
migrate_enable() already.

Fix this by dropping pi_lock before unlocking hb->lock and reaquire
pi_lock after that again. This is safe as the loop code reevaluates
head again under the pi_lock.

Reported-by: Yong Zhang <yong.zhang@windriver.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:25 +03:00
Steven Rostedt 421d4b32b1 futex: Fix bug on when a requeued RT task times out
Requeue with timeout causes a bug with PREEMPT_RT.

The bug comes from a timed out condition.

	TASK 1				TASK 2
	------				------
    futex_wait_requeue_pi()
	futex_wait_queue_me()
	<timed out>

					double_lock_hb();

	raw_spin_lock(pi_lock);
	if (current->pi_blocked_on) {
	} else {
	    current->pi_blocked_on = PI_WAKE_INPROGRESS;
	    run_spin_unlock(pi_lock);
	    spin_lock(hb->lock); <-- blocked!

					plist_for_each_entry_safe(this) {
					    rt_mutex_start_proxy_lock();
						task_blocks_on_rt_mutex();
						BUG_ON(task->pi_blocked_on)!!!!

The BUG_ON() actually has a check for PI_WAKE_INPROGRESS, but the
problem is that, after TASK 1 sets PI_WAKE_INPROGRESS, it then tries to
grab the hb->lock, which it fails to do so. As the hb->lock is a mutex,
it will block and set the "pi_blocked_on" to the hb->lock.

When TASK 2 goes to requeue it, the check for PI_WAKE_INPROGESS fails
because the task1's pi_blocked_on is no longer set to that, but instead,
set to the hb->lock.

The fix:

When calling rt_mutex_start_proxy_lock() a check is made to see
if the proxy tasks pi_blocked_on is set. If so, exit out early.
Otherwise set it to a new flag PI_REQUEUE_INPROGRESS, which notifies
the proxy task that it is being requeued, and will handle things
appropriately.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:24 +03:00
Thomas Gleixner 83fddd55bf rtmutex: Handle the various new futex race conditions
RT opens a few new interesting race conditions in the rtmutex/futex
combo due to futex hash bucket lock being a 'sleeping' spinlock and
therefor not disabling preemption.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:24 +03:00
Sebastian Andrzej Siewior 597fbf406e net/core: use local_bh_disable() in netif_rx_ni()
In 2004 netif_rx_ni() gained a preempt_disable() section around
netif_rx() and its do_softirq() + testing for it. The do_softirq() part
is required because netif_rx() raises the softirq but does not invoke
it. The preempt_disable() is required to remain on the same CPU which added the
skb to the per-CPU list.
All this can be avoided be putting this into a local_bh_disable()ed
section. The local_bh_enable() part will invoke do_softirq() if
required.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:24 +03:00