Commit Graph

32282 Commits

Author SHA1 Message Date
Alibek Omarov c1ce54bceb Linux 5.4.193 with MCST patches (6.2) 2023-03-25 04:34:12 +03:00
Sebastian Andrzej Siewior df6e402903 sched: Switch wait_task_inactive to HRTIMER_MODE_REL_HARD
[ Upstream commit 39609ed79d420e0b966e16a1d695733c2d3b9a7f ]

With PREEMPT_RT enabled all hrtimers callbacks will be invoked in
softirq mode unless they are explicitly marked as HRTIMER_MODE_HARD.
During boot kthread_bind() is used for the creation of per-CPU threads
and then hangs in wait_task_inactive() if the ksoftirqd is not
yet up and running.
The hang disappeared since commit
   26c7295be0c5e ("kthread: Do not preempt current task if it is going to call schedule()")

but enabling function trace on boot reliably leads to the freeze on boot
behaviour again.
The timer in wait_task_inactive() can not be directly used by an user
interface to abuse it and create a mass wake of several tasks at the
same time which would to long sections with disabled interrupts.
Therefore it is safe to make the timer HRTIMER_MODE_REL_HARD.

Switch the timer to HRTIMER_MODE_REL_HARD.

Cc: stable-rt@vger.kernel.org
Link: https://lkml.kernel.org/r/20210826170408.vm7rlj7odslshwch@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
2023-03-25 04:21:37 +03:00
Andrew Halaney 22562d5988 locking/rwsem-rt: Remove might_sleep() in __up_read()
[ Upstream commit b2ed0a4302faf2bb09e97529dd274233c082689b ]

There's no chance of sleeping here, the reader is giving up the
lock and possibly waking up the writer who is waiting on it.

Reported-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
2023-03-25 04:21:37 +03:00
Sebastian Andrzej Siewior a43acd1e98 locking/rwsem-rt: Add __down_read_interruptible()
The stable backported a patch which adds __down_read_interruptible() for
the generic rwsem implementation.

Add RT's version __down_read_interruptible().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:37 +03:00
Sebastian Andrzej Siewior 18d5a2ded1 Revert "hrtimer: Allow raw wakeups during boot"
This change is no longer needed since commit
   26c7295be0c5e ("kthread: Do not preempt current task if it is going to call schedule()")

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2023-03-25 04:21:37 +03:00
Sebastian Andrzej Siewior 5564fa35b3 timers: Don't block on ->expiry_lock for TIMER_IRQSAFE
PREEMPT_RT does not spin and wait until a running timer completes its
callback but instead it blocks on a sleeping lock to prevent a deadlock.

This blocking can not be done for workqueue's IRQ_SAFE timer which will
be canceled in an IRQ-off region. It has to happen to in IRQ-off region
because changing the PENDING bit and clearing the timer must not be
interrupted to avoid a busy-loop.

The callback invocation of IRQSAFE timer is not preempted on PREEMPT_RT
so there is no need to synchronize on timer_base::expiry_lock.

Don't acquire the timer_base::expiry_lock for TIMER_IRQSAFE flagged
timer.
Add a lockdep annotation to ensure that this function is always invoked
in preemptible context on PREEMPT_RT.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: stable-rt@vger.kernel.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2023-03-25 04:21:36 +03:00
Oleg Nesterov 6bd4eef1d8 ptrace: fix ptrace_unfreeze_traced() race with rt-lock
The patch "ptrace: fix ptrace vs tasklist_lock race" changed
ptrace_freeze_traced() to take task->saved_state into account, but
ptrace_unfreeze_traced() has the same problem and needs a similar fix:
it should check/update both ->state and ->saved_state.

Reported-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Fixes: "ptrace: fix ptrace vs tasklist_lock race"
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: stable-rt@vger.kernel.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2023-03-25 04:21:36 +03:00
Sebastian Andrzej Siewior 36c2e4d093 rwsem: Provide down_read_non_owner() and up_read_non_owner() for -RT
The rwsem implementation on -RT allows multiple reader and there is no
owner tracking anymore.
We can provide down_read_non_owner() and up_read_non_owner() by skipping
the owner check bits which are only available in the !RT implementation.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2023-03-25 04:21:36 +03:00
Sebastian Andrzej Siewior 58d58190a1 workqueue: Sync with upstream
This is an all-on-one patch reverting the following commits:
  workqueue: Don't assume that the callback has interrupts disabled
  sched/swait: Add swait_event_lock_irq()
  workqueue: Use swait for wq_manager_wait
  workqueue: Convert the locks to raw type

and introducing the following commits from upstream:
  workqueue: Use rcuwait for wq_manager_wait
  workqueue: Convert the pool::lock and wq_mayday_lock to raw_spinlock_t

as an replacement.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2023-03-25 04:21:36 +03:00
Matt Fleming f958e52689 signal: Prevent double-free of user struct
The way user struct reference counting works changed significantly with,

  fda31c50292a ("signal: avoid double atomic counter increments for user accounting")

Now user structs are only freed once the last pending signal is
dequeued. Make sigqueue_free_current() follow this new convention to
avoid freeing the user struct multiple times and triggering this
warning:

 refcount_t: underflow; use-after-free.
 WARNING: CPU: 0 PID: 6794 at lib/refcount.c:288 refcount_dec_not_one+0x45/0x50
 Call Trace:
  refcount_dec_and_lock_irqsave+0x16/0x60
  free_uid+0x31/0xa0
  __dequeue_signal+0x17c/0x190
  dequeue_signal+0x5a/0x1b0
  do_sigtimedwait+0x208/0x250
  __x64_sys_rt_sigtimedwait+0x6f/0xd0
  do_syscall_64+0x72/0x200
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Reported-by: Daniel Wagner <wagi@monom.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2023-03-25 04:21:36 +03:00
汪勇10269566 8ea11731c2 printk: Force a line break on pr_cont(" ")
Since the printk rework, pr_cont("\n") will not lead to a line break.
A new line will only be created if
- cpu != c->cpu_owner || !(flags & LOG_CONT)
- c->len + len > sizeof(c->buf)

Flush the buffer to enforce a new line on pr_cont().

[bigeasy: reword commit message ]

Signed-off-by: 汪勇10269566 <wang.yong12@zte.com.cn>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2023-03-25 04:21:36 +03:00
John Ogness cbf31f0ca6 printk: console must not schedule for drivers
Even though the printk kthread is always preemptible, it is still not
allowed to call cond_resched() from within console drivers. The
task may become non-preemptible in the console driver call chain. For
example, vt_console_print() takes a spinlock and then can call into
fbcon_redraw(), which can conditionally invoke cond_resched():

|BUG: sleeping function called from invalid context at kernel/printk/printk.c:2322
|in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 177, name: printk
|CPU: 0 PID: 177 Comm: printk Not tainted 5.6.2-00011-ga536059557f1d9 #1
|Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
|Call Trace:
| dump_stack+0x66/0x8b
| ___might_sleep+0x102/0x120
| console_conditional_schedule+0x24/0x30
| fbcon_redraw+0x96/0x1c0
| fbcon_scroll+0x556/0xd70
| con_scroll+0x147/0x1e0
| lf+0x9e/0xb0
| vt_console_print+0x253/0x3d0
| printk_kthread_func+0x1d5/0x3b0

Disable cond_resched() for the call into the console drivers.

Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2023-03-25 04:21:35 +03:00
Clark Williams 894114838d sysfs: Add /sys/kernel/realtime entry
Add a /sys/kernel entry to indicate that the kernel is a
realtime kernel.

Clark says that he needs this for udev rules, udev needs to evaluate
if its a PREEMPT_RT kernel a few thousand times and parsing uname
output is too slow or so.

Are there better solutions? Should it exist and return 0 on !-rt?

Signed-off-by: Clark Williams <williams@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2023-03-25 04:21:35 +03:00
Ingo Molnar 77c67bcd8a genirq: Disable irqpoll on -rt
Creates long latencies for no value

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:35 +03:00
Thomas Gleixner f9d7455782 signals: Allow rt tasks to cache one sigqueue struct
To avoid allocation allow rt tasks to cache one sigqueue struct in
task struct.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:35 +03:00
Josh Cartwright e1267649bf genirq: update irq_set_irqchip_state documentation
On -rt kernels, the use of migrate_disable()/migrate_enable() is
sufficient to guarantee a task isn't moved to another CPU.  Update the
irq_set_irqchip_state() documentation to reflect this.

Signed-off-by: Josh Cartwright <joshc@ni.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:33 +03:00
Sebastian Andrzej Siewior 3af424b7f4 tracing: make preempt_lazy and migrate_disable counter smaller
The migrate_disable counter should not exceed 255 so it is enough to
store it in an 8bit field.
With this change we can move the `preempt_lazy_count' member into the
gap so the whole struct shrinks by 4 bytes to 12 bytes in total.
Remove the `padding' field, it is not needed.
Update the tracing fields in trace_define_common_fields() (it was
missing the preempt_lazy_count field).

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:33 +03:00
Thomas Gleixner 78ed450f36 sched: Add support for lazy preemption
It has become an obsession to mitigate the determinism vs. throughput
loss of RT. Looking at the mainline semantics of preemption points
gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER
tasks. One major issue is the wakeup of tasks which are right away
preempting the waking task while the waking task holds a lock on which
the woken task will block right after having preempted the wakee. In
mainline this is prevented due to the implicit preemption disable of
spin/rw_lock held regions. On RT this is not possible due to the fully
preemptible nature of sleeping spinlocks.

Though for a SCHED_OTHER task preempting another SCHED_OTHER task this
is really not a correctness issue. RT folks are concerned about
SCHED_FIFO/RR tasks preemption and not about the purely fairness
driven SCHED_OTHER preemption latencies.

So I introduced a lazy preemption mechanism which only applies to
SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the
existing preempt_count each tasks sports now a preempt_lazy_count
which is manipulated on lock acquiry and release. This is slightly
incorrect as for lazyness reasons I coupled this on
migrate_disable/enable so some other mechanisms get the same treatment
(e.g. get_cpu_light).

Now on the scheduler side instead of setting NEED_RESCHED this sets
NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and
therefor allows to exit the waking task the lock held region before
the woken task preempts. That also works better for cross CPU wakeups
as the other side can stay in the adaptive spinning loop.

For RT class preemption there is no change. This simply sets
NEED_RESCHED and forgoes the lazy preemption counter.

 Initial test do not expose any observable latency increasement, but
history shows that I've been proven wrong before :)

The lazy preemption mode is per default on, but with
CONFIG_SCHED_DEBUG enabled it can be disabled via:

 # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features

and reenabled via

 # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features

The test results so far are very machine and workload dependent, but
there is a clear trend that it enhances the non RT workload
performance.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:32 +03:00
David Miller 53755fc090 bpf/stackmap: Dont trylock mmap_sem with PREEMPT_RT and interrupts disabled
In a RT kernel down_read_trylock() cannot be used from NMI context and
up_read_non_owner() is another problematic issue.

So in such a configuration, simply elide the annotated stackmap and
just report the raw IPs.

In the longer term, it might be possible to provide a atomic friendly
versions of the page cache traversal which will at least provide the info
if the pages are resident and don't need to be paged in.

[ tglx: Use IS_ENABLED() to avoid the #ifdeffery, fixup the irq work
  	callback and add a comment ]

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:32 +03:00
Thomas Gleixner 36ee95a49d bpf, lpm: Make locking RT friendly
The LPM trie map cannot be used in contexts like perf, kprobes and tracing
as this map type dynamically allocates memory.

The memory allocation happens with a raw spinlock held which is a truly
spinning lock on a PREEMPT RT enabled kernel which disables preemption and
interrupts.

As RT does not allow memory allocation from such a section for various
reasons, convert the raw spinlock to a regular spinlock.

On a RT enabled kernel these locks are substituted by 'sleeping' spinlocks
which provide the proper protection but keep the code preemptible.

On a non-RT kernel regular spinlocks map to raw spinlocks, i.e. this does
not cause any functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:32 +03:00
Thomas Gleixner b1e96321f4 bpf: Prepare hashtab locking for PREEMPT_RT
PREEMPT_RT forbids certain operations like memory allocations (even with
GFP_ATOMIC) from atomic contexts. This is required because even with
GFP_ATOMIC the memory allocator calls into code pathes which acquire locks
with long held lock sections. To ensure the deterministic behaviour these
locks are regular spinlocks, which are converted to 'sleepable' spinlocks
on RT. The only true atomic contexts on an RT kernel are the low level
hardware handling, scheduling, low level interrupt handling, NMIs etc. None
of these contexts should ever do memory allocations.

As regular device interrupt handlers and soft interrupts are forced into
thread context, the existing code which does
  spin_lock*(); alloc(GPF_ATOMIC); spin_unlock*();
just works.

In theory the BPF locks could be converted to regular spinlocks as well,
but the bucket locks and percpu_freelist locks can be taken from arbitrary
contexts (perf, kprobes, tracepoints) which are required to be atomic
contexts even on RT. These mechanisms require preallocated maps, so there
is no need to invoke memory allocations within the lock held sections.

BPF maps which need dynamic allocation are only used from (forced) thread
context on RT and can therefore use regular spinlocks which in turn allows
to invoke memory allocations from the lock held section.

To achieve this make the hash bucket lock a union of a raw and a regular
spinlock and initialize and lock/unlock either the raw spinlock for
preallocated maps or the regular variant for maps which require memory
allocations.

On a non RT kernel this distinction is neither possible nor required.
spinlock maps to raw_spinlock and the extra code and conditional is
optimized out by the compiler. No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:32 +03:00
Thomas Gleixner 45fa008a46 bpf: Factor out hashtab bucket lock operations
As a preparation for making the BPF locking RT friendly, factor out the
hash bucket lock operations into inline functions. This allows to do the
necessary RT modification in one place instead of sprinkling it all over
the place. No functional change.

The now unused htab argument of the lock/unlock functions will be used in
the next step which adds PREEMPT_RT support.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:32 +03:00
Thomas Gleixner 0c01246e9e bpf: Replace open coded recursion prevention in sys_bpf()
The required protection is that the caller cannot be migrated to a
different CPU as these functions end up in places which take either a hash
bucket lock or might trigger a kprobe inside the memory allocator. Both
scenarios can lead to deadlocks. The deadlock prevention is per CPU by
incrementing a per CPU variable which temporarily blocks the invocation of
BPF programs from perf and kprobes.

Replace the open coded preempt_[dis|en]able and __this_cpu_[inc|dec] pairs
with the new helper functions. These functions are already prepared to make
BPF work on PREEMPT_RT enabled kernels. No functional change for !RT
kernels.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:32 +03:00
David Miller f6eb4c80c0 bpf: Use bpf_prog_run_pin_on_cpu() at simple call sites.
All of these cases are strictly of the form:

	preempt_disable();
	BPF_PROG_RUN(...);
	preempt_enable();

Replace this with bpf_prog_run_pin_on_cpu() which wraps BPF_PROG_RUN()
with:

	migrate_disable();
	BPF_PROG_RUN(...);
	migrate_enable();

On non RT enabled kernels this maps to preempt_disable/enable() and on RT
enabled kernels this solely prevents migration, which is sufficient as
there is no requirement to prevent reentrancy to any BPF program from a
preempting task. The only requirement is that the program stays on the same
CPU.

Therefore, this is a trivially correct transformation.

The seccomp loop does not need protection over the loop. It only needs
protection per BPF filter program

[ tglx: Converted to bpf_prog_run_pin_on_cpu() ]

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:31 +03:00
Thomas Gleixner 60f1dcfbe0 bpf: Dont iterate over possible CPUs with interrupts disabled
pcpu_freelist_populate() is disabling interrupts and then iterates over the
possible CPUs. The reason why this disables interrupts is to silence
lockdep because the invoked ___pcpu_freelist_push() takes spin locks.

Neither the interrupt disabling nor the locking are required in this
function because it's called during initialization and the resulting map is
not yet visible to anything.

Split out the actual push assignement into an inline, call it from the loop
and remove the interrupt disable.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:31 +03:00
Thomas Gleixner d1a724bf1e perf/bpf: Remove preempt disable around BPF invocation
The BPF invocation from the perf event overflow handler does not require to
disable preemption because this is called from NMI or at least hard
interrupt context which is already non-preemptible.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:31 +03:00
Thomas Gleixner ded67a610a bpf/trace: Remove redundant preempt_disable from trace_call_bpf()
Similar to __bpf_trace_run this is redundant because __bpf_trace_run() is
invoked from a trace point via __DO_TRACE() which already disables
preemption _before_ invoking any of the functions which are attached to a
trace point.

Remove it and add a cant_sleep() check.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:31 +03:00
Alexei Starovoitov 3afab04a11 bpf: disable preemption for bpf progs attached to uprobe
trace_call_bpf() no longer disables preemption on its own.
All callers of this function has to do it explicitly.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:31 +03:00
Thomas Gleixner bd22595aad bpf/trace: Remove EXPORT from trace_call_bpf()
All callers are built in. No point to export this.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:31 +03:00
Thomas Gleixner 98a5660da8 bpf/tracing: Remove redundant preempt_disable() in __bpf_trace_run()
__bpf_trace_run() disables preemption around the BPF_PROG_RUN() invocation.

This is redundant because __bpf_trace_run() is invoked from a trace point
via __DO_TRACE() which already disables preemption _before_ invoking any of
the functions which are attached to a trace point.

Remove it and add a cant_sleep() check.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:31 +03:00
Thomas Gleixner f067b3a092 bpf: Update locking comment in hashtab code
The comment where the bucket lock is acquired says:

  /* bpf_map_update_elem() can be called in_irq() */

which is not really helpful and aside of that it does not explain the
subtle details of the hash bucket locks expecially in the context of BPF
and perf, kprobes and tracing.

Add a comment at the top of the file which explains the protection scopes
and the details how potential deadlocks are prevented.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:31 +03:00
Thomas Gleixner 789b06d8de bpf: Enforce preallocation for instrumentation programs on RT
Aside of the general unsafety of run-time map allocation for
instrumentation type programs RT enabled kernels have another constraint:

The instrumentation programs are invoked with preemption disabled, but the
memory allocator spinlocks cannot be acquired in atomic context because
they are converted to 'sleeping' spinlocks on RT.

Therefore enforce map preallocation for these programs types when RT is
enabled.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:30 +03:00
Thomas Gleixner dbbbf81012 bpf: Tighten the requirements for preallocated hash maps
The assumption that only programs attached to perf NMI events can deadlock
on memory allocators is wrong. Assume the following simplified callchain:

 kmalloc() from regular non BPF context
  cache empty
   freelist empty
    lock(zone->lock);
     tracepoint or kprobe
      BPF()
       update_elem()
        lock(bucket)
          kmalloc()
           cache empty
            freelist empty
             lock(zone->lock);  <- DEADLOCK

There are other ways which do not involve locking to create wreckage:

 kmalloc() from regular non BPF context
  local_irq_save();
   ...
    obj = slab_first();
     kprobe()
      BPF()
       update_elem()
        lock(bucket)
         kmalloc()
          local_irq_save();
           ...
            obj = slab_first(); <- Same object as above ...

So preallocation _must_ be enforced for all variants of intrusive
instrumentation.

Unfortunately immediate enforcement would break backwards compatibility, so
for now such programs still are allowed to run, but a one time warning is
emitted in dmesg and the verifier emits a warning in the verifier log as
well so developers are made aware about this and can fix their programs
before the enforcement becomes mandatory.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:30 +03:00
Mike Galbraith e776981391 cpuset: Convert callback_lock to raw_spinlock_t
The two commits below add up to a cpuset might_sleep() splat for RT:

8447a0fee9 cpuset: convert callback_mutex to a spinlock
344736f29b cpuset: simplify cpuset_node_allowed API

BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:995
in_atomic(): 0, irqs_disabled(): 1, pid: 11718, name: cset
CPU: 135 PID: 11718 Comm: cset Tainted: G            E   4.10.0-rt1-rt #4
Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0056.R01.1409242327 09/24/2014
Call Trace:
 ? dump_stack+0x5c/0x81
 ? ___might_sleep+0xf4/0x170
 ? rt_spin_lock+0x1c/0x50
 ? __cpuset_node_allowed+0x66/0xc0
 ? ___slab_alloc+0x390/0x570 <disables IRQs>
 ? anon_vma_fork+0x8f/0x140
 ? copy_page_range+0x6cf/0xb00
 ? anon_vma_fork+0x8f/0x140
 ? __slab_alloc.isra.74+0x5a/0x81
 ? anon_vma_fork+0x8f/0x140
 ? kmem_cache_alloc+0x1b5/0x1f0
 ? anon_vma_fork+0x8f/0x140
 ? copy_process.part.35+0x1670/0x1ee0
 ? _do_fork+0xdd/0x3f0
 ? _do_fork+0xdd/0x3f0
 ? do_syscall_64+0x61/0x170
 ? entry_SYSCALL64_slow_path+0x25/0x25

The later ensured that a NUMA box WILL take callback_lock in atomic
context by removing the allocator and reclaim path __GFP_HARDWALL
usage which prevented such contexts from taking callback_mutex.

One option would be to reinstate __GFP_HARDWALL protections for
RT, however, as the 8447a0fee9 changelog states:

The callback_mutex is only used to synchronize reads/updates of cpusets'
flags and cpu/node masks. These operations should always proceed fast so
there's no reason why we can't use a spinlock instead of the mutex.

Cc: stable-rt@vger.kernel.org
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:30 +03:00
Thomas Gleixner 73ccba39d9 lockdep: Make it RT aware
teach lockdep that we don't really do softirqs on -RT.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:29 +03:00
Thomas Gleixner ad4a585864 random: Make it work on rt
Delegate the random insertion to the forced threaded interrupt
handler. Store the return IP of the hard interrupt handler in the irq
descriptor and feed it into the random generator as a source of
entropy.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:29 +03:00
Thomas Gleixner 721360ec3a panic: skip get_random_bytes for RT_FULL in init_oops_id
Disable on -RT. If this is invoked from irq-context we will have problems
to acquire the sleeping lock.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:29 +03:00
Sebastian Andrzej Siewior 87f8ee14f4 irqwork: push most work into softirq context
Initially we defered all irqwork into softirq because we didn't want the
latency spikes if perf or another user was busy and delayed the RT task.
The NOHZ trigger (nohz_full_kick_work) was the first user that did not work
as expected if it did not run in the original irqwork context so we had to
bring it back somehow for it. push_irq_work_func is the second one that
requires this.

This patch adds the IRQ_WORK_HARD_IRQ which makes sure the callback runs
in raw-irq context. Everything else is defered into softirq context. Without
-RT we have the orignal behavior.

This patch incorporates tglx orignal work which revoked a little bringing back
the arch_irq_work_raise() if possible and a few fixes from Steven Rostedt and
Mike Galbraith,

[bigeasy: melt tglx's irq_work_tick_soft() which splits irq_work_tick() into a
          hard and soft variant]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:29 +03:00
Thomas Gleixner 9b9cfa4b04 rt: Introduce cpu_chill()
Retry loops on RT might loop forever when the modifying side was
preempted. Add cpu_chill() to replace cpu_relax(). cpu_chill()
defaults to cpu_relax() for non RT. On RT it puts the looping task to
sleep for a tick so the preempted task can make progress.

Steven Rostedt changed it to use a hrtimer instead of msleep():
|
|Ulrich Obergfell pointed out that cpu_chill() calls msleep() which is woken
|up by the ksoftirqd running the TIMER softirq. But as the cpu_chill() is
|called from softirq context, it may block the ksoftirqd() from running, in
|which case, it may never wake up the msleep() causing the deadlock.

+ bigeasy later changed to schedule_hrtimeout()
|If a task calls cpu_chill() and gets woken up by a regular or spurious
|wakeup and has a signal pending, then it exits the sleep loop in
|do_nanosleep() and sets up the restart block. If restart->nanosleep.type is
|not TI_NONE then this results in accessing a stale user pointer from a
|previously interrupted syscall and a copy to user based on the stale
|pointer or a BUG() when 'type' is not supported in nanosleep_copyout().

+ bigeasy: add PF_NOFREEZE:
| [....] Waiting for /dev to be fully populated...
| =====================================
| [ BUG: udevd/229 still has locks held! ]
| 3.12.11-rt17 #23 Not tainted
| -------------------------------------
| 1 lock held by udevd/229:
|  #0:  (&type->i_mutex_dir_key#2){+.+.+.}, at: lookup_slow+0x28/0x98
|
| stack backtrace:
| CPU: 0 PID: 229 Comm: udevd Not tainted 3.12.11-rt17 #23
| (unwind_backtrace+0x0/0xf8) from (show_stack+0x10/0x14)
| (show_stack+0x10/0x14) from (dump_stack+0x74/0xbc)
| (dump_stack+0x74/0xbc) from (do_nanosleep+0x120/0x160)
| (do_nanosleep+0x120/0x160) from (hrtimer_nanosleep+0x90/0x110)
| (hrtimer_nanosleep+0x90/0x110) from (cpu_chill+0x30/0x38)
| (cpu_chill+0x30/0x38) from (dentry_kill+0x158/0x1ec)
| (dentry_kill+0x158/0x1ec) from (dput+0x74/0x15c)
| (dput+0x74/0x15c) from (lookup_real+0x4c/0x50)
| (lookup_real+0x4c/0x50) from (__lookup_hash+0x34/0x44)
| (__lookup_hash+0x34/0x44) from (lookup_slow+0x38/0x98)
| (lookup_slow+0x38/0x98) from (path_lookupat+0x208/0x7fc)
| (path_lookupat+0x208/0x7fc) from (filename_lookup+0x20/0x60)
| (filename_lookup+0x20/0x60) from (user_path_at_empty+0x50/0x7c)
| (user_path_at_empty+0x50/0x7c) from (user_path_at+0x14/0x1c)
| (user_path_at+0x14/0x1c) from (vfs_fstatat+0x48/0x94)
| (vfs_fstatat+0x48/0x94) from (SyS_stat64+0x14/0x30)
| (SyS_stat64+0x14/0x30) from (ret_fast_syscall+0x0/0x48)

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:28 +03:00
Scott Wood a5850930e9 rcutorture: Avoid problematic critical section nesting on RT
rcutorture was generating some nesting scenarios that are not
reasonable.  Constrain the state selection to avoid them.

Example #1:

1. preempt_disable()
2. local_bh_disable()
3. preempt_enable()
4. local_bh_enable()

On PREEMPT_RT, BH disabling takes a local lock only when called in
non-atomic context.  Thus, atomic context must be retained until after BH
is re-enabled.  Likewise, if BH is initially disabled in non-atomic
context, it cannot be re-enabled in atomic context.

Example #2:

1. rcu_read_lock()
2. local_irq_disable()
3. rcu_read_unlock()
4. local_irq_enable()

If the thread is preempted between steps 1 and 2,
rcu_read_unlock_special.b.blocked will be set, but it won't be
acted on in step 3 because IRQs are disabled.  Thus, reporting of the
quiescent state will be delayed beyond the local_irq_enable().

For now, these scenarios will continue to be tested on non-PREEMPT_RT
kernels, until debug checks are added to ensure that they are not
happening elsewhere.

Signed-off-by: Scott Wood <swood@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:27 +03:00
Julia Cartwright 56ca6ad996 rcu: enable rcu_normal_after_boot by default for RT
The forcing of an expedited grace period is an expensive and very
RT-application unfriendly operation, as it forcibly preempts all running
tasks on CPUs which are preventing the gp from expiring.

By default, as a policy decision, disable the expediting of grace
periods (after boot) on configurations which enable PREEMPT_RT.

Suggested-by: Luiz Capitulino <lcapitulino@redhat.com>
Acked-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Julia Cartwright <julia@ni.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:27 +03:00
Sebastian Andrzej Siewior 277efec1fa srcu: replace local_irqsave() with a locallock
There are two instances which disable interrupts in order to become a
stable this_cpu_ptr() pointer. The restore part is coupled with
spin_unlock_irqrestore() which does not work on RT.
Replace the local_irq_save() call with the appropriate local_lock()
version of it.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:27 +03:00
Scott Wood fc049aaf7b rcu: Use rcuc threads on PREEMPT_RT as we did
While switching to the reworked RCU-thread code, it has been forgotten
to enable the thread processing on -RT.
Besides restoring behavior that used to be default on RT, this avoids
a deadlock on scheduler locks.

Signed-off-by: Scott Wood <swood@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:27 +03:00
Sebastian Andrzej Siewior 27de62e778 locking: Make spinlock_t and rwlock_t a RCU section on RT
On !RT a locked spinlock_t and rwlock_t disables preemption which
implies a RCU read section. There is code that relies on that behaviour.

Add an explicit RCU read section on RT while a sleeping lock (a lock
which would disables preemption on !RT) acquired.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:27 +03:00
Thomas Gleixner c8ff2f06be futex: workaround migrate_disable/enable in different context
migrate_enable() invokes __schedule() and it expects a preempt count of one.
Holding a raw_spinlock_t with disabled interrupts should not allow scheduling.

These little hacks ensure that we don't schedule while we lock the hb lockwith
interrupts enabled and unlock it with interrupts disabled.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[XXX: As per PeterZ suggesstion
	set_thread_flag(TIF_NEED_RESCHED); preempt_fold_need_resched()
 would trigger a scheduler invocation on the last preempt_enable() which in
 turn would allow to drop this.
]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:27 +03:00
Thomas Gleixner 6ff7d19cf2 trace: Add migrate-disabled counter to tracing output
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:27 +03:00
Scott Wood a4aa572645 sched: migrate_enable: Remove __schedule() call
We can rely on preempt_enable() to schedule.  Besides simplifying the
code, this potentially allows sequences such as the following to be
permitted:

migrate_disable();
preempt_disable();
migrate_enable();
preempt_enable();

Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Scott Wood <swood@redhat.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:27 +03:00
Scott Wood e2e316ff2b sched: migrate_enable: Use per-cpu cpu_stop_work
Commit e6c287b1512d ("sched: migrate_enable: Use stop_one_cpu_nowait()")
adds a busy wait to deal with an edge case where the migrated thread
can resume running on another CPU before the stopper has consumed
cpu_stop_work.  However, this is done with preemption disabled and can
potentially lead to deadlock.

While it is not guaranteed that the cpu_stop_work will be consumed before
the migrating thread resumes and exits the stack frame, it is guaranteed
that nothing other than the stopper can run on the old cpu between the
migrating thread scheduling out and the cpu_stop_work being consumed.
Thus, we can store cpu_stop_work in per-cpu data without it being
reused too early.

Fixes: e6c287b1512d ("sched: migrate_enable: Use stop_one_cpu_nowait()")
Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Scott Wood <swood@redhat.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:27 +03:00
Scott Wood 5be92193b8 sched: migrate_enable: Use stop_one_cpu_nowait()
migrate_enable() can be called with current->state != TASK_RUNNING.
Avoid clobbering the existing state by using stop_one_cpu_nowait().
Since we're stopping the current cpu, we know that we won't get
past __schedule() until migration_cpu_stop() has run (at least up to
the point of migrating us to another cpu).

Signed-off-by: Scott Wood <swood@redhat.com>
[bigeasy: spin until the request has been processed]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:27 +03:00
Sebastian Andrzej Siewior fe6fe368b0 sched/core: migrate_enable() must access takedown_cpu_task on !HOTPLUG_CPU
The variable takedown_cpu_task is never declared/used on !HOTPLUG_CPU
except for migrate_enable(). This leads to a link error.

Don't use takedown_cpu_task in !HOTPLUG_CPU.

Reported-by: Dick Hollenbeck <dick@softplc.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:26 +03:00