linux

Commit Graph

Author	SHA1	Message	Date
Alibek Omarov	c1ce54bceb	Linux 5.4.193 with MCST patches (6.2)	2023-03-25 04:34:12 +03:00
Sebastian Andrzej Siewior	df6e402903	sched: Switch wait_task_inactive to HRTIMER_MODE_REL_HARD [ Upstream commit 39609ed79d420e0b966e16a1d695733c2d3b9a7f ] With PREEMPT_RT enabled all hrtimers callbacks will be invoked in softirq mode unless they are explicitly marked as HRTIMER_MODE_HARD. During boot kthread_bind() is used for the creation of per-CPU threads and then hangs in wait_task_inactive() if the ksoftirqd is not yet up and running. The hang disappeared since commit 26c7295be0c5e ("kthread: Do not preempt current task if it is going to call schedule()") but enabling function trace on boot reliably leads to the freeze on boot behaviour again. The timer in wait_task_inactive() can not be directly used by an user interface to abuse it and create a mass wake of several tasks at the same time which would to long sections with disabled interrupts. Therefore it is safe to make the timer HRTIMER_MODE_REL_HARD. Switch the timer to HRTIMER_MODE_REL_HARD. Cc: stable-rt@vger.kernel.org Link: https://lkml.kernel.org/r/20210826170408.vm7rlj7odslshwch@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Tom Zanussi <zanussi@kernel.org>	2023-03-25 04:21:37 +03:00
Andrew Halaney	22562d5988	locking/rwsem-rt: Remove might_sleep() in __up_read() [ Upstream commit b2ed0a4302faf2bb09e97529dd274233c082689b ] There's no chance of sleeping here, the reader is giving up the lock and possibly waking up the writer who is waiting on it. Reported-by: Chunyu Hu <chuhu@redhat.com> Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Tom Zanussi <zanussi@kernel.org>	2023-03-25 04:21:37 +03:00
Sebastian Andrzej Siewior	a43acd1e98	locking/rwsem-rt: Add __down_read_interruptible() The stable backported a patch which adds __down_read_interruptible() for the generic rwsem implementation. Add RT's version __down_read_interruptible(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:37 +03:00
Sebastian Andrzej Siewior	18d5a2ded1	Revert "hrtimer: Allow raw wakeups during boot" This change is no longer needed since commit 26c7295be0c5e ("kthread: Do not preempt current task if it is going to call schedule()") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2023-03-25 04:21:37 +03:00
Sebastian Andrzej Siewior	5564fa35b3	timers: Don't block on ->expiry_lock for TIMER_IRQSAFE PREEMPT_RT does not spin and wait until a running timer completes its callback but instead it blocks on a sleeping lock to prevent a deadlock. This blocking can not be done for workqueue's IRQ_SAFE timer which will be canceled in an IRQ-off region. It has to happen to in IRQ-off region because changing the PENDING bit and clearing the timer must not be interrupted to avoid a busy-loop. The callback invocation of IRQSAFE timer is not preempted on PREEMPT_RT so there is no need to synchronize on timer_base::expiry_lock. Don't acquire the timer_base::expiry_lock for TIMER_IRQSAFE flagged timer. Add a lockdep annotation to ensure that this function is always invoked in preemptible context on PREEMPT_RT. Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2023-03-25 04:21:36 +03:00
Oleg Nesterov	6bd4eef1d8	ptrace: fix ptrace_unfreeze_traced() race with rt-lock The patch "ptrace: fix ptrace vs tasklist_lock race" changed ptrace_freeze_traced() to take task->saved_state into account, but ptrace_unfreeze_traced() has the same problem and needs a similar fix: it should check/update both ->state and ->saved_state. Reported-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Fixes: "ptrace: fix ptrace vs tasklist_lock race" Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2023-03-25 04:21:36 +03:00
Sebastian Andrzej Siewior	36c2e4d093	rwsem: Provide down_read_non_owner() and up_read_non_owner() for -RT The rwsem implementation on -RT allows multiple reader and there is no owner tracking anymore. We can provide down_read_non_owner() and up_read_non_owner() by skipping the owner check bits which are only available in the !RT implementation. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2023-03-25 04:21:36 +03:00
Sebastian Andrzej Siewior	58d58190a1	workqueue: Sync with upstream This is an all-on-one patch reverting the following commits: workqueue: Don't assume that the callback has interrupts disabled sched/swait: Add swait_event_lock_irq() workqueue: Use swait for wq_manager_wait workqueue: Convert the locks to raw type and introducing the following commits from upstream: workqueue: Use rcuwait for wq_manager_wait workqueue: Convert the pool::lock and wq_mayday_lock to raw_spinlock_t as an replacement. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2023-03-25 04:21:36 +03:00
Matt Fleming	f958e52689	signal: Prevent double-free of user struct The way user struct reference counting works changed significantly with, fda31c50292a ("signal: avoid double atomic counter increments for user accounting") Now user structs are only freed once the last pending signal is dequeued. Make sigqueue_free_current() follow this new convention to avoid freeing the user struct multiple times and triggering this warning: refcount_t: underflow; use-after-free. WARNING: CPU: 0 PID: 6794 at lib/refcount.c:288 refcount_dec_not_one+0x45/0x50 Call Trace: refcount_dec_and_lock_irqsave+0x16/0x60 free_uid+0x31/0xa0 __dequeue_signal+0x17c/0x190 dequeue_signal+0x5a/0x1b0 do_sigtimedwait+0x208/0x250 __x64_sys_rt_sigtimedwait+0x6f/0xd0 do_syscall_64+0x72/0x200 entry_SYSCALL_64_after_hwframe+0x49/0xbe Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Reported-by: Daniel Wagner <wagi@monom.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2023-03-25 04:21:36 +03:00
汪勇10269566	8ea11731c2	printk: Force a line break on pr_cont(" ") Since the printk rework, pr_cont("\n") will not lead to a line break. A new line will only be created if - cpu != c->cpu_owner \|\| !(flags & LOG_CONT) - c->len + len > sizeof(c->buf) Flush the buffer to enforce a new line on pr_cont(). [bigeasy: reword commit message ] Signed-off-by: 汪勇10269566 <wang.yong12@zte.com.cn> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2023-03-25 04:21:36 +03:00
John Ogness	cbf31f0ca6	printk: console must not schedule for drivers Even though the printk kthread is always preemptible, it is still not allowed to call cond_resched() from within console drivers. The task may become non-preemptible in the console driver call chain. For example, vt_console_print() takes a spinlock and then can call into fbcon_redraw(), which can conditionally invoke cond_resched(): \|BUG: sleeping function called from invalid context at kernel/printk/printk.c:2322 \|in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 177, name: printk \|CPU: 0 PID: 177 Comm: printk Not tainted 5.6.2-00011-ga536059557f1d9 #1 \|Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 \|Call Trace: \| dump_stack+0x66/0x8b \| ___might_sleep+0x102/0x120 \| console_conditional_schedule+0x24/0x30 \| fbcon_redraw+0x96/0x1c0 \| fbcon_scroll+0x556/0xd70 \| con_scroll+0x147/0x1e0 \| lf+0x9e/0xb0 \| vt_console_print+0x253/0x3d0 \| printk_kthread_func+0x1d5/0x3b0 Disable cond_resched() for the call into the console drivers. Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2023-03-25 04:21:35 +03:00
Clark Williams	894114838d	sysfs: Add /sys/kernel/realtime entry Add a /sys/kernel entry to indicate that the kernel is a realtime kernel. Clark says that he needs this for udev rules, udev needs to evaluate if its a PREEMPT_RT kernel a few thousand times and parsing uname output is too slow or so. Are there better solutions? Should it exist and return 0 on !-rt? Signed-off-by: Clark Williams <williams@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2023-03-25 04:21:35 +03:00
Ingo Molnar	77c67bcd8a	genirq: Disable irqpoll on -rt Creates long latencies for no value Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2023-03-25 04:21:35 +03:00
Thomas Gleixner	f9d7455782	signals: Allow rt tasks to cache one sigqueue struct To avoid allocation allow rt tasks to cache one sigqueue struct in task struct. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2023-03-25 04:21:35 +03:00
Josh Cartwright	e1267649bf	genirq: update irq_set_irqchip_state documentation On -rt kernels, the use of migrate_disable()/migrate_enable() is sufficient to guarantee a task isn't moved to another CPU. Update the irq_set_irqchip_state() documentation to reflect this. Signed-off-by: Josh Cartwright <joshc@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:33 +03:00
Sebastian Andrzej Siewior	3af424b7f4	tracing: make preempt_lazy and migrate_disable counter smaller The migrate_disable counter should not exceed 255 so it is enough to store it in an 8bit field. With this change we can move the `preempt_lazy_count' member into the gap so the whole struct shrinks by 4 bytes to 12 bytes in total. Remove the `padding' field, it is not needed. Update the tracing fields in trace_define_common_fields() (it was missing the preempt_lazy_count field). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:33 +03:00
Thomas Gleixner	78ed450f36	sched: Add support for lazy preemption It has become an obsession to mitigate the determinism vs. throughput loss of RT. Looking at the mainline semantics of preemption points gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER tasks. One major issue is the wakeup of tasks which are right away preempting the waking task while the waking task holds a lock on which the woken task will block right after having preempted the wakee. In mainline this is prevented due to the implicit preemption disable of spin/rw_lock held regions. On RT this is not possible due to the fully preemptible nature of sleeping spinlocks. Though for a SCHED_OTHER task preempting another SCHED_OTHER task this is really not a correctness issue. RT folks are concerned about SCHED_FIFO/RR tasks preemption and not about the purely fairness driven SCHED_OTHER preemption latencies. So I introduced a lazy preemption mechanism which only applies to SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the existing preempt_count each tasks sports now a preempt_lazy_count which is manipulated on lock acquiry and release. This is slightly incorrect as for lazyness reasons I coupled this on migrate_disable/enable so some other mechanisms get the same treatment (e.g. get_cpu_light). Now on the scheduler side instead of setting NEED_RESCHED this sets NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and therefor allows to exit the waking task the lock held region before the woken task preempts. That also works better for cross CPU wakeups as the other side can stay in the adaptive spinning loop. For RT class preemption there is no change. This simply sets NEED_RESCHED and forgoes the lazy preemption counter. Initial test do not expose any observable latency increasement, but history shows that I've been proven wrong before :) The lazy preemption mode is per default on, but with CONFIG_SCHED_DEBUG enabled it can be disabled via: # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features and reenabled via # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features The test results so far are very machine and workload dependent, but there is a clear trend that it enhances the non RT workload performance. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2023-03-25 04:21:32 +03:00
David Miller	53755fc090	bpf/stackmap: Dont trylock mmap_sem with PREEMPT_RT and interrupts disabled In a RT kernel down_read_trylock() cannot be used from NMI context and up_read_non_owner() is another problematic issue. So in such a configuration, simply elide the annotated stackmap and just report the raw IPs. In the longer term, it might be possible to provide a atomic friendly versions of the page cache traversal which will at least provide the info if the pages are resident and don't need to be paged in. [ tglx: Use IS_ENABLED() to avoid the #ifdeffery, fixup the irq work callback and add a comment ] Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:32 +03:00
Thomas Gleixner	36ee95a49d	bpf, lpm: Make locking RT friendly The LPM trie map cannot be used in contexts like perf, kprobes and tracing as this map type dynamically allocates memory. The memory allocation happens with a raw spinlock held which is a truly spinning lock on a PREEMPT RT enabled kernel which disables preemption and interrupts. As RT does not allow memory allocation from such a section for various reasons, convert the raw spinlock to a regular spinlock. On a RT enabled kernel these locks are substituted by 'sleeping' spinlocks which provide the proper protection but keep the code preemptible. On a non-RT kernel regular spinlocks map to raw spinlocks, i.e. this does not cause any functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:32 +03:00
Thomas Gleixner	b1e96321f4	bpf: Prepare hashtab locking for PREEMPT_RT PREEMPT_RT forbids certain operations like memory allocations (even with GFP_ATOMIC) from atomic contexts. This is required because even with GFP_ATOMIC the memory allocator calls into code pathes which acquire locks with long held lock sections. To ensure the deterministic behaviour these locks are regular spinlocks, which are converted to 'sleepable' spinlocks on RT. The only true atomic contexts on an RT kernel are the low level hardware handling, scheduling, low level interrupt handling, NMIs etc. None of these contexts should ever do memory allocations. As regular device interrupt handlers and soft interrupts are forced into thread context, the existing code which does spin_lock(); alloc(GPF_ATOMIC); spin_unlock(); just works. In theory the BPF locks could be converted to regular spinlocks as well, but the bucket locks and percpu_freelist locks can be taken from arbitrary contexts (perf, kprobes, tracepoints) which are required to be atomic contexts even on RT. These mechanisms require preallocated maps, so there is no need to invoke memory allocations within the lock held sections. BPF maps which need dynamic allocation are only used from (forced) thread context on RT and can therefore use regular spinlocks which in turn allows to invoke memory allocations from the lock held section. To achieve this make the hash bucket lock a union of a raw and a regular spinlock and initialize and lock/unlock either the raw spinlock for preallocated maps or the regular variant for maps which require memory allocations. On a non RT kernel this distinction is neither possible nor required. spinlock maps to raw_spinlock and the extra code and conditional is optimized out by the compiler. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:32 +03:00
Thomas Gleixner	45fa008a46	bpf: Factor out hashtab bucket lock operations As a preparation for making the BPF locking RT friendly, factor out the hash bucket lock operations into inline functions. This allows to do the necessary RT modification in one place instead of sprinkling it all over the place. No functional change. The now unused htab argument of the lock/unlock functions will be used in the next step which adds PREEMPT_RT support. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:32 +03:00
Thomas Gleixner	0c01246e9e	bpf: Replace open coded recursion prevention in sys_bpf() The required protection is that the caller cannot be migrated to a different CPU as these functions end up in places which take either a hash bucket lock or might trigger a kprobe inside the memory allocator. Both scenarios can lead to deadlocks. The deadlock prevention is per CPU by incrementing a per CPU variable which temporarily blocks the invocation of BPF programs from perf and kprobes. Replace the open coded preempt_[dis\|en]able and __this_cpu_[inc\|dec] pairs with the new helper functions. These functions are already prepared to make BPF work on PREEMPT_RT enabled kernels. No functional change for !RT kernels. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:32 +03:00
David Miller	f6eb4c80c0	bpf: Use bpf_prog_run_pin_on_cpu() at simple call sites. All of these cases are strictly of the form: preempt_disable(); BPF_PROG_RUN(...); preempt_enable(); Replace this with bpf_prog_run_pin_on_cpu() which wraps BPF_PROG_RUN() with: migrate_disable(); BPF_PROG_RUN(...); migrate_enable(); On non RT enabled kernels this maps to preempt_disable/enable() and on RT enabled kernels this solely prevents migration, which is sufficient as there is no requirement to prevent reentrancy to any BPF program from a preempting task. The only requirement is that the program stays on the same CPU. Therefore, this is a trivially correct transformation. The seccomp loop does not need protection over the loop. It only needs protection per BPF filter program [ tglx: Converted to bpf_prog_run_pin_on_cpu() ] Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:31 +03:00
Thomas Gleixner	60f1dcfbe0	bpf: Dont iterate over possible CPUs with interrupts disabled pcpu_freelist_populate() is disabling interrupts and then iterates over the possible CPUs. The reason why this disables interrupts is to silence lockdep because the invoked ___pcpu_freelist_push() takes spin locks. Neither the interrupt disabling nor the locking are required in this function because it's called during initialization and the resulting map is not yet visible to anything. Split out the actual push assignement into an inline, call it from the loop and remove the interrupt disable. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:31 +03:00
Thomas Gleixner	d1a724bf1e	perf/bpf: Remove preempt disable around BPF invocation The BPF invocation from the perf event overflow handler does not require to disable preemption because this is called from NMI or at least hard interrupt context which is already non-preemptible. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:31 +03:00
Thomas Gleixner	ded67a610a	bpf/trace: Remove redundant preempt_disable from trace_call_bpf() Similar to __bpf_trace_run this is redundant because __bpf_trace_run() is invoked from a trace point via __DO_TRACE() which already disables preemption _before_ invoking any of the functions which are attached to a trace point. Remove it and add a cant_sleep() check. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:31 +03:00
Alexei Starovoitov	3afab04a11	bpf: disable preemption for bpf progs attached to uprobe trace_call_bpf() no longer disables preemption on its own. All callers of this function has to do it explicitly. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:31 +03:00
Thomas Gleixner	bd22595aad	bpf/trace: Remove EXPORT from trace_call_bpf() All callers are built in. No point to export this. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:31 +03:00
Thomas Gleixner	98a5660da8	bpf/tracing: Remove redundant preempt_disable() in __bpf_trace_run() __bpf_trace_run() disables preemption around the BPF_PROG_RUN() invocation. This is redundant because __bpf_trace_run() is invoked from a trace point via __DO_TRACE() which already disables preemption _before_ invoking any of the functions which are attached to a trace point. Remove it and add a cant_sleep() check. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:31 +03:00
Thomas Gleixner	f067b3a092	bpf: Update locking comment in hashtab code The comment where the bucket lock is acquired says: /* bpf_map_update_elem() can be called in_irq() */ which is not really helpful and aside of that it does not explain the subtle details of the hash bucket locks expecially in the context of BPF and perf, kprobes and tracing. Add a comment at the top of the file which explains the protection scopes and the details how potential deadlocks are prevented. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:31 +03:00
Thomas Gleixner	789b06d8de	bpf: Enforce preallocation for instrumentation programs on RT Aside of the general unsafety of run-time map allocation for instrumentation type programs RT enabled kernels have another constraint: The instrumentation programs are invoked with preemption disabled, but the memory allocator spinlocks cannot be acquired in atomic context because they are converted to 'sleeping' spinlocks on RT. Therefore enforce map preallocation for these programs types when RT is enabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:30 +03:00
Thomas Gleixner	dbbbf81012	bpf: Tighten the requirements for preallocated hash maps The assumption that only programs attached to perf NMI events can deadlock on memory allocators is wrong. Assume the following simplified callchain: kmalloc() from regular non BPF context cache empty freelist empty lock(zone->lock); tracepoint or kprobe BPF() update_elem() lock(bucket) kmalloc() cache empty freelist empty lock(zone->lock); <- DEADLOCK There are other ways which do not involve locking to create wreckage: kmalloc() from regular non BPF context local_irq_save(); ... obj = slab_first(); kprobe() BPF() update_elem() lock(bucket) kmalloc() local_irq_save(); ... obj = slab_first(); <- Same object as above ... So preallocation _must_ be enforced for all variants of intrusive instrumentation. Unfortunately immediate enforcement would break backwards compatibility, so for now such programs still are allowed to run, but a one time warning is emitted in dmesg and the verifier emits a warning in the verifier log as well so developers are made aware about this and can fix their programs before the enforcement becomes mandatory. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:30 +03:00
Mike Galbraith	e776981391	cpuset: Convert callback_lock to raw_spinlock_t The two commits below add up to a cpuset might_sleep() splat for RT: `8447a0fee9` cpuset: convert callback_mutex to a spinlock `344736f29b` cpuset: simplify cpuset_node_allowed API BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:995 in_atomic(): 0, irqs_disabled(): 1, pid: 11718, name: cset CPU: 135 PID: 11718 Comm: cset Tainted: G E 4.10.0-rt1-rt #4 Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0056.R01.1409242327 09/24/2014 Call Trace: ? dump_stack+0x5c/0x81 ? ___might_sleep+0xf4/0x170 ? rt_spin_lock+0x1c/0x50 ? __cpuset_node_allowed+0x66/0xc0 ? ___slab_alloc+0x390/0x570 <disables IRQs> ? anon_vma_fork+0x8f/0x140 ? copy_page_range+0x6cf/0xb00 ? anon_vma_fork+0x8f/0x140 ? __slab_alloc.isra.74+0x5a/0x81 ? anon_vma_fork+0x8f/0x140 ? kmem_cache_alloc+0x1b5/0x1f0 ? anon_vma_fork+0x8f/0x140 ? copy_process.part.35+0x1670/0x1ee0 ? _do_fork+0xdd/0x3f0 ? _do_fork+0xdd/0x3f0 ? do_syscall_64+0x61/0x170 ? entry_SYSCALL64_slow_path+0x25/0x25 The later ensured that a NUMA box WILL take callback_lock in atomic context by removing the allocator and reclaim path __GFP_HARDWALL usage which prevented such contexts from taking callback_mutex. One option would be to reinstate __GFP_HARDWALL protections for RT, however, as the `8447a0fee9` changelog states: The callback_mutex is only used to synchronize reads/updates of cpusets' flags and cpu/node masks. These operations should always proceed fast so there's no reason why we can't use a spinlock instead of the mutex. Cc: stable-rt@vger.kernel.org Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:30 +03:00
Thomas Gleixner	73ccba39d9	lockdep: Make it RT aware teach lockdep that we don't really do softirqs on -RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2023-03-25 04:21:29 +03:00
Thomas Gleixner	ad4a585864	random: Make it work on rt Delegate the random insertion to the forced threaded interrupt handler. Store the return IP of the hard interrupt handler in the irq descriptor and feed it into the random generator as a source of entropy. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2023-03-25 04:21:29 +03:00
Thomas Gleixner	721360ec3a	panic: skip get_random_bytes for RT_FULL in init_oops_id Disable on -RT. If this is invoked from irq-context we will have problems to acquire the sleeping lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2023-03-25 04:21:29 +03:00
Sebastian Andrzej Siewior	87f8ee14f4	irqwork: push most work into softirq context Initially we defered all irqwork into softirq because we didn't want the latency spikes if perf or another user was busy and delayed the RT task. The NOHZ trigger (nohz_full_kick_work) was the first user that did not work as expected if it did not run in the original irqwork context so we had to bring it back somehow for it. push_irq_work_func is the second one that requires this. This patch adds the IRQ_WORK_HARD_IRQ which makes sure the callback runs in raw-irq context. Everything else is defered into softirq context. Without -RT we have the orignal behavior. This patch incorporates tglx orignal work which revoked a little bringing back the arch_irq_work_raise() if possible and a few fixes from Steven Rostedt and Mike Galbraith, [bigeasy: melt tglx's irq_work_tick_soft() which splits irq_work_tick() into a hard and soft variant] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:29 +03:00
Thomas Gleixner	9b9cfa4b04	rt: Introduce cpu_chill() Retry loops on RT might loop forever when the modifying side was preempted. Add cpu_chill() to replace cpu_relax(). cpu_chill() defaults to cpu_relax() for non RT. On RT it puts the looping task to sleep for a tick so the preempted task can make progress. Steven Rostedt changed it to use a hrtimer instead of msleep(): \| \|Ulrich Obergfell pointed out that cpu_chill() calls msleep() which is woken \|up by the ksoftirqd running the TIMER softirq. But as the cpu_chill() is \|called from softirq context, it may block the ksoftirqd() from running, in \|which case, it may never wake up the msleep() causing the deadlock. + bigeasy later changed to schedule_hrtimeout() \|If a task calls cpu_chill() and gets woken up by a regular or spurious \|wakeup and has a signal pending, then it exits the sleep loop in \|do_nanosleep() and sets up the restart block. If restart->nanosleep.type is \|not TI_NONE then this results in accessing a stale user pointer from a \|previously interrupted syscall and a copy to user based on the stale \|pointer or a BUG() when 'type' is not supported in nanosleep_copyout(). + bigeasy: add PF_NOFREEZE: \| [....] Waiting for /dev to be fully populated... \| ===================================== \| [ BUG: udevd/229 still has locks held! ] \| 3.12.11-rt17 #23 Not tainted \| ------------------------------------- \| 1 lock held by udevd/229: \| #0: (&type->i_mutex_dir_key#2){+.+.+.}, at: lookup_slow+0x28/0x98 \| \| stack backtrace: \| CPU: 0 PID: 229 Comm: udevd Not tainted 3.12.11-rt17 #23 \| (unwind_backtrace+0x0/0xf8) from (show_stack+0x10/0x14) \| (show_stack+0x10/0x14) from (dump_stack+0x74/0xbc) \| (dump_stack+0x74/0xbc) from (do_nanosleep+0x120/0x160) \| (do_nanosleep+0x120/0x160) from (hrtimer_nanosleep+0x90/0x110) \| (hrtimer_nanosleep+0x90/0x110) from (cpu_chill+0x30/0x38) \| (cpu_chill+0x30/0x38) from (dentry_kill+0x158/0x1ec) \| (dentry_kill+0x158/0x1ec) from (dput+0x74/0x15c) \| (dput+0x74/0x15c) from (lookup_real+0x4c/0x50) \| (lookup_real+0x4c/0x50) from (__lookup_hash+0x34/0x44) \| (__lookup_hash+0x34/0x44) from (lookup_slow+0x38/0x98) \| (lookup_slow+0x38/0x98) from (path_lookupat+0x208/0x7fc) \| (path_lookupat+0x208/0x7fc) from (filename_lookup+0x20/0x60) \| (filename_lookup+0x20/0x60) from (user_path_at_empty+0x50/0x7c) \| (user_path_at_empty+0x50/0x7c) from (user_path_at+0x14/0x1c) \| (user_path_at+0x14/0x1c) from (vfs_fstatat+0x48/0x94) \| (vfs_fstatat+0x48/0x94) from (SyS_stat64+0x14/0x30) \| (SyS_stat64+0x14/0x30) from (ret_fast_syscall+0x0/0x48) Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:28 +03:00
Scott Wood	a5850930e9	rcutorture: Avoid problematic critical section nesting on RT rcutorture was generating some nesting scenarios that are not reasonable. Constrain the state selection to avoid them. Example #1: 1. preempt_disable() 2. local_bh_disable() 3. preempt_enable() 4. local_bh_enable() On PREEMPT_RT, BH disabling takes a local lock only when called in non-atomic context. Thus, atomic context must be retained until after BH is re-enabled. Likewise, if BH is initially disabled in non-atomic context, it cannot be re-enabled in atomic context. Example #2: 1. rcu_read_lock() 2. local_irq_disable() 3. rcu_read_unlock() 4. local_irq_enable() If the thread is preempted between steps 1 and 2, rcu_read_unlock_special.b.blocked will be set, but it won't be acted on in step 3 because IRQs are disabled. Thus, reporting of the quiescent state will be delayed beyond the local_irq_enable(). For now, these scenarios will continue to be tested on non-PREEMPT_RT kernels, until debug checks are added to ensure that they are not happening elsewhere. Signed-off-by: Scott Wood <swood@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:27 +03:00
Julia Cartwright	56ca6ad996	rcu: enable rcu_normal_after_boot by default for RT The forcing of an expedited grace period is an expensive and very RT-application unfriendly operation, as it forcibly preempts all running tasks on CPUs which are preventing the gp from expiring. By default, as a policy decision, disable the expediting of grace periods (after boot) on configurations which enable PREEMPT_RT. Suggested-by: Luiz Capitulino <lcapitulino@redhat.com> Acked-by: Paul E. McKenney <paulmck@linux.ibm.com> Signed-off-by: Julia Cartwright <julia@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:27 +03:00
Sebastian Andrzej Siewior	277efec1fa	srcu: replace local_irqsave() with a locallock There are two instances which disable interrupts in order to become a stable this_cpu_ptr() pointer. The restore part is coupled with spin_unlock_irqrestore() which does not work on RT. Replace the local_irq_save() call with the appropriate local_lock() version of it. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:27 +03:00
Scott Wood	fc049aaf7b	rcu: Use rcuc threads on PREEMPT_RT as we did While switching to the reworked RCU-thread code, it has been forgotten to enable the thread processing on -RT. Besides restoring behavior that used to be default on RT, this avoids a deadlock on scheduler locks. Signed-off-by: Scott Wood <swood@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:27 +03:00
Sebastian Andrzej Siewior	27de62e778	locking: Make spinlock_t and rwlock_t a RCU section on RT On !RT a locked spinlock_t and rwlock_t disables preemption which implies a RCU read section. There is code that relies on that behaviour. Add an explicit RCU read section on RT while a sleeping lock (a lock which would disables preemption on !RT) acquired. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:27 +03:00
Thomas Gleixner	c8ff2f06be	futex: workaround migrate_disable/enable in different context migrate_enable() invokes __schedule() and it expects a preempt count of one. Holding a raw_spinlock_t with disabled interrupts should not allow scheduling. These little hacks ensure that we don't schedule while we lock the hb lockwith interrupts enabled and unlock it with interrupts disabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [XXX: As per PeterZ suggesstion set_thread_flag(TIF_NEED_RESCHED); preempt_fold_need_resched() would trigger a scheduler invocation on the last preempt_enable() which in turn would allow to drop this. ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:27 +03:00
Thomas Gleixner	6ff7d19cf2	trace: Add migrate-disabled counter to tracing output Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2023-03-25 04:21:27 +03:00
Scott Wood	a4aa572645	sched: migrate_enable: Remove __schedule() call We can rely on preempt_enable() to schedule. Besides simplifying the code, this potentially allows sequences such as the following to be permitted: migrate_disable(); preempt_disable(); migrate_enable(); preempt_enable(); Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Scott Wood <swood@redhat.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:27 +03:00
Scott Wood	e2e316ff2b	sched: migrate_enable: Use per-cpu cpu_stop_work Commit e6c287b1512d ("sched: migrate_enable: Use stop_one_cpu_nowait()") adds a busy wait to deal with an edge case where the migrated thread can resume running on another CPU before the stopper has consumed cpu_stop_work. However, this is done with preemption disabled and can potentially lead to deadlock. While it is not guaranteed that the cpu_stop_work will be consumed before the migrating thread resumes and exits the stack frame, it is guaranteed that nothing other than the stopper can run on the old cpu between the migrating thread scheduling out and the cpu_stop_work being consumed. Thus, we can store cpu_stop_work in per-cpu data without it being reused too early. Fixes: e6c287b1512d ("sched: migrate_enable: Use stop_one_cpu_nowait()") Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Scott Wood <swood@redhat.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:27 +03:00
Scott Wood	5be92193b8	sched: migrate_enable: Use stop_one_cpu_nowait() migrate_enable() can be called with current->state != TASK_RUNNING. Avoid clobbering the existing state by using stop_one_cpu_nowait(). Since we're stopping the current cpu, we know that we won't get past __schedule() until migration_cpu_stop() has run (at least up to the point of migrating us to another cpu). Signed-off-by: Scott Wood <swood@redhat.com> [bigeasy: spin until the request has been processed] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:27 +03:00
Sebastian Andrzej Siewior	fe6fe368b0	sched/core: migrate_enable() must access takedown_cpu_task on !HOTPLUG_CPU The variable takedown_cpu_task is never declared/used on !HOTPLUG_CPU except for migrate_enable(). This leads to a link error. Don't use takedown_cpu_task in !HOTPLUG_CPU. Reported-by: Dick Hollenbeck <dick@softplc.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2023-03-25 04:21:26 +03:00

1 2 3 4 5 ...

32282 Commits