Commit Graph

892207 Commits

Author SHA1 Message Date
Thomas Gleixner 3d0991b2f3 mm: Enable SLUB for RT
Avoid the memory allocation in IRQ section

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[bigeasy: factor out everything except the kcalloc() workaorund ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:22 +03:00
Ingo Molnar 8dd0922414 mm/vmstat: Protect per cpu variables with preempt disable on RT
Disable preemption on -RT for the vmstat code. On vanila the code runs in
IRQ-off regions while on -RT it is not. "preempt_disable" ensures that the
same ressources is not updated in parallel due to preemption.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:22 +03:00
Thomas Gleixner 179fc64984 preempt: Provide preempt_*_(no)rt variants
RT needs a few preempt_disable/enable points which are not necessary
otherwise. Implement variants to avoid #ifdeffery.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:22 +03:00
Sebastian Andrzej Siewior cc47dc4fc7 mm/page_alloc: Use migrate_disable() in drain_local_pages_wq()
The drain_local_pages_wq() uses preempt_disable() to ensure that there
will be no CPU migration while drain_local_pages() is invoked which
might happen if the CPU is going down.
drain_local_pages() acquires a sleeping lock on RT which can not be
acquired with disabled preemption.

Use migrate_disable() instead of preempt_disable(): On RT it ensures
that the CPU won't go down and on !RT it is replaced with
preempt_disable().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:22 +03:00
Luiz Capitulino 17b38f7b98 mm: perform lru_add_drain_all() remotely
lru_add_drain_all() works by scheduling lru_add_drain_cpu() to run
on all CPUs that have non-empty LRU pagevecs and then waiting for
the scheduled work to complete. However, workqueue threads may never
have the chance to run on a CPU that's running a SCHED_FIFO task.
This causes lru_add_drain_all() to block forever.

This commit solves this problem by changing lru_add_drain_all()
to drain the LRU pagevecs of remote CPUs. This is done by grabbing
swapvec_lock and calling lru_add_drain_cpu().

PS: This is based on an idea and initial implementation by
    Rik van Riel.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:22 +03:00
Ingo Molnar 3dae63e7b1 mm/swap: Convert to percpu locked
Replace global locks (get_cpu + local_irq_save) with "local_locks()".
Currently there is one of for "rotate" and one for "swap".

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:22 +03:00
Ingo Molnar 4901761f49 mm: page_alloc: rt-friendly per-cpu pages
rt-friendly per-cpu pages: convert the irqs-off per-cpu locking
method into a preemptible, explicit-per-cpu-locks method.

Contains fixes from:
	 Peter Zijlstra <a.p.zijlstra@chello.nl>
	 Thomas Gleixner <tglx@linutronix.de>

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:22 +03:00
Thomas Gleixner 1bdce1f597 mm/SLUB: delay giving back empty slubs to IRQ enabled regions
__free_slab() is invoked with disabled interrupts which increases the
irq-off time while __free_pages() is doing the work.
Allow __free_slab() to be invoked with enabled interrupts and move
everything from interrupts-off invocations to a temporary per-CPU list
so it can be processed later.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:22 +03:00
Thomas Gleixner 4dd9f4c8bc mm/SLxB: change list_lock to raw_spinlock_t
The list_lock is used with used with IRQs off on RT. Make it a raw_spinlock_t
otherwise the interrupts won't be disabled on -RT.  The locking rules remain
the same on !RT.
This patch changes it for SLAB and SLUB since both share the same header
file for struct kmem_cache_node defintion.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:22 +03:00
Peter Zijlstra a826a9cad9 Split IRQ-off and zone->lock while freeing pages from PCP list #2
Split the IRQ-off section while accessing the PCP list from zone->lock
while freeing pages.
Introcude  isolate_pcp_pages() which separates the pages from the PCP
list onto a temporary list and then free the temporary list via
free_pcppages_bulk().

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:21 +03:00
Peter Zijlstra ea19afca31 Split IRQ-off and zone->lock while freeing pages from PCP list #1
Split the IRQ-off section while accessing the PCP list from zone->lock
while freeing pages.
Introcude  isolate_pcp_pages() which separates the pages from the PCP
list onto a temporary list and then free the temporary list via
free_pcppages_bulk().

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:21 +03:00
Oleg Nesterov 896cf0cb5d signal/x86: Delay calling signals in atomic
On x86_64 we must disable preemption before we enable interrupts
for stack faults, int3 and debugging, because the current task is using
a per CPU debug stack defined by the IST. If we schedule out, another task
can come in and use the same stack and cause the stack to be corrupted
and crash the kernel on return.

When CONFIG_PREEMPT_RT is enabled, spin_locks become mutexes, and
one of these is the spin lock used in signal handling.

Some of the debug code (int3) causes do_trap() to send a signal.
This function calls a spin lock that has been converted to a mutex
and has the possibility to sleep. If this happens, the above issues with
the corrupted stack is possible.

Instead of calling the signal right away, for PREEMPT_RT and x86_64,
the signal information is stored on the stacks task_struct and
TIF_NOTIFY_RESUME is set. Then on exit of the trap, the signal resume
code will send the signal when preemption is enabled.

[ rostedt: Switched from #ifdef CONFIG_PREEMPT_RT to
  ARCH_RT_DELAYS_SIGNAL_SEND and added comments to the code. ]

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[bigeasy: also needed on 32bit as per Yang Shi <yang.shi@linaro.org>]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:21 +03:00
Sebastian Andrzej Siewior 647eaa4930 softirq: Add preemptible softirq
Add preemptible softirq for RT's needs. By removing the softirq count
from the preempt counter, the softirq becomes preemptible. A per-CPU
lock ensures that there is no parallel softirq processing or that
per-CPU variables are not access in parallel by multiple threads.

local_bh_enable() will process all softirq work that has been raised in
its BH-disabled section once the BH counter gets to 0.

[+ rcu_read_lock() as part of local_bh_disable() by Scott Wood]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:21 +03:00
Sebastian Andrzej Siewior 8cf972c7b3 locallock: Include header for the `current' macro
Include the header for `current' macro so that
CONFIG_KERNEL_HEADER_TEST=y passes.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:21 +03:00
Thomas Gleixner 5b3302566a rt: Add local irq locks
Introduce locallock. For !RT this maps to preempt_disable()/
local_irq_disable() so there is not much that changes. For RT this will
map to a spinlock. This makes preemption possible and locked "ressource"
gets the lockdep anotation it wouldn't have otherwise. The locks are
recursive for owner == current. Also, all locks user migrate_disable()
which ensures that the task is not migrated to another CPU while the lock
is held and the owner is preempted.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:21 +03:00
Sebastian Andrzej Siewior 266005de95 x86: Disable HAVE_ARCH_JUMP_LABEL
__text_poke() does:
|        local_irq_save(flags);
…
|        ptep = get_locked_pte(poking_mm, poking_addr, &ptl);

which does not work on -RT because the PTE-lock is a spinlock_t typed lock.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:21 +03:00
Sebastian Andrzej Siewior d03792e5cb efi: Allow efi=runtime
In case the command line option "efi=noruntime" is default at built-time, the user
could overwrite its state by `efi=runtime' and allow it again.

Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:21 +03:00
Sebastian Andrzej Siewior 2a59f5dff1 efi: Disable runtime services on RT
Based on meassurements the EFI functions get_variable /
get_next_variable take up to 2us which looks okay.
The functions get_time, set_time take around 10ms. Those 10ms are too
much. Even one ms would be too much.
Ard mentioned that SetVariable might even trigger larger latencies if
the firware will erase flash blocks on NOR.

The time-functions are used by efi-rtc and can be triggered during
runtimed (either via explicit read/write or ntp sync).

The variable write could be used by pstore.
These functions can be disabled without much of a loss. The poweroff /
reboot hooks may be provided by PSCI.

Disable EFI's runtime wrappers.

This was observed on "EFI v2.60 by SoftIron Overdrive 1000".

Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:21 +03:00
Sebastian Andrzej Siewior cb5b81f8f0 md: disable bcache
It uses anon semaphores
|drivers/md/bcache/request.c: In function ‘cached_dev_write_complete’:
|drivers/md/bcache/request.c:1007:2: error: implicit declaration of function ‘up_read_non_owner’ [-Werror=implicit-function-declaration]
|  up_read_non_owner(&dc->writeback_lock);
|  ^
|drivers/md/bcache/request.c: In function ‘request_write’:
|drivers/md/bcache/request.c:1033:2: error: implicit declaration of function ‘down_read_non_owner’ [-Werror=implicit-function-declaration]
|  down_read_non_owner(&dc->writeback_lock);
|  ^

either we get rid of those or we have to introduce them…

Link: http://lkml.kernel.org/r/20130820111602.3cea203c@gandalf.local.home
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:21 +03:00
Sebastian Andrzej Siewior 2c6cd632e3 net/core: disable NET_RX_BUSY_POLL on RT
napi_busy_loop() disables preemption and performs a NAPI poll. We can't acquire
sleeping locks with disabled preemption so we would have to work around this
and add explicit locking for synchronisation against ksoftirqd.
Without explicit synchronisation a low priority process would "own" the NAPI
state (by setting NAPIF_STATE_SCHED) and could be scheduled out (no
preempt_disable() and BH is preemptible on RT).
In case a network packages arrives then the interrupt handler would set
NAPIF_STATE_MISSED and the system would wait until the task owning the NAPI
would be scheduled in again.
Should a task with RT priority busy poll then it would consume the CPU instead
allowing tasks with lower priority to run.

The NET_RX_BUSY_POLL is disabled by default (the system wide sysctls for
poll/read are set to zero) so disable NET_RX_BUSY_POLL on RT to avoid wrong
locking context on RT. Should this feature be considered useful on RT systems
then it could be enabled again with proper locking and synchronisation.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:21 +03:00
Thomas Gleixner 915de11543 sched: Disable CONFIG_RT_GROUP_SCHED on RT
Carsten reported problems when running:

  taskset 01 chrt -f 1 sleep 1

from within rc.local on a F15 machine. The task stays running and
never gets on the run queue because some of the run queues have
rt_throttled=1 which does not go away. Works nice from a ssh login
shell. Disabling CONFIG_RT_GROUP_SCHED solves that as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:21 +03:00
Sebastian Andrzej Siewior 94462480e5 rcu: make RCU_BOOST default on RT
Since it is no longer invoked from the softirq people run into OOM more
often if the priority of the RCU thread is too low. Making boosting
default on RT should help in those case and it can be switched off if
someone knows better.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:21 +03:00
Ingo Molnar 0923af59b9 mm: Allow only SLUB on RT
Memory allocation disables interrupts as part of the allocation and freeing
process. For -RT it is important that this section remain short and don't
depend on the size of the request or an internal state of the memory allocator.
At the beginning the SLAB memory allocator was adopted for RT's needs and it
required substantial changes. Later, with the addition of the SLUB memory
allocator we adopted this one as well and the changes were smaller. More
important, due to the design of the SLUB allocator it performs better and its
worst case latency was smaller. In the end only SLUB remained supported.

Disable SLAB and SLOB on -RT. Only SLUB is adopted to -RT needs.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:20 +03:00
Thomas Gleixner e3bbca899d kconfig: Disable config options which are not RT compatible
Disable stuff which is known to have issues on RT

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:20 +03:00
Sebastian Andrzej Siewior 2ddbdd8512 fs/dcache: use swait_queue instead of waitqueue
__d_lookup_done() invokes wake_up_all() while holding a hlist_bl_lock()
which disables preemption. As a workaround convert it to swait.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:20 +03:00
Sebastian Andrzej Siewior e852fda5d8 fs/dcache: bring back explicit INIT_HLIST_BL_HEAD init
Commit 3d375d7859 ("mm: update callers to use HASH_ZERO flag") removed
INIT_HLIST_BL_HEAD and uses the ZERO flag instead for the init. However
on RT we have also a spinlock which needs an init call so we can't use
that.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:20 +03:00
Clark Williams b0408fdcf9 fscache: initialize cookie hash table raw spinlocks
The fscache cookie mechanism uses a hash table of hlist_bl_head structures. The
PREEMPT_RT patcheset adds a raw spinlock to this structure and so on PREEMPT_RT
the structures get used uninitialized, causing warnings about bad magic numbers
when spinlock debugging is turned on.

Use the init function for fscache cookies.

Signed-off-by: Clark Williams <williams@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:20 +03:00
Paul Gortmaker 902a0c0b45 list_bl: Make list head locking RT safe
As per changes in include/linux/jbd_common.h for avoiding the
bit_spin_locks on RT ("fs: jbd/jbd2: Make state lock and journal
head lock rt safe") we do the same thing here.

We use the non atomic __set_bit and __clear_bit inside the scope of
the lock to preserve the ability of the existing LIST_DEBUG code to
use the zero'th bit in the sanity checks.

As a bit spinlock, we had no lockdep visibility into the usage
of the list head locking.  Now, if we were to implement it as a
standard non-raw spinlock, we would see:

BUG: sleeping function called from invalid context at kernel/rtmutex.c:658
in_atomic(): 1, irqs_disabled(): 0, pid: 122, name: udevd
5 locks held by udevd/122:
 #0:  (&sb->s_type->i_mutex_key#7/1){+.+.+.}, at: [<ffffffff811967e8>] lock_rename+0xe8/0xf0
 #1:  (rename_lock){+.+...}, at: [<ffffffff811a277c>] d_move+0x2c/0x60
 #2:  (&dentry->d_lock){+.+...}, at: [<ffffffff811a0763>] dentry_lock_for_move+0xf3/0x130
 #3:  (&dentry->d_lock/2){+.+...}, at: [<ffffffff811a0734>] dentry_lock_for_move+0xc4/0x130
 #4:  (&dentry->d_lock/3){+.+...}, at: [<ffffffff811a0747>] dentry_lock_for_move+0xd7/0x130
Pid: 122, comm: udevd Not tainted 3.4.47-rt62 #7
Call Trace:
 [<ffffffff810b9624>] __might_sleep+0x134/0x1f0
 [<ffffffff817a24d4>] rt_spin_lock+0x24/0x60
 [<ffffffff811a0c4c>] __d_shrink+0x5c/0xa0
 [<ffffffff811a1b2d>] __d_drop+0x1d/0x40
 [<ffffffff811a24be>] __d_move+0x8e/0x320
 [<ffffffff811a278e>] d_move+0x3e/0x60
 [<ffffffff81199598>] vfs_rename+0x198/0x4c0
 [<ffffffff8119b093>] sys_renameat+0x213/0x240
 [<ffffffff817a2de5>] ? _raw_spin_unlock+0x35/0x60
 [<ffffffff8107781c>] ? do_page_fault+0x1ec/0x4b0
 [<ffffffff817a32ca>] ? retint_swapgs+0xe/0x13
 [<ffffffff813eb0e6>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff8119b0db>] sys_rename+0x1b/0x20
 [<ffffffff817a3b96>] system_call_fastpath+0x1a/0x1f

Since we are only taking the lock during short lived list operations,
lets assume for now that it being raw won't be a significant latency
concern.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
[julia@ni.com: Use #define instead static inline to avoid false positive from
               lockdep]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:20 +03:00
Sebastian Andrzej Siewior ae0bba2146 fs/dcache: disable preemption on i_dir_seq's write side
i_dir_seq is an opencoded seqcounter. Based on the code it looks like we
could have two writers in parallel despite the fact that the d_lock is
held. The problem is that during the write process on RT the preemption
is still enabled and if this process is interrupted by a reader with RT
priority then we lock up.
To avoid that lock up I am disabling the preemption during the update.
The rename of i_dir_seq is here to ensure to catch new write sides in
future.

Cc: stable-rt@vger.kernel.org
Reported-by: Oleg.Karfich@wago.com
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:20 +03:00
Sebastian Andrzej Siewior 1bc5345a56 fs/nfs: turn rmdir_sem into a semaphore
The RW semaphore had a reader side which used the _non_owner version
because it most likely took the reader lock in one thread and released it
in another which would cause lockdep to complain if the "regular"
version was used.
On -RT we need the owner because the rw lock is turned into a rtmutex.
The semaphores on the hand are "plain simple" and should work as
expected. We can't have multiple readers but on -RT we don't allow
multiple readers anyway so that is not a loss.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:20 +03:00
Sebastian Andrzej Siewior cac578e31b userfaultfd: Use a seqlock instead of seqcount
On RT write_seqcount_begin() disables preemption which leads to warning
in add_wait_queue() while the spinlock_t is acquired.
The waitqueue can't be converted to swait_queue because
userfaultfd_wake_function() is used as a custom wake function.

Use seqlock instead seqcount to avoid the preempt_disable() section
during add_wait_queue().

Cc: stable-rt@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:20 +03:00
Sebastian Andrzej Siewior c587de4dc8 net/Qdisc: use a seqlock instead seqcount
The seqcount disables preemption on -RT while it is held which can't
remove. Also we don't want the reader to spin for ages if the writer is
scheduled out. The seqlock on the other hand will serialize / sleep on
the lock while writer is active.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:20 +03:00
Sebastian Andrzej Siewior 6e91cb9b19 NFSv4: replace seqcount_t with a seqlock_t
The raw_write_seqcount_begin() in nfs4_reclaim_open_state() causes a
preempt_disable() on -RT. The spin_lock()/spin_unlock() in that section does
not work.
The lockdep part was removed in commit
  abbec2da13 ("NFS: Use raw_write_seqcount_begin/end int nfs4_reclaim_open_state")
because lockdep complained.
The whole seqcount thing was introduced in commit
  c137afabe3 ("NFSv4: Allow the state manager to mark an open_owner as being recovered").
The recovery threads runs only once.
write_seqlock() does not work on !RT because it disables preemption and it the
writer side is preemptible (has to remain so despite the fact that it will
block readers).

Reported-by: kernel test robot <xiaolong.ye@intel.com>
Link: https://lkml.kernel.org/r/20161021164727.24485-1-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:20 +03:00
Thomas Gleixner a6ffaaf304 seqlock: Prevent rt starvation
If a low prio writer gets preempted while holding the seqlock write
locked, a high prio reader spins forever on RT.

To prevent this let the reader grab the spinlock, so it blocks and
eventually boosts the writer. This way the writer can proceed and
endless spinning is prevented.

For seqcount writers we disable preemption over the update code
path. Thanks to Al Viro for distangling some VFS code to make that
possible.

Nicholas Mc Guire:
- spin_lock+unlock => spin_unlock_wait
- __write_seqcount_begin => __raw_write_seqcount_begin

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:20 +03:00
Sebastian Andrzej Siewior e96faa4aee dma-buf: Use seqlock_t instread disabling preemption
"dma reservation" disables preemption while acquiring the write access
for "seqcount".

Replace the seqcount with a seqlock_t which provides seqcount like
semantic and lock for writer.

Link: https://lkml.kernel.org/r/f410b429-db86-f81c-7c67-f563fa808b62@free.fr
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:20 +03:00
Thomas Gleixner 88199aa670 signal: Revert ptrace preempt magic
Upstream commit '53da1d9456fe7f8 fix ptrace slowness' is nothing more
than a bandaid around the ptrace design trainwreck. It's not a
correctness issue, it's merily a cosmetic bandaid.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:19 +03:00
Thomas Gleixner 40c5c3896a timekeeping: Split jiffies seqlock
Replace jiffies_lock seqlock with a simple seqcounter and a rawlock so
it can be taken in atomic context on RT.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-25 04:21:19 +03:00
Sebastian Andrzej Siewior e6598a0bd6 mm: Warn on memory allocation in non-preemptible context on RT
The memory allocation via kmalloc(, GFP_ATOMIC) in atomic context
(disabled preemption or interrupts) is not allowed on RT because the
buddy allocator is using sleeping locks which can't be acquired in this
context.
Such an an allocation may not trigger a warning in the buddy allocator
if it is always satisfied in the SLUB allocator.

Add a warning on RT if a memory allocation was attempted in not
preemptible region.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
2023-03-25 04:21:19 +03:00
Rob Herring 5150ed530f of: Rework and simplify phandle cache to use a fixed size
The phandle cache was added to speed up of_find_node_by_phandle() by
avoiding walking the whole DT to find a matching phandle. The
implementation has several shortcomings:

  - The cache is designed to work on a linear set of phandle values.
    This is true for dtc generated DTs, but not for other cases such as
    Power.
  - The cache isn't enabled until of_core_init() and a typical system
    may see hundreds of calls to of_find_node_by_phandle() before that
    point.
  - The cache is freed and re-allocated when the number of phandles
    changes.
  - It takes a raw spinlock around a memory allocation which breaks on
    RT.

Change the implementation to a fixed size and use hash_32() as the
cache index. This greatly simplifies the implementation. It avoids
the need for any re-alloc of the cache and taking a reference on nodes
in the cache. We only have a single source of removing cache entries
which is of_detach_node().

Using hash_32() removes any assumption on phandle values improving
the hit rate for non-linear phandle values. The effect on linear values
using hash_32() is about a 10% collision. The chances of thrashing on
colliding values seems to be low.

To compare performance, I used a RK3399 board which is a pretty typical
system. I found that just measuring boot time as done previously is
noisy and may be impacted by other things. Also bringing up secondary
cores causes some issues with measuring, so I booted with 'nr_cpus=1'.
With no caching, calls to of_find_node_by_phandle() take about 20124 us
for 1248 calls. There's an additional 288 calls before time keeping is
up. Using the average time per hit/miss with the cache, we can calculate
these calls to take 690 us (277 hit / 11 miss) with a 128 entry cache
and 13319 us with no cache or an uninitialized cache.

Comparing the 3 implementations the time spent in
of_find_node_by_phandle() is:

no cache:        20124 us (+ 13319 us)
128 entry cache:  5134 us (+ 690 us)
current cache:     819 us (+ 13319 us)

We could move the allocation of the cache earlier to improve the
current cache, but that just further complicates the situation as it
needs to be after slab is up, so we can't do it when unflattening (which
uses memblock).

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Segher Boessenkool <segher@kernel.crashing.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Signed-off-by: Rob Herring <robh@kernel.org>
Link: https://lkml.kernel.org/r/20191211232345.24810-1-robh@kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:19 +03:00
Sebastian Andrzej Siewior a66f88a0d5 tpm: remove tpm_dev_wq_lock
Added in commit

  9e1b74a63f ("tpm: add support for nonblocking operation")

but never actually used it.

Cc: Philip Tricca <philip.b.tricca@intel.com>
Cc: Tadeusz Struk <tadeusz.struk@intel.com>
Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:19 +03:00
Sebastian Andrzej Siewior 34b227ff95 mm: workingset: replace IRQ-off check with a lockdep assert.
Commit

  68d48e6a2d ("mm: workingset: add vmstat counter for shadow nodes")

introduced an IRQ-off check to ensure that a lock is held which also
disabled interrupts. This does not work the same way on -RT because none
of the locks, that are held, disable interrupts.
Replace this check with a lockdep assert which ensures that the lock is
held.

Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:19 +03:00
Sebastian Andrzej Siewior 95faf05ec7 cgroup: Acquire cgroup_rstat_lock with enabled interrupts
There is no need to disable interrupts while cgroup_rstat_lock is
acquired. The lock is never used in-IRQ context so a simple spin_lock()
is enough for synchronisation purpose.

Acquire cgroup_rstat_lock without disabling interrupts and ensure that
cgroup_rstat_cpu_lock is acquired with disabled interrupts (this one is
acquired in-IRQ context).

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:19 +03:00
Sebastian Andrzej Siewior 76880e1fa4 cgroup: Remove `may_sleep' from cgroup_rstat_flush_locked()
cgroup_rstat_flush_locked() is always invoked with `may_sleep' set to
true so that this case can be made default and the parameter removed.

Remove the `may_sleep' parameter.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:19 +03:00
Sebastian Andrzej Siewior bff81c5d86 cgroup: Consolidate users of cgroup_rstat_lock.
cgroup_rstat_flush_irqsafe() has no users, remove it.
cgroup_rstat_flush_hold() and cgroup_rstat_flush_release() are only used within
this file. Make it static.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:19 +03:00
Sebastian Andrzej Siewior 9cb4141c73 cgroup: Remove ->css_rstat_flush()
I was looking at the lifetime of the the ->css_rstat_flush() to see if
cgroup_rstat_cpu_lock should remain a raw_spinlock_t. I didn't find any
users and is unused since it was introduced in commit
  8f53470bab ("cgroup: Add cgroup_subsys->css_rstat_flush()")

Remove the css_rstat_flush callback because it has no users.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:19 +03:00
Sebastian Andrzej Siewior baf1451e1b workqueue: Convert the locks to raw type
After all the workqueue and the timer rework, we can finally make the
worker_pool lock raw.
The lock is not held over an unbounded period of time/iterations.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:19 +03:00
Sebastian Andrzej Siewior a2c9e4039a workqueue: Use swait for wq_manager_wait
In order for the workqueue code use raw_spinlock_t typed locking there
must not be a spinlock_t typed lock be acquired. A wait_queue_head uses
a spinlock_t lock for its list protection.

Use a swait based queue head to avoid raw_spinlock_t -> spinlock_t
locking.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:19 +03:00
Sebastian Andrzej Siewior a1468913de sched/swait: Add swait_event_lock_irq()
The swait_event_lock_irq() is inspired by wait_event_lock_irq(). This is
required by the workqueue code once it switches to swait.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:18 +03:00
Sebastian Andrzej Siewior b065527a55 workqueue: Don't assume that the callback has interrupts disabled
Due to the TIMER_IRQSAFE flag, the timer callback is invoked with
disabled interrupts. On -RT the callback is invoked in softirq context
with enabled interrupts. Since the interrupts are threaded, there are
are no in_irq() users. The local_bh_disable() around the threaded
handler ensures that there is either a timer or a threaded handler
active on the CPU.

Disable interrupts before __queue_work() is invoked from the timer
callback.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:18 +03:00
Sebastian Andrzej Siewior 9e86cc66fa Use CONFIG_PREEMPTION
Thisi is an all-in-one patch of the current `PREEMPTION' branch.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-03-25 04:21:18 +03:00