There's two problems when installing cgroup events on CPUs: firstly
list_update_cgroup_event() only tries to set cpuctx->cgrp for the
first event, if that mismatches on @cgrp we'll not try again for later
additions.
Secondly, when we install a cgroup event into an active context, only
issue an event reprogram when the event matches the current cgroup
context. This avoids a pointless event reprogramming.
Signed-off-by: leilei.lin <leilei.lin@alibaba-inc.com>
[ Improved the changelog and comments. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: brendan.d.gregg@gmail.com
Cc: eranian@gmail.com
Cc: linux-kernel@vger.kernel.org
Cc: yang_oliver@hotmail.com
Link: http://lkml.kernel.org/r/20180306093637.28247-1-linxiulei@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The event schedule order (as per perf_event_sched_in()) is:
- cpu pinned
- task pinned
- cpu flexible
- task flexible
But perf_rotate_context() will unschedule cpu-flexible even if it
doesn't need a rotation.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Similar to how first programming cpu=-1 and then cpu=# is wrong, so is
rotating both. It was especially wrong when we were still programming
the PMU in this same order, because in that scenario we might never
actually end up running cpu=# events at all.
Cure this by using the active_list to pick the rotation event; since
at programming we already select the left-most event.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The last argument is, and always must be, the same.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When an event group contains more events than can be scheduled on the
hardware, iterating the full event group for ctx_sched_out is a waste
of time.
Keep track of the events that got programmed on the hardware, such
that we can iterate this smaller list in order to schedule them out.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that all the grouping is done with RB trees, we no longer need
group_entry and can replace the whole thing with sibling_list.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Scheduling in events with cpu=-1 before events with cpu=# changes
semantics and is undesirable in that it would priorize these events.
Given that groups->index is across all groups we actually have an
inter-group ordering, meaning we can merge-sort two groups, which is
just what we need to preserve semantics.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change event groups into RB trees sorted by CPU and then by a 64bit
index, so that multiplexing hrtimer interrupt handler would be able
skipping to the current CPU's list and ignore groups allocated for the
other CPUs.
New API for manipulating event groups in the trees is implemented as well
as adoption on the API in the current implementation.
pinned_group_sched_in() and flexible_group_sched_in() API are
introduced to consolidate code enabling the whole group from pinned
and flexible groups appropriately.
Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/372f9c8b-0cfe-4240-e44d-83d863d40813@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mark reported his arm64 perf fuzzer runs sometimes splat like:
armv8pmu_read_counter+0x1e8/0x2d8
armpmu_event_update+0x8c/0x188
armpmu_read+0xc/0x18
perf_output_read+0x550/0x11e8
perf_event_read_event+0x1d0/0x248
perf_event_exit_task+0x468/0xbb8
do_exit+0x690/0x1310
do_group_exit+0xd0/0x2b0
get_signal+0x2e8/0x17a8
do_signal+0x144/0x4f8
do_notify_resume+0x148/0x1e8
work_pending+0x8/0x14
which asserts that we only call pmu::read() on ACTIVE events.
The above callchain does:
perf_event_exit_task()
perf_event_exit_task_context()
task_ctx_sched_out() // INACTIVE
perf_event_exit_event()
perf_event_set_state(EXIT) // EXIT
sync_child_event()
perf_event_read_event()
perf_output_read()
perf_output_read_group()
leader->pmu->read()
Which results in doing a pmu::read() on an !ACTIVE event.
I _think_ this is 'new' since we added attr.inherit_stat, which added
the perf_event_read_event() to the exit path, without that
perf_event_read_output() would only trigger from samples and for
@event to trigger a sample, it's leader _must_ be ACTIVE too.
Still, adding this check makes it consistent with the @sub case for
the siblings.
Reported-and-Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull perf updates from Thomas Gleixner:
"Another set of perf updates:
- Fix a Skylake Uncore event format declaration
- Prevent perf pipe mode from crahsing which was caused by a missing
buffer allocation
- Make the perf top popup message which tells the user that it uses
fallback mode on older kernels a debug message.
- Make perf context rescheduling work correcctly
- Robustify the jump error drawing in perf browser mode so it does
not try to create references to NULL initialized offset entries
- Make trigger_on() robust so it does not enable the trigger before
everything is set up correctly to handle it
- Make perf auxtrace respect the --no-itrace option so it does not
try to queue AUX data for decoding.
- Prevent having different number of field separators in CVS output
lines when a counter is not supported.
- Make the perf kallsyms man page usage behave like it does for all
other perf commands.
- Synchronize the kernel headers"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix ctx_event_type in ctx_resched()
perf tools: Fix trigger class trigger_on()
perf auxtrace: Prevent decoding when --no-itrace
perf stat: Fix CVS output format for non-supported counters
tools headers: Sync x86's cpufeatures.h
tools headers: Sync copy of kvm UAPI headers
perf record: Fix crash in pipe mode
perf annotate browser: Be more robust when drawing jump arrows
perf top: Fix annoying fallback message on older kernels
perf kallsyms: Fix the usage on the man page
perf/x86/intel/uncore: Fix Skylake UPI event format
Pull locking fix from Thomas Gleixner:
"rt_mutex_futex_unlock() grew a new irq-off call site, but the function
assumes that its always called from irq enabled context.
Use (un)lock_irqsafe() to handle the new call site correctly"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rtmutex: Make rt_mutex_futex_unlock() safe for irq-off callsites
Pull RCU updates from Paul E. McKenney:
- Miscellaneous fixes, perhaps most notably removing obsolete
code whose only purpose in life was to gather information for
the now-removed RCU debugfs facility. Other notable changes
include removing NO_HZ_FULL_ALL in favor of the nohz_full kernel
boot parameter, minor optimizations for expedited grace periods,
some added tracing, creating an RCU-specific workqueue using Tejun's
new WQ_MEM_RECLAIM flag, and several cleanups to code and comments.
- SRCU cleanups and optimizations.
- Torture-test updates, perhaps most notably the adding of ARMv8
support, but also including numerous cleanups and usability fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The replace_filter_string() frees the current string and then copies a given
string. But in the two locations that it was used, the allocation happened
right after the filter was allocated (nothing to replace). There's no need
for this to be a helper function. Embedding the allocation in the two places
where it was called will make changing the code in the future easier.
Also make the variable consistent (always use "filter_string" as the name,
as it was used in one instance as "filter_str")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
replace_system_preds() creates a filter list to free even when it doesn't
really need to have it. Only save filters that require synchronize_sched()
in the filter list to free. This will allow the code to be updated a bit
easier in the future.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The __alloc_filter() function does nothing more that allocate the filter.
There's no reason to have it as a helper function.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The filter code does open code string appending to produce an error message.
Instead it can be simplified by using trace_seq function helpers.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There's no reason to BUG if there's a bug in the filtering code. Simply do a
warning and return.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Synthetic events can be done within the recording of other events. Notify
the ring buffer via ring_buffer_nest_start() and ring_buffer_nest_end() that
this is intended and not to block it due to its recursion protection.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The ring-buffer code has recusion protection in case tracing ends up tracing
itself, the ring-buffer will detect that it was called at the same context
(normal, softirq, interrupt or NMI), and not continue to record the event.
With the histogram synthetic events, they are called while tracing another
event at the same context. The recusion protection triggers because it
detects tracing at the same context and stops it.
Add ring_buffer_nest_start() and ring_buffer_nest_end() that will notify the
ring buffer that a trace is about to happen within another trace and that it
is intended, and not to trigger the recursion blocking.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The default clock if timestamps are used in a histogram is "global".
If timestamps aren't used, the clock is irrelevant.
Use the "clock=" param only if you want to override the default
"global" clock for a histogram with timestamps.
Link: http://lkml.kernel.org/r/427bed1389c5d22aa40c3e0683e30cc3d151e260.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Rajvi Jingar <rajvi.jingar@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Allow tracing code outside of trace.c to access tracing_set_clock().
Some applications may require a particular clock in order to function
properly, such as latency calculations.
Also, add an accessor returning the current clock string.
Link: http://lkml.kernel.org/r/6d1c53e9ee2163f54e1849f5376573f54f0e6009.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
With the addition of variables and actions, it's become necessary to
provide more detailed error information to users about syntax errors.
Add a 'last error' facility accessible via the erroring event's 'hist'
file. Reading the hist file after an error will display more detailed
information about what went wrong, if information is available. This
extended error information will be available until the next hist
trigger command for that event.
# echo xxx > /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
echo: write error: Invalid argument
# cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/hist
ERROR: Couldn't yyy: zzz
Last command: xxx
Also add specific error messages for variable and action errors.
Link: http://lkml.kernel.org/r/64e9c422fc8aeafcc2f7a3b4328c0cffe7969129.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add support for alias=$somevar where alias can be used as
onmatch.xxx($alias).
Aliases are a way of creating a new name for an existing variable, for
flexibly in making naming more clear in certain cases. For example in
the below the user perhaps feels that using $new_lat in the synthetic
event invocation is opaque or doesn't fit well stylistically with
previous triggers, so creates an alias of $new_lat named $latency and
uses that in the call instead:
# echo 'hist:keys=next_pid:new_lat=common_timestamp.usecs' >
/sys/kernel/debug/tracing/events/sched/sched_switch/trigger
# echo 'hist:keys=pid:latency=$new_lat:
onmatch(sched.sched_switch).wake2($latency,pid)' >
/sys/kernel/debug/tracing/events/synthetic/wake1/trigger
Link: http://lkml.kernel.org/r/ef20a65d921af3a873a6f1e8c71407c926d5586f.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The existing code only allows for one space before and after the 'if'
specifying the filter for a hist trigger. Add code to make that more
permissive as far as whitespace goes. Specifically, we want to allow
spaces in the trigger itself now that we have additional syntax
(onmatch/onmax) where spaces are more natural e.g. spaces after commas
in param lists.
Link: http://lkml.kernel.org/r/1053090c3c308d4f431accdeb59dff4b511d4554.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add an 'onmax(var).save(field,...)' hist trigger action which is
invoked whenever an event exceeds the current maximum.
The end result is that the trace event fields or variables specified
as the onmax.save() params will be saved if 'var' exceeds the current
maximum for that hist trigger entry. This allows context from the
event that exhibited the new maximum to be saved for later reference.
When the histogram is displayed, additional fields displaying the
saved values will be printed.
As an example the below defines a couple of hist triggers, one for
sched_wakeup and another for sched_switch, keyed on pid. Whenever a
sched_wakeup occurs, the timestamp is saved in the entry corresponding
to the current pid, and when the scheduler switches back to that pid,
the timestamp difference is calculated. If the resulting latency
exceeds the current maximum latency, the specified save() values are
saved:
# echo 'hist:keys=pid:ts0=common_timestamp.usecs \
if comm=="cyclictest"' >> \
/sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
# echo 'hist:keys=next_pid:\
wakeup_lat=common_timestamp.usecs-$ts0:\
onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm) \
if next_comm=="cyclictest"' >> \
/sys/kernel/debug/tracing/events/sched/sched_switch/trigger
When the histogram is displayed, the max value and the saved values
corresponding to the max are displayed following the rest of the
fields:
# cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
{ next_pid: 3728 } hitcount: 199 \
max: 123 next_comm: cyclictest prev_pid: 0 \
prev_prio: 120 prev_comm: swapper/3
{ next_pid: 3730 } hitcount: 1321 \
max: 15 next_comm: cyclictest prev_pid: 0 \
prev_prio: 120 prev_comm: swapper/1
{ next_pid: 3729 } hitcount: 1973\
max: 25 next_comm: cyclictest prev_pid: 0 \
prev_prio: 120 prev_comm: swapper/0
Totals:
Hits: 3493
Entries: 3
Dropped: 0
Link: http://lkml.kernel.org/r/006907f71b1e839bb059337ec3c496f84fcb71de.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add an 'onmatch(matching.event).<synthetic_event_name>(param list)'
hist trigger action which is invoked with the set of variables or
event fields named in the 'param list'. The result is the generation
of a synthetic event that consists of the values contained in those
variables and/or fields at the time the invoking event was hit.
As an example the below defines a simple synthetic event using a
variable defined on the sched_wakeup_new event, and shows the event
definition with unresolved fields, since the sched_wakeup_new event
with the testpid variable hasn't been defined yet:
# echo 'wakeup_new_test pid_t pid; int prio' >> \
/sys/kernel/debug/tracing/synthetic_events
# cat /sys/kernel/debug/tracing/synthetic_events
wakeup_new_test pid_t pid; int prio
The following hist trigger both defines a testpid variable and
specifies an onmatch() trace action that uses that variable along with
a non-variable field to generate a wakeup_new_test synthetic event
whenever a sched_wakeup_new event occurs, which because of the 'if
comm == "cyclictest"' filter only happens when the executable is
cyclictest:
# echo 'hist:testpid=pid:keys=$testpid:\
onmatch(sched.sched_wakeup_new).wakeup_new_test($testpid, prio) \
if comm=="cyclictest"' >> \
/sys/kernel/debug/tracing/events/sched/sched_wakeup_new/trigger
Creating and displaying a histogram based on those events is now just
a matter of using the fields and new synthetic event in the
tracing/events/synthetic directory, as usual:
# echo 'hist:keys=pid,prio:sort=pid,prio' >> \
/sys/kernel/debug/tracing/events/synthetic/wakeup_new_test/trigger
Link: http://lkml.kernel.org/r/8c2a574bcb7530c876629c901ecd23911b14afe8.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Rajvi Jingar <rajvi.jingar@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Users should be able to directly specify event fields in hist trigger
'actions' rather than being forced to explicitly create a variable for
that purpose.
Add support allowing fields to be used directly in actions, which
essentially does just that - creates 'invisible' variables for each
bare field specified in an action. If a bare field refers to a field
on another (matching) event, it even creates a special histogram for
the purpose (since variables can't be defined on an existing histogram
after histogram creation).
Here's a simple example that demonstrates both. Basically the
onmatch() action creates a list of variables corresponding to the
parameters of the synthetic event to be generated, and then uses those
values to generate the event. So for the wakeup_latency synthetic
event 'call' below the first param, $wakeup_lat, is a variable defined
explicitly on sched_switch, where 'next_pid' is just a normal field on
sched_switch, and prio is a normal field on sched_waking.
Since the mechanism works on variables, those two normal fields just
have 'invisible' variables created internally for them. In the case of
'prio', which is on another event, we actually need to create an
additional hist trigger and define the invisible variable on that, since
once a hist trigger is defined, variables can't be added to it later.
echo 'wakeup_latency u64 lat; pid_t pid; int prio' >>
/sys/kernel/debug/tracing/synthetic_events
echo 'hist:keys=pid:ts0=common_timestamp.usecs >>
/sys/kernel/debug/tracing/events/sched/sched_waking/trigger
echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:
onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,prio)
>> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
Link: http://lkml.kernel.org/r/8e8dcdac1ea180ed7a3689e1caeeccede9dc42b3.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Synthetic events are user-defined events generated from hist trigger
variables saved from one or more other events.
To define a synthetic event, the user writes a simple specification
consisting of the name of the new event along with one or more
variables and their type(s), to the tracing/synthetic_events file.
For instance, the following creates a new event named 'wakeup_latency'
with 3 fields: lat, pid, and prio:
# echo 'wakeup_latency u64 lat; pid_t pid; int prio' >> \
/sys/kernel/debug/tracing/synthetic_events
Reading the tracing/synthetic_events file lists all the
currently-defined synthetic events, in this case the event we defined
above:
# cat /sys/kernel/debug/tracing/synthetic_events
wakeup_latency u64 lat; pid_t pid; int prio
At this point, the synthetic event is ready to use, and a histogram
can be defined using it:
# echo 'hist:keys=pid,prio,lat.log2:sort=pid,lat' >> \
/sys/kernel/debug/tracing/events/synthetic/wakeup_latency/trigger
The new event is created under the tracing/events/synthetic/ directory
and looks and behaves just like any other event:
# ls /sys/kernel/debug/tracing/events/synthetic/wakeup_latency
enable filter format hist id trigger
Although a histogram can be defined for it, nothing will happen until
an action tracing that event via the trace_synth() function occurs.
The trace_synth() function is very similar to all the other trace_*
invocations spread throughout the kernel, except in this case the
trace_ function and its corresponding tracepoint isn't statically
generated but defined by the user at run-time.
How this can be automatically hooked up via a hist trigger 'action' is
discussed in a subsequent patch.
Link: http://lkml.kernel.org/r/c68df2284b7d172669daf9be29db62ad49bbc559.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
[fix noderef.cocci warnings, sizeof pointer for kcalloc of event->fields]
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add a hook for executing extra actions whenever a histogram entry is
added or updated.
The default 'action' when a hist entry is added to a histogram is to
update the set of values associated with it. Some applications may
want to perform additional actions at that point, such as generate
another event, or compare and save a maximum.
Add a simple framework for doing that; specific actions will be
implemented on top of it in later patches.
Link: http://lkml.kernel.org/r/9482ba6a3eaf5ca6e60954314beacd0e25c05b24.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add the necessary infrastructure to allow the variables defined on one
event to be referenced in another. This allows variables set by a
previous event to be referenced and used in expressions combining the
variable values saved by that previous event and the event fields of
the current event. For example, here's how a latency can be
calculated and saved into yet another variable named 'wakeup_lat':
# echo 'hist:keys=pid,prio:ts0=common_timestamp ...
# echo 'hist:keys=next_pid:wakeup_lat=common_timestamp-$ts0 ...
In the first event, the event's timetamp is saved into the variable
ts0. In the next line, ts0 is subtracted from the second event's
timestamp to produce the latency.
Further users of variable references will be described in subsequent
patches, such as for instance how the 'wakeup_lat' variable above can
be displayed in a latency histogram.
Link: http://lkml.kernel.org/r/b1d3e6975374e34d501ff417c20189c3f9b2c7b8.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Some accessor functions, such as for variable references, require
access to a corrsponding tracing_map_elt.
Add a tracing_map_elt param to the function signature and update the
accessor functions accordingly.
Link: http://lkml.kernel.org/r/e0f292b068e9e4948da1d5af21b5ae0efa9b5717.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Up until now, hist triggers only needed per-element support for saving
'comm' data, which was saved directly as a private data pointer.
In anticipation of the need to save other data besides 'comm', add a
new hist_elt_data struct for the purpose, and switch the current
'comm'-related code over to that.
Link: http://lkml.kernel.org/r/4502c338c965ddf5fc19fb1ec4764391e001ed4b.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add support for simple addition, subtraction, and unary expressions
(-(expr) and expr, where expr = b-a, a+b, a+b+c) to hist triggers, in
order to support a minimal set of useful inter-event calculations.
These operations are needed for calculating latencies between events
(timestamp1-timestamp0) and for combined latencies (latencies over 3
or more events).
In the process, factor out some common code from key and value
parsing.
Link: http://lkml.kernel.org/r/9a9308ead4fe32a433d9c7e95921fb798394f6b2.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
[kbuild test robot fix, add static to parse_atom()]
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
[ Replaced '//' comments with normal /* */ comments ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Named triggers must also have the same set of variables in order to be
considered compatible - update the trigger match test to account for
that.
The reason for this requirement is that named triggers with variables
are meant to allow one or more events to set the same variable.
Link: http://lkml.kernel.org/r/a17eae6328a99917f9d5c66129c9fcd355279ee9.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add support for saving the value of a current event's event field by
assigning it to a variable that can be read by a subsequent event.
The basic syntax for saving a variable is to simply prefix a unique
variable name not corresponding to any keyword along with an '=' sign
to any event field.
Both keys and values can be saved and retrieved in this way:
# echo 'hist:keys=next_pid:vals=$ts0:ts0=common_timestamp ...
# echo 'hist:timer_pid=common_pid:key=$timer_pid ...'
If a variable isn't a key variable or prefixed with 'vals=', the
associated event field will be saved in a variable but won't be summed
as a value:
# echo 'hist:keys=next_pid:ts1=common_timestamp:...
Multiple variables can be assigned at the same time:
# echo 'hist:keys=pid:vals=$ts0,$b,field2:ts0=common_timestamp,b=field1 ...
Multiple (or single) variables can also be assigned at the same time
using separate assignments:
# echo 'hist:keys=pid:vals=$ts0:ts0=common_timestamp:b=field1:c=field2 ...
Variables set as above can be used by being referenced from another
event, as described in a subsequent patch.
Link: http://lkml.kernel.org/r/fc93c4944d9719dbcb1d0067be627d44e98e2adc.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Baohong Liu <baohong.liu@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Appending .usecs onto a common_timestamp field will cause the
timestamp value to be in microseconds instead of the default
nanoseconds. A typical latency histogram using usecs would look like
this:
# echo 'hist:keys=pid,prio:ts0=common_timestamp.usecs ...
# echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0 ...
This also adds an external trace_clock_in_ns() to trace.c for the
timestamp conversion.
Link: http://lkml.kernel.org/r/4e813705a170b3e13e97dc3135047362fb1a39f3.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In order to allow information to be passed between trace events, add
support for per-element variables to tracing_map. This provides a
means for histograms to associate a value or values with an entry when
it's saved or updated, and retrieved by a subsequent event occurrences.
Variables can be set using tracing_map_set_var() and read using
tracing_map_read_var(). tracing_map_var_set() returns true or false
depending on whether or not the variable has been set or not, which is
important for event-matching applications.
tracing_map_read_var_once() reads the variable and resets it to the
'unset' state, implementing read-once variables, which are also
important for event-matching uses.
Link: http://lkml.kernel.org/r/7fa001108252556f0c6dd9d63145eabfe3370d1a.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add support for a timestamp event field. This is actually a 'pseudo-'
event field in that it behaves like it's part of the event record, but
is really part of the corresponding ring buffer event.
To make use of the timestamp field, users can specify
"common_timestamp" as a field name for any histogram. Note that this
doesn't make much sense on its own either as either a key or value,
but needs to be supported even so, since follow-on patches will add
support for making use of this field in time deltas. The
common_timestamp 'field' is not a bona fide event field - so you won't
find it in the event description - but rather it's a synthetic field
that can be used like a real field.
Note that the use of this field requires the ring buffer be put into
'absolute timestamp' mode, which saves the complete timestamp for each
event rather than an offset. This mode will be enabled if and only if
a histogram makes use of the "common_timestamp" field.
Link: http://lkml.kernel.org/r/97afbd646ed146e26271f3458b4b33e16d7817c2.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Baohong Liu <baohong.liu@intel.com>
[kasan use-after-free fix]
Signed-off-by: Vedang Patel <vedang.patel@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add a new option flag indicating whether or not the ring buffer is in
'absolute timestamp' mode.
Currently this is only set/unset by hist triggers that make use of a
common_timestamp. As such, there's no reason to make this writeable
for users - its purpose is only to allow users to determine
unequivocally whether or not the ring buffer is in that mode (although
absolute timestamps can coexist with the normal delta timestamps, when
the ring buffer is in absolute mode, timestamps written while absolute
mode is in effect take up more space in the buffer, and are not as
efficient).
Link: http://lkml.kernel.org/r/e8aa7b1cde1cf15014e66545d06ac6ef2ebba456.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
RINGBUF_TYPE_TIME_STAMP is defined but not used, and from what I can
gather was reserved for something like an absolute timestamp feature
for the ring buffer, if not a complete replacement of the current
time_delta scheme.
This code redefines RINGBUF_TYPE_TIME_STAMP to implement absolute time
stamps. Another way to look at it is that it essentially forces
extended time_deltas for all events.
The motivation for doing this is to enable time_deltas that aren't
dependent on previous events in the ring buffer, making it feasible to
use the ring_buffer_event timetamps in a more random-access way, for
purposes other than serial event printing.
To set/reset this mode, use tracing_set_timestamp_abs() from the
previous interface patch.
Link: http://lkml.kernel.org/r/477b362dba1ce7fab9889a1a8e885a62c472f041.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Define a new function, tracing_set_time_stamp_abs(), which can be used
to enable or disable the use of absolute timestamps rather than time
deltas for a trace array.
Only the interface is added here; a subsequent patch will add the
underlying implementation.
Link: http://lkml.kernel.org/r/ce96119de44c7fe0ee44786d15254e9b493040d3.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Baohong Liu <baohong.liu@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
We now have the logic to detect and remove duplicates in the
tracing_map hash table. The code which merges duplicates in the
histogram is redundant now. So, modify this code just to detect
duplicates. The duplication detection code is still kept to ensure
that any rare race condition which might cause duplicates does not go
unnoticed.
Link: http://lkml.kernel.org/r/55215cf59e2674391bdaf772fdafc4c393352b03.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Vedang Patel <vedang.patel@intel.com>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
A duplicate in the tracing_map hash table is when 2 different entries
have the same key and, as a result, the key_hash. This is possible due
to a race condition in the algorithm. This race condition is inherent to
the algorithm and not a bug. This was fine because, until now, we were
only interested in the sum of all the values related to a particular
key (the duplicates are dealt with in tracing_map_sort_entries()). But,
with the inclusion of variables[1], we are interested in individual
values. So, it will not be clear what value to choose when
there are duplicates. So, the duplicates need to be removed.
The duplicates can occur in the code in the following scenarios:
- A thread is in the process of adding a new element. It has
successfully executed cmpxchg() and inserted the key. But, it is still
not done acquiring the trace_map_elt struct, populating it and storing
the pointer to the struct in the value field of tracing_map hash table.
If another thread comes in at this time and wants to add an element with
the same key, it will not see the current element and add a new one.
- There are multiple threads trying to execute cmpxchg at the same time,
one of the threads will succeed and the others will fail. The ones which
fail will go ahead increment 'idx' and add a new element there creating
a duplicate.
This patch detects and avoids the first condition by asking the thread
which detects the duplicate to loop one more time. There is also a
possibility of infinite loop if the thread which is trying to insert
goes to sleep indefinitely and the one which is trying to insert a new
element detects a duplicate. Which is why, the thread loops for
map_size iterations before returning NULL.
The second scenario is avoided by preventing the threads which failed
cmpxchg() from incrementing idx. This way, they will loop
around and check if the thread which succeeded in executing cmpxchg()
had the same key.
[1] http://lkml.kernel.org/r/cover.1498510759.git.tom.zanussi@linux.intel.com
Link: http://lkml.kernel.org/r/e178e89ec399240331d383bd5913d649713110f4.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Vedang Patel <vedang.patel@intel.com>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When the length of the NTP tick changes significantly, e.g. when an
NTP/PTP application is correcting the initial offset of the clock, a
large value may accumulate in the NTP error before the multiplier
converges to the correct value. It may then take a very long time (hours
or even days) before the error is corrected. This causes the clock to
have an unstable frequency offset, which has a negative impact on the
stability of synchronization with precise time sources (e.g. NTP/PTP
using hardware timestamping or the PTP KVM clock).
Use division to determine the correct multiplier directly from the NTP
tick length and replace the iterative approach. This removes the last
major source of the NTP error. The only remaining source is now limited
resolution of the multiplier, which is corrected by adding 1 to the
multiplier when the system clock is behind the NTP time.
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1520620971-9567-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When the timekeeping multiplier is changed, the NTP error is updated to
correct the clock for the delay between the tick and the update of the
clock. This error is corrected in later updates and the clock appears as
if the frequency was changed exactly on the tick.
Remove this correction to keep the point where the frequency is
effectively changed at the time of the update. This removes a major
source of the NTP error.
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1520620971-9567-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The BUG and stack protector reports were still using a raw %p. This
changes it to %pB for more meaningful output.
Link: http://lkml.kernel.org/r/20180301225704.GA34198@beast
Fixes: ad67b74d24 ("printk: hash addresses printed with %p")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Richard Weinberger <richard.weinberger@gmail.com>,
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, as reported by Eric, an invalid si_code value 0 is
passed in many signals delivered to userspace in response to faults
and other kernel errors. Typically 0 is passed when the fault is
insufficiently diagnosable or when there does not appear to be any
sensible alternative value to choose.
This appears to violate POSIX, and is intuitively wrong for at
least two reasons arising from the fact that 0 == SI_USER:
1) si_code is a union selector, and SI_USER (and si_code <= 0 in
general) implies the existence of a different set of fields
(siginfo._kill) from that which exists for a fault signal
(siginfo._sigfault). However, the code raising the signal
typically writes only the _sigfault fields, and the _kill
fields make no sense in this case.
Thus when userspace sees si_code == 0 (SI_USER) it may
legitimately inspect fields in the inactive union member _kill
and obtain garbage as a result.
There appears to be software in the wild relying on this,
albeit generally only for printing diagnostic messages.
2) Software that wants to be robust against spurious signals may
discard signals where si_code == SI_USER (or <= 0), or may
filter such signals based on the si_uid and si_pid fields of
siginfo._sigkill. In the case of fault signals, this means
that important (and usually fatal) error conditions may be
silently ignored.
In practice, many of the faults for which arm64 passes si_code == 0
are undiagnosable conditions such as exceptions with syndrome
values in ESR_ELx to which the architecture does not yet assign any
meaning, or conditions indicative of a bug or error in the kernel
or system and thus that are unrecoverable and should never occur in
normal operation.
The approach taken in this patch is to translate all such
undiagnosable or "impossible" synchronous fault conditions to
SIGKILL, since these are at least probably localisable to a single
process. Some of these conditions should really result in a kernel
panic, but due to the lack of diagnostic information it is
difficult to be certain: this patch does not add any calls to
panic(), but this could change later if justified.
Although si_code will not reach userspace in the case of SIGKILL,
it is still desirable to pass a nonzero value so that the common
siginfo handling code can detect incorrect use of si_code == 0
without false positives. In this case the si_code dependent
siginfo fields will not be correctly initialised, but since they
are not passed to userspace I deem this not to matter.
A few faults can reasonably occur in realistic userspace scenarios,
and _should_ raise a regular, handleable (but perhaps not
ignorable/blockable) signal: for these, this patch attempts to
choose a suitable standard si_code value for the raised signal in
each case instead of 0.
arm64 was the only arch to define a BUS_FIXME code, so after this
patch nobody defines it. This patch therefore also removes the
relevant code from siginfo_layout().
Cc: James Morse <james.morse@arm.com>
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
tasklet_action() + tasklet_hi_action() are almost identical. Move the
common code from both function into __tasklet_action_common() and let
both functions invoke it with different arguments.
[ bigeasy: Splitted out from RT's "tasklet: Prevent tasklets from going
into infinite spin in RT" and added commit message]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Julia Cartwright <juliac@eso.teric.us>
Link: https://lkml.kernel.org/r/20180227164808.10093-3-bigeasy@linutronix.de
__tasklet_schedule() and __tasklet_hi_schedule() are almost identical.
Move the common code from both function into __tasklet_schedule_common()
and let both functions invoke it with different arguments.
[ bigeasy: Splitted out from RT's "tasklet: Prevent tasklets from going
into infinite spin in RT" and added commit message. Use
this_cpu_ptr(headp) in __tasklet_schedule_common() as suggested
by Julia Cartwright ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Julia Cartwright <juliac@eso.teric.us>
Link: https://lkml.kernel.org/r/20180227164808.10093-2-bigeasy@linutronix.de
Returning the base of the allocated interrupt range from irq_sim_init() and
devm_irq_sim_init() allows users to handle the logic of associating irq
numbers with any other driver-specific resources without having to use
irq_sim_irqnum().
Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: https://lkml.kernel.org/r/20180304121018.640-4-brgl@bgdev.pl
As discussed with Marc Zyngier: irq_sim_init() and its devres variant
should return the base of the allocated interrupt range on success rather
than 0.
Make devm_irq_sim_init() check for an error code. This is a preparatory
change for modifying irq_sim_init() itself.
Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: https://lkml.kernel.org/r/20180304121018.640-3-brgl@bgdev.pl
kfree() is used in the irq_sim code but slab.h is pulled in indirectly via
irq.h. Include it explicitly.
Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: https://lkml.kernel.org/r/20180304121018.640-2-brgl@bgdev.pl
When running rcutorture with TREE03 config, CONFIG_PROVE_LOCKING=y, and
kernel cmdline argument "rcutorture.gp_exp=1", lockdep reports a
HARDIRQ-safe->HARDIRQ-unsafe deadlock:
================================
WARNING: inconsistent lock state
4.16.0-rc4+ #1 Not tainted
--------------------------------
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
takes:
__schedule+0xbe/0xaf0
{IN-HARDIRQ-W} state was registered at:
_raw_spin_lock+0x2a/0x40
scheduler_tick+0x47/0xf0
...
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&rq->lock);
<Interrupt>
lock(&rq->lock);
*** DEADLOCK ***
1 lock held by rcu_torture_rea/724:
rcu_torture_read_lock+0x0/0x70
stack backtrace:
CPU: 2 PID: 724 Comm: rcu_torture_rea Not tainted 4.16.0-rc4+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
Call Trace:
lock_acquire+0x90/0x200
? __schedule+0xbe/0xaf0
_raw_spin_lock+0x2a/0x40
? __schedule+0xbe/0xaf0
__schedule+0xbe/0xaf0
preempt_schedule_irq+0x2f/0x60
retint_kernel+0x1b/0x2d
RIP: 0010:rcu_read_unlock_special+0x0/0x680
? rcu_torture_read_unlock+0x60/0x60
__rcu_read_unlock+0x64/0x70
rcu_torture_read_unlock+0x17/0x60
rcu_torture_reader+0x275/0x450
? rcutorture_booster_init+0x110/0x110
? rcu_torture_stall+0x230/0x230
? kthread+0x10e/0x130
kthread+0x10e/0x130
? kthread_create_worker_on_cpu+0x70/0x70
? call_usermodehelper_exec_async+0x11a/0x150
ret_from_fork+0x3a/0x50
This happens with the following even sequence:
preempt_schedule_irq();
local_irq_enable();
__schedule():
local_irq_disable(); // irq off
...
rcu_note_context_switch():
rcu_note_preempt_context_switch():
rcu_read_unlock_special():
local_irq_save(flags);
...
raw_spin_unlock_irqrestore(...,flags); // irq remains off
rt_mutex_futex_unlock():
raw_spin_lock_irq();
...
raw_spin_unlock_irq(); // accidentally set irq on
<return to __schedule()>
rq_lock():
raw_spin_lock(); // acquiring rq->lock with irq on
which means rq->lock becomes a HARDIRQ-unsafe lock, which can cause
deadlocks in scheduler code.
This problem was introduced by commit 02a7c234e5 ("rcu: Suppress
lockdep false-positive ->boost_mtx complaints"). That brought the user
of rt_mutex_futex_unlock() with irq off.
To fix this, replace the *lock_irq() in rt_mutex_futex_unlock() with
*lock_irq{save,restore}() to make it safe to call rt_mutex_futex_unlock()
with irq off.
Fixes: 02a7c234e5 ("rcu: Suppress lockdep false-positive ->boost_mtx complaints")
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Link: https://lkml.kernel.org/r/20180309065630.8283-1-boqun.feng@gmail.com
When pinning a file under the BPF virtual file system (traditionally
/sys/fs/bpf), using a dot in the name of the location to pin at is not
allowed. For example, trying to pin at "/sys/fs/bpf/foo.bar" will be
rejected with -EPERM.
This check was introduced at the same time as the BPF file system
itself, with commit b2197755b2 ("bpf: add support for persistent
maps/progs"). At this time, it was checked in a function called
"bpf_dname_reserved()", which made clear that using a dot was reserved
for future extensions.
This function disappeared and the check was moved elsewhere with commit
0c93b7d85d ("bpf: reject invalid names right in ->lookup()"), and the
meaning of the dot ban was lost.
The present commit simply adds a comment in the source to explain to the
reader that the usage of dots is reserved for future usage.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In ctx_resched(), EVENT_FLEXIBLE should be sched_out when EVENT_PINNED is
added. However, ctx_resched() calculates ctx_event_type before checking
this condition. As a result, pinned events will NOT get higher priority
than flexible events.
The following shows this issue on an Intel CPU (where ref-cycles can
only use one hardware counter).
1. First start:
perf stat -C 0 -e ref-cycles -I 1000
2. Then, in the second console, run:
perf stat -C 0 -e ref-cycles:D -I 1000
The second perf uses pinned events, which is expected to have higher
priority. However, because it failed in ctx_resched(). It is never
run.
This patch fixes this by calculating ctx_event_type after re-evaluating
event_type.
Reported-by: Ephraim Park <ephiepark@fb.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <jolsa@redhat.com>
Cc: <kernel-team@fb.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 487f05e18a ("perf/core: Optimize event rescheduling on active contexts")
Link: http://lkml.kernel.org/r/20180306055504.3283731-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since the return type of the function is bool, the internal
'ret' variable should be bool too.
Signed-off-by: Gaurav Jindal<gauravjindal1104@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180221125407.GA14292@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When NEWLY_IDLE load balance is not triggered, we might need to update the
blocked load anyway. We can kick an ilb so an idle CPU will take care of
updating blocked load or we can try to update them locally before entering
idle. In the latter case, we reuse part of the nohz_idle_balance.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: brendan.jackman@arm.com
Cc: dietmar.eggemann@arm.com
Cc: morten.rasmussen@foss.arm.com
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1518622006-16089-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We're going to want to call nohz_idle_balance() or parts thereof from
idle_balance(). Since we already have a forward declaration of
idle_balance() move it down such that it's below nohz_idle_balance()
avoiding the need for a forward declaration for that.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that we have two back-to-back NO_HZ_COMMON blocks, merge them.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This pure code movement results in two #ifdef CONFIG_NO_HZ_COMMON
sections landing next to each other.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Avoid calling update_blocked_averages() when it does not in fact have
any by re-using/extending update_nohz_stats().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Instead of using the cfs_rq_is_decayed() which monitors all *_avg
and *_sum, we create a cfs_rq_has_blocked() which only takes care of
util_avg and load_avg. We are only interested by these 2 values which are
decaying faster than the *_sum so we can stop the periodic update earlier.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: brendan.jackman@arm.com
Cc: dietmar.eggemann@arm.com
Cc: morten.rasmussen@foss.arm.com
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1518517879-2280-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Stopped the periodic update of blocked load when all idle CPUs have fully
decayed. We introduce a new nohz.has_blocked that reflect if some idle
CPUs has blocked load that have to be periodiccally updated. nohz.has_blocked
is set everytime that a Idle CPU can have blocked load and it is then clear
when no more blocked load has been detected during an update. We don't need
atomic operation but only to make cure of the right ordering when updating
nohz.idle_cpus_mask and nohz.has_blocked.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: brendan.jackman@arm.com
Cc: dietmar.eggemann@arm.com
Cc: morten.rasmussen@foss.arm.com
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1518517879-2280-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It was suggested that a migration hint might be usefull for the
CPU-freq governors.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The primary observation is that nohz enter/exit is always from the
current CPU, therefore NOHZ_TICK_STOPPED does not in fact need to be
an atomic.
Secondary is that we appear to have 2 nearly identical hooks in the
nohz enter code, set_cpu_sd_state_idle() and
nohz_balance_enter_idle(). Fold the whole set_cpu_sd_state thing into
nohz_balance_{enter,exit}_idle.
Removes an atomic op from both enter and exit paths.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since we already iterate CPUs looking for work on NEWIDLE, use this
iteration to age the blocked load. If the domain for which this is
done completely spand the idle set, we can push the ILB based aging
forward.
Suggested-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Teach the idle balancer about the need to update statistics which have
a different periodicity from regular balancing.
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current:
if (nohz_kick_needed())
nohz_balancer_kick()
is pointless complexity, fold them into a single call and avoid the
various conditions at the call site.
When we introduce multiple different needs to kick the ilb, the above
construct also becomes a problem.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Split the NOHZ idle balancer into doing two separate actions:
- update blocked load statistic
- actually load-balance
Since the latter requires the former, ensure this happens. For now
always tag both bits at the same time.
Prepares for a future where we can toggle only the STATS bit.
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Using atomic_t allows us to use the more flexible bitops provided
there. Also its smaller.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Instead of trying to duplicate scheduler state to track if an RT task
is running, directly use the scheduler runqueue state for it.
This vastly simplifies things and fixes a number of bugs related to
sugov and the scheduler getting out of sync wrt this state.
As a consequence we not also update the remove cfs/dl state when
iterating the shared mask.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Due to using GCC defines for configuration, some labels might be unused in
certain configurations. While adding a __maybe_unused to the label is
fine in general, the line has to be terminated with ';'. This is also
reflected in the GCC documentation, but GCC parsed the previous variant
without an error message.
This has been spotted while compiling with goto-cc, the compiler for the
CPROVER tool suite.
Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
Signed-off-by: Michael Tautschnig <tautschn@amazon.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1519717660-16157-1-git-send-email-nmanthey@amazon.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Audit link denied events generate duplicate PATH records which disagree
in different ways from symlink and hardlink denials.
audit_log_link_denied() should not directly generate PATH records.
See: https://github.com/linux-audit/audit-kernel/issues/21
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Audit link denied events emit disjointed records when audit is disabled.
No records should be emitted when audit is disabled.
See: https://github.com/linux-audit/audit-kernel/issues/21
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
otherwise kernel can oops later in seq_release() due to dereferencing null
file->private_data which is only set if seq_open() succeeds.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
IP: seq_release+0xc/0x30
Call Trace:
close_pdeo+0x37/0xd0
proc_reg_release+0x5d/0x60
__fput+0x9d/0x1d0
____fput+0x9/0x10
task_work_run+0x75/0x90
do_exit+0x252/0xa00
do_group_exit+0x36/0xb0
SyS_exit_group+0xf/0x10
Fixes: 516fb7f2e7 ("/proc/module: use the same logic as /proc/kallsyms for address exposure")
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org # 4.15+
Signed-off-by: Leon Yu <chianglungyu@gmail.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
This commit adds new field "addr" to bpf_perf_event_data which could be
read and used by bpf programs attached to perf events. The value of the
field is copied from bpf_perf_event_data_kern.addr and contains the
address value recorded by specifying sample_type with PERF_SAMPLE_ADDR
when calling perf_event_open.
Signed-off-by: Teng Qin <qinteng@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
commit d178bc3a70 ("user namespace: usb:
make usb urbs user namespace aware (v2)") changed kill_pid_info_as_uid
to kill_pid_info_as_cred, saving and passing a cred structure instead of
uids. Since the secid can be obtained from the cred, drop the secid fields
from the usb_dev_state and async structures, and drop the secid argument to
kill_pid_info_as_cred. Replace the secid argument to security_task_kill
with the cred. Update SELinux, Smack, and AppArmor to use the cred, which
avoids the need for Smack and AppArmor to use a secid at all in this hook.
Further changes to Smack might still be required to take full advantage of
this change, since it should now be possible to perform capability
checking based on the supplied cred. The changes to Smack and AppArmor
have only been compile-tested.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Paul Moore <paul@paul-moore.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: James Morris <james.morris@microsoft.com>
devm_memremap_pages() was re-worked in e8d5134833 "memremap: change
devm_memremap_pages interface to use struct dev_pagemap" to take a
caller allocated struct dev_pagemap as a function parameter. A call to
devres_free() was left in the error cleanup path which results in a
kernel panic if the remap fails for some reason. Remove it to fix the
panic and let devm_memremap_pages() fail gracefully.
Fixes: e8d5134833 ("memremap: change devm_memremap_pages interface...")
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
If you pass in an invalid audit boot parameter value, e.g. "audit=off",
the kernel panics very early in boot before the regular console is
initialized. Unless you have earlyprintk enabled, there is no
indication of what the problem is on the console.
Convert the panic() calls to pr_err(), and leave auditing enabled if an
invalid parameter value was passed in.
Modify the parameter to also accept "on" or "off" as valid values, and
update the documentation accordingly.
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
All of the conflicts were cases of overlapping changes.
In net/core/devlink.c, we have to make care that the
resouce size_params have become a struct member rather
than a pointer to such an object.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Use an appropriate TSQ pacing shift in mac80211, from Toke
Høiland-Jørgensen.
2) Just like ipv4's ip_route_me_harder(), we have to use skb_to_full_sk
in ip6_route_me_harder, from Eric Dumazet.
3) Fix several shutdown races and similar other problems in l2tp, from
James Chapman.
4) Handle missing XDP flush properly in tuntap, for real this time.
From Jason Wang.
5) Out-of-bounds access in powerpc ebpf tailcalls, from Daniel
Borkmann.
6) Fix phy_resume() locking, from Andrew Lunn.
7) IFLA_MTU values are ignored on newlink for some tunnel types, fix
from Xin Long.
8) Revert F-RTO middle box workarounds, they only handle one dimension
of the problem. From Yuchung Cheng.
9) Fix socket refcounting in RDS, from Ka-Cheong Poon.
10) Don't allow ppp unit registration to an unregistered channel, from
Guillaume Nault.
11) Various hv_netvsc fixes from Stephen Hemminger.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (98 commits)
hv_netvsc: propagate rx filters to VF
hv_netvsc: filter multicast/broadcast
hv_netvsc: defer queue selection to VF
hv_netvsc: use napi_schedule_irqoff
hv_netvsc: fix race in napi poll when rescheduling
hv_netvsc: cancel subchannel setup before halting device
hv_netvsc: fix error unwind handling if vmbus_open fails
hv_netvsc: only wake transmit queue if link is up
hv_netvsc: avoid retry on send during shutdown
virtio-net: re enable XDP_REDIRECT for mergeable buffer
ppp: prevent unregistered channels from connecting to PPP units
tc-testing: skbmod: fix match value of ethertype
mlxsw: spectrum_switchdev: Check success of FDB add operation
net: make skb_gso_*_seglen functions private
net: xfrm: use skb_gso_validate_network_len() to check gso sizes
net: sched: tbf: handle GSO_BY_FRAGS case in enqueue
net: rename skb_gso_validate_mtu -> skb_gso_validate_network_len
rds: Incorrect reference counting in TCP socket creation
net: ethtool: don't ignore return from driver get_fecparam method
vrf: check forwarding on the original netdevice when generating ICMP dest unreachable
...
Pull timer fixes from Thomas Gleixner:
"A small set of fixes from the timer departement:
- Add a missing timer wheel clock forward when migrating timers off a
unplugged CPU to prevent operating on a stale clock base and
missing timer deadlines.
- Use the proper shift count to extract data from a register value to
prevent evaluating unrelated bits
- Make the error return check in the FSL timer driver work correctly.
Checking an unsigned variable for less than zero does not really
work well.
- Clarify the confusing comments in the ARC timer code"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Forward timer base before migrating timers
clocksource/drivers/arc_timer: Update some comments
clocksource/drivers/mips-gic-timer: Use correct shift count to extract data
clocksource/drivers/fsl_ftm_timer: Fix error return checking
Make it easier to concatenate all the scheduler .c files for single-module
compilation.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge these two small .c modules as they implement two aspects
of idle task handling.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Do the following cleanups and simplifications:
- sched/sched.h already includes <asm/paravirt.h>, so no need to
include it in sched/core.c again.
- order the <linux/sched/*.h> headers alphabetically
- add all <linux/sched/*.h> headers to kernel/sched/sched.h
- remove all unnecessary includes from the .c files that
are already included in kernel/sched/sched.h.
Finally, make all scheduler .c files use a single common header:
#include "sched.h"
... which now contains a union of the relied upon headers.
This makes the various .c files easier to read and easier to handle.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull libnvdimm fixes from Dan Williams:
"A 4.16 regression fix, three fixes for -stable, and a cleanup fix:
- During the merge window support for the new ACPI NVDIMM Platform
Capabilities structure disabled support for "deep flush", a
force-unit- access like mechanism for persistent memory. Restore
that mechanism.
- VFIO like RDMA is yet one more memory registration / pinning
interface that is incompatible with Filesystem-DAX. Disable long
term pins of Filesystem-DAX mappings via VFIO.
- The Filesystem-DAX detection to prevent long terms pins mistakenly
also disabled Device-DAX pins which are not subject to the same
block- map collision concerns.
- Similar to the setup path, softlockup warnings can trigger in the
shutdown path for large persistent memory namespaces. Teach
for_each_device_pfn() to perform cond_resched() in all cases.
- Boaz noticed that the might_sleep() in dax_direct_access() is stale
as of the v4.15 kernel.
These have received a build success notification from the 0day robot,
and the longterm pin fixes have appeared in -next. However, I recently
rebased the tree to remove some other fixes that need to be reworked
after review feedback.
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
memremap: fix softlockup reports at teardown
libnvdimm: re-enable deep flush for pmem devices via fsync()
vfio: disable filesystem-dax page pinning
dax: fix vma_is_fsdax() helper
dax: ->direct_access does not sleep anymore
A good number of small style inconsistencies have accumulated
in the scheduler core, so do a pass over them to harmonize
all these details:
- fix speling in comments,
- use curly braces for multi-line statements,
- remove unnecessary parentheses from integer literals,
- capitalize consistently,
- remove stray newlines,
- add comments where necessary,
- remove invalid/unnecessary comments,
- align structure definitions and other data types vertically,
- add missing newlines for increased readability,
- fix vertical tabulation where it's misaligned,
- harmonize preprocessor conditional block labeling
and vertical alignment,
- remove line-breaks where they uglify the code,
- add newline after local variable definitions,
No change in functionality:
md5:
1191fa0a890cfa8132156d2959d7e9e2 built-in.o.before.asm
1191fa0a890cfa8132156d2959d7e9e2 built-in.o.after.asm
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Fixed style error: Missing space before the open parenthesis
- Fixed style warnings: 2x Missing blank line after declaration
One warning left: else after return
(I don't feel comfortable fixing that without side effects)
Signed-off-by: Mario Leinweber <marioleinweber@web.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20180302182007.28691-1-marioleinweber@web.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The cond_resched() currently in the setup path needs to be duplicated in
the teardown path. Rather than require each instance of
for_each_device_pfn() to open code the same sequence, embed it in the
helper.
Link: https://github.com/intel/ixpdimm_sw/issues/11
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <stable@vger.kernel.org>
Fixes: 7138970383 ("mm, zone_device: Replace {get, put}_zone_device_page()...")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Since commit afcc90f862 ("usercopy: WARN() on slab cache usercopy
region violations"), MIPS systems booting with a compat root filesystem
emit a warning when copying compat siginfo to userspace:
WARNING: CPU: 0 PID: 953 at mm/usercopy.c:81 usercopy_warn+0x98/0xe8
Bad or missing usercopy whitelist? Kernel memory exposure attempt
detected from SLAB object 'task_struct' (offset 1432, size 16)!
Modules linked in:
CPU: 0 PID: 953 Comm: S01logging Not tainted 4.16.0-rc2 #10
Stack : ffffffff808c0000 0000000000000000 0000000000000001 65ac85163f3bdc4a
65ac85163f3bdc4a 0000000000000000 90000000ff667ab8 ffffffff808c0000
00000000000003f8 ffffffff808d0000 00000000000000d1 0000000000000000
000000000000003c 0000000000000000 ffffffff808c8ca8 ffffffff808d0000
ffffffff808d0000 ffffffff80810000 fffffc0000000000 ffffffff80785c30
0000000000000009 0000000000000051 90000000ff667eb0 90000000ff667db0
000000007fe0d938 0000000000000018 ffffffff80449958 0000000020052798
ffffffff808c0000 90000000ff664000 90000000ff667ab0 00000000100c0000
ffffffff80698810 0000000000000000 0000000000000000 0000000000000000
0000000000000000 0000000000000000 ffffffff8010d02c 65ac85163f3bdc4a
...
Call Trace:
[<ffffffff8010d02c>] show_stack+0x9c/0x130
[<ffffffff80698810>] dump_stack+0x90/0xd0
[<ffffffff80137b78>] __warn+0x100/0x118
[<ffffffff80137bdc>] warn_slowpath_fmt+0x4c/0x70
[<ffffffff8021e4a8>] usercopy_warn+0x98/0xe8
[<ffffffff8021e68c>] __check_object_size+0xfc/0x250
[<ffffffff801bbfb8>] put_compat_sigset+0x30/0x88
[<ffffffff8011af24>] setup_rt_frame_n32+0xc4/0x160
[<ffffffff8010b8b4>] do_signal+0x19c/0x230
[<ffffffff8010c408>] do_notify_resume+0x60/0x78
[<ffffffff80106f50>] work_notifysig+0x10/0x18
---[ end trace 88fffbf69147f48a ]---
Commit 5905429ad8 ("fork: Provide usercopy whitelisting for
task_struct") noted that:
"While the blocked and saved_sigmask fields of task_struct are copied to
userspace (via sigmask_to_save() and setup_rt_frame()), it is always
copied with a static length (i.e. sizeof(sigset_t))."
However, this is not true in the case of compat signals, whose sigset
is copied by put_compat_sigset and receives size as an argument.
At most call sites, put_compat_sigset is copying a sigset from the
current task_struct. This triggers a warning when
CONFIG_HARDENED_USERCOPY is active. However, by marking this function as
static inline, the warning can be avoided because in all of these cases
the size is constant at compile time, which is allowed. The only site
where this is not the case is handling the rt_sigpending syscall, but
there the copy is being made from a stack local variable so does not
trigger the warning.
Move put_compat_sigset to compat.h, and mark it static inline. This
fixes the WARN on MIPS.
Fixes: afcc90f862 ("usercopy: WARN() on slab cache usercopy region violations")
Signed-off-by: Matt Redfearn <matt.redfearn@mips.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "Dmitry V . Levin" <ldv@altlinux.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/18639/
Signed-off-by: James Hogan <jhogan@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf 2018-02-28
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Add schedule points and reduce the number of loop iterations
the test_bpf kernel module is performing in order to not hog
the CPU for too long, from Eric.
2) Fix an out of bounds access in tail calls in the ppc64 BPF
JIT compiler, from Daniel.
3) Fix a crash on arm64 on unaligned BPF xadd operations that
could be triggered via interpreter and JIT, from Daniel.
Please not that once you merge net into net-next at some point, there
is a minor merge conflict in test_verifier.c since test cases had
been added at the end in both trees. Resolution is trivial: keep all
the test cases from both trees.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull printk fix from Petr Mladek:
"Make sure that we wake up userspace loggers. This fixes a race
introduced by the console waiter logic during this merge window"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
printk: Wake klogd when passing console_lock owner
On CPU hotunplug the enqueued timers of the unplugged CPU are migrated to a
live CPU. This happens from the control thread which initiated the unplug.
If the CPU on which the control thread runs came out from a longer idle
period then the base clock of that CPU might be stale because the control
thread runs prior to any event which forwards the clock.
In such a case the timers from the unplugged CPU are queued on the live CPU
based on the stale clock which can cause large delays due to increased
granularity of the outer timer wheels which are far away from base:;clock.
But there is a worse problem than that. The following sequence of events
illustrates it:
- CPU0 timer1 is queued expires = 59969 and base->clk = 59131.
The timer is queued at wheel level 2, with resulting expiry time = 60032
(due to level granularity).
- CPU1 enters idle @60007, with next timer expiry @60020.
- CPU0 is hotplugged at @60009
- CPU1 exits idle and runs the control thread which migrates the
timers from CPU0
timer1 is now queued in level 0 for immediate handling in the next
softirq because the requested expiry time 59969 is before CPU1 base->clk
60007
- CPU1 runs code which forwards the base clock which succeeds because the
next expiring timer. which was collected at idle entry time is still set
to 60020.
So it forwards beyond 60007 and therefore misses to expire the migrated
timer1. That timer gets expired when the wheel wraps around again, which
takes between 63 and 630ms depending on the HZ setting.
Address both problems by invoking forward_timer_base() for the control CPUs
timer base. All other places, which might run into a similar problem
(mod_timer()/add_timer_on()) already invoke forward_timer_base() to avoid
that.
[ tglx: Massaged comment and changelog ]
Fixes: a683f390b9 ("timers: Forward the wheel clock whenever possible")
Co-developed-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: linux-arm-msm@vger.kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180118115022.6368-1-clingutla@codeaurora.org
Surprisingly there is no simple way to see if the IRQ line in question
is wakeup source or not.
Note that wakeup might be an OOB (out-of-band) source like GPIO line
which makes things slightly more complicated.
Add a sysfs node to cover this case.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tony Lindgren <tony@atomide.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20180226155043.67937-1-andriy.shevchenko@linux.intel.com
Show unadorned pointers in lockdep reports - lockdep is a debugging
facility and hashing pointers there doesn't make a whole lotta sense.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20180226134926.23069-1-bp@alien8.de
wake_klogd is a local variable in console_unlock(). The information
is lost when the console_lock owner using the busy wait added by
the commit dbdda842fe ("printk: Add console owner and waiter
logic to load balance console writes"). The following race is
possible:
CPU0 CPU1
console_unlock()
for (;;)
/* calling console for last message */
printk()
log_store()
log_next_seq++;
/* see new message */
if (seen_seq != log_next_seq) {
wake_klogd = true;
seen_seq = log_next_seq;
}
console_lock_spinning_enable();
if (console_trylock_spinning())
/* spinning */
if (console_lock_spinning_disable_and_check()) {
printk_safe_exit_irqrestore(flags);
return;
console_unlock()
if (seen_seq != log_next_seq) {
/* already seen */
/* nothing to do */
Result: Nobody would wakeup klogd.
One solution would be to make a global variable from wake_klogd.
But then we would need to manipulate it under a lock or so.
This patch wakes klogd also when console_lock is passed to the
spinning waiter. It looks like the right way to go. Also userspace
should have a chance to see and store any "flood" of messages.
Note that the very late klogd wake up was a historic solution.
It made sense on single CPU systems or when sys_syslog() operations
were synchronized using the big kernel lock like in v2.1.113.
But it is questionable these days.
Fixes: dbdda842fe ("printk: Add console owner and waiter logic to load balance console writes")
Link: http://lkml.kernel.org/r/20180226155734.dzwg3aovqnwtvkoy@pathway.suse.cz
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org
Cc: Tejun Heo <tj@kernel.org>
Suggested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Pull x86 fixes from Thomas Gleixner:
"Yet another pile of melted spectrum related changes:
- sanitize the array_index_nospec protection mechanism: Remove the
overengineered array_index_nospec_mask_check() magic and allow
const-qualified types as index to avoid temporary storage in a
non-const local variable.
- make the microcode loader more robust by properly propagating error
codes. Provide information about new feature bits after micro code
was updated so administrators can act upon.
- optimizations of the entry ASM code which reduce code footprint and
make the code simpler and faster.
- fix the {pmd,pud}_{set,clear}_flags() implementations to work
properly on paravirt kernels by removing the address translation
operations.
- revert the harmful vmexit_fill_RSB() optimization
- use IBRS around firmware calls
- teach objtool about retpolines and add annotations for indirect
jumps and calls.
- explicitly disable jumplabel patching in __init code and handle
patching failures properly instead of silently ignoring them.
- remove indirect paravirt calls for writing the speculation control
MSR as these calls are obviously proving the same attack vector
which is tried to be mitigated.
- a few small fixes which address build issues with recent compiler
and assembler versions"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
KVM/VMX: Optimize vmx_vcpu_run() and svm_vcpu_run() by marking the RDMSR path as unlikely()
KVM/x86: Remove indirect MSR op calls from SPEC_CTRL
objtool, retpolines: Integrate objtool with retpoline support more closely
x86/entry/64: Simplify ENCODE_FRAME_POINTER
extable: Make init_kernel_text() global
jump_label: Warn on failed jump_label patching attempt
jump_label: Explicitly disable jump labels in __init code
x86/entry/64: Open-code switch_to_thread_stack()
x86/entry/64: Move ASM_CLAC to interrupt_entry()
x86/entry/64: Remove 'interrupt' macro
x86/entry/64: Move the switch_to_thread_stack() call to interrupt_entry()
x86/entry/64: Move ENTER_IRQ_STACK from interrupt macro to interrupt_entry
x86/entry/64: Move PUSH_AND_CLEAR_REGS from interrupt macro to helper function
x86/speculation: Move firmware_restrict_branch_speculation_*() from C to CPP
objtool: Add module specific retpoline rules
objtool: Add retpoline validation
objtool: Use existing global variables for options
x86/mm/sme, objtool: Annotate indirect call in sme_encrypt_execute()
x86/boot, objtool: Annotate indirect jump in secondary_startup_64()
x86/paravirt, objtool: Annotate indirect calls
...
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-02-26
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Various improvements for BPF kselftests: i) skip unprivileged tests
when kernel.unprivileged_bpf_disabled sysctl knob is set, ii) count
the number of skipped tests from unprivileged, iii) when a test case
had an unexpected error then print the actual but also the unexpected
one for better comparison, from Joe.
2) Add a sample program for collecting CPU state statistics with regards
to how long the CPU resides in cstate and pstate levels. Based on
cpu_idle and cpu_frequency trace points, from Leo.
3) Various x64 BPF JIT optimizations to further shrink the generated
image size in order to make it more icache friendly. When tested on
the Cilium generated programs, image size reduced by approx 4-5% in
best case mainly due to how LLVM emits unsigned 32 bit constants,
from Daniel.
4) Improvements and fixes on the BPF sockmap sample programs: i) fix
the sockmap's Makefile to include nlattr.o for libbpf, ii) detach
the sock ops programs from the cgroup before exit, from Prashant.
5) Avoid including xdp.h in filter.h by just forward declaring the
struct xdp_rxq_info in filter.h, from Jesper.
6) Fix the BPF kselftests Makefile for cgroup_helpers.c by only declaring
it a dependency for test_dev_cgroup.c but not every other test case
where it is not needed, from Jesper.
7) Adjust rlimit RLIMIT_MEMLOCK for test_tcpbpf_user selftest since the
default is insufficient for creating the 'global_map' used in the
corresponding BPF program, from Yonghong.
8) Likewise, for the xdp_redirect sample, Tushar ran into the same when
invoking xdp_redirect and xdp_monitor at the same time, therefore
in order to have the sample generically work bump the limit here,
too. Fix from Tushar.
9) Avoid an unnecessary NULL check in BPF_CGROUP_RUN_PROG_INET_SOCK()
since sk is always guaranteed to be non-NULL, from Yafang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull x86 fixes from Thomas Gleixner:
"A small set of fixes:
- UAPI data type correction for hyperv
- correct the cpu cores field in /proc/cpuinfo on CPU hotplug
- return proper error code in the resctrl file system failure path to
avoid silent subsequent failures
- correct a subtle accounting issue in the new vector allocation code
which went unnoticed for a while and caused suspend/resume
failures"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/topology: Update the 'cpu cores' field in /proc/cpuinfo correctly across CPU hotplug operations
x86/topology: Fix function name in documentation
x86/intel_rdt: Fix incorrect returned value when creating rdgroup sub-directory in resctrl file system
x86/apic/vector: Handle vector release on CPU unplug correctly
genirq/matrix: Handle CPU offlining proper
x86/headers/UAPI: Use __u64 instead of u64 in <uapi/asm/hyperv.h>
RCU's expedited grace periods can participate in out-of-memory deadlocks
due to all available system_wq kthreads being blocked and there not being
memory available to create more. This commit prevents such deadlocks
by allocating an RCU-specific workqueue_struct at early boot time, and
providing it with a rescuer to ensure forward progress. This uses the
shiny new init_rescuer() function provided by Tejun (but indirectly).
This commit also causes SRCU to use this new RCU-specific
workqueue_struct. Note that SRCU's use of workqueues never blocks them
waiting for readers, so this should be safe from a forward-progress
viewpoint. Note that this moves SRCU from system_power_efficient_wq
to a normal workqueue. In the unlikely event that this results in
measurable degradation, a separate power-efficient workqueue will be
creates for SRCU.
Reported-by: Prateek Sood <prsood@codeaurora.org>
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Pull networking fixes from David Miller:
1) Fix TTL offset calculation in mac80211 mesh code, from Peter Oh.
2) Fix races with procfs in ipt_CLUSTERIP, from Cong Wang.
3) Memory leak fix in lpm_trie BPF map code, from Yonghong Song.
4) Need to use GFP_ATOMIC in BPF cpumap allocations, from Jason Wang.
5) Fix potential deadlocks in netfilter getsockopt() code paths, from
Paolo Abeni.
6) Netfilter stackpointer size checks really are needed to validate
user input, from Florian Westphal.
7) Missing timer init in x_tables, from Paolo Abeni.
8) Don't use WQ_MEM_RECLAIM in mac80211 hwsim, from Johannes Berg.
9) When an ibmvnic device is brought down then back up again, it can be
sent queue entries from a previous session, handle this properly
instead of crashing. From Thomas Falcon.
10) Fix TCP checksum on LRO buffers in mlx5e, from Gal Pressman.
11) When we are dumping filters in cls_api, the output SKB is empty, and
the filter we are dumping is too large for the space in the SKB, we
should return -EMSGSIZE like other netlink dump operations do.
Otherwise userland has no signal that is needs to increase the size
of its read buffer. From Roman Kapl.
12) Several XDP fixes for virtio_net, from Jesper Dangaard Brouer.
13) Module refcount leak in netlink when a dump start fails, from Jason
Donenfeld.
14) Handle sub-optimal GSO sizes better in TCP BBR congestion control,
from Eric Dumazet.
15) Releasing bpf per-cpu arraymaps can take a long time, add a
condtional scheduling point. From Eric Dumazet.
16) Implement retpolines for tail calls in x64 and arm64 bpf JITs. From
Daniel Borkmann.
17) Fix page leak in gianfar driver, from Andy Spencer.
18) Missed clearing of estimator scratch buffer, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (76 commits)
net_sched: gen_estimator: fix broken estimators based on percpu stats
gianfar: simplify FCS handling and fix memory leak
ipv6 sit: work around bogus gcc-8 -Wrestrict warning
macvlan: fix use-after-free in macvlan_common_newlink()
bpf, arm64: fix out of bounds access in tail call
bpf, x64: implement retpoline for tail call
rxrpc: Fix send in rxrpc_send_data_packet()
net: aquantia: Fix error handling in aq_pci_probe()
bpf: fix rcu lockdep warning for lpm_trie map_free callback
bpf: add schedule points in percpu arrays management
regulatory: add NUL to request alpha2
ibmvnic: Fix early release of login buffer
net/smc9194: Remove bogus CONFIG_MAC reference
net: ipv4: Set addr_type in hash_keys for forwarded case
tcp_bbr: better deal with suboptimal GSO
smsc75xx: fix smsc75xx_set_features()
netlink: put module reference if dump start fails
selftests/bpf/test_maps: exit child process without error in ENOMEM case
selftests/bpf: update gitignore with test_libbpf_open
selftests/bpf: tcpbpf_kern: use in6_* macros from glibc
..
Pull security subsystem fixes from James Morris:
- keys fixes via David Howells:
"A collection of fixes for Linux keyrings, mostly thanks to Eric
Biggers:
- Fix some PKCS#7 verification issues.
- Fix handling of unsupported crypto in X.509.
- Fix too-large allocation in big_key"
- Seccomp updates via Kees Cook:
"These are fixes for the get_metadata interface that landed during
-rc1. While the new selftest is strictly not a bug fix, I think
it's in the same spirit of avoiding bugs"
- an IMA build fix from Randy Dunlap
* 'fixes-v4.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
integrity/security: fix digsig.c build error with header file
KEYS: Use individual pages in big_key for crypto buffers
X.509: fix NULL dereference when restricting key with unsupported_sig
X.509: fix BUG_ON() when hash algorithm is unsupported
PKCS#7: fix direct verification of SignerInfo signature
PKCS#7: fix certificate blacklisting
PKCS#7: fix certificate chain verification
seccomp: add a selftest for get_metadata
ptrace, seccomp: tweak get_metadata behavior slightly
seccomp, ptrace: switch get_metadata types to arch independent
The requirements around atomic_add() / atomic64_add() resp. their
JIT implementations differ across architectures. E.g. while x86_64
seems just fine with BPF's xadd on unaligned memory, on arm64 it
triggers via interpreter but also JIT the following crash:
[ 830.864985] Unable to handle kernel paging request at virtual address ffff8097d7ed6703
[...]
[ 830.916161] Internal error: Oops: 96000021 [#1] SMP
[ 830.984755] CPU: 37 PID: 2788 Comm: test_verifier Not tainted 4.16.0-rc2+ #8
[ 830.991790] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.29 07/17/2017
[ 830.998998] pstate: 80400005 (Nzcv daif +PAN -UAO)
[ 831.003793] pc : __ll_sc_atomic_add+0x4/0x18
[ 831.008055] lr : ___bpf_prog_run+0x1198/0x1588
[ 831.012485] sp : ffff00001ccabc20
[ 831.015786] x29: ffff00001ccabc20 x28: ffff8017d56a0f00
[ 831.021087] x27: 0000000000000001 x26: 0000000000000000
[ 831.026387] x25: 000000c168d9db98 x24: 0000000000000000
[ 831.031686] x23: ffff000008203878 x22: ffff000009488000
[ 831.036986] x21: ffff000008b14e28 x20: ffff00001ccabcb0
[ 831.042286] x19: ffff0000097b5080 x18: 0000000000000a03
[ 831.047585] x17: 0000000000000000 x16: 0000000000000000
[ 831.052885] x15: 0000ffffaeca8000 x14: 0000000000000000
[ 831.058184] x13: 0000000000000000 x12: 0000000000000000
[ 831.063484] x11: 0000000000000001 x10: 0000000000000000
[ 831.068783] x9 : 0000000000000000 x8 : 0000000000000000
[ 831.074083] x7 : 0000000000000000 x6 : 000580d428000000
[ 831.079383] x5 : 0000000000000018 x4 : 0000000000000000
[ 831.084682] x3 : ffff00001ccabcb0 x2 : 0000000000000001
[ 831.089982] x1 : ffff8097d7ed6703 x0 : 0000000000000001
[ 831.095282] Process test_verifier (pid: 2788, stack limit = 0x0000000018370044)
[ 831.102577] Call trace:
[ 831.105012] __ll_sc_atomic_add+0x4/0x18
[ 831.108923] __bpf_prog_run32+0x4c/0x70
[ 831.112748] bpf_test_run+0x78/0xf8
[ 831.116224] bpf_prog_test_run_xdp+0xb4/0x120
[ 831.120567] SyS_bpf+0x77c/0x1110
[ 831.123873] el0_svc_naked+0x30/0x34
[ 831.127437] Code: 97fffe97 17ffffec 00000000 f9800031 (885f7c31)
Reason for this is because memory is required to be aligned. In
case of BPF, we always enforce alignment in terms of stack access,
but not when accessing map values or packet data when the underlying
arch (e.g. arm64) has CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS set.
xadd on packet data that is local to us anyway is just wrong, so
forbid this case entirely. The only place where xadd makes sense in
fact are map values; xadd on stack is wrong as well, but it's been
around for much longer. Specifically enforce strict alignment in case
of xadd, so that we handle this case generically and avoid such crashes
in the first place.
Fixes: 17a5267067 ("bpf: verifier (add verifier core)")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJaj38KAAoJEAx081l5xIa+AY8P/0oX+UPtjNxVqUTzeejxxZG7
EpmcJWP2SENnkOSdiyPMLI4SIOgv0B+73hX6ATbsVx9nseqxAJyoAFJZCQy7ioS3
RjB6wXi/WQrxrXc3MU5FUp8AfPLvZx2BlAHGqyuk3V2f3fIjl0tWmMuxgdc0WX1j
wzzHNBEKoXG5WVVEOXJZq5xd8s35QTdhqGpqrvl1ruHtqmnls8n67qPB9F7F6lHm
Iwi6MlvIxwoLIuWj0cJyOoUdw0Z6/MQ+Of8zW1E0NJIfgfa9LKjtIRUacJvOndRP
Oq9XUCI/6gmNswmdktz65w1SfuU/cq9j46FuBh23QYNvYfuYgtvL0xhQPYF08vtK
83X1Sop8Pzz9f2jCL2TPKLF37TetNpMT1gTP/NsGirRc+cvZTMBl1+OcWO47oTYZ
TZ70L7GSJOdJV/n5vdCE5bSBS/thvLC5tyUGgRH+y7E6Lt2HouVN3ulkKb/stuQ3
ee9NbI16YXZepK3+Z4YUdFziC40BO7K0LGlyAjs9G95LBRQNq9jNJLXTog5vSUJa
3DFjEqQ558iciGkmYx4cQhlCqYvzuNClutz2D4RN7LqA5wHKqt4LWwTgjnUk9Z82
lvNm3IGB+HiXEWpQmEuQeMqC+Xwxfdx3n+s3I7TpztdbgIWJM4KqAa4OKKK2NUM6
qxEYwcQ2P84obOwBkVu3
=LRdE
-----END PGP SIGNATURE-----
Merge tag 'drm-fixes-for-v4.16-rc3' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"A bunch of fixes for rc3:
Exynos:
- fixes for using monotonic timestamps
- register definitions
- removal of unused file
ipu-v3L
- minor changes
- make some register arrays const+static
- fix some leaks
meson:
- fix for vsync
atomic:
- fix for memory leak
EDID parser:
- add quirks for some more non-desktop devices
- 6-bit panel fix.
drm_mm:
- fix a bug in the core drm mm hole handling
cirrus:
- fix lut loading regression
Lastly there is a deadlock fix around runtime suspend for secondary
GPUs.
There was a deadlock between one thread trying to wait for a workqueue
job to finish in the runtime suspend path, and the workqueue job it
was waiting for in turn waiting for a runtime_get_sync to return.
The fixes avoids it by not doing the runtime sync in the workqueue as
then we always wait for all those tasks to complete before we runtime
suspend"
* tag 'drm-fixes-for-v4.16-rc3' of git://people.freedesktop.org/~airlied/linux: (25 commits)
drm/tve200: fix kernel-doc documentation comment include
drm/edid: quirk Sony PlayStation VR headset as non-desktop
drm/edid: quirk Windows Mixed Reality headsets as non-desktop
drm/edid: quirk Oculus Rift headsets as non-desktop
drm/meson: fix vsync buffer update
drm: Handle unexpected holes in color-eviction
drm: exynos: Use proper macro definition for HDMI_I2S_PIN_SEL_1
drm/exynos: remove exynos_drm_rotator.h
drm/exynos: g2d: Delete an error message for a failed memory allocation in two functions
drm/exynos: fix comparison to bitshift when dealing with a mask
drm/exynos: g2d: use monotonic timestamps
drm/edid: Add 6 bpc quirk for CPT panel in Asus UX303LA
gpu: ipu-csi: add 10/12-bit grayscale support to mbus_code_to_bus_cfg
gpu: ipu-cpmem: add 16-bit grayscale support to ipu_cpmem_set_image
gpu: ipu-v3: prg: fix device node leak in ipu_prg_lookup_by_phandle
gpu: ipu-v3: pre: fix device node leak in ipu_pre_lookup_by_phandle
drm/amdgpu: Fix deadlock on runtime suspend
drm/radeon: Fix deadlock on runtime suspend
drm/nouveau: Fix deadlock on runtime suspend
drm: Allow determining if current task is output poll worker
...
Evidently the __mutex_owner() function was never intended for use
outside the core mutex code, so build a thing locking wrapper around
the mutex code which allows us to track the mutex owner.
One, arguably positive, side effect is that this allows us to hide
the audit_cmd_mutex inside of kernel/audit.c behind the lock/unlock
functions.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
At CPU hotunplug the corresponding per cpu matrix allocator is shut down and
the allocated interrupt bits are discarded under the assumption that all
allocated bits have been either migrated away or shut down through the
managed interrupts mechanism.
This is not true because interrupts which are not started up might have a
vector allocated on the outgoing CPU. When the interrupt is started up
later or completely shutdown and freed then the allocated vector is handed
back, triggering warnings or causing accounting issues which result in
suspend failures and other issues.
Change the CPU hotplug mechanism of the matrix allocator so that the
remaining allocations at unplug time are preserved and global accounting at
hotplug is correctly readjusted to take the dormant vectors into account.
Fixes: 2f75d9e1c9 ("genirq: Implement bitmap matrix allocator")
Reported-by: Yuriy Vostrikov <delamonpansie@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Yuriy Vostrikov <delamonpansie@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180222112316.849980972@linutronix.de
Commit 9a3efb6b66 ("bpf: fix memory leak in lpm_trie map_free callback function")
fixed a memory leak and removed unnecessary locks in map_free callback function.
Unfortrunately, it introduced a lockdep warning. When lockdep checking is turned on,
running tools/testing/selftests/bpf/test_lpm_map will have:
[ 98.294321] =============================
[ 98.294807] WARNING: suspicious RCU usage
[ 98.295359] 4.16.0-rc2+ #193 Not tainted
[ 98.295907] -----------------------------
[ 98.296486] /home/yhs/work/bpf/kernel/bpf/lpm_trie.c:572 suspicious rcu_dereference_check() usage!
[ 98.297657]
[ 98.297657] other info that might help us debug this:
[ 98.297657]
[ 98.298663]
[ 98.298663] rcu_scheduler_active = 2, debug_locks = 1
[ 98.299536] 2 locks held by kworker/2:1/54:
[ 98.300152] #0: ((wq_completion)"events"){+.+.}, at: [<00000000196bc1f0>] process_one_work+0x157/0x5c0
[ 98.301381] #1: ((work_completion)(&map->work)){+.+.}, at: [<00000000196bc1f0>] process_one_work+0x157/0x5c0
Since actual trie tree removal happens only after no other
accesses to the tree are possible, replacing
rcu_dereference_protected(*slot, lockdep_is_held(&trie->lock))
with
rcu_dereference_protected(*slot, 1)
fixed the issue.
Fixes: 9a3efb6b66 ("bpf: fix memory leak in lpm_trie map_free callback function")
Reported-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
syszbot managed to trigger RCU detected stalls in
bpf_array_free_percpu()
It takes time to allocate a huge percpu map, but even more time to free
it.
Since we run in process context, use cond_resched() to yield cpu if
needed.
Fixes: a10423b87a ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Merge misc fixes from Andrew Morton:
"16 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: don't defer struct page initialization for Xen pv guests
lib/Kconfig.debug: enable RUNTIME_TESTING_MENU
vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems
selftests/memfd: add run_fuse_test.sh to TEST_FILES
bug.h: work around GCC PR82365 in BUG()
mm/swap.c: make functions and their kernel-doc agree (again)
mm/zpool.c: zpool_evictable: fix mismatch in parameter name and kernel-doc
ida: do zeroing in ida_pre_get()
mm, swap, frontswap: fix THP swap if frontswap enabled
certs/blacklist_nohashes.c: fix const confusion in certs blacklist
kernel/relay.c: limit kmalloc size to KMALLOC_MAX_SIZE
mm, mlock, vmscan: no more skipping pagevecs
mm: memcontrol: fix NR_WRITEBACK leak in memcg and system stats
Kbuild: always define endianess in kconfig.h
include/linux/sched/mm.h: re-inline mmdrop()
tools: fix cross-compile var clobbering
Each read from a file in efivarfs results in two calls to EFI
(one to get the file size, another to get the actual data).
On X86 these EFI calls result in broadcast system management
interrupts (SMI) which affect performance of the whole system.
A malicious user can loop performing reads from efivarfs bringing
the system to its knees.
Linus suggested per-user rate limit to solve this.
So we add a ratelimit structure to "user_struct" and initialize
it for the root user for no limit. When allocating user_struct for
other users we set the limit to 100 per second. This could be used
for other places that want to limit the rate of some detrimental
user action.
In efivarfs if the limit is exceeded when reading, we take an
interruptible nap for 50ms and check the rate limit again.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previously if users passed a small size for the input structure size, they
would get get odd behavior. It doesn't make sense to pass a structure
smaller than at least filter_off size, so let's just give -EINVAL in this
case.
This changes userspace visible behavior, but was only introduced in commit
26500475ac ("ptrace, seccomp: add support for retrieving seccomp
metadata") in 4.16-rc2, so should be safe to change if merged before then.
Reported-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
If there is a memory allocation error when trying to change an audit
kernel feature value, the ignored allocation error will trigger a NULL
pointer dereference oops on subsequent use of that pointer. Return
instead.
Passes audit-testsuite.
See: https://github.com/linux-audit/audit-kernel/issues/76
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: not necessary (other funcs check for NULL), but a good practice]
Signed-off-by: Paul Moore <paul@paul-moore.com>
chan->n_subbufs is set by the user and relay_create_buf() does a kmalloc()
of chan->n_subbufs * sizeof(size_t *).
kmalloc_slab() will generate a warning when this fails if
chan->subbufs * sizeof(size_t *) > KMALLOC_MAX_SIZE.
Limit chan->n_subbufs to the maximum allowed kmalloc() size.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802061216100.122576@chino.kir.corp.google.com
Fixes: f6302f1bcd ("relay: prevent integer overflow in relay_open()")
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As Peter points out, Doing a CALL+RET for just the decrement is a bit silly.
Fixes: d70f2a14b7 ("include/linux/sched/mm.h: uninline mmdrop_async(), etc")
Acked-by: Peter Zijlstra (Intel) <peterz@infraded.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
drivers, a memory leak on non-blocking commits, a crash on color-eviction.
The is also meson and edid fixes, plus a fix for a doc warning.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJajY3SAAoJEEN0HIUfOBk0/5IP/jTa0VKe7UurEzj9Vzgt4USu
tVre4MGN42peY2PbVSsBmvHAOeyII7la1/NkiFi8wZKQ2MXw43NenKOcRLDW0r9b
6U8Tlq3sU//NdUDAiLLx9hKb+i31ag+wodvULt0PKtEWDsxWDSRZUo792as2YUkC
VxHuIQywNABohn2Ya8Og1dON25GD7zRzNzH7O+g+fds/Qvav0504u2v10jBKJC0D
IB2oc3ZtJR8n0dFpzhnEB7YkxyvkrsWZQ1LtutGFgrr54F0KVHvAm/VMZ5qzyCRi
kvJN81OFo0xpdE7ZMSQ5YAvcPsEC5ifSNaaxpawsM904H7fS6FNhHMg7cGGi1f7R
B8YbLrdy+mBnQPNNbPcDPQA+YN/tRv4rRmmdLdkDbdY1GM/JJ4C7PTuLL6mX1iWU
DuHiaFS0KZGoS0XCVbvhLkPt5fsmvp+QxBpeNAtxgOdn2pRquDmGZ1jTVEG2mw5U
rqoPURa3urqdSwj8ba0jbJo6WBAmb1uWeyJ7xpyUVhR9SR30+URYVWwJEPDOgTnQ
PaEzjobntgDLaq5NbhpEvmYmylv1SPkucGtkCtwPxIrrh5Z84pZTJ1th2ogfn3Ti
VL25dTlzFpsjEMgC72wCi0eiP7qLVTX9vHYZBzkeIjIWDH0rCnCFxvjwmD/aVUbz
Ex1/fGNEVkFupcYLu7m4
=555h
-----END PGP SIGNATURE-----
Merge tag 'drm-misc-fixes-2018-02-21' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
Fixes for 4.16. I contains fixes for deadlock on runtime suspend on few
drivers, a memory leak on non-blocking commits, a crash on color-eviction.
The is also meson and edid fixes, plus a fix for a doc warning.
* tag 'drm-misc-fixes-2018-02-21' of git://anongit.freedesktop.org/drm/drm-misc:
drm/tve200: fix kernel-doc documentation comment include
drm/meson: fix vsync buffer update
drm: Handle unexpected holes in color-eviction
drm/edid: Add 6 bpc quirk for CPT panel in Asus UX303LA
drm/amdgpu: Fix deadlock on runtime suspend
drm/radeon: Fix deadlock on runtime suspend
drm/nouveau: Fix deadlock on runtime suspend
drm: Allow determining if current task is output poll worker
workqueue: Allow retrieval of current task's work struct
drm/atomic: Fix memleak on ERESTARTSYS during non-blocking commits
Daniel Borkmann says:
====================
pull-request: bpf 2018-02-20
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a memory leak in LPM trie's map_free() callback function, where
the trie structure itself was not freed since initial implementation.
Also a synchronize_rcu() was needed in order to wait for outstanding
programs accessing the trie to complete, from Yonghong.
2) Fix sock_map_alloc()'s error path in order to correctly propagate
the -EINVAL error in case of too large allocation requests. This
was just recently introduced when fixing close hooks via ULP layer,
fix from Eric.
3) Do not use GFP_ATOMIC in __cpu_map_entry_alloc(). Reason is that this
will not work with the recent __ptr_ring_init_queue_alloc() conversion
to kvmalloc_array(), where in case of fallback to vmalloc() that GFP
flag is invalid, from Jason.
4) Fix two recent syzkaller warnings: i) fix bpf_prog_array_copy_to_user()
when a prog query with a big number of ids was performed where we'd
otherwise trigger a warning from allocator side, ii) fix a missing
mlock precharge on arraymaps, from Daniel.
5) Two fixes for bpftool in order to avoid breaking JSON output when used
in batch mode, from Quentin.
6) Move a pr_debug() in libbpf in order to avoid having an otherwise
uninitialized variable in bpf_program__reloc_text(), from Jeremy.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
A domain cgroup isn't allowed to be turned threaded if its subtree is
populated or domain controllers are enabled. cgroup_enable_threaded()
depended on cgroup_can_be_thread_root() test to enforce this rule. A
parent which has populated domain descendants or have domain
controllers enabled can't become a thread root, so the above rules are
enforced automatically.
However, for the root cgroup which can host mixed domain and threaded
children, cgroup_can_be_thread_root() doesn't check any of those
conditions and thus first level cgroups ends up escaping those rules.
This patch fixes the bug by adding explicit checks for those rules in
cgroup_enable_threaded().
Reported-by: Michael Kerrisk (man-pages) <mtk.manpages@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 8cfd8147df ("cgroup: implement cgroup v2 thread support")
Cc: stable@vger.kernel.org # v4.14+
Convert init_kernel_text() to a global function and use it in a few
places instead of manually comparing _sinittext and _einittext.
Note that kallsyms.h has a very similar function called
is_kernel_inittext(), but its end check is inclusive. I'm not sure
whether that's intentional behavior, so I didn't touch it.
Suggested-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/4335d02be8d45ca7d265d2f174251d0b7ee6c5fd.1519051220.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently when the jump label code encounters an address which isn't
recognized by kernel_text_address(), it just silently fails.
This can be dangerous because jump labels are used in a variety of
places, and are generally expected to work. Convert the silent failure
to a warning.
This won't warn about attempted writes to tracepoints in __init code
after initmem has been freed, as those are already guarded by the
entry->code check.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/de3a271c93807adb7ed48f4e946b4f9156617680.1519051220.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After initmem has been freed, any jump labels in __init code are
prevented from being written to by the kernel_text_address() check in
__jump_label_update(). However, this check is quite broad. If
kernel_text_address() were to return false for any other reason, the
jump label write would fail silently with no warning.
For jump labels in module init code, entry->code is set to zero to
indicate that the entry is disabled. Do the same thing for core kernel
init code. This makes the behavior more consistent, and will also make
it more straightforward to detect non-init jump label write failures in
the next patch.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/c52825c73f3a174e8398b6898284ec20d4deb126.1519051220.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that the 1Hz tick is offloaded to workqueues, we can safely remove
the residual code that used to handle it locally.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1519186649-3242-7-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
keep the scheduler stats alive. However this residual tick is a burden
for bare metal tasks that can't stand any interruption at all, or want
to minimize them.
The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
outsource these scheduler ticks to the global workqueue so that a
housekeeping CPU handles those remotely. The sched_class::task_tick()
implementations have been audited and look safe to be called remotely
as the target runqueue and its current task are passed in parameter
and don't seem to be accessed locally.
Note that in the case of using isolcpus, it's still up to the user to
affine the global workqueues to the housekeeping CPUs through
/sys/devices/virtual/workqueue/cpumask or domains isolation
"isolcpus=nohz,domain".
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1519186649-3242-6-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As we prepare for offloading the residual 1hz scheduler ticks to
workqueue, let's affine those to housekeepers so that they don't
interrupt the CPUs that don't want to be disturbed.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1519186649-3242-5-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This check is racy but provides a good heuristic to determine whether
a CPU may need a remote tick or not.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1519186649-3242-4-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It makes this function more self-explanatory about what it does and how
to use it.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1519186649-3242-3-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Do that rename in order to normalize the hrtick namespace.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1519186649-3242-2-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If wake_affine() pulls a task to another node for any reason and the node is
no longer preferred then temporarily stop automatic NUMA balancing pulling
the task back. Otherwise, tasks with a strong waker/wakee relationship
may constantly fight automatic NUMA balancing over where a task should
be placed.
Once again netperf is interesting here. The performance barely changes
but automatic NUMA balancing is interesting:
Hmean send-64 354.67 ( 0.00%) 352.15 ( -0.71%)
Hmean send-128 702.91 ( 0.00%) 693.84 ( -1.29%)
Hmean send-256 1350.07 ( 0.00%) 1344.19 ( -0.44%)
Hmean send-1024 5124.38 ( 0.00%) 4941.24 ( -3.57%)
Hmean send-2048 9687.44 ( 0.00%) 9624.45 ( -0.65%)
Hmean send-3312 14577.64 ( 0.00%) 14514.35 ( -0.43%)
Hmean send-4096 16393.62 ( 0.00%) 16488.30 ( 0.58%)
Hmean send-8192 26877.26 ( 0.00%) 26431.63 ( -1.66%)
Hmean send-16384 38683.43 ( 0.00%) 38264.91 ( -1.08%)
Hmean recv-64 354.67 ( 0.00%) 352.15 ( -0.71%)
Hmean recv-128 702.91 ( 0.00%) 693.84 ( -1.29%)
Hmean recv-256 1350.07 ( 0.00%) 1344.19 ( -0.44%)
Hmean recv-1024 5124.38 ( 0.00%) 4941.24 ( -3.57%)
Hmean recv-2048 9687.43 ( 0.00%) 9624.45 ( -0.65%)
Hmean recv-3312 14577.59 ( 0.00%) 14514.35 ( -0.43%)
Hmean recv-4096 16393.55 ( 0.00%) 16488.20 ( 0.58%)
Hmean recv-8192 26876.96 ( 0.00%) 26431.29 ( -1.66%)
Hmean recv-16384 38682.41 ( 0.00%) 38263.94 ( -1.08%)
NUMA alloc hit 1465986 1423090
NUMA alloc miss 0 0
NUMA interleave hit 0 0
NUMA alloc local 1465897 1423003
NUMA base PTE updates 1473 1420
NUMA huge PMD updates 0 0
NUMA page range updates 1473 1420
NUMA hint faults 1383 1312
NUMA hint local faults 451 124
NUMA hint local percent 32 9
There is a slight degrading in performance but there are slightly fewer
NUMA faults. There is a large drop in the percentage of local faults but
the bulk of migrations for netperf are in small shared libraries so it's
reflecting the fact that automatic NUMA balancing has backed off. This is
a case where despite wake_affine() and automatic NUMA balancing fighting
for placement that there is a marginal benefit to rescheduling to local
data quickly. However, it should be noted that wake_affine() and automatic
NUMA balancing fighting each other constantly is undesirable.
However, the benefit in other cases is large. This is the result for NAS
with the D class sizing on a 4-socket machine:
nas-mpi
4.15.0 4.15.0
sdnuma-v1r23 delayretry-v1r23
Time cg.D 557.00 ( 0.00%) 431.82 ( 22.47%)
Time ep.D 77.83 ( 0.00%) 79.01 ( -1.52%)
Time is.D 26.46 ( 0.00%) 26.64 ( -0.68%)
Time lu.D 727.14 ( 0.00%) 597.94 ( 17.77%)
Time mg.D 191.35 ( 0.00%) 146.85 ( 23.26%)
4.15.0 4.15.0
sdnuma-v1r23delayretry-v1r23
User 75665.20 70413.30
System 20321.59 8861.67
Elapsed 766.13 634.92
Minor Faults 16528502 7127941
Major Faults 4553 5068
NUMA alloc local 6963197 6749135
NUMA base PTE updates 366409093 107491434
NUMA huge PMD updates 687556 198880
NUMA page range updates 718437765 209317994
NUMA hint faults 13643410 4601187
NUMA hint local faults 9212593 3063996
NUMA hint local percent 67 66
Note the massive reduction in system CPU usage even though the percentage
of local faults is barely affected. There is a massive reduction in the
number of PTE updates showing that automatic NUMA balancing has backed off.
A critical observation is also that there is a massive reduction in minor
faults which is due to far fewer NUMA hinting faults being trapped.
There were questions on NAS OMP and how it behaved related to threads
being bound to CPUs. First, there are more gains than losses with this
patch applied and a reduction in system CPU usage:
nas-omp
4.16.0-rc1 4.16.0-rc1
sdnuma-v2r1 delayretry-v2r1
Time bt.D 436.71 ( 0.00%) 430.05 ( 1.53%)
Time cg.D 201.02 ( 0.00%) 180.87 ( 10.02%)
Time ep.D 32.84 ( 0.00%) 32.68 ( 0.49%)
Time is.D 9.63 ( 0.00%) 9.64 ( -0.10%)
Time lu.D 331.20 ( 0.00%) 304.80 ( 7.97%)
Time mg.D 54.87 ( 0.00%) 52.72 ( 3.92%)
Time sp.D 1108.78 ( 0.00%) 917.10 ( 17.29%)
Time ua.D 378.81 ( 0.00%) 398.83 ( -5.28%)
4.16.0-rc1 4.16.0-rc1
sdnuma-v2r1delayretry-v2r1
User 305633.08 296751.91
System 451.75 357.80
Elapsed 2595.73 2368.13
However, it does not close the gap between binding and being unbound. There
is negligible difference between the performance of the baseline and a
patched kernel when threads are bound so it is not presented here:
4.16.0-rc1 4.16.0-rc1
delayretry-bind delayretry-unbound
Time bt.D 385.02 ( 0.00%) 430.05 ( -11.70%)
Time cg.D 144.02 ( 0.00%) 180.87 ( -25.59%)
Time ep.D 32.85 ( 0.00%) 32.68 ( 0.52%)
Time is.D 10.52 ( 0.00%) 9.64 ( 8.37%)
Time lu.D 285.31 ( 0.00%) 304.80 ( -6.83%)
Time mg.D 43.21 ( 0.00%) 52.72 ( -22.01%)
Time sp.D 820.24 ( 0.00%) 917.10 ( -11.81%)
Time ua.D 337.09 ( 0.00%) 398.83 ( -18.32%)
4.16.0-rc1 4.16.0-rc1
delayretry-binddelayretry-unbound
User 277731.25 296751.91
System 261.29 357.80
Elapsed 2100.55 2368.13
Unfortunately, while performance is improved by the patch, there is still
quite a long way to go before it's equivalent to hard binding.
Other workloads like hackbench, tbench, dbench and schbench are barely
affected. dbench shows a mix of gains and losses depending on the machine
although in general, the results are more stable.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-7-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
find_idlest_group() compares a local group with each other group to select
the one that is most idle. When comparing groups in different NUMA domains,
a very slight imbalance is enough to select a remote NUMA node even if the
runnable load on both groups is 0 or close to 0. This ignores the cost of
remote accesses entirely and is a problem when selecting the CPU for a
newly forked task to run on. This is problematic when a forking server
is almost guaranteed to run on a remote node incurring numerous remote
accesses and potentially causing automatic NUMA balancing to try migrate
the task back or migrate the data to another node. Similar weirdness is
observed if a basic shell command pipes output to another as each process
in the pipeline is likely to start on different nodes and then get adjusted
later by wake_affine().
This patch adds imbalance to remote domains when considering whether to
select CPUs from remote domains. If the local domain is selected, imbalance
will still be used to try select a CPU from a lower scheduler domain's group
instead of stacking tasks on the same CPU.
A variety of workloads and machines were tested and as expected, there is no
difference on UMA. The difference on NUMA can be dramatic. This is a comparison
of elapsed times running the git regression test suite. It's fork-intensive with
short-lived processes:
4.15.0 4.15.0
noexit-v1r23 sdnuma-v1r23
Elapsed min 1706.06 ( 0.00%) 1435.94 ( 15.83%)
Elapsed mean 1709.53 ( 0.00%) 1436.98 ( 15.94%)
Elapsed stddev 2.16 ( 0.00%) 1.01 ( 53.38%)
Elapsed coeffvar 0.13 ( 0.00%) 0.07 ( 44.54%)
Elapsed max 1711.59 ( 0.00%) 1438.01 ( 15.98%)
4.15.0 4.15.0
noexit-v1r23 sdnuma-v1r23
User 5434.12 5188.41
System 4878.77 3467.09
Elapsed 10259.06 8624.21
That shows a considerable reduction in elapsed times. It's important to
note that automatic NUMA balancing does not affect this load as processes
are too short-lived.
There is also a noticable impact on hackbench such as this example using
processes and pipes:
hackbench-process-pipes
4.15.0 4.15.0
noexit-v1r23 sdnuma-v1r23
Amean 1 1.0973 ( 0.00%) 0.9393 ( 14.40%)
Amean 4 1.3427 ( 0.00%) 1.3730 ( -2.26%)
Amean 7 1.4233 ( 0.00%) 1.6670 ( -17.12%)
Amean 12 3.0250 ( 0.00%) 3.3013 ( -9.13%)
Amean 21 9.0860 ( 0.00%) 9.5343 ( -4.93%)
Amean 30 14.6547 ( 0.00%) 13.2433 ( 9.63%)
Amean 48 22.5447 ( 0.00%) 20.4303 ( 9.38%)
Amean 79 29.2010 ( 0.00%) 26.7853 ( 8.27%)
Amean 110 36.7443 ( 0.00%) 35.8453 ( 2.45%)
Amean 141 45.8533 ( 0.00%) 42.6223 ( 7.05%)
Amean 172 55.1317 ( 0.00%) 50.6473 ( 8.13%)
Amean 203 64.4420 ( 0.00%) 58.3957 ( 9.38%)
Amean 234 73.2293 ( 0.00%) 67.1047 ( 8.36%)
Amean 265 80.5220 ( 0.00%) 75.7330 ( 5.95%)
Amean 296 88.7567 ( 0.00%) 82.1533 ( 7.44%)
It's not a universal win as there are occasions when spreading wide and
quickly is a benefit but it's more of a win than it is a loss. For other
workloads, there is little difference but netperf is interesting. Without
the patch, the server and client starts on different nodes but quickly get
migrated due to wake_affine. Hence, the difference is overall performance
is marginal but detectable:
4.15.0 4.15.0
noexit-v1r23 sdnuma-v1r23
Hmean send-64 349.09 ( 0.00%) 354.67 ( 1.60%)
Hmean send-128 699.16 ( 0.00%) 702.91 ( 0.54%)
Hmean send-256 1316.34 ( 0.00%) 1350.07 ( 2.56%)
Hmean send-1024 5063.99 ( 0.00%) 5124.38 ( 1.19%)
Hmean send-2048 9705.19 ( 0.00%) 9687.44 ( -0.18%)
Hmean send-3312 14359.48 ( 0.00%) 14577.64 ( 1.52%)
Hmean send-4096 16324.20 ( 0.00%) 16393.62 ( 0.43%)
Hmean send-8192 26112.61 ( 0.00%) 26877.26 ( 2.93%)
Hmean send-16384 37208.44 ( 0.00%) 38683.43 ( 3.96%)
Hmean recv-64 349.09 ( 0.00%) 354.67 ( 1.60%)
Hmean recv-128 699.16 ( 0.00%) 702.91 ( 0.54%)
Hmean recv-256 1316.34 ( 0.00%) 1350.07 ( 2.56%)
Hmean recv-1024 5063.99 ( 0.00%) 5124.38 ( 1.19%)
Hmean recv-2048 9705.16 ( 0.00%) 9687.43 ( -0.18%)
Hmean recv-3312 14359.42 ( 0.00%) 14577.59 ( 1.52%)
Hmean recv-4096 16323.98 ( 0.00%) 16393.55 ( 0.43%)
Hmean recv-8192 26111.85 ( 0.00%) 26876.96 ( 2.93%)
Hmean recv-16384 37206.99 ( 0.00%) 38682.41 ( 3.97%)
However, what is very interesting is how automatic NUMA balancing behaves.
Each netperf instance runs long enough for balancing to activate:
NUMA base PTE updates 4620 1473
NUMA huge PMD updates 0 0
NUMA page range updates 4620 1473
NUMA hint faults 4301 1383
NUMA hint local faults 1309 451
NUMA hint local percent 30 32
NUMA pages migrated 1335 491
AutoNUMA cost 21% 6%
There is an unfortunate number of remote faults although tracing indicated
that the vast majority are in shared libraries. However, the tendency to
start tasks on the same node if there is capacity means that there were
far fewer PTE updates and faults incurred overall.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-6-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a task exits, it notifies the parent that it has exited. This is a
sync wakeup and the exiting task may pull the parent towards the wakers
CPU. For simple workloads like using a shell, it was observed that the
shell is pulled across nodes by exiting processes. This is daft as the
parent may be long-lived and properly placed. This patch special cases a
sync wakeup on exit to avoid pulling tasks across nodes. Testing on a range
of workloads and machines showed very little differences in performance
although there was a small 3% boost on some machines running a shellscript
intensive workload (git regression test suite).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-5-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
wake_affine_weight() will consider migrating a task to, or near, the current
CPU if there is a load imbalance. If the CPUs share LLC then either CPU
is valid as a search-for-idle-sibling target and equally appropriate for
stacking two tasks on one CPU if an idle sibling is unavailable. If they do
not share cache then a cross-node migration potentially impacts locality
so while they are equal from a CPU capacity point of view, they are not
equal in terms of memory locality. In either case, it's more appropriate
to migrate only if there is a difference in their effective load.
This patch modifies wake_affine_weight() to only consider migrating a task
if there is a load imbalance for normal wakeups but will allow potential
stacking if the loads are equal and it's a sync wakeup.
For the most part, the different in performance is marginal. For example,
on a 4-socket server running netperf UDP_STREAM on localhost the differences
are as follows:
4.15.0 4.15.0
16rc0 noequal-v1r23
Hmean send-64 355.47 ( 0.00%) 349.50 ( -1.68%)
Hmean send-128 697.98 ( 0.00%) 693.35 ( -0.66%)
Hmean send-256 1328.02 ( 0.00%) 1318.77 ( -0.70%)
Hmean send-1024 5051.83 ( 0.00%) 5051.11 ( -0.01%)
Hmean send-2048 9637.02 ( 0.00%) 9601.34 ( -0.37%)
Hmean send-3312 14355.37 ( 0.00%) 14414.51 ( 0.41%)
Hmean send-4096 16464.97 ( 0.00%) 16301.37 ( -0.99%)
Hmean send-8192 26722.42 ( 0.00%) 26428.95 ( -1.10%)
Hmean send-16384 38137.81 ( 0.00%) 38046.11 ( -0.24%)
Hmean recv-64 355.47 ( 0.00%) 349.50 ( -1.68%)
Hmean recv-128 697.98 ( 0.00%) 693.35 ( -0.66%)
Hmean recv-256 1328.02 ( 0.00%) 1318.77 ( -0.70%)
Hmean recv-1024 5051.83 ( 0.00%) 5051.11 ( -0.01%)
Hmean recv-2048 9636.95 ( 0.00%) 9601.30 ( -0.37%)
Hmean recv-3312 14355.32 ( 0.00%) 14414.48 ( 0.41%)
Hmean recv-4096 16464.74 ( 0.00%) 16301.16 ( -0.99%)
Hmean recv-8192 26721.63 ( 0.00%) 26428.17 ( -1.10%)
Hmean recv-16384 38136.00 ( 0.00%) 38044.88 ( -0.24%)
Stddev send-64 7.30 ( 0.00%) 4.75 ( 34.96%)
Stddev send-128 15.15 ( 0.00%) 22.38 ( -47.66%)
Stddev send-256 13.99 ( 0.00%) 19.14 ( -36.81%)
Stddev send-1024 105.73 ( 0.00%) 67.38 ( 36.27%)
Stddev send-2048 294.57 ( 0.00%) 223.88 ( 24.00%)
Stddev send-3312 302.28 ( 0.00%) 271.74 ( 10.10%)
Stddev send-4096 195.92 ( 0.00%) 121.10 ( 38.19%)
Stddev send-8192 399.71 ( 0.00%) 563.77 ( -41.04%)
Stddev send-16384 1163.47 ( 0.00%) 1103.68 ( 5.14%)
Stddev recv-64 7.30 ( 0.00%) 4.75 ( 34.96%)
Stddev recv-128 15.15 ( 0.00%) 22.38 ( -47.66%)
Stddev recv-256 13.99 ( 0.00%) 19.14 ( -36.81%)
Stddev recv-1024 105.73 ( 0.00%) 67.38 ( 36.27%)
Stddev recv-2048 294.59 ( 0.00%) 223.89 ( 24.00%)
Stddev recv-3312 302.24 ( 0.00%) 271.75 ( 10.09%)
Stddev recv-4096 196.03 ( 0.00%) 121.14 ( 38.20%)
Stddev recv-8192 399.86 ( 0.00%) 563.65 ( -40.96%)
Stddev recv-16384 1163.79 ( 0.00%) 1103.86 ( 5.15%)
The difference in overall performance is marginal but note that most
measurements are less variable. There were similar observations for other
netperf comparisons. hackbench with sockets or threads with processes or
threads showed minor difference with some reduction of migration. tbench
showed only marginal differences that were within the noise. dbench,
regardless of filesystem, showed minor differences all of which are
within noise. Multiple machines, both UMA and NUMA were tested without
any regressions showing up.
The biggest risk with a patch like this is affecting wakeup latencies.
However, the schbench load from Facebook which is very sensitive to wakeup
latency showed a mixed result with mostly improvements in wakeup latency:
4.15.0 4.15.0
16rc0 noequal-v1r23
Lat 50.00th-qrtle-1 38.00 ( 0.00%) 38.00 ( 0.00%)
Lat 75.00th-qrtle-1 49.00 ( 0.00%) 41.00 ( 16.33%)
Lat 90.00th-qrtle-1 52.00 ( 0.00%) 50.00 ( 3.85%)
Lat 95.00th-qrtle-1 54.00 ( 0.00%) 51.00 ( 5.56%)
Lat 99.00th-qrtle-1 63.00 ( 0.00%) 60.00 ( 4.76%)
Lat 99.50th-qrtle-1 66.00 ( 0.00%) 61.00 ( 7.58%)
Lat 99.90th-qrtle-1 78.00 ( 0.00%) 65.00 ( 16.67%)
Lat 50.00th-qrtle-2 38.00 ( 0.00%) 38.00 ( 0.00%)
Lat 75.00th-qrtle-2 42.00 ( 0.00%) 43.00 ( -2.38%)
Lat 90.00th-qrtle-2 46.00 ( 0.00%) 48.00 ( -4.35%)
Lat 95.00th-qrtle-2 49.00 ( 0.00%) 50.00 ( -2.04%)
Lat 99.00th-qrtle-2 55.00 ( 0.00%) 57.00 ( -3.64%)
Lat 99.50th-qrtle-2 58.00 ( 0.00%) 60.00 ( -3.45%)
Lat 99.90th-qrtle-2 65.00 ( 0.00%) 68.00 ( -4.62%)
Lat 50.00th-qrtle-4 41.00 ( 0.00%) 41.00 ( 0.00%)
Lat 75.00th-qrtle-4 45.00 ( 0.00%) 46.00 ( -2.22%)
Lat 90.00th-qrtle-4 50.00 ( 0.00%) 50.00 ( 0.00%)
Lat 95.00th-qrtle-4 54.00 ( 0.00%) 53.00 ( 1.85%)
Lat 99.00th-qrtle-4 61.00 ( 0.00%) 61.00 ( 0.00%)
Lat 99.50th-qrtle-4 65.00 ( 0.00%) 64.00 ( 1.54%)
Lat 99.90th-qrtle-4 76.00 ( 0.00%) 82.00 ( -7.89%)
Lat 50.00th-qrtle-8 48.00 ( 0.00%) 46.00 ( 4.17%)
Lat 75.00th-qrtle-8 55.00 ( 0.00%) 54.00 ( 1.82%)
Lat 90.00th-qrtle-8 60.00 ( 0.00%) 59.00 ( 1.67%)
Lat 95.00th-qrtle-8 63.00 ( 0.00%) 63.00 ( 0.00%)
Lat 99.00th-qrtle-8 71.00 ( 0.00%) 69.00 ( 2.82%)
Lat 99.50th-qrtle-8 74.00 ( 0.00%) 73.00 ( 1.35%)
Lat 99.90th-qrtle-8 98.00 ( 0.00%) 90.00 ( 8.16%)
Lat 50.00th-qrtle-16 56.00 ( 0.00%) 55.00 ( 1.79%)
Lat 75.00th-qrtle-16 68.00 ( 0.00%) 67.00 ( 1.47%)
Lat 90.00th-qrtle-16 77.00 ( 0.00%) 78.00 ( -1.30%)
Lat 95.00th-qrtle-16 82.00 ( 0.00%) 84.00 ( -2.44%)
Lat 99.00th-qrtle-16 90.00 ( 0.00%) 93.00 ( -3.33%)
Lat 99.50th-qrtle-16 93.00 ( 0.00%) 97.00 ( -4.30%)
Lat 99.90th-qrtle-16 110.00 ( 0.00%) 110.00 ( 0.00%)
Lat 50.00th-qrtle-32 68.00 ( 0.00%) 62.00 ( 8.82%)
Lat 75.00th-qrtle-32 90.00 ( 0.00%) 83.00 ( 7.78%)
Lat 90.00th-qrtle-32 110.00 ( 0.00%) 100.00 ( 9.09%)
Lat 95.00th-qrtle-32 122.00 ( 0.00%) 111.00 ( 9.02%)
Lat 99.00th-qrtle-32 145.00 ( 0.00%) 133.00 ( 8.28%)
Lat 99.50th-qrtle-32 154.00 ( 0.00%) 143.00 ( 7.14%)
Lat 99.90th-qrtle-32 2316.00 ( 0.00%) 515.00 ( 77.76%)
Lat 50.00th-qrtle-35 69.00 ( 0.00%) 72.00 ( -4.35%)
Lat 75.00th-qrtle-35 92.00 ( 0.00%) 95.00 ( -3.26%)
Lat 90.00th-qrtle-35 111.00 ( 0.00%) 114.00 ( -2.70%)
Lat 95.00th-qrtle-35 122.00 ( 0.00%) 124.00 ( -1.64%)
Lat 99.00th-qrtle-35 142.00 ( 0.00%) 144.00 ( -1.41%)
Lat 99.50th-qrtle-35 150.00 ( 0.00%) 154.00 ( -2.67%)
Lat 99.90th-qrtle-35 6104.00 ( 0.00%) 5640.00 ( 7.60%)
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-4-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On sync wakeups, the previous CPU effective load may not be used so delay
the calculation until it's needed.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-3-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The only caller of wake_affine() knows the CPU ID. Pass it in instead of
rechecking it.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The default values for nreader and nwriter are apparently not all that
user-friendly, resulting in people doing scalability tests that ran all
runs at large scale. This commit therefore makes both the nreaders and
nwriters module default to the number of CPUs, and adds a comment to
rcuperf.c stating that the number of CPUs should be specified using the
nr_cpus kernel boot parameter. This commit also eliminates the redundant
rcuperf scripting specification of default values for these parameters.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_torture_writer() function adapts to requested testing from module
parameters as well as the function pointers in the structure referenced
by cur_ops. However, as long as the module parameters do not conflict
with the function pointers, this adaptation is silent. This silence can
result in confusion as to exactly what was tested, which could in turn
result in untested RCU code making its way into mainline.
This commit therefore makes rcu_torture_writer() announce exactly which
portions of RCU's API it ends up testing.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
During boot, normal grace periods are processed as expedited. When
rcutorture is built into the kernel, it starts during boot and thus
detects that normal grace periods are unconditionally expedited.
Therefore, rcutorture concludes that there is no point in trying
to dynamically enable expediting, do it disables this aspect of testing,
which is a bit of an overreaction to the temporary boot-time expediting.
This commit therefore rechecks forced expediting throughout the test,
enabling dynamic expediting if normal grace periods are processed
normally at any point.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently the rcu_torture_fakewriter() function invokes cur_ops->sync()
and cur_ops->exp_sync() without first checking to see if they are in
fact non-NULL. This results in kernel NULL pointer dereferences when
testing RCU implementations that choose not to provide the full set of
primitives. Given that it is perfectly reasonable to have specialized
RCU implementations that provide only a subset of the RCU API, this is
a bug in rcutorture.
This commit therefore makes rcu_torture_fakewriter() check function
pointers before invoking them, thus allowing it to test subsetted
RCU implementations.
Reported-by: Lihao Liang <lianglihao@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit moves to __func__ for function names and for KBUILD_MODNAME
for module names, all in the name of better resilience to change.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit replaces array-allocation calls to kzalloc() with
equivalent calls to kcalloc().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The code in srcu_gp_end() inserts a delay every 0x3ff grace periods in
order to prevent SRCU grace-period work from consuming an entire CPU
when there is a long sequence of expedited SRCU grace-period requests.
However, all of SRCU's grace-period work is carried out in workqueues,
which are in turn within kthreads, which are automatically throttled as
needed by the scheduler. In particular, if there is plenty of idle time,
there is no point in throttling.
This commit therefore removes the expedited SRCU grace-period throttling.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Of course, compilers will optimize out a dead code. Anyway, remove
any dead code for better readibility.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, given a multi-level srcu_node tree, SRCU can scan the full
set of srcu_data structures at each level when cleaning up after a grace
period. This, though harmless otherwise, represents pointless overhead.
This commit therefore eliminates this overhead by scanning the srcu_data
structures only when traversing the leaf srcu_node structures.
Signed-off-by: Ildar Ismagilov <devix84@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
SRCU checks each srcu_data structure's grace-period number for counter
wrap four times per cycle by default. This frequency guarantees that
normal comparisons will detect potential wrap. However, the expedited
grace-period number is not checked. The consquences are not too horrible
(a failure to expedite a grace period when requested), but it would be
good to avoid such things. This commit therefore adds this check to
the expedited grace-period number.
Signed-off-by: Ildar Ismagilov <devix84@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit moves to __func__ for function names in the name of better
resilience to change.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit reworks the first loop in sync_rcu_exp_select_cpus()
to avoid doing unnecssary stores to other CPUs' rcu_data
structures. This speeds up that first loop by roughly a factor of
two on an old x86 system. In the case where the system is mostly
idle, this loop incurs a large fraction of the overhead of the
synchronize_rcu_expedited(). There is less benefit on busy systems
because the overhead of the smp_call_function_single() in the second
loop dominates in that case.
However, it is not unusual to do configuration chances involving
RCU grace periods (both expedited and normal) while the system is
mostly idle, so this optimization is worth doing.
While we are in the area, this commit also adds parentheses to arguments
used by the for_each_leaf_node_possible_cpu() macro.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If a CPU is transitioning to or from offline state, an expedited
grace period may undergo a timed wait. This timed wait can unduly
delay grace periods, so this commit adds a trace statement to make
it visible.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds more tracing of expedited grace periods to enable
improved debugging of slowdowns.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The srcu_funnel_exp_start() function checks to see if the srcu_struct
structure's expedited grace period counter needs updating to reflect a
newly arrived request for an expedited SRCU grace period. Unfortunately,
the check is backwards, so this commit reverses the sense of the test.
Signed-off-by: Ildar Ismagilov <devix84@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Commits c0b334c5bf and ea9b0c8a26 introduced new sparse warnings
by accessing rcu_node->lock directly and ignoring the __private
marker. Introduce a new wrapper and use it. Also fix a similar problem
in srcutree.c introduced by a3883df393.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
RCU's nxttail has been optimized to be a rcu_segcblist, which is
a multi-tailed linked list with macros defined for the indexes for
each tail. The indexes have been defined in linux/rcu_segcblist.h,
so this commit removes the redundant definitions in kernel/rcu/tree.h.
Signed-off-by: Liu Changcheng <changcheng.liu@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The kernel/rcu/rcu.h file has a pair of consecutive #ifdefs on
CONFIG_TINY_RCU, so this commit consolidates them, thus saving a few
lines of code.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
It is not always obvious that the stack dump from a starved grace-period
kthread isn't instead that of a CPU stalling the current grace period.
This commit therefore adds a pr_err() flagging these dumps.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The debugfs interface displayed statistics on RCU-pending checks but
this interface has since been removed. This commit therefore removes the
no-longer-used rcu_state structure's ->n_force_qs_lh and ->n_force_qs_ngp
fields along with their updates. (Though the ->n_force_qs_ngp field
was actually not used at all, embarrassingly enough.)
If this information proves necessary in the future, the corresponding
event traces will be added.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The debugfs interface displayed statistics on RCU-pending checks
but this interface has since been removed. This commit therefore
removes the no-longer-used rcu_data structure's ->n_rcu_pending,
->n_rp_core_needs_qs, ->n_rp_report_qs, ->n_rp_cb_ready,
->n_rp_cpu_needs_gp, ->n_rp_gp_completed, ->n_rp_gp_started,
->n_rp_nocb_defer_wakeup, and ->n_rp_need_nothing fields along with
their updates.
If this information proves necessary in the future, the corresponding
event traces will be added.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The debugfs interface displayed statistics on RCU callback invocation but
this interface has since been removed. This commit therefore removes the
no-longer-used rcu_data structure's ->n_cbs_invoked and ->n_nocbs_invoked
fields along with their updates.
If this information proves necessary in the future, the corresponding
event traces will be added.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The debugfs interface displayed statistics on RCU priority boosting,
but this interface has since been removed. This commit therefore
removes the no-longer-used rcu_data structure's ->n_tasks_boosted,
->n_exp_boosts, and ->n_exp_boosts and their updates.
If this information proves necessary in the future, the corresponding
event traces will be added.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
In CONFIG_RCU_NOCB_CPU=y kernels, if the boot parameters indicate that
none of the CPUs should in fact be offloaded, the following somewhat
obtuse message appears:
Offload RCU callbacks from CPUs: .
This commit therefore makes the message at least grammatically correct
in this case:
Offload RCU callbacks from CPUs: (none)
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Pull perf updates from Thomas Gleixner:
"Perf tool updates and kprobe fixes:
- perf_mmap overwrite mode fixes/overhaul, prep work to get 'perf
top' using it, making it bearable to use it in large core count
systems such as Knights Landing/Mill Intel systems (Kan Liang)
- s/390 now uses syscall.tbl, just like x86-64 to generate the
syscall table id -> string tables used by 'perf trace' (Hendrik
Brueckner)
- Use strtoull() instead of home grown function (Andy Shevchenko)
- Synchronize kernel ABI headers, v4.16-rc1 (Ingo Molnar)
- Document missing 'perf data --force' option (Sangwon Hong)
- Add perf vendor JSON metrics for ARM Cortex-A53 Processor (William
Cohen)
- Improve error handling and error propagation of ftrace based
kprobes so failures when installing kprobes are not silently
ignored and create disfunctional tracepoints"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
kprobes: Propagate error from disarm_kprobe_ftrace()
kprobes: Propagate error from arm_kprobe_ftrace()
Revert "tools include s390: Grab a copy of arch/s390/include/uapi/asm/unistd.h"
perf s390: Rework system call table creation by using syscall.tbl
perf s390: Grab a copy of arch/s390/kernel/syscall/syscall.tbl
tools/headers: Synchronize kernel ABI headers, v4.16-rc1
perf test: Fix test trace+probe_libc_inet_pton.sh for s390x
perf data: Document missing --force option
perf tools: Substitute yet another strtoull()
perf top: Check the latency of perf_top__mmap_read()
perf top: Switch default mode to overwrite mode
perf top: Remove lost events checking
perf hists browser: Add parameter to disable lost event warning
perf top: Add overwrite fall back
perf evsel: Expose the perf_missing_features struct
perf top: Check per-event overwrite term
perf mmap: Discard legacy interface for mmap read
perf test: Update mmap read functions for backward-ring-buffer test
perf mmap: Introduce perf_mmap__read_event()
perf mmap: Introduce perf_mmap__read_done()
...
Pull irq updates from Thomas Gleixner:
"A small set of updates mostly for irq chip drivers:
- MIPS GIC fix for spurious, masked interrupts
- fix for a subtle IPI bug in GICv3
- do not probe GICv3 ITSs that are marked as disabled
- multi-MSI support for GICv2m
- various small cleanups"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqdomain: Re-use DEFINE_SHOW_ATTRIBUTE() macro
irqchip/bcm: Remove hashed address printing
irqchip/gic-v2m: Add PCI Multi-MSI support
irqchip/gic-v3: Ignore disabled ITS nodes
irqchip/gic-v3: Use wmb() instead of smb_wmb() in gic_raise_softirq()
irqchip/gic-v3: Change pr_debug message to pr_devel
irqchip/mips-gic: Avoid spuriously handling masked interrupts
Introduce a helper to retrieve the current task's work struct if it is
a workqueue worker.
This allows us to fix a long-standing deadlock in several DRM drivers
wherein the ->runtime_suspend callback waits for a specific worker to
finish and that worker in turn calls a function which waits for runtime
suspend to finish. That function is invoked from multiple call sites
and waiting for runtime suspend to finish is the correct thing to do
except if it's executing in the context of the worker.
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lyude Paul <lyude@redhat.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Link: https://patchwork.freedesktop.org/patch/msgid/2d8f603074131eb87e588d2b803a71765bd3a2fd.1518338788.git.lukas@wunner.de
In case of threaded interrupts the thread follows the affinity setting of
the hard interrupt. The related function uses the affinity mask which was
set by either from user space or via one of the kernel mechanisms. This
mask can be wider than the resulting effective affinity of the hard
interrupt. As a consequence the thread might become affine to a completely
different CPU.
Use the effective interrupt affinity if the architecture supports it, so
the hard interrupt and the thread stay on the same CPU.
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
...instead of open coding file operations followed by custom ->open()
callbacks per each attribute.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Improve error handling when disarming ftrace-based kprobes. Like with
arm_kprobe_ftrace(), propagate any errors from disarm_kprobe_ftrace() so
that we do not disable/unregister kprobes that are still armed. In other
words, unregister_kprobe() and disable_kprobe() should not report success
if the kprobe could not be disarmed.
disarm_all_kprobes() keeps its current behavior and attempts to
disarm all kprobes. It returns the last encountered error and gives a
warning if not all probes could be disarmed.
This patch is based on Petr Mladek's original patchset (patches 2 and 3)
back in 2015, which improved kprobes error handling, found here:
https://lkml.org/lkml/2015/2/26/452
However, further work on this had been paused since then and the patches
were not upstreamed.
Based-on-patches-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/20180109235124.30886-3-jeyu@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Improve error handling when arming ftrace-based kprobes. Specifically, if
we fail to arm a ftrace-based kprobe, register_kprobe()/enable_kprobe()
should report an error instead of success. Previously, this has lead to
confusing situations where register_kprobe() would return 0 indicating
success, but the kprobe would not be functional if ftrace registration
during the kprobe arming process had failed. We should therefore take any
errors returned by ftrace into account and propagate this error so that we
do not register/enable kprobes that cannot be armed. This can happen if,
for example, register_ftrace_function() finds an IPMODIFY conflict (since
kprobe_ftrace_ops has this flag set) and returns an error. Such a conflict
is possible since livepatches also set the IPMODIFY flag for their ftrace_ops.
arm_all_kprobes() keeps its current behavior and attempts to arm all
kprobes. It returns the last encountered error and gives a warning if
not all probes could be armed.
This patch is based on Petr Mladek's original patchset (patches 2 and 3)
back in 2015, which improved kprobes error handling, found here:
https://lkml.org/lkml/2015/2/26/452
However, further work on this had been paused since then and the patches
were not upstreamed.
Based-on-patches-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/20180109235124.30886-2-jeyu@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
syzkaller recently triggered OOM during percpu map allocation;
while there is work in progress by Dennis Zhou to add __GFP_NORETRY
semantics for percpu allocator under pressure, there seems also a
missing bpf_map_precharge_memlock() check in array map allocation.
Given today the actual bpf_map_charge_memlock() happens after the
find_and_alloc_map() in syscall path, the bpf_map_precharge_memlock()
is there to bail out early before we go and do the map setup work
when we find that we hit the limits anyway. Therefore add this for
array map as well.
Fixes: 6c90598174 ("bpf: pre-allocate hash map elements")
Fixes: a10423b87a ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
Reported-by: syzbot+adb03f3f0bb57ce3acda@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit 6f1982fedd ("sched/isolation: Handle the nohz_full= parameter")
broke CONFIG_NO_HZ_FULL_ALL=y kernels. This breakage is due to the code
under CONFIG_NO_HZ_FULL_ALL failing to invoke the shiny new housekeeping
functions. This means that rcutorture scenario TREE04 now emits RCU CPU
stall warnings due to the RCU grace-period kthreads not being awakened
at a time of their choosing, or perhaps even not at all:
[ 27.731422] rcu_bh kthread starved for 21001 jiffies! g18446744073709551369 c18446744073709551368 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=3
[ 27.731423] rcu_bh I14936 9 2 0x80080000
[ 27.731435] Call Trace:
[ 27.731440] __schedule+0x31a/0x6d0
[ 27.731442] schedule+0x31/0x80
[ 27.731446] schedule_timeout+0x15a/0x320
[ 27.731453] ? call_timer_fn+0x130/0x130
[ 27.731457] rcu_gp_kthread+0x66c/0xea0
[ 27.731458] ? rcu_gp_kthread+0x66c/0xea0
Because no one has complained about CONFIG_NO_HZ_FULL_ALL=y being broken,
I hypothesize that no one is in fact using it, other than rcutorture.
This commit therefore eliminates CONFIG_NO_HZ_FULL_ALL and updates
rcutorture's config files to instead use the nohz_full= kernel parameter
to put the desired CPUs into nohz_full mode.
Fixes: 6f1982fedd ("sched/isolation: Handle the nohz_full= parameter")
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Since rcu_boot_init_percpu_data() is only called at boot time,
there is no data race and spinlock is not needed.
Signed-off-by: Lihao Liang <lianglihao@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If audit is disabled, who cares if there is a bug indicating syscall in
process or names already recorded. Bail immediately on audit disabled.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
The audit entry filter has been long deprecated with userspace support
finally removed in audit-v2.6.7 and plans to remove kernel support have
existed since kernel-v2.6.31.
Remove it.
Since removing the audit entry filter, test for early return before
setting up any context state.
Passes audit-testsuite.
See: https://github.com/linux-audit/audit-kernel/issues/6
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Pull locking fixes from Ingo Molnar:
"This contains two qspinlock fixes and three documentation and comment
fixes"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/semaphore: Update the file path in documentation
locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()
locking/qspinlock: Ensure node->count is updated before initialising node
locking/qspinlock: Ensure node is initialised before updating prev->next
Documentation/locking/mutex-design: Update to reflect latest changes
This array appears to be completely unused, remove it.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
A bug was introduced in 8fae477056
("audit: add support for session ID user filter")
See: https://github.com/linux-audit/audit-kernel/issues/4
When setting a session ID filter, the session ID filter field overwrote
the quick pointer reference to the arch field, potentially causing the
arch field to be misinterpreted.
Passes audit-testsuite.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Since the Linux Audit project has transitioned completely over to
github, update the MAINTAINERS file and the primary audit source file to
reflect that reality.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
syzkaller tried to perform a prog query in perf_event_query_prog_array()
where struct perf_event_query_bpf had an ids_len of 1,073,741,353 and
thus causing a warning due to failed kcalloc() allocation out of the
bpf_prog_array_copy_to_user() helper. Given we cannot attach more than
64 programs to a perf event, there's no point in allowing huge ids_len.
Therefore, allow a buffer that would fix the maximum number of ids and
also add a __GFP_NOWARN to the temporary ids buffer.
Fixes: f371b304f1 ("bpf/tracing: allow user space to query prog array on the same tp")
Fixes: 0911287ce3 ("bpf: fix bpf_prog_array_copy_to_user() issues")
Reported-by: syzbot+cab5816b0edbabf598b3@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
There're several implications after commit 0bf7800f17 ("ptr_ring:
try vmalloc() when kmalloc() fails") with the using of vmalloc() since
can't allow GFP_ATOMIC but mandate GFP_KERNEL. This will lead a WARN
since cpumap try to call with GFP_ATOMIC. Fortunately, entry
allocation of cpumap can only be done through syscall path which means
GFP_ATOMIC is not necessary, so fixing this by replacing GFP_ATOMIC
with GFP_KERNEL.
Reported-by: syzbot+1a240cdb1f4cc88819df@syzkaller.appspotmail.com
Fixes: 0bf7800f17 ("ptr_ring: try vmalloc() when kmalloc() fails")
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: akpm@linux-foundation.org
Cc: dhowells@redhat.com
Cc: hannes@cmpxchg.org
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In case user program provides silly parameters, we want
a map_alloc() handler to return an error, not a NULL pointer,
otherwise we crash later in find_and_alloc_map()
Fixes: 1aa12bdf1b ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
There is a memory leak happening in lpm_trie map_free callback
function trie_free. The trie structure itself does not get freed.
Also, trie_free function did not do synchronize_rcu before freeing
various data structures. This is incorrect as some rcu_read_lock
region(s) for lookup, update, delete or get_next_key may not complete yet.
The fix is to add synchronize_rcu in the beginning of trie_free.
The useless spin_lock is removed from this function as well.
Fixes: b95a5c4db0 ("bpf: add a longest prefix match trie map implementation")
Reported-by: Mathieu Malaterre <malat@debian.org>
Reported-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch starts to convert pernet_subsys, registered
from postcore initcalls.
audit_net_init() creates netlink socket, while audit_net_exit()
destroys it. The rest of the pernet_list are not interested
in the socket, so we make audit_net_ops async.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When queuing on the qspinlock, the count field for the current CPU's head
node is incremented. This needn't be atomic because locking in e.g. IRQ
context is balanced and so an IRQ will return with node->count as it
found it.
However, the compiler could in theory reorder the initialisation of
node[idx] before the increment of the head node->count, causing an
IRQ to overwrite the initialised node and potentially corrupt the lock
state.
Avoid the potential for this harmful compiler reordering by placing a
barrier() between the increment of the head node->count and the subsequent
node initialisation.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1518528177-19169-3-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a locker ends up queuing on the qspinlock locking slowpath, we
initialise the relevant mcs node and publish it indirectly by updating
the tail portion of the lock word using xchg_tail. If we find that there
was a pre-existing locker in the queue, we subsequently update their
->next field to point at our node so that we are notified when it's our
turn to take the lock.
This can be roughly illustrated as follows:
/* Initialise the fields in node and encode a pointer to node in tail */
tail = initialise_node(node);
/*
* Exchange tail into the lockword using an atomic read-modify-write
* operation with release semantics
*/
old = xchg_tail(lock, tail);
/* If there was a pre-existing waiter ... */
if (old & _Q_TAIL_MASK) {
prev = decode_tail(old);
smp_read_barrier_depends();
/* ... then update their ->next field to point to node.
WRITE_ONCE(prev->next, node);
}
The conditional update of prev->next therefore relies on the address
dependency from the result of xchg_tail ensuring order against the
prior initialisation of node. However, since the release semantics of
the xchg_tail operation apply only to the write portion of the RmW,
then this ordering is not guaranteed and it is possible for the CPU
to return old before the writes to node have been published, consequently
allowing us to point prev->next to an uninitialised node.
This patch fixes the problem by making the update of prev->next a RELEASE
operation, which also removes the reliance on dependency ordering.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1518528177-19169-2-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since schedutil kernel thread directly set priority to 0, the macro
SUGOV_KTHREAD_PRIORITY is not used. So remove it.
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/1518097702-9665-1-git-send-email-leo.yan@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mark noticed that he had sporadic "spinlock recursion" warnings from
the DEBUG_SPINLOCK code. Now rq->lock is special in that the owner
changes in the middle of a context switch.
It so happens that we fix up the lock.owner too late, @prev can run
(remotely) the moment prev->on_cpu is cleared, this then allows @prev
to again try and acquire this rq->lock and trigger this warning.
So we have to switch lock.owner before clearing prev->on_cpu.
Do this by moving the DEBUG_SPINLOCK annotation from after switch_to()
to before switch_to() and collect all lockdep annotations there into
prepare_lock_switch() to mirror the existing finish_lock_switch().
Debugged-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
rq->clock_task may be updated between the two calls of
rq_clock_task() in update_curr_rt(). Calling rq_clock_task() only
once makes it more accurate and efficient, taking update_curr() as
reference.
Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: zhong.weidong@zte.com.cn
Link: http://lkml.kernel.org/r/1517882008-44552-1-git-send-email-wen.yang99@zte.com.cn
Signed-off-by: Ingo Molnar <mingo@kernel.org>
rq->clock_task may be updated between the two calls of
rq_clock_task() in update_curr_dl(). Calling rq_clock_task() only
once makes it more accurate and efficient, taking update_curr() as
reference.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: zhong.weidong@zte.com.cn
Link: http://lkml.kernel.org/r/1517882148-44599-1-git-send-email-wen.yang99@zte.com.cn
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove a useless space in # ifdef and align it with others.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1518512382-29426-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While adding cgroup2 interface for the cpu controller, 0d5936344f
("sched: Implement interface for cgroup unified hierarchy") forgot to
update input validation and left it to reject cpu.max config if any
descendant has set a higher value.
cgroup2 officially supports delegation and a descendant must not be
able to restrict what its ancestors can configure. For absolute
limits such as cpu.max and memory.max, this means that the config at
each level should only act as the upper limit at that level and
shouldn't interfere with what other cgroups can configure.
This patch updates config validation on cgroup2 so that the cpu
controller follows the same convention.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 0d5936344f ("sched: Implement interface for cgroup unified hierarchy")
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org # v4.15+
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:
for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
done
with de-mangling cleanups yet to come.
NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.
The next patch from Al will sort out the final differences, and we
should be all done.
Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ARM:
- Include icache invalidation optimizations, improving VM startup time
- Support for forwarded level-triggered interrupts, improving
performance for timers and passthrough platform devices
- A small fix for power-management notifiers, and some cosmetic changes
PPC:
- Add MMIO emulation for vector loads and stores
- Allow HPT guests to run on a radix host on POWER9 v2.2 CPUs without
requiring the complex thread synchronization of older CPU versions
- Improve the handling of escalation interrupts with the XIVE interrupt
controller
- Support decrement register migration
- Various cleanups and bugfixes.
s390:
- Cornelia Huck passed maintainership to Janosch Frank
- Exitless interrupts for emulated devices
- Cleanup of cpuflag handling
- kvm_stat counter improvements
- VSIE improvements
- mm cleanup
x86:
- Hypervisor part of SEV
- UMIP, RDPID, and MSR_SMI_COUNT emulation
- Paravirtualized TLB shootdown using the new KVM_VCPU_PREEMPTED bit
- Allow guests to see TOPOEXT, GFNI, VAES, VPCLMULQDQ, and more AVX512
features
- Show vcpu id in its anonymous inode name
- Many fixes and cleanups
- Per-VCPU MSR bitmaps (already merged through x86/pti branch)
- Stable KVM clock when nesting on Hyper-V (merged through x86/hyperv)
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJafvMtAAoJEED/6hsPKofo6YcH/Rzf2RmshrWaC3q82yfIV0Qz
Z8N8yJHSaSdc3Jo6cmiVj0zelwAxdQcyjwlT7vxt5SL2yML+/Q0st9Hc3EgGGXPm
Il99eJEl+2MYpZgYZqV8ff3mHS5s5Jms+7BITAeh6Rgt+DyNbykEAvzt+MCHK9cP
xtsIZQlvRF7HIrpOlaRzOPp3sK2/MDZJ1RBE7wYItK3CUAmsHim/LVYKzZkRTij3
/9b4LP1yMMbziG+Yxt1o682EwJB5YIat6fmDG9uFeEVI5rWWN7WFubqs8gCjYy/p
FX+BjpOdgTRnX+1m9GIj0Jlc/HKMXryDfSZS07Zy4FbGEwSiI5SfKECub4mDhuE=
=C/uD
-----END PGP SIGNATURE-----
Merge tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Radim Krčmář:
"ARM:
- icache invalidation optimizations, improving VM startup time
- support for forwarded level-triggered interrupts, improving
performance for timers and passthrough platform devices
- a small fix for power-management notifiers, and some cosmetic
changes
PPC:
- add MMIO emulation for vector loads and stores
- allow HPT guests to run on a radix host on POWER9 v2.2 CPUs without
requiring the complex thread synchronization of older CPU versions
- improve the handling of escalation interrupts with the XIVE
interrupt controller
- support decrement register migration
- various cleanups and bugfixes.
s390:
- Cornelia Huck passed maintainership to Janosch Frank
- exitless interrupts for emulated devices
- cleanup of cpuflag handling
- kvm_stat counter improvements
- VSIE improvements
- mm cleanup
x86:
- hypervisor part of SEV
- UMIP, RDPID, and MSR_SMI_COUNT emulation
- paravirtualized TLB shootdown using the new KVM_VCPU_PREEMPTED bit
- allow guests to see TOPOEXT, GFNI, VAES, VPCLMULQDQ, and more
AVX512 features
- show vcpu id in its anonymous inode name
- many fixes and cleanups
- per-VCPU MSR bitmaps (already merged through x86/pti branch)
- stable KVM clock when nesting on Hyper-V (merged through
x86/hyperv)"
* tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (197 commits)
KVM: PPC: Book3S: Add MMIO emulation for VMX instructions
KVM: PPC: Book3S HV: Branch inside feature section
KVM: PPC: Book3S HV: Make HPT resizing work on POWER9
KVM: PPC: Book3S HV: Fix handling of secondary HPTEG in HPT resizing code
KVM: PPC: Book3S PR: Fix broken select due to misspelling
KVM: x86: don't forget vcpu_put() in kvm_arch_vcpu_ioctl_set_sregs()
KVM: PPC: Book3S PR: Fix svcpu copying with preemption enabled
KVM: PPC: Book3S HV: Drop locks before reading guest memory
kvm: x86: remove efer_reload entry in kvm_vcpu_stat
KVM: x86: AMD Processor Topology Information
x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested
kvm: embed vcpu id to dentry of vcpu anon inode
kvm: Map PFN-type memory regions as writable (if possible)
x86/kvm: Make it compile on 32bit and with HYPYERVISOR_GUEST=n
KVM: arm/arm64: Fixup userspace irqchip static key optimization
KVM: arm/arm64: Fix userspace_irqchip_in_use counting
KVM: arm/arm64: Fix incorrect timer_is_pending logic
MAINTAINERS: update KVM/s390 maintainers
MAINTAINERS: add Halil as additional vfio-ccw maintainer
MAINTAINERS: add David as a reviewer for KVM/s390
...
Pull networking fixes from David Miller:
1) Make allocations less aggressive in x_tables, from Minchal Hocko.
2) Fix netfilter flowtable Kconfig deps, from Pablo Neira Ayuso.
3) Fix connection loss problems in rtlwifi, from Larry Finger.
4) Correct DRAM dump length for some chips in ath10k driver, from Yu
Wang.
5) Fix ABORT handling in rxrpc, from David Howells.
6) Add SPDX tags to Sun networking drivers, from Shannon Nelson.
7) Some ipv6 onlink handling fixes, from David Ahern.
8) Netem packet scheduler interval calcualtion fix from Md. Islam.
9) Don't put crypto buffers on-stack in rxrpc, from David Howells.
10) Fix handling of error non-delivery status in netlink multicast
delivery over multiple namespaces, from Nicolas Dichtel.
11) Missing xdp flush in tuntap driver, from Jason Wang.
12) Synchonize RDS protocol netns/module teardown with rds object
management, from Sowini Varadhan.
13) Add nospec annotations to mpls, from Dan Williams.
14) Fix SKB truesize handling in TIPC, from Hoang Le.
15) Interrupt masking fixes in stammc from Niklas Cassel.
16) Don't allow ptr_ring objects to be sized outside of kmalloc's
limits, from Jason Wang.
17) Don't allow SCTP chunks to be built which will have a length
exceeding the chunk header's 16-bit length field, from Alexey
Kodanev.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (82 commits)
ibmvnic: Remove skb->protocol checks in ibmvnic_xmit
bpf: fix rlimit in reuseport net selftest
sctp: verify size of a new chunk in _sctp_make_chunk()
s390/qeth: fix SETIP command handling
s390/qeth: fix underestimated count of buffer elements
ptr_ring: try vmalloc() when kmalloc() fails
ptr_ring: fail early if queue occupies more than KMALLOC_MAX_SIZE
net: stmmac: remove redundant enable of PMT irq
net: stmmac: rename GMAC_INT_DEFAULT_MASK for dwmac4
net: stmmac: discard disabled flags in interrupt status register
ibmvnic: Reset long term map ID counter
tools/libbpf: handle issues with bpf ELF objects containing .eh_frames
selftests/bpf: add selftest that use test_libbpf_open
selftests/bpf: add test program for loading BPF ELF files
tools/libbpf: improve the pr_debug statements to contain section numbers
bpf: Sync kernel ABI header with tooling header for bpf_common.h
net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT
net: thunder: change q_len's type to handle max ring size
tipc: fix skb truesize/datasize ratio control
net/sched: cls_u32: fix cls_u32 on filter replace
...
as well as the removing of function probes.
This fixes the code with Al's suggestions. I also added a few selftests
to test the broken cases such that they wont happen again.
-----BEGIN PGP SIGNATURE-----
iQHIBAABCgAyFiEEPm6V/WuN2kyArTUe1a05Y9njSUkFAlp9/g8UHHJvc3RlZHRA
Z29vZG1pcy5vcmcACgkQ1a05Y9njSUkFaAwAl+BpQOOLY/vdivHZApyX75qc+Ysn
yzMtz/9FffYsZiWaPp84iKCHpRKXzHIkmcNlZNWmitoHE6DY53+4l/CrlLgEius8
FApkXnptFkhklfW1EhkXjt9pawcqFJhbYlGld93Bn/jvPOznpIhxDKcLqTJkD2M4
DUwJLv0p8vh2uUSMUPLrNGDrHb6lu/sO0zEcf1HyWPEMo/6r903aG99cKuCLOnjF
1jSwQ0DivTz+BjXrt+OVObzm1RFCvBP9SK0WRoPTzsmBx6EWEexCNiwyXxFJfRXn
GfYRMDllkJ7i5ATCppdDHZyIdoCAF+iFLh5UK/Kl/cG6w8c2FATm3HOBkgr12as6
LVSS475FfxjaMoT73fYszQdHevc9dI3aIh4GYjThH/mw2KSwBq+1GjRf4Jqa/1pt
crlDsQMQTXMx9pGN+DOZR7NDkj9xTyFlkXq8oJ9jJDLlB/7rymr751KiQohbvhVA
EklF4P8zQNtMMhqSPIPlQVo6rdQhawsibpku
=tVDz
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Al Viro discovered some breakage with the parsing of the
set_ftrace_filter as well as the removing of function probes.
This fixes the code with Al's suggestions. I also added a few
selftests to test the broken cases such that they wont happen
again"
* tag 'trace-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
selftests/ftrace: Add more tests for removing of function probes
selftests/ftrace: Add some missing glob checks
selftests/ftrace: Have reset_ftrace_filter handle multiple instances
selftests/ftrace: Have reset_ftrace_filter handle modules
tracing: Fix parsing of globs with a wildcard at the beginning
ftrace: Remove incorrect setting of glob search field
Daniel Borkmann says:
====================
pull-request: bpf 2018-02-09
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Two fixes for BPF sockmap in order to break up circular map references
from programs attached to sockmap, and detaching related sockets in
case of socket close() event. For the latter we get rid of the
smap_state_change() and plug into ULP infrastructure, which will later
also be used for additional features anyway such as TX hooks. For the
second issue, dependency chain is broken up via map release callback
to free parse/verdict programs, all from John.
2) Fix a libbpf relocation issue that was found while implementing XDP
support for Suricata project. Issue was that when clang was invoked
with default target instead of bpf target, then various other e.g.
debugging relevant sections are added to the ELF file that contained
relocation entries pointing to non-BPF related sections which libbpf
trips over instead of skipping them. Test cases for libbpf are added
as well, from Jesper.
3) Various misc fixes for bpftool and one for libbpf: a small addition
to libbpf to make sure it recognizes all standard section prefixes.
Then, the Makefile in bpftool/Documentation is improved to explicitly
check for rst2man being installed on the system as we otherwise risk
installing empty man pages; the man page for bpftool-map is corrected
and a set of missing bash completions added in order to avoid shipping
bpftool where the completions are only partially working, from Quentin.
4) Fix applying the relocation to immediate load instructions in the
nfp JIT which were missing a shift, from Jakub.
5) Two fixes for the BPF kernel selftests: handle CONFIG_BPF_JIT_ALWAYS_ON=y
gracefully in test_bpf.ko module and mark them as FLAG_EXPECTED_FAIL
in this case; and explicitly delete the veth devices in the two tests
test_xdp_{meta,redirect}.sh before dismantling the netnses as when
selftests are run in batch mode, then workqueue to handle destruction
might not have finished yet and thus veth creation in next test under
same dev name would fail, from Yonghong.
6) Fix test_kmod.sh to check the test_bpf.ko module path before performing
an insmod, and fallback to modprobe. Especially the latter is useful
when having a device under test that has the modules installed instead,
from Naresh.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Al Viro reported:
For substring - sure, but what about something like "*a*b" and "a*b"?
AFAICS, filter_parse_regex() ends up with identical results in both
cases - MATCH_GLOB and *search = "a*b". And no way for the caller
to tell one from another.
Testing this with the following:
# cd /sys/kernel/tracing
# echo '*raw*lock' > set_ftrace_filter
bash: echo: write error: Invalid argument
With this patch:
# echo '*raw*lock' > set_ftrace_filter
# cat set_ftrace_filter
_raw_read_trylock
_raw_write_trylock
_raw_read_unlock
_raw_spin_unlock
_raw_write_unlock
_raw_spin_trylock
_raw_spin_lock
_raw_write_lock
_raw_read_lock
Al recommended not setting the search buffer to skip the first '*' unless we
know we are not using MATCH_GLOB. This implements his suggested logic.
Link: http://lkml.kernel.org/r/20180127170748.GF13338@ZenIV.linux.org.uk
Cc: stable@vger.kernel.org
Fixes: 60f1d5e3ba ("ftrace: Support full glob matching")
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Suggsted-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
__unregister_ftrace_function_probe() will incorrectly parse the glob filter
because it resets the search variable that was setup by filter_parse_regex().
Al Viro reported this:
After that call of filter_parse_regex() we could have func_g.search not
equal to glob only if glob started with '!' or '*'. In the former case
we would've buggered off with -EINVAL (not = 1). In the latter we
would've set func_g.search equal to glob + 1, calculated the length of
that thing in func_g.len and proceeded to reset func_g.search back to
glob.
Suppose the glob is e.g. *foo*. We end up with
func_g.type = MATCH_MIDDLE_ONLY;
func_g.len = 3;
func_g.search = "*foo";
Feeding that to ftrace_match_record() will not do anything sane - we
will be looking for names containing "*foo" (->len is ignored for that
one).
Link: http://lkml.kernel.org/r/20180127031706.GE13338@ZenIV.linux.org.uk
Cc: stable@vger.kernel.org
Fixes: 3ba0092971 ("ftrace: Introduce ftrace_glob structure")
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull scheduler updates from Ingo Molnar:
- membarrier updates (Mathieu Desnoyers)
- SMP balancing optimizations (Mel Gorman)
- stats update optimizations (Peter Zijlstra)
- RT scheduler race fixes (Steven Rostedt)
- misc fixes and updates
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
sched/fair: Do not migrate if the prev_cpu is idle
sched/fair: Restructure wake_affine*() to return a CPU id
sched/fair: Remove unnecessary parameters from wake_affine_idle()
sched/rt: Make update_curr_rt() more accurate
sched/rt: Up the root domain ref count when passing it around via IPIs
sched/rt: Use container_of() to get root domain in rto_push_irq_work_func()
sched/core: Optimize update_stats_*()
sched/core: Optimize ttwu_stat()
membarrier/selftest: Test private expedited sync core command
membarrier/arm64: Provide core serializing command
membarrier/x86: Provide core serializing command
membarrier: Provide core serializing command, *_SYNC_CORE
lockin/x86: Implement sync_core_before_usermode()
locking: Introduce sync_core_before_usermode()
membarrier/selftest: Test global expedited command
membarrier: Provide GLOBAL_EXPEDITED command
membarrier: Document scheduler barrier requirements
powerpc, membarrier: Skip memory barrier in switch_mm()
membarrier/selftest: Test private expedited command
Pull networking fixes from David Miller:
1) Fix error path in netdevsim, from Jakub Kicinski.
2) Default values listed in tcp_wmem and tcp_rmem documentation were
inaccurate, from Tonghao Zhang.
3) Fix route leaks in SCTP, both for ipv4 and ipv6. From Alexey Kodanev
and Tommi Rantala.
4) Fix "MASK < Y" meant to be "MASK << Y" in xgbe driver, from Wolfram
Sang.
5) Use after free in u32_destroy_key(), from Paolo Abeni.
6) Fix two TX issues in be2net driver, from Suredh Reddy.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (25 commits)
be2net: Handle transmit completion errors in Lancer
be2net: Fix HW stall issue in Lancer
RDS: IB: Fix null pointer issue
nfp: fix kdoc warnings on nested structures
sample/bpf: fix erspan metadata
net: erspan: fix erspan config overwrite
net: erspan: fix metadata extraction
cls_u32: fix use after free in u32_destroy_key()
net: amd-xgbe: fix comparison to bitshift when dealing with a mask
net: phy: Handle not having GPIO enabled in the kernel
ibmvnic: fix empty firmware version and errors cleanup
sctp: fix dst refcnt leak in sctp_v4_get_dst
sctp: fix dst refcnt leak in sctp_v6_get_dst()
dwc-xlgmac: remove Jie Deng as co-maintainer
doc: Change the min default value of tcp_wmem/tcp_rmem.
samples/bpf: use bpf_set_link_xdp_fd
libbpf: add missing SPDX-License-Identifier
libbpf: add error reporting in XDP
libbpf: add function to setup XDP
tools: add netlink.h and if_link.h in tools uapi
...
A pipe's size is represented as an 'unsigned int'. As expected, writing a
value greater than UINT_MAX to /proc/sys/fs/pipe-max-size fails with
EINVAL. However, the F_SETPIPE_SZ fcntl silently truncates such values to
32 bits, rather than failing with EINVAL as expected. (It *does* fail
with EINVAL for values above (1 << 31) but <= UINT_MAX.)
Fix this by moving the check against UINT_MAX into round_pipe_size() which
is called in both cases.
Link: http://lkml.kernel.org/r/20180111052902.14409-6-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pipe_proc_fn() is no longer needed, as it only calls through to
proc_dopipe_max_size(). Just put proc_dopipe_max_size() in the ctl_table
entry directly, and remove the unneeded EXPORT_SYMBOL() and the ENOSYS
stub for it.
(The reason the ENOSYS stub isn't needed is that the pipe-max-size
ctl_table entry is located directly in 'kern_table' rather than being
registered separately. Therefore, the entry is already only defined when
the kernel is built with sysctl support.)
Link: http://lkml.kernel.org/r/20180111052902.14409-3-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "pipe: buffer limits fixes and cleanups", v2.
This series simplifies the sysctl handler for pipe-max-size and fixes
another set of bugs related to the pipe buffer limits:
- The root user wasn't allowed to exceed the limits when creating new
pipes.
- There was an off-by-one error when checking the limits, so a limit of
N was actually treated as N - 1.
- F_SETPIPE_SZ accepted values over UINT_MAX.
- Reading the pipe buffer limits could be racy.
This patch (of 7):
Before validating the given value against pipe_min_size,
do_proc_dopipe_max_size_conv() calls round_pipe_size(), which rounds the
value up to pipe_min_size. Therefore, the second check against
pipe_min_size is redundant. Remove it.
Link: http://lkml.kernel.org/r/20180111052902.14409-2-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make iomem_is_exclusive return bool due to this particular function only
using either one or zero as its return value.
No functional change.
Link: http://lkml.kernel.org/r/1513266622-15860-5-git-send-email-baiyaowei@cmss.chinamobile.com
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make current_cpuset_is_being_rebound return bool due to this particular
function only using either one or zero as its return value.
No functional change.
Link: http://lkml.kernel.org/r/1513266622-15860-4-git-send-email-baiyaowei@cmss.chinamobile.com
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The file was converted from print_symbol() to %pf some time ago in
commit ef26f20cd1 ("genirq: Print threaded handler in spurious debug
output"). kallsyms does not seem to be needed anymore.
Link: http://lkml.kernel.org/r/20171208025616.16267-10-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hrtimer does not seem to use any of kallsyms functions/defines.
Link: http://lkml.kernel.org/r/20171208025616.16267-9-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently KCOV_ENABLE does not check if the current task is already
associated with another kcov descriptor. As the result it is possible
to associate a single task with more than one kcov descriptor, which
later leads to a memory leak of the old descriptor. This relation is
really meant to be one-to-one (task has only one back link).
Extend validation to detect such misuse.
Link: http://lkml.kernel.org/r/20180122082520.15716-1-dvyukov@google.com
Fixes: 5c9a8750a6 ("kernel: add kcov code coverage")
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Shankara Pailoor <sp3485@columbia.edu>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit ba62bafe94 ("kernel/relay.c: fix potential memory leak").
This commit introduced a double free bug, because 'chan' is already
freed by the line:
kref_put(&chan->kref, relay_destroy_channel);
This bug was found by syzkaller, using the BLKTRACESETUP ioctl.
Link: http://lkml.kernel.org/r/20180127004759.101823-1-ebiggers3@gmail.com
Fixes: ba62bafe94 ("kernel/relay.c: fix potential memory leak")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Zhouyi Zhou <yizhouzhou@ict.ac.cn>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org> [4.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are several functions that do find_task_by_vpid() followed by
get_task_struct(). We can use a helper function instead.
Link: http://lkml.kernel.org/r/1509602027-11337-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All other places that deals with namespaces have an explanation of why
the restriction is there.
The description added in this commit was based on commit e66eded830
("userns: Don't allow CLONE_NEWUSER | CLONE_FS").
Link: http://lkml.kernel.org/r/20171112151637.13258-1-marcos.souza.org@gmail.com
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thus reducing one indentation level while maintaining the same rationale.
Link: http://lkml.kernel.org/r/20171117002929.5155-1-marcos.souza.org@gmail.com
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 92266d6ef6 ("async: simplify lowest_in_progress()")
which was simply wrong: In the case where domain is NULL, we now use the
wrong offsetof() in the list_first_entry macro, so we don't actually
fetch the ->cookie value, but rather the eight bytes located
sizeof(struct list_head) further into the struct async_entry.
On 64 bit, that's the data member, while on 32 bit, that's a u64 built
from func and data in some order.
I think the bug happens to be harmless in practice: It obviously only
affects callers which pass a NULL domain, and AFAICT the only such
caller is
async_synchronize_full() ->
async_synchronize_full_domain(NULL) ->
async_synchronize_cookie_domain(ASYNC_COOKIE_MAX, NULL)
and the ASYNC_COOKIE_MAX means that in practice we end up waiting for
the async_global_pending list to be empty - but it would break if
somebody happened to pass (void*)-1 as the data element to
async_schedule, and of course also if somebody ever does a
async_synchronize_cookie_domain(, NULL) with a "finite" cookie value.
Maybe the "harmless in practice" means this isn't -stable material. But
I'm not completely confident my quick git grep'ing is enough, and there
might be affected code in one of the earlier kernels that has since been
removed, so I'll leave the decision to the stable guys.
Link: http://lkml.kernel.org/r/20171128104938.3921-1-linux@rasmusvillemoes.dk
Fixes: 92266d6ef6 "async: simplify lowest_in_progress()"
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Adam Wallis <awallis@codeaurora.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: <stable@vger.kernel.org> [3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nearly all modern compilers support a stack-protector option, and nearly
all modern distributions enable the kernel stack-protector, so enabling
this by default in kernel builds would make sense. However, Kconfig does
not have knowledge of available compiler features, so it isn't safe to
force on, as this would unconditionally break builds for the compilers or
architectures that don't have support. Instead, this introduces a new
option, CONFIG_CC_STACKPROTECTOR_AUTO, which attempts to discover the best
possible stack-protector available, and will allow builds to proceed even
if the compiler doesn't support any stack-protector.
This option is made the default so that kernels built with modern
compilers will be protected-by-default against stack buffer overflows,
avoiding things like the recent BlueBorne attack. Selection of a specific
stack-protector option remains available, including disabling it.
Additionally, tiny.config is adjusted to use CC_STACKPROTECTOR_NONE, since
that's the option with the least code size (and it used to be the default,
so we have to explicitly choose it there now).
Link: http://lkml.kernel.org/r/1510076320-69931-4-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Laura Abbott <labbott@redhat.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Require struct page by default for filesystem DAX to remove a number of
surprising failure cases. This includes failures with direct I/O, gdb and
fork(2).
* Add support for the new Platform Capabilities Structure added to the NFIT in
ACPI 6.2a. This new table tells us whether the platform supports flushing
of CPU and memory controller caches on unexpected power loss events.
* Revamp vmem_altmap and dev_pagemap handling to clean up code and better
support future future PCI P2P uses.
* Deprecate the ND_IOCTL_SMART_THRESHOLD command whose payload has become
out-of-sync with recent versions of the NVDIMM_FAMILY_INTEL spec, and
instead rely on the generic ND_CMD_CALL approach used by the two other IOCTL
families, NVDIMM_FAMILY_{HPE,MSFT}.
* Enhance nfit_test so we can test some of the new things added in version 1.6
of the DSM specification. This includes testing firmware download and
simulating the Last Shutdown State (LSS) status.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJaeOg0AAoJEJ/BjXdf9fLBAFoQAI/IgcgJ2h9lfEpgjBRTC44t
2p8dxwT1Ofw3Y1aR/tI8nYRXjRtAGuP4UIeRVnb1CL/N7PagJyoMGU+6hmzg+ptY
c7cEDvw6nZOhrFwXx/xn7R53sYG8zH+UE6+jTR/PP/G4mQJfFCg4iF9R72Y7z0n7
aurf82Kz137NPUy6dNr4V9bmPMJWAaOci9WOj5SKddR5ZSNbjoxylTwQRvre5y4r
7HQTScEkirABOdSf1JoXTSUXCH/RC9UFFXR03ScHstGb1HjCj3KdcicVc50Q++Ub
qsEudhE6i44PEW1Hh4Qkg6hjHMEa8qHP+ShBuRuVaUmlghYTQn66niJAYLZilwdz
EVjE7vR+toHA5g3YCalEmYVutUEhIDkh/xfpd7vM6ZorUGJy95a2elEJs2fHBffC
gEhnCip7FROPcK5RDNUM8hBgnG/q5wwWPQMKY+6rKDZQx3mXssCrKp2Vlx7kBwMG
rpblkEpYjPonbLEHxsSU8yTg9Uq55ciIWgnOToffcjZvjbihi8WUVlHcwHUMPf/o
DWElg+4qmG0Sdd4S2NeAGwTl1Ewrf2RrtUGMjHtH4OUFs1wo6ZmfrxFzzMfoZ1Od
ko/s65v4uwtTzECh2o+XQaNsReR5YETXxmA40N/Jpo7/7twABIoZ/ASvj/3ZBYj+
sie+u2rTod8/gQWSfHpJ
=MIMX
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Ross Zwisler:
- Require struct page by default for filesystem DAX to remove a number
of surprising failure cases. This includes failures with direct I/O,
gdb and fork(2).
- Add support for the new Platform Capabilities Structure added to the
NFIT in ACPI 6.2a. This new table tells us whether the platform
supports flushing of CPU and memory controller caches on unexpected
power loss events.
- Revamp vmem_altmap and dev_pagemap handling to clean up code and
better support future future PCI P2P uses.
- Deprecate the ND_IOCTL_SMART_THRESHOLD command whose payload has
become out-of-sync with recent versions of the NVDIMM_FAMILY_INTEL
spec, and instead rely on the generic ND_CMD_CALL approach used by
the two other IOCTL families, NVDIMM_FAMILY_{HPE,MSFT}.
- Enhance nfit_test so we can test some of the new things added in
version 1.6 of the DSM specification. This includes testing firmware
download and simulating the Last Shutdown State (LSS) status.
* tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (37 commits)
libnvdimm, namespace: remove redundant initialization of 'nd_mapping'
acpi, nfit: fix register dimm error handling
libnvdimm, namespace: make min namespace size 4K
tools/testing/nvdimm: force nfit_test to depend on instrumented modules
libnvdimm/nfit_test: adding support for unit testing enable LSS status
libnvdimm/nfit_test: add firmware download emulation
nfit-test: Add platform cap support from ACPI 6.2a to test
libnvdimm: expose platform persistence attribute for nd_region
acpi: nfit: add persistent memory control flag for nd_region
acpi: nfit: Add support for detect platform CPU cache flush on power loss
device-dax: Fix trailing semicolon
libnvdimm, btt: fix uninitialized err_lock
dax: require 'struct page' by default for filesystem dax
ext2: auto disable dax instead of failing mount
ext4: auto disable dax instead of failing mount
mm, dax: introduce pfn_t_special()
mm: Fix devm_memremap_pages() collision handling
mm: Fix memory size alignment in devm_memremap_pages_release()
memremap: merge find_dev_pagemap into get_dev_pagemap
memremap: change devm_memremap_pages interface to use struct dev_pagemap
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJad5lgAAoJEFmIoMA60/r8s2kQAI3PztawDpaCP9Z12pkbBHSt
Ho0xTyk9rCZi9kQJbNjc+a+QrlA3QmTHXIXerB3LSWoh7M+XhsECjem92eHpgLNS
JvYPhTfOrCr0vdiAmOz6hD0AqN/psrbfzgiJhSwomsGEFS77k7kERSJckRv81sxb
Aj5F/WjucAgLorwm4auveAJEQ7atE7/6pkXzoqYm4G6NLOb46jUcRGndrnvXZBlz
fws8fBM4BHyi7i25CYQl24tFq1CGax1rIPgLg+4KnH76bQk/N6Ju0sGVSzfh+hG8
SIerK9bJbzGRAuNKoxB3aO1dyzsK3x9WztE2mG98w5trOISPIR1FqnvC/225FWAU
d6eIXiC7wKnEx+DElNTzCjzfHc7SAJoupO32H7CoiTe5zPUlWlxJ1zLYkK1gt50q
m8PRBiYTglxyznzrO0drtcdjEzvbdZNRrsYnul4wi1vSHzjk6F6XLtzT10XWM1M1
1pXLB8384FTj0Hu4bq6Y3Aivkmz0Sf+eQM2NaOwe+Zj7/1VV0d3lvi4LUXkqzLCA
FoXPJSMxG2Qu+iflCeYRQBJjExaZH3eNLZ3dT6QpcJrjaFVedd9u5DeeFqNL27zV
bhr8TdqrR4p4rc8EBAGoCapw96IxLZROKB3gxbrZVOpfIZpzthwHbElHX6aqUgF4
w/EV1JWs36WXWaxFk8wd
=ttq9
-----END PGP SIGNATURE-----
Merge tag 'pci-v4.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
- skip AER driver error recovery callbacks for correctable errors
reported via ACPI APEI, as we already do for errors reported via the
native path (Tyler Baicar)
- fix DPC shared interrupt handling (Alex Williamson)
- print full DPC interrupt number (Keith Busch)
- enable DPC only if AER is available (Keith Busch)
- simplify DPC code (Bjorn Helgaas)
- calculate ASPM L1 substate parameter instead of hardcoding it (Bjorn
Helgaas)
- enable Latency Tolerance Reporting for ASPM L1 substates (Bjorn
Helgaas)
- move ASPM internal interfaces out of public header (Bjorn Helgaas)
- allow hot-removal of VGA devices (Mika Westerberg)
- speed up unplug and shutdown by assuming Thunderbolt controllers
don't support Command Completed events (Lukas Wunner)
- add AtomicOps support for GPU and Infiniband drivers (Felix Kuehling,
Jay Cornwall)
- expose "ari_enabled" in sysfs to help NIC naming (Stuart Hayes)
- clean up PCI DMA interface usage (Christoph Hellwig)
- remove PCI pool API (replaced with DMA pool) (Romain Perier)
- deprecate pci_get_bus_and_slot(), which assumed PCI domain 0 (Sinan
Kaya)
- move DT PCI code from drivers/of/ to drivers/pci/ (Rob Herring)
- add PCI-specific wrappers for dev_info(), etc (Frederick Lawler)
- remove warnings on sysfs mmap failure (Bjorn Helgaas)
- quiet ROM validation messages (Alex Deucher)
- remove redundant memory alloc failure messages (Markus Elfring)
- fill in types for compile-time VGA and other I/O port resources
(Bjorn Helgaas)
- make "pci=pcie_scan_all" work for Root Ports as well as Downstream
Ports to help AmigaOne X1000 (Bjorn Helgaas)
- add SPDX tags to all PCI files (Bjorn Helgaas)
- quirk Marvell 9128 DMA aliases (Alex Williamson)
- quirk broken INTx disable on Ceton InfiniTV4 (Bjorn Helgaas)
- fix CONFIG_PCI=n build by adding dummy pci_irqd_intx_xlate() (Niklas
Cassel)
- use DMA API to get MSI address for DesignWare IP (Niklas Cassel)
- fix endpoint-mode DMA mask configuration (Kishon Vijay Abraham I)
- fix ARTPEC-6 incorrect IS_ERR() usage (Wei Yongjun)
- add support for ARTPEC-7 SoC (Niklas Cassel)
- add endpoint-mode support for ARTPEC (Niklas Cassel)
- add Cadence PCIe host and endpoint controller driver (Cyrille
Pitchen)
- handle multiple INTx status bits being set in dra7xx (Vignesh R)
- translate dra7xx hwirq range to fix INTD handling (Vignesh R)
- remove deprecated Exynos PHY initialization code (Jaehoon Chung)
- fix MSI erratum workaround for HiSilicon Hip06/Hip07 (Dongdong Liu)
- fix NULL pointer dereference in iProc BCMA driver (Ray Jui)
- fix Keystone interrupt-controller-node lookup (Johan Hovold)
- constify qcom driver structures (Julia Lawall)
- rework Tegra config space mapping to increase space available for
endpoints (Vidya Sagar)
- simplify Tegra driver by using bus->sysdata (Manikanta Maddireddy)
- remove PCI_REASSIGN_ALL_BUS usage on Tegra (Manikanta Maddireddy)
- add support for Global Fabric Manager Server (GFMS) event to
Microsemi Switchtec switch driver (Logan Gunthorpe)
- add IDs for Switchtec PSX 24xG3 and PSX 48xG3 (Kelvin Cao)
* tag 'pci-v4.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (140 commits)
PCI: cadence: Add EndPoint Controller driver for Cadence PCIe controller
dt-bindings: PCI: cadence: Add DT bindings for Cadence PCIe endpoint controller
PCI: endpoint: Fix EPF device name to support multi-function devices
PCI: endpoint: Add the function number as argument to EPC ops
PCI: cadence: Add host driver for Cadence PCIe controller
dt-bindings: PCI: cadence: Add DT bindings for Cadence PCIe host controller
PCI: Add vendor ID for Cadence
PCI: Add generic function to probe PCI host controllers
PCI: generic: fix missing call of pci_free_resource_list()
PCI: OF: Add generic function to parse and allocate PCI resources
PCI: Regroup all PCI related entries into drivers/pci/Makefile
PCI/DPC: Reformat DPC register definitions
PCI/DPC: Add and use DPC Status register field definitions
PCI/DPC: Squash dpc_rp_pio_get_info() into dpc_process_rp_pio_error()
PCI/DPC: Remove unnecessary RP PIO register structs
PCI/DPC: Push dpc->rp_pio_status assignment into dpc_rp_pio_get_info()
PCI/DPC: Squash dpc_rp_pio_print_error() into dpc_rp_pio_get_info()
PCI/DPC: Make RP PIO log size check more generic
PCI/DPC: Rename local "status" to "dpc_status"
PCI/DPC: Squash dpc_rp_pio_print_tlp_header() into dpc_rp_pio_print_error()
...
When a program is attached to a map we increment the program refcnt
to ensure that the program is not removed while it is potentially
being referenced from sockmap side. However, if this same program
also references the map (this is a reasonably common pattern in
my programs) then the verifier will also increment the maps refcnt
from the verifier. This is to ensure the map doesn't get garbage
collected while the program has a reference to it.
So we are left in a state where the map holds the refcnt on the
program stopping it from being removed and releasing the map refcnt.
And vice versa the program holds a refcnt on the map stopping it
from releasing the refcnt on the prog.
All this is fine as long as users detach the program while the
map fd is still around. But, if the user omits this detach command
we are left with a dangling map we can no longer release.
To resolve this when the map fd is released decrement the program
references and remove any reference from the map to the program.
This fixes the issue with possibly dangling map and creates a
user side API constraint. That is, the map fd must be held open
for programs to be attached to a map.
Fixes: 174a79ff95 ("bpf: sockmap with sk redirect support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The selftests test_maps program was leaving dangling BPF sockmap
programs around because not all psock elements were removed from
the map. The elements in turn hold a reference on the BPF program
they are attached to causing BPF programs to stay open even after
test_maps has completed.
The original intent was that sk_state_change() would be called
when TCP socks went through TCP_CLOSE state. However, because
socks may be in SOCK_DEAD state or the sock may be a listening
socket the event is not always triggered.
To resolve this use the ULP infrastructure and register our own
proto close() handler. This fixes the above case.
Fixes: 174a79ff95 ("bpf: sockmap with sk redirect support")
Reported-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch adds perf_uprobe support with similar pattern as previous
patch (for kprobe).
Two functions, create_local_trace_uprobe() and
destroy_local_trace_uprobe(), are created so a uprobe can be created
and attached to the file descriptor created by perf_event_open().
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yonghong Song <yhs@fb.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Cc: <daniel@iogearbox.net>
Cc: <davem@davemloft.net>
Cc: <kernel-team@fb.com>
Cc: <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171206224518.3598254-7-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A new PMU type, perf_kprobe is added. Based on attr from perf_event_open(),
perf_kprobe creates a kprobe (or kretprobe) for the perf_event. This
kprobe is private to this perf_event, and thus not added to global
lists, and not available in tracefs.
Two functions, create_local_trace_kprobe() and
destroy_local_trace_kprobe() are added to created and destroy these
local trace_kprobe.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yonghong Song <yhs@fb.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Cc: <daniel@iogearbox.net>
Cc: <davem@davemloft.net>
Cc: <kernel-team@fb.com>
Cc: <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171206224518.3598254-6-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The select_idle_sibling() (SIS) rewrite in commit:
10e2f1acd0 ("sched/core: Rewrite and improve select_idle_siblings()")
... replaced a domain iteration with a search that broadly speaking
does a wrapped walk of the scheduler domain sharing a last-level-cache.
While this had a number of improvements, one consequence is that two tasks
that share a waker/wakee relationship push each other around a socket. Even
though two tasks may be active, all cores are evenly used. This is great from
a search perspective and spreads a load across individual cores, but it has
adverse consequences for cpufreq. As each CPU has relatively low utilisation,
cpufreq may decide the utilisation is too low to used a higher P-state and
overall computation throughput suffers.
While individual cpufreq and cpuidle drivers may compensate by artifically
boosting P-state (at c0) or avoiding lower C-states (during idle), it does
not help if hardware-based cpufreq (e.g. HWP) is used.
This patch tracks a recently used CPU based on what CPU a task was running
on when it last was a waker a CPU it was recently using when a task is a
wakee. During SIS, the recently used CPU is used as a target if it's still
allowed by the task and is idle.
The benefit may be non-obvious so consider an example of two tasks
communicating back and forth. Task A may be an application doing IO where
task B is a kworker or kthread like journald. Task A may issue IO, wake
B and B wakes up A on completion. With the existing scheme this may look
like the following (potentially different IDs if SMT is in use but similar
principal applies).
A (cpu 0) wake B (wakes on cpu 1)
B (cpu 1) wake A (wakes on cpu 2)
A (cpu 2) wake B (wakes on cpu 3)
etc.
A careful reader may wonder why CPU 0 was not idle when B wakes A the
first time and it's simply due to the fact that A can be rescheduled to
another CPU and the pattern is that prev == target when B tries to wakeup A
and the information about CPU 0 has been lost.
With this patch, the pattern is more likely to be:
A (cpu 0) wake B (wakes on cpu 1)
B (cpu 1) wake A (wakes on cpu 0)
A (cpu 0) wake B (wakes on cpu 1)
etc
i.e. two communicating casts are more likely to use just two cores instead
of all available cores sharing a LLC.
The most dramatic speedup was noticed on dbench using the XFS filesystem on
UMA as clients interact heavily with workqueues in that configuration. Note
that a similar speedup is not observed on ext4 as the wakeup pattern
is different:
4.15.0-rc9 4.15.0-rc9
waprev-v1 biasancestor-v1
Hmean 1 287.54 ( 0.00%) 817.01 ( 184.14%)
Hmean 2 1268.12 ( 0.00%) 1781.24 ( 40.46%)
Hmean 4 1739.68 ( 0.00%) 1594.47 ( -8.35%)
Hmean 8 2464.12 ( 0.00%) 2479.56 ( 0.63%)
Hmean 64 1455.57 ( 0.00%) 1434.68 ( -1.44%)
The results can be less dramatic on NUMA where automatic balancing interferes
with the test. It's also known that network benchmarks running on localhost
also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
and TCP depending on the machine). Hackbench also seens small improvements
(6-11% depending on machine and thread count). The facebook schbench was also
tested but in most cases showed little or no different to wakeup latencies.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-5-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
wake_affine_idle() prefers to move a task to the current CPU if the
wakeup is due to an interrupt. The expectation is that the interrupt
data is cache hot and relevant to the waking task as well as avoiding
a search. However, there is no way to determine if there was cache hot
data on the previous CPU that may exceed the interrupt data. Furthermore,
round-robin delivery of interrupts can migrate tasks around a socket where
each CPU is under-utilised. This can interact badly with cpufreq which
makes decisions based on per-cpu data. It has been observed on machines
with HWP that p-states are not boosted to their maximum levels even though
the workload is latency and throughput sensitive.
This patch uses the previous CPU for the task if it's idle and cache-affine
with the current CPU even if the current CPU is idle due to the wakup
being related to the interrupt. This reduces migrations at the cost of
the interrupt data not being cache hot when the task wakes.
A variety of workloads were tested on various machines and no adverse
impact was noticed that was outside noise. dbench on ext4 on UMA showed
roughly 10% reduction in the number of CPU migrations and it is a case
where interrupts are frequent for IO competions. In most cases, the
difference in performance is quite small but variability is often
reduced. For example, this is the result for pgbench running on a UMA
machine with different numbers of clients.
4.15.0-rc9 4.15.0-rc9
baseline waprev-v1
Hmean 1 22096.28 ( 0.00%) 22734.86 ( 2.89%)
Hmean 4 74633.42 ( 0.00%) 75496.77 ( 1.16%)
Hmean 7 115017.50 ( 0.00%) 113030.81 ( -1.73%)
Hmean 12 126209.63 ( 0.00%) 126613.40 ( 0.32%)
Hmean 16 131886.91 ( 0.00%) 130844.35 ( -0.79%)
Stddev 1 636.38 ( 0.00%) 417.11 ( 34.46%)
Stddev 4 614.64 ( 0.00%) 583.24 ( 5.11%)
Stddev 7 542.46 ( 0.00%) 435.45 ( 19.73%)
Stddev 12 173.93 ( 0.00%) 171.50 ( 1.40%)
Stddev 16 671.42 ( 0.00%) 680.30 ( -1.32%)
CoeffVar 1 2.88 ( 0.00%) 1.83 ( 36.26%)
Note that the different in performance is marginal but for low utilisation,
there is less variability.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-4-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>