latency-hist.patch

This patch provides a recording mechanism to store data of potential
sources of system latencies. The recordings separately determine the
latency caused by a delayed timer expiration, by a delayed wakeup of the
related user space program and by the sum of both. The histograms can be
enabled and reset individually. The data are accessible via the debug
filesystem. For details please consult Documentation/trace/histograms.txt.

Signed-off-by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This commit is contained in:
Carsten Emde 2011-07-19 14:03:41 +01:00 committed by Alibek Omarov
parent 351766fb51
commit bae0de293a
10 changed files with 1616 additions and 0 deletions

View File

@ -0,0 +1,186 @@
Using the Linux Kernel Latency Histograms
This document gives a short explanation how to enable, configure and use
latency histograms. Latency histograms are primarily relevant in the
context of real-time enabled kernels (CONFIG_PREEMPT/CONFIG_PREEMPT_RT)
and are used in the quality management of the Linux real-time
capabilities.
* Purpose of latency histograms
A latency histogram continuously accumulates the frequencies of latency
data. There are two types of histograms
- potential sources of latencies
- effective latencies
* Potential sources of latencies
Potential sources of latencies are code segments where interrupts,
preemption or both are disabled (aka critical sections). To create
histograms of potential sources of latency, the kernel stores the time
stamp at the start of a critical section, determines the time elapsed
when the end of the section is reached, and increments the frequency
counter of that latency value - irrespective of whether any concurrently
running process is affected by latency or not.
- Configuration items (in the Kernel hacking/Tracers submenu)
CONFIG_INTERRUPT_OFF_LATENCY
CONFIG_PREEMPT_OFF_LATENCY
* Effective latencies
Effective latencies are actually occuring during wakeup of a process. To
determine effective latencies, the kernel stores the time stamp when a
process is scheduled to be woken up, and determines the duration of the
wakeup time shortly before control is passed over to this process. Note
that the apparent latency in user space may be somewhat longer, since the
process may be interrupted after control is passed over to it but before
the execution in user space takes place. Simply measuring the interval
between enqueuing and wakeup may also not appropriate in cases when a
process is scheduled as a result of a timer expiration. The timer may have
missed its deadline, e.g. due to disabled interrupts, but this latency
would not be registered. Therefore, the offsets of missed timers are
recorded in a separate histogram. If both wakeup latency and missed timer
offsets are configured and enabled, a third histogram may be enabled that
records the overall latency as a sum of the timer latency, if any, and the
wakeup latency. This histogram is called "timerandwakeup".
- Configuration items (in the Kernel hacking/Tracers submenu)
CONFIG_WAKEUP_LATENCY
CONFIG_MISSED_TIMER_OFSETS
* Usage
The interface to the administration of the latency histograms is located
in the debugfs file system. To mount it, either enter
mount -t sysfs nodev /sys
mount -t debugfs nodev /sys/kernel/debug
from shell command line level, or add
nodev /sys sysfs defaults 0 0
nodev /sys/kernel/debug debugfs defaults 0 0
to the file /etc/fstab. All latency histogram related files are then
available in the directory /sys/kernel/debug/tracing/latency_hist. A
particular histogram type is enabled by writing non-zero to the related
variable in the /sys/kernel/debug/tracing/latency_hist/enable directory.
Select "preemptirqsoff" for the histograms of potential sources of
latencies and "wakeup" for histograms of effective latencies etc. The
histogram data - one per CPU - are available in the files
/sys/kernel/debug/tracing/latency_hist/preemptoff/CPUx
/sys/kernel/debug/tracing/latency_hist/irqsoff/CPUx
/sys/kernel/debug/tracing/latency_hist/preemptirqsoff/CPUx
/sys/kernel/debug/tracing/latency_hist/wakeup/CPUx
/sys/kernel/debug/tracing/latency_hist/wakeup/sharedprio/CPUx
/sys/kernel/debug/tracing/latency_hist/missed_timer_offsets/CPUx
/sys/kernel/debug/tracing/latency_hist/timerandwakeup/CPUx
The histograms are reset by writing non-zero to the file "reset" in a
particular latency directory. To reset all latency data, use
#!/bin/sh
TRACINGDIR=/sys/kernel/debug/tracing
HISTDIR=$TRACINGDIR/latency_hist
if test -d $HISTDIR
then
cd $HISTDIR
for i in `find . | grep /reset$`
do
echo 1 >$i
done
fi
* Data format
Latency data are stored with a resolution of one microsecond. The
maximum latency is 10,240 microseconds. The data are only valid, if the
overflow register is empty. Every output line contains the latency in
microseconds in the first row and the number of samples in the second
row. To display only lines with a positive latency count, use, for
example,
grep -v " 0$" /sys/kernel/debug/tracing/latency_hist/preemptoff/CPU0
#Minimum latency: 0 microseconds.
#Average latency: 0 microseconds.
#Maximum latency: 25 microseconds.
#Total samples: 3104770694
#There are 0 samples greater or equal than 10240 microseconds
#usecs samples
0 2984486876
1 49843506
2 58219047
3 5348126
4 2187960
5 3388262
6 959289
7 208294
8 40420
9 4485
10 14918
11 18340
12 25052
13 19455
14 5602
15 969
16 47
17 18
18 14
19 1
20 3
21 2
22 5
23 2
25 1
* Wakeup latency of a selected process
To only collect wakeup latency data of a particular process, write the
PID of the requested process to
/sys/kernel/debug/tracing/latency_hist/wakeup/pid
PIDs are not considered, if this variable is set to 0.
* Details of the process with the highest wakeup latency so far
Selected data of the process that suffered from the highest wakeup
latency that occurred in a particular CPU are available in the file
/sys/kernel/debug/tracing/latency_hist/wakeup/max_latency-CPUx.
In addition, other relevant system data at the time when the
latency occurred are given.
The format of the data is (all in one line):
<PID> <Priority> <Latency> (<Timeroffset>) <Command> \
<- <PID> <Priority> <Command> <Timestamp>
The value of <Timeroffset> is only relevant in the combined timer
and wakeup latency recording. In the wakeup recording, it is
always 0, in the missed_timer_offsets recording, it is the same
as <Latency>.
When retrospectively searching for the origin of a latency and
tracing was not enabled, it may be helpful to know the name and
some basic data of the task that (finally) was switching to the
late real-tlme task. In addition to the victim's data, also the
data of the possible culprit are therefore displayed after the
"<-" symbol.
Finally, the timestamp of the time when the latency occurred
in <seconds>.<microseconds> after the most recent system boot
is provided.
These data are also reset when the wakeup histogram is reset.

View File

@ -111,6 +111,9 @@ struct hrtimer {
enum hrtimer_restart (*function)(struct hrtimer *);
struct hrtimer_clock_base *base;
unsigned long state;
#ifdef CONFIG_MISSED_TIMER_OFFSETS_HIST
ktime_t praecox;
#endif
#ifdef CONFIG_TIMER_STATS
int start_pid;
void *start_site;

View File

@ -1575,6 +1575,12 @@ struct task_struct {
unsigned long trace;
/* bitmask and counter of trace recursion */
unsigned long trace_recursion;
#ifdef CONFIG_WAKEUP_LATENCY_HIST
u64 preempt_timestamp_hist;
#ifdef CONFIG_MISSED_TIMER_OFFSETS_HIST
long timer_offset;
#endif
#endif
#endif /* CONFIG_TRACING */
#ifdef CONFIG_MEMCG /* memcg uses this to do batch job */
struct memcg_batch_info {

View File

@ -0,0 +1,72 @@
#undef TRACE_SYSTEM
#define TRACE_SYSTEM hist
#if !defined(_TRACE_HIST_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_HIST_H
#include "latency_hist.h"
#include <linux/tracepoint.h>
#if !defined(CONFIG_PREEMPT_OFF_HIST) && !defined(CONFIG_INTERRUPT_OFF_HIST)
#define trace_preemptirqsoff_hist(a, b)
#else
TRACE_EVENT(preemptirqsoff_hist,
TP_PROTO(int reason, int starthist),
TP_ARGS(reason, starthist),
TP_STRUCT__entry(
__field(int, reason)
__field(int, starthist)
),
TP_fast_assign(
__entry->reason = reason;
__entry->starthist = starthist;
),
TP_printk("reason=%s starthist=%s", getaction(__entry->reason),
__entry->starthist ? "start" : "stop")
);
#endif
#ifndef CONFIG_MISSED_TIMER_OFFSETS_HIST
#define trace_hrtimer_interrupt(a, b, c, d)
#else
TRACE_EVENT(hrtimer_interrupt,
TP_PROTO(int cpu, long long offset, struct task_struct *curr,
struct task_struct *task),
TP_ARGS(cpu, offset, curr, task),
TP_STRUCT__entry(
__field(int, cpu)
__field(long long, offset)
__array(char, ccomm, TASK_COMM_LEN)
__field(int, cprio)
__array(char, tcomm, TASK_COMM_LEN)
__field(int, tprio)
),
TP_fast_assign(
__entry->cpu = cpu;
__entry->offset = offset;
memcpy(__entry->ccomm, curr->comm, TASK_COMM_LEN);
__entry->cprio = curr->prio;
memcpy(__entry->tcomm, task != NULL ? task->comm : "<none>",
task != NULL ? TASK_COMM_LEN : 7);
__entry->tprio = task != NULL ? task->prio : -1;
),
TP_printk("cpu=%d offset=%lld curr=%s[%d] thread=%s[%d]",
__entry->cpu, __entry->offset, __entry->ccomm,
__entry->cprio, __entry->tcomm, __entry->tprio)
);
#endif
#endif /* _TRACE_HIST_H */
/* This part must be outside protection */
#include <trace/define_trace.h>

View File

@ -0,0 +1,29 @@
#ifndef _LATENCY_HIST_H
#define _LATENCY_HIST_H
enum hist_action {
IRQS_ON,
PREEMPT_ON,
TRACE_STOP,
IRQS_OFF,
PREEMPT_OFF,
TRACE_START,
};
static char *actions[] = {
"IRQS_ON",
"PREEMPT_ON",
"TRACE_STOP",
"IRQS_OFF",
"PREEMPT_OFF",
"TRACE_START",
};
static inline char *getaction(int action)
{
if (action >= 0 && action <= sizeof(actions)/sizeof(actions[0]))
return actions[action];
return "unknown";
}
#endif /* _LATENCY_HIST_H */

View File

@ -53,6 +53,7 @@
#include <asm/uaccess.h>
#include <trace/events/timer.h>
#include <trace/events/hist.h>
/*
* The timer bases:
@ -1024,6 +1025,17 @@ int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
timer_stats_hrtimer_set_start_info(timer);
#ifdef CONFIG_MISSED_TIMER_OFFSETS_HIST
{
ktime_t now = new_base->get_time();
if (ktime_to_ns(tim) < ktime_to_ns(now))
timer->praecox = now;
else
timer->praecox = ktime_set(0, 0);
}
#endif
leftmost = enqueue_hrtimer(timer, new_base);
/*
@ -1298,6 +1310,8 @@ static void __run_hrtimer(struct hrtimer *timer, ktime_t *now)
#ifdef CONFIG_HIGH_RES_TIMERS
static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer);
/*
* High resolution timer interrupt
* Called with interrupts disabled
@ -1341,6 +1355,15 @@ retry:
timer = container_of(node, struct hrtimer, node);
trace_hrtimer_interrupt(raw_smp_processor_id(),
ktime_to_ns(ktime_sub(ktime_to_ns(timer->praecox) ?
timer->praecox : hrtimer_get_expires(timer),
basenow)),
current,
timer->function == hrtimer_wakeup ?
container_of(timer, struct hrtimer_sleeper,
timer)->task : NULL);
/*
* The immediate goal for using the softexpires is
* minimizing wakeups, not running timers at the

View File

@ -192,6 +192,24 @@ config IRQSOFF_TRACER
enabled. This option and the preempt-off timing option can be
used together or separately.)
config INTERRUPT_OFF_HIST
bool "Interrupts-off Latency Histogram"
depends on IRQSOFF_TRACER
help
This option generates continuously updated histograms (one per cpu)
of the duration of time periods with interrupts disabled. The
histograms are disabled by default. To enable them, write a non-zero
number to
/sys/kernel/debug/tracing/latency_hist/enable/preemptirqsoff
If PREEMPT_OFF_HIST is also selected, additional histograms (one
per cpu) are generated that accumulate the duration of time periods
when both interrupts and preemption are disabled. The histogram data
will be located in the debug file system at
/sys/kernel/debug/tracing/latency_hist/irqsoff
config PREEMPT_TRACER
bool "Preemption-off Latency Tracer"
default n
@ -216,6 +234,24 @@ config PREEMPT_TRACER
enabled. This option and the irqs-off timing option can be
used together or separately.)
config PREEMPT_OFF_HIST
bool "Preemption-off Latency Histogram"
depends on PREEMPT_TRACER
help
This option generates continuously updated histograms (one per cpu)
of the duration of time periods with preemption disabled. The
histograms are disabled by default. To enable them, write a non-zero
number to
/sys/kernel/debug/tracing/latency_hist/enable/preemptirqsoff
If INTERRUPT_OFF_HIST is also selected, additional histograms (one
per cpu) are generated that accumulate the duration of time periods
when both interrupts and preemption are disabled. The histogram data
will be located in the debug file system at
/sys/kernel/debug/tracing/latency_hist/preemptoff
config SCHED_TRACER
bool "Scheduling Latency Tracer"
select GENERIC_TRACER
@ -226,6 +262,74 @@ config SCHED_TRACER
This tracer tracks the latency of the highest priority task
to be scheduled in, starting from the point it has woken up.
config WAKEUP_LATENCY_HIST
bool "Scheduling Latency Histogram"
depends on SCHED_TRACER
help
This option generates continuously updated histograms (one per cpu)
of the scheduling latency of the highest priority task.
The histograms are disabled by default. To enable them, write a
non-zero number to
/sys/kernel/debug/tracing/latency_hist/enable/wakeup
Two different algorithms are used, one to determine the latency of
processes that exclusively use the highest priority of the system and
another one to determine the latency of processes that share the
highest system priority with other processes. The former is used to
improve hardware and system software, the latter to optimize the
priority design of a given system. The histogram data will be
located in the debug file system at
/sys/kernel/debug/tracing/latency_hist/wakeup
and
/sys/kernel/debug/tracing/latency_hist/wakeup/sharedprio
If both Scheduling Latency Histogram and Missed Timer Offsets
Histogram are selected, additional histogram data will be collected
that contain, in addition to the wakeup latency, the timer latency, in
case the wakeup was triggered by an expired timer. These histograms
are available in the
/sys/kernel/debug/tracing/latency_hist/timerandwakeup
directory. They reflect the apparent interrupt and scheduling latency
and are best suitable to determine the worst-case latency of a given
system. To enable these histograms, write a non-zero number to
/sys/kernel/debug/tracing/latency_hist/enable/timerandwakeup
config MISSED_TIMER_OFFSETS_HIST
depends on HIGH_RES_TIMERS
select GENERIC_TRACER
bool "Missed Timer Offsets Histogram"
help
Generate a histogram of missed timer offsets in microseconds. The
histograms are disabled by default. To enable them, write a non-zero
number to
/sys/kernel/debug/tracing/latency_hist/enable/missed_timer_offsets
The histogram data will be located in the debug file system at
/sys/kernel/debug/tracing/latency_hist/missed_timer_offsets
If both Scheduling Latency Histogram and Missed Timer Offsets
Histogram are selected, additional histogram data will be collected
that contain, in addition to the wakeup latency, the timer latency, in
case the wakeup was triggered by an expired timer. These histograms
are available in the
/sys/kernel/debug/tracing/latency_hist/timerandwakeup
directory. They reflect the apparent interrupt and scheduling latency
and are best suitable to determine the worst-case latency of a given
system. To enable these histograms, write a non-zero number to
/sys/kernel/debug/tracing/latency_hist/enable/timerandwakeup
config ENABLE_DEFAULT_TRACERS
bool "Trace process context switches and events"
depends on !GENERIC_TRACER

View File

@ -34,6 +34,10 @@ obj-$(CONFIG_FUNCTION_TRACER) += trace_functions.o
obj-$(CONFIG_IRQSOFF_TRACER) += trace_irqsoff.o
obj-$(CONFIG_PREEMPT_TRACER) += trace_irqsoff.o
obj-$(CONFIG_SCHED_TRACER) += trace_sched_wakeup.o
obj-$(CONFIG_INTERRUPT_OFF_HIST) += latency_hist.o
obj-$(CONFIG_PREEMPT_OFF_HIST) += latency_hist.o
obj-$(CONFIG_WAKEUP_LATENCY_HIST) += latency_hist.o
obj-$(CONFIG_MISSED_TIMER_OFFSETS_HIST) += latency_hist.o
obj-$(CONFIG_NOP_TRACER) += trace_nop.o
obj-$(CONFIG_STACK_TRACER) += trace_stack.o
obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o

1178
kernel/trace/latency_hist.c Normal file

File diff suppressed because it is too large Load Diff

View File

@ -17,6 +17,7 @@
#include <linux/fs.h>
#include "trace.h"
#include <trace/events/hist.h>
static struct trace_array *irqsoff_trace __read_mostly;
static int tracer_enabled __read_mostly;
@ -439,11 +440,13 @@ void start_critical_timings(void)
{
if (preempt_trace() || irq_trace())
start_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
trace_preemptirqsoff_hist(TRACE_START, 1);
}
EXPORT_SYMBOL_GPL(start_critical_timings);
void stop_critical_timings(void)
{
trace_preemptirqsoff_hist(TRACE_STOP, 0);
if (preempt_trace() || irq_trace())
stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
}
@ -453,6 +456,7 @@ EXPORT_SYMBOL_GPL(stop_critical_timings);
#ifdef CONFIG_PROVE_LOCKING
void time_hardirqs_on(unsigned long a0, unsigned long a1)
{
trace_preemptirqsoff_hist(IRQS_ON, 0);
if (!preempt_trace() && irq_trace())
stop_critical_timing(a0, a1);
}
@ -461,6 +465,7 @@ void time_hardirqs_off(unsigned long a0, unsigned long a1)
{
if (!preempt_trace() && irq_trace())
start_critical_timing(a0, a1);
trace_preemptirqsoff_hist(IRQS_OFF, 1);
}
#else /* !CONFIG_PROVE_LOCKING */
@ -486,6 +491,7 @@ inline void print_irqtrace_events(struct task_struct *curr)
*/
void trace_hardirqs_on(void)
{
trace_preemptirqsoff_hist(IRQS_ON, 0);
if (!preempt_trace() && irq_trace())
stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
}
@ -495,11 +501,13 @@ void trace_hardirqs_off(void)
{
if (!preempt_trace() && irq_trace())
start_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
trace_preemptirqsoff_hist(IRQS_OFF, 1);
}
EXPORT_SYMBOL(trace_hardirqs_off);
void trace_hardirqs_on_caller(unsigned long caller_addr)
{
trace_preemptirqsoff_hist(IRQS_ON, 0);
if (!preempt_trace() && irq_trace())
stop_critical_timing(CALLER_ADDR0, caller_addr);
}
@ -509,6 +517,7 @@ void trace_hardirqs_off_caller(unsigned long caller_addr)
{
if (!preempt_trace() && irq_trace())
start_critical_timing(CALLER_ADDR0, caller_addr);
trace_preemptirqsoff_hist(IRQS_OFF, 1);
}
EXPORT_SYMBOL(trace_hardirqs_off_caller);
@ -518,12 +527,14 @@ EXPORT_SYMBOL(trace_hardirqs_off_caller);
#ifdef CONFIG_PREEMPT_TRACER
void trace_preempt_on(unsigned long a0, unsigned long a1)
{
trace_preemptirqsoff_hist(PREEMPT_ON, 0);
if (preempt_trace() && !irq_trace())
stop_critical_timing(a0, a1);
}
void trace_preempt_off(unsigned long a0, unsigned long a1)
{
trace_preemptirqsoff_hist(PREEMPT_ON, 1);
if (preempt_trace() && !irq_trace())
start_critical_timing(a0, a1);
}