Impact: fix callsites with dynamic format strings
Since its new binary implementation, trace_printk() internally uses static
containers for the format strings on each callsites. But the value is
assigned once at build time, which means that it can't take dynamic
formats.
So this patch unearthes the raw trace_printk implementation for the callers
that will need trace_printk to be able to carry these dynamic format
strings. The trace_printk() macro will use the appropriate implementation
for each callsite. Most of the time however, the binary implementation will
still be used.
The other impact of this patch is that mmiotrace_printk() will use the old
implementation because it calls the low level trace_vprintk and we can't
guess here whether the format passed in it is dynamic or not.
Some parts of this patch have been written by Steven Rostedt (most notably
the part that chooses the appropriate implementation for each callsites).
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: do not confuse user on small trace buffer sizes
When the system boots up, the trace buffer is small to conserve memory.
It is only two pages per online CPU. When the tracer is used, it expands
to the default value.
This can confuse the user if they look at the buffer size and see only
7, but then later they see 1408.
# cat /debug/tracing/buffer_size_kb
7
# echo sched_switch > /debug/tracing/current_tracer
# cat /debug/tracing/buffer_size_kb
1408
This patch tries to help remove this confustion by showing that the
buffer has not been expanded.
# cat /debug/tracing/buffer_size_kb
7 (expanded: 1408)
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: speed up and remove possible races
The get_online_cpus was added to the ring buffer because the original
design would free the ring buffer on a CPU that was being taken
off line. The final design kept the ring buffer around even when the
CPU was taken off line. This is to allow a user to still read the
information on that ring buffer.
Most of the get_online_cpus are no longer needed since the ring buffer will
not disappear from the use cases.
Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The hotplug code in the ring buffers is for use with CPU hotplug,
not generic hotplug.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: prevent races with ring_buffer_expanded
This patch places the expanding of the tracing buffer under the
protection of the trace_types_lock mutex. It is highly unlikely
that there would be any contention, but better safe than sorry.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: cleanup
Some of the comments about the trace buffer resizing is gobbledygook.
And I wonder why people question if I'm a native English speaker.
This patch makes the comments make a bit more sense.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: save on memory
Currently, a ring buffer was allocated for each "possible_cpus". On
some systems, this is the same as NR_CPUS. Thus, if a system defined
NR_CPUS = 64 but it only had 1 CPU, we could have possibly 63 useless
ring buffers taking up space. With a default buffer of 3 megs, this
could be quite drastic.
This patch changes the ring buffer code to only allocate ring buffers
for online CPUs. If a CPU goes off line, we do not free the buffer.
This is because the user may still have trace data in that buffer
that they would like to look at.
Perhaps in the future we could add code to delete a ring buffer if
the CPU is offline and the ring buffer becomes empty.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix to task live locking on reading trace_pipe on one CPU
The same code is used for both trace_pipe (all CPUS) and the per_cpu
trace_pipe file. When there is no data to read, it will check for
signals and wait on the trace wait queue.
The problem happens with the per_cpu wait. The trace_wait code checks
all CPUs. Thus, if there's data in another CPU buffer, then it will
exit the wait, without checking for signals or waiting on the wait queue.
It would then try to read the empty buffer, and since that will just
return nothing, then it will try to wait again. Unfortunately, that will
again fail due to there still being data in the other buffers. This
ends up with a live lock for the task.
This patch fixes the trace_wait to be aware that the iterator may only
be waiting on a single buffer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
To save memory, the tracer ring buffers are set to a minimum.
The activating of a trace expands the ring buffer size. This patch
adds this expanding, when an event is activated.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: less memory impact on systems not using tracer
When the kernel boots up that has tracing configured, it allocates
the default size of the ring buffer. This currently happens to be
1.4Megs per possible CPU. This is quite a bit of wasted memory if
the system is never using the tracer.
The current solution is to keep the ring buffers to a minimum size
until the user uses them. Once a tracer is piped into the current_tracer
the ring buffer will be expanded to the default size. If the user
changes the size of the ring buffer, it will take the size given
by the user immediately.
If the user adds a "ftrace=" to the kernel command line, then the ring
buffers will be set to the default size on initialization.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: prevent locking up by lockdep tracer
The lockdep tracer uses trace_vprintk and thus trace_vprintk can not
call back into lockdep without locking up.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Using the function_graph tracer in recent kernels generates a spew of
preemption BUGs. Fix this by not requiring trace_clock_local() users
to disable preemption themselves.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up
There existed a lot of <space><tab>'s in the tracing code. This
patch removes them.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up / comments
The comments that described the ftrace macros to manipulate the
TRACE_EVENT and TRACE_FORMAT macros no longer match the code.
This patch updates them.
Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
In trying to stay consistant with the C style format in the TRACE_EVENT
macro, it makes more sense to do the printk after the assigning of
the variables.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The event directory files type and available_types were no longer
needed with the new TRACE_EVENT_FORMAT macros, they were deleted.
But by accident the available_events file was also removed.
This patch brings it back.
Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix to prevent crash on calling NULL function pointer
The ftrace internal records have their format exported via the event
system under the ftrace subsystem. These are only for exporting the
format to allow binary readers to be able to parse them in a binary
output.
The ftrace subsystem events can only be enabled via the ftrace tracers
and do not have a registering function. The event files expect the
event record to have registering function and will call it directly.
Passing in a ftrace subsystem event will cause the kernel to crash
because it will execute a NULL pointer.
This patch prevents the ftrace subsystem from being viewable to the
event enabling files.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
The offsetof and sizeof are of type size_t, and instead of typecasting
them to unsigned int for printk formatting, one could just use %zu.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
"for (++cpu ; cpu < num_possible_cpus(); cpu++)" statement assumes
possible cpus have continuous number - but that's a wrong assumption.
Insted, cpumask_next() should be used.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20090310104437.A480.A69D9226@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up
The TRACE_EVENT_FORMAT macro is no longer used by trace points
and only the DECLARE_TRACE, TRACE_FORMAT or TRACE_EVENT macros should
be used by them. Although the TRACE_EVENT_FORMAT macro is still used
by the internal tracing utility, it should not be used in core
kernel code.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up and enhancement
The TRACE_EVENT_FORMAT macro looks quite ugly and is limited in its
ability to save data as well as to print the record out. Working with
Ingo Molnar, we came up with a new format that is much more pleasing to
the eye of C developers. This new macro is more C style than the old
macro, and is more obvious to what it does.
Here's the example. The only updated macro in this patch is the
sched_switch trace point.
The old method looked like this:
TRACE_EVENT_FORMAT(sched_switch,
TP_PROTO(struct rq *rq, struct task_struct *prev,
struct task_struct *next),
TP_ARGS(rq, prev, next),
TP_FMT("task %s:%d ==> %s:%d",
prev->comm, prev->pid, next->comm, next->pid),
TRACE_STRUCT(
TRACE_FIELD(pid_t, prev_pid, prev->pid)
TRACE_FIELD(int, prev_prio, prev->prio)
TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
next_comm,
TP_CMD(memcpy(TRACE_ENTRY->next_comm,
next->comm,
TASK_COMM_LEN)))
TRACE_FIELD(pid_t, next_pid, next->pid)
TRACE_FIELD(int, next_prio, next->prio)
),
TP_RAW_FMT("prev %d:%d ==> next %s:%d:%d")
);
The above method is hard to read and requires two format fields.
The new method:
/*
* Tracepoint for task switches, performed by the scheduler:
*
* (NOTE: the 'rq' argument is not used by generic trace events,
* but used by the latency tracer plugin. )
*/
TRACE_EVENT(sched_switch,
TP_PROTO(struct rq *rq, struct task_struct *prev,
struct task_struct *next),
TP_ARGS(rq, prev, next),
TP_STRUCT__entry(
__array( char, prev_comm, TASK_COMM_LEN )
__field( pid_t, prev_pid )
__field( int, prev_prio )
__array( char, next_comm, TASK_COMM_LEN )
__field( pid_t, next_pid )
__field( int, next_prio )
),
TP_printk("task %s:%d [%d] ==> %s:%d [%d]",
__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
__entry->next_comm, __entry->next_pid, __entry->next_prio),
TP_fast_assign(
memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
__entry->prev_pid = prev->pid;
__entry->prev_prio = prev->prio;
memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
__entry->next_pid = next->pid;
__entry->next_prio = next->prio;
)
);
This macro is called TRACE_EVENT, it is broken up into 5 parts:
TP_PROTO: the proto type of the trace point
TP_ARGS: the arguments of the trace point
TP_STRUCT_entry: the structure layout of the entry in the ring buffer
TP_printk: the printk format
TP_fast_assign: the method used to write the entry into the ring buffer
The structure is the definition of how the event will be saved in the
ring buffer. The printk is used by the internal tracing in case of
an oops, and the kernel needs to print out the format of the record
to the console. This the TP_printk gives a means to show the records
in a human readable format. It is also used to print out the data
from the trace file.
The TP_fast_assign is executed directly. It is basically like a C function,
where the __entry is the handle to the record.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
This removes the custom made STR(x) macros in the tracer and uses
the generic __stringify macro instead.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
The macros TPPROTO, TPARGS, TPFMT, TPRAWFMT, and TPCMD all look a bit
ugly. This patch adds an underscore to their names.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix compiler warnings
On x86_64 sizeof and offsetof are treated as long, where as on x86_32
they are int. This patch typecasts them to unsigned int to avoid
one arch giving warnings while the other does not.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: improve workqueue tracer output
Currently, /sys/kernel/debug/tracing/trace_stat/workqueues can display
wrong and strange thread names.
Why?
Currently, ftrace has tracing_record_cmdline()/trace_find_cmdline()
convenience function that implements a task->comm string cache.
This can avoid unnecessary memcpy overhead and the workqueue tracer
uses it.
However, in general, any trace statistics feature shouldn't use
tracing_record_cmdline() because trace statistics can display
very old process. Then comm cache can return wrong string because
recent process overrides the cache.
Fortunately, workqueue trace guarantees that displayed processes
are live. Thus we can search comm string from PID at display time.
<before>
% cat workqueues
# CPU INSERTED EXECUTED NAME
# | | | |
7 431913 431913 kondemand/7
7 0 0 tail
7 21 21 git
7 0 0 ls
7 9 9 cat
7 832632 832632 unix_chkpwd
7 236292 236292 ls
Note: tail, git, ls, cat unix_chkpwd are obiously not workqueue thread.
<after>
% cat workqueues
# CPU INSERTED EXECUTED NAME
# | | | |
7 510 510 kondemand/7
7 0 0 kmpathd/7
7 15 15 ata/7
7 0 0 aio/7
7 11 11 kblockd/7
7 1063 1063 work_on_cpu/7
7 167 167 events/7
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Remove a few leftovers and clean up the code a bit.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1236356510-8381-5-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: faster and lighter tracing
Now that we have trace_bprintk() which is faster and consume lesser
memory than trace_printk() and has the same purpose, we can now drop
the old implementation in favour of the binary one from trace_bprintk(),
which means we move all the implementation of trace_bprintk() to
trace_printk(), so the Api doesn't change except that we must now use
trace_seq_bprintk() to print the TRACE_PRINT entries.
Some changes result of this:
- Previously, trace_bprintk depended of a single tracer and couldn't
work without. This tracer has been dropped and the whole implementation
of trace_printk() (like the module formats management) is now integrated
in the tracing core (comes with CONFIG_TRACING), though we keep the file
trace_printk (previously trace_bprintk.c) where we can find the module
management. Thus we don't overflow trace.c
- changes some parts to use trace_seq_bprintk() to print TRACE_PRINT entries.
- change a bit trace_printk/trace_vprintk macros to support non-builtin formats
constants, and fix 'const' qualifiers warnings. But this is all transparent for
developers.
- etc...
V2:
- Rebase against last changes
- Fix mispell on the changelog
V3:
- Rebase against last changes (moving trace_printk() to kernel.h)
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1236356510-8381-5-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add a generic printk() for tracing, like trace_printk()
trace_bprintk() uses the infrastructure to record events on ring_buffer.
[ fweisbec@gmail.com: ported to latest -tip, made it work if
!CONFIG_MODULES, never free the format strings from modules
because we can't keep track of them and conditionnaly create
the ftrace format strings section (reported by Steven Rostedt) ]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1236356510-8381-4-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: save on memory for tracing
Current tracers are typically using a struct(like struct ftrace_entry,
struct ctx_switch_entry, struct special_entr etc...)to record a binary
event. These structs can only record a their own kind of events.
A new kind of tracer need a new struct and a lot of code too handle it.
So we need a generic binary record for events. This infrastructure
is for this purpose.
[fweisbec@gmail.com: rebase against latest -tip, make it safe while sched
tracing as reported by Steven Rostedt]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1236356510-8381-3-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix deadlock while using set_ftrace_pid
Reproducer:
# cd /sys/kernel/debug/tracing
# echo $$ > set_ftrace_pid
then, console becomes hung.
Details:
when writing set_ftracepid, kernel callstack is following
ftrace_pid_write()
mutex_lock(&ftrace_lock);
ftrace_update_pid_func()
mutex_lock(&ftrace_lock);
mutex_unlock(&ftrace_lock);
mutex_unlock(&ftrace_lock);
then, system always deadlocks when ftrace_pid_write() is called.
In past days, ftrace_pid_write() used ftrace_start_lock, but
commit e6ea44e9b4 consolidated
ftrace_start_lock to ftrace_lock.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <srostedt@redhat.com>
LKML-Reference: <20090306151155.0778.A69D9226@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: allow user apps to read binary format of basic ftrace entries
Currently, only defined raw events export their formats so a binary
reader can parse them. There's no reason that the default ftrace entries
can't export their formats.
This patch adds a subsystem called "ftrace" in the events directory
that includes the ftrace entries for basic ftrace recorded items.
These only have three files in the events directory:
type : printf
available_types : printf
format : format for the event entry
For example:
# cat /debug/tracing/events/ftrace/wakeup/format
name: wakeup
ID: 3
format:
field:unsigned char type; offset:0; size:1;
field:unsigned char flags; offset:1; size:1;
field:unsigned char preempt_count; offset:2; size:1;
field:int pid; offset:4; size:4;
field:int tgid; offset:8; size:4;
field:unsigned int prev_pid; offset:12; size:4;
field:unsigned char prev_prio; offset:16; size:1;
field:unsigned char prev_state; offset:17; size:1;
field:unsigned int next_pid; offset:20; size:4;
field:unsigned char next_prio; offset:24; size:1;
field:unsigned char next_state; offset:25; size:1;
field:unsigned int next_cpu; offset:28; size:4;
print fmt: "%u:%u:%u ==+ %u:%u:%u [%03u]"
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
Move the macro that creates the event format file to a separate header.
This will allow the default ftrace events to use this same macro
to create the formats to read those events.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: cleanup
All file_operations structures should be constant. No one is going to
change them.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Clean up menu structure, introduce TRACING_SUPPORT switch that signals
whether an architecture supports various instrumentation mechanisms.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: decrease hangs risks with the graph tracer on slow systems
Since the function graph tracer can spend too much time on timer
interrupts, it's better now to use the more lightweight local
clock. Anyway, the function graph traces are more reliable on a
per cpu trace.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <49af243d.06e9300a.53ad.ffff840c@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Use a more generic name - this also allows the prototype to move
to kernel.h and be generally available to kernel developers who
want to do some quick tracing.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The latency tracers (irqsoff, preemptoff, preemptirqsoff, and wakeup)
are pretty useless with the default output format. This patch makes them
automatically enable the latency format when they are selected. They
also record the state of the latency option, and if it was not enabled
when selected, they disable it on reset.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
Both print_lat_fmt and print_trace_fmt do pretty much the same thing
except for one different function call. This patch consolidates the
two functions and adds an if statement to perform the difference.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
The trace and latency_trace function pointers are identical for
every tracer but the function tracer. The differences in the function
tracer are trivial (latency output puts paranthesis around parent).
This patch removes the latency_trace pointer and all prints will
now just use the trace output function pointer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
With the removal of the latency_trace file, we lost the ability
to see some of the finer details in a trace. Like the state of
interrupts enabled, the preempt count, need resched, and if we
are in an interrupt handler, softirq handler or not.
This patch simply creates an option to bring back the old format.
This also removes the warning about an unused variable that held
the latency_trace file operations.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The buffer used by trace_seq was updated incorrectly. Instead
of consuming what was actually read, it consumed the rest of the
buffer on reads.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix trace read to conform to standards
Andrew Morton, Theodore Tso and H. Peter Anvin brought to my attention
that a userspace read should not return -EFAULT if it succeeded in
copying anything. It should only return -EFAULT if it failed to copy
at all.
This patch modifies the check of copy_from_user and updates the return
code appropriately.
I also used H. Peter Anvin's short cut rule to just test ret == count.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
If a partial ring_buffer_page_read happens, then some of the
incremental timestamps may be lost. This patch writes the
recent timestamp into the page that is passed back to the caller.
A partial ring_buffer_page_read is where the full page would not
be written back to the user, and instead, just part of the page
is copied to the user. A full page would be a page swap with the
ring buffer and the timestamps would be correct.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix to ftrace_dump output corruption
The commit: b04cc6b1f6
tracing/core: introduce per cpu tracing files
added a new field to the iterator called cpu_file. This was a handle
to differentiate between the per cpu trace output files and the
all cpu "trace" file. The all cpu "trace" file required setting this
to TRACE_PIPE_ALL_CPU.
The problem is that the ftrace_dump sets up its own iterator but was
not updated to handle this change. The result was only CPU 0 printing
out on crash and a lot of "<0>"'s also being printed.
Reported-by: Thomas Gleixner <tglx@linuxtronix.de>
Tested-by: Darren Hart <dvhtc@us.ibm.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: new feature
This patch creates a directory of files that correspond to the
per CPU ring buffers. These are binary files and are made to
be used with splice. This is the fastest way to extract data from
the ftrace ring buffers.
Thanks to Jiaying Zhang for pushing me to get this code fixed,
and to Eduard - Gabriel Munteanu for his splice code that helped
me debug my code.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: dont leave holes in read buffer page
The ring_buffer_read_page swaps a given page with the reader page
of the ring buffer, if certain conditions are set:
1) requested length is big enough to hold entire page data
2) a writer is not currently on the page
3) the page is not partially consumed.
Instead of swapping with the supplied page. It copies the data to
the supplied page instead. But currently the data is copied in the
same offset as the source page. This causes a hole at the start
of the reader page. This complicates the use of this function.
Instead, it should copy the data at the beginning of the function
and update the index fields accordingly.
Other small clean ups are also done in this patch.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix to possible alignment problems on some archs.
Some arch compilers include an NULL char array in the sizeof field.
Since the ring_buffer_event type includes one of these, it is better
to use the "offsetof" instead, to avoid strange bugs on these archs.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The ring_buffer_read_page was broken if it were to only copy part
of the page. This patch fixes that up as well as adds a parameter
to allow a length field, in order to only copy part of the buffer page.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix ring_buffer_read_page
After a page is swapped into the ring buffer, the write field must
also be reset.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The registering of events had the return value check backwards.
A zero returned is success, the check had it as a failure.
This patch also fixes a missing "\n" in the warning that the check
failed.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
To be able to identify the trace in the binary format output, the
id of the trace event (which is dynamically assigned) must also be listed.
This patch adds the name of the trace point as well as the id assigned.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch includes the ftrace header to the event formats files:
# cat /debug/tracing/events/sched/sched_switch/format
field:unsigned char type; offset:0; size:1;
field:unsigned char flags; offset:1; size:1;
field:unsigned char preempt_count; offset:2; size:1;
field:int pid; offset:4; size:4;
field:int tgid; offset:8; size:4;
field:pid_t prev_pid; offset:12; size:4;
field:int prev_prio; offset:16; size:4;
field special:char next_comm[TASK_COMM_LEN]; offset:20; size:16;
field:pid_t next_pid; offset:36; size:4;
field:int next_prio; offset:40; size:4;
A blank line is used as a deliminator between the ftrace header and the
trace point fields.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch adds the "format" file to the trace point event directory.
This is based off of work by Tom Zanussi, in which a file is exported
to be tread from user land such that a user space app may read the
binary record stored in the ring buffer.
# cat /debug/tracing/events/sched/sched_switch/format
field:pid_t prev_pid; offset:12; size:4;
field:int prev_prio; offset:16; size:4;
field special:char next_comm[TASK_COMM_LEN]; offset:20; size:16;
field:pid_t next_pid; offset:36; size:4;
field:int next_prio; offset:40; size:4;
Idea-from: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
The trace_seq functions may be used separately outside of the ftrace
iterator. The trace_seq_reset is needed for these operations.
This patch also renames trace_seq_reset to the more appropriate
trace_seq_init.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The trace event objects are currently not proctected against
reentrancy. This patch adds a mutex around the modifications of
the trace event fields.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Tom Zanussi pointed out that the simple TRACE_FIELD was not enough to
record trace data that required memcpy. This patch addresses this issue
by adding a TRACE_FIELD_SPECIAL. The format is similar to TRACE_FIELD
but looks like so:
TRACE_FIELD_SPECIAL(type_item, item, cmd)
What TRACE_FIELD gave was:
TRACE_FIELD(type, item, assign)
The TRACE_FIELD would be used in declaring a structure:
struct {
type item;
};
And later assign it via:
entry->item = assign;
What TRACE_FIELD_SPECIAL gives us is:
In the declaration of the structure:
struct {
type_item;
};
And the assignment:
cmd;
This change log will explain the one example used in the patch:
TRACE_EVENT_FORMAT(sched_switch,
TPPROTO(struct rq *rq, struct task_struct *prev,
struct task_struct *next),
TPARGS(rq, prev, next),
TPFMT("task %s:%d ==> %s:%d",
prev->comm, prev->pid, next->comm, next->pid),
TRACE_STRUCT(
TRACE_FIELD(pid_t, prev_pid, prev->pid)
TRACE_FIELD(int, prev_prio, prev->prio)
TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
next_comm,
TPCMD(memcpy(TRACE_ENTRY->next_comm,
next->comm,
TASK_COMM_LEN)))
TRACE_FIELD(pid_t, next_pid, next->pid)
TRACE_FIELD(int, next_prio, next->prio)
),
TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
);
The struct will be create as:
struct {
pid_t prev_pid;
int prev_prio;
char next_comm[TASK_COMM_LEN];
pid_t next_pid;
int next_prio;
};
Note the TRACE_ENTRY in the cmd part of TRACE_SPECIAL. TRACE_ENTRY will
be set by the tracer to point to the structure inside the trace buffer.
entry->prev_pid = prev->pid;
entry->prev_prio = prev->prio;
memcpy(entry->next_comm, next->comm, TASK_COMM_LEN);
entry->next_pid = next->pid;
entry->next_prio = next->prio
Reported-by: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch adds the interface to enable the C style trace points.
In the directory /debugfs/tracing/events/subsystem/event
We now have three files:
enable : values 0 or 1 to enable or disable the trace event.
available_types: values 'raw' and 'printf' which indicate the tracing
types available for the trace point. If a developer does not
use the TRACE_EVENT_FORMAT macro and just uses the TRACE_FORMAT
macro, then only 'printf' will be available. This file is
read only.
type: values 'raw' or 'printf'. This indicates which type of tracing
is active for that trace point. 'printf' is the default and
if 'raw' is not available, this file is read only.
# echo raw > /debug/tracing/events/sched/sched_wakeup/type
# echo 1 > /debug/tracing/events/sched/sched_wakeup/enable
Will enable the C style tracing for the sched_wakeup trace point.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: lower overhead tracing
The current event tracer can automatically pick up trace points
that are registered with the TRACE_FORMAT macro. But it required
a printf format string and parsing. Although, this adds the ability
to get guaranteed information like task names and such, it took
a hit in overhead processing. This processing can add about 500-1000
nanoseconds overhead, but in some cases that too is considered
too much and we want to shave off as much from this overhead as
possible.
Tom Zanussi recently posted tracing patches to lkml that are based
on a nice idea about capturing the data via C structs using
STRUCT_ENTER, STRUCT_EXIT type of macros.
I liked that method very much, but did not like the implementation
that required a developer to add data/code in several disjoint
locations.
This patch extends the event_tracer macros to do a similar "raw C"
approach that Tom Zanussi did. But instead of having the developers
needing to tweak a bunch of code all over the place, they can do it
all in one macro - preferably placed near the code that it is
tracing. That makes it much more likely that tracepoints will be
maintained on an ongoing basis by the code they modify.
The new macro TRACE_EVENT_FORMAT is created for this approach. (Note,
a developer may still utilize the more low level DECLARE_TRACE macros
if they don't care about getting their traces automatically in the event
tracer.)
They can also use the existing TRACE_FORMAT if they don't need to code
the tracepoint in C, but just want to use the convenience of printf.
So if the developer wants to "hardwire" a tracepoint in the fastest
possible way, and wants to acquire their data via a user space utility
in a raw binary format, or wants to see it in the trace output but not
sacrifice any performance, then they can implement the faster but
more complex TRACE_EVENT_FORMAT macro.
Here's what usage looks like:
TRACE_EVENT_FORMAT(name,
TPPROTO(proto),
TPARGS(args),
TPFMT(fmt, fmt_args),
TRACE_STUCT(
TRACE_FIELD(type1, item1, assign1)
TRACE_FIELD(type2, item2, assign2)
[...]
),
TPRAWFMT(raw_fmt)
);
Note name, proto, args, and fmt, are all identical to what TRACE_FORMAT
uses.
name: is the unique identifier of the trace point
proto: The proto type that the trace point uses
args: the args in the proto type
fmt: printf format to use with the event printf tracer
fmt_args: the printf argments to match fmt
TRACE_STRUCT starts the ability to create a structure.
Each item in the structure is defined with a TRACE_FIELD
TRACE_FIELD(type, item, assign)
type: the C type of item.
item: the name of the item in the stucture
assign: what to assign the item in the trace point callback
raw_fmt is a way to pretty print the struct. It must match
the order of the items are added in TRACE_STUCT
An example of this would be:
TRACE_EVENT_FORMAT(sched_wakeup,
TPPROTO(struct rq *rq, struct task_struct *p, int success),
TPARGS(rq, p, success),
TPFMT("task %s:%d %s",
p->comm, p->pid, success?"succeeded":"failed"),
TRACE_STRUCT(
TRACE_FIELD(pid_t, pid, p->pid)
TRACE_FIELD(int, success, success)
),
TPRAWFMT("task %d success=%d")
);
This creates us a unique struct of:
struct {
pid_t pid;
int success;
};
And the way the call back would assign these values would be:
entry->pid = p->pid;
entry->success = success;
The nice part about this is that the creation of the assignent is done
via macro magic in the event tracer. Once the TRACE_EVENT_FORMAT is
created, the developer will then have a faster method to record
into the ring buffer. They do not need to worry about the tracer itself.
The developer would only need to touch the files in include/trace/*.h
Again, I would like to give special thanks to Tom Zanussi for this
nice idea.
Idea-from: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Right now all tracers must manage their own trace buffers. This was
to enforce tracers to be independent in case we finally decide to
allow each tracer to have their own trace buffer.
But now we are adding event tracing that writes to the current tracer's
buffer. This adds an interface to allow events to write to the current
tracer buffer without having to manage its own. Since event tracing
has no "tracer", and is just a way to hook into any other tracer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch makes the event files, set_event and available_events
aware of the subsystem.
Now you can enable an entire subsystem with:
echo 'irq:*' > set_event
Note: the '*' is not needed.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
If a trace point header defines TRACE_SYSTEM, then it will add the
following trace points into that event system.
If include/trace/irq_event_types.h has:
#define TRACE_SYSTEM irq
at the top and
#undef TRACE_SYSTEM
at the bottom, then a directory "irq" will be created in the
/debug/tracing/events directory. Inside that directory will contain the
two trace points that are defined in include/trace/irq_event_types.h.
Only adding the above to irq and not to sched, we get:
# ls /debug/tracing/events/
irq sched_process_exit sched_signal_send sched_wakeup_new
sched_kthread_stop sched_process_fork sched_switch
sched_kthread_stop_ret sched_process_free sched_wait_task
sched_migrate_task sched_process_wait sched_wakeup
# ls /debug/tracing/events/irq
irq_handler_entry irq_handler_exit
If we add #define TRACE_SYSTEM sched to the trace/sched_event_types.h
then the rest of the trace events will be put in a sched directory
within the events directory.
I've been playing with this idea of the subsystem for a while, but
recently Tom Zanussi posted some patches to lkml that included this
method. Tom's approach was clean and got me to finally put some effort
to clean up the event trace points.
Thanks to Tom Zanussi for demonstrating how nice the subsystem
method is.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
To further facilitate the ease of adding trace points for developers, this
patch creates include/trace/trace_events.h and
include/trace/trace_event_types.h.
The former file will hold the trace/<type>.h files and the latter will hold
the trace/<type>_event_types.h files.
To create new tracepoints and to have them automatically
appear in the event tracer, a developer makes the trace/<type>.h file
which includes <linux/tracepoint.h> and the trace/<type>_event_types.h file.
The trace/<type>_event_types.h file will hold the TRACE_FORMAT
macros.
Then add the trace/<type>.h file to trace/trace_events.h,
and add the trace/<type>_event_types.h to the trace_event_types.h file.
No need to modify files elsewhere.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
kcalloc is a better approach to allocate a NULL array.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix compile warning and clean up
When I first wrote __tracing_open, instead of passing the error
code via the ERR_PTR macros, I lazily used a separate parameter
to hold the return for errors.
When Frederic Weisbecker updated that function, he used the Linux
kernel ERR_PTR for the returns. This caused the parameter return
to possibly not be initialized on error. gcc correctly pointed this
out with a warning.
This patch converts the entire function to use the Linux kernel
ERR_PTR macro methods.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix to possible race conditions
There's some uses of current_tracer that is not protected by the
trace_types_lock. There is a small chance that a sysadmin changes
the tracer while the current_tracer is being referenced.
If the race is hit, it is unlikely to cause any harm since the
tracers are constant and are not freed. But some strang side
effects may occur.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch adds the tracer dependent options dynamically to the
options directory when the tracer is activated. These options are
removed when the tracer is deactivated.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch creates an options directory in the debugfs, that contains
the available tracing options. These files contain 1 or 0, where 1
is the option is enabled and 0 it is disabled.
Simply echoing in 1 will enable the option and 0 will disable it.
This patch only contains the core options, not the tracer options.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: implement new tracing timestamp APIs
Add three trace clock variants, with differing scalability/precision
tradeoffs:
- local: CPU-local trace clock
- medium: scalable global clock with some jitter
- global: globally monotonic, serialized clock
Make the ring-buffer use the local trace clock internally.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add new tracepoints
Add them to the generic IRQ code, that way every architecture
gets these new tracepoints, not just x86.
Using Steve's new 'TRACE_FORMAT', I can get function graph
trace as follows using the original two IRQ tracepoints:
3) | handle_IRQ_event() {
3) | /* (irq_handler_entry) irq=28 handler=eth0 */
3) | e1000_intr_msi() {
3) 2.460 us | __napi_schedule();
3) 9.416 us | }
3) | /* (irq_handler_exit) irq=28 handler=eth0 return=handled */
3) + 22.935 us | }
Signed-off-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: restructure the VFS layout of per CPU trace buffers
The per cpu trace files are all in a single directory:
/debug/tracing/per_cpu. In case of a large number of cpu, the
content of this directory becomes messy so we create now one
directory per cpu inside /debug/tracing/per_cpu which contain
each their own trace_pipe and trace files.
Ie:
/debug/tracing$ ls -R per_cpu
per_cpu:
cpu0 cpu1
per_cpu/cpu0:
trace trace_pipe
per_cpu/cpu1:
trace trace_pipe
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There's been a bit confusion to whether DEFINE/DECLARE_TRACE_FMT should
be a DEFINE or a DECLARE. Ingo Molnar suggested simply calling it
TRACE_FORMAT.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Now that several per-cpu files can be read or spliced at the
same, we want the read/splice callbacks for tracing files to be
reentrants.
Until now, a single global mutex (trace_types_lock) serialized
the access to tracing_read_pipe(), tracing_splice_read_pipe(),
and the seq helpers.
Ie: it means that if a user tries to read trace_pipe0 and
trace_pipe1 at the same time, the access to the function
tracing_read_pipe() is contended and one reader must wait for
the other to finish its read call.
The trace_type_lock mutex is mostly here to serialize the access
to the global current tracer (current_trace), which can be
changed concurrently. Although the iter struct keeps a private
pointer to this tracer, its callbacks can be changed by another
function.
The method used here is to not keep anymore private reference to
the tracer inside the iterator but to make a copy of it inside
the iterator. Then it checks on subsequents read calls if the
tracer has changed. This is not costly because the current
tracer is not expected to be changed often, so we use a branch
prediction for that.
Moreover, we add a private mutex to the iterator (there is one
iterator per file descriptor) to serialize the accesses in case
of multiple consumers per file descriptor (which would be a
silly idea from the user). Note that this is not to protect the
ring buffer, since the ring buffer already serializes the
readers accesses. This is to prevent from traces weirdness in
case of concurrent consumers. But these mutexes can be dropped
anyway, that would not result in any crash. Just tell me what
you think about it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: split up tracing output per cpu
Currently, on the tracing debugfs directory, three files are
available to the user to let him extracting the trace output:
- trace is an iterator through the ring-buffer. It's a reader
but not a consumer It doesn't block when no more traces are
available.
- trace pretty similar to the former, except that it adds more
informations such as prempt count, irq flag, ...
- trace_pipe is a reader and a consumer, it will also block
waiting for traces if necessary (heh, yes it's a pipe).
The traces coming from different cpus are curretly mixed up
inside these files. Sometimes it messes up the informations,
sometimes it's useful, depending on what does the tracer
capture.
The tracing_cpumask file is useful to filter the output and
select only the traces captured a custom defined set of cpus.
But still it is not enough powerful to extract at the same time
one trace buffer per cpu.
So this patch creates a new directory: /debug/tracing/per_cpu/.
Inside this directory, you will now find one trace_pipe file and
one trace file per cpu.
Which means if you have two cpus, you will have:
trace0
trace1
trace_pipe0
trace_pipe1
And of course, reading these files will have the same effect
than with the usual tracing files, except that you will only see
the traces from the given cpu.
The original all-in-one cpu trace file are still available on
their original place.
Until now, only one consumer was allowed on trace_pipe to avoid
racy consuming on the ring-buffer. Now the approach changed a
bit, you can have only one consumer per cpu.
Which means you are allowed to read concurrently trace_pipe0 and
trace_pipe1 But you can't have two readers on trace_pipe0 or
trace_pipe1.
Following the same logic, if there is one reader on the common
trace_pipe, you can not have at the same time another reader on
trace_pipe0 or in trace_pipe1. Because in trace_pipe is already
a consumer in all cpu buffers in essence.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove old debug/tracing API
/debug/tracing/latency_trace is an old legacy format we kept from
the old latency tracer. Remove the file for now. If there's any
useful bit missing then we'll propagate any useful output bits into
the /debug/tracing/trace output.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix CPU hotplug lockup
bts_hotcpu_handler() is called with irqs disabled, so using mutex_lock()
is a no-no.
All the BTS codepaths here are atomic (they do not schedule), so using
a spinlock is the right solution.
Cc: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds the directory /debug/tracing/events/ that will contain
all the registered trace points.
# ls /debug/tracing/events/
sched_kthread_stop sched_process_fork sched_switch
sched_kthread_stop_ret sched_process_free sched_wait_task
sched_migrate_task sched_process_wait sched_wakeup
sched_process_exit sched_signal_send sched_wakeup_new
# ls /debug/tracing/events/sched_switch/
enable
# cat /debug/tracing/events/sched_switch/enable
1
# cat /debug/tracing/set_event
sched_switch
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch changes the trace/sched.h to use the DECLARE_TRACE_FMT
such that they are automatically registered with the event tracer.
And it also adds the tracing sched headers to kernel/trace/events.c
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch creates the event tracing infrastructure of ftrace.
It will create the files:
/debug/tracing/available_events
/debug/tracing/set_event
The available_events will list the trace points that have been
registered with the event tracer.
set_events will allow the user to enable or disable an event hook.
example:
# echo sched_wakeup > /debug/tracing/set_event
Will enable the sched_wakeup event (if it is registered).
# echo "!sched_wakeup" >> /debug/tracing/set_event
Will disable the sched_wakeup event (and only that event).
# echo > /debug/tracing/set_event
Will disable all events (notice the '>')
# cat /debug/tracing/available_events > /debug/tracing/set_event
Will enable all registered event hooks.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Fix an invalid memory reference problem when cpu hotplug support is
disabled and the hw-branch-tracer is set as current tracer.
Initializing the tracer calls bts_trace_init() which has already
been freed at this time.
Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: added precaution on failure detection
Break out of the modifying loop as soon as a failure is detected.
This is just an added precaution found by code review and was not
found by any bug chasing.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch creates the weak functions: ftrace_arch_code_modify_prepare
and ftrace_arch_code_modify_post_process that are called before and
after the stop machine is called to modify the kernel text.
If the arch needs to do pre or post processing, it only needs to define
these functions.
[ Update: Ingo Molnar suggested using the name ftrace_arch_code_modify_*
over using ftrace_arch_modify_* ]
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: trace only functions matching a pattern
The set_graph_function file let one to trace only one or several
chosen functions and follow all their code flow.
Currently, only a constant function name is allowed so this patch
allows the ftrace_regex functions:
- matches all functions that end with "name":
echo *name > set_graph_function
- matches all functions that begin with "name":
echo name* > set_graph_function
- matches all functions that contains "name":
echo *name* > set_graph_function
Example:
echo mutex* > set_graph_function
0) | mutex_lock_nested() {
0) 0.563 us | __might_sleep();
0) 2.072 us | }
0) | mutex_unlock() {
0) 1.036 us | __mutex_unlock_slowpath();
0) 2.433 us | }
0) | mutex_unlock() {
0) 0.691 us | __mutex_unlock_slowpath();
0) 1.787 us | }
0) | mutex_lock_interruptible_nested() {
0) 0.548 us | __might_sleep();
0) 1.945 us | }
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: prevent deadlock if ring buffer gets corrupted
This patch adds a paranoid check to make sure the ring buffer consumer
does not go into an infinite loop. Since the ring buffer has been set
to read only, the consumer should not loop for more than the ring buffer
size. A check is added to make sure the consumer does not loop more than
the ring buffer size.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix output of function tracer to be useful
The function tracer is pretty useless if KALLSYMS is not configured.
Unless you are good at reading hex values, the function tracer should
select the KALLSYMS configuration.
Also, the dynamic function tracer will fail its self test if KALLSYMS
is not selected.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix to prevent hard lockup on self tests
If one of the tracers are broken and is constantly filling the ring
buffer while the test of the ring buffer is running, it will hang
the box. The reason is that the test is a consumer that will not
stop till the ring buffer is empty. But if the tracer is broken and
is constantly producing input to the buffer, this test will never
end. The result is a lockup of the box.
This happened when KALLSYMS was not defined and the dynamic ftrace
test constantly filled the ring buffer, because the filter failed
and all functions were being traced. Something was being called
that constantly filled the buffer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
There is nothing really arch specific of the push and pop functions
used by the function graph tracer. This patch moves them to generic
code.
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: api and pipe waiting change
Currently, the waiting used in tracing_read_pipe() is done through a
100 msecs schedule_timeout() loop which periodically check if there
are traces on the buffer.
This can cause small latencies for programs which are reading the incoming
events.
This patch makes the reader waiting for the trace_wait waitqueue except
for few tracers such as the sched and functions tracers which might be
already hold the runqueue lock while waking up the reader.
This is performed through a new callback wait_pipe() on struct tracer.
If none is implemented on a specific tracer, the default waiting for
trace_wait queue is attached.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up
The traceon and traceoff function probes are confusing to developers
to what happens when a counter is not specified. This should help
clear things up.
# echo "*:traceoff" > set_ftrace_filter
# cat /debug/tracing/set_ftrace_filter
#### all functions enabled ####
do_fork:traceoff:unlimited
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: cleanup
Fix incorrect hint message in code and typos in comments.
Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch is to fix the return value of trace_selftest_startup_sysprof
and trace_selftest_startup_branch on failure.
Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Pass tsk to tracing_record_cmdline instead of current.
Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
Ingo Molnar did not like the _hook naming convention used by the
select function tracer. Luis Claudio R. Goncalves suggested using
the "_probe" extension. This patch implements the change of
calling the functions and variables "_hook" and replacing them
with "_probe".
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Ingo Molnar pointed out some coding style issues with the recent ftrace
updates. This patch cleans them up.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch adds a pretty print version of traceon and traceoff
output for set_ftrace_filter.
# echo 'sys_open:traceon:4' > set_ftrace_filter
# cat set_ftrace_filter
#### all functions enabled ####
sys_open:traceon:count=4
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch adds a call back for the tracers that have hooks to
selected functions. This allows the tracer to show better output
in the set_ftrace_filter file.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch adds output to show what functions have tracer hooks
attached to them.
# echo 'sys_open:traceon:4' > /debug/tracing/set_ftrace_filter
# cat set_ftrace_filter
#### all functions enabled ####
sys_open:ftrace_traceon:0000000000000004
# echo 'do_fork:traceoff:' > set_ftrace_filter
# cat set_ftrace_filter
#### all functions enabled ####
sys_open:ftrace_traceon:0000000000000002
do_fork:ftrace_traceoff:ffffffffffffffff
Note the 4 changed to a 2. This is because The code was executed twice
since the traceoff was added. If a cat is done again:
#### all functions enabled ####
sys_open:ftrace_traceon
do_fork:ftrace_traceoff:ffffffffffffffff
The number disappears. That is because it will not print a NULL.
Callbacks to allow the tracer to pretty print will be implemented soon.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch adds the new function selection commands traceon and
traceoff. traceon sets the function to enable the ring buffers
while traceoff disables the ring buffers. You can pass in the
number of times you want the command to be executed when the function
is hit. It will only execute if the state of the buffers are not
already in that state.
Example:
# echo do_fork:traceon:4
Will enable the ring buffers if they are disabled every time it
hits do_fork, up to 4 times.
# echo sys_close:traceoff
This will disable the ring buffers every time (unlimited) when
sys_close is called.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: new feature
Currently, the function tracer only gives you an ability to hook
a tracer to all functions being traced. The dynamic function trace
allows you to pick and choose which of those functions will be
traced, but all functions being traced will call all tracers that
registered with the function tracer.
This patch adds a new feature that allows a tracer to hook to specific
functions, even when all functions are being traced. It allows for
different functions to call different tracer hooks.
The way this is accomplished is by a special function that will hook
to the function tracer and will set up a hash table knowing which
tracer hook to call with which function. This is the most general
and easiest method to accomplish this. Later, an arch may choose
to supply their own method in changing the mcount call of a function
to call a different tracer. But that will be an exercise for the
future.
To register a function:
struct ftrace_hook_ops {
void (*func)(unsigned long ip,
unsigned long parent_ip,
void **data);
int (*callback)(unsigned long ip, void **data);
void (*free)(void **data);
};
int register_ftrace_function_hook(char *glob, struct ftrace_hook_ops *ops,
void *data);
glob is a simple glob to search for the functions to hook.
ops is a pointer to the operations (listed below)
data is the default data to be passed to the hook functions when traced
ops:
func is the hook function to call when the functions are traced
callback is a callback function that is called when setting up the hash.
That is, if the tracer needs to do something special for each
function, that is being traced, and wants to give each function
its own data. The address of the entry data is passed to this
callback, so that the callback may wish to update the entry to
whatever it would like.
free is a callback for when the entry is freed. In case the tracer
allocated any data, it is give the chance to free it.
To unregister we have three functions:
void
unregister_ftrace_function_hook(char *glob, struct ftrace_hook_ops *ops,
void *data)
This will unregister all hooks that match glob, point to ops, and
have its data matching data. (note, if glob is NULL, blank or '*',
all functions will be tested).
void
unregister_ftrace_function_hook_func(char *glob,
struct ftrace_hook_ops *ops)
This will unregister all functions matching glob that has an entry
pointing to ops.
void unregister_ftrace_function_hook_all(char *glob)
This simply unregisters all funcs.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
Now that ftrace_lock is a mutex, there is no reason to have three
different mutexes protecting similar data. All the mutex paths
are not in hot paths, so having a mutex to cover more data is
not a problem.
This patch removes the ftrace_sysctl_lock and ftrace_start_lock
and uses the ftrace_lock to protect the locations that were protected
by these locks. By doing so, this change also removes some of
the lock nesting that was taking place.
There are still more mutexes in ftrace.c that can probably be
consolidated, but they can be dealt with later. We need to be careful
about the way the locks are nested, and by consolidating, we can cause
a recursive deadlock.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
The older versions of ftrace required doing the ftrace list
search under atomic context. Now all the calls are in non-atomic
context. There is no reason to keep the ftrace_lock as a spinlock.
This patch converts it to a mutex.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Allow for other tracers to add their own commands for function
selection. This interface gives a trace the ability to name a
command for function selection. Right now it is pretty limited
in what it offers, but this is a building step for more features.
The :mod: command is converted to this interface and also serves
as a template for other implementations.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix to prevent empty set_ftrace_filter and no ftrace output
The function filter is used to only trace a given set of functions.
The filter is enabled when a function name is echoed into the
set_ftrace_filter file. But if the name has a typo and the function
is not found, the filter is enabled, but no function is listed.
This makes a confusing situation where set_ftrace_filter is empty
but no functions ever get enabled for tracing.
For example:
# cat /debug/tracing/set_ftrace_filter
#### all functions enabled ####
# echo bad_name > set_ftrace_filter
# cat /debug/tracing/set_ftrace_filter
# echo function > current_tracer
# cat trace
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
This patch changes that to only enable filtering if a function
is set to be filtered on. Now, the filter is not enabled if
a bad name is echoed into set_ftrace_filter.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch adds a "command" syntax to the function filtering files:
/debugfs/tracing/set_ftrace_filter
/debugfs/tracing/set_ftrace_notrace
Of the format: <function>:<command>:<parameter>
The command is optional, and dependent on the command, so are
the parameters.
echo do_fork > set_ftrace_filter
Will only trace 'do_fork'.
echo 'sched_*' > set_ftrace_filter
Will only trace functions starting with the letters 'sched_'.
echo '*:mod:ext3' > set_ftrace_filter
Will trace only the ext3 module functions.
echo '*write*:mod:ext3' > set_ftrace_notrace
Will prevent the ext3 functions with the letters 'write' in
the name from being traced.
echo '!*_allocate:mod:ext3' > set_ftrace_filter
Will remove the functions in ext3 that end with the letters
'_allocate' from the ftrace filter.
Although this patch implements the 'command' format, only the
'mod' command is supported. More commands to follow.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
ftrace_match_records does a lot of things that other features
can use. This patch breaks up ftrace_match_records and pulls
out ftrace_setup_glob and ftrace_match_record.
ftrace_setup_glob prepares a simple glob expression for use with
ftrace_match_record. ftrace_match_record compares a single record
with a glob type.
Breaking this up will allow for more features to run on individual
records.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
ftrace_match is too generic of a name. What it really does is
search all records and matches the records with the given string,
and either sets or unsets the functions to be traced depending
on if the parameter 'enable' is set or not.
This allows us to make another function called ftrace_match that
can be used to test a single record.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
To iterate over all the functions that dynamic trace knows about
it requires two for loops. One to iterate over the pages and the
other to iterate over the records within the page.
There are several duplications of these loops in ftrace.c. This
patch creates the macros do_for_each_ftrace_rec and
while_for_each_ftrace_rec to handle this logic, and removes the
duplicate code.
While making this change, I also discovered and fixed a small
bug that one of the iterations should exit the loop after it found the
record it was searching for. This used a break when it should have
used a goto, since there were two loops it needed to break out
from. No real harm was done by this bug since it would only continue
to search the other records, and the code was in a slow path anyway.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up, make set_ftrace_filter less confusing
The set_ftrace_filter shows only the functions that will be traced.
But when it is empty, it will trace all functions. This can be a bit
confusing.
This patch makes set_ftrace_filter show:
#### all functions enabled ####
When all functions will be traced, and we do not filter only a select
few.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The function bts_trace_init() references a variable
bts_hotcpu_notifier which is marked
as __cpuinitdata. Thus causes section mismatch. This patch fixes it.
LD kernel/trace/built-in.o
WARNING: kernel/trace/built-in.o(.text+0xc90c): Section mismatch in
reference from the function bts_trace_init() to the variable
.cpuinit.data:bts_hotcpu_notifier
The function bts_trace_init() references
the variable __cpuinitdata bts_hotcpu_notifier.
This is often because bts_trace_init lacks a __cpuinitdata
annotation or the annotation of bts_hotcpu_notifier is wrong.
WARNING: kernel/trace/built-in.o(.text+0xc92a): Section mismatch in
reference from the function bts_trace_reset() to the variable
.cpuinit.data:bts_hotcpu_notifier
The function bts_trace_reset() references
the variable __cpuinitdata bts_hotcpu_notifier.
This is often because bts_trace_reset lacks a __cpuinitdata
annotation or the annotation of bts_hotcpu_notifier is wrong.
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Cc: markus.t.metzger@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cosmetic change in Kconfig menu layout
This patch was originally suggested by Peter Zijlstra, but seems it
was forgotten.
CONFIG_MMIOTRACE and CONFIG_MMIOTRACE_TEST were selectable
directly under the Kernel hacking / debugging menu in the kernel
configuration system. They were present only for x86 and x86_64.
Other tracers that use the ftrace tracing framework are in their own
sub-menu. This patch moves the mmiotrace configuration options there.
Since the Kconfig file, where the tracer menu is, is not architecture
specific, HAVE_MMIOTRACE_SUPPORT is introduced and provided only by
x86/x86_64. CONFIG_MMIOTRACE now depends on it.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: enhances lost events counting in mmiotrace
The tracing framework, or the ring buffer facility it uses, has a switch
to stop recording data. When recording is off, the trace events will be
lost. The framework does not count these, so mmiotrace has to count them
itself.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Convert the c/p state "power" tracer to use tracepoints. Avoids a
function call when the tracer is disabled.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
While reviewing the ring buffer code, I thougth I saw a bug with
if (!__raw_spin_trylock(&cpu_buffer->lock))
goto out_unlock;
But I forgot that we use a variable "lock_taken" that is set if
the spinlock is taken, and only unlock it if that variable is set.
To avoid further confusion from other reviewers, this patch
renames the label out_unlock with out_reset, which is the more
appropriate name.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Add the missing pair tracing_{start,stop}_record_cmdline() to record well
the cmdline associated with pid.
Changes in v2:
- fix a build error, the sched_switch tracer is needed to record the
cmdline.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix these sparse warnings:
kernel/trace/ring_buffer.c:70:37: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:84:39: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:96:43: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2475:13: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2475:13: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2478:42: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2478:42: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2500:40: warning: incorrect type in argument 3 (different signedness)
kernel/trace/ring_buffer.c:2505:44: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2507:46: warning: incorrect type in argument 2 (different signedness)
kernel/trace/trace.c:2130:40: warning: incorrect type in argument 3 (different signedness)
kernel/trace/trace.c:2280:40: warning: incorrect type in argument 3 (different signedness)
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make global variables and a global function static
The function '__trace_userstack' does not seem to have a caller, so it
is commented out.
Fix this sparse warnings:
kernel/trace/trace.c:82:5: warning: symbol 'tracing_disabled' was not declared. Should it be static?
kernel/trace/trace.c:600:10: warning: symbol 'trace_record_cmdline_disabled' was not declared. Should it be static?
kernel/trace/trace.c:957:6: warning: symbol '__trace_userstack' was not declared. Should it be static?
kernel/trace/trace.c:1694:5: warning: symbol 'tracing_release' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch is to make the function return early on failure, and give
correct return value on success.
Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The C99 specification states in section 6.11.5:
The placement of a storage-class specifier other than at the beginning
of the declaration specifiers in a declaration is an obsolescent
feature.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: change API and init bpage when copy
ring_buffer_read_page()/rb_remove_entries() may be called for
a partially consumed page.
Add a parameter for rb_remove_entries() and make it update
cpu_buffer->entries correctly for partially consumed pages.
ring_buffer_read_page() now returns the offset to the next event.
Init the bpage's time_stamp when return value is 0.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: Fix bug
I found several very very curious line.
It's so curious that it may be brought by typing mistake.
When (cpu_buffer->reader_page == cpu_buffer->commit_page):
1) We haven't copied it for bpage is changed:
bpage = cpu_buffer->reader_page->page;
memcpy(bpage->data, cpu_buffer->reader_page->page->data + read ... )
2) We need update cpu_buffer->reader_page->read, but
"cpu_buffer->reader_page += read;" is not right.
[
This bug was a typo. The commit->reader_page is a page pointer
and not an index into the page. The line should have been
commit->reader_page->read += read. The other changes
by Lai are nice clean ups to the code. - SDR
]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Ingo Molnar suggested a series of clean ups for the splice code.
This patch implements those suggestions.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This moves the pipe waiting code from tracing_read_pipe() into
tracing_wait_pipe(), which is useful to implement other fops, like
splice_read.
Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Added and implemented tracing_pipe_fops->splice_read(). This allows
userspace programs to get tracing data more efficiently.
Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
When one cats the trace file, the leaf functions are printed without brackets:
function();
whereas in the trace_pipe file we'll see the following:
function() {
}
This is because the ring_buffer handling is not the same between those two files.
On the trace file, when an entry is printed, the iterator advanced and then we can
check the next entry.
There is no iterator with trace_pipe, the current entry to print has been peeked
and not consumed. So checking the next entry will still return the current one while
we don't consume it.
This patch introduces a new value for the output callbacks to ask the tracing
core to not consume the current entry after printing it.
We need it because we will have to consume the current entry ourself to check
the next one.
Now the trace_pipe is able to handle well the leaf functions.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix
The BLK_DEV_IO_TRACE entry used to be in block/Kconfig - which
file itself was dependent on CONFIG_BLOCK. But now the entry is
in kernel/trace/Kconfig - which is present even on !CONFIG_BLOCK.
So add a 'depends on BLOCK' to BLK_DEV_IO_TRACE.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: simplification
Instead of requiring that plugins have the sequence:
my_tracer_stop(my_trace_array);
unregister_tracer(my_tracer);
it should be possible just do a:
unregister_tracer(my_tracer);
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Move the power tracer headers to trace/power.h to keep ftrace.h and power bits
more easy to maintain as separated topics.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Making it more easy to do a basic regression test for this tracer.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up
Fixed several typos in the comments.
Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
Now that a generic in_nmi is available, this patch removes the
special code in the ring_buffer and implements the in_nmi generic
version instead.
With this change, I was also able to rename the "arch_ftrace_nmi_enter"
back to "ftrace_nmi_enter" and remove the code from the ring buffer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: prevent deadlock in NMI
The ring buffers are not yet totally lockless with writing to
the buffer. When a writer crosses a page, it grabs a per cpu spinlock
to protect against a reader. The spinlocks taken by a writer are not
to protect against other writers, since a writer can only write to
its own per cpu buffer. The spinlocks protect against readers that
can touch any cpu buffer. The writers are made to be reentrant
with the spinlocks disabling interrupts.
The problem arises when an NMI writes to the buffer, and that write
crosses a page boundary. If it grabs a spinlock, it can be racing
with another writer (since disabling interrupts does not protect
against NMIs) or with a reader on the same CPU. Luckily, most of the
users are not reentrant and protects against this issue. But if a
user of the ring buffer becomes reentrant (which is what the ring
buffers do allow), if the NMI also writes to the ring buffer then
we risk the chance of a deadlock.
This patch moves the ftrace_nmi_enter called by nmi_enter() to the
ring buffer code. It replaces the current ftrace_nmi_enter that is
used by arch specific code to arch_ftrace_nmi_enter and updates
the Kconfig to handle it.
When an NMI is called, it will set a per cpu variable in the ring buffer
code and will clear it when the NMI exits. If a write to the ring buffer
crosses page boundaries inside an NMI, a trylock is used on the spin
lock instead. If the spinlock fails to be acquired, then the entry
is discarded.
This bug appeared in the ftrace work in the RT tree, where event tracing
is reentrant. This workaround solved the deadlocks that appeared there.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: fix to prevent developers from using entry->cpu
With the new ring buffer infrastructure, the cpu for the entry is
implicit with which CPU buffer it is on.
The original code use to record the current cpu into the generic
entry header, which can be retrieved by entry->cpu. When the
ring buffer was introduced, the users were convert to use the
the cpu number of which cpu ring buffer was in use (this was passed
to the tracers by the iterator: iter->cpu).
Unfortunately, the cpu item in the entry structure was never removed.
This allowed for developers to use it instead of the proper iter->cpu,
unknowingly, using an uninitialized variable. This was not the fault
of the developers, since it would seem like the logical place to
retrieve the cpu identifier.
This patch removes the cpu item from the entry structure and fixes
all the users that should have been using iter->cpu.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: cleanup
To make it easy for ftrace plugin writers, as this was open coded in
the existing plugins
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Frédéric Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new API
These new functions do what previously was being open coded, reducing
the number of details ftrace plugin writers have to worry about.
It also standardizes the handling of stacktrace, userstacktrace and
other trace options we may introduce in the future.
With this patch, for instance, the blk tracer (and some others already
in the tree) can use the "userstacktrace" /d/tracing/trace_options
facility.
$ codiff /tmp/vmlinux.before /tmp/vmlinux.after
linux-2.6-tip/kernel/trace/trace.c:
trace_vprintk | -5
trace_graph_return | -22
trace_graph_entry | -26
trace_function | -45
__ftrace_trace_stack | -27
ftrace_trace_userstack | -29
tracing_sched_switch_trace | -66
tracing_stop | +1
trace_seq_to_user | -1
ftrace_trace_special | -63
ftrace_special | +1
tracing_sched_wakeup_trace | -70
tracing_reset_online_cpus | -1
13 functions changed, 2 bytes added, 355 bytes removed, diff: -353
linux-2.6-tip/block/blktrace.c:
__blk_add_trace | -58
1 function changed, 58 bytes removed, diff: -58
linux-2.6-tip/kernel/trace/trace.c:
trace_buffer_lock_reserve | +88
trace_buffer_unlock_commit | +86
2 functions changed, 174 bytes added, diff: +174
/tmp/vmlinux.after:
16 functions changed, 176 bytes added, 413 bytes removed, diff: -237
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Frédéric Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar suggested using goto logic to keep the indentation
down and to be able to remove the nasty line breaks. This actually
makes the code a bit more readable.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: simplification of tracers
As all tracers are doing this we might as well do it in
register_ftrace_event and save one branch each time we call these
callbacks.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As they actually all return these enumerators.
Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: bugfix and cleanup
Some callsites were returning either TRACE_ITER_PARTIAL_LINE if the
trace_seq routines (trace_seq_printf, etc) returned 0 meaning its buffer
was full, or zero otherwise.
But...
/* Return values for print_line callback */
enum print_line_t {
TRACE_TYPE_PARTIAL_LINE = 0, /* Retry after flushing the seq */
TRACE_TYPE_HANDLED = 1,
TRACE_TYPE_UNHANDLED = 2 /* Relay to other output functions */
};
In other cases the return value was not being relayed at all.
Most of the time it didn't hurt because the page wasn't get filled, but
for correctness sake, handle the return values everywhere.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
"ftrace: use struct pid" commit 978f3a45d9
converted ftrace_pid_trace to "struct pid*".
But we can't use do_each_pid_task() without rcu_read_lock() even if
we know the pid itself can't go away (it was pinned in ftrace_pid_write).
The exiting task can detach itself from this pid at any moment.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: API change
The trace_seq and trace_entry are in trace_iterator, where there are
more fields that may be needed by tracers, so just pass the
tracer_iterator as is already the case for struct tracer->print_line.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make trace_event more convenient for tracers
All tracers (for the moment) that use the struct trace_event want to
have the context info printed before their own output: the pid/cmdline,
cpu, and timestamp.
But some other tracers that want to implement their trace_event
callbacks will not necessary need these information or they may want to
format them as they want.
This patch adds a new default-enabled trace option:
TRACE_ITER_CONTEXT_INFO When disabled through:
echo nocontext-info > /debugfs/tracing/trace_options
The pid, cpu and timestamps headers will not be printed.
IE with the sched_switch tracer with context-info (default):
bash-2935 [001] 100.356561: 2935:120:S ==> [001] 0:140:R <idle>
<idle>-0 [000] 100.412804: 0:140:R + [000] 11:115:S events/0
<idle>-0 [000] 100.412816: 0:140:R ==> [000] 11:115:R events/0
events/0-11 [000] 100.412829: 11:115:S ==> [000] 0:140:R <idle>
Without context-info:
2935:120:S ==> [001] 0:140:R <idle>
0:140:R + [000] 11:115:S events/0
0:140:R ==> [000] 11:115:R events/0
11:115:S ==> [000] 0:140:R <idle>
A tracer can disable it at runtime by clearing the bit
TRACE_ITER_CONTEXT_INFO in trace_flags.
The print routines were renamed to trace_print_context and
trace_print_lat_context, so that they can be used by tracers if they
want to use them for one of the trace_event callbacks.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that we have a working ftrace=<tracer> function, make the boot
tracer get activated by it. This way we can turn it on or off without
recompiling the kernel, as well as keeping the selftests on. The
selftests are disabled whenever a default tracer starts running.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra started the functionality to start up a default
tracing at bootup. This patch finishes the work.
Now if you add 'ftrace=<tracer>' to the command line, when that tracer
is registered on bootup, that tracer is selected and starts tracing.
Note, all selftests for tracers that are registered after this tracer
is disabled. This prevents the selftests from disturbing the running
tracer, or the running tracer from disturbing the selftest.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: New way of using the blktrace infrastructure
This drops the requirement of userspace utilities to use the blktrace
facility.
Configuration is done thru sysfs, adding a "trace" directory to the
partition directory where blktrace can be enabled for the associated
request_queue.
The same filters present in the IOCTL interface are present as sysfs
device attributes.
The /sys/block/sdX/sdXN/trace/enable file allows tracing without any
filters.
The other files in this directory: pid, act_mask, start_lba and end_lba
can be used with the same meaning as with the IOCTL interface.
Using the sysfs interface will only setup the request_queue->blk_trace
fields, tracing will only take place when the "blk" tracer is selected
via the ftrace interface, as in the following example:
To see the trace, one can use the /d/tracing/trace file or the
/d/tracign/trace_pipe file, with semantics defined in the ftrace
documentation in Documentation/ftrace.txt.
[root@f10-1 ~]# cat /t/trace
kjournald-305 [000] 3046.491224: 8,1 A WBS 6367 + 8 <- (8,1) 6304
kjournald-305 [000] 3046.491227: 8,1 Q R 6367 + 8 [kjournald]
kjournald-305 [000] 3046.491236: 8,1 G RB 6367 + 8 [kjournald]
kjournald-305 [000] 3046.491239: 8,1 P NS [kjournald]
kjournald-305 [000] 3046.491242: 8,1 I RBS 6367 + 8 [kjournald]
kjournald-305 [000] 3046.491251: 8,1 D WB 6367 + 8 [kjournald]
kjournald-305 [000] 3046.491610: 8,1 U WS [kjournald] 1
<idle>-0 [000] 3046.511914: 8,1 C RS 6367 + 8 [6367]
[root@f10-1 ~]#
The default line context (prefix) format is the one described in the ftrace
documentation, with the blktrace specific bits using its existing format,
described in blkparse(8).
If one wants to have the classic blktrace formatting, this is possible by
using:
[root@f10-1 ~]# echo blk_classic > /t/trace_options
[root@f10-1 ~]# cat /t/trace
8,1 0 3046.491224 305 A WBS 6367 + 8 <- (8,1) 6304
8,1 0 3046.491227 305 Q R 6367 + 8 [kjournald]
8,1 0 3046.491236 305 G RB 6367 + 8 [kjournald]
8,1 0 3046.491239 305 P NS [kjournald]
8,1 0 3046.491242 305 I RBS 6367 + 8 [kjournald]
8,1 0 3046.491251 305 D WB 6367 + 8 [kjournald]
8,1 0 3046.491610 305 U WS [kjournald] 1
8,1 0 3046.511914 0 C RS 6367 + 8 [6367]
[root@f10-1 ~]#
Using the ftrace standard format allows more flexibility, such
as the ability of asking for backtraces via trace_options:
[root@f10-1 ~]# echo noblk_classic > /t/trace_options
[root@f10-1 ~]# echo stacktrace > /t/trace_options
[root@f10-1 ~]# cat /t/trace
kjournald-305 [000] 3318.826779: 8,1 A WBS 6375 + 8 <- (8,1) 6312
kjournald-305 [000] 3318.826782:
<= submit_bio
<= submit_bh
<= sync_dirty_buffer
<= journal_commit_transaction
<= kjournald
<= kthread
<= child_rip
kjournald-305 [000] 3318.826836: 8,1 Q R 6375 + 8 [kjournald]
kjournald-305 [000] 3318.826837:
<= generic_make_request
<= submit_bio
<= submit_bh
<= sync_dirty_buffer
<= journal_commit_transaction
<= kjournald
<= kthread
Please read the ftrace documentation to use aditional, standardized
tracing filters such as /d/tracing/trace_cpumask, etc.
See also /d/tracing/trace_mark to add comments in the trace stream,
that is equivalent to the /d/block/sdaN/msg interface.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix kmemtrace printk warnings:
kernel/trace/kmemtrace.c:142: warning: format '%4ld' expects type 'long int', but argument 3 has type 'size_t'
kernel/trace/kmemtrace.c:147: warning: format '%4ld' expects type 'long int', but argument 3 has type 'size_t'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch brings various bugfixes:
- Drop the first irrelevant task switch on the very beginning of a trace.
- Drop the OVERHEAD word from the headers, the DURATION word is sufficient
and will not overlap other columns.
- Make the headers fit well their respective columns whatever the
selected options.
Ie, default options:
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
1) 0.646 us | }
1) | mem_cgroup_del_lru_list() {
1) 0.624 us | lookup_page_cgroup();
1) 1.970 us | }
echo funcgraph-proc > trace_options
# tracer: function_graph
#
# CPU TASK/PID DURATION FUNCTION CALLS
# | | | | | | | | |
0) bash-2937 | 0.895 us | }
0) bash-2937 | 0.888 us | __rcu_read_unlock();
0) bash-2937 | 0.864 us | conv_uni_to_pc();
0) bash-2937 | 1.015 us | __rcu_read_lock();
echo nofuncgraph-cpu > trace_options
echo nofuncgraph-proc > trace_options
# tracer: function_graph
#
# DURATION FUNCTION CALLS
# | | | | | |
3.752 us | native_pud_val();
0.616 us | native_pud_val();
0.624 us | native_pmd_val();
About features, one can now disable the duration (this will hide the
overhead too for convenient reasons and because on doesn't need
overhead if it hasn't the duration):
echo nofuncgraph-duration > trace_options
# tracer: function_graph
#
# FUNCTION CALLS
# | | | |
cap_vm_enough_memory() {
__vm_enough_memory() {
vm_acct_memory();
}
}
}
And at last, an option to print the absolute time:
//Restart from default options
echo funcgraph-abstime > trace_options
# tracer: function_graph
#
# TIME CPU DURATION FUNCTION CALLS
# | | | | | | | |
261.339774 | 1) + 42.823 us | }
261.339775 | 1) 1.045 us | _spin_lock_irq();
261.339777 | 1) 0.940 us | _spin_lock_irqsave();
261.339778 | 1) 0.752 us | _spin_unlock_irqrestore();
261.339780 | 1) 0.857 us | _spin_unlock_irq();
261.339782 | 1) | flush_to_ldisc() {
261.339783 | 1) | tty_ldisc_ref() {
261.339783 | 1) | tty_ldisc_try() {
261.339784 | 1) 1.075 us | _spin_lock_irqsave();
261.339786 | 1) 0.842 us | _spin_unlock_irqrestore();
261.339788 | 1) 4.211 us | }
261.339788 | 1) 5.662 us | }
The format is seconds.usecs.
I guess no one needs the nanosec precision here, the main goal is to have
an overview about the general timings of events, and to see the place when
the trace switches from one cpu to another.
ie:
274.874760 | 1) 0.676 us | _spin_unlock();
274.874762 | 1) 0.609 us | native_load_sp0();
274.874763 | 1) 0.602 us | native_load_tls();
274.878739 | 0) 0.722 us | }
274.878740 | 0) 0.714 us | native_pmd_val();
274.878741 | 0) 0.730 us | native_pmd_val();
Here there is a 4000 usecs difference when we switch the cpu.
Changes in V2:
- Completely fix the first pointless task switch.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The logic in the tracing_start/stop code prevents the WARN_ON
from ever detecting if a start/stop pair was mismatched.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>