Despite that the tracepoint record is always present when the
PERF_SAMPLE_TP_RECORD flag is set, gcc raises a warning,
thinking it might not be initialized:
kernel/perf_counter.c: In function ‘perf_counter_output’:
kernel/perf_counter.c:2650: warning: ‘tp’ may be used uninitialized in this function
Then, initialize it to NULL and always check if it's not NULL
before dereference it.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <1249698400-5441-2-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reimplement the software counters to deal with fast moving
event sources (such as tracepoints). This means being able
to generate multiple overflows from a single 'event' as well
as support throttling.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch implements the kernel side support for ftrace event
record sampling.
A new counter sampling attribute is added:
PERF_SAMPLE_TP_RECORD
which requests ftrace events record sampling. In this case
if a PERF_TYPE_TRACEPOINT counter is active and a tracepoint
fires, we emit the tracepoint binary record to the
perfcounter event buffer, as a sample.
Result, after setting PERF_SAMPLE_TP_RECORD attribute from perf
record:
perf record -f -F 1 -a -e workqueue:workqueue_execution
perf report -D
0x21e18 [0x48]: event: 9
.
. ... raw event: size 72 bytes
. 0000: 09 00 00 00 01 00 48 00 d0 c7 00 81 ff ff ff ff ......H........
. 0010: 0a 00 00 00 0a 00 00 00 21 00 00 00 00 00 00 00 ........!......
. 0020: 2b 00 01 02 0a 00 00 00 0a 00 00 00 65 76 65 6e +...........eve
. 0030: 74 73 2f 31 00 00 00 00 00 00 00 00 0a 00 00 00 ts/1...........
. 0040: e0 b1 31 81 ff ff ff ff .......
.
0x21e18 [0x48]: PERF_EVENT_SAMPLE (IP, 1): 10: 0xffffffff8100c7d0 period: 33
The raw ftrace binary record starts at offset 0020.
Translation:
struct trace_entry {
type = 0x2b = 43;
flags = 1;
preempt_count = 2;
pid = 0xa = 10;
tgid = 0xa = 10;
}
thread_comm = "events/1"
thread_pid = 0xa = 10;
func = 0xffffffff8131b1e0 = flush_to_ldisc()
What will come next?
- Userspace support ('perf trace'), 'flight data recorder' mode
for perf trace, etc.
- The unconditional copy from the profiling callback brings
some costs however if someone wants no such sampling to
occur, and needs to be fixed in the future. For that we need
to have an instant access to the perf counter attribute.
This is a matter of a flag to add in the struct ftrace_event.
- Take care of the events recursivity! Don't ever try to record
a lock event for example, it seems some locking is used in
the profiling fast path and lead to a tracing recursivity.
That will be fixed using raw spinlock or recursivity
protection.
- [...]
- Profit! :-)
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Adds possible second part to the assign argument of TP_EVENT().
TP_perf_assign(
__perf_count(foo);
__perf_addr(bar);
)
Which, when specified make the swcounter increment with @foo instead
of the usual 1, and report @bar for PERF_SAMPLE_ADDR (data address
associated with the event) when this triggers a counter overflow.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge reason: Merge up to almost-rc6 to pick up latest perfcounters
(on which we'll queue up a dependent fix)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This takes care of the following entry from Dan's list:
kernel/irq/resend.c +73 check_irq_resend(17) warning: variable derefenced before check 'desc->chip'
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Eugene Teo <eteo@redhat.com>
Cc: Julia Lawall <julia@diku.dk>
LKML-Reference: <200908062146.03638.bzolnier@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When the process exits we don't have to run new cputimer nor
use running one (as it not accounts when tsk->exit_state != 0)
to get process CPU times. As there is only one thread we can
just use CPU times fields from task and signal structs.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Roland McGrath <roland@redhat.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Don't return when we find the first bootconsole - it can leave
other bootconsoles still installed, and they can be used and
cause problems later (if they are in the init section, and
eventually released), and cause problems. Make sure we remove
all of them.
Signed-off-by: Sonic Zhang <Sonic.Zhang@analog.com>
Signed-off-by: Robin Getz <rgetz@analog.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dan Carpenter sent me a fix to prevent pred from being used if
it couldn't be allocated. This updates his patch for the same
problem in the tracing tree (which has changed this code quite
substantially).
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <1249746576.6453.30.camel@tropicana>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The original report:
create_logical_pred() could sometimes return NULL.
It's a static checker complaining rather than problems at runtime...
If filter_add_subsystem_pred() fails due to ENOSPC or ENOMEM,
the pred doesn't get freed, while as a side effect it does for
other errors. Make it so the caller always frees the pred for
any error.
Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <1249746593.6453.32.camel@tropicana>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dan Carpenter sent me a fix to prevent pred from being used if
it couldn't be allocated. I noticed the same problem also
existed for the create_pred() case and added a fix for that.
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <1249746549.6453.29.camel@tropicana>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Don't move it if target node is -1.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4A785B5D.4070702@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf_counter: Fix double list iteration in per task precise stats
perf: Auto-detect libelf
perf symbol: Fix symbol parsing in certain cases: use the build-id as a symlink
perf_counter/powerpc: Check oprofile_cpu_type for NULL before using it
ftrace: Fix perf-tracepoint OOPS
perf report: Add missing command line options to man page
perf: Auto-detect libbfd
perf report: Make --sort comm,dso,symbol the default
While looking at Jens Rosenboom bug report
(http://lkml.org/lkml/2009/7/27/35) about strange sys_futex call done from
a dying "ps" program, we found following problem.
clone() syscall has special support for TID of created threads. This
support includes two features.
One (CLONE_CHILD_SETTID) is to set an integer into user memory with the
TID value.
One (CLONE_CHILD_CLEARTID) is to clear this same integer once the created
thread dies.
The integer location is a user provided pointer, provided at clone()
time.
kernel keeps this pointer value into current->clear_child_tid.
At execve() time, we should make sure kernel doesnt keep this user
provided pointer, as full user memory is replaced by a new one.
As glibc fork() actually uses clone() syscall with CLONE_CHILD_SETTID and
CLONE_CHILD_CLEARTID set, chances are high that we might corrupt user
memory in forked processes.
Following sequence could happen:
1) bash (or any program) starts a new process, by a fork() call that
glibc maps to a clone( ... CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID
...) syscall
2) When new process starts, its current->clear_child_tid is set to a
location that has a meaning only in bash (or initial program) context
(&THREAD_SELF->tid)
3) This new process does the execve() syscall to start a new program.
current->clear_child_tid is left unchanged (a non NULL value)
4) If this new program creates some threads, and initial thread exits,
kernel will attempt to clear the integer pointed by
current->clear_child_tid from mm_release() :
if (tsk->clear_child_tid
&& !(tsk->flags & PF_SIGNALED)
&& atomic_read(&mm->mm_users) > 1) {
u32 __user * tidptr = tsk->clear_child_tid;
tsk->clear_child_tid = NULL;
/*
* We don't check the error code - if userspace has
* not set up a proper pointer then tough luck.
*/
<< here >> put_user(0, tidptr);
sys_futex(tidptr, FUTEX_WAKE, 1, NULL, NULL, 0);
}
5) OR : if new program is not multi-threaded, but spied by /proc/pid
users (ps command for example), mm_users > 1, and the exiting program
could corrupt 4 bytes in a persistent memory area (shm or memory mapped
file)
If current->clear_child_tid points to a writeable portion of memory of the
new program, kernel happily and silently corrupts 4 bytes of memory, with
unexpected effects.
Fix is straightforward and should not break any sane program.
Reported-by: Jens Rosenboom <jens@mcbone.net>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sonny Rao <sonnyrao@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use CONFIG_HOTPLUG_CPU, not CONFIG_CPU_HOTPLUG
When hot-unpluging a cpu, it will leak memory allocated at cpu hotplug,
but only if CPUMASK_OFFSTACK=y, which is default to n.
The bug was introduced by 8969a5ede0
("generic-ipi: remove kmalloc()").
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I noticed oprofile memleaked in linux-2.6 current tree,
and tracked this ring-buffer leak.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
LKML-Reference: <4A7C06B9.2090302@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Brice Goglin reported this crash with per task precise stats:
> I finally managed to test the threaded perfcounter statistics (thanks a
> lot for implementing it). I am running 2.6.31-rc5 (with the AMD
> magny-cours patches but I don't think they matter here). I am trying to
> measure local/remote memory accesses per thread during the well-known
> stream benchmark. It's compiled with OpenMP using 16 threads on a
> quad-socket quad-core barcelona machine.
>
> Command line is:
> /mnt/scratch/bgoglin/cpunode/linux-2.6.31/tools/perf/perf record -f -s
> -e r1000001e0 -e r1000002e0 -e r1000004e0 -e r1000008e0 ./stream
>
> It seems to work fine with a single -e <counter> on the command line
> while it crashes when there are at least 2 of them.
> It seems to work fine without -s as well.
A silly copy-paste resulted in a messed up iteration which would
cause the OOPS.
Reported-by: Brice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Brice Goglin <Brice.Goglin@inria.fr>
LKML-Reference: <1249574786.32113.550.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When calling rb_buffer_peek() from ring_buffer_consume() and a
padding event is returned, the function rb_advance_reader() is
called twice. This may lead to missing samples or under high
workloads to the warning below. This patch fixes this. If a padding
event is returned by rb_buffer_peek() it will be consumed by the
calling function now.
Also, I simplified some code in ring_buffer_consume().
------------[ cut here ]------------
WARNING: at /dev/shm/.source/linux/kernel/trace/ring_buffer.c:2289 rb_advance_reader+0x2e/0xc5()
Hardware name: Anaheim
Modules linked in:
Pid: 29, comm: events/2 Tainted: G W 2.6.31-rc3-oprofile-x86_64-standard-00059-g5050dc2 #1
Call Trace:
[<ffffffff8106776f>] ? rb_advance_reader+0x2e/0xc5
[<ffffffff81039ffe>] warn_slowpath_common+0x77/0x8f
[<ffffffff8103a025>] warn_slowpath_null+0xf/0x11
[<ffffffff8106776f>] rb_advance_reader+0x2e/0xc5
[<ffffffff81068bda>] ring_buffer_consume+0xa0/0xd2
[<ffffffff81326933>] op_cpu_buffer_read_entry+0x21/0x9e
[<ffffffff810be3af>] ? __find_get_block+0x4b/0x165
[<ffffffff8132749b>] sync_buffer+0xa5/0x401
[<ffffffff810be3af>] ? __find_get_block+0x4b/0x165
[<ffffffff81326c1b>] ? wq_sync_buffer+0x0/0x78
[<ffffffff81326c76>] wq_sync_buffer+0x5b/0x78
[<ffffffff8104aa30>] worker_thread+0x113/0x1ac
[<ffffffff8104dd95>] ? autoremove_wake_function+0x0/0x38
[<ffffffff8104a91d>] ? worker_thread+0x0/0x1ac
[<ffffffff8104dc9a>] kthread+0x88/0x92
[<ffffffff8100bdba>] child_rip+0xa/0x20
[<ffffffff8104dc12>] ? kthread+0x0/0x92
[<ffffffff8100bdb0>] ? child_rip+0x0/0x20
---[ end trace f561c0a58fcc89bd ]---
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@kernel.org>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If we disable modules, we get the following warning in ftrace events
file:
kernel/trace/trace_events.c:912: attention : ‘remove_subsystem_dir’ defined but not used
remove_subystem_dir() is useless if !CONFIG_MODULES, then move it to
the appropriate #ifdef section of trace_events.c
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
The function graph events helpers which insert the function entry and
return events into the ring buffer currently reside in trace.c
But this file is quite overloaded and the right place for these helpers
is in the function graph tracer file.
Then move them to trace_functions_graph.c
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
The sched events helpers which insert the sched switch and wakeup
events into the ring buffer currently reside in trace.c
But this file is quite overloaded and the right place for these helpers
is in the sched switch tracer file.
Then move them to trace_functions.c
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Make the stacktrace event insertion helpers globals.
This has two effects:
- Prepare for moving the sched events insertion helpers to
the sched switch tracer file.
- Move some ifdef outside function definitions
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
In order to prepare the moving of the function graph tracer insertion
helpers from trace.c to trace_functions_graph.c, we need to export the
ftrace_cpu_disabled variable.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
print_graph_cpu() is little over-designed.
And "log10_all" may be wrong when there are holes in cpu_online_mask:
the max online cpu id > cpumask_weight(cpu_online_mask)
So change it by using a static column length for the cpu matching
nr_cpu_ids number of decimal characters.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4A6EEE5E.2000001@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Not all tracepoints are created equal, in specific the ftrace
tracepoints are created with TRACE_EVENT_FORMAT() which does
not generate the needed bits to tie them into perf counters.
For those events, don't create the 'id' file and fail
->profile_enable when their ID is specified through other
means.
Reported-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1249497664.5890.4.camel@laptop>
[ v2: fix build error in the !CONFIG_EVENT_PROFILE case ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In the event of a lock steal or owner died,
rt_mutex_start_proxy_lock() will give the rt_mutex to the
waiting task, but it fails to release the wait_lock. This leads
to subsequent deadlocks when other tasks try to acquire the
rt_mutex.
I also removed a few extra blank lines that really spaced this
routine out. I must have been high on the \n when I wrote this
originally...
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@linux.vnet.ibm.com>
LKML-Reference: <4A79D7F1.4000405@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The commit:
commit e0fdace10e
Author: David Miller <davem@davemloft.net>
Date: Fri Aug 1 01:11:22 2008 -0700
debug_locks: set oops_in_progress if we will log messages.
Otherwise lock debugging messages on runqueue locks can deadlock the
system due to the wakeups performed by printk().
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Will permanently set oops_in_progress on any lockdep failure.
When this triggers it will cause any read from the ring buffer to
permanently disable the ring buffer (not to mention no locking of
printk).
This patch removes the check. It keeps the print in NMI which makes
sense. This is probably OK, since the ring buffer should not cause
something to set oops_in_progress anyway.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function ring_buffer_discard_commit inversed the code path
of the result of try_to_discard. It should skip incrementing the
entry counter if try_to_discard succeeded. But instead, it increments
the entry conder if it succeeded to discard, and does not increment
it if it fails.
The result of this bug is that filtering will make the stat counters
incorrect.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently SELinux enforcement of controls on the ability to map low memory
is determined by the mmap_min_addr tunable. This patch causes SELinux to
ignore the tunable and instead use a seperate Kconfig option specific to how
much space the LSM should protect.
The tunable will now only control the need for CAP_SYS_RAWIO and SELinux
permissions will always protect the amount of low memory designated by
CONFIG_LSM_MMAP_MIN_ADDR.
This allows users who need to disable the mmap_min_addr controls (usual reason
being they run WINE as a non-root user) to do so and still have SELinux
controls preventing confined domains (like a web server) from being able to
map some area of low memory.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
* 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf_counter: Set the CONFIG_PERF_COUNTERS default to y if CONFIG_PROFILING=y
perf: Fix read buffer overflow
perf top: Add mwait_idle_with_hints to skip_symbols[]
perf tools: Fix faulty check
perf report: Update for the new FORK/EXIT events
perf_counter: Full task tracing
perf_counter: Collapse inherit on read()
tracing, perf_counter: Add help text to CONFIG_EVENT_PROFILE
perf_counter tools: Fix link errors with older toolchains
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix race in cpupri introduced by cpumask_var changes
sched: Fix latencytop and sleep profiling vs group scheduling
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
posix-timers: Fix oops in clock_nanosleep() with CLOCK_MONOTONIC_RAW
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: Fix missing function_graph events when we splice_read from trace_pipe
tracing: Fix invalid function_graph entry
trace: stop tracer in oops_enter()
ftrace: Only update $offset when we update $ref_func
ftrace: Fix the conditional that updates $ref_func
tracing: only truncate ftrace files when O_TRUNC is set
tracing: show proper address for trace-printk format
The state machine described in the comments wasn't updated with
a follow-on fix. Address that and cleanup the corresponding
commentary in the function.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <4A737C2A.9090001@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Two important aspects of the schedule_work() function are not
yet documented:
- that it is allowed to pass a struct work_struct * to this
function that is already on the kernel-global workqueue;
- the meaning of its return value.
The patch below documents both aspects.
Signed-off-by: Bart Van Assche <bart.vanassche@gmail.com>
Cc: "Greg Kroah-Hartman" <gregkh@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <200907301900.54202.bart.vanassche@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Prevent calling do_nanosleep() with clockid
CLOCK_MONOTONIC_RAW, it may cause oops, such as NULL pointer
dereference.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: <stable@kernel.org>
LKML-Reference: <4A764FF3.50607@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The unit is KB, so sizeof(struct circular_queue) should be
divided by 1024.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Cc: davem@davemloft.net
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: a.p.zijlstra@chello.nl
LKML-Reference: <1249220616-7190-1-git-send-email-tom.leiming@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We still can apply DaveM's generation count optimization to
BFS, based on the following idea:
- before doing each BFS, increase the global generation id
by 1
- if one node in the graph has been visited, mark it as
visited by storing the current global generation id into
the node's dep_gen_id field
- so we can decide if one node has been visited already, by
comparing the node's dep_gen_id with the global generation id.
By applying DaveM's generation count optimization to current
implementation of BFS, we gain the following advantages:
- we save MAX_LOCKDEP_ENTRIES/8 bytes memory;
- we remove the bitmap_zero(bfs_accessed, MAX_LOCKDEP_ENTRIES);
in each BFS, which is very time-consuming since
MAX_LOCKDEP_ENTRIES may be very large.(16384UL)
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "David S. Miller" <davem@davemloft.net>
LKML-Reference: <1248274089-6358-1-git-send-email-tom.leiming@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
spin_lock_nest_lock() allows to take many instances of the same
class, this can easily lead to overflow of MAX_LOCK_DEPTH.
To avoid this overflow, we'll stop accounting instances but
start reference counting the class in the held_lock structure.
[ We could maintain a list of instances, if we'd move the hlock
stuff into __lock_acquired(), but that would require
significant modifications to the current code. ]
We restrict this mode to spin_lock_nest_lock() only, because it
degrades the lockdep quality due to lost of instance.
For lockstat this means we don't track lock statistics for any
but the first lock in the series.
Currently nesting is limited to 11 bits because that was the
spare space available in held_lock. This yields a 2048
instances maximium.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add a lockdep helper to validate that we indeed are the owner
of a lock.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fixes a few comments and whitespaces that annoyed me.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This build bug:
In file included from kernel/sched.c:1765:
kernel/sched_rt.c: In function ‘has_pushable_tasks’:
kernel/sched_rt.c:1069: error: ‘struct rt_rq’ has no member named ‘pushable_tasks’
kernel/sched_rt.c: In function ‘pick_next_task_rt’:
kernel/sched_rt.c:1084: error: ‘struct rq’ has no member named ‘post_schedule’
Triggers because both pushable_tasks and post_schedule are
SMP-only fields.
Move pushable_tasks() to the SMP section and #ifdef the post_schedule use.
Cc: Gregory Haskins <ghaskins@novell.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090729150422.17691.55590.stgit@dev.haskins.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix:
kernel/built-in.o: In function `lockdep_stats_show':
lockdep_proc.c:(.text+0x48202): undefined reference to `max_bfs_queue_depth'
As max_bfs_queue_depth is only available under
CONFIG_PROVE_LOCKING=y.
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Like sched_migrate_task(), set_cpus_allowed_ptr() should hold
onto the migration thread too.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A frequent mistake appears to be to call task_of() on a
scheduler entity that is not actually a task, which can result
in a wild pointer.
Add a check to catch these mistakes.
Suggested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reflect "active" cpus in the rq->rd->online field, instead of
the online_map.
The motivation is that things that use the root-domain code
(such as cpupri) only care about cpus classified as "active"
anyway. By synchronizing the root-domain state with the active
map, we allow several optimizations.
For instance, we can remove an extra cpumask_and from the
scheduler hotpath by utilizing rq->rd->online (since it is now
a cached version of cpu_active_map & rq->rd->span).
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Max Krasnyansky <maxk@qualcomm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090730145723.25226.24493.stgit@dev.haskins.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We currently have an explicit "needs_post" vtable method which
returns a stack variable for whether we should later run
post-schedule. This leads to an awkward exchange of the
variable as it bubbles back up out of the context switch. Peter
Zijlstra observed that this information could be stored in the
run-queue itself instead of handled on the stack.
Therefore, we revert to the method of having context_switch
return void, and update an internal rq->post_schedule variable
when we require further processing.
In addition, we fix a race condition where we try to access
current->sched_class without holding the rq->lock. This is
technically racy, as the sched-class could change out from
under us. Instead, we reference the per-rq post_schedule
variable with the runqueue unlocked, but with preemption
disabled to see if we need to reacquire the rq->lock.
Finally, we clean the code up slightly by removing the #ifdef
CONFIG_SMP conditionals from the schedule() call, and implement
some inline helper functions instead.
This patch passes checkpatch, and rt-migrate.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090729150422.17691.55590.stgit@dev.haskins.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We need to add the new prio to the cpupri accounting before
removing the old prio. This is because removing the old prio
first will open a race window where the cpu will be removed
from pri_active. In this case the cpu will not be visible for
RT push and pulls. This could cause a RT task to not migrate
appropriately, and create a very large latency.
This bug was found with the use of ftrace sched events and
trace_printk.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090729042526.438281019@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The current method for pushing RT tasks after scheduling only
happens after a context switch. But we found cases where a task
is set up on a run queue to be pushed but the push never
happens because the schedule chooses the same task.
This bug was found with the help of Gregory Haskins and the use
of ftrace (trace_printk). It tooks several days for both of us
analyzing the code and the trace output to find this.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090729042526.205923666@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When cgroup group scheduling is built in, skip some code paths
if we don't have any (but the root) cgroups configured.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit ec4e0e2fe0 ("fix
inconsistency when redistribute per-cpu tg->cfs_rq shares")
broke cgroup smp fairness.
In order to avoid starvation of newly placed tasks, we never
quite set the share of an empty cpu group-task to 0, but
instead we set it as if there's a single NICE-0 task present.
If however we actually set this in cfs_rq[cpu]->shares, that
means the total shares for that group will be slightly inflated
every time we balance, causing the observed unfairness.
Fix this by setting cfs_rq[cpu]->shares to 0 but actually
setting the effective weight of the related se to the inflated
number.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1248696557.6987.1615.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Background:
Several race conditions in the scheduler have cropped up
recently, which Steven and I have tracked down using ftrace.
The most recent one turns out to be a race in how the scheduler
determines a suitable migration target for RT tasks, introduced
recently with commit:
commit 68e74568fb
Date: Tue Nov 25 02:35:13 2008 +1030
sched: convert struct cpupri_vec cpumask_var_t.
The original design of cpupri allowed lockless readers to
quickly determine a best-estimate target. Races between the
pri_active bitmap and the vec->mask were handled in the
original code because we would detect and return "0" when this
occured. The design was predicated on the *effective*
atomicity (*) of caching the result of cpus_and() between the
cpus_allowed and the vec->mask.
Commit 68e74568 changed the behavior such that vec->mask is
accessed multiple times. This introduces a subtle race, the
result of which means we can have a result that returns "1",
but with an empty bitmap.
*) yes, we know cpus_and() is not a locked operator across the
entire composite array, but it is implicitly atomic on a
per-word basis which is all the design required to work.
Implementation:
Rather than forgoing the lockless design, or reverting to a
stack-based cpumask_t, we simply check for when the race has
been encountered and continue processing in the event that the
race is hit. This renders the removal race as if the priority
bit had been atomically cleared as well, and allows the
algorithm to execute correctly.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090730145728.25226.92769.stgit@dev.haskins.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The latencytop and sleep accounting code assumes that any
scheduler entity represents a task, this is not so.
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In order to be able to distinguish between no samples due to
inactivity and no samples due to task ended, Arjan asked for
PERF_EVENT_EXIT events. This is useful to the boot delay
instrumentation (bootchart) app.
This patch changes the PERF_EVENT_FORK to be emitted on every
clone, and adds PERF_EVENT_EXIT to be emitted on task exit,
after the task's counters have been closed.
This task tracing is controlled through: attr.comm || attr.mmap
and through the new attr.task field.
Suggested-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
[ cleaned up perf_counter.h a bit ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently the counter value returned by read() is the value of
the parent counter, to which child counters are only fed back
on child exit.
Thus read() can return rather erratic (and meaningless) numbers
depending on the state of the child processes.
Change this by always iterating the full child hierarchy on
read() and sum all counters.
Suggested-by: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When debugging a recent lockup bug i found various deficiencies
in how our current lockup detection helpers work:
- SysRq-L is not very efficient as it uses a workqueue, hence
it cannot punch through hard lockups and cannot see through
most soft lockups either.
- The SysRq-L code depends on the NMI watchdog - which is off
by default.
- We dont print backtraces from the RCU code's built-in
'RCU state machine is stuck' debug code. This debug
code tends to be one of the first (and only) mechanisms
that show that a lockup has occured.
This patch changes the code so taht we:
- Trigger the NMI backtrace code from SysRq-L instead of using
a workqueue (which cannot punch through hard lockups)
- Trigger print-all-CPU-backtraces from the RCU lockup detection
code
Also decouple the backtrace printing code from the NMI watchdog:
- Dont use variable size cpumasks (it might not be initialized
and they are a bit more fragile anyway)
- Trigger an NMI immediately via an IPI, instead of waiting
for the NMI tick to occur. This is a lot faster and can
produce more relevant backtraces. It will also work if the
NMI watchdog is disabled.
- Dont print the 'dazed and confused' message when we print
a backtrace from the NMI
- Do a show_regs() plus a dump_stack() to get maximum info
out of the dump. Worst-case we get two stacktraces - which
is not a big deal. Sometimes, if register content is
corrupted, the precise stack walker in show_regs() wont
give us a full backtrace - in this case dump_stack() will
do it.
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The previous commit ("do_sigaltstack: avoid copying 'stack_t' as a
structure to user space") fixed a real bug. This one just cleans up the
copy from user space to that gcc can generate better code for it (and so
that it looks the same as the later copy back to user space).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ulrich Drepper correctly points out that there is generally padding in
the structure on 64-bit hosts, and that copying the structure from
kernel to user space can leak information from the kernel stack in those
padding bytes.
Avoid the whole issue by just copying the three members one by one
instead, which also means that the function also can avoid the need for
a stack frame. This also happens to match how we copy the new structure
from user space, so it all even makes sense.
[ The obvious solution of adding a memset() generates horrid code, gcc
does really stupid things. ]
Reported-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use kernel_text_address() for checking probe address instead of
__kernel_text_address(), because __kernel_text_address() returns true
for init functions even after relaseing those functions.
That will hit a BUG() in text_poke().
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When profile= is used, a large buffer is allocated early at boot. This
can be larger than what the page allocator can provide so it prints a
warning. However, the caller is able to handle the situation so this
patch suppresses the warning.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit ec64f51545 ("cgroup: fix
frequent -EBUSY at rmdir"), cgroup's rmdir (especially against memcg)
doesn't return -EBUSY by temporary ref counts. That commit expects all
refs after pre_destroy() is temporary but...it wasn't. Then, rmdir can
wait permanently. This patch tries to fix that and change followings.
- set CGRP_WAIT_ON_RMDIR flag before pre_destroy().
- clear CGRP_WAIT_ON_RMDIR flag when the subsys finds racy case.
if there are sleeping ones, wakes them up.
- rmdir() sleeps only when CGRP_WAIT_ON_RMDIR flag is set.
Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: Paul Menage <menage@google.com>
Acked-by: Balbir Sigh <balbir@linux.vnet.ibm.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The bug was introduced by commit cc31edceee
("cgroups: convert tasks file to use a seq_file with shared pid array").
We cache a pid array for all threads that are opening the same "tasks"
file, but the pids in the array are always from the namespace of the
last process that opened the file, so all other threads will read pids
from that namespace instead of their own namespaces.
To fix it, we maintain a list of pid arrays, which is keyed by pid_ns.
The list will be of length 1 at most time.
Reported-by: Paul Menage <menage@google.com>
Idea-by: Paul Menage <menage@google.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Serge Hallyn <serue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Setting
"crashkernel=512M-2G:64M,2G-:128M"
does not work but it turns to work if it has a trailing-whitespace,
like
"crashkernel=512M-2G:64M,2G-:128M ".
It was because of a bug in the parser, running over the cmdline.
This patch adds a check of the termination.
Reported-by: Jin Dongming <jin.dongming@np.css.fujitsu.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Tested-by: Jin Dongming <jin.dongming@np.css.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a post-2.6.31 regression which was introduced by
2ff05b2b4e ("oom: move oom_adj value from
task_struct to mm_struct").
After moving the oom_adj value from the task struct to the mm_struct, the
oom_adj value was no longer properly inherited by child processes.
Copying over the oom_adj value at fork time fixes that bug.
[kosaki.motohiro@jp.fujitsu.com: test for current->mm before dereferencing it]
Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Paul Menage <manage@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
About a half events are missing when we splice_read
from trace_pipe. They are unexpectedly consumed because we ignore
the TRACE_TYPE_NO_CONSUME return value used by the function graph
tracer when it needs to consume the events by itself to walk on
the ring buffer.
The same problem appears with ftrace_dump()
Example of an output before this patch:
1) | ktime_get_real() {
1) 2.846 us | read_hpet();
1) 4.558 us | }
1) 6.195 us | }
After this patch:
0) | ktime_get_real() {
0) | getnstimeofday() {
0) 1.960 us | read_hpet();
0) 3.597 us | }
0) 5.196 us | }
The fix also applies on 2.6.30
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: stable@kernel.org
LKML-Reference: <4A6EEC52.90704@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
When print_graph_entry() computes a function call entry event, it needs
to also check the next entry to guess if it matches the return event of
the current function entry.
In order to look at this next event, it needs to consume the current
entry before going ahead in the ring buffer.
However, if the current event that gets consumed is the last one in the
ring buffer head page, the ring_buffer may reuse the page for writers.
The consumed entry will then become invalid because of possible
racy overwriting.
Me must then handle this entry by making a copy of it.
The fix also applies on 2.6.30
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: stable@kernel.org
LKML-Reference: <4A6EEAEC.3050508@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Commit 63706172f3 ("kthreads: rework
kthread_stop()") removed the limitation that the thread function mysr
not call do_exit() itself, but forgot to update the comment.
Since that commit it is OK to use kthread_stop() even if kthread can
exit itself.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The check_modstruct_version() needs to look up the symbol "module_layout"
in the kernel, but it does so literally and not by a C identifier. The
trouble is that it does not include a symbol prefix for those ports that
need it (like the Blackfin and H8300 port). So make sure we tack on the
MODULE_SYMBOL_PREFIX define to the front of it.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If trace_printk_on_oops is set we lose interesting trace information
when the tracer is enabled across oops handling and printing. We want
the trace which might give us information _WHY_ we oopsed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Some cleanups of the lockdep code after the BFS series:
- Remove the last traces of the generation id
- Fixup comment style
- Move the bfs routines into lockdep.c
- Cleanup the bfs routines
[ tom.leiming@gmail.com: Fix crash ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1246201486-7308-11-git-send-email-tom.leiming@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add BFS statistics to the existing lockdep stats.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1246201486-7308-10-git-send-email-tom.leiming@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Also account the BFS memory usage.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
[ fix build for !PROVE_LOCKING ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1246201486-7308-9-git-send-email-tom.leiming@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Implement lockdep_count_{for,back}ward using BFS.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1246201486-7308-8-git-send-email-tom.leiming@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since the shortest lock dependencies' path may be obtained by BFS,
we print the shortest one by print_shortest_lock_dependencies().
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1246201486-7308-7-git-send-email-tom.leiming@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch uses BFS to implement find_usage_*wards(),which
was originally writen by DFS.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1246201486-7308-6-git-send-email-tom.leiming@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch uses BFS to implement check_noncircular() and
prints the generated shortest circle if exists.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1246201486-7308-5-git-send-email-tom.leiming@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
1,introduce match() to BFS in order to make it usable to
match different pattern;
2,also rename some functions to make them more suitable.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1246201486-7308-4-git-send-email-tom.leiming@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
1,replace %MAX_CIRCULAR_QUE_SIZE with &(MAX_CIRCULAR_QUE_SIZE-1)
since we define MAX_CIRCULAR_QUE_SIZE as power of 2;
2,use bitmap to mark if a lock is accessed in BFS in order to
clear it quickly, because we may search a graph many times.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1246201486-7308-3-git-send-email-tom.leiming@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently lockdep will print the 1st circle detected if it
exists when acquiring a new (next) lock.
This patch prints the shortest path from the next lock to be
acquired to the previous held lock if a circle is found.
The patch still uses the current method to check circle, and
once the circle is found, breadth-first search algorithem is
used to compute the shortest path from the next lock to the
previous lock in the forward lock dependency graph.
Printing the shortest path will shorten the dependency chain,
and make troubleshooting for possible circular locking easier.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1246201486-7308-2-git-send-email-tom.leiming@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
migration_init() returns the return value of the hotplug notifier. In
the success case this is NOTIFY_OK which is 1. initcall_debug
evaluates that as an error code because init calls are expected to
return 0 on success.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The current code will truncate the ftrace files contents if O_APPEND
is not set and the file is opened in write mode. This is incorrect.
It should only truncate the file if O_TRUNC is set. Otherwise
if one of these files is opened by a C program with fopen "r+",
it will incorrectly truncate the file.
Reported-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since the trace_printk may use pointers to the format fields
in the buffer, they are exported via debugfs/tracing/printk_formats.
This is used by utilities that read the ring buffer in binary format.
It helps the utilities map the address of the format in the binary
buffer to what the printf format looks like.
Unfortunately, the way the output code works, it exports the address
of the pointer to the format address, and not the format address
itself. This makes the file totally useless in trying to figure
out what format string a binary address belongs to.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Every time we cat a trace_stat file, we leak memory allocated by
seq_open().
Also fix memory leak in a failure path in tracing_stat_open().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A67D92B.4060704@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Every time we cat set_graph_function, we leak memory allocated
by seq_open().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A67D907.2010500@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Every time we cat stack_trace, we leak memory allocated by seq_open().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A67D8E8.3020500@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since genirq: Delegate irq affinity setting to the irq thread
(591d2fb02e) compilation with
CONFIG_SMP=n fails with following error:
/usr/src/linux-2.6/kernel/irq/manage.c:
In function 'irq_thread_check_affinity':
/usr/src/linux-2.6/kernel/irq/manage.c:475:
error: 'struct irq_desc' has no member named 'affinity'
make[4]: *** [kernel/irq/manage.o] Error 1
That commit adds a new function irq_thread_check_affinity() which
uses struct irq_desc.affinity which is only available for CONFIG_SMP=y.
Move that function under #ifdef CONFIG_SMP.
[ tglx@brownpaperbag: compile and boot tested on UP and SMP ]
Signed-off-by: Bruno Premont <bonbons@linux-vserver.org>
LKML-Reference: <20090722222232.2eb3e1c4@neptune.home>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* 'perf-counters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-perf: (31 commits)
perf_counter tools: Give perf top inherit option
perf_counter tools: Fix vmlinux symbol generation breakage
perf_counter: Detect debugfs location
perf_counter: Add tracepoint support to perf list, perf stat
perf symbol: C++ demangling
perf: avoid structure size confusion by using a fixed size
perf_counter: Fix throttle/unthrottle event logging
perf_counter: Improve perf stat and perf record option parsing
perf_counter: PERF_SAMPLE_ID and inherited counters
perf_counter: Plug more stack leaks
perf: Fix stack data leak
perf_counter: Remove unused variables
perf_counter: Make call graph option consistent
perf_counter: Add perf record option to log addresses
perf_counter: Log vfork as a fork event
perf_counter: Synthesize VDSO mmap event
perf_counter: Make sure we dont leak kernel memory to userspace
perf_counter tools: Fix index boundary check
perf_counter: Fix the tracepoint channel to perfcounters
perf_counter, x86: Extend perf_counter Pentium M support
...
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: fix nr_uninterruptible accounting of frozen tasks really
sched: fix load average accounting vs. cpu hotplug
sched: Account for vruntime wrapping
the "reserved" field was not initialized to zero, resulting in 4 bytes
of stack data leaking to userspace....
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Right now we only print PERF_EVENT_THROTTLE + 1 (ie PERF_EVENT_UNTHROTTLE).
Fix this to print both a throttle and unthrottle event.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090722130546.GE9029@kryten>
Anton noted that for inherited counters the counter-id as provided by
PERF_SAMPLE_ID isn't mappable to the id found through PERF_RECORD_ID
because each inherited counter gets its own id.
His suggestion was to always return the parent counter id, since that
is the primary counter id as exposed. However, these inherited
counters have a unique identifier so that events like
PERF_EVENT_PERIOD and PERF_EVENT_THROTTLE can be specific about which
counter gets modified, which is important when trying to normalize the
sample streams.
This patch removes PERF_EVENT_PERIOD in favour of PERF_SAMPLE_PERIOD,
which is more useful anyway, since changing periods became a lot more
common than initially thought -- rendering PERF_EVENT_PERIOD the less
useful solution (also, PERF_SAMPLE_PERIOD reports the more accurate
value, since it reports the value used to trigger the overflow,
whereas PERF_EVENT_PERIOD simply reports the requested period changed,
which might only take effect on the next cycle).
This still leaves us PERF_EVENT_THROTTLE to consider, but since that
_should_ be a rare occurrence, and linking it to a primary id is the
most useful bit to diagnose the problem, we introduce a
PERF_SAMPLE_STREAM_ID, for those few cases where the full
reconstruction is important.
[Does change the ABI a little, but I see no other way out]
Suggested-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1248095846.15751.8781.camel@twins>
the "reserved" field was not initialized to zero, resulting in 4 bytes
of stack data leaking to userspace....
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
commit ca109491f (hrtimer: removing all ur callback modes) moved all
hrtimer callbacks into hard interrupt context when high resolution
timers are active. That breaks code which relied on the assumption
that the callback happens in softirq context.
Provide a generic infrastructure which combines tasklets and hrtimers
together to provide an in-softirq hrtimer experience.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: torvalds@linux-foundation.org
Cc: kaber@trash.net
Cc: David Miller <davem@davemloft.net>
LKML-Reference: <1248265724.27058.1366.camel@twins>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Support for graceful handling of sleep states (S3/S4/S5) after an Intel(R) TXT launch.
Without this patch, attempting to place the system in one of the ACPI sleep
states (S3/S4/S5) will cause the TXT hardware to treat this as an attack and
will cause a system reset, with memory locked. Not only may the subsequent
memory scrub take some time, but the platform will be unable to enter the
requested power state.
This patch calls back into the tboot so that it may properly and securely clean
up system state and clear the secrets-in-memory flag, after which it will place
the system into the requested sleep state using ACPI information passed by the kernel.
arch/x86/kernel/smpboot.c | 2 ++
drivers/acpi/acpica/hwsleep.c | 3 +++
kernel/cpu.c | 7 ++++++-
3 files changed, 11 insertions(+), 1 deletion(-)
Signed-off-by: Joseph Cihula <joseph.cihula@intel.com>
Signed-off-by: Shane Wang <shane.wang@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
irq_set_thread_affinity() calls set_cpus_allowed_ptr() which might
sleep, but irq_set_thread_affinity() is called with desc->lock held
and can be called from hard interrupt context as well. The code has
another bug as it does not hold a ref on the task struct as required
by set_cpus_allowed_ptr().
Just set the IRQTF_AFFINITY bit in action->thread_flags. The next time
the thread runs it migrates itself. Solves all of the above problems
nicely.
Add kerneldoc to irq_set_thread_affinity() while at it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <new-submission>
Currently a subsystem filter should be applicable to all events
under the subsystem, and if it failed, all the event filters
will be cleared. Those behaviors make subsys filter much less
useful:
# echo 'vec == 1' > irq/softirq_entry/filter
# echo 'irq == 5' > irq/filter
bash: echo: write error: Invalid argument
# cat irq/softirq_entry/filter
none
I'd expect it set the filter for irq_handler_entry/exit, and
not touch softirq_entry/exit.
The basic idea is, try to see if the filter can be applied
to which events, and then just apply to the those events:
# echo 'vec == 1' > softirq_entry/filter
# echo 'irq == 5' > filter
# cat irq_handler_entry/filter
irq == 5
# cat softirq_entry/filter
vec == 1
Changelog for v2:
- do some cleanups to address Frederic's comments.
Inspired-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A63D485.7030703@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
'\n' is already appended, and what we need is just an extra
space for the '\0'.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
LKML-Reference: <4A3EED63.3090908@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When a dynamic array is defined, we add __data_loc_foo in
trace_entry to record the offset of the array, but the
size of the array is not recorded, which causes 2 problems:
- the event filter just compares the first 2 chars of the strings.
- parsers can't parse dynamic arrays.
So we encode the size of each dynamic array in the higher 16 bits
of __data_loc_foo, while the offset is in lower 16 bits.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A5E964A.9000403@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Writing a zero length string to sys/.../current_clocksource will cause
a NULL pointer dereference if the clock events system is in one shot
(highres or nohz) mode.
Pointed-out-by: Dan Carpenter <error27@gmail.com>
LKML-Reference: <alpine.DEB.2.00.0907191545580.12306@bicker>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
timer->expires may be uninitialized, so check timer_pending() before
touching timer->expires to pacify kmemcheck.
Signed-off-by: Pavel Roskin <proski@gnu.org>
LKML-Reference: <20090718204602.5191.360.stgit@mj.roinet.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
might_sleep() is called late-ish in cond_resched(), after the
need_resched()/preempt enabled/system running tests are
checked.
It's better to check the sleeps while atomic earlier and not
depend on some environment datas that reduce the chances to
detect a problem.
Also define cond_resched_*() helpers as macros, so that the
FILE/LINE reported in the sleeping while atomic warning
displays the real origin and not sched.h
Changes in v2:
- Call __might_sleep() directly instead of might_sleep() which
may call cond_resched()
- Turn cond_resched() into a macro so that the file:line
couple reported refers to the caller of cond_resched() and
not __cond_resched() itself.
Changes in v3:
- Also propagate this __might_sleep() pull up to
cond_resched_lock() and cond_resched_softirq()
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1247725694-6082-6-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add a preempt count base offset to compare against the current
preempt level count. It prepares to pull up the might_sleep
check from cond_resched() to cond_resched_lock() and
cond_resched_bh().
For these two helpers, we need to respectively ensure that once
we'll unlock the given spinlock / reenable local softirqs, we
will reach a sleepable state.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
[ Move and rename preempt_count_equals() ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1247725694-6082-4-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cover the off case for __might_sleep(), so that we avoid
#ifdefs in files that make use of it. Especially, this prepares
for the __might_sleep() pull up on cond_resched().
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1247725694-6082-3-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove the outdated comment from __cond_resched() related to
the now removed Big Kernel Semaphore.
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1247725694-6082-2-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The schedule() function is a loop that reschedules the current
task while the TIF_NEED_RESCHED flag is set:
void schedule(void)
{
need_resched:
/* schedule code */
if (need_resched())
goto need_resched;
}
And cond_resched() repeat this loop:
do {
add_preempt_count(PREEMPT_ACTIVE);
schedule();
sub_preempt_count(PREEMPT_ACTIVE);
} while(need_resched());
This loop is needless because schedule() already did the check
and nothing can set TIF_NEED_RESCHED between schedule() exit
and the loop check in need_resched().
Then remove this needless loop.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1247725694-6082-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
commit e3c8ca8336 (sched: do not count frozen tasks toward load) broke
the nr_uninterruptible accounting on freeze/thaw. On freeze the task
is excluded from accounting with a check for (task->flags &
PF_FROZEN), but that flag is cleared before the task is thawed. So
while we prevent that the task with state TASK_UNINTERRUPTIBLE
is accounted to nr_uninterruptible on freeze we decrement
nr_uninterruptible on thaw.
Use a separate flag which is handled by the freezing task itself. Set
it before calling the scheduler with TASK_UNINTERRUPTIBLE state and
clear it after we return from frozen state.
Cc: <stable@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The new load average code clears rq->calc_load_active on
CPU_ONLINE. That's wrong as the new onlined CPU might have got a
scheduler tick already and accounted the delta to the stale value of
the time we offlined the CPU.
Clear the value when we cleanup the dead CPU instead.
Also move the update of the calc_load_update time for the newly online
CPU to CPU_UP_PREPARE to avoid that the CPU plays catch up with the
stale update time value.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When profile= is used, a large buffer is allocated early at
boot. This can be larger than what the page allocator can
provide so it prints a warning. However, the caller is able to
handle the situation so this patch suppresses the warning.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Linux Memory Management List <linux-mm@kvack.org>
Cc: Heinz Diehl <htd@fancy-poultry.org>
Cc: David Miller <davem@davemloft.net>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <1247656992-19846-3-git-send-email-mel@csn.ul.ie>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Right now we don't output vfork events. Even though we should
always see an exec after a vfork, we may get perfcounter
samples between the vfork and exec. These samples can lead to
some confusion when parsing perfcounter data.
To keep things consistent we should always log a fork event. It
will result in a little more log data, but is less confusing to
trace parsing tools.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090716104817.589309391@samba.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There are a few places we are leaking tiny amounts of kernel
memory to userspace. This happens when writing out strings
because we always align the end to 64 bits.
To avoid this we should always use an appropriately sized
temporary buffer and ensure it is zeroed.
Since d_path assembles the string from the end of the buffer
backwards, we need to add 64 bits after the buffer to allow for
alignment.
We also need to copy arch_vma_name to the temporary buffer,
because if we use it directly we may end up copying to
userspace a number of bytes after the end of the string
constant.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090716104817.273972048@samba.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I spotted two sites that didn't take vruntime wrap-around into
account. Fix these by creating a comparison helper that does do
so.
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We can directly use %pf input format instead of kallsyms_lookup()
and %s input format
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
We can directly use %pF input format instead of sprint_symbol()
and %s input format.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Rewrite the __ftrace_replace_code() function, simplify it, but don't
change the code's logic.
First, we get the state we want to set, if the record has the same
state, then do nothing, otherwise enable/disable it.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
ftrace_trace_onoff_callback() will return an error even if we do the
right operation, for example:
# echo _spin_*:traceon:10 > set_ftrace_filter
-bash: echo: write error: Invalid argument
# cat set_ftrace_filter
#### all functions enabled ####
_spin_trylock_bh:traceon:count=10
_spin_unlock_irq:traceon:count=10
_spin_unlock_bh:traceon:count=10
_spin_lock_irq:traceon:count=10
_spin_unlock:traceon:count=10
_spin_trylock:traceon:count=10
_spin_unlock_irqrestore:traceon:count=10
_spin_lock_irqsave:traceon:count=10
_spin_lock_bh:traceon:count=10
_spin_lock:traceon:count=10
We want to set _spin_*:traceon:10 to set_ftrace_filter, it complains
with "Invalid argument", but the operation is successful.
This is because ftrace_process_regex() returns the number of functions that
matched the pattern. If the number is not 0, this value is returned
by ftrace_regex_write() whereas we want to return the number of bytes
virtually written.
Also the file offset pointer is not updated in this case.
If the number of matched functions is lower than the number of bytes written
by the user, this results to a reprocessing of the string given by the user with
a lower size, leading to a malformed ftrace regex and then a -EINVAL returned.
So, this patch fixes it by returning 0 if no error occured.
The fix also applies on 2.6.30
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
kernel/trace/ring_buffer.c: In function 'rb_tail_page_update':
kernel/trace/ring_buffer.c:849: warning: value computed is not used
kernel/trace/ring_buffer.c:850: warning: value computed is not used
Add "(void)"s to fix this warning, because we don't need here to handle
the fail case of cmpxchg, it's fine if an interrupt already did the
job.
Changed from V1:
Add a comment(which is written by Steven) for it.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-sched:
sched: Fix bug in SCHED_IDLE interaction with group scheduling
sched: Fix rt_rq->pushable_tasks initialization in init_rt_rq()
sched: Reset sched stats on fork()
sched_rt: Fix overload bug on rt group scheduling
sched: Documentation/sched-rt-group: Fix style issues & bump version
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
hrtimer: Fix migration expiry check
hrtimer: migration: do not check expiry time on current CPU
* 'core-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
futexes: Fix infinite loop in get_futex_key() on huge page
The per cpu variable stat is freeded if we fail to allocate a name
on start up. This was due to stat at first being allocated in the
initial design. But since then, it has become a static per cpu variable
but the free on error was not removed.
Also added __init annotation to the function that this is in.
[ Impact: prevent possible memory corruption on low mem at boot up ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix a missed rename in EVENT_PROFILE support so that it gets
built and allows tracepoint tracing from the 'perf' tool.
Fix a typo in the (never before built & enabled) portion in
perf_counter.c as well, and update that code to the
attr.config changes as well.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ben Gamari <bgamari.foss@gmail.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1246869094-21237-1-git-send-email-chris@chris-wilson.co.uk>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This makes generic netlink network namespace aware. No
generic netlink families except for the controller family
are made namespace aware, they need to be checked one by
one and then set the family->netnsok member to true.
A new function genlmsg_multicast_netns() is introduced to
allow sending a multicast message in a given namespace,
for example when it applies to an object that lives in
that namespace, a new function genlmsg_multicast_allns()
to send a message to all network namespaces (for objects
that do not have an associated netns).
The function genlmsg_multicast() is changed to multicast
the message in just init_net, which is currently correct
for all generic netlink families since they only work in
init_net right now. Some will later want to work in all
net namespaces because they do not care about the netns
at all -- those will have to be converted to use one of
the new functions genlmsg_multicast_allns() or
genlmsg_multicast_netns() whenever they are made netns
aware in some way.
After this patch families can easily decide whether or
not they should be available in all net namespaces. Many
genl families us it for objects not related to networking
and should therefore be available in all namespaces, but
that will have to be done on a per family basis.
Note that this doesn't touch on the checkpoint/restart
problem where network namespaces could be used, genl
families and multicast groups are numbered globally and
I see no easy way of changing that, especially since it
must be possible to multicast to all network namespaces
for those families that do not care about netns.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'kmemleak' of git://linux-arm.org/linux-2.6:
kmemleak: Remove alloc_bootmem annotations introduced in the past
kmemleak: Add callbacks to the bootmem allocator
kmemleak: Allow partial freeing of memory blocks
kmemleak: Trace the kmalloc_large* functions in slub
kmemleak: Scan objects allocated during a scanning episode
kmemleak: Do not acquire scan_mutex in kmemleak_open()
kmemleak: Remove the reported leaks number limitation
kmemleak: Add more cond_resched() calls in the scanning thread
kmemleak: Renice the scanning thread to +10
* Remove smp_lock.h from files which don't need it (including some headers!)
* Add smp_lock.h to files which do need it
* Make smp_lock.h include conditional in hardirq.h
It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT
This will make hardirq.h inclusion cheaper for every PREEMPT=n config
(which includes allmodconfig/allyesconfig, BTW)
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_futex_key() can infinitely loop if it is called on a
virtual address that is within a huge page but not aligned to
the beginning of that page. The call to get_user_pages_fast
will return the struct page for a sub-page within the huge page
and the check for page->mapping will always fail.
The fix is to call compound_head on the page before checking
that it's mapped.
Signed-off-by: Sonny Rao <sonnyrao@us.ibm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
Cc: anton@samba.org
Cc: rajamony@us.ibm.com
Cc: speight@us.ibm.com
Cc: mstephen@us.ibm.com
Cc: grimm@us.ibm.com
Cc: mikey@ozlabs.au.ibm.com
LKML-Reference: <20090710231313.GA23572@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
One of the isolation modifications for SCHED_IDLE is the
unitization of sleeper credit. However the check for this
assumes that the sched_entity we're placing always belongs to a
task.
This is potentially not true with group scheduling and leaves
us rummaging randomly when we try to pull the policy.
Signed-off-by: Paul Turner <pjt@google.com>
Cc: peterz@infradead.org
LKML-Reference: <alpine.DEB.1.00.0907101649570.29914@kitami.corp.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'core-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
dma-debug: Fix the overlap() function to be correct and readable
oprofile: reset bt_lost_no_mapping with other stats
x86/oprofile: rename kernel parameter for architectural perfmon to arch_perfmon
signals: declare sys_rt_tgsigqueueinfo in syscalls.h
rcu: Mark Hierarchical RCU no longer experimental
dma-debug: Put all hash-chain locks into the same lock class
dma-debug: fix off-by-one error in overlap function
Optimize cond_resched() by removing one conditional.
Currently cond_resched() checks system_state ==
SYSTEM_RUNNING in order to avoid scheduling before the
scheduler is running.
We can however, as per suggestion of Matt, use
PREEMPT_ACTIVE to accomplish that very same.
Suggested-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: Fix trace_print_seq()
kprobes: No need to unlock kprobe_insn_mutex
tracing/fastboot: Document the need of initcall_debug
trace_export: Repair missed fields
tracing: Fix stack tracer sysctl handling
The timer migration expiry check should prevent the migration of a
timer to another CPU when the timer expires before the next event is
scheduled on the other CPU. Migrating the timer might delay it because
we can not reprogram the clock event device on the other CPU. But the
code implementing that check has two flaws:
- for !HIGHRES the check compares the expiry value with the clock
events device expiry value which is wrong for CLOCK_REALTIME based
timers.
- the check is racy. It holds the hrtimer base lock of the target CPU,
but the clock event device expiry value can be modified
nevertheless, e.g. by an timer interrupt firing.
The !HIGHRES case is easy to fix as we can enqueue the timer on the
cpu which was selected by the load balancer. It runs the idle
balancing code once per jiffy anyway. So the maximum delay for the
timer is the same as when we keep the tick on the current cpu going.
In the HIGHRES case we can get the next expiry value from the hrtimer
cpu_base of the target CPU and serialize the update with the cpu_base
lock. This moves the lock section in hrtimer_interrupt() so we can set
next_event to KTIME_MAX while we are handling the expired timers and
set it to the next expiry value after we handled the timers under the
base lock. While the expired timers are processed timer migration is
blocked because the expiry time of the timer is always <= KTIME_MAX.
Also remove the now useless clockevents_get_next_event() function.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The timer migration code needs to check whether the expiry time of the
timer is before the programmed clock event expiry time when the timer
is enqueued on another CPU because we can not reprogram the timer
device on the other CPU. The current logic checks the expiry time even
if we enqueue on the current CPU when nohz_get_load_balancer() returns
current CPU. This might lead to an endless loop in the expiry check
code when the expiry time of the timer is before the current
programmed next event.
Check whether nohz_get_load_balancer() returns current CPU and skip
the expiry check if this is the case.
The bug was triggered from the networking code. The patch fixes the
regression http://bugzilla.kernel.org/show_bug.cgi?id=13738
(Soft-Lockup/Race in networking in 2.6.31-rc1+195)
Cc: Arun Bharadwaj <arun@linux.vnet.ibm.com
Tested-by: Joao Correia <joaomiguelcorreia@gmail.com>
Tested-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When logging to console is disabled from userspace using klogctl()
and later re-enabled, console_loglevel gets set to the default
log level instead to the previous value.
This means that if the kernel was booted with 'quiet', the boot is
suddenly no longer quiet after logging to console gets re-enabled.
Save the current console_loglevel when logging is disabled and
restore to that value. If the log level is set to a specific value
while disabled, this is interpreted as an implicit re-enabling of
the logging.
The problem that prompted this patch is described in:
http://lkml.org/lkml/2009/6/28/234
There are two variations possible on the patch below:
1) If klogctl(7) is called while logging is not disabled, then set level
to default (partially preserving current functionality):
case 7: /* Enable logging to console */
- console_loglevel = default_console_loglevel;
+ if (saved_console_loglevel == -1)
+ console_loglevel = default_console_loglevel;
+ else {
+ console_loglevel = saved_console_loglevel;
+ saved_console_loglevel = -1;
+ }
2) If klogctl(8) is called while logging is disabled, then don't enable
logging, but remember the requested value for when logging does get
enabled again:
case 8: /* Set level of messages printed to console */
[...]
- console_loglevel = len;
+ if (saved_console_loglevel == -1)
+ console_loglevel = len;
+ else
+ saved_console_loglevel = len;
Yet another option would be to ignore the request.
Signed-off-by: Frans Pop <elendil@planet.nl>
Cc: cryptsetup@packages.debian.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <200907061331.49930.elendil@planet.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Today, when a console is registered without CON_PRINTBUFFER,
end users never see the announcement of it being added, and
never know if they missed something, if the console is really
at the start or not, and just leads to general confusion.
This re-orders existing code, to make sure the console is
added, before the "console [%s%d] enabled" is printed out -
ensuring that this message is _always_ seen.
This has the desired/intended side effect of making sure that
"console enabled:" messages are printed on the bootconsole, and
the real console. This does cause the same line is printed
twice if the bootconsole and real console are the same device,
but if they are on different devices, the message is printed to
both consoles.
Signed-off-by : Robin Getz <rgetz@blackfin.uclinux.org>
Cc: "Andrew Morton" <akpm@linux-foundation.org>
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>
LKML-Reference: <200907091308.37370.rgetz@blackfin.uclinux.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The stat entries can be freed when the stat file is being read.
The worse is, the ptr can be freed immediately after it's returned
from workqueue_stat_start/next().
Add a refcnt to struct cpu_workqueue_stats to avoid use-after-free.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4A51B16F.6010608@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add stat_release() callback to struct tracer_stat, so a stat tracer
can release it's entries after the stat file has been read out.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A51B16A.6020708@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
So we have:
- kmemtrace_print_alloc/free() for kmemtrace default output
- kmemtrace_print_alloc/free_user() for binary output used
by kmemtrace-user.
Suggested-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A51B288.70505@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove the obsolete seq_print_ip_sym() usage and replace it
by the %pf format in order to print function symbols.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
LKML-Reference: <1247107590-6428-3-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove the obsolete seq_print_ip_sym() usage and replace it
by the %pf format in order to print function symbols.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <1247107590-6428-2-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove empty subsystem and its directory when module unload.
Before patch:
# rmmod trace-events-sample.ko
# ls sample
enable filter
After patch:
# rmmod trace-events-sample.ko
# ls sample
ls: cannot access sample: No such file or directory
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Acked-by: Tom Zanussi <tzanussi@gmail.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4A55A8BE.9010707@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
No need to save preds to event_subsystem, because it's not used.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Acked-by: Tom Zanussi <tzanussi@gmail.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4A55A83C.1030005@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
init_rt_rq() initializes only rq->rt.pushable_tasks, and not the
pushable_tasks field of the passed rt_rq. The plist is not used
uninitialized since the only pushable_tasks plists used are the
ones of root rt_rqs; anyway reinitializing the list on every group
creation corrupts the root plist, losing its previous contents.
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090615185638.GK21741@gandalf.sssup.it>
CC: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The sched_stat fields are currently not reset upon fork.
Ingo's recent commit 6c594c21fc
did reset nr_migrations, but it didn't reset any of the
others.
This patch resets all sched_stat fields on fork.
Signed-off-by: Lucas De Marchi <lucas.de.marchi@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <193b0f820907090457s7a3662f4gcdecdc22fcae857b@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fixes an easily triggerable BUG() when setting process affinities.
Make sure to count the number of migratable tasks in the same place:
the root rt_rq. Otherwise the number doesn't make sense and we'll hit
the BUG in set_cpus_allowed_rt().
Also, make sure we only count tasks, not groups (this is probably
already taken care of by the fact that rt_se->nr_cpus_allowed will be 0
for groups, but be more explicit)
Tested-by: Thomas Gleixner <tglx@linutronix.de>
CC: stable@kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Gregory Haskins <ghaskins@novell.com>
LKML-Reference: <1247067476.9777.57.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Instead of open coding the unclone context thingy, put it in
a common function.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kmemleak_alloc() calls were added in some places where alloc_bootmem was
called. Since now kmemleak tracks bootmem allocations, these explicit
calls should be run.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Commit 5fd29d6ccb ("printk: clean up
handling of log-levels and newlines") changed printk semantics. printk
lines with multiple KERN_<level> prefixes are no longer emitted as
before the patch.
<level> is now included in the output on each additional use.
Remove all uses of multiple KERN_<level>s in formats.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix various silly problems wrt mnt_namespace.h:
- exit_mnt_ns() isn't used, remove it
- done that, sched.h and nsproxy.h inclusions aren't needed
- mount.h inclusion was need for vfsmount_lock, but no longer
- remove mnt_namespace.h inclusion from files which don't use anything
from mnt_namespace.h
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch converts the ring buffers into a completely lockless
buffer recording system. The read side still takes locks since
we still serialize readers. But the writers are the ones that
must be lockless (those can happen in NMIs).
The main change is to the "head_page" pointer. We write to the
tail, and read from the head. The "head_page" pointer in the cpu
buffer is now just a reference to where to look. The real head
page is now kept in the head_page->list->prev->next pointer.
That is, in the list head of the previous page we set flags.
The list pages are allocated to be aligned such that the lowest
significant bits are always zero pointing to the list. This gives
us play to put in flags to their pointers.
bit 0: set when the page is a head page
bit 1: set when the writer is moving the page (for overwrite mode)
cmpxchg is used to update the pointer.
When the writer wraps the buffer and the tail meets the head,
in overwrite mode, the writer must move the head page forward.
It first uses cmpxchg to change the pointer flag from 1 to 2.
Once this is done, the reader on another CPU will not take the
page from the buffer.
The writers need to protect against interrupts (we don't bother with
disabling interrupts because NMIs are allowed to write too).
After the writer sets the pointer flag to 2, it takes care to
manage interrupts coming in. This is discribed in detail within the
comments of the code.
Changes in version 2:
- Let reader reset entries value of header page.
- Fix tail page passing commit page on reader page test.
- Always increment entries and write counter in rb_tail_page_update
- Add safety check in rb_set_commit_to_write to break out of infinite loop
- add mask in rb_is_reader_page
[ Impact: lock free writing to the ring buffer ]
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
This patch changes the ring buffer data pages from using a link list
head pointer, to making each buffer page point to another buffer page
and never back to a "head".
This makes the handling of the ring buffer less complex, since the
traversing of the ring buffer pages no longer needs to account for the
head pointer.
This change also is needed to make the ring buffer lockless.
[
Changes in version 2:
- Added change that Lai Jiangshan mentioned.
From: Lai Jiangshan <laijs@cn.fujitsu.com>
Date: Thu, 11 Jun 2009 11:25:48 +0800
LKML-Reference: <4A30793C.6090208@cn.fujitsu.com>
I'm not sure whether these 4 lines:
bpage = list_entry(pages.next, struct buffer_page, list);
list_del_init(&bpage->list);
cpu_buffer->pages = &bpage->list;
list_splice(&pages, cpu_buffer->pages);
equal to these 2 lines:
cpu_buffer->pages = pages.next;
list_del(&pages);
If there are equivalent, I think the second one
are simpler. It may be not a really necessarily cleanup.
What I asked is: if there are equivalent, could you use these two line:
cpu_buffer->pages = pages.next;
list_del(&pages);
]
[ Impact: simplify the ring buffer to help make it lockless ]
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
do_execve() and ptrace_attach() return -EINTR if
mutex_lock_interruptible(->cred_guard_mutex) fails.
This is not right, change the code to return ERESTARTNOINTR.
Perhaps we should also change proc_pid_attr_write().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These warnings were observed on MIPS32 using 2.6.31-rc1 and gcc-4.2.0:
mm/page_alloc.c: In function 'alloc_pages_exact':
mm/page_alloc.c:1986: warning: passing argument 1 of 'virt_to_phys' makes pointer from integer without a cast
drivers/usb/mon/mon_bin.c: In function 'mon_alloc_buff':
drivers/usb/mon/mon_bin.c:1264: warning: passing argument 1 of 'virt_to_phys' makes pointer from integer without a cast
[akpm@linux-foundation.org: fix kernel/perf_counter.c too]
Signed-off-by: Kevin Cernekee <cernekee@gmail.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kerneldoc comment describing suspend_device_irqs() is currently
misleading, because generally the function doesn't really disable
interrupt lines at the chip level. Replace it with a more accurate
one.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
LKML-Reference: <200907050022.35117.rjw@sisk.pl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull linus#master to merge PER_CPU_DEF_ATTRIBUTES and alpha build fix
changes. As alpha in percpu tree uses 'weak' attribute instead of
inline assembly, there's no need for __used attribute.
Conflicts:
arch/alpha/include/asm/percpu.h
arch/mn10300/kernel/vmlinux.lds.S
include/linux/percpu-defs.h
Currently by default the output of kmemtrace is binary format instead
of human-readable output.
This patch makes the following changes:
- We'll see human-readable output by default
- We'll see binary output if 'bin' option is set
Note: you may probably need to explicitly disable context-info binary
output:
# echo 0 > options/context-info
# echo 1 > options/bin
# cat trace_pipe
v2:
- use %pF to print call_site
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A4DD0A0.5060500@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Today, register_console() assumes the following usage:
- The first console to register with a flag set to CON_BOOT
is the one and only bootconsole.
- If another register_console() is called with an additional
CON_BOOT, it is silently rejected.
- As soon as a console without the CON_BOOT set calls
registers the bootconsole is automatically unregistered.
- Once there is a "real" console - register_console() will
silently reject any consoles with it's CON_BOOT flag set.
In many systems (alpha, blackfin, microblaze, mips, powerpc,
sh, & x86), there are early_printk implementations, which use
the CON_BOOT which come out serial ports, vga, usb, & memory
buffers.
In many embedded systems, it would be nice to have two
bootconsoles - in case the primary fails, you always have
access to a backup memory buffer - but this requires at least
two CON_BOOT consoles...
This patch enables that functionality.
With the change applied, on boot you get (if you try to
re-enable a boot console after the "real" console has been
registered):
root:/> dmesg | grep console
bootconsole [early_shadow0] enabled
bootconsole [early_BFuart0] enabled
Kernel command line: root=/dev/mtdblock0 rw earlyprintk=serial,uart0,57600 console=ttyBF0,57600 nmi_debug=regs
console handover:boot [early_BFuart0] boot [early_shadow0] -> real [ttyBF0]
Too late to register bootconsole early_shadow0
or:
root:/> dmesg | grep console
Kernel command line: root=/dev/mtdblock0 rw console=ttyBF0,57600
console [ttyBF0] enabled
Signed-off-by: Robin Getz <rgetz@blackfin.uclinux.org>
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>
Cc: "Andrew Morton" <akpm@linux-foundation.org>
Cc: "Mike Frysinger" <vapier.adi@gmail.com>
Cc: "Paul Mundt" <lethal@linux-sh.org>
LKML-Reference: <200907012108.38030.rgetz@blackfin.uclinux.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This adds the synchronize_sched_expedited() primitive that
implements the "big hammer" expedited RCU grace periods.
This primitive is placed in kernel/sched.c rather than
kernel/rcupdate.c due to its need to interact closely with the
migration_thread() kthread.
The idea is to wake up this kthread with req->task set to NULL,
in response to which the kthread reports the quiescent state
resulting from the kthread having been scheduled.
Because this patch needs to fallback to the slow versions of
the primitives in response to some races with CPU onlining and
offlining, a new synchronize_rcu_bh() primitive is added as
well.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Cc: davem@davemloft.net
Cc: dada1@cosmosbay.com
Cc: zbr@ioremap.net
Cc: jeff.chua.linux@gmail.com
Cc: paulus@samba.org
Cc: laijs@cn.fujitsu.com
Cc: jengelh@medozas.de
Cc: r000n@r000n.net
Cc: benh@kernel.crashing.org
Cc: mathieu.desnoyers@polymtl.ca
LKML-Reference: <12459460982947-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We will lose something if trace_seq->buffer[0] is 0, because the copy length
is calculated by strlen() in seq_puts(), so using seq_write() instead of
seq_puts().
There have a example:
after reboot:
# echo kmemtrace > current_tracer
# echo 0 > options/kmem_minimalistic
# cat trace
# tracer: kmemtrace
#
#
Nothing is exported, because the first byte of trace_seq->buffer[ ]
is KMEMTRACE_USER_ALLOC.
( the value of KMEMTRACE_USER_ALLOC is zero, seeing
kmemtrace_print_alloc_user() in kernel/trace/kmemtrace.c)
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4A4B2351.5010300@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We already have ftrace= boot option, and this adds a similar
boot option for trace events, so allow trace events to be
enabled at boot, for boot debugging purpose.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A4ACE29.3010407@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use struct list instead of struct hlist for managing
insn_pages, because insn_pages doesn't use hash table.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
LKML-Reference: <20090630210814.17851.64651.stgit@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove needless kprobe_insn_mutex unlocking during safety check
in garbage collection, because if someone releases a dirty slot
during safety check (which ensures other cpus doesn't execute
all dirty slots), the safety check must be fail. So, we need to
hold the mutex while checking safety.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
LKML-Reference: <20090630210809.17851.28781.stgit@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'kmemleak' of git://linux-arm.org/linux-2.6:
kmemleak: Inform kmemleak about pid_hash
kmemleak: Do not warn if an unknown object is freed
kmemleak: Do not report new leaked objects if the scanning was stopped
kmemleak: Slightly change the policy on newly allocated objects
kmemleak: Do not trigger a scan when reading the debug/kmemleak file
kmemleak: Simplify the reports logged by the scanning thread
kmemleak: Enable task stacks scanning by default
kmemleak: Allow the early log buffer to be configurable.
* 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (47 commits)
perf report: Add --symbols parameter
perf report: Add --comms parameter
perf report: Add --dsos parameter
perf_counter tools: Adjust only prelinked symbol's addresses
perf_counter: Provide a way to enable counters on exec
perf_counter tools: Reduce perf stat measurement overhead/skew
perf stat: Use percentages for scaling output
perf_counter, x86: Update x86_pmu after WARN()
perf stat: Micro-optimize the code: memcpy is only required if no event is selected and !null_run
perf stat: Improve output
perf stat: Fix multi-run stats
perf stat: Add -n/--null option to run without counters
perf_counter tools: Remove dead code
perf_counter: Complete counter swap
perf report: Print sorted callchains per histogram entries
perf_counter tools: Prepare a small callchain framework
perf record: Fix unhandled io return value
perf_counter tools: Add alias for 'l1d' and 'l1i'
perf-report: Add bare minimum PERF_EVENT_READ parsing
perf-report: Add modes for inherited stats and no-samples
...
The file opened in acct_on and freshly stored in the ns->bacct struct can
be closed in acct_file_reopen by a concurrent call after we release
acct_lock and before we call mntput(file->f_path.mnt).
Record file->f_path.mnt in a local variable and use this variable only.
Signed-off-by: Renaud Lottiaux <renaud.lottiaux@kerlabs.com>
Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the 32-bit signed quantities get assigned to the u64 resource_size_t,
they are incorrectly sign-extended.
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13253
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=9905
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Reported-by: Leann Ogasawara <leann@ubuntu.com>
Cc: Pierre Ossman <drzeus@drzeus.cx>
Reported-by: <pablomme@googlemail.com>
Tested-by: <pablomme@googlemail.com>
Cc: <stable@kernel.org>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This provides a way to mark a counter to be enabled on the next
exec. This is useful for measuring the total activity of a
program without including overhead from the process that
launches it.
This also changes the perf stat command to use this new
facility.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <19017.43927.838745.689203@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Kmemleak does not track alloc_bootmem calls but the pid_hash allocated
in pidhash_init() would need to be scanned as it contains pointers to
struct pid objects.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
To use boot tracer, one should pass initcall_debug as well as
ftrace=initcall to the command line.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4A48735E.9050002@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Hide __raw_get_cpu_var() as well - thus all the direct
references to runqueues will abstracted out.
Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
LKML-Reference: <20090629.144457.886429910353660979.mitake@dcl.info.waseda.ac.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, delay: tsc based udelay should have rdtsc_barrier
x86, setup: correct include file in <asm/boot.h>
x86, setup: Fix typo "CONFIG_x86_64" in <asm/boot.h>
x86, mce: percpu mcheck_timer should be pinned
x86: Add sysctl to allow panic on IOCK NMI error
x86: Fix uv bau sending buffer initialization
x86, mce: Fix mce resume on 32bit
x86: Move init_gbpages() to setup_arch()
x86: ensure percpu lpage doesn't consume too much vmalloc space
x86: implement percpu_alloc kernel parameter
x86: fix pageattr handling for lpage percpu allocator and re-enable it
x86: reorganize cpa_process_alias()
x86: prepare setup_pcpu_lpage() for pageattr fix
x86: rename remap percpu first chunk allocator to lpage
x86: fix duplicate free in setup_pcpu_remap() failure path
percpu: fix too lazy vunmap cache flushing
x86: Set cpu_llc_id on AMD CPUs
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
timer stats: Optimize by adding quick check to avoid function calls
timers: Fix timer_migration interface which accepts any number as input
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ftrace: Fix the output of profile
ring-buffer: Make it generally available
ftrace: Remove duplicate newline
tracing: Fix trace_buf_size boot option
ftrace: Fix t_hash_start()
ftrace: Don't manipulate @pos in t_start()
ftrace: Don't increment @pos in g_start()
tracing: Reset iterator in t_start()
trace_stat: Don't increment @pos in seq start()
tracing_bprintk: Don't increment @pos in t_start()
tracing/events: Don't increment @pos in s_start()
Some fields for struct ftrace_graph_ret are missed
when they are exported to user.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4A448FB6.5000302@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This made my machine completely frozen:
# echo 1 > /proc/sys/kernel/stack_tracer_enabled
# echo 2 > /proc/sys/kernel/stack_tracer_enabled
The cause is register_ftrace_function() was called twice.
Also fix ftrace_enabled sysctl, though seems nothing bad happened
as I tested it.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A448D17.9010305@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Complete the counter swap by indeed switching the times too and
updating the userpage after modifying the counter values.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <1246014623.31755.195.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The first entry of the ftrace profile was always skipped when
reading trace_stat/functionX.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A443D59.4080307@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch introduces a new sysctl:
/proc/sys/kernel/panic_on_io_nmi
which defaults to 0 (off).
When enabled, the kernel panics when the kernel receives an NMI
caused by an IO error.
The IO error triggered NMI indicates a serious system
condition, which could result in IO data corruption. Rather
than contiuing, panicing and dumping might be a better choice,
so one can figure out what's causing the IO error.
This could be especially important to companies running IO
intensive applications where corruption must be avoided, e.g. a
bank's databases.
[ SuSE has been shipping it for a while, it was done at the
request of a large database vendor, for their users. ]
Signed-off-by: Kurt Garloff <garloff@suse.de>
Signed-off-by: Roberto Angelino <robertangelino@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
LKML-Reference: <20090624213211.GA11291@kroah.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The PERF_EVENT_READ implementation made me realize we don't
actually need the sample_type int the output sample, since
we already have that in the perf_counter_attr information.
Therefore, remove the PERF_EVENT_MISC_OVERFLOW bit and the
event->type overloading, and imply put counter overflow
samples in a PERF_EVENT_SAMPLE type.
This also fixes the issue that event->type was only 32-bit
and sample_type had 64 usable bits.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With the introduction of PERF_EVENT_READ we have the
possibility to provide accurate counter values for
individual tasks in a task hierarchy.
However, due to the lazy context switching used for similar
counter contexts our current per task counts are way off.
In order to maintain some of the lazy switch benefits we
don't disable it out-right, but simply iterate the active
counters and flip the values between the contexts.
This only reads the counters but does not need to reprogram
the full PMU.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Provide a read() like event which can be used to log the
counter value at specific sites such as child->parent
folding on exit.
In order to be useful, we log the counter parent ID, not the
actual counter ID, since userspace can only relate parent
IDs to perf_counter_attr constructs.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Update the mmap control page with the needed information to
use the userspace RDPMC instruction for self monitoring.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add the needed time scale to the self-profile mmap information.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Yanmin noticed that fault_in_user_writeable() requests 4 pages instead
of one.
That's the result of blindly trusting Linus' proposal :) I even looked
up the prototype to verify the correctness: the argument in question
is confusingly enough named "len" while in reality it means number of
pages.
Pointed-out-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In hunting down the cause for the hwlat_detector ring buffer spew in
my failed -next builds it became obvious that folks are now treating
ring_buffer as something that is generic independent of tracing and thus,
suitable for public driver consumption.
Given that there are only a few minor areas in ring_buffer that have any
reliance on CONFIG_TRACING or CONFIG_FUNCTION_TRACER, provide stubs for
those and make it generally available.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Cc: Jon Masters <jcm@jonmasters.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20090625053012.GB19944@linux-sh.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
another race fix in jfs_check_acl()
Get "no acls for this inode" right, fix shmem breakage
inline functions left without protection of ifdef (acl)
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
audit: inode watches depend on CONFIG_AUDIT not CONFIG_AUDIT_SYSCALL
Even though one cannot make use of the audit watch code without
CONFIG_AUDIT_SYSCALL the spaghetti nature of the audit code means that
the audit rule filtering requires that it at least be compiled.
Thus build the audit_watch code when we build auditfilter like it was
before cfcad62c74
Clearly this is a point of potential future cleanup..
Reported-by: Frans Pop <elendil@planet.nl>
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
commit 64d1304a64 (futex: setup writeable mapping for futex ops which
modify user space data) did address only half of the problem of write
access faults.
The patch was made on two wrong assumptions:
1) access_ok(VERIFY_WRITE,...) would actually check write access.
On x86 it does _NOT_. It's a pure address range check.
2) a RW mapped region can not go away under us.
That's wrong as well. Nobody can prevent another thread to call
mprotect(PROT_READ) on that region where the futex resides. If that
call hits between the get_user_pages_fast() verification and the
actual write access in the atomic region we are toast again.
The solution is to not rely on access_ok and get_user() for any write
access related fault on private and shared futexes. Instead we need to
fault it in with verification of write access.
There is no generic non destructive write mechanism which would fault
the user page in trough a #PF, but as we already know that we will
fault we can as well call get_user_pages() directly and avoid the #PF
overhead.
If get_user_pages() returns -EFAULT we know that we can not fix it
anymore and need to bail out to user space.
Remove a bunch of confusing comments on this issue as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
The ->ptrace_may_access() methods are named confusingly - the real
ptrace_may_access() returns a bool, while these security checks have
a retval convention.
Rename it to ptrace_access_check, to reduce the confusion factor.
[ Impact: cleanup, no code changed ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: James Morris <jmorris@namei.org>
Remove Classic RCU, given that the combination of Tree RCU and
the proposed Bloatwatch RCU do everything that Classic RCU can
with fewer bugs.
Tree RCU has been default in x86 builds for almost six months,
and seems to be quite reliable, so there does not seem to be
much justification for keeping the Classic RCU code and config
complexity around anymore.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: akpm@linux-foundation.org
Cc: niv@us.ibm.com
Cc: dvhltc@us.ibm.com
Cc: dipankar@in.ibm.com
Cc: dhowells@redhat.com
Cc: lethal@linux-sh.org
Cc: kernel@wantstofly.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We should be able to specify [KMG] when setting trace_buf_size
boot option, as documented in kernel-parameters.txt
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A41F2DB.4020102@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When the kernel is configured with CONFIG_TIMER_STATS but timer
stats are runtime disabled we still get calls to
__timer_stats_timer_set_start_info which initializes some
fields in the corresponding struct timer_list.
So add some quick checks in the the timer stats setup functions
to avoid function calls to __timer_stats_timer_set_start_info
when timer stats are disabled.
In an artificial workload that does nothing but playing ping
pong with a single tcp packet via loopback this decreases cpu
consumption by 1 - 1.5%.
This is part of a modified function trace output on SLES11:
perl-2497 [00] 28630647177732388 [+ 125]: sk_reset_timer <-tcp_v4_rcv
perl-2497 [00] 28630647177732513 [+ 125]: mod_timer <-sk_reset_timer
perl-2497 [00] 28630647177732638 [+ 125]: __timer_stats_timer_set_start_info <-mod_timer
perl-2497 [00] 28630647177732763 [+ 125]: __mod_timer <-mod_timer
perl-2497 [00] 28630647177732888 [+ 125]: __timer_stats_timer_set_start_info <-__mod_timer
perl-2497 [00] 28630647177733013 [+ 93]: lock_timer_base <-__mod_timer
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mustafa Mesanovic <mustafa.mesanovic@de.ibm.com>
Cc: Arjan van de Ven <arjan@infradead.org>
LKML-Reference: <20090623153811.GA4641@osiris.boeblingen.de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When the output of set_ftrace_filter is larger than PAGE_SIZE,
t_hash_start() will be called the 2nd time, and then we start
from the head of a hlist, which is wrong and causes some entries
to be outputed twice.
The worse is, if the hlist is large enough, reading set_ftrace_filter
won't stop but in a dead loop.
Reviewed-by: Liming Wang <liming.wang@windriver.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A41876E.2060407@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It's rather confusing that in t_start(), in some cases @pos is
incremented, and in some cases it's decremented and then incremented.
This patch rewrites t_start() in a much more general way.
Thus we fix a bug that if ftrace_filtered == 1, functions have tracer
hooks won't be printed, because the branch is always unreachable:
static void *t_start(...)
{
...
if (!p)
return t_hash_start(m, pos);
return p;
}
Before:
# echo 'sys_open' > /mnt/tracing/set_ftrace_filter
# echo 'sys_write:traceon:4' >> /mnt/tracing/set_ftrace_filter
sys_open
After:
# echo 'sys_open' > /mnt/tracing/set_ftrace_filter
# echo 'sys_write:traceon:4' >> /mnt/tracing/set_ftrace_filter
sys_open
sys_write:traceon:count=4
Reviewed-by: Liming Wang <liming.wang@windriver.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A41874B.4090507@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It's wrong to increment @pos in g_start(). It causes some entries
lost when reading set_graph_function, if the output of the file
is larger than PAGE_SIZE.
Reviewed-by: Liming Wang <liming.wang@windriver.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A418738.7090401@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The iterator is m->private, but it's not reset to trace_types in
t_start(). If the output is larger than PAGE_SIZE and t_start()
is called the 2nd time, things will go wrong.
Reviewed-by: Liming Wang <liming.wang@windriver.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A418728.5020506@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It's wrong to increment @pos in stat_seq_start(). It causes some
stat entries lost when reading stat file, if the output of the file
is larger than PAGE_SIZE.
Reviewed-by: Liming Wang <liming.wang@windriver.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A418716.90209@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It's wrong to increment @pos in t_start(), otherwise we'll lose
some entries when reading printk_formats, if the output is larger
than PAGE_SIZE.
Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Liming Wang <liming.wang@windriver.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A4186FA.1020106@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While testing syscall tracepoints posted by Jason, I found 3 entries
were missing when reading available_events. The output size of
available_events is < 4 pages, which means we lost 1 entry per page.
The cause is, it's wrong to increment @pos in s_start().
Actually there's another bug here -- reading avaiable_events/set_events
can race with module unload:
# cat available_events |
s_start() |
s_stop() |
| # rmmod foo.ko
s_start() |
call = list_entry(m->private) |
@call might be freed and accessing it will lead to crash.
Reviewed-by: Liming Wang <liming.wang@windriver.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A4186DD.6090405@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Percpu variable definition is about to be updated such that all percpu
symbols including the static ones must be unique. Update percpu
variable definitions accordingly.
* as,cfq: rename ioc_count uniquely
* cpufreq: rename cpu_dbs_info uniquely
* xen: move nesting_count out of xen_evtchn_do_upcall() and rename it
* mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and
rename it
* ipv4,6: rename cookie_scratch uniquely
* x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to
pmc_irq_entry and nmi_entry to pmc_nmi_entry
* perf_counter: rename disable_count to perf_disable_count
* ftrace: rename test_event_disable to ftrace_test_event_disable
* kmemleak: rename test_pointer to kmemleak_test_pointer
* mce: rename next_interval to mce_next_interval
[ Impact: percpu usage cleanups, no duplicate static percpu var names ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: linux-mm <linux-mm@kvack.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
There are a few places where ___cacheline_aligned* is used with
DEFINE_PER_CPU(). Use DEFINE_PER_CPU_SHARED_ALIGNED() instead.
DEFINE_PER_CPU_SHARED_ALIGNED() applies alignment only on SMPs. While
all other converted places used _in_smp variant or only get compiled
for SMP, net/rds used unconditional ____cacheline_aligned. I don't
see any reason these data structures should be aligned on UP and thus
converted together.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Andy Grover <andy.grover@oracle.com>
This patch makes most !CONFIG_HAVE_SETUP_PER_CPU_AREA archs use
dynamic percpu allocator. The first chunk is allocated using
embedding helper and 8k is reserved for modules. This ensures that
the new allocator behaves almost identically to the original allocator
as long as static percpu variables are concerned, so it shouldn't
introduce much breakage.
s390 and alpha use custom SHIFT_PERCPU_PTR() to work around addressing
range limit the addressing model imposes. Unfortunately, this breaks
if the address is specified using a variable, so for now, the two
archs aren't converted.
The following architectures are affected by this change.
* sh
* arm
* cris
* mips
* sparc(32)
* blackfin
* avr32
* parisc (broken, under investigation)
* m32r
* powerpc(32)
As this change makes the dynamic allocator the default one,
CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is replaced with its invert -
CONFIG_HAVE_LEGACY_PER_CPU_AREA, which is added to yet-to-be converted
archs. These archs implement their own setup_per_cpu_areas() and the
conversion is not trivial.
* powerpc(64)
* sparc(64)
* ia64
* alpha
* s390
Boot and batch alloc/free tests on x86_32 with debug code (x86_32
doesn't use default first chunk initialization). Compile tested on
sparc(32), powerpc(32), arm and alpha.
Kyle McMartin reported that this change breaks parisc. The problem is
still under investigation and he is okay with pushing this patch
forward and fixing parisc later.
[ Impact: use dynamic allocator for most archs w/o custom percpu setup ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Bryan Wu <cooloney@kernel.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
If syscall removes the root of subtree being watched, we
definitely do not want the rules refering that subtree
to be destroyed without the syscall in question having
a chance to match them.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A number of places in the audit system we send an op= followed by a string
that includes spaces. Somehow this works but it's just wrong. This patch
moves all of those that I could find to be quoted.
Example:
Change From: type=CONFIG_CHANGE msg=audit(1244666690.117:31): auid=0 ses=1
subj=unconfined_u:unconfined_r:auditctl_t:s0-s0:c0.c1023 op=remove rule
key="number2" list=4 res=0
Change To: type=CONFIG_CHANGE msg=audit(1244666690.117:31): auid=0 ses=1
subj=unconfined_u:unconfined_r:auditctl_t:s0-s0:c0.c1023 op="remove rule"
key="number2" list=4 res=0
Signed-off-by: Eric Paris <eparis@redhat.com>
audit_get_nd() is only used by audit_watch and could be more cleanly
implemented by having the audit watch functions call it when needed rather
than making the generic audit rule parsing code deal with those objects.
Signed-off-by: Eric Paris <eparis@redhat.com>
In preparation for converting audit to use fsnotify instead of inotify we
seperate the inode watching code into it's own file. This is similar to
how the audit tree watching code is already seperated into audit_tree.c
Signed-off-by: Eric Paris <eparis@redhat.com>
audit_receive_skb is hard to clearly parse what it is doing to the netlink
message. Clean the function up so it is easy and clear to see what is going
on.
Signed-off-by: Eric Paris <eparis@redhat.com>
The audit handling of netlink messages is all over the place. Clean things
up, use predetermined macros, generally make it more readable.
Signed-off-by: Eric Paris <eparis@redhat.com>
audit_update_watch() runs all of the rules for a given watch and duplicates
them, attaches a new watch to them, and then when it finishes that process
and has called free on all of the old rules (ok maybe still inside the rcu
grace period) it proceeds to use the last element from list_for_each_entry_safe()
as if it were a krule rather than being the audit_watch which was anchoring
the list to output a message about audit rules changing.
This patch unfies the audit message from two different places into a helper
function and calls it from the correct location in audit_update_rules(). We
will now get an audit message about the config changing for each rule (with
each rules filterkey) rather than the previous garbage.
Signed-off-by: Eric Paris <eparis@redhat.com>
The audit execve record splitting code estimates the length of the message
generated. But it forgot to include the "" that wrap each string in its
estimation. This means that execve messages with lots of tiny (1-2 byte)
arguments could still cause records greater than 8k to be emitted. Simply
fix the estimate.
Signed-off-by: Eric Paris <eparis@redhat.com>
When an audit watch is added to a parent the temporary watch inside the
original krule from userspace is freed. Yet the original watch is used after
the real watch was created in audit_add_rules()
Signed-off-by: Eric Paris <eparis@redhat.com>
We don't need to add usage counts for swcounter and attr usage
models for inherited counters since the parent counter will
always have one, which suffices to generate the needed output.
This avoids up to 3 global atomic increments per inherited
counter.
LKML-Reference: <new-submission>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Teach perf_counter_alloc() about inheritance so that we can
optimize the inherit path in the next patch.
Remove the child_counter->atrr.inherit = 1 line because the
only way to get there is if parent_counter->attr.inherit == 1
and we copy the attrs.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Similar to tracepoints, use an enable variable to reduce
overhead when unused.
Only look for a counter of a particular event type when we know
there is at least one in the system.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
SLAB uses get/put_online_cpus() which use a mutex which is itself only
initialized when cpu_hotplug_init() is called. Currently we hang suring
boot in SLAB due to doing that too late.
Reported by James Bottomley and Sachin Sant (and possibly others).
Debugged by Benjamin Herrenschmidt.
This just removes the dynamic initialization of the data structures, and
replaces it with a static one, avoiding this dependency entirely, and
removing one unnecessary special initcall.
Tested-by: Sachin Sant <sachinp@in.ibm.com>
Tested-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Tested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
genirq, irq.h: Fix kernel-doc warnings
genirq: fix comment to say IRQ_WAKE_THREAD
* 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits)
perfcounter: Handle some IO return values
perf_counter: Push perf_sample_data through the swcounter code
perf_counter tools: Define and use our own u64, s64 etc. definitions
perf_counter: Close race in perf_lock_task_context()
perf_counter, x86: Improve interactions with fast-gup
perf_counter: Simplify and fix task migration counting
perf_counter tools: Add a data file header
perf_counter: Update userspace callchain sampling uses
perf_counter: Make callchain samples extensible
perf report: Filter to parent set by default
perf_counter tools: Handle lost events
perf_counter: Add event overlow handling
fs: Provide empty .set_page_dirty() aop for anon inodes
perf_counter: tools: Makefile tweaks for 64-bit powerpc
perf_counter: powerpc: Add processor back-end for MPC7450 family
perf_counter: powerpc: Make powerpc perf_counter code safe for 32-bit kernels
perf_counter: powerpc: Change how processor-specific back-ends get selected
perf_counter: powerpc: Use unsigned long for register and constraint values
perf_counter: powerpc: Enable use of software counters on 32-bit powerpc
perf_counter tools: Add and use isprint()
...
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix out of scope variable access in sched_slice()
sched: Hide runqueues from direct refer at source code level
sched: Remove unneeded __ref tag
sched, x86: Fix cpufreq + sched_clock() TSC scaling
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits)
tracing/urgent: warn in case of ftrace_start_up inbalance
tracing/urgent: fix unbalanced ftrace_start_up
function-graph: add stack frame test
function-graph: disable when both x86_32 and optimize for size are configured
ring-buffer: have benchmark test print to trace buffer
ring-buffer: do not grab locks in nmi
ring-buffer: add locks around rb_per_cpu_empty
ring-buffer: check for less than two in size allocation
ring-buffer: remove useless compile check for buffer_page size
ring-buffer: remove useless warn on check
ring-buffer: use BUF_PAGE_HDR_SIZE in calculating index
tracing: update sample event documentation
tracing/filters: fix race between filter setting and module unload
tracing/filters: free filter_string in destroy_preds()
ring-buffer: use commit counters for commit pointer accounting
ring-buffer: remove unused variable
ring-buffer: have benchmark test handle discarded events
ring-buffer: prevent adding write in discarded area
tracing/filters: strloc should be unsigned short
tracing/filters: operand can be negative
...
Fix up kmemcheck-induced conflict in kernel/trace/ring_buffer.c manually
Push the perf_sample_data further outwards to the swcounter interface,
to abstract it away some more.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Prevent from further ftrace_start_up inbalances so that we avoid
future nop patching omissions with dynamic ftrace.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Perfcounter reports the following stats for a wide system
profiling:
#
# (2364 samples)
#
# Overhead Symbol
# ........ ......
#
15.40% [k] mwait_idle_with_hints
8.29% [k] read_hpet
5.75% [k] ftrace_caller
3.60% [k] ftrace_call
[...]
This snapshot has been taken while neither the function tracer nor
the function graph tracer was running.
With dynamic ftrace, such results show a wrong ftrace behaviour
because all calls to ftrace_caller or ftrace_graph_caller (the patched
calls to mcount) are supposed to be patched into nop if none of those
tracers are running.
The problem occurs after the first run of the function tracer. Once we
launch it a second time, the callsites will never be nopped back,
unless you set custom filters.
For example it happens during the self tests at boot time.
The function tracer selftest runs, and then the dynamic tracing is
tested too. After that, the callsites are left un-nopped.
This is because the reset callback of the function tracer tries to
unregister two ftrace callbacks in once: the common function tracer
and the function tracer with stack backtrace, regardless of which
one is currently in use.
It then creates an unbalance on ftrace_start_up value which is expected
to be zero when the last ftrace callback is unregistered. When it
reaches zero, the FTRACE_DISABLE_CALLS is set on the next ftrace
command, triggering the patching into nop. But since it becomes
unbalanced, ie becomes lower than zero, if the kernel functions
are patched again (as in every further function tracer runs), they
won't ever be nopped back.
Note that ftrace_call and ftrace_graph_call are still patched back
to ftrace_stub in the off case, but not the callers of ftrace_call
and ftrace_graph_caller. It means that the tracing is well deactivated
but we waste a useless call into every kernel function.
This patch just unregisters the right ftrace_ops for the function
tracer on its reset callback and ignores the other one which is
not registered, fixing the unbalance. The problem also happens
is .30
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: stable@kernel.org
The bug is ancient.
If we trace the sub-thread of our natural child and this sub-thread exits,
we update parent->signal->cxxx fields. But we should not do this until
the whole thread-group exits, otherwise we account this thread (and all
other live threads) twice.
Add the task_detached() check. No need to check thread_group_empty(),
wait_consider_task()->delay_group_leader() already did this.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
perf_lock_task_context() is buggy because it can return a dead
context.
the RCU read lock in perf_lock_task_context() only guarantees
the memory won't get freed, it doesn't guarantee the object is
valid (in our case refcount > 0).
Therefore we can return a locked object that can get freed the
moment we release the rcu read lock.
perf_pin_task_context() then increases the refcount and does an
unlock on freed memory.
That increased refcount will cause a double free, in case it
started out with 0.
Ammend this by including the get_ctx() functionality in
perf_lock_task_context() (all users already did this later
anyway), and return a NULL context when the found one is
already dead.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The task migrations counter was causing rare and hard to decypher
memory corruptions under load. After a day of debugging and bisection
we found that the problem was introduced with:
3f731ca: perf_counter: Fix cpu migration counter
Turning them off fixes the crashes. Incidentally, the whole
perf_counter_task_migration() logic can be done simpler as well,
by injecting a proper sw-counter event.
This cleanup also fixed the crashes. The precise failure mode is
not completely clear yet, but we are clearly not unhappy about
having a fix ;-)
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In case gcc does something funny with the stack frames, or the return
from function code, we would like to detect that.
An arch may implement passing of a variable that is unique to the
function and can be saved on entering a function and can be tested
when exiting the function. Usually the frame pointer can be used for
this purpose.
This patch also implements this for x86. Where it passes in the stack
frame of the parent function, and will test that frame on exit.
There was a case in x86_32 with optimize for size (-Os) where, for a
few functions, gcc would align the stack frame and place a copy of the
return address into it. The function graph tracer modified the copy and
not the actual return address. On return from the funtion, it did not go
to the tracer hook, but returned to the parent. This broke the function
graph tracer, because the return of the parent (where gcc did not do
this funky manipulation) returned to the location that the child function
was suppose to. This caused strange kernel crashes.
This test detected the problem and pointed out where the issue was.
This modifies the parameters of one of the functions that the arch
specific code calls, so it includes changes to arch code to accommodate
the new prototype.
Note, I notice that the parsic arch implements its own push_return_trace.
This is now a generic function and the ftrace_push_return_trace should be
used instead. This patch does not touch that code.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On x86_32, when optimize for size is set, gcc may align the frame pointer
and make a copy of the the return address inside the stack frame.
The return address that is located in the stack frame may not be
the one used to return to the calling function. This will break the
function graph tracer.
The function graph tracer replaces the return address with a jump to a hook
function that can trace the exit of the function. If it only replaces
a copy, then the hook will not be called when the function returns.
Worse yet, when the parent function returns, the function graph tracer
will return back to the location of the child function which will
easily crash the kernel with weird results.
To see the problem, when i386 is compiled with -Os we get:
c106be03: 57 push %edi
c106be04: 8d 7c 24 08 lea 0x8(%esp),%edi
c106be08: 83 e4 e0 and $0xffffffe0,%esp
c106be0b: ff 77 fc pushl 0xfffffffc(%edi)
c106be0e: 55 push %ebp
c106be0f: 89 e5 mov %esp,%ebp
c106be11: 57 push %edi
c106be12: 56 push %esi
c106be13: 53 push %ebx
c106be14: 81 ec 8c 00 00 00 sub $0x8c,%esp
c106be1a: e8 f5 57 fb ff call c1021614 <mcount>
When it is compiled with -O2 instead we get:
c10896f0: 55 push %ebp
c10896f1: 89 e5 mov %esp,%ebp
c10896f3: 83 ec 28 sub $0x28,%esp
c10896f6: 89 5d f4 mov %ebx,0xfffffff4(%ebp)
c10896f9: 89 75 f8 mov %esi,0xfffffff8(%ebp)
c10896fc: 89 7d fc mov %edi,0xfffffffc(%ebp)
c10896ff: e8 d0 08 fa ff call c1029fd4 <mcount>
The compile with -Os will align the stack pointer then set up the
frame pointer (%ebp), and it copies the return address back into
the stack frame. The change to the return address in mcount is done
to the copy and not the real place holder of the return address.
Then compile with -O2 sets up the frame pointer first, this makes
the change to the return address by mcount affect where the function
will jump on exit.
Reported-by: Jake Edge <jake@lwn.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Enable gcov profiling of the entire kernel on x86_64. Required changes
include disabling profiling for:
* arch/kernel/acpi/realmode and arch/kernel/boot/compressed:
not linked to main kernel
* arch/vdso, arch/kernel/vsyscall_64 and arch/kernel/hpet:
profiling causes segfaults during boot (incompatible context)
Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Li Wei <W.Li@Sun.COM>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Heiko Carstens <heicars2@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <mschwid2@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enable the use of GCC's coverage testing tool gcov [1] with the Linux
kernel. gcov may be useful for:
* debugging (has this code been reached at all?)
* test improvement (how do I change my test to cover these lines?)
* minimizing kernel configurations (do I need this option if the
associated code is never run?)
The profiling patch incorporates the following changes:
* change kbuild to include profiling flags
* provide functions needed by profiling code
* present profiling data as files in debugfs
Note that on some architectures, enabling gcc's profiling option
"-fprofile-arcs" for the entire kernel may trigger compile/link/
run-time problems, some of which are caused by toolchain bugs and
others which require adjustment of architecture code.
For this reason profiling the entire kernel is initially restricted
to those architectures for which it is known to work without changes.
This restriction can be lifted once an architecture has been tested
and found compatible with gcc's profiling. Profiling of single files
or directories is still available on all platforms (see config help
text).
[1] http://gcc.gnu.org/onlinedocs/gcc/Gcov.html
Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Li Wei <W.Li@Sun.COM>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Heiko Carstens <heicars2@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <mschwid2@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Call constructors (gcc-generated initcall-like functions) during kernel
start and module load. Constructors are e.g. used for gcov data
initialization.
Disable constructor support for usermode Linux to prevent conflicts with
host glibc.
Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Li Wei <W.Li@Sun.COM>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Heiko Carstens <heicars2@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <mschwid2@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
clone_nsproxy() does useless copying of old nsproxy -- every pointer will
be rewritten to new ns or to old ns. Remove copying, rename
clone_nsproxy(), create_nsproxy() will be used by C/R code to create fresh
nsproxy on restart.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
create_uts_ns() will be used by C/R to create fresh uts_ns.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
copy_pid_ns() is a perfect example of a case where unwinding leads to more
code and makes it less clear. Watch the diffstat.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Reviewed-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
create_pid_namespace() creates everything, but caller has to assign parent
pidns by hand, which is unnatural. At the moment of call new ->level has
to be taken from somewhere and parent pidns is already available.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
find_task_by_pid_type_ns is only used to implement find_task_by_vpid and
find_task_by_pid_ns, but both of them pass PIDTYPE_PID as first argument.
So just fold find_task_by_pid_type_ns into find_task_by_pid_ns and use
find_task_by_pid_ns to implement find_task_by_vpid.
While we're at it also remove the exports for find_task_by_pid_ns and
find_task_by_vpid - we don't have any modular callers left as the only
modular caller of he old pre pid namespace find_task_by_pid (gfs2) was
switched to pid_task which operates on a struct pid pointer instead of a
pid_t. Given the confusion about pid_t values vs namespace that's
generally the better option anyway and I think we're better of restricting
modules to do it that way.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remoce the unused variable 'val' from __do_proc_dointvec()
The integer has been declared and used as 'val = -val' and there is no
reference to it anywhere.
Signed-off-by: Sukanto Ghosh <sukanto.cse.iitb@gmail.com>
Cc: Jaswinder Singh Rajput <jaswinder@kernel.org>
Cc: Sukanto Ghosh <sukanto.cse.iitb@gmail.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that kthread_stop() can be used even if the task has already exited,
we can kill the "wait_to_die:" loop in migration_thread(). But we must
pin rq->migration_thread after creation.
Actually, I don't think CPU_UP_CANCELED or CPU_DEAD should wait for
->migration_thread exit. Perhaps we can simplify this code a bit more.
migration_call() can set ->should_stop and forget about this thread. But
we need a new helper in kthred.c for that.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Vitaliy Gusev <vgusev@openvz.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based on Eric's patch which in turn was based on my patch.
kthread_stop() has the nasty problems:
- it runs unpredictably long with the global semaphore held.
- it deadlocks if kthread itself does kthread_stop() before it obeys
the kthread_should_stop() request.
- it is not useable if kthread exits on its own, see for example the
ugly "wait_to_die:" hack in migration_thread()
- it is not possible to just tell kthread it should stop, we must always
wait for its exit.
With this patch kthread() allocates all neccesary data (struct kthread) on
its own stack, globals kthread_stop_xxx are deleted. ->vfork_done is used
as a pointer into "struct kthread", this means kthread_stop() can easily
wait for kthread's exit.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Vitaliy Gusev <vgusev@openvz.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We use two completions two create the kernel thread, this is a bit ugly.
kthread() wakes up create_kthread() via ->started, then create_kthread()
wakes up the caller kthread_create() via ->done. But kthread() does not
need to wait for kthread(), it can just return. Instead kthread() itself
can wake up the caller of kthread_create().
Kill kthread_create_info->started, ->done is enough. This improves the
scalability a bit and sijmplifies the code.
The only problem if kernel_thread() fails, in that case create_kthread()
must do complete(&create->done).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Vitaliy Gusev <vgusev@openvz.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_wait:
current->state = TASK_INTERRUPTIBLE;
read_lock(&tasklist_lock);
... search for the task to reap ...
In theory, the ->state changing can leak into the critical section. Since
the child can change its status under read_lock(tasklist) in parallel
(finish_stop/ptrace_stop), we can miss the wakeup if __wake_up_parent()
sees us in TASK_RUNNING state. Add the barrier.
Also, use __set_current_state() to set TASK_RUNNING.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_wait() does BUG_ON(tsk->signal != current->signal), this looks like a
raher obsolete check. At least, I don't think do_wait() is the best place
to verify that all threads have the same ->signal. Remove it.
Also, change the code to use while_each_thread().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we don't pass &retval down to other helpers we can simplify
the code more.
- kill tsk_result, just use retval
- add the "notask" label right after the main loop, and
s/got end/goto notask/ after the fastpath pid check.
This way we don't need to initialize retval before this
check and the code becomes a bit more clean, if this pid
has no attached tasks we should just skip the list search.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce "struct wait_opts" which holds the parameters for misc helpers
in do_wait() pathes.
This adds 13 lines to kernel/exit.c, but saves 256 bytes from .o and imho
makes the code much more readable.
This patch temporary uglifies rusage/siginfo code a little bit, will be
addressed by further cleanups.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No functional changes, preparation for the next patch.
ptrace_do_wait() adds WUNTRACED to options for wait_task_stopped() which
should always accept the stopped tracee, even if do_wait() was called
without WUNTRACED.
Change wait_task_stopped() to check "ptrace || WUNTRACED" instead. This
makes the code more explicit, and "int options" argument becomes const in
do_wait() pathes.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The forked child can have TIF_SIGPENDING if it was copied from parent's
ti->flags. But this is harmless and actually almost never happens,
because copy_process() can't succeed if signal_pending() == T.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no reason for thread_group_cputime() in wait_task_zombie(), there
must be no other threads.
This call was previously needed to collect the per-cpu data which we do
not have any longer.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change ptrace_getsiginfo/ptrace_setsiginfo to use lock_task_sighand()
without tasklist_lock. Perhaps it makes sense to make a single helper
with "bool rw" argument.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the non-traced sub-thread calls do_notify_parent_cldstop(), we send the
notification to group_leader->real_parent and we report group_leader's
pid.
But, if group_leader is traced we use the wrong ->parent->nsproxy->pid_ns,
the tracer and parent can live in different namespaces. Change the code
to use "parent" instead of tsk->parent.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change wait_task_zombie() to use ->real_parent instead of ->parent. We
could even use current afaics, but ->real_parent is more clean.
We know that the child is not ptrace_reparented() and thus they are equal.
But we should avoid using task_struct->parent, we are going to remove it.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Use rcu_read_lock() instead of tasklist_lock to find/get the task
in ptrace_get_task_struct().
- Make it static, it has no callers outside of ptrace.c.
- The comment doesn't match the reality, this helper does not do
any checks. Beacuse it is really trivial and static I removed the
whole comment.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the "Nasty, nasty" lock dance in ptrace_attach()/ptrace_traceme() -
from now task_lock() has nothing to do with ptrace at all.
With the recent changes nobody uses task_lock() to serialize with ptrace,
but in fact it was never needed and it was never used consistently.
However ptrace_attach() calls __ptrace_may_access() and needs task_lock()
to pin task->mm for get_dumpable(). But we can call __ptrace_may_access()
before we take tasklist_lock, ->cred_exec_mutex protects us against
do_execve() path which can change creds and MMF_DUMP* flags.
(ugly, but we can't use ptrace_may_access() because it hides the error
code, so we have to take task_lock() and use __ptrace_may_access()).
NOTE: this change assumes that LSM hooks, security_ptrace_may_access() and
security_ptrace_traceme(), can be called without task_lock() held.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ptrace_attach() and ptrace_traceme() are the last functions which look as
if the untraced task can have task->ptrace != 0, this must not be
possible. Change the code to just check ->ptrace != 0 and s/|=/=/ to set
PT_PTRACED.
Also, a couple of trivial whitespace cleanups in ptrace_attach().
And move ptrace_traceme() up near ptrace_attach() to keep them close to
each other.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Add PF_KTHREAD check to prevent attaching to the kernel thread
with a borrowed ->mm.
With or without this change we can race with daemonize() which
can set PF_KTHREAD or clear ->mm after ptrace_attach() does the
check, but this doesn't matter because reparent_to_kthreadd()
does ptrace_unlink().
- Kill "!task->mm" check. We don't really care about ->mm != NULL,
and the task can call exit_mm() right after we drop task_lock().
What we need is to make sure we can't attach after exit_notify(),
check task->exit_state != 0 instead.
Also, move the "already traced" check down for cosmetic reasons.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No functional changes.
- Nobody except ptrace.c & co should use ptrace flags directly, we have
task_ptrace() for that.
- No need to specially check PT_PTRACED, we must not have other PT_ bits
set without PT_PTRACED. And no need to know this flag exists.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"Search in the siblings" should use ->real_parent, not ->parent. If the
task is traced then ->parent == tracer, while the task's parent is always
->real_parent.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
allow_signal() checks ->mm == NULL. Not sure why. Perhaps to make sure
current is the kernel thread. But this helper must not be used unless we
are the kernel thread, kill this check.
Also, document the fact that the CLONE_SIGHAND kthread must not use
allow_signal(), unless the caller really wants to change the parent's
->sighand->action as well.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't have an interface to reset mem.limit or memsw.limit now.
This patch allows to reset mem.limit or memsw.limit when they are being
set to -1.
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The 'noprefix' option was introduced for backwards-compatibility of
cpuset, but actually it can be used when mounting other subsystems.
This results in possibility of name collision, and now the collision can
really happen, because we have 'stat' file in both memory and cpuacct
subsystem:
# mount -t cgroup -o noprefix,memory,cpuacct xxx /mnt
Cgroup will happily mount the 2 subsystems, but only 'stat' file of memory
subsys can be seen.
We don't want users to use nopreifx, and also want to avoid name
collision, so we change to allow noprefix only if mounting just the cpuset
subsystem.
[akpm@linux-foundation.org: fix shift for cpuset_subsys_id >= 32]
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Statistics for softirq doesn't exist.
It will be helpful like statistics for interrupts.
This patch introduces counting the number of softirq,
which will be exported in /proc/softirqs.
When softirq handler consumes much CPU time,
/proc/stat is like the following.
$ while :; do cat /proc/stat | head -n1 ; sleep 10 ; done
cpu 88 0 408 739665 583 28 2 0 0
cpu 450 0 1090 740970 594 28 1294 0 0
^^^^
softirq
In such a situation,
/proc/softirqs shows us which softirq handler is invoked.
We can see the increase rate of softirqs.
<before>
$ cat /proc/softirqs
CPU0 CPU1 CPU2 CPU3
HI 0 0 0 0
TIMER 462850 462805 462782 462718
NET_TX 0 0 0 365
NET_RX 2472 2 2 40
BLOCK 0 0 381 1164
TASKLET 0 0 0 224
SCHED 462654 462689 462698 462427
RCU 3046 2423 3367 3173
<after>
$ cat /proc/softirqs
CPU0 CPU1 CPU2 CPU3
HI 0 0 0 0
TIMER 463361 465077 465056 464991
NET_TX 53 0 1 365
NET_RX 3757 2 2 40
BLOCK 0 0 398 1170
TASKLET 0 0 0 224
SCHED 463074 464318 464612 463330
RCU 3505 2948 3947 3673
When CPU TIME of softirq is high,
the rates of increase is the following.
TIMER : 220/sec : CPU1-3
NET_TX : 5/sec : CPU0
NET_RX : 120/sec : CPU0
SCHED : 40-200/sec : all CPU
RCU : 45-58/sec : all CPU
The rates of increase in an idle mode is the following.
TIMER : 250/sec
SCHED : 250/sec
RCU : 2/sec
It seems many softirqs for receiving packets and rcu are invoked. This
gives us help for checking system.
Signed-off-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp>
Reviewed-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alternative method of mmap() data output handling that provides
better overflow management and a more reliable data stream.
Unlike the previous method, that didn't have any user->kernel
feedback and relied on userspace keeping up, this method relies on
userspace writing its last read position into the control page.
It will ensure new output doesn't overwrite not-yet read events,
new events for which there is no space left are lost and the
overflow counter is incremented, providing exact event loss
numbers.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently the output of the ring buffer benchmark/test prints to
the console. This test runs for ten seconds every ten seconds and
ouputs the result after every iteration. This needlessly fills up
the logs.
This patch makes the ring buffer benchmark/test print to the ftrace
buffer using trace_printk. To view the test results, you must examine
the debug/tracing/trace file.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If ftrace_dump_on_oops is set, and an NMI detects a lockup, then it
will need to read from the ring buffer. But the read side of the
ring buffer still takes locks. This patch adds a check on the read
side that if it is in an NMI, then it will disable the ring buffer
and not take any locks.
Reads can still happen on a disabled ring buffer.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The checking of whether the buffer is empty or not needs to be serialized
among the readers. Add the reader spin lock around it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ring buffer must have at least two pages allocated for the
reader page swap to work.
The page count check will miss the case of a zero size passed in.
Even though a zero size ring buffer would probably fail an allocation,
making the min size check for less than two instead of equal to one makes
the code a bit more robust.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The original version of the ring buffer had a hack to map the
page struct that held the pages of the buffer to also be the structure
that the ring buffer would keep the pages in a link list.
This overlap of the page struct was very dangerous and that hack was
removed a while ago.
But there was a check to make sure the buffer_page never became bigger
than the page struct, and would fail the compile if it did. The
check was only meaningful when we had the hack. Now that we have separate
allocated descriptors for the buffer pages, we can remove this check.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'linux-next' of git://git.infradead.org/ubifs-2.6:
UBIFS: start using hrtimers
hrtimer: export ktime_add_safe
UBIFS: do not forget to register BDI device
UBIFS: allow sync option in rootflags
UBIFS: remove dead code
UBIFS: use anonymous device
UBIFS: return proper error code if the compr is not present
UBIFS: return error if link and unlink race
UBIFS: reset no_space flag after inode deletion
Access to local variable lw is aliased by usage of pointer load.
Access to pointer load in calc_delta_mine() happens when lw is
already out of scope.
[ Reported by static code analysis. ]
Signed-off-by: Christian Engelmayer <christian.engelmayer@frequentis.com>
LKML-Reference: <20090616103512.0c846e51@frequentis.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There are some points which refer the per-cpu value "runqueues" directly.
sched.c provides nice abstraction, such as cpu_rq() and this_rq(),
so we should use these macros when looking runqueues.
Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
LKML-Reference: <20090617.222055.374768827975756908.mitake@dcl.info.waseda.ac.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Those two functions no longer call alloc_bootmmem_cpumask_var(),
so no need to tag them with __init_refok.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <4A35DD5B.9050106@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* akpm: (182 commits)
fbdev: bf54x-lq043fb: use kzalloc over kmalloc/memset
fbdev: *bfin*: fix __dev{init,exit} markings
fbdev: *bfin*: drop unnecessary calls to memset
fbdev: bfin-t350mcqb-fb: drop unused local variables
fbdev: blackfin has __raw I/O accessors, so use them in fb.h
fbdev: s1d13xxxfb: add accelerated bitblt functions
tcx: use standard fields for framebuffer physical address and length
fbdev: add support for handoff from firmware to hw framebuffers
intelfb: fix a bug when changing video timing
fbdev: use framebuffer_release() for freeing fb_info structures
radeon: P2G2CLK_ALWAYS_ONb tested twice, should 2nd be P2G2CLK_DAC_ALWAYS_ONb?
s3c-fb: CPUFREQ frequency scaling support
s3c-fb: fix resource releasing on error during probing
carminefb: fix possible access beyond end of carmine_modedb[]
acornfb: remove fb_mmap function
mb862xxfb: use CONFIG_OF instead of CONFIG_PPC_OF
mb862xxfb: restrict compliation of platform driver to PPC
Samsung SoC Framebuffer driver: add Alpha Channel support
atmel-lcdc: fix pixclock upper bound detection
offb: use framebuffer_alloc() to allocate fb_info struct
...
Manually fix up conflicts due to kmemcheck in mm/slab.c
Round the slow work queue's cull and OOM timeouts to whole second boundary
with round_jiffies(). The slow work queue uses a pair of timers to cull
idle threads and, after OOM, to delay new thread creation.
This patch also extracts the mod_timer() logic for the cull timer into a
separate helper function.
By rounding non-time-critical timers such as these to whole seconds, they
will be batched up to fire at the same time rather than being spread out.
This allows the CPU wake up less, which saves power.
Signed-off-by: Chris Peterson <cpeterso@cpeterso.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>