JOBCTL_TRAPPING indicates that ptracer is waiting for tracee to
(re)transit into TRACED. task_clear_jobctl_pending() must be called
when either tracee enters TRACED or the transition is cancelled for
some reason. The former is achieved by explicitly calling
task_clear_jobctl_pending() in ptrace_stop() and the latter by calling
it at the end of do_signal_stop().
Calling task_clear_jobctl_trapping() at the end of do_signal_stop()
limits the scope TRAPPING can be used and is fragile in that seemingly
unrelated changes to tracee's control flow can lead to stuck TRAPPING.
We already have task_clear_jobctl_pending() calls on those cancelling
events to clear JOBCTL_STOP_PENDING. Cancellations can be handled by
making those call sites use JOBCTL_PENDING_MASK instead and updating
task_clear_jobctl_pending() such that task_clear_jobctl_trapping() is
called automatically if no stop/trap is pending.
This patch makes the above changes and removes the fallback
task_clear_jobctl_trapping() call from do_signal_stop().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
This patch introduces JOBCTL_PENDING_MASK and replaces
task_clear_jobctl_stop_pending() with task_clear_jobctl_pending()
which takes an extra @mask argument.
JOBCTL_PENDING_MASK is currently equal to JOBCTL_STOP_PENDING but
future patches will add more bits. recalc_sigpending_tsk() is updated
to use JOBCTL_PENDING_MASK instead.
task_clear_jobctl_pending() takes @mask which in subset of
JOBCTL_PENDING_MASK and clears the relevant jobctl bits. If
JOBCTL_STOP_PENDING is set, other STOP bits are cleared together. All
task_clear_jobctl_stop_pending() users are updated to call
task_clear_jobctl_pending() with JOBCTL_STOP_PENDING which is
functionally identical to task_clear_jobctl_stop_pending().
This patch doesn't cause any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
In ptrace_stop(), after arch hook is done, the task state and jobctl
bits are updated while holding siglock. The ordering requirement
there is that TASK_TRACED is set before JOBCTL_TRAPPING is cleared to
prevent ptracer waiting on TRAPPING doesn't end up waking up TRACED is
actually set and sees TASK_RUNNING in wait(2).
Move set_current_state(TASK_TRACED) to the top of the block and
reorganize comments. This makes the ordering more obvious
(TASK_TRACED before other updates) and helps future updates to group
stop participation.
This patch doesn't cause any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
signal->group_stop currently hosts mostly group stop related flags;
however, it's gonna be used for wider purposes and the GROUP_STOP_
flag prefix becomes confusing. Rename signal->group_stop to
signal->jobctl and rename all GROUP_STOP_* flags to JOBCTL_*.
Bit position macros JOBCTL_*_BIT are defined and JOBCTL_* flags are
defined in terms of them to allow using bitops later.
While at it, reassign JOBCTL_TRAPPING to bit 22 to better accomodate
future additions.
This doesn't cause any functional change.
-v2: JOBCTL_*_BIT macros added as suggested by Linus.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
ERESTART* is always wrong without TIF_SIGPENDING. Teach sys_pause()
to handle the spurious wakeup correctly.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cleanup. Remove the unneeded goto's, we can simply read blocked.sig[0]
unconditionally and then copy-to-user it if oset != NULL.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com>
As Tejun and Linus pointed out, "nand" is the wrong name for "x & ~y",
it should be "andn". Rename signandsets() as suggested.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
do_sigtimedwait() changes current->blocked and thus it needs
set_current_blocked()->retarget_shared_pending().
We could use set_current_blocked() directly. It is fine to change
->real_blocked from all-zeroes to ->blocked and vice versa lockless,
but this is not immediately clear, looks racy, and needs a huge
comment to explain why this is correct.
To keep the things simple this patch adds the new static helper,
__set_task_blocked() which should be called with ->siglock held. This
way we can change both ->real_blocked and ->blocked atomically under
->siglock as the current code does. This is more understandable.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com>
Factor out the common code in sys_rt_sigtimedwait/compat_sys_rt_sigtimedwait
to the new helper, do_sigtimedwait().
Add the comment to document the extra tick we add to timespec_to_jiffies(ts),
thanks to Linus who explained this to me.
Perhaps it would be better to move compat_sys_rt_sigtimedwait() into
signal.c under CONFIG_COMPAT, then we can make do_sigtimedwait() static.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com>
No functional changes, cleanup compat_sys_rt_sigtimedwait() and
sys_rt_sigtimedwait().
Calculate the timeout before we take ->siglock, this simplifies and
lessens the code. Use timespec_valid() to check the timespec.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com>
sys_rt_sigprocmask() looks unnecessarily complicated, simplify it.
We can just read current->blocked lockless unconditionally before
anything else and then copy-to-user it if needed. At worst we
copy 4 words on mips.
We could copy-to-user the old mask first and simplify the code even
more, but the patch tries to keep the current behaviour: we change
current->block even if copy_to_user(oset) fails.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
In short, almost every changing of current->blocked is wrong, or at least
can lead to the unexpected results.
For example. Two threads T1 and T2, T1 sleeps in sigtimedwait/pause/etc.
kill(tgid, SIG) can pick T2 for TIF_SIGPENDING. If T2 calls sigprocmask()
and blocks SIG before it notices the pending signal, nobody else can handle
this pending shared signal.
I am not sure this is bug, but at least this looks strange imho. T1 should
not sleep forever, there is a signal which should wake it up.
This patch moves the code which actually changes ->blocked into the new
helper, set_current_blocked() and changes this code to call
retarget_shared_pending() as exit_signals() does. We should only care about
the signals we just blocked, we use "newset & ~current->blocked" as a mask.
We do not check !sigisemptyset(newblocked), retarget_shared_pending() is
cheap unless mask & shared_pending.
Note: for this particular case we could simply change sigprocmask() to
return -EINTR if signal_pending(), but then we should change other callers
and, more importantly, if we need this fix then set_current_blocked() will
have more callers and some of them can't restart. See the next patch as a
random example.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
No functional changes, preparation to simplify the review of the next change.
1. We can read current->block lockless, nobody else can ever change this mask.
2. Calculate the resulting sigset_t outside of ->siglock into the temporary
variable, then take ->siglock and change ->blocked.
Also, kill the stale comment about BKL.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
retarget_shared_pending() blindly does recalc_sigpending_and_wake() for
every sub-thread, this is suboptimal. We can check t->blocked and stop
looping once every bit in shared_pending has the new target.
Note: we do not take task_is_stopped_or_traced(t) into account, we are
not trying to speed up the signal delivery or to avoid the unnecessary
(but harmless) signal_wake_up(0) in this unlikely case.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
exit_signals() checks signal_pending() before retarget_shared_pending() but
this is suboptimal. We can avoid the while_each_thread() loop in case when
there are no shared signals visible to us.
Add the "shared_pending.signal & ~blocked" check. We don't use tsk->blocked
directly but pass ~blocked as an argument, this is needed for the next patch.
Note: we can optimize this more. while_each_thread(t) can check t->blocked
into account and stop after every pending signal has the new target, see the
next patch.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
No functional changes. Move the notify-other-threads code from exit_signals()
to the new helper, retarget_shared_pending().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Add kernel-doc to syscalls in signal.c.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
General coding style and comment fixes; no code changes:
- Use multi-line-comment coding style.
- Put some function signatures completely on one line.
- Hyphenate some words.
- Spell Posix as POSIX.
- Correct typos & spellos in some comments.
- Drop trailing whitespace.
- End sentences with periods.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch moves SIGNAL_STOP_DEQUEUED from signal_struct->flags to
task_struct->group_stop, and thus makes it per-thread.
Like SIGNAL_STOP_DEQUEUED, GROUP_STOP_DEQUEUED can be false-positive
after return from get_signal_to_deliver(), this is fine. The only
purpose of this bit is: we can drop ->siglock after __dequeue_signal()
returns the sig_kernel_stop() signal and before we call
do_signal_stop(), in this case we must not miss SIGCONT if it comes in
between.
But, unlike SIGNAL_STOP_DEQUEUED, GROUP_STOP_DEQUEUED can not be
false-positive in do_signal_stop() if multiple threads dequeue the
sig_kernel_stop() signal at the same time.
Consider two threads T1 and T2, SIGTTIN has a hanlder.
- T1 dequeues SIGTSTP and sets SIGNAL_STOP_DEQUEUED, then
it drops ->siglock
- SIGCONT comes and clears SIGNAL_STOP_DEQUEUED, SIGTSTP
should be cancelled.
- T2 dequeues SIGTTIN and sets SIGNAL_STOP_DEQUEUED again.
Since we have a handler we should not stop, T2 returns
to usermode to run the handler.
- T1 continues, calls do_signal_stop() and wrongly starts
the group stop because SIGNAL_STOP_DEQUEUED was restored
in between.
With or without this change:
- we need to do something with ptrace_signal() which can
return SIGSTOP, but this needs another discussion
- SIGSTOP can be lost if it races with the mt exec, will
be fixed later.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
PF_EXITING or TASK_STOPPED has already called task_participate_group_stop()
and cleared its ->group_stop. No need to do task_clear_group_stop_pending()
when we start the new group stop.
Add a small comment to explain the !task_is_stopped() check. Note that this
check is not exactly right and it can lead to unnecessary stop later if the
thread is TASK_PTRACED. What we need is task_participated_in_group_stop(),
this will be solved later.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
prepare_signal(SIGCONT) should never set TIF_SIGPENDING or wake up
the TASK_INTERRUPTIBLE threads. We are going to call complete_signal()
which should pick the right thread correctly. All we need is to wake
up the TASK_STOPPED threads.
If the task was stopped, it can't return to usermode without taking
->siglock. Otherwise we don't care, and the spurious TIF_SIGPENDING
can't be useful.
The comment says:
* If there is a handler for SIGCONT, we must make
* sure that no thread returns to user mode before
* we post the signal
It is not clear what this means. Probably, "when there's only a single
thread" and this continues to be true. Otherwise, even if this SIGCONT
is not private, with or without this change only one thread can dequeue
SIGCONT, other threads can happily return to user mode before before
that thread handles this signal.
Note also that wake_up_state(t, __TASK_STOPPED) can't race with the task
which changes its state, TASK_STOPPED state is protected by ->siglock as
well.
In short: when it comes to signal delivery, SIGCONT is the normal signal
and does not need any special support.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit da48524eb2 ("Prevent rt_sigqueueinfo and rt_tgsigqueueinfo
from spoofing the signal code") made the check on si_code too strict.
There are several legitimate places where glibc wants to queue a
negative si_code different from SI_QUEUE:
- This was first noticed with glibc's aio implementation, which wants
to queue a signal with si_code SI_ASYNCIO; the current kernel
causes glibc's tst-aio4 test to fail because rt_sigqueueinfo()
fails with EPERM.
- Further examination of the glibc source shows that getaddrinfo_a()
wants to use SI_ASYNCNL (which the kernel does not even define).
The timer_create() fallback code wants to queue signals with SI_TIMER.
As suggested by Oleg Nesterov <oleg@redhat.com>, loosen the check to
forbid only the problematic SI_TKILL case.
Reported-by: Klaus Dittrich <kladit@arcor.de>
Acked-by: Julien Tinnes <jln@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Changelog:
Dec 8: Fixed bug in my check_kill_permission pointed out by
Eric Biederman.
Dec 13: Apply Eric's suggestion to pass target task into kill_ok_by_cred()
for clarity
Dec 31: address comment by Eric Biederman:
don't need cred/tcred in check_kill_permission.
Jan 1: use const cred struct.
Jan 11: Per Bastian Blank's advice, clean up kill_ok_by_cred().
Feb 16: kill_ok_by_cred: fix bad parentheses
Feb 23: per akpm, let compiler inline kill_ok_by_cred
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just as group_exit_code shouldn't be generated when a PTRACE_CONT'd
task re-enters job control stop, notifiction for the event should be
suppressed too. The logic is the same as the group_exit_code
generation suppression in do_signal_stop(), if SIGNAL_STOP_STOPPED is
already set, the task is re-entering job control stop without
intervening SIGCONT and the notifications should be suppressed.
Test case follows.
#include <stdio.h>
#include <unistd.h>
#include <signal.h>
#include <time.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
static const struct timespec ts100ms = { .tv_nsec = 100000000 };
static pid_t tracee, tracer;
static const char *pid_who(pid_t pid)
{
return pid == tracee ? "tracee" : (pid == tracer ? "tracer" : "mommy ");
}
static void sigchld_sigaction(int signo, siginfo_t *si, void *ucxt)
{
printf("%s: SIG status=%02d code=%02d (%s)\n",
pid_who(getpid()), si->si_status, si->si_code,
pid_who(si->si_pid));
}
int main(void)
{
const struct sigaction chld_sa = { .sa_sigaction = sigchld_sigaction,
.sa_flags = SA_SIGINFO|SA_RESTART };
siginfo_t si;
sigaction(SIGCHLD, &chld_sa, NULL);
tracee = fork();
if (!tracee) {
tracee = getpid();
while (1)
pause();
}
kill(tracee, SIGSTOP);
waitid(P_PID, tracee, &si, WSTOPPED);
tracer = fork();
if (!tracer) {
tracer = getpid();
ptrace(PTRACE_ATTACH, tracee, NULL, NULL);
waitid(P_PID, tracee, &si, WSTOPPED);
ptrace(PTRACE_CONT, tracee, NULL, (void *)(long)si.si_status);
waitid(P_PID, tracee, &si, WSTOPPED);
ptrace(PTRACE_CONT, tracee, NULL, (void *)(long)si.si_status);
waitid(P_PID, tracee, &si, WSTOPPED);
printf("tracer: detaching\n");
ptrace(PTRACE_DETACH, tracee, NULL, NULL);
return 0;
}
while (1)
pause();
return 0;
}
Before the patch, the parent gets the second notification for the
tracee after the tracer detaches. si_status is zero because
group_exit_code is not set by the group stop completion which
triggered this notification.
mommy : SIG status=19 code=05 (tracee)
tracer: SIG status=00 code=05 (tracee)
tracer: SIG status=19 code=04 (tracee)
tracer: SIG status=00 code=05 (tracee)
tracer: detaching
mommy : SIG status=00 code=05 (tracee)
mommy : SIG status=00 code=01 (tracer)
^C
After the patch, the duplicate notification is gone.
mommy : SIG status=19 code=05 (tracee)
tracer: SIG status=00 code=05 (tracee)
tracer: SIG status=19 code=04 (tracee)
tracer: SIG status=00 code=05 (tracee)
tracer: detaching
mommy : SIG status=00 code=01 (tracer)
^C
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
With recent changes, job control and ptrace stopped states are
properly separated and accessible to the real parent and the ptracer
respectively; however, notifications of job control stopped/continued
events to the real parent while ptraced are still missing.
A ptracee participates in group stop in ptrace_stop() but the
completion isn't notified. If participation results in completion of
group stop, notify the real parent of the event. The ptrace and group
stops are separate and can be handled as such.
However, when the real parent and the ptracer are in the same thread
group, only the ptrace stop event is visible through wait(2) and the
duplicate notifications are different from the current behavior and
are confusing. Suppress group stop notification in such cases.
The continued state is shared between the real parent and the ptracer
but is only meaningful to the real parent. Always notify the real
parent and notify the ptracer too for backward compatibility. Similar
to stop notification, if the real parent is the ptracer, suppress a
duplicate notification.
Test case follows.
#include <stdio.h>
#include <unistd.h>
#include <time.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
int main(void)
{
const struct timespec ts100ms = { .tv_nsec = 100000000 };
pid_t tracee, tracer;
siginfo_t si;
int i;
tracee = fork();
if (tracee == 0) {
while (1) {
printf("tracee: SIGSTOP\n");
raise(SIGSTOP);
nanosleep(&ts100ms, NULL);
printf("tracee: SIGCONT\n");
raise(SIGCONT);
nanosleep(&ts100ms, NULL);
}
}
waitid(P_PID, tracee, &si, WSTOPPED | WNOHANG | WNOWAIT);
tracer = fork();
if (tracer == 0) {
nanosleep(&ts100ms, NULL);
ptrace(PTRACE_ATTACH, tracee, NULL, NULL);
for (i = 0; i < 11; i++) {
si.si_pid = 0;
waitid(P_PID, tracee, &si, WSTOPPED);
if (si.si_pid && si.si_code == CLD_TRAPPED)
ptrace(PTRACE_CONT, tracee, NULL,
(void *)(long)si.si_status);
}
printf("tracer: EXITING\n");
return 0;
}
while (1) {
si.si_pid = 0;
waitid(P_PID, tracee, &si, WSTOPPED | WCONTINUED | WEXITED);
if (si.si_pid)
printf("mommy : WAIT status=%02d code=%02d\n",
si.si_status, si.si_code);
}
return 0;
}
Before this patch, while ptraced, the real parent doesn't get
notifications for job control events, so although it can access those
events, the later waitid(2) call never wakes up.
tracee: SIGSTOP
mommy : WAIT status=19 code=05
tracee: SIGCONT
tracee: SIGSTOP
tracee: SIGCONT
tracee: SIGSTOP
tracee: SIGCONT
tracee: SIGSTOP
tracer: EXITING
mommy : WAIT status=19 code=05
^C
After this patch, it works as expected.
tracee: SIGSTOP
mommy : WAIT status=19 code=05
tracee: SIGCONT
mommy : WAIT status=18 code=06
tracee: SIGSTOP
mommy : WAIT status=19 code=05
tracee: SIGCONT
mommy : WAIT status=18 code=06
tracee: SIGSTOP
mommy : WAIT status=19 code=05
tracee: SIGCONT
mommy : WAIT status=18 code=06
tracee: SIGSTOP
tracer: EXITING
mommy : WAIT status=19 code=05
^C
-v2: Oleg pointed out that
* Group stop notification to the real parent should also happen
when ptracer detach races with ptrace_stop().
* real_parent_is_ptracer() should be testing thread group
equality not the task itself as wait(2) and stop/cont
notifications are normally thread-group wide.
Both issues are fixed accordingly.
-v3: real_parent_is_ptracer() updated to test child->real_parent
instead of child->group_leader->real_parent per Oleg's
suggestion.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
The stopped notifications in do_signal_stop() and exit_signals() are
always for the completion of job control. The one in do_signal_stop()
may be delivered to the ptracer if PTRACE_ATTACH races with
notification and the one in exit_signals() if task exits while
ptraced.
In both cases, the notifications are meaningless and confusing to the
ptracer as it never accesses the group stop state while the real
parent would miss notifications for the events it is watching.
Make sure these notifications always go to the real parent by calling
do_notify_parent_cld_stop() with %false @for_ptrace.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Currently, do_notify_parent_cldstop() determines whether the
notification is for the real parent or ptracer. Move the decision to
the caller by adding @for_ptrace parameter to
do_notify_parent_cldstop(). All the callers are updated to pass
task_ptrace(target_task), so this patch doesn't cause any behavior
difference.
While at it, add function comment to do_notify_parent_cldstop().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
While ptraced, a task may be resumed while the containing process is
still job control stopped. If the task receives another stop signal
in this state, it will still initiate group stop, which generates
group_exit_code, which the real parent would be able to see once the
ptracer detaches.
In this scenario, the real parent may see two consecutive CLD_STOPPED
events from two stop signals without intervening SIGCONT, which
normally is impossible.
Test case follows.
#include <stdio.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
int main(void)
{
pid_t tracee;
siginfo_t si;
tracee = fork();
if (!tracee)
while (1)
pause();
kill(tracee, SIGSTOP);
waitid(P_PID, tracee, &si, WSTOPPED);
if (!fork()) {
ptrace(PTRACE_ATTACH, tracee, NULL, NULL);
waitid(P_PID, tracee, &si, WSTOPPED);
ptrace(PTRACE_CONT, tracee, NULL, (void *)(long)si.si_status);
waitid(P_PID, tracee, &si, WSTOPPED);
ptrace(PTRACE_CONT, tracee, NULL, (void *)(long)si.si_status);
waitid(P_PID, tracee, &si, WSTOPPED);
ptrace(PTRACE_DETACH, tracee, NULL, NULL);
return 0;
}
while (1) {
si.si_pid = 0;
waitid(P_PID, tracee, &si, WSTOPPED | WNOHANG);
if (si.si_pid)
printf("st=%02d c=%02d\n", si.si_status, si.si_code);
}
return 0;
}
Before the patch, the latter waitid() in polling mode reports the
second stopped event generated by the implied SIGSTOP of
PTRACE_ATTACH.
st=19 c=05
^C
After the patch, the second event is not reported.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Currently, if the task is STOPPED on ptrace attach, it's left alone
and the state is silently changed to TRACED on the next ptrace call.
The behavior breaks the assumption that arch_ptrace_stop() is called
before any task is poked by ptrace and is ugly in that a task
manipulates the state of another task directly.
With GROUP_STOP_PENDING, the transitions between TASK_STOPPED and
TRACED can be made clean. The tracer can use the flag to tell the
tracee to retry stop on attach and detach. On retry, the tracee will
enter the desired state in the correct way. The lower 16bits of
task->group_stop is used to remember the signal number which caused
the last group stop. This is used while retrying for ptrace attach as
the original group_exit_code could have been consumed with wait(2) by
then.
As the real parent may wait(2) and consume the group_exit_code
anytime, the group_exit_code needs to be saved separately so that it
can be used when switching from regular sleep to ptrace_stop(). This
is recorded in the lower 16bits of task->group_stop.
If a task is already stopped and there's no intervening SIGCONT, a
ptrace request immediately following a successful PTRACE_ATTACH should
always succeed even if the tracer doesn't wait(2) for attach
completion; however, with this change, the tracee might still be
TASK_RUNNING trying to enter TASK_TRACED which would cause the
following request to fail with -ESRCH.
This intermediate state is hidden from the ptracer by setting
GROUP_STOP_TRAPPING on attach and making ptrace_check_attach() wait
for it to clear on its signal->wait_chldexit. Completing the
transition or getting killed clears TRAPPING and wakes up the tracer.
Note that the STOPPED -> RUNNING -> TRACED transition is still visible
to other threads which are in the same group as the ptracer and the
reverse transition is visible to all. Please read the comments for
details.
Oleg:
* Spotted a race condition where a task may retry group stop without
proper bookkeeping. Fixed by redoing bookkeeping on retry.
* Spotted that the transition is visible to userland in several
different ways. Most are fixed with GROUP_STOP_TRAPPING. Unhandled
corner case is documented.
* Pointed out not setting GROUP_STOP_SIGMASK on an already stopped
task would result in more consistent behavior.
* Pointed out that calling ptrace_stop() from do_signal_stop() in
TASK_STOPPED can race with group stop start logic and then confuse
the TRAPPING wait in ptrace_check_attach(). ptrace_stop() is now
called with TASK_RUNNING.
* Suggested using signal->wait_chldexit instead of bit wait.
* Spotted a race condition between TRACED transition and clearing of
TRAPPING.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
A ptraced task would still stop at do_signal_stop() when it's stopping
for stop signals and do_signal_stop() behaves the same whether the
task is ptraced or not. However, in addition to stopping,
ptrace_stop() also does ptrace specific stuff like calling
architecture specific callbacks, so this behavior makes the code more
fragile and difficult to understand.
This patch makes do_signal_stop() test whether the task is ptraced and
use ptrace_stop() if so. This renders tracehook_notify_jctl() rather
pointless as the ptrace notification is now handled by ptrace_stop()
regardless of the return value from the tracehook. It probably is a
good idea to update it.
This doesn't solve the whole problem as tasks already in stopped state
would stay in the regular stop when ptrace attached. That part will
be handled by the next patch.
Oleg pointed out that this makes a userland-visible change. Before,
SIGCONT would be able to wake up a task in group stop even if the task
is ptraced if the tracer hasn't issued another ptrace command
afterwards (as the next ptrace commands transitions the state into
TASK_TRACED which ignores SIGCONT wakeups). With this and the next
patch, SIGCONT may race with the transition into TASK_TRACED and is
ignored if the tracee already entered TASK_TRACED.
Another userland visible change of this and the next patch is that the
ptracee's state would now be TASK_TRACED where it used to be
TASK_STOPPED, which is visible via fs/proc.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Currently, ptrace_stop() unconditionally participates in group stop
bookkeeping. This is unnecessary and inaccurate. Make it only
participate if the task is trapping for group stop - ie. if @why is
CLD_STOPPED. As ptrace_stop() currently is not used when trapping for
group stop, this equals to disabling group stop participation from
ptrace_stop().
A visible behavior change is increased likelihood of delayed group
stop completion if the thread group contains one or more ptraced
tasks.
This is to preapre for further cleanup of the interaction between
group stop and ptrace.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Currently task->signal->group_stop_count is used to decide whether to
stop for group stop. However, if there is a task in the group which
is taking a long time to stop, other tasks which are continued by
ptrace would repeatedly stop for the same group stop until the group
stop is complete.
Conversely, if a ptraced task is in TASK_TRACED state, the debugger
won't get notified of group stops which is inconsistent compared to
the ptraced task in any other state.
This patch introduces GROUP_STOP_PENDING which tracks whether a task
is yet to stop for the group stop in progress. The flag is set when a
group stop starts and cleared when the task stops the first time for
the group stop, and consulted whenever whether the task should
participate in a group stop needs to be determined. Note that now
tasks in TASK_TRACED also participate in group stop.
This results in the following behavior changes.
* For a single group stop, a ptracer would see at most one stop
reported.
* A ptracee in TASK_TRACED now also participates in group stop and the
tracer would get the notification. However, as a ptraced task could
be in TASK_STOPPED state or any ptrace trap could consume group
stop, the notification may still be missing. These will be
addressed with further patches.
* A ptracee may start a group stop while one is still in progress if
the tracer let it continue with stop signal delivery. Group stop
code handles this correctly.
Oleg:
* Spotted that a task might skip signal check even when its
GROUP_STOP_PENDING is set. Fixed by updating
recalc_sigpending_tsk() to check GROUP_STOP_PENDING instead of
group_stop_count.
* Pointed out that task->group_stop should be cleared whenever
task->signal->group_stop_count is cleared. Fixed accordingly.
* Pointed out the behavior inconsistency between TASK_TRACED and
RUNNING and the last behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
task->signal->group_stop_count is used to track the progress of group
stop. It's initialized to the number of tasks which need to stop for
group stop to finish and each stopping or trapping task decrements.
However, each task doesn't keep track of whether it decremented the
counter or not and if woken up before the group stop is complete and
stops again, it can decrement the counter multiple times.
Please consider the following example code.
static void *worker(void *arg)
{
while (1) ;
return NULL;
}
int main(void)
{
pthread_t thread;
pid_t pid;
int i;
pid = fork();
if (!pid) {
for (i = 0; i < 5; i++)
pthread_create(&thread, NULL, worker, NULL);
while (1) ;
return 0;
}
ptrace(PTRACE_ATTACH, pid, NULL, NULL);
while (1) {
waitid(P_PID, pid, NULL, WSTOPPED);
ptrace(PTRACE_SINGLESTEP, pid, NULL, (void *)(long)SIGSTOP);
}
return 0;
}
The child creates five threads and the parent continuously traps the
first thread and whenever the child gets a signal, SIGSTOP is
delivered. If an external process sends SIGSTOP to the child, all
other threads in the process should reliably stop. However, due to
the above bug, the first thread will often end up consuming
group_stop_count multiple times and SIGSTOP often ends up stopping
none or part of the other four threads.
This patch adds a new field task->group_stop which is protected by
siglock and uses GROUP_STOP_CONSUME flag to track which task is still
to consume group_stop_count to fix this bug.
task_clear_group_stop_pending() and task_participate_group_stop() are
added to help manipulating group stop states. As ptrace_stop() now
also uses task_participate_group_stop(), it will set
SIGNAL_STOP_STOPPED if it completes a group stop.
There still are many issues regarding the interaction between group
stop and ptrace. Patches to address them will follow.
- Oleg spotted duplicate GROUP_STOP_CONSUME. Dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
To prepare for cleanup of the interaction between group stop and
ptrace, add @why to ptrace_stop(). Existing users are updated such
that there is no behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Roland McGrath <roland@redhat.com>
tracehook_notify_jctl() aids in determining whether and what to report
to the parent when a task is stopped or continued. The function also
adds an extra requirement that siglock may be released across it,
which is currently unused and quite difficult to satisfy in
well-defined manner.
As job control and the notifications are about to receive major
overhaul, remove the tracehook and open code it. If ever necessary,
let's factor it out after the overhaul.
* Oleg spotted incorrect CLD_CONTINUED/STOPPED selection when ptraced.
Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
do_signal_stop() is used only by get_signal_to_deliver() and after a
successful signal stop, it always calls try_to_freeze(), so the
try_to_freeze() loop around schedule() in do_signal_stop() is
superflous and confusing. Remove it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
After a task receives SIGCONT, its parent is notified via SIGCHLD with
its siginfo describing what the notified event is. If SIGCONT is
received while the child process is stopped, the code should be
CLD_CONTINUED. If SIGCONT is recieved while the child process is in
the process of being stopped, it should be CLD_STOPPED. Which code to
use is determined in prepare_signal() and recorded in signal->flags
using SIGNAL_CLD_CONTINUED|STOP flags.
get_signal_deliver() should test these flags and then notify
accoringly; however, it incorrectly tested SIGNAL_STOP_CONTINUED
instead of SIGNAL_CLD_CONTINUED, thus incorrectly notifying
CLD_CONTINUED if the signal is delivered before the task is wait(2)ed
and CLD_STOPPED if the state was fetched already.
Fix it by testing SIGNAL_CLD_CONTINUED. While at it, uncompress the
?: test into if/else clause for better readability.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Userland should be able to trust the pid and uid of the sender of a
signal if the si_code is SI_TKILL.
Unfortunately, the kernel has historically allowed sigqueueinfo() to
send any si_code at all (as long as it was negative - to distinguish it
from kernel-generated signals like SIGILL etc), so it could spoof a
SI_TKILL with incorrect siginfo values.
Happily, it looks like glibc has always set si_code to the appropriate
SI_QUEUE, so there are probably no actual user code that ever uses
anything but the appropriate SI_QUEUE flag.
So just tighten the check for si_code (we used to allow any negative
value), and add a (one-time) warning in case there are binaries out
there that might depend on using other si_code values.
Signed-off-by: Julien Tinnes <jln@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lock_task_sighand() grabs sighand->siglock in case of returning non-NULL
but unlock_task_sighand() releases it unconditionally. This leads sparse
to complain about the lock context imbalance. Rename and wrap
lock_task_sighand() using __cond_lock() macro to make sparse happy.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The original hwpoison code added a new siginfo field si_addr_lsb to
pass the granuality of the fault address to user space. Unfortunately
this field was never copied to user space. Fix this here.
I added explicit checks for the MCEERR codes to avoid having
to patch all potential callers to initialize the field.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Commit 8f92054e7c ("CRED: Fix __task_cred()'s lockdep check and banner
comment") fixed the lockdep checks on __task_cred(). This has shown up
a place in the signalling code where a lock should be held - namely that
check_kill_permission() requires its callers to hold the RCU lock.
Fix group_send_sig_info() to get the RCU read lock around its call to
check_kill_permission().
Without this patch, the following warning can occur:
===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
kernel/signal.c:660 invoked rcu_dereference_check() without protection!
...
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change zap_other_threads() to return the number of other sub-threads found
on ->thread_group list.
Other changes are cosmetic:
- change the code to use while_each_thread() helper
- remove the obsolete comment about SIGKILL/SIGSTOP
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Tridgell reports that aio_read(SIGEV_SIGNAL) can fail if the
notification from the helper thread races with setresuid(), see
http://samba.org/~tridge/junkcode/aio_uid.c
This happens because check_kill_permission() doesn't permit sending a
signal to the task with the different cred->xids. But there is not any
security reason to check ->cred's when the task sends a signal (private or
group-wide) to its sub-thread. Whatever we do, any thread can bypass all
security checks and send SIGKILL to all threads, or it can block a signal
SIG and do kill(gettid(), SIG) to deliver this signal to another
sub-thread. Not to mention that CLONE_THREAD implies CLONE_VM.
Change check_kill_permission() to avoid the credentials check when the
sender and the target are from the same thread group.
Also, move "cred = current_cred()" down to avoid calling get_current()
twice.
Note: David Howells pointed out we could relax this even more, the
CLONE_SIGHAND (without CLONE_THREAD) case probably does not need
these checks too.
Roland said:
: The glibc (libpthread) that does set*id across threads has
: been in use for a while (2.3.4?), probably in distro's using kernels as old
: or older than any active -stable streams. In the race in question, this
: kernel bug is breaking valid POSIX application expectations.
Reported-by: Andrew Tridgell <tridge@samba.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: <stable@kernel.org> [all kernel versions]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch contains the hooks and instrumentation into kernel which
live outside the kernel/debug directory, which the kdb core
will call to run commands like lsmod, dmesg, bt etc...
CC: linux-arch@vger.kernel.org
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Martin Hicks <mort@sgi.com>
Make sure compiler won't do weird things with limits. E.g. fetching them
twice may return 2 different values after writable limits are implemented.
I.e. either use rlimit helpers added in commit 3e10e716ab ("resource:
add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes sure that we pick the synchronous signals caused by a
processor fault over any pending regular asynchronous signals sent to
use by [t]kill().
This is not strictly required semantics, but it makes it _much_ easier
for programs like Wine that expect to find the fault information in the
signal stack.
Without this, if a non-synchronous signal gets picked first, the delayed
asynchronous signal will have its signal context pointing to the new
signal invocation, rather than the instruction that caused the SIGSEGV
or SIGBUS in the first place.
This is not all that pretty, and we're discussing making the synchronous
signals more explicit rather than have these kinds of implicit
preferences of SIGSEGV and friends. See for example
http://bugzilla.kernel.org/show_bug.cgi?id=15395
for some of the discussion. But in the meantime this is a simple and
fairly straightforward work-around, and the whole
if (x & Y)
x &= Y;
thing can be compiled into (and gcc does do it) just three instructions:
movq %rdx, %rax
andl $Y, %eax
cmovne %rax, %rdx
so it is at least a simple solution to a subtle issue.
Reported-and-tested-by: Pavel Vilim <wylda@volny.cz>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When print-fatal-signals is enabled it's possible to dump any memory
reachable by the kernel to the log by simply jumping to that address from
user space.
Or crash the system if there's some hardware with read side effects.
The fatal signals handler will dump 16 bytes at the execution address,
which is fully controlled by ring 3.
In addition when something jumps to a unmapped address there will be up to
16 additional useless page faults, which might be potentially slow (and at
least is not very efficient)
Fortunately this option is off by default and only there on i386.
But fix it by checking for kernel addresses and also stopping when there's
a page fault.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sys: Fix missing rcu protection for __task_cred() access
signals: Fix more rcu assumptions
signal: Fix racy access to __task_cred in kill_pid_info_as_uid()
Move the call to do_signal_stop() down, after tracehook call. This makes
->group_stop_count condition visible to tracers before do_signal_stop()
will participate in this group-stop.
Currently the patch has no effect, tracehook_get_signal() always returns 0.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Trivial, s/0/SI_USER/ in collect_signal() for grep.
This is a bit confusing, we don't know the source of this signal.
But we don't care, and "info->si_code = 0" is imho worse.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change send_signal() to use si_fromuser(). From now SEND_SIG_NOINFO
triggers the "from_ancestor_ns" check.
This fixes reparent_thread()->group_send_sig_info(pdeath_signal)
behaviour, before this patch send_signal() does not detect the
cross-namespace case when the child of the dying parent belongs to the
sub-namespace.
This patch can affect the behaviour of send_sig(), kill_pgrp() and
kill_pid() when the caller sends the signal to the sub-namespace with
"priv == 0" but surprisingly all callers seem to use them correctly,
including disassociate_ctty(on_exit).
Except: drivers/staging/comedi/drivers/addi-data/*.c incorrectly use
send_sig(priv => 0). But his is minor and should be fixed anyway.
Reported-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Reviewed-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No changes in compiled code. The patch adds the new helper, si_fromuser()
and changes check_kill_permission() to use this helper.
The real effect of this patch is that from now we "officially" consider
SEND_SIG_NOINFO signal as "from user-space" signals. This is already true
if we look at the code which uses SEND_SIG_NOINFO, except __send_signal()
has another opinion - see the next patch.
The naming of these special SEND_SIG_XXX siginfo's is really bad
imho. From __send_signal()'s pov they mean
SEND_SIG_NOINFO from user
SEND_SIG_PRIV from kernel
SEND_SIG_FORCED no info
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Reviewed-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1) Remove the misleading comment in __sigqueue_alloc() which claims
that holding a spinlock is equivalent to rcu_read_lock().
2) Add a rcu_read_lock/unlock around the __task_cred() access
in __sigqueue_alloc()
This needs to be revisited to remove the remaining users of
read_lock(&tasklist_lock) but that's outside the scope of this patch.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <20091210004703.269843657@linutronix.de>
kill_pid_info_as_uid() accesses __task_cred() without being in a RCU
read side critical section. tasklist_lock is not protecting that when
CONFIG_TREE_PREEMPT_RCU=y.
Convert the whole tasklist_lock section to rcu and use
lock_task_sighand to prevent the exit race.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <20091210004703.232302055@linutronix.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
When the system has too many timers or too many aggregate
queued signals, the EAGAIN error is returned to application
from kernel, including timer_create() [POSIX.1b].
It means that the app exceeded the limit of pending signals,
but in general application writers do not expect this
outcome and the current silent failure can cause rare app
failures under very high load.
This patch adds a new message when we reach the limit
and if print_fatal_signals is enabled:
task/1234: reached RLIMIT_SIGPENDING, dropping signal
If you see this message and your system behaved unexpectedly,
you can run following command to lift the limit:
# ulimit -i unlimited
With help from Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>.
Signed-off-by: Naohiro Ooiwa <nooiwa@miraclelinux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: oleg@redhat.com
LKML-Reference: <4AF6E7E2.9080406@miraclelinux.com>
[ Modified a few small details, gave surrounding code some love. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
__fatal_signal_pending inlines to one instruction on x86, probably two
instructions on other machines. It takes two longer x86 instructions just
to call it and test its return value, not to mention the function itself.
On my random x86_64 config, this saved 70 bytes of text (59 of those being
__fatal_signal_pending itself).
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce do_send_sig_info() and convert group_send_sig_info(),
send_sig_info(), do_send_specific() to use this helper.
Hopefully it will have more users soon, it allows to specify
specific/group behaviour via "bool group" argument.
Shaves 80 bytes from .text.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stephane eranian <eranian@googlemail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This changes tracehook_notify_jctl() so it's called with the siglock held,
and changes its argument and return value definition. These clean-ups
make it a better fit for what new tracing hooks need to check.
Tracing needs the siglock here, held from the time TASK_STOPPED was set,
to avoid potential SIGCONT races if it wants to allow any blocking in its
tracing hooks.
This also folds the finish_stop() function into its caller
do_signal_stop(). The function is short, called only once and only
unconditionally. It aids readability to fold it in.
[oleg@redhat.com: do not call tracehook_notify_jctl() in TASK_STOPPED state]
[oleg@redhat.com: introduce tracehook_finish_jctl() helper]
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The bug is old, it wasn't cause by recent changes.
Test case:
static void *tfunc(void *arg)
{
int pid = (long)arg;
assert(ptrace(PTRACE_ATTACH, pid, NULL, NULL) == 0);
kill(pid, SIGKILL);
sleep(1);
return NULL;
}
int main(void)
{
pthread_t th;
long pid = fork();
if (!pid)
pause();
signal(SIGCHLD, SIG_IGN);
assert(pthread_create(&th, NULL, tfunc, (void*)pid) == 0);
int r = waitpid(-1, NULL, __WNOTHREAD);
printf("waitpid: %d %m\n", r);
return 0;
}
Before the patch this program hangs, after this patch waitpid() correctly
fails with errno == -ECHILD.
The problem is, __ptrace_detach() reaps the EXIT_ZOMBIE tracee if its
->real_parent is our sub-thread and we ignore SIGCHLD. But in this case
we should wake up other threads which can sleep in do_wait().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The previous commit ("do_sigaltstack: avoid copying 'stack_t' as a
structure to user space") fixed a real bug. This one just cleans up the
copy from user space to that gcc can generate better code for it (and so
that it looks the same as the later copy back to user space).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ulrich Drepper correctly points out that there is generally padding in
the structure on 64-bit hosts, and that copying the structure from
kernel to user space can leak information from the kernel stack in those
padding bytes.
Avoid the whole issue by just copying the three members one by one
instead, which also means that the function also can avoid the need for
a stack frame. This also happens to match how we copy the new structure
from user space, so it all even makes sense.
[ The obvious solution of adding a memset() generates horrid code, gcc
does really stupid things. ]
Reported-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the non-traced sub-thread calls do_notify_parent_cldstop(), we send the
notification to group_leader->real_parent and we report group_leader's
pid.
But, if group_leader is traced we use the wrong ->parent->nsproxy->pid_ns,
the tracer and parent can live in different namespaces. Change the code
to use "parent" instead of tsk->parent.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No functional changes.
- Nobody except ptrace.c & co should use ptrace flags directly, we have
task_ptrace() for that.
- No need to specially check PT_PTRACED, we must not have other PT_ bits
set without PT_PTRACED. And no need to know this flag exists.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This false positive is due to field padding in struct sigqueue. When
this dynamically allocated structure is copied to the stack (in arch-
specific delivery code), kmemcheck sees a read from the padding, which
is, naturally, uninitialized.
Hide the false positive using the __GFP_NOTRACK_FALSE_POSITIVE flag.
Also made the rlimit override code a bit clearer by introducing a new
variable.
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
* 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (244 commits)
Revert "x86, bts: reenable ptrace branch trace support"
tracing: do not translate event helper macros in print format
ftrace/documentation: fix typo in function grapher name
tracing/events: convert block trace points to TRACE_EVENT(), fix !CONFIG_BLOCK
tracing: add protection around module events unload
tracing: add trace_seq_vprint interface
tracing: fix the block trace points print size
tracing/events: convert block trace points to TRACE_EVENT()
ring-buffer: fix ret in rb_add_time_stamp
ring-buffer: pass in lockdep class key for reader_lock
tracing: add annotation to what type of stack trace is recorded
tracing: fix multiple use of __print_flags and __print_symbolic
tracing/events: fix output format of user stack
tracing/events: fix output format of kernel stack
tracing/trace_stack: fix the number of entries in the header
ring-buffer: discard timestamps that are at the start of the buffer
ring-buffer: try to discard unneeded timestamps
ring-buffer: fix bug in ring_buffer_discard_commit
ftrace: do not profile functions when disabled
tracing: make trace pipe recognize latency format flag
...
sys_kill has the per thread counterpart sys_tgkill. sigqueueinfo is
missing a thread directed counterpart. Such an interface is important
for migrating applications from other OSes which have the per thread
delivery implemented.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Split out the code from do_tkill to make it reusable by the follow up
patch which implements sys_rt_tgsigqueueinfo
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Don't flush inherited SIGKILL during execve() in SELinux's post cred commit
hook. This isn't really a security problem: if the SIGKILL came before the
credentials were changed, then we were right to receive it at the time, and
should honour it; if it came after the creds were changed, then we definitely
should honour it; and in any case, all that will happen is that the process
will be scrapped before it ever returns to userspace.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
Impact: clean up
Create a sub directory in include/trace called events to keep the
trace point headers in their own separate directory. Only headers that
declare trace points should be defined in this directory.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch lowers the number of places a developer must modify to add
new tracepoints. The current method to add a new tracepoint
into an existing system is to write the trace point macro in the
trace header with one of the macros TRACE_EVENT, TRACE_FORMAT or
DECLARE_TRACE, then they must add the same named item into the C file
with the macro DEFINE_TRACE(name) and then add the trace point.
This change cuts out the needing to add the DEFINE_TRACE(name).
Every file that uses the tracepoint must still include the trace/<type>.h
file, but the one C file must also add a define before the including
of that file.
#define CREATE_TRACE_POINTS
#include <trace/mytrace.h>
This will cause the trace/mytrace.h file to also produce the C code
necessary to implement the trace point.
Note, if more than one trace/<type>.h is used to create the C code
it is best to list them all together.
#define CREATE_TRACE_POINTS
#include <trace/foo.h>
#include <trace/bar.h>
#include <trace/fido.h>
Thanks to Mathieu Desnoyers and Christoph Hellwig for coming up with
the cleaner solution of the define above the includes over my first
design to have the C code include a "special" header.
This patch converts sched, irq and lockdep and skb to use this new
method.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When sending a signal to a descendant namespace, set ->si_pid to 0 since
the sender does not have a pid in the receiver's namespace.
Note:
- If rt_sigqueueinfo() sets si_code to SI_USER when sending a
signal across a pid namespace boundary, the value in ->si_pid
will be cleared to 0.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Normally SIG_DFL signals to global and container-init are dropped early.
But if a signal is blocked when it is posted, we cannot drop the signal
since the receiver may install a handler before unblocking the signal.
Once this signal is queued however, the receiver container-init has no way
of knowing if the signal was sent from an ancestor or descendant
namespace. This patch ensures that contianer-init drops all SIG_DFL
signals in get_signal_to_deliver() except SIGKILL/SIGSTOP.
If SIGSTOP/SIGKILL originate from a descendant of container-init they are
never queued (i.e dropped in sig_ignored() in an earler patch).
If SIGSTOP/SIGKILL originate from parent namespace, the signal is queued
and container-init processes the signal.
IOW, if get_signal_to_deliver() sees a sig_kernel_only() signal for global
or container-init, the signal must have been generated internally or must
have come from an ancestor ns and we process the signal.
Further, the signal_group_exit() check was needed to cover the case of a
multi-threaded init sending SIGKILL to other threads when doing an exit()
or exec(). But since the new sig_kernel_only() check covers the SIGKILL,
the signal_group_exit() check is no longer needed and can be removed.
Finally, now that we have all pieces in place, set SIGNAL_UNKILLABLE for
container-inits.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Drop early any SIG_DFL or SIG_IGN signals to container-init from within
the same container. But queue SIGSTOP and SIGKILL to the container-init
if they are from an ancestor container.
Blocked, fatal signals (i.e when SIG_DFL is to terminate) from within the
container can still terminate the container-init. That will be addressed
in the next patch.
Note: To be bisect-safe, SIGNAL_UNKILLABLE will be set for container-inits
in a follow-on patch. Until then, this patch is just a preparatory
step.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
send_signal() (or its helper) needs to determine the pid namespace of the
sender. But a signal sent via kill_pid_info_as_uid() comes from within
the kernel and send_signal() does not need to determine the pid namespace
of the sender. So define a helper for send_signal() which takes an
additional parameter, 'from_ancestor_ns' and have kill_pid_info_as_uid()
use that helper directly.
The 'from_ancestor_ns' parameter will be used in a follow-on patch.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(This is a modified version of the patch submitted by Oleg Nesterov
http://lkml.org/lkml/2008/11/18/249 and tries to address comments that
came up in that discussion)
init ignores the SIG_DFL signals but we queue them anyway, including
SIGKILL. This is mostly OK, the signal will be dropped silently when
dequeued, but the pending SIGKILL has 2 bad implications:
- it implies fatal_signal_pending(), so we confuse things
like wait_for_completion_killable/lock_page_killable.
- for the sub-namespace inits, the pending SIGKILL can
mask (legacy_queue) the subsequent SIGKILL from the
parent namespace which must kill cinit reliably.
(preparation, cinits don't have SIGNAL_UNKILLABLE yet)
The patch can't help when init is ptraced, but ptracing of init is not
"safe" anyway.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Container-init must behave like global-init to processes within the
container and hence it must be immune to unhandled fatal signals from
within the container (i.e SIG_DFL signals that terminate the process).
But the same container-init must behave like a normal process to processes
in ancestor namespaces and so if it receives the same fatal signal from a
process in ancestor namespace, the signal must be processed.
Implementing these semantics requires that send_signal() determine pid
namespace of the sender but since signals can originate from workqueues/
interrupt-handlers, determining pid namespace of sender may not always be
possible or safe.
This patchset implements the design/simplified semantics suggested by
Oleg Nesterov. The simplified semantics for container-init are:
- container-init must never be terminated by a signal from a
descendant process.
- container-init must never be immune to SIGKILL from an ancestor
namespace (so a process in parent namespace must always be able
to terminate a descendant container).
- container-init may be immune to unhandled fatal signals (like
SIGUSR1) even if they are from ancestor namespace. SIGKILL/SIGSTOP
are the only reliable signals to a container-init from ancestor
namespace.
This patch:
Based on an earlier patch submitted by Oleg Nesterov and comments from
Roland McGrath (http://lkml.org/lkml/2008/11/19/258).
The handler parameter is currently unused in the tracehook functions.
Besides, the tracehook functions are called with siglock held, so the
functions can check the handler if they later need to.
Removing the parameter simiplifies changes to sig_ignored() in a follow-on
patch.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes bug #12208:
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=12208
Subject : uml is very slow on 2.6.28 host
This turned out to be not a scheduler regression, but an already
existing problem in ptrace being triggered by subtle scheduler
changes.
The problem is this:
- task A is ptracing task B
- task B stops on a trace event
- task A is woken up and preempts task B
- task A calls ptrace on task B, which does ptrace_check_attach()
- this calls wait_task_inactive(), which sees that task B is still on the runq
- task A goes to sleep for a jiffy
- ...
Since UML does lots of the above sequences, those jiffies quickly add
up to make it slow as hell.
This patch solves this by not rescheduling in read_unlock() after
ptrace_stop() has woken up the tracer.
Thanks to Oleg Nesterov and Ingo Molnar for the feedback.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We're going to split the process wide cpu accounting into two parts:
- clocks; which can take all the time they want since they run
from user context.
- timers; which need constant time tracing but can affort the overhead
because they're default off -- and rare.
The clock readout will go back to a full sum of the thread group, for this
we need to re-add the exit stats that were removed in the initial itimer
rework (f06febc9: timers: fix itimer/many thread hang).
Furthermore, since that full sum can be rather slow for large thread groups
and we have the complete dead task stats, revert the do_notify_parent time
computation.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With print-fatal-signals=1 on a kernel with CONFIG_PREEMPT=y, sending an
unexpected signal to a process causes a BUG: using smp_processor_id() in
preemptible code.
get_signal_to_deliver() releases the siglock before calling
print_fatal_signal(), which calls show_regs(), which calls
smp_processor_id(), which is not supposed to be called from a
preemptible thread.
Make sure show_regs() runs with preemption disabled.
Signed-off-by: Ed Swierk <eswierk@aristanetworks.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Convert all system calls to return a long. This should be a NOP since all
converted types should have the same size anyway.
With the exception of sys_exit_group which returned void. But that doesn't
matter since the system call doesn't return.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>