Commit Graph

2229 Commits

Author SHA1 Message Date
Thomas Gleixner 746976a301 NTP: remove clock_was_set() call to prevent deadlock
The clock_was_set() call in seconds_overflow() which happens only when
leap seconds are inserted / deleted is wrong in two aspects:

1. it results in a call to on_each_cpu() with interrupts disabled
2. it is potential deadlock source vs. call_lock in smp_call_function()

The only possible side effect of the removal might be, that an absolute
CLOCK_REALTIME timer fires 1 second too late, in the rare case of leap
second deletion and an absolute CLOCK_REALTIME timer which expires in
the affected time frame. It will never fire too early.

This was probably observed by the reporter of a June 30th -> July 1st
hang: http://lkml.org/lkml/2007/7/3/103

A similar problem was observed by Dave Jones, who provided a screen shot
with a lockdep back trace, which allowed to analyse the problem.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-03 13:54:27 -07:00
Rafael J. Wysocki 2391dae3e3 PM: introduce set_target method in pm_ops
Commit 52ade9b3b9 changed the suspend code
ordering to execute pm_ops->prepare() after the device model per-device
.suspend() calls in order to fix some ACPI-related issues.  Unfortunately, it
broke the at91 platform which assumed that pm_ops->prepare() would be called
before suspending devices.

at91 used pm_ops->prepare() to get notified of the target system sleep state,
so that it could use this information while suspending devices.  However, with
the current suspend code ordering pm_ops->prepare() is called too late for
this purpose.  Thus, at91 needs an additional method in 'struct pm_ops' that
will be used for notifying the platform of the target system sleep state.
Moreover, in the future such a method will also be needed by ACPI.

This patch adds the .set_target() method to 'struct pm_ops' and makes the
suspend code call it, if implemented, before executing the device model
per-device .suspend() calls.  It also modifies the at91 code to use
pm_ops->set_target() instead of pm_ops->prepare().

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: David Brownell <dbrownell@users.sourceforge.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-01 12:29:44 -07:00
Masami Hiramatsu a66e356c04 relayfs: fix overwrites
When I use relayfs with "overwrite" mode, read() still sets incorrect
number of consumed bytes.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Tom Zanussi <zanussi@us.ibm.com>
Acked-by: David Wilder <dwilder@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-28 11:38:18 -07:00
David Wilder 8d62fdebda relay file read: start-pos fix
Fix a bug in the relay read interface causing the number of consumed bytes
to be set incorrectly.

Signed-off-by: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: David Wilder <dwilder@us.ibm.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-28 11:34:54 -07:00
Thomas Gleixner a06381fec7 FUTEX: Restore the dropped ERSCH fix
The return value of futex_find_get_task() needs to be -ESRCH in case
that the search fails.  This was part of the original futex fixes and
got accidentally dropped, when the futex-tidy-up patch was split out.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Stable Team <stable@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-24 12:08:53 -07:00
Tony Jones 7b018b2888 audit: fix oops removing watch if audit disabled
Removing a watched file will oops if audit is disabled (auditctl -e 0).

To reproduce:
- auditctl -e 1
- touch /tmp/foo
- auditctl -w /tmp/foo
- auditctl -e 0
- rm /tmp/foo (or mv)

Signed-off-by: Tony Jones <tonyj@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-24 08:59:12 -07:00
Christoph Lameter 92c4ca5c3a sched: fix next_interval determination in idle_balance()
The intervals of domains that do not have SD_BALANCE_NEWIDLE must be
considered for the calculation of the time of the next balance.  Otherwise
we may defer rebalancing forever.

Siddha also spotted that the conversion of the balance interval
to jiffies is missing. Fix that to.

From: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

also continue the loop if !(sd->flags & SD_LOAD_BALANCE).

Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

It did in fact trigger under all three of mainline, CFS, and -rt including CFS
-- see below for a couple of emails from last Friday giving results for these
three on the AMD box (where it happened) and on a single-quad NUMA-Q system
(where it did not, at least not with such severity).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-24 08:59:11 -07:00
Cedric Le Goater 4e71e474c7 fix refcounting of nsproxy object when unshared
When a namespace is unshared, a refcount on the previous nsproxy is
abusively taken, leading to a memory leak of nsproxy objects.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-24 08:59:10 -07:00
Thomas Gleixner 58229a1899 posix-timers: Prevent softirq starvation by small intervals and SIG_IGN
posix-timers which deliver an ignored signal are currently rearmed in
the timer softirq: This is necessary because the timer needs to be
delivered again when SIG_IGN is removed. This is not a problem, when
the interval is reasonable.

With high resolution timers enabled one might arm a posix timer with a
very small interval and ignore the signal. This might lead to a
softirq starvation when the interval is so small that the timer is
requeued onto the softirq pending list right away.

This problem was pointed out by Jan Kiszka. Thanks Jan !

The correct solution would be to stop the timer, when the signal is
ignored and rearm it when SIG_IGN is removed. Unfortunately this
requires modification in sigaction and involves non trivial sighand
locking. It's too late in the release cycle for such a change.

For now we just keep the timer running and enforce that the timer only
fires every jiffie. This does not break anything as we keep the
overrun counter correct. It adds a little inaccuracy to the
timer_gettime() interface, but...

The more complex change is necessary anyway to fix another short
coming of the current implementation, which I discovered while looking
at this problem: A pending signal is discarded when SIG_IGN is set. In
case that a posixtimer signal is pending then it is discarded as well,
but when SIG_IGN is removed later nothing rearms the timer. This is
not new, it's that way since posix timers have been merged. So nothing
to worry about right now.

I have a working solution to fix all of this, but the impact is too
large for both stable and 2.6.22. I'm going to send it out for review
in the next days.

This should go into 2.6.21.stable as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Jan Kiszka <jan.kiszka@web.de>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Stable Team <stable@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-21 15:57:04 -07:00
Linus Torvalds fa490cfd15 Fix possible runqueue lock starvation in wait_task_inactive()
Miklos Szeredi reported very long pauses (several seconds, sometimes
more) on his T60 (with a Core2Duo) which he managed to track down to
wait_task_inactive()'s open-coded busy-loop.

He observed that an interrupt on one core tries to acquire the
runqueue-lock but does not succeed in doing so for a very long time -
while wait_task_inactive() on the other core loops waiting for the first
core to deschedule a task (which it wont do while spinning in an
interrupt handler).

This rewrites wait_task_inactive() to do all its waiting optimistically
without any locks taken at all, and then just double-check the end
result with the proper runqueue lock held over just a very short
section.  If there were races in the optimistic wait, of a preemption
event scheduled the process away, we simply re-synchronize, and start
over.

So the code now looks like this:

	repeat:
		/* Unlocked, optimistic looping! */
		rq = task_rq(p);
		while (task_running(rq, p))
			cpu_relax();

		/* Get the *real* values */
		rq = task_rq_lock(p, &flags);
		running = task_running(rq, p);
		array = p->array;
		task_rq_unlock(rq, &flags);

		/* Check them.. */
		if (unlikely(running)) {
			cpu_relax();
			goto repeat;
		}

		/* Preempted away? Yield if so.. */
		if (unlikely(array)) {
			yield();
			goto repeat;
		}

Basically, that first "while()" loop is done entirely without any
locking at all (and doesn't check for the case where the target process
might have been preempted away), and so it's possibly "incorrect", but
we don't really care.  Both the runqueue used, and the "task_running()"
check might be the wrong tests, but they won't oops - they just mean
that we could possibly get the wrong results due to lack of locking and
exit the loop early in the case of a race condition.

So once we've exited the loop, we then get the proper (and careful) rq
lock, and check the running/runnable state _safely_.  And if it turns
out that our quick-and-dirty and unsafe loop was wrong after all, we
just go back and try it all again.

(The patch also adds a lot of comments, which is the actual bulk of it
all, to make it more obvious why we can do these things without holding
the locks).

Thanks to Miklos for all the testing and tracking it down.

Tested-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-18 11:52:55 -07:00
Ingo Molnar a0f98a1cb7 sched: fix SysRq-N (normalize RT tasks)
Gene Heskett reported the following problem while testing CFS: SysRq-N
is not always effective in normalizing tasks back to SCHED_OTHER.

The reason for that turns out to be the following bug:

 - normalize_rt_tasks() uses for_each_process() to iterate through all
   tasks in the system.  The problem is, this method does not iterate
   through all tasks, it iterates through all thread groups.

The proper mechanism to enumerate over all threads is to use a
do_each_thread() + while_each_thread() loop.

Reported-by: Gene Heskett <gene.heskett@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-18 11:52:55 -07:00
Benjamin Herrenschmidt caec4e8dc8 Fix signalfd interaction with thread-private signals
Don't let signalfd dequeue private signals off other threads (in the
case of things like SIGILL or SIGSEGV, trying to do so would result
in undefined behaviour on who actually gets the signal, since they
are force unblocked).

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-18 10:18:32 -07:00
Thomas Gleixner bd197234b0 Revert "futex_requeue_pi optimization"
This reverts commit d0aa7a70bf.

It not only introduced user space visible changes to the futex syscall,
it is also non-functional and there is no way to fix it proper before
the 2.6.22 release.

The breakage report ( http://lkml.org/lkml/2007/5/12/17 ) went
unanswered, and unfortunately it turned out that the concept is not
feasible at all.  It violates the rtmutex semantics badly by introducing
a virtual owner, which hacks around the coupling of the user-space
pi_futex and the kernel internal rt_mutex representation.

At the moment the only safe option is to remove it fully as it contains
user-space visible changes to broken kernel code, which we do not want
to expose in the 2.6.22 release.

The patch reverts the original patch mostly 1:1, but contains a couple
of trivial manual cleanups which were necessary due to patches, which
touched the same area of code later.

Verified against the glibc tests and my own PI futex tests.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-18 09:48:41 -07:00
Rafael J. Wysocki 2f41dddbbd swsusp: Fix userland interface
Fix oops caused by 'cat /dev/snapshot', reported by Arkadiusz Miskiewicz,
and make it impossible to thaw tasks with the help of the swsusp userland
interface while there is a snapshot image ready to save.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-16 13:16:15 -07:00
Paul Jackson 3e903e7b16 cpuset: zero malloc - fix for old cpusets
The cpuset code to present a list of tasks using a cpuset to user space could
write to an array that it had kmalloc'd, after a kmalloc request of zero size.

The problem was that the code didn't check for writes past the allocated end
of the array until -after- the first write.

This is a race condition that is likely rare -- it would only show up if a
cpuset went from being empty to having a task in it, during the brief time
between the allocation and the first write.

Prior to roughly 2.6.22 kernels, this was also a benign problem, because a
zero kmalloc returned a few usable bytes anyway, and no harm was done with the
bogus write.

With the 2.6.22 kernel changes to make issue a warning if code tries to write
to the location returned from a zero size allocation, this problem is no
longer benign.  This cpuset code would occassionally trigger that warning.

The fix is trivial -- check before storing into the array, not after, whether
the array is big enough to hold the store.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Serge E. Hallyn" <serue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Paul Menage <menage@google.com>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-16 13:16:15 -07:00
Alexey Kuznetsov 778e9a9c3e pi-futex: fix exit races and locking problems
1. New entries can be added to tsk->pi_state_list after task completed
   exit_pi_state_list(). The result is memory leakage and deadlocks.

2. handle_mm_fault() is called under spinlock. The result is obvious.

3. results in self-inflicted deadlock inside glibc.
   Sometimes futex_lock_pi returns -ESRCH, when it is not expected
   and glibc enters to for(;;) sleep() to simulate deadlock. This problem
   is quite obvious and I think the patch is right. Though it looks like
   each "if" in futex_lock_pi() got some stupid special case "else if". :-)

4. sometimes futex_lock_pi() returns -EDEADLK,
   when nobody has the lock. The reason is also obvious (see comment
   in the patch), but correct fix is far beyond my comprehension.
   I guess someone already saw this, the chunk:

                        if (rt_mutex_trylock(&q.pi_state->pi_mutex))
                                ret = 0;

   is obviously from the same opera. But it does not work, because the
   rtmutex is really taken at this point: wake_futex_pi() of previous
   owner reassigned it to us. My fix works. But it looks very stupid.
   I would think about removal of shift of ownership in wake_futex_pi()
   and making all the work in context of process taking lock.

From: Thomas Gleixner <tglx@linutronix.de>

Fix 1) Avoid the tasklist lock variant of the exit race fix by adding
    an additional state transition to the exit code.

    This fixes also the issue, when a task with recursive segfaults
    is not able to release the futexes.

Fix 2) Cleanup the lookup_pi_state() failure path and solve the -ESRCH
    problem finally.

Fix 3) Solve the fixup_pi_state_owner() problem which needs to do the fixup
    in the lock protected section by using the in_atomic userspace access
    functions.

    This removes also the ugly lock drop / unqueue inside of fixup_pi_state()

Fix 4) Fix a stale lock in the error path of futex_wake_pi()

Added some error checks for verification.

The -EDEADLK problem is solved by the rtmutex fixups.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-08 17:23:34 -07:00
Thomas Gleixner 1a539a8728 rt-mutex: fix chain walk early wakeup bug
Alexey Kuznetsov found some problems in the pi-futex code.

One of the root causes is:

When a wakeup happens, we do not to stop the chain walk so we follow a not
longer relevant locking chain.

Drop out when this happens.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-08 17:23:34 -07:00
Thomas Gleixner c0d1d2bf5a rt-mutex: fix stale return value
Alexey Kuznetsov found some problems in the pi-futex code.

The major problem is a stale return value in rt_mutex_slowlock():

When the pi chain walk returns -EDEADLK, but the waiter was woken up during
the phases where the locks were dropped, the rtmutex could be acquired, but
due to the stale return value -EDEADLK returned to the caller.

Reset the return value in the retry path.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-08 17:23:34 -07:00
Roland McGrath b74d0deb96 Restrict clearing TIF_SIGPENDING
This patch should get a few birds.  It prevents sigaction calls from
clearing TIF_SIGPENDING in other threads, which could leak -ERESTART*.
And It fixes ptrace_stop not to clear it, which done at the syscall exit
stop could leak -ERESTART*.  It probably removes the harm from signalfd,
at least assuming it never calls dequeue_signal on kernel threads that
might have used block_all_signals.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-07 08:52:15 -07:00
Ingo Molnar c1a834dc70 timer stats: speedups
Make timer-stats have almost zero overhead when enabled in the config but
not used.  (this way distros can enable it more easily)

Also update the documentation about overhead of timer_stats - it was
written for the first version which had a global lock and a linear list
walk based lookup ;-)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-01 08:18:30 -07:00
Bjorn Steinbrink 9fcc15ec3c timer statistics: fix race
Fix two races in the timer stats lookup code.  One by ensuring that the
initialization of a new entry is finished upon insertion of that entry.
The other by cleaning up the hash table when the entries array is cleared,
so that we don't have any "pre-inserted" entries.

Thanks to Eric Dumazet for reminding me of the memory barriers.

Signed-off-by: Bjorn Steinbrink <B.Steinbrink@gmx.de>
Signed-off-by: Ian Kumlien <pomac@vapor.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-01 08:18:30 -07:00
Ulrich Drepper f0ede66fca fix compat futex code for private futexes
When the private futex support was added the compat code wasn't changed.
The result is that code using compat code which fail, e.g., because the
timeout values are not correctly passed.  The following patch should fix
that.

Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-01 08:18:28 -07:00
Kyle McMartin 7a74fc4925 fix possible null ptr deref in kallsyms_lookup
ugh, this function gets called by our unwinder. recursive backtrace for
the win... bisection to find this one was "fun."

Signed-off-by: Kyle McMartin <kyle@parisc-linux.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-30 10:51:38 -07:00
Thomas Gleixner eaad084bb0 NOHZ: prevent multiplication overflow - stop timer for huge timeouts
get_next_timer_interrupt() returns a delta of (LONG_MAX > 1) in case
there is no timer pending. On 64 bit machines this results in a
multiplication overflow in tick_nohz_stop_sched_tick().

Reported by: Dave Miller <davem@davemloft.net>

Make the return value a constant and limit the return value to a 32 bit
value.

When the max timeout value is returned, we can safely stop the tick
timer device. The max jiffies delta results in a 12 days timeout for
HZ=1000.

In the long term the get_next_timer_interrupt() code needs to be
reworked to return ktime instead of jiffies, but we have to wait until
the last users of the original NO_IDLE_HZ code are converted.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-29 18:11:10 -07:00
Linus Torvalds 92ea77275b Fix crash with irqpoll due to the IRQF_IRQPOLL flag testing
With irqpoll enabled, trying to test the IRQF_IRQPOLL flag in the
actions would cause a NULL pointer dereference if no action was
installed (for example, the driver might have been unloaded with
interrupts still pending).

So be a bit more careful about testing the flag by making sure to test
for that case.

(The actual _change_ is trivial, the patch is more than a one-liner
because I rewrote the testing to also be much more readable.

Original (discarded) bugfix by Bernhard Walle.

Cc: Bernhard Walle <bwalle@suse.de>
Tested-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-24 08:37:14 -07:00
Thomas Gleixner 98d8256739 Prevent going idle with softirq pending
The NOHZ patch contains a check for softirqs pending when a CPU goes idle.
The BUG is unrelated to NOHZ, it just was made visible by the NOHZ patch.
The BUG showed up mainly on P4 / hyperthreading enabled machines which lead
the investigations into the wrong direction in the first place.  The real
cause is in cond_resched_softirq():

cond_resched_softirq() is enabling softirqs without invoking the softirq
daemon when softirqs are pending.  This leads to the warning message in the
NOHZ idle code:

t1 runs softirq disabled code on CPU#0
interrupt happens, softirq is raised, but deferred (softirqs disabled)
t1 calls cond_resched_softirq()
	enables softirqs via _local_bh_enable()
	calls schedule()
t2 runs
t1 is migrated to CPU#1
t2 is done and invokes idle()
NOHZ detects the pending softirq

Fix: change _local_bh_enable() to local_bh_enable() so the softirq
daemon is invoked.

Thanks to Anant Nitya for debugging this with great patience !

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:15 -07:00
OGAWA Hirofumi 6373da1fb7 power: Fix sizeof(PAGE_SIZE) typo
Fix sizeof(PAGE_SIZE) typo.  It should be just PAGE_SIZE for zeroing the
swsusp_header.

Signed-off-by: OGAWA Hirofumi <hogawa@miraclelinux.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:14 -07:00
Oleg Nesterov 14441960e8 simplify cleanup_workqueue_thread()
cleanup_workqueue_thread() and cwq_should_stop() are overcomplicated.

Convert the code to use kthread_should_stop/kthread_stop as was
suggested by Gautham and Srivatsa.

In particular this patch removes the (unlikely) busy-wait loop from the
exit path, it was a temporary and ugly kludge (if not a bug).

Note: the current code was designed to solve another old problem:
work->func can't share locks with hotplug callbacks.  I think this could
be done, see

	http://marc.info/?l=linux-kernel&m=116905366428633

but this needs some more complications to preserve CPU affinity of
cwq->thread during cpu_up().  A freezer-based hotplug looks more
appealing.

[akpm@linux-foundation.org: make it more tolerant of gcc borkenness]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Zilvinas Valinskas <zilvinas@wilibox.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:13 -07:00
Roland McGrath 7bb44adef3 recalc_sigpending_tsk fixes
Steve Hawkes discovered a problem where recalc_sigpending_tsk was called in
do_sigaction but no signal_wake_up call was made, preventing later signals
from waking up blocked threads with TIF_SIGPENDING already set.

In fact, the few other calls to recalc_sigpending_tsk outside the signals
code are also subject to this problem in other race conditions.

This change makes recalc_sigpending_tsk private to the signals code.  It
changes the outside calls, as well as do_sigaction, to use the new
recalc_sigpending_and_wake instead.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: <Steve.Hawkes@motorola.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:12 -07:00
Thomas Gleixner 3528231606 NOHZ: Rate limit the local softirq pending warning output
The warning in the NOHZ code, which triggers when a CPU goes idle with
softirqs pending can fill up the logs quite quickly.  Rate limit the output
until we found the root cause of that problem.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Thomas Gleixner 72fcde9662 Ignore bogus ACPI info for offline CPUs
Booting a SMP kernel with maxcpus=1 on a SMP system leads to a hard hang,
because ACPI ignores the maxcpus setting and sends timer broadcast info for
the offline CPUs.  This results in a stuck for ever call to
smp_call_function_single() on an offline CPU.

Ignore the bogus information and print a kernel error to remind ACPI
folks to fix it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Gautham R Shenoy 88f18ba028 freezer: move frozen_process() to kernel/power/process.c
Other than refrigerator, no one else calls frozen_process().  So move it from
include/linux/freezer.h to kernel/power/process.c.

Also, since a task can be marked as frozen by itself, we don't need to pass
the (struct task_struct *p) parameter to frozen_process().

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Oleg Nesterov a076e4bca2 freezer: fix kthread_create vs freezer theoretical race
kthread() sleeps in TASK_INTERRUPTIBLE state waiting for the first wakeup.  In
theory, this wakeup may come from freeze_process()->signal_wake_up(), so the
task can disappear even before kthread_create() sets its ->comm.

Change kthread() to use TASK_UNINTERRUPTIBLE.

[akpm@linux-foundation.org: s/BUG_ON/WARN_ON+recover]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Rafael J. Wysocki 49b12d4f5e freezer: take kernel_execve into consideration
Kernel threads can become userland processes by calling kernel_execve().

In particular, this may happen right after the try_to_freeze_tasks()
called with FREEZER_USER_SPACE has returned, so try_to_freeze_tasks()
needs to take userspace processes into consideration even if it is
called with FREEZER_KERNEL_THREADS.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Rafael J. Wysocki ba96a0c880 freezer: fix vfork problem
Currently try_to_freeze_tasks() has to wait until all of the vforked processes
exit and for this reason every user can make it fail.  To fix this problem we
can introduce the additional process flag PF_FREEZER_SKIP to be used by tasks
that do not want to be counted as freezable by the freezer and want to have
TIF_FREEZE set nevertheless.  Then, this flag can be set by tasks using
sys_vfork() before they call wait_for_completion(&vfork) and cleared after
they have woken up.  After clearing it, the tasks should call try_to_freeze()
as soon as possible.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Rafael J. Wysocki 33e1c288da freezer: close potential race between refrigerator and thaw_tasks
If the freezing of tasks fails and a task is preempted in refrigerator()
before calling frozen_process(), then thaw_tasks() may run before this task is
frozen.  In that case the task will freeze and no one will thaw it.

To fix this race we can call freezing(current) in refrigerator() along with
frozen_process(current) under the task_lock() which also should be taken in
the error path of try_to_freeze_tasks() as well as in thaw_process().
Moreover, if thaw_process() additionally clears TIF_FREEZE for tasks that are
not frozen, we can be sure that all tasks are thawed and there are no pending
"freeze" requests after thaw_tasks() has run.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:10 -07:00
Alexey Dobriyan e8edc6e03a Detach sched.h from mm.h
First thing mm.h does is including sched.h solely for can_do_mlock() inline
function which has "current" dereference inside. By dealing with can_do_mlock()
mm.h can be detached from sched.h which is good. See below, why.

This patch
a) removes unconditional inclusion of sched.h from mm.h
b) makes can_do_mlock() normal function in mm/mlock.c
c) exports can_do_mlock() to not break compilation
d) adds sched.h inclusions back to files that were getting it indirectly.
e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
   getting them indirectly

Net result is:
a) mm.h users would get less code to open, read, preprocess, parse, ... if
   they don't need sched.h
b) sched.h stops being dependency for significant number of files:
   on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
   after patch it's only 3744 (-8.3%).

Cross-compile tested on

	all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
	alpha alpha-up
	arm
	i386 i386-up i386-defconfig i386-allnoconfig
	ia64 ia64-up
	m68k
	mips
	parisc parisc-up
	powerpc powerpc-up
	s390 s390-up
	sparc sparc-up
	sparc64 sparc64-up
	um-x86_64
	x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig

as well as my two usual configs.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-21 09:18:19 -07:00
Rafael J. Wysocki 8d98a690f5 swsusp: fix sysfs interface
The sysfs files /sys/power/disk and /sys/power/state do not work as
documented, since they allow the user to write only a few initial
characters of the input string to trigger the option (eg.  'echo pl >
/sys/power/disk' activates the platform mode of hibernation).  Fix it.

Special thanks to Peter Moulder <Peter.Moulder@infotech.monash.edu.au> for
pointing out the problem.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:05 -07:00
Dan Aloni 71ce92f3fa make sysctl/kernel/core_pattern and fs/exec.c agree on maximum core filename size
Make sysctl/kernel/core_pattern and fs/exec.c agree on maximum core
filename size and change it to 128, so that extensive patterns such as
'/local/cores/%e-%h-%s-%t-%p.core' won't result in truncated filename
generation.

Signed-off-by: Dan Aloni <da-x@monatomic.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:05 -07:00
Christoph Lameter a35afb830f Remove SLAB_CTOR_CONSTRUCTOR
SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@ucw.cz>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:04 -07:00
Linus Torvalds 52ade9b3b9 Fix ACPI suspend / device suspend ordering problem
In commit e3c7db621b we fixed the resume
ordering, so that the ACPI low-level resume code was called before the
actual driver resume was called. However, that broke the nesting logic
of suspend and resume, and we continued to suspend the devices _after_
we the ACPI device suspend code was called.

That resulted in us saving PCI state for devices that had already been
changed by ACPI, and in some cases disabled entirely (causing the PCI
save_state to be all-ones).  Which in turn caused the wrong state to be
written back on resume.

This moves the ACPI device suspend to after the device model per-device
suspend() calls. This fixes the bogus state save.

Thanks to Lukáš Hejtmánek for testing.

Acked-by: Lukas Hejtmanek <xhejtman@ics.muni.cz>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-16 15:33:19 -07:00
Al Viro 327b9eebbf audit_match_signal() and friends are used only if CONFIG_AUDITSYSCALL is set
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-15 18:56:37 -07:00
Thomas Gleixner 8f89441b37 clocksource: fix lock order in the resume path
lockdep complains about the lock nesting of clocksource and watchdog lock
in the resume path.

Change the resume marker to a bit operation and remove the lock from this
path.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-15 08:54:00 -07:00
Thomas Gleixner d10ff3fb62 timekeeping fix patch got mis-applied
The time keeping code move to kernel/time/timekeeping.c broke the
clocksource resume logic patch, which got applied to the old file by a
fuzzy application.  Fix it up and move the clocksource_resume() call to
the appropriate place.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ tssk, tssk, everybody should use --fuzz=0 ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-14 12:13:11 -07:00
Heiko Carstens 8df767dd75 compat signalfd and timerfd are cond syscalls
Add missing cond_syscall statements for compat_sys_signalfd and
compat_sys_timerfd.

Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-12 10:55:40 -07:00
Linus Torvalds 2a383c63ff Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
  [IA64] Quicklist support for IA64
  [IA64] fix Kprobes reentrancy
  [IA64] SN: validate smp_affinity mask on intr redirect
  [IA64] drivers/char/snsc_event.c:206: warning: unused variable `p'
  [IA64] mca.c:121: warning: 'cpe_poll_timer' defined but not used
  [IA64] Fix - Section mismatch: reference to .init.data:mvec_name
  [IA64] more warning cleanups
  [IA64] Wire up epoll_pwait and utimensat
  [IA64] Fix warnings resulting from type-checking in dev_dbg()
  [IA64] typo s/kenrel/kernel/
2007-05-11 12:53:21 -07:00
Linus Torvalds 853da00220 Merge branch 'audit.b38' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b38' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
  [PATCH] Abnormal End of Processes
  [PATCH] match audit name data
  [PATCH] complete message queue auditing
  [PATCH] audit inode for all xattr syscalls
  [PATCH] initialize name osid
  [PATCH] audit signal recipients
  [PATCH] add SIGNAL syscall class (v3)
  [PATCH] auditing ptrace
2007-05-11 09:57:16 -07:00
John Keller 25d61578da [IA64] SN: validate smp_affinity mask on intr redirect
On SN, only allow one bit to be set in the smp_affinty mask when
redirecting an interrupt.  Currently setting multiple bits is allowed, but
only the first bit is used in determining the CPU to redirect to.  This has
caused confusion among some customers.

[akpm@linux-foundation.org: fixes]
Signed-off-by: John Keller <jpk@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2007-05-11 09:35:38 -07:00
Davide Libenzi e1ad7468c7 signal/timer/event: eventfd core
This is a very simple and light file descriptor, that can be used as event
wait/dispatch by userspace (both wait and dispatch) and by the kernel
(dispatch only).  It can be used instead of pipe(2) in all cases where those
would simply be used to signal events.  Their kernel overhead is much lower
than pipes, and they do not consume two fds.  When used in the kernel, it can
offer an fd-bridge to enable, for example, functionalities like KAIO or
syslets/threadlets to signal to an fd the completion of certain operations.
But more in general, an eventfd can be used by the kernel to signal readiness,
in a POSIX poll/select way, of interfaces that would otherwise be incompatible
with it.  The API is:

int eventfd(unsigned int count);

The eventfd API accepts an initial "count" parameter, and returns an eventfd
fd.  It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and write(2).

The POLLIN flag is raised when the internal counter is greater than zero.

The POLLOUT flag is raised when at least a value of "1" can be written to the
internal counter.

The POLLERR flag is raised when an overflow in the counter value is detected.

The write(2) operation can never overflow the counter, since it blocks (unless
O_NONBLOCK is set, in which case -EAGAIN is returned).

But the eventfd_signal() function can do it, since it's supposed to not sleep
during its operation.

The read(2) function reads the __u64 counter value, and reset the internal
value to zero.  If the value read is equal to (__u64) -1, an overflow happened
on the internal counter (due to 2^64 eventfd_signal() posts that has never
been retired - unlickely, but possible).

The write(2) call writes an __u64 count value, and adds it to the current
counter.  The eventfd fd supports O_NONBLOCK also.

On the kernel side, we have:

struct file *eventfd_fget(int fd);
int eventfd_signal(struct file *file, unsigned int n);

The eventfd_fget() should be called to get a struct file* from an eventfd fd
(this is an fget() + check of f_op being an eventfd fops pointer).

The kernel can then call eventfd_signal() every time it wants to post an event
to userspace.  The eventfd_signal() function can be called from any context.
An eventfd() simple test and bench is available here:

http://www.xmailserver.org/eventfd-bench.c

This is the eventfd-based version of pipetest-4 (pipe(2) based):

http://www.xmailserver.org/pipetest-4.c

Not that performance matters much in the eventfd case, but eventfd-bench
shows almost as double as performance than pipetest-4.

[akpm@linux-foundation.org: fix i386 build]
[akpm@linux-foundation.org: add sys_eventfd to sys_ni.c]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:36 -07:00
Davide Libenzi 83f5d12669 signal/timer/event: timerfd compat code
This patch implements the necessary compat code for the timerfd system call.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:36 -07:00
Davide Libenzi b215e28399 signal/timer/event: timerfd core
This patch introduces a new system call for timers events delivered though
file descriptors.  This allows timer event to be used with standard POSIX
poll(2), select(2) and read(2).  As a consequence of supporting the Linux
f_op->poll subsystem, they can be used with epoll(2) too.

The system call is defined as:

int timerfd(int ufd, int clockid, int flags, const struct itimerspec *utmr);

The "ufd" parameter allows for re-use (re-programming) of an existing timerfd
w/out going through the close/open cycle (same as signalfd).  If "ufd" is -1,
s new file descriptor will be created, otherwise the existing "ufd" will be
re-programmed.

The "clockid" parameter is either CLOCK_MONOTONIC or CLOCK_REALTIME.  The time
specified in the "utmr->it_value" parameter is the expiry time for the timer.

If the TFD_TIMER_ABSTIME flag is set in "flags", this is an absolute time,
otherwise it's a relative time.

If the time specified in the "utmr->it_interval" is not zero (.tv_sec == 0,
tv_nsec == 0), this is the period at which the following ticks should be
generated.

The "utmr->it_interval" should be set to zero if only one tick is requested.
Setting the "utmr->it_value" to zero will disable the timer, or will create a
timerfd without the timer enabled.

The function returns the new (or same, in case "ufd" is a valid timerfd
descriptor) file, or -1 in case of error.

As stated before, the timerfd file descriptor supports poll(2), select(2) and
epoll(2).  When a timer event happened on the timerfd, a POLLIN mask will be
returned.

The read(2) call can be used, and it will return a u32 variable holding the
number of "ticks" that happened on the interface since the last call to
read(2).  The read(2) call supportes the O_NONBLOCK flag too, and EAGAIN will
be returned if no ticks happened.

A quick test program, shows timerfd working correctly on my amd64 box:

http://www.xmailserver.org/timerfd-test.c

[akpm@linux-foundation.org: add sys_timerfd to sys_ni.c]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:36 -07:00
Davide Libenzi fba2afaaec signal/timer/event: signalfd core
This patch series implements the new signalfd() system call.

I took part of the original Linus code (and you know how badly it can be
broken :), and I added even more breakage ;) Signals are fetched from the same
signal queue used by the process, so signalfd will compete with standard
kernel delivery in dequeue_signal().  If you want to reliably fetch signals on
the signalfd file, you need to block them with sigprocmask(SIG_BLOCK).  This
seems to be working fine on my Dual Opteron machine.  I made a quick test
program for it:

http://www.xmailserver.org/signafd-test.c

The signalfd() system call implements signal delivery into a file descriptor
receiver.  The signalfd file descriptor if created with the following API:

int signalfd(int ufd, const sigset_t *mask, size_t masksize);

The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
to close/create cycle (Linus idea).  Use "ufd" == -1 if you want a brand new
signalfd file.

The "mask" allows to specify the signal mask of signals that we are interested
in.  The "masksize" parameter is the size of "mask".

The signalfd fd supports the poll(2) and read(2) system calls.  The poll(2)
will return POLLIN when signals are available to be dequeued.  As a direct
consequence of supporting the Linux poll subsystem, the signalfd fd can use
used together with epoll(2) too.

The read(2) system call will return a "struct signalfd_siginfo" structure in
the userspace supplied buffer.  The return value is the number of bytes copied
in the supplied buffer, or -1 in case of error.  The read(2) call can also
return 0, in case the sighand structure to which the signalfd was attached,
has been orphaned.  The O_NONBLOCK flag is also supported, and read(2) will
return -EAGAIN in case no signal is available.

If the size of the buffer passed to read(2) is lower than sizeof(struct
signalfd_siginfo), -EINVAL is returned.  A read from the signalfd can also
return -ERESTARTSYS in case a signal hits the process.  The format of the
struct signalfd_siginfo is, and the valid fields depends of the (->code &
__SI_MASK) value, in the same way a struct siginfo would:

struct signalfd_siginfo {
	__u32 signo;	/* si_signo */
	__s32 err;	/* si_errno */
	__s32 code;	/* si_code */
	__u32 pid;	/* si_pid */
	__u32 uid;	/* si_uid */
	__s32 fd;	/* si_fd */
	__u32 tid;	/* si_fd */
	__u32 band;	/* si_band */
	__u32 overrun;	/* si_overrun */
	__u32 trapno;	/* si_trapno */
	__s32 status;	/* si_status */
	__s32 svint;	/* si_int */
	__u64 svptr;	/* si_ptr */
	__u64 utime;	/* si_utime */
	__u64 stime;	/* si_stime */
	__u64 addr;	/* si_addr */
};

[akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:36 -07:00
Sukadev Bhattiprolu 0800d30832 Use task_pgrp() task_session() in copy_process()
Use task_pgrp() and task_session() in copy_process(), and avoid find_pid()
call when attaching the task to its process group and session.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: <containers@lists.osdl.org>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:36 -07:00
Sukadev Bhattiprolu 85868995d9 Use struct pid parameter in copy_process()
Modify copy_process() to take a struct pid * parameter instead of a pid_t.
This simplifies the code a bit and also avoids having to call find_pid() to
convert the pid_t to a struct pid.

Changelog:
	- Fixed Badari Pulavarty's comments and passed in &init_struct_pid
	  from fork_idle().
	- Fixed Eric Biederman's comments and simplified this patch and
	  used a new patch to remove the likely(pid) check.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: <containers@lists.osdl.org>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:35 -07:00
Sukadev Bhattiprolu 820e45db23 statically initialize struct pid for swapper
Statically initialize a struct pid for the swapper process (pid_t == 0) and
attach it to init_task.  This is needed so task_pid(), task_pgrp() and
task_session() interfaces work on the swapper process also.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: <containers@lists.osdl.org>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:35 -07:00
Sukadev Bhattiprolu e713d0dab2 attach_pid() with struct pid parameter
attach_pid() currently takes a pid_t and then uses find_pid() to find the
corresponding struct pid.  Sometimes we already have the struct pid.  We can
then skip find_pid() if attach_pid() were to take a struct pid parameter.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: <containers@lists.osdl.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:35 -07:00
Daniel Walker 3e88c553db use defines in sys_getpriority/sys_setpriority
Switch to the defines for these two checks, instead of hard coding the
values.

[akpm@linux-foundation.org: add missing include]
Signed-off-by: Daniel Walker <dwalker@mvista.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:35 -07:00
Benjamin Herrenschmidt a12bb44471 stop_machine() now uses hard_irq_disable
Add a call to hard_irq_disable() to stop_machine so that we make sure IRQs are
really disabled and not only lazy-disabled on archs like powerpc as some users
of stop_machine() may rely on that.

[akpm@linux-foundation.org: build fix]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:34 -07:00
Eric Dumazet 6eaeeaba39 getrusage(): fill ru_inblock and ru_oublock fields if possible
If CONFIG_TASK_IO_ACCOUNTING is defined, we update io accounting counters for
each task.

This patch permits reporting of values using the well known getrusage()
syscall, filling ru_inblock and ru_oublock instead of null values.

As TASK_IO_ACCOUNTING currently counts bytes counts, we approximate blocks
count doing : nr_blocks = nr_bytes / 512

Example of use :
----------------------
After patch is applied, /usr/bin/time command can now give a good
approximation of IO that the process had to do.

$ /usr/bin/time grep tototo /usr/include/*
Command exited with non-zero status 1
0.00user 0.02system 0:02.11elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
24288inputs+0outputs (0major+259minor)pagefaults 0swaps

$ /usr/bin/time dd if=/dev/zero of=/tmp/testfile count=1000
1000+0 enregistrements lus
1000+0 enregistrements écrits
512000 octets (512 kB) copiés, 0,00326601 seconde, 157 MB/s
0.00user 0.00system 0:00.00elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+3000outputs (0major+299minor)pagefaults 0swaps

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:34 -07:00
Steve Grubb 0a4ff8c259 [PATCH] Abnormal End of Processes
Hi,

I have been working on some code that detects abnormal events based on audit
system events. One kind of event that we currently have no visibility for is
when a program terminates due to segfault - which should never happen on a
production machine. And if it did, you'd want to investigate it. Attached is a
patch that collects these events and sends them into the audit system.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2007-05-11 05:38:26 -04:00
Amy Griffis 5712e88f2b [PATCH] match audit name data
Make more effort to detect previously collected names, so we don't log
multiple PATH records for a single filesystem object. Add
audit_inc_name_count() to reduce duplicate code.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2007-05-11 05:38:26 -04:00
Amy Griffis 4fc03b9beb [PATCH] complete message queue auditing
Handle the edge cases for POSIX message queue auditing. Collect inode
info when opening an existing mq, and for send/receive operations. Remove
audit_inode_update() as it has really evolved into the equivalent of
audit_inode().

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2007-05-11 05:38:26 -04:00
Amy Griffis e41e8bde43 [PATCH] initialize name osid
Audit contexts can be reused, so initialize a name's osid to the
default in audit_getname(). This ensures we don't log a bogus object
label when no inode data is collected for a name.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2007-05-11 05:38:25 -04:00
Amy Griffis e54dc2431d [PATCH] audit signal recipients
When auditing syscalls that send signals, log the pid and security
context for each target process. Optimize the data collection by
adding a counter for signal-related rules, and avoiding allocating an
aux struct unless we have more than one target process. For process
groups, collect pid/context data in blocks of 16. Move the
audit_signal_info() hook up in check_kill_permission() so we audit
attempts where permission is denied.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2007-05-11 05:38:25 -04:00
Al Viro a5cb013da7 [PATCH] auditing ptrace
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2007-05-11 05:38:25 -04:00
akpm@linux-foundation.org e9910846fd timer: revert parenthesis fix in tbase_get_deferrable() etc
On 09-05-2007 21:10, Pallipadi, Venkatesh wrote:
...
> On a 64 bit system, converting pointer to int causes unnecessary
> compiler warning, and intermediate long conversion was to avoid that.
> I will have to rephrase my comment to remove 32 bit value and use int,
> as that is what the function returns.

So, this patch reverts all changes done by my previous patch.

I apologize for my wrong comment about "logical error" here.

Cc: "Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com>
Cc: Satyam Sharma <satyam.sharma@gmail.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-10 09:26:53 -07:00
Linus Torvalds aabded9c3a Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc:
  [POWERPC] Further fixes for the removal of 4level-fixup hack from ppc32
  [POWERPC] EEH: log all PCI-X and PCI-E AER registers
  [POWERPC] EEH: capture and log pci state on error
  [POWERPC] EEH: Split up long error msg
  [POWERPC] EEH: log error only after driver notification.
  [POWERPC] fsl_soc: Make mac_addr const in fs_enet_of_init().
  [POWERPC] Don't use SLAB/SLUB for PTE pages
  [POWERPC] Spufs support for 64K LS mappings on 4K kernels
  [POWERPC] Add ability to 4K kernel to hash in 64K pages
  [POWERPC] Introduce address space "slices"
  [POWERPC] Small fixes & cleanups in segment page size demotion
  [POWERPC] iSeries: Make HVC_ISERIES the default
  [POWERPC] iSeries: suppress build warning in lparmap.c
  [POWERPC] Mark pages that don't exist as nosave
  [POWERPC] swsusp: Introduce register_nosave_region_late
2007-05-09 12:56:01 -07:00
Linus Torvalds 9a9136e270 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
  sound: convert "sound" subdirectory to UTF-8
  MAINTAINERS: Add cxacru website/mailing list
  include files: convert "include" subdirectory to UTF-8
  general: convert "kernel" subdirectory to UTF-8
  documentation: convert the Documentation directory to UTF-8
  Convert the toplevel files CREDITS and MAINTAINERS to UTF-8.
  remove broken URLs from net drivers' output
  Magic number prefix consistency change to Documentation/magic-number.txt
  trivial: s/i_sem /i_mutex/
  fix file specification in comments
  drivers/base/platform.c: fix small typo in doc
  misc doc and kconfig typos
  Remove obsolete fat_cvf help text
  Fix occurrences of "the the "
  Fix minor typoes in kernel/module.c
  Kconfig: Remove reference to external mqueue library
  Kconfig: A couple of grammatical fixes in arch/i386/Kconfig
  Correct comments in genrtc.c to refer to correct /proc file.
  Fix more "deprecated" spellos.
  Fix "deprecated" typoes.
  ...

Fix trivial comment conflict in kernel/relay.c.
2007-05-09 12:54:17 -07:00
Roman Zippel f7e4217b00 rename thread_info to stack
This finally renames the thread_info field in task structure to stack, so that
the assumptions about this field are gone and archs have more freedom about
placing the thread_info structure.

Nonbroken archs which have a proper thread pointer can do the access to both
current thread and task structure via a single pointer.

It'll allow for a few more cleanups of the fork code, from which e.g.  ia64
could benefit.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
[akpm@linux-foundation.org: build fix]
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Andi Kleen <ak@muc.de>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Roman Zippel c9f4f06d31 wrap access to thread_info
Recently a few direct accesses to the thread_info in the task structure snuck
back, so this wraps them with the appropriate wrapper.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Thomas Gleixner b52f52a093 clocksource: fix resume logic
We need to make sure that the clocksources are resumed, when timekeeping is
resumed.  The current resume logic does not guarantee this.

Add a resume function pointer to the clocksource struct, so clocksource
drivers which need to reinitialize the clocksource can provide a resume
function.

Add a resume function, which calls the maybe available clocksource resume
functions and resets the watchdog function, so a stable TSC can be used
accross suspend/resume.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Christoph Lameter 77461ab332 Make vm statistics update interval configurable
Make it configurable.  Code in mm makes the vm statistics intervals
independent from the cache reaper use that opportunity to make it
configurable.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Rafael J. Wysocki 455c017ae3 microcode: use suspend-related CPU hotplug notifications
Make the microcode driver use the suspend-related CPU hotplug notifications
to handle the CPU hotplug events occuring during system-wide suspend and
resume transitions.  Remove the global variable suspend_cpu_hotplug
previously used for this purpose.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Rafael J. Wysocki 8bb7844286 Add suspend-related notifications for CPU hotplug
Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress.  This
patch introduces such notifications and causes them to be used during
suspend and resume transitions.  It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).

[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Jarek Poplawski 38a23e311b timer: parenthesis fix in tbase_get_deferrable() etc
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:55 -07:00
Eric Dumazet 34f01cc1f5 FUTEX: new PRIVATE futexes
Analysis of current linux futex code :
  --------------------------------------

A central hash table futex_queues[] holds all contexts (futex_q) of waiting
threads.

Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to
perform lookups or insert/deletion of a futex_q.

When a futex_wait() is done, calling thread has to :

1) - Obtain a read lock on mmap_sem to be able to validate the user pointer
     (calling find_vma()). This validation tells us if the futex uses
     an inode based store (mapped file), or mm based store (anonymous mem)

2) - compute a hash key

3) - Atomic increment of reference counter on an inode or a mm_struct

4) - lock part of futex_queues[] hash table

5) - perform the test on value of futex.
	(rollback is value != expected_value, returns EWOULDBLOCK)
	(various loops if test triggers mm faults)

6) queue the context into hash table, release the lock got in 4)

7) - release the read_lock on mmap_sem

   <block>

8) Eventually unqueue the context (but rarely, as this part  may be done
   by the futex_wake())

Futexes were designed to improve scalability but current implementation has
various problems :

- Central hashtable :

  This means scalability problems if many processes/threads want to use
  futexes at the same time.
  This means NUMA unbalance because this hashtable is located on one node.

- Using mmap_sem on every futex() syscall :

  Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
  ops on mmap_sem, dirtying cache line :
    - lot of cache line ping pongs on SMP configurations.

  mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
  Highly threaded processes might suffer from mmap_sem contention.

  mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
  programs because of contention on the mmap_sem cache line.

- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
  It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
  because of cache misses.

Most of these scalability problems come from the fact that futexes are in
one global namespace.  As we use a central hash table, we must make sure
they are all using the same reference (given by the mm subsystem).  We
chose to force all futexes be 'shared'.  This has a cost.

But fact is POSIX defined PRIVATE and SHARED, allowing clear separation,
and optimal performance if carefuly implemented.  Time has come for linux
to have better threading performance.

The goal is to permit new futex commands to avoid :
 - Taking the mmap_sem semaphore, conflicting with other subsystems.
 - Modifying a ref_count on mm or an inode, still conflicting with mm or fs.

This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
futexes, we only need to distinguish futexes by their virtual address, no
matter the underlying mm storage is.

If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes.  And be
prepared to fallback on old subcommands for old kernels.  Using one global
variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.

PTHREAD_PROCESS_SHARED futexes should still use the old subcommands.

Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)

Note : the same SHARED futex (mapped on a file) can be used by old binaries
*and* new binaries, because both binaries will use the old subcommands.

Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)

Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)

/* calling futex_wait(addr, value) with value != *addr */
433 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
424 cycles per futex(FUTEX_WAIT) call (using one futex)
334 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
334 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
For reference :
187 cycles per getppid() call
188 cycles per umask() call
181 cycles per ni_syscall() call

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Pierre Peiffer <pierre.peiffer@bull.net>
Cc: "Ulrich Drepper" <drepper@gmail.com>
Cc: "Nick Piggin" <nickpiggin@yahoo.com.au>
Cc: "Ingo Molnar" <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:55 -07:00
Pierre Peiffer d0aa7a70bf futex_requeue_pi optimization
This patch provides the futex_requeue_pi functionality, which allows some
threads waiting on a normal futex to be requeued on the wait-queue of a
PI-futex.

This provides an optimization, already used for (normal) futexes, to be used
with the PI-futexes.

This optimization is currently used by the glibc in pthread_broadcast, when
using "normal" mutexes.  With futex_requeue_pi, it can be used with
PRIO_INHERIT mutexes too.

Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:55 -07:00
Pierre Peiffer c19384b5b2 Make futex_wait() use an hrtimer for timeout
This patch modifies futex_wait() to use an hrtimer + schedule() in place of
schedule_timeout().

schedule_timeout() is tick based, therefore the timeout granularity is the
tick (1 ms, 4 ms or 10 ms depending on HZ).  By using a high resolution timer
for timeout wakeup, we can attain a much finer timeout granularity (in the
microsecond range).  This parallels what is already done for futex_lock_pi().

The timeout passed to the syscall is no longer converted to jiffies and is
therefore passed to do_futex() and futex_wait() as an absolute ktime_t
therefore keeping nanosecond resolution.

Also this removes the need to pass the nanoseconds timeout part to
futex_lock_pi() in val2.

In futex_wait(), if there is no timeout then a regular schedule() is
performed.  Otherwise, an hrtimer is fired before schedule() is called.

[akpm@linux-foundation.org: fix `make headers_check']
Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net>
Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:55 -07:00
Pierre Peiffer ec92d08292 futex priority based wakeup
Today, all threads waiting for a given futex are woken in FIFO order (first
waiter woken first) instead of priority order.

This patch makes use of plist (pirotity ordered lists) instead of simple list
in futex_hash_bucket.

All non-RT threads are stored with priority MAX_RT_PRIO, causing them to be
woken last, in FIFO order (RT-threads are woken first, in priority order).

Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net>
Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:55 -07:00
Oleg Nesterov 6e84d644b5 make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.

cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work().  So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called.  It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.

With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.

As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.

Disadvantages:

	- this patch adds wmb() to insert_work().

	- slowdowns the fast path (when del_timer() succeeds on entry) of
	  cancel_rearming_delayed_work(), because wait_on_work() is called
	  unconditionally. In that case, compared to the old version, we are
	  doing "unneeded" lock/unlock for each online CPU.

	  On the other hand, this means we don't need to use cancel_work_sync()
	  after cancel_rearming_delayed_work().

	- complicates the code (.text grows by 130 bytes).

[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Gautham R Shenoy 7b0834c26f Remove kthread_bind() call from _cpu_down()
We are anyway kthread_stop()ping other per-cpu kernel threads after
move_task_off_dead_cpu(), so we can do it with the stop_machine_run thread
as well.

I just checked with Vatsa if there was any subtle reason why they
had put in the kthread_bind() in cpu.c. Vatsa cannot seem to recollect
any and I can't see any. So let us just remove the kthread_bind.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Oleg Nesterov 10ab825bde change kernel threads to ignore signals instead of blocking them
Currently kernel threads use sigprocmask(SIG_BLOCK) to protect against
signals.  This doesn't prevent the signal delivery, this only blocks
signal_wake_up().  Every "killall -33 kthreadd" means a "struct siginfo"
leak.

Change kthreadd_setup() to set all handlers to SIG_IGN instead of blocking
them (make a new helper ignore_signals() for that).  If the kernel thread
needs some signal, it should use allow_signal() anyway, and in that case it
should not use CLONE_SIGHAND.

Note that we can't change daemonize() (should die!) in the same way,
because it can be used along with CLONE_SIGHAND.  This means that
allow_signal() still should unblock the signal to work correctly with
daemonize()ed threads.

However, disallow_signal() doesn't block the signal any longer but ignores
it.

NOTE: with or without this patch the kernel threads are not protected from
handle_stop_signal(), this seems harmless, but not good.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Oleg Nesterov 5de18d1697 worker_thread: don't play with SIGCHLD and numa policy
worker_thread() inherits ignored SIGCHLD and numa_default_policy() from its
parent, kthreadd.  No need to setup this again.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Oleg Nesterov 90cce03d9b wait_for_helper: remove unneeded do_sigaction()
allow_signal(SIGCHLD) does all necessary job, no need to call do_sigaction()
prior to.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Eric W. Biederman 49d769d52e Change reparent_to_init to reparent_to_kthreadd
When a kernel thread calls daemonize, instead of reparenting the thread to
init reparent the thread to kthreadd next to the threads created by
kthread_create.

This is really just a stop gap until daemonize goes away, but it does
ensure no kernel threads are under init and they are all in one place that
is easy to find.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Eric W. Biederman 73c279927f kthread: don't depend on work queues
Currently there is a circular reference between work queue initialization
and kthread initialization.  This prevents the kthread infrastructure from
initializing until after work queues have been initialized.

We want the properties of tasks created with kthread_create to be as close
as possible to the init_task and to not be contaminated by user processes.
The later we start our kthreadd that creates these tasks the harder it is
to avoid contamination from user processes and the more of a mess we have
to clean up because the defaults have changed on us.

So this patch modifies the kthread support to not use work queues but to
instead use a simple list of structures, and to have kthreadd start from
init_task immediately after our kernel thread that execs /sbin/init.

By being a true child of init_task we only have to change those process
settings that we want to have different from init_task, such as our process
name, the cpus that are allowed, blocking all signals and setting SIGCHLD
to SIG_IGN so that all of our children are reaped automatically.

By being a true child of init_task we also naturally get our ppid set to 0
and do not wind up as a child of PID == 1.  Ensuring that tasks generated
by kthread_create will not slow down the functioning of the wait family of
functions.

[akpm@linux-foundation.org: use interruptible sleeps]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Oleg Nesterov c93465181f ____call_usermodehelper: don't flush_signals()
____call_usermodehelper() has no reason for flush_signals().  It is a fresh
forked process which is going to exec a user-space application or exit on
failure.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Oleg Nesterov 28e53bddf8 unify flush_work/flush_work_keventd and rename it to cancel_work_sync
flush_work(wq, work) doesn't need the first parameter, we can use cwq->wq
(this was possible from the very beginnig, I missed this).  So we can unify
flush_work_keventd and flush_work.

Also, rename flush_work() to cancel_work_sync() and fix all callers.
Perhaps this is not the best name, but "flush_work" is really bad.

(akpm: this is why the earlier patches bypassed maintainers)

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Auke Kok <auke-jan.h.kok@intel.com>,
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Oleg Nesterov a4798833d2 zap_other_threads: remove unneeded ->exit_signal change
We already depend on fact that all sub-threads have ->exit_signal == -1, no
need to set it in zap_other_threads().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Oleg Nesterov 85f4186af9 worker_thread: fix racy try_to_freeze() usage
worker_thread() can miss freeze_process()->signal_wake_up() if it happens
between try_to_freeze() and prepare_to_wait().  We should check freezing()
before entering schedule().

This race was introduced by me in

	[PATCH 1/1] workqueue: don't migrate pending works from the dead CPU

Looks like mm/vmscan.c:kswapd() has the same race.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Oleg Nesterov b9aac8e0d3 worker_thread: don't play with signals
worker_thread() doesn't need to "Block and flush all signals", this was
already done by its caller, kthread().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Oleg Nesterov 23b2e5991a workqueue: kill NOAUTOREL works
We don't have any users, and it is not so trivial to use NOAUTOREL works
correctly.  It is better to simplify API.

Delete NOAUTOREL support and rename work_release to work_clear_pending to
avoid a confusion.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov 1634c48f8b make cancel_rearming_delayed_work() work on any workqueue, not just keventd_wq
cancel_rearming_delayed_workqueue(wq, dwork) doesn't need the first
parameter.  We don't hang on un-queued dwork any longer, and work->data
doesn't change its type.  This means we can always figure out "wq" from
dwork when it is needed.

Remove this parameter, and rename the function to
cancel_rearming_delayed_work().  Re-create an inline "obsolete"
cancel_rearming_delayed_workqueue(wq) which just calls
cancel_rearming_delayed_work().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov a848e3b67c workqueue: introduce wq_per_cpu() helper
Cleanup.  A number of per_cpu_ptr(wq->cpu_wq, cpu) users have to check that
cpu is valid for this wq.  Make a simple helper.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov 63bc036252 unify queue_delayed_work() and queue_delayed_work_on()
Change queue_delayed_work() to use queue_delayed_work_on() to avoid the code
duplication (saves 133 bytes).

Q: queue_delayed_work() enqueues &dwork->work directly when delay == 0, why?

[jirislaby@gmail.com: oops fix]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov ed7c0feede make queue_delayed_work() friendly to flush_fork()
Currently typeof(delayed_work->work.data) is

	"struct workqueue_struct" when the timer is pending

	"struct cpu_workqueue_struct" whe the work is queued

This makes impossible to use flush_fork(delayed_work->work) in addition
to cancel_delayed_work/cancel_rearming_delayed_work, not good.

Change queue_delayed_work/delayed_work_timer_fn to use cwq, not wq. This
complicates (and uglifies) these functions a little bit, but alows us to
use flush_fork(dwork) and imho makes the whole code more consistent.

Also, document the fact that cancel_rearming_delayed_work() doesn't garantee
the completion of work->func() upon return.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov 06ba38a9a0 workqueues: shift kthread_bind() from CPU_UP_PREPARE to CPU_ONLINE
CPU_UP_PREPARE binds cwq->thread to the new CPU.  So CPU_UP_CANCELED tries to
wake up the task which is bound to the failed CPU.

With this patch we don't bind cwq->thread until CPU becomes online.  The first
wake_up() after kthread_create() is a bit special, make a simple helper for
that.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov c12920d190 workqueue: make init_workqueues() __init
The only caller of init_workqueues() is do_basic_setup().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov cce1a1656c workqueue: introduce workqueue_struct->singlethread
Add explicit workqueue_struct->singlethread flag.  This lessens .text a
little, but most importantly this allows us to manipulate wq->list without
changine the meaning of is_single_threaded().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov b1f4ec172f workqueue: introduce cpu_singlethread_map
The code like

	if (is_single_threaded(wq))
		do_something(singlethread_cpu);
	else {
		for_each_cpu_mask(cpu, cpu_populated_map)
			do_something(cpu);
	}

looks very annoying. We can add "static cpumask_t cpu_singlethread_map" and
simplify the code. Lessens .text a bit, and imho makes the code more readable.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00