Commit Graph

1355 Commits

Author SHA1 Message Date
Ingo Molnar 48f24c4da1 [PATCH] sched: clean up fallout of recent changes
Clean up some of the impact of recent (and not so recent) scheduler
changes:

 - turning macros into nice inline functions
 - sanitizing and unifying variable definitions
 - whitespace, style consistency, 80-lines, comment correctness, spelling
   and curly braces police

Due to the macro hell and variable placement simplifications there's even 26
bytes of .text saved:

   text    data     bss     dec     hex filename
  25510    4153     192   29855    749f sched.o.before
  25484    4153     192   29829    7485 sched.o.after

[akpm@osdl.org: build fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:10 -07:00
Paul Mackerras 829035fd70 [PATCH] lockdep: irqtrace subsystem, move account_system_vtime() calls into kernel/softirq.c
At the moment, powerpc and s390 have their own versions of do_softirq which
include local_bh_disable() and __local_bh_enable() calls.  They end up
calling __do_softirq (in kernel/softirq.c) which also does
local_bh_disable/enable.

Apparently the two levels of disable/enable trigger a warning from some
validation code that Ingo is working on, and he would like to see the outer
level removed.  But to do that, we have to move the account_system_vtime
calls that are currently in the arch do_softirq() implementations for
powerpc and s390 into the generic __do_softirq() (this is a no-op for other
archs because account_system_vtime is defined to be an empty inline
function on all other archs).  This patch does that.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:10 -07:00
Ingo Molnar 60be6b9a41 [PATCH] lockdep: annotate on-stack completions
lockdep needs to have the waitqueue lock initialized for on-stack waitqueues
implicitly initialized by DECLARE_COMPLETION().  Annotate on-stack completions
accordingly.

Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:09 -07:00
Ingo Molnar 366c7f554e [PATCH] lockdep: annotate enable_in_hardirq()
Make use of local_irq_enable_in_hardirq() API to annotate places that enable
hardirqs in hardirq context.

Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:09 -07:00
Ingo Molnar ad33945175 [PATCH] lockdep: annotate ->mmap_sem
Teach special (recursive) locking code to the lock validator.  Has no effect
on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:08 -07:00
Ingo Molnar 5436552448 [PATCH] lockdep: annotate hrtimer base locks
Teach special (recursive) locking code to the lock validator.  Has no effect
on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:07 -07:00
Ingo Molnar fcb993712f [PATCH] lockdep: annotate scheduler runqueue locks
Teach per-CPU runqueue locks and recursive locking code to the lock validator.
 Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:07 -07:00
Ingo Molnar d730e882a1 [PATCH] lockdep: annotate timer base locks
Split the per-CPU timer base locks up into separate lock classes, because they
are used recursively.

Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:07 -07:00
Ingo Molnar eb4542b98c [PATCH] lockdep: annotate waitqueues
Create one lock class for all waitqueue locks in the kernel.  Has no effect on
non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:07 -07:00
Ingo Molnar 243c7621aa [PATCH] lockdep: annotate genirq
Teach special (recursive) locking code to the lock validator.  Has no effect
on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:06 -07:00
Ingo Molnar 8b8f319fc7 [PATCH] lockdep: annotate futex
Teach special (recursive) locking code to the lock validator.  Introduces
double_lock_hb() to unify double- hash-bucket-lock taking.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:06 -07:00
Ingo Molnar a0f1ccfd8d [PATCH] lockdep: do not recurse in printk
Make printk()-ing from within the lock validation code safer by using the
lockdep-recursion counter.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:05 -07:00
Ingo Molnar ef5d4707b9 [PATCH] lockdep: prove mutex locking correctness
Use the lock validator framework to prove mutex locking correctness.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:04 -07:00
Ingo Molnar 8a25d5debf [PATCH] lockdep: prove spinlock rwlock locking correctness
Use the lock validator framework to prove spinlock and rwlock locking
correctness.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:04 -07:00
Ingo Molnar 4ea2176dfa [PATCH] lockdep: prove rwsem locking correctness
Use the lock validator framework to prove rwsem locking correctness.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:04 -07:00
Ingo Molnar a8f24a3978 [PATCH] lockdep: procfs
Lock validator /proc/lockdep and /proc/lockdep_stats support.
(FIXME: should go into debugfs)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:04 -07:00
Ingo Molnar 6c9076ec9c [PATCH] lockdep: allow read_lock() recursion of same class
From: Ingo Molnar <mingo@elte.hu>

lockdep so far only allowed read-recursion for the same lock instance.
This is enough in the overwhelming majority of cases, but a hostap case
triggered and reported by Miles Lane relies on same-class
different-instance recursion.  So we relax the restriction on read-lock
recursion.

(This change does not allow rwsem read-recursion, which is still
forbidden.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:04 -07:00
Ingo Molnar fbb9ce9530 [PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.

Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.

What does the lock validator do?  It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems).  Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules.  If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal.  If the
new rule could create a deadlock scenario then this condition is printed out.

When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios.  In a typical system this means millions of separate
scenarios.  This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem).  [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]

Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically.  In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!).  So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself!  In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.

To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class".  For example, all struct inode objects
in the kernel have inode->inotify_mutex.  If there are 10,000 inodes cached,
then there are 10,000 lock objects.  But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class.  The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules.  The
set of rules persist during the lifetime of the kernel.

To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:

 lock-classes:                            694 [max: 2048]
 direct dependencies:                  1598 [max: 8192]
 indirect dependencies:               17896
 all direct dependencies:             16206
 dependency chains:                    1910 [max: 8192]
 in-hardirq chains:                      17
 in-softirq chains:                     105
 in-process chains:                    1065
 stack-trace entries:                 38761 [max: 131072]
 combined max dependencies:         2033928
 hardirq-safe locks:                     24
 hardirq-unsafe locks:                  176
 softirq-safe locks:                     53
 softirq-unsafe locks:                  137
 irq-safe locks:                         59
 irq-unsafe locks:                      176

The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.

More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:

   http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt

[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:03 -07:00
Ingo Molnar de30a2b355 [PATCH] lockdep: irqtrace subsystem, core
Accurate hard-IRQ-flags and softirq-flags state tracing.

This allows us to attach extra functionality to IRQ flags on/off
events (such as trace-on/off).

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:03 -07:00
Ingo Molnar 8637c09901 [PATCH] lockdep: stacktrace subsystem, core
Framework to generate and save stacktraces quickly, without printing anything
to the console.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:02 -07:00
Ingo Molnar e4d9191885 [PATCH] lockdep: locking init debugging improvement
Locking init improvement:

 - introduce and use __SPIN_LOCK_UNLOCKED for array initializations,
   to pass in the name string of locks, used by debugging

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:02 -07:00
Ingo Molnar 9cebb55268 [PATCH] lockdep: mutex section binutils workaround
Work around weird section nesting build bug causing smp-alternatives failures
under certain circumstances.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:01 -07:00
Ingo Molnar 9a11b49a80 [PATCH] lockdep: better lock debugging
Generic lock debugging:

 - generalized lock debugging framework. For example, a bug in one lock
   subsystem turns off debugging in all lock subsystems.

 - got rid of the caller address passing (__IP__/__IP_DECL__/etc.) from
   the mutex/rtmutex debugging code: it caused way too much prototype
   hackery, and lockdep will give the same information anyway.

 - ability to do silent tests

 - check lock freeing in vfree too.

 - more finegrained debugging options, to allow distributions to
   turn off more expensive debugging features.

There's no separate 'held mutexes' list anymore - but there's a 'held locks'
stack within lockdep, which unifies deadlock detection across all lock
classes.  (this is independent of the lockdep validation stuff - lockdep first
checks whether we are holding a lock already)

Here are the current debugging options:

CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y

which do:

 config DEBUG_MUTEXES
          bool "Mutex debugging, basic checks"

 config DEBUG_LOCK_ALLOC
         bool "Detect incorrect freeing of live mutexes"

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:01 -07:00
Ingo Molnar fb7e42413a [PATCH] lockdep: remove mutex deadlock checking code
With the lock validator we detect mutex deadlocks (and more), the mutex
deadlock checking code is both redundant and slower.  So remove it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:01 -07:00
Ingo Molnar 36596243da [PATCH] lockdep: remove DEBUG_BUG_ON()
cleanup: remove unused DEBUG_BUG_ON() defines.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:01 -07:00
Ingo Molnar 9e7f4d451e [PATCH] lockdep: rename DEBUG_WARN_ON()
Rename DEBUG_WARN_ON() to the less generic DEBUG_LOCKS_WARN_ON() name, so that
it's clear that this is a lock-debugging internal mechanism.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:01 -07:00
Ingo Molnar c4e05116a2 [PATCH] lockdep: clean up rwsems
Clean up rwsems.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:01 -07:00
Ingo Molnar 4d435f9d8f [PATCH] lockdep: add is_module_address()
Add is_module_address() method - to be used by lockdep.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:00 -07:00
Christoph Lameter 9614634fe6 [PATCH] ZVC/zone_reclaim: Leave 1% of unmapped pagecache pages for file I/O
It turns out that it is advantageous to leave a small portion of unmapped file
backed pages if all of a zone's pages (or almost all pages) are allocated and
so the page allocator has to go off-node.

This allows recently used file I/O buffers to stay on the node and
reduces the times that zone reclaim is invoked if file I/O occurs
when we run out of memory in a zone.

The problem is that zone reclaim runs too frequently when the page cache is
used for file I/O (read write and therefore unmapped pages!) alone and we have
almost all pages of the zone allocated.  Zone reclaim may remove 32 unmapped
pages.  File I/O will use these pages for the next read/write requests and the
unmapped pages increase.  After the zone has filled up again zone reclaim will
remove it again after only 32 pages.  This cycle is too inefficient and there
are potentially too many zone reclaim cycles.

With the 1% boundary we may still remove all unmapped pages for file I/O in
zone reclaim pass.  However.  it will take a large number of read and writes
to get back to 1% again where we trigger zone reclaim again.

The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
second timeout.

[akpm@osdl.org: rename the /proc file and the variable]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:26:59 -07:00
Benjamin Herrenschmidt 5a43a066b1 [PATCH] genirq: Allow fasteoi handler to retrigger disabled interrupts
Make the fasteoi handler mark disabled interrupts as pending if they
happen anyway. This allow implementation of a delayed disable scheme
with the fasteoi handler.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-07-03 19:54:59 +10:00
Thomas Gleixner 284c66806e [PATCH] genirq:fixup missing SA_PERCPU replacement
The irqflags consolidation converted SA_PERCPU_IRQ to IRQF_PERCPU but
did not define the new constant.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-02 17:29:22 -07:00
Thomas Gleixner d061daa0e3 [PATCH] genirq: ARM dyntick cleanup
Linus: "The hacks in kernel/irq/handle.c are really horrid. REALLY
horrid."

They are indeed. Move the dyntick quirks to ARM where they belong.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-02 17:29:21 -07:00
Linus Torvalds b4b9034132 Merge branch 'genirq' of master.kernel.org:/home/rmk/linux-2.6-arm
* 'genirq' of master.kernel.org:/home/rmk/linux-2.6-arm: (24 commits)
  [ARM] 3683/2:  ARM: Convert at91rm9200 to generic irq handling
  [ARM] 3682/2:  ARM: Convert ixp4xx to generic irq handling
  [ARM] 3702/1: ARM: Convert ixp23xx to generic irq handling
  [ARM] 3701/1: ARM: Convert plat-omap to generic irq handling
  [ARM] 3700/1: ARM: Convert lh7a40x to generic irq handling
  [ARM] 3699/1: ARM: Convert s3c2410 to generic irq handling
  [ARM] 3698/1: ARM: Convert sa1100 to generic irq handling
  [ARM] 3697/1: ARM: Convert shark to generic irq handling
  [ARM] 3696/1: ARM: Convert clps711x to generic irq handling
  [ARM] 3694/1: ARM: Convert ecard driver to generic irq handling
  [ARM] 3693/1: ARM: Convert omap1 to generic irq handling
  [ARM] 3691/1: ARM: Convert imx to generic irq handling
  [ARM] 3688/1: ARM: Convert clps7500 to generic irq handling
  [ARM] 3687/1: ARM: Convert integrator to generic irq handling
  [ARM] 3685/1: ARM: Convert pxa to generic irq handling
  [ARM] 3684/1: ARM: Convert l7200 to generic irq handling
  [ARM] 3681/1: ARM: Convert ixp2000 to generic irq handling
  [ARM] 3680/1: ARM: Convert footbridge to generic irq handling
  [ARM] 3695/1: ARM drivers/pcmcia: Fixup includes
  [ARM] 3689/1: ARM drivers/input/touchscreen: Fixup includes
  ...

Manual conflict resolved in kernel/irq/handle.c (butt-ugly ARM tickless
code).
2006-07-02 15:07:45 -07:00
Thomas Gleixner 3cca53b02a [PATCH] irq-flags: generic irq: Use the new IRQF_ constants
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-02 13:58:49 -07:00
Thomas Gleixner f8b5473fcb [ARM] 3690/1: genirq: Introduce and make use of dummy irq chip
Patch from Thomas Gleixner

From: Thomas Gleixner <tglx@linutronix.de>

ARM has a couple of really dumb interrupt controllers.
Implement a generic one and fixup the ARM migration. ARM reused
the no_irq_chip for this purpose, but this does not work out
for platforms which are not converted to the new interrupt
type handling model.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-07-01 22:30:08 +01:00
Thomas Gleixner a2166abd06 [ARM] 3679/1: ARM: Make ARM dyntick implementation work with genirq
Patch from Thomas Gleixner

From: Thomas Gleixner <tglx@linutronix.de>

Make the ARM dyntick implementation work with the generic
irq code. This hopefully goes away once we consolidated the
dyntick implementations.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-07-01 22:30:07 +01:00
Linus Torvalds fc25465f09 Merge branch 'audit.b22' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b22' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
  [PATCH] audit syscall classes
  [PATCH] audit: support for object context filters
  [PATCH] audit: rename AUDIT_SE_* constants
  [PATCH] add rule filterkey
2006-07-01 09:59:08 -07:00
Bjorn Helgaas e8c4b9d003 [PATCH] IRQ: warning message cleanup
Make warnings more consistent.

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01 09:55:57 -07:00
Bjorn Helgaas 17311c03c3 [PATCH] IRQ: Use SA_PERCPU_IRQ, not IRQ_PER_CPU, for irqaction.flags
IRQ_PER_CPU is a bit in the struct irq_desc "status" field, not in the
struct irqaction "flags", so the previous code checked the wrong bit.

SA_PERCPU_IRQ is only used by drivers/char/mmtimer.c for SGI ia64 boxes.

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01 09:55:57 -07:00
Ingo Molnar ed6f7b10e6 [PATCH] pi-futex: futex_wake() lockup fix
Fix futex_wake() exit condition bug when handling the robust-list with PI
futexes on them.

(reported by Ulrich Drepper, debugged by the lock validator.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01 09:55:57 -07:00
Vernon Mauery a99e4e413e [PATCH] pi-futex: fix mm_struct memory leak
lock_queue was getting called essentially twice in a row and was
continually incrementing the mm_count ref count, thus causing a memory
leak.

Dinakar Guniguntala provided a proper fix for the problem that simply grabs
the spinlock for the hash bucket queue rather than calling lock_queue.

The second time we do a queue_lock in futex_lock_pi, we really only need to
take the hash bucket lock.

Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Signed-off-by: Vernon Mauery <vernux@us.ibm.com>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01 09:55:57 -07:00
Al Viro b915543b46 [PATCH] audit syscall classes
Allow to tie upper bits of syscall bitmap in audit rules to kernel-defined
sets of syscalls.  Infrastructure, a couple of classes (with 32bit counterparts
for biarch targets) and actual tie-in on i386, amd64 and ia64.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-07-01 07:44:10 -04:00
Darrel Goeddel 6e5a2d1d32 [PATCH] audit: support for object context filters
This patch introduces object audit filters based on the elements
of the SELinux context.

Signed-off-by: Darrel Goeddel <dgoeddel@trustedcs.com>
Acked-by:  Stephen Smalley <sds@tycho.nsa.gov>

 kernel/auditfilter.c           |   25 +++++++++++++++++++++++++
 kernel/auditsc.c               |   40 ++++++++++++++++++++++++++++++++++++++++
 security/selinux/ss/services.c |   18 +++++++++++++++++-
 3 files changed, 82 insertions(+), 1 deletion(-)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-07-01 05:44:19 -04:00
Darrel Goeddel 3a6b9f85c6 [PATCH] audit: rename AUDIT_SE_* constants
This patch renames some audit constant definitions and adds
additional definitions used by the following patch.  The renaming
avoids ambiguity with respect to the new definitions.

Signed-off-by: Darrel Goeddel <dgoeddel@trustedcs.com>

 include/linux/audit.h          |   15 ++++++++----
 kernel/auditfilter.c           |   50 ++++++++++++++++++++---------------------
 kernel/auditsc.c               |   10 ++++----
 security/selinux/ss/services.c |   32 +++++++++++++-------------
 4 files changed, 56 insertions(+), 51 deletions(-)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-07-01 05:44:08 -04:00
Amy Griffis 5adc8a6adc [PATCH] add rule filterkey
Add support for a rule key, which can be used to tie audit records to audit
rules.  This is useful when a watched file is accessed through a link or
symlink, as well as for general audit log analysis.

Because this patch uses a string key instead of an integer key, there is a bit
of extra overhead to do the kstrdup() when a rule fires.  However, we're also
allocating memory for the audit record buffer, so it's probably not that
significant.  I went ahead with a string key because it seems more
user-friendly.

Note that the user must ensure that filterkeys are unique.  The kernel only
checks for duplicate rules.

Signed-off-by: Amy Griffis <amy.griffis@hpd.com>
2006-07-01 05:43:06 -04:00
Linus Torvalds 22a3e233ca Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
  Remove obsolete #include <linux/config.h>
  remove obsolete swsusp_encrypt
  arch/arm26/Kconfig typos
  Documentation/IPMI typos
  Kconfig: Typos in net/sched/Kconfig
  v9fs: do not include linux/version.h
  Documentation/DocBook/mtdnand.tmpl: typo fixes
  typo fixes: specfic -> specific
  typo fixes in Documentation/networking/pktgen.txt
  typo fixes: occuring -> occurring
  typo fixes: infomation -> information
  typo fixes: disadvantadge -> disadvantage
  typo fixes: aquire -> acquire
  typo fixes: mecanism -> mechanism
  typo fixes: bandwith -> bandwidth
  fix a typo in the RTC_CLASS help text
  smb is no longer maintained

Manually merged trivial conflict in arch/um/kernel/vmlinux.lds.S
2006-06-30 15:39:30 -07:00
Andrew Morton e7b384043e [PATCH] cond_resched() fix
Fix a bug identified by Zou Nan hai <nanhai.zou@intel.com>:

If the system is in state SYSTEM_BOOTING, and need_resched() is true,
cond_resched() returns true even though it didn't reschedule.  Consequently
need_resched() remains true and JBD locks up.

Fix that by teaching cond_resched() to only return true if it really did call
schedule().

cond_resched_lock() and cond_resched_softirq() have a problem too.  If we're
in SYSTEM_BOOTING state and need_resched() is true, these functions will drop
the lock and will then try to call schedule(), but the SYSTEM_BOOTING state
will prevent schedule() from being called.  So on return, need_resched() will
still be true, but cond_resched_lock() has to return 1 to tell the caller that
the lock was dropped.  The caller will probably lock up.

Bottom line: if these functions dropped the lock, they _must_ call schedule()
to clear need_resched().   Make it so.

Also, uninline __cond_resched().  It's largeish, and slowpath.

Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:38 -07:00
David Quigley 8f95dc58d0 [PATCH] SELinux: add security hook call to kill_proc_info_as_uid
This patch adds a call to the extended security_task_kill hook introduced by
the prior patch to the kill_proc_info_as_uid function so that these signals
can be properly mediated by security modules.  It also updates the existing
hook call in check_kill_permission.

Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:37 -07:00
Christoph Lameter 34aa1330f9 [PATCH] zoned vm counters: zone_reclaim: remove /proc/sys/vm/zone_reclaim_interval
The zone_reclaim_interval was necessary because we were not able to determine
how many unmapped pages exist in a zone.  Therefore we had to scan in
intervals to figure out if any pages were unmapped.

With the zoned counters and NR_ANON_PAGES we now know the number of pagecache
pages and the number of mapped pages in a zone.  So we can simply skip the
reclaim if there is an insufficient number of unmapped pages.  We use
SWAP_CLUSTER_MAX as the boundary.

Drop all support for /proc/sys/vm/zone_reclaim_interval.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:35 -07:00
Jörn Engel 6ab3d5624e Remove obsolete #include <linux/config.h>
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 19:25:36 +02:00
Pavel Machek e02169b682 remove obsolete swsusp_encrypt
Remove SWSUSP_ENCRYPT config option; it is no longer implemented.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 18:59:59 +02:00
Adrian Bunk 80f7228b59 typo fixes: occuring -> occurring
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 18:27:16 +02:00
Dave Jones ae90dd5dbe Move workqueue exports to where the functions are defined.
Signed-off-by: Dave Jones <davej@redhat.com>
2006-06-30 01:40:45 -04:00
Venkatesh Pallipadi 7a6bc1cdd5 [CPUFREQ] Add queue_delayed_work_on() interface for workqueues.
Add queue_delayed_work_on() interface for workqueues.

Signed-off-by: Alexey Starikovskiy <alexey.y.starikovskiy@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
2006-06-30 01:33:31 -04:00
Darrel Goeddel c7bdb545d2 [NETLINK]: Encapsulate eff_cap usage within security framework.
This patch encapsulates the usage of eff_cap (in netlink_skb_params) within
the security framework by extending security_netlink_recv to include a required
capability parameter and converting all direct usage of eff_caps outside
of the lsm modules to use the interface.  It also updates the SELinux
implementation of the security_netlink_send and security_netlink_recv
hooks to take advantage of the sid in the netlink_skb_params struct.
This also enables SELinux to perform auditing of netlink capability checks.
Please apply, for 2.6.18 if possible.

Signed-off-by: Darrel Goeddel <dgoeddel@trustedcs.com>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by:  James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-29 16:57:55 -07:00
Linus Torvalds 3aa590c6b7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (43 commits)
  [POWERPC] Use little-endian bit from firmware ibm,pa-features property
  [POWERPC] Make sure smp_processor_id works very early in boot
  [POWERPC] U4 DART improvements
  [POWERPC] todc: add support for Time-Of-Day-Clock
  [POWERPC] Make lparcfg.c work when both iseries and pseries are selected
  [POWERPC] Fix idr locking in init_new_context
  [POWERPC] mpc7448hpc2 (taiga) board config file
  [POWERPC] Add tsi108 pci and platform device data register function
  [POWERPC] Add general support for mpc7448hpc2 (Taiga) platform
  [POWERPC] Correct the MAX_CONTEXT definition
  powerpc: minor cleanups for mpc86xx
  [POWERPC] Make sure we select CONFIG_NEW_LEDS if ADB_PMU_LED is set
  [POWERPC] Simplify the code defining the 64-bit CPU features
  [POWERPC] powerpc: kconfig warning fix
  [POWERPC] Consolidate some of kernel/misc*.S
  [POWERPC] Remove unused function call_with_mmu_off
  [POWERPC] update asm-powerpc/time.h
  [POWERPC] Clean up it_lp_queue.h
  [POWERPC] Skip the "copy down" of the kernel if it is already at zero.
  [POWERPC] Add the use of the firmware soft-reset-nmi to kdump.
  ...
2006-06-29 11:32:34 -07:00
Linus Torvalds 1903ac54f8 Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6:
  [PATCH] i386: export memory more than 4G through /proc/iomem
  [PATCH] 64bit Resource: finally enable 64bit resource sizes
  [PATCH] 64bit Resource: convert a few remaining drivers to use resource_size_t where needed
  [PATCH] 64bit resource: change pnp core to use resource_size_t
  [PATCH] 64bit resource: change pci core and arch code to use resource_size_t
  [PATCH] 64bit resource: change resource core to use resource_size_t
  [PATCH] 64bit resource: introduce resource_size_t for the start and end of struct resource
  [PATCH] 64bit resource: fix up printks for resources in misc drivers
  [PATCH] 64bit resource: fix up printks for resources in arch and core code
  [PATCH] 64bit resource: fix up printks for resources in pcmcia drivers
  [PATCH] 64bit resource: fix up printks for resources in video drivers
  [PATCH] 64bit resource: fix up printks for resources in ide drivers
  [PATCH] 64bit resource: fix up printks for resources in mtd drivers
  [PATCH] 64bit resource: fix up printks for resources in pci core and hotplug drivers
  [PATCH] 64bit resource: fix up printks for resources in networks drivers
  [PATCH] 64bit resource: fix up printks for resources in sound drivers
  [PATCH] 64bit resource: C99 changes for struct resource declarations

Fixed up trivial conflict in drivers/ide/pci/cmd64x.c (the printk that
was changed by the 64-bit resources had been deleted in the meantime ;)
2006-06-29 10:49:17 -07:00
Ingo Molnar 47c2a3aa44 [PATCH] genirq: add chip->eoi(), fastack -> fasteoi
Clean up the fastack concept by turning it into fasteoi and introducing the
->eoi() method for chips.

This also allows the cleanup of an i386 EOI quirk - now the quirk is
cleanly separated from the pure ACK implementation.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:26 -07:00
Benjamin Herrenschmidt 98bb244b68 [PATCH] genirq: fasteoi handler: handle interrupt disabling
Note when a disable interrupt happened with the fasteoi handler as well so
that delayed disable can be implemented with fasteoi-type controllers.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:25 -07:00
Ingo Molnar 43f7775944 [PATCH] genirq: more verbose debugging on unexpected IRQ vectors
One frequent sign of IRQ handling bugs is the appearance of unexpected
vectors.  Print out all the IRQ state in that case.  We dont want this patch
upstream, but it is useful during initial testing.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:25 -07:00
Ingo Molnar f1c2662cbc [PATCH] genirq: cleanup: no_irq_type -> no_irq_chip rename
Rename no_irq_type to no_irq_chip.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:25 -07:00
Thomas Gleixner e76de9f8eb [PATCH] genirq: add SA_TRIGGER support
Enable drivers to request an IRQ with a given irq-flow (trigger/polarity)
setting.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner ba9a2331ba [PATCH] genirq: add irq-wake (power-management) support
Enable platforms to set the irq-wake (power-management) properties of an IRQ.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Ingo Molnar 7a55713ab4 [PATCH] genirq: add handle_bad_irq()
Handle bad IRQ vectors via the irqchip mechanism.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner dd87eb3a24 [PATCH] genirq: add irq-chip support
Enable platforms to use the irq-chip and irq-flow abstractions: allow setting
of the chip, the type and provide highlevel handlers for common irq-flows.

[rostedt@goodmis.org: misroute-irq: Don't call desc->chip->end because of edge interrupts]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner 6a6de9ef58 [PATCH] genirq: core
Core genirq support: add the irq-chip and irq-flow abstractions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Ingo Molnar a34db9b28a [PATCH] genirq: update copyrights
Update/add copyrights in the generic IRQ code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner 94d39e1f6e [PATCH] genirq: add IRQ_NOAUTOEN support
Enable platforms to disable the automatic enabling of freshly set up irqs.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner 6550c775cb [PATCH] genirq: add IRQ_NOREQUEST support
Enable platforms to disable request_irq() for certain interrupts.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner 3418d72404 [PATCH] genirq: add IRQ_NOPROBE support
Introduce IRQ_NOPROBE: enables platforms to control chip-probing.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner a4633adcdb [PATCH] genirq: add genirq sw IRQ-retrigger
Enable platforms that do not have a hardware-assisted hardirq-resend mechanism
to resend them via a softirq-driven IRQ emulation mechanism.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Ingo Molnar 77a5afecdb [PATCH] genirq: cleanup: no_irq_type cleanups
Clean up no_irq_type: share the NOP functions where possible, and properly
name the ack_bad() function.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Ingo Molnar 8d28bc751b [PATCH] genirq: doc: handle_IRQ_event() and __do_IRQ() comments
Document handle_IRQ_event() and __do_IRQ().

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Ingo Molnar c0ad90a32f [PATCH] genirq: add ->retrigger() irq op to consolidate hw_irq_resend()
Add ->retrigger() irq op to consolidate hw_irq_resend() implementations.
(Most architectures had it defined to NOP anyway.)

NOTE: ia64 needs testing. i386 and x86_64 tested.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Thomas Gleixner 096c8131c5 [PATCH] genirq: debug: better debug printout in enable_irq()
Make enable_irq() debug printouts user-readable.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Ingo Molnar 0d7012a968 [PATCH] genirq: cleanup: turn ARCH_HAS_IRQ_PER_CPU into CONFIG_IRQ_PER_CPU
Cleanup: change ARCH_HAS_IRQ_PER_CPU into a Kconfig method.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Ingo Molnar cd916d31cc [PATCH] genirq: cleanup: merge pending_irq_cpumask[] into irq_desc[]
Consolidation: remove the pending_irq_cpumask[NR_IRQS] array and move it into
the irq_desc[NR_IRQS].pending_mask field.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar 4a733ee126 [PATCH] genirq: cleanup: merge irq_dir[], smp_affinity_entry[] into irq_desc[]
Consolidation: remove the irq_dir[NR_IRQS] and the smp_affinity_entry[NR_IRQS]
arrays and move them into the irq_desc[] array.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar 34ffdb7233 [PATCH] genirq: cleanup: reduce irq_desc_t use, mark it obsolete
Cleanup: remove irq_desc_t use from the generic IRQ code, and mark it
obsolete.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar 06fcb0c6fb [PATCH] genirq: cleanup: misc code cleanups
Assorted code cleanups to the generic IRQ code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar 2e60bbb6d5 [PATCH] genirq: cleanup: remove fastcall
Now that i386 defaults to regparm, explicit uses of fastcall are not needed
anymore.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar a8553acd6c [PATCH] genirq: cleanup: remove irq_descp()
Cleanup: remove irq_descp() - explicit use of irq_desc[] is shorter and more
readable.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar a53da52fd7 [PATCH] genirq: cleanup: merge irq_affinity[] into irq_desc[]
Consolidation: remove the irq_affinity[NR_IRQS] array and move it into the
irq_desc[NR_IRQS].affinity field.

[akpm@osdl.org: sparc64 build fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar 74ffd553a3 [PATCH] genirq: sem2mutex probe_sem -> probing_active
Convert the irq auto-probing semaphore to a mutex.  (This allows us to find
probing API usage bugs sooner, via the mutex debugging code.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:21 -07:00
Ingo Molnar d1bef4ed5f [PATCH] genirq: rename desc->handler to desc->chip
This patch-queue improves the generic IRQ layer to be truly generic, by adding
various abstractions and features to it, without impacting existing
functionality.

While the queue can be best described as "fix and improve everything in the
generic IRQ layer that we could think of", and thus it consists of many
smaller features and lots of cleanups, the one feature that stands out most is
the new 'irq chip' abstraction.

The irq-chip abstraction is about describing and coding and IRQ controller
driver by mapping its raw hardware capabilities [and quirks, if needed] in a
straightforward way, without having to think about "IRQ flow"
(level/edge/etc.) type of details.

This stands in contrast with the current 'irq-type' model of genirq
architectures, which 'mixes' raw hardware capabilities with 'flow' details.
The patchset supports both types of irq controller designs at once, and
converts i386 and x86_64 to the new irq-chip design.

As a bonus side-effect of the irq-chip approach, chained interrupt controllers
(master/slave PIC constructs, etc.) are now supported by design as well.

The end result of this patchset intends to be simpler architecture-level code
and more consolidation between architectures.

We reused many bits of code and many concepts from Russell King's ARM IRQ
layer, the merging of which was one of the motivations for this patchset.

This patch:

rename desc->handler to desc->chip.

Originally i did not want to do this, because it's a big patch.  But having
both "desc->handler", "desc->handle_irq" and "action->handler" caused a
large degree of confusion and made the code appear alot less clean than it
truly is.

I have also attempted a dual approach as well by introducing a
desc->chip alias - but that just wasnt robust enough and broke
frequently.

So lets get over with this quickly.  The conversion was done automatically
via scripts and converts all the code in the kernel.

This renaming patch is the first one amongst the patches, so that the
remaining patches can stay flexible and can be merged and split up
without having some big monolithic patch act as a merge barrier.

[akpm@osdl.org: build fix]
[akpm@osdl.org: another build fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:21 -07:00
Andrew Morton 84860f9979 [PATCH] load_module() cleanup
Undo bizarre declaration in load_module().

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-28 14:59:04 -07:00
Arjan van de Ven f71d20e961 [PATCH] Add EXPORT_UNUSED_SYMBOL and EXPORT_UNUSED_SYMBOL_GPL
Temporarily add EXPORT_UNUSED_SYMBOL and EXPORT_UNUSED_SYMBOL_GPL.  These
will be used as a transition measure for symbols that aren't used in the
kernel and are on the way out.  When a module uses such a symbol, a warning
is printk'd at modprobe time.

The main reason for removing unused exports is size: eacho export takes
roughly between 100 and 150 bytes of kernel space in the binary.  This
patch gives users the option to immediately get this size gain via a config
option.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-28 14:59:04 -07:00
David Wilder c0ce7d0886 [POWERPC] Add the use of the firmware soft-reset-nmi to kdump.
With this patch, kdump uses the firmware soft-reset NMI for two purposes:
1) Initiate the kdump (take a crash dump) by issuing a soft-reset.
2) Break a CPU out of a deadlock condition that is detected during kdump
processing.

When a soft-reset is initiated each CPU will enter
system_reset_exception() and set its corresponding bit in the global
bit-array cpus_in_sr then call die(). When die() finds the CPU's bit set
in cpu_in_sr crash_kexec() is called to initiate a crash dump. The first
CPU to enter crash_kexec() is called the "crashing CPU". All other CPUs
are "secondary CPUs". The secondary CPU's pass through to
crash_kexec_secondary() and sleep. The crashing CPU waits for all CPUs
to enter via soft-reset then boots the kdump kernel (see
crash_soft_reset_check())

When the system crashes due to a panic or exception, crash_kexec() is
called by panic() or die(). The crashing CPU sends an IPI to all other
CPUs to notify them of the pending shutdown. If a CPU is in a deadlock
or hung state with interrupts disabled, the IPI will not be delivered.
The result being, that the kdump kernel is not booted. This problem is
solved with the use of a firmware generated soft-reset. After the
crashing_cpu has issued the IPI, it waits for 10 sec for all CPUs to
enter crash_ipi_callback(). A CPU signifies its entry to
crash_ipi_callback() by setting its corresponding bit in the
cpus_in_crash bit array. After 10 sec, if one or more CPUs have not set
their bit in cpus_in_crash we assume that the CPU(s) is deadlocked. The
operator is then prompted to generate a soft-reset to break the
deadlock. Each CPU enters the soft reset handler as described above.

Two conditions must be handled at this point:
1) The system crashed because the operator generated a soft-reset. See
2) The system had crashed before the soft-reset was generated ( in the
case of a Panic or oops).

The first CPU to enter crash_kexec() uses the state of the kexec_lock to
determine this state. If kexec_lock is already held then condition 2 is
true and crash_kexec_secondary() is called, else; this CPU is flagged as
the crashing CPU, the kexec_lock is acquired and crash_kexec() proceeds
as described above.

Each additional CPUs responding to the soft-reset will pass through
crash_kexec() to kexec_secondary(). All secondary CPUs call
crash_ipi_callback() readying them self's for the shutdown. When ready
they clear their bit in cpus_in_sr. The crashing CPU waits in
kexec_secondary() until all other CPUs have cleared their bits in
cpus_in_sr. The kexec kernel boot is then started.

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: David Wilder <dwilder@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-06-28 15:18:52 +10:00
Jesper Juhl 9a66a53f55 [PATCH] Remove redundant NULL checks before [kv]free - in kernel/
Remove redundant kfree NULL checks from kernel/

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:48 -07:00
Sebastien Dugue 59e0e0ace7 [PATCH] futex_requeue() optimization
In futex_requeue(), when the 2 futexes keys hash to the same bucket, there
is no need to move the futex_q to the end of the bucket list.

Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:48 -07:00
Thomas Gleixner 95e02ca9bb [PATCH] rtmutex: Propagate priority settings into PI lock chains
When the priority of a task, which is blocked on a lock, changes we must
propagate this change into the PI lock chain.  Therefor the chain walk code
is changed to get rid of the references to current to avoid false positives
in the deadlock detector, as setscheduler might be called by a task which
holds the lock on which the task whose priority is changed is blocked.

Also add some comments about the get/put_task_struct usage to avoid
confusion.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:48 -07:00
Thomas Gleixner 0bafd214e4 [PATCH] rtmutex: Modify rtmutex-tester to test the setscheduler propagation
Make test suite setscheduler calls asynchronously.  Remove the waits in the
test cases and add a new testcase to verify the correctness of the
setscheduler priority propagation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Thomas Gleixner e74c69f46d [PATCH] Drop tasklist lock in do_sched_setscheduler
There is no need to hold tasklist_lock across the setscheduler call, when
we pin the task structure with get_task_struct().  Interrupts are disabled
in setscheduler anyway and the permission checks do not need interrupts
disabled.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar c87e2837be [PATCH] pi-futex: futex_lock_pi/futex_unlock_pi support
This adds the actual pi-futex implementation, based on rt-mutexes.

[dino@in.ibm.com: fix an oops-causing race]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar 0cdbee9920 [PATCH] pi-futex: rt mutex futex api
Add proxy-locking rt-mutex functionality needed by pi-futexes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Thomas Gleixner 61a8712286 [PATCH] pi-futex: rt mutex tester
RT-mutex tester: scriptable tester for rt mutexes, which allows userspace
scripting of mutex unit-tests (and dynamic tests as well), using the actual
rt-mutex implementation of the kernel.

[akpm@osdl.org: fixlet]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar e7eebaf6a8 [PATCH] pi-futex: rt mutex debug
Runtime debugging functionality for rt-mutexes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar 23f78d4a03 [PATCH] pi-futex: rt mutex core
Core functions for the rt-mutex subsystem.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar b29739f902 [PATCH] pi-futex: scheduler support for pi
Add framework to boost/unboost the priority of RT tasks.

This consists of:

 - caching the 'normal' priority in ->normal_prio
 - providing a functions to set/get the priority of the task
 - make sched_setscheduler() aware of boosting

The effective_prio() cleanups also fix a priority-calculation bug pointed out
by Andrey Gelman, in set_user_nice().

has_rt_policy() fix: Peter Williams <pwil3058@bigpond.net.au>

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andrey Gelman <agelman@012.net.il>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:46 -07:00
Ingo Molnar e2970f2fb6 [PATCH] pi-futex: futex code cleanups
We are pleased to announce "lightweight userspace priority inheritance" (PI)
support for futexes.  The following patchset and glibc patch implements it,
ontop of the robust-futexes patchset which is included in 2.6.16-mm1.

We are calling it lightweight for 3 reasons:

 - in the user-space fastpath a PI-enabled futex involves no kernel work
   (or any other PI complexity) at all.  No registration, no extra kernel
   calls - just pure fast atomic ops in userspace.

 - in the slowpath (in the lock-contention case), the system call and
   scheduling pattern is in fact better than that of normal futexes, due to
   the 'integrated' nature of FUTEX_LOCK_PI.  [more about that further down]

 - the in-kernel PI implementation is streamlined around the mutex
   abstraction, with strict rules that keep the implementation relatively
   simple: only a single owner may own a lock (i.e.  no read-write lock
   support), only the owner may unlock a lock, no recursive locking, etc.

  Priority Inheritance - why, oh why???
  -------------------------------------

Many of you heard the horror stories about the evil PI code circling Linux for
years, which makes no real sense at all and is only used by buggy applications
and which has horrible overhead.  Some of you have dreaded this very moment,
when someone actually submits working PI code ;-)

So why would we like to see PI support for futexes?

We'd like to see it done purely for technological reasons.  We dont think it's
a buggy concept, we think it's useful functionality to offer to applications,
which functionality cannot be achieved in other ways.  We also think it's the
right thing to do, and we think we've got the right arguments and the right
numbers to prove that.  We also believe that we can address all the
counter-arguments as well.  For these reasons (and the reasons outlined below)
we are submitting this patch-set for upstream kernel inclusion.

What are the benefits of PI?

  The short reply:
  ----------------

User-space PI helps achieving/improving determinism for user-space
applications.  In the best-case, it can help achieve determinism and
well-bound latencies.  Even in the worst-case, PI will improve the statistical
distribution of locking related application delays.

  The longer reply:
  -----------------

Firstly, sharing locks between multiple tasks is a common programming
technique that often cannot be replaced with lockless algorithms.  As we can
see it in the kernel [which is a quite complex program in itself], lockless
structures are rather the exception than the norm - the current ratio of
lockless vs.  locky code for shared data structures is somewhere between 1:10
and 1:100.  Lockless is hard, and the complexity of lockless algorithms often
endangers to ability to do robust reviews of said code.  I.e.  critical RT
apps often choose lock structures to protect critical data structures, instead
of lockless algorithms.  Furthermore, there are cases (like shared hardware,
or other resource limits) where lockless access is mathematically impossible.

Media players (such as Jack) are an example of reasonable application design
with multiple tasks (with multiple priority levels) sharing short-held locks:
for example, a highprio audio playback thread is combined with medium-prio
construct-audio-data threads and low-prio display-colory-stuff threads.  Add
video and decoding to the mix and we've got even more priority levels.

So once we accept that synchronization objects (locks) are an unavoidable fact
of life, and once we accept that multi-task userspace apps have a very fair
expectation of being able to use locks, we've got to think about how to offer
the option of a deterministic locking implementation to user-space.

Most of the technical counter-arguments against doing priority inheritance
only apply to kernel-space locks.  But user-space locks are different, there
we cannot disable interrupts or make the task non-preemptible in a critical
section, so the 'use spinlocks' argument does not apply (user-space spinlocks
have the same priority inversion problems as other user-space locking
constructs).  Fact is, pretty much the only technique that currently enables
good determinism for userspace locks (such as futex-based pthread mutexes) is
priority inheritance:

Currently (without PI), if a high-prio and a low-prio task shares a lock [this
is a quite common scenario for most non-trivial RT applications], even if all
critical sections are coded carefully to be deterministic (i.e.  all critical
sections are short in duration and only execute a limited number of
instructions), the kernel cannot guarantee any deterministic execution of the
high-prio task: any medium-priority task could preempt the low-prio task while
it holds the shared lock and executes the critical section, and could delay it
indefinitely.

  Implementation:
  ---------------

As mentioned before, the userspace fastpath of PI-enabled pthread mutexes
involves no kernel work at all - they behave quite similarly to normal
futex-based locks: a 0 value means unlocked, and a value==TID means locked.
(This is the same method as used by list-based robust futexes.) Userspace uses
atomic ops to lock/unlock these mutexes without entering the kernel.

To handle the slowpath, we have added two new futex ops:

  FUTEX_LOCK_PI
  FUTEX_UNLOCK_PI

If the lock-acquire fastpath fails, [i.e.  an atomic transition from 0 to TID
fails], then FUTEX_LOCK_PI is called.  The kernel does all the remaining work:
if there is no futex-queue attached to the futex address yet then the code
looks up the task that owns the futex [it has put its own TID into the futex
value], and attaches a 'PI state' structure to the futex-queue.  The pi_state
includes an rt-mutex, which is a PI-aware, kernel-based synchronization
object.  The 'other' task is made the owner of the rt-mutex, and the
FUTEX_WAITERS bit is atomically set in the futex value.  Then this task tries
to lock the rt-mutex, on which it blocks.  Once it returns, it has the mutex
acquired, and it sets the futex value to its own TID and returns.  Userspace
has no other work to perform - it now owns the lock, and futex value contains
FUTEX_WAITERS|TID.

If the unlock side fastpath succeeds, [i.e.  userspace manages to do a TID ->
0 atomic transition of the futex value], then no kernel work is triggered.

If the unlock fastpath fails (because the FUTEX_WAITERS bit is set), then
FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the behalf of
userspace - and it also unlocks the attached pi_state->rt_mutex and thus wakes
up any potential waiters.

Note that under this approach, contrary to other PI-futex approaches, there is
no prior 'registration' of a PI-futex.  [which is not quite possible anyway,
due to existing ABI properties of pthread mutexes.]

Also, under this scheme, 'robustness' and 'PI' are two orthogonal properties
of futexes, and all four combinations are possible: futex, robust-futex,
PI-futex, robust+PI-futex.

  glibc support:
  --------------

Ulrich Drepper and Jakub Jelinek have written glibc support for PI-futexes
(and robust futexes), enabling robust and PI (PTHREAD_PRIO_INHERIT) POSIX
mutexes.  (PTHREAD_PRIO_PROTECT support will be added later on too, no
additional kernel changes are needed for that).  [NOTE: The glibc patch is
obviously inofficial and unsupported without matching upstream kernel
functionality.]

the patch-queue and the glibc patch can also be downloaded from:

  http://redhat.com/~mingo/PI-futex-patches/

Many thanks go to the people who helped us create this kernel feature: Steven
Rostedt, Esben Nielsen, Benedikt Spranger, Daniel Walker, John Cooper, Arjan
van de Ven, Oleg Nesterov and others.  Credits for related prior projects goes
to Dirk Grambow, Inaky Perez-Gonzalez, Bill Huey and many others.

Clean up the futex code, before adding more features to it:

 - use u32 as the futex field type - that's the ABI
 - use __user and pointers to u32 instead of unsigned long
 - code style / comment style cleanups
 - rename hash-bucket name from 'bh' to 'hb'.

I checked the pre and post futex.o object files to make sure this
patch has no code effects.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:46 -07:00
Steven Rostedt 66e5393a78 [PATCH] BUG() if setscheduler is called from interrupt context
Thomas Gleixner is adding the call to a rtmutex function in setscheduler.
This call grabs a spin_lock that is not always protected by interrupts
disabled.  So this means that setscheduler cant be called from interrupt
context.

To prevent this from happening in the future, this patch adds a
BUG_ON(in_interrupt()) in that function.  (Thanks to akpm <aka.  Andrew
Morton> for this suggestion).

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:46 -07:00
Oleg Nesterov 9fea80e4d9 [PATCH] sched: uninline task_rq_lock()
Saves 543 bytes from sched.o (gcc 3.3.3).

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Con Kolivas <kernel@kolivas.org>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Siddha, Suresh B 5c45bf279d [PATCH] sched: mc/smt power savings sched policy
sysfs entries 'sched_mc_power_savings' and 'sched_smt_power_savings' in
/sys/devices/system/cpu/ control the MC/SMT power savings policy for the
scheduler.

Based on the values (1-enable, 0-disable) for these controls, sched groups
cpu power will be determined for different domains.  When power savings
policy is enabled and under light load conditions, scheduler will minimize
the physical packages/cpu cores carrying the load and thus conserving
power(with a perf impact based on the workload characteristics...  see OLS
2005 CMP kernel scheduler paper for more details..)

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Con Kolivas <kernel@kolivas.org>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Srivatsa Vaddagiri 369381694d [PATCH] sched_domai: Allocate sched_group structures dynamically
As explained here:
	http://marc.theaimsgroup.com/?l=linux-kernel&m=114327539012323&w=2

there is a problem with sharing sched_group structures between two
separate sched_group structures for different sched_domains.

The patch has been tested and found to avoid the kernel lockup problem
described in above URL.

Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Srivatsa Vaddagiri 15f0b676a4 [PATCH] sched_domai: Use kmalloc_node
The sched group structures used to represent various nodes need to be
allocated from respective nodes (as suggested here also:

	http://uwsg.ucs.indiana.edu/hypermail/linux/kernel/0603.3/0051.html)

Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Srivatsa Vaddagiri d3a5aa9858 [PATCH] sched_domai: Don't use GFP_ATOMIC
Replace GFP_ATOMIC allocation for sched_group_nodes with GFP_KERNEL based
allocation.

Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Srivatsa Vaddagiri 51888ca25a [PATCH] sched_domain: handle kmalloc failure
Try to handle mem allocation failures in build_sched_domains by bailing out
and cleaning up thus-far allocated memory.  The patch has a direct consequence
that we disable load balancing completely (even at sibling level) upon *any*
memory allocation failure.

[Lee.Schermerhorn@hp.com: bugfix]
Signed-off-by: Srivatsa Vaddagir <vatsa@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Peter Williams 615052dc3b [PATCH] sched: Avoid unnecessarily moving highest priority task move_tasks()
Problem:

To help distribute high priority tasks evenly across the available CPUs
move_tasks() does not, under some circumstances, skip tasks whose load
weight is bigger than the designated amount.  Because the highest priority
task on the busiest queue may be on the expired array it may be moved as a
result of this mechanism.  Apart from not being the most desirable way to
redistribute the high priority tasks (we'd rather move the second highest
priority task), there is a risk that this could set up a loop with this
task bouncing backwards and forwards between the two queues.  (This latter
possibility can be demonstrated by running a nice==-20 CPU bound task on an
otherwise quiet 2 CPU system.)

Solution:

Modify the mechanism so that it does not override skip for the highest
priority task on the CPU.  Of course, if there are more than one tasks at
the highest priority then it will allow the override for one of them as
this is a desirable redistribution of high priority tasks.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Peter Williams 50ddd96917 [PATCH] sched: modify move_tasks() to improve load balancing outcomes
Problem:

The move_tasks() function is designed to move UP TO the amount of load it
is asked to move and in doing this it skips over tasks looking for ones
whose load weights are less than or equal to the remaining load to be
moved.  This is (in general) a good thing but it has the unfortunate result
of breaking one of the original load balancer's good points: namely, that
(within the limits imposed by the active/expired array model and the fact
the expired is processed first) it moves high priority tasks before low
priority ones and this means there's a good chance (see active/expired
problem for why it's only a chance) that the highest priority task on the
queue but not actually on the CPU will be moved to the other CPU where (as
a high priority task) it may preempt the current task.

Solution:

Modify move_tasks() so that high priority tasks are not skipped when moving
them will make them the highest priority task on their new run queue.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Peter Williams 2dd73a4f09 [PATCH] sched: implement smpnice
Problem:

The introduction of separate run queues per CPU has brought with it "nice"
enforcement problems that are best described by a simple example.

For the sake of argument suppose that on a single CPU machine with a
nice==19 hard spinner and a nice==0 hard spinner running that the nice==0
task gets 95% of the CPU and the nice==19 task gets 5% of the CPU.  Now
suppose that there is a system with 2 CPUs and 2 nice==19 hard spinners and
2 nice==0 hard spinners running.  The user of this system would be entitled
to expect that the nice==0 tasks each get 95% of a CPU and the nice==19
tasks only get 5% each.  However, whether this expectation is met is pretty
much down to luck as there are four equally likely distributions of the
tasks to the CPUs that the load balancing code will consider to be balanced
with loads of 2.0 for each CPU.  Two of these distributions involve one
nice==0 and one nice==19 task per CPU and in these circumstances the users
expectations will be met.  The other two distributions both involve both
nice==0 tasks being on one CPU and both nice==19 being on the other CPU and
each task will get 50% of a CPU and the user's expectations will not be
met.

Solution:

The solution to this problem that is implemented in the attached patch is
to use weighted loads when determining if the system is balanced and, when
an imbalance is detected, to move an amount of weighted load between run
queues (as opposed to a number of tasks) to restore the balance.  Once
again, the easiest way to explain why both of these measures are necessary
is to use a simple example.  Suppose that (in a slight variation of the
above example) that we have a two CPU system with 4 nice==0 and 4 nice=19
hard spinning tasks running and that the 4 nice==0 tasks are on one CPU and
the 4 nice==19 tasks are on the other CPU.  The weighted loads for the two
CPUs would be 4.0 and 0.2 respectively and the load balancing code would
move 2 tasks resulting in one CPU with a load of 2.0 and the other with
load of 2.2.  If this was considered to be a big enough imbalance to
justify moving a task and that task was moved using the current
move_tasks() then it would move the highest priority task that it found and
this would result in one CPU with a load of 3.0 and the other with a load
of 1.2 which would result in the movement of a task in the opposite
direction and so on -- infinite loop.  If, on the other hand, an amount of
load to be moved is calculated from the imbalance (in this case 0.1) and
move_tasks() skips tasks until it find ones whose contributions to the
weighted load are less than this amount it would move two of the nice==19
tasks resulting in a system with 2 nice==0 and 2 nice=19 on each CPU with
loads of 2.1 for each CPU.

One of the advantages of this mechanism is that on a system where all tasks
have nice==0 the load balancing calculations would be mathematically
identical to the current load balancing code.

Notes:

struct task_struct:

has a new field load_weight which (in a trade off of space for speed)
stores the contribution that this task makes to a CPU's weighted load when
it is runnable.

struct runqueue:

has a new field raw_weighted_load which is the sum of the load_weight
values for the currently runnable tasks on this run queue.  This field
always needs to be updated when nr_running is updated so two new inline
functions inc_nr_running() and dec_nr_running() have been created to make
sure that this happens.  This also offers a convenient way to optimize away
this part of the smpnice mechanism when CONFIG_SMP is not defined.

int try_to_wake_up():

in this function the value SCHED_LOAD_BALANCE is used to represent the load
contribution of a single task in various calculations in the code that
decides which CPU to put the waking task on.  While this would be a valid
on a system where the nice values for the runnable tasks were distributed
evenly around zero it will lead to anomalous load balancing if the
distribution is skewed in either direction.  To overcome this problem
SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
or by the average load_weight per task for the queue in question (as
appropriate).

int move_tasks():

The modifications to this function were complicated by the fact that
active_load_balance() uses it to move exactly one task without checking
whether an imbalance actually exists.  This precluded the simple
overloading of max_nr_move with max_load_move and necessitated the addition
of the latter as an extra argument to the function.  The internal
implementation is then modified to move up to max_nr_move tasks and
max_load_move of weighted load.  This slightly complicates the code where
move_tasks() is called and if ever active_load_balance() is changed to not
use move_tasks() the implementation of move_tasks() should be simplified
accordingly.

struct sched_group *find_busiest_group():

Similar to try_to_wake_up(), there are places in this function where
SCHED_LOAD_SCALE is used to represent the load contribution of a single
task and the same issues are created.  A similar solution is adopted except
that it is now the average per task contribution to a group's load (as
opposed to a run queue) that is required.  As this value is not directly
available from the group it is calculated on the fly as the queues in the
groups are visited when determining the busiest group.

A key change to this function is that it is no longer to scale down
*imbalance on exit as move_tasks() uses the load in its scaled form.

void set_user_nice():

has been modified to update the task's load_weight field when it's nice
value and also to ensure that its run queue's raw_weighted_load field is
updated if it was runnable.

From: "Siddha, Suresh B" <suresh.b.siddha@intel.com>

With smpnice, sched groups with highest priority tasks can mask the imbalance
between the other sched groups with in the same domain.  This patch fixes some
of the listed down scenarios by not considering the sched groups which are
lightly loaded.

a) on a simple 4-way MP system, if we have one high priority and 4 normal
   priority tasks, with smpnice we would like to see the high priority task
   scheduled on one cpu, two other cpus getting one normal task each and the
   fourth cpu getting the remaining two normal tasks.  but with current
   smpnice extra normal priority task keeps jumping from one cpu to another
   cpu having the normal priority task.  This is because of the
   busiest_has_loaded_cpus, nr_loaded_cpus logic..  We are not including the
   cpu with high priority task in max_load calculations but including that in
   total and avg_load calcuations..  leading to max_load < avg_load and load
   balance between cpus running normal priority tasks(2 Vs 1) will always show
   imbalanace as one normal priority and the extra normal priority task will
   keep moving from one cpu to another cpu having normal priority task..

b) 4-way system with HT (8 logical processors).  Package-P0 T0 has a
   highest priority task, T1 is idle.  Package-P1 Both T0 and T1 have 1 normal
   priority task each..  P2 and P3 are idle.  With this patch, one of the
   normal priority tasks on P1 will be moved to P2 or P3..

c) With the current weighted smp nice calculations, it doesn't always make
   sense to look at the highest weighted runqueue in the busy group..
   Consider a load balance scenario on a DP with HT system, with Package-0
   containing one high priority and one low priority, Package-1 containing one
   low priority(with other thread being idle)..  Package-1 thinks that it need
   to take the low priority thread from Package-0.  And find_busiest_queue()
   returns the cpu thread with highest priority task..  And ultimately(with
   help of active load balance) we move high priority task to Package-1.  And
   same continues with Package-0 now, moving high priority task from package-1
   to package-0..  Even without the presence of active load balance, load
   balance will fail to balance the above scenario..  Fix find_busiest_queue
   to use "imbalance" when it is lightly loaded.

[kernel@kolivas.org: sched: store weighted load on up]
[kernel@kolivas.org: sched: add discrete weighted cpu load function]
[suresh.b.siddha@intel.com: sched: remove dead code]
Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Cc: John Hawkes <hawkes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Kirill Korotaev efc30814a8 [PATCH] sched: CPU hotplug race vs. set_cpus_allowed()
There is a race between set_cpus_allowed() and move_task_off_dead_cpu().
__migrate_task() doesn't report any err code, so task can be left on its
runqueue if its cpus_allowed mask changed so that dest_cpu is not longer a
possible target.  Also, chaning cpus_allowed mask requires rq->lock being
held.

Signed-off-by: Kirill Korotaev <dev@openvz.org>
Acked-By: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Steven Rostedt cc94abfcbc [PATCH] unnecessary long index i in sched
Unless we expect to have more than 2G CPUs, there's no reason to have 'i'
as a long long here.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Con Kolivas 72d2854d4e [PATCH] sched: fix interactive ceiling code
The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect
and not explicit enough.  The sleep boost is not supposed to be any larger
than without this code and the comment is not clear enough about what
exactly it does, just the reason it does it.  Fix it.

There is a ceiling to the priority beyond which tasks that only ever sleep
for very long periods cannot surpass.  Fix it.

Prevent the on-runqueue bonus logic from defeating the idle sleep logic.

Opportunity to micro-optimise.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Steven Rostedt d444886e14 [PATCH] sched: simplify bitmap definition
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Chen, Kenneth W c96d145e71 [PATCH] sched: fix smt nice lock contention and optimization
Initial report and lock contention fix from Chris Mason:

Recent benchmarks showed some performance regressions between 2.6.16 and
2.6.5.  We tracked down one of the regressions to lock contention in
schedule heavy workloads (~70,000 context switches per second)

kernel/sched.c:dependent_sleeper() was responsible for most of the lock
contention, hammering on the run queue locks.  The patch below is more of a
discussion point than a suggested fix (although it does reduce lock
contention significantly).  The dependent_sleeper code looks very expensive
to me, especially for using a spinlock to bounce control between two
different siblings in the same cpu.

It is further optimized:

* perform dependent_sleeper check after next task is determined
* convert wake_sleeping_dependent to use trylock
* skip smt runqueue check if trylock fails
* optimize double_rq_lock now that smt nice is converted to trylock
* early exit in searching first SD_SHARE_CPUPOWER domain
* speedup fast path of dependent_sleeper

[akpm@osdl.org: cleanup]
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Chris Mason <mason@suse.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Chandra Seetharaman 26c2143b63 [PATCH] cpu hotplug: make cpu_notifier related notifier calls __cpuinit only
Make notifier_calls associated with cpu_notifier as __cpuinit.

__cpuinit makes sure that the function is init time only unless
CONFIG_HOTPLUG_CPU is defined.

[akpm@osdl.org: section fix]
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:41 -07:00
Chandra Seetharaman 65edc68c34 [PATCH] cpu hotplug: make [un]register_cpu_notifier init time only
CPUs come online only at init time (unless CONFIG_HOTPLUG_CPU is defined).
So, cpu_notifier functionality need to be available only at init time.

This patch makes register_cpu_notifier() available only at init time, unless
CONFIG_HOTPLUG_CPU is defined.

This patch exports register_cpu_notifier() and unregister_cpu_notifier() only
if CONFIG_HOTPLUG_CPU is defined.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:41 -07:00
Chandra Seetharaman 054cc8a2d8 [PATCH] cpu hotplug: revert initdata patch submitted for 2.6.17
This patch reverts notifier_block changes made in 2.6.17

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:41 -07:00
Chandra Seetharaman 9c7b216d23 [PATCH] cpu hotplug: revert init patch submitted for 2.6.17
In 2.6.17, there was a problem with cpu_notifiers and XFS.  I provided a
band-aid solution to solve that problem.  In the process, i undid all the
changes you both were making to ensure that these notifiers were available
only at init time (unless CONFIG_HOTPLUG_CPU is defined).

We deferred the real fix to 2.6.18.  Here is a set of patches that fixes the
XFS problem cleanly and makes the cpu notifiers available only at init time
(unless CONFIG_HOTPLUG_CPU is defined).

If CONFIG_HOTPLUG_CPU is defined then cpu notifiers are available at run
time.

This patch reverts the notifier_call changes made in 2.6.17

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:40 -07:00
Paul E. McKenney c32e066057 [PATCH] rcutorture: add call_rcu_bh() operations
Add operations for the call_rcu_bh() variant of RCU.  Also add an
rcu_batches_completed_bh() function, which is needed by rcutorture.

Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:40 -07:00
Paul E. McKenney 72e9bb5492 [PATCH] rcutorture: add ops vector and Classic RCU ops
Add an ops vector to rcutorture, and add the ops for Classic RCU.  Update
the rcutorture documentation to reflect slight change to the dmesg formats.

Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:40 -07:00
Paul E. McKenney 29766f1eb3 [PATCH] rcutorture: catchup doc fixes for idle-hz tests
This just catches the RCU torture documentation up with the recent fixes
that test RCU for architectures that turn of the scheduling-clock interrupt
for idle CPUs and the addition of a SUCCESS/FAILURE indication, fixing up
an obsolete comment as well.

Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:40 -07:00
Randy Dunlap 1dbe83c344 [PATCH] fix kernel-doc in kernel/ dir
Fix kernel-doc parameters in kernel/

Warning(/var/linsrc/linux-2617-g9//kernel/auditsc.c:1376): No description found for parameter 'u_abs_timeout'
Warning(/var/linsrc/linux-2617-g9//kernel/auditsc.c:1420): No description found for parameter 'u_msg_prio'
Warning(/var/linsrc/linux-2617-g9//kernel/auditsc.c:1420): No description found for parameter 'u_abs_timeout'
Warning(/var/linsrc/linux-2617-g9//kernel/acct.c:526): No description found for parameter 'pacct'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:39 -07:00
Ingo Molnar 34af946a22 [PATCH] spin/rwlock init cleanups
locking init cleanups:

 - convert " = SPIN_LOCK_UNLOCKED" to spin_lock_init() or DEFINE_SPINLOCK()
 - convert rwlocks in a similar manner

this patch was generated automatically.

Motivation:

 - cleanliness
 - lockdep needs control of lock initialization, which the open-coded
   variants do not give
 - it's also useful for -rt and for lock debugging in general

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:39 -07:00
Randy Dunlap a7807a32bb [PATCH] poison: add & use more constants
Add more poison values to include/linux/poison.h.  It's not clear to me
whether some others should be added or not, so I haven't added any of
these:

./include/linux/libata.h:#define ATA_TAG_POISON		0xfafbfcfdU
./arch/ppc/8260_io/fcc_enet.c:1918:	memset((char *)(&(immap->im_dprambase[(mem_addr+64)])), 0x88, 32);
./drivers/usb/mon/mon_text.c:429:	memset(mem, 0xe5, sizeof(struct mon_event_text));
./drivers/char/ftape/lowlevel/ftape-ctl.c:738:		memset(ft_buffer[i]->address, 0xAA, FT_BUFF_SIZE);
./drivers/block/sx8.c:/* 0xf is just arbitrary, non-zero noise; this is sorta like poisoning */

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:38 -07:00
Ingo Molnar e6e5494cb2 [PATCH] vdso: randomize the i386 vDSO by moving it into a vma
Move the i386 VDSO down into a vma and thus randomize it.

Besides the security implications, this feature also helps debuggers, which
can COW a vma-backed VDSO just like a normal DSO and can thus do
single-stepping and other debugging features.

It's good for hypervisors (Xen, VMWare) too, which typically live in the same
high-mapped address space as the VDSO, hence whenever the VDSO is used, they
get lots of guest pagefaults and have to fix such guest accesses up - which
slows things down instead of speeding things up (the primary purpose of the
VDSO).

There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
for older glibcs that still rely on a prelinked high-mapped VDSO.  Newer
distributions (using glibc 2.3.3 or later) can turn this option off.  Turning
it off is also recommended for security reasons: attackers cannot use the
predictable high-mapped VDSO page as syscall trampoline anymore.

There is a new vdso=[0|1] boot option as well, and a runtime
/proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
on/off.

(This version of the VDSO-randomization patch also has working ELF
coredumping, the previous patch crashed in the coredumping code.)

This code is a combined work of the exec-shield VDSO randomization
code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
started this patch and i completed it.

[akpm@osdl.org: cleanups]
[akpm@osdl.org: compile fix]
[akpm@osdl.org: compile fix 2]
[akpm@osdl.org: compile fix 3]
[akpm@osdl.org: revernt MAXMEM change]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:38 -07:00
KAMEZAWA Hiroyuki 2842f11419 [PATCH] catch valid mem range at onlining memory
This patch allows hot-add memory which is not aligned to section.

Now, hot-added memory has to be aligned to section size.  Considering big
section sized archs, this is not useful.

When hot-added memory is registerd as iomem resoruce by iomem resource
patch, we can make use of that information to detect valid memory range.

Note: With this, not-aligned memory can be registerd. To allow hot-add
      memory with holes, we have to do more work around add_memory().
      (It doesn't allows add memory to already existing mem section.)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:36 -07:00
Andrew Morton 5c31f2738a [PATCH] pm_trace is dangerous
CONFIG_PM_TRACES scrogs your RTC.  Mark it as experimental, and defaulting to
`off'.

Also beef up the help message a bit.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:35 -07:00
Randy Dunlap 7f32a25f63 [PATCH] kernel/acct: fix function definition
kernel/acct.c:579:19: warning: non-ANSI function declaration of function 'acct_process'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:35 -07:00
Greg Kroah-Hartman 6550e07f41 [PATCH] 64bit Resource: finally enable 64bit resource sizes
Introduce the Kconfig entry and actually switch to a 64bit value, if
wanted, for resource_size_t.

Based on a patch series originally from Vivek Goyal <vgoyal@in.ibm.com>

Cc: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-06-27 09:24:00 -07:00
Greg Kroah-Hartman d75fc8bbcc [PATCH] 64bit resource: change resource core to use resource_size_t
Based on a patch series originally from Vivek Goyal <vgoyal@in.ibm.com>

Cc: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-06-27 09:23:59 -07:00
Greg Kroah-Hartman 685143ac1f [PATCH] 64bit resource: fix up printks for resources in arch and core code
This is needed if we wish to change the size of the resource structures.

Based on an original patch from Vivek Goyal <vgoyal@in.ibm.com> and
Andrew Morton.

(tweaked by Andy Isaacson <adi@hexapodia.org>)

Cc: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Andy Isaacson <adi@hexapodia.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-06-27 09:23:59 -07:00
Linus Torvalds 2a2ed2db35 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild
* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild: (40 commits)
  kbuild: trivial fixes in Makefile
  kbuild: adding symbols in Kconfig and defconfig to TAGS
  kbuild: replace abort() with exit(1)
  kbuild: support for %.symtypes files
  kbuild: fix silentoldconfig recursion
  kbuild: add option for stripping modules while installing them
  kbuild: kill some false positives from modpost
  kbuild: export-symbol usage report generator
  kbuild: fix make -rR breakage
  kbuild: append -dirty for updated but uncommited changes
  kbuild: append git revision for all untagged commits
  kbuild: fix module.symvers parsing in modpost
  kbuild: ignore make's built-in rules & variables
  kbuild: bugfix with initramfs
  kbuild: modpost build fix
  kbuild: check license compatibility when building modules
  kbuild: export-type enhancement to modpost.c
  kbuild: add dependency on kernel.release to the package targets
  kbuild: `make kernelrelease' speedup
  kconfig: KCONFIG_OVERWRITECONFIG
  ...
2006-06-26 11:05:15 -07:00
Linus Torvalds 81a07d7588 Merge branch 'x86-64'
* x86-64: (83 commits)
  [PATCH] x86_64: x86_64 stack usage debugging
  [PATCH] x86_64: (resend) x86_64 stack overflow debugging
  [PATCH] x86_64: msi_apic.c build fix
  [PATCH] x86_64: i386/x86-64 Add nmi watchdog support for new Intel CPUs
  [PATCH] x86_64: Avoid broadcasting NMI IPIs
  [PATCH] x86_64: fix apic error on bootup
  [PATCH] x86_64: enlarge window for stack growth
  [PATCH] x86_64: Minor string functions optimizations
  [PATCH] x86_64: Move export symbols to their C functions
  [PATCH] x86_64: Standardize i386/x86_64 handling of NMI_VECTOR
  [PATCH] x86_64: Fix modular pc speaker
  [PATCH] x86_64: remove sys32_ni_syscall()
  [PATCH] x86_64: Do not use -ffunction-sections for modules
  [PATCH] x86_64: Add cpu_relax to apic_wait_icr_idle
  [PATCH] x86_64: adjust kstack_depth_to_print default
  [PATCH] i386/x86-64: adjust /proc/interrupts column headings
  [PATCH] x86_64: Fix race in cpu_local_* on preemptible kernels
  [PATCH] x86_64: Fix fast check in safe_smp_processor_id
  [PATCH] x86_64: x86_64 setup.c - printing cmp related boottime information
  [PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status
  ...

Manual resolve of trivial conflict in arch/i386/kernel/Makefile
2006-06-26 10:51:09 -07:00
Andi Kleen 495ab9c045 [PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status
During some profiling I noticed that default_idle causes a lot of
memory traffic. I think that is caused by the atomic operations
to clear/set the polling flag in thread_info. There is actually
no reason to make this atomic - only the idle thread does it
to itself, other CPUs only read it. So I moved it into ti->status.

Converted i386/x86-64/ia64 for now because that was the easiest
way to fix ACPI which also manipulates these flags in its idle
function.

Cc: Nick Piggin <npiggin@novell.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 10:48:21 -07:00
Jan Beulich 83f4fcce7f [PATCH] x86_64: allow unwinder to build without module support
Add proper conditionals to be able to build with CONFIG_MODULES=n.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 10:48:18 -07:00
Jan Beulich c33bd9aac0 [PATCH] i386/x86-64: fall back to old-style call trace if no unwinding
If no unwinding is possible at all for a certain exception instance,
fall back to the old style call trace instead of not showing any trace
at all.

Also, allow setting the stack trace mode at the command line.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 10:48:18 -07:00
Jan Beulich 4552d5dc08 [PATCH] x86_64: reliable stack trace support
These are the generic bits needed to enable reliable stack traces based
on Dwarf2-like (.eh_frame) unwind information. Subsequent patches will
enable x86-64 and i386 to make use of this.

Thanks to Andi Kleen and Ingo Molnar, who pointed out several possibilities
for improvement.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 10:48:17 -07:00
Andi Kleen bebfa1013e [PATCH] x86_64: Add compat_printk and sysctl to turn off compat layer warnings
Sometimes e.g. with crashme the compat layer warnings can be noisy.
Add a way to turn them off by gating all output through compat_printk
that checks a global sysctl. The default is not changed.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 10:48:16 -07:00
Peter Williams b78709cfd4 [PATCH] sched: fix SCHED_FIFO bug in sys_sched_rr_get_interval()
The introduction of SCHED_BATCH scheduling class with a value of 3 means
that the expression (p->policy & SCHED_FIFO) will return true if policy
is SCHED_BATCH or SCHED_FIFO.

Unfortunately, this expression is used in sys_sched_rr_get_interval()
and in the absence of a comment to say that this is intentional I
presume that it is unintentional and erroneous.

The fix is to change the expression to (p->policy == SCHED_FIFO).

Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 10:02:41 -07:00
Oleg Nesterov cf2dfbfbf4 [PATCH] coredump: copy_process: don't check SIGNAL_GROUP_EXIT
After the previous patch SIGNAL_GROUP_EXIT implies a pending SIGKILL, we
can remove this check from copy_process() because we already checked
!signal_pending().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:27 -07:00
Oleg Nesterov d5f70c00ad [PATCH] coredump: kill ptrace related stuff
With this patch zap_process() sets SIGNAL_GROUP_EXIT while sending SIGKILL to
the thread group.  This means that a TASK_TRACED task

	1. Will be awakened by signal_wake_up(1)

	2. Can't sleep again via ptrace_notify()

	3. Can't go to do_signal_stop() after return
	   from ptrace_stop() in get_signal_to_deliver()

So we can remove all ptrace related stuff from coredump path.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:27 -07:00
Eric W. Biederman df26c40e56 [PATCH] proc: Cleanup proc_fd_access_allowed
In process of getting proc_fd_access_allowed to work it has developed a few
warts.  In particular the special case that always allows introspection and
the special case to allow inspection of kernel threads.

The special case for introspection is needed for /proc/self/mem.

The special case for kernel threads really should be overridable
by security modules.

So consolidate these checks into ptrace.c:may_attach().

The check to always allow introspection is trivial.

The check to allow access to kernel threads, and zombies is a little
trickier.  mem_read and mem_write already verify an mm exists so it isn't
needed twice.  proc_fd_access_allowed only doesn't want a check to verify
task->mm exits, s it prevents all access to kernel threads.  So just move
the task->mm check into ptrace_attach where it is needed for practical
reasons.

I did a quick audit and none of the security modules in the kernel seem to
care if they are passed a task without an mm into security_ptrace.  So the
above move should be safe and it allows security modules to come up with
more restrictive policy.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:26 -07:00
Eric W. Biederman 13b41b0949 [PATCH] proc: Use struct pid not struct task_ref
Incrementally update my proc-dont-lock-task_structs-indefinitely patches so
that they work with struct pid instead of struct task_ref.

Mostly this is a straight 1-1 substitution.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:26 -07:00
Eric W. Biederman 99f8955183 [PATCH] proc: don't lock task_structs indefinitely
Every inode in /proc holds a reference to a struct task_struct.  If a
directory or file is opened and remains open after the the task exits this
pinning continues.  With 8K stacks on a 32bit machine the amount pinned per
file descriptor is about 10K.

Normally I would figure a reasonable per user process limit is about 100
processes.  With 80 processes, with a 1000 file descriptors each I can trigger
the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless
data.

This patch replaces the struct task_struct pointer with a pointer to a struct
task_ref which has a struct task_struct pointer.  The so the pinning of dead
tasks does not happen.

The code now has to contend with the fact that the task may now exit at any
time.  Which is a little but not muh more complicated.

With this change it takes about 1000 processes each opening up 1000 file
descriptors before I can trigger the OOM killer.  Much better.

[mlp@google.com: task_mmu small fixes]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Paul Jackson <pj@sgi.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Albert Cahalan <acahalan@gmail.com>
Signed-off-by: Prasanna Meda <mlp@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:25 -07:00
Eric W. Biederman 48e6484d49 [PATCH] proc: Rewrite the proc dentry flush on exit optimization
To keep the dcache from filling up with dead /proc entries we flush them on
process exit.  However over the years that code has gotten hairy with a
dentry_pointer and a lock in task_struct and misdocumented as a correctness
feature.

I have rewritten this code to look and see if we have a corresponding entry in
the dcache and if so flush it on process exit.  This removes the extra fields
in the task_struct and allows me to trivially handle the case of a
/proc/<tgid>/task/<pid> entry as well as the current /proc/<pid> entries.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:24 -07:00
Anil S Keshavamurthy e6f47f978b [PATCH] Notify page fault call chain
With this patch Kprobes now registers for page fault notifications only when
their is an active probe registered.  Once all the active probes are
unregistered their is no need to be notified of page faults and kprobes
unregisters itself from the page fault notifications.  Hence we will have ZERO
side effects when no probes are active.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:22 -07:00
Anil S Keshavamurthy 3d5631e063 [PATCH] Kprobes registers for notify page fault
Kprobes now registers for page fault notifications.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavmurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:22 -07:00
mao, bibo 3672165677 [PATCH] Kprobe: multi kprobe posthandler for booster
If there are multi kprobes on the same probepoint, there will be one extra
aggr_kprobe on the head of kprobe list.  The aggr_kprobe has
aggr_post_handler/aggr_break_handler whether the other kprobe
post_hander/break_handler is NULL or not.  This patch modifies this, only
when there is one or more kprobe in the list whose post_handler is not
NULL, post_handler of aggr_kprobe will be set as aggr_post_handler.

[soshima@redhat.com: !CONFIG_PREEMPT fix]
Signed-off-by: bibo, mao <bibo.mao@intel.com>
Cc: Masami Hiramatsu <hiramatu@sdl.hitachi.co.jp>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Yumiko Sugita <sugita@sdl.hitachi.co.jp>
Cc: Hideo Aoki <haoki@redhat.com>
Signed-off-by: Satoshi Oshima <soshima@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:22 -07:00
Roman Zippel 19923c190e [PATCH] fix and optimize clock source update
This fixes the clock source updates in update_wall_time() to correctly
track the time coming in via current_tick_length().  Optimize the fast
paths to be as short as possible to keep the overhead low.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Acked-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:21 -07:00
john stultz a275254975 [PATCH] time: rename clocksource functions
As suggested by Roman Zippel, change clocksource functions to use
clocksource_xyz rather then xyz_clocksource to avoid polluting the
namespace.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:21 -07:00
john stultz 5d0cf410e9 [PATCH] Time: i386 Clocksource Drivers
Implement the time sources for i386 (acpi_pm, cyclone, hpet, pit, and tsc).
With this patch, the conversion of the i386 arch to the generic timekeeping
code should be complete.

The patch should be fairly straight forward, only adding the new clocksources.

[hirofumi@mail.parknet.co.jp: acpi_pm cleanup]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:21 -07:00
john stultz cf3c769b4b [PATCH] Time: Introduce arch generic time accessors
Introduces clocksource switching code and the arch generic time accessor
functions that use the clocksource infrastructure.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:20 -07:00
john stultz 5eb6d20533 [PATCH] Time: Use clocksource abstraction for NTP adjustments
Instead of incrementing xtime by tick_nsec + ntp adjustments, use the
clocksource abstraction to increment and scale time.  Using the clocksource
abstraction allows other clocksources to be used consistently in the face of
late or lost ticks, while preserving the existing behavior via the jiffies
clocksource.

This removes the need to keep time_phase adjustments as we just use the
current_tick_length() function as the NTP interface and accumulate time using
shifted nanoseconds.

The basics of this design was by Roman Zippel, however it is my own
interpretation and implementation, so the credit should go to him and the
blame to me.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:20 -07:00
john stultz 260a42309b [PATCH] Time: Let user request precision from current_tick_length()
Change the current_tick_length() function so it takes an argument which
specifies how much precision to return in shifted nanoseconds.  This provides
a simple way to convert between NTPs internal nanoseconds shifted by
(SHIFT_SCALE - 10) to other shifted nanosecond units that are used by the
clocksource abstraction.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:20 -07:00
john stultz ad596171ed [PATCH] Time: Use clocksource infrastructure for update_wall_time
Modify the update_wall_time function so it increments time using the
clocksource abstraction instead of jiffies.  Since the only clocksource driver
currently provided is the jiffies clocksource, this should result in no
functional change.  Additionally, a timekeeping_init and timekeeping_resume
function has been added to initialize and maintain some of the new timekeping
state.

[hirofumi@mail.parknet.co.jp: fixlet]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:20 -07:00
john stultz 734efb467b [PATCH] Time: Clocksource Infrastructure
This introduces the clocksource management infrastructure.  A clocksource is a
driver-like architecture generic abstraction of a free-running counter.  This
code defines the clocksource structure, and provides management code for
registering, selecting, accessing and scaling clocksources.

Additionally, this includes the trivial jiffies clocksource, a lowest common
denominator clocksource, provided mainly for use as an example.

[hirofumi@mail.parknet.co.jp: Don't enable IRQ too early]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:20 -07:00
Ingo Molnar 81615b624a [PATCH] Convert kernel/cpu.c to mutexes
Convert kernel/cpu.c from semaphore to mutex.

I've reviewed all lock_cpu_hotplug() critical sections, and they all seem to
fit mutex semantics.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:16 -07:00
Ingo Molnar 1fb00c6cbd [PATCH] work around ppc64 bootup bug by making mutex-debugging save/restore irqs
It seems ppc64 wants to lock mutexes in early bootup code, with interrupts
disabled, and they expect interrupts to stay disabled, else they crash.

Work around this bug by making mutex debugging variants save/restore irq
flags.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:16 -07:00
Linus Torvalds 3448097fcc Revert "swsusp special saveable pages support" commits
This reverts commits

  3e3318dee0 [PATCH] swsusp: x86_64 mark special saveable/unsaveable pages
  b6370d96e0 [PATCH] swsusp: i386 mark special saveable/unsaveable pages
  ce4ab0012b [PATCH] swsusp: add architecture special saveable pages support

because not only do they apparently cause page faults on x86, the
infrastructure doesn't compile on powerpc.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 18:41:00 -07:00
Linus Torvalds 72cf2709bf Fix PM_TRACE dependency: works only on 32-bit x86 for now
Not that x86-64 and other architecture support should be difficult to
add (trivial fixups to the data format and add the proper linker script
entry).

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:04:15 -07:00
KaiGai Kohei 77787bfb44 [PATCH] pacct: none-delayed process accounting accumulation
In current 2.6.17 implementation, signal_struct refered from task_struct is
used for per-process data structure.  The pacct facility also uses it as a
per-process data structure to store stime, utime, minflt, majflt.  But those
members are saved in __exit_signal().  It's too late.

For example, if some threads exits at same time, pacct facility has a
possibility to drop accountings for a part of those threads.  (see, the
following 'The results of original 2.6.17 kernel') I think accounting
information should be completely collected into the per-process data structure
before writing out an accounting record.

This patch fixes this matter.  Accumulation of stime, utime, minflt and majflt
are done before generating accounting record.

[mingo@elte.hu: fix acct_collect() siglock bug found by lockdep]
Signed-off-by: KaiGai Kohei <kaigai@ak.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:25 -07:00
KaiGai Kohei f6ec29a42d [PATCH] pacct: avoidance to refer the last thread as a representation of the process
When pacct facility generate an 'ac_flag' field in accounting record, it
refers a task_struct of the thread which died last in the process.  But any
other task_structs are ignored.

Therefore, pacct facility drops ASU flag even if root-privilege operations are
used by any other threads except the last one.  In addition, AFORK flag is
always set when the thread of group-leader didn't die last, although this
process has called execve() after fork().

We have a same matter in ac_exitcode.  The recorded ac_exitcode is an exit
code of the last thread in the process.  There is a possibility this exitcode
is not the group leader's one.
2006-06-25 10:01:25 -07:00
KaiGai Kohei 0e4648141a [PATCH] pacct: add pacct_struct to fix some pacct bugs.
The pacct facility need an i/o operation when an accounting record is
generated.  There is a possibility to wake OOM killer up.  If OOM killer is
activated, it kills some processes to make them release process memory
regions.

But acct_process() is called in the killed processes context before calling
exit_mm(), so those processes cannot release own memory.  In the results, any
processes stop in this point and it finally cause a system stall.
2006-06-25 10:01:25 -07:00
Randy Dunlap 9e37bd301e [PATCH] kthread: move kernel-doc and put it into DocBook
Move kthread API kernel-doc from kthread.h to kthread.c & fix it.
Add kthread API to kernel-api DocBook.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:24 -07:00
Randy Dunlap fa9799e33d [PATCH] ktime/hrtimer: fix kernel-doc comments
Fix kernel-doc formatting in ktime.h and hrtimer.[ch] files.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:23 -07:00
Heiko Carstens fc75cdfa5b [PATCH] cpu hotplug: fix CPU_UP_CANCEL handling
If a cpu hotplug callback fails on CPU_UP_PREPARE, all callbacks will be
called with CPU_UP_CANCELED.  A few of these callbacks assume that on
CPU_UP_PREPARE a pointer to task has been stored in a percpu array.  This
assumption is not true if CPU_UP_PREPARE fails and the following calls to
kthread_bind() in CPU_UP_CANCELED will cause an addressing exception
because of passing a NULL pointer.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:22 -07:00
Serge E. Hallyn 8bdd1d1250 [PATCH] kthread: convert stop_machine into a kthread
- Update stop_machine.c to spawn stop_machine as kthreads rather than the
  deprecated kernel_threads.

- Update stop_machine to use the more efficient kthread_bind() before
  running task in place of set_cpus_allowed() after.

[akpm@osdl.org: remove now-wrong set_cpus_allowed()]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:22 -07:00
Anton Blanchard 2aa92581fb [PATCH] Link error when futexes are disabled on 64bit architectures
If futexes are disabled we fail to link on ppc64.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:17 -07:00
akpm@osdl.org 838cd153a5 [PATCH] N32 sigset and __COMPAT_ENDIAN_SWAP__
I'm testing glibc on MIPS64, little-endian, N32, O32 and N64 multilibs.

Among the NPTL test failures seen are some arising from sigsuspend problems
for N32: it blocks the wrong signals, so SIGCANCEL (SIGRTMIN) is blocked
despite glibc's carefully excluding it from sets of signals to block.
Specifically, testing suggests it blocks signal N^32 instead of signal N,
so (in the example tested) blocking SIGUSR1 (17) blocks signal 49 instead.

glibc's sigset_t uses an array of unsigned long, as does the kernel.
In both cases, signal N+1 is represented as
(1UL << (N % (8 * sizeof (unsigned long)))) in word number
(N / (8 * sizeof (unsigned long))).

Thus the N32 glibc uses an array of 32-bit words and the N64 kernel uses an
array of 64-bit words.  For little-endian, the layout is the same, with
signals 1-32 in the first 4 bytes, signals 33-64 in the second, etc.; for
big-endian, userspace has that layout while in the kernel each 8 bytes have
the two halves swapped from the userspace layout.

The N32 sigsuspend syscall uses sigset_from_compat to convert the userspace
sigset to kernel format.  If __COMPAT_ENDIAN_SWAP__ is *not* set, this uses
logic of the form

  set->sig[0] = compat->sig[0] | (((long)compat->sig[1]) << 32 )

to convert the userspace sigset to a kernel one.  This looks correct to me
for both big and little endian, given that in userspace compat->sig[1] will
represent signals 33-64, and so will the high 32 bits of set->sig[0] in the
kernel.  If however __COMPAT_ENDIAN_SWAP__ *is* set, as it is for
__MIPSEL__, it uses

  set->sig[0] = compat->sig[1] | (((long)compat->sig[0]) << 32 );

which seems incorrect for both big and little endian, and would
explain the observed symptoms.

This code is the only use of __COMPAT_ENDIAN_SWAP__, so if incorrect
then that macro serves no purpose, in which case something like the
following patch would seem appropriate to remove it.

Signed-off-by: Joseph Myers <joseph@codesourcery.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:15 -07:00
Stephen Hemminger eab03ac7bd [PATCH] Get rid of /proc/sys/proc
The table is empty, why does it still exist?

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:15 -07:00
Jan Engelhardt 3b9c04106b [PATCH] printk time parameter
Currently, enabling/disabling printk timestamps is only possible through
reboot (bootparam) or recompile.  I normally do not run with timestamps
(since syslog handles that in a good manner), but for measuring small
kernel delays (e.g.  irq probing - see parport thread) I needed subsecond
precision, but then again, just for some minutes rather than all kernel
messages to come.  The following patch adds a module_param() with which the
timestamps can be en-/disabled in a live system through
/sys/modules/printk/parameters/printk_time.

Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:13 -07:00
Matt Helsley 11e64757f9 [PATCH] Remove unecessary NULL check in kernel/acct.c
copy_process() appears to be the only caller of acct_clear_integrals() and
does not pass in NULL task pointers.  Remove the unecessary check.

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:09 -07:00
Andreas Mohr 3b364b8d58 [PATCH] constify parts of kernel/power/
Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:08 -07:00
Andrew Morton b61367732f [PATCH] schedule_on_each_cpu(): reduce kmalloc() size
schedule_on_each_cpu() presently does a large kmalloc - 96 kbytes on 1024 CPU
64-bit.

Rework it so that we do one 8192-byte allocation and then a pile of tiny ones,
via alloc_percpu().  This has a much higher chance of success (100% in the
current VM).

This also has the effect of reducing the memory requirements from NR_CPUS*n to
num_possible_cpus()*n.

Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:07 -07:00
Adrian Bunk 83cc5ed3c4 [PATCH] kernel/sys.c: cleanups
- proper prototypes for the following functions:
  - ctrl_alt_del()  (in include/linux/reboot.h)
  - getrusage()     (in include/linux/resource.h)
- make the following needlessly global functions static:
  - kernel_restart_prepare()
  - kernel_kexec()

[akpm@osdl.org: compile fix]
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:06 -07:00
Michael Ellerman 76a8ad2939 [PATCH] Make printk work for really early debugging
Currently printk is no use for early debugging because it refuses to
actually print anything to the console unless
cpu_online(smp_processor_id()) is true.

The stated explanation is that console drivers may require per-cpu
resources, or otherwise barf, because the system is not yet setup
correctly.  Fair enough.

However some console drivers might be quite happy running early during
boot, in fact we have one, and so it'd be nice if printk understood that.

So I added a flag (which I would have called CON_BOOT, but that's taken)
called CON_ANYTIME, which indicates that a console is happy to be called
anytime, even if the cpu is not yet online.

Tested on a Power 5 machine, with both a CON_ANYTIME driver and a bogus
console driver that BUG()s if called while offline.  No problems AFAICT.
Built for i386 UP & SMP.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:05 -07:00
Alan Stern bbb1747d4e [PATCH] Allow raw_notifier callouts to unregister themselves
Since raw_notifier chains don't benefit from any centralized locking
protections, they shouldn't suffer from the associated limitations.  Under
some circumstances it might make sense for a raw_notifier callout routine
to unregister itself from the notifier chain.  This patch (as678) changes
the notifier core to allow for such things.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:01 -07:00
Paul Mackerras bfe5d83419 [PATCH] Define __raw_get_cpu_var and use it
There are several instances of per_cpu(foo, raw_smp_processor_id()), which
is semantically equivalent to __get_cpu_var(foo) but without the warning
that smp_processor_id() can give if CONFIG_DEBUG_PREEMPT is enabled.  For
those architectures with optimized per-cpu implementations, namely ia64,
powerpc, s390, sparc64 and x86_64, per_cpu() turns into more and slower
code than __get_cpu_var(), so it would be preferable to use __get_cpu_var
on those platforms.

This defines a __raw_get_cpu_var(x) macro which turns into per_cpu(x,
raw_smp_processor_id()) on architectures that use the generic per-cpu
implementation, and turns into __get_cpu_var(x) on the architectures that
have an optimized per-cpu implementation.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:01 -07:00
Jesper Juhl f867d2a2e5 [PATCH] ensure NULL deref can't possibly happen in is_exported()
If CONFIG_KALLSYMS is defined and if it should happen that is_exported() is
given a NULL 'mod' and lookup_symbol(name, __start___ksymtab,
__stop___ksymtab) returns 0, then we'll end up dereferencing a NULL
pointer.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:00:59 -07:00
Linus Torvalds eb71c87a49 Add some basic resume trace facilities
Considering that there isn't a lot of hw we can depend on during resume,
this is about as good as it gets.

This is x86-only for now, although the basic concept (and most of the
code) will certainly work on almost any platform.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-24 14:44:01 -07:00
Eric Sesterhenn 125e18745f [PATCH] More BUG_ON conversion
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bartlomiej Zolnierkiewicz <B.Zolnierkiewicz@elka.pw.edu.pl>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Acked-by: "Salyzyn, Mark" <mark_salyzyn@adaptec.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:08 -07:00
Jan Beulich 908dcecda1 [PATCH] adjust handle_IRR_event() return type
Correct the return type of handle_IRQ_event() (inconsistency noticed during
Xen development), and remove redundant declarations.  The return type
adjustment required breaking out the definition of irqreturn_t into a
separate header, in order to satisfy current include order dependencies.

Signed-off-by: Jan Beulich <jbeulich@novell.com>

Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:08 -07:00
Porpoise 3439dd86e3 [PATCH] When CONFIG_BASE_SMALL=1, cascade() may enter an infinite loop
When CONFIG_BASE_SAMLL=1, cascade() in may enter the infinite loop.
Because of CONFIG_BASE_SMALL=1(TVR_BITS=6 and TVN_BITS=4), the list
base->tv5 may cascade into base->tv5.  So, the kernel enters the infinite
loop in the function cascade().

I created a test module to verify this bug, and a patch to fix it.

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/timer.h>
#if 0
#include <linux/kdb.h>
#else
#define kdb_printf printk
#endif

#define TVN_BITS (CONFIG_BASE_SMALL ? 4 : 6)
#define TVR_BITS (CONFIG_BASE_SMALL ? 6 : 8)
#define TVN_SIZE (1 << TVN_BITS)
#define TVR_SIZE (1 << TVR_BITS)
#define TVN_MASK (TVN_SIZE - 1)
#define TVR_MASK (TVR_SIZE - 1)

#define TV_SIZE(N)  (N*TVN_BITS  + TVR_BITS)

struct timer_list timer0;
struct timer_list dummy_timer1;
struct timer_list dummy_timer2;

void dummy_timer_fun(unsigned long data) {
}
unsigned long j=0;
void check_timer_base(unsigned long data)
{
        kdb_printf("check_timer_base %08x\n",jiffies);
        mod_timer(&timer0,(jiffies & (~0xFFF)) + 0x1FFF);
}

int init_module(void)
{
        init_timer(&timer0);
        timer0.data = (unsigned long)0;
        timer0.function = check_timer_base;
        mod_timer(&timer0,jiffies+1);

        init_timer(&dummy_timer1);
        dummy_timer1.data = (unsigned long)0;
        dummy_timer1.function = dummy_timer_fun;

        init_timer(&dummy_timer2);
        dummy_timer2.data = (unsigned long)0;
        dummy_timer2.function = dummy_timer_fun;

        j=jiffies;
        j&=(~((1<<TV_SIZE(3))-1));
        j+=(1<<TV_SIZE(3));
        j+=(1<<TV_SIZE(4));

        kdb_printf("mod_timer %08x\n",j);

        mod_timer(&dummy_timer1, j );
        mod_timer(&dummy_timer2, j );

        return 0;
}

void cleanup_module()
{
        del_timer_sync(&timer0);
        del_timer_sync(&dummy_timer1);
        del_timer_sync(&dummy_timer2);
}

(Cleanups from Oleg)

[oleg@tv-sign.ru: use list_replace_init()]
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:08 -07:00
Oleg Nesterov 626ab0e69d [PATCH] list: use list_replace_init() instead of list_splice_init()
list_splice_init(list, head) does unneeded job if it is known that
list_empty(head) == 1.  We can use list_replace_init() instead.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:07 -07:00
Randy Dunlap 862f5f0133 [PATCH] Doc: add audit & acct to DocBook
Fix one audit kernel-doc description (one parameter was missing).
Add audit*.c interfaces to DocBook.
Add BSD accounting interfaces to DocBook.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:07 -07:00
Paul E. McKenney d83015b8f6 [PATCH] Make RCU API inaccessible to non-GPL Linux kernel modules
Remove synchronize_kernel() (deprecated 2-APR-2005 in
http://lkml.org/lkml/2005/4/3/11) and makes the RCU API inaccessible to
non-GPL Linux kernel modules (as was announced more than one year ago in
http://lkml.org/lkml/2005/4/3/8).  Tested on x86 and ppc64.

Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:07 -07:00
Jes Sorensen 55f4e8d156 [PATCH] kernel/sys.c doesn't need init.h
kernel/sys.c doesn't have anything in it relying on linux/init.h -
remove the include.

Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:07 -07:00
Andrew Morton 57ae250861 [PATCH] CONFIG_NET=n build fix
Cc: Greg KH <greg@kroah.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:06 -07:00
Andreas Mohr 83d4e6e7fb [PATCH] make noirqdebug/irqfixup __read_mostly, add (un)likely()
Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:05 -07:00
Daniel Walker 89d0cf01c0 [PATCH] invert irq/migration.c brach prediction
If you get to that point in the code it means that desc->move_irq is set,
pending_irq_cpumask[irq] and cpu_online_map should have a value.  Still
pretty good chance anding those two you'll still have a value.  So these
two branch predictors should be inverted.

Signed-off-by: Daniel Walker <dwalker@mvista.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:04 -07:00
Ingo Molnar 8e0a43d8fa [PATCH] cond_resched() might_sleep() fix
add the __might_sleep() check back to cond_resched().

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:04 -07:00
Prasanna Meda 6e66726047 [PATCH] dup fd error fix
Set errorp in dup_fd, it will be used in sys_unshare also.

Signed-off-by: Prasanna Meda <mlp@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:04 -07:00
Andrew Morton 0ae26f1b31 [PATCH] mmput() might sleep
exit_aio() and exit_mmap() can sleep.  But it's easy to accidentally call
mmput() from inside locks.

Cc: Dave Peterson <dsp@llnl.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:03 -07:00
Jeff Moyer c330dda908 [PATCH] Add a sysfs file to determine if a kexec kernel is loaded
Create two files in /sys/kernel, kexec_loaded and kexec_crash_loaded.  Each
file contains a simple boolean value indicating whether the relevant kernel
has been loaded into memory.  The motivation for this is geared around
support.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:02 -07:00
Rafael J. Wysocki 968808b895 [PATCH] swsusp: use less memory during resume
Make swsusp allocate only as much memory as needed to store the image data
and metadata during resume.

Without this patch swsusp additionally allocates many page frames that will
conflict with the "original" locations of the image data and are considered
as "unsafe", treating them as "eaten" pages (ie.  allocated but unusable).

The patch makes swsusp allocate as many pages as it'll need to store the
data read from the image in one shot, creating a list of allocated "safe"
pages, and use the observation that all pages allocated by it are marked
with the PG_nosave and PG_nosave_free flags set.   Namely, when it's about
to load an image page, swsusp can check whether the page frame
corresponding to the "original" location of this page has been allocated
(ie.  if the page frame has the PG_nosave and PG_nosave_free flags set) and
if so, it can load the page directly into this page frame.   Otherwise it
uses an allocated "safe" page from the list to store the data that will be
copied to their "original" location later on.

This allows us to save many page copyings and page allocations during
resume and in the future it may allow us to load images greater than 50% of
the normal zone.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: "Pavel Machek" <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:00 -07:00
Adrian Bunk 7bff24e255 [PATCH] kernel/power/snapshot.c: cleanups
- make needlessly global functions static
- make dummy functions static inline

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:59 -07:00
Rafael J. Wysocki a938c356d5 [PATCH] swsusp: take lowmem reserves into account
swsusp allocates memory from the normal zone, so it cannot use lowmem
reserve pages from the lower zones.  Therefore it should not count these
pages as available to it.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:59 -07:00
Shaohua Li ce4ab0012b [PATCH] swsusp: add architecture special saveable pages support
1. Add architecture specific pages save/restore support.  Next two patches
   will use this to save/restore 'ACPI NVS' pages.

2. Allow reserved pages 'nosave'.  This could avoid save/restore BIOS
   reserved pages.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Nigel Cunningham <nigel@suspend2.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:59 -07:00
Zhang Yanmin 1b61b910e9 [PATCH] x86: kernel irq balance doesn't work
On i386, kernel irq balance doesn't work.

1) In function do_irq_balance, after kernel finds the min_loaded cpu but
   before calling set_pending_irq to really pin the selected_irq to the
   target cpu, kernel does a cpus_and with irq_affinity[selected_irq].
   Later on, when the irq is acked, kernel would calls
   move_native_irq=>desc->handler->set_affinity to change the irq affinity.
    However, every function pointed by
   hw_interrupt_type->set_affinity(unsigned int irq, cpumask_t cpumask)
   always changes irq_affinity[irq] to cpumask.  Next time when recalling
   do_irq_balance, it has to do cpu_ands again with
   irq_affinity[selected_irq], but irq_affinity[selected_irq] already
   becomes one cpu selected by the first irq balance.

2) Function balance_irq in file arch/i386/kernel/io_apic.c has the same
   issue.

[akpm@osdl.org: cleanups]
Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:57 -07:00