Commit Graph

2289 Commits

Author SHA1 Message Date
Jan Beulich 98011f569e mm: fix improper .init-type section references
.. which modpost started warning about.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:36 -07:00
KAMEZAWA Hiroyuki f0c0b2b808 change zonelist order: zonelist order selection logic
Make zonelist creation policy selectable from sysctl/boot option v6.

This patch makes NUMA's zonelist (of pgdat) order selectable.
Available order are Default(automatic)/ Node-based / Zone-based.

[Default Order]
The kernel selects Node-based or Zone-based order automatically.

[Node-based Order]
This policy treats the locality of memory as the most important parameter.
Zonelist order is created by each zone's locality. This means lower zones
(ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
IOW. ZONE_DMA will be in the middle of zonelist.
current 2.6.21 kernel uses this.

Pros.
 * A user can expect local memory as much as possible.
Cons.
 * lower zone will be exhansted before higher zone. This may cause OOM_KILL.

Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
because of ZONE_DMA exhaution and you need the best locality.

(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

*node(0)'s memory allocation order:

 node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.

*node(1)'s memory allocation order:

 node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

[Zone-based order]
This policy treats the zone type as the most important parameter.
Zonelist order is created by zone-type order. This means lower zone
never be used bofere higher zone exhaustion.
IOW. ZONE_DMA will be always at the tail of zonelist.

Pros.
 * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
Cons.
 * memory locality may not be best.

(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

*node(0)'s memory allocation order:

 node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.

*node(1)'s memory allocation order:

 node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.

command:
%echo N > /proc/sys/vm/numa_zonelist_order

Will rebuild zonelist in Node-based order.

command:
%echo Z > /proc/sys/vm/numa_zonelist_order

Will rebuild zonelist in Zone-based order.

Thanks to Lee Schermerhorn, he gives me much help and codes.

[Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order]
[akpm@linux-foundation.org: build fix]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "jesse.barnes@intel.com" <jesse.barnes@intel.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:35 -07:00
Yinghai Lu 18a8bd949d serial: convert early_uart to earlycon for 8250
Beacuse SERIAL_PORT_DFNS is removed from include/asm-i386/serial.h and
include/asm-x86_64/serial.h.  the serial8250_ports need to be probed late in
serial initializing stage.  the console_init=>serial8250_console_init=>
register_console=>serial8250_console_setup will return -ENDEV, and console
ttyS0 can not be enabled at that time.  need to wait till uart_add_one_port in
drivers/serial/serial_core.c to call register_console to get console ttyS0.
that is too late.

Make early_uart to use early_param, so uart console can be used earlier.  Make
it to be bootconsole with CON_BOOT flag, so can use console handover feature.
and it will switch to corresponding normal serial console automatically.

new command line will be:
	console=uart8250,io,0x3f8,9600n8
	console=uart8250,mmio,0xff5e0000,115200n8
or
	earlycon=uart8250,io,0x3f8,9600n8
	earlycon=uart8250,mmio,0xff5e0000,115200n8

it will print in very early stage:
	Early serial console at I/O port 0x3f8 (options '9600n8')
	console [uart0] enabled
later for console it will print:
	console handover: boot [uart0] -> real [ttyS0]

Signed-off-by: <yinghai.lu@sun.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Gerd Hoffmann <kraxel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:35 -07:00
Yinghai Lu d37bf60de0 console: console handover to preferred console
for earlyprintk=ttyS0,9600 console=tty0 console=ttyS0,9600n8

the handover will happen from earlyser0 to tty0.  but what we want is to
hand over to ttyS0.

Later with serial-convert-early_uart-to-earlycon-for-8250.patch,

	console=tty0 console=uart8250,io,0x3f8,9600n8

will handover to ttyS0 instead of tty0.

Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Gerd Hoffmann <kraxel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:34 -07:00
Yinghai Lu eaa944afb2 console: more buf for index parsing
Change name to buf according to the usage as name + index

Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Gerd Hoffmann <kraxel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:34 -07:00
Avi Kivity ac076758b9 HOTPLUG: Adapt cpuset hotplug callback to CPU_DYING
CPU_DYING is called in atomic context, so don't try to take any locks.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:49 +03:00
Avi Kivity db912f9639 HOTPLUG: Add CPU_DYING notifier
KVM wants a notification when a cpu is about to die, so it can disable
hardware extensions, but at a time when user processes cannot be scheduled
on the cpu, so it doesn't try to use virtualization extensions after they
have been disabled.

This adds a CPU_DYING notification.  The notification is called in atomic
context on the doomed cpu.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:49 +03:00
Ingo Molnar e4af30be8f [PATCH] sched: prettify prio_to_wmult[]
prettify the prio_to_wmult[] array. (this could have saved us from the typos)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-16 09:46:31 +02:00
Ingo Molnar 5714d2de93 [PATCH] sched: document prio_to_wmult[]
document prio_to_wmult[].

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-16 09:46:31 +02:00
Ingo Molnar f9153ee6c7 [PATCH] sched: improve weight-array comments
improve the comments around the wmult array (which controls the weight
of niced tasks). Clarify that to achieve a 10% difference in CPU
utilization, a weight multiplier of 1.25 has to be used.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-16 09:46:30 +02:00
Thomas Gleixner 4fd885170b CFS: Fix missing digit off in wmult table
Roman Zippel noticed another inconsistency of the wmult table.

wmult[16] has a missing digit.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-13 16:45:43 -07:00
Linus Torvalds 12a2296054 Merge branch 'splice-2.6.23' of git://git.kernel.dk/data/git/linux-2.6-block
* 'splice-2.6.23' of git://git.kernel.dk/data/git/linux-2.6-block:
  splice: fix offset mangling with direct splicing (sendfile)
  security: revalidate rw permissions for sys_splice and sys_vmsplice
  relay: fixup kerneldoc comment
  relay: fix bogus cast in subbuf_splice_actor()
2007-07-13 10:51:07 -07:00
Linus Torvalds aba2da66cf Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched:
  [PATCH] sched: small topology.h cleanup
  [PATCH] sched: fix show_task()/show_tasks() output
  [PATCH] sched: remove stale version info from kernel/sched_debug.c
  [PATCH] sched: allow larger granularity
  [PATCH] sched: fix prio_to_wmult[] for nice 1

[ I re-did the commits to get rid of some bogus merge commit that
  Ingo had. - Linus ]

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-13 10:13:37 -07:00
Ingo Molnar 4bd77321a8 [PATCH] sched: fix show_task()/show_tasks() output
fix show_task()/show_tasks() output:

- there's no sibling info anymore

- the fields were not aligned properly with the description

- get rid of the lazy-TLB output: it's been quite some time since
  we last had a bug there, and when we had a bug it wasnt helped a
  bit by this debug output.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-13 10:11:17 -07:00
Ingo Molnar 45f384a64f [PATCH] sched: remove stale version info from kernel/sched_debug.c
kernel/sched_debug.c referred to CFS -v20, but there's no CFS versioning
needed within the upstream kernel.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-13 10:10:41 -07:00
Ingo Molnar a5968df873 [PATCH] sched: allow larger granularity
Allow granularity up to 100 msecs, instead of 10 msecs.
(needed on larger boxes)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-13 10:10:08 -07:00
Mike Galbraith e127031f4f [PATCH] sched: fix prio_to_wmult[] for nice 1
There's a typo in the values in prio_to_wmult[] for nice level 1.  While
it did not cause bad CPU distribution, but caused more rescheduling
between nice-0 and nice-1 tasks than necessary.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-13 10:09:02 -07:00
Tom Zanussi d3f35d98b3 relay: fixup kerneldoc comment
Change comment from kerneldoc to normal.

Signed-off-by: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-13 14:14:28 +02:00
Tom Zanussi 24da24de2e relay: fix bogus cast in subbuf_splice_actor()
The current code that sets the read position in subbuf_splice_actor may
give erroneous results if the buffer size isn't a power of 2.  This
patch fixes the problem.

Signed-off-by: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-13 14:14:26 +02:00
Linus Torvalds bb50cbbd4b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6:
  security: unexport mmap_min_addr
  SELinux: use SECINITSID_NETMSG instead of SECINITSID_UNLABELED for NetLabel
  security: Protection for exploiting null dereference using mmap
  SELinux: Use %lu for inode->i_no when printing avc
  SELinux: allow preemption between transition permission checks
  selinux: introduce schedule points in policydb_destroy()
  selinux: add selinuxfs structure for object class discovery
  selinux: change sel_make_dir() to specify inode counter.
  selinux: rename sel_remove_bools() for more general usage.
  selinux: add support for querying object classes and permissions from the running policy
2007-07-12 13:46:48 -07:00
Eric Paris ed03218951 security: Protection for exploiting null dereference using mmap
Add a new security check on mmap operations to see if the user is attempting
to mmap to low area of the address space.  The amount of space protected is
indicated by the new proc tunable /proc/sys/vm/mmap_min_addr and defaults to
0, preserving existing behavior.

This patch uses a new SELinux security class "memprotect."  Policy already
contains a number of allow rules like a_t self:process * (unconfined_t being
one of them) which mean that putting this check in the process class (its
best current fit) would make it useless as all user processes, which we also
want to protect against, would be allowed. By taking the memprotect name of
the new class it will also make it possible for us to move some of the other
memory protect permissions out of 'process' and into the new class next time
we bump the policy version number (which I also think is a good future idea)

Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2007-07-11 22:52:29 -04:00
Tejun Heo 7b595756ec sysfs: kill unnecessary attribute->owner
sysfs is now completely out of driver/module lifetime game.  After
deletion, a sysfs node doesn't access anything outside sysfs proper,
so there's no reason to hold onto the attribute owners.  Note that
often the wrong modules were accounted for as owners leading to
accessing removed modules.

This patch kills now unnecessary attribute->owner.  Note that with
this change, userland holding a sysfs node does not prevent the
backing module from being unloaded.

For more info regarding lifetime rule cleanup, please read the
following message.

  http://article.gmane.org/gmane.linux.kernel/510293

(tweaked by Greg to not delete the field just yet, to make it easier to
merge things properly.)

Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-07-11 16:09:06 -07:00
Jens Axboe cac36bb06e pipe: change the ->pin() operation to ->confirm()
The name 'pin' was badly chosen, it doesn't pin a pipe buffer
in the most commonly used sense in the kernel. So change the
name to 'confirm', after debating this issue with Hugh
Dickins a bit.

A good return from ->confirm() means that the buffer is really
there, and that the contents are good.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 08:04:15 +02:00
Jens Axboe 1db60cf205 relay: use splice_to_pipe() instead of open-coding the pipe loop
It cleans up the relay splice implementation a lot, and gets rid of
a lot of internal pipe knowledge that should not be in there.

Plus fixes for padding and partial first page (and lots more) from
Tom Zanussi.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 08:04:15 +02:00
Jens Axboe d6b29d7cee splice: divorce the splice structure/function definitions from the pipe header
We need to move even more stuff into the header so that folks can use
the splice_to_pipe() implementation instead of open-coding a lot of
pipe knowledge (see relay implementation), so move to our own header
file finally.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 08:04:14 +02:00
Tom Zanussi ebf9909343 splice: relay support
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 08:04:14 +02:00
Ingo Molnar c31f2e8a42 sched: add CFS credits
add credits for recent major scheduler contributions:

  Con Kolivas, for pioneering the fair-scheduling approach
  Peter Williams, for smpnice
  Mike Galbraith, for interactivity tuning of CFS
  Srivatsa Vaddagiri, for group scheduling enhancements

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:01 +02:00
Ingo Molnar 0fec171cdb sched: clean up sleep_on() APIs
clean up the sleep_on() APIs:

 - do not use fastcall
 - replace fragile macro magic with proper inline functions

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:01 +02:00
Ingo Molnar 9761eea851 sched: style cleanups
4 small style cleanups to sched.c: checkpatch.pl is now happy about
the totality of sched.c [ignoring false positives] - yay! ;-)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:00 +02:00
Ingo Molnar 23bdd703a5 sched: do not set softirqs to nice +19
do not set softirqs to nice +19. _If_ for whatever reason
we missed to process some high-prio softirq and woke up
ksoftirqd, we should give it a fair chance to actually
get some work done, even if the system is under load.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:00 +02:00
Ingo Molnar 43ae34cb4c sched: scheduler debugging, core
scheduler debugging core: implement /proc/sched_debug and
/proc/<PID>/sched files for scheduler debugging.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:00 +02:00
Ingo Molnar 77e54a1f88 sched: add CFS debug sysctls
add CFS debug sysctls: only tweakable if SCHED_DEBUG is enabled.
This allows for faster debugging of scheduler problems.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:00 +02:00
Ingo Molnar b2cfba19f6 sched: remove unused rq types from sched.c
remove unused rq types from sched.c, now that we switched
over to CFS.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:00 +02:00
Ingo Molnar 634fa8c97c sched: remove interactivity types
remove now unused interactivity-heuristics related defined and
types of the old scheduler.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:00 +02:00
Ingo Molnar dff06c157b sched: clean up include files in sched.c
clean up include files in sched.c, they were still old-style <asm/>.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:00 +02:00
Balbir Singh 172ba844a8 sched: update delay-accounting to use CFS's precise stats
update delay-accounting to use CFS's precise stats.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:00 +02:00
Ingo Molnar 1b9f19c212 sched: turn on the use of unstable events
make use of sched-clock-unstable events.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Ingo Molnar bb29ab2686 sched: x86, track TSC-unstable events
track TSC-unstable events and propagate it to the scheduler code.
Also allow sched_clock() to be used when the TSC is unstable,
the rq_clock() wrapper creates a reliable clock out of it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Ingo Molnar dd41f596cd sched: cfs core code
apply the CFS core code.

this change switches over the scheduler core to CFS's modular
design and makes use of kernel/sched_fair/rt/idletask.c to implement
Linux's scheduling policies.

thanks to Andrew Morton and Thomas Gleixner for lots of detailed review
feedback and for fixlets.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
2007-07-09 18:51:59 +02:00
Ingo Molnar f3479f10c5 sched: remove the sleep-bonus interactivity code
remove the sleep-bonus interactivity code from the core scheduler.

scheduling policy is implemented in the policy modules, and CFS does
not need such type of heuristics.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Ingo Molnar c18a17329b sched: remove expired_starving()
remove the expired_starving() heuristics from the core scheduler.

CFS does not need it, and this did not really work well in practice
anyway, due to the rq->nr_running multiplier to STARVATION_LIMIT.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Ingo Molnar f2ac58ee61 sched: remove sleep_type
remove the sleep_type heuristics from the core scheduler - scheduling
policy is implemented in the scheduling-policy modules. (and CFS does
not use this type of sleep-type heuristics)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Ingo Molnar 45bf76df48 sched: cfs, add load-calculation methods
add the new load-calculation methods of CFS.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Ingo Molnar 14531189f0 sched: clean up __normal_prio() position
clean up: move __normal_prio() in head of normal_prio().

no code changed.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Ingo Molnar 71f8bd4600 sched: cleanup: move dequeue/enqueue_task()
cleanup: move dequeue/enqueue_task() to a more logical place, to
not split up __normal_prio()/normal_prio().

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Ingo Molnar c24d20dbef sched: move around resched_task()
move resched_task()/resched_cpu() into the 'public interfaces'
section of sched.c, for use by kernel/sched_fair/rt/idletask.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Ingo Molnar e05606d330 sched: clean up the rt priority macros
clean up the rt priority macros, pointed out by Andrew Morton.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Ingo Molnar 138a8aeb5b sched: add cfs_rq ops
add the set_task_cfs_rq() abstraction needed by CONFIG_FAIR_GROUP_SCHED.

(not activated yet)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:58 +02:00
Ingo Molnar 41b86e9c51 sched: make posix-cpu-timers use CFS's accounting information
update the posix-cpu-timers code to use CFS's CPU accounting information.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:58 +02:00
Ingo Molnar 20d315d42a sched: add rq_clock()/__rq_clock()
add rq_clock()/__rq_clock(), a robust wrapper around sched_clock(),
used by CFS. It protects against common type of sched_clock() problems
(caused by hardware): time warps forwards and backwards.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:58 +02:00
Ingo Molnar 6aa645ea5f sched: cfs rq data types
add the CFS rq data types to sched.c.

(the old scheduler fields are still intact, they are removed
 by a later patch)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:58 +02:00
Ingo Molnar fa72e9e484 sched: cfs core, kernel/sched_idletask.c
add kernel/sched_idletask.c - which implements the idle thread
scheduling class. This further simplifies sched.c (under CFS),
for example a number of 'if (p == rq->idle)' type of special-cases
can be removed from sched.c, and schedule() gets simpler too.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:58 +02:00
Ingo Molnar bb44e5d1c6 sched: cfs core, kernel/sched_rt.c
add kernel/sched_rt.c: SCHED_FIFO/SCHED_RR support. The behavior
and semantics of SCHED_FIFO/SCHED_RR tasks is unchanged.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:58 +02:00
Ingo Molnar bf0f6f24a1 sched: cfs core, kernel/sched_fair.c
add kernel/sched_fair.c - which implements the bulk of CFS's
behavioral changes for SCHED_OTHER tasks.

see Documentation/sched-design-CFS.txt about details.

Authors:

 Ingo Molnar <mingo@elte.hu>
 Dmitry Adamushko <dmitry.adamushko@gmail.com>
 Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
 Mike Galbraith <efault@gmx.de>

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
2007-07-09 18:51:58 +02:00
Ingo Molnar 425e0968a2 sched: move code into kernel/sched_stats.h
create sched_stats.h and move sched.c schedstats code into it.
This cleans up sched.c a bit.

no code changes are caused by this patch.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:58 +02:00
Ingo Molnar 1df21055e3 sched: add init_idle_bootup_task()
add the init_idle_bootup_task() callback to the bootup thread,
unused at the moment. (CFS will use it to switch the scheduling
class of the boot thread to the idle class)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:58 +02:00
Ingo Molnar f64f61145a sched: remove sched_exit()
remove sched_exit(): the elaborate dance of us trying to recover
timeslices given to child tasks never really worked.

CFS does not need it either.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:58 +02:00
Ingo Molnar c65cc87052 sched: uninline set_task_cpu()
uninline set_task_cpu(): CFS will add more code to it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:58 +02:00
Ingo Molnar 0437e109e1 sched: zap the migration init / cache-hot balancing code
the SMP load-balancer uses the boot-time migration-cost estimation
code to attempt to improve the quality of balancing. The reason for
this code is that the discrete priority queues do not preserve
the order of scheduling accurately, so the load-balancer skips
tasks that were running on a CPU 'recently'.

this code is fundamental fragile: the boot-time migration cost detector
doesnt really work on systems that had large L3 caches, it caused boot
delays on large systems and the whole cache-hot concept made the
balancing code pretty undeterministic as well.

(and hey, i wrote most of it, so i can say it out loud that it sucks ;-)

under CFS the same purpose of cache affinity can be achieved without
any special cache-hot special-case: tasks are sorted in the 'timeline'
tree and the SMP balancer picks tasks from the left side of the
tree, thus the most cache-cold task is balanced automatically.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:57 +02:00
Ingo Molnar d15bcfdbe1 sched: rename idle_type/SCHED_IDLE
enum idle_type (used by the load-balancer) clashes with the
SCHED_IDLE name that we want to introduce. 'CPU_IDLE' instead
of 'SCHED_IDLE' is more descriptive as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:57 +02:00
Thomas Gleixner 746976a301 NTP: remove clock_was_set() call to prevent deadlock
The clock_was_set() call in seconds_overflow() which happens only when
leap seconds are inserted / deleted is wrong in two aspects:

1. it results in a call to on_each_cpu() with interrupts disabled
2. it is potential deadlock source vs. call_lock in smp_call_function()

The only possible side effect of the removal might be, that an absolute
CLOCK_REALTIME timer fires 1 second too late, in the rare case of leap
second deletion and an absolute CLOCK_REALTIME timer which expires in
the affected time frame. It will never fire too early.

This was probably observed by the reporter of a June 30th -> July 1st
hang: http://lkml.org/lkml/2007/7/3/103

A similar problem was observed by Dave Jones, who provided a screen shot
with a lockdep back trace, which allowed to analyse the problem.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-03 13:54:27 -07:00
Rafael J. Wysocki 2391dae3e3 PM: introduce set_target method in pm_ops
Commit 52ade9b3b9 changed the suspend code
ordering to execute pm_ops->prepare() after the device model per-device
.suspend() calls in order to fix some ACPI-related issues.  Unfortunately, it
broke the at91 platform which assumed that pm_ops->prepare() would be called
before suspending devices.

at91 used pm_ops->prepare() to get notified of the target system sleep state,
so that it could use this information while suspending devices.  However, with
the current suspend code ordering pm_ops->prepare() is called too late for
this purpose.  Thus, at91 needs an additional method in 'struct pm_ops' that
will be used for notifying the platform of the target system sleep state.
Moreover, in the future such a method will also be needed by ACPI.

This patch adds the .set_target() method to 'struct pm_ops' and makes the
suspend code call it, if implemented, before executing the device model
per-device .suspend() calls.  It also modifies the at91 code to use
pm_ops->set_target() instead of pm_ops->prepare().

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: David Brownell <dbrownell@users.sourceforge.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-01 12:29:44 -07:00
Masami Hiramatsu a66e356c04 relayfs: fix overwrites
When I use relayfs with "overwrite" mode, read() still sets incorrect
number of consumed bytes.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Tom Zanussi <zanussi@us.ibm.com>
Acked-by: David Wilder <dwilder@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-28 11:38:18 -07:00
David Wilder 8d62fdebda relay file read: start-pos fix
Fix a bug in the relay read interface causing the number of consumed bytes
to be set incorrectly.

Signed-off-by: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: David Wilder <dwilder@us.ibm.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-28 11:34:54 -07:00
Thomas Gleixner a06381fec7 FUTEX: Restore the dropped ERSCH fix
The return value of futex_find_get_task() needs to be -ESRCH in case
that the search fails.  This was part of the original futex fixes and
got accidentally dropped, when the futex-tidy-up patch was split out.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Stable Team <stable@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-24 12:08:53 -07:00
Tony Jones 7b018b2888 audit: fix oops removing watch if audit disabled
Removing a watched file will oops if audit is disabled (auditctl -e 0).

To reproduce:
- auditctl -e 1
- touch /tmp/foo
- auditctl -w /tmp/foo
- auditctl -e 0
- rm /tmp/foo (or mv)

Signed-off-by: Tony Jones <tonyj@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-24 08:59:12 -07:00
Christoph Lameter 92c4ca5c3a sched: fix next_interval determination in idle_balance()
The intervals of domains that do not have SD_BALANCE_NEWIDLE must be
considered for the calculation of the time of the next balance.  Otherwise
we may defer rebalancing forever.

Siddha also spotted that the conversion of the balance interval
to jiffies is missing. Fix that to.

From: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

also continue the loop if !(sd->flags & SD_LOAD_BALANCE).

Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

It did in fact trigger under all three of mainline, CFS, and -rt including CFS
-- see below for a couple of emails from last Friday giving results for these
three on the AMD box (where it happened) and on a single-quad NUMA-Q system
(where it did not, at least not with such severity).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-24 08:59:11 -07:00
Cedric Le Goater 4e71e474c7 fix refcounting of nsproxy object when unshared
When a namespace is unshared, a refcount on the previous nsproxy is
abusively taken, leading to a memory leak of nsproxy objects.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-24 08:59:10 -07:00
Thomas Gleixner 58229a1899 posix-timers: Prevent softirq starvation by small intervals and SIG_IGN
posix-timers which deliver an ignored signal are currently rearmed in
the timer softirq: This is necessary because the timer needs to be
delivered again when SIG_IGN is removed. This is not a problem, when
the interval is reasonable.

With high resolution timers enabled one might arm a posix timer with a
very small interval and ignore the signal. This might lead to a
softirq starvation when the interval is so small that the timer is
requeued onto the softirq pending list right away.

This problem was pointed out by Jan Kiszka. Thanks Jan !

The correct solution would be to stop the timer, when the signal is
ignored and rearm it when SIG_IGN is removed. Unfortunately this
requires modification in sigaction and involves non trivial sighand
locking. It's too late in the release cycle for such a change.

For now we just keep the timer running and enforce that the timer only
fires every jiffie. This does not break anything as we keep the
overrun counter correct. It adds a little inaccuracy to the
timer_gettime() interface, but...

The more complex change is necessary anyway to fix another short
coming of the current implementation, which I discovered while looking
at this problem: A pending signal is discarded when SIG_IGN is set. In
case that a posixtimer signal is pending then it is discarded as well,
but when SIG_IGN is removed later nothing rearms the timer. This is
not new, it's that way since posix timers have been merged. So nothing
to worry about right now.

I have a working solution to fix all of this, but the impact is too
large for both stable and 2.6.22. I'm going to send it out for review
in the next days.

This should go into 2.6.21.stable as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Jan Kiszka <jan.kiszka@web.de>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Stable Team <stable@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-21 15:57:04 -07:00
Linus Torvalds fa490cfd15 Fix possible runqueue lock starvation in wait_task_inactive()
Miklos Szeredi reported very long pauses (several seconds, sometimes
more) on his T60 (with a Core2Duo) which he managed to track down to
wait_task_inactive()'s open-coded busy-loop.

He observed that an interrupt on one core tries to acquire the
runqueue-lock but does not succeed in doing so for a very long time -
while wait_task_inactive() on the other core loops waiting for the first
core to deschedule a task (which it wont do while spinning in an
interrupt handler).

This rewrites wait_task_inactive() to do all its waiting optimistically
without any locks taken at all, and then just double-check the end
result with the proper runqueue lock held over just a very short
section.  If there were races in the optimistic wait, of a preemption
event scheduled the process away, we simply re-synchronize, and start
over.

So the code now looks like this:

	repeat:
		/* Unlocked, optimistic looping! */
		rq = task_rq(p);
		while (task_running(rq, p))
			cpu_relax();

		/* Get the *real* values */
		rq = task_rq_lock(p, &flags);
		running = task_running(rq, p);
		array = p->array;
		task_rq_unlock(rq, &flags);

		/* Check them.. */
		if (unlikely(running)) {
			cpu_relax();
			goto repeat;
		}

		/* Preempted away? Yield if so.. */
		if (unlikely(array)) {
			yield();
			goto repeat;
		}

Basically, that first "while()" loop is done entirely without any
locking at all (and doesn't check for the case where the target process
might have been preempted away), and so it's possibly "incorrect", but
we don't really care.  Both the runqueue used, and the "task_running()"
check might be the wrong tests, but they won't oops - they just mean
that we could possibly get the wrong results due to lack of locking and
exit the loop early in the case of a race condition.

So once we've exited the loop, we then get the proper (and careful) rq
lock, and check the running/runnable state _safely_.  And if it turns
out that our quick-and-dirty and unsafe loop was wrong after all, we
just go back and try it all again.

(The patch also adds a lot of comments, which is the actual bulk of it
all, to make it more obvious why we can do these things without holding
the locks).

Thanks to Miklos for all the testing and tracking it down.

Tested-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-18 11:52:55 -07:00
Ingo Molnar a0f98a1cb7 sched: fix SysRq-N (normalize RT tasks)
Gene Heskett reported the following problem while testing CFS: SysRq-N
is not always effective in normalizing tasks back to SCHED_OTHER.

The reason for that turns out to be the following bug:

 - normalize_rt_tasks() uses for_each_process() to iterate through all
   tasks in the system.  The problem is, this method does not iterate
   through all tasks, it iterates through all thread groups.

The proper mechanism to enumerate over all threads is to use a
do_each_thread() + while_each_thread() loop.

Reported-by: Gene Heskett <gene.heskett@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-18 11:52:55 -07:00
Benjamin Herrenschmidt caec4e8dc8 Fix signalfd interaction with thread-private signals
Don't let signalfd dequeue private signals off other threads (in the
case of things like SIGILL or SIGSEGV, trying to do so would result
in undefined behaviour on who actually gets the signal, since they
are force unblocked).

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-18 10:18:32 -07:00
Thomas Gleixner bd197234b0 Revert "futex_requeue_pi optimization"
This reverts commit d0aa7a70bf.

It not only introduced user space visible changes to the futex syscall,
it is also non-functional and there is no way to fix it proper before
the 2.6.22 release.

The breakage report ( http://lkml.org/lkml/2007/5/12/17 ) went
unanswered, and unfortunately it turned out that the concept is not
feasible at all.  It violates the rtmutex semantics badly by introducing
a virtual owner, which hacks around the coupling of the user-space
pi_futex and the kernel internal rt_mutex representation.

At the moment the only safe option is to remove it fully as it contains
user-space visible changes to broken kernel code, which we do not want
to expose in the 2.6.22 release.

The patch reverts the original patch mostly 1:1, but contains a couple
of trivial manual cleanups which were necessary due to patches, which
touched the same area of code later.

Verified against the glibc tests and my own PI futex tests.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-18 09:48:41 -07:00
Rafael J. Wysocki 2f41dddbbd swsusp: Fix userland interface
Fix oops caused by 'cat /dev/snapshot', reported by Arkadiusz Miskiewicz,
and make it impossible to thaw tasks with the help of the swsusp userland
interface while there is a snapshot image ready to save.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-16 13:16:15 -07:00
Paul Jackson 3e903e7b16 cpuset: zero malloc - fix for old cpusets
The cpuset code to present a list of tasks using a cpuset to user space could
write to an array that it had kmalloc'd, after a kmalloc request of zero size.

The problem was that the code didn't check for writes past the allocated end
of the array until -after- the first write.

This is a race condition that is likely rare -- it would only show up if a
cpuset went from being empty to having a task in it, during the brief time
between the allocation and the first write.

Prior to roughly 2.6.22 kernels, this was also a benign problem, because a
zero kmalloc returned a few usable bytes anyway, and no harm was done with the
bogus write.

With the 2.6.22 kernel changes to make issue a warning if code tries to write
to the location returned from a zero size allocation, this problem is no
longer benign.  This cpuset code would occassionally trigger that warning.

The fix is trivial -- check before storing into the array, not after, whether
the array is big enough to hold the store.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Serge E. Hallyn" <serue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Paul Menage <menage@google.com>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-16 13:16:15 -07:00
Alexey Kuznetsov 778e9a9c3e pi-futex: fix exit races and locking problems
1. New entries can be added to tsk->pi_state_list after task completed
   exit_pi_state_list(). The result is memory leakage and deadlocks.

2. handle_mm_fault() is called under spinlock. The result is obvious.

3. results in self-inflicted deadlock inside glibc.
   Sometimes futex_lock_pi returns -ESRCH, when it is not expected
   and glibc enters to for(;;) sleep() to simulate deadlock. This problem
   is quite obvious and I think the patch is right. Though it looks like
   each "if" in futex_lock_pi() got some stupid special case "else if". :-)

4. sometimes futex_lock_pi() returns -EDEADLK,
   when nobody has the lock. The reason is also obvious (see comment
   in the patch), but correct fix is far beyond my comprehension.
   I guess someone already saw this, the chunk:

                        if (rt_mutex_trylock(&q.pi_state->pi_mutex))
                                ret = 0;

   is obviously from the same opera. But it does not work, because the
   rtmutex is really taken at this point: wake_futex_pi() of previous
   owner reassigned it to us. My fix works. But it looks very stupid.
   I would think about removal of shift of ownership in wake_futex_pi()
   and making all the work in context of process taking lock.

From: Thomas Gleixner <tglx@linutronix.de>

Fix 1) Avoid the tasklist lock variant of the exit race fix by adding
    an additional state transition to the exit code.

    This fixes also the issue, when a task with recursive segfaults
    is not able to release the futexes.

Fix 2) Cleanup the lookup_pi_state() failure path and solve the -ESRCH
    problem finally.

Fix 3) Solve the fixup_pi_state_owner() problem which needs to do the fixup
    in the lock protected section by using the in_atomic userspace access
    functions.

    This removes also the ugly lock drop / unqueue inside of fixup_pi_state()

Fix 4) Fix a stale lock in the error path of futex_wake_pi()

Added some error checks for verification.

The -EDEADLK problem is solved by the rtmutex fixups.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-08 17:23:34 -07:00
Thomas Gleixner 1a539a8728 rt-mutex: fix chain walk early wakeup bug
Alexey Kuznetsov found some problems in the pi-futex code.

One of the root causes is:

When a wakeup happens, we do not to stop the chain walk so we follow a not
longer relevant locking chain.

Drop out when this happens.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-08 17:23:34 -07:00
Thomas Gleixner c0d1d2bf5a rt-mutex: fix stale return value
Alexey Kuznetsov found some problems in the pi-futex code.

The major problem is a stale return value in rt_mutex_slowlock():

When the pi chain walk returns -EDEADLK, but the waiter was woken up during
the phases where the locks were dropped, the rtmutex could be acquired, but
due to the stale return value -EDEADLK returned to the caller.

Reset the return value in the retry path.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-08 17:23:34 -07:00
Roland McGrath b74d0deb96 Restrict clearing TIF_SIGPENDING
This patch should get a few birds.  It prevents sigaction calls from
clearing TIF_SIGPENDING in other threads, which could leak -ERESTART*.
And It fixes ptrace_stop not to clear it, which done at the syscall exit
stop could leak -ERESTART*.  It probably removes the harm from signalfd,
at least assuming it never calls dequeue_signal on kernel threads that
might have used block_all_signals.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-07 08:52:15 -07:00
Ingo Molnar c1a834dc70 timer stats: speedups
Make timer-stats have almost zero overhead when enabled in the config but
not used.  (this way distros can enable it more easily)

Also update the documentation about overhead of timer_stats - it was
written for the first version which had a global lock and a linear list
walk based lookup ;-)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-01 08:18:30 -07:00
Bjorn Steinbrink 9fcc15ec3c timer statistics: fix race
Fix two races in the timer stats lookup code.  One by ensuring that the
initialization of a new entry is finished upon insertion of that entry.
The other by cleaning up the hash table when the entries array is cleared,
so that we don't have any "pre-inserted" entries.

Thanks to Eric Dumazet for reminding me of the memory barriers.

Signed-off-by: Bjorn Steinbrink <B.Steinbrink@gmx.de>
Signed-off-by: Ian Kumlien <pomac@vapor.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-01 08:18:30 -07:00
Ulrich Drepper f0ede66fca fix compat futex code for private futexes
When the private futex support was added the compat code wasn't changed.
The result is that code using compat code which fail, e.g., because the
timeout values are not correctly passed.  The following patch should fix
that.

Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-01 08:18:28 -07:00
Kyle McMartin 7a74fc4925 fix possible null ptr deref in kallsyms_lookup
ugh, this function gets called by our unwinder. recursive backtrace for
the win... bisection to find this one was "fun."

Signed-off-by: Kyle McMartin <kyle@parisc-linux.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-30 10:51:38 -07:00
Thomas Gleixner eaad084bb0 NOHZ: prevent multiplication overflow - stop timer for huge timeouts
get_next_timer_interrupt() returns a delta of (LONG_MAX > 1) in case
there is no timer pending. On 64 bit machines this results in a
multiplication overflow in tick_nohz_stop_sched_tick().

Reported by: Dave Miller <davem@davemloft.net>

Make the return value a constant and limit the return value to a 32 bit
value.

When the max timeout value is returned, we can safely stop the tick
timer device. The max jiffies delta results in a 12 days timeout for
HZ=1000.

In the long term the get_next_timer_interrupt() code needs to be
reworked to return ktime instead of jiffies, but we have to wait until
the last users of the original NO_IDLE_HZ code are converted.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-29 18:11:10 -07:00
Linus Torvalds 92ea77275b Fix crash with irqpoll due to the IRQF_IRQPOLL flag testing
With irqpoll enabled, trying to test the IRQF_IRQPOLL flag in the
actions would cause a NULL pointer dereference if no action was
installed (for example, the driver might have been unloaded with
interrupts still pending).

So be a bit more careful about testing the flag by making sure to test
for that case.

(The actual _change_ is trivial, the patch is more than a one-liner
because I rewrote the testing to also be much more readable.

Original (discarded) bugfix by Bernhard Walle.

Cc: Bernhard Walle <bwalle@suse.de>
Tested-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-24 08:37:14 -07:00
Thomas Gleixner 98d8256739 Prevent going idle with softirq pending
The NOHZ patch contains a check for softirqs pending when a CPU goes idle.
The BUG is unrelated to NOHZ, it just was made visible by the NOHZ patch.
The BUG showed up mainly on P4 / hyperthreading enabled machines which lead
the investigations into the wrong direction in the first place.  The real
cause is in cond_resched_softirq():

cond_resched_softirq() is enabling softirqs without invoking the softirq
daemon when softirqs are pending.  This leads to the warning message in the
NOHZ idle code:

t1 runs softirq disabled code on CPU#0
interrupt happens, softirq is raised, but deferred (softirqs disabled)
t1 calls cond_resched_softirq()
	enables softirqs via _local_bh_enable()
	calls schedule()
t2 runs
t1 is migrated to CPU#1
t2 is done and invokes idle()
NOHZ detects the pending softirq

Fix: change _local_bh_enable() to local_bh_enable() so the softirq
daemon is invoked.

Thanks to Anant Nitya for debugging this with great patience !

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:15 -07:00
OGAWA Hirofumi 6373da1fb7 power: Fix sizeof(PAGE_SIZE) typo
Fix sizeof(PAGE_SIZE) typo.  It should be just PAGE_SIZE for zeroing the
swsusp_header.

Signed-off-by: OGAWA Hirofumi <hogawa@miraclelinux.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:14 -07:00
Oleg Nesterov 14441960e8 simplify cleanup_workqueue_thread()
cleanup_workqueue_thread() and cwq_should_stop() are overcomplicated.

Convert the code to use kthread_should_stop/kthread_stop as was
suggested by Gautham and Srivatsa.

In particular this patch removes the (unlikely) busy-wait loop from the
exit path, it was a temporary and ugly kludge (if not a bug).

Note: the current code was designed to solve another old problem:
work->func can't share locks with hotplug callbacks.  I think this could
be done, see

	http://marc.info/?l=linux-kernel&m=116905366428633

but this needs some more complications to preserve CPU affinity of
cwq->thread during cpu_up().  A freezer-based hotplug looks more
appealing.

[akpm@linux-foundation.org: make it more tolerant of gcc borkenness]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Zilvinas Valinskas <zilvinas@wilibox.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:13 -07:00
Roland McGrath 7bb44adef3 recalc_sigpending_tsk fixes
Steve Hawkes discovered a problem where recalc_sigpending_tsk was called in
do_sigaction but no signal_wake_up call was made, preventing later signals
from waking up blocked threads with TIF_SIGPENDING already set.

In fact, the few other calls to recalc_sigpending_tsk outside the signals
code are also subject to this problem in other race conditions.

This change makes recalc_sigpending_tsk private to the signals code.  It
changes the outside calls, as well as do_sigaction, to use the new
recalc_sigpending_and_wake instead.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: <Steve.Hawkes@motorola.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:12 -07:00
Thomas Gleixner 3528231606 NOHZ: Rate limit the local softirq pending warning output
The warning in the NOHZ code, which triggers when a CPU goes idle with
softirqs pending can fill up the logs quite quickly.  Rate limit the output
until we found the root cause of that problem.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Thomas Gleixner 72fcde9662 Ignore bogus ACPI info for offline CPUs
Booting a SMP kernel with maxcpus=1 on a SMP system leads to a hard hang,
because ACPI ignores the maxcpus setting and sends timer broadcast info for
the offline CPUs.  This results in a stuck for ever call to
smp_call_function_single() on an offline CPU.

Ignore the bogus information and print a kernel error to remind ACPI
folks to fix it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Gautham R Shenoy 88f18ba028 freezer: move frozen_process() to kernel/power/process.c
Other than refrigerator, no one else calls frozen_process().  So move it from
include/linux/freezer.h to kernel/power/process.c.

Also, since a task can be marked as frozen by itself, we don't need to pass
the (struct task_struct *p) parameter to frozen_process().

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Oleg Nesterov a076e4bca2 freezer: fix kthread_create vs freezer theoretical race
kthread() sleeps in TASK_INTERRUPTIBLE state waiting for the first wakeup.  In
theory, this wakeup may come from freeze_process()->signal_wake_up(), so the
task can disappear even before kthread_create() sets its ->comm.

Change kthread() to use TASK_UNINTERRUPTIBLE.

[akpm@linux-foundation.org: s/BUG_ON/WARN_ON+recover]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Rafael J. Wysocki 49b12d4f5e freezer: take kernel_execve into consideration
Kernel threads can become userland processes by calling kernel_execve().

In particular, this may happen right after the try_to_freeze_tasks()
called with FREEZER_USER_SPACE has returned, so try_to_freeze_tasks()
needs to take userspace processes into consideration even if it is
called with FREEZER_KERNEL_THREADS.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Rafael J. Wysocki ba96a0c880 freezer: fix vfork problem
Currently try_to_freeze_tasks() has to wait until all of the vforked processes
exit and for this reason every user can make it fail.  To fix this problem we
can introduce the additional process flag PF_FREEZER_SKIP to be used by tasks
that do not want to be counted as freezable by the freezer and want to have
TIF_FREEZE set nevertheless.  Then, this flag can be set by tasks using
sys_vfork() before they call wait_for_completion(&vfork) and cleared after
they have woken up.  After clearing it, the tasks should call try_to_freeze()
as soon as possible.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Rafael J. Wysocki 33e1c288da freezer: close potential race between refrigerator and thaw_tasks
If the freezing of tasks fails and a task is preempted in refrigerator()
before calling frozen_process(), then thaw_tasks() may run before this task is
frozen.  In that case the task will freeze and no one will thaw it.

To fix this race we can call freezing(current) in refrigerator() along with
frozen_process(current) under the task_lock() which also should be taken in
the error path of try_to_freeze_tasks() as well as in thaw_process().
Moreover, if thaw_process() additionally clears TIF_FREEZE for tasks that are
not frozen, we can be sure that all tasks are thawed and there are no pending
"freeze" requests after thaw_tasks() has run.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:10 -07:00
Alexey Dobriyan e8edc6e03a Detach sched.h from mm.h
First thing mm.h does is including sched.h solely for can_do_mlock() inline
function which has "current" dereference inside. By dealing with can_do_mlock()
mm.h can be detached from sched.h which is good. See below, why.

This patch
a) removes unconditional inclusion of sched.h from mm.h
b) makes can_do_mlock() normal function in mm/mlock.c
c) exports can_do_mlock() to not break compilation
d) adds sched.h inclusions back to files that were getting it indirectly.
e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
   getting them indirectly

Net result is:
a) mm.h users would get less code to open, read, preprocess, parse, ... if
   they don't need sched.h
b) sched.h stops being dependency for significant number of files:
   on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
   after patch it's only 3744 (-8.3%).

Cross-compile tested on

	all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
	alpha alpha-up
	arm
	i386 i386-up i386-defconfig i386-allnoconfig
	ia64 ia64-up
	m68k
	mips
	parisc parisc-up
	powerpc powerpc-up
	s390 s390-up
	sparc sparc-up
	sparc64 sparc64-up
	um-x86_64
	x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig

as well as my two usual configs.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-21 09:18:19 -07:00
Rafael J. Wysocki 8d98a690f5 swsusp: fix sysfs interface
The sysfs files /sys/power/disk and /sys/power/state do not work as
documented, since they allow the user to write only a few initial
characters of the input string to trigger the option (eg.  'echo pl >
/sys/power/disk' activates the platform mode of hibernation).  Fix it.

Special thanks to Peter Moulder <Peter.Moulder@infotech.monash.edu.au> for
pointing out the problem.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:05 -07:00
Dan Aloni 71ce92f3fa make sysctl/kernel/core_pattern and fs/exec.c agree on maximum core filename size
Make sysctl/kernel/core_pattern and fs/exec.c agree on maximum core
filename size and change it to 128, so that extensive patterns such as
'/local/cores/%e-%h-%s-%t-%p.core' won't result in truncated filename
generation.

Signed-off-by: Dan Aloni <da-x@monatomic.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:05 -07:00
Christoph Lameter a35afb830f Remove SLAB_CTOR_CONSTRUCTOR
SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@ucw.cz>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:04 -07:00