Commit Graph

548 Commits

Author SHA1 Message Date
Paul Jackson 03a285f580 [PATCH] cpuset: skip rcu check if task is in root cpuset
For systems that aren't using cpusets, but have them CONFIG_CPUSET enabled in
their kernel (eventually this may be most distribution kernels), this patch
removes even the minimal rcu_read_lock() from the memory page allocation path.

Actually, it removes that rcu call for any task that is in the root cpuset
(top_cpuset), which on systems not actively using cpusets, is all tasks.

We don't need the rcu check for tasks in the top_cpuset, because the
top_cpuset is statically allocated, so at no risk of being freed out from
underneath us.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:45 -08:00
Paul Jackson 7edc59628b [PATCH] cpuset: mark number_of_cpusets read_mostly
Mark cpuset global 'number_of_cpusets' as __read_mostly.

This global is accessed everytime a zone is considered in the zonelist loops
beneath __alloc_pages, looking for a free memory page.  If number_of_cpusets
is just one, then we can short circuit the mems_allowed check.

Since this global is read alot on a hot path, and written rarely, it is an
excellent candidate for __read_mostly.

Thanks to Christoph Lameter for the suggestion.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:45 -08:00
Paul Jackson 6b9c2603ce [PATCH] cpuset: use rcu directly optimization
Optimize the cpuset impact on page allocation, the most performance critical
cpuset hook in the kernel.

On each page allocation, the cpuset hook needs to check for a possible change
in the current tasks cpuset.  It can now handle the common case, of no change,
without taking any spinlock or semaphore, thanks to RCU.

Convert a spinlock on the current task to an rcu_read_lock(), saving
approximately a memory barrier and an atomic op, depending on architecture.

This is done by adding rcu_assign_pointer() and synchronize_rcu() calls to the
write side of the task->cpuset pointer, in cpuset.c:attach_task(), to delay
freeing up a detached cpuset until after any critical sections referencing
that pointer.

Thanks to Andi Kleen, Nick Piggin and Eric Dumazet for ideas.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:45 -08:00
Paul Jackson c417f0242e [PATCH] cpuset: remove test for null cpuset from alloc code path
Remove a couple of more lines of code from the cpuset hooks in the page
allocation code path.

There was a check for a NULL cpuset pointer in the routine
cpuset_update_task_memory_state() that was only needed during system boot,
after the memory subsystem was initialized, before the cpuset subsystem was
initialized, to catch a NULL task->cpuset pointer.

Add a cpuset_init_early() routine, just before the mem_init() call in
init/main.c, that sets up just enough of the init tasks cpuset structure to
render cpuset_update_task_memory_state() calls harmless.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:44 -08:00
Paul Jackson 04c19fa6f1 [PATCH] cpuset: migrate all tasks in cpuset at once
Given the mechanism in the previous patch to handle rebinding the per-vma
mempolicies of all tasks in a cpuset that changes its memory placement, it is
now easier to handle the page migration requirements of such tasks at the same
time.

The previous code didn't actually attempt to migrate the pages of the tasks in
a cpuset whose memory placement changed until the next time each such task
tried to allocate memory.  This was undesirable, as users invoking memory page
migration exected to happen when the placement changed, not some unspecified
time later when the task needed more memory.

It is now trivial to handle the page migration at the same time as the per-vma
rebinding is done.

The routine cpuset.c:update_nodemask(), which handles changing a cpusets
memory placement ('mems') now checks for the special case of being asked to
write a placement that is the same as before.  It was harmless enough before
to just recompute everything again, even though nothing had changed.  But page
migration is a heavy weight operation - moving pages about.  So now it is
worth avoiding that if asked to move a cpuset to its current location.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:44 -08:00
Paul Jackson 4225399a66 [PATCH] cpuset: rebind vma mempolicies fix
Fix more of longstanding bug in cpuset/mempolicy interaction.

NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
to just the Memory Nodes allowed by that cpuset.  The kernel maintains
internal state for each mempolicy, tracking what nodes are used for the
MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.

When a tasks cpuset memory placement changes, whether because the cpuset
changed, or because the task was attached to a different cpuset, then the
tasks mempolicies have to be rebound to the new cpuset placement, so as to
preserve the cpuset-relative numbering of the nodes in that policy.

An earlier fix handled such mempolicy rebinding for mempolicies attached to a
task.

This fix rebinds mempolicies attached to vma's (address ranges in a tasks
address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
updating vma's, the rebinding of vma mempolicies has to be done when the
cpuset memory placement is changed, at which time mmap_sem can be safely
acquired.  The tasks mempolicy is rebound later, when the task next attempts
to allocate memory and notices that its task->cpuset_mems_generation is
out-of-date with its cpusets mems_generation.

Because walking the tasklist to find all tasks attached to a changing cpuset
requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
affected tasks while doing the tasklist scan.  In general, one cannot acquire
a semaphore (which can sleep) while already holding a spinlock (such as
tasklist_lock).  So a list of mm references has to be built up during the
tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
acquired, and the vma's in that mm rebound.

Once the tasklist lock is dropped, affected tasks may fork new tasks, before
their mm's are rebound.  A kernel global 'cpuset_being_rebound' is set to
point to the cpuset being rebound (there can only be one; cpuset modifications
are done under a global 'manage_sem' semaphore), and the mpol_copy code that
is used to copy a tasks mempolicies during fork catches such forking tasks,
and ensures their children are also rebound.

When a task is moved to a different cpuset, it is easier, as there is only one
task involved.  It's mm->vma's are scanned, using the same
mpol_rebind_policy() as used above.

It may happen that both the mpol_copy hook and the update done via the
tasklist scan update the same mm twice.  This is ok, as the mempolicies of
each vma in an mm keep track of what mems_allowed they are relative to, and
safely no-op a second request to rebind to the same nodes.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:44 -08:00
Paul Jackson 202f72d5d1 [PATCH] cpuset: number_of_cpusets optimization
Easy little optimization hack to avoid actually having to call
cpuset_zone_allowed() and check mems_allowed, in the main page allocation
routine, __alloc_pages().  This saves several CPU cycles per page allocation
on systems not using cpusets.

A counter is updated each time a cpuset is created or removed, and whenever
there is only one cpuset in the system, it must be the root cpuset, which
contains all CPUs and all Memory Nodes.  In that case, when the counter is
one, all allocations are allowed.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:44 -08:00
Paul Jackson 74cb21553f [PATCH] cpuset: numa_policy_rebind cleanup
Cleanup, reorganize and make more robust the mempolicy.c code to rebind
mempolicies relative to the containing cpuset after a tasks memory placement
changes.

The real motivator for this cleanup patch is to lay more groundwork for the
upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's
after the containing cpuset memory placement changes.

NUMA mempolicies are constrained by the cpuset their task is a member of.
When either (1) a task is moved to a different cpuset, or (2) the 'mems'
mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded
node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to
be recalculated, relative to their new cpuset placement.

The old code used an unreliable method of determining what was the old
mems_allowed constraining the mempolicy.  It just looked at the tasks
mems_allowed value.  This sort of worked with the present code, that just
rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken,
referring to the old nodes.  But in an upcoming patch, the vma mempolicies
will be rebound as well.  Then the order in which the various task and vma
mempolicies are updated will no longer be deterministic, and one can no longer
count on the task->mems_allowed holding the old value for as long as needed.
It's not even clear if the current code was guaranteed to work reliably for
task mempolicies.

So I added a mems_allowed field to each mempolicy, stating exactly what
mems_allowed the policy is relative to, and updated synchronously and reliably
anytime that the mempolicy is rebound.

Also removed a useless wrapper routine, numa_policy_rebind(), and had its
caller, cpuset_update_task_memory_state(), call directly to the rewritten
policy_rebind() routine, and made that rebind routine extern instead of
static, and added a "mpol_" prefix to its name, making it
mpol_rebind_policy().

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:44 -08:00
Paul Jackson 909d75a3b7 [PATCH] cpuset: implement cpuset_mems_allowed
Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code
needed, to obtain the mems_allowed vector of a cpuset, and replaced the
workaround in sys_migrate_pages() to call this new method.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:44 -08:00
Paul Jackson cf2a473c40 [PATCH] cpuset: combine refresh_mems and update_mems
The important code paths through alloc_pages_current() and alloc_page_vma(),
by which most kernel page allocations go, both called
cpuset_update_current_mems_allowed(), which in turn called refresh_mems().
-Both- of these latter two routines did a tasklock, got the tasks cpuset
pointer, and checked for out of date cpuset->mems_generation.

That was a silly duplication of code and waste of CPU cycles on an important
code path.

Consolidated those two routines into a single routine, called
cpuset_update_task_memory_state(), since it updates more than just
mems_allowed.

Changed all callers of either routine to call the new consolidated routine.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:43 -08:00
Paul Jackson b4b2641843 [PATCH] cpuset: fork hook fix
Fix obscure, never seen in real life, cpuset fork race.  The cpuset_fork()
call in fork.c was setting up the correct task->cpuset pointer after the
tasklist_lock was dropped, which briefly exposed the newly forked process with
an unsafe (copied from parent without locks or usage counter increment) cpuset
pointer.

In theory, that exposed cpuset pointer could have been pointing at a cpuset
that was already freed and removed, and in theory another task that had been
sitting on the tasklist_lock waiting to scan the task list could have raced
down the entire tasklist, found our new child at the far end, and dereferenced
that bogus cpuset pointer.

To fix, setup up the correct cpuset pointer in the new child by calling
cpuset_fork() before the new task is linked into the tasklist, and with that,
add a fork failure case, to dereference that cpuset, if the fork fails along
the way, after cpuset_fork() was called.

Had to remove a BUG_ON() from cpuset_exit(), because it was no longer valid -
the call to cpuset_exit() from a failed fork would not have PF_EXITING set.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:43 -08:00
Paul Jackson 59dac16fb9 [PATCH] cpuset: update_nodemask code reformat
Restructure code layout of the kernel/cpuset.c update_nodemask() routine,
removing embedded returns and nested if's in favor of goto completion labels.
This is being done in anticipation of adding more logic to this routine, which
will favor the goto style structure.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:43 -08:00
Paul Jackson c5b2aff896 [PATCH] cpuset: minor spacing initializer fixes
Four trivial cpuset fixes: remove extra spaces, remove useless initializers,
mark one __read_mostly.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:43 -08:00
Paul Jackson 3e0d98b9f1 [PATCH] cpuset: memory pressure meter
Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.

This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.

This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.

This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure.  It's up to the batch
manager or other user code to decide what to do about it and take action.

==> Unless this feature is enabled by writing "1" to the special file
    /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
    code of __alloc_pages() for this metric reduces to simply noticing
    that the cpuset_memory_pressure_enabled flag is zero.  So only
    systems that enable this feature will compute the metric.

Why a per-cpuset, running average:

    Because this meter is per-cpuset, rather than per-task or mm, the
    system load imposed by a batch scheduler monitoring this metric is
    sharply reduced on large systems, because a scan of the tasklist can be
    avoided on each set of queries.

    Because this meter is a running average, instead of an accumulating
    counter, a batch scheduler can detect memory pressure with a single
    read, instead of having to read and accumulate results for a period of
    time.

    Because this meter is per-cpuset rather than per-task or mm, the
    batch scheduler can obtain the key information, memory pressure in a
    cpuset, with a single read, rather than having to query and accumulate
    results over all the (dynamically changing) set of tasks in the cpuset.

A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.

A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:42 -08:00
Paul Jackson 5966514db6 [PATCH] cpuset: mempolicy one more nodemask conversion
Finish converting mm/mempolicy.c from bitmaps to nodemasks.  The previous
conversion had left one routine using bitmaps, since it involved a
corresponding change to kernel/cpuset.c

Fix that interface by replacing with a simple macro that calls nodes_subset(),
or if !CONFIG_CPUSET, returns (1).

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <christoph@lameter.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:42 -08:00
Paul E. McKenney 2d89c92907 [PATCH] Simpler signal-exit concurrency handling
Some simplification in checking signal delivery against concurrent exit.
Instead of using get_task_struct_rcu(), which increments the task_struct
reference count, check the reference count after acquiring sighand lock.

Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:40 -08:00
Ingo Molnar e56d090310 [PATCH] RCU signal handling
RCU tasklist_lock and RCU signal handling: send signals RCU-read-locked
instead of tasklist_lock read-locked.  This is a scalability improvement on
SMP and a preemption-latency improvement under PREEMPT_RCU.

Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: William Irwin <wli@holomorphy.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:40 -08:00
Ravikiran G Thirumalai 22fc6eccbf [PATCH] Change maxaligned_in_smp alignemnt macros to internodealigned_in_smp macros
____cacheline_maxaligned_in_smp is currently used to align critical structures
and avoid false sharing.  It uses per-arch L1_CACHE_SHIFT_MAX and people find
L1_CACHE_SHIFT_MAX useless.

However, we have been using ____cacheline_maxaligned_in_smp to align
structures on the internode cacheline size.  As per Andi's suggestion,
following patch kills ____cacheline_maxaligned_in_smp and introduces
INTERNODE_CACHE_SHIFT, which defaults to L1_CACHE_SHIFT for all arches.
Arches needing L3/Internode cacheline alignment can define
INTERNODE_CACHE_SHIFT in the arch asm/cache.h.  Patch replaces
____cacheline_maxaligned_in_smp with ____cacheline_internodealigned_in_smp

With this patch, L1_CACHE_SHIFT_MAX can be killed

Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:38 -08:00
Paul Jackson 45b07ef31d [PATCH] cpusets: swap migration interface
Add a boolean "memory_migrate" to each cpuset, represented by a file
containing "0" or "1" in each directory below /dev/cpuset.

It defaults to false (file contains "0").  It can be set true by writing
"1" to the file.

If true, then anytime that a task is attached to the cpuset so marked, the
pages of that task will be moved to that cpuset, preserving, to the extent
practical, the cpuset-relative placement of the pages.

Also anytime that a cpuset so marked has its memory placement changed (by
writing to its "mems" file), the tasks in that cpuset will have their pages
moved to the cpusets new nodes, preserving, to the extent practical, the
cpuset-relative placement of the moved pages.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:43 -08:00
Christoph Lameter 39743889aa [PATCH] Swap Migration V5: sys_migrate_pages interface
sys_migrate_pages implementation using swap based page migration

This is the original API proposed by Ray Bryant in his posts during the first
half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org.

The intent of sys_migrate is to migrate memory of a process.  A process may
have migrated to another node.  Memory was allocated optimally for the prior
context.  sys_migrate_pages allows to shift the memory to the new node.

sys_migrate_pages is also useful if the processes available memory nodes have
changed through cpuset operations to manually move the processes memory.  Paul
Jackson is working on an automated mechanism that will allow an automatic
migration if the cpuset of a process is changed.  However, a user may decide
to manually control the migration.

This implementation is put into the policy layer since it uses concepts and
functions that are also needed for mbind and friends.  The patch also provides
a do_migrate_pages function that may be useful for cpusets to automatically
move memory.  sys_migrate_pages does not modify policies in contrast to Ray's
implementation.

The current code here is based on the swap based page migration capability and
thus is not able to preserve the physical layout relative to it containing
nodeset (which may be a cpuset).  When direct page migration becomes available
then the implementation needs to be changed to do a isomorphic move of pages
between different nodesets.  The current implementation simply evicts all
pages in source nodeset that are not in the target nodeset.

Patch supports ia64, i386 and x86_64.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:42 -08:00
Christoph Lameter 15316ba81a [PATCH] add schedule_on_each_cpu()
swap migration's isolate_lru_page() currently uses an IPI to notify other
processors that the lru caches need to be drained if the page cannot be
found on the LRU.  The IPI interrupt may interrupt a processor that is just
processing lru requests and cause a race condition.

This patch introduces a new function run_on_each_cpu() that uses the
keventd() to run the LRU draining on each processor.  Processors disable
preemption when dealing the LRU caches (these are per processor) and thus
executing LRU draining from another process is safe.

Thanks to Lee Schermerhorn <lee.schermerhorn@hp.com> for finding this race
condition.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:40 -08:00
Rohit Seth 8ad4b1fb82 [PATCH] Make high and batch sizes of per_cpu_pagelists configurable
As recently there has been lot of traffic on the right values for batch and
high water marks for per_cpu_pagelists.  This patch makes these two
variables configurable through /proc interface.

A new tunable /proc/sys/vm/percpu_pagelist_fraction is added.  This entry
controls the fraction of pages at most in each zone that are allocated for
each per cpu page list.  The min value for this is 8.  It means that we
don't allow more than 1/8th of pages in each zone to be allocated in any
single per_cpu_pagelist.

The batch value of each per cpu pagelist is also updated as a result.  It
is set to pcp->high/4.  The upper limit of batch is (PAGE_SHIFT * 8)

Signed-off-by: Rohit Seth <rohit.seth@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:40 -08:00
Andrew Morton 9d0243bca3 [PATCH] drop-pagecache
Add /proc/sys/vm/drop_caches.  When written to, this will cause the kernel to
discard as much pagecache and/or reclaimable slab objects as it can.  THis
operation requires root permissions.

It won't drop dirty data, so the user should run `sync' first.

Caveats:

a) Holds inode_lock for exorbitant amounts of time.

b) Needs to be taught about NUMA nodes: propagate these all the way through
   so the discarding can be controlled on a per-node basis.

This is a debugging feature: useful for getting consistent results between
filesystem benchmarks.  We could possibly put it under a config option, but
it's less than 300 bytes.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:40 -08:00
David S. Miller 0aec63e67c [PATCH] Fix posix-cpu-timers sched_time accumulation
I've spent the past 3 days digging into a glibc testsuite failure in
current CVS, specifically libc/rt/tst-cputimer1.c The thr1 and thr2
timers fire too early in the second pass of this test.  The second
pass is noteworthy because it makes use of intervals, whereas the
first pass does not.

All throughout the posix-cpu-timers.c code, the calculation of the
process sched_time sum is implemented roughly as:

	unsigned long long sum;

	sum = tsk->signal->sched_time;
	t = tsk;
	do {
		sum += t->sched_time;
		t = next_thread(t);
	} while (t != tsk);

In fact this is the exact scheme used by check_process_timers().

In the case of check_process_timers(), current->sched_time has just
been updated (via scheduler_tick(), which is invoked by
update_process_times(), which subsequently invokes
run_posix_cpu_timers()) So there is no special processing necessary
wrt. that.

In other contexts, we have to allot for the fact that tsk->sched_time
might be a bit out of date if we are current.  And the
posix-cpu-timers.c code uses current_sched_time() to deal with that.

Unfortunately it does so in an erroneous and inconsistent manner in
one spot which is what results in the early timer firing.

In cpu_clock_sample_group_locked(), it does this:

		cpu->sched = p->signal->sched_time;
		/* Add in each other live thread.  */
		while ((t = next_thread(t)) != p) {
			cpu->sched += t->sched_time;
		}
		if (p->tgid == current->tgid) {
			/*
			 * We're sampling ourselves, so include the
			 * cycles not yet banked.  We still omit
			 * other threads running on other CPUs,
			 * so the total can always be behind as
			 * much as max(nthreads-1,ncpus) * (NSEC_PER_SEC/HZ).
			 */
			cpu->sched += current_sched_time(current);
		} else {
			cpu->sched += p->sched_time;
		}

The problem is the "p->tgid == current->tgid" test.  If "p" is
not current, and the tgids are the same, we will add the process
t->sched_time twice into cpu->sched and omit "p"'s sched_time
which is very very very wrong.

posix-cpu-timers.c has a helper function, sched_ns(p) which takes care
of this, so my fix is to use that here instead of this special tgid
test.

The fact that current can be one of the sub-threads of "p" points out
that we could make things a little bit more accurate, perhaps by using
sched_ns() on every thread we process in these loops.  It also points
out that we don't use the most accurate value for threads in the group
actively running other cpus (and this is mentioned in the comment).

But that is a future enhancement, and this fix here definitely makes
sense.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 20:23:04 -08:00
Jayachandran C 6fe2e70bbe [PATCH] kernel/module.c: removed dead code
This patch fixes an issue reported by Coverity in kernel/module.c

Error reported: Cannot reach this line of code "else return ptr;"

Patch description:
  This is the error path, so 'err' will be negative, the else case
  is not required, this patch removes it.

Signed-off-by: Jayachandran C. <c.jayachandran@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:59 -08:00
Martin Schwidefsky 347a8dc3b8 [PATCH] s390: cleanup Kconfig
Sanitize some s390 Kconfig options.  We have ARCH_S390, ARCH_S390X,
ARCH_S390_31, 64BIT, S390_SUPPORT and COMPAT.  Replace these 6 options by
S390, 64BIT and COMPAT.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:53 -08:00
Martin Schwidefsky 089545f0c7 [PATCH] s390: cputime_t fixes
There are some more places where the use of cputime_t instead of an integer
type and the associated macros is necessary for the virtual cputime accounting
on s390.  Affected are the s390 specific appldata code and BSD process
accounting.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:49 -08:00
Rafael J. Wysocki 277c6e2ad7 [PATCH] swsusp: save image header first
This makes the swsusp_info structure become the header of the image in the
literal sense (ie.  it is saved to the swap and read before any other image
data with the help of the swsusp's swap map structure, so generally it is
treated in the same way as the rest of the image).

The main thing it does is to make swsusp_header contain the offset of the swap
map used to track the image data pages rather than the offset of swsusp_info.
 Simultaneously, swsusp_info becomes the first image page written to the swap.

The other changes are generally consequences of the above with a few
exceptions (there's some consolidation in the image reading part as a few
functions turn into trivial wrappers around something else).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:43 -08:00
Rafael J. Wysocki 1adf6c8ea9 [PATCH] swsusp: improve handling of swap partitions
This changes the handling of swap partitions by swsusp to avoid locking of the
swap devices that are not used for suspend and, consequently, simplifies the
code.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:43 -08:00
Rafael J. Wysocki ca0aec0f7a [PATCH] swsusp: make image size limit tunable
Make the suspend image size limit tunable via /sys/power/image_size.

It is necessary for systems on which there is a limited amount of swap
available for suspend.  It can also be useful for optimizing performance of
swsusp on systems with 1 GB of RAM or more.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:42 -08:00
Rafael J. Wysocki b3a93a255e [PATCH] swsusp: limit image size
Limit the size of the suspend image to approx.  500 MB, which should
improve the overall performance of swsusp on systems with more than 1 GB of
RAM.

It introduces the constant IMAGE_SIZE that can be set to the preferred size
of the image (in MB) and modifies the memory-shrinking part of swsusp to
take this constant into account (500 is the default value of IMAGE_SIZE).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:42 -08:00
Pavel Machek c050ca7870 [PATCH] swsusp: Drop duplicate prototypes
These two prototypes are already present in sched.h, remove duplicate
version.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:42 -08:00
Rafael J. Wysocki e5e2fa7857 [PATCH] swsusp: fix enough_free_mem
This patch fixes a problem with the function enough_free_mem() used by
swsusp to verify if there is a sufficient number of memory pages available
to it to create and save the suspend image.

Namely, enough_free_mem() uses nr_free_pages() to obtain the number of free
memory pages, which is incorrect, because this function returns the total
number of free pages, including free highmem pages, and the highmem pages
cannot be used by swsusp for storing the image data.

The patch makes enough_free_mem() avoid counting the free highmem
pages as available to swsusp.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:40 -08:00
Rafael J. Wysocki 72a97e0839 [PATCH] swsusp: improve freeing of memory
This patch makes swsusp free only as much memory as needed to complete the
suspend and not as much as possible.   In the most of cases this should speed
up the suspend and make the system much more responsive after resume,
especially if a GUI (eg.  X Windows) is used.

If needed, the old behavior (ie to free as much memory as possible during
suspend) can be restored by unsetting FAST_FREE in power.h

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:40 -08:00
Rafael J. Wysocki 7088a5c001 [PATCH] swsusp: introduce the swap map structure
This patch introduces the swap map structure that can be used by swsusp for
keeping tracks of data pages written to the swap.   The structure itself is
described in a comment within the patch.

The overall idea is to reduce the amount of metadata written to the swap and
to write and read the image pages sequentially, in a file-alike way.  This
makes the swap-handling part of swsusp fairly independent of its
snapshot-handling part and will hopefully allow us to completely separate
these two parts in the future.

This patch is needed to remove the suspend image size limit imposed by the
limited size of the swsusp_info structure, which is essential for x86-64
systems with more than 512 MB of RAM.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:40 -08:00
Rafael J. Wysocki f2d97f0296 [PATCH] swsusp: remove encryption
This patch removes the image encryption that is only used by swsusp instead of
zeroing the image after resume in order to prevent someone from reading some
confidential data from it in the future and it does not protect the image from
being read by an unauthorized person before resume.  The functionality it
provides should really belong to the user space and will possibly be
reimplemented after the swap-handling functionality of swsusp is moved to the
user space.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:40 -08:00
Ivan Kokshaysky eee45269b0 [PATCH] Alpha: convert to generic irq framework (generic part)
Thanks to Christoph for doing most of the work.

This allows automatic SMP IRQ affinity assignment other than default "all
interrupts on all CPUs" which is rather expensive.  This might be useful if
the hardware can be programmed to distribute interrupts among different
CPUs, like Alpha does.

Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:40 -08:00
David Howells 7ee1dd3fee [PATCH] FRV: Make futex code compilable on nommu [try #2]
Make the futex code compilable and usable on NOMMU by making the attempt to
handle page faults conditional on CONFIG_MMU.  If this is not enabled, then
we can assume that EFAULT returned from futex_atomic_op_inuser() is not
recoverable, and that the address lies outside of valid memory.

handle_mm_fault() is made to BUG if called on NOMMU without attempting to
invoke the actual handler (__handle_mm_fault).

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:33 -08:00
Andrew Morton a576219aca [PATCH] swsusp: resume_store() retval fix
- This function returns -EINVAL all the time.  Fix.

- Decruftify it a bit too.

- Writing to it doesn't seem to do what it's suppoed to do.

Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:21 -08:00
Linus Torvalds db9edfd7e3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-2.6
Trivial manual merge fixup for usb_find_interface clashes.
2006-01-04 18:44:12 -08:00
Linus Torvalds 25c862cc9e Merge git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild 2006-01-04 16:36:52 -08:00
akpm@osdl.org f743ca5e10 [PATCH] kobject_uevent CONFIG_NET=n fix
lib/lib.a(kobject_uevent.o)(.text+0x25f): In function `kobject_uevent':
: undefined reference to `__alloc_skb'
lib/lib.a(kobject_uevent.o)(.text+0x2a1): In function `kobject_uevent':
: undefined reference to `skb_over_panic'
lib/lib.a(kobject_uevent.o)(.text+0x31d): In function `kobject_uevent':
: undefined reference to `skb_over_panic'
lib/lib.a(kobject_uevent.o)(.text+0x356): In function `kobject_uevent':
: undefined reference to `netlink_broadcast'
lib/lib.a(kobject_uevent.o)(.init.text+0x9): In function `kobject_uevent_init':
: undefined reference to `netlink_kernel_create'
make: *** [.tmp_vmlinux1] Error 1

Netlink is unconditionally enabled if CONFIG_NET, so that's OK.

kobject_uevent.o is compiled even if !CONFIG_HOTPLUG, which is lazy.

Let's compound the sin.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-01-04 16:18:08 -08:00
Kay Sievers 312c004d36 [PATCH] driver core: replace "hotplug" by "uevent"
Leave the overloaded "hotplug" word to susbsystems which are handling
real devices. The driver core does not "plug" anything, it just exports
the state to userspace and generates events.

Signed-off-by: Kay Sievers <kay.sievers@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-01-04 16:18:08 -08:00
Kay Sievers 0f76e5acf9 [PATCH] add uevent_helper control in /sys/kernel/
This deprecates the /proc/sys/kernel/hotplug file, as all
this stuff should be in /sys some day, right? :)
In /sys/kernel/ we have now uevent_seqnum and uevent_helper.
The seqnum is no longer used by udev, as the version for this
kernel depends on netlink which events will never get
out-of-order.

Recent udev versions disable the /sbin/hotplug helper with
an init script, cause it leads to OOM on big boxes by running
hundreds of shells in parallel. It should be done now by:
  echo "" > /sys/kernel/uevent_helper

(Note that "-n" does not work, cause neighter proc nor sysfs
support truncate().)

Signed-off-by: Kay Sievers <kay.sievers@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-01-04 16:18:07 -08:00
Kay Sievers 0296b22813 [PATCH] remove CONFIG_KOBJECT_UEVENT option
It makes zero sense to have hotplug, but not the netlink
events enabled today. Remove this option and merge the
kobject_uevent.h header into the kobject.h header file.

Signed-off-by: Kay Sievers <kay.sievers@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-01-04 16:18:07 -08:00
Adrian Bunk f4b09ebc8b update the email address of Randy Dunlap
This patch removes all references to the bouncing address
rddunlap@osdl.org and one dead web page from the kernel.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
2006-01-03 13:37:51 +01:00
febf7ea4be gitignore: ignore more generated files
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2006-01-03 11:35:26 +01:00
Linus Torvalds de9e007d91 sysctl: make sure to terminate strings with a NUL
This is a slightly more complete fix for the previous minimal sysctl
string fix.  It always terminates the returned string with a NUL, even
if the full result wouldn't fit in the user-supplied buffer.

The returned length is the full untruncated length, so that you can
tell when truncation has occurred.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-31 17:00:29 -08:00
Yi Yang 82c9df8201 [PATCH] Fix false old value return of sysctl
For the sysctl syscall, if the user wants to get the old value of a
sysctl entry and set a new value for it in the same syscall, the old
value is always overwritten by the new value if the sysctl entry is of
string type and if the user sets its strategy to sysctl_string.  This
issue lies in the strategy being run twice if the strategy is set to
sysctl_string, the general strategy sysctl_string always returns 0 if
success.

Such strategy routines as sysctl_jiffies and sysctl_jiffies_ms return 1
because they do read and write for the sysctl entry.

The strategy routine sysctl_string return 0 although it actually read
and write the sysctl entry.

According to my analysis, if a strategy routine do read and write, it
should return 1, if it just does some necessary check but not read and
write, it should return 0, for example sysctl_intvec.

Signed-off-by: Yi Yang <yang.y.yi@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-30 17:22:08 -08:00
Linus Torvalds 8febdd85ad sysctl: don't overflow the user-supplied buffer with '\0'
If the string was too long to fit in the user-supplied buffer,
the sysctl layer would zero-terminate it by writing past the
end of the buffer. Don't do that.

Noticed by Yi Yang <yang.y.yi@gmail.com>

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-30 17:18:53 -08:00