Commit Graph

197 Commits

Author SHA1 Message Date
Martin Hicks 53e9a6159f [PATCH] VM: zone reclaim atomic ops cleanup
Christoph Lameter and Marcelo Tosatti asked to get rid of the
atomic_inc_and_test() to cleanup the atomic ops in the zone reclaim code.

Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:44 -07:00
Martin Hicks bce5f6ba34 [PATCH] VM: add capabilites check to set_zone_reclaim
Add a capability check to sys_set_zone_reclaim().  This syscall is not
something that should be available to a user.

Signed-off-by:  Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:44 -07:00
Nick Piggin 242e546862 [PATCH] mm: remove atomic
This bitop does not need to be atomic because it is performed when there will
be no references to the page (ie.  the page is being freed).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:44 -07:00
Nick Piggin 9a61c349b2 [PATCH] mm: remap ZERO_PAGE mappings
filemap_xip's nopage routine maps the ZERO_PAGE into readonly mappings, if it
has no data page to map there: then if the hole in the file is later filled,
__xip_unmap uses an rmap technique to replace the ZERO_PAGEs mapped for that
offset by the newly allocated file page, so that established mappings will see
the newly written data.

However, on MIPS (alone) there's not one but as many as eight ZERO_PAGEs,
chosen for coloring by user virtual address; and if mremap has meanwhile been
used to move a mapping containing a ZERO_PAGE, it will generally not match the
ZERO_PAGE(address) __xip_unmap is looking for.

To maintain XIP's established mappings correctly on MIPS, we need Nick's fix
to mremap's move_one_page (originally presented as an optimization), to
replace the ZERO_PAGE appropriate to the old address by the ZERO_PAGE
appropriate to the new address.

(But when I first saw this, I was thinking the ZERO_PAGEs themselves would get
corrupted, very bad.  Now I think it's the other way round, that the
established mappings will fail to see the newly written data: incorrect, but
not corrupting everything else.  Whether filemap_xip's technique is generally
safe, I'd hesitate to say in a hurry: it's interesting, but we've never tried
to do that in tmpfs.)

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:44 -07:00
Nick Piggin 4d7670e0f6 [PATCH] mm: cleanup rmap
Thanks to Bill Irwin for pointing this out.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:43 -07:00
Nick Piggin 2822c1aa57 [PATCH] mm: micro-optimise rmap
Microoptimise page_add_anon_rmap.  Although these expressions are used only in
the taken branch of the if() statement, the compiler can't reorder them inside
because atomic_inc_and_test is a barrier.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:43 -07:00
Nick Piggin c3dce2d89c [PATCH] mm: comment rmap
Just be clear that VM_RESERVED pages here are a bug, and the test is not there
because they are expected.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:43 -07:00
Christoph Lameter 6e21c8f145 [PATCH] /proc/<pid>/numa_maps to show on which nodes pages reside
This patch was recently discussed on linux-mm:
http://marc.theaimsgroup.com/?t=112085728500002&r=1&w=2

I inherited a large code base from Ray for page migration.  There was a
small patch in there that I find to be very useful since it allows the
display of the locality of the pages in use by a process.  I reworked that
patch and came up with a /proc/<pid>/numa_maps that gives more information
about the vma's of a process.  numa_maps is indexes by the start address
found in /proc/<pid>/maps.  F.e.  with this patch you can see the page use
of the "getty" process:

margin:/proc/12008 # cat maps
00000000-00004000 r--p 00000000 00:00 0
2000000000000000-200000000002c000 r-xp 00000000 08:04 516                /lib/ld-2.3.3.so
2000000000038000-2000000000040000 rw-p 00028000 08:04 516                /lib/ld-2.3.3.so
2000000000040000-2000000000044000 rw-p 2000000000040000 00:00 0
2000000000058000-2000000000260000 r-xp 00000000 08:04 54707842           /lib/tls/libc.so.6.1
2000000000260000-2000000000268000 ---p 00208000 08:04 54707842           /lib/tls/libc.so.6.1
2000000000268000-2000000000274000 rw-p 00200000 08:04 54707842           /lib/tls/libc.so.6.1
2000000000274000-2000000000280000 rw-p 2000000000274000 00:00 0
2000000000280000-20000000002b4000 r--p 00000000 08:04 9126923            /usr/lib/locale/en_US.utf8/LC_CTYPE
2000000000300000-2000000000308000 r--s 00000000 08:04 60071467           /usr/lib/gconv/gconv-modules.cache
2000000000318000-2000000000328000 rw-p 2000000000318000 00:00 0
4000000000000000-4000000000008000 r-xp 00000000 08:04 29576399           /sbin/mingetty
6000000000004000-6000000000008000 rw-p 00004000 08:04 29576399           /sbin/mingetty
6000000000008000-600000000002c000 rw-p 6000000000008000 00:00 0          [heap]
60000fff7fffc000-60000fff80000000 rw-p 60000fff7fffc000 00:00 0
60000ffffff44000-60000ffffff98000 rw-p 60000ffffff44000 00:00 0          [stack]
a000000000000000-a000000000020000 ---p 00000000 00:00 0                  [vdso]

cat numa_maps
2000000000000000 default MaxRef=43 Pages=11 Mapped=11 N0=4 N1=3 N2=2 N3=2
2000000000038000 default MaxRef=1 Pages=2 Mapped=2 Anon=2 N0=2
2000000000040000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
2000000000058000 default MaxRef=43 Pages=61 Mapped=61 N0=14 N1=15 N2=16 N3=16
2000000000268000 default MaxRef=1 Pages=2 Mapped=2 Anon=2 N0=2
2000000000274000 default MaxRef=1 Pages=3 Mapped=3 Anon=3 N0=3
2000000000280000 default MaxRef=8 Pages=3 Mapped=3 N0=3
2000000000300000 default MaxRef=8 Pages=2 Mapped=2 N0=2
2000000000318000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N2=1
4000000000000000 default MaxRef=6 Pages=2 Mapped=2 N1=2
6000000000004000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
6000000000008000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
60000fff7fffc000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
60000ffffff44000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1

getty uses ld.so.  The first vma is the code segment which is used by 43
other processes and the pages are evenly distributed over the 4 nodes.

The second vma is the process specific data portion for ld.so.  This is
only one page.

The display format is:

<startaddress>	 Links to information in /proc/<pid>/map
<memory policy>  This can be "default" "interleave={}", "prefer=<node>" or "bind={<zones>}"
MaxRef=		<maximum reference to a page in this vma>
Pages=		<Nr of pages in use>
Mapped=		<Nr of pages with mapcount >
Anon=		<nr of anonymous pages>
Nx=		<Nr of pages on Node x>

The content of the proc-file is self-evident.  If this would be tied into
the sparsemem system then the contents of this file would not be too
useful.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:43 -07:00
Hugh Dickins 839b9685e8 [PATCH] rmap: don't test rss
Remove the three get_mm_counter(mm, rss) tests from rmap.c: there was a
time when testing rss was important to avoid a particular race between
dup_mmap and the anonmm rmap; but now it's just a rather silly pseudo-
optimization, made even more obscure by the get_mm_counter macro.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:42 -07:00
Hugh Dickins 3279ffd97f [PATCH] delete from_swap_cache BUG_ONs
Three of the four BUG_ONs in delete_from_swap_cache are immediately
repeated in __delete_from_swap_cache: delete those and add the one.  But
perhaps mm/ is altogether overprovisioned with historic BUGs?

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:42 -07:00
Hugh Dickins 5d337b9194 [PATCH] swap: swap_lock replace list+device
The idea of a swap_device_lock per device, and a swap_list_lock over them all,
is appealing; but in practice almost every holder of swap_device_lock must
already hold swap_list_lock, which defeats the purpose of the split.

The only exceptions have been swap_duplicate, valid_swaphandles and an
untrodden path in try_to_unuse (plus a few places added in this series).
valid_swaphandles doesn't show up high in profiles, but swap_duplicate does
demand attention.  However, with the hold time in get_swap_pages so much
reduced, I've not yet found a load and set of swap device priorities to show
even swap_duplicate benefitting from the split.  Certainly the split is mere
overhead in the common case of a single swap device.

So, replace swap_list_lock and swap_device_lock by spinlock_t swap_lock
(generally we seem to prefer an _ in the name, and not hide in a macro).

If someone can show a regression in swap_duplicate, then probably we should
add a hashlock for the swap_map entries alone (shorts being anatomic), so as
to help the case of the single swap device too.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:42 -07:00
Hugh Dickins 048c27fd72 [PATCH] swap: scan_swap_map latency breaks
The get_swap_page/scan_swap_map latency can be so bad that even those without
preemption configured deserve relief: periodically cond_resched.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:41 -07:00
Hugh Dickins 52b7efdbe5 [PATCH] swap: scan_swap_map drop swap_device_lock
get_swap_page has often shown up on latency traces, doing lengthy scans while
holding two spinlocks.  swap_list_lock is already dropped, now scan_swap_map
drop swap_device_lock before scanning the swap_map.

While scanning for an empty cluster, don't worry that racing tasks may
allocate what was free and free what was allocated; but when allocating an
entry, check it's still free after retaking the lock.  Avoid dropping the lock
in the expected common path.  No barriers beyond the locks, just let the
cookie crumble; highest_bit limit is volatile, but benign.

Guard against swapoff: must check SWP_WRITEOK before allocating, must raise
SWP_SCANNING reference count while in scan_swap_map, swapoff wait for that to
fall - just use schedule_timeout, we don't want to burden scan_swap_map
itself, and it's very unlikely that anyone can really still be in
scan_swap_map once swapoff gets this far.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:41 -07:00
Hugh Dickins 7dfad4183b [PATCH] swap: scan_swap_map restyled
Rewrite scan_swap_map to allocate in just the same way as before (taking the
next free entry SWAPFILE_CLUSTER-1 times, then restarting at the lowest wholly
empty cluster, falling back to lowest entry if none), but with a view towards
dropping the lock in the next patch.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:41 -07:00
Hugh Dickins fb4f88dcab [PATCH] swap: get_swap_page drop swap_list_lock
Rewrite get_swap_page to allocate in just the same sequence as before, but
without holding swap_list_lock across its scan_swap_map.  Decrement
nr_swap_pages and update swap_list.next in advance, while still holding
swap_list_lock.  Skip full devices by testing highest_bit.  Swapoff hold
swap_device_lock as well as swap_list_lock to clear SWP_WRITEOK.  Reduces lock
contention when there are parallel swap devices of the same priority.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:41 -07:00
Hugh Dickins 89d09a2c80 [PATCH] swap: freeing update swap_list.next
This makes negligible difference in practice: but swap_list.next should not be
updated to a higher prio in the general helper swap_info_get, but rather in
swap_entry_free; and then only in the case when entry is actually freed.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:41 -07:00
Hugh Dickins 6eb396dc4a [PATCH] swap: swap unsigned int consistency
The swap header's unsigned int last_page determines the range of swap pages,
but swap_info has been using int or unsigned long in some cases: use unsigned
int throughout (except, in several places a local unsigned long is useful to
avoid overflows when adding).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:41 -07:00
Hugh Dickins 53092a7402 [PATCH] swap: show span of swap extents
The "Adding %dk swap" message shows the number of swap extents, as a guide to
how fragmented the swapfile may be.  But a useful further guide is what total
extent they span across (sometimes scarily large).

And there's no need to keep nr_extents in swap_info: it's unused after the
initial message, so save a little space by keeping it on stack.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:40 -07:00
Hugh Dickins 11d31886db [PATCH] swap: swap extent list is ordered
There are several comments that swap's extent_list.prev points to the lowest
extent: that's not so, it's extent_list.next which points to it, as you'd
expect.  And a couple of loops in add_swap_extent which go all the way through
the list, when they should just add to the other end.

Fix those up, and let map_swap_page search the list forwards: profiles shows
it to be twice as quick that way - because prefetch works better on how the
structs are typically kmalloc'ed?  or because usually more is written to than
read from swap, and swap is allocated ascendingly?

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:40 -07:00
Hugh Dickins 4cd3bb10ff [PATCH] swap: move destroy_swap_extents calls
sys_swapon's call to destroy_swap_extents on failure is made after the final
swap_list_unlock, which is faintly unsafe: another sys_swapon might already be
setting up that swap_info_struct.  Calling it earlier, before taking
swap_list_lock, is safe.  sys_swapoff's call to destroy_swap_extents was safe,
but likewise move it earlier, before taking the locks (once try_to_unuse has
completed, nothing can be needing the swap extents).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:40 -07:00
Hugh Dickins e2244ec2ef [PATCH] swap: correct swapfile nr_good_pages
If a regular swapfile lies on a filesystem whose blocksize is less than
PAGE_SIZE, then setup_swap_extents may have to cut the number of usable swap
pages; but sys_swapon's nr_good_pages was not expecting that.  Also,
setup_swap_extents takes no account of badpages listed in the swap header: not
worth doing so, but ensure nr_badpages is 0 for a regular swapfile.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:40 -07:00
Hugh Dickins b0d9bcd4bb [PATCH] swap: update swapfile i_sem comment
Update swap extents comment: nowadays we guard with S_SWAPFILE not i_sem.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:40 -07:00
Dave Hansen 28ae55c98e [PATCH] sparsemem extreme: hotplug preparation
This splits up sparse_index_alloc() into two pieces.  This is needed
because we'll allocate the memory for the second level in a different place
from where we actually consume it to keep the allocation from happening
underneath a lock

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:38 -07:00
Bob Picco 3e347261a8 [PATCH] sparsemem extreme implementation
With cleanups from Dave Hansen <haveblue@us.ibm.com>

SPARSEMEM_EXTREME makes mem_section a one dimensional array of pointers to
mem_sections.  This two level layout scheme is able to achieve smaller
memory requirements for SPARSEMEM with the tradeoff of an additional shift
and load when fetching the memory section.  The current SPARSEMEM
implementation is a one dimensional array of mem_sections which is the
default SPARSEMEM configuration.  The patch attempts isolates the
implementation details of the physical layout of the sparsemem section
array.

SPARSEMEM_EXTREME requires bootmem to be functioning at the time of
memory_present() calls.  This is not always feasible, so architectures
which do not need it may allocate everything statically by using
SPARSEMEM_STATIC.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:38 -07:00
Bob Picco 802f192e4a [PATCH] SPARSEMEM EXTREME
A new option for SPARSEMEM is ARCH_SPARSEMEM_EXTREME.  Architecture
platforms with a very sparse physical address space would likely want to
select this option.  For those architecture platforms that don't select the
option, the code generated is equivalent to SPARSEMEM currently in -mm.
I'll be posting a patch on ia64 ml which uses this new SPARSEMEM feature.

ARCH_SPARSEMEM_EXTREME makes mem_section a one dimensional array of
pointers to mem_sections.  This two level layout scheme is able to achieve
smaller memory requirements for SPARSEMEM with the tradeoff of an
additional shift and load when fetching the memory section.  The current
SPARSEMEM -mm implementation is a one dimensional array of mem_sections
which is the default SPARSEMEM configuration.  The patch attempts isolates
the implementation details of the physical layout of the sparsemem section
array.

ARCH_SPARSEMEM_EXTREME depends on 64BIT and is by default boolean false.

I've boot tested under aim load ia64 configured for ARCH_SPARSEMEM_EXTREME.
 I've also boot tested a 4 way Opteron machine with !ARCH_SPARSEMEM_EXTREME
and tested with aim.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:38 -07:00
Nick Piggin d992895ba2 [PATCH] Lazy page table copies in fork()
Defer copying of ptes until fault time when it is possible to reconstruct
the pte from backing store. Idea from Andi Kleen and Nick Piggin.

Thanks to input from Rik van Riel and Linus and to Hugh for correcting
my blundering.

Ray Fucillo <fucillo@intersystems.com> reports:

  "I applied this latest patch to a 2.6.12 kernel and found that it does
   resolve the problem.  Prior to the patch on this machine, I was
   seeing about 23ms spent in fork for ever 100MB of shared memory
   segment.

   After applying the patch, fork is taking about 1ms regardless of the
   shared memory size."

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-29 17:25:04 -07:00
Linus Torvalds cc314eef01 Fix nasty ncpfs symlink handling bug.
This bug could cause oopses and page state corruption, because ncpfs
used the generic page-cache symlink handlign functions.  But those
functions only work if the page cache is guaranteed to be "stable", ie a
page that was installed when the symlink walk was started has to still
be installed in the page cache at the end of the walk.

We could have fixed ncpfs to not use the generic helper routines, but it
is in many ways much cleaner to instead improve on the symlink walking
helper routines so that they don't require that absolute stability.

We do this by allowing "follow_link()" to return a error-pointer as a
cookie, which is fed back to the cleanup "put_link()" routine.  This
also simplifies NFS symlink handling.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-19 18:02:56 -07:00
David Gibson c7546f8f03 [PATCH] Fix hugepage crash on failing mmap()
This patch fixes a crash in the hugepage code.  unmap_hugepage_area() was
assuming that (due to prefault) PTEs must exist for all the area in
question.  However, this may not be the case, if mmap() encounters an error
before the prefault and calls unmap_region() to clean up any partial
mapping.

Depending on the hugepage configuration, this crash can be triggered by an
unpriveleged user.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-05 12:22:37 -07:00
Simon Derr 2f60f8d357 [PATCH] __vm_enough_memory() signedness fix
We have found what seems to be a small bug in __vm_enough_memory() when
sysctl_overcommit_memory is set to OVERCOMMIT_NEVER.

When this bug occurs the systems fails to boot, with /sbin/init whining
about fork() returning ENOMEM.

We hunted down the problem to this:

The deferred update mecanism used in vm_acct_memory(), on a SMP system,
allows the vm_committed_space counter to have a negative value.

This should not be a problem since this counter is known to be inaccurate.

But in __vm_enough_memory() this counter is compared to the `allowed'
variable, which is an unsigned long.  This comparison is broken since it
will consider the negative values of vm_committed_space to be huge positive
values, resulting in a memory allocation failure.

Signed-off-by: <Jean-Marc.Saffroy@ext.bull.net>
Signed-off-by: <Simon.Derr@bull.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-04 21:43:14 -07:00
Hugh Dickins 1c5ad84516 [PATCH] fix VmSize and VmData after mremap
mremap's move_vma is applying __vm_stat_account to the old vma which may
have already been freed: move it to just before the do_munmap.

mremapping to and fro with CONFIG_DEBUG_SLAB=y showed /proc/<pid>/status
VmSize and VmData wrapping just like in kernel bugzilla #4842, and fixed by
this patch - worth including in 2.6.13, though not yet confirmed that it
fixes that specific report from Frank van Maarseveen.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-04 13:11:15 -07:00
Linus Torvalds a68d2ebc15 Fix up recent get_user_pages() handling
The VM_FAULT_WRITE thing is an extra bit, not a valid return value, and
has to be treated as such by get_user_pages().

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-03 10:07:09 -07:00
Nick Piggin f33ea7f404 [PATCH] fix get_user_pages bug
Checking pte_dirty instead of pte_write in __follow_page is problematic
for s390, and for copy_one_pte which leaves dirty when clearing write.

So revert __follow_page to check pte_write as before, and make
do_wp_page pass back a special extra VM_FAULT_WRITE bit to say it has
done its full job: once get_user_pages receives this value, it no longer
requires pte_write in __follow_page.

But most callers of handle_mm_fault, in the various architectures, have
switch statements which do not expect this new case.  To avoid changing
them all in a hurry, make an inline wrapper function (using the old
name) that masks off the new bit, and use the extended interface with
double underscores.

Yes, we do have a call to do_wp_page from do_swap_page, but no need to
change that: in rare case it's needed, another do_wp_page will follow.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
[ Cleanups by Nick Piggin ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-03 09:12:05 -07:00
Eric Dumazet ba17101b41 [PATCH] sys_set_mempolicy() doesnt check if mode < 0
A kernel BUG() is triggered by a call to set_mempolicy() with a negative
first argument.  This is because the mode is declared as an int, and the
validity check doesnt check < 0 values.  Alternatively, mode could be
declared as unsigned int or unsigned long.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-01 21:38:00 -07:00
Hugh Dickins 690dbe1ced [PATCH] x86_64: access of some bad address
x86_64 has a large sparse gate area between VSYSCALL_START and
VSYSCALL_END, not all of it presently backed by pmds.  Alexander Nyberg has
found that in some circumstances gdb may try to ptrace here, and hit
get_user_pages BUG_ON.  It seems odd that gdb should be accessing here, but
it certainly shouldn't crash in this way: relax BUG_ON to -EFAULT.  Fixes
kernel bugzilla #4801.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-01 21:38:00 -07:00
Linus Torvalds 4ceb5db975 Fix get_user_pages() race for write access
There's no real guarantee that handle_mm_fault() will always be able to
break a COW situation - if an update from another thread ends up
modifying the page table some way, handle_mm_fault() may end up
requiring us to re-try the operation.

That's normally fine, but get_user_pages() ended up re-trying it as a
read, and thus a write access could in theory end up losing the dirty
bit or be done on a page that had not been properly COW'ed.

This makes get_user_pages() always retry write accesses as write
accesses by making "follow_page()" require that a writable follow has
the dirty bit set.  That simplifies the code and solves the race: if the
COW break fails for some reason, we'll just loop around and try again.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-01 11:14:49 -07:00
Martin J. Bligh e310fd4325 [PATCH] Fix NUMA node sizing in nr_free_zone_pages
We are iterating over all nodes in nr_free_zone_pages().  Because the
fallback zonelists contain all nodes in the system, and we walk all the
zonelists, we're counting memory multiple times (once for each node).  This
caused us to make a size estimate of 32GB for an 8GB AMD64 box, which makes
all the dirty ratio calculations, etc incorrect.

There's still a further bug to fix from e820 holes causing overestimation
as well, but this fix is separate, and good as is, and fixes one class of
problems.  Problem found by Badari, and tested by Ram Pai - thanks!

Signed-off-by:  Martin J. Bligh <mbligh@mbligh.org>
Signed-off-by:  Matt Dobson <colpatch@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-30 10:14:46 -07:00
Andy Whitcroft 12b1c5f382 [PATCH] Remove bogus warning in page_alloc.c
Originally __free_pages_bulk used the relative page number within a zone to
define its buddies.  This meant that to maintain the "maximally aligned"
requirements (that an allocation of size N will be aligned at least to N
physically) zones had to also be aligned to 1<<MAX_ORDER pages.  When
__free_pages_bulk was updated to use the relative page frame numbers of the
free'd pages to pair buddies this released the alignment constraint on the
'left' edge of the zone.  This allows _either_ edge of the zone to contain
partial MAX_ORDER sized buddies.  These simply never will have matching
buddies and thus will never make it to the 'top' of the pyramid.

The patch below removes a now redundant check ensuring that the mem_map was
aligned to MAX_ORDER.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27 16:25:54 -07:00
suzuki 165cd40235 [PATCH] madvise() does not always return -EBADF on non-file mapped area
The madvise() system call returns -EBADF for areas which does not map to
files, only for *behaviour* request MADV_WILLNEED.

According to man pages, madvise returns :

EBADF - the map exists, but the area maps something that isn't a file.

Fixes bug 2995.

Signed-off-by: Suzuki K P <suzuki@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27 16:25:54 -07:00
Andrew Morton 1aaf18ff9d [PATCH] check_user_page_readable() deadlock fix
Fix bug identifued by Richard Purdie <rpurdie@rpsys.net>.

oprofile calls check_user_page_readable() from interrupt context, so we
deadlock over various VFS locks.

But check_user_page_readable() doesn't imply either a read or a write of the
page's contents.  Change __follow_page() so that check_user_page_readable()
can tell __follow_page() that we're not accessing the page's contents, and use
that info to avoid the troublesome lock-takings.

Also, make follow_page() inline for the single callsite in memory.c to save a
bit of stack space.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27 16:25:53 -07:00
Andi Kleen 90c5029e47 [PATCH] Undo mempolicy shared policy rbtree microoptimization
All mempolicy changes must be inside the spinlock and readding the rb_erase
prevents a crash while doing:

> echo "1" > /tmp/numatest
> numactl --length=0x4000 --shm /tmp/numatest --localalloc
> numactl --length=0x2000 --offset=0 --shm /tmp/numatest --membind=0
> numactl --length=0x2000 --offset=0x2000 --shm /tmp/numatest --membind=1
> ipcs
> ipcrm -M "the_key_value_of_this_shm_area"

Based on a patch by John Blackwood

Cc: <john.blackwood@ccur.com>
Cc: <andrea@suse.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27 16:25:52 -07:00
Carsten Otte afa597ba20 [PATCH] execute-in-place fixes
This patch includes feedback from Andrew and Christoph. Thanks for
taking time to review.

Use of empty_zero_page was eliminated to fix compilation for architectures
that don't have it.

This patch removes setting pages up-to-date in ext2_get_xip_page and all
bug checks to verify that the page is indeed up to date.  Setting the page
state on mapping to userland is bogus.  None of the code patchs involved
with these pages in mm cares about the page state.

still on my ToDo list: identify a place outside second extended where
__inode_direct_access should reside

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-15 09:54:50 -07:00
Geert Uytterhoeven 082ff0a999 [PATCH] mm/filemap_xip.c compilation fix
mm/filemap_xip.c: In function `__xip_unmap':
mm/filemap_xip.c:194: request for member `pte' in something not a structure or union

Apparently pte_pfn() takes a pte_t, not a pointer to a pte_t.  From looking
at asm/page.h, it seems to be the same on ia32 or ppc (iff
STRICT_MM_TYPECHECKS is enabled, which is disabled by default on ppc).

Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-12 16:01:00 -07:00
Alexey Dobriyan 0db925af1d [PATCH] propagate __nocast annotations
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07 18:23:46 -07:00
Anton Blanchard 42639269f9 [PATCH] mm: quieten OOM killer noise
We now print statistics when invoking the OOM killer, however this
information is not rate limited and you can get into situations where the
console is continually spammed.

For example, when a task is exiting the OOM killer will simply return
(waiting for that task to exit and clear up memory).  If the VM continually
calls back into the OOM killer we get thousands of copies of show_mem() on
the console.

Use printk_ratelimit() to quieten it.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07 18:23:36 -07:00
Marcelo Tosatti 37b173a4d0 [PATCH] remove completly bogus comment inside __alloc_pages() try_to_free_pages handling
Remove completly bogus comment from did_some_progress != 0 handling (that
same comment is a few lines below on did_some_progress = 0 case, where it
belongs).

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07 18:23:35 -07:00
Marcelo Tosatti 79b9ce311e [PATCH] print order information when OOM killing
Dump the current allocation order when OOM killing.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07 18:23:35 -07:00
Christoph Lameter 83b78bd2d3 [PATCH] Fix broken kmalloc_node in rc1/rc2
This patch used to be in Andrew's tree before the NUMA slab allocator went
in. Either this patch or the NUMA slab allocator is needed in order for
kmalloc_node to work correctly.

pcibus_to_node may be used to generate the node information passed to
kmalloc_node. pcibus_to_node returns -1 if it was not able to determine
on which node a pcibus is located. For that case kmalloc_node must
work like kmalloc.

Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-06 10:52:45 -07:00
Pekka J Enberg 687a21cee1 [PATCH] rename wakeup_bdflush to wakeup_pdflush
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-28 21:20:31 -07:00
Bob Picco 3212c6be25 [PATCH] fix WANT_PAGE_VIRTUAL in memmap_init
I spotted this issue while in memmap_init last week.  I can't say the
change has any test coverage by me.  start_pfn was formerly used in main
"for" loop.  The fix is replace start_pfn with pfn.

Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-27 15:11:42 -07:00
Linus Torvalds 2031d0f586 Merge Christoph's freeze cleanup patch 2005-06-25 17:16:53 -07:00
Christoph Lameter 3e1d1d28d9 [PATCH] Cleanup patch for process freezing
1. Establish a simple API for process freezing defined in linux/include/sched.h:

   frozen(process)		Check for frozen process
   freezing(process)		Check if a process is being frozen
   freeze(process)		Tell a process to freeze (go to refrigerator)
   thaw_process(process)	Restart process
   frozen_process(process)	Process is frozen now

2. Remove all references to PF_FREEZE and PF_FROZEN from all
   kernel sources except sched.h

3. Fix numerous locations where try_to_freeze is manually done by a driver

4. Remove the argument that is no longer necessary from two function calls.

5. Some whitespace cleanup

6. Clear potential race in refrigerator (provides an open window of PF_FREEZE
   cleared before setting PF_FROZEN, recalc_sigpending does not check
   PF_FROZEN).

This patch does not address the problem of freeze_processes() violating the rule
that a task may only modify its own flags by setting PF_FREEZE. This is not clean
in an SMP environment. freeze(process) is therefore not SMP safe!

Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 17:10:13 -07:00
Nick Wilson 8c0e33c133 [PATCH] Use ALIGN to remove duplicate code
This patch makes use of ALIGN() to remove duplicate round-up code.

Signed-off-by: Nick Wilson <njw@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:25:02 -07:00
Vivek Goyal 92aa63a5a1 [PATCH] kdump: Retrieve saved max pfn
This patch retrieves the max_pfn being used by previous kernel and stores it
in a safe location (saved_max_pfn) before it is overwritten due to user
defined memory map.  This pfn is used to make sure that user does not try to
read the physical memory beyond saved_max_pfn.

Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:52 -07:00
Badari Pulavarty b0cfbd995d [PATCH] fix for generic_file_write iov problem
Here is the fix for the problem described in

	http://bugzilla.kernel.org/show_bug.cgi?id=4721

Basically, problem is generic_file_buffered_write() is accessing beyond end
of the iov[] vector after handling the last vector.  If we happen to cross
page boundary, we get a fault.

I think this simple patch is good enough.  If we really don't want to
depend on the "count", then we need pass nr_segs to
filemap_set_next_iovec() and decrement it and check it.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:39 -07:00
Pavel Machek 648be31881 [PATCH] swsusp: kill config_pm_disk
CONFIG_PM_DISK is long gone, but it still managed to survived at few
places.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:32 -07:00
Hugh Dickins 2d15cab85b [PATCH] mm: fix remap_pte_range BUG
Out-of-tree user of remap_pfn_range hit kernel BUG at mm/memory.c:1112!  It
passes an unrounded size to remap_pfn_range, which was okay before 2.6.12,
but misses remap_pte_range's new end condition.  An audit of all the other
ptwalks confirms that this is the only one so exposed.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:26 -07:00
Hifumi Hisashi 1e8a81c5a3 [PATCH] Fix the error handling in direct I/O
Fix a bug on error handling in the direct I/O function.

Currently, if a file is opened with the O_DIRECT|O_SYNC flag, the write()
syscall cannot receive the EIO error after an I/O error (SCSI cable is
disconnected etc.).

Return values of other points that call generic_osync_inode() are treated
appropriately.

Signed-off-by: Hisashi Hifumi  <hifumi.hisashi@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:25 -07:00
Carsten Otte fe77ba6f4f [PATCH] xip: madvice/fadvice: execute in place
Make sys_madvice/fadvice return sane with xip.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-24 00:06:42 -07:00
Carsten Otte eb6fe0c388 [PATCH] xip: reduce code duplication
This patch reworks filemap_xip.c with the goal to reduce code duplication
from mm/filemap.c.  It applies agains 2.6.12-rc6-mm1.  Instead of
implementing the aio functions, this one implements the synchronous
read/write functions only.  For readv and writev, the generic fallback is
used.  For aio, we rely on the application doing the fallback.  Since our
"synchronous" function does memcpy immediately anyway, there is no
performance difference between using the fallbacks or implementing each
operation.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-24 00:06:41 -07:00
Carsten Otte ceffc07852 [PATCH] xip: fs/mm: execute in place
- generic_file* file operations do no longer have a xip/non-xip split
- filemap_xip.c implements a new set of fops that require get_xip_page
  aop to work proper. all new fops are exported GPL-only (don't like to
  see whatever code use those except GPL modules)
- __xip_unmap now uses page_check_address, which is no longer static
  in rmap.c, and defined in linux/rmap.h
- mm/filemap.h is now much more clean, plainly having just Linus'
  inline funcs moved here from filemap.c
- fix includes in filemap_xip to make it build cleanly on i386

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-24 00:06:41 -07:00
Martin Waitz 3d41088fa3 [PATCH] DocBook: update comments
This patch updates some comments to match code changes.

Signed-off-by: Martin Waitz <tali@admingilde.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-24 00:06:40 -07:00
Christoph Lameter 45778ca819 [PATCH] Remove f_error field from struct file
The following patch removes the f_error field and all checks of f_error.

Trond said:

  f_error was introduced for NFS, and made sense when we were guaranteed
  always to have a file pointer around when write errors occurred.  Since
  then, we have (for various reasons) had to introduce the nfs_open_context in
  order to track the file read/write state, and it made sense to move our
  f_error tracking there too.

Signed-off-by: Christoph Lameter <christoph@lameter.com>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:33 -07:00
Benjamin LaHaise 01890a4c12 [PATCH] mempool - only init waitqueue in slow path
Here's a small patch to improve the performance of mempool_alloc by only
initializing the wait queue when we're about to wait.

Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:29 -07:00
Pekka Enberg 3bc1ee3e8f [PATCH] remove redundant vm_flags clearing from madvise.c
This patch removes redundant VM_ClearReadHint from mm/madvice.c which was
left there by Prasanna's patch.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:19 -07:00
Paulo Marques 543537bd92 [PATCH] create a kstrdup library function
This patch creates a new kstrdup library function and changes the "local"
implementations in several places to use this function.

Most of the changes come from the sound and net subsystems.  The sound part
had already been acknowledged by Takashi Iwai and the net part by David S.
Miller.

I left UML alone for now because I would need more time to read the code
carefully before making changes there.

Signed-off-by: Paulo Marques <pmarques@grupopie.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:18 -07:00
Christoph Lameter 1946089a10 [PATCH] NUMA aware block device control structure allocation
Patch to allocate the control structures for for ide devices on the node of
the device itself (for NUMA systems).  The patch depends on the Slab API
change patch by Manfred and me (in mm) and the pcidev_to_node patch that I
posted today.

Does some realignment too.

Signed-off-by: Justin M. Forbes <jmforbes@linuxtx.org>
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Pravin Shelar <pravin@calsoftinc.com>
Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:09 -07:00
Andy Whitcroft 29751f6991 [PATCH] sparsemem hotplug base
Make sparse's initalization be accessible at runtime.  This allows sparse
mappings to be created after boot in a hotplug situation.

This patch is separated from the previous one just to give an indication how
much of the sparse infrastructure is *just* for hotplug memory.

The section_mem_map doesn't really store a pointer.  It stores something that
is convenient to do some math against to get a pointer.  It isn't valid to
just do *section_mem_map, so I don't think it should be stored as a pointer.

There are a couple of things I'd like to store about a section.  First of all,
the fact that it is !NULL does not mean that it is present.  There could be
such a combination where section_mem_map *is* NULL, but the math gets you
properly to a real mem_map.  So, I don't think that check is safe.

Since we're storing 32-bit-aligned structures, we have a few bits in the
bottom of the pointer to play with.  Use one bit to encode whether there's
really a mem_map there, and the other one to tell whether there's a valid
section there.  We need to distinguish between the two because sometimes
there's a gap between when a section is discovered to be present and when we
can get the mem_map for it.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:05 -07:00
Andy Whitcroft 641c767389 [PATCH] sparsemem swiss cheese numa layouts
The part of the sparsemem patch which modifies memmap_init_zone() has recently
become a problem.  It changes behavior so that there is a call to
pfn_to_page() for each individual page inside of a node's range:
node_start_pfn through node_end_pfn.  It used to simply do this once, at the
beginning of the node, but having sparsemem's non-contiguous mem_map[]s inside
of a node made it necessary to change.

Mike Kravetz recently wrote a patch which made the NUMA code accept some new
kinds of layouts.  The system's memory was laid out like this, with node 0's
memory in two pieces: one before and one after node 1's memory:

	Node 0: +++++     +++++
	Node 1:      +++++

Previous behavior before Mike's patch was to assign nodes like this:

	Node 0: 00000     XXXXX
	Node 1:      11111

Where the 'X' areas were simply thrown away.  The new behavior was to make the
pg_data_t span node 0 across all of its areas, including areas that are really
node 1's: Node 0: 000000000000000 Node 1: 11111

This wastes a little bit of mem_map space, but ends up being OK, and more
fully utilizes the system's memory.  memmap_init_zone() initializes all of the
"struct page"s for node 0, even for the "hole", but those never get used,
because there is no pfn_to_page() that resolves to those pages.  However, only
calling pfn_to_page() once, memmap_init_zone() always uses the pages that were
allocated for node0->node_mem_map because:

	struct page *start = pfn_to_page(start_pfn);
	// effectively start = &node->node_mem_map[0]
	for (page = start; page < (start + size); page++) {
		init_page_here();...
		page++;
	}

Slow, and wasteful, but generally harmless.

But, modify that to call pfn_to_page() for each loop iteration (like sparsemem
does):

	for (pfn = start_pfn; pfn < < (start_pfn + size); pfn++++) {
		page = pfn_to_page(pfn);
	}

And you end up trying to initialize node 1's pages too early, along with bogus
data from node 0.  This patch checks for those weird layouts and declines to
touch the pages, making the more frequent pfn_to_page() calls OK to do.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:05 -07:00
Andy Whitcroft d41dee369b [PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[].  This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems.  Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.

A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA.  When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.

Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous.  It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.

Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory.  This is what allows the mem_map[]
to be chopped up.

In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags.  Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations.  However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions.  Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.

One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags.  It might provide
speed increases on certain platforms and will be stored there if there is
room.  But, if out of room, an alternate (theoretically slower) mechanism is
used.

This patch introduces CONFIG_FLATMEM.  It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:04 -07:00
Andy Whitcroft af705362ab [PATCH] generify memory present
Allow architectures to indicate that they will be providing hooks to indice
installed memory areas, memory_present().  Provide prototypes for the i386
implementation.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:04 -07:00
Dave Hansen 785dcd44b6 [PATCH] mm/Kconfig: give DISCONTIG more help text
This gives DISCONTIGMEM a bit more help text to explain what it does, not just
when to choose it.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:03 -07:00
Dave Hansen e1785e85b9 [PATCH] mm/Kconfig: hide "Memory Model" selection menu
I got some feedback from users who think that the new "Memory Model" menu is a
little invasive.  This patch will hide that menu, except when
CONFIG_EXPERIMENTAL is enabled *or* when an individual architecture wants it.

An individual arch may want to enable it because they've removed their
arch-specific DISCONTIG prompt in favor of the mm/Kconfig one.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:03 -07:00
Dave Hansen 44d0f805c7 [PATCH] sparsemem: fix minor "defaults" issue in mm/Kconfig
The following patch applies on top of 2.6.12-rc2-mm1.  It fixes a minor
user interaction issue, and an early reference to SPARSEMEM.

This "choice" menu would always default to FLATMEM, as it was listed first.
 Move it to the end so that the other defaults have a chance first.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:03 -07:00
Dave Hansen 93b7504e3e [PATCH] Introduce new Kconfig option for NUMA or DISCONTIG
There is some confusion that arose when working on SPARSEMEM patch between
what is needed for DISCONTIG vs. NUMA.

Multiple pg_data_t's are needed for DISCONTIGMEM or NUMA, independently.
All of the current NUMA implementations require an implementation of
DISCONTIG.  Because of this, quite a lot of code which is really needed for
NUMA is actually under DISCONTIG #ifdefs.  For SPARSEMEM, we changed some
of these #ifdefs to CONFIG_NUMA, but that broke the DISCONTIG=y and NUMA=n
case.

Introducing this new NEED_MULTIPLE_NODES config option allows code that is
needed for both NUMA or DISCONTIG to be separated out from code that is
specific to DISCONTIG.

One great advantage of this approach is that it doesn't require every
architecture to be converted over.  All of the current implementations
should "just work", only the ones implementing SPARSEMEM will have to be
fixed up.

The change to free_area_init() makes it work inside, or out of the new
config option.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:03 -07:00
Dave Hansen 3a9da7655d [PATCH] create mm/Kconfig for arch-independent memory options
With sparsemem being introduced, we need a central place for new
memory-related .config options: mm/Kconfig.  This allows us to remove many
of the duplicated arch-specific options.

The new option, CONFIG_FLATMEM, is there to enable us to detangle NUMA and
DISCONTIGMEM.  This is a requirement for sparsemem because sparsemem uses
the NUMA code without the presence of DISCONTIGMEM.  The sparsemem patches
use CONFIG_FLATMEM in generic code, so this patch is a requirement before
applying them.

Almost all places that used to do '#ifndef CONFIG_DISCONTIGMEM' should use
'#ifdef CONFIG_FLATMEM' instead.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:01 -07:00
Dave Hansen 348f8b6c48 [PATCH] sparsemem base: reorganize page->flags bit operations
Generify the value fields in the page_flags.  The aim is to allow the location
and size of these fields to be varied.  Additionally we want to move away from
fixed allocations per field whilst still enforcing the overall bit utilisation
limits.  We rely on the compiler to spot and optimise the accessor functions.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:01 -07:00
Dave Hansen 6f167ec721 [PATCH] sparsemem base: simple NUMA remap space allocator
Introduce a simple allocator for the NUMA remap space.  This space is very
scarce, used for structures which are best allocated node local.

This mechanism is also used on non-NUMA ia64 systems with a vmem_map to keep
the pgdat->node_mem_map initialized in a consistent place for all
architectures.

Issues:
o alloc_remap takes a node_id where we might expect a pgdat which was intended
  to allow us to allocate the pgdat's using this mechanism; which we do not yet
  do.  Could have alloc_remap_node() and alloc_remap_nid() for this purpose.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:01 -07:00
Christoph Lameter b7c84c6ada [PATCH] boot_pageset must not be freed.
The boot_pageset needs to be preserved for hotplugging and for off line
processors and nodes. Otherwise pointers will point into memory that has
now a different use. /proc/zoneinfo is currently showing strange results
if processors / nodes are not present.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-22 20:42:32 -07:00
Denis Vlasenko c0d62219a4 [PATCH] Kill stray newline
OOM killer prints a stray newline.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:21 -07:00
Abhijit Karmarkar b4955ce3dd [PATCH] msync: check pte dirty earlier
It's common practice to msync a large address range regularly, in which
often only a few ptes have actually been dirtied since the previous pass.

sync_pte_range then goes much faster if it tests whether pte is dirty
before locating and accessing each struct page cacheline; and it is hardly
slowed by ptep_clear_flush_dirty repeating that test in the opposite case,
when every pte actually is dirty.

But beware, s390's pte_dirty always says false, since its dirty bit is kept
in the storage key, located via the struct page address.  So skip this
optimization in its case: use a pte_maybe_dirty macro which just says true
if page_test_and_clear_dirty is implemented.

Signed-off-by: Abhijit Karmarkar <abhijitk@veritas.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:21 -07:00
Hugh Dickins c475a8ab62 [PATCH] can_share_swap_page: use page_mapcount
Remember that ironic get_user_pages race?  when the raised page_count on a
page swapped out led do_wp_page to decide that it had to copy on write, so
substituted a different page into userspace.  2.6.7 onwards have Andrea's
solution, where try_to_unmap_one backs out if it finds page_count raised.

Which works, but is unsatisfying (rmap.c has no other page_count heuristics),
and was found a few months ago to hang an intensive page migration test.  A
year ago I was hesitant to engage page_mapcount, now it seems the right fix.

So remove the page_count hack from try_to_unmap_one; and use activate_page in
unuse_mm when dropping lock, to replace its secondary effect of helping
swapoff to make progress in that case.

Simplify can_share_swap_page (now called only on anonymous pages) to check
page_mapcount + page_swapcount == 1: still needs the page lock to stabilize
their (pessimistic) sum, but does not need swapper_space.tree_lock for that.

In do_swap_page, move swap_free and unlock_page below page_add_anon_rmap, to
keep sum on the high side, and correct when can_share_swap_page called.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:21 -07:00
Hugh Dickins d296e9cd02 [PATCH] do_wp_page: cannot share file page
A small optimization to do_wp_page's check for whether to avoid copy by
reusing the page already mapped.  It can never share a cached file page,
nor can it share a reserved page (often the empty zero page), so it's a
waste of time to lock and unlock in those cases.  Which nowadays can both
be neatly excluded by a preliminary PageAnon test.

Christoph has reported that a preliminary page_count test proved valuable
for scalability here, but PageAnon covers more common cases all at once.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:21 -07:00
Hugh Dickins 08ef472937 [PATCH] get_user_pages: kill get_page_map
Since its birth, get_user_pages has been calling a misguided get_page_map
function.  follow_page has already returned NULL if the pfn is invalid, we
cannot reach an invalid pfn from a validated struct page.

Remove get_page_map, and the messy rewind in get_user_pages to cope with
its failure.  Oh, and could we please call that "struct page *page" like
everywhere else, instead of "struct page *map"?

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:21 -07:00
Hugh Dickins 334795eca4 [PATCH] bad_page: clear reclaim and slab
Since free_pages_check complains if PG_reclaim or PG_slab is set, bad_page
ought to clear them to avoid repetitive reports (Nikita noticed this too).
Let prep_new_page check page_count and PG_slab as free_pages_check does.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:19 -07:00
Hugh Dickins 91612e0df2 [PATCH] mbind: check_range use standard ptwalk
Strict mbind's check for currently mapped pages being on node has been
using a slow loop which re-evaluates pgd, pud, pmd, pte for each entry:
replace that by a standard four-level page table walk like others in mm.
Since mmap_sem is held for writing, page_table_lock can be taken at the
inner level to limit latency.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:19 -07:00
Hugh Dickins 941150a326 [PATCH] mbind: fix verify_pages pte_page
Strict mbind's check that pages already mapped are on right node has been
using pte_page without checking if pfn_valid, and without page_table_lock
to prevent spurious failures when try_to_unmap_one intervenes between the
pte_present and the pte_page.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:19 -07:00
Hugh Dickins 0edd73b334 [PATCH] shmem: restore superblock info
To improve shmem scalability, we allowed tmpfs instances which don't need
their blocks or inodes limited not to count them, and not to allocate any
sbinfo.  Which was okay when the only use for the sbinfo was accounting
blocks and inodes; but since then a couple of unrelated projects extending
tmpfs want to store other data in the sbinfo.  Whether either extension
reaches mainline is beside the point: I'm guilty of a bad design decision,
and should restore sbinfo to make any such future extensions easier.

So, once again allocate a shmem_sb_info for every shmem/tmpfs instance, and
now let max_blocks 0 indicate unlimited blocks, and max_inodes 0 unlimited
inodes.  Brent Casavant verified (many months ago) that this does not
perceptibly impact the scalability (since the unlimited sbinfo cacheline is
repeatedly accessed but only once dirtied).

And merge shmem_set_size into its sole caller shmem_remount_fs.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:18 -07:00
Christoph Lameter 2caaad41e4 [PATCH] Reduce size of huge boot per_cpu_pageset
Reduce size of the huge per_cpu_pageset structure in __initdata introduced
into mm1 with the pageset localization patchset.  Use one specially
configured pageset per cpu for all zones and nodes during bootup.

- Avoid duplication of pageset initialization code.
- do the adding to the pageset list before potential free_pages_bulk
  in free_hot_cold_page (otherwise we would have to hold a page
  in a pageset during the period that the boot pagesets are in use).
- remove mistaken __cpuinitdata attribute and revert back to __initdata
  for the boot pageset. A boot pageset is not necessary for cpu hotplug.

Tested for UP SMP NUMA on x86_64 (2.6.12-rc6-mm1): UP SMP NUMA Tested on
IA64 (2.6.12-rc5-mm2): NUMA (2.6.12-rc6-mm1 broken for IA64 because of
sparsemem patches)

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:18 -07:00
Christoph Lameter 4ae7c03943 [PATCH] Periodically drain non local pagesets
The pageset array can potentially acquire a huge amount of memory on large
NUMA systems.  F.e.  on a system with 512 processors and 256 nodes there
will be 256*512 pagesets.  If each pageset only holds 5 pages then we are
talking about 655360 pages.With a 16K page size on IA64 this results in
potentially 10 Gigabytes of memory being trapped in pagesets.  The typical
cases are much less for smaller systems but there is still the potential of
memory being trapped in off node pagesets.  Off node memory may be rarely
used if local memory is available and so we may potentially have memory in
seldom used pagesets without this patch.

The slab allocator flushes its per cpu caches every 2 seconds.  The
following patch flushes the off node pageset caches in the same way by
tying into the slab flush.

The patch also changes /proc/zoneinfo to include the number of pages
currently in each pageset.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:18 -07:00
Janet Morgan 578c2fd6a7 [PATCH] add OOM debug
This patch provides more debug info when the system is OOM.  It displays
memory stats (basically sysrq-m info) from __alloc_pages() when page
allocation fails and during OOM kill.

Thanks to Dave Jones for coming up with the idea.

Signed-off-by: Janet Morgan <janetmor@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:17 -07:00
Benjamin LaHaise c2f29ea111 [PATCH] __read_page_state(): pass unsigned long instead of unsigned
By making the offset argument of __read_page_state an unsigned long instead of
unsigned, we can avoid forcing the compiler to sign extend a usually constant
argument.  This saves 1 instruction on x86-64.

Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:17 -07:00
Benjamin LaHaise 83e5d8f725 [PATCH] __mod_page_state(): pass unsigned long instead of unsigned
By making the offset argument of __mod_page_state an unsigned long instead
of unsigned, we can avoid forcing the compiler to sign extend a usually
constant argument.  This saves 1 instruction on x86-64.

Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:17 -07:00
Darren Hart 1ad539b2bd [PATCH] vm: try_to_free_pages unused argument
try_to_free_pages accepts a third argument, order, but hasn't used it since
before 2.6.0.  The following patch removes the argument and updates all the
calls to try_to_free_pages.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:17 -07:00
Chris Wright 73219d1780 [PATCH] mmap topdown fix for large stack limit, large allocation
The topdown changes in 2.6.12-rc1 can cause large allocations with large
stack limit to fail, despite there being space available.  The
mmap_base-len is only valid when len >= mmap_base.  However, nothing in
topdown allocator checks this.  It's only (now) caught at higher level,
which will cause allocation to simply fail.  The following change restores
the fallback to bottom-up path, which will allow large allocations with
large stack limit to potentially still succeed.

Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:16 -07:00
Wolfgang Wander 1363c3cd86 [PATCH] Avoiding mmap fragmentation
Ingo recently introduced a great speedup for allocating new mmaps using the
free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
causes huge performance increases in thread creation.

The downside of this patch is that it does lead to fragmentation in the
mmap-ed areas (visible via /proc/self/maps), such that some applications
that work fine under 2.4 kernels quickly run out of memory on any 2.6
kernel.

The problem is twofold:

  1) the free_area_cache is used to continue a search for memory where
     the last search ended.  Before the change new areas were always
     searched from the base address on.

     So now new small areas are cluttering holes of all sizes
     throughout the whole mmap-able region whereas before small holes
     tended to close holes near the base leaving holes far from the base
     large and available for larger requests.

  2) the free_area_cache also is set to the location of the last
     munmap-ed area so in scenarios where we allocate e.g.  five regions of
     1K each, then free regions 4 2 3 in this order the next request for 1K
     will be placed in the position of the old region 3, whereas before we
     appended it to the still active region 1, placing it at the location
     of the old region 2.  Before we had 1 free region of 2K, now we only
     get two free regions of 1K -> fragmentation.

The patch addresses thes issues by introducing yet another cache descriptor
cached_hole_size that contains the largest known hole size below the
current free_area_cache.  If a new request comes in the size is compared
against the cached_hole_size and if the request can be filled with a hole
below free_area_cache the search is started from the base instead.

The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
(earlier posted) leakme.c test program terminates after 50000+ iterations
with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
(as expected) with thread creation, Ingo's test_str02 with 20000 threads
requires 0.7s system time.

Taking out Ingo's patch (un-patch available per request) by basically
deleting all mentions of free_area_cache from the kernel and starting the
search for new memory always at the respective bases we observe: leakme
terminates successfully with 11 distinctive hardly fragmented areas in
/proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
time for Ingo's test_str02 with 20000 threads.

Now - drumroll ;-) the appended patch works fine with leakme: it ends with
only 7 distinct areas in /proc/self/maps and also thread creation seems
sufficiently fast with 0.71s for 20000 threads.

Signed-off-by: Wolfgang Wander <wwc@rentec.com>
Credit-to: "Richard Purdie" <rpurdie@rpsys.net>
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu> (partly)
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:16 -07:00
Christoph Lameter e7c8d5c995 [PATCH] node local per-cpu-pages
This patch modifies the way pagesets in struct zone are managed.

Each zone has a per-cpu array of pagesets.  So any particular CPU has some
memory in each zone structure which belongs to itself.  Even if that CPU is
not local to that zone.

So the patch relocates the pagesets for each cpu to the node that is nearest
to the cpu instead of allocating the pagesets in the (possibly remote) target
zone.  This means that the operations to manage pages on remote zone can be
done with information available locally.

We play a macro trick so that non-NUMA pmachines avoid the additional
pointer chase on the page allocator fastpath.

AIM7 benchmark on a 32 CPU SGI Altix

w/o patches:
Tasks    jobs/min  jti  jobs/min/task      real       cpu
    1      484.68  100       484.6769     12.01      1.97   Fri Mar 25 11:01:42 2005
  100    27140.46   89       271.4046     21.44    148.71   Fri Mar 25 11:02:04 2005
  200    30792.02   82       153.9601     37.80    296.72   Fri Mar 25 11:02:42 2005
  300    32209.27   81       107.3642     54.21    451.34   Fri Mar 25 11:03:37 2005
  400    34962.83   78        87.4071     66.59    588.97   Fri Mar 25 11:04:44 2005
  500    31676.92   75        63.3538     91.87    742.71   Fri Mar 25 11:06:16 2005
  600    36032.69   73        60.0545     96.91    885.44   Fri Mar 25 11:07:54 2005
  700    35540.43   77        50.7720    114.63   1024.28   Fri Mar 25 11:09:49 2005
  800    33906.70   74        42.3834    137.32   1181.65   Fri Mar 25 11:12:06 2005
  900    34120.67   73        37.9119    153.51   1325.26   Fri Mar 25 11:14:41 2005
 1000    34802.37   74        34.8024    167.23   1465.26   Fri Mar 25 11:17:28 2005

with slab API changes and pageset patch:

Tasks    jobs/min  jti  jobs/min/task      real       cpu
    1      485.00  100       485.0000     12.00      1.96   Fri Mar 25 11:46:18 2005
  100    28000.96   89       280.0096     20.79    150.45   Fri Mar 25 11:46:39 2005
  200    32285.80   79       161.4290     36.05    293.37   Fri Mar 25 11:47:16 2005
  300    40424.15   84       134.7472     43.19    438.42   Fri Mar 25 11:47:59 2005
  400    39155.01   79        97.8875     59.46    590.05   Fri Mar 25 11:48:59 2005
  500    37881.25   82        75.7625     76.82    730.19   Fri Mar 25 11:50:16 2005
  600    39083.14   78        65.1386     89.35    872.79   Fri Mar 25 11:51:46 2005
  700    38627.83   77        55.1826    105.47   1022.46   Fri Mar 25 11:53:32 2005
  800    39631.94   78        49.5399    117.48   1169.94   Fri Mar 25 11:55:30 2005
  900    36903.70   79        41.0041    141.94   1310.78   Fri Mar 25 11:57:53 2005
 1000    36201.23   77        36.2012    160.77   1458.31   Fri Mar 25 12:00:34 2005

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com>
Signed-off-by: Shai Fultheim <Shai@Scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:16 -07:00
David Gibson 63551ae0fe [PATCH] Hugepage consolidation
A lot of the code in arch/*/mm/hugetlbpage.c is quite similar.  This patch
attempts to consolidate a lot of the code across the arch's, putting the
combined version in mm/hugetlb.c.  There are a couple of uglyish hacks in
order to covert all the hugepage archs, but the result is a very large
reduction in the total amount of code.  It also means things like hugepage
lazy allocation could be implemented in one place, instead of six.

Tested, at least a little, on ppc64, i386 and x86_64.

Notes:
	- this patch changes the meaning of set_huge_pte() to be more
	  analagous to set_pte()
	- does SH4 need s special huge_ptep_get_and_clear()??

Acked-by: William Lee Irwin <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:15 -07:00
Martin Hicks 1e7e5a9048 [PATCH] VM: rate limit early reclaim
When early zone reclaim is turned on the LRU is scanned more frequently when a
zone is low on memory.  This limits when the zone reclaim can be called by
skipping the scan if another thread (either via kswapd or sync reclaim) is
already reclaiming from the zone.

Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:14 -07:00
Martin Hicks 0c35bbadc5 [PATCH] VM: add __GFP_NORECLAIM
When using the early zone reclaim, it was noticed that allocating new pages
that should be spread across the whole system caused eviction of local pages.

This adds a new GFP flag to prevent early reclaim from happening during
certain allocation attempts.  The example that is implemented here is for page
cache pages.  We want page cache pages to be spread across the whole system,
and we don't want page cache pages to evict other pages to get local memory.

Signed-off-by:  Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:14 -07:00
Martin Hicks 753ee72896 [PATCH] VM: early zone reclaim
This is the core of the (much simplified) early reclaim.  The goal of this
patch is to reclaim some easily-freed pages from a zone before falling back
onto another zone.

One of the major uses of this is NUMA machines.  With the default allocator
behavior the allocator would look for memory in another zone, which might be
off-node, before trying to reclaim from the current zone.

This adds a zone tuneable to enable early zone reclaim.  It is selected on a
per-zone basis and is turned on/off via syscall.

Adding some extra throttling on the reclaim was also required (patch
4/4).  Without the machine would grind to a crawl when doing a "make -j"
kernel build.  Even with this patch the System Time is higher on
average, but it seems tolerable.  Here are some numbers for kernbench
runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:

			wall  user   sys   %cpu  ctx sw.  sleeps
			----  ----   ---   ----   ------  ------
No patch		1009  1384   847   258   298170   504402
w/patch, no reclaim     880   1376   667   288   254064   396745
w/patch & reclaim       1079  1385   926   252   291625   548873

These numbers are the average of 2 runs of 3 "make -j" runs done right
after system boot.  Run-to-run variability for "make -j" is huge, so
these numbers aren't terribly useful except to seee that with reclaim
the benchmark still finishes in a reasonable amount of time.

I also looked at the NUMA hit/miss stats for the "make -j" runs and the
reclaim doesn't make any difference when the machine is thrashing away.

Doing a "make -j8" on a single node that is filled with page cache pages
takes 700 seconds with reclaim turned on and 735 seconds without reclaim
(due to remote memory accesses).

The simple zone_reclaim syscall program is at
http://www.bork.org/~mort/sgi/zone_reclaim.c

Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:14 -07:00