Commit Graph

4489 Commits

Author SHA1 Message Date
Andi Kleen aa50d3a7aa Encode huge page size for VM_FAULT_HWPOISON errors
This fixes a problem introduced with the hugetlb hwpoison handling

The user space SIGBUS signalling wants to know the size of the hugepage
that caused a HWPOISON fault.

Unfortunately the architecture page fault handlers do not have easy
access to the struct page.

Pass the information out in the fault error code instead.

I added a separate VM_FAULT_HWPOISON_LARGE bit for this case and encode
the hpage index in some free upper bits of the fault code. The small
page hwpoison keeps stays with the VM_FAULT_HWPOISON name to minimize
changes.

Also add code to hugetlb.h to convert that index into a page shift.

Will be used in a further patch.

Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: fengguang.wu@intel.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08 09:32:46 +02:00
Andi Kleen d5bd910696 hugepage: move is_hugepage_on_freelist inside ifdef to avoid warning
Fixes warning reported by Stephen Rothwell

mm/hugetlb.c:2950: warning: 'is_hugepage_on_freelist' defined but not used

for the !CONFIG_MEMORY_FAILURE case.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08 09:32:46 +02:00
Andi Kleen 4e1c19750a Clean up __page_set_anon_rmap
Linus asked for a cleanup of __page_set_anon_rmap to make
it look more like the cleaner huge pages version.

Factor out the duplicated PageAnon check into a single check
at the beginning of the function.

Remove obsolete comments and rewrite them into standard English.

No functional changes.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08 09:32:46 +02:00
Naoya Horiguchi 6a90181c7b HWPOISON, hugetlb: fix unpoison for hugepage
Currently unpoisoning hugepages doesn't work correctly because
clearing PG_HWPoison is done outside if (TestClearPageHWPoison).
This patch fixes it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08 09:32:45 +02:00
Naoya Horiguchi d950b95882 HWPOISON, hugetlb: soft offlining for hugepage
This patch extends soft offlining framework to support hugepage.
When memory corrected errors occur repeatedly on a hugepage,
we can choose to stop using it by migrating data onto another hugepage
and disabling the original (maybe half-broken) one.

ChangeLog since v4:
- branch soft_offline_page() for hugepage

ChangeLog since v3:
- remove comment about "ToDo: hugepage soft-offline"

ChangeLog since v2:
- move refcount handling into isolate_lru_page()

ChangeLog since v1:
- add double check in isolating hwpoisoned hugepage
- define free/non-free checker for hugepage
- postpone calling put_page() for hugepage in soft_offline_page()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08 09:32:45 +02:00
Naoya Horiguchi 8c6c2ecb44 HWPOSION, hugetlb: recover from free hugepage error when !MF_COUNT_INCREASED
Currently error recovery for free hugepage works only for MF_COUNT_INCREASED.
This patch enables !MF_COUNT_INCREASED case.

Free hugepages can be handled directly by alloc_huge_page() and
dequeue_hwpoisoned_huge_page(), and both of them are protected
by hugetlb_lock, so there is no race between them.

Note that this patch defines the refcount of HWPoisoned hugepage
dequeued from freelist is 1, deviated from present 0, thereby we
can avoid race between unpoison and memory failure on free hugepage.
This is reasonable because unlikely to free buddy pages, free hugepage
is governed by hugetlbfs even after error handling finishes.
And it also makes unpoison code added in the later patch cleaner.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08 09:32:45 +02:00
Naoya Horiguchi a9869b837c hugetlb: move refcounting in hugepage allocation inside hugetlb_lock
Currently alloc_huge_page() raises page refcount outside hugetlb_lock.
but it causes race when dequeue_hwpoison_huge_page() runs concurrently
with alloc_huge_page().
To avoid it, this patch moves set_page_refcounted() in hugetlb_lock.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08 09:32:45 +02:00
Naoya Horiguchi 6de2b1aab9 HWPOISON, hugetlb: add free check to dequeue_hwpoison_huge_page()
This check is necessary to avoid race between dequeue and allocation,
which can cause a free hugepage to be dequeued twice and get kernel unstable.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08 09:32:45 +02:00
Naoya Horiguchi 290408d4a2 hugetlb: hugepage migration core
This patch extends page migration code to support hugepage migration.
One of the potential users of this feature is soft offlining which
is triggered by memory corrected errors (added by the next patch.)

Todo:
- there are other users of page migration such as memory policy,
  memory hotplug and memocy compaction.
  They are not ready for hugepage support for now.

ChangeLog since v4:
- define migrate_huge_pages()
- remove changes on isolation/putback_lru_page()

ChangeLog since v2:
- refactor isolate/putback_lru_page() to handle hugepage
- add comment about race on unmap_and_move_huge_page()

ChangeLog since v1:
- divide migration code path for hugepage
- define routine checking migration swap entry for hugetlb
- replace "goto" with "if/else" in remove_migration_pte()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08 09:32:45 +02:00
Naoya Horiguchi 0ebabb416f hugetlb: redefine hugepage copy functions
This patch modifies hugepage copy functions to have only destination
and source hugepages as arguments for later use.
The old ones are renamed from copy_{gigantic,huge}_page() to
copy_user_{gigantic,huge}_page().
This naming convention is consistent with that between copy_highpage()
and copy_user_highpage().

ChangeLog since v4:
- add blank line between local declaration and code
- remove unnecessary might_sleep()

ChangeLog since v2:
- change copy_huge_page() from macro to inline dummy function
  to avoid compile warning when !CONFIG_HUGETLB_PAGE.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08 09:32:44 +02:00
Naoya Horiguchi bf50bab2b3 hugetlb: add allocate function for hugepage migration
We can't use existing hugepage allocation functions to allocate hugepage
for page migration, because page migration can happen asynchronously with
the running processes and page migration users should call the allocation
function with physical addresses (not virtual addresses) as arguments.

ChangeLog since v3:
- unify alloc_buddy_huge_page() and alloc_buddy_huge_page_node()

ChangeLog since v2:
- remove unnecessary get/put_mems_allowed() (thanks to David Rientjes)

ChangeLog since v1:
- add comment on top of alloc_huge_page_no_vma()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08 09:32:44 +02:00
Naoya Horiguchi 998b4382c1 hugetlb: fix metadata corruption in hugetlb_fault()
Since the PageHWPoison() check is for avoiding hwpoisoned page remained
in pagecache mapping to the process, it should be done in "found in pagecache"
branch, not in the common path.
Otherwise, metadata corruption occurs if memory failure happens between
alloc_huge_page() and lock_page() because page fault fails with metadata
changes remained (such as refcount, mapcount, etc.)

This patch moves the check to "found in pagecache" branch and fix the problem.

ChangeLog since v2:
- remove retry check in "new allocation" path.
- make description more detailed
- change patch name from "HWPOISON, hugetlb: move PG_HWPoison bit check"

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08 09:32:44 +02:00
Ingo Molnar 153db80f8c Merge commit 'v2.6.36-rc7' into core/memblock
Merge reason: Update from -rc3 to -rc7.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-08 09:15:00 +02:00
Linus Torvalds 6b0cd00bc3 Merge branch 'hwpoison-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6
* 'hwpoison-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
  HWPOISON: Stop shrinking at right page count
  HWPOISON: Report correct address granuality for AO huge page errors
  HWPOISON: Copy si_addr_lsb to user
  page-types.c: fix name of unpoison interface
2010-10-07 13:59:32 -07:00
Kirill A. Shutemov ad4ca5f4b7 memcg: fix thresholds with use_hierarchy == 1
We need to check parent's thresholds if parent has use_hierarchy == 1 to
be sure that parent's threshold events will be triggered even if parent
itself is not active (no MEM_CGROUP_EVENTS).

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-07 13:31:21 -07:00
Robin Holt f241e6607b mm: alloc_large_system_hash() printk overflow on 16TB boot
During boot of a 16TB system, the following is printed:
Dentry cache hash table entries: -2147483648 (order: 22, 17179869184 bytes)

Signed-off-by: Robin Holt <holt@sgi.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-07 13:31:21 -07:00
Andi Kleen 47f43e7efa HWPOISON: Stop shrinking at right page count
When we call the slab shrinker to free a page we need to stop at
page count one because the caller always holds a single reference, not zero.

This avoids useless looping over slab shrinkers and freeing too much
memory.

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-07 09:47:10 +02:00
Andi Kleen 0d9ee6a2d4 HWPOISON: Report correct address granuality for AO huge page errors
The SIGBUS user space signalling is supposed to report the
address granuality of a corruption. Pass this information correctly
for huge pages by querying the hpage order.

Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-07 09:45:44 +02:00
Pekka Enberg 92a5bbc11f SLUB: Fix memory hotplug with !NUMA
This patch fixes the following build breakage when memory hotplug is enabled on
UMA configurations:

  /home/test/linux-2.6/mm/slub.c: In function 'kmem_cache_init':
  /home/test/linux-2.6/mm/slub.c:3031:2: error: 'slab_memory_callback'
  undeclared (first use in this function)
  /home/test/linux-2.6/mm/slub.c:3031:2: note: each undeclared
  identifier is reported only once for each function it appears in
  make[2]: *** [mm/slub.o] Error 1
  make[1]: *** [mm] Error 2
  make: *** [sub-make] Error 2

Reported-by: Zimny Lech <napohybelskurwysynom2010@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-06 21:16:42 +03:00
Christoph Lameter a5a84755c5 slub: Move functions to reduce #ifdefs
There is a lot of #ifdef/#endifs that can be avoided if functions would be in different
places. Move them around and reduce #ifdef.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-06 16:54:37 +03:00
Christoph Lameter ab4d5ed5ee slub: Enable sysfs support for !CONFIG_SLUB_DEBUG
Currently disabling CONFIG_SLUB_DEBUG also disabled SYSFS support meaning
that the slabs cannot be tuned without DEBUG.

Make SYSFS support independent of CONFIG_SLUB_DEBUG

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-06 16:54:36 +03:00
Pekka Enberg 15b7c51420 SLUB: Optimize slab_free() debug check
This patch optimizes slab_free() debug check to use "c->node != NUMA_NO_NODE"
instead of "c->node >= 0" because the former generates smaller code on x86-64:

  Before:

    4736:       48 39 70 08             cmp    %rsi,0x8(%rax)
    473a:       75 26                   jne    4762 <kfree+0xa2>
    473c:       44 8b 48 10             mov    0x10(%rax),%r9d
    4740:       45 85 c9                test   %r9d,%r9d
    4743:       78 1d                   js     4762 <kfree+0xa2>

  After:

    4736:       48 39 70 08             cmp    %rsi,0x8(%rax)
    473a:       75 23                   jne    475f <kfree+0x9f>
    473c:       83 78 10 ff             cmpl   $0xffffffffffffffff,0x10(%rax)
    4740:       74 1d                   je     475f <kfree+0x9f>

This patch also cleans up __slab_alloc() to use NUMA_NO_NODE instead of "-1"
for enabling debugging for a per-CPU cache.

Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-06 16:52:26 +03:00
Yinghai Lu f1af98c762 memblock: Fix wraparound in find_region()
When trying to find huge range for crashkernel, get

[    0.000000] ------------[ cut here ]------------
[    0.000000] WARNING: at arch/x86/mm/memblock.c:248 memblock_x86_reserve_range+0x40/0x7a()
[    0.000000] Hardware name: Sun Fire x4800
[    0.000000] memblock_x86_reserve_range: wrong range [0xffffffff37000000, 0x137000000)
[    0.000000] Modules linked in:
[    0.000000] Pid: 0, comm: swapper Not tainted 2.6.36-rc5-tip-yh-01876-g1cac214-dirty #59
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff82816f7e>] ? memblock_x86_reserve_range+0x40/0x7a
[    0.000000]  [<ffffffff81078c2d>] warn_slowpath_common+0x85/0x9e
[    0.000000]  [<ffffffff81078d38>] warn_slowpath_fmt+0x6e/0x70
[    0.000000]  [<ffffffff8281e77c>] ? memblock_find_region+0x40/0x78
[    0.000000]  [<ffffffff8281eb1f>] ? memblock_find_base+0x9a/0xb9
[    0.000000]  [<ffffffff82816f7e>] memblock_x86_reserve_range+0x40/0x7a
[    0.000000]  [<ffffffff8280452c>] setup_arch+0x99d/0xb2a
[    0.000000]  [<ffffffff810a3e02>] ? trace_hardirqs_off+0xd/0xf
[    0.000000]  [<ffffffff81cec7d8>] ? _raw_spin_unlock_irqrestore+0x3d/0x4c
[    0.000000]  [<ffffffff827ffcec>] start_kernel+0xde/0x3f1
[    0.000000]  [<ffffffff827ff2d4>] x86_64_start_reservations+0xa0/0xa4
[    0.000000]  [<ffffffff827ff3de>] x86_64_start_kernel+0x106/0x10d
[    0.000000] ---[ end trace a7919e7f17c0a725 ]---
[    0.000000] Reserving 8192MB of memory at 17592186041200MB for crashkernel (System RAM: 526336MB)

This is caused by a wraparound in the test due to size > end;
explicitly check for this condition and fail.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4CAA4DD3.1080401@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-10-05 21:45:35 -07:00
Hugh Dickins 4e31635c36 ksm: fix bad user data when swapping
Building under memory pressure, with KSM on 2.6.36-rc5, collapsed with
an internal compiler error: typically indicating an error in swapping.

Perhaps there's a timing issue which makes it now more likely, perhaps
it's just a long time since I tried for so long: this bug goes back to
KSM swapping in 2.6.33.

Notice how reuse_swap_page() allows an exclusive page to be reused, but
only does SetPageDirty if it can delete it from swap cache right then -
if it's currently under Writeback, it has to be left in cache and we
don't SetPageDirty, but the page can be reused.  Fine, the dirty bit
will get set in the pte; but notice how zap_pte_range() does not bother
to transfer pte_dirty to page_dirty when unmapping a PageAnon.

If KSM chooses to share such a page, it will look like a clean copy of
swapcache, and not be written out to swap when its memory is needed;
then stale data read back from swap when it's needed again.

We could fix this in reuse_swap_page() (or even refuse to reuse a
page under writeback), but it's more honest to fix my oversight in
KSM's write_protect_page().  Several days of testing on three machines
confirms that this fixes the issue they showed.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-04 11:09:53 -07:00
Hugh Dickins 4829b906cc ksm: fix page_address_in_vma anon_vma oops
2.6.36-rc1 commit 21d0d443cd "rmap:
resurrect page_address_in_vma anon_vma check" was right to resurrect
that check; but now that it's comparing anon_vma->roots instead of
just anon_vmas, there's a danger of oopsing on a NULL anon_vma.

In most cases no NULL anon_vma ever gets here; but it turns out that
occasionally KSM, when enabled on a forked or forking process, will
itself call page_address_in_vma() on a "half-KSM" page left over from
an earlier failed attempt to merge - whose page_anon_vma() is NULL.

It's my bug that those should be getting here at all: I thought they
were already dealt with, this oops proves me wrong, I'll fix it in
the next release - such pages are effectively pinned until their
process exits, since rmap cannot find their ptes (though swapoff can).

For now just work around it by making page_address_in_vma() safe (and
add a comment on why that check is wanted anyway).  A similar check
in __page_check_anon_rmap() is safe because do_page_add_anon_rmap()
already excluded KSM pages.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-04 11:09:53 -07:00
Namhyung Kim 5d1f57e4d3 slub: Move NUMA-related functions under CONFIG_NUMA
Make kmalloc_cache_alloc_node_notrace(), kmalloc_large_node()
and __kmalloc_node_track_caller() to be compiled only when
CONFIG_NUMA is selected.

Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-02 10:47:53 +03:00
Namhyung Kim 3478973ded slub: Add lock release annotation
The unfreeze_slab() releases page's PG_locked bit but was missing
proper annotation. The deactivate_slab() needs to be marked also
since it calls unfreeze_slab() without grabbing the lock.

Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-02 10:47:53 +03:00
Namhyung Kim a5dd5c117c slub: Fix signedness warnings
The bit-ops routines require its arg to be a pointer to unsigned long.
This leads sparse to complain about different signedness as follows:

 mm/slub.c:2425:49: warning: incorrect type in argument 2 (different signedness)
 mm/slub.c:2425:49:    expected unsigned long volatile *addr
 mm/slub.c:2425:49:    got long *map

Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-02 10:47:52 +03:00
Christoph Lameter 62e346a830 slub: extract common code to remove objects from partial list without locking
There are a couple of places where repeat the same statements when removing
a page from the partial list. Consolidate that into __remove_partial().

Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-02 10:44:10 +03:00
Christoph Lameter f7cb193362 SLUB: Pass active and inactive redzone flags instead of boolean to debug functions
Pass the actual values used for inactive and active redzoning to the
functions that check the objects. Avoids a lot of the ? : things to
lookup the values in the functions.

Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-02 10:44:10 +03:00
Christoph Lameter 7340cc8414 slub: reduce differences between SMP and NUMA
Reduce the #ifdefs and simplify bootstrap by making SMP and NUMA as much alike
as possible. This means that there will be an additional indirection to get to
the kmem_cache_node field under SMP.

Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-02 10:44:10 +03:00
Pekka Enberg ed59ecbf89 Revert "Slub: UP bandaid"
This reverts commit 5249d039500f05a5ab379286b1d23ab9b04d3f2c. It's not needed
after commit bbddff0545 ("percpu: use percpu
allocator on UP too").
2010-10-02 10:28:55 +03:00
Tejun Heo ed6c1115c8 percpu: clear memory allocated with the km allocator
Percpu allocator should clear memory before returning it but the km
allocator forgot to do it.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
2010-10-02 10:28:42 +03:00
Tejun Heo 9b8327bb24 percpu: use percpu allocator on UP too
On UP, percpu allocations were redirected to kmalloc.  This has the
following problems.

* For certain amount of allocations (determined by
  PERCPU_DYNAMIC_EARLY_SLOTS and PERCPU_DYNAMIC_EARLY_SIZE), percpu
  allocator can be used before the usual kernel memory allocator is
  brought online.  On SMP, this is used to initialize the kernel
  memory allocator.

* percpu allocator honors alignment upto PAGE_SIZE but kmalloc()
  doesn't.  For example, workqueue makes use of larger alignments for
  cpu_workqueues.

Currently, users of percpu allocators need to handle UP differently,
which is somewhat fragile and ugly.  Other than small amount of
memory, there isn't much to lose by enabling percpu allocator on UP.
It can simply use kernel memory based chunk allocation which was added
for SMP archs w/o MMUs.

This patch removes mm/percpu_up.c, builds mm/percpu.c on UP too and
makes UP build use percpu-km.  As percpu addresses and kernel
addresses are always identity mapped and static percpu variables don't
need any special treatment, nothing is arch dependent and mm/percpu.c
implements generic setup_per_cpu_areas() for UP.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
2010-10-02 10:26:05 +03:00
Tejun Heo 0bc1406241 vmalloc: pcpu_get/free_vm_areas() aren't needed on UP
These functions are used only by percpu memory allocator on SMP.
Don't build them on UP.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Nick Piggin <npiggin@kernel.dk>
2010-10-02 10:25:25 +03:00
Pekka Enberg 84c1cf6246 SLUB: Fix merged slab cache names
As explained by Linus "I'm Proud to be an American" Torvalds:

  Looking at the merging code, I actually think it's totally
  buggy. If you have something like this:

   - load module A: create slab cache A

   - load module B: create slab cache B that can merge with A

   - unload module A

   - "cat /proc/slabinfo": BOOM. Oops.

  exactly because the name is not handled correctly, and you'll have
  module B holding open a slab cache that has a name pointer that points
  to module A that no longer exists.

This patch fixes the problem by using kstrdup() to allocate dynamic memory for
->name of "struct kmem_cache" as suggested by Christoph Lameter.

Acked-by: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@kernel.org>

Conflicts:

	mm/slub.c
2010-10-02 10:24:29 +03:00
Christoph Lameter db210e70e5 Slub: UP bandaid
Since the percpu allocator does not provide early allocation in UP mode (only
in SMP configurations) use __get_free_page() to improvise a compound page
allocation that can be later freed via kfree().

Compound pages will be released when the cpu caches are resized.

Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-02 10:24:29 +03:00
David Rientjes a016471a16 slub: fix SLUB_RESILIENCY_TEST for dynamic kmalloc caches
Now that the kmalloc_caches array is dynamically allocated at boot,
SLUB_RESILIENCY_TEST needs to be fixed to pass the correct type.

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-02 10:24:29 +03:00
Christoph Lameter 8de66a0c02 slub: Fix up missing kmalloc_cache -> kmem_cache_node case for memoryhotplug
Memory hotplug allocates and frees per node structures. Use the correct name.

Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-02 10:24:28 +03:00
Christoph Lameter 7d550c56a2 slub: Add dummy functions for the !SLUB_DEBUG case
On Wed, 25 Aug 2010, Randy Dunlap wrote:
> mm/slub.c:1732: error: implicit declaration of function 'slab_pre_alloc_hook'
> mm/slub.c:1751: error: implicit declaration of function 'slab_post_alloc_hook'
> mm/slub.c:1881: error: implicit declaration of function 'slab_free_hook'
> mm/slub.c:1886: error: implicit declaration of function 'slab_free_hook_irq'

Empty functions are missing if the runtime debuggability option is compiled
out.

Provide the fall back functions to empty hooks if SLUB_DEBUG is not set.

Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-02 10:24:28 +03:00
David Rientjes 8df275af8d slob: fix gfp flags for order-0 page allocations
kmalloc_node() may allocate higher order slob pages, but the __GFP_COMP
bit is only passed to the page allocator and not represented in the
tracepoint event.  The bit should be passed to trace_kmalloc_node() as
well.

Acked-by: Matt Mackall <mpm@selenic.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-02 10:24:28 +03:00
Christoph Lameter c1d508365e slub: Move gfpflag masking out of the hotpath
Move the gfpflags masking into the hooks for checkers and into the slowpaths.
gfpflag masking requires access to a global variable and thus adds an
additional cacheline reference to the hotpaths.

If no hooks are active then the gfpflag masking will result in
code that the compiler can toss out.

Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-02 10:24:27 +03:00
Christoph Lameter c016b0bdee slub: Extract hooks for memory checkers from hotpaths
Extract the code that memory checkers and other verification tools use from
the hotpaths. Makes it easier to add new ones and reduces the disturbances
of the hotpaths.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-02 10:24:27 +03:00
Christoph Lameter 51df114281 slub: Dynamically size kmalloc cache allocations
kmalloc caches are statically defined and may take up a lot of space just
because the sizes of the node array has to be dimensioned for the largest
node count supported.

This patch makes the size of the kmem_cache structure dynamic throughout by
creating a kmem_cache slab cache for the kmem_cache objects. The bootstrap
occurs by allocating the initial one or two kmem_cache objects from the
page allocator.

C2->C3
	- Fix various issues indicated by David
	- Make create kmalloc_cache return a kmem_cache * pointer.

Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-02 10:24:27 +03:00
Christoph Lameter 6c182dc0de slub: Remove static kmem_cache_cpu array for boot
The percpu allocator can now handle allocations during early boot.
So drop the static kmem_cache_cpu array.

Cc: Tejun Heo <tj@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-02 10:24:26 +03:00
Christoph Lameter 55136592fe slub: Remove dynamic dma slab allocation
Remove the dynamic dma slab allocation since this causes too many issues with
nested locks etc etc. The change avoids passing gfpflags into many functions.

V3->V4:
- Create dma caches in kmem_cache_init() instead of kmem_cache_init_late().

Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-02 10:24:26 +03:00
Christoph Lameter 1537066c69 slub: Force no inlining of debug functions
Compiler folds the debgging functions into the critical paths.
Avoid that by adding noinline to the functions that check for
problems.

Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-02 10:24:26 +03:00
Larry Woodman 5ec1055aa5 Avoid pgoff overflow in remap_file_pages
Thomas Pollet noticed that the remap_file_pages() system call in
fremap.c has a potential overflow in the first part of the if statement
below, which could cause it to process bogus input parameters.
Specifically the pgoff + size parameters could be wrap thereby
preventing the system call from failing when it should.

Reported-by: Thomas Pollet <thomas.pollet@gmail.com>
Signed-off-by: Larry Woodman <lwoodman@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-25 09:34:58 -07:00
Linus Torvalds e92b05dec8 fremap: get rid of broken 'end' variable
Thomas Pollet points out that the 'end' variable is broken.  It was
computed based on start/size before they were page-aligned, and as such
doesn't actually match any of the other actions we take.  The overflow
test on end was also redundant, since we had already tested it with the
properly aligned version.

So just get rid of it entirely.  The one remaining use for that broken
variable can just use 'start+size' like all the other cases already did.

Reported-by: Thomas Pollet <thomas.pollet@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-24 14:13:57 -07:00
Naoya Horiguchi a850ea3037 hugetlb, rmap: add BUG_ON(!PageLocked) in hugetlb_add_anon_rmap()
Confirming page lock is held in hugetlb_add_anon_rmap() may be useful
to detect possible future problems.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-23 17:29:19 -07:00
Naoya Horiguchi 56c9cfb13c hugetlb, rmap: fix confusing page locking in hugetlb_cow()
The "if (!trylock_page)" block in the avoidcopy path of hugetlb_cow()
looks confusing and is buggy.  Originally this trylock_page() was
intended to make sure that old_page is locked even when old_page !=
pagecache_page, because then only pagecache_page is locked.

This patch fixes it by moving page locking into hugetlb_fault().

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-23 17:29:18 -07:00
Naoya Horiguchi cd67f0d2a9 hugetlb, rmap: use hugepage_add_new_anon_rmap() in hugetlb_cow()
Obviously, setting anon_vma for COWed hugepage should be done
by hugepage_add_new_anon_rmap() to scan vmas faster.
This patch fixes it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-23 17:29:18 -07:00
Naoya Horiguchi 433abed6c6 hugetlb, rmap: always use anon_vma root pointer
This patch applies Andrea's fix given by the following patch into hugepage
rmapping code:

  commit 288468c334
  Author: Andrea Arcangeli <aarcange@redhat.com>
  Date:   Mon Aug 9 17:19:09 2010 -0700

This patch uses anon_vma->root and avoids unnecessary overwriting when
anon_vma is already set up.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-23 17:29:18 -07:00
Linus Torvalds 57aebd7739 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  percpu: fix pcpu_last_unit_cpu
2010-09-23 08:06:55 -07:00
Andrea Arcangeli 2aeadc30de mmap: call unlink_anon_vmas() in __split_vma() in case of error
If __split_vma fails because of an out of memory condition the
anon_vma_chain isn't teardown and freed potentially leading to rmap walks
accessing freed vma information plus there's a memleak.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-22 17:22:40 -07:00
David Rientjes e85bfd3aa7 oom: filter unkillable tasks from tasklist dump
/proc/sys/vm/oom_dump_tasks is enabled by default, so it's necessary to
limit as much information as possible that it should emit.

The tasklist dump should be filtered to only those tasks that are eligible
for oom kill.  This is already done for memcg ooms, but this patch extends
it to both cpuset and mempolicy ooms as well as init.

In addition to suppressing irrelevant information, this also reduces
confusion since users currently don't know which tasks in the tasklist
aren't eligible for kill (such as those attached to cpusets or bound to
mempolicies with a disjoint set of mems or nodes, respectively) since that
information is not shown.

Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-22 17:22:39 -07:00
Minchan Kim d1908362ae vmscan: check all_unreclaimable in direct reclaim path
M.  Vefa Bicakci reported 2.6.35 kernel hang up when hibernation on his
32bit 3GB mem machine.
(https://bugzilla.kernel.org/show_bug.cgi?id=16771). Also he bisected
the regression to

  commit bb21c7ce18
  Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
  Date:   Fri Jun 4 14:15:05 2010 -0700

     vmscan: fix do_try_to_free_pages() return value when priority==0 reclaim failure

At first impression, this seemed very strange because the above commit
only chenged function return value and hibernate_preallocate_memory()
ignore return value of shrink_all_memory().  But it's related.

Now, page allocation from hibernation code may enter infinite loop if the
system has highmem.  The reasons are that vmscan don't care enough OOM
case when oom_killer_disabled.

The problem sequence is following as.

1. hibernation
2. oom_disable
3. alloc_pages
4. do_try_to_free_pages
       if (scanning_global_lru(sc) && !all_unreclaimable)
               return 1;

If kswapd is not freozen, it would set zone->all_unreclaimable to 1 and
then shrink_zones maybe return true(ie, all_unreclaimable is true).  So at
last, alloc_pages could go to _nopage_.  If it is, it should have no
problem.

This patch adds all_unreclaimable check to protect in direct reclaim path,
too.  It can care of hibernation OOM case and help bailout
all_unreclaimable case slightly.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reported-by: M. Vefa Bicakci <bicave@superonline.com>
Reported-by: <caiqian@redhat.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: <caiqian@redhat.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-22 17:22:39 -07:00
David Rientjes f19e8aa11a oom: always return a badness score of non-zero for eligible tasks
A task's badness score is roughly a proportion of its rss and swap
compared to the system's capacity.  The scale ranges from 0 to 1000 with
the highest score chosen for kill.  Thus, this scale operates on a
resolution of 0.1% of RAM + swap.  Admin tasks are also given a 3% bonus,
so the badness score of an admin task using 3% of memory, for example,
would still be 0.

It's possible that an exceptionally large number of tasks will combine to
exhaust all resources but never have a single task that uses more than
0.1% of RAM and swap (or 3.0% for admin tasks).

This patch ensures that the badness score of any eligible task is never 0
so the machine doesn't unnecessarily panic because it cannot find a task
to kill.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-22 17:22:38 -07:00
Linus Torvalds b68e9d4581 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  bdi: Fix warnings in __mark_inode_dirty for /dev/zero and friends
  char: Mark /dev/zero and /dev/kmem as not capable of writeback
  bdi: Initialize noop_backing_dev_info properly
  cfq-iosched: fix a kernel OOPs when usb key is inserted
  block: fix blk_rq_map_kern bio direction flag
  cciss: freeing uninitialized data on error path
2010-09-22 09:12:37 -07:00
Jan Kara 976e48f8a5 bdi: Initialize noop_backing_dev_info properly
Properly initialize this backing dev info so that writeback code does not
barf when getting to it e.g. via sb->s_bdi.

Cc: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-22 09:48:47 +02:00
Tejun Heo 46b30ea9bc percpu: fix pcpu_last_unit_cpu
pcpu_first/last_unit_cpu are used to track which cpu has the first and
last units assigned.  This in turn is used to determine the span of a
chunk for man/unmap cache flushes and whether an address belongs to
the first chunk or not in per_cpu_ptr_to_phys().

When the number of possible CPUs isn't power of two, a chunk may
contain unassigned units towards the end of a chunk.  The logic to
determine pcpu_last_unit_cpu was incorrect when there was an unused
unit at the end of a chunk.  It failed to ignore the unused unit and
assigned the unused marker NR_CPUS to pcpu_last_unit_cpu.

This was discovered through kdump failure which was caused by
malfunctioning per_cpu_ptr_to_phys() on a kvm setup with 50 possible
CPUs by CAI Qian.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: CAI Qian <caiqian@redhat.com>
Cc: stable@kernel.org
2010-09-21 08:12:25 +02:00
Hugh Dickins 31c4a3d3a0 mm: further fix swapin race condition
Commit 4969c1192d ("mm: fix swapin race condition") is now agreed to
be incomplete.  There's a race, not very much less likely than the
original race envisaged, in which it is further necessary to check that
the swapcache page's swap has not changed.

Here's the reasoning: cast in terms of reuse_swap_page(), but probably
could be reformulated to rely on try_to_free_swap() instead, or on
swapoff+swapon.

A, faults into do_swap_page(): does page1 = lookup_swap_cache(swap1) and
comes through the lock_page(page1).

B, a racing thread of the same process, faults on the same address: does
page1 = lookup_swap_cache(swap1) and now waits in lock_page(page1), but
for whatever reason is unlucky not to get the lock any time soon.

A carries on through do_swap_page(), a write fault, but cannot reuse the
swap page1 (another reference to swap1).  Unlocks the page1 (but B
doesn't get it yet), does COW in do_wp_page(), page2 now in that pte.

C, perhaps the parent of A+B, comes in and write faults the same swap
page1 into its mm, reuse_swap_page() succeeds this time, swap1 is freed.

kswapd comes in after some time (B still unlucky) and swaps out some
pages from A+B and C: it allocates the original swap1 to page2 in A+B,
and some other swap2 to the original page1 now in C.  But does not
immediately free page1 (actually it couldn't: B holds a reference),
leaving it in swap cache for now.

B at last gets the lock on page1, hooray! Is PageSwapCache(page1)? Yes.
Is pte_same(*page_table, orig_pte)? Yes, because page2 has now been
given the swap1 which page1 used to have.  So B proceeds to insert page1
into A+B's page_table, though its content now belongs to C, quite
different from what A wrote there.

B ought to have checked that page1's swap was still swap1.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-20 10:44:37 -07:00
Cliff Wickman 3ee48b6af4 mm, x86: Saving vmcore with non-lazy freeing of vmas
During the reading of /proc/vmcore the kernel is doing
ioremap()/iounmap() repeatedly. And the buildup of un-flushed
vm_area_struct's is causing a great deal of overhead. (rb_next()
is chewing up most of that time).

This solution is to provide function set_iounmap_nonlazy(). It
causes a subsequent call to iounmap() to immediately purge the
vma area (with try_purge_vmap_area_lazy()).

With this patch we have seen the time for writing a 250MB
compressed dump drop from 71 seconds to 44 seconds.

Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: kexec@lists.infradead.org
Cc: <stable@kernel.org>
LKML-Reference: <E1OwHZ4-0005WK-Tw@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-09-17 09:11:56 +02:00
Christoph Hellwig dd3932eddf block: remove BLKDEV_IFL_WAIT
All the blkdev_issue_* helpers can only sanely be used for synchronous
caller.  To issue cache flushes or barriers asynchronously the caller needs
to set up a bio by itself with a completion callback to move the asynchronous
state machine ahead.  So drop the BLKDEV_IFL_WAIT flag that is always
specified when calling blkdev_issue_* and also remove the now unused flags
argument to blkdev_issue_flush and blkdev_issue_zeroout.  For
blkdev_issue_discard we need to keep it for the secure discard flag, which
gains a more descriptive name and loses the bitops vs flag confusion.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-16 20:52:58 +02:00
Yinghai Lu 3661ca66a4 memblock: Fix section mismatch warnings
Stephen found a bunch of section mismatch warnings with the
new memblock changes.

Use __init_memblock to replace __init in memblock.c and remove
__init in memblock.h. We should not use __init in header files.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Yinghai Lu <Yinghai@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
LKML-Reference: <4C912709.2090201@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-09-15 22:17:13 +02:00
Linus Torvalds ff3cb3fec3 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: Range check cpu in blk_cpu_to_group
  scatterlist: prevent invalid free when alloc fails
  writeback: Fix lost wake-up shutting down writeback thread
  writeback: do not lose wakeup events when forking bdi threads
  cciss: fix reporting of max queue depth since init
  block: switch s390 tape_block and mg_disk to elevator_change()
  block: add function call to switch the IO scheduler from a driver
  fs/bio-integrity.c: return -ENOMEM on kmalloc failure
  bio-integrity.c: remove dependency on __GFP_NOFAIL
  BLOCK: fix bio.bi_rw handling
  block: put dev->kobj in blk_register_queue fail path
  cciss: handle allocation failure
  cfq-iosched: Documentation help for new tunables
  cfq-iosched: blktrace print per slice sector stats
  cfq-iosched: Implement tunable group_idle
  cfq-iosched: Do group share accounting in IOPS when slice_idle=0
  cfq-iosched: Do not idle if slice_idle=0
  cciss: disable doorbell reset on reset_devices
  blkio: Fix return code for mkdir calls
2010-09-10 07:26:27 -07:00
Christoph Hellwig 349f429eec swap: do not send discards as barriers
The swap code already uses synchronous discards, no need to add I/O barriers.

tj: superflous newlines removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Hugh Dickins <hughd@google.com>
Tested-by: Nigel Cunningham <nigel@tuxonice.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10 12:35:39 +02:00
Tejun Heo 9329ba9704 percpu: update comments to reflect that percpu allocations are always zero-filled
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephane Eranian <eranian@google.com>
2010-09-10 11:01:56 +02:00
Tejun Heo fc1481a956 percpu: clear memory allocated with the km allocator
Percpu allocator should clear memory before returning it but the km
allocator forgot to do it.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
2010-09-10 10:56:24 +02:00
Mel Gorman 9ee493ce0a mm: page allocator: drain per-cpu lists after direct reclaim allocation fails
When under significant memory pressure, a process enters direct reclaim
and immediately afterwards tries to allocate a page.  If it fails and no
further progress is made, it's possible the system will go OOM.  However,
on systems with large amounts of memory, it's possible that a significant
number of pages are on per-cpu lists and inaccessible to the calling
process.  This leads to a process entering direct reclaim more often than
it should increasing the pressure on the system and compounding the
problem.

This patch notes that if direct reclaim is making progress but allocations
are still failing that the system is already under heavy pressure.  In
this case, it drains the per-cpu lists and tries the allocation a second
time before continuing.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
Christoph Lameter aa45484031 mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
cheaper than scanning a number of lists.  To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained
both periodically and when the delta is above a threshold.  On large CPU
systems, the difference between the estimated and real value of
NR_FREE_PAGES can be very high.  If NR_FREE_PAGES is much higher than
number of real free page in buddy, the VM can allocate pages below min
watermark, at worst reducing the real number of pages to zero.  Even if
the OOM killer kills some victim for freeing memory, it may not free
memory if the exit path requires a new page resulting in livelock.

This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate view of an arbitrary vmstat
counter.  It is used to read NR_FREE_PAGES while kswapd is awake to avoid
the watermark being accidentally broken.  The estimate is not perfect and
may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
Mel Gorman 72853e2991 mm: page allocator: update free page counters after pages are placed on the free list
When allocating a page, the system uses NR_FREE_PAGES counters to
determine if watermarks would remain intact after the allocation was made.
This check is made without interrupts disabled or the zone lock held and
so is race-prone by nature.  Unfortunately, when pages are being freed in
batch, the counters are updated before the pages are added on the list.
During this window, the counters are misleading as the pages do not exist
yet.  When under significant pressure on systems with large numbers of
CPUs, it's possible for processes to make progress even though they should
have been stalled.  This is particularly problematic if a number of the
processes are using GFP_ATOMIC as the min watermark can be accidentally
breached and in extreme cases, the system can livelock.

This patch updates the counters after the pages have been added to the
list.  This makes the allocator more cautious with respect to preserving
the watermarks and mitigates livelock possibilities.

[akpm@linux-foundation.org: avoid modifying incoming args]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
KAMEZAWA Hiroyuki 5ee28a4476 vmstat: update zone stat threshold when onlining a cpu
refresh_zone_stat_thresholds() calculates parameter based on the number of
online cpus.  It's called at cpu offlining but needs to be called at
onlining, too.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
Hugh Dickins 3399446632 swap: discard while swapping only if SWAP_FLAG_DISCARD
Tests with recent firmware on Intel X25-M 80GB and OCZ Vertex 60GB SSDs
show a shift since I last tested in December: in part because of firmware
updates, in part because of the necessary move from barriers to awaiting
completion at the block layer.  While discard at swapon still shows as
slightly beneficial on both, discarding 1MB swap cluster when allocating
is now disadvanteous: adds 25% overhead on Intel, adds 230% on OCZ (YMMV).

Surrender: discard as presently implemented is more hindrance than help
for swap; but might prove useful on other devices, or with improvements.
So continue to do the discard at swapon, but make discard while swapping
conditional on a SWAP_FLAG_DISCARD to sys_swapon() (which has been using
only the lower 16 bits of int flags).

We can add a --discard or -d to swapon(8), and a "discard" to swap in
/etc/fstab: matching the mount option for btrfs, ext4, fat, gfs2, nilfs2.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nigel Cunningham <nigel@tuxonice.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
Christoph Hellwig 8f2ae0faa3 swap: do not send discards as barriers
The swap code already uses synchronous discards, no need to add I/O
barriers.

This fixes the worst of the terrible slowdown in swap allocation for
hibernation, reported on 2.6.35 by Nigel Cunningham; but does not entirely
eliminate that regression.

[tj@kernel.org: superflous newlines removed]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Nigel Cunningham <nigel@tuxonice.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
Hugh Dickins b73d7fcecd swap: prevent reuse during hibernation
Move the hibernation check from scan_swap_map() into try_to_free_swap():
to catch not only the common case when hibernation's allocation itself
triggers swap reuse, but also the less likely case when concurrent page
reclaim (shrink_page_list) might happen to try_to_free_swap from a page.

Hibernation already clears __GFP_IO from the gfp_allowed_mask, to stop
reclaim from going to swap: check that to prevent swap reuse too.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Ondrej Zary <linux@rainbow-software.org>
Cc: Andrea Gelmini <andrea.gelmini@gmail.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nigel Cunningham <nigel@tuxonice.net>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
Hugh Dickins 910321ea81 swap: revert special hibernation allocation
Please revert 2.6.36-rc commit d2997b1042
"hibernation: freeze swap at hibernation".  It complicated matters by
adding a second swap allocation path, just for hibernation; without in any
way fixing the issue that it was intended to address - page reclaim after
fixing the hibernation image might free swap from a page already imaged as
swapcache, letting its swap be reallocated to store a different page of
the image: resulting in data corruption if the imaged page were freed as
clean then swapped back in.  Pages freed to si->swap_map were still in
danger of being reallocated by the alternative allocation path.

I guess it inadvertently fixed slow SSD swap allocation for hibernation,
as reported by Nigel Cunningham: by missing out the discards that occur on
the usual swap allocation path; but that was unintentional, and needs a
separate fix.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Ondrej Zary <linux@rainbow-software.org>
Cc: Andrea Gelmini <andrea.gelmini@gmail.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nigel Cunningham <nigel@tuxonice.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
Gary King ac8456d6f9 bounce: call flush_dcache_page() after bounce_copy_vec()
I have been seeing problems on Tegra 2 (ARMv7 SMP) systems with HIGHMEM
enabled on 2.6.35 (plus some patches targetted at 2.6.36 to perform cache
maintenance lazily), and the root cause appears to be that the mm bouncing
code is calling flush_dcache_page before it copies the bounce buffer into
the bio.

The bounced page needs to be flushed after data is copied into it, to
ensure that architecture implementations can synchronize instruction and
data caches if necessary.

Signed-off-by: Gary King <gking@nvidia.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
KAMEZAWA Hiroyuki 0dcc48c15f memory hotplug: fix next block calculation in is_removable
next_active_pageblock() is for finding next _used_ freeblock.  It skips
several blocks when it finds there are a chunk of free pages lager than
pageblock.  But it has 2 bugs.

  1. We have no lock. page_order(page) - pageblock_order can be minus.
  2. pageblocks_stride += is wrong. it should skip page_order(p) of pages.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:24 -07:00
Minchan Kim bc69304574 mm: compaction: handle active and inactive fairly in too_many_isolated
Iram reported that compaction's too_many_isolated() loops forever.
(http://www.spinics.net/lists/linux-mm/msg08123.html)

The meminfo when the situation happened was inactive anon is zero.  That's
because the system has no memory pressure until then.  While all anon
pages were in the active lru, compaction could select active lru as well
as inactive lru.  That's a different thing from vmscan's isolated.  So we
has been two too_many_isolated.

While compaction can isolate pages in both active and inactive, current
implementation of too_many_isolated only considers inactive.  It made
Iram's problem.

This patch handles active and inactive fairly.  That's because we can't
expect where from and how many compaction would isolated pages.

This patch changes (nr_isolated > nr_inactive) with
nr_isolated > (nr_active + nr_inactive) / 2.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reported-by: Iram Shahzad <iram.shahzad@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:24 -07:00
Andrea Arcangeli 152e0659fc mm: avoid warning when COMPACTION is selected
COMPACTION enables MIGRATION, but MIGRATION spawns a warning if numa or
memhotplug aren't selected.  However MIGRATION doesn't depend on them.  I
guess it's just trying to be strict doing a double check on who's enabling
it, but it doesn't know that compaction also enables MIGRATION.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:24 -07:00
Andrea Arcangeli 4969c1192d mm: fix swapin race condition
The pte_same check is reliable only if the swap entry remains pinned (by
the page lock on swapcache).  We've also to ensure the swapcache isn't
removed before we take the lock as try_to_free_swap won't care about the
page pin.

One of the possible impacts of this patch is that a KSM-shared page can
point to the anon_vma of another process, which could exit before the page
is freed.

This can leave a page with a pointer to a recycled anon_vma object, or
worse, a pointer to something that is no longer an anon_vma.

[riel@redhat.com: changelog help]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:24 -07:00
Stefan Bader 39aa3cb3e8 mm: Move vma_stack_continue into mm.h
So it can be used by all that need to check for that.

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 09:05:06 -07:00
Tejun Heo 3c9a024fde percpu: fix build breakage on s390 and cleanup build configuration tests
Commit bbddff05 (percpu: use percpu allocator on UP too) incorrectly
excluded pcpu_build_alloc_info() on SMP configurations which use
generic setup_per_cpu_area() like s390.  The config ifdefs are
becoming confusing.  Fix and clean it up by,

* Move pcpu_build_alloc_info() right on top of its two users -
  pcpu_{embed|page}_first_chunk() which are already in CONFIG_SMP
  block.

* Define BUILD_{EMBED|PAGE}_FIRST_CHUNK which indicate whether each
  first chunk function needs to be included and use them to control
  inclusion of the three functions to reduce confusion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sachin Sant <sachinp@in.ibm.com>
2010-09-09 18:00:15 +02:00
Tejun Heo bbddff0545 percpu: use percpu allocator on UP too
On UP, percpu allocations were redirected to kmalloc.  This has the
following problems.

* For certain amount of allocations (determined by
  PERCPU_DYNAMIC_EARLY_SLOTS and PERCPU_DYNAMIC_EARLY_SIZE), percpu
  allocator can be used before the usual kernel memory allocator is
  brought online.  On SMP, this is used to initialize the kernel
  memory allocator.

* percpu allocator honors alignment upto PAGE_SIZE but kmalloc()
  doesn't.  For example, workqueue makes use of larger alignments for
  cpu_workqueues.

Currently, users of percpu allocators need to handle UP differently,
which is somewhat fragile and ugly.  Other than small amount of
memory, there isn't much to lose by enabling percpu allocator on UP.
It can simply use kernel memory based chunk allocation which was added
for SMP archs w/o MMUs.

This patch removes mm/percpu_up.c, builds mm/percpu.c on UP too and
makes UP build use percpu-km.  As percpu addresses and kernel
addresses are always identity mapped and static percpu variables don't
need any special treatment, nothing is arch dependent and mm/percpu.c
implements generic setup_per_cpu_areas() for UP.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
2010-09-08 11:11:23 +02:00
Tejun Heo 4f8b02b4e5 vmalloc: pcpu_get/free_vm_areas() aren't needed on UP
These functions are used only by percpu memory allocator on SMP.
Don't build them on UP.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Reviewed-by: Chrsitoph Lameter <cl@linux.com>
2010-09-08 11:10:47 +02:00
Linus Torvalds ce7db282a3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  percpu: fix a mismatch between code and comment
  percpu: fix a memory leak in pcpu_extend_area_map()
  percpu: add __percpu notations to UP allocator
  percpu: handle __percpu notations in UP accessors
2010-09-07 14:08:37 -07:00
Ingo Molnar daab7fc734 Merge commit 'v2.6.36-rc3' into x86/memblock
Conflicts:
	arch/x86/kernel/trampoline.c
	mm/memblock.c

Merge reason: Resolve the conflicts, update to latest upstream.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-08-31 09:45:46 +02:00
Linus Torvalds 997396a73a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: fix get_ticket_handler() error handling
  ceph: don't BUG on ENOMEM during mds reconnect
  ceph: ceph_mdsc_build_path() returns an ERR_PTR
  ceph: Fix warnings
  ceph: ceph_get_inode() returns an ERR_PTR
  ceph: initialize fields on new dentry_infos
  ceph: maintain i_head_snapc when any caps are dirty, not just for data
  ceph: fix osd request lru adjustment when sending request
  ceph: don't improperly set dir complete when holding EXCL cap
  mm: exporting account_page_dirty
  ceph: direct requests in snapped namespace based on nonsnap parent
  ceph: queue cap snap writeback for realm children on snap update
  ceph: include dirty xattrs state in snapped caps
  ceph: fix xattr cap writeback
  ceph: fix multiple mds session shutdown
2010-08-28 14:07:20 -07:00
Hugh Dickins f18194275c mm: fix hang on anon_vma->root->lock
After several hours, kbuild tests hang with anon_vma_prepare() spinning on
a newly allocated anon_vma's lock - on a box with CONFIG_TREE_PREEMPT_RCU=y
(which makes this very much more likely, but it could happen without).

The ever-subtle page_lock_anon_vma() now needs a further twist: since
anon_vma_prepare() and anon_vma_fork() are liable to change the ->root
of a reused anon_vma structure at any moment, page_lock_anon_vma()
needs to check page_mapped() again before succeeding, otherwise
page_unlock_anon_vma() might address a different root->lock.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-28 13:54:12 -07:00
Yinghai Lu a9ce6bc151 x86, memblock: Replace e820_/_early string with memblock_
1.include linux/memblock.h directly. so later could reduce e820.h reference.
2 this patch is done by sed scripts mainly

-v2: use MEMBLOCK_ERROR instead of -1ULL or -1UL

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-08-27 11:13:47 -07:00
Yinghai Lu 72d7c3b33c x86: Use memblock to replace early_res
1. replace find_e820_area with memblock_find_in_range
2. replace reserve_early with memblock_x86_reserve_range
3. replace free_early with memblock_x86_free_range.
4. NO_BOOTMEM will switch to use memblock too.
5. use _e820, _early wrap in the patch, in following patch, will
   replace them all
6. because memblock_x86_free_range support partial free, we can remove some special care
7. Need to make sure that memblock_find_in_range() is called after memblock_x86_fill()
   so adjust some calling later in setup.c::setup_arch()
   -- corruption_check and mptable_update

-v2: Move reserve_brk() early
    Before fill_memblock_area, to avoid overlap between brk and memblock_find_in_range()
    that could happen We have more then 128 RAM entry in E820 tables, and
    memblock_x86_fill() could use memblock_find_in_range() to find a new place for
    memblock.memory.region array.
    and We don't need to use extend_brk() after fill_memblock_area()
    So move reserve_brk() early before fill_memblock_area().
-v3: Move find_smp_config early
    To make sure memblock_find_in_range not find wrong place, if BIOS doesn't put mptable
    in right place.
-v4: Treat RESERVED_KERN as RAM in memblock.memory. and they are already in
    memblock.reserved already..
    use __NOT_KEEP_MEMBLOCK to make sure memblock related code could be freed later.
-v5: Generic version __memblock_find_in_range() is going from high to low, and for 32bit
    active_region for 32bit does include high pages
    need to replace the limit with memblock.default_alloc_limit, aka get_max_mapped()
-v6: Use current_limit instead
-v7: check with MEMBLOCK_ERROR instead of -1ULL or -1L
-v8: Set memblock_can_resize early to handle EFI with more RAM entries
-v9: update after kmemleak changes in mainline

Suggested-by: David S. Miller <davem@davemloft.net>
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-08-27 11:12:29 -07:00
Yinghai Lu edbe7d23b4 memblock: Add find_memory_core_early()
According to node range in early_node_map[] with __memblock_find_in_range
to find free range.

Will be used by memblock_x86_find_in_range_node()

memblock_x86_find_in_range_node will be used to find right buffer for NODE_DATA

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-08-27 11:10:54 -07:00
Yinghai Lu f88eff74aa bootmem, x86: Add weak version of reserve_bootmem_generic
It will be used memblock_x86_to_bootmem converting

It is an wrapper for reserve_bootmem, and x86 64bit is using special one.

Also clean up that version for x86_64. We don't need to take care of numa
path for that, bootmem can handle it how

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-08-27 11:08:13 -07:00
Yinghai Lu 7950c407c0 memblock: Add memblock_free/reserve_reserved_regions()
So we can avoid export memblock_reserved_init_regions()
Suggested by Ben.

-v2: use __init_memblock attribute

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-08-27 11:07:56 -07:00
Namhyung Kim 54157c4447 percpu: fix a mismatch between code and comment
When pcpu_build_alloc_info() searches best_upa value, it ignores current value
if the number of waste units exceeds 1/3 of the number of total cpus. But the
comment on the code says that it will ignore if wastage is over 25%.
Modify the comment.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-08-27 11:36:19 +02:00
Huang Shijie a002d14842 percpu: fix a memory leak in pcpu_extend_area_map()
The original code did not free the old map.  This patch fixes it.

tj: use @old as memcpy source instead of @chunk->map, and indentation
    and description update

Signed-off-by: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
2010-08-27 11:36:08 +02:00
Artem Bityutskiy 6628bc74f1 writeback: do not lose wakeup events when forking bdi threads
This patch fixes the following issue:

INFO: task mount.nfs4:1120 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount.nfs4    D 00000000fffc6a21     0  1120   1119 0x00000000
 ffff880235643948 0000000000000046 ffffffff00000000 ffffffff00000000
 ffff880235643fd8 ffff880235314760 00000000001d44c0 ffff880235643fd8
 00000000001d44c0 00000000001d44c0 00000000001d44c0 00000000001d44c0
Call Trace:
 [<ffffffff813bc747>] schedule_timeout+0x34/0xf1
 [<ffffffff813bc530>] ? wait_for_common+0x3f/0x130
 [<ffffffff8106b50b>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff813bc5c3>] wait_for_common+0xd2/0x130
 [<ffffffff8104159c>] ? default_wake_function+0x0/0xf
 [<ffffffff813beaa0>] ? _raw_spin_unlock+0x26/0x2a
 [<ffffffff813bc6bb>] wait_for_completion+0x18/0x1a
 [<ffffffff81101a03>] sync_inodes_sb+0xca/0x1bc
 [<ffffffff811056a6>] __sync_filesystem+0x47/0x7e
 [<ffffffff81105798>] sync_filesystem+0x47/0x4b
 [<ffffffff810e7ffd>] generic_shutdown_super+0x22/0xd2
 [<ffffffff810e80f8>] kill_anon_super+0x11/0x4f
 [<ffffffffa00d06d7>] nfs4_kill_super+0x3f/0x72 [nfs]
 [<ffffffff810e7b68>] deactivate_locked_super+0x21/0x41
 [<ffffffff810e7fd6>] deactivate_super+0x40/0x45
 [<ffffffff810fc66c>] mntput_no_expire+0xb8/0xed
 [<ffffffff810fc73b>] release_mounts+0x9a/0xb0
 [<ffffffff810fc7bb>] put_mnt_ns+0x6a/0x7b
 [<ffffffffa00d0fb2>] nfs_follow_remote_path+0x19a/0x296 [nfs]
 [<ffffffffa00d11ca>] nfs4_try_mount+0x75/0xaf [nfs]
 [<ffffffffa00d1790>] nfs4_get_sb+0x276/0x2ff [nfs]
 [<ffffffff810e7dba>] vfs_kern_mount+0xb8/0x196
 [<ffffffff810e7ef6>] do_kern_mount+0x48/0xe8
 [<ffffffff810fdf68>] do_mount+0x771/0x7e8
 [<ffffffff810fe062>] sys_mount+0x83/0xbd
 [<ffffffff810089c2>] system_call_fastpath+0x16/0x1b

The reason of this hang was a race condition: when the flusher thread is
forking a bdi thread, we use 'kthread_run()', so we run it _before_ we make it
visible in 'bdi->wb.task'. The bdi thread runs, does all works, and goes sleep.
'bdi->wb.task' is still NULL. And this is a dangerous time window.

If at this time someone queues a work for this bdi, he does not see the bdi
thread and wakes up the forker thread instead! But the forker has already
forked this bdi thread, but just did not make it visible yet!

The result is that we lose the wake up event for this bdi thread and the NFS4
code waits forever.

To fix the problem, we should use 'ktrhead_create()' for creating bdi threads,
then make them visible in 'bdi->wb.task', and only after this wake them up.
This is exactly what this patch does.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-27 09:16:18 +02:00
Linus Torvalds 871eae4891 Merge branch '2.6.36-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev
* '2.6.36-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev:
  xfs: do not discard page cache data on EAGAIN
  xfs: don't do memory allocation under the CIL context lock
  xfs: Reduce log force overhead for delayed logging
  xfs: dummy transactions should not dirty VFS state
  xfs: ensure f_ffree returned by statfs() is non-negative
  xfs: handle negative wbc->nr_to_write during sync writeback
  writeback: write_cache_pages doesn't terminate at nr_to_write <= 0
  xfs: fix untrusted inode number lookup
  xfs: ensure we mark all inodes in a freed cluster XFS_ISTALE
  xfs: unlock items before allowing the CIL to commit
2010-08-25 08:39:07 -07:00
Luck, Tony 8ca3eb0809 guard page for stacks that grow upwards
pa-risc and ia64 have stacks that grow upwards. Check that
they do not run into other mappings. By making VM_GROWSUP
0x0 on architectures that do not ever use it, we can avoid
some unpleasant #ifdefs in check_stack_guard_page().

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-24 12:13:20 -07:00