Commit Graph

12391 Commits

Author SHA1 Message Date
Mel Gorman 09a913a7a9 sched/numa: avoid trapping faults and attempting migration of file-backed dirty pages
change_pte_range is called from task work context to mark PTEs for
receiving NUMA faulting hints.  If the marked pages are dirty then
migration may fail.  Some filesystems cannot migrate dirty pages without
blocking so are skipped in MIGRATE_ASYNC mode which just wastes CPU.
Even when they can, it can be a waste of cycles when the pages are
shared forcing higher scan rates.  This patch avoids marking shared
dirty pages for hinting faults but also will skip a migration if the
page was dirtied after the scanner updated a clean page.

This is most noticeable running the NASA Parallel Benchmark when backed
by btrfs, the default root filesystem for some distributions, but also
noticeable when using XFS.

The following are results from a 4-socket machine running a 4.16-rc4
kernel with some scheduler patches that are pending for the next merge
window.

                        4.16.0-rc4             4.16.0-rc4
                 schedtip-20180309          nodirty-v1
  Time cg.D      459.07 (   0.00%)      444.21 (   3.24%)
  Time ep.D       76.96 (   0.00%)       77.69 (  -0.95%)
  Time is.D       25.55 (   0.00%)       27.85 (  -9.00%)
  Time lu.D      601.58 (   0.00%)      596.87 (   0.78%)
  Time mg.D      107.73 (   0.00%)      108.22 (  -0.45%)

is.D regresses slightly in terms of absolute time but note that that
particular load varies quite a bit from run to run.  The more relevant
observation is the total system CPU usage.

            4.16.0-rc4  4.16.0-rc4
          schedtip-20180309 nodirty-v1
  User        71471.91    70627.04
  System      11078.96     8256.13
  Elapsed       661.66      632.74

That is a substantial drop in system CPU usage and overall the workload
completes faster.  The NUMA balancing statistics are also interesting

  NUMA base PTE updates        111407972   139848884
  NUMA huge PMD updates           206506      264869
  NUMA page range updates      217139044   275461812
  NUMA hint faults               4300924     3719784
  NUMA hint local faults         3012539     3416618
  NUMA hint local percent             70          91
  NUMA pages migrated            1517487     1358420

While more PTEs are scanned due to changes in what faults are gathered,
it's clear that a far higher percentage of faults are local as the bulk
of the remote hits were dirty pages that, in this case with btrfs, had
no chance of migrating.

The following is a comparison when using XFS as that is a more realistic
filesystem choice for a data partition

                        4.16.0-rc4             4.16.0-rc4
                 schedtip-20180309          nodirty-v1r47
  Time cg.D      485.28 (   0.00%)      442.62 (   8.79%)
  Time ep.D       77.68 (   0.00%)       77.54 (   0.18%)
  Time is.D       26.44 (   0.00%)       24.79 (   6.24%)
  Time lu.D      597.46 (   0.00%)      597.11 (   0.06%)
  Time mg.D      142.65 (   0.00%)      105.83 (  25.81%)

That is a reasonable gain on two relatively long-lived workloads.  While
not presented, there is also a substantial drop in system CPu usage and
the NUMA balancing stats show similar improvements in locality as btrfs
did.

Link: http://lkml.kernel.org/r/20180326094334.zserdec62gwmmfqf@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:31 -07:00
Tejun Heo 18be460eeb mm/hmm.c: remove superfluous RCU protection around radix tree lookup
hmm_devmem_find() requires rcu_read_lock_held() but there's nothing which
actually uses the RCU protection.  The only caller is
hmm_devmem_pages_create() which already grabs the mutex and does
superfluous rcu_read_lock/unlock() around the function.

This doesn't add anything and just adds to confusion.  Remove the RCU
protection and open-code the radix tree lookup.  If this needs to become
more sophisticated in the future, let's add them back when necessary.

Link: http://lkml.kernel.org/r/20180314194515.1661824-4-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:31 -07:00
Jérôme Glisse f88a1e90c6 mm/hmm: use device driver encoding for HMM pfn
Users of hmm_vma_fault() and hmm_vma_get_pfns() provide a flags array and
pfn shift value allowing them to define their own encoding for HMM pfn
that are fill inside the pfns array of the hmm_range struct.  With this
device driver can get pfn that match their own private encoding out of HMM
without having to do any conversion.

[rcampbell@nvidia.com: don't ignore specific pte fault flag in hmm_vma_fault()]
  Link: http://lkml.kernel.org/r/20180326213009.2460-2-jglisse@redhat.com
[rcampbell@nvidia.com: clarify fault logic for device private memory]
  Link: http://lkml.kernel.org/r/20180326213009.2460-3-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20180323005527.758-16-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:31 -07:00
Jérôme Glisse 2aee09d8c1 mm/hmm: change hmm_vma_fault() to allow write fault on page basis
This changes hmm_vma_fault() to not take a global write fault flag for a
range but instead rely on caller to populate HMM pfns array with proper
fault flag ie HMM_PFN_VALID if driver want read fault for that address or
HMM_PFN_VALID and HMM_PFN_WRITE for write.

Moreover by setting HMM_PFN_DEVICE_PRIVATE the device driver can ask for
device private memory to be migrated back to system memory through page
fault.

This is more flexible API and it better reflects how device handles and
reports fault.

Link: http://lkml.kernel.org/r/20180323005527.758-15-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:31 -07:00
Jérôme Glisse 53f5c3f489 mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()
No functional change, just create one function to handle pmd and one to
handle pte (hmm_vma_handle_pmd() and hmm_vma_handle_pte()).

Link: http://lkml.kernel.org/r/20180323005527.758-14-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:31 -07:00
Jérôme Glisse 33cd47dcbb mm/hmm: move hmm_pfns_clear() closer to where it is used
Move hmm_pfns_clear() closer to where it is used to make it clear it is
not use by page table walkers.

Link: http://lkml.kernel.org/r/20180323005527.758-13-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Jérôme Glisse b2744118a6 mm/hmm: rename HMM_PFN_DEVICE_UNADDRESSABLE to HMM_PFN_DEVICE_PRIVATE
Make naming consistent across code, DEVICE_PRIVATE is the name use outside
HMM code so use that one.

Link: http://lkml.kernel.org/r/20180323005527.758-12-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Jérôme Glisse 5504ed2969 mm/hmm: do not differentiate between empty entry or missing directory
There is no point in differentiating between a range for which there is
not even a directory (and thus entries) and empty entry (pte_none() or
pmd_none() returns true).

Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.

Link: http://lkml.kernel.org/r/20180323005527.758-11-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Jérôme Glisse 855ce7d252 mm/hmm: cleanup special vma handling (VM_SPECIAL)
Special vma (one with any of the VM_SPECIAL flags) can not be access by
device because there is no consistent model across device drivers on those
vma and their backing memory.

This patch directly use hmm_range struct for hmm_pfns_special() argument
as it is always affecting the whole vma and thus the whole range.

It also make behavior consistent after this patch both hmm_vma_fault() and
hmm_vma_get_pfns() returns -EINVAL when facing such vma.  Previously
hmm_vma_fault() returned 0 and hmm_vma_get_pfns() return -EINVAL but both
were filling the HMM pfn array with special entry.

Link: http://lkml.kernel.org/r/20180323005527.758-10-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Jérôme Glisse ff05c0c6bb mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to ulong
All device driver we care about are using 64bits page table entry.  In
order to match this and to avoid useless define convert all HMM pfn to
directly use uint64_t.  It is a first step on the road to allow driver to
directly use pfn value return by HMM (saving memory and CPU cycles use for
conversion between the two).

Link: http://lkml.kernel.org/r/20180323005527.758-9-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Jérôme Glisse 86586a41b8 mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture
Only peculiar architecture allow write without read thus assume that any
valid pfn do allow for read.  Note we do not care for write only because
it does make sense with thing like atomic compare and exchange or any
other operations that allow you to get the memory value through them.

Link: http://lkml.kernel.org/r/20180323005527.758-8-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Jérôme Glisse 08232a4544 mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters
Both hmm_vma_fault() and hmm_vma_get_pfns() were taking a hmm_range struct
as parameter and were initializing that struct with others of their
parameters.  Have caller of those function do this as they are likely to
already do and only pass this struct to both function this shorten
function signature and make it easier in the future to add new parameters
by simply adding them to the structure.

Link: http://lkml.kernel.org/r/20180323005527.758-7-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Jérôme Glisse c719547f03 mm/hmm: hmm_pfns_bad() was accessing wrong struct
The private field of mm_walk struct point to an hmm_vma_walk struct and
not to the hmm_range struct desired.  Fix to get proper struct pointer.

Link: http://lkml.kernel.org/r/20180323005527.758-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Jérôme Glisse c01cbba2aa mm/hmm: unregister mmu_notifier when last HMM client quit
This code was lost in translation at one point.  This properly call
mmu_notifier_unregister_no_release() once last user is gone.  This fix the
zombie mm_struct as without this patch we do not drop the refcount we have
on it.

Link: http://lkml.kernel.org/r/20180323005527.758-5-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Ralph Campbell e1401513c6 mm/hmm: HMM should have a callback before MM is destroyed
hmm_mirror_register() registers a callback for when the CPU pagetable is
modified.  Normally, the device driver will call hmm_mirror_unregister()
when the process using the device is finished.  However, if the process
exits uncleanly, the struct_mm can be destroyed with no warning to the
device driver.

Link: http://lkml.kernel.org/r/20180323005527.758-4-jglisse@redhat.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Steven Rostedt d51d1e6450 mm, vmscan, tracing: use pointer to reclaim_stat struct in trace event
The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12
parameters! Seven of them are from the reclaim_stat structure.  This
structure is currently local to mm/vmscan.c.  By moving it to the global
vmstat.h header, we can also reference it from the vmscan tracepoints.
In moving it, it brings down the overhead of passing so many arguments
to the trace event.  In the future, we may limit the number of arguments
that a trace event may pass (ideally just 6, but more realistically it
may be 8).

Before this patch, the code to call the trace event is this:

 0f 83 aa fe ff ff       jae    ffffffff811e6261 <shrink_inactive_list+0x1e1>
 48 8b 45 a0             mov    -0x60(%rbp),%rax
 45 8b 64 24 20          mov    0x20(%r12),%r12d
 44 8b 6d d4             mov    -0x2c(%rbp),%r13d
 8b 4d d0                mov    -0x30(%rbp),%ecx
 44 8b 75 cc             mov    -0x34(%rbp),%r14d
 44 8b 7d c8             mov    -0x38(%rbp),%r15d
 48 89 45 90             mov    %rax,-0x70(%rbp)
 8b 83 b8 fe ff ff       mov    -0x148(%rbx),%eax
 8b 55 c0                mov    -0x40(%rbp),%edx
 8b 7d c4                mov    -0x3c(%rbp),%edi
 8b 75 b8                mov    -0x48(%rbp),%esi
 89 45 80                mov    %eax,-0x80(%rbp)
 65 ff 05 e4 f7 e2 7e    incl   %gs:0x7ee2f7e4(%rip)        # 15bd0 <__preempt_count>
 48 8b 05 75 5b 13 01    mov    0x1135b75(%rip),%rax        # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
 48 85 c0                test   %rax,%rax
 74 72                   je     ffffffff811e646a <shrink_inactive_list+0x3ea>
 48 89 c3                mov    %rax,%rbx
 4c 8b 10                mov    (%rax),%r10
 89 f8                   mov    %edi,%eax
 48 89 85 68 ff ff ff    mov    %rax,-0x98(%rbp)
 89 f0                   mov    %esi,%eax
 48 89 85 60 ff ff ff    mov    %rax,-0xa0(%rbp)
 89 c8                   mov    %ecx,%eax
 48 89 85 78 ff ff ff    mov    %rax,-0x88(%rbp)
 89 d0                   mov    %edx,%eax
 48 89 85 70 ff ff ff    mov    %rax,-0x90(%rbp)
 8b 45 8c                mov    -0x74(%rbp),%eax
 48 8b 7b 08             mov    0x8(%rbx),%rdi
 48 83 c3 18             add    $0x18,%rbx
 50                      push   %rax
 41 54                   push   %r12
 41 55                   push   %r13
 ff b5 78 ff ff ff       pushq  -0x88(%rbp)
 41 56                   push   %r14
 41 57                   push   %r15
 ff b5 70 ff ff ff       pushq  -0x90(%rbp)
 4c 8b 8d 68 ff ff ff    mov    -0x98(%rbp),%r9
 4c 8b 85 60 ff ff ff    mov    -0xa0(%rbp),%r8
 48 8b 4d 98             mov    -0x68(%rbp),%rcx
 48 8b 55 90             mov    -0x70(%rbp),%rdx
 8b 75 80                mov    -0x80(%rbp),%esi
 41 ff d2                callq  *%r10

After the patch:

 0f 83 a8 fe ff ff       jae    ffffffff811e626d <shrink_inactive_list+0x1cd>
 8b 9b b8 fe ff ff       mov    -0x148(%rbx),%ebx
 45 8b 64 24 20          mov    0x20(%r12),%r12d
 4c 8b 6d a0             mov    -0x60(%rbp),%r13
 65 ff 05 f5 f7 e2 7e    incl   %gs:0x7ee2f7f5(%rip)        # 15bd0 <__preempt_count>
 4c 8b 35 86 5b 13 01    mov    0x1135b86(%rip),%r14        # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
 4d 85 f6                test   %r14,%r14
 74 2a                   je     ffffffff811e6411 <shrink_inactive_list+0x371>
 49 8b 06                mov    (%r14),%rax
 8b 4d 8c                mov    -0x74(%rbp),%ecx
 49 8b 7e 08             mov    0x8(%r14),%rdi
 49 83 c6 18             add    $0x18,%r14
 4c 89 ea                mov    %r13,%rdx
 45 89 e1                mov    %r12d,%r9d
 4c 8d 45 b8             lea    -0x48(%rbp),%r8
 89 de                   mov    %ebx,%esi
 51                      push   %rcx
 48 8b 4d 98             mov    -0x68(%rbp),%rcx
 ff d0                   callq  *%rax

Link: http://lkml.kernel.org/r/2559d7cb-ec60-1200-2362-04fa34fd02bb@fb.com
Link: http://lkml.kernel.org/r/20180322121003.4177af15@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reported-by: Alexei Starovoitov <ast@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Andrey Ryabinin e3c1ac586c mm/vmscan: don't mess with pgdat->flags in memcg reclaim
memcg reclaim may alter pgdat->flags based on the state of LRU lists in
cgroup and its children.  PGDAT_WRITEBACK may force kswapd to sleep
congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
pages.  But the worst here is PGDAT_CONGESTED, since it may force all
direct reclaims to stall in wait_iff_congested().  Note that only kswapd
have powers to clear any of these bits.  This might just never happen if
cgroup limits configured that way.  So all direct reclaims will stall as
long as we have some congested bdi in the system.

Leave all pgdat->flags manipulations to kswapd.  kswapd scans the whole
pgdat, only kswapd can clear pgdat->flags once node is balanced, thus
it's reasonable to leave all decisions about node state to kswapd.

Why only kswapd? Why not allow to global direct reclaim change these
flags? It is because currently only kswapd can clear these flags.  I'm
less worried about the case when PGDAT_CONGESTED falsely not set, and
more worried about the case when it falsely set.  If direct reclaimer
sets PGDAT_CONGESTED, do we have guarantee that after the congestion
problem is sorted out, kswapd will be woken up and clear the flag? It
seems like there is no such guarantee.  E.g.  direct reclaimers may
eventually balance pgdat and kswapd simply won't wake up (see
wakeup_kswapd()).

Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim
now loses its congestion throttling mechanism.  Add per-cgroup
congestion state and throttle cgroup2 reclaimers if memcg is in
congestion state.

Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
bits since they alter only kswapd behavior.

The problem could be easily demonstrated by creating heavy congestion in
one cgroup:

    echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
    mkdir -p /sys/fs/cgroup/congester
    echo 512M > /sys/fs/cgroup/congester/memory.max
    echo $$ > /sys/fs/cgroup/congester/cgroup.procs
    /* generate a lot of diry data on slow HDD */
    while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
    ....
    while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &

and some job in another cgroup:

    mkdir /sys/fs/cgroup/victim
    echo 128M > /sys/fs/cgroup/victim/memory.max

    # time cat /dev/sda > /dev/null
    real    10m15.054s
    user    0m0.487s
    sys     1m8.505s

According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
of the time sleeping there.

With the patch, cat don't waste time anymore:

    # time cat /dev/sda > /dev/null
    real    5m32.911s
    user    0m0.411s
    sys     0m56.664s

[aryabinin@virtuozzo.com: congestion state should be per-node]
  Link: http://lkml.kernel.org/r/20180406135215.10057-1-aryabinin@virtuozzo.com
[ayabinin@virtuozzo.com: make congestion state per-cgroup-per-node instead of just per-cgroup[
  Link: http://lkml.kernel.org/r/20180406180254.8970-2-aryabinin@virtuozzo.com
Link: http://lkml.kernel.org/r/20180323152029.11084-5-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Andrey Ryabinin d108c7721f mm/vmscan: don't change pgdat state on base of a single LRU list state
We have separate LRU list for each memory cgroup.  Memory reclaim
iterates over cgroups and calls shrink_inactive_list() every inactive
LRU list.  Based on the state of a single LRU shrink_inactive_list() may
flag the whole node as dirty,congested or under writeback.  This is
obviously wrong and hurtful.  It's especially hurtful when we have
possibly small congested cgroup in system.  Than *all* direct reclaims
waste time by sleeping in wait_iff_congested().  And the more memcgs in
the system we have the longer memory allocation stall is, because
wait_iff_congested() called on each lru-list scan.

Sum reclaim stats across all visited LRUs on node and flag node as
dirty, congested or under writeback based on that sum.  Also call
congestion_wait(), wait_iff_congested() once per pgdat scan, instead of
once per lru-list scan.

This only fixes the problem for global reclaim case.  Per-cgroup reclaim
may alter global pgdat flags too, which is wrong.  But that is separate
issue and will be addressed in the next patch.

This change will not have any effect on a systems with all workload
concentrated in a single cgroup.

[aryabinin@virtuozzo.com: check nr_writeback against all nr_taken, not just file]
  Link: http://lkml.kernel.org/r/20180406180254.8970-1-aryabinin@virtuozzo.com
Link: http://lkml.kernel.org/r/20180323152029.11084-4-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Andrey Ryabinin c4fd4fa580 mm/vmscan: remove redundant current_may_throttle() check
Only kswapd can have non-zero nr_immediate, and current_may_throttle()
is always true for kswapd (PF_LESS_THROTTLE bit is never set) thus it's
enough to check stat.nr_immediate only.

Link: http://lkml.kernel.org/r/20180315164553.17856-4-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:29 -07:00
Andrey Ryabinin 894befec4d mm/vmscan: update stale comments
Update some comments that became stale since transiton from per-zone to
per-node reclaim.

Link: http://lkml.kernel.org/r/20180315164553.17856-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:29 -07:00
Roman Gushchin d79f7aa496 mm: treat indirectly reclaimable memory as free in overcommit logic
Indirectly reclaimable memory can consume a significant part of total
memory and it's actually reclaimable (it will be released under actual
memory pressure).

So, the overcommit logic should treat it as free.

Otherwise, it's possible to cause random system-wide memory allocation
failures by consuming a significant amount of memory by indirectly
reclaimable memory, e.g.  dentry external names.

If overcommit policy GUESS is used, it might be used for denial of
service attack under some conditions.

The following program illustrates the approach.  It causes the kernel to
allocate an unreclaimable kmalloc-256 chunk for each stat() call, so
that at some point the overcommit logic may start blocking large
allocation system-wide.

  int main()
  {
  	char buf[256];
  	unsigned long i;
  	struct stat statbuf;

  	buf[0] = '/';
  	for (i = 1; i < sizeof(buf); i++)
  		buf[i] = '_';

  	for (i = 0; 1; i++) {
  		sprintf(&buf[248], "%8lu", i);
  		stat(buf, &statbuf);
  	}

  	return 0;
  }

This patch in combination with related indirectly reclaimable memory
patches closes this issue.

Link: http://lkml.kernel.org/r/20180313130041.8078-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:29 -07:00
Roman Gushchin 034ebf65c3 mm: treat indirectly reclaimable memory as available in MemAvailable
Adjust /proc/meminfo MemAvailable calculation by adding the amount of
indirectly reclaimable memory (rounded to the PAGE_SIZE).

Link: http://lkml.kernel.org/r/20180305133743.12746-4-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:29 -07:00
Roman Gushchin eb59254608 mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES
Patch series "indirectly reclaimable memory", v2.

This patchset introduces the concept of indirectly reclaimable memory
and applies it to fix the issue of when a big number of dentries with
external names can significantly affect the MemAvailable value.

This patch (of 3):

Introduce a concept of indirectly reclaimable memory and adds the
corresponding memory counter and /proc/vmstat item.

Indirectly reclaimable memory is any sort of memory, used by the kernel
(except of reclaimable slabs), which is actually reclaimable, i.e.  will
be released under memory pressure.

The counter is in bytes, as it's not always possible to count such
objects in pages.  The name contains BYTES by analogy to
NR_KERNEL_STACK_KB.

Link: http://lkml.kernel.org/r/20180305133743.12746-2-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:29 -07:00
Linus Torvalds 49a695ba72 powerpc updates for 4.17
Notable changes:
 
  - Support for 4PB user address space on 64-bit, opt-in via mmap().
 
  - Removal of POWER4 support, which was accidentally broken in 2016 and no one
    noticed, and blocked use of some modern instructions.
 
  - Workarounds so that the hypervisor can enable Transactional Memory on Power9.
 
  - A series to disable the DAWR (Data Address Watchpoint Register) on Power9.
 
  - More information displayed in the meltdown/spectre_v1/v2 sysfs files.
 
  - A vpermxor (Power8 Altivec) implementation for the raid6 Q Syndrome.
 
  - A big series to make the allocation of our pacas (per cpu area), kernel page
    tables, and per-cpu stacks NUMA aware when using the Radix MMU on Power9.
 
 And as usual many fixes, reworks and cleanups.
 
 Thanks to:
   Aaro Koskinen, Alexandre Belloni, Alexey Kardashevskiy, Alistair Popple, Andy
   Shevchenko, Aneesh Kumar K.V, Anshuman Khandual, Balbir Singh, Benjamin
   Herrenschmidt, Christophe Leroy, Christophe Lombard, Cyril Bur, Daniel Axtens,
   Dave Young, Finn Thain, Frederic Barrat, Gustavo Romero, Horia Geantă,
   Jonathan Neuschäfer, Kees Cook, Larry Finger, Laurent Dufour, Laurent Vivier,
   Logan Gunthorpe, Madhavan Srinivasan, Mark Greer, Mark Hairgrove, Markus
   Elfring, Mathieu Malaterre, Matt Brown, Matt Evans, Mauricio Faria de
   Oliveira, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Paul Mackerras,
   Philippe Bergheaud, Ram Pai, Rob Herring, Sam Bobroff, Segher Boessenkool,
   Simon Guo, Simon Horman, Stewart Smith, Sukadev Bhattiprolu, Suraj Jitindar
   Singh, Thiago Jung Bauermann, Vaibhav Jain, Vaidyanathan Srinivasan, Vasant
   Hegde, Wei Yongjun.
 -----BEGIN PGP SIGNATURE-----
 
 iQIwBAABCAAaBQJayKxDExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYAr
 JQ/6A9Xs4zHDn9OeT9esEIxciETqUlrP0Wp64c4JVC7EkG1E7xRDZ4Xb4m8R2nNt
 9sPhtNO1yCtEk6kFQtPNB0N8v6pud4I6+aMcYnn+tP8mJRYQ4x9bYaF3Hw98IKmE
 Kd6TglmsUQvh2GpwPiF93KpzzWu1HB2kZzzqJcAMTMh7C79Qz00BjrTJltzXB2jx
 tJ+B4lVy8BeU8G5nDAzJEEwb5Ypkn8O40rS/lpAwVTYOBJ8Rbyq8Fj82FeREK9YO
 4EGaEKPkC/FdzX7OJV3v2/nldCd8pzV471fAoGuBUhJiJBMBoBybcTHIdDex7LlL
 zMLV1mUtGo8iolRPhL8iCH+GGifZz2WzstYCozz7hgIraWtc/frq9rZp6q0LdH/K
 trk7UbPGlVb92ecWZVpZyEcsMzKrCgZqnAe9wRNh1uEKScEdzd/bmRaMhENUObRh
 Hili6AVvmSKExpy7k2sZP/oUMaeC15/xz8Lk7l8a/iCkYhNmPYh5iSXM5+UKpcRT
 FYOcO0o3DwXsN46Whow3nJ7TqAsDy9/ecPUG71JQi3ZrHnRrm8jxkn8MCG5pZ1Fi
 KvKDxlg6RiJo3DF9/fSOpJUokvMwqBS5dJo4eh5eiDy94aBTqmBKFecvPxQm7a0L
 l3uXCF/6JuXEvMukFjGBO4RiYhw8i+B2uKsh81XUh7HKrgE=
 =HAB1
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Notable changes:

   - Support for 4PB user address space on 64-bit, opt-in via mmap().

   - Removal of POWER4 support, which was accidentally broken in 2016
     and no one noticed, and blocked use of some modern instructions.

   - Workarounds so that the hypervisor can enable Transactional Memory
     on Power9.

   - A series to disable the DAWR (Data Address Watchpoint Register) on
     Power9.

   - More information displayed in the meltdown/spectre_v1/v2 sysfs
     files.

   - A vpermxor (Power8 Altivec) implementation for the raid6 Q
     Syndrome.

   - A big series to make the allocation of our pacas (per cpu area),
     kernel page tables, and per-cpu stacks NUMA aware when using the
     Radix MMU on Power9.

  And as usual many fixes, reworks and cleanups.

  Thanks to: Aaro Koskinen, Alexandre Belloni, Alexey Kardashevskiy,
  Alistair Popple, Andy Shevchenko, Aneesh Kumar K.V, Anshuman Khandual,
  Balbir Singh, Benjamin Herrenschmidt, Christophe Leroy, Christophe
  Lombard, Cyril Bur, Daniel Axtens, Dave Young, Finn Thain, Frederic
  Barrat, Gustavo Romero, Horia Geantă, Jonathan Neuschäfer, Kees Cook,
  Larry Finger, Laurent Dufour, Laurent Vivier, Logan Gunthorpe,
  Madhavan Srinivasan, Mark Greer, Mark Hairgrove, Markus Elfring,
  Mathieu Malaterre, Matt Brown, Matt Evans, Mauricio Faria de Oliveira,
  Michael Neuling, Naveen N. Rao, Nicholas Piggin, Paul Mackerras,
  Philippe Bergheaud, Ram Pai, Rob Herring, Sam Bobroff, Segher
  Boessenkool, Simon Guo, Simon Horman, Stewart Smith, Sukadev
  Bhattiprolu, Suraj Jitindar Singh, Thiago Jung Bauermann, Vaibhav
  Jain, Vaidyanathan Srinivasan, Vasant Hegde, Wei Yongjun"

* tag 'powerpc-4.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (207 commits)
  powerpc/64s/idle: Fix restore of AMOR on POWER9 after deep sleep
  powerpc/64s: Fix POWER9 DD2.2 and above in cputable features
  powerpc/64s: Fix pkey support in dt_cpu_ftrs, add CPU_FTR_PKEY bit
  powerpc/64s: Fix dt_cpu_ftrs to have restore_cpu clear unwanted LPCR bits
  Revert "powerpc/64s/idle: POWER9 ESL=0 stop avoid save/restore overhead"
  powerpc: iomap.c: introduce io{read|write}64_{lo_hi|hi_lo}
  powerpc: io.h: move iomap.h include so that it can use readq/writeq defs
  cxl: Fix possible deadlock when processing page faults from cxllib
  powerpc/hw_breakpoint: Only disable hw breakpoint if cpu supports it
  powerpc/mm/radix: Update command line parsing for disable_radix
  powerpc/mm/radix: Parse disable_radix commandline correctly.
  powerpc/mm/hugetlb: initialize the pagetable cache correctly for hugetlb
  powerpc/mm/radix: Update pte fragment count from 16 to 256 on radix
  powerpc/mm/keys: Update documentation and remove unnecessary check
  powerpc/64s/idle: POWER9 ESL=0 stop avoid save/restore overhead
  powerpc/64s/idle: Consolidate power9_offline_stop()/power9_idle_stop()
  powerpc/powernv: Always stop secondaries before reboot/shutdown
  powerpc: hard disable irqs in smp_send_stop loop
  powerpc: use NMI IPI for smp_send_stop
  powerpc/powernv: Fix SMT4 forcing idle code
  ...
2018-04-07 12:08:19 -07:00
Linus Torvalds 3b54765cca Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc things

 - ocfs2 updates

 - the v9fs maintainers have been missing for a long time. I've taken
   over v9fs patch slinging.

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (116 commits)
  mm,oom_reaper: check for MMF_OOM_SKIP before complaining
  mm/ksm: fix interaction with THP
  mm/memblock.c: cast constant ULLONG_MAX to phys_addr_t
  headers: untangle kmemleak.h from mm.h
  include/linux/mmdebug.h: make VM_WARN* non-rvals
  mm/page_isolation.c: make start_isolate_page_range() fail if already isolated
  mm: change return type to vm_fault_t
  mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes
  mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory
  kernel/fork.c: detect early free of a live mm
  mm: make counting of list_lru_one::nr_items lockless
  mm/swap_state.c: make bool enable_vma_readahead and swap_vma_readahead() static
  block_invalidatepage(): only release page if the full page was invalidated
  mm: kernel-doc: add missing parameter descriptions
  mm/swap.c: remove @cold parameter description for release_pages()
  mm/nommu: remove description of alloc_vm_area
  zram: drop max_zpage_size and use zs_huge_class_size()
  zsmalloc: introduce zs_huge_class_size()
  mm: fix races between swapoff and flush dcache
  fs/direct-io.c: minor cleanups in do_blockdev_direct_IO
  ...
2018-04-06 14:19:26 -07:00
Tetsuo Handa 97b1255cb2 mm,oom_reaper: check for MMF_OOM_SKIP before complaining
I got "oom_reaper: unable to reap pid:" messages when the victim thread
was blocked inside free_pgtables() (which occurred after returning from
unmap_vmas() and setting MMF_OOM_SKIP).  We don't need to complain when
exit_mmap() already set MMF_OOM_SKIP.

  Killed process 7558 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: unable to reap pid:7558 (a.out)
  a.out           D13272  7558   6931 0x00100084
  Call Trace:
   schedule+0x2d/0x80
   rwsem_down_write_failed+0x2bb/0x440
   call_rwsem_down_write_failed+0x13/0x20
   down_write+0x49/0x60
   unlink_file_vma+0x28/0x50
   free_pgtables+0x36/0x100
   exit_mmap+0xbb/0x180
   mmput+0x50/0x110
   copy_process.part.41+0xb61/0x1fe0
   _do_fork+0xe6/0x560
   do_syscall_64+0x74/0x230
   entry_SYSCALL_64_after_hwframe+0x42/0xb7

Link: http://lkml.kernel.org/r/201803221946.DHG65638.VFJHFtOSQLOMOF@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Claudio Imbrenda 77da2ba064 mm/ksm: fix interaction with THP
This patch fixes a corner case for KSM.  When two pages belong or
belonged to the same transparent hugepage, and they should be merged,
KSM fails to split the page, and therefore no merging happens.

This bug can be reproduced by:
* making sure ksm is running (in case disabling ksmtuned)
* enabling transparent hugepages
* allocating a THP-aligned 1-THP-sized buffer
  e.g. on amd64: posix_memalign(&p, 1<<21, 1<<21)
* filling it with the same values
  e.g. memset(p, 42, 1<<21)
* performing madvise to make it mergeable
  e.g. madvise(p, 1<<21, MADV_MERGEABLE)
* waiting for KSM to perform a few scans

The expected outcome is that the all the pages get merged (1 shared and
the rest sharing); the actual outcome is that no pages get merged (1
unshared and the rest volatile)

The reason of this behaviour is that we increase the reference count
once for both pages we want to merge, but if they belong to the same
hugepage (or compound page), the reference counter used in both cases is
the one of the head of the compound page.  This means that
split_huge_page will find a value of the reference counter too high and
will fail.

This patch solves this problem by testing if the two pages to merge
belong to the same hugepage when attempting to merge them.  If so, the
hugepage is split safely.  This means that the hugepage is not split if
not necessary.

Link: http://lkml.kernel.org/r/1521548069-24758-1-git-send-email-imbrenda@linux.vnet.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Co-authored-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Stefan Agner 644d87dccd mm/memblock.c: cast constant ULLONG_MAX to phys_addr_t
This fixes a warning shown when phys_addr_t is 32-bit int when compiling
with clang:

  mm/memblock.c:927:15: warning: implicit conversion from 'unsigned long long'
        to 'phys_addr_t' (aka 'unsigned int') changes value from
        18446744073709551615 to 4294967295 [-Wconstant-conversion]
                                  r->base : ULLONG_MAX;
                                            ^~~~~~~~~~
  ./include/linux/kernel.h:30:21: note: expanded from macro 'ULLONG_MAX'
  #define ULLONG_MAX      (~0ULL)
                           ^~~~~

Link: http://lkml.kernel.org/r/20180319005645.29051-1-stefan@agner.ch
Signed-off-by: Stefan Agner <stefan@agner.ch>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Randy Dunlap 514c603249 headers: untangle kmemleak.h from mm.h
Currently <linux/slab.h> #includes <linux/kmemleak.h> for no obvious
reason.  It looks like it's only a convenience, so remove kmemleak.h
from slab.h and add <linux/kmemleak.h> to any users of kmemleak_* that
don't already #include it.  Also remove <linux/kmemleak.h> from source
files that do not use it.

This is tested on i386 allmodconfig and x86_64 allmodconfig.  It would
be good to run it through the 0day bot for other $ARCHes.  I have
neither the horsepower nor the storage space for the other $ARCHes.

Update: This patch has been extensively build-tested by both the 0day
bot & kisskb/ozlabs build farms.  Both of them reported 2 build failures
for which patches are included here (in v2).

[ slab.h is the second most used header file after module.h; kernel.h is
  right there with slab.h. There could be some minor error in the
  counting due to some #includes having comments after them and I didn't
  combine all of those. ]

[akpm@linux-foundation.org: security/keys/big_key.c needs vmalloc.h, per sfr]
Link: http://lkml.kernel.org/r/e4309f98-3749-93e1-4bb7-d9501a39d015@infradead.org
Link: http://kisskb.ellerman.id.au/kisskb/head/13396/
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>	[2 build failures]
Reported-by: Fengguang Wu <fengguang.wu@intel.com>	[2 build failures]
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Wei Yongjun <weiyongjun1@huawei.com>
Cc: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Mike Kravetz 2c7452a075 mm/page_isolation.c: make start_isolate_page_range() fail if already isolated
start_isolate_page_range() is used to set the migrate type of a set of
pageblocks to MIGRATE_ISOLATE while attempting to start a migration
operation.  It assumes that only one thread is calling it for the
specified range.  This routine is used by CMA, memory hotplug and
gigantic huge pages.  Each of these users synchronize access to the
range within their subsystem.  However, two subsystems (CMA and gigantic
huge pages for example) could attempt operations on the same range.  If
this happens, one thread may 'undo' the work another thread is doing.
This can result in pageblocks being incorrectly left marked as
MIGRATE_ISOLATE and therefore not available for page allocation.

What is ideally needed is a way to synchronize access to a set of
pageblocks that are undergoing isolation and migration.  The only thing
we know about these pageblocks is that they are all in the same zone.  A
per-node mutex is too coarse as we want to allow multiple operations on
different ranges within the same zone concurrently.  Instead, we will
use the migration type of the pageblocks themselves as a form of
synchronization.

start_isolate_page_range sets the migration type on a set of page-
blocks going in order from the one associated with the smallest pfn to
the largest pfn.  The zone lock is acquired to check and set the
migration type.  When going through the list of pageblocks check if
MIGRATE_ISOLATE is already set.  If so, this indicates another thread is
working on this pageblock.  We know exactly which pageblocks we set, so
clean up by undo those and return -EBUSY.

This allows start_isolate_page_range to serve as a synchronization
mechanism and will allow for more general use of callers making use of
these interfaces.  Update comments in alloc_contig_range to reflect this
new functionality.

Each CPU holds the associated zone lock to modify or examine the
migration type of a pageblock.  And, it will only examine/update a
single pageblock per lock acquire/release cycle.

Link: http://lkml.kernel.org/r/20180309224731.16978-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
David Rientjes d46078b288 mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes
Since the 2.6 kernel, the oom killer has slightly biased away from
CAP_SYS_ADMIN processes by discounting some of its memory usage in
comparison to other processes.

This has always been implicit and nothing exactly relies on the
behavior.

Gaurav notices that __task_cred() can dereference a potentially freed
pointer if the task under consideration is exiting because a reference
to the task_struct is not held.

Remove the CAP_SYS_ADMIN bias so that all processes are treated equally.

If any CAP_SYS_ADMIN process would like to be biased against, it is
always allowed to adjust /proc/pid/oom_score_adj.

Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803071548510.6996@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
David Rientjes 5ecd9d403a mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory
Kswapd will not wakeup if per-zone watermarks are not failing or if too
many previous attempts at background reclaim have failed.

This can be true if there is a lot of free memory available.  For high-
order allocations, kswapd is responsible for waking up kcompactd for
background compaction.  If the zone is not below its watermarks or
reclaim has recently failed (lots of free memory, nothing left to
reclaim), kcompactd does not get woken up.

When __GFP_DIRECT_RECLAIM is not allowed, allow kcompactd to still be
woken up even if kswapd will not reclaim.  This allows high-order
allocations, such as thp, to still trigger background compaction even
when the zone has an abundance of free memory.

Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803111659420.209721@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Kirill Tkhai 0c7c1bed7e mm: make counting of list_lru_one::nr_items lockless
During the reclaiming slab of a memcg, shrink_slab iterates over all
registered shrinkers in the system, and tries to count and consume
objects related to the cgroup.  In case of memory pressure, this behaves
bad: I observe high system time and time spent in list_lru_count_one()
for many processes on RHEL7 kernel.

This patch makes list_lru_node::memcg_lrus rcu protected, that allows to
skip taking spinlock in list_lru_count_one().

Shakeel Butt with the patch observes significant perf graph change.  He
says:

========================================================================
Setup: running a fork-bomb in a memcg of 200MiB on a 8GiB and 4 vcpu
VM and recording the trace with 'perf record -g -a'.

The trace without the patch:

+  34.19%     fb.sh  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
+  30.77%     fb.sh  [kernel.kallsyms]  [k] _raw_spin_lock
+   3.53%     fb.sh  [kernel.kallsyms]  [k] list_lru_count_one
+   2.26%     fb.sh  [kernel.kallsyms]  [k] super_cache_count
+   1.68%     fb.sh  [kernel.kallsyms]  [k] shrink_slab
+   0.59%     fb.sh  [kernel.kallsyms]  [k] down_read_trylock
+   0.48%     fb.sh  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
+   0.38%     fb.sh  [kernel.kallsyms]  [k] shrink_node_memcg
+   0.32%     fb.sh  [kernel.kallsyms]  [k] queue_work_on
+   0.26%     fb.sh  [kernel.kallsyms]  [k] count_shadow_nodes

With the patch:

+   0.16%     swapper  [kernel.kallsyms]    [k] default_idle
+   0.13%     oom_reaper  [kernel.kallsyms]    [k] mutex_spin_on_owner
+   0.05%     perf  [kernel.kallsyms]    [k] copy_user_generic_string
+   0.05%     init.real  [kernel.kallsyms]    [k] wait_consider_task
+   0.05%     kworker/0:0  [kernel.kallsyms]    [k] finish_task_switch
+   0.04%     kworker/2:1  [kernel.kallsyms]    [k] finish_task_switch
+   0.04%     kworker/3:1  [kernel.kallsyms]    [k] finish_task_switch
+   0.04%     kworker/1:0  [kernel.kallsyms]    [k] finish_task_switch
+   0.03%     binary  [kernel.kallsyms]    [k] copy_page
========================================================================

Thanks Shakeel for the testing.

[ktkhai@virtuozzo.com: v2]
  Link: http://lkml.kernel.org/r/151203869520.3915.2587549826865799173.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/150583358557.26700.8490036563698102569.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Colin Ian King f5c754d63d mm/swap_state.c: make bool enable_vma_readahead and swap_vma_readahead() static
The bool enable_vma_readahead and swap_vma_readahead() are local to the
source and do not need to be in global scope, so make them static.

Cleans up sparse warnings:

  mm/swap_state.c:41:6: warning: symbol 'enable_vma_readahead' was not declared. Should it be static?
  mm/swap_state.c:742:13: warning: symbol 'swap_vma_readahead' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20180223164852.5159-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Mike Rapoport e8b098fc57 mm: kernel-doc: add missing parameter descriptions
Link: http://lkml.kernel.org/r/1519585191-10180-4-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Mike Rapoport 002843de36 mm/swap.c: remove @cold parameter description for release_pages()
The 'cold' parameter was removed from release_pages function by commit
c6f92f9fbe ("mm: remove cold parameter for release_pages").

Update the description to match the code.

Link: http://lkml.kernel.org/r/1519585191-10180-3-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Mike Rapoport e48e3c590a mm/nommu: remove description of alloc_vm_area
The alloc_mm_area in nommu is a stub, but its description states it
allocates kernel address space.  Remove the description to make the code
and the documentation agree.

Link: http://lkml.kernel.org/r/1519585191-10180-2-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Sergey Senozhatsky 010b495e2f zsmalloc: introduce zs_huge_class_size()
Patch series "zsmalloc/zram: drop zram's max_zpage_size", v3.

ZRAM's max_zpage_size is a bad thing.  It forces zsmalloc to store
normal objects as huge ones, which results in bigger zsmalloc memory
usage.  Drop it and use actual zsmalloc huge-class value when decide if
the object is huge or not.

This patch (of 2):

Not every object can be share its zspage with other objects, e.g.  when
the object is as big as zspage or nearly as big a zspage.  For such
objects zsmalloc has a so called huge class - every object which belongs
to huge class consumes the entire zspage (which consists of a physical
page).  On x86_64, PAGE_SHIFT 12 box, the first non-huge class size is
3264, so starting down from size 3264, objects can share page(-s) and
thus minimize memory wastage.

ZRAM, however, has its own statically defined watermark for huge
objects, namely "3 * PAGE_SIZE / 4 = 3072", and forcibly stores every
object larger than this watermark (3072) as a PAGE_SIZE object, in other
words, to a huge class, while zsmalloc can keep some of those objects in
non-huge classes.  This results in increased memory consumption.

zsmalloc knows better if the object is huge or not.  Introduce
zs_huge_class_size() function which tells if the given object can be
stored in one of non-huge classes or not.  This will let us to drop
ZRAM's huge object watermark and fully rely on zsmalloc when we decide
if the object is huge.

[sergey.senozhatsky.work@gmail.com: add pool param to zs_huge_class_size()]
  Link: http://lkml.kernel.org/r/20180314081833.1096-2-sergey.senozhatsky@gmail.com
Link: http://lkml.kernel.org/r/20180306070639.7389-2-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Huang Ying cb9f753a37 mm: fix races between swapoff and flush dcache
Thanks to commit 4b3ef9daa4 ("mm/swap: split swap cache into 64MB
trunks"), after swapoff the address_space associated with the swap
device will be freed.  So page_mapping() users which may touch the
address_space need some kind of mechanism to prevent the address_space
from being freed during accessing.

The dcache flushing functions (flush_dcache_page(), etc) in architecture
specific code may access the address_space of swap device for anonymous
pages in swap cache via page_mapping() function.  But in some cases
there are no mechanisms to prevent the swap device from being swapoff,
for example,

  CPU1					CPU2
  __get_user_pages()			swapoff()
    flush_dcache_page()
      mapping = page_mapping()
        ...				  exit_swap_address_space()
        ...				    kvfree(spaces)
        mapping_mapped(mapping)

The address space may be accessed after being freed.

But from cachetlb.txt and Russell King, flush_dcache_page() only care
about file cache pages, for anonymous pages, flush_anon_page() should be
used.  The implementation of flush_dcache_page() in all architectures
follows this too.  They will check whether page_mapping() is NULL and
whether mapping_mapped() is true to determine whether to flush the
dcache immediately.  And they will use interval tree (mapping->i_mmap)
to find all user space mappings.  While mapping_mapped() and
mapping->i_mmap isn't used by anonymous pages in swap cache at all.

So, to fix the race between swapoff and flush dcache, __page_mapping()
is add to return the address_space for file cache pages and NULL
otherwise.  All page_mapping() invoking in flush dcache functions are
replaced with page_mapping_file().

[akpm@linux-foundation.org: simplify page_mapping_file(), per Mike]
Link: http://lkml.kernel.org/r/20180305083634.15174-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Zankel <chris@zankel.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Dan Williams 05ea88608d mm, hugetlbfs: introduce ->pagesize() to vm_operations_struct
When device-dax is operating in huge-page mode we want it to behave like
hugetlbfs and report the MMU page mapping size that is being enforced by
the vma.

Similar to commit 31383c6865 "mm, hugetlbfs: introduce ->split() to
vm_operations_struct" it would be messy to teach vma_mmu_pagesize()
about device-dax page mapping sizes in the same (hstate) way that
hugetlbfs communicates this attribute.  Instead, these patches introduce
a new ->pagesize() vm operation.

Link: http://lkml.kernel.org/r/151996254734.27922.15813097401404359642.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Dan Williams 09135cc594 mm, powerpc: use vma_kernel_pagesize() in vma_mmu_pagesize()
Patch series "mm, smaps: MMUPageSize for device-dax", v3.

Similar to commit 31383c6865 ("mm, hugetlbfs: introduce ->split() to
vm_operations_struct") here is another occasion where we want
special-case hugetlbfs/hstate enabling to also apply to device-dax.

This prompts the question what other hstate conversions we might do
beyond ->split() and ->pagesize(), but this appears to be the last of
the usages of hstate_vma() in generic/non-hugetlbfs specific code paths.

This patch (of 3):

The current powerpc definition of vma_mmu_pagesize() open codes looking
up the page size via hstate.  It is identical to the generic
vma_kernel_pagesize() implementation.

Now, vma_kernel_pagesize() is growing support for determining the page
size of Device-DAX vmas in addition to the existing Hugetlbfs page size
determination.

Ideally, if the powerpc vma_mmu_pagesize() used vma_kernel_pagesize() it
would automatically benefit from any new vma-type support that is added
to vma_kernel_pagesize().  However, the powerpc vma_mmu_pagesize() is
prevented from calling vma_kernel_pagesize() due to a circular header
dependency that requires vma_mmu_pagesize() to be defined before
including <linux/hugetlb.h>.

Break this circular dependency by defining the default vma_mmu_pagesize()
as a __weak symbol to be overridden by the powerpc version.

Link: http://lkml.kernel.org/r/151996254179.27922.2213728278535578744.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Mario Leinweber 2923117b71 mm/gup.c: fix coding style issues.
- Fixed style error: 8 spaces -> 1 tab.
- Fixed style warning: Corrected misleading indentation.

Link: http://lkml.kernel.org/r/20180302210254.31888-1-marioleinweber@web.de
Signed-off-by: Mario Leinweber <marioleinweber@web.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Aaron Lu 97334162e4 mm/free_pcppages_bulk: prefetch buddy while not holding lock
When a page is freed back to the global pool, its buddy will be checked
to see if it's possible to do a merge.  This requires accessing buddy's
page structure and that access could take a long time if it's cache
cold.

This patch adds a prefetch to the to-be-freed page's buddy outside of
zone->lock in hope of accessing buddy's page structure later under
zone->lock will be faster.  Since we *always* do buddy merging and check
an order-0 page's buddy to try to merge it when it goes into the main
allocator, the cacheline will always come in, i.e.  the prefetched data
will never be unused.

Normally, the number of prefetch will be pcp->batch(default=31 and has
an upper limit of (PAGE_SHIFT * 8)=96 on x86_64) but in the case of
pcp's pages get all drained, it will be pcp->count which has an upper
limit of pcp->high.  pcp->high, although has a default value of 186
(pcp->batch=31 * 6), can be changed by user through
/proc/sys/vm/percpu_pagelist_fraction and there is no software upper
limit so could be large, like several thousand.  For this reason, only
the first pcp->batch number of page's buddy structure is prefetched to
avoid excessive prefetching.

In the meantime, there are two concerns:

 1. the prefetch could potentially evict existing cachelines, especially
    for L1D cache since it is not huge

 2. there is some additional instruction overhead, namely calculating
    buddy pfn twice

For 1, it's hard to say, this microbenchmark though shows good result
but the actual benefit of this patch will be workload/CPU dependant;

For 2, since the calculation is a XOR on two local variables, it's
expected in many cases that cycles spent will be offset by reduced
memory latency later.  This is especially true for NUMA machines where
multiple CPUs are contending on zone->lock and the most time consuming
part under zone->lock is the wait of 'struct page' cacheline of the
to-be-freed pages and their buddies.

Test with will-it-scale/page_fault1 full load:

  kernel      Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
  v4.16-rc2+  9034215        7971818       13667135       15677465
  patch2/3    9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
  this patch 10180856 +6.8%  8506369 +2.3% 14756865 +4.9% 17325324 +3.9%

Note: this patch's performance improvement percent is against patch2/3.

(Changelog stolen from Dave Hansen and Mel Gorman's comments at
http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@intel.com)

[aaron.lu@intel.com: use helper function, avoid disordering pages]
  Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
  Link: http://lkml.kernel.org/r/20180320113146.GB24737@intel.com
[aaron.lu@intel.com: v4]
  Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
  Link: http://lkml.kernel.org/r/20180309082431.GB30868@intel.com
Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Suggested-by: Ying Huang <ying.huang@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Aaron Lu 0a5f4e5b45 mm/free_pcppages_bulk: do not hold lock when picking pages to free
When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, the
zone->lock is held and then pages are chosen from PCP's migratetype
list.  While there is actually no need to do this 'choose part' under
lock since it's PCP pages, the only CPU that can touch them is us and
irq is also disabled.

Moving this part outside could reduce lock held time and improve
performance.  Test with will-it-scale/page_fault1 full load:

  kernel      Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
  v4.16-rc2+  9034215        7971818       13667135       15677465
  this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%

What the test does is: starts $nr_cpu processes and each will repeatedly
do the following for 5 minutes:

 - mmap 128M anonymouse space

 - write access to that space

 - munmap.

The score is the aggregated iteration.

https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

Link: http://lkml.kernel.org/r/20180301062845.26038-3-aaron.lu@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Aaron Lu 77ba9062e4 mm/free_pcppages_bulk: update pcp->count inside
Matthew Wilcox found that all callers of free_pcppages_bulk() currently
update pcp->count immediately after so it's natural to do it inside
free_pcppages_bulk().

No functionality or performance change is expected from this patch.

Link: http://lkml.kernel.org/r/20180301062845.26038-2-aaron.lu@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
David Rientjes bc3106b26c mm, compaction: drain pcps for zone when kcompactd fails
It's possible for free pages to become stranded on per-cpu pagesets
(pcps) that, if drained, could be merged with buddy pages on the zone's
free area to form large order pages, including up to MAX_ORDER.

Consider a verbose example using the tools/vm/page-types tool at the
beginning of a ZONE_NORMAL ('B' indicates a buddy page and 'S' indicates
a slab page).  Pages on pcps do not have any page flags set.

  109954  1       _______S________________________________________________________
  109955  2       __________B_____________________________________________________
  109957  1       ________________________________________________________________
  109958  1       __________B_____________________________________________________
  109959  7       ________________________________________________________________
  109960  1       __________B_____________________________________________________
  109961  9       ________________________________________________________________
  10996a  1       __________B_____________________________________________________
  10996b  3       ________________________________________________________________
  10996e  1       __________B_____________________________________________________
  10996f  1       ________________________________________________________________
  ...
  109f8c  1       __________B_____________________________________________________
  109f8d  2       ________________________________________________________________
  109f8f  2       __________B_____________________________________________________
  109f91  f       ________________________________________________________________
  109fa0  1       __________B_____________________________________________________
  109fa1  7       ________________________________________________________________
  109fa8  1       __________B_____________________________________________________
  109fa9  1       ________________________________________________________________
  109faa  1       __________B_____________________________________________________
  109fab  1       _______S________________________________________________________

The compaction migration scanner is attempting to defragment this memory
since it is at the beginning of the zone.  It has done so quite well,
all movable pages have been migrated.  From pfn [0x109955, 0x109fab),
there are only buddy pages and pages without flags set.

These pages may be stranded on pcps that could otherwise allow this
memory to be coalesced if freed back to the zone free area.  It is
possible that some of these pages may not be on pcps and that something
has called alloc_pages() and used the memory directly, but we rely on
the absence of __GFP_MOVABLE in these cases to allocate from
MIGATE_UNMOVABLE pageblocks to try to keep these MIGRATE_MOVABLE
pageblocks as free as possible.

These buddy and pcp pages, spanning 1,621 pages, could be coalesced and
allow for three transparent hugepages to be dynamically allocated.
Running the numbers for all such spans on the system, it was found that
there were over 400 such spans of only buddy pages and pages without
flags set at the time this /proc/kpageflags sample was collected.
Without this support, there were _no_ order-9 or order-10 pages free.

When kcompactd fails to defragment memory such that a cc.order page can
be allocated, drain all pcps for the zone back to the buddy allocator so
this stranding cannot occur.  Compaction for that order will
subsequently be deferred, which acts as a ratelimit on this drain.

Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803010340100.88270@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Howard McLauchlan 4f6923fbb3 mm: make should_failslab always available for fault injection
should_failslab() is a convenient function to hook into for directed
error injection into kmalloc().  However, it is only available if a
config flag is set.

The following BCC script, for example, fails kmalloc() calls after a
btrfs umount:

    from bcc import BPF

    prog = r"""
    BPF_HASH(flag);

    #include <linux/mm.h>

    int kprobe__btrfs_close_devices(void *ctx) {
            u64 key = 1;
            flag.update(&key, &key);
            return 0;
    }

    int kprobe__should_failslab(struct pt_regs *ctx) {
            u64 key = 1;
            u64 *res;
            res = flag.lookup(&key);
            if (res != 0) {
                bpf_override_return(ctx, -ENOMEM);
            }
            return 0;
    }
    """
    b = BPF(text=prog)

    while 1:
        b.kprobe_poll()

This patch refactors the should_failslab implementation so that the
function is always available for error injection, independent of flags.

This change would be similar in nature to commit f5490d3ec921 ("block:
Add should_fail_bio() for bpf error injection").

Link: http://lkml.kernel.org/r/20180222020320.6944-1-hmclauchlan@fb.com
Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Dou Liyang 14298d3663 mm/page_poison.c: make early_page_poison_param() __init
The early_param() is only called during kernel initialization, So Linux
marks the function of it with __init macro to save memory.

But it forgot to mark the early_page_poison_param().  So, Make it __init
as well.

Link: http://lkml.kernel.org/r/20180117034757.27024-1-douly.fnst@cn.fujitsu.com
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Dou Liyang 1173194e1e mm/page_owner.c: make early_page_owner_param() __init
The early_param() is only called during kernel initialization, So Linux
marks the functions of it with __init macro to save memory.

But it forgot to mark the early_page_owner_param().  So, Make it __init
as well.

Link: http://lkml.kernel.org/r/20180117034736.26963-1-douly.fnst@cn.fujitsu.com
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Dou Liyang 8bd30c1090 mm/kmemleak.c: make kmemleak_boot_config() __init
The early_param() is only called during kernel initialization, So Linux
marks the functions of it with __init macro to save memory.

But it forgot to mark the kmemleak_boot_config().  So, Make it __init as
well.

Link: http://lkml.kernel.org/r/20180117034720.26897-1-douly.fnst@cn.fujitsu.com
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00