linux

Go to file

Michal Hocko a49bd4d716 mm, numa: rework do_pages_move Patch series "unclutter thp migration" Motivation: THP migration is hacked into the generic migration with rather surprising semantic. The migration allocation callback is supposed to check whether the THP can be migrated at once and if that is not the case then it allocates a simple page to migrate. unmap_and_move then fixes that up by splitting the THP into small pages while moving the head page to the newly allocated order-0 page. Remaining pages are moved to the LRU list by split_huge_page. The same happens if the THP allocation fails. This is really ugly and error prone [2]. I also believe that split_huge_page to the LRU lists is inherently wrong because all tail pages are not migrated. Some callers will just work around that by retrying (e.g. memory hotplug). There are other pfn walkers which are simply broken though. e.g. madvise_inject_error will migrate head and then advances next pfn by the huge page size. do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind), will simply split the THP before migration if the THP migration is not supported then falls back to single page migration but it doesn't handle tail pages if the THP migration path is not able to allocate a fresh THP so we end up with ENOMEM and fail the whole migration which is a questionable behavior. Page compaction doesn't try to migrate large pages so it should be immune. The first patch reworks do_pages_move which relies on a very ugly calling semantic when the return status is pushed to the migration path via private pointer. It uses pre allocated fixed size batching to achieve that. We simply cannot do the same if a THP is to be split during the migration path which is done in the patch 3. Patch 2 is follow up cleanup which removes the mentioned return status calling convention ugliness. On a side note: There are some semantic issues I have encountered on the way when working on patch 1 but I am not addressing them here. E.g. trying to move THP tail pages will result in either success or EBUSY (the later one more likely once we isolate head from the LRU list). Hugetlb reports EACCESS on tail pages. Some errors are reported via status parameter but migration failures are not even though the original `reason' argument suggests there was an intention to do so. From a quick look into git history this never worked. I have tried to keep the semantic unchanged. Then there is a relatively minor thing that the page isolation might fail because of pages not being on the LRU - e.g. because they are sitting on the per-cpu LRU caches. Easily fixable. This patch (of 3): do_pages_move is supposed to move user defined memory (an array of addresses) to the user defined numa nodes (an array of nodes one for each address). The user provided status array then contains resulting numa node for each address or an error. The semantic of this function is little bit confusing because only some errors are reported back. Notably migrate_pages error is only reported via the return value. This patch doesn't try to address these semantic nuances but rather change the underlying implementation. Currently we are processing user input (which can be really large) in batches which are stored to a temporarily allocated page. Each address is resolved to its struct page and stored to page_to_node structure along with the requested target numa node. The array of these structures is then conveyed down the page migration path via private argument. new_page_node then finds the corresponding structure and allocates the proper target page. What is the problem with the current implementation and why to change it? Apart from being quite ugly it also doesn't cope with unexpected pages showing up on the migration list inside migrate_pages path. That doesn't happen currently but the follow up patch would like to make the thp migration code more clear and that would need to split a THP into the list for some cases. How does the new implementation work? Well, instead of batching into a fixed size array we simply batch all pages that should be migrated to the same node and isolate all of them into a linked list which doesn't require any additional storage. This should work reasonably well because page migration usually migrates larger ranges of memory to a specific node. So the common case should work equally well as the current implementation. Even if somebody constructs an input where the target numa nodes would be interleaved we shouldn't see a large performance impact because page migration alone doesn't really benefit from batching. mmap_sem batching for the lookup is quite questionable and isolate_lru_page which would benefit from batching is not using it even in the current implementation. Link: http://lkml.kernel.org/r/20180103082555.14592-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Reale <ar@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2018-04-11 10:28:32 -07:00
Documentation	Documentation/vm/hmm.txt: typos and syntaxes fixes	2018-04-11 10:28:31 -07:00
LICENSES	LICENSES: Add MPL-1.1 license	2018-01-06 10:59:44 -07:00
arch	c6x changes 4.17	2018-04-10 11:50:14 -07:00
block	for-4.17/block-20180402	2018-04-05 14:27:02 -07:00
certs	certs/blacklist_nohashes.c: fix const confusion in certs blacklist	2018-02-21 15:35:43 -08:00
crypto	MIPS changes for 4.17	2018-04-10 11:39:22 -07:00
drivers	mm: check __highest_present_section_nr directly in memory_dev_init()	2018-04-11 10:28:31 -07:00
firmware	kbuild: remove all dummy assignments to obj-	2017-11-18 11:46:06 +09:00
fs	dcache: account external names as indirectly reclaimable memory	2018-04-11 10:28:29 -07:00
include	mm: memcg: make sure memory.events is uptodate when waking pollers	2018-04-11 10:28:31 -07:00
init	New features:	2018-04-10 11:27:30 -07:00
ipc	Merge branch 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2018-04-03 19:15:32 -07:00
kernel	New features:	2018-04-10 11:27:30 -07:00
lib	New features:	2018-04-10 11:27:30 -07:00
mm	mm, numa: rework do_pages_move	2018-04-11 10:28:32 -07:00
net	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2018-04-09 17:04:10 -07:00
samples	VFIO updates for v4.17-rc1	2018-04-06 19:44:27 -07:00
scripts	Leaking-addresses patches for 4.17-rc1	2018-04-07 11:56:33 -07:00
security	New features:	2018-04-10 11:27:30 -07:00
sound	sound fixes for 4.17-rc1	2018-04-10 10:16:04 -07:00
tools	New features:	2018-04-10 11:27:30 -07:00
usr	kbuild: rename built-in.o to built-in.a	2018-03-26 02:01:19 +09:00
virt	KVM/ARM updates for v4.17	2018-03-28 16:09:09 +02:00
.cocciconfig	…
.get_maintainer.ignore	…
.gitattributes	.gitattributes: set git diff driver for C source code files	2016-10-07 18:46:30 -07:00
.gitignore	kbuild: move include/config/ksym/* to include/ksym/*	2018-03-26 02:01:23 +09:00
.mailmap	Merge candidates for 4.17 merge window	2018-04-06 17:35:43 -07:00
COPYING	COPYING: use the new text with points to the license files	2018-03-23 12:41:45 -06:00
CREDITS	MAINTAINERS/CREDITS: Drop METAG ARCHITECTURE	2018-03-05 16:34:24 +00:00
Kbuild	Kbuild updates for v4.15	2017-11-17 17:45:29 -08:00
Kconfig	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
MAINTAINERS	mm/hmm: documentation editorial update to HMM documentation	2018-04-11 10:28:30 -07:00
Makefile	Kconfig updates for v4.17	2018-04-03 16:28:01 -07:00
README	Docs: Added a pointer to the formatted docs to README	2018-03-21 09:02:53 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.
See Documentation/00-INDEX for a list of what is contained in each file.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.