linux/drivers
Michal Hocko f1dd2cd13c mm, memory_hotplug: do not associate hotadded memory to zones until online
The current memory hotplug implementation relies on having all the
struct pages associate with a zone/node during the physical hotplug
phase (arch_add_memory->__add_pages->__add_section->__add_zone).  In the
vast majority of cases this means that they are added to ZONE_NORMAL.
This has been so since 9d99aaa31f ("[PATCH] x86_64: Support memory
hotadd without sparsemem") and it wasn't a big deal back then because
movable onlining didn't exist yet.

Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable
onlining 511c2aba8f ("mm, memory-hotplug: dynamic configure movable
memory and portion memory") and then things got more complicated.
Rather than reconsidering the zone association which was no longer
needed (because the memory hotplug already depended on SPARSEMEM) a
convoluted semantic of zone shifting has been developed.  Only the
currently last memblock or the one adjacent to the zone_movable can be
onlined movable.  This essentially means that the online type changes as
the new memblocks are added.

Let's simulate memory hot online manually
  $ echo 0x100000000 > /sys/devices/system/memory/probe
  $ grep . /sys/devices/system/memory/memory32/valid_zones
  Normal Movable

  $ echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
  $ grep . /sys/devices/system/memory/memory3?/valid_zones
  /sys/devices/system/memory/memory32/valid_zones:Normal
  /sys/devices/system/memory/memory33/valid_zones:Normal Movable

  $ echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
  $ grep . /sys/devices/system/memory/memory3?/valid_zones
  /sys/devices/system/memory/memory32/valid_zones:Normal
  /sys/devices/system/memory/memory33/valid_zones:Normal
  /sys/devices/system/memory/memory34/valid_zones:Normal Movable

  $ echo online_movable > /sys/devices/system/memory/memory34/state
  $ grep . /sys/devices/system/memory/memory3?/valid_zones
  /sys/devices/system/memory/memory32/valid_zones:Normal
  /sys/devices/system/memory/memory33/valid_zones:Normal Movable
  /sys/devices/system/memory/memory34/valid_zones:Movable Normal

This is an awkward semantic because an udev event is sent as soon as the
block is onlined and an udev handler might want to online it based on
some policy (e.g.  association with a node) but it will inherently race
with new blocks showing up.

This patch changes the physical online phase to not associate pages with
any zone at all.  All the pages are just marked reserved and wait for
the onlining phase to be associated with the zone as per the online
request.  There are only two requirements

	- existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap

	- ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses

the latter one is not an inherent requirement and can be changed in the
future.  It preserves the current behavior and made the code slightly
simpler.  This is subject to change in future.

This means that the same physical online steps as above will lead to the
following state: Normal Movable

  /sys/devices/system/memory/memory32/valid_zones:Normal Movable
  /sys/devices/system/memory/memory33/valid_zones:Normal Movable

  /sys/devices/system/memory/memory32/valid_zones:Normal Movable
  /sys/devices/system/memory/memory33/valid_zones:Normal Movable
  /sys/devices/system/memory/memory34/valid_zones:Normal Movable

  /sys/devices/system/memory/memory32/valid_zones:Normal Movable
  /sys/devices/system/memory/memory33/valid_zones:Normal Movable
  /sys/devices/system/memory/memory34/valid_zones:Movable

Implementation:
The current move_pfn_range is reimplemented to check the above
requirements (allow_online_pfn_range) and then updates the respective
zone (move_pfn_range_to_zone), the pgdat and links all the pages in the
pfn range with the zone/node.  __add_pages is updated to not require the
zone and only initializes sections in the range.  This allowed to
simplify the arch_add_memory code (s390 could get rid of quite some of
code).

devm_memremap_pages is the only user of arch_add_memory which relies on
the zone association because it only hooks into the memory hotplug only
half way.  It uses it to associate the new memory with ZONE_DEVICE but
doesn't allow it to be {on,off}lined via sysfs.  This means that this
particular code path has to call move_pfn_range_to_zone explicitly.

The original zone shifting code is kept in place and will be removed in
the follow up patch for an easier review.

Please note that this patch also changes the original behavior when
offlining a memory block adjacent to another zone (Normal vs.  Movable)
used to allow to change its movable type.  This will be handled later.

[richard.weiyang@gmail.com: simplify zone_intersects()]
  Link: http://lkml.kernel.org/r/20170616092335.5177-1-richard.weiyang@gmail.com
[richard.weiyang@gmail.com: remove duplicate call for set_page_links]
  Link: http://lkml.kernel.org/r/20170616092335.5177-2-richard.weiyang@gmail.com
[akpm@linux-foundation.org: remove unused local `i']
Link: http://lkml.kernel.org/r/20170515085827.16474-12-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # For s390 bits
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daniel Kiper <daniel.kiper@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tobias Regnery <tobias.regnery@gmail.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:32 -07:00
..
accessibility
acpi arm64 updates for 4.13: 2017-07-05 17:09:27 -07:00
amba
android
ata Merge branch 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata 2017-07-06 09:41:58 -07:00
atm net: convert sock.sk_wmem_alloc from atomic_t to refcount_t 2017-07-01 07:39:08 -07:00
auxdisplay
base mm, memory_hotplug: do not associate hotadded memory to zones until online 2017-07-06 16:24:32 -07:00
bcma
block zram: count same page write as page_stored 2017-07-06 16:24:31 -07:00
bluetooth Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-07-05 12:31:59 -07:00
bus ARM: SoC driver updates 2017-07-04 14:47:47 -07:00
cdrom block: don't set bounce limit in blk_init_queue 2017-06-27 12:13:45 -06:00
char arm64 updates for 4.13: 2017-07-05 17:09:27 -07:00
clk ARM: Device-tree updates 2017-07-04 14:37:25 -07:00
clocksource ARM: SoC driver updates 2017-07-04 14:47:47 -07:00
connector
cpufreq ARM: SoC driver updates 2017-07-04 14:47:47 -07:00
cpuidle Power management updates for v4.13-rc1 2017-07-04 13:39:41 -07:00
crypto Cavium CNN55XX: fix broken default Kconfig entry 2017-07-05 13:03:05 -07:00
dax
dca
devfreq PM / devfreq: exynos-ppmu: Staticize event list 2017-06-12 10:12:07 +09:00
dio
dma dmaengine: omap-dma: port_window support correction for both direction 2017-06-20 11:45:01 +08:00
dma-buf
edac EDAC, pnd2: Fix Apollo Lake DIMM detection 2017-06-29 10:37:50 +02:00
eisa
extcon extcon: int3496: Switch to devm_acpi_dev_add_driver_gpios() 2017-06-12 10:00:24 +09:00
firewire networking: make skb_push & __skb_push return void pointers 2017-06-16 11:48:40 -04:00
firmware arm64 updates for 4.13: 2017-07-05 17:09:27 -07:00
fmc
fpga
fsi
gpio driver core patches for 4.13-rc1 2017-07-03 20:27:48 -07:00
gpu Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-07-03 13:08:04 -07:00
hid driver core patches for 4.13-rc1 2017-07-03 20:27:48 -07:00
hsi HSI changes for the v4.13 series 2017-07-04 14:28:22 -07:00
hv
hwmon hwmon: (aspeed-pwm-tacho) Poll with short sleeps. 2017-06-24 08:58:06 -07:00
hwspinlock
hwtracing Char/Misc patches for 4.13-rc1 2017-07-03 20:55:59 -07:00
i2c Char/Misc patches for 4.13-rc1 2017-07-03 20:55:59 -07:00
ide block: Change argument type of scsi_req_init() 2017-06-20 19:27:14 -06:00
idle intel_idle: Use more common logging style 2017-06-29 22:58:35 +02:00
iio hwmon updates for v4.13: 2017-07-04 11:48:27 -07:00
infiniband Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-07-05 12:31:59 -07:00
input Input: synaptics-rmi4 - only read the F54 query registers which are used 2017-06-23 00:08:48 -07:00
iommu Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-07-03 16:50:31 -07:00
ipack
irqchip arm64 updates for 4.13: 2017-07-05 17:09:27 -07:00
isdn Merge branch 'work.misc-set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-05 13:13:32 -07:00
leds LED fixes for 4.12-rc6 2017-06-18 08:51:35 +09:00
lguest
lightnvm lightnvm: pblk: set line bitmap check under debug 2017-06-30 11:08:18 -06:00
macintosh Merge branch 'work.misc-set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-05 13:13:32 -07:00
mailbox ACPICA: Add support for new PCCT subtables 2017-06-12 14:58:39 +02:00
mcb
md Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-07-03 13:08:04 -07:00
media networking: make skb_put & friends return void pointers 2017-06-16 11:48:39 -04:00
memory ARM: SoC driver updates 2017-07-04 14:47:47 -07:00
memstick
message
mfd Merge remote-tracking branches 'regulator/topic/settle', 'regulator/topic/tps65910' and 'regulator/topic/tps65917' into regulator-next 2017-07-03 16:52:21 +01:00
misc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-07-05 12:31:59 -07:00
mmc MMC core: 2017-07-04 11:11:56 -07:00
mtd There has been a fair amount of activity in the docs tree this time 2017-07-03 21:13:25 -07:00
mux
net Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-07-05 12:31:59 -07:00
nfc NFC 4.13 pull request 2017-07-01 14:30:39 -07:00
ntb ntb: no sleep in ntb_async_tx_submit 2017-06-19 14:24:41 -04:00
nubus
nvdimm block: don't bother with bounce limits for make_request drivers 2017-06-27 12:13:45 -06:00
nvme Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-07-03 16:50:31 -07:00
nvmem
of Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-06-15 11:59:32 -04:00
oprofile
parisc parisc: ->mapping_error 2017-07-05 21:46:42 +02:00
parport
pci Power management updates for v4.13-rc1 2017-07-04 13:39:41 -07:00
pcmcia pcmcia: ds: convert to use DRIVER_ATTR_RO 2017-06-12 16:14:30 +02:00
perf Merge branch 'aarch64/for-next/ras-apei' into aarch64/for-next/core 2017-06-26 10:54:27 +01:00
phy phy: bcm-ns-usb3: add MDIO driver using proper bus layer 2017-06-16 13:22:26 +05:30
pinctrl Revert "pinctrl: rockchip: avoid hardirq-unsafe functions in irq_chip" 2017-06-29 15:03:24 +02:00
platform Power management updates for v4.13-rc1 2017-07-04 13:39:41 -07:00
pnp ACPI / PM: Consolidate device wakeup settings code 2017-06-28 01:52:32 +02:00
power power supply and reset changes for the v4.13 series 2017-07-04 14:25:14 -07:00
powercap powercap/RAPL: prevent overridding bits outside of the mask 2017-06-28 00:38:34 +02:00
pps
ps3
ptp ptp: Add a ptp clock driver for Broadcom DTE 2017-06-15 12:07:15 -04:00
pwm
rapidio
ras arm64 updates for 4.13: 2017-07-05 17:09:27 -07:00
regulator Merge remote-tracking branches 'regulator/topic/settle', 'regulator/topic/tps65910' and 'regulator/topic/tps65917' into regulator-next 2017-07-03 16:52:21 +01:00
remoteproc
reset ARM: SoC driver updates 2017-07-04 14:47:47 -07:00
rpmsg Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-07-05 12:31:59 -07:00
rtc sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming 2017-06-20 12:19:14 +02:00
s390 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-07-05 12:31:59 -07:00
sbus block: don't set bounce limit in blk_init_queue 2017-06-27 12:13:45 -06:00
scsi Merge branch 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata 2017-07-06 09:41:58 -07:00
sfi
sh drivers/sh/intc/virq.c: delete an error message for a failed memory allocation in add_virq_to_pirq() 2017-07-06 16:24:30 -07:00
sn
soc ARM: SoC driver updates 2017-07-04 14:47:47 -07:00
spi Merge remote-tracking branches 'spi/topic/spidev', 'spi/topic/st-ssc4' and 'spi/topic/stm32' into spi-next 2017-07-03 16:21:12 +01:00
spmi
ssb
staging Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-07-05 12:31:59 -07:00
target Merge branch 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-05 14:35:57 -07:00
tc
tee
thermal USB/PHY patches for 4.13-rc1 2017-07-03 19:30:55 -07:00
thunderbolt
tty Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-07-05 12:31:59 -07:00
uio
usb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-07-05 12:31:59 -07:00
uwb driver core patches for 4.13-rc1 2017-07-03 20:27:48 -07:00
vfio sched/wait: Rename wait_queue_t => wait_queue_entry_t 2017-06-20 12:18:27 +02:00
vhost Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-07-05 12:31:59 -07:00
video video: fbdev: udlfb: drop log level for blanking 2017-06-14 12:40:36 +02:00
virt
virtio virtio_balloon: disable VIOMMU support 2017-06-18 23:13:35 +03:00
vlynq
vme
w1
watchdog
xen Merge branch 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-07-03 18:33:03 -07:00
zorro
Kconfig
Makefile