linux/arch
Matt Fleming df13086490 x86/asm/64: Align start of __clear_user() loop to 16-bytes
commit bb5570ad3b upstream.

x86 CPUs can suffer severe performance drops if a tight loop, such as
the ones in __clear_user(), straddles a 16-byte instruction fetch
window, or worse, a 64-byte cacheline. This issues was discovered in the
SUSE kernel with the following commit,

  1153933703 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")

which increased the code object size from 10 bytes to 15 bytes and
caused the 8-byte copy loop in __clear_user() to be split across a
64-byte cacheline.

Aligning the start of the loop to 16-bytes makes this fit neatly inside
a single instruction fetch window again and restores the performance of
__clear_user() which is used heavily when reading from /dev/zero.

Here are some numbers from running libmicro's read_z* and pread_z*
microbenchmarks which read from /dev/zero:

  Zen 1 (Naples)

  libmicro-file
                                        5.7.0-rc6              5.7.0-rc6              5.7.0-rc6
                                                    revert-1153933703d9+               align16+
  Time mean95-pread_z100k       9.9195 (   0.00%)      5.9856 (  39.66%)      5.9938 (  39.58%)
  Time mean95-pread_z10k        1.1378 (   0.00%)      0.7450 (  34.52%)      0.7467 (  34.38%)
  Time mean95-pread_z1k         0.2623 (   0.00%)      0.2251 (  14.18%)      0.2252 (  14.15%)
  Time mean95-pread_zw100k      9.9974 (   0.00%)      6.0648 (  39.34%)      6.0756 (  39.23%)
  Time mean95-read_z100k        9.8940 (   0.00%)      5.9885 (  39.47%)      5.9994 (  39.36%)
  Time mean95-read_z10k         1.1394 (   0.00%)      0.7483 (  34.33%)      0.7482 (  34.33%)

Note that this doesn't affect Haswell or Broadwell microarchitectures
which seem to avoid the alignment issue by executing the loop straight
out of the Loop Stream Detector (verified using perf events).

Fixes: 1153933703 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # v4.19+
Link: https://lkml.kernel.org/r/20200618102002.30034-1-matt@codeblueprint.co.uk
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-30 15:37:08 -04:00
..
alpha alpha: fix memory barriers so that they conform to the specification 2020-06-22 09:31:21 +02:00
arc ARC: [plat-eznps]: Restrict to CONFIG_ISA_ARCOMPACT 2020-06-07 13:18:50 +02:00
arm ARM: imx5: add missing put_device() call in imx_suspend_alloc_ocram() 2020-06-30 15:37:00 -04:00
arm64 arm64: sve: Fix build failure when ARM64_SVE=y and SYSCTL=n 2020-06-30 15:37:05 -04:00
c6x
csky csky: Fixup abiv2 syscall_trace break a4 & a5 2020-06-17 16:40:21 +02:00
h8300
hexagon hexagon: define ioremap_uc 2020-05-10 10:31:31 +02:00
ia64 mm/memory_hotplug: shrink zones when offlining memory 2020-01-09 10:19:56 +01:00
m68k m68k/PCI: Fix a memory leak in an error handling path 2020-06-24 17:50:16 +02:00
microblaze microblaze: Prevent the overflow of the start 2020-02-24 08:37:02 +01:00
mips MIPS: Fix IRQ tracing when call handle_fpe() and handle_msa_fpe() 2020-06-22 09:31:09 +02:00
nds32 asm-generic/nds32: don't redefine cacheflush primitives 2020-01-17 19:48:43 +01:00
nios2
openrisc openrisc: Fix issue with argument clobbering for clone/fork 2020-06-24 17:50:37 +02:00
parisc parisc: Fix kernel panic in mem_init() 2020-06-03 08:21:28 +02:00
powerpc powerpc: Fix kernel crash in show_instructions() w/DEBUG_VIRTUAL 2020-06-24 17:50:45 +02:00
riscv RISC-V: Don't allow write+exec only page mapping request in mmap 2020-06-30 15:37:06 -04:00
s390 s390/vdso: fix vDSO clock_getres() 2020-06-30 15:37:05 -04:00
sh pinctrl: sh-pfc: sh7269: Fix CAN function GPIOs 2020-02-24 08:36:41 +01:00
sparc fix a braino in "sparc32: fix register window handling in genregs32_[gs]et()" 2020-06-30 15:36:47 -04:00
um um: ensure `make ARCH=um mrproper` removes arch/$(SUBARCH)/include/generated/ 2020-05-02 08:48:53 +02:00
unicore32
x86 x86/asm/64: Align start of __clear_user() loop to 16-bytes 2020-06-30 15:37:08 -04:00
xtensa xtensa: Implement copy_thread_tls 2020-01-14 20:08:35 +01:00
.gitignore
Kconfig asm-generic/tlb: add missing CONFIG symbol 2020-02-24 08:37:02 +01:00