glibc/sysdeps
H.J. Lu a057f5f8cd X86-64: Use non-temporal store in memcpy on large data
The large memcpy micro benchmark in glibc shows that there is a
regression with large data on Haswell machine.  non-temporal store in
memcpy on large data can improve performance significantly.  This
patch adds a threshold to use non temporal store which is 6 times of
shared cache size.  When size is above the threshold, non temporal
store will be used, but avoid non-temporal store if there is overlap
between destination and source since destination may be in cache when
source is loaded.

For size below 8 vector register width, we load all data into registers
and store them together.  Only forward and backward loops, which move 4
vector registers at a time, are used to support overlapping addresses.
For forward loop, we load the last 4 vector register width of data and
the first vector register width of data into vector registers before the
loop and store them after the loop.  For backward loop, we load the first
4 vector register width of data and the last vector register width of
data into vector registers before the loop and store them after the loop.

	[BZ #19928]
	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
	New.
	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6
	times of shared cache size.
	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
	(VMOVNT): New.
	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
	(VMOVNT): Likewise.
	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
	(VMOVNT): Likewise.
	(VMOVU): Changed to movups for smaller code sizes.
	(VMOVA): Changed to movaps for smaller code sizes.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
	comments.
	(PREFETCH): New.
	(PREFETCH_SIZE): Likewise.
	(PREFETCHED_LOAD_SIZE): Likewise.
	(PREFETCH_ONE_SET): Likewise.
	Rewrite to use forward and backward loops, which move 4 vector
	registers at a time, to support overlapping addresses and use
	non temporal store if size is above the threshold and there is
	no overlap between destination and source.
2016-04-12 08:10:47 -07:00
..
aarch64 Add _STRING_INLINE_unaligned and string_private.h 2016-02-18 14:55:29 -02:00
alpha Update Alpha libm-test-ulps 2016-01-25 10:43:41 -08:00
arm Fix building glibc master with NDEBUG and --with-cpu. 2016-03-15 23:23:24 -04:00
generic Fix crash on getauxval call without HAVE_AUX_VECTOR 2016-04-10 23:58:43 +02:00
gnu Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
hppa hppa: fix dladdr [BZ #19415] 2016-01-08 02:19:26 -05:00
i386 When disabling SSE, make sure -fpmath is not set to use SSE either 2016-04-09 22:14:24 -04:00
ia64 Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
ieee754 Increase internal precision of ldbl-128ibm decimal printf [BZ #19853] 2016-03-31 12:14:33 -05:00
init_array Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
m68k Add _STRING_INLINE_unaligned and string_private.h 2016-02-18 14:55:29 -02:00
mach hurd: Add c++-types expected result 2016-03-20 22:16:34 +01:00
microblaze Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
mips Fix MIPS64 memcpy regression. 2016-01-28 01:52:05 +00:00
nacl Fix build with HAVE_AUX_VECTOR 2016-04-11 10:27:25 +02:00
nios2 Maintainence patch for nios2: update ULPS file and localplt.data changes. 2016-01-21 22:58:03 -08:00
nptl New pthread_barrier algorithm to fulfill barrier destruction requirements. 2016-01-15 21:20:34 +01:00
posix Fix flag test in waitid compatibility layer 2016-03-13 21:44:09 +01:00
powerpc powerpc: Add optimized P8 strspn 2016-04-07 15:51:28 -05:00
pthread Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
s390 S390: Use ahi instead of aghi in 32bit _dl_runtime_resolve. 2016-04-01 10:42:54 +02:00
sh Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
sparc Add _STRING_INLINE_unaligned and string_private.h 2016-02-18 14:55:29 -02:00
tile Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
unix VDSO support for MIPS 2016-04-12 11:05:13 +01:00
wordsize-32 Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
wordsize-64 Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
x86 Remove Fast_Copy_Backward from Intel Core processors 2016-04-01 15:09:14 -07:00
x86_64 X86-64: Use non-temporal store in memcpy on large data 2016-04-12 08:10:47 -07:00