41 Commits

Author SHA1 Message Date
Andrew Senkevich
72276d6e88 Added memcpy/memmove family optimized with AVX512 for KNL hardware.
Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk,
mempcpy_chk, memmove_chk.
It shows average improvement more than 30% over AVX versions on KNL
hardware (performance results in the thread
<https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>).

    * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files.
    * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
    * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file.
    * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise.
    * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise.
    * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch.
    * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
    * sysdeps/x86_64/multiarch/memmove.c: Likewise.
    * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    * sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
    * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.
2016-01-16 00:49:45 +03:00
Andrew Senkevich
83d776f979 Added memset optimized with AVX512 for KNL hardware.
It shows improvement up to 28% over AVX2 memset (performance results
attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).

    * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
    * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
    * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
    * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
    * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
    * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
    index_Prefer_No_VZEROUPPER): New.
    * sysdeps/x86/cpu-features.c (init_cpu_features): Set the
    Prefer_No_VZEROUPPER for Knights Landing.
2015-12-19 02:47:28 +03:00
Joseph Myers
c871b9b096 Remove -mavx2 configure tests.
There are configure tests for the -mavx2 compiler option.  AVX2
support was added in GCC 4.7, so these tests are now obsolete; this
patch removes them.

Tested for x86_64 and x86 (testsuite, and that installed stripped
shared libraries are unchanged by the patch).

	* sysdeps/i386/configure.ac (libc_cv_cc_avx2): Remove configure
	test.
	* sysdeps/i386/configure: Regenerated.
	* sysdeps/x86_64/configure.ac (libc_cv_cc_avx2): Remove configure
	test.
	* sysdeps/x86_64/configure: Regenerated.
	* config.h.in (HAVE_AVX2_SUPPORT): Remove #undef.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memset-avx2 unconditionally instead of conditionally on
	[$(config-cflags-avx2) = yes].
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list) [HAVE_AVX2_SUPPORT]: Make code
	unconditional.
	* sysdeps/x86_64/multiarch/memset.S [HAVE_AVX2_SUPPORT]: Likewise.
	* sysdeps/x86_64/multiarch/memset_chk.S
	[IS_IN (libc) && SHARED && HAVE_AVX2_SUPPORT]: Change conditional
	to [IS_IN (libc) && SHARED].
2015-10-28 13:29:03 +00:00
Joseph Myers
3b7aa5bf59 Remove configure tests for SSE4 support.
GCC added support for -msse4 in version 4.3.  Thus the configure tests
for it are obsolete, and this patch removes them.

Tested for x86_64 and x86 (testsuite, and that installed stripped
shared libraries are unchanged by this patch).

	* sysdeps/i386/configure.ac (libc_cv_cc_sse4): Remove configure
	test.
	* sysdeps/i386/configure: Regenerated.
	* sysdeps/i386/i686/multiarch/Makefile
	[$(config-cflags-sse4) = yes]: Make code unconditional.
	* sysdeps/i386/i686/multiarch/strcspn.S [HAVE_SSE4_SUPPORT]:
	Likewise.
	* sysdeps/i386/i686/multiarch/strspn.S [HAVE_SSE4_SUPPORT]:
	Likewise.
	* sysdeps/x86_64/configure.ac (libc_cv_cc_sse4): Remove configure
	test.
	* sysdeps/x86_64/configure: Regenerated.
	* sysdeps/x86_64/multiarch/Makefile [$(config-cflags-sse4) = yes]:
	Make code unconditional.
	* sysdeps/x86_64/multiarch/strcspn.S [HAVE_SSE4_SUPPORT]:
	Likewise.
	* sysdeps/x86_64/multiarch/strspn.S [HAVE_SSE4_SUPPORT]: Likewise.
	* config.h.in (HAVE_SSE4_SUPPORT): Remove #undef.
2015-10-06 20:47:40 +00:00
H.J. Lu
e2e4f56056 Add _dl_x86_cpu_features to rtld_global
This patch adds _dl_x86_cpu_features to rtld_global in x86 ld.so
and initializes it early before __libc_start_main is called so that
cpu_features is always available when it is used and we can avoid
calling __init_cpu_features in IFUNC selectors.

	* sysdeps/i386/dl-machine.h: Include <cpu-features.c>.
	(dl_platform_init): Call init_cpu_features.
	* sysdeps/i386/dl-procinfo.c (_dl_x86_cpu_features): New.
	* sysdeps/i386/i686/cacheinfo.c
	(DISABLE_PREFERRED_MEMORY_INSTRUCTION): Removed.
	* sysdeps/i386/i686/multiarch/Makefile (aux): Remove init-arch.
	* sysdeps/i386/i686/multiarch/Versions: Removed.
	* sysdeps/i386/i686/multiarch/ifunc-defines.sym (KIND_OFFSET):
	Removed.
	* sysdeps/i386/ldsodefs.h: Include <cpu-features.h>.
	* sysdeps/unix/sysv/linux/x86/Makefile
	(libpthread-sysdep_routines): Remove init-arch.
	* sysdeps/unix/sysv/linux/x86_64/dl-procinfo.c: Include
	<sysdeps/x86_64/dl-procinfo.c> instead of
	sysdeps/generic/dl-procinfo.c>.
	* sysdeps/x86/Makefile [$(subdir) == csu] (gen-as-const-headers):
	Add cpu-features-offsets.sym and rtld-global-offsets.sym.
	[$(subdir) == elf] (sysdep-dl-routines): Add dl-get-cpu-features.
	[$(subdir) == elf] (tests): Add tst-get-cpu-features.
	[$(subdir) == elf] (tests-static): Add
	tst-get-cpu-features-static.
	* sysdeps/x86/Versions: New file.
	* sysdeps/x86/cpu-features-offsets.sym: Likewise.
	* sysdeps/x86/cpu-features.c: Likewise.
	* sysdeps/x86/cpu-features.h: Likewise.
	* sysdeps/x86/dl-get-cpu-features.c: Likewise.
	* sysdeps/x86/libc-start.c: Likewise.
	* sysdeps/x86/rtld-global-offsets.sym: Likewise.
	* sysdeps/x86/tst-get-cpu-features-static.c: Likewise.
	* sysdeps/x86/tst-get-cpu-features.c: Likewise.
	* sysdeps/x86_64/dl-procinfo.c: Likewise.
	* sysdeps/x86_64/cacheinfo.c (__cpuid_count): Removed.
	Assume USE_MULTIARCH is defined and don't check it.
	(is_intel): Replace __cpu_features with GLRO(dl_x86_cpu_features).
	(is_amd): Likewise.
	(max_cpuid): Likewise.
	(intel_check_word): Likewise.
	(__cache_sysconf): Don't call __init_cpu_features.
	(__x86_preferred_memory_instruction): Removed.
	(init_cacheinfo): Don't call __init_cpu_features. Replace
	__cpu_features with GLRO(dl_x86_cpu_features).
	* sysdeps/x86_64/dl-machine.h: <cpu-features.c>.
	(dl_platform_init): Call init_cpu_features.
	* sysdeps/x86_64/ldsodefs.h: Include <cpu-features.h>.
	* sysdeps/x86_64/multiarch/Makefile (aux): Remove init-arch.
	* sysdeps/x86_64/multiarch/Versions: Removed.
	* sysdeps/x86_64/multiarch/cacheinfo.c: Likewise.
	* sysdeps/x86_64/multiarch/init-arch.c: Likewise.
	* sysdeps/x86_64/multiarch/ifunc-defines.sym (KIND_OFFSET):
	Removed.
	* sysdeps/x86_64/multiarch/init-arch.h: Rewrite.
2015-08-13 03:41:22 -07:00
Ling Ma
05f3633da4 Improve 64bit memcpy performance for Haswell CPU with AVX instruction
In this patch we take advantage of HSW memory bandwidth, manage to
reduce miss branch prediction by avoiding using branch instructions and
force destination to be aligned with avx instruction.

The CPU2006 403.gcc benchmark indicates this patch improves performance
from 2% to 10%.
2014-07-30 08:02:35 -07:00
H.J. Lu
f2fef657d8 Enable AVX2 optimized memset only if -mavx2 works
* config.h.in (HAVE_AVX2_SUPPORT): New #undef.
	* sysdeps/i386/configure.ac: Set HAVE_AVX2_SUPPORT and
	config-cflags-avx2.
	* sysdeps/x86_64/configure.ac: Likewise.
	* sysdeps/i386/configure: Regenerated.
	* sysdeps/x86_64/configure: Likewise.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memset-avx2 only if config-cflags-avx2 is yes.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list):
	Tests for memset_chk and memset only if HAVE_AVX2_SUPPORT is
	defined.
	* sysdeps/x86_64/multiarch/memset.S: Define multiple versions
	only if HAVE_AVX2_SUPPORT is defined.
	* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
2014-07-14 07:58:27 -07:00
Ling Ma
5c74e47cd6 Add x86_64 memset optimized for AVX2
In this patch we take advantage of HSW memory bandwidth, manage to
reduce miss branch prediction by avoiding using branch instructions and
force destination to be aligned with avx & avx2 instruction.

The CPU2006 403.gcc benchmark indicates this patch improves performance
from 26% to 59%.

	* sysdeps/x86_64/multiarch/Makefile: Add memset-avx2.
	* sysdeps/x86_64/multiarch/memset-avx2.S: New file.
	* sysdeps/x86_64/multiarch/memset.S: Likewise.
	* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/rtld-memset.S: Likewise.
2014-06-19 15:14:08 -07:00
Ondřej Bílka
584b18eb4d Add strstr with unaligned loads. Fixes bug 12100.
A sse42 version of strstr used pcmpistr instruction which is quite
ineffective. A faster way is look for pairs of characters which is uses
sse2, is faster than pcmpistr and for real strings a pairs we look for
are relatively rare.

For linear time complexity we use buy or rent technique which switches
to two-way algorithm when superlinear behaviour is detected.
2013-12-14 20:08:13 +01:00
Ondřej Bílka
dc1a95c730 Faster strrchr. 2013-09-26 19:23:01 +02:00
Ondřej Bílka
8f02859f17 Add unaligned strcmp. 2013-09-03 16:27:10 +02:00
Ondrej Bilka
2d48b41c8f Faster memcpy on x64.
We add new memcpy version that uses unaligned loads which are fast
on modern processors. This allows second improvement which is avoiding
computed jump which is relatively expensive operation.

Tests available here:
http://kam.mff.cuni.cz/~ondra/memcpy_profile_result27_04_13.tar.bz2
2013-05-20 08:24:41 +02:00
Ondrej Bilka
37bb363f03 Faster strlen on x64. 2013-03-18 07:39:12 +01:00
Ondrej Bilka
80f844c9d8 Remove Prefer_SSE_for_memop on x64 2013-03-11 15:39:08 +01:00
Ondrej Bilka
87bd9bc4bd Revert " * sysdeps/x86_64/strlen.S: Replace with new SSE2 based implementation"
This reverts commit b79188d71716b6286866e06add976fe84100595e.
2013-03-06 22:27:18 +01:00
Ondrej Bilka
b79188d717 * sysdeps/x86_64/strlen.S: Replace with new SSE2 based implementation
which is faster on all x86_64 architectures.
	Tested on AMD, Intel Nehalem, SNB, IVB.
2013-03-06 21:54:01 +01:00
Carlos O'Donell
1a0994f535 BZ#14059: Fix AVX and FMA4 detection.
Fix AVX and FMA4 detection by following the guidelines
set out by Intel and AMD for detecting these features.
2012-05-17 06:59:28 -07:00
Ulrich Drepper
1d3e4b618a Optimized wcschr and wcscpy for x86-64 and x86-32 2011-12-17 14:39:23 -05:00
Liubov Dmitrieva
ce7dd29f28 Optimized strnlen and wcscmp for x86-64 2011-10-23 14:56:04 -04:00
Liubov Dmitrieva
be13f7bff6 Optimized memcmp and wmemcmp for x86-64 and x86-32 2011-10-15 11:10:08 -04:00
Liubov Dmitrieva
a5f524e479 Add Atom-optimized strchr and strrchr for x86-64 2011-09-05 21:34:03 -04:00
Liubov Dmitrieva
99710781cc Improve 64 bit strcat functions with SSE2/SSSE3 2011-07-19 17:11:54 -04:00
H.J. Lu
8912479f9e Improved st{r,p}{,n}cpy for SSE2 and SSSE3 on x86-64 2011-06-24 15:14:22 -04:00
H.J. Lu
ff02d5280b Use IFUNC on x86-64 memset 2010-11-08 03:41:34 -05:00
H.J. Lu
623aac7f84 Unroll x86-64 strlen 2010-08-26 22:09:34 -07:00
Roland McGrath
8b2b771538 Clean up warnings in new x86_64/multiarch code. 2010-08-25 12:13:08 -07:00
Richard Henderson
73f27d5e72 Clean up SSE variable shifts 2010-08-24 11:35:01 -07:00
Ulrich Drepper
e9f82e0d1d Add optimized strncasecmp versions for x86-64. 2010-08-14 22:04:01 -07:00
Ulrich Drepper
73507d3ae0 Add support for SSSE3 and SSE4.2 versions of strcasecmp on x86-64. 2010-07-31 21:41:09 -07:00
Ulrich Drepper
cc9f2e47a0 Speed up SSE4.2 strcasestr by avoiding indirect function call. 2010-07-16 15:37:38 -07:00
H.J. Lu
6fb8cbcb58 Improve 64bit memcpy/memmove for Atom, Core 2 and Core i7
This patch includes optimized 64bit memcpy/memmove for Atom, Core 2 and
Core i7.  It improves memcpy by up to 3X on Atom, up to 4X on Core 2 and
up to 1X on Core i7.  It also improves memmove by up to 3X on Atom, up to
4X on Core 2 and up to 2X on Core i7.
2010-06-30 08:26:11 -07:00
H.J. Lu
404a6e3201 x86-64 SSE4 optimized memcmp
This is 64bit SSE4 optimized memcmp. It improves memcmp by upto 3X
on Intel Core i7.
2010-04-14 00:12:53 -07:00
H.J. Lu
001659f4d5 Implement SSE4.2 optimized strchr and strrchr. 2009-10-22 22:47:12 -07:00
Ulrich Drepper
0fda545d5f Add SSSE3-optimized implementation of str{,n}cmp for x86-64. 2009-08-07 22:51:02 -07:00
H.J. Lu
7956a3d27c Add SSE2 support to str{,n}cmp for x86-64. 2009-07-26 13:32:28 -07:00
H.J. Lu
2b7a8664fa SSE4.2 strstr/strcasestr for x86-64.
This patch implements SSE4.2 strstr/strcasestr, using Knuth-Morris-Pratt
string searching algorithm.
2009-07-20 21:06:50 -07:00
H.J. Lu
06e51c8f3d Add SSE4.2 support for strcspn, strpbrk, and strspn on x86-64. 2009-07-03 02:48:56 -07:00
H.J. Lu
ab6a873fe0 SSSE3 strcpy/stpcpy for x86-64
This patch adds SSSE3 strcpy/stpcpy. I got up to 4X speed up on Core 2
and Core i7.  I disabled it on Atom since SSSE3 version is slower for
shorter (<64byte) data.
2009-07-02 03:39:03 -07:00
H.J. Lu
772f4e6a1b Add SSE4.2 support for strcmp and strncmp on x86-64. 2009-06-22 20:38:41 -07:00
Ulrich Drepper
3ab2d57a4d Optimize x86-64 strlen for SSE4.2.
The SSE4.2 implementation is used in the DSO only.  The patch also adds
some infrastructure to be used in similar code later one.
2009-06-05 11:32:00 -07:00
Ulrich Drepper
425ce2edb9 * config.h.in (USE_MULTIARCH): Define.
* configure.in: Handle --enable-multi-arch.
	* elf/dl-runtime.c (_dl_fixup): Handle STT_GNU_IFUNC.
	(_dl_fixup_profile): Likewise.
	* elf/do-lookup.c (dl_lookup_x): Likewise.
	* sysdeps/x86_64/dl-machine.h: Handle STT_GNU_IFUNC.
	* elf/elf.h (STT_GNU_IFUNC): Define.
	* include/libc-symbols.h (libc_ifunc): Define.
	* sysdeps/x86_64/cacheinfo.c: If USE_MULTIARCH is defined, use the
	framework in init-arch.h to get CPUID values.
	* sysdeps/x86_64/multiarch/Makefile: New file.
	* sysdeps/x86_64/multiarch/init-arch.c: New file.
	* sysdeps/x86_64/multiarch/init-arch.h: New file.
	* sysdeps/x86_64/multiarch/sched_cpucount.c: New file.

	* config.make.in (experimental-malloc): Define.
	* configure.in: Handle --enable-experimental-malloc.
	* malloc/Makefile: Handle experimental-malloc flag.
	* malloc/malloc.c: Implement PER_THREAD and ATOMIC_FASTBINS features.
	* malloc/arena.c: Likewise.
	* malloc/hooks.c: Likewise.
	* malloc/malloc.h: Define M_ARENA_TEST and M_ARENA_MAX.
2009-03-13 23:53:18 +00:00