Commit Graph

404 Commits

Author SHA1 Message Date
Tamar Christina 024edf0895 AArch64: Fix left fold sum reduction RTL patterns [PR104049]
As the discussion in the PR pointed out the RTL we have for the REDUC_PLUS
patterns are wrong.  The UNSPECs are modelled as returning a vector and then
in an expand pattern we emit a vec_select of the 0th element to get the scalar.

This is incorrect as the instruction itself already only returns a single scalar
and by declaring it returns a vector it allows combine to push in a subreg into
the pattern, which causes reload to make duplicate moves.

This patch corrects this by removing the weird indirection and making the RTL
pattern model the correct semantics of the instruction immediately.

gcc/ChangeLog:

	PR target/104049
	* config/aarch64/aarch64-simd.md
	(aarch64_reduc_plus_internal<mode>): Fix RTL and rename to...
	(reduc_plus_scal_<mode>): ... This.
	(reduc_plus_scal_v4sf): Moved.
	(aarch64_reduc_plus_internalv2si): Fix RTL and rename to...
	(reduc_plus_scal_v2si): ... This.

gcc/testsuite/ChangeLog:

	PR target/104049
	* gcc.target/aarch64/vadd_reduc-1.c: New test.
	* gcc.target/aarch64/vadd_reduc-2.c: New test.
2022-04-07 08:27:53 +01:00
Richard Sandiford 83d7e720cd aarch64: Extend vec_concat patterns to 8-byte vectors
This patch extends the previous support for 16-byte vec_concat
so that it supports pairs of 4-byte elements.  This too isn't
strictly a regression fix, since the 8-byte forms weren't affected
by the same problems as the 16-byte forms, but it leaves things in
a more consistent state.

gcc/
	* config/aarch64/iterators.md (VDCSIF): New mode iterator.
	(VDBL): Handle SF.
	(single_wx, single_type, single_dtype, dblq): New mode attributes.
	* config/aarch64/aarch64-simd.md (load_pair_lanes<mode>): Extend
	from VDC to VDCSIF.
	(store_pair_lanes<mode>): Likewise.
	(*aarch64_combine_internal<mode>): Likewise.
	(*aarch64_combine_internal_be<mode>): Likewise.
	(*aarch64_combinez<mode>): Likewise.
	(*aarch64_combinez_be<mode>): Likewise.
	* config/aarch64/aarch64.cc (aarch64_classify_address): Handle
	8-byte modes for ADDR_QUERY_LDP_STP_N.
	(aarch64_print_operand): Likewise for %y.

gcc/testsuite/
	* gcc.target/aarch64/vec-init-13.c: New test.
	* gcc.target/aarch64/vec-init-14.c: Likewise.
	* gcc.target/aarch64/vec-init-15.c: Likewise.
	* gcc.target/aarch64/vec-init-16.c: Likewise.
	* gcc.target/aarch64/vec-init-17.c: Likewise.
2022-02-09 16:57:06 +00:00
Richard Sandiford bce43c0493 aarch64: Remove move_lo/hi_quad expanders
This patch is the second of two to remove the old
move_lo/hi_quad expanders and move_hi_quad insns.

gcc/
	* config/aarch64/aarch64-simd.md (@aarch64_split_simd_mov<mode>):
	Use aarch64_combine instead of move_lo/hi_quad.  Tabify.
	(move_lo_quad_<mode>, aarch64_simd_move_hi_quad_<mode>): Delete.
	(aarch64_simd_move_hi_quad_be_<mode>, move_hi_quad_<mode>): Delete.
	(vec_pack_trunc_<mode>): Take general_operand elements and use
	aarch64_combine rather than move_lo/hi_quad to combine them.
	(vec_pack_trunc_df): Likewise.
2022-02-09 16:57:06 +00:00
Richard Sandiford 4057266ce5 aarch64: Add a general vec_concat expander
After previous patches, we have a (mostly new) group of vec_concat
patterns as well as vestiges of the old move_lo/hi_quad patterns.
(A previous patch removed the move_lo_quad insns, but we still
have the move_hi_quad insns and both sets of expanders.)

This patch is the first of two to remove the old move_lo/hi_quad
stuff.  It isn't technically a regression fix, but it seemed
better to make the changes now rather than leave things in
a half-finished and inconsistent state.

This patch defines an aarch64_vec_concat expander that coerces the
element operands into a valid form, including the ones added by the
previous patch.  This in turn lets us get rid of one move_lo/hi_quad
pair.

As a side-effect, it also means that vcombines of 2 vectors make
better use of the available forms, like vec_inits of 2 scalars
already do.

gcc/
	* config/aarch64/aarch64-protos.h (aarch64_split_simd_combine):
	Delete.
	* config/aarch64/aarch64-simd.md (@aarch64_combinez<mode>): Rename
	to...
	(*aarch64_combinez<mode>): ...this.
	(@aarch64_combinez_be<mode>): Rename to...
	(*aarch64_combinez_be<mode>): ...this.
	(@aarch64_vec_concat<mode>): New expander.
	(aarch64_combine<mode>): Use it.
	(@aarch64_simd_combine<mode>): Delete.
	* config/aarch64/aarch64.cc (aarch64_split_simd_combine): Delete.
	(aarch64_expand_vector_init): Use aarch64_vec_concat.

gcc/testsuite/
	* gcc.target/aarch64/vec-init-12.c: New test.
2022-02-09 16:57:05 +00:00
Richard Sandiford 85ac2fe44f aarch64: Add more vec_combine patterns
vec_combine is really one instruction on aarch64, provided that
the lowpart element is in the same register as the destination
vector.  This patch adds patterns for that.

The patch fixes a regression from GCC 8.  Before the patch:

int64x2_t s64q_1(int64_t a0, int64_t a1) {
  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
    return (int64x2_t) { a1, a0 };
  else
    return (int64x2_t) { a0, a1 };
}

generated:

        fmov    d0, x0
        ins     v0.d[1], x1
        ins     v0.d[1], x1
        ret

whereas GCC 8 generated the more respectable:

        dup     v0.2d, x0
        ins     v0.d[1], x1
        ret

gcc/
	* config/aarch64/predicates.md (aarch64_reg_or_mem_pair_operand):
	New predicate.
	* config/aarch64/aarch64-simd.md (*aarch64_combine_internal<mode>)
	(*aarch64_combine_internal_be<mode>): New patterns.

gcc/testsuite/
	* gcc.target/aarch64/vec-init-9.c: New test.
	* gcc.target/aarch64/vec-init-10.c: Likewise.
	* gcc.target/aarch64/vec-init-11.c: Likewise.
2022-02-09 16:57:05 +00:00
Richard Sandiford aeef5c57f1 aarch64: Remove redundant vec_concat patterns
move_lo_quad_internal_<mode> and move_lo_quad_internal_be_<mode>
partially duplicate the later aarch64_combinez{,_be}<mode> patterns.
The duplication itself is a regression.

The only substantive differences between the two are:

* combinez uses vector MOV (ORR) instead of element MOV (DUP).
  The former seems more likely to be handled via renaming.

* combinez disparages the GPR->FPR alternative whereas move_lo_quad
  gave it equal cost.  The new test gives a token example of when
  the combinez behaviour helps.

gcc/
	* config/aarch64/aarch64-simd.md (move_lo_quad_internal_<mode>)
	(move_lo_quad_internal_be_<mode>): Delete.
	(move_lo_quad_<mode>): Use aarch64_combine<Vhalf> instead of the above.

gcc/testsuite/
	* gcc.target/aarch64/vec-init-8.c: New test.
2022-02-09 16:57:04 +00:00
Richard Sandiford 958448a944 aarch64: Generalise adjacency check for load_pair_lanes
This patch generalises the load_pair_lanes<mode> guard so that
it uses aarch64_check_consecutive_mems to check for consecutive
mems.  It also allows the pattern to be used for STRICT_ALIGNMENT
targets if the alignment is high enough.

The main aim is to avoid an inline test, for the sake of a later patch
that needs to repeat it.  Reusing aarch64_check_consecutive_mems seemed
simpler than writing an entirely new function.

gcc/
	* config/aarch64/aarch64-protos.h (aarch64_mergeable_load_pair_p):
	Declare.
	* config/aarch64/aarch64-simd.md (load_pair_lanes<mode>): Use
	aarch64_mergeable_load_pair_p instead of inline check.
	* config/aarch64/aarch64.cc (aarch64_expand_vector_init): Likewise.
	(aarch64_check_consecutive_mems): Allow the reversed parameter
	to be null.
	(aarch64_mergeable_load_pair_p): New function.
2022-02-09 16:57:03 +00:00
Richard Sandiford fabc5d9bce aarch64: Generalise vec_set predicate
The aarch64_simd_vec_set<mode> define_insn takes memory operands,
so this patch makes the vec_set<mode> optab expander do the same.

gcc/
	* config/aarch64/aarch64-simd.md (vec_set<mode>): Allow the
	element to be an aarch64_simd_nonimmediate_operand.
2022-02-09 16:57:02 +00:00
Richard Sandiford c48a6819d1 aarch64: Tighten general_operand predicates
This patch fixes some case in which *general_operand was used over
*nonimmediate_operand by patterns that don't accept immediates.
This avoids some complication with later patches.

gcc/
	* config/aarch64/aarch64-simd.md (aarch64_simd_vec_set<mode>): Use
	aarch64_simd_nonimmediate_operand instead of
	aarch64_simd_general_operand.
	(@aarch64_combinez<mode>): Use nonimmediate_operand instead of
	general_operand.
	(@aarch64_combinez_be<mode>): Likewise.
2022-02-09 16:57:02 +00:00
Richard Sandiford 7e4f89a23e aarch64: Add missing movmisalign patterns
The Advanced SIMD movmisalign patterns didn't handle 16-bit
FP modes, which meant that the vector loop for:

  void
  test (_Float16 *data)
  {
    _Pragma ("omp simd")
    for (int i = 0; i < 8; ++i)
      data[i] = 1.0;
  }

would be versioned for alignment.

This was causing some new failures in aarch64/sve/single_5.c:

FAIL: gcc.target/aarch64/sve/single_5.c scan-assembler-not \\tb
FAIL: gcc.target/aarch64/sve/single_5.c scan-assembler-not \\tcmp
FAIL: gcc.target/aarch64/sve/single_5.c scan-assembler-times \\tstr\\tq[0-9]+, 10

but I didn't look into what changed from earlier releases.
Adding the missing modes removes some existing xfails.

gcc/
	* config/aarch64/aarch64-simd.md (movmisalign<mode>): Extend from
	VALL to VALL_F16.

gcc/testsuite/
	* gcc.target/aarch64/sve/single_5.c: Remove some XFAILs.
2022-02-03 10:44:00 +00:00
Richard Sandiford 6a77052660 aarch64: Remove VALL_F16MOV iterator
The VALL_F16MOV iterator now has the same modes as VALL_F16,
in the same order.  This patch removes the former in favour
of the latter.

This doesn't fix a bug as such, but it's ultra-safe (no change in
object code) and it saves a follow-up patch from having to make
a false choice between the iterators.

gcc/
	* config/aarch64/iterators.md (VALL_F16MOV): Delete.
	* config/aarch64/aarch64-simd.md (mov<mode>): Use VALL_F16 instead
	of VALL_F16MOV.
2022-02-03 10:44:00 +00:00
Tamar Christina ab95fe61fe AArch64: use canonical ordering for complex mul, fma and fms
After the first patch in the series this updates the optabs to expect the
canonical sequence.

gcc/ChangeLog:

	PR tree-optimization/102819
	PR tree-optimization/103169
	* config/aarch64/aarch64-simd.md (cml<fcmac1><conj_op><mode>4): Use
	canonical order.
	* config/aarch64/aarch64-sve.md (cml<fcmac1><conj_op><mode>4): Likewise.
2022-02-02 10:51:38 +00:00
Jakub Jelinek 7adcbafe45 Update copyright years. 2022-01-03 10:42:10 +01:00
Przemyslaw Wirkus 0a68862e78 aarch64: fix: ls64 tests fail on aarch64_be [PR103729]
This patch is sorting issue with LS64 intrinsics tests failing with
AArch64_be targets.

gcc/ChangeLog:

	PR target/103729
	* config/aarch64/aarch64-simd.md (aarch64_movv8di): Allow big endian
	targets to move V8DI.
2021-12-16 10:50:29 +00:00
Przemyslaw Wirkus fdcddba8f2 aarch64: Add LS64 extension and intrinsics
This patch is adding support for LS64 (Armv8.7-A Load/Store 64 Byte extension)
which is part of Armv8.7-A architecture. Changes include missing plumbing for
TARGET_LS64, LS64 data structure and intrinsics defined in ACLE. Machine
description of intrinsics is using new V8DI mode added in a separate patch.
__ARM_FEATURE_LS64 is defined if the Armv8.7-A LS64 instructions for atomic
64-byte access to device memory are supported.

New compiler internal type is added wrapping ACLE struct data512_t:

typedef struct {
  uint64_t val[8];
} __arm_data512_t;

gcc/ChangeLog:

	* config/aarch64/aarch64-builtins.c (enum aarch64_builtins):
	Define AARCH64_LS64_BUILTIN_LD64B, AARCH64_LS64_BUILTIN_ST64B,
	AARCH64_LS64_BUILTIN_ST64BV, AARCH64_LS64_BUILTIN_ST64BV0.
	(aarch64_init_ls64_builtin_decl): Helper function.
	(aarch64_init_ls64_builtins): Helper function.
	(aarch64_init_ls64_builtins_types): Helper function.
	(aarch64_general_init_builtins): Init LS64 intrisics for
	TARGET_LS64.
	(aarch64_expand_builtin_ls64): LS64 intrinsics expander.
	(aarch64_general_expand_builtin): Handle aarch64_expand_builtin_ls64.
	(ls64_builtins_data): New helper struct.
	(v8di_UP): New define.
	* config/aarch64/aarch64-c.c (aarch64_update_cpp_builtins): Define
	__ARM_FEATURE_LS64.
	* config/aarch64/aarch64.c (aarch64_classify_address): Enforce the
	V8DI range (7-bit signed scaled) for both ends of the range.
	* config/aarch64/aarch64-simd.md (movv8di): New pattern.
	(aarch64_movv8di): New pattern.
	* config/aarch64/aarch64.h (AARCH64_ISA_LS64): New define.
	(TARGET_LS64): New define.
	* config/aarch64/aarch64.md: Add UNSPEC_LD64B, UNSPEC_ST64B,
	UNSPEC_ST64BV and UNSPEC_ST64BV0.
	(ld64b): New define_insn.
	(st64b): New define_insn.
	(st64bv): New define_insn.
	(st64bv0): New define_insn.
	* config/aarch64/arm_acle.h (data512_t): New type derived from
	__arm_data512_t.
	(__arm_data512_t): New internal type.
	(__arm_ld64b): New intrinsic.
	(__arm_st64b): New intrinsic.
	(__arm_st64bv): New intrinsic.
	(__arm_st64bv0): New intrinsic.
	* config/arm/types.md: Add new type ls64.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/acle/ls64_asm.c: New test.
	* gcc.target/aarch64/acle/ls64_ld64b.c: New test.
	* gcc.target/aarch64/acle/ls64_ld64b-2.c: New test.
	* gcc.target/aarch64/acle/ls64_ld64b-3.c: New test.
	* gcc.target/aarch64/acle/ls64_st64b.c: New test.
	* gcc.target/aarch64/acle/ls64_ld_st_o0.c: New test.
	* gcc.target/aarch64/acle/ls64_st64b-2.c: New test.
	* gcc.target/aarch64/acle/ls64_st64bv.c: New test.
	* gcc.target/aarch64/acle/ls64_st64bv-2.c: New test.
	* gcc.target/aarch64/acle/ls64_st64bv-3.c: New test.
	* gcc.target/aarch64/acle/ls64_st64bv0.c: New test.
	* gcc.target/aarch64/acle/ls64_st64bv0-2.c: New test.
	* gcc.target/aarch64/acle/ls64_st64bv0-3.c: New test.
	* gcc.target/aarch64/pragma_cpp_predefs_2.c: Add checks
	for __ARM_FEATURE_LS64.
2021-12-14 14:52:27 +00:00
Tamar Christina 9b8830b6f3 AArch64: Optimize right shift rounding narrowing
This optimizes right shift rounding narrow instructions to
rounding add narrow high where one vector is 0 when the shift amount is half
that of the original input type.

i.e.

uint32x4_t foo (uint64x2_t a, uint64x2_t b)
{
  return vrshrn_high_n_u64 (vrshrn_n_u64 (a, 32), b, 32);
}

now generates:

foo:
        movi    v3.4s, 0
        raddhn  v0.2s, v2.2d, v3.2d
        raddhn2 v0.4s, v2.2d, v3.2d

instead of:

foo:
        rshrn   v0.2s, v0.2d, 32
        rshrn2  v0.4s, v1.2d, 32
        ret

On Arm cores this is an improvement in both latency and throughput.
Because a vector zero is needed I created a new method
aarch64_gen_shareable_zero that creates zeros using V4SI and then takes a subreg
of the zero to the desired type.  This allows CSE to share all the zero
constants.

gcc/ChangeLog:

	* config/aarch64/aarch64-protos.h (aarch64_gen_shareable_zero): New.
	* config/aarch64/aarch64-simd.md (aarch64_rshrn<mode>,
	aarch64_rshrn2<mode>): Generate rounding half-ing add when appropriate.
	* config/aarch64/aarch64.c (aarch64_gen_shareable_zero): New.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/advsimd-intrinsics/shrn-1.c: New test.
	* gcc.target/aarch64/advsimd-intrinsics/shrn-2.c: New test.
	* gcc.target/aarch64/advsimd-intrinsics/shrn-3.c: New test.
	* gcc.target/aarch64/advsimd-intrinsics/shrn-4.c: New test.
2021-12-02 14:39:43 +00:00
Richard Sandiford e32b9eb32d vect: Add support for fmax and fmin reductions
This patch adds support for reductions involving calls to fmax*()
and fmin*(), without the -ffast-math flags that allow them to be
converted to MAX_EXPR and MIN_EXPR.

gcc/
	* doc/md.texi (reduc_fmin_scal_@var{m}): Document.
	(reduc_fmax_scal_@var{m}): Likewise.
	* optabs.def (reduc_fmax_scal_optab): New optab.
	(reduc_fmin_scal_optab): Likewise
	* internal-fn.def (REDUC_FMAX, REDUC_FMIN): New functions.
	* tree-vect-loop.c (reduction_fn_for_scalar_code): Handle
	CASE_CFN_FMAX and CASE_CFN_FMIN.
	(neutral_op_for_reduction): Likewise.
	(needs_fold_left_reduction_p): Likewise.
	* config/aarch64/iterators.md (FMAXMINV): New iterator.
	(fmaxmin): Handle UNSPEC_FMAXNMV and UNSPEC_FMINNMV.
	* config/aarch64/aarch64-simd.md (reduc_<optab>_scal_<mode>): Fix
	unspec mode.
	(reduc_<fmaxmin>_scal_<mode>): New pattern.
	* config/aarch64/aarch64-sve.md (reduc_<fmaxmin>_scal_<mode>):
	Likewise.

gcc/testsuite/
	* gcc.dg/vect/vect-fmax-1.c: New test.
	* gcc.dg/vect/vect-fmax-2.c: Likewise.
	* gcc.dg/vect/vect-fmax-3.c: Likewise.
	* gcc.dg/vect/vect-fmin-1.c: New test.
	* gcc.dg/vect/vect-fmin-2.c: Likewise.
	* gcc.dg/vect/vect-fmin-3.c: Likewise.
	* gcc.target/aarch64/fmaxnm_1.c: Likewise.
	* gcc.target/aarch64/fmaxnm_2.c: Likewise.
	* gcc.target/aarch64/fminnm_1.c: Likewise.
	* gcc.target/aarch64/fminnm_2.c: Likewise.
	* gcc.target/aarch64/sve/fmaxnm_2.c: Likewise.
	* gcc.target/aarch64/sve/fmaxnm_3.c: Likewise.
	* gcc.target/aarch64/sve/fminnm_2.c: Likewise.
	* gcc.target/aarch64/sve/fminnm_3.c: Likewise.
2021-11-30 09:52:25 +00:00
Andrew Pinski c744ae0897 [COMMITTED] aarch64: [PR103170] Fix aarch64_simd_dup<mode>
The problem here is aarch64_simd_dup<mode> use
the vw iterator rather than vwcore iterator.  This causes
problems for the V4SF and V2DF modes. I changed both of
aarch64_simd_dup<mode> patterns to be consistent.

Committed as obvious after a bootstrap/test on aarch64-linux-gnu.

	PR target/103170

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (aarch64_simd_dup<mode>):
	Use vwcore iterator for the r constraint output string.

gcc/testsuite/ChangeLog:

	* gcc.c-torture/compile/vector-dup-1.c: New test.
2021-11-10 22:06:23 +00:00
Tamar Christina 5ba247ade1 AArch64: Remove shuffle pattern for rounding variant.
This removed the patterns to optimize the rounding shift and narrow.
The optimization is valid only for the truncating rounding shift and narrow,
for the rounding shift and narrow we need a different pattern that I will submit
separately.

This wasn't noticed before as the benchmarks did not run conformance as part of
the run, which we now do and this now passes again.

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (*aarch64_topbits_shuffle<mode>_le
	,*aarch64_topbits_shuffle<mode>_be): Remove.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/shrn-combine-8.c: Update.
	* gcc.target/aarch64/shrn-combine-9.c: Update.
2021-11-10 15:10:09 +00:00
Richard Sandiford 6d331688fc aarch64: Tweak FMAX/FMIN iterators
There was some duplication between the maxmin_uns (uns for unspec
rather than unsigned) int attribute and the optab int attribute.
The difficulty for FMAXNM and FMINNM is that the instructions
really correspond to two things: the smax/smin optabs for floats
(used only for fast-math-like flags) and the fmax/fmin optabs
(used for built-in functions).  The optab attribute was
consistently for the former but maxmin_uns had a mixture of both.

This patch renames maxmin_uns to fmaxmin and only uses it
for the fmax and fmin optabs.  The reductions that previously
used the maxmin_uns attribute now use the optab attribute instead.

FMAX and FMIN are awkward in that they don't correspond to any
optab.  It's nevertheless useful to define them alongside the
“real” optabs.  Previously they were known as “smax_nan” and
“smin_nan”, but the problem with those names it that smax and
smin are only used for floats if NaNs don't matter.  This patch
therefore uses fmax_nan and fmin_nan instead.

There is still some inconsistency, in that the optab attribute
handles UNSPEC_COND_FMAX but the fmaxmin attribute handles
UNSPEC_FMAX.  This is because the SVE FP instructions, being
predicated, have to use unspecs in cases where the Advanced
SIMD ones could use rtl codes.

At least there are no duplicate entries though, so this seemed
like the best compromise for now.

gcc/
	* config/aarch64/iterators.md (optab): Use fmax_nan instead of
	smax_nan and fmin_nan instead of smin_nan.
	(maxmin_uns): Rename to...
	(fmaxmin): ...this and make the same changes.  Remove entries
	unrelated to fmax* and fmin*.
	* config/aarch64/aarch64.md (<maxmin_uns><mode>3): Rename to...
	(<fmaxmin><mode>3): ...this.
	* config/aarch64/aarch64-simd.md (aarch64_<maxmin_uns>p<mode>):
	Rename to...
	(aarch64_<optab>p<mode>): ...this.
	(<maxmin_uns><mode>3): Rename to...
	(<fmaxmin><mode>3): ...this.
	(reduc_<maxmin_uns>_scal_<mode>): Rename to...
	(reduc_<optab>_scal_<mode>): ...this and update gen* call.
	(aarch64_reduc_<maxmin_uns>_internal<mode>): Rename to...
	(aarch64_reduc_<optab>_internal<mode>): ...this.
	(aarch64_reduc_<maxmin_uns>_internalv2si): Rename to...
	(aarch64_reduc_<optab>_internalv2si): ...this.
	* config/aarch64/aarch64-sve.md (<maxmin_uns><mode>3): Rename to...
	(<fmaxmin><mode>3): ...this.
	* config/aarch64/aarch64-simd-builtins.def (smax_nan, smin_nan)
	Rename to...
	(fmax_nan, fmin_nan): ...this.
	* config/aarch64/arm_neon.h (vmax_f32, vmax_f64, vmaxq_f32, vmaxq_f64)
	(vmin_f32, vmin_f64, vminq_f32, vminq_f64, vmax_f16, vmaxq_f16)
	(vmin_f16, vminq_f16): Update accordingly.
2021-11-10 12:38:43 +00:00
Jonathan Wright 66f206b853 aarch64: Add machine modes for Neon vector-tuple types
Until now, GCC has used large integer machine modes (OI, CI and XI)
to model Neon vector-tuple types. This is suboptimal for many
reasons, the most notable are:

 1) Large integer modes are opaque and modifying one vector in the
    tuple requires a lot of inefficient set/get gymnastics. The
    result is a lot of superfluous move instructions.
 2) Large integer modes do not map well to types that are tuples of
    64-bit vectors - we need additional zero-padding which again
    results in superfluous move instructions.

This patch adds new machine modes that better model the C-level Neon
vector-tuple types. The approach is somewhat similar to that already
used for SVE vector-tuple types.

All of the AArch64 backend patterns and builtins that manipulate Neon
vector tuples are updated to use the new machine modes. This has the
effect of significantly reducing the amount of boiler-plate code in
the arm_neon.h header.

While this patch increases the quality of code generated in many
instances, there is still room for significant improvement - which
will be attempted in subsequent patches.

gcc/ChangeLog:

2021-08-09  Jonathan Wright  <jonathan.wright@arm.com>
	    Richard Sandiford  <richard.sandiford@arm.com>

	* config/aarch64/aarch64-builtins.c (v2x8qi_UP): Define.
	(v2x4hi_UP): Likewise.
	(v2x4hf_UP): Likewise.
	(v2x4bf_UP): Likewise.
	(v2x2si_UP): Likewise.
	(v2x2sf_UP): Likewise.
	(v2x1di_UP): Likewise.
	(v2x1df_UP): Likewise.
	(v2x16qi_UP): Likewise.
	(v2x8hi_UP): Likewise.
	(v2x8hf_UP): Likewise.
	(v2x8bf_UP): Likewise.
	(v2x4si_UP): Likewise.
	(v2x4sf_UP): Likewise.
	(v2x2di_UP): Likewise.
	(v2x2df_UP): Likewise.
	(v3x8qi_UP): Likewise.
	(v3x4hi_UP): Likewise.
	(v3x4hf_UP): Likewise.
	(v3x4bf_UP): Likewise.
	(v3x2si_UP): Likewise.
	(v3x2sf_UP): Likewise.
	(v3x1di_UP): Likewise.
	(v3x1df_UP): Likewise.
	(v3x16qi_UP): Likewise.
	(v3x8hi_UP): Likewise.
	(v3x8hf_UP): Likewise.
	(v3x8bf_UP): Likewise.
	(v3x4si_UP): Likewise.
	(v3x4sf_UP): Likewise.
	(v3x2di_UP): Likewise.
	(v3x2df_UP): Likewise.
	(v4x8qi_UP): Likewise.
	(v4x4hi_UP): Likewise.
	(v4x4hf_UP): Likewise.
	(v4x4bf_UP): Likewise.
	(v4x2si_UP): Likewise.
	(v4x2sf_UP): Likewise.
	(v4x1di_UP): Likewise.
	(v4x1df_UP): Likewise.
	(v4x16qi_UP): Likewise.
	(v4x8hi_UP): Likewise.
	(v4x8hf_UP): Likewise.
	(v4x8bf_UP): Likewise.
	(v4x4si_UP): Likewise.
	(v4x4sf_UP): Likewise.
	(v4x2di_UP): Likewise.
	(v4x2df_UP): Likewise.
	(TYPES_GETREGP): Delete.
	(TYPES_SETREGP): Likewise.
	(TYPES_LOADSTRUCT_U): Define.
	(TYPES_LOADSTRUCT_P): Likewise.
	(TYPES_LOADSTRUCT_LANE_U): Likewise.
	(TYPES_LOADSTRUCT_LANE_P): Likewise.
	(TYPES_STORE1P): Move for consistency.
	(TYPES_STORESTRUCT_U): Define.
	(TYPES_STORESTRUCT_P): Likewise.
	(TYPES_STORESTRUCT_LANE_U): Likewise.
	(TYPES_STORESTRUCT_LANE_P): Likewise.
	(aarch64_simd_tuple_types): Define.
	(aarch64_lookup_simd_builtin_type): Handle tuple type lookup.
	(aarch64_init_simd_builtin_functions): Update frontend lookup
	for builtin functions after handling arm_neon.h pragma.
	(register_tuple_type): Manually set modes of single-integer
	tuple types. Record tuple types.
	* config/aarch64/aarch64-modes.def
	(ADV_SIMD_D_REG_STRUCT_MODES): Define D-register tuple modes.
	(ADV_SIMD_Q_REG_STRUCT_MODES): Define Q-register tuple modes.
	(SVE_MODES): Give single-vector modes priority over vector-
	tuple modes.
	(VECTOR_MODES_WITH_PREFIX): Set partial-vector mode order to
	be after all single-vector modes.
	* config/aarch64/aarch64-simd-builtins.def: Update builtin
	generator macros to reflect modifications to the backend
	patterns.
	* config/aarch64/aarch64-simd.md (aarch64_simd_ld2<mode>):
	Use vector-tuple mode iterator and rename to...
	(aarch64_simd_ld2<vstruct_elt>): This.
	(aarch64_simd_ld2r<mode>): Use vector-tuple mode iterator and
	rename to...
	(aarch64_simd_ld2r<vstruct_elt>): This.
	(aarch64_vec_load_lanesoi_lane<mode>): Use vector-tuple mode
	iterator and rename to...
	(aarch64_vec_load_lanes<mode>_lane<vstruct_elt>): This.
	(vec_load_lanesoi<mode>): Use vector-tuple mode iterator and
	rename to...
	(vec_load_lanes<mode><vstruct_elt>): This.
	(aarch64_simd_st2<mode>): Use vector-tuple mode iterator and
	rename to...
	(aarch64_simd_st2<vstruct_elt>): This.
	(aarch64_vec_store_lanesoi_lane<mode>): Use vector-tuple mode
	iterator and rename to...
	(aarch64_vec_store_lanes<mode>_lane<vstruct_elt>): This.
	(vec_store_lanesoi<mode>): Use vector-tuple mode iterator and
	rename to...
	(vec_store_lanes<mode><vstruct_elt>): This.
	(aarch64_simd_ld3<mode>): Use vector-tuple mode iterator and
	rename to...
	(aarch64_simd_ld3<vstruct_elt>): This.
	(aarch64_simd_ld3r<mode>): Use vector-tuple mode iterator and
	rename to...
	(aarch64_simd_ld3r<vstruct_elt>): This.
	(aarch64_vec_load_lanesci_lane<mode>): Use vector-tuple mode
	iterator and rename to...
	(vec_load_lanesci<mode>): This.
	(aarch64_simd_st3<mode>): Use vector-tuple mode iterator and
	rename to...
	(aarch64_simd_st3<vstruct_elt>): This.
	(aarch64_vec_store_lanesci_lane<mode>): Use vector-tuple mode
	iterator and rename to...
	(vec_store_lanesci<mode>): This.
	(aarch64_simd_ld4<mode>): Use vector-tuple mode iterator and
	rename to...
	(aarch64_simd_ld4<vstruct_elt>): This.
	(aarch64_simd_ld4r<mode>): Use vector-tuple mode iterator and
	rename to...
	(aarch64_simd_ld4r<vstruct_elt>): This.
	(aarch64_vec_load_lanesxi_lane<mode>): Use vector-tuple mode
	iterator and rename to...
	(vec_load_lanesxi<mode>): This.
	(aarch64_simd_st4<mode>): Use vector-tuple mode iterator and
	rename to...
	(aarch64_simd_st4<vstruct_elt>): This.
	(aarch64_vec_store_lanesxi_lane<mode>): Use vector-tuple mode
	iterator and rename to...
	(vec_store_lanesxi<mode>): This.
	(mov<mode>): Define for Neon vector-tuple modes.
	(aarch64_ld1x3<VALLDIF:mode>): Use vector-tuple mode iterator
	and rename to...
	(aarch64_ld1x3<vstruct_elt>): This.
	(aarch64_ld1_x3_<mode>): Use vector-tuple mode iterator and
	rename to...
	(aarch64_ld1_x3_<vstruct_elt>): This.
	(aarch64_ld1x4<VALLDIF:mode>): Use vector-tuple mode iterator
	and rename to...
	(aarch64_ld1x4<vstruct_elt>): This.
	(aarch64_ld1_x4_<mode>): Use vector-tuple mode iterator and
	rename to...
	(aarch64_ld1_x4_<vstruct_elt>): This.
	(aarch64_st1x2<VALLDIF:mode>): Use vector-tuple mode iterator
	and rename to...
	(aarch64_st1x2<vstruct_elt>): This.
	(aarch64_st1_x2_<mode>): Use vector-tuple mode iterator and
	rename to...
	(aarch64_st1_x2_<vstruct_elt>): This.
	(aarch64_st1x3<VALLDIF:mode>): Use vector-tuple mode iterator
	and rename to...
	(aarch64_st1x3<vstruct_elt>): This.
	(aarch64_st1_x3_<mode>): Use vector-tuple mode iterator and
	rename to...
	(aarch64_st1_x3_<vstruct_elt>): This.
	(aarch64_st1x4<VALLDIF:mode>): Use vector-tuple mode iterator
	and rename to...
	(aarch64_st1x4<vstruct_elt>): This.
	(aarch64_st1_x4_<mode>): Use vector-tuple mode iterator and
	rename to...
	(aarch64_st1_x4_<vstruct_elt>): This.
	(*aarch64_mov<mode>): Define for vector-tuple modes.
	(*aarch64_be_mov<mode>): Likewise.
	(aarch64_ld<VSTRUCT:nregs>r<VALLDIF:mode>): Use vector-tuple
	mode iterator and rename to...
	(aarch64_ld<nregs>r<vstruct_elt>): This.
	(aarch64_ld2<mode>_dreg): Use vector-tuple mode iterator and
	rename to...
	(aarch64_ld2<vstruct_elt>_dreg): This.
	(aarch64_ld3<mode>_dreg): Use vector-tuple mode iterator and
	rename to...
	(aarch64_ld3<vstruct_elt>_dreg): This.
	(aarch64_ld4<mode>_dreg): Use vector-tuple mode iterator and
	rename to...
	(aarch64_ld4<vstruct_elt>_dreg): This.
	(aarch64_ld<VSTRUCT:nregs><VDC:mode>): Use vector-tuple mode
	iterator and rename to...
	(aarch64_ld<nregs><vstruct_elt>): Use vector-tuple mode
	iterator and rename to...
	(aarch64_ld<VSTRUCT:nregs><VQ:mode>): Use vector-tuple mode
	(aarch64_ld1x2<VQ:mode>): Delete.
	(aarch64_ld1x2<VDC:mode>): Use vector-tuple mode iterator and
	rename to...
	(aarch64_ld1x2<vstruct_elt>): This.
	(aarch64_ld<VSTRUCT:nregs>_lane<VALLDIF:mode>): Use vector-
	tuple mode iterator and rename to...
	(aarch64_ld<nregs>_lane<vstruct_elt>): This.
	(aarch64_get_dreg<VSTRUCT:mode><VDC:mode>): Delete.
	(aarch64_get_qreg<VSTRUCT:mode><VQ:mode>): Likewise.
	(aarch64_st2<mode>_dreg): Use vector-tuple mode iterator and
	rename to...
	(aarch64_st2<vstruct_elt>_dreg): This.
	(aarch64_st3<mode>_dreg): Use vector-tuple mode iterator and
	rename to...
	(aarch64_st3<vstruct_elt>_dreg): This.
	(aarch64_st4<mode>_dreg): Use vector-tuple mode iterator and
	rename to...
	(aarch64_st4<vstruct_elt>_dreg): This.
	(aarch64_st<VSTRUCT:nregs><VDC:mode>): Use vector-tuple mode
	iterator and rename to...
	(aarch64_st<nregs><vstruct_elt>): This.
	(aarch64_st<VSTRUCT:nregs><VQ:mode>): Use vector-tuple mode
	iterator and rename to aarch64_st<nregs><vstruct_elt>.
	(aarch64_st<VSTRUCT:nregs>_lane<VALLDIF:mode>): Use vector-
	tuple mode iterator and rename to...
	(aarch64_st<nregs>_lane<vstruct_elt>): This.
	(aarch64_set_qreg<VSTRUCT:mode><VQ:mode>): Delete.
	(aarch64_simd_ld1<mode>_x2): Use vector-tuple mode iterator
	and rename to...
	(aarch64_simd_ld1<vstruct_elt>_x2): This.
	* config/aarch64/aarch64.c (aarch64_advsimd_struct_mode_p):
	Refactor to include new vector-tuple modes.
	(aarch64_classify_vector_mode): Add cases for new vector-
	tuple modes.
	(aarch64_advsimd_partial_struct_mode_p): Define.
	(aarch64_advsimd_full_struct_mode_p): Likewise.
	(aarch64_advsimd_vector_array_mode): Likewise.
	(aarch64_sve_data_mode): Change location in file.
	(aarch64_array_mode): Handle case of Neon vector-tuple modes.
	(aarch64_hard_regno_nregs): Handle case of partial Neon
	vector structures.
	(aarch64_classify_address): Refactor to include handling of
	Neon vector-tuple modes.
	(aarch64_print_operand): Print "d" for "%R" for a partial
	Neon vector structure.
	(aarch64_expand_vec_perm_1): Use new vector-tuple mode.
	(aarch64_modes_tieable_p): Prevent tieing Neon partial struct
	modes with scalar machines modes larger than 8 bytes.
	(aarch64_can_change_mode_class): Don't allow changes between
	partial and full Neon vector-structure modes.
	* config/aarch64/arm_neon.h (vst2_lane_f16): Use updated
	builtin and remove boiler-plate code for opaque mode.
	(vst2_lane_f32): Likewise.
	(vst2_lane_f64): Likewise.
	(vst2_lane_p8): Likewise.
	(vst2_lane_p16): Likewise.
	(vst2_lane_p64): Likewise.
	(vst2_lane_s8): Likewise.
	(vst2_lane_s16): Likewise.
	(vst2_lane_s32): Likewise.
	(vst2_lane_s64): Likewise.
	(vst2_lane_u8): Likewise.
	(vst2_lane_u16): Likewise.
	(vst2_lane_u32): Likewise.
	(vst2_lane_u64): Likewise.
	(vst2q_lane_f16): Likewise.
	(vst2q_lane_f32): Likewise.
	(vst2q_lane_f64): Likewise.
	(vst2q_lane_p8): Likewise.
	(vst2q_lane_p16): Likewise.
	(vst2q_lane_p64): Likewise.
	(vst2q_lane_s8): Likewise.
	(vst2q_lane_s16): Likewise.
	(vst2q_lane_s32): Likewise.
	(vst2q_lane_s64): Likewise.
	(vst2q_lane_u8): Likewise.
	(vst2q_lane_u16): Likewise.
	(vst2q_lane_u32): Likewise.
	(vst2q_lane_u64): Likewise.
	(vst3_lane_f16): Likewise.
	(vst3_lane_f32): Likewise.
	(vst3_lane_f64): Likewise.
	(vst3_lane_p8): Likewise.
	(vst3_lane_p16): Likewise.
	(vst3_lane_p64): Likewise.
	(vst3_lane_s8): Likewise.
	(vst3_lane_s16): Likewise.
	(vst3_lane_s32): Likewise.
	(vst3_lane_s64): Likewise.
	(vst3_lane_u8): Likewise.
	(vst3_lane_u16): Likewise.
	(vst3_lane_u32): Likewise.
	(vst3_lane_u64): Likewise.
	(vst3q_lane_f16): Likewise.
	(vst3q_lane_f32): Likewise.
	(vst3q_lane_f64): Likewise.
	(vst3q_lane_p8): Likewise.
	(vst3q_lane_p16): Likewise.
	(vst3q_lane_p64): Likewise.
	(vst3q_lane_s8): Likewise.
	(vst3q_lane_s16): Likewise.
	(vst3q_lane_s32): Likewise.
	(vst3q_lane_s64): Likewise.
	(vst3q_lane_u8): Likewise.
	(vst3q_lane_u16): Likewise.
	(vst3q_lane_u32): Likewise.
	(vst3q_lane_u64): Likewise.
	(vst4_lane_f16): Likewise.
	(vst4_lane_f32): Likewise.
	(vst4_lane_f64): Likewise.
	(vst4_lane_p8): Likewise.
	(vst4_lane_p16): Likewise.
	(vst4_lane_p64): Likewise.
	(vst4_lane_s8): Likewise.
	(vst4_lane_s16): Likewise.
	(vst4_lane_s32): Likewise.
	(vst4_lane_s64): Likewise.
	(vst4_lane_u8): Likewise.
	(vst4_lane_u16): Likewise.
	(vst4_lane_u32): Likewise.
	(vst4_lane_u64): Likewise.
	(vst4q_lane_f16): Likewise.
	(vst4q_lane_f32): Likewise.
	(vst4q_lane_f64): Likewise.
	(vst4q_lane_p8): Likewise.
	(vst4q_lane_p16): Likewise.
	(vst4q_lane_p64): Likewise.
	(vst4q_lane_s8): Likewise.
	(vst4q_lane_s16): Likewise.
	(vst4q_lane_s32): Likewise.
	(vst4q_lane_s64): Likewise.
	(vst4q_lane_u8): Likewise.
	(vst4q_lane_u16): Likewise.
	(vst4q_lane_u32): Likewise.
	(vst4q_lane_u64): Likewise.
	(vtbl3_s8): Likewise.
	(vtbl3_u8): Likewise.
	(vtbl3_p8): Likewise.
	(vtbl4_s8): Likewise.
	(vtbl4_u8): Likewise.
	(vtbl4_p8): Likewise.
	(vld1_u8_x3): Likewise.
	(vld1_s8_x3): Likewise.
	(vld1_u16_x3): Likewise.
	(vld1_s16_x3): Likewise.
	(vld1_u32_x3): Likewise.
	(vld1_s32_x3): Likewise.
	(vld1_u64_x3): Likewise.
	(vld1_s64_x3): Likewise.
	(vld1_f16_x3): Likewise.
	(vld1_f32_x3): Likewise.
	(vld1_f64_x3): Likewise.
	(vld1_p8_x3): Likewise.
	(vld1_p16_x3): Likewise.
	(vld1_p64_x3): Likewise.
	(vld1q_u8_x3): Likewise.
	(vld1q_s8_x3): Likewise.
	(vld1q_u16_x3): Likewise.
	(vld1q_s16_x3): Likewise.
	(vld1q_u32_x3): Likewise.
	(vld1q_s32_x3): Likewise.
	(vld1q_u64_x3): Likewise.
	(vld1q_s64_x3): Likewise.
	(vld1q_f16_x3): Likewise.
	(vld1q_f32_x3): Likewise.
	(vld1q_f64_x3): Likewise.
	(vld1q_p8_x3): Likewise.
	(vld1q_p16_x3): Likewise.
	(vld1q_p64_x3): Likewise.
	(vld1_u8_x2): Likewise.
	(vld1_s8_x2): Likewise.
	(vld1_u16_x2): Likewise.
	(vld1_s16_x2): Likewise.
	(vld1_u32_x2): Likewise.
	(vld1_s32_x2): Likewise.
	(vld1_u64_x2): Likewise.
	(vld1_s64_x2): Likewise.
	(vld1_f16_x2): Likewise.
	(vld1_f32_x2): Likewise.
	(vld1_f64_x2): Likewise.
	(vld1_p8_x2): Likewise.
	(vld1_p16_x2): Likewise.
	(vld1_p64_x2): Likewise.
	(vld1q_u8_x2): Likewise.
	(vld1q_s8_x2): Likewise.
	(vld1q_u16_x2): Likewise.
	(vld1q_s16_x2): Likewise.
	(vld1q_u32_x2): Likewise.
	(vld1q_s32_x2): Likewise.
	(vld1q_u64_x2): Likewise.
	(vld1q_s64_x2): Likewise.
	(vld1q_f16_x2): Likewise.
	(vld1q_f32_x2): Likewise.
	(vld1q_f64_x2): Likewise.
	(vld1q_p8_x2): Likewise.
	(vld1q_p16_x2): Likewise.
	(vld1q_p64_x2): Likewise.
	(vld1_s8_x4): Likewise.
	(vld1q_s8_x4): Likewise.
	(vld1_s16_x4): Likewise.
	(vld1q_s16_x4): Likewise.
	(vld1_s32_x4): Likewise.
	(vld1q_s32_x4): Likewise.
	(vld1_u8_x4): Likewise.
	(vld1q_u8_x4): Likewise.
	(vld1_u16_x4): Likewise.
	(vld1q_u16_x4): Likewise.
	(vld1_u32_x4): Likewise.
	(vld1q_u32_x4): Likewise.
	(vld1_f16_x4): Likewise.
	(vld1q_f16_x4): Likewise.
	(vld1_f32_x4): Likewise.
	(vld1q_f32_x4): Likewise.
	(vld1_p8_x4): Likewise.
	(vld1q_p8_x4): Likewise.
	(vld1_p16_x4): Likewise.
	(vld1q_p16_x4): Likewise.
	(vld1_s64_x4): Likewise.
	(vld1_u64_x4): Likewise.
	(vld1_p64_x4): Likewise.
	(vld1q_s64_x4): Likewise.
	(vld1q_u64_x4): Likewise.
	(vld1q_p64_x4): Likewise.
	(vld1_f64_x4): Likewise.
	(vld1q_f64_x4): Likewise.
	(vld2_s64): Likewise.
	(vld2_u64): Likewise.
	(vld2_f64): Likewise.
	(vld2_s8): Likewise.
	(vld2_p8): Likewise.
	(vld2_p64): Likewise.
	(vld2_s16): Likewise.
	(vld2_p16): Likewise.
	(vld2_s32): Likewise.
	(vld2_u8): Likewise.
	(vld2_u16): Likewise.
	(vld2_u32): Likewise.
	(vld2_f16): Likewise.
	(vld2_f32): Likewise.
	(vld2q_s8): Likewise.
	(vld2q_p8): Likewise.
	(vld2q_s16): Likewise.
	(vld2q_p16): Likewise.
	(vld2q_p64): Likewise.
	(vld2q_s32): Likewise.
	(vld2q_s64): Likewise.
	(vld2q_u8): Likewise.
	(vld2q_u16): Likewise.
	(vld2q_u32): Likewise.
	(vld2q_u64): Likewise.
	(vld2q_f16): Likewise.
	(vld2q_f32): Likewise.
	(vld2q_f64): Likewise.
	(vld3_s64): Likewise.
	(vld3_u64): Likewise.
	(vld3_f64): Likewise.
	(vld3_s8): Likewise.
	(vld3_p8): Likewise.
	(vld3_s16): Likewise.
	(vld3_p16): Likewise.
	(vld3_s32): Likewise.
	(vld3_u8): Likewise.
	(vld3_u16): Likewise.
	(vld3_u32): Likewise.
	(vld3_f16): Likewise.
	(vld3_f32): Likewise.
	(vld3_p64): Likewise.
	(vld3q_s8): Likewise.
	(vld3q_p8): Likewise.
	(vld3q_s16): Likewise.
	(vld3q_p16): Likewise.
	(vld3q_s32): Likewise.
	(vld3q_s64): Likewise.
	(vld3q_u8): Likewise.
	(vld3q_u16): Likewise.
	(vld3q_u32): Likewise.
	(vld3q_u64): Likewise.
	(vld3q_f16): Likewise.
	(vld3q_f32): Likewise.
	(vld3q_f64): Likewise.
	(vld3q_p64): Likewise.
	(vld4_s64): Likewise.
	(vld4_u64): Likewise.
	(vld4_f64): Likewise.
	(vld4_s8): Likewise.
	(vld4_p8): Likewise.
	(vld4_s16): Likewise.
	(vld4_p16): Likewise.
	(vld4_s32): Likewise.
	(vld4_u8): Likewise.
	(vld4_u16): Likewise.
	(vld4_u32): Likewise.
	(vld4_f16): Likewise.
	(vld4_f32): Likewise.
	(vld4_p64): Likewise.
	(vld4q_s8): Likewise.
	(vld4q_p8): Likewise.
	(vld4q_s16): Likewise.
	(vld4q_p16): Likewise.
	(vld4q_s32): Likewise.
	(vld4q_s64): Likewise.
	(vld4q_u8): Likewise.
	(vld4q_u16): Likewise.
	(vld4q_u32): Likewise.
	(vld4q_u64): Likewise.
	(vld4q_f16): Likewise.
	(vld4q_f32): Likewise.
	(vld4q_f64): Likewise.
	(vld4q_p64): Likewise.
	(vld2_dup_s8): Likewise.
	(vld2_dup_s16): Likewise.
	(vld2_dup_s32): Likewise.
	(vld2_dup_f16): Likewise.
	(vld2_dup_f32): Likewise.
	(vld2_dup_f64): Likewise.
	(vld2_dup_u8): Likewise.
	(vld2_dup_u16): Likewise.
	(vld2_dup_u32): Likewise.
	(vld2_dup_p8): Likewise.
	(vld2_dup_p16): Likewise.
	(vld2_dup_p64): Likewise.
	(vld2_dup_s64): Likewise.
	(vld2_dup_u64): Likewise.
	(vld2q_dup_s8): Likewise.
	(vld2q_dup_p8): Likewise.
	(vld2q_dup_s16): Likewise.
	(vld2q_dup_p16): Likewise.
	(vld2q_dup_s32): Likewise.
	(vld2q_dup_s64): Likewise.
	(vld2q_dup_u8): Likewise.
	(vld2q_dup_u16): Likewise.
	(vld2q_dup_u32): Likewise.
	(vld2q_dup_u64): Likewise.
	(vld2q_dup_f16): Likewise.
	(vld2q_dup_f32): Likewise.
	(vld2q_dup_f64): Likewise.
	(vld2q_dup_p64): Likewise.
	(vld3_dup_s64): Likewise.
	(vld3_dup_u64): Likewise.
	(vld3_dup_f64): Likewise.
	(vld3_dup_s8): Likewise.
	(vld3_dup_p8): Likewise.
	(vld3_dup_s16): Likewise.
	(vld3_dup_p16): Likewise.
	(vld3_dup_s32): Likewise.
	(vld3_dup_u8): Likewise.
	(vld3_dup_u16): Likewise.
	(vld3_dup_u32): Likewise.
	(vld3_dup_f16): Likewise.
	(vld3_dup_f32): Likewise.
	(vld3_dup_p64): Likewise.
	(vld3q_dup_s8): Likewise.
	(vld3q_dup_p8): Likewise.
	(vld3q_dup_s16): Likewise.
	(vld3q_dup_p16): Likewise.
	(vld3q_dup_s32): Likewise.
	(vld3q_dup_s64): Likewise.
	(vld3q_dup_u8): Likewise.
	(vld3q_dup_u16): Likewise.
	(vld3q_dup_u32): Likewise.
	(vld3q_dup_u64): Likewise.
	(vld3q_dup_f16): Likewise.
	(vld3q_dup_f32): Likewise.
	(vld3q_dup_f64): Likewise.
	(vld3q_dup_p64): Likewise.
	(vld4_dup_s64): Likewise.
	(vld4_dup_u64): Likewise.
	(vld4_dup_f64): Likewise.
	(vld4_dup_s8): Likewise.
	(vld4_dup_p8): Likewise.
	(vld4_dup_s16): Likewise.
	(vld4_dup_p16): Likewise.
	(vld4_dup_s32): Likewise.
	(vld4_dup_u8): Likewise.
	(vld4_dup_u16): Likewise.
	(vld4_dup_u32): Likewise.
	(vld4_dup_f16): Likewise.
	(vld4_dup_f32): Likewise.
	(vld4_dup_p64): Likewise.
	(vld4q_dup_s8): Likewise.
	(vld4q_dup_p8): Likewise.
	(vld4q_dup_s16): Likewise.
	(vld4q_dup_p16): Likewise.
	(vld4q_dup_s32): Likewise.
	(vld4q_dup_s64): Likewise.
	(vld4q_dup_u8): Likewise.
	(vld4q_dup_u16): Likewise.
	(vld4q_dup_u32): Likewise.
	(vld4q_dup_u64): Likewise.
	(vld4q_dup_f16): Likewise.
	(vld4q_dup_f32): Likewise.
	(vld4q_dup_f64): Likewise.
	(vld4q_dup_p64): Likewise.
	(vld2_lane_u8): Likewise.
	(vld2_lane_u16): Likewise.
	(vld2_lane_u32): Likewise.
	(vld2_lane_u64): Likewise.
	(vld2_lane_s8): Likewise.
	(vld2_lane_s16): Likewise.
	(vld2_lane_s32): Likewise.
	(vld2_lane_s64): Likewise.
	(vld2_lane_f16): Likewise.
	(vld2_lane_f32): Likewise.
	(vld2_lane_f64): Likewise.
	(vld2_lane_p8): Likewise.
	(vld2_lane_p16): Likewise.
	(vld2_lane_p64): Likewise.
	(vld2q_lane_u8): Likewise.
	(vld2q_lane_u16): Likewise.
	(vld2q_lane_u32): Likewise.
	(vld2q_lane_u64): Likewise.
	(vld2q_lane_s8): Likewise.
	(vld2q_lane_s16): Likewise.
	(vld2q_lane_s32): Likewise.
	(vld2q_lane_s64): Likewise.
	(vld2q_lane_f16): Likewise.
	(vld2q_lane_f32): Likewise.
	(vld2q_lane_f64): Likewise.
	(vld2q_lane_p8): Likewise.
	(vld2q_lane_p16): Likewise.
	(vld2q_lane_p64): Likewise.
	(vld3_lane_u8): Likewise.
	(vld3_lane_u16): Likewise.
	(vld3_lane_u32): Likewise.
	(vld3_lane_u64): Likewise.
	(vld3_lane_s8): Likewise.
	(vld3_lane_s16): Likewise.
	(vld3_lane_s32): Likewise.
	(vld3_lane_s64): Likewise.
	(vld3_lane_f16): Likewise.
	(vld3_lane_f32): Likewise.
	(vld3_lane_f64): Likewise.
	(vld3_lane_p8): Likewise.
	(vld3_lane_p16): Likewise.
	(vld3_lane_p64): Likewise.
	(vld3q_lane_u8): Likewise.
	(vld3q_lane_u16): Likewise.
	(vld3q_lane_u32): Likewise.
	(vld3q_lane_u64): Likewise.
	(vld3q_lane_s8): Likewise.
	(vld3q_lane_s16): Likewise.
	(vld3q_lane_s32): Likewise.
	(vld3q_lane_s64): Likewise.
	(vld3q_lane_f16): Likewise.
	(vld3q_lane_f32): Likewise.
	(vld3q_lane_f64): Likewise.
	(vld3q_lane_p8): Likewise.
	(vld3q_lane_p16): Likewise.
	(vld3q_lane_p64): Likewise.
	(vld4_lane_u8): Likewise.
	(vld4_lane_u16): Likewise.
	(vld4_lane_u32): Likewise.
	(vld4_lane_u64): Likewise.
	(vld4_lane_s8): Likewise.
	(vld4_lane_s16): Likewise.
	(vld4_lane_s32): Likewise.
	(vld4_lane_s64): Likewise.
	(vld4_lane_f16): Likewise.
	(vld4_lane_f32): Likewise.
	(vld4_lane_f64): Likewise.
	(vld4_lane_p8): Likewise.
	(vld4_lane_p16): Likewise.
	(vld4_lane_p64): Likewise.
	(vld4q_lane_u8): Likewise.
	(vld4q_lane_u16): Likewise.
	(vld4q_lane_u32): Likewise.
	(vld4q_lane_u64): Likewise.
	(vld4q_lane_s8): Likewise.
	(vld4q_lane_s16): Likewise.
	(vld4q_lane_s32): Likewise.
	(vld4q_lane_s64): Likewise.
	(vld4q_lane_f16): Likewise.
	(vld4q_lane_f32): Likewise.
	(vld4q_lane_f64): Likewise.
	(vld4q_lane_p8): Likewise.
	(vld4q_lane_p16): Likewise.
	(vld4q_lane_p64): Likewise.
	(vqtbl2_s8): Likewise.
	(vqtbl2_u8): Likewise.
	(vqtbl2_p8): Likewise.
	(vqtbl2q_s8): Likewise.
	(vqtbl2q_u8): Likewise.
	(vqtbl2q_p8): Likewise.
	(vqtbl3_s8): Likewise.
	(vqtbl3_u8): Likewise.
	(vqtbl3_p8): Likewise.
	(vqtbl3q_s8): Likewise.
	(vqtbl3q_u8): Likewise.
	(vqtbl3q_p8): Likewise.
	(vqtbl4_s8): Likewise.
	(vqtbl4_u8): Likewise.
	(vqtbl4_p8): Likewise.
	(vqtbl4q_s8): Likewise.
	(vqtbl4q_u8): Likewise.
	(vqtbl4q_p8): Likewise.
	(vqtbx2_s8): Likewise.
	(vqtbx2_u8): Likewise.
	(vqtbx2_p8): Likewise.
	(vqtbx2q_s8): Likewise.
	(vqtbx2q_u8): Likewise.
	(vqtbx2q_p8): Likewise.
	(vqtbx3_s8): Likewise.
	(vqtbx3_u8): Likewise.
	(vqtbx3_p8): Likewise.
	(vqtbx3q_s8): Likewise.
	(vqtbx3q_u8): Likewise.
	(vqtbx3q_p8): Likewise.
	(vqtbx4_s8): Likewise.
	(vqtbx4_u8): Likewise.
	(vqtbx4_p8): Likewise.
	(vqtbx4q_s8): Likewise.
	(vqtbx4q_u8): Likewise.
	(vqtbx4q_p8): Likewise.
	(vst1_s64_x2): Likewise.
	(vst1_u64_x2): Likewise.
	(vst1_f64_x2): Likewise.
	(vst1_s8_x2): Likewise.
	(vst1_p8_x2): Likewise.
	(vst1_s16_x2): Likewise.
	(vst1_p16_x2): Likewise.
	(vst1_s32_x2): Likewise.
	(vst1_u8_x2): Likewise.
	(vst1_u16_x2): Likewise.
	(vst1_u32_x2): Likewise.
	(vst1_f16_x2): Likewise.
	(vst1_f32_x2): Likewise.
	(vst1_p64_x2): Likewise.
	(vst1q_s8_x2): Likewise.
	(vst1q_p8_x2): Likewise.
	(vst1q_s16_x2): Likewise.
	(vst1q_p16_x2): Likewise.
	(vst1q_s32_x2): Likewise.
	(vst1q_s64_x2): Likewise.
	(vst1q_u8_x2): Likewise.
	(vst1q_u16_x2): Likewise.
	(vst1q_u32_x2): Likewise.
	(vst1q_u64_x2): Likewise.
	(vst1q_f16_x2): Likewise.
	(vst1q_f32_x2): Likewise.
	(vst1q_f64_x2): Likewise.
	(vst1q_p64_x2): Likewise.
	(vst1_s64_x3): Likewise.
	(vst1_u64_x3): Likewise.
	(vst1_f64_x3): Likewise.
	(vst1_s8_x3): Likewise.
	(vst1_p8_x3): Likewise.
	(vst1_s16_x3): Likewise.
	(vst1_p16_x3): Likewise.
	(vst1_s32_x3): Likewise.
	(vst1_u8_x3): Likewise.
	(vst1_u16_x3): Likewise.
	(vst1_u32_x3): Likewise.
	(vst1_f16_x3): Likewise.
	(vst1_f32_x3): Likewise.
	(vst1_p64_x3): Likewise.
	(vst1q_s8_x3): Likewise.
	(vst1q_p8_x3): Likewise.
	(vst1q_s16_x3): Likewise.
	(vst1q_p16_x3): Likewise.
	(vst1q_s32_x3): Likewise.
	(vst1q_s64_x3): Likewise.
	(vst1q_u8_x3): Likewise.
	(vst1q_u16_x3): Likewise.
	(vst1q_u32_x3): Likewise.
	(vst1q_u64_x3): Likewise.
	(vst1q_f16_x3): Likewise.
	(vst1q_f32_x3): Likewise.
	(vst1q_f64_x3): Likewise.
	(vst1q_p64_x3): Likewise.
	(vst1_s8_x4): Likewise.
	(vst1q_s8_x4): Likewise.
	(vst1_s16_x4): Likewise.
	(vst1q_s16_x4): Likewise.
	(vst1_s32_x4): Likewise.
	(vst1q_s32_x4): Likewise.
	(vst1_u8_x4): Likewise.
	(vst1q_u8_x4): Likewise.
	(vst1_u16_x4): Likewise.
	(vst1q_u16_x4): Likewise.
	(vst1_u32_x4): Likewise.
	(vst1q_u32_x4): Likewise.
	(vst1_f16_x4): Likewise.
	(vst1q_f16_x4): Likewise.
	(vst1_f32_x4): Likewise.
	(vst1q_f32_x4): Likewise.
	(vst1_p8_x4): Likewise.
	(vst1q_p8_x4): Likewise.
	(vst1_p16_x4): Likewise.
	(vst1q_p16_x4): Likewise.
	(vst1_s64_x4): Likewise.
	(vst1_u64_x4): Likewise.
	(vst1_p64_x4): Likewise.
	(vst1q_s64_x4): Likewise.
	(vst1q_u64_x4): Likewise.
	(vst1q_p64_x4): Likewise.
	(vst1_f64_x4): Likewise.
	(vst1q_f64_x4): Likewise.
	(vst2_s64): Likewise.
	(vst2_u64): Likewise.
	(vst2_f64): Likewise.
	(vst2_s8): Likewise.
	(vst2_p8): Likewise.
	(vst2_s16): Likewise.
	(vst2_p16): Likewise.
	(vst2_s32): Likewise.
	(vst2_u8): Likewise.
	(vst2_u16): Likewise.
	(vst2_u32): Likewise.
	(vst2_f16): Likewise.
	(vst2_f32): Likewise.
	(vst2_p64): Likewise.
	(vst2q_s8): Likewise.
	(vst2q_p8): Likewise.
	(vst2q_s16): Likewise.
	(vst2q_p16): Likewise.
	(vst2q_s32): Likewise.
	(vst2q_s64): Likewise.
	(vst2q_u8): Likewise.
	(vst2q_u16): Likewise.
	(vst2q_u32): Likewise.
	(vst2q_u64): Likewise.
	(vst2q_f16): Likewise.
	(vst2q_f32): Likewise.
	(vst2q_f64): Likewise.
	(vst2q_p64): Likewise.
	(vst3_s64): Likewise.
	(vst3_u64): Likewise.
	(vst3_f64): Likewise.
	(vst3_s8): Likewise.
	(vst3_p8): Likewise.
	(vst3_s16): Likewise.
	(vst3_p16): Likewise.
	(vst3_s32): Likewise.
	(vst3_u8): Likewise.
	(vst3_u16): Likewise.
	(vst3_u32): Likewise.
	(vst3_f16): Likewise.
	(vst3_f32): Likewise.
	(vst3_p64): Likewise.
	(vst3q_s8): Likewise.
	(vst3q_p8): Likewise.
	(vst3q_s16): Likewise.
	(vst3q_p16): Likewise.
	(vst3q_s32): Likewise.
	(vst3q_s64): Likewise.
	(vst3q_u8): Likewise.
	(vst3q_u16): Likewise.
	(vst3q_u32): Likewise.
	(vst3q_u64): Likewise.
	(vst3q_f16): Likewise.
	(vst3q_f32): Likewise.
	(vst3q_f64): Likewise.
	(vst3q_p64): Likewise.
	(vst4_s64): Likewise.
	(vst4_u64): Likewise.
	(vst4_f64): Likewise.
	(vst4_s8): Likewise.
	(vst4_p8): Likewise.
	(vst4_s16): Likewise.
	(vst4_p16): Likewise.
	(vst4_s32): Likewise.
	(vst4_u8): Likewise.
	(vst4_u16): Likewise.
	(vst4_u32): Likewise.
	(vst4_f16): Likewise.
	(vst4_f32): Likewise.
	(vst4_p64): Likewise.
	(vst4q_s8): Likewise.
	(vst4q_p8): Likewise.
	(vst4q_s16): Likewise.
	(vst4q_p16): Likewise.
	(vst4q_s32): Likewise.
	(vst4q_s64): Likewise.
	(vst4q_u8): Likewise.
	(vst4q_u16): Likewise.
	(vst4q_u32): Likewise.
	(vst4q_u64): Likewise.
	(vst4q_f16): Likewise.
	(vst4q_f32): Likewise.
	(vst4q_f64): Likewise.
	(vst4q_p64): Likewise.
	(vtbx4_s8): Likewise.
	(vtbx4_u8): Likewise.
	(vtbx4_p8): Likewise.
	(vld1_bf16_x2): Likewise.
	(vld1q_bf16_x2): Likewise.
	(vld1_bf16_x3): Likewise.
	(vld1q_bf16_x3): Likewise.
	(vld1_bf16_x4): Likewise.
	(vld1q_bf16_x4): Likewise.
	(vld2_bf16): Likewise.
	(vld2q_bf16): Likewise.
	(vld2_dup_bf16): Likewise.
	(vld2q_dup_bf16): Likewise.
	(vld3_bf16): Likewise.
	(vld3q_bf16): Likewise.
	(vld3_dup_bf16): Likewise.
	(vld3q_dup_bf16): Likewise.
	(vld4_bf16): Likewise.
	(vld4q_bf16): Likewise.
	(vld4_dup_bf16): Likewise.
	(vld4q_dup_bf16): Likewise.
	(vst1_bf16_x2): Likewise.
	(vst1q_bf16_x2): Likewise.
	(vst1_bf16_x3): Likewise.
	(vst1q_bf16_x3): Likewise.
	(vst1_bf16_x4): Likewise.
	(vst1q_bf16_x4): Likewise.
	(vst2_bf16): Likewise.
	(vst2q_bf16): Likewise.
	(vst3_bf16): Likewise.
	(vst3q_bf16): Likewise.
	(vst4_bf16): Likewise.
	(vst4q_bf16): Likewise.
	(vld2_lane_bf16): Likewise.
	(vld2q_lane_bf16): Likewise.
	(vld3_lane_bf16): Likewise.
	(vld3q_lane_bf16): Likewise.
	(vld4_lane_bf16): Likewise.
	(vld4q_lane_bf16): Likewise.
	(vst2_lane_bf16): Likewise.
	(vst2q_lane_bf16): Likewise.
	(vst3_lane_bf16): Likewise.
	(vst3q_lane_bf16): Likewise.
	(vst4_lane_bf16): Likewise.
	(vst4q_lane_bf16): Likewise.
	* config/aarch64/geniterators.sh: Modify iterator regex to
	match new vector-tuple modes.
	* config/aarch64/iterators.md (insn_count): Extend mode
	attribute with vector-tuple type information.
	(nregs): Likewise.
	(Vendreg): Likewise.
	(Vetype): Likewise.
	(Vtype): Likewise.
	(VSTRUCT_2D): New mode iterator.
	(VSTRUCT_2DNX): Likewise.
	(VSTRUCT_2DX): Likewise.
	(VSTRUCT_2Q): Likewise.
	(VSTRUCT_2QD): Likewise.
	(VSTRUCT_3D): Likewise.
	(VSTRUCT_3DNX): Likewise.
	(VSTRUCT_3DX): Likewise.
	(VSTRUCT_3Q): Likewise.
	(VSTRUCT_3QD): Likewise.
	(VSTRUCT_4D): Likewise.
	(VSTRUCT_4DNX): Likewise.
	(VSTRUCT_4DX): Likewise.
	(VSTRUCT_4Q): Likewise.
	(VSTRUCT_4QD): Likewise.
	(VSTRUCT_D): Likewise.
	(VSTRUCT_Q): Likewise.
	(VSTRUCT_QD): Likewise.
	(VSTRUCT_ELT): New mode attribute.
	(vstruct_elt): Likewise.
	* genmodes.c (VECTOR_MODE): Add default prefix and order
	parameters.
	(VECTOR_MODE_WITH_PREFIX): Define.
	(make_vector_mode): Add mode prefix and order parameters.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/advsimd-intrinsics/bf16_vldN_lane_2.c:
	Relax incorrect register number requirement.
	* gcc.target/aarch64/sve/pcs/struct_3_256.c: Accept
	equivalent codegen with fmov.
2021-11-04 14:54:36 +00:00
Tamar Christina 1d5c43db79 AArch64: Add better costing for vector constants and operations
This patch adds extended costing to cost the creation of constants and the
manipulation of constants.  The default values provided are based on
architectural expectations and each cost models can be individually tweaked as
needed.

The changes in this patch covers:

* Construction of PARALLEL or CONST_VECTOR:
  Adds better costing for vector of constants which is based on the constant
  being created and the instruction that can be used to create it.  i.e. a movi
  is cheaper than a literal load etc.
* Construction of a vector through a vec_dup.

gcc/ChangeLog:

	* config/arm/aarch-common-protos.h (struct vector_cost_table): Add
	movi, dup and extract costing fields.
	* config/aarch64/aarch64-cost-tables.h (qdf24xx_extra_costs,
	thunderx_extra_costs, thunderx2t99_extra_costs,
	thunderx3t110_extra_costs, tsv110_extra_costs, a64fx_extra_costs): Use
	them.
	* config/arm/aarch-cost-tables.h (generic_extra_costs,
	cortexa53_extra_costs, cortexa57_extra_costs, cortexa76_extra_costs,
	exynosm1_extra_costs, xgene1_extra_costs): Likewise
	* config/aarch64/aarch64-simd.md (aarch64_simd_dup<mode>): Add r->w dup.
	* config/aarch64/aarch64.c (aarch64_rtx_costs): Add extra costs.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/vect-cse-codegen.c: New test.
2021-11-01 13:49:46 +00:00
Tamar Christina 3db4440d4c AArch64: Combine cmeq 0 + not into cmtst
This turns a bitwise inverse of an equality comparison with 0 into a compare of
bitwise nonzero (cmtst).

We already have one pattern for cmsts, this adds an additional one which does
not require an additional bitwise and.

i.e.

#include <arm_neon.h>

uint8x8_t bar(int16x8_t abs_row0, int16x8_t row0) {
  uint16x8_t row0_diff =
    vreinterpretq_u16_s16(veorq_s16(abs_row0, vshrq_n_s16(row0, 15)));
  uint8x8_t abs_row0_gt0 =
    vmovn_u16(vcgtq_u16(vreinterpretq_u16_s16(abs_row0), vdupq_n_u16(0)));
  return abs_row0_gt0;
}

now generates:

bar:
        cmtst   v0.8h, v0.8h, v0.8h
        xtn     v0.8b, v0.8h
        ret

instead of:

bar:
        cmeq    v0.8h, v0.8h, #0
        not     v0.16b, v0.16b
        xtn     v0.8b, v0.8h
        ret

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (*aarch64_cmtst_same_<mode>): New.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/mvn-cmeq0-1.c: New test.
2021-10-20 17:11:52 +01:00
Tamar Christina 52da40ffe2 AArch64: Add pattern xtn+xtn2 to uzp1
This turns truncate operations with a hi/lo pair into a single permute of half
the bit size of the input and just ignoring the top bits (which are truncated
out).

i.e.

void d2 (short * restrict a, int *b, int n)
{
    for (int i = 0; i < n; i++)
      a[i] = b[i];
}

now generates:

.L4:
        ldp     q0, q1, [x3]
        add     x3, x3, 32
        uzp1    v0.8h, v0.8h, v1.8h
        str     q0, [x5], 16
        cmp     x4, x3
        bne     .L4

instead of

.L4:
        ldp     q0, q1, [x3]
        add     x3, x3, 32
        xtn     v0.4h, v0.4s
        xtn2    v0.8h, v1.4s
        str     q0, [x5], 16
        cmp     x4, x3
        bne     .L4

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (*aarch64_narrow_trunc<mode>): New.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/narrow_high_combine.c: Update case.
	* gcc.target/aarch64/xtn-combine-1.c: New test.
	* gcc.target/aarch64/xtn-combine-2.c: New test.
	* gcc.target/aarch64/xtn-combine-3.c: New test.
	* gcc.target/aarch64/xtn-combine-4.c: New test.
	* gcc.target/aarch64/xtn-combine-5.c: New test.
	* gcc.target/aarch64/xtn-combine-6.c: New test.
2021-10-20 17:10:25 +01:00
Tamar Christina ea464fd2d4 AArch64: Add pattern for sshr to cmlt
This optimizes signed right shift by BITSIZE-1 into a cmlt operation which is
more optimal because generally compares have a higher throughput than shifts.

On AArch64 the result of the shift would have been either -1 or 0 which is the
results of the compare.

i.e.

void e (int * restrict a, int *b, int n)
{
    for (int i = 0; i < n; i++)
      b[i] = a[i] >> 31;
}

now generates:

.L4:
        ldr     q0, [x0, x3]
        cmlt    v0.4s, v0.4s, #0
        str     q0, [x1, x3]
        add     x3, x3, 16
        cmp     x4, x3
        bne     .L4

instead of:

.L4:
        ldr     q0, [x0, x3]
        sshr    v0.4s, v0.4s, 31
        str     q0, [x1, x3]
        add     x3, x3, 16
        cmp     x4, x3
        bne     .L4

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (aarch64_simd_ashr<mode>): Add case cmp
	case.
	* config/aarch64/constraints.md (D1): New.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/shl-combine-2.c: New test.
	* gcc.target/aarch64/shl-combine-3.c: New test.
	* gcc.target/aarch64/shl-combine-4.c: New test.
	* gcc.target/aarch64/shl-combine-5.c: New test.
2021-10-20 17:09:00 +01:00
Tamar Christina 41812e5e35 AArch64: Add combine patterns for narrowing shift of half top bits (shuffle)
When doing a (narrowing) right shift by half the width of the original type then
we are essentially shuffling the top bits from the first number down.

If we have a hi/lo pair we can just use a single shuffle instead of needing two
shifts.

i.e.

typedef short int16_t;
typedef unsigned short uint16_t;

void foo (uint16_t * restrict a, int16_t * restrict d, int n)
{
    for( int i = 0; i < n; i++ )
      d[i] = (a[i] * a[i]) >> 16;
}

now generates:

.L4:
        ldr     q0, [x0, x3]
        umull   v1.4s, v0.4h, v0.4h
        umull2  v0.4s, v0.8h, v0.8h
        uzp2    v0.8h, v1.8h, v0.8h
        str     q0, [x1, x3]
        add     x3, x3, 16
        cmp     x4, x3
        bne     .L4

instead of

.L4:
        ldr     q0, [x0, x3]
        umull   v1.4s, v0.4h, v0.4h
        umull2  v0.4s, v0.8h, v0.8h
        sshr    v1.4s, v1.4s, 16
        sshr    v0.4s, v0.4s, 16
        xtn     v1.4h, v1.4s
        xtn2    v1.8h, v0.4s
        str     q1, [x1, x3]
        add     x3, x3, 16
        cmp     x4, x3
        bne     .L4

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md
	(*aarch64_<srn_op>topbits_shuffle<mode>_le): New.
	(*aarch64_topbits_shuffle<mode>_le): New.
	(*aarch64_<srn_op>topbits_shuffle<mode>_be): New.
	(*aarch64_topbits_shuffle<mode>_be): New.
	* config/aarch64/predicates.md
	(aarch64_simd_shift_imm_vec_exact_top): New.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/shrn-combine-10.c: New test.
	* gcc.target/aarch64/shrn-combine-5.c: New test.
	* gcc.target/aarch64/shrn-combine-6.c: New test.
	* gcc.target/aarch64/shrn-combine-7.c: New test.
	* gcc.target/aarch64/shrn-combine-8.c: New test.
	* gcc.target/aarch64/shrn-combine-9.c: New test.
2021-10-20 17:07:54 +01:00
Tamar Christina e33aef11e1 aarch64: Add combine patterns for right shift and narrow
This adds a simple pattern for combining right shifts and narrows into
shifted narrows.

i.e.

typedef short int16_t;
typedef unsigned short uint16_t;

void foo (uint16_t * restrict a, int16_t * restrict d, int n)
{
    for( int i = 0; i < n; i++ )
      d[i] = (a[i] * a[i]) >> 10;
}

now generates:

.L4:
        ldr     q0, [x0, x3]
        umull   v1.4s, v0.4h, v0.4h
        umull2  v0.4s, v0.8h, v0.8h
        shrn    v1.4h, v1.4s, 10
        shrn2   v1.8h, v0.4s, 10
        str     q1, [x1, x3]
        add     x3, x3, 16
        cmp     x4, x3
        bne     .L4

instead of:

.L4:
        ldr     q0, [x0, x3]
        umull   v1.4s, v0.4h, v0.4h
        umull2  v0.4s, v0.8h, v0.8h
        sshr    v1.4s, v1.4s, 10
        sshr    v0.4s, v0.4s, 10
        xtn     v1.4h, v1.4s
        xtn2    v1.8h, v0.4s
        str     q1, [x1, x3]
        add     x3, x3, 16
        cmp     x4, x3
        bne     .L4

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (*aarch64_<srn_op>shrn<mode>_vect,
	*aarch64_<srn_op>shrn<mode>2_vect_le,
	*aarch64_<srn_op>shrn<mode>2_vect_be): New.
	* config/aarch64/iterators.md (srn_op): New.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/shrn-combine-1.c: New test.
	* gcc.target/aarch64/shrn-combine-2.c: New test.
	* gcc.target/aarch64/shrn-combine-3.c: New test.
	* gcc.target/aarch64/shrn-combine-4.c: New test.
2021-10-20 17:06:31 +01:00
Tejas Belagod e2e0b85c1e PR101609: Use the correct iterator for AArch64 vector right shift pattern
Loops containing long long shifts fail to vectorize due to the vectorizer
not being able to recognize long long right shifts. This is due to a bug
in the iterator used for the vashr and vlshr patterns in aarch64-simd.md.

2021-08-09  Tejas Belagod  <tejas.belagod@arm.com>

gcc/ChangeLog
	PR target/101609
	* config/aarch64/aarch64-simd.md (vlshr<mode>3, vashr<mode>3): Use
	the right iterator.

gcc/testsuite/ChangeLog
	* gcc.target/aarch64/vect-shr-reg.c: New testcase.
	* gcc.target/aarch64/vect-shr-reg-run.c: Likewise.
2021-08-09 12:54:14 +01:00
Jonathan Wright 3bc9db6a98 simplify-rtx: Push sign/zero-extension inside vec_duplicate
As a general principle, vec_duplicate should be as close to the root
of an expression as possible. Where unary operations have
vec_duplicate as an argument, these operations should be pushed
inside the vec_duplicate.

This patch modifies unary operation simplification to push
sign/zero-extension of a scalar inside vec_duplicate.

This patch also updates all RTL patterns in aarch64-simd.md to use
the new canonical form.

gcc/ChangeLog:

2021-07-19  Jonathan Wright  <jonathan.wright@arm.com>

	* config/aarch64/aarch64-simd.md: Push sign/zero-extension
	inside vec_duplicate for all patterns.
	* simplify-rtx.c (simplify_context::simplify_unary_operation_1):
	Push sign/zero-extension inside vec_duplicate.
2021-07-27 10:42:33 +01:00
Tamar Christina 1ab2270036 AArch64: correct dot-product RTL patterns for aarch64.
The previous fix for this problem was wrong due to a subtle difference between
where NEON expects the RMW values and where intrinsics expects them.

The insn pattern is modeled after the intrinsics and so needs an expand for
the vectorizer optab to switch the RTL.

However operand[3] is not expected to be written to so the current pattern is
bogus.

Instead I rewrite the RTL to be in canonical ordering and merge them.

gcc/ChangeLog:

	* config/aarch64/aarch64-simd-builtins.def (sdot, udot): Rename to..
	(sdot_prod, udot_prod): ... This.
	* config/aarch64/aarch64-simd.md (aarch64_<sur>dot<vsi2qi>): Merged
	into...
	(<sur>dot_prod<vsi2qi>): ... this.
	(aarch64_<sur>dot_lane<vsi2qi>, aarch64_<sur>dot_laneq<vsi2qi>):
	Change operands order.
	(<sur>sadv16qi): Use new operands order.
	* config/aarch64/arm_neon.h (vdot_u32, vdotq_u32, vdot_s32,
	vdotq_s32): Use new RTL ordering.
2021-07-26 10:23:21 +01:00
Tamar Christina 2050ac1a54 AArch64: correct usdot vectorizer and intrinsics optabs
There's a slight mismatch between the vectorizer optabs and the intrinsics
patterns for NEON.  The vectorizer expects operands[3] and operands[0] to be
the same but the aarch64 intrinsics expanders expect operands[0] and
operands[1] to be the same.

This means we need different patterns here.  This adds a separate usdot
vectorizer pattern which just shuffles around the RTL params.

There's also an inconsistency between the usdot and (u|s)dot intrinsics RTL
patterns which is not corrected here.

gcc/ChangeLog:

	* config/aarch64/aarch64-builtins.c (TYPES_TERNOP_SUSS,
	aarch64_types_ternop_suss_qualifiers): New.
	* config/aarch64/aarch64-simd-builtins.def (usdot_prod): Use it.
	* config/aarch64/aarch64-simd.md (usdot_prod<vsi2qi>): Re-organize RTL.
	* config/aarch64/arm_neon.h (vusdot_s32, vusdotq_s32): Use it.
2021-07-26 10:22:23 +01:00
Jonathan Wright b7e450c973 aarch64: Refactor TBL/TBX RTL patterns
Rename two-source-register TBL/TBX RTL patterns so that their names
better reflect what they do, rather than confusing them with tbl3 or
tbx4 patterns. Also use the correct "neon_tbl2" type attribute for
both patterns.

Rename single-source-register TBL/TBX patterns for consistency.

gcc/ChangeLog:

2021-07-08  Jonathan Wright  <jonathan.wright@arm.com>

	* config/aarch64/aarch64-simd-builtins.def: Use two variant
	generators for all TBL/TBX intrinsics and rename to
	consistent forms: qtbl[1234] or qtbx[1234].
	* config/aarch64/aarch64-simd.md (aarch64_tbl1<mode>):
	Rename to...
	(aarch64_qtbl1<mode>): This.
	(aarch64_tbx1<mode>): Rename to...
	(aarch64_qtbx1<mode>): This.
	(aarch64_tbl2v16qi): Delete.
	(aarch64_tbl3<mode>): Rename to...
	(aarch64_qtbl2<mode>): This.
	(aarch64_tbx4<mode>): Rename to...
	(aarch64_qtbx2<mode>): This.
	* config/aarch64/aarch64.c (aarch64_expand_vec_perm_1): Use
	renamed qtbl1 and qtbl2 RTL patterns.
	* config/aarch64/arm_neon.h (vqtbl1_p8): Use renamed qtbl1
	RTL pattern.
	(vqtbl1_s8): Likewise.
	(vqtbl1_u8): Likewise.
	(vqtbl1q_p8): Likewise.
	(vqtbl1q_s8): Likewise.
	(vqtbl1q_u8): Likewise.
	(vqtbx1_s8): Use renamed qtbx1 RTL pattern.
	(vqtbx1_u8): Likewise.
	(vqtbx1_p8): Likewise.
	(vqtbx1q_s8): Likewise.
	(vqtbx1q_u8): Likewise.
	(vqtbx1q_p8): Likewise.
	(vtbl1_s8): Use renamed qtbl1 RTL pattern.
	(vtbl1_u8): Likewise.
	(vtbl1_p8): Likewise.
	(vtbl2_s8): Likewise
	(vtbl2_u8): Likewise.
	(vtbl2_p8): Likewise.
	(vtbl3_s8): Use renamed qtbl2 RTL pattern.
	(vtbl3_u8): Likewise.
	(vtbl3_p8): Likewise.
	(vtbl4_s8): Likewise.
	(vtbl4_u8): Likewise.
	(vtbl4_p8): Likewise.
	(vtbx2_s8): Use renamed qtbx2 RTL pattern.
	(vtbx2_u8): Likewise.
	(vtbx2_p8): Likewise.
	(vqtbl2_s8): Use renamed qtbl2 RTL pattern.
	(vqtbl2_u8): Likewise.
	(vqtbl2_p8): Likewise.
	(vqtbl2q_s8): Likewise.
	(vqtbl2q_u8): Likewise.
	(vqtbl2q_p8): Likewise.
	(vqtbx2_s8): Use renamed qtbx2 RTL pattern.
	(vqtbx2_u8): Likewise.
	(vqtbx2_p8): Likewise.
	(vqtbx2q_s8): Likewise.
	(vqtbx2q_u8): Likewise.
	(vqtbx2q_p8): Likewise.
	(vtbx4_s8): Likewise.
	(vtbx4_u8): Likewise.
	(vtbx4_p8): Likewise.
2021-07-20 10:02:41 +01:00
Tamar Christina 5402023f05 Revert "AArch64: Correct dot-product auto-vect optab RTL"
This reverts commit 6d1cdb2782.
2021-07-15 13:16:00 +01:00
Tamar Christina 6d1cdb2782 AArch64: Correct dot-product auto-vect optab RTL
The current RTL for the vectorizer patterns for dot-product are incorrect.
Operand3 isn't an output parameter so we can't write to it.

This fixes this issue and reduces the number of RTL.

gcc/ChangeLog:

	* config/aarch64/aarch64-simd-builtins.def (udot, sdot): Rename to...
	(sdot_prod, udot_prod): ...These.
	* config/aarch64/aarch64-simd.md (<sur>dot_prod<vsi2qi>): Remove.
	(aarch64_<sur>dot<vsi2qi>): Rename to...
	(<sur>dot_prod<vsi2qi>): ...This.
	* config/aarch64/arm_neon.h (vdot_u32, vdotq_u32, vdot_s32, vdotq_s32):
	Update builtins.
2021-07-14 15:41:31 +01:00
Tamar Christina 752045ed1e AArch64: Add support for sign differing dot-product usdot for NEON and SVE.
Hi All,

This adds optabs implementing usdot_prod.

The following testcase:

#define N 480
#define SIGNEDNESS_1 unsigned
#define SIGNEDNESS_2 signed
#define SIGNEDNESS_3 signed
#define SIGNEDNESS_4 unsigned

SIGNEDNESS_1 int __attribute__ ((noipa))
f (SIGNEDNESS_1 int res, SIGNEDNESS_3 char *restrict a,
   SIGNEDNESS_4 char *restrict b)
{
  for (__INTPTR_TYPE__ i = 0; i < N; ++i)
    {
      int av = a[i];
      int bv = b[i];
      SIGNEDNESS_2 short mult = av * bv;
      res += mult;
    }
  return res;
}

Generates for NEON

f:
        movi    v0.4s, 0
        mov     x3, 0
        .p2align 3,,7
.L2:
        ldr     q1, [x2, x3]
        ldr     q2, [x1, x3]
        usdot   v0.4s, v1.16b, v2.16b
        add     x3, x3, 16
        cmp     x3, 480
        bne     .L2
        addv    s0, v0.4s
        fmov    w1, s0
        add     w0, w0, w1
        ret

and for SVE

f:
        mov     x3, 0
        cntb    x5
        mov     w4, 480
        mov     z1.b, #0
        whilelo p0.b, wzr, w4
        mov     z3.b, #0
        ptrue   p1.b, all
        .p2align 3,,7
.L2:
        ld1b    z2.b, p0/z, [x1, x3]
        ld1b    z0.b, p0/z, [x2, x3]
        add     x3, x3, x5
        sel     z0.b, p0, z0.b, z3.b
        whilelo p0.b, w3, w4
        usdot   z1.s, z0.b, z2.b
        b.any   .L2
        uaddv   d0, p1, z1.s
        fmov    x1, d0
        add     w0, w0, w1
        ret

instead of

f:
        movi    v0.4s, 0
        mov     x3, 0
        .p2align 3,,7
.L2:
        ldr     q2, [x1, x3]
        ldr     q1, [x2, x3]
        add     x3, x3, 16
        sxtl    v4.8h, v2.8b
        sxtl2   v3.8h, v2.16b
        uxtl    v2.8h, v1.8b
        uxtl2   v1.8h, v1.16b
        mul     v2.8h, v2.8h, v4.8h
        mul     v1.8h, v1.8h, v3.8h
        saddw   v0.4s, v0.4s, v2.4h
        saddw2  v0.4s, v0.4s, v2.8h
        saddw   v0.4s, v0.4s, v1.4h
        saddw2  v0.4s, v0.4s, v1.8h
        cmp     x3, 480
        bne     .L2
        addv    s0, v0.4s
        fmov    w1, s0
        add     w0, w0, w1
        ret

and

f:
        mov     x3, 0
        cnth    x5
        mov     w4, 480
        mov     z1.b, #0
        whilelo p0.h, wzr, w4
        ptrue   p2.b, all
        .p2align 3,,7
.L2:
        ld1sb   z2.h, p0/z, [x1, x3]
        punpklo p1.h, p0.b
        ld1b    z0.h, p0/z, [x2, x3]
        add     x3, x3, x5
        mul     z0.h, p2/m, z0.h, z2.h
        sunpklo z2.s, z0.h
        sunpkhi z0.s, z0.h
        add     z1.s, p1/m, z1.s, z2.s
        punpkhi p1.h, p0.b
        whilelo p0.h, w3, w4
        add     z1.s, p1/m, z1.s, z0.s
        b.any   .L2
        uaddv   d0, p2, z1.s
        fmov    x1, d0
        add     w0, w0, w1
        ret

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (aarch64_usdot<vsi2qi>): Rename to...
	(usdot_prod<vsi2qi>): ... This.
	* config/aarch64/aarch64-simd-builtins.def (usdot): Rename to...
	(usdot_prod): ...This.
	* config/aarch64/arm_neon.h (vusdot_s32, vusdotq_s32): Likewise.
	* config/aarch64/aarch64-sve.md (@aarch64_<sur>dot_prod<vsi2qi>):
	Rename to...
	(@<sur>dot_prod<vsi2qi>): ...This.
	* config/aarch64/aarch64-sve-builtins-base.cc
	(svusdot_impl::expand): Use it.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/simd/vusdot-autovec.c: New test.
	* gcc.target/aarch64/sve/vusdot-autovec.c: New test.
2021-07-14 15:19:32 +01:00
Jonathan Wright dbfc149b63 aarch64: Model zero-high-half semantics of ADDHN/SUBHN instructions
Model the zero-high-half semantics of the narrowing arithmetic Neon
instructions in the aarch64_<sur><addsub>hn<mode> RTL pattern.
Modeling these semantics allows for better RTL combinations while
also removing some register allocation issues as the compiler now
knows that the operation is totally destructive.

Add new tests to narrow_zero_high_half.c to verify the benefit of
this change.

gcc/ChangeLog:

2021-06-14  Jonathan Wright  <jonathan.wright@arm.com>

	* config/aarch64/aarch64-simd.md (aarch64_<sur><addsub>hn<mode>):
	Change to an expander that emits the correct instruction
	depending on endianness.
	(aarch64_<sur><addsub>hn<mode>_insn_le): Define.
	(aarch64_<sur><addsub>hn<mode>_insn_be): Define.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/narrow_zero_high_half.c: Add new tests.
2021-06-16 14:22:42 +01:00
Jonathan Wright d0889b5d37 aarch64: Model zero-high-half semantics of [SU]QXTN instructions
Split the aarch64_<su>qmovn<mode> pattern into separate scalar and
vector variants. Further split the vector RTL  pattern into big/
little endian variants that model the zero-high-half semantics of the
underlying instruction. Modeling these semantics allows for better
RTL combinations while also removing some register allocation issues
as the compiler now knows that the operation is totally destructive.

Add new tests to narrow_zero_high_half.c to verify the benefit of
this change.

gcc/ChangeLog:

2021-06-14  Jonathan Wright  <jonathan.wright@arm.com>

	* config/aarch64/aarch64-simd-builtins.def: Split generator
	for aarch64_<su>qmovn builtins into scalar and vector
	variants.
	* config/aarch64/aarch64-simd.md (aarch64_<su>qmovn<mode>_insn_le):
	Define.
	(aarch64_<su>qmovn<mode>_insn_be): Define.
	(aarch64_<su>qmovn<mode>): Split into scalar and vector
	variants. Change vector variant to an expander that emits the
	correct instruction depending on endianness.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/narrow_zero_high_half.c: Add new tests.
2021-06-16 14:22:22 +01:00
Jonathan Wright c86a303968 aarch64: Model zero-high-half semantics of SQXTUN instruction in RTL
Split the aarch64_sqmovun<mode> pattern into separate scalar and
vector variants. Further split the vector pattern into big/little
endian variants that model the zero-high-half semantics of the
underlying instruction. Modeling these semantics allows for better
RTL combinations while also removing some register allocation issues
as the compiler now knows that the operation is totally destructive.

Add new tests to narrow_zero_high_half.c to verify the benefit of
this change.

gcc/ChangeLog:

2021-06-14  Jonathan Wright  <jonathan.wright@arm.com>

	* config/aarch64/aarch64-simd-builtins.def: Split generator
	for aarch64_sqmovun builtins into scalar and vector variants.
	* config/aarch64/aarch64-simd.md (aarch64_sqmovun<mode>):
	Split into scalar and vector variants. Change vector variant
	to an expander that emits the correct instruction depending
	on endianness.
	(aarch64_sqmovun<mode>_insn_le): Define.
	(aarch64_sqmovun<mode>_insn_be): Define.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/narrow_zero_high_half.c: Add new tests.
2021-06-16 14:22:08 +01:00
Jonathan Wright d8a88cdae9 aarch64: Model zero-high-half semantics of XTN instruction in RTL
Modeling the zero-high-half semantics of the XTN narrowing
instruction in RTL indicates to the compiler that this is a totally
destructive operation. This enables more RTL simplifications and also
prevents some register allocation issues.

Add new tests to narrow_zero_high_half.c to verify the benefit of
this change.

gcc/ChangeLog:

2021-06-11  Jonathan Wright  <jonathan.wright@arm.com>

	* config/aarch64/aarch64-simd.md (aarch64_xtn<mode>_insn_le):
	Define - modeling zero-high-half semantics.
	(aarch64_xtn<mode>): Change to an expander that emits the
	appropriate instruction depending on endianness.
	(aarch64_xtn<mode>_insn_be): Define - modeling zero-high-half
	semantics.
	(aarch64_xtn2<mode>_le): Rename to...
	(aarch64_xtn2<mode>_insn_le): This.
	(aarch64_xtn2<mode>_be): Rename to...
	(aarch64_xtn2<mode>_insn_be): This.
	(vec_pack_trunc_<mode>): Emit truncation instruction instead
	of aarch64_xtn.
	* config/aarch64/iterators.md (Vnarrowd): Add Vnarrowd mode
	attribute iterator.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/narrow_zero_high_half.c: Add new tests.
2021-06-16 14:21:52 +01:00
Jonathan Wright 4536433820 aarch64: Use correct type attributes for RTL generating XTN(2)
Use the correct "neon_move_narrow_q" type attribute in RTL patterns
that generate XTN/XTN2 instructions.

This makes a material difference because these instructions can be
executed on both SIMD pipes in the Cortex-A57 core model, whereas the
"neon_shift_imm_narrow_q" attribute (in use until now) would suggest
to the scheduler that they could only execute on one of the two
pipes.

gcc/ChangeLog:

2021-05-18  Jonathan Wright  <jonathan.wright@arm.com>

	* config/aarch64/aarch64-simd.md: Use "neon_move_narrow_q"
	type attribute in patterns generating XTN(2).
2021-05-19 14:45:31 +01:00
Jonathan Wright 577d5819e0 aarch64: Use an expander for quad-word vec_pack_trunc pattern
The existing vec_pack_trunc RTL pattern emits an opaque two-
instruction assembly code sequence that prevents proper instruction
scheduling. This commit changes the pattern to an expander that emits
individual xtn and xtn2 instructions.

This commit also consolidates the duplicate truncation patterns.

gcc/ChangeLog:

2021-05-17  Jonathan Wright  <jonathan.wright@arm.com>

	* config/aarch64/aarch64-simd.md (aarch64_simd_vec_pack_trunc_<mode>):
	Remove as duplicate of...
	(aarch64_xtn<mode>): This.
	(aarch64_xtn2<mode>_le): Move position in file.
	(aarch64_xtn2<mode>_be): Move position in file.
	(aarch64_xtn2<mode>): Move position in file.
	(vec_pack_trunc_<mode>): Define as an expander.
2021-05-19 14:45:17 +01:00
Jonathan Wright ddbdb9a384 aarch64: Refactor aarch64_<sur>q<r>shr<u>n_n<mode> RTL pattern
Split the aarch64_<sur>q<r>shr<u>n_n<mode> pattern into separate
scalar and vector variants. Further split the vector pattern into
big/little endian variants that model the zero-high-half semantics
of the underlying instruction - allowing for more combinations with
the write-to-high-half variant (aarch64_<sur>q<r>shr<u>n2_n<mode>.)

gcc/ChangeLog:

2021-05-14  Jonathan Wright  <jonathan.wright@arm.com>

	* config/aarch64/aarch64-simd-builtins.def: Split builtin
	generation for aarch64_<sur>q<r>shr<u>n_n<mode> pattern into
	separate scalar and vector generators.
	* config/aarch64/aarch64-simd.md
	(aarch64_<sur>q<r>shr<u>n_n<mode>): Define as an expander and
	split into...
	(aarch64_<sur>q<r>shr<u>n_n<mode>_insn_le): This and...
	(aarch64_<sur>q<r>shr<u>n_n<mode>_insn_be): This.
	* config/aarch64/iterators.md: Define SD_HSDI iterator.
2021-05-19 14:44:39 +01:00
Jonathan Wright 778ac63fe2 aarch64: Relax aarch64_sqxtun2<mode> RTL pattern
Use UNSPEC_SQXTUN instead of UNSPEC_SQXTUN2 in aarch64_sqxtun2<mode>
patterns. This allows for more more aggressive combinations and
ultimately better code generation. The now redundant UNSPEC_SQXTUN2
is removed.

gcc/ChangeLog:

2021-05-14  Jonathn Wright  <jonathan.wright@arm.com>

	* config/aarch64/aarch64-simd.md: Use UNSPEC_SQXTUN instead
	of UNSPEC_SQXTUN2.
	* config/aarch64/iterators.md: Remove UNSPEC_SQXTUN2.
2021-05-19 14:44:26 +01:00
Jonathan Wright 4e26303e0b aarch64: Relax aarch64_<sur>q<r>shr<u>n2_n<mode> RTL pattern
Implement saturating right-shift and narrow high Neon intrinsic RTL
patterns using a vec_concat of a register_operand and a VQSHRN_N
unspec - instead of just a VQSHRN_N unspec. This more relaxed pattern
allows for more aggressive combinations and ultimately better code
generation.

gcc/ChangeLog:

2021-03-04  Jonathan Wright  <jonathan.wright@arm.com>

	* config/aarch64/aarch64-simd.md (aarch64_<sur>q<r>shr<u>n2_n<mode>):
	Implement as an expand emitting a big/little endian
	instruction pattern.
	(aarch64_<sur>q<r>shr<u>n2_n<mode>_insn_le): Define.
	(aarch64_<sur>q<r>shr<u>n2_n<mode>_insn_be): Define.
2021-05-19 14:44:10 +01:00
Jonathan Wright 3eddaad02d aarch64: Relax aarch64_<sur><addsub>hn2<mode> RTL pattern
Implement v[r]addhn2 and v[r]subhn2 Neon intrinsic RTL patterns using
a vec_concat of a register_operand and an ADDSUBHN unspec - instead
of just an ADDSUBHN2 unspec. This more relaxed pattern allows for
more aggressive combinations and ultimately better code generation.

This patch also removes the now redundant [R]ADDHN2 and [R]SUBHN2
unspecs and their iterator.

gcc/ChangeLog:

2021-03-03  Jonathan Wright  <jonathan.wright@arm.com>

	* config/aarch64/aarch64-simd.md (aarch64_<sur><addsub>hn2<mode>):
	Implement as an expand emitting a big/little endian
	instruction pattern.
	(aarch64_<sur><addsub>hn2<mode>_insn_le): Define.
	(aarch64_<sur><addsub>hn2<mode>_insn_be): Define.
	* config/aarch64/iterators.md: Remove UNSPEC_[R]ADDHN2 and
	UNSPEC_[R]SUBHN2 unspecs and ADDSUBHN2 iterator.
2021-05-19 14:43:55 +01:00
Kyrylo Tkachov ff3809b459 aarch64: Make sqdmlal2 patterns match canonical RTL
The sqdmlal2 patterns are hidden beneath the SBINQOPS iterator and unfortunately they don't match
canonical RTL because the simple accumulate operand comes in the first arm of the SS_PLUS.
This patch splits the SS_PLUS and SS_MINUS forms with the SS_PLUS operands set up to match
the canonical form, where the complex operand comes first.

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md
	(aarch64_sqdml<SBINQOPS:as>l2_lane<mode>_internal): Split into...
	(aarch64_sqdmlsl2_lane<mode>_internal): ... This...
	(aarch64_sqdmlal2_lane<mode>_internal): ... And this.
	(aarch64_sqdml<SBINQOPS:as>l2_laneq<mode>_internal): Split into ...
	(aarch64_sqdmlsl2_laneq<mode>_internal): ... This...
	(aarch64_sqdmlal2_laneq<mode>_internal): ... And this.
	(aarch64_sqdml<SBINQOPS:as>l2_n<mode>_internal): Split into...
	(aarch64_sqdmlsl2_n<mode>_internal): ... This...
	(aarch64_sqdmlal2_n<mode>_internal): ... And this.
2021-05-14 15:31:25 +01:00
Kyrylo Tkachov 543c0cbca0 aarch64: Merge sqdmlal2 and sqdmlsl2 expanders
The various sqdmlal2 and sqdmlsl2 expanders perform almost identical functions and can be
merged using code iterators and attributes to reduce the code in the MD file.
No behavioural change is expected.

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (aarch64_sqdmlal2<mode>): Merge into...
	(aarch64_sqdml<SBINQOPS:as>l2<mode>): ... This.
	(aarch64_sqdmlsl2<mode>): Delete.
	(aarch64_sqdmlal2_lane<mode>): Merge this...
	(aarch64_sqdmlsl2_lane<mode>): ... And this...
	(aarch64_sqdml<SBINQOPS:as>l2_lane<mode>): ... Into this.
	(aarch64_sqdmlal2_laneq<mode>): Merge this...
	(aarch64_sqdmlsl2_laneq<mode>): ... And this...
	(aarch64_sqdml<SBINQOPS:as>l2_laneq<mode>): ... Into this.
	(aarch64_sqdmlal2_n<mode>): Merge this...
	(aarch64_sqdmlsl2_n<mode>): ... And this...
	(aarch64_sqdml<SBINQOPS:as>l2_n<mode>): ... Into this.
2021-05-14 09:56:45 +01:00
Richard Sandiford 28de75d276 aarch64: A couple of mul_laneq tweaks
This patch removes the duplication between the mul_laneq<mode>3
and the older mul-lane patterns.  The older patterns were previously
divided into two based on whether the indexed operand had the same mode
as the other operands or whether it had the opposite length from the
other operands (64-bit vs. 128-bit).  However, it seemed easier to
divide them instead based on whether the indexed operand was 64-bit or
128-bit, since that maps directly to the arm_neon.h “q” conventions.

Also, it looks like the older patterns were missing cases for
V8HF<->V4HF combinations, which meant that vmul_laneq_f16 and
vmulq_lane_f16 didn't produce single instructions.

There was a typo in the V2SF entry for VCONQ, but in practice
no patterns were using that entry until now.

The test passes for both endiannesses, but endianness does change
the mapping between regexps and functions.

gcc/
	* config/aarch64/iterators.md (VMUL_CHANGE_NLANES): Delete.
	(VMULD): New iterator.
	(VCOND): Handle V4HF and V8HF.
	(VCONQ): Fix entry for V2SF.
	* config/aarch64/aarch64-simd.md (mul_lane<mode>3): Use VMULD
	instead of VMUL.  Use a 64-bit vector mode for the indexed operand.
	(*aarch64_mul3_elt_<vswap_width_name><mode>): Merge with...
	(mul_laneq<mode>3): ...this define_insn.  Use VMUL instead of VDQSF.
	Use a 128-bit vector mode for the indexed operand.  Use stype for
	the scheduling type.

gcc/testsuite/
	* gcc.target/aarch64/fmul_lane_1.c: New test.
2021-05-11 12:17:33 +01:00
Jonathan Wright d388179a79 aarch64: Use RTL builtins for FP ml[as][q]_laneq intrinsics
Rewrite floating-point vml[as][q]_laneq Neon intrinsics to use RTL
builtins rather than relying on the GCC vector extensions. Using RTL
builtins allows control over the emission of fmla/fmls instructions
(which we don't want here.)

With this commit, the code generated by these intrinsics changes from
a fused multiply-add/subtract instruction to an fmul followed by an
fadd/fsub instruction. If the programmer really wants fmla/fmls
instructions, they can use the vfm[as] intrinsics.

gcc/ChangeLog:

2021-02-17  Jonathan Wright  <jonathan.wright@arm.com>

	* config/aarch64/aarch64-simd-builtins.def: Add
	float_ml[as][q]_laneq builtin generator macros.
	* config/aarch64/aarch64-simd.md (mul_laneq<mode>3): Define.
	(aarch64_float_mla_laneq<mode>): Define.
	(aarch64_float_mls_laneq<mode>): Define.
	* config/aarch64/arm_neon.h (vmla_laneq_f32): Use RTL builtin
	instead of GCC vector extensions.
	(vmlaq_laneq_f32): Likewise.
	(vmls_laneq_f32): Likewise.
	(vmlsq_laneq_f32): Likewise.
2021-04-30 18:41:25 +01:00
Jonathan Wright 1baf4ed878 aarch64: Use RTL builtins for FP ml[as][q]_lane intrinsics
Rewrite floating-point vml[as][q]_lane Neon intrinsics to use RTL
builtins rather than relying on the GCC vector extensions. Using RTL
builtins allows control over the emission of fmla/fmls instructions
(which we don't want here.)

With this commit, the code generated by these intrinsics changes from
a fused multiply-add/subtract instruction to an fmul followed by an
fadd/fsub instruction. If the programmer really wants fmla/fmls
instructions, they can use the vfm[as] intrinsics.

gcc/ChangeLog:

2021-02-16  Jonathan Wright  <jonathan.wright@arm.com>

	* config/aarch64/aarch64-simd-builtins.def: Add
	float_ml[as]_lane builtin generator macros.
	* config/aarch64/aarch64-simd.md (*aarch64_mul3_elt<mode>):
	Rename to...
	(mul_lane<mode>3): This, and re-order arguments.
	(aarch64_float_mla_lane<mode>): Define.
	(aarch64_float_mls_lane<mode>): Define.
	* config/aarch64/arm_neon.h (vmla_lane_f32): Use RTL builtin
	instead of GCC vector extensions.
	(vmlaq_lane_f32): Likewise.
	(vmls_lane_f32): Likewise.
	(vmlsq_lane_f32): Likewise.
2021-04-30 18:41:11 +01:00