OpenE2K/gcc - gcc - Expired Mentality Git

Commit Graph

Author	SHA1	Message	Date
Tamar Christina	024edf0895	AArch64: Fix left fold sum reduction RTL patterns [PR104049] As the discussion in the PR pointed out the RTL we have for the REDUC_PLUS patterns are wrong. The UNSPECs are modelled as returning a vector and then in an expand pattern we emit a vec_select of the 0th element to get the scalar. This is incorrect as the instruction itself already only returns a single scalar and by declaring it returns a vector it allows combine to push in a subreg into the pattern, which causes reload to make duplicate moves. This patch corrects this by removing the weird indirection and making the RTL pattern model the correct semantics of the instruction immediately. gcc/ChangeLog: PR target/104049 * config/aarch64/aarch64-simd.md (aarch64_reduc_plus_internal<mode>): Fix RTL and rename to... (reduc_plus_scal_<mode>): ... This. (reduc_plus_scal_v4sf): Moved. (aarch64_reduc_plus_internalv2si): Fix RTL and rename to... (reduc_plus_scal_v2si): ... This. gcc/testsuite/ChangeLog: PR target/104049 * gcc.target/aarch64/vadd_reduc-1.c: New test. * gcc.target/aarch64/vadd_reduc-2.c: New test.	2022-04-07 08:27:53 +01:00
Richard Sandiford	83d7e720cd	aarch64: Extend vec_concat patterns to 8-byte vectors This patch extends the previous support for 16-byte vec_concat so that it supports pairs of 4-byte elements. This too isn't strictly a regression fix, since the 8-byte forms weren't affected by the same problems as the 16-byte forms, but it leaves things in a more consistent state. gcc/ * config/aarch64/iterators.md (VDCSIF): New mode iterator. (VDBL): Handle SF. (single_wx, single_type, single_dtype, dblq): New mode attributes. * config/aarch64/aarch64-simd.md (load_pair_lanes<mode>): Extend from VDC to VDCSIF. (store_pair_lanes<mode>): Likewise. (aarch64_combine_internal<mode>): Likewise. (aarch64_combine_internal_be<mode>): Likewise. (aarch64_combinez<mode>): Likewise. (aarch64_combinez_be<mode>): Likewise. * config/aarch64/aarch64.cc (aarch64_classify_address): Handle 8-byte modes for ADDR_QUERY_LDP_STP_N. (aarch64_print_operand): Likewise for %y. gcc/testsuite/ * gcc.target/aarch64/vec-init-13.c: New test. * gcc.target/aarch64/vec-init-14.c: Likewise. * gcc.target/aarch64/vec-init-15.c: Likewise. * gcc.target/aarch64/vec-init-16.c: Likewise. * gcc.target/aarch64/vec-init-17.c: Likewise.	2022-02-09 16:57:06 +00:00
Richard Sandiford	bce43c0493	aarch64: Remove move_lo/hi_quad expanders This patch is the second of two to remove the old move_lo/hi_quad expanders and move_hi_quad insns. gcc/ * config/aarch64/aarch64-simd.md (@aarch64_split_simd_mov<mode>): Use aarch64_combine instead of move_lo/hi_quad. Tabify. (move_lo_quad_<mode>, aarch64_simd_move_hi_quad_<mode>): Delete. (aarch64_simd_move_hi_quad_be_<mode>, move_hi_quad_<mode>): Delete. (vec_pack_trunc_<mode>): Take general_operand elements and use aarch64_combine rather than move_lo/hi_quad to combine them. (vec_pack_trunc_df): Likewise.	2022-02-09 16:57:06 +00:00
Richard Sandiford	4057266ce5	aarch64: Add a general vec_concat expander After previous patches, we have a (mostly new) group of vec_concat patterns as well as vestiges of the old move_lo/hi_quad patterns. (A previous patch removed the move_lo_quad insns, but we still have the move_hi_quad insns and both sets of expanders.) This patch is the first of two to remove the old move_lo/hi_quad stuff. It isn't technically a regression fix, but it seemed better to make the changes now rather than leave things in a half-finished and inconsistent state. This patch defines an aarch64_vec_concat expander that coerces the element operands into a valid form, including the ones added by the previous patch. This in turn lets us get rid of one move_lo/hi_quad pair. As a side-effect, it also means that vcombines of 2 vectors make better use of the available forms, like vec_inits of 2 scalars already do. gcc/ * config/aarch64/aarch64-protos.h (aarch64_split_simd_combine): Delete. * config/aarch64/aarch64-simd.md (@aarch64_combinez<mode>): Rename to... (aarch64_combinez<mode>): ...this. (@aarch64_combinez_be<mode>): Rename to... (aarch64_combinez_be<mode>): ...this. (@aarch64_vec_concat<mode>): New expander. (aarch64_combine<mode>): Use it. (@aarch64_simd_combine<mode>): Delete. * config/aarch64/aarch64.cc (aarch64_split_simd_combine): Delete. (aarch64_expand_vector_init): Use aarch64_vec_concat. gcc/testsuite/ * gcc.target/aarch64/vec-init-12.c: New test.	2022-02-09 16:57:05 +00:00
Richard Sandiford	85ac2fe44f	aarch64: Add more vec_combine patterns vec_combine is really one instruction on aarch64, provided that the lowpart element is in the same register as the destination vector. This patch adds patterns for that. The patch fixes a regression from GCC 8. Before the patch: int64x2_t s64q_1(int64_t a0, int64_t a1) { if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__) return (int64x2_t) { a1, a0 }; else return (int64x2_t) { a0, a1 }; } generated: fmov d0, x0 ins v0.d[1], x1 ins v0.d[1], x1 ret whereas GCC 8 generated the more respectable: dup v0.2d, x0 ins v0.d[1], x1 ret gcc/ * config/aarch64/predicates.md (aarch64_reg_or_mem_pair_operand): New predicate. * config/aarch64/aarch64-simd.md (aarch64_combine_internal<mode>) (aarch64_combine_internal_be<mode>): New patterns. gcc/testsuite/ * gcc.target/aarch64/vec-init-9.c: New test. * gcc.target/aarch64/vec-init-10.c: Likewise. * gcc.target/aarch64/vec-init-11.c: Likewise.	2022-02-09 16:57:05 +00:00
Richard Sandiford	aeef5c57f1	aarch64: Remove redundant vec_concat patterns move_lo_quad_internal_<mode> and move_lo_quad_internal_be_<mode> partially duplicate the later aarch64_combinez{,_be}<mode> patterns. The duplication itself is a regression. The only substantive differences between the two are: * combinez uses vector MOV (ORR) instead of element MOV (DUP). The former seems more likely to be handled via renaming. * combinez disparages the GPR->FPR alternative whereas move_lo_quad gave it equal cost. The new test gives a token example of when the combinez behaviour helps. gcc/ * config/aarch64/aarch64-simd.md (move_lo_quad_internal_<mode>) (move_lo_quad_internal_be_<mode>): Delete. (move_lo_quad_<mode>): Use aarch64_combine<Vhalf> instead of the above. gcc/testsuite/ * gcc.target/aarch64/vec-init-8.c: New test.	2022-02-09 16:57:04 +00:00
Richard Sandiford	958448a944	aarch64: Generalise adjacency check for load_pair_lanes This patch generalises the load_pair_lanes<mode> guard so that it uses aarch64_check_consecutive_mems to check for consecutive mems. It also allows the pattern to be used for STRICT_ALIGNMENT targets if the alignment is high enough. The main aim is to avoid an inline test, for the sake of a later patch that needs to repeat it. Reusing aarch64_check_consecutive_mems seemed simpler than writing an entirely new function. gcc/ * config/aarch64/aarch64-protos.h (aarch64_mergeable_load_pair_p): Declare. * config/aarch64/aarch64-simd.md (load_pair_lanes<mode>): Use aarch64_mergeable_load_pair_p instead of inline check. * config/aarch64/aarch64.cc (aarch64_expand_vector_init): Likewise. (aarch64_check_consecutive_mems): Allow the reversed parameter to be null. (aarch64_mergeable_load_pair_p): New function.	2022-02-09 16:57:03 +00:00
Richard Sandiford	fabc5d9bce	aarch64: Generalise vec_set predicate The aarch64_simd_vec_set<mode> define_insn takes memory operands, so this patch makes the vec_set<mode> optab expander do the same. gcc/ * config/aarch64/aarch64-simd.md (vec_set<mode>): Allow the element to be an aarch64_simd_nonimmediate_operand.	2022-02-09 16:57:02 +00:00
Richard Sandiford	c48a6819d1	aarch64: Tighten general_operand predicates This patch fixes some case in which general_operand was used over nonimmediate_operand by patterns that don't accept immediates. This avoids some complication with later patches. gcc/ * config/aarch64/aarch64-simd.md (aarch64_simd_vec_set<mode>): Use aarch64_simd_nonimmediate_operand instead of aarch64_simd_general_operand. (@aarch64_combinez<mode>): Use nonimmediate_operand instead of general_operand. (@aarch64_combinez_be<mode>): Likewise.	2022-02-09 16:57:02 +00:00
Richard Sandiford	7e4f89a23e	aarch64: Add missing movmisalign patterns The Advanced SIMD movmisalign patterns didn't handle 16-bit FP modes, which meant that the vector loop for: void test (_Float16 data) { _Pragma ("omp simd") for (int i = 0; i < 8; ++i) data[i] = 1.0; } would be versioned for alignment. This was causing some new failures in aarch64/sve/single_5.c: FAIL: gcc.target/aarch64/sve/single_5.c scan-assembler-not \\tb FAIL: gcc.target/aarch64/sve/single_5.c scan-assembler-not \\tcmp FAIL: gcc.target/aarch64/sve/single_5.c scan-assembler-times \\tstr\\tq[0-9]+, 10 but I didn't look into what changed from earlier releases. Adding the missing modes removes some existing xfails. gcc/ config/aarch64/aarch64-simd.md (movmisalign<mode>): Extend from VALL to VALL_F16. gcc/testsuite/ * gcc.target/aarch64/sve/single_5.c: Remove some XFAILs.	2022-02-03 10:44:00 +00:00
Richard Sandiford	6a77052660	aarch64: Remove VALL_F16MOV iterator The VALL_F16MOV iterator now has the same modes as VALL_F16, in the same order. This patch removes the former in favour of the latter. This doesn't fix a bug as such, but it's ultra-safe (no change in object code) and it saves a follow-up patch from having to make a false choice between the iterators. gcc/ * config/aarch64/iterators.md (VALL_F16MOV): Delete. * config/aarch64/aarch64-simd.md (mov<mode>): Use VALL_F16 instead of VALL_F16MOV.	2022-02-03 10:44:00 +00:00
Tamar Christina	ab95fe61fe	AArch64: use canonical ordering for complex mul, fma and fms After the first patch in the series this updates the optabs to expect the canonical sequence. gcc/ChangeLog: PR tree-optimization/102819 PR tree-optimization/103169 * config/aarch64/aarch64-simd.md (cml<fcmac1><conj_op><mode>4): Use canonical order. * config/aarch64/aarch64-sve.md (cml<fcmac1><conj_op><mode>4): Likewise.	2022-02-02 10:51:38 +00:00
Jakub Jelinek	7adcbafe45	Update copyright years.	2022-01-03 10:42:10 +01:00
Przemyslaw Wirkus	0a68862e78	aarch64: fix: ls64 tests fail on aarch64_be [PR103729] This patch is sorting issue with LS64 intrinsics tests failing with AArch64_be targets. gcc/ChangeLog: PR target/103729 * config/aarch64/aarch64-simd.md (aarch64_movv8di): Allow big endian targets to move V8DI.	2021-12-16 10:50:29 +00:00
Przemyslaw Wirkus	fdcddba8f2	aarch64: Add LS64 extension and intrinsics This patch is adding support for LS64 (Armv8.7-A Load/Store 64 Byte extension) which is part of Armv8.7-A architecture. Changes include missing plumbing for TARGET_LS64, LS64 data structure and intrinsics defined in ACLE. Machine description of intrinsics is using new V8DI mode added in a separate patch. __ARM_FEATURE_LS64 is defined if the Armv8.7-A LS64 instructions for atomic 64-byte access to device memory are supported. New compiler internal type is added wrapping ACLE struct data512_t: typedef struct { uint64_t val[8]; } __arm_data512_t; gcc/ChangeLog: * config/aarch64/aarch64-builtins.c (enum aarch64_builtins): Define AARCH64_LS64_BUILTIN_LD64B, AARCH64_LS64_BUILTIN_ST64B, AARCH64_LS64_BUILTIN_ST64BV, AARCH64_LS64_BUILTIN_ST64BV0. (aarch64_init_ls64_builtin_decl): Helper function. (aarch64_init_ls64_builtins): Helper function. (aarch64_init_ls64_builtins_types): Helper function. (aarch64_general_init_builtins): Init LS64 intrisics for TARGET_LS64. (aarch64_expand_builtin_ls64): LS64 intrinsics expander. (aarch64_general_expand_builtin): Handle aarch64_expand_builtin_ls64. (ls64_builtins_data): New helper struct. (v8di_UP): New define. * config/aarch64/aarch64-c.c (aarch64_update_cpp_builtins): Define __ARM_FEATURE_LS64. * config/aarch64/aarch64.c (aarch64_classify_address): Enforce the V8DI range (7-bit signed scaled) for both ends of the range. * config/aarch64/aarch64-simd.md (movv8di): New pattern. (aarch64_movv8di): New pattern. * config/aarch64/aarch64.h (AARCH64_ISA_LS64): New define. (TARGET_LS64): New define. * config/aarch64/aarch64.md: Add UNSPEC_LD64B, UNSPEC_ST64B, UNSPEC_ST64BV and UNSPEC_ST64BV0. (ld64b): New define_insn. (st64b): New define_insn. (st64bv): New define_insn. (st64bv0): New define_insn. * config/aarch64/arm_acle.h (data512_t): New type derived from __arm_data512_t. (__arm_data512_t): New internal type. (__arm_ld64b): New intrinsic. (__arm_st64b): New intrinsic. (__arm_st64bv): New intrinsic. (__arm_st64bv0): New intrinsic. * config/arm/types.md: Add new type ls64. gcc/testsuite/ChangeLog: * gcc.target/aarch64/acle/ls64_asm.c: New test. * gcc.target/aarch64/acle/ls64_ld64b.c: New test. * gcc.target/aarch64/acle/ls64_ld64b-2.c: New test. * gcc.target/aarch64/acle/ls64_ld64b-3.c: New test. * gcc.target/aarch64/acle/ls64_st64b.c: New test. * gcc.target/aarch64/acle/ls64_ld_st_o0.c: New test. * gcc.target/aarch64/acle/ls64_st64b-2.c: New test. * gcc.target/aarch64/acle/ls64_st64bv.c: New test. * gcc.target/aarch64/acle/ls64_st64bv-2.c: New test. * gcc.target/aarch64/acle/ls64_st64bv-3.c: New test. * gcc.target/aarch64/acle/ls64_st64bv0.c: New test. * gcc.target/aarch64/acle/ls64_st64bv0-2.c: New test. * gcc.target/aarch64/acle/ls64_st64bv0-3.c: New test. * gcc.target/aarch64/pragma_cpp_predefs_2.c: Add checks for __ARM_FEATURE_LS64.	2021-12-14 14:52:27 +00:00
Tamar Christina	9b8830b6f3	AArch64: Optimize right shift rounding narrowing This optimizes right shift rounding narrow instructions to rounding add narrow high where one vector is 0 when the shift amount is half that of the original input type. i.e. uint32x4_t foo (uint64x2_t a, uint64x2_t b) { return vrshrn_high_n_u64 (vrshrn_n_u64 (a, 32), b, 32); } now generates: foo: movi v3.4s, 0 raddhn v0.2s, v2.2d, v3.2d raddhn2 v0.4s, v2.2d, v3.2d instead of: foo: rshrn v0.2s, v0.2d, 32 rshrn2 v0.4s, v1.2d, 32 ret On Arm cores this is an improvement in both latency and throughput. Because a vector zero is needed I created a new method aarch64_gen_shareable_zero that creates zeros using V4SI and then takes a subreg of the zero to the desired type. This allows CSE to share all the zero constants. gcc/ChangeLog: * config/aarch64/aarch64-protos.h (aarch64_gen_shareable_zero): New. * config/aarch64/aarch64-simd.md (aarch64_rshrn<mode>, aarch64_rshrn2<mode>): Generate rounding half-ing add when appropriate. * config/aarch64/aarch64.c (aarch64_gen_shareable_zero): New. gcc/testsuite/ChangeLog: * gcc.target/aarch64/advsimd-intrinsics/shrn-1.c: New test. * gcc.target/aarch64/advsimd-intrinsics/shrn-2.c: New test. * gcc.target/aarch64/advsimd-intrinsics/shrn-3.c: New test. * gcc.target/aarch64/advsimd-intrinsics/shrn-4.c: New test.	2021-12-02 14:39:43 +00:00
Richard Sandiford	e32b9eb32d	vect: Add support for fmax and fmin reductions This patch adds support for reductions involving calls to fmax() and fmin(), without the -ffast-math flags that allow them to be converted to MAX_EXPR and MIN_EXPR. gcc/ * doc/md.texi (reduc_fmin_scal_@var{m}): Document. (reduc_fmax_scal_@var{m}): Likewise. * optabs.def (reduc_fmax_scal_optab): New optab. (reduc_fmin_scal_optab): Likewise * internal-fn.def (REDUC_FMAX, REDUC_FMIN): New functions. * tree-vect-loop.c (reduction_fn_for_scalar_code): Handle CASE_CFN_FMAX and CASE_CFN_FMIN. (neutral_op_for_reduction): Likewise. (needs_fold_left_reduction_p): Likewise. * config/aarch64/iterators.md (FMAXMINV): New iterator. (fmaxmin): Handle UNSPEC_FMAXNMV and UNSPEC_FMINNMV. * config/aarch64/aarch64-simd.md (reduc_<optab>_scal_<mode>): Fix unspec mode. (reduc_<fmaxmin>_scal_<mode>): New pattern. * config/aarch64/aarch64-sve.md (reduc_<fmaxmin>_scal_<mode>): Likewise. gcc/testsuite/ * gcc.dg/vect/vect-fmax-1.c: New test. * gcc.dg/vect/vect-fmax-2.c: Likewise. * gcc.dg/vect/vect-fmax-3.c: Likewise. * gcc.dg/vect/vect-fmin-1.c: New test. * gcc.dg/vect/vect-fmin-2.c: Likewise. * gcc.dg/vect/vect-fmin-3.c: Likewise. * gcc.target/aarch64/fmaxnm_1.c: Likewise. * gcc.target/aarch64/fmaxnm_2.c: Likewise. * gcc.target/aarch64/fminnm_1.c: Likewise. * gcc.target/aarch64/fminnm_2.c: Likewise. * gcc.target/aarch64/sve/fmaxnm_2.c: Likewise. * gcc.target/aarch64/sve/fmaxnm_3.c: Likewise. * gcc.target/aarch64/sve/fminnm_2.c: Likewise. * gcc.target/aarch64/sve/fminnm_3.c: Likewise.	2021-11-30 09:52:25 +00:00
Andrew Pinski	c744ae0897	[COMMITTED] aarch64: [PR103170] Fix aarch64_simd_dup<mode> The problem here is aarch64_simd_dup<mode> use the vw iterator rather than vwcore iterator. This causes problems for the V4SF and V2DF modes. I changed both of aarch64_simd_dup<mode> patterns to be consistent. Committed as obvious after a bootstrap/test on aarch64-linux-gnu. PR target/103170 gcc/ChangeLog: * config/aarch64/aarch64-simd.md (aarch64_simd_dup<mode>): Use vwcore iterator for the r constraint output string. gcc/testsuite/ChangeLog: * gcc.c-torture/compile/vector-dup-1.c: New test.	2021-11-10 22:06:23 +00:00
Tamar Christina	5ba247ade1	AArch64: Remove shuffle pattern for rounding variant. This removed the patterns to optimize the rounding shift and narrow. The optimization is valid only for the truncating rounding shift and narrow, for the rounding shift and narrow we need a different pattern that I will submit separately. This wasn't noticed before as the benchmarks did not run conformance as part of the run, which we now do and this now passes again. gcc/ChangeLog: * config/aarch64/aarch64-simd.md (aarch64_topbits_shuffle<mode>_le ,aarch64_topbits_shuffle<mode>_be): Remove. gcc/testsuite/ChangeLog: * gcc.target/aarch64/shrn-combine-8.c: Update. * gcc.target/aarch64/shrn-combine-9.c: Update.	2021-11-10 15:10:09 +00:00
Richard Sandiford	6d331688fc	aarch64: Tweak FMAX/FMIN iterators There was some duplication between the maxmin_uns (uns for unspec rather than unsigned) int attribute and the optab int attribute. The difficulty for FMAXNM and FMINNM is that the instructions really correspond to two things: the smax/smin optabs for floats (used only for fast-math-like flags) and the fmax/fmin optabs (used for built-in functions). The optab attribute was consistently for the former but maxmin_uns had a mixture of both. This patch renames maxmin_uns to fmaxmin and only uses it for the fmax and fmin optabs. The reductions that previously used the maxmin_uns attribute now use the optab attribute instead. FMAX and FMIN are awkward in that they don't correspond to any optab. It's nevertheless useful to define them alongside the “real” optabs. Previously they were known as “smax_nan” and “smin_nan”, but the problem with those names it that smax and smin are only used for floats if NaNs don't matter. This patch therefore uses fmax_nan and fmin_nan instead. There is still some inconsistency, in that the optab attribute handles UNSPEC_COND_FMAX but the fmaxmin attribute handles UNSPEC_FMAX. This is because the SVE FP instructions, being predicated, have to use unspecs in cases where the Advanced SIMD ones could use rtl codes. At least there are no duplicate entries though, so this seemed like the best compromise for now. gcc/ * config/aarch64/iterators.md (optab): Use fmax_nan instead of smax_nan and fmin_nan instead of smin_nan. (maxmin_uns): Rename to... (fmaxmin): ...this and make the same changes. Remove entries unrelated to fmax* and fmin. config/aarch64/aarch64.md (<maxmin_uns><mode>3): Rename to... (<fmaxmin><mode>3): ...this. * config/aarch64/aarch64-simd.md (aarch64_<maxmin_uns>p<mode>): Rename to... (aarch64_<optab>p<mode>): ...this. (<maxmin_uns><mode>3): Rename to... (<fmaxmin><mode>3): ...this. (reduc_<maxmin_uns>_scal_<mode>): Rename to... (reduc_<optab>_scal_<mode>): ...this and update gen* call. (aarch64_reduc_<maxmin_uns>_internal<mode>): Rename to... (aarch64_reduc_<optab>_internal<mode>): ...this. (aarch64_reduc_<maxmin_uns>_internalv2si): Rename to... (aarch64_reduc_<optab>_internalv2si): ...this. * config/aarch64/aarch64-sve.md (<maxmin_uns><mode>3): Rename to... (<fmaxmin><mode>3): ...this. * config/aarch64/aarch64-simd-builtins.def (smax_nan, smin_nan) Rename to... (fmax_nan, fmin_nan): ...this. * config/aarch64/arm_neon.h (vmax_f32, vmax_f64, vmaxq_f32, vmaxq_f64) (vmin_f32, vmin_f64, vminq_f32, vminq_f64, vmax_f16, vmaxq_f16) (vmin_f16, vminq_f16): Update accordingly.	2021-11-10 12:38:43 +00:00
Jonathan Wright	66f206b853	aarch64: Add machine modes for Neon vector-tuple types Until now, GCC has used large integer machine modes (OI, CI and XI) to model Neon vector-tuple types. This is suboptimal for many reasons, the most notable are: 1) Large integer modes are opaque and modifying one vector in the tuple requires a lot of inefficient set/get gymnastics. The result is a lot of superfluous move instructions. 2) Large integer modes do not map well to types that are tuples of 64-bit vectors - we need additional zero-padding which again results in superfluous move instructions. This patch adds new machine modes that better model the C-level Neon vector-tuple types. The approach is somewhat similar to that already used for SVE vector-tuple types. All of the AArch64 backend patterns and builtins that manipulate Neon vector tuples are updated to use the new machine modes. This has the effect of significantly reducing the amount of boiler-plate code in the arm_neon.h header. While this patch increases the quality of code generated in many instances, there is still room for significant improvement - which will be attempted in subsequent patches. gcc/ChangeLog: 2021-08-09 Jonathan Wright <jonathan.wright@arm.com> Richard Sandiford <richard.sandiford@arm.com> * config/aarch64/aarch64-builtins.c (v2x8qi_UP): Define. (v2x4hi_UP): Likewise. (v2x4hf_UP): Likewise. (v2x4bf_UP): Likewise. (v2x2si_UP): Likewise. (v2x2sf_UP): Likewise. (v2x1di_UP): Likewise. (v2x1df_UP): Likewise. (v2x16qi_UP): Likewise. (v2x8hi_UP): Likewise. (v2x8hf_UP): Likewise. (v2x8bf_UP): Likewise. (v2x4si_UP): Likewise. (v2x4sf_UP): Likewise. (v2x2di_UP): Likewise. (v2x2df_UP): Likewise. (v3x8qi_UP): Likewise. (v3x4hi_UP): Likewise. (v3x4hf_UP): Likewise. (v3x4bf_UP): Likewise. (v3x2si_UP): Likewise. (v3x2sf_UP): Likewise. (v3x1di_UP): Likewise. (v3x1df_UP): Likewise. (v3x16qi_UP): Likewise. (v3x8hi_UP): Likewise. (v3x8hf_UP): Likewise. (v3x8bf_UP): Likewise. (v3x4si_UP): Likewise. (v3x4sf_UP): Likewise. (v3x2di_UP): Likewise. (v3x2df_UP): Likewise. (v4x8qi_UP): Likewise. (v4x4hi_UP): Likewise. (v4x4hf_UP): Likewise. (v4x4bf_UP): Likewise. (v4x2si_UP): Likewise. (v4x2sf_UP): Likewise. (v4x1di_UP): Likewise. (v4x1df_UP): Likewise. (v4x16qi_UP): Likewise. (v4x8hi_UP): Likewise. (v4x8hf_UP): Likewise. (v4x8bf_UP): Likewise. (v4x4si_UP): Likewise. (v4x4sf_UP): Likewise. (v4x2di_UP): Likewise. (v4x2df_UP): Likewise. (TYPES_GETREGP): Delete. (TYPES_SETREGP): Likewise. (TYPES_LOADSTRUCT_U): Define. (TYPES_LOADSTRUCT_P): Likewise. (TYPES_LOADSTRUCT_LANE_U): Likewise. (TYPES_LOADSTRUCT_LANE_P): Likewise. (TYPES_STORE1P): Move for consistency. (TYPES_STORESTRUCT_U): Define. (TYPES_STORESTRUCT_P): Likewise. (TYPES_STORESTRUCT_LANE_U): Likewise. (TYPES_STORESTRUCT_LANE_P): Likewise. (aarch64_simd_tuple_types): Define. (aarch64_lookup_simd_builtin_type): Handle tuple type lookup. (aarch64_init_simd_builtin_functions): Update frontend lookup for builtin functions after handling arm_neon.h pragma. (register_tuple_type): Manually set modes of single-integer tuple types. Record tuple types. * config/aarch64/aarch64-modes.def (ADV_SIMD_D_REG_STRUCT_MODES): Define D-register tuple modes. (ADV_SIMD_Q_REG_STRUCT_MODES): Define Q-register tuple modes. (SVE_MODES): Give single-vector modes priority over vector- tuple modes. (VECTOR_MODES_WITH_PREFIX): Set partial-vector mode order to be after all single-vector modes. * config/aarch64/aarch64-simd-builtins.def: Update builtin generator macros to reflect modifications to the backend patterns. * config/aarch64/aarch64-simd.md (aarch64_simd_ld2<mode>): Use vector-tuple mode iterator and rename to... (aarch64_simd_ld2<vstruct_elt>): This. (aarch64_simd_ld2r<mode>): Use vector-tuple mode iterator and rename to... (aarch64_simd_ld2r<vstruct_elt>): This. (aarch64_vec_load_lanesoi_lane<mode>): Use vector-tuple mode iterator and rename to... (aarch64_vec_load_lanes<mode>_lane<vstruct_elt>): This. (vec_load_lanesoi<mode>): Use vector-tuple mode iterator and rename to... (vec_load_lanes<mode><vstruct_elt>): This. (aarch64_simd_st2<mode>): Use vector-tuple mode iterator and rename to... (aarch64_simd_st2<vstruct_elt>): This. (aarch64_vec_store_lanesoi_lane<mode>): Use vector-tuple mode iterator and rename to... (aarch64_vec_store_lanes<mode>_lane<vstruct_elt>): This. (vec_store_lanesoi<mode>): Use vector-tuple mode iterator and rename to... (vec_store_lanes<mode><vstruct_elt>): This. (aarch64_simd_ld3<mode>): Use vector-tuple mode iterator and rename to... (aarch64_simd_ld3<vstruct_elt>): This. (aarch64_simd_ld3r<mode>): Use vector-tuple mode iterator and rename to... (aarch64_simd_ld3r<vstruct_elt>): This. (aarch64_vec_load_lanesci_lane<mode>): Use vector-tuple mode iterator and rename to... (vec_load_lanesci<mode>): This. (aarch64_simd_st3<mode>): Use vector-tuple mode iterator and rename to... (aarch64_simd_st3<vstruct_elt>): This. (aarch64_vec_store_lanesci_lane<mode>): Use vector-tuple mode iterator and rename to... (vec_store_lanesci<mode>): This. (aarch64_simd_ld4<mode>): Use vector-tuple mode iterator and rename to... (aarch64_simd_ld4<vstruct_elt>): This. (aarch64_simd_ld4r<mode>): Use vector-tuple mode iterator and rename to... (aarch64_simd_ld4r<vstruct_elt>): This. (aarch64_vec_load_lanesxi_lane<mode>): Use vector-tuple mode iterator and rename to... (vec_load_lanesxi<mode>): This. (aarch64_simd_st4<mode>): Use vector-tuple mode iterator and rename to... (aarch64_simd_st4<vstruct_elt>): This. (aarch64_vec_store_lanesxi_lane<mode>): Use vector-tuple mode iterator and rename to... (vec_store_lanesxi<mode>): This. (mov<mode>): Define for Neon vector-tuple modes. (aarch64_ld1x3<VALLDIF:mode>): Use vector-tuple mode iterator and rename to... (aarch64_ld1x3<vstruct_elt>): This. (aarch64_ld1_x3_<mode>): Use vector-tuple mode iterator and rename to... (aarch64_ld1_x3_<vstruct_elt>): This. (aarch64_ld1x4<VALLDIF:mode>): Use vector-tuple mode iterator and rename to... (aarch64_ld1x4<vstruct_elt>): This. (aarch64_ld1_x4_<mode>): Use vector-tuple mode iterator and rename to... (aarch64_ld1_x4_<vstruct_elt>): This. (aarch64_st1x2<VALLDIF:mode>): Use vector-tuple mode iterator and rename to... (aarch64_st1x2<vstruct_elt>): This. (aarch64_st1_x2_<mode>): Use vector-tuple mode iterator and rename to... (aarch64_st1_x2_<vstruct_elt>): This. (aarch64_st1x3<VALLDIF:mode>): Use vector-tuple mode iterator and rename to... (aarch64_st1x3<vstruct_elt>): This. (aarch64_st1_x3_<mode>): Use vector-tuple mode iterator and rename to... (aarch64_st1_x3_<vstruct_elt>): This. (aarch64_st1x4<VALLDIF:mode>): Use vector-tuple mode iterator and rename to... (aarch64_st1x4<vstruct_elt>): This. (aarch64_st1_x4_<mode>): Use vector-tuple mode iterator and rename to... (aarch64_st1_x4_<vstruct_elt>): This. (aarch64_mov<mode>): Define for vector-tuple modes. (aarch64_be_mov<mode>): Likewise. (aarch64_ld<VSTRUCT:nregs>r<VALLDIF:mode>): Use vector-tuple mode iterator and rename to... (aarch64_ld<nregs>r<vstruct_elt>): This. (aarch64_ld2<mode>_dreg): Use vector-tuple mode iterator and rename to... (aarch64_ld2<vstruct_elt>_dreg): This. (aarch64_ld3<mode>_dreg): Use vector-tuple mode iterator and rename to... (aarch64_ld3<vstruct_elt>_dreg): This. (aarch64_ld4<mode>_dreg): Use vector-tuple mode iterator and rename to... (aarch64_ld4<vstruct_elt>_dreg): This. (aarch64_ld<VSTRUCT:nregs><VDC:mode>): Use vector-tuple mode iterator and rename to... (aarch64_ld<nregs><vstruct_elt>): Use vector-tuple mode iterator and rename to... (aarch64_ld<VSTRUCT:nregs><VQ:mode>): Use vector-tuple mode (aarch64_ld1x2<VQ:mode>): Delete. (aarch64_ld1x2<VDC:mode>): Use vector-tuple mode iterator and rename to... (aarch64_ld1x2<vstruct_elt>): This. (aarch64_ld<VSTRUCT:nregs>_lane<VALLDIF:mode>): Use vector- tuple mode iterator and rename to... (aarch64_ld<nregs>_lane<vstruct_elt>): This. (aarch64_get_dreg<VSTRUCT:mode><VDC:mode>): Delete. (aarch64_get_qreg<VSTRUCT:mode><VQ:mode>): Likewise. (aarch64_st2<mode>_dreg): Use vector-tuple mode iterator and rename to... (aarch64_st2<vstruct_elt>_dreg): This. (aarch64_st3<mode>_dreg): Use vector-tuple mode iterator and rename to... (aarch64_st3<vstruct_elt>_dreg): This. (aarch64_st4<mode>_dreg): Use vector-tuple mode iterator and rename to... (aarch64_st4<vstruct_elt>_dreg): This. (aarch64_st<VSTRUCT:nregs><VDC:mode>): Use vector-tuple mode iterator and rename to... (aarch64_st<nregs><vstruct_elt>): This. (aarch64_st<VSTRUCT:nregs><VQ:mode>): Use vector-tuple mode iterator and rename to aarch64_st<nregs><vstruct_elt>. (aarch64_st<VSTRUCT:nregs>_lane<VALLDIF:mode>): Use vector- tuple mode iterator and rename to... (aarch64_st<nregs>_lane<vstruct_elt>): This. (aarch64_set_qreg<VSTRUCT:mode><VQ:mode>): Delete. (aarch64_simd_ld1<mode>_x2): Use vector-tuple mode iterator and rename to... (aarch64_simd_ld1<vstruct_elt>_x2): This. * config/aarch64/aarch64.c (aarch64_advsimd_struct_mode_p): Refactor to include new vector-tuple modes. (aarch64_classify_vector_mode): Add cases for new vector- tuple modes. (aarch64_advsimd_partial_struct_mode_p): Define. (aarch64_advsimd_full_struct_mode_p): Likewise. (aarch64_advsimd_vector_array_mode): Likewise. (aarch64_sve_data_mode): Change location in file. (aarch64_array_mode): Handle case of Neon vector-tuple modes. (aarch64_hard_regno_nregs): Handle case of partial Neon vector structures. (aarch64_classify_address): Refactor to include handling of Neon vector-tuple modes. (aarch64_print_operand): Print "d" for "%R" for a partial Neon vector structure. (aarch64_expand_vec_perm_1): Use new vector-tuple mode. (aarch64_modes_tieable_p): Prevent tieing Neon partial struct modes with scalar machines modes larger than 8 bytes. (aarch64_can_change_mode_class): Don't allow changes between partial and full Neon vector-structure modes. * config/aarch64/arm_neon.h (vst2_lane_f16): Use updated builtin and remove boiler-plate code for opaque mode. (vst2_lane_f32): Likewise. (vst2_lane_f64): Likewise. (vst2_lane_p8): Likewise. (vst2_lane_p16): Likewise. (vst2_lane_p64): Likewise. (vst2_lane_s8): Likewise. (vst2_lane_s16): Likewise. (vst2_lane_s32): Likewise. (vst2_lane_s64): Likewise. (vst2_lane_u8): Likewise. (vst2_lane_u16): Likewise. (vst2_lane_u32): Likewise. (vst2_lane_u64): Likewise. (vst2q_lane_f16): Likewise. (vst2q_lane_f32): Likewise. (vst2q_lane_f64): Likewise. (vst2q_lane_p8): Likewise. (vst2q_lane_p16): Likewise. (vst2q_lane_p64): Likewise. (vst2q_lane_s8): Likewise. (vst2q_lane_s16): Likewise. (vst2q_lane_s32): Likewise. (vst2q_lane_s64): Likewise. (vst2q_lane_u8): Likewise. (vst2q_lane_u16): Likewise. (vst2q_lane_u32): Likewise. (vst2q_lane_u64): Likewise. (vst3_lane_f16): Likewise. (vst3_lane_f32): Likewise. (vst3_lane_f64): Likewise. (vst3_lane_p8): Likewise. (vst3_lane_p16): Likewise. (vst3_lane_p64): Likewise. (vst3_lane_s8): Likewise. (vst3_lane_s16): Likewise. (vst3_lane_s32): Likewise. (vst3_lane_s64): Likewise. (vst3_lane_u8): Likewise. (vst3_lane_u16): Likewise. (vst3_lane_u32): Likewise. (vst3_lane_u64): Likewise. (vst3q_lane_f16): Likewise. (vst3q_lane_f32): Likewise. (vst3q_lane_f64): Likewise. (vst3q_lane_p8): Likewise. (vst3q_lane_p16): Likewise. (vst3q_lane_p64): Likewise. (vst3q_lane_s8): Likewise. (vst3q_lane_s16): Likewise. (vst3q_lane_s32): Likewise. (vst3q_lane_s64): Likewise. (vst3q_lane_u8): Likewise. (vst3q_lane_u16): Likewise. (vst3q_lane_u32): Likewise. (vst3q_lane_u64): Likewise. (vst4_lane_f16): Likewise. (vst4_lane_f32): Likewise. (vst4_lane_f64): Likewise. (vst4_lane_p8): Likewise. (vst4_lane_p16): Likewise. (vst4_lane_p64): Likewise. (vst4_lane_s8): Likewise. (vst4_lane_s16): Likewise. (vst4_lane_s32): Likewise. (vst4_lane_s64): Likewise. (vst4_lane_u8): Likewise. (vst4_lane_u16): Likewise. (vst4_lane_u32): Likewise. (vst4_lane_u64): Likewise. (vst4q_lane_f16): Likewise. (vst4q_lane_f32): Likewise. (vst4q_lane_f64): Likewise. (vst4q_lane_p8): Likewise. (vst4q_lane_p16): Likewise. (vst4q_lane_p64): Likewise. (vst4q_lane_s8): Likewise. (vst4q_lane_s16): Likewise. (vst4q_lane_s32): Likewise. (vst4q_lane_s64): Likewise. (vst4q_lane_u8): Likewise. (vst4q_lane_u16): Likewise. (vst4q_lane_u32): Likewise. (vst4q_lane_u64): Likewise. (vtbl3_s8): Likewise. (vtbl3_u8): Likewise. (vtbl3_p8): Likewise. (vtbl4_s8): Likewise. (vtbl4_u8): Likewise. (vtbl4_p8): Likewise. (vld1_u8_x3): Likewise. (vld1_s8_x3): Likewise. (vld1_u16_x3): Likewise. (vld1_s16_x3): Likewise. (vld1_u32_x3): Likewise. (vld1_s32_x3): Likewise. (vld1_u64_x3): Likewise. (vld1_s64_x3): Likewise. (vld1_f16_x3): Likewise. (vld1_f32_x3): Likewise. (vld1_f64_x3): Likewise. (vld1_p8_x3): Likewise. (vld1_p16_x3): Likewise. (vld1_p64_x3): Likewise. (vld1q_u8_x3): Likewise. (vld1q_s8_x3): Likewise. (vld1q_u16_x3): Likewise. (vld1q_s16_x3): Likewise. (vld1q_u32_x3): Likewise. (vld1q_s32_x3): Likewise. (vld1q_u64_x3): Likewise. (vld1q_s64_x3): Likewise. (vld1q_f16_x3): Likewise. (vld1q_f32_x3): Likewise. (vld1q_f64_x3): Likewise. (vld1q_p8_x3): Likewise. (vld1q_p16_x3): Likewise. (vld1q_p64_x3): Likewise. (vld1_u8_x2): Likewise. (vld1_s8_x2): Likewise. (vld1_u16_x2): Likewise. (vld1_s16_x2): Likewise. (vld1_u32_x2): Likewise. (vld1_s32_x2): Likewise. (vld1_u64_x2): Likewise. (vld1_s64_x2): Likewise. (vld1_f16_x2): Likewise. (vld1_f32_x2): Likewise. (vld1_f64_x2): Likewise. (vld1_p8_x2): Likewise. (vld1_p16_x2): Likewise. (vld1_p64_x2): Likewise. (vld1q_u8_x2): Likewise. (vld1q_s8_x2): Likewise. (vld1q_u16_x2): Likewise. (vld1q_s16_x2): Likewise. (vld1q_u32_x2): Likewise. (vld1q_s32_x2): Likewise. (vld1q_u64_x2): Likewise. (vld1q_s64_x2): Likewise. (vld1q_f16_x2): Likewise. (vld1q_f32_x2): Likewise. (vld1q_f64_x2): Likewise. (vld1q_p8_x2): Likewise. (vld1q_p16_x2): Likewise. (vld1q_p64_x2): Likewise. (vld1_s8_x4): Likewise. (vld1q_s8_x4): Likewise. (vld1_s16_x4): Likewise. (vld1q_s16_x4): Likewise. (vld1_s32_x4): Likewise. (vld1q_s32_x4): Likewise. (vld1_u8_x4): Likewise. (vld1q_u8_x4): Likewise. (vld1_u16_x4): Likewise. (vld1q_u16_x4): Likewise. (vld1_u32_x4): Likewise. (vld1q_u32_x4): Likewise. (vld1_f16_x4): Likewise. (vld1q_f16_x4): Likewise. (vld1_f32_x4): Likewise. (vld1q_f32_x4): Likewise. (vld1_p8_x4): Likewise. (vld1q_p8_x4): Likewise. (vld1_p16_x4): Likewise. (vld1q_p16_x4): Likewise. (vld1_s64_x4): Likewise. (vld1_u64_x4): Likewise. (vld1_p64_x4): Likewise. (vld1q_s64_x4): Likewise. (vld1q_u64_x4): Likewise. (vld1q_p64_x4): Likewise. (vld1_f64_x4): Likewise. (vld1q_f64_x4): Likewise. (vld2_s64): Likewise. (vld2_u64): Likewise. (vld2_f64): Likewise. (vld2_s8): Likewise. (vld2_p8): Likewise. (vld2_p64): Likewise. (vld2_s16): Likewise. (vld2_p16): Likewise. (vld2_s32): Likewise. (vld2_u8): Likewise. (vld2_u16): Likewise. (vld2_u32): Likewise. (vld2_f16): Likewise. (vld2_f32): Likewise. (vld2q_s8): Likewise. (vld2q_p8): Likewise. (vld2q_s16): Likewise. (vld2q_p16): Likewise. (vld2q_p64): Likewise. (vld2q_s32): Likewise. (vld2q_s64): Likewise. (vld2q_u8): Likewise. (vld2q_u16): Likewise. (vld2q_u32): Likewise. (vld2q_u64): Likewise. (vld2q_f16): Likewise. (vld2q_f32): Likewise. (vld2q_f64): Likewise. (vld3_s64): Likewise. (vld3_u64): Likewise. (vld3_f64): Likewise. (vld3_s8): Likewise. (vld3_p8): Likewise. (vld3_s16): Likewise. (vld3_p16): Likewise. (vld3_s32): Likewise. (vld3_u8): Likewise. (vld3_u16): Likewise. (vld3_u32): Likewise. (vld3_f16): Likewise. (vld3_f32): Likewise. (vld3_p64): Likewise. (vld3q_s8): Likewise. (vld3q_p8): Likewise. (vld3q_s16): Likewise. (vld3q_p16): Likewise. (vld3q_s32): Likewise. (vld3q_s64): Likewise. (vld3q_u8): Likewise. (vld3q_u16): Likewise. (vld3q_u32): Likewise. (vld3q_u64): Likewise. (vld3q_f16): Likewise. (vld3q_f32): Likewise. (vld3q_f64): Likewise. (vld3q_p64): Likewise. (vld4_s64): Likewise. (vld4_u64): Likewise. (vld4_f64): Likewise. (vld4_s8): Likewise. (vld4_p8): Likewise. (vld4_s16): Likewise. (vld4_p16): Likewise. (vld4_s32): Likewise. (vld4_u8): Likewise. (vld4_u16): Likewise. (vld4_u32): Likewise. (vld4_f16): Likewise. (vld4_f32): Likewise. (vld4_p64): Likewise. (vld4q_s8): Likewise. (vld4q_p8): Likewise. (vld4q_s16): Likewise. (vld4q_p16): Likewise. (vld4q_s32): Likewise. (vld4q_s64): Likewise. (vld4q_u8): Likewise. (vld4q_u16): Likewise. (vld4q_u32): Likewise. (vld4q_u64): Likewise. (vld4q_f16): Likewise. (vld4q_f32): Likewise. (vld4q_f64): Likewise. (vld4q_p64): Likewise. (vld2_dup_s8): Likewise. (vld2_dup_s16): Likewise. (vld2_dup_s32): Likewise. (vld2_dup_f16): Likewise. (vld2_dup_f32): Likewise. (vld2_dup_f64): Likewise. (vld2_dup_u8): Likewise. (vld2_dup_u16): Likewise. (vld2_dup_u32): Likewise. (vld2_dup_p8): Likewise. (vld2_dup_p16): Likewise. (vld2_dup_p64): Likewise. (vld2_dup_s64): Likewise. (vld2_dup_u64): Likewise. (vld2q_dup_s8): Likewise. (vld2q_dup_p8): Likewise. (vld2q_dup_s16): Likewise. (vld2q_dup_p16): Likewise. (vld2q_dup_s32): Likewise. (vld2q_dup_s64): Likewise. (vld2q_dup_u8): Likewise. (vld2q_dup_u16): Likewise. (vld2q_dup_u32): Likewise. (vld2q_dup_u64): Likewise. (vld2q_dup_f16): Likewise. (vld2q_dup_f32): Likewise. (vld2q_dup_f64): Likewise. (vld2q_dup_p64): Likewise. (vld3_dup_s64): Likewise. (vld3_dup_u64): Likewise. (vld3_dup_f64): Likewise. (vld3_dup_s8): Likewise. (vld3_dup_p8): Likewise. (vld3_dup_s16): Likewise. (vld3_dup_p16): Likewise. (vld3_dup_s32): Likewise. (vld3_dup_u8): Likewise. (vld3_dup_u16): Likewise. (vld3_dup_u32): Likewise. (vld3_dup_f16): Likewise. (vld3_dup_f32): Likewise. (vld3_dup_p64): Likewise. (vld3q_dup_s8): Likewise. (vld3q_dup_p8): Likewise. (vld3q_dup_s16): Likewise. (vld3q_dup_p16): Likewise. (vld3q_dup_s32): Likewise. (vld3q_dup_s64): Likewise. (vld3q_dup_u8): Likewise. (vld3q_dup_u16): Likewise. (vld3q_dup_u32): Likewise. (vld3q_dup_u64): Likewise. (vld3q_dup_f16): Likewise. (vld3q_dup_f32): Likewise. (vld3q_dup_f64): Likewise. (vld3q_dup_p64): Likewise. (vld4_dup_s64): Likewise. (vld4_dup_u64): Likewise. (vld4_dup_f64): Likewise. (vld4_dup_s8): Likewise. (vld4_dup_p8): Likewise. (vld4_dup_s16): Likewise. (vld4_dup_p16): Likewise. (vld4_dup_s32): Likewise. (vld4_dup_u8): Likewise. (vld4_dup_u16): Likewise. (vld4_dup_u32): Likewise. (vld4_dup_f16): Likewise. (vld4_dup_f32): Likewise. (vld4_dup_p64): Likewise. (vld4q_dup_s8): Likewise. (vld4q_dup_p8): Likewise. (vld4q_dup_s16): Likewise. (vld4q_dup_p16): Likewise. (vld4q_dup_s32): Likewise. (vld4q_dup_s64): Likewise. (vld4q_dup_u8): Likewise. (vld4q_dup_u16): Likewise. (vld4q_dup_u32): Likewise. (vld4q_dup_u64): Likewise. (vld4q_dup_f16): Likewise. (vld4q_dup_f32): Likewise. (vld4q_dup_f64): Likewise. (vld4q_dup_p64): Likewise. (vld2_lane_u8): Likewise. (vld2_lane_u16): Likewise. (vld2_lane_u32): Likewise. (vld2_lane_u64): Likewise. (vld2_lane_s8): Likewise. (vld2_lane_s16): Likewise. (vld2_lane_s32): Likewise. (vld2_lane_s64): Likewise. (vld2_lane_f16): Likewise. (vld2_lane_f32): Likewise. (vld2_lane_f64): Likewise. (vld2_lane_p8): Likewise. (vld2_lane_p16): Likewise. (vld2_lane_p64): Likewise. (vld2q_lane_u8): Likewise. (vld2q_lane_u16): Likewise. (vld2q_lane_u32): Likewise. (vld2q_lane_u64): Likewise. (vld2q_lane_s8): Likewise. (vld2q_lane_s16): Likewise. (vld2q_lane_s32): Likewise. (vld2q_lane_s64): Likewise. (vld2q_lane_f16): Likewise. (vld2q_lane_f32): Likewise. (vld2q_lane_f64): Likewise. (vld2q_lane_p8): Likewise. (vld2q_lane_p16): Likewise. (vld2q_lane_p64): Likewise. (vld3_lane_u8): Likewise. (vld3_lane_u16): Likewise. (vld3_lane_u32): Likewise. (vld3_lane_u64): Likewise. (vld3_lane_s8): Likewise. (vld3_lane_s16): Likewise. (vld3_lane_s32): Likewise. (vld3_lane_s64): Likewise. (vld3_lane_f16): Likewise. (vld3_lane_f32): Likewise. (vld3_lane_f64): Likewise. (vld3_lane_p8): Likewise. (vld3_lane_p16): Likewise. (vld3_lane_p64): Likewise. (vld3q_lane_u8): Likewise. (vld3q_lane_u16): Likewise. (vld3q_lane_u32): Likewise. (vld3q_lane_u64): Likewise. (vld3q_lane_s8): Likewise. (vld3q_lane_s16): Likewise. (vld3q_lane_s32): Likewise. (vld3q_lane_s64): Likewise. (vld3q_lane_f16): Likewise. (vld3q_lane_f32): Likewise. (vld3q_lane_f64): Likewise. (vld3q_lane_p8): Likewise. (vld3q_lane_p16): Likewise. (vld3q_lane_p64): Likewise. (vld4_lane_u8): Likewise. (vld4_lane_u16): Likewise. (vld4_lane_u32): Likewise. (vld4_lane_u64): Likewise. (vld4_lane_s8): Likewise. (vld4_lane_s16): Likewise. (vld4_lane_s32): Likewise. (vld4_lane_s64): Likewise. (vld4_lane_f16): Likewise. (vld4_lane_f32): Likewise. (vld4_lane_f64): Likewise. (vld4_lane_p8): Likewise. (vld4_lane_p16): Likewise. (vld4_lane_p64): Likewise. (vld4q_lane_u8): Likewise. (vld4q_lane_u16): Likewise. (vld4q_lane_u32): Likewise. (vld4q_lane_u64): Likewise. (vld4q_lane_s8): Likewise. (vld4q_lane_s16): Likewise. (vld4q_lane_s32): Likewise. (vld4q_lane_s64): Likewise. (vld4q_lane_f16): Likewise. (vld4q_lane_f32): Likewise. (vld4q_lane_f64): Likewise. (vld4q_lane_p8): Likewise. (vld4q_lane_p16): Likewise. (vld4q_lane_p64): Likewise. (vqtbl2_s8): Likewise. (vqtbl2_u8): Likewise. (vqtbl2_p8): Likewise. (vqtbl2q_s8): Likewise. (vqtbl2q_u8): Likewise. (vqtbl2q_p8): Likewise. (vqtbl3_s8): Likewise. (vqtbl3_u8): Likewise. (vqtbl3_p8): Likewise. (vqtbl3q_s8): Likewise. (vqtbl3q_u8): Likewise. (vqtbl3q_p8): Likewise. (vqtbl4_s8): Likewise. (vqtbl4_u8): Likewise. (vqtbl4_p8): Likewise. (vqtbl4q_s8): Likewise. (vqtbl4q_u8): Likewise. (vqtbl4q_p8): Likewise. (vqtbx2_s8): Likewise. (vqtbx2_u8): Likewise. (vqtbx2_p8): Likewise. (vqtbx2q_s8): Likewise. (vqtbx2q_u8): Likewise. (vqtbx2q_p8): Likewise. (vqtbx3_s8): Likewise. (vqtbx3_u8): Likewise. (vqtbx3_p8): Likewise. (vqtbx3q_s8): Likewise. (vqtbx3q_u8): Likewise. (vqtbx3q_p8): Likewise. (vqtbx4_s8): Likewise. (vqtbx4_u8): Likewise. (vqtbx4_p8): Likewise. (vqtbx4q_s8): Likewise. (vqtbx4q_u8): Likewise. (vqtbx4q_p8): Likewise. (vst1_s64_x2): Likewise. (vst1_u64_x2): Likewise. (vst1_f64_x2): Likewise. (vst1_s8_x2): Likewise. (vst1_p8_x2): Likewise. (vst1_s16_x2): Likewise. (vst1_p16_x2): Likewise. (vst1_s32_x2): Likewise. (vst1_u8_x2): Likewise. (vst1_u16_x2): Likewise. (vst1_u32_x2): Likewise. (vst1_f16_x2): Likewise. (vst1_f32_x2): Likewise. (vst1_p64_x2): Likewise. (vst1q_s8_x2): Likewise. (vst1q_p8_x2): Likewise. (vst1q_s16_x2): Likewise. (vst1q_p16_x2): Likewise. (vst1q_s32_x2): Likewise. (vst1q_s64_x2): Likewise. (vst1q_u8_x2): Likewise. (vst1q_u16_x2): Likewise. (vst1q_u32_x2): Likewise. (vst1q_u64_x2): Likewise. (vst1q_f16_x2): Likewise. (vst1q_f32_x2): Likewise. (vst1q_f64_x2): Likewise. (vst1q_p64_x2): Likewise. (vst1_s64_x3): Likewise. (vst1_u64_x3): Likewise. (vst1_f64_x3): Likewise. (vst1_s8_x3): Likewise. (vst1_p8_x3): Likewise. (vst1_s16_x3): Likewise. (vst1_p16_x3): Likewise. (vst1_s32_x3): Likewise. (vst1_u8_x3): Likewise. (vst1_u16_x3): Likewise. (vst1_u32_x3): Likewise. (vst1_f16_x3): Likewise. (vst1_f32_x3): Likewise. (vst1_p64_x3): Likewise. (vst1q_s8_x3): Likewise. (vst1q_p8_x3): Likewise. (vst1q_s16_x3): Likewise. (vst1q_p16_x3): Likewise. (vst1q_s32_x3): Likewise. (vst1q_s64_x3): Likewise. (vst1q_u8_x3): Likewise. (vst1q_u16_x3): Likewise. (vst1q_u32_x3): Likewise. (vst1q_u64_x3): Likewise. (vst1q_f16_x3): Likewise. (vst1q_f32_x3): Likewise. (vst1q_f64_x3): Likewise. (vst1q_p64_x3): Likewise. (vst1_s8_x4): Likewise. (vst1q_s8_x4): Likewise. (vst1_s16_x4): Likewise. (vst1q_s16_x4): Likewise. (vst1_s32_x4): Likewise. (vst1q_s32_x4): Likewise. (vst1_u8_x4): Likewise. (vst1q_u8_x4): Likewise. (vst1_u16_x4): Likewise. (vst1q_u16_x4): Likewise. (vst1_u32_x4): Likewise. (vst1q_u32_x4): Likewise. (vst1_f16_x4): Likewise. (vst1q_f16_x4): Likewise. (vst1_f32_x4): Likewise. (vst1q_f32_x4): Likewise. (vst1_p8_x4): Likewise. (vst1q_p8_x4): Likewise. (vst1_p16_x4): Likewise. (vst1q_p16_x4): Likewise. (vst1_s64_x4): Likewise. (vst1_u64_x4): Likewise. (vst1_p64_x4): Likewise. (vst1q_s64_x4): Likewise. (vst1q_u64_x4): Likewise. (vst1q_p64_x4): Likewise. (vst1_f64_x4): Likewise. (vst1q_f64_x4): Likewise. (vst2_s64): Likewise. (vst2_u64): Likewise. (vst2_f64): Likewise. (vst2_s8): Likewise. (vst2_p8): Likewise. (vst2_s16): Likewise. (vst2_p16): Likewise. (vst2_s32): Likewise. (vst2_u8): Likewise. (vst2_u16): Likewise. (vst2_u32): Likewise. (vst2_f16): Likewise. (vst2_f32): Likewise. (vst2_p64): Likewise. (vst2q_s8): Likewise. (vst2q_p8): Likewise. (vst2q_s16): Likewise. (vst2q_p16): Likewise. (vst2q_s32): Likewise. (vst2q_s64): Likewise. (vst2q_u8): Likewise. (vst2q_u16): Likewise. (vst2q_u32): Likewise. (vst2q_u64): Likewise. (vst2q_f16): Likewise. (vst2q_f32): Likewise. (vst2q_f64): Likewise. (vst2q_p64): Likewise. (vst3_s64): Likewise. (vst3_u64): Likewise. (vst3_f64): Likewise. (vst3_s8): Likewise. (vst3_p8): Likewise. (vst3_s16): Likewise. (vst3_p16): Likewise. (vst3_s32): Likewise. (vst3_u8): Likewise. (vst3_u16): Likewise. (vst3_u32): Likewise. (vst3_f16): Likewise. (vst3_f32): Likewise. (vst3_p64): Likewise. (vst3q_s8): Likewise. (vst3q_p8): Likewise. (vst3q_s16): Likewise. (vst3q_p16): Likewise. (vst3q_s32): Likewise. (vst3q_s64): Likewise. (vst3q_u8): Likewise. (vst3q_u16): Likewise. (vst3q_u32): Likewise. (vst3q_u64): Likewise. (vst3q_f16): Likewise. (vst3q_f32): Likewise. (vst3q_f64): Likewise. (vst3q_p64): Likewise. (vst4_s64): Likewise. (vst4_u64): Likewise. (vst4_f64): Likewise. (vst4_s8): Likewise. (vst4_p8): Likewise. (vst4_s16): Likewise. (vst4_p16): Likewise. (vst4_s32): Likewise. (vst4_u8): Likewise. (vst4_u16): Likewise. (vst4_u32): Likewise. (vst4_f16): Likewise. (vst4_f32): Likewise. (vst4_p64): Likewise. (vst4q_s8): Likewise. (vst4q_p8): Likewise. (vst4q_s16): Likewise. (vst4q_p16): Likewise. (vst4q_s32): Likewise. (vst4q_s64): Likewise. (vst4q_u8): Likewise. (vst4q_u16): Likewise. (vst4q_u32): Likewise. (vst4q_u64): Likewise. (vst4q_f16): Likewise. (vst4q_f32): Likewise. (vst4q_f64): Likewise. (vst4q_p64): Likewise. (vtbx4_s8): Likewise. (vtbx4_u8): Likewise. (vtbx4_p8): Likewise. (vld1_bf16_x2): Likewise. (vld1q_bf16_x2): Likewise. (vld1_bf16_x3): Likewise. (vld1q_bf16_x3): Likewise. (vld1_bf16_x4): Likewise. (vld1q_bf16_x4): Likewise. (vld2_bf16): Likewise. (vld2q_bf16): Likewise. (vld2_dup_bf16): Likewise. (vld2q_dup_bf16): Likewise. (vld3_bf16): Likewise. (vld3q_bf16): Likewise. (vld3_dup_bf16): Likewise. (vld3q_dup_bf16): Likewise. (vld4_bf16): Likewise. (vld4q_bf16): Likewise. (vld4_dup_bf16): Likewise. (vld4q_dup_bf16): Likewise. (vst1_bf16_x2): Likewise. (vst1q_bf16_x2): Likewise. (vst1_bf16_x3): Likewise. (vst1q_bf16_x3): Likewise. (vst1_bf16_x4): Likewise. (vst1q_bf16_x4): Likewise. (vst2_bf16): Likewise. (vst2q_bf16): Likewise. (vst3_bf16): Likewise. (vst3q_bf16): Likewise. (vst4_bf16): Likewise. (vst4q_bf16): Likewise. (vld2_lane_bf16): Likewise. (vld2q_lane_bf16): Likewise. (vld3_lane_bf16): Likewise. (vld3q_lane_bf16): Likewise. (vld4_lane_bf16): Likewise. (vld4q_lane_bf16): Likewise. (vst2_lane_bf16): Likewise. (vst2q_lane_bf16): Likewise. (vst3_lane_bf16): Likewise. (vst3q_lane_bf16): Likewise. (vst4_lane_bf16): Likewise. (vst4q_lane_bf16): Likewise. * config/aarch64/geniterators.sh: Modify iterator regex to match new vector-tuple modes. * config/aarch64/iterators.md (insn_count): Extend mode attribute with vector-tuple type information. (nregs): Likewise. (Vendreg): Likewise. (Vetype): Likewise. (Vtype): Likewise. (VSTRUCT_2D): New mode iterator. (VSTRUCT_2DNX): Likewise. (VSTRUCT_2DX): Likewise. (VSTRUCT_2Q): Likewise. (VSTRUCT_2QD): Likewise. (VSTRUCT_3D): Likewise. (VSTRUCT_3DNX): Likewise. (VSTRUCT_3DX): Likewise. (VSTRUCT_3Q): Likewise. (VSTRUCT_3QD): Likewise. (VSTRUCT_4D): Likewise. (VSTRUCT_4DNX): Likewise. (VSTRUCT_4DX): Likewise. (VSTRUCT_4Q): Likewise. (VSTRUCT_4QD): Likewise. (VSTRUCT_D): Likewise. (VSTRUCT_Q): Likewise. (VSTRUCT_QD): Likewise. (VSTRUCT_ELT): New mode attribute. (vstruct_elt): Likewise. * genmodes.c (VECTOR_MODE): Add default prefix and order parameters. (VECTOR_MODE_WITH_PREFIX): Define. (make_vector_mode): Add mode prefix and order parameters. gcc/testsuite/ChangeLog: * gcc.target/aarch64/advsimd-intrinsics/bf16_vldN_lane_2.c: Relax incorrect register number requirement. * gcc.target/aarch64/sve/pcs/struct_3_256.c: Accept equivalent codegen with fmov.	2021-11-04 14:54:36 +00:00
Tamar Christina	1d5c43db79	AArch64: Add better costing for vector constants and operations This patch adds extended costing to cost the creation of constants and the manipulation of constants. The default values provided are based on architectural expectations and each cost models can be individually tweaked as needed. The changes in this patch covers: * Construction of PARALLEL or CONST_VECTOR: Adds better costing for vector of constants which is based on the constant being created and the instruction that can be used to create it. i.e. a movi is cheaper than a literal load etc. * Construction of a vector through a vec_dup. gcc/ChangeLog: * config/arm/aarch-common-protos.h (struct vector_cost_table): Add movi, dup and extract costing fields. * config/aarch64/aarch64-cost-tables.h (qdf24xx_extra_costs, thunderx_extra_costs, thunderx2t99_extra_costs, thunderx3t110_extra_costs, tsv110_extra_costs, a64fx_extra_costs): Use them. * config/arm/aarch-cost-tables.h (generic_extra_costs, cortexa53_extra_costs, cortexa57_extra_costs, cortexa76_extra_costs, exynosm1_extra_costs, xgene1_extra_costs): Likewise * config/aarch64/aarch64-simd.md (aarch64_simd_dup<mode>): Add r->w dup. * config/aarch64/aarch64.c (aarch64_rtx_costs): Add extra costs. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vect-cse-codegen.c: New test.	2021-11-01 13:49:46 +00:00
Tamar Christina	3db4440d4c	AArch64: Combine cmeq 0 + not into cmtst This turns a bitwise inverse of an equality comparison with 0 into a compare of bitwise nonzero (cmtst). We already have one pattern for cmsts, this adds an additional one which does not require an additional bitwise and. i.e. #include <arm_neon.h> uint8x8_t bar(int16x8_t abs_row0, int16x8_t row0) { uint16x8_t row0_diff = vreinterpretq_u16_s16(veorq_s16(abs_row0, vshrq_n_s16(row0, 15))); uint8x8_t abs_row0_gt0 = vmovn_u16(vcgtq_u16(vreinterpretq_u16_s16(abs_row0), vdupq_n_u16(0))); return abs_row0_gt0; } now generates: bar: cmtst v0.8h, v0.8h, v0.8h xtn v0.8b, v0.8h ret instead of: bar: cmeq v0.8h, v0.8h, #0 not v0.16b, v0.16b xtn v0.8b, v0.8h ret gcc/ChangeLog: * config/aarch64/aarch64-simd.md (aarch64_cmtst_same_<mode>): New. gcc/testsuite/ChangeLog: gcc.target/aarch64/mvn-cmeq0-1.c: New test.	2021-10-20 17:11:52 +01:00
Tamar Christina	52da40ffe2	AArch64: Add pattern xtn+xtn2 to uzp1 This turns truncate operations with a hi/lo pair into a single permute of half the bit size of the input and just ignoring the top bits (which are truncated out). i.e. void d2 (short * restrict a, int b, int n) { for (int i = 0; i < n; i++) a[i] = b[i]; } now generates: .L4: ldp q0, q1, [x3] add x3, x3, 32 uzp1 v0.8h, v0.8h, v1.8h str q0, [x5], 16 cmp x4, x3 bne .L4 instead of .L4: ldp q0, q1, [x3] add x3, x3, 32 xtn v0.4h, v0.4s xtn2 v0.8h, v1.4s str q0, [x5], 16 cmp x4, x3 bne .L4 gcc/ChangeLog: config/aarch64/aarch64-simd.md (aarch64_narrow_trunc<mode>): New. gcc/testsuite/ChangeLog: gcc.target/aarch64/narrow_high_combine.c: Update case. * gcc.target/aarch64/xtn-combine-1.c: New test. * gcc.target/aarch64/xtn-combine-2.c: New test. * gcc.target/aarch64/xtn-combine-3.c: New test. * gcc.target/aarch64/xtn-combine-4.c: New test. * gcc.target/aarch64/xtn-combine-5.c: New test. * gcc.target/aarch64/xtn-combine-6.c: New test.	2021-10-20 17:10:25 +01:00
Tamar Christina	ea464fd2d4	AArch64: Add pattern for sshr to cmlt This optimizes signed right shift by BITSIZE-1 into a cmlt operation which is more optimal because generally compares have a higher throughput than shifts. On AArch64 the result of the shift would have been either -1 or 0 which is the results of the compare. i.e. void e (int * restrict a, int b, int n) { for (int i = 0; i < n; i++) b[i] = a[i] >> 31; } now generates: .L4: ldr q0, [x0, x3] cmlt v0.4s, v0.4s, #0 str q0, [x1, x3] add x3, x3, 16 cmp x4, x3 bne .L4 instead of: .L4: ldr q0, [x0, x3] sshr v0.4s, v0.4s, 31 str q0, [x1, x3] add x3, x3, 16 cmp x4, x3 bne .L4 Thanks, Tamar gcc/ChangeLog: config/aarch64/aarch64-simd.md (aarch64_simd_ashr<mode>): Add case cmp case. * config/aarch64/constraints.md (D1): New. gcc/testsuite/ChangeLog: * gcc.target/aarch64/shl-combine-2.c: New test. * gcc.target/aarch64/shl-combine-3.c: New test. * gcc.target/aarch64/shl-combine-4.c: New test. * gcc.target/aarch64/shl-combine-5.c: New test.	2021-10-20 17:09:00 +01:00
Tamar Christina	41812e5e35	AArch64: Add combine patterns for narrowing shift of half top bits (shuffle) When doing a (narrowing) right shift by half the width of the original type then we are essentially shuffling the top bits from the first number down. If we have a hi/lo pair we can just use a single shuffle instead of needing two shifts. i.e. typedef short int16_t; typedef unsigned short uint16_t; void foo (uint16_t * restrict a, int16_t * restrict d, int n) { for( int i = 0; i < n; i++ ) d[i] = (a[i] * a[i]) >> 16; } now generates: .L4: ldr q0, [x0, x3] umull v1.4s, v0.4h, v0.4h umull2 v0.4s, v0.8h, v0.8h uzp2 v0.8h, v1.8h, v0.8h str q0, [x1, x3] add x3, x3, 16 cmp x4, x3 bne .L4 instead of .L4: ldr q0, [x0, x3] umull v1.4s, v0.4h, v0.4h umull2 v0.4s, v0.8h, v0.8h sshr v1.4s, v1.4s, 16 sshr v0.4s, v0.4s, 16 xtn v1.4h, v1.4s xtn2 v1.8h, v0.4s str q1, [x1, x3] add x3, x3, 16 cmp x4, x3 bne .L4 Thanks, Tamar gcc/ChangeLog: * config/aarch64/aarch64-simd.md (aarch64_<srn_op>topbits_shuffle<mode>_le): New. (aarch64_topbits_shuffle<mode>_le): New. (aarch64_<srn_op>topbits_shuffle<mode>_be): New. (aarch64_topbits_shuffle<mode>_be): New. * config/aarch64/predicates.md (aarch64_simd_shift_imm_vec_exact_top): New. gcc/testsuite/ChangeLog: * gcc.target/aarch64/shrn-combine-10.c: New test. * gcc.target/aarch64/shrn-combine-5.c: New test. * gcc.target/aarch64/shrn-combine-6.c: New test. * gcc.target/aarch64/shrn-combine-7.c: New test. * gcc.target/aarch64/shrn-combine-8.c: New test. * gcc.target/aarch64/shrn-combine-9.c: New test.	2021-10-20 17:07:54 +01:00
Tamar Christina	e33aef11e1	aarch64: Add combine patterns for right shift and narrow This adds a simple pattern for combining right shifts and narrows into shifted narrows. i.e. typedef short int16_t; typedef unsigned short uint16_t; void foo (uint16_t * restrict a, int16_t * restrict d, int n) { for( int i = 0; i < n; i++ ) d[i] = (a[i] * a[i]) >> 10; } now generates: .L4: ldr q0, [x0, x3] umull v1.4s, v0.4h, v0.4h umull2 v0.4s, v0.8h, v0.8h shrn v1.4h, v1.4s, 10 shrn2 v1.8h, v0.4s, 10 str q1, [x1, x3] add x3, x3, 16 cmp x4, x3 bne .L4 instead of: .L4: ldr q0, [x0, x3] umull v1.4s, v0.4h, v0.4h umull2 v0.4s, v0.8h, v0.8h sshr v1.4s, v1.4s, 10 sshr v0.4s, v0.4s, 10 xtn v1.4h, v1.4s xtn2 v1.8h, v0.4s str q1, [x1, x3] add x3, x3, 16 cmp x4, x3 bne .L4 Thanks, Tamar gcc/ChangeLog: * config/aarch64/aarch64-simd.md (aarch64_<srn_op>shrn<mode>_vect, aarch64_<srn_op>shrn<mode>2_vect_le, aarch64_<srn_op>shrn<mode>2_vect_be): New. config/aarch64/iterators.md (srn_op): New. gcc/testsuite/ChangeLog: * gcc.target/aarch64/shrn-combine-1.c: New test. * gcc.target/aarch64/shrn-combine-2.c: New test. * gcc.target/aarch64/shrn-combine-3.c: New test. * gcc.target/aarch64/shrn-combine-4.c: New test.	2021-10-20 17:06:31 +01:00
Tejas Belagod	e2e0b85c1e	PR101609: Use the correct iterator for AArch64 vector right shift pattern Loops containing long long shifts fail to vectorize due to the vectorizer not being able to recognize long long right shifts. This is due to a bug in the iterator used for the vashr and vlshr patterns in aarch64-simd.md. 2021-08-09 Tejas Belagod <tejas.belagod@arm.com> gcc/ChangeLog PR target/101609 * config/aarch64/aarch64-simd.md (vlshr<mode>3, vashr<mode>3): Use the right iterator. gcc/testsuite/ChangeLog * gcc.target/aarch64/vect-shr-reg.c: New testcase. * gcc.target/aarch64/vect-shr-reg-run.c: Likewise.	2021-08-09 12:54:14 +01:00
Jonathan Wright	3bc9db6a98	simplify-rtx: Push sign/zero-extension inside vec_duplicate As a general principle, vec_duplicate should be as close to the root of an expression as possible. Where unary operations have vec_duplicate as an argument, these operations should be pushed inside the vec_duplicate. This patch modifies unary operation simplification to push sign/zero-extension of a scalar inside vec_duplicate. This patch also updates all RTL patterns in aarch64-simd.md to use the new canonical form. gcc/ChangeLog: 2021-07-19 Jonathan Wright <jonathan.wright@arm.com> * config/aarch64/aarch64-simd.md: Push sign/zero-extension inside vec_duplicate for all patterns. * simplify-rtx.c (simplify_context::simplify_unary_operation_1): Push sign/zero-extension inside vec_duplicate.	2021-07-27 10:42:33 +01:00
Tamar Christina	1ab2270036	AArch64: correct dot-product RTL patterns for aarch64. The previous fix for this problem was wrong due to a subtle difference between where NEON expects the RMW values and where intrinsics expects them. The insn pattern is modeled after the intrinsics and so needs an expand for the vectorizer optab to switch the RTL. However operand[3] is not expected to be written to so the current pattern is bogus. Instead I rewrite the RTL to be in canonical ordering and merge them. gcc/ChangeLog: * config/aarch64/aarch64-simd-builtins.def (sdot, udot): Rename to.. (sdot_prod, udot_prod): ... This. * config/aarch64/aarch64-simd.md (aarch64_<sur>dot<vsi2qi>): Merged into... (<sur>dot_prod<vsi2qi>): ... this. (aarch64_<sur>dot_lane<vsi2qi>, aarch64_<sur>dot_laneq<vsi2qi>): Change operands order. (<sur>sadv16qi): Use new operands order. * config/aarch64/arm_neon.h (vdot_u32, vdotq_u32, vdot_s32, vdotq_s32): Use new RTL ordering.	2021-07-26 10:23:21 +01:00
Tamar Christina	2050ac1a54	AArch64: correct usdot vectorizer and intrinsics optabs There's a slight mismatch between the vectorizer optabs and the intrinsics patterns for NEON. The vectorizer expects operands[3] and operands[0] to be the same but the aarch64 intrinsics expanders expect operands[0] and operands[1] to be the same. This means we need different patterns here. This adds a separate usdot vectorizer pattern which just shuffles around the RTL params. There's also an inconsistency between the usdot and (u\|s)dot intrinsics RTL patterns which is not corrected here. gcc/ChangeLog: * config/aarch64/aarch64-builtins.c (TYPES_TERNOP_SUSS, aarch64_types_ternop_suss_qualifiers): New. * config/aarch64/aarch64-simd-builtins.def (usdot_prod): Use it. * config/aarch64/aarch64-simd.md (usdot_prod<vsi2qi>): Re-organize RTL. * config/aarch64/arm_neon.h (vusdot_s32, vusdotq_s32): Use it.	2021-07-26 10:22:23 +01:00
Jonathan Wright	b7e450c973	aarch64: Refactor TBL/TBX RTL patterns Rename two-source-register TBL/TBX RTL patterns so that their names better reflect what they do, rather than confusing them with tbl3 or tbx4 patterns. Also use the correct "neon_tbl2" type attribute for both patterns. Rename single-source-register TBL/TBX patterns for consistency. gcc/ChangeLog: 2021-07-08 Jonathan Wright <jonathan.wright@arm.com> * config/aarch64/aarch64-simd-builtins.def: Use two variant generators for all TBL/TBX intrinsics and rename to consistent forms: qtbl[1234] or qtbx[1234]. * config/aarch64/aarch64-simd.md (aarch64_tbl1<mode>): Rename to... (aarch64_qtbl1<mode>): This. (aarch64_tbx1<mode>): Rename to... (aarch64_qtbx1<mode>): This. (aarch64_tbl2v16qi): Delete. (aarch64_tbl3<mode>): Rename to... (aarch64_qtbl2<mode>): This. (aarch64_tbx4<mode>): Rename to... (aarch64_qtbx2<mode>): This. * config/aarch64/aarch64.c (aarch64_expand_vec_perm_1): Use renamed qtbl1 and qtbl2 RTL patterns. * config/aarch64/arm_neon.h (vqtbl1_p8): Use renamed qtbl1 RTL pattern. (vqtbl1_s8): Likewise. (vqtbl1_u8): Likewise. (vqtbl1q_p8): Likewise. (vqtbl1q_s8): Likewise. (vqtbl1q_u8): Likewise. (vqtbx1_s8): Use renamed qtbx1 RTL pattern. (vqtbx1_u8): Likewise. (vqtbx1_p8): Likewise. (vqtbx1q_s8): Likewise. (vqtbx1q_u8): Likewise. (vqtbx1q_p8): Likewise. (vtbl1_s8): Use renamed qtbl1 RTL pattern. (vtbl1_u8): Likewise. (vtbl1_p8): Likewise. (vtbl2_s8): Likewise (vtbl2_u8): Likewise. (vtbl2_p8): Likewise. (vtbl3_s8): Use renamed qtbl2 RTL pattern. (vtbl3_u8): Likewise. (vtbl3_p8): Likewise. (vtbl4_s8): Likewise. (vtbl4_u8): Likewise. (vtbl4_p8): Likewise. (vtbx2_s8): Use renamed qtbx2 RTL pattern. (vtbx2_u8): Likewise. (vtbx2_p8): Likewise. (vqtbl2_s8): Use renamed qtbl2 RTL pattern. (vqtbl2_u8): Likewise. (vqtbl2_p8): Likewise. (vqtbl2q_s8): Likewise. (vqtbl2q_u8): Likewise. (vqtbl2q_p8): Likewise. (vqtbx2_s8): Use renamed qtbx2 RTL pattern. (vqtbx2_u8): Likewise. (vqtbx2_p8): Likewise. (vqtbx2q_s8): Likewise. (vqtbx2q_u8): Likewise. (vqtbx2q_p8): Likewise. (vtbx4_s8): Likewise. (vtbx4_u8): Likewise. (vtbx4_p8): Likewise.	2021-07-20 10:02:41 +01:00
Tamar Christina	5402023f05	Revert "AArch64: Correct dot-product auto-vect optab RTL" This reverts commit `6d1cdb2782`.	2021-07-15 13:16:00 +01:00
Tamar Christina	6d1cdb2782	AArch64: Correct dot-product auto-vect optab RTL The current RTL for the vectorizer patterns for dot-product are incorrect. Operand3 isn't an output parameter so we can't write to it. This fixes this issue and reduces the number of RTL. gcc/ChangeLog: * config/aarch64/aarch64-simd-builtins.def (udot, sdot): Rename to... (sdot_prod, udot_prod): ...These. * config/aarch64/aarch64-simd.md (<sur>dot_prod<vsi2qi>): Remove. (aarch64_<sur>dot<vsi2qi>): Rename to... (<sur>dot_prod<vsi2qi>): ...This. * config/aarch64/arm_neon.h (vdot_u32, vdotq_u32, vdot_s32, vdotq_s32): Update builtins.	2021-07-14 15:41:31 +01:00
Tamar Christina	752045ed1e	AArch64: Add support for sign differing dot-product usdot for NEON and SVE. Hi All, This adds optabs implementing usdot_prod. The following testcase: #define N 480 #define SIGNEDNESS_1 unsigned #define SIGNEDNESS_2 signed #define SIGNEDNESS_3 signed #define SIGNEDNESS_4 unsigned SIGNEDNESS_1 int __attribute__ ((noipa)) f (SIGNEDNESS_1 int res, SIGNEDNESS_3 char restrict a, SIGNEDNESS_4 char restrict b) { for (__INTPTR_TYPE__ i = 0; i < N; ++i) { int av = a[i]; int bv = b[i]; SIGNEDNESS_2 short mult = av * bv; res += mult; } return res; } Generates for NEON f: movi v0.4s, 0 mov x3, 0 .p2align 3,,7 .L2: ldr q1, [x2, x3] ldr q2, [x1, x3] usdot v0.4s, v1.16b, v2.16b add x3, x3, 16 cmp x3, 480 bne .L2 addv s0, v0.4s fmov w1, s0 add w0, w0, w1 ret and for SVE f: mov x3, 0 cntb x5 mov w4, 480 mov z1.b, #0 whilelo p0.b, wzr, w4 mov z3.b, #0 ptrue p1.b, all .p2align 3,,7 .L2: ld1b z2.b, p0/z, [x1, x3] ld1b z0.b, p0/z, [x2, x3] add x3, x3, x5 sel z0.b, p0, z0.b, z3.b whilelo p0.b, w3, w4 usdot z1.s, z0.b, z2.b b.any .L2 uaddv d0, p1, z1.s fmov x1, d0 add w0, w0, w1 ret instead of f: movi v0.4s, 0 mov x3, 0 .p2align 3,,7 .L2: ldr q2, [x1, x3] ldr q1, [x2, x3] add x3, x3, 16 sxtl v4.8h, v2.8b sxtl2 v3.8h, v2.16b uxtl v2.8h, v1.8b uxtl2 v1.8h, v1.16b mul v2.8h, v2.8h, v4.8h mul v1.8h, v1.8h, v3.8h saddw v0.4s, v0.4s, v2.4h saddw2 v0.4s, v0.4s, v2.8h saddw v0.4s, v0.4s, v1.4h saddw2 v0.4s, v0.4s, v1.8h cmp x3, 480 bne .L2 addv s0, v0.4s fmov w1, s0 add w0, w0, w1 ret and f: mov x3, 0 cnth x5 mov w4, 480 mov z1.b, #0 whilelo p0.h, wzr, w4 ptrue p2.b, all .p2align 3,,7 .L2: ld1sb z2.h, p0/z, [x1, x3] punpklo p1.h, p0.b ld1b z0.h, p0/z, [x2, x3] add x3, x3, x5 mul z0.h, p2/m, z0.h, z2.h sunpklo z2.s, z0.h sunpkhi z0.s, z0.h add z1.s, p1/m, z1.s, z2.s punpkhi p1.h, p0.b whilelo p0.h, w3, w4 add z1.s, p1/m, z1.s, z0.s b.any .L2 uaddv d0, p2, z1.s fmov x1, d0 add w0, w0, w1 ret gcc/ChangeLog: * config/aarch64/aarch64-simd.md (aarch64_usdot<vsi2qi>): Rename to... (usdot_prod<vsi2qi>): ... This. * config/aarch64/aarch64-simd-builtins.def (usdot): Rename to... (usdot_prod): ...This. * config/aarch64/arm_neon.h (vusdot_s32, vusdotq_s32): Likewise. * config/aarch64/aarch64-sve.md (@aarch64_<sur>dot_prod<vsi2qi>): Rename to... (@<sur>dot_prod<vsi2qi>): ...This. * config/aarch64/aarch64-sve-builtins-base.cc (svusdot_impl::expand): Use it. gcc/testsuite/ChangeLog: * gcc.target/aarch64/simd/vusdot-autovec.c: New test. * gcc.target/aarch64/sve/vusdot-autovec.c: New test.	2021-07-14 15:19:32 +01:00
Jonathan Wright	dbfc149b63	aarch64: Model zero-high-half semantics of ADDHN/SUBHN instructions Model the zero-high-half semantics of the narrowing arithmetic Neon instructions in the aarch64_<sur><addsub>hn<mode> RTL pattern. Modeling these semantics allows for better RTL combinations while also removing some register allocation issues as the compiler now knows that the operation is totally destructive. Add new tests to narrow_zero_high_half.c to verify the benefit of this change. gcc/ChangeLog: 2021-06-14 Jonathan Wright <jonathan.wright@arm.com> * config/aarch64/aarch64-simd.md (aarch64_<sur><addsub>hn<mode>): Change to an expander that emits the correct instruction depending on endianness. (aarch64_<sur><addsub>hn<mode>_insn_le): Define. (aarch64_<sur><addsub>hn<mode>_insn_be): Define. gcc/testsuite/ChangeLog: * gcc.target/aarch64/narrow_zero_high_half.c: Add new tests.	2021-06-16 14:22:42 +01:00
Jonathan Wright	d0889b5d37	aarch64: Model zero-high-half semantics of [SU]QXTN instructions Split the aarch64_<su>qmovn<mode> pattern into separate scalar and vector variants. Further split the vector RTL pattern into big/ little endian variants that model the zero-high-half semantics of the underlying instruction. Modeling these semantics allows for better RTL combinations while also removing some register allocation issues as the compiler now knows that the operation is totally destructive. Add new tests to narrow_zero_high_half.c to verify the benefit of this change. gcc/ChangeLog: 2021-06-14 Jonathan Wright <jonathan.wright@arm.com> * config/aarch64/aarch64-simd-builtins.def: Split generator for aarch64_<su>qmovn builtins into scalar and vector variants. * config/aarch64/aarch64-simd.md (aarch64_<su>qmovn<mode>_insn_le): Define. (aarch64_<su>qmovn<mode>_insn_be): Define. (aarch64_<su>qmovn<mode>): Split into scalar and vector variants. Change vector variant to an expander that emits the correct instruction depending on endianness. gcc/testsuite/ChangeLog: * gcc.target/aarch64/narrow_zero_high_half.c: Add new tests.	2021-06-16 14:22:22 +01:00
Jonathan Wright	c86a303968	aarch64: Model zero-high-half semantics of SQXTUN instruction in RTL Split the aarch64_sqmovun<mode> pattern into separate scalar and vector variants. Further split the vector pattern into big/little endian variants that model the zero-high-half semantics of the underlying instruction. Modeling these semantics allows for better RTL combinations while also removing some register allocation issues as the compiler now knows that the operation is totally destructive. Add new tests to narrow_zero_high_half.c to verify the benefit of this change. gcc/ChangeLog: 2021-06-14 Jonathan Wright <jonathan.wright@arm.com> * config/aarch64/aarch64-simd-builtins.def: Split generator for aarch64_sqmovun builtins into scalar and vector variants. * config/aarch64/aarch64-simd.md (aarch64_sqmovun<mode>): Split into scalar and vector variants. Change vector variant to an expander that emits the correct instruction depending on endianness. (aarch64_sqmovun<mode>_insn_le): Define. (aarch64_sqmovun<mode>_insn_be): Define. gcc/testsuite/ChangeLog: * gcc.target/aarch64/narrow_zero_high_half.c: Add new tests.	2021-06-16 14:22:08 +01:00
Jonathan Wright	d8a88cdae9	aarch64: Model zero-high-half semantics of XTN instruction in RTL Modeling the zero-high-half semantics of the XTN narrowing instruction in RTL indicates to the compiler that this is a totally destructive operation. This enables more RTL simplifications and also prevents some register allocation issues. Add new tests to narrow_zero_high_half.c to verify the benefit of this change. gcc/ChangeLog: 2021-06-11 Jonathan Wright <jonathan.wright@arm.com> * config/aarch64/aarch64-simd.md (aarch64_xtn<mode>_insn_le): Define - modeling zero-high-half semantics. (aarch64_xtn<mode>): Change to an expander that emits the appropriate instruction depending on endianness. (aarch64_xtn<mode>_insn_be): Define - modeling zero-high-half semantics. (aarch64_xtn2<mode>_le): Rename to... (aarch64_xtn2<mode>_insn_le): This. (aarch64_xtn2<mode>_be): Rename to... (aarch64_xtn2<mode>_insn_be): This. (vec_pack_trunc_<mode>): Emit truncation instruction instead of aarch64_xtn. * config/aarch64/iterators.md (Vnarrowd): Add Vnarrowd mode attribute iterator. gcc/testsuite/ChangeLog: * gcc.target/aarch64/narrow_zero_high_half.c: Add new tests.	2021-06-16 14:21:52 +01:00
Jonathan Wright	4536433820	aarch64: Use correct type attributes for RTL generating XTN(2) Use the correct "neon_move_narrow_q" type attribute in RTL patterns that generate XTN/XTN2 instructions. This makes a material difference because these instructions can be executed on both SIMD pipes in the Cortex-A57 core model, whereas the "neon_shift_imm_narrow_q" attribute (in use until now) would suggest to the scheduler that they could only execute on one of the two pipes. gcc/ChangeLog: 2021-05-18 Jonathan Wright <jonathan.wright@arm.com> * config/aarch64/aarch64-simd.md: Use "neon_move_narrow_q" type attribute in patterns generating XTN(2).	2021-05-19 14:45:31 +01:00
Jonathan Wright	577d5819e0	aarch64: Use an expander for quad-word vec_pack_trunc pattern The existing vec_pack_trunc RTL pattern emits an opaque two- instruction assembly code sequence that prevents proper instruction scheduling. This commit changes the pattern to an expander that emits individual xtn and xtn2 instructions. This commit also consolidates the duplicate truncation patterns. gcc/ChangeLog: 2021-05-17 Jonathan Wright <jonathan.wright@arm.com> * config/aarch64/aarch64-simd.md (aarch64_simd_vec_pack_trunc_<mode>): Remove as duplicate of... (aarch64_xtn<mode>): This. (aarch64_xtn2<mode>_le): Move position in file. (aarch64_xtn2<mode>_be): Move position in file. (aarch64_xtn2<mode>): Move position in file. (vec_pack_trunc_<mode>): Define as an expander.	2021-05-19 14:45:17 +01:00
Jonathan Wright	ddbdb9a384	aarch64: Refactor aarch64_<sur>q<r>shr<u>n_n<mode> RTL pattern Split the aarch64_<sur>q<r>shr<u>n_n<mode> pattern into separate scalar and vector variants. Further split the vector pattern into big/little endian variants that model the zero-high-half semantics of the underlying instruction - allowing for more combinations with the write-to-high-half variant (aarch64_<sur>q<r>shr<u>n2_n<mode>.) gcc/ChangeLog: 2021-05-14 Jonathan Wright <jonathan.wright@arm.com> * config/aarch64/aarch64-simd-builtins.def: Split builtin generation for aarch64_<sur>q<r>shr<u>n_n<mode> pattern into separate scalar and vector generators. * config/aarch64/aarch64-simd.md (aarch64_<sur>q<r>shr<u>n_n<mode>): Define as an expander and split into... (aarch64_<sur>q<r>shr<u>n_n<mode>_insn_le): This and... (aarch64_<sur>q<r>shr<u>n_n<mode>_insn_be): This. * config/aarch64/iterators.md: Define SD_HSDI iterator.	2021-05-19 14:44:39 +01:00
Jonathan Wright	778ac63fe2	aarch64: Relax aarch64_sqxtun2<mode> RTL pattern Use UNSPEC_SQXTUN instead of UNSPEC_SQXTUN2 in aarch64_sqxtun2<mode> patterns. This allows for more more aggressive combinations and ultimately better code generation. The now redundant UNSPEC_SQXTUN2 is removed. gcc/ChangeLog: 2021-05-14 Jonathn Wright <jonathan.wright@arm.com> * config/aarch64/aarch64-simd.md: Use UNSPEC_SQXTUN instead of UNSPEC_SQXTUN2. * config/aarch64/iterators.md: Remove UNSPEC_SQXTUN2.	2021-05-19 14:44:26 +01:00
Jonathan Wright	4e26303e0b	aarch64: Relax aarch64_<sur>q<r>shr<u>n2_n<mode> RTL pattern Implement saturating right-shift and narrow high Neon intrinsic RTL patterns using a vec_concat of a register_operand and a VQSHRN_N unspec - instead of just a VQSHRN_N unspec. This more relaxed pattern allows for more aggressive combinations and ultimately better code generation. gcc/ChangeLog: 2021-03-04 Jonathan Wright <jonathan.wright@arm.com> * config/aarch64/aarch64-simd.md (aarch64_<sur>q<r>shr<u>n2_n<mode>): Implement as an expand emitting a big/little endian instruction pattern. (aarch64_<sur>q<r>shr<u>n2_n<mode>_insn_le): Define. (aarch64_<sur>q<r>shr<u>n2_n<mode>_insn_be): Define.	2021-05-19 14:44:10 +01:00
Jonathan Wright	3eddaad02d	aarch64: Relax aarch64_<sur><addsub>hn2<mode> RTL pattern Implement v[r]addhn2 and v[r]subhn2 Neon intrinsic RTL patterns using a vec_concat of a register_operand and an ADDSUBHN unspec - instead of just an ADDSUBHN2 unspec. This more relaxed pattern allows for more aggressive combinations and ultimately better code generation. This patch also removes the now redundant [R]ADDHN2 and [R]SUBHN2 unspecs and their iterator. gcc/ChangeLog: 2021-03-03 Jonathan Wright <jonathan.wright@arm.com> * config/aarch64/aarch64-simd.md (aarch64_<sur><addsub>hn2<mode>): Implement as an expand emitting a big/little endian instruction pattern. (aarch64_<sur><addsub>hn2<mode>_insn_le): Define. (aarch64_<sur><addsub>hn2<mode>_insn_be): Define. * config/aarch64/iterators.md: Remove UNSPEC_[R]ADDHN2 and UNSPEC_[R]SUBHN2 unspecs and ADDSUBHN2 iterator.	2021-05-19 14:43:55 +01:00
Kyrylo Tkachov	ff3809b459	aarch64: Make sqdmlal2 patterns match canonical RTL The sqdmlal2 patterns are hidden beneath the SBINQOPS iterator and unfortunately they don't match canonical RTL because the simple accumulate operand comes in the first arm of the SS_PLUS. This patch splits the SS_PLUS and SS_MINUS forms with the SS_PLUS operands set up to match the canonical form, where the complex operand comes first. gcc/ChangeLog: * config/aarch64/aarch64-simd.md (aarch64_sqdml<SBINQOPS:as>l2_lane<mode>_internal): Split into... (aarch64_sqdmlsl2_lane<mode>_internal): ... This... (aarch64_sqdmlal2_lane<mode>_internal): ... And this. (aarch64_sqdml<SBINQOPS:as>l2_laneq<mode>_internal): Split into ... (aarch64_sqdmlsl2_laneq<mode>_internal): ... This... (aarch64_sqdmlal2_laneq<mode>_internal): ... And this. (aarch64_sqdml<SBINQOPS:as>l2_n<mode>_internal): Split into... (aarch64_sqdmlsl2_n<mode>_internal): ... This... (aarch64_sqdmlal2_n<mode>_internal): ... And this.	2021-05-14 15:31:25 +01:00
Kyrylo Tkachov	543c0cbca0	aarch64: Merge sqdmlal2 and sqdmlsl2 expanders The various sqdmlal2 and sqdmlsl2 expanders perform almost identical functions and can be merged using code iterators and attributes to reduce the code in the MD file. No behavioural change is expected. gcc/ChangeLog: * config/aarch64/aarch64-simd.md (aarch64_sqdmlal2<mode>): Merge into... (aarch64_sqdml<SBINQOPS:as>l2<mode>): ... This. (aarch64_sqdmlsl2<mode>): Delete. (aarch64_sqdmlal2_lane<mode>): Merge this... (aarch64_sqdmlsl2_lane<mode>): ... And this... (aarch64_sqdml<SBINQOPS:as>l2_lane<mode>): ... Into this. (aarch64_sqdmlal2_laneq<mode>): Merge this... (aarch64_sqdmlsl2_laneq<mode>): ... And this... (aarch64_sqdml<SBINQOPS:as>l2_laneq<mode>): ... Into this. (aarch64_sqdmlal2_n<mode>): Merge this... (aarch64_sqdmlsl2_n<mode>): ... And this... (aarch64_sqdml<SBINQOPS:as>l2_n<mode>): ... Into this.	2021-05-14 09:56:45 +01:00
Richard Sandiford	28de75d276	aarch64: A couple of mul_laneq tweaks This patch removes the duplication between the mul_laneq<mode>3 and the older mul-lane patterns. The older patterns were previously divided into two based on whether the indexed operand had the same mode as the other operands or whether it had the opposite length from the other operands (64-bit vs. 128-bit). However, it seemed easier to divide them instead based on whether the indexed operand was 64-bit or 128-bit, since that maps directly to the arm_neon.h “q” conventions. Also, it looks like the older patterns were missing cases for V8HF<->V4HF combinations, which meant that vmul_laneq_f16 and vmulq_lane_f16 didn't produce single instructions. There was a typo in the V2SF entry for VCONQ, but in practice no patterns were using that entry until now. The test passes for both endiannesses, but endianness does change the mapping between regexps and functions. gcc/ * config/aarch64/iterators.md (VMUL_CHANGE_NLANES): Delete. (VMULD): New iterator. (VCOND): Handle V4HF and V8HF. (VCONQ): Fix entry for V2SF. * config/aarch64/aarch64-simd.md (mul_lane<mode>3): Use VMULD instead of VMUL. Use a 64-bit vector mode for the indexed operand. (aarch64_mul3_elt_<vswap_width_name><mode>): Merge with... (mul_laneq<mode>3): ...this define_insn. Use VMUL instead of VDQSF. Use a 128-bit vector mode for the indexed operand. Use stype for the scheduling type. gcc/testsuite/ gcc.target/aarch64/fmul_lane_1.c: New test.	2021-05-11 12:17:33 +01:00
Jonathan Wright	d388179a79	aarch64: Use RTL builtins for FP ml[as][q]_laneq intrinsics Rewrite floating-point vml[as][q]_laneq Neon intrinsics to use RTL builtins rather than relying on the GCC vector extensions. Using RTL builtins allows control over the emission of fmla/fmls instructions (which we don't want here.) With this commit, the code generated by these intrinsics changes from a fused multiply-add/subtract instruction to an fmul followed by an fadd/fsub instruction. If the programmer really wants fmla/fmls instructions, they can use the vfm[as] intrinsics. gcc/ChangeLog: 2021-02-17 Jonathan Wright <jonathan.wright@arm.com> * config/aarch64/aarch64-simd-builtins.def: Add float_ml[as][q]_laneq builtin generator macros. * config/aarch64/aarch64-simd.md (mul_laneq<mode>3): Define. (aarch64_float_mla_laneq<mode>): Define. (aarch64_float_mls_laneq<mode>): Define. * config/aarch64/arm_neon.h (vmla_laneq_f32): Use RTL builtin instead of GCC vector extensions. (vmlaq_laneq_f32): Likewise. (vmls_laneq_f32): Likewise. (vmlsq_laneq_f32): Likewise.	2021-04-30 18:41:25 +01:00
Jonathan Wright	1baf4ed878	aarch64: Use RTL builtins for FP ml[as][q]_lane intrinsics Rewrite floating-point vml[as][q]_lane Neon intrinsics to use RTL builtins rather than relying on the GCC vector extensions. Using RTL builtins allows control over the emission of fmla/fmls instructions (which we don't want here.) With this commit, the code generated by these intrinsics changes from a fused multiply-add/subtract instruction to an fmul followed by an fadd/fsub instruction. If the programmer really wants fmla/fmls instructions, they can use the vfm[as] intrinsics. gcc/ChangeLog: 2021-02-16 Jonathan Wright <jonathan.wright@arm.com> * config/aarch64/aarch64-simd-builtins.def: Add float_ml[as]_lane builtin generator macros. * config/aarch64/aarch64-simd.md (aarch64_mul3_elt<mode>): Rename to... (mul_lane<mode>3): This, and re-order arguments. (aarch64_float_mla_lane<mode>): Define. (aarch64_float_mls_lane<mode>): Define. config/aarch64/arm_neon.h (vmla_lane_f32): Use RTL builtin instead of GCC vector extensions. (vmlaq_lane_f32): Likewise. (vmls_lane_f32): Likewise. (vmlsq_lane_f32): Likewise.	2021-04-30 18:41:11 +01:00

1 2 3 4 5 ...

404 Commits