The MMA build built-ins currently use individual lxv instructions to
load up the registers of a __vector_pair or __vector_quad. If the
memory addresses of the built-in operands are to adjacent locations,
then we can use an lxvp in some cases to load up two registers at once.
The patch below adds support for checking whether memory addresses are
adjacent and emitting an lxvp instead of two lxv instructions.
2021-07-14 Peter Bergner <bergner@linux.ibm.com>
gcc/
* config/rs6000/rs6000.c (adjacent_mem_locations): Return the lower
addressed memory rtx, if any.
(rs6000_split_multireg_move): Fix code formatting.
Handle MMA build built-ins with operands in adjacent memory locations.
gcc/testsuite/
* gcc.target/powerpc/mma-builtin-9.c: New test.
An upcoming change to rs6000_split_multireg_move requires it to be
moved later in the file to fix a declaration issue.
2021-07-14 Peter Bergner <bergner@linux.ibm.com>
gcc/
* config/rs6000/rs6000.c (rs6000_split_multireg_move): Move to later
in the file.
Here during CTAD we're incorrectly treating T&& as a forwarding
reference even though T is a template parameter of the class template.
This happens because the template parameter T in the out-of-line
definition of the constructor doesn't have the flag
TEMPLATE_TYPE_PARM_FOR_CLASS set, and during duplicate_decls the
the redeclaration (which is in terms of this unflagged T) prevails.
To fix this, we could perhaps be more consistent about setting the flag,
but it appears we don't really need this flag to make the determination.
Since the template parameters of an synthesized guide consist of the
template parameters of the class template followed by those of the
constructor (if any), it should suffice to look at the index of the
template parameter to determine whether it comes from the class
template or the constructor (template). This patch replaces the
TEMPLATE_TYPE_PARM_FOR_CLASS flag with this approach.
PR c++/88252
gcc/cp/ChangeLog:
* cp-tree.h (TEMPLATE_TYPE_PARM_FOR_CLASS): Remove.
* pt.c (push_template_decl): Remove TEMPLATE_TYPE_PARM_FOR_CLASS
handling.
(redeclare_class_template): Likewise.
(forwarding_reference_p): Define.
(maybe_adjust_types_for_deduction): Use it instead. Add 'tparms'
parameter.
(unify_one_argument): Pass tparms to
maybe_adjust_types_for_deduction.
(try_one_overload): Likewise.
(unify): Likewise.
(rewrite_template_parm): Remove TEMPLATE_TYPE_PARM_FOR_CLASS
handling.
gcc/testsuite/ChangeLog:
* g++.dg/cpp1z/class-deduction96.C: New test.
The uses of vec<T> in get_all_loop_exits and process_conditional were memory
leaks, as .release() was never called for them. The other changes are some
cases that did have proper release handling, but it's simpler to leave
releasing to the auto_vec destructor.
gcc/ChangeLog:
* sel-sched-ir.h (get_all_loop_exits): Use auto_vec.
gcc/cp/ChangeLog:
* class.c (struct find_final_overrider_data): Use auto_vec.
(find_final_overrider): Remove explicit release.
* coroutines.cc (process_conditional): Use auto_vec.
* cp-gimplify.c (struct cp_genericize_data): Use auto_vec.
(cp_genericize_tree): Remove explicit release.
* parser.c (cp_parser_objc_at_property_declaration): Use
auto_delete_vec.
* semantics.c (omp_reduction_lookup): Use auto_vec.
As I was discussing with richi, I don't think it makes sense to protect
calls to pure/const functions from DCE just because they aren't explicitly
declared noexcept. PR100382 indicates that there are different
considerations for Go, which has non-call exceptions. But still turn the
flag off for that specific testcase.
gcc/c-family/ChangeLog:
* c-opts.c (c_common_post_options): Set -fdelete-dead-exceptions.
gcc/ChangeLog:
* doc/invoke.texi: -fdelete-dead-exceptions is on by default for
C++.
gcc/testsuite/ChangeLog:
* g++.dg/torture/pr100382.C: Pass -fno-delete-dead-exceptions.
The lines being removed have been updated and merged into a new
condition. But when resolving some conflicts I accidentally
reintroduced them causing some test failes.
This removes them.
Committed as the changes were previously approved in
https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574977.html
but the hunk was misapplied during a rebase.
gcc/ChangeLog:
* tree-vect-patterns.c (vect_recog_dot_prod_pattern):
Remove erroneous line.
gcc/testsuite/ChangeLog:
* gcc.dg/vect/vect-reduc-dot-11.c: Expect pass.
* gcc.dg/vect/vect-reduc-dot-15.c: Likewise.
* gcc.dg/vect/vect-reduc-dot-19.c: Likewise.
* gcc.dg/vect/vect-reduc-dot-21.c: Likewise.
This PR gave me a hard time: I saw multiple issues starting with
different revisions. But ultimately the root cause seems to be
the following, and the attached patch fixes all issues I've found
here.
In cxx_eval_array_reference we create a new constexpr context for the
CP_AGGREGATE_TYPE_P case, but we also have to create it for the
non-aggregate case. In this test, we are evaluating
((B *)this)->a = rhs->a
which means that we set ctx.object to ((B *)this)->a. Then we proceed
to evaluate the initializer, rhs->a. For *rhs, we eval rhs, a PARM_DECL,
for which we have (const B &) &c.arr[0] in the hash table. Then
cxx_fold_indirect_ref gives us c.arr[0]. c is evaluated to {.arr={}} so
c.arr is {}. Now we want c.arr[0], so we end up in cxx_eval_array_reference
and since we're initializing from {}, we call build_value_init which
gives us an AGGR_INIT_EXPR that calls 'constexpr B::B()'. Then we
evaluate this AGGR_INIT_EXPR and since its first argument is dummy,
we take ctx.object instead. But that is the wrong object, we're not
initializing ((B *)this)->a here. And so we wound up with an
initializer for A, and then crash in cxx_eval_component_reference:
gcc_assert (DECL_CONTEXT (part) == TYPE_MAIN_VARIANT (TREE_TYPE (whole)));
where DECL_CONTEXT (part) is B (as it should be) but the type of whole
was A.
So create a new object, if there already was one, and the element type
is not a scalar.
PR c++/101371
gcc/cp/ChangeLog:
* constexpr.c (cxx_eval_array_reference): Create a new .object
and .ctor for the non-aggregate non-scalar case too when
value-initializing.
gcc/testsuite/ChangeLog:
* g++.dg/cpp1y/constexpr-101371-2.C: New test.
* g++.dg/cpp1y/constexpr-101371.C: New test.
The current RTL for the vectorizer patterns for dot-product are incorrect.
Operand3 isn't an output parameter so we can't write to it.
This fixes this issue and reduces the number of RTL.
gcc/ChangeLog:
* config/aarch64/aarch64-simd-builtins.def (udot, sdot): Rename to...
(sdot_prod, udot_prod): ...These.
* config/aarch64/aarch64-simd.md (<sur>dot_prod<vsi2qi>): Remove.
(aarch64_<sur>dot<vsi2qi>): Rename to...
(<sur>dot_prod<vsi2qi>): ...This.
* config/aarch64/arm_neon.h (vdot_u32, vdotq_u32, vdot_s32, vdotq_s32):
Update builtins.
The RTL Generated from <sup>dot_prod<vsi2qi> is invalid as operand3 cannot be
written to, it's a normal input. For the expand it's just another operand
but the caller does not expect it to be written to.
gcc/ChangeLog:
* config/arm/neon.md (<sup>dot_prod<vsi2qi>): Drop statements.
This adds testcases to test for auto-vect detection of the new sign differing
dot product.
gcc/ChangeLog:
* doc/sourcebuild.texi (arm_v8_2a_i8mm_neon_hw): Document.
gcc/testsuite/ChangeLog:
* lib/target-supports.exp
(check_effective_target_arm_v8_2a_imm8_neon_ok_nocache,
check_effective_target_arm_v8_2a_i8mm_neon_hw,
check_effective_target_vect_usdot_qi): New.
* gcc.dg/vect/vect-reduc-dot-9.c: New test.
* gcc.dg/vect/vect-reduc-dot-10.c: New test.
* gcc.dg/vect/vect-reduc-dot-11.c: New test.
* gcc.dg/vect/vect-reduc-dot-12.c: New test.
* gcc.dg/vect/vect-reduc-dot-13.c: New test.
* gcc.dg/vect/vect-reduc-dot-14.c: New test.
* gcc.dg/vect/vect-reduc-dot-15.c: New test.
* gcc.dg/vect/vect-reduc-dot-16.c: New test.
* gcc.dg/vect/vect-reduc-dot-17.c: New test.
* gcc.dg/vect/vect-reduc-dot-18.c: New test.
* gcc.dg/vect/vect-reduc-dot-19.c: New test.
* gcc.dg/vect/vect-reduc-dot-20.c: New test.
* gcc.dg/vect/vect-reduc-dot-21.c: New test.
* gcc.dg/vect/vect-reduc-dot-22.c: New test.
This adds optabs implementing usdot_prod.
The following testcase:
#define N 480
#define SIGNEDNESS_1 unsigned
#define SIGNEDNESS_2 signed
#define SIGNEDNESS_3 signed
#define SIGNEDNESS_4 unsigned
SIGNEDNESS_1 int __attribute__ ((noipa))
f (SIGNEDNESS_1 int res, SIGNEDNESS_3 char *restrict a,
SIGNEDNESS_4 char *restrict b)
{
for (__INTPTR_TYPE__ i = 0; i < N; ++i)
{
int av = a[i];
int bv = b[i];
SIGNEDNESS_2 short mult = av * bv;
res += mult;
}
return res;
}
Generates
f:
vmov.i32 q8, #0 @ v4si
add r3, r2, #480
.L2:
vld1.8 {q10}, [r2]!
vld1.8 {q9}, [r1]!
vusdot.s8 q8, q9, q10
cmp r3, r2
bne .L2
vadd.i32 d16, d16, d17
vpadd.i32 d16, d16, d16
vmov.32 r3, d16[0]
add r0, r0, r3
bx lr
instead of
f:
vmov.i32 q8, #0 @ v4si
add r3, r2, #480
.L2:
vld1.8 {q9}, [r2]!
vld1.8 {q11}, [r1]!
cmp r3, r2
vmull.s8 q10, d18, d22
vmull.s8 q9, d19, d23
vaddw.s16 q8, q8, d20
vaddw.s16 q8, q8, d21
vaddw.s16 q8, q8, d18
vaddw.s16 q8, q8, d19
bne .L2
vadd.i32 d16, d16, d17
vpadd.i32 d16, d16, d16
vmov.32 r3, d16[0]
add r0, r0, r3
bx lr
For NEON. I couldn't figure out if the MVE instruction vmlaldav.s16 could be
used to emulate this. Because it would require additional widening to work I
left MVE out of this patch set but perhaps someone should take a look.
gcc/ChangeLog:
* config/arm/neon.md (usdot_prod<vsi2qi>): New.
gcc/testsuite/ChangeLog:
* gcc.target/arm/simd/vusdot-autovec.c: New test.
This patch adds support for a dot product where the sign of the multiplication
arguments differ. i.e. one is signed and one is unsigned but the precisions are
the same.
#define N 480
#define SIGNEDNESS_1 unsigned
#define SIGNEDNESS_2 signed
#define SIGNEDNESS_3 signed
#define SIGNEDNESS_4 unsigned
SIGNEDNESS_1 int __attribute__ ((noipa))
f (SIGNEDNESS_1 int res, SIGNEDNESS_3 char *restrict a,
SIGNEDNESS_4 char *restrict b)
{
for (__INTPTR_TYPE__ i = 0; i < N; ++i)
{
int av = a[i];
int bv = b[i];
SIGNEDNESS_2 short mult = av * bv;
res += mult;
}
return res;
}
The operations are performed as if the operands were extended to a 32-bit value.
As such this operation isn't valid if there is an intermediate conversion to an
unsigned value. i.e. if SIGNEDNESS_2 is unsigned.
more over if the signs of SIGNEDNESS_3 and SIGNEDNESS_4 are flipped the same
optab is used but the operands are flipped in the optab expansion.
To support this the patch extends the dot-product detection to optionally
ignore operands with different signs and stores this information in the optab
subtype which is now made a bitfield.
The subtype can now additionally controls which optab an EXPR can expand to.
gcc/ChangeLog:
* optabs.def (usdot_prod_optab): New.
* doc/md.texi: Document it and clarify other dot prod optabs.
* optabs-tree.h (enum optab_subtype): Add optab_vector_mixed_sign.
* optabs-tree.c (optab_for_tree_code): Support usdot_prod_optab.
* optabs.c (expand_widen_pattern_expr): Likewise.
* tree-cfg.c (verify_gimple_assign_ternary): Likewise.
* tree-vect-loop.c (vectorizable_reduction): Query dot-product kind.
* tree-vect-patterns.c (vect_supportable_direct_optab_p): Take optional
optab subtype.
(vect_widened_op_tree): Optionally ignore
mismatch types.
(vect_recog_dot_prod_pattern): Support usdot_prod_optab.
UINTR is available only in 64-bit mode. Since the codegen target is
unknown when the the gcc driver is processing -march=native, to properly
handle UINTR for -march=native:
1. Pass "arch [32|64]" and "tune [32|64]" to host_detect_local_cpu to
indicate 32-bit and 64-bit codegen.
2. Change ix86_option_override_internal to enable UINTR only in 64-bit
mode for -march=CPU when PTA_CPU includes PTA_UINTR.
gcc/
PR target/101395
* config/i386/driver-i386.c (host_detect_local_cpu): Check
"arch [32|64]" and "tune [32|64]" for 32-bit and 64-bit codegen.
Enable UINTR only for 64-bit codegen.
* config/i386/i386-options.c
(ix86_option_override_internal::DEF_PTA): Skip PTA_UINTR if not
in 64-bit mode.
* config/i386/i386.h (ARCH_ARG): New.
(CC1_CPU_SPEC): Pass "[arch|tune] 32" for 32-bit codegen and
"[arch|tune] 64" for 64-bit codegen.
gcc/testsuite/
PR target/101395
* gcc.target/i386/pr101395-1.c: New test.
* gcc.target/i386/pr101395-2.c: Likewise.
* gcc.target/i386/pr101395-3.c: Likewise.
This adds a conditional noexcept to the C++20 constructor. The
std::to_address call cannot throw, so only taking the difference of the
two iterators can throw.
Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
libstdc++-v3/ChangeLog:
* include/std/string_view (basic_string_view(It, End)): Add
noexcept-specifier.
* testsuite/21_strings/basic_string_view/cons/char/range.cc:
Check noexcept-specifier. Also check construction without CTAD.
The following fixes the IV adjustment for the gap in a negative
stride SLP vectorization. The adjustment was in the wrong direction,
now fixes as in the patch.
2021-07-14 Richard Biener <rguenther@suse.de>
PR tree-optimization/101445
* tree-vect-stmts.c (vectorizable_load): Do the gap adjustment
of the IV in the correct direction for negative stride
accesses.
* gcc.dg/vect/pr101445.c: New testcase.
pot_dummy_types is a hash_set from whose traversal the code prints some type
lines. hash_set normally uses default_hash_traits which for pointer types
(the hash set hashes const char *) uses pointer_hash which hashes the
addresses of the pointers except of the least significant 3 bits.
With address space randomization, that results in non-determinism in the
-fdump-go-specs= generated file, each invocation can have different order of
the lines emitted from pot_dummy_types traversal.
This patch fixes it by hashing the string contents instead to make the
hashes reproduceable.
2021-07-14 Jakub Jelinek <jakub@redhat.com>
PR go/101407
* godump.c (godump_str_hash): New type.
(godump_container::pot_dummy_types): Use string_hash instead of
ptr_hash in the hash_set.
The following adds support for re-using the vector reduction def
from the main loop in vectorized epilogue loops on architectures
which use different vector sizes for the epilogue. That's only
x86 as far as I am aware.
2021-07-13 Richard Biener <rguenther@suse.de>
* tree-vect-loop.c (vect_find_reusable_accumulator): Handle
vector types where the old vector type has a multiple of
the new vector type elements.
(vect_create_partial_epilog): New function, split out from...
(vect_create_epilog_for_reduction): ... here.
(vect_transform_cycle_phi): Reduce the re-used accumulator
to the new vector type.
* gcc.target/i386/vect-reduc-1.c: New testcase.
Odd-numbered indices describing argument access sizes in the fnspec
string can only hold 't' or a digit, as tested in the beginning of the
case. When checking that the size-supplying argument does not have
additional information associated with it, the test that excludes the
't' possibility looks for it at the even position in the fnspec
string. Oops.
This might yield false positives and negatives if a function has a
fnspec in which an argument uses a 't' access-size, and ('t' - '1')
happens to be the index of an argument described in an fnspec string.
Assuming ASCII encoding, it would take a function with at least 68
arguments described in fnspec. Still, probably worth fixing.
for gcc/ChangeLog
* tree-ssa-alias.c (attr_fnspec::verify): Fix index in
non-'t'-sized arg check.
If an artificial label created for a landing pad ends up being
dropped in favor of a user-supplied label, the user-supplied label
inherits the landing pad index, but the post_landing_pad field is not
adjusted to point to the new label.
This patch fixes the problem, and adds verification that we don't
remove a label that's still used as a landing pad.
The circumstance in which this problem can be hit was unusual: removal
of a block with an unreachable label moves the label to some other
unrelated block, in case its address is taken. In the case at hand
(pr42739.C, complicated by wrappers and cleanups), the chosen block
happened to be an EH landing pad. (A followup patch will change that.)
for gcc/ChangeLog
* tree-cfg.c (cleanup_dead_labels_eh): Update
post_landing_pad label upon change of landing pad block's
primary label.
(cleanup_dead_labels): Check that a removed label is not that
of a landing pad.
Add a new RTL simplification for the case of a VEC_SELECT selecting
the low part of a vector. The simplification returns a SUBREG.
The primary goal of this patch is to enable better combinations of
Neon RTL patterns - specifically allowing generation of 'write-to-
high-half' narrowing intructions.
Adding this RTL simplification means that the expected results for a
number of tests need to be updated:
* aarch64 Neon: Update the scan-assembler regex for intrinsics tests
to expect a scalar register instead of lane 0 of a vector.
* aarch64 SVE: Likewise.
* arm MVE: Use lane 1 instead of lane 0 for lane-extraction
intrinsics tests (as the move instructions get optimized away for
lane 0.)
This patch also adds new code generation tests to
narrow_high_combine.c to verify the benefit of this RTL
simplification.
gcc/ChangeLog:
2021-06-08 Jonathan Wright <jonathan.wright@arm.com>
* combine.c (combine_simplify_rtx): Add vec_select -> subreg
simplification.
* config/aarch64/aarch64.md (*zero_extend<SHORT:mode><GPI:mode>2_aarch64):
Add Neon to general purpose register case for zero-extend
pattern.
* config/arm/vfp.md (*arm_movsi_vfp): Remove "*" from *t -> r
case to prevent some cases opting to go through memory.
* cse.c (fold_rtx): Add vec_select -> subreg simplification.
* rtl.c (rtvec_series_p): Define predicate to determine
whether a vector contains a linear series of integers.
* rtl.h (rtvec_series_p): Define.
* rtlanal.c (vec_series_lowpart_p): Define predicate to
determine if a vector selection is equivalent to the low part
of the vector.
* rtlanal.h (vec_series_lowpart_p): Define.
* simplify-rtx.c (simplify_context::simplify_binary_operation_1):
Add vec_select -> subreg simplification.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/extract_zero_extend.c: Remove dump scan
for RTL pattern match.
* gcc.target/aarch64/narrow_high_combine.c: Add new tests.
* gcc.target/aarch64/simd/vmulx_laneq_f64_1.c: Update
scan-assembler regex to look for a scalar register instead of
lane 0 of a vector.
* gcc.target/aarch64/simd/vmulxd_laneq_f64_1.c: Likewise.
* gcc.target/aarch64/simd/vmulxs_lane_f32_1.c: Likewise.
* gcc.target/aarch64/simd/vmulxs_laneq_f32_1.c: Likewise.
* gcc.target/aarch64/simd/vqdmlalh_lane_s16.c: Likewise.
* gcc.target/aarch64/simd/vqdmlals_lane_s32.c: Likewise.
* gcc.target/aarch64/simd/vqdmlslh_lane_s16.c: Likewise.
* gcc.target/aarch64/simd/vqdmlsls_lane_s32.c: Likewise.
* gcc.target/aarch64/simd/vqdmullh_lane_s16.c: Likewise.
* gcc.target/aarch64/simd/vqdmullh_laneq_s16.c: Likewise.
* gcc.target/aarch64/simd/vqdmulls_lane_s32.c: Likewise.
* gcc.target/aarch64/simd/vqdmulls_laneq_s32.c: Likewise.
* gcc.target/aarch64/sve/dup_lane_1.c: Likewise.
* gcc.target/aarch64/sve/extract_1.c: Likewise.
* gcc.target/aarch64/sve/extract_2.c: Likewise.
* gcc.target/aarch64/sve/extract_3.c: Likewise.
* gcc.target/aarch64/sve/extract_4.c: Likewise.
* gcc.target/aarch64/sve/live_1.c: Update scan-assembler regex
cases to look for 'b' and 'h' registers instead of 'w'.
* gcc.target/arm/crypto-vsha1cq_u32.c: Update scan-assembler
regex to reflect lane 0 vector extractions being simplified
to scalar register moves.
* gcc.target/arm/crypto-vsha1h_u32.c: Likewise.
* gcc.target/arm/crypto-vsha1mq_u32.c: Likewise.
* gcc.target/arm/crypto-vsha1pq_u32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vgetq_lane_f16.c: Extract
lane 1 as the moves for lane 0 now get optimized away.
* gcc.target/arm/mve/intrinsics/vgetq_lane_f32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vgetq_lane_s16.c: Likewise.
* gcc.target/arm/mve/intrinsics/vgetq_lane_s32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vgetq_lane_s8.c: Likewise.
* gcc.target/arm/mve/intrinsics/vgetq_lane_u16.c: Likewise.
* gcc.target/arm/mve/intrinsics/vgetq_lane_u32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vgetq_lane_u8.c: Likewise.
Copy the test for _mm_testz_si128, _mm_testc_si128,
_mm_testnzc_si128, _mm_test_all_ones, _mm_test_all_zeros,
_mm_test_mix_ones_zeros from gcc/testsuite/gcc.target/i386.
2021-07-13 Paul A. Clarke <pc@us.ibm.com>
gcc/testsuite
* gcc.target/powerpc/sse4_1-ptest-1.c: Copy from
gcc/testsuite/gcc.target/i386.
The use of npos triggers a diagnostic as described in PR c++/101361.
This change replaces the use of npos with the exact length, which is
already known. We can further simplify it by inlining the effects of
compare and substr, avoiding the redundant range checks in the latter.
Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
libstdc++-v3/ChangeLog:
PR c++/101361
* include/std/string_view (ends_with): Use traits_type::compare
directly.
Allow gimple_could_trap_p (which previously took a non-const gimple)
to be called from functions that take a const gimple (such as
gimple_has_side_effects), and update its prototypes. Pre-approved
as obvious.
2021-07-13 Roger Sayle <roger@nextmovesoftware.com>
Richard Biener <rguenther@suse.de>
gcc/ChangeLog
* gimple.c (gimple_could_trap_p_1): Make S argument a
"const gimple*". Preserve constness in call to
gimple_asm_volatile_p.
(gimple_could_trap_p): Make S argument a "const gimple*".
* gimple.h (gimple_could_trap_p_1, gimple_could_trap_p):
Update function prototypes.
When I added the new C++23 constructor I added a conditional include of
<bits/ranges_base.h>, which was already being included unconditionally.
This removes the unconditional include but changes the condition for the
other one, so it's used for C++20 as well.
Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
libstdc++-v3/ChangeLog:
* include/std/string_view: Only include <bits/ranges_base.h>
once, and only for C++20 and later.
This patch adds support for reusing a main loop's reduction accumulator
in an epilogue loop. This in turn lets the loops share a single piece
of vector->scalar reduction code.
The patch has the following restrictions:
(1) The epilogue reduction can only operate on a single vector
(e.g. ncopies must be 1 for non-SLP reductions, and the group size
must be <= the element count for SLP reductions).
(2) Both loops must use the same vector mode for their accumulators.
This means that the patch is restricted to targets that support
--param vect-partial-vector-usage=1.
(3) The reduction must be a standard “tree code” reduction.
However, these restrictions could be lifted in future. For example,
if the main loop operates on 128-bit vectors and the epilogue loop
operates on 64-bit vectors, we could in future reduce the 128-bit
vector by one stage and use the 64-bit result as the starting point
for the epilogue result.
The patch tries to handle chained SLP reductions, unchained SLP
reductions and non-SLP reductions. It also handles cases in which
the epilogue loop is entered directly (rather than via the main loop)
and cases in which the epilogue loop can be skipped.
vect_get_main_loop_result is a bit more general than the current
patch needs.
gcc/
* tree-vectorizer.h (vect_reusable_accumulator): New structure.
(_loop_vec_info::main_loop_edge): New field.
(_loop_vec_info::skip_main_loop_edge): Likewise.
(_loop_vec_info::skip_this_loop_edge): Likewise.
(_loop_vec_info::reusable_accumulators): Likewise.
(_stmt_vec_info::reduc_scalar_results): Likewise.
(_stmt_vec_info::reused_accumulator): Likewise.
(vect_get_main_loop_result): Declare.
* tree-vectorizer.c (vec_info::new_stmt_vec_info): Initialize
reduc_scalar_inputs.
(vec_info::free_stmt_vec_info): Free reduc_scalar_inputs.
* tree-vect-loop-manip.c (vect_get_main_loop_result): New function.
(vect_do_peeling): Fill an epilogue loop's main_loop_edge,
skip_main_loop_edge and skip_this_loop_edge fields.
* tree-vect-loop.c (INCLUDE_ALGORITHM): Define.
(vect_emit_reduction_init_stmts): New function.
(get_initial_def_for_reduction): Use it.
(get_initial_defs_for_reduction): Likewise. Change the vinfo
parameter to a loop_vec_info.
(vect_create_epilog_for_reduction): Store the scalar results
in the reduc_info. If an epilogue loop is reusing an accumulator
from the main loop, and if the epilogue loop can also be skipped,
try to place the reduction code in the join block. Record
accumulators that could potentially be reused by epilogue loops.
(vect_transform_cycle_phi): When vectorizing epilogue loops,
try to reuse accumulators from the main loop. Record the initial
value in reduc_info for non-SLP reductions too.
gcc/testsuite/
* gcc.target/aarch64/sve/reduc_9.c: New test.
* gcc.target/aarch64/sve/reduc_9_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_10.c: Likewise.
* gcc.target/aarch64/sve/reduc_10_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_11.c: Likewise.
* gcc.target/aarch64/sve/reduc_11_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_12.c: Likewise.
* gcc.target/aarch64/sve/reduc_12_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_13.c: Likewise.
* gcc.target/aarch64/sve/reduc_13_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_14.c: Likewise.
* gcc.target/aarch64/sve/reduc_14_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_15.c: Likewise.
* gcc.target/aarch64/sve/reduc_15_run.c: Likewise.
After previous patches, we can now easily provide the neutral op
as an argument to get_initial_def_for_reduction. This in turn
allows the adjustment calculation to be moved outside of
get_initial_def_for_reduction, which is the main motivation
of the patch.
gcc/
* tree-vect-loop.c (get_initial_def_for_reduction): Remove
adjustment handling. Take the neutral value as an argument,
in place of the code argument.
(vect_transform_cycle_phi): Update accordingly. Handle the
initial values of cond reductions separately from code reductions.
Choose the adjustment here rather than in
get_initial_def_for_reduction. Sink the splat of vec_initial_def.
This patch generalises the interface to neutral_op_for_slp_reduction
so that it can be used for non-SLP reductions too. This isn't much
of a win on its own, but it helps later patches.
gcc/
* tree-vect-loop.c (neutral_op_for_slp_reduction): Replace with...
(neutral_op_for_reduction): ...this, providing a more general
interface.
(vect_create_epilog_for_reduction): Update accordingly.
(vectorizable_reduction): Likewise.
(vect_transform_cycle_phi): Likewise.
Similarly to the previous patch, this one passes the reduc_info
to get_initial_def_for_reduction, rather than a stmt_vec_info that
lacks the metadata. This again becomes useful later.
gcc/
* tree-vect-loop.c (get_initial_def_for_reduction): Take the
reduc_info instead of the original stmt_vec_info.
(vect_transform_cycle_phi): Update accordingly.
This patch passes the reduc_info to get_initial_defs_for_reduction,
so that the function can get general information from there rather
than from the first SLP statement. This isn't a win on its own,
but it becomes important with later patches.
gcc/
* tree-vect-loop.c (get_initial_defs_for_reduction): Take the
reduc_info as an additional parameter.
(vect_transform_cycle_phi): Update accordingly.
This patch adds a helper function called vect_phi_initial_value
for returning the incoming value of a given loop phi. The main
reason for adding it is to ensure that the right preheader edge
is used when vectorising nested loops. (PHI_ARG_DEF_FROM_EDGE
itself doesn't assert that the given edge is for the right block,
although I guess that would be good to add separately.)
gcc/
* tree-vectorizer.h: Include tree-ssa-operands.h.
(vect_phi_initial_value): New function.
* tree-vect-loop.c (neutral_op_for_slp_reduction): Use it.
(get_initial_defs_for_reduction, info_for_reduction): Likewise.
(vect_create_epilog_for_reduction, vectorizable_reduction): Likewise.
(vect_transform_cycle_phi, vectorizable_induction): Likewise.
Vector reduction accumulators can differ in signedness from the
final scalar result. The conversions to handle that case were
distributed through vect_create_epilog_for_reduction; this patch
does the conversion up-front instead.
gcc/
* tree-vect-loop.c (vect_create_epilog_for_reduction): Convert
the phi results to vectype after creating them. Remove later
conversion code that thus becomes redundant.
vect_create_epilog_for_reduction had a variable called new_phis.
It collected the statements that produce the exit block definitions
of the vector reduction accumulators. Although those statements
are indeed phis initially, they are often replaced with normal
statements later, leading to puzzling code like:
FOR_EACH_VEC_ELT (new_phis, i, new_phi)
{
int bit_offset;
if (gimple_code (new_phi) == GIMPLE_PHI)
vec_temp = PHI_RESULT (new_phi);
else
vec_temp = gimple_assign_lhs (new_phi);
Also, although the array collects statements, in practice all users want
the lhs instead.
This patch therefore replaces new_phis with a vector of gimple values
called “reduc_inputs”.
Also, reduction chains and ncopies>1 were handled with identical code
(and there was a comment saying so). The patch unites them into
a single “if”.
gcc/
* tree-vect-loop.c (vect_create_epilog_for_reduction): Replace
the new_phis vector with a reduc_inputs vector. Combine handling
of reduction chains and ncopies > 1.
This patch constructs an array_slice of the scalar statements that
produce live-out reduction results in the original unvectorised loop.
There are three cases:
- SLP reduction chains: the final SLP stmt is live-out
- full SLP reductions: all SLP stmts are live-out
- non-SLP reductions: the single scalar stmt is live-out
This is a slight simplification on its own, mostly because it maans
“group_size” has a consistent meaning throughout the function.
The main justification though is that it helps with later patches.
gcc/
* tree-vect-loop.c (vect_create_epilog_for_reduction): Truncate
scalar_results to group_size elements after reducing down from
N*group_size elements. Construct an array_slice of the live-out
stmts and assert that there is one stmt per scalar result.
vect_create_epilog_for_reduction only handles two cases: single-loop
reductions and double reductions. “nested cycles” (i.e. reductions
in the inner loop when vectorising an outer loop) are handled elsewhere
and don't need a vector->scalar reduction.
The function had variables called nested_in_vect_loop and double_reduc
and asserted that nested_in_vect_loop implied double_reduc, but it
still had code to handle nested_in_vect_loop && !double_reduc.
This patch removes that and uses double_reduc everywhere.
gcc/
* tree-vect-loop.c (vect_create_epilog_for_reduction): Remove
nested_in_vect_loop and use double_reduc everywhere. Remove dead
assignment to "loop".
-msve-vector-bits=128 causes the AArch64 port to list 128-bit Advanced
SIMD as the first-choice mode for vectorisation, with SVE being used for
things that Advanced SIMD can't handle as easily. However, ifcvt would
not then try to use SVE's predicated FP arithmetic, leading to tests
like TSVC ControlFlow-flt failing to vectorise.
The mask load/store code did try other vector modes, but could also be
improved to make sure that SVEness sticks when computing derived modes.
(Unlike mode_for_vector, related_vector_mode always returns a vector
mode, so there's no need to check VECTOR_MODE_P as well.)
gcc/
* internal-fn.c (vectorized_internal_fn_supported_p): Handle
vector types first. For scalar types, consider both the preferred
vector mode and the alternative vector modes.
* optabs-query.c (can_vec_mask_load_store_p): Use the same
structure as above, in particular using related_vector_mode
for modes provided by autovectorize_vector_modes.
gcc/testsuite/
* gcc.target/aarch64/sve/cond_arith_6.c: New test.
The following testcase is miscompiled, because VN during cunrolli changes
__bos argument from address of a larger field to address of a smaller field
and so __builtin_object_size (, 1) then folds into smaller value than the
actually available size.
copy_reference_ops_from_ref has a hack for this, but it was using
cfun->after_inlining as a check whether the hack can be ignored, and
cunrolli is after_inlining.
This patch uses a property to make it exact (set at the end of objsz
pass that doesn't do insert_min_max_p) and additionally based on discussions
in the PR moves the objsz pass earlier after IPA.
2021-07-13 Jakub Jelinek <jakub@redhat.com>
Richard Biener <rguenther@suse.de>
PR tree-optimization/101419
* tree-pass.h (PROP_objsz): Define.
(make_pass_early_object_sizes): Declare.
* passes.def (pass_all_early_optimizations): Rename pass_object_sizes
there to pass_early_object_sizes, drop parameter.
(pass_all_optimizations): Move pass_object_sizes right after pass_ccp,
drop parameter, move pass_post_ipa_warn right after that.
* tree-object-size.c (pass_object_sizes::execute): Rename to...
(object_sizes_execute): ... this. Add insert_min_max_p argument.
(pass_data_object_sizes): Move after object_sizes_execute.
(pass_object_sizes): Likewise. In execute method call
object_sizes_execute, drop set_pass_param method and insert_min_max_p
non-static data member and its initializer in the ctor.
(pass_data_early_object_sizes, pass_early_object_sizes,
make_pass_early_object_sizes): New.
* tree-ssa-sccvn.c (copy_reference_ops_from_ref): Use
(cfun->curr_properties & PROP_objsz) instead of cfun->after_inlining.
* gcc.dg/builtin-object-size-10.c: Pass -fdump-tree-early_objsz-details
instead of -fdump-tree-objsz1-details in dg-options and adjust names
of dump file in scan-tree-dump.
* gcc.dg/pr101419.c: New test.
sem.h is included in between # pragma GCC visibility push(hidden)
and # pragma GCC visibility pop and includes limits.h there, which
since the introduction of sysconf declaration in recent glibcs
in there causes trouble. libgomp assumes it is compiled by gcc,
so we don't really need to include limits.h there and can use
-__INT_MAX__ - 1 instead (which clang and icc support too for years).
2021-07-13 Jakub Jelinek <jakub@redhat.com>
Florian Weimer <fweimer@redhat.com>
* config/linux/sem.h: Don't include limits.h.
(SEM_WAIT): Define to -__INT_MAX__ - 1 instead of INT_MIN.
* config/linux/affinity.c: Include limits.h.
It was undocument before, but it might used in linux kernel for resolve
code model issue, so LLVM community suggest we should document that,
so that make it become supported/documented/non-internal machine constraints.
gcc/ChangeLog:
PR target/101275
* config/riscv/constraints.md ("S"): Update description and remove
@internal.
* doc/md.texi (Machine Constraints): Document the 'S' constraints
for RISC-V.
I noticed that the vec-splati-runnable.c did not have an abort after one
of the tests. If the test was run with optimization, the optimizer could
delete some of the tests and throw off the count. However, due to the
fact that the value being loaded in that test is undefined, I did not
check what value was loaded, but I just stored it into a volatile global
variable.
2021-07-12 Michael Meissner <meissner@linux.ibm.com>
gcc/testsuite/
* gcc.target/powerpc/vec-splati-runnable.c: Run test with -O2
optimization. Do not check what XXSPLTIDP generates if the value
is undefined.
The function rs6000_const_f32_to_i32 called REAL_VALUE_TO_TARGET_SINGLE
with a long long type and returns it. This patch changes the type to long
which is the proper type for REAL_VALUE_TO_TARGET_SINGLE.
2021-07-12 Michael Meissner <meissner@linux.ibm.com>
gcc/
* config/rs6000/altivec.md (xxspltiw_v4sf): Change local variable
value to to long.
* config/rs6000/rs6000-protos.h (rs6000_const_f32_to_i32): Change
return type to long.
* config/rs6000/rs6000.c (rs6000_const_f32_to_i32): Change return
type to long.