* config/i386/avx512fintrin.h (_mm512_set_epi16, _mm512_set_epi8,
_mm512_setzero): New intrinsics.
* gcc.target/i386/avx512f-set-v32hi-1.c: New test.
* gcc.target/i386/avx512f-set-v32hi-2.c: New test.
* gcc.target/i386/avx512f-set-v32hi-3.c: New test.
* gcc.target/i386/avx512f-set-v32hi-4.c: New test.
* gcc.target/i386/avx512f-set-v32hi-5.c: New test.
* gcc.target/i386/avx512f-set-v64qi-1.c: New test.
* gcc.target/i386/avx512f-set-v64qi-2.c: New test.
* gcc.target/i386/avx512f-set-v64qi-3.c: New test.
* gcc.target/i386/avx512f-set-v64qi-4.c: New test.
* gcc.target/i386/avx512f-set-v64qi-5.c: New test.
* gcc.target/i386/avx512f-setzero-1.c: New test.
From-SVN: r260310
In the testcase in this patch we create an SLP vector with only two
elements. Our current vector initialisation code will first duplicate
the first element to both lanes, then overwrite the top lane with a new
value.
This duplication can be clunky and wasteful.
Better would be to simply use the fact that we will always be
overwriting the remaining bits, and simply move the first element to the corrcet
place (implicitly zeroing all other bits).
This reduces the code generation for this case, and can allow more
efficient addressing modes, and other second order benefits for AArch64
code which has been vectorized to V2DI mode.
Note that the change is generic enough to catch the case for any vector
mode, but is expected to be most useful for 2x64-bit vectorization.
Unfortunately, on its own, this would cause failures in
gcc.target/aarch64/load_v2vec_lanes_1.c and
gcc.target/aarch64/store_v2vec_lanes.c , which expect to see many more
vec_merge and vec_duplicate for their simplifications to apply. To fix
this,
add a special case to the AArch64 code if we are loading from two memory
addresses, and use the load_pair_lanes patterns directly.
We also need a new pattern in simplify-rtx.c:simplify_ternary_operation
to catch:
(vec_merge:OUTER
(vec_duplicate:OUTER x:INNER)
(subreg:OUTER y:INNER 0)
(const_int N))
And simplify it to:
(vec_concat:OUTER x:INNER y:INNER) or (vec_concat y x)
This is similar to the existing patterns which are tested in this
function, without requiring the second operand to also be a vec_duplicate.
* config/aarch64/aarch64.c (aarch64_expand_vector_init): Modify
code generation for cases where splatting a value is not useful.
* simplify-rtx.c (simplify_ternary_operation): Simplify
vec_merge across a vec_duplicate and a paradoxical subreg forming a vector
mode to a vec_concat.
* gcc.target/aarch64/vect-slp-dup.c: New.
Co-Authored-By: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
From-SVN: r260309
2018-05-17 Richard Biener <rguenther@suse.de>
PR tree-optimization/85757
* tree-ssa-dse.c (dse_classify_store): Record a PHI def and
remove defs that only feed that PHI from further processing.
* gcc.dg/tree-ssa/ssa-dse-34.c: New testcase.
From-SVN: r260306
DWARF5 defines a small header for .debug_str_offsets. Since we only use
it for split dwarf .dwo files we don't need to keep track of the actual
index offset in an attribute.
gcc/ChangeLog
* dwarf2out.c (count_index_strings): New function.
(output_indirect_strings): Call count_index_strings and generate
header for dwarf_version >= 5.
From-SVN: r260298
We already emit DWARF5 attributes and tables for indirect addresses
and string offsets, but still use GNU forms. Add a new helper function
dwarf_FORM () for emitting the right form.
Currently we only use the uleb128 forms. But DWARF5 also allows
1, 2, 3 and 4 byte forms (DW_FORM_strx[1234] and DW_FORM_addrx[1234])
which might be more space efficient.
gcc/ChangeLog
* dwarf2out.c (dwarf_FORM): New function.
(set_indirect_string): Use dwarf_FORM.
(reset_indirect_string): Likewise.
(size_of_die): Likewise.
(value_format): Likewise.
(output_die): Likewise.
(add_skeleton_AT_string): Likewise.
(output_macinfo_op): Likewise.
(index_string): Likewise.
(output_index_string_offset): Likewise.
(output_index_string): Likewise.
From-SVN: r260297
gcc/ChangeLog:
2018-05-16 Carl Love <cel@us.ibm.com>
* config/rs6000/rs6000.md (prefetch): Generate ISA 2.06 instructions
dcbt and dcbtstt with TH=16 if operands[2] is 0 and Power 8 or newer.
From-SVN: r260296
gcc/testsuite/ChangeLog:
2018-05-16 Carl Love <cel@us.ibm.com>
* gcc.target/powerpc/vsx-vector-6-be.c: Remove file.
* gcc.target/powerpc/vsx-vector-6-be.p7.c: New test file.
* gcc.target/powerpc/vsx-vector-6-be.p8.c: New test file.
* gcc.target/powerpc/vsx-vector-6-le.c (dg-final): Update counts for
xvcmpeqdp., xvcmpgtdp., xvcmpgedp., xxlxor, xvrdpi.
From-SVN: r260294
This patch improves register allocation of fma by preferring to update the
accumulator register. This is done by adding fma insns with operand 1 as the
accumulator. The register allocator considers copy preferences only in operand
order, so if the first operand is dead, it has the highest chance of being
reused as the destination. As a result code using fma often has a better
register allocation. Performance of SPECFP2017 improves by over 0.5% on some
implementations, while it had no effect on other implementations. Fma is more
readable too, in a simple example we now generate:
fmadd s16, s2, s1, s16
fmadd s7, s17, s16, s7
fmadd s6, s16, s7, s6
fmadd s5, s7, s6, s5
instead of:
fmadd s16, s16, s2, s1
fmadd s7, s7, s16, s6
fmadd s6, s6, s7, s5
fmadd s5, s5, s6, s4
gcc/
* config/aarch64/aarch64.md (fma<mode>4): Change into expand pattern.
(fnma<mode>4): Likewise.
(fms<mode>4): Likewise.
(fnms<mode>4): Likewise.
(aarch64_fma<mode>4): Rename insn, reorder accumulator operand.
(aarch64_fnma<mode>4): Likewise.
(aarch64_fms<mode>4): Likewise.
(aarch64_fnms<mode>4): Likewise.
(aarch64_fnmadd<mode>4): Likewise.
From-SVN: r260292
2018-05-16 Richard Biener <rguenther@suse.de>
* tree-vectorizer.h (struct stmt_info_for_cost): Add where member.
(dump_stmt_cost): Declare.
(add_stmt_cost): Dump cost we add.
(add_stmt_costs): New function.
(vect_model_simple_cost, vect_model_store_cost, vect_model_load_cost):
No longer exported.
(vect_analyze_stmt): Adjust prototype.
(vectorizable_condition): Likewise.
(vectorizable_live_operation): Likewise.
(vectorizable_reduction): Likewise.
(vectorizable_induction): Likewise.
* tree-vect-loop.c (vect_analyze_loop_operations): Create local
cost vector to pass to vectorizable_ and record afterwards.
(vect_model_reduction_cost): Take cost vector argument and adjust.
(vect_model_induction_cost): Likewise.
(vectorizable_reduction): Likewise.
(vectorizable_induction): Likewise.
(vectorizable_live_operation): Likewise.
* tree-vect-slp.c (vect_create_new_slp_node): Initialize
SLP_TREE_NUMBER_OF_VEC_STMTS.
(vect_analyze_slp_cost_1): Remove.
(vect_analyze_slp_cost): Likewise.
(vect_slp_analyze_node_operations): Take visited args and
a target cost vector. Avoid processing already visited stmt sets.
(vect_slp_analyze_operations): Use a local cost vector to gather
costs and register those of non-discarded instances.
(vect_bb_vectorization_profitable_p): Use add_stmt_costs.
(vect_schedule_slp_instance): Remove copying of
SLP_TREE_NUMBER_OF_VEC_STMTS. Instead assert that it is not
zero.
* tree-vect-stmts.c (record_stmt_cost): Remove path directly
adding cost. Record cost entry location.
(vect_prologue_cost_for_slp_op): Function to compute cost of
a constant or invariant generated for SLP vect in the prologue,
split out from vect_analyze_slp_cost_1.
(vect_model_simple_cost): Make static. Adjust for SLP costing.
(vect_model_promotion_demotion_cost): Likewise.
(vect_model_store_cost): Likewise, make static.
(vect_model_load_cost): Likewise.
(vectorizable_bswap): Add cost vector arg and adjust.
(vectorizable_call): Likewise.
(vectorizable_simd_clone_call): Likewise.
(vectorizable_conversion): Likewise.
(vectorizable_assignment): Likewise.
(vectorizable_shift): Likewise.
(vectorizable_operation): Likewise.
(vectorizable_store): Likewise.
(vectorizable_load): Likewise.
(vectorizable_condition): Likewise.
(vectorizable_comparison): Likewise.
(can_vectorize_live_stmts): Likewise.
(vect_analyze_stmt): Likewise.
(vect_transform_stmt): Adjust calls to vectorizable_*.
* tree-vectorizer.c: Include gimple-pretty-print.h.
(dump_stmt_cost): New function.
From-SVN: r260289
2018-05-16 Richard Biener <rguenther@suse.de>
* params.def (PARAM_DSE_MAX_ALIAS_QUERIES_PER_STORE): New param.
* doc/invoke.texi (dse-max-alias-queries-per-store): Document.
* tree-ssa-dse.c: Include tree-ssa-loop.h.
(check_name): New callback.
(dse_classify_store): Track cycles via a visited bitmap of PHI
defs and simplify handling of in-loop and across loop dead stores
and properly fail for loop-variant refs. Handle byte-tracking with
multiple defs. Use PARAM_DSE_MAX_ALIAS_QUERIES_PER_STORE for
limiting the walk.
* gcc.dg/tree-ssa/ssa-dse-32.c: New testcase.
* gcc.dg/tree-ssa/ssa-dse-33.c: Likewise.
* gcc.dg/uninit-pr81897-2.c: Use -fno-tree-dse.
From-SVN: r260288
The SLP unrolling factor is calculated by finding the smallest
scalar type for each SLP statement and taking the number of required
lanes from the vector versions of those scalar types. E.g. for an
int32->int64 conversion, it's the vector of int32s rather than the
vector of int64s that determines the unroll factor.
We rely on tree-vect-patterns.c to replace boolean operations like:
bool a, b, c;
a = b & c;
with integer operations of whatever the best size is in context.
E.g. if b and c are fed by comparisons of ints, a, b and c will become
the appropriate size for an int comparison. For most targets this means
that a, b and c will end up as int-sized themselves, but on targets like
SVE and AVX512 with packed vector booleans, they'll instead become a
small bitfield like :1, padded to a byte for memory purposes.
The SLP code would then take these scalar types and try to calculate
the vector type for them, causing the unroll factor to be much higher
than necessary.
This patch tries to make the SLP code use the same approach as the
loop vectorizer, by splitting out the code that calculates the
statement vector type and the vector type that should be used for
the number of units.
2018-05-16 Richard Sandiford <richard.sandiford@linaro.org>
gcc/
* tree-vectorizer.h (vect_get_vector_types_for_stmt): Declare.
(vect_get_mask_type_for_stmt): Likewise.
* tree-vect-slp.c (vect_two_operations_perm_ok_p): New function,
split out from...
(vect_build_slp_tree_1): ...here. Use vect_get_vector_types_for_stmt
to determine the statement's vector type and the vector type that
should be used for calculating nunits. Deal with cases in which
the type has to be deferred.
(vect_slp_analyze_node_operations): Use vect_get_vector_types_for_stmt
and vect_get_mask_type_for_stmt to calculate STMT_VINFO_VECTYPE.
* tree-vect-loop.c (vect_determine_vf_for_stmt_1)
(vect_determine_vf_for_stmt): New functions, split out from...
(vect_determine_vectorization_factor): ...here.
* tree-vect-stmts.c (vect_get_vector_types_for_stmt)
(vect_get_mask_type_for_stmt): New functions, split out from
vect_determine_vectorization_factor.
gcc/testsuite/
* gcc.target/aarch64/sve/vcond_10.c: New test.
* gcc.target/aarch64/sve/vcond_10_run.c: Likewise.
* gcc.target/aarch64/sve/vcond_11.c: Likewise.
* gcc.target/aarch64/sve/vcond_11_run.c: Likewise.
From-SVN: r260287
PR lto/85583
* lto-partition.c (account_reference_p): Do not account
references from aliases; do not account refernces from
external initializers.
From-SVN: r260266
Constrain constructors and member functions of random number engines so
that functions taking seed sequences can only be called with types that
meet the seed sequence requirements.
PR libstdc++/85749
* include/bits/random.h (__detail::__is_seed_seq): New SFINAE helper.
(linear_congruential_engine, mersenne_twister_engine)
(subtract_with_carry_engine, discard_block_engine)
(independent_bits_engine, shuffle_order_engine): Use __is_seed_seq to
constrain function templates taking seed sequences.
* include/bits/random.tcc (linear_congruential_engine::seed(_Sseq&))
(mersenne_twister_engine::seed(_Sseq&))
(subtract_with_carry_engine::seed(_Sseq&)): Change return types to
match declarations.
* include/ext/random (simd_fast_mersenne_twister_engine): Use
__is_seed_seq to constrain function templates taking seed sequences.
* include/ext/random.tcc (simd_fast_mersenne_twister_engine::seed):
Change return type to match declaration.
* testsuite/26_numerics/random/discard_block_engine/cons/seed_seq2.cc:
New.
* testsuite/26_numerics/random/independent_bits_engine/cons/
seed_seq2.cc: New.
* testsuite/26_numerics/random/linear_congruential_engine/cons/
seed_seq2.cc: New.
* testsuite/26_numerics/random/mersenne_twister_engine/cons/
seed_seq2.cc: New.
* testsuite/26_numerics/random/pr60037-neg.cc: Adjust dg-error lineno.
* testsuite/26_numerics/random/shuffle_order_engine/cons/seed_seq2.cc:
New.
* testsuite/26_numerics/random/subtract_with_carry_engine/cons/
seed_seq2.cc: New.
* testsuite/ext/random/simd_fast_mersenne_twister_engine/cons/
seed_seq2.cc: New.
From-SVN: r260263
The correct definition seems to be has_root_directory() for all systems
we care about.
PR libstdc++/83891
* include/bits/fs_path.h (path::is_absolute()): Use same definition
for all operating systems.
* include/experimental/bits/fs_path.h (path::is_absolute()): Likewise.
* testsuite/27_io/filesystem/path/query/is_absolute.cc: New.
* testsuite/27_io/filesystem/path/query/is_relative.cc: Fix comment.
* testsuite/experimental/filesystem/path/query/is_absolute.cc: New.
From-SVN: r260259
The path::operator/=(const Source&) and path::append overloads were
still following the semantics of the Filesystem TS not C++17. Only
the path::operator/=(const path&) overload was correct.
This change adds more tests for path::operator/=(const path&) and adds
new tests to verify that the other append operations have equivalent
behaviour.
PR libstdc++/84159
* include/bits/fs_path.h (path::operator/=, path::append): Construct
temporary path before calling _M_append.
(path::_M_append): Change parameter to path and implement C++17
semantics.
* testsuite/27_io/filesystem/path/append/path.cc: Add helper function
and more examples from the standard.
* testsuite/27_io/filesystem/path/append/source.cc: New.
* testsuite/27_io/filesystem/path/decompose/filename.cc: Add comment.
* testsuite/27_io/filesystem/path/nonmember/append.cc: New.
From-SVN: r260255
2018-05-15 Richard Biener <rguenther@suse.de>
* tree-ssa-dse.c (dse_classify_store): Remove use_stmt parameter,
add by_clobber_p one. Change algorithm to collect all defs
representing uses we need to walk and try reducing them to
a single one before failing.
(dse_dom_walker::dse_optimize_stmt): Adjust.
* gcc.dg/tree-ssa/ssa-dse-31.c: New testcase.
From-SVN: r260253
For older DWARF and -gsplit-dwarf we want to emit DW_OP_GNU_addr_index
and DW_OP_GNU_const_index, but for DWARF5 we should use DW_OP_addrx
and DW_OP_constx.
gcc/ChangeLog:
* dwarf2out.c (dwarf_OP): Handle DW_OP_addrx and DW_OP_constx.
(size_of_loc_descr): Likewise.
(output_loc_operands): Likewise.
(output_loc_operands_raw): Likewise.
(dw_addr_op): Use dwarf_OP () for DW_OP_constx and DW_OP_addrx.
(resolve_addr_in_expr): Handle DW_OP_addrx and DW_OP_constx.
(hash_loc_operands): Likewise.
(compare_loc_operands): Likewise.
From-SVN: r260252
The length in the .debug_addr unit header was calculated using the number
of elements in the addr_index_table. This is wrong because the entries in
the table are refcounted and only those with a refcount > 0 are actually
put in the index. Add a helper function count_index_addrs to get the
correct number of addresses in the index.
gcc/ChangeLog:
* dwarf2out.c (count_index_addrs): New function.
(dwarf2out_finish): Use count_index_addrs to calculate addrs_length.
From-SVN: r260251
2018-05-15 Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
PR ipa/85734
* ipa-pure-const.c (warn_function_malloc): Pass value of known_finite param
as true in call to suggest_attribute.
testsuite/
* gcc.dg/ipa/pr85734.c: New test.
From-SVN: r260249
* tree.c (build_cp_fntype_variant): New.
(build_ref_qualified_type, build_exception_variant)
(strip_typedefs, cxx_copy_lang_qualifiers): Use it.
(cxx_type_hash_eq, cp_check_qualified_type): Check
TYPE_HAS_LATE_RETURN_TYPE.
(cp_build_type_attribute_variant): Check cxx_type_hash_eq.
(cp_build_qualified_type_real): No need to preserve C++ qualifiers.
* class.c (build_clone): Use cxx_copy_lang_qualifiers.
(adjust_clone_args): Likewise.
* decl.c (grokfndecl): Add late_return_type_p parameter. Use
build_cp_fntype_variant.
(grokdeclarator): Pass late_return_type_p to grokfndecl.
(check_function_type): Use cxx_copy_lang_qualifiers.
(static_fn_type): Use cxx_copy_lang_qualifiers.
* decl2.c (build_memfn_type, maybe_retrofit_in_chrg)
(cp_reconstruct_complex_type, coerce_new_type, coerce_delete_type)
(change_return_type): Use cxx_copy_lang_qualifiers.
* mangle.c (write_type): Use cxx_copy_lang_qualifiers.
* parser.c (cp_parser_lambda_declarator_opt): Represent an explicit
return type on the declarator like a normal trailing return type.
* pt.c (tsubst_function_type): Use build_cp_fntype_variant.
(copy_default_args_to_explicit_spec): Use cxx_copy_lang_qualifiers.
* typeck.c (merge_types): Use build_cp_fntype_variant.
From-SVN: r260238
For some reason I made both an @item and an @itemx for
-mreadonly-in-sdata. This fixes it.
* doc/invoke.texi (RS/6000 and PowerPC Options): Delete @itemx for
-mreadonly-in-sdata.
From-SVN: r260237
PR libstdc++/81256
* include/bits/fstream.tcc (basic_filebuf::close): Do not swallow
exceptions from _M_terminate_output().
* include/std/fstream (basic_filebuf::~basic_filebuf): Swallow any
exceptions from close().
* testsuite/27_io/basic_filebuf/close/81256.cc: New.
From-SVN: r260236
When the AESE,AESD and AESMC, AESMC instructions are generated through the appropriate arm_neon.h intrinsics
we really want to keep them together when the AESE feeds into an AESMC and fusion is supported by the target CPU.
We have macro-fusion hooks and scheduling model forwarding paths defined to facilitate that.
It is, however, not always enough.
This patch adds another mechanism for doing that.
When we can detect during combine that the required dependency is exists (AESE -> AESMC, AESD -> AESIMC)
just keep them together with a combine pattern throughout the rest of compilation.
We won't ever want to split them.
The testcases generate 4 AESE(D) instructions in a block followed by 4 AES(I)MC instructions that
consume the corresponding results and it also adds a bunch of computations in-between so that the
AESE and AESMC instructions are not trivially back-to-back, thus exercising the compiler's ability
to bring them together.
With this patch all 4 pairs are fused whereas before a couple of fusions would be missed due to intervening
arithmetic and memory instructions.
* config/aarch64/aarch64-simd.md (*aarch64_crypto_aese_fused):
New pattern.
(aarch64_crypto_aesd_fused): Likewise.
* gcc.target/aarch64/crypto-fuse-1.c: New test.
* gcc.target/aarch64/crypto-fuse-2.c: Likewise.
From-SVN: r260234
Remove the remaining uses of '*' from aarch64.md.
Using '*' in alternatives is typically incorrect as it tells the register
allocator to ignore those alternatives. Also add a missing '?' so we
prefer a floating point register for same-size int<->fp conversions.
gcc/
* config/aarch64/aarch64.md (mov<mode>): Remove '*' in alternatives.
(movsi_aarch64): Likewise.
(load_pairsi): Likewise.
(load_pairdi): Likewise.
(store_pairsi): Likewise.
(store_pairdi): Likewise.
(load_pairsf): Likewise.
(load_pairdf): Likewise.
(store_pairsf): Likewise.
(store_pairdf): Likewise.
(zero_extend): Likewise.
(trunc): Swap alternatives.
(fcvt_target): Add '?' to prefer w over r.
testsuite/
* gcc.target/aarch64/vmov_n_1.c: Update test.
* gcc.target/aarch64/vfp-1.c: Update test.
From-SVN: r260233
PR target/85756
* config/i386/i386.md: Disallow non-commutative arithmetics in
last twpeephole for mem {+,-,&,|,^}= x; mem != 0 after cmpelim
optimization. Use COMMUTATIVE_ARITH_P test rather than != MINUS
in the peephole2 before it.
testsuite/ChangeLog:
* gcc.c-torture/execute/pr85756.c: New test.
From-SVN: r260231