Similar to AArch64 the Arm implementation of 128-bit atomics is broken.
For 128-bit atomics we rely on pthread barriers to correct guard the address
in the pointer to get correct memory ordering. However for 128-bit atomics the
address under the lock is different from the original pointer.
This means that one of the values under the atomic operation is not protected
properly and so we fail during when the user has requested sequential
consistency as there's no barrier to enforce this requirement.
As such users have resorted to adding an
#ifdef GCC
<emit barrier>
#endif
around the use of these atomics.
This corrects the issue by issuing a barrier only when __ATOMIC_SEQ_CST was
requested. I have hand verified that the barriers are inserted
for atomic seq cst.
libatomic/ChangeLog:
PR target/102218
* config/arm/host-config.h (pre_seq_barrier, post_seq_barrier,
pre_post_seq_barrier): Require barrier on __ATOMIC_SEQ_CST.
The AArch64 implementation of 128-bit atomics is broken.
For 128-bit atomics we rely on pthread barriers to correct guard the address
in the pointer to get correct memory ordering. However for 128-bit atomics the
address under the lock is different from the original pointer.
This means that one of the values under the atomic operation is not protected
properly and so we fail during when the user has requested sequential
consistency as there's no barrier to enforce this requirement.
As such users have resorted to adding an
#ifdef GCC
<emit barrier>
#endif
around the use of these atomics.
This corrects the issue by issuing a barrier only when __ATOMIC_SEQ_CST was
requested. To remedy this performance hit I think we should revisit using a
similar approach to out-line-atomics for the 128-bit atomics.
Note that I believe I need the empty file due to the include_next chain but
I am not entirely sure. I have hand verified that the barriers are inserted
for atomic seq cst.
libatomic/ChangeLog:
PR target/102218
* config/aarch64/aarch64-config.h: New file.
* config/aarch64/host-config.h: New file.
I've revisited the earlier two workarounds for dwarf2out_register_external_die
getting duplicate entries. It turns out that r11-525-g03d90a20a1afcb
added dref_queue pruning to lto_input_tree but decl reading uses that
to stream in DECL_INITIAL even when in the middle of SCC streaming.
When that SCC then gets thrown away we can end up with debug nodes
registered which isn't supposed to happen. The following adjusts
the DECL_INITIAL streaming to go the in-SCC way, using lto_input_tree_1,
since no SCCs are expected at this point, just refs.
PR lto/106540
PR lto/106334
* dwarf2out.cc (dwarf2out_register_external_die): Restore
original assert.
* lto-streamer-in.cc (lto_read_tree_1): Use lto_input_tree_1
to input DECL_INITIAL, avoiding to commit drefs.
Since this testcase is not exactly SSA specific and it would
be a good idea to compile this at more than just at -O1, moving
it to gcc.c-torture/compile would do that.
Committed as obvious after a test on x86_64-linux-gnu.
gcc/testsuite/ChangeLog:
* gcc.dg/tree-ssa/pr93776.c: Moved to...
* gcc.c-torture/compile/pr93776.c: ...here.
Adding -march=cascadelake to the command line options of the new cmpti2.c
testcase triggers TImode STV and produces vector code that doesn't match
the scalar implementation that this test was intended to check. Adding
-mno-stv to the options fixes this. Committed as obvious.
2022-08-07 Roger Sayle <roger@nextmovesoftware.com>
gcc/testsuite/ChangeLog
* gcc.target/i386/cmpti2.c: Add -mno-stv to dg-options.
We claim we support P0415R1 (constexpr complex), but e.g.
#include <complex>
constexpr bool
foo ()
{
std::complex<double> a (1.0, 2.0);
a += 3.0;
a.real (6.0);
return a.real () == 6.0 && a.imag () == 2.0;
}
static_assert (foo ());
fails with
test.C:12:20: error: non-constant condition for static assertion
12 | static_assert (foo ());
| ~~~~^~
test.C:12:20: in ‘constexpr’ expansion of ‘foo()’
test.C:8:10: in ‘constexpr’ expansion of ‘a.std::complex<double>::real(6.0e+0)’
test.C:12:20: error: modification of ‘__real__ a.std::complex<double>::_M_value’ is not a constant expression
The problem is we don't handle REALPART_EXPR and IMAGPART_EXPR
in cxx_eval_store_expression.
The following patch attempts to support it (with a requirement
that those are the outermost expressions, ARRAY_REF/COMPONENT_REF
etc. are just not possible on the result of these, BIT_FIELD_REF
would be theoretically possible if trying to extract some bits
from one part of a complex int, but I don't see how it could appear
in the FE trees.
For these references, the code handles value being COMPLEX_CST,
COMPLEX_EXPR or CONSTRUCTOR_NO_CLEARING empty CONSTRUCTOR (what we use
to represent uninitialized values for C++20 and later) and the
code starts by rewriting it to COMPLEX_EXPR, so that we can freely
adjust the individual parts and later on possibly optimize it back
to COMPLEX_CST if both halves are constant.
2022-08-07 Jakub Jelinek <jakub@redhat.com>
PR c++/88174
* constexpr.cc (cxx_eval_store_expression): Handle REALPART_EXPR
and IMAGPART_EXPR. Change ctors from releasing_vec to
auto_vec<tree *>, adjust all uses. For !preeval, update ctors
vector.
* g++.dg/cpp1y/constexpr-complex1.C: New test.
This patch tweaks i386.md's *cmp<dwi>_doubleword splitter's predicate to
allow general_operand, not just x86_64_hilo_general_operand, to improve
code generation. As a general rule, i386.md's _doubleword splitters should
be post-reload splitters that require integer immediate operands to be
x86_64_hilo_int_operand, so that each part is a valid word mode immediate
constant. As an exception to this rule, doubleword patterns that must be
split before reload, because they require additional scratch registers,
can use take advantage of this ability to create new pseudos, to accept
any immediate constant, and call force_reg on the high and/or low parts
if they are not suitable immediate operands in word mode.
The benefit is shown in the new cmpti3.c test case below.
__int128 x;
int foo()
{
__int128 t = 0x1234567890abcdefLL;
return x == t;
}
where GCC with -O2 currently generates:
movabsq $1311768467294899695, %rax
xorl %edx, %edx
xorq x(%rip), %rax
xorq x+8(%rip), %rdx
orq %rdx, %rax
sete %al
movzbl %al, %eax
ret
but with this patch now generates:
movabsq $1311768467294899695, %rax
xorq x(%rip), %rax
orq x+8(%rip), %rax
sete %al
movzbl %al, %eax
ret
2022-08-07 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386.md (*cmp<dwi>_doubleword): Change predicate
for x86_64_hilo_general_operand to general operand. Call
force_reg on parts that are not x86_64_immediate_operand.
gcc/testsuite/ChangeLog
* gcc.target/i386/cmpti1.c: New test case.
* gcc.target/i386/cmpti2.c: Likewise.
* gcc.target/i386/cmpti3.c: Likewise.
This patch adds a new warning to -fanalyzer for jumps through NULL
function pointers.
gcc/analyzer/ChangeLog:
PR analyzer/105947
* analyzer.opt (Wanalyzer-jump-through-null): New option.
* engine.cc (class jump_through_null): New.
(exploded_graph::process_node): Complain about jumps through NULL
function pointers.
gcc/ChangeLog:
PR analyzer/105947
* doc/invoke.texi: Add -Wanalyzer-jump-through-null.
gcc/testsuite/ChangeLog:
PR analyzer/105947
* gcc.dg/analyzer/function-ptr-5.c: New test.
Signed-off-by: David Malcolm <dmalcolm@redhat.com>
This patch to the middle-end's RTL expansion reorders the code in
emit_store_flag_1 so that the backend has more control over how best
to expand/split double word equality/inequality comparisons against
zero or minus one. With the current implementation, the middle-end
always decides to lower this idiom during RTL expansion using SUBREGs
and word mode instructions, without ever consulting the backend's
machine description. Hence on x86_64, a TImode comparison against zero
is always expanded as:
(parallel [
(set (reg:DI 91)
(ior:DI (subreg:DI (reg:TI 88) 0)
(subreg:DI (reg:TI 88) 8)))
(clobber (reg:CC 17 flags))])
(set (reg:CCZ 17 flags)
(compare:CCZ (reg:DI 91)
(const_int 0 [0])))
This patch, which makes no changes to the code itself, simply reorders
the clauses in emit_store_flag_1 so that the middle-end first attempts
expansion using the target's doubleword mode cstore optab/expander,
and only if this fails, falls back to lowering to word mode operations.
On x86_64, this allows the expander to produce:
(set (reg:CCZ 17 flags)
(compare:CCZ (reg:TI 88)
(const_int 0 [0])))
which is a candidate for scalar-to-vector transformations (and
combine simplifications etc.). On targets that don't define a cstore
pattern for doubleword integer modes, there should be no change in
behaviour. For those that do, the current behaviour can be restored
(if desired) by restricting the expander/insn to not apply when the
comparison is EQ or NE, and operand[2] is either const0_rtx or
constm1_rtx.
This change just keeps RTL expansion more consistent (in philosophy).
For other doubleword comparisons, such as with operators LT and GT,
or with constants other than zero or -1, the wishes of the backend
are respected, and only if the optab expansion fails are the default
fall-back implementations using narrower integer mode operations
(and conditional jumps) used.
2022-08-05 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* expmed.cc (emit_store_flag_1): Move code to expand double word
equality and inequality against zero or -1, using word operations,
to after trying to use the backend's cstore<mode>4 optab/expander.
This excludes value_replacement and store_elim from diamonds as they don't
handle the form properly.
gcc/ChangeLog:
PR middle-end/106534
* tree-ssa-phiopt.cc (tree_ssa_phiopt_worker): Guard the
value_replacement and store_elim from diamonds.
This fixes odd SUCCEEDED dumps from the backthreader registry that
can happen even though register_jump_thread cancelled the thread
as invalid.
* tree-ssa-threadbackward.cc (back_threader::maybe_register_path):
Check whether the registry register_path rejected the path.
(back_threader_registry::register_path): Return whether
register_jump_thread succeeded.
An unsupported_range temporary is instantiated in every Value_Range
for completeness sake and should be mostly a NOP. However, it's
showing up in the callgrind stats, because it's not inline. This
fixes the oversight.
PR tree-optimization/106514
gcc/ChangeLog:
* value-range.cc (unsupported_range::unsupported_range): Move...
* value-range.h (unsupported_range::unsupported_range): ...here.
(unsupported_range::set_undefined): New.
Loop distribution currently gives up if the outer loop of a loop
nest it analyzes contains a stmt with side-effects instead of
continuing to analyze the innermost loop. The following fixes that
by continuing anyway.
PR tree-optimization/106533
* tree-loop-distribution.cc (loop_distribution::execute): Continue
analyzing the inner loops when find_seed_stmts_for_distribution
fails.
* gcc.dg/tree-ssa/ldist-39.c: New testcase.
Set the return value to 0 when modulo is supported, and to 1 when not supported.
gcc/testsuite/
* lib/target-supports.exp (check_p9modulo_hw_available): Correct return
value.
The problem here was a disconnect between splittable_const_int_operand
predicate and the function riscv_build_integer_1 for 32bits with zbs enabled.
The splittable_const_int_operand predicate had a check for TARGET_64BIT which
was not needed so this patch removed it.
Committed as obvious after a build for risc32-elf configured with --with-arch=rv32imac_zba_zbb_zbc_zbs.
Thanks,
Andrew Pinski
gcc/ChangeLog:
* config/riscv/predicates.md (splittable_const_int_operand):
Remove the check for TARGET_64BIT for single bit const values.
The P2499R0 paper was recently approved for C++23.
libstdc++-v3/ChangeLog:
* include/std/string_view (basic_string_view(Range&&)): Add
explicit as per P2499R0.
* testsuite/21_strings/basic_string_view/cons/char/range_c++20.cc:
Adjust implicit conversions. Check implicit conversions fail.
* testsuite/21_strings/basic_string_view/cons/wchar_t/range_c++20.cc:
Likewise.
This library defect was recently approved for C++23.
libstdc++-v3/ChangeLog:
* include/bits/fs_dir.h (directory_iterator): Add comparison
with std::default_sentinel_t. Remove redundant operator!= for
C++20.
* (recursive_directory_iterator): Likewise.
* include/bits/iterator_concepts.h [!__cpp_lib_concepts]
(default_sentinel_t, default_sentinel): Define even if concepts
are not supported.
* include/bits/regex.h (regex_iterator): Add comparison with
std::default_sentinel_t. Remove redundant operator!= for C++20.
(regex_token_iterator): Likewise.
(regex_token_iterator::_M_end_of_seq()): Add noexcept.
* testsuite/27_io/filesystem/iterators/lwg3719.cc: New test.
* testsuite/28_regex/iterators/regex_iterator/lwg3719.cc:
New test.
* testsuite/28_regex/iterators/regex_token_iterator/lwg3719.cc:
New test.
compute_ranges_in_block loops over the import list and then checks the
same bit in exports. It is nmore efficent to loop over the intersection
of the 2 bitmaps.
PR tree-optimization/106514
* gimple-range-path.cc (path_range_query::compute_ranges_in_block):
Use EXECUTE_IF_AND_IN_BITMAP to loop over 2 bitmaps.
This adds a match.pd rule that drops the bitwwise nots when both arguments to a
subtract is inverted. i.e. for:
float g(float a, float b)
{
return ~(int)a - ~(int)b;
}
we instead generate
float g(float a, float b)
{
return (int)b - (int)a;
}
We already do a limited version of this from the fold_binary fold functions but
this makes a more general version in match.pd that applies more often.
gcc/ChangeLog:
* match.pd: New bit_not rule.
gcc/testsuite/ChangeLog:
* gcc.dg/subnot.c: New test.
For the diamond PHI form in tree_ssa_phiopt_worker we need to
extract edge e2 sooner. This changes it so we extract it at the
same time we determine we have a diamond shape.
gcc/ChangeLog:
PR middle-end/106519
* tree-ssa-phiopt.cc (tree_ssa_phiopt_worker): Check final phi edge for
diamond shapes.
gcc/testsuite/ChangeLog:
PR middle-end/106519
* gcc.dg/pr106519.c: New test.
This patch adds a new optimization to match.pd. The pattern, -x & 1,
now gets simplified to x & 1, reducing the number of instructions
produced.
This patch also adds tests for the optimization rule.
Bootstrapped/regtested on x86_64-pc-linux-gnu.
PR tree-optimization/106243
gcc/ChangeLog:
* match.pd (-x & 1): New simplification.
gcc/testsuite/ChangeLog:
* gcc.dg/pr106243-1.c: New test.
* gcc.dg/pr106243.c: New test.
The LC SSA rewrite performs SSA verification at start but the VN
run performed on the unrolled-and-jammed body can leave us with
invalid SSA form until CFG cleanup is run. So make sure we do that
before rewriting into LC SSA.
PR tree-optimization/106521
* gimple-loop-jam.cc (tree_loop_unroll_and_jam): Perform
CFG cleanup manually before rewriting into LC SSA.
* gcc.dg/torture/pr106521.c: New testcase.
I've tried to understand how the greedy search works seeing the
bitmap dances and the split into resolve_phi. I've summarized
the intent of the algorithm as
// For further greedy searching we want to remove interesting
// names defined in BB but add ones on the PHI edges for the
// respective edges.
but the implementation differs in detail. In particular when
there is more than one interesting PHI in BB it seems to only consider
the first for translating defs across edges. It also only applies
the loop crossing restriction when there is an interesting PHI.
The following preserves the loop crossing restriction to the case
of interesting PHIs but merges resolve_phi back, changing interesting
as outlined with the intent above. It should get more threading
cases when there are multiple interesting PHI defs in a block.
It might be a bit faster due to less bitmap operations but in the
end the main intent was to make what happens more obvious.
* tree-ssa-threadbackward.cc (populate_worklist): Remove.
(back_threader::resolve_phi): Likewise.
(back_threader::find_paths_to_names): Rewrite greedy search.
The P2549R1 paper was accepted for C++23. I already implemented it for
our <expected>, but I didn't rename the private daata members, only the
public member functions. This renames the data members for consistency
with the working draft.
libstdc++-v3/ChangeLog:
* include/std/expected (unexpected::_M_val): Rename to _M_unex.
(bad_expected_access::_M_val): Likewise.
My P2467R1 proposal was accepted for C++23 so there's an official value
for this macro now.
libstdc++-v3/ChangeLog:
* include/bits/ios_base.h (__cpp_lib_ios_noreplace): Update
value to 202207L.
* include/std/version (__cpp_lib_ios_noreplace): Likewise.
* testsuite/27_io/basic_ofstream/open/char/noreplace.cc: Check
for new value.
* testsuite/27_io/basic_ofstream/open/wchar_t/noreplace.cc:
Likewise.
When using a mutex and condition variable, the notifying thread needs to
increment _M_ver while holding the mutex lock, and the waiting thread
needs to re-check after locking the mutex. This avoids a missed
notification as described in the PR.
By moving the increment of _M_ver to the base _M_notify we can make the
use of the mutex local to the use of the condition variable, and
simplify the code a little. We can use a relaxed store because the mutex
already provides sequential consistency. Also we don't need to check
whether __addr == &_M_ver because we know that's always true for
platforms that use a condition variable, and so we also know that we
always need to use notify_all() not notify_one().
Reviewed-by: Thomas Rodgers <trodgers@redhat.com>
libstdc++-v3/ChangeLog:
PR libstdc++/106183
* include/bits/atomic_wait.h (__waiter_pool_base::_M_notify):
Move increment of _M_ver here.
[!_GLIBCXX_HAVE_PLATFORM_WAIT]: Lock mutex around increment.
Use relaxed memory order and always notify all waiters.
(__waiter_base::_M_do_wait) [!_GLIBCXX_HAVE_PLATFORM_WAIT]:
Check value again after locking mutex.
(__waiter_base::_M_notify): Remove increment of _M_ver.
The tuple pretty printer uses 1-based indeces which is quite confusing
considering the access to the same values with the std::get functions
uses 0-based indeces. This patch changes the pretty printer since
this is not a guaranteed API.
libstdc++-v3/ChangeLog:
* python/libstdcxx/v6/printers.py (class StdTuplePrinter): Use
zero-based indeces just like std:get takes.
dg.exp=pr104612.c fails with an ICE on s390x, because copysignv2sf3
produces an insn that vsel<mode> is supposed to recognize, but can't,
because it's not defined for V2SF. Fix by defining it for all vector
modes supported by copysign<mode>3.
gcc/ChangeLog:
* config/s390/vector.md (V_HW_FT): New iterator.
* config/s390/vx-builtins.md (vsel<mode>): Use V_HW_FT instead
of V_HW.
Testing has shown that using the load vector pair and store vector pair
instructions for block moves has some performance issues on power10.
A patch on June 11th modified the code so that GCC would not set
-mblock-ops-vector-pair by default if we are tuning for power10, but it would
set the option if we were tuning for a different machine and have load and store
vector pair instructions enabled.
This patch eliminates the code setting -mblock-ops-vector-pair. If you want to
generate load vector pair and store vector pair instructions for block moves,
you must use -mblock-ops-vector-pair.
2022-08-03 Michael Meissner <meissner@linux.ibm.com>
gcc/
* config/rs6000/rs6000.cc (rs6000_option_override_internal): Remove code
setting -mblock-ops-vector-pair.
When killing a def in the path ranger, there is no need to walk the set
of existing equivalences clearing bits. An equivalence match requires
that both ssa-names have to be in each others set. As killing_def
creates a new empty set contianing only the current def, it already
ensures false equivaelnces won't happen.
PR tree-optimization/106514
* value-relation.cc (path_oracle::killing_def) Do not walk the
equivalence set clearing bits.
The regexps in hte test btf-int-1.c were not working properly with the
commenting style of at least one target: powerpc64le-linux-gnu. This
patch changes the test to use better regexps.
Tested in bpf-unkonwn-none, x86_64-linux-gnu and powerpc64le-linux-gnu.
Pushed to master as obvious.
gcc/testsuite/ChangeLog:
PR testsuite/106515
* gcc.dg/debug/btf/btf-int-1.c: Fix regexps in
scan-assembler-times.
This patch adds support for three-way min/max recognition in phi-opts.
Concretely for e.g.
#include <stdint.h>
uint8_t three_min (uint8_t xc, uint8_t xm, uint8_t xy) {
uint8_t xk;
if (xc < xm) {
xk = (uint8_t) (xc < xy ? xc : xy);
} else {
xk = (uint8_t) (xm < xy ? xm : xy);
}
return xk;
}
we generate:
<bb 2> [local count: 1073741824]:
_5 = MIN_EXPR <xc_1(D), xy_3(D)>;
_7 = MIN_EXPR <xm_2(D), _5>;
return _7;
instead of
<bb 2>:
if (xc_2(D) < xm_3(D))
goto <bb 3>;
else
goto <bb 4>;
<bb 3>:
xk_5 = MIN_EXPR <xc_2(D), xy_4(D)>;
goto <bb 5>;
<bb 4>:
xk_6 = MIN_EXPR <xm_3(D), xy_4(D)>;
<bb 5>:
# xk_1 = PHI <xk_5(3), xk_6(4)>
return xk_1;
The same function also immediately deals with turning a minimization problem
into a maximization one if the results are inverted. We do this here since
doing it in match.pd would end up changing the shape of the BBs and adding
additional instructions which would prevent various optimizations from working.
gcc/ChangeLog:
* tree-ssa-phiopt.cc (minmax_replacement): Optionally search for the phi
sequence of a three-way conditional.
(replace_phi_edge_with_variable): Support diamonds.
(tree_ssa_phiopt_worker): Detect diamond phi structure for three-way
min/max.
(strip_bit_not, invert_minmax_code): New.
gcc/testsuite/ChangeLog:
* gcc.dg/tree-ssa/split-path-1.c: Disable phi-opts so we don't optimize
code away.
* gcc.dg/tree-ssa/minmax-10.c: New test.
* gcc.dg/tree-ssa/minmax-11.c: New test.
* gcc.dg/tree-ssa/minmax-12.c: New test.
* gcc.dg/tree-ssa/minmax-13.c: New test.
* gcc.dg/tree-ssa/minmax-14.c: New test.
* gcc.dg/tree-ssa/minmax-15.c: New test.
* gcc.dg/tree-ssa/minmax-16.c: New test.
* gcc.dg/tree-ssa/minmax-3.c: New test.
* gcc.dg/tree-ssa/minmax-4.c: New test.
* gcc.dg/tree-ssa/minmax-5.c: New test.
* gcc.dg/tree-ssa/minmax-6.c: New test.
* gcc.dg/tree-ssa/minmax-7.c: New test.
* gcc.dg/tree-ssa/minmax-8.c: New test.
* gcc.dg/tree-ssa/minmax-9.c: New test.
In upstream dmd, the compiler front-end and run-time have been merged
together into one repository. Both dmd and libdruntime now track that.
D front-end changes:
- Deprecated `scope(failure)' blocks that contain `return' statements.
- Deprecated using integers for `version' or `debug' conditions.
- Deprecated returning a discarded void value from a function.
- `new' can now allocate an associative array.
D runtime changes:
- Added avx512f detection to core.cpuid module.
Phobos changes:
- Changed std.experimental.logger.core.sharedLog to return
shared(Logger).
gcc/d/ChangeLog:
* dmd/MERGE: Merge upstream dmd d7772a2369.
* dmd/VERSION: Bump version to v2.100.1.
* d-codegen.cc (get_frameinfo): Check whether decision to generate
closure changed since semantic finished.
* d-lang.cc (d_handle_option): Remove handling of -fdebug=level and
-fversion=level.
* decl.cc (DeclVisitor::visit (VarDeclaration *)): Generate evaluation
of noreturn variable initializers before throw.
* expr.cc (ExprVisitor::visit (AssignExp *)): Don't generate
assignment for noreturn types, only evaluate for side effects.
* lang.opt (fdebug=): Undocument -fdebug=level.
(fversion=): Undocument -fversion=level.
libphobos/ChangeLog:
* configure: Regenerate.
* configure.ac (libtool_VERSION): Update to 4:0:0.
* libdruntime/MERGE: Merge upstream druntime d7772a2369.
* libdruntime/Makefile.am (DRUNTIME_DSOURCES): Add
core/internal/array/duplication.d.
* libdruntime/Makefile.in: Regenerate.
* src/MERGE: Merge upstream phobos 5748ca43f.
* testsuite/libphobos.gc/nocollect.d:
A SET operation that writes memory may have the same value as an
earlier store but if the alias sets of the new and earlier store do
not conflict then the set is not truly redundant. This can happen,
for example, if objects of different types share a stack slot.
To fix this we define a new function in cselib that first checks for
equality and if that is successful then finds the earlier store in the
value history and checks the alias sets.
The routine is used in two places elsewhere in the compiler:
cfgcleanup and postreload.
gcc/ChangeLog:
PR rtl-optimization/106187
* alias.h (mems_same_for_tbaa_p): Declare.
* alias.cc (mems_same_for_tbaa_p): New function.
* dse.cc (record_store): Use it instead of open-coding
alias check.
* cselib.h (cselib_redundant_set_p): Declare.
* cselib.cc: Include alias.h
(cselib_redundant_set_p): New function.
* cfgcleanup.cc: (mark_effect): Use cselib_redundant_set_p instead
of rtx_equal_for_cselib_p.
* postreload.cc (reload_cse_simplify): Use cselib_redundant_set_p.
(reload_cse_noop_set_p): Delete.
The option prints TOP N counters in a stable format
usage for comparison (diff).
gcc/ChangeLog:
* doc/gcov-dump.texi: Document the new option.
* gcov-dump.cc (main): Parse the new option.
(print_usage): Show the option.
(tag_counters): Sort key:value pairs of TOP N counter.
This patch adds a peephole2 to i386.md to implement the suggestion in
PR target/47949, of using xchg instead of mov for moving values to/from
the %rax/%eax register, controlled by -Oz, as the xchg instruction is
one byte shorter than the move it is replacing.
The new test case is taken from the PR:
int foo(int x) { return x; }
where previously we'd generate:
foo: mov %edi,%eax // 2 bytes
ret
but with this patch, using -Oz, we generate:
foo: xchg %eax,%edi // 1 byte
ret
On the CSiBE benchmark, this saves a total of 10238 bytes (reducing
the -Oz total from 3661796 bytes to 3651558 bytes, a 0.28% saving).
Interestingly, some modern architectures (such as Zen 3) implement
xchg using zero latency register renaming (just like mov), so in theory
this transformation could be enabled when optimizing for speed, if
benchmarking shows the improved code density produces consistently
better performance. However, this is architecture dependent, and
there may be interactions using xchg (instead a single_set) in the
late RTL passes (such as cprop_hardreg), so for now I've restricted
this to -Oz.
2022-08-03 Roger Sayle <roger@nextmovesoftware.com>
Uroš Bizjak <ubizjak@gmail.com>
gcc/ChangeLog
PR target/47949
* config/i386/i386.md (peephole2): New peephole2 to convert
SWI48 moves to/from %rax/%eax where the src is dead to xchg,
when optimizing for minimal size with -Oz.
gcc/testsuite/ChangeLog
PR target/47949
* gcc.target/i386/pr47949.c: New test case.
This patch adds an extra optimization to *cmp<dwi>_doubleword to improve
the code generated for comparisons against -1. Hypothetically, if a
comparison against -1 reached this splitter we'd currently generate code
that looks like:
notq %rdx ; 3 bytes
notq %rax ; 3 bytes
orq %rdx, %rax ; 3 bytes
setne %al
With this patch we would instead generate the superior:
andq %rdx, %rax ; 3 bytes
cmpq $-1, %rax ; 4 bytes
setne %al
which is both faster and smaller, and also what's currently generated
thanks to the middle-end splitting double word comparisons against
zero and minus one during RTL expansion. Should that change, this would
become a missed-optimization regression, but this patch also (potentially)
helps suitable comparisons created by CSE and combine.
2022-08-03 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386.md (*cmp<dwi>_doubleword): Add a special case
to split comparisons against -1 using AND and CMP -1 instructions.
This patch improves TImode STV by adding support for logical shifts by
integer constants that are multiples of 8. For the test case:
unsigned __int128 a, b;
void foo() { a = b << 16; }
on x86_64, gcc -O2 currently generates:
movq b(%rip), %rax
movq b+8(%rip), %rdx
shldq $16, %rax, %rdx
salq $16, %rax
movq %rax, a(%rip)
movq %rdx, a+8(%rip)
ret
with this patch we now generate:
movdqa b(%rip), %xmm0
pslldq $2, %xmm0
movaps %xmm0, a(%rip)
ret
2022-08-03 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386-features.cc (compute_convert_gain): Add gain
for converting suitable TImode shift to a V1TImode shift.
(timode_scalar_chain::convert_insn): Add support for converting
suitable ASHIFT and LSHIFTRT.
(timode_scalar_to_vector_candidate_p): Consider logical shifts
by integer constants that are multiples of 8 to be candidates.
gcc/testsuite/ChangeLog
* gcc.target/i386/sse4_1-stv-7.c: New test case.
This patch implements some additional zero-extension and sign-extension
related optimizations in simplify-rtx.cc. The original motivation comes
from PR rtl-optimization/71775, where in comment #2 Andrew Pinksi sees:
Failed to match this instruction:
(set (reg:DI 88 [ _1 ])
(sign_extend:DI (subreg:SI (ctz:DI (reg/v:DI 86 [ x ])) 0)))
On many platforms the result of DImode CTZ is constrained to be a
small unsigned integer (between 0 and 64), hence the truncation to
32-bits (using a SUBREG) and the following sign extension back to
64-bits are effectively a no-op, so the above should ideally (often)
be simplified to "(set (reg:DI 88) (ctz:DI (reg/v:DI 86 [ x ]))".
To implement this, and some closely related transformations, we build
upon the existing val_signbit_known_clear_p predicate. In the first
chunk, nonzero_bits knows that FFS and ABS can't leave the sign-bit
bit set, so the simplification of of ABS (ABS (x)) and ABS (FFS (x))
can itself be simplified. The second transformation is that we can
canonicalized SIGN_EXTEND to ZERO_EXTEND (as in the PR 71775 case above)
when the operand's sign-bit is known to be clear. The final two chunks
are for SIGN_EXTEND of a truncating SUBREG, and ZERO_EXTEND of a
truncating SUBREG respectively. The nonzero_bits of a truncating
SUBREG pessimistically thinks that the upper bits may have an
arbitrary value (by taking the SUBREG), so we need look deeper at the
SUBREG's operand to confirm that the high bits are known to be zero.
Unfortunately, for PR rtl-optimization/71775, ctz:DI on x86_64 with
default architecture options is undefined at zero, so we can't be sure
the upper bits of reg:DI 88 will be sign extended (all zeros or all ones).
nonzero_bits knows this, so the above transformations don't trigger,
but the transformations themselves are perfectly valid for other
operations such as FFS, POPCOUNT and PARITY, and on other targets/-march
settings where CTZ is defined at zero.
2022-08-03 Roger Sayle <roger@nextmovesoftware.com>
Segher Boessenkool <segher@kernel.crashing.org>
Richard Sandiford <richard.sandiford@arm.com>
gcc/ChangeLog
* simplify-rtx.cc (simplify_unary_operation_1) <ABS>: Add
optimizations for CLRSB, PARITY, POPCOUNT, SS_ABS and LSHIFTRT
that are all positive to complement the existing FFS and
idempotent ABS simplifications.
<SIGN_EXTEND>: Canonicalize SIGN_EXTEND to ZERO_EXTEND when
val_signbit_known_clear_p is true of the operand.
Simplify sign extensions of SUBREG truncations of operands
that are already suitably (zero) extended.
<ZERO_EXTEND>: Simplify zero extensions of SUBREG truncations
of operands that are already suitably zero extended.
Previously, all gimple_cond types were undserstoof, with float values,
this is no longer true. We should gracefully do nothing if the
gcond type is not supported.
PR tree-optimization/106510
gcc/
* gimple-range-fold.cc (fur_source::register_outgoing_edges):
Check for unsupported statements early.
gcc/testsuite
* gcc.dg/pr106510.c: New.