The problem with this testcase was that since my patch for PR97900 we
weren't preserving DECL_UID identity for parameters of instantiations of
templated functions, so using those parameters as the keys for the
defarg_inst map broke. I think this was always fragile given the
possibility of redeclarations, so instead of reverting that change let's
switch to keying off the function.
Memory use compiling stdc++.h is not noticeably different.
PR c++/103186
gcc/cp/ChangeLog:
* pt.cc (defarg_inst): Use tree_vec_map_cache_hasher.
(defarg_insts_for): New.
(tsubst_default_argument): Adjust.
gcc/testsuite/ChangeLog:
* g++.dg/cpp0x/lambda/lambda-defarg10.C: New test.
On a GT 1030, with driver version 470.94 and -mptx=3.1 I run into:
...
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/parallel-dims.c \
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none \
-O2 execution test
...
which minimizes to the same test-case as listed in commit "[nvptx]
Update default ptx isa to 6.3".
The problem is again that the first diverging branch is not handled as such in
SASS, which causes problems with a subsequent shfl insn, but given that we
have -mptx=3.1 we can't use the bar.warp.sync insn.
Given that the default is now -mptx=6.3, and consequently -mptx=3.1 is of a
lesser importance, implement the next best thing: abort when detecting
non-convergence using this insn:
...
{ .reg.b32 act;
vote.ballot.b32 act,1;
.reg.pred uni;
setp.eq.b32 uni,act,0xffffffff;
@ !uni trap;
@ !uni exit;
}
...
Interestingly, the effect of this is that rather than aborting, the test-case
now passes.
Tested on x86_64 with nvptx accelerator.
gcc/ChangeLog:
2022-01-31 Tom de Vries <tdevries@suse.de>
* config/nvptx/nvptx.cc (nvptx_single): Use nvptx_uniform_warp_check.
* config/nvptx/nvptx.md (define_c_enum "unspecv"): Add
UNSPECV_UNIFORM_WARP_CHECK.
(define_insn "nvptx_uniform_warp_check"): New define_insn.
On a GT 1030 (sm_61), with driver version 470.94 I run into:
...
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/parallel-dims.c \
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none \
-O2 execution test
...
which minimizes to the same test-case as listed in commit "[nvptx] Update
default ptx isa to 6.3".
The first divergent branch looks like:
...
{
.reg .u32 %x;
mov.u32 %x,%tid.x;
setp.ne.u32 %r59,%x,0;
}
@ %r59 bra $L15;
mov.u64 %r48,%ar0;
mov.u32 %r22,2;
ld.u64 %r53,[%r48];
mov.u32 %r55,%r22;
mov.u32 %r54,1;
$L15:
...
and when inspecting the generated SASS, the branch is not setup as a divergent
branch, but instead as a regular branch.
This causes us to execute a shfl.sync insn in divergent mode, which is likely
to cause trouble given a remark in the ptx isa version 6.3, which mentions
that for .target sm_6x or below, all threads must excute the same
shfl.sync instruction in convergence.
Fix this by placing a "bar.warp.sync 0xffffffff" at the desired convergence
point (in the example above, after $L15).
Tested on x86_64 with nvptx accelerator.
gcc/ChangeLog:
2022-01-31 Tom de Vries <tdevries@suse.de>
* config/nvptx/nvptx.cc (nvptx_single): Use nvptx_warpsync.
* config/nvptx/nvptx.md (define_c_enum "unspecv"): Add
UNSPECV_WARPSYNC.
(define_insn "nvptx_warpsync"): New define_insn.
With the following example, minimized from parallel-dims.c:
...
int
main (void)
{
int vectors_max = -1;
#pragma acc parallel num_gangs (1) num_workers (1) copy (vectors_max)
{
for (int i = 0; i < 2; i++)
for (int j = 0; j < 2; j++)
#pragma acc loop vector reduction (max: vectors_max)
for (int k = 0; k < 32; k++)
vectors_max = k;
}
if (vectors_max != 31)
__builtin_abort ();
return 0;
}
...
I run into (T400, driver version 470.94):
...
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/parallel-dims.c \
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O2 \
execution test
...
The FAIL does not happen with GOMP_NVPTX_JIT=-O0.
The problem seems to be that the shfl insns for the vector reduction are not
executed uniformly by the warp. Enforcing this by using shfl.sync fixes the
problem.
Fix this by setting the ptx isa to 6.3 by default, which allows the use of
shfl.sync.
Tested on x86_64 with nvptx accelerator.
gcc/ChangeLog:
2022-01-27 Tom de Vries <tdevries@suse.de>
* config/nvptx/nvptx.opt (mptx): Set to PTX_VERSION_6_3 by default.
In ptx isa 6.0, a new barrier instruction was added, and bar.sync was
redefined as barrier.sync.aligned.
The aligned modifier indicates that all threads in a CTA will execute the same
barrier instruction.
The seems fine for a form "bar.sync 0".
But a "bar.sync %rx,64" (as used for vector length > 32) may execute a
diffferent barrier depending on the value of %rx, so we can't assume it's
aligned.
Fix this by using "barrier.sync %rx,64" instead.
Tested on x86_64 with nvptx accelerator.
gcc/ChangeLog:
2022-01-27 Tom de Vries <tdevries@suse.de>
* config/nvptx/nvptx-opts.h (enum ptx_version): Add PTX_VERSION_6_0.
* config/nvptx/nvptx.h (TARGET_PTX_6_0): New macro.
* config/nvptx/nvptx.md (define_insn "nvptx_barsync"): Use barrier
insn for TARGET_PTX_6_0.
When running libgomp test-case reduction-7.c on an nvptx accelerator
(T400, driver version 470.86) and GOMP_NVPTX_JIT=-O0, I run into:
...
reduction-7.exe:reduction-7.c:312: v_p_2: \
Assertion `out[j * 32 + i] == (i + j) * 2' failed.
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-7.c \
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none \
-O0 execution test
...
During investigation I found ptx code like this:
...
@ %r163 bra $L262;
$L262:
...
There's a known problem with executing this type of code, and a workaround is
in place to address this: prevent_branch_around_nothing. The workaround does
not trigger though because it doesn't handle the nop insn.
Fix this by handling the nop insn in prevent_branch_around_nothing.
Tested libgomp on x86_64 with nvptx accelerator.
gcc/ChangeLog:
2022-01-27 Tom de Vries <tdevries@suse.de>
PR target/100428
* config/nvptx/nvptx.cc (prevent_branch_around_nothing): Handle nop
insn.
The ptx insn atom doesn't support local memory. In case of doing an atomic
operation on local memory, we run into:
...
operation not supported on global/shared address space
...
This is the cuGetErrorString message for CUDA_ERROR_INVALID_ADDRESS_SPACE.
The message is somewhat confusing given that actually the operation is not
supported on local address space.
Fix this by falling back on a non-atomic version when detecting
a frame-related memory operand.
This only solves some cases that are detected at compile-time. It does
however fix the openacc private-atomic-* test-cases.
Tested on x86_64 with nvptx accelerator.
gcc/ChangeLog:
2022-01-27 Tom de Vries <tdevries@suse.de>
* config/nvptx/nvptx.md (define_insn "atomic_compare_and_swap<mode>_1")
(define_insn "atomic_exchange<mode>")
(define_insn "atomic_fetch_add<mode>")
(define_insn "atomic_fetch_addsf")
(define_insn "atomic_fetch_<logic><mode>"): Output non-atomic version
if memory operands is frame-relative.
gcc/testsuite/ChangeLog:
2022-01-31 Tom de Vries <tdevries@suse.de>
* gcc.target/nvptx/stack-atomics-run.c: New test.
libgomp/ChangeLog:
2022-01-27 Tom de Vries <tdevries@suse.de>
* testsuite/libgomp.oacc-c-c++-common/private-atomic-1.c: Remove
PR83812 workaround.
* testsuite/libgomp.oacc-fortran/private-atomic-1-vector.f90: Same.
* testsuite/libgomp.oacc-fortran/private-atomic-1-worker.f90: Same.
When I run the libgomp test-case reduction-cplx-dbl.c on an nvptx accelerator
(T400, driver version 470.86), I run into:
...
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c \
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O0 \
execution test
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c \
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O2 \
execution test
...
The problem is in this code generated for a gang reduction:
...
$L39:
atom.global.cas.b32 %r59, [__reduction_lock], 0, 1;
setp.ne.u32 %r116, %r59, 0;
@%r116 bra $L39;
ld.f64 %r60, [%r44];
ld.f64 %r61, [%r44+8];
ld.f64 %r64, [%r44];
ld.f64 %r65, [%r44+8];
add.f64 %r117, %r64, %r22;
add.f64 %r118, %r65, %r41;
st.f64 [%r44], %r117;
st.f64 [%r44+8], %r118;
atom.global.cas.b32 %r119, [__reduction_lock], 1, 0;
...
which is taking and releasing a lock, but missing the appropriate barriers to
protect the loads and store inside the lock.
Fix this by adding membar.gl barriers.
Likewise, add membar.cta barriers if we protect shared memory loads and
stores (even though the worker-partitioning part of the test-case is not
failing).
Tested on x86_64 with nvptx accelerator.
gcc/ChangeLog:
2022-01-27 Tom de Vries <tdevries@suse.de>
* config/nvptx/nvptx.cc (enum nvptx_builtins): Add
NVPTX_BUILTIN_MEMBAR_GL and NVPTX_BUILTIN_MEMBAR_CTA.
(VOID): New macro.
(nvptx_init_builtins): Add MEMBAR_GL and MEMBAR_CTA.
(nvptx_expand_builtin): Handle NVPTX_BUILTIN_MEMBAR_GL and
NVPTX_BUILTIN_MEMBAR_CTA.
(nvptx_lockfull_update): Add level parameter. Emit barriers.
(nvptx_reduction_update, nvptx_goacc_reduction_fini): Update call to
nvptx_lockfull_update.
* config/nvptx/nvptx.md (define_c_enum "unspecv"): Add
UNSPECV_MEMBAR_GL.
(define_expand "nvptx_membar_gl"): New expand.
(define_insn "*nvptx_membar_gl"): New insn.
This matches the memory order in libc++.
libstdc++-v3/ChangeLog:
* include/bits/atomic_wait.h: Change memory order from
Acquire/Release with relaxed loads to SeqCst+Release for
accesses to the waiter's count.
As the minimal GCC version that can build the current master
is 4.8, it does not make sense mentioning something for older
versions.
gcc/ChangeLog:
* doc/install.texi: Remove option for GCC < 4.8.
The following testcase fails -fcompare-debug, because expand_vector_comparison
since r11-1786-g1ac9258cca8030745d3c0b8f63186f0adf0ebc27 sets
vec_cond_expr_only when it sees some use other than VEC_COND_EXPR that uses
the lhs in its condition.
Obviously we should ignore debug stmts when doing so, e.g. by not pushing
them to uses.
That would be a 2 liner change, but while looking at it, I'm also worried
about VEC_COND_EXPRs that would use the lhs in more than one operand,
like VEC_COND_EXPR <lhs, lhs, something> or VEC_COND_EXPR <lhs, something, lhs>
(sure, they ought to be folded, but what if they weren't). Because if
something like that happens, then FOR_EACH_IMM_USE_FAST would push the same
stmt multiple times and expand_vector_condition can return true even when
it modifies it (for vector bool masking).
And lastly, it seems quite wasteful to safe_push statements that will just
cause vec_cond_expr_only = false; and break; in the second loop, both for
cases like 1000 immediate non-VEC_COND_EXPR uses and for cases like
999 VEC_COND_EXPRs with lhs in cond followed by a single non-VEC_COND_EXPR
use. So this patch only pushes VEC_COND_EXPRs there.
2022-02-01 Jakub Jelinek <jakub@redhat.com>
PR middle-end/104307
* tree-vect-generic.cc (expand_vector_comparison): Don't push debug
stmts to uses vector, just set vec_cond_expr_only to false for
non-VEC_COND_EXPRs instead of pushing them into uses. Treat
VEC_COND_EXPRs that use lhs not just in rhs1, but rhs2 or rhs3 too
like non-VEC_COND_EXPRs.
* gcc.target/i386/pr104307.c: New test.
When propagating a multi-word register into an access with a smaller
mode the can_change_mode backend hook is already consulted for the
original register. This however is also required for the intermediate
copy in copy_regno which might use a different register class.
gcc/ChangeLog:
PR rtl-optimization/101260
* regcprop.cc (maybe_mode_change): Invoke mode_change_ok also for
copy_regno.
gcc/testsuite/ChangeLog:
PR rtl-optimization/101260
* gcc.target/s390/pr101260.c: New testcase.
These operations should raise an invalid operation exception at runtime.
So they should not be folded during compilation unless -fno-trapping-math
is used.
gcc/
PR middle-end/95115
* fold-const.cc (const_binop): Do not fold NaN result from
non-NaN operands.
gcc/testsuite
* gcc.dg/pr95115.c: New test.
When running libgomp test-case broadcast-many.c on an nvptx accelerator
(T400, driver version 470.86), I run into:
...
libgomp: The Nvidia accelerator has insufficient resources to launch \
'main$_omp_fn$0' with num_workers = 32 and vector_length = 32; \
recompile the program with 'num_workers = x and vector_length = y' on \
that offloaded region or '-fopenacc-dim=❌y' where x * y <= 896.
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/broadcast-many.c \
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none \
-O0 execution test
...
The error does not occur when using GOMP_NVPTX_JIT=-O0.
Fix this by using 896 / 32 == 28 workers for ACC_DEVICE_TYPE_nvidia.
Likewise for some other test-cases.
Tested libgomp on x86_64 with nvptx accelerator.
libgomp/ChangeLog:
2022-01-27 Tom de Vries <tdevries@suse.de>
* testsuite/libgomp.oacc-c-c++-common/broadcast-many.c: Reduce
num_workers for nvidia accelerator to fix libgomp error 'insufficient
resources'.
* testsuite/libgomp.oacc-c-c++-common/par-loop-comb-reduction-4.c:
Same.
* testsuite/libgomp.oacc-c-c++-common/reduction-7.c: Same.
When running the libgomp testsuite with GOMP_NVPTX_JIT=-O0 using an nvptx
accelerator (Nvidia T400, 2GB), I run into:
...
libgomp: cuCtxSynchronize error: unspecified launch failure \
(perhaps abort was called)
libgomp: cuMemFree_v2 error: unspecified launch failure
libgomp: device finalization failed
FAIL: libgomp.fortran/examples-4/declare_target-1.f90 -O0 execution test
...
The test-case contains:
...
! Reduced from 25 to 23, otherwise execution runs out of thread stack on
! Nvidia Titan V.
if (fib (23) /= fib_wrapper (23)) stop 2
...
Fix this by reducing the fib/fib_wrapper argument from 23 to 22.
Same for declare_target-2.f90.
Tested on x86_64 with nvptx accelerator.
libgomp/ChangeLog:
2022-01-27 Tom de Vries <tdevries@suse.de>
* testsuite/libgomp.fortran/examples-4/declare_target-1.f90: Reduce
recursion depth.
* testsuite/libgomp.fortran/examples-4/declare_target-2.f90: Same.
As mentioned in PR56888 comment 21:
...
-fno-tree-loop-distribute-patterns is the reliable way to not
transform loops into library calls.
...
However, since commit 6f966f0614 ("ldist: Recognize strlen and rawmemchr like
loops") a strlen or rawmemchr library call may be introduced by ldist.
This caused regressions in testcases
gcc.c-torture/execute/builtins/strlen{,-2,-3}.c for nvptx.
Fix this by not calling transform_reduction_loop from
loop_distribution::execute for -fno-tree-loop-distribute-patterns.
Tested regressed test-cases as well as gcc.dg/tree-ssa/ldist-*.c on
nvptx.
gcc/ChangeLog:
2022-01-31 Tom de Vries <tdevries@suse.de>
* tree-loop-distribution.cc (generate_reduction_builtin_1): Check for
-ftree-loop-distribute-patterns.
(loop_distribution::execute): Don't call transform_reduction_loop for
-fno-tree-loop-distribute-patterns.
gcc/testsuite/ChangeLog:
2022-01-31 Tom de Vries <tdevries@suse.de>
* gcc.dg/tree-ssa/ldist-strlen-4.c: New test.
The OEP_* enums were moved to tree-core.h in
r0-124973-g5e351e960763 but the comment was correct
when it was added added to fold-const.h in
r10-4231-g7f4a8ee03d40. This fixes the reference
to the OEP_* enum to reference tree-core.
Committed as obvious after a bootstrap/test on x86_64-linux.
gcc/ChangeLog:
* fold-const.h (operand_compare::operand_equal_p):
Fix comment about OEP_* flags.
Here we ICE in unify_array_domain when we're trying to deduce the type
of an array, as in
auto(*p)[i] = (int(*)[i])0;
but unify_array_domain doesn't arbitrarily complex bounds. Another
test is, e.g.,
auto (*b)[0/0] = &a;
where the type of the array is
<<< Unknown tree: template_type_parm >>>[0:(sizetype) ((ssizetype) (0 / 0) - 1)]
It seems to me that we need not handle these.
PR c++/102414
PR c++/101874
gcc/cp/ChangeLog:
* decl.cc (create_array_type_for_decl): Use template_placeholder_p.
Sorry on a variable-length array of auto.
gcc/testsuite/ChangeLog:
* g++.dg/cpp23/auto-array3.C: New test.
* g++.dg/cpp23/auto-array4.C: New test.
Weird things are going to happen if you define your std::initializer_list
as a union. In this case, we crash in output_constructor_regular_field.
Let's not allow such a definition in the first place.
PR c++/102434
gcc/cp/ChangeLog:
* class.cc (finish_struct): Don't allow union initializer_list.
gcc/testsuite/ChangeLog:
* g++.dg/cpp0x/initlist128.C: New test.
Here during deduction guide generation for the nested class template
B<char(int)>::C, the computation of outer_args yields the template
arguments relative to the primary template for B (i.e. {char(int)})
but what we really want is those relative to C's enclosing scope, the
partial specialization of B (i.e. {char, int}).
PR c++/104294
gcc/cp/ChangeLog:
* pt.cc (ctor_deduction_guides_for): Correct computation of
outer_args.
gcc/testsuite/ChangeLog:
* g++.dg/cpp1z/class-deduction106.C: New test.
As reported by Martin, while David has added OPTION_GLIBC define to aix
and Iain to darwin, all the other non-linux targets now fail because
rs6000.md macro isn't defined.
One possibility is to define this macro in option-defaults.h which on rs6000
targets is included last, then we don't need to define it in aix/darwin
headers and for targets using linux.h or linux64.h it will DTRT too.
The other option is the first 2 hunks + changing the 3
if (!OPTION_GLIBC)
FAIL;
cases in rs6000.md to e.g.
#ifdef OPTION_GLIBC
if (!OPTION_GLIBC)
#endif
FAIL;
or to:
#ifdef OPTION_GLIBC
if (!OPTION_GLIBC)
#else
if (true)
#endif
FAIL;
(the latter case if Richi wants to push the -Wunreachable-code changes for
GCC 13).
2022-01-31 Jakub Jelinek <jakub@redhat.com>
PR target/104298
* config/rs6000/aix.h (OPTION_GLIBC): Remove.
* config/rs6000/darwin.h (OPTION_GLIBC): Likewise.
* config/rs6000/option-defaults.h (OPTION_GLIBC): Define to 0
if not already defined.
libiberty/
PR demangler/98886
PR demangler/99935
* rust-demangle.c (struct rust_demangler): Add a recursion
counter.
(demangle_path): Increment/decrement the recursion counter upon
entry and exit. Fail if the counter exceeds a fixed limit.
(demangle_type): Likewise.
(rust_demangle_callback): Initialise the recursion counter,
disabling if requested by the option flags.
> > PR tree-optimization/103514
> > * match.pd (a & b) ^ (a == b) -> !(a | b): New optimization.
> > * match.pd (a & b) == (a ^ b) -> !(a | b): New optimization.
> > * gcc.dg/tree-ssa/pr103514.c: Testcase for this optimization.
> >
> > 1) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103514
> Note the bug was filed an fixed during stage3, review just didn't happen in
> a reasonable timeframe.
>
> I'm going to ACK this for the trunk and go ahead and commit it for you.
The testcase FAILs on short-circuit targets like powerpc64le-linux.
While the first 2 functions are identical, the last two look like:
<bb 2> :
if (a_5(D) != 0)
goto <bb 3>; [INV]
else
goto <bb 4>; [INV]
<bb 3> :
if (b_6(D) != 0)
goto <bb 5>; [INV]
else
goto <bb 4>; [INV]
<bb 4> :
<bb 5> :
# iftmp.1_4 = PHI <1(3), 0(4)>
_1 = a_5(D) == b_6(D);
_2 = (int) _1;
_3 = _2 ^ iftmp.1_4;
_9 = _2 != iftmp.1_4;
return _9;
instead of the expected:
<bb 2> :
_3 = a_8(D) & b_9(D);
_4 = (int) _3;
_5 = a_8(D) == b_9(D);
_6 = (int) _5;
_1 = a_8(D) | b_9(D);
_2 = ~_1;
_7 = (int) _2;
_10 = ~_1;
return _10;
so no wonder it doesn't match. E.g. x86_64-linux will also use jumps
if it isn't just a && b but a && b && c && d (will do
a & b and c & d tests and jump based on those.
As it is too late to implement this optimization even for the short
circuiting targets this late (not even sure which pass would be best),
this patch just forces non-short-circuiting for the test.
2022-01-31 Jakub Jelinek <jakub@redhat.com>
PR tree-optimization/103514
* gcc.dg/tree-ssa/pr103514.c: Add
--param logical-op-non-short-circuit=1 to dg-options.
libatomic/ChangeLog:
* acinclude.m4: Detect *_ld_is_mold and use it.
* configure: Regenerate.
libgomp/ChangeLog:
* acinclude.m4: Detect *_ld_is_mold and use it.
* configure: Regenerate.
libitm/ChangeLog:
* acinclude.m4: Detect *_ld_is_mold and use it.
* configure: Regenerate.
libstdc++-v3/ChangeLog:
* acinclude.m4: Detect *_ld_is_mold and use it.
* configure: Regenerate.
We were passing down the original type to recursive invocations
of multiple_of_p for say (int)(unsigned * unsigned).
2022-01-24 Richard Biener <rguenther@suse.de>
PR tree-optimization/100499
* fold-const.cc (multiple_of_p): Pass the correct type of
the expression to the recursive invocation of multiple_of_p
for conversions and use CASE_CONVERT.
This is what has been done for ages on SPARC/Solaris and makes it possible
to use 64-bit atomic instructions even in 32-bit mode.
gcc/
PR target/104189
* config/sparc/linux64.h (TARGET_DEFAULT): Add MASK_V8PLUS.
There are a few cases where we know we're dealing with (poly-)integer
constants, so remove the use of multiple_of_p in those cases to make
the PR100499 fix less impactful.
2022-01-24 Richard Biener <rguenther@suse.de>
PR tree-optimization/100499
* tree-cfg.cc (verify_gimple_assign_ternary): Use multiple_p
on poly-ints instead of multiple_of_p.
* tree-ssa.cc (maybe_rewrite_mem_ref_base): Likewise.
(non_rewritable_mem_ref_base): Likewise.
(non_rewritable_lvalue_p): Likewise.
(execute_update_addresses_taken): Likewise.
These tests have always been failing for my autotester running a
cris-elf simulator; when unrestrained they take about 20 minutes each,
compared to the (doubled) timeout of 720 seconds, of a total 2h40min
for the whole of the libstdc++-v3 testsuite. The tests cover counter
overflow and are already disabled for LP64 targets.
* testsuite/27_io/basic_istream/get/char/lwg3464.cc: Don't run on
simulator targets.
* testsuite/27_io/basic_istream/get/wchar_t/lwg3464.cc: Likewise.
This test fails everywhere, because ? doesn't match literal ?.
It should use \\? instead. I've also changed those .s in there.
2022-01-29 Jakub Jelinek <jakub@redhat.com>
PR tree-optimization/95424
* gcc.dg/tree-ssa/divide-7.c: Fix up regexps in scan-tree-dump{,-not}.
On Fri, Jan 28, 2022 at 11:38:23AM -0700, Jeff Law wrote:
> Thanks. Given the original submission and most of the review work was done
> prior to stage3 closing, I went ahead and installed this on the trunk.
Unfortunately this breaks quite a lot of things.
The main problem is that GIMPLE allows EQ_EXPR etc. only with BOOLEAN_TYPE
or with TYPE_PRECISION == 1 integral type (or vector boolean).
Violating this causes verification failures in tree-cfg.cc in some cases,
in other cases wrong-code issues because before it is verified we e.g.
transform
1U / x
into
x == 1U
and later into
x (because we assume that == type must be one of the above cases and
when it is the same type as the type of the first operand, for boolean-ish
cases it should be equivalent).
Fixed by changing that
(eq @1 { build_one_cst (type); })
into
(convert (eq:boolean_type_node @1 { build_one_cst (type); }))
Note, I'm not 100% sure if :boolean_type_node is required in that case,
I see some spots in match.pd that look exactly like this, while there is
e.g. (convert (le ...)) that supposedly does the right thing too.
The signed integer 1/X case doesn't need changes changes, for
(cond (le ...) ...)
le gets correctly boolean_type_node and cond should use type.
I've also reformatted it, some lines were too long, match.pd uses
indentation by 1 column instead of 2 etc.
2022-01-29 Jakub Jelinek <jakub@redhat.com>
Andrew Pinski <apinski@marvell.com>
PR tree-optimization/104279
PR tree-optimization/104280
PR tree-optimization/104281
* match.pd (1 / X -> X == 1 for unsigned X): Build eq with
boolean_type_node and convert to type. Formatting fixes.
* gcc.dg/torture/pr104279.c: New test.
* gcc.dg/torture/pr104280.c: New test.
* gcc.dg/torture/pr104281.c: New test.
This patch will add the missed pattern described in bug 103514 [1] to the match.pd. [1] includes proof of correctness for the patch too.
1) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103514
gcc/
PR tree-optimization/103514
* match.pd (a & b) ^ (a == b) -> !(a | b): New optimization.
(a & b) == (a ^ b) -> !(a | b): New optimization.
gcc/testsuite
* gcc.dg/tree-ssa/pr103514.c: Testcase for this optimization.
Here we're emitting a -Wignored-qualifiers warning for an intermediate
compiler-generated cast of nullptr to 'method-type* const' as part of
value initialization of a const pmf. This patch suppresses the warning
by instead casting to the corresponding unqualified type.
PR c++/92752
gcc/cp/ChangeLog:
* typeck.cc (build_ptrmemfunc): Cast a nullptr constant to the
unqualified pointer type not the qualified one.
gcc/testsuite/ChangeLog:
* g++.dg/warn/Wignored-qualifiers2.C: New test.
Co-authored-by: Jason Merrill <jason@redhat.com>
A recent patch added tests for OPTION_GLIBC that is defined in
linux.h and linux64.h. This broke bootstrap for powerpc Darwin.
Fixed by adding a definition to 0 for OPTION_GLIBC.
Signed-off-by: Iain Sandoe <iain@sandoe.co.uk>
gcc/ChangeLog:
* config/rs6000/darwin.h (OPTION_GLIBC): Define to 0.
This patch implements an optimization for the following C++ code:
int f(int x) {
return 1 / x;
}
int f(unsigned int x) {
return 1 / x;
}
Before this patch, x86-64 gcc -std=c++20 -O3 produces the following assembly:
f(int):
xor edx, edx
mov eax, 1
idiv edi
ret
f(unsigned int):
xor edx, edx
mov eax, 1
div edi
ret
In comparison, clang++ -std=c++20 -O3 produces the following assembly:
f(int):
lea ecx, [rdi + 1]
xor eax, eax
cmp ecx, 3
cmovb eax, edi
ret
f(unsigned int):
xor eax, eax
cmp edi, 1
sete al
ret
Clang's output is more efficient as it avoids expensive div operations.
With this patch, GCC now produces the following assembly:
f(int):
lea eax, [rdi + 1]
cmp eax, 2
mov eax, 0
cmovbe eax, edi
ret
f(unsigned int):
xor eax, eax
cmp edi, 1
sete al
ret
which is virtually identical to Clang's assembly output. Any slight differences
in the output for f(int) is possibly related to a different missed optimization.
v2: https://gcc.gnu.org/pipermail/gcc-patches/2022-January/587751.html
Changes from v2:
1. Refactor from using a switch statement to using the built-in
if-else statement.
v1: https://gcc.gnu.org/pipermail/gcc-patches/2022-January/587634.html
Changes from v1:
1. Refactor common if conditions.
2. Use build_[minus_]one_cst (type) to get -1/1 of the correct type.
3. Match only for TRUNC_DIV_EXPR and TYPE_PRECISION (type) > 1.
gcc/ChangeLog:
PR tree-optimization/95424
* match.pd: Simplify 1 / X where X is an integer.