gcc/libgomp/testsuite/libgomp.c
Tom de Vries a624388b95 [nvptx] Add warp sync at simt exit
Consider this code (with N defined to 1024):
...
  float v = 0.0;
  #pragma omp target map(tofrom: v)
  #pragma omp parallel for simd
  for (int i = 0 ; i < N; i++)
    {
      #pragma omp atomic update
      v = v + 1.0;
    }
...

It hangs when executing on target board unix/-foffload=-misa=sm_75, using
drivers 470.103.01 and 510.54 on a T400 board (sm_75).

I'm tentatively identifying the problem as a bug in -muniform-simt for
architectures that support Independent Thread Scheduling (sm_70 and later).

The problem -muniform-simt is trying to address is to make sure that a
register produced outside an openmp simd region is available when used in any
lane inside an simd region.

The solution is to, outside an simd region, execute in all warp lanes, thus
producing consistent values in result registers in each warp thread.

This approach doesn't work when executing in all warp lanes multiplies the
side effects from 1 to 32 separate side effects, which is the case for atomic
insns.  So atomic insns are rewritten to execute only in lane 0, and if
there are any results, those are propagated to the other threads in the warp.
[ And likewise for system calls malloc, free, vprintf. ]

Now, consider a non-atomic update: ld, add, store.  The store has side
effects, are those multiplied or not?

Pre-sm_70 we can assume that at the end of an SIMT region, any divergent
control flow has reconverged, and we have a uniform warp, executing in lock
step.  So:
- the load will load the same value into the result register across the warp,
- the add will write the same value into the result register across the warp,
- the store will write the same value to the same memory location, 32 times,
  at once, having the result of a single store.
So, no side-effect multiplication (well, at least that's the observation).

Starting sm_70, the threads in a warp are no longer guaranteed to reconverge
after divergence.  There's a "Convergence Optimizer" that can can identify
that it is safe for a warp to reconverge, but that works only as long as the
code does not contain "synchronizing operations".

Consequently, the ld, add, store sequence can be executed by a non-uniform
warp, which means the side effects can have multiplied, and the registers are
no longer guarantueed to be in sync.

The atomic update in the example above is translated using an atom.cas loop,
which means that we have divergence (because only one thread is allowed to
succeed at a time) and the "Convergence Optimizer" doesn't reconverge probably
because the atom.cas counts as a "synchronizing operation".  So, it seems
plausible that the root cause for the mentioned hang is the problem described
above.

Fix this by adding an explicit warp sync at simt exit.

Note that we're assuming here that the warp will stay uniform until the next
SIMT region entry.

Tested on x86_64 with nvptx accelerator.

gcc/ChangeLog:

2022-03-09  Tom de Vries  <tdevries@suse.de>

	PR target/104916
	PR target/104783
	* config/nvptx/nvptx.md (define_expand "omp_simt_exit"): Emit warp
	sync (or uniform warp check for mptx < 6.0).

libgomp/ChangeLog:

2022-03-15  Tom de Vries  <tdevries@suse.de>

	PR target/104916
	PR target/104783
	* testsuite/libgomp.c/pr104783-2.c: New test.
2022-03-22 14:35:34 +01:00
..
appendix-a
examples-4
address-space-1.c Host and offload targets have no common meaning of address spaces 2022-01-13 11:16:20 +01:00
affinity-1.c Update copyright years. 2022-01-03 10:42:10 +01:00
affinity-2.c
atomic-1.c
atomic-2.c
atomic-3.c
atomic-4.c
atomic-5.c
atomic-6.c
atomic-10.c
atomic-11.c
atomic-12.c
atomic-13.c
atomic-14.c
atomic-15.c
atomic-16.c
atomic-17.c
autopar-1.c
autopar-2.c
autopar-3.c
autopar-4.c
autopar-5.c
autopar-6.c
autopar-7.c
autopar-8.c
barrier-1.c
c.exp
cancel-for-1.c
cancel-for-2.c
cancel-parallel-1.c
cancel-parallel-2.c
cancel-parallel-3.c
cancel-sections-1.c
collapse-1.c
collapse-2.c
collapse-3.c
copyin-1.c
copyin-2.c
copyin-3.c
critical-1.c
critical-2.c
debug-1.c
declare-variant-1.c
declare-variant-2.c openmp: Fix up handling of kind(host) and kind(nohost) in ACCEL_COMPILERs [PR103384] 2021-11-24 10:30:32 +01:00
declare-variant-3-sm30.c [libgomp, testsuite, nvptx] Add -mptx=_ in declare-variant-3-sm*.c 2022-02-28 10:10:51 +01:00
declare-variant-3-sm35.c [libgomp, testsuite, nvptx] Add -mptx=_ in declare-variant-3-sm*.c 2022-02-28 10:10:51 +01:00
declare-variant-3-sm53.c [libgomp, testsuite, nvptx] Add -mptx=_ in declare-variant-3-sm*.c 2022-02-28 10:10:51 +01:00
declare-variant-3-sm70.c [libgomp, testsuite, nvptx] Add -mptx=_ in declare-variant-3-sm*.c 2022-02-28 10:10:51 +01:00
declare-variant-3-sm75.c [libgomp, testsuite, nvptx] Add -mptx=_ in declare-variant-3-sm*.c 2022-02-28 10:10:51 +01:00
declare-variant-3-sm80.c [libgomp, testsuite, nvptx] Add -mptx=_ in declare-variant-3-sm*.c 2022-02-28 10:10:51 +01:00
declare-variant-3.h [libgomp, testsuite, nvptx] Add libgomp.c/declare-variant-3-sm*.c 2022-02-24 11:41:03 +01:00
depend-1.c
depend-2.c
depend-3.c
depend-4.c
depend-5.c
depend-6.c
depend-7.c
depend-8.c
depend-9.c
depend-10.c
doacross-1.c
doacross-2.c
doacross-3.c
icv-1.c
icv-2.c
lib-1.c
lib-2.c
linear-1.c
lock-1.c
lock-2.c
lock-3.c
loop-1.c
loop-2.c
loop-3.c
loop-4.c
loop-5.c
loop-6.c
loop-7.c
loop-8.c
loop-9.c
loop-10.c
loop-11.c
loop-12.c
loop-16.c
loop-17.c
loop-18.c
loop-19.c
loop-20.c
loop-21.c
loop-22.c
loop-23.c
loop-24.c
loop-25.c
loop-26.c openmp: Allow non-rectangular loops with pointer iterators 2021-10-27 09:22:07 +02:00
loop-27.c openmp: Allow non-rectangular loops with pointer iterators 2021-10-27 09:22:07 +02:00
nested-1.c
nested-2.c
nested-3.c
nestedfn-1.c
nestedfn-2.c
nestedfn-3.c
nestedfn-4.c
nestedfn-5.c
nestedfn-6.c
nqueens-1.c
omp_hello.c
omp_matvec.c
omp_orphan.c
omp_reduction.c
omp_workshare1.c
omp_workshare2.c
omp_workshare3.c
omp_workshare4.c
omp-loop01.c
omp-loop02.c
omp-loop03.c
omp-nested-1.c
omp-nested-2.c
omp-nested-3.c
omp-parallel-for.c
omp-parallel-if.c
omp-single-1.c
omp-single-2.c
omp-single-3.c
ordered-1.c
ordered-2.c
ordered-3.c
ordered-5.c
parallel-1.c
parloops-exit-first-loop-alt-2.c
parloops-exit-first-loop-alt-3.c
parloops-exit-first-loop-alt-4.c
parloops-exit-first-loop-alt-5.c
parloops-exit-first-loop-alt-6.c
parloops-exit-first-loop-alt-7.c
parloops-exit-first-loop-alt.c
places-1.c openmp: Add support for OMP_PLACES=ll_caches 2021-10-15 12:06:51 +02:00
places-2.c openmp: Add support for OMP_PLACES=ll_caches 2021-10-15 12:06:51 +02:00
places-3.c openmp: Add support for OMP_PLACES=ll_caches 2021-10-15 12:06:51 +02:00
places-4.c openmp: Add support for OMP_PLACES=ll_caches 2021-10-15 12:06:51 +02:00
places-5.c openmp: Add support for OMP_PLACES=numa_domains 2021-10-15 12:16:50 +02:00
places-6.c openmp: Fix up handling of OMP_PLACES=threads(1) 2021-10-15 16:25:25 +02:00
places-7.c openmp: Fix up handling of OMP_PLACES=threads(1) 2021-10-15 16:25:25 +02:00
places-8.c openmp: Fix up handling of OMP_PLACES=threads(1) 2021-10-15 16:25:25 +02:00
places-9.c openmp: Fix up handling of OMP_PLACES=threads(1) 2021-10-15 16:25:25 +02:00
places-10.c openmp: Fix up handling of OMP_PLACES=threads(1) 2021-10-15 16:25:25 +02:00
pr24455-1.c
pr24455.c
pr26171.c
pr26943-1.c
pr26943-2.c
pr26943-3.c
pr26943-4.c
pr29947-1.c
pr29947-2.c
pr30494.c
pr32362-1.c
pr32362-2.c
pr32362-3.c
pr32468.c
pr33880.c
pr34513.c
pr35130.c
pr35196.c
pr35549.c
pr35625.c
pr36802-1.c
pr36802-2.c
pr36802-3.c
pr38650.c
pr39154.c
pr39591-1.c
pr39591-2.c
pr39591-3.c
pr42029.c
pr42942.c
pr43893.c
pr46032-2.c
pr46032.c
pr46193.c
pr46886.c
pr48591.c
pr49897-1.c
pr49897-2.c
pr49898-1.c
pr49898-2.c
pr52547.c
pr58392.c
pr58756.c
pr61200.c
pr64734.c
pr66133.c
pr66714.c
pr68960.c
pr69110.c
pr69805.c
pr70680-1.c
pr70680-2.c
pr79940.c
pr80394.c
pr80809-1.c
pr80809-2.c
pr80809-3.c
pr80853.c
pr81687-1.c
pr81687-2.c
pr81778.c
pr86416-1.c
pr86416-2.c
pr86660.c
pr89002.c
pr90779.c
pr90811.c
pr93566.c
pr95620.c
pr99555-1.c [libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end 2022-02-22 15:48:03 +01:00
pr104385.c libgomp: Fix segfault with posthumous orphan tasks [PR104385] 2022-02-08 09:30:17 +01:00
pr104783-2.c [nvptx] Add warp sync at simt exit 2022-03-22 14:35:34 +01:00
pr104783.c [nvptx] Disable warp sync in simt region 2022-03-10 12:20:44 +01:00
pr104952-1.c [openmp] Fix SIMT reduction using TRUTH_{AND,OR}IF_EXPR 2022-03-18 15:45:13 +01:00
pr104952-2.c [openmp] Fix SIMT reduction using TRUTH_{AND,OR}IF_EXPR 2022-03-18 15:45:13 +01:00
priority.c
private-1.c
reduction-1.c
reduction-2.c
reduction-3.c
reduction-4.c
reduction-5.c
reduction-6.c
reduction-7.c
reduction-8.c
reduction-9.c
reduction-10.c
reduction-11.c
reduction-12.c
reduction-13.c
reduction-14.c
reduction-15.c
scan-1.c
scan-2.c
scan-3.c
scan-4.c
scan-5.c
scan-6.c
scan-7.c
scan-8.c
scan-9.c
scan-10.c
scan-11.c vectorizer: Fix up -fsimd-cost-model= handling 2021-10-12 09:28:10 +02:00
scan-12.c vectorizer: Fix up -fsimd-cost-model= handling 2021-10-12 09:28:10 +02:00
scan-13.c vectorizer: Fix up -fsimd-cost-model= handling 2021-10-12 09:28:10 +02:00
scan-14.c vectorizer: Fix up -fsimd-cost-model= handling 2021-10-12 09:28:10 +02:00
scan-15.c vectorizer: Fix up -fsimd-cost-model= handling 2021-10-12 09:28:10 +02:00
scan-16.c vectorizer: Fix up -fsimd-cost-model= handling 2021-10-12 09:28:10 +02:00
scan-17.c vectorizer: Fix up -fsimd-cost-model= handling 2021-10-12 09:28:10 +02:00
scan-18.c vectorizer: Fix up -fsimd-cost-model= handling 2021-10-12 09:28:10 +02:00
scan-19.c vectorizer: Fix up -fsimd-cost-model= handling 2021-10-12 09:28:10 +02:00
scan-20.c vectorizer: Fix up -fsimd-cost-model= handling 2021-10-12 09:28:10 +02:00
scan-21.c vectorizer: Fix up -fsimd-cost-model= handling 2021-10-12 09:28:10 +02:00
scan-22.c vectorizer: Fix up -fsimd-cost-model= handling 2021-10-12 09:28:10 +02:00
sections-1.c
sections-2.c
shared-1.c
shared-2.c
shared-3.c
simd-1.c
simd-2.c
simd-3.c
simd-4.c
simd-5.c
simd-6.c
simd-7.c
simd-8.c
simd-9.c
simd-10.c
simd-11.c
simd-12.c
simd-13.c
single-1.c
single-2.c
sort-1.c Update copyright years. 2022-01-03 10:42:10 +01:00
static-chunk-size-one.c
switch-conversion-2.c
switch-conversion.c
target-3.c OpenMP: Add strictly nested API call check [PR102972] 2021-10-30 23:45:32 +02:00
target-4.c
target-5.c OpenMP: Add strictly nested API call check [PR102972] 2021-10-30 23:45:32 +02:00
target-6.c OpenMP: Add strictly nested API call check [PR102972] 2021-10-30 23:45:32 +02:00
target-7.c
target-8.c
target-9.c
target-11.c
target-12.c
target-14.c
target-15.c
target-16.c
target-17.c
target-18.c
target-19.c
target-20.c
target-21.c
target-22.c
target-23.c OpenMP 5.0: Remove array section base-pointer mapping semantics and other front-end adjustments 2021-12-09 00:01:10 +08:00
target-24.c
target-25.c
target-26.c
target-27.c
target-28.c
target-29.c OpenMP 5.0: Remove array section base-pointer mapping semantics and other front-end adjustments 2021-12-09 00:01:10 +08:00
target-30.c
target-31.c
target-32.c
target-33.c
target-34.c
target-35.c
target-36.c
target-37.c
target-38.c
target-39.c
target-40.c
target-41.c
target-42.c
target-43.c 'libgomp.c/target-43.c': '-latomic' for nvptx offloading 2021-09-06 11:51:13 +02:00
target-44.c Add 'default' to -foffload=; document that flag [PR67300] 2021-06-29 16:00:04 +02:00
target-critical-1.c
target-has-device-addr-3.c C, C++, Fortran, OpenMP: Add 'has_device_addr' clause to 'target' construct. 2022-02-09 23:47:12 -08:00
target-link-1.c
target-print-1.c
target-teams-1.c OpenMP: Add strictly nested API call check [PR102972] 2021-10-30 23:45:32 +02:00
task-1.c
task-2.c
task-3.c
task-4.c
task-5.c
task-6.c
task-reduction-1.c
task-reduction-2.c
task-reduction-3.c
task-reduction-4.c
teams-1.c OpenMP: Add strictly nested API call check [PR102972] 2021-10-30 23:45:32 +02:00
teams-2.c
teams-3.c
teams-4.c openmp: Honor OpenMP 5.1 num_teams lower bound 2021-11-12 12:41:22 +01:00
teams-5.c libgomp: Add a testcase for omp_get_num_teams inside of target inside of host teams 2021-11-15 08:58:39 +01:00
thread-limit-1.c
thread-limit-2.c OpenMP: Add strictly nested API call check [PR102972] 2021-10-30 23:45:32 +02:00
thread-limit-3.c OpenMP: Add strictly nested API call check [PR102972] 2021-10-30 23:45:32 +02:00
thread-limit-4.c OpenMP: Add strictly nested API call check [PR102972] 2021-10-30 23:45:32 +02:00
thread-limit-5.c OpenMP: Add strictly nested API call check [PR102972] 2021-10-30 23:45:32 +02:00
udr-2.c
udr-3.c
uns-outer-4.c
usleep.h
vla-1.c