This patch adds the 'has_device_addr' clause to the OpenMP 'target' construct
which was introduced in OpenMP 5.1 (OpenMP API 5.1 specification pp. 197ff):
has_device_addr(list)
"The has_device_addr clause indicates that its list items already have device
addresses and therefore they may be directly accessed from a target device.
If the device address of a list item is not for the device on which the target
region executes, accessing the list item inside the region results in
unspecified behavior. The list items may include array sections." (p. 200)
"A list item may not be specified in both an is_device_ptr clause and a
has_device_addr clause on the directive." (p. 202)
"A list item that appears in an is_device_ptr or a has_device_addr clause must
not be specified in any data-sharing attribute clause on the same target
construct." (p. 203)
gcc/c-family/ChangeLog:
* c-omp.cc (c_omp_split_clauses): Added OMP_CLAUSE_HAS_DEVICE_ADDR case.
* c-pragma.h (enum pragma_kind): Added 5.1 in comment.
(enum pragma_omp_clause): Added PRAGMA_OMP_CLAUSE_HAS_DEVICE_ADDR.
gcc/c/ChangeLog:
* c-parser.cc (c_parser_omp_clause_name): Parse 'has_device_addr'
clause.
(c_parser_omp_variable_list): Handle array sections.
(c_parser_omp_clause_has_device_addr): Added.
(c_parser_omp_all_clauses): Added PRAGMA_OMP_CLAUSE_HAS_DEVICE_ADDR
case.
(c_parser_omp_target_exit_data): Added HAS_DEVICE_ADDR to
OMP_CLAUSE_MASK.
* c-typeck.cc (handle_omp_array_sections): Handle clause restrictions.
(c_finish_omp_clauses): Handle array sections.
gcc/cp/ChangeLog:
* parser.cc (cp_parser_omp_clause_name): Parse 'has_device_addr' clause.
(cp_parser_omp_var_list_no_open): Handle array sections.
(cp_parser_omp_all_clauses): Added PRAGMA_OMP_CLAUSE_HAS_DEVICE_ADDR
case.
(cp_parser_omp_target_update): Added HAS_DEVICE_ADDR to OMP_CLAUSE_MASK.
* semantics.cc (handle_omp_array_sections): Handle clause restrictions.
(finish_omp_clauses): Handle array sections.
gcc/fortran/ChangeLog:
* dump-parse-tree.cc (show_omp_clauses): Added OMP_LIST_HAS_DEVICE_ADDR
case.
* gfortran.h: Added OMP_LIST_HAS_DEVICE_ADDR.
* openmp.cc (enum omp_mask2): Added OMP_CLAUSE_HAS_DEVICE_ADDR.
(gfc_match_omp_clauses): Parse HAS_DEVICE_ADDR clause.
(resolve_omp_clauses): Same.
* trans-openmp.cc (gfc_trans_omp_variable_list): Added
OMP_LIST_HAS_DEVICE_ADDR case.
(gfc_trans_omp_clauses): Firstprivatize of array descriptors.
gcc/ChangeLog:
* gimplify.cc (gimplify_scan_omp_clauses): Added cases for
OMP_CLAUSE_HAS_DEVICE_ADDR
and handle array sections.
(gimplify_adjust_omp_clauses): Added OMP_CLAUSE_HAS_DEVICE_ADDR case.
* omp-low.cc (scan_sharing_clauses): Handle OMP_CLAUSE_HAS_DEVICE_ADDR.
(lower_omp_target): Same.
* tree-core.h (enum omp_clause_code): Same.
* tree-nested.cc (convert_nonlocal_omp_clauses): Same.
(convert_local_omp_clauses): Same.
* tree-pretty-print.cc (dump_omp_clause): Same.
* tree.cc: Same.
libgomp/ChangeLog:
* libgomp.texi: Updated entry for HAS_DEVICE_ADDR.
* target.c (copy_firstprivate_data): Copy only if host address is not
NULL.
* testsuite/libgomp.c++/target-has-device-addr-2.C: New test.
* testsuite/libgomp.c++/target-has-device-addr-4.C: New test.
* testsuite/libgomp.c++/target-has-device-addr-5.C: New test.
* testsuite/libgomp.c++/target-has-device-addr-6.C: New test.
* testsuite/libgomp.c-c++-common/target-has-device-addr-1.c: New test.
* testsuite/libgomp.c/target-has-device-addr-3.c: New test.
* testsuite/libgomp.fortran/target-has-device-addr-1.f90: New test.
* testsuite/libgomp.fortran/target-has-device-addr-2.f90: New test.
* testsuite/libgomp.fortran/target-has-device-addr-3.f90: New test.
* testsuite/libgomp.fortran/target-has-device-addr-4.f90: New test.
gcc/testsuite/ChangeLog:
* c-c++-common/gomp/clauses-1.c: Added has_device_addr to test cases.
* g++.dg/gomp/attrs-1.C: Added has_device_addr to test cases.
* g++.dg/gomp/attrs-2.C: Added has_device_addr to test cases.
* c-c++-common/gomp/target-has-device-addr-1.c: New test.
* c-c++-common/gomp/target-has-device-addr-2.c: New test.
* c-c++-common/gomp/target-is-device-ptr-1.c: New test.
* c-c++-common/gomp/target-is-device-ptr-2.c: New test.
* gfortran.dg/gomp/is_device_ptr-3.f90: New test.
* gfortran.dg/gomp/target-has-device-addr-1.f90: New test.
* gfortran.dg/gomp/target-has-device-addr-2.f90: New test.
The following patch fixes crashes with posthumous orphan tasks.
When a parent task finishes, gomp_clear_parent clears the parent
pointers of its children tasks present in the parent->children_queue.
But children that are still waiting for dependencies aren't in that
queue yet, they will be added there only when the sibling they are
waiting for exits. Unfortunately we were adding those tasks into
the queues with the original task->parent which then causes crashes
because that task is gone and freed. The following patch fixes that
by clearing the parent field when we schedule such task for running
by adding it into the queues and we know that the sibling task which
is about to finish has NULL parent.
2022-02-08 Jakub Jelinek <jakub@redhat.com>
PR libgomp/104385
* task.c (gomp_task_run_post_handle_dependers): If parent is NULL,
clear task->parent.
* testsuite/libgomp.c/pr104385.c: New test.
The ptx insn atom doesn't support local memory. In case of doing an atomic
operation on local memory, we run into:
...
operation not supported on global/shared address space
...
This is the cuGetErrorString message for CUDA_ERROR_INVALID_ADDRESS_SPACE.
The message is somewhat confusing given that actually the operation is not
supported on local address space.
Fix this by falling back on a non-atomic version when detecting
a frame-related memory operand.
This only solves some cases that are detected at compile-time. It does
however fix the openacc private-atomic-* test-cases.
Tested on x86_64 with nvptx accelerator.
gcc/ChangeLog:
2022-01-27 Tom de Vries <tdevries@suse.de>
* config/nvptx/nvptx.md (define_insn "atomic_compare_and_swap<mode>_1")
(define_insn "atomic_exchange<mode>")
(define_insn "atomic_fetch_add<mode>")
(define_insn "atomic_fetch_addsf")
(define_insn "atomic_fetch_<logic><mode>"): Output non-atomic version
if memory operands is frame-relative.
gcc/testsuite/ChangeLog:
2022-01-31 Tom de Vries <tdevries@suse.de>
* gcc.target/nvptx/stack-atomics-run.c: New test.
libgomp/ChangeLog:
2022-01-27 Tom de Vries <tdevries@suse.de>
* testsuite/libgomp.oacc-c-c++-common/private-atomic-1.c: Remove
PR83812 workaround.
* testsuite/libgomp.oacc-fortran/private-atomic-1-vector.f90: Same.
* testsuite/libgomp.oacc-fortran/private-atomic-1-worker.f90: Same.
When running libgomp test-case broadcast-many.c on an nvptx accelerator
(T400, driver version 470.86), I run into:
...
libgomp: The Nvidia accelerator has insufficient resources to launch \
'main$_omp_fn$0' with num_workers = 32 and vector_length = 32; \
recompile the program with 'num_workers = x and vector_length = y' on \
that offloaded region or '-fopenacc-dim=❌y' where x * y <= 896.
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/broadcast-many.c \
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none \
-O0 execution test
...
The error does not occur when using GOMP_NVPTX_JIT=-O0.
Fix this by using 896 / 32 == 28 workers for ACC_DEVICE_TYPE_nvidia.
Likewise for some other test-cases.
Tested libgomp on x86_64 with nvptx accelerator.
libgomp/ChangeLog:
2022-01-27 Tom de Vries <tdevries@suse.de>
* testsuite/libgomp.oacc-c-c++-common/broadcast-many.c: Reduce
num_workers for nvidia accelerator to fix libgomp error 'insufficient
resources'.
* testsuite/libgomp.oacc-c-c++-common/par-loop-comb-reduction-4.c:
Same.
* testsuite/libgomp.oacc-c-c++-common/reduction-7.c: Same.
When running the libgomp testsuite with GOMP_NVPTX_JIT=-O0 using an nvptx
accelerator (Nvidia T400, 2GB), I run into:
...
libgomp: cuCtxSynchronize error: unspecified launch failure \
(perhaps abort was called)
libgomp: cuMemFree_v2 error: unspecified launch failure
libgomp: device finalization failed
FAIL: libgomp.fortran/examples-4/declare_target-1.f90 -O0 execution test
...
The test-case contains:
...
! Reduced from 25 to 23, otherwise execution runs out of thread stack on
! Nvidia Titan V.
if (fib (23) /= fib_wrapper (23)) stop 2
...
Fix this by reducing the fib/fib_wrapper argument from 23 to 22.
Same for declare_target-2.f90.
Tested on x86_64 with nvptx accelerator.
libgomp/ChangeLog:
2022-01-27 Tom de Vries <tdevries@suse.de>
* testsuite/libgomp.fortran/examples-4/declare_target-1.f90: Reduce
recursion depth.
* testsuite/libgomp.fortran/examples-4/declare_target-2.f90: Same.
libatomic/ChangeLog:
* acinclude.m4: Detect *_ld_is_mold and use it.
* configure: Regenerate.
libgomp/ChangeLog:
* acinclude.m4: Detect *_ld_is_mold and use it.
* configure: Regenerate.
libitm/ChangeLog:
* acinclude.m4: Detect *_ld_is_mold and use it.
* configure: Regenerate.
libstdc++-v3/ChangeLog:
* acinclude.m4: Detect *_ld_is_mold and use it.
* configure: Regenerate.
Rather than rubber-stamp whatever requested vs. actual device kernel launch
configuration happens, actually (again) verify the requested values (modulo
expected variations).
This better highlights that "AMD GCN has an upper limit of 'num_workers(16)'",
and the deficiency that "AMD GCN uses the autovectorizer for the vector
dimension: the use of a function call in vector-partitioned code [...] is not
currently supported".
And, this removes several instances of race conditions, where variables are
concurrently written to in OpenACC gang-redundant mode.
libgomp/
* testsuite/libgomp.oacc-c-c++-common/loop-gwv-1.c: Strengthen.
* testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/loop-red-gwv-1.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/loop-red-v-1.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/loop-red-v-2.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/loop-red-w-1.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/loop-red-w-2.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/loop-red-wv-1.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/loop-v-1.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/loop-w-1.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/loop-wv-1.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/routine-gwv-1.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/routine-v-1.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/routine-w-1.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/routine-wv-1.c: Likewise.
Currently omp_get_device_num does not work on gcn targets with more than one
offload device. The reason is that GOMP_DEVICE_NUM_VAR is static in
icv-device.c and thus "__gomp_device_num" is not visible in the offload image.
This patch removes "static" such that "__gomp_device_num" is now part of the
offload image and can now be found in GOMP_OFFLOAD_load_image in the plugin.
This is not an issue for nvptx. There, "__gomp_device_num" is in the offload
image even with "static".
libgomp/ChangeLog:
* config/gcn/icv-device.c: Make GOMP_DEVICE_NUM_VAR public (remove
"static") to make the device num available in the offload image.
libgomp/
* testsuite/libgomp.oacc-fortran/privatized-ref-1.f95: New test.
* testsuite/libgomp.oacc-c++/privatized-ref-2.C: New test.
* testsuite/libgomp.oacc-c++/privatized-ref-3.C: New test.
Co-authored-by: Thomas Schwinge <thomas@codesourcery.com>
libgomp/
* plugin/plugin-gcn.c (parse_target_attributes): Automatically set
the number of teams and threads if necessary.
(gcn_exec): Automatically set the number of gangs and workers if
necessary.
Co-Authored-By: Andrew Stubbs <ams@codesourcery.com>
This patch adds support for OpenMP 5.0 allocate clause for fortran. It does not
yet support the allocator-modifier as specified in OpenMP 5.1. The allocate
clause is already supported in C/C++.
gcc/fortran/ChangeLog:
* dump-parse-tree.c (show_omp_clauses): Handle OMP_LIST_ALLOCATE.
* gfortran.h (OMP_LIST_ALLOCATE): New enum value.
* openmp.c (enum omp_mask1): Add OMP_CLAUSE_ALLOCATE.
(gfc_match_omp_clauses): Handle OMP_CLAUSE_ALLOCATE
(OMP_PARALLEL_CLAUSES, OMP_DO_CLAUSES, OMP_SECTIONS_CLAUSES)
(OMP_TASK_CLAUSES, OMP_TASKLOOP_CLAUSES, OMP_TARGET_CLAUSES)
(OMP_TEAMS_CLAUSES, OMP_DISTRIBUTE_CLAUSES)
(OMP_SINGLE_CLAUSES): Add OMP_CLAUSE_ALLOCATE.
(OMP_TASKGROUP_CLAUSES): New.
(gfc_match_omp_taskgroup): Use OMP_TASKGROUP_CLAUSES instead of
OMP_CLAUSE_TASK_REDUCTION.
(resolve_omp_clauses): Handle OMP_LIST_ALLOCATE.
(resolve_omp_do): Avoid warning when loop iteration variable is
in allocate clause.
* trans-openmp.c (gfc_trans_omp_clauses): Handle translation of
allocate clause.
(gfc_split_omp_clauses): Update for OMP_LIST_ALLOCATE.
gcc/testsuite/ChangeLog:
* gfortran.dg/gomp/allocate-1.f90: New test.
* gfortran.dg/gomp/allocate-2.f90: New test.
* gfortran.dg/gomp/allocate-3.f90: New test.
* gfortran.dg/gomp/collapse1.f90: Update error message.
* gfortran.dg/gomp/openmp-simd-4.f90: Likewise.
* gfortran.dg/gomp/clauses-1.f90: Uncomment allocate clause.
libgomp/ChangeLog:
* testsuite/libgomp.fortran/allocate-1.c: New test.
* testsuite/libgomp.fortran/allocate-1.f90: New test.
* libgomp.texi: Remove string that says that allocate clause
support is for C/C++ only.
After recent commit be661959a6
"libgomp/testsuite: Improve omp_get_device_num() tests", we're now iterating
over all OpenMP target devices. Intel MIC (emulated) offloading still doesn't
properly implement device-side 'omp_get_device_num', and we thus regress:
PASS: libgomp.c/../libgomp.c-c++-common/target-45.c (test for excess errors)
[-PASS:-]{+FAIL:+} libgomp.c/../libgomp.c-c++-common/target-45.c execution test
PASS: libgomp.c++/../libgomp.c-c++-common/target-45.c (test for excess errors)
[-PASS:-]{+FAIL:+} libgomp.c++/../libgomp.c-c++-common/target-45.c execution test
PASS: libgomp.fortran/target10.f90 -O0 (test for excess errors)
[-PASS:-]{+FAIL:+} libgomp.fortran/target10.f90 -O0 execution test
PASS: libgomp.fortran/target10.f90 -O1 (test for excess errors)
[-PASS:-]{+FAIL:+} libgomp.fortran/target10.f90 -O1 execution test
PASS: libgomp.fortran/target10.f90 -O2 (test for excess errors)
[-PASS:-]{+FAIL:+} libgomp.fortran/target10.f90 -O2 execution test
PASS: libgomp.fortran/target10.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions (test for excess errors)
[-PASS:-]{+FAIL:+} libgomp.fortran/target10.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
PASS: libgomp.fortran/target10.f90 -O3 -g (test for excess errors)
[-PASS:-]{+FAIL:+} libgomp.fortran/target10.f90 -O3 -g execution test
PASS: libgomp.fortran/target10.f90 -Os (test for excess errors)
[-PASS:-]{+FAIL:+} libgomp.fortran/target10.f90 -Os execution test
Improve the XFAILing added in commit bb75b22aba
"Allow matching Intel MIC in OpenMP 'declare variant'" for the case that *any*
Intel MIC offload device is available.
libgomp/
* testsuite/libgomp.c-c++-common/on_device_arch.h
(any_device_arch, any_device_arch_intel_mic): New.
* testsuite/lib/libgomp.exp
(check_effective_target_offload_device_any_intel_mic): New.
* testsuite/libgomp.c-c++-common/target-45.c: Use it.
* testsuite/libgomp.fortran/target10.f90: Likewise.
In OpenACC 'kernels' decomposition, we're improperly nesting synchronous and
asynchronous data and compute regions, giving rise to data races when the
asynchronicity is actually executed, as is visible in at least on test case
with GCN offloading.
The proper fix is to correctly use the asynchronous interfaces, making the
currently synchronous data regions fully asynchronous (see also
<https://gcc.gnu.org/PR97390> "[OpenACC] 'async' clause on 'data' construct",
which is to share the same implementation), but that's for later; for now add
some more synchronization.
gcc/
* omp-oacc-kernels-decompose.cc (add_wait): New function, split out
of...
(add_async_clauses_and_wait): ...here. Call new outlined function.
(decompose_kernels_region_body): Add wait at the end of
explicitly-asynchronous kernels regions.
libgomp/
* testsuite/libgomp.oacc-c-c++-common/f-asyncwait-1.c: Remove GCN
offloading execution XFAIL.
Co-Authored-By: Thomas Schwinge <thomas@codesourcery.com>
... as otherwise 'gcc/omp-low.c:lower_omp_target' has to create a temporary:
13073 else if (is_gimple_reg (var))
13074 {
13075 gcc_assert (offloaded);
13076 tree avar = create_tmp_var (TREE_TYPE (var));
13077 mark_addressable (avar);
..., which (a) is only implemented for actualy *offloaded* regions (but not
data regions), and (b) the subsequently synthesized code for writing to and
later reading back from the temporary fundamentally conflicts with OpenACC
'async' (as used by OpenACC 'kernels' decomposition). That's all not trivial
to make work, so let's just avoid this case.
gcc/
PR middle-end/100280
* omp-oacc-kernels-decompose.cc (maybe_build_inner_data_region):
Mark variables used in synthesized data clauses as addressable.
gcc/testsuite/
PR middle-end/100280
* c-c++-common/goacc/kernels-decompose-pr100280-1.c: New.
* c-c++-common/goacc/classify-kernels-parloops.c: Likewise.
* c-c++-common/goacc/classify-kernels-unparallelized-parloops.c:
Likewise.
* c-c++-common/goacc/classify-kernels-unparallelized.c: Test
'--param openacc-kernels=decompose'.
* c-c++-common/goacc/classify-kernels.c: Likewise.
* c-c++-common/goacc/kernels-decompose-2.c: Update.
* c-c++-common/goacc/kernels-decompose-ice-1.c: Remove.
* c-c++-common/goacc/kernels-decompose-ice-2.c: Likewise.
* gfortran.dg/goacc/classify-kernels-parloops.f95: New.
* gfortran.dg/goacc/classify-kernels-unparallelized-parloops.f95:
Likewise.
* gfortran.dg/goacc/classify-kernels-unparallelized.f95: Test
'--param openacc-kernels=decompose'.
* gfortran.dg/goacc/classify-kernels.f95: Likewise.
libgomp/
PR middle-end/100280
* testsuite/libgomp.oacc-c-c++-common/declare-vla-kernels-decompose-ice-1.c:
Update.
* testsuite/libgomp.oacc-c-c++-common/f-asyncwait-1.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/kernels-decompose-1.c:
Likewise.
Suggested-by: Julian Brown <julian@codesourcery.com>
Related to r12-6208-gebc853deb7cc0487de9ef6e891a007ba853d1933
"libgomp: Fix GOMP_DEVICE_NUM_VAR stringification during offload image load"
That commit fixed an issue with omp_get_device_num() on gcn/nvptx that
resulted in having always the value 0.
This commit modifies the tests to iterate over all devices such that on a
multi-nonhost-device system it had detected that always-zero issue.
libgomp/ChangeLog:
* testsuite/libgomp.c-c++-common/target-45.c: Iterate over all devices.
* testsuite/libgomp.fortran/target10.f90: Likewise.
In the patch that implemented omp_get_device_num(), there was an error where
the stringification of GOMP_DEVICE_NUM_VAR, which is the macro expanding to
the actual symbol used, was erroneously using the STRINGX() macro in the
libgomp offload image symbol search, and expansion of the variable name
string through the additional layer of preprocessor symbol was not properly
achieved.
This patch fixes this by changing to properly use XSTRING(), also from
include/symcat.h.
libgomp/ChangeLog:
* plugin/plugin-gcn.c (GOMP_OFFLOAD_load_image): Change uses of STRINGX
into XSTRING when looking for GOMP_DEVICE_NUM_VAR in offload image.
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_load_image): Likewise.
Up to now the libgomp GCN plugin has been finding the offload variables
by using a symbol lookup, but the AMD runtime requires that the symbols are
global for that to work. This was ensured by mkoffload as a post-procssing
step, but the LLVM 13 assembler no longer accepts this in the case where the
variable was previously declared differently.
This patch switches to locating the symbols directly from the
offload_var_table, which means that only one symbol needs to be forced
global.
This changes breaks the libgomp image compatibility so GOMP_VERSION_GCN has
also been bumped.
gcc/ChangeLog:
* config/gcn/mkoffload.c (process_asm): Process the variable table
completely differently.
(process_obj): Encode the varaible data differently.
include/ChangeLog:
* gomp-constants.h (GOMP_VERSION_GCN): Bump.
libgomp/ChangeLog:
* plugin/plugin-gcn.c (struct gcn_image_desc): Remove global_variables.
(GOMP_OFFLOAD_load_image): Locate the offload variables via the
table, not individual symbols.
Some testcases for libgomp.c++ only works for non-shared address space offloading,
because it exercises the zero-length array section behavior for offloaded
address space, testing for NULL/non-NULL cases.
libgomp/ChangeLog:
* testsuite/libgomp.c++/target-lambda-1.C: Only run under
"target offload_device_nonshared_as"
* testsuite/libgomp.c++/target-this-3.C: Likewise.
* testsuite/libgomp.c++/target-this-4.C: Likewise.