gcc/libgomp/testsuite/libgomp.oacc-c-c++-common
Julian Brown 2a3f9f6532 openacc: Shared memory layout optimisation
This patch implements an algorithm to lay out local data-share (LDS)
space.  It currently works for AMD GCN.  At the moment, LDS is used for
three things:

  1. Gang-private variables
  2. Reduction temporaries (accumulators)
  3. Broadcasting for worker partitioning

After the patch is applied, (2) and (3) are placed at preallocated
locations in LDS, and (1) continues to be handled by the backend (as it
is at present prior to this patch being applied). LDS now looks like this:

  +--------------+ (gang-private size + 1024, = 1536)
  | free space   |
  |    ...       |
  | - - - - - - -|
  | worker bcast |
  +--------------+
  | reductions   |
  +--------------+ <<< -mgang-private-size=<number> (def. 512)
  | gang-private |
  |    vars      |
  +--------------+ (32)
  | low LDS vars |
  +--------------+ LDS base

So, gang-private space is fixed at a constant amount at compile time
(which can be increased with a command-line switch if necessary
for some given code). The layout algorithm takes out a slice of the
remainder of usable space for reduction vars, and uses the rest for
worker partitioning.

The partitioning algorithm works as follows.

 1. An "adjacency" set is built up for each basic block that might
    do a broadcast. This is calculated by starting at each such block,
    and doing a recursive DFS walk over successors to find the next
    block (or blocks) that *also* does a broadcast
    (dfs_broadcast_reachable_1).

 2. The adjacency set is inverted to get adjacent predecessor blocks also.

 3. Blocks that will perform a broadcast are sorted by size of that
    broadcast: the biggest blocks are handled first.

 4. A splay tree structure is used to calculate the spans of LDS memory
    that are already allocated by the blocks adjacent to this one
    (merge_ranges{,_1}.

 5. The current block's broadcast space is allocated from the first free
    span not allocated in the splay tree structure calculated above
    (first_fit_range). This seems to work quite nicely and efficiently
    with the splay tree structure.

 6. Continue with the next-biggest broadcast block until we're done.

In this way, "adjacent" broadcasts will not use the same piece of
LDS memory.

PR96334 "openacc: Unshare reduction temporaries for GCN" got merged in:

The GCN backend uses tree nodes like MEM((__lds TYPE *) <constant>)
for reduction temporaries. Unlike e.g. var decls and SSA names, these
nodes cannot be shared during gimplification, but are so in some
circumstances. This is detected when appropriate --enable-checking
options are used. This patch unshares such nodes when they are reused
more than once.

gcc/
	* config/gcn/gcn-protos.h
	(gcn_goacc_create_worker_broadcast_record): Update prototype.
	* config/gcn/gcn-tree.c (gcn_goacc_get_worker_red_decl): Use
	preallocated block of LDS memory.  Do not cache/share decls for
	reduction temporaries between invocations.
	(gcn_goacc_reduction_teardown): Unshare VAR on second use.
	(gcn_goacc_create_worker_broadcast_record): Add OFFSET parameter
	and return temporary LDS space at that offset.  Return pointer in
	"sender" case.
	* config/gcn/gcn.c (acc_lds_size, gang_private_hwm, lds_allocs):
	New global vars.
	(ACC_LDS_SIZE): Define as acc_lds_size.
	(gcn_init_machine_status): Don't initialise lds_allocated,
	lds_allocs, reduc_decls fields of machine function struct.
	(gcn_option_override): Handle default size for gang-private
	variables and -mgang-private-size option.
	(gcn_expand_prologue): Use LDS_SIZE instead of LDS_SIZE-1 when
	initialising M0_REG.
	(gcn_shared_mem_layout): New function.
	(gcn_print_lds_decl): Update comment. Use global lds_allocs map and
	gang_private_hwm variable.
	(TARGET_GOACC_SHARED_MEM_LAYOUT): Define target hook.
	* config/gcn/gcn.h (machine_function): Remove lds_allocated,
	lds_allocs, reduc_decls. Add reduction_base, reduction_limit.
	* config/gcn/gcn.opt (gang_private_size_opt): New global.
	(mgang-private-size=): New option.
	* doc/tm.texi.in (TARGET_GOACC_SHARED_MEM_LAYOUT): Place
	documentation hook.
	* doc/tm.texi: Regenerate.
	* omp-oacc-neuter-broadcast.cc (targhooks.h, diagnostic-core.h):
	Add includes.
	(build_sender_ref): Handle sender_decl being pointer.
	(worker_single_copy): Add PLACEMENT and ISOLATE_BROADCASTS
	parameters.  Pass placement argument to
	create_worker_broadcast_record hook invocations.  Handle
	sender_decl being pointer and isolate_broadcasts inserting extra
	barriers.
	(blk_offset_map_t): Add typedef.
	(neuter_worker_single): Add BLK_OFFSET_MAP parameter.  Pass
	preallocated range to worker_single_copy call.
	(dfs_broadcast_reachable_1): New function.
	(idx_decl_pair_t, used_range_vec_t): New typedefs.
	(sort_size_descending): New function.
	(addr_range): New class.
	(splay_tree_compare_addr_range, splay_tree_free_key)
	(first_fit_range, merge_ranges_1, merge_ranges): New functions.
	(execute_omp_oacc_neuter_broadcast): Rename to...
	(oacc_do_neutering): ... this.  Add BOUNDS_LO, BOUNDS_HI
	parameters.  Arrange layout of shared memory for broadcast
	operations.
	(execute_omp_oacc_neuter_broadcast): New function.
	(pass_omp_oacc_neuter_broadcast::gate): Remove num_workers==1
	handling from here.  Enable pass for all OpenACC routines in order
	to call shared memory-layout hook.
	* target.def (create_worker_broadcast_record): Add OFFSET
	parameter.
	(shared_mem_layout): New hook.
libgomp/
	* testsuite/libgomp.oacc-c-c++-common/broadcast-many.c: Update.
2021-09-17 21:04:30 +02:00
..
abort-1.c
abort-2.c
abort-3.c
abort-4.c
abort-5.c
acc_free-pr92503-1.c
acc_free-pr92503-2.c
acc_free-pr92503-3-2.c
acc_free-pr92503-3.c
acc_free-pr92503-4-2.c
acc_free-pr92503-4.c
acc_get_property-aux.c
acc_get_property-gcn.c
acc_get_property-host.c
acc_get_property-nvptx.c
acc_get_property.c
acc_map_data-device_already-1.c
acc_map_data-device_already-2.c
acc_map_data-device_already-3.c
acc_map_data-host_already-1.c
acc_map_data-host_already-2.c
acc_map_data-host_already-3.c
acc_on_device-1.c
acc_prof-dispatch-1.c
acc_prof-init-1.c [OpenACC] Clarify sequencing of 'async' data copying vs. profiling events in 'libgomp.oacc-c-c++-common/acc_prof-{init,parallel}-1.c' 2021-07-27 11:16:25 +02:00
acc_prof-init-2.c
acc_prof-kernels-1.c amdgcn: Enable OpenACC worker partitioning for AMD GCN 2021-08-09 15:08:44 +02:00
acc_prof-parallel-1.c [OpenACC] Clarify sequencing of 'async' data copying vs. profiling events in 'libgomp.oacc-c-c++-common/acc_prof-{init,parallel}-1.c' 2021-07-27 11:16:25 +02:00
acc_prof-valid_bytes-1.c
acc_prof-version-1.c
acc_set_cuda_stream-1.c
acc_unmap_data-pr92840-1.c
acc_unmap_data-pr92840-2.c
acc_unmap_data-pr92840-3.c
acc-on-device-2.c
acc-on-device.c
async_queue-1.c
async-data-1-1.c Fix OpenACC "ephemeral" asynchronous host-to-device copies 2021-07-27 11:16:27 +02:00
async-data-1-2.c Don't use libgomp 'cbuf' buffering with OpenACC 'async' 2021-07-27 11:16:37 +02:00
asyncwait-1.c
asyncwait-nop-1.c
atomic_capture-1.c
atomic_capture-2.c
atomic_capture-3.c
atomic_rw-1.c
atomic_update-1.c
broadcast-1.c
broadcast-many.c openacc: Shared memory layout optimisation 2021-09-17 21:04:30 +02:00
cache-1.c
clauses-1.c
clauses-2.c
collapse-1.c
collapse-2.c
collapse-3.c
collapse-4.c
combined-directives-1.c
combined-reduction.c
context-1.c
context-2.c
context-3.c
context-4.c
crash-1.c
data-1.c
data-2-lib.c
data-2.c
data-3.c
data-clauses-kernels-ipa-pta.c
data-clauses-kernels.c
data-clauses-parallel-ipa-pta.c
data-clauses-parallel.c
data-clauses.h
data-firstprivate-1.c
declare-1.c
declare-2.c
declare-3.c
declare-4.c
declare-5.c
declare-vla-kernels-decompose-ice-1.c
declare-vla-kernels-decompose.c
declare-vla.c
deep-copy-1.c
deep-copy-2.c
deep-copy-3.c
deep-copy-4.c
deep-copy-5.c
deep-copy-6.c
deep-copy-7.c
deep-copy-8.c
deep-copy-9.c
deep-copy-10.c
deep-copy-11.c
deep-copy-14.c
default-1.c
deviceptr-1.c
enter_exit-lib.c
enter-data.c
f-asyncwait-1.c
f-asyncwait-2.c
f-asyncwait-3.c
firstprivate-1.c
firstprivate-mappings-1.c
function-not-offloaded.c
gang-reduction-var-assignment.c
gang-static-1.c
gang-static-2.c
gomp-debug-env.c
host_data-1.c
host_data-2.c
host_data-4.c
host_data-5.c
host_data-6.c
host_data-7.c
if-1.c
insufficient-resources.c
kernels-alias-ipa-pta-2.c
kernels-alias-ipa-pta-3.c
kernels-alias-ipa-pta.c
kernels-decompose-1.c
kernels-empty.c
kernels-loop-2.c
kernels-loop-3.c
kernels-loop-and-seq-2.c
kernels-loop-and-seq-3.c
kernels-loop-and-seq-4.c
kernels-loop-and-seq-5.c
kernels-loop-and-seq-6.c
kernels-loop-and-seq.c
kernels-loop-clauses.c
kernels-loop-collapse.c
kernels-loop-data-2.c
kernels-loop-data-enter-exit-2.c
kernels-loop-data-enter-exit.c
kernels-loop-data-update.c
kernels-loop-data.c
kernels-loop-g.c
kernels-loop-mod-not-zero.c
kernels-loop-n.c
kernels-loop-nest.c
kernels-loop.c
kernels-parallel-loop-data-enter-exit.c
kernels-private-vars-local-worker-1.c
kernels-private-vars-local-worker-2.c
kernels-private-vars-local-worker-3.c
kernels-private-vars-local-worker-4.c
kernels-private-vars-local-worker-5.c
kernels-private-vars-loop-gang-1.c
kernels-private-vars-loop-gang-2.c
kernels-private-vars-loop-gang-3.c
kernels-private-vars-loop-gang-4.c
kernels-private-vars-loop-gang-5.c
kernels-private-vars-loop-gang-6.c
kernels-private-vars-loop-vector-1.c
kernels-private-vars-loop-vector-2.c
kernels-private-vars-loop-worker-1.c
kernels-private-vars-loop-worker-2.c
kernels-private-vars-loop-worker-3.c
kernels-private-vars-loop-worker-4.c
kernels-private-vars-loop-worker-5.c
kernels-private-vars-loop-worker-6.c
kernels-private-vars-loop-worker-7.c
kernels-reduction-1.c
kernels-reduction.c
lib-1.c
lib-2.c
lib-3.c
lib-4.c
lib-5.c
lib-6.c
lib-7.c
lib-8.c
lib-9.c
lib-10.c
lib-11.c
lib-12.c
lib-13.c
lib-14.c
lib-15.c
lib-16.c
lib-19.c
lib-20.c
lib-23.c
lib-24.c
lib-25.c
lib-26.c
lib-27.c
lib-31.c
lib-32.c
lib-33.c
lib-34.c
lib-35.c
lib-36.c
lib-37.c
lib-39.c
lib-40.c
lib-41.c
lib-42.c
lib-44.c
lib-45.c
lib-46.c
lib-48.c
lib-49.c
lib-51.c
lib-52.c
lib-53.c
lib-54.c
lib-55.c
lib-56.c
lib-57.c
lib-58.c
lib-59.c
lib-60.c
lib-61.c
lib-62.c
lib-63.c
lib-64.c
lib-65.c
lib-66.c
lib-67.c
lib-68.c
lib-69.c
lib-70.c
lib-72.c
lib-73.c
lib-74.c
lib-75.c
lib-76.c
lib-78.c
lib-79.c
lib-81.c
lib-82.c
lib-83.c
lib-84.c
lib-85.c
lib-86.c
lib-87.c
lib-88.c
lib-89.c
lib-90.c
lib-91.c
lib-92.c
lib-94.c Fix OpenACC 'async'/'wait' issues in 'libgomp.oacc-c-c++-common/lib-{94,95}.c', 'libgomp.oacc-fortran/lib-16{,-2}.f90' 2021-07-27 11:16:24 +02:00
lib-95.c Fix OpenACC 'async'/'wait' issues in 'libgomp.oacc-c-c++-common/lib-{94,95}.c', 'libgomp.oacc-fortran/lib-16{,-2}.f90' 2021-07-27 11:16:24 +02:00
loop-auto-1.c
loop-default-runtime.c
loop-default.h
loop-dim-default.c amdgcn: Enable OpenACC worker partitioning for AMD GCN 2021-08-09 15:08:44 +02:00
loop-g-1.c
loop-g-2.c
loop-gwv-1.c
loop-gwv-2.c
loop-red-g-1.c
loop-red-gwv-1.c
loop-red-v-1.c
loop-red-v-2.c
loop-red-w-1.c
loop-red-w-2.c
loop-red-wv-1.c
loop-v-1.c
loop-w-1.c
loop-wv-1.c
map-data-1.c
mapping-1.c
mdc-refcount-1.c
mdc-refcount-2.c
mdc-refcount-3.c
mode-transitions.c Address '?:' issues in 'libgomp.oacc-c-c++-common/mode-transitions.c' 2021-08-16 12:12:09 +02:00
nested-1.c
nested-2.c
no_create-1.c
no_create-2.c
no_create-3.c
no_create-4.c
no_create-5.c
nvptx-merged-loop.c
nvptx-sese-1.c
offset-1.c
par-loop-comb-reduction-1.c
par-loop-comb-reduction-2.c
par-loop-comb-reduction-3.c
par-loop-comb-reduction-4.c
par-reduction-1.c
par-reduction-2.c
parallel-dims.c amdgcn: Enable OpenACC worker partitioning for AMD GCN 2021-08-09 15:08:44 +02:00
parallel-empty.c
parallel-loop-1.c
parallel-loop-1.h
parallel-loop-2.h
parallel-reduction.c
pointer-align-1.c
pr70289.c
pr70373.c
pr70688.c
pr83046.c
pr83589.c
pr83920.c
pr84217.c
pr84955-1.c
pr84955.c
pr85381-2.c
pr85381-3.c
pr85381-4.c
pr85381-5.c
pr85381.c
pr85422.c
pr85486-2.c [OpenACC] Extract 'pass_oacc_loop_designation' out of 'pass_oacc_device_lower' 2021-07-29 09:19:44 +02:00
pr85486-3.c [OpenACC] Extract 'pass_oacc_loop_designation' out of 'pass_oacc_device_lower' 2021-07-29 09:19:44 +02:00
pr85486.c [OpenACC] Extract 'pass_oacc_loop_designation' out of 'pass_oacc_device_lower' 2021-07-29 09:19:44 +02:00
pr85782.c
pr87835.c
pr88288.c
pr88941.c
pr88946.c
pr89376.c
pr90009.c
pr92726-1.c
pr92843-1.c
pr92848-1-d-a.c
pr92848-1-d-p.c
pr92848-1-r-a.c
pr92848-1-r-p.c
pr92854-1.c
pr92877-1.c
pr92970-1.c
pr92984-1.c
pr95270-1.c
pr95270-2.c
present-1.c
present-2.c
private-atomic-1-gang.c
private-atomic-1.c
private-variables.c
reduction-1.c
reduction-2.c
reduction-3.c
reduction-4.c
reduction-5.c
reduction-6.c
reduction-7.c
reduction-8.c
reduction-cplx-dbl.c
reduction-cplx-flt.c
reduction-dbl.c
reduction-flt.c
reduction-initial-1.c
reduction.h
refcounting-1.c
refcounting-2.c
routine-1.c
routine-4.c
routine-g-1.c
routine-gwv-1.c
routine-nohost-1.c [OpenACC] Extract 'pass_oacc_loop_designation' out of 'pass_oacc_device_lower' 2021-07-29 09:19:44 +02:00
routine-nohost-2_2.c OpenACC 'nohost' clause 2021-07-21 23:58:11 +02:00
routine-nohost-2.c OpenACC 'nohost' clause 2021-07-21 23:58:11 +02:00
routine-v-1.c
routine-w-1.c
routine-wv-1.c
routine-wv-2.c amdgcn: Enable OpenACC worker partitioning for AMD GCN 2021-08-09 15:08:44 +02:00
static-variable-1.c Adjust 'libgomp.oacc-c-c++-common/static-variable-1.c' 2021-08-13 22:53:58 +02:00
struct-1.c
struct-3-1-1.c
struct-copyout-1.c
struct-copyout-2.c
structured-detach-underflow.c
structured-dynamic-lifetimes-1-lib.c
structured-dynamic-lifetimes-1.c
structured-dynamic-lifetimes-2-lib.c
structured-dynamic-lifetimes-2.c
structured-dynamic-lifetimes-3-lib.c
structured-dynamic-lifetimes-3.c
structured-dynamic-lifetimes-4-lib.c
structured-dynamic-lifetimes-4.c
structured-dynamic-lifetimes-5-lib.c
structured-dynamic-lifetimes-5.c
structured-dynamic-lifetimes-6-lib.c
structured-dynamic-lifetimes-6.c
structured-dynamic-lifetimes-7-lib.c
structured-dynamic-lifetimes-7.c
structured-dynamic-lifetimes-8-lib.c
structured-dynamic-lifetimes-8.c
subr.h
subr.ptx
subset-subarray-mappings-1-d-a.c
subset-subarray-mappings-1-d-p.c
subset-subarray-mappings-1-r-a.c
subset-subarray-mappings-1-r-p.c
subset-subarray-mappings-2.c
switch-conversion-2.c
switch-conversion.c
tile-1.c
timer.h
unmap-infinity-1.c
update-1.c
variable-not-offloaded.c
vector-length-64-1.c
vector-length-64-2.c
vector-length-64-3.c
vector-length-128-1.c [OpenACC] Extract 'pass_oacc_loop_designation' out of 'pass_oacc_device_lower' 2021-07-29 09:19:44 +02:00
vector-length-128-2.c [OpenACC] Extract 'pass_oacc_loop_designation' out of 'pass_oacc_device_lower' 2021-07-29 09:19:44 +02:00
vector-length-128-3.c [OpenACC] Extract 'pass_oacc_loop_designation' out of 'pass_oacc_device_lower' 2021-07-29 09:19:44 +02:00
vector-length-128-4.c [OpenACC] Extract 'pass_oacc_loop_designation' out of 'pass_oacc_device_lower' 2021-07-29 09:19:44 +02:00
vector-length-128-5.c [OpenACC] Extract 'pass_oacc_loop_designation' out of 'pass_oacc_device_lower' 2021-07-29 09:19:44 +02:00
vector-length-128-6.c [OpenACC] Extract 'pass_oacc_loop_designation' out of 'pass_oacc_device_lower' 2021-07-29 09:19:44 +02:00
vector-length-128-7.c [OpenACC] Extract 'pass_oacc_loop_designation' out of 'pass_oacc_device_lower' 2021-07-29 09:19:44 +02:00
vector-length-128-10.c
vector-loop.c
vector-type-1.c
vprop-2.c
vprop.c
vred2d-128.c
zero_length_subarrays.c