Add an "early rematerialisation" pass

This patch looks for pseudo registers that are live across a call
and for which no call-preserved hard registers exist.  It then
recomputes the pseudos as necessary to ensure that they are no
longer live across a call.  The comment at the head of the file
describes the approach.

A new target hook selects which modes should be treated in this way.
By default none are, in which case the pass is skipped very early.

It might also be worth looking for cases like:

   C1: R1 := f (...)
   ...
   C2: R2 := f (...)
   C3: R1 := C2

and giving the same value number to C1 and C3, effectively treating
it like:

   C1: R1 := f (...)
   ...
   C2: R2 := f (...)
   C3: R1 := f (...)

Another (much more expensive) enhancement would be to apply value
numbering to all pseudo registers (not just rematerialisation
candidates), so that we can handle things like:

  C1: R1 := f (...R2...)
  ...
  C2: R1 := f (...R3...)

where R2 and R3 hold the same value.  But the current pass seems
to catch the vast majority of cases.

2018-01-13  Richard Sandiford  <richard.sandiford@linaro.org>

gcc/
	* Makefile.in (OBJS): Add early-remat.o.
	* target.def (select_early_remat_modes): New hook.
	* doc/tm.texi.in (TARGET_SELECT_EARLY_REMAT_MODES): New hook.
	* doc/tm.texi: Regenerate.
	* targhooks.h (default_select_early_remat_modes): Declare.
	* targhooks.c (default_select_early_remat_modes): New function.
	* timevar.def (TV_EARLY_REMAT): New timevar.
	* passes.def (pass_early_remat): New pass.
	* tree-pass.h (make_pass_early_remat): Declare.
	* early-remat.c: New file.
	* config/aarch64/aarch64.c (aarch64_select_early_remat_modes): New
	function.
	(TARGET_SELECT_EARLY_REMAT_MODES): Define.

gcc/testsuite/
	* gcc.target/aarch64/sve/spill_1.c: Also test that no predicates
	are spilled.
	* gcc.target/aarch64/sve/spill_2.c: New test.
	* gcc.target/aarch64/sve/spill_3.c: Likewise.
	* gcc.target/aarch64/sve/spill_4.c: Likewise.
	* gcc.target/aarch64/sve/spill_5.c: Likewise.
	* gcc.target/aarch64/sve/spill_6.c: Likewise.
	* gcc.target/aarch64/sve/spill_7.c: Likewise.

From-SVN: r256636
This commit is contained in:
Richard Sandiford 2018-01-13 18:00:51 +00:00 committed by Richard Sandiford
parent d1d20a49a7
commit 5cce817119
20 changed files with 2945 additions and 0 deletions

View File

@ -1,3 +1,19 @@
2018-01-13 Richard Sandiford <richard.sandiford@linaro.org>
* Makefile.in (OBJS): Add early-remat.o.
* target.def (select_early_remat_modes): New hook.
* doc/tm.texi.in (TARGET_SELECT_EARLY_REMAT_MODES): New hook.
* doc/tm.texi: Regenerate.
* targhooks.h (default_select_early_remat_modes): Declare.
* targhooks.c (default_select_early_remat_modes): New function.
* timevar.def (TV_EARLY_REMAT): New timevar.
* passes.def (pass_early_remat): New pass.
* tree-pass.h (make_pass_early_remat): Declare.
* early-remat.c: New file.
* config/aarch64/aarch64.c (aarch64_select_early_remat_modes): New
function.
(TARGET_SELECT_EARLY_REMAT_MODES): Define.
2018-01-13 Richard Sandiford <richard.sandiford@linaro.org>
Alan Hayward <alan.hayward@arm.com>
David Sherwood <david.sherwood@arm.com>

View File

@ -1277,6 +1277,7 @@ OBJS = \
dwarf2asm.o \
dwarf2cfi.o \
dwarf2out.o \
early-remat.o \
emit-rtl.o \
et-forest.o \
except.o \

View File

@ -17130,6 +17130,22 @@ aarch64_can_change_mode_class (machine_mode from,
return true;
}
/* Implement TARGET_EARLY_REMAT_MODES. */
static void
aarch64_select_early_remat_modes (sbitmap modes)
{
/* SVE values are not normally live across a call, so it should be
worth doing early rematerialization even in VL-specific mode. */
for (int i = 0; i < NUM_MACHINE_MODES; ++i)
{
machine_mode mode = (machine_mode) i;
unsigned int vec_flags = aarch64_classify_vector_mode (mode);
if (vec_flags & VEC_ANY_SVE)
bitmap_set_bit (modes, i);
}
}
/* Target-specific selftests. */
#if CHECKING_P
@ -17596,6 +17612,9 @@ aarch64_libgcc_floating_mode_supported_p
#undef TARGET_CAN_CHANGE_MODE_CLASS
#define TARGET_CAN_CHANGE_MODE_CLASS aarch64_can_change_mode_class
#undef TARGET_SELECT_EARLY_REMAT_MODES
#define TARGET_SELECT_EARLY_REMAT_MODES aarch64_select_early_remat_modes
#if CHECKING_P
#undef TARGET_RUN_TARGET_SELFTESTS
#define TARGET_RUN_TARGET_SELFTESTS selftest::aarch64_run_selftests

View File

@ -2774,6 +2774,17 @@ details.
With LRA, the default is to use @var{mode} unmodified.
@end deftypefn
@deftypefn {Target Hook} void TARGET_SELECT_EARLY_REMAT_MODES (sbitmap @var{modes})
On some targets, certain modes cannot be held in registers around a
standard ABI call and are relatively expensive to spill to the stack.
The early rematerialization pass can help in such cases by aggressively
recomputing values after calls, so that they don't need to be spilled.
This hook returns the set of such modes by setting the associated bits
in @var{modes}. The default implementation selects no modes, which has
the effect of disabling the early rematerialization pass.
@end deftypefn
@deftypefn {Target Hook} bool TARGET_CLASS_LIKELY_SPILLED_P (reg_class_t @var{rclass})
A target hook which returns @code{true} if pseudos that have been assigned
to registers of class @var{rclass} would likely be spilled because

View File

@ -2307,6 +2307,8 @@ Do not define this macro if you do not define
@hook TARGET_SECONDARY_MEMORY_NEEDED_MODE
@hook TARGET_SELECT_EARLY_REMAT_MODES
@hook TARGET_CLASS_LIKELY_SPILLED_P
@hook TARGET_CLASS_MAX_NREGS

2611
gcc/early-remat.c Normal file

File diff suppressed because it is too large Load Diff

View File

@ -460,6 +460,7 @@ along with GCC; see the file COPYING3. If not see
NEXT_PASS (pass_sms);
NEXT_PASS (pass_live_range_shrinkage);
NEXT_PASS (pass_sched);
NEXT_PASS (pass_early_remat);
NEXT_PASS (pass_ira);
NEXT_PASS (pass_reload);
NEXT_PASS (pass_postreload);

View File

@ -5572,6 +5572,19 @@ reload from using some alternatives, like @code{TARGET_PREFERRED_RELOAD_CLASS}."
(rtx x, reg_class_t rclass),
default_preferred_output_reload_class)
DEFHOOK
(select_early_remat_modes,
"On some targets, certain modes cannot be held in registers around a\n\
standard ABI call and are relatively expensive to spill to the stack.\n\
The early rematerialization pass can help in such cases by aggressively\n\
recomputing values after calls, so that they don't need to be spilled.\n\
\n\
This hook returns the set of such modes by setting the associated bits\n\
in @var{modes}. The default implementation selects no modes, which has\n\
the effect of disabling the early rematerialization pass.",
void, (sbitmap modes),
default_select_early_remat_modes)
DEFHOOK
(class_likely_spilled_p,
"A target hook which returns @code{true} if pseudos that have been assigned\n\

View File

@ -82,6 +82,7 @@ along with GCC; see the file COPYING3. If not see
#include "params.h"
#include "real.h"
#include "langhooks.h"
#include "sbitmap.h"
bool
default_legitimate_address_p (machine_mode mode ATTRIBUTE_UNUSED,
@ -2329,4 +2330,11 @@ default_stack_clash_protection_final_dynamic_probe (rtx residual ATTRIBUTE_UNUSE
return 0;
}
/* The default implementation of TARGET_EARLY_REMAT_MODES. */
void
default_select_early_remat_modes (sbitmap)
{
}
#include "gt-targhooks.h"

View File

@ -287,5 +287,6 @@ extern unsigned int default_min_arithmetic_precision (void);
extern enum flt_eval_method
default_excess_precision (enum excess_precision_type ATTRIBUTE_UNUSED);
extern bool default_stack_clash_protection_final_dynamic_probe (rtx);
extern void default_select_early_remat_modes (sbitmap);
#endif /* GCC_TARGHOOKS_H */

View File

@ -1,3 +1,14 @@
2018-01-13 Richard Sandiford <richard.sandiford@linaro.org>
* gcc.target/aarch64/sve/spill_1.c: Also test that no predicates
are spilled.
* gcc.target/aarch64/sve/spill_2.c: New test.
* gcc.target/aarch64/sve/spill_3.c: Likewise.
* gcc.target/aarch64/sve/spill_4.c: Likewise.
* gcc.target/aarch64/sve/spill_5.c: Likewise.
* gcc.target/aarch64/sve/spill_6.c: Likewise.
* gcc.target/aarch64/sve/spill_7.c: Likewise.
2018-01-13 Richard Sandiford <richard.sandiford@linaro.org>
Alan Hayward <alan.hayward@arm.com>
David Sherwood <david.sherwood@arm.com>

View File

@ -26,3 +26,5 @@ TEST_LOOP (uint64_t, 511);
/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #511\n} 2 } } */
/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */

View File

@ -0,0 +1,39 @@
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable" } */
#include <stdint.h>
void consumer (void *);
#define TEST_LOOP(TYPE) \
void \
multi_loop_##TYPE (TYPE *x, TYPE val) \
{ \
for (int i = 0; i < 7; ++i) \
x[i] += val; \
consumer (x); \
for (int i = 0; i < 7; ++i) \
x[i] += val; \
consumer (x); \
for (int i = 0; i < 7; ++i) \
x[i] += val; \
consumer (x); \
}
/* One iteration is enough. */
TEST_LOOP (uint8_t);
TEST_LOOP (uint16_t);
/* Two iterations are enough. Complete unrolling makes sense
even at -O2. */
TEST_LOOP (uint32_t);
/* Four iterations are needed; ought to stay a loop. */
TEST_LOOP (uint64_t);
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]\.b} 3 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]\.h} 3 } } */
/* { dg-final { scan-assembler {\twhilelo\tp[0-9]\.s} } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]\.d} 6 } } */
/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */

View File

@ -0,0 +1,48 @@
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable" } */
#include <stdint.h>
void consumer (void *);
#define TEST_LOOP(TYPE) \
void \
multi_loop_##TYPE (TYPE *x, TYPE val1, TYPE val2, int n) \
{ \
for (int i = 0; i < n; ++i) \
{ \
x[i * 2] += val1; \
x[i * 2 + 1] += val2; \
} \
consumer (x); \
for (int i = 0; i < n; ++i) \
{ \
x[i * 2] += val1; \
x[i * 2 + 1] += val2; \
} \
consumer (x); \
for (int i = 0; i < n; ++i) \
{ \
x[i * 2] += val1; \
x[i * 2 + 1] += val2; \
} \
consumer (x); \
}
/* One iteration is enough. */
TEST_LOOP (uint8_t);
TEST_LOOP (uint16_t);
/* Two iterations are enough. Complete unrolling makes sense
even at -O2. */
TEST_LOOP (uint32_t);
/* Four iterations are needed; ought to stay a loop. */
TEST_LOOP (uint64_t);
/* { dg-final { scan-assembler {\tld1b\tz[0-9]\.b} } } */
/* { dg-final { scan-assembler {\tld1h\tz[0-9]\.h} } } */
/* { dg-final { scan-assembler {\tld1w\tz[0-9]\.s} } } */
/* { dg-final { scan-assembler {\tld1d\tz[0-9]\.d} } } */
/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */

View File

@ -0,0 +1,36 @@
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize" } */
#include <stdint.h>
void consumer (void *);
#define TEST_LOOP(TYPE, VAL) \
void \
multi_loop_##TYPE (TYPE *x) \
{ \
for (int i = 0; i < 100; ++i) \
x[i] += VAL; \
consumer (x); \
for (int i = 0; i < 100; ++i) \
x[i] += VAL; \
consumer (x); \
for (int i = 0; i < 100; ++i) \
x[i] += VAL; \
consumer (x); \
}
TEST_LOOP (uint16_t, 0x1234);
TEST_LOOP (uint32_t, 0x12345);
TEST_LOOP (uint64_t, 0x123456);
/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+\.h,} 3 } } */
/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+\.s,} 3 } } */
/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+\.d,} 3 } } */
/* { dg-final { scan-assembler-times {\tld1rh\tz[0-9]+\.h,} 3 } } */
/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s,} 3 } } */
/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d,} 3 } } */
/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */

View File

@ -0,0 +1,34 @@
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
#include <stdint.h>
void consumer (void *);
#define TEST_LOOP(TYPE, VAL) \
void \
multi_loop_##TYPE (TYPE *x) \
{ \
for (int i = 0; i < 100; ++i) \
x[i] += VAL + i; \
consumer (x); \
for (int i = 0; i < 100; ++i) \
x[i] += VAL + i; \
consumer (x); \
for (int i = 0; i < 100; ++i) \
x[i] += VAL + i; \
consumer (x); \
}
TEST_LOOP (uint8_t, 3);
TEST_LOOP (uint16_t, 4);
TEST_LOOP (uint32_t, 5);
TEST_LOOP (uint64_t, 6);
TEST_LOOP (float, 2.5f);
TEST_LOOP (double, 3.5);
/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\..,} 18 } } */
/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */

View File

@ -0,0 +1,44 @@
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable" } */
#include <stdint.h>
void consumer (void *);
#define TEST_LOOP(TYPE, VAL) \
void \
multi_loop_##TYPE (TYPE *x1, TYPE *x2, TYPE *x3, TYPE *x4, int which) \
{ \
if (which) \
{ \
for (int i = 0; i < 7; ++i) \
x1[i] += VAL; \
consumer (x1); \
for (int i = 0; i < 7; ++i) \
x2[i] -= VAL; \
consumer (x2); \
} \
else \
{ \
for (int i = 0; i < 7; ++i) \
x3[i] &= VAL; \
consumer (x3); \
} \
for (int i = 0; i < 7; ++i) \
x4[i] |= VAL; \
consumer (x4); \
}
TEST_LOOP (uint8_t, 0x12);
TEST_LOOP (uint16_t, 0x1234);
TEST_LOOP (uint32_t, 0x12345);
TEST_LOOP (uint64_t, 0x123456);
/* { dg-final { scan-assembler {\tld1b\tz[0-9]+\.b,} } } */
/* { dg-final { scan-assembler {\tld1h\tz[0-9]+\.h,} } } */
/* { dg-final { scan-assembler {\tld1w\tz[0-9]+\.s,} } } */
/* { dg-final { scan-assembler {\tld1d\tz[0-9]+\.d,} } } */
/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */

View File

@ -0,0 +1,46 @@
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
#include <stdint.h>
void consumer (void *);
#define TEST_LOOP(TYPE, VAL) \
void \
multi_loop_##TYPE (TYPE *x, int n) \
{ \
for (int k = 0; k < 4; ++k) \
{ \
for (int j = 0; j < n; ++j) \
{ \
for (int i = 0; i < 100; ++i) \
x[i] += VAL + i; \
asm volatile (""); \
} \
for (int j = 0; j < n; ++j) \
consumer (x); \
for (int j = 0; j < n; ++j) \
{ \
for (int i = 0; i < 100; ++i) \
x[i] += VAL + i; \
asm volatile (""); \
} \
consumer (x); \
for (int i = 0; i < 100; ++i) \
x[i] += VAL + i; \
consumer (x); \
} \
}
TEST_LOOP (uint8_t, 3);
TEST_LOOP (uint16_t, 4);
TEST_LOOP (uint32_t, 5);
TEST_LOOP (uint64_t, 6);
TEST_LOOP (float, 2.5f);
TEST_LOOP (double, 3.5);
/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\..,} 18 } } */
/* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */
/* { dg-final { scan-assembler-not {\tstr\tz[0-9]} } } */
/* { dg-final { scan-assembler-not {\tldr\tp[0-9]} } } */
/* { dg-final { scan-assembler-not {\tstr\tp[0-9]} } } */

View File

@ -253,6 +253,7 @@ DEFTIMEVAR (TV_MODE_SWITCH , "mode switching")
DEFTIMEVAR (TV_SMS , "sms modulo scheduling")
DEFTIMEVAR (TV_LIVE_RANGE_SHRINKAGE , "live range shrinkage")
DEFTIMEVAR (TV_SCHED , "scheduling")
DEFTIMEVAR (TV_EARLY_REMAT , "early rematerialization")
DEFTIMEVAR (TV_IRA , "integrated RA")
DEFTIMEVAR (TV_LRA , "LRA non-specific")
DEFTIMEVAR (TV_LRA_ELIMINATE , "LRA virtuals elimination")

View File

@ -576,6 +576,7 @@ extern rtl_opt_pass *make_pass_mode_switching (gcc::context *ctxt);
extern rtl_opt_pass *make_pass_sms (gcc::context *ctxt);
extern rtl_opt_pass *make_pass_sched (gcc::context *ctxt);
extern rtl_opt_pass *make_pass_live_range_shrinkage (gcc::context *ctxt);
extern rtl_opt_pass *make_pass_early_remat (gcc::context *ctxt);
extern rtl_opt_pass *make_pass_ira (gcc::context *ctxt);
extern rtl_opt_pass *make_pass_reload (gcc::context *ctxt);
extern rtl_opt_pass *make_pass_clean_state (gcc::context *ctxt);