Rather than rely on recursion during the middle of register allocation,
lower indirect registers to loads and stores off the indirect base into
plain temps.
For an x86_64 host, with sufficient registers, this results in identical
code, modulo the actual register assignments.
For an i686 host, with insufficient registers, this means that temps can
be (temporarily) spilled to the stack in order to satisfy an allocation.
This as opposed to the possibility of not being able to spill, to allocate
a register for the indirect base, in order to perform a spill.
Reviewed-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Instead of using -1 as end of chain, use 0, and link through the 0
entry as a fully circular double-linked list.
Reviewed-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
TCG backends do not need most of exec-all.h; extract what they actually
need to a separate file or move it directly to tcg.h. The next patch
will stop including exec-all.h from everywhere.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The TCG code is quite performance sensitive, but at the same time can
also be quite tricky. That is why asserts that can be enabled with the
--enable-debug-tcg configure option.
This used to work the following way:
| #include "config.h"
|
| ...
|
| #if !defined(CONFIG_DEBUG_TCG) && !defined(NDEBUG)
| /* define it to suppress various consistency checks (faster) */
| #define NDEBUG
| #endif
|
| ...
|
| #include <assert.h>
Since commit 757e725b (tcg: Clean up includes) "config.h" as been
replaced by "qemu/osdep.h" which itself includes <assert.h>. As a
consequence the assertions are always enabled, even when using
--disable-debug-tcg, causing a performance regression, especially on
targets with many registers. For instance on qemu-system-ppc the
speed difference is about 15%.
tcg_debug_assert is controlled directly by CONFIG_DEBUG_TCG and already
uses in some places. This patch replaces all the calls to assert into
calss to tcg_debug_assert.
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Message-id: 1461228530-14852-1-git-send-email-aurelien@aurel32.net
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Clean up includes so that osdep.h is included first and headers
which it implies are not included manually.
This commit was created with scripts/clean-includes.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1453832250-766-16-git-send-email-peter.maydell@linaro.org
Rather than allow arbitrary shift+trunc, only concern ourselves
with low and high parts. This is all that was being used anyway.
Signed-off-by: Richard Henderson <rth@twiddle.net>
They behave the same as ext32s_i64 and ext32u_i64 from the constant
folding and zero propagation point of view, except that they can't
be replaced by a mov, so we don't compute the affected value.
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
The op is sometimes named trunc_shr_i32 and sometimes trunc_shr_i64_i32,
and the name in the README doesn't match the name offered to the
frontends.
Always use the long name to make it clear it is a size changing op.
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Instead of using an enum which could be either a copy or a const, track
them separately. This will be used in the next patch.
Constants are tracked through a bool. Copies are tracked by initializing
temp's next_copy and prev_copy to itself, allowing to simplify the code
a bit.
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Add two accessor functions temp_is_const and temp_is_copy, to make the
code more readable and make code change easier.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
The tcg_temp_info structure uses 24 bytes per temp. Now that we emulate
vector registers on most guests, it's not uncommon to have more than 100
used temps. This means we have initialize more than 2kB at least twice
per TB, often more when there is a few goto_tb.
Instead used a TCGTempSet bit array to track which temps are in used in
the current basic block. This means there are only around 16 bytes to
initialize.
This improves the boot time of a MIPS guest on an x86-64 host by around
7% and moves out tcg_optimize from the the top of the profiler list.
[rth: Handle TCG_CALL_DUMMY_ARG]
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
By convention, on a 64-bit host TCG internally stores 32-bit constants
as sign-extended. This is not the case in the optimizer when a 32-bit
constant is folded.
This doesn't seem to have more consequences than suboptimal code
generation. For instance the x86 backend assumes sign-extended constants,
and in some rare cases uses a 32-bit unsigned immediate 0xffffffff
instead of a 8-bit signed immediate 0xff for the constant -1. This is
with a ppc guest:
before
------
---- 0x9f29cc
movi_i32 tmp1,$0xffffffff
movi_i32 tmp2,$0x0
add2_i32 tmp0,CA,CA,tmp2,r6,tmp2
add2_i32 tmp0,CA,tmp0,CA,tmp1,tmp2
mov_i32 r10,tmp0
0x7fd8c7dfe90c: xor %ebp,%ebp
0x7fd8c7dfe90e: mov %ebp,%r11d
0x7fd8c7dfe911: mov 0x18(%r14),%r9d
0x7fd8c7dfe915: add %r9d,%r10d
0x7fd8c7dfe918: adc %ebp,%r11d
0x7fd8c7dfe91b: add $0xffffffff,%r10d
0x7fd8c7dfe922: adc %ebp,%r11d
0x7fd8c7dfe925: mov %r11d,0x134(%r14)
0x7fd8c7dfe92c: mov %r10d,0x28(%r14)
after
-----
---- 0x9f29cc
movi_i32 tmp1,$0xffffffffffffffff
movi_i32 tmp2,$0x0
add2_i32 tmp0,CA,CA,tmp2,r6,tmp2
add2_i32 tmp0,CA,tmp0,CA,tmp1,tmp2
mov_i32 r10,tmp0
0x7f37010d490c: xor %ebp,%ebp
0x7f37010d490e: mov %ebp,%r11d
0x7f37010d4911: mov 0x18(%r14),%r9d
0x7f37010d4915: add %r9d,%r10d
0x7f37010d4918: adc %ebp,%r11d
0x7f37010d491b: add $0xffffffffffffffff,%r10d
0x7f37010d491f: adc %ebp,%r11d
0x7f37010d4922: mov %r11d,0x134(%r14)
0x7f37010d4929: mov %r10d,0x28(%r14)
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Message-Id: <1436544211-2769-2-git-send-email-aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Due to a copy&paste, the new op value is tested against mov_i32 instead
of movi_i32. The test is therefore always false. Fix that.
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Message-Id: <1436544211-2769-1-git-send-email-aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
The tcg_constant_folding folding ends up doing all the optimizations
(which is a good thing to avoid looping on all ops multiple time), so
make it clear and just rename it tcg_optimize.
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Message-Id: <1433447607-31184-6-git-send-email-aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Most of the calls to tcg_opt_gen_mov are preceeded by a test to check if
the source temp is a constant. Fold that into the tcg_opt_gen_mov
function.
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Message-Id: <1433495958-9508-1-git-send-email-aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Each call to tcg_opt_gen_mov is preceeded by a test to check if the
source and destination temps are copies. Fold that into the
tcg_opt_gen_mov function.
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Message-Id: <1433447607-31184-4-git-send-email-aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
We can get the opcode using the TCGOp pointer. It needs to be
dereferenced, but it's anyway done a few lines below to write
the new value.
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Message-Id: <1433447607-31184-3-git-send-email-aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
We can get the opcode using the TCGOp pointer. It needs to be
dereferenced, but it's anyway done a few lines below to write
the new value.
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Message-Id: <1433447607-31184-2-git-send-email-aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
At the tcg opcode level, not at the tcg-op.h generator level.
This requires minor changes through all of the tcg backends,
but none of the cpu translators.
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
As seen with ubuntu-5.10-live-powerpc.iso.
Reported-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Reviewed-by: Bastian Koppelmann <kbastian@mail.uni-paderborn.de>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Rather reserving space in the op stream for optimization,
let the optimizer add ops as necessary.
Reviewed-by: Bastian Koppelmann <kbastian@mail.uni-paderborn.de>
Signed-off-by: Richard Henderson <rth@twiddle.net>
With the linked list scheme we need not leave nops in the stream
that we need to process later.
Reviewed-by: Bastian Koppelmann <kbastian@mail.uni-paderborn.de>
Signed-off-by: Richard Henderson <rth@twiddle.net>
The previous setup required ops and args to be completely sequential,
and was error prone when it came to both iteration and optimization.
Reviewed-by: Bastian Koppelmann <kbastian@mail.uni-paderborn.de>
Signed-off-by: Richard Henderson <rth@twiddle.net>
With the "old" ldst ops we didn't know the real width of the
result of the load, but with the "new" ldst ops we do.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Since all backends have been converted, remove the compatibility code.
Acked-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
For a 64-bit host, the high bits of a register after a 32-bit operation
are undefined. Adjust the temps mask for all 32-bit ops to reflect that.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
If either the high or low pair can be resolved, we can
simplify to either a constant or to a 32-bit comparison.
Signed-off-by: Richard Henderson <rth@twiddle.net>
Avoid allocating a tcg temporary to hold the constant address,
and instead place it directly into the op_call arguments.
At the same time, convert to the newly introduced tcg_out_call
backend function, rather than invoking tcg_out_op for the call.
Signed-off-by: Richard Henderson <rth@twiddle.net>
By inspection, for a deposit(x, y, 0, 64), we'd have a shift of (1<<64)
and everything else falls apart. But we can reuse the existing deposit
logic to get this right.
Signed-off-by: Richard Henderson <rth@twiddle.net>
The TCG result would be undefined, but we can at least produce one
plausible result and avoid triggering the wrath of analysis tools.
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Recognize 0 operand to andc, and -1 operands to and, orc, eqv.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Like we already do for SUB and XOR.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Given, of course, an appropriate constant. These could be generated
from the "canonical" operation for inversion on the guest, or via
other optimizations.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
The shl_i32 op might set some bits of the unused 32 high bits of the
mask. Fix that by clearing the unused 32 high bits for all 32-bit ops
except load/store which operate on tl values.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Known-zero bits optimization is a great idea that helps to generate more
optimized code. However the current implementation only works in very few
cases as the computed mask is not saved.
Fix this to make it really working.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
32-bit versions of sar and shr ops should not propagate known-zero bits
from the unused 32 high bits. For sar it could even lead to wrong code
being generated.
Cc: qemu-stable@nongnu.org
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Use them in places where mulu2 and muls2 are used.
Optimize mulx2 with dead low part to mulxh.
Reviewed-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
When setcond2 is rewritten into setcond, the state of the destination
temp should be reset, so that a copy of the previous value is not
used instead of the result.
Reported-by: Michael Tokarev <mjt@tls.msk.ru>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
This adds two optimizations using the non-zero bit mask. In some cases
involving shifts or ANDs the value can become zero, and can thus be
optimized to a move of zero. Second, useless zero-extension or an
AND with constant can be detected that would only zero bits that are
already zero.
The main advantage of this optimization is that it turns zero-extensions
into moves, thus enabling much better copy propagation (around 1% code
reduction). Here is for example a "test $0xff0000,%ecx + je" before
optimization:
mov_i64 tmp0,rcx
movi_i64 tmp1,$0xff0000
discard cc_src
and_i64 cc_dst,tmp0,tmp1
movi_i32 cc_op,$0x1c
ext32u_i64 tmp0,cc_dst
movi_i64 tmp12,$0x0
brcond_i64 tmp0,tmp12,eq,$0x0
and after (without patch on the left, with on the right):
movi_i64 tmp1,$0xff0000 movi_i64 tmp1,$0xff0000
discard cc_src discard cc_src
and_i64 cc_dst,rcx,tmp1 and_i64 cc_dst,rcx,tmp1
movi_i32 cc_op,$0x1c movi_i32 cc_op,$0x1c
ext32u_i64 tmp0,cc_dst
movi_i64 tmp12,$0x0 movi_i64 tmp12,$0x0
brcond_i64 tmp0,tmp12,eq,$0x0 brcond_i64 cc_dst,tmp12,eq,$0x0
Other similar cases: "test %eax, %eax + jne" where eax is already 32-bit
(after optimization, without patch on the left, with on the right):
discard cc_src discard cc_src
mov_i64 cc_dst,rax mov_i64 cc_dst,rax
movi_i32 cc_op,$0x1c movi_i32 cc_op,$0x1c
ext32u_i64 tmp0,cc_dst
movi_i64 tmp12,$0x0 movi_i64 tmp12,$0x0
brcond_i64 tmp0,tmp12,ne,$0x0 brcond_i64 rax,tmp12,ne,$0x0
"test $0x1, %dl + je":
movi_i64 tmp1,$0x1 movi_i64 tmp1,$0x1
discard cc_src discard cc_src
and_i64 cc_dst,rdx,tmp1 and_i64 cc_dst,rdx,tmp1
movi_i32 cc_op,$0x1a movi_i32 cc_op,$0x1a
ext8u_i64 tmp0,cc_dst
movi_i64 tmp12,$0x0 movi_i64 tmp12,$0x0
brcond_i64 tmp0,tmp12,eq,$0x0 brcond_i64 cc_dst,tmp12,eq,$0x0
In some cases TCG even outsmarts GCC. :) Here the input code has
"and $0x2,%eax + movslq %eax,%rbx + test %rbx, %rbx" and the optimizer,
thanks to copy propagation, does the following:
movi_i64 tmp12,$0x2 movi_i64 tmp12,$0x2
and_i64 rax,rax,tmp12 and_i64 rax,rax,tmp12
mov_i64 cc_dst,rax mov_i64 cc_dst,rax
ext32s_i64 tmp0,rax -> nop
mov_i64 rbx,tmp0 -> mov_i64 rbx,cc_dst
and_i64 cc_dst,rbx,rbx -> nop
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>