Commit Graph

1655 Commits

Author SHA1 Message Date
Peter Maydell
a7ce790a02 tcg/tcg-op.h: Add multiple include guard
The tcg-op.h header was missing the usual guard against multiple
inclusion; add it.

(Spotted by lgtm.com's static analyzer.)

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20181108125256.30986-1-peter.maydell@linaro.org
2018-11-08 15:15:32 +00:00
Richard Henderson
e6cd4bb59b tcg: Split CONFIG_ATOMIC128
GCC7+ will no longer advertise support for 16-byte __atomic operations
if only cmpxchg is supported, as for x86_64.  Fortunately, x86_64 still
has support for __sync_compare_and_swap_16 and we can make use of that.
AArch64 does not have, nor ever has had such support, so open-code it.

Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-10-18 19:46:36 -07:00
Emilio G. Cota
72fd2efbbd tcg: distribute tcg_time into TCG contexts
When we implemented per-vCPU TCG contexts, we forgot to also
distribute the tcg_time counter, which has remained as a global
accessed without any serialization, leading to potentially missed
counts.

Fix it by distributing the field over the TCG contexts, embedding
it into TCGProfile with a field called "cpu_exec_time", which is more
descriptive than "tcg_time". Add a function to query this value
directly, and for completeness, fill in the field in
tcg_profile_snapshot, even though its callers do not use it.

Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-Id: <20181010144853.13005-5-cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-10-18 18:58:10 -07:00
Emilio G. Cota
dd1d7da23b tcg: plug holes in struct TCGProfile
This plugs two 4-byte holes in 64-bit.

Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-Id: <20181010144853.13005-4-cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-10-18 18:58:10 -07:00
Emilio G. Cota
c1f543b739 tcg: fix use of uninitialized variable under CONFIG_PROFILER
We forgot to initialize n in commit 15fa08f845 ("tcg: Dynamically
allocate TCGOps", 2017-12-29).

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-Id: <20181010144853.13005-3-cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-10-18 18:58:10 -07:00
Richard Henderson
d7f425fdea tcg: Implement CPU_LOG_TB_NOCHAIN during expansion
Rather than test NOCHAIN before linking, do not emit the
goto_tb opcode at all.  We already do this for goto_ptr.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-10-18 18:58:10 -07:00
Roman Kapl
93bf9a4273 tcg/i386: fix vector operations on 32-bit hosts
The TCG backend uses LOWREGMASK to get the low 3 bits of register numbers.
This was defined as no-op for 32-bit x86, with the assumption that we have
eight registers anyway. This assumption is not true once we have xmm regs.

Since LOWREGMASK was a no-op, xmm register indidices were wrong in opcodes
and have overflown into other opcode fields, wreaking havoc.

To trigger these problems, you can try running the "movi d8, #0x0" AArch64
instruction on 32-bit x86. "vpxor %xmm0, %xmm0, %xmm0" should be generated,
but instead TCG generated "vpxor %xmm0, %xmm0, %xmm2".

Fixes: 770c2fc7bb ("Add vector operations")
Signed-off-by: Roman Kapl <rka@sysgo.com>
Message-Id: <20180824131734.18557-1-rka@sysgo.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-09-26 09:02:51 -07:00
Richard Henderson
1fb57da72a tcg/optimize: Do not skip default processing of dup_vec
If we do not opimize away dup_vec, we must mark its output as changed.

Fixes: 170ba88f45
Reported-by: Laurent Desnogues <laurent.desnogues@gmail.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Laurent Desnogues <laurent.desnogues@gmail.com>
Tested-by: Laurent Desnogues <laurent.desnogues@gmail.com>
Message-id: 20180805233258.31892-1-richard.henderson@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2018-08-06 14:57:48 +01:00
Richard Henderson
672189cd58 tcg/i386: Mark xmm registers call-clobbered
When host vector registers and operations were introduced, I failed
to mark the registers call clobbered as required by the ABI.

Fixes: 770c2fc7bb
Cc: qemu-stable@nongnu.org
Reported-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-07-23 09:21:14 -07:00
Alex Bennée
e65a5f227d tcg/aarch64: limit mul_vec size
In AdvSIMD we can only do 32x32 integer multiples although SVE is
capable of larger 64 bit multiples. As a result we can end up
generating invalid opcodes. Fix this by only reprting we can emit
mul vector ops if the size is small enough.

Fixes a crash on:

  sve-all-short-v8.3+sve@vq3/insn_mul_z_zi___INC.risu.bin

When running on AArch64 hardware.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20180719154248.29669-1-alex.bennee@linaro.org>
[rth: Removed the tcg_debug_assert -- there are plenty of other
cases that we do not diagnose within the insn encoding helpers.]
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-07-19 09:07:31 -07:00
Richard Henderson
499748d768 tcg: Restrict check_size_impl to multiples of the line size
Normally this is automatic in the size restrictions that are placed
on vector sizes coming from the implementation.  However, for the
legitimate size tuple [oprsz=8, maxsz=32], we need to clear the final
24 bytes of the vector register.  Without this check, do_dup selects
TCG_TYPE_V128 and clears only 16 bytes.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Message-id: 20180705191929.30773-2-richard.henderson@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2018-07-09 14:51:34 +01:00
Richard Henderson
9f75462065 tcg: Reduce max TB opcode count
Also, assert that we don't overflow any of two different offsets into
the TB. Both unwind and goto_tb both record a uint16_t for later use.

This fixes an arm-softmmu test case utilizing NEON in which there is
a TB generated that runs to 7800 opcodes, and compiles to 96k on an
x86_64 host.  This overflows the 16-bit offset in which we record the
goto_tb reset offset.  Because of that overflow, we install a jump
destination that goes to neverland.  Boom.

With this reduced op count, the same TB compiles to about 48k for
aarch64, ppc64le, and x86_64 hosts, and neither assertion fires.

Cc: qemu-stable@nongnu.org
Reported-by: "Jason A. Donenfeld" <Jason@zx2c4.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-06-15 09:39:53 -10:00
Emilio G. Cota
0ac20318ce tcg: remove tb_lock
Use mmap_lock in user-mode to protect TCG state and the page descriptors.
In !user-mode, each vCPU has its own TCG state, so no locks needed.
Per-page locks are used to protect the page descriptors.

Per-TB locks are used in both modes to protect TB jumps.

Some notes:

- tb_lock is removed from notdirty_mem_write by passing a
  locked page_collection to tb_invalidate_phys_page_fast.

- tcg_tb_lookup/remove/insert/etc have their own internal lock(s),
  so there is no need to further serialize access to them.

- do_tb_flush is run in a safe async context, meaning no other
  vCPU threads are running. Therefore acquiring mmap_lock there
  is just to please tools such as thread sanitizer.

- Not visible in the diff, but tb_invalidate_phys_page already
  has an assert_memory_lock.

- cpu_io_recompile is !user-only, so no mmap_lock there.

- Added mmap_unlock()'s before all siglongjmp's that could
  be called in user-mode while mmap_lock is held.
  + Added an assert for !have_mmap_lock() after returning from
    the longjmp in cpu_exec, just like we do in cpu_exec_step_atomic.

Performance numbers before/after:

Host: AMD Opteron(tm) Processor 6376

                 ubuntu 17.04 ppc64 bootup+shutdown time

  700 +-+--+----+------+------------+-----------+------------*--+-+
      |    +    +      +            +           +           *B    |
      |         before ***B***                            ** *    |
      |tb lock removal ###D###                         ***        |
  600 +-+                                           ***         +-+
      |                                           **         #    |
      |                                        *B*          #D    |
      |                                     *** *         ##      |
  500 +-+                                ***           ###      +-+
      |                             * ***           ###           |
      |                            *B*          # ##              |
      |                          ** *          #D#                |
  400 +-+                      **            ##                 +-+
      |                      **           ###                     |
      |                    **           ##                        |
      |                  **         # ##                          |
  300 +-+  *           B*          #D#                          +-+
      |    B         ***        ###                               |
      |    *       **       ####                                  |
      |     *   ***      ###                                      |
  200 +-+   B  *B     #D#                                       +-+
      |     #B* *   ## #                                          |
      |     #*    ##                                              |
      |    + D##D#     +            +           +            +    |
  100 +-+--+----+------+------------+-----------+------------+--+-+
           1    8      16      Guest CPUs       48           64
  png: https://imgur.com/HwmBHXe

              debian jessie aarch64 bootup+shutdown time

  90 +-+--+-----+-----+------------+------------+------------+--+-+
     |    +     +     +            +            +            +    |
     |         before ***B***                                B    |
  80 +tb lock removal ###D###                              **D  +-+
     |                                                   **###    |
     |                                                 **##       |
  70 +-+                                             ** #       +-+
     |                                             ** ##          |
     |                                           **  #            |
  60 +-+                                       *B  ##           +-+
     |                                       **  ##               |
     |                                    ***  #D                 |
  50 +-+                               ***   ##                 +-+
     |                             * **   ###                     |
     |                           **B*  ###                        |
  40 +-+                     ****  # ##                         +-+
     |                   ****     #D#                             |
     |             ***B**      ###                                |
  30 +-+    B***B**        ####                                 +-+
     |    B *   *     # ###                                       |
     |     B       ###D#                                          |
  20 +-+   D  ##D##                                             +-+
     |      D#                                                    |
     |    +     +     +            +            +            +    |
  10 +-+--+-----+-----+------------+------------+------------+--+-+
          1     8     16      Guest CPUs        48           64
  png: https://imgur.com/iGpGFtv

The gains are high for 4-8 CPUs. Beyond that point, however, unrelated
lock contention significantly hurts scalability.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-06-15 08:18:48 -10:00
Emilio G. Cota
128ed2278c tcg: move tb_ctx.tb_phys_invalidate_count to tcg_ctx
Thereby making it per-TCGContext. Once we remove tb_lock, this will
avoid an atomic increment every time a TB is invalidated.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-06-15 07:42:55 -10:00
Emilio G. Cota
be2cdc5e35 tcg: track TBs with per-region BST's
This paves the way for enabling scalable parallel generation of TCG code.

Instead of tracking TBs with a single binary search tree (BST), use a
BST for each TCG region, protecting it with a lock. This is as scalable
as it gets, since each TCG thread operates on a separate region.

The core of this change is the introduction of struct tcg_region_tree,
which contains a pointer to a GTree and an associated lock to serialize
accesses to it. We then allocate an array of tcg_region_tree's, adding
the appropriate padding to avoid false sharing based on
qemu_dcache_linesize.

Given a tc_ptr, we first find the corresponding region_tree. This
is done by special-casing the first and last regions first, since they
might be of size != region.size; otherwise we just divide the offset
by region.stride. I was worried about this division (several dozen
cycles of latency), but profiling shows that this is not a fast path.
Note that region.stride is not required to be a power of two; it
is only required to be a multiple of the host's page size.

Note that with this design we can also provide consistent snapshots
about all region trees at once; for instance, tcg_tb_foreach
acquires/releases all region_tree locks before/after iterating over them.
For this reason we now drop tb_lock in dump_exec_info().

As an alternative I considered implementing a concurrent BST, but this
can be tricky to get right, offers no consistent snapshots of the BST,
and performance and scalability-wise I don't think it could ever beat
having separate GTrees, given that our workload is insert-mostly (all
concurrent BST designs I've seen focus, understandably, on making
lookups fast, which comes at the expense of convoluted, non-wait-free
insertions/removals).

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-06-15 07:42:55 -10:00
John Arbuckle
1019242af1 tcg/i386: Use byte form of xgetbv instruction
The assembler in most versions of Mac OS X is pretty old and does not
support the xgetbv instruction.  To go around this problem, the raw
encoding of the instruction is used instead.

Signed-off-by: John Arbuckle <programmingkidx@gmail.com>
Message-Id: <20180604215102.11002-1-programmingkidx@gmail.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-06-15 07:42:55 -10:00
Peter Maydell
163670542f tcg-next queue
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJbEdLqAAoJEGTfOOivfiFfQaEH/Rq96S5bo94495KmRJY9e/jw
 lV321YYI7nx7sHtViG/B3iTkvnxzZPWcc7XbBMxyV5xmMQ/5zjS/ynZPFyy/cYRn
 zLM4W0SJ38EqhHTZpkkvw9Nle8UbNWKm5PgND2TyE4hmeuQ98OrQ6Y1GvP4MFpXs
 uQErbmMjYHMq7thbfCO6ulJjjEliRy3AJ2C3fCCCUgBQrJt6JeqbGr/Zzi2y88M9
 IhoK8RbJiWT2O5Tl95q2NOQvr11WbFlu/K0nuaVgbfTwd2tp3ygmRKPpeZ24qA52
 qtwgcIjWHHkkC5s1qaP8oW4FtoMQZdsaOwSOPw0ZBnG+VA7P/h33fWr9f5SistA=
 =UVdE
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/rth/tags/tcg-next-pull-request' into staging

tcg-next queue

# gpg: Signature made Sat 02 Jun 2018 00:12:42 BST
# gpg:                using RSA key 64DF38E8AF7E215F
# gpg: Good signature from "Richard Henderson <richard.henderson@linaro.org>"
# Primary key fingerprint: 7A48 1E78 868B 4DB6 A85A  05C0 64DF 38E8 AF7E 215F

* remotes/rth/tags/tcg-next-pull-request:
  tcg: Pass tb and index to tcg_gen_exit_tb separately

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2018-06-04 11:28:31 +01:00
Richard Henderson
07ea28b418 tcg: Pass tb and index to tcg_gen_exit_tb separately
Do the cast to uintptr_t within the helper, so that the compiler
can type check the pointer argument.  We can also do some more
sanity checking of the index argument.

Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-06-01 15:15:27 -07:00
Philippe Mathieu-Daudé
23c11b04dc target: Do not include "exec/exec-all.h" if it is not necessary
Code change produced with:
    $ git grep '#include "exec/exec-all.h"' | \
      cut -d: -f-1 | \
      xargs egrep -L "(cpu_address_space_init|cpu_loop_|tlb_|tb_|GETPC|singlestep|TranslationBlock)" | \
      xargs sed -i.bak '/#include "exec\/exec-all.h"/d'

Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20180528232719.4721-10-f4bug@amsat.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-06-01 14:15:10 +02:00
Emilio G. Cota
1d34982155 tcg: fix s/compliment/complement/ typos
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
2018-05-20 08:25:23 +03:00
Peter Maydell
f5583c527f target-arm queue:
* hw/arm/iotkit.c: fix minor memory leak
  * softfloat: fix wrong-exception-flags bug for multiply-add corner case
  * arm: isolate and clean up DTB generation
  * implement Arm v8.1-Atomics extension
  * Fix some bugs and missing instructions in the v8.2-FP16 extension
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJa9IUCAAoJEDwlJe0UNgzeEGMQAKKjVRzZ7MBgvxQj0FJSWhSP
 BZkATf3ktid255PRpIssBZiY9oM+uY6n+/IRozAGvfDBp9eQOkrZczZjfW5hpe0B
 YsQadtk5cUOXqQzRTegSMPOoMmz8f5GaGOk4R6AEXJEX+Rug/zbOn9Q8Yx7JTd7o
 yBvU1+fys3galSiB88cffA95B9fwGfLsM7rP6OC4yNdUBYwjHf3wtY53WsxtWqX9
 oX4keEiROQkrOfbSy9wYPZzu/0iRo8v35+7wIZhvNSlf02k6yJ7a+w0C4EQIRhWm
 5zciE+aMYr7nOGpj7AEJLrRekhwnD6Ppje6aUd15yrxfNRZkpk/FeECWnaOPDis7
 QNijx5Zqg6+GyItQKi5U4vFVReMj09OB7xDyAq77xDeBj4l3lg2DNkRfRhqQZAcv
 2r4EW+pfLNj76Ah1qtQ410fprw462Sopb6bHmeuFbf1QFbQvJ4CL1+7Jl3ExrDX4
 2+iQb4sQghWDxhDLfRSLxQ7K+bX+mNfGdFW8h+jPShD/+JY42dTKkFZEl4ghNgMD
 mpj8FrQuIkSMqnDmPfoTG5MVTMERacqPU7GGM7/fxudIkByO3zTiLxJ/E+Iy8HvX
 29xKoOBjKT5FJrwJABsN6VpA3EuyAARgQIZ/dd6N5GZdgn2KAIHuaI+RHFOesKFd
 dJGM6sdksnsAAz28aUEJ
 =uXY+
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/pmaydell/tags/pull-target-arm-20180510' into staging

target-arm queue:
 * hw/arm/iotkit.c: fix minor memory leak
 * softfloat: fix wrong-exception-flags bug for multiply-add corner case
 * arm: isolate and clean up DTB generation
 * implement Arm v8.1-Atomics extension
 * Fix some bugs and missing instructions in the v8.2-FP16 extension

# gpg: Signature made Thu 10 May 2018 18:44:34 BST
# gpg:                using RSA key 3C2525ED14360CDE
# gpg: Good signature from "Peter Maydell <peter.maydell@linaro.org>"
# gpg:                 aka "Peter Maydell <pmaydell@gmail.com>"
# gpg:                 aka "Peter Maydell <pmaydell@chiark.greenend.org.uk>"
# Primary key fingerprint: E1A5 C593 CD41 9DE2 8E83  15CF 3C25 25ED 1436 0CDE

* remotes/pmaydell/tags/pull-target-arm-20180510: (21 commits)
  target/arm: Clear SVE high bits for FMOV
  target/arm: Fix float16 to/from int16
  target/arm: Implement vector shifted FCVT for fp16
  target/arm: Implement vector shifted SCVF/UCVF for fp16
  target/arm: Enable ARM_FEATURE_V8_ATOMICS for user-only
  target/arm: Implement CAS and CASP
  target/arm: Fill in disas_ldst_atomic
  target/arm: Introduce ARM_FEATURE_V8_ATOMICS and initial decode
  target/riscv: Use new atomic min/max expanders
  tcg: Use GEN_ATOMIC_HELPER_FN for opposite endian atomic add
  tcg: Introduce atomic helpers for integer min/max
  target/xtensa: Use new min/max expanders
  target/arm: Use new min/max expanders
  tcg: Introduce helpers for integer min/max
  atomic.h: Work around gcc spurious "unused value" warning
  make sure that we aren't overwriting mc->get_hotplug_handler by accident
  arm/boot: split load_dtb() from arm_load_kernel()
  platform-bus-device: use device plug callback instead of machine_done notifier
  pc: simplify MachineClass::get_hotplug_handler handling
  softfloat: Handle default NaN mode after pickNaNMulAdd, not before
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

# Conflicts:
#	target/riscv/translate.c
2018-05-11 17:41:54 +01:00
Richard Henderson
5507c2bf35 tcg: Introduce atomic helpers for integer min/max
Given that this atomic operation will be used by both risc-v
and aarch64, let's not duplicate code across the two targets.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20180508151437.4232-5-richard.henderson@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2018-05-10 18:10:57 +01:00
Richard Henderson
b87fb8cd9f tcg: Introduce helpers for integer min/max
These operations are re-invented by several targets so far.
Several supported hosts have insns for these, so place the
expanders out-of-line for a future introduction of tcg opcodes.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20180508151437.4232-2-richard.henderson@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2018-05-10 18:10:57 +01:00
Richard Henderson
abebf92597 tcg: Limit the number of ops in a TB
In 6001f7729e we partially attempt to address the branch
displacement overflow caused by 15fa08f845.

However, gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vqtbX.c
is a testcase that contains a TB so large as to overflow anyway.
The limit here of 8000 ops produces a maximum output TB size of
24112 bytes on a ppc64le host with that test case.  This is still
much less than the maximum forward branch distance of 32764 bytes.

Cc: qemu-stable@nongnu.org
Fixes: 15fa08f845 ("tcg: Dynamically allocate TCGOps")
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-05-09 08:30:57 -07:00
Peter Maydell
7eb30ef0ba tcg/i386: Fix dup_vec in non-AVX2 codepath
The VPUNPCKLD* instructions are all "non-destructive source",
indicated by "NDS" in the encoding string in the x86 ISA manual.
This means that they take two source operands, one of which is
encoded in the VEX.vvvv field. We were incorrectly treating them
as if they were destructive-source and passing 0 as the 'v'
argument of tcg_out_vex_modrm(). This meant we were always
using %xmm0 as one of the source operands, causing incorrect
results if the register allocator happened to want to use
something else. For instance the input AArch64 insn:
 DUP v26.16b, w21
which becomes TCG IR ops:
 dup_vec v128,e8,tmp2,x21
 st_vec v128,e8,tmp2,env,$0xa40
was assembled to:
0x607c568c:  c4 c1 7a 7e 86 e8 00 00  vmovq    0xe8(%r14), %xmm0
0x607c5694:  00
0x607c5695:  c5 f9 60 c8              vpunpcklbw %xmm0, %xmm0, %xmm1
0x607c5699:  c5 f9 61 c9              vpunpcklwd %xmm1, %xmm0, %xmm1
0x607c569d:  c5 f9 70 c9 00           vpshufd  $0, %xmm1, %xmm1
0x607c56a2:  c4 c1 7a 7f 8e 40 0a 00  vmovdqu  %xmm1, 0xa40(%r14)
0x607c56aa:  00

when the vpunpcklwd insn should be "%xmm1, %xmm1, %xmm1".
This resulted in our incorrectly setting the output vector to
q26=0000320000003200:0000320000003200
when given an input of x21 == 0000000002803200
rather than the expected all-zeroes.

Pass the correct source register number to tcg_out_vex_modrm()
for these insns.

Fixes: 770c2fc7bb
Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20180504153431.5169-1-peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-05-09 08:30:57 -07:00
Laurent Vivier
6001f7729e tcg: workaround branch instruction overflow in tcg_out_qemu_ld/st
ppc64 uses a BC instruction to call the tcg_out_qemu_ld/st
slow path. BC instruction uses a relative address encoded
on 14 bits.

The slow path functions are added at the end of the generated
instructions buffer, in the reverse order of the callers.
So more we have slow path functions more the distance between
the caller (BC) and the function increases.

This patch changes the behavior to generate the functions in
the same order of the callers.

Cc: qemu-stable@nongnu.org
Fixes: 15fa08f845 ("tcg: Dynamically allocate TCGOps")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Message-Id: <20180429235840.16659-1-lvivier@redhat.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-05-01 11:56:55 -07:00
Richard Henderson
5bfa803448 tcg: Improve TCGv_ptr support
Drop TCGV_PTR_TO_NAT and TCGV_NAT_TO_PTR internal macros.

Add tcg_temp_local_new_ptr, tcg_gen_brcondi_ptr, tcg_gen_ext_i32_ptr,
tcg_gen_trunc_i64_ptr, tcg_gen_extu_ptr_i64, tcg_gen_trunc_ptr_i32.

Use inlines instead of macros where possible.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-05-01 11:56:16 -07:00
Richard Henderson
9a938d86b0 tcg: Allow wider vectors for cmp and mul
In db432672, we allow wide inputs for operations such as add.
However, in 212be173 and 3774030a we didn't do the same for
compare and multiply.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-05-01 11:56:16 -07:00
Henry Wertz
3f814b8037 tcg/arm: Fix memory barrier encoding
I found with qemu 2.11.x or newer that I would get an illegal instruction
error running some Intel binaries on my ARM chromebook.  On investigation,
I found it was quitting on memory barriers.

qemu instruction:
mb $0x31
was translating as:
0x604050cc:  5bf07ff5  blpl     #0x600250a8

After patch it gives:
0x604050cc:  f57ff05b  dmb      ish

In short, I found INSN_DMB_ISH (memory barrier for ARMv7) appeared to be
correct based on online docs, but due to some endian-related shenanigans it
had to be byte-swapped to suit qemu; it appears INSN_DMB_MCR (memory
barrier for ARMv6) also should be byte swapped  (and this patch does so).
I have not checked for correctness of aarch64's barrier instruction.

Cc: qemu-stable@nongnu.org
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Henry Wertz <hwertz10@gmail.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-05-01 11:56:07 -07:00
Richard Henderson
d103021269 tcg: Document INDEX_mul[us]h_*
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-05-01 11:41:56 -07:00
Peter Maydell
161dfd1e7f tcg/mips: Handle large offsets from target env to tlb_table
The MIPS TCG target makes the assumption that the offset from the
target env pointer to the tlb_table is less than about 64K. This
used to be true, but gradual addition of features to the Arm
target means that it's no longer true there. This results in
the build-time assertion failing:

In file included from /home/pm215/qemu/include/qemu/osdep.h:36:0,
                 from /home/pm215/qemu/tcg/tcg.c:28:
/home/pm215/qemu/tcg/mips/tcg-target.inc.c: In function ‘tcg_out_tlb_load’:
/home/pm215/qemu/include/qemu/compiler.h:90:36: error: static assertion failed: "not expecting: offsetof(CPUArchState, tlb_table[NB_MMU_MODES - 1][1]) > 0x7ff0 + 0x7fff"
 #define QEMU_BUILD_BUG_MSG(x, msg) _Static_assert(!(x), msg)
                                    ^
/home/pm215/qemu/include/qemu/compiler.h:98:30: note: in expansion of macro ‘QEMU_BUILD_BUG_MSG’
 #define QEMU_BUILD_BUG_ON(x) QEMU_BUILD_BUG_MSG(x, "not expecting: " #x)
                              ^
/home/pm215/qemu/tcg/mips/tcg-target.inc.c:1236:9: note: in expansion of macro ‘QEMU_BUILD_BUG_ON’
         QEMU_BUILD_BUG_ON(offsetof(CPUArchState,
         ^
/home/pm215/qemu/rules.mak:66: recipe for target 'tcg/tcg.o' failed

An ideal long term approach would be to rearrange the CPU state
so that the tlb_table was not so far along it, but this is tricky
because it would move it from the "not cleared on CPU reset" part
of the struct to the "cleared on CPU reset" part. As a simple fix
for the 2.12 release, make the MIPS TCG target handle an arbitrary
offset by emitting more add instructions. This will mean an extra
instruction in the fastpath for TCG loads and stores for the
affected guests (currently just aarch64-softmmu).

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Message-id: 20180413142336.32163-1-peter.maydell@linaro.org
2018-04-16 11:51:37 +01:00
Richard Henderson
9743cd5736 tcg: Introduce tcg_set_insn_start_param
The parameters for tcg_gen_insn_start are target_ulong, which may be split
into two TCGArg parameters for storage in the opcode on 32-bit hosts.

Fixes the ARM target and its direct use of tcg_set_insn_param, which would
set the wrong argument in the 64-on-32 case.

Cc: qemu-stable@nongnu.org
Reported-by: alarson@ddci.com
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20180410003558.2470-1-richard.henderson@linaro.org
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2018-04-10 13:02:26 +01:00
Richard Henderson
f2f1dde751 tcg: Mark muluh_i64 and mulsh_i64 as 64-bit ops
Failure to do so results in the tcg optimizer sign-extending
any constant fold from 32-bits.  This turns out to be visible
in the RISC-V testsuite using a host that emits these opcodes
(e.g. any non-x86_64).

Reported-by: Michael Clark <mjc@sifive.com>
Reviewed-by: Emilio G. Cota <cota@braap.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-03-28 12:45:16 +08:00
Richard Henderson
adb196cbd5 tcg: Add choose_vector_size
This unifies 5 copies of checks for supported vector size,
and in the process fixes a missing check in tcg_gen_gvec_2s.

This lead to an assertion failure for 64-bit vector multiply,
which is not available in the AVX instruction set.

Suggested-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-03-16 00:55:04 +08:00
Richard Henderson
7f34ed4bcd tcg/i386: Support INDEX_op_dup2_vec for -m32
Unknown why -m32 was passing with gcc but not clang; it should have
failed for both.  This would be used for tcg_gen_dup_i64_vec, and
visible with the right TB and an aarch64 guest.

Reported-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-03-16 00:55:04 +08:00
Richard Henderson
b2e3ae9452 tcg: Improve tcg_gen_muli_i32/i64
Convert multiplication by power of two to left shift.

Reviewed-by: Emilio G. Cota <cota@braap.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-03-16 00:55:04 +08:00
Richard Henderson
14e4c1e235 tcg/aarch64: Add vector operations
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-02-08 15:54:08 +00:00
Richard Henderson
770c2fc7bb tcg/i386: Add vector operations
The x86 vector instruction set is extremely irregular.  With newer
editions, Intel has filled in some of the blanks.  However, we don't
get many 64-bit operations until SSE4.2, introduced in 2009.

The subsequent edition was for AVX1, introduced in 2011, which added
three-operand addressing, and adjusts how all instructions should be
encoded.

Given the relatively narrow 2 year window between possible to support
and desirable to support, and to vastly simplify code maintainence,
I am only planning to support AVX1 and later cpus.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-02-08 15:54:08 +00:00
Richard Henderson
170ba88f45 tcg/optimize: Handle vector opcodes during optimize
Trivial move and constant propagation.  Some identity and constant
function folding, but nothing that requires knowledge of the size
of the vector element.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-02-08 15:54:06 +00:00
Richard Henderson
22fc352703 tcg: Add generic vector helpers with a scalar operand
Use dup to convert a non-constant scalar to a third vector.

Add addition, multiplication, and logical operations with an immediate.
Add addition, subtraction, multiplication, and logical operations with
a non-constant scalar.  Allow for the front-end to build operations in
which the scalar operand comes first.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-02-08 15:54:06 +00:00
Richard Henderson
f49b12c6e6 tcg: Add generic helpers for saturating arithmetic
No vector ops as yet.  SSE only has direct support for 8- and 16-bit
saturation; handling 32- and 64-bit saturation is much more expensive.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-02-08 15:54:06 +00:00
Richard Henderson
3774030a3e tcg: Add generic vector ops for multiplication
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-02-08 15:54:06 +00:00
Richard Henderson
212be173f0 tcg: Add generic vector ops for comparisons
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-02-08 15:54:05 +00:00
Richard Henderson
d0ec97967f tcg: Add generic vector ops for constant shifts
Opcodes are added for scalar and vector shifts, but considering the
varied semantics of these do not expose them to the front ends.  Do
go ahead and provide them in case they are needed for backend expansion.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-02-08 15:54:05 +00:00
Richard Henderson
db432672dc tcg: Add generic vector expanders
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-02-08 15:54:05 +00:00
Richard Henderson
474b2e8f0f tcg: Standardize integral arguments to expanders
Some functions use intN_t arguments, some use uintN_t, some just
used "unsigned".  To aid putting function pointers in tables, we
need consistency.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-02-08 15:54:05 +00:00
Richard Henderson
d2fd745fe8 tcg: Add types and basic operations for host vectors
Nothing uses or enables them yet.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-02-08 15:54:04 +00:00
Richard Henderson
da73a4abca tcg: Allow multiple word entries into the constant pool
This will be required for storing vector constants.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-02-08 15:53:34 +00:00
Richard Henderson
030ffe39dd tcg/ppc: Allow a 32-bit offset to the constant pool
We recently relaxed the limit of the number of opcodes that can
appear in a TranslationBlock.  In certain cases this has resulted
in relocation overflow.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-01-16 08:21:56 -08:00
Richard Henderson
4a64e0fd68 tcg/ppc: Support tlb offsets larger than 64k
AArch64 with SVE has an offset of 80k to the 8th TLB.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-01-16 08:21:56 -08:00