Commit Graph

433 Commits

Author SHA1 Message Date
Alex Bennée 027d9a7d29 cpu: atomically modify cpu->exit_request
ThreadSanitizer picks up potential races although we already use
barriers to ensure things are in the correct order when processing exit
requests. For true C11 defined behaviour across threads we need to use
relaxed atomic_set/atomic_read semantics to reassure tsan.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20160930213106.20186-9-alex.bennee@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-10-04 10:00:26 +02:00
Sergey Fedorov 3359baad36 tcg: Make tb_flush() thread safe
Use async_safe_run_on_cpu() to make tb_flush() thread safe.  This is
possible now that code generation does not happen in the middle of
execution.

It can happen that multiple threads schedule a safe work to flush the
translation buffer. To keep statistics and debugging output sane, always
check if the translation buffer has already been flushed.

Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
[AJB: minor re-base fixes]
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <1470158864-17651-13-git-send-email-alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-27 11:57:30 +02:00
Richard Henderson be2208e2a5 cpu-exec: Check -dfilter for -d cpu
Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-09-16 08:12:11 -07:00
Sergey Fedorov b34de45fc4 tcg: rename tb_find_physical()
In fact, this function does not exactly perform a lookup by physical
address as it is descibed for comment on get_page_addr_code(). Thus
it may be a bit confusing to have "physical" in it's name. So rename it
to tb_htable_lookup() to better reflect its actual functionality.

Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Message-Id: <20160715175852.30749-13-sergey.fedorov@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-13 19:08:43 +02:00
Sergey Fedorov bd2710d5da tcg: Merge tb_find_slow() and tb_find_fast()
These functions are not too big and can be merged together. This makes
locking scheme more clear and easier to follow.

Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20160715175852.30749-12-sergey.fedorov@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-13 19:08:43 +02:00
Sergey Fedorov 74d356dd48 tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump()
Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20160715175852.30749-11-sergey.fedorov@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-13 19:08:43 +02:00
Alex Bennée 518615c650 tcg: cpu-exec: remove tb_lock from the hot-path
Lock contention in the hot path of moving between existing patched
TranslationBlocks is the main drag in multithreaded performance. This
patch pushes the tb_lock() usage down to the two places that really need
it:

  - code generation (tb_gen_code)
  - jump patching (tb_add_jump)

The rest of the code doesn't really need to hold a lock as it is either
using per-CPU structures, atomically updated or designed to be used in
concurrent read situations (qht_lookup).

To keep things simple I removed the #ifdef CONFIG_USER_ONLY stuff as the
locks become NOPs anyway until the MTTCG work is completed.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>

Message-Id: <20160715175852.30749-10-sergey.fedorov@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-13 19:08:43 +02:00
Paolo Bonzini 6d21e4208f tcg: Prepare TB invalidation for lockless TB lookup
When invalidating a translation block, set an invalid flag into the
TranslationBlock structure first.  It is also necessary to check whether
the target TB is still valid after acquiring 'tb_lock' but before calling
tb_add_jump() since TB lookup is to be performed out of 'tb_lock' in
future. Note that we don't have to check 'last_tb'; an already invalidated
TB will not be executed anyway and it is thus safe to patch it.

Suggested-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-13 19:08:43 +02:00
Sergey Fedorov 118b07308a tcg: Prepare safe access to tb_flushed out of tb_lock
Ensure atomicity and ordering of CPU's 'tb_flushed' access for future
translation block lookup out of 'tb_lock'.

This field can only be touched from another thread by tb_flush() in user
mode emulation. So the only access to be sequential atomic is:
 * a single write in tb_flush();
 * reads/writes out of 'tb_lock'.

In future, before enabling MTTCG in system mode, tb_flush() must be safe
and this field becomes unnecessary.

Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20160715175852.30749-5-sergey.fedorov@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-13 19:08:42 +02:00
Sergey Fedorov 89a16b1e42 tcg: Prepare safe tb_jmp_cache lookup out of tb_lock
Ensure atomicity of CPU's 'tb_jmp_cache' access for future translation
block lookup out of 'tb_lock'.

Note that this patch does *not* make CPU's TLB invalidation safe if it
is done from some other thread while the CPU is in its execution loop.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20160715175852.30749-4-sergey.fedorov@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-13 19:08:42 +02:00
Sergey Fedorov 4b7e69509d tcg: Pass last_tb by value to tb_find_fast()
This is a small clean up. tb_find_fast() is a final consumer of this
variable so no need to pass it by reference. 'last_tb' is always updated
by subsequent cpu_loop_exec_tb() in cpu_exec().

This change also simplifies calling cpu_exec_nocache() in
cpu_handle_exception().

Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Message-Id: <20160715175852.30749-3-sergey.fedorov@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-13 19:08:42 +02:00
Sergey Fedorov ca7d8e1c9c cpu-exec: Move down some declarations in cpu_exec()
This will fix a compiler warning with -Wclobbered:

http://lists.nongnu.org/archive/html/qemu-devel/2016-07/msg03347.html

Reported-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Message-Id: <20160715193123.28113-1-sergey.fedorov@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-17 09:59:22 +02:00
Emilio G. Cota 909eaac9bb tb hash: track translated blocks with qht
Having a fixed-size hash table for keeping track of all translation blocks
is suboptimal: some workloads are just too big or too small to get maximum
performance from the hash table. The MRU promotion policy helps improve
performance when the hash table is a little undersized, but it cannot
make up for severely undersized hash tables.

Furthermore, frequent MRU promotions result in writes that are a scalability
bottleneck. For scalability, lookups should only perform reads, not writes.
This is not a big deal for now, but it will become one once MTTCG matures.

The appended fixes these issues by using qht as the implementation of
the TB hash table. This solution is superior to other alternatives considered,
namely:

- master: implementation in QEMU before this patchset
- xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
- xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
              MRU is implemented here by adding an intermediate struct
              that contains the u32 hash and a pointer to the TB; this
              allows us, on an MRU promotion, to copy said struct (that is not
              at the head), and put this new copy at the head. After a grace
              period, the original non-head struct can be eliminated, and
              after another grace period, freed.
- qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
                   no MRU for lookups; MRU for inserts.
The appended solution is the following:
- qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
                 no MRU for lookups; MRU for inserts.

The plots below compare the considered solutions. The Y axis shows the
boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
sweeps the number of buckets (or initial number of buckets for qht-autoresize).
The plots in PNG format (and with errorbars) can be seen here:
  http://imgur.com/a/Awgnq

Each test runs 5 times, and the entire QEMU process is pinned to a
single core for repeatability of results.

                            Host: Intel Xeon E5-2690

  28 ++------------+-------------+-------------+-------------+------------++
     A*****        +             +             +             master **A*** +
  27 ++    *                                                 xxhash ##B###++
     |      A******A******                               xxhash-rcu $$C$$$ |
  26 C$$                  A******A******            qht-fixed-nomru*%%D%%%++
     D%%$$                              A******A******A*qht-dyn-mru A*E****A
  25 ++ %%$$                                          qht-dyn-nomru &&F&&&++
     B#####%                                                               |
  24 ++    #C$$$$$                                                        ++
     |      B###  $                                                        |
     |          ## C$$$$$$                                                 |
  23 ++           #       C$$$$$$                                         ++
     |             B######       C$$$$$$                                %%%D
  22 ++                  %B######       C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
     |                    D%%%%%%B######      @E@@@@@@    %%%D%%%@@@E@@@@@@E
  21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
     +             E@@@   F&&&   +      E@     +      F&&&   +             +
  20 ++------------+-------------+-------------+-------------+------------++
     14            16            18            20            22            24
                             log2 number of buckets

                                 Host: Intel i7-4790K

  14.5 ++------------+------------+-------------+------------+------------++
       A**           +            +             +            master **A*** +
    14 ++ **                                                 xxhash ##B###++
  13.5 ++   **                                           xxhash-rcu $$C$$$++
       |                                            qht-fixed-nomru %%D%%% |
    13 ++     A******                                   qht-dyn-mru @@E@@@++
       |             A*****A******A******             qht-dyn-nomru &&F&&& |
  12.5 C$$                               A******A******A*****A******    ***A
    12 ++ $$                                                        A***  ++
       D%%% $$                                                             |
  11.5 ++  %%                                                             ++
       B###  %C$$$$$$                                                      |
    11 ++  ## D%%%%% C$$$$$                                               ++
       |     #      %      C$$$$$$                                         |
  10.5 F&&&&&&B######D%%%%%       C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$    $$$C
    10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
       +             F&&          D%%%%%%B######B######B#####B###@@@D%%%   +
   9.5 ++------------+------------+-------------+------------+------------++
       14            16           18            20           22            24
                              log2 number of buckets

Note that the original point before this patch series is X=15 for "master";
the little sensitivity to the increased number of buckets is due to the
poor hashing function in master.

xxhash-rcu has significant overhead due to the constant churn of allocating
and deallocating intermediate structs for implementing MRU. An alternative
would be do consider failed lookups as "maybe not there", and then
acquire the external lock (tb_lock in this case) to really confirm that
there was indeed a failed lookup. This, however, would not be enough
to implement dynamic resizing--this is more complex: see
"Resizable, Scalable, Concurrent Hash Tables via Relativistic
Programming" by Triplett, McKenney and Walpole. This solution was
discarded due to the very coarse RCU read critical sections that we have
in MTTCG; resizing requires waiting for readers after every pointer update,
and resizes require many pointer updates, so this would quickly become
prohibitive.

qht-fixed-nomru shows that MRU promotion is advisable for undersized
hash tables.

However, qht-dyn-mru shows that MRU promotion is not important if the
hash table is properly sized: there is virtually no difference in
performance between qht-dyn-nomru and qht-dyn-mru.

Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
can achieve with optimum sizing of the hash table, while keeping the hash
table scalable for readers.

The improvement we get before and after this patch for booting debian jessie
with arm-softmmu is:

- Intel Xeon E5-2690: 10.5% less time
- Intel i7-4790K: 5.2% less time

We could get this same improvement _for this particular workload_ by
statically increasing the size of the hash table. But this would hurt
workloads that do not need a large hash table. The dynamic (upward)
resizing allows us to start small and enlarge the hash table as needed.

A quick note on downsizing: the table is resized back to 2**15 buckets
on every tb_flush; this makes sense because it is not guaranteed that the
table will reach the same number of TBs later on (e.g. most bootup code is
thrown away after boot); it makes sense to grow the hash table as
more code blocks are translated. This also avoids the complication of
having to build downsizing hysteresis logic into qht.

Reviewed-by: Sergey Fedorov <serge.fedorov@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-Id: <1465412133-3029-15-git-send-email-cota@braap.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-06-11 17:11:16 -07:00
Emilio G. Cota 42bd32287f tb hash: hash phys_pc, pc, and flags with xxhash
For some workloads such as arm bootup, tb_phys_hash is performance-critical.
The is due to the high frequency of accesses to the hash table, originated
by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
More info:
  https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html

To dig further into this I modified an arm image booting debian jessie to
immediately shut down after boot. Analysis revealed that quite a bit of time
is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
results in very uneven loading of chains in the hash table's buckets;
the longest observed chain had ~550 elements.

The appended addresses this with two changes:

1) Use xxhash as the hash table's hash function. xxhash is a fast,
   high-quality hashing function.

2) Feed the hashing function with not just tb_phys, but also pc and flags.

This improves performance over using just tb_phys for hashing, since that
resulted in some hash buckets having many TB's, while others getting very few;
with these changes, the longest observed chain on a single hash bucket is
brought down from ~550 to ~40.

Tests show that the other element checked for in tb_find_physical,
cs_base, is always a match when tb_phys+pc+flags are a match,
so hashing cs_base is wasteful. It could be that this is an ARM-only
thing, though. UPDATE:
On Tue, Apr 05, 2016 at 08:41:43 -0700, Richard Henderson wrote:
> The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a TB
> consisting of only a delay slot).
> It may well still turn out to be reasonable to ignore cs_base for hashing.

BTW, after this change the hash table should not be called "tb_hash_phys"
anymore; this is addressed later in this series.

This change gives consistent bootup time improvements. I tested two
host machines:
- Intel Xeon E5-2690: 11.6% less time
- Intel i7-4790K: 19.2% less time

Increasing the number of hash buckets yields further improvements. However,
using a larger, fixed number of buckets can degrade performance for other
workloads that do not translate as many blocks (600K+ for debian-jessie arm
bootup). This is dealt with later in this series.

Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-Id: <1465412133-3029-8-git-send-email-cota@braap.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-06-11 23:10:19 +00:00
Sergey Fedorov c88c67e58b cpu-exec: Fix direct jump to TB spanning page
It is not safe to make a direct jump to a TB spanning two pages in
system emulation because the mapping for the second page can get changed
but we don't take care of direct jumps in this case.

However in user mode emulation, this is not the case because there's
only static address translation and TBs are always invalidated properly.

Fixes: 5b053a4a28 ("tcg: Clean up direct block chaining safety checks")

Reported-by: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Tested-by: Max Filippov <jcmvbkbc@gmail.com>
Message-id: 1463404380-29302-1-git-send-email-sergey.fedorov@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2016-05-26 13:14:29 +01:00
Paolo Bonzini 63c915526d cpu: move exec-all.h inclusion out of cpu.h
exec-all.h contains TCG-specific definitions.  It is not needed outside
TCG-specific files such as translate.c, exec.c or *helper.c.

One generic function had snuck into include/exec/exec-all.h; move it to
include/qom/cpu.h.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-19 16:42:29 +02:00
Sergey Fedorov 8b1fe3f439 cpu-exec: Clean up 'interrupt_request' reloading in cpu_handle_interrupt()
Suggested-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Message-Id: <1463071937-26607-1-git-send-email-sergey.fedorov@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12 14:07:16 -10:00
Sergey Fedorov ba048a4ae1 cpu-exec: Remove unused 'x86_cpu' and 'env' from cpu_exec()
Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Message-Id: <1462962111-32237-6-git-send-email-sergey.fedorov@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12 14:06:42 -10:00
Sergey Fedorov 928de9ee14 cpu-exec: Move TB execution stuff out of cpu_exec()
Simplify cpu_exec() by extracting TB execution code outside of
cpu_exec() into a new static inline function cpu_loop_exec_tb().

Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Message-Id: <1462962111-32237-5-git-send-email-sergey.fedorov@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12 14:06:42 -10:00
Sergey Fedorov c385e6e497 cpu-exec: Move interrupt handling out of cpu_exec()
Simplify cpu_exec() by extracting interrupt handling code outside of
cpu_exec() into a new static inline function cpu_handle_interrupt().

Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: Richard Henderson  <rth@twiddle.net>
Message-Id: <1462962111-32237-4-git-send-email-sergey.fedorov@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12 14:06:42 -10:00
Sergey Fedorov ea284766ec cpu-exec: Move exception handling out of cpu_exec()
Simplify cpu_exec() by extracting exception handling code out of
cpu_exec() into a new static inline function cpu_handle_exception().
Also make cpu_handle_debug_exception() inline as it is used only once.

Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Message-Id: <1462962111-32237-3-git-send-email-sergey.fedorov@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12 14:06:42 -10:00
Sergey Fedorov 8b2d34e997 cpu-exec: Move halt handling out of cpu_exec()
Simplify cpu_exec() by extracting CPU halt state handling code out of
cpu_exec() into a new static inline function cpu_handle_halt().

Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Message-Id: <1462962111-32237-2-git-send-email-sergey.fedorov@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12 14:06:42 -10:00
Sergey Fedorov c6f0d9f84c cpu-exec: Remove relic orphaned comment
This comment should have been deleted by commit 0ac087f1f3 ("removed
unused code") but somehow it is still here. There's no point to keep it.

Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Message-Id: <1462286050-21778-1-git-send-email-sergey.fedorov@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12 14:06:42 -10:00
Sergey Fedorov 3213525f8a tcg: Remove needless CPUState::current_tb
This field was used for telling cpu_interrupt() to unlink a chain of TBs
being executed when it worked that way. Now, cpu_interrupt() don't do
this anymore. So we don't need this field anymore.

Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Message-Id: <1462273462-14036-1-git-send-email-sergey.fedorov@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12 14:06:42 -10:00
Sergey Fedorov a0522c7a55 cpu-exec: Move TB chaining into tb_find_fast()
Move tb_add_jump() call and surrounding code from cpu_exec() into
tb_find_fast(). That simplifies cpu_exec() a little by hiding the direct
chaining optimization details into tb_find_fast(). It also allows to
move tb_lock()/tb_unlock() pair into tb_find_fast(), putting it closer
to tb_find_slow() which also manipulates the lock.

Suggested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
[rth: Fixed rebase typo in nochain test.]
2016-05-12 14:06:42 -10:00
Sergey Fedorov 6f789be56d tcg: Rework tb_invalidated_flag
'tb_invalidated_flag' was meant to catch two events:
 * some TB has been invalidated by tb_phys_invalidate();
 * the whole translation buffer has been flushed by tb_flush().

Then it was checked:
 * in cpu_exec() to ensure that the last executed TB can be safely
   linked to directly call the next one;
 * in cpu_exec_nocache() to decide if the original TB should be provided
   for further possible invalidation along with the temporarily
   generated TB.

It is always safe to patch an invalidated TB since it is not going to be
used anyway. It is also safe to call tb_phys_invalidate() for an already
invalidated TB. Thus, setting this flag in tb_phys_invalidate() is
simply unnecessary. Moreover, it can prevent from pretty proper linking
of TBs, if any arbitrary TB has been invalidated. So just don't touch it
in tb_phys_invalidate().

If this flag is only used to catch whether tb_flush() has been called
then rename it to 'tb_flushed'. Declare it as 'bool' and stick to using
only 'true' and 'false' to set its value. Also, instead of setting it in
tb_gen_code(), just after tb_flush() has been called, do it right inside
of tb_flush().

In cpu_exec(), this flag is used to track if tb_flush() has been called
and have made 'next_tb' (a reference to the last executed TB) invalid
for linking it to directly call the next TB. tb_flush() can be called
during the CPU execution loop from tb_gen_code(), during TB execution or
by another thread while 'tb_lock' is released. Catch for translation
buffer flush reliably by resetting this flag once before first TB lookup
and each time we find it set before trying to add a direct jump. Don't
touch in in tb_find_physical().

Each vCPU has its own execution loop in multithreaded mode and thus
should have its own copy of the flag to be able to reset it with its own
'next_tb' and don't affect any other vCPU execution thread. So make this
flag per-vCPU and move it to CPUState.

In cpu_exec_nocache(), we only need to check if tb_flush() has been
called from tb_gen_code() called by cpu_exec_nocache() itself. To do
this reliably, preserve the old value of the flag, reset it before
calling tb_gen_code(), check afterwards, and combine the saved value
back to the flag.

This patch is based on the patch "tcg: move tb_invalidated_flag to
CPUState" from Paolo Bonzini <pbonzini@redhat.com>.

Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12 14:06:42 -10:00
Sergey Fedorov 819af24b9c tcg: Clean up from 'next_tb'
The value returned from tcg_qemu_tb_exec() is the value passed to the
corresponding tcg_gen_exit_tb() at translation time of the last TB
attempted to execute. It is a little confusing to store it in a variable
named 'next_tb'. In fact, it is a combination of 4-byte aligned pointer
and additional information in its two least significant bits. Break it
down right away into two variables named 'last_tb' and 'tb_exit' which
are a pointer to the last TB attempted to execute and the TB exit
reason, correspondingly. This simplifies the code and improves its
readability.

Correct a misleading documentation comment for tcg_qemu_tb_exec() and
fix logging in cpu_tb_exec(). Also rename a misleading 'next_tb' in
another couple of places.

Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12 14:06:42 -10:00
Paolo Bonzini 7687bf52e5 cpu-exec: elide more icount code if CONFIG_USER_ONLY
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Alex Bennée: #ifndef replay code to match elided functions]
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12 14:06:42 -10:00
Alex Bennée 1279f323d6 tcg: reorganize tb_find_physical loop
Put some comments and improve code structure. This should help reading
the code.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
[Sergey Fedorov: provide commit message; bring back resetting of
tb_invalidated_flag]
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: Richard Henderson  <rth@twiddle.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12 14:06:42 -10:00
Sergey Fedorov 5b053a4a28 tcg: Clean up direct block chaining safety checks
We don't take care of direct jumps when address mapping changes. Thus we
must be sure to generate direct jumps so that they always keep valid
even if address mapping changes. Luckily, we can only allow to execute a
TB if it was generated from the pages which match with current mapping.

Document tcg_gen_goto_tb() declaration and note the reason for
destination PC limitations.

Some targets with variable length instructions allow TB to straddle a
page boundary. However, we make sure that both of TB pages match the
current address mapping when looking up TBs. So it is safe to do direct
jumps into the both pages. Correct the checks for some of those targets.

Given that, we can safely patch a TB which spans two pages. Remove the
unnecessary check in cpu_exec() and allow such TBs to be patched.

Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
2016-05-12 14:06:41 -10:00
Emilio G. Cota 89fee74a0f tb: consistently use uint32_t for tb->flags
We are inconsistent with the type of tb->flags: usage varies loosely
between int and uint64_t. Settle to uint32_t everywhere, which is
superior to both: at least one target (aarch64) uses the most significant
bit in the u32, and uint64_t is wasteful.

Compile-tested for all targets.

Suggested-by: Laurent Desnogues <laurent.desnogues@gmail.com>
Suggested-by: Richard Henderson <rth@twiddle.net>
Tested-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
Reviewed-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
Reviewed-by: Laurent Desnogues <laurent.desnogues@gmail.com>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
Message-Id: <1460049562-23517-1-git-send-email-cota@braap.org>
2016-05-12 14:06:40 -10:00
Alex Bennée d977e1c2db qemu-log: dfilter-ise exec, out_asm, op and opt_op
This ensures the code generation debug code will honour -dfilter if set.
For the "exec" tracing I've added a new inline macro for efficiency's
sake.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Aurelien Jarno <aurelien@aureL32.net>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Message-Id: <1458052224-9316-8-git-send-email-alex.bennee@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 22:20:18 +01:00
Peter Maydell 1a83063522 qemu-log: Improve the "exec" TB execution logging
Improve the TB execution logging so that it is easier to identify
what is happening from trace logs:
 * move the "Trace" logging of executed TBs into cpu_tb_exec()
   so that it is emitted if and only if we actually execute a TB,
   and for consistency for the CPU state logging
 * log when we link two TBs together via tb_add_jump()
 * log when cpu_tb_exec() returns early from a chain of TBs

The new style logging looks like this:

Trace 0x7fb7cc822ca0 [ffffffc0000dce00]
Linking TBs 0x7fb7cc822ca0 [ffffffc0000dce00] index 0 -> 0x7fb7cc823110 [ffffffc0000dce10]
Trace 0x7fb7cc823110 [ffffffc0000dce10]
Trace 0x7fb7cc823420 [ffffffc000302688]
Trace 0x7fb7cc8234a0 [ffffffc000302698]
Trace 0x7fb7cc823520 [ffffffc0003026a4]
Trace 0x7fb7cc823560 [ffffffc0000dce44]
Linking TBs 0x7fb7cc823560 [ffffffc0000dce44] index 1 -> 0x7fb7cc8235d0 [ffffffc0000dce70]
Trace 0x7fb7cc8235d0 [ffffffc0000dce70]
Stopped execution of TB chain before 0x7fb7cc8235d0 [ffffffc0000dce70]
Trace 0x7fb7cc8235d0 [ffffffc0000dce70]
Trace 0x7fb7cc822fd0 [ffffffc0000dd52c]

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
[AJB: reword patch title, Abandoned->Stopped]
Reviewed-by: Aurelien Jarno <aurelien@aurel32.net>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Message-Id: <1458052224-9316-6-git-send-email-alex.bennee@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 22:20:18 +01:00
Paolo Bonzini 508127e243 log: do not unnecessarily include qom/cpu.h
Split the bits that require it to exec/log.h.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Message-id: 1452174932-28657-8-git-send-email-den@openvz.org
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-02-03 09:19:10 +00:00
Peter Maydell 7b31bbc2e6 exec: Clean up includes
Clean up includes so that osdep.h is included first and headers
which it implies are not included manually.

This commit was created with scripts/clean-includes.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1453832250-766-4-git-send-email-peter.maydell@linaro.org
2016-01-29 15:07:22 +00:00
Peter Maydell 9319738080 So here it is, let's see what happens.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJWPHM6AAoJEL/70l94x66DK5YIAJTNthYWL8eNhQ1iek6CLlV+
 etVXm3JDmkV0zOfYVHLBb44VLZ6I1ocas+57F/kmz7SKpMLiI6bMXRxhTSkiO4D+
 3N36cWQf3fq+P0DmxuikMlYGz8V6QQ5PQE2xJKV0ZIWAkiqInxilkN3qt81sNR+A
 A9Ohom3sc0eGHyYJcVDK4krbnNSAZjIB2yMWperw61x+GYAhxjA02HPUgB32KK6q
 KrdnKmnRu9Cw6y4wTCbbDITJztPexZYsX2DOJh30wC0eNcE+MZ7J2im8Frpxe+Ml
 C8MUuvSqLOyeu9tUfrXGzd6kMtEKrmU+fh2nNbxJbtfowDjkW2jcIEgC0UjkGE4=
 =BF1q
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream-replay' into staging

So here it is, let's see what happens.

# gpg: Signature made Fri 06 Nov 2015 09:30:34 GMT using RSA key ID 78C7AE83
# gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>"
# gpg:                 aka "Paolo Bonzini <pbonzini@redhat.com>"

* remotes/bonzini/tags/for-upstream-replay:
  replay: recording of the user input
  replay: command line options
  replay: replay blockers for devices
  replay: initialization and deinitialization
  replay: ptimer
  bottom halves: introduce bh call function
  replay: checkpoints
  icount: improve counting for record/replay
  replay: shutdown event
  replay: recording and replaying clock ticks
  replay: asynchronous events infrastructure
  replay: interrupts and exceptions
  cpu: replay instructions sequence
  cpu-exec: allow temporary disabling icount
  replay: introduce icount event
  replay: introduce mutex to protect the replay log
  replay: internal functions for replay log
  replay: global variables and function stubs

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2015-11-06 11:31:40 +00:00
Pavel Dovgalyuk 6f0609697f replay: interrupts and exceptions
This patch includes modifications of common cpu files. All interrupts and
exceptions occured during recording are written into the replay log.
These events allow correct replaying the execution by kicking cpu thread
when one of these events is found in the log.

Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
Message-Id: <20150917162416.8676.57647.stgit@PASHA-ISP.def.inno>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-06 10:16:00 +01:00
Pavel Dovgalyuk 56c0269a9e cpu-exec: allow temporary disabling icount
This patch is required for deterministic replay to generate an exception
by trying executing an instruction without changing icount.
It adds new flag to TB for disabling icount while translating it.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
Message-Id: <20150917162359.8676.77011.stgit@PASHA-ISP.def.inno>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-05 12:19:09 +01:00
Stefan Weil 0448f5f8b8 cpu-exec: Fix compiler warning (-Werror=clobbered)
Reloading of local variables after sigsetjmp is only needed for some
buggy compilers.

The code which should reload these variables causes compiler warnings
with gcc 4.7 when compiler optimizations are enabled:

cpu-exec.c:204:15: error:
 variable ‘cpu’ might be clobbered by ‘longjmp’ or ‘vfork’ [-Werror=clobbered]
cpu-exec.c:207:15: error:
 variable ‘cc’ might be clobbered by ‘longjmp’ or ‘vfork’ [-Werror=clobbered]
cpu-exec.c:202:28: error:
 argument ‘env’ might be clobbered by ‘longjmp’ or ‘vfork’ [-Werror=clobbered]

Now this code is only used for compilers which need it
(and gcc 4.5.x, x > 0 which does not need it but won't give warnings).

There were bug reports for clang and gcc 4.5.0, while gcc 4.5.1
was reported to work fine without the reload code. For clang it
is not clear which versions are affected, so simply keep the status quo
for all clang compilations. This can be improved later.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
Message-Id: <1443266606-21400-1-git-send-email-sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-04 15:56:04 +01:00
Richard Henderson 89a82cd4b6 cpu-exec: Add "nochain" debug flag
Respect it to avoid linking TBs together.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
2015-10-19 11:04:39 -10:00
Pavel Dovgalyuk 6220e900bc i386: partial revert of interrupt poll fix
Processing CPU_INTERRUPT_POLL requests in cpu_has_work functions
break the determinism of cpu_exec. This patch is required to make
interrupts processing deterministic.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
Message-Id: <20150917162331.8676.15286.stgit@PASHA-ISP.def.inno>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-25 12:04:44 +02:00
Peter Crosthwaite 5abf9495ca cpu-exec: Migrate some generic fns to cpu-exec-common
The goal is to split the functions such that cpu-exec is CPU specific
content, while cpus-exec-common.c is generic code only. The function
interface to cpu-exec needs to be virtualised to prepare support for
multi-arch and moving these definitions out saves bloating the QOM
interface. So move these definitions out of cpu-exec to a new module,
cpu-exec-common.

Signed-off-by: Peter Crosthwaite <crosthwaite.peter@gmail.com>
Message-Id: <3cefeb3fbbb33031670951a0e74de2778529da3f.1441614289.git.crosthwaite.peter@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-16 17:33:33 +02:00
Peter Maydell a2aa09e181 * Support for jemalloc
* qemu_mutex_lock_iothread "No such process" fix
 * cutils: qemu_strto* wrappers
 * iohandler.c simplification
 * Many other fixes and misc patches.
 
 And some MTTCG work (with Emilio's fixes squashed):
 * Signal-free TCG kick
 * Removing spinlock in favor of QemuMutex
 * User-mode emulation multi-threading fixes/docs
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJV8Tk7AAoJEL/70l94x66Ds3QH/3bi0RRR2NtKIXAQrGo5tfuD
 NPMu1K5Hy+/26AC6mEVNRh4kh7dPH5E4NnDGbxet1+osvmpjxAjc2JrxEybhHD0j
 fkpzqynuBN6cA2Gu5GUNoKzxxTmi2RrEYigWDZqCftRXBeO2Hsr1etxJh9UoZw5H
 dgpU3j/n0Q8s08jUJ1o789knZI/ckwL4oXK4u2KhSC7ZTCWhJT7Qr7c0JmiKReaF
 JEYAsKkQhICVKRVmC8NxML8U58O8maBjQ62UN6nQpVaQd0Yo/6cstFTZsRrHMHL3
 7A2Tyg862cMvp+1DOX3Bk02yXA+nxnzLF8kUe0rYo6llqDBDStzqyn1j9R0qeqA=
 =nB06
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into staging

* Support for jemalloc
* qemu_mutex_lock_iothread "No such process" fix
* cutils: qemu_strto* wrappers
* iohandler.c simplification
* Many other fixes and misc patches.

And some MTTCG work (with Emilio's fixes squashed):
* Signal-free TCG kick
* Removing spinlock in favor of QemuMutex
* User-mode emulation multi-threading fixes/docs

# gpg: Signature made Thu 10 Sep 2015 09:03:07 BST using RSA key ID 78C7AE83
# gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>"
# gpg:                 aka "Paolo Bonzini <pbonzini@redhat.com>"

* remotes/bonzini/tags/for-upstream: (44 commits)
  cutils: work around platform differences in strto{l,ul,ll,ull}
  cpu-exec: fix lock hierarchy for user-mode emulation
  exec: make mmap_lock/mmap_unlock globally available
  tcg: comment on which functions have to be called with mmap_lock held
  tcg: add memory barriers in page_find_alloc accesses
  remove unused spinlock.
  replace spinlock by QemuMutex.
  cpus: remove tcg_halt_cond and tcg_cpu_thread globals
  cpus: protect work list with work_mutex
  scripts/dump-guest-memory.py: fix after RAMBlock change
  configure: Add support for jemalloc
  add macro file for coccinelle
  configure: factor out adding disas configure
  vhost-scsi: fix wrong vhost-scsi firmware path
  checkpatch: remove tests that are not relevant outside the kernel
  checkpatch: adapt some tests to QEMU
  CODING_STYLE: update mixed declaration rules
  qmp: Add example usage of strto*l() qemu wrapper
  cutils: Add qemu_strtoull() wrapper
  cutils: Add qemu_strtoll() wrapper
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2015-09-14 16:13:16 +01:00
Pavel Dovgalyuk 1c3c8af1fb cpu-exec: introduce loop exit with restore function
This patch introduces loop exit function, which also
restores guest CPU state according to the value of host
program counter.

Reviewed-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
Message-Id: <20150710095702.13280.97477.stgit@PASHA-ISP>
Signed-off-by: Richard Henderson <rth@twiddle.net>
2015-09-11 08:16:16 -07:00
Paolo Bonzini 9fd1a94888 cpu-exec: fix lock hierarchy for user-mode emulation
tb_lock has to be taken inside the mmap_lock (example:
tb_invalidate_phys_range is called by target_mmap), but
tb_link_page is taking the mmap_lock and it is called
with the tb_lock held.

To fix this, take the mmap_lock in tb_find_slow, not
in tb_link_page.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-09 15:34:56 +02:00
KONRAD Frederic 677ef6230b replace spinlock by QemuMutex.
spinlock is only used in two cases:
  * cpu-exec.c: to protect TranslationBlock
  * mem_helper.c: for lock helper in target-i386 (which seems broken).

It's a pthread_mutex_t in user-mode, so we can use QemuMutex directly,
with an #ifdef.  The #ifdef will be removed when multithreaded TCG
will need the mutex as well.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
Message-Id: <1439220437-23957-5-git-send-email-fred.konrad@greensocs.com>
Signed-off-by: Emilio G. Cota <cota@braap.org>
[Merge Emilio G. Cota's patch to remove volatile. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-09 15:34:55 +02:00
Paolo Bonzini e0c382113f tcg: signal-free qemu_cpu_kick
Signals are slow and do not exist on Win32.  The previous patches
have done most of the legwork to introduce memory barriers (some
of them were even there already for the sake of Windows!) and
we can now set the flags directly in the iothread.

qemu_cpu_kick_thread is not used anymore on TCG, since the TCG thread is
never outside usermode while the CPU is running (not halted).  Instead run
the content of the signal handler (now in qemu_cpu_kick_no_halt) directly.
qemu_cpu_kick_no_halt is also used in qemu_mutex_lock_iothread to avoid
the overhead of qemu_cond_broadcast.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-09 15:34:54 +02:00
Paolo Bonzini aed807c8e2 tcg: synchronize exit_request and tcg_current_cpu accesses
Synchronize the remaining pair of accesses in cpu_signal.  These should
be necessary on Windows as well, at least in theory.  Probably
SuspendProcess and ResumeProcess introduce some implicit memory
barrier.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-09 15:34:54 +02:00
Paolo Bonzini ab096a75cd tcg: synchronize cpu->exit_request and cpu->tcg_exit_req accesses
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-09 15:34:53 +02:00
Paolo Bonzini b0a46fa796 tcg: assign cpu->current_tb in a simpler place
TCG has not been reading cpu->current_tb from signal handlers for years.
The code that synchronized cpu_exec with the signal handler is not
needed anymore.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-09 15:34:53 +02:00