With a vfio assigned device we lay down a base MemoryRegion registered
as an IO region, giving us read & write accessors. If the region
supports mmap, we lay down a higher priority sub-region MemoryRegion
on top of the base layer initialized as a RAM device pointer to the
mmap. Finally, if we have any quirks for the device (ie. address
ranges that need additional virtualization support), we put another IO
sub-region on top of the mmap MemoryRegion. When this is flattened,
we now potentially have sub-page mmap MemoryRegions exposed which
cannot be directly mapped through KVM.
This is as expected, but a subtle detail of this is that we end up
with two different access mechanisms through QEMU. If we disable the
mmap MemoryRegion, we make use of the IO MemoryRegion and service
accesses using pread and pwrite to the vfio device file descriptor.
If the mmap MemoryRegion is enabled and results in one of these
sub-page gaps, QEMU handles the access as RAM, using memcpy to the
mmap. Using either pread/pwrite or the mmap directly should be
correct, but using memcpy causes us problems. I expect that not only
does memcpy not necessarily honor the original width and alignment in
performing a copy, but it potentially also uses processor instructions
not intended for MMIO spaces. It turns out that this has been a
problem for Realtek NIC assignment, which has such a quirk that
creates a sub-page mmap MemoryRegion access.
To resolve this, we disable memory_access_is_direct() for ram_device
regions since QEMU assumes that it can use memcpy for those regions.
Instead we access through MemoryRegionOps, which replaces the memcpy
with simple de-references of standard sizes to the host memory.
With this patch we attempt to provide unrestricted access to the RAM
device, allowing byte through qword access as well as unaligned
access. The assumption here is that accesses initiated by the VM are
driven by a device specific driver, which knows the device
capabilities. If unaligned accesses are not supported by the device,
we don't want them to work in a VM by performing multiple aligned
accesses to compose the unaligned access. A down-side of this
philosophy is that the xp command from the monitor attempts to use
the largest available access weidth, unaware of the underlying
device. Using memcpy had this same restriction, but at least now an
operator can dump individual registers, even if blocks of device
memory may result in access widths beyond the capabilities of a
given device (RTL NICs only support up to dword).
Reported-by: Thorsten Kohfeldt <thorsten.kohfeldt@gmx.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Setting skip_dump on a MemoryRegion allows us to modify one specific
code path, but the restriction we're trying to address encompasses
more than that. If we have a RAM MemoryRegion backed by a physical
device, it not only restricts our ability to dump that region, but
also affects how we should manipulate it. Here we recognize that
MemoryRegions do not change to sometimes allow dumps and other times
not, so we replace setting the skip_dump flag with a new initializer
so that we know exactly the type of region to which we're applying
this behavior.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
softmmu requires more functions to be thread-safe, because translation
blocks can be invalidated from e.g. notdirty callbacks. Probably the
same holds for user-mode emulation, it's just that no one has ever
tried to produce a coherent locking there.
This patch will guide the introduction of more tb_lock and tb_unlock
calls for system emulation.
Note that after this patch some (most) of the mentioned functions are
still called outside tb_lock/tb_unlock. The next one will rectify this.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Message-Id: <20161027151030.20863-7-alex.bennee@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This adds asserts to check the locking on the various translation
engines structures. There are two sets of structures that are protected
by locks.
The first the l1map and PageDesc structures used to track which
translation blocks are associated with which physical addresses. In
user-mode this is covered by the mmap_lock.
The second case are TB context related structures which are protected by
tb_lock which is also user-mode only.
Currently the asserts do nothing in SoftMMU mode but this will change
for MTTCG.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Message-Id: <20161027151030.20863-4-alex.bennee@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When we cannot emulate an atomic operation within a parallel
context, this exception allows us to stop the world and try
again in a serial context.
Reviewed-by: Emilio G. Cota <cota@braap.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
x2APIC support to APIC code, cpu_exec_init() refactor on all
architectures, and other x86 changes.
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJYDmYyAAoJECgHk2+YTcWmoSUP/2ga+b9YmPuyL7XC+12pff0I
Z8gdjUzbMUNcCI0JMZCTGUJbs3BapLcnsA7ypmt88s9kG02WeDMhNx1BfYiAFgLU
kPLQlXAM7awEdGagd3sTCiFojSUZ7GxYHjd5fuhPoOAXvXM8im6zJl18ZcsnStjO
/J8JGoGDHq1XJlz+RIjnGamojJWCiO/+iiD+rFmVSic8zjHPDYq14sIk/QJX+DaF
azLiOI6DAlX3kyrN5ZshhIRQ3COzzUMUSDF/ZaYHjudUco5MBnwj/oLQniTq+ZUd
hCu7dr5TpLxI7q1yltyd0UIl/+aZGbE8tEvoXAtc735iK4m2CTckT7ql6x3xI+Ir
PmpPgIswHqfCiCXm8imLj6ZI47kRA1x4x4AudLaNVKP7jO82485sS9HWpOadYsaU
jvek2SqfqvH+vce4FzwlLEcXGDb73MT/XkIUvd7SfPIbs9umgdZc03U4SHfAWr0i
lAIRs4Ym0AAS2WSE4E09wvdUUr9oxaQBMhw3JAiNmg7hLfyINTP+D/IhtlAVXXEA
F9D7fky5lDwfKvIwPxPJbDD5bCBV9AmxhiahIhv3epu4Kg4orf1inkrx0IZWSbB0
7+JZ7j8asuizfibkeZAN9rxVwmz32makJNsnjzZHlnaPxTvIDzvRkNceBnhC5vKq
3yfxgl4agXmMjveraAtt
=T2kg
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/ehabkost/tags/x86-pull-request' into staging
x86 and CPU queue, 2016-10-24
x2APIC support to APIC code, cpu_exec_init() refactor on all
architectures, and other x86 changes.
# gpg: Signature made Mon 24 Oct 2016 20:51:14 BST
# gpg: using RSA key 0x2807936F984DC5A6
# gpg: Good signature from "Eduardo Habkost <ehabkost@redhat.com>"
# Primary key fingerprint: 5A32 2FD5 ABC4 D3DB ACCF D1AA 2807 936F 984D C5A6
* remotes/ehabkost/tags/x86-pull-request:
exec: call cpu_exec_exit() from a CPU unrealize common function
exec: move cpu_exec_init() calls to realize functions
exec: split cpu_exec_init()
pc: q35: Bump max_cpus to 288
pc: Require IRQ remapping and EIM if there could be x2APIC CPUs
pc: Add 'etc/boot-cpus' fw_cfg file for machine with more than 255 CPUs
Increase MAX_CPUMASK_BITS from 255 to 288
pc: Clarify FW_CFG_MAX_CPUS usage comment
pc: kvm_apic: Pass APIC ID depending on xAPIC/x2APIC mode
pc: apic_common: Reset APIC ID to initial ID when switching into x2APIC mode
pc: apic_common: Restore APIC ID to initial ID on reset
pc: apic_common: Extend APIC ID property to 32bit
pc: Leave max apic_id_limit only in legacy cpu hotplug code
acpi: cphp: Force switch to modern cpu hotplug if APIC ID > 254
pc: acpi: x2APIC support for SRAT table
pc: acpi: x2APIC support for MADT table and _MAT method
Conflicts:
target-arm/cpu.c
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Modify all CPUs to call it from XXX_cpu_realizefn() function.
Remove all the cannot_destroy_with_object_finalize_yet as
unsafe references have been moved to cpu_exec_realizefn().
(tested with QOM command provided by commit 4c315c27)
for arm:
Setting of cpu->mp_affinity is moved from arm_cpu_initfn()
to arm_cpu_realizefn() as setting of cpu_index is now done
in cpu_exec_realizefn(). To avoid to overwrite an user defined
value, we set it to an invalid value by default, and update
it in realize function only if the value is still invalid.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Support target CPUs having a page size which isn't knownn
at compile time. To use this, the CPU implementation should:
* define TARGET_PAGE_BITS_VARY
* not define TARGET_PAGE_BITS
* define TARGET_PAGE_BITS_MIN to the smallest value it
might possibly want for TARGET_PAGE_BITS
* call set_preferred_target_page_bits() in its realize
function to indicate the actual preferred target page
size for the CPU (and report any error from it)
In CONFIG_USER_ONLY, the CPU implementation should continue
to define TARGET_PAGE_BITS appropriately for the guest
OS page size.
Machines which want to take advantage of having the page
size something larger than TARGET_PAGE_BITS_MIN must
set the MachineClass minimum_page_bits field to a value
which they guarantee will be no greater than the preferred
page size for any CPU they create.
Note that changing the target page size by setting
minimum_page_bits is a migration compatibility break
for that machine.
For debugging purposes, attempts to use TARGET_PAGE_SIZE
before it has been finally confirmed will assert.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
This speeds up MEMORY_LISTENER_CALL noticeably. Right now,
with many PCI devices you have N regions added to M AddressSpaces
(M = # PCI devices with bus-master enabled) and each call looks
up the whole listener list, with at least M listeners in it.
Because most of the regions in N are BARs, which are also roughly
proportional to M, the whole thing is O(M^3). This changes it
to O(M^2), which is the best we can do without rewriting the
whole thing.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Store the page size in each RAMBlock, we need it later.
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Use async_safe_run_on_cpu() to make tb_flush() thread safe. This is
possible now that code generation does not happen in the middle of
execution.
It can happen that multiple threads schedule a safe work to flush the
translation buffer. To keep statistics and debugging output sane, always
check if the translation buffer has already been flushed.
Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
[AJB: minor re-base fixes]
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <1470158864-17651-13-git-send-email-alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a mutex for the CPU list to system emulation, as it will be used to
manage safe work. Abstract manipulation of the CPU list in new functions
cpu_list_add and cpu_list_remove.
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Migrating a VM during reboot sometimes results in differences
between the source and destination in the SMRAM area.
This is because migration_bitmap_sync() only fetches from KVM
the dirty log of address_space_memory. SMRAM memory slots
are ignored and the modifications to SMRAM are not sent to the
destination.
Reported-by: He Rongguang <herongguang.he@huawei.com>
Reviewed-by: He Rongguang <herongguang.he@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The new interface can be used to replace the old notify_started() and
notify_stopped(). Meanwhile it provides explicit flags so that IOMMUs
can know what kind of notifications it is requested for.
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <1474606948-14391-3-git-send-email-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
IOMMU Notifier list is used for notifying IO address mapping changes.
Currently VFIO is the only user.
However it is possible that future consumer like vhost would like to
only listen to part of its notifications (e.g., cache invalidations).
This patch introduced IOMMUNotifier and IOMMUNotfierFlag bits for a
finer grained control of it.
IOMMUNotifier contains a bitfield for the notify consumer describing
what kind of notification it is interested in. Currently two kinds of
notifications are defined:
- IOMMU_NOTIFIER_MAP: for newly mapped entries (additions)
- IOMMU_NOTIFIER_UNMAP: for entries to be removed (cache invalidates)
When registering the IOMMU notifier, we need to specify one or multiple
types of messages to listen to.
When notifications are triggered, its type will be checked against the
notifier's type bits, and only notifiers with registered bits will be
notified.
(For any IOMMU implementation, an in-place mapping change should be
notified with an UNMAP followed by a MAP.)
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <1474606948-14391-2-git-send-email-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The return address argument to the softmmu template helpers was
confused. In the legacy case, we wanted to indicate that there
is no return address, and so passed in NULL. However, we then
immediately subtracted GETPC_ADJ from NULL, resulting in a non-zero
value, indicating the presence of an (invalid) return address.
Push the GETPC_ADJ subtraction down to the only point it's required:
immediately before use within cpu_restore_state_from_tb, after all
NULL pointer checks have been completed.
This makes GETPC and GETRA identical. Remove GETRA as the lesser
used macro, replacing all uses with GETPC.
Signed-off-by: Richard Henderson <rth@twiddle.net>
When invalidating a translation block, set an invalid flag into the
TranslationBlock structure first. It is also necessary to check whether
the target TB is still valid after acquiring 'tb_lock' but before calling
tb_add_jump() since TB lookup is to be performed out of 'tb_lock' in
future. Note that we don't have to check 'last_tb'; an already invalidated
TB will not be executed anyway and it is thus safe to patch it.
Suggested-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Instead of using -1 as end of chain, use 0, and link through the 0
entry as a fully circular double-linked list.
Reviewed-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
For i386, the ABI specifies that 'long long' (8 byte values)
need only be 4 aligned, but we were requiring them to be
8-aligned. This meant we were laying out the target_epoll_event
structure wrongly. Add a suitable ifdef to abitypes.h to
specify the i386-specific alignment requirement.
Reported-by: Icenowy Zheng <icenowy@aosc.xyz>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>
Header guard symbols should match their file name to make guard
collisions less likely. Offenders found with
scripts/clean-header-guards.pl -vn.
Cleaned up with scripts/clean-header-guards.pl, followed by some
renaming of new guard symbols picked by the script to better ones.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Tracked down with an ugly, brittle and probably buggy Perl script.
Also move includes converted to <...> up so they get included before
ours where that's obviously okay.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Tested-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Richard Henderson <rth@twiddle.net>
There are functions tlb_fill(), cpu_unaligned_access() and
do_unaligned_access() that are called with access type and mmu index
arguments. But these arguments are named 'is_write' and 'is_user' in their
declarations. The patches fix the arguments to avoid a confusion.
Signed-off-by: Sergey Sorokin <afarallax@yandex.ru>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Message-id: 1465907177-1399402-1-git-send-email-afarallax@yandex.ru
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Some architectures (e.g. ARMv8) need the address which is aligned
to a size more than the size of the memory access.
To support such check it's enough the current costless alignment
check implementation in QEMU, but we need to support
an alignment size specifying.
Signed-off-by: Sergey Sorokin <afarallax@yandex.ru>
Message-Id: <1466705806-679898-1-git-send-email-afarallax@yandex.ru>
Signed-off-by: Richard Henderson <rth@twiddle.net>
[rth: Assert in tcg_canonicalize_memop. Leave get_alignment_bits
available for, though unused by, user-mode. Retain logging difference
based on ALIGNED_ONLY.]
It doesn't make sense to pass a NULL ops argument to
memory_region_init_rom_device(), because the effect will
be that if the guest tries to write to the memory region
then QEMU will segfault. Catch the bug earlier by sanity
checking the arguments to this function, and remove the
misleading documentation that suggests that passing NULL
might be sensible.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1467122287-24974-4-git-send-email-peter.maydell@linaro.org
Provide a new helper function memory_region_init_rom() for memory
regions which are read-only (and unlike those created by
memory_region_init_rom_device() don't have special behaviour
for writes). This has the same behaviour as calling
memory_region_init_ram() and then memory_region_set_readonly()
(which is what we do today in boards with pure ROMs) but is a
more easily discoverable API for the purpose.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1467122287-24974-2-git-send-email-peter.maydell@linaro.org
The IOMMU driver may change behavior depending on whether a notifier
client is present. In the case of POWER, this represents a change in
the visibility of the IOTLB, for other drivers such as intel-iommu and
future AMD-Vi emulation, notifier support is not yet enabled and this
provides the opportunity to flag that incompatibility.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Peter Xu <peterx@redhat.com>
Tested-by: Peter Xu <peterx@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
[new log & extracted from [PATCH qemu v17 12/12] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping listening]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This function needs to be converted to QOM hook and virtualised for
multi-arch. This rename interferes, as cpu-qom will not have access
to the renaming causing name divergence. This rename doesn't really do
anything anyway so just delete it.
Signed-off-by: Peter Crosthwaite <crosthwaite.peter@gmail.com>
Message-Id: <69bd25a8678b8b31b91cd9760c777bed1aafb44e.1437212383.git.crosthwaite.peter@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Crosthwaite <crosthwaitepeter@gmail.com>
Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
uses when translating, however this information is not available outside
the translate context for various checks.
This adds a get_min_page_size callback to MemoryRegionIOMMUOps and
a wrapper for it so IOMMU users (such as VFIO) can know the minimum
actual page size supported by an IOMMU.
As IOMMU MR represents a guest IOMMU, this uses TARGET_PAGE_SIZE
as fallback.
This removes vfio_container_granularity() and uses new helper in
memory_region_iommu_replay() when replaying IOMMU mappings on added
IOMMU memory region.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
[dwg: Removed an unnecessary calculation]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The event is described in "trace-events". Note that the "MO_AMASK" flag
is not traced, since it does not seem to affect the visible semantics of
instructions.
[s/inline inline/inline/ to fix clang build.
--Stefan]
Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 146549350711.18437.726780393247474362.stgit@fimbulvetr.bsc.es
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Having a fixed-size hash table for keeping track of all translation blocks
is suboptimal: some workloads are just too big or too small to get maximum
performance from the hash table. The MRU promotion policy helps improve
performance when the hash table is a little undersized, but it cannot
make up for severely undersized hash tables.
Furthermore, frequent MRU promotions result in writes that are a scalability
bottleneck. For scalability, lookups should only perform reads, not writes.
This is not a big deal for now, but it will become one once MTTCG matures.
The appended fixes these issues by using qht as the implementation of
the TB hash table. This solution is superior to other alternatives considered,
namely:
- master: implementation in QEMU before this patchset
- xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
- xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
MRU is implemented here by adding an intermediate struct
that contains the u32 hash and a pointer to the TB; this
allows us, on an MRU promotion, to copy said struct (that is not
at the head), and put this new copy at the head. After a grace
period, the original non-head struct can be eliminated, and
after another grace period, freed.
- qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
no MRU for lookups; MRU for inserts.
The appended solution is the following:
- qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
no MRU for lookups; MRU for inserts.
The plots below compare the considered solutions. The Y axis shows the
boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
sweeps the number of buckets (or initial number of buckets for qht-autoresize).
The plots in PNG format (and with errorbars) can be seen here:
http://imgur.com/a/Awgnq
Each test runs 5 times, and the entire QEMU process is pinned to a
single core for repeatability of results.
Host: Intel Xeon E5-2690
28 ++------------+-------------+-------------+-------------+------------++
A***** + + + master **A*** +
27 ++ * xxhash ##B###++
| A******A****** xxhash-rcu $$C$$$ |
26 C$$ A******A****** qht-fixed-nomru*%%D%%%++
D%%$$ A******A******A*qht-dyn-mru A*E****A
25 ++ %%$$ qht-dyn-nomru &&F&&&++
B#####% |
24 ++ #C$$$$$ ++
| B### $ |
| ## C$$$$$$ |
23 ++ # C$$$$$$ ++
| B###### C$$$$$$ %%%D
22 ++ %B###### C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
| D%%%%%%B###### @E@@@@@@ %%%D%%%@@@E@@@@@@E
21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
+ E@@@ F&&& + E@ + F&&& + +
20 ++------------+-------------+-------------+-------------+------------++
14 16 18 20 22 24
log2 number of buckets
Host: Intel i7-4790K
14.5 ++------------+------------+-------------+------------+------------++
A** + + + master **A*** +
14 ++ ** xxhash ##B###++
13.5 ++ ** xxhash-rcu $$C$$$++
| qht-fixed-nomru %%D%%% |
13 ++ A****** qht-dyn-mru @@E@@@++
| A*****A******A****** qht-dyn-nomru &&F&&& |
12.5 C$$ A******A******A*****A****** ***A
12 ++ $$ A*** ++
D%%% $$ |
11.5 ++ %% ++
B### %C$$$$$$ |
11 ++ ## D%%%%% C$$$$$ ++
| # % C$$$$$$ |
10.5 F&&&&&&B######D%%%%% C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$ $$$C
10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
+ F&& D%%%%%%B######B######B#####B###@@@D%%% +
9.5 ++------------+------------+-------------+------------+------------++
14 16 18 20 22 24
log2 number of buckets
Note that the original point before this patch series is X=15 for "master";
the little sensitivity to the increased number of buckets is due to the
poor hashing function in master.
xxhash-rcu has significant overhead due to the constant churn of allocating
and deallocating intermediate structs for implementing MRU. An alternative
would be do consider failed lookups as "maybe not there", and then
acquire the external lock (tb_lock in this case) to really confirm that
there was indeed a failed lookup. This, however, would not be enough
to implement dynamic resizing--this is more complex: see
"Resizable, Scalable, Concurrent Hash Tables via Relativistic
Programming" by Triplett, McKenney and Walpole. This solution was
discarded due to the very coarse RCU read critical sections that we have
in MTTCG; resizing requires waiting for readers after every pointer update,
and resizes require many pointer updates, so this would quickly become
prohibitive.
qht-fixed-nomru shows that MRU promotion is advisable for undersized
hash tables.
However, qht-dyn-mru shows that MRU promotion is not important if the
hash table is properly sized: there is virtually no difference in
performance between qht-dyn-nomru and qht-dyn-mru.
Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
can achieve with optimum sizing of the hash table, while keeping the hash
table scalable for readers.
The improvement we get before and after this patch for booting debian jessie
with arm-softmmu is:
- Intel Xeon E5-2690: 10.5% less time
- Intel i7-4790K: 5.2% less time
We could get this same improvement _for this particular workload_ by
statically increasing the size of the hash table. But this would hurt
workloads that do not need a large hash table. The dynamic (upward)
resizing allows us to start small and enlarge the hash table as needed.
A quick note on downsizing: the table is resized back to 2**15 buckets
on every tb_flush; this makes sense because it is not guaranteed that the
table will reach the same number of TBs later on (e.g. most bootup code is
thrown away after boot); it makes sense to grow the hash table as
more code blocks are translated. This also avoids the complication of
having to build downsizing hysteresis logic into qht.
Reviewed-by: Sergey Fedorov <serge.fedorov@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-Id: <1465412133-3029-15-git-send-email-cota@braap.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
For some workloads such as arm bootup, tb_phys_hash is performance-critical.
The is due to the high frequency of accesses to the hash table, originated
by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
More info:
https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html
To dig further into this I modified an arm image booting debian jessie to
immediately shut down after boot. Analysis revealed that quite a bit of time
is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
results in very uneven loading of chains in the hash table's buckets;
the longest observed chain had ~550 elements.
The appended addresses this with two changes:
1) Use xxhash as the hash table's hash function. xxhash is a fast,
high-quality hashing function.
2) Feed the hashing function with not just tb_phys, but also pc and flags.
This improves performance over using just tb_phys for hashing, since that
resulted in some hash buckets having many TB's, while others getting very few;
with these changes, the longest observed chain on a single hash bucket is
brought down from ~550 to ~40.
Tests show that the other element checked for in tb_find_physical,
cs_base, is always a match when tb_phys+pc+flags are a match,
so hashing cs_base is wasteful. It could be that this is an ARM-only
thing, though. UPDATE:
On Tue, Apr 05, 2016 at 08:41:43 -0700, Richard Henderson wrote:
> The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a TB
> consisting of only a delay slot).
> It may well still turn out to be reasonable to ignore cs_base for hashing.
BTW, after this change the hash table should not be called "tb_hash_phys"
anymore; this is addressed later in this series.
This change gives consistent bootup time improvements. I tested two
host machines:
- Intel Xeon E5-2690: 11.6% less time
- Intel i7-4790K: 19.2% less time
Increasing the number of hash buckets yields further improvements. However,
using a larger, fixed number of buckets can degrade performance for other
workloads that do not translate as many blocks (600K+ for debian-jessie arm
bootup). This is dealt with later in this series.
Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-Id: <1465412133-3029-8-git-send-email-cota@braap.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
This will be used by upcoming changes for hashing the tb hash.
Add this into a separate file to include the copyright notice from
xxhash.
Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-Id: <1465412133-3029-7-git-send-email-cota@braap.org>
Signed-off-by: Richard Henderson <rth@twiddle.net>
The function cpu_resume_from_signal() is now always called with a
NULL puc argument, and is rather misnamed since it is never called
from a signal handler. It is essentially forcing an exit to the
top level cpu loop but without raising any exception, so rename
it to cpu_loop_exit_noexc() and drop the useless unused argument.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Acked-by: Eduardo Habkost <ehabkost@redhat.com>
Acked-by: Riku Voipio <riku.voipio@linaro.org>
Message-id: 1463494687-25947-4-git-send-email-peter.maydell@linaro.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAV1gdMrRIkN7ePJvAAQhLcg/+Kby99taEuewItrA1yDs75jxOlLqaJopd
cVzo4LFRFPhIn4UEKqRQS0CGoIeU/DYOmObvuUzJxs2LyUoHoqmQOwEm5obC2a85
JrHo/NOppYBbyvvIEAAXzZDCZo0KZKVclrlT+AX5obpOSNSvAnKvEuLWq1aQ9WGN
n4AzHuFEl885cd4nFd8VK/xth89bqz6U/z8CjgIuw3mczp1XNrK5IJJwAy5epHay
GCBr9XHooW3SU971WS20RTRS0D33tKPHgCU3ZeZ3rKh4g3JNj6/ixdVgzi9NqFsQ
5DzAj/iBGhN1LtCOednRS6tUt32Bhy8G/g4O3GiXdejagAmNe2wz31cveNJ8S3W5
DK8SZAnJlz06zN5uIpOVQgDOqfXZkCp7ndq779QJoHOAnuOjJBcUbhw1myz2R3eR
6208tStWl3R0+ATEK8CZ7ejg1cUHvdzyqGJA+1nC2HaFUrBWipxN8jf2fz9vO/wG
G7zNbahvVgyJWO7bPNK4mxkb6qkWCETnCnLJsq2ZbmtPEMcINjD8vNWLNvFGVG8b
2HbinDrzh0Z9Zik5gLZfiVyP5HFaWSrJn9QRVIgaUjuIH9n3/25sl9OvW/JLjxJ+
h2P17CLnAK6dhUYc4R3wQTx2X/N2FvO4DD8iMYOcgDY6fhZ2b6EEyE9yBgQrIDbF
gU1AlC/CX+M=
=AXqa
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/riku/tags/pull-linux-user-20160608' into staging
linux-user pull request for June 2016
# gpg: Signature made Wed 08 Jun 2016 14:27:14 BST
# gpg: using RSA key 0xB44890DEDE3C9BC0
# gpg: Good signature from "Riku Voipio <riku.voipio@iki.fi>"
# gpg: aka "Riku Voipio <riku.voipio@linaro.org>"
* remotes/riku/tags/pull-linux-user-20160608: (44 commits)
linux-user: In fork_end(), remove correct CPUs from CPU list
linux-user: Special-case ERESTARTSYS in target_strerror()
linux-user: Make target_strerror() return 'const char *'
linux-user: Correct signedness of target_flock l_start and l_len fields
linux-user: Use safe_syscall wrapper for ioctl
linux-user: Use safe_syscall wrapper for accept and accept4 syscalls
linux-user: Use safe_syscall wrapper for semop
linux-user: Use safe_syscall wrapper for epoll_wait syscalls
linux-user: Use safe_syscall wrapper for poll and ppoll syscalls
linux-user: Use safe_syscall wrapper for sleep syscalls
linux-user: Use safe_syscall wrapper for rt_sigtimedwait syscall
linux-user: Use safe_syscall wrapper for flock
linux-user: Use safe_syscall wrapper for mq_timedsend and mq_timedreceive
linux-user: Use safe_syscall wrapper for msgsnd and msgrcv
linux-user: Use safe_syscall wrapper for send* and recv* syscalls
linux-user: Use safe_syscall wrapper for connect syscall
linux-user: Use safe_syscall wrapper for readv and writev syscalls
linux-user: Fix error conversion in 64-bit fadvise syscall
linux-user: Fix NR_fadvise64 and NR_fadvise64_64 for 32-bit guests
linux-user: Fix handling of arm_fadvise64_64 syscall
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Conflicts:
configure
scripts/qemu-binfmt-conf.sh
The target_to_host_bitmask() and host_to_target_bitmask() functions
and the associated struct bitmask_transtbl are completely generic,
but for historical reasons the target related fields and parameters
are named 'x86' and the host related fields are named 'alpha'.
Rename them to 'target' and 'host'.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
The thunk_type_size_array() and thunk_type_align_array() functions
are only provided if NO_THUNK_TYPE_SIZE is not defined. However
nothing in the codebase defines that, and so in fact these functions
are always present. Drop the unnecessary #ifdefs.
(Over a decade ago thunk.h used to be included by some softmmu
files, which defined NO_THUNK_TYPE_SIZE, but these includes are
long gone; see for instance commit f193c7979c2f7.)
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
The WORDS_ALIGNED #define is not used anywhere, and hasn't been since
2013 when commit 612d590ebc rewrote the various ld<type>_<endian>_p
functions to not use it. Remove the #define and the comment describing it.
Also remove the line in the comment about TARGET_WORDS_ALIGNED, since
it has never actually existed.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Clean up includes so that osdep.h is included first and headers
which it implies are not included manually.
This commit was created with scripts/clean-includes.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Let users of qemu_get_ram_ptr and qemu_ram_ptr_length pass in an
address that is relative to the MemoryRegion. This basically means
what address_space_translate returns.
Because the semantics of the second parameter change, rename the
function to qemu_map_ram_ptr.
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the old qemu_ram_addr_from_host to memory_region_from_host and
make it return an offset within the region. For qemu_ram_addr_from_host
return the ram_addr_t directly, similar to what it was before
commit 1b5ec23 ("memory: return MemoryRegion from qemu_ram_addr_from_host",
2013-07-04).
Reviewed-by: Marc-André Lureau <marcandre.lureau@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Of the two callers, one does not use it, and the other can compute
it itself based on the other output argument (offset) and the RAMBlock.
Reviewed-by: Marc-André Lureau <marcandre.lureau@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove direct uses of ram_addr_t and optimize memory_region_{get,set}_fd
now that a MemoryRegion knows its RAMBlock directly.
Reviewed-by: Marc-André Lureau <marcandre.lureau@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The collision check does nothing and hasn't been used. Remove the
variable together with related code.
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-Id: <1458900629-2334-2-git-send-email-famz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On the one hand, we have already qemu_get_ram_block() whose function
is similar. On the other hand, we can directly use mr->ram_block but
searching RAMblock by ram_addr which is a kind of waste.
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-Id: <1462845901-89716-2-git-send-email-arei.gonglei@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>