Currently the pseries machine code always attempts to set the size of the
guests's hash page table to 16MB. However, because of the way the POWER
MMU works, a suitable hash page table size should really depend on memory
size. 16MB will be excessive for guests with <1GB and RAM, and may not be
enough for guests with >2GB of RAM (depending on guest page size and
other factors).
The usual given rule of thumb is that the hash table should be 1/64 of
the size of memory, but in fact the Linux guests we are aiming at don't
really need that much. This patch, therefore, changes the hash table
allocation code to aim for 1/128 of the size of RAM (rounding up). When
using KVM, this size may still be adjusted by the host kernel if it is
unable to allocate a suitable (contiguous) table.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>
In the paravirtualized environment provided by PAPR, there is a standard
locking scheme so that hypercalls updating the hash page table from
different guest threads don't corrupt the haah table state. We implement
this HVLOCK bit in out page table hypercalls. However, it is not necessary
in our case, since the hypercalls all run in the qemu environment under the
big qemu lock.
Therefore, this patch removes the locking code. This has the additional
advantage of freeing up a hash PTE bit which will be useful for migration
support.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>
Report from smatch:
ppc405_uc.c:209 dcr_read_pob(12) error: buffer overflow 'pob->besr' 2 <= 2
ppc405_uc.c:232 dcr_write_pob(12) error: buffer overflow 'pob->besr' 2 <= 2
The old code reads and writes besr[POB0_BESR1 - POB0_BESR0] or besr[2]
which is one too much.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: Alexander Graf <agraf@suse.de>
The kvmppc_reset_htab() function invokes the KVM_PPC_ALLOCATE_HTAB vm ioctl
to request KVM to allocate and reset a hash page table for the guest - it
returns the size of hash table allocated, or 0 to indicate that qemu needs
to allocate the hash table itself. In practice qemu needs to allocate the
htab for full emulation and with Book3sPR KVM, but the kernel has to
allocate it for Book3sHV KVM (the hash table needs to be physically
contiguous in that case).
Unfortunately, the logic in this function is incorrect for some existing
kernels. Specifically:
* at least some PR KVM versions advertise the relevant capability but
don't actually implement the ioctl(), returning ENOTTY.
* For old kernels which don't have the capability, we currently return 0.
This is correct for PV KVM, where we need to allocate the htab, but not for
HV KVM - kernels of this era always allocate a 16MB hash table per guest.
This patch corrects both of these edge cases.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>
Currently the ibm,int-on and ibm,int-off RTAS functions are implemented as
no-ops. This is because when implemented as specified in PAPR they caused
Linux (which calls both int-on/off and set-xive) to end up with interrupts
masked when they should not be. Since Linux's set-xive calls make the
int-on/off calls redundant, making them nops worked around the problem.
In fact, the problem was caused because there was a subtle bug in set-xive,
PAPR specifies that as well as updating the current priority, it also needs
to update the saved priority used by int-on/off. With this bug fixed the
problem goes away. This patch implements this more correct fix.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>
On the pseries machine the IOMMU (aka TCE tables) is always active for all
PCI and VIO devices. Mostly to simplify the SLOF firmware, we implement an
extension which allows the IOMMU to be temporarily disabled for certain
devices.
Currently this is implemented by setting the device's DMAContext pointer to
NULL (thus reverting to qemu's default no-IOMMU DMA behaviour), then
replacing it when bypass mode is disabled.
This approach causes a bunch of complications though. It complexifies the
management of the DMAContext lifetimes, it's problematic for savevm/loadvm,
and it means that while bypass is active we have nowhere to store the
device's LIOBN (Logical IO Bus Number, used to identify DMA address
spaces). At present we regenerate the LIOBN from other address information
but this restricts how we can allocate LIOBNs.
This patch gives up on this approach, replacing it with the much simpler
one of having a 'bypass' boolean flag in the TCE state structure.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>
The general device state structure for PAPR VIO emulated devices includes a
'flags' field which was never used. This patch removes it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>
Currently the XICS interrupt controller emulation uses a custom enum to
specify whether a given interrupt is level-sensitive or message-triggered.
This enum makes life awkward for saving the state, and isn't particularly
useful since there are only two possibilities. This patch replaces the
enum with a simple bool.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>
The XICS interrupt controller emulation uses some C bitfield variables in
its internal state structure. This makes like awkward for saving the state
because we don't have easy VMSTATE helpers for bitfields.
This patch removes the bitfields, instead using explicit bit masking in a
single status variable.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>
The H_CEDE hypercall implementation for the pseries machine doesn't trigger
quite the right path in the main cpu exec loop. We should set exit_request
to pop up one extra level and recheck state, and we should set the
exception_index to EXCP_HLT (H_CEDE is roughly equivalent to the hlt
instruction on x86).
In practice, this doesn't really matter except for KVM, and KVM implements
H_CEDE internally so we never hit this code path. But we might as well
get it right, just in case it matters some day.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>
The XICS interrupt controller used on the pseries machine currently has no
reset handler. We can get away with this under some circumstances, but
it's not correct, and can cause failures if the XICS happens to be in the
wrong state at the time of reset.
This patch adds a hook to properly reset the XICS state.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>
The emulated PCI host bridge on the pseries machine incorporates an IOMMU
(PAPR TCE table). Currently the mappings in this IOMMU are not cleared
when we reset the system. This patch fixes this bug. To do this it adds
a new reset function to the IOMMU emulation code. The VIO devices already
reset their TCE tables, but they do so by destroying and re-creating their
DMA context. This doesn't work for the PCI host bridge, because the
infrastructure for PCI IOMMUs has already copied/cached the DMA pointer
context into the subordinate PCI device structures.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>
When we reset the system, the reset method for VIO bus devices resets
the state of their request queue (if present) as it should. However
it was not resetting the state of their TCE table (DMA translation) if
present. It was also not resetting the state of the per-device signal
mask set with H_VIO_SIGNAL. This patch corrects both bugs, and also
removes some small code duplication in the reset paths.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>
This adds support for then new "reset htab" ioctl which allows qemu
to properly cleanup the MMU hash table when the guest is reset. With
the corresponding kernel support, reset of a guest now works properly.
This also paves the way for indicating a different size hash table
to the kernel and for the kernel to be able to impose limits on
the requested size.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>
A number of things need to occur during reset of the PAPR
paravirtualized platform in a specific order. For example, the hash
table needs to be cleared before the CPUs are reset, so that they
initialize their register state correctly, and the CPUs need to have
their main reset called before we set up the entry point state on the
boot cpu. We also need to have the main qdev reset happen before the
creation and installation of the device tree for the new boot, because
we need the state of the devices settled to correctly construct the
device tree.
We currently do the pseries once-per-reset initializations done from a
reset handler. However we can't adequately control when this handler
is called during the reset - in particular we can't guarantee it
happens after all the qdev resets (since qdevs might be registered
after the machine init function has executed).
This patch uses the new QEMUMachine reset method to to fix this
problem, ensuring the various order dependent reset steps happen in
the correct order.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: Alexander Graf <agraf@suse.de>
The current pseries machine init function iterates over the CPUs at several
points, doing various bits of initialization. This is messy; these can
and should be merged into a single iteration doing all the necessary per
cpu initialization. Worse, some of these initializations were setting up
state which should be set on every reset, not just at machine init time.
A few of the initializations simply weren't necessary at all.
This patch, therefore, moves those things that need to be to the
per-cpu reset handler, and combines the remainder into two loops over
the cpus (which also creates them). The second loop is for setting up
hash table information, and will be removed in a subsequent patch also
making other fixes to the hash table setup.
This exposes a bug in our start-cpu RTAS routine (called by the guest to
start up CPUs other than CPU0) under kvm. Previously, this function did
not make a call to ensure that it's changes to the new cpu's state were
pushed into KVM in-kernel state. We sort-of got away with this because
some of the initializations had already placed the secondary CPUs into the
right starting state for the sorts of Linux guests we've been running.
Nonetheless the start-cpu RTAS call's behaviour was not correct and could
easily have been broken by guest changes. This patch also fixes it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: Alexander Graf <agraf@suse.de>
At least when invoked with high enough 'level' arguments,
kvm_arch_put_registers() is supposed to copy essentially all the cpu state
as encoded in qemu's internal structures into the kvm state. Currently
the ppc version does not do this - it never calls KVM_SET_SREGS, for
example, and therefore never sets the SDR1 and various other important
though rarely changed registers.
Instead, the code paths which need to set these registers need to
explicitly make (conditional) kvm calls which transfer the changes to kvm.
This breaks the usual model of handling state updates in qemu, where code
just changes the internal model and has it flushed out to kvm automatically
at some later point.
This patch fixes this for Book S ppc CPUs by adding a suitable call to
KVM_SET_SREGS and als to KVM_SET_ONE_REG to set the HIOR (the only register
that is set with that call so far). This lets us remove the hacks to
explicitly set these registers from the kvmppc_set_papr() function.
The problem still exists for Book E CPUs (which use a different version of
the kvm_sregs structure). But fixing that has some complications of its
own so can be left to another day.
Lkewise, there is still some ugly code for setting the PVR through special
calls to SET_SREGS which is left in for now. The PVR needs to be set
especially early because it can affect what other features are available
on the CPU, so I need to do more thinking to see if it can be integrated
into the normal paths or not.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>
We can finally get rid of the ugly HANDLE_NAN{1,2,3} macros.
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Alexander Graf <agraf@suse.de>
Use the new softfloat float32_muladd() function to implement the vmaddfp
and vnmsubfp instructions. As a bonus we can get rid of the call to the
HANDLE_NAN3 macro, as the NaN handling is directly done at the softfloat
level.
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Alexander Graf <agraf@suse.de>
Use the new softfloat float32_min() and float32_max() to implement the
vminfp and vmaxfp instructions. As a bonus we can get rid of the call to
the HANDLE_NAN2 macro, as the NaN handling is directly done at the
softfloat level.
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Alexander Graf <agraf@suse.de>
Commit e024e881bb provided a pickNaN()
function for PowerPC, implementing the correct NaN propagation rules.
Therefore there is no need to test the operands manually, we can rely
on the softfloat code to do that.
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Alexander Graf <agraf@suse.de>
Place it in alphabetical order, there is a separate section for sharing
ppc4xx devices now.
Signed-off-by: Andreas Färber <afaerber@suse.de>
Acked-by: Edgar E. Iglesias <edgar.iglesias@gmail.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Place it in alphabetical order and add new Devices section ppc4xx to
share file rules with 405 and virtex_ml507.
Signed-off-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Andreas Färber <afaerber@suse.de>
Cc: Alexander Graf <agraf@suse.de>
Cc: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
If the call to xc_hvm_track_dirty_vram() fails, then we set dirtybit on all the
video ram. This case happens during migration.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
This patch add some calls to xen_modified_memory to notify Xen about dirtybits
during migration.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Avi Kivity <avi@redhat.com>
This new helper/hook is used in the next patch to add an extra call in a single
place.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Avi Kivity <avi@redhat.com>
This function is to be used during live migration. Every write access to the
guest memory should call this funcion so the Xen tools knows which pages are
dirty.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
This command is used during a migration of a guest under Xen. It calls
memory_global_dirty_log_start or memory_global_dirty_log_stop according to the
argument pass to the command.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com>
Currently it is assumed PCI device BAR access < 4G memory. If there is such a
device whose BAR size is larger than 4G, it must access > 4G memory address.
This patch enable the 64bits big BAR support on qemu.
Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
The Xen platform device will unplug any NICs if requested by the guest (PVonHVM)
including a NIC that would have been passthrough. This patch makes sure that a
passthrough device will not be unplug.
Reported-by: "Zhang, Yang Z" <yang.z.zhang@intel.com>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
The uint64_to_float32() conversion function was incorrectly always
returning numbers with the sign bit set (ie negative numbers). Correct
this so we return positive numbers instead.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
In float16_to_float32, when returning an infinity, just pass zero
as the mantissa argument to packFloat32(), rather than shifting
a value which we know must be zero.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
We cannot cast directly from pointer to uint64.
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Alex Barcelo <abarcelo@ac.upc.edu>
Reported-by: Alex Barcelo <abarcelo@ac.upc.edu>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Enabled for all softmmu guests supporting PCI on Linux hosts. Note
that currently only x86 hosts have the kernel side VFIO IOMMU support
for this. PPC (g3beige) is the only non-x86 guest known to work.
ARM (veratile) hangs in firmware, others untested.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This adds the core of the QEMU VFIO-based PCI device assignment driver.
To make use of this driver, enable CONFIG_VFIO, CONFIG_VFIO_IOMMU_TYPE1,
and CONFIG_VFIO_PCI in your host Linux kernel config. Load the vfio-pci
module. To assign device 0000:05:00.0 to a guest, do the following:
for dev in $(ls /sys/bus/pci/devices/0000:05:00.0/iommu_group/devices); do
vendor=$(cat /sys/bus/pci/devices/$dev/vendor)
device=$(cat /sys/bus/pci/devices/$dev/device)
if [ -e /sys/bus/pci/devices/$dev/driver ]; then
echo $dev > /sys/bus/pci/devices/$dev/driver/unbind
fi
echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id
done
See Documentation/vfio.txt in the Linux kernel tree for further
description of IOMMU groups and VFIO.
Then launch qemu including the option:
-device vfio-pci,host=0000:05:00.0
Legacy PCI interrupts (INTx) currently makes use of a kludge where we
trap BAR accesses and assume the access is in response to an interrupt,
therefore de-asserting and unmasking the interrupt. It's not quite as
targetted as using the EOI for this, but it's self contained and seems
to work across all architectures. The side-effect is a significant
performance slow-down for device in INTx mode. Some devices, like
graphics cards, don't really use their interrupt, so this can be turned
off with the x-intx=off option, which disables INTx alltogether. This
should be considered an experimental option until we refine this code.
Both MSI and MSI-X are supported and avoid these issues.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Based on Linux as of 1a95620.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This patch implements Supervisor Mode Execution Prevention (SMEP) and
Supervisor Mode Access Prevention (SMAP) for x86. The purpose of the
patch, obviously, is to help kernel developers debug the support for
those features.
A fair bit of the code relates to the handling of CPUID features. The
CPUID code probably would get greatly simplified if all the feature
bit words were unified into a single vector object, but in the
interest of producing a minimal patch for SMEP/SMAP, and because I had
very limited time for this project, I followed the existing style.
[ v2: don't change the definition of the qemu64 CPU shorthand, since
that breaks loading old snapshots. Per Anthony Liguori this can be
fixed once the CPU feature set is snapshot.
Change the coding style slightly to conform to checkpatch.pl. ]
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The -cpu configuration interface is based on a list of feature names or
properties, on a single namespace, so there's no need to mention on
which CPUID leaf/register each flag is located.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Don Slutz <Don@CloudSwitch.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Instead of having duplicate feature names on the ext2_feature array for
the AMD feature bit aliases, we keep the feature names only on the
feature_name[] array, and copy the corresponding bits to
cpuid_ext2_features in case the CPU vendor is AMD.
This will:
- Make sure we don't set the feature bit aliases on Intel CPUs;
- Make it easier to convert feature bits to CPU properties, as now we
have a single bit on the x86_def_t struct for each CPU feature.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Don Slutz <Don@CloudSwitch.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Both constants have the same value, but CPUID_EXT2_AMD_ALIASES is
defined without using magic numbers.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Don Slutz <Don@CloudSwitch.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Instea of using a hardcoded hex constant, define CPUID_EXT2_AMD_ALIASES
as the set of CPUID[8000_0001].EDX bits that on AMD are the same as the
bits of CPUID[1].EDX.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-By: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Don Slutz <Don@CloudSwitch.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Bit 10 of CPUID[8000_0001].EDX is not defined as an alias of
CPUID[1].EDX[10], so do not duplicate it on
kvm_arch_get_supported_cpuid().
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-By: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Don Slutz <Don@CloudSwitch.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Add a test for each of report/ignore/stop. The tests use blkdebug
to generate an error in the middle of a script. The error is
recoverable (once = "on") so that we can test resuming a job after
stopping for an error.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
iotests.py provides a convenience function that uses Python keyword
arguments to represent QMP command arguments. However, almost all
QMP commands use dashes for argument names (the sole exception is
block_set_io_throttle), and dashes are not allowed in a keyword
argument name. Hence provide automatic conversion of underscores
to dashes.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Currently it is impossible to write a blkdebug script that ping-pongs
between two states, because the second set-state rule will use the
state that is set in the first. If you have
[set-state]
event = "..."
state = "1"
new_state = "2"
[set-state]
event = "..."
state = "2"
new_state = "1"
for example the state will remain locked at 1. This can be fixed
by first processing all rules, and then setting the state.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds support for error management to streaming.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>