Commit Graph

183 Commits

Author SHA1 Message Date
Alexey Kardashevskiy ec132efaa8 spapr: Support NVIDIA V100 GPU with NVLink2
NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory
space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver
implements special regions for such GPUs and emulates an NVLink bridge.
NVLink2-enabled POWER9 CPUs also provide address translation services
which includes an ATS shootdown (ATSD) register exported via the NVLink
bridge device.

This adds a quirk to VFIO to map the GPU memory and create an MR;
the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses
this to get the MR and map it to the system address space.
Another quirk does the same for ATSD.

This adds additional steps to sPAPR PHB setup:

1. Search for specific GPUs and NPUs, collect findings in
sPAPRPHBState::nvgpus, manage system address space mappings;

2. Add device-specific properties such as "ibm,npu", "ibm,gpu",
"memory-block", "link-speed" to advertise the NVLink2 function to
the guest;

3. Add "mmio-atsd" to vPHB to advertise the ATSD capability;

4. Add new memory blocks (with extra "linux,memory-usable" to prevent
the guest OS from accessing the new memory until it is onlined) and
npuphb# nodes representing an NPU unit for every vPHB as the GPU driver
uses it for link discovery.

This allocates space for GPU RAM and ATSD like we do for MMIOs by
adding 2 new parameters to the phb_placement() hook. Older machine types
set these to zero.

This puts new memory nodes in a separate NUMA node to as the GPU RAM
needs to be configured equally distant from any other node in the system.
Unlike the host setup which assigns numa ids from 255 downwards, this
adds new NUMA nodes after the user configures nodes or from 1 if none
were configured.

This adds requirement similar to EEH - one IOMMU group per vPHB.
The reason for this is that ATSD registers belong to a physical NPU
so they cannot invalidate translations on GPUs attached to another NPU.
It is guaranteed by the host platform as it does not mix NVLink bridges
or GPUs from different NPU in the same IOMMU group. If more than one
IOMMU group is detected on a vPHB, this disables ATSD support for that
vPHB and prints a warning.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for vfio portions]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Message-Id: <20190312082103.130561-1-aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-04-26 10:41:23 +10:00
Greg Kurz 4560116e42 spapr_pci: Fix broken naming of PCI bus
Recent commit 5cf0d326a0 fixed a regression which was preventing the
guest to access the extended config space of a PCIe device. This was
done by introducing a new PCI bus subtype for PAPR. The original fix
was causing PCI busses to be named "spapr-pci-host-bridge-root-bus.N"
instead of "pci.N", which was making upper layers unhappy of course.
This got worked around by hardcoding the PCI bus name to "pci.0", but
this only works for the default PHB. And we're now hitting:

# qemu-system-ppc64 \
             -device spapr-pci-host-bridge,index=1 \
             -device e1000e,bus=pci.0 \
             -device e1000e,bus=pci.1
qemu-system-ppc64: -device e1000e,bus=pci.1: Bus 'pci.1' not found

David already posted some patches [1] to control PCI extended config
space accesses with a new flag in the base PCI bus class instead of
subtyping. These patches are a bit more intrusive though, and
are targetted for 4.1.

When no name is passed to pci_register_bus(), the core device code
generates a lowercase name based on the QOM typename. The typename
for the base PCI bus class is "PCI", hence the "pci.0", "pci.1"
bus names. Rename the type of the PAPR PCI bus to "pci", so that
the QOM code can generate proper names. This is a hack but it is
enough to fix the regression. And all this will be reworked properly
in 4.1.

[1] https://patchwork.ozlabs.org/project/qemu-devel/list/?series=100486

Fixes: 5cf0d326a0
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <155500034416.646888.1307366522340665522.stgit@bahia.lab.toulouse-stg.fr.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-04-12 12:23:02 +10:00
Greg Kurz 5cf0d326a0 spapr_pci: Fix extended config space accesses
The PAPR PHB acts as a legacy PCI bus but it allows PCIe extended
config space accesses anyway (for pseries-2.9 and newer machine
types).

Introduce a specific PCI bus subtype to inform the common PCI code
about that.

Fixes: c2077e2ca0
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <155414130834.574858.16502276132110219890.stgit@bahia.lan>
[dwg: Apply fix so we don't rename the default pci bus, breaking everything]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-04-09 15:03:10 +10:00
Markus Armbruster e366d181ce spapr: Remove NULL checks on error_propagate() calls
Patch created mechanically by rerunning:

  $  spatch --sp-file scripts/coccinelle/error_propagate_null.cocci \
	    --macro-file scripts/cocci-macro-file.h \
	    --dir . --in-place

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190318190148.18283-1-armbru@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-03-19 15:24:15 +11:00
David Gibson ce2918cbc3 spapr: Use CamelCase properly
The qemu coding standard is to use CamelCase for type and structure names,
and the pseries code follows that... sort of.  There are quite a lot of
places where we bend the rules in order to preserve the capitalization of
internal acronyms like "PHB", "TCE", "DIMM" and most commonly "sPAPR".

That was a bad idea - it frequently leads to names ending up with hard to
read clusters of capital letters, and means they don't catch the eye as
type identifiers, which is kind of the point of the CamelCase convention in
the first place.

In short, keeping type identifiers look like CamelCase is more important
than preserving standard capitalization of internal "words".  So, this
patch renames a heap of spapr internal type names to a more standard
CamelCase.

In addition to case changes, we also make some other identifier renames:
  VIOsPAPR* -> SpaprVio*
    The reverse word ordering was only ever used to mitigate the capital
    cluster, so revert to the natural ordering.
  VIOsPAPRVTYDevice -> SpaprVioVty
  VIOsPAPRVLANDevice -> SpaprVioVlan
    Brevity, since the "Device" didn't add useful information
  sPAPRDRConnector -> SpaprDrc
  sPAPRDRConnectorClass -> SpaprDrcClass
    Brevity, and makes it clearer this is the same thing as a "DRC"
    mentioned in many other places in the code

This is 100% a mechanical search-and-replace patch.  It will, however,
conflict with essentially any and all outstanding patches touching the
spapr code.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-03-12 14:33:05 +11:00
David Hildenbrand 07578b0ad6 qdev: Let the hotplug_handler_unplug() caller delete the device
When unplugging a device, at one point the device will be destroyed
via object_unparent(). This will, one the one hand, unrealize the
removed device hierarchy, and on the other hand, destroy/free the
device hierarchy.

When chaining hotplug handlers, we want to overwrite a bus hotplug
handler by the machine hotplug handler, to be able to perform
some part of the plug/unplug and to forward the calls to the bus hotplug
handler.

For now, the bus hotplug handler would trigger an object_unparent(), not
allowing us to perform some unplug action on a device after we forwarded
the call to the bus hotplug handler. The device would be gone at that
point.

machine_unplug_handler(dev)
    /* eventually do unplug stuff */
    bus_unplug_handler(dev)
    /* dev is gone, we can't do more unplug stuff */

So move the object_unparent() to the original caller of the unplug. For
now, keep the unrealize() at the original places of the
object_unparent(). For implicitly chained hotplug handlers (e.g. pc
code calling acpi hotplug handlers), the object_unparent() has to be
done by the outermost caller. So when calling hotplug_handler_unplug()
from inside an unplug handler, nothing is to be done.

hotplug_handler_unplug(dev) -> calls machine_unplug_handler()
    machine_unplug_handler(dev) {
        /* eventually do unplug stuff */
        bus_unplug_handler(dev) -> calls unrealize(dev)
        /* we can do more unplug stuff but device already unrealized */
    }
object_unparent(dev)

In the long run, every unplug action should be factored out of the
unrealize() function into the unplug handler (especially for PCI). Then
we can get rid of the additonal unrealize() calls and object_unparent()
will properly unrealize the device hierarchy after the device has been
unplugged.

hotplug_handler_unplug(dev) -> calls machine_unplug_handler()
    machine_unplug_handler(dev) {
        /* eventually do unplug stuff */
        bus_unplug_handler(dev) -> only unplugs, does not unrealize
        /* we can do more unplug stuff */
    }
object_unparent(dev) -> will unrealize

The original approach was suggested by Igor Mammedov for the PCI
part, but I extended it to all hotplug handlers. I consider this one
step into the right direction.

To summarize:
- object_unparent() on synchronous unplugs is done by common code
-- "Caller of hotplug_handler_unplug"
- object_unparent() on asynchronous unplugs ("unplug requests") has to
  be done manually
-- "Caller of hotplug_handler_unplug"

Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20190228122849.4296-2-david@redhat.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2019-03-06 11:51:08 -03:00
Greg Kurz bb2bdd812e spapr: add hotplug hooks for PHB hotplug
Hotplugging PHBs is a machine-level operation, but PHBs reside on the
main system bus, so we register spapr machine as the handler for the
main system bus.

Provide the usual pre-plug, plug and unplug-request handlers.

Move the checking of the PHB index to the pre-plug handler. It is okay
to do that and assert in the realize function because the pre-plug
handler is always called, even for the oldest machine types we support.

Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
(Fixed interrupt controller phandle in "interrupt-map" and
 TCE table size in "ibm,dma-window" FDT fragment, Greg Kurz)
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <155059672926.1466090.13612804072190051439.stgit@bahia.lab.toulouse-stg.fr.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-02-26 09:21:25 +11:00
Michael Roth f130928d2a spapr_pci: add ibm, my-drc-index property for PHB hotplug
This is needed to denote a boot-time PHB as being hot-pluggable.

Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <155059672420.1466090.15147504040270659866.stgit@bahia.lab.toulouse-stg.fr.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-02-26 09:21:25 +11:00
Michael Roth 0a0a66cd1b spapr_pci: provide node start offset via spapr_populate_pci_dt()
PHB hotplug re-uses PHB device tree generation code and passes
it to a guest via RTAS. Doing this requires knowledge of where
exactly in the device tree the node describing the PHB begins.

Provide this via a new optional pointer that can be used to
store the PHB node's start offset.

Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <155059671912.1466090.10891589403973703473.stgit@bahia.lab.toulouse-stg.fr.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-02-26 09:21:25 +11:00
Greg Kurz ef28b98d58 spapr_pci: add PHB unrealize
To support PHB hotplug we need to clean up lingering references,
memory, child properties, etc. prior to the PHB object being
finalized. Generally this will be called as a result of calling
object_unparent() on the PHB object, which in turn would normally
be called as the result of an unplug() operation.

When the PHB is finalized, child objects will be unparented in
turn, and finalized if the PHB was the only reference holder. so
we don't bother to explicitly unparent child objects of the PHB,
with the notable exception of DRCs. This is needed to avoid a QEMU
crash when unplugging a PHB and resetting the machine before the
guest could handle the event. The DRCs are removed from the QOM tree
by  pci_unregister_root_bus() and we must make sure we're not leaving
stale aliases under the global /dr-connector path.

The formula that gives the number of DMA windows is moved to an
inline function in the hw/pci-host/spapr.h header because it
will have other users.

The unrealize function is able to cope with partially realized PHBs.
It is hence used to implement proper rollback on the realize error
path.

Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <155059669881.1466090.13515030705986041517.stgit@bahia.lab.toulouse-stg.fr.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-02-26 09:21:25 +11:00
Greg Kurz 09d876ce2c spapr/drc: Drop spapr_drc_attach() fdt argument
All DRC subtypes have been converted to generate the FDT fragment at
configure connector time instead of attach time. The fdt and fdt_offset
arguments of spapr_drc_attach() aren't needed anymore. Drop them and
make the implementation of the dt_populate() method mandatory.

Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <155059667853.1466090.16527852453054217565.stgit@bahia.lab.toulouse-stg.fr.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-02-26 09:21:25 +11:00
Greg Kurz 46fd02990d spapr/pci: Generate FDT fragment at configure connector time
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <155059667346.1466090.326696113231137772.stgit@bahia.lab.toulouse-stg.fr.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-02-26 09:21:25 +11:00
Michael Roth 94d1cc5f03 qdev: pass an Object * to qbus_set_hotplug_handler()
Certain devices types, like memory/CPU, are now being handled using a
hotplug interface provided by a top-level MachineClass. Hotpluggable
host bridges are another such device where it makes sense to use a
machine-level hotplug handler. However, unlike those devices,
host-bridges have a parent bus (the main system bus), and devices with
a parent bus use a different mechanism for registering their hotplug
handlers: qbus_set_hotplug_handler(). This interface currently expects
a handler to be a subclass of DeviceClass, but this is not the case
for MachineClass, which derives directly from ObjectClass.

Internally, the interface only requires an ObjectClass, so expose that
in qbus_set_hotplug_handler().

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <154999589921.690774.3640149277362188566.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-02-17 21:54:02 +11:00
Greg Kurz 925969c3e2 spapr_pci: Fix interrupt leak in rtas_ibm_change_msi() error path
Now that IRQ allocation has been split in two (first allocate IRQ numbers,
then claim them), if the claiming fails, we must release the IRQs.

Fixes: 4fe75a8ccd "spapr: split the IRQ allocation sequence"
Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-02-17 21:54:02 +11:00
Greg Kurz 5c7adcf422 spapr: Rename xics to intc in interrupt controller agnostic code
All this code is used with both the XICS and XIVE interrupt controllers.

Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-02-17 21:54:02 +11:00
Alexey Kardashevskiy 382b6f2225 spapr_pci: Fix endianness in assigned-addresses property
reg->phys_hi and assigned->phys_hi are big endian but we do an extra
byteswap anyway when copying reg->phys_hi to assigned->phys_hi.
To make things slightly more messy, we also add a relocatable bit (b_n())
although in the right endianness.

This fixes endianness of assigned->phys_hi.

This is unlikely to produce any visible difference though as we should end up
there only in the case of PCI hotplug and even then I am not sure if
(d->io_regions[i].addr == PCI_BAR_UNMAPPED) == true.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-02-04 18:44:20 +11:00
David Hildenbrand d8e81d6e60 spapr/pci: Fix primary bus number for PCI bridges
While looking at the s390x implementation, looks like spapr has a
similar BUG when building the topology.

The primary bus number corresponds always to the bus number of the
bus the bridge is attached to.

Right now, if we have two bridges attached to the same bus (e.g. root
bus) this is however not the case. The first bridge will have primary
bus 0, the second bridge primary bus 1, which is wrong. Fix the assignment.

While at it, drop setting the PCI_SUBORDINATE_BUS temporarily to 0xff.
Setting it temporarily to that value (as discussed e.g. in [1]), is
only relevant for a running system that probes the buses. The value is
effectively unused for us just doing a DFS.

[1] http://www.science.unitn.it/~fiorella/guidelinux/tlk/node76.html

Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-02-04 18:44:18 +11:00
Greg Kurz 999c9caf2e spapr: move spapr_create_phb() to core machine code
This function is only used when creating the default PHB. Let's rename
it and move it to the core machine code for clarity.

Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-01-09 09:28:14 +11:00
David Hildenbrand 27c1da5129 spapr_pci: perform unplug via the hotplug handler
Introduce and use the "unplug" callback.

This is a preparation for multi-stage hotplug handlers, whereby the bus
hotplug handler is overwritten by the machine hotplug handler. This handler
will then pass control to the bus hotplug handler. So to get this running
cleanly, we also have to make sure to go via the hotplug handler chain when
actually unplugging a device after an unplug request. Lookup the hotplug
handler and call "unplug".

Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2018-12-20 11:19:12 -05:00
Greg Kurz 4fc4c6a53d spapr_pci: convert g_malloc() to g_new()
When allocating an array, it is a recommended coding practice to call
g_new(FooType, n) instead of g_malloc(n * sizeof(FooType)) because
it takes care to avoid overflow when calculating the size of the
allocated block and it returns FooType *, which allows the compiler
to perform type checking.

Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2018-11-08 12:04:40 +11:00
Markus Armbruster 4b5766488f error: Fix use of error_prepend() with &error_fatal, &error_abort
From include/qapi/error.h:

  * Pass an existing error to the caller with the message modified:
  *     error_propagate(errp, err);
  *     error_prepend(errp, "Could not frobnicate '%s': ", name);

Fei Li pointed out that doing error_propagate() first doesn't work
well when @errp is &error_fatal or &error_abort: the error_prepend()
is never reached.

Since I doubt fixing the documentation will stop people from getting
it wrong, introduce error_propagate_prepend(), in the hope that it
lures people away from using its constituents in the wrong order.
Update the instructions in error.h accordingly.

Convert existing error_prepend() next to error_propagate to
error_propagate_prepend().  If any of these get reached with
&error_fatal or &error_abort, the error messages improve.  I didn't
check whether that's the case anywhere.

Cc: Fei Li <fli@suse.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20181017082702.5581-2-armbru@redhat.com>
2018-10-19 14:51:34 +02:00
Cédric Le Goater 0976efd51b spapr_pci: add an extra 'nr_msis' argument to spapr_populate_pci_dt
So that we don't have to call qdev_get_machine() to get the machine
class and the sPAPRIrq backend holding the number of MSIs.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2018-09-25 11:12:25 +10:00
Cédric Le Goater e39de895f6 spapr: introduce a spapr_irq class 'nr_msis' attribute
The number of MSI interrupts a sPAPR machine can allocate is in direct
relation with the number of interrupts of the sPAPRIrq backend. Define
statically this value at the sPAPRIrq class level and use it for the
"ibm,pe-total-#msi" property of the sPAPR PHB.

According to the PAPR specs, "ibm,pe-total-#msi" defines the maximum
number of MSIs that are available to the PE. We choose to advertise
the maximum number of MSIs that are available to the machine for
simplicity of the model and to avoid segmenting the MSI interrupt pool
which can be easily shared. If the pool limit is reached, it can be
extended dynamically.

Finally, remove XICS_IRQS_SPAPR which is now unused.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2018-09-25 11:12:25 +10:00
Greg Kurz bc9b1f10f2 spapr_pci: fix potential NULL pointer dereference
Commit 2c88b098e7 added a call to SPAPR_MACHINE_GET_CLASS(spapr) in
spapr_phb_realize() before we check spapr isn't NULL. This causes QEMU
to crash when starting a non-pseries machine with a sPAPR PHB.

This could be fixed by setting the smc variable after the null check,
but it seems more explicit to use a ternary operator to skip the call
to SPAPR_MACHINE_GET_CLASS() if spapr is NULL, since spapr_phb_realize()
will return immediately in this case.

This was reported by Coverity (CID 1395170 and 1395183).

Fixes: 2c88b098e7
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2018-08-28 11:31:23 +10:00
Cédric Le Goater 2c88b098e7 spapr_pci: factorize the use of SPAPR_MACHINE_GET_CLASS()
It should save us some CPU cycles as these routines perform a lot of
checks.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2018-08-21 14:28:45 +10:00
Cédric Le Goater 82cffa2eb2 spapr: introduce a fixed IRQ number space
This proposal introduces a new IRQ number space layout using static
numbers for all devices, depending on a device index, and a bitmap
allocator for the MSI IRQ numbers which are negotiated by the guest at
runtime.

As the VIO device model does not have a device index but a "reg"
property, we introduce a formula to compute an IRQ number from a "reg"
value. It should minimize most of the collisions.

The previous layout is kept in pre-3.1 machines raising the
'legacy_irq_allocation' machine class flag.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2018-08-21 14:28:45 +10:00
Cédric Le Goater 4fe75a8ccd spapr: split the IRQ allocation sequence
Today, when a device requests for IRQ number in a sPAPR machine, the
spapr_irq_alloc() routine first scans the ICSState status array to
find an empty slot and then performs the assignement of the selected
numbers. Split this sequence in two distinct routines : spapr_irq_find()
for lookups and spapr_irq_claim() for claiming the IRQ numbers.

This will ease the introduction of a static layout of IRQ numbers.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2018-06-21 21:22:53 +10:00
David Gibson 30f79dc13f spapr_pci: Remove unhelpful pagesize warning
By default, the IOMMU model built into the spapr virtual PCI host bridge
supports 4kiB and 64kiB IOMMU page sizes.  However this can be overridden
which may be desirable to allow larger IOMMU page sizes when running a
guest with hugepage backing and passthrough devices.  For that reason a
warning was printed when the device wasn't configured to allow the pagesize
with which guest RAM is backed.

Experience has proven, however, that this message is more confusing than
useful.  Worse it sometimes makes little sense when the host-available page
sizes don't match those available on the guest, which can happen with
a POWER8 guest running on a POWER9 KVM host.

Long term we do want better handling to allow large IOMMU page sizes to be
used, but for now this parameter and warning don't really accomplish it.
So, remove the message, pending a better solution.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2018-06-12 10:44:36 +10:00
Greg Kurz 9cbe305b60 spapr_pci: fix MSI/MSIX selection
In various place we don't correctly check if the device supports MSI or
MSI-X. This can cause devices to be advertised with MSI support, even
if they only support MSI-X (like virtio-pci-* devices for example):

                ethernet@0 {
                        ibm,req#msi = <0x1>; <--- wrong!
			.
			ibm,loc-code = "qemu_virtio-net-pci:0000:00:00.0";
			.
			ibm,req#msi-x = <0x3>;
                };

Worse, this can also cause the "ibm,change-msi" RTAS call to corrupt the
PCI status and cause migration to fail:

  qemu-system-ppc64: get_pci_config_device: Bad config data: i=0x6
    read: 0 device: 10 cmask: 10 wmask: 0 w1cmask:0
                              ^^
           PCI_STATUS_CAP_LIST bit which is assumed to be constant

This patch changes spapr_populate_pci_child_dt() to properly check for
MSI support using msi_present(): this ensures that PCIDevice::msi_cap
was set by msi_init() and that msi_nr_vectors_allocated() will look at
the right place in the config space.

Checking PCIDevice::msix_entries_nr is enough for MSI-X but let's add
a call to msix_present() there as well for consistency.

It also changes rtas_ibm_change_msi() to select the appropriate MSI
type in Function 1 instead of always selecting plain MSI. This new
behaviour is compliant with LoPAPR 1.1, as described in "Table 71.
ibm,change-msi Argument Call Buffer":

  Function 1: If Number Outputs is equal to 3, request to set to a new
           number of MSIs (including set to 0).
           If the “ibm,change-msix-capable” property exists and Number
           Outputs is equal to 4, request is to set to a new number of
           MSI or MSI-X (platform choice) interrupts (including set to
           0).

Since MSI is the the platform default (LoPAPR 6.2.3 MSI Option), let's
check for MSI support first.

And finally, it checks the input parameters are valid, as described in
LoPAPR 1.1 "R1–7.3.10.5.1–3":

  For the MSI option: The platform must return a Status of -3 (Parameter
  error) from ibm,change-msi, with no change in interrupt assignments if
  the PCI configuration address does not support MSI and Function 3 was
  requested (that is, the “ibm,req#msi” property must exist for the PCI
  configuration address in order to use Function 3), or does not support
  MSI-X and Function 4 is requested (that is, the “ibm,req#msi-x” property
  must exist for the PCI configuration address in order to use Function 4),
  or if neither MSIs nor MSI-Xs are supported and Function 1 is requested.

This ensures that the ret_intr_type variable contains a valid MSI type
for this device, and that spapr_msi_setmsg() won't corrupt the PCI status.

Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2018-01-29 14:24:41 +11:00
Michael S. Tsirkin acc95bc850 Merge remote-tracking branch 'origin/master' into HEAD
Resolve conflicts around apb.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2018-01-11 22:03:50 +02:00
Greg Kurz 2b3db9dd34 spapr_pci: use warn_report()
These two are definitely warnings. Let's use the appropriate API.

Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2018-01-10 12:52:59 +11:00
Greg Kurz bb2d8ab636 spapr: fix LSI interrupt specifiers in the device tree
LoPAPR 1.1 B.6.9.1.2 describes the "#interrupt-cells" property of the
PowerPC External Interrupt Source Controller node as follows:

“#interrupt-cells”

  Standard property name to define the number of cells in an interrupt-
  specifier within an interrupt domain.

  prop-encoded-array: An integer, encoded as with encode-int, that denotes
  the number of cells required to represent an interrupt specifier in its
  child nodes.

  The value of this property for the PowerPC External Interrupt option shall
  be 2. Thus all interrupt specifiers (as used in the standard “interrupts”
  property) shall consist of two cells, each containing an integer encoded
  as with encode-int. The first integer represents the interrupt number the
  second integer is the trigger code: 0 for edge triggered, 1 for level
  triggered.

This patch fixes the interrupt specifiers in the "interrupt-map" property
of the PHB node, that were setting the second cell to 8 (confusion with
IRQ_TYPE_LEVEL_LOW ?) instead of 1.

VIO devices and RTAS event sources use the same format for interrupt
specifiers: while here, we introduce a common helper to handle the
encoding details.

Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Tested-by: Cédric Le Goater <clg@kaod.org>
--
v3: - reference public LoPAPR instead of internal PAPR+ in changelog
    - change helper name to spapr_dt_xics_irq()

v2: - drop the erroneous changes to the "interrupts" prop in PCI device nodes
    - introduce a common helper to encode interrupt specifiers
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-12-15 09:49:24 +11:00
Cédric Le Goater 7718375584 spapr: introduce a spapr_qirq() helper
xics_get_qirq() is only used by the sPAPR machine. Let's move it there
and change its name to reflect its scope. It will be useful for XIVE
support which will use its own set of qirqs.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-12-15 09:49:24 +11:00
Cédric Le Goater 60c6823b9b spapr: move the IRQ allocation routines under the machine
Also change the prototype to use a sPAPRMachineState and prefix them
with spapr_irq_. It will let us synchronise the IRQ allocation with
the XIVE interrupt mode when available.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-12-15 09:49:24 +11:00
David Gibson fd56e0612b pci: Eliminate redundant PCIDevice::bus pointer
The bus pointer in PCIDevice is basically redundant with QOM information.
It's always initialized to the qdev_get_parent_bus(), the only difference
is the type.

Therefore this patch eliminates the field, instead creating a pci_get_bus()
helper to do the type mangling to derive it conveniently from the QOM
Device object underneath.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Marcel Apfelbaum <marcel@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
2017-12-05 19:13:45 +02:00
David Gibson 1115ff6d26 pci: Rename root bus initialization functions for clarity
pci_bus_init(), pci_bus_new_inplace(), pci_bus_new() and pci_register_bus()
are misleadingly named.  They're not used for initializing *any* PCI bus,
but only for a root PCI bus.

Non-root buses - i.e. ones under a logical PCI to PCI bridge - are instead
created with a direct qbus_create_inplace() (see pci_bridge_initfn()).

This patch renames the functions to make it clear they're only used for
a root bus.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Marcel Apfelbaum <marcel@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
2017-12-05 19:13:45 +02:00
Greg Kurz f7d6bfcdc0 spapr_pci: fail gracefully with non-pseries machine types
QEMU currently crashes when the user tries to add an spapr-pci-host-bridge
on a non-pseries machine:

$ qemu-system-ppc64 -M ppce500 -device spapr-pci-host-bridge,index=1
hw/ppc/spapr_pci.c:1535:spapr_phb_realize:
Object 0x1003dacae60 is not an instance of type spapr-machine
Aborted (core dumped)

The same thing happens with the deprecated but still available child type
spapr-pci-vfio-host-bridge.

Fix both by checking the machine type with object_dynamic_cast().

Reviewed-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-10-17 10:34:01 +11:00
Peter Maydell ab16152926 Migration pull 2017-09-27
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZy64HAAoJEAUWMx68W/3nTqwP/A5Gx4Qwkv5KKdpM0YLq//d+
 OODmzl7Ni3a5Up1ETqGdLb84estrgY+5DISp73Rkt4a5tbT7+XKrhb4qD+93NnTe
 zynY9in4C1jGxYm7YzeOhwSeIiuLZMTCLQlGdYw7/nunIFwkItUEvAFx3AG1WCJe
 2Mk0lvmg4LikruDDMdzqZaJu7h5RU5sQjA7SsyrTBdsN7tNWl3rKLYGXwgzv0uz5
 n2xkUgzvvnj1Bk/Adojkn05yxA86xKD/4rhFED9fjNVSjAGHMrHIWOJ70V26Cg5w
 3gJ+5mesWsH+erf0JFYv0S38SyFbmIOE39Nn13D/d0o1x89P8B8cgqbi3ADTKM77
 875wuIVnZzi2vIwVdxXQ9GHQ79cpXwr2fOfQ2rjT6Ll95K+u/MQG86fQiO0eJW+0
 KwQVCwwh+HmCUcCogMuxAc9+F8C8qolwCi/9QXwS2yLBElHKaWDIMyTce36cW9d7
 cZaKIOeSJUGNFoaWZnXN88MRuOYbdywTl+GddVAW3+VJCTYV2oi0o5fsTfxXy5AV
 y7uYo/pcSj2gSZJ5GairMlB6p5iXnE8yusi1e4ZKA1x1TaSHSb6zR59lRUFr+j/L
 JhUCfA85v5/elGqgkYp6UhSzFDJ2ID2oSEMQTIzfVrinOXtnf2KEh33YMbUH5qyo
 yHVEu12uPe9rE6A0vWlu
 =/+LV
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/dgilbert/tags/pull-migration-20170927a' into staging

Migration pull 2017-09-27

# gpg: Signature made Wed 27 Sep 2017 14:56:23 BST
# gpg:                using RSA key 0x0516331EBC5BFDE7
# gpg: Good signature from "Dr. David Alan Gilbert (RH2) <dgilbert@redhat.com>"
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg:          It is not certain that the signature belongs to the owner.
# Primary key fingerprint: 45F5 C71B 4A0C B7FB 977A  9FA9 0516 331E BC5B FDE7

* remotes/dgilbert/tags/pull-migration-20170927a:
  migration: Route more error paths
  migration: Route errors up through vmstate_save
  migration: wire vmstate_save_state errors up to vmstate_subsection_save
  migration: Check field save returns
  migration: check pre_save return in vmstate_save_state
  migration: pre_save return int
  migration: disable auto-converge during bulk block migration

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2017-09-27 22:44:51 +01:00
Dr. David Alan Gilbert 44b1ff319c migration: pre_save return int
Modify the pre_save method on VMStateDescription to return an int
rather than void so that it potentially can fail.

Changed zillions of devices to make them return 0; the only
case I've made it return non-0 is hw/intc/s390_flic_kvm.c that already
had an error_report/return case.

Note: If you add an error exit in your pre_save you must emit
an error_report to say why.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20170925112917.21340-2-dgilbert@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2017-09-27 11:35:59 +01:00
Greg Kurz 30b3bc5aa9 spapr_pci: make index property mandatory
PHBs can be created with an index property, in which case the machine
code automatically sets all the MMIO windows at addresses derived from
the index. Alternatively, they can be manually created without index,
but the user has to provide addresses for all MMIO windows.

The non-index way happens to be more trouble than it's worth: it's
difficult to use, keeps requiring (potentially incompatible) changes
when some new parameter needs adding, and is awkward to check for
collisions. It currently even has a bug that prevents to use two
non-index PHBs because their child DRCs are all derived from the
same index == -1 value, and, thus, collide.

This patch hence makes the index property mandatory. As a consequence,
the PHB's memory regions and BUID are now always configured according
to the index, and it is no longer possible to set them from the command
line.

This DOES BREAK backwards compat, but we don't think the non-index
PHB feature was used in practice (at least libvirt doesn't) and the
simplification is worth it.

Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-09-27 13:05:41 +10:00
Greg Kurz 96dbc9af35 spapr_pci: don't create 64-bit MMIO window if we don't need to
When running a pseries-2.2 or older machine type, we get the following
lines in info mtree:

address-space: memory
...
ffffffffffffffff-ffffffffffffffff (prio 0, i/o): alias
 pci@800000020000000.mmio64-alias @pci@800000020000000.mmio
  ffffffffffffffff-ffffffffffffffff

address-space: cpu-memory
...
ffffffffffffffff-ffffffffffffffff (prio 0, i/o): alias
 pci@800000020000000.mmio64-alias @pci@800000020000000.mmio
  ffffffffffffffff-ffffffffffffffff

The same thing occurs when running a pseries-2.7 with

    -global spapr-pci-host-bridge.mem_win_size=2147483648

This happens because we always create a 64-bit MMIO window, even if
we didn't explicitely requested it (ie, mem64_win_size == 0) and the
32-bit window is below 2GiB. It doesn't seem to have an impact on the
guest though because spapr_populate_pci_dt() doesn't advertise the
bogus windows when mem64_win_size == 0.

Since these memory regions don't induce any state, we can safely
choose to not create them when their address is equal to -1,
without breaking migration from existing setups.

Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-09-15 10:29:48 +10:00
Greg Kurz 1d36da769a spapr_pci: convert sprintf() to g_strdup_printf()
In order to follow a QEMU common practice.

Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-09-15 10:29:48 +10:00
Greg Kurz 9ba255365e spapr_pci: handle FDT creation errors with _FDT()
libfdt failures when creating the FDT should cause QEMU to terminate.

Let's use the _FDT() macro which does just that instead of propagating
the error to the caller. spapr_populate_pci_child_dt() no longer needs
to return a value in this case.

Note that, on the way, this get rids of the following nonsensical lines:

    g_assert(!ret);
    if (ret) {

Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-09-15 10:29:48 +10:00
Greg Kurz 99372e785e spapr_pci: use the common _FDT() helper
All other users in hw/ppc already consider an error when building
the FDT to be fatal, even on hotplug paths. There's no valid reason
for spapr_pci to behave differently. So let's used the common _FDT()
helper which terminates QEMU when libfdt fails.

Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-09-15 10:29:48 +10:00
Greg Kurz 549ce59e2b spapr_pci: use g_strdup_printf()
Building strings with g_strdup_printf() instead of snprintf() is
a QEMU common practice.

Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-09-15 10:29:48 +10:00
Greg Kurz d049bde69d spapr_pci: drop useless check in spapr_populate_pci_child_dt()
spapr_phb_get_loc_code() either returns a non-null pointer, or aborts
if g_strdup_printf() failed to allocate memory.

Signed-off-by: Greg Kurz <groug@kaod.org>
[dwg: Grammatical fix to commit message]
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-09-15 10:29:48 +10:00
Greg Kurz 8f68760561 spapr_pci: drop useless check in spapr_phb_vfio_get_loc_code()
g_strdup_printf() either returns a non-null pointer, or aborts if it
failed to allocate memory.

Signed-off-by: Greg Kurz <groug@kaod.org>
[dwg: Grammatical fix to commit message]
Acked-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-09-15 10:29:48 +10:00
Greg Kurz dba95ebbf8 spapr_pci: parent the MSI memory region to the PHB
This memory region should be owned by the PHB. This ensures the PHB
cannot be finalized as long as the the region is guest visible, or
used by a CPU or a device.

Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-09-08 09:30:54 +10:00
Greg Kurz 5c3d70e970 spapr_pci: use memory_region_add_subregion() with DMA windows
Passing a null priority to memory_region_add_subregion_overlap() is
strictly equivalent to calling memory_region_add_subregion().

Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-09-08 09:30:54 +10:00
Alexey Kardashevskiy 18f2330ef5 spapr_pci: Fix obsolete comment about MSIX encoding in addr/data
f1c2dc7c86 "spapr-pci: rework MSI/MSIX" (07/2013) changed MSIX encoding
but forgot to change the comment so this changes it.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-07-25 11:14:25 +10:00