qemu-e2k/hw
Ankit Agrawal b64b7ed8bb qom: new object to associate device to NUMA node
NVIDIA GPU's support MIG (Mult-Instance GPUs) feature [1], which allows
partitioning of the GPU device resources (including device memory) into
several (upto 8) isolated instances. Each of the partitioned memory needs
a dedicated NUMA node to operate. The partitions are not fixed and they
can be created/deleted at runtime.

Unfortunately Linux OS does not provide a means to dynamically create/destroy
NUMA nodes and such feature implementation is not expected to be trivial. The
nodes that OS discovers at the boot time while parsing SRAT remains fixed. So
we utilize the Generic Initiator (GI) Affinity structures that allows
association between nodes and devices. Multiple GI structures per BDF is
possible, allowing creation of multiple nodes by exposing unique PXM in each
of these structures.

Implement the mechanism to build the GI affinity structures as Qemu currently
does not. Introduce a new acpi-generic-initiator object to allow host admin
link a device with an associated NUMA node. Qemu maintains this association
and use this object to build the requisite GI Affinity Structure.

When multiple NUMA nodes are associated with a device, it is required to
create those many number of acpi-generic-initiator objects, each representing
a unique device:node association.

Following is one of a decoded GI affinity structure in VM ACPI SRAT.
[0C8h 0200   1]                Subtable Type : 05 [Generic Initiator Affinity]
[0C9h 0201   1]                       Length : 20

[0CAh 0202   1]                    Reserved1 : 00
[0CBh 0203   1]           Device Handle Type : 01
[0CCh 0204   4]             Proximity Domain : 00000007
[0D0h 0208  16]                Device Handle : 00 00 20 00 00 00 00 00 00 00 00
00 00 00 00 00
[0E0h 0224   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[0E4h 0228   4]                    Reserved2 : 00000000

[0E8h 0232   1]                Subtable Type : 05 [Generic Initiator Affinity]
[0E9h 0233   1]                       Length : 20

An admin can provide a range of acpi-generic-initiator objects, each
associating a device (by providing the id through pci-dev argument)
to the desired NUMA node (using the node argument). Currently, only PCI
device is supported.

For the grace hopper system, create a range of 8 nodes and associate that
with the device using the acpi-generic-initiator object. While a configuration
of less than 8 nodes per device is allowed, such configuration will prevent
utilization of the feature to the fullest. The following sample creates 8
nodes per PCI device for a VM with 2 PCI devices and link them to the
respecitve PCI device using acpi-generic-initiator objects:

-numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \
-numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \
-numa node,nodeid=8 -numa node,nodeid=9 \
-device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
-object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \
-object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=3 \
-object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=4 \
-object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=5 \
-object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=6 \
-object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=7 \
-object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=8 \
-object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \

-numa node,nodeid=10 -numa node,nodeid=11 -numa node,nodeid=12 \
-numa node,nodeid=13 -numa node,nodeid=14 -numa node,nodeid=15 \
-numa node,nodeid=16 -numa node,nodeid=17 \
-device vfio-pci-nohotplug,host=0009:01:01.0,bus=pcie.0,addr=05.0,rombar=0,id=dev1 \
-object acpi-generic-initiator,id=gi8,pci-dev=dev1,node=10 \
-object acpi-generic-initiator,id=gi9,pci-dev=dev1,node=11 \
-object acpi-generic-initiator,id=gi10,pci-dev=dev1,node=12 \
-object acpi-generic-initiator,id=gi11,pci-dev=dev1,node=13 \
-object acpi-generic-initiator,id=gi12,pci-dev=dev1,node=14 \
-object acpi-generic-initiator,id=gi13,pci-dev=dev1,node=15 \
-object acpi-generic-initiator,id=gi14,pci-dev=dev1,node=16 \
-object acpi-generic-initiator,id=gi15,pci-dev=dev1,node=17 \

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu [1]
Cc: Jonathan Cameron <qemu-devel@nongnu.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Message-Id: <20240308145525.10886-2-ankita@nvidia.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12 17:56:55 -04:00
..
9pfs
acpi qom: new object to associate device to NUMA node 2024-03-12 17:56:55 -04:00
adc
alpha
arm hw/xen: Rename 'ram_memory' global variable as 'xen_memory' 2024-03-09 18:51:45 +01:00
audio hw/audio/virtio-sound: return correct command response size 2024-03-12 17:56:55 -04:00
avr
block
char hw/char/xen_console: Fix missing ERRP_GUARD() for error_prepend() 2024-03-09 18:51:45 +01:00
core pcie: Support PCIe Gen5/Gen6 link speeds 2024-03-12 17:56:55 -04:00
cpu
cris
cxl hw/cxl/cxl-host: Fix missing ERRP_GUARD() in cxl_fixed_memory_window_config() 2024-03-12 17:56:55 -04:00
display hw/display/macfb: Fix missing ERRP_GUARD() in macfb_nubus_realize() 2024-03-12 17:56:55 -04:00
dma
fsi
gpio hw/gpio: Implement STM32L4x5 GPIO 2024-03-07 12:19:25 +00:00
hppa
hyperv vmbus: Print a warning when enabled without the recommended set of features 2024-03-08 14:18:56 +01:00
i2c hw/i2c: Implement Broadcom Serial Controller (BSC) 2024-03-05 13:22:55 +00:00
i386 hw/i386/pc: Inline pc_cmos_init() into pc_cmos_init_late() and remove it 2024-03-12 17:56:55 -04:00
ide
input
intc hw/intc: Check @errp to handle the error of IOAPICCommonClass.realize() 2024-03-12 17:56:55 -04:00
ipack
ipmi
isa
loongarch loongarch: Change the UEFI loading mode to loongarch 2024-02-29 19:32:45 +08:00
m68k hw/m68k/mcf5208: add support for reset 2024-03-09 19:17:01 +01:00
mem hw/mem/cxl_type3: Fix missing ERRP_GUARD() in ct3_realize() 2024-03-12 17:56:55 -04:00
microblaze
mips mips: do not list individual devices from configs/ 2024-03-08 15:51:22 +01:00
misc hw/misc/xlnx-versal-trng: Check returned bool in trng_prop_fault_event_set() 2024-03-12 17:56:55 -04:00
net hw/pci: Always call pcie_sriov_pf_reset() 2024-03-12 17:56:55 -04:00
nios2
nubus
nvme hw/pci: Always call pcie_sriov_pf_reset() 2024-03-12 17:56:55 -04:00
nvram
openrisc
pci hw/pci: Always call pcie_sriov_pf_reset() 2024-03-12 17:56:55 -04:00
pci-bridge hw/pci-bridge/cxl_upstream: Fix missing ERRP_GUARD() in cxl_usp_realize() 2024-03-12 17:56:55 -04:00
pci-host
pcmcia
ppc mac_newworld: change timebase frequency from 100MHz to 25MHz for mac99 machine 2024-03-09 19:17:01 +01:00
rdma
remote hw/remote/remote-obj: hw/misc/ivshmem: Fix missing ERRP_GUARD() for error_prepend() 2024-03-09 18:51:45 +01:00
riscv target/riscv: fix ACPI MCFG table 2024-03-08 21:00:37 +10:00
rtc hw/rtc/sun4v-rtc: Relicense to GPLv2-or-later 2024-03-07 12:54:56 +00:00
rx
s390x
scsi trivial patches for 2024-03-09 2024-03-09 20:12:05 +00:00
sd
sensor
sh4
smbios Implement SMBIOS type 9 v2.6 2024-03-12 17:56:55 -04:00
sparc
sparc64
ssi
timer
tpm hw/tpm: Remove HOST_PAGE_ALIGN from tpm_ppi_init 2024-02-29 11:35:36 -10:00
tricore
ufs
usb hw/usb/bus.c: PCAP adding 0xA in Windows version 2024-03-01 08:27:33 +01:00
vfio hw/vfio/iommufd: Fix missing ERRP_GUARD() in iommufd_cdev_getfd() 2024-03-12 17:56:55 -04:00
virtio hw/virtio: Add support for VDPA network simulation devices 2024-03-12 17:56:55 -04:00
watchdog
xen hw/xen: Extract 'xen_igd.h' from 'xen_pt.h' 2024-03-09 18:51:45 +01:00
xenpv
xtensa
Kconfig
meson.build