[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] PVH CPU hotplug design document



On 12/01/17 12:13, Roger Pau Monné wrote:
> Hello,
>
> Below is a draft of a design document for PVHv2 CPU hotplug. It should cover
> both vCPU and pCPU hotplug. It's mainly centered around the hardware domain,
> since for unprivileged PVH guests the vCPU hotplug mechanism is already
> described in Boris series [0], and it's shared with HVM.
>
> The aim here is to find a way to use ACPI vCPU hotplug for the hardware 
> domain,
> while still being able to properly detect and notify Xen of pCPU hotplug.
>
> Thanks, Roger.
>
> [0] https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg00060.html
>
> ---8<---
> % CPU hotplug support for PVH
> % Roger Pau Monné <roger.pau@xxxxxxxxxx>
> % Draft B
>
> # Revision History
>
> | Version | Date        | Changes                                           |
> |---------|-------------|---------------------------------------------------|
> | Draft A | 5 Jan 2017  | Initial draft.                                    |
> |---------|-------------|---------------------------------------------------|
> | Draft B | 12 Jan 2017 | Removed the XXX comments and clarify some         |
> |         |             | sections.                                         |
> |         |             |                                                   |
> |         |             | Added a sample of the SSDT ASL code that would be |
> |         |             | appended to the hardware domain.                  |
>
> # Preface
>
> This document aims to describe the interface to use in order to implement CPU
> hotplug for PVH guests, this applies to hotplug of both physical and virtual
> CPUs.
>
> # Introduction
>
> One of the design goals of PVH is to be able to remove as much Xen PV specific
> code as possible, thus limiting the number of Xen PV interfaces used by 
> guests,
> and tending to use native interfaces (as used by bare metal) as much as
> possible. This is in line with the efforts also done by Xen on ARM and helps
> reduce the burden of maintaining huge amounts of Xen PV code inside of guests
> kernels.
>
> This however presents some challenges due to the model used by the Xen
> Hypervisor, where some devices are handled by Xen while others are left for 
> the
> hardware domain to manage. The fact that Xen lacks and AML parser also makes 
> it
> harder, since it cannot get the full hardware description from dynamic ACPI
> tables (DSDT, SSDT) without the hardware domain collaboration.
>
> One of such issues is CPU enumeration and hotplug, for both the hardware and
> unprivileged domains. The aim is to be able to use the same enumeration and
> hotplug interface for all PVH guests, regardless of their privilege.
>
> This document aims to describe the interface used in order to fulfill the
> following actions:
>
>  * Virtual CPU (vCPU) enumeration at boot time.
>  * Hotplug of vCPUs.
>  * Hotplug of physical CPUs (pCPUs) to Xen.
>
> # Prior work
>
> ## PV CPU hotplug
>
> CPU hotplug for Xen PV guests is implemented using xenstore and hypercalls. 
> The
> guest has to setup a watch event on the "cpu/" xenstore node, and react to
> changes in this directory. CPUs are added creating a new node and setting it's
> "availability" to online:
>
>     cpu/X/availability = "online"
>
> Where X is the vCPU ID. This is an out-of-band method, that relies on Xen
> specific interfaces in order to perform CPU hotplug.

It is also worth pointing the shortcomings of this model, i.e. that
there is no mechanism to prevent a guest onlining more processors if it
ignores the xenstore values.

>
> ## QEMU CPU hotplug using ACPI
>
> The ACPI tables provided to HVM guests contain processor objects, as created 
> by
> libacpi. The number of processor objects in the ACPI namespace matches the
> maximum number of processors supported by HVM guests (up to 128 at the time of
> writing). Processors currently disabled are marked as so in the MADT and in
> their \_MAT and \_STA methods.
>
> A PRST operation region in I/O space is also defined, with a size of 128bits,
> that's used as a bitmap of enabled vCPUs on the system. A PRSC method is
> provided in order to check for updates to the PRST region and trigger
> notifications on the affected processor objects. The execution of the PRSC
> method is done by a GPE event. Then OSPM checks the value returned by \_STA 
> for
> the ACPI\_STA\_DEVICE\_PRESENT flag in order to check if the vCPU has been
> enabled.

It is worth describing the toolstack side of hotplug? It is equally
relevant IMO.

>
> ## Native CPU hotplug
>
> OSPM waits for a notification from ACPI on the processor object and when an
> event is received the return value from _STA is checked in order to see if
> ACPI\_STA\_DEVICE\_PRESENT has been enabled. This notification is triggered
> from the method of a GPE block.
>
> # PVH CPU hotplug
>
> The aim as stated in the introduction is to use a method as similar as 
> possible
> to bare metal CPU hotplug for PVH, this is feasible for unprivileged domains,
> since the ACPI tables can be created by the toolstack and provided to the
> guest. Then a minimal I/O or memory handler will be added to Xen in order to
> report the bitmap of enabled vCPUs. There's already a [series][0] posted to
> xen-devel that implement this functionality for unprivileged PVH guests.
>
> This however is proven to be quite difficult to implement for the hardware
> domain, since it has to manage both pCPUs and vCPUs. The hardware domain 
> should
> be able to notify Xen of the addition of new pCPUs, so that they can be used 
> by
> the Hypervisor, and also be able to hotplug new vCPUs for it's own usage. 
> Since
> Xen cannot access the dynamic (AML) ACPI tables, because it lacks an AML
> parser, it is the duty of the hardware domain to parse those tables and notify
> Xen of relevant events.
>
> There are several related issues here that prevent a straightforward solution
> to this issue:
>
>  * Xen cannot parse AML tables, and thus cannot get notifications from ACPI
>    events. And even in the case that Xen could parse those tables, there can
>    only be one OSPM registered with ACPI

There can indeed only be one OSPM, which is the entity that executes AML
methods and receives external interrupts from ACPI-related things.

However, dom0 being OSPM does not prohibit Xen from reading and parsing
the AML (should we choose to include that functionality in the
hypervisor).  Xen is fine to do anything it wants in terms of reading
and interpreting the tables, so long as it doesn't start executing AML
bytecode.


>  * Xen can provide a valid MADT table to the hardware domain that describes 
> the
>    environment in which the hardware domain is running, but it cannot prevent
>    the hardware domain from seeing the real processor devices in the ACPI
>    namespace, neither Xen can provide the hardware domain with processor

", nor can Xen provide the..."

>    devices that match the vCPUs at the moment.
>
> [0]: 
> https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg00060.html
>
> ## Proposed solution using the STAO
>
> The general idea of this method is to use the STAO in order to hide the pCPUs
> from the hardware domain, and provide processor objects for vCPUs in an extra
> SSDT table.
>
> This method requires one change to the STAO, in order to be able to notify the
> hardware domain of which processors found in ACPI tables are pCPUs. The
> description of the new STAO field is as follows:
>
>  |   Field            | Byte Length | Byte Offset |     Description          |
>  |--------------------|:-----------:|:-----------:|--------------------------|
>  | Processor List [n] |      -      |      -      | A list of ACPI numbers,  |
>  |                    |             |             | where each number is the |
>  |                    |             |             | Processor UID of a       |
>  |                    |             |             | physical CPU, and should |
>  |                    |             |             | be treated specially by  |
>  |                    |             |             | the OSPM                 |
>
> The list of UIDs in this new field would be matched against the ACPI Processor
> UID field found in local/x2 APIC MADT structs and Processor objects in the 
> ACPI
> namespace, and the OSPM should either ignore those objects, or in case it
> implements pCPU hotplug, it should notify Xen of changes to these objects.
>
> The contents of the MADT provided to the hardware domain are also going to be
> different from the contents of the MADT as found in native ACPI. The local/x2
> APIC entries for all the pCPUs are going to be marked as disabled.
>
> Extra entries are going to be added for each vCPU available to the hardware
> domain, up to the maximum number of supported vCPUs. Note that supported vCPUs
> might be different than enabled vCPUs, so it's possible that some of these
> entries are also going to be marked as disabled. The entries for vCPUs on the
> MADT are going to use a processor local x2 APIC structure, and the ACPI
> processor ID of the first vCPU is going to be UINT32_MAX - HVM_MAX_VCPUS, in
> order to avoid clashes with IDs of pCPUs.

This is slightly problematic.  There is no restriction (so far as I am
aware) on which ACPI IDs the firmware picks for its objects.  They need
not be consecutive, logical, or start from 0.

If STAO is being extended to list the IDs of the physical processor
objects, we should go one step further and explicitly list the IDs of
the virtual processor objects.  This leaves us flexibility if we have to
avoid awkward firmware ID layouts.

It is also work stating that this puts an upper limit on nr_pcpus +
nr_dom0_vcpus (but 4 billion processors really ought to be enough for
anyone...)

> In order to be able to perform vCPU hotplug, the vCPUs must have an ACPI
> processor object in the ACPI namespace, so that the OSPM can request
> notifications and get the value of the \_STA and \_MAT methods. This can be
> problematic because Xen doesn't know the ACPI name of the other processor
> objects, so blindly adding new ones can create namespace clashes.
>
> This can be solved by using a different ACPI name in order to describe vCPUs 
> in
> the ACPI namespace. Most hardware vendors tend to use CPU or PR prefixes for
> the processor objects, so using a 'VP' (ie: Virtual Processor) prefix should
> prevent clashes.

One system I have to hand (with more than 255 pcpus) uses Cxxx

To avoid namespace collisions, I can't see any option but to parse the
DSDT/SSDTs to at least confirm that VPxx is available to use.

>
> A Xen GPE device block will be used in order to deliver events related to the
> vCPUs available to the guest, since Xen doesn't know if there are any bits
> available in the native GPEs. A SCI interrupt will be injected into the guest
> in order to trigger the event.
>
> The following snippet is a representation of the ASL SSDT code that is 
> proposed
> for the hardware domain:
>
>     DefinitionBlock ("SSDT.aml", "SSDT", 5, "Xen", "HVM", 0)
>     {
>         Scope (\_SB)
>         {
>            OperationRegion(XEN, SystemMemory, 0xDEADBEEF, 40)
>            Field(XEN, ByteAcc, NoLock, Preserve) {
>                NCPU, 16, /* Number of vCPUs */
>                MSUA, 32, /* MADT checksum address */
>                MAPA, 32, /* MADT LAPIC0 address */
>            }
>         }
>         Scope ( \_SB ) {
>             OperationRegion ( MSUM, SystemMemory, \_SB.MSUA, 1 )
>             Field ( MSUM, ByteAcc, NoLock, Preserve ) {
>                 MSU, 8
>             }
>             Method ( PMAT, 2 ) {
>                 If ( LLess(Arg0, NCPU) ) {
>                     Return ( ToBuffer(Arg1) )
>                 }
>                 Return ( Buffer() {0, 8, 0xff, 0xff, 0, 0, 0, 0} )
>             }
>             Processor ( VP00, 0, 0x0000b010, 0x06 ) {
>                 Name ( _HID, "ACPI0007" )
>                 Name ( _UID, 4294967167 )
>                 OperationRegion ( MATR, SystemMemory, Add(\_SB.MAPA, 0), 8 )
>                 Field ( MATR, ByteAcc, NoLock, Preserve ) {
>                     MAT, 64
>                 }
>                 Field ( MATR, ByteAcc, NoLock, Preserve ) {
>                     Offset(4),
>                     FLG, 1
>                 }
>                 Method ( _MAT, 0 ) {
>                     Return ( ToBuffer(MAT) )
>                 }
>                 Method ( _STA ) {
>                     If ( FLG ) {
>                         Return ( 0xF )
>                     }
>                     Return ( 0x0 )
>                 }
>                 Method ( _EJ0, 1, NotSerialized ) {
>                     Sleep ( 0xC8 )
>                 }
>             }
>             Processor ( VP01, 1, 0x0000b010, 0x06 ) {
>                 Name ( _HID, "ACPI0007" )
>                 Name ( _UID, 4294967168 )
>                 OperationRegion ( MATR, SystemMemory, Add(\_SB.MAPA, 8), 8 )
>                 Field ( MATR, ByteAcc, NoLock, Preserve ) {
>                     MAT, 64
>                 }
>                 Field ( MATR, ByteAcc, NoLock, Preserve ) {
>                     Offset(4),
>                     FLG, 1
>                 }
>                 Method ( _MAT, 0 ) {
>                     Return ( PMAT (1, MAT) )
>                 }
>                 Method ( _STA ) {
>                     If ( LLess(1, \_SB.NCPU) ) {
>                         If ( FLG ) {
>                             Return ( 0xF )
>                         }
>                     }
>                     Return ( 0x0 )
>                 }
>                 Method ( _EJ0, 1, NotSerialized ) {
>                     Sleep ( 0xC8 )
>                 }
>             }
>             OperationRegion ( PRST, SystemIO, 0xaf00, 1 )

This also has a chance of collision, both with the system ACPI
controller, and also with PCIe devices advertising IO-BARs.  (All
graphics cards ever have IO-BARs, because windows refuses to bind a
graphics driver to a PCI graphics device if the PCI device doesn't have
at least one IO-BAR.  Because PCIe requires 4k alignment on the upstream
bridge IO-windows, there is a surprisingly low limit on the number of
graphics cards you can put in a server and have functioning to windows
satisfaction.)

As with the other risks of collisions, Xen is going to have to search
the system to find a free area to use.

>             Field ( PRST, ByteAcc, NoLock, Preserve ) {
>                 PRS, 2
>             }
>             Method ( PRSC, 0 ) {
>                 Store ( ToBuffer(PRS), Local0 )
>                 Store ( DerefOf(Index(Local0, 0)), Local1 )
>                 And ( Local1, 1, Local2 )
>                 If ( LNotEqual(Local2, \_SB.VP00.FLG) ) {
>                     Store ( Local2, \_SB.VP00.FLG )
>                     If ( LEqual(Local2, 1) ) {
>                         Notify ( VP00, 1 )
>                         Subtract ( \_SB.MSU, 1, \_SB.MSU )
>                     }
>                     Else {
>                         Notify ( VP00, 3 )
>                         Add ( \_SB.MSU, 1, \_SB.MSU )
>                     }
>                 }
>                 ShiftRight ( Local1, 1, Local1 )
>                 And ( Local1, 1, Local2 )
>                 If ( LNotEqual(Local2, \_SB.VP01.FLG) ) {
>                     Store ( Local2, \_SB.VP01.FLG )
>                     If ( LEqual(Local2, 1) ) {
>                         Notify ( VP01, 1 )
>                         Subtract ( \_SB.MSU, 1, \_SB.MSU )
>                     }
>                     Else {
>                         Notify ( VP01, 3 )
>                         Add ( \_SB.MSU, 1, \_SB.MSU )
>                     }
>                 }
>                 Return ( One )
>             }
>         }
>         Device ( \_SB.GPEX ) {
>             Name ( _HID, "ACPI0006" )
>             Name ( _UID, "XENGPE" )
>             Name ( _CRS, ResourceTemplate() {
>                 IO (Decode16, 0xafe0 , 0xafe0, 0x00, 0x4)
>             } )
>             Method ( _E02 ) {
>                 \_SB.PRSC ()
>             }
>         }
>     }
>
> Since the position of the XEN data memory area is not know, the hypervisor 
> will
> have to replace the address 0xdeadbeef with the actual memory address where
> this structure has been copied. This will involve a memory search of the AML
> code resulting from the compilation of the above ASL snippet.

This is also slightly risky.  If we need to do this, can we get a
relocation list from the compiled table from iasl?

~Andrew

>
> In order to implement this, the hypervisor build is going to use part of
> libacpi and the iasl compiler.
>


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.