[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] HVMlite ABI specification DRAFT A



On 04/02/16 17:48, Roger Pau Monnà wrote:
> Hello,
>
> I've Cced a bunch of people who have expressed interest in the HVMlite 
> design/implementation, both from a Xen or OS point of view. If you 
> would like to be removed, please say so and I will remove you in 
> further iterations. The same applies if you want to be added to the Cc.
>
> This is an initial draft on the HVMlite design and implementation. I've 
> mixed certain aspects of the design with the implementation, because I 
> think we are quite tied by the implementation possibilities in certain 
> aspects, so not speaking about it would make the document incomplete. I 
> might be wrong on that, so feel free to comment otherwise if you would 
> prefer a different approach. At least this should get the conversation 
> started into a couple of pending items regarding HVMlite. I don't want 
> to spoil the fun, but IMHO they are:
>
>  - Local APIC: should we _always_ provide a local APIC to HVMlite 
>    guests?

I think it would be best to offer an LAPIC by default (to be helpful to
most modern OSes), but leave the option for an administrator to disable
if they specifically don't want one.

>  - HVMlite hardware domain: can we get rid of the PHYSDEV ops and PIRQ 
>    event channels?
>  - HVMlite PCI-passthrough: can we get rid of pciback/pcifront?

+1000, for both.

>
> The document is still far from complete, and I've only tried to 
> represent the points where there's consensus (like the boot ABI) or 
> parts where feedback is needed in order to reach a consensus (like the 
> items pointed above). I'm of course not as knowledgeable as some people 
> on the Cc, so please correct me if you think there are mistakes or 
> simply impossible goals.
>
> Roger.
> ---
>
> Xen HVMlite ABI
> ===============

Any chance this can end up living in docs/specs/HVMLite-ABI.$FOO,
alongside the existing formal specs?

Would it also be possible to write a feature document in
docs/features/HVMLite.$FOO ?

>
> Boot ABI
> --------
>
> Since the Xen entry point into the kernel can be different from the
> native entry point, a `ELFNOTE` is used in order to tell the domain
> builder how to load and jump into the kernel entry point:
>
>     ELFNOTE(Xen, XEN_ELFNOTE_PHYS32_ENTRY,          .long,  xen_start32)
>
> The presence of the `XEN_ELFNOTE_PHYS32_ENTRY` note indicates that the
> kernel supports the boot ABI described in this document.
>
> The domain builder must load the kernel into the guest memory space and
> jump into the entry point defined at `XEN_ELFNOTE_PHYS32_ENTRY` with the
> following machine state:

Given multiple possible entries, the domain builder might have multiple
starting options available.

I would reword this to "When starting an HVMLite domain, the domain
builder shall load ...", which allows the domian builder to chose an
alternative entry method, at its discretion.

>
>  * `ebx`: contains the physical memory address where the loader has placed
>    the boot start info structure.
>
>  * `cr0`: bit 0 (PE) must be set. All the other writeable bits are cleared.
>
>  * `cr4`: all bits are cleared.
>
>  * `cs`: must be a 32-bit read/execute code segment with a base of â0â
>    and a limit of â0xFFFFFFFFâ. The selector value is unspecified.
>
>  * `ds`, `es`: must be a 32-bit read/write data segment with a base of
>    â0â and a limit of â0xFFFFFFFFâ. The selector values are all unspecified.
>
>  * `tr`: must be a 32-bit TSS (active) with a base of '0' and a limit of 
> '0x67'.
>
>  * `eflags`: bit 17 (VM) must be cleared. Bit 9 (IF) must be cleared.
>    Bit 8 (TF) must be cleared. Other bits are all unspecified.

I would also specify that the direction flag shall be clear, to prevent
all kernels needing to `cld` on entry.

>
> All other processor registers and flag bits are unspecified. The OS is in
> charge of setting up it's own stack, GDT and IDT.
>
> The format of the boot start info structure is the following (pointed to
> be %ebx):
>
>     struct hvm_start_info {
>     #define HVM_START_MAGIC_VALUE 0x336ec578
>         uint32_t magic;             /* Contains the magic value 0x336ec578    
>    */
>                                     /* ("xEn3" with the 0x80 bit of the "E" 
> set).*/
>         uint32_t flags;             /* SIF_xxx flags.                         
>    */
>         uint32_t cmdline_paddr;     /* Physical address of the command line.  
>    */
>         uint32_t nr_modules;        /* Number of modules passed to the 
> kernel.   */
>         uint32_t modlist_paddr;     /* Physical address of an array of        
>    */
>                                     /* hvm_modlist_entry.                     
>    */
>     };

For both paddr values, zero indicates "not provided".

>
>     struct hvm_modlist_entry {
>         uint32_t paddr;             /* Physical address of the module.        
>    */
>         uint32_t size;              /* Size of the module in bytes.           
>    */
>     };
>
> Other relevant information needed in order to boot a guest kernel
> (console page address, xenstore event channel...) can be obtained
> using HVMPARAMS, just like it's done on HVM guests.
>
> The setup of the hypercall page is also performed in the same way
> as HVM guests, using the hypervisor cpuid leaves and msr ranges.
>
> Hardware description
> --------------------
>
> Hardware description can come from two different sources, just like on (PV)HVM
> guests.
>
> Description of PV devices will always come from xenbus, and in fact
> xenbus is the only hardware description that is guaranteed to always be
> provided to HVMlite guests.
>
> Description of physical hardware devices will always come from ACPI, in the
> absence of any physical hardware device no ACPI tables will be provided. The
> presence of ACPI tables can be detected by finding the RSDP, just like on
> bare metal.
>
> Non-PV devices exposed to the guest
> -----------------------------------
>
> The initial idea was to simply don't provide any emulated devices to a HVMlite
> guest as the default option. We have however identified certain situations
> where emulated devices could be interesting, both from a performance and
> easy implementation point of view. The following list tries to encompass
> the different identified scenarios:
>
>  * 1. HVMlite with no emulated devices at all
>    ------------------------------------------
>    This is the current implementation inside of Xen, everything is disabled
>    by default and the guest has access to the PV devices only. This is of
>    course the most secure design because it has the smaller surface of attack.
>
>  * 2. HVMlite with PCI-passthrough
>    -------------------------------
>    The current model of PCI-passthrought in PV guests is complex and requires
>    heavy modifications to the guest OS. Going forward we would like to remove
>    this limitation, by providing an interface that's the same as found on bare
>    metal. In order to do this, at least an emulated local APIC and IO APIC
>    should be provided to guests, together with the access to a PCI-Root 
> complex.
>    As said in the 'Hardware description' section above, this will also require
>    ACPI. So this proposed scenario will require the following elements that 
> are
>    not present in the minimal (or default) HVMlite implementation: ACPI, local
>    APIC IO APIC and PCI-Root complex.

The IOAPIC is only required when doing passthrough of non-VF devices. 
If the passthrough usecase is restricted to SRIOV VFs only, the IOAPIC
can be omitted, as the SRIOV spec forbids the use of legacy line
interrupts for VFs.  Again with security in mind, it should be possible
for an admin to specify this configuration if they really wish to reduce
the emulated attack surface in Xen.

Independently of the HVMLite angle, having a minimal host bridge in Xen
solves a lot of our current architectural problems with existing PCI
Passthrough, and in particular allows for device model disaggregation,
which will also be of interest for the plain HVM case.

>
>  * 3. HVMlite hardware domain
>    --------------------------
>    The aim is that a HVMlite hardware domain is going to work exactly like a
>    HVMlite domain with passed-through devices. This means that the domain will
>    need access to the same set of emulated devices, and that some ACPI tables
>    must be fixed in order to reflect the reality of the container the hardware
>    domain is running on. The ACPI section contains more detailed information
>    about which/how these tables are going to be fixed.
>
>    Note that in this scenario the hardware domain will *always* have a local
>    APIC and IO APIC, and that the usage of PHYSDEV operations and PIRQ event
>    channels is going to be removed in favour of the bare metal mechanisms.

We do need to cater for at least the RTC for the hardware domain.  This
can be done by not using the FADT "reduced" flag and actually wiring up
the legacy IO ports, which ought to be sufficient.

>
> There have been some opinions that the current model (1) should be replaced
> with (2) without any passed-through devices, so that at least a local APIC is
> provided. Should then a RSDT, FADT and MADT be provided? We would then be
> able to switch the CPU enumeration to the one used on bare metal (ie: using 
> the
> data in the MADT).
>
> ACPI
> ----
>
> ACPI tables will be provided to the hardware domain or to unprivileged
> domains that have passed-through PCI devices. In the case of unprivileged
> guests ACPI tables are going to be created by the toolstack and will only
> contain the set of devices available to the guest, which will at least be
> the following: local APIC, IO APIC, the passed-through device. In order to
> provide this information from ACPI the following tables are needed as a
> minimum: RSDT, FADT, MADT and DSDT.
>
> In the case of the hardware domain, Xen has traditionally passed-through the
> native ACPI tables to the guest. This is something that of course we still
> want to do, but in the case of HVMlite Xen will have to make sure that
> the data passed in the ACPI tables to the hardware domain contain the accurate
> hardware description. This means that at least certain tables will have to
> be modified/mangled before being presented to the guest:
>
>  * MADT: the number of local APIC entries need to be fixed to match the number
>          of vCPUs available to the guest. The address of the IO APIC(s) also
>          need to be fixed in order to match the emulated ones that we are 
> going
>          to provide.
>
>  * DSDT: certain devices reported in the DSDT may not be available to the 
> guest,
>          but since the DSDT is a run-time generated table we cannot fix it. In
>          order to cope with this, a STAO table will be provided that should
>          be able to signal which devices are not available to the hardware
>          domain. This is in line with the Xen/ACPI implementation for ARM.
>
>  * MPST, PMTT, SBTT and SRAT: won't be initially presented to the guest, until
>                               we get our act together on the vNUMA stuff.

and SLIT.

>
> NB: there are corner cases that I'm not sure how to solve properly. Currently
> the hardware domain has some 'hacks' regarding ACPI and Xen. At least I'm 
> aware
> of the following:
>
>  * 1. Reporting CPU PM info back to Xen: this comes from the DSDT table, and
>    since this table is only available to the hardware domain it has to report
>    the PM info back to Xen so that Xen can perform proper PM.
>  * 2. Doing proper shutdown (S5) requires the usage of a hypercall, which is
>    mixed with native ACPICA code in most OSes. This is awkward and requires
>    the usage of hooks into ACPICA which we have not yet managed to upstream.
>  * 3. Reporting the PCI devices it finds to the hypervisor: this is not very
>    intrusive in general, so I'm not that pushed to remove it. It's generally
>    easy in any OS to add some kind of hook that's executed every time a PCI
>    device is discovered.
>  * 4. Report PCI memory-mapped configuration areas to Xen: my opinion 
> regarding
>    this one is the same as (3), it's not really intrusive so I'm not very
>    pushed to remove it.
>
> I would ideally like to get rid of (2) in the list above, since I'm quite sure
> we are never going to be able to merge the needed hooks into ACPICA. AFAICT 
> Xen
> should be able to parse the FADT table and find the address of the PM1a and
> PM1b control registers and trap on access.

Doing this would require more of (1), as the exact values written to the
PM1a and PM1b control registers are specified in the DSDT, iirc.

>
> (1) is also quite nasty, but I don't see any possible way to get rid of it.

Sadly not.

>
> AP startup
> ----------
>
> AP startup is performed using hypercalls. The following VCPU operations
> are used in order to bring up secondary vCPUs:
>
>  * VCPUOP_initialise is used to set the initial state of the vCPU. The
>    argument passed to the hypercall must be of the type vcpu_hvm_context.
>    See public/hvm/hvm_vcpu.h for the layout of the structure. Note that
>    this hypercall allows starting the vCPU in several modes (16/32/64bits),
>    regardless of the mode the BSP is currently running on.
>
>  * VCPUOP_up is used to launch the vCPU once the initial state has been
>    set using VCPUOP_initialise.
>
>  * VCPUOP_down is used to bring down a vCPU.
>
>  * VCPUOP_is_up is used to scan the number of available vCPUs.
>
> Additionally, if a local APIC is available CPU bringup can also be performed
> using the hardware native AP startup sequence (IPIs). In this case the
> hypercall interface will still be provided, as a faster and more convenient
> way of starting APs.

+1

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.