[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Draft F] Xen on ARM vITS Handling



On Thu, Jun 11, 2015 at 3:10 PM, Ian Campbell <ian.campbell@xxxxxxxxxx> wrote:
> Draft F follows. Also at:
> http://xenbits.xen.org/people/ianc/vits/draftF.{pdf,html}
>
> Here's a quick update based on feedback prior to meeting on #xenarm at
> 12:00AM BST / 7:00AM EDT / 4:30PM IST (which is ~1:20 from now)
>
> Ian.
>
> % Xen on ARM vITS Handling
> % Ian Campbell <ian.campbell@xxxxxxxxxx>
> % Draft F
>
> # Changelog
>
> ## Since Draft E
>
> * Discussion of `struct pending_irq`
> * Fix handling of enable/disable, requiring switching back to trapping
>   the virtual cfg table again. get_vlpi_cfg is no longer needed.
> * Fix p2m_lookup to also use get_page_from_gfn.
>
> ## Since Draft D
>
> * Fixed assumptions about vLPI->pLPI mapping, which is not
>   possible. This lead to changes to the model for enabling and
>   disabling pLPI and vLPI and the handling of the virtual LPI
>   configuration table, resolving _Unresolved Issue 1_.
> * Made the pLPI and vLPI interrupt priorities explicit.
> * Attempted to clarify the trust issues regarding in-guest data
>   structures.
> * Mandate a particular cacheability for tables in guest memory.
>
> ## Since Draft C
>
> * _Major_ rework, in an attempt to simplify everything into something
>   more likely to be achievable for 4.6.
>     * Made some simplifying assumptions.
>     * Reduced the scope of some support.
>     * Command emulation is now mostly trivial.
>     * Expanded detail on host setup, allowing other assumptions to be
>       made during emulation.
> * Many other things lost in the noise of the above.
>
> ## Since Draft B
>
> * Details of command translation (thanks to Julien and Vijay)
> * Added background on LPI Translation and Pending tables
> * Added background on Collections
> * Settled on `N:N` scheme for vITS:pat's mapping.
> * Rejigged section nesting a bit.
> * Since we now thing translation should be cheap, settle on
>   translation at scheduling time.
> * Lazy `INVALL` and `SYNC`
>
> ## Since Draft A
>
> * Added discussion of when/where command translation occurs.
> * Contention on scheduler lock, suggestion to use SOFTIRQ.
> * Handling of domain shutdown.
> * More detailed discussion of multiple vs single vits pros/cons.
>
> # Introduction
>
> ARM systems containing a GIC version 3 or later may contain one or
> more ITS logical blocks. An ITS is used to route Message Signalled
> interrupts from devices into an LPI injection on the processor.
>
> The following summarises the ITS hardware design and serves as a set
> of assumptions for the vITS software design. For full details of the
> ITS see the "GIC Architecture Specification".
>
> ## Locality-specific Peripheral Interrupts (`LPI`)
>
> This is a new class of message signalled interrupts introduced in
> GICv3. They occupy the interrupt ID space from `8192..(2^32)-1`.
>
> The number of LPIs support by an ITS is exposed via
> `GITS_TYPER.IDbits` (as number of bits - 1), it may be up to
> 2^32. _Note_: This field also contains the number of Event IDs
> supported by the ITS.
>
> ### LPI Configuration Table
>
> Each LPI has an associated configuration byte in the LPI Configuration
> Table (managed via the GIC Redistributor and placed at
> `GICR_PROPBASER` or `GICR_VPROPBASER`). This byte configures:
>
> * The LPI's priority;
> * Whether the LPI is enabled or disabled.
>
> Software updates the Configuration Table directly but must then issue
> an invalidate command (per-device `INV` ITS command, global `INVALL`
> ITS command or write `GICR_INVLPIR`) for the affect to be guaranteed
> to become visible (possibly requiring an ITS `SYNC` command to ensure
> completion of the `INV` or `INVALL`). Note that it is valid for an
> implementation to reread the configuration table at any time (IOW it
> is _not_ guaranteed that a change to the LPI Configuration Table won't
> be visible until an invalidate is issued).
>
> ### LPI Pending Table
>
> Each LPI also has an associated bit in the LPI Pending Table (managed
> by the GIC redistributor). This bit signals whether the LPI is pending
> or not.
>
> This region may contain out of date information and the mechanism to
> synchronise is `IMPLEMENTATION DEFINED`.
>
> ## Interrupt Translation Service (`ITS`)
>
> ### Device Identifiers
>
> Each device using the ITS is associated with a unique "Device
> Identifier".
>
> The device IDs are properties of the implementation and are typically
> described via system firmware, e.g. the ACPI IORT table or via device
> tree.
>
> The number of device ids in a system depends on the implementation and
> can be discovered via `GITS_TYPER.Devbits`. This field allows an ITS
> to have up to 2^32 devices.
>
> ### Events
>
> Each device can generate "Events" (called `ID` in the spec) these
> correspond to possible interrupt sources in the device (e.g. MSI
> offset).
>
> The maximum number of interrupt sources is device specific. It is
> usually discovered either from firmware tables (e.g. DT or ACPI) or
> from bus specific mechanisms (e.g. PCI config space).
>
> The maximum number of events ids support by an ITS is exposed via
> `GITS_TYPER.IDbits` (as number of bits - 1), it may be up to
> 2^32. _Note_: This field also contains the number of `LPIs` supported
> by the ITS.
>
> ### Interrupt Collections
>
> Each interrupt is a member of an "Interrupt Collection". This allows
> software to manage large numbers of physical interrupts with a small
> number of commands rather than issuing one command per interrupt.
>
> On a system with N processors, the ITS must provide at least N+1
> collections.
>
> An ITS may support some number of internal collections (indicated by
> `GITS_TYPER.HCC`) and external ones which require memory provisioned
> by the Operating System via a `GITS_BASERn` register.
>
> ### Target Addresses
>
> The Target Address correspond to a specific GIC re-distributor. The
> format of this field depends on the value of the `GITS_TYPER.PTA` bit:
>
> * 1: the base address of the re-distributor target is used
> * 0: a unique processor number is used. The mapping between the
>   processor affinity value (`MPIDR`) and the processor number is
>   discoverable via `GICR_TYPER.ProcessorNumber`.
>
> This value is up to the ITS implementer (`GITS_TYPER` is a read-only
> register).
>
> ### Device Table
>
> A Device Table is configured in each ITS which maps incoming device
> identifiers into an ITS Interrupt Translation Table.
>
> ### Interrupt Translation Table (`ITT`) and Collection Table
>
> An `Event` generated by a `Device` is translated into an `LPI` via a
> per-Device Interrupt Translation Table. The structure of this table is
> described in GIC Spec 4.9.12.
>
> The ITS translation table maps the device id of the originating device
> into a physical interrupt (`LPI`) and an Interrupt Collection.
>
> The Collection is in turn looked up in the Collection Table to produce
> a Target Address, indicating a redistributor (AKA CPU) to which the
> LPI is delivered.
>
> ### OS Provisioned Memory Regions
>
> The ITS hardware design provides mechanisms for an ITS to be provided
> with various blocks of memory by the OS for ITS internal use, this
> include the per-device ITT (established with `MAPD`) and memory
> regions for Device Tables, Virtual Processors and Interrupt
> Collections. Up to 8 such regions can be requested by the ITS and
> provisioned by the OS via the `GITS_BASERn` registers.
>
> ### ITS Configuration
>
> The ITS is configured and managed, including establishing and
> configuring the Translation Tables and Collection Table, via an in
> memory ring shared between the CPU and the ITS controller. The ring is
> managed via the `GITS_CBASER` register and indexed by `GITS_CWRITER`
> and `GITS_CREADR` registers.
>
> A processor adds commands to the shared ring and then updates
> `GITS_CWRITER` to make them visible to the ITS controller.
>
> The ITS controller processes commands from the ring and then updates
> `GITS_CREADR` to indicate the the processor that the command has been
> processed.
>
> Commands are processed sequentially.
>
> Commands sent on the ring include operational commands:
>
> * Routing interrupts to processors;
> * Generating interrupts;
> * Clearing the pending state of interrupts;
> * Synchronising the command queue
>
> and maintenance commands:
>
> * Map device/collection/processor;
> * Map virtual interrupt;
> * Clean interrupts;
> * Discard interrupts;
>
> The field `GITS_CBASER.Size` encodes the number of 4KB pages minus 0
> consisting of the command queue. This field is 8 bits which means the
> maximum size is 2^8 * 4KB = 1MB. Given that each command is 32 bytes,
> there is a maximum of 32768 commands in the queue.
>
> The ITS provides no specific completion notification
> mechanism. Completion is monitored by a combination of a `SYNC`
> command and either polling `GITS_CREADR` or notification via an
> interrupt generated via the `INT` command.
>
> Note that the interrupt generation via `INT` requires an originating
> device ID to be supplied (which is then translated via the ITS into an
> LPI). No specific device ID is defined for this purpose and so the OS
> software is expected to fabricate one.
>
> Possible ways of inventing such a device ID are:
>
> * Enumerate all device ids in the system and pick another one;
> * Use a PCI BDF associated with a non-existent device function (such
>   as an unused one relating to the PCI root-bridge) and translate that
>   (via firmware tables) into a suitable device id;
> * ???
>
> # LPI Handling in Xen
>
> ## IRQ descriptors
>
> Currently all SGI/PPI/SPI interrupts are covered by a single static
> array of `struct irq_desc` with ~1024 entries (the maximum interrupt
> number in that set of interrupt types).
>
> The addition of LPIs in GICv3 means that the largest potential
> interrupt specifier is much larger.
>
> Therefore a second dynamically allocated array will be added to cover
> the range `8192..nr_lpis`. The `irq_to_desc` function will determine
> which array to use (static `0..1024` or dynamic `8192..end` lpi desc
> array) based on the input irq number. Two arrays are used to avoid a
> wasteful allocation covering the unused/unusable) `1024..8191` range.
>
> ## Virtual LPI interrupt injection
>
> A physical interrupt which is routed to a guest vCPU has the
> `_IRQ_GUEST` flag set in the `irq_desc` status mask. Such interrupts
> have an associated instance of `struct irq_guest` which contains the
> target `struct domain` pointer and virtual interrupt number.
>
> In Xen a virtual interrupt (either arising from a physical interrupt
> or completely virtual) is ultimately injected to a VCPU using the
> `vgic_vcpu_inject_irq` function, or `vgic_vcpu_inject_lpi`.
>
> This mechanism will likely need updating to handle the injection of
> virtual LPIs. In particular rather than `GICD_ITARGERRn` or
> `GICD_IROUTERn` routing of LPIs is performed via the ITS collections
> mechanism. This is discussed below (In _vITS_:_Virtual LPI injection_).
>
> # Scope
>
> The ITS is rather complicated, especially when combined with
> virtualisation. To simplify things we initially omit the following
> functionality:
>
> - Interrupt -> vCPU -> pCPU affinity. The management of physical vs
>   virtual Collections is a feature of GICv4, thus is omitted in this
>   design for GICv3. Physical interrupts which occur on a pCPU where
>   the target vCPU is not already resident will be forwarded (via IPI)
>   to the correct pCPU for injection via the existing
>   `vgic_vcpu_inject_irq` mechanism (extended to handle LPI injection
>   correctly).
> - Clearing of the pending state of an LPI under various circumstances
>   (`MOVI`, `DISCARD`, `CLEAR` commands) is not done. This will result
>   in guests seeing some perhaps spurious interrupts.
> - vITS functionality will only be available on 64-bit ARM hosts,
>   avoiding the need to worry about fast access to guest owned data
>   structures (64-bit uses a direct map). (NB: 32-bit guests on 64-bit
>   hosts can be considered to have access)
>
> # pITS
>
> ## Assumptions
>
> It is assumed that `GITS_TYPER.IDbits` is large enough that there are
> sufficient LPIs available to cover the sum of the number of possible
> events generated by each device in the system (that is the sum of the
> actual events for each bit of hardware, rather than the notional
> per-device maximum from `GITS_TYPER.Idbits`).
>
> This assumption avoids the need to do memory allocations and interrupt
> routing at run time, e.g. during command processing by allowing us to
> setup everything up front.
>
> ## Driver
>
> The physical driver will provide functions for enabling, disabling
> routing etc a specified interrupt, via the usual Xen APIs for doing
> such things.
>
> This will likely involve interacting with the physical ITS command
> queue etc. In this document such interactions are considered internal
> to the driver (i.e. we care that the API to enable an interrupt
> exists, not how it is implemented).
>
> The physical ITS will be provisioned with whatever tables it requests
> via its `GITS_BASERn` registers.
>
> ## Collections
>
> The `pITS` will be configured at start of day with 1 Collection mapped
> to each physical processor, using the `MAPC` command on the physical
> ITS.
>
> ## Per Device Information
>
> Each physical device in the system which can be used together with an
> ITS (whether using passthrough or not) will have associated with it a
> data structure:
>
>     struct its_device {
>         struct pits *pits;
>         uintNN_t phys_device_id;
>         uintNN_t virt_device_id;
>         unsigned int *events;
>         unsigned int nr_events;
>         struct page_info *pitt;
>         unsigned int nr_pitt_pages;
>         /* Other fields relating to pITS maintenance but unrelated to vITS */
>     };
>
> Where:
>
> - `pits`: Pointer to the associated physical ITS.
> - `phys_device_id`: The physical device ID of the physical device
> - `virt_device_id`: The virtual device ID if the device is accessible
>   to a domain
> - `events`: An array mapping a per-device event number into a physical
>   LPI.
> - `nr_events`: The number of events which this device is able to
>   generate.
> - `pitt`, `nr_pitt_pages`: Records allocation of pages for physical
>   ITT (not directly accessible).
>
> During its lifetime this structure may be referenced by several
> different mappings (e.g. physical and virtual device id maps, virtual
> collection device id).
>
> ## Device Discovery/Registration and Configuration
>
> Per device information will be discovered based on firmware tables (DT
> or ACPI) and information provided by dom0 (e.g. reading associated PCI
> cfg space, registration via PHYSDEVOP_pci_device_add or new custom
> hypercalls).
>
> This information shall include at least:
>
> - The Device ID of the device.
> - The maximum number of Events which the device is capable of
>   generating.
>
> When a device is discovered/registered (i.e. when all necessary
> information is available) then:
>
> - `struct its_device` and the embedded `events` array will be
>   allocated (the latter with `nr_events` elements).
> - The `struct its_device` will be inserted into a mapping (possibly an
>   R-B tree) from its physical Device ID to the `struct its`.
> - `nr_events` physical LPIs will be allocated and recorded in the
>   `events` array.
> - An ITT table will be allocated for the device and the appropriate
>   `MAPD` command will be issued to the physical ITS. The location will
>   be recorded in `struct its_device.pitt`.
> - Each Event which the device may generate will be mapped to the
>   corresponding LPI in the `events` array and a collection, by issuing
>   a series of `MAPVI` commands. Events will be assigned to physical
>   collections in a round-robin fashion.
>
> This setup must occur for a given device before any ITS interrupts may
> be configured for the device and certainly before a device is passed
> through to a guest. This implies that dom0 cannot use MSIs on a PCI
> device before having called `PHYSDEVOP_pci_device_add`.
>
> # Device Assignment
>
> Each domain will have an associated mapping from virtual device ids
> into a data structure describing the physical device, including a
> reference to the relevant `struct its_device`.
>
> The number of possible device IDs may be large so a simple array or
> list is likely unsuitable. A tree (e.g. Red-Black may be a suitable
> data structure. Currently we do not need to perform lookups in this
> tree on any hot paths.
>
> _Note_: In the context of virtualised device ids (especially for domU)
> it may be possible to arrange for the upper bound on the number of
> device IDs to be lower allowing a more efficient data structure to be
> used. This is left for a future improvement.
>
> When a device is assigned to a domain (including to domain 0) the
> mapping for the new virtual device ID will be entered into the tree.
>
> During assignment all LPIs associated with the device will be routed
> to the guest (i.e. `route_irq_to_guest` will be called for each LPI in
> the `struct its_device.events` array) and the pLPI will be enabled in
> the physical LPI configuration table with a priority of `GIC_PRI_IRQ`
> (not any priority from the guest).
>
> # vITS
>
> A guest domain which is allowed to use ITS functionality (i.e. has
> been assigned pass-through devices which can generate MSIs) will be
> presented with a virtualised ITS.
>
> Accesses to the vITS registers will trap to Xen and be emulated and a
> virtualised Command Queue will be provided.
>
> Commands entered onto the virtual Command Queue will be translated
> into physical commands, as described later in this document.
>
> There are other aspects to virtualising the ITS (LPI collection
> management, assignment of LPI ranges to guests, device
> management). However these are only considered here to the extent
> needed for describing the vITS emulation.
>
> ## Xen interaction with guest OS provisioned vITS memory
>
> Memory which the guest provisions to the vITS (ITT via `MAPD` or other
> tables via `GITS_BASERn`) needs careful handling in Xen.
>
> ### Trust
>
> Since Xen cannot trust data in data structures contained in such
> memory if a guest can trample over it at will. Therefore Xen either
> must take great care when accessing data structures stored in such
> memory to validate the contents e.g. not trust that values are within
> the required limits or it must take steps to restrict guest access to
> the memory when it is provisioned. Since the data structures are
> simple and most accessors need to do bounds check anyway it is
> considered sufficient to simply do the necessary checks on access.
>
> **Any information read memory which has been provisioned by the guest
>    OS should not be trusted and must be carefully checked (e.g. ranges
>    etc) before use.**
>
> ### Mapping
>
> Most data structures stored in this shared memory are accessed on the
> hot interrupt injection path and must therefore be quickly accessible
> from within Xen. Since we have restricted vits support to 64-bit hosts
> only `map_domain_page` is fast enough to be used on the fly and
> therefore we do not need to be concerned about unbounded amounts of
> permanently mapped memory consumed by each `MAPD` command.
>
> Although `map_domain_page` is fast, `p2m_lookup` (translation from IPA
> to PA) is not necessarily so. For now we accept this, as a future
> extension a sparse mapping of the guest device table in vmap space
> could be considered, with limits on the total amount of vmap space which
> we allow each domain to consume.
>
> The `GITS_BASERn` registers allow for the guest to specify cache
> attributes for the memory. For now we require that these have the same
> attributes as hypercall arguments in general (see `public/arch-arm.h`)
>
> In addition while `GITS_BASERn` allows the Cacheability to be
> specified as `Device-nGnRnE` we require that the tables provided be in
> normal guest RAM (not MMIO, not granted memory etc), that is it must
> have type `p2m_ram_rw`.
>
> ## vITS properties
>
> The vITS implementation shall have:
>
> - `GITS_TYPER.HCC == nr_vcpus + 1`.
> - `GITS_TYPER.PTA == 0`. Target addresses are linear processor numbers.
> - `GITS_TYPER.Devbits == See below`.
> - `GITS_TYPER.IDbits == See below`.
> - `GITS_TYPER.ITT Entry Size == 7`, meaning 8 bytes, which is the size
>   of `struct vitt` (defined below).
>
> `GITS_TYPER.Devbits` and `GITS_TYPER.Idbits` will need to be chosen to
> reflect the host and guest configurations (number of LPIs, maximum
> device ID etc).
>
> Other fields (not mentioned here) will be set to some sensible (or
> mandated) value.
>
> The `GITS_BASER0` will be setup to request sufficient memory for a
> device table consisting of entries of:
>
>     struct vdevice_table {
>         uint64_t vitt_ipa;
>         uint32_t vitt_size;
>         uint32_t padding;
>     };

      How about adding valid bit to know if the entry is valid or not?

>     BUILD_BUG_ON(sizeof(struct vdevice_table) != 16);
>
> On write to `GITS_BASER0` the relevant details of the Device Table
> (IPA, size, cache attributes to use when mapping) will be recorded in
> `struct domain`.
>
> All other `GITS_BASERn.Valid == 0`.
>
> ## vITS to pITS mapping
>
> A physical system may have multiple physical ITSs.
>
> With the simplified vits command model presented here only a single
> `vits` is required.
>
> In the future a more complex arrangement may be desired. Since the
> choice of model is internal to the hypervisor/tools and is
> communicated to the guest via firmware tables we are not tied to this
> model as an ABI if we decide to change.
>
> When constructing dom0 it will therefore be necessary to rewrite any
> DTS properties which refer to an ITS to point to the single provided
> ITS, as well as dropping all ITS nodes and replacing them with a
> single node representing the vITS.
>
> ## Mapping from `vLPI` back to `pLPI`
>
> While we have arranged for a (`pDevice`,`pEvent`) to map to a single
> `pLPI` we cannot guarantee that a given `vLPI` is mapped by a single
> (`vDevice`,`vEvent`) since the guest may setup multiple ITT tables
> such that this is not the case. Enforcing that this is the case is
> prohibitively expensive.
>
> Therefore it is not in general possible to associate a `vLPI` with a
> `pLPI`.
>
> ## Per-domain `struct pending_irq` for `vLPI`s
>
> Internally Xen uses a `struct pending_irq` to track the status of any
> pending virtual IRQ, including a virtual LPI.
>
> Upon domain creation an array of such `struct pending_irq`'s will be
> allocated to cover the range `8192..nr_lpis` (for the number of LPIs
> which the guest is configured with) and a pointer this array will be
> stored in the `struct domain`. The function `irq_to_pending` will be
> modified to lookup interupts in the LPI range in this array.
>
> ## Handling of unrouted/spurious LPIs
>
> Since there is no 1:1 link between a `vLPI` and `pLPI` enabling and
> disabling of phyiscal LPIs cannot be driven from the state of an
> associated vLPI.
>
> Each `pLPI` is routed and enabled during device assignment, therefore
> it is possible to receive a physical LPI which has yet to be routed
> (via a `vITS`) to a `vLPI`.

Why do we need to enable LPIs during device assignment?
Can't we do it only on LPI configuration update, which is trapped in
Xen as mentioned
in 7.8? ( ## Enabling and disabling LPIs)

>
> Similarly if a guest routes multiple Events to a single `vLPI` the
> interrupt may already be pending when we attempt to deliver it.
>
> Such `pLPI`s shall be ignored and left in the priority dropped state
> (per the read from `GICC_IAR`). They will not be `EOI`-d in order to
> avoid a possible interrupt storm.
>
> On device deassignment (including as part of domain destroy) after
> resetting the device it will be necessary to EOI any interrupts in
> such a state by walking over all events in the corresponding `struct
> its_device`.
>
> ## Enabling and disabling LPIs
>
> Two new functions `vgic_enable_lpi` and `vgic_disable_lpi` will be
> provided which are analogous to `vgic_enable_irqs` and
> `vgic_disable_irqs` but work for the LPI interface. (Alternatively,
> refactoring the existing functions to work for all caes would be
> acceptable too).
>
> A `vLPI` which has not yet be enabled will automatically be queued, by
> the existing vgic injection machinery, until a call to
> `vgic_enable_lpi` is made (in response to a trapped access to the
> virtual cfg table).
>
> ## LPI Configuration Table Virtualisation
>
> A guest's write accesses to its LPI Configuration Table (which is just
> an area of guest RAM which the guest has nominated) will be trapped to
> the hypervisor, using stage 2 MMU permissions, in order for changes to
> be propagated into the host interrupt configuration.
>
> On write `bit[0]` of the written byte is the enable/disable state for
> the irq and is handled thus, for each byte in the written value:
>
>     lpi = lpi correspoding to byte offset (addr - table_base);
>
>     pending_irq = irq_to_pending(lpi);
>     pending_irq->priority = byte & 0xfc; /* XXX: or byte >> 2 */
>
>     if ( byte & 0x1 )
>         vgic_enable_lpi(current, lpi);
>     else
>         vgic_disable_lpi(current, lpi);
>
> Note that physical interrupts are always configured with a priority of
> `GIC_PRI_IRQ`, regardless of the priority of any virtual interrupt.
>
> ## LPI Pending Table Virtualisation
>
> According to GIC spec 4.8.5 this table is not necessarily in sync and
> the mechanism to force a sync is `IMPLEMENTATION DEFINED`, hence we
> don't need to do anything.
>
> ## Device Table Virtualisation
>
> The IPA, size and cacheability attributes of the guest device table
> will be recorded in `struct domain` upon write to `GITS_BASER0`.
>
> In order to lookup an entry for `device`:
>
>     define {get,set}_vdevice_entry(domain, device, struct device_table 
> *entry):
>         offset = device*sizeof(struct vdevice_table)
>         if offset > <DT size>: error
>
>         dt_entry = <DT base IPA> + device*sizeof(struct vdevice_table)
>         paddr = p2m_lookup(domain, dt_entry, p2m_ram)
>         page = get_page_from_gfn(current->domain, paddr>>PAGE_SHIFT, &p2mt, 
> P2M_ALLOC);
>         if !page: error
>         if !page_is_ram(p2mt): put_page(page); error;
>
>         dt_mapping = map_domain_page(page)
>
>         if (set)
>              dt_mapping[<appropriate page offset from device>] = *entry;
>         else
>              *entry = dt_mapping[<appropriate page offset>];
>
>         unmap_domain_page(dt_mapping)
>         put_page(page)
>
> Since everything is based upon IPA (guest addresses) a malicious guest
> can only reference its own RAM here.
>
> ## ITT Virtualisation
>
> The location of a VITS will have been recorded in the domain Device
> Table by a `MAPI` or `MAPVI` command and is looked up as above.
>
> The `vitt` is a `struct vitt`:
>
>     struct vitt {
>         uint16_t valid:1;
>         uint16_t pad:15;
>         uint16_t collection;
>         uint32_t vlpi;
>     };
>     BUILD_BUG_ON(sizeof(struct vitt) != 8);
>
> A lookup occurs similar to for a device table, the offset is range
> checked against the `vitt_size` from the device table. To lookup
> `event` on `device`:
>
>     define {get,set}_vitt_entry(domain, device, event, struct vitt *entry):
>         get_vdevice_entry(domain, device, &dt)
>
>         offset = event*sizeof(struct vitt);
>         if offset > dt->vitt_size: error
>
>         vitt_entry = dt->vita_ipa + event*sizeof(struct vitt)
>         paddr = p2m_lookup(domain, vitt_entry, p2m_ram)
>         page = get_page_from_gfn(current->domain, paddr>>PAGE_SHIFT, &p2mt, 
> P2M_ALLOC);
>         if !page: error
>         if !page_is_ram(p2mt): put_page(page); error;
>
>         vitt_mapping = map_domain_page(page)
>
>         if (set)
>              vitt_mapping[<appropriate page offset from event>] = *entry;
>         else
>              *entry = vitt_mapping[<appropriate page offset>];
>
>         unmap_domain_page(entry)
>         put_page(page)
>
> Again since this is IPA based a malicious guest can only point things
> to its own ram.
>
> ## Collection Table Virtualisation
>
> A pointer to a dynamically allocated array `its_collections` mapping
> collection ID to vcpu ID will be added to `struct domain`. The array
> shall have `nr_vcpus + 1` entries and resets to ~0 (or another
> explicitly invalid vpcu nr).
>
> ## Virtual LPI injection
>
> As discussed above the `vgic_vcpu_inject_irq` functionality will need
> to be extended to cover this new case, most likely via a new
> `vgic_vcpu_inject_lpi` frontend function. `vgic_vcpu_inject_irq` will
> also require some refactoring to allow the priority to be passed in
> from the caller (since `LPI` proprity comes from the `LPI` CFG table,
> while `SPI` and `PPI` priority is configured via other means).
>
> `vgic_vcpu_inject_lpi` receives a `struct domain *` and a virtual
> interrupt number (corresponding to a vLPI) and needs to figure out
> which vcpu this should map to.
>
> To do this it must look up the Collection ID associated (via the vITS)
> with that LPI.
>
> Proposal: Add a new `its_device` field to `struct irq_guest`, a
> pointer to the associated `struct its_device`. The existing `struct
> irq_guest.virq` field contains the event ID (perhaps use a `union`
> to give a more appropriate name) and _not_ the virtual LPI. Injection
> then consists of:
>
>         d = irq_guest->domain
>         virq = irq_guest->virq
>         its_device = irq_guest->its_device
>
>         get_vitt_entry(d, its_device->virt_device_id, virq, &vitt)
>         vcpu = d->its_collections[vitt.collection]
>
>         if !is_valid_lpi(vitt.vlpi): error
>
>         vgic_vcpu_inject_lpi(&d->vcpus[vcpu], vitt.vlpi)
>
> If the LPI is currently disabled then it will be queued by
> `vgic_vcpu_inject_lpi` and injected in response to a subsequent
> `vgic_enable_lpi` call.
>
> ## Command Queue Virtualisation
>
> The command translation/emulation in this design has been arranged to
> be as cheap as possible (e.g. in many cases the actions are NOPs),
> avoiding previous concerns about the length of time which an emulated
> write to a `CWRITER` register may block the vcpu.
>
> The vits will simply track its reader and writer pointers. On write
> to `CWRITER` it will immediately and synchronously process all
> commands in the queue and update its state accordingly.
>
> It might be possible to implement a rudimentary form of preemption by
> periodically (as determined by `hypercall_preempt_check()`) returning
> to the guest without incrementing PC but with updated internal
> `CREADR` state, meaning it will reexecute the write to `CWRITER` and
> we can pickup where we left off for another iteration. This at least
> lets us schedule other vcpus etc and prevents a monopoly.
>
> ## ITS Command Translation
>
> This section is based on the section 5.13 of GICv3 specification
> (PRD03-GENC-010745 24.0) and provides concrete ideas of how this can
> be interpreted for Xen.
>
> The ITS provides 12 commands in order to manage interrupt collections,
> devices and interrupts. Possible command parameters are:
>
> - Device ID (`Device`) (called `device` in the spec).
> - Event ID (`Event`) (called `ID` in the spec). This is an index into
>   a devices `ITT`.
> - Collection ID (`Collection`) (called `collection` in the spec)
> - LPI ID (`LPI`) (called `pID` in the spec)
> - Target Address (`TA`) (called `TA` in the spec`)
>
> These parameters need to be validated and translated from Virtual (`v`
> prefix) to Physical (`p` prefix).
>
> Note, we differ from the naming in the GIC spec for clarity, in
> particular we use `Event` not `ID` and `LPI` not `pID` to reduce
> confusion, especially when `v` and `p` suffixes are used due to
> virtualisation.
>
> ### Parameter Validation / Translation
>
> Each command contains parameters that needs to be validated before any
> usage in Xen or passing to the hardware.
>
> #### Device ID (`Device`)
>
> Corresponding ITT obtained by looking up as described above.
>
> The physical `struct its_device` can be found by looking up in the
> domain's device map.
>
> If lookup fails or the resulting device table entry is invalid then
> the Device is invalid.
>
> #### Event ID (`Event`)
>
> Validated against emulated `GITS_TYPER.IDbits`.
>
> It is not necessary to translate a `vEvent`.
>
> #### LPI (`LPI`)
>
> Validated against emulated `GITS_TYPER.IDbits`.
>
> It is not necessary to translate a `vLPI` into a `pLPI` since the
> tables all contain `vLPI`. (Translation from `pLPI` to `vLPI` happens
> via `struct irq_guest` when we receive the IRQ).
>
> #### Interrupt Collection (`Collection`)
>
> The `Collection` is validated against the size of the per-domain
> `its_collections` array (i.e. nr_vcpus + 1) and then translated by a
> simple lookup in that array.
>
>      vcpu_nr = d->its_collections[Collection]
>
> A result > `nr_cpus` is invalid
>
> #### Target Address (`TA`)
>
> This parameter is used in commands which manage collections. It is a
> unique identifier per processor.
>
> We have chosen to implement `GITS_TYPER.PTA` as 0, hence `vTA` simply
> corresponds to the `vcpu_id`, so only needs bounds checking against
> `nr_vcpus`.
>
> ### Commands
>
> To be read with reference to spec for each command (which includes
> error checks etc which are omitted here).
>
> It is assumed that inputs will be bounds and validity checked as
> described above, thus error handling is omitted for brevity (i.e. if
> get and/or set fail then so be it). In general invalid commands are
> simply ignored.
>
> #### `MAPD`: Map a physical device to an ITT.
>
> _Format_: `MAPD Device, Valid, ITT Address, ITT Size`.
>
> _Spec_: 5.13.11
>
> `MAPD` is sent with `Valid` bit set if the mapping is to be added and
> reset when mapping is removed.
>
> When the `Valid` bit is set then the range `ITT Address` to `ITT
> Address` + `ITT Size` need not be validated, this is done in
> `{get,set}_vdevice_entry` when calling the `p2m_lookup`
> function. Validating the memory at `MAPD` time would serve no purpose
> since the guest could subsequently balloon it out or grant map over it etc.
>
> The domain's device table is updated with the provided information.
>
> The `vitt_mapd` field is set according to the `Valid` flag in the
> command:
>
>     dt_entry.vitt_ipa = ITT Address
>     dt_entry.vitt_size = ITT Size
>     set_vdevice_entry(current->domain, Device, &dt_entry)
>
> #### `MAPC`: Map an interrupt collection to a target processor
>
> _Format_: `MAPC Collection, TA`
>
> _Spec_: 5.13.12
>
> The updated `vTA` (a vcpu number) is recorded in the `its_collections`
> array of the domain struct:
>
>     d->its_collections[Collection] = TA
>
> #### `MAPI`: Map an interrupt to an interrupt collection.
>
> _Format_: `MAPI Device, LPI, Collection`
>
> _Spec_: 5.13.13
>
> After validation:
>
>     vitt.valid = True
>     vitt.collection = Collection
>     vitt.vlpi = LPI
>     set_vitt_entry(current->domian, Device, LPI, &vitt)
>
> #### `MAPVI`: Map an input identifier to a physical interrupt and an 
> interrupt collection.
>
> Format: `MAPVI Device, Event, LPI, Collection`
>
>     vitt.valid = True
>     vitt.collection = Collection
>     vitt.vlpi = LPI
>     set_vitt_entry(current->odmian, Device, Event, &vitt)
>
> #### `MOVI`: Redirect interrupt to an interrupt collection
>
> _Format_: `MOVI Device, Event, Collection`
>
> _Spec_: 5.13.15
>
>     get_vitt_entry(current->domain, Device, Event, &vitt)
>     vitt.collection = Collection
>     set_vitt_entry(current->domain, Device, Event, &vitt)
>
>     XXX consider helper which sets field without mapping/unmapping
>     twice.
>
> This command is supposed to move any pending interrupts associated
> with `Event` to the vcpu implied by the new `Collection`, which is
> tricky. For now we ignore this requirement (as we do for
> `GICD_IROUTERn` and `GICD_TARGETRn` for other interrupt types).
>
> #### `DISCARD`: Discard interrupt requests
>
> _Format_: `DISCARD Device, Event`
>
> _Spec_: 5.13.16
>
>     get_vitt_entry(current->domain, Device, Event, &vitt)
>     vitt.valid = False
>     set_vitt_entry(current->domain, Device, Event, &vitt)
>
>     XXX consider helper which sets field without mapping/unmapping
>     twice.
>
> This command is supposed to clear the pending state of any associated
> interrupt. This requirement is ignored (guest may see a spurious
> interrupt).
>
> #### `INV`: Clean any caches associated with interrupt
>
> _Format_: `INV Device, Event`
>
> _Spec_: 5.13.17
>
> Since LPI Configuration table updates are not trapped and the config
> is read on use, there is nothing to do here.
>
> #### `INVALL`: Clean any caches associated with an interrupt collection
>
> _Format_: `INVALL Collection`
>
> _Spec_: 5.13.19
>
> Since LPI Configuration table updates are not trapped and the config
> is read on use, there is nothing to do here.
>
> #### `INT`: Generate an interrupt
>
> _Format_: `INT Device, Event`
>
> _Spec_: 5.13.20
>
> The `vitt` entry corresonding to `Device,Event` is looked up and then:
>
>     get_vitt_entry(current->domain, Device, Event, &vitt)
>     vgic_vcpu_inject_lpi(current->domain, vitt.vlpi)
>
> __Note_: Where (Device,Event) is real may need consideration of
> interactions with real LPIs being delivered: Julien had concerns about
> Xen's internal IRQ State tracking. if this is a problem then may need
> changes to IRQ state tracking, or to inject as a real IRQ and let
> physical IRQ injection handle it, or write to `GICR_SETLPIR`?
>
> #### `CLEAR`: Clear the pending state of an interrupt
>
> _Format_: `CLEAR Device, Event`
>
> _Spec_: 5.13.21
>
> Should clear pending state of LPI. Ignore (guest may see a spurious
> interrupt).
>
> #### `SYNC`: Wait for completion of any outstanding ITS actions for collection
>
> _Format_: `SYNC TA`
>
> _Spec_: 5.13.22
>
> This command can be ignored.
>
> # GICv4 Direct Interrupt Injection
>
> GICv4 will directly mark the LPIs pending in the virtual pending table
> which is per-redistributor (i.e per-vCPU).
>
> LPIs will be received by the guest the same way as an SPIs. I.e trap in
> IRQ mode then read ICC_IAR1_EL1 (for GICv3).
>
> Therefore GICv4 will not require one vITS per pITS.
>
> # Event Channels
>
> It has been proposed that it might be nice to inject event channels as
> LPIs in the future. Whether or not that would involve any sort of vITS
> is unclear, but if it did then it would likely be a separate emulation
> to the vITS emulation used with a pITS and as such is not considered
> further here.
>
> # Glossary
>
> * _MSI_: Message Signalled Interrupt
> * _ITS_: Interrupt Translation Service
> * _GIC_: Generic Interrupt Controller
> * _LPI_: Locality-specific Peripheral Interrupt
>
> # References
>
> "GIC Architecture Specification" PRD03-GENC-010745 24.0.
>
> "IO Remapping Table System Software on ARMÂ Platforms" ARM DEN 0049A.
>

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.