[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] (v2) VT-d Posted-interrupt (PI) design for XEN



Hi Jan & other maintainers,

Do you think it is good for you guys to continue the review if I send out
a RFC patch for this feature?

Thanks,
Feng

> -----Original Message-----
> From: Wu, Feng
> Sent: Wednesday, March 18, 2015 8:44 PM
> To: xen-devel@xxxxxxxxxxxxx
> Cc: Keir Fraser (keir@xxxxxxx); Jan Beulich (JBeulich@xxxxxxxx); Tian, Kevin;
> Zhang, Yang Z; Wu, Feng
> Subject: (v2) VT-d Posted-interrupt (PI) design for XEN
> 
> VT-d Posted-interrupt (PI) design for XEN
> 
> Background
> ==========
> With the development of virtualization, there are more and more device
> assignment requirements. However, today when a VM is running with
> assigned devices (such as, NIC), external interrupt handling for the assigned
> devices always needs VMM intervention.
> 
> VT-d Posted-interrupt is a more enhanced method to handle interrupts
> in the virtualization environment. Interrupt posting is the process by
> which an interrupt request is recorded in a memory-resident
> posted-interrupt-descriptor structure by the root-complex, followed by
> an optional notification event issued to the CPU complex.
> 
> With VT-d Posted-interrupt we can get the following advantages:
> - Direct delivery of external interrupts to running vCPUs without VMM
> intervention
> - Decrease the interrupt migration complexity. On vCPU migration, software
> can atomically co-migrate all interrupts targeting the migrating vCPU. For
> virtual machines with assigned devices, migrating a vCPU across pCPUs
> either incur the overhead of forwarding interrupts in software (e.g. via VMM
> generated IPIS), or complexity to independently migrate each interrupt
> targeting
> the vCPU to the new pCPU. However, after enabling VT-d PI, the destination
> vCPU
> of an external interrupt from assigned devices is stored in the IRTE (i.e.
> Posted-interrupt Descriptor Address), when vCPU is migrated to another pCPU,
> we will set this new pCPU in the 'NDST' filed of Posted-interrupt descriptor, 
> this
> make the interrupt migration automatic.
> 
> 
> Posted-interrupt Introduction
> ========================
> There are two components to the Posted-interrupt architecture:
> Processor Support and Root-Complex Support
> 
> - Processor Support
> Posted-interrupt processing is a feature by which a processor processes
> the virtual interrupts by recording them as pending on the virtual-APIC
> page.
> 
> Posted-interrupt processing is enabled by setting the "process posted
> interrupts" VM-execution control. The processing is performed in response
> to the arrival of an interrupt with the posted-interrupt notification vector.
> In response to such an interrupt, the processor processes virtual interrupts
> recorded in a data structure called a posted-interrupt descriptor.
> 
> More information about APICv and CPU-side Posted-interrupt, please refer
> to Chapter 29, and Section 29.6 in the Intel SDM:
> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6
> 4-ia-32-architectures-software-developer-manual-325462.pdf
> 
> - Root-Complex Support
> Interrupt posting is the process by which an interrupt request (from IOAPIC
> or MSI/MSIx capable sources) is recorded in a memory-resident
> posted-interrupt-descriptor structure by the root-complex, followed by
> an optional notification event issued to the CPU complex. The interrupt
> request arriving at the root-complex carry the identity of the interrupt
> request source and a 'remapping-index'. The remapping-index is used to
> look-up an entry from the memory-resident interrupt-remap-table. Unlike
> with interrupt-remapping, the interrupt-remap-table-entry for a posted-
> interrupt, specifies a virtual-vector and a pointer to the posted-interrupt
> descriptor. The virtual-vector specifies the vector of the interrupt to be
> recorded in the posted-interrupt descriptor. The posted-interrupt descriptor
> hosts storage for the virtual-vectors and contains the attributes of the
> notification event (interrupt) to be issued to the CPU complex to inform
> CPU/software about pending interrupts recorded in the posted-interrupt
> descriptor.
> 
> More information about VT-d PI, please refer to
> http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog
> y/vt-directed-io-spec.html
> 
> Important Definitions
> ==================
> There are some changes to IRTE and posted-interrupt descriptor after
> VT-d PI is introduced:
> IRTE:
> Posted-interrupt Descriptor Address: the address of the posted-interrupt
> descriptor
> Virtual Vector: the guest vector of the interrupt
> URG: indicates if the interrupt is urgent
> 
> Posted-interrupt descriptor:
> The Posted Interrupt Descriptor hosts the following fields:
> Posted Interrupt Request (PIR): Provide storage for posting (recording)
> interrupts (one bit
> per vector, for up to 256 vectors).
> 
> Outstanding Notification (ON): Indicate if there is a notification event
> outstanding (not
> processed by processor or software) for this Posted Interrupt Descriptor. When
> this field is 0,
> hardware modifies it from 0 to 1 when generating a notification event, and the
> entity receiving
> the notification event (processor or software) resets it as part of posted
> interrupt processing.
> 
> Suppress Notification (SN): Indicate if a notification event is to be 
> suppressed
> (not
> generated) for non-urgent interrupt requests (interrupts processed through an
> IRTE with
> URG=0).
> 
> Notification Vector (NV): Specify the vector for notification event 
> (interrupt).
> 
> Notification Destination (NDST): Specify the physical APIC-ID of the 
> destination
> logical
> processor for the notification event.
> 
> Design Overview
> ==============
> In this design, we will cover the following items:
> 1. Add a variable to control whether enable VT-d posted-interrupt or not.
> 2. VT-d PI feature detection.
> 3. Extend posted-interrupt descriptor structure to cover VT-d PI specific 
> stuff.
> 4. Extend IRTE structure to support VT-d PI.
> 5. Introduce a new global vector which is used for waking up the blocked vCPU.
> 6. Update IRTE when guest modifies the interrupt configuration (MSI/MSIx
> configuration).
> 7. Update posted-interrupt descriptor during vCPU scheduling (when the state
> of the vCPU is transmitted among RUNSTATE_running / RUNSTATE_blocked/
> RUNSTATE_runnable / RUNSTATE_offline).
> 8. How to wakeup blocked vCPU when an interrupt is posted for it (wakeup
> notification handler).
> 9. New boot command line for Xen, which controls VT-d PI feature by user.
> 10. Multicast/broadcast and lowest priority interrupts consideration.
> 
> 
> Implementation details
> ===================
> - New variable to control VT-d PI
> 
> Like variable 'iommu_intremap' for interrupt remapping, it is very
> straightforward
> to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' is set
> only when interrupt remapping and VT-d posted-interrupt are both enabled.
> 
> - VT-d PI feature detection.
> Bit 59 in VT-d Capability Register is used to report VT-d Posted-interrupt
> support.
> 
> - Extend posted-interrupt descriptor structure to cover VT-d PI specific 
> stuff.
> Here is the new structure for posted-interrupt descriptor:
> 
> struct pi_desc {
>      DECLARE_BITMAP(pir, NR_VECTORS);
>      union {
>         struct
>         {
>         u64 on     : 1,
>             sn     : 1,
>             rsvd_1 : 13,
>             ndm    : 1,
>             nv     : 8,
>             rsvd_2 : 8,
>             ndst   : 32;
>         };
>         u64 control;
>     };
>     u32 rsvd[6];
>  } __attribute__ ((aligned (64)));
> 
> - Extend IRTE structure to support VT-d PI.
> 
> Here is the new structure for IRTE:
> /* interrupt remap entry */
> struct iremap_entry {
>   union {
>     u64 lo_val;
>     struct {
>         u64 p       : 1,
>             fpd     : 1,
>             dm      : 1,
>             rh      : 1,
>             tm      : 1,
>             dlm     : 3,
>             avail   : 4,
>             res_1   : 4,
>             vector  : 8,
>             res_2   : 8,
>             dst     : 32;
>     }lo;
>     struct {
>         u64 p       : 1,
>             fpd     : 1,
>             res_1   : 6,
>             avail   : 4,
>             res_2   : 2,
>             urg     : 1,
>             im      : 1,
>             vector  : 8,
>             res_3   : 14,
>             pda_l   : 26;
>     }lo_intpost;
>   };
>   union {
>     u64 hi_val;
>     struct {
>         u64 sid     : 16,
>             sq      : 2,
>             svt     : 2,
>             res_1   : 44;
>     }hi;
>     struct {
>         u64 sid     : 16,
>             sq      : 2,
>             svt     : 2,
>             res_1   : 12,
>             pda_h   : 32;
>     }hi_intpost;
>   };
> };
> 
> - Introduce a new global vector which is used to wake up the blocked vCPU.
> 
> Currently, there is a global vector 'posted_intr_vector', which is used as the
> global notification vector for all vCPUs in the system. This vector is stored 
> in
> VMCS and CPU considers it as a _special_ vector, uses it to notify the related
> pCPU when an interrupt is recorded in the posted-interrupt descriptor.
> 
> This existing global vector is a _special_ vector to CPU, CPU handle it in a
> _special_ way compared to normal vectors, please refer to 29.6 in Intel SDM
> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6
> 4-ia-32-architectures-software-developer-manual-325462.pdf
> for more information about how CPU handles it.
> 
> After having VT-d PI, VT-d engine can issue notification event when the
> assigned devices issue interrupts. We need add a new global vector to
> wakeup the blocked vCPU, please refer to later section in this design for
> how to use this new global vector.
> 
> - Update IRTE when guest modifies the interrupt configuration (MSI/MSIx
> configuration).
> After VT-d PI is introduced, the format of IRTE is changed as follows:
>       Descriptor Address: the address of the posted-interrupt descriptor
>       Virtual Vector: the guest vector of the interrupt
>       URG: indicates if the interrupt is urgent
>       Other fields continue to have the same meaning
> 
> 'Descriptor Address' tells the destination vCPU of this interrupt, since
> each vCPU has a dedicated posted-interrupt descriptor.
> 
> 'Virtual Vector' tells the guest vector of the interrupt.
> 
> When guest changes the configuration of the interrupts, such as, the
> cpu affinity, or the vector, we need to update the associated IRTE 
> accordingly.
> 
> - Update posted-interrupt descriptor during vCPU scheduling
> 
> The basic idea here is:
> 1. When vCPU's state is RUNSTATE_running,
>         - Set 'NV' to 'posted_intr_vector'.
>         - Clear 'SN' to accept posted-interrupts.
>         - Set 'NDST' to the pCPU on which the vCPU will be running.
> 2. When vCPU's state is RUNSTATE_blocked,
>         - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the
>           related vCPU when posted-interrupt happens for it.
>           Please refer to the above section about the new global vector.
>         - Clear 'SN' to accept posted-interrupts
> 3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
>         - Set 'SN' to suppress non-urgent interrupts
>           (Current, we only support non-urgent interrupts)
>          When vCPU is in RUNSTATE_runnable or RUNSTATE_offline,
>          It is not needed to accept posted-interrupt notification event,
>          since we don't change the behavior of scheduler when the interrupt
>          occurs, we still need wait the next scheduling of the vCPU.
>          When external interrupts from assigned devices occur, the
> interrupts
>          are recorded in PIR, and will be synced to IRR before VM-Entry.
>         - Set 'NV' to 'posted_intr_vector'.
> 
> - How to wakeup blocked vCPU when an interrupt is posted for it (wakeup
> notification handler).
> 
> Here is the scenario for the usage of the new global vector:
> 
> 1. vCPU0 is running on pCPU0
> 2. vCPU0 is blocked and vCPU1 is currently running on pCPU0
> 3. An external interrupt from an assigned device occurs for vCPU0, if we
> still use 'posted_intr_vector' as the notification vector for vCPU0, the
> notification event for vCPU0 (the event will go to pCPU1) will be consumed
> by vCPU1 incorrectly (remember this is a special vector to CPU). The worst
> case is that vCPU0 will never be woken up again since the wakeup event
> for it is always consumed by other vCPUs incorrectly. So we need introduce
> another global vector, naming 'pi_wakeup_vector' to wake up the blocked
> vCPU.
> 
> After using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue notification
> event using this new vector. Since this new vector is not a SPECIAL one to 
> CPU,
> it is just a normal vector. To cpu, it just receives an normal external 
> interrupt,
> then we can get control in the handler of this new vector. In this case,
> hypervisor
> can do something in it, such as wakeup the blocked vCPU.
> 
> Here are what we do for the blocked vCPU:
> 1. Define a per-cpu list 'blocked_vcpu_on_cpu', which stored the blocked
> vCPU on the pCPU.
> 2. When the vCPU's state is changed to RUNSTATE_blocked, insert the vCPU
> to the per-cpu list belonging to the pCPU it was running.
> 3. When the vCPU is unblocked, remove the vCPU from the related pCPU list.
> 
> In the handler of 'pi_wakeup_vector', we do:
> 1. Get the physical CPU.
> 2. Iterate the list 'blocked_vcpu_on_cpu' of the current pCPU, if 'ON' is set,
> we unblock the associated vCPU.
> 
> - New boot command line for Xen, which controls VT-d PI feature by user.
> 
> Like 'intremap' for interrupt remapping, we add a new boot command line
> 'intpost' for posted-interrupts.
> 
> - Multicast/broadcast and lowest priority interrupts consideration.
> 
> With VT-d PI, the destination vCPU information of an external interrupt
> from assigned devices is stored in IRTE, this makes the following
> consideration of the design:
> 1. Multicast/broadcast interrupts cannot be posted.
> 2. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex
> (starting from Nehalem) ignore TPR value, and instead supported two other
> ways (configurable by BIOS) on how the handle lowest priority interrupts:
>       A) Round robin: In this method, the chipset simply delivers lowest 
> priority
> interrupts in a round-robin manner across all the available logical CPUs. 
> While
> this provides good load balancing, this was not the best thing to do always as
> interrupts from the same device (like NIC) will start running on all the CPUs
> thrashing caches and taking locks. This led to the next scheme.
>       B) Vector hashing: In this method, hardware would apply a hash function
> on the vector value in the interrupt request, and use that hash to pick a 
> logical
> CPU to route the lowest priority interrupt. This way, a given vector always 
> goes
> to the same logical CPU, avoiding the thrashing problem above.
> 
> So, gist of above is that, lowest priority interrupts has never been 
> delivered as
> "lowest priority" in physical hardware.
> 
> I will emulate vector hashing for posted-interrupt for XEN.
> 
> ================================
> 
> Any comments about this design are highly appreciated!
> 
> Thanks,
> Feng

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.