[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] VT-d Posted-interrupt (PI) design for XEN




> -----Original Message-----
> From: Andrew Cooper [mailto:andrew.cooper3@xxxxxxxxxx]
> Sent: Thursday, March 05, 2015 2:48 AM
> To: Wu, Feng; xen-devel@xxxxxxxxxxxxx
> Cc: Zhang, Yang Z; Tian, Kevin; Jan Beulich
> Subject: Re: [Xen-devel] VT-d Posted-interrupt (PI) design for XEN
> 
> On 04/03/15 13:30, Wu, Feng wrote:
> > VT-d Posted-interrupt (PI) design for XEN
> 
> Thankyou very much for this!
> 
> >
> > Background
> > ==========
> > With the development of virtualization, there are more and more device
> > assignment requirements. However, today when a VM is running with
> > assigned devices (such as, NIC), external interrupt handling for the 
> > assigned
> > devices always needs VMM intervention.
> >
> > VT-d Posted-interrupt is a more enhanced method to handle interrupts
> > in the virtualization environment. Interrupt posting is the process by
> > which an interrupt request is recorded in a memory-resident
> > posted-interrupt-descriptor structure by the root-complex, followed by
> > an optional notification event issued to the CPU complex.
> >
> > With VT-d Posted-interrupt we can get the following advantages:
> > - Directly delivery of external interrupts to running vCPUs without VMM
> > intervention
> > - Decease the interrupt migration complexity. On vCPU migration, software
> > can atomically co-migrate all interrupts targeting the migrating vCPU.
> 
> I presume you mean "Decrease" ?

Yes!

> 
> "Decease" means something quite different.

Sorry for the typo. 

> 
> >
> >
> > Posted-interrupt Introduction
> > ========================
> > There are two components to the Posted-interrupt architecture:
> > Processor Support and Root-Complex Support
> >
> > - Processor Support
> > Posted-interrupt processing is a feature by which a processor processes
> > the virtual interrupts by recording them as pending on the virtual-APIC
> > page.
> >
> > Posted-interrupt processing is enabled by setting the "process posted
> > interrupts" VM-execution control. The processing is performed in response
> > to the arrival of an interrupt with the posted-interrupt notification 
> > vector.
> > In response to such an interrupt, the processor processes virtual interrupts
> > recorded in a data structure called a posted-interrupt descriptor.
> >
> > More information about APICv and CPU-side Posted-interrupt, please refer
> > to Chapter 29, and Section 29.6 in the Intel SDM:
> >
> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6
> 4-ia-32-architectures-software-developer-manual-325462.pdf
> >
> > - Root-Complex Support
> > Interrupt posting is the process by which an interrupt request (from IOAPIC
> > or MSI/MSIx capable sources) is recorded in a memory-resident
> > posted-interrupt-descriptor structure by the root-complex, followed by
> > an optional notification event issued to the CPU complex. The interrupt
> > request arriving at the root-complex carry the identity of the interrupt
> > request source and a 'remapping-index'. The remapping-index is used to
> > look-up an entry from the memory-resident interrupt-remap-table. Unlike
> > with interrupt-remapping, the interrupt-remap-table-entry for a posted-
> > interrupt, specifies a virtual-vector and a pointer to the posted-interrupt
> > descriptor. The virtual-vector specifies the vector of the interrupt to be
> > recorded in the posted-interrupt descriptor. The posted-interrupt descriptor
> > hosts storage for the virtual-vectors and contains the attributes of the
> > notification event (interrupt) to be issued to the CPU complex to inform
> > CPU/software about pending interrupts recorded in the posted-interrupt
> > descriptor.
> >
> > More information about VT-d PI, please refer to
> >
> http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog
> y/vt-directed-io-spec.html
> >
> >
> > Design Overview
> > ==============
> > In this design, we will cover the following items:
> > 1. Add a variant to control whether enable VT-d posted-interrupt or not.
> > 2. VT-d PI feature detection.
> > 3. Extend posted-interrupt descriptor structure to cover VT-d PI specific 
> > stuff.
> > 4. Extend IRTE structure to support VT-d PI.
> > 5. Introduce a new global vector which is used for waking up the HLT'ed 
> > vCPU.
> > 6. Update IRTE when guest modifies the interrupt configuration (MSI/MSIx
> configuration).
> > 7. Update posted-interrupt descriptor during vCPU scheduling (when the
> state
> > of the vCPU is transmitted among RUNSTATE_running / RUNSTATE_blocked/
> > RUNSTATE_runnable / RUNSTATE_offline).
> > 8. New boot command line for Xen, which controls VT-d PI feature by user.
> > 9. Multicast/broadcast and lowest priority interrupts consideration.
> >
> >
> > Implementation details
> > ===================
> > - New variant to control VT-d PI
> 
> I know what you are trying to say, but "New variant" does not express
> what you mean.
> 
> "A new control relating to VT-d PI" perhaps?
> 
> > Like variant 'iommu_intremap' for interrupt remapping, it is very
> straightforward
> > to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' is
> set
> > only when interrupt remapping and VT-d posted-interrupt are both enabled.
> 
> I would avoid mixing names such as PI and intpost.  If anything, it
> should be "iommu_postint" to keep the naming consistent.  (Here and
> elsewhere).
> 

My original ideas is 'iommu_intpost' is consistent to 'iommu_intremap', we can
also use 'interrupt posting' for this feature, just like 'interrupt remapping', 
but I
think your comments is also good.


> >
> > - VT-d PI feature detection.
> > Bit 59 in VT-d Capability Register is used to report VT-d Posted-interrupt
> support.
> >
> > - Extend posted-interrupt descriptor structure to cover VT-d PI specific 
> > stuff.
> > Here is the new structure for posted-interrupt descriptor:
> >
> > struct pi_desc {
> >      DECLARE_BITMAP(pir, NR_VECTORS);
> >      union {
> >         struct
> >         {
> >         u64 on     : 1,
> >             sn     : 1,
> >             rsvd_1 : 13,
> >             ndm    : 1,
> >             nv     : 8,
> >             rsvd_2 : 8,
> >             ndst   : 32;
> >         };
> >         u64 control;
> >     };
> >     u32 rsvd[6];
> >  } __attribute__ ((aligned (64)));
> 
> Is there a pending update to the system programming guide?  According to
> 325384.pdf, only the Oustanding Notification is defined, and all others
> are reserved for software use.
> 
> I however noticed that these fields match up with the description of a
> posted interrupt descriptor in the VT-d spec.  Are they supposed to be
> the same structure in memory used by both the cpu and root complex, or
> independent structures which happen to look very similar?

In 325384.pdf, the format of posted-interrupt descriptor is the one before
VT-d PI is introduced, after having VT-d PI, we enhance the structure to
the format defined in the VT-d Spec above.

> 
> >
> > - Extend IRTE structure to support VT-d PI.
> > Here is the new structure for IRTE:
> > /* interrupt remap entry */
> > struct iremap_entry {
> >   union {
> >     u64 lo_val;
> >     struct {
> >         u64 p       : 1,
> >             fpd     : 1,
> >             dm      : 1,
> >             rh      : 1,
> >             tm      : 1,
> >             dlm     : 3,
> >             avail   : 4,
> >             res_1   : 4,
> >             vector  : 8,
> >             res_2   : 8,
> >             dst     : 32;
> >     }lo;
> >     struct {
> >         u64 p       : 1,
> >             fpd     : 1,
> >             res_1   : 6,
> >             avail   : 4,
> >             res_2   : 2,
> >             urg     : 1,
> >             pst     : 1,
> >             vector  : 8,
> >             res_3   : 14,
> >             pda_l   : 26;
> >     }lo_intpost;
> >   };
> >   union {
> >     u64 hi_val;
> >     struct {
> >         u64 sid     : 16,
> >             sq      : 2,
> >             svt     : 2,
> >             res_1   : 44;
> >     }hi;
> >     struct {
> >         u64 sid     : 16,
> >             sq      : 2,
> >             svt     : 2,
> >             res_1   : 12,
> >             pda_h   : 32;
> >     }hi_intpost;
> >   };
> > };
> 
> None of the bitfields contain the IM field (bit 15) which is stated as
> the qualification between the two interpretations of the IRTE.

Oh, I defined this according to an old version of VT-d PI Spec. 'pst' is
in fact the 'IM' bit in the latest Spec. I will change this.

> 
> Also, I feel that the structure would be better layed out as:
> 
> struct iremap_entry {
>     union {
>         struct { u64 lo, hi; };
>         struct { <bitfields> } norm; (names subject to improvement)
>         struct { <bitfields> } post;
>     };
> };
> 
> Which does not duplicate the lo and hi u64s in sub-unions.  (This will
> involve some refactoring of the existing code.)

This is a good suggestion, I also think about this before, but this need
some changes to the existing code. May need more thinking whether
worth it.

> 
> >
> > - Introduce a new global vector which is used to wake up the HLT'ed vCPU.
> > Currently, there is a global vector 'posted_intr_vector', which is used as 
> > the
> > global notification vector for all vCPUs in the system. This vector is 
> > stored in
> > VMCS and CPU considers it as a special vector, uses it to notify the related
> > pCPU when an interrupt is recorded in the posted-interrupt descriptor.
> >
> > After having VT-d PI, VT-d engine can issue notification event when the
> > assigned devices issue interrupts. We need add a new global vector to
> > wakeup the HLT'ed vCPU, please refer to the following scenario for the
> > usage of this new global vector:
> >
> > 1. vCPU0 is running on pCPU0
> > 2. vCPU0 is HLT'ed and vCPU1 is currently running on pCPU0
> 
> I don't understand what you are trying to express with this scenario.
> vCPU0 cannot be running on pCPU0 and also halted with vCPU1 running on
> pCPU0.
> 
> A vCPU is either running, in which case it has an associated pCPU, or it
> is not running and has no specific pCPU affiliation.
> 

Here I just want to show why and when we need the extra global vector.
Please see more explanation about this in the reply to Jan!

Thanks for all the comments!

Thanks,
Feng

> ~Andrew
> 
> > 3. An external interrupt from an assigned device occurs for vCPU0, if we
> > still use 'posted_intr_vector' as the notification vector for vCPU0, the
> > notification event for vCPU0 (the event will go to pCPU1) will be consumed
> > by vCPU1 incorrectly. The worst case is that vCPU0 will never be woken up
> > again since the wakeup event for it is always consumed by other vCPUs
> > incorrectly. So we need introduce another global vector, naming
> 'pi_wakeup_vector'
> > to wake up the HTL'ed vCPU.
> >
> > - Update IRTE when guest modifies the interrupt configuration (MSI/MSIx
> configuration).
> > After VT-d PI is introduced, the format of IRTE is changed as follows:
> >     Descriptor Address: the address of the posted-interrupt descriptor
> >     Virtual Vector: the guest vector of the interrupt
> >     URG: indicates if the interrupt is urgent
> >     Other fields continue to have the same meaning
> >
> > 'Descriptor Address' tells the destination vCPU of this interrupt, since
> > each vCPU has a dedicated posted-interrupt descriptor.
> >
> > 'Virtual Vector' tells the guest vector of the interrupt.
> >
> > When guest changes the configuration of the interrupts, such as, the
> > cpu affinity, or the vector, we need to update the associated IRTE 
> > accordingly.
> >
> > - Update posted-interrupt descriptor during vCPU scheduling
> > The basic idea here is:
> > 1. When vCPU's state is RUNSTATE_running,
> >         - Set 'NV' to 'posted_intr_vector'.
> >         - Clear 'SN' to accept posted-interrupts.
> >         - Set 'NDST' to the pCPU on which the vCPU will be running.
> > 2. When vCPU's state is RUNSTATE_blocked,
> >         - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the
> >           related vCPU when posted-interrupt happens for it.
> >           Please refer to the above section about the new global vector.
> >         - Clear 'SN' to accept posted-interrupts
> > 3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
> >         - Set 'SN' to suppress non-urgent interrupts
> >           (Current, we only support non-urgent interrupts)
> >          When vCPU is in RUNSTATE_runnable or RUNSTATE_offline,
> >          It is not needed to accept posted-interrupt notification event,
> >          since we don't change the behavior of scheduler when the
> interrupt
> >          occurs, we still need wait the next scheduling of the vCPU.
> >          When external interrupts from assigned devices occur, the
> interrupts
> >          are recorded in PIR, and will be synced to IRR before VM-Entry.
> >         - Set 'NV' to 'posted_intr_vector'.
> >
> > - New boot command line for Xen, which controls VT-d PI feature by user.
> > Like 'intremap' for interrupt remapping, we add a new boot command line
> > 'intpost' for posted-interrupts.
> >
> > - Multicast/broadcast and lowest priority interrupts consideration
> > With VT-d PI, the destination vCPU information of an external interrupt
> > from assigned devices is stored in IRTE, this makes the following
> > consideration of the design:
> > 1. Multicast/broadcast interrupts cannot be posted.
> > 2. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex
> > (starting from Nehalem) ignore TPR value, and instead supported two other
> > ways (configurable by BIOS) on how the handle lowest priority interrupts:
> >     A) Round robin: In this method, the chipset simply delivers lowest 
> > priority
> > interrupts in a round-robin manner across all the available logical CPUs. 
> > While
> > this provides good load balancing, this was not the best thing to do always 
> > as
> > interrupts from the same device (like NIC) will start running on all the 
> > CPUs
> > thrashing caches and taking locks. This led to the next scheme.
> >     B) Vector hashing: In this method, hardware would apply a hash function
> > on the vector value in the interrupt request, and use that hash to pick a
> logical
> > CPU to route the lowest priority interrupt. This way, a given vector always
> goes
> > to the same logical CPU, avoiding the thrashing problem above.
> >
> > So, gist of above is that, lowest priority interrupts has never been 
> > delivered
> as
> > "lowest priority" in physical hardware.
> >
> > For KVM enabling work of VT-d PI, we divide this into two stage:
> > Stage 1: Only support single-CPU lowest-priority interrupts (configured via
> > /proc/irq or irqbalance). This is simple and clear.
> > Stage 2: After all the patches are merged, I will add the vector hashing
> support
> > for lowest-priority on VT-d PI.
> >
> > On Xen side, what is your opinion about support lowest-priority interrupts
> > for VT-d PI?
> >
> > ================================
> >
> > Any comments about this design are highly appreciated!
> >
> > Thanks,
> > Feng
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@xxxxxxxxxxxxx
> > http://lists.xen.org/xen-devel
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.