[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] (v2) VT-d Posted-interrupt (PI) design for XEN




> -----Original Message-----
> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@xxxxxxxxxx]
> Sent: Friday, March 20, 2015 3:12 AM
> To: Wu, Feng
> Cc: xen-devel@xxxxxxxxxxxxx; Zhang, Yang Z; Tian, Kevin; Keir Fraser
> (keir@xxxxxxx); Jan Beulich (JBeulich@xxxxxxxx)
> Subject: Re: [Xen-devel] (v2) VT-d Posted-interrupt (PI) design for XEN
> 
> On Thu, Mar 19, 2015 at 03:03:55AM +0000, Wu, Feng wrote:
> > Thanks for the comments!
> >
> > > -----Original Message-----
> > > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@xxxxxxxxxx]
> > > Sent: Thursday, March 19, 2015 12:10 AM
> > > To: Wu, Feng
> > > Cc: xen-devel@xxxxxxxxxxxxx; Zhang, Yang Z; Tian, Kevin; Keir Fraser
> > > (keir@xxxxxxx); Jan Beulich (JBeulich@xxxxxxxx)
> > > Subject: Re: [Xen-devel] (v2) VT-d Posted-interrupt (PI) design for XEN
> > >
> > > On Wed, Mar 18, 2015 at 12:44:21PM +0000, Wu, Feng wrote:
> > > > VT-d Posted-interrupt (PI) design for XEN
> > > >
> > > > Background
> > > > ==========
> > > > With the development of virtualization, there are more and more device
> > > > assignment requirements. However, today when a VM is running with
> > > > assigned devices (such as, NIC), external interrupt handling for the
> assigned
> > > > devices always needs VMM intervention.
> > > >
> > > > VT-d Posted-interrupt is a more enhanced method to handle interrupts
> > > > in the virtualization environment. Interrupt posting is the process by
> > > > which an interrupt request is recorded in a memory-resident
> > > > posted-interrupt-descriptor structure by the root-complex, followed by
> > > > an optional notification event issued to the CPU complex.
> > > >
> > > > With VT-d Posted-interrupt we can get the following advantages:
> > > > - Direct delivery of external interrupts to running vCPUs without VMM
> > > > intervention
> > >
> > >
> > > I hadn't digged deep in what Xen has currently - but I would assume that
> > > this is exactly what we have now in Xen?
> >
> > Here is what Xen currently does for external interrupts from assigned
> devices:
> >
> > When a VM is running and an external interrupts from an assigned devices
> occurs
> > for it. VM-EXIT happens, then:
> >
> > vmx_do_extint() --> do_IRQ() --> __do_IRQ_guest() --> hvm_do_IRQ_dpci() -->
> > raise_softirq_for(pirq_dpci) --> raise_softirq(HVM_DPCI_SOFTIRQ)
> >
> > softirq HVM_DPCI_SOFTIRQ is bound to dpci_softirq()
> >
> > dpci_softirq() --> hvm_dirq_assist() --> vmsi_deliver_pirq() --> 
> > vmsi_deliver()
> -->
> > vmsi_inj_irq() --> vlapic_set_irq()
> 
> <nods> This would be fantastic to put in the design document to help
> people make sure that their expectations are in line.

Sure!

> 
> >
> > vlapic_set_irq() does the following things:
> > 1. If CPU-side posted-interrupt is supported (I think it is supported from 
> > Xen
> 4.3, or Xen 4.4,
> > sorry, not quite remember the exact version), call vmx_deliver_posted_intr()
> to deliver
> > the virtual interrupt via posted-interrupt infrastructure.
> 
> The benefit is that if an interrupt comes for VCPU0 instead of
> VCPU1 we can inject the interrupt in the VCPU1 without having it
> do an VMEXIT.
> 
> However if we pin the vCPUs, then CPU-side posted interrupt do not
> help - we still have to process the interrupt in Xen hypervisor.
> 
> > 2. Else If CPU-side posted-interrupt is not supported, set the related vIRR 
> > in
> vLAPIC
> > page and call vcpu_kick() to kick the related vCPU. Before VM-Entry,
> vmx_intr_assist()
> > will help to inject the interrupt to guests.
> >
> > However, after VT-d PI is supported, when a guest is running in non-root and
> an
> > external interrupt from an assigned device occurs for it. _no_ VM-Exit is
> needed,
> > the guest can handle this totally in non-root mode, thus avoiding all the 
> > above
> > code flow.
> 
> <nods> However it does require for Linux PVHVM guests to not use the
> vector callback mechanism - or rather - not use the event mechanism.
> 
> What you require for this to work on the Linux side is for the PCIe
> device to use the 'baremetal' mechanism to setup MSIs (program the
> IOAPIC, etc). It would be worth mentioning this in the document too.

Thanks for the suggestion. In fact, there are some information about this in 
this design doc, please
refer to section " Update IRTE when guest modifies the interrupt configuration 
(MSI/MSIx configuration)."

When guests update the MSI/MSIx information, Xen will get control and the guest
interrupt information will get updated in related IRTE.

> 
> >
> > >
> > > Hm, actually we seem to be still invoking the hypervisor on the
> > > interrupts  -except that if we need to dispatch it to another CPU
> > > using an normal vector to do so - which would still cause the
> > > hypervisor to be invoked? Or does it actually go straight in the
> > > guest?
> > >
> >
> > Like what I mentioned above, If the guest is running, we don't need invoke
> hypervisor.
> >
> > > So what kind of support do we currently have in Xen from posted
> > > interrupt? Could you add a bit about this in the background please?
> >
> > Good suggestion.
> >
> > Currently, Xen only supports the CPU-side posted-interrupt. Like what I
> mentioned above,
> > function vlapic_set_irq() can use this to deliver virtual interrupts, 
> > basically
> there are several
> > methods to deliver virtual interrupts to guests:
> > - Event delivery before VM-Entry via __vmx_inject_exception(), this is the
> oldest way.
> > - After APICv was enabled, we had hardware support for virtual interrupt
> delivery, virtual
> > interrupts are stored in virtual LAPIC page, after VM-Entry, guests can
> evaluate these
> > virtual interrupt and handle them in non-root mode.
> > - As an enhancement to APICv, CPU-side posted-interrupt was introduced,
> like above comments,
> > with this new feature, we don't need to kick the vCPU and deliver the 
> > virtual
> interrupts
> > direct to it.
> >
> > About APICv and CPU-side Posted-interrupt, please refer to Chapter 29, and
> Section 29.6 in the Intel SDM:
> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6
> 4-ia-32-architectures-software-developer-manual-325462.pdf
> >
> > >
> > > > - Decrease the interrupt migration complexity. On vCPU migration,
> software
> > > > can atomically co-migrate all interrupts targeting the migrating vCPU. 
> > > > For
> > > > virtual machines with assigned devices, migrating a vCPU across pCPUs
> > > > either incur the overhead of forwarding interrupts in software (e.g. via
> VMM
> > > > generated IPIS), or complexity to independently migrate each interrupt
> > > targeting
> > > > the vCPU to the new pCPU. However, after enabling VT-d PI, the
> destination
> > > vCPU
> > > > of an external interrupt from assigned devices is stored in the IRTE 
> > > > (i.e.
> > > > Posted-interrupt Descriptor Address), when vCPU is migrated to another
> > > pCPU,
> > > > we will set this new pCPU in the 'NDST' filed of Posted-interrupt
> descriptor,
> > > this
> > > > make the interrupt migration automatic.
> > > >
> > > >
> > > > Posted-interrupt Introduction
> > > > ========================
> > > > There are two components to the Posted-interrupt architecture:
> > > > Processor Support and Root-Complex Support
> > > >
> > > > - Processor Support
> > > > Posted-interrupt processing is a feature by which a processor processes
> > > > the virtual interrupts by recording them as pending on the virtual-APIC
> > > > page.
> > > >
> > > > Posted-interrupt processing is enabled by setting the "process posted
> > > > interrupts" VM-execution control. The processing is performed in
> response
> > > > to the arrival of an interrupt with the posted-interrupt notification 
> > > > vector.
> > > > In response to such an interrupt, the processor processes virtual
> interrupts
> > > > recorded in a data structure called a posted-interrupt descriptor.
> > > >
> > > > More information about APICv and CPU-side Posted-interrupt, please
> refer
> > > > to Chapter 29, and Section 29.6 in the Intel SDM:
> > > >
> > >
> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6
> > > 4-ia-32-architectures-software-developer-manual-325462.pdf
> > > >
> > > > - Root-Complex Support
> > > > Interrupt posting is the process by which an interrupt request (from
> IOAPIC
> > > > or MSI/MSIx capable sources) is recorded in a memory-resident
> > > > posted-interrupt-descriptor structure by the root-complex, followed by
> > > > an optional notification event issued to the CPU complex. The interrupt
> > > > request arriving at the root-complex carry the identity of the interrupt
> > > > request source and a 'remapping-index'. The remapping-index is used to
> > > > look-up an entry from the memory-resident interrupt-remap-table. Unlike
> > > > with interrupt-remapping, the interrupt-remap-table-entry for a posted-
> > > > interrupt, specifies a virtual-vector and a pointer to the 
> > > > posted-interrupt
> > > > descriptor. The virtual-vector specifies the vector of the interrupt to 
> > > > be
> > > > recorded in the posted-interrupt descriptor. The posted-interrupt
> descriptor
> > > > hosts storage for the virtual-vectors and contains the attributes of the
> > > > notification event (interrupt) to be issued to the CPU complex to inform
> > > > CPU/software about pending interrupts recorded in the posted-interrupt
> > > > descriptor.
> > > >
> > > > More information about VT-d PI, please refer to
> > > >
> > >
> http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog
> > > y/vt-directed-io-spec.html
> > > >
> > > > Important Definitions
> > > > ==================
> > > > There are some changes to IRTE and posted-interrupt descriptor after
> > > > VT-d PI is introduced:
> > >
> > > s/is/was/
> > >
> > > > IRTE:
> > > > Posted-interrupt Descriptor Address: the address of the posted-interrupt
> > > descriptor
> > > > Virtual Vector: the guest vector of the interrupt
> > > > URG: indicates if the interrupt is urgent
> > > >
> > > > Posted-interrupt descriptor:
> > > > The Posted Interrupt Descriptor hosts the following fields:
> > > > Posted Interrupt Request (PIR): Provide storage for posting (recording)
> > > interrupts (one bit
> > > > per vector, for up to 256 vectors).
> > > >
> > > > Outstanding Notification (ON): Indicate if there is a notification event
> > > outstanding (not
> > > > processed by processor or software) for this Posted Interrupt 
> > > > Descriptor.
> > > When this field is 0,
> > > > hardware modifies it from 0 to 1 when generating a notification event,
> and
> > > the entity receiving
> > > > the notification event (processor or software) resets it as part of 
> > > > posted
> > > interrupt processing.
> > > >
> > > > Suppress Notification (SN): Indicate if a notification event is to be
> suppressed
> > > (not
> > > > generated) for non-urgent interrupt requests (interrupts processed
> through
> > > an IRTE with
> > > > URG=0).
> > > >
> > > > Notification Vector (NV): Specify the vector for notification event
> (interrupt).
> > > >
> > > > Notification Destination (NDST): Specify the physical APIC-ID of the
> > > destination logical
> > > > processor for the notification event.
> > > >
> > > > Design Overview
> > > > ==============
> > > > In this design, we will cover the following items:
> > > > 1. Add a variable to control whether enable VT-d posted-interrupt or 
> > > > not.
> > > > 2. VT-d PI feature detection.
> > > > 3. Extend posted-interrupt descriptor structure to cover VT-d PI 
> > > > specific
> stuff.
> > >
> > > stuff? Perhaps features?
> > > > 4. Extend IRTE structure to support VT-d PI.
> > > > 5. Introduce a new global vector which is used for waking up the blocked
> > > vCPU.
> > > > 6. Update IRTE when guest modifies the interrupt configuration
> (MSI/MSIx
> > > configuration).
> > > > 7. Update posted-interrupt descriptor during vCPU scheduling (when the
> > > state
> > > > of the vCPU is transmitted among RUNSTATE_running /
> RUNSTATE_blocked/
> > > > RUNSTATE_runnable / RUNSTATE_offline).
> > > > 8. How to wakeup blocked vCPU when an interrupt is posted for it
> (wakeup
> > > notification handler).
> > > > 9. New boot command line for Xen, which controls VT-d PI feature by 
> > > > user.
> > > > 10. Multicast/broadcast and lowest priority interrupts consideration.
> > > >
> > > >
> > > > Implementation details
> > > > ===================
> > > > - New variable to control VT-d PI
> > > >
> > > > Like variable 'iommu_intremap' for interrupt remapping, it is very
> > > straightforward
> > > > to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' 
> > > > is
> > > set
> > > > only when interrupt remapping and VT-d posted-interrupt are both
> enabled.
> > > >
> > > > - VT-d PI feature detection.
> > > > Bit 59 in VT-d Capability Register is used to report VT-d 
> > > > Posted-interrupt
> > > support.
> > > >
> > > > - Extend posted-interrupt descriptor structure to cover VT-d PI specific
> stuff.
> > > > Here is the new structure for posted-interrupt descriptor:
> > > >
> > > > struct pi_desc {
> > > >      DECLARE_BITMAP(pir, NR_VECTORS);
> > > >      union {
> > > >         struct
> > > >         {
> > > >         u64 on     : 1,
> > > >             sn     : 1,
> > > >             rsvd_1 : 13,
> > > >             ndm    : 1,
> > > >             nv     : 8,
> > > >             rsvd_2 : 8,
> > > >             ndst   : 32;
> > > >         };
> > > >         u64 control;
> > > >     };
> > > >     u32 rsvd[6];
> > > >  } __attribute__ ((aligned (64)));
> > > >
> > > > - Extend IRTE structure to support VT-d PI.
> > > >
> > > > Here is the new structure for IRTE:
> > > > /* interrupt remap entry */
> > > > struct iremap_entry {
> > > >   union {
> > > >     u64 lo_val;
> > > >     struct {
> > > >         u64 p       : 1,
> > > >             fpd     : 1,
> > > >             dm      : 1,
> > > >             rh      : 1,
> > > >             tm      : 1,
> > > >             dlm     : 3,
> > > >             avail   : 4,
> > > >             res_1   : 4,
> > > >             vector  : 8,
> > > >             res_2   : 8,
> > > >             dst     : 32;
> > > >     }lo;
> > > >     struct {
> > > >         u64 p       : 1,
> > > >             fpd     : 1,
> > > >             res_1   : 6,
> > > >             avail   : 4,
> > > >             res_2   : 2,
> > > >             urg     : 1,
> > > >             im      : 1,
> > > >             vector  : 8,
> > > >             res_3   : 14,
> > > >             pda_l   : 26;
> > > >     }lo_intpost;
> > > >   };
> > > >   union {
> > > >     u64 hi_val;
> > > >     struct {
> > > >         u64 sid     : 16,
> > > >             sq      : 2,
> > > >             svt     : 2,
> > > >             res_1   : 44;
> > > >     }hi;
> > > >     struct {
> > > >         u64 sid     : 16,
> > > >             sq      : 2,
> > > >             svt     : 2,
> > > >             res_1   : 12,
> > > >             pda_h   : 32;
> > > >     }hi_intpost;
> > > >   };
> > > > };
> > > >
> > > > - Introduce a new global vector which is used to wake up the blocked
> vCPU.
> > > >
> > > > Currently, there is a global vector 'posted_intr_vector', which is used 
> > > > as
> the
> > >
> > > s/Currently/In Xen 4.6 and earlier/
> > > > global notification vector for all vCPUs in the system. This vector is 
> > > > stored
> in
> > > > VMCS and CPU considers it as a _special_ vector, uses it to notify the
> related
> > > > pCPU when an interrupt is recorded in the posted-interrupt descriptor.
> > > >
> > > > This existing global vector is a _special_ vector to CPU, CPU handle it 
> > > > in a
> > > > _special_ way compared to normal vectors, please refer to 29.6 in Intel
> SDM
> > > >
> > >
> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6
> > > 4-ia-32-architectures-software-developer-manual-325462.pdf
> > > > for more information about how CPU handles it.
> > > >
> > > > After having VT-d PI, VT-d engine can issue notification event when the
> > > > assigned devices issue interrupts. We need add a new global vector to
> > > > wakeup the blocked vCPU, please refer to later section in this design 
> > > > for
> > > > how to use this new global vector.
> > >
> > > Ah, so this is what Xen has right now - and the changes that this design
> > > outlines are here  deal with an blocked guests.
> >
> > No, this is what I add for enabling VT-d PI. We discussed a lot about this
> > new global vector and its usage scenario after posting version 1 of this
> > design. Do you have any question about this?
> 
> No, you clarified it in your answers to my questions! thank you.
> >
> > > >
> > > > - Update IRTE when guest modifies the interrupt configuration (MSI/MSIx
> > > configuration).
> > > > After VT-d PI is introduced, the format of IRTE is changed as follows:
> > > >         Descriptor Address: the address of the posted-interrupt 
> > > > descriptor
> > > >         Virtual Vector: the guest vector of the interrupt
> > > >         URG: indicates if the interrupt is urgent
> > > >         Other fields continue to have the same meaning
> > > >
> > > > 'Descriptor Address' tells the destination vCPU of this interrupt, since
> > > > each vCPU has a dedicated posted-interrupt descriptor.
> > > >
> > > > 'Virtual Vector' tells the guest vector of the interrupt.
> > > >
> > > > When guest changes the configuration of the interrupts, such as, the
> > > > cpu affinity, or the vector, we need to update the associated IRTE
> accordingly.
> > > >
> > > > - Update posted-interrupt descriptor during vCPU scheduling
> > > >
> > > > The basic idea here is:
> > > > 1. When vCPU's state is RUNSTATE_running,
> > > >         - Set 'NV' to 'posted_intr_vector'.
> > > >         - Clear 'SN' to accept posted-interrupts.
> > > >         - Set 'NDST' to the pCPU on which the vCPU will be running.
> > > > 2. When vCPU's state is RUNSTATE_blocked,
> > > >         - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the
> > > >           related vCPU when posted-interrupt happens for it.
> > > >           Please refer to the above section about the new global
> vector.
> > > >         - Clear 'SN' to accept posted-interrupts
> > > > 3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
> > > >         - Set 'SN' to suppress non-urgent interrupts
> > > >           (Current, we only support non-urgent interrupts)
> > > >          When vCPU is in RUNSTATE_runnable or RUNSTATE_offline,
> > > >          It is not needed to accept posted-interrupt notification event,
> > > >          since we don't change the behavior of scheduler when the
> > > interrupt
> > > >          occurs, we still need wait the next scheduling of the vCPU.
> > >
> > > still need to wait for the next..
> > > >          When external interrupts from assigned devices occur, the
> > > interrupts
> > > >          are recorded in PIR, and will be synced to IRR before
> VM-Entry.
> > > >         - Set 'NV' to 'posted_intr_vector'.
> > > >
> > > > - How to wakeup blocked vCPU when an interrupt is posted for it (wakeup
> > > notification handler).
> > > >
> > > > Here is the scenario for the usage of the new global vector:
> > > >
> > > > 1. vCPU0 is running on pCPU0
> > > > 2. vCPU0 is blocked and vCPU1 is currently running on pCPU0
> > > > 3. An external interrupt from an assigned device occurs for vCPU0, if we
> > > > still use 'posted_intr_vector' as the notification vector for vCPU0, the
> > > > notification event for vCPU0 (the event will go to pCPU1) will be 
> > > > consumed
> > > > by vCPU1 incorrectly (remember this is a special vector to CPU). The
> worst
> > > > case is that vCPU0 will never be woken up again since the wakeup event
> > > > for it is always consumed by other vCPUs incorrectly. So we need
> introduce
> > > > another global vector, naming 'pi_wakeup_vector' to wake up the blocked
> > > vCPU.
> > > >
> > > > After using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue
> notification
> > > > event using this new vector. Since this new vector is not a SPECIAL one 
> > > > to
> > > CPU,
> > > > it is just a normal vector. To cpu, it just receives an normal external
> interrupt,
> > > > then we can get control in the handler of this new vector. In this case,
> > > hypervisor
> > > > can do something in it, such as wakeup the blocked vCPU.
> > > >
> > > > Here are what we do for the blocked vCPU:
> > > > 1. Define a per-cpu list 'blocked_vcpu_on_cpu', which stored the blocked
> > > > vCPU on the pCPU.
> > > > 2. When the vCPU's state is changed to RUNSTATE_blocked, insert the
> vCPU
> > > > to the per-cpu list belonging to the pCPU it was running.
> > > > 3. When the vCPU is unblocked, remove the vCPU from the related pCPU
> list.
> > > >
> > > > In the handler of 'pi_wakeup_vector', we do:
> > > > 1. Get the physical CPU.
> > > > 2. Iterate the list 'blocked_vcpu_on_cpu' of the current pCPU, if 'ON' 
> > > > is
> set,
> > > > we unblock the associated vCPU.
> > > >
> > > > - New boot command line for Xen, which controls VT-d PI feature by user.
> > > >
> > > > Like 'intremap' for interrupt remapping, we add a new boot command line
> > > > 'intpost' for posted-interrupts.
> > >
> > > Earlier you mentioned "iommu_intpost" ?
> >
> > 'intpost' is a Xen command line parameter, while 'iommu_intpost' is a 
> > variable
> > In the Code, just like 'intremap' and 'iommu_intremap'.
> 
> Why not piggyback on 'iommu' ? It might be worth mentioning the
> reasoning why you choose a new name instead of adding new options for
> the 'iommu'.

Oh, sorry, there is a mistake in my previous description. In fact, 'intpost' is 
an option
for 'iommu' command line, just like ' intremap'.

Thanks,
Feng

> >
> > Thanks,
> > Feng
> >
> > >
> > > >
> > > > - Multicast/broadcast and lowest priority interrupts consideration.
> > > >
> > > > With VT-d PI, the destination vCPU information of an external interrupt
> > > > from assigned devices is stored in IRTE, this makes the following
> > > > consideration of the design:
> > > > 1. Multicast/broadcast interrupts cannot be posted.
> > > > 2. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex
> > > > (starting from Nehalem) ignore TPR value, and instead supported two
> other
> > > > ways (configurable by BIOS) on how the handle lowest priority 
> > > > interrupts:
> > > >         A) Round robin: In this method, the chipset simply delivers 
> > > > lowest
> priority
> > > > interrupts in a round-robin manner across all the available logical 
> > > > CPUs.
> While
> > > > this provides good load balancing, this was not the best thing to do 
> > > > always
> as
> > > > interrupts from the same device (like NIC) will start running on all the
> CPUs
> > > > thrashing caches and taking locks. This led to the next scheme.
> > > >         B) Vector hashing: In this method, hardware would apply a hash
> function
> > > > on the vector value in the interrupt request, and use that hash to pick 
> > > > a
> > > logical
> > > > CPU to route the lowest priority interrupt. This way, a given vector 
> > > > always
> > > goes
> > > > to the same logical CPU, avoiding the thrashing problem above.
> > > >
> > > > So, gist of above is that, lowest priority interrupts has never been
> delivered
> > > as
> > > > "lowest priority" in physical hardware.
> > > >
> > > > I will emulate vector hashing for posted-interrupt for XEN.
> > > >
> > > > ================================
> > > >
> > > > Any comments about this design are highly appreciated!
> > > >
> > > > Thanks,
> > > > Feng
> > > >
> > > > _______________________________________________
> > > > Xen-devel mailing list
> > > > Xen-devel@xxxxxxxxxxxxx
> > > > http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.