Xen project Mailing List

Re: [Xen-devel] (v2) VT-d Posted-interrupt (PI) design for XEN

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Date: Mon, 23 Mar 2015 08:04:06 +0000

Accept-language: en-US

Cc: "Tian, Kevin" <kevin.tian@xxxxxxxxx>, "Wu, Feng" <feng.wu@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, "Jan Beulich \(JBeulich@xxxxxxxx\)" <JBeulich@xxxxxxxx>, "Zhang, Yang Z" <yang.z.zhang@xxxxxxxxx>, "Keir Fraser \(keir@xxxxxxx\)" <keir@xxxxxxx>

Delivery-date: Mon, 23 Mar 2015 08:04:48 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Thread-index: AQHQYniQW3pNQ6ZQokmUzanl/lSkUJ0ptc0A

Thread-topic: [Xen-devel] (v2) VT-d Posted-interrupt (PI) design for XEN

> -----Original Message----- > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@xxxxxxxxxx] > Sent: Friday, March 20, 2015 3:12 AM > To: Wu, Feng > Cc: xen-devel@xxxxxxxxxxxxx; Zhang, Yang Z; Tian, Kevin; Keir Fraser > (keir@xxxxxxx); Jan Beulich (JBeulich@xxxxxxxx) > Subject: Re: [Xen-devel] (v2) VT-d Posted-interrupt (PI) design for XEN > > On Thu, Mar 19, 2015 at 03:03:55AM +0000, Wu, Feng wrote: > > Thanks for the comments! > > > > > -----Original Message----- > > > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@xxxxxxxxxx] > > > Sent: Thursday, March 19, 2015 12:10 AM > > > To: Wu, Feng > > > Cc: xen-devel@xxxxxxxxxxxxx; Zhang, Yang Z; Tian, Kevin; Keir Fraser > > > (keir@xxxxxxx); Jan Beulich (JBeulich@xxxxxxxx) > > > Subject: Re: [Xen-devel] (v2) VT-d Posted-interrupt (PI) design for XEN > > > > > > On Wed, Mar 18, 2015 at 12:44:21PM +0000, Wu, Feng wrote: > > > > VT-d Posted-interrupt (PI) design for XEN > > > > > > > > Background > > > > ========== > > > > With the development of virtualization, there are more and more device > > > > assignment requirements. However, today when a VM is running with > > > > assigned devices (such as, NIC), external interrupt handling for the > assigned > > > > devices always needs VMM intervention. > > > > > > > > VT-d Posted-interrupt is a more enhanced method to handle interrupts > > > > in the virtualization environment. Interrupt posting is the process by > > > > which an interrupt request is recorded in a memory-resident > > > > posted-interrupt-descriptor structure by the root-complex, followed by > > > > an optional notification event issued to the CPU complex. > > > > > > > > With VT-d Posted-interrupt we can get the following advantages: > > > > - Direct delivery of external interrupts to running vCPUs without VMM > > > > intervention > > > > > > > > > I hadn't digged deep in what Xen has currently - but I would assume that > > > this is exactly what we have now in Xen? > > > > Here is what Xen currently does for external interrupts from assigned > devices: > > > > When a VM is running and an external interrupts from an assigned devices > occurs > > for it. VM-EXIT happens, then: > > > > vmx_do_extint() --> do_IRQ() --> __do_IRQ_guest() --> hvm_do_IRQ_dpci() --> > > raise_softirq_for(pirq_dpci) --> raise_softirq(HVM_DPCI_SOFTIRQ) > > > > softirq HVM_DPCI_SOFTIRQ is bound to dpci_softirq() > > > > dpci_softirq() --> hvm_dirq_assist() --> vmsi_deliver_pirq() --> > > vmsi_deliver() > --> > > vmsi_inj_irq() --> vlapic_set_irq() > > <nods> This would be fantastic to put in the design document to help > people make sure that their expectations are in line. Sure! > > > > > vlapic_set_irq() does the following things: > > 1. If CPU-side posted-interrupt is supported (I think it is supported from > > Xen > 4.3, or Xen 4.4, > > sorry, not quite remember the exact version), call vmx_deliver_posted_intr() > to deliver > > the virtual interrupt via posted-interrupt infrastructure. > > The benefit is that if an interrupt comes for VCPU0 instead of > VCPU1 we can inject the interrupt in the VCPU1 without having it > do an VMEXIT. > > However if we pin the vCPUs, then CPU-side posted interrupt do not > help - we still have to process the interrupt in Xen hypervisor. > > > 2. Else If CPU-side posted-interrupt is not supported, set the related vIRR > > in > vLAPIC > > page and call vcpu_kick() to kick the related vCPU. Before VM-Entry, > vmx_intr_assist() > > will help to inject the interrupt to guests. > > > > However, after VT-d PI is supported, when a guest is running in non-root and > an > > external interrupt from an assigned device occurs for it. _no_ VM-Exit is > needed, > > the guest can handle this totally in non-root mode, thus avoiding all the > > above > > code flow. > > <nods> However it does require for Linux PVHVM guests to not use the > vector callback mechanism - or rather - not use the event mechanism. > > What you require for this to work on the Linux side is for the PCIe > device to use the 'baremetal' mechanism to setup MSIs (program the > IOAPIC, etc). It would be worth mentioning this in the document too. Thanks for the suggestion. In fact, there are some information about this in this design doc, please refer to section " Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration)." When guests update the MSI/MSIx information, Xen will get control and the guest interrupt information will get updated in related IRTE. > > > > > > > > > Hm, actually we seem to be still invoking the hypervisor on the > > > interrupts -except that if we need to dispatch it to another CPU > > > using an normal vector to do so - which would still cause the > > > hypervisor to be invoked? Or does it actually go straight in the > > > guest? > > > > > > > Like what I mentioned above, If the guest is running, we don't need invoke > hypervisor. > > > > > So what kind of support do we currently have in Xen from posted > > > interrupt? Could you add a bit about this in the background please? > > > > Good suggestion. > > > > Currently, Xen only supports the CPU-side posted-interrupt. Like what I > mentioned above, > > function vlapic_set_irq() can use this to deliver virtual interrupts, > > basically > there are several > > methods to deliver virtual interrupts to guests: > > - Event delivery before VM-Entry via __vmx_inject_exception(), this is the > oldest way. > > - After APICv was enabled, we had hardware support for virtual interrupt > delivery, virtual > > interrupts are stored in virtual LAPIC page, after VM-Entry, guests can > evaluate these > > virtual interrupt and handle them in non-root mode. > > - As an enhancement to APICv, CPU-side posted-interrupt was introduced, > like above comments, > > with this new feature, we don't need to kick the vCPU and deliver the > > virtual > interrupts > > direct to it. > > > > About APICv and CPU-side Posted-interrupt, please refer to Chapter 29, and > Section 29.6 in the Intel SDM: > http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6 > 4-ia-32-architectures-software-developer-manual-325462.pdf > > > > > > > > > - Decrease the interrupt migration complexity. On vCPU migration, > software > > > > can atomically co-migrate all interrupts targeting the migrating vCPU. > > > > For > > > > virtual machines with assigned devices, migrating a vCPU across pCPUs > > > > either incur the overhead of forwarding interrupts in software (e.g. via > VMM > > > > generated IPIS), or complexity to independently migrate each interrupt > > > targeting > > > > the vCPU to the new pCPU. However, after enabling VT-d PI, the > destination > > > vCPU > > > > of an external interrupt from assigned devices is stored in the IRTE > > > > (i.e. > > > > Posted-interrupt Descriptor Address), when vCPU is migrated to another > > > pCPU, > > > > we will set this new pCPU in the 'NDST' filed of Posted-interrupt > descriptor, > > > this > > > > make the interrupt migration automatic. > > > > > > > > > > > > Posted-interrupt Introduction > > > > ======================== > > > > There are two components to the Posted-interrupt architecture: > > > > Processor Support and Root-Complex Support > > > > > > > > - Processor Support > > > > Posted-interrupt processing is a feature by which a processor processes > > > > the virtual interrupts by recording them as pending on the virtual-APIC > > > > page. > > > > > > > > Posted-interrupt processing is enabled by setting the "process posted > > > > interrupts" VM-execution control. The processing is performed in > response > > > > to the arrival of an interrupt with the posted-interrupt notification > > > > vector. > > > > In response to such an interrupt, the processor processes virtual > interrupts > > > > recorded in a data structure called a posted-interrupt descriptor. > > > > > > > > More information about APICv and CPU-side Posted-interrupt, please > refer > > > > to Chapter 29, and Section 29.6 in the Intel SDM: > > > > > > > > http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6 > > > 4-ia-32-architectures-software-developer-manual-325462.pdf > > > > > > > > - Root-Complex Support > > > > Interrupt posting is the process by which an interrupt request (from > IOAPIC > > > > or MSI/MSIx capable sources) is recorded in a memory-resident > > > > posted-interrupt-descriptor structure by the root-complex, followed by > > > > an optional notification event issued to the CPU complex. The interrupt > > > > request arriving at the root-complex carry the identity of the interrupt > > > > request source and a 'remapping-index'. The remapping-index is used to > > > > look-up an entry from the memory-resident interrupt-remap-table. Unlike > > > > with interrupt-remapping, the interrupt-remap-table-entry for a posted- > > > > interrupt, specifies a virtual-vector and a pointer to the > > > > posted-interrupt > > > > descriptor. The virtual-vector specifies the vector of the interrupt to > > > > be > > > > recorded in the posted-interrupt descriptor. The posted-interrupt > descriptor > > > > hosts storage for the virtual-vectors and contains the attributes of the > > > > notification event (interrupt) to be issued to the CPU complex to inform > > > > CPU/software about pending interrupts recorded in the posted-interrupt > > > > descriptor. > > > > > > > > More information about VT-d PI, please refer to > > > > > > > > http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog > > > y/vt-directed-io-spec.html > > > > > > > > Important Definitions > > > > ================== > > > > There are some changes to IRTE and posted-interrupt descriptor after > > > > VT-d PI is introduced: > > > > > > s/is/was/ > > > > > > > IRTE: > > > > Posted-interrupt Descriptor Address: the address of the posted-interrupt > > > descriptor > > > > Virtual Vector: the guest vector of the interrupt > > > > URG: indicates if the interrupt is urgent > > > > > > > > Posted-interrupt descriptor: > > > > The Posted Interrupt Descriptor hosts the following fields: > > > > Posted Interrupt Request (PIR): Provide storage for posting (recording) > > > interrupts (one bit > > > > per vector, for up to 256 vectors). > > > > > > > > Outstanding Notification (ON): Indicate if there is a notification event > > > outstanding (not > > > > processed by processor or software) for this Posted Interrupt > > > > Descriptor. > > > When this field is 0, > > > > hardware modifies it from 0 to 1 when generating a notification event, > and > > > the entity receiving > > > > the notification event (processor or software) resets it as part of > > > > posted > > > interrupt processing. > > > > > > > > Suppress Notification (SN): Indicate if a notification event is to be > suppressed > > > (not > > > > generated) for non-urgent interrupt requests (interrupts processed > through > > > an IRTE with > > > > URG=0). > > > > > > > > Notification Vector (NV): Specify the vector for notification event > (interrupt). > > > > > > > > Notification Destination (NDST): Specify the physical APIC-ID of the > > > destination logical > > > > processor for the notification event. > > > > > > > > Design Overview > > > > ============== > > > > In this design, we will cover the following items: > > > > 1. Add a variable to control whether enable VT-d posted-interrupt or > > > > not. > > > > 2. VT-d PI feature detection. > > > > 3. Extend posted-interrupt descriptor structure to cover VT-d PI > > > > specific > stuff. > > > > > > stuff? Perhaps features? > > > > 4. Extend IRTE structure to support VT-d PI. > > > > 5. Introduce a new global vector which is used for waking up the blocked > > > vCPU. > > > > 6. Update IRTE when guest modifies the interrupt configuration > (MSI/MSIx > > > configuration). > > > > 7. Update posted-interrupt descriptor during vCPU scheduling (when the > > > state > > > > of the vCPU is transmitted among RUNSTATE_running / > RUNSTATE_blocked/ > > > > RUNSTATE_runnable / RUNSTATE_offline). > > > > 8. How to wakeup blocked vCPU when an interrupt is posted for it > (wakeup > > > notification handler). > > > > 9. New boot command line for Xen, which controls VT-d PI feature by > > > > user. > > > > 10. Multicast/broadcast and lowest priority interrupts consideration. > > > > > > > > > > > > Implementation details > > > > =================== > > > > - New variable to control VT-d PI > > > > > > > > Like variable 'iommu_intremap' for interrupt remapping, it is very > > > straightforward > > > > to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' > > > > is > > > set > > > > only when interrupt remapping and VT-d posted-interrupt are both > enabled. > > > > > > > > - VT-d PI feature detection. > > > > Bit 59 in VT-d Capability Register is used to report VT-d > > > > Posted-interrupt > > > support. > > > > > > > > - Extend posted-interrupt descriptor structure to cover VT-d PI specific > stuff. > > > > Here is the new structure for posted-interrupt descriptor: > > > > > > > > struct pi_desc { > > > > DECLARE_BITMAP(pir, NR_VECTORS); > > > > union { > > > > struct > > > > { > > > > u64 on : 1, > > > > sn : 1, > > > > rsvd_1 : 13, > > > > ndm : 1, > > > > nv : 8, > > > > rsvd_2 : 8, > > > > ndst : 32; > > > > }; > > > > u64 control; > > > > }; > > > > u32 rsvd[6]; > > > > } __attribute__ ((aligned (64))); > > > > > > > > - Extend IRTE structure to support VT-d PI. > > > > > > > > Here is the new structure for IRTE: > > > > /* interrupt remap entry */ > > > > struct iremap_entry { > > > > union { > > > > u64 lo_val; > > > > struct { > > > > u64 p : 1, > > > > fpd : 1, > > > > dm : 1, > > > > rh : 1, > > > > tm : 1, > > > > dlm : 3, > > > > avail : 4, > > > > res_1 : 4, > > > > vector : 8, > > > > res_2 : 8, > > > > dst : 32; > > > > }lo; > > > > struct { > > > > u64 p : 1, > > > > fpd : 1, > > > > res_1 : 6, > > > > avail : 4, > > > > res_2 : 2, > > > > urg : 1, > > > > im : 1, > > > > vector : 8, > > > > res_3 : 14, > > > > pda_l : 26; > > > > }lo_intpost; > > > > }; > > > > union { > > > > u64 hi_val; > > > > struct { > > > > u64 sid : 16, > > > > sq : 2, > > > > svt : 2, > > > > res_1 : 44; > > > > }hi; > > > > struct { > > > > u64 sid : 16, > > > > sq : 2, > > > > svt : 2, > > > > res_1 : 12, > > > > pda_h : 32; > > > > }hi_intpost; > > > > }; > > > > }; > > > > > > > > - Introduce a new global vector which is used to wake up the blocked > vCPU. > > > > > > > > Currently, there is a global vector 'posted_intr_vector', which is used > > > > as > the > > > > > > s/Currently/In Xen 4.6 and earlier/ > > > > global notification vector for all vCPUs in the system. This vector is > > > > stored > in > > > > VMCS and CPU considers it as a _special_ vector, uses it to notify the > related > > > > pCPU when an interrupt is recorded in the posted-interrupt descriptor. > > > > > > > > This existing global vector is a _special_ vector to CPU, CPU handle it > > > > in a > > > > _special_ way compared to normal vectors, please refer to 29.6 in Intel > SDM > > > > > > > > http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6 > > > 4-ia-32-architectures-software-developer-manual-325462.pdf > > > > for more information about how CPU handles it. > > > > > > > > After having VT-d PI, VT-d engine can issue notification event when the > > > > assigned devices issue interrupts. We need add a new global vector to > > > > wakeup the blocked vCPU, please refer to later section in this design > > > > for > > > > how to use this new global vector. > > > > > > Ah, so this is what Xen has right now - and the changes that this design > > > outlines are here deal with an blocked guests. > > > > No, this is what I add for enabling VT-d PI. We discussed a lot about this > > new global vector and its usage scenario after posting version 1 of this > > design. Do you have any question about this? > > No, you clarified it in your answers to my questions! thank you. > > > > > > > > > > - Update IRTE when guest modifies the interrupt configuration (MSI/MSIx > > > configuration). > > > > After VT-d PI is introduced, the format of IRTE is changed as follows: > > > > Descriptor Address: the address of the posted-interrupt > > > > descriptor > > > > Virtual Vector: the guest vector of the interrupt > > > > URG: indicates if the interrupt is urgent > > > > Other fields continue to have the same meaning > > > > > > > > 'Descriptor Address' tells the destination vCPU of this interrupt, since > > > > each vCPU has a dedicated posted-interrupt descriptor. > > > > > > > > 'Virtual Vector' tells the guest vector of the interrupt. > > > > > > > > When guest changes the configuration of the interrupts, such as, the > > > > cpu affinity, or the vector, we need to update the associated IRTE > accordingly. > > > > > > > > - Update posted-interrupt descriptor during vCPU scheduling > > > > > > > > The basic idea here is: > > > > 1. When vCPU's state is RUNSTATE_running, > > > > - Set 'NV' to 'posted_intr_vector'. > > > > - Clear 'SN' to accept posted-interrupts. > > > > - Set 'NDST' to the pCPU on which the vCPU will be running. > > > > 2. When vCPU's state is RUNSTATE_blocked, > > > > - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the > > > > related vCPU when posted-interrupt happens for it. > > > > Please refer to the above section about the new global > vector. > > > > - Clear 'SN' to accept posted-interrupts > > > > 3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline, > > > > - Set 'SN' to suppress non-urgent interrupts > > > > (Current, we only support non-urgent interrupts) > > > > When vCPU is in RUNSTATE_runnable or RUNSTATE_offline, > > > > It is not needed to accept posted-interrupt notification event, > > > > since we don't change the behavior of scheduler when the > > > interrupt > > > > occurs, we still need wait the next scheduling of the vCPU. > > > > > > still need to wait for the next.. > > > > When external interrupts from assigned devices occur, the > > > interrupts > > > > are recorded in PIR, and will be synced to IRR before > VM-Entry. > > > > - Set 'NV' to 'posted_intr_vector'. > > > > > > > > - How to wakeup blocked vCPU when an interrupt is posted for it (wakeup > > > notification handler). > > > > > > > > Here is the scenario for the usage of the new global vector: > > > > > > > > 1. vCPU0 is running on pCPU0 > > > > 2. vCPU0 is blocked and vCPU1 is currently running on pCPU0 > > > > 3. An external interrupt from an assigned device occurs for vCPU0, if we > > > > still use 'posted_intr_vector' as the notification vector for vCPU0, the > > > > notification event for vCPU0 (the event will go to pCPU1) will be > > > > consumed > > > > by vCPU1 incorrectly (remember this is a special vector to CPU). The > worst > > > > case is that vCPU0 will never be woken up again since the wakeup event > > > > for it is always consumed by other vCPUs incorrectly. So we need > introduce > > > > another global vector, naming 'pi_wakeup_vector' to wake up the blocked > > > vCPU. > > > > > > > > After using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue > notification > > > > event using this new vector. Since this new vector is not a SPECIAL one > > > > to > > > CPU, > > > > it is just a normal vector. To cpu, it just receives an normal external > interrupt, > > > > then we can get control in the handler of this new vector. In this case, > > > hypervisor > > > > can do something in it, such as wakeup the blocked vCPU. > > > > > > > > Here are what we do for the blocked vCPU: > > > > 1. Define a per-cpu list 'blocked_vcpu_on_cpu', which stored the blocked > > > > vCPU on the pCPU. > > > > 2. When the vCPU's state is changed to RUNSTATE_blocked, insert the > vCPU > > > > to the per-cpu list belonging to the pCPU it was running. > > > > 3. When the vCPU is unblocked, remove the vCPU from the related pCPU > list. > > > > > > > > In the handler of 'pi_wakeup_vector', we do: > > > > 1. Get the physical CPU. > > > > 2. Iterate the list 'blocked_vcpu_on_cpu' of the current pCPU, if 'ON' > > > > is > set, > > > > we unblock the associated vCPU. > > > > > > > > - New boot command line for Xen, which controls VT-d PI feature by user. > > > > > > > > Like 'intremap' for interrupt remapping, we add a new boot command line > > > > 'intpost' for posted-interrupts. > > > > > > Earlier you mentioned "iommu_intpost" ? > > > > 'intpost' is a Xen command line parameter, while 'iommu_intpost' is a > > variable > > In the Code, just like 'intremap' and 'iommu_intremap'. > > Why not piggyback on 'iommu' ? It might be worth mentioning the > reasoning why you choose a new name instead of adding new options for > the 'iommu'. Oh, sorry, there is a mistake in my previous description. In fact, 'intpost' is an option for 'iommu' command line, just like ' intremap'. Thanks, Feng > > > > Thanks, > > Feng > > > > > > > > > > > > > - Multicast/broadcast and lowest priority interrupts consideration. > > > > > > > > With VT-d PI, the destination vCPU information of an external interrupt > > > > from assigned devices is stored in IRTE, this makes the following > > > > consideration of the design: > > > > 1. Multicast/broadcast interrupts cannot be posted. > > > > 2. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex > > > > (starting from Nehalem) ignore TPR value, and instead supported two > other > > > > ways (configurable by BIOS) on how the handle lowest priority > > > > interrupts: > > > > A) Round robin: In this method, the chipset simply delivers > > > > lowest > priority > > > > interrupts in a round-robin manner across all the available logical > > > > CPUs. > While > > > > this provides good load balancing, this was not the best thing to do > > > > always > as > > > > interrupts from the same device (like NIC) will start running on all the > CPUs > > > > thrashing caches and taking locks. This led to the next scheme. > > > > B) Vector hashing: In this method, hardware would apply a hash > function > > > > on the vector value in the interrupt request, and use that hash to pick > > > > a > > > logical > > > > CPU to route the lowest priority interrupt. This way, a given vector > > > > always > > > goes > > > > to the same logical CPU, avoiding the thrashing problem above. > > > > > > > > So, gist of above is that, lowest priority interrupts has never been > delivered > > > as > > > > "lowest priority" in physical hardware. > > > > > > > > I will emulate vector hashing for posted-interrupt for XEN. > > > > > > > > ================================ > > > > > > > > Any comments about this design are highly appreciated! > > > > > > > > Thanks, > > > > Feng > > > > > > > > _______________________________________________ > > > > Xen-devel mailing list > > > > Xen-devel@xxxxxxxxxxxxx > > > > http://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.