Xen project Mailing List

Re: [Xen-devel] [PATCH v2 3/5] VT-d PI: restrict the vcpu number on a given pcpu

From: George Dunlap <dunlapg@xxxxxxxxx>

Date: Mon, 15 May 2017 16:48:47 +0100

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Mon, 15 May 2017 15:49:01 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Thu, May 11, 2017 at 7:04 AM, Chao Gao <chao.gao@xxxxxxxxx> wrote: > Currently, a blocked vCPU is put in its pCPU's pi blocking list. If > too many vCPUs are blocked on a given pCPU, it will incur that the list > grows too long. After a simple analysis, there are 32k domains and > 128 vcpu per domain, thus about 4M vCPUs may be blocked in one pCPU's > PI blocking list. When a wakeup interrupt arrives, the list is > traversed to find some specific vCPUs to wake them up. This traversal in > that case would consume much time. > > To mitigate this issue, this patch limits the vcpu number on a given > pCPU, taking factors such as perfomance of common case, current hvm vcpu > count and current pcpu count into consideration. With this method, for > the common case, it works fast and for some extreme cases, the list > length is under control. > > The change in vmx_pi_unblock_vcpu() is for the following case: > vcpu is running -> try to block (this patch may change NSDT to > another pCPU) but notification comes in time, thus the vcpu > goes back to running station -> VM-entry (we should set NSDT again, > reverting the change we make to NSDT in vmx_vcpu_block()) > > Signed-off-by: Chao Gao <chao.gao@xxxxxxxxx> > --- > xen/arch/x86/hvm/vmx/vmx.c | 78 > +++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 71 insertions(+), 7 deletions(-) > > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c > index efff6cd..c0d0b58 100644 > --- a/xen/arch/x86/hvm/vmx/vmx.c > +++ b/xen/arch/x86/hvm/vmx/vmx.c > @@ -100,16 +100,70 @@ void vmx_pi_per_cpu_init(unsigned int cpu) > spin_lock_init(&per_cpu(vmx_pi_blocking, cpu).lock); > } > > +/* > + * Choose an appropriate pcpu to receive wakeup interrupt. > + * By default, the local pcpu is chosen as the destination. But if the > + * vcpu number of the local pcpu exceeds a limit, another pcpu is chosen. > + * > + * Currently, choose (v_tot/p_tot) + K as the limit of vcpu, where > + * v_tot is the total number of vcpus on the system, p_tot is the total > + * number of pcpus in the system, and K is a fixed number. Experments shows > + * the maximal time to wakeup a vcpu from a 128-entry blocking list is > + * considered acceptable. So choose 128 as the fixed number K. > + * > + * This policy makes sure: > + * 1) for common cases, the limit won't be reached and the local pcpu is used > + * which is beneficial to performance (at least, avoid an IPI when unblocking > + * vcpu). > + * 2) for the worst case, the blocking list length scales with the vcpu count > + * divided by the pcpu count. > + */ > +#define PI_LIST_FIXED_NUM 128 > +#define PI_LIST_LIMIT (atomic_read(&num_hvm_vcpus) / num_online_cpus() + > \ > + PI_LIST_FIXED_NUM) > + > +static unsigned int vmx_pi_choose_dest_cpu(struct vcpu *v) > +{ > + int count, limit = PI_LIST_LIMIT; > + unsigned int dest = v->processor; > + > + count = atomic_read(&per_cpu(vmx_pi_blocking, dest).counter); > + while ( unlikely(count >= limit) ) > + { > + dest = cpumask_cycle(dest, &cpu_online_map); > + count = atomic_read(&per_cpu(vmx_pi_blocking, dest).counter); > + } > + return dest; > +} > + > static void vmx_vcpu_block(struct vcpu *v) > { > unsigned long flags; > - unsigned int dest; > + unsigned int dest, dest_cpu; > spinlock_t *old_lock; > - spinlock_t *pi_blocking_list_lock = > - &per_cpu(vmx_pi_blocking, v->processor).lock; > struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc; > + spinlock_t *pi_blocking_list_lock; > + > + /* > + * After pCPU goes down, the per-cpu PI blocking list is cleared. > + * To make sure the parameter vCPU is added to the chosen pCPU's > + * PI blocking list before the list is cleared, just retry when > + * finding the pCPU has gone down. Also retry to choose another > + * pCPU when finding the list length reachs the limit. > + */ > + retry: > + dest_cpu = vmx_pi_choose_dest_cpu(v); > + pi_blocking_list_lock = &per_cpu(vmx_pi_blocking, dest_cpu).lock; > > spin_lock_irqsave(pi_blocking_list_lock, flags); > + if ( unlikely((!cpu_online(dest_cpu)) || > + (atomic_read(&per_cpu(vmx_pi_blocking, dest_cpu).counter) > >= > + PI_LIST_LIMIT)) ) > + { > + spin_unlock_irqrestore(pi_blocking_list_lock, flags); > + goto retry; > + } Algorithmically I think this is on the right track. But all these atomic reads and writes are a mess. Atomic accesses aren't free; and the vast majority of the time you're doing things with the pi_blocking_list_lock anyway. Why not do something like this at the top of vmx_vcpu_block() (replacing dest_cpu with pi_cpu for clarity)? pi_cpu = v->processor; retry: pi_blocking_list_lock = &per_cpu(vmx_pi_blocking, pi_cpu).lock; spin_lock_irqsave(pi_blocking_list_lock, flags); /* * Since dest_cpu may now be one other than the one v is currently * running on, check to make sure that it's still up. */ if ( unlikely((!cpu_online(pi_cpu)) || pi_over_limit(per_cpu(vmx_pi_blocking, pi_cpu).counter)) ) { pi_cpu = cpumask_cycle(pi_cpu, &cpu_online_map); spin_unlock_irqrestore(pi_blocking_list_lock, flags); goto retry; } where we define pi_over_limit() like this: static bool pi_over_limit(int count) { /* Compare w/ constant first to save an atomic read in the common case */ return (count > PI_LIST_FIXED_NUM) && (count > (atomic_read(&num_hvm_vcpus) / num_online_cpus()) + PI_LIST_FIXED_NUM ) ); } That way, in the incredibly common case where count < 128, you simply grab the lock once and don't to *any* atomic reads, rather than doing at least four atomic reads in the common case. [snip] > @@ -163,6 +224,7 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v) > unsigned long flags; > spinlock_t *pi_blocking_list_lock; > struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc; > + unsigned int dest = cpu_physical_id(v->processor); > > /* > * Set 'NV' field back to posted_intr_vector, so the > @@ -170,6 +232,8 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v) > * it is running in non-root mode. > */ > write_atomic(&pi_desc->nv, posted_intr_vector); > + write_atomic(&pi_desc->ndst, > + x2apic_enabled ? dest : MASK_INSR(dest, > PI_xAPIC_NDST_MASK)); Just checking -- if an interrupt is raised between these two lines, what will happen? Will the interrupt be queued up to be delivered to the vcpu the next time it does a VMENTRY? -George _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.