Xen project Mailing List

Re: [Xen-devel] [PATCH v3 3/3] VT-d PI: restrict the vcpu number on a given pcpu

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Fri, 16 Jun 2017 09:09:13 -0600

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, xen-devel@xxxxxxxxxxxxx

Delivery-date: Fri, 16 Jun 2017 15:09:21 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

>>> On 24.05.17 at 08:56, <chao.gao@xxxxxxxxx> wrote: > Currently, a blocked vCPU is put in its pCPU's pi blocking list. If > too many vCPUs are blocked on a given pCPU, it will incur that the list > grows too long. After a simple analysis, there are 32k domains and > 128 vcpu per domain, thus about 4M vCPUs may be blocked in one pCPU's > PI blocking list. When a wakeup interrupt arrives, the list is > traversed to find some specific vCPUs to wake them up. This traversal in > that case would consume much time. > > To mitigate this issue, this patch limits the vcpu number on a given > pCPU, This would be a bug, but I think it's the description which is wrong (or at least imprecise): You don't limit the number of vCPU-s _run_ on any pCPU, but those tracked on any pCPU-s blocking list. Please say so here to avoid confusion. > taking factors such as perfomance of common case, current hvm vcpu > count and current pcpu count into consideration. With this method, for > the common case, it works fast and for some extreme cases, the list > length is under control. > > The change in vmx_pi_unblock_vcpu() is for the following case: > vcpu is running -> try to block (this patch may change NSDT to > another pCPU) but notification comes in time, thus the vcpu What does "but notification comes in time" mean? > goes back to running station -> VM-entry (we should set NSDT again, s/station/state/ ? > reverting the change we make to NSDT in vmx_vcpu_block()) Overall I'm not sure I really understand what you try to explain here. > --- a/xen/arch/x86/hvm/vmx/vmx.c > +++ b/xen/arch/x86/hvm/vmx/vmx.c > @@ -100,16 +100,62 @@ void vmx_pi_per_cpu_init(unsigned int cpu) > spin_lock_init(&per_cpu(vmx_pi_blocking, cpu).lock); > } > > +/* > + * By default, the local pcpu (means the one the vcpu is currently running > on) > + * is chosen as the destination of wakeup interrupt. But if the vcpu number > of > + * the pcpu exceeds a limit, another pcpu is chosen until we find a suitable > + * one. > + * > + * Currently, choose (v_tot/p_tot) + K as the limit of vcpu count, where > + * v_tot is the total number of hvm vcpus on the system, p_tot is the total > + * number of pcpus in the system, and K is a fixed number. Experments shows > + * the maximum time to wakeup a vcpu from a 128-entry blocking list is about > + * 22us, which is tolerable. So choose 128 as the fixed number K. Giving and kind of absolute time value requires also stating on what hardware this was measured. > + * This policy makes sure: > + * 1) for common cases, the limit won't be reached and the local pcpu is used > + * which is beneficial to performance (at least, avoid an IPI when unblocking > + * vcpu). > + * 2) for the worst case, the blocking list length scales with the vcpu count > + * divided by the pcpu count. > + */ > +#define PI_LIST_FIXED_NUM 128 > +#define PI_LIST_LIMIT (atomic_read(&num_hvm_vcpus) / num_online_cpus() + > \ > + PI_LIST_FIXED_NUM) > + > +static bool pi_over_limit(int count) Can a caller validly pass a negative argument? Otherwise unsigned int please. > +{ > + /* Compare w/ constant first to save an atomic read in the common case */ As an atomic read is just a normal read on x86, does this really matter? > + return ((count > PI_LIST_FIXED_NUM) && > + (count > (atomic_read(&num_hvm_vcpus) / num_online_cpus()) + > + PI_LIST_FIXED_NUM)); Right above you've #define-d PI_LIST_LIMIT - why do you open code it here? Also note that the outer pair of parentheses is pointless (and hampering readability). > static void vmx_vcpu_block(struct vcpu *v) > { > unsigned long flags; > - unsigned int dest; > + unsigned int dest, pi_cpu; > spinlock_t *old_lock; > - spinlock_t *pi_blocking_list_lock = > - &per_cpu(vmx_pi_blocking, v->processor).lock; > struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc; > + spinlock_t *pi_blocking_list_lock; > + > + pi_cpu = v->processor; > + retry: > + pi_blocking_list_lock = &per_cpu(vmx_pi_blocking, pi_cpu).lock; > > spin_lock_irqsave(pi_blocking_list_lock, flags); > + /* > + * Since pi_cpu may now be one other than the one v is currently > + * running on, check to make sure that it's still up. > + */ > + if ( unlikely((!cpu_online(pi_cpu)) || But this check comest to late then: You've already used per-CPU data of an offline CPU by the time you make it here. I'm also not you really need the lock here. A read_atomic() or the counter would suffice afaics (of course the writers then would need to use write_atomic() or add_sized()). > + pi_over_limit(per_cpu(vmx_pi_blocking, pi_cpu).counter)) ) Indentation. > + { > + pi_cpu = cpumask_cycle(pi_cpu, &cpu_online_map); With this, how could the CPU be offline by the time you make it back to the check above. > + spin_unlock_irqrestore(pi_blocking_list_lock, flags); > + goto retry; Please try (hard) to avoid the goto here (we generally accept its use only for keeping error handling reasonably simple). > @@ -134,6 +180,13 @@ static void vmx_vcpu_block(struct vcpu *v) > ASSERT(pi_desc->ndst == > (x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK))); > > + if ( unlikely(pi_cpu != v->processor) ) > + { > + dest = cpu_physical_id(pi_cpu); > + write_atomic(&pi_desc->ndst, > + (x2apic_enabled ? dest : MASK_INSR(dest, > PI_xAPIC_NDST_MASK))); > + } This kind of contradicts the ASSERT() visible in patch context above. I _think_ it is nevertheless correct, but it would be good to attach a comment here explaining this. Also there's no point in parenthesizing the argument of a function call. > @@ -170,6 +224,8 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v) > * it is running in non-root mode. > */ > write_atomic(&pi_desc->nv, posted_intr_vector); > + write_atomic(&pi_desc->ndst, > + x2apic_enabled ? dest : MASK_INSR(dest, > PI_xAPIC_NDST_MASK)); The explanation of why this is needed should also be put here rather then (only) in the commit message. Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.