[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH, RFC] x86/HVM: batch vCPU wakeups



At 09:33 +0100 on 09 Sep (1410251617), Jan Beulich wrote:
> Mass wakeups (via vlapic_ipi()) can take enormous amounts of time,
> especially when many of the remote pCPU-s are in deep C-states. For
> 64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware,
> accumulated times of over 2ms were observed. Considering that Windows
> broadcasts IPIs from its timer interrupt

Bleargh. :(

> RFC for two reasons:
> 1) I started to consider elimination of the event-check IPIs in a more
>    general way when MONITOR/MWAIT is in use: As long as the remote CPU
>    is known to be MWAITing (or in the process of resuming after MWAIT),
>    the write of softirq_pending(cpu) ought to be sufficient to wake it.
>    This would yield the patch here pointless on MONITOR/MWAIT capable
>    hardware.

That's a promising line.   I guess the number of 64-way systems that
don't support MONITOR/MWAIT is pretty small. :)

> 2) The condition when to enable batching in vlapic_ipi() is already
>    rather complex, but it is nevertheless unclear whether it would
>    benefit from further tuning (as mentioned above).

The test you chose seems plausible enough to me.  I suppose we care
enough about single-destination IPIs on idle hardware that we need
this check at all; but I doubt that IPIs to exactly two destinations,
for example, are worth worrying about.

The implementation looks OK in general; a few nits below.

> --- a/xen/arch/x86/hvm/vlapic.c
> +++ b/xen/arch/x86/hvm/vlapic.c
> @@ -447,12 +447,30 @@ void vlapic_ipi(
>  
>      default: {
>          struct vcpu *v;
> -        for_each_vcpu ( vlapic_domain(vlapic), v )
> +        const struct domain *d = vlapic_domain(vlapic);
> +        bool_t batch = d->max_vcpus > 2 &&
> +                       (short_hand
> +                        ? short_hand != APIC_DEST_SELF
> +                        : vlapic_x2apic_mode(vlapic)
> +                          ? dest_mode ? hweight16(dest) > 1
> +                                      : dest == 0xffffffff
> +                          : dest_mode
> +                            ? hweight8(dest &
> +                                       GET_xAPIC_DEST_FIELD(
> +                                           vlapic_get_reg(vlapic,
> +                                                          APIC_DFR))) > 1
> +                            : dest == 0xff);

Yoiks.  Could you extract some of this into a helper function, say
is_unicast_dest(), just for readability?

>  void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
>  {
> -    if ( !test_and_set_bit(nr, &softirq_pending(cpu))
> -         && (cpu != smp_processor_id()) )
> +    unsigned int this_cpu = smp_processor_id();

The name clashes with the this_cpu() macro (obviously OK for compiling
but harder to read) and can be dropped since...

> +
> +    if ( test_and_set_bit(nr, &softirq_pending(cpu))
> +         || (cpu == this_cpu) )
> +        return;
> +
> +    if ( !per_cpu(batching, this_cpu) || in_irq() )

this_cpu(batching) is the preferred way of addressing this on Xen
anyway - though I guess maybe it's _slightly_ less efficient to
recalculate get_cpu_info() here than to indirect through the
per_cpu_offsets[]?  I haven't looked at the asm.

>          smp_send_event_check_cpu(cpu);
> +    else
> +        set_bit(nr, &per_cpu(batch_mask, this_cpu));
> +}
> +
> +void cpu_raise_softirq_batch_begin(void)
> +{
> +    ++per_cpu(batching, smp_processor_id());

This one should definitely be using this_cpu().

> +}
> +
> +void cpu_raise_softirq_batch_finish(void)
> +{
> +    unsigned int cpu, this_cpu = smp_processor_id();
> +    cpumask_t *mask = &per_cpu(batch_mask, this_cpu);

Again, this_cpu()?

> +    ASSERT(per_cpu(batching, this_cpu));
> +    for_each_cpu ( cpu, mask )
> +        if ( !softirq_pending(cpu) )
> +            cpumask_clear_cpu(cpu, mask);
> +    smp_send_event_check_mask(mask);
> +    cpumask_clear(mask);
> +    --per_cpu(batching, this_cpu);
>  }

One other thing that occurs to me is that we might do the batching in
the caller (by gathering a mask of pcpus in the vlapic code) but that
sounds messy.  And it's not like softirq is so particularly
latency-sensitive that we'd want to avoid this much overhead in the
common case.  So on second thoughts this way is better. :)

Cheers,

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.