[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH, RFC] x86/HVM: batch vCPU wakeups



On Tue, Sep 09, 2014 at 09:33:37AM +0100, Jan Beulich wrote:
> Mass wakeups (via vlapic_ipi()) can take enormous amounts of time,
> especially when many of the remote pCPU-s are in deep C-states. For
> 64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware,
> accumulated times of over 2ms were observed. Considering that Windows
> broadcasts IPIs from its timer interrupt, which at least at certain
> times can run at 1kHz, it is clear that this can result in good
> guest behavior.

I guess that should say "it is clear that this *can't* result in good guest 
behaviour" .. 


-- Pasi


> In fact, on said hardware guests with significantly
> beyond 40 vCPU-s simply hung when e.g. ServerManager gets started.
> With the patch here, average broadcast times for a 64-vCPU guest went
> to a measured maximum of 310us (which is still quite a lot).
> 
> This isn't just helping to reduce the number of ICR writes when the
> host APICs run in clustered mode, but also to reduce them by
> suppressing the sends altogether when - by the time
> cpu_raise_softirq_batch_finish() is reached - the remote CPU already
> managed to handle the softirq. Plus - when using MONITOR/MWAIT - the
> update of softirq_pending(cpu), being on the monitored cache line -
> should make the remote CPU wake up ahead of the ICR being sent,
> allowing the wait-for-ICR-idle latencies to be reduced (to perhaps a
> large part due to overlapping the wakeups of multiple CPUs).
> 
> Of course this necessarily increases the latencies for the remote
> CPU wakeup at least slightly. To weigh between the effects, the
> condition to enable batching in vlapic_ipi() may need further tuning.
> 
> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
> ---
> RFC for two reasons:
> 1) I started to consider elimination of the event-check IPIs in a more
>    general way when MONITOR/MWAIT is in use: As long as the remote CPU
>    is known to be MWAITing (or in the process of resuming after MWAIT),
>    the write of softirq_pending(cpu) ought to be sufficient to wake it.
>    This would yield the patch here pointless on MONITOR/MWAIT capable
>    hardware.
> 2) The condition when to enable batching in vlapic_ipi() is already
>    rather complex, but it is nevertheless unclear whether it would
>    benefit from further tuning (as mentioned above).
> 
> --- a/xen/arch/x86/hvm/vlapic.c
> +++ b/xen/arch/x86/hvm/vlapic.c
> @@ -447,12 +447,30 @@ void vlapic_ipi(
>  
>      default: {
>          struct vcpu *v;
> -        for_each_vcpu ( vlapic_domain(vlapic), v )
> +        const struct domain *d = vlapic_domain(vlapic);
> +        bool_t batch = d->max_vcpus > 2 &&
> +                       (short_hand
> +                        ? short_hand != APIC_DEST_SELF
> +                        : vlapic_x2apic_mode(vlapic)
> +                          ? dest_mode ? hweight16(dest) > 1
> +                                      : dest == 0xffffffff
> +                          : dest_mode
> +                            ? hweight8(dest &
> +                                       GET_xAPIC_DEST_FIELD(
> +                                           vlapic_get_reg(vlapic,
> +                                                          APIC_DFR))) > 1
> +                            : dest == 0xff);
> +
> +        if ( batch )
> +            cpu_raise_softirq_batch_begin();
> +        for_each_vcpu ( d, v )
>          {
>              if ( vlapic_match_dest(vcpu_vlapic(v), vlapic,
>                                     short_hand, dest, dest_mode) )
>                  vlapic_accept_irq(v, icr_low);
>          }
> +        if ( batch )
> +            cpu_raise_softirq_batch_finish();
>          break;
>      }
>      }
> --- a/xen/common/softirq.c
> +++ b/xen/common/softirq.c
> @@ -23,6 +23,9 @@ irq_cpustat_t irq_stat[NR_CPUS];
>  
>  static softirq_handler softirq_handlers[NR_SOFTIRQS];
>  
> +static DEFINE_PER_CPU(cpumask_t, batch_mask);
> +static DEFINE_PER_CPU(unsigned int, batching);
> +
>  static void __do_softirq(unsigned long ignore_mask)
>  {
>      unsigned int i, cpu;
> @@ -70,22 +73,56 @@ void open_softirq(int nr, softirq_handle
>  
>  void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
>  {
> -    int cpu;
> -    cpumask_t send_mask;
> +    unsigned int cpu = smp_processor_id();
> +    cpumask_t send_mask, *raise_mask;
> +
> +    if ( !per_cpu(batching, cpu) || in_irq() )
> +    {
> +        cpumask_clear(&send_mask);
> +        raise_mask = &send_mask;
> +    }
> +    else
> +        raise_mask = &per_cpu(batch_mask, cpu);
>  
> -    cpumask_clear(&send_mask);
>      for_each_cpu(cpu, mask)
>          if ( !test_and_set_bit(nr, &softirq_pending(cpu)) )
> -            cpumask_set_cpu(cpu, &send_mask);
> +            cpumask_set_cpu(cpu, raise_mask);
>  
> -    smp_send_event_check_mask(&send_mask);
> +    if ( raise_mask == &send_mask )
> +        smp_send_event_check_mask(raise_mask);
>  }
>  
>  void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
>  {
> -    if ( !test_and_set_bit(nr, &softirq_pending(cpu))
> -         && (cpu != smp_processor_id()) )
> +    unsigned int this_cpu = smp_processor_id();
> +
> +    if ( test_and_set_bit(nr, &softirq_pending(cpu))
> +         || (cpu == this_cpu) )
> +        return;
> +
> +    if ( !per_cpu(batching, this_cpu) || in_irq() )
>          smp_send_event_check_cpu(cpu);
> +    else
> +        set_bit(nr, &per_cpu(batch_mask, this_cpu));
> +}
> +
> +void cpu_raise_softirq_batch_begin(void)
> +{
> +    ++per_cpu(batching, smp_processor_id());
> +}
> +
> +void cpu_raise_softirq_batch_finish(void)
> +{
> +    unsigned int cpu, this_cpu = smp_processor_id();
> +    cpumask_t *mask = &per_cpu(batch_mask, this_cpu);
> +
> +    ASSERT(per_cpu(batching, this_cpu));
> +    for_each_cpu ( cpu, mask )
> +        if ( !softirq_pending(cpu) )
> +            cpumask_clear_cpu(cpu, mask);
> +    smp_send_event_check_mask(mask);
> +    cpumask_clear(mask);
> +    --per_cpu(batching, this_cpu);
>  }
>  
>  void raise_softirq(unsigned int nr)
> --- a/xen/include/xen/softirq.h
> +++ b/xen/include/xen/softirq.h
> @@ -30,6 +30,9 @@ void cpumask_raise_softirq(const cpumask
>  void cpu_raise_softirq(unsigned int cpu, unsigned int nr);
>  void raise_softirq(unsigned int nr);
>  
> +void cpu_raise_softirq_batch_begin(void);
> +void cpu_raise_softirq_batch_finish(void);
> +
>  /*
>   * Process pending softirqs on this CPU. This should be called periodically
>   * when performing work that prevents softirqs from running in a timely 
> manner.
> 
> 

> x86/HVM: batch vCPU wakeups
> 
> Mass wakeups (via vlapic_ipi()) can take enormous amounts of time,
> especially when many of the remote pCPU-s are in deep C-states. For
> 64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware,
> accumulated times of over 2ms were observed. Considering that Windows
> broadcasts IPIs from its timer interrupt, which at least at certain
> times can run at 1kHz, it is clear that this can result in good
> guest behavior. In fact, on said hardware guests with significantly
> beyond 40 vCPU-s simply hung when e.g. ServerManager gets started.
> With the patch here, average broadcast times for a 64-vCPU guest went
> to a measured maximum of 310us (which is still quite a lot).
> 
> This isn't just helping to reduce the number of ICR writes when the
> host APICs run in clustered mode, but also to reduce them by
> suppressing the sends altogether when - by the time
> cpu_raise_softirq_batch_finish() is reached - the remote CPU already
> managed to handle the softirq. Plus - when using MONITOR/MWAIT - the
> update of softirq_pending(cpu), being on the monitored cache line -
> should make the remote CPU wake up ahead of the ICR being sent,
> allowing the wait-for-ICR-idle latencies to be reduced (to perhaps a
> large part due to overlapping the wakeups of multiple CPUs).
> 
> Of course this necessarily increases the latencies for the remote
> CPU wakeup at least slightly. To weigh between the effects, the
> condition to enable batching in vlapic_ipi() may need further tuning.
> 
> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
> ---
> RFC for two reasons:
> 1) I started to consider elimination of the event-check IPIs in a more
>    general way when MONITOR/MWAIT is in use: As long as the remote CPU
>    is known to be MWAITing (or in the process of resuming after MWAIT),
>    the write of softirq_pending(cpu) ought to be sufficient to wake it.
>    This would yield the patch here pointless on MONITOR/MWAIT capable
>    hardware.
> 2) The condition when to enable batching in vlapic_ipi() is already
>    rather complex, but it is nevertheless unclear whether it would
>    benefit from further tuning (as mentioned above).
> 
> --- a/xen/arch/x86/hvm/vlapic.c
> +++ b/xen/arch/x86/hvm/vlapic.c
> @@ -447,12 +447,30 @@ void vlapic_ipi(
>  
>      default: {
>          struct vcpu *v;
> -        for_each_vcpu ( vlapic_domain(vlapic), v )
> +        const struct domain *d = vlapic_domain(vlapic);
> +        bool_t batch = d->max_vcpus > 2 &&
> +                       (short_hand
> +                        ? short_hand != APIC_DEST_SELF
> +                        : vlapic_x2apic_mode(vlapic)
> +                          ? dest_mode ? hweight16(dest) > 1
> +                                      : dest == 0xffffffff
> +                          : dest_mode
> +                            ? hweight8(dest &
> +                                       GET_xAPIC_DEST_FIELD(
> +                                           vlapic_get_reg(vlapic,
> +                                                          APIC_DFR))) > 1
> +                            : dest == 0xff);
> +
> +        if ( batch )
> +            cpu_raise_softirq_batch_begin();
> +        for_each_vcpu ( d, v )
>          {
>              if ( vlapic_match_dest(vcpu_vlapic(v), vlapic,
>                                     short_hand, dest, dest_mode) )
>                  vlapic_accept_irq(v, icr_low);
>          }
> +        if ( batch )
> +            cpu_raise_softirq_batch_finish();
>          break;
>      }
>      }
> --- a/xen/common/softirq.c
> +++ b/xen/common/softirq.c
> @@ -23,6 +23,9 @@ irq_cpustat_t irq_stat[NR_CPUS];
>  
>  static softirq_handler softirq_handlers[NR_SOFTIRQS];
>  
> +static DEFINE_PER_CPU(cpumask_t, batch_mask);
> +static DEFINE_PER_CPU(unsigned int, batching);
> +
>  static void __do_softirq(unsigned long ignore_mask)
>  {
>      unsigned int i, cpu;
> @@ -70,22 +73,56 @@ void open_softirq(int nr, softirq_handle
>  
>  void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
>  {
> -    int cpu;
> -    cpumask_t send_mask;
> +    unsigned int cpu = smp_processor_id();
> +    cpumask_t send_mask, *raise_mask;
> +
> +    if ( !per_cpu(batching, cpu) || in_irq() )
> +    {
> +        cpumask_clear(&send_mask);
> +        raise_mask = &send_mask;
> +    }
> +    else
> +        raise_mask = &per_cpu(batch_mask, cpu);
>  
> -    cpumask_clear(&send_mask);
>      for_each_cpu(cpu, mask)
>          if ( !test_and_set_bit(nr, &softirq_pending(cpu)) )
> -            cpumask_set_cpu(cpu, &send_mask);
> +            cpumask_set_cpu(cpu, raise_mask);
>  
> -    smp_send_event_check_mask(&send_mask);
> +    if ( raise_mask == &send_mask )
> +        smp_send_event_check_mask(raise_mask);
>  }
>  
>  void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
>  {
> -    if ( !test_and_set_bit(nr, &softirq_pending(cpu))
> -         && (cpu != smp_processor_id()) )
> +    unsigned int this_cpu = smp_processor_id();
> +
> +    if ( test_and_set_bit(nr, &softirq_pending(cpu))
> +         || (cpu == this_cpu) )
> +        return;
> +
> +    if ( !per_cpu(batching, this_cpu) || in_irq() )
>          smp_send_event_check_cpu(cpu);
> +    else
> +        set_bit(nr, &per_cpu(batch_mask, this_cpu));
> +}
> +
> +void cpu_raise_softirq_batch_begin(void)
> +{
> +    ++per_cpu(batching, smp_processor_id());
> +}
> +
> +void cpu_raise_softirq_batch_finish(void)
> +{
> +    unsigned int cpu, this_cpu = smp_processor_id();
> +    cpumask_t *mask = &per_cpu(batch_mask, this_cpu);
> +
> +    ASSERT(per_cpu(batching, this_cpu));
> +    for_each_cpu ( cpu, mask )
> +        if ( !softirq_pending(cpu) )
> +            cpumask_clear_cpu(cpu, mask);
> +    smp_send_event_check_mask(mask);
> +    cpumask_clear(mask);
> +    --per_cpu(batching, this_cpu);
>  }
>  
>  void raise_softirq(unsigned int nr)
> --- a/xen/include/xen/softirq.h
> +++ b/xen/include/xen/softirq.h
> @@ -30,6 +30,9 @@ void cpumask_raise_softirq(const cpumask
>  void cpu_raise_softirq(unsigned int cpu, unsigned int nr);
>  void raise_softirq(unsigned int nr);
>  
> +void cpu_raise_softirq_batch_begin(void);
> +void cpu_raise_softirq_batch_finish(void);
> +
>  /*
>   * Process pending softirqs on this CPU. This should be called periodically
>   * when performing work that prevents softirqs from running in a timely 
> manner.

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.