[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH for-4.12 v2 16/17] xen/arm: Implement Set/Way operations



CC'ing Dario

Dario, please give a look at the preemption question below.


On Fri, 7 Dec 2018, Julien Grall wrote:
> On 06/12/2018 23:32, Stefano Stabellini wrote:
> > On Tue, 4 Dec 2018, Julien Grall wrote:
> > > Set/Way operations are used to perform maintenance on a given cache.
> > > At the moment, Set/Way operations are not trapped and therefore a guest
> > > OS will directly act on the local cache. However, a vCPU may migrate to
> > > another pCPU in the middle of the processor. This will result to have
> > > cache with stall data (Set/Way are not propagated) potentially causing
> > > crash. This may be the cause of heisenbug noticed in Osstest [1].
> > > 
> > > Furthermore, Set/Way operations are not available on system cache. This
> > > means that OS, such as Linux 32-bit, relying on those operations to
> > > fully clean the cache before disabling MMU may break because data may
> > > sits in system caches and not in RAM.
> > > 
> > > For more details about Set/Way, see the talk "The Art of Virtualizing
> > > Cache Maintenance" given at Xen Summit 2018 [2].
> > > 
> > > In the context of Xen, we need to trap Set/Way operations and emulate
> > > them. From the Arm Arm (B1.14.4 in DDI 046C.c), Set/Way operations are
> > > difficult to virtualized. So we can assume that a guest OS using them will
> > > suffer the consequence (i.e slowness) until developer removes all the
> > > usage
> > > of Set/Way.
> > > 
> > > As the software is not allowed to infer the Set/Way to Physical Address
> > > mapping, Xen will need to go through the guest P2M and clean &
> > > invalidate all the entries mapped.
> > > 
> > > Because Set/Way happen in batch (a loop on all Set/Way of a cache), Xen
> > > would need to go through the P2M for every instructions. This is quite
> > > expensive and would severely impact the guest OS. The implementation is
> > > re-using the KVM policy to limit the number of flush:
> > >      - If we trap a Set/Way operations, we enable VM trapping (i.e
> > >        HVC_EL2.TVM) to detect cache being turned on/off, and do a full
> > >      clean.
> > >      - We clean the caches when turning on and off
> > >      - Once the caches are enabled, we stop trapping VM instructions
> > > 
> > > [1]
> > > https://lists.xenproject.org/archives/html/xen-devel/2017-09/msg03191.html
> > > [2] https://fr.slideshare.net/xen_com_mgr/virtualizing-cache
> > > 
> > > Signed-off-by: Julien Grall <julien.grall@xxxxxxx>
> > > 
> > > ---
> > >      Changes in v2:
> > >          - Fix emulation for Set/Way cache flush arm64 sysreg
> > >          - Add support for preemption
> > >          - Check cache status on every VM traps in Arm64
> > >          - Remove spurious change
> > > ---
> > >   xen/arch/arm/arm64/vsysreg.c | 17 ++++++++
> > >   xen/arch/arm/p2m.c           | 92
> > > ++++++++++++++++++++++++++++++++++++++++++++
> > >   xen/arch/arm/traps.c         | 25 +++++++++++-
> > >   xen/arch/arm/vcpreg.c        | 22 +++++++++++
> > >   xen/include/asm-arm/domain.h |  8 ++++
> > >   xen/include/asm-arm/p2m.h    | 20 ++++++++++
> > >   6 files changed, 183 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/xen/arch/arm/arm64/vsysreg.c b/xen/arch/arm/arm64/vsysreg.c
> > > index 16ac9c344a..8a85507d9d 100644
> > > --- a/xen/arch/arm/arm64/vsysreg.c
> > > +++ b/xen/arch/arm/arm64/vsysreg.c
> > > @@ -34,9 +34,14 @@
> > >   static bool vreg_emulate_##reg(struct cpu_user_regs *regs,          \
> > >                                  uint64_t *r, bool read)              \
> > >   {                                                                   \
> > > +    struct vcpu *v = current;                                       \
> > > +    bool cache_enabled = vcpu_has_cache_enabled(v);                 \
> > > +                                                                    \
> > >       GUEST_BUG_ON(read);                                             \
> > >       WRITE_SYSREG64(*r, reg);                                        \
> > >                                                                       \
> > > +    p2m_toggle_cache(v, cache_enabled);                             \
> > > +                                                                    \
> > >       return true;                                                    \
> > >   }
> > >   @@ -85,6 +90,18 @@ void do_sysreg(struct cpu_user_regs *regs,
> > >           break;
> > >         /*
> > > +     * HCR_EL2.TSW
> > > +     *
> > > +     * ARMv8 (DDI 0487B.b): Table D1-42
> > > +     */
> > > +    case HSR_SYSREG_DCISW:
> > > +    case HSR_SYSREG_DCCSW:
> > > +    case HSR_SYSREG_DCCISW:
> > > +        if ( !hsr.sysreg.read )
> > > +            p2m_set_way_flush(current);
> > > +        break;
> > > +
> > > +    /*
> > >        * HCR_EL2.TVM
> > >        *
> > >        * ARMv8 (DDI 0487D.a): Table D1-38
> > > diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
> > > index ca9f0d9ebe..8ee6ff7bd7 100644
> > > --- a/xen/arch/arm/p2m.c
> > > +++ b/xen/arch/arm/p2m.c
> > > @@ -3,6 +3,7 @@
> > >   #include <xen/iocap.h>
> > >   #include <xen/lib.h>
> > >   #include <xen/sched.h>
> > > +#include <xen/softirq.h>
> > >     #include <asm/event.h>
> > >   #include <asm/flushtlb.h>
> > > @@ -1620,6 +1621,97 @@ int p2m_cache_flush_range(struct domain *d, gfn_t
> > > *pstart, gfn_t end)
> > >       return rc;
> > >   }
> > >   +/*
> > > + * Clean & invalidate RAM associated to the guest vCPU.
> > > + *
> > > + * The function can only work with the current vCPU and should be called
> > > + * with IRQ enabled as the vCPU could get preempted.
> > > + */
> > > +void p2m_flush_vm(struct vcpu *v)
> > > +{
> > > +    int rc;
> > > +    gfn_t start = _gfn(0);
> > > +
> > > +    ASSERT(v == current);
> > > +    ASSERT(local_irq_is_enabled());
> > > +    ASSERT(v->arch.need_flush_to_ram);
> > > +
> > > +    do
> > > +    {
> > > +        rc = p2m_cache_flush_range(v->domain, &start, _gfn(ULONG_MAX));
> > > +        if ( rc == -ERESTART )
> > > +            do_softirq();
> > 
> > Shouldn't we store somewhere where we left off? Specifically the value
> > of `start' when rc == -ERESTART? Maybe we change need_flush_to_ram to
> > gfn_t and used it to store `start'?
> 
> It is not necessary on Arm. The hypervisor stack is per-vCPU and we will
> always return to where we were preempted.

Ah, right! Even better.


> > > +    } while ( rc == -ERESTART );
> > > +
> > > +    if ( rc != 0 )
> > > +        gprintk(XENLOG_WARNING,
> > > +                "P2M has not been correctly cleaned (rc = %d)\n",
> > > +                rc);
> > > +
> > > +    v->arch.need_flush_to_ram = false;
> > > +}
> > > +
> > > +/*
> > > + * See note at ARMv7 ARM B1.14.4 (DDI 0406C.c) (TL;DR: S/W ops are not
> > > + * easily virtualized).
> > > + *
> > > + * Main problems:
> > > + *  - S/W ops are local to a CPU (not broadcast)
> > > + *  - We have line migration behind our back (speculation)
> > > + *  - System caches don't support S/W at all (damn!)
> > > + *
> > > + * In the face of the above, the best we can do is to try and convert
> > > + * S/W ops to VA ops. Because the guest is not allowed to infer the S/W
> > > + * to PA mapping, it can only use S/W to nuke the whole cache, which is
> > > + * rather a good thing for us.
> > > + *
> > > + * Also, it is only used when turning caches on/off ("The expected
> > > + * usage of the cache maintenance instructions that operate by set/way
> > > + * is associated with the powerdown and powerup of caches, if this is
> > > + * required by the implementation.").
> > > + *
> > > + * We use the following policy:
> > > + *  - If we trap a S/W operation, we enabled VM trapping to detect
> > > + *  caches being turned on/off, and do a full clean.
> > > + *
> > > + *  - We flush the caches on both caches being turned on and off.
> > > + *
> > > + *  - Once the caches are enabled, we stop trapping VM ops.
> > > + */
> > > +void p2m_set_way_flush(struct vcpu *v)
> > > +{
> > > +    /* This function can only work with the current vCPU. */
> > > +    ASSERT(v == current);
> > > +
> > > +    if ( !(v->arch.hcr_el2 & HCR_TVM) )
> > > +    {
> > > +        v->arch.need_flush_to_ram = true;
> > > +        vcpu_hcr_set_flags(v, HCR_TVM);
> > > +    }
> > > +}
> > > +
> > > +void p2m_toggle_cache(struct vcpu *v, bool was_enabled)
> > > +{
> > > +    bool now_enabled = vcpu_has_cache_enabled(v);
> > > +
> > > +    /* This function can only work with the current vCPU. */
> > > +    ASSERT(v == current);
> > > +
> > > +    /*
> > > +     * If switching the MMU+caches on, need to invalidate the caches.
> > > +     * If switching it off, need to clean the caches.
> > > +     * Clean + invalidate does the trick always.
> > > +     */
> > > +    if ( was_enabled != now_enabled )
> > > +    {
> > > +        v->arch.need_flush_to_ram = true;
> > > +    }
> > > +
> > > +    /* Caches are now on, stop trapping VM ops (until a S/W op) */
> > > +    if ( now_enabled )
> > > +        vcpu_hcr_clear_flags(v, HCR_TVM);
> > > +}
> > > +
> > >   mfn_t gfn_to_mfn(struct domain *d, gfn_t gfn)
> > >   {
> > >       return p2m_lookup(d, gfn, NULL);
> > > diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
> > > index 02665cc7b4..221c762ada 100644
> > > --- a/xen/arch/arm/traps.c
> > > +++ b/xen/arch/arm/traps.c
> > > @@ -97,7 +97,7 @@ register_t get_default_hcr_flags(void)
> > >   {
> > >       return  (HCR_PTW|HCR_BSU_INNER|HCR_AMO|HCR_IMO|HCR_FMO|HCR_VM|
> > >                (vwfi != NATIVE ? (HCR_TWI|HCR_TWE) : 0) |
> > > -             HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP|HCR_FB);
> > > +             HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP|HCR_FB|HCR_TSW);
> > >   }
> > >     static enum {
> > > @@ -2258,10 +2258,33 @@ static void check_for_pcpu_work(void)
> > >       }
> > >   }
> > >   +/*
> > > + * Process pending work for the vCPU. Any call should be fast or
> > > + * implement preemption.
> > > + */
> > > +static void check_for_vcpu_work(void)
> > > +{
> > > +    struct vcpu *v = current;
> > > +
> > > +    if ( likely(!v->arch.need_flush_to_ram) )
> > > +        return;
> > > +
> > > +    /*
> > > +     * Give a chance for the pCPU to process work before handling the
> > > vCPU
> > > +     * pending work.
> > > +     */
> > > +    check_for_pcpu_work();
> > 
> > This is a bit awkward, basically we need to call check_for_pcpu_work
> > before check_for_vcpu_work(), and after it. This is basically what we
> > are doing:
> > 
> >    check_for_pcpu_work()
> >    check_for_vcpu_work()
> >    check_for_pcpu_work()
> 
> Not really, we only do that if we have vCPU work to do (see the check
> !v->arch.need_flush_to_ram). So we call twice only when we need to do some
> vCPU work (at the moment only the p2m).
> 
> We can't avoid the one after check_for_vcpu_work() because you may have
> softirq pending and already signaled (i.e via an interrupt).

I understand this.


> So you may not execute them before returning to the guest introducing
> long delay. That's why we execute the rest of the code with interrupts
> masked. If sotfirq_pending() returns 0 then you know there were no
> more softirq pending to handle. All the new one will be signaled via
> an interrupt than can only come up when irq are unmasked.
>
> The one before executing vCPU work can potentially be avoided. The reason I
> added it is it can take some times before p2m_flush_vm() will call softirq. As
> we do this on return to guest we may have already been executed for some time
> in the hypervisor. So this give us a chance to preempt if the vCPU consumed
> his sliced.

This one is difficult to tell whether it is important or if it would be
best avoided.

For Dario: basically we have a long running operation to perform, we
thought that the best place for it would be on the path returning to
guest (leave_hypervisor_tail). The operation can interrupt itself
checking sotfirq_pending() once in a while to avoid blocking the pcpu
for too long.

The question is: is it better to check sotfirq_pending() even before
starting? Or every so often during the operating is good enough? Does it
even matter?


 
> > Sounds like we should come up with something better but I don't have a
> > concrete suggestion :-)
> > 
> > 
> > > +    local_irq_enable();
> > > +    p2m_flush_vm(v);
> > > +    local_irq_disable();
> > > +}
> > > +
> > >   void leave_hypervisor_tail(void)
> > >   {
> > >       local_irq_disable();
> > >   +    check_for_vcpu_work();
> > >       check_for_pcpu_work();
> > >         vgic_sync_to_lrs();
> > > diff --git a/xen/arch/arm/vcpreg.c b/xen/arch/arm/vcpreg.c
> > > index 550c25ec3f..cdc91cdf5b 100644
> > > --- a/xen/arch/arm/vcpreg.c
> > > +++ b/xen/arch/arm/vcpreg.c
> > > @@ -51,9 +51,14 @@
> > >   #define TVM_REG(sz, func, reg...)
> > > \
> > >   static bool func(struct cpu_user_regs *regs, uint##sz##_t *r, bool read)
> > > \
> > >   {
> > > \
> > > +    struct vcpu *v = current;
> > > \
> > > +    bool cache_enabled = vcpu_has_cache_enabled(v);
> > > \
> > > +
> > > \
> > >       GUEST_BUG_ON(read);
> > > \
> > >       WRITE_SYSREG##sz(*r, reg);
> > > \
> > >                                                                           
> > >     \
> > > +    p2m_toggle_cache(v, cache_enabled);
> > > \
> > > +
> > > \
> > >       return true;
> > > \
> > >   }
> > >   @@ -71,6 +76,8 @@ static bool func(struct cpu_user_regs *regs,
> > > uint##sz##_t *r, bool read)    \
> > >   static bool vreg_emulate_##xreg(struct cpu_user_regs *regs, uint32_t *r,
> > > \
> > >                                   bool read, bool hi)
> > > \
> > >   {
> > > \
> > > +    struct vcpu *v = current;
> > > \
> > > +    bool cache_enabled = vcpu_has_cache_enabled(v);
> > > \
> > >       register_t reg = READ_SYSREG(xreg);
> > > \
> > >                                                                           
> > >     \
> > >       GUEST_BUG_ON(read);
> > > \
> > > @@ -86,6 +93,8 @@ static bool vreg_emulate_##xreg(struct cpu_user_regs
> > > *regs, uint32_t *r,    \
> > >       }
> > > \
> > >       WRITE_SYSREG(reg, xreg);
> > > \
> > >                                                                           
> > >     \
> > > +    p2m_toggle_cache(v, cache_enabled);
> > > \
> > > +
> > > \
> > >       return true;
> > > \
> > >   }
> > > \
> > >                                                                           
> > >     \
> > > @@ -186,6 +195,19 @@ void do_cp15_32(struct cpu_user_regs *regs, const
> > > union hsr hsr)
> > >           break;
> > >         /*
> > > +     * HCR_EL2.TSW
> > > +     *
> > > +     * ARMv7 (DDI 0406C.b): B1.14.6
> > > +     * ARMv8 (DDI 0487B.b): Table D1-42
> > > +     */
> > > +    case HSR_CPREG32(DCISW):
> > > +    case HSR_CPREG32(DCCSW):
> > > +    case HSR_CPREG32(DCCISW):
> > > +        if ( !cp32.read )
> > > +            p2m_set_way_flush(current);
> > > +        break;
> > > +
> > > +    /*
> > >        * HCR_EL2.TVM
> > >        *
> > >        * ARMv8 (DDI 0487D.a): Table D1-38
> > > diff --git a/xen/include/asm-arm/domain.h b/xen/include/asm-arm/domain.h
> > > index 175de44927..f16b973e0d 100644
> > > --- a/xen/include/asm-arm/domain.h
> > > +++ b/xen/include/asm-arm/domain.h
> > > @@ -202,6 +202,14 @@ struct arch_vcpu
> > >       struct vtimer phys_timer;
> > >       struct vtimer virt_timer;
> > >       bool   vtimer_initialized;
> > > +
> > > +    /*
> > > +     * The full P2M may require some cleaning (e.g when emulation
> > > +     * set/way). As the action can take a long time, it requires
> > > +     * preemption. So this is deferred until we return to the guest.
> > 
> > The reason for delaying the call to p2m_flush_vm until we return to the
> > guest is that we need to call do_softirq to check whether we need to be
> > preempted and we cannot make that call in the middle of the vcpreg.c
> > handlers, right?
> We need to make sure that do_softirq() is called without any lock. With the
> current code, it would technically be possible to call do_softirq() directly
> in vcreg.c handlers. But I think it is not entirely future-proof.
> 
> So it would be better if we call do_softirq() with the little stack as
> possible as it is easier to ensure that the callers are not taking any lock.
> 
> The infrastructure added should be re-usable for other sort of work (i.e ITS,
> ioreq) in the future.

I think it makes sense and it is easier to think about and to understand
compared to calling do_softirq in the middle of another complex
function. I would ask you to improve a bit the last sentence of this
comment, something like:

"It is deferred until we return to guest, where we can more easily check
for softirqs and preempt the vcpu safely."

It would almost be worth generalizing it even further, introducing a new
tasklet-like concept to schedule long work before returning to guest.
An idea for the future.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.