Xen project Mailing List

Re: [Xen-devel] [PATCH for-4.12 v2 16/17] xen/arm: Implement Set/Way operations

From: Stefano Stabellini <sstabellini@xxxxxxxxxx>

Date: Fri, 7 Dec 2018 13:29:40 -0800 (PST)

Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, Stefano Stabellini <sstabellini@xxxxxxxxxx>, dfaggioli@xxxxxxxx

Delivery-date: Fri, 07 Dec 2018 21:29:54 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

CC'ing Dario Dario, please give a look at the preemption question below. On Fri, 7 Dec 2018, Julien Grall wrote: > On 06/12/2018 23:32, Stefano Stabellini wrote: > > On Tue, 4 Dec 2018, Julien Grall wrote: > > > Set/Way operations are used to perform maintenance on a given cache. > > > At the moment, Set/Way operations are not trapped and therefore a guest > > > OS will directly act on the local cache. However, a vCPU may migrate to > > > another pCPU in the middle of the processor. This will result to have > > > cache with stall data (Set/Way are not propagated) potentially causing > > > crash. This may be the cause of heisenbug noticed in Osstest [1]. > > > > > > Furthermore, Set/Way operations are not available on system cache. This > > > means that OS, such as Linux 32-bit, relying on those operations to > > > fully clean the cache before disabling MMU may break because data may > > > sits in system caches and not in RAM. > > > > > > For more details about Set/Way, see the talk "The Art of Virtualizing > > > Cache Maintenance" given at Xen Summit 2018 [2]. > > > > > > In the context of Xen, we need to trap Set/Way operations and emulate > > > them. From the Arm Arm (B1.14.4 in DDI 046C.c), Set/Way operations are > > > difficult to virtualized. So we can assume that a guest OS using them will > > > suffer the consequence (i.e slowness) until developer removes all the > > > usage > > > of Set/Way. > > > > > > As the software is not allowed to infer the Set/Way to Physical Address > > > mapping, Xen will need to go through the guest P2M and clean & > > > invalidate all the entries mapped. > > > > > > Because Set/Way happen in batch (a loop on all Set/Way of a cache), Xen > > > would need to go through the P2M for every instructions. This is quite > > > expensive and would severely impact the guest OS. The implementation is > > > re-using the KVM policy to limit the number of flush: > > > - If we trap a Set/Way operations, we enable VM trapping (i.e > > > HVC_EL2.TVM) to detect cache being turned on/off, and do a full > > > clean. > > > - We clean the caches when turning on and off > > > - Once the caches are enabled, we stop trapping VM instructions > > > > > > [1] > > > https://lists.xenproject.org/archives/html/xen-devel/2017-09/msg03191.html > > > [2] https://fr.slideshare.net/xen_com_mgr/virtualizing-cache > > > > > > Signed-off-by: Julien Grall <julien.grall@xxxxxxx> > > > > > > --- > > > Changes in v2: > > > - Fix emulation for Set/Way cache flush arm64 sysreg > > > - Add support for preemption > > > - Check cache status on every VM traps in Arm64 > > > - Remove spurious change > > > --- > > > xen/arch/arm/arm64/vsysreg.c | 17 ++++++++ > > > xen/arch/arm/p2m.c | 92 > > > ++++++++++++++++++++++++++++++++++++++++++++ > > > xen/arch/arm/traps.c | 25 +++++++++++- > > > xen/arch/arm/vcpreg.c | 22 +++++++++++ > > > xen/include/asm-arm/domain.h | 8 ++++ > > > xen/include/asm-arm/p2m.h | 20 ++++++++++ > > > 6 files changed, 183 insertions(+), 1 deletion(-) > > > > > > diff --git a/xen/arch/arm/arm64/vsysreg.c b/xen/arch/arm/arm64/vsysreg.c > > > index 16ac9c344a..8a85507d9d 100644 > > > --- a/xen/arch/arm/arm64/vsysreg.c > > > +++ b/xen/arch/arm/arm64/vsysreg.c > > > @@ -34,9 +34,14 @@ > > > static bool vreg_emulate_##reg(struct cpu_user_regs *regs, \ > > > uint64_t *r, bool read) \ > > > { \ > > > + struct vcpu *v = current; \ > > > + bool cache_enabled = vcpu_has_cache_enabled(v); \ > > > + \ > > > GUEST_BUG_ON(read); \ > > > WRITE_SYSREG64(*r, reg); \ > > > \ > > > + p2m_toggle_cache(v, cache_enabled); \ > > > + \ > > > return true; \ > > > } > > > @@ -85,6 +90,18 @@ void do_sysreg(struct cpu_user_regs *regs, > > > break; > > > /* > > > + * HCR_EL2.TSW > > > + * > > > + * ARMv8 (DDI 0487B.b): Table D1-42 > > > + */ > > > + case HSR_SYSREG_DCISW: > > > + case HSR_SYSREG_DCCSW: > > > + case HSR_SYSREG_DCCISW: > > > + if ( !hsr.sysreg.read ) > > > + p2m_set_way_flush(current); > > > + break; > > > + > > > + /* > > > * HCR_EL2.TVM > > > * > > > * ARMv8 (DDI 0487D.a): Table D1-38 > > > diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c > > > index ca9f0d9ebe..8ee6ff7bd7 100644 > > > --- a/xen/arch/arm/p2m.c > > > +++ b/xen/arch/arm/p2m.c > > > @@ -3,6 +3,7 @@ > > > #include <xen/iocap.h> > > > #include <xen/lib.h> > > > #include <xen/sched.h> > > > +#include <xen/softirq.h> > > > #include <asm/event.h> > > > #include <asm/flushtlb.h> > > > @@ -1620,6 +1621,97 @@ int p2m_cache_flush_range(struct domain *d, gfn_t > > > *pstart, gfn_t end) > > > return rc; > > > } > > > +/* > > > + * Clean & invalidate RAM associated to the guest vCPU. > > > + * > > > + * The function can only work with the current vCPU and should be called > > > + * with IRQ enabled as the vCPU could get preempted. > > > + */ > > > +void p2m_flush_vm(struct vcpu *v) > > > +{ > > > + int rc; > > > + gfn_t start = _gfn(0); > > > + > > > + ASSERT(v == current); > > > + ASSERT(local_irq_is_enabled()); > > > + ASSERT(v->arch.need_flush_to_ram); > > > + > > > + do > > > + { > > > + rc = p2m_cache_flush_range(v->domain, &start, _gfn(ULONG_MAX)); > > > + if ( rc == -ERESTART ) > > > + do_softirq(); > > > > Shouldn't we store somewhere where we left off? Specifically the value > > of `start' when rc == -ERESTART? Maybe we change need_flush_to_ram to > > gfn_t and used it to store `start'? > > It is not necessary on Arm. The hypervisor stack is per-vCPU and we will > always return to where we were preempted. Ah, right! Even better. > > > + } while ( rc == -ERESTART ); > > > + > > > + if ( rc != 0 ) > > > + gprintk(XENLOG_WARNING, > > > + "P2M has not been correctly cleaned (rc = %d)\n", > > > + rc); > > > + > > > + v->arch.need_flush_to_ram = false; > > > +} > > > + > > > +/* > > > + * See note at ARMv7 ARM B1.14.4 (DDI 0406C.c) (TL;DR: S/W ops are not > > > + * easily virtualized). > > > + * > > > + * Main problems: > > > + * - S/W ops are local to a CPU (not broadcast) > > > + * - We have line migration behind our back (speculation) > > > + * - System caches don't support S/W at all (damn!) > > > + * > > > + * In the face of the above, the best we can do is to try and convert > > > + * S/W ops to VA ops. Because the guest is not allowed to infer the S/W > > > + * to PA mapping, it can only use S/W to nuke the whole cache, which is > > > + * rather a good thing for us. > > > + * > > > + * Also, it is only used when turning caches on/off ("The expected > > > + * usage of the cache maintenance instructions that operate by set/way > > > + * is associated with the powerdown and powerup of caches, if this is > > > + * required by the implementation."). > > > + * > > > + * We use the following policy: > > > + * - If we trap a S/W operation, we enabled VM trapping to detect > > > + * caches being turned on/off, and do a full clean. > > > + * > > > + * - We flush the caches on both caches being turned on and off. > > > + * > > > + * - Once the caches are enabled, we stop trapping VM ops. > > > + */ > > > +void p2m_set_way_flush(struct vcpu *v) > > > +{ > > > + /* This function can only work with the current vCPU. */ > > > + ASSERT(v == current); > > > + > > > + if ( !(v->arch.hcr_el2 & HCR_TVM) ) > > > + { > > > + v->arch.need_flush_to_ram = true; > > > + vcpu_hcr_set_flags(v, HCR_TVM); > > > + } > > > +} > > > + > > > +void p2m_toggle_cache(struct vcpu *v, bool was_enabled) > > > +{ > > > + bool now_enabled = vcpu_has_cache_enabled(v); > > > + > > > + /* This function can only work with the current vCPU. */ > > > + ASSERT(v == current); > > > + > > > + /* > > > + * If switching the MMU+caches on, need to invalidate the caches. > > > + * If switching it off, need to clean the caches. > > > + * Clean + invalidate does the trick always. > > > + */ > > > + if ( was_enabled != now_enabled ) > > > + { > > > + v->arch.need_flush_to_ram = true; > > > + } > > > + > > > + /* Caches are now on, stop trapping VM ops (until a S/W op) */ > > > + if ( now_enabled ) > > > + vcpu_hcr_clear_flags(v, HCR_TVM); > > > +} > > > + > > > mfn_t gfn_to_mfn(struct domain *d, gfn_t gfn) > > > { > > > return p2m_lookup(d, gfn, NULL); > > > diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c > > > index 02665cc7b4..221c762ada 100644 > > > --- a/xen/arch/arm/traps.c > > > +++ b/xen/arch/arm/traps.c > > > @@ -97,7 +97,7 @@ register_t get_default_hcr_flags(void) > > > { > > > return (HCR_PTW|HCR_BSU_INNER|HCR_AMO|HCR_IMO|HCR_FMO|HCR_VM| > > > (vwfi != NATIVE ? (HCR_TWI|HCR_TWE) : 0) | > > > - HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP|HCR_FB); > > > + HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP|HCR_FB|HCR_TSW); > > > } > > > static enum { > > > @@ -2258,10 +2258,33 @@ static void check_for_pcpu_work(void) > > > } > > > } > > > +/* > > > + * Process pending work for the vCPU. Any call should be fast or > > > + * implement preemption. > > > + */ > > > +static void check_for_vcpu_work(void) > > > +{ > > > + struct vcpu *v = current; > > > + > > > + if ( likely(!v->arch.need_flush_to_ram) ) > > > + return; > > > + > > > + /* > > > + * Give a chance for the pCPU to process work before handling the > > > vCPU > > > + * pending work. > > > + */ > > > + check_for_pcpu_work(); > > > > This is a bit awkward, basically we need to call check_for_pcpu_work > > before check_for_vcpu_work(), and after it. This is basically what we > > are doing: > > > > check_for_pcpu_work() > > check_for_vcpu_work() > > check_for_pcpu_work() > > Not really, we only do that if we have vCPU work to do (see the check > !v->arch.need_flush_to_ram). So we call twice only when we need to do some > vCPU work (at the moment only the p2m). > > We can't avoid the one after check_for_vcpu_work() because you may have > softirq pending and already signaled (i.e via an interrupt). I understand this. > So you may not execute them before returning to the guest introducing > long delay. That's why we execute the rest of the code with interrupts > masked. If sotfirq_pending() returns 0 then you know there were no > more softirq pending to handle. All the new one will be signaled via > an interrupt than can only come up when irq are unmasked. > > The one before executing vCPU work can potentially be avoided. The reason I > added it is it can take some times before p2m_flush_vm() will call softirq. As > we do this on return to guest we may have already been executed for some time > in the hypervisor. So this give us a chance to preempt if the vCPU consumed > his sliced. This one is difficult to tell whether it is important or if it would be best avoided. For Dario: basically we have a long running operation to perform, we thought that the best place for it would be on the path returning to guest (leave_hypervisor_tail). The operation can interrupt itself checking sotfirq_pending() once in a while to avoid blocking the pcpu for too long. The question is: is it better to check sotfirq_pending() even before starting? Or every so often during the operating is good enough? Does it even matter? > > Sounds like we should come up with something better but I don't have a > > concrete suggestion :-) > > > > > > > + local_irq_enable(); > > > + p2m_flush_vm(v); > > > + local_irq_disable(); > > > +} > > > + > > > void leave_hypervisor_tail(void) > > > { > > > local_irq_disable(); > > > + check_for_vcpu_work(); > > > check_for_pcpu_work(); > > > vgic_sync_to_lrs(); > > > diff --git a/xen/arch/arm/vcpreg.c b/xen/arch/arm/vcpreg.c > > > index 550c25ec3f..cdc91cdf5b 100644 > > > --- a/xen/arch/arm/vcpreg.c > > > +++ b/xen/arch/arm/vcpreg.c > > > @@ -51,9 +51,14 @@ > > > #define TVM_REG(sz, func, reg...) > > > \ > > > static bool func(struct cpu_user_regs *regs, uint##sz##_t *r, bool read) > > > \ > > > { > > > \ > > > + struct vcpu *v = current; > > > \ > > > + bool cache_enabled = vcpu_has_cache_enabled(v); > > > \ > > > + > > > \ > > > GUEST_BUG_ON(read); > > > \ > > > WRITE_SYSREG##sz(*r, reg); > > > \ > > > > > > \ > > > + p2m_toggle_cache(v, cache_enabled); > > > \ > > > + > > > \ > > > return true; > > > \ > > > } > > > @@ -71,6 +76,8 @@ static bool func(struct cpu_user_regs *regs, > > > uint##sz##_t *r, bool read) \ > > > static bool vreg_emulate_##xreg(struct cpu_user_regs *regs, uint32_t *r, > > > \ > > > bool read, bool hi) > > > \ > > > { > > > \ > > > + struct vcpu *v = current; > > > \ > > > + bool cache_enabled = vcpu_has_cache_enabled(v); > > > \ > > > register_t reg = READ_SYSREG(xreg); > > > \ > > > > > > \ > > > GUEST_BUG_ON(read); > > > \ > > > @@ -86,6 +93,8 @@ static bool vreg_emulate_##xreg(struct cpu_user_regs > > > *regs, uint32_t *r, \ > > > } > > > \ > > > WRITE_SYSREG(reg, xreg); > > > \ > > > > > > \ > > > + p2m_toggle_cache(v, cache_enabled); > > > \ > > > + > > > \ > > > return true; > > > \ > > > } > > > \ > > > > > > \ > > > @@ -186,6 +195,19 @@ void do_cp15_32(struct cpu_user_regs *regs, const > > > union hsr hsr) > > > break; > > > /* > > > + * HCR_EL2.TSW > > > + * > > > + * ARMv7 (DDI 0406C.b): B1.14.6 > > > + * ARMv8 (DDI 0487B.b): Table D1-42 > > > + */ > > > + case HSR_CPREG32(DCISW): > > > + case HSR_CPREG32(DCCSW): > > > + case HSR_CPREG32(DCCISW): > > > + if ( !cp32.read ) > > > + p2m_set_way_flush(current); > > > + break; > > > + > > > + /* > > > * HCR_EL2.TVM > > > * > > > * ARMv8 (DDI 0487D.a): Table D1-38 > > > diff --git a/xen/include/asm-arm/domain.h b/xen/include/asm-arm/domain.h > > > index 175de44927..f16b973e0d 100644 > > > --- a/xen/include/asm-arm/domain.h > > > +++ b/xen/include/asm-arm/domain.h > > > @@ -202,6 +202,14 @@ struct arch_vcpu > > > struct vtimer phys_timer; > > > struct vtimer virt_timer; > > > bool vtimer_initialized; > > > + > > > + /* > > > + * The full P2M may require some cleaning (e.g when emulation > > > + * set/way). As the action can take a long time, it requires > > > + * preemption. So this is deferred until we return to the guest. > > > > The reason for delaying the call to p2m_flush_vm until we return to the > > guest is that we need to call do_softirq to check whether we need to be > > preempted and we cannot make that call in the middle of the vcpreg.c > > handlers, right? > We need to make sure that do_softirq() is called without any lock. With the > current code, it would technically be possible to call do_softirq() directly > in vcreg.c handlers. But I think it is not entirely future-proof. > > So it would be better if we call do_softirq() with the little stack as > possible as it is easier to ensure that the callers are not taking any lock. > > The infrastructure added should be re-usable for other sort of work (i.e ITS, > ioreq) in the future. I think it makes sense and it is easier to think about and to understand compared to calling do_softirq in the middle of another complex function. I would ask you to improve a bit the last sentence of this comment, something like: "It is deferred until we return to guest, where we can more easily check for softirqs and preempt the vcpu safely." It would almost be worth generalizing it even further, introducing a new tasklet-like concept to schedule long work before returning to guest. An idea for the future. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.