[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 7/7] xen/x86: use PCID feature



On 23/03/18 11:51, Jan Beulich wrote:
>>>> On 21.03.18 at 13:51, <jgross@xxxxxxxx> wrote:
>> Avoid flushing the complete TLB when switching %cr3 for mitigation of
>> Meltdown by using the PCID feature if available.
>>
>> We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
>> 2 values for the non-XPTI case:
>>
>> - guest active and in kernel mode
>> - guest active and in user mode
>> - hypervisor active and guest in user mode (XPTI only)
>> - hypervisor active and guest in kernel mode (XPTI only)
> 
> Before committing to this route, Jun, Kevin, can we please get
> confirmation that PCID isn't (and isn't going to be) subject to the
> same speculation issues in the pipeline that the U/S bit is (having
> caused Meltdown in the first place)? To me it seems a pretty
> likely thing to play all the same games.

Really? This would assume either the processor is capable to deal with
multiple matching TLB entries when speculating or that there can be
only one entry per virtual address present in the TLB at the same time
in spite of different PCIDs.

And why aren't you asking for the same problem with VPIDs? This should
be comparable to the PCID problem you are suspecting.

> 
>> ---
>>  docs/misc/xen-command-line.markdown | 12 +++++++++
>>  xen/arch/x86/debug.c                |  3 ++-
>>  xen/arch/x86/domain_page.c          |  2 +-
>>  xen/arch/x86/domctl.c               |  4 +++
>>  xen/arch/x86/flushtlb.c             | 49 ++++++++++++++++++++++++++++------
>>  xen/arch/x86/mm.c                   | 34 +++++++++++++++++++++---
>>  xen/arch/x86/pv/dom0_build.c        |  1 +
>>  xen/arch/x86/pv/domain.c            | 52 
>> +++++++++++++++++++++++++++++++++++++
>>  xen/include/asm-x86/domain.h        | 14 +++++++---
>>  xen/include/asm-x86/pv/domain.h     |  2 ++
>>  xen/include/asm-x86/x86-defns.h     |  1 +
>>  11 files changed, 158 insertions(+), 16 deletions(-)
> 
> Having had the discussion previously, I'm missing a change to
> smp.c:new_tlbflush_clock_period() here.

I can add that. I did not do this as I haven't treated FLUSH_TLB
differently to FLUSH_TLB_GLOBAL (trying this even without any other
change to staging led to degfaults in dom0). So such a change to
new_tlbflush_clock_period() should be a separate patch I believe.

> 
>> --- a/xen/arch/x86/debug.c
>> +++ b/xen/arch/x86/debug.c
>> @@ -97,7 +97,8 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct domain *dp, uint64_t 
>> pgd3val)
>>      l3_pgentry_t l3e, *l3t;
>>      l2_pgentry_t l2e, *l2t;
>>      l1_pgentry_t l1e, *l1t;
>> -    unsigned long cr3 = (pgd3val ? pgd3val : dp->vcpu[0]->arch.cr3);
>> +    unsigned long cr3 = (pgd3val ? pgd3val
>> +                                 : (dp->vcpu[0]->arch.cr3 & 
>> ~X86_CR3_NOFLUSH));
> 
> What about the PCID portion? You want the address of the page
> here, so I think you should use a "white-listing" masking operation
> instead of a blacklisting one.

The PCID portion is no problem here as the value is converted into a
mfn.

I can do the modification you are asking for, of course.

> 
> Also, as you touch this anyway, it would have been nice to drop
> the unnecessary middle argument at the same time.

Okay.

> 
>> --- a/xen/arch/x86/flushtlb.c
>> +++ b/xen/arch/x86/flushtlb.c
>> @@ -102,7 +102,19 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>>      t = pre_flush();
>>  
>>      if ( read_cr4() & X86_CR4_PGE )
>> +        /*
>> +         * X86_CR4_PGE set means PCID being inactive.
>> +         * We have to purge the TLB via flipping cr4.pge.
>> +         */
>>          write_cr4(cr4 & ~X86_CR4_PGE);
>> +    else if ( cpu_has_invpcid )
>> +        /*
>> +         * If we are using PCID purge the TLB via INVPCID as loading cr3
>> +         * will affect the current PCID only.
>> +         * If INVPCID is not supported we don't use PCIDs so loading cr3
>> +         * will purge the TLB (we are in the "global pages off" branch).
>> +         */
>> +        invpcid_flush_all();
> 
> Since with CR4.PGE correctness-wise clear it doesn't matter whether
> you use invpcid_flush_all() or invpcid_flush_all_nonglobals() here,
> does it also not matter performance-wise?

I didn't check that. I'll have a try.

> 
>> @@ -131,14 +143,35 @@ unsigned int flush_area_local(const void *va, unsigned 
>> int flags)
>>      {
>>          if ( order == 0 )
>>          {
>> -            /*
>> -             * We don't INVLPG multi-page regions because the 2M/4M/1G
>> -             * region may not have been mapped with a superpage. Also there
>> -             * are various errata surrounding INVLPG usage on superpages, 
>> and
>> -             * a full flush is in any case not *that* expensive.
>> -             */
> 
> This comment really explains the order == 0 check above, so I
> don't think it should be moved.

Okay.

> 
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -497,12 +497,38 @@ void free_shared_domheap_page(struct page_info *page)
>>      free_domheap_page(page);
>>  }
>>  
>> +/*
>> + * Return additional PCID specific cr3 bits.
>> + *
>> + * Note that X86_CR3_NOFLUSH will not be readable in cr3. Anyone consuming
>> + * v->arch.cr3 should mask away X86_CR3_NOFLUSH and X86_CR3_PCIDMASK in case
> 
> I stand to my previous comment, which was left unanswered afaics:

Uuh, sorry for that.

> "Is it a good idea to suppress the flush this way for every reader
>  of v->arch.cr3? For example, what about the use in
>  dbg_pv_va2mfn()? I think it should be the consumers of the field
>  to decide whether to OR in that flag (which isn't part of the
>  register value anyway)."
> To be more precise, I can see the pcid to be put here (which will
> require consumers to be aware anyway), but I don't think the non-
> register-value no-flush indicator belongs here. IOW I think after
> writing the value into %cr3, the value read back should match the
> stored value.

This will make restore_all_guest more complicated. v->arch.cr3 is copied
to cpu_info->xen_cr3 there and this value is then used for %cr3. I
really don't want to add complex logic there to add the no-flush
indicator in case PCIDs are active.

>> + * the value is used to address the root page table.
>> + */
>> +static unsigned long get_pcid_bits(struct vcpu *v, bool is_xen)
>> +{
>> +    return X86_CR3_NOFLUSH | (is_xen ? PCID_PV_XEN : 0) |
>> +           ((v->arch.flags & TF_kernel_mode) ? PCID_PV_PRIV : PCID_PV_USER);
>> +}
>> +
>>  void make_cr3(struct vcpu *v, mfn_t mfn)
>>  {
>> +    struct domain *d = v->domain;
>> +
>>      v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
>> -    if ( v == current && this_cpu(root_pgt) && is_pv_vcpu(v) &&
>> -         !is_pv_32bit_vcpu(v) )
>> -        get_cpu_info()->root_pgt_changed = true;
>> +    if ( is_pv_domain(d) )
>> +    {
>> +        if ( d->arch.pv_domain.xpti && v == current )
>> +        {
>> +            struct cpu_info *cpu_info = get_cpu_info();
>> +
>> +            cpu_info->root_pgt_changed = true;
>> +            if ( d->arch.pv_domain.pcid )
>> +                cpu_info->pv_cr3 = __pa(this_cpu(root_pgt)) |
>> +                                   get_pcid_bits(v, false);
> 
> Just like we seem to have agreed (on an earlier patch) that
> setting root_pgt_changed elsewhere may be better, perhaps
> that extends to pv_cr3 as well? It would certainly be nice if
> this was written with the final intended value right away, yet
> the lack of an "else" here suggests there is at least one place
> where this doesn't happen, but pv_cr3 is also written with a
> non-zero value.

Yes. I'll look into moving it to _toggle_guest_pt().

> 
>> +        }
>> +        if ( d->arch.pv_domain.pcid )
>> +            v->arch.cr3 |= get_pcid_bits(v, d->arch.pv_domain.xpti);
> 
> It is certainly at least confusing that you pass "xpti" as argument
> for a parameter named "is_xen". The question is what mode you
> want us to be in when running with PCID but no XPTI: Should Xen
> use its own PCID then? That would seem reasonable to me, but
> would seem to require passing true here. Yet then this would
> require switching CR3 on the way out to guests and back in from
> guests even in that case (just without copying the root page table).
> 
> If in that mode Xen isn't meant to use its own PCID, the command
> line option "pcid=noxpti" would seem at least misleading to me then,
> as you wouldn't really use different PCIDs in that mode, but only
> disable global pages (which probably hurts performance) and use
> INVPCID for flushing (which probably helps performance).

The idea is to use different PCID values for guest kernel and user
mode. This removes the need for global guest user pages. I don't
want to use global guest user pages together with PCID as flushing
global pages from the TLB with PCID enabled requires flushing either
the complete TLB or you'd have to use INVLPG in all possible address
spaces (so you'd need to have multiple %cr3 switches).

I'll add some comments in this regard.

> 
>> --- a/xen/arch/x86/pv/domain.c
>> +++ b/xen/arch/x86/pv/domain.c
>> @@ -95,6 +95,58 @@ void xpti_domain_init(struct domain *d)
>>      }
>>  }
>>  
>> +static __read_mostly enum {
>> +    PCID_OFF,
>> +    PCID_ALL,
>> +    PCID_XPTI,
>> +    PCID_NOXPTI
>> +} opt_pcid = PCID_XPTI;
>> +
>> +static __init int parse_pcid(const char *s)
>> +{
>> +    int rc = 0;
>> +
>> +    if ( !strcmp(s, "off") )
>> +        opt_pcid = PCID_OFF;
>> +    else if ( !strcmp(s, "all") )
>> +        opt_pcid = PCID_ALL;
>> +    else if ( !strcmp(s, "xpti") )
>> +        opt_pcid = PCID_XPTI;
>> +    else if ( !strcmp(s, "noxpti") )
>> +        opt_pcid = PCID_NOXPTI;
> 
> I'd prefer if you used parse_bool() and parse_boolean() here, so
> that the syntax of this new option fits with all other boolean ones
> we have.

Okay.

> 
>> +void pcid_domain_init(struct domain *d)
>> +{
>> +    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) ||
>> +         !cpu_has_invpcid || !cpu_has_pcid )
>> +        return;
>> +
>> +    switch ( opt_pcid )
>> +    {
>> +    case PCID_OFF:
>> +        d->arch.pv_domain.pcid = false;
> 
> As for the earlier patch, with the return above implying the value
> to be zero this store ought to be unnecessary.

Okay.

> 
>> @@ -619,14 +626,15 @@ unsigned long pv_guest_cr4_fixup(const struct vcpu *, 
>> unsigned long guest_cr4);
>>        | (mmu_cr4_features                                   \
>>           & (X86_CR4_PGE | X86_CR4_PSE | X86_CR4_SMEP |      \
>>              X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
>> -            X86_CR4_FSGSBASE))                              \
>> -      | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
>> +            X86_CR4_FSGSBASE | X86_CR4_PCIDE))              \
> 
> Why? Afaics you never set the bit in mmu_cr4_features.

Oops, this is a leftover from v2.

> 
>> --- a/xen/include/asm-x86/x86-defns.h
>> +++ b/xen/include/asm-x86/x86-defns.h
>> @@ -46,6 +46,7 @@
>>   * Intel CPU flags in CR3
>>   */
>>  #define X86_CR3_NOFLUSH (_AC(1, ULL) << 63)
>> +#define X86_CR3_PCIDMASK _AC(0x0fff, ULL) /* Mask for PCID */
> 
> As per an earlier comment you also want X86_CR3_ADDR_MASK here
> (and as an implication from the suggested name I think it would be
> better if you also added another underscore in the one you already
> add).

Yes.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.