[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 7/7] xen/x86: use PCID feature



>>> On 21.03.18 at 13:51, <jgross@xxxxxxxx> wrote:
> Avoid flushing the complete TLB when switching %cr3 for mitigation of
> Meltdown by using the PCID feature if available.
> 
> We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
> 2 values for the non-XPTI case:
> 
> - guest active and in kernel mode
> - guest active and in user mode
> - hypervisor active and guest in user mode (XPTI only)
> - hypervisor active and guest in kernel mode (XPTI only)

Before committing to this route, Jun, Kevin, can we please get
confirmation that PCID isn't (and isn't going to be) subject to the
same speculation issues in the pipeline that the U/S bit is (having
caused Meltdown in the first place)? To me it seems a pretty
likely thing to play all the same games.

> ---
>  docs/misc/xen-command-line.markdown | 12 +++++++++
>  xen/arch/x86/debug.c                |  3 ++-
>  xen/arch/x86/domain_page.c          |  2 +-
>  xen/arch/x86/domctl.c               |  4 +++
>  xen/arch/x86/flushtlb.c             | 49 ++++++++++++++++++++++++++++------
>  xen/arch/x86/mm.c                   | 34 +++++++++++++++++++++---
>  xen/arch/x86/pv/dom0_build.c        |  1 +
>  xen/arch/x86/pv/domain.c            | 52 
> +++++++++++++++++++++++++++++++++++++
>  xen/include/asm-x86/domain.h        | 14 +++++++---
>  xen/include/asm-x86/pv/domain.h     |  2 ++
>  xen/include/asm-x86/x86-defns.h     |  1 +
>  11 files changed, 158 insertions(+), 16 deletions(-)

Having had the discussion previously, I'm missing a change to
smp.c:new_tlbflush_clock_period() here.

> --- a/xen/arch/x86/debug.c
> +++ b/xen/arch/x86/debug.c
> @@ -97,7 +97,8 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct domain *dp, uint64_t 
> pgd3val)
>      l3_pgentry_t l3e, *l3t;
>      l2_pgentry_t l2e, *l2t;
>      l1_pgentry_t l1e, *l1t;
> -    unsigned long cr3 = (pgd3val ? pgd3val : dp->vcpu[0]->arch.cr3);
> +    unsigned long cr3 = (pgd3val ? pgd3val
> +                                 : (dp->vcpu[0]->arch.cr3 & 
> ~X86_CR3_NOFLUSH));

What about the PCID portion? You want the address of the page
here, so I think you should use a "white-listing" masking operation
instead of a blacklisting one.

Also, as you touch this anyway, it would have been nice to drop
the unnecessary middle argument at the same time.

> --- a/xen/arch/x86/flushtlb.c
> +++ b/xen/arch/x86/flushtlb.c
> @@ -102,7 +102,19 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>      t = pre_flush();
>  
>      if ( read_cr4() & X86_CR4_PGE )
> +        /*
> +         * X86_CR4_PGE set means PCID being inactive.
> +         * We have to purge the TLB via flipping cr4.pge.
> +         */
>          write_cr4(cr4 & ~X86_CR4_PGE);
> +    else if ( cpu_has_invpcid )
> +        /*
> +         * If we are using PCID purge the TLB via INVPCID as loading cr3
> +         * will affect the current PCID only.
> +         * If INVPCID is not supported we don't use PCIDs so loading cr3
> +         * will purge the TLB (we are in the "global pages off" branch).
> +         */
> +        invpcid_flush_all();

Since with CR4.PGE correctness-wise clear it doesn't matter whether
you use invpcid_flush_all() or invpcid_flush_all_nonglobals() here,
does it also not matter performance-wise?

> @@ -131,14 +143,35 @@ unsigned int flush_area_local(const void *va, unsigned 
> int flags)
>      {
>          if ( order == 0 )
>          {
> -            /*
> -             * We don't INVLPG multi-page regions because the 2M/4M/1G
> -             * region may not have been mapped with a superpage. Also there
> -             * are various errata surrounding INVLPG usage on superpages, and
> -             * a full flush is in any case not *that* expensive.
> -             */

This comment really explains the order == 0 check above, so I
don't think it should be moved.

> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -497,12 +497,38 @@ void free_shared_domheap_page(struct page_info *page)
>      free_domheap_page(page);
>  }
>  
> +/*
> + * Return additional PCID specific cr3 bits.
> + *
> + * Note that X86_CR3_NOFLUSH will not be readable in cr3. Anyone consuming
> + * v->arch.cr3 should mask away X86_CR3_NOFLUSH and X86_CR3_PCIDMASK in case

I stand to my previous comment, which was left unanswered afaics:
"Is it a good idea to suppress the flush this way for every reader
 of v->arch.cr3? For example, what about the use in
 dbg_pv_va2mfn()? I think it should be the consumers of the field
 to decide whether to OR in that flag (which isn't part of the
 register value anyway)."
To be more precise, I can see the pcid to be put here (which will
require consumers to be aware anyway), but I don't think the non-
register-value no-flush indicator belongs here. IOW I think after
writing the value into %cr3, the value read back should match the
stored value.

> + * the value is used to address the root page table.
> + */
> +static unsigned long get_pcid_bits(struct vcpu *v, bool is_xen)
> +{
> +    return X86_CR3_NOFLUSH | (is_xen ? PCID_PV_XEN : 0) |
> +           ((v->arch.flags & TF_kernel_mode) ? PCID_PV_PRIV : PCID_PV_USER);
> +}
> +
>  void make_cr3(struct vcpu *v, mfn_t mfn)
>  {
> +    struct domain *d = v->domain;
> +
>      v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
> -    if ( v == current && this_cpu(root_pgt) && is_pv_vcpu(v) &&
> -         !is_pv_32bit_vcpu(v) )
> -        get_cpu_info()->root_pgt_changed = true;
> +    if ( is_pv_domain(d) )
> +    {
> +        if ( d->arch.pv_domain.xpti && v == current )
> +        {
> +            struct cpu_info *cpu_info = get_cpu_info();
> +
> +            cpu_info->root_pgt_changed = true;
> +            if ( d->arch.pv_domain.pcid )
> +                cpu_info->pv_cr3 = __pa(this_cpu(root_pgt)) |
> +                                   get_pcid_bits(v, false);

Just like we seem to have agreed (on an earlier patch) that
setting root_pgt_changed elsewhere may be better, perhaps
that extends to pv_cr3 as well? It would certainly be nice if
this was written with the final intended value right away, yet
the lack of an "else" here suggests there is at least one place
where this doesn't happen, but pv_cr3 is also written with a
non-zero value.

> +        }
> +        if ( d->arch.pv_domain.pcid )
> +            v->arch.cr3 |= get_pcid_bits(v, d->arch.pv_domain.xpti);

It is certainly at least confusing that you pass "xpti" as argument
for a parameter named "is_xen". The question is what mode you
want us to be in when running with PCID but no XPTI: Should Xen
use its own PCID then? That would seem reasonable to me, but
would seem to require passing true here. Yet then this would
require switching CR3 on the way out to guests and back in from
guests even in that case (just without copying the root page table).

If in that mode Xen isn't meant to use its own PCID, the command
line option "pcid=noxpti" would seem at least misleading to me then,
as you wouldn't really use different PCIDs in that mode, but only
disable global pages (which probably hurts performance) and use
INVPCID for flushing (which probably helps performance).

> --- a/xen/arch/x86/pv/domain.c
> +++ b/xen/arch/x86/pv/domain.c
> @@ -95,6 +95,58 @@ void xpti_domain_init(struct domain *d)
>      }
>  }
>  
> +static __read_mostly enum {
> +    PCID_OFF,
> +    PCID_ALL,
> +    PCID_XPTI,
> +    PCID_NOXPTI
> +} opt_pcid = PCID_XPTI;
> +
> +static __init int parse_pcid(const char *s)
> +{
> +    int rc = 0;
> +
> +    if ( !strcmp(s, "off") )
> +        opt_pcid = PCID_OFF;
> +    else if ( !strcmp(s, "all") )
> +        opt_pcid = PCID_ALL;
> +    else if ( !strcmp(s, "xpti") )
> +        opt_pcid = PCID_XPTI;
> +    else if ( !strcmp(s, "noxpti") )
> +        opt_pcid = PCID_NOXPTI;

I'd prefer if you used parse_bool() and parse_boolean() here, so
that the syntax of this new option fits with all other boolean ones
we have.

> +void pcid_domain_init(struct domain *d)
> +{
> +    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) ||
> +         !cpu_has_invpcid || !cpu_has_pcid )
> +        return;
> +
> +    switch ( opt_pcid )
> +    {
> +    case PCID_OFF:
> +        d->arch.pv_domain.pcid = false;

As for the earlier patch, with the return above implying the value
to be zero this store ought to be unnecessary.

> @@ -619,14 +626,15 @@ unsigned long pv_guest_cr4_fixup(const struct vcpu *, 
> unsigned long guest_cr4);
>        | (mmu_cr4_features                                   \
>           & (X86_CR4_PGE | X86_CR4_PSE | X86_CR4_SMEP |      \
>              X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
> -            X86_CR4_FSGSBASE))                              \
> -      | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
> +            X86_CR4_FSGSBASE | X86_CR4_PCIDE))              \

Why? Afaics you never set the bit in mmu_cr4_features.

> --- a/xen/include/asm-x86/x86-defns.h
> +++ b/xen/include/asm-x86/x86-defns.h
> @@ -46,6 +46,7 @@
>   * Intel CPU flags in CR3
>   */
>  #define X86_CR3_NOFLUSH (_AC(1, ULL) << 63)
> +#define X86_CR3_PCIDMASK _AC(0x0fff, ULL) /* Mask for PCID */

As per an earlier comment you also want X86_CR3_ADDR_MASK here
(and as an implication from the suggested name I think it would be
better if you also added another underscore in the one you already
add).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.