[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 7/7] xen/x86: use PCID feature



On 23/03/18 14:46, Jan Beulich wrote:
>>>> On 23.03.18 at 12:29, <jgross@xxxxxxxx> wrote:
>> On 23/03/18 11:51, Jan Beulich wrote:
>>>>>> On 21.03.18 at 13:51, <jgross@xxxxxxxx> wrote:
>>>> Avoid flushing the complete TLB when switching %cr3 for mitigation of
>>>> Meltdown by using the PCID feature if available.
>>>>
>>>> We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
>>>> 2 values for the non-XPTI case:
>>>>
>>>> - guest active and in kernel mode
>>>> - guest active and in user mode
>>>> - hypervisor active and guest in user mode (XPTI only)
>>>> - hypervisor active and guest in kernel mode (XPTI only)
>>>
>>> Before committing to this route, Jun, Kevin, can we please get
>>> confirmation that PCID isn't (and isn't going to be) subject to the
>>> same speculation issues in the pipeline that the U/S bit is (having
>>> caused Meltdown in the first place)? To me it seems a pretty
>>> likely thing to play all the same games.
>>
>> Really? This would assume either the processor is capable to deal with
>> multiple matching TLB entries when speculating or that there can be
>> only one entry per virtual address present in the TLB at the same time
>> in spite of different PCIDs.
> 
> Hmm, yes, good point.
> 
>> And why aren't you asking for the same problem with VPIDs? This should
>> be comparable to the PCID problem you are suspecting.
> 
> Since Linux is using the approach, I'm not really suspecting a
> problem. I'd just like to be sure there is none / not going to be
> one. None of this is spelled out in the doc after all.
> 
>>>> ---
>>>>  docs/misc/xen-command-line.markdown | 12 +++++++++
>>>>  xen/arch/x86/debug.c                |  3 ++-
>>>>  xen/arch/x86/domain_page.c          |  2 +-
>>>>  xen/arch/x86/domctl.c               |  4 +++
>>>>  xen/arch/x86/flushtlb.c             | 49 
>>>> ++++++++++++++++++++++++++++------
>>>>  xen/arch/x86/mm.c                   | 34 +++++++++++++++++++++---
>>>>  xen/arch/x86/pv/dom0_build.c        |  1 +
>>>>  xen/arch/x86/pv/domain.c            | 52 
>>>> +++++++++++++++++++++++++++++++++++++
>>>>  xen/include/asm-x86/domain.h        | 14 +++++++---
>>>>  xen/include/asm-x86/pv/domain.h     |  2 ++
>>>>  xen/include/asm-x86/x86-defns.h     |  1 +
>>>>  11 files changed, 158 insertions(+), 16 deletions(-)
>>>
>>> Having had the discussion previously, I'm missing a change to
>>> smp.c:new_tlbflush_clock_period() here.
>>
>> I can add that. I did not do this as I haven't treated FLUSH_TLB
>> differently to FLUSH_TLB_GLOBAL (trying this even without any other
>> change to staging led to degfaults in dom0). So such a change to
>> new_tlbflush_clock_period() should be a separate patch I believe.
> 
> Ah yes, you don't have the "flushing too much" issue anymore in
> this version, or at least not in as obvious a way. Or wait - it's in
> patch 4. You still do an including-global flush there regardless of
> whether FLUSH_TLB_GLOBAL was actually requested. Anyway,
> we'll save this aspect for later.
> 
>>>> --- a/xen/arch/x86/debug.c
>>>> +++ b/xen/arch/x86/debug.c
>>>> @@ -97,7 +97,8 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct domain *dp, uint64_t 
>>>> pgd3val)
>>>>      l3_pgentry_t l3e, *l3t;
>>>>      l2_pgentry_t l2e, *l2t;
>>>>      l1_pgentry_t l1e, *l1t;
>>>> -    unsigned long cr3 = (pgd3val ? pgd3val : dp->vcpu[0]->arch.cr3);
>>>> +    unsigned long cr3 = (pgd3val ? pgd3val
>>>> +                                 : (dp->vcpu[0]->arch.cr3 & 
>>>> ~X86_CR3_NOFLUSH));
>>>
>>> What about the PCID portion? You want the address of the page
>>> here, so I think you should use a "white-listing" masking operation
>>> instead of a blacklisting one.
>>
>> The PCID portion is no problem here as the value is converted into a
>> mfn.
>>
>> I can do the modification you are asking for, of course.
> 
> Well, even if the bits are shifter out in the end, the code could
> look more correct. Plus by masking to just the address, future
> new meaning assigned to currently unused bits would not be
> a problem for this piece of code anymore.
> 
>>>> +/*
>>>> + * Return additional PCID specific cr3 bits.
>>>> + *
>>>> + * Note that X86_CR3_NOFLUSH will not be readable in cr3. Anyone consuming
>>>> + * v->arch.cr3 should mask away X86_CR3_NOFLUSH and X86_CR3_PCIDMASK in 
>>>> case
>>>
>>> I stand to my previous comment, which was left unanswered afaics:
>>
>> Uuh, sorry for that.
>>
>>> "Is it a good idea to suppress the flush this way for every reader
>>>  of v->arch.cr3? For example, what about the use in
>>>  dbg_pv_va2mfn()? I think it should be the consumers of the field
>>>  to decide whether to OR in that flag (which isn't part of the
>>>  register value anyway)."
>>> To be more precise, I can see the pcid to be put here (which will
>>> require consumers to be aware anyway), but I don't think the non-
>>> register-value no-flush indicator belongs here. IOW I think after
>>> writing the value into %cr3, the value read back should match the
>>> stored value.
>>
>> This will make restore_all_guest more complicated. v->arch.cr3 is copied
>> to cpu_info->xen_cr3 there and this value is then used for %cr3. I
>> really don't want to add complex logic there to add the no-flush
>> indicator in case PCIDs are active.
> 
> Valid point. Looking at all present uses of ->arch.cr3, it's probably
> indeed better the way you have it. However, I'm now wondering
> about something else: make_cr3() leaves PCID as zero for HVM
> and idle domains, but runs Xen with PCIDs 2 and 3 for (some) PV
> domains. That looks like an undesirable setup though - it would
> seem better to run Xen (with full page tables) with PCID 0 at all
> times.
> 
> Then we'd have e.g.
> PCID 0        Xen (full page tables)
> PCID x        PV guest priv
> PCID y        PV guest user

So this would need another way to switch between guest and xen %cr3.
Or would you want to use different %cr3 values with the same PCID
without flushing the TLB in between? This seems to be a way to ask for
problems...

In case you'd use the same %cr3 (guest kernel one, I guess) for both
cases: are you really sure there is no problem in any hypervisor path
accessing guest data which would result in using guest kernel access
rights when coming from user mode (BTW: that was the security note I
had in v2 of my series).

> Global pages in PCID 0 could then still be permitted, and wouldn't
> ever need flushing except when FLUSH_TLB_GLOBAL is requested.
> 
> As to the use of two separate PCIDs for PV kernel and user modes
> - while this helps isolation, it prevents recovering the non-XPTI
> property of user mode TLB entries surviving in-guest mode switches.

I don't get that. With PCID the guest's kernel _and_ user entries
will survive in-guest mode switches as there is no TLB flushing
involved (the no-flush bit is set in v->arch.cr3 for both modes).

The only downside are guest kernel accesses to user pages: they will
need additional TLB entries as the PCID is different.

> I wonder whether this is part of the reason you see PCID have a
> negative effect in the non-XPTI case.
> 
> So in the end the question is: Why not use just two PCIDs, and
> allow global pages just like we do now, with the added benefit
> that we no longer need to flush Xen's global TLB entries just
> because we want to get rid of PV guest user ones.

I can't see how that would work without either needing some more TLB
flushes in order to prevent stale TLB entries or loosing the Meltdown
mitigation.

Which %cr3/PCID combination should be used in hypervisor, guest kernel
and guest user mode? Which pages would be global?

> 
>>>> +        }
>>>> +        if ( d->arch.pv_domain.pcid )
>>>> +            v->arch.cr3 |= get_pcid_bits(v, d->arch.pv_domain.xpti);
>>>
>>> It is certainly at least confusing that you pass "xpti" as argument
>>> for a parameter named "is_xen". The question is what mode you
>>> want us to be in when running with PCID but no XPTI: Should Xen
>>> use its own PCID then? That would seem reasonable to me, but
>>> would seem to require passing true here. Yet then this would
>>> require switching CR3 on the way out to guests and back in from
>>> guests even in that case (just without copying the root page table).
>>>
>>> If in that mode Xen isn't meant to use its own PCID, the command
>>> line option "pcid=noxpti" would seem at least misleading to me then,
>>> as you wouldn't really use different PCIDs in that mode, but only
>>> disable global pages (which probably hurts performance) and use
>>> INVPCID for flushing (which probably helps performance).
>>
>> The idea is to use different PCID values for guest kernel and user
>> mode. This removes the need for global guest user pages.
> 
> Partly only. Global guest user pages serve two purposes: They
> survive a round trip user -> kernel -> user (as long as no Xen
> context switch occurs in the middle), and they also allow the
> kernel to utilize TLB entries user mode has just created e.g. for
> system call input. Your current use of PCIDs retains only the
> first property.

Right.

> 
>> I don't
>> want to use global guest user pages together with PCID as flushing
>> global pages from the TLB with PCID enabled requires flushing either
>> the complete TLB or you'd have to use INVLPG in all possible address
>> spaces (so you'd need to have multiple %cr3 switches).
> 
> Well, yes, flushing _individual_ pages is a problem in that case.
> As to multiple CR3 switches - are these all that bad really with
> the no-flush bit set? With the reduced number of PCIDs in actual
> use (as discussed above) "all possible address spaces" would
> mean just two. And I could imagine that in a number of cases
> just one INVLPG (with the right PCID active) might suffice.
> 
> One complicating factor is that we don't want to introduce
> Xen TLB entries for other than what we map in the minimal page
> tables into PV guest PCID space, which would happen if we
> simply switched PCID around an INVLPG.
> 
> What I don't understand in any event is why you need separate
> PCIDs for Xen depending on whether the active PV guest is in
> kernel or user mode.

Main reason are the different page tables anchored in %cr3.

> As a side note, I've noticed only now that get_pcid_bits()'s
> first parameter wants to be constified.

Will change that.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.