[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] x86/pv: Flush TLB in response to paging structure changes



On 21.10.2020 17:39, Andrew Cooper wrote:
> On 21/10/2020 14:56, Jan Beulich wrote:
>> On 21.10.2020 15:07, Andrew Cooper wrote:
>>> @@ -4037,6 +4037,9 @@ long do_mmu_update(
>>>                          break;
>>>                      rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
>>>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>>> +                    /* Paging structure maybe changed.  Flush linear 
>>> range. */
>>> +                    if ( !rc )
>>> +                        flush_flags_all |= FLUSH_TLB;
>>>                      break;
>>>  
>>>                  case PGT_l3_page_table:
>>> @@ -4044,6 +4047,9 @@ long do_mmu_update(
>>>                          break;
>>>                      rc = mod_l3_entry(va, l3e_from_intpte(req.val), mfn,
>>>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>>> +                    /* Paging structure maybe changed.  Flush linear 
>>> range. */
>>> +                    if ( !rc )
>>> +                        flush_flags_all |= FLUSH_TLB;
>>>                      break;
>>>  
>>>                  case PGT_l4_page_table:
>>> @@ -4051,27 +4057,28 @@ long do_mmu_update(
>>>                          break;
>>>                      rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>>>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>>> -                    if ( !rc && pt_owner->arch.pv.xpti )
>>> +                    /* Paging structure maybe changed.  Flush linear 
>>> range. */
>>> +                    if ( !rc )
>>>                      {
>>> -                        bool local_in_use = false;
>>> +                        bool local_in_use = mfn_eq(
>>> +                            pagetable_get_mfn(curr->arch.guest_table), 
>>> mfn);
>>>  
>>> -                        if ( 
>>> mfn_eq(pagetable_get_mfn(curr->arch.guest_table),
>>> -                                    mfn) )
>>> -                        {
>>> -                            local_in_use = true;
>>> -                            get_cpu_info()->root_pgt_changed = true;
>>> -                        }
>>> +                        flush_flags_all |= FLUSH_TLB;
>>> +
>>> +                        if ( local_in_use )
>>> +                            flush_flags_local |= FLUSH_TLB | 
>>> FLUSH_ROOT_PGTBL;
>> Aiui here (and in the code consuming the flags) you build upon
>> flush_flags_local, when not zero, always being a superset of
>> flush_flags_all. I think this is a trap to fall into when later
>> wanting to change this code, but as per below this won't hold
>> anymore anyway, I think. Hence here I think you want to set
>> FLUSH_TLB unconditionally, and above for L3 and L2 you want to
>> set it in both variables. Or, if I'm wrong below, a comment to
>> that effect may help people avoid falling into this trap.
>>
>> An alternative would be to have
>>
>>     flush_flags_local |= (flush_flags_all & FLUSH_TLB);
>>
>> before doing the actual flush.
> 
> Honestly, this is what I meant by stating that the asymmetry is a total
> mess.
> 
> I originally named all 'remote', but this is even less accurate, it may
> still contain the current cpu.
> 
> Our matrix of complexity:
> 
> * FLUSH_TLB for L2+ structure changes
> * FLUSH_TLB_GLOBAL/FLUSH_ROOT_PGTBL for XPTI
> 
> with optimisations to skip GLOBAL/ROOT_PGTBL on the local CPU if none of
> the updates hit the L4-in-use, and to skip the remote if we hold all
> references on the L4.
> 
> Everything is complicated because pt_owner may not be current, for
> toolstack operations constructing a new domain.

Which is a case I'm wondering why we flush in the first place.
The L4 under construction can't be in use yet, and hence
updates to it shouldn't need syncing. Otoh
pt_owner->dirty_cpumask obviously is empty at that point, i.e.
special casing this likely isn't really worth it.

>>> @@ -4173,18 +4180,36 @@ long do_mmu_update(
>>>      if ( va )
>>>          unmap_domain_page(va);
>>>  
>>> -    if ( sync_guest )
>>> +    /*
>>> +     * Flushing needs to occur for one of several reasons.
>>> +     *
>>> +     * 1) An update to an L2 or higher occured.  This potentially changes 
>>> the
>>> +     *    pagetable structure, requiring a flush of the linear range.
>>> +     * 2) An update to an L4 occured, and XPTI is enabled.  All CPUs 
>>> running
>>> +     *    on a copy of this L4 need refreshing.
>>> +     */
>>> +    if ( flush_flags_all || flush_flags_local )
>> Minor remark: At least in x86 code it is more efficient to use
>> | instead of || in such cases, to avoid relying on the compiler
>> to carry out this small optimization.
> 
> This transformation should not be recommended in general.  There are
> cases, including this one, where it is at best, no effect, and at worst,
> wrong.
> 
> I don't care about people using ancient compilers.  They've got far
> bigger (== more impactful) problems than (the absence of) this
> transformation, and the TLB flush will dwarf anything the compiler does
> here.
> 
> However, the hand "optimised" case breaks a compiler spotting that the
> entire second clause is actually redundant for now.

Oh, you mean non-zero flush_flags_local implying non-zero
flush_flags_all? Not sure why a compiler recognizing this shouldn't
be able to optimize away | when it is able to optimize away || in
this case. Anyway - minor point, as said.

> I specifically didn't encode the dependency, to avoid subtle bugs
> if/when someone alters the logic.

Good point, but let me point out then that you encoded another
subtle dependency, which I didn't spell out completely before
because with the suggested adjustments it disappears: You may not
omit FLUSH_TLB from flush_flags_local when !local_in_use, as the L4
being changed may be one recursively hanging off of the L4 we're
running on.

>>>      {
>>> +        cpumask_t *mask = pt_owner->dirty_cpumask;
>>> +
>>>          /*
>>> -         * Force other vCPU-s of the affected guest to pick up L4 entry
>>> -         * changes (if any).
>>> +         * Local flushing may be asymmetric with remote.  If there is local
>>> +         * flushing to do, perform it separately and omit the current CPU 
>>> from
>>> +         * pt_owner->dirty_cpumask.
>>>           */
>>> -        unsigned int cpu = smp_processor_id();
>>> -        cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
>>> +        if ( flush_flags_local )
>>> +        {
>>> +            unsigned int cpu = smp_processor_id();
>>> +
>>> +            mask = per_cpu(scratch_cpumask, cpu);
>>> +            cpumask_copy(mask, pt_owner->dirty_cpumask);
>>> +            __cpumask_clear_cpu(cpu, mask);
>>> +
>>> +            flush_local(flush_flags_local);
>>> +        }
>>>  
>>> -        cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
>> I understand you're of the opinion that cpumask_copy() +
>> __cpumask_clear_cpu() is more efficient than cpumask_andnot()?
>> (I guess I agree for high enough CPU counts.)
> 
> Its faster in all cases, even low CPU counts.

Is it? Data for BTR with a memory operand is available in the ORM only
for some Atom CPUs, and its latency and throughput aren't very good
there. Anyway - again a minor aspect, but I'll keep this in mind for
future code I write, as so far I've preferred the "andnot" form in
such cases.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.