[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v1] x86/mm: Suppresses vm_events caused by page-walks



>>> On 27.08.18 at 15:02, <andrew.cooper3@xxxxxxxxxx> wrote:
> On 27/08/18 13:53, Razvan Cojocaru wrote:
>> On 8/27/18 3:37 PM, Andrew Cooper wrote:
>>> On 27/08/18 13:12, Jan Beulich wrote:
>>>>>> For NPT, isn't there an error code bit telling you whether the
>>>>>> request was a user or "system" one? If not, some cheating
>>>>>> would be needed (derive from CPL, accepting that e.g.
>>>>>> descriptor table accesses would get mis-attributed), but
>>>>>> that's still not going to involve looking at the PTE flags.
>>>>> The alternative would be to simply walk (without enforcing any flags,
>>>>> and so making the pfec walk parameter unnecessary) to the respective
>>>>> address, and query for _PAGE_ACCESSED and _PAGE_DIRTY only.
>>>>>
>>>>> If _PAGE_ACCESSED is not set, set it and exit.
>>>>> If _PAGE_ACCESSED is set, set _PAGE_DIRTY also and exit.
>>>> Since it's ambiguous in the NPT case - are you talking about
>>>> setting the flags in the guest or host page tables? The
>>>> former, I'm afraid, might not be acceptable (as not always
>>>> being architecturally correct). In anyway feels as if we'd
>>>> been here before, in that this reminds me of you meaning
>>>> to imply from a second walk (with A already set) that it must
>>>> be a write access. I thought we had settled on such an
>>>> implication not being generally correct.
>>> The problem that is trying to be solved is that when operating in
>>> non-root mode, the cpu pagewalk, when trying to set a guest A/D bit in a
>>> write-protected EPT page, takes an EPT violation for a write to a
>>> read-only page.
>>>
>>> Manually setting the A/D bits (as appropriate) and restarting the
>>> instruction is sufficient for it to complete correctly.
>>>
>>> At the moment, every time this happens, a request is sent to the
>>> introspection agent, and the agent calculates that it was due to
>>> pagetable protection, and instructs Xen to emulate the instruction. 
>>> This accounts for 97% (?) of the VMExits, and is unrelated to any of the
>>> real protections which introspection is trying to achieve.
>>>
>>> What Razvan is looking to do is to have Xen skip the "send to the
>>> introspection agent" part as an optimisation, because hardware tells Xen
>>> (as part of the VMExit) when this condition has occurred, and the
>>> vm_event logic has already asked Xen to try and fix up this condition
>>> automatically.
>>>
>>> What can actually be done depends on how A/D bits behave in real hardware.
>>>
>>> Setting access bits for non-leaf entries is definitely fine, and
>>> speculatively setting the access bit is also explicitly permitted by the
>>> spec.  However, I can't find any comment on speculative dirty bits (from
>>> either Intel or AMD), and I've not encountered such a behaviour with the
>>> pagetable work I've been doing.

Yeah, a description of the problem to solve definitely helps.

>> I've forgotten a piece of information that I really should have written
>> here: we would only set the D bit if A is already set and either the
>> page is writable (has _PAGE_RW set) or CR0.WP is 0 (the latter case is
>> admittedly more tricky).
> 
> How about a new function which works similarly to guest-walk-tables, but
> only ever sets A/D bits.
> 
> Given information from hardware, we know the linear address, and that it
> was a problem with the guest pagetables, from which we explicitly know
> that it was from writing an A/D bit to a guest PTE.
> 
> While walking down the levels, set any missing A bits and remember if we
> set any.  If we set A bits, consider ourselves complete and exit back to
> the guest.  If no A bits were set, and the access was a write (which we
> know from the EPT violation information), then set the leaf D bit.

Plus taking into consideration CR0.WP and the entry's W bit, as
Razvan has said.

> This should be architecturally correct as it is exclusively derived from
> information provided by the VMExit, and won't cause dirty bits to be
> written in cases where the hardware wouldn't have written them
> (speculative or otherwise).  It does mean that an instruction which
> would need to set A and D bits in the walk will take two EPT violations
> to achieve the end result, but it probably is still quicker than sending
> the vm_event out.

I'm afraid this is going to be only mostly correct: Atomicity of the page
table write is going to be lost. This could become an actual problem if
the guest used racing PTE accesses. Such racing accesses might not
be a bug - simply consider the OS scanning for set A and/or D bits
(and clearing them when they're set). Or an entity temporarily clearing
(parts of) PTEs, with recovery logic in place to restore them when
needed for a synchronous access. At the very least there's then the
risk of a live lock within the guest.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.