[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] PML (Page Modification Logging) design for Xen



On Fri, Feb 13, 2015 at 11:28 PM, Andrew Cooper
<andrew.cooper3@xxxxxxxxxx> wrote:
> On 13/02/15 14:32, Kai Huang wrote:
>> On Fri, Feb 13, 2015 at 6:57 PM, Andrew Cooper
>> <andrew.cooper3@xxxxxxxxxx> wrote:
>>> On 13/02/15 02:11, Kai Huang wrote:
>>>
>>>
>>> On 02/12/2015 10:10 PM, Andrew Cooper wrote:
>>>
>>> On 12/02/15 06:54, Tian, Kevin wrote:
>>>
>>> which presumably
>>> means that the PML buffer flush needs to be aware of which gfns are
>>> mapped by superpages to be able to correctly set a block of bits in the
>>> logdirty bitmap.
>>>
>>> Unfortunately PML itself can't tell us if the logged GPA comes from
>>> superpage or not, but even in PML we still need to split superpages to
>>> 4K page, just like traditional write protection approach does. I think
>>> this is because live migration should be based on 4K page granularity.
>>> Marking all 512 bits of a 2M page to be dirty by a single write doesn't
>>> make sense in both write protection and PML cases.
>>>
>>> agree. extending one write to superpage enlarges dirty set unnecessary.
>>> since spec doesn't say superpage logging is not supported, I'd think a
>>> 4k-aligned entry being logged if within superpage.
>>>
>>> The spec states that an gfn is appended to the log strictly on the
>>> transition of the D bit from 0 to 1.
>>>
>>> In the case of a 2M superpage, there is a single D bit for the entire 2M
>>> range.
>>>
>>>
>>> The plausible (working) scenarios I can see are:
>>>
>>> 1) superpages are not supported (not indicated by the whitepaper).
>>>
>>> A better description would be -- PML doesn't check if it's superpage, it
>>> just operates with D-bit, no matter what page size.
>>>
>>> 2) a single entry will be written which must be taken to cover the
>>> entire 2M range.
>>> 3) an individual entry is written for every access.
>>>
>>> Below is the reply from our hardware guy related to PML on superpage. It
>>> should have answered accurately.
>>>
>>> "As noted in Section 1.3, logging occurs whenever the CPU would set an EPT D
>>> bit.
>>>
>>> It does not matter whether the D bit is in an EPT PTE (4KB page), EPT PDE
>>> (2MB page), or EPT PDPTE (1GB page).
>>>
>>> In all cases, the GPA written to the PML log will be the address of the
>>> write that causes the D bit in question to be updated, with bits 11:0
>>> cleared.
>>>
>>> This means that, in the case in which the D bit is in an EPT PDE or an EPT
>>> PDPTE, the log entry will communicate which 4KB region within the larger
>>> page was being written.
>>>
>>> Once the D bit is set in one of these entries, a subsequent write to the
>>> larger page will not generate a log entry, even if that write is to a
>>> different 4KB region within the larger page.  This is because log entries
>>> are created only when a D bit is being set and a write will not cause a D
>>> bit to be set if the page's D bit is already set.
>>>
>>> The log entries do not communicate the level of the EPT paging-structure
>>> entry in which the D bit was set (i.e., it does not communicate the page
>>> size). "
>>>
>>>
>>> Thanks for the clarification.
>>>
>>> The result of this behaviour is that the PML flush logic is going to have to
>>> look up each gfn and check whether it is mapped by a superpage, which will
>>> add a sizeable overhead.
>> Sorry that I am replying using my personal email account, as I can't
>> access my company account.
>>
>> I don't think we  need to check if the gfn is mapped by a superpage.
>> The PML flush does very simple thing:
>>
>> 1) read out PML index
>> 2) loop all valid GPA logged in PML buffer according to PML index, and
>> call paging_mark_dirty for them.
>> 3) reset PML index to 511, which essentially reset the PML buffer to
>> be empty again.
>>
>> Above process doesn't need to know if the GFN is mapped by superpage
>> or not. Actually, for the superpage, as  you can see in my design, it
>> will still set to be  read-only in case of PML, as we still need to
>> split superpage to 4K pages even in PML case. Therefore superpage in
>> logdirty mode will be first split to 4K pages in EPT violation, and
>> then those 4K pages will follow PML path.
>
> This will only function correctly if superpage shattering is used.
>
> As soon as a superpage D bit transitions from 0 to 1, the gfn is logged
> and the guest can make further updated in the same frame without further
> log entries being recorded. The PML flush code *must* assume that every
> other gfn mapped by the superpage is dirty, or memory corruption could
> occur when resuming on the far side of the migration.

To me the superpage has been split before its D bit changes from 0 to
1, as in my understanding EPT violation happens before setting D-bit,
and it's not possible to log gfn before superpage is split. Therefore
PML doesn't need to assume every other gfn in superpage range is
dirty, as they are already 4K pages now with D-bit clear and can be
logged by PML.  Does this sound reasonable?

>
>>
>>> It is also not conducive to minimising the data transmitted in the migration
>>> stream.
>> Yes PML itself is unlikely to minimize data transmitted in the
>> migration stream, as how much dirty pages will be  transmitted is
>> totally up to guest. But it reduces EPT violation of 4K page write
>> protection, so theoretically PML can reduce CPU cycles in hypervisor
>> context and more cycles can be used in guest mode, therefore it's
>> reasonable to expect guest will have better performance.
>
> "performance" is a huge amorphous blob of niceness that wants to be
> achieved.  You must be more specific than that when describing
> "performance" as "better".

Yes I will gather some benchmark results prior to sending out the
patch to review. Actually it will be helpful if you or other guys can
provide some suggestion relating to how to measure the performance,
such as which benchmarks should be run.

I have to read the rest of your reply tomorrow morning as it's
midnight at my time zone :)

Thanks,
-Kai

>
> Without superpage shattering, the use of PML can trade off a reduction
> in guest VMexits vs more data needing to be sent in the migration
> stream.  This might be nice from the point of view of the guest
> administrator, but is quite possibly disastrous for the host
> administrator, if their cloud is network-bound.
>
> With superpage shattering, the use of PML can trade off a reduction in
> guest VMexits vs greater host ram usage and slower system runtime
> performance because of increased TLB pressure.
>
>
> Stating a change in performance must always consider the tradeoffs.  In
> this PML example, it is not a simple case that a new hardware feature
> strictly makes everything better, if used.
>
>>
>>>
>>> One future option might be to shatter all the EPT superpages when logdirty
>>> is enabled.
>> This is  what I designed originally.
>
> This is acceptable as a design constraint, especially given the limits
> of the hardware, but it is important to know as a restriction.
>
> Now that I reread your original email I do spot that in there.  I admit
> that it was not immediately clear to me the first time around.
>
> This does highlight the usefulness of design review to get everyones
> understanding (i.e. mine) up to scratch before starting to argue over
> the finer details of an implementation.
>
>>
>> This would be ok for a domain which is being migrated away, but
>>> would be suboptiomal for snapshot operations; Xen currently has no ability
>>> to coalesce pages back into superpages.
>> Doesn't this issue exist in current log-dirty implementation anyway?
>
> I believe it is an issue.
>
>> Therefore although PML doesn't solve this issue but it doesn't bring
>> any regression either. To me coalescing pages back to superpage is a
>> separate optimization but not related to PML directly.
>
> Agreed.
>
>>
>> It also interacts poorly with HAP
>>> vram tracking which enables logdirty mode itself.
>> Why would PML interact with HAP vram tracking poorly?
>
> I was referring to the shattering aspect, rather than PML itself.
> Shattering all superpages would be overkill to just track vram, which
> only needs to cover a small region.
>
> I have to admit that the current vram tracking infrastructure is a bit
> of a mess.  It has different semantics depending on whether HAP or
> shadow is in use (HAP VRAM tracking enabled logdirty mode, shadow VRAM
> tracking doesn't), and causes problems for the qemu/libxc interaction at
> the beginning of live migration.  These problems are compounded by
> XenServers habit of constantly tweaking the shadow allocation, and have
> been further compounded by XSA-97 introducing -EBUSY into the mix.
>
> I have tried once to sort the interface out, but didn't get very far.  I
> really need to see about trying again.
>
> ~Andrew
>



-- 
Thanks,
-Kai

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.