[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1/4] expand x86 arch_shared_info to support linear p2m list



On 18/11/14 05:33, Juergen Gross wrote:
> On 11/14/2014 05:08 PM, Andrew Cooper wrote:
>> On 14/11/14 15:32, Juergen Gross wrote:
>>> On 11/14/2014 03:59 PM, Andrew Cooper wrote:
>>>> On 14/11/14 14:14, Jürgen Groß wrote:
>>>>> On 11/14/2014 02:56 PM, Andrew Cooper wrote:
>>>>>> On 14/11/14 12:53, Juergen Gross wrote:
>>>>>>> On 11/14/2014 12:41 PM, Andrew Cooper wrote:
>>>>>>>> On 14/11/14 09:37, Juergen Gross wrote:
>>>>>>>>> The x86 struct arch_shared_info field pfn_to_mfn_frame_list_list
>>>>>>>>> currently contains the mfn of the top level page frame of the 3
>>>>>>>>> level
>>>>>>>>> p2m tree, which is used by the Xen tools during saving and
>>>>>>>>> restoring
>>>>>>>>> (and live migration) of pv domains and for crash dump analysis.
>>>>>>>>> With
>>>>>>>>> three levels of the p2m tree it is possible to support up to 512
>>>>>>>>> GB of
>>>>>>>>> RAM for a 64 bit pv domain.
>>>>>>>>>
>>>>>>>>> A 32 bit pv domain can support more, as each memory page can hold
>>>>>>>>> 1024
>>>>>>>>> instead of 512 entries, leading to a limit of 4 TB.
>>>>>>>>>
>>>>>>>>> To be able to support more RAM on x86-64 switch to a virtual
>>>>>>>>> mapped
>>>>>>>>> p2m list.
>>>>>>>>>
>>>>>>>>> This patch expands struct arch_shared_info with a new p2m list
>>>>>>>>> virtual
>>>>>>>>> address and the mfn of the page table root. The new
>>>>>>>>> information is
>>>>>>>>> indicated by the domain to be valid by storing ~0UL into
>>>>>>>>> pfn_to_mfn_frame_list_list. The hypervisor indicates usability of
>>>>>>>>> this
>>>>>>>>> feature by a new flag XENFEAT_virtual_p2m.
>>>>>>>>
>>>>>>>> How do you envisage this being used?  Are you expecting the tools
>>>>>>>> to do
>>>>>>>> manual pagetable walks using xc_map_foreign_xxx() ?
>>>>>>>
>>>>>>> Yes. Not very different compared to today's mapping via the 3 level
>>>>>>> p2m tree. Just another entry format, 4 instead of 3 levels and
>>>>>>> starting
>>>>>>> at an offset.
>>>>>>
>>>>>> Yes - David and I were discussing this over lunch, and it is not
>>>>>> actually very different.
>>>>>>
>>>>>> In reality, how likely is it that the pages backing this virtual
>>>>>> linear
>>>>>> array change?
>>>>>
>>>>> Very unlikely, I think. But not impossible.
>>>>>
>>>>>> One issue currently is that, during the live part of migration, the
>>>>>> toolstack has no way of working out whether the structure of the
>>>>>> p2m has
>>>>>> changed (intermediate leaves rearranged, or the length increasing).
>>>>>>
>>>>>> In the case that the VM does change the structure of the p2m
>>>>>> under the
>>>>>> feet of the toolstack, migration will either blow up in a
>>>>>> non-subtle way
>>>>>> with a p2m/m2p mismatch, or in a subtle way with the receiving side
>>>>>> copying the new p2m over the wrong part of the new domain.
>>>>>>
>>>>>> I am wondering whether, with this new p2m method, we can take
>>>>>> sufficient
>>>>>> steps to be able to guarantee mishaps like this can't occur.
>>>>>
>>>>> This should be easy: I could add a counter in arch_shared_info
>>>>> which is
>>>>> incremented whenever a p2m mapping is being changed. The toolstack
>>>>> could
>>>>> compare the counter values before start and at end of migration and
>>>>> redo
>>>>> the migration (or fail) if they are different. In order to avoid
>>>>> races
>>>>> I would have to increment the counter before and after changing the
>>>>> mapping.
>>>>>
>>>>
>>>> That is insufficient I believe.
>>>>
>>>> Consider:
>>>>
>>>> * Toolstack walks pagetables and maps the frames containing the
>>>> linear p2m
>>>> * Live migration starts
>>>> * VM remaps a frame in the middle of the linear p2m
>>>> * Live migration continues, but the toolstack has a stale frame in the
>>>> middle of its view of the p2m.
>>>
>>> This would be covered by my suggestion. At the end of the memory
>>> transfer (with some bogus contents) the toolstack would discover the
>>> change of the p2m structure and either fail the migration or start it
>>> from the beginning and thus overwriting the bogus frames.
>>
>> Checking after pause is too late.  The content of the p2m is used verify
>> each frame being sent on the wire, so is in active use for the entire
>> duration of live migration.
>>
>> If the toolstack starts verifying frames being sent using information
>> from a stale p2m, the best that can be hoped for is that the toolstack
>> declares that the p2m and m2p are inconsistent and abort the migrate.
>>
>>>
>>>> As the p2m is almost never expected to change, I think it might be
>>>> better to have a flag the toolstack can set to say "The toolstack is
>>>> peeking at your p2m behind your back - you must not change its
>>>> structure."
>>>
>>> Be careful here: changes of the structure can be due to two scenarios:
>>> - ballooning (invalid entries being populated): this is no problem, as
>>>    we can stop the ballooning during live migration.
>>> - mapping of grant pages e.g. in a stub domain (first map in an area
>>>    former marked as invalid): you can't stop this, as the stub domain
>>>    has to do some work. Here a restart of the migration should work, as
>>>    the p2m structure change can only happen once for each affected p2m
>>>    page.
>>
>> Migration is not at all possible with a domain referencing foreign
>> frames.
>>
>> The live part can cope with foreign frames referenced in the ptes.  As
>> part of the pause handling in the VM, the frontends must unmap any
>> grants they have.  After pause, any remaining foreign frames cause a
>> migration failure.
>>
>>>
>>>> Having just thought this through, I think there is also a race
>>>> condition
>>>> between a VM changing an entry in the p2m, and the toolstack doing
>>>> verifications of frames being sent.
>>>
>>> Okay, so the flag you mentioned should just prohibit changes in the
>>> p2m list related to memory frames of the affected domain: ballooning
>>> up or down, or rearranging the memory layout (does this happen today?).
>>> Mapping and unmapping of grant pages should be still allowed.
>>
>> HVM guests doesn't have any of their p2m updates represented in the
>> logdirty bitmap, so ballooning an HVM guest during migrate leads to
>> unexpected holes or lack of holes on the resuming side, leading to a
>> very confused balloon driver.
>>
>> At the time I had not found a problem with PV guests, but it is now
>> clear that there is a period of time when a guest is altering its p2m
>> where the p2m and m2p are out of sync, which will cause a migration
>> failure if the toolstack observes this artefact.
>
> So ballooning should be disabled during migration. I think this should
> be handled via callbacks triggered by xenstore: one at start of
> migration to stop ballooning and one at end to restart it. I wouldn't
> want to tie this functionality to the p2m list structure, as it is
> not related to it.

It is not just ballooning.  It is any change to the p2m whatsoever. 
This includes mapping/unmapping grants, XENMEM_exchange, and the guest
simply changing the p2m layout.

I suspect that the only reason this has not been encountered in practice
is that noone has attempted migrating a domain which makes use of
foreign mappings.  It is typically only the backend drivers which map
frontend memory, and dom0 doesn't migrate.

~Andrew


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.