[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Buggy interaction of live migration and p2m updates



On 21/11/14 05:41, Juergen Gross wrote:
> On 11/20/2014 07:28 PM, Andrew Cooper wrote:
>> Hello,
>>
>> Tim, David and I were discussing this over lunch.  This email is a
>> (hopefully accurate) account of our findings, and potential solutions.
>> (If I have messed up, please shout.)
>>
>> Currently, correct live migration of PV domains relies on the toolstack
>> (which has a live mapping of the guests p2m) not observing stale values
>> when the guest updates its p2m, and the race condition between a p2m
>> update and an m2p update.  Realistically, this means no updates to the
>> p2m at all, due to several potential race conditions.  Should any race
>> conditions happen (e.g. ballooning while live migrating), the effects
>> could be anything from an aborted migration to VM memory corruption.
>>
>> It should be noted that migrationv2 does not fix any of this.  It alters
>> the way in which some race conditions might be observed.  During
>> development of migrationv2, there was an explicit non-requirement of
>> fixing the existing Ballooning+LiveMigration issues we were aware,
>> although at the time, we were not aware of this specific set of issues.
>> Our goal was to simply make migrationv2 work in the same circumstances
>> as previously, but with a bitness-agnostic wire format and
>> forward-extensible protocol.
>>
>>
>> As far as these issues are concerned, there are two distinct p2m
>> modifications which we care about:
>> 1) p2m structure changes (rearranging the layout of the p2m)
>> 2) p2m content changes (altering entries in the p2m)
>>
>> There is no possible way for the toolstack to prevent a domain from
>> altering its p2m.  At the moment, ballooning typically only occurs when
>> requested by the toolstack, but the underlying operations
>> (increase/decrease_reservation, mem_exchange, etc) can be used by the
>> guest at any point.  This includes Wei's guest memory fragmentation
>> changes.  Changes to the content of the p2m also occur for grant map and
>> unmap operations.
>>
>>
>> Currently in PV guests, the p2m is implemented using a 3-level tree,
>> with its root in the guests shared_info page.  It provides a hard VM
>> memory limit of 4TB for 32bit PV guests (which is far higher than the
>> 128GB limit from the compat p2m mappings), or 512GB for 64bit PV guests.
>>
>> Juergen has a proposed new p2m interface using a virtual linear
>> mapping.  This is conceptually similar to the previous implementation
>> (which is fine from the toolstacks point of view), but far less
>> complicated from the guests point of view, and removes the memory limits
>> imposed by the p2m structure.
>>
>> The new virtual linear mapping suffers from the same interaction issues
>> as the old 3-level tree did, but the introduction of the new interface
>> affords us an opportunity to make all API modifications at once to
>> reduce churn.
>>
>>
>> During live migration, the toolstack maps the guests p2m into a linear
>> mapping in the toolstacks virtual address space.  This is done once at
>> the start of migration, and never subsequently altered.  During live
>> migration, the p2m is cross-verified with the m2p, and frames are sent
>> using pfns as a reference, as they will be located in different frames
>> on the receiving side.
>>
>> Should the guest change the p2m structure during live migration, the
>> toolstack ends up with a stale p2m with a non-p2m frame in the middle,
>> resulting in bogus cross-referencing.  Should the guest change an entry
>> in the p2m, the p2m frame itself will be resent as it would be marked as
>> dirty in the logdirty bitmap, but the target pfn will remain unsent and
>> probably stale on the receiving side.
>>
>>
>> Another factor which needs to be taken into account is Remus/COLO, which
>> run the domains under live migration conditions for the duration of
>> their lifetime.
>>
>> During the live part of migration, the toolstack already has to be able
>> to tolerate failures to normalise the pagetables, which result as a
>> consequent of the pagetables being in active.  These failures are fatal
>> on the final iteration after the guest has been paused, but the same
>> logic could be extended to p2m/m2p issues, if needed.
>>
>>
>> There are several potential solutions to these problems.
>>
>> 1) Freeze the guests p2m during live migrate
>>
>> This is the simplest sounding option, but is quite problematic from the
>> point of view of the guest.  It is essentially a shared spinlock between
>> the toolstack and the guest kernel.  It would prevent any grant
>> map/unmap operations from occurring, and might interact badly with
>> certain p2m updated in the guest which would previously be expected to
>> unconditionally succeed.
>>
>> Pros) (Can't think of any)
>> Cons) Not easy to implement (even conceptually), requires invasive guest
>> changes, will cripple Remus/COLO
>>
>>
>> 2) Deep p2m dirty tracking
>>
>> In the case that a p2m frame is discovered dirty in the logdirty bitmap,
>> we can be certain that a write has occurred to it, and in the common
>> case, means that the mapping has changed.  The toolstack could maintain
>> a non-live copy of the p2m which is updated as new frames are sent.
>> When a dirty p2m frame is found, the live and non-live copies can be
>> consulted to find which pfn mappings have changed, and locally mark all
>> the altered pfns for retransmit.
>>
>> Pros) No guest changes required
>> Cons) Toolstack needs to keep an additional copy of the guests p2m on
>> the sending side
>>
>> 3) Eagerly check for p2m structure changes.
>>
>> p2m structure changes are rare after boot, but not impossible.  Each
>> iteration of live migration, the toolstack can check for dirty
>> higher-level p2m frames in the dirty bitmap.  In the case that a
>> structure update occurs, the toolstack can use information it already
>> has to calculate a subset of pfns affected by the update, and mark them
>> for resending.  (This can currently be done to the frame granularity
>> given the p2m frame lit, but in combination with 2), could result in
>> fewer pfns needing resending.)
>>
>> Pros) No guest changes required.
>> Cons) Moderately high toolstack overhead,  Possibility to resend far
>> more pfns than strictly required.
>>
>> 4) Request p2m structure change updates from the guest
>>
>> The guest could provide a "p2m generation count" to allow the toolstack
>> to evaluate whether the structure had changed.  This would allow the
>> live part of migration to periodically re-evaluate whether it should
>> remap the p2m to avoid stale mappings.
>>
>> Pros) Easy to implement alongside the virtual linear mapping support.
>> Easy for toolstack and guest
>> Cons) Only works with new virtual linear guests.
>>
>>
>> Proposed solution:  A combination of 2, 3 and 4.
>>
>> For legacy 3-level p2m guests, the toolstack can detect p2m structure
>> updates by tracking the p2m top and mid levels in the logdirty bitmap,
>> and invalidating the modified subset of pfns.  It has to eagerly check
>> the p2m frame list list mfn entry in the shared info to see whether the
>> guest has swapped onto a completely new p2m.
>>
>> For a virtual linear map, the intermediate levels are not available to
>> track, but we can require that the guest increment p2m generation clock
>> in the shared info.  When the structure changes, the toolstack can remap
>> the p2m and calculate the altered subset of pfns, and mark for resend.
>>
>> The toolstack must also track changes in the p2m itself, and compare to
>> a local copy showing the mapping at the time at which the pfn was last
>> sent.  This can be used to work out which p2m mappings have changed, and
>> also be used to confirm whether the pfns on the receiving side are stale
>> or not.
>>
>> I believe this covered all cases and race conditions.  In the case that
>> the p2m is updated before the m2p, the p2m frame will be marked dirty in
>> the bitmap, and discoverable on the next iteration.  At that point, if
>> the p2m and m2p are inconsistent, the pfn will be deferred until the
>> final iteration.  If not, the frame is sent and everything is all ok.
>> In the case that the p2m is updated after the m2p, the p2m/m2p will be
>> consistent when the dirty bitmap is acted on.
>>
>>
>> Thoughts? (for anyone who has made it this far :)  I think I have
>> covered everything.)
>
> Sounds okay.
>
> Two remarks regarding the virtual linear map:
> - The intermediate levels could be tracked, as they are memory pages as
>   well. It is not practical to do so, however, as there might be lots of
>   changes not related to the p2m.

The intermediate levels are just pagetables, are they not? Or is there a
separate tracking structure?

> - The generation count is being checked by the tools in a lazy manner.
>   This will require an increment of the count by the guest only after
>   changing the structure of the p2m map, I think.

On further consideration, I think this needs to be a lockless
producer/consumer interface, with increment once at start, and once
again at the end.  The toolstack needs some ability to confirm that it
has got a consistent mapping of the virtual p2m, as it cant practically
detect updates via the logdirty bitmap.

It also occurs to me that the toolstack code needs to gain some use of
ACCESS_ONCE() when reading the live p2m.

~Andrew


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.