[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Buggy interaction of live migration and p2m updates



Hello,

Tim, David and I were discussing this over lunch.  This email is a
(hopefully accurate) account of our findings, and potential solutions. 
(If I have messed up, please shout.)

Currently, correct live migration of PV domains relies on the toolstack
(which has a live mapping of the guests p2m) not observing stale values
when the guest updates its p2m, and the race condition between a p2m
update and an m2p update.  Realistically, this means no updates to the
p2m at all, due to several potential race conditions.  Should any race
conditions happen (e.g. ballooning while live migrating), the effects
could be anything from an aborted migration to VM memory corruption.

It should be noted that migrationv2 does not fix any of this.  It alters
the way in which some race conditions might be observed.  During
development of migrationv2, there was an explicit non-requirement of
fixing the existing Ballooning+LiveMigration issues we were aware,
although at the time, we were not aware of this specific set of issues. 
Our goal was to simply make migrationv2 work in the same circumstances
as previously, but with a bitness-agnostic wire format and
forward-extensible protocol.


As far as these issues are concerned, there are two distinct p2m
modifications which we care about:
1) p2m structure changes (rearranging the layout of the p2m)
2) p2m content changes (altering entries in the p2m)

There is no possible way for the toolstack to prevent a domain from
altering its p2m.  At the moment, ballooning typically only occurs when
requested by the toolstack, but the underlying operations
(increase/decrease_reservation, mem_exchange, etc) can be used by the
guest at any point.  This includes Wei's guest memory fragmentation
changes.  Changes to the content of the p2m also occur for grant map and
unmap operations.


Currently in PV guests, the p2m is implemented using a 3-level tree,
with its root in the guests shared_info page.  It provides a hard VM
memory limit of 4TB for 32bit PV guests (which is far higher than the
128GB limit from the compat p2m mappings), or 512GB for 64bit PV guests.

Juergen has a proposed new p2m interface using a virtual linear
mapping.  This is conceptually similar to the previous implementation
(which is fine from the toolstacks point of view), but far less
complicated from the guests point of view, and removes the memory limits
imposed by the p2m structure.

The new virtual linear mapping suffers from the same interaction issues
as the old 3-level tree did, but the introduction of the new interface
affords us an opportunity to make all API modifications at once to
reduce churn.


During live migration, the toolstack maps the guests p2m into a linear
mapping in the toolstacks virtual address space.  This is done once at
the start of migration, and never subsequently altered.  During live
migration, the p2m is cross-verified with the m2p, and frames are sent
using pfns as a reference, as they will be located in different frames
on the receiving side.

Should the guest change the p2m structure during live migration, the
toolstack ends up with a stale p2m with a non-p2m frame in the middle,
resulting in bogus cross-referencing.  Should the guest change an entry
in the p2m, the p2m frame itself will be resent as it would be marked as
dirty in the logdirty bitmap, but the target pfn will remain unsent and
probably stale on the receiving side.


Another factor which needs to be taken into account is Remus/COLO, which
run the domains under live migration conditions for the duration of
their lifetime.

During the live part of migration, the toolstack already has to be able
to tolerate failures to normalise the pagetables, which result as a
consequent of the pagetables being in active.  These failures are fatal
on the final iteration after the guest has been paused, but the same
logic could be extended to p2m/m2p issues, if needed.


There are several potential solutions to these problems.

1) Freeze the guests p2m during live migrate

This is the simplest sounding option, but is quite problematic from the
point of view of the guest.  It is essentially a shared spinlock between
the toolstack and the guest kernel.  It would prevent any grant
map/unmap operations from occurring, and might interact badly with
certain p2m updated in the guest which would previously be expected to
unconditionally succeed.

Pros) (Can't think of any)
Cons) Not easy to implement (even conceptually), requires invasive guest
changes, will cripple Remus/COLO


2) Deep p2m dirty tracking

In the case that a p2m frame is discovered dirty in the logdirty bitmap,
we can be certain that a write has occurred to it, and in the common
case, means that the mapping has changed.  The toolstack could maintain
a non-live copy of the p2m which is updated as new frames are sent. 
When a dirty p2m frame is found, the live and non-live copies can be
consulted to find which pfn mappings have changed, and locally mark all
the altered pfns for retransmit.

Pros) No guest changes required
Cons) Toolstack needs to keep an additional copy of the guests p2m on
the sending side

3) Eagerly check for p2m structure changes.

p2m structure changes are rare after boot, but not impossible.  Each
iteration of live migration, the toolstack can check for dirty
higher-level p2m frames in the dirty bitmap.  In the case that a
structure update occurs, the toolstack can use information it already
has to calculate a subset of pfns affected by the update, and mark them
for resending.  (This can currently be done to the frame granularity
given the p2m frame lit, but in combination with 2), could result in
fewer pfns needing resending.)

Pros) No guest changes required.
Cons) Moderately high toolstack overhead,  Possibility to resend far
more pfns than strictly required.

4) Request p2m structure change updates from the guest

The guest could provide a "p2m generation count" to allow the toolstack
to evaluate whether the structure had changed.  This would allow the
live part of migration to periodically re-evaluate whether it should
remap the p2m to avoid stale mappings.

Pros) Easy to implement alongside the virtual linear mapping support. 
Easy for toolstack and guest
Cons) Only works with new virtual linear guests.


Proposed solution:  A combination of 2, 3 and 4.

For legacy 3-level p2m guests, the toolstack can detect p2m structure
updates by tracking the p2m top and mid levels in the logdirty bitmap,
and invalidating the modified subset of pfns.  It has to eagerly check
the p2m frame list list mfn entry in the shared info to see whether the
guest has swapped onto a completely new p2m.

For a virtual linear map, the intermediate levels are not available to
track, but we can require that the guest increment p2m generation clock
in the shared info.  When the structure changes, the toolstack can remap
the p2m and calculate the altered subset of pfns, and mark for resend.

The toolstack must also track changes in the p2m itself, and compare to
a local copy showing the mapping at the time at which the pfn was last
sent.  This can be used to work out which p2m mappings have changed, and
also be used to confirm whether the pfns on the receiving side are stale
or not.

I believe this covered all cases and race conditions.  In the case that
the p2m is updated before the m2p, the p2m frame will be marked dirty in
the bitmap, and discoverable on the next iteration.  At that point, if
the p2m and m2p are inconsistent, the pfn will be deferred until the
final iteration.  If not, the frame is sent and everything is all ok. 
In the case that the p2m is updated after the m2p, the p2m/m2p will be
consistent when the dirty bitmap is acted on.


Thoughts? (for anyone who has made it this far :)  I think I have
covered everything.)

~Andrew


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.