[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v4 1/2] docs/designs: Add a design document for non-cooperative live migration



> -----Original Message-----
> From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> Sent: 29 January 2020 19:47
> To: Durrant, Paul <pdurrant@xxxxxxxxxxxx>; xen-devel@xxxxxxxxxxxxxxxxxxxx
> Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>; Ian Jackson
> <ian.jackson@xxxxxxxxxxxxx>; Jan Beulich <jbeulich@xxxxxxxx>; Julien Grall
> <julien@xxxxxxx>; Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>; Stefano
> Stabellini <sstabellini@xxxxxxxxxx>; Wei Liu <wl@xxxxxxx>
> Subject: Re: [PATCH v4 1/2] docs/designs: Add a design document for non-
> cooperative live migration
> 
> On 29/01/2020 14:47, Paul Durrant wrote:
> > diff --git a/docs/designs/non-cooperative-migration.md
> b/docs/designs/non-cooperative-migration.md
> > new file mode 100644
> > index 0000000000..5db3939db5
> > --- /dev/null
> > +++ b/docs/designs/non-cooperative-migration.md
> > @@ -0,0 +1,272 @@
> > +# Non-Cooperative Migration of Guests on Xen
> > +
> > +## Background
> > +
> > +The normal model of migration in Xen is driven by the guest because it
> was
> > +originally implemented for PV guests, where the guest must be aware it
> is
> > +running under Xen and is hence expected to co-operate.
> 
> For PV guests, is more than "expected to co-operate".
> 
> Migrating a PV guest involves rewriting every pagetable entry with a
> different MFN, so even before you consider things like the PV protocols,
> there is no way this could be done without the cooperation of the guest.

Yes, the P2M will change and this is visible to the guest, but does a PV guest 
need to take action when this occurs? I'm not sure.

> 
> Sadly, this fact was depended upon for migration of the PV protocols,
> and has migrated (excuse the pun) into the HVM world as well.
> 

Alas yes.

> > This model dates from
> > +an era when it was assumed that the host administrator had control of
> at least
> > +the privileged software running in the guest (i.e. the guest kernel)
> which may
> > +still be true in an enterprise deployment but is not generally true in
> a cloud
> > +environment.
> 
> I haven't seen it discussed elsewhere, but even enterprise environments
> have problems.
> 
> Having host admin == guest admin doesn't mean that guest drivers aren't
> buggy, or that the VM doesn't explode on migrate.

No, but at least the host admin has a chance to test and update guest software 
to be 'reasonably' confident that migration will work before employing it en 
masse.

> 
> The simple fact is that involving the guest kernel adds unnecessary
> moving parts which can (and do with a non-zero probability) go wrong.
> 

Yes, having written the frontend side of migration in the Windows drivers it is 
*very* hard to get right, particularly in Windows where one has to deal with 
the complex and asynchronous PnP subsystem colliding with a migration. The 
network driver also requires a multi-reader/single-writer lock with odd 
semantics (w.r.t. to IRQL) which I had to code myself 
(https://xenbits.xen.org/gitweb/?p=pvdrivers/win/xenvif.git;a=blob;f=src/xenvif/mrsw.h).
 It took years of fixing subtle races (in that and elsewhere) to get to the 
(AFAIK) reliable code we have now. 
Avoiding execution of code like this (in all OS) certainly avoids the 
opportunity for subtle bugs to manifest themselves.

> >  The aim of this design is to provide a model which is purely host
> > +driven, requiring no co-operation from the software running in the
> > +guest, and is thus suitable for cloud scenarios.
> > +
> > +PV guests are out of scope for this project because, as is outlined
> above, they
> > +have a symbiotic relationship with the hypervisor and therefore a
> certain level
> > +of co-operation can be assumed.
> 
> If nothing else, I'd at least suggest s/can be assumed/is necessary/.

Ok. I'll make that modification.

> 
> > +Because the service domain’s domid is used directly by the guest in
> setting
> > +up grant entries and event channels, the backend drivers in the new
> host
> > +environment must be provided by service domain with the same domid.
> Also,
> > +because the guest can sample its own domid from the frontend area and
> use it in
> > +hypercalls (e.g. HVMOP_set_param) rather than DOMID_SELF, the guest
> domid must
> > +also be preserved to maintain the ABI.
> 
> Has this been true since forever?  The grant and event APIs took some
> care to avoid the guest needing to know its own domid.
> 

The guest doesn't need to know its domid; DOMID_SELF will work, but the guest 
*can* use its own domid in this case (whereas I think grant and event ops will 
insist on DOMID_SELF unless referring to another domain). As far as I know this 
has been the case since forever and so I don't think it is something we can 
change now unless we move to a new ABI.

> > +
> > +Furthermore, it will necessary to modify backend drivers to re-
> establish
> > +communication with frontend drivers without perturbing the content of
> the
> > +backend area or requiring any changes to the values of the xenstore
> state nodes.
> > +
> > +## Other Para-Virtual State
> > +
> > +### Shared Rings
> > +
> > +Because the console and store protocol shared pages are actually part
> of the
> > +guest memory image (in an E820 reserved region just below 4G)
> 
> Typically*.
> 
> Their exact location is entirely up to the domain builder, and tend not
> to be there for PVH guests which aren't trying to fit the two frames
> into a BAR.

Ok, I'll add the 'typically' in there. The exact detail is not that important.

> 
> > then the content
> > +will get migrated as part of the guest memory image. Hence no
> additional code
> > +is require to prevent any guest visible change in the content.
> 
> I do agree with this conclusion however.
> 

Good :-)

> > +### Shared Info
> > +
> > +There is already a record defined in *libxenctrl Domain Image Format*
> [3]
> > +called `SHARED_INFO` which simply contains a complete copy of the
> domain’s
> > +shared info page. It is not currently incuded in an HVM (type `0x0002`)
> > +migration stream. It may be feasible to include it as an optional
> record
> > +but it is not clear that the content of the shared info page ever needs
> > +to be preserved for an HVM guest.
> > +
> > +For a PV guest the `arch_shared_info` sub-structure contains important
> > +information about the guest’s P2M, but this information is not relevant
> for
> > +an HVM guest where the P2M is not directly manipulated via the guest.
> The other
> > +state contained in the `shared_info` structure relates the domain wall-
> clock
> > +(the state of which should already be transferred by the `RTC` HVM
> context
> > +information which contained in the `HVM_CONTEXT` save record) and some
> event
> > +channel state (particularly if using the *2l* protocol). Event channel
> state
> > +will need to be fully transferred if we are not going to require the
> guest
> > +co-operation to re-open the channels and so it should be possible to
> re-build a
> > +shared info page for an HVM guest from such other state.
> > +
> > +Note that the shared info page also contains an array of
> `XEN_LEGACY_MAX_VCPUS`
> > +(32) `vcpu_info` structures. A domain may nominate a different guest
> physical
> > +address to use for the vcpu info. This is mandatory for if a domain
> wants to
> > +use more than 32 vCPUs and optional for legacy vCPUs. This mapping is
> not
> > +currently transferred in the migration state so this will either need
> to be
> > +added into an existing save record, or an additional type of save
> record will
> > +be needed.
> 
> For non-cooperative migration in the current ABI, a minimum is to know
> where the shared info frame is mapped, so it can be re-mapped on behalf
> of the guest on the destination side.
> 

True, and the same for the grant tables, although it occurs to me that by 
turning these into domheap pages (as part of getting rid of shared xenheap 
pages... for other reasons) means the content should be migrated anyway, so 
we'll only need save records for the GFNs themselves.

> The rest of this section will be very good evidence in the "new guest
> ABI" design.
> 
> > +### Grant table
> > +
> > +The grant table is essentially the para-virtual equivalent of an IOMMU.
> 
> TBH, I think "shared memory" is a much better analogy than an IOMMU.
> OTOH, perhaps that doesn't cope with the grant copy aspect quite as well
> as I'd like.
> 

Well the table allows the guest to create a 'mapping' into a grant ref address 
space, and then those addresses are passed to 'PV devices', so the IOMMU 
analogy seemed most appropriate.

  Paul

> ~Andrew
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.