[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [RFC PATCH v3 15/22] Start documenting the live update handover



From: David Woodhouse <dwmw@xxxxxxxxxxxx>

Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx>
---
 docs/specs/libxc-migration-stream.pandoc |  19 +-
 docs/specs/live-update-handover.pandoc   | 371 +++++++++++++++++++++++
 2 files changed, 388 insertions(+), 2 deletions(-)
 create mode 100644 docs/specs/live-update-handover.pandoc

diff --git a/docs/specs/libxc-migration-stream.pandoc 
b/docs/specs/libxc-migration-stream.pandoc
index a7a8a08936..9a6679f3de 100644
--- a/docs/specs/libxc-migration-stream.pandoc
+++ b/docs/specs/libxc-migration-stream.pandoc
@@ -227,12 +227,18 @@ type         0x00000000: END
 
              0x0000000F: CHECKPOINT_DIRTY_PFN_LIST (Secondary -> Primary)
 
-             0x00000010 - 0x7FFFFFFF: Reserved for future _mandatory_
+             0x00000010 - 0x3FFFFFFF: Reserved for future _mandatory_
              records.
 
-             0x80000000 - 0xFFFFFFFF: Reserved for future _optional_
+             0x40000000 - 0x7FFFFFFF: Reserved for future _mandatory_
+             live update records.
+
+             0x80000000 - 0xBFFFFFFF: Reserved for future _optional_
              records.
 
+             0xC0000000 - 0xFFFFFFFF: Reserved for future _optional_
+             live update records.
+
 body_length  Length in octets of the record body.
 
 body         Content of the record.
@@ -246,6 +252,15 @@ Records may be _mandatory_ or _optional_.  Optional 
records have bit
 unsupported mandatory record must fail.  The contents of optional
 records may be ignored during a restore.
 
+Note: This basic record format,. and some of the record types defined here,
+are also used for Live Update, as discussed in the Live Update Handover
+document: `docs/specs/live-update-handover.pandoc`.
+
+Records defined for live update have bit 30 set in their type value,
+are defined in that document, and are out of scope for this document.
+Such records shall not appear in the Domain Image Format as defined by
+this document.
+
 The following sub-sections specify the record body format for each of
 the record types.
 
diff --git a/docs/specs/live-update-handover.pandoc 
b/docs/specs/live-update-handover.pandoc
new file mode 100644
index 0000000000..31d23c7c90
--- /dev/null
+++ b/docs/specs/live-update-handover.pandoc
@@ -0,0 +1,371 @@
+% Live Update Handover Protocol
+% David Woodhouse <<dwmw@xxxxxxxxxxxx>>
+% Revision 1
+
+Introduction
+============
+
+Purpose
+-------
+
+Live update performs a _kexec_ from one running version of Xen to
+another, preserving all running domains in a form of guest-transparent
+live migration.
+
+This document outlines the memory layout requirements and data stream
+used in handover protocol, to ensure that pages used by running
+domains are preserved during the transition from one version of Xen
+to the next.
+
+
+Compatibility
+-------------
+
+It cannot be repeated often enough that information passed over live
+update is an ABI. It is expected that live update can be performed from
+one major version of Xen to another, or even hypothetically to a system
+which is not Xen at all.
+
+It is necessary that some data are handed over "in place"; in
+particular the memory pages of the running domains. However, no
+internal Xen data structures may be transferred in this fashion; at
+least not without retrospectively declaring them to be ABI, with the
+restrictions that places on subsequent changes.
+
+
+
+Handover
+========
+
+
+Memory Usage Restrictions
+-------------------------
+
+The new Xen must take care not to use any memory pages which already
+belong to guests. To facilitate this, a contiguous region of memory
+is reserved for the boot allocator, known as _live update bootmem_.
+
+This region is reserved by the original Xen during its own boot, and
+the location made available to the _kexec(8)_ user space tool
+through the `kexec_get_range` hypercall using a new region type
+`KEXEC_RANGE_MA_LIVEUPDATE`. It is passed to the new Xen on the
+command line, using the `liveupdate=` parameter.
+
+The new Xen must not use any pages outside this region until it has
+consumed the live update data stream and determined which pages are
+already in use by running domains.
+
+At run time, Xen may use memory from the reserved region for any
+purpose that does not require preservation over a live update; in
+particular it must not be mapped to a domain.
+
+The new Xen executable image must be loaded by kexec to the same
+physical location as the running Xen, since that region of memory is
+known to be available. For that reason, freed init memory from the
+Xen image is also treated as reserved _live update bootmem_.
+
+
+Live Update Data Stream
+-----------------------
+
+During handover, the running Xen pauses all domains and creates a
+_live update data stream_ containing all the information required by
+the new Xen to restore them. This is largely the same as guest
+transparent live migration.
+
+Data pages for this stream may be allocated anywhere in physical
+memory outside the _live update bootmem_ regions.
+
+Xen creates a physically contiguous array of MFNs of the allocated
+data pages, suitable for passing to `vmap()` to obtain a virtually
+contiguous mapping of the whole data stream.
+
+
+Breadcrumb
+----------
+
+Since the live update data stream is created during the final `kexec_exec`
+hypercall, its address cannot be passed on the command line to the
+new Xen since the command line needs to have been set up by `kexec(8)`
+in userspace long beforehand.
+
+Thus, to allow the new Xen to find the data stream, the old Xen places
+a _breadcrumb_ in the first words of the _live update bootmem_, containing
+the number of data pages, and the physical address of the contiguous MFN
+array.
+
+The breadcrumb is written as the last action of the `kexec_reloc()`
+routine during the `kexec` handover, so cannot overwrite anything
+important by virtue of the existing guarantee that Xen will not place
+any data in that region which needs to survive across a live update.
+
+A restriction of the `kexec_reloc()` mechanism for writing the breadcrumb
+is that the values are host-endian and are masked with PAGE_MASK; the low
+bits are zeroed. This is actually perfect for the magic value used
+to recognise a live update breadcrumb, since it neatly prevents any attempt
+to live update to a Xen which uses a different endianness or page size.
+
+For the physical address of the MFN list it's perfectly fine, since
+that list is page-aligned anyway. For the number of pages, it means
+the value must be shifted accordingly. Hence the use of `shifted_nr_pages`
+in the breadcrumb structure below:
+
+
+     0      1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+    | live_update_magic                               |
+    +-------------------------------------------------+
+    | mfn_array_physaddr                              |
+    +-------------------------------------------------+
+    | shifted_nr_pages                                |
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field               Description
+------------------- ------------------------------------------------
+live_update_magic   "LiveUpda" (0x4c69766555706461) stored in the the host
+                    endianness and masked with PAGE_MASK.
+                    For example on x86_64: `00 60 70 55 65 76 89 4c`.
+
+mfn_array_physaddr  Machine address of MFN list for data streaes.
+
+shift_nr_pages      Number of data pages, shifted by PAGE_SHIFT to
+                    avoid the limitation of kexec_reloc().
+--------------------------------------------------------------------
+
+
+IOMMU
+-----
+
+Where devices are passed through to domains, it may not be possible
+to quiesce those devices for the purpose of performing the update.
+
+If performing live update with assigned devices, the original Xen will
+leave the IOMMU mappings active during the handover (thus implying
+that IOMMU page tables may not be allocated in the `live update
+bootmem` region either).
+
+The new Xen must resume control of the IOMMU without causing those mappings
+to become invalid even for a short period of time. On hardware which does not
+support Posted Interrupts, interrupts may need to be generated on resume.
+
+_This section will be expanded once we actually have it working._
+
+\clearpage
+
+Data Stream Overview
+====================
+
+Once discovered and mapped, the live update data stream forms a
+virtually contiguous stream of records following the basic form
+documented in the LibXenCtrl Domain Image Format at
+`docs/specs/libxc-migration-stream.pandoc`.
+
+Some record types from the LibXenCtrl Domain Image format are used
+as-is, such as the `X86_PV_INFO`, `X86_PV_VCPU_BASIC`, `HVM_CONTEXT`
+and other records containing domain-specific data.
+
+The Domain Header from that document is not used in that form, and a new
+record of type `LU_DOMAIN_INFO` is defined below.
+
+Other new record types specific to the live update process are defined in
+this document. Of those, some contain global state such as the M2P table
+information, while others are domain-specific.
+
+The live update data stream starts with records containing global
+information, followed any number of times by a `LU_DOMAIN_INFO` record
+and subsequent domain-specific records for that domain.
+
+There is a single `END` record at the end of the live update data stream,
+indicating that no more `DOMAIN_INFO` records are present.
+
+\clearpage
+
+As defined in the LibXenCtrl Domain Image format document, a record
+has the following structure. Record type values defined for live update
+have bit 30 set, and are thus in the range 0x40000000-0x7FFFFFFF for
+mandatory live update records, and 0xC0000000-0xFFFFFFFF for optional
+live update records _(of which there are none at the present time)_.
+
+
+    0     1     2     3     4     5     6     7 octet
+    +-----------------------+-------------------------+
+    | type                  | body_length             |
+    +-----------+-----------+-------------------------+
+    | body...                                         |
+    ...
+    |           | padding (0 to 7 octets)             |
+    +-----------+-------------------------------------+
+
+--------------------------------------------------------------------
+Field        Description
+-----------  -------------------------------------------------------
+type         0x40000000: LU_VERSION
+
+             0x40000001: LU_M2P
+
+             0x40000002: LU_M2P_COMPAT
+
+             0x40000003: LU_DOMAIN_INFO
+
+             0x40000004 - 0x7FFFFFFF: Reserved for future _mandatory_
+             live update records.
+
+             0xC0000000 - 0xFFFFFFFF: Reserved for future _optional_
+             live update records.
+
+body_length  Length in octets of the record body.
+
+body         Content of the record.
+
+padding      0 to 7 octets of zeros to pad the whole record to a multiple
+             of 8 octets.
+--------------------------------------------------------------------
+
+
+\clearpage
+
+Global Records
+==============
+
+LU_VERSION
+----------
+
+The version field indicates the version of Xen from which the system
+is live updating. In theory this should never be relevant, but it
+allows for version-specific workarounds to be implementing in the receiving
+Xen should they become necessary.
+
+     0      1     2     3     4     5     6     7 octet
+    +-----------------------+-----------+-------------+
+    | xen_major             | xen_minor               |
+    +-----------------------+-------------------------+
+
+
+--------------------------------------------------------------------
+Field       Description
+----------- --------------------------------------------------------
+xen_major   The Xen major version from which the system is updating.
+
+xen_minor   The Xen minor version from which the system is updating.
+--------------------------------------------------------------------
+
+\clearpage
+
+LU_M2P / LU_M2P_COMPAT
+----------------------
+
+The M2P and compatibility M2P records contain a scatter/gather list of
+pages containing native or 32-bit M2P data.
+
+
+     0     1     2     3     4     5     6     7 octet
+    +-----------------------+-------------------------+
+    | m2p_page_data[0]...                             |
+    ...
+    +-------------------------------------------------+
+    | m2p_page_data[N-1]...                           |
+    ...
+    +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field           Description
+-----------     --------------------------------------------------------
+m2p_page_data   A 64-bit value containing the physical address of the
+                next page of M2P data, encoding the _order_ of the page
+                into the low 12 bits. Thus, a 1GiB page at 0x4C0000000
+                would be encoded as 0x4C000001E.
+
+                In case the M2P does not contiguously cover pages starting
+                from MFN zero, a discontiguity is indicated by a field
+                with order set to zero. The high bits of the field then
+                provide the MFN for which the subsequent M2P data page
+                provides data.
+
+--------------------------------------------------------------------
+
+\clearpage
+
+Domain Specific Records
+=======================
+
+
+LU_DOMAIN_INFO
+--------------
+
+The domain info record contains general properties necessary to
+recreate a domain in the receiving Xen, and marks the start of a set
+of other domain-specific records pertaining to that domain.
+
+     0      1     2     3     4     5     6     7 octet
+    +-----------------------+-----------+-------------+
+    | type                  | page_shift| domain_id   |
+    +-----------------------+-----------+-------------+
+    | domain_handle[0-7]                              |
+    +-------------------------------------------------+
+    | domain_handle[8-15]                             |
+    +-----------------------+-------------------------+
+    | ssidref               | flags                   |
+    +-----------------------+-------------------------+
+    | max_vcpus             | emulation_flags         |
+    +-----------------------+-------------------------+
+    | extra_flags           | (padding)               |
+    +-----------------------+-------------------------+
+
+
+--------------------------------------------------------------------
+Field           Description
+--------------- --------------------------------------------------------
+type            0x0000: Reserved.
+
+                0x0001: x86 PV.
+
+                0x0002: x86 HVM.
+
+                0x0003 - 0xFFFFFFFF: Reserved.
+
+page_shift      Size of a guest page as a power of two.
+
+                i.e., page size = 2 ^page_shift^.
+
+domain_id       Domain ID
+
+
+domain_handle   UUID domain handle.
+
+ssidref         Security Identifier Index
+
+flags           Domain flags using `XEN_DOMCTL_CTF_`
+
+max_vcpus       Maximum vCPUs for domain.
+
+emulation_flags Emulation flags using `XEN_X86_EMU_`
+
+extra_flags     Additional flags:
+
+                0x00000001: Is privileged
+
+--------------------------------------------------------------------
+
+\clearpage
+
+Future Extensions
+=================
+
+All changes to this specification should bump the revision number in
+the title block.
+
+All changes to the image or domain headers require the image version
+to be increased.
+
+The format may be extended by adding additional record types.
+
+Extending an existing record type must be done by adding a new record
+type.  This allows old images with the old record to still be
+restored.
+
+The image header may only be extended by _appending_ additional
+fields.  In particular, the `marker`, `id` and `version` fields must
+never change size or location.
+
+
-- 
2.21.0


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.