[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration
From: Joshua Otto <jtotto@xxxxxxxxxxxx> Hi, A little over a year ago, I posted a patch series implementing support for post-copy live migration via xenpaging [1]. Following Andrew and Wei's review of the initial refactoring patches, I promised to follow up with revised patches, a design document and an experimental performance evaluation. It took a lot longer than I thought, but I've finally prepared all of those things now - hopefully better late than never :) The patches are the v2 of the series from [1], rebased against "3fafdc2 xen/arm: p2m: Fix incorrect mapping of superpages", the tip of master when I performed the rebase and experiments: late May 2017. They're accessible on GitHub at [2]. Changes from v2: - addressed the feedback received from the first round - fixed bugs discovered during performance experiments - based on results from the performance experiments, added a paging op to populate pages directly into the evicted state Though I haven't actually tried to do so myself, a quick look at the relevant code indicates that a relatively painless rebase should still be possible. The body of this mail is the report. It is intended to describe the purpose, design and behaviour of live migration both before and after the patches in sufficient detail to enable a future contributor or academic researcher with only general Xen familiarity to pick them up if they turn out to be useful in the future. I prepared it in plain text for the mailing list, and based its format on Haozhong Zhang's vNVDIMM design document [3]. TL;DR: These (now slightly stale) patches implement post-copy live migration using xenpaging. They provide a modest downtime reduction when used in hybrid mode with pre-copy, likely because they permit the memory migration to proceed in parallel with guest device model set-up. This benefit probably doesn't outweigh the cost in terms of increased implementation complexity. Thanks for reading! - Joshua Otto [1] https://lists.xenproject.org/archives/html/xen-devel/2017-03/msg03491.html [2] https://github.com/jtotto/xen/commits/postcopy-v2 [3] https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00006.html Note: I've sent this from my personal e-mail account because I'm no longer able to send mail from my old school address, though I'm still able to receive mail sent to it. Post-Copy Live Migration for Xen - Design and Performance Evaluation ==================================================================== Xen supports live migration of guests between physical hosts. Documentation of this feature can be found at [a] - summarized briefly, it enables system administrators to 'move' a running guest from one physical host running Xen to another. One of the most difficult sub-problems of live migration is the memory migration. Today, Xen's live memory migration employs an iterative pre-copy algorithm, in which all guest memory is transmitted from the migration sender to receiver _before_ execution is stopped at the sender and resumed at the receiver. This document describes the design, implementation and performance evaluation of an alternative live memory migration algorithm, _post-copy_ live migration, that attempts to address some of the shortcomings of pre-copy migration by deferring the transmission of part or all of the guest's memory until after it is resumed at its destination. The described design adds support for post-copy without altering the existing architecture of the migration feature, taking advantage of the xenpaging mechanism to implement post-copy paging purely in the toolstack. The experimental performance evaluation of the new feature indicates that, for the SQL database workload evaluated, post-copy in combination with some number of pre-copy iterations yields modest downtime-reduction benefits, but that pure post-copy results in unacceptable application-level guest downtime. Content ======= 1. Background 1.1 Implemented in Xen today: pre-copy memory migration 1.2 Proposed enhancement: post-copy memory migration 2. Design 2.1 Current design 2.1.1 `xl migrate` <-> `xl migrate-receive`, Part One 2.1.2 libxl_domain_suspend() <-> libxl_domain_create_restore(), Part One 2.1.3 libxl__stream_write <-> libxl__stream_read, Part One 2.1.4 xc_domain_save() <-> xc_domain_restore() 2.1.5 libxl__stream_write <-> libxl__stream_read, Part Two 2.1.6 libxl_domain_suspend() <-> libxl_domain_create_restore(), Part Two 2.1.7 `xl migrate` <-> `xl migrate-receive`, Part Two 2.2 Proposed design changes 2.2.1 `xl migrate` <-> `xl migrate-receive`, Part One 2.2.2 libxl_domain_live_migrate() <-> libxl_domain_create_restore(), Part One 2.2.3 libxl__stream_write <-> libxl__stream_read, Part One 2.2.4 xc_domain_save() <-> xc_domain_restore(), Part One 2.2.4.1 Pre-copy policy 2.2.4.2 Post-copy transition 2.2.5 libxl__stream_write <-> libxl__stream_read, Part Two 2.2.6 xc_domain_save() <-> xc_domain_restore(), Part Two: memory post-copy 2.2.6.1 Background: xenpaging 2.2.6.2 Post-copy paging 2.2.6.3 Batch page-out operations 2.2.7 libxl_domain_live_migrate() <-> libxl_domain_create_restore(), Part Two 2.2.8 `xl migrate` <-> `xl migrate-receive`, Part Two 3. Performance evaluation 3.1 Prior work and metrics 3.2 Experiment: pgbench 3.2.1 Experiment design 3.2.2 Results 3.2.2.1 Algorithms A vs. E: stop-and-copy vs. post-copy after iterative pre-copy 3.2.2.2 Algorithm C: pure post-copy 3.2.2.3 Algorithm B vs. D: post-copy after a single pre-copy iteration 3.3 Further Experiments 4. Conclusion 5. References 1. Background 1.1 Implemented in Xen today: pre-copy memory migration Live migration of guest memory in Xen is currently implemented using an iterative pre-copy algorithm with fixed iteration-count and remaining-page thresholds. It can be described at a high level by the following sequence of steps: 1) transmit all of the guest's pages to the migration receiver 2) while more than DIRTY_THRESHOLD pages have been dirtied since they were last transmitted and fewer than ITERATION_THRESHOLD transmission iterations have been performed... 3) transmit all guest pages modified since last transmission, goto 2) 4) pause the guest 5) transmit any remaining dirty pages, along with the guest's virtual hardware configuration and the state of its virtual devices If the migration process can transmit pages faster than they are dirtied by the guest, the migration loop converges - each successive iteration begins with fewer dirty pages than the last. If it converges sufficiently quickly, the number of dirty pages drops below DIRTY_THRESHOLD pages in fewer than ITERATION_THRESHOLD iterations and the guest experiences minimal downtime. (the current values of DIRTY_THRESHOLD and ITERATION_THRESHOLD are 50 and 5, respectively) This approach has worked extremely well for the last >10 years, but has some drawbacks: - The guest's page-dirtying rate is likely non-uniform across its pages. Instead, most guests will dirty a subset of their pages much more frequently than the rest (this subset is often referred to as the Writable Working Set, or WWS). If the WWS is larger than DIRTY_THRESHOLD and its pages are dirtied at a higher rate than the migration transmission rate, the migration will 'get stuck' trying to migrate these pages. In this situation: - All the time and bandwidth spent attempting pre-copy of these pages is wasted (no further reduction in stop-and-copy downtime can be gained) - The guest suffers downtime for the full duration required to transmit the WWS at the end of the migration anyway. - Migrating guests continue to consume CPU and I/O resources at the sending host for the entire duration of the memory migration, which limits the effectiveness of migration for the purpose of load-balancing these resources. 1.2 Proposed enhancement: post-copy memory migration Post-copy live migration is an alternative memory migration technique that (at least theoretically) addresses these problems. Under post-copy migration, execution of the guest is moved from the sending to receiving host _before_ the memory migration is complete. As the guest executes at the receiver, any attempts to access unmigrated pages are intercepted as page-faults and the guest is paused while the accessed pages are synchronously fetched from the sender. When not servicing faults for specific unmigrated pages, the sender can push the remaining unmigrated pages in the background. The technique can be employed immediately at the start of a migration, or after any amount of pre-copying (including in the middle of a pre-copy iteration). The post-copy technique exploits the fact that the guest can make some progress at the receiver without access to all of its memory, permitting execution to proceed in parallel with the continuing memory migration and breaking up the single long stop-and-copy downtime into smaller intervals interspersed with periods of execution. Depending on the nature of the application running in the guest, this can be the difference between only degraded performance and observable downtime. Compared to the existing pre-copy technique, post-copy also has the same total-migration-time and bandwidth consumption advantages as outright stop-and-copy: each page is migrated exactly once, rather than arbitrarily many times according to the dirtying behaviour of the guest. 2. Design The live migration feature is implemented almost entirely in the toolstack, by a set of cooperating dom0 processes distributed between both peers whose functionality is split across four layers: Layer | Sender Process | Receiver Process --------------+------------------------+------------------------------ xl | `xl migrate` | `xl migrate-receive` libxl | libxl_domain_suspend() | libxl_domain_create_restore() libxl stream | libxl__stream_write | libxl__stream_read -------- (libxl-save-helper process boundary) ----------------------- libxc | xc_domain_save() | xc_domain_restore() Section 2.1 describes the flow of control through each of these layers in the existing design for the case of a live migration of an HVM domain. Section 2.2 describes the changes to the existing design required to accommodate the introduction of post-copy memory migration. 2.1 Current design 2.1.1 `xl migrate` <-> `xl migrate-receive`, Part One An administrator (or automation) initiates a live migration with the `xl migrate` command at the sending host, specifying the domain to be migrated and the receiving host. `xl migrate`: - Gathers the domain's xl.cfg(5)-level configuration. - Spawns an SSH child process that launches `xl migrate-receive` at the destination host, thereby establishing a secure, bidirectional stream-oriented communication channel between the remotely cooperating migration processes. - Waits to receive the `migrate_receiver_banner` message transmitted by the `xl migrate-receive` peer, confirming the viability of the link and the readiness of the peer. - Transmits the domain configuration gathered previously. - Calls libxl_domain_suspend(). This is the handoff to the libxl API, which handles the migration of the domain's _state_ now that its configuration is taken care of. Meanwhile, the setup path in the `xl migrate-receive` peer at the destination: - Immediately transmits the `migrate_receiver_banner`. - Receives the xl.cfg(5) domain configuration in binary format and computes from it a libxl_domain_config structure. - Calls libxl_domain_create_restore() with the computed configuration and communication streams. 2.1.2 libxl_domain_suspend() <-> libxl_domain_create_restore(), Part One libxl_domain_send(): - Initializes various mechanisms required to support live migration (e.g. guest suspension and QEMU logdirty support) - Calls libxl__stream_write_start() to kick off the async stream-writing path. - Drops into the AO_INPROGRESS async event loop, which drives control flow for the remainder of the migration. Note: as a first-time reader of libxl when exploring this code path, I found the invocation of the AO_INPROGRESS event loop at [s] to be _extremely_ non-obvious, because the macro isn't function-like - at first I assumed something like '#define AO_INPROGRESS EINPROGRESS', while reality is closer to '#define AO_INPROGRESS do { poll(); dispatch_events(); } while (!done)' libxl_domain_create_restore(), meanwhile: - Validates the configuration of the domain for compatibility with the receiving host and the live migration process. - Creates an 'empty' domain via xc_domain_create(), to be subsequently filled in with the state of the migrating domain. - Prepares the new domain's XenStore hierarchy. - Calls libxl__stream_read_start() to kick off the async stream-reading path. - Drops into the AO_INPROGRESS async event loop. 2.1.3 libxl__stream_write <-> libxl__stream_read, Part One The stream writer: - Writes the stream header. - Writes the LIBXC_CONTEXT record, indicating that control of the stream is to be transferred to libxc for the migration of the guest's virtual architectural state. - Launches the libxl-save-helper, which exists to permit the synchronous execution of xc_domain_save() while keeping the libxl API asynchronous from the perspective of the library client. The stream reader: - Reads the stream header. - Reads the LIBXC_CONTEXT record, and launches its own libxl-save-helper to run xc_domain_restore(). 2.1.4 xc_domain_save() <-> xc_domain_restore() xc_domain_save(): - Writes the Image and Domain headers. - Allocates the dirty_bitmap, a bitmap tracking the set of guest pages whose up-to-date contents aren't known at the receiver. - Enables the 'logdirty' hypervisor and emulator mechanisms. - Transmits all of the pages of the guest in sequence. - Guest pages are transmitted in batches of 1024 at a time. Transmitting a batch entails mapping each page in the batch into the process via the xenforeignmemory_map() interface and collecting them into an iovec for consumption by writev(2). - Iteratively, until either of the conditions for termination are met: - Refreshes the contents of the dirty_bitmap via XEN_DOMCTL_SHADOW_OP_CLEAN, which atomically records the current state of the dirty bitmap maintained in Xen and clears it (marking all pages as 'clean' again for the next round). - Re-transmits each of the pages marked in the updated dirty_bitmap. - Suspends the domain (via IPC to the controlling libxl). - Obtains the 'final' dirty_bitmap - since the guest is now paused, it can no longer dirty pages, so transmitting the pages marked in this bitamp will ensure that the receiving peer has the up-to-date contents of every page. - Transmits these pages. - Collects and transmitting the rest of the HVM state: at present this includes TSC info, the architectural state of each vCPU (encapsulated in a blob of 'HVM context' extracted via domctl), and the HVM_PARAMS (which describe, among other things, a set of 'magic' pages within the guest). - Transmits the END record. xc_domain_restore(): - Reads and validates the Image and Domain headers. - Allocates the populated_pfns bitmap, a bitmap tracking the set of guest pages that have been 'populated' (allocated by the hypervisor for use by the guest). - Iteratively consumes the stream of PAGE_DATA records transmitted as the sender executes the migration loop. - Consuming a PAGE_DATA record entails populating each of the pages in the received record that hasn't previously been populated (as recorded in populated_pfns), then mapping all of the pages in the batch and updating their contents with the data in the new record. - After the suspension of the guest by the sender, receives the final sequence of PAGE_DATA records and the remainder of the state records. This entails installing the received HVM context, TSC info and HVM params into the guest. - Consumes the END record. After the sender has transmitted the END record and the receiver has consumed it, control flow passes out of libxc and back into the libxl-save-helper on each side. The result on each side is reported via IPC back to the main libxl process and both helpers terminate. These terminations are observed as asynchronous events in libxl that resume control flow at that level. 2.1.5 libxl__stream_write <-> libxl__stream_read, Part Two The stream writer next proceeds along an asynchronous chain of record composition and transmission: - First, emulator data maintained in XenStore is collected and transmitted. - Next, the state of the emulator itself (the 'emulator context') is collected and transmitted. - Finally the libxl END record is transmitted. At this point, the libxl stream is complete and the stream completion callback is invoked. The stream reader executes an asynchronous record receipt loop that consumes each of these records in turn. - The emulator XenStore data is mirrored into the receiver XenStore. - The emulator context blob is written to local storage for subsequent consumption during emulator establishment. - When the END record is received, the completion callback of the stream reader is invoked. 2.1.6 libxl_domain_suspend() <-> libxl_domain_create_restore(), Part Two Relatively little happens in this phase at the sender side: some teardown is carried out and then the libxl AO is marked as complete, terminating the AO_INPROGRESS event loop of libxl_domain_suspend() and returning control flow back to `xl migrate`. In libxl_domain_create_restore(), on the other hand, the work of unpacking the rest of the received guest state and preparing it for resumption ('building it', in the vocabulary of the code) remains. To be completely honest, my understanding of exactly what this entails is a little shaky, but the flow of asynchronous control through this process following completion of the stream follows roughly this path, with the names of each step giving a reasonable hint at the work being performed: -> domcreate_stream_done() [b] -> domcreate_rebuild_done() [c] -> domcreate_launch_dm() [d] -> domcreate_devmodel_started() [e] -> domcreate_attach_devices() (re-entered iteratively for each device) [f] -> domcreate_complete() [g] domcreate_complete() marks the libxl AO as complete, terminating the AO_INPROGRESS loop of libxl_domain_create_restore() and returning control flow to `xl migrate-receive`. 2.1.7 `xl migrate` <-> `xl migrate-receive`, Part Two At this point, the guest is paused at the sending host and ready to be unpaused at the receiving host. Logic in the xl tools then carries out the following handshake to safely destroy the guest at the sender and actually unpause it at the receiver: Sender: After the return of libxl_domain_suspend(), the sender waits synchronously to receive the `migrate_receiver_ready` message. Receiver: After libxl_domain_create_restore() returns, the receiver transmits the `migrate_receiver_ready` message and synchronously waits to receive the `migrate_permission_to_go` message. Sender: After `migrate_receiver_ready` is received, the sender renames the domain with the '--migratedaway' suffix, and transmits `migrate_permission_to_go`. Receiver: After `migrate_permission_to_go` is received, the receiver renames the newly-restored domain to strip its original '--incoming' suffix. It then attempts to unpause it, and reports the success or failure of this operation as the `migrate_report`. If all has gone well up to this point, the guest is now live and executing at the receiver. Sender: If the `migrate_report` indicates success, the '--migratedaway' domain is destroyed. If any of the steps in this sequence prior to the transmission of the `migrate_permission_to_go` message fail _or_ a positive report of failure from the receiver arrives, the receiver destroys their copy of the domain and the sender recovers by unpausing its (still-valid!) copy of the guest. If, however, the sender transmits `migrate_permission_to_go` and a positive report of success from the receiver fails to arrive, the migration has fallen into the 'failed badly' scenario where the sender cannot safely recover by resuming its local copy, because the receiver's copy _may_ be executing. This is the most serious possible failure mode of the scheme described here. 2.2 Proposed design changes The proposed patch series [h] introduces support for a new post-copy phase in the live memory migration. In doing so, it makes no architectural changes to the live migration feature: it is still implemented in the user-space toolstack, and the layering of components within the toolstack is preserved. The most substantial changes are made to the core live memory migration implementation in libxc, with a few supporting changes in the libxl migration stream and even fewer higher up in libxl/xl. To carry out the transition to the new post-copy phase: - At the end of the pre-copy phase of the memory migration, the sender now transmits only the _pfns_ of the final set of dirty pages where previously it transmitted their contents. - The receiving save helper process registers itself as a _pager_ for the domain being restored, and marks each of the pages in the set as 'paged out'. This is the key mechanism by which post-copy's characteristic demand-faulting is implemented. - The sender next transmits the remaining guest execution context. This includes the libxl context, requiring that control of the stream be _temporarily_ handed up from libxc back to libxl. After all libxl context is transmitted, control of the stream is handed back to libxc. - The receiver installs this additional context exactly as before (requiring a symmetric temporary handoff to libxl on this side as well). - At this point, all state except the outstanding post-copy pages has been transmitted, and the guest is ready for resumption. The receiving libxl process (the parent of the receiver migration helper) then initiates the resumption process described in 2.1.6. This completes the transition to the post-copy phase. Once in the post-copy phase: - The sender iterates over the set of post-copy pages, transmitting them in batches. Between batches, it checks if any pages have been specifically requested by the receiver, and prioritizes them for transmission. - The receiver, as the pager of a now-live guest, forwards faulting pages to the sender. When batches arrive from the sender, they are installed via the page-in path. These loops terminate when all of the post-copy pages have been sent and received, respectively, after which all that remains is teardown (the paused image of the guest at the sender is destroyed, etc.). The rest of this section presents a more detailed description of the control flow of a live migration with a post-copy phase, focusing on the changes to each corresponding subsection of 2.1. 2.2.1 `xl migrate` <-> `xl migrate-receive`, Part One As before, live migration is initiated with the `xl migrate` command at the sending host. A new '--postcopy' option is added to the command, which is used to compute the value of a new 'memory_strategy' parameter to libxl_domain_suspend() (or rather, to libxl_domain_live_migrate(), a new libxl API entrypoint like libxl_domain_suspend() but with additional parameters that are only meaningful in the context of live migration). Two values for this parameter are possible: - STOP_AND_COPY specifies that upon termination of the pre-copy loop the migration should be terminated with a stop-and-copy migration of the final set of dirty pages - POSTCOPY specifies that upon termination of the pre-copy loop the migration should transition to the post-copy phase An additional boolean out-parameter, 'postcopy_transitioned', is also passed to libxl_domain_live_migrate(). This bit is set within libxl_domain_send() at the end of the post-copy transition (from the sender's point of view, this is after the libxl POSTCOPY_TRANSITION_END is sent), and is used by the caller to decide whether or not it's safe to attempt to resume the paused guest locally in the event of failure. A similar boolean out-parameter, 'postcopy_resumed', is now passed to libxl_domain_create_restore(). It is set during the post-copy phase when the domain is (or isn't) successfully unpaused at the end of the domain-building/resumption process, and is used by the caller to determine whether or not the unpause handshake should occur. 2.2.2 libxl_domain_live_migrate() <-> libxl_domain_create_restore(), Part One This stage is mostly unchanged on both sides. The new memory_strategy parameter of libxl_domain_live_migrate() is stashed in the libxl async request structure for later use by a new 'precopy policy' RPC callback for xc_domain_save(), described in section 2.2.4. 2.2.3 libxl__stream_write <-> libxl__stream_read, Part One This stage is entirely unchanged on both sides. 2.2.4 xc_domain_save() <-> xc_domain_restore(), Part One 2.2.4.1 Pre-copy policy The first major change to xc_domain_save() is the generalization of the 'pre-copy policy', i.e. the algorithm used to decide how long the pre-copy phase of the migration should continue before transitioning forward. As described earlier, the historical policy has been to continue until either the ITERATION_THRESHOLD is exceeded or fewer than DIRTY_THRESHOLD pages remain at the end of a round, at which point an unconditional transition to stop-and-copy has occurred. The generalization of this policy is introduced early in the proposed patch series. It factors the decision-making logic out of the mechanism of the migration loop and into a new save_callbacks function with the following prototype: struct precopy_stats { unsigned int iteration; unsigned int total_written; int dirty_count; /* -1 if unknown */ }; /* Policy decision return codes. */ #define XGS_POLICY_ABORT (-1) #define XGS_POLICY_CONTINUE_PRECOPY 0 #define XGS_POLICY_STOP_AND_COPY 1 #define XGS_POLICY_POSTCOPY 2 int precopy_policy(struct precopy_stats stats, void *data); This new hook is invoked after each _batch_ of pre-copy pages is transmitted, a much finer granularity than the previous policy which was evaluated only at iteration boundaries. This introduces a bit of extra complexity to the problem of computing the 'final' set of dirty pages: where previously it was sufficient to execute one final XEN_DOMCTL_SHADOW_OP_CLEAN after pausing the guest, now the true set of final dirty pages is the union of the results of the final CLEAN and the subset of the last CLEAN result set remaining in the interrupted final pre-copy iteration. To solve this problem, pages are cleared from the dirty_bitmap as they are added to the current migration batch, meaning that the dirty_bitmap at the point of interruption is exactly the subset not yet migrated during the previous iteration. These bits are temporarily transferred to the deferred_pages bitmap while the final CLEAN is executed, and then merged back into dirty_bitmap. In making this change, my motivation was to permit two new sorts of policies: 1) 'Continue until some budget of network bandwidth/wallclock time is exceeded, then transition to post-copy', which seemed like it would be useful to administrators wishing to take advantage of post-copy to set a hard bound on the amount of time or bandwidth allowed for a migration while still offering the best-effort liveness of post-copy. 2) 'Continue until the live human operator decides to initiate post-copy', which was explicitly to match the equivalent QEMU post-copy feature [i]. In retrospect, I should probably have focused more on making the post-copy mechanism work under the existing policy and left the broader issue for later discussion. It's not really required, and the patches in their current state provide implement in this hook exactly the previous policy, simply returning POSTCOPY rather than STOP_AND_COPY at the old termination point based on the libxl_domain_live_migrate() memory_strategy (which itself simply reflects whether or not the user specified '--postcopy' to `xl migrate`). 2.2.4.2 Post-copy transition For the sender, the transition from the pre-copy to the post-copy phase begins by: 1) Suspending the guest and collecting the final dirty bitmap, just as for stop-and-copy. 2) Transmitting to the receiver a new POSTCOPY_BEGIN record, to prime them for subsequent records whose handling differs between stop-and-copy and post-copy. 3) Transmitting the set of 'end-of-checkpoint' records (e.g. TSC_INFO, HVM_CONTEXT and HVM_PARAMS in the case of HVM domains) 4) Transmitting a POSTCOPY_PFNS_BEGIN record, followed by a sequence of POSTCOPY_PFNS records enumerating the set of pfns to post-copy migrated. Post-copy PFNs are transmitted in POSTCOPY_PFNS records, which are like PAGE_DATA records but without the trailing actual page contents. Each batch can hold up to 512k 64-bit pfns while staying within the stream protocol's 4mb record size cap. At this point, the only state needed to resume the guest not yet available at the receiver is the higher-level libxl context. Control of the stream must therefore be handed back to the libxl stream writer. This is done by: 5) Writing a new POSTCOPY_TRANSITION record, to co-ordinate a symmetric hand-off at the receiver. 6) Executing the synchronous postcopy_transition RPC, to which the libxl parent will reply when the libxl stream is finished. At the receiver (numbered to match corresponding sender steps): 2) When the POSTCOPY_BEGIN record arrives, only the restore context 'postcopy' bit is set. 3) The end-of-checkpoint records arrive next and are handled as in stop-and-copy, with one exception: when the HVM_PARAMS record arrives, the magic page parameters (HVM_PARAM_*_PFN) are explicitly populated, as they may not yet have been. This is to ensure that the magic pages that must be cleared can be, and in the case of the PAGING_RING so that the immediately following pager setup succeeds. 4) When the POSTCOPY_PFNS_BEGIN record arrives, the receiving helper enables paging on the migrating guest by establishing itself as its pager. As the subsequent POSTCOPY_PFNS records arrive, each of the pages in the post-copy set are marked as 'paged out' (the paging component of the change is described in greater detail in section 2.2.6). 6) When the POSTCOPY_TRANSITION record arrives, the synchronous receive-side postcopy_transition RPC is executed, transferring control of the stream back to the receiving libxl parent. 2.2.5 libxl__stream_write <-> libxl__stream_read, Part Two The postcopy_transition() RPC from the libxc save helper is plumbed to libxl__stream_write_start_postcopy_transition(), which records in the stream writer context that it's executing a new SWS_PHASE_POSTCOPY_TRANSITION callback chain and then kicks off exactly the same chain as before, starting with the emulator XenStore record. At the end of the chain, a POSTCOPY_TRANSITION_END record is written, indicating to the receiver that control of the migration stream is to be transferred back to libxc in the helpers for the duration of the post-copy memory migration phase. This transfer is then carried out by signalling the completion of the postcopy_transition() RPC to the libxc save helper. The postcopy_transition RPC from the libxc _receiver_ helper is plumbed to libxl__stream_read_start_postcopy_transition(), which is symmetric in spirit and implementation to its companion at the sender. At the end of the libxl post-copy transition, two concurrent stages of the migration begin: the libxc post-copy memory migration, and the receiver libxl domain resumption procedure. 2.2.6 xc_domain_save() <-> xc_domain_restore(), Part Two: memory post-copy This stage implements the key functionality of the post-copy memory migration: it permits the building, resumption and execution of the migrating guest at the receiver before all of its memory is migrated. To achieve this, the receiver must intercept _all_ accesses to the unmigrated pages of the guest as they occur and fetch their contents from the sender before allowing them to proceed. Fortunately, the fundamental supporting mechanism - guest paging - already exists! It's documented fairly lightly [k] and the description given there doesn't do much to inspire confidence in its stability, but it is completely sufficient in its current state for the purpose of intercepting accesses to unmigrated pages during the post-copy phase. 2.2.6.1 Background: xenpaging Paging for a given guest is managed by a 'pager', a process in a privileged foreign domain that a) identifies and evicts pages to be paged out and b) services faults for evicted pages. To facilitate this, the hypervisor provides a) a family of paging operations under the `memory_op` hypercall and b) an event ring, into which it _produces_ events when paged pages are accessed (pausing the accessing vCPU at the same time) and from which it _consumes_ pager responses indicating that accessed pages were loaded (and correspondingly unpausing the accessing vCPU). Evicting a page requires the pager to perform two operations: 1) When a page is first selected for eviction by the pager's policy it is marked with the `nominate` operation, which sets up the page to trap upon writes to detect modifications during page-out. The pager then maps and write the page's contents to its backing store. 2) After the page's contents are saved, the pager tries to complete the process with the `evict` operation. If the page was not modified since its nomination the eviction succeeds and its memory can be freed. If it was, however, the eviction has failed. Re-installing a paged page is performed in a single `prep` operation, which atomically allocates and copies in the content of the paged page. The protocol for the paging event ring consists of a request and response: 1) The hypervisor emits a request into the event ring when it intercepts an access to a paged page. There are two classes of accesses that can occur, though they are indistinguishable from the point of view of the ring protocol: a) accesses from within the guest, which result in the accessing vCPU being paused b) mappings of paged pages by foreign domains, which are made to fail (with the expectation that the mapper retry after some delay) In either case, the request made to the ring communicates the faulting pfn. In the former case, it also communicates the faulting vCPU. 2) The pager consumes these requests, obtains the contents of the faulting pfns by its own unique means, and after performing the `prep` operation to install them, emits back into the ring a response containing exactly the information in the original request. 2.2.6.2 Post-copy paging Post-copy paging setup occurs when the POSTCOPY_PFNS_BEGIN record arrives. The libxc restore helper begins by registering itself as the new guest's pager, enabling paging and setting up the event ring. As subsequent POSTCOPY_PFNS records arrive, the pfns they contain must all be marked as paged out at the hypervisor level. Doing so naively can be prohibitively costly when the number of post-copy pages is large; this problem, and its solution, are described in 2.2.6.3. After control of the stream is returned at the end of the post-copy transition, the steady-state of the post-copy phase begins. Crucially, this occurs even while the libxl 'building' of the guest proceeds. This is important because domain building can and does access guest memory - in particular, QEMU maps guest memory. For the duration of the post-copy phase, the receiver maintains a simple state-machine for each post-copy pfn, described in this source-code comment at the declaration of its storage: /* * Prior to the receipt of the first POSTCOPY_PFNS record, all * pfns are 'invalid', meaning that we don't (yet) believe that * they need to be migrated as part of the postcopy phase. * * Pfns received in POSTCOPY_PFNS records become 'outstanding', * meaning that they must be migrated but haven't yet been * requested, received or dropped. * * A pfn transitions from outstanding to requested when we * receive a request for it on the paging ring and request it * from the sender, before having received it. There is at * least one valid entry in pending_requests for each requested * pfn. * * A pfn transitions from either outstanding or requested to * ready when its contents are received. Responses to all * previous pager requests for this pfn are pushed at this time, * and subsequent pager requests for this pfn can be responded * to immediately. * * A pfn transitions from outstanding to dropped if we're * notified on the ring of the drop. We track this explicitly * so that we don't panic upon subsequently receiving the * contents of this page from the sender. * * In summary, the per-pfn postcopy state machine is: * * invalid -> outstanding -> requested -> ready * | ^ * +------------------------+ * | * +------ -> dropped * * The state of each pfn is tracked using these four bitmaps. */ unsigned long *outstanding_pfns; unsigned long *requested_pfns; unsigned long *ready_pfns; unsigned long *dropped_pfns; A given pfn's state is defined by the set it's in (set memberships are mutually exclusive). The receiver's post-copy loop can be expressed in pseudo-code as: outstanding_pfns = { final dirty pages } requested_pfns = { } ready_pfns = { } while (outstanding_pfns is not empty) { /* * Wait for a notification on the paging ring event channel, or for * data to arrive on the migration stream. */ wait_for_events(); /* * Consume any new faults generated by the guest and forward them to * the sender so that their transmission is prioritized. */ faults = {} while (!empty(paging_ring)) { fault = take(paging_ring) if (fault in ready_pfns) { /* * It's possible that the faulting page may have arrived and * been loaded after the fault occurred but before we got * around to consuming the event from the ring. In this case, * reply immediately. */ notify(paging_ring) } else { faults += fault } } outstanding_pfns -= faults requested_pfns += faults send(faults) /* * Consume incoming page data records by installing their contents into * the guest. If a guest vCPU is paused waiting for the arrival of a * given page, unpause it now that it's safe to continue. */ while (record = read_record(migration_stream)) { paging_load(record.pfn, record.data) if (record.pfn in requested_pfns) { notify(paging_ring) } requested_pfns -= record.pfn ready_pfns += record.pfn } } The sender's companion post-copy loop is simpler: remaining_pages = { final dirty pages } while (remaining_pages) { transmission_batch = {} /* Service new faults. */ faults = recv_faults() if (faults) { transmission_batch += faults } /* Fill out the rest of the batch with background pages. */ remainder = take(remaining_pages, BATCH_SIZE - count(faults)) transmission_batch += remainder remaining_pages -= remainder send(transmission_batch) } One interesting problem is in deciding which not-yet-requested pages should be pushed in the next background batch. Ideally, they should be sent in the order they'll be accessed, to minimize the faults at the receiver. In practice, the general problem of predicting the guest's page access stream is _extremely_ difficult - this is the well-known pre-paging problem, which has been explored by decades of academic and industrial research. The current version of the patch series exploits spatial locality in the physical page access stream by starting at the next unsent pfn after the last faulting pfn and proceeding forward. The sender terminates the libxc stream with a POSTCOPY_COMPLETE record so that the receiver can flush (i.e. consume) all in-flight POSTCOPY_PAGE_DATA records before control of the stream is handed back to libxl on both sides. 2.2.6.3 Batch page-out operations When a batch of POSTCOPY_PFNS arrives during the post-copy transition, all of the pfns in the batch must be marked paged-out as quickly as possible to minimize the downtime required. Doing so using the existing paging primitives requires: 1) 'Populating' (i.e. allocating a backing physical page for) any pages in the batch that aren't already, because only populated pages can be 'paged out'. In a migration that transitions to post-copy before the end of the first iteration, any pages not sent during the partial first round will be unpopulated during the post-copy transition. In the special case of an instant post-copy migration, this will be _all_ of the guest's pages. 2) Performing the `nominate` and `evict` operations individually on each page in turn, because only nominated pages can be evicted. There are a few obvious inefficiencies here: - Populating the pages is unnecessary. - The `nominate` operation is entirely unnecessary when the page's contents are already available from the pager's backing store and can't be invalidated by modification by the guest or other foreign domains. - The `evict` operation acts on a single page at a time even when many pages are known to need eviction up front. Together, these inefficiencies make the combined operation of 'evicting' many pfns at a time during the critical post-copy transition downtime phase quite costly. Quantitatively, in my experimental set-up (described in detail in section 3.2.1) I measured the time required to evict a batch of 512k pfns at 8.535s, which is _enormous_ for outright downtime. To solve this problem, the last patches in the series introduce a new memory paging op designed to address specifically this situation, called `populate_evicted`. This operation takes a _batch_ of pfns and, for each one: - de-populates it if populated - transitions it directly to the paged-out state, skipping the nomination step With a further patch rewriting the POSTCOPY_PFNS handler to use this new primitive, I measured a 512k-pfn eviction time of 1.590s, a 5.4x improvement. 2.2.7 libxl_domain_live_migrate() <-> libxl_domain_create_restore(), Part Two The sender side of this stage is unchanged: libxl_domain_live_migrate() returns almost immediately back up to `xl migrate`. The receiver side becomes more complicated, however, as it now has to manage the completion of two asynchronous operations: 1) its libxc helper, which terminates when all of the outstanding post-copy pages have arrived and been installed 2) the domain-building/resumption process detailed in section 2.1.6 (the functional sequence is unchanged - the fact that some guest memory remains unmigrated is made completely transparent by the libxc pager) These operations (referred to in code as the 'stream' and 'resume' operations, respectively) can complete in either order, and either can fail. A pair of state machines encoding the progress of each are therefore introduced to the libxl domain create state context structure, with three possible states each: INPROGRESS --+--> FAILED (rc) | +--> SUCCESS (rc == 0) In a healthy migration, the first operation to make the INPROGRESS -> SUCCESS transition simply records its final state as such and waits for the completion of the second. The second operation to complete then finds the other already complete, and calls the overall completion callback to report success. For example, in the case of a long memory post-copy phase, the resume operation is expected to complete first. When it does, it finds that the stream operation is still running, so it simply transitions to SUCCESS. When the post-copy migration is finished and the libxc helper terminates, the new domcreate_postcopy_stream_done() callback finds the resume successfully completed and reports the completion of the entire operation. The 'resume' operation is initiated from the domcreate_postcopy_transition_callback(), kicking off the same callback sequence as started by domcreate_stream_done() in the non-post-copy case. All termination points along this callback sequence are hooked by the new domcreate_report_result(), which when given a successful result to report also unpauses the guest to begin true post-copy execution. If the resume fails and the stream isn't yet complete, we latch the error, actively abort the stream and then wait for the failure completion of the stream to complete the overall operation. The 'stream' operation's completion is signalled by domcreate_postcopy_stream_done(), which is wired up to the libxc helper's SIGCHLD in the way that domcreate_stream_done() was previously. If the stream fails, its error is simply stashed (and no other action taken) on the assumption that the resumption will eventually complete one way or another and find it. 2.2.8 `xl migrate` <-> `xl migrate-receive`, Part Two If everything has gone according to plan up to this point, the migration is effectively complete - the guest is now unpaused and executing at the receiver with all of its memory migrated. The cautious final unpause handshake is therefore no longer necessary, so the sender simply destroys its copy of the domain, the receiver strips the migration suffix from the name of its copy and the entire process is complete! The receiver does still send a completion message to the sender, however, simply to signal to an interactive user at the sender exactly when the operation has completed. If, however, something has gone awry, the penalty of post-copy's extra vulnerability to failure is paid. At the receiver, any failure reported by libxl_domain_create_restore() results in the prompt destruction of the local copy of the domain, even if it's already executing as part of the post-copy phase and some of the guest's state exists only in this copy, because no sane recovery mode exists with other parts of its state locally unavailable. At the sender, the new postcopy_transitioned out-parameter of libxl_domain_live_migrate() is examined: - if the transition record wasn't transmitted (postcopy_transitioned == false), there's no way that the guest could possibly have begun executing at the sender, so it's safe to recover by unpausing the original copy of the domain - if it was, however (postcopy_transitioned == true), it's _possible_ (though not certain) that the guest may have executed (or may even still _be_ executing) at the destination, so unpausing it locally isn't safe In the latter case, the policy is essentially the same as in the existing 'failed_badly' scenario of normal pre-copy migration in which the sender fails to receive the migration success report after transmitting the `migrate_permission_to_go` message: the local copy of the domain is suffixed with --postcopy-inconsistent, and a diagnostic message is printed explaining the failure. One major possible improvement to this scheme is especially worth noting: at the sender, the current postcopy_transitioned bit is a very conservative indication of whether it's safe to attempt local recovery. There are many ways in which the attempted resumption of the domain at the receiver could fail without rendering the communication stream between the sender and receiver inoperable (e.g. one of the domain's disks might be unavailable at the receiver), and in these scenarios the receiver could send a message explicitly indicating to the sender that it should attempt recovery. 3. Performance evaluation 3.1 Prior work and metrics "Post-copy live migration of virtual machines" [u] describes an earlier implementation of post-copy live migration in Xen and evaluates its performance. Although the details of their implementation - written nearly a decade ago and requiring in-guest kernel support - are vastly different than what's proposed here, their metrics and approach to performance evaluation are still useful Section 3 of the paper enumerates the following performance metrics: > 1. Preparation Time: This is the time between initiating migration and > transferring the VM’s processor state to the target node, during which > the VM continues to execute and dirty its memory. For pre-copy, this > time includes the entire iterative memory copying phase, whereas it is > negligible for post-copy. > 2. Downtime: This is time during which the migrating VM’s execution is > stopped. At the minimum this includes the transfer of processor state. > For pre-copy, this transfer also includes any remaining dirty pages. > For post-copy this includes other minimal execution state, if any, > needed by the VM to start at the target. > 3. Resume Time: This is the time between resuming the VM’s execution at > the target and the end of migration altogether, at which point all > dependencies on the source must be eliminated. For pre-copy, one needs > only to re-schedule the target VM and destroy the source copy. On the > other hand, majority of our postcopy approach operates in this period. > 4. Pages Transferred: This is the total count of memory pages transferred, > including duplicates, across all of the above time periods. Pre-copy > transfers most of its pages during preparation time, whereas post-copy > transfers most during resume time. > 5. Total Migration Time: This is the sum of all the above times from start > to finish. Total time is important because it affects the release of > resources on both participating nodes as well as within the VMs on both > nodes. Until the completion of migration, we cannot free the source > VM’s memory. > 6. Application Degradation: This is the extent to which migration slows > down the applications running in the VM. Pre-copy must track dirtied > pages by trapping write accesses to each page, which significantly > slows down write-intensive workloads. Similarly, postcopy needs to > service network faults generated at the target, which also slows down > VM workloads. The performance of a memory migration algorithm with respect to these metrics will vary significantly with the workload running in the guest. More specifically, it will vary according to the behaviour of the memory access stream - the pace, read/write mix, and locality of accesses (within pages and between pages). See "Downtime Analysis of Virtual Machine Live Migration" [v] for a quantitative investigation of the effect of these workload parameters on pre-copy migration in Xen and other hypervisors. The relative importance of these metrics obviously varies with deployment context, but in my opinion the most common ordering is likely: - Downtime - Application Degradation - Preparation Time - Total Migration Time - Resume Time - Pages Transferred Stop-and-copy and pure post-copy schemes, which transmit each guest page exactly once, will obviously outperform pre-copy at Preparation Time, Pages Transferred and Total Migration Time, but because of this practical preference ordering it's post-copy's potential to reduce Downtime by trading it for Application Degradation that makes it the most interesting. Because write-heavy workloads with large writable working sets experience the greatest downtime under pre-copy, I decided to investigate them first. 3.2 Experiment: pgbench When selecting a particular application workload to represent the class of pre-copy-resistant workloads with large writable working sets, I looked for a few other desirable properties: 1) It should involve some amount of I/O, which could help the guest make progress even during synchronous page faults (as the I/O could proceed in parallel to the servicing of a subsequent fault). 2) It should be possible to sample instantaneous application performance for the Application Degradation metric, and to perform such sampling reasonably frequently over the course of the migration. 3) It should be reasonably representative of an interesting real-world application, to avoid being confounded by differences in behaviour between purely-synthetic workloads and the ones we're actually interested in. For example, the 'dirty page generators' commonly found in the live migration literature aren't very useful for evaluating any mechanism that attempts pre-paging based on the memory access stream because their memory access streams are generally nothing like real applications (often being either perfectly sequential or perfectly random). With these properties and the 'large writable working set' criterion in mind, I eventually decided upon the pgbench [x] benchmark for PostgreSQL: > pgbench is a simple program for running benchmark tests on PostgreSQL. It > runs the same sequence of SQL commands over and over, possibly in multiple > concurrent database sessions, and then calculates the average transaction > rate (transactions per second). By default, pgbench tests a scenario that is > loosely based on TPC-B, involving five SELECT, UPDATE, and INSERT commands > per transaction. <snip> > The default built-in transaction script (also invoked with -b tpcb-like) > issues seven commands per transaction over randomly chosen aid, tid, bid and > balance. The scenario is inspired by the TPC-B benchmark, but is not > actually TPC-B, hence the name. > 1. BEGIN; > 2. UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; > 3. SELECT abalance FROM pgbench_accounts WHERE aid = :aid; > 4. UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; > 5. UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; > 6. INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) > VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); > 7. END; I evaluated the performance of five live migration algorithm variants: A) traditional five-iteration pre-copy (the status-quo today) B) single-iteration pre-copy followed by stop-and-copy C) direct post-copy D) single-iteration pre-copy followed by post-copy (often called 'hybrid migration) E) five-iteration pre-copy followed by post-copy 3.2.1 Experiment design The physical test-bed was composed of: - Two Intel NUC5CPYH [z] mini PCs, each with 8GB of RAM and a 120GB SSD (this was the cheapest Intel hardware with EPT support I could easily obtain two identical units of) - a Cisco Meraki MS220-8P [l] gigabit ethernet switch - my personal laptop computer One NUC PC was chosen to be the sender (S), and the other the receiver (R). The test-bed configuration was: S - Switch - R | Laptop I.e. each host had a gigabit link to all the others. See [m] and [n] for the full output of `xl info` on S and R; the subsets that seem relevant to me are: S: release : 3.16.0-4-amd64 version : #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) machine : x86_64 nr_cpus : 2 max_cpu_id : 1 nr_nodes : 1 cores_per_socket : 2 threads_per_core : 1 cpu_mhz : 1599 virt_caps : hvm total_memory : 8112 xen_version : 4.9-rc xen_scheduler : credit xen_pagesize : 4096 xen_changeset : Fri May 12 23:17:29 2017 -0400 git:c6ed26e xen_commandline : placeholder altp2m=1 cc_compiler : gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609 cc_compile_by : jtotto cc_compile_date : Sat May 27 18:29:17 EDT 2017 R: release : 3.16.0-4-amd64 version : #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07) machine : x86_64 nr_cpus : 2 max_cpu_id : 1 nr_nodes : 1 cores_per_socket : 2 threads_per_core : 1 cpu_mhz : 1599 virt_caps : hvm total_memory : 8112 xen_version : 4.9-rc xen_scheduler : credit xen_pagesize : 4096 xen_changeset : Fri May 12 23:17:29 2017 -0400 git:c6ed26e xen_commandline : placeholder no-real-mode edd=off cc_compiler : gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609 cc_compile_by : jtotto cc_compile_date : Sat May 27 18:29:17 EDT 2017 Particularly noteworthy as a potential experimental confound is the relatively old dom0 kernel - I couldn't get anything newer to boot on the NUC hardware, unfortunately. I'm aware that the privcmd driver has changed since then, and experimented with back-porting a more recent version of the kernel module to the base kernel I was experimenting with when evaluating the performance of the batch page-out operation, but it didn't appear to make any difference. To evaluate each algorith variant, I migrated a guest running a PostgreSQL from S to R while running the pgbench client against it from my laptop. My laptop was also configured as an NFS server and hosted the guest's storage. The xl.cfg(5) configuration of the test domain was: builder='hvm' vcpus=4 memory=2048 shadow_memory=512 name='debvm' disk=['file:/mnt/mig/debvm-1.img,xvda,w'] boot="c" vif=['bridge=xenbr0'] sdl=0 stdvga=1 serial='pty' usbdevice='tablet' on_poweroff='destroy' on_reboot='restart' on_crash='restart' vnc=1 vnclisten="" vncpasswd="" vfb=['type=vnc'] altp2m="external" The exact experiment shell script can be found inline at [o]. The procedure, executed from my laptop, was basically: repeat 5 times: for each algorithm variant: `xl create` the test domain at S and wait for it to boot re-initialize the test database launch the pgbench client wait for 20 seconds to let the benchmark warm up initiate migration of the test domain to R shut down the test domain from within To measure Preparation Time, Downtime, Resume Time and Total Migration Time, I added some simple timestamp printf()s at key points in the migration sequence: Preparation Time (recorded at the sender) Start: Upon entry to save() End: In suspend_domain() immediately _before_ the libxl suspension hook Downtime (recorded at the receiver) For pre-copy migrations (variants A/B): Start: Since the end of the downtime period can only be recorded at the receiver, a way to record the beginning of the period was needed. A new dummy record type, PERF_STOP_AND_COPY, was added for this purpose, which is emitted by the sender immediately after the suspension of the domain. The receiver records the time at which this record is received as the beginning of the period. End: In `xl migrate-receive` immediately after libxl_domain_unpause() For post-copy and hybrid migrations (variants C/D/E): Start: Upon receipt of the existing POSTCOPY_BEGIN record End: In domcreate_report_result() immediately after libxl_domain_unpause() Resume Time Start: Exactly where Downtime ends, after libxl_domain_unpause() End: In postcopy_restore() after all pages have been loaded All timestamps were recorded via `clock_gettime(CLOCK_MONOTONIC)`. The additional patches implementing this tracing can be found at [p]. To measure instantaneous Application Degradation, I ran pgbench in its `--log --aggregate-interval` mode with an interval of 1 second, thus each second sampling: - the total number of transactions committed in that second - the sum of the latencies of these transactions (with which the mean latency can be computed) - the minimum and maximum latencies of these transactions Of these, I think that 'transactions/second' is the easiest to interpret, and is what I'll use throughout my analysis. 3.2.2 Results A plot of the raw phase duration measurements for each run: Figure 1: https://github.com/jtotto/xen-postcopy-figures/blob/master/figure1.pdf Given that the results from run to run for each algorithm variant were relatively stable, they can more easily be considered in aggregate via their arithmetic means: Figure 2: https://github.com/jtotto/xen-postcopy-figures/blob/master/figure2.pdf For fun, I re-rendered the above plot using gnuplot's 'dumb' terminal driver: Migration algorithm variant vs. average phase and total durations 90 +-----------+-----------+-----------+----------+-----------+-----------+ | Preparing * | | Down # | | Resuming $ | | ######## | | ******** | 80 + * * ######## + | * * ******** | | * * $$$$$$$$ * * | | * * ######## $ $ * * | | * * # # $ $ * * | | * * # # $ $ * * | 70 + * * # # $$$$$$$$ $ $ * * + | * * # # $ $ $ $ * * | | * * # # $ $ ######## * * | | * * # # $ $ # # * * | | * * ******** $ $ ******** * * | | * * * * $ $ * * * * | 60 + * * * * $ $ * * * * + | * * * * $ $ * * * * | | * * * * $ $ * * * * | | * * * * $ $ * * * * | | * * * * $ $ * * * * | | * * * * $ $ * * * * | 50 + * * * * $ $ * * * * + | * * * * $ $ * * * * | | * * * * $ $ * * * * | | * * * * $ $ * * * * | | * * * * $ $ * * * * | 40 + * * * * $ $ * * * * + | * * * * $ $ * * * * | | * * * * $ $ * * * * | | * * * * $ $ * * * * | | * * * * $ $ * * * * | | * * * * $ $ * * * * | 30 + * * * * $ $ * * * * + | * * * * $ $ * * * * | | * * * * $ $ * * * * | | * * * * $ $ * * * * | | * * * * $ $ * * * * | | * * * * $ $ * * * * | 20 + * * * * $ $ * * * * + | * * * * $ $ * * * * | | * * * * $ $ * * * * | | * * * * $ $ * * * * | | * * * * $ $ * * * * | | * * * * $ $ * * * * | 10 + * * * * $ $ * * * * + | * * * * $ $ * * * * | | * * * * $ $ * * * * | | * * * * ######## * * * * | | * * * * # # * * * * | | * * * * # # * * * * | 0 +-------********----********----********----********----********-------+ A B C D E Algorithm variant -- N.B. Before getting into comparisons between the algorithm variants, these results reveal an interesting property of the test-setup common to all of them: the network is _not_ the bottleneck in the migration page transfer. iperf between S and R measured the bandwidth as roughly the gigabit limit advertised by the switch and NICs, but the effective 'bandwidth' actually observed during the pre-copy iterations can be computed as (524357 pages / 63.47s) ~= 8261 pages/s, or ~271Mbps. I didn't collect timings of the sender batch mapping or receive batch installation routines, but I suspect it must be one of these two that's the limiting factor in this set-up. Interpret all of these timing measurements accordingly. -- 3.2.2.1 Algorithms A vs. E: stop-and-copy vs. post-copy after iterative pre-copy I'm going to focus first on the two multi-iteration pre-copy variants, A and E (the current five-iteration pre-copy algorithm and five-iteration pre-copy + post-copy, respectively). We can see that they: a) Require the longest preparation time, as expected. b) Actually still achieve the best downtime, despite my expectation that the workload would be write-heavy enough to cause problems. A's downtime is only slightly worse than D's, and is almost twice as good a C's! More on this shortly. Most interestingly, we can see that E required 30% less downtime than A on average, and always completed its post-copy phase _before_ the guest unpaused (recall that the post-copy phase proceeds in parallel with libxl domain creation and can complete before the guest is ready to unpause). Why? Focusing first on Algorithm A, we can plot the number of pages transmitted during each pre-copy iteration: Figure 3: https://github.com/jtotto/xen-postcopy-figures/blob/master/figure3.pdf We can see that the migration made substantial progress toward convergence, with strong successive decreases between iterations and an average final dirty set of ~7.5k pages (30.1 MiB). The downtime period during which these pages are transferred can be divided into two interesting sub-phases: the phase during which the memory migration and libxc stream are completed (ending with the termination of the libxc helper), and the phase during which the higher level libxl domain establishment occurs (ending with the unpausing of the domain). Plotting the measurements of these phases: Figure 4: https://github.com/jtotto/xen-postcopy-figures/blob/master/figure4.pdf This shows that around 2/3 of the total downtime is incurred _after_ the memory migration is complete. We can make similar plots for Algorithm E. Here are the number of pages transmitted during each iteration: Figure 5: https://github.com/jtotto/xen-postcopy-figures/blob/master/figure5.pdf We can see that the E-runs transmitted slightly fewer pages on average during the pre-copy iterations than the A-runs. This is presumably experimental noise - there's no reason to expect them to be different. Significantly, however, the E-runs actually needed to post-copy slightly _more_ pages on average than the A-runs needed to stop-and-copy. This means the 30% downtime reduction isn't just experimental noise in favour of E, but evidence of an algorithmic advantage. I think the explanation for this advantage is clear: because the post-copy phase can proceed in parallel with the libxl domain set-up procedure, the downtime duration is reduced to the _minimum_ of the durations of these processes, rather than their sum. This effect can be seen clearly in the corresponding plot for the libxc/libxl sub-phase breakdown: Figure 6: https://github.com/jtotto/xen-postcopy-figures/blob/master/figure6.pdf As further evidence in support of this explanation, on average only 9 faults were incurred during each brief post-copy phase, indicating that the pages required by QEMU/etc. to proceed with domain creation most weren't in the working set. Another important question is: what impact do the two approaches have on the Application Degradation metric? This plot shows the number of benchmark transactions committed each second over the course of each of the Algorithm A migrations: Figure 7: https://github.com/jtotto/xen-postcopy-figures/blob/master/figure7.pdf The dotted line at the 20-second mark indicates roughly where the migration was initiated. We can see that: - Throughout each phase of the migration, there are occasional severe single-second degradations in tps. I'm not sure what caused these, but they appear to be distributed roughly evenly and occur during every phase, so I think they can safely be ignored for the rest of the discussion. - During the 20-second warmup period at host S, the benchmark measures a relatively consistent ~325tps. - Once the migration starts, benchmark performance quickly degrades to roughly 200tps. This clearly indicates an interaction of some kind between the application and the migration process, though which resource they're contending for isn't clear. CPU or network seem like the most likely candidates. - At roughly the 80-second mark, performance 'degrades' to 0tps as the domain is suspended for the stop-and-copy phase of the migration. This is where things get interesting: although the internal measurements from the previous set of plots indicate that the guest is only truly paused for around 2.5s, from a network observer's point of view the actual application is completely unavailable for around 9s. Not great. - When the application recovers and begins to make progress, it rebounds to only 275tps rather than 325, indicating some kind of asymmetry between S and R that I can't entirely account for. How do these measurements look for Algorithm E? Here's the plot: Figure 8: https://github.com/jtotto/xen-postcopy-figures/blob/master/figure8.pdf The behaviour appears to be much the same, with a slight improvement in average application-visible downtime because of particularly good results in rounds 2 and 3 with 4s and 5s application downtime measurements, respectively. 3.2.2.2 Algorithm C: pure post-copy Turning our attention next to pure post-copy, we can make a few interesting observations directly from the Figure 2 phase-timing measurements: - At 4.9s, C's average Downtime is almost twice that of A! For an algorithm intended to trade outright Downtime for Application Degradation, that's not great. - At 69.2s, C's Total Migration Time is the lowest of any of the algorithms and ~17% lower than A's. - C's post-copy phase takes roughly as long as a single pre-copy iteration. What's ballooning C's downtime? Figure 9 breaks down the sub-phases: Figure 9: https://github.com/jtotto/xen-postcopy-figures/blob/master/figure9.pdf There are three major contributors to the Downtime here: 1) It takes ~0.7s for the sender to gather and transmit the set of all post-copy pfns. This makes some sense, as the sender needs to check the _type_ of every pfn in the guest's addressable range to see which ones are unpopulated and which must actually be migrated. 2) It takes ~1.7s to populate-and-evict all of these pages, even with the batching hypercall introduced at the end of the patch series. 3) It takes ~2.4s to complete all of the libxl-level domain set-up after the libxc stream helper is ready to enter the post-copy phase. I think this is really interesting - recall that this step took only ~1.8s for algorithm A. The explanation for this is that the device model (i.e. QEMU) needs to map guest pages immediately during its set-up, and these mappings immediately cause post-copy faults! Algorithm E didn't encounter this because these pages apparently aren't in the working set and so were covered by earlier pre-copy iterations. What application degradation do we observe during the post-copy phase: Figure 10: https://github.com/jtotto/xen-postcopy-figures/blob/master/figure10.pdf Ouch! The actual application-observed downtime is frankly horrific. Round 3 has what appears to be the closest thing to degraded execution at around the 40 second mark, but only briefly. In general, the application suffers around _50_ seconds of observable downtime before recovering to a state of reasonable (but still degraded) performance. This recovery occurs at the ~75 second mark, where performance recovers to ~175tps. The migration finishes and the application recovers to ~275tps at roughly 90 seconds. To get a clearer picture of why the migration behaves this way, we can plot the faulting behaviour of the resuming guest during the post-copy phase. Figure 11: https://github.com/jtotto/xen-postcopy-figures/blob/master/figure11.pdf Figure 12: https://github.com/jtotto/xen-postcopy-figures/blob/master/figure12.pdf Figure 13: https://github.com/jtotto/xen-postcopy-figures/blob/master/figure13.pdf Looking first at the fault counts, we can see that the first ~5-10 seconds are relatively quiet. Then, there's a sudden burst to ~100-200 faults/s for a few seconds, followed by a decline to a stable rate of ~50 faults/s until the 50 second mark where they essentially stop. Latencies are high during the initial period, averaging 50-100ms per fault, and decline to 10-20ms per fault during the later steady-state. I'm not really sure how to account for the burst around the 10 second mark, or the comparatively lower steady-state rate. Because it occurs so long after the first observed fault I don't think it can be bulk device model mappings - the guest is already unpaused at this point. But, if the vCPUs were capable of generating faults this rapidly (i.e. if the post-copy stack was capable of servicing faults quickly enough to _let_ them be generated this quickly), why the subsequent decline to a lower rate for the rest of the phase? One possible explanation is that the guest actually isn't generating faults at its maximum possible rate at steady-state. Instead, it could be alternating between faulting and making non-trivial progress. If the post-copy background push scheme consistently selected the wrong background pages to push, the time required to commit the first application transaction of the post-copy phase would then be the _sum_ of the time spent executing this non-trivial work and the time required to synchronously fault each of the non-predicted pages in sequence. Since the guest transitions relatively suddenly from 0tps to 175tps (65% of the full 275tps after recovery), I infer that this set of poorly-predicted necessary pages is common between transactions. As a result, once it has been faulted over from the first transaction or two, all subsequent transactions can proceed quickly. This discussion raises several questions: 1) Does it make sense that the background pre-paging scheme made poor predictions in the context of the application's actual memory access stream? 2) Would alternative pre-paging schemes have made better predictions? 3) How much would those better predictions improve application performance? To answer 1), I conducted a further experiment: I prepared a more invasive additional tracing patch that disabled the background push logic for the first 90 seconds of the post-copy phase and logged every individual faulting pfn to visualize the guest memory access stream over time. I obtained these traces: Figure 14 (i-v): https://github.com/jtotto/xen-postcopy-figures/blob/master/figure14-1.pdf https://github.com/jtotto/xen-postcopy-figures/blob/master/figure14-2.pdf https://github.com/jtotto/xen-postcopy-figures/blob/master/figure14-3.pdf https://github.com/jtotto/xen-postcopy-figures/blob/master/figure14-4.pdf https://github.com/jtotto/xen-postcopy-figures/blob/master/figure14-5.pdf These traces permit a number of observations: a) Over the first ~5 seconds we can very clearly see a large number of physically-clustered high PFNs faulting in rapid succession. I believe this is the device model establishment. b) For the next ~10 seconds, we can see lots of faults in three physical PFN regions: very low memory, 150000-250000, and 475000-525000, with the latter seeing the most faults. c) At the ~15 second mark, the pattern then shifts: the 475000-525000 continues to fill in, and a long, descending physical page scan begins, either starting from 500000 or from 175000 and 'wrapping around' at the bottom. This reveals that there _is_ reasonable physical locality in the access stream available for exploitation. In particular, I speculate that the long descending scan corresponds to a database table scan. However, the scheme implemented in the current version of the patch, simply scanning _forward_ from the last physical pfn to fault, is perhaps not clever enough to fully take advantage of it. In "Post-Copy Live Migration of Virtual Machines", a number of more clever 'bubbling' pre-paging strategies are discussed that I imagine would have done better. The approach described in "A Novel Hybrid-Copy Algorithm for Live Migration of Virtual Machines" [q] also seems like it might have worked well (though it's not as easy to eyeball). In principle, I think this answers question 2) in the affirmative. However, although, I didn't have time to implement and experimentally evaluate these alternatives, I think it's fairly safe to say that even with perfect prediction the application-level downtime entailed by this approach would still be worse than for Algorithm A, since such a large set of pages appears to be necesssary to permit the application to make even increment of externally-visible progress. -- Aside: Having collected all of the same timing data in this further experiment as I did in the first set, I decided to take a look at the fault stats and application performance plots and was able to make some interesting observations: Figure 15: https://github.com/jtotto/xen-postcopy-figures/blob/master/figure15.pdf Figure 16: https://github.com/jtotto/xen-postcopy-figures/blob/master/figure16.pdf Figure 17: https://github.com/jtotto/xen-postcopy-figures/blob/master/figure17.pdf It may not be obvious unless you line it up next to the corresponding plots from the first experiment, but: - all of the per-fault latencies are reduced by an order of magnitude, with the mean falling from ~10ms to ~3ms - the fault service rate is increased massively, by a factor from 4 to as much as 24 This makes some sense: in disabling all background pushing, I also disabled the logic that 'filled out' the remainder of a batch servicing a fault with other pages in its locality, so I'd expect each individual fault request to be serviced more quickly and consequently that the guest would be able to generate subsequent ones faster. However, neither of these translate into better application performance: Figure 18: https://github.com/jtotto/xen-postcopy-figures/blob/master/figure18.pdf So, the batching and weak prediction logic implemented in the patch in its current state are clearly worth _something_. -- 3.2.2.3 Algorithm B vs. D: post-copy after a single pre-copy iteration The final post-copy variant, Algorithm D, appears to have performed reasonably according to its phase timings in Figure 2: - At 2.45s, its raw Downtime is slightly less than that of A - At 63.47s, its Preparation Time is ~20% less than A's - With 10.26s of Resume Time, its Total Migration Time is a slight ~6% less than A's Algorithm B, its pre-copy-only counterpart, fared less well, with the same Preparation Time as D but with the worst outright Downtime of any variant at 11s. Judging by Figures 3 and 5, they both ended their single pre-copy iteration with ~90k pages (~351 MiB) dirty. The real question, of course, is how their Application Degradation results compare to those of A and E: Figure 19: https://github.com/jtotto/xen-postcopy-figures/blob/master/figure19.pdf Figure 20: https://github.com/jtotto/xen-postcopy-figures/blob/master/figure20.pdf These results show that Algorithm D's Application Degradation is only moderately worse than A's, while B's is almost twice as bad. Algorithm D therefore seems like a potentially useful alternative to A in situations where it would be useful to trade a moderate increase in Application Degradation for a more significant decrease in Preparation Time. 3.3 Further Experiments I only had time to conduct the experiment described in the previous section, but there are a number of further experiments that I think would be worth conducting if I had time. Collecting the same data as in the previous experiment against workloads other than pgbench would be my first priority. Although I tried to choose the workload to be as write-heavy as possible while still being realistic, the results clearly demonstrated it was fairly amenable to pre-copy anyway. If a non-synthetic workload with an even heavier write-mix were evaluated, post-copy might enjoy a more clear advantage. Identifying a workload with a more granular increment of progress might also demonstrate the ability of post-copy to trade outright Downtime for Application Degradation. As the experiment showed, in the case of pgbench even a single transaction required a large subset of all guest pages to complete. Moving beyond evaluating the patch in its current state, there are many possible post-copy pre-paging schemes in the literature that it could be augmented to implement, and it would be interesting evaluate each of them in the same way. For all of the above experiments, as well as the pgbench one, it would also _very_ interesting to conduct them on more production-realistic hardware, to see how shifting the bottleneck to the network as opposed to the CPU of the migrating machines would affect the results. 4. Conclusion In this document, I've described the design and implementation of a proposed change to introduce post-copy memory migration support for Xen. I then presented and interpreted the results of performance measurements I collected during experiments testing the change. In my opinion, the data so far suggest that: 1) Pure post-copy is probably only useful in situations where you would be okay with performing an outright stop-and-copy, but would like to try to do a little better if possible. 2) Hybrid post-copy does seem to perform marginally better than pre-copy alone, but this comes at the cost of both additional code complexity and worse reliability characteristics. The costs of the latter seem to me like they probably outweigh the former benefit... I think it would be very interesting to further investigate how much downtime is spent on device model establishment in production set-ups. If it turns out to be as significant as it was in my experiments, a very limited form of post-copy that only permits the memory migration to proceed in parallel with the device model/etc. setup without actually unpausing the guest until it completes could be worth investigating, as it could reduce downtime without adversely affecting reliability. 5. References [a] migration.pandoc https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/features/migration.pandoc [b] domcreate_stream_done() https://github.com/xen-project/xen/blob/0a0dcdcd20e9711cbfb08db5b21af5299ee1eb8b/tools/libxl/libxl_create.c#L1127-L1204 [c] domcreate_rebuild_done() https://github.com/xen-project/xen/blob/0a0dcdcd20e9711cbfb08db5b21af5299ee1eb8b/tools/libxl/libxl_create.c#L1206-L1234 [d] domcreate_launch_dm() https://github.com/xen-project/xen/blob/0a0dcdcd20e9711cbfb08db5b21af5299ee1eb8b/tools/libxl/libxl_create.c#L1236-L1405 [e] domcreate_devmodel_started() https://github.com/xen-project/xen/blob/0a0dcdcd20e9711cbfb08db5b21af5299ee1eb8b/tools/libxl/libxl_create.c#L1489-L1519 [f] domcreate_attach_devices() https://github.com/xen-project/xen/blob/0a0dcdcd20e9711cbfb08db5b21af5299ee1eb8b/tools/libxl/libxl_create.c#L1446-L1487 [g] domcreate_complete() https://github.com/xen-project/xen/blob/0a0dcdcd20e9711cbfb08db5b21af5299ee1eb8b/tools/libxl/libxl_create.c#L1521-L1569 [h] Post-copy patches v2 https://github.com/jtotto/xen/commits/postcopy-v2 [i] QEMU Post-Copy Live Migration https://wiki.qemu.org/Features/PostCopyLiveMigration [k] xenpaging https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenpaging.txt [s] AO_INPROGRESS https://github.com/xen-project/xen/blob/0a0dcdcd20e9711cbfb08db5b21af5299ee1eb8b/tools/libxl/libxl_domain.c#L520 [t] Live Migration of Virtual Machines http://www.cl.cam.ac.uk/research/srg/netos/papers/2005-migration-nsdi-pre.pdf [u] Post-Copy Live Migration of Virtual Machines https://kartikgopalan.github.io/publications/hines09postcopy_osr.pdf [v] Downtime Analysis of Virtual Machine Live Migration https://citemaster.net/get/e61b2d78-b400-11e3-91be-00163e009cc7/salfner11downtime.pdf [x] pgbench https://www.postgresql.org/docs/10/static/pgbench.html [z] Intel NUC Kit NUC5CPYH https://ark.intel.com/products/85254/Intel-NUC-Kit-NUC5CPYH [l] Cisco Meraki MS220-8P https://meraki.cisco.com/products/switches/ms220-8 [q] A Novel Hybrid-Copy Algorithm for Live Migration of Virtual Machine http://www.mdpi.com/1999-5903/9/3/37 [m] $ sudo xl info [sudo] password for fydp: host : fydp release : 3.16.0-4-amd64 version : #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) machine : x86_64 nr_cpus : 2 max_cpu_id : 1 nr_nodes : 1 cores_per_socket : 2 threads_per_core : 1 cpu_mhz : 1599 hw_caps : bfebfbff:43d8e3bf:28100800:00000101:00000000:00002282:00000000:00000100 virt_caps : hvm total_memory : 8112 free_memory : 2066 sharing_freed_memory : 0 sharing_used_memory : 0 outstanding_claims : 0 free_cpus : 0 xen_major : 4 xen_minor : 9 xen_extra : -rc xen_version : 4.9-rc xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 xen_scheduler : credit xen_pagesize : 4096 platform_params : virt_start=0xffff800000000000 xen_changeset : Fri May 12 23:17:29 2017 -0400 git:c6ed26e xen_commandline : placeholder altp2m=1 cc_compiler : gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609 cc_compile_by : jtotto cc_compile_domain : cc_compile_date : Sat May 27 18:29:17 EDT 2017 build_id : fc017c8cf375bbe7464c5be8fff2d3fd2e08cbaa xend_config_format : 4 [n] $ sudo xl info [sudo] password for fydp: host : fydp release : 3.16.0-4-amd64 version : #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07) machine : x86_64 nr_cpus : 2 max_cpu_id : 1 nr_nodes : 1 cores_per_socket : 2 threads_per_core : 1 cpu_mhz : 1599 hw_caps : bfebfbff:43d8e3bf:28100800:00000101:00000000:00002282:00000000:00000100 virt_caps : hvm total_memory : 8112 free_memory : 128 sharing_freed_memory : 0 sharing_used_memory : 0 outstanding_claims : 0 free_cpus : 0 xen_major : 4 xen_minor : 9 xen_extra : -rc xen_version : 4.9-rc xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 xen_scheduler : credit xen_pagesize : 4096 platform_params : virt_start=0xffff800000000000 xen_changeset : Fri May 12 23:17:29 2017 -0400 git:c6ed26e xen_commandline : placeholder no-real-mode edd=off cc_compiler : gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609 cc_compile_by : jtotto cc_compile_domain : cc_compile_date : Sat May 27 18:29:17 EDT 2017 build_id : fc017c8cf375bbe7464c5be8fff2d3fd2e08cbaa xend_config_format : 4 [o] experiment.sh # Repeat each experiment 5 times. for i in {1..5}; do echo "Experiment iteration $i" for experiment in a b c d e do echo "Conducting experiment $experiment" # First, spin up the test VM to be migrated. while true do ssh -p 1337 -i ~/.ssh/waterloo root@192.168.2.64 \ 'xl create /home/fydp/vms/multideb.cfg' && break sleep 5 done # Wait for the test VM to become accessible. echo 'Booting test VM...' while true do ssh -i ~/.ssh/waterloo root@192.168.2.67 echo && break sleep 1 done # Initialize the test database. pgbench -h 192.168.2.67 -U postgres -i bench -s 70 # Begin running the test in the background. pgbench -h 192.168.2.67 -U postgres -c 4 -j 1 -T 180 -l \ --aggregate-interval 1 bench & # After 20 seconds... sleep 20 # Initiate the migration. echo "Starting the migration..." case $experiment in a) ssh -p 1337 -i ~/.ssh/waterloo root@192.168.2.64 \ 'xl migrate debvm 192.168.2.63' \ > pgbench-$experiment-$i.log 2>&1 ;; b) ssh -p 1337 -i ~/.ssh/waterloo root@192.168.2.64 \ 'xl migrate --precopy-iterations 1 debvm 192.168.2.63' \ > pgbench-$experiment-$i.log 2>&1 ;; c) ssh -p 1337 -i ~/.ssh/waterloo root@192.168.2.64 \ 'xl migrate --precopy-iterations 0 --postcopy debvm 192.168.2.63' \ > pgbench-$experiment-$i.log 2>&1 ;; d) ssh -p 1337 -i ~/.ssh/waterloo root@192.168.2.64 \ 'xl migrate --precopy-iterations 1 --postcopy debvm 192.168.2.63' \ > pgbench-$experiment-$i.log 2>&1 ;; e) ssh -p 1337 -i ~/.ssh/waterloo root@192.168.2.64 \ 'xl migrate --precopy-iterations 5 --postcopy debvm 192.168.2.63' \ > pgbench-$experiment-$i.log 2>&1 ;; esac # Wait for the benchmark to complete. echo "Migration complete." wait # Rename the benchmark log to something more useful. mv pgbench_log.* pgbench-perf-$experiment-$i.log # Shut down the test VM. echo "Cleaning up..." ssh -i ~/.ssh/waterloo root@192.168.2.67 \ '(sleep 10 && shutdown -h now) < /dev/null > /dev/null 2>&1 &' # Wait for it to really be down. sleep 20 done done echo 'All done' [p] Post-copy tracing patches https://github.com/jtotto/xen/commits/postcopy-tracing Joshua Otto (23): tools: rename COLO 'postcopy' to 'aftercopy' libxc/xc_sr: parameterise write_record() on fd libxc/xc_sr_restore.c: use write_record() in send_checkpoint_dirty_pfn_list() libxc/xc_sr: naming correction: mfns -> gfns libxc/xc_sr_restore: introduce generic 'pages' records libxc/xc_sr_restore: factor helpers out of handle_page_data() libxc/migration: tidy the xc_domain_save()/restore() interface libxc/migration: defer precopy policy to a callback libxl/migration: wire up the precopy policy RPC callback libxc/xc_sr_save: introduce save batch types libxc/migration: correct hvm record ordering specification libxc/migration: specify postcopy live migration libxc/migration: add try_read_record() libxc/migration: implement the sender side of postcopy live migration libxc/migration: implement the receiver side of postcopy live migration libxl/libxl_stream_write.c: track callback chains with an explicit phase libxl/libxl_stream_read.c: track callback chains with an explicit phase libxl/migration: implement the sender side of postcopy live migration libxl/migration: implement the receiver side of postcopy live migration tools: expose postcopy live migration support in libxl and xl xen/mem_paging: move paging op arguments into a union xen/mem_paging: add a populate_evicted paging op libxc/xc_sr_restore.c: use populate_evicted() docs/specs/libxc-migration-stream.pandoc | 182 +++- docs/specs/libxl-migration-stream.pandoc | 19 +- tools/libxc/include/xenctrl.h | 2 + tools/libxc/include/xenguest.h | 237 +++--- tools/libxc/xc_mem_paging.c | 39 +- tools/libxc/xc_nomigrate.c | 16 +- tools/libxc/xc_private.c | 21 +- tools/libxc/xc_private.h | 2 + tools/libxc/xc_sr_common.c | 116 ++- tools/libxc/xc_sr_common.h | 170 +++- tools/libxc/xc_sr_common_x86.c | 2 +- tools/libxc/xc_sr_restore.c | 1321 +++++++++++++++++++++++++----- tools/libxc/xc_sr_restore_x86_hvm.c | 41 +- tools/libxc/xc_sr_save.c | 903 ++++++++++++++++---- tools/libxc/xc_sr_save_x86_hvm.c | 18 +- tools/libxc/xc_sr_save_x86_pv.c | 17 +- tools/libxc/xc_sr_stream_format.h | 15 +- tools/libxc/xg_save_restore.h | 16 +- tools/libxl/libxl.h | 40 +- tools/libxl/libxl_colo_restore.c | 2 +- tools/libxl/libxl_colo_save.c | 2 +- tools/libxl/libxl_create.c | 191 ++++- tools/libxl/libxl_dom_save.c | 71 +- tools/libxl/libxl_domain.c | 33 +- tools/libxl/libxl_internal.h | 80 +- tools/libxl/libxl_remus.c | 2 +- tools/libxl/libxl_save_callout.c | 12 +- tools/libxl/libxl_save_helper.c | 60 +- tools/libxl/libxl_save_msgs_gen.pl | 10 +- tools/libxl/libxl_sr_stream_format.h | 13 +- tools/libxl/libxl_stream_read.c | 136 ++- tools/libxl/libxl_stream_write.c | 161 ++-- tools/ocaml/libs/xl/xenlight_stubs.c | 2 +- tools/xl/xl.h | 7 +- tools/xl/xl_cmdtable.c | 3 + tools/xl/xl_migrate.c | 65 +- tools/xl/xl_vmcontrol.c | 8 +- xen/arch/x86/mm.c | 5 +- xen/arch/x86/mm/mem_paging.c | 40 +- xen/arch/x86/mm/p2m.c | 101 +++ xen/arch/x86/x86_64/compat/mm.c | 6 +- xen/arch/x86/x86_64/mm.c | 6 +- xen/include/asm-x86/mem_paging.h | 3 +- xen/include/asm-x86/p2m.h | 2 + xen/include/public/memory.h | 25 +- 45 files changed, 3489 insertions(+), 734 deletions(-) -- 2.7.4 _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |