[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC Patch 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service



At 07/18/2014 07:38 PM, Wen Congyang Wrote:
> Virtual machine (VM) replication is a well known technique for providing
> application-agnostic software-implemented hardware fault tolerance -
> "non-stop service". Currently, remus provides this function, but it buffers
> all output packets, and the latency is unacceptable.
> 
> In xen summit 2012, We introduce a new VM replication solution: colo
> (COarse-grain LOck-stepping virtual machine). The presentation is in
> the following URL:
> http://www.slideshare.net/xen_com_mgr/colo-coarsegrain-lockstepping-virtual-machines-for-nonstop-service
> 
> Here is the summary of the solution:
>>From the client's point of view, as long as the client observes identical
> responses from the primary and secondary VMs, according to the service
> semantics, then the secondary vm is a valid replica of the primary
> vm, and can successfully take over when a hardware failure of the
> primary vm is detected.
> 
> This patchset is RFC, and implements the frame of colo:
> 1. Both primary vm and secondary vm are running
> 2. do checkoint
> 
> This patchset is based on remus-v15, and use migration v1. Only supports hvm
> guest now.
> 
> TODO list:
> 1. rebase to remus-v17 or newer
> 2. support migration v2
> 3. nic/disk replication
> 4. support pvm
> 
> Patch 1-3: bugfix
> Patch 4-6: temporarily update remus to reuse remus device codes
> Patch 7-14: update some APIs which will be used by colo
> Patch 15-22: colo related codes
> Patch 23: Hack patch, just for test
> Patch 24-25: bugfix. We find this bug before rebasing colo to newest xen.
>           But we don't trigger this bug now.
> Patch 26: A patch for qemu-xen

I also put the codes in github:
https://github.com/wencongyang/xen/tree/colo

> 
> Hong Tao (1):
>   copy the correct page to memory
> 
> Wen Congyang (24):
>   csum the correct page
>   don't zero out ioreq page
>   don't touch remus in remus_device
>   rename remus device to checkpoint device
>   adjust the indentation
>   Refactor domain_suspend_callback_common()
>   Update libxl__domain_resume() for colo
>   Update libxl__domain_suspend_common_switch_qemu_logdirty() for colo
>   Introduce a new internal API libxl__domain_unpause()
>   Update libxl__domain_unpause() to support qemu-xen
>   support to resume uncooperative HVM guests
>   update datecopier to support sending data only
>   introduce a new API to aync read data from fd
>   Update libxl_save_msgs_gen.pl to support return data from xl to xc
>   Allow slave sends data to master
>   secondary vm suspend/resume/checkpoint code
>   primary vm suspend/get_dirty_pfn/resume/checkpoint code
>   xc_domain_save: flush cache before calling callbacks->postcopy() in
>     colo mode
>   COLO: xc related codes
>   send store mfn and console mfn to xl before resuming secondary vm
>   implement the cmdline for COLO
>   HACK: do checkpoint per 20ms
>   fix vm entry fail
>   sync mmu before resuming secondary vm
> 
>  docs/man/xl.pod.1                                  |   9 +-
>  tools/libxc/xc_domain.c                            |   9 +
>  tools/libxc/xc_domain_restore.c                    |  74 +-
>  tools/libxc/xc_domain_save.c                       |  66 +-
>  tools/libxc/xc_resume.c                            |  20 +-
>  tools/libxc/xenctrl.h                              |   2 +
>  tools/libxc/xenguest.h                             |  40 +
>  tools/libxl/Makefile                               |   3 +-
>  tools/libxl/libxl.c                                | 102 ++-
>  tools/libxl/libxl.h                                |   3 +-
>  tools/libxl/libxl_aoutils.c                        |  81 +-
>  ...xl_remus_device.c => libxl_checkpoint_device.c} | 266 ++++---
>  tools/libxl/libxl_colo.h                           |  48 ++
>  tools/libxl/libxl_colo_restore.c                   | 882 
> +++++++++++++++++++++
>  tools/libxl/libxl_colo_save.c                      | 602 ++++++++++++++
>  tools/libxl/libxl_create.c                         | 131 ++-
>  tools/libxl/libxl_dom.c                            | 424 ++++++----
>  tools/libxl/libxl_internal.h                       | 262 ++++--
>  tools/libxl/libxl_netbuffer.c                      |  85 +-
>  tools/libxl/libxl_nonetbuffer.c                    |  14 +-
>  tools/libxl/libxl_qmp.c                            |  10 +
>  tools/libxl/libxl_remus_disk_drbd.c                |  54 +-
>  tools/libxl/libxl_save_callout.c                   |  37 +-
>  tools/libxl/libxl_save_helper.c                    |  17 +
>  tools/libxl/libxl_save_msgs_gen.pl                 |  74 +-
>  tools/libxl/libxl_types.idl                        |  12 +-
>  tools/libxl/xl_cmdimpl.c                           |  54 +-
>  tools/libxl/xl_cmdtable.c                          |   3 +-
>  xen/arch/x86/domctl.c                              |  15 +
>  xen/arch/x86/hvm/save.c                            |   6 +
>  xen/arch/x86/hvm/vmx/vmcs.c                        |   8 +
>  xen/arch/x86/hvm/vmx/vmx.c                         |   8 +
>  xen/include/asm-x86/hvm/hvm.h                      |   1 +
>  xen/include/asm-x86/hvm/vmx/vmcs.h                 |   1 +
>  xen/include/public/domctl.h                        |   1 +
>  xen/include/xen/hvm/save.h                         |   2 +
>  36 files changed, 2895 insertions(+), 531 deletions(-)
>  rename tools/libxl/{libxl_remus_device.c => libxl_checkpoint_device.c} (47%)
>  create mode 100644 tools/libxl/libxl_colo.h
>  create mode 100644 tools/libxl/libxl_colo_restore.c
>  create mode 100644 tools/libxl/libxl_colo_save.c
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.