On Mon, May 22, 2017 at 11:18 AM, George Dunlap
<george.dunlap@xxxxxxxxxx> wrote:
        
          On 22/05/17 07:35, Hao, Xudong wrote:
          
            Bug detailed description:
----------------
Create one RHEL7.3 HVM and do live migration continuously, while doing the 200+ or 300+ times live-migration, tool stack report error and migration failed.
Environment :
----------------
HW: Skylake server
Xen: Xen 4.9.0 RC4
Dom0: Linux 4.11.0
Reproduce steps:
----------------
1.      Compile Xen 4.9 Rc4 and dom0 kernel 4.11.0, boot to dom0
2.      Boot RHEL7.3 HVM guest
3.      Migrate guest to localhost, sleep 10 seconds
4.      Repeat doing the step3.
Current result:
----------------
VM Migration fail.
Base error log:
----------------
xl migrate 24hrs_lm_guest_2 localhost
root@localhost's password:
migration target: Ready to receive domain.
Saving to migration stream new xl format (info 0x3/0x0/1761)
Loading new save file <incoming migration stream> (new xl fmt info 0x3/0x0/1761)
Savefile contains xl domain config in JSON format
Parsing config from <saved>
xc: info: Saving domain 273, type x86 HVM
xc: info: Found x86 HVM domain from Xen 4.9
xc: info: Restoring domain
xc: error: set HVM param 12 = 0x00000000feffe000 (85 = Interrupted system call should ): Internal error
xc: error: Restore failed (85 = Interrupted system call should ): Internal error
          
          Interesting -- it appears that setting HVM_PARAM_IDENT_PT (#12) can fail
with -ERESTART.  But the comment for ERESTART makes it explicit that it
should be internal only -- it should cause a hypercall continuation (so
that the hypercall restarts automatically), rather than returning to the
guest.
But the hypercall continuation code seems to have disappeared from
do_hvm_op() at some point?
/me digs a bit more...
        
        The problem turns out to be commit ae20ccf ("dm_op: convert
HVMOP_set_mem_type"), which says:
    This patch removes the need for handling HVMOP restarts, so that
    infrastructure is removed.
While it's true that there are no more operations which need iteration
information restored, but there are two operations which may still
need to be restarted to avoid deadlocks with other operations.
Attached is a patch which restores hypercall continuation checking.
Xudong, can you give it a test?
Thanks,
 -George
       
      
        From 3d4ce135ea3b396bb63752c39e6234366d590c16 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@xxxxxxxxxx>
Date: Mon, 22 May 2017 11:38:31 +0100
Subject: [PATCH] Restore HVM_OP hypercall continuation (partial revert of
 ae20ccf)
Commit ae20ccf removed the hypercall continuation logic from the end
of do_hvm_op(), claiming:
"This patch removes the need for handling HVMOP restarts, so that
infrastructure is removed."
That turns out to be false.  The removal of HVMOP_set_mem_type removed
the need to store a start iteration value in the hypercall
continuation, but a grep through hvm.c for ERESTART turns up at least
two places where do_hvm_op() may still need a hypercall continuation:
 * HVMOP_set_hvm_param can return -ERESTART when setting
HVM_PARAM_IDENT_PT in the event that it fails to acquire the domctl
lock
 * HVMOP_flush_tlbs can return -ERESTART if several vcpus call it at
   the same time
In both cases, a simple restart (with no stored iteration information)
is necessary.
Add a check for -ERESTART again, along with a comment at the top of
the function regarding the lack of decoding any information from the
op value.
Reported-by: Xudong Hao <xudong.hao@xxxxxxxxx>
Signed-off-by: George Dunlap <george.dunlap@xxxxxxxxxx>
---
CC: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
CC: Jan Beulich <jbeulich@xxxxxxxx>
CC: Paul Durrant <paul.durrant@xxxxxxxxxx>