[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL


  • To: Jan Beulich <JBeulich@xxxxxxxx>, "Ian Jackson <ian.jackson@xxxxxxxxxx>" <ian.jackson@xxxxxxxxxx>
  • From: Juergen Gross <jgross@xxxxxxxx>
  • Date: Thu, 7 Jun 2018 13:30:20 +0200
  • Autocrypt: addr=jgross@xxxxxxxx; prefer-encrypt=mutual; keydata= xsBNBFOMcBYBCACgGjqjoGvbEouQZw/ToiBg9W98AlM2QHV+iNHsEs7kxWhKMjrioyspZKOB ycWxw3ie3j9uvg9EOB3aN4xiTv4qbnGiTr3oJhkB1gsb6ToJQZ8uxGq2kaV2KL9650I1SJve dYm8Of8Zd621lSmoKOwlNClALZNew72NjJLEzTalU1OdT7/i1TXkH09XSSI8mEQ/ouNcMvIJ NwQpd369y9bfIhWUiVXEK7MlRgUG6MvIj6Y3Am/BBLUVbDa4+gmzDC9ezlZkTZG2t14zWPvx XP3FAp2pkW0xqG7/377qptDmrk42GlSKN4z76ELnLxussxc7I2hx18NUcbP8+uty4bMxABEB AAHNHkp1ZXJnZW4gR3Jvc3MgPGpncm9zc0BzdXNlLmRlPsLAeQQTAQIAIwUCU4xw6wIbAwcL CQgHAwIBBhUIAgkKCwQWAgMBAh4BAheAAAoJELDendYovxMvi4UH/Ri+OXlObzqMANruTd4N zmVBAZgx1VW6jLc8JZjQuJPSsd/a+bNr3BZeLV6lu4Pf1Yl2Log129EX1KWYiFFvPbIiq5M5 kOXTO8Eas4CaScCvAZ9jCMQCgK3pFqYgirwTgfwnPtxFxO/F3ZcS8jovza5khkSKL9JGq8Nk czDTruQ/oy0WUHdUr9uwEfiD9yPFOGqp4S6cISuzBMvaAiC5YGdUGXuPZKXLpnGSjkZswUzY d9BVSitRL5ldsQCg6GhDoEAeIhUC4SQnT9SOWkoDOSFRXZ+7+WIBGLiWMd+yKDdRG5RyP/8f 3tgGiB6cyuYfPDRGsELGjUaTUq3H2xZgIPfOwE0EU4xwFgEIAMsx+gDjgzAY4H1hPVXgoLK8 B93sTQFN9oC6tsb46VpxyLPfJ3T1A6Z6MVkLoCejKTJ3K9MUsBZhxIJ0hIyvzwI6aYJsnOew cCiCN7FeKJ/oA1RSUemPGUcIJwQuZlTOiY0OcQ5PFkV5YxMUX1F/aTYXROXgTmSaw0aC1Jpo w7Ss1mg4SIP/tR88/d1+HwkJDVW1RSxC1PWzGizwRv8eauImGdpNnseneO2BNWRXTJumAWDD pYxpGSsGHXuZXTPZqOOZpsHtInFyi5KRHSFyk2Xigzvh3b9WqhbgHHHE4PUVw0I5sIQt8hJq 5nH5dPqz4ITtCL9zjiJsExHuHKN3NZsAEQEAAcLAXwQYAQIACQUCU4xwFgIbDAAKCRCw3p3W KL8TL0P4B/9YWver5uD/y/m0KScK2f3Z3mXJhME23vGBbMNlfwbr+meDMrJZ950CuWWnQ+d+ Ahe0w1X7e3wuLVODzjcReQ/v7b4JD3wwHxe+88tgB9byc0NXzlPJWBaWV01yB2/uefVKryAf AHYEd0gCRhx7eESgNBe3+YqWAQawunMlycsqKa09dBDL1PFRosF708ic9346GLHRc6Vj5SRA UTHnQqLetIOXZm3a2eQ1gpQK9MmruO86Vo93p39bS1mqnLLspVrL4rhoyhsOyh0Hd28QCzpJ wKeHTd0MAWAirmewHXWPco8p1Wg+V+5xfZzuQY0f4tQxvOpXpt4gQ1817GQ5/Ed/wsDtBBgB CAAgFiEEhRJncuj2BJSl0Jf3sN6d1ii/Ey8FAlrd8NACGwIAgQkQsN6d1ii/Ey92IAQZFggA HRYhBFMtsHpB9jjzHji4HoBcYbtP2GO+BQJa3fDQAAoJEIBcYbtP2GO+TYsA/30H/0V6cr/W V+J/FCayg6uNtm3MJLo4rE+o4sdpjjsGAQCooqffpgA+luTT13YZNV62hAnCLKXH9n3+ZAgJ RtAyDWk1B/0SMDVs1wxufMkKC3Q/1D3BYIvBlrTVKdBYXPxngcRoqV2J77lscEvkLNUGsu/z W2pf7+P3mWWlrPMJdlbax00vevyBeqtqNKjHstHatgMZ2W0CFC4hJ3YEetuRBURYPiGzuJXU pAd7a7BdsqWC4o+GTm5tnGrCyD+4gfDSpkOT53S/GNO07YkPkm/8J4OBoFfgSaCnQ1izwgJQ jIpcG2fPCI2/hxf2oqXPYbKr1v4Z1wthmoyUgGN0LPTIm+B5vdY82wI5qe9uN6UOGyTH2B3p hRQUWqCwu2sqkI3LLbTdrnyDZaixT2T0f4tyF5Lfs+Ha8xVMhIyzNb1byDI5FKCb
  • Cc: Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Thu, 07 Jun 2018 11:30:33 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Openpgp: preference=signencrypt

On 06/06/18 11:40, Juergen Gross wrote:
> On 06/06/18 11:35, Jan Beulich wrote:
>>>>> On 05.06.18 at 18:19, <ian.jackson@xxxxxxxxxx> wrote:
>>>>>  test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 14 guest-saverestore.2 
>>>
>>> I thought I would reply again with the key point from my earlier mail
>>> highlighted, and go a bit further.  The first thing to go wrong in
>>> this was:
>>>
>>> 2018-05-30 22:12:49.320+0000: xc: Failed to get types for pfn batch (14 = 
>>> Bad address): Internal error
>>> 2018-05-30 22:12:49.483+0000: xc: Save failed (14 = Bad address): Internal 
>>> error
>>> 2018-05-30 22:12:49.648+0000: libxl-save-helper: complete r=-1: Bad address
>>>
>>> You can see similar messages in the other logfile:
>>>
>>> 2018-05-30 22:12:49.650+0000: libxl: 
>>> libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 3:saving 
>>> domain: domain responded to suspend request: Bad address
>>>
>>> All of these are reports of the same thing: xc_get_pfn_type_batch at
>>> xc_sr_save.c:133 failed with EFAULT.  I'm afraid I don't know why.
>>>
>>> There is no corresponding message in the host's serial log nor the
>>> dom0 kernel log.
>>
>> I vaguely recall from the time when I had looked at the similar Windows
>> migration issues that the guest is already in the process of being cleaned
>> up when these occur. Commit 2dbe9c3cd2 ("x86/mm: silence a pointless
>> warning") intentionally suppressed a log message here, and the
>> immediately following debugging code (933f966bcd x86/mm: add
>> temporary debugging code to get_page_from_gfn_p2m()) was reverted
>> a little over a month later. This wasn't as a follow-up to another patch
>> (fix), but following the discussion rooted at
>> https://lists.xenproject.org/archives/html/xen-devel/2017-06/msg00324.html
> 
> That was -ESRCH, not -EFAULT.

I've looked a little bit more into this.

As we are seeing EFAULT being returned by the hypervisor this either
means the tools are specifying an invalid address (quite unlikely)
or the buffers are not as MAP_LOCKED as we wish them to be.

Is there a way to see whether the host was experiencing some memory
shortage, so the buffers might have been swapped out?

man mmap tells me: "This implementation will try to populate (prefault)
the whole range but the mmap call doesn't fail with ENOMEM if this
fails. Therefore major faults might happen later on."

And: "One should use mmap(2) plus mlock(2) when major faults are not
acceptable after the initialization of the mapping."

With osdep_alloc_pages() in tools/libs/call/linux.c touching all the
hypercall buffer pages before doing the hypercall I'm not sure this
could be an issue.

Any thoughts on that?


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.