[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: domU reboot claim failed


  • To: Jason Andryuk <jason.andryuk@xxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Fri, 12 Sep 2025 08:15:03 +0200
  • Autocrypt: addr=jbeulich@xxxxxxxx; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Fri, 12 Sep 2025 06:15:24 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 11.09.2025 23:20, Jason Andryuk wrote:
> Thanks, everyone.
> 
> On 2025-09-10 17:57, Andrew Cooper wrote:
>> On 10/09/2025 7:58 pm, Jason Andryuk wrote:
>>> Hi,
>>>
>>> We're running Android as a guest and it's running the Compatibility
>>> Test Suite.  During the CTS, the Android domU is rebooted multiple times.
>>>
>>> In the middle of the CTS, we've seen reboot fail.  xl -vvv shows:
>>> domainbuilder: detail: Could not allocate memory for HVM guest as we
>>> cannot claim memory!
>>> xc: error: panic: xg_dom_boot.c:119: xc_dom_boot_mem_init: can't
>>> allocate low memory for domain: Out of memory
>>> libxl: error: libxl_dom.c:581:libxl__build_dom: xc_dom_boot_mem_init
>>> failed: Cannot allocate memory
>>> domainbuilder: detail: xc_dom_release: called
>>>
>>> So the claim failed.  The system has enough memory since we're just
>>> rebooting the same VM.  As a work around, I added sleep(1) + retry,
>>> which works.
>>>
>>> The curious part is the memory allocation.  For d2 to d5, we have:
>>> domainbuilder: detail: range: start=0x0 end=0xf0000000
>>> domainbuilder: detail: range: start=0x100000000 end=0x1af000000
>>> xc: detail: PHYSICAL MEMORY ALLOCATION:
>>> xc: detail:   4KB PAGES: 0x0000000000000000
>>> xc: detail:   2MB PAGES: 0x00000000000006f8
>>> xc: detail:   1GB PAGES: 0x0000000000000003
>>>
>>> But when we have to retry the claim for d6, there are no 1GB pages used:
>>> domainbuilder: detail: range: start=0x0 end=0xf0000000
>>> domainbuilder: detail: range: start=0x100000000 end=0x1af000000
>>> domainbuilder: detail: HVM claim failed! attempt 0
>>> xc: detail: PHYSICAL MEMORY ALLOCATION:
>>> xc: detail:   4KB PAGES: 0x0000000000002800
>>> xc: detail:   2MB PAGES: 0x0000000000000ce4
>>> xc: detail:   1GB PAGES: 0x0000000000000000
>>>
>>> But subsequent reboots for d7 and d8 go back to using 1GB pages.
>>>
>>> Does the change in memory allocation stick out to anyone?
>>>
>>> Unfortunately, I don't have insight into what the failing test is doing.
>>>
>>> Xen doesn't seem set up to track the claim across reboot.  Retrying
>>> the claim works in our scenario since we have a controlled configuration.
>>
>> This looks to me like a known phenomenon.  Ages back, a change was made
>> in how Xen scrubs memory, from being synchronous in domain_kill(), to
>> being asynchronous in the idle loop.
>>
>> The consequence being that, on an idle system, you can shutdown and
>> reboot the domain faster, but on a busy system you end up trying to
>> allocate the new domain while memory from the old domain is still dirty.
>>
>> It is a classic example of a false optimisation, which looks great on an
>> idle system only because the idle CPUs are swallowing the work.
>>
>> This impacts the ability to find a 1G aligned block of free memory to
>> allocate a superpage with, and by the sounds of it, claims (which
>> predate this behaviour change) aren't aware of the "to be scrubbed"
>> queue and fail instead.
> 
> Claims check total_avail_pages and outstanding_claims.  It looks like 
> free_heap_pages() sets PGC_need_scrub and then increments 
> total_avail_pages.  But then it's not getting through the accounting far 
> enough to stake a claim?
> 
> Also free_heap_page() looks like it's trying to merge chunks - I thought 
> that would handle larger allocations.  Are they not truly usable until 
> they've been scrubbed, which leads to the lack of 1GB pages?
> 
> Clearly I need to learn more here.

I rather expect this then may not be scrubbing related, but domain cleanup
hasn't progressed quickly enough for the earlier instance.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.