Xen project Mailing List

Re: domU reboot claim failed

To: Jason Andryuk <jason.andryuk@xxxxxxx>

Date: Fri, 12 Sep 2025 08:15:03 +0200

Autocrypt: addr=jbeulich@xxxxxxxx; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Fri, 12 Sep 2025 06:15:24 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 11.09.2025 23:20, Jason Andryuk wrote: > Thanks, everyone. > > On 2025-09-10 17:57, Andrew Cooper wrote: >> On 10/09/2025 7:58 pm, Jason Andryuk wrote: >>> Hi, >>> >>> We're running Android as a guest and it's running the Compatibility >>> Test Suite. During the CTS, the Android domU is rebooted multiple times. >>> >>> In the middle of the CTS, we've seen reboot fail. xl -vvv shows: >>> domainbuilder: detail: Could not allocate memory for HVM guest as we >>> cannot claim memory! >>> xc: error: panic: xg_dom_boot.c:119: xc_dom_boot_mem_init: can't >>> allocate low memory for domain: Out of memory >>> libxl: error: libxl_dom.c:581:libxl__build_dom: xc_dom_boot_mem_init >>> failed: Cannot allocate memory >>> domainbuilder: detail: xc_dom_release: called >>> >>> So the claim failed. The system has enough memory since we're just >>> rebooting the same VM. As a work around, I added sleep(1) + retry, >>> which works. >>> >>> The curious part is the memory allocation. For d2 to d5, we have: >>> domainbuilder: detail: range: start=0x0 end=0xf0000000 >>> domainbuilder: detail: range: start=0x100000000 end=0x1af000000 >>> xc: detail: PHYSICAL MEMORY ALLOCATION: >>> xc: detail: 4KB PAGES: 0x0000000000000000 >>> xc: detail: 2MB PAGES: 0x00000000000006f8 >>> xc: detail: 1GB PAGES: 0x0000000000000003 >>> >>> But when we have to retry the claim for d6, there are no 1GB pages used: >>> domainbuilder: detail: range: start=0x0 end=0xf0000000 >>> domainbuilder: detail: range: start=0x100000000 end=0x1af000000 >>> domainbuilder: detail: HVM claim failed! attempt 0 >>> xc: detail: PHYSICAL MEMORY ALLOCATION: >>> xc: detail: 4KB PAGES: 0x0000000000002800 >>> xc: detail: 2MB PAGES: 0x0000000000000ce4 >>> xc: detail: 1GB PAGES: 0x0000000000000000 >>> >>> But subsequent reboots for d7 and d8 go back to using 1GB pages. >>> >>> Does the change in memory allocation stick out to anyone? >>> >>> Unfortunately, I don't have insight into what the failing test is doing. >>> >>> Xen doesn't seem set up to track the claim across reboot. Retrying >>> the claim works in our scenario since we have a controlled configuration. >> >> This looks to me like a known phenomenon. Ages back, a change was made >> in how Xen scrubs memory, from being synchronous in domain_kill(), to >> being asynchronous in the idle loop. >> >> The consequence being that, on an idle system, you can shutdown and >> reboot the domain faster, but on a busy system you end up trying to >> allocate the new domain while memory from the old domain is still dirty. >> >> It is a classic example of a false optimisation, which looks great on an >> idle system only because the idle CPUs are swallowing the work. >> >> This impacts the ability to find a 1G aligned block of free memory to >> allocate a superpage with, and by the sounds of it, claims (which >> predate this behaviour change) aren't aware of the "to be scrubbed" >> queue and fail instead. > > Claims check total_avail_pages and outstanding_claims. It looks like > free_heap_pages() sets PGC_need_scrub and then increments > total_avail_pages. But then it's not getting through the accounting far > enough to stake a claim? > > Also free_heap_page() looks like it's trying to merge chunks - I thought > that would handle larger allocations. Are they not truly usable until > they've been scrubbed, which leads to the lack of 1GB pages? > > Clearly I need to learn more here. I rather expect this then may not be scrubbing related, but domain cleanup hasn't progressed quickly enough for the earlier instance. Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.