|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v3] xen/mm: move adjustment of claimed pages counters on allocation
On 08.01.2026 09:44, Roger Pau Monné wrote:
> On Thu, Jan 08, 2026 at 09:24:51AM +0100, Jan Beulich wrote:
>> On 07.01.2026 18:56, Roger Pau Monne wrote:
>>> The current logic splits the update of the amount of available memory in
>>> the system (total_avail_pages) and pending claims into two separately
>>> locked regions. This leads to a window between counters adjustments where
>>> the result of total_avail_pages - outstanding_claims doesn't reflect the
>>> real amount of free memory available, and can return a negative value due
>>> to total_avail_pages having been updated ahead of outstanding_claims.
>>>
>>> Fix by adjusting outstanding_claims and d->outstanding_pages in the same
>>> place where total_avail_pages is updated. Note that accesses to
>>> d->outstanding_pages is protected by the global heap_lock, just like
>>> total_avail_pages or outstanding_claims. Add a comment to the field
>>> declaration, and also adjust the comment at the top of
>>> domain_set_outstanding_pages() to be clearer in that regard.
>>>
>>> Note that failures in assign_pages() causes the claimed amount that has
>>> been allocated to be lost, as the amount is not added back to the domain
>>> quota once pages are freed. Given the intended usage of claims is limited
>>> to initial physmap populate, and the current failure paths in
>>> assign_pages() would lead to the domain being destroyed anyway, don't
>>> add extra logic to recover the claimed amount on failure - it's just adding
>>> complexity for no real benefit.
>>>
>>> Fixes: 65c9792df600 ("mmu: Introduce XENMEM_claim_pages (subop of memory
>>> ops)")
>>> Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
>>> ---
>>> Changes since v2:
>>> - Revert back to the approach in v1.
>>
>> You didn't fully go back to v1. While ...
>>
>>> @@ -548,9 +524,10 @@ int domain_set_outstanding_pages(struct domain *d,
>>> unsigned long pages)
>>> unsigned long claim, avail_pages;
>>>
>>> /*
>>> - * take the domain's page_alloc_lock, else all d->tot_page adjustments
>>> - * must always take the global heap_lock rather than only in the much
>>> - * rarer case that d->outstanding_pages is non-zero
>>> + * Two locks are needed here:
>>> + * - d->page_alloc_lock: protects accesses to
>>> d->{tot,max,extra}_pages.
>>> + * - heap_lock: protects accesses to d->outstanding_pages,
>>> total_avail_pages
>>> + * and outstanding_claims.
>>> */
>>> nrspin_lock(&d->page_alloc_lock);
>>> spin_lock(&heap_lock);
>>
>> ... the comment improvement is of course okay to keep, ...
>>
>>> @@ -999,7 +976,7 @@ static struct page_info *alloc_heap_pages(
>>> {
>>> nodeid_t node;
>>> unsigned int i, buddy_order, zone, first_dirty;
>>> - unsigned long request = 1UL << order;
>>> + unsigned int request = 1UL << order;
>>
>> ... this I'm less certain about (and if it was to be kept, it should also
>> become 1U). For one, this bounds check:
>>
>> if ( (outstanding_claims + request > total_avail_pages) &&
>>
>> ends up still being okay (perhaps except on Arm32, but the wrapping issue
>> there is pre-existing, albeit possibly benign due to other constraints),
>> but just because outstanding_claims is "long" (and it's unclear to me why
>> it isn't "unsigned long").
>>
>> And then, what exactly is it that you want the more narrow type for (the
>> description says nothing in that regard)? The other relevant uses of the
>> variable are
>>
>> avail[node][zone] -= request;
>> total_avail_pages -= request;
>>
>> where both avail[][] and total_avail_pages are (unsigned) long (again
>> unclear to me why for total_avail_pages it's plain long).
>>
>> Oh, wait, it is ...
>>
>>> @@ -1071,6 +1050,30 @@ static struct page_info *alloc_heap_pages(
>>> total_avail_pages -= request;
>>> ASSERT(total_avail_pages >= 0);
>>>
>>> + if ( d && d->outstanding_pages && !(memflags & MEMF_no_refcount) )
>>> + {
>>> + /*
>>> + * Adjust claims in the same locked region where total_avail_pages
>>> is
>>> + * adjusted, not doing so would lead to a window where the amount
>>> of
>>> + * free memory (avail - claimed) would be incorrect.
>>> + *
>>> + * Note that by adjusting the claimed amount here it's possible for
>>> + * pages to fail to be assigned to the claiming domain while
>>> already
>>> + * having been subtracted from d->outstanding_pages. Such claimed
>>> + * amount is then lost, as the pages that fail to be assigned to
>>> the
>>> + * domain are freed without replenishing the claim. This is fine
>>> given
>>> + * claims are only to be used during physmap population as part of
>>> + * domain build, and any failure in assign_pages() there will
>>> result in
>>> + * the domain being destroyed before creation is finished. Losing
>>> part
>>> + * of the claim makes no difference.
>>> + */
>>> + unsigned int outstanding = min(d->outstanding_pages, request);
>>
>> ... the desire to avoid use of min_t() here which wants "request" to be
>> "unsigned int". At some point we'll want to change the struct domain fields
>> to unsigned long anyway, at which point the above would need adjustment. It's
>> well possible that such an adjustment would end up being to then use min_t().
>> Imo we'd be better off using e.g.
>>
>> unsigned int outstanding = min(d->outstanding_pages + 0UL, request);
>>
>> or even
>>
>> typeof(d->outstanding_pages) outstanding =
>> min(d->outstanding_pages + 0UL, request);
>>
>> right away. In the latter case the decl wouldn't even need touching when the
>> struct domain fields are promoted.
>
> My preference would be:
>
> unsigned long outstanding = min(d->outstanding_pages + 0UL, request);
>
> If that's fine with you.
It is.
>>> + BUG_ON(outstanding > d->outstanding_pages);
>>
>> Unlike in v1, where the min() was different, this is now dead code.
>
> Oh, I need to adjust this so it's outstanding > outstanding_claims
> instead.
And then:
Reviewed-by: Jan Beulich <jbeulich@xxxxxxxx>
Jan
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |