[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 12/20] xen/riscv: implement p2m_set_range()


  • To: Oleksii Kurochko <oleksii.kurochko@xxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Fri, 15 Aug 2025 14:50:37 +0200
  • Autocrypt: addr=jbeulich@xxxxxxxx; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL
  • Cc: Alistair Francis <alistair.francis@xxxxxxx>, Bob Eshleman <bobbyeshleman@xxxxxxxxx>, Connor Davis <connojdavis@xxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Fri, 15 Aug 2025 12:50:47 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 15.08.2025 11:52, Oleksii Kurochko wrote:
> On 8/5/25 6:04 PM, Jan Beulich wrote:
>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>> This patch introduces p2m_set_range() and its core helper p2m_set_entry() 
>>> for
>> Nit: This patch doesn't introduce p2m_set_range(); it merely fleshes it out.
>>
>>> --- a/xen/arch/riscv/include/asm/p2m.h
>>> +++ b/xen/arch/riscv/include/asm/p2m.h
>>> @@ -7,11 +7,13 @@
>>>   #include <xen/rwlock.h>
>>>   #include <xen/types.h>
>>>   
>>> +#include <asm/page.h>
>>>   #include <asm/page-bits.h>
>>>   
>>>   extern unsigned int p2m_root_order;
>>>   #define P2M_ROOT_ORDER  p2m_root_order
>>>   #define P2M_ROOT_PAGES  BIT(P2M_ROOT_ORDER, U)
>>> +#define P2M_ROOT_LEVEL  HYP_PT_ROOT_LEVEL
>> I think I commented on this before, and I would have hoped for at least a 
>> remark
>> in the description to appear (perhaps even a comment here): It's okay(ish) 
>> to tie
>> these together for now, but in the longer run I don't expect this is going 
>> to be
>> wanted. If e.g. we ran Xen in Sv57 mode, there would be no reason at all to 
>> force
>> all P2Ms to use 5 levels of page tables.
> 
> Do you mean that for G-stage it could be chosen any SvXX mode to limit an 
> amount
> of page tables necessary for G-stage? If yes, then, at least, I agree that a
> comment should be added or, probably, "#warning optimize an amount of p2m 
> root level
> for MMU mode > Sv48" (or maybe >=).

Yes.

> Or do you mean if we set hgatp.mode=Sv57 then it is possible to limit an 
> amount of
> page table's levels to use? In this case I think hardware still will expect 
> to see
> 5 levels of page tables.

No.

>>> +static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
>>> +{
>>> +    write_pte(p, pte);
>>> +    if ( clean_pte )
>>> +        clean_dcache_va_range(p, sizeof(*p));
>> Not necessarily for right away, but if multiple adjacent PTEs are
>> written without releasing the lock, this then redundant cache flushing
>> can be a performance issue.
> 
> Can't it be resolved on a caller side? Something like:
>    p2m_write_pte(p1, pte1, false);
>    p2m_write_pte(p2, pte2, false);
>    p2m_write_pte(p3, pte3, false);
>    p2m_write_pte(p4, pte4, true);
> where p1-p4 are adjacent.

No. You wouldn't know whether the last write flushes what the earlier
three have written. There may be a cacheline boundary in between. Plus
I didn't really think of back-to-back writes, but e.g. a loop doing
many of them, where a single wider flush may then be more efficient.

>>> +#define P2M_TABLE_MAP_NONE 0
>>> +#define P2M_TABLE_MAP_NOMEM 1
>>> +#define P2M_TABLE_SUPER_PAGE 2
>>> +#define P2M_TABLE_NORMAL 3
>>> +
>>> +/*
>>> + * Take the currently mapped table, find the corresponding the entry
>>> + * corresponding to the GFN, and map the next table, if available.
>> Nit: Double "corresponding".
>>
>>> + * The previous table will be unmapped if the next level was mapped
>>> + * (e.g P2M_TABLE_NORMAL returned).
>>> + *
>>> + * `alloc_tbl` parameter indicates whether intermediate tables should
>>> + * be allocated when not present.
>>> + *
>>> + * Return values:
>>> + *  P2M_TABLE_MAP_NONE: a table allocation isn't permitted.
>>> + *  P2M_TABLE_MAP_NOMEM: allocating a new page failed.
>>> + *  P2M_TABLE_SUPER_PAGE: next level or leaf mapped normally.
>>> + *  P2M_TABLE_NORMAL: The next entry points to a superpage.
>>> + */
>>> +static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
>>> +                          unsigned int level, pte_t **table,
>>> +                          unsigned int offset)
>>> +{
>>> +    panic("%s: hasn't been implemented yet\n", __func__);
>>> +
>>> +    return P2M_TABLE_MAP_NONE;
>>> +}
>>> +
>>> +/* Free pte sub-tree behind an entry */
>>> +static void p2m_free_subtree(struct p2m_domain *p2m,
>>> +                             pte_t entry, unsigned int level)
>>> +{
>>> +    panic("%s: hasn't been implemented yet\n", __func__);
>>> +}
>>> +
>>> +/*
>>> + * Insert an entry in the p2m. This should be called with a mapping
>>> + * equal to a page/superpage.
>>> + */
>>> +static int p2m_set_entry(struct p2m_domain *p2m,
>>> +                           gfn_t gfn,
>>> +                           unsigned long page_order,
>>> +                           mfn_t mfn,
>>> +                           p2m_type_t t)
>>> +{
>>> +    unsigned int level;
>>> +    unsigned int target = page_order / PAGETABLE_ORDER;
>>> +    pte_t *entry, *table, orig_pte;
>>> +    int rc;
>>> +    /* A mapping is removed if the MFN is invalid. */
>>> +    bool removing_mapping = mfn_eq(mfn, INVALID_MFN);
>> Comment and code don't fit together. Many MFNs are invalid (any for which
>> mfn_valid() returns false), yet you only check for INVALID_MFN here.
> 
> Probably, it makes sense to add an|ASSERT()| here for the case when
> |mfn_valid(mfn)| is false, but the MFN is not explicitly equal 
> to|INVALID_MFN|.
> This would indicate that someone attempted to perform a mapping with an
> incorrect MFN, which, IMO, is entirely wrong.

No, and we've been there before. MMIO can live anywhere, and mappings for
such still will need to be permitted. It is correct to check only for
INVALID_MFN here imo; it's just the comment which also needs to reflect
that.

>>> +    /*
>>> +     * If we are here with level > target, we must be at a leaf node,
>>> +     * and we need to break up the superpage.
>>> +     */
>>> +    if ( level > target )
>>> +    {
>>> +        panic("Shattering isn't implemented\n");
>>> +    }
>>> +
>>> +    /*
>>> +     * We should always be there with the correct level because all the
>>> +     * intermediate tables have been installed if necessary.
>>> +     */
>>> +    ASSERT(level == target);
>>> +
>>> +    orig_pte = *entry;
>>> +
>>> +    if ( removing_mapping )
>>> +        p2m_clean_pte(entry, p2m->clean_pte);
>>> +    else
>>> +    {
>>> +        pte_t pte = p2m_pte_from_mfn(mfn, t);
>>> +
>>> +        p2m_write_pte(entry, pte, p2m->clean_pte);
>>> +
>>> +        p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
>>> +                                      gfn_add(gfn, BIT(page_order, UL) - 
>>> 1));
>>> +        p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, gfn);
>>> +    }
>>> +
>>> +    p2m->need_flush = true;
>>> +
>>> +    /*
>>> +     * Currently, the infrastructure required to enable 
>>> CONFIG_HAS_PASSTHROUGH
>>> +     * is not ready for RISC-V support.
>>> +     *
>>> +     * When CONFIG_HAS_PASSTHROUGH=y, iommu_iotlb_flush() should be done
>>> +     * here.
>>> +     */
>>> +#ifdef CONFIG_HAS_PASSTHROUGH
>>> +#   error "add code to flush IOMMU TLB"
>>> +#endif
>>> +
>>> +    rc = 0;
>>> +
>>> +    /*
>>> +     * Free the entry only if the original pte was valid and the base
>>> +     * is different (to avoid freeing when permission is changed).
>>> +     */
>>> +    if ( pte_is_valid(orig_pte) &&
>>> +         !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
>> I'm puzzled by this 2nd check: A permission change would - I expect - only
>> occur to a leaf entry. If the new entry is a super-page one and the old
>> wasn't, don't you still need to free the sub-tree, no matter whether the
>> MFNs are the same?
> 
> I expect the MFNs to differ in this scenario, so the old sub-tree will be 
> freed.

You expecting something isn't a good criteria. If it's possible, even if
unexpected (by you), it needs dealing with correctly.

> Based on your example (new entry is super-page and old entry isn't):
> For old mapping (lets say, 4 KiB leaf) p2m_set_entry() walks all levels down
> to L0, so we will have the following MMU page table walks:
>    L2 PTE -> L1 PTE (MFN of L0 page table) -> L0 PTE -> RAM
> 
> When new mapping (lets say, 2 MiB superpage) will be requested, 
> p2m_set_entry()
> will stop at L1 (the superpage level):
>   L2 PTE -> L1 PTE (at this moment, L1 PTE points to L0 page table, which
>                     points to RAM)
> Then the old L1 PTE will be saved in 'orig_pte', then writes 'entry' with
> the RAM MFN for the 2 MiB mapping. The walk becomes:
>    L2 PTE -> L1 PTE -> RAM
> 
> Therefore, 'entry' now holds an MFN pointing to RAM (superpage leaf). 
> 'orig_pte'
> still holds an MFN pointing to the L0 table (the old sub-tree). Since these 
> MFNs
> differ, the code calls p2m_free_subtree(p2m, orig_pte, …) and frees the old L0
> sub-tree.

A particular example doesn't help. All possible cases need handling correctly.

>>   Plus consider the special case of MFN 0: If you clear
>> an entry using MFN 0, you will find old and new PTEs' both having the same
>> MFN.
> 
> Isn't this happen only when a mapping removal is explicitly requested?
> In the case of a mapping removal it seems to me it is enough just to
> clear PTE with all zeroes.

Correct. Which means original MFN (PPN) and new MFN (PPN) would match.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.