[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC 18/22] xen/arm: p2m: Introduce p2m_set_entry and __p2m_set_entry



Hi Stefano,

On 06/09/16 02:08, Stefano Stabellini wrote:
On Thu, 28 Jul 2016, Julien Grall wrote:
The ARM architecture mandates to use of a break-before-make sequence
when changing translation entries if the page table is shared between
multiple CPUs whenever a valid entry is replaced by another valid entry
(see D4.7.1 in ARM DDI 0487A.j for more details).

The break-before-make sequence can be divided in the following steps:
    1) Invalidate the old entry in the page table
    2) Issue a TLB invalidation instruction for the address associated
    to this entry
    3) Write the new entry

The current P2M code implemented in apply_one_level does not respect
this sequence and may result to break coherency on some processors.

Adapting the current implementation to use the break-before-make
sequence would imply some code duplication and more TLBs invalidation
than necessary. For instance, if we are replacing a 4KB page and the
current mapping in the P2M is using a 1GB superpage, the following steps
will happen:
    1) Shatter the 1GB superpage into a series of 2MB superpages
    2) Shatter the 2MB superpage into a series of 4KB pages
    3) Replace the 4KB page

As the current implementation is shattering while descending and install
the mapping, Xen would need to issue 3 TLB invalidation instructions
which is clearly inefficient.

Furthermore, all the operations which modify the page table are using
the same skeleton. It is more complicated to maintain different code paths
than having a generic function that set an entry and take care of the
break-before-make sequence.

The new implementation is based on the x86 EPT one which, I think,
fits quite well for the break-before-make sequence whilst keeping
the code simple.

The main function of the new implementation is __p2m_get_entry. It will
only work on mapping that are aligned to a block entry in the page table
(i.e 1GB, 2MB, 4KB when using a 4KB granularity).

Another function, p2m_get_entry, is provided to break down is region
into mapping that is aligned to a block entry.

Note that to keep this patch "small", there are no caller of those
functions added in this patch (they will be added in follow-up patches).

Signed-off-by: Julien Grall <julien.grall@xxxxxxx>

---
    I need to find the impact of this new implementation on ARM32 because
    the domheap is not always mapped. This means that Xen needs to map/unmap
    everytime the page associated to the page table. It might be possible
    to re-use some caching structure as the current implementation does
    or rework the way a domheap is mapped/unmapped.

    Also, this code still contain few TODOs mostly to add sanity
    checks and few optimization. The IOMMU is not yet supported.
---
 xen/arch/arm/p2m.c | 335 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 335 insertions(+)

diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
index c93e554..297b176 100644
--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -750,6 +750,341 @@ static void p2m_put_l3_page(mfn_t mfn, p2m_type_t type)
     }
 }

+#if 0
+/* Free lpae sub-tree behind an entry */
+static void p2m_free_entry(struct p2m_domain *p2m,
+                           lpae_t entry, unsigned int level)
+{
+    unsigned int i;
+    lpae_t *table;
+    mfn_t mfn;
+
+    /* Nothing to do if the entry is invalid or a super-page */

Why are we not freeing superpages? It looks like this function could be
correctly called on a superpage (with level == target).

The p2m code on ARM does not take any reference on mapping except for foreign one. So there is nothing to do in the case of superpage (it is not possible to have foreign superpage).



+    if ( !p2m_valid(entry) || p2m_is_superpage(entry, level) )
+        return;
+
+    if ( level == 3 )
+    {
+        p2m_put_l3_page(_mfn(entry.p2m.base), entry.p2m.type);
+        return;
+    }
+
+    table = map_domain_page(_mfn(entry.p2m.base));
+    for ( i = 0; i < LPAE_ENTRIES; i++ )
+        p2m_free_entry(p2m, *(table + i), level + 1);
+
+    unmap_domain_page(table);
+
+    /*
+     * Make sure all the references in the TLB have been removed before
+     * freing the intermediate page table.
+     * XXX: Should we defer the free of the page table to avoid the
+     * flush?

Yes, I would probably move the p2m flush out of this function.

We would need to also defer the free of the intermediate table (see free_domheap_page below) that means keeping track of the pages to free later.

The flush here will only happen once and iff a sub-tree is removed (i.e unlikely in most of the cases).

So I am not sure it is worth to keep a list of page to free later. Any opinions?



+     */
+    if ( p2m->need_flush )
+        p2m_flush_tlb_sync(p2m);
+
+    mfn = _mfn(entry.p2m.base);
+    ASSERT(mfn_valid(mfn_x(mfn)));
+
+    free_domheap_page(mfn_to_page(mfn_x(mfn)));
+}
+
+static bool p2m_split_superpage(struct p2m_domain *p2m, lpae_t *entry,
+                                unsigned int level, unsigned int target,
+                                const unsigned int *offsets)
+{
+    struct page_info *page;
+    unsigned int i;
+    lpae_t pte, *table;
+    bool rv = true;
+
+    /* Convenience aliases */
+    p2m_type_t t = entry->p2m.type;
+    mfn_t mfn = _mfn(entry->p2m.base);
+
+    /* Convenience aliases */
+    unsigned int next_level = level + 1;
+    unsigned int level_order = level_orders[next_level];
+
+    /*
+     * This should only be called with target != level and the entry is
+     * a superpage.
+     */
+    ASSERT(level < target);
+    ASSERT(p2m_is_superpage(*entry, level));
+
+    page = alloc_domheap_page(NULL, 0);
+    if ( !page )
+        return false;
+
+    page_list_add(page, &p2m->pages);
+    table = __map_domain_page(page);
+
+    /*
+     * We are either splitting a first level 1G page into 512 second level
+     * 2M pages, or a second level 2M page into 512 third level 4K pages.
+     */
+    for ( i = 0; i < LPAE_ENTRIES; i++ )
+    {
+        lpae_t *new_entry = table + i;
+
+        pte = mfn_to_p2m_entry(mfn, t, p2m->default_access);
+
+        mfn = mfn_add(mfn, (1UL << level_order));
+
+        /*
+         * First and second level pages set p2m.table = 0, but third
+         * level entries set p2m.table = 1.
+         */
+        if ( next_level < 3 )
+            pte.p2m.table = 0;
+
+        write_pte(new_entry, pte);
+    }
+
+    /*
+     * Shatter superpage in the page to the level we want to make the
+     * changes.
+     * This is done outside the loop to avoid checking the offset to
+     * know whether the entry should be shattered for every entry.
+     */
+    if ( next_level != target )
+        rv = p2m_split_superpage(p2m, table + offsets[next_level],
+                                 level + 1, target, offsets);
+
+    if ( p2m->clean_pte )
+        clean_dcache_va_range(table, PAGE_SIZE);
+
+    unmap_domain_page(table);
+
+    pte = mfn_to_p2m_entry(_mfn(page_to_mfn(page)), p2m_invalid,
+                           p2m->default_access);
+
+    p2m_write_pte(entry, pte, p2m->clean_pte);
+
+    /*
+     * Even if we failed, we should install the newly allocated LPAE
+     * entry. The caller will be in charge to free the sub-tree.
+     * XXX: See if we can free entry here.
+     */
+    *entry = pte;

Isn't the call to p2m_write_pte just above enough?

Hmmm yes. And it is wrong because not safe. I will drop it.



+
+    return rv;
+}
+
+/*
+ * Insert an entry in the p2m. This should be called with a mapping
+ * equal to a page/superpage (4K, 2M, 1G).
+ */
+static int __p2m_set_entry(struct p2m_domain *p2m,
+                           gfn_t sgfn,
+                           unsigned int page_order,
+                           mfn_t smfn,
+                           p2m_type_t t,
+                           p2m_access_t a)
+{
+    paddr_t addr = pfn_to_paddr(gfn_x(sgfn));
+    unsigned int level = 0;
+    unsigned int target = 3 - (page_order / LPAE_SHIFT);
+    lpae_t *entry, *table, orig_pte;
+    int rc;
+
+    /* Convenience aliases */
+    const unsigned int offsets[4] = {
+        zeroeth_table_offset(addr),
+        first_table_offset(addr),
+        second_table_offset(addr),
+        third_table_offset(addr)
+    };
+
+    /* TODO: Check the validity for the address */
+
+    ASSERT(p2m_is_write_locked(p2m));
+
+    /*
+     * Check if the level target is valid: we only support
+     * 4K - 2M - 1G mapping.
+     */
+    ASSERT(target > 0 && target <= 3);
+
+    table = p2m_get_root_pointer(p2m, sgfn);
+    if ( !table )
+        return -EINVAL;
+
+    for ( level = P2M_ROOT_LEVEL; level < target; level++ )
+    {
+        rc = p2m_next_level(p2m, false, &table, offsets[level]);

If smfn is INVALID_MFN, does it make sense to create intermediate table
entries?

I asked myself the same question at the beginning and decided to avoid caring about it. The caller is supposed to remove a page that has a entry in the p2m, so for me it is a caller mistake.

But I agree, that we allocate intermediate table entries. I will give a look if I can find an non-intrusive way to do it.



+        if ( rc == GUEST_TABLE_MAP_FAILED )
+        {
+            rc = -ENOENT;
+            goto out;
+        }
+        else if ( rc != GUEST_TABLE_NORMAL_PAGE )
+            break;
+    }
+
+    entry = table + offsets[level];
+
+    /*
+     * If we are here with level < target, we must be at a leaf node,
+     * and we need to break up the superpage.
+     */
+    if ( level < target )
+    {
+        /* We need to split the original page. */
+        lpae_t split_pte = *entry;
+
+        ASSERT(p2m_is_superpage(*entry, level));
+
+        if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
+        {
+            p2m_free_entry(p2m, split_pte, level);
+            rc = -ENOMEM;
+            goto out;
+        }
+
+        /*
+         * Follow the break-before-sequence to update the entry.
+         * For more details see (D4.7.1 in ARM DDI 0487A.j).
+         * XXX: Can we flush by address?
+         */
+        p2m_remove_pte(entry, p2m->clean_pte);
+        p2m_flush_tlb_sync(p2m);
+
+        p2m_write_pte(entry, split_pte, p2m->clean_pte);
+
+        /* then move to the level we want to make real changes */
+        for ( ; level < target; level++ )
+        {
+            rc = p2m_next_level(p2m, true, &table, offsets[level]);
+
+            /*
+             * The entry should be found and either be a table
+             * or a superpage if level 3 is not targeted
+             */
+            ASSERT(rc == GUEST_TABLE_NORMAL_PAGE ||
+                   (rc == GUEST_TABLE_SUPER_PAGE && target < 3));
+        }
+        entry = table + offsets[level];
+    }
+
+    /*
+     * We should always be there with the correct level because
+     * all the intermediate tables have been installed if necessary.
+     */
+    ASSERT(level == target);
+
+    orig_pte = *entry;
+
+    /*
+     * The radix-tree can only work on 4KB. This is only used when
+     * memaccess is enabled.

"This" what? This function? The radix-tree?

The radix-tree. I didn't want to repeat the word and though "this" would make sense.

+     */
+    ASSERT(!p2m->mem_access_enabled || page_order == 0);
+    /*
+     * The access type should always be p2m_access_rwx when the mapping
+     * is removed.
+     */
+    ASSERT(!mfn_eq(INVALID_MFN, smfn) || (a == p2m_access_rwx));
+    /*
+     * Update the mem access permission before update the P2M. So we
+     * don't have to revert the mapping if it has failed.
+     */
+    rc = p2m_mem_access_radix_set(p2m, sgfn, a);
+    if ( rc )
+        goto out;
+
+    /*
+     * Always remove the entry in order to follow the break-before-make
+     * sequence when updating the translation table (D4.7.1 in ARM DDI
+     * 0487A.j).
+     */
+    if ( p2m_valid(orig_pte) )
+        p2m_remove_pte(entry, p2m->clean_pte);
+
+    if ( mfn_eq(smfn, INVALID_MFN) )
+        /* Flush can be deferred if the entry is removed */
+        p2m->need_flush |= !!p2m_valid(orig_pte);
+    else
+    {
+        lpae_t pte;
+
+        /*
+         * Flush the TLB before write the new one to keep coherency.
+         * XXX: Can we flush by address?

I was asking myself the same question

It is not trivial. On ARMv7, there is no way to invalidate by IPA, so we still need to do a full flush.

In the case of ARMv8, it is possible to do a flush by IPA with the following sequence (see D4-1739 in ARM DDI 0487A.i):
 tlbi ipa2e1, xt
 dsb
 tlbi vmalle1

So I was wondering if we could leave that for a future optimization.



+         */
+        if ( p2m_valid(orig_pte) )
+            p2m_flush_tlb_sync(p2m);
+
+        pte = mfn_to_p2m_entry(smfn, t, a);
+        if ( level < 3 )
+            pte.p2m.table = 0; /* Superpage entry */
+
+        p2m_write_pte(entry, pte, p2m->clean_pte);
+
+        p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
+                                      gfn_add(sgfn, 1 << page_order));
+        p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, sgfn);
+    }
+
+    /*
+     * Free the entry only if the original pte was valid and the base
+     * is different (to avoid freeing when permission is changed).
+     */
+    if ( p2m_valid(orig_pte) && entry->p2m.base != orig_pte.p2m.base )
+        p2m_free_entry(p2m, orig_pte, level);
+
+    /* XXX: Flush iommu */

Please do, it would be best if we had iommu flushing in this patch
before committing it.

It was my plan, hence the "RFC". I just wanted to send a first version to get feedback before refining the code.



+    rc = 0;
+
+out:
+    unmap_domain_page(table);
+
+    return rc;
+}
+
+static int p2m_set_entry(struct p2m_domain *p2m,
+                         gfn_t sgfn,
+                         unsigned long todo,
+                         mfn_t smfn,
+                         p2m_type_t t,
+                         p2m_access_t a)
+{
+    int rc = 0;
+
+    while ( todo )
+    {
+        unsigned long mask = gfn_x(sgfn) | mfn_x(smfn) | todo;
+        unsigned long order;
+
+        /* Always map 4k by 4k when memaccess is enabled */
+        if ( unlikely(p2m->mem_access_enabled) )
+            order = THIRD_ORDER;

Missing an "else" here

I will fix it in the next version.



+        if ( !(mask & ((1UL << FIRST_ORDER) - 1)) )
+            order = FIRST_ORDER;
+        else if ( !(mask & ((1UL << SECOND_ORDER) - 1)) )
+            order = SECOND_ORDER;
+        else
+            order = THIRD_ORDER;
+
+        rc = __p2m_set_entry(p2m, sgfn, order, smfn, t, a);
+        if ( rc )
+            break;

Shouldn't we be checking that "(1<<order)" doesn't exceed "todo" before
calling __p2m_set_entry? Otherwise we risk creating a mapping bigger
than requested.

The mask is defined as gfn_x(sgfn) | mfn_x(smfn) | todo

So we will never have the order exceeding "todo".



+        sgfn = gfn_add(sgfn, (1 << order));
+        if ( !mfn_eq(smfn, INVALID_MFN) )
+           smfn = mfn_add(smfn, (1 << order));
+
+        todo -= (1 << order);
+    }
+
+    return rc;
+}
+#endif


Regards,

--
Julien Grall

--
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.