[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v2 15/30] xen/x86: populate PVHv2 Dom0 physical memory map



On Tue, Oct 04, 2016 at 05:16:09AM -0600, Jan Beulich wrote:
> >>> On 04.10.16 at 11:12, <roger.pau@xxxxxxxxxx> wrote:
> > On Fri, Sep 30, 2016 at 09:52:56AM -0600, Jan Beulich wrote:
> >> >>> On 27.09.16 at 17:57, <roger.pau@xxxxxxxxxx> wrote:
> >> > @@ -336,7 +343,8 @@ static unsigned long __init compute_dom0_nr_pages(
> >> >          avail -= dom0_paging_pages(d, nr_pages);
> >> >      }
> >> >  
> >> > -    if ( (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) &&
> >> > +    if ( is_pv_domain(d) &&
> >> > +         (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) &&
> >> 
> >> Perhaps better to simply force parms->p2m_base to UNSET_ADDR
> >> earlier on?
> > 
> > p2m_base is already unconditionally set to UNSET_ADDR for PVHv2 domains, 
> > hence the added is_pv_domain check in order to make sure that PVHv2 guests 
> > don't get into that branch, which AFAICT is only relevant to PV guests.
> 
> This reads contradictory: If it's set to UNSET_ADDR, why the extra
> check?

On PVHv2 p2m_base == UNSET_ADDR already, so the extra check is to make sure 
PVHv2 guests don't execute the code inside of the if condition, which is 
for classic PV guests (note that the check is not against != UNSET_ADDR). Or 
maybe I'm missing something?

> >> > @@ -1657,15 +1679,238 @@ out:
> >> >      return rc;
> >> >  }
> >> >  
> >> > +/* Populate an HVM memory range using the biggest possible order. */
> >> > +static void __init hvm_populate_memory_range(struct domain *d, uint64_t 
> >> > start,
> >> > +                                             uint64_t size)
> >> > +{
> >> > +    static unsigned int __initdata memflags = 
> >> > MEMF_no_dma|MEMF_exact_node;
> >> > +    unsigned int order;
> >> > +    struct page_info *page;
> >> > +    int rc;
> >> > +
> >> > +    ASSERT(IS_ALIGNED(size, PAGE_SIZE) && IS_ALIGNED(start, PAGE_SIZE));
> >> > +
> >> > +    order = MAX_ORDER;
> >> > +    while ( size != 0 )
> >> > +    {
> >> > +        order = min(get_order_from_bytes_floor(size), order);
> >> > +        page = alloc_domheap_pages(d, order, memflags);
> >> > +        if ( page == NULL )
> >> > +        {
> >> > +            if ( order == 0 && memflags )
> >> > +            {
> >> > +                /* Try again without any memflags. */
> >> > +                memflags = 0;
> >> > +                order = MAX_ORDER;
> >> > +                continue;
> >> > +            }
> >> > +            if ( order == 0 )
> >> > +                panic("Unable to allocate memory with order 0!\n");
> >> > +            order--;
> >> > +            continue;
> >> > +        }
> >> 
> >> Is it not possible to utilize alloc_chunk() here?
> > 
> > Not really, alloc_chunk will do a ceil calculation of the number of needed 
> > pages, which means we end up allocating bigger chunks than needed and then 
> > free them. I prefer this approach, since we always request the exact memory 
> > that's required, so there's no need to free leftover pages.
> 
> Hmm, in that case at least some of the shared logic would be nice to
> get abstracted out.

TBH, I don't see much benefit in that, alloc_chunk works fine for PV guests 
because Xen ends up walking the list of returned pages and mapping them one 
to one. This is IMHO not what should be done for PVH guests, and instead the 
caller needs to know the actual order of the allocated chunk, so it can pass 
it to guest_physmap_add_page. ATM it's a very simple function that's clear, 
if I start mixing up bits of alloc_chunk it's going to get more complex, and 
I would like to avoid that, so that PVH Dom0 build doesn't end up like 
current PV Dom0 build.

> >> > +        if ( (size & 0xffffffff) == 0 )
> >> > +            process_pending_softirqs();
> >> 
> >> That's 4Gb at a time - isn't that a little too much?
> > 
> > Hm, it's the same that's used in pvh_add_mem_mapping AFAICT. I could reduce 
> > it to 0xfffffff, but I'm also wondering if it makes sense to just call it 
> > on 
> > each iteration, since we are possibly mapping regions with big orders here.
> 
> Iteration could is all that matters here really; the size of the mapping
> wouldn't normally (as long as its one of the hardware supported sizes).
> Doing the check on every iteration may be a little much (you may want
> to check whether there's noticeable extra overhead), but doing the
> check on like every 64 iterations may limit overhead enough to be
> acceptable without making this more sophisticated.

Ack, I've done it using a define, so we can always change that 64 to 
something else if the need arises.

> >> > +static int __init hvm_setup_p2m(struct domain *d)
> >> > +{
> >> > +    struct vcpu *saved_current, *v = d->vcpu[0];
> >> > +    unsigned long nr_pages;
> >> > +    int i, rc, preempted;
> >> > +
> >> > +    printk("** Preparing memory map **\n");
> >> 
> >> Debugging leftover again?
> > 
> > Should this be merged with the next label, so that it reads as "Preparing 
> > and populating the memory map"?
> 
> No, I'd rather see the other(s) gone too.

Done, I've just left the initial "Building a PVH Dom0" message.
 
> >> > +    /*
> >> > +     * Subtract one page for the EPT identity page table and two pages
> >> > +     * for the MADT replacement.
> >> > +     */
> >> > +    nr_pages = compute_dom0_nr_pages(d, NULL, 0) - 3;
> >> 
> >> How do you know the MADT replacement requires two pages? Isn't
> >> that CPU-count dependent? And doesn't the partial page used for
> >> the TSS also need accounting for here?
> > 
> > Yes, it's CPU-count dependent. This is just an approximation, since we only 
> > support up to 256 CPUs on HVM guests, and each Processor Local APIC entry 
> > is 
> > 
> > 8 bytes, this means that the CPU-related data is going to use up to 2048 
> > bytes of data, which still leaves plenty of space for the IO APIC and the 
> > Interrupt Source Override entries. We requests two pages in case the 
> > original MADT crosses a page boundary. FWIW, I could also fetch the 
> > original 
> > MADT size earlier and use that as the upper bound here for memory 
> > reservation.
> 
> That wouldn't help in case someone wants more vCPU-s than there
> are pCPU-s. And baking in another assumption of there being <=
> 128 vCPU-s when there's already work being done to eliminate that
> limit is likely not too good an idea.

Right, I've now removed all this, since for the MADT we are going to steal 
RAM pages.

> > In the RFC series we also spoke about placing the MADT in a different 
> > position that the native one, which would mean that we would end up 
> > stealing 
> > some space from a RAM region in order to place it, so that we wouldn't have 
> > to do this accounting.
> 
> Putting the new MADT at the same address as the old one won't work
> anyway, again because possibly vCPU-s > pCPU-s.

Yes, although from my tests doing CPU over-allocation on HVM guests works 
very badly.

> >> > +    process_pending_softirqs();
> >> 
> >> Why, outside of any loop?
> > 
> > It's the same that's done in construct_dom0_pv, so I though that it was a 
> > good idea to drain any pending softirqs before starting domain build.
> 
> Perhaps in that case it should be pulled out of there into the
> wrapper?

Yes, good point.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.