[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Re: RFC Patch for the "x86-64, mm: Put early page table high"



On Fri, Apr 29, 2011 at 01:58:33PM -0700, Jeremy Fitzhardinge wrote:
> On 04/29/2011 08:46 AM, Konrad Rzeszutek Wilk wrote:
> > Without a fix for this 2.6.39-rc5 fails during bootup.
> >
> > It fails such early, you only Xen telling you that the domain crashed.
> >
> > There is one patch to fix this, and the last word was here:
> >  http://marc.info/?i=4DAF0ECB.8060009@xxxxxxxx
> >
> > But after nothing had been heard from Peter.
> >
> > So I decided to look at this from a different perspective: why not
> > contain the workaround inside Xen early bootup code.
> >
> > I've tested this code as DomU (1G, 2G, 3G), Dom0 (8G, 4G, 1000M)
> > and they all booted fine. Hadn't compile tested it on 32-bit and
> > I think it will actually complain about it. Let me fix that.
> >
> > See attached patch (also present in stable/bug-fixes-for-rc5) which also
> > has the "xen: mask_rw_pte mark RO all pagetable pages up to pgt_buf_top"
> > integrated in.
> 
> Well, if we can hide the fix away in our code, then that has obvious
> advantages.  But I worry that this change is pretty closely dependent on
> how the other code works, and would be fragile in the face of further
> changes to that code (esp since there's no obvious coupling between the
> two, so anyone changing the arch/x86 code wouldn't have any clues to
> look at the corresponding Xen code).

True. I am hoping to remove this in 2.6.40 though...
> 
> > diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> > index aef7af9..a54c7c2 100644
> > --- a/arch/x86/xen/mmu.c
> > +++ b/arch/x86/xen/mmu.c
> > @@ -1463,6 +1463,115 @@ static int xen_pgd_alloc(struct mm_struct *mm)
> >     return ret;
> >  }
> >  
> > +#ifdef CONFIG_X86_64
> > +static __initdata u64 __last_pgt_set_rw = 0;
> > +static __initdata u64 __pgt_buf_start = 0;
> > +static __initdata u64 __pgt_buf_end = 0;
> > +static __initdata u64 __pgt_buf_top = 0;
> > +/*
> > + * As a consequence of the commit:
> > + * 
> > + * commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> > + * Author: Yinghai Lu <yinghai@xxxxxxxxxx>
> > + * Date:   Fri Dec 17 16:58:28 2010 -0800
> > + * 
> > + *     x86-64, mm: Put early page table high
> > + * 
> > + * at some point init_memory_mapping is going to reach the pagetable pages
> > + * area and map those pages too (mapping them as normal memory that falls
> > + * in the range of addresses passed to init_memory_mapping as argument).
> > + * Some of those pages are already pagetable pages (they are in the range
> > + * pgt_buf_start-pgt_buf_end) therefore they are going to be mapped RO and
> > + * everything is fine.
> > + * Some of these pages are not pagetable pages yet (they fall in the range
> > + * pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
> > + * are going to be mapped RW.  When these pages become pagetable pages and
> > + * are hooked into the pagetable, xen will find that the guest has already
> > + * a RW mapping of them somewhere and fail the operation.
> > + * The reason Xen requires pagetables to be RO is that the hypervisor needs
> > + * to verify that the pagetables are valid before using them. The 
> > validation
> > + * operations are called "pinning" (more details in arch/x86/xen/mmu.c).
> > + * 
> > + * In order to fix the issue we mark all the pages in the entire range
> > + * pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
> > + * is completed only the range pgt_buf_start-pgt_buf_end is reserved by
> > + * init_memory_mapping. Hence the kernel is going to crash as soon as one
> > + * of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
> > + * ranges are RO).
> > + * 
> > + * For this reason, this function is introduced which is called _after_
> 
> I think name "mark_rw_past_pg" explicitly here, since "this" is somewhat
> ambiguous.

<nods>
> 
> > + * the init_memory_mapping has completed (in a perfect world we would
> > + * call this function from init_memory_mapping, but lets ignore that).
> > + * 
> > + * Because we are called _after_ init_memory_mapping the pgt_buf_[start,
> > + * end,top] have all changed to new values (b/c another init_memory_mapping
> > + * is called). Hence, the first time we enter this function, we save
> > + * away the pgt_buf_start value and update the pgt_buf_[end,top].
> > + * 
> > + * When we detect that the "old" pgt_buf_start through pgt_buf_end
> > + * PFNs have been reserved (so memblock_x86_reserve_range has been called),
> > + * we immediately set out to RW the "old" pgt_buf_end through pgt_buf_top.
> > + * 
> > + * And then we update those "old" pgt_buf_[end|top] with the new ones
> > + * so that we can redo this on the next pagetable.
> > + */
> > +static __init void mark_rw_past_pgt(unsigned long pfn) {
> > +
> > +   if (pfn && pgt_buf_end > pgt_buf_start) {
> > +           u64 addr, size;
> > +
> > +           /* Save it away. */
> > +           if (!__pgt_buf_start) {
> > +                   __pgt_buf_start = pgt_buf_start;
> > +                   __pgt_buf_end = pgt_buf_end;
> > +                   __pgt_buf_top = pgt_buf_top;
> > +                   return;
> > +           }
> > +           /* If we get the range that starts at __pgt_buf_end that means
> > +            * the range is reserved, and that in 'init_memory_mapping'
> > +            * the 'memblock_x86_reserve_range' has been called with the
> > +            * outdated __pgt_buf_start, __pgt_buf_end (the "new"
> > +            * pgt_buf_[start|end|top] refer now to a new pagetable.
> > +            * Note: we are called _after_ the pgt_buf_[..] have been
> > +            * updated.*/
> > +
> > +           addr = 
> > memblock_x86_find_in_range_size(PFN_PHYS(__pgt_buf_start), &size, 
> > PAGE_SIZE);
> > +
> > +           /* Still not reserved, meaning 'memblock_x86_reserve_range'
> > +            * hasn't been called yet. Update the _end and _top.*/
> > +           if (addr == PFN_PHYS(__pgt_buf_start)) {
> > +                   __pgt_buf_end = pgt_buf_end;
> > +                   __pgt_buf_top = pgt_buf_top;
> > +                   return;
> > +           }
> > +
> > +           /* OK, the area is reserved, meaning it is time for us to
> > +            * set RW for the old end->top PFNs. */
> > +
> > +           /* ..unless we had already done this. */
> > +           if (__pgt_buf_end == __last_pgt_set_rw)
> > +                   return;
> > +
> > +           addr = PFN_PHYS(__pgt_buf_end);
> > +           
> > +           /* set as RW the rest */
> > +           printk(KERN_DEBUG "xen: setting RW the range %llx - %llx\n",
> > +                   PFN_PHYS(__pgt_buf_end), PFN_PHYS(__pgt_buf_top));
> > +           
> > +           while (addr < PFN_PHYS(__pgt_buf_top)) {
> > +                   make_lowmem_page_readwrite(__va(addr));
> > +                   addr += PAGE_SIZE;
> > +           }
> > +           /* And update everything so that we are ready for the next
> > +            * pagetable (the one created for regions past 4GB) */
> > +           __last_pgt_set_rw = __pgt_buf_end;
> > +           __pgt_buf_start = pgt_buf_start;
> > +           __pgt_buf_end = pgt_buf_end;
> > +           __pgt_buf_top = pgt_buf_top;
> > +   }
> > +   return;
> > +}
> > +#endif
> >  static void xen_pgd_free(struct mm_struct *mm, pgd_t *pgd)
> >  {
> >  #ifdef CONFIG_X86_64
> > @@ -1488,6 +1597,7 @@ static __init pte_t mask_rw_pte(pte_t *ptep, pte_t 
> > pte)
> >  {
> >     unsigned long pfn = pte_pfn(pte);
> >  
> > +   mark_rw_past_pgt(pfn);
> >     /*
> >      * If the new pfn is within the range of the newly allocated
> >      * kernel pagetable, and it isn't being mapped into an
> > @@ -1495,7 +1605,7 @@ static __init pte_t mask_rw_pte(pte_t *ptep, pte_t 
> > pte)
> >      * it is RO.
> >      */
> >     if (((!is_early_ioremap_ptep(ptep) &&
> > -                   pfn >= pgt_buf_start && pfn < pgt_buf_end)) ||
> > +                   pfn >= pgt_buf_start && pfn < pgt_buf_top)) ||
> >                     (is_early_ioremap_ptep(ptep) && pfn != (pgt_buf_end - 
> > 1)))
> >             pte = pte_wrprotect(pte);
> >  
> > @@ -1997,6 +2107,8 @@ __init void xen_ident_map_ISA(void)
> >  
> >  static __init void xen_post_allocator_init(void)
> >  {
> > +   mark_rw_past_pgt(1);
> 
> Why '1'?  Perhaps this should be a different function rather than
> overloading a single one?

The 'mask_rw_pte' gets called quite often during startup. And a lot of times
for the pte(0), so that was an optimization to not do the memblock search.

But that can as well be expressed in 'mask_rw_pte' by

 if (pfn)
     mark_rw_past_pgt();

Let me do that.

> 
>     J

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.