[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 3 of 5] Rework locking in the PoD layer



> On Wed, 2012-02-01 at 19:56 +0000, Andres Lagar-Cavilla wrote:
>> @@ -114,15 +114,20 @@ p2m_pod_cache_add(struct p2m_domain *p2m
>>          unmap_domain_page(b);
>>      }
>>
>> +    /* First, take all pages off the domain list */
>>      lock_page_alloc(p2m);
>> -
>> -    /* First, take all pages off the domain list */
>>      for(i=0; i < 1 << order ; i++)
>>      {
>>          p = page + i;
>>          page_list_del(p, &d->page_list);
>>      }
>>
>> +    /* Ensure that the PoD cache has never been emptied.
>> +     * This may cause "zombie domains" since the page will never be
>> freed. */
>> +    BUG_ON( d->arch.relmem != RELMEM_not_started );
>> +
>> +    unlock_page_alloc(p2m);
>> +
>
> This assert should stay where it is.  The idea is to verify
> after-the-fact that the page_list_add_tail()s *have not* raced with
> emptying the PoD cache.  Having the assert before cannot guarantee that
> they *will not* race with emptying the PoD cache.  Alternately, if we're
> positive that condition can never happen again, we should just remove
> the BUG_ON().

As is, the code in cache_add mutually excludes p2m_pod_empty_cache until
pod_lock is released by the caller. So, the page_list_add_tails()s cannot
race with emptying the PoD cache.

On further look, the BUG_ON can go. relmem moves out of _not_started only
after p2m_pod_empty_cache has finished. So, transitively, we're safe.

Emptying the pod cache does the spin barrier. Question: should it not take
the lock after the spin barrier, to be super safe?

Thanks!
Andres

>
> If I recall correctly, I put this in here because something ended up
> calling cache_add after empty_cache(), potentially with the p2m lock
> already held; that's why I put the BUG_ON() there to begin with.
>
>> @@ -922,6 +929,12 @@ p2m_pod_emergency_sweep(struct p2m_domai
>>      limit = (start > POD_SWEEP_LIMIT) ? (start - POD_SWEEP_LIMIT) : 0;
>>
>>      /* FIXME: Figure out how to avoid superpages */
>> +    /* NOTE: Promote to globally locking the p2m. This will get
>> complicated
>> +     * in a fine-grained scenario. Even if we're to lock each gfn
>> +     * individually we must be careful about recursion limits and
>> +     * POD_SWEEP_STRIDE. This is why we don't enforce deadlock
>> constraints
>> +     * between p2m and pod locks */
>> +    p2m_lock(p2m);
>>      for ( i=p2m->pod.reclaim_single; i > 0 ; i-- )
>>      {
>>          p2m_access_t a;
>> @@ -940,7 +953,7 @@ p2m_pod_emergency_sweep(struct p2m_domai
>>          /* Stop if we're past our limit and we have found *something*.
>>           *
>>           * NB that this is a zero-sum game; we're increasing our cache
>> size
>> -         * by re-increasing our 'debt'.  Since we hold the p2m lock,
>> +         * by re-increasing our 'debt'.  Since we hold the pod lock,
>>           * (entry_count - count) must remain the same. */
>>          if ( p2m->pod.count > 0 && i < limit )
>>              break;
>> @@ -949,6 +962,7 @@ p2m_pod_emergency_sweep(struct p2m_domai
>>      if ( j )
>>          p2m_pod_zero_check(p2m, gfns, j);
>>
>> +    p2m_unlock(p2m);
>>      p2m->pod.reclaim_single = i ? i - 1 : i;
>>
>>  }
>> @@ -965,8 +979,9 @@ p2m_pod_demand_populate(struct p2m_domai
>>      int i;
>>
>>      ASSERT(gfn_locked_by_me(p2m, gfn));
>> +    pod_lock(p2m);
>>
>> -    /* This check is done with the p2m lock held.  This will make sure
>> that
>> +    /* This check is done with the pod lock held.  This will make sure
>> that
>>       * even if d->is_dying changes under our feet,
>> p2m_pod_empty_cache()
>>       * won't start until we're done. */
>>      if ( unlikely(d->is_dying) )
>> @@ -977,6 +992,7 @@ p2m_pod_demand_populate(struct p2m_domai
>>       * 1GB region to 2MB chunks for a retry. */
>>      if ( order == PAGE_ORDER_1G )
>>      {
>> +        pod_unlock(p2m);
>>          gfn_aligned = (gfn >> order) << order;
>>          /* Note that we are supposed to call set_p2m_entry() 512 times
>> to
>>           * split 1GB into 512 2MB pages here. But We only do once here
>> because
>> @@ -1000,11 +1016,15 @@ p2m_pod_demand_populate(struct p2m_domai
>>
>>          /* If we're low, start a sweep */
>>          if ( order == PAGE_ORDER_2M && page_list_empty(&p2m->pod.super)
>> )
>> +            /* Note that sweeps scan other ranges in the p2m. In an
>> scenario
>> +             * in which p2m locks are order-enforced wrt pod lock and
>> p2m
>> +             * locks are fine grained, this will result in deadlock
>> panic */
>>              p2m_pod_emergency_sweep_super(p2m);
>>
>>          if ( page_list_empty(&p2m->pod.single) &&
>>               ( ( order == PAGE_ORDER_4K )
>>                 || (order == PAGE_ORDER_2M &&
>> page_list_empty(&p2m->pod.super) ) ) )
>> +            /* Same comment regarding deadlock applies */
>>              p2m_pod_emergency_sweep(p2m);
>>      }
>>
>

You know what, these comments are *old*. I really need to cut down on the
FUD factor :) As is, there is no risk of deadlock.

With a (hypothetical as of today) fine-grained p2m locking approach, I
want to leave a "be careful" reminder. And document your trylock
suggestion, which I believe to be accurate.

Thanks for the detailed look!
Andres

> Regarding locking on emergency sweeps: I think it should suffice if we
> could do the equivalent of a spin_trylock() on each gpfn, and just move
> on (not checking that gfn) if it fails.  What do you think?
>
> Other than that, I don't see anything wrong with this locking at first
> glance; but it's complicated enough that I don't think I've quite
> grokked it yet.  I'll look at it again tomorrow and see if things are
> more clear. :-)
>
> -George
>
>
>> @@ -1012,8 +1032,6 @@ p2m_pod_demand_populate(struct p2m_domai
>>      if ( q == p2m_guest && gfn > p2m->pod.max_guest )
>>          p2m->pod.max_guest = gfn;
>>
>> -    lock_page_alloc(p2m);
>> -
>>      if ( p2m->pod.count == 0 )
>>          goto out_of_memory;
>>
>> @@ -1026,8 +1044,6 @@ p2m_pod_demand_populate(struct p2m_domai
>>
>>      BUG_ON((mfn_x(mfn) & ((1 << order)-1)) != 0);
>>
>> -    unlock_page_alloc(p2m);
>> -
>>      gfn_aligned = (gfn >> order) << order;
>>
>>      set_p2m_entry(p2m, gfn_aligned, mfn, order, p2m_ram_rw,
>> p2m->default_access);
>> @@ -1038,7 +1054,7 @@ p2m_pod_demand_populate(struct p2m_domai
>>          paging_mark_dirty(d, mfn_x(mfn) + i);
>>      }
>>
>> -    p2m->pod.entry_count -= (1 << order); /* Lock: p2m */
>> +    p2m->pod.entry_count -= (1 << order);
>>      BUG_ON(p2m->pod.entry_count < 0);
>>
>>      if ( tb_init_done )
>> @@ -1056,20 +1072,24 @@ p2m_pod_demand_populate(struct p2m_domai
>>          __trace_var(TRC_MEM_POD_POPULATE, 0, sizeof(t), &t);
>>      }
>>
>> +    pod_unlock(p2m);
>>      return 0;
>>  out_of_memory:
>> -    unlock_page_alloc(p2m);
>> +    pod_unlock(p2m);
>>
>>      printk("%s: Out of populate-on-demand memory! tot_pages %" PRIu32 "
>> pod_entries %" PRIi32 "\n",
>>             __func__, d->tot_pages, p2m->pod.entry_count);
>>      domain_crash(d);
>>  out_fail:
>> +    pod_unlock(p2m);
>>      return -1;
>>  remap_and_retry:
>>      BUG_ON(order != PAGE_ORDER_2M);
>> -    unlock_page_alloc(p2m);
>> +    pod_unlock(p2m);
>>
>>      /* Remap this 2-meg region in singleton chunks */
>> +    /* NOTE: In a p2m fine-grained lock scenario this might
>> +     * need promoting the gfn lock from gfn->2M superpage */
>>      gfn_aligned = (gfn>>order)<<order;
>>      for(i=0; i<(1<<order); i++)
>>          set_p2m_entry(p2m, gfn_aligned+i, _mfn(0), PAGE_ORDER_4K,
>> @@ -1137,9 +1157,11 @@ guest_physmap_mark_populate_on_demand(st
>>          rc = -EINVAL;
>>      else
>>      {
>> -        p2m->pod.entry_count += 1 << order; /* Lock: p2m */
>> +        pod_lock(p2m);
>> +        p2m->pod.entry_count += 1 << order;
>>          p2m->pod.entry_count -= pod_count;
>>          BUG_ON(p2m->pod.entry_count < 0);
>> +        pod_unlock(p2m);
>>      }
>>
>>      gfn_unlock(p2m, gfn, order);
>> diff -r fb9c06df05f2 -r 56ceab0118cb xen/arch/x86/mm/p2m-pt.c
>> --- a/xen/arch/x86/mm/p2m-pt.c
>> +++ b/xen/arch/x86/mm/p2m-pt.c
>> @@ -954,6 +954,7 @@ long p2m_pt_audit_p2m(struct p2m_domain
>>      struct domain *d = p2m->domain;
>>
>>      ASSERT(p2m_locked_by_me(p2m));
>> +    ASSERT(pod_locked_by_me(p2m));
>>
>>      test_linear = ( (d == current->domain)
>>                      && !pagetable_is_null(current->arch.monitor_table)
>> );
>> diff -r fb9c06df05f2 -r 56ceab0118cb xen/arch/x86/mm/p2m.c
>> --- a/xen/arch/x86/mm/p2m.c
>> +++ b/xen/arch/x86/mm/p2m.c
>> @@ -72,6 +72,7 @@ boolean_param("hap_2mb", opt_hap_2mb);
>>  static void p2m_initialise(struct domain *d, struct p2m_domain *p2m)
>>  {
>>      mm_lock_init(&p2m->lock);
>> +    mm_lock_init(&p2m->pod.lock);
>>      INIT_LIST_HEAD(&p2m->np2m_list);
>>      INIT_PAGE_LIST_HEAD(&p2m->pages);
>>      INIT_PAGE_LIST_HEAD(&p2m->pod.super);
>> @@ -587,8 +588,10 @@ guest_physmap_add_entry(struct domain *d
>>              rc = -EINVAL;
>>          else
>>          {
>> -            p2m->pod.entry_count -= pod_count; /* Lock: p2m */
>> +            pod_lock(p2m);
>> +            p2m->pod.entry_count -= pod_count;
>>              BUG_ON(p2m->pod.entry_count < 0);
>> +            pod_unlock(p2m);
>>          }
>>      }
>>
>> @@ -1372,6 +1375,7 @@ p2m_flush_table(struct p2m_domain *p2m)
>>      /* "Host" p2m tables can have shared entries &c that need a bit
>> more
>>       * care when discarding them */
>>      ASSERT(p2m_is_nestedp2m(p2m));
>> +    /* Nested p2m's do not do pod, hence the asserts (and no pod
>> lock)*/
>>      ASSERT(page_list_empty(&p2m->pod.super));
>>      ASSERT(page_list_empty(&p2m->pod.single));
>>
>> @@ -1529,6 +1533,7 @@ void audit_p2m(struct domain *d,
>>      P2M_PRINTK("p2m audit starts\n");
>>
>>      p2m_lock(p2m);
>> +    pod_lock(p2m);
>>
>>      if (p2m->audit_p2m)
>>          pmbad = p2m->audit_p2m(p2m);
>> @@ -1589,6 +1594,7 @@ void audit_p2m(struct domain *d,
>>      }
>>      spin_unlock(&d->page_alloc_lock);
>>
>> +    pod_unlock(p2m);
>>      p2m_unlock(p2m);
>>
>>      P2M_PRINTK("p2m audit complete\n");
>> diff -r fb9c06df05f2 -r 56ceab0118cb xen/include/asm-x86/p2m.h
>> --- a/xen/include/asm-x86/p2m.h
>> +++ b/xen/include/asm-x86/p2m.h
>> @@ -261,25 +261,12 @@ struct p2m_domain {
>>      unsigned long max_mapped_pfn;
>>
>>      /* Populate-on-demand variables
>> -     * NB on locking.  {super,single,count} are
>> -     * covered by d->page_alloc_lock, since they're almost always used
>> in
>> -     * conjunction with that functionality.  {entry_count} is covered
>> by
>> -     * the domain p2m lock, since it's almost always used in
>> conjunction
>> -     * with changing the p2m tables.
>> -     *
>> -     * At this point, both locks are held in two places.  In both,
>> -     * the order is [p2m,page_alloc]:
>> -     * + p2m_pod_decrease_reservation() calls p2m_pod_cache_add(),
>> -     *   which grabs page_alloc
>> -     * + p2m_pod_demand_populate() grabs both; the p2m lock to avoid
>> -     *   double-demand-populating of pages, the page_alloc lock to
>> -     *   protect moving stuff from the PoD cache to the domain page
>> list.
>> -     *
>> -     * We enforce this lock ordering through a construct in mm-locks.h.
>> -     * This demands, however, that we store the previous lock-ordering
>> -     * level in effect before grabbing the page_alloc lock. The unlock
>> -     * level is stored in the arch section of the domain struct.
>> -     */
>> +     * All variables are protected with the pod lock. We cannot rely on
>> +     * the p2m lock if it's turned into a fine-grained lock.
>> +     * We only use the domain page_alloc lock for additions and
>> +     * deletions to the domain's page list. Because we use it nested
>> +     * within the PoD lock, we enforce it's ordering (by remembering
>> +     * the unlock level in the arch_domain sub struct). */
>>      struct {
>>          struct page_list_head super,   /* List of superpages
>>     */
>>                           single;       /* Non-super lists
>>     */
>> @@ -288,6 +275,8 @@ struct p2m_domain {
>>          unsigned         reclaim_super; /* Last gpfn of a scan */
>>          unsigned         reclaim_single; /* Last gpfn of a scan */
>>          unsigned         max_guest;    /* gpfn of max guest
>> demand-populate */
>> +        mm_lock_t        lock;         /* Locking of private pod
>> structs,   *
>> +                                        * not relying on the p2m lock.
>>     */
>>      } pod;
>>  };
>>
>
>
>



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.