Xen project Mailing List

Re: [Xen-devel] [PATCH 3 of 5] Rework locking in the PoD layer

To: "George Dunlap" <george.dunlap@xxxxxxxxxx>

From: "Andres Lagar-Cavilla" <andres@xxxxxxxxxxxxxxxx>

Date: Thu, 2 Feb 2012 12:03:10 -0800

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxxxxx>, "andres@xxxxxxxxxxxxxx" <andres@xxxxxxxxxxxxxx>, "Tim \(Xen.org\)" <tim@xxxxxxx>, "keir.xen@xxxxxxxxx" <keir.xen@xxxxxxxxx>, "adin@xxxxxxxxxxxxxx" <adin@xxxxxxxxxxxxxx>

Delivery-date: Thu, 02 Feb 2012 20:03:45 +0000

Domainkey-signature: a=rsa-sha1; c=nofws; d=lagarcavilla.org; h=message-id :in-reply-to:references:date:subject:from:to:cc:reply-to :mime-version:content-type:content-transfer-encoding; q=dns; s= lagarcavilla.org; b=HuMkoCub1RDBx7rMJSGW38EvFNu5J4CqLC7sGCIUzjh9 3SkK4uKIzhAB+1c6un0LT2p+lm9f+h2i3SMvoBx+UFaJ4hZaXBRQIWDMjqd1+bIA 5KEMOpBw8Ospon9qTjYV9jop3dlq8wxC8Id0P0z4U0i6eA3fk41MFwCAwv14wvw=

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

> On Wed, 2012-02-01 at 19:56 +0000, Andres Lagar-Cavilla wrote: >> @@ -114,15 +114,20 @@ p2m_pod_cache_add(struct p2m_domain *p2m >> unmap_domain_page(b); >> } >> >> + /* First, take all pages off the domain list */ >> lock_page_alloc(p2m); >> - >> - /* First, take all pages off the domain list */ >> for(i=0; i < 1 << order ; i++) >> { >> p = page + i; >> page_list_del(p, &d->page_list); >> } >> >> + /* Ensure that the PoD cache has never been emptied. >> + * This may cause "zombie domains" since the page will never be >> freed. */ >> + BUG_ON( d->arch.relmem != RELMEM_not_started ); >> + >> + unlock_page_alloc(p2m); >> + > > This assert should stay where it is. The idea is to verify > after-the-fact that the page_list_add_tail()s *have not* raced with > emptying the PoD cache. Having the assert before cannot guarantee that > they *will not* race with emptying the PoD cache. Alternately, if we're > positive that condition can never happen again, we should just remove > the BUG_ON(). As is, the code in cache_add mutually excludes p2m_pod_empty_cache until pod_lock is released by the caller. So, the page_list_add_tails()s cannot race with emptying the PoD cache. On further look, the BUG_ON can go. relmem moves out of _not_started only after p2m_pod_empty_cache has finished. So, transitively, we're safe. Emptying the pod cache does the spin barrier. Question: should it not take the lock after the spin barrier, to be super safe? Thanks! Andres > > If I recall correctly, I put this in here because something ended up > calling cache_add after empty_cache(), potentially with the p2m lock > already held; that's why I put the BUG_ON() there to begin with. > >> @@ -922,6 +929,12 @@ p2m_pod_emergency_sweep(struct p2m_domai >> limit = (start > POD_SWEEP_LIMIT) ? (start - POD_SWEEP_LIMIT) : 0; >> >> /* FIXME: Figure out how to avoid superpages */ >> + /* NOTE: Promote to globally locking the p2m. This will get >> complicated >> + * in a fine-grained scenario. Even if we're to lock each gfn >> + * individually we must be careful about recursion limits and >> + * POD_SWEEP_STRIDE. This is why we don't enforce deadlock >> constraints >> + * between p2m and pod locks */ >> + p2m_lock(p2m); >> for ( i=p2m->pod.reclaim_single; i > 0 ; i-- ) >> { >> p2m_access_t a; >> @@ -940,7 +953,7 @@ p2m_pod_emergency_sweep(struct p2m_domai >> /* Stop if we're past our limit and we have found *something*. >> * >> * NB that this is a zero-sum game; we're increasing our cache >> size >> - * by re-increasing our 'debt'. Since we hold the p2m lock, >> + * by re-increasing our 'debt'. Since we hold the pod lock, >> * (entry_count - count) must remain the same. */ >> if ( p2m->pod.count > 0 && i < limit ) >> break; >> @@ -949,6 +962,7 @@ p2m_pod_emergency_sweep(struct p2m_domai >> if ( j ) >> p2m_pod_zero_check(p2m, gfns, j); >> >> + p2m_unlock(p2m); >> p2m->pod.reclaim_single = i ? i - 1 : i; >> >> } >> @@ -965,8 +979,9 @@ p2m_pod_demand_populate(struct p2m_domai >> int i; >> >> ASSERT(gfn_locked_by_me(p2m, gfn)); >> + pod_lock(p2m); >> >> - /* This check is done with the p2m lock held. This will make sure >> that >> + /* This check is done with the pod lock held. This will make sure >> that >> * even if d->is_dying changes under our feet, >> p2m_pod_empty_cache() >> * won't start until we're done. */ >> if ( unlikely(d->is_dying) ) >> @@ -977,6 +992,7 @@ p2m_pod_demand_populate(struct p2m_domai >> * 1GB region to 2MB chunks for a retry. */ >> if ( order == PAGE_ORDER_1G ) >> { >> + pod_unlock(p2m); >> gfn_aligned = (gfn >> order) << order; >> /* Note that we are supposed to call set_p2m_entry() 512 times >> to >> * split 1GB into 512 2MB pages here. But We only do once here >> because >> @@ -1000,11 +1016,15 @@ p2m_pod_demand_populate(struct p2m_domai >> >> /* If we're low, start a sweep */ >> if ( order == PAGE_ORDER_2M && page_list_empty(&p2m->pod.super) >> ) >> + /* Note that sweeps scan other ranges in the p2m. In an >> scenario >> + * in which p2m locks are order-enforced wrt pod lock and >> p2m >> + * locks are fine grained, this will result in deadlock >> panic */ >> p2m_pod_emergency_sweep_super(p2m); >> >> if ( page_list_empty(&p2m->pod.single) && >> ( ( order == PAGE_ORDER_4K ) >> || (order == PAGE_ORDER_2M && >> page_list_empty(&p2m->pod.super) ) ) ) >> + /* Same comment regarding deadlock applies */ >> p2m_pod_emergency_sweep(p2m); >> } >> > You know what, these comments are *old*. I really need to cut down on the FUD factor :) As is, there is no risk of deadlock. With a (hypothetical as of today) fine-grained p2m locking approach, I want to leave a "be careful" reminder. And document your trylock suggestion, which I believe to be accurate. Thanks for the detailed look! Andres > Regarding locking on emergency sweeps: I think it should suffice if we > could do the equivalent of a spin_trylock() on each gpfn, and just move > on (not checking that gfn) if it fails. What do you think? > > Other than that, I don't see anything wrong with this locking at first > glance; but it's complicated enough that I don't think I've quite > grokked it yet. I'll look at it again tomorrow and see if things are > more clear. :-) > > -George > > >> @@ -1012,8 +1032,6 @@ p2m_pod_demand_populate(struct p2m_domai >> if ( q == p2m_guest && gfn > p2m->pod.max_guest ) >> p2m->pod.max_guest = gfn; >> >> - lock_page_alloc(p2m); >> - >> if ( p2m->pod.count == 0 ) >> goto out_of_memory; >> >> @@ -1026,8 +1044,6 @@ p2m_pod_demand_populate(struct p2m_domai >> >> BUG_ON((mfn_x(mfn) & ((1 << order)-1)) != 0); >> >> - unlock_page_alloc(p2m); >> - >> gfn_aligned = (gfn >> order) << order; >> >> set_p2m_entry(p2m, gfn_aligned, mfn, order, p2m_ram_rw, >> p2m->default_access); >> @@ -1038,7 +1054,7 @@ p2m_pod_demand_populate(struct p2m_domai >> paging_mark_dirty(d, mfn_x(mfn) + i); >> } >> >> - p2m->pod.entry_count -= (1 << order); /* Lock: p2m */ >> + p2m->pod.entry_count -= (1 << order); >> BUG_ON(p2m->pod.entry_count < 0); >> >> if ( tb_init_done ) >> @@ -1056,20 +1072,24 @@ p2m_pod_demand_populate(struct p2m_domai >> __trace_var(TRC_MEM_POD_POPULATE, 0, sizeof(t), &t); >> } >> >> + pod_unlock(p2m); >> return 0; >> out_of_memory: >> - unlock_page_alloc(p2m); >> + pod_unlock(p2m); >> >> printk("%s: Out of populate-on-demand memory! tot_pages %" PRIu32 " >> pod_entries %" PRIi32 "\n", >> __func__, d->tot_pages, p2m->pod.entry_count); >> domain_crash(d); >> out_fail: >> + pod_unlock(p2m); >> return -1; >> remap_and_retry: >> BUG_ON(order != PAGE_ORDER_2M); >> - unlock_page_alloc(p2m); >> + pod_unlock(p2m); >> >> /* Remap this 2-meg region in singleton chunks */ >> + /* NOTE: In a p2m fine-grained lock scenario this might >> + * need promoting the gfn lock from gfn->2M superpage */ >> gfn_aligned = (gfn>>order)<<order; >> for(i=0; i<(1<<order); i++) >> set_p2m_entry(p2m, gfn_aligned+i, _mfn(0), PAGE_ORDER_4K, >> @@ -1137,9 +1157,11 @@ guest_physmap_mark_populate_on_demand(st >> rc = -EINVAL; >> else >> { >> - p2m->pod.entry_count += 1 << order; /* Lock: p2m */ >> + pod_lock(p2m); >> + p2m->pod.entry_count += 1 << order; >> p2m->pod.entry_count -= pod_count; >> BUG_ON(p2m->pod.entry_count < 0); >> + pod_unlock(p2m); >> } >> >> gfn_unlock(p2m, gfn, order); >> diff -r fb9c06df05f2 -r 56ceab0118cb xen/arch/x86/mm/p2m-pt.c >> --- a/xen/arch/x86/mm/p2m-pt.c >> +++ b/xen/arch/x86/mm/p2m-pt.c >> @@ -954,6 +954,7 @@ long p2m_pt_audit_p2m(struct p2m_domain >> struct domain *d = p2m->domain; >> >> ASSERT(p2m_locked_by_me(p2m)); >> + ASSERT(pod_locked_by_me(p2m)); >> >> test_linear = ( (d == current->domain) >> && !pagetable_is_null(current->arch.monitor_table) >> ); >> diff -r fb9c06df05f2 -r 56ceab0118cb xen/arch/x86/mm/p2m.c >> --- a/xen/arch/x86/mm/p2m.c >> +++ b/xen/arch/x86/mm/p2m.c >> @@ -72,6 +72,7 @@ boolean_param("hap_2mb", opt_hap_2mb); >> static void p2m_initialise(struct domain *d, struct p2m_domain *p2m) >> { >> mm_lock_init(&p2m->lock); >> + mm_lock_init(&p2m->pod.lock); >> INIT_LIST_HEAD(&p2m->np2m_list); >> INIT_PAGE_LIST_HEAD(&p2m->pages); >> INIT_PAGE_LIST_HEAD(&p2m->pod.super); >> @@ -587,8 +588,10 @@ guest_physmap_add_entry(struct domain *d >> rc = -EINVAL; >> else >> { >> - p2m->pod.entry_count -= pod_count; /* Lock: p2m */ >> + pod_lock(p2m); >> + p2m->pod.entry_count -= pod_count; >> BUG_ON(p2m->pod.entry_count < 0); >> + pod_unlock(p2m); >> } >> } >> >> @@ -1372,6 +1375,7 @@ p2m_flush_table(struct p2m_domain *p2m) >> /* "Host" p2m tables can have shared entries &c that need a bit >> more >> * care when discarding them */ >> ASSERT(p2m_is_nestedp2m(p2m)); >> + /* Nested p2m's do not do pod, hence the asserts (and no pod >> lock)*/ >> ASSERT(page_list_empty(&p2m->pod.super)); >> ASSERT(page_list_empty(&p2m->pod.single)); >> >> @@ -1529,6 +1533,7 @@ void audit_p2m(struct domain *d, >> P2M_PRINTK("p2m audit starts\n"); >> >> p2m_lock(p2m); >> + pod_lock(p2m); >> >> if (p2m->audit_p2m) >> pmbad = p2m->audit_p2m(p2m); >> @@ -1589,6 +1594,7 @@ void audit_p2m(struct domain *d, >> } >> spin_unlock(&d->page_alloc_lock); >> >> + pod_unlock(p2m); >> p2m_unlock(p2m); >> >> P2M_PRINTK("p2m audit complete\n"); >> diff -r fb9c06df05f2 -r 56ceab0118cb xen/include/asm-x86/p2m.h >> --- a/xen/include/asm-x86/p2m.h >> +++ b/xen/include/asm-x86/p2m.h >> @@ -261,25 +261,12 @@ struct p2m_domain { >> unsigned long max_mapped_pfn; >> >> /* Populate-on-demand variables >> - * NB on locking. {super,single,count} are >> - * covered by d->page_alloc_lock, since they're almost always used >> in >> - * conjunction with that functionality. {entry_count} is covered >> by >> - * the domain p2m lock, since it's almost always used in >> conjunction >> - * with changing the p2m tables. >> - * >> - * At this point, both locks are held in two places. In both, >> - * the order is [p2m,page_alloc]: >> - * + p2m_pod_decrease_reservation() calls p2m_pod_cache_add(), >> - * which grabs page_alloc >> - * + p2m_pod_demand_populate() grabs both; the p2m lock to avoid >> - * double-demand-populating of pages, the page_alloc lock to >> - * protect moving stuff from the PoD cache to the domain page >> list. >> - * >> - * We enforce this lock ordering through a construct in mm-locks.h. >> - * This demands, however, that we store the previous lock-ordering >> - * level in effect before grabbing the page_alloc lock. The unlock >> - * level is stored in the arch section of the domain struct. >> - */ >> + * All variables are protected with the pod lock. We cannot rely on >> + * the p2m lock if it's turned into a fine-grained lock. >> + * We only use the domain page_alloc lock for additions and >> + * deletions to the domain's page list. Because we use it nested >> + * within the PoD lock, we enforce it's ordering (by remembering >> + * the unlock level in the arch_domain sub struct). */ >> struct { >> struct page_list_head super, /* List of superpages >> */ >> single; /* Non-super lists >> */ >> @@ -288,6 +275,8 @@ struct p2m_domain { >> unsigned reclaim_super; /* Last gpfn of a scan */ >> unsigned reclaim_single; /* Last gpfn of a scan */ >> unsigned max_guest; /* gpfn of max guest >> demand-populate */ >> + mm_lock_t lock; /* Locking of private pod >> structs, * >> + * not relying on the p2m lock. >> */ >> } pod; >> }; >> > > > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.