[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC] [PATCH] x86/mm: use wait queues for mem_paging


  • To: "Tim Deegan" <tim@xxxxxxx>
  • From: "Andres Lagar-Cavilla" <andres@xxxxxxxxxxxxxxxx>
  • Date: Thu, 23 Feb 2012 07:55:36 -0800
  • Cc: olaf@xxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxx, andres@xxxxxxxxxxxxxx, adin@xxxxxxxxxxxxxx
  • Delivery-date: Thu, 23 Feb 2012 15:56:05 +0000
  • Domainkey-signature: a=rsa-sha1; c=nofws; d=lagarcavilla.org; h=message-id :in-reply-to:references:date:subject:from:to:cc:reply-to :mime-version:content-type:content-transfer-encoding; q=dns; s= lagarcavilla.org; b=FCi2Q5bRucm/ieN6FOU6fW06ti5Tj8Lz/xC8ZzUbabmG F/cCxCeF+m5nT/st3IBebtQUEXIPNIHrsqvkYIoZ0VJ3Z5CujQoG7QoTTFmSSDly +YE3KwcUzC+MbK79UnwhJlcJJP8c0mmRvIBt9Z8WNR4DXljSWPVuXX+SmPsbThg=
  • List-id: Xen developer discussion <xen-devel.lists.xen.org>

> Hi,
>
> I'm about to repost this patch, with a few other changes and fixups.
> Further comments below.
>
> At 12:57 -0800 on 17 Feb (1329483430), Andres Lagar-Cavilla wrote:
>> > At 08:05 -0800 on 17 Feb (1329465959), Andres Lagar-Cavilla wrote:
>> >> Well, thanks for taking a stab at it. Looks fairly well. My main
>> >> observation is that we do not want to unconditionally go on a wait
>> queue
>> >> everywhere. For example, the grant table code pretty reliable unwinds
>> >> itself and correctly tells the grant mapper to retry, in the presence
>> of
>> >> a
>> >> paged page.
>> >
>> > The grant table code is unlikely to hit a paged-out gfn in the
>> caller's
>> > address space.
>>
>> Oh yeah they do. At least for the target gfn's behind the grants. That's
>> why we need the retry loops in the PV backends in dom0.
>
> The target gfns aren't in the caller's p2m, though.
>
>> > The balloon code should not be trying to populate at all!
>>
>> Agreed. get_gfn_query is needed there. Nothing bad happens because we
>> immediately send drop_page. But it's wrong-ish.
>
> Fixed
>
>> >> Hence I had proposed introducing get_gfn_sleep, and I'd be happy if
>> the
>> >> wait queue code is invoked only there. And then we can call
>> >> get_gfn_sleep
>> >> in vmx_load_pdptrs, in guest page table walks, and in __hvm_copy.
>> >
>> > I see.  In the longer term I definitely do not want get_gfn()'s
>> callers
>> > to have to deal with p2m_is_paged(), and PoD, and sharing, and every
>> > other p2m hack we come up with.  I want all that stuff hidden behind
>> > get_gfn_*(), with at most a flag to say 'fix up magic MM tricks'.
>> >
>> > I don't like get_gfn_sleep() very much but I'll see if I can find a
>> way
>> > to do the equivalent with a flag to get_gfn_*().  But I suspect it
>> will
>> > end up with a tri-state argument of:
>> >  a) just give me the current state;
>> >  b) try (without sleeping) to fix up paging, PoD and sharing but
>> >     still maybe make me handle it myself by error propagation; and
>> >  c) just fix it up.
>> >
>> > 'b' is a pretty weird interface.  Specially given that in corner
>> cases,
>> > 'make me handle it' might involve hitting a wait queue anyway.  If we
>> > intend in the fullness of time to get rid of it (as I hope we would)
>> > then probably we shouldn't introduce it.  (And if that means pushing
>> > this whole change out past 4.2, we should consider doing that.)
>> >
>> It's weird. b) is an evolutionary side-effect. It should disappear. I
>> guess what I'm arguing for is "it should disappear in the long term, not
>> on a patch so close to 4.2"
>
> Well, I tried making the 'but don't go to a waitq' flag; the
> implementation is trivial on top of the patches I'm about to post but
> the question of when to set it was so ugly I gave up.
>
> I have hit another stumbling block though - get_two_gfns() can't be
> safely ask get_gfn to populate or unshare if that might involve a waitq,
> since by definition one of its two calls is made with a lock held.  I
> can't see a nice way of having it retry (not one that's guaranteed to
> make progress) without pulling the fixup-and-retry up into that
> function.  I might do that in the next revision.

One way to do this is use query_unlocked on the second. If we sense
danger, drop the lock on the first, then do the right thing for the
second, then drop the lock for the second, then retry at the top.
Front-runner for ugliest-thing-in-the-hypervisor award.

hvm_task_switch also gets two gfns, although in a manner that prevents
easily using get_two_gfns (hvm_map_entry on gdt offsets). It's unlikely
this will ever be a problem (it's somewhat reasonable to expect both gdt
entries to fall in the same gfn) but it's something worth keeping in mind.

>
>> > Can't be soon enough for me.  I've been sharpening the axe for that
>> one
>> > for years now. :)
>>
>> What's stopping the death blow?
>
> People said they still need it.
>
>> >> You want to make sure the mfn is not valid. For p2m_ram_paging_out we
>> >> still have the mfn in place.
>> >
>> > Doesn't p2m_mem_paging_populate already DTRT in all these cases?  If
>> > not, it really should.
>>
>> It's not about populate, it's about killing the domain unnecessarily in
>> the in_atomic check.
>
> Oh I see, thanks.  Fixed.
>
>> >> >> +
>> >> >> +        /* Safety catch: it _should_ be safe to wait here but if
>> >> it's
>> >> >> not,
>> >> >> +         * crash the VM, not the host */
>> >> >> +        if ( in_atomic() )
>> >> >> +        {
>> >> >> +            WARN();
>> >> >> +            domain_crash(p2m->domain);
>> >> >> +        }
>> >> >> +        else
>> >> >> +        {
>> >> >> +            current->arch.mem_paging_gfn = gfn;
>> >> >> +            wait_event(current->arch.mem_paging_wq, ({
>> >> >> +                        mfn = p2m->get_entry(p2m, gfn, t, a,
>> >> >> +                                             p2m_query,
>> page_order);
>> >> >> +                        (*t != p2m_ram_paging_in);
>> >>
>> >> Well, first of all mfn->get_entry will not happen in a locked
>> context,
>> >> so
>> >> you will exit the wait loop not holding the p2m lock and crash later.
>> >
>> > Yes - we always exit the loop without the lock; then we 'goto again'
>> and
>> > do things properly.  I realise it's a bit inefficient, but on the
>> > page-in path the p2m lookups shouldn't be the bottleneck. :)  I did
>> > write the version that handled the locking in this loop but it got a
>> bit
>> > twisty, plus we need the 'goto again' for other reasons (see below).
>>
>> Yes, but, you'll be reading the p2m in a racy manner.
>
> I don't think there are any real races here.  Either we see
> p2m_paging_in (and someone will eventually wake us when it pages in)
> or we don't (and we go back and do the lookup safely).
>
>> > As a follow-up I ought to make the page-sharing fixup code that's just
>> > below this use 'goto again' as well; we shouldn't really leave this
>> > function until we get a lookup where all these cases are dealt with.
>
> On closer inspection the page-sharing version is OK because it doesn't
> release the lock, and the unshare function won't leave us with a
> paged-out page.
>
>> I think we won't be able to have a magic
>> get_gfn_solve_everything_silently() call until we find a solution to the
>> fact that get_gfn happens in locked contexts.
>
> Yes, I agree, and I'm coming round to the opinion that we might be too
> late to fix that in 4.2. :|
>
>> I've though about a wacky
>> idea that
>> 1. schedules a sleep-n-retry softirq
>> 2. fails the call
>> 3. avoids crashing the domain on the unwind path
>> 4. when executing softirqs pauses the vcpu as per 1.
>> 5. When woken up retries the last failed hypercall/vmexit/whatever
>>
>> I shouldn't have said that out loud...
>
> The first time I approached the idea of paging guest RAM (before Patrick
> started on the current system) I prototyped something not a million
> miles from this (basically, a sort of setjmp() on hypervisor entry) but
> the question of making all hypercalls and VMEXITs idempotent scared me
> off. :)
>

We've used long jump-like stuff. It's fragile, but it carried us through
the day. It's a non-starter here, I hope.

Looking forward to the new patch.
Andres

>> >> >> void p2m_mem_paging_populate(struct domain *d, unsigned long gfn)
>> >> >> {
>> >> >>     struct vcpu *v = current;
>> >> >> -    mem_event_request_t req;
>> >> >> +    mem_event_request_t req = {0};
>> >> >>     p2m_type_t p2mt;
>> >> >>     p2m_access_t a;
>> >> >>     mfn_t mfn;
>> >> >>     struct p2m_domain *p2m = p2m_get_hostp2m(d);
>> >> >> +    int send_request = 0;
>> >> >>
>> >> >>     /* We're paging. There should be a ring */
>> >> >>     int rc = mem_event_claim_slot(d, &d->mem_event->paging);
>> >>
>> >> Given all the optimizations around the send_request flag, it might be
>> >> worth moving the claim_slot call to an if(send_request) block at the
>> >> bottom, and get rid of cancel_slot altogether. Claim_slot could
>> >> potentially go (or cause someone else to go) to a wait queue, and
>> it's
>> >> best to not be that rude.
>> >
>> > Sure.  It might be tricky to make sure we unwind properly in the
>> failure
>> > case, but I'll have a go at a follow-up patch that looks at that.
>> >
>> > (This is not really a regression in this patch AFAICS - in the
>> existing
>> > code
>> > the get_gfn() callers end up calling this function anyway.)
>>
>> *After* they've put gfn. Not a regression, just cleaner code imo.
>
> This is also called after putting the gfn - but in any case I've
> reshuffled the mem_event_claim_slot in the new version.
>
> Cheers,
>
> Tim.
>



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.