WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.

To: Daniel Stodden <daniel.stodden@xxxxxxxxxx>
Subject: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
From: Jeremy Fitzhardinge <jeremy@xxxxxxxx>
Date: Wed, 17 Nov 2010 18:29:42 -0800
Cc: "Xen-devel@xxxxxxxxxxxxxxxxxxx" <Xen-devel@xxxxxxxxxxxxxxxxxxx>
Delivery-date: Wed, 17 Nov 2010 18:30:28 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <1290040898.11102.1709.camel@xxxxxxxxxxxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <1289604707-13378-1-git-send-email-daniel.stodden@xxxxxxxxxx> <4CDDE0DA.2070303@xxxxxxxx> <1289620544.11102.373.camel@xxxxxxxxxxxxxxxxxxxxxxx> <4CE17B80.7080606@xxxxxxxx> <1289898792.23890.214.camel@ramone> <4CE2C5B1.1050806@xxxxxxxx> <1289942932.11102.802.camel@xxxxxxxxxxxxxxxxxxxxxxx> <4CE41853.1010000@xxxxxxxx> <1290025317.11102.1216.camel@xxxxxxxxxxxxxxxxxxxxxxx> <4CE442EA.1090708@xxxxxxxx> <1290031020.11102.1410.camel@xxxxxxxxxxxxxxxxxxxxxxx> <4CE453E2.1000308@xxxxxxxx> <1290035201.11102.1577.camel@xxxxxxxxxxxxxxxxxxxxxxx> <4CE46A03.3010104@xxxxxxxx> <1290040898.11102.1709.camel@xxxxxxxxxxxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.12) Gecko/20101027 Fedora/3.1.6-1.fc13 Lightning/1.0b3pre Thunderbird/3.1.6
On 11/17/2010 04:41 PM, Daniel Stodden wrote:
> On Wed, 2010-11-17 at 18:49 -0500, Jeremy Fitzhardinge wrote:
>> On 11/17/2010 03:06 PM, Daniel Stodden wrote:
>>>> So we're back to needing a way of mapping from a random mfn to a pfn so
>>>> we can find the corresponding struct page.  I would be tempted to put a
>>>> layer over m2p to allow local m2p mappings to override the global m2p 
>>>> table.
>>> I think a local m2p lookup on a slow path is a superior option, iff you
>>> do think it's doable. Without e.g. risking to bloat some inline stuff, I
>>> mean.
>>>
>>> Where do you think in the call stack down into pvops code would the
>>> lookup go? Asking because I'd expect the kernel to potentially learn
>>> more tricks with that.  
>> I don't think m2p is all that performance critical.  p2m is used way more.
>>
>> p2m is already a radix tree; 
> Yes, but pfns are dense plug holes, aren't they?

Yes.

>> I think m2p could be done somewhat
>> similarly, where undefined entries fall through to the global m2p.  The
>> main problem is probably making sure the memory allocation for new m2p
>> entries can be done in a blocking context, so we don't have to rely on
>> GFP_ATOMIC.
> Whatever the index is going to be, all the backends I'm aware of run
> their rings on a kthread. Sounds to me like GFP_WAIT followed by an
> rwlock is perfectly sufficient. Only the reversal commonly ends up in
> interrupt context.
>
>> That particular m2p lookup would be in xen_pte_val(), but I think that's
>> the callsite for pretty much all m2p lookups.
>>
>>> A mfn->gref mapping would obsolete blkback-pagemap. Well, iff the
>>> kernel-blktap zerocopy stuff wants to be resurrected.
>>>
>>> It would also be a cheap way to implement current blktap to do
>>> virt->gref lookups for tapdisks. Some tapdisk filters want this, present
>>> major example is the memshr patch, and it's sort of nicer than a ring
>>> message hack.
>>>
>>> Wouldn't storing the handle allow unmapping grant ptes on the normal
>>> user PT teardown path? I think we always was this .zap_pte vma-operation
>>> in blktap, iirc. MMU notifier replacement? Maybe not a good one.
>> I think mmu notifiers are fine; this is exactly what they're for after
>> all.  Very few address spaces have granted pages mapped into them, so
>> keeping the normal pte operations as fast as possible and using more
>> expensive notifiers for the afflicted mms seems like the right tradeoff.
>>
>> Hm.  Before Gerd highlighted mmu notifiers as the right way to deal with
>> granted pages in gntdev, I had the idea of allocating a shadow page for
>> each pte page and storing grant refs there, where the shadow is hung of
>> the pte page's struct page, where set_pte could use it to do the special
>> thing if needed.
>>
>> I wonder if we could do something similar here, where we store the pfn
>> for a given pte in the shadow?  But how would it get used?  There's no
>> "read pte" pvop, and pte_val gets the pte value, not its address, so
>> there'd be no way to find the shadow.  So I guess that doesn't work.
> I'm not sure about the radix variant. All the backends do order-0
> allocations, as discussed above. Depending on the driver pair behavior,
> the mfn ranges can get arbitrarily sparse. The real M2P makes completely
> different assumptions in density and size, or not? 

Well, there's two use-cases for the local m2p idea.  One is for granted
pages, which are going to be all over the place, but the other is for
hardware mfns, which are more likely to be densely clustered.

A radix tree for grant mfns is likely to be pretty inefficient - but the
worst case is one radix page per mfn, which isn't too bad, since we're
not expecting that many granted mfns to be around.  But perhaps a hash
or rbtree would be a better fit.  Or we could insist on making the mfns
contiguous.

> Well, maybe one could shadow (cow, actually) just that? Saves the index.
> Likely a dumb idea. :)

I guess Xen won't let us map over the m2p, but maybe we could alias it. 
But that's going to waste lots of address space in a 32b dom0.

> What numbers of grant refs do we run? I remember times when the tables
> were bumped up massively, for stuff like pvfb, a long time ago. I guess
> rest remained rather conservative.

We only really need to worry about mfns which are actually going to be
mapped into userspace and guped.  We could even propose a (new?) mmu
notifier to prepare for gup so that it can be deferred as late as possible.

> The shadow idea sounds pretty cool, mainly because the vaddr spaces are
> often contiguous. At least for userland pages.
>
> Thing which would bother me is that settling on a single shadow page
> already limits map entries to sizeof(pte), so ideas like additionally
> mapping to grefs/handles/flag already go out of the window. Or grow
> bigger leaf, but on average that's also more space overhead.

At least we can always fit a pointer into sizeof(pte_t), so there's some
scope for having more storage if needed.  But I don't see how it can
help for gup...

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel