I'll throw an idea there and you educate me why it's lame.
Going back to the primary issue of dropping zero-copy, you want the block
backend (tapdev w/AIO or otherwise) to operate on regular dom0 pages, because
you run into all sorts of quirkiness otherwise: magical VM_FOREIGN incantations
to back granted mfn's with fake page structs that make get_user_pages happy,
quirky grant PTEs, etc.
Ok, so how about something along the lines of GNTTABOP_swap? Eerily reminiscent
of (maligned?) GNTTABOP_transfer, but hear me out.
The observation is that for a blkfront read, you could do the read all along on
a regular dom0 frame, and when stuffing the response into the ring, swap the
dom0 frame (mfn) you used with the domU frame provided as a buffer. Then the
algorithm folds out:
1. Block backend, instead of get_empty_pages_and_pagevec at init time, creates
a pool of reserved regular pages via get_free_page(s). These pages have their
refcount pumped, no one in dom0 will ever touch them.
2. When extracting a blkfront write from the ring, call GNTTABOP_swap
immediately. One of the backend-reserved mfn's is swapped with the domU mfn.
Pfn's and page struct's on both ends remain untouched.
3. For blkfront reads, call swap when stuffing the response back into the ring
4. Because of 1, dom0 can a) calmly fix its p2m (and kvaddr) after swap, much
like balloon and others do, without fear of races. More importantly, b) you
don't have a weirdo granted PTE, or work with a frame from other domain. It's
your page all along, dom0
5. One assumption for domU is that pages allocated as blkfront buffers won't be
touched by anybody, so a) it's safe for them to swap async with another frame
with undef contents and b) domU can fix its p2m (and kvaddr) when pulling
responses from the ring (the new mfn should be put on the response by dom0
directly or through an opaque handle)
6. Scatter-gather vectors in ring requests give you a natural multicall
batching for these GNTTABOP_swap's. I.e. all these hypercalls won't happen as
often and at the granularity as skbuff's demanded for GNTTABOP_transfer
7. Potentially domU may want to use the contents in a blkfront write buffer
later for something else. So it's not really zero-copy. But the approach opens
a window to async memcpy . From the point of swap when pulling the req to the
point of pushing the response, you can do memcpy at any time. Don't know about
how practical that is though.
Problems at first glance:
1. To support GNTTABOP_swap you need to add more if(version) to blkfront and
blkback.
2. The kernel vaddr will need to be managed as well by dom0/U. Much like
balloon or others: hypercall, fix p2m, and fix kvaddr all need to be taken care
of. domU will probably need to neuter its kvaddr before granting, and then
re-establish it when the response arrives. Weren't all these hypercalls
ultimately more expensive than memcpy for GNTABOP_transfer for netback?
3. Managing the pool of backend reserved pages may be a problem?
So in the end, perhaps more of an academic exercise than a palatable answer,
but nonetheless I'd like to hear other problems people may find with this
approach
Andres
> Message: 3
> Date: Tue, 16 Nov 2010 13:28:51 -0800
> From: Daniel Stodden <daniel.stodden@xxxxxxxxxx>
> Subject: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
> To: Jeremy Fitzhardinge <jeremy@xxxxxxxx>
> Cc: "Xen-devel@xxxxxxxxxxxxxxxxxxx" <Xen-devel@xxxxxxxxxxxxxxxxxxx>
> Message-ID: <1289942932.11102.802.camel@xxxxxxxxxxxxxxxxxxxxxxx>
> Content-Type: text/plain; charset="UTF-8"
>
> On Tue, 2010-11-16 at 12:56 -0500, Jeremy Fitzhardinge wrote:
>> On 11/16/2010 01:13 AM, Daniel Stodden wrote:
>>> On Mon, 2010-11-15 at 13:27 -0500, Jeremy Fitzhardinge wrote:
>>>> On 11/12/2010 07:55 PM, Daniel Stodden wrote:
>>>>>> Surely this can be dealt with by replacing the mapped granted page with
>>>>>> a local copy if the refcount is elevated?
>>>>> Yeah. We briefly discussed this when the problem started to pop up
>>>>> (again).
>>>>>
>>>>> I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with
>>>>> a dummy page mapped in. You wouldn't need a copy, a R/O zero map easily
>>>>> does the job.
>>>> Hm, I'd be a bit concerned that that might cause problems if used
>>>> generically.
>>> Yeah. It wasn't a problem because all the network backends are on TCP,
>>> where one can be rather sure that the dups are going to be properly
>>> dropped.
>>>
>>> Does this hold everywhere ..? -- As mentioned below, the problem is
>>> rather in AIO/DIO than being Xen-specific, so you can see the same
>>> behavior on bare metal kernels too. A userspace app seeing an AIO
>>> complete and then reusing that buffer elsewhere will occassionally
>>> resend garbage over the network.
>>
>> Yeah, that sounds like a generic security problem. I presume the
>> protocol will just discard the excess retransmit data, but it might mean
>> a usermode program ends up transmitting secrets it never intended to...
>>
>>> There are some important parts which would go missing. Such as
>>> ratelimiting gntdev accesses -- 200 thundering tapdisks each trying to
>>> gntmap 352 pages simultaneously isn't so good, so there still needs to
>>> be some bridge arbitrating them. I'd rather keep that in kernel space,
>>> okay to cram stuff like that into gntdev? It'd be much more
>>> straightforward than IPC.
>>
>> What's the problem? If you do nothing then it will appear to the kernel
>> as a bunch of processes doing memory allocations, and they'll get
>> blocked/rate-limited accordingly if memory is getting short.
>
> The problem is that just letting the page allocator work through
> allocations isn't going to scale anywhere.
>
> The worst case memory requested under load is <number-of-disks> * (32 *
> 11 pages). As a (conservative) rule of thumb, N will be 200 or rather
> better.
>
> The number of I/O actually in-flight at any point, in contrast, is
> derived from the queue/sg sizes of the physical device. For a simple
> disk, that's about a ring or two.
>
>> There's
>> plenty of existing mechanisms to control that sort of thing (cgroups,
>> etc) without adding anything new to the kernel. Or are you talking
>> about something other than simple memory pressure?
>>
>> And there's plenty of existing IPC mechanisms if you want them to
>> explicitly coordinate with each other, but I'd tend to thing that's
>> premature unless you have something specific in mind.
>>
>>> Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev..
>>> Can't find it now, what happened? Without, there's presently still no
>>> zero-copy.
>>
>> gntdev doesn't need VM_FOREIGN any more - it uses the (relatively
>> new-ish) mmu notifier infrastructure which is intended to allow a device
>> to sync an external MMU with usermode mappings. We're not using it in
>> precisely that way, but it allows us to wrangle grant mappings before
>> the generic code tries to do normal pte ops on them.
>
> The mmu notifiers were for safe teardown only. They are not sufficient
> for DIO, which wants gup() to work. If you want zcopy on gntdev, we'll
> need to back those VMAs with page structs. Or bounce again (gulp, just
> mentioning it). As with the blktap2 patches, note there is no difference
> in the dom0 memory bill, it takes page frames.
>
> This is pretty much exactly the pooling stuff in current drivers/blktap.
> The interface could look as follows ([] designates users).
>
> * [toolstack]
> Calling some ctls to create/destroy ctls pools of frames.
> (Blktap currently does this in sysfs.)
>
> * [toolstack]
> Optionally resize them, according to the physical queue
> depth [estimate] of the underlying storage.
>
> * [tapdisk]
> A backend instance, when starting up, opens a gntdev, then
> uses a ctl to bind its gntdev handle to a frame pool.
>
> * [tapdisk]
> The .mmap call now will allocate frames to back the VMA.
> This operation can fail/block under congestion. Neither
> is desirable, so we need a .poll.
>
> * [tapdisk]
> To integrate grant mappings with a single-threaded event loop,
> use .poll. The handle fires as soon as a request can be mapped.
>
> Under congestion, the .poll code will queue waiting disks and wake
> them round-robin, once VMAs are released.
>
> (A [tapdisk] doesn't mean to dismiss a potential [qemu].)
>
> Still confident we want that? (Seriously asking). A lot of the code to
> do so has been written for blktap, it wouldn't be hard to bend into a
> gntdev extension.
>
>>> Once the issues were solved, it'd be kinda nice. Simplifies stuff like
>>> memshr for blktap, which depends on getting hold of original grefs.
>>>
>>> We'd presumably still need the tapdev nodes, for qemu, etc. But those
>>> can stay non-xen aware then.
>>>
>>>>>> The only caveat is the stray unmapping problem, but I think gntdev can
>>>>>> be modified to deal with that pretty easily.
>>>>> Not easier than anything else in kernel space, but when dealing only
>>>>> with the refcounts, that's as as good a place as anwhere else, yes.
>>>> I think the refcount test is pretty straightforward - if the refcount is
>>>> 1, then we're the sole owner of the page and we don't need to worry
>>>> about any other users. If its > 1, then somebody else has it, and we
>>>> need to make sure it no longer refers to a granted page (which is just a
>>>> matter of doing a set_pte_atomic() to remap from present to present).
>>> [set_pte_atomic over grant ptes doesn't work, or does it?]
>>
>> No, I forgot about grant ptes magic properties. But there is the hypercall.
>
> Yup.
>
>>>> Then we'd have a set of frames whose lifetimes are being determined by
>>>> some other subsystem. We can either maintain a list of them and poll
>>>> waiting for them to become free, or just release them and let them be
>>>> managed by the normal kernel lifetime rules (which requires that the
>>>> memory attached to them be completely normal, of course).
>>> The latter sounds like a good alternative to polling. So an
>>> unmap_and_replace, and giving up ownership thereafter. Next run of the
>>> dispatcher thread can can just refill the foreign pfn range via
>>> alloc_empty_pages(), to rebalance.
>>
>> Do we actually need a "foreign page range"? Won't any pfn do? If we
>> start with a specific range of foreign pfns and then start freeing those
>> pfns back to the kernel, we won't have one for long...
>
> I guess we've been meaning the same thing here, unless I'm
> misunderstanding you. Any pfn does, and the balloon pagevec allocations
> default to order 0 entries indeed. Sorry, you're right, that's not a
> 'range'. With a pending re-xmit, the backend can find a couple (or all)
> of the request frames have count>1. It can flip and abandon those as
> normal memory. But it will need those lost memory slots back, straight
> away or next time it's running out of frames. As order-0 allocations.
>
> Foreign memory is deliberately short. Blkback still defaults to 2 rings
> worth of address space, iirc, globally. That's what that mempool sysfs
> stuff in the later blktap2 patches aimed at -- making the size
> configurable where queue length matters, and isolate throughput between
> physical backends, where the toolstack wants to care.
>
> Daniel
>
>
>
>
> ------------------------------
>
> Message: 4
> Date: Tue, 16 Nov 2010 13:42:54 -0800 (PST)
> From: Boris Derzhavets <bderzhavets@xxxxxxxxx>
> Subject: Re: [Xen-devel] Re: 2.6.37-rc1 mainline domU - BUG: unable to
> handle kernel paging request
> To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> Cc: Jeremy Fitzhardinge <jeremy@xxxxxxxx>,
> xen-devel@xxxxxxxxxxxxxxxxxxx, Bruce Edge <bruce.edge@xxxxxxxxx>
> Message-ID: <923132.8834.qm@xxxxxxxxxxxxxxxxxxxxxxxxxxx>
> Content-Type: text/plain; charset="us-ascii"
>
>> So what I think you are saying is that you keep on getting the bug in DomU?
>> Is the stack-trace the same as in rc1?
>
> Yes.
> When i want to get 1-2 hr of stable work :-
>
> # service network restart
> # service nfs restart
>
> at Dom0.
>
> I also believe that presence of xen-pcifront.fix.patch is making things much
> more stable
> on F14.
>
> Boris.
>
> --- On Tue, 11/16/10, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote:
>
> From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> Subject: Re: [Xen-devel] Re: 2.6.37-rc1 mainline domU - BUG: unable to handle
> kernel paging request
> To: "Boris Derzhavets" <bderzhavets@xxxxxxxxx>
> Cc: "Jeremy Fitzhardinge" <jeremy@xxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxx,
> "Bruce Edge" <bruce.edge@xxxxxxxxx>
> Date: Tuesday, November 16, 2010, 4:15 PM
>
> On Tue, Nov 16, 2010 at 12:43:28PM -0800, Boris Derzhavets wrote:
>>> Huh. I .. what? I am confused. I thought we established that the issue
>>> was not related to Xen PCI front? You also seem to uncomment the
>>> upstream.core.patches and the xen.pvhvm.patch - why?
>>
>> I cannot uncomment upstream.core.patches and the xen.pvhvm.patch
>> it gives failed HUNKs
>
> Uhh.. I am even more confused.
>>
>>> Ok, they are.. v2.6.37-rc2 which came out today has the fixes
>>
>> I am pretty sure rc2 doesn't contain everything from xen.next-2.6.37.patch,
>> gntdev's stuff for sure. I've built 2.6.37-rc2 kernel rpms and loaded
>> kernel-2.6.27-rc2.git0.xendom0.x86_64 under Xen 4.0.1.
>> Device /dev/xen/gntdev has not been created. I understand that it's
>> unrelated to DomU ( related to Dom0) , but once again with rc2 in DomU i
>> cannot
>> get 3.2 GB copied over to DomU from NFS share at Dom0.
>
> So what I think you are saying is that you keep on getting the bug in DomU?
> Is the stack-trace the same as in rc1?
>
>
>
>
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://lists.xensource.com/archives/html/xen-devel/attachments/20101116/015048ae/attachment.html
>
> ------------------------------
>
> Message: 5
> Date: Tue, 16 Nov 2010 13:49:14 -0800 (PST)
> From: Boris Derzhavets <bderzhavets@xxxxxxxxx>
> Subject: Re: [Xen-devel] Re: 2.6.37-rc1 mainline domU - BUG: unable to
> handle kernel paging request
> To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> Cc: Jeremy Fitzhardinge <jeremy@xxxxxxxx>,
> xen-devel@xxxxxxxxxxxxxxxxxxx, Bruce Edge <bruce.edge@xxxxxxxxx>
> Message-ID: <228566.47308.qm@xxxxxxxxxxxxxxxxxxxxxxxxxxx>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Yes, here we are
>
> [ 186.975228] ------------[ cut here ]------------
> [ 186.975245] kernel BUG at mm/mmap.c:2399!
> [ 186.975254] invalid opcode: 0000 [#1] SMP
> [ 186.975269] last sysfs file:
> /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map
> [ 186.975284] CPU 0
> [ 186.975290] Modules linked in: nfs fscache deflate zlib_deflate ctr
> camellia cast5 rmd160 crypto_null ccm serpent blowfish twofish_generic
> twofish_x86_64 twofish_common ecb xcbc cbc sha256_generic sha512_generic
> des_generic cryptd aes_x86_64 aes_generic ah6 ah4 esp6 esp4 xfrm4_mode_beet
> xfrm4_tunnel tunnel4 xfrm4_mode_tunnel xfrm4_mode_transport
> xfrm6_mode_transport xfrm6_mode_ro xfrm6_mode_beet xfrm6_mode_tunnel ipcomp
> ipcomp6 xfrm_ipcomp xfrm6_tunnel tunnel6 af_key nfsd lockd nfs_acl
> auth_rpcgss exportfs sunrpc ipv6 uinput xen_netfront microcode xen_blkfront
> [last unloaded: scsi_wait_scan]
> [ 186.975507]
> [ 186.975515] Pid: 1562, comm: ls Not tainted
> 2.6.37-0.1.rc1.git8.xendom0.fc14.x86_64 #1 /
> [ 186.975529] RIP: e030:[<ffffffff8110ada1>] [<ffffffff8110ada1>]
> exit_mmap+0x10c/0x119
> [ 186.975550] RSP: e02b:ffff8800781bde18 EFLAGS: 00010202
> [ 186.975560] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> 0000000000000000
> [ 186.975573] RDX: 00000000914a9149 RSI: 0000000000000001 RDI:
> ffffea00000c0280
> [ 186.975585] RBP: ffff8800781bde48 R08: ffffea00000c0280 R09:
> 0000000000000001
> [ 186.975598] R10: ffffffff8100750f R11: ffffea0000967778 R12:
> ffff880076c68b00
> [ 186.975610] R13: ffff88007f83f1e0 R14: ffff880076c68b68 R15:
> 0000000000000001
> [ 186.975625] FS: 00007f8e471d97c0(0000) GS:ffff88007f831000(0000)
> knlGS:0000000000000000
> [ 186.975639] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 186.975650] CR2: 00007f8e464a9940 CR3: 0000000001a03000 CR4:
> 0000000000002660
> [ 186.975663] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 186.976012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [ 186.976012] Process ls (pid: 1562, threadinfo ffff8800781bc000, task
> ffff8800788223e0)
> [ 186.976012] Stack:
> [ 186.976012] 000000000000006b ffff88007f83f1e0 ffff8800781bde38
> ffff880076c68b00
> [ 186.976012] ffff880076c68c40 ffff8800788229d0 ffff8800781bde68
> ffffffff810505fc
> [ 186.976012] ffff8800788223e0 ffff880076c68b00 ffff8800781bdeb8
> ffffffff81056747
> [ 186.976012] Call Trace:
> [ 186.976012] [<ffffffff810505fc>] mmput+0x65/0xd8
> [ 186.976012] [<ffffffff81056747>] exit_mm+0x13e/0x14b
> [ 186.976012] [<ffffffff81056976>] do_exit+0x222/0x7c6
> [ 186.976012] [<ffffffff8100750f>] ? xen_restore_fl_direct_end+0x0/0x1
> [ 186.976012] [<ffffffff8107ea7c>] ? arch_local_irq_restore+0xb/0xd
> [ 186.976012] [<ffffffff814b3949>] ? lockdep_sys_exit_thunk+0x35/0x67
> [ 186.976012] [<ffffffff810571b0>] do_group_exit+0x88/0xb6
> [ 186.976012] [<ffffffff810571f5>] sys_exit_group+0x17/0x1b
> [ 186.976012] [<ffffffff8100acf2>] system_call_fastpath+0x16/0x1b
> [ 186.976012] Code: 8d 7d 18 e8 c3 8a 00 00 41 c7 45 08 00 00 00 00 48 89 df
> e8 0d e9 ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 98 01 00 00 00 74 02 <0f>
> 0b 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 54 53 48
> [ 186.976012] RIP [<ffffffff8110ada1>] exit_mmap+0x10c/0x119
> [ 186.976012] RSP <ffff8800781bde18>
> [ 186.976012] ---[ end trace c0f4eff4054a67e4 ]---
> [ 186.976012] Fixing recursive fault but reboot is needed!
>
> Message from syslogd@fedora14 at Nov 17 00:47:40 ...
> kernel:[ 186.975228] ------------[ cut here ]------------
>
> Message from syslogd@fedora14 at Nov 17 00:47:40 ...
> kernel:[ 186.975254] invalid opcode: 0000 [#1] SMP
>
> Message from syslogd@fedora14 at Nov 17 00:47:40 ...
> kernel:[ 186.975269] last sysfs file:
> /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map
>
> Message from syslogd@fedora14 at Nov 17 00:47:40 ...
> kernel:[ 186.976012] Stack:
>
> Message from syslogd@fedora14 at Nov 17 00:47:40 ...
> kernel:[ 186.976012] Call Trace:
>
> Message from syslogd@fedora14 at Nov 17 00:47:40 ...
> kernel:[ 186.976012] Code: 8d 7d 18 e8 c3 8a 00 00 41 c7 45 08 00 00 00 00
> 48 89 df e8 0d e9 ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 98 01 00 00 00 74
> 02 <0f> 0b 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 54 53 48
>
> --- On Tue, 11/16/10, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote:
>
> From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> Subject: Re: [Xen-devel] Re: 2.6.37-rc1 mainline domU - BUG: unable to handle
> kernel paging request
> To: "Boris Derzhavets" <bderzhavets@xxxxxxxxx>
> Cc: "Jeremy Fitzhardinge" <jeremy@xxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxx,
> "Bruce Edge" <bruce.edge@xxxxxxxxx>
> Date: Tuesday, November 16, 2010, 4:15 PM
>
> On Tue, Nov 16, 2010 at 12:43:28PM -0800, Boris Derzhavets wrote:
>>> Huh. I .. what? I am confused. I thought we established that the issue
>>> was not related to Xen PCI front? You also seem to uncomment the
>>> upstream.core.patches and the xen.pvhvm.patch - why?
>>
>> I cannot uncomment upstream.core.patches and the xen.pvhvm.patch
>> it gives failed HUNKs
>
> Uhh.. I am even more confused.
>>
>>> Ok, they are.. v2.6.37-rc2 which came out today has the fixes
>>
>> I am pretty sure rc2 doesn't contain everything from xen.next-2.6.37.patch,
>> gntdev's stuff for sure. I've built 2.6.37-rc2 kernel rpms and loaded
>> kernel-2.6.27-rc2.git0.xendom0.x86_64 under Xen 4.0.1.
>> Device /dev/xen/gntdev has not been created. I understand that it's
>> unrelated to DomU ( related to Dom0) , but once again with rc2 in DomU i
>> cannot
>> get 3.2 GB copied over to DomU from NFS share at Dom0.
>
> So what I think you are saying is that you keep on getting the bug in DomU?
> Is the stack-trace the same as in rc1?
>
>
>
>
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://lists.xensource.com/archives/html/xen-devel/attachments/20101116/84bccfd3/attachment.html
>
> ------------------------------
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
>
>
> End of Xen-devel Digest, Vol 69, Issue 218
> ******************************************
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|