On Tue, 2010-11-16 at 12:56 -0500, Jeremy Fitzhardinge wrote:
> On 11/16/2010 01:13 AM, Daniel Stodden wrote:
> > On Mon, 2010-11-15 at 13:27 -0500, Jeremy Fitzhardinge wrote:
> >> On 11/12/2010 07:55 PM, Daniel Stodden wrote:
> >>>> Surely this can be dealt with by replacing the mapped granted page with
> >>>> a local copy if the refcount is elevated?
> >>> Yeah. We briefly discussed this when the problem started to pop up
> >>> (again).
> >>> I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with
> >>> a dummy page mapped in. You wouldn't need a copy, a R/O zero map easily
> >>> does the job.
> >> Hm, I'd be a bit concerned that that might cause problems if used
> >> generically.
> > Yeah. It wasn't a problem because all the network backends are on TCP,
> > where one can be rather sure that the dups are going to be properly
> > dropped.
> > Does this hold everywhere ..? -- As mentioned below, the problem is
> > rather in AIO/DIO than being Xen-specific, so you can see the same
> > behavior on bare metal kernels too. A userspace app seeing an AIO
> > complete and then reusing that buffer elsewhere will occassionally
> > resend garbage over the network.
> Yeah, that sounds like a generic security problem. I presume the
> protocol will just discard the excess retransmit data, but it might mean
> a usermode program ends up transmitting secrets it never intended to...
> > There are some important parts which would go missing. Such as
> > ratelimiting gntdev accesses -- 200 thundering tapdisks each trying to
> > gntmap 352 pages simultaneously isn't so good, so there still needs to
> > be some bridge arbitrating them. I'd rather keep that in kernel space,
> > okay to cram stuff like that into gntdev? It'd be much more
> > straightforward than IPC.
> What's the problem? If you do nothing then it will appear to the kernel
> as a bunch of processes doing memory allocations, and they'll get
> blocked/rate-limited accordingly if memory is getting short.
The problem is that just letting the page allocator work through
allocations isn't going to scale anywhere.
The worst case memory requested under load is <number-of-disks> * (32 *
11 pages). As a (conservative) rule of thumb, N will be 200 or rather
The number of I/O actually in-flight at any point, in contrast, is
derived from the queue/sg sizes of the physical device. For a simple
disk, that's about a ring or two.
> plenty of existing mechanisms to control that sort of thing (cgroups,
> etc) without adding anything new to the kernel. Or are you talking
> about something other than simple memory pressure?
> And there's plenty of existing IPC mechanisms if you want them to
> explicitly coordinate with each other, but I'd tend to thing that's
> premature unless you have something specific in mind.
> > Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev..
> > Can't find it now, what happened? Without, there's presently still no
> > zero-copy.
> gntdev doesn't need VM_FOREIGN any more - it uses the (relatively
> new-ish) mmu notifier infrastructure which is intended to allow a device
> to sync an external MMU with usermode mappings. We're not using it in
> precisely that way, but it allows us to wrangle grant mappings before
> the generic code tries to do normal pte ops on them.
The mmu notifiers were for safe teardown only. They are not sufficient
for DIO, which wants gup() to work. If you want zcopy on gntdev, we'll
need to back those VMAs with page structs. Or bounce again (gulp, just
mentioning it). As with the blktap2 patches, note there is no difference
in the dom0 memory bill, it takes page frames.
This is pretty much exactly the pooling stuff in current drivers/blktap.
The interface could look as follows ( designates users).
Calling some ctls to create/destroy ctls pools of frames.
(Blktap currently does this in sysfs.)
Optionally resize them, according to the physical queue
depth [estimate] of the underlying storage.
A backend instance, when starting up, opens a gntdev, then
uses a ctl to bind its gntdev handle to a frame pool.
The .mmap call now will allocate frames to back the VMA.
This operation can fail/block under congestion. Neither
is desirable, so we need a .poll.
To integrate grant mappings with a single-threaded event loop,
use .poll. The handle fires as soon as a request can be mapped.
Under congestion, the .poll code will queue waiting disks and wake
them round-robin, once VMAs are released.
(A [tapdisk] doesn't mean to dismiss a potential [qemu].)
Still confident we want that? (Seriously asking). A lot of the code to
do so has been written for blktap, it wouldn't be hard to bend into a
> > Once the issues were solved, it'd be kinda nice. Simplifies stuff like
> > memshr for blktap, which depends on getting hold of original grefs.
> > We'd presumably still need the tapdev nodes, for qemu, etc. But those
> > can stay non-xen aware then.
> >>>> The only caveat is the stray unmapping problem, but I think gntdev can
> >>>> be modified to deal with that pretty easily.
> >>> Not easier than anything else in kernel space, but when dealing only
> >>> with the refcounts, that's as as good a place as anwhere else, yes.
> >> I think the refcount test is pretty straightforward - if the refcount is
> >> 1, then we're the sole owner of the page and we don't need to worry
> >> about any other users. If its > 1, then somebody else has it, and we
> >> need to make sure it no longer refers to a granted page (which is just a
> >> matter of doing a set_pte_atomic() to remap from present to present).
> > [set_pte_atomic over grant ptes doesn't work, or does it?]
> No, I forgot about grant ptes magic properties. But there is the hypercall.
> >> Then we'd have a set of frames whose lifetimes are being determined by
> >> some other subsystem. We can either maintain a list of them and poll
> >> waiting for them to become free, or just release them and let them be
> >> managed by the normal kernel lifetime rules (which requires that the
> >> memory attached to them be completely normal, of course).
> > The latter sounds like a good alternative to polling. So an
> > unmap_and_replace, and giving up ownership thereafter. Next run of the
> > dispatcher thread can can just refill the foreign pfn range via
> > alloc_empty_pages(), to rebalance.
> Do we actually need a "foreign page range"? Won't any pfn do? If we
> start with a specific range of foreign pfns and then start freeing those
> pfns back to the kernel, we won't have one for long...
I guess we've been meaning the same thing here, unless I'm
misunderstanding you. Any pfn does, and the balloon pagevec allocations
default to order 0 entries indeed. Sorry, you're right, that's not a
'range'. With a pending re-xmit, the backend can find a couple (or all)
of the request frames have count>1. It can flip and abandon those as
normal memory. But it will need those lost memory slots back, straight
away or next time it's running out of frames. As order-0 allocations.
Foreign memory is deliberately short. Blkback still defaults to 2 rings
worth of address space, iirc, globally. That's what that mempool sysfs
stuff in the later blktap2 patches aimed at -- making the size
configurable where queue length matters, and isolate throughput between
physical backends, where the toolstack wants to care.
Xen-devel mailing list