On Mon, 2010-11-15 at 13:27 -0500, Jeremy Fitzhardinge wrote:
> On 11/12/2010 07:55 PM, Daniel Stodden wrote:
> >> Surely this can be dealt with by replacing the mapped granted page with
> >> a local copy if the refcount is elevated?
> > Yeah. We briefly discussed this when the problem started to pop up
> > (again).
> > I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with
> > a dummy page mapped in. You wouldn't need a copy, a R/O zero map easily
> > does the job.
> Hm, I'd be a bit concerned that that might cause problems if used
Yeah. It wasn't a problem because all the network backends are on TCP,
where one can be rather sure that the dups are going to be properly
Does this hold everywhere ..? -- As mentioned below, the problem is
rather in AIO/DIO than being Xen-specific, so you can see the same
behavior on bare metal kernels too. A userspace app seeing an AIO
complete and then reusing that buffer elsewhere will occassionally
resend garbage over the network.
How often that would happen in practice probably depends on the
popularity of DIO on NFS for normal apps. Not too many, so yeah, very
generically one would maybe want to memcpy().
> If the page is being used RO, then replacing with a copy
> shouldn't be a problem, but getting a consistent snapshot of a mutable
> page is obviously going to be a problem.
I think it's safe to assume the problem is limited to writes, so the
buffers aren't really mutable.
> > On UP that'd be just a matter of disabling interrupts for
> > a while.
> Disable for what purpose? You mean to do an exchange of the mapping, or
> something else?
[I was referring to the prospect of doing a gref unmap/map sequence in
dom0. Since we do have a swap operation in Xen, just asking flipping the
PTE entry and flushing across all vcpus sounds sufficient to me,
> > I dropped it after it became clear that XS was moving to SMP, where one
> > would end up with a full barrier to orchestrate the TLB flushes
> > everywhere. Now, the skb runs prone to crash all run in softirq context,
> > I wouldn't exactly expect a huge performance win from syncing on that
> > kind of thing across all nodes, compared to local memcpy. Nor would I
> > want storage stuff to touch locks shared with TCP, that's just not our
> > business. Correct me if I'm mistaken.
> I don't follow what your high-level concern is here. If we update the
> pte to unmap the granted page, then its up to Xen to arrange for any TLB
> flushes to make sure the page is no longer accessible to the domain. We
> don't need to do anything explicit, and its independent of whether we're
> simply unmapping the page or replacing the mapping with another one (ie,
> any TLB flushing necessary is already going on).
> I was also under the impression that this is a relatively rare event
> rather than something that's going to be necessary for every page, so
> the overhead should be minimal.
> >>> If zero-copy becomes more attractive again, the plan would be to
> >>> rather use grantdev in userspace, such as a filter driver for tapdisk
> >>> instead. Until then, there's presumably a notable difference in L2
> >>> cache footprint. Then again, there's also a whole number of cycles not
> >>> spent in redundant hypercalls now, to deal with the pseudophysical
> >>> map.
> >> Frankly, I think the idea of putting blkback+tapdisk entirely in
> >> usermode is all upside with no (obvious) downsides. It:
> >> 1. avoids having to upstream anything
> >> 2. avoids having to upstream anything
> >> 3. avoids having to upstream anything
> >> 4. gets us back zero-copy (if that's important)
> > (No, unfortunately. DIO on a granted frame under blktap would be as
> > vulnerable as DIO on a granted frame under a userland blkback, userland
> > won't buy us anthing as far as the zcopy side of things is concerned).
> Why's that? Are you talking about the stray page reference
> vulnerability, or something else? You're right that it doesn't really
> help with stray references because its still kernel code which ends up
> doing the dereference rather than usermode.
[Still was talking about the stray reference issues].
> >> 5. makes the IO path nice and straightforward
> >> 6. seems to address all the other problems you mentioned
> > I'm not at all against a userland blkback. Just wouldn't go as far as
> > considering this a silver bullet.
> > The main thing I'm scared of is ending up hacking cheesy stuff into the
> > user ABI to take advantage of things immediately available to FSes on
> > the bio layer, but harder (or at least slower) to get made available to
> > userland.
> But that hasn't been a problem for tapdisk so far. Given that it is
> using DIO then any completed write has at least been submitted to a
> storage device, so then its just a question of how to make sure that any
> buffers are fully flushed, which would be fdatasync() I guess.
> Besides, our requirements are hardly unique; if there's some clear need
> for a new API, then the course of action is to work with broader Linux
> community to work out what it should be and how it should be
> implemented. There's no need for "hacking cheesy stuff" - there's been
> enough of that already.
> > DISCARD support is one currently hot example, do you see that in AIO
> > somewhere? Ok, probably a good thing for everybody anyway, so maybe
> > patching that is useful work. But it's extra work right now and probably
> > no more fun to maintain than blktap is.
> fallocate() is being extended to allow hole-punching in files. I don't
> know what work is being done to do discard on random block devices, but
> that's clearly a generically useful thing to have.
Hole punching for userland being in the works is actually great news.
Just saw the patchset on ext4-devel flying by, great.
> > The second issue I see is the XCP side of things. XenServer got a lot of
> > benefit out of blktap2, and particularly because of the tapdevs. It
> > promotes a fairly rigorous split between a blkback VBD, controlled by
> > the agent, and tapdevs, controlled by XS's storage manager.
> > That doesn't prevent blkback to go into userspace, but it better won't
> > share a process with some libblktap, which in turn would better not be
> > controlled under the same xenstore path.
> Could you elaborate on this? What was the benefit?
It's been mainly a matter of who controls what. Blktap1 was basically a
VBD, controlled by the agent. Blktap2 is a VDI represented as a block
device. Leaving management of that to XCP's storage manager, which just
hands that device node over to Xapi simplified many things. Before, the
agent had to understand a lot about the type of storage, then talk to
the right backend accordingly. Worse, in order to have storage
management control a couple datapath features, you'd basically have to
talk to Xapi, which would talk though xenstore to blktap, which was a
bit tedious. :)
Merging the VDI side of things with VBDs again just didn't sound so
attrative at first sight.
But I've thinking a little longer about the issues in the meantime. I
think it's actually not so hard to do that without collateral damage.
Let's say we create an extension to tapdisk which speaks blkback's
datapath in userland. We'd basically put one of those tapdisks on every
storage node, independent of the image type, such as a bare LUN or a
VHD. We add a couple additional IPC calls to make it directly
connect/disconnect to/from (ring-ref,event-channel) pairs.
Means it doesn't even need to talk xenstore, the control plane could all
be left to some single daemon, which knows how to instruct the right
tapdev (via libblktapctl) by looking at the physical-device node. I
guess getting the control stuff out of the kernel is always a good idea.
There are some important parts which would go missing. Such as
ratelimiting gntdev accesses -- 200 thundering tapdisks each trying to
gntmap 352 pages simultaneously isn't so good, so there still needs to
be some bridge arbitrating them. I'd rather keep that in kernel space,
okay to cram stuff like that into gntdev? It'd be much more
straightforward than IPC.
Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev..
Can't find it now, what happened? Without, there's presently still no
Once the issues were solved, it'd be kinda nice. Simplifies stuff like
memshr for blktap, which depends on getting hold of original grefs.
We'd presumably still need the tapdev nodes, for qemu, etc. But those
can stay non-xen aware then.
> >> The only caveat is the stray unmapping problem, but I think gntdev can
> >> be modified to deal with that pretty easily.
> > Not easier than anything else in kernel space, but when dealing only
> > with the refcounts, that's as as good a place as anwhere else, yes.
> I think the refcount test is pretty straightforward - if the refcount is
> 1, then we're the sole owner of the page and we don't need to worry
> about any other users. If its > 1, then somebody else has it, and we
> need to make sure it no longer refers to a granted page (which is just a
> matter of doing a set_pte_atomic() to remap from present to present).
[set_pte_atomic over grant ptes doesn't work, or does it?]
> Then we'd have a set of frames whose lifetimes are being determined by
> some other subsystem. We can either maintain a list of them and poll
> waiting for them to become free, or just release them and let them be
> managed by the normal kernel lifetime rules (which requires that the
> memory attached to them be completely normal, of course).
The latter sounds like a good alternative to polling. So an
unmap_and_replace, and giving up ownership thereafter. Next run of the
dispatcher thread can can just refill the foreign pfn range via
alloc_empty_pages(), to rebalance.
Xen-devel mailing list