This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.

To: Jeremy Fitzhardinge <jeremy@xxxxxxxx>
Subject: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
From: Daniel Stodden <daniel.stodden@xxxxxxxxxx>
Date: Fri, 12 Nov 2010 19:56:20 -0800
Cc: "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>
Delivery-date: Fri, 12 Nov 2010 19:57:05 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <4CDDE0DA.2070303@xxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <1289604707-13378-1-git-send-email-daniel.stodden@xxxxxxxxxx> <4CDDE0DA.2070303@xxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx

On Fri, 2010-11-12 at 19:50 -0500, Jeremy Fitzhardinge wrote:
> On 11/12/2010 03:31 PM, Daniel Stodden wrote:
> > It's fairly a big change in how I/O buffers are managed. Prior to this
> > series, we had zero-copy I/O down to userspace. Unfortunately, blktap2
> > always had to jump through a couple of extra loops to do so. Present
> > state of that is that we dropped that, so all tapdev I/O is bounced
> > to/from a bunch of normal pages. Essentially replacing the old VMA
> > management with a couple insert/zap VM calls.
> Do you have any performance results comparing the two approaches?

No. One could probably go try large ramdisks or an AIO backend in tmpfs.
All the storage I'm concerned with here terminates either on the NIC or
a local spindle. That's hard to thwart in cache bandwidth.

> > One issue was that the kernel can't cope with recursive
> > I/O. Submitting an iovec on a tapdev, passing it to userspace and then
> > reissuing the same vector via AIO apparently doesn't fit well with the
> > lock protocol applied to those pages. This is the main reason why
> > blktap had to deal a lot with grant refs. About as much as blkback
> > already does before passing requests on. What happens there is that
> > it's aliasing those granted pages under a different PFN, thereby in a
> > separate page struct. Not pretty, but it worked, so it's not the
> > reason why we chose to drop that at some point.
> >
> > The more prevalent problem was network storage, especially anything
> > involving TCP. That includes VHD on both NFS and iSCSI. The problem
> > with those is that retransmits (by the transport) and I/O op
> > completion (on the application layer) are never synchronized.  With
> > sufficiently bad timing and bit of jitter on the network, it's
> > perfectly common for the kernel to complete an AIO request with a late
> > ack on the input queue just when retransmission timer is about to fire
> > underneath. The completion will unmap the granted frame, crashing any
> > uncanceled retransmission on an empty page frame. There are different
> > ways to deal with that. Page destructors might be one way, but as far
> > as I heard they are not particularly popular upstream. Issuing the
> > block I/O on dom0 memory is straightforward and avoids the hassle. One
> > could go argue that retransmits after DIO completion are still a
> > potential privacy problem (I did), but it's not Xen's problem after
> > all.
> Surely this can be dealt with by replacing the mapped granted page with
> a local copy if the refcount is elevated?

Yeah. We briefly discussed this when the problem started to pop up

I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with
a dummy page mapped in. You wouldn't need a copy, a R/O zero map easily
does the job. On UP that'd be just a matter of disabling interrupts for
a while.

I dropped it after it became clear that XS was moving to SMP, where one
would end up with a full barrier to orchestrate the TLB flushes
everywhere. Now, the skb runs prone to crash all run in softirq context,
I wouldn't exactly expect a huge performance win from syncing on that
kind of thing across all nodes, compared to local memcpy. Nor would I
want storage stuff to touch locks shared with TCP, that's just not our
business. Correct me if I'm mistaken.

I'd like to see maybe stuff like node affinity on NUMA getting a bit
more work. I think the patch presently just fills the queue node in, but
that didn't see much testing, and one would have to correlate that.

>   Then that can catch any stray
> residual references while we can still return the granted page to its
> owner.  And obviously, not reuse that pfn for grants until the refcount
> is zero...

> > If zero-copy becomes more attractive again, the plan would be to
> > rather use grantdev in userspace, such as a filter driver for tapdisk
> > instead. Until then, there's presumably a notable difference in L2
> > cache footprint. Then again, there's also a whole number of cycles not
> > spent in redundant hypercalls now, to deal with the pseudophysical
> > map.
> Frankly, I think the idea of putting blkback+tapdisk entirely in
> usermode is all upside with no (obvious) downsides.  It:
>    1. avoids having to upstream anything
>    2. avoids having to upstream anything
>    3. avoids having to upstream anything
>    4. gets us back zero-copy (if that's important)

(No, unfortunately. DIO on a granted frame under blktap would be as
vulnerable as DIO on a granted frame under a userland blkback, userland
won't buy us anthing as far as the zcopy side of things is concerned).

>    5. makes the IO path nice and straightforward
>    6. seems to address all the other problems you mentioned

I'm not at all against a userland blkback. Just wouldn't go as far as
considering this a silver bullet.

The main thing I'm scared of is ending up hacking cheesy stuff into the
user ABI to take advantage of things immediately available to FSes on
the bio layer, but harder (or at least slower) to get made available to

DISCARD support is one currently hot example, do you see that in AIO
somewhere? Ok, probably a good thing for everybody anyway, so maybe
patching that is useful work. But it's extra work right now and probably
no more fun to maintain than blktap is.

The second issue I see is the XCP side of things. XenServer got a lot of
benefit out of blktap2, and particularly because of the tapdevs. It
promotes a fairly rigorous split between a blkback VBD, controlled by
the agent, and tapdevs, controlled by XS's storage manager.

That doesn't prevent blkback to go into userspace, but it better won't
share a process with some libblktap, which in turn would better not be
controlled under the same xenstore path.

So for XCP it'd be AIO on tapdevs for the time being, and with that
whatever the syscall interface lets you do.

> The only caveat is the stray unmapping problem, but I think gntdev can
> be modified to deal with that pretty easily.

Not easier than anything else in kernel space, but when dealing only
with the refcounts, that's as as good a place as anwhere else, yes.

> qemu has usermode blkback support already, and an actively improving
> block-IO infrastructure, so one approach might be to consider putting
> (parts of) tapdisk into qemu - and makes it pretty natural to reuse it
> with non-Xen guests via virtio-block, emulated devices, etc.  But I'm
> not sold on that; having a standalone tapdisk w/ blkback makes sense to
> me as well.
> On the other hand, I don't think we're going to be able to get away with
> putting netback in usermode, so we still need to deal with that - but I
> think an all-copying version will be fine to get started with at least.

> > Please pull upstream/xen/dom0/backend/blktap2 from
> > git://xenbits.xensource.com/people/dstodden/linux.git
> OK, I've pulled it, but I haven't had a chance to test it yet.



Xen-devel mailing list

<Prev in Thread] Current Thread [Next in Thread>