This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.

To: Daniel Stodden <daniel.stodden@xxxxxxxxxx>
Subject: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
From: Jeremy Fitzhardinge <jeremy@xxxxxxxx>
Date: Fri, 12 Nov 2010 16:50:34 -0800
Cc: Xen <xen-devel@xxxxxxxxxxxxxxxxxxx>
Delivery-date: Fri, 12 Nov 2010 16:51:31 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <1289604707-13378-1-git-send-email-daniel.stodden@xxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <1289604707-13378-1-git-send-email-daniel.stodden@xxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv: Gecko/20101027 Fedora/3.1.6-1.fc13 Lightning/1.0b3pre Thunderbird/3.1.6
On 11/12/2010 03:31 PM, Daniel Stodden wrote:
> It's fairly a big change in how I/O buffers are managed. Prior to this
> series, we had zero-copy I/O down to userspace. Unfortunately, blktap2
> always had to jump through a couple of extra loops to do so. Present
> state of that is that we dropped that, so all tapdev I/O is bounced
> to/from a bunch of normal pages. Essentially replacing the old VMA
> management with a couple insert/zap VM calls.

Do you have any performance results comparing the two approaches?

> One issue was that the kernel can't cope with recursive
> I/O. Submitting an iovec on a tapdev, passing it to userspace and then
> reissuing the same vector via AIO apparently doesn't fit well with the
> lock protocol applied to those pages. This is the main reason why
> blktap had to deal a lot with grant refs. About as much as blkback
> already does before passing requests on. What happens there is that
> it's aliasing those granted pages under a different PFN, thereby in a
> separate page struct. Not pretty, but it worked, so it's not the
> reason why we chose to drop that at some point.
> The more prevalent problem was network storage, especially anything
> involving TCP. That includes VHD on both NFS and iSCSI. The problem
> with those is that retransmits (by the transport) and I/O op
> completion (on the application layer) are never synchronized.  With
> sufficiently bad timing and bit of jitter on the network, it's
> perfectly common for the kernel to complete an AIO request with a late
> ack on the input queue just when retransmission timer is about to fire
> underneath. The completion will unmap the granted frame, crashing any
> uncanceled retransmission on an empty page frame. There are different
> ways to deal with that. Page destructors might be one way, but as far
> as I heard they are not particularly popular upstream. Issuing the
> block I/O on dom0 memory is straightforward and avoids the hassle. One
> could go argue that retransmits after DIO completion are still a
> potential privacy problem (I did), but it's not Xen's problem after
> all.

Surely this can be dealt with by replacing the mapped granted page with
a local copy if the refcount is elevated?  Then that can catch any stray
residual references while we can still return the granted page to its
owner.  And obviously, not reuse that pfn for grants until the refcount
is zero...

> If zero-copy becomes more attractive again, the plan would be to
> rather use grantdev in userspace, such as a filter driver for tapdisk
> instead. Until then, there's presumably a notable difference in L2
> cache footprint. Then again, there's also a whole number of cycles not
> spent in redundant hypercalls now, to deal with the pseudophysical
> map.

Frankly, I think the idea of putting blkback+tapdisk entirely in
usermode is all upside with no (obvious) downsides.  It:

   1. avoids having to upstream anything
   2. avoids having to upstream anything
   3. avoids having to upstream anything

   4. gets us back zero-copy (if that's important)
   5. makes the IO path nice and straightforward
   6. seems to address all the other problems you mentioned

The only caveat is the stray unmapping problem, but I think gntdev can
be modified to deal with that pretty easily.

qemu has usermode blkback support already, and an actively improving
block-IO infrastructure, so one approach might be to consider putting
(parts of) tapdisk into qemu - and makes it pretty natural to reuse it
with non-Xen guests via virtio-block, emulated devices, etc.  But I'm
not sold on that; having a standalone tapdisk w/ blkback makes sense to
me as well.

On the other hand, I don't think we're going to be able to get away with
putting netback in usermode, so we still need to deal with that - but I
think an all-copying version will be fine to get started with at least.

> There are also benefits or non-issues.
>  - This blktap is rather xen-independent. Certainly depends on the
>    common ring macros, but lacking grant stuff it compiles on bare
>    metal Linux with no CONFIG_XEN. Not consummated here, because
>    that's going to move the source tree out of drivers/xen. But I'd
>    like to post a new branch proposing to do so.
>  - Blktaps size in dom0 didn't really change. Frames (now pages) were
>    always pooled. We used to balloon memory to claim space for
>    redundant grant mappings. Now we reserve, by default, the same
>    volume in normal memory.
>  - The previous code would runs all I/O on a single pool. Typically
>    two rings worth of requests. Sufficient for a whole lot of systems,
>    especially with single storage backends, but not so nice when I/O
>    on a number of otherwise independent filers or volumes collides.
>    Pools are refcounted kobjects in sysfs. Toolstacks using the new
>    code can thereby choose to elimitate bottlenecks by grouping taps
>    on different buffer pools. Pools can also be resized, to accomodate
>    greater queue depths. [Note that blkback still has the same issue,
>    so guests won't take advantage of that before that's resolved as
>    well.]
>  - XCP started to make some use of stacking tapdevs. Think pointing
>    the image chain of a bunch of "leaf" taps to a shared parent
>    node. That works fairly well, but definitely takes independent
>    resource pools to avoid deadlock by parent starvation then.
> Please pull upstream/xen/dom0/backend/blktap2 from
> git://xenbits.xensource.com/people/dstodden/linux.git

OK, I've pulled it, but I haven't had a chance to test it yet.


Xen-devel mailing list