[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Re: [PATCH 00/17] Netchannel2 for a modern git kernel

> >> BTW, do you see this is something as a candidate for merging upstream?
> >>     
> > I've mostly been defining ``upstream'' as you, but, yes, sending it
> > further would be good.
> OK, but that's a fair bit more work.
Yes, indeed.  This is very much a long-term goal.

It might make sense to send an initial version which doesn't support
receiver-map mode first, because that avoids the whole PG_foreign
issue.  It'd be a bit slow, but it would work, and it'd be properly
cross-compatible with a receiver-map capable version.

> > The NC2 approach is basically similar to the NC1 approach, but
> > generalised so that NC1 and NC2 can cooperate in a reasonably sane
> > way.  It still uses the PG_foreign bit to identify foreign pages, and
> > the page->private and page->mapping fields for various bits of
> > information.
> Unfortunately the PG_foreign approach is a non-starter for upstream,
> mainly because adding new page flags is strongly frowned upon unless
> there's a very compelling reason.  Unless we can find some other kernel
> subsystems which can make use of a page destructor, we probably won't
> make the cut.  (It doesn't help that there are no more page flags left
> on 32-bit.)
Yeah, I didn't think that was going to go very far.

It might be possible to do something like:

1) Create a special struct address_space somewhere.  This wouldn't
   really do anything, but would just act as a placeholder.
2) Whenever we would normally set PG_foreign, set page->mapping to
   point at the placeholder address_space.
3) Rather than testing PG_foreign, test page->mapping == &placeholder.
4) Somehow move all of the Xen-specific bits which currently use
   ->mapping to use ->private instead.

Then we wouldn't need the page bit.  It's not even that much of an
abuse; foreign memory is arguably a special kind of address space, so
creating a struct address_space for it isn't insane.

> The approach I'm trying at the moment is to use the skb destructor
> mechanism to grab the pages out of the skb as its freed.  To deal with
> skb_clone, I'm adding a flag to the skb to force a clone to do a
> complete copy so there are no more aliases to the pages (skb_clone
> should be rare in the common case).
Yeah, that would work.  There needs to be some way for netback to get
grant references and so forth related to netchannel2-mapped pages, and
vice versa, but that shouldn't be too hard.

> > The basic idea is that everything which can map foreign pages and
> > expose them to the rest of Linux needs to allocate a foreign page
> > tracker (effectively an array of (domid, grant_ref, void *ctxt)
> > tuples), and to register mapped pages with that tracker.  It then uses
> > the top few bits of page->private to identify the tracker, and the
> > rest to index into the array.  This allows you to forward packets from
> > a foreign domain without knowing whether it was received by NC1 or
> > NC2.
> Well, if its wrapped by a skb, we can get the skb destructor to handle
> the cleanup phase.  So long as we get the callback, I don't think it
> should affect the rest of the mechanism.
Cleanup isn't the tricky part.  The problem is that you can't forward
a packet unless you know which domain it came from and the relevant
grant references, because Xen won't let you create grant references on
a mapping of another domain's memory.  You therefore need some way of
translating a struct page in an skb into a (domid_t, grant_ref_t)
pair.  netback currently handles this with some private lookup tables,
but that only works if it's the only thing which can inject foreign
mappings into the stack.  The foreign map tracker stuff was an attempt
to generalise this to work with multiple netback-like drivers.

> > Arguably, blkback should be using this mechanism as well, but since
> > we've gotten away with it so far I thought it'd be best to let
> > sleeping dogs lie.  The only time it'd make any difference would be
> > when pages out of a block request somehow get attached to network
> > packets, which seems unlikely.
> Block lifetimes are simpler because there's no cloning and bios have a
> end_io callback which is more or less equivalent to the skb destructor.
Yes, that's true, the cleanup bit is much easier for block requests,
but you still potentially have a forwarding issue.  There are a couple
of potentially problematic scenarios:

1) You might have nested block devices.  Suppose you have three
domains (domA, domB, and domC), and a physical block device sdX in
domA.  DomA could then be configured to run a blkback exposing sdX to
domB as xvdY.  DomB might then itself run a blkback exposing xvdY to
domC as xvdZ.  This won't work.  Requests issued by domC will be
mapped by domB's blkback and injected into its local storage stack,
and will eventually reach domB's xvdY blkfront.  This will try to
grant domA access to the relevant memory, but, because it doesn't know
about foreign mappings, it'll grant as if the memory was owned by
domB.  Xen will then reject domA's attempts to map these domB grants,
and every request on xvdZ will fail.

Admittedly, that'd be a rather stupid configuration, but it's not
currently blocked by the tools (and it'd be rather difficult to block,
even if we wanted to).

2) I've not actually checked this, but I suspect we have problem if
you're running an iSCSI initiator in dom0 against a target running in
a domU, and then try to expose the SCSI device in dom0 as a block
device in some other domU.  When requests come in from the blkfront,
the dom0 blkback will map them as foreign pages, and then pass them
off to the iSCSI initiator.  It would make sense for the pages in the
block request to get attached to the skb as fragment pages, rather
than copied.  When the skb eventually reaches netback, netback will
try to do a grant copy into the receiving netfront's buffers (because
PG_foreign isn't set), which will fail, because dom0 doesn't actually
own the pages.

As I say, I've not actually checked whether that's how the initiators
work, but it would be a sane implementation if you're talking to a NIC
with jumbogram support.

Thinking some more, there's another variant of this bug which doesn't
involve block devices at all: bridging between a netfront and a
netback.  If you have a single bridge with both netfront and netback
devices attached to it, and you're not in ALWAYS_COPY_SKB mode,
forwarding packets from the netback interface to the netfront one
won't work.  Packets received by netback will be foreign mappings, but
netfront doesn't know that, so when it sends packets to the backend
it'll set up grants as if they were in local memory, which won't work.
I'm not sure what the right fix for that is; probably just copying
the packet in netfront.


Attachment: signature.asc
Description: Digital signature

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.