[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Re: Interdomain comms

  • To: Harry Butterworth <harry@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
  • From: Eric Van Hensbergen <ericvh@xxxxxxxxx>
  • Date: Sat, 7 May 2005 19:57:04 -0500
  • Cc: Mike Wray <mike.wray@xxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxx, "Ronald G. Minnich" <rminnich@xxxxxxxx>, Eric Van Hensbergen <ericvh@xxxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Sun, 08 May 2005 00:56:54 +0000
  • Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=ASWscV7voAbRbo1vtBj4LFJSwco0FThFNt7wD2RmYtbXXckmFW/mV40Vz1Uj2pmZD29s/5u1wI4JL7Ffh6YzBGTg/dTr5IrhU0v12yQp2wxAIEgZXHv60cEan9XhPFdXw/G97K4LYD0QMK6cATkv+aU9p09XBlUS9isNmNQqIJ0=
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>

On 5/7/05, Harry Butterworth <harry@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> If you could go through with 9p the same concrete example I did I think
> I'd find that useful.  Also, I should probably spend another 10 mins on
> the docs ;-)

It would be way better if we could step you through a demo in the
environment.  The documentation for Plan 9 has always been spartan at
best -- so getting a good idea of how things work without looking over
someones shoulder who is experienced has always been somewhat
difficult.  I've been trying to correct this while pushing some of the
bits into Linux - my Freenix v9fs paper spent quite a bit of time
talking about how 9P actually works and showing traces for typical
file system operations.  The OLS paper which I'm supposed to be
writing right now covers the same sort of thing for the dynamic
private name spaces.

The reason why I didn't post a counter-example was because I didn't
see much difference between our two ideas for the scenario you lay out
(except you obviously understand a lot more about some of the details
that I haven't gotten into yet):

(from the looks of things, you had already established an
authenticated connection between the FE and BE and the top-level had
already been established and whatever file system you are using to
read file data had already traversed to the file and opened it.

A quick summary of how we do the above with 9P:
a) establish the connection to the BE (there are various semantics
possible here, in my current stuff I pass a reference to the Channel
around.  Alternatively you could use socket like semantics to connect
to the other partition)
b) Issue a protocol negotiation packet to establish buffer sizes,
protocol version, etc. (t_version)
c) Attach to the remote resource, providing authentication information
(if necessary) t_attach -- this will also create an initial piece of
shared meta-data referencing the root resource of the BE (for devices
there may only be a single resource, or one resource such as a block
device may have multiple nodes such as partitions, in Plan 9 devices
also present different aspects of their interface (ctl, data, stat,
etc.) as different nodes in a hierarchical fashion.  The reference to
different nodes in the tree is called a FID (think of it as a file
descriptor) - it contains information about who has attached to the
resource and where in the hierarchy they are.  A key thing to remember
is that in Plan 9, every device is a file server.
d) You would then traverse the object tree to the resource you wanted
to use (in your case it sounded like a file in a file system, so the
metaphor is straightforward).  The client would issue a t_walk message
to perform this traversal.
e) The object would then need to be opened (t_open) with information
about what type of operation will be executed (read, write, both) and
can include additional information about the type of transactions
(append only, exclusive access, etc.) that may be beneficial to
managing the underlying resource.  The BE could use information cached
in the Fid from the attach to check the FE's permission to be
accessing this resource with that mode.

In our world, this would result in you holding a Fid pointing to the
open object.  The Fid is a pointer to meta-data and is considered
state on both the FE and the BE. (this has downsides in terms of
reliability and the ability to recover sessions or fail over to
different BE's -- one of our summer students will be addressing the
reliability problem this summer).

The FE performs a read operation passing it the necessary bits:
  ret = read( fd, *buf, count );

This actually would be translated (in collaboration with local
meta-data into a t_read mesage)
  t_read tag fid offset count (where offset is determined by local fid metadata)

The BE receives the read request, and based on state information kept
in the Fid (basically your metadata), it finds the file contents in
the buffer cache.  It sends a response packet with a pointer to its
local buffer cache entry:

 r_read tag count *data

There are a couple ways we could go when the FE receives the response:
a) it could memcopy the data to the user buffer *buf .  This is the
way things   currently work, and isn't very efficient -- but may be
the way to go for the ultra-paranoid who don't like sharing memory
references between partitions.

b) We could have registered the memory pointed to by *buf and passed
that reference along the path -- but then it probably would just
amount to the BE doing the copy rather than the front end.  Perhaps
this approximates what you were talking about doing?

c) As long as the buffers in question (both *buf and the buffer cache
entry) were page-aligned, etc. -- we could play clever VM games
marking the page as shared RO between the two partitions and alias the
virtual memory pointed to by *buf to the shared page.  This is very
sketchy and high level and I need to delve into all sorts of details
-- but the idea would be to use virtual memory as your friend for
these sort of shared read-only buffer caches.  It would also require
careful allocation of buffers of the right size on the right alignment
-- but driver writers are used to that sort of thing.

To do this sort of thing, we'd need to do the exact same sort of
accounting you describe:
>The implementation of local_buffer_reference_copy for that specific
>combination of buffer types maps the BE pages into the FE address space
>incrementing their reference counts and also unmaps the old FE pages and
>decrements their reference counts, returning them to the free pool if

When the FE was done with the BE, it would close the resources
(issuing t_clunk on any fids associated with the BE).

The above looks complicated, but to a FE writer would be as simple as:
 channel = dial("net!BE"); /* establish connection */ 
/* in my current code, channel is passed as an argument to the FE as a
boot arg */
  root = fsmount(channel, NULL); /* this does the t_version, auth, & attach */
  fd = open(root, "/some/path/file", OREAD);
  ret = read(fd, *buf, sizeof(buf));

If you want to get fancy, you could get rid of the root arg to open
and use a private name space (after fsmount):
  bind(root, "/mnt/be", MREPL); /* bind the back end to a well known place */
then it would be:
  fd=open("/mnt/be/some/path/file", OREAD);

There's also all sorts of cool stuff you can do on the domain
controller to provision child partitions using dynamic name space and
then just exporting the custom fashioned environment using 9P -- but
that's higher level organization stuff again.  There's all sorts of
cool tricks you can play with 9P (similar to the stuff that the FUSE
and FiST user-space file system packages provide) like copy-on-write
file systems, COW block devices, muxed ttys, etc. etc.

The reality is that I'm not sure I'd actually want to use a BE to
implement a file system, but its quite reasonable to implement a
buffer cache that way.  In all likelihood this would result in the FE
opening up a connection (and a single object) on the BE buffer cache,
then using different offsets to grab specific blocks from the BE
buffer cache using the t_read operation.

I've described it in terms of a file system, using your example as a
basis, but the same sort of thing would be true for a block device or
a network connection (with some slightly different semantic rules on
the network connection).  The main point is to keep things simple for
the FE and BE writers, and deal with all the accounting and magic you
describe within the infrastructure (no small task).

Another difference would involve what would happen if you did have to
bridge a cluster network - the 9P network encapsulation is well
defined, all you would need to do (at the I/O partition bridging the
network) is marshall the data according to the existing protocol spec.
 For more intelligent networks using RDMA and such things, you could
keep the scatter/gather style semantics and send pointers into the
RDMA space for buffer references.

As I said before, there's lots of similarities in what we are talking
about, I'm just gluing a slightly more abstract interface on top,
which has some benefits in some additional organizational and security
mechanisms (and a well-established (but not widely used yet) network
protocol encapsulation).

There are plenty of details I know I'm glossing over, and I'm sure
I'll need lots of help getting things right.  I'd have preferred
staying quiet until I had my act together a little more, but Orran and
Ron convinced me that it was important to let people know the
direction I'm planning on exploring.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.