[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Inter-domain Communication using Virtual Sockets (high-level design)



On 06/13/2013 12:27 PM, Tim Deegan wrote:
Hi,

At 19:07 +0100 on 11 Jun (1370977636), David Vrabel wrote:
This is a high-level design document for an inter-domain communication
system under the virtual sockets API (AF_VSOCK) recently added to Linux.

This document covers a lot of ground (transport, namespace&c), and I'm
not sure where the AF_VSOCK interface comes in that.  E.g., are
communications with the 'connection manager' done by the application
(like DNS lookups) or by the kernel (like routing)?

Purpose
-------

In the Windsor architecture for XenServer, dom0 is disaggregated into
several _service domains_.  Examples of service domains include
network and storage driver domains, and qemu (stub) domains.

To allow the toolstack to manage service domains there needs to be a
communication mechanism between the toolstack running in one domain and
all the service domains.

The principle focus of this new transport is control-plane traffic

<nit>principal</nit>

(low latency and low data rates) but consideration is given to future
uses requiring higher data rates.
[...]
Design Map
----------

The linux kernel requires a Xen-specific virtual socket transport and
front and back drivers.

The connection manager is a new user space daemon running in the
backend domain.

One in every domain that runs backends, or one for the whole system?

[...]
Linux's virtual sockets are used as the interface to applications.
Virtual sockets were introduced in Linux 3.9 and provides a hypervisor
independent[^1] interface to user space applications for inter-domain
communication.

[^1]: The API and address format is hypervisor independent but the
address values are not.

An internal API is provided to implement a low-level virtual socket
transport.  This will be implemented within a pair of front and back
drivers.  The use of the standard front/back driver method allows the
toolstack to handle the suspend, resume and migration in a similar way
to the existing drivers.

What does that look like at the socket interface?  Would an AF_VSOCK
socket transparently stay open across migrate but connect to a different
backend?  Or would it be torn down and the application need to DTRT
about re-connecting?

The front/back pair provides a point-to-point link between the two
domains.  This is used to communicate between applications on those
hosts and between the frontend domain and the _connection manager_
running on the backend.

The connection manager allows domUs to request direct connections to
peer domains.  Without the connection manager, peers have no mechanism
to exchange the information ncessary for setting up the direct
connections.

Sure they do -- they can use any existing shared namespace.  Xenstore
is the obvious candidate, but there's always DNS, or twitter. :P

The toolstack sets the policy in the connection manager
to allow connection requests.  The default policy is to deny
connection requests.

Hmmm.  Since the underlying transports use their own ACLs (e.g. grant
tables), the connection manager can't actually stop two domains from
communicating.  You'd need to use XSM for that.

High Level Design
=================

Virtual Sockets
---------------

The AF_VSOCK socket address family in the Linux kernel has a two part
address format: a uint32_t _context ID_ (_CID_) identifying the domain
and a uint32_t port for the specific service in that domain.

The CID shall be the domain ID and some CIDs have a specific meaning.

CID                     Purpose
-------------------     -------
0x7FF0 (DOMID_SELF)     The local domain.
0x7FF1                  The backend domain (where the connection manager
is).

OK, so there's only one connection manager.  And the connection manager
has an address at the socket interface -- does that mean application
code should connect to it and send it requests?  But the information in
those requests is only useful to the code below the socket interface.

Connection Manager
------------------

The connection manager has two main purposes.

1. Checking that two domains are permitted to connect.

As I said, I don't think that can work.

2. Providing a mechanism for two domains to exchange the grant
    references and event channels needed for them to setup a shared
    ring transport.

If they already want to talk to each other, they can communicate all
that in a single grant ref (which is the same size as an AF_VSOCK port).

So I guess the purpose is multiplexing connection requests: some sort of
listener in the 'backend' must already be talking to the manager (and
because you need the manager to broker new connections, so must the
frontend).

Wait, is this connection manager just xenstore in a funny hat?  Or could
it be implemented by adding a few new node/permission types to xenstore?

Domains commnicate with the connection manager over the front-back
transport link.  The connection manager must be in the same domain as
the virtual socket backend driver.

The connection manager opens a virtual socket and listens on a well
defined port (port 1).

The following messages are defined.

Message          Purpose
-------          -------
CONNECT_req      Request connection to another peer.
CONNECT_rsp      Response to a connection request.
CONNECT_ind      Indicate that a peer is trying to connect.
CONNECT_ack      Acknowledge a connection request.

Again, are these messages carried in a socket connection, or done under
the hood on a non-socket channel?  Or some mix of the two?  I think I
must be missing some key part of the picture. :)

V4V
---
### Advantages

* Does not use grant table resource.  If shared rings are used then a
   busy guest with hundreds of peers will require more grant table
   entries than the current default.

### Disadvantages

* Any changes or extentions to the protocol or ring format would
   require a hypervisor change.  This is more difficult than making
   changes to guests.

In practice, it's often easier to upgrade the hypervisor than the guest
kernels, but I agree that it's bad to have mechanism in the hypervisor.

* The connection-less, "shared-bus" model of v4v is unsuitable for
   untrusted peers.  This requires layering a connection model on top
   and much of the simplicity of the v4v ABI is lost.

I think that if v4v can't manage a listen/connect model, then that's a
bug in v4v rather than a design-level drawback.  My understanding was
that the shared-receiver ring was intended to serve this purpose, and
that v4vtables would be used to silence over-loud peers (much like the
ACL you propose for the connection manager).  Ross?

We are looking into enhancing this. For one thing, we need some level of control over connection management in the core code for it to work cleanly with AF_VSOCK. We also have plans to allow the v4vtables to be managed by guests too. We are planning a significant overhaul of the v4vtables to improve them.


* The mechanism for handling full destination rings will not scale up
   on busy domains.  The event channel only indicates that some ring
   may have space -- it does not identify which ring has space.

That's a fair point, which you raised on the v4v thread, and one that I
expect Ross to address.

We are investigating ways to improve this - ways to relieve the guests of the burden of scanning all rings to find what changed.


I'd be very interested to hear the v4v authors' opinions on this VSOCK
draft, btw -- in particular if it (or something similar) can provide all
v4v's features without new hypervisor code, I'd very much prefer it.

I guess I cannot be 100% just by reading the part of the spec on the low level transport mechanism. We originally tried to use a grant based model and ran into issue. Two of the most pronounced were:

- Failure of grantees to release grants would cause hung domains under certain situations. This was discussed early in the V4V RFC work that Jean G. did. I am not sure if this has been fixed and if so, how. There was a suggestion about a fix in a reply from Daniel a while back.

- Synchronization between guests was very complicated without a central arbitrator like the hypervisor.

Also this solution may have some scaling issues. If I understand the model being proposed here, each ring which I guess is a connection consumes an event channel. In the large number of connections scenario is this not a scaling problem? I may not fully understand the proposed low level transport spec.


Cheers,

Tim.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.