[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] V4V

On 05/30/2012 07:41 AM, Stefano Stabellini wrote:
> On Tue, 29 May 2012, Daniel De Graaf wrote:
>> On 05/24/2012 01:23 PM, Jean Guyader wrote:
>>> As I'm going through the code to clean-up XenClient's inter VM
>>> communication
>>> (V4V), I thought it would be a good idea to start a thread to talk about
>>> the
>>> fundamental differences between V4V and libvchan. I believe the two system
>>> are
>>> not clones of eachother and they serve different
>>> purposes.
>>> Disclaimer: I'm not an expert in libvchan; most of the assertion I'm doing
>>> about libvchan it coming from my reading of the code. If some of the facts
>>> are wrong it's only due to my ignorance about the subject.
>> I'll try to fill in some of these points with my understanding of libvchan;
>> I have correspondingly less knowledge of V4V, so I may be wrong in 
>> assumptions
>> there.
>>> 1. Why V4V?
>>> About the time when we started XenClient (3 year ago) we were looking for a
>>> lightweight inter VM communication scheme. We started working on a system
>>> based on netchannel2 at the time called V2V (VM to VM). The system
>>> was very similar to what libvchan is today, and we started to hit some
>>> roadblocks:
>>>     - The setup relied on a broker in dom0 to prepare the xenstore node
>>>       permissions when a guest wanted to create a new connection. The code
>>>       to do this setup was a single point of failure. If the
>>>       broker was down you could create any more connections.
>> libvchan avoids this by allowing the application to determine the xenstore
>> path and adjusts permissions itself; the path /local/domain/N/data is
>> suitable for a libvchan server in domain N to create the nodes in question.
> Let say that the frontend lives in domain A and that the backend lives
> in domain N.
> Usually the frontend has a node:
> /local/domain/A/device/<devicename>/<number>/backend
> that points to the backend, in this case:
> /local/domain/N/backend/<devicename>/A/<number>
> The backend is not allowed to write to the frontend path, so it cannot write
> its own path in the backend node. Clearly the frontend doesn't know that
> information so it cannot fill it up. So the toolstack (typically in
> dom0) helps with the initial setup writing down under the frontend path
> where is the backend.
> How does libvchan solve this issue?

Libvchan requires both endpoints to know the domain ID of the peer they are
communicating with - this could be communicated during domain build or through
a name service. The application then defines a path such as
"/local/domain/$server_domid/data/example-app/$client_domid" which is writable
by the server; the server creates nodes here that are readable by the client.

>>>     - Symmetric communications were a nightmare. Take the case where A is a
>>>       backend for B and B is a backend for A. If one of the domain crash the
>>>       other one couldn't be destroyed because it has some paged mapped from
>>>       the dead domain. This specific issue is probably fixed today.
>> This is mostly taken care of by improvements in the hypervisor's handling of
>> grant mappings. If one domain holds grant mappings open, the domain whose
>> grants are held can't be fully destroyed, but if both domains are being
>> destroyed then cycles of grant mappings won't stop them from going away.
> However under normal circumstances the domain holding the mappings (that
> I guess it would be the domain running the backend, correct?) would
> recognize that the other domain is gone and therefore unmap the grants
> and close the connection, right?
> I hope that if the frontend crashes and dies, it doesn't necessarily
> become a zombie because the backend holds some mappings.

The mapping between frontend/backend and vchan client/server may be backwards:
the server must be initialized first and provides the pages for the client to
map. It looks like you are considering the frontend to be the server.

The vchan client domain maps grants provided by the server. If the server's
domain crashes, it may become a zombie until the client application notices the
crash. This will happen if the client uses the vchan and gets an error when
sending an event notification (in this case, a well-behaved client will close 
vchan). If the client does not often send data on the vchan, it can use a watch 
the server's xenstore node and close the vchan when the node is deleted.

A client that does not notice the server's destruction will leave a zombie 
A system administrator can resolve this by killing the client process.

>>>     - The PV connect/disconnect state-machine is poorly implemented.
>>>       There's no trivial mechanism to synchronize disconnecting/reconnecting
>>>       and dom0 must also allow the two domains to see parts of xenstore
>>>       belonging to the other domain in the process.
>> No interaction from dom0 is required to allow two domUs to communicate using
>> xenstore (assuming the standard xenstored; more restrictive xenstored
>> daemons may add such restrictions, intended to be used in conjunction with 
>> XSM
>> policies preventing direct communication via event channels/grants). The
>> connection state machine is weak; an external mechanism (perhaps the standard
>> xenbus "state" entry) could be used to coordinate this better in the user of
>> libvchan.
> I am curious to know what the "connection state machine" is in libvchan.

There are two bytes in the shared page which are set to \1 when the vchan is
connected and changed to \0 when one side is closed (either by libvchan_close or
by Linux if the process exits or crashes). The server has the option of ignoring
the close and allowing the client to reconnect, which is useful if the client
application is to be restarted. Since the rings remain intact, no data is lost
across a restart (although a crashing client may lose data it has already pulled
off the ring).

>>>     - Using the grant-ref model and having to map grant pages on each
>>>       transfer cause updates to V->P memory mappings and thus leads to
>>>       TLB misses and flushes (TLB flushes being expensive operations).
>> This mapping only happens once at the open of the channel, so this cost 
>> becomes
>> unimportant for a long-running channel. The cost is far higher for HVM 
>> domains
>> that use PCI devices since the grant mapping causes an IOMMU flush.
> So I take that you are not passing grant refs through the connection,
> unlike blkfront and blkback.
Not directly. All data is passed through the rings, which by default are sized 
1024 and 2048 bytes. Larger multi-page rings are supported (in powers of two), 
which case the initial shared page has a static list of grants provided by the
server which are all mapped by the client. Data transfer speeds are 
improved with larger rings, although this levels off when both ends are able to
avoid excessive context switches waiting for a ring to be filled or emptied.

>> [followup from Stefano's replies]
>> I would not expect much difference even on a NUMA system, assuming each domU
>> is fully contained within a NUMA node: one of the two copies made by libvchan
>> will be confined to a single node, while the other copy will be cross-node.
>> With domUs not properly confined to nodes, the hypervisor might be able to do
>> better in cases where libvchan would have required two inter-node copies.
> Right, I didn't realize that libvchan uses copies rather than grant refs
> to transfer the actual data.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.