1. Overview
===========

This package is a Request For Comments on a design for accelerated networking
in Xen using "smart NICs".  The package consists of this README, a diagram of
the proposed architecture, and a powerpoint presentation of the architecture.
The code for a proof-of-concept implementation is also included for reference.
The primary objective at this stage is to begin discussions on the approach
itself, as opposed to our implementation, or even the APIs used between the
various modules that make up our implementation.  However, all critisism is
welcomed, including of the code itself.

Please direct all comments and questions to the xen-devel mailing list.

Section 2 outlines the architecture -- this is the crux of this request for
comments.  Section 3 describes at (a high level) our proof-of-concept
hardware, and gives an overview of the proof-of-concept code which is
packaged with this README.  Since our implementation is a prototype, there are
elements of it that would be unsuitable for inclusion in Xen, and it is
lacking in some aspects: section 4 outlines the changes we plan to make for
the future implementation.

A mentioned above, at this stage, we are more interested in feedback from the
Xen community on our current approach and future direction than the
proof-of-concept code, but all feedback is welcome.


2. Architecture
===============

This proposal is for Xen support for virtualisable NICs that can be programmed
directly from untrusted entities without risking system integrity (this
probably implies the presence of an IOMMU).

The proposed architecture adds a direct fast path for network traffic
between the hardware and domU frontend driver.  The original shared-memory
path (via the dom0 backend driver) is retained for unaccelerated traffic
and to support migration.  The frontend driver has divided into two parts:
(i) OS Driver - OSD, and (ii) Semi-Virtualised Hardware - SVH.  The latter
is the device driver for the virtualised in the guest.

Please see attached diagram: xen_networking.pdf.  A number of acronyms are
used in this diagram that relate to those used in the source: OSD (Operating
System Driver), SVH (Semi-Virtualised Hardware), BEC (BackEnd Communication),
FEC (FrontEnd Communication).

Hardware:
---------

The design presented here is intended to be sufficiently general that various
designs of smart NICs with various capabilities may be supported.  All that is
assumed is a smart NIC is capable of being directly mapped into the guest
without sacraficing system integrity (known as a Virtual Interface, VI), and
that it presents a layer 2 interface.  Specifically, this proposal does not
address the challenges of integrating TCP Offload into the architecture,
however we note that this proposal may form part of the support required for
such devices.

Boot:
-----

When a new domU is brought up, and the frontend driver loaded, the backend
driver is notified using xenbus, and a channel of communication is
established between the two.  This consists of a message fifo constructed
using shared memory pages and an event channel IRQ.  As with the existing
Xen drivers, this channel supports transfer of packets and control
information.  It has been extended with two new functions: 1) The back-end
may inform the front-end that a smart NIC is present.  2) The front-end (or
rather the SVH driver) may request the resources required to communicate
directly with the smart NIC.  Such resources may inclued a mapping onto a
virtual interface, and IOMMU mappings for DMA buffers.

When a front-end driver is notified that a smart NIC is available, it
attempts to load a module that implements the SVH driver for the smart NIC.
If it succeeds, then the SVH driver communicates with dom0 to setup a
virtual interface.  If no suitable driver is available in the guest, then
it defaults to using the unaccelerated path.

Receive:
--------

Where possible the smart NIC should deliver packets directly to the virtual
interface of the guest to which the packet is addressed.  The SVH driver in
domU will be invoked and pass the packet into the OS stack in the normal
way.  The SVH driver may be invoked directly via an MSI-X interrupt, or via
the event channel when MSI-X is not available.

It is likely that a smart NIC will have some means to demultiplex incoming
packets to virtual interfaces.  This architecture is intended to be
completely agnostic with respect to the mechanism used.  For example, the
NIC might have a mapping from destination MAC address to the guest's
virtual interface, or might use VLAN tags, or might use IP addresses.

When the NIC is not able to deliver a packet to the appropriate virtual
interface, it should deliver the packet to the default interface (usually
managed by dom0) from where it will be forwarded to the guest via the
slower software path.  For example broadcast packets may be delivered in
this way without requiring that the NIC be able to DMA a copy of the packet
to each virtual interface.

Transmit:
---------

On transmit, the frontend driver in domU receives the packet from the OS in
the normal way.  The packet is passed to the SVH driver, which arranges for
the packet to be transmitted via the virtual interface.

Some smart NICs may be able to transfer packets from one virtual interface
to another, providing a means for guests to communicate with one another
directly.  For NICs that don't support that feature the front-end driver
maintains a table that contains the MAC addresses of all local guests.
Packets addressed to other guests in the same machine are passed through
the shared-memory channel to dom0 in the usual way (slow path), rather than
passing them to the SVH driver.  This mechanism may also be used for
broadcast and multicast packets.

Migration:
----------

The key to supporting migration is that it is always possible to fall-back
to the slow path.  Before migration the guest is asked to close down its
virtual interface, and revert to using the slow path for all transmits.
The driver in dom0 arranges that received packets are also delivered via
the slow path.

The guest no longer has any dependency on the smart NIC, and can be
migrated in the usual way.  If there is no smart NIC on the remote host
then nothing further need be done.  If there is, then an appropriate SVH
driver will be loaded on demand, and a virtual interface will be created,
as described in the "Boot" section above.

In this way it is possible to support migration between systems with and
without smart NICs, and between systems with different types of smart NICs.


3. Proof-of-concept implementation
==================================

The prototype code in this RFC is for driving Solarflare EF1 NICs, which
can present up to 4096 Virtual Interfaces (VI).  Each EF1 VI consists of a
pair of rx/tx DMA descriptor rings.  The NIC also has an onboard IOMMU, and
a filter table which directs incoming packets to a given DMA ring, based on
destination IP address and port number.  Filters are inserted into the
NIC's filter table on demand by the driver in dom0.

Our current implementation is built separately from the Xen tree to allow us
to create a rapid prototype.  It is our intention that this get merged with
the existing front- and back-end drivers in the future, together with the many
necessary changes to allow that (see Section 4).  It is therefore not possible
for others to build our current source, but as without our hardware and
drivers you couldn't use it, this isn't a problem in itself.  This
implementation should be seen as a request for comments and work in progress.
The source is attached for your reference.  We also welcome comments about our
use of Xen APIs.

The src directory contains our current work-in-progress prototype for these
drivers.  Those files marked "***" are suggested good starting points for
anyone interested in following the code in more detail.

xen_gen                  - Code that's shared by back and front end drivers
xen_gen/ef_cuckoo_hash.c - Implementation of cuckoo hash table.
xen_gen/ef_misc_util.c   - Our versions of logging, fail etc.
xen_gen/ef_msg_iface.c   - Interface to the messaging fifo
xen_gen/ef_xen_util.c    - Wrappers around xen calls (mostly concerned with
                           memory, grants etc for IO pages).

xen_bend                      - The backend driver implementation
xen_bend/ef_bend_accel.[c,h]  - Interface to generic hardware acceleration ***
xen_bend/ef_bend.[c,h]        - Core of backend driver, module code, xenbus
                                callbacks etc.
xen_bend/ef_bend_fwd.[c,h]    - Forwarding packets from backend to frontend
                                driver
xen_bend/ef_bend_netdev.[c,h] - Netdev for tx slow path to connect to the
                                bridge
xen_bend/ef_bend_solarflare.c - Interface from generic parts to solarflare
                                specific parts of driver.
xen_bend/ef_bend_vnic.[c,h]   - FEC (front end communication) layer.
xen_bend/ef_char_bend.[c,h]   - Interface to solarflare's resource manager.
xen_bend/ef_filter.[c,h]      - Managing NIC filters.
xen_bend/ef_iputil.h          - Util for IP filter management
xen_bend/ef_mcast.[c,h]       - Multicast handling

xen_fend                      - The frontend driver implementation
xen_fend/ef_vnic_bufs.[c,h]   - Buffer manager for frontend driver ***
xen_fend/ef_vnic.[c,h]        - Core of frontend driver, module code, xenbus
                                callbacks etc.
xen_fend/ef_vnic_netdev.[c,h] - Slowpath TX
xen_fend/ef_vnic_osd.c        - Frontend driver interface to OS
xen_fend/ef_vnic_svh.c        - Semi-virtualised hardware, generic wrapper
                                around hardware functions
xen_fend/svh_ef1.c            - Semi-virtualised hardware functions to access
                                Solarflare EF1
xen_fend/svh_null.c           - Dummy semi-virtualised hardware functions 

include/ci/xen/ef_msg_iface.h   - frontend <-> backend message interface
include/ci/xen/ef_cuckoo_hash.h - Cuckoo hash table
include/ci/xen/ef_shared_fifo.h - Shared fifo used to communicate between
                                  frontend and backend
include/ci/xen/ef_xen_util.h    - Wrapper around Xen APIs 

include/ci/driver/virtual/ef_hyperops.h     - Generic wrapper around
                                              hypervisor operations
include/ci/driver/virtual/ef_hyperops_xen.h - Wrapper around Xen hypervisor
                                              operations ***
include/ci/driver/virtual/vnic.h            - Frontend driver data structures
                                              and function calls


4. Intended Future Architecture & Implementation
================================================

We plan to expand the current prototype into a solution that is suitable for
inclusion in Xen.

Although logically the frontend module in the current implementation is
divided into two, it is linked as a single module.  Our intention is to split
this into two frontend modules: a generic hardware-independent module, and a
hardware-dependent SVH module.  We expect the generic front-end module to be
derived from the existing netfront driver, although it could easily be
implemented as a separate module, were that to be desired.  (In our prototype
we have implemented stand-alone frontend and backend drivers; this was done
for convenience of development only, and should not be interpreted as our
desire for the final architecture.)

The final solution clearly needs to present APIs between the backend and
generic frontend driver, and between the generic frontend driver and the
hardware-dependent frontend driver.  This API must be suitably generic and
expressive that it can be used by different "smart NICs" from different
hardware vendors.  We therefore need input from others when specifying the
API, as while we can make intelligent guesses, there will inevitably be
details of other hardware that we're not aware of.

The local MAC address lookup on the TX path may not be needed when using a
smart NIC that can loop packets back to a local virtual interface.

It would be desirable to add support for delivering interrupts directly to
the frontend driver in the guest when possible (using MSI-X).  It may be
that that support can go entirely in the hardware specific drivers.

Although the current architecture has been designed to support live
relocation, it has not yet been implemented. in our proof-of-concept.  It will
be implemented Real Soon Now.
