# HG changeset patch # User Robb Romans <3r@xxxxxxxxxx> # Node ID 4a0a59c835362fad757002e884ea4795bb63fd5e # Parent 8eddf18dd1a469ed62d6c310f9dec496db33a36a Separate file for interface/devices. Signed-off-by: Robb Romans <3r@xxxxxxxxxx> diff -r 8eddf18dd1a4 -r 4a0a59c83536 docs/src/interface.tex --- a/docs/src/interface.tex Thu Sep 15 20:56:53 2005 +++ b/docs/src/interface.tex Thu Sep 15 21:13:42 2005 @@ -93,182 +93,8 @@ %% chapter Memory moved to memory.tex \include{src/interface/memory} - -\chapter{Devices} -\label{c:devices} - -Devices such as network and disk are exported to guests using a -split device driver. The device driver domain, which accesses the -physical device directly also runs a {\em backend} driver, serving -requests to that device from guests. Each guest will use a simple -{\em frontend} driver, to access the backend. Communication between these -domains is composed of two parts: First, data is placed onto a shared -memory page between the domains. Second, an event channel between the -two domains is used to pass notification that data is outstanding. -This separation of notification from data transfer allows message -batching, and results in very efficient device access. - -Event channels are used extensively in device virtualization; each -domain has a number of end-points or \emph{ports} each of which -may be bound to one of the following \emph{event sources}: -\begin{itemize} - \item a physical interrupt from a real device, - \item a virtual interrupt (callback) from Xen, or - \item a signal from another domain -\end{itemize} - -Events are lightweight and do not carry much information beyond -the source of the notification. Hence when performing bulk data -transfer, events are typically used as synchronization primitives -over a shared memory transport. Event channels are managed via -the {\tt event\_channel\_op()} hypercall; for more details see -Section~\ref{s:idc}. - -This chapter focuses on some individual device interfaces -available to Xen guests. - -\section{Network I/O} - -Virtual network device services are provided by shared memory -communication with a backend domain. From the point of view of -other domains, the backend may be viewed as a virtual ethernet switch -element with each domain having one or more virtual network interfaces -connected to it. - -\subsection{Backend Packet Handling} - -The backend driver is responsible for a variety of actions relating to -the transmission and reception of packets from the physical device. -With regard to transmission, the backend performs these key actions: - -\begin{itemize} -\item {\bf Validation:} To ensure that domains do not attempt to - generate invalid (e.g. spoofed) traffic, the backend driver may - validate headers ensuring that source MAC and IP addresses match the - interface that they have been sent from. - - Validation functions can be configured using standard firewall rules - ({\small{\tt iptables}} in the case of Linux). - -\item {\bf Scheduling:} Since a number of domains can share a single - physical network interface, the backend must mediate access when - several domains each have packets queued for transmission. This - general scheduling function subsumes basic shaping or rate-limiting - schemes. - -\item {\bf Logging and Accounting:} The backend domain can be - configured with classifier rules that control how packets are - accounted or logged. For example, log messages might be generated - whenever a domain attempts to send a TCP packet containing a SYN. -\end{itemize} - -On receipt of incoming packets, the backend acts as a simple -demultiplexer: Packets are passed to the appropriate virtual -interface after any necessary logging and accounting have been carried -out. - -\subsection{Data Transfer} - -Each virtual interface uses two ``descriptor rings'', one for transmit, -the other for receive. Each descriptor identifies a block of contiguous -physical memory allocated to the domain. - -The transmit ring carries packets to transmit from the guest to the -backend domain. The return path of the transmit ring carries messages -indicating that the contents have been physically transmitted and the -backend no longer requires the associated pages of memory. - -To receive packets, the guest places descriptors of unused pages on -the receive ring. The backend will return received packets by -exchanging these pages in the domain's memory with new pages -containing the received data, and passing back descriptors regarding -the new packets on the ring. This zero-copy approach allows the -backend to maintain a pool of free pages to receive packets into, and -then deliver them to appropriate domains after examining their -headers. - -% -%Real physical addresses are used throughout, with the domain performing -%translation from pseudo-physical addresses if that is necessary. - -If a domain does not keep its receive ring stocked with empty buffers then -packets destined to it may be dropped. This provides some defence against -receive livelock problems because an overload domain will cease to receive -further data. Similarly, on the transmit path, it provides the application -with feedback on the rate at which packets are able to leave the system. - - -Flow control on rings is achieved by including a pair of producer -indexes on the shared ring page. Each side will maintain a private -consumer index indicating the next outstanding message. In this -manner, the domains cooperate to divide the ring into two message -lists, one in each direction. Notification is decoupled from the -immediate placement of new messages on the ring; the event channel -will be used to generate notification when {\em either} a certain -number of outstanding messages are queued, {\em or} a specified number -of nanoseconds have elapsed since the oldest message was placed on the -ring. - -% Not sure if my version is any better -- here is what was here before: -%% Synchronization between the backend domain and the guest is achieved using -%% counters held in shared memory that is accessible to both. Each ring has -%% associated producer and consumer indices indicating the area in the ring -%% that holds descriptors that contain data. After receiving {\it n} packets -%% or {\t nanoseconds} after receiving the first packet, the hypervisor sends -%% an event to the domain. - -\section{Block I/O} - -All guest OS disk access goes through the virtual block device VBD -interface. This interface allows domains access to portions of block -storage devices visible to the the block backend device. The VBD -interface is a split driver, similar to the network interface -described above. A single shared memory ring is used between the -frontend and backend drivers, across which read and write messages are -sent. - -Any block device accessible to the backend domain, including -network-based block (iSCSI, *NBD, etc), loopback and LVM/MD devices, -can be exported as a VBD. Each VBD is mapped to a device node in the -guest, specified in the guest's startup configuration. - -Old (Xen 1.2) virtual disks are not supported under Xen 2.0, since -similar functionality can be achieved using the more complete LVM -system, which is already in widespread use. - -\subsection{Data Transfer} - -The single ring between the guest and the block backend supports three -messages: - -\begin{description} -\item [{\small {\tt PROBE}}:] Return a list of the VBDs available to this guest - from the backend. The request includes a descriptor of a free page - into which the reply will be written by the backend. - -\item [{\small {\tt READ}}:] Read data from the specified block device. The - front end identifies the device and location to read from and - attaches pages for the data to be copied to (typically via DMA from - the device). The backend acknowledges completed read requests as - they finish. - -\item [{\small {\tt WRITE}}:] Write data to the specified block device. This - functions essentially as {\small {\tt READ}}, except that the data moves to - the device instead of from it. -\end{description} - -% um... some old text -%% In overview, the same style of descriptor-ring that is used for -%% network packets is used here. Each domain has one ring that carries -%% operation requests to the hypervisor and carries the results back -%% again. - -%% Rather than copying data, the backend simply maps the domain's buffers -%% in order to enable direct DMA to them. The act of mapping the buffers -%% also increases the reference counts of the underlying pages, so that -%% the unprivileged domain cannot try to return them to the hypervisor, -%% install them as page tables, or any other unsafe behaviour. -%% %block API here +%% chapter Devices moved to devices.tex +\include{src/interface/devices} \chapter{Further Information} diff -r 8eddf18dd1a4 -r 4a0a59c83536 docs/src/interface/devices.tex --- /dev/null Thu Sep 15 20:56:53 2005 +++ b/docs/src/interface/devices.tex Thu Sep 15 21:13:42 2005 @@ -0,0 +1,178 @@ +\chapter{Devices} +\label{c:devices} + +Devices such as network and disk are exported to guests using a split +device driver. The device driver domain, which accesses the physical +device directly also runs a \emph{backend} driver, serving requests to +that device from guests. Each guest will use a simple \emph{frontend} +driver, to access the backend. Communication between these domains is +composed of two parts: First, data is placed onto a shared memory page +between the domains. Second, an event channel between the two domains +is used to pass notification that data is outstanding. This +separation of notification from data transfer allows message batching, +and results in very efficient device access. + +Event channels are used extensively in device virtualization; each +domain has a number of end-points or \emph{ports} each of which may be +bound to one of the following \emph{event sources}: +\begin{itemize} + \item a physical interrupt from a real device, + \item a virtual interrupt (callback) from Xen, or + \item a signal from another domain +\end{itemize} + +Events are lightweight and do not carry much information beyond the +source of the notification. Hence when performing bulk data transfer, +events are typically used as synchronization primitives over a shared +memory transport. Event channels are managed via the {\tt + event\_channel\_op()} hypercall; for more details see +Section~\ref{s:idc}. + +This chapter focuses on some individual device interfaces available to +Xen guests. + + +\section{Network I/O} + +Virtual network device services are provided by shared memory +communication with a backend domain. From the point of view of other +domains, the backend may be viewed as a virtual ethernet switch +element with each domain having one or more virtual network interfaces +connected to it. + +\subsection{Backend Packet Handling} + +The backend driver is responsible for a variety of actions relating to +the transmission and reception of packets from the physical device. +With regard to transmission, the backend performs these key actions: + +\begin{itemize} +\item {\bf Validation:} To ensure that domains do not attempt to + generate invalid (e.g. spoofed) traffic, the backend driver may + validate headers ensuring that source MAC and IP addresses match the + interface that they have been sent from. + + Validation functions can be configured using standard firewall rules + ({\small{\tt iptables}} in the case of Linux). + +\item {\bf Scheduling:} Since a number of domains can share a single + physical network interface, the backend must mediate access when + several domains each have packets queued for transmission. This + general scheduling function subsumes basic shaping or rate-limiting + schemes. + +\item {\bf Logging and Accounting:} The backend domain can be + configured with classifier rules that control how packets are + accounted or logged. For example, log messages might be generated + whenever a domain attempts to send a TCP packet containing a SYN. +\end{itemize} + +On receipt of incoming packets, the backend acts as a simple +demultiplexer: Packets are passed to the appropriate virtual interface +after any necessary logging and accounting have been carried out. + +\subsection{Data Transfer} + +Each virtual interface uses two ``descriptor rings'', one for +transmit, the other for receive. Each descriptor identifies a block +of contiguous physical memory allocated to the domain. + +The transmit ring carries packets to transmit from the guest to the +backend domain. The return path of the transmit ring carries messages +indicating that the contents have been physically transmitted and the +backend no longer requires the associated pages of memory. + +To receive packets, the guest places descriptors of unused pages on +the receive ring. The backend will return received packets by +exchanging these pages in the domain's memory with new pages +containing the received data, and passing back descriptors regarding +the new packets on the ring. This zero-copy approach allows the +backend to maintain a pool of free pages to receive packets into, and +then deliver them to appropriate domains after examining their +headers. + +% Real physical addresses are used throughout, with the domain +% performing translation from pseudo-physical addresses if that is +% necessary. + +If a domain does not keep its receive ring stocked with empty buffers +then packets destined to it may be dropped. This provides some +defence against receive livelock problems because an overload domain +will cease to receive further data. Similarly, on the transmit path, +it provides the application with feedback on the rate at which packets +are able to leave the system. + +Flow control on rings is achieved by including a pair of producer +indexes on the shared ring page. Each side will maintain a private +consumer index indicating the next outstanding message. In this +manner, the domains cooperate to divide the ring into two message +lists, one in each direction. Notification is decoupled from the +immediate placement of new messages on the ring; the event channel +will be used to generate notification when {\em either} a certain +number of outstanding messages are queued, {\em or} a specified number +of nanoseconds have elapsed since the oldest message was placed on the +ring. + +%% Not sure if my version is any better -- here is what was here +%% before: Synchronization between the backend domain and the guest is +%% achieved using counters held in shared memory that is accessible to +%% both. Each ring has associated producer and consumer indices +%% indicating the area in the ring that holds descriptors that contain +%% data. After receiving {\it n} packets or {\t nanoseconds} after +%% receiving the first packet, the hypervisor sends an event to the +%% domain. + + +\section{Block I/O} + +All guest OS disk access goes through the virtual block device VBD +interface. This interface allows domains access to portions of block +storage devices visible to the the block backend device. The VBD +interface is a split driver, similar to the network interface +described above. A single shared memory ring is used between the +frontend and backend drivers, across which read and write messages are +sent. + +Any block device accessible to the backend domain, including +network-based block (iSCSI, *NBD, etc), loopback and LVM/MD devices, +can be exported as a VBD. Each VBD is mapped to a device node in the +guest, specified in the guest's startup configuration. + +Old (Xen 1.2) virtual disks are not supported under Xen 2.0, since +similar functionality can be achieved using the more complete LVM +system, which is already in widespread use. + +\subsection{Data Transfer} + +The single ring between the guest and the block backend supports three +messages: + +\begin{description} +\item [{\small {\tt PROBE}}:] Return a list of the VBDs available to + this guest from the backend. The request includes a descriptor of a + free page into which the reply will be written by the backend. + +\item [{\small {\tt READ}}:] Read data from the specified block + device. The front end identifies the device and location to read + from and attaches pages for the data to be copied to (typically via + DMA from the device). The backend acknowledges completed read + requests as they finish. + +\item [{\small {\tt WRITE}}:] Write data to the specified block + device. This functions essentially as {\small {\tt READ}}, except + that the data moves to the device instead of from it. +\end{description} + +%% um... some old text: In overview, the same style of descriptor-ring +%% that is used for network packets is used here. Each domain has one +%% ring that carries operation requests to the hypervisor and carries +%% the results back again. + +%% Rather than copying data, the backend simply maps the domain's +%% buffers in order to enable direct DMA to them. The act of mapping +%% the buffers also increases the reference counts of the underlying +%% pages, so that the unprivileged domain cannot try to return them to +%% the hypervisor, install them as page tables, or any other unsafe +%% behaviour. +%% +%% % block API here