[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Scalable Event Channel ABI design (draft A)



On Tue, 2013-02-05 at 18:18 +0000, David Vrabel wrote:
> On 05/02/13 16:10, Ian Campbell wrote:
> > On Mon, 2013-02-04 at 17:52 +0000, David Vrabel wrote:
> >> The pages within the event array need not be physically nor virtually
> >> contiguous, but the guest or Xen may make the virtually contiguous for
> >> ease of implementation.  e.g., by using vmap() in Xen or vmalloc() in
> >> Linux.  Pages are added by the guest as required by the number of
> >> bound event channels.
> > 
> > Strictly speaking the size is related to the maximum bound evtchn, which
> > need not be the same as the number of bound evtchns.
> 
> Yes.
> 
> >> The state of an event is tracked using 3 bits within the event word.
> >> the M (masked), P (pending) and L (linked) bits.  Only state
> >> transitions that change a single bit are valid.
> > 
> > The diagram shows transitions B<->P and P<->L, which involve two bits in
> > each case. So do BM<->PM and LM<->BM now I think about it.
> 
> B == 000b, P == 100b, L == 101b.
> 
> Similarly for the masked transitions:
> 
> BM == 010b, PM == 110b, LM == 111b.

Ah, it wasn't clear that the states in the diagram didn't relate to the
bits directly, reusing the letters in that way is a bit confusing.

> 
> > Is the L bit redundant with the LINK field == 0 or != 0?
> 
> LINK == 0 is the end of queue marker.  We could use all ones to mean
> 'unlinked' but using the L bit allows the guest to remove the event from
> the queue by clearing a single bit, instead of writing the LINK field.

So you can use test/clear-bit instead of cmpxchg?

> >> A guest should call this during initial VCPU bring up (and on resume?).
> >>
> >>     struct evtchnop_init {
> >>         uint32_t vcpu;
> >>         uint64_t array_pfn;
> >>     };
> > 
> > I think this will have a different layout on 32 and 64 bit x86, if you
> > care (because uint64_t is align(4) on 32-bit and align(8) on 64-bit).
> 
> Ok.  I thought uint64_t was always 8 byte aligned on both.

Sadly not, it's the cause of a fair few 32- vs 64-bit ABI
differences :-(

> >> Memory Usage
> >> ------------
> >>
> >> Alternatively, we can consider a system with $D$ driver domains, each
> >> of which requires $E_D$ events, and a dom0 using the maximum number of
> >> pages (128).
> >>
> >> \begin{eqnarray*}
> >> V & = & P - \left(128 + D \times \frac{E_D}{E_P}\right)
> >> \end{eqnarray*}
> >>
> >> With, for example, 16 driver domains each using the maximum number of 
> >> pages:
> >> \begin{eqnarray*}
> >> V  & = & (262144/2) - (128 + 16 \times \frac{2^{17}}{1024}) \\
> >>    & = & 129 \times 10^3\text{ VMs}
> >> \end{eqnarray*}
> > 
> > This accounts for the driver domains and dom0 but not the domains which
> > they are serving, doesn't it?
> 
> This is calculating the number of pages left once those used for dom0
> and the driver domains are used.  This is the same as the number of
> supportable VMs since the other VMs only require a page.

Ah, makes sense.

> >> In summary, there is space to map the event arrays for over 100,000
> >> VMs.  This is more than the limit imposed by the 16 bit domain ID
> >> (~32,000 VMs).
> > 
> > Is there scope to reduce the maximum then?
> 
> Maximum of what?

I suppose the maximum number of pages a domain can use for evtchn
queues.

> 
> >> ### Control Block
> >>
> >> With $L$ priority levels and two 32-bit words for the head and tail
> >> indexes, the amount of space ($S$) required in the `struct vcpu_info`
> >> for the control block is:
> >> \begin{eqnarray*}
> >> S & = & L \times 2 \times 4 \\
> >>   & = & 16 \times 2 \times 4 \\
> >>   & = & 128\text{ bytes}
> >> \end{eqnarray*}
> >>
> >> There is more than enough space within `struct vcpu_info` for the
> >> control block while still keeping plenty of space for future use.
> > 
> > There is? I can see 7 bytes of padding and 4 bytes of evtchn_pending_sel
> > which I suppose becomes redundant now.
> > 
> > I don't think it would be a problem to predicate use of this new
> > interface on the use of the VCPU_placement API and therefore give scope
> > to expand the vcpu_info.
> 
> I should have taken a proper look at where the vcpu_info comes from...
> Oops.
> 
> We can add a new VCPUOP_register_vcpu_info_and_control_block which will
> add a struct vcpu_info and the control block.

Better to either say that using the new ABI requires the guest to have
reserved space after vcpu_info or to have a separate call for the
control block (which could have the constraint that it is on the same
page as vcpu_nfo)

> >> Low Level Design
> >> ================
> >>
> >> In the pseudo code in this section, all memory accesses are atomic,
> >> including those to bit-fields within the event word.
> > 
> >> These variables have a standard meaning:
> >>
> >> Variable  Purpose
> >> --------  -------
> >> E         Event array.
> >> p         A specific event.
> >> H         The head index for a specific priority.
> >> T         The tail index for a specific priority.
> >>
> >> These memory barriers are required:
> >>
> >> Function  Purpose
> >> --------  -------
> >> mb()      Full (read/write) memory barrier
> >>
> >> Raising an Event
> >> ----------------
> >>
> >> When Xen raises an event it marks it pending and (if it is not masked)
> >> adds it tail of event queue.
> > 
> > What are the conditions for actually performing the upcall when
> > returning from the hypervisor to the guest?
> 
> Any H != 0 for that VCPU.  This means we should have a per-VCPU bitmap
> of non empty queues.

You need to be careful here since if two evtchn's are raised near
simultaneously then you will return to the guest with H == +2, and do an
upcall. If in the course of handling the first interrupt you make a
hypercall you will be returning with H == +1, but you won't want to
reinject.

Which makes me wonder about handling of IRQ priority, is the intention
that if a higher priority interrupt occurs while you are processing an
existing upcall then it will be interrupted for the new higher priority
one? In which case we need to have an IRQL concept which allows the
domain to re-enable interrupts while masking interrupts at the current
or lower priority.

Or perhaps it is acceptable to say that a domain will always finish
handling the current IRQ before it takes the higher priority one.

Should we have a bitmask of pending priorities so the guest doesn't have
to check all the queues? It's only 15/16 queues but since most evtchns
probably remain at level 7 that's still a bit wasteful, especially if
you need to restart the scan after every interrupt to implement
prioritisation.

> >> Concurrent access by Xen to the event queue must be protected by a
> >> per-event queue spin lock.
> >>
> >> Consuming Events
> >> ----------------
> >>
> >> The guests consumes events starting at the head until it reaches the
> >> tail.  Events in the queue that are not pending or are masked are
> >> consumed but not handled.
> >>
> >>     while H != 0
> >>         p = H
> >>         H = E[p].link
> >>         if H == 0
> >>             mb()
> >>             H = E[p].link
> >>         E[H].linked = 0
> > 
> > Did you mean E[p].linked here?
> > 
> > If at this point the interrupt is reraised then the if in the raising
> > pseudo code becomes true, linked will set again and don't we also race
> > with the clearing of E[p].pending below?
> 
> The event will be correctly linked and then we clear P,

By P you mean E[p].pending in the pseudo-code terminology?

>  when we consume
> the linked event it will be ignored as P is already clear.

So we will miss the second interrupt?

> Perhaps we could also avoid linking the event if it is already pending?
> 
> >>> Note: When the event queue contains a single event, removing the
> >>> event may race with Xen appending another event because the load of
> >>> `E[p].link` and the store of `H` is not atomic.  To avoid this race,
> >>> the guest must recheck `E[p].link` if the list appeared empty.
> > 
> > It appears that both "Raising an Event" and "Consuming Events" can write
> > H? Is that correct? Likewise for the linked bit.
> 
> Xen only writes H when the queue is empty, the VM only writes H if it is
> non-empty.
> 
> The linked bit is set by Xen and cleared by the guest.
> 
> Do you see a problem with this?

You'd need to be careful about the barriers and think hard about the
case where the VM is emptying the queue at the same time as Xen is
injecting a new interrupt and therefore unemptying it. And vice-versa.

> > It's a bit unclear because the pseudo-code doesn't make it explicitly
> > which variables are par of the shared data structure and which are
> > private to the local routine.
> 
> Capital variables are in the shared structures.  Lower-case are local.
> 
> >> Masking Events
> >> --------------
> >>
> >> Events are masked by setting the masked bit.  If the event is pending
> >> and linked it does not need to be unlinked.
> >>
> >>     E[p].masked = 1
> >>
> >> Unmasking Events
> >> ----------------
> >>
> >> Events are unmasked by the guest by clearing the masked bit.  If the
> >> event is pending the guest must call the event channel unmask
> >> hypercall so Xen can link the event into the correct event queue.
> >>
> >>     E[p].masked = 0
> >>     if E[p].pending
> >>         hypercall(EVTCHN_unmask)
> > 
> > Can the hypercall do the E[p].masked = 0?
> 
> Sure.

Perhaps I should have said "Should..." :-)

Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.