[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Draft C] Xen on ARM vITS Handling



On 01/06/15 14:12, Ian Campbell wrote:
> On Fri, 2015-05-29 at 14:40 +0100, Julien Grall wrote:
>> Hi Ian,

Hi Ian,

>> NIT: You used my Linaro email which I think is de-activated now :).
> 
> I keep finding new address books with that address  in them!
> 
>>> ## ITS Translation Table
>>>
>>> Message signalled interrupts are translated into an LPI via an ITS
>>> translation table which must be configured for each device which can
>>> generate an MSI.
>>
>> I'm not sure what is the ITS Table Table. Did you mean Interrupt
>> Translation Table?
> 
> I don't think I wrote Table Table anywhere.

Sorry I meant "ITS translation table"

> I'm referring to the tables which are established by e.g. the MAPD
> command and friends, e.g. the thing shown in "4.9.12 Notional ITS Table
> Structure".

On previous paragraph you are referring particularly to "Interrupt
Translation Table". This is the only table that is configured per device.

[..]

>>> XXX there are other aspects to virtualising the ITS (LPI collection
>>> management, assignment of LPI ranges to guests, device
>>> management). However these are not currently considered here. XXX
>>> Should they be/do they need to be?
>>
>> I think we began to cover these aspect with the section "command emulation".
> 
> Some aspects, yes. I went with:
> 
>         There are other aspects to virtualising the ITS (LPI collection
>         management, assignment of LPI ranges to guests, device
>         management). However these are only considered here to the extent
>         needed for describing the vITS emulation.
> 
>>> XXX In the context of virtualised device ids this may not be the case,
>>> e.g. we can arrange for (mostly) contiguous device ids and we know the
>>> bound is significantly lower than 2^32
>>
>> Well, the deviceID is computed from the BDF and some DMA alias. As the
>> algorithm can't be tweaked, it's very likely that we will have
>> non-contiguous Device ID. See pci_for_each_dma_alias in Linux
>> (drivers/pci/search.c).
> 
> The implication here is that deviceID is fixed in hardware and is used
> by driver domain software in contexts where we do not get the
> opportunity to translate is that right? What contexts are those?

No, the driver domain software will always use a virtual DeviceID (based
on the vBDF and other things). The problem I wanted to raise is how to
translate back the vDeviceID to a physical deviceID/BDF.

> Note that the BDF is also something which we could in principal
> virtualise (we already do for domU). Perhaps that is infeasible for dom0
> though?

For DOM0 the virtual BDF is equal to the physical BDF. So the both
deviceID (physical and virtual) will be the same.

We may decide to do vBDF == pBDF for guest too in order to simplify the
code.

> That gives me two thoughts.
> 
> The first is that although device identifiers are not necessarily
> contiguous, they are generally at least grouped and not allocated at
> random through the 2^32 options. For example a PCI Host bridge typically
> has a range of device ids associated with it and each device has a
> device id derived from that.

Usually it's one per (device, function).

> 
> I'm not sure if we can leverage that into a more useful data structure
> than an R-B tree, or for example to arrange for the R-B to allow for the
> translation of a device within a span into the parent span and from
> there do the lookup. Specifically when looking up a device ID
> corresponding to a PCI device we could arrange to find the PCI host
> bridge and find the actual device from there. This would keep the RB
> tree much smaller and therefore perhaps quicker? Of course that depends
> on what the lookup from PCI host bridge to a device looked like.

I'm not sure why you are speaking about PCI host bridge. AFAIK, the
guest doesn't have a physical host bridge.

Although, this is an optimization that we can think about it later. The
R-B will already be fast enough for a first implementation. My main
point was about the translation vDeviceID => pDeviceID.

> The second is that perhaps we can do something simpler for the domU
> case, if we were willing to tolerate it being different from dom0.
> 
>>> Possible efficient data structures would be:
>>>
>>> 1. List: The lookup/deletion is in O(n) and the insertion will depend
>>>     if the device should be sorted following their identifier. The
>>>     memory overhead is 18 bytes per element.
>>> 2. Red-black tree: All the operations are O(log(n)). The memory
>>>     overhead is 24 bytes per element.
>>>
>>> A Red-black tree seems the more suitable for having fast deviceID
>>> validation even though the memory overhead is a bit higher compare to
>>> the list.
>>>
>>> ### Event ID (`vID`)
>>>
>>> This is the per-device Interrupt identifier (i.e. the MSI index). It
>>> is configured by the device driver software.
>>>
>>> It is not necessary to translate a `vID`, however they may need to be
>>> represented in various data structures given to the pITS.
>>>
>>> XXX is any of this true?
>>
>>
>> Right, the vID will always be equal to the pID. Although you will need
>> to associate a physical LPI for every pair (vID, DevID).
> 
> I think in the terms defined by this document that is (`ID`, `vID`) =>
> an LPI. Right?

Well, by `ID` you refer about deviceID? That remind me that the second
paragraph of "# ITS Command Translation" is confusing (copy below):

"The ITS provides 12 commands in order to manage interrupt collections,
devices and interrupts. Possible command parameters are device ID
(`ID`), Event ID (`vID`), Collection ID (`vCID`), Target Address
(`vTA`) parameters."

The spec is using `ID` for "interrupt ID" and `device` for deviceID.

In general, the section "command emulation" is using 'v' to refer the
virtual identifier and 'p' for the physical identifier. (see pCID vs
vCID both are Collection ID).

> 
> Have we considered how this mapping will be tracked?

I don't think so. We need to track the list of `event ID` enabled for
this deviceID and the associated `vLPI` and `LPI`.

Although, if allocate a contiguous chunck of `LPI` it won't be necessary
to track the later.

>>> ### Interrupt Collection (`vCID`)
>>>
>>> This parameter is used in commands which manage collections and
>>> interrupt in order to move them for one CPU to another. The ITS is
>>> only mandated to implement N + 1 collections where N is the number of
>>> processor on the platform (i.e max number of VCPUs for a given
>>> guest). Furthermore, the identifiers are always contiguous.
>>>
>>> If we decide to implement the strict minimum (i.e N + 1), an array is
>>> enough and will allow operations in O(1).
>>>
>>> XXX Could forgo array and go straight to vcpu_info/domain_info.
>>
>> Not really, the number of collection is always one higher than the
>> number of VCPUs. How would you store the last collection?
> 
> In domain_info. What I meant was:
> 
>     if ( vcid == domain->nr_vcpus )
>          return domain->interrupt_collection
>     else if ( vcid < domain_nr_vcpus )
>          return domain->vcpus[vcid]->interrupt_colleciton
>     else
>          invalid vcid.
> 
> Similar to how SPI vs PPI interrupts are handled.

Sorry, I didn't understand your suggestion like that.

I think that can work, although the resulting code may be difficult to
read/understand because a collection can be moved from one vCPU to another.

> 
>>> ## Command Translation
>>>
>>> Of the existing GICv3 ITS commands, `MAPC`, `MAPD`, `MAPVI`/`MAPI` are
>>> potentially time consuming commands as these commands creates entry in
>>> the Xen ITS structures, which are used to validate other ITS commands.
>>>
>>> `INVALL` and `SYNC` are global and potentially disruptive to other
>>> guests and so need consideration.
>>
>> INVALL and SYNC are not global. They both take a parameter: vCID for
>> INVALL and vTarget for SYNC.
> 
> By global I meant not associated with a specific device. I went with:
> 
>         `INVALL` and `SYNC` are not specific to a given device (they are per
>         collection per target respectively) and are therefore potentially
>         disruptive to other guests and so need consideration.

Thanks for the clarification.

> 
>> INVALL ensures that any interrupts in the specified collection are
>> re-load. SYNC ensures that all the previous command, and all outstanding
>> physical actions relating to the specified re-distributor are completed.
> 
>>
>>> All other ITS command like `MOVI`, `DISCARD`, `INV`, `INT`, `CLEAR`
>>> just validate and generate physical command.
>>>
>>> ### `MAPC` command translation
>>>
>>> Format: `MAPC vCID, vTA`
>>>
>>> - `MAPC pCID, pTA` physical ITS command is generated
>>
>> We should not send any MAPC command to the physical ITS. The collection
>> is already mapped during Xen boot.
> 
> What is the plan for this start of day mapping? One collection per pCPU
> and ignore the rest?

To map the only collection per pCPU. And we can see for improvement later.

> It seems (section 4.9.2) that there are two potential kinds of
> collections, ones internal to the ITS and others where data is held in
> external memory. The numbers of both are limited by the hardware.

They are all the same. The only difference is for external collection
the memory should be allocated by the software.

> I suppose the internal ones will be faster.

No idea.

> Supposing that a guest is likely to use collections to map interrupts to
> specific vcpus, and that the physical collections will be mapped to
> pcpus, I suppose this means we will need to do some relatively expensive
> remapping (corresponding to moving the IRQ to another collection) in
> arch_move_irqs? Is that the best we can do?

If we use the same solution as SPI/PPIs (i.e arch_move_irqs), yes. But
I'm not in favor of this solution which is already expensive for
SPI/PPIs (after talking with Dario, vCPU may move often between pCPU).

We could decide to implement a lazy solution (i.e moving the interrupt
only when it's firing) but that would require to send ITS command in Xen.

Another idea is to never move the interrupt collection from one pCPU to
another. The vCID would always be equal to the same pCID, but the
interrupt would be injected to a different processor depending on the
associated vTA.

> 
>> This command should only assign a pCID to the vCID.
> 
> Does it not also need to remap some interrupts to that new pCID?

That would be very expensive. Another solution would be to always have
the same mapping vCID <=> pCID (allocated at boot) and only update the
vTA. We would loose the possibility to have LPIs following a vCPU but
that would be a small drawback compare to the cost of the ITS.

>>>
>>> ### `MAPD` Command translation
>>>
>>> Format: `MAPD device, Valid, ITT IPA, ITT Size`
>>>
>>> `MAPD` is sent with `Valid` bit set if device needs to be added and reset
>>> when device is removed.
>>
>> Another case: The ITT is replaced. This use case needs more care because
>> we need to ensure that all the interrupt are disabled before switching
>> to the new ITT.
> 
> I've added a note since I think this is going to be a discussion in the
> other sub thread.
> 
>>
>>> If `Valid` bit is set:
>>>
>>> - Allocate memory for `its_device` struct
>>> - Validate ITT IPA & ITT size and update its_device struct
>>> - Find number of vectors(nrvecs) for this device by querying PCI
>>>    helper function
>>> - Allocate nrvecs number of LPI XXX nrvecs is a function of `ITT Size`?
>>> - Allocate memory for `struct vlpi_map` for this device. This
>>>    `vlpi_map` holds mapping of Virtual LPI to Physical LPI and ID.
>>> - Find physical ITS node with which this device is associated
>>
>> XXX: The MAPD command is using a virtual DevID which is different that
>> the pDevID (because the BDF is not the same). How do you find the
>> corresponding translation?
> 
> Not sure, do we need a per-domain thing mapping vBDF to $something? Do
> we alreayd have such a thing, e.g. in the SMMU code?

The current SMMU code doesn't have anything related to it. I need to
think about it.

> I've added a note.

Thanks.

>>
>>> - Call `p2m_lookup` on ITT IPA addr and get physical ITT address
>>> - Validate ITT Size
>>> - Generate/format physical ITS command: `MAPD, ITT PA, ITT Size`
>>
>> I had some though about the other validation problem with the ITT. The
>> region will be used by the ITS to store the mapping between the ID and
>> the LPI as long as some others information.
> 
> ITYM "as well as some other information"?

Yes.

> 
>> I guess that if the OS is playing with the ITT (such as writing in it)
>> the ITS will behave badly. We have to ensure to the guest will never
>> write in it and by the same occasion that the same region is not passed
>> to 2 devices.
> 
> I don't think we will be exposing the physical ITT to the guest, will
> we? That will live in memory which Xen owns and controls and doesn't
> share with any guest.

That's need to be clarify. The RFC sent by Vijay is letting the guest
feeding the physical ITT.

But I just though that it won't work well, the ITT has to be contiguous
in the physical memory.

> In fact, I don't know that a vITS will need an ITT memory at all, i.e.
> most of our GITS_BASERn will be unimplemented.
> In theory we could use these registers to offload some of the data
> structure storage requirements to the guest, but that would require
> great care to validate any time we touched it (or perhaps just
> p2m==r/o), I think it is probably not worth the stress if we can just
> use regular hypervisor side data structures instead? (This stuff is just
> there for the h/w ITS which doesn't have the luxury of xmalloc).

GITS_BASERn is only used to feed internal ITS table such as Collections,
Devices, Virtual Processors.

For the ITT, the address is passed via MAPD and as to be allocated by
the guest.

We could ignore the address provided by the guest but that would mean
the ITT is allocated twice (one by the guest unused, and the Xen one
used to feed the pITS). Although, as pointed above the ITT memory region
allocated by the guest may not be contiguous in physical memory

Furthermore, if we allocate the ITT in Xen, this command will be quick
to emulate (we could send the MAPD command when the device is attached).

So less possible security issue.

>>> Here the overhead is with memory allocation for `its_device` and `vlpi_map`
>>>
>>> XXX Suggestion was to preallocate some of those at device passthrough
>>> setup time?
>>
>> Some of the informations can even be setup when the PCI device is added
>> to Xen (such as the number of MSI supported and physical LPIs chunk).
> 
> Yes, assuming there are sufficient LPIs to allocate in this way. That's
> not clear though, is it?

IHMO it's not an issue. If the ITS doesn't provide enough LPIs a
baremetal OS will likely not being able to use a device.

FWIW, Linux is always allocating the LPIs when the device is added based
on the PCI cfg.

> 
>>> If Validation bit is not set:
>>>
>>> - Validate if the device exits by checking vITS device list
>>> - Clear all `vlpis` assigned for this device
>>> - Remove this device from vITS list
>>> - Free memory
>>>
>>> XXX If preallocation presumably shouldn't free here either.
>>
>> Right. We could use a field to say if the device is activated or not.
>>
>>>
>>> ### `MAPVI`/`MAPI` Command translation
>>>
>>> Format: `MAPVI device, ID, vID, vCID`
>>
>> Actually the 2 commands are completely different:
>>      - MAPI maps a (DevID, ID) to a collection
>>      - MAVI maps a (DevID, ID) to a collection and an LPI.
> 
> MAPVI for the second one I think?

Right.

> 
> The difference is that MAPI lacks the vID argument?

Yes. And MAPI doesn't bind the (DevID, ID) to an LPIs. The first one can
be used to move IRQ from one collection to another.

>> The process described below is only about MAPVI.
> 
> OK. I've left a placeholder for `MAPI`.
> 
>> Also what about interrupt re-mapping?
> 
> I don't know, what about it?

It was an open question. It answered to it on the mail to Vijay's.

> 
>>> - Validate vCID and get pCID by searching cid_map
>>>
>>> - if vID does not have entry in `vlpi_entries` of this device allocate
>>>    a new pID from `vlpi_map` of this device and update `vlpi_entries`
>>>    with new pID
>>
>> What if the vID is already used by another
> 
> I think Vijay's updates already addressed this.

I will give a look.

> 
>>> - Allocate irq descriptor and add to RB tree
>>> - call `route_irq_to_guest()` for this pID
>>> - Generate/format physical ITS command: `MAPVI device ID, pID, pCID`
>>
>>
>>> Here the overhead is allocating physical ID, allocate memory for irq
>>> descriptor and routing interrupt.
>>>
>>> XXX Suggested to preallocate?
>>
>> Right. We may also need to have a separate routing for LPIs as the
>> current function is quite long to execute.
>>
>> I was thinking into routing the interrupt at device assignation
>> (assuming we allocate the pLPIs at that time). And only set the mapping
>> to vLPIs when the MAPI is called.
> 
> Please propose concrete modifications to the text, since I can't figure
> out what you mean to change here.

It's not yet fully clear in my mind.

If we assuming that we have a chunk of LPIs allocate for a the device
when it's assigned to the guest, the only missing information is the
corresponding vLPIs.

We could decide to setup the internal routing structure (allocation of
the irq_guest) and defer the setting of the vLPI. This would work
because the interrupt are disabled at boot time and won't be enabled as
long as a MAPVI command is sent.



> 
>>
>>>
>>> ### `INVALL` Command translation
>>
>> The format of INVALL is INVALL collection
>>
>>> A physical `INVALL` is only generated if the LPI dirty bitmap has any
>>> bits set. Otherwise it is skipped.
>>>
>>> XXX Perhaps bitmap should just be a simple counter?
>>
>> We would need to handle it per collection.
> 
> Hrm, this complicates things a bit. Don't we need to invalidate any pCID
> which has a routing of an interrupt to vCID? i.e. potentially multiple
> INVALL?

Not necessarily. If we keep the interrupts of a vCID always in the same
pCID, we would need to send only on INVALL.

>>
>>>
>>> ### `SYNC` Command translation
>>
>> The format of SYNC is SYNC target. It's only ensure the completion for a
>> re-distributor.
>> Although, the pseudo-code (see perform_sync in 5.13.22 in
>> PRD03-GENC-010745 24.0) seems to say it waits for all re-distributor...
>> I'm not sure what to trust.
> 
> Yes, it's confusing but the first sentence of 5.13.22 says:
>         This command specifies that the ITS must wait for completion of
>         internal effects of all previous commands, and all
>         outstanding physical actions relating to the specified
>         re-distributor.
>         
> So, by my reading, all redistributors need to have seen the effect of
> any command issued to the given redistributor (not all commands given to
> any redistributor).
> 
> Example: given command cA issued to redistributor rA and command cB
> issued to redistrubutor rB and then issuing SYNC(rA) must ensure that cA
> is visible to _both_ rA and rB, but doesn't say anything regarding cB at
> all.

Thanks for the explanation.

Regards,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.