[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] PCI Passthrough Design - Draft 3



On Wed, Aug 12, 2015 at 01:03:07PM +0530, Manish Jaggi wrote:
> Below are the comments. I will also send a Draft 4 taking account of the 
> comments.
> 
> 
> On Wednesday 12 August 2015 02:04 AM, Konrad Rzeszutek Wilk wrote:
> >On Tue, Aug 04, 2015 at 05:57:24PM +0530, Manish Jaggi wrote:
> >>              -----------------------------
> >>             | PCI Pass-through in Xen ARM |
> >>              -----------------------------
> >>             manish.jaggi@xxxxxxxxxxxxxxxxxx
> >>             -------------------------------
> >>
> >>                      Draft-3
> >>...
> >>[snip]
> >>2.2    PHYSDEVOP_pci_host_bridge_add hypercall
> >>----------------------------------------------
> >>Xen code accesses PCI configuration space based on the sbdf received from
> >>the
> >>guest. The order in which the pci device tree node appear may not be the
> >>same
> >>order of device enumeration in dom0. Thus there needs to be a mechanism to
> >>bind
> >>the segment number assigned by dom0 to the pci host controller. The
> >>hypercall
> >>is introduced:
> >Why can't we extend the existing hypercall to have the segment value?
> >
> >Oh wait, PHYSDEVOP_manage_pci_add_ext does it already!
> It doesnât pass the cfg_base and size to xen

cfg_base is the BAR? Or the MMIO ?

> >
> >And have the hypercall (and Xen) be able to deal with introduction of PCI
> >devices that are out of sync?
> >
> >Maybe I am confused but aren't PCI host controllers also 'uploaded' to
> >Xen?
> I need to add one more line here to be more descriptive. The binding is
> between the segment number (domain number in linux)
> used by dom0 and the pci config space address in the pci node of device tree
> (reg property).
> The hypercall was introduced to cater the fact that the dom0 may process pci
> nodes in the device tree in any order.

I still don't follow - sorry.

Why would it matter that the PCI nodes are processed in any order?

> By this binding it is a clear ABI.
> >>#define PHYSDEVOP_pci_host_bridge_add    44
> >>struct physdev_pci_host_bridge_add {
> >>     /* IN */
> >>     uint16_t seg;
> >>     uint64_t cfg_base;
> >>     uint64_t cfg_size;
> >>};
> >>
> >>This hypercall is invoked before dom0 invokes the PHYSDEVOP_pci_device_add
> >>hypercall. The handler code invokes to update segment number in
> >>pci_hostbridge:
> >>
> >>int pci_hostbridge_setup(uint32_t segno, uint64_t cfg_base, uint64_t
> >>cfg_size);
> >>
> >>Subsequent calls to pci_conf_read/write are completed by the
> >>pci_hostbridge_ops
> >>of the respective pci_hostbridge.
> >This design sounds like it is added to deal with having to pre-allocate the
> >amount host controllers structure before the PCI devices are streaming in?
> >
> >Instead of having the PCI devices and PCI host controllers be updated
> >as they are coming in?
> >
> >Why can't the second option be done?
> If you are referring to ACPI, we have to add the support.
> PCI Host controllers are pci nodes in device tree.

I think what you are saying is that the PCI devices are being uploaded
during ACPI parsing. The PCI host controllers are done via
device tree.

But what difference does that make? Why can't Xen deal with these
being in any order? Can't it re-organize its internal represenation
of PCI host controllers and PCI devices based on new data?



> >>2.3    Helper Functions
> >>------------------------
> >>a) pci_hostbridge_dt_node(pdev->seg);
> >>Returns the device tree node pointer of the pci node from which the pdev got
> >>enumerated.
> >>
> >>3.    SMMU programming
> >>-------------------
> >>
> >>3.1.    Additions for PCI Passthrough
> >>-----------------------------------
> >>3.1.1 - add_device in iommu_ops is implemented.
> >>
> >>This is called when PHYSDEVOP_pci_add_device is called from dom0.
> >Or for PHYSDEVOP_manage_pci_add_ext ?
> Not sure but it seems logical for this also.
> >>.add_device = arm_smmu_add_dom0_dev,
> >>static int arm_smmu_add_dom0_dev(u8 devfn, struct device *dev)
> >>{
> >>         if (dev_is_pci(dev)) {
> >>             struct pci_dev *pdev = to_pci_dev(dev);
> >>             return arm_smmu_assign_dev(pdev->domain, devfn, dev);
> >>         }
> >>         return -1;
> >>}
> >>
> >What about removal?
> >
> >What if the device is removed (hot-unplugged??
> .remove_device  = arm_smmu_remove_device(). would be called.
> Will update in Draft4

Also please mention what hypercall you would use.

> 
> >>3.1.2 dev_get_dev_node is modified for pci devices.
> >>-------------------------------------------------------------------------
> >>The function is modified to return the dt_node of the pci hostbridge from
> >>the device tree. This is required as non-dt devices need a way to find on
> >>which smmu they are attached.
> >>
> >>static struct arm_smmu_device *find_smmu_for_device(struct device *dev)
> >>{
> >>         struct device_node *dev_node = dev_get_dev_node(dev);
> >>....
> >>
> >>static struct device_node *dev_get_dev_node(struct device *dev)
> >>{
> >>         if (dev_is_pci(dev)) {
> >>                 struct pci_dev *pdev = to_pci_dev(dev);
> >>                 return pci_hostbridge_dt_node(pdev->seg);
> >>         }
> >>...
> >>
> >>
> >>3.2.    Mapping between streamID - deviceID - pci sbdf - requesterID
> >>---------------------------------------------------------------------
> >>For a simpler case all should be equal to BDF. But there are some devices
> >>that
> >>use the wrong requester ID for DMA transactions. Linux kernel has pci quirks
> >>for these. How the same be implemented in Xen or a diffrent approach has to
> >s/pci/PCI/
> >>be
> >>taken is TODO here.
> >>Till that time, for basic implementation it is assumed that all are equal to
> >>BDF.
> >>
> >>
> >>4.    Assignment of PCI device
> >>---------------------------------
> >>
> >>4.1    Dom0
> >>------------
> >>All PCI devices are assigned to dom0 unless hidden by pci-hide bootargs in
> >>dom0.
> >'pci-hide' in dom0? Greeping in Documentation/kernel-parameters.txt I don't
> >see anything.
> %s/pci-hide//pciback/./hide//
> >>Dom0 enumerates the PCI devices. For each device the MMIO space has to be
> >>mapped
> >>in the Stage2 translation for dom0. For dom0 xen maps the ranges from dt pci
> >s/xen/Xen/
> >s/pci/PCI/
> >>nodes in stage 2 translation during boot.
> >>4.1.1    Stage 2 Mapping of GITS_ITRANSLATER space (64k)
> >>------------------------------------------------------
> >>
> >>GITS_ITRANSLATER space (64k) must be programmed in Stage2 translation so
> >>that SMMU
> >>can translate MSI(x) from the device using the page table of the domain.
> >>
> >>4.1.1.1 For Dom0
> >>-----------------
> >>GITS_ITRANSLATER address space is mapped 1:1 during dom0 boot. For dom0 this
> >>mapping is done in the vgic driver. For domU the mapping is done by
> >>toolstack.
> >>
> >>4.1.1.2    For DomU
> >>-----------------
> >>For domU, while creating the domain, the toolstack reads the IPA from the
> >>macro GITS_ITRANSLATER_SPACE from xen/include/public/arch-arm.h. The PA is
> >>read from a new hypercall which returns the PA of the
> >>GITS_ITRANSLATER_SPACE.
> >>Subsequently the toolstack sends a hypercall to create a stage 2 mapping.
> >>
> >>Hypercall Details: XEN_DOMCTL_get_itranslater_space
> >>
> >>/* XEN_DOMCTL_get_itranslater_space */
> >>struct xen_domctl_get_itranslater_space {
> >>     /* OUT variables. */
> >>     uint64_aligned_t start_addr;
> >>     uint64_aligned_t size;
> >>};
> >>typedef struct xen_domctl_get_itranslater_space
> >>xen_domctl_get_itranslater_space;
> >>DEFINE_XEN_GUEST_HANDLE(xen_domctl_get_itranslater_space;
> >>
> >>4.2    DomU
> >>------------
> >>There are two ways a device is assigned
> >>In the flow of pci-attach device, the toolstack will read the pci
> >>configuration
> >>space BAR registers. The toolstack has the guest memory map and the
> >>information
> >>of the MMIO holes.
> >>
> >>When the first pci device is assigned to domU, toolstack allocates a virtual
> >s/pci/PCI/
> >
> >first? What about the other ones?
> %s/the first/a/
> Typo
> >
> >>BAR region from the MMIO hole area. toolstack then sends domctl
> >s/sends/invokes/
> >>xc_domain_memory_mapping to map in stage2 translation.
> >What if there are more than one device? How will the MMIO and BAR regions
> >picked? Based on first-come first-serve?
> >>4.2.1    Reserved Areas in guest memory space
> >>--------------------------------------------
> >>Parts of the guest address space is reserved for mapping assigned pci
> >>device's
> >s/pci/PCI/
> >>BAR regions. Toolstack is responsible for allocating ranges from this area
> >>and
> >>creating stage 2 mapping for the domain.
> >>
> >>/* For 32bit */
> >>GUEST_MMIO_BAR_BASE_32, GUEST_MMIO_BAR_SIZE_32
> >>
> >>/* For 64bit */
> >>
> >>GUEST_MMIO_BAR_BASE_64, GUEST_MMIO_BAR_SIZE_64
> in public/arch-arm.h
> 
> /* For 32bit */
> #define GUEST_MMIO_BAR_BASE_32 <<>>
> #define GUEST_MMIO_BAR_SIZE_32 <<>>
> 
> /* For 64bit */
> 
> #define GUEST_MMIO_BAR_BASE_64 <<>>
> #define GUEST_MMIO_BAR_SIZE_64 <<>>
> 
> 
> >Not sure what this means.
> Will add more description.
> The idea is to map the PCI BAR regions into guest Stage2 translation, so a
> pre defined area in guest address
> space is reserved for this.
> If a BAR region address is 32b BASE_32 area would be used, otherwise 64b.

What if you have both? 32-bit and 64-bit?

> >>Note: For 64bit systems, PCI BAR regions should be mapped from
> >>GUEST_MMIO_BAR_BASE_64.
> >>
> >>IPA is allocated from the {GUEST_MMIO_BAR_BASE_64, GUEST_MMIO_BAR_SIZE_64}
> %s/{GUEST_MMIO_BAR_BASE_64, GUEST_MMIO_BAR_SIZE_64}/
> 
> (GUEST_MMIO_BAR_BASE_64 ... GUEST_MMIO_BAR_BASE_64+GUEST_MMIO_BAR_SIZE_64) 
> region
> 
> >>range and PA is the values read from the BAR registers.
> >Is the BAR size dynamic?
> see above
> >What happens when the device is unplugged? And then plugged back in?
> >How do you choose where in the GUEST_MMIO_.. it is going to be in?
> >What is the hypercall you are goign to use for unplugging it?
> >
> >>4.2.2    New entries in xenstore for device BARs
> >s/xenstore/XenStore/
> >
> >>-----------------------------------------------
> >>toolstack also updates the xenstore information for the device
> >s/toolstack/Toolstack
> >
> >>(virtualbar:physical bar).This information is read by xenpciback and
> >s/xenpciback/xen-pciback/
> >
> >No segment value?
> Where. Didnt get you

The Xen PCI back can also deal with segment values (domain).

> >>returned
> >>to the pcifront driver configuration space reads for BAR.
> >>
> >>Entries created are as follows:
> >>/local/domain/0/backend/pci/1/0
> >>vdev-N
> >>     BDF = ""
> >>     BAR-0-IPA = ""
> >>     BAR-0-PA = ""
> >>     BAR-0-SIZE = ""
> >>     ...
> >>     BAR-M-IPA = ""
> >>     BAR-M-PA = ""
> >>     BAR-M-SIZE = ""
> >>
> >>Note: Is BAR M SIZE is 0, it is not a valied entry.
> >s/valied/valid/
> >
> >s/Is/If/ ?
> >
> >>4.2.4    Hypercall Modification for bdf mapping notification to xen
> >s/xen/Xen/
> >>-------------------------------------------------------------------
> >>Guest devfn generation currently done by xen-pciback to be done by toolstack
> >>only. Guest devfn is generated at the time of domain creation (if pci
> >>devices
> >>are specified in cfg file) or using xl pci-attach call.
> >What is 'devfn generation'? It sounds to me that you are saying that
> >xen-pciback should follow the XenStore keys and use those.
> Yes, that is what Ian / Julien suggested. x86 to follow the same as guest
> devfn generation should be
> in toolstack on not in pciback.
> >
> >But the title talks about 'hypercall modifications' - while this
> >talks about bdf mapping?
> the xc_assgin_device will include the guest devfn
> >>5. DomU FrontEnd Bus Changes
> >>-------------------------------------------------------------------------------
> >>
> >>5.1    Change in Linux PCI ForntEnd - backend driver for MSI/X programming
> >s/ForntEnd/Frontend/
> >
> >And I would say 'Linux Xen PCI frontend'.
> >
> >>---------------------------------------------------------------------------
> >>FrontEnd backend communication for MSI is removed in XEN ARM. It would be
> >>handled by the gic-its driver in guest kernel and trapped in xen.
> >s/xen/Xen/
> >
> >s/removed/disabled/
> >
> >>5.2    Frontend bus and interrupt parent vITS
> >>-----------------------------------------------
> >>On the Pci frontend bus msi-parent gicv3-its is added. As there is a single
> >s/Pci/PCI/
> >
> >>virtual its for a domU, as there is only a single virtual pci bus in domU.
> >its?
> >ITS perhaps?
> >
> >We could have multiple segments too in Xen pci-frontend..
> >
> >>This
> >>ensures that the config_msi calls are handled by the gicv3 its driver in
> >s/its/ITS/
> >s/gicv3/GICV3/
> >
> >>domU
> >>kernel and not utilising frontend-backend communication between dom0-domU.
> >utilising? Utilizing.
> >
> >>It is required to have a gicv3-its node in guest device tree.
> >OK, you totally lost me. You said earlier that we do not want to use
> >Xen pcifrontend for MSI. But here you talk about 'PCI frontend'? So
> >what is it?
> PCI Frontend bus is a virtual bus in domU on which assigned devices are
> enumerated.
> While the PCI Frontend backend communication is limited to config space
> access.

It can also do MSI and MSI-X.
> >
> >And how do you keep the vITS segment:bus:devfn mapping in sync
> >with Xen PCI backend? I presume you need to update the vITS in
> >the hypervisor with the proper segment:bus:devfn values?
> I will add a reference to the vITS design.
> see above. assign_device will have a guest devfn.
> >Is there an hypercall for that?
> we had earlier a hypercall map_sbdf but removed it due to addition of guest
> devfn in assign_device call.


However I don't see in xen_domctl_assign_device anything mentioning
the guest sbdf? What if you want the sbdfs in the guest to start at
a different segment or bus than what they do in on the physical machine?

> >>6.    NUMA domU and vITS
> >>--------------------------
> >>a) On NUMA systems domU still have a single its node.
> >s/its/ITS/
> >
> >>b) How can xen identify the ITS on which a device is connected.
> >s/xen/Xen/
> >
> >>- Using segment number query using api which gives pci host controllers
> >>device node
> >s/api/API/
> >s/pci/PCI/
> >
> >Which is ? I only see one hypercall mentioned here.
> >
> >>struct dt_device_node* pci_hostbridge_dt_node(uint32_t segno)
> >Oh, this is INTERNAL to the hypervisor. Sorry, you lost me a bit
> >with the domU part so I thought it meant the domU should be able
> >to query it.
> I will add a bit more of description in Draft 4 .
> >>c) Query the interrupt parent of the pci device node to find out the its.
> >>
> >s/its/ITS/
> >
> >?
> >>_______________________________________________
> >>Xen-devel mailing list
> >>Xen-devel@xxxxxxxxxxxxx
> >>http://lists.xen.org/xen-devel
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.