[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] PCI Passthrough Design - Draft 3

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
From: Manish Jaggi <mjaggi@xxxxxxxxxxxxxxxxxx>
Date: Wed, 12 Aug 2015 13:03:07 +0530
Cc: "Prasun.kapoor@xxxxxxxxxx" <Prasun.kapoor@xxxxxxxxxx>, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>, "Kumar, Vijaya" <Vijaya.Kumar@xxxxxxxxxxxxxxxxxx>, Julien Grall <julien.grall@xxxxxxxxxx>, Xen Devel <xen-devel@xxxxxxxxxxxxx>
Delivery-date: Wed, 12 Aug 2015 07:33:15 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>
Spamdiagnosticmetadata: NSPM
Spamdiagnosticmetadata: NSPM
Spamdiagnosticoutput: 1:23
Spamdiagnosticoutput: 1:23

Below are the comments. I will also send a Draft 4 taking account of the comments.

On Wednesday 12 August 2015 02:04 AM, Konrad Rzeszutek Wilk wrote:

On Tue, Aug 04, 2015 at 05:57:24PM +0530, Manish Jaggi wrote:

             -----------------------------
            | PCI Pass-through in Xen ARM |
             -----------------------------
            manish.jaggi@xxxxxxxxxxxxxxxxxx
            -------------------------------

                     Draft-3
...
[snip]
2.2    PHYSDEVOP_pci_host_bridge_add hypercall
----------------------------------------------
Xen code accesses PCI configuration space based on the sbdf received from
the
guest. The order in which the pci device tree node appear may not be the
same
order of device enumeration in dom0. Thus there needs to be a mechanism to
bind
the segment number assigned by dom0 to the pci host controller. The
hypercall
is introduced:

Why can't we extend the existing hypercall to have the segment value?

Oh wait, PHYSDEVOP_manage_pci_add_ext does it already!

It doesn’t pass the cfg_base and size to xen


And have the hypercall (and Xen) be able to deal with introduction of PCI
devices that are out of sync?

Maybe I am confused but aren't PCI host controllers also 'uploaded' to
Xen?

I need to add one more line here to be more descriptive. The binding is between the segment number (domain number in linux)
used by dom0 and the pci config space address in the pci node of device tree (reg property).
The hypercall was introduced to cater the fact that the dom0 may process pci nodes in the device tree in any order.
By this binding it is a clear ABI.

#define PHYSDEVOP_pci_host_bridge_add    44
struct physdev_pci_host_bridge_add {
    /* IN */
    uint16_t seg;
    uint64_t cfg_base;
    uint64_t cfg_size;
};

This hypercall is invoked before dom0 invokes the PHYSDEVOP_pci_device_add
hypercall. The handler code invokes to update segment number in
pci_hostbridge:

int pci_hostbridge_setup(uint32_t segno, uint64_t cfg_base, uint64_t
cfg_size);

Subsequent calls to pci_conf_read/write are completed by the
pci_hostbridge_ops
of the respective pci_hostbridge.

This design sounds like it is added to deal with having to pre-allocate the
amount host controllers structure before the PCI devices are streaming in?

Instead of having the PCI devices and PCI host controllers be updated
as they are coming in?

Why can't the second option be done?

If you are referring to ACPI, we have to add the support.
PCI Host controllers are pci nodes in device tree.

2.3    Helper Functions
------------------------
a) pci_hostbridge_dt_node(pdev->seg);
Returns the device tree node pointer of the pci node from which the pdev got
enumerated.

3.    SMMU programming
-------------------

3.1.    Additions for PCI Passthrough
-----------------------------------
3.1.1 - add_device in iommu_ops is implemented.

This is called when PHYSDEVOP_pci_add_device is called from dom0.

Or for PHYSDEVOP_manage_pci_add_ext ?

Not sure but it seems logical for this also.

.add_device = arm_smmu_add_dom0_dev,
static int arm_smmu_add_dom0_dev(u8 devfn, struct device *dev)
{
        if (dev_is_pci(dev)) {
            struct pci_dev *pdev = to_pci_dev(dev);
            return arm_smmu_assign_dev(pdev->domain, devfn, dev);
        }
        return -1;
}

What about removal?

What if the device is removed (hot-unplugged??

.remove_device = arm_smmu_remove_device(). would be called.
Will update in Draft4

3.1.2 dev_get_dev_node is modified for pci devices.
-------------------------------------------------------------------------
The function is modified to return the dt_node of the pci hostbridge from
the device tree. This is required as non-dt devices need a way to find on
which smmu they are attached.

static struct arm_smmu_device *find_smmu_for_device(struct device *dev)
{
        struct device_node *dev_node = dev_get_dev_node(dev);
....

static struct device_node *dev_get_dev_node(struct device *dev)
{
        if (dev_is_pci(dev)) {
                struct pci_dev *pdev = to_pci_dev(dev);
                return pci_hostbridge_dt_node(pdev->seg);
        }
...


3.2.    Mapping between streamID - deviceID - pci sbdf - requesterID
---------------------------------------------------------------------
For a simpler case all should be equal to BDF. But there are some devices
that
use the wrong requester ID for DMA transactions. Linux kernel has pci quirks
for these. How the same be implemented in Xen or a diffrent approach has to

s/pci/PCI/

be
taken is TODO here.
Till that time, for basic implementation it is assumed that all are equal to
BDF.


4.    Assignment of PCI device
---------------------------------

4.1    Dom0
------------
All PCI devices are assigned to dom0 unless hidden by pci-hide bootargs in
dom0.

'pci-hide' in dom0? Greeping in Documentation/kernel-parameters.txt I don't
see anything.

%s/pci-hide/pciback.hide/

Dom0 enumerates the PCI devices. For each device the MMIO space has to be
mapped
in the Stage2 translation for dom0. For dom0 xen maps the ranges from dt pci

s/xen/Xen/
s/pci/PCI/

nodes in stage 2 translation during boot.

4.1.1    Stage 2 Mapping of GITS_ITRANSLATER space (64k)
------------------------------------------------------

GITS_ITRANSLATER space (64k) must be programmed in Stage2 translation so
that SMMU
can translate MSI(x) from the device using the page table of the domain.

4.1.1.1 For Dom0
-----------------
GITS_ITRANSLATER address space is mapped 1:1 during dom0 boot. For dom0 this
mapping is done in the vgic driver. For domU the mapping is done by
toolstack.

4.1.1.2    For DomU
-----------------
For domU, while creating the domain, the toolstack reads the IPA from the
macro GITS_ITRANSLATER_SPACE from xen/include/public/arch-arm.h. The PA is
read from a new hypercall which returns the PA of the
GITS_ITRANSLATER_SPACE.
Subsequently the toolstack sends a hypercall to create a stage 2 mapping.

Hypercall Details: XEN_DOMCTL_get_itranslater_space

/* XEN_DOMCTL_get_itranslater_space */
struct xen_domctl_get_itranslater_space {
    /* OUT variables. */
    uint64_aligned_t start_addr;
    uint64_aligned_t size;
};
typedef struct xen_domctl_get_itranslater_space
xen_domctl_get_itranslater_space;
DEFINE_XEN_GUEST_HANDLE(xen_domctl_get_itranslater_space;

4.2    DomU
------------
There are two ways a device is assigned
In the flow of pci-attach device, the toolstack will read the pci
configuration
space BAR registers. The toolstack has the guest memory map and the
information
of the MMIO holes.

When the first pci device is assigned to domU, toolstack allocates a virtual

s/pci/PCI/

first? What about the other ones?

%s/the first/a/
Typo

BAR region from the MMIO hole area. toolstack then sends domctl

s/sends/invokes/

xc_domain_memory_mapping to map in stage2 translation.

What if there are more than one device? How will the MMIO and BAR regions
picked? Based on first-come first-serve?

4.2.1    Reserved Areas in guest memory space
--------------------------------------------
Parts of the guest address space is reserved for mapping assigned pci
device's

s/pci/PCI/

BAR regions. Toolstack is responsible for allocating ranges from this area
and
creating stage 2 mapping for the domain.

/* For 32bit */
GUEST_MMIO_BAR_BASE_32, GUEST_MMIO_BAR_SIZE_32

/* For 64bit */

GUEST_MMIO_BAR_BASE_64, GUEST_MMIO_BAR_SIZE_64

in public/arch-arm.h

/* For 32bit */
#define GUEST_MMIO_BAR_BASE_32 <<>>
#define GUEST_MMIO_BAR_SIZE_32 <<>>

/* For 64bit */

#define GUEST_MMIO_BAR_BASE_64 <<>>
#define GUEST_MMIO_BAR_SIZE_64 <<>>

Not sure what this means.

Will add more description.
The idea is to map the PCI BAR regions into guest Stage2 translation, so a pre defined area in guest address
space is reserved for this.
If a BAR region address is 32b BASE_32 area would be used, otherwise 64b.

Note: For 64bit systems, PCI BAR regions should be mapped from
GUEST_MMIO_BAR_BASE_64.

IPA is allocated from the {GUEST_MMIO_BAR_BASE_64, GUEST_MMIO_BAR_SIZE_64}

%s/{GUEST_MMIO_BAR_BASE_64, GUEST_MMIO_BAR_SIZE_64}/

(GUEST_MMIO_BAR_BASE_64 ... GUEST_MMIO_BAR_BASE_64+GUEST_MMIO_BAR_SIZE_64) region

range and PA is the values read from the BAR registers.

Is the BAR size dynamic?

see above

What happens when the device is unplugged? And then plugged back in?
How do you choose where in the GUEST_MMIO_.. it is going to be in?
What is the hypercall you are goign to use for unplugging it?

4.2.2    New entries in xenstore for device BARs

s/xenstore/XenStore/

-----------------------------------------------
toolstack also updates the xenstore information for the device

s/toolstack/Toolstack

(virtualbar:physical bar).This information is read by xenpciback and

s/xenpciback/xen-pciback/

No segment value?

Where. Didnt get you

returned
to the pcifront driver configuration space reads for BAR.

Entries created are as follows:
/local/domain/0/backend/pci/1/0
vdev-N
    BDF = ""
    BAR-0-IPA = ""
    BAR-0-PA = ""
    BAR-0-SIZE = ""
    ...
    BAR-M-IPA = ""
    BAR-M-PA = ""
    BAR-M-SIZE = ""

Note: Is BAR M SIZE is 0, it is not a valied entry.

s/valied/valid/

s/Is/If/ ?

4.2.4    Hypercall Modification for bdf mapping notification to xen

s/xen/Xen/

-------------------------------------------------------------------
Guest devfn generation currently done by xen-pciback to be done by toolstack
only. Guest devfn is generated at the time of domain creation (if pci
devices
are specified in cfg file) or using xl pci-attach call.

What is 'devfn generation'? It sounds to me that you are saying that
xen-pciback should follow the XenStore keys and use those.

Yes, that is what Ian / Julien suggested. x86 to follow the same as guest devfn generation should be
in toolstack on not in pciback.


But the title talks about 'hypercall modifications' - while this
talks about bdf mapping?

the xc_assgin_device will include the guest devfn

5. DomU FrontEnd Bus Changes
-------------------------------------------------------------------------------

5.1    Change in Linux PCI ForntEnd - backend driver for MSI/X programming

s/ForntEnd/Frontend/

And I would say 'Linux Xen PCI frontend'.

---------------------------------------------------------------------------
FrontEnd backend communication for MSI is removed in XEN ARM. It would be
handled by the gic-its driver in guest kernel and trapped in xen.

s/xen/Xen/

s/removed/disabled/

5.2    Frontend bus and interrupt parent vITS
-----------------------------------------------
On the Pci frontend bus msi-parent gicv3-its is added. As there is a single

s/Pci/PCI/

virtual its for a domU, as there is only a single virtual pci bus in domU.

its?
ITS perhaps?

We could have multiple segments too in Xen pci-frontend..

This
ensures that the config_msi calls are handled by the gicv3 its driver in

s/its/ITS/
s/gicv3/GICV3/

domU
kernel and not utilising frontend-backend communication between dom0-domU.

utilising? Utilizing.

It is required to have a gicv3-its node in guest device tree.

OK, you totally lost me. You said earlier that we do not want to use
Xen pcifrontend for MSI. But here you talk about 'PCI frontend'? So
what is it?

PCI Frontend bus is a virtual bus in domU on which assigned devices are enumerated.
While the PCI Frontend backend communication is limited to config space access.


And how do you keep the vITS segment:bus:devfn mapping in sync
with Xen PCI backend? I presume you need to update the vITS in
the hypervisor with the proper segment:bus:devfn values?

I will add a reference to the vITS design.
see above. assign_device will have a guest devfn.

Is there an hypercall for that?

we had earlier a hypercall map_sbdf but removed it due to addition of guest devfn in assign_device call.

6.    NUMA domU and vITS
--------------------------
a) On NUMA systems domU still have a single its node.

s/its/ITS/

b) How can xen identify the ITS on which a device is connected.

s/xen/Xen/

- Using segment number query using api which gives pci host controllers
device node

s/api/API/
s/pci/PCI/

Which is ? I only see one hypercall mentioned here.

struct dt_device_node* pci_hostbridge_dt_node(uint32_t segno)

Oh, this is INTERNAL to the hypervisor. Sorry, you lost me a bit
with the domU part so I thought it meant the domU should be able
to query it.

I will add a bit more of description in Draft 4 .

c) Query the interrupt parent of the pci device node to find out the its.

s/its/ITS/

?

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] PCI Passthrough Design - Draft 3
  - From: Konrad Rzeszutek Wilk

References:
- [Xen-devel] PCI Passthrough Design - Draft 3
  - From: Manish Jaggi
- Re: [Xen-devel] PCI Passthrough Design - Draft 3
  - From: Konrad Rzeszutek Wilk

Prev by Date: Re: [Xen-devel] [xen 4.6 retrospective] [urgent] rename "freeze" window and make release branch as soon as possible after RC1
Next by Date: Re: [Xen-devel] [xen 4.6 retrospective] [urgent] rename "freeze" window and make release branch as soon as possible after RC1
Previous by thread: Re: [Xen-devel] PCI Passthrough Design - Draft 3
Next by thread: Re: [Xen-devel] PCI Passthrough Design - Draft 3
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.