[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC PATCH 06/24] ARM: GICv3 ITS: introduce host LPI array



On Thu, 10 Nov 2016, Andre Przywara wrote:
> Hi,
> 
> On 27/10/16 23:59, Stefano Stabellini wrote:
> > On Wed, 28 Sep 2016, Andre Przywara wrote:
> >> The number of LPIs on a host can be potentially huge (millions),
> >> although in practise will be mostly reasonable. So prematurely allocating
> >> an array of struct irq_desc's for each LPI is not an option.
> >> However Xen itself does not care about LPIs, as every LPI will be injected
> >> into a guest (Dom0 for now).
> >> Create a dense data structure (8 Bytes) for each LPI which holds just
> >> enough information to determine the virtual IRQ number and the VCPU into
> >> which the LPI needs to be injected.
> >> Also to not artificially limit the number of LPIs, we create a 2-level
> >> table for holding those structures.
> >> This patch introduces functions to initialize these tables and to
> >> create, lookup and destroy entries for a given LPI.
> >> We allocate and access LPI information in a way that does not require
> >> a lock.
> >>
> >> Signed-off-by: Andre Przywara <andre.przywara@xxxxxxx>
> >> ---
> >>  xen/arch/arm/gic-its.c        | 154 
> >> ++++++++++++++++++++++++++++++++++++++++++
> >>  xen/include/asm-arm/gic-its.h |  18 +++++
> >>  2 files changed, 172 insertions(+)
> >>
> >> diff --git a/xen/arch/arm/gic-its.c b/xen/arch/arm/gic-its.c
> >> index 88397bc..2140e4a 100644
> >> --- a/xen/arch/arm/gic-its.c
> >> +++ b/xen/arch/arm/gic-its.c
> >> @@ -18,18 +18,31 @@
> >>  
> >>  #include <xen/config.h>
> >>  #include <xen/lib.h>
> >> +#include <xen/sched.h>
> >>  #include <xen/err.h>
> >>  #include <xen/device_tree.h>
> >>  #include <xen/libfdt/libfdt.h>
> >>  #include <asm/p2m.h>
> >> +#include <asm/domain.h>
> >>  #include <asm/io.h>
> >>  #include <asm/gic.h>
> >>  #include <asm/gic_v3_defs.h>
> >>  #include <asm/gic-its.h>
> >>  
> >> +/* LPIs on the host always go to a guest, so no struct irq_desc for them. 
> >> */
> >> +union host_lpi {
> >> +    uint64_t data;
> >> +    struct {
> >> +        uint64_t virt_lpi:32;
> >> +        uint64_t dom_id:16;
> >> +        uint64_t vcpu_id:16;
> >> +    };
> >> +};
> > 
> > Why not the following?
> > 
> >   union host_lpi {
> >       uint64_t data;
> >       struct {
> >           uint32_t virt_lpi;
> >           uint16_t dom_id;
> >           uint16_t vcpu_id;
> >       };
> >   };
> 
> I am not sure that gives me a guarantee of stuffing everything into a
> u64 (as per the C standard). It probably will on arm64 with gcc, but I
> thought better safe than sorry.

I am pretty sure that it is covered by the standard, also see
IHI0055A_aapcs64. Additionally I don't think the union with "data" is
actually required either.


> >>  /* Global state */
> >>  static struct {
> >>      uint8_t *lpi_property;
> >> +    union host_lpi **host_lpis;
> >>      int host_lpi_bits;
> >>  } lpi_data;
> >>  
> >> @@ -43,6 +56,26 @@ static DEFINE_PER_CPU(void *, pending_table);
> >>  #define MAX_HOST_LPI_BITS                                                \
> >>          min_t(unsigned int, lpi_data.host_lpi_bits, CONFIG_HOST_LPI_BITS)
> >>  #define MAX_HOST_LPIS   (BIT(MAX_HOST_LPI_BITS) - 8192)
> >> +#define HOST_LPIS_PER_PAGE      (PAGE_SIZE / sizeof(union host_lpi))
> >> +
> >> +static union host_lpi *gic_find_host_lpi(uint32_t lpi, struct domain *d)
> > 
> > I take "lpi" is the physical lpi here. Maybe we would rename it to "plpi"
> > for clarity.
> 
> Indeed.
> 
> > 
> >> +{
> >> +    union host_lpi *hlpi;
> >> +
> >> +    if ( lpi < 8192 || lpi >= MAX_HOST_LPIS + 8192 )
> >> +        return NULL;
> >> +
> >> +    lpi -= 8192;
> >> +    if ( !lpi_data.host_lpis[lpi / HOST_LPIS_PER_PAGE] )
> >> +        return NULL;
> >> +
> >> +    hlpi = &lpi_data.host_lpis[lpi / HOST_LPIS_PER_PAGE][lpi % 
> >> HOST_LPIS_PER_PAGE];
> > 
> > I realize I am sometimes obsessive about this, but division operations
> > are expensive and this is on the hot path, so I would do:
> > 
> > #define HOST_LPIS_PER_PAGE      (PAGE_SIZE >> 3)
> 
> to replace
> #define HOST_LPIS_PER_PAGE      (PAGE_SIZE / sizeof(union host_lpi))?
> 
> This should be computed by the compiler, as it's constant.
> 
> > unsigned int table = lpi / HOST_LPIS_PER_PAGE;
> 
> So I'd rather replace this by ">> (PAGE_SIZE - 3)".

This is actually what I meant, thanks.


> But again the compiler would do this for us, as replacing "constant
> divisions by power-of-two" with "right shifts" are a text book example
> of easy optimization, if I remember this compiler class at uni correctly ;-)

Yet, we found instanced where this didn't happen in the common Xen
scheduler code on x86.


> > then use table throughout this function.
> 
> I see your point (though this is ARMv8, which always has udiv).
> But to prove your paranoia wrong: I don't see any divisions in the
> disassembly, but a lsr #3 and a lsr #9 and various other clever and
> cheap ARMv8 instructions ;-)
> Compilers have really come a long way in 2016 ...

Fair enough, thanks for checking. That is enough for me.


> >> +    if ( d && hlpi->dom_id != d->domain_id )
> >> +        return NULL;
> > 
> > I think this function is very useful so I would avoid making any domain
> > checks here: one day we might want to retrieve hlpi even if hlpi->dom_id
> > != d->domain_id. I would move the domain check outside.
> 
> That's why I have "d && ..." in front. If you pass in NULL for the
> domain, it will skip this check. That saves us from coding the check in
> every caller.
> Is that not good enough?

There is a simple solution to this: write two functions, one without the
check and a wrapper to it with the check.


> > 
> >> +    return hlpi;
> >> +}
> >>  
> >>  #define ITS_COMMAND_SIZE        32
> >>  
> >> @@ -96,6 +129,33 @@ static int its_send_cmd_sync(struct host_its *its, int 
> >> cpu)
> >>      return its_send_command(its, cmd);
> >>  }
> >>  
> >> +static int its_send_cmd_discard(struct host_its *its,
> >> +                                uint32_t deviceid, uint32_t eventid)
> >> +{
> >> +    uint64_t cmd[4];
> >> +
> >> +    cmd[0] = GITS_CMD_DISCARD | ((uint64_t)deviceid << 32);
> >> +    cmd[1] = eventid;
> >> +    cmd[2] = 0x00;
> >> +    cmd[3] = 0x00;
> >> +
> >> +    return its_send_command(its, cmd);
> >> +}
> >> +
> >> +static int its_send_cmd_mapti(struct host_its *its,
> >> +                              uint32_t deviceid, uint32_t eventid,
> >> +                              uint32_t pintid, uint16_t icid)
> >> +{
> >> +    uint64_t cmd[4];
> >> +
> >> +    cmd[0] = GITS_CMD_MAPTI | ((uint64_t)deviceid << 32);
> >> +    cmd[1] = eventid | ((uint64_t)pintid << 32);
> >> +    cmd[2] = icid;
> >> +    cmd[3] = 0x00;
> >> +
> >> +    return its_send_command(its, cmd);
> >> +}
> >> +
> >>  static int its_send_cmd_mapc(struct host_its *its, int collection_id, int 
> >> cpu)
> >>  {
> >>      uint64_t cmd[4];
> >> @@ -330,15 +390,109 @@ uint64_t gicv3_lpi_get_proptable()
> >>      return reg;
> >>  }
> >>  
> >> +/* Allocate the 2nd level array for host LPIs. This one holds pointers
> >> + * to the page with the actual "union host_lpi" entries. Our LPI limit
> >> + * avoids excessive memory usage.
> >> + */
> >>  int gicv3_lpi_init_host_lpis(int lpi_bits)
> >>  {
> >> +    int nr_lpi_ptrs;
> >> +
> >>      lpi_data.host_lpi_bits = lpi_bits;
> >>  
> >> +    nr_lpi_ptrs = MAX_HOST_LPIS / (PAGE_SIZE / sizeof(union host_lpi));
> >> +
> >> +    lpi_data.host_lpis = xzalloc_array(union host_lpi *, nr_lpi_ptrs);
> >> +    if ( !lpi_data.host_lpis )
> >> +        return -ENOMEM;
> > 
> > Why are we not allocating the 2nd level right away? To save memory? If
> > so, I would like some numbers in a real use case scenario written either
> > here on in the commit message.
> 
> LPIs can be allocated sparsely. Each LPI uses 8 Bytes, chances are we
> never use more than a few dozen on a real system, so we just use two
> pages with this scheme.
> 
> Allocating memory for all 2 << 20 (the default) takes 8 MB (probably for
> nothing), extending this to 24 bits uses 128 MB already.
> The problem is that Xen cannot know how many LPIs Dom0 will use, so I'd
> rather make this number generous here - hence the allocation scheme.
> 
> Not sure if this is actually overkill or paranoid and we would get away
> with a much smaller single level allocation, driven by a config option
> or runtime parameter, though.

All right. Please write an in-code comment explaining this reasoning
with a sample number of LPIs used by Dom0 on a real case scenario.


> >>      printk("GICv3: using at most %ld LPIs on the host.\n", MAX_HOST_LPIS);
> >>  
> >>      return 0;
> >>  }
> >>  
> >> +/* Allocates a new host LPI to be injected as "virt_lpi" into the 
> >> specified
> >> + * VCPU. Returns the host LPI ID or a negative error value.
> >> + */
> >> +int gicv3_lpi_allocate_host_lpi(struct host_its *its,
> >> +                                uint32_t devid, uint32_t eventid,
> >> +                                struct vcpu *v, int virt_lpi)
> >> +{
> >> +    int chunk, i;
> >> +    union host_lpi hlpi, *new_chunk;
> >> +
> >> +    /* TODO: handle some kind of preassigned LPI mapping for DomUs */
> >> +    if ( !its )
> >> +        return -EPERM;
> >> +
> >> +    /* TODO: This could be optimized by storing some "next available" 
> >> hint and
> >> +     * only iterate if this one doesn't work. But this function should be
> >> +     * called rarely.
> >> +     */
> > 
> > Yes please. Even a trivial pointer to last would be far better than this.
> > It would be nice to run some numbers and prove that in realistic
> > scenarios finding an empty plpi doesn't take more than 5-10 ops, which
> > should be the case unless we have to loop over and the initial chucks
> > are still fully populated, causing Xen to scan for 512 units at a time.
> > We defenitely want to avoid that, if not in rare worse case scenarios.
> 
> I can try, though keep in mind that this code is really only called on
> allocating a host LPI, which would only happen when an LPI gets mapped.
> And this is done only upon a Dom0 driver initializing a device. Normally
> you wouldn't expect this during the actual guest runtime.

It would happen during actual guest runtime with device assignment,
wouldn't?

A simple pointer to last would be a good start, and in-code comment about
how many times we loop to find an empty plpi on a normal case (no device
assignment).

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.