[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space



On Fri, May 19, 2017 at 05:35:47AM -0600, Jan Beulich wrote:
> >>> On 27.04.17 at 16:35, <roger.pau@xxxxxxxxxx> wrote:
> > --- a/tools/libxl/libxl_x86.c
> > +++ b/tools/libxl/libxl_x86.c
> > @@ -11,7 +11,7 @@ int libxl__arch_domain_prepare_config(libxl__gc *gc,
> >      if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_HVM) {
> >          if (d_config->b_info.device_model_version !=
> >              LIBXL_DEVICE_MODEL_VERSION_NONE) {
> > -            xc_config->emulation_flags = XEN_X86_EMU_ALL;
> > +            xc_config->emulation_flags = (XEN_X86_EMU_ALL & 
> > ~XEN_X86_EMU_VPCI);
> 
> I can see why you need this, but I'm not sure this is a good model.
> Ideally for ordinary HVM guests you'd never have to change this
> line. Therefore perhaps it might be a better idea to use a "negative"
> flag here.

I would expect that at some point HVM guests are also going to use the
Xen internal PCI emulation for IOREQs (XEN_DMOP_IO_RANGE_PCI), and for
PCI passthrough, so having VPCI disabled is only temporary (long term
HVM guests should again use XEN_X86_EMU_ALL).

> > --- /dev/null
> > +++ b/tools/tests/vpci/Makefile
> > @@ -0,0 +1,45 @@
> > +
> > +XEN_ROOT=$(CURDIR)/../../..
> > +include $(XEN_ROOT)/tools/Rules.mk
> > +
> > +TARGET := test_vpci
> > +
> > +.PHONY: all
> > +all: $(TARGET)
> > +
> > +.PHONY: run
> > +run: $(TARGET)
> > +   ./$(TARGET) > $(TARGET).out
> > +
> > +$(TARGET): vpci.c vpci.h rbtree.c rbtree.h
> > +   $(HOSTCC) -g -o $@ vpci.c main.c rbtree.c
> > +
> > +.PHONY: clean
> > +clean:
> > +   rm -rf $(TARGET) $(TARGET).out *.o *~ vpci.h vpci.c rbtree.c rbtree.h
> > +
> > +.PHONY: distclean
> > +distclean: clean
> > +
> > +.PHONY: install
> > +install:
> > +
> > +vpci.h: $(XEN_ROOT)/xen/include/xen/vpci.h
> > +   sed -e '/#include/d' <$< >$@
> > +
> > +vpci.c: $(XEN_ROOT)/xen/drivers/vpci/vpci.c
> > +   # Trick the compiler so it doesn't complain about missing symbols
> > +   sed -e '/#include/d' \
> > +       -e '1s;^;#include "emul.h"\
> > +                const vpci_register_init_t __start_vpci_array[1]\;\
> > +                const vpci_register_init_t __end_vpci_array[1]\;\
> > +                ;' <$< >$@
> > +
> > +rbtree.h: $(XEN_ROOT)/xen/include/xen/rbtree.h
> > +   sed -e '/#include/d' <$< >$@
> > +
> > +rbtree.c: $(XEN_ROOT)/xen/common/rbtree.c
> > +   sed -e "/#include/d" \
> > +       -e '1s;^;#include "emul.h"\
> > +                ;' <$< >$@
> 
> Plain symlinking and __XEN__ conditionals in the files may be the
> easier to follow variant. But I'm no heavily opposed to this one,
> I'm merely afraid that further adjustments may end up becoming
> necessary down the road, resulting in the rules here to become
> more convoluted.

Yes, I'm not opposed to doing that, I just think the code is cleaner
using this rather than adding __XEN__ conditionals, but I agree that
if this becomes too convoluted at some point I would be in favor of
using conditionals instead.

> > --- /dev/null
> > +++ b/tools/tests/vpci/emul.h
> > @@ -0,0 +1,107 @@
> > +/*
> > + * Unit tests for the generic vPCI handler code.
> > + *
> > + * Copyright (C) 2017 Citrix Systems R&D
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms and conditions of the GNU General Public
> > + * License, version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public
> > + * License along with this program; If not, see 
> > <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#ifndef _TEST_VPCI_
> > +#define _TEST_VPCI_
> > +
> > +#include <stdlib.h>
> > +#include <stdio.h>
> > +#include <stddef.h>
> > +#include <stdint.h>
> > +#include <stdbool.h>
> > +#include <errno.h>
> > +#include <assert.h>
> > +
> > +#define container_of(ptr, type, member) ({                      \
> > +        typeof( ((type *)0)->member ) *__mptr = (ptr);          \
> > +        (type *)( (char *)__mptr - offsetof(type,member) );})
> 
> There are a couple of stray blanks (immediately inside parentheses)
> here, and a missing one after the comma in offsetof().

Sorry, I've picked this as-is from the Xen headers (kernel.h). I've
changed it to:

#define container_of(ptr, type, member) ({                      \
        typeof(((type *)0)->member) *__mptr = (ptr);            \
        (type *)((char *)__mptr - offsetof(type, member));})

> > +#include "rbtree.h"
> > +
> > +struct pci_dev {
> > +    struct domain *domain;
> > +    struct vpci *vpci;
> > +};
> > +
> > +struct domain {
> > +    struct pci_dev pdev;
> > +};
> > +
> > +struct vcpu
> > +{
> > +    struct domain *domain;
> > +};
> > +
> > +extern struct vcpu v;
> 
> This is odd. With ...
> 
> > +#define spin_lock(x)
> > +#define spin_unlock(x)
> > +#define spin_is_locked(x) true
> > +
> > +#define current (&v)
> 
> ... this, why don't you simply have
> 
> extern struct vcpu *current;
> 
> keeping v (or however you mean to name it) static?

Done.

> > +#define has_vpci(d) true
> > +
> > +#include "vpci.h"
> > +
> > +#define xzalloc(type) (type *)calloc(1, sizeof(type))
> 
> Missing an outer pair of parentheses.

Done.

> > +#define xfree(p) free(p)
> > +
> > +#define EXPORT_SYMBOL(x)
> 
> I think we should rather get rid of them from rbtree.c.
> 
> > +#define pci_get_pdev_by_domain(d, ...) &(d)->pdev
> 
> Missing an outer pair of parentheses again, whereas ...

OK, I'm adding a pre-patch then to get rid of them.

> > +#define atomic_read(x) 1
> > +
> > +/* Dummy native helpers. Writes are ignored, reads return 1's. */
> > +#define pci_conf_read8(...) (0xff)
> > +#define pci_conf_read16(...) (0xffff)
> > +#define pci_conf_read32(...) (0xffffffff)
> 
> ... here they're pointless.

Thanks, removed.

> > +/* Dummy hooks, write stores data, read fetches it. */
> > +static int vpci_read8(struct pci_dev *pdev, unsigned int reg,
> > +                      union vpci_val *val, void *data)
> > +{
> > +    uint8_t *priv = data;
> > +
> > +    val->half_word = *priv;
> 
> Half word? Half a word is at least 16 bits on any reasonable
> architecture nowadays. Using it for a byte is simply confusing. I'd
> suggest naming the fields what they are - u8, u16, and u32.

Right, I can do that. I was using this nomenclature because that's
what the PCI specification uses:

half word: 8b
word: 16b
double word: 32b

Will change it to u8, u16 and u32.

> > +#define VPCI_READ(reg, size, data) \
> > +    assert(!xen_vpci_read(0, 0, 0, reg, size, data))
> > +
> > +#define VPCI_READ_CHECK(reg, size, expected) ({ \
> > +    uint32_t val;                               \
> > +    VPCI_READ(reg, size, &val);                 \
> > +    assert(val == expected);                    \
> > +    })
> 
> Bad indentation - either the }) needs to move left, of the body needs
> to move right.

I will move the }) left.

> > +#define VPCI_WRITE(reg, size, data) \
> > +    assert(!xen_vpci_write(0, 0, 0, reg, size, data))
> 
> You using fixed SBDF here, ...
> 
> > +#define VPCI_CHECK_REG(reg, size, data) ({      \
> > +    VPCI_WRITE(reg, size, data);                \
> > +    VPCI_READ_CHECK(reg, size, data);           \
> > +    })
> > +
> > +#define VPCI_ADD_REG(fread, fwrite, off, size, store)                      
> >    \
> > +    assert(!xen_vpci_add_register(&d.pdev, fread, fwrite, off, size, 
> > &store)) \
> 
> ... why do you have this strange &d.pdev here? The assumption
> that a (fake) domain has a single (fake) PCI device looks pretty odd
> anyway - why can't you simply have a global (fake) PCI device?

Yes, I can do that. Now the root is the pci_dev itself instead of the
domain.

> > +#define VPCI_ADD_INVALID_REG(fread, fwrite, off, size)                     
> >  \
> > +    assert(xen_vpci_add_register(&d.pdev, fread, fwrite, off, size, NULL)) 
> >  \
> > +
> > +int
> > +main(int argc, char **argv)
> > +{
> > +    /* Index storage by offset. */
> > +    uint32_t r0 = 0xdeadbeef;
> > +    uint8_t r5 = 0xef;
> > +    uint8_t r6 = 0xbe;
> > +    uint8_t r7 = 0xef;
> > +    uint16_t r12 = 0x8696;
> > +    int rc;
> > +
> > +    VPCI_ADD_REG(vpci_read32, vpci_write32, 0, 4, r0);
> > +    VPCI_READ_CHECK(0, 4, 0xdeadbeef);
> > +    VPCI_CHECK_REG(0, 4, 0xbcbcbcbc);
> 
> In the context here the macro name is pretty confusing: I'd expect
> it to check the register holds the specified value, without also doing
> a write. How about VPCI_WRITE_CHECK()?

Completely agree, the read macro is already VPCI_READ_CHECK, so it
makes sense for the write one to be VPCI_WRITE_CHECK IMHO.

> > +    VPCI_ADD_REG(vpci_read8, vpci_write8, 5, 1, r5);
> > +    VPCI_READ_CHECK(5, 1, 0xef);
> > +    VPCI_CHECK_REG(5, 1, 0xba);
> > +
> > +    VPCI_ADD_REG(vpci_read8, vpci_write8, 6, 1, r6);
> > +    VPCI_READ_CHECK(6, 1, 0xbe);
> > +    VPCI_CHECK_REG(6, 1, 0xba);
> > +
> > +    VPCI_ADD_REG(vpci_read8, vpci_write8, 7, 1, r7);
> > +    VPCI_READ_CHECK(7, 1, 0xef);
> > +    VPCI_CHECK_REG(7, 1, 0xbd);
> > +
> > +    VPCI_ADD_REG(vpci_read16, vpci_write16, 12, 2, r12);
> > +    VPCI_READ_CHECK(12, 2, 0x8696);
> > +    VPCI_READ_CHECK(12, 4, 0xffff8696);
> > +
> > +    /*
> > +     * At this point we have the following layout:
> > +     *
> > +     * 32    24    16     8     0
> > +     *  +-----+-----+-----+-----+
> > +     *  |          r0           | 0
> > +     *  +-----+-----+-----+-----+
> > +     *  | r7  |  r6 |  r5 |/////| 32
> > +     *  +-----+-----+-----+-----|
> > +     *  |///////////////////////| 64
> > +     *  +-----------+-----------+
> > +     *  |///////////|    r12    | 96
> > +     *  +-----------+-----------+
> > +     *             ...
> > +     *  / = empty.
> > +     */
> > +
> > +    /* Try to add an overlapping register handler. */
> > +    VPCI_ADD_INVALID_REG(vpci_read32, vpci_write32, 4, 4);
> > +
> > +    /* Try to add a non-aligned register. */
> > +    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 15, 2);
> > +
> > +    /* Try to add a register with wrong size. */
> > +    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 8, 3);
> > +
> > +    /* Try to add a register with missing handlers. */
> > +    VPCI_ADD_INVALID_REG(vpci_read16, NULL, 8, 2);
> > +    VPCI_ADD_INVALID_REG(NULL, vpci_write16, 8, 2);
> 
> Is that something which really is wrong in all cases? What about e.g.
> r/o registers?

My initial though was that r/o registers should register something
like a noop write handler, that could be shared between all of them. I
can certainly allow registration of handlers without a read or write
function (but not both), and make it a noop (writes will be ignored,
reads will return 1's).

> > +    /* Read/write of unset register. */
> > +    VPCI_READ_CHECK(8, 4, 0xffffffff);
> > +    VPCI_READ_CHECK(8, 2, 0xffff);
> > +    VPCI_READ_CHECK(8, 1, 0xff);
> > +    VPCI_WRITE(10, 2, 0xbeef);
> > +    VPCI_READ_CHECK(10, 2, 0xffff);
> > +
> > +    /* Read of multiple registers */
> > +    VPCI_CHECK_REG(7, 1, 0xbd);
> > +    VPCI_READ_CHECK(4, 4, 0xbdbabaff);
> 
> I think a variant accessing mixed size registers would also be
> desirable here. Perhaps it would be best to exhaustively test
> all possible variations (there aren't that many after all). Same
> for writes and partial accesses (below) then.

So you mean to scan the whole space (from 0 to 128 on this test) using
all possible register sizes for both read and write? That would indeed
be feasible.

> > @@ -256,6 +257,152 @@ void register_g2m_portio_handler(struct domain *d)
> >      handler->ops = &g2m_portio_ops;
> >  }
> >  
> > +/* Do some sanity checks. */
> > +static int vpci_access_check(unsigned int reg, unsigned int len)
> > +{
> > +    /* Check access size. */
> > +    if ( len != 1 && len != 2 && len != 4 )
> > +    {
> > +        gdprintk(XENLOG_WARNING, "invalid length (reg: %#x, len: %u)\n",
> > +                 reg, len);
> 
> I think many of such gdprintk()s want to go away before this series
> gets committed.

OK, I've found them useful while developing, but I guess they are not
really useful outside from that context. I guess there's no way to
leave them in place, maybe a Kconfig option?

> > +/* vPCI config space IO ports handlers (0xcf8/0xcfc). */
> > +static bool_t vpci_portio_accept(const struct hvm_io_handler *handler,
> 
> Plain bool please.

Sadly struct hvm_io_ops (which is where this function is used) expects
a bool_t as return.

> 
> > +                                 const ioreq_t *p)
> > +{
> > +    return (p->addr == 0xcf8 && p->size == 4) || (p->addr & 0xfffc) == 
> > 0xcfc;
> > +}
> > +
> > +static int vpci_portio_read(const struct hvm_io_handler *handler,
> > +                            uint64_t addr, uint32_t size, uint64_t *data)
> > +{
> > +    struct domain *d = current->domain;
> > +    unsigned int bus, devfn, reg;
> > +    uint32_t data32;
> > +    int rc;
> > +
> > +    vpci_lock(d);
> > +    if ( addr == 0xcf8 )
> > +    {
> > +        ASSERT(size == 4);
> > +        *data = d->arch.hvm_domain.pci_cf8;
> > +        vpci_unlock(d);
> > +        return X86EMUL_OKAY;
> > +    }
> > +    else if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) )
> 
> Pointless "else".
> 
> > +    {
> > +        vpci_unlock(d);
> > +        return X86EMUL_OKAY;
> 
> You need to write to *data here, or else you need to return
> false from vpci_portio_accept() already in this case (but then
> you'd need to follow the stdvga model and take the lock
> there, releasing it in a .complete handler).

Right, this should return 1's in this case then.

> > +    }
> > +
> > +    /* Decode the PCI address. */
> > +    hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &devfn, 
> > &reg);
> > +
> > +    if ( vpci_access_check(reg, size) || reg >= 0xff )
> 
> > 0xff or >= 0x100, but the check is pointless as
> hvm_pci_decode_addr() wont return larger values.

Right, this check is pointless.

> > +    {
> > +        vpci_unlock(d);
> > +        return X86EMUL_UNHANDLEABLE;
> 
> I don't think this matches real hardware behavior. If this "fails"
> at all, surely by returning all ones.

Yes, I've changed this to return 1's and X86EMUL_OKAY.

> > +    }
> > +
> > +    rc = xen_vpci_read(0, bus, devfn, reg, size, &data32);
> > +    if ( !rc )
> > +        *data = data32;
> > +    vpci_unlock(d);
> 
> Please set *data outside the locked region.
> 
> And since there's no best place to make this other remark - I'd
> prefer if you either kept together SBDF in one value when passing
> this as arguments to functions, or alternatively pass this as four
> values rather than keeping devfn artificially together.

I guess I prefer the 4 values, that was my first approach until I
realized pci_dev internally stores devfn in a single variable.

> > +     return rc ? X86EMUL_UNHANDLEABLE : X86EMUL_OKAY;
> > +}
> 
> Again the question - what's the bare hardware equivalent of
> returning X86EMUL_UNHANDLEABLE here?

All 1's I assume (or other random garbage). Would you be OK with me
adding a "fail" label that sets data to all 1's and returns X86EMUL_OKAY?

> > --- a/xen/arch/x86/hvm/ioreq.c
> > +++ b/xen/arch/x86/hvm/ioreq.c
> > @@ -1177,6 +1177,9 @@ struct hvm_ioreq_server 
> > *hvm_select_ioreq_server(struct domain *d,
> >           CF8_ENABLED(cf8) )
> >      {
> >          uint32_t sbdf, x86_fam;
> > +        unsigned int bus, devfn, reg;
> > +
> > +        hvm_pci_decode_addr(cf8, p->addr, &bus, &devfn, &reg);
> >  
> >          /* PCI config data cycle */
> >  
> > @@ -1186,9 +1189,7 @@ struct hvm_ioreq_server 
> > *hvm_select_ioreq_server(struct domain *d,
> >                                   PCI_FUNC(CF8_BDF(cf8)));
> 
> Any reason you don't use bus and devfn (really dev/slot and func)
> in the expression the tail of which is visible here?

Good catch, thanks.

> > --- a/xen/arch/x86/xen.lds.S
> > +++ b/xen/arch/x86/xen.lds.S
> > @@ -224,6 +224,9 @@ SECTIONS
> >         __start_schedulers_array = .;
> >         *(.data.schedulers)
> >         __end_schedulers_array = .;
> > +       __start_vpci_array = .;
> > +       *(.data.vpci)
> > +       __end_vpci_array = .;
> 
> With vpci.c declaring these const, they should go into .rodata.
> With the type name further being vpci_register_init_t it may even
> be next to .init.rodata where they belong.

Done, I've placed them in .rodata now.

I don't think it makes sense to place them in .init.rodata, the _init_
tag means these functions are supposed to be used to initialize the
vPCI handlers for devices, but they could be run after Xen has
finished the initialization, for example if MMCFG areas are discovered
by Dom0 with new devices (I know this is not yet implemented).

> > @@ -1041,6 +1042,8 @@ static void setup_one_hwdom_device(const struct 
> > setup_hwdom *ctxt,
> >          devfn += pdev->phantom_stride;
> >      } while ( devfn != pdev->devfn &&
> >                PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
> > +
> > +    xen_vpci_add_handlers(pdev);
> > }
> 
> You're losing an error code here.

Fixed, thanks.

> > --- /dev/null
> > +++ b/xen/drivers/vpci/Makefile
> > @@ -0,0 +1 @@
> > +obj-y += vpci.o
> 
> Without having seen further patches it's not clear whether this really
> needs its own directory.

vPCI emulation handlers (for PCI headers, capabilities, MSI...) will
also go into this folder.

> > --- /dev/null
> > +++ b/xen/drivers/vpci/vpci.c
> > @@ -0,0 +1,469 @@
> > +/*
> > + * Generic functionality for handling accesses to the PCI configuration 
> > space
> > + * from guests.
> > + *
> > + * Copyright (C) 2017 Citrix Systems R&D
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms and conditions of the GNU General Public
> > + * License, version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public
> > + * License along with this program; If not, see 
> > <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#include <xen/sched.h>
> > +#include <xen/vpci.h>
> > +
> > +extern const vpci_register_init_t __start_vpci_array[], __end_vpci_array[];
> > +#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
> > +#define vpci_init __start_vpci_array
> 
> What is this last one good for?

It's used by xen_vpci_add_handlers in order to call the init
functions, I can drop it and call __start_vpci_array[i](...) if that's
better.

> > +/* Internal struct to store the emulated PCI registers. */
> > +struct vpci_register {
> > +    vpci_read_t read;
> > +    vpci_write_t write;
> 
> These two are pointers - please change the typedefs so that they're
> visibly pointers here. That'll then also allow the typedef to be used to
> declare actual handlers, should any such declarations be needed (e.g.
> if the same handler can be used by two different source files).
> 
> > +    unsigned int size;
> > +    unsigned int offset;
> > +    void *priv_data;
> 
> "private" (shorter and hence easier to type)?

Done to both the above comments.

> > +    struct rb_node node;
> > +};
> > +
> > +int xen_vpci_add_handlers(struct pci_dev *pdev)
> 
> __hwdom_init (I notice setup_one_hwdom_device() wrongly isn't
> annotated so).

OK, so you really want the init handlers to be inside of the
.init.rodata section then.

> > +{
> > +    int i, rc = 0;
> 
> i wants to be unsigned.

Rightfully.

> > +    if ( !has_vpci(pdev->domain) )
> > +        return 0;
> > +
> > +    pdev->vpci = xzalloc(struct vpci);
> > +    if ( !pdev->vpci )
> > +        return -ENOMEM;
> > +
> > +    pdev->vpci->handlers = RB_ROOT;
> > +
> > +    for ( i = 0; i < NUM_VPCI_INIT; i++ )
> > +    {
> > +        rc = vpci_init[i](pdev);
> > +        if ( rc )
> > +            break;
> > +    }
> > +
> > +    if ( rc )
> > +    {
> > +        struct rb_node *node = rb_first(&pdev->vpci->handlers);
> > +        struct vpci_register *r;
> 
> Please move this into the more narrow scope below.

Done.

> > +        /* Iterate over the tree and cleanup. */
> > +        while ( node != NULL )
> > +        {
> > +            r = container_of(node, struct vpci_register, node);
> > +            node = rb_next(node);
> > +            rb_erase(&r->node, &pdev->vpci->handlers);
> > +            xfree(r);
> > +        }
> > +        xfree(pdev->vpci);
> > +    }
> > +
> > +    return rc;
> > +}
> > +
> > +static bool vpci_register_overlap(const struct vpci_register *r,
> > +                                  unsigned int offset)
> > +{
> > +    if ( offset >= r->offset && offset < r->offset + r->size )
> > +        return true;
> > +
> > +    return false;
> 
> This can be one single return statement.
>
> > +}
> > +
> > +
> 
> Stray double blank lines.

Fixed both.

> > +static int vpci_register_cmp(const struct vpci_register *r1,
> > +                             const struct vpci_register *r2)
> > +{
> > +    /* Make sure there's no overlap between registers. */
> > +    if ( vpci_register_overlap(r1, r2->offset) ||
> > +         vpci_register_overlap(r1, r2->offset + r2->size - 1) ||
> > +         vpci_register_overlap(r2, r1->offset) ||
> > +         vpci_register_overlap(r2, r1->offset + r1->size - 1) )
> 
> Overlap checks can generally be done with just two comparisons,
> so I guess the parameters chosen for vpci_register_overlap()
> aren't optimal. I guess you don't need the function at all, as you
> could do all that's needed here:
> 
>     if ( r1->offset < r2->offset + r2->size &&
>          r2->offset < r1->offset + r1->size )
>         return 0;
> 
> The comment of course is somewhat misleading here too, as
> returning zero isn't really an error indication.

I've fixed this and changed the comment to "Return 0 if registers
overlap".

> > +        return 0;
> > +
> > +    if (r1->offset < r2->offset)
> > +        return -1;
> > +    else if (r1->offset > r2->offset)
> > +        return 1;
> 
> Coding style.

Fixed (and removed the pointless else).

> > +    ASSERT_UNREACHABLE();
> > +    return 0;
> > +}
> > +
> > +static struct vpci_register *vpci_find_register(const struct pci_dev *pdev,
> > +                                                const unsigned int reg,
> > +                                                const unsigned int size)
> > +{
> > +    struct rb_node *node;
> 
> const
> 
> > +    struct vpci_register r = {
> > +        .offset = reg,
> > +        .size = size,
> > +    };
> > +
> > +    ASSERT(vpci_locked(pdev->domain));
> > +
> > +    node = pdev->vpci->handlers.rb_node;
> > +    while ( node )
> > +    {
> > +        struct vpci_register *t =
> 
> const

Making both of them const means the return must also be const, and
that's not suitable by some of the consumers (ie:
xen_vpci_remove_register for example).

> > +int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,
> > +                          vpci_write_t write_handler, unsigned int offset,
> > +                          unsigned int size, void *data)
> > +{
> > +    struct rb_node **new, *parent;
> > +    struct vpci_register *r;
> > +
> > +    /* Some sanity checks. */
> > +    if ( (size != 1 && size != 2 && size != 4) || offset >= 0xFFF ||
> 
> Off by one again in the offset check.

Fixed to be > 0xfff. Should this maybe be added to pci_regs.h as
PCI_MAX_REGISTER? 

> > +         offset & (size - 1) || read_handler == NULL || write_handler == 
> > NULL )
> 
> As said, I'm not convinced either of the read or write handlers
> being NULL is really a mistake. Both of them being NULL surely
> is.

Right (see above reply).

> > +        return -EINVAL;
> > +
> > +    r = xzalloc(struct vpci_register);
> 
> Looks like xmalloc() would be fine here - you initialize all fields.

Yes.

> > +    if ( !r )
> > +        return -ENOMEM;
> > +
> > +    r->read = read_handler;
> > +    r->write = write_handler;
> > +    r->size = size;
> > +    r->offset = offset;
> > +    r->priv_data = data;
> > +
> > +    vpci_lock(pdev->domain);
> > +    new = &pdev->vpci->handlers.rb_node;
> > +    parent = NULL;
> > +
> > +    while (*new) {
> 
> Coding style.
> 
> > +        struct vpci_register *this =
> 
> const

Done for both.

> > +int xen_vpci_remove_register(struct pci_dev *pdev, unsigned int offset)
> > +{
> > +    struct vpci_register *r;
> > +
> > +    vpci_lock(pdev->domain);
> > +    r = vpci_find_register(pdev, offset, 1 /* size doesn't matter here. 
> > */);
> 
> I'm not sure about this - is there anything wrong with the caller,
> knowing the size, also passing it? You could then even refuse
> requests to remove a register where (offset,size) doesn't match
> the recorded values (as vpci_find_register() will return any
> overlapping one).

Well, I think the important bit is to check that what
vpci_find_register returns matches what the called expects to
remove.

> > +    if ( !r )
> > +    {
> > +        vpci_unlock(pdev->domain);
> > +        return -ENOENT;
> > +    }
> > +
> > +    rb_erase(&r->node, &pdev->vpci->handlers);
> > +    xfree(r);
> > +    vpci_unlock(pdev->domain);
> 
> Please swap xfree() and unlock.
> 
> > +static void vpci_read_hw(unsigned int seg, unsigned int bus,
> > +                         unsigned int devfn, unsigned int reg, uint32_t 
> > size,
> > +                         uint32_t *data)
> 
> Instead of passing a pointer to the result, please consider returning
> the value, as the function doesn't return anything at present.
> 
> > +{
> > +    switch ( size )
> > +    {
> > +    case 4:
> > +        *data = pci_conf_read32(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
> > +                                reg);
> > +        break;
> > +    case 3:
> > +        /*
> > +         * This is possible because a 4byte read can have 1byte trapped and
> > +         * the rest passed-through.
> > +         */
> > +        *data = pci_conf_read16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
> > +                                reg + 1) << 8;
> > +        *data |= pci_conf_read8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
> > +                               reg);
> 
> Which of the two parts to read with read16() should depend on the
> low bit of reg. Also for maximum compatibility I'd strongly suggest
> reading the low part before the high one.

Right. Changed it to:

if ( reg & 1 )
{
    data = pci_conf_read8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                          reg);
    data |= pci_conf_read16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                            reg + 1) << 8;
}
else
{
    data = pci_conf_read16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                           reg);
    data |= pci_conf_read8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                           reg + 2) << 16;
}

> > +/* Helper macros for the read/write handlers. */
> > +#define GENMASK_BYTES(e, s) GENMASK((e) * 8, (s) * 8)
> 
> What do e and s stand for here?

e = end, s = start (in bytes).

> > +#define SHIFT_RIGHT_BYTES(d, o) d >>= (o) * 8
> 
> And at least o here?

o = offset (in bytes)

> > +#define ADD_RESULT(r, d, s, o) r |= ((d) & GENMASK_BYTES(s, 0)) << ((o) * 
> > 8)
> 
> And d, s, and o here?
> 
> Also I can't see what addition you would want to perform below.
> All you ought to do are ANDs and ORs.

It's adding new data to a partial output (ie: because the output might
be split across several registers).

r = result to update
d = new data to add
s = size of the new data to add
o = offset of the newly added data in the partial result.

I will add all this as a comment to the macros:

/*
 * Helper macros for the read/write handlers (all input values are in bytes).
 *
 * GENMASK_BYTES: generate a bitmask from the byte pointed by s to the byte
 * pointed by e.
 *
 * SHIFT_RIGHT_BYTES: shift a value pointed by d the number of bytes
 * specified in o.
 *
 * ADD_RESULT: append a result to another variable pointed by r. d is the
 * variable that contains the data to be added, s contains the size of the
 * data to add, and finally o is the offset of such data in the destination
 * (r).
 */

I can rename ADD_RESULT to APPEND_RESULT or something more descriptive
if you think it's going to make it easier to understand.

> > +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,
> 
> The function being other than void, same question as earlier:
> What's the bare hardware equivalent of this returning other
> than zero?

I though it would be useful to have some more fine-grained error
reporting if that's ever needed, although as you say, from a hardware
point of view any errors will be simply reported as the value obtained
being all 1's.

The question is, should this already return all 1's, or should the
called translate failures into all 1's?

> > +                  unsigned int reg, uint32_t size, uint32_t *data)
> > +{
> > +    struct domain *d = current->domain;
> > +    struct pci_dev *pdev;
> > +    const struct vpci_register *r;
> > +    union vpci_val val = { .double_word = 0 };
> > +    unsigned int data_rshift = 0, data_lshift = 0, data_size;
> > +    uint32_t tmp_data;
> > +    int rc;
> > +
> > +    ASSERT(vpci_locked(d));
> > +
> > +    *data = 0;
> > +
> > +    /* Find the PCI dev matching the address. */
> > +    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
> 
> What about the global PCI devices lock here? While VT-d code,
> perhaps wrongly, doesn't acquire that lock prior to calling the
> function, all callers in passthrough/pci.c do or verify it is being
> held.

Right, I will add an assert then.

> > +    if ( !pdev )
> > +        goto passthrough;
> > +
> > +    /* Find the vPCI register handler. */
> > +    r = vpci_find_register(pdev, reg, size);
> 
> With the overlap handling in vpci_find_register() I can't see how
> this would reliably return the correct (lowest) register when the
> request spans multiple ones.

It doesn't need to, if there's a lower register it will be found by
the recursive call to xen_vpci_read done below (before calling into
the handler pointed by r).

> > +    if ( !r )
> > +        goto passthrough;
> > +
> > +    if ( r->offset > reg )
> > +    {
> > +        /*
> > +         * There's a heading gap into the emulated register.
> > +         * NB: it's possible for this recursive call to have a size of 3.
> > +         */
> > +        rc = xen_vpci_read(seg, bus, devfn, reg, r->offset - reg, 
> > &tmp_data);
> 
> I'm not particularly happy to see recursion being used here, even if
> that's not going to be very deep. Both qemu and pciback get away
> without, iirc, and while it's not the neatest code I find qemu's easier
> to follow than the apparently written from scratch variant here. Is
> there a particular reason you didn't at least take what is there as a
> basis?

I've been thinking about this, and I've used a RB tree because I
wanted the searches to be fast, but due to RB properties if we want to
emulate a region that expends across multiple registers Xen needs to
issue several vpci_find_register, which will always start from the
root of the Rb tree, making this slower in this case.

Using a sorted linked list will allow Xen to only perform the search
once, and then continue from the first register until all registers
needed in order to complete the emulation have been found.

In resume, I think that using a sorted linked list to store the
registers here would be better indeed, and it would also get rid of
the recursion.

> > +        if ( rc )
> > +            return rc;
> > +
> > +        /* Add the head read to the partial result. */
> > +        ADD_RESULT(*data, tmp_data, r->offset - reg, 0);
> > +        data_lshift = r->offset - reg;
> > +
> > +        /* Account for the read. */
> > +        size -= data_lshift;
> > +        reg += data_lshift;
> > +    }
> > +    else if ( r->offset < reg )
> > +        /* There's an offset into the emulated register */
> > +        data_rshift = reg - r->offset;
> 
> This could be a plain else, avoiding another conditional branch.

This is likely going to change in any case...

> > +    ASSERT(data_lshift == 0 || data_rshift == 0);
> > +    data_size = min(size, r->size - data_rshift);
> > +    ASSERT(data_size != 0);
> > +
> > +    /* Perform the read of the register. */
> > +    rc = r->read(pdev, r->offset, &val, r->priv_data);
> > +    if ( rc )
> > +        return rc;
> > +
> > +    val.double_word >>= data_rshift * 8;
> > +    ADD_RESULT(*data, val.double_word, data_size, data_lshift);
> > +
> > +    /* Account for the read */
> > +    size -= data_size;
> > +    reg += data_size;
> > +
> > +    /* Read the remaining, if any. */
> > +    if ( size > 0 )
> > +    {
> > +        /*
> > +         * Read tailing data.
> 
> trailing?

Yes.

> > +static int vpci_write_helper(struct pci_dev *pdev,
> > +                             const struct vpci_register *r, unsigned int 
> > size,
> > +                             unsigned int offset, uint32_t data)
> > +{
> > +    union vpci_val val = { .double_word = data };
> > +    int rc;
> > +
> > +    ASSERT(size <= r->size);
> > +    if ( size != r->size )
> > +    {
> > +        rc = r->read(pdev, r->offset, &val, r->priv_data);
> > +        if ( rc )
> > +            return rc;
> > +        val.double_word &= ~GENMASK_BYTES(size + offset, offset);
> > +        data &= GENMASK_BYTES(size, 0);
> > +        val.double_word |= data << (offset * 8);
> > +    }
> > +
> > +    return r->write(pdev, r->offset, val, r->priv_data);
> > +}
> 
> I'm not sure that writing back the value read is correct in all cases
> (think of write-only or rw1c registers or even offsets where reads
> and writes access different registers altogether). I think the write
> handlers will need to be made capable of dealing with partial writes.

That seems to be what pciback does fro writes: read, merge value,
write back (drivers/xen/xen-pciback/conf_space.c
xen_pcibk_config_write):

err = conf_space_read(dev, cfg_entry, field_start,
                      &tmp_val);
if (err)
        break;

tmp_val = merge_value(tmp_val, value, get_mask(size),
                      offset - field_start);

err = conf_space_write(dev, cfg_entry, field_start,
                       tmp_val);

I would prefer to do it this way in order to avoid adding more
complexity to the handlers themselves. So far I haven't found any such
registers (rw1c) in the PCI config space, do you have references to
any of them?

> > +int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,
> > +                   unsigned int reg, uint32_t size, uint32_t data)
> > +{
> > +    struct domain *d = current->domain;
> > +    struct pci_dev *pdev;
> > +    const struct vpci_register *r;
> > +    unsigned int data_size, data_offset = 0;
> > +    int rc;
> > +
> > +    ASSERT(vpci_locked(d));
> > +
> > +    /* Find the PCI dev matching the address. */
> > +    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
> > +    if ( !pdev )
> > +        goto passthrough;
> > +
> > +    /* Find the vPCI register handler. */
> > +    r = vpci_find_register(pdev, reg, size);
> > +    if ( !r )
> > +        goto passthrough;
> > +
> > +    else if ( r->offset > reg )
> 
> Pointless "else" again, even more so with the blank line in between.
> 
> > --- a/xen/include/xen/pci.h
> > +++ b/xen/include/xen/pci.h
> > @@ -13,6 +13,7 @@
> >  #include <xen/irq.h>
> >  #include <xen/pci_regs.h>
> >  #include <xen/pfn.h>
> > +#include <xen/rbtree.h>
> 
> Why? All you add to this file is ...
> 
> > @@ -88,6 +89,9 @@ struct pci_dev {
> >  #define PT_FAULT_THRESHOLD 10
> >      } fault;
> >      u64 vf_rlen[6];
> > +
> > +    /* Data for vPCI. */
> > +    struct vpci *vpci;
> 
> ... this. I guess you really want to add the #include ...
> 
> > --- /dev/null
> > +++ b/xen/include/xen/vpci.h
> > @@ -0,0 +1,66 @@
> > +#ifndef _VPCI_
> > +#define _VPCI_
> > +
> > +#include <xen/pci.h>
> > +#include <xen/types.h>
> 
> ... here.

Right, the RB is going to be replaced with a linked-list anyway, but
the list header should be included here.

> > +/* Helpers for locking/unlocking. */
> > +#define vpci_lock(d) spin_lock(&(d)->arch.hvm_domain.vpci_lock)
> > +#define vpci_unlock(d) spin_unlock(&(d)->arch.hvm_domain.vpci_lock)
> > +#define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)
> 
> While for the code layering you don't need recursive locks, did you
> consider using them nevertheless so that spin_is_locked() return
> values are actually meaningful for your purposes?

I'm not sure I follow, spin_is_locked already return meaningful values
for my purpose AFAICT.

> > +#define REGISTER_VPCI_INIT(x) \
> > +  static const vpci_register_init_t x##_entry __used_section(".data.vpci") 
> > = x
> 
> To match up with the type name and assuming "REGISTER" here
> means the PCI register rather than "registration", I think this
> would better be VPCI_REGISTER() (I don't really mind the _INIT
> suffix, but I think it's relatively pointless).

OK, that's fine.

> > +/* Add vPCI handlers to device. */
> > +int xen_vpci_add_handlers(struct pci_dev *dev);
> > +
> > +/* Add/remove a register handler. */
> > +int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,
> > +                          vpci_write_t write_handler, unsigned int offset,
> > +                          unsigned int size, void *data);
> > +int xen_vpci_remove_register(struct pci_dev *pdev, unsigned int offset);
> > +
> > +/* Generic read/write handlers for the PCI config space. */
> > +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,
> > +                  unsigned int reg, uint32_t size, uint32_t *data);
> > +int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,
> > +                   unsigned int reg, uint32_t size, uint32_t data);
> 
> Along the lines of what I've said in a few places about return values,
> please carefully consider where they're needed. Once you decide
> they are really needed, the respective functions would likely want to
> become __must_check.

From a emulation PoV all those errors are going to be reported to the
VM as ignored writes or reads returning all 1's.

The question is where/who should do this translation. I though it
would be helpful to do this in the trap handlers themselves, and let
the vpci code return more meaningful error value. If you think it's
not going to be helpful I can remove them.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.