|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
On Tue, May 12, 2026 at 10:28:05AM +0000, Volodymyr Babchuk wrote:
> Hi Roger,
>
> Roger Pau Monné <roger.pau@xxxxxxxxxx> writes:
>
> > On Tue, May 12, 2026 at 07:32:20AM +0000, Mykyta Poturai wrote:
> >>
> >>
> >> On 5/12/26 09:20, Jan Beulich wrote:
> >> > On 11.05.2026 16:10, Volodymyr Babchuk wrote:
> >> >> Hi Jan,
> >> >>
> >> >> Jan Beulich <jbeulich@xxxxxxxx> writes:
> >> >>
> >> >>> On 07.05.2026 22:40, Volodymyr Babchuk wrote:
> >> >>>> Jan Beulich <jbeulich@xxxxxxxx> writes:
> >> >>>>> On 06.05.2026 11:39, Mykyta Poturai wrote:
> >> >>>>>> On 5/4/26 08:37, Jan Beulich wrote:
> >> >>>>>>> On 23.04.2026 12:12, Mykyta Poturai wrote:
> >> >>>>>>>> On 4/21/26 17:43, Jan Beulich wrote:
> >> >>>>>>>>> On 09.04.2026 16:01, Mykyta Poturai wrote:
> >> >>>>>>>>>> From: Stewart Hildebrand <stewart.hildebrand@xxxxxxx>
> >> >>>>>>>>>>
> >> >>>>>>>>>> This code is expected to only be used by privileged domains,
> >> >>>>>>>>>> unprivileged domains should not get access to the SR-IOV
> >> >>>>>>>>>> capability.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
> >> >>>>>>>>>> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to
> >> >>>>>>>>>> account
> >> >>>>>>>>>> for possible changes in the system page size register. Also
> >> >>>>>>>>>> force VFs to
> >> >>>>>>>>>> always use emulated reads for command register, this is needed
> >> >>>>>>>>>> to
> >> >>>>>>>>>> prevent some drivers accidentally unmapping BARs.
> >> >>>>>>>>>
> >> >>>>>>>>> This apparently refers to the change to vpci_init_header().
> >> >>>>>>>>> Writes are
> >> >>>>>>>>> already intercepted. How would a read lead to accidental BAR
> >> >>>>>>>>> unmap? Even
> >> >>>>>>>>> for writes I don't see how a VF driver could accidentally unmap
> >> >>>>>>>>> BARs, as
> >> >>>>>>>>> the memory decode bit there is hardwired to 0.
> >> >>>>>>>>>
> >> >>>>>>>>>> Discovery of VFs is
> >> >>>>>>>>>> done by Dom0, which must register them with Xen.
> >> >>>>>>>>>
> >> >>>>>>>>> If we intercept control register writes, why would we still
> >> >>>>>>>>> require
> >> >>>>>>>>> Dom0 to report the VFs that appear?
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> Sorry, I don't understand this question. You specifically
> >> >>>>>>>> requested this
> >> >>>>>>>> to be done this way in V2. Quoting your reply from V2 below.
> >> >>>>>>>>
> >> >>>>>>>> > Aren't you effectively busy-waiting for these 100ms, by
> >> >>>>>>>> simply
> >> >>>>>>>> returning "true"
> >> >>>>>>>> > from vpci_process_pending() until the time has passed? This
> >> >>>>>>>> imo is a
> >> >>>>>>>> no-go. You
> >> >>>>>>>> > want to set a timer and put the vCPU to sleep, to wake it up
> >> >>>>>>>> again
> >> >>>>>>>> when the
> >> >>>>>>>> > timer has expired. That'll then eliminate the need for the
> >> >>>>>>>> not-so-nice patch 4.
> >> >>>>>>>>
> >> >>>>>>>> > Question is whether we need to actually go this far (right
> >> >>>>>>>> away). I
> >> >>>>>>>> expect you
> >> >>>>>>>> > don't mean to hand PFs to DomU-s. As long as we keep them in
> >> >>>>>>>> the hardware
> >> >>>>>>>> > domain, can't we trust it to set things up correctly, just
> >> >>>>>>>> like we
> >> >>>>>>>> trust it in
> >> >>>>>>>> > a number of other aspects?
> >> >>>>>>>
> >> >>>>>>> How's any of this related to the question I raised here, or your
> >> >>>>>>> reply
> >> >>>>>>> thereto? If we intercept PCI_SRIOV_CTRL, we know when VFs are
> >> >>>>>>> created.
> >> >>>>>>> Why still demand Dom0 to report them then?
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>> The spec states that VFs can take up to 100ms after the VF_ENABLE
> >> >>>>>> bit is
> >> >>>>>> set to become alive. We discussed in the V2 that it is not
> >> >>>>>> acceptable to
> >> >>>>>> do a required 100ms wait in Xen while blocking a domain. And not
> >> >>>>>> doing
> >> >>>>>> that blocking would require some mechanism to only allow a domain
> >> >>>>>> to run
> >> >>>>>> for precisely 99(or more?)ms. You yourself suggested that we can
> >> >>>>>> trust
> >> >>>>>> the hardware domain with registering VFs if we already trust it with
> >> >>>>>> other PCI-related stuff. Did you change your mind, or am I
> >> >>>>>> completely
> >> >>>>>> misunderstanding this question?
> >> >>>>>
> >> >>>>> No, I still think that we can trust hwdom enough. Nevertheless we
> >> >>>>> should
> >> >>>>> aim at being independent of it where possible. And I seem to recall
> >> >>>>> that
> >> >>>>> I had also outlined an approach how to avoid spin-waiting for 100ms
> >> >>>>> in
> >> >>>>> the hypervisor.
> >> >>>>
> >> >>>> I want to clarify: you are telling that Xen should not wait for hwdom
> >> >>>> to
> >> >>>> report VFs and instead create them by itself. Is this correct?
> >> >>>
> >> >>> If that's technically possible, yes.
> >> >>
> >> >> Okay, so let's clear this. If I remember correct, you discussed this
> >> >> with Mykyta in the previous version and suggested to put the vCPU to
> >> >> sleep for 100ms.
> >> >
> >> > I don't think I did (except perhaps from a very abstract perspective),
> >> > precisely because of ...
> >> >
> >> >> I don't think that this is a good idea, because guest
> >> >> kernel will not be happy about that.
> >> >
> >> > ... this. Instead iirc I suggested to refuse (short-circuit) handling
> >> > VF register accesses for the next 100ms.
> >> >
> >> > Jan
> >>
> >> Do you have any suggestions on how to ensure that we accurately catch
> >> the window where 100ms have already passed, but guests haven’t tried to
> >> read anything yet, to flip this back? As I mentioned in the previous
> >> version, Linux, for example, doesn’t attempt to re-read anything if the
> >> first read failed after 100ms. So it appears to me that this approach
> >> would be prone to racing with the guest for getting to the VF first. One
> >> approach I can think of is to somehow swap the register handlers back
> >> in-flight during the first read by the guest if 100ms have already
> >> passed. However, this would still depend on Dom0 for registering VFs,
> >> but in a more convoluted way. We also can’t add the VFs before 100ms
> >> have passed and add timing checks to all register handlers, because
> >> pci_add_device and everything below it expects the device to be
> >> functional at the moment of addition.
> >>
> >>
> >>
> >> Maybe you see some other way to avoid these problems that I am missing?
> >
> > We could maybe do some middle ground here, kind of similar to what
> > Linux does. The overall idea would be to put on hold any accesses to
> > the device(s) PCI config space for 100ms, that would include the PF
> > and any VFs. At the point when VF enable is set Xen already knows the
> > position of the VFs in the PCI config space.
> >
> > Any PCI config space access attempts to the PF or VFs during that
> > 100ms window would cause the guest vCPU to be put on hold, and the
> > access would only be retried once the 100ms window has passed and Xen
> > has registered the VFs with vPCI. This approach needs extra logic to
> > put vPCI accesses on hold, similar to what Xen does when mapping a BAR
> > into the p2m, and a timer to defer the adding of the Vfs and the
> > unlocking of the affected PCI config space region.
> >
> > That would be a middle ground IMO, as the guest vCPUs could be running
> > freely, unless accesses to the affected PCI config space was attempted
> > before the 100ms window, at which point they would be blocked waiting
> > for the timeout to expire. A well-behaved domain shouldn't try to
> > access the PCI config space either ahead the 100ms window expiring.
>
> This approach seems reasonable for me, but this would require big
> changes in vPCI logic, as now pci_add_device() needs ability to defer
> all config space accesses till VFs are ready and in meantime we'll have
> to deal with half-initialized pdev. PCI/vPCI logic is already convoluted
> enough and adding more intermediate states, which need to be dealt with
> in different places will make things even worse. Unless I miss some easy
> fix, of course...
You could just defer the pci_add_device() to after the 100ms window
from having enabled VFs? The only requirement would be blocking PCI
config space accesses to VFs during those 100ms. You could use a
bitmap to signal which SBDF should be rejected in vpci_{read,write}.
There's already logic in those functions to reject accesses.
> What I am trying to say is that your suggestion is technically doable,
> but requires lots of work, and we don't need resources for this right
> now. So, what's your opinion on existing approach? Is relying on a
> domain to introduce VFs such a bad idea?
IMO yes, requiring the usage of an hypercall (or any other side-band
interface) when not strictly required is just adding a handicap for
OSes that then need to be ported to run on Xen. It might seem easier
at first, but adding and maintaining such side-band interfaces in OSes
tend to be cumbersome and prone to errors.
We could have avoided introducing vPCI at the cost of adding a
completely new side-band interface to manage PCI config space
accesses and interrupts on x86 PVH, yet we didn't do it because albeit
easier to implement from the Xen side, it would have a huge cost on
OSes that would want to run in PVH mode.
Thanks, Roger.
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |