[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0


  • To: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Tue, 12 May 2026 15:22:25 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=Wj4h3Ia5IEE4WoUj7agL//1EIIs9tQPFwXwLKzXfi90=; b=ZA7x5WJi3qBve+WRf4z+Sv1rZMnHv+q8Hb/rgDC9KIlllCC/AJHrS9Ytc2tTUmaEz0gsKA8HMzrexFZMdQTldalI7uxdu1y+3B3RG7OwU3iInzvU6/lC9a+PnE14S0cQ2k9L7WRQ+h18Flp4ieuLonhXiUkNd/hJf/+xpDQHmazlbW5/T5CYSz1imsFKgt75gsQVD+RLkBR69zF4cIvHkCG5vOOyOOHmQWemn6PIrzJMmXLN4dsygnOTFz47lUXr/X+dzjfRY9/yTuYJaeb5p5RkjfPR4Y9lhaOPm4H9fI3Mkh/jYkCvYFzxLXXvOym04lLz+jXZxMgAwc7/9p+vHg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=UWqonfqFprsAx34Z1NMdrUKvP+SEtVR8xNt3M3eKUxt1F6O56qoPgUaZjCgoxfkvul1ZxDydTn456PfHrtfcDqkQEPfQLdwWKh/VjFdtRzIy5IQxTM4IKAZs0lfpc+24hQAgnVv3BRgNzOcNZZ56H9bvBLKaOsBPqyRDM5kBmnqDnZJ6RSWJeY9Oba4EfEhCMqgNioOPxEjM5oiu+xNoMSfNBlKM9IhfuC+yv0ZejPVSl9L5oDCeMp4StIaAstOlulLqKpEQB3BC6WgH0hwewp8ltGv0/ILMm9fxlEWidwM3VEOAJxM05KZmvt6qNPu+zuecMVXWpM1+xdx6gXbt9w==
  • Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=selector1 header.d=citrix.com header.i="@citrix.com" header.h="From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck"
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: Mykyta Poturai <Mykyta_Poturai@xxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, "Daniel P. Smith" <dpsmith@xxxxxxxxxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Stewart Hildebrand <stewart.hildebrand@xxxxxxx>
  • Delivery-date: Tue, 12 May 2026 13:22:51 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Tue, May 12, 2026 at 10:28:05AM +0000, Volodymyr Babchuk wrote:
> Hi Roger,
> 
> Roger Pau Monné <roger.pau@xxxxxxxxxx> writes:
> 
> > On Tue, May 12, 2026 at 07:32:20AM +0000, Mykyta Poturai wrote:
> >> 
> >> 
> >> On 5/12/26 09:20, Jan Beulich wrote:
> >> > On 11.05.2026 16:10, Volodymyr Babchuk wrote:
> >> >> Hi Jan,
> >> >>
> >> >> Jan Beulich <jbeulich@xxxxxxxx> writes:
> >> >>
> >> >>> On 07.05.2026 22:40, Volodymyr Babchuk wrote:
> >> >>>> Jan Beulich <jbeulich@xxxxxxxx> writes:
> >> >>>>> On 06.05.2026 11:39, Mykyta Poturai wrote:
> >> >>>>>> On 5/4/26 08:37, Jan Beulich wrote:
> >> >>>>>>> On 23.04.2026 12:12, Mykyta Poturai wrote:
> >> >>>>>>>> On 4/21/26 17:43, Jan Beulich wrote:
> >> >>>>>>>>> On 09.04.2026 16:01, Mykyta Poturai wrote:
> >> >>>>>>>>>> From: Stewart Hildebrand <stewart.hildebrand@xxxxxxx>
> >> >>>>>>>>>>
> >> >>>>>>>>>> This code is expected to only be used by privileged domains,
> >> >>>>>>>>>> unprivileged domains should not get access to the SR-IOV 
> >> >>>>>>>>>> capability.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
> >> >>>>>>>>>> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to 
> >> >>>>>>>>>> account
> >> >>>>>>>>>> for possible changes in the system page size register. Also 
> >> >>>>>>>>>> force VFs to
> >> >>>>>>>>>> always use emulated reads for command register, this is needed 
> >> >>>>>>>>>> to
> >> >>>>>>>>>> prevent some drivers accidentally unmapping BARs.
> >> >>>>>>>>>
> >> >>>>>>>>> This apparently refers to the change to vpci_init_header(). 
> >> >>>>>>>>> Writes are
> >> >>>>>>>>> already intercepted. How would a read lead to accidental BAR 
> >> >>>>>>>>> unmap? Even
> >> >>>>>>>>> for writes I don't see how a VF driver could accidentally unmap 
> >> >>>>>>>>> BARs, as
> >> >>>>>>>>> the memory decode bit there is hardwired to 0.
> >> >>>>>>>>>
> >> >>>>>>>>>> Discovery of VFs is
> >> >>>>>>>>>> done by Dom0, which must register them with Xen.
> >> >>>>>>>>>
> >> >>>>>>>>> If we intercept control register writes, why would we still 
> >> >>>>>>>>> require
> >> >>>>>>>>> Dom0 to report the VFs that appear?
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> Sorry, I don't understand this question. You specifically 
> >> >>>>>>>> requested this
> >> >>>>>>>> to be done this way in V2. Quoting your reply from V2 below.
> >> >>>>>>>>
> >> >>>>>>>>    > Aren't you effectively busy-waiting for these 100ms, by 
> >> >>>>>>>> simply
> >> >>>>>>>> returning "true"
> >> >>>>>>>>    > from vpci_process_pending() until the time has passed? This 
> >> >>>>>>>> imo is a
> >> >>>>>>>> no-go. You
> >> >>>>>>>>    > want to set a timer and put the vCPU to sleep, to wake it up 
> >> >>>>>>>> again
> >> >>>>>>>> when the
> >> >>>>>>>>    > timer has expired. That'll then eliminate the need for the
> >> >>>>>>>> not-so-nice patch 4.
> >> >>>>>>>>
> >> >>>>>>>>    > Question is whether we need to actually go this far (right 
> >> >>>>>>>> away). I
> >> >>>>>>>> expect you
> >> >>>>>>>>    > don't mean to hand PFs to DomU-s. As long as we keep them in 
> >> >>>>>>>> the hardware
> >> >>>>>>>>    > domain, can't we trust it to set things up correctly, just 
> >> >>>>>>>> like we
> >> >>>>>>>> trust it in
> >> >>>>>>>>    > a number of other aspects?
> >> >>>>>>>
> >> >>>>>>> How's any of this related to the question I raised here, or your 
> >> >>>>>>> reply
> >> >>>>>>> thereto? If we intercept PCI_SRIOV_CTRL, we know when VFs are 
> >> >>>>>>> created.
> >> >>>>>>> Why still demand Dom0 to report them then?
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>> The spec states that VFs can take up to 100ms after the VF_ENABLE 
> >> >>>>>> bit is
> >> >>>>>> set to become alive. We discussed in the V2 that it is not 
> >> >>>>>> acceptable to
> >> >>>>>> do a required 100ms wait in Xen while blocking a domain. And not 
> >> >>>>>> doing
> >> >>>>>> that blocking would require some mechanism to only allow a domain 
> >> >>>>>> to run
> >> >>>>>> for precisely 99(or more?)ms. You yourself suggested that we can 
> >> >>>>>> trust
> >> >>>>>> the hardware domain with registering VFs if we already trust it with
> >> >>>>>> other PCI-related stuff. Did you change your mind, or am I 
> >> >>>>>> completely
> >> >>>>>> misunderstanding this question?
> >> >>>>>
> >> >>>>> No, I still think that we can trust hwdom enough. Nevertheless we 
> >> >>>>> should
> >> >>>>> aim at being independent of it where possible. And I seem to recall 
> >> >>>>> that
> >> >>>>> I had also outlined an approach how to avoid spin-waiting for 100ms 
> >> >>>>> in
> >> >>>>> the hypervisor.
> >> >>>>
> >> >>>> I want to clarify: you are telling that Xen should not wait for hwdom 
> >> >>>> to
> >> >>>> report VFs and instead create them by itself. Is this correct?
> >> >>>
> >> >>> If that's technically possible, yes.
> >> >>
> >> >> Okay, so let's clear this. If I remember correct, you discussed this
> >> >> with Mykyta in the previous version and suggested to put the vCPU to
> >> >> sleep for 100ms.
> >> > 
> >> > I don't think I did (except perhaps from a very abstract perspective),
> >> > precisely because of ...
> >> > 
> >> >> I don't think that this is a good idea, because guest
> >> >> kernel will not be happy about that.
> >> > 
> >> > ... this. Instead iirc I suggested to refuse (short-circuit) handling
> >> > VF register accesses for the next 100ms.
> >> > 
> >> > Jan
> >> 
> >> Do you have any suggestions on how to ensure that we accurately catch 
> >> the window where 100ms have already passed, but guests haven’t tried to 
> >> read anything yet, to flip this back? As I mentioned in the previous 
> >> version, Linux, for example, doesn’t attempt to re-read anything if the 
> >> first read failed after 100ms. So it appears to me that this approach 
> >> would be prone to racing with the guest for getting to the VF first. One 
> >> approach I can think of is to somehow swap the register handlers back 
> >> in-flight during the first read by the guest if 100ms have already 
> >> passed. However, this would still depend on Dom0 for registering VFs, 
> >> but in a more convoluted way. We also can’t add the VFs before 100ms 
> >> have passed and add timing checks to all register handlers, because 
> >> pci_add_device and everything below it expects the device to be 
> >> functional at the moment of addition.
> >> 
> >> 
> >> 
> >> Maybe you see some other way to avoid these problems that I am missing?
> >
> > We could maybe do some middle ground here, kind of similar to what
> > Linux does.  The overall idea would be to put on hold any accesses to
> > the device(s) PCI config space for 100ms, that would include the PF
> > and any VFs.  At the point when VF enable is set Xen already knows the
> > position of the VFs in the PCI config space.
> >
> > Any PCI config space access attempts to the PF or VFs during that
> > 100ms window would cause the guest vCPU to be put on hold, and the
> > access would only be retried once the 100ms window has passed and Xen
> > has registered the VFs with vPCI.  This approach needs extra logic to
> > put vPCI accesses on hold, similar to what Xen does when mapping a BAR
> > into the p2m, and a timer to defer the adding of the Vfs and the
> > unlocking of the affected PCI config space region.
> >
> > That would be a middle ground IMO, as the guest vCPUs could be running
> > freely, unless accesses to the affected PCI config space was attempted
> > before the 100ms window, at which point they would be blocked waiting
> > for the timeout to expire.  A well-behaved domain shouldn't try to
> > access the PCI config space either ahead the 100ms window expiring.
> 
> This approach seems reasonable for me, but this would require big
> changes in vPCI logic, as now pci_add_device() needs ability to defer
> all config space accesses till VFs are ready and in meantime we'll have
> to deal with half-initialized pdev. PCI/vPCI logic is already convoluted
> enough and adding more intermediate states, which need to be dealt with
> in different places will make things even worse. Unless I miss some easy
> fix, of course...

You could just defer the pci_add_device() to after the 100ms window
from having enabled VFs?  The only requirement would be blocking PCI
config space accesses to VFs during those 100ms.  You could use a
bitmap to signal which SBDF should be rejected in vpci_{read,write}.
There's already logic in those functions to reject accesses.

> What I am trying to say is that your suggestion is technically doable,
> but requires lots of work, and we don't need resources for this right
> now. So, what's your opinion on existing approach? Is relying on a
> domain to introduce VFs such a bad idea?

IMO yes, requiring the usage of an hypercall (or any other side-band
interface) when not strictly required is just adding a handicap for
OSes that then need to be ported to run on Xen.  It might seem easier
at first, but adding and maintaining such side-band interfaces in OSes
tend to be cumbersome and prone to errors.

We could have avoided introducing vPCI at the cost of adding a
completely new side-band interface to manage PCI config space
accesses and interrupts on x86 PVH, yet we didn't do it because albeit
easier to implement from the Xen side, it would have a huge cost on
OSes that would want to run in PVH mode.

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.