Xen project Mailing List

Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0

To: Mykyta Poturai <Mykyta_Poturai@xxxxxxxx>

From: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Date: Tue, 12 May 2026 10:58:34 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=qH4EWmAqN6sVd9w3DT5QYXve3FhDQLaIzlvXI4tn7w8=; b=FjZs75dTFqogmSMMmoqDHnmV2d2SAry+GkyHaEFCHIVTvY4BqDoARPOFKgcpWsRqG0jd6KZHGcXTRn4W/lo5GEeK53fRSbCCZEE4h4i9dhofWDF1CFYeQtMZjmVDTbujgzOV1EOiVRzP1AhRxh+0pjKx8rcbFlOtKueRMoYrHZZPUV1l1TaX4JcqlNx0e88BPuFFAp69sRGgarZVQC9ymxev72tEALAS/Rib4amiiDHGLcY44YBmCoH9g2fCXMOP+p4zuFkc4+4HENTEfaTQyMQWdgIuzNj+DcmIAwv404j6nonSA+ib8hR+h7GO6scgtMQhgeglI6hH2iY2lILfUg==

Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=nnxlfFEIl6LbOy5ahady0CbZ3k4PGlQaHE2iX0YaQsxb6oqNE9Lls+PJ/w1IASsy6IjDU0qABeEFZEH2zP+mf2MxpQtItse9pAvnPJgxCLCfNGulB1IBFuiDjhIlYq5UjAzk76OwFv+CjWr/qQ5DPkiHw/e5KZ0CEpFwJJsuhOtEUlXdpkf8phpDOVOMTFeFbxdkUFqV743M1G2DJaIKJ2wcSoVY2dSGCj2lxIIVMDHfipkpULUI63zUr9ltid9Ny6YK3QiCAxiLq84Lijs7+eyIuHC+ce7R1ZExeudXQgGLTyOv9l/9YZepPVTzBZ4PntzmJLX6P1B93GA0qXmphw==

Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=selector1 header.d=citrix.com header.i="@citrix.com" header.h="From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck"

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;

Cc: Jan Beulich <jbeulich@xxxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, "Daniel P. Smith" <dpsmith@xxxxxxxxxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Stewart Hildebrand <stewart.hildebrand@xxxxxxx>

Delivery-date: Tue, 12 May 2026 08:58:45 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Tue, May 12, 2026 at 07:32:20AM +0000, Mykyta Poturai wrote: > > > On 5/12/26 09:20, Jan Beulich wrote: > > On 11.05.2026 16:10, Volodymyr Babchuk wrote: > >> Hi Jan, > >> > >> Jan Beulich <jbeulich@xxxxxxxx> writes: > >> > >>> On 07.05.2026 22:40, Volodymyr Babchuk wrote: > >>>> Jan Beulich <jbeulich@xxxxxxxx> writes: > >>>>> On 06.05.2026 11:39, Mykyta Poturai wrote: > >>>>>> On 5/4/26 08:37, Jan Beulich wrote: > >>>>>>> On 23.04.2026 12:12, Mykyta Poturai wrote: > >>>>>>>> On 4/21/26 17:43, Jan Beulich wrote: > >>>>>>>>> On 09.04.2026 16:01, Mykyta Poturai wrote: > >>>>>>>>>> From: Stewart Hildebrand <stewart.hildebrand@xxxxxxx> > >>>>>>>>>> > >>>>>>>>>> This code is expected to only be used by privileged domains, > >>>>>>>>>> unprivileged domains should not get access to the SR-IOV > >>>>>>>>>> capability. > >>>>>>>>>> > >>>>>>>>>> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically > >>>>>>>>>> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to > >>>>>>>>>> account > >>>>>>>>>> for possible changes in the system page size register. Also force > >>>>>>>>>> VFs to > >>>>>>>>>> always use emulated reads for command register, this is needed to > >>>>>>>>>> prevent some drivers accidentally unmapping BARs. > >>>>>>>>> > >>>>>>>>> This apparently refers to the change to vpci_init_header(). Writes > >>>>>>>>> are > >>>>>>>>> already intercepted. How would a read lead to accidental BAR unmap? > >>>>>>>>> Even > >>>>>>>>> for writes I don't see how a VF driver could accidentally unmap > >>>>>>>>> BARs, as > >>>>>>>>> the memory decode bit there is hardwired to 0. > >>>>>>>>> > >>>>>>>>>> Discovery of VFs is > >>>>>>>>>> done by Dom0, which must register them with Xen. > >>>>>>>>> > >>>>>>>>> If we intercept control register writes, why would we still require > >>>>>>>>> Dom0 to report the VFs that appear? > >>>>>>>>> > >>>>>>>> > >>>>>>>> Sorry, I don't understand this question. You specifically requested > >>>>>>>> this > >>>>>>>> to be done this way in V2. Quoting your reply from V2 below. > >>>>>>>> > >>>>>>>> > Aren't you effectively busy-waiting for these 100ms, by simply > >>>>>>>> returning "true" > >>>>>>>> > from vpci_process_pending() until the time has passed? This imo > >>>>>>>> is a > >>>>>>>> no-go. You > >>>>>>>> > want to set a timer and put the vCPU to sleep, to wake it up > >>>>>>>> again > >>>>>>>> when the > >>>>>>>> > timer has expired. That'll then eliminate the need for the > >>>>>>>> not-so-nice patch 4. > >>>>>>>> > >>>>>>>> > Question is whether we need to actually go this far (right > >>>>>>>> away). I > >>>>>>>> expect you > >>>>>>>> > don't mean to hand PFs to DomU-s. As long as we keep them in > >>>>>>>> the hardware > >>>>>>>> > domain, can't we trust it to set things up correctly, just like > >>>>>>>> we > >>>>>>>> trust it in > >>>>>>>> > a number of other aspects? > >>>>>>> > >>>>>>> How's any of this related to the question I raised here, or your reply > >>>>>>> thereto? If we intercept PCI_SRIOV_CTRL, we know when VFs are created. > >>>>>>> Why still demand Dom0 to report them then? > >>>>>>> > >>>>>> > >>>>>> The spec states that VFs can take up to 100ms after the VF_ENABLE bit > >>>>>> is > >>>>>> set to become alive. We discussed in the V2 that it is not acceptable > >>>>>> to > >>>>>> do a required 100ms wait in Xen while blocking a domain. And not doing > >>>>>> that blocking would require some mechanism to only allow a domain to > >>>>>> run > >>>>>> for precisely 99(or more?)ms. You yourself suggested that we can trust > >>>>>> the hardware domain with registering VFs if we already trust it with > >>>>>> other PCI-related stuff. Did you change your mind, or am I completely > >>>>>> misunderstanding this question? > >>>>> > >>>>> No, I still think that we can trust hwdom enough. Nevertheless we should > >>>>> aim at being independent of it where possible. And I seem to recall that > >>>>> I had also outlined an approach how to avoid spin-waiting for 100ms in > >>>>> the hypervisor. > >>>> > >>>> I want to clarify: you are telling that Xen should not wait for hwdom to > >>>> report VFs and instead create them by itself. Is this correct? > >>> > >>> If that's technically possible, yes. > >> > >> Okay, so let's clear this. If I remember correct, you discussed this > >> with Mykyta in the previous version and suggested to put the vCPU to > >> sleep for 100ms. > > > > I don't think I did (except perhaps from a very abstract perspective), > > precisely because of ... > > > >> I don't think that this is a good idea, because guest > >> kernel will not be happy about that. > > > > ... this. Instead iirc I suggested to refuse (short-circuit) handling > > VF register accesses for the next 100ms. > > > > Jan > > Do you have any suggestions on how to ensure that we accurately catch > the window where 100ms have already passed, but guests haven’t tried to > read anything yet, to flip this back? As I mentioned in the previous > version, Linux, for example, doesn’t attempt to re-read anything if the > first read failed after 100ms. So it appears to me that this approach > would be prone to racing with the guest for getting to the VF first. One > approach I can think of is to somehow swap the register handlers back > in-flight during the first read by the guest if 100ms have already > passed. However, this would still depend on Dom0 for registering VFs, > but in a more convoluted way. We also can’t add the VFs before 100ms > have passed and add timing checks to all register handlers, because > pci_add_device and everything below it expects the device to be > functional at the moment of addition. > > > > Maybe you see some other way to avoid these problems that I am missing? We could maybe do some middle ground here, kind of similar to what Linux does. The overall idea would be to put on hold any accesses to the device(s) PCI config space for 100ms, that would include the PF and any VFs. At the point when VF enable is set Xen already knows the position of the VFs in the PCI config space. Any PCI config space access attempts to the PF or VFs during that 100ms window would cause the guest vCPU to be put on hold, and the access would only be retried once the 100ms window has passed and Xen has registered the VFs with vPCI. This approach needs extra logic to put vPCI accesses on hold, similar to what Xen does when mapping a BAR into the p2m, and a timer to defer the adding of the Vfs and the unlocking of the affected PCI config space region. That would be a middle ground IMO, as the guest vCPUs could be running freely, unless accesses to the affected PCI config space was attempted before the 100ms window, at which point they would be blocked waiting for the timeout to expire. A well-behaved domain shouldn't try to access the PCI config space either ahead the 100ms window expiring. Roger.

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.