[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 2/2] libxl_pci: Fix guest shutdown with PCI PT attached



On Tue, Oct 15, 2019 at 06:59:37PM +0200, Sander Eikelenboom wrote:
>On 14/10/2019 17:03, Chao Gao wrote:
>> On Thu, Oct 10, 2019 at 06:13:43PM +0200, Sander Eikelenboom wrote:
>>> On 01/10/2019 12:35, Anthony PERARD wrote:
>>>> Rewrite of the commit message:
>>>>
>>>> Before the problematic commit, libxl used to ignore error when
>>>> destroying (force == true) a passthrough device, especially error that
>>>> happens when dealing with the DM.
>>>>
>>>> Since fae4880c45fe, if the DM failed to detach the pci device within
>>>> the allowed time, the timed out error raised skip part of
>>>> pci_remove_*, but also raise the error up to the caller of
>>>> libxl__device_pci_destroy_all, libxl__destroy_domid, and thus the
>>>> destruction of the domain fails.
>>>>
>>>> In this patch, if the DM didn't confirmed that the device is removed,
>>>> we will print a warning and keep going if force=true.  The patch
>>>> reorder the functions so that pci_remove_timeout() calls
>>>> pci_remove_detatched() like it's done when DM calls are successful.
>>>>
>>>> We also clean the QMP states and associated timeouts earlier, as soon
>>>> as they are not needed anymore.
>>>>
>>>> Reported-by: Sander Eikelenboom <linux@xxxxxxxxxxxxxx>
>>>> Fixes: fae4880c45fe015e567afa223f78bf17a6d98e1b
>>>> Signed-off-by: Anthony PERARD <anthony.perard@xxxxxxxxxx>
>>>>
>>>
>>> Hi Anthony / Chao,
>>>
>>> I have to come back to this, a bit because perhaps there is an underlying 
>>> issue.
>>> While it earlier occurred to me that the VM to which I passed through most 
>>> pci-devices 
>>> (8 to be exact) became very slow to shutdown, but I  didn't investigate it 
>>> further.
>>>
>>> But after you commit messages from this patch it kept nagging, so today I 
>>> did some testing
>>> and bisecting.
>>>
>>> The difference in tear-down time at least from what the IOMMU code logs is 
>>> quite large:
>>>
>>> xen-4.12.0
>>>     Setup:      7.452 s
>>>     Tear-down:  7.626 s
>>>
>>> xen-unstable-ee7170822f1fc209f33feb47b268bab35541351d
>>>     Setup:      7.468 s
>>>     Tear-down: 50.239 s
>>>
>>> Bisection turned up:
>>>     commit c4b1ef0f89aa6a74faa4618ce3efed1de246ec40
>>>     Author: Chao Gao <chao.gao@xxxxxxxxx>
>>>     Date:   Fri Jul 19 10:24:08 2019 +0100
>>>     libxl_qmp: wait for completion of device removal
>>>
>>> Which makes me wonder if there is something going wrong in Qemu ?
> 
>> Hi Sander,
>Hi Chao,
>
>> 
>> Thanks for your testing and the bisection.
>> 
>> I tried on my machine, the destruction time of a guest with 8 pass-thru
>> devices increased from 4s to 12s after applied the commit above.
>
>To what patch are you referring Anthony's or 
>c4b1ef0f89aa6a74faa4618ce3efed1de246ec40 ?

The latter.

>
>> In my understanding, I guess you might get the error message "timed out
>> waiting for DM to remove...". There might be some issues on your assigned
>> devices' drivers. You can first unbind the devices with their drivers in
>> VM and then tear down the VM, and check whether the VM teardown gets
>> much faster.
>
>I get that error message when I test with Anthony's patch applied, the 
>destruction time with that patch is low.
>
>How ever my point was if that patch is correct in the sense that there seems 
>to be an underlying issue 
>which causes it to take so long. That issue was uncovered by 
>c4b1ef0f89aa6a74faa4618ce3efed1de246ec40, so I'm not
>saying that commit is wrong in any sense, it just uncovered another issue that 
>was already present,
>but hard to detect as we just didn't wait at destruction time (and thus the 
>same effect as a timeout).

Actually, it is introduced by c4b1ef0f89, though it did fix another
issue.

>
>One or the other way that was just a minor issue until 
>fae4880c45fe015e567afa223f78bf17a6d98e1b, where the long
>destruction time now caused the domain destruction to stall, which was then 
>fixed by Antony's patch, but that uses
>a timeout which kinds of circumvents the issue, instead of finding out where 
>is comes from and solve it there (
>if that is possible of course).
>
>And I wonder if Anthony's patch doesn't interfere with the case you made 
>c4b1ef0f89aa6a74faa4618ce3efed1de246ec40 for, 
>if you get the timeout error message as well, then that is kind of not waiting 
>for the destruction to finish, isn't it ?
>
>Chao, 
>could you perhaps test for me Xen with as latest commit 
>ee7170822f1fc209f33feb47b268bab35541351d ?
>That is before Anthony's patch series, but after your 
>c4b1ef0f89aa6a74faa4618ce3efed1de246ec40.

It's actually what I did. VM teardown with 8 pass-thru devices on my
side takes 12s which only took 4s without my patch.

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.