[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] panic("queue invalidate wait descriptor was not executed\n")




> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> Sent: Thursday, May 12, 2016 5:49 AM
> To: Zytaruk, Kelly
> Cc: Feng Wu; Kevin Tian; xen-devel@xxxxxxxxxxxxx
> Subject: Re: [Xen-devel] panic("queue invalidate wait descriptor was not
> executed\n")
> 
> >>> On 11.05.16 at 15:51, <Kelly.Zytaruk@xxxxxxx> wrote:
> > During Xen boot I am seeing the panic in the subject line from
> > .../xen/drivers/passthrough/vgt/qinval.c
> 
> And this is with current staging, or some much older version of Xen?
> (ISTR some issue with the invalidation request getting sent to the wrong
> IOMMU, leading to a timeout.)

No this is not current Xen, it is with 4.2.

Can you tell me more about the invalidation request getting sent to the wrong 
IOMMU problem and approximately when it was fixed?  If you could identify the 
patch I could back port it into my copy of Xen for testing.

This is a NUMA system with 2 IOMMUs
I have 4 devices on 2 PCIe cards (2 per card)
They reside at the following locations 3:0.0, 5:0.0, 83:0.0 and 85:0.0
From what I understand about NUMA, based on the BDFs,  2 devices should be on 
one IOMMU and the other 2 should on the other IOMMU.

I put in some more print statements last night and discovered that during boot 
Xen attaches all 4 devices to the same IOMMU structure. Xen sends out a flush 
to all 4 devices on the first IOMMU and then follows it with a Wait 
invalidation packet to the same IOMMU.  Below is what I am seeing;

(XEN) IOMMU LIST - List of defined IOMMU structures
(XEN) iommu[00] @ ffff83103fffa5c0, Q=2060c04002, HEAD=90, TAIL=90
(XEN)     Seq Num = 0, pt_levels = 4, cap = 0x00d2078c106f0466, ecap = 
0x0000000000f020df, domid_bitmap = 1, domid_map=0x0
(XEN) iommu[01] @ ffff83103fffa790, Q=103ffec002, HEAD=bd0, TAIL=bd0
(XEN)     Seq Num = 1, pt_levels = 4, cap = 0x00d2078c106f0466, ecap = 
0x0000000000f020df, domid_bitmap = 1, domid_map=0x0

(XEN) gen_dev_iotlb_inv_dsc - DEVICE IOTLB Descriptor 0x7ffffffffffff001 
0x0000830000000003 for 83:00.0 (index = 9), iommu = ffff83103fffa5c0, fault = 
0x00000000
(XEN) gen_dev_iotlb_inv_dsc - DEVICE IOTLB Descriptor 0x7ffffffffffff001 
0x0000810000000003 for 81:00.0 (index = 10), iommu = ffff83103fffa5c0, fault = 
0x00000000
(XEN) gen_dev_iotlb_inv_dsc - DEVICE IOTLB Descriptor 0x7ffffffffffff001 
0x0000050000000003 for 05:00.0 (index = 11), iommu = ffff83103fffa5c0, fault = 
0x00000000
(XEN) gen_dev_iotlb_inv_dsc - DEVICE IOTLB Descriptor 0x7ffffffffffff001 
0x0000030000000003 for 03:00.0 (index = 12), iommu = ffff83103fffa5c0, fault = 
0x00000000
(XEN) queue_invalidate_wait (iommu = ffff83103fffa5c0)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) queue invalidate wait descriptor was not executed
(XEN) ****************************************

Is it a bug to have all 4 devices on the same IOMMU?  Is this why the Wait 
Invalidation is failing?
Actually I am not sure if Xen is attaching all 4 devices to the same IOMMU or 
if it is generating the dev iotlb descriptors wrong

> 
> > From the Fault Status Register (= 0x40 (ITE)). I am seeing a hardware
> > timeout on the invalidate
> >
> > Disabling queued invalidation is not an option.  I need to find out
> > why the operation is timing out and fix it.
> >
> > I found two timeouts; one in software and one in hardware.
> > After the invalidate is submitted there is a wait packet submitted and
> > the boot software waits for the wait packet to complete in a loop with
> > a software timeout.  At the end of the software timeout it issues the
> > panic.  I can increase the software timeout but it still doesn't solve
> > the problem.  Just before the panic I dump the value of the Fault
> > Status Register and I see that the hardware has already timed out
> > (FSTS_REG = 0x40 = ITE = "Invalidation Timeout Error").  As a first
> > step in solving this I would like to increase the hardware timeout value.
> >
> > I have the Intel spec and I was reading from the spec...
> >
> > " Hardware starts an invalidation completion timer for this ITag, and
> > issues the invalidation request message to the specified endpoint. If
> > the invalidation command from software is for a first-level mapping,
> > the invalidation request message is generated with the appropriate
> > PASID prefix to identify the target PASID. The invalidation completion
> > time-out value is recommended to be sufficiently larger than the
> > PCI-Express read completion time-outs. "
> >
> > The above leads me to believe that there should be some way of setting
> > the invalidation completion time-out value.  Unfortunately I couldn't
> > find anything in the Intel spec that tells me how to set the "invalidation
> > completion time-out".   Can someone point me in the right direction to
> > setting the completion timer?
> 
> For this I guess you should have Cc-ed the VT-d maintainers, which I have now
> done.
> 
> Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.