[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-unstable-staging: Xen BUG at iommu_map.c:455



On 11/04/15 18:42, Sander Eikelenboom wrote:
> Saturday, April 11, 2015, 7:35:57 PM, you wrote:
>
>> On 11/04/15 18:25, Sander Eikelenboom wrote:
>>> Saturday, April 11, 2015, 6:38:17 PM, you wrote:
>>>
>>>> On 11/04/15 17:32, Andrew Cooper wrote:
>>>>> On 11/04/15 17:21, Sander Eikelenboom wrote:
>>>>>> Saturday, April 11, 2015, 4:21:56 PM, you wrote:
>>>>>>
>>>>>>> On 11/04/15 15:11, Sander Eikelenboom wrote:
>>>>>>>> Friday, April 10, 2015, 8:55:27 PM, you wrote:
>>>>>>>>
>>>>>>>>> On 10/04/15 11:24, Sander Eikelenboom wrote:
>>>>>>>>>> Hi Andrew,
>>>>>>>>>>
>>>>>>>>>> Finally got some time to figure this out .. and i have narrowed it 
>>>>>>>>>> down to:
>>>>>>>>>> git://xenbits.xen.org/staging/qemu-upstream-unstable.git
>>>>>>>>>> commit 7665d6ba98e20fb05c420de947c1750fd47e5c07 "Xen: Use the 
>>>>>>>>>> ioreq-server API when available"
>>>>>>>>>> A straight revert of this commit prevents the issue from happening.
>>>>>>>>>>
>>>>>>>>>> The reason i had a hard time figuring this out was:
>>>>>>>>>> - I wasn't aware of this earlier, since git pulling the main xen 
>>>>>>>>>> tree, doesn't 
>>>>>>>>>>   auto update the qemu-* trees.
>>>>>>>>> This has caught me out so many times.  It is very non-obvious 
>>>>>>>>> behaviour.
>>>>>>>>>> - So i happen to get this when i cloned a fresh tree to try to 
>>>>>>>>>> figure out the 
>>>>>>>>>>   other issue i was seeing.
>>>>>>>>>> - After that checking out previous versions of the main xen tree 
>>>>>>>>>> didn't resolve 
>>>>>>>>>>   this new issue, because the qemu tree doesn't get auto updated and 
>>>>>>>>>> is set 
>>>>>>>>>>   "master".
>>>>>>>>>> - Cloning a xen-stable-4.5.0 made it go away .. because that has a 
>>>>>>>>>> specific 
>>>>>>>>>>   git://xenbits.xen.org/staging/qemu-upstream-unstable.git tag which 
>>>>>>>>>> is not 
>>>>>>>>>>   master.
>>>>>>>>>>
>>>>>>>>>> *sigh* 
>>>>>>>>>>
>>>>>>>>>> This is tested with xen main tree at last commit 
>>>>>>>>>> 3a28f760508fb35c430edac17a9efde5aff6d1d5
>>>>>>>>>> (normal xen-unstable, not the staging branch)
>>>>>>>>>>
>>>>>>>>>> Ok so i have added some extra debug info (see attached diff) and 
>>>>>>>>>> this is the 
>>>>>>>>>> output when it crashes due to something the commit above triggered, 
>>>>>>>>>> the 
>>>>>>>>>> level is out of bounds and the pfn looks fishy too.
>>>>>>>>>> Complete serial log from both bad and good (specific commit 
>>>>>>>>>> reverted) are 
>>>>>>>>>> attached.
>>>>>>>>> Just to confirm, you are positively identifying a qemu changeset as
>>>>>>>>> causing this crash?
>>>>>>>>> If so, the qemu change has discovered a pre-existing issue in the
>>>>>>>>> toolstack pci-passthrough interface.  Whatever qemu is or isn't doing,
>>>>>>>>> it should not be able to cause a crash like this.
>>>>>>>>> With this in mind, I need to brush up on my AMD-Vi details.
>>>>>>>>> In the meantime, can you run with the following patch to identify what
>>>>>>>>> is going on, domctl wise?  I assume it is the assign_device which is
>>>>>>>>> failing, but it will be nice to observe the differences between the
>>>>>>>>> working and failing case, which might offer a hint.
>>>>>>>> Hrrm with your patch i end up with a fatal page fault in 
>>>>>>>> iommu_do_pci_domctl:
>>>>>>>>
>>>>>>>> (XEN) [2015-04-11 14:03:31.833] ----[ Xen-4.6-unstable  x86_64  
>>>>>>>> debug=y  Tainted:    C ]----
>>>>>>>> (XEN) [2015-04-11 14:03:31.857] CPU:    5
>>>>>>>> (XEN) [2015-04-11 14:03:31.868] RIP:    e008:[<ffff82d08014c52c>] 
>>>>>>>> iommu_do_pci_domctl+0x2dc/0x740
>>>>>>>> (XEN) [2015-04-11 14:03:31.894] RFLAGS: 0000000000010256   CONTEXT: 
>>>>>>>> hypervisor
>>>>>>>> (XEN) [2015-04-11 14:03:31.915] rax: 0000000000000008   rbx: 
>>>>>>>> 0000000000000800   rcx: ffffffffffebe5ed
>>>>>>>> (XEN) [2015-04-11 14:03:31.942] rdx: 0000000000000800   rsi: 
>>>>>>>> 0000000000000000   rdi: ffff830256ef7e38
>>>>>>>> (XEN) [2015-04-11 14:03:31.968] rbp: ffff830256ef7c98   rsp: 
>>>>>>>> ffff830256ef7c08   r8:  00000000deadbeef
>>>>>>>> (XEN) [2015-04-11 14:03:31.995] r9:  00000000deadbeef   r10: 
>>>>>>>> ffff82d08024e500   r11: 0000000000000282
>>>>>>>> (XEN) [2015-04-11 14:03:32.022] r12: 0000000000000000   r13: 
>>>>>>>> 0000000000000008   r14: 0000000000000000
>>>>>>>> (XEN) [2015-04-11 14:03:32.049] r15: 0000000000000000   cr0: 
>>>>>>>> 0000000080050033   cr4: 00000000000006f0
>>>>>>>> (XEN) [2015-04-11 14:03:32.076] cr3: 00000002336a6000   cr2: 
>>>>>>>> 0000000000000000
>>>>>>>> (XEN) [2015-04-11 14:03:32.096] ds: 0000   es: 0000   fs: 0000   gs: 
>>>>>>>> 0000   ss: e010   cs: e008
>>>>>>>> (XEN) [2015-04-11 14:03:32.121] Xen stack trace from 
>>>>>>>> rsp=ffff830256ef7c08:
>>>>>>>> (XEN) [2015-04-11 14:03:32.141]    ffff830256ef7c78 ffff82d08012c178 
>>>>>>>> ffff830256ef7c28 ffff830256ef7c28
>>>>>>>> (XEN) [2015-04-11 14:03:32.168]    0000000000000010 0000000000000000 
>>>>>>>> 0000000000000000 0000000000000000
>>>>>>>> (XEN) [2015-04-11 14:03:32.195]    00000000000006f0 00007fe300000000 
>>>>>>>> ffff830256eb7790 ffff83025cc6d300
>>>>>>>> (XEN) [2015-04-11 14:03:32.222]    ffff82d080330c60 00007fe396bab004 
>>>>>>>> 0000000000000000 00007fe396bab004
>>>>>>>> (XEN) [2015-04-11 14:03:32.249]    0000000000000000 0000000000000005 
>>>>>>>> ffff830256ef7ca8 ffff82d08014900b
>>>>>>>> (XEN) [2015-04-11 14:03:32.276]    ffff830256ef7d98 ffff82d080161f2d 
>>>>>>>> 0000000000000010 0000000000000000
>>>>>>>> (XEN) [2015-04-11 14:03:32.303]    0000000000000000 ffff830256ef7ce8 
>>>>>>>> ffff82d08018b655 ffff830256ef7d48
>>>>>>>> (XEN) [2015-04-11 14:03:32.330]    ffff830256ef7cf8 ffff82d08018b66a 
>>>>>>>> ffff830256ef7d38 ffff82d08012925e
>>>>>>>> (XEN) [2015-04-11 14:03:32.357]    ffff830256efc068 0000000800000001 
>>>>>>>> 800000022e12c167 0000000000000000
>>>>>>>> (XEN) [2015-04-11 14:03:32.384]    0000000000000002 ffff830256ef7e38 
>>>>>>>> 0000000800000000 800000022e12c167
>>>>>>>> (XEN) [2015-04-11 14:03:32.411]    0000000000000003 ffff830256ef7db8 
>>>>>>>> 0000000000000000 00007fe396780eb0
>>>>>>>> (XEN) [2015-04-11 14:03:32.439]    0000000000000202 ffffffffffffffff 
>>>>>>>> 0000000000000000 00007fe396bab004
>>>>>>>> (XEN) [2015-04-11 14:03:32.466]    0000000000000000 0000000000000005 
>>>>>>>> ffff830256ef7ef8 ffff82d08010497f
>>>>>>>> (XEN) [2015-04-11 14:03:32.493]    0000000000000001 0000000000100001 
>>>>>>>> 800000022e12c167 ffff88001f7ecc00
>>>>>>>> (XEN) [2015-04-11 14:03:32.520]    00007fe396780eb0 ffff88001c849508 
>>>>>>>> 0000000e00000007 ffffffff8105594a
>>>>>>>> (XEN) [2015-04-11 14:03:32.547]    000000000000e033 0000000000000202 
>>>>>>>> ffff88001ece3d40 000000000000e02b
>>>>>>>> (XEN) [2015-04-11 14:03:32.574]    ffff830256ef7e28 ffff82d080194933 
>>>>>>>> 000000000000beef ffffffff81bd6c85
>>>>>>>> (XEN) [2015-04-11 14:03:32.601]    ffff830256ef7f08 ffff82d080193edd 
>>>>>>>> 0000000b0000002d 0000000000000001
>>>>>>>> (XEN) [2015-04-11 14:03:32.628]    0000000100000800 00007fe3962abbd0 
>>>>>>>> ffff000a81050001 00007fe39656ce6e
>>>>>>>> (XEN) [2015-04-11 14:03:32.655]    00007ffdf2a654f0 00007fe39656d0c9 
>>>>>>>> 00007fe39656ce6e 00007fe3969a9a55
>>>>>>>> (XEN) [2015-04-11 14:03:32.682] Xen call trace:
>>>>>>>> (XEN) [2015-04-11 14:03:32.695]    [<ffff82d08014c52c>] 
>>>>>>>> iommu_do_pci_domctl+0x2dc/0x740
>>>>>>>> (XEN) [2015-04-11 14:03:32.718]    [<ffff82d08014900b>] 
>>>>>>>> iommu_do_domctl+0x17/0x1a
>>>>>>>> (XEN) [2015-04-11 14:03:32.739]    [<ffff82d080161f2d>] 
>>>>>>>> arch_do_domctl+0x2469/0x26e1
>>>>>>>> (XEN) [2015-04-11 14:03:32.762]    [<ffff82d08010497f>] 
>>>>>>>> do_domctl+0x1a1f/0x1d60
>>>>>>>> (XEN) [2015-04-11 14:03:32.783]    [<ffff82d080234c6b>] 
>>>>>>>> syscall_enter+0xeb/0x145
>>>>>>>> (XEN) [2015-04-11 14:03:32.804] 
>>>>>>>> (XEN) [2015-04-11 14:03:32.813] Pagetable walk from 0000000000000000:
>>>>>>>> (XEN) [2015-04-11 14:03:32.831]  L4[0x000] = 0000000234075067 
>>>>>>>> 000000000001f2a8
>>>>>>>> (XEN) [2015-04-11 14:03:32.852]  L3[0x000] = 0000000229ad4067 
>>>>>>>> 0000000000014c49
>>>>>>>> (XEN) [2015-04-11 14:03:32.873]  L2[0x000] = 0000000000000000 
>>>>>>>> ffffffffffffffff 
>>>>>>>> (XEN) [2015-04-11 14:03:32.894] 
>>>>>>>> (XEN) [2015-04-11 14:03:32.903] 
>>>>>>>> ****************************************
>>>>>>>> (XEN) [2015-04-11 14:03:32.922] Panic on CPU 5:
>>>>>>>> (XEN) [2015-04-11 14:03:32.935] FATAL PAGE FAULT
>>>>>>>> (XEN) [2015-04-11 14:03:32.948] [error_code=0000]
>>>>>>>> (XEN) [2015-04-11 14:03:32.961] Faulting linear address: 
>>>>>>>> 0000000000000000
>>>>>>>> (XEN) [2015-04-11 14:03:32.981] 
>>>>>>>> ****************************************
>>>>>>>> (XEN) [2015-04-11 14:03:33.000] 
>>>>>>>> (XEN) [2015-04-11 14:03:33.009] Reboot in five seconds...
>>>>>>>>
>>>>>>>> The RIP resolves to the prink added by your patch in:
>>>>>>>>
>>>>>>>>     case XEN_DOMCTL_test_assign_device:
>>>>>>>>         ret = xsm_test_assign_device(XSM_HOOK, 
>>>>>>>> domctl->u.assign_device.machine_sbdf);
>>>>>>>>         if ( ret )
>>>>>>>>             break;
>>>>>>>>
>>>>>>>>         seg = domctl->u.assign_device.machine_sbdf >> 16;
>>>>>>>>         bus = (domctl->u.assign_device.machine_sbdf >> 8) & 0xff;
>>>>>>>>         devfn = domctl->u.assign_device.machine_sbdf & 0xff;
>>>>>>>>
>>>>>>>>         printk("*** %pv->d%d: 
>>>>>>>> test_assign_device({%04x:%02x:%02x.%u})\n",
>>>>>>>>                current, d->domain_id,
>>>>>>>>                seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
>>>>>>>>
>>>>>>>>         if ( device_assigned(seg, bus, devfn) )
>>>>>>>>         {
>>>>>>>>             printk(XENLOG_G_INFO
>>>>>>>>                    "%04x:%02x:%02x.%u already assigned, or 
>>>>>>>> non-existent\n",
>>>>>>>>                    seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
>>>>>>>>             ret = -EINVAL;
>>>>>>>>         }
>>>>>>>>         break;
>>>>>>> hmm - 'd' is NULL.  This ought to work better.
>>>>>>> diff --git a/xen/drivers/passthrough/pci.c 
>>>>>>> b/xen/drivers/passthrough/pci.c
>>>>>>> index 9f3413c..85ff1fc 100644
>>>>>>> --- a/xen/drivers/passthrough/pci.c
>>>>>>> +++ b/xen/drivers/passthrough/pci.c
>>>>>>> @@ -1532,6 +1532,11 @@ int iommu_do_pci_domctl(
>>>>>>>          max_sdevs = domctl->u.get_device_group.max_sdevs;
>>>>>>>          sdevs = domctl->u.get_device_group.sdev_array;
>>>>>>>  
>>>>>>> +        printk("*** %pv->d%d: get_device_group({%04x:%02x:%02x.%u, 
>>>>>>> %u})\n",
>>>>>>> +               current, d->domain_id,
>>>>>>> +               seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
>>>>>>> +               max_sdevs);
>>>>>>> +
>>>>>>>          ret = iommu_get_device_group(d, seg, bus, devfn, sdevs, 
>>>>>>> max_sdevs);
>>>>>>>          if ( ret < 0 )
>>>>>>>          {
>>>>>>> @@ -1558,6 +1563,9 @@ int iommu_do_pci_domctl(
>>>>>>>          bus = (domctl->u.assign_device.machine_sbdf >> 8) & 0xff;
>>>>>>>          devfn = domctl->u.assign_device.machine_sbdf & 0xff;
>>>>>>>  
>>>>>>> +        printk("*** %pv: test_assign_device({%04x:%02x:%02x.%u})\n",
>>>>>>> +               current, seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
>>>>>>> +
>>>>>>>          if ( device_assigned(seg, bus, devfn) )
>>>>>>>          {
>>>>>>>              printk(XENLOG_G_INFO
>>>>>>> @@ -1582,6 +1590,10 @@ int iommu_do_pci_domctl(
>>>>>>>          bus = (domctl->u.assign_device.machine_sbdf >> 8) & 0xff;
>>>>>>>          devfn = domctl->u.assign_device.machine_sbdf & 0xff;
>>>>>>>  
>>>>>>> +        printk("*** %pv->d%d: assign_device({%04x:%02x:%02x.%u})\n",
>>>>>>> +               current, d->domain_id,
>>>>>>> +               seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
>>>>>>> +
>>>>>>>          ret = device_assigned(seg, bus, devfn) ?:
>>>>>>>                assign_device(d, seg, bus, devfn);
>>>>>>>          if ( ret == -ERESTART )
>>>>>>> @@ -1604,6 +1616,10 @@ int iommu_do_pci_domctl(
>>>>>>>          bus = (domctl->u.assign_device.machine_sbdf >> 8) & 0xff;
>>>>>>>          devfn = domctl->u.assign_device.machine_sbdf & 0xff;
>>>>>>>  
>>>>>>> +        printk("*** %pv->d%d: deassign_device({%04x:%02x:%02x.%u})\n",
>>>>>>> +               current, d->domain_id,
>>>>>>> +               seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
>>>>>>> +
>>>>>>>          spin_lock(&pcidevs_lock);
>>>>>>>          ret = deassign_device(d, seg, bus, devfn);
>>>>>>>          spin_unlock(&pcidevs_lock);
>>>>>> Hi Andrew,
>>>>>>
>>>>>> Attached are the serial logs good (with revert) and bad (without):
>>>>>>
>>>>>> Some things that seems strange to me:
>>>>>> - The numerous calls to get the device 08:00.0 assigned ... for 0a:00.0 
>>>>>> there 
>>>>>>   was only one call to both test assign and assign.
>>>>>> - However these numerous calls are there in both the good and the bad 
>>>>>> case,
>>>>>>   so perhaps it's strange and wrong .. but not the cause ..
>>>>>> - I had a hunch it could be due to the 08:00.0 using MSI-X, but when 
>>>>>> only 
>>>>>>   passing through 0a:00.0, i get the same numerous calls but now for the 
>>>>>>   0a:00.0 which uses IntX, so I think that is more related to being the 
>>>>>> *first*
>>>>>>   device to be passed through to a guest.
>>>>> I have also observed this behaviour, but not had time to investigate. 
>>>>> It doesn't appear problematic in the longrun but it probably a toolstack
>>>>> issue which wants fixing (if only in the name of efficiency).
>>>> And just after I sent this email, I have realised why.
>>>> The first assign device will have to build IO pagetables, which is a
>>>> long operation and subject to hypercall continuations.  The second
>>>> device will reused the same pagetables, so is quick to complete.
>>> So .. is the ioreq patch from Paul involved in providing something used in 
>>> building 
>>> the pagetables .. and could it have say some off by one resulting in the 
>>> 0xffffffffffff .. which could lead to the pagetable building going beserk, 
>>> requiring a paging_mode far greater than normally would be required .. 
>>> which 
>>> get's set .. since that isn't checked properly .. leading to things 
>>> breaking 
>>> a bit further when it does get checked ? 
>> A -1 is slipping in somewhere and ending up in the gfn field.
>> The result is that update_paging_mode() attempts to construct
>> iopagetables to cover a 76bit address space, which is how level ends up
>> at 8.  (Note that a level of 7 is reserved, and a level of anything
>> greater than 4 is implausible on your system.)
>> I think the crash is collateral damage following on from
>> update_paging_mode() not properly sanitising its input, but that there
>> is still some other issue causing -1 to be passed in the first place.
>> I am still trying to locate where a -1 might plausibly be coming from.
> I have just added some extra debug code to store the values from the start 
> of update_paging_mode() .. so i can print them at the end if the paging_mode 
> gets out of band and do a dump_stack() as well. Hopefully is will confirm 
> this.

Right - arch_iommu_populate_page_table() is falling over a page
allocated to the domain which doesn't have a valid gfn.

The ioreq server allocates itself some guest pages and then shoots them
out as part of setting the server up.  (This is a kudge to work around
the fact that Xen doesn't have an interface for device models etc to
allocate memory on behalf of the domain which will strictly never find
its way into the guest physmap.)

Can you try this patch and see whether some of the numbers printed out
start matching up?

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index bfde380..d1adfa7 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -534,6 +534,10 @@ static int hvm_map_ioreq_page(
 static void hvm_remove_ioreq_gmfn(
     struct domain *d, struct hvm_ioreq_page *iorp)
 {
+    printk("*** %s() d%d, page %p, mfn %lx, gfn %lx, va %p\n",
+           __func__, d->domain_id,
+           iorp->page, page_to_mfn(iorp->page), iorp->gmfn, iorp->va);
+
     guest_physmap_remove_page(d, iorp->gmfn,
                               page_to_mfn(iorp->page), 0);
     clear_page(iorp->va);
diff --git a/xen/drivers/passthrough/x86/iommu.c
b/xen/drivers/passthrough/x86/iommu.c
index 9eb8d33..048a1a9 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -59,7 +59,16 @@ int arch_iommu_populate_page_table(struct domain *d)
         if ( has_hvm_container_domain(d) ||
             (page->u.inuse.type_info & PGT_type_mask) ==
PGT_writable_page )
         {
-            BUG_ON(SHARED_M2P(mfn_to_gmfn(d, page_to_mfn(page))));
+            unsigned long mfn = page_to_mfn(page);
+            unsigned long gfn = mfn_to_gmfn(d, mfn);
+
+            BUG_ON(SHARED_M2P(gfn));
+
+            if ( gfn == INVALID_MFN )
+            {
+                printk("*** %s() d%d, page %p, mfn %lx, gfn %lx - about
to break\n",
+                       __func__, d->domain_id, page, mfn, gfn);
+            }
             rc = hd->platform_ops->map_page(
                 d, mfn_to_gmfn(d, page_to_mfn(page)), page_to_mfn(page),
                 IOMMUF_readable|IOMMUF_writable);


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.