[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] crash on boot with 4.6.1 on fedora 24



On 05/10/2016 03:23 AM, Jan Beulich wrote:
>>>> On 09.05.16 at 20:40, <boris.ostrovsky@xxxxxxxxxx> wrote:
>> On 05/09/2016 01:22 PM, Kevin Moraga wrote:
>>> On 05/09/2016 11:15 AM, Boris Ostrovsky wrote:
>>>> On 05/09/2016 12:40 PM, Kevin Moraga wrote:
>>>>> On 05/09/2016 09:53 AM, Jan Beulich wrote:
>>>>>>>>> On 09.05.16 at 16:52, <kmoragas@xxxxxxxxxx> wrote:
>>>>>>> On 05/09/2016 04:08 AM, Jan Beulich wrote:
>>>>>>>>>>> On 09.05.16 at 00:51, <kmoragas@xxxxxxxxxx> wrote:
>>>>>>>>> I'm try to compile kernel 4.4.8 (using fedora 23) to run with Xen 
>>>>>>>>> 4.6.0
>>>>>>>>> and Intel Skylake processor (Intel Core i7-6600U)
>>>>>>>>>
>>>>>>>>> This kernel is crashing almost in the same way as explained in this
>>>>>>>>> thread... But my problem is mainly with Skylake. Because the same
>>>>>>>>> configuration works within another machine but with another processor
>>>>>>>>> (Intel Core i5-3340M). Attached are the boot logs.
>>>>>>>> The address the fault occurs on (ffff8000006bdee0) is bogus, so
>>>>>>>> from the register and stack dump alone I don't think we can derive
>>>>>>>> much. What we'd need is access to the kernel binary used (or
>>>>>>>> really the vmlinux accompanying the vmlinuz that was used), in
>>>>>>>> order to see where exactly the kernel died, and hence where this
>>>>>>>> bogus address originates from. As I understand it this is a kernel
>>>>>>>> you built yourself - can you make said binary from exactly that
>>>>>>>> build available somewhere? 
>>>>>>> Yes I have it. But I get the same crash on various 4.4.X and also with
>>>>>>> 4.5.3.
>>>>>>>
>>>>>>> **https://drive.google.com/open?id=0B6Ol0ob95UxXQV9HM1BWMmhCZ0E 
>>>>>> Well, this doesn't contain the file I'm after (vmlinux), and taking
>>>>>> apart vmlinuz would be quite cumbersome.
>>>>>>
>>>>>> Jan
>>>>>>
>>>>> Oh sorry, here is the link to vmlinux
>>>>>
>>>>>
>> https://drive.google.com/file/d/0B6Ol0ob95UxXN0dDMWM1a29vMEk/view?usp=sharing
>>  
>>>> This is still vmlinuz but the failure is at
>>>>
>>>> ffffffff81007ef3:       48 3b 1d 4e 2e ec 00    cmp   
>>>> 0xec2e4e(%rip),%rbx        # 0xffffffff81ecad48
>>>> ffffffff81007efa:       73 51                   jae    0xffffffff81007f4d
>>>> ffffffff81007efc:       31 c0                   xor    %eax,%eax
>>>> ffffffff81007efe:       48 8b 15 03 d2 c0 00    mov   
>>>> 0xc0d203(%rip),%rdx        # 0xffffffff81c15108
>>>> ffffffff81007f05:       90                      nop
>>>> ffffffff81007f06:       90                      nop
>>>> ffffffff81007f07:       90                      nop
>>>> ffffffff81007f08:       4c 8b 2c da             mov   
>>>> (%rdx,%rbx,8),%r13    <======
>>>> ffffffff81007f0c:       90                      nop
>>>> ffffffff81007f0d:       90                      nop
>>>> ffffffff81007f0e:       90                      nop
>>>> ffffffff81007f0f:       85 c0                   test   %eax,%eax
>>>> ffffffff81007f11:       78 3a                   js     0xffffffff81007f4d
>>>> ffffffff81007f13:       48 8b 05 ee 11 d2 00    mov   
>>>> 0xd211ee(%rip),%rax        # 0xffffffff81d29108
>>>> ffffffff81007f1a:       49 39 c5                cmp    %rax,%r13
>>>> ffffffff81007f1d:       73 6f                   jae    0xffffffff81007f8e
>>>> ffffffff81007f1f:       48 8b 05 ea 11 d2 00    mov   
>>>> 0xd211ea(%rip),%rax        # 0xffffffff81d29110
>>>> ffffffff81007f26:       4a 8b 04 e8             mov    (%rax,%r13,8),%rax
>>>>
>>>> Any chance you could provide an un-stripped binary or System.map?
>>> Here is the link for System.map
>>>
>>>
>> https://drive.google.com/file/d/0B6Ol0ob95UxXYVE4SzdMcENsWWs/view?usp=sharing
>>  
>>
>> So my semi-educated guess at your stack is
>> __early_ioremap
>>   -> __early_set_fixmap
>>     -> set_pte
>>       -> xen_set_pte_init
>>         -> mask_rw_pte
>>           -> pte_pfn
>>             -> pte_val
>>                -> xen_pte_val
>>                  -> pte_mfn_to_pfn
>>                    -> mfn_to_pfn_no_overrides
>>                      -> ret =
>> xen_safe_read_ulong(&machine_to_phys_mapping[mfn], &pfn)
>>
>>
>> With ffffffff81007f08 being the faulted address the last one looks
>> plausible:
>>
>>
>> ffffffff81007efe:       48 8b 15 03 d2 c0 00    mov   
>> 0xc0d203(%rip),%rdx        # 0xffffffff81c15108
>> ffffffff81007f05:       90                      nop
>> ffffffff81007f06:       90                      nop
>> ffffffff81007f07:       90                      nop
>> ffffffff81007f08:       4c 8b 2c da       mov    (%rdx,%rbx,8),%r13
>>
>> since
>>
>> ostr@workbase> grep  ffffffff81c15108
>> /tmp/System.map-4.4.8-9.pvops.qubes.x86_64
>> ffffffff81c15108 D machine_to_phys_mapping
>> ostr@workbase>
>>
>> But %rdx is not ffffffff81c15108, it is ffff800000000000:
>>
>> (XEN) rax: 0000000000000000   rbx: 00000000000d7bdc   rcx: ffff880002059000
>> (XEN) rdx: ffff800000000000   rsi: 80000000d7bdc063   rdi: 80000000d7bdc063
> But that's a MOV above, i.e. %rdx = [0xffffffff81c15108], which
> sensibly is MACH2PHYS_VIRT_START. 

<facepalm> of course!

> And the MFN in %rbx
> would then match with the value in %cr2. Question is - where
> does MFN 0xd7bdc come from (it's in a reserved range, and hence
> can only be MMIO, which shouldn't be subject to M2P translation),
> and why is this a problem only on Skylake (or maybe that's not
> CPU related at all, but just dependent on the memory layout
> produced by the firmware).
>
> Obviously, accesses to the sparse[!] M2P prior to a proper #PF
> handler established can't end well. With no RAM present in the
> range 0xc0000000-0xffffffff, the 4th 2Mb M2P page doesn't get
> populated, i.e. this page walk
>
> (XEN) Pagetable walk from ffff8000006bdee0:
> (XEN)  L4[0x100] = 000000081daf9067 ffffffffffffffff
> (XEN)  L3[0x000] = 000000081daf7067 ffffffffffffffff
> (XEN)  L2[0x003] = 0000000000000000 ffffffffffffffff 
>
> is to be expected.
>
> Anyway, Kevin, it would really make things a lot easier if you
> provided the vmlinux matching the vmlinuz, which you should
> have (assuming my understanding is correct that this is a kernel
> you built yourself). After all what we may need to figure out is
> the caller of __early_ioremap() in the call stack Boris deduced.

I didn't finish unwrapping the stack yesterday. Here it is:

setup_arch -> dmi_scan_machine -> dmi_walk_early -> early_ioremap

-boris



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.