[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] PAT-related crash booting Linux 4.4 + Xen 4.5 on VMware ESXi



Yes, we're just now moving to 4.4 stable, and will be there for a
while, so backporting would be very helpful.

--Ed


On Tue, May 24, 2016 at 7:53 AM, Kani, Toshimitsu <toshi.kani@xxxxxxx> wrote:
> On Mon, 2016-05-23 at 15:52 -0700, Ed Swierk wrote:
>> Good question. I ran my tests again, and found I'd misinterpreted the
>> Fusion behavior.
>>
>> On Fusion 8.1.1, MSR_IA32_CR_PAT returns a reasonable value:
>>
>> (XEN) Freed 308kB init memory.
>> mapping kernel into physical memory
>> cpu_has_pat=0 cpuid_edx(1)=f89cbf5 pat=65536
>> pat_init_cache_modes pat=50100070406
>> pat_init_cache_modes i=7 pat_val=0 cache=3
>> pat_init_cache_modes ok
>> pat_init_cache_modes i=6 pat_val=0 cache=3
>> pat_init_cache_modes ok
>> pat_init_cache_modes i=5 pat_val=5 cache=5
>> pat_init_cache_modes ok
>> pat_init_cache_modes i=4 pat_val=1 cache=1
>> pat_init_cache_modes ok
>> pat_init_cache_modes i=3 pat_val=0 cache=3
>> pat_init_cache_modes ok
>> pat_init_cache_modes i=2 pat_val=7 cache=2
>> pat_init_cache_modes ok
>> pat_init_cache_modes i=1 pat_val=4 cache=4
>> pat_init_cache_modes ok
>> pat_init_cache_modes i=0 pat_val=6 cache=0
>> pat_init_cache_modes ok
>> pat_init_cache_modes pat_msg=WB  WT  UC- UC  WC  WP  UC  UC
>> about to get started...
>> [    0.000000] x86/PAT: Configuration [0-7]: WB  WT  UC-
>> UC  WC  WP  UC  UC
>>
>> On ESXi 5.5.0, MSR_IA32_CR_PAT returns 0, and we are indeed hitting
>> the BUG_ON in update_cache_mode_entry():
>>
>> (XEN) Freed 312kB init memory.
>> mapping kernel into physical memory
>> cpu_has_pat=0 cpuid_edx(1)=f89cbf5 pat=65536
>> pat_init_cache_modes pat=0
>> pat_init_cache_modes i=7 pat_val=0 cache=3
>> pat_init_cache_modes ok
>> pat_init_cache_modes i=6 pat_val=0 cache=3
>> pat_init_cache_modes ok
>> pat_init_cache_modes i=5 pat_val=0 cache=3
>> pat_init_cache_modes ok
>> pat_init_cache_modes i=4 pat_val=0 cache=3
>> pat_init_cache_modes ok
>> pat_init_cache_modes i=3 pat_val=0 cache=3
>> pat_init_cache_modes ok
>> pat_init_cache_modes i=2 pat_val=0 cache=3
>> pat_init_cache_modes ok
>> pat_init_cache_modes i=1 pat_val=0 cache=3
>> pat_init_cache_modes ok
>> pat_init_cache_modes i=0 pat_val=0 cache=3
>> (XEN) traps.c:459:d0v0 Unhandled invalid opcode fault/trap [#6] on
>> VCPU 0 [ec=0000]
>> (XEN) domain_crash_sync called from entry.S: fault at ffff82d0802276c3
>> create_bounce_frame+0x12b/0x13a
>>
>> In both cases, the PAT CPUID feature bit is set, and cpu_has_pat is
>> always 0 at this early point (so my RFC patch is wrong). The simplest
>> fix is to call pat_init_cache_modes(pat) only if pat != 0.
>>
>> This is starting to look like the same logic that's in pat_bsp_init(),
>> which doesn't seem to be called when booting on Xen. Should it be? Was
>> Xen deliberately excluded from this PAT emulation change?
>> https://groups.google.com/d/msg/linux.kernel/JoJKbCOxV0U/PM0I9d1v60kJ
>
> Calling pat_init() requires the CPU rendezvous handler in MTRR, which is
> disabled in Xen.  This PAT initialization has been problematic, and the
> following patches addressed it in 4.6.  This will fix your problem as
> well.
> https://lkml.org/lkml/2016/3/23/500
>
> In particular, patch 6/7 removed the Xen code in question.
> https://lkml.org/lkml/2016/3/23/503
>
> Do you need to fix this issue in 4.4?  If so, we should be able to request
> backporting the patches to 4.4 stable.
>
> -Toshi
>
>
>>
>> --Ed
>>
>>
>> On Mon, May 23, 2016 at 1:13 PM, Boris Ostrovsky
>> <boris.ostrovsky@xxxxxxxxxx> wrote:
>> >
>> > On 05/23/2016 10:15 AM, Konrad Rzeszutek Wilk wrote:
>> > >
>> > > On Fri, May 20, 2016 at 04:58:09PM -0700, Ed Swierk wrote:
>> > > >
>> > > > (XEN) traps.c:459:d0v0 Unhandled invalid opcode fault/trap [#6] on
>> > > > VCPU 0 [ec=0000]
>> > > > (XEN) domain_crash_sync called from entry.S: fault at
>> > > > ffff82d0802286c3 create_bounce_frame+0x12b/0x13a
>> > > > (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
>> > > > (XEN) ----[ Xen-4.5.4-pre  x86_64  debug=n  Not tainted ]----
>> > > > (XEN) CPU:    0
>> > > > (XEN) RIP:    e033:[<ffffffff81053cbd>]
>> > > > (XEN) RFLAGS: 0000000000000206   EM: 1   CONTEXT: pv guest (d0v0)
>> > > > (XEN) rax: 0000000000000022   rbx: 00000000ffffffff   rcx:
>> > > > 0000000000000000
>> > > > (XEN) rdx: 0000000000000022   rsi: 0000000000000003   rdi:
>> > > > 0000000000000000
>> > > > (XEN) rbp: ffffffff81b67ea8   rsp:
>> > > > ffffffff81b67e68   r8:  0000000000000001
>> > > > (XEN) r9:  0000000000000001   r10: ffffffff81b67f20   r11:
>> > > > 6c61765f74617020
>> > > > (XEN) r12: 0000000000000000   r13: 0000000000000003   r14:
>> > > > 0000000000000000
>> > > > (XEN) r15: ffffffff81b67ebb   cr0: 000000008005003b   cr4:
>> > > > 00000000001526b0
>> > > > (XEN) cr3: 00000001b16eb000   cr2: 0000000000000000
>> > > > (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs:
>> > > > e033
>> > > > (XEN) Guest stack trace from rsp=ffffffff81b67e68:
>> > > > (XEN)    0000000000000000 6c61765f74617020 ffffffff81053cbd
>> > > > 000000010000e030
>> > > > (XEN)    0000000000010006 ffffffff81b67ea8 000000000000e02b
>> > > > ffffffff81b67f20
>> > > > (XEN)    ffffffff81b67f10 ffffffff8105b339 55ffffff81b67f10
>> > > > 5520204355202043
>> > > > (XEN)    5520204355202043 5520204355202043 0020204355202043
>> > > > 0000000000000000
>> > > > (XEN)    0000000000000000 ffffffff81b67f38 0000000000000000
>> > > > 0000000000000000
>> > > > (XEN)    0000000000000000 ffffffff81b67ff0 ffffffff82010d0a
>> > > > 0000000000000000
>> > > > (XEN)    000306f200000000 fed8320300010800 0000000000000000
>> > > > 0000000000000000
>> > > > (XEN)    0000000000000000 0000000000000000 0000000000000000
>> > > > 0000000000000000
>> > > > (XEN)    0000000000000000 0000000000000000 0000000000000000
>> > > > 0000000000000000
>> > > > (XEN)    0000000000000000 0000000000000000 0000000000000000
>> > > > 0000000000000000
>> > > > (XEN)    0000000000000000 0000000000000000 0000000000000000
>> > > > 0000000000000000
>> > > > (XEN)    0000000000000000 ffffffff81b68008 0000000000000000
>> > > > 0000000000000000
>> > > > (XEN)    0000000000000000 0000000000000000 00000000fffedb08
>> > > > (XEN) Domain 0 crashed: rebooting machine in 5 seconds.
>> > > > (XEN) Resetting with ACPI MEMORY or I/O RESET_REG.
>> > > >
>> > > > The crash occurs in pat_init_cache_modes(), called by
>> > > > xen_start_kernel().  The pat value from MSR_IA32_CR_PAT is 0.
>> > > > Strangely, the same kernel and Xen boot just fine on VMware Fusion
>> > > > 8.1.1, even though the MSR is 0 there as well.
>> > Are you hitting BUG_ON in update_cache_mode_entry()? I don't think I
>> > can
>> > see how you can avoid it when MSR read returns 0.
>> >
>> >
>> > >
>> > > >
>> > > >
>> > > > Anyway, guessing that it's pointless to call pat_init_cache_modes()
>> > > > when the CPU doesn't support PAT, I added a check for cpu_has_pat.
>> > > > This resolves the problem on ESXi and doesn't seem to break real
>> > > > hardware, though I'm not sure how to verify PAT functionality.  So
>> > > > this is just an RFC.
>> > Can you start an HVM guest in Xen after your patch below?
>> >
>> > >
>> > > Cc-ing maintainers.
>> > > >
>> > > > diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>> > > > index 9a29803..209f680 100644
>> > > > --- a/arch/x86/xen/enlighten.c
>> > > > +++ b/arch/x86/xen/enlighten.c
>> > > > @@ -1633,8 +1633,12 @@ asmlinkage __visible void __init
>> > > > xen_start_kernel(void)
>> > > >       * Modify the cache mode translation tables to match Xen's PAT
>> > > >       * configuration.
>> > > >       */
>> > > > -    rdmsrl(MSR_IA32_CR_PAT, pat);
>> > > > -    pat_init_cache_modes(pat);
>> > > > +    if (cpu_has_pat) {
>> > > > +            rdmsrl(MSR_IA32_CR_PAT, pat);
>> > > > +            pat_init_cache_modes(pat);
>> > > > +    } else {
>> > > > +            xen_raw_console_write("CPU does not support PAT\n");
>> > > > +    }
>> > > >
>> > > >      /* keep using Xen gdt for now; no urgent need to change it */
>> > > >
>> > This looks OK to me but I think we should first understand why you
>> > don't
>> > crash on Fusion.
>> >
>> > Also, PAT initialization code has been rewritten in Linux (for 4.5?) so
>> > I suspect this problem is only observed on earlier kernels.
>> >
>> > -boris
>> >

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.