[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot



On 07/06/2017 10:05, Paul Durrant wrote:
>> -----Original Message-----
>> From: Juergen Gross [mailto:jgross@xxxxxxxx]
>> Sent: 07 June 2017 10:03
>> To: Jan Beulich <JBeulich@xxxxxxxx>; Paul Durrant
>> <Paul.Durrant@xxxxxxxxxx>
>> Cc: Julien Grall (julien.grall@xxxxxxx) <julien.grall@xxxxxxx>; xen-devel
>> (xen-devel@xxxxxxxxxxxxxxxxxxxx) <xen-devel@xxxxxxxxxxxxxxxxxxxx>; 'Boris
>> Ostrovsky' <boris.ostrovsky@xxxxxxxxxx>
>> Subject: Re: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot
>>
>> On 07/06/17 10:27, Jan Beulich wrote:
>>>>>> On 07.06.17 at 10:07, <Paul.Durrant@xxxxxxxxxx> wrote:
>>>>>  -----Original Message-----
>>>>> From: Boris Ostrovsky [mailto:boris.ostrovsky@xxxxxxxxxx]
>>>>> Sent: 06 June 2017 18:00
>>>>> To: Paul Durrant <Paul.Durrant@xxxxxxxxxx>; 'Jan Beulich'
>>>>> <JBeulich@xxxxxxxx>
>>>>> Cc: xen-devel (xen-devel@xxxxxxxxxxxxxxxxxxxx) <xen-
>>>>> devel@xxxxxxxxxxxxxxxxxxxx>
>>>>> Subject: Re: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot
>>>>>
>>>>> On 06/06/2017 12:28 PM, Paul Durrant wrote:
>>>>>>> -----Original Message-----
>>>>>>> From: Xen-devel [mailto:xen-devel-bounces@xxxxxxxxxxxxx] On Behalf
>> Of
>>>>>>> Paul Durrant
>>>>>>> Sent: 06 June 2017 16:52
>>>>>>> To: 'Jan Beulich' <JBeulich@xxxxxxxx>
>>>>>>> Cc: xen-devel (xen-devel@xxxxxxxxxxxxxxxxxxxx) <xen-
>>>>>>> devel@xxxxxxxxxxxxxxxxxxxx>
>>>>>>> Subject: Re: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
>>>>>>>> Sent: 06 June 2017 16:11
>>>>>>>> To: Paul Durrant <Paul.Durrant@xxxxxxxxxx>
>>>>>>>> Cc: xen-devel (xen-devel@xxxxxxxxxxxxxxxxxxxx) <xen-
>>>>>>>> devel@xxxxxxxxxxxxxxxxxxxx>
>>>>>>>> Subject: Re: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot
>>>>>>>>
>>>>>>>>>>> On 06.06.17 at 16:32, <Paul.Durrant@xxxxxxxxxx> wrote:
>>>>>>>>> I've been having fun setting up a new test rig...
>>>>>>>>>
>>>>>>>>> I have a skull canyon NUC and I put debian stretch (rc4) on it (so
>> that's a
>>>>>>>>> 4.9 kernel) and then tried building and installing the latest Xen
>> staging-
>>>>> 4.9
>>>>>>>>> code. The system failed to boot... basically it got stuck before even
>>>>>>>>> managing to get sufficiently into Xen to spit out anything on the
>>>>> console.
>>>>>>>>> Xen 4.8 OTOH booted just fine so I started bisecting and after 14
>>>>>>> iterations
>>>>>>>>> I got down to the following commit is being the problem:
>>>>>>>>>
>>>>>>>>> commit c0655e492e6b33e26ec9cd33f59725d0db89cdd0
>>>>>>>>> Author: Juergen Gross <jgross@xxxxxxxx>
>>>>>>>>> Date:   Fri Mar 24 14:18:54 2017 +0100
>>>>>>>>>
>>>>>>>>>     x86: split boot trampoline into permanent and temporary part
>>>>>>>>>
>>>>>>>>>     The hypervisor needs a trampoline in low memory for early boot
>> and
>>>>>>>>>     later for bringing up cpus and during wakeup from suspend.
>> Today
>>>>> this
>>>>>>>>>     trampoline is kept completely even if most of it isn't needed
>> later.
>>>>>>>>>     Split the trampoline into a permanent part and a temporary part
>>>>>>> needed
>>>>>>>>>     at early boot only. Introduce a new entry at the boundary.
>>>>>>>>>
>>>>>>>>>     Reduce the stack for wakeup code in order for the permanent
>>>>>>>>>     trampoline to fit in a single page. 4k of stack seems excessive,
>> about
>>>>>>>>>     3k should be more than enough.
>>>>>>>>>
>>>>>>>>>     Add an ASSERT() to the linker script to ensure the wakeup stack is
>>>>>>>>>     always at least 3k.
>>>>>>>>>
>>>>>>>>>     Signed-off-by: Juergen Gross <jgross@xxxxxxxx>
>>>>>>>>>     Reviewed-by: Jan Beulich <jbeulich@xxxxxxxx>
>>>>>>>>>
>>>>>>>>> To verify this I checked out master, reverted that commit, and tried
>>>>> again.
>>>>>>>>> The NUC still booted fine.
>>>>>>>> Well, interesting, but I don't think it is very realistic to expect any
>>>>>>>> fix with just the information you supply. There must be something
>>>>>>>> rather special about that system, and likely it would help if we
>>>>>>>> knew what that is. E.g. an unusual E820 map. Worse would be if
>>>>>>>> they used memory outside of properly marked E820 regions in a
>>>>>>>> way colliding with what we do.
>>>>>>>>
>>>>>>>> Otherwise I'm afraid we need to hope for you to debug the issue.
>>>>>>>>
>>>>>>> Yes, I was posting this more a heads-up for the moment, so that 4.9
>> does
>>>>> not
>>>>>>> go out with this regression.
>>>>>>>
>>>>>>> I will try to figure out what is going on... My initial thoughts on 
>>>>>>> looking
>>>> at
>>>>> what
>>>>>>> the patch does are that it may be something to do with the fact I am
>> using
>>>>> a
>>>>>>> vga console rather than a serial one. I need to try another 4.9 on
>> another
>>>>>>> system (gigabyte brix) to see if the problem manifests there too. I'll
>> also
>>>>> have
>>>>>>> to play with the BIOS settings on the skull canyon.
>>>>>>>
>>>>>> The problem definitely doesn't manifest on the brix, so the next theory
>> is
>>>>> that it is something to do with the BIOS of the skull canyon.
>>>>>
>>>>> FWIW, one of machines in our test farm choked on this very patch. I
>>>>> don't remember details now but essentially it turned out that syslinux
>>>>> (we are pxe-booting) could not handle changes in ELF sections layout
>>>>> (the way syslinux calculated how to load the binary into memory
>> resulted
>>>>> in overlap of some sort).
>>>>>
>>>>> I hacked it (mboot.c32 specifically) to work around this but never came
>>>>> up with a proper solution.
>>>>>
>>>> In my case it was grub2... and thinking about it I am running an older
>>>> version on the brix so I guess it may still manifest there if I update.
>>>> Either way it sounds like it may be better to revert the patch until the
>>>> issue is better understood.
>>> I'm not sure if we could simply revert this one patch - it's the first of a
>>> 3-patch series. At the first glance I can't really see any dependency
>>> of the later two patches on it, but then again I seem to recall that the
>>> split was a prereq. Adding Jürgen.
>> I think it could be reverted. It was a prerequisite for another patch I
>> prepared but didn't send as it was quite late in the 4.9 cycle and it
>> depended on the other patches of Daniel.
>>
>> TBH: I really can't see what is wrong with that patch. The only change
>> which should be able to break something seems to be the reduction of the
>> wakeup stack size to 3kB, but this shouldn't affect booting the system
>> at all...
>>
> Yeah, my next test is going to be increasing the size of the wakeup stack 
> again, but there is really nothing obviously wrong with the patch.

My gut feeling is that there is some path through boot (tickled by these
two machines) which is clobbering the wrong piece of memory, which was
previously safe and is now not, because of the rearrangements here.

Debugging these machines is very tricky, because they have no serial or
IMPI whatsoever.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.