[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] HVM domains crash after upgrade from XEN 4.5.1 to 4.5.2



On 18/11/2015 22:51, Atom2 wrote:
> Am 17.11.15 um 00:10 schrieb Atom2:
>> Am 17.11.15 um 00:01 schrieb Andrew Cooper:
>>> On 16/11/2015 19:16, Atom2 wrote:
>>>>
>>>> Am 16.11.15 um 16:31 schrieb Konrad Rzeszutek Wilk:
>>>>>>>> Your analysis was absolutely spot on. After re-thinking this for a
>>>>>>>> moment, I thought going down that route first would make a lot of
>>>>>>>> sense
>>>>>>>> as PV guests still do work and one of the differences to HVM
>>>>>>>> domUs is
>>>>>>>> that the former do _not_ require SeaBIOS. Looking at my log
>>>>>>>> files of
>>>>>>>> installed packages confirmed an upgrade from SeaBIOS 1.7.5 to
>>>>>>>> 1.8.2 in
>>>>>>>> the relevant timeframe which obviously had not made it to the
>>>>>>>> hvmloader
>>>>>>>> of xen-4.5.1 as I did not re-compile xen after the upgrade of
>>>>>>>> SeaBIOS.
>>>>>>>>
>>>>>>>> So I re-compiled xen-4.5.1 (obviously now using the installed
>>>>>>>> SeaBIOS
>>>>>>>> 1.8.2) and the same error as with xen-4.5.2 popped up - and that
>>>>>>>> seemed
>>>>>>>> to strongly indicate that there indeed might be an issue with
>>>>>>>> SeaBIOS as
>>>>>>>> this probably was the only variable that had changed from the
>>>>>>>> original
>>>>>>>> install of xen-4.5.1.
>>>>> I recall seeing this way back in Fedora 20 days. I narrowed it
>>>>> down the
>>>>> SeaBIOS version that was a standalone package to not have CONFIG_XEN.
>>>>>
>>>>> Having that fixed in the SeaBIOS package fixed it.
>>>> Hi Konrad, Doug, Andrew (specifically added to this part of the
>>>> thread)!
>>>> Konrad, you might have found an interesting point. I did have a look
>>>> at the ebuild for the failing version and in there I found the
>>>> following comment:
>>>> ====== comment from ebuild =======
>>>>      # Upstream hasn't released a new binary.  We snipe ours from
>>>> Fedora for now.
>>>>      #
>>>> http://code.coreboot.org/p/seabios/downloads/get/bios.bin-${PV}.gz
>>>> ====== end comment from ebuild =======
>>>> which might in fact underline that there might be an issue similar to
>>>> what you described above.
>>>>
>>>> What is also pretty interesting is the fact that the old (working)
>>>> SeaBIOS version 1.7.5 installed as "bios.bin" under /usr/share/seabios
>>>> is actually 262.144 bytes in size whereas the new (invalid) SeaBIOS
>>>> 1.8.2 installed in the same location is only half as big: 131.072
>>>> bytes.
>>>>
>>>> I checked at the download site and the 1.8.2 binary version is indeed
>>>> not available from http://code.coreboot.org/p/seabios/downloads/. But
>>>> both the binary versions for 1.7.5 and 1.8.0 are available and both
>>>> are acutually 262.144 bytes in size, so I'd be very surprised if the
>>>> 1.8.2 version is really only half that size. By the way, the old
>>>> working version (according to the ebuild) was directly downloaded from
>>>> the above url and also shows an identical SHA1 digest to that version
>>>> available for download there.
>>>>
>>>> To me this looks as if something is really wrong here. If anybody of
>>>> you has access to a 1.8.2 version, could you please confirm whether
>>>> there's really that big a size difference between the 1.7.5 and the
>>>> 1.8.2 versions? Or is that difference probably attributable to the
>>>> missing CONFIG_XEN option?
>>>>
>>>> Andrew: I havent't gotten around to run the debug version of the
>>>> hypervisor again, but if the current suspicion turns out to be true,
>>>> there's probably not much value in that anyways. Would you agree?
>>> Sadly not.
>> Fair enough. I'll try to get things done, hopefully somewhen tomorrow
>> or, in case that doesn't work out, on Wednesday and will send you the
>> requested information.
>>
>> Many thanks for your support, Atom2
>>> I accept that this issue is possibly fixed in newer SeaBIOS by working
>>> around the issue.
>>>
>>> However, I stand by my original point.  *There is no way the guest
>>> should be able to get into this situation in the first place*, and its
>>> implication of *there is a genuine hypervisor bug which we should track
>>> down*, irrespective of whether the issue has been worked around elsehow
> Hi Andrew,
> as promised I have again tried with a debug build and the results are
> very mixed. I initially tried to better understand what the debug USE
> flag actually does in gentoo and my understanding (after reading the
> so called ebuilds) is now that the XEN hypervisor will be built by
> adding a gcc option of "debug=y" - and that's what should compile a
> debug build - right?

Yes indeed.

> So I went on and again enabled the debug USE flag plus gdb symbols and
> rebuilt the hypervisor in the hope that this created a valid and
> working debug build.
>
> It, however, seems there's another problem lurking somewhere which
> only manifests itself when I boot from the debug build of the hypervisor.

You did manage to get at least one decent log from a properly debugbuild.

However, all we need is the hvm_debug output.  This patch:

---8<---
diff --git a/xen/include/asm-x86/hvm/support.h
b/xen/include/asm-x86/hvm/support.h
index 05ef5c5..7a8fbb5 100644
--- a/xen/include/asm-x86/hvm/support.h
+++ b/xen/include/asm-x86/hvm/support.h
@@ -28,7 +28,7 @@

 #define HVM_DELIVER_NO_ERROR_CODE  -1

-#ifndef NDEBUG
+#if 1
 #define DBG_LEVEL_0                 (1 << 0)
 #define DBG_LEVEL_1                 (1 << 1)
 #define DBG_LEVEL_2                 (1 << 2)
---8<---

Will enable hvm_debug in a non-debug build of hypervisor.  Can you try
that please?


> The system crashes early on with a DOUBLE FAULT in doIRQ - we have had
> this already earlier in that thread. I am however a step further as
> the disass in gdb now seems to provide not just an empty page full of
> NULL values but rather something that might give you a hint why it
> crashes that early on: Please see the attached disass file (doIRQ)
> together with the serial console output (serial.dbg). The old NULL
> value file was probably because I did not include gdb symbols in the
> debug build at that time - my bad.

The fact that it is completely consistent is useful from a debugging
point of view.

The disassembly of do_IRQ now looks like a plausible function, but the
consistently faulting address has no plausible way of generating a
double fault.  I suspect therefore that something has caused memory
corruption in Xen .text section.

As an experiment, could you try booting with the minimum available
command line options, which look to be just "com1=115200,8n1,0x3f8,4
console=com1,vga dom0_mem=4G,max:4G" to see whether it is an interaction
of the options you have enabled.

If the issue still reproduces, I will rework the previous debugging
patch I gave you to definitely dump the actual code being run at the
time of the fault.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.