Re: [Xen-users] Fatal Trap 18 (convincing hardware engineer)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robbie Dinn wrote:
> No one else seems to have taken the bite, so I will even
> though I may not be best qualified to do so.

and I thank you for that. ;-)

> Matthew Baker wrote:
>> Hi all,
>>
>> I have 2 servers with identical hardware (lspci at the bottom of this
>> email).
> Two identical servers is good. But I wasn't clear from your description
> whether they behaved the same.

Well both servers have exhibited "problems" I've only been able to
capture the panic on one machine. So my assumption that it is the same
cause may be wrong.

> Assuming they behave differently then that might mean you have one
> substandard component in one of the machines. Record all the serial
> numbers of the components, or label them yourself, then begin
> swapping them between the machines. If you can get the fault to
> move from one machine to the other, you can maybe pin it on one component.

I'm going to be able to get both these boxes out of a rack into a place
which I can do some better diagnosis from this angle. I'm beginning to
believe it may be related to one box more than the other.

> Your hardware guy may have already tried the above. If you have two
> machines and they both show the fault, that's more tricky.

fingers crossed.

>> Our hardware engineer is convinced it's either a Xen or driver issue.
> 
> I can see why he might think so or want to say so.

Yes as can I.

>> I've seen the thread at
>> http://lists.xensource.com/archives/html/xen-users/2006-08/msg00792.html
>> and have directed the engineer at this.
>>
>> My questions to the list are:
>>
>> 1. Can this be caused by anything else (other than hardware)?
>> 2. Is there anything I can do to debug this further to confirm what part
>> of the system is failing (e.g. either CPU/RAM or PCI/BUS timeout)?
> 
> grasping at straws, could you try running a memory test program, eg memtest86.

Yes we've ran some diagnostics on one of the boxes and all seems well.
However, we still need to compare them.

> Is this a server class machine with with EEC memory? If so, is it possible
> to get the linux kernel to report any soft memory errors that get corrected
> via the EEC hardware? 
> 
> Is there anything in linux/Documentation/drivers/edac/edac.txt
> that might help? (I have not used this myself). There may be
> non fatal errors that are happening that before the fatal one.
> That might give you or your hardware engineer a clue as to
> where else to look.

Ah, this looks good. The edac modules were loaded already (by udev I
presume). I've enabled the logging features via /sys. Thanks for the tip.

> How about building a linux kernel with some form of debugging
> turned on? This might help you to see is something is
> scribbling on memory when it shouldn't be.

Yes, we've thought about enabling the gdb-stub as described in
http://wiki.xensource.com/xenwiki/XenPPC/Debug/XenGDBStub I'm presuming
this will work for other architectures than ppc. I see this as a last
resort as kernel debugging can be quite time consuming!

Thanks for your help it has given me some ideas on how to approach this.

Matt

- --
 Matthew Baker, UNIX Systems Administrator
 ----------------------------------------------------
 Institute for Learning and Research Technology (ILRT)
 A: University of Bristol,
    8-10 Berkeley Square,
    Bristol.
    BS8 1HH
 W: http://www.ilrt.bristol.ac.uk
 E: matt.baker@xxxxxxxxxx
 T: +44 (0)117 928 7121
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHX8QsLvm7pB/aicMRAuGeAJ4mb4NSPj6YeRSC48iKz2N0U3jm3gCfZM1d
Pr3mJfQZsO0bvCvtUoqjwT8=
=XSUr
-----END PGP SIGNATURE-----

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users
WARNING - OLD ARCHIVES

xen-users

Re: [Xen-users] Fatal Trap 18 (convincing hardware engineer)