WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] XCP: Crashes on dual Xeon HP ProLiant systems

To: Pasi Kärkkäinen <pasik@xxxxxx>
Subject: Re: [Xen-devel] XCP: Crashes on dual Xeon HP ProLiant systems
From: "dwight at supercomputer.org" <dwight@xxxxxxxxxxxxxxxxx>
Date: Sat, 1 May 2010 14:06:25 -0700
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
Delivery-date: Sat, 01 May 2010 14:12:32 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <20100430182007.GA17817@xxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <201004300932.37495.dwight@xxxxxxxxxxxxxxxxx> <20100430182007.GA17817@xxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
On Friday 30 April 2010 11:20:07 am Pasi Kärkkäinen wrote:
> On Fri, Apr 30, 2010 at 09:32:37AM -0700, dwight at 
supercomputer.org wrote:
> > Is anyone else running the latest XCP on HP ProLiant DL380
> > systems? Or a similar dual Xeon 8-core system? I'm seeing
> > spontaneous reboots when under a load. ...
>
> Uhm.. the compiler really shouldn't crash.
>
> Are you sure your hardware is OK? If the stock EL5.4 Xen also
> crashes, it could be broken hardware?
>
> Did you try running memtest86+ ?
>
> Is baremetal Linux stable, if you run for example
> "make -j8 bzImage && make -j8 modules && make clean" kernel build
> in a loop?
>
> -- Pasi

Thank you for your reply, Pasi.

I agree that the compiler shouldn't crash. That's definitely
rude behavior.

It might well be broken hardware. I was thinking that it was
more likely that it was an issue between the older CentOS Xen
and this much newer Xeon hardware. And so the "hardware or OS
problem" that gcc was complaining about was an issue with
the Virtualized hardware.

But yesterday I ran into a different issue, which leads me to
believe that it is either a physical hardware or Dom0 OS issue.

On the machine which was running XCP, I tried installing
64-bit CentOS 5.4. The installation crashed. Two separate times.
The first time I didn't have a log file (since it was a video
based installation). The second time through though I used the iLO 
virtualized serial port, and I could see that the installation 
crashed about halfway through. Again, a spontaneous reboot, as XCP 
experienced.

I talked to one of the guys in the lab, who has done far more
installations of these ProLiant (and Dell) boxes than I have,
and he was quite familiar with this. He said that on some of
these boxes (both HP and Dell), the 64-bit CentOS 5.4 install
will crash. But supposedly the 32-bit installation will work.

He also said that CentOS 5.3, both 32 and 64 bit, work fine.
I realize that this is anecdotal, and I don't have any more
information here (as to the CPU's and hardware), but I thought
that this was interesting.

At this point, I don't trust either the hardware or the OS,
so I'm going to start a full diagnostics run using a suite
that I've put together over the past 15 years, which has
served me very well in qualifying boxes.

memtest86 is one of these. I mentioned earlier that I had
started an overnight run of this on both boxes. I can now
report that both have passed. After 12+ hours, they had gone
successfully through two separate runs without error.

Next up is prime95, with the torture test. Nothing else comes
close to exercising the CPU, as indicated by the heat given
off during this test. This will also be a test of the thermal
cooling.

If that passes, then I'm going to exercise the disk subsystem.
One of these is very similar to what you suggested. Specifically,
multiple rebuilds of the kernel, but from scratch each time.

Frankly, though, I'm going to see if I can get a different
ProLiant box. Nonetheless, I want the data on this one.
I'm hoping that I can detect a box which will fail, before I
run XCP on it.

I'll post the results when I have them, hopefully in a
couple of days.

   -dwight-




_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

<Prev in Thread] Current Thread [Next in Thread>