This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


RE: [Xen-devel] State of current Xen debugger

To: "Tim Deegan" <Tim.Deegan@xxxxxxxxxx>
Subject: RE: [Xen-devel] State of current Xen debugger
From: "Roger Cruz" <roger.cruz@xxxxxxxxxxxxxxxxxxx>
Date: Tue, 14 Sep 2010 11:08:58 -0500
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
Delivery-date: Tue, 14 Sep 2010 09:09:55 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <EACA7CA90354A849B1315959042A052C26F50A@xxxxxxxxxxxxxxxxxxxxx> <20100914143002.GD29761@xxxxxxxxxxxxxxxxxxxxxxx> <EACA7CA90354A849B1315959042A052C26F50B@xxxxxxxxxxxxxxxxxxxxx> <20100914152019.GE29761@xxxxxxxxxxxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: ActUIGY9UpJcgn5+R7CsnAahGjw8sgABUCSU
Thread-topic: [Xen-devel] State of current Xen debugger

I am using 3.4.2 with some modifications

I added printks to the nmi_watchdog_tick as shown below.  I don't break the console lock.. but I am convinced that the printk lock is not the problem because I have also tested by having a void printk routine and it still hangs, so it felt pretty safe not breaking the lock.  I also tried the console_start/end_sync to make sure I was seeing all the messages when it hung.

void nmi_watchdog_tick(struct cpu_user_regs * regs)
    unsigned int sum = this_cpu(nmi_timer_ticks);

    if ( (this_cpu(last_irq_sums) == sum) &&
         !atomic_read(&watchdog_disable_count) )
      if (sum > 20) {
        //      console_start_sync();
        printk("**** CPU%d, counter=%d, last_sum=%d, curr_sum=%d, hz=%d, nmis=%d\n",
               smp_processor_id(), this_cpu(alert_counter), this_cpu(last_irq_sums), sum, 5*nmi_hz,  nmi_count(smp_processor_id()) );
        //      console_end_sync();
         * Ayiee, looks like this CPU is stuck ... wait a few IRQs (5 seconds)
         * before doing the oops ...
        if ( this_cpu(alert_counter) == 5*nmi_hz )
            printk("Watchdog timer detects that CPU%d is stuck!\n",
            fatal_trap(TRAP_nmi, regs);
      if (sum > 20) {
        //      console_start_sync();
        printk("*CPU%d, counter=%d, last_sum=%d, curr_sum=%d, nmis=%d\n",
               smp_processor_id(), this_cpu(alert_counter), this_cpu(last_irq_sums), sum, nmi_count(smp_processor_id()) );
        this_cpu(last_irq_sums) = sum;
        this_cpu(alert_counter) = 0;

My messages stop printing and I get a hard hang.  the Performance Ctr NMI appears to come once every 4 seconds.  However, I have observed instances where they are about 10 seconds apart.  Not sure what is making the NMIs come in at uneven intervals.  As a test, I turned on SpeedStep and power management functions in the BIOS and it still hangs.

XEN) *CPU0, counter=0, last_sum=974, curr_sum=977, nmis=391
(XEN) *CPU0, counter=0, last_sum=977, curr_sum=979, nmis=392
(XEN) *CPU0, counter=0, last_sum=979, curr_sum=981, nmis=393
(XEN) *CPU0, counter=0, last_sum=981, curr_sum=984, nmis=394
(XEN) *CPU0, counter=0, last_sum=984, curr_sum=986, nmis=395
(XEN) *CPU0, counter=0, last_sum=986, curr_sum=988, nmis=396
(XEN) *CPU0, counter=0, last_sum=988, curr_sum=991, nmis=397
(XEN) *CPU0, counter=0, last_sum=991, curr_sum=993, nmis=398
(XEN) *CPU0, counter=0, last_sum=993, curr_sum=995, nmis=399
(XEN) *CPU0, counter=0, last_sum=995, curr_sum=997, nmis=400
(XEN) *CPU0, counter=0, last_sum=997, curr_sum=1000, nmis=401
(XEN) *CPU0, counter=0, last_sum=1000, curr_sum=1002, nmis=402
(XEN) *CPU0, counter=0, last_sum=1002, curr_sum=1005, nmis=403
(XEN) *CPU0, counter=0, last_sum=1005, curr_sum=1008, nmis=404
(XEN) *CPU0, counter=0, last_sum=1008, curr_sum=1010, nmis=405
(XEN) *CPU0, counter=0, last_sum=1010, curr_sum=1013, nmis=406
(XEN) *CPU0, counter=0, last_sum=1013, curr_sum=1015, nmis=407
(XEN) *CPU0, counter=0, last_sum=1015, curr_sum=1018, nmis=408
(XEN) *CPU0, counter=0, last_sum=1018, curr_sum=1020, nmis=409
(XEN) *CPU0, counter=0, last_sum=1020, curr_sum=1023, nmis=410
(XEN) *CPU0, counter=0, last_sum=1023, curr_sum=1026, nmis=411
(XEN) *CPU0, counter=0, last_sum=1026, curr_sum=1029, nmis=412
(XEN) *CPU0, counter=0, last_sum=1029, curr_sum=1031, nmis=413
(XEN) *CPU0, counter=0, last_sum=1031, curr_sum=1033, nmis=414
(XEN) *CPU0, counter=0, last_sum=1033, curr_sum=1035, nmis=415
(XEN) *CPU0, counter=0, last_sum=1035, curr_sum=1038, nmis=416
(XEN) *CPU0, counter=0, last_sum=1038, curr_sum=1041, nmis=417
(XEN) *CPU0, counter=0, last_sum=1041, curr_sum=1043, nmis=418
(XEN) *CPU0, counter=0, last_sum=1043, curr_sum=1046, nmis=419
(XEN) *CPU0, counter=0, last_sum=1046, curr_sum=1049, nmis=420
(XEN) *CPU0, counter=0, last_sum=1049, curr_sum=1051, nmis=421
(XEN) *CPU0, counter=0, last_sum=1051, curr_sum=1055, nmis=422
(XEN) *CPU0, counter=0, last_sum=1055, curr_sum=1058, nmis=423
(XEN) *CPU0, counter=0, last_sum=1058, curr_sum=1061, nmis=424
(XEN) *CPU0, counter=0, last_sum=1061, curr_sum=1064, nmis=425
(XEN) *CPU0, counter=0, last_sum=1064, curr_sum=1067, nmis=426
(XEN) *CPU0, counter=0, last_sum=1067, curr_sum=1070, nmis=427
(XEN) *CPU0, counter=0, last_sum=1070, curr_sum=1073, nmis=428
(XEN) *CPU0, counter=0, last_sum=1073, curr_sum=1076, nmis=429
 __  __            _____ _  _    ____
 \ \/ /___ _ __   |___ /| || |  |___ \
  \  // _ \ '_ \    |_ \| || |_   __) |
  /  \  __/ | | |  ___) |__   _| / __/
 /_/\_\___|_| |_| |____(_) |_|(_)_____|
(XEN) Xen version 3.4.2 (rcruz@) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) Mon Sep 13 23:06:17 UTC 2010
(XEN) Latest ChangeSet: Mon Sep 13 16:12:14 2010 -0400 132:a499dd8fcb55

-----Original Message-----
From: Tim Deegan [mailto:Tim.Deegan@xxxxxxxxxx]
Sent: Tue 9/14/2010 11:20 AM
To: Roger Cruz
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: Re: [Xen-devel] State of current Xen debugger

At 15:56 +0100 on 14 Sep (1284479787), Roger Cruz wrote:
> I had a pretty good inkling that one of you hardcore developers would
> say that :-) Yes, it is pretty well wedged.  I can cause the problem
> more rapidly by dropping to a single CPU.  When the hang happens, the
> Xen console is completely dead.  None of the special keys work.

If the 'd' key doesn't work then the serial irq isn't getting handled,
so the CPU is wedged at a higher TPR (at least).  Usually in that case
the CPU is spinning so the NMI watchdog timer kicks in OK; possibly if
it was idle with a high TPR it wouldn't.

What version of Xen are you using? 

It might be worth trying a boot with MSI disabled (there were reports at
one stage of MSIs not being EOI'd because the timer interupt that would
remind Xen to EOI them was at a lower priority than the MSI).

> I do have hopes a BIOS upgrade could fix this as a last resort but I
> want to see if at least I can understand the problem.  We have a few
> different machines that are exhibiting similar symptoms so I have to
> see if I can find a work-around without requiring every user to
> upgrade their BIOS :-(
> Just in case, what debugger have you been using?  Are there recent
> instructions on how to set it up that you can point me to?

I don't use a debugger on Xen.  I usually find that by the time the
debugger kicks in it's too late to help, so I end up finding bugs by
code inspection and printks. :)

Mukesh Rathor at Oracle has done some debugger work, though, including
an in-Xen debugger.  There's a gdb stub too but I suspect it's rotted
quite badly.



> Thanks
> Roger
> -----Original Message-----
> From: Tim Deegan [mailto:Tim.Deegan@xxxxxxxxxx]
> Sent: Tue 9/14/2010 10:30 AM
> To: Roger Cruz
> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
> Subject: Re: [Xen-devel] State of current Xen debugger
> Hi,
> At 15:22 +0100 on 14 Sep (1284477779), Roger Cruz wrote:
> > I am trying to debug a problem where the hypervisor is hanging hard.
> > Not even the NMI watchdog is triggering a reboot.  So I wanted to hook
> > up a debugger.
> Sorry to bring a counsel of despair but if the NMI watchdog isn't
> working then your chances of getting a working debugger are slim.  It's
> likely that at least one CPU is very very stuck.  Does the 'd' debug key
> work on the serial line when the machine is wedged?
> On a more cheerful note, I've twice seen hard hangs like this that
> turned out to be hardware issues, fixable with BIOS upgrades.
> Cheers,
> Tim.
> > What is the state of the current debuggers out there?
> > Any input on how I should set it up (kdb, gdb, etc) and pointers to a
> > good wiki page are much appreciated.  I did perform a Google search
> > and found some links but I want to hear from the current developers as
> > to what is most stable and useful for debugging this type of hard
> > hang.  I only have a serial port PCI-express card to use as the laptop
> > has no built in port.
> --
> Tim Deegan <Tim.Deegan@xxxxxxxxxx>
> Principal Software Engineer, XenServer Engineering
> Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10 02:35:00

Tim Deegan <Tim.Deegan@xxxxxxxxxx>
Principal Software Engineer, XenServer Engineering
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10 02:35:00

Xen-devel mailing list