[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] NMI with SMP domain causing machine to reboot



  I have spend most of the last weeks trying to nail down a nasty bug 
  that is preventing me to release xenoprof for SMP domains.
  The bug is non-deterministic and when it happens the machine just
  reboots with no message or warning on the serial console.
  This made the debugging process painfull and slow.

  I started removing specific components of xenoprof code trying to find
  what component is causing the problem. After removing almost all code
  it seems the bug is associated with NMI interrupts.
  Right now the only code left is the code to program a hardware perf.
  counter to count "non-halted" clock cycles (hard-coded) and to handle
  NMI interrupts. All other logic was removed and  and I am still seeing
  the machine auto rebooting at some non-determinist time.
  I am starting to suspect this might be a Xen bug and I will probably
  need some help from the Xen core team to nail this down.
  
  I have attached a patch that enables Xen to program the perf counter
  and handle the NMIs they generate. I have also attached a patch for
  a new user level  test tool for starting the performance counter. 
  I hope these patches enable others to reproduce the behaviour I am 
  observing

  I only see this bug when running SMP domains (either dom0 or domU)
  with NMIs being generated. My machine has two CPUs with hyperthreading
  disabled. When I boot an SMP domain0 (with 2 VCPUs) I only see the 
  the bug when NMIs are generated for CPU 1. Surprisingly,
  I have never seen the auto rebooting behavior when NMIs are generated
on
  CPU 0 only. Since the bug is non determinitic it is possible that
  the bug is still there but for some reason not triggered for NMIs on
  CPU 0.  

  Here is a sequence of steps that I use to trigger the bug (on an SMP 
  dom0 with 2 VCPUs);

  1) initialize the performance counter
     > xenpmc -i
  2) start the counter
     > xenpmc -g
  3) verify that NMIs are being generated 
     > xenpmc -s
     This causes a counter of NMIs for [CPU0,CPU1] to be printed.
     This command was originally intended to stop the counters
     (and NMI generation) but the command was modified to 
     just return without stopping the counters. As a side 
     effect the number of NMIs are printed on the xen console 
     and can be used to verify that NMIs are being generated
  
  In order to trigger the bug I execute the comand "xm dmesg"
  in a loop and eventually the machine auto reboot. (usually
  after a few minutes). I use the following shell script to 
  execute "xm dmesg" in a loop.

    #!/bin/bash
    while true;
    do xm dmesg;
    sleep 1;
    done

  Does anybody has an idea of what can be causing this behavior and
  how we could nail this down?

  Thanks

  Renato
  

Attachment: nmitest_xen.patch
Description: nmitest_xen.patch

Attachment: nmitest_tools.patch
Description: nmitest_tools.patch

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.