[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] xen: add support for automatic debug key actions in case of crash



On 29.10.20 15:22, Jan Beulich wrote:
On 22.10.2020 16:39, Juergen Gross wrote:
When the host crashes it would sometimes be nice to have additional
debug data available which could be produced via debug keys, but
halting the server for manual intervention might be impossible due to
the need to reboot/kexec rather sooner than later.

Add support for automatic debug key actions in case of crashes which
can be activated via boot- or runtime-parameter.

While I can certainly see this possibly being a useful thing in
certain situations, I'm uncertain it's going to be helpful in at
least a fair set of cases. What output to request very much
depends on the nature of the problem one is running into, and
the more keys one adds "just in case", the longer the reboot
latency, and the higher the risk (see also below) of the output
generation actually causing further problems.

The obvious case is a watchdog induced crash: at least 2 sets of dom0
state will help in many cases.

IOW I'm neither fully convinced that we want this, nor fully
opposed.

Depending on the type of crash the desired data might be different, so
support different settings for the possible types of crashes.

The parameter is "crash-debug" with the following syntax:

   crash-debug-<type>=<string>

with <type> being one of:

   panic, hwdom, watchdog, kexeccmd, debugkey

and <string> a sequence of debug key characters with '.' having the
special semantics of a 1 second pause.

1 second is a whole lot of time. To get two successive sets
of data, a much shorter delay (if any) would normally suffice.

Yes, I'd be fine to trade that for a shorter period of time.

Also, while '.' may seem like a good choice right now, with the
shortage of characters we may want to put a real handler behind
it at come point. The one character that clearly won't make much
sense to use in this context is 'h', but that's awful as a (kind
of) separator. Could we perhaps replace 'h' by '?', freeing up
'h' and allowing '?' to be used for this purpose here?

Fine with me. Another possibility would be to add '\' as an escape
character with '\.' meaning "debug-key .".

--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -574,6 +574,29 @@ reduction of features at Xen's disposal to manage guests.
  ### cpuinfo (x86)
  > `= <boolean>`
+### crash-debug-debugkey
+### crash-debug-hwdom
+### crash-debug-kexeccmd
+### crash-debug-panic
+### crash-debug-watchdog
+> `= <string>`
+
+> Can be modified at runtime
+
+Specify debug-key actions in cases of crashes. Each of the parameters applies
+to a different crash reason. The `<string>` is a sequence of debug key
+characters, with `.` having the special meaning of a 1 second pause.
+
+So e.g. `crash-debug-watchdog=0.0r` would dump dom0 state twice with a
+second between the two state dumps, followed by the run queues of the
+hypervisor, if the system crashes due to a watchdog timeout.
+
+These parameters should be used carefully, as e.g. specifying
+`crash-debug-debugkey=C` would result in an endless loop. Depending on the
+reason of the system crash it might happen that triggering some debug key
+action will result in a hang instead of dumping data and then doing a
+reboot or crash dump.

I think it would be useful if the flavors were (briefly)
explained: At the very least "debugkey" doesn't directly fit "in
cases of crashes", as there's no crash involved. kexec_crash()
instead gets invoked without there having been any crash.

Yes, and having some additional state generate for this case might
help diagnosis.


You may also want to point out that this is a best effort thing
only - system state at the point of a crash may be such that the
attempt of handling one or the debug keys would have further bad
effects on the system, including that the actual kexec may then
never occur.

True.


@@ -507,6 +509,41 @@ void __init initialize_keytable(void)
      }
  }
+#define CRASHACTION_SIZE 32
+static char crash_debug_panic[CRASHACTION_SIZE];
+static char crash_debug_hwdom[CRASHACTION_SIZE];
+static char crash_debug_watchdog[CRASHACTION_SIZE];
+static char crash_debug_kexeccmd[CRASHACTION_SIZE];
+static char crash_debug_debugkey[CRASHACTION_SIZE];
+
+static char *crash_action[CRASHREASON_N] = {
+    [CRASHREASON_PANIC] = crash_debug_panic,
+    [CRASHREASON_HWDOM] = crash_debug_hwdom,
+    [CRASHREASON_WATCHDOG] = crash_debug_watchdog,
+    [CRASHREASON_KEXECCMD] = crash_debug_kexeccmd,
+    [CRASHREASON_DEBUGKEY] = crash_debug_debugkey,
+};
+
+string_runtime_param("crash-debug-panic", crash_debug_panic);
+string_runtime_param("crash-debug-hwdom", crash_debug_hwdom);
+string_runtime_param("crash-debug-watchdog", crash_debug_watchdog);
+string_runtime_param("crash-debug-kexeccmd", crash_debug_kexeccmd);
+string_runtime_param("crash-debug-debugkey", crash_debug_debugkey);

In general I'm not in favor of this (or similar) needing
five new command line options, instead of just one. The problem
with e.g.

crash-debug=panic:rq,watchdog:0d

is that ',' (or any other separator chosen) could in principle
also be a debug key. It would still be possible to use

crash-debug=panic:rq crash-debug=watchdog:0d

though. Thoughts?

OTOH the runtime parameters are much easier addressable that way.


+void keyhandler_crash_action(enum crash_reason reason)
+{
+    const char *action = crash_action[reason];
+    struct cpu_user_regs *regs = get_irq_regs() ? : guest_cpu_user_regs();
+
+    while ( *action ) {

Misplaced brace.

Will fix.


+        if ( *action == '.' )
+            mdelay(1000);
+        else
+            handle_keypress(*action, regs);
+        action++;
+    }
+}

I think only diagnostic keys should be permitted here.

While I understand that other keys could produce nonsense or do harm,
I'm not sure we should really prohibit them. Allowing them would e.g.
allow to do just a reboot without kdump for one type of crashes.


--- a/xen/include/xen/kexec.h
+++ b/xen/include/xen/kexec.h
@@ -1,6 +1,8 @@
  #ifndef __XEN_KEXEC_H__
  #define __XEN_KEXEC_H__
+#include <xen/keyhandler.h>

Could we go without this, utilizing the gcc extension of forward
declared enums? Otoh ...

@@ -82,7 +84,11 @@ void vmcoreinfo_append_str(const char *fmt, ...)
  #define kexecing 0
static inline void kexec_early_calculations(void) {}
-static inline void kexec_crash(void) {}
+static inline void kexec_crash(enum crash_reason reason)
+{
+    keyhandler_crash_action(reason);
+}

... if this is to be an inline function and not just a #define,
it'll need the declaration of the function to have been seen.

And even being a #define all users of kexec_crash() would need to
#include keyhandler.h (and I'm not sure there are any source files
including kexec.h which don't use kexec_crash()).


Juergen



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.