[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3] xen: add support for automatic debug key actions in case of crash



On 11.12.2020 11:22, Julien Grall wrote:
> Hi,
> 
> On 11/12/2020 07:24, Jan Beulich wrote:
>> On 11.12.2020 08:02, Jürgen Groß wrote:
>>> On 10.12.20 21:51, Julien Grall wrote:
>>>> Hi Jan,
>>>>
>>>> On 09/12/2020 14:29, Jan Beulich wrote:
>>>>> On 09.12.2020 13:11, Julien Grall wrote:
>>>>>> On 26/11/2020 11:20, Jan Beulich wrote:
>>>>>>> On 26.11.2020 09:03, Juergen Gross wrote:
>>>>>>>> When the host crashes it would sometimes be nice to have additional
>>>>>>>> debug data available which could be produced via debug keys, but
>>>>>>>> halting the server for manual intervention might be impossible due to
>>>>>>>> the need to reboot/kexec rather sooner than later.
>>>>>>>>
>>>>>>>> Add support for automatic debug key actions in case of crashes which
>>>>>>>> can be activated via boot- or runtime-parameter.
>>>>>>>>
>>>>>>>> Depending on the type of crash the desired data might be different, so
>>>>>>>> support different settings for the possible types of crashes.
>>>>>>>>
>>>>>>>> The parameter is "crash-debug" with the following syntax:
>>>>>>>>
>>>>>>>>      crash-debug-<type>=<string>
>>>>>>>>
>>>>>>>> with <type> being one of:
>>>>>>>>
>>>>>>>>      panic, hwdom, watchdog, kexeccmd, debugkey
>>>>>>>>
>>>>>>>> and <string> a sequence of debug key characters with '+' having the
>>>>>>>> special semantics of a 10 millisecond pause.
>>>>>>>>
>>>>>>>> So "crash-debug-watchdog=0+0qr" would result in special output in case
>>>>>>>> of watchdog triggered crash (dom0 state, 10 ms pause, dom0 state,
>>>>>>>> domain info, run queues).
>>>>>>>>
>>>>>>>> Signed-off-by: Juergen Gross <jgross@xxxxxxxx>
>>>>>>>> ---
>>>>>>>> V2:
>>>>>>>> - switched special character '.' to '+' (Jan Beulich)
>>>>>>>> - 10 ms instead of 1 s pause (Jan Beulich)
>>>>>>>> - added more text to the boot parameter description (Jan Beulich)
>>>>>>>>
>>>>>>>> V3:
>>>>>>>> - added const (Jan Beulich)
>>>>>>>> - thorough test of crash reason parameter (Jan Beulich)
>>>>>>>> - kexeccmd case should depend on CONFIG_KEXEC (Jan Beulich)
>>>>>>>> - added dummy get_irq_regs() helper on Arm
>>>>>>>>
>>>>>>>> Signed-off-by: Juergen Gross <jgross@xxxxxxxx>
>>>>>>>
>>>>>>> Except for the Arm aspect, where I'm not sure using
>>>>>>> guest_cpu_user_regs() is correct in all cases,
>>>>>>
>>>>>> I am not entirely sure to understand what get_irq_regs() is supposed to
>>>>>> returned on x86. Is it the registers saved from the most recent
>>>>>> exception?
>>>>>
>>>>> An interrupt (not an exception) sets the underlying per-CPU
>>>>> variable, such that interested parties will know the real
>>>>> context is not guest or "normal" Xen code, but an IRQ.
>>>>
>>>> Thanks for the explanation. I am a bit confused to why we need to give a
>>>> regs to handle_keypress() because no-one seems to use it. Do you have an
>>>> explanation?
>>>
>>> dump_registers() (key 'd') is using it.
>>>
>>>>
>>>> To add to the confusion, it looks like that get_irqs_regs() may return
>>>> NULL. So sometimes we may pass guest_cpu_regs() (which may contain
>>>> garbagge or a set too far).
>>>
>>> I guess this is a best effort approach.
>>
>> Indeed. If there are ways to make it "more best", we should of
>> course follow them. (Except before Dom0 starts, I'm afraid I
>> don't see though where garbage would come from. And even then,
>> just like for the idle vCPU-s, it shouldn't really be garbage,
>> or else this suggests missing initialization somewhere.)
> 
> So I decided to mimick what 'd' does to see what happen if this is 
> called early boot.

But this isn't really relevant here: If you need to deal with a
crash during boot, just don't specify these command line options
(and that's on top of Jürgen's indication that 'd' may not be
very useful to specify here anyway, albeit now that I think
about this I'm not so sure anymore - panic() only logs the local
CPUs registers iirc, while 'd' would log everyone's). Of course
Jürgen could go and limit honoring of the option to sufficiently
high SYS_STATE_*. In particular at least the x86 crash you've
observed is - afaict - from the is_idle_vcpu(current) check in
dump_execstate(), which requires init_idle_domain() to have run
before.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.