[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: handle_pio looping during domain shutdown, with qemu 4.2.0 in stubdom



On 05.06.2020 14:01, Marek Marczykowski-Górecki wrote:
> On Fri, Jun 05, 2020 at 11:22:46AM +0200, Jan Beulich wrote:
>> On 05.06.2020 11:09, Jan Beulich wrote:
>>> On 04.06.2020 16:25, Marek Marczykowski-Górecki wrote:
>>>> (XEN) hvm.c:1620:d6v0 All CPUs offline -- powering off.
>>>> (XEN) d3v0 handle_pio port 0xb004 read 0x0000
>>>> (XEN) d3v0 handle_pio port 0xb004 read 0x0000
>>>> (XEN) d3v0 handle_pio port 0xb004 write 0x0001
>>>> (XEN) d3v0 handle_pio port 0xb004 write 0x2001
>>>> (XEN) d4v0 XEN_DMOP_remote_shutdown domain 3 reason 0
>>>> (XEN) d4v0 domain 3 domain_shutdown vcpu_id 0 defer_shutdown 1
>>>> (XEN) d4v0 XEN_DMOP_remote_shutdown domain 3 done
>>>> (XEN) hvm.c:1620:d5v0 All CPUs offline -- powering off.
>>>> (XEN) d1v0 handle_pio port 0xb004 read 0x0000
>>>> (XEN) d1v0 handle_pio port 0xb004 read 0x0000
>>>> (XEN) d1v0 handle_pio port 0xb004 write 0x0001
>>>> (XEN) d1v0 handle_pio port 0xb004 write 0x2001
>>>> (XEN) d2v0 XEN_DMOP_remote_shutdown domain 1 reason 0
>>>> (XEN) d2v0 domain 1 domain_shutdown vcpu_id 0 defer_shutdown 1
>>>> (XEN) d2v0 XEN_DMOP_remote_shutdown domain 1 done
>>>> (XEN) grant_table.c:3702:d0v0 Grant release 0x3 ref 0x11d flags 0x2 d6
>>>> (XEN) grant_table.c:3702:d0v0 Grant release 0x4 ref 0x11e flags 0x2 d6
>>>> (XEN) d3v0 handle_pio port 0xb004 read 0x0000
>>>
>>> Perhaps in this message could you also log
>>> v->domain->is_shutting_down, v->defer_shutdown, and
>>> v->paused_for_shutdown?
>>
>> And v->domain->is_shut_down please.
> 
> Here it is:
> 
> (XEN) hvm.c:1620:d6v0 All CPUs offline -- powering off.
> (XEN) d3v0 handle_pio port 0xb004 read 0x0000 is_shutting_down 0 
> defer_shutdown 0 paused_for_shutdown 0 is_shut_down 0
> (XEN) d3v0 handle_pio port 0xb004 read 0x0000 is_shutting_down 0 
> defer_shutdown 0 paused_for_shutdown 0 is_shut_down 0
> (XEN) d3v0 handle_pio port 0xb004 write 0x0001 is_shutting_down 0 
> defer_shutdown 0 paused_for_shutdown 0 is_shut_down 0
> (XEN) d3v0 handle_pio port 0xb004 write 0x2001 is_shutting_down 0 
> defer_shutdown 0 paused_for_shutdown 0 is_shut_down 0
> (XEN) d4v0 XEN_DMOP_remote_shutdown domain 3 reason 0
> (XEN) d4v0 domain 3 domain_shutdown vcpu_id 0 defer_shutdown 1
> (XEN) d4v0 XEN_DMOP_remote_shutdown domain 3 done
> (XEN) hvm.c:1620:d5v0 All CPUs offline -- powering off.
> (XEN) d1v0 handle_pio port 0xb004 read 0x0000 is_shutting_down 0 
> defer_shutdown 0 paused_for_shutdown 0 is_shut_down 0
> (XEN) d1v0 handle_pio port 0xb004 read 0x0000 is_shutting_down 0 
> defer_shutdown 0 paused_for_shutdown 0 is_shut_down 0
> (XEN) d1v0 handle_pio port 0xb004 write 0x0001 is_shutting_down 0 
> defer_shutdown 0 paused_for_shutdown 0 is_shut_down 0
> (XEN) d1v0 handle_pio port 0xb004 write 0x2001 is_shutting_down 0 
> defer_shutdown 0 paused_for_shutdown 0 is_shut_down 0
> (XEN) d2v0 XEN_DMOP_remote_shutdown domain 1 reason 0
> (XEN) d2v0 domain 1 domain_shutdown vcpu_id 0 defer_shutdown 1
> (XEN) d2v0 XEN_DMOP_remote_shutdown domain 1 done
> (XEN) grant_table.c:3702:d0v1 Grant release 0x3 ref 0x125 flags 0x2 d6
> (XEN) grant_table.c:3702:d0v1 Grant release 0x4 ref 0x126 flags 0x2 d6
> (XEN) d1v0 handle_pio port 0xb004 read 0x0000 is_shutting_down 1 
> defer_shutdown 1 paused_for_shutdown 0 is_shut_down 0

To me this is a clear indication that we did exit to guest context
with ->defer_shutdown set.

What I'm missing from your debugging patch is logging when the
default case of the first switch() in hvmemul_do_io() gets hit. I
think I said yesterday that I consider this a fair candidate of
where the X86EMUL_UNHANDLEABLE is coming from.

On top of that, with what we've sort of learned today, could you
log (or worse) any instances of handle_pio() getting called with
->defer_shutdown set? Afaict this should never happen, but you
may hit this case earlier than for the call out of the VMEXIT
handler, which would then move us closer to the root of the issue.

With "(or worse)" I mean it could also be as heavy as BUG(), ...

> Regarding BUG/WARN - do you think I could get any more info then? I
> really don't mind crashing that system, it's a virtual machine
> currently used only for debugging this issue.

... and the selection here really depends on what overall impact
you expect. I.e. I'm with Andrew that BUG() may be the construct
of choice if otherwise you get overly much output. In other cases
it may allow you to hit the same case again, with perhaps
slightly changed other state, giving further hints on where the
issue starts.

One thing that's not clear to me here: In the title you say
handle_pio() loops, but with the domain getting crashed I can't
seem to see that happening. Of course it may be a wrong
understanding /interpretation of mine that it is the guest doing
repeated I/O from/to port 0xb004.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.