[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Dom0 kernel 4.14 with SMP randomly crashing





On Wed, Nov 7, 2018 at 12:16 AM Rishi <2rushikeshj@xxxxxxxxx> wrote:


On Tue, Nov 6, 2018 at 10:41 PM Rishi <2rushikeshj@xxxxxxxxx> wrote:


On Tue, Nov 6, 2018 at 5:47 PM Wei Liu <wei.liu2@xxxxxxxxxx> wrote:
On Tue, Nov 06, 2018 at 03:31:31PM +0530, Rishi wrote:
>
> So after knowing the stack trace, it appears that the CPU was getting stuck
> for xen_hypercall_xen_version

That hypercall is used when a PV kernel (re-)enables interrupts. See
xen_irq_enable. The purpose is to force the kernel to switch to
hypervisor.

>
> watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:0]
>
>
> [30569.582740] watchdog: BUG: soft lockup - CPU#0 stuck for 23s!
> [swapper/0:0]
>
> [30569.588186] Kernel panic - not syncing: softlockup: hung tasks
>
> [30569.591307] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G             L    4.19.1
> #1
>
> [30569.595110] Hardware name: Xen HVM domU, BIOS 4.4.1-xs132257 12/12/2016
>
> [30569.598356] Call Trace:
>
> [30569.599597]  <IRQ>
>
> [30569.600920]  dump_stack+0x5a/0x73
>
> [30569.602998]  panic+0xe8/0x249
>
> [30569.604806]  watchdog_timer_fn+0x200/0x230
>
> [30569.607029]  ? softlockup_fn+0x40/0x40
>
> [30569.609246]  __hrtimer_run_queues+0x133/0x270
>
> [30569.611712]  hrtimer_interrupt+0xfb/0x260
>
> [30569.613800]  xen_timer_interrupt+0x1b/0x30
>
> [30569.616972]  __handle_irq_event_percpu+0x69/0x1a0
>
> [30569.619831]  handle_irq_event_percpu+0x30/0x70
>
> [30569.622382]  handle_percpu_irq+0x34/0x50
>
> [30569.625048]  generic_handle_irq+0x1e/0x30
>
> [30569.627216]  __evtchn_fifo_handle_events+0x163/0x1a0
>
> [30569.629955]  __xen_evtchn_do_upcall+0x41/0x70
>
> [30569.632612]  xen_evtchn_do_upcall+0x27/0x50
>
> [30569.635136]  xen_do_hypervisor_callback+0x29/0x40
>
> [30569.638181] RIP: e030:xen_hypercall_xen_version+0xa/0x20

What is the asm code for this RIP?


Wei.

The issue of crash is getting resolved with appending "noirqbalance" at xen command line. This way all dom0 cpus are available but not irq balanced at xen.

Even though I'm running irqbalance service in dom0 the irqs seems to be not moving. <- this is dom0 perspective, I do not know yet, if it follows Xen irq.

I tried objdump, while I have  have the function in out but there is no asm code of it. Its just "..."

ffffffff81001220 <xen_hypercall_xen_version>:

        ...


ffffffff81001240 <xen_hypercall_console_io>:

        ...

All "hypercalls" appear similarly.


How frequent can be that hypercall/xen_irq_enable()? Like n/s or once a while?
During my tests, the system runs stable unless I'm downloading a large file. Files around a GB size are getting downloaded without crash, but system crash comes when its above it. I'm using a 2.1GB file & wget to download.

Is there a way I can simulate PV kernel (re-)enable of interrupt using a kernel module with a controlled fashion? 

If this is on right track

ffffffff8101ab70 <xen_force_evtchn_callback>:

ffffffff8101ab70:       31 ff                   xor    %edi,%edi

ffffffff8101ab72:       31 f6                   xor    %esi,%esi

ffffffff8101ab74:       e8 a7 66 fe ff          callq  ffffffff81001220 <xen_hypercall_xen_version>

ffffffff8101ab79:       c3                      retq

ffffffff8101ab7a:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)


It seems I'm hitting following code from xen_irq_enable

        barrier(); /* unmask then check (avoid races) */

        if (unlikely(vcpu->evtchn_upcall_pending))

                xen_force_evtchn_callback();

The code says unlikely yet, it is being called, And I got following structure

struct vcpu_info {

        /*

         * 'evtchn_upcall_pending' is written non-zero by Xen to indicate

         * a pending notification for a particular VCPU. It is then cleared

         * by the guest OS /before/ checking for pending work, thus avoiding

         * a set-and-check race. Note that the mask is only accessed by Xen

         * on the CPU that is currently hosting the VCPU. This means that the

         * pending and mask flags can be updated by the guest without special

         * synchronisation (i.e., no need for the x86 LOCK prefix).


 Let me know if the thread is being spammed with such intermediates.
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.