[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] crash in csched_load_balance after xl vcpu-pin



On Thu, 2018-04-12 at 17:38 +0200, Dario Faggioli wrote:
> On Thu, 2018-04-12 at 15:15 +0200, Dario Faggioli wrote:
> > On Thu, 2018-04-12 at 14:45 +0200, Olaf Hering wrote:
> > > 
> > > dies after the first iteration.
> > > 
> > >         BUG_ON(!test_bit(_VPF_migrating, &prev->pause_flags));
> > > 
> 
> Update. I replaced this:
> 
Olaf, new patch! :-)

FTR, a previous version of this (where I was not printing
smp_processor_id() and prev->is_running), produced the output that I am
attaching below.

Looks to me like, while on the crashing CPU, we are here [*]:

void context_saved(struct vcpu *prev)
{
    ...
    if ( unlikely(prev->pause_flags & VPF_migrating) )
    {
        unsigned long flags;
        spinlock_t *lock = vcpu_schedule_lock_irqsave(prev, &flags);

        if (vcpu_runnable(prev) || !test_bit(_VPF_migrating, 
&prev->pause_flags))
            printk("CPU %u: d%uv%d isr=%u runnbl=%d proc=%d pf=%lu orq=%d 
csf=%u\n",
                   smp_processor_id(), prev->domain->domain_id, prev->vcpu_id,
                   prev->is_running, vcpu_runnable(prev),
                   prev->processor, prev->pause_flags,
                   SCHED_OP(vcpu_scheduler(prev), onrunq, prev),
                   SCHED_OP(vcpu_scheduler(prev), csflags, prev));

        [*]

        if ( prev->runstate.state == RUNSTATE_runnable )
            vcpu_runstate_change(prev, RUNSTATE_offline, NOW());
        BUG_ON(curr_on_cpu(prev->processor) == prev);
        SCHED_OP(vcpu_scheduler(prev), sleep, prev);

        vcpu_schedule_unlock_irqrestore(lock, flags, prev);

        vcpu_migrate(prev);
    }
}

On the "other CPU", we might be around here [**]:

static void vcpu_migrate(struct vcpu *v)
{
    ...
    if ( v->is_running ||
         !test_and_clear_bit(_VPF_migrating, &v->pause_flags) )
    {
        sched_spin_unlock_double(old_lock, new_lock, flags); 
        return; 
    } 
 
    vcpu_move_locked(v, new_cpu); 
 
    sched_spin_unlock_double(old_lock, new_lock, flags); 

    [**] 

    if ( old_cpu != new_cpu ) 
        sched_move_irqs(v); 
 
    /* Wake on new CPU. */ 
    vcpu_wake(v); 
}

(XEN) d10v1 runnbl=0 proc=22 pf=1 orq=0 csf=4
(XEN) d10v0 runnbl=1 proc=20 pf=0 orq=0 csf=4
(XEN) d10v0 runnbl=1 proc=25 pf=0 orq=0 csf=4
(XEN) d10v2 runnbl=1 proc=31 pf=0 orq=0 csf=4
(XEN) d10v2 runnbl=1 proc=10 pf=0 orq=1 csf=0
(XEN) d10v0 runnbl=1 proc=30 pf=0 orq=0 csf=4
(XEN) d10v0 runnbl=1 proc=15 pf=0 orq=0 csf=4
(XEN) d10v3 runnbl=1 proc=13 pf=0 orq=1 csf=0
(XEN) d10v2 runnbl=1 proc=39 pf=0 orq=0 csf=4
(XEN) d10v3 runnbl=1 proc=32 pf=0 orq=0 csf=4
(XEN) d10v2 runnbl=1 proc=20 pf=0 orq=0 csf=4
(XEN) d10v2 runnbl=1 proc=20 pf=0 orq=0 csf=4
(XEN) d10v1 runnbl=0 proc=26 pf=1 orq=0 csf=4
(XEN) d10v3 runnbl=1 proc=16 pf=0 orq=0 csf=4
(XEN) Xen BUG at sched_credit.c:877
(XEN) ----[ Xen-4.11.20180411T100655.82540b66ce-180412155659  x86_64  debug=y   
Not tainted ]----
(XEN) CPU:    16
(XEN) RIP:    e008:[<ffff82d08022c84d>] 
sched_credit.c#csched_vcpu_migrate+0x52/0x54
(XEN) RFLAGS: 0000000000010006   CONTEXT: hypervisor (d6v0)
(XEN) rax: ffff8300779c9000   rbx: 0000000000000012   rcx: ffff830adac719f0
(XEN) rdx: 0000000000000012   rsi: ffff8300779b2000   rdi: 00000033ff8bb000
(XEN) rbp: ffff83087cfb7ce8   rsp: ffff83087cfb7ce8   r8:  0000000000000010
(XEN) r9:  0000ffff0000ffff   r10: 00ff00ff00ff00ff   r11: 0f0f0f0f0f0f0f0f
(XEN) r12: ffff83047fe82188   r13: ffff83047fe70188   r14: ffff82d0805c7180
(XEN) r15: ffff8300779b2000   cr0: 000000008005003b   cr4: 00000000000026e0
(XEN) cr3: 0000000f8404b000   cr2: 00007f18dfeca000
(XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen code around <ffff82d08022c84d> 
(sched_credit.c#csched_vcpu_migrate+0x52/0x54):
(XEN)  5d c3 0f 0b 0f 0b 0f 0b <0f> 0b 55 48 89 e5 48 8d 05 26 a9 39 00 48 8b 57
(XEN) Xen stack trace from rsp=ffff83087cfb7ce8:
(XEN)    ffff83087cfb7cf8 ffff82d080239419 ffff83087cfb7d68 ffff82d08023a8d8
(XEN)    ffff82d0805c7160 ffff82d0805c7180 01ff83087cfb7d78 0000001200000010
(XEN)    0000000000000092 0000000000000296 0000000000000003 ffff8300779b2000
(XEN)    ffff83047fe82188 0000000000000292 0000000000000004 ffff82d0805b2520
(XEN)    ffff83087cfb7db8 ffff82d08023c795 ffff83087cfb7d98 ffff8300779b2000
(XEN)    ffff83087cfb7db8 ffff8300779c9000 ffff8300779b2000 ffff830ad6463000
(XEN)    0000000000000010 ffff830adad26000 ffff83087cfb7e08 ffff82d08027a538
(XEN)    ffff83087cfb7dd8 ffff82d0802a8510 ffff83087cfb7e08 ffff8300779b2000
(XEN)    ffff8300779c9000 ffff83047fe82188 0000008405ba3022 0000000000000003
(XEN)    ffff83087cfb7e98 ffff82d0802397a9 ffff8300779b2560 ffff83047fe821a0
(XEN)    0000001000fb7e58 ffff83047fe82180 ffff82d080328ba1 ffff8300779b2000
(XEN)    ffff830adad26000 ffff8300779c9000 0000000001c9c380 ffff82d080302000
(XEN)    ffff8300779b2000 ffff82d08059c480 ffff82d08059bc80 ffffffffffffffff
(XEN)    ffff83087cfb7fff ffff82d0805a3c80 ffff83087cfb7ed8 ffff82d08023d552
(XEN)    ffff82d080328ba1 ffff8300779b2000 ffff8300779c9000 ffff830adad26000
(XEN)    0000000000000010 ffff830ad6463000 ffff83087cfb7ee8 ffff82d08023d5c5
(XEN)    ffff83087cfb7db8 ffff82d080328d6b ffffffff81c00000 ffffffff81c00000
(XEN)    ffffffff81c00000 0000000000000000 0000000000000000 ffffffff81d4c180
(XEN)    0000000000000008 000000470cb96de6 0000000000000001 0000000000000000
(XEN)    ffffffff81020e50 0000000000000000 0000000000000000 0000000000000000
(XEN) Xen call trace:
(XEN)    [<ffff82d08022c84d>] sched_credit.c#csched_vcpu_migrate+0x52/0x54
(XEN)    [<ffff82d080239419>] schedule.c#vcpu_move_locked+0x42/0xcc
(XEN)    [<ffff82d08023a8d8>] schedule.c#vcpu_migrate+0x210/0x23b
(XEN)    [<ffff82d08023c795>] context_saved+0x21e/0x461
(XEN)    [<ffff82d08027a538>] context_switch+0xe9/0xf67
(XEN)    [<ffff82d0802397a9>] schedule.c#schedule+0x306/0x6ab
(XEN)    [<ffff82d08023d552>] softirq.c#__do_softirq+0x71/0x9a
(XEN)    [<ffff82d08023d5c5>] do_softirq+0x13/0x15
(XEN)    [<ffff82d080328d6b>] vmx_asm_do_vmentry+0x2b/0x30

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

Attachment: context-save-race-debug.patch
Description: Text Data

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.