[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] crash in csched_load_balance after xl vcpu-pin


  • To: Dario Faggioli <dfaggioli@xxxxxxxx>
  • From: George Dunlap <George.Dunlap@xxxxxxxxxx>
  • Date: Fri, 13 Apr 2018 09:03:02 +0000
  • Accept-language: en-GB, en-US
  • Cc: Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>, Olaf Hering <olaf@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>
  • Delivery-date: Fri, 13 Apr 2018 09:03:07 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHT0Kn8h9q9Rs4uXU6WSZ84IkzkxKP5m4eAgAAQTgCAABgpgP//95qAgABStoCAAAzIAIAAOZmAgAAnhQCAAI8GAIAAKacAgABUrwCAAAbcgIAAH5mAgAA4woCAAA1mgIAAyuSAgAAKvgCAACm4gIAACCmAgAAoE4CAAB3zgIABBeGA
  • Thread-topic: [Xen-devel] crash in csched_load_balance after xl vcpu-pin


> On Apr 12, 2018, at 6:25 PM, Dario Faggioli <dfaggioli@xxxxxxxx> wrote:
> 
> On Thu, 2018-04-12 at 17:38 +0200, Dario Faggioli wrote:
>> On Thu, 2018-04-12 at 15:15 +0200, Dario Faggioli wrote:
>>> On Thu, 2018-04-12 at 14:45 +0200, Olaf Hering wrote:
>>>> 
>>>> dies after the first iteration.
>>>> 
>>>>        BUG_ON(!test_bit(_VPF_migrating, &prev->pause_flags));
>>>> 
>> 
>> Update. I replaced this:
>> 
> Olaf, new patch! :-)
> 
> FTR, a previous version of this (where I was not printing
> smp_processor_id() and prev->is_running), produced the output that I am
> attaching below.
> 
> Looks to me like, while on the crashing CPU, we are here [*]:
> 
> void context_saved(struct vcpu *prev)
> {
>    ...
>    if ( unlikely(prev->pause_flags & VPF_migrating) )
>    {
>        unsigned long flags;
>        spinlock_t *lock = vcpu_schedule_lock_irqsave(prev, &flags);
> 
>        if (vcpu_runnable(prev) || !test_bit(_VPF_migrating, 
> &prev->pause_flags))
>            printk("CPU %u: d%uv%d isr=%u runnbl=%d proc=%d pf=%lu orq=%d 
> csf=%u\n",
>                   smp_processor_id(), prev->domain->domain_id, prev->vcpu_id,
>                   prev->is_running, vcpu_runnable(prev),
>                   prev->processor, prev->pause_flags,
>                   SCHED_OP(vcpu_scheduler(prev), onrunq, prev),
>                   SCHED_OP(vcpu_scheduler(prev), csflags, prev));
> 
>        [*]
> 
>        if ( prev->runstate.state == RUNSTATE_runnable )
>            vcpu_runstate_change(prev, RUNSTATE_offline, NOW());
>        BUG_ON(curr_on_cpu(prev->processor) == prev);
>        SCHED_OP(vcpu_scheduler(prev), sleep, prev);
> 
>        vcpu_schedule_unlock_irqrestore(lock, flags, prev);
> 
>        vcpu_migrate(prev);
>    }
> }
> 
> On the "other CPU", we might be around here [**]:
> 
> static void vcpu_migrate(struct vcpu *v)
> {
>    ...
>    if ( v->is_running ||
>         !test_and_clear_bit(_VPF_migrating, &v->pause_flags) )n

I think the bottom line is, for this test to be valid, then at this point 
test_bit(VPF_migrating) *must* imply !vcpu_on_runqueue(v), but at this point it 
doesn’t: If someone else has come by and cleared the bit, done migration, and 
woken it up, and then someone *else* set the bit again without taking it off 
the runqueue, it may still be on the runqueue.

My series which calls vcpu_sleep_nosync_locked() after setting VPF_migrating 
should help with this.

Or, alternately, instead of baking all this implicit  knowledge about credit 
into the scheduler, we should just implement credit_vcpu_migrate(), and have it 
remove it from one runqueue and put it on another.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.