[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] crash in csched_load_balance after xl vcpu-pin


  • To: Dario Faggioli <dfaggioli@xxxxxxxx>
  • From: George Dunlap <George.Dunlap@xxxxxxxxxx>
  • Date: Fri, 13 Apr 2018 11:29:05 +0000
  • Accept-language: en-GB, en-US
  • Cc: Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>, Olaf Hering <olaf@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>
  • Delivery-date: Fri, 13 Apr 2018 11:29:24 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHT0Kn8h9q9Rs4uXU6WSZ84IkzkxKP5m4eAgAAQTgCAABgpgP//95qAgABStoCAAAzIAIAAOZmAgAAnhQCAAI8GAIAAKacAgABUrwCAAAbcgIAAH5mAgAA4woCAAA1mgIAAyuSAgAAKvgCAACm4gIAACCmAgAAoE4CAAB3zgIABBeGAgAAGVwCAACJ+gA==
  • Thread-topic: [Xen-devel] crash in csched_load_balance after xl vcpu-pin


> On Apr 13, 2018, at 10:25 AM, Dario Faggioli <dfaggioli@xxxxxxxx> wrote:
> 
> On Fri, 2018-04-13 at 09:03 +0000, George Dunlap wrote:
>>> On Apr 12, 2018, at 6:25 PM, Dario Faggioli <dfaggioli@xxxxxxxx>
>>> wrote:
>>> 
>> I think the bottom line is, for this test to be valid, then at this
>> point test_bit(VPF_migrating) *must* imply !vcpu_on_runqueue(v), but
>> at this point it doesn’t: If someone else has come by and cleared the
>> bit, done migration, and woken it up, and then someone *else* set the
>> bit again without taking it off the runqueue, it may still be on the
>> runqueue.
>> 
>> My series which calls vcpu_sleep_nosync_locked() after setting
>> VPF_migrating should help with this.
>> 
> Yes. In fact, Olaf, I still think that doing a run with George's RFC
> applied, would be useful, if only as a data point.
> 
>> Or, alternately, instead of baking all this implicit  knowledge about
>> credit into the scheduler, we should just implement
>> credit_vcpu_migrate(), and have it remove it from one runqueue and
>> put it on another.
>> 
> But it's not really "baking Credit implicit knowledge", IMO. It is that
> we have an invariant which we are failing to enforce.

Which invariant is that?  That a vcpu is not on a runqueue when switching 
v->processor.  But “on a runqueue” is a scheduler-specific construct that the 
main scheduling code doesn’t know about.  Otherwise we could make the late 
bail-out clause in vcpu_migrate() something like this:

if ( !v->is_running ||
     vcpu_on_runq(v) ||
     !test_and_clear_bit(VPF_migrating) )
{
     /* unlock and return */
}

All this stuff with vcpu_sleep_nosync() and vcpu_wake() is just indirectly 
making sure that the Credit1-specific invariant — that switching v->processor 
removes it from one runqueue and adds it to another — actually happens; but it 
does it in an opaque way.  And the main reason the migrate() callback was 
introduced (IIRC) is because credit2’s migration invariants didn’t really 
correspond to the invariants implicitly defined by schedule.c for credit1.

I think as far as backports go, my current RFC would be fine.  Another 
possibility, though, would be to simply add a migrate() callback to remove the 
vcpu from the runqueue before switching v->processor, *without* removing any of 
the current song and dance about vcpu_sleep_nosync().  That should be fairly 
simple and straightforward to backport, and won’t make anything worse (since in 
theory it should have been removed by that point anyway).  Then for 4.12 we can 
figure out what we want to do going forward.

 -George
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.