[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] crash in csched_load_balance after xl vcpu-pin



On Tue, 2018-04-10 at 11:59 +0100, George Dunlap wrote:
> On 04/10/2018 11:33 AM, Dario Faggioli wrote:
> > On Tue, 2018-04-10 at 09:34 +0000, George Dunlap wrote:
> > > Assuming the bug is this one:
> > > 
> > > BUG_ON( cpu != snext->vcpu->processor );
> > > 
> > 
> > Yes, it is that one.
> > 
> > Another stack trace, this time from a debug=y built hypervisor, of
> > what
> > we are thinking it is the same bug (although reproduced in a
> > slightly
> > different way) is this:
> > 
> > (XEN) ----[ Xen-4.7.2_02-36.1.12847.11.PTF  x86_64  debug=y  Not
> > tainted ]----
> > (XEN) CPU:    45
> > (XEN) RIP:    e008:[<ffff82d08012508f>]
> > sched_credit.c#csched_schedule+0x361/0xaa9
> > ...
> > (XEN) Xen call trace:
> > (XEN)    [<ffff82d08012508f>]
> > sched_credit.c#csched_schedule+0x361/0xaa9
> > (XEN)    [<ffff82d08012c233>] schedule.c#schedule+0x109/0x5d6
> > (XEN)    [<ffff82d08012fb5f>] softirq.c#__do_softirq+0x7f/0x8a
> > (XEN)    [<ffff82d08012fbb4>] do_softirq+0x13/0x15
> > (XEN)    [<ffff82d0801fd5c5>] vmx_asm_do_vmentry+0x25/0x2a
> > 
> > (I can provide it all, if necessary.)
> > 
> > I've done some analysis, although when we still were not entirely
> > sure
> > that changing the affinities was the actual cause (or, at least,
> > what
> > is triggering the whole thing).
> > 
> > In the specific case of this stack trace, the current vcpu running
> > on
> > CPU 45 is d3v11. It is not in the runqueue, because it has been
> > removed, and not added back to it, and the reason is it is not
> > runnable
> > (it has VPF_migrating on in pause_flags).
> > 
> > The runqueue of pcpu 45 looks fine (i.e., it is not corrupt or
> > anything
> > like that), it has d3v10,d9v1,d32767v45 in it (in this order)
> > 
> > d3v11->processor is 45, so that is also fine.
> > 
> > Basically, d3v11 wants to move away from pcpu 45, and this might
> > (but
> > that's not certain) be the reson because we're rescheduling. The
> > fact
> > that there are vcpus wanting to migrate can very well be the cause
> > of
> > affinity being changed.
> > 
> > Now, the problem is that, looking into the runqueue, I found out
> > that
> > d3v10->processor=32. I.e., d3v10 is queued in pcpu 45's runqueue,
> > with
> > processor=32, which really shouldn't happen.
> > 
> > This leads to the bug triggering, as, in csched_schedule(), we read
> > the
> > head of the runqueue with:
> > 
> > snext = __runq_elem(runq->next);
> > 
> > and then we pass snext to csched_load_balance(), where the BUG_ON
> > is.
> > 
> > Another thing that I've found out, is that all "misplaced" vcpus
> > (i.e.,
> > in this and also in other manifestations of this bug) have their
> > csched_vcpu.flags=4, which is CSCHED_FLAGS_VCPU_MIGRATING.
> > 
> > This, basically, is again a sign of vcpu_migrate() having been
> > called,
> > on d3v10 as well, which in turn has called csched_vcpu_pick().
> > 
> > > a nasty race condition… a vcpu has just been taken off the
> > > runqueue
> > > of the current pcpu, but it’s apparently been assigned to a
> > > different
> > > cpu.
> > > 
> > 
> > Nasty indeed. I've been looking into this on and off, but so far I
> > haven't found the root cause.
> > 
> > Now that we know for sure that it is changing affinity that trigger
> > it,
> > the field of the investigation can be narrowed a little bit... But
> > I
> > still am finding hard to spot where the race happens.
> > 
> > I'll look more into this later in the afternoon. I'll let know if
> > something comes to mind.
> 
> Actually, it looks quite simple:  schedule.c:vcpu_move_locked() is
> supposed to actually do the moving; if vcpu_scheduler()->migrate is
> defined, it calls that; otherwise, it just sets v-
> >processor.  Credit1
> doesn't define migrate.  So when changing the vcpu affinity on
> credit1,
> v->processor is simply modified without it changing runqueues.
> 
> The real question is why it's so hard to actually trigger any
> problems!
> 
Wait, but when vcpu_move_locked() is called, the vcpu being moved
should not be in any runqueue.

In fact, it is called from vcpu_migrate() which, in its turn, is always
 preceded by a call to vcpu_sleep_nosync(), that removes the vcpu from
the runqueue.

The only exception is when it is called from context_saved(). But then
again, the vcpu on which it is called is not on the runqueue, because
it was found not runnable.

That is why things works... well, apart from this bug. :-)

I mean, the root cause of this bug may very well be that there is a
code path that leads to calling vcpu_move_locked() on a vcpu that is
still in a runqueue... but have you actually identified it?

> But as a quick fix, implementing csched_vcpu_migrate() is probably
> the
> best solution.  Do you want to pick that up, or should I?
> 
And what should csched_vcpu_migrate() do, apart from changing
vc->processor?

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.