[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED

On Sun, 2020-02-02 at 12:57 +0000, Julien Grall wrote:
> Hi Dario,

> Apologies for the late answer.
No problem, I also did not had any more time to look into this yet.

> On 22/01/2020 03:40, Dario Faggioli wrote:
> > On Fri, 2020-01-10 at 18:24 +0000, Julien Grall wrote:
> > > 
> > You have a 2 vCPUs dom0, and how many other vCPUs from other
> > domains?
> > Or do you only have those 2 dom0 vCPUs and you are actually pausing
> > dom0?
> Only dom0 with 2 vCPUs is running. On every hypercall, it will try
> to 
> pause/unpause itself. 
Ok, that was my understanding, but I wasn't 100% sure. Thanks for

> This is to roughly match the behavior of the Arm 
> guest atomic helpers.
Yep, makes sense.

> > If you just have the 2 dom0's vCPUs around (and we call them vCPU A
> > and
> > vCPU B), the only case for which I can imagine runq_pick()
> > returning A
> > on CPU1 would be if CPU0 would be running vCPU B (and invoked the
> > hypercall from it) and CPU1 was idle... is this the case?
> This is indeed the case. The schedule() on CPU1 has happenned
> because 
> vCPU A was woken up (e.g an interrupt was received and injected to
> the 
> vCPU).

> > In fact, I'm starting to think that patch 7c7b407e777 "xen/sched:
> > introduce unit_runnable_state()", which added the 'q_remove(snext)'
> > in
> > rt_schedule() might not be correct.
> I have tested Xen before this commit and didn't manage to reproduce
> the 
> crash. As soon as I had the commit, it will crash quite quickly.
Ok, thanks for checking this as well. That's very useful.

> > In fact, if runq_pick() returns a vCPU which is in the runqueue,
> > but is
> > not runnable (e.g., because we're racing with do_domain_pause(),
> > which
> > already set pause_count), it's not rt_schedule() job to dequeue it
> > from
> > anything.
> > 
> > We probably should just ignore it and pick another vCPU, if any
> > (and
> > idle otherwise). Then, after we release the lock, if will be
> > rt_unit_sleep(), called by do_domain_pause() in this case, that
> > will
> > finish the job of properly dequeueing it...
> > 
> > Another strange thing is that, as the code looks right now,
> > runq_pick()
> > returns the first unit in the runq (i.e., the one with the earliest
> > deadline), without checking whether it is runnable. Then, in
> > rt_schedule(), if the unit is not runnable, we (only partially, as
> > you
> > figured out) dequeue it, and use idle instead, as our candidate for
> > being the next scheduled unit... But what if there were other
> > *runnable* units in the runqueue?
> My knowledge of the scheduler is quite limited. Maybe Meng would be
> able 
> to answer to this question?
Yes, indeed, here I was pretty much thinking out loud, and trying to
trigger comments from Meng.

Anyway, I'll see about putting together a quick test patch that
implement what I described (next week), and let's see if it works.

Dario Faggioli, Ph.D
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

Attachment: signature.asc
Description: This is a digitally signed message part

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.