Xen project Mailing List

Re: [Xen-devel] [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts

From: Dario Faggioli <dfaggioli@xxxxxxxx>

Date: Sat, 27 Jan 2018 02:27:49 +0100

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, David Woodhouse <dwmw@xxxxxxxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Sat, 27 Jan 2018 01:28:41 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

> On 25/01/18 16:09, Andrew Cooper wrote: > > On 25/01/18 15:57, Jan Beulich wrote: > > > > > > > > > For the record, the overwhelming majority of calls to > > > __sync_local_execstate() being responsible for the behavior > > > come from invalidate_interrupt(), which suggests to me that > > > there's a meaningful number of cases where a vCPU is migrated > > > to another CPU and then back, without another vCPU having > > > run on the original CPU in between. If I'm not wrong with this, > > > I have to question why the vCPU is migrated then in the first > > > place. > > So, about this. I haven't applied Jan's measurement patch yet (I'm doing some reshuffling of my dev and test hardware here), but I have given a look at traces. So, Jan, a question: why are you saying "migrated to another CPU **and then back**"? I'm asking because, AFAICT, the fact that __sync_local_execstate() is called from invalidate_interrupt() means that: * a vCPU is running on a pCPU * the vCPU is migrated, and the pCPU became idle * the vCPU starts to run where it was migrated, while its 'original' pCPU is still idle ==> inv. IPI ==> sync state. So there seems to me to be no need for the vCPU to actually "go back", is there it? Anyway, looking at traces, I observed the following: 28.371352689 --|------x------ d32767v9 csched:schedule cpu 9, idle 28.371354095 --|------x------ d32767v9 sched_switch prev d32767v9, run for 3412.789us 28.371354475 --|------x------ d32767v9 sched_switch next d3v8, was runnable for 59.917us, next slice 30000.0us 28.371354752 --|------x------ d32767v9 sched_switch prev d32767v9 next d3v8 28.371355267 --|------x------ d32767v9 runstate_change d32767v9 running->runnable (1) 28.371355728 --|------x------ d?v? runstate_change d3v8 runnable->running ............ ................ ... (2) 28.375501037 -----|||-x----|- d3v8 vcpu_wake d3v5 28.375501540 -----|||-x----|- d3v8 runstate_change d3v5 blocked->runnable (3) 28.375502300 -----|||-x----|- d3v8 csched:runq_tickle, cpu 8 ............ ................ ... 28.375509472 --|--|||x|----|- d32767v8 csched:schedule cpu 8, idle 28.375510682 --|--|||x|----|- d32767v8 sched_switch prev d32767v8, run for 724.165us 28.375511034 --|--|||x|----|- d32767v8 sched_switch next d3v5, was runnable for 7.396us, next slice 30000.0us 28.375511300 --|--|||x|----|- d32767v8 sched_switch prev d32767v8 next d3v5 28.375511640 --|--|||x|----|- d32767v8 runstate_change d32767v8 running->runnable (4) 28.375512060 --|--|||x|----|- d?v? runstate_change d3v5 runnable->running ............ ................ ... (5) 28.375624977 ----|-|||x----|- d3v8 csched: d3v8 unboosted (6) 28.375628208 ----|-|||x----|- d3v8 csched:pick_cpu 11 for d3v8 At (1) d3v8 starts running on CPU 9. Then, at (2), d3v5 wakes up, and at (3) CPU 8 (which is idle) is tickled, as a consequence of that. At (4), CPU 8 picks up d3v5 and run it (this may seem unrelated, but bear with me a little). At (5), a periodic tick arrives on CPU 9. Periodic ticks are a core part of the Credit1 algorithm, and are used for accounting and load balancing. In fact, csched_tick() calls csched_vcpu_acct() which, at (6), calls _csched_cpu_pick(). Pick realizes that d3v8 is running on CPU 9, and that CPU 8 is also busy. Now, since CPU 8 and 9 are hyperthreads of the same core, and since there are fully idle cores, Credit1 decides that it's better to kick d3v8 to one of those fully idle cores, so both d3v5 and d3v8 itslef can run at full "core speed". In fact, we see that CPU 11 is picked, as both the hyperthreads --CPU 10 and CPU 11 itself-- are idle. (To be continued, below) (7) 28.375630686 ----|-|||x----|- d3v8 csched:schedule cpu 9, busy (*) 28.375631619 ----|-|||x----|- d3v8 csched:load_balance skipping 14 28.375632094 ----|-|||x----|- d3v8 csched:load_balance skipping 8 28.375632612 ----|-|||x----|- d3v8 csched:load_balance skipping 4 28.375633004 ----|-|||x----|- d3v8 csched:load_balance skipping 6 28.375633364 ----|-|||x----|- d3v8 csched:load_balance skipping 7 28.375633960 ----|-|||x------ d3v8 csched:load_balance skipping 8 28.375634470 ----|-|||x------ d3v8 csched:load_balance skipping 4 28.375634820 ----|-|||x------ d3v8 csched:load_balance skipping 6 (**)28.375635067 ----|-|||x------ d3v8 csched:load_balance skipping 7 28.375635560 ----|-|||x------ d3v8 sched_switch prev d3v8, run for 4288.140us 28.375635988 ----|-|||x------ d3v8 sched_switch next d32767v9, was runnable for 4288.140us 28.375636233 ----|-|||x------ d3v8 sched_switch prev d3v8 next d32767v9 28.375636615 ----|-|||x------ d3v8 runstate_change d3v8 running->offline (8) 28.375637015 ----|-|||x------ d?v? runstate_change d32767v9 runnable->running 28.375638146 ----|-x||------- d3v2 vcpu_block d3v2 ............ ................ ... 28.375645627 ----|--||x------ d32767v9 csched:pick_cpu 11 for d3v8 28.375647138 ----|--||x------ d32767v9 vcpu_wake d3v8 28.375647640 ----|--||x------ d32767v9 runstate_change d3v8 offline->runnable (9) 28.375648353 ----|--||x------ d32767v9 csched:runq_tickle, cpu 11 ............ ................ ... 28.375709505 ----|--||--x---- d32767v11 sched_switch prev d32767v11, run for 2320182.912us 28.375709778 ----|--||--x---- d32767v11 sched_switch next d3v8, was runnable for 59.670us, next slice 30000.0us 28.375710001 ----|--||--x---- d32767v11 sched_switch prev d32767v11 next d3v8 28.375710501 ----|--||--x---- d32767v11 runstate_change d32767v11 running->runnable (10)28.375710858 ----|--||--x---- d?v? runstate_change d3v8 runnable->running At (7) we see that CPU 9 re-schedules, as a consequence of pick deciding to migrate d3v8. As a side note, all the "load_balance skipping xx" lines between (*) and (**) show that stealing work attempts are actually prevented on all those CPUs, because they have only 1 runnable (either running or ready to do so) vCPU. I.e., my patch works and achieves its goal of avoiding even trying to steal (which means avoiding having to take a lock!), when there's no need. :-) Anyway, at (8) d3v8 is gone, and CPU 9 eventually becomes idle. At (9), another call to pick_cpu() confirms that d3v8 will land on CPU 11, and at (10) we see it starting to run there. It should be at this point that the invalidate IPI is sent, which causes the state sync request (note, in fact, that CPU 8 is still idle). Now, this is just _one_ example, but I am quite convinced that this may actually be one of the most prominent causes of the behavior Jan reported. The problem, as I was expecting, is not work stealing, the problem is, well... Credit1! :-/ In fact, when d3v5 wakes up, why, at point (3), CPU 8 is tickled, instead of, for instance, CPU 10 (or 11, or 12, or 13)? CPU 8 and CPU 9 are hyperthread siblings, and CPU 9 is busy, so it would have been better to try to leave CPU 8 alone. And that would have been possible, as both the core of CPUs 10 and 11, and of CPUs 12 or 13 are fully idle. Well, point is, tickling in Credit1 does not check/consider hyperthreading. Can it then start doing so? Not easily, IMO, and at an added cost --which will be payed on the vCPU wakeup path (which is already quite convoluted and complex, in that scheduler). Credit2, for instance, does not suffer from this issue. In fact, hyperthreading, there, is considered during wakeup/tickling already. Hope this helps clarifying things a bit, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Software Engineer @ SUSE https://www.suse.com/

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.