Xen project Mailing List

Re: [Xen-devel] xen: credit2: credit2 can’t reach the throughput as expected

To: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>

From: Dario Faggioli <dfaggioli@xxxxxxxx>

Date: Thu, 14 Feb 2019 16:04:45 +0100

Cc: zheng chuan <jason.zhengchuan@xxxxxxxxxxx>

Delivery-date: Thu, 14 Feb 2019 15:04:19 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Hey, I think you've dropped the xen-devel mailing list, in this and the other replies. I'll forward them to there, so they leave trace in the archives. Please, re-add it, and try to avoid dropping it again. Thanks --- > > Hi, George, > > > Hi (although I'm not George :-D), > Hi, Dario, > > I found Credit2 can’t reach the throughput as expected under my test > > workload, compared to Credit and CFS. It is easy to reproduce, and I > > think the problem is really exist. > > It really took me a long time to find out why due to my lack of > > knowledge, and I cannot find a good way to solve it. > > Please do help to take a look at it. Thx. > > > Ok, thanks for your testing, and for reporting this to us. > > A few questions. > Thank you for your replying :) > > *************** > > [How to reproduce] > > *************** > > I use openSUSE-Tumbleweed with xen-4.11 version. > > Here is the test workload like: > > I have guest_1 with 4 vCPU and guest_2 with 8 vCPU running on 4 pCPU, > > that is, the relation of pCPU:vCPU is 1:3. > > Then I add pressure with 20% CPU usage for each vCPU, which results in > > total 240% pCPU usage. > > The 20% pressure model is that, I start one process on each vCPU, > > which runs 20ms indefinitely and then goes to sleep 80ms within the > > period of 100ms. > > I use xentop to observe guest cpu usage in dom0, as I expect, the > > guest cpu usage is 80% and 160% for guest_1 and guest_2 , > > respectively. > > > Do you have the sources for this somewhere, so that we can try to reproduce > it ourself. I'm thinking to the source code for the periodic apps (if you used a > custom made one), or the repository (if you used one from any) or the name > of the benchmarking suite --and the parameter used to create this scenario? > I have put the test demo in attachment, please run it as follows: 1. compile it gcc upress.c -o upress 2. calculate the loops in dom0 first ./upress -l 100 For example, the output is cpu khz : 2200000 calculate loops: 4472. We get 4472. 3. give the 20% pressure for each vcpu in guest by ./upress -l 20 -z 4472 & It is better to bind each pressure task to vcpu by taskset. > > ************** > > [Why it happens] > > ************** > > The test workload likes the polling from the long term to see. > > As showed in the figure below, the - - - - means the cputime the vcpus > > is running and the ——— means the idle. > > As we can see from Fig.1, if vcpu_1 and vcpu_2 can run staggeredly, > > the throughput looks fine, however, if vcpu_1 and vcpu_2 runs at the > > same time, they will compete for pCPU, which results in poor > > throughput. > > > > vcpu_1 - - - - - - - ———————— - - - - - - -———————— - - - - > > - > > | | > | > > | | > > vcpu_2 - - - - - - -———————— - - - - - - - > > ———————— > > | vcpu1 | vcpu2 | | > vcpu1 > > | vcpu2 | | vcpu1 > > cpu usage - - - - - - - - - - - - - ————- - - - - - - - - - - - - > > - ———— - - - - - - - > > Fig.1 > > > > vcpu_1 - - - - - - - ———————— - - - - - - - > > ——————— > > | > > vcpu_2 - - - - - - - ———————— - - - - - - - > > ——————— > > | compete running | both > > sleep | compete running | both sleep | > > cpu usage - - - - - - - - - - - - - -———————— - - - - - - - - - - > > - - - - ———————— > > Fig.2 > > > Ok, I'm not entirely sure I follow all this, but let's put it aside for a second. The > question I have is, is this analysis coming from looking at actual traces? If yes, > can you upload somewhere/share the trace files? > Sorry for the mess picture, you can see the figure below. The green one means vcpu is running while the red one means idle. In Fig.1, vcpu1 and vcpu2 runs staggeredly, it means vcpu1 runs 20ms and then vcpu2 runs 20ms while vcpu1 is sleeping. In Fig.2, vcpu1 and vcpu2 runs at the same time, it means vcpu1 and vcpu2 compete for pCPU, and then go to sleep at the same time. Obviously, the smaller time-slice is, the worse competition happens. As you mentioned that the Credit2 does not have a real timeslice, the vcpu can be preempted by the difference of credit (+ sched_ratelimit_us) dynamically. If the difference of credit is not enough between vcpus, as a result, the Scenes in Fig.2 happens at most of time in my testcase. Finally, it can not use pCPU effectively. > > As we do reset_credit() when snext->credit is negative which makes the > > credit value is too close between each vcpu. > > As a result, from long term to observe, the time-slice of each vcpu > > becomes smaller, they compete for pCPU at the same time just like > > shown in Fig.2 above. > > Thus, i think the reason why it can't reach the expected throughput is > > that reset_credit() for all vcpu will make the time-slice smaller > > which is different from Credit and CFS. > > > Ok, so you're saying this drop of "throughput" can be caused by scheduling > happening too frequently in Credit2. > > Well, I guess that is a possibility, although, as I said above, I'd need to think a > bit more about this, as well as trying to reproduce it, and look at actual traces. > > Perhaps, one thing that can be done to try to confirm this analysis, would be to > make the scheduling less frequent in Credit2 and, on the other hand, to make > it more frequent in Credit1. In theory, if the analysis is correct, you would > observe the behavior of this specific workload improving on Credit2 and > degrading in Credit1, when doing so. > > If you fancy trying that, for Credit1, you can play with the > sched_credit_tslice_ms Xen boot time parameter (e.g., try pushing it down to > 1ms). > > For Credit2, it's a little trickier, as the scheduler does not have a real timeslice. > So, either you alter CSCHED2_MIN_TIMER, in the code, or you "mimic" the > timeslice increase by setting sched_ratelimit_us to a higher value (like, e.g., > 10ms). > Here is the further test result: i. it is interesting that it still works well if I make Credit1 to 1ms by xl sched-credit -s -t 1 linux-sodv:~ # xl sched-credit Cpupool Pool-0: tslice=1ms ratelimit=1000us migration-delay=0us Name ID Weight Cap Domain-0 0 256 0 Xenstore 1 256 0 guest_1 2 256 0 guest_2 3 256 0 xentop - 13:34:02 Xen 4.11.0_02-1 4 domains: 3 running, 1 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown Mem: 67079796k total, 67078956k used, 840k free CPUs: 32 @ 2600MHz NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT SSID Domain-0 -----r 127 1.7 64050536 95.5 no limit n/a 32 0 0 0 0 0 0 0 0 0 0 guest_1 -----r 85 82.5 1048832 1.6 1049600 1.6 4 1 343 2 1 0 4144 2168 191469 10364 0 guest_2 -----r 137 164.5 1048832 1.6 1049600 1.6 8 1 297 4 1 0 4115 246 191637 10323 0 Xenstore --b--- 0 0.0 32760 0.0 670720 1.0 1 0 0 0 0 0 0 0 0 0 0 ii. it works well if sched_ratelimit_us is set up to 30ms above. linux-sodv:~ # xl sched-credit2 -s -p Pool-0 Cpupool Pool-0: ratelimit=30000us xentop - 13:54:42 Xen 4.11.0_02-1 4 domains: 2 running, 2 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown Mem: 67079796k total, 67078968k used, 828k free CPUs: 32 @ 2600MHz NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT SSID Domain-0 -----r 113 2.7 64050452 95.5 no limit n/a 32 0 0 0 0 0 0 0 0 0 0 guest_1 --b--- 66 82.8 1048832 1.6 1049600 1.6 4 1 449 2 1 0 4089 2177 192131 10476 0 guest_2 -----r 97 165.8 1048832 1.6 1049600 1.6 8 1 438 5 1 0 4160 1146 192068 10409 0 However, the sched_ratelimit_us is not so elegant and flexiable that it guarantees the specific time-slice fixedly. It may very likely cause degrading of the other scheduler criteria like sched_latency. As far as I know, CFS could adjust time-slice according to the nr_queue in runqueue (in__sched_period() ). Could it possible that Credit2 also have the similar ability to adjust time-slice automatically? Looking forward to hearing your opinion on this issue. Best Regards. > It's not a conclusive test, but I think it is a good enough one for gaining some > more understanding of the issue. > > Regards, > Dario > -- > <<This happens because I choose it to happen!>> (Raistlin Majere) > ----------------------------------------------------------------- > Dario Faggioli, Ph.D, http://about.me/dario.faggioli Software Engineer @ SUSE > https://www.suse.com/ -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Software Engineer @ SUSE https://www.suse.com/ _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.