[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] xen: credit2: credit2 can’t reach the throughput as expected



Hey, I think you've dropped the xen-devel mailing list, in this and the
other replies.

I'll forward them to there, so they leave trace in the archives.
Please, re-add it, and try to avoid dropping it again.

Thanks
---
> > Hi, George,
> >
> Hi (although I'm not George :-D),
>
 
Hi, Dario,
 
> > I found Credit2 can’t reach the throughput as expected under my
test
> > workload, compared to Credit and CFS. It is easy to reproduce, and
I
> > think the problem is really exist.
> > It really took me a long time to find out why due to my lack of
> > knowledge, and I cannot find a good way to solve it.
> > Please do help to take a look at it. Thx.
> >
> Ok, thanks for your testing, and for reporting this to us.
>
> A few questions.
>
 
Thank you for your replying :)
 
> > ***************
> > [How to reproduce]
> > ***************
> > I use openSUSE-Tumbleweed with xen-4.11 version.
> > Here is the test workload like:
> > I have guest_1 with 4 vCPU and guest_2 with 8 vCPU running on 4
pCPU,
> > that is, the relation of pCPU:vCPU is 1:3.
> > Then I add pressure with 20% CPU usage for each vCPU, which results
in
> > total 240% pCPU usage.
> > The 20% pressure model is that, I start one process on each vCPU,
> > which runs 20ms indefinitely and then goes to sleep 80ms within the
> > period of 100ms.
> > I use xentop to observe guest cpu usage in dom0, as I expect, the
> > guest cpu usage is 80% and 160% for guest_1 and guest_2 ,
> > respectively.
> >
> Do you have the sources for this somewhere, so that we can try to
reproduce
> it ourself. I'm thinking to the source code for the periodic apps (if
you used a
> custom made one), or the repository (if you used one from any) or the
name
> of the benchmarking suite --and the parameter used to create this
scenario?
>
 
I have put the test demo in attachment, please run it as follows:
1. compile it
  gcc upress.c -o upress
2. calculate the loops in dom0 first
  ./upress -l 100
  For example, the output is
  cpu khz : 2200000
  calculate loops: 4472.
  We get 4472.
3. give the 20% pressure for each vcpu in guest by
  ./upress -l 20 -z 4472 &
  It is better to bind each pressure task to vcpu by taskset.
 
> > **************
> > [Why it happens]
> > **************
> > The test workload likes the polling from the long term to see.
> > As showed in the figure below, the - - - - means the cputime the
vcpus
> > is running and the ——— means the idle.
> > As we can see from Fig.1, if vcpu_1 and vcpu_2 can run staggeredly,
> > the throughput looks fine, however, if vcpu_1 and vcpu_2 runs at
the
> > same time, they will compete for pCPU, which results in poor
> > throughput.
> >
> > vcpu_1        - - - - - - - ————————  - - - - - - -———————— - - - -
> > -
> >                   |                |
> |
> >               |                               |
> > vcpu_2                        - - - - - - -————————  - - - - - - -
> > ————————
> >                   |  vcpu1    |   vcpu2   |               |
> vcpu1
> >  |   vcpu2   |              |  vcpu1
> > cpu usage   - - - - - - - - - - - - -  ————- - - - - - - - - - - -
-
> > - ———— - - - - - - -
> >                                                            Fig.1
> >
> > vcpu_1       - - - - - - - ————————                  - - - - - - -
> > ———————
> >                  |
> > vcpu_2       - - - - - - - ————————                  - - - - - - -
> > ———————
> >                  |  compete running     |   both
> > sleep         |  compete running    |   both sleep    |
> > cpu usage   - - - - - - - - - - - - - -———————— - - - - - - - - - -
> > - - - - ————————
> >                                                            Fig.2
> >
> Ok, I'm not entirely sure I follow all this, but let's put it aside
for a second. The
> question I have is, is this analysis coming from looking at actual
traces? If yes,
> can you upload somewhere/share the trace files?
>
 
Sorry for the mess picture, you can see the figure below.

The green one means vcpu is running while the red one means idle.
In Fig.1, vcpu1 and vcpu2 runs staggeredly, it means vcpu1 runs 20ms
and then vcpu2 runs 20ms while vcpu1 is sleeping.
In Fig.2, vcpu1 and vcpu2 runs at the same time, it means vcpu1 and
vcpu2 compete for pCPU, and then go to sleep at the same time.
Obviously, the smaller time-slice is, the worse competition happens.
As you mentioned that the Credit2 does not have a real timeslice, the
vcpu can be preempted by the difference of credit (+
sched_ratelimit_us) dynamically.
If the difference of credit is not enough between vcpus, as a result,
the Scenes in Fig.2 happens at most of time in my testcase.
Finally, it can not use pCPU effectively.
 
> > As we do reset_credit() when snext->credit is negative which makes
the
> > credit value is too close between each vcpu.
> > As a result, from long term to observe, the time-slice of each vcpu
> > becomes smaller, they compete for pCPU at the same time just like
> > shown in Fig.2 above.
> > Thus, i think the reason why it can't reach the expected throughput
is
> > that reset_credit() for all vcpu will make the time-slice smaller
> > which is different from Credit and CFS.
> >
> Ok, so you're saying this drop of "throughput" can be caused by
scheduling
> happening too frequently in Credit2.
>
> Well, I guess that is a possibility, although, as I said above, I'd
need to think a
> bit more about this, as well as trying to reproduce it, and look at
actual traces.
>
> Perhaps, one thing that can be done to try to confirm this analysis,
would be to
> make the scheduling less frequent in Credit2 and, on the other hand,
to make
> it more frequent in Credit1. In theory, if the analysis is correct,
you would
> observe the behavior of this specific workload improving on Credit2
and
> degrading in Credit1, when doing so.
>
> If you fancy trying that, for Credit1, you can play with the
> sched_credit_tslice_ms Xen boot time parameter (e.g., try pushing it
down to
> 1ms).
>
> For Credit2, it's a little trickier, as the scheduler does not have a
real timeslice.
> So, either you alter CSCHED2_MIN_TIMER, in the code, or you "mimic"
the
> timeslice increase by setting sched_ratelimit_us to a higher value
(like, e.g.,
> 10ms).
>
 
Here is the further test result:
i.  it is interesting that it still works well if I make Credit1 to 1ms
by xl sched-credit -s -t 1
linux-sodv:~ # xl sched-credit
Cpupool Pool-0: tslice=1ms ratelimit=1000us migration-delay=0us
Name                                ID Weight  Cap
Domain-0                             0    256    0
Xenstore                             1    256    0
guest_1                              2    256    0
guest_2                              3    256    0
 
xentop - 13:34:02   Xen 4.11.0_02-1
4 domains: 3 running, 1 blocked, 0 paused, 0 crashed, 0 dying, 0
shutdown
Mem: 67079796k total, 67078956k used, 840k free    CPUs: 32 @ 2600MHz
      NAME  STATE   CPU(sec) CPU(%)     MEM(k) MEM(%)  MAXMEM(k)
MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS   VBD_OO   VBD_RD   VBD_WR 
VBD_RSECT  VBD_WSECT SSID
  Domain-0 -----r        127    1.7   64050536   95.5   no limit      
n/a    32    0        0        0    0        0        0       
0          0          0    0
   guest_1 -----r         85   82.5    1048832    1.6    1049600      
1.6     4    1      343        2    1        0     4144     2168    
191469      10364    0
   guest_2 -----r        137  164.5    1048832    1.6    1049600      
1.6     8    1      297        4    1        0     4115      246    
191637      10323    0
  Xenstore --b---          0    0.0      32760    0.0     670720      
1.0     1    0        0        0    0        0        0       
0          0          0    0
 
ii.  it works well if sched_ratelimit_us is set up to 30ms above.
linux-sodv:~ # xl sched-credit2 -s -p Pool-0
Cpupool Pool-0: ratelimit=30000us
 
xentop - 13:54:42   Xen 4.11.0_02-1
4 domains: 2 running, 2 blocked, 0 paused, 0 crashed, 0 dying, 0
shutdown
Mem: 67079796k total, 67078968k used, 828k free    CPUs: 32 @ 2600MHz
      NAME  STATE   CPU(sec) CPU(%)     MEM(k) MEM(%)  MAXMEM(k)
MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS   VBD_OO   VBD_RD   VBD_WR 
VBD_RSECT  VBD_WSECT SSID
  Domain-0 -----r        113    2.7   64050452   95.5   no limit      
n/a    32    0        0        0    0        0        0       
0          0          0    0
   guest_1 --b---         66   82.8    1048832    1.6    1049600      
1.6     4    1      449        2    1        0     4089     2177    
192131      10476    0
   guest_2 -----r         97  165.8    1048832    1.6    1049600      
1.6     8    1      438        5    1        0     4160     1146    
192068      10409    0
 
However, the sched_ratelimit_us is not so elegant and flexiable that it
guarantees the specific time-slice fixedly.
It may very likely cause degrading of the other scheduler criteria like
sched_latency.
As far as I know, CFS could adjust time-slice according to the nr_queue
in runqueue (in__sched_period() ).
Could it possible that Credit2 also have the similar ability to adjust
time-slice automatically?
 
Looking forward to hearing your opinion on this issue.
 
Best Regards.
 
> It's not a conclusive test, but I think it is a good enough one for
gaining some
> more understanding of the issue.
>
> Regards,
> Dario
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli Software
Engineer @ SUSE
> https://www.suse.com/
 
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.