RE: [Xen-devel] [PATCH] scheduler rate controller


> -----Original Message-----
> From: dunlapg@xxxxxxxxx [mailto:dunlapg@xxxxxxxxx] On Behalf Of George
> Dunlap
> Sent: Tuesday, October 25, 2011 12:17 AM
> To: Lv, Hui
> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; Duan, Jiangang; Tian, Kevin;
> keir@xxxxxxx; Dong, Eddie
> Subject: Re: [Xen-devel] [PATCH] scheduler rate controller
> 
> On Mon, Oct 24, 2011 at 4:36 AM, Lv, Hui <hui.lv@xxxxxxxxx> wrote:
> >
> > As one of the topics presented in Xen summit2011 in SC, we proposed
> one method scheduler rate controller (SRC) to control high frequency of
> scheduling under some conditions. You can find the slides at
> > http://www.slideshare.net/xen_com_mgr/9-hui-
> lvtacklingthemanagementchallengesofserverconsolidationonmulticoresystem
> s
> >
> > In the followings, we have tested it with 2-socket multi-core system
> with many rounds and got the positive results and improve the
> performance greatly either with the consolidation workload
> SPECvirt_2010 or some small workloads such as sysbench and SPECjbb. So
> I posted it here for review.
> >
> > >From Xen scheduling mechanism, hypervisor kicks related VCPUs by
> raising schedule softirq during processing external interrupts.
> Therefore, if the number of IRQ is very large, the scheduling happens
> more frequent. Frequent scheduling will
> > 1) bring more overhead for hypervisor and
> > 2) increase cache miss rate.
> >
> > In our consolidation workloads, SPECvirt_sc2010, SR-IOV & iSCSI
> solution are adopted to bypass software emulation but bring heavy
> network traffic. Correspondingly, 15k scheduling happened per second on
> each physical core, which means the average running time is  very short,
> only 60us. We proposed SRC in XEN to mitigate this problem.
> > The performance benefits brought by this patch is very huge at peak
> throughput with no influence when system loads are low.
> >
> > SRC improved SPECvirt performance by 14%.
> > 1)It reduced CPU utilization, which allows more load to be added.
> > 2)Response time (QoS)  became better at the same CPU %.
> > 3)The better response time allowed us to push the CPU % at peak
> performance to an even higher level (CPU was not saturated in SPECvirt).
> > SRC reduced context switch rate significantly, resulted in
> > 2)Smaller Path Length
> > 3)Less cache misses thus lower CPI
> > 4)Better performance for both Guest and Hypervisor sides.
> >
> > With this patch, from our SPECvirt_sc2010 results, the performance of
> xen catches up the other open sourced hypervisor.
> 
> Hui,
> 
> Thanks for the patch, and the work you've done testing it.  There are
> a couple of things to discuss.
> 
> * I'm not sure I like the idea of doing this at the generic level than
> at the specific scheduler level -- e.g., inside of credit1.  For
> better or for worse, all aspects of scheduling work together, and even
> small changes tend to have a significant effect on the emergent
> behavior.  I understand why you'd want this in the generic scheduling
> code; but it seems like it would be better for each scheduler to
> implement a rate control independently.
> 
> * The actual algorithm you use here isn't described.  It seems to be
> as follows (please correct me if I've made a mistake
> reverse-engineering the algorithm):
> 
> Every 10ms, check to see if there have been more than 50 schedules.
> If so, disable pre-emption entirely for 10ms, allowing processes to
> run without being interrupted (unless they yield).
> 

Sorry for the lack of description. You are right for the control process.

> It seems like we should be able to do better.  For one, it means in
> the general case you will flip back and forth between really frequent
> schedules and less frequent schedules.  For two, turning off
> preemption entirely will mean that whatever vcpu happens to be running
> could, if it wished, run for the full 10ms; and which one got elected
> to do that would be really random.  This may work well for SPECvirt,
> but it's the kind of algorithm that is likely to have some workloads
> on which it works very poorly.  Finally, there's the chance that this
> algorithm could be "gamed" -- i.e., if a rogue VM knew that most other
> VMs yielded frequently, it might be able to arrange that there would
> always be more than 50 context switches a second, while it runs
> without preemption and takes up more than its fair share.
> 

Yes, agree that, there are more things to do to make a more delicate solution 
for this in the next step. For example, we can consider per VM status to decide 
whether to turn on/off the control to make it more fair, such as your point 
three.

However, as the first step, the current solution is straightforward and 
effective. 
1) Most importantly, it happens when the scheduling frequency is excessive. 
User can decide which degree is excessive by setting parameter 
"opt_sched_rate_high"(default 50). If the system is crucial for latency 
sensitive tasks, you can choose a higher value that this patch will have little 
impact on it. User can decide which value is good for their environment. 
However, from our experience, if the scheduling frequency is too excessive, it 
also impairs the Qos of latency sensitive tasks due to frequent interrupts by 
other VMs.
2) Considering the excessive scheduling condition, the preemption is turning 
off entirely. If current running vcpu, yielded frequently, it cannot run for 
the full 10ms. If current running vcpu, not yielded frequently, it can possible 
run as long as up to 10ms. That means, this algorithm roughly protect the 
un-yielded vcpu to run a long time slice without preemption. This is something 
similar to your point 3 but in a roughly way. :)
3) Finally, this patch aimed to solve the issue when scheduling frequency is 
excessive but not influence the normal case (less frequency). We should treat 
these two cases separately. Since excessive scheduling case cannot guarantee 
neither performance or Qos.


 
> Have you tried just making it give each vcpu a minimum amount of
> scheduling time, say, 500us or 1ms?
> 
> Now a couple of stylistic comments:
> * src tends to make me think of "source".  I think sched_rate[_*]
> would fit the existing naming convention better.
> * src_controller() shouldn't call continue_running() directly.
> Instead, scheduler() should call src_controller(); and only call
> sched->do_schedule() if src_controller() returns false (or something
> like that).
> * Whatever the algorithm is should have comments describing what it
> does and how it's supposed to work.
> * Your patch is malformed; you need to have it apply at the top level,
> not from within the xen/ subdirectory.  The easiest way to get a patch
> is to use either mercurial queues, or "hg diff".  There are some good
> suggestions for making and posting patches here:
> http://wiki.xensource.com/xenwiki/SubmittingXenPatches
> 

Thanks for the kind information. I think the next version will be better :)

> Thanks again for all your work on this -- we definitely want Xen to
> beat the other open-source hypervisor. :-)
> 
>  -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

RE: [Xen-devel] [PATCH] scheduler rate controller