Re: [Xen-devel] [PATCH] scheduler rate controller

To:	"Lv, Hui" <hui.lv@xxxxxxxxx>
Subject:	Re: [Xen-devel] [PATCH] scheduler rate controller
From:	George Dunlap <George.Dunlap@xxxxxxxxxxxxx>
Date:	Mon, 24 Oct 2011 17:17:04 +0100
Cc:	"Duan, Jiangang" <jiangang.duan@xxxxxxxxx>, "Tian, Kevin" <kevin.tian@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, "keir@xxxxxxx" <keir@xxxxxxx>, "Dong, Eddie" <eddie.dong@xxxxxxxxx>
Delivery-date:	Mon, 24 Oct 2011 09:17:53 -0700
Dkim-signature:	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=hOVCBdNI8WaW3HlaSBmLx/wnBtyN9yLLHNnFlk2Aqgo=; b=CJOsvOxz9VbIImxzSekVn4A+vbmzRQ7WwhB/YKeq5oh8mt0L95xjw71tgKKMWzrr6s 688dVaYdTyTPcmNIuhqLd87WEcLi1hbLFm+SND5ExHtT5USI4vsdBR6Q9szMk0VV4gcD EcDGkO8OhhPR1vh9igmgtngYBpHlV0Fo6atKU=
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<C10D3FB0CD45994C8A51FEC1227CE22F340768D793@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References:	<C10D3FB0CD45994C8A51FEC1227CE22F340768D793@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx

On Mon, Oct 24, 2011 at 4:36 AM, Lv, Hui <hui.lv@xxxxxxxxx> wrote:
>
> As one of the topics presented in Xen summit2011 in SC, we proposed one 
> method scheduler rate controller (SRC) to control high frequency of 
> scheduling under some conditions. You can find the slides at
> http://www.slideshare.net/xen_com_mgr/9-hui-lvtacklingthemanagementchallengesofserverconsolidationonmulticoresystems
>
> In the followings, we have tested it with 2-socket multi-core system with 
> many rounds and got the positive results and improve the performance greatly 
> either with the consolidation workload SPECvirt_2010 or some small workloads 
> such as sysbench and SPECjbb. So I posted it here for review.
>
> >From Xen scheduling mechanism, hypervisor kicks related VCPUs by raising 
> >schedule softirq during processing external interrupts. Therefore, if the 
> >number of IRQ is very large, the scheduling happens more frequent. Frequent 
> >scheduling will
> 1) bring more overhead for hypervisor and
> 2) increase cache miss rate.
>
> In our consolidation workloads, SPECvirt_sc2010, SR-IOV & iSCSI solution are 
> adopted to bypass software emulation but bring heavy network traffic. 
> Correspondingly, 15k scheduling happened per second on each physical core, 
> which means the average running time is  very short, only 60us. We proposed 
> SRC in XEN to mitigate this problem.
> The performance benefits brought by this patch is very huge at peak 
> throughput with no influence when system loads are low.
>
> SRC improved SPECvirt performance by 14%.
> 1)It reduced CPU utilization, which allows more load to be added.
> 2)Response time (QoS)  became better at the same CPU %.
> 3)The better response time allowed us to push the CPU % at peak performance 
> to an even higher level (CPU was not saturated in SPECvirt).
> SRC reduced context switch rate significantly, resulted in
> 2)Smaller Path Length
> 3)Less cache misses thus lower CPI
> 4)Better performance for both Guest and Hypervisor sides.
>
> With this patch, from our SPECvirt_sc2010 results, the performance of xen 
> catches up the other open sourced hypervisor.

Hui,

Thanks for the patch, and the work you've done testing it.  There are
a couple of things to discuss.

* I'm not sure I like the idea of doing this at the generic level than
at the specific scheduler level -- e.g., inside of credit1.  For
better or for worse, all aspects of scheduling work together, and even
small changes tend to have a significant effect on the emergent
behavior.  I understand why you'd want this in the generic scheduling
code; but it seems like it would be better for each scheduler to
implement a rate control independently.

* The actual algorithm you use here isn't described.  It seems to be
as follows (please correct me if I've made a mistake
reverse-engineering the algorithm):

Every 10ms, check to see if there have been more than 50 schedules.
If so, disable pre-emption entirely for 10ms, allowing processes to
run without being interrupted (unless they yield).

It seems like we should be able to do better.  For one, it means in
the general case you will flip back and forth between really frequent
schedules and less frequent schedules.  For two, turning off
preemption entirely will mean that whatever vcpu happens to be running
could, if it wished, run for the full 10ms; and which one got elected
to do that would be really random.  This may work well for SPECvirt,
but it's the kind of algorithm that is likely to have some workloads
on which it works very poorly.  Finally, there's the chance that this
algorithm could be "gamed" -- i.e., if a rogue VM knew that most other
VMs yielded frequently, it might be able to arrange that there would
always be more than 50 context switches a second, while it runs
without preemption and takes up more than its fair share.

Have you tried just making it give each vcpu a minimum amount of
scheduling time, say, 500us or 1ms?

Now a couple of stylistic comments:
* src tends to make me think of "source".  I think sched_rate[_*]
would fit the existing naming convention better.
* src_controller() shouldn't call continue_running() directly.
Instead, scheduler() should call src_controller(); and only call
sched->do_schedule() if src_controller() returns false (or something
like that).
* Whatever the algorithm is should have comments describing what it
does and how it's supposed to work.
* Your patch is malformed; you need to have it apply at the top level,
not from within the xen/ subdirectory.  The easiest way to get a patch
is to use either mercurial queues, or "hg diff".  There are some good
suggestions for making and posting patches here:
http://wiki.xensource.com/xenwiki/SubmittingXenPatches

Thanks again for all your work on this -- we definitely want Xen to
beat the other open-source hypervisor. :-)

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] [PATCH] scheduler rate controller