[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.


  • To: "George Dunlap" <George.Dunlap@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>
  • From: "Zhiyuan Shao" <zyshao.maillist@xxxxxxxxx>
  • Date: Fri, 10 Apr 2009 10:28:22 +0800
  • Cc:
  • Delivery-date: Thu, 09 Apr 2009 19:29:21 -0700
  • Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:references:subject:message-id:x-mailer:mime-version :content-type; b=vGH+RyOY7p+iB6lSQ7WRqKxaDmGIR0NsTkrr60IpQrk9lkQyl6rpZpCeKyEMgJTqzI nvuneXSUGwyN4RCoL6V8eSCQhswBFa0gzCUWGjRFsMRBncd+5Otv8kURvTKHdSeIVCEf ibcDEt8Zo+kza6DJrqyt+RF19ZArC/ASi0paE=
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Hi all,
 
Actually I think I/O responsiveness is important to control by the scheduling algorithm. And this is especially true for vritualized desktop/client environment, since in such environments, there are so many I/O events to handle, which is different from the server consolidation case, where many of the tasks are CPU-intensive.
I would like to show this point by a simple scheduling algorithm, which is attached with this mail, i wrotten in the last winter (Jan. 2009). That time, I am in Intel OTC for a visitation, and thank Intel guys (Disheng, Kevin Tian and etc.) for their help.
The scheduler is named as SDP (you have to use "sched=sdp" parameter in Xen kernel line when boot), which i mean to use ideas of dynamic priority to make the virtualized clients meat their needs of usage. However, this scheduler is basically a simple prototype for idea proving till now, I had not implement the dynamic priority mechanisms in yet. The solution used in this simple scheduler is largely ad-hoc, and i hope it can contribute something to the future development of next generation Xen scheduler. BTW, I borrowed large portion of code from Credit scheduler.
 
This patch should work in VT-d platform well (it does not doing well in the emulated device environment, since device emulation, especially the video results in too high overhead to handle by the scheduler). We (thank again Intel OTC for the VT-d platform) tested the scheduler in a 3.0Ghz 2-core system, invoked 2 HVM guest domains (one is primary domain, and another is auxiliary, both have two VCPUs), pinned each vcpu of each domain to different PCPUs (VCPUs of domain0 is pinned as well), since the i still had not implement a proper VCPU migration mechanism in SDP (sorry for that, I do not think the aggresive migration mechanism of Credit is proper for virtualized clients, and working on this now, hope can find a proper one for Xen in near future). The sound and video card are directly assigned to the primary domain, while the auxiliary domain uses emulated ones.
 
Set the "priority" (should be named as "I/O responsiveness", I think i had make a mistake on this, since the initial objectiveness is to use dynamic priority ideas) value of domain0 to 91, and set that of primary domain to 90. regarding the auxiliary domain, you can left it as default (80). Please used the attached domain0 command line tool (i.e., sched_sdp_set [domain_id] [new_priority]) to set the new priority, I am not good at Python, sorry for that!
We had tested a scenario that the primary domain plays a DivX video, and at the same time, copy very big files in the auxiliary domain. The video can be played well! The effectiveness we experienced beats BCredit in this usage case, no matter how we adjust the parameters of BCredit.
 
Some explanations of SDP:
The "priority" parameter is actually used to control I/O responsiveness. If a VCPU is woken by an I/O event during the runtime, and at the same time, and its "priority" value happened to be higher than the current VCPU, the current VCPU will be preempted, and the woken VCPU is scheduled. A "bonus" will be given to the woken VCPU to leave "enough possible" time for it to complete its I/O handling. The bonus value is computed by substracting the "priority" parameter of the two (the woken and the current ) VCPUs. This strategy actually inhibits the preemption of a currently running VCPU with high "priority" by another VCPU with lower one, while permits preemption vice vesa, and i think this method fits well for the asymmetric domain role scenario of virtualized clients and desktops.
Regarding the computation resource allocation, the simple scheduler actually shares the CPU resource by a round-robin fashion. I/O event happened at a high "priority" VCPU can give it a little "bonus", after using up that, the VCPU fall back to the round-robin scheduling ring. By this way, it maintains some kind of fairness even in virtualized client environment. e.g., in our testing scenario, we found the file-copying in auxiliary domain proceeds well (although a little slow) when the primary domain plays a DivX video, which results in high volumn of I/O events.
 
By this experience, i think I/O responsiveness is an important parameter to be added in the development of new scheduler, since platforms have their independent performance metrics, and user can adjust the I/O responsiveness parameter of the domains to make them work well.
 
Moreover, I think some characters of Credit scheduler does not fit well for the virtualized clients/desktops (for further discussion if possible). If used in the virtualized clients/desktop scenatio, the worst side of Credit is its little state space to mark the VCPUs (i.e., BOOT, UNDER and OVER). This make it very inconvenient at least to differentiate the VCPUs of different domains, and with different kinds of tasks, although the little state space do work well in consolidated servers.  The second inconevient side of Credit scheduler is the method that the scheduler "boosts" the VCPUs. In the original version Credit, a woken VCPU has to have enough credits (UNDER state) to make itself promoted to BOOST state. However, the domain (VCPU) may used up its credit, and at the same time it do have critical task. At this moment, fairness is of the secondary consideration, and should be maintained in later phases. At this point, BCredit brings some changes, although Bcredit may give fairness little consideration, unfortunately. 
 
 
Thanks,
 
 
2009-04-10

Zhiyuan Shao

发件人: George Dunlap
发送时间: 2009-04-09  23:59:18
收件人: xen-devel@xxxxxxxxxxxxxxxxxxx
抄送:
主题: [Xen-devel] [RFC] Scheduler work,part 1: High-level goals and interface.
In the interest of openness (as well as in the interest of taking
advantage of all the smart people out there), I'm posting a very early
design prototype of the credit2 scheduler.  We've had a lot of
contributors to the scheduler recently, so I hope that those with
interest and knowledge will take a look and let me know what they
think at a high level.
 
This first e-mail will discuss the overall goals: the target "sweet
spot" use cases to consider, measurable goals for the scheduler, and
the target interface / features.  This is for general comment.
 
The subsequent e-mail(s?) will include some specific algorithms and
changes currently in consideration, as well as some bleeding-edge
patches.  This will be for people who have a specific interest in the
details of the scheduling algorithms.
 
Please feel free to comment / discuss / suggest improvements.
 
1. Design targets
 
We have three general use cases in mind: Server consolidation, virtual
desktop providers, and clients (e.g. XenClient).
 
For servers, our target "sweet spot" for which we will optimize is a
system with 2 sockets, 4 cores each socket, and SMT (16 logical cpus).
Ideal performance is expected to be reached at about 80% total system
cpu utilization; but the system should function reasonably well up to
a utilization of 800% (e.g., a load of 8).
 
For virtual desktop systems, we will have a large number of
interactive VMs with a lot of shared memory.  Most of these will be
single-vcpu, or at most 2 vcpus.
 
For client systems, we expect to have 3-4 VMs (including dom0).
Systems will probably ahve a single socket with 2 cores and SMT (4
logical cpus).  Many VMs will be using PCI pass-through to access
network, video, and audio cards.  They'll also be running video and
audio workloads, which are extremely latency-sensitive.
 
2. Design goals
 
For each of the target systems and workloads above, we have some
high-level goals for the scheduler:
 
* Fairness.  In this context, we define "fairness" as the ability to
get cpu time proportional to weight.
 
We want to try to make this true even for latency-sensitive workloads
such as networking, where long scheduling latency can reduce the
throughput, and thus the total amount of time the VM can effectively
use.
 
* Good scheduling for latency-sensitive workloads.
 
To the degree we are able, we want this to be true even those which
use a significant amount of cpu power: That is, my audio shouldn't
break up if I start a cpu hog process in the VM playing the audio.
 
* HT-aware.
 
Running on a logical processor with an idle peer thread is not the
same as running on a logical processor with a busy peer thread.  The
scheduler needs to take this into account when deciding "fairness".
 
* Power-aware.
 
Using as many sockets / cores as possible can increase the total cache
size avalable to VMs, and thus (in the absence of inter-VM sharing)
increase total computing power; but by keeping multiple sockets and
cores powered up, also increases the electrical power used by the
system.  We want a configurable way to balance between maximizing
processing power vs minimizing electrical power.
 
3. Target interface:
 
The target interface will be similar to credit1:
 
* The basic unit is the VM "weight".  When competing for cpu
resources, VMs will get a share of the resources proportional to their
weight.  (e.g., two cpu-hog workloads with weights of 256 and 512 will
get 33% and 67% of the cpu, respectively).
 
* Additionally, we will be introducing a "reservation" or "floor".
  (I'm open to name changes on this one.)  This will be a minimum
  amount of cpu time that a VM can get if it wants it.
 
For example, one could give dom0 a "reservation" of 50%, but leave the
weight at 256.  No matter how many other VMs run with a weight of 256,
dom0 will be guaranteed to get 50% of one cpu if it wants it.
 
* The "cap" functionality of credit1 will be retained.
 
This is a maximum amount of cpu time that a VM can get: i.e., a VM
with a cap of 50% will only get half of one cpu, even if the rest of
the system is completely idle.
 
* We will also have an interface to the cpu-vs-electrical power.
 
This is yet to be defined.  At the hypervisor level, it will probably
be a number representing the "badness" of powering up extra cpus /
cores.  At the tools level, there will probably be the option of
either specifying the number, or of using one of 2/3 pre-defined
values {power, balance, green/battery}.
 
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
.

Attachment: sdp-ctl.tar.gz
Description: Binary data

Attachment: sdp_09.1.8.patch
Description: Binary data

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.