This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


[Xen-devel] Re: The caculation of the credit in credit_scheduler

To: "Zhang, Xiantao" <xiantao.zhang@xxxxxxxxx>
Subject: [Xen-devel] Re: The caculation of the credit in credit_scheduler
From: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>
Date: Tue, 09 Nov 2010 14:16:33 +0000
Cc: "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>, "Dong, Eddie" <eddie.dong@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>
Delivery-date: Tue, 09 Nov 2010 06:17:27 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <BC00F5384FCFC9499AF06F92E8B78A9E1C06EC6BF5@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <789F9655DD1B8F43B48D77C5D30659732FD0A5C9@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <BC00F5384FCFC9499AF06F92E8B78A9E1C06EC6BF5@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/20101027 Thunderbird/3.0.10

Thanks for your comments. All of the things you pointed out are things I'm trying to address in credit2. In fact, a huge amount of them can be attributed to the fact that credit1 divides tasks into 3 priorities (OVER, UNDER, and BOOST) and will schedule tasks "round-robin" within a each priority. Round-robin is known to discriminate against tasks which yield (such as tasks that do frequent I/O) in favor of tasks that don't yield (such as cpu "burners").

In credit2, I hope to address these issues in a couple of ways:
* Always sort the runqueue by order of credit. This addresses issues in all of 1, 2, and 3. * When a VM wakes up, update the credit of all the running VMs to see if any of them should be preempted (addressing #3) * When selecting how long to run, I have a mechanism to look at the next VM in the runqueue, and calculate how long it would take for the current VM's credit to equal the next VM's credit. I.e., if the one chosen to run has 10ms of credit, and the next one on the runqueue has 7ms of credit, set the schedule time to 3ms. This is limited by a "minimum schedule time" (currently 500us) and a "maximum schedule time" (currently 10ms). This could probably use some more tweaking, but it seem to work pretty well.

It's not clear to me how to address a lot of the issues you bring up without doing a big redesign -- which is what I'm already working on.

If you're interested in helping test / develop credit2, let me know, I'd love some help. :-)


On 05/11/10 07:26, Zhang, Xiantao wrote:
Maybe idlers shouldn't produce the credits at the calcuation points.  I did an 
experiment before, it can reduce the unfaireness if idlers not producing credit.

Except this issue, I also have findings and want to share them with you guys to 
get more input about credit scheduler.

1. Interrupt delivery for assiged devices is done in a tasklet and the tasklet 
is running in the idle vcpu's context, but scheduler's behavior for scheduling 
idle vcpu looks very strange. Ideally, when switch to idle vcpu for executing 
tasklet, the previous vcpu should be switch back after tasklet is done, but 
current policy is to choose another vcpu in runq.  That is to say, one 
interrupt happens on one CPU, the CPU may do a real task switch, it maybe not 
acceptable when interrupt frequency is high and also introduce some performance 
bugs according to our experiments.  Even if we can switch back the previous 
vcpu after executing tasklet, how to determine its timeslice for its next run 
is also a key issue and this is not addressed. If still give 30ms for its 
restart run, it may trigger some fairness issues, I think.

2.  Another issue is found during our experiments and this is a very 
interesting issue(likely to be a bug).  In the experiment, we pinned three 
guests(two cpu-intensive and one IO-intensive) on two logical processors 
firstly, and each guest is configured with two virtual CPUs, and the CPU 
utilization share is ~90% for each CPU intensive guest and ~20% for 
IO-intensive guest.  But the magic thing happens after we introducing an 
addition idle guest which doesn't do real worload and just does idle.  The CPU 
utilization share is changed : ~50% for each CPU-intensive guest and ~100% for 
the IO-intensive  guest.  After analying the scheduling data, we found the 
change is from virtual timer interrupt delivery to the idle guest. Although the 
guest is idle, but there are still 1000 timer interrupts for each vcpu in one 
second. Current credit scheduler will boost the idle vcpu from the blocked 
state and trigger 1000 schedule events in the target physical processor, and 
the IO-intensi

ve guest maybe benefit from the frequent schedule events and get more CPU 
utilization share.  The more magic thing is that after 'xm pause' and 'xm 
unpause' the idle guest,  the each of the three guests are all allocated with 
~66% CPU share.
This finding tells us some facts:  (1)  current credit scheduler is not fair to 
IO-intensive guests. (2) IO-intensive guests have the ability to acquire fair 
CPU share when competing with CPU-intensive guests. (3) Current timeslice 
(30ms) is meaningless, since the average timeslice is far smaller than 1ms 
under real workloads(This may bring performance issues). (4) boost mechanism is 
too aggressive and idle guest shouldn't be boosted when it is waken from halt 
state.  (5)  There is no policy in credit to determine how
long the boosted vcpu can run ,and how to handle the preempted vcpu .

3.  Credit is not really used for determining key scheduling policies. For 
example, when choose candidate task, credit is not well used to evaluate tasks' 
priority, and this maybe not fair to IO-intensive guest. Additionally, task's 
priority is not caculated in time and just is updated every 30ms. In this case, 
even if one task's credit is minus, its prioirty maybe still TS_UNDER or 
TS_BOOST due to delayed update, so maybe when the vcpu is scheduled out, its 
priority should be updated after credit change.  In addition, when a boosted 
vCPU is scheduled out, its priority is always set to TS_UNDER, and credit is 
not considered as well. If the credit becomes minus, it maybe better to set the 
priority to TS_OVER?.

Any comments ?


Jiang, Yunhong wrote:
When reading the credit scheduler code and doing experiment, I notice
one thing interesting in current credit scheduler. For example, in
following situation:

A powerful system with 64 CPUs.

Xen Environment:
Dom0 with 8 vCPU bound to CPU (0, 16~24)

3 HVM domain, all with 2 vCPUS, all bound as vcpu0->pcpu1,
vcpu1->pcpu2. Among them, 2 are CPU intensive while 1 is I/O

The result shows that the I/O intensive domain will occupy more than
100% cpu, while the two cpu intensive domain each occupy 50%.

IMHO it should be 66% for all domain.

The reason is how the credit is caculated. Although the 3 HVM domains
is pinned to 2 PCPU and share the 2 CPUs, they will all get 2* 300
credit when credit account. That means the I/O intensive HVM domain
will never be under credit, thus it will preempt the CPU intensive
whenever it is boost (i.e. after I/O access to QEMU), and it is set
to be TS_UNDER only at the tick time, and then, boost again.

I'm not sure if this is meaningful usage model and need fix, but I
think it is helpful to show this to the list.

I didn't try credit2, so no idea if this will happen to credit2 also.


Xen-devel mailing list