WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

[Xen-devel] [ED] New credit, experimental, v1

To: xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: [Xen-devel] [ED] New credit, experimental, v1
From: George Dunlap <dunlapg@xxxxxxxxx>
Date: Thu, 18 Jun 2009 23:01:21 +0100
Delivery-date: Thu, 18 Jun 2009 15:02:27 -0700
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=12fmsYy1UR1hRphZLCKJwQ5cF3d85M1Y6JB1DRFnXlM=; b=Y0TD+MTpif/7rOJ4rIv4lcNGIESZDU68fMuNJFjEgiZZ9wpyHeJ3r3cyfV9eKTsCvY 9IIIFBmm2OGNiNAeCG72wdSb3ugMXmcK99swRGYPL669FQZUebmjFgFDqGGZ+Ma5WjDr DTJW7CgjVqSOITMikrOXXJtql2sXolKbhpwkA=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; b=Sx844tljrq+SgPAI0JZ2ZFHMcr9nKfSJyazZCwV6AgPyX9tVwiLFGLFkZd0yz05diC xRokWjslZI99UfUJUlsVQNTUWw+BVWKqHReXmeN7bDvjrzKmbEQnAguSZ6RK9IqfMsBX 7PvtL3I+2xReQ60uWTw50g35tfTYU14trPHQg=
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
A couple of things are clear from the previous discussion, and from
other work on the scheduler: the 10ms granularity of credit charging
is much too rough for latency-sensitive workloads.  The next version
of the scheduler will be keeping track of credits more precisely.

That alone won't fix it, however.  The credit scheduler is still
designed to cross between UNDER and OVER on a regular basis, and to
discriminate against VMs which go into OVER.

For a while I worked on different algorithms that were similar in
nature.  However, I kept running up against some fundamental limitations.

Credit1's method of "active" vs "inactive" works well if VMs fall
directly into those two camps, and are not interactive: active VMs
burn all the credits they get, inactive VMs don't get credits but
hardly need any cpu.  But a number of VMs fall into the middle.

(Starting now is where I'm a lot less sure of what I'm saying, and
expect to hear that I'm mistaken, or I've missed something.)

If VMs are allowed to accumulate credit, then some VMs will
continually gain credit.  Credit1 solved this by capping credit, and
actually discarding it when a VM reached the cap.  The problems with
this have been discussed before.  If you don't discard the credit,
however, then the "stable state" of the system is for all VMs to have
max credit, and cpu will effectively fall back to round-robin.
(Briefly, if all VMs have MAX_CREDITS, and we assign up to 30ms of
credit every 30ms, allowing extra credit from one VM to flow into
another VM, then after burning some combination of credits, all VMs
will be filled up to their max.)

I experimented for a while with the idea of discarding credits and
assigning new credits from scratch every 100ms.  The problem with this
was again dealing with assigning credits not burned by some VMs
"fairly" (i.e., by weight) to other VMs.  The fundamental limitation
is that we can't know ahead of time how much a VM will want to use.
Even past performance isn't a good measure: just because a VM *didn't*
run didnt' mean it *wouldn't* have run given the chance.

* Dynamic retroactive earning

The solution I'm presenting now is actually very simple
computationally.  However, it has some very subtle side-effects.

The algortihm is this:

* Set everyone's credits according to their weights, scaled so that
the total amount of assigned credit can be burned up in 50-100ms.

* Then, burn credits as a fixed rate, running the VM with the
highest amount of credits first.

* If a vcpu comes to the front of a queue that has zero credits (i.e.,
if there are no runnable VMs with credits), then assign credits again;
but do not *add* credits, merely *set* them proportional to their
weights; any unused credit is effectively discarded.

Here are some of the attributes of this algorithm:

* The credit assignment will effectively divide time into "earning
periods".

* The amount of credit added into the system each earning period is
equal to the amount burned.  So overall system credit and overall
system burn will be the same.

* Assigning fixed credit but varying the time at which it is delivered
effectively changes the effective rate of earning.  However, because
the thing that varies the rate of earning happens at the end rather
than the beginning, the time at which credits are assigned effectively
retoractively sets the rate of earning for that earning period.
(Think carefully about this for a bit.)

So this effectively addresses the two problems we had before: adding
too much credit into the system (because earn == burn always), and not
knowing in advance how much credit a VM would use, because we set the
effective rate of earning not at the beginning, but at the end.

* Assigning credit when we first have a vcpu without credit.  This
scales everyone's earning rate, and discards the credits of those who
don't use all of theirs, while making sure that those who do use all
of their credits get time proportional to their weight.

A couple of conditions are required to make this algorithm work
properly in the mix of burn and wake-and-sleep workloads.  First, the
"time slice" given to a "burn" VM must be a fraction (1/2 at most,
probably preferrably less) than the amount of credit that VM has.

Anyway, take a look, give it a spin, and let me know what you think.
I'm still working on the wake latency properties.  But scp to an HVM
domU, in competition with spinners, works pretty well for me, and
scales well with setting weight.  Test some other latency-sensitive
workloads and let me know how it fares.  I still have some
more things to tweak, but I wanted to get the basic idea out there so
people could discuss it / ask questions / give feedback / turn up
workload or situations where it fails miserably.

I should make it clear, this is totally an early design prototype.  Still needs
work; some ugly hacks; scant attention paid to efficient algorithms;
promised features (such as cap, reservation) missing.

 -George

Attachment: credit2-hypervisor.diff
Description: Text Data

Attachment: credit2-tools.diff
Description: Text Data

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
<Prev in Thread] Current Thread [Next in Thread>
  • [Xen-devel] [ED] New credit, experimental, v1, George Dunlap <=