On Friday, 11 May 2007 at 07:55, Keir Fraser wrote:
> On 11/5/07 00:00, "Daniel P. Berrange" <berrange@xxxxxxxxxx> wrote:
>
> > It would be interesting to know what aspect of the xenstore interaction
> > is responsible for the slowdown. In particular, whether it is a fundamental
> > architectural constraint, or whether it is merely due to the poor
> > performance
> > of the current impl. We already know from previous tests that XenD impl of
> > transactions absolutely kills performance of various XenD operations due to
> > the vast amount of unneccessary I/O it does.
> >
> > If fixing the XenstoreD transaction code were to help suspend performance
> > too, it might be a better option than re-writing all code which touches
> > xenstore. A quick test of putting /var/lib/xenstored on a ramdisk would
> > be a way of testing whether its the I/O which is hurting suspend time.
>
> Yes. We could go either way -- it wouldn't be too bad to add support via
> dynamic VIRQ_DOM_EXC for example, or add other things to get xenstore off
> the critical path for save/restore. But if the problem is that xenstored
> sucks it probably is worth investing a bit of time to tackle the problem
> directly and see where the time is going. We could end up with optimisations
> which have benefits beyond just save/restore.
I'm sure xenstore could be made significantly faster, but barring a
redesign maybe it's better just to use it for low-frequency
transactions with pretty loose latency expectations? Running the
suspend notification through xenstore, to xend and finally back to
xc_save (as the current code does) seems convoluted, and bound to
create opportunities for bad scheduling compared to directly notifying
xc_save.
In case there's interest, I'll attach the two patches I'm using to
speed up checkpointing (and live migration downtime). As I mentioned
earlier, the first patch should be semantically equivalent to existing
code, and cuts downtime to about 30-35ms. The second notifies xend
that the domain has been suspended asynchronously, so that final round
memory copying may begin before device migration stage 2. This is a
semantic change, but I can't think of a concrete drawback. It's a
little rough-and-ready -- suggestions for improvement are welcome.
Here are some stats on final round time (100 runs):
xen 3.1:
avg: 93.40 ms, min: 72.59, max: 432.46, median: 85.10
patch 1 (trigger suspend via event channel):
avg: 43.69 ms, min: 35.21, max: 409.50, median: 37.21
patch 1, /var/lib/xenstored on tmpfs:
avg: 33.88 ms, min: 27.01, max: 369.21, median: 28.34
patch 2 (receive suspended notification via event channel):
avg: 4.95 ms, min: 3.46, max: 14.73, median: 4.63
suspend-evtchn.patch
Description: Text Data
subscribe-suspend.patch
Description: Text Data
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|