On Mon, Nov 30, 2009 at 10:02 PM, Dave Scott
<Dave.Scott@xxxxxxxxxxxxx> wrote:
The observation that speeding up xenstore reduces the frequency of crashes is interesting. Perhaps the failure happens when a concurrent transaction causes an abort? Maybe you could provoke it by running 'xm create' in a loop while also writing somewhere in xenstore? IIRC (although I could be mistaken) the standard C xenstore considers all concurrent transactions to be conflicting even if they operate on disjoint parts of the tree so provoking an abort would be easy.
Hey Dave,
Thanks for responding! This actually sounds quite plausible.
Caveats:
1. We don't have an 'xm'... instead there's a CLI called 'xe' which can do almost everything the API can do but the syntax is different to 'xm'. You'd either have to port your scripts ('xe vm-start' rather than 'xm create'?) or write some kind of wrapper.
That shouldn't be too difficult :)
The reason we rewrote xenstored was because we used xenstore to report periodic guest performance stats to dom0. By doing this we accidentally created a horrible scalability bottleneck where, somewhere around 30 or 40 guests, every transaction aborted and the system livelocked. The new xenstored is smart enough to realize that these separate transactions are not conflicting and can be committed together.
We also have a couple of scripts that periodically collect statistics from the xenstore. We haven't seen any livelocks, but perhaps the xend crashes are caused by the same limitation. The xend crashes don't seem to happen until we actually have some (20+?) domU's running.