Xen project Mailing List

Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions

To: Tim Deegan <tim@xxxxxxx>

From: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>

Date: Thu, 10 Jan 2013 13:43:53 -0800 (PST)

Cc: "Keir \(Xen.org\)" <keir@xxxxxxx>, Ian Campbell <ian.campbell@xxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Andres Lagar-Cavilla <andreslc@xxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx, Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>

Delivery-date: Thu, 10 Jan 2013 21:45:04 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

> From: Tim Deegan [mailto:tim@xxxxxxx] > Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of > problem and alternate > solutions Hi Tim -- Thanks for the response. > At 13:38 -0800 on 02 Jan (1357133898), Dan Magenheimer wrote: > > > The discussion ought to be around the actual problem, which is (as far > > > as I can see) that in a system where guests are ballooning without > > > limits, VM creation failure can happen after a long delay. In > > > particular it is the delay that is the problem, rather than the failure. > > > Some solutions that have been proposed so far: > > > - don't do that, it's silly (possibly true but not helpful); > > > - this reservation hypercall, to pull the failure forward; > > > - make allocation faster to avoid the delay (a good idea anyway, > > > but can it be made fast enough?); > > > - use max_pages or similar to stop other VMs using all of RAM. > > > > Good summary. So, would you agree that the solution selection > > comes down to: "Can max_pages or similar be used effectively to > > stop other VMs using all of RAM? If so, who is implementing that? > > Else the reservation hypercall is a good solution." ? > > Not quite. I think there are other viable options, and I don't > particularly like the reservation hypercall. Are you suggesting an alternative option other than the max_pages toolstack-based proposal that Ian and I are discussing in a parallel subthread? Just checking, in case I am forgetting an alternative you (or someone else proposed). Are there reasons other than "incompleteness" (see below) that you dislike the reservation hypercall? To me, it seems fairly elegant in that it uses the same locks for capacity-allocation as for page allocation, thus guaranteeing no races can occur. > I can still see something like max_pages working well enough. AFAICS > the main problem with that solution is something like this: because it > limits the guests individually rather than collectively, it prevents > memory transfers between VMs even if they wouldn't clash with the VM > being built. Indeed, you are commenting on one of the same differences I observed today in the subthread with Ian, where I said that the hypervisor-based solution is only "max-of-sums"- constrained whereas the toolstack-based solution is "sum-of-maxes"-constrained. With tmem/selfballooning active, what you call "memory transfers between VMs" can be happening constantly. (To clarify for others, it is not the contents of the memory that is being transferred, just the capacity... i.e. VM A frees a page and VM B allocates a page.) So thanks for reinforcing this point as I think it is subtle but important. > That could be worked around with an upcall to a toolstack > agent that reshuffles things on a coarse granularity based on need. I > agree that's slower than having the hypervisor make the decisions but > I'm not convinced it'd be unmanageable. "Based on need" begs a number of questions, starting with how "need" is defined and how conflicting needs are resolved. Tmem balances need as a self-adapting system. For your upcalls, you'd have to convince me that, even if "need" could be communicated to an guest-external entity (i.e. a toolstack), that the entity would/could have any data to inform a policy to intelligently resolve conflicts. I also don't see how it could be done without either significant hypervisor or guest-kernel changes. > Or, how about actually moving towards a memory scheduler like you > suggested -- for example by integrating memory allocation more tightly > with tmem. There could be an xsm-style hook in the allocator for > tmem-enabled domains. That way tmem would have complete control over > all memory allocations for the guests under its control, and it could > implement a shared upper limit. Potentially in future the tmem > interface could be extended to allow it to force guests to give back > more kinds of memory, so that it could try to enforce fairness (e.g. if > two VMs are busy, why should the one that spiked first get to keep all > the RAM?) or other nice scheduler-like properties. Tmem (plus selfballooning), unchanged, already does some of this. While I would be interested in discussing better solutions, the now four-year odyssey of pushing what I thought were relatively simple changes upstream into Linux has left a rather sour taste in my mouth, so rather than consider any solution that requires more guest kernel changes, I'd first prefer to ensure that you thoroughly understand what tmem already does, and how and why. Would you be interested in that? I would be very happy to see other core members of the Xen community (outside Oracle) understand tmem, as I'd like to see the whole community benefit rather than just Oracle. > Or, you could consider booting the new guest pre-ballooned so it doesn't > have to allocate all that memory in the build phase. It would boot much > quicker (solving the delayed-failure problem), and join the scramble for > resources on an equal footing with its peers. I'm not positive I understand "pre-ballooned" but IIUC, all Linux guests already boot pre-ballooned, in that, from the vm.cfg file, "mem=" is allocated, not "maxmem=". If you mean something less than "mem=", you'd have to explain to me how Xen guesses how much memory a guest kernel needs when even the guest kernel doesn't know. Tmem, with self-ballooning, launches the guest with "mem=", and then the guest kernel "self adapts" to (dramatically) reduce its usage soon after boot. It can be fun to "watch(1)", meaning using the Linux "watch -d 'head -1 /proc/meminfo'" command. > > > My own position remains that I can live with the reservation hypercall, > > > as long as it's properly done - including handling PV 32-bit and PV > > > superpage guests. > > > > Tim, would you at least agree that "properly" is a red herring? > > I'm not quite sure what you mean by that. To the extent that this isn't > a criticism of the high-level reservation design, maybe. But I stand by > it as a criticism of the current implementation. Sorry, I was just picking on word usage. IMHO, the hypercall does work "properly" for the classes of domains it was designed to work on (which I'd estimate in the range of 98% of domains these days). I do agree that it doesn't work for 2%, so I'd claim that the claim hypercall is "properly done", but maybe not "completely done". Clearly, one would prefer a solution that handles 100%, but I'd rather have a solution that solves 98% (and doesn't make the other 2% any worse), than no solution at all. Dan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.