Xen project Mailing List

Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions

To: Andres Lagar-Cavilla <andreslc@xxxxxxxxxxxxxx>

From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Date: Fri, 11 Jan 2013 11:03:14 -0500

Cc: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>, "Keir \(Xen.org\)" <keir@xxxxxxx>, Ian Campbell <ian.campbell@xxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Tim Deegan <tim@xxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx, Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>

Delivery-date: Fri, 11 Jan 2013 16:04:18 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Heya, Much appreciate your input, and below are my responses. > >>> A) In Linux, a privileged user can write to a sysfs file which writes > >>> to the balloon driver which makes hypercalls from the guest kernel to > >> > >> A fairly bizarre limitation of a balloon-based approach to memory > >> management. Why on earth should the guest be allowed to change the size of > >> its balloon, and therefore its footprint on the host. This may be > >> justified with arguments pertaining to the stability of the in-guest > >> workload. What they really reveal are limitations of ballooning. But the > >> inadequacy of the balloon in itself doesn't automatically translate into > >> justifying the need for a new hyper call. > > > > Why is this a limitation? Why shouldn't the guest the allowed to change > > its memory usage? It can go up and down as it sees fit. > > No no. Can the guest change its cpu utilization outside scheduler > constraints? NIC/block dev quotas? Why should an unprivileged guest be able > to take a massive s*it over the host controller's memory allocation, at the > guest's whim? There is a limit to what it can do. It is not an uncontrolled guest going mayhem - it does it stuff within the parameters of the guest config. Within in my mind also implies the 'tmem' doing extra things in the hypervisor. > > I'll be happy with a balloon the day I see an OS that can't be rooted :) > > Obviously this points to a problem with sharing & paging. And this is why I > still spam this thread. More below. > > > And if it goes down and it gets better performance - well, why shouldn't > > it do it? > > > > I concur it is odd - but it has been like that for decades. > > Heh. Decades â one? Still - a decade. > > > > > >> > >>> the hypervisor, which adjusts the domain memory footprint, which changes > >>> the number of free pages _without_ the toolstack knowledge. > >>> The toolstack controls constraints (essentially a minimum and maximum) > >>> which the hypervisor enforces. The toolstack can ensure that the > >>> minimum and maximum are identical to essentially disallow Linux from > >>> using this functionality. Indeed, this is precisely what Citrix's > >>> Dynamic Memory Controller (DMC) does: enforce min==max so that DMC always > >>> has complete control and, so, knowledge of any domain memory > >>> footprint changes. But DMC is not prescribed by the toolstack, > >> > >> Neither is enforcing min==max. This was my argument when previously > >> commenting on this thread. The fact that you have enforcement of a maximum > >> domain allocation gives you an excellent tool to keep a domain's > >> unsupervised growth at bay. The toolstack can choose how fine-grained, how > >> often to be alerted and stall the domain. That would also do the trick - but there are penalties to it. If one just wants to launch multiple guests and "freeze" all the other guests from using the balloon driver - that can certainly be done. But that is a half-way solution (in my mind). Dan's idea is that you wouldn't even need that and can just allocate without having to worry about the other guests at all - b/c you have reserved enough memory in the hypervisor (host) to launch the guest. > > > > There is a down-call (so events) to the tool-stack from the hypervisor when > > the guest tries to balloon in/out? So the need for this problem arose > > but the mechanism to deal with it has been shifted to the user-space > > then? What to do when the guest does this in/out balloon at freq > > intervals? > > > > I am missing actually the reasoning behind wanting to stall the domain? > > Is that to compress/swap the pages that the guest requests? Meaning > > an user-space daemon that does "things" and has ownership > > of the pages? > > The (my) reasoning is that this enables control over unsupervised growth. I > was being facetious a couple lines above. Paging and sharing also have the > same problem with badly behaved guests. So this is where you stop these guys, > allow the toolstack to catch a breath, and figure out what to do with this > domain (more RAM? page out? foo?). But what if we do not even have to have the toolstack to catch a breath. The goal here is for it not to be involved in this and let the hypervisor deal with unsupervised growth as it is better equiped to do so - and it is the ultimate judge whether the guest can grow wildly or not. I mean why make the toolstack become CPU bound when you can just set the hypervisor to take this extra information in an account and you avoid the CPU-bound problem altogether. > > All your questions are very valid, but they are policy in toolstack-land. > Luckily the hypervisor needs no knowledge of that. My thinking is that some policy (say how much the guests can grow) is something that the host sets. And the hypervisor is the engine that takes these values in account and runs with it. I think you are advocating that the "engine" and policy should be both in the user-land. .. snip.. > >> Great care has been taken for this statement to not be exactly true. The > >> hypervisor discards one of two pages that the toolstack tells it to (and > >> patches the physmap of the VM previously pointing to the discard page). It > >> doesn't merge, nor does it look into contents. The hypervisor doesn't care > >> about the page contents. This is deliberate, so as to avoid spurious > >> claims of "you are using technique X!" > >> > > > > Is the toolstack (or a daemon in userspace) doing this? I would > > have thought that there would be some optimization to do this > > somewhere? > > You could optimize but then you are baking policy where it does not belong. > This is what KSM did, which I dislike. Seriously, does the kernel need to > scan memory to find duplicates? Can't something else do it given suitable > interfaces? Now any other form of sharing policy that tries to use > VMA_MERGEABLE is SOL. Tim, Gregor and I, at different points in time, tried > to avoid this. I don't know that it was a conscious or deliberate effort, but > it worked out that way. OK, I think I understand you - you are advocating for user-space because the combination of policy/engine can be done there. Dan's and mine thinking is to piggyback on the hypervisors' MM engine and just provide a means of tweaking one value. In some ways that is simialar to making sysctls in the kernel to tell the MM how to behave. .. snip.. > > That code makes certain assumptions - that the guest will not go/up down > > in the ballooning once the toolstack has decreed how much > > memory the guest should use. It also assumes that the operations > > are semi-atomic - and to make it so as much as it can - it executes > > these operations in serial. > > > > This goes back to the problem statement - if we try to parallize > > this we run in the problem that the amount of memory we thought > > we free is not true anymore. The start of this email has a good > > description of some of the issues. > > Just set max_pages (bad name...) everywhere as needed to make room. Then kick > tmem (everywhere, in parallel) to free memory. Wait until enough is free â. > Allocate your domain(s, in parallel). If any vcpus become stalled because a > tmem guest driver is trying to allocate beyond max_pages, you need to adjust > your allocations. As usual. Versus just one "reserve" that would remove the need for most of this. That is - if we can not "reserve" we would fall-back to the mechanism you stated, but if there is enough memory we do not have to do the "wait" game (which on a 1TB takes forever and makes launching guests sometimes take minutes) - and can launch the guest without having to worry about slow-path. .. snip. > >> > > > > I believe Dan is saying is that it is not enabled by default. > > Meaning it does not get executed in by /etc/init.d/xencommons and > > as such it never gets run (or does it now?) - unless one knows > > about it - or it is enabled by default in a product. But perhaps > > we are both mistaken? Is it enabled by default now on den-unstable? > > I'm a bit lost â what is supposed to be enabled? A sharing daemon? A paging > daemon? Neither daemon requires wait queue work, batch allocations, etc. I > can't figure out what this portion of the conversation is about. The xenshared daemon. > > Having said that, thanks for the thoughtful follow-up Thank you for your response! _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.