[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions



> From: Tim Deegan [mailto:tim@xxxxxxx]
> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of 
> problem and alternate
> solutions

Hi Tim --

Thanks for the response.

> At 13:38 -0800 on 02 Jan (1357133898), Dan Magenheimer wrote:
> > > The discussion ought to be around the actual problem, which is (as far
> > > as I can see) that in a system where guests are ballooning without
> > > limits, VM creation failure can happen after a long delay.  In
> > > particular it is the delay that is the problem, rather than the failure.
> > > Some solutions that have been proposed so far:
> > >  - don't do that, it's silly (possibly true but not helpful);
> > >  - this reservation hypercall, to pull the failure forward;
> > >  - make allocation faster to avoid the delay (a good idea anyway,
> > >    but can it be made fast enough?);
> > >  - use max_pages or similar to stop other VMs using all of RAM.
> >
> > Good summary.  So, would you agree that the solution selection
> > comes down to: "Can max_pages or similar be used effectively to
> > stop other VMs using all of RAM? If so, who is implementing that?
> > Else the reservation hypercall is a good solution." ?
> 
> Not quite.  I think there are other viable options, and I don't
> particularly like the reservation hypercall.

Are you suggesting an alternative option other than the max_pages
toolstack-based proposal that Ian and I are discussing in a parallel
subthread?  Just checking, in case I am forgetting an alternative
you (or someone else proposed).

Are there reasons other than "incompleteness" (see below) that
you dislike the reservation hypercall?  To me, it seems fairly
elegant in that it uses the same locks for capacity-allocation
as for page allocation, thus guaranteeing no races can occur.

> I can still see something like max_pages working well enough.  AFAICS
> the main problem with that solution is something like this: because it
> limits the guests individually rather than collectively, it prevents
> memory transfers between VMs even if they wouldn't clash with the VM
> being built.

Indeed, you are commenting on one of the same differences
I observed today in the subthread with Ian, where I said
that the hypervisor-based solution is only "max-of-sums"-
constrained whereas the toolstack-based solution is
"sum-of-maxes"-constrained.  With tmem/selfballooning active,
what you call "memory transfers between VMs" can be happening
constantly.  (To clarify for others, it is not the contents
of the memory that is being transferred, just the capacity...
i.e. VM A frees a page and VM B allocates a page.)

So thanks for reinforcing this point as I think it is subtle
but important.

> That could be worked around with an upcall to a toolstack
> agent that reshuffles things on a coarse granularity based on need.  I
> agree that's slower than having the hypervisor make the decisions but
> I'm not convinced it'd be unmanageable.

"Based on need" begs a number of questions, starting with how
"need" is defined and how conflicting needs are resolved.
Tmem balances need as a self-adapting system. For your upcalls,
you'd have to convince me that, even if "need" could be communicated
to an guest-external entity (i.e. a toolstack), that the entity
would/could have any data to inform a policy to intelligently resolve
conflicts.  I also don't see how it could be done without either
significant hypervisor or guest-kernel changes.

> Or, how about actually moving towards a memory scheduler like you
> suggested -- for example by integrating memory allocation more tightly
> with tmem.  There could be an xsm-style hook in the allocator for
> tmem-enabled domains.  That way tmem would have complete control over
> all memory allocations for the guests under its control, and it could
> implement a shared upper limit.  Potentially in future the tmem
> interface could be extended to allow it to force guests to give back
> more kinds of memory, so that it could try to enforce fairness (e.g. if
> two VMs are busy, why should the one that spiked first get to keep all
> the RAM?) or other nice scheduler-like properties.

Tmem (plus selfballooning), unchanged, already does some of this.
While I would be interested in discussing better solutions, the
now four-year odyssey of pushing what I thought were relatively
simple changes upstream into Linux has left a rather sour taste
in my mouth, so rather than consider any solution that requires
more guest kernel changes, I'd first prefer to ensure that you
thoroughly understand what tmem already does, and how and why.
Would you be interested in that?   I would be very happy to see
other core members of the Xen community (outside Oracle) understand
tmem, as I'd like to see the whole community benefit rather than
just Oracle.

> Or, you could consider booting the new guest pre-ballooned so it doesn't
> have to allocate all that memory in the build phase.  It would boot much
> quicker (solving the delayed-failure problem), and join the scramble for
> resources on an equal footing with its peers.

I'm not positive I understand "pre-ballooned" but IIUC, all Linux
guests already boot pre-ballooned, in that, from the vm.cfg file,
"mem=" is allocated, not "maxmem=".  If you mean something less than
"mem=", you'd have to explain to me how Xen guesses how much memory a
guest kernel needs when even the guest kernel doesn't know.

Tmem, with self-ballooning, launches the guest with "mem=", and
then the guest kernel "self adapts" to (dramatically) reduce its usage
soon after boot.  It can be fun to "watch(1)", meaning using the
Linux "watch -d 'head -1 /proc/meminfo'" command.

> > > My own position remains that I can live with the reservation hypercall,
> > > as long as it's properly done - including handling PV 32-bit and PV
> > > superpage guests.
> >
> > Tim, would you at least agree that "properly" is a red herring?
> 
> I'm not quite sure what you mean by that.  To the extent that this isn't
> a criticism of the high-level reservation design, maybe.  But I stand by
> it as a criticism of the current implementation.

Sorry, I was just picking on word usage.  IMHO, the hypercall
does work "properly" for the classes of domains it was designed
to work on (which I'd estimate in the range of 98% of domains
these days).  I do agree that it doesn't work for 2%, so I'd
claim that the claim hypercall is "properly done", but maybe
not "completely done".  Clearly, one would prefer a solution that
handles 100%, but I'd rather have a solution that solves 98%
(and doesn't make the other 2% any worse), than no solution at all.

Dan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.