|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
On 02/01/13 21:59, Konrad Rzeszutek Wilk wrote: Thanks for the clarification. I am not that fluent in the OCaml code. I'm not fluent in OCaml either, I'm mainly going from memory based on the discussions I had with the author when it was being designed, as well as discussions with the xapi team when dealing with bugs at later points. When a request comes in for a certain amount of memory, it will go and set each VM's max_pages, and the max tmem pool size. It can then check whether there is enough free memory to complete the allocation or not (since there's a race between checking how much memory a guest is using and setting max_pages). If that succeeds, it can return "success". If, while that VM is being built, another request comes in, it can again go around and set the max sizes lower. It has to know how much of the memory is "reserved" for the first guest being built, but if there's enough left after that, it can return "success" and allow the second VM to start being built. After the VMs are built, the toolstack can remove the limits again if it wants, again allowing the free flow of memory.This sounds to me like what Xapi does? No, AFAIK xapi always sets the max_pages to what it wants the guest to be using at any given time. I talked about removing the limits (and about operating without limits in the normal case) because it seems like something that Oracle wants (having to do with tmem). Do you see any problems with this scheme? All it requires is for the toolstack to be able to temporarliy set limits on both guests ballooning up and on tmem allocating more than a certain amount of memory. We already have mechanisms for the first, so if we had a "max_pages" for tmem, then you'd have all the tools you need to implement it. So when you say, "tmem freeze", are you specifically talking about not allowing tmem to allocate more memory (what I called a "max_pages" for tmem)? Or is there more to it? Secondly, just to clarify: when a guest is using memory from the tmem pool, is that added to tot_pages? I'm not sure what "gives a definite yes or no" is supposed to mean -- the scheme I described also gives a definite yes or no. In any case, your point about ballooning is taken: if we set max_pages for a VM and just leave it there while VMs are being built, then VMs cannot balloon up, even if there is "free" memory (i.e., memory that will not be used for the currently-building VM), and cannot be moved *bewteen* VMs either (i.e., by ballooning down one and ballooning the other up). Both of these be done by extending the toolstack with a memory model (see below), but that adds an extra level of complication. What do you mean by the extra 'reserved' space? And what potential issues are there with PCI passthrough? To be accepted, the reservation hypercall will certainly have to be extended to do superpages and 32-bit guests, so that's the case we should be considering. Wouldn't the same argument apply to the reservation hypercall? Suppose that there was enough domain memory but not enough Xen heap memory, or enough of some other resource -- the hypercall might succeed, but then the domain build still fail at some later point when the other resource allocation failed. Hmm, I don't think what you wrote about mine is quite right. Here's
what I had in mind for mine (let me call it "limit-and-check"):
[serial] 1). Set limits on all guests, and tmem, and see how much memory is left. 2) Read free memory [parallel] 2a) Claim memory for each guest from freshly-calculated pool of free memory. 3) For each claim that can be satisfied, launch a guest4) If there are guests that can't be satisfied with the current free memory, then: [serial]4a) round-robin existing guests to decrease their memory consumption if allowed. Goto 2. 5) Remove limits on guests.Note that 1 would only be done for the first such "request", and 5 would only be done after all such requests have succeeded or failed. Also note that steps 1 and 5 are only necessary if you want to go without such limits -- xapi doesn't do them, because it always keeps max_pages set to what it wants the guest to be using. Also, note that the "claiming" (2a for mine above and 1 for yours) has to be serialized with other "claims" in both cases (in the reservation hypercall case, with a lock inside the hypervisor), but that the building can begin in parallel with the "claiming" in both cases. But I think I do see what you're getting at. The "free memory" measurement has to be taken when the system is in a "quiescent" state -- or at least a "grow only" state -- otherwise it's meaningless. So #4a should really be: 4a) Round-robin existing guests to decrease their memory consumption if allowed. 4b) Wait for currently-building guests to finish building (if any), then go to #2. So suppose the following cases, in which several requests for guest creation come in over a short period of time (not necessarily all at once): A. There is enough memory for all requested VMs to be built without ballooning / something else B. There is enough for some, but not all of the VMs to be built without ballooning / something else In case A, then I think "limit-and-check" and "reservation hypercall" should perform the same. For each new request that comes in, the toolstack can say, "Well, when I checked I had 64GiB free; then I started to build a 16GiB VM. So I should have 48GiB left, enough to build this 32GiB VM." "Well, when I checked I had 64GiB free; then I started to build a 16GiB VM and a 32GiB VM, so I should have 16GiB left, enough to be able to build this 16GiB VM." The main difference comes in case B. The "reservation hypercall" method will not have to wait until all existing guests have finished building to be able to start subsequent guests; but "limit-and-check" would have to wait until the currently-building guests are finished before doing another check. This limitation doesn't apply to xapi, because it doesn't use the hypervisor's free memory as a measure of the memory it has available to it. Instead, it keeps an internal model of the free memory the hypervisor has available. This is based on MAX(current_target, tot_pages) of each guest (where "current_target" for a domain in the process of being built is the amount of memory it will have eventually). We might call this the "model" approach. We could extend "limit-and-check" to "limit-check-and-model" (i.e., estimate how much memory is really free after ballooning based on how much the guests' tot_pages), or "limit-model" (basically, fully switch to a xapi-style "model" approach while you're doing domain creation). That would be significantly more complicated. On the other hand, a lot of the work has already been done by the XenServer team, and (I believe) the code in question is all GPL'ed, so Oracle could just take the algorithms and adapt them with just a bit if tweaking (and a bit of code translation). It seems to me that he "model" approach brings a lot of other benefits as well. But at any rate -- without debating the value or cost of the "model" approach, would you agree with my analysis and conclusions? Namely: a. "limit-and-check" and "reservation hypercall" are similar wrt guest creation when there is enough memory currently free to build all requested guests b. "limit-and-check" may be slower if some guests can succeed in being built but others must wait for memory to be freed up, since the "check" has to wait for current guests to finish building c. (From further back) One downside of a pure "limit-and-check" approach is that while VMs are being built, VMs cannot increase in size, even if there is "free" memory (not being used to build the currently-building domain(s)) or if another VM can be ballooned down. d. "model"-based approaches can mitigate b and c, at the cost of a more complicated algorithm I'm sorry, what race / fudge factor are you talking about? -George _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |