WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support

To: Andre Przywara <andre.przywara@xxxxxxx>
Subject: RE: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support
From: "Cui, Dexuan" <dexuan.cui@xxxxxxxxx>
Date: Thu, 25 Feb 2010 21:14:32 +0800
Accept-language: zh-CN, en-US
Acceptlanguage: zh-CN, en-US
Cc: "Nakajima, Jun" <jun.nakajima@xxxxxxxxx>, Ian Pratt <Ian.Pratt@xxxxxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>
Delivery-date: Thu, 25 Feb 2010 05:15:31 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <4B83A58D.4000901@xxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <4B6B4126.2050508@xxxxxxx> <ED3036A092A28F4C91B0B4360DD128EABD9D6362@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx> <4B83A58D.4000901@xxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: Acq0bhmDtPMFnu3+RxqxNuHE9wHKagBnGicA
Thread-topic: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support
Andre Przywara wrote:
> Cui, Dexuan wrote:
>> Hi Andre,
>> I'm also looking into hvm guest's numa support and I'd like to share
>> my thoughs and supply my understanding about your patches. 
>> 
>> 1) Besides SRAT, I think we should also build guest SLIT according
>> to host SLIT. 
> That is probably right, though currently low priority. Let's get the
> basics first upstream.
I think the goal of guest NUMA is reflecting hardware configuration properly.
Nintin's patch (which exposes host SLIT info to dom0's user space) would help 
here. I think adding hvm guest SLIT should be not complex and we can also do 
that together once Nitin's patch is in (after Xen 4.0.0 is released).

> 
>> 2) I agree we should supply the user a way to specify which guest
>> node should have how much memory, namely, the "nodemem" 
>> parameter in your patch02. However, I can't find where it is assigned
> a value in your patches. I guess you missed it in image.py.
> Omitted for now. I wanted to keep the first patches clean and had some
> hard time to propagate arrays from the config files downto libxc. Is
> there a good explanation of the different kind of config file
> options? I see different classes (like HVM only) along with
> some legacy parts that appear quite confusing to me.
I also feel here it needs many efforts to cleanly pass the necessary info from 
guest config file to libxc (and to hvmloader and possibly to hypervisor) . :-)

If the "nodemem" option is not specified by a user, looks your patches equally 
distribute guest memory into guest nodes. I think this is not good -- I think 
we should require the user to explicitly specify how the guest memory should be 
distributed, e.g., assuming there are 2 host nodes in a platform and the user 
can know hNode0 has 3G memory available and hNode1 has 8G(a user can know 
easily this by "xm info"): now the user needs to create a guest with 10G 
memory. If we enable guest numa and equally distribute guest memory, the guest 
would think there are 2 nodes and the first 5G memory is on gNode0 and the 
second 5G is on gNode1, and we 1:1 map gNodes to hNodes -- but actually 40% of 
gNode0's memory is not on hNode0! In this case, not enabling guest numa may 
achive better guest performance.
I mean: equally distributing guest memory into guest nodes would make the guest 
performance very unpredictable to the user. I think the policy can be: only 
enable guest numa iff "nodemem" is specified.

> 
>>      And what if xen can't allocate memory from the specified host
>> node(e.g., no enough free memory on the host node)? -- currently xen
>> *silently* tries to allocate memory from other host 
> nodes -- this would hurt guest performance
>> while the user doesn't know that at all! I think we should add an
> option in guest config file: if it's set,
>> the guest creation should fail if xen can not allocate memory from
>> the 
> specified host node.
> Exactly that scenario I had also in mind: Provide some kind of
> numa=auto option in the config file to let Xen automatically
> split up the memory allocation from different nodes if needed. I think
> we need an upper limit here, or maybe something like:
> numa={force,allow,deny}
I think the policy could be:
1) if no numa config is specified by the user, we should try the best to make 
guest creation succeed, even if the guest would have a bad performance;
2) if "nodemem" is specified, we should try the best to satisfy that; when we 
can't satisfy that: if "numa=force" is specified, we fail the guest creation, 
ELSE, try the best to make guest creation succeed, even if the guest would have 
a bad performance;

> numanodes=2
> the numa=allow option would only allocate up to 2 nodes if no single
> node can satisfy the memory request.
I don't think it's good to require the user to specify the number of guest 
nodes. It's not straightforward to a user at all.

In my mind, the typical scenario should be:
One day, a user needs to create a "powerful" guest that has 32vcpus and 64G 
memory.
By running "xm info", the user knows there are 3 hNodes (this is only a 
casually faked case by me :-)
hNode0: 8   logical processors, 20G memory available;
hNode1: 24 logical processors, 40G memory available;
hNode2: 8   logical processors, 40G memory available.
Aftering thinking for seconds, the user decides to allocate 4/24/4 vcpus, 
10G/40G/14G memory, from hNode0/1/2, respectively, to the guest; or, the user 
may decide to allocate 24/8 vcpus, 40G/24G memory, from hNode1/2, respectively, 
to the guest.

I mean: we should be able to deduce the number of guest node from the user's 
explicit configuration, without that, the guest performance would be 
unpredictable if we just simply require the user to supply "numanodes" and try 
to fingure out the "best" (I think it's difficult and not flexible at all) 
vcpu/mem distribution solution for the user.
Looks Ian Pratt also tends to use this idea in his reply to another thead "Host 
Numa informtion in dom0".

>> 3) In your patch02:
>> +        for (i = 0; i < numanodes; i++)
>> +            numainfo.guest_to_host_node[i] = i % 2;
>> As you said in the mail "[PATCH 5/5]", at present it "simply round
>> robin until the code for automatic allocation is in place", 
>> I think "simply round robin" is not acceptable and we should
>> implement 
> "automatic allocation".
> Right, but this depends on the one part I missed. The first part of
> this is the xc_nodeload() function. I will try to provide
> the missing part this week.
As I replied above, I think it's better to ask the user to give an explicit 
configuration. It's difficult to make an always-wise-enough algorithm to figure 
out the best solution and the user will lose flexibility.

> 
>> 4) Your patches try to sort the host nodes using a noad load
>> evaluation algorithm, and require the user to specify how many 
>> guest nodes the guest should see, and distribute equally guest vcpus
>>     into each guest node. I don't think the algorithm could be wise
>> enough every time and it's not flexiable. Requiring the user to
>> specify the number  
>> of guest node and districuting vcpus equally into each guest node
>> also 
> doesn't sound wise enough and flexible.
> Another possible extension. I had some draft with "node_cpus=[1,2,1]"
> to put one vCPU in the first and third node and two vCPUs in the
> second node, although I omitted them from the first "draft" release.
As I replied above, the info "nodemem" and "cpus"(the vcpu affinity info) 
should be enough and the "node_cpus" here should be redundant.

> 
>>    Since guest numa needs vcpu pinning to work as expected, how
>> about my below thoughs? 
>> 
>>    a) ask the user to use "cpus" option to pin each vcpu to a
>>    physical cpu (or node); b) find out how many physical nodes (host
>>    nodes) are involved and use that number as the number of guest
>> node; c) each guest node corresponds to a host node found out in
>> step b) and use this info to fill the numainfo.guest_to_host_node[]
>> in 3).   
> My idea is:
> 1) use xc_nodeload() to get a list of host nodes with the respective
> amount of free memory
> 2) either use the user-provided number of guest nodes and determine
> the number based on the memory availability (=n)
> 3) select the <n> best nodes from the list (algorithm still to be
> discussed, but a simple approach is sufficient for the first time)
> 4) populate numainfo.guest_to_host_node accordingly
> 5) pin vCPUs based on this array
> 
> This is basically the missing function (TM) I described earlier.
Please see my above reply.

>> 5) I think we also need to present the numa guest with virtual cpu
>> topology, e.g., throught the initial APCI ID. In current xen,
>> apic_id = vcpu_id * 2; even if we have the guest SRAT support and
>> use  2 guest nodes for a vcpus=n guest,
>> the guest would still think it's on a package with n cores without
>> the knowledge of vcpu and cache 
>> topology and this would do harm to the performance of guest.
>>    I think we can use each guest node as a guest package and by
>> giving the guest a proper APIC ID (consisting of guest
>> SMT_ID/Core_ID/Package_ID) to show the vcpu topology to guest. 
>> This needs changes to the hvmloader's SRAT/MADT's APID ID fields,
> xen's cpuid/vlapic emulation.
> The APIC ID scenario does not work on AMD CPUs, which don't have a bit
> field based association between compute units and APIC IDs. For NUMA
> purposes SRAT should be sufficient, as it overrides APIC based
Sorry, I'm not familiar with APIC ID on AMD's CPU.
My thought is: assuming a hNode corresponds to a host package, and a package 
has some cores, and a core has 2 threads, if we could expose this info (and the 
related host cache topology) to guest, guest os can intentionally try to 
schedule the "related" processes to run on the threads of the same core, and as 
a result, we can achieve a better guest performance.

> decisions. But you are right in that it needs more CPUID / ACPI
> tweaking to get the topology right, although this should be addressed
> in separate patches:
> Currently(?) it is very cumbersome to inject a specific "cores per
> socket" number into Xen (by tweaking those ugly CPUID bit masks). For
I looked into current Xen code and also agree it's not easy to make the 
injection clean; however it may deserve the effort if this can improve guest 
performance to a notable degree. I'm trying to obtain some data now.


> QEMU/KVM I introduced an easy config scheme (smp=8,cores=2,threads=2)
> to allow this (purely CPUID based). If only I had time for this I
> would do this for Xen, too.
> 
>> 
>> 6) HVM vcpu's hot add/remove functionlity was added into xen
>> recently. The guest numa support should take this into
>> consideration. Are you volunteering? ;-)  
Yes, I'm looking into this.

>> 7) I don't see the live migration support in your patches. Looks
>> it's hard for hvm numa guest to do live migration as the 
>> src/dest hosts could be very different in HW  configuration.
> I don't think this is a problem. We need to separate guest specific
> options (like VCPUs to guest nodes or guest memory to guest nodes
> mapping) from host specific parts (guest nodes to host nodes). I
> haven't tested it yet, but I assume that the config file options to
> specify the guest specific parts should be sent already right now,
> resulting in the new guest setting up with the proper guest config.
> The guest node to host node association is determined by the new host
> dynamically depending on the current host's resources. This can turn
> out to be sub-optimal, like migrating a "4 guest node on 4 host
> nodes" guest on a dual node host, but this would currently map to
> 0-1-0-1 setup, where two guest nodes are assigned the same host node.
> I don't see much of an problem here.
A concern of mine is: after the migration, does a user expect the guest 
performancer would change a lot?
E.g., assuming there are 2 exactly same hosts and there are many guests running 
on each hosts, after migrating a numa guest from host A to host B, the 
underlying memory distribution may vary (because on A and B, the amount of 
available memory on different nodes are different; on A, all gNode0's memory 
can be on hNode0, but on B, gNode0's memory can be on hNode0 and 1) and the 
guest performance would be downgraded.
Another thing is: if we change the current mapping "apic_id = vcpu_id * 2", 
we'll have compatability issue.

Thanks,
-- Dexuan
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel