[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes

To: Dario Faggioli <raistlin@xxxxxxxx>
From: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
Date: Tue, 1 May 2012 16:45:35 +0100
Cc: Andre Przywara <andre.przywara@xxxxxxx>, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, Stefano Stabellini <Stefano.Stabellini@xxxxxxxxxxxxx>, Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>
Delivery-date: Tue, 01 May 2012 15:47:29 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

Hey Dario,

Thanks for the work on this -- this is obviously a very complicated areato try to make sense of. Of course, that makes it even more complicatedto review -- sorry this has taken so long. (comments inline)


On 11/04/12 14:17, Dario Faggioli wrote:

+
+=item B<auto>
+
+use a not better specified (xl-implementation dependant) algorithm
+to try automatically fitting the guest on the host's NUMA nodes

Uum, you mean, "Use the default placement algorithm"? :-) We shouldspecify which one this will be here.

+
+/* Automatic placement policies symbols, with the following meaning:
+ *  NONE   : no automatic placement at all;
+ *  STATIC : explicit nodes specification.
+ *  FFIT   : use the First Fit algorithm for automatic placement;
+ *  AUTO   : an not better specified automatic placement, likely
+ *           to be implemented as a short circuit to (one of)
+ *           the above(s);
+ */
+#define NODES_POLICY_NONE    0
+#define NODES_POLICY_STATIC  1
+#define NODES_POLICY_FFIT    2
+#define NODES_POLICY_AUTO    3

This is minor, but it seems like "auto" should be 1, so if we addanother policy, the policies are all together, without having to moveAUTO around every time.

+
+/* Store the policy for the domain while parsing */
+static int nodes_policy = NODES_POLICY_DEFAULT;
+
+/* Store the number of nodes to be used while parsing */
+static int num_nodes_policy = 0;

Why are "nodes_policy" and "num_nodes_policy" not passed in along withb_info?

+static int nodes_policy_parse(const char *pol, long int *policy)
+{
+    int i;
+    const char *n;
+
+    for (i = 0; i<  sizeof(nodes_policy_names) / 
sizeof(nodes_policy_names[0]); i++) {

I personally prefer an explicit NODES_POLICY_MAX, but I'll let the libxlmaintainers decide on that.

+
+        /* Determine how much free memory we want on each of the nodes
+         * that will end up in suitable_nodes. Either adding or ignoring
+         * the modulus of the integer division should be fine (the total
+         * number of nodes should be in the order of tens of them), so
+         * let's add it as it should be more safe.
+         */
+        memb_node = (memb / (*nodes)) + (memb % (*nodes));

Wouldn't it just be easier to do a "round-up" function here, like this:
 memb_node = ( (memb + *nodes -1) / (*nodes) )

Also, is it really necessary for a VM to have an equal amount of memoryon every node? It seems like it would be better to have 75% on one nodeand 25% on a second node than to have 25% on four nodes, for example.Furthermore, insisting on an even amount fragments the node memoryfurther -- i.e., if we chose to have 25% on four nodes instead of 75% onone and 25% on another, that will make it harder for another VM to fiton a single node as well.

The need for NODES_POLICY_RETRY_DEC is an artifact of the above; ifnodes were allowed to be assymetric, you wouldn't need to *decrease* thenumber of nodes to find *more* memory.

+        /* Apply the asked retry policy for the next round. Notice
+         * that it would be pointless to increase nodes without also
+         * adding some nodes to the map! */
+        *nodes += retry;
+        if (retry == NODES_POLICY_RETRY_INC)
+            __add_nodes_to_nodemap(nodemap, numa, nr_nodes, retry);

Hmm -- if I'm reading this right, the only time the nodemap won't be allnodes is if (1) the user specified nodes, or (2) there's a cpumask ineffect. If we're going to override that setting, wouldn't it make senseto just expand to all numa nodes?

Hmm -- though I suppose what you'd really want to try is adding eachnode in turn, rather than one at a time (i.e., if the cpus are pinned tonodes 2 and 3, and [2,3] doesn't work, try [1,2,3] and [2,3,4] beforetrying [1,2,3,4]. But that's starting to get really complicated -- Iwonder if it's better to just fail and let the user change the pinnings/ configuration node mapping.

+
+    /* If a nodemap has been specified, just comply with that.
+     * OTOH, if there's no nodemap, rely on cpumap (if any), and
+     * fallback to all node if even that is empty. Also consider
+     * all the nodes to be valid if only the number (i.e., _how_
+     * _many_ of them instead of _which_ of them) of nodes the
+     * domain wants is provided.

I feel like this comment could be made more clear...

+     *
+     * Concering retries, if a specific set of nodes is specified,
+     * just try that and give up if we fail. If, instead, a set of
+     * CPUs onto which to pin the domain is specified, try first
+     * the exact nodes those CPUs belongs to, then also try both
+     * a smaller and a bigger number of nodes. The same happens if
+     * we just have the number of nodes to be used. Finally, if
+     * there is no node-affinity, no cpu-pinning, no number of nodes,
+     * start trying with one node and increase at the configured
+     * rate (considering all the nodes to be suitable).

This should be above the "do {} while()" loop below. (Also, see commenton increasing node map above.)

+
+        if (use_cpus>= b_info->max_vcpus) {
+            rc = 0;
+            break;
+        }

Hmm -- there's got to be a better way to find out the minimum number ofnodes to house a given number of vcpus than just starting at 1 andre-trying until we have enough.

+        /* Add one more node and retry fitting the domain */
+        __add_nodes_to_nodemap(&new_nodemap, numa, nr_nodes, 1);

Same comment as above.


+    ret = place_domain(&d_config.b_info);
+    if (ret == ESRCH || ret == ENOSPC) {
+        fprintf(stderr, "failed to place the domain, fallback to all nodes\n");
+        libxl_nodemap_set_any(&d_config.b_info.nodemap);

Hmm -- is this the right fallback? I think in general, if the user asksfor X to be done, and X cannot be done for whatever reason, the rightthing is to tell the user that X cannot be done and let them sort itout, rather than resorting to Y (unless that's been set).

Is it the case that if any node placement fails, then they'll all fail?Or to put it the other way, is it the case that if there is *some*placement, that any placement algorithm will eventually find it? If so,then I think this fallback may make sense, as there's nothing for theuser to really do.

diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -365,10 +365,11 @@ static void dump_numa(unsigned char key)

         for_each_online_node(i) {
                 paddr_t pa = (paddr_t)(NODE_DATA(i)->node_start_pfn + 1)<<  
PAGE_SHIFT;
-               printk("idx%d ->  NODE%d start->%lu size->%lu\n",
+               printk("idx%d ->  NODE%d start->%lu size->%lu free->%lu\n",
                           i, NODE_DATA(i)->node_id,
                           NODE_DATA(i)->node_start_pfn,
-                         NODE_DATA(i)->node_spanned_pages);
+                         NODE_DATA(i)->node_spanned_pages,
+                          avail_node_heap_pages(i));
                 /* sanity check phys_to_nid() */
                 printk("phys_to_nid(%"PRIpaddr") ->  %d should be %d\n", pa, 
phys_to_nid(pa),
                           NODE_DATA(i)->node_id);

This should be in its own patch.

  -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes
  - From: Dario Faggioli

Prev by Date: Re: [Xen-devel] [PATCH 1/7] vgabios: Output to Xen debug port instead of using Bochs one
Next by Date: Re: [Xen-devel] [PATCH] auto balloon initial domain and fix dom0_mem=X inconsistencies (v5).
Previous by thread: [Xen-devel] VGABIOS patches
Next by thread: Re: [Xen-devel] [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.