[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes

Hey Dario,

Thanks for the work on this -- this is obviously a very complicated area to try to make sense of. Of course, that makes it even more complicated to review -- sorry this has taken so long. (comments inline)

On 11/04/12 14:17, Dario Faggioli wrote:
+=item B<auto>
+use a not better specified (xl-implementation dependant) algorithm
+to try automatically fitting the guest on the host's NUMA nodes
Uum, you mean, "Use the default placement algorithm"? :-) We should specify which one this will be here.
+/* Automatic placement policies symbols, with the following meaning:
+ *  NONE   : no automatic placement at all;
+ *  STATIC : explicit nodes specification.
+ *  FFIT   : use the First Fit algorithm for automatic placement;
+ *  AUTO   : an not better specified automatic placement, likely
+ *           to be implemented as a short circuit to (one of)
+ *           the above(s);
+ */
+#define NODES_POLICY_NONE    0
+#define NODES_POLICY_FFIT    2
+#define NODES_POLICY_AUTO    3
This is minor, but it seems like "auto" should be 1, so if we add another policy, the policies are all together, without having to move AUTO around every time.
+/* Store the policy for the domain while parsing */
+static int nodes_policy = NODES_POLICY_DEFAULT;
+/* Store the number of nodes to be used while parsing */
+static int num_nodes_policy = 0;
Why are "nodes_policy" and "num_nodes_policy" not passed in along with b_info?
+static int nodes_policy_parse(const char *pol, long int *policy)
+    int i;
+    const char *n;
+    for (i = 0; i<  sizeof(nodes_policy_names) / 
sizeof(nodes_policy_names[0]); i++) {
I personally prefer an explicit NODES_POLICY_MAX, but I'll let the libxl maintainers decide on that.
+        /* Determine how much free memory we want on each of the nodes
+         * that will end up in suitable_nodes. Either adding or ignoring
+         * the modulus of the integer division should be fine (the total
+         * number of nodes should be in the order of tens of them), so
+         * let's add it as it should be more safe.
+         */
+        memb_node = (memb / (*nodes)) + (memb % (*nodes));
Wouldn't it just be easier to do a "round-up" function here, like this:
 memb_node = ( (memb + *nodes -1) / (*nodes) )

Also, is it really necessary for a VM to have an equal amount of memory on every node? It seems like it would be better to have 75% on one node and 25% on a second node than to have 25% on four nodes, for example. Furthermore, insisting on an even amount fragments the node memory further -- i.e., if we chose to have 25% on four nodes instead of 75% on one and 25% on another, that will make it harder for another VM to fit on a single node as well.

The need for NODES_POLICY_RETRY_DEC is an artifact of the above; if nodes were allowed to be assymetric, you wouldn't need to *decrease* the number of nodes to find *more* memory.
+        /* Apply the asked retry policy for the next round. Notice
+         * that it would be pointless to increase nodes without also
+         * adding some nodes to the map! */
+        *nodes += retry;
+        if (retry == NODES_POLICY_RETRY_INC)
+            __add_nodes_to_nodemap(nodemap, numa, nr_nodes, retry);
Hmm -- if I'm reading this right, the only time the nodemap won't be all nodes is if (1) the user specified nodes, or (2) there's a cpumask in effect. If we're going to override that setting, wouldn't it make sense to just expand to all numa nodes?

Hmm -- though I suppose what you'd really want to try is adding each node in turn, rather than one at a time (i.e., if the cpus are pinned to nodes 2 and 3, and [2,3] doesn't work, try [1,2,3] and [2,3,4] before trying [1,2,3,4]. But that's starting to get really complicated -- I wonder if it's better to just fail and let the user change the pinnings / configuration node mapping.
+    /* If a nodemap has been specified, just comply with that.
+     * OTOH, if there's no nodemap, rely on cpumap (if any), and
+     * fallback to all node if even that is empty. Also consider
+     * all the nodes to be valid if only the number (i.e., _how_
+     * _many_ of them instead of _which_ of them) of nodes the
+     * domain wants is provided.
I feel like this comment could be made more clear...
+     *
+     * Concering retries, if a specific set of nodes is specified,
+     * just try that and give up if we fail. If, instead, a set of
+     * CPUs onto which to pin the domain is specified, try first
+     * the exact nodes those CPUs belongs to, then also try both
+     * a smaller and a bigger number of nodes. The same happens if
+     * we just have the number of nodes to be used. Finally, if
+     * there is no node-affinity, no cpu-pinning, no number of nodes,
+     * start trying with one node and increase at the configured
+     * rate (considering all the nodes to be suitable).
This should be above the "do {} while()" loop below. (Also, see comment on increasing node map above.)
+        if (use_cpus>= b_info->max_vcpus) {
+            rc = 0;
+            break;
+        }
Hmm -- there's got to be a better way to find out the minimum number of nodes to house a given number of vcpus than just starting at 1 and re-trying until we have enough.
+        /* Add one more node and retry fitting the domain */
+        __add_nodes_to_nodemap(&new_nodemap, numa, nr_nodes, 1);
Same comment as above.

+    ret = place_domain(&d_config.b_info);
+    if (ret == ESRCH || ret == ENOSPC) {
+        fprintf(stderr, "failed to place the domain, fallback to all nodes\n");
+        libxl_nodemap_set_any(&d_config.b_info.nodemap);
Hmm -- is this the right fallback? I think in general, if the user asks for X to be done, and X cannot be done for whatever reason, the right thing is to tell the user that X cannot be done and let them sort it out, rather than resorting to Y (unless that's been set).

Is it the case that if any node placement fails, then they'll all fail? Or to put it the other way, is it the case that if there is *some* placement, that any placement algorithm will eventually find it? If so, then I think this fallback may make sense, as there's nothing for the user to really do.
diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -365,10 +365,11 @@ static void dump_numa(unsigned char key)

         for_each_online_node(i) {
                 paddr_t pa = (paddr_t)(NODE_DATA(i)->node_start_pfn + 1)<<  
-               printk("idx%d ->  NODE%d start->%lu size->%lu\n",
+               printk("idx%d ->  NODE%d start->%lu size->%lu free->%lu\n",
                           i, NODE_DATA(i)->node_id,
-                         NODE_DATA(i)->node_spanned_pages);
+                         NODE_DATA(i)->node_spanned_pages,
+                          avail_node_heap_pages(i));
                 /* sanity check phys_to_nid() */
                 printk("phys_to_nid(%"PRIpaddr") ->  %d should be %d\n", pa, 
This should be in its own patch.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.