[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity

To: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
From: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
Date: Thu, 20 Dec 2012 20:21:45 +0000
Cc: Marcus Granado <Marcus.Granado@xxxxxxxxxxxxx>, Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, Anil Madhavapeddy <anil@xxxxxxxxxx>, Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>, Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Daniel De Graaf <dgdegra@xxxxxxxxxxxxx>, Matt Wilson <msw@xxxxxxxxxx>
Delivery-date: Thu, 20 Dec 2012 20:28:41 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 19/12/12 19:07, Dario Faggioli wrote:

+static inline int
+__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers)
+{
+    /*
+     * Consent to migration if cpu is one of the idlers in the VCPU's
+     * affinity mask. In fact, if that is not the case, it just means it
+     * was some other CPU that was tickled and should hence come and pick
+     * VCPU up. Migrating it to cpu would only make things worse.
+     */
+    return cpumask_test_cpu(cpu, idlers) && cpumask_test_cpu(cpu, mask);
  }

  static int
@@ -493,85 +599,98 @@ static int
      cpumask_t idlers;
      cpumask_t *online;
      struct csched_pcpu *spc = NULL;
+    int ret, balance_step;
      int cpu;

-    /*
-     * Pick from online CPUs in VCPU's affinity mask, giving a
-     * preference to its current processor if it's in there.
-     */
      online = cpupool_scheduler_cpumask(vc->domain->cpupool);
-    cpumask_and(&cpus, online, vc->cpu_affinity);
-    cpu = cpumask_test_cpu(vc->processor, &cpus)
-            ? vc->processor
-            : cpumask_cycle(vc->processor, &cpus);
-    ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
+    for_each_csched_balance_step( balance_step )
+    {
+        /* Pick an online CPU from the proper affinity mask */
+        ret = csched_balance_cpumask(vc, balance_step, &cpus);
+        cpumask_and(&cpus, &cpus, online);

-    /*
-     * Try to find an idle processor within the above constraints.
-     *
-     * In multi-core and multi-threaded CPUs, not all idle execution
-     * vehicles are equal!
-     *
-     * We give preference to the idle execution vehicle with the most
-     * idling neighbours in its grouping. This distributes work across
-     * distinct cores first and guarantees we don't do something stupid
-     * like run two VCPUs on co-hyperthreads while there are idle cores
-     * or sockets.
-     *
-     * Notice that, when computing the "idleness" of cpu, we may want to
-     * discount vc. That is, iff vc is the currently running and the only
-     * runnable vcpu on cpu, we add cpu to the idlers.
-     */
-    cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
-    if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) )
-        cpumask_set_cpu(cpu, &idlers);
-    cpumask_and(&cpus, &cpus, &idlers);
-    cpumask_clear_cpu(cpu, &cpus);
+        /* If present, prefer vc's current processor */
+        cpu = cpumask_test_cpu(vc->processor, &cpus)
+                ? vc->processor
+                : cpumask_cycle(vc->processor, &cpus);
+        ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );

-    while ( !cpumask_empty(&cpus) )
-    {
-        cpumask_t cpu_idlers;
-        cpumask_t nxt_idlers;
-        int nxt, weight_cpu, weight_nxt;
-        int migrate_factor;
+        /*
+         * Try to find an idle processor within the above constraints.
+         *
+         * In multi-core and multi-threaded CPUs, not all idle execution
+         * vehicles are equal!
+         *
+         * We give preference to the idle execution vehicle with the most
+         * idling neighbours in its grouping. This distributes work across
+         * distinct cores first and guarantees we don't do something stupid
+         * like run two VCPUs on co-hyperthreads while there are idle cores
+         * or sockets.
+         *
+         * Notice that, when computing the "idleness" of cpu, we may want to
+         * discount vc. That is, iff vc is the currently running and the only
+         * runnable vcpu on cpu, we add cpu to the idlers.
+         */
+        cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
+        if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) )
+            cpumask_set_cpu(cpu, &idlers);
+        cpumask_and(&cpus, &cpus, &idlers);
+        /* If there are idlers and cpu is still not among them, pick one */
+        if ( !cpumask_empty(&cpus) && !cpumask_test_cpu(cpu, &cpus) )
+            cpu = cpumask_cycle(cpu, &cpus);

This seems to be an addition to the algorithm -- particularly hidden inthis kind of "indent a big section that's almost exactly the same", Ithink this at least needs to be called out in the changelog message,perhaps put in a separate patch.

Can you comment on to why you think it's necessary? Was there aparticular problem you were seeing?

+        cpumask_clear_cpu(cpu, &cpus);

-        nxt = cpumask_cycle(cpu, &cpus);
+        while ( !cpumask_empty(&cpus) )
+        {
+            cpumask_t cpu_idlers;
+            cpumask_t nxt_idlers;
+            int nxt, weight_cpu, weight_nxt;
+            int migrate_factor;

-        if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) )
-        {
-            /* We're on the same socket, so check the busy-ness of threads.
-             * Migrate if # of idlers is less at all */
-            ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
-            migrate_factor = 1;
-            cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_sibling_mask, cpu));
-            cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_sibling_mask, nxt));
-        }
-        else
-        {
-            /* We're on different sockets, so check the busy-ness of cores.
-             * Migrate only if the other core is twice as idle */
-            ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
-            migrate_factor = 2;
-            cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_core_mask, cpu));
-            cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt));
+            nxt = cpumask_cycle(cpu, &cpus);
+
+            if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) )
+            {
+                /* We're on the same socket, so check the busy-ness of threads.
+                 * Migrate if # of idlers is less at all */
+                ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
+                migrate_factor = 1;
+                cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_sibling_mask,
+                            cpu));
+                cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_sibling_mask,
+                            nxt));
+            }
+            else
+            {
+                /* We're on different sockets, so check the busy-ness of cores.
+                 * Migrate only if the other core is twice as idle */
+                ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
+                migrate_factor = 2;
+                cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_core_mask, cpu));
+                cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt));
+            }
+
+            weight_cpu = cpumask_weight(&cpu_idlers);
+            weight_nxt = cpumask_weight(&nxt_idlers);
+            /* smt_power_savings: consolidate work rather than spreading it */
+            if ( sched_smt_power_savings ?
+                 weight_cpu > weight_nxt :
+                 weight_cpu * migrate_factor < weight_nxt )
+            {
+                cpumask_and(&nxt_idlers, &cpus, &nxt_idlers);
+                spc = CSCHED_PCPU(nxt);
+                cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers);
+                cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu));
+            }
+            else
+            {
+                cpumask_andnot(&cpus, &cpus, &nxt_idlers);
+            }
          }

-        weight_cpu = cpumask_weight(&cpu_idlers);
-        weight_nxt = cpumask_weight(&nxt_idlers);
-        /* smt_power_savings: consolidate work rather than spreading it */
-        if ( sched_smt_power_savings ?
-             weight_cpu > weight_nxt :
-             weight_cpu * migrate_factor < weight_nxt )
-        {
-            cpumask_and(&nxt_idlers, &cpus, &nxt_idlers);
-            spc = CSCHED_PCPU(nxt);
-            cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers);
-            cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu));
-        }
-        else
-        {
-            cpumask_andnot(&cpus, &cpus, &nxt_idlers);
-        }
+        /* Stop if cpu is idle (or if csched_balance_cpumask() says we can) */
+        if ( cpumask_test_cpu(cpu, &idlers) || ret )
+            break;

Right -- OK, I think everything looks good here, except the "return -1from csched_balance_cpumask" thing. I think it would be better if weexplicitly checked cpumask_full(...->node_affinity_cpumask) and skippedthe NODE step if that's the case.

Also -- and sorry to have to ask this kind of thing, but after sortingthrough the placement algorithm my head hurts -- under whatcircumstances would "cpumask_test_cpu(cpu, &idlers)" be false at thispoint? It seems like the only possibility would be if:

( (vc->processor was not in the original &cpus [1])
  || !IS_RUNQ_IDLE(vc->processor) )
&& (there are no idlers in the original &cpus)

Which I suppose probably matches the time when we want to move on fromlooking at NODE affinity and look for CPU affinity.

[1] This could happen either if the vcpu/node affinity has changed, orif we're currently running outside our node affinity and we're doing theNODE step.

OK -- I think I've convinced myself that this is OK as well (apart fromthe hidden check). I'll come back to look at your response to the loadbalancing thing tomorrow.


 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  - From: Dario Faggioli

References:
- [Xen-devel] [PATCH 00 of 10 v2] NUMA aware credit scheduling
  - From: Dario Faggioli
- [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  - From: Dario Faggioli

Prev by Date: Re: [Xen-devel] [PATCH] hvmloader / qemu-xen: Getting rid of resource conflict for OpRegion.
Next by Date: Re: [Xen-devel] [PATCH] xen/grant-table: correctly initialize grant table version 1
Previous by thread: Re: [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
Next by thread: Re: [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.