Xen project Mailing List

Re: [Xen-devel] [PATCH 3/4] xen: sched: reorganize cpu_disable_scheduler()

To: Dario Faggioli <dario.faggioli@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

From: Juergen Gross <jgross@xxxxxxxx>

Date: Tue, 07 Jul 2015 13:16:12 +0200

Cc: George Dunlap <george.dunlap@xxxxxxxxxxxxx>

Delivery-date: Tue, 07 Jul 2015 11:16:29 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 07/03/2015 05:49 PM, Dario Faggioli wrote:

The function is called both when we want to remove a cpu
from a cpupool, and during cpu teardown, for suspend or
shutdown. If, however, the boot cpu (cpu 0, most of the
times) is not present in the default cpupool, during
suspend or shutdown, Xen crashes like this:

   root@Zhaman:~# xl cpupool-cpu-remove Pool-0 0
   root@Zhaman:~# shutdown -h now
   (XEN) ----[ Xen-4.6-unstable  x86_64  debug=y  Tainted:    C ]----
   ...
   (XEN) Xen call trace:
   (XEN)    [<ffff82d0801238de>] _csched_cpu_pick+0x156/0x61f
   (XEN)    [<ffff82d080123db5>] csched_cpu_pick+0xe/0x10
   (XEN)    [<ffff82d08012de3c>] vcpu_migrate+0x18e/0x321
   (XEN)    [<ffff82d08012e4f8>] cpu_disable_scheduler+0x1cf/0x2ac
   (XEN)    [<ffff82d08018bb8d>] __cpu_disable+0x313/0x36e
   (XEN)    [<ffff82d080101424>] take_cpu_down+0x34/0x3b
   (XEN)    [<ffff82d08013097a>] stopmachine_action+0x70/0x99
   (XEN)    [<ffff82d0801325f0>] do_tasklet_work+0x78/0xab
   (XEN)    [<ffff82d080132926>] do_tasklet+0x5e/0x8a
   (XEN)    [<ffff82d08016478c>] idle_loop+0x56/0x6b
   (XEN)
   (XEN)
   (XEN) ****************************************
   (XEN) Panic on CPU 15:
   (XEN) Assertion 'cpu < nr_cpu_ids' failed at 
...URCES/xen/xen/xen.git/xen/include/xen/cpumask.h:97
   (XEN) ****************************************

There also are problems when we try to suspend or shutdown
with a cpupool configured with just one cpu (no matter, in
this case, whether that is the boot cpu or not):

   root@Zhaman:~# xl create /etc/xen/test.cfg
   root@Zhaman:~# xl cpupool-migrate test Pool-1
   root@Zhaman:~# xl cpupool-list -c
   Name               CPU list
   Pool-0             0,1,2,3,4,5,6,7,8,9,10,11,13,14,15
   Pool-1             12
   root@Zhaman:~# shutdown -h now
   (XEN) ----[ Xen-4.6-unstable  x86_64  debug=y  Tainted:    C ]----
   (XEN) CPU:    12
   ...
   (XEN) Xen call trace:
   (XEN)    [<ffff82d08018bb91>] __cpu_disable+0x317/0x36e
   (XEN)    [<ffff82d080101424>] take_cpu_down+0x34/0x3b
   (XEN)    [<ffff82d08013097a>] stopmachine_action+0x70/0x99
   (XEN)    [<ffff82d0801325f0>] do_tasklet_work+0x78/0xab
   (XEN)    [<ffff82d080132926>] do_tasklet+0x5e/0x8a
   (XEN)    [<ffff82d08016478c>] idle_loop+0x56/0x6b
   (XEN)
   (XEN)
   (XEN) ****************************************
   (XEN) Panic on CPU 12:
   (XEN) Xen BUG at smpboot.c:895
   (XEN) ****************************************

In both cases, the problem is the scheduler not being able
to:
  - move all the vcpus to the boot cpu (as the boot cpu is
    not in the cpupool), in the former;
  - move the vcpus away from a cpu at all (as that is the
    only one cpu in the cpupool), in the latter.

Solution is to distinguish, inside cpu_disable_scheduler(),
the two cases of cpupool manipulation and teardown. For
cpupool manipulation, it is correct to ask the scheduler to
take an action, as pathological situation (like there not
being any cpu in the pool where to send vcpus) are taken
care of (i.e., forbidden!) already. For suspend and shutdown,
we don't want the scheduler to be involved at all, as the
final goal is pretty simple: "send all the vcpus to the
boot cpu ASAP", so we just go for it.

Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

Just 2 minor nits (see below), otherwise: Acked-by: Juergen Gross <jgross@xxxxxxxx>

---
Cc: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
Cc: Juergen Gross <jgross@xxxxxxxx>
---
  xen/common/schedule.c |  109 ++++++++++++++++++++++++++++++++++++++++++-------
  1 file changed, 93 insertions(+), 16 deletions(-)

diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index e83c666..eac8804 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -455,8 +455,8 @@ void vcpu_unblock(struct vcpu *v)
   * Do the actual movemet of a vcpu from old to new CPU. Locks for *both*
   * CPUs needs to have been taken already when calling this!
   */
-static void vcpu_move(struct vcpu *v, unsigned int old_cpu,
-                      unsigned int new_cpu)
+static void _vcpu_move(struct vcpu *v, unsigned int old_cpu,
+                       unsigned int new_cpu)
  {
      /*
       * Transfer urgency status to new CPU before switching CPUs, as
@@ -479,6 +479,35 @@ static void vcpu_move(struct vcpu *v, unsigned int old_cpu,
          v->processor = new_cpu;
  }

+static void vcpu_move(struct vcpu *v, unsigned int new_cpu)
+{
+    unsigned long flags;
+    unsigned int cpu = v->processor;
+    spinlock_t *lock, *new_lock;
+
+    /*
+     * When here, the vcpu should be ready for being moved. This means:
+     *  - both its original and target processor must be quiet;
+     *  - it must not be marked as currently running;
+     *  - the proper flag must be set (i.e., no one must have had any
+     *    chance to reset it).
+     */
+    ASSERT(is_idle_vcpu(curr_on_cpu(cpu)) &&
+           is_idle_vcpu(curr_on_cpu(new_cpu)));
+    ASSERT(!v->is_running && test_bit(_VPF_migrating, &v->pause_flags));
+
+    lock = per_cpu(schedule_data, v->processor).schedule_lock;
+    new_lock = per_cpu(schedule_data, new_cpu).schedule_lock;
+
+    sched_spin_lock_double(lock, new_lock, &flags);
+    ASSERT(new_cpu != v->processor);
+    _vcpu_move(v, cpu, new_cpu);
+    sched_spin_unlock_double(lock, new_lock, flags);
+
+    sched_move_irqs(v);
+    vcpu_wake(v);
+}
+
  static void vcpu_migrate(struct vcpu *v)
  {
      unsigned long flags;
@@ -543,7 +572,7 @@ static void vcpu_migrate(struct vcpu *v)
          return;
      }

-    vcpu_move(v, old_cpu, new_cpu);
+    _vcpu_move(v, old_cpu, new_cpu);

      sched_spin_unlock_double(old_lock, new_lock, flags);

@@ -616,7 +645,8 @@ int cpu_disable_scheduler(unsigned int cpu)
      struct vcpu *v;
      struct cpupool *c;
      cpumask_t online_affinity;
-    int    ret = 0;
+    unsigned int new_cpu;
+    int ret = 0;

      c = per_cpu(cpupool, cpu);
      if ( c == NULL )
@@ -645,25 +675,72 @@ int cpu_disable_scheduler(unsigned int cpu)
                  cpumask_setall(v->cpu_hard_affinity);
              }

-            if ( v->processor == cpu )
+            if ( v->processor != cpu )
              {
-                set_bit(_VPF_migrating, &v->pause_flags);
+                /* This vcpu is not on cpu, so we can move on. */
                  vcpu_schedule_unlock_irqrestore(lock, flags, v);
-                vcpu_sleep_nosync(v);
-                vcpu_migrate(v);
+                continue;
              }
-            else
-                vcpu_schedule_unlock_irqrestore(lock, flags, v);

              /*
-             * A vcpu active in the hypervisor will not be migratable.
-             * The caller should try again after releasing and reaquiring
-             * all locks.
+             * If we're here, it means that the vcpu is on cpu. Let's see how
+             * it's best to send it away, depending on whether we are shutting
+             * down/suspending, or doing cpupool manipulations.
               */
-            if ( v->processor == cpu )
-                ret = -EAGAIN;
-        }
+            set_bit(_VPF_migrating, &v->pause_flags);
+            vcpu_schedule_unlock_irqrestore(lock, flags, v);
+            vcpu_sleep_nosync(v);
+
+            /*
+             * In case of shutdown/suspend, it is not necessary to ask the
+             * scheduler to chime in. In fact:
+             *  * there is no reason for it: the end result we are after is
+             *    just 'all the vcpus on the boot pcpu, and no vcpu anywhere
+             *    else', so let's just go for it;
+             *  * it's wrong, when dealing a cpupool with only non-boot pcpus,
+             *    as the scheduler will always fail to send the vcpus away
+             *    from the last online (non boot) pcpu!

I'd add a comment that in shutdown/suspend case all domains are being paused, so we can be active in dom0/Pool-0 only.

+             *
+             * Therefore, in the shutdown/suspend case, let's just pick one
+             * of the still online pcpus, and send everyone there. Ideally,
+             * we would pick up the boot pcpu directly, but we don't know
+             * which one it is.
+             *
+             * OTOH, if the system is still live, and we are here because of
+             * cpupool manipulations:
+             *  * it would be best to call the scheduler, as that would serve
+             *    as a re-evaluation of the placement of the vcpu, taking into
+             *    account the modified cpupool configuration;
+             *  * the scheduler will always fine a suitable solution, or
+             *    things would have failed before getting in here.
+             *
+             * Therefore, in the cpupool manipulation case, let's just ask the
+             * scheduler to do its job, via calling vcpu_migrate().
+             */
+            if ( unlikely(system_state == SYS_STATE_suspend) )
+            {
+                /*
+                 * The boot pcpu is, usually, pcpu #0, so using cpumask_first()
+                 * actually helps us to achieve our ultimate goal quicker.
+                 */
+                cpumask_andnot(&online_affinity, &cpu_online_map, 
cpumask_of(cpu));

What about an ASSERT/BUG regarding a non-empty online_affinity? Juergen

+                new_cpu = cpumask_first(&online_affinity);
+                vcpu_move(v, new_cpu);
+            }
+            else
+            {
+                /*
+                 * The only caveat, in this case, is that if the vcpu active
+                 * in the hypervisor, it won't be migratable. In this case,
+                 * the caller should try again after releasing and reaquiring
+                 * all locks.
+                 */
+                vcpu_migrate(v);

+                if ( v->processor == cpu )
+                    ret = -EAGAIN;
+            }
+        }
          domain_update_node_affinity(d);
      }

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.