Scalable SMP Scheduling Proposal
--------------------------------

In a multiprocessor server running virtualized environments, it is
common to virtualize the physical processors, providing multiprocessor
environments for guest operating systems.  Modern operating systems use
locks to ensure exclusive access to critical regions of code or data.
Inefficiencies occur when one virtual processor of a multiprocessor
domain is preempted while holding a lock, while other virtual processors
of the same domain continue to run on other processors, waiting for the
lock to be released.

This problem can affect user applications as well as the kernel.
Important subsystems (databases, for example) or large-scale scientific
applications assume exclusive access to processors and tune their
synchronization methods accordingly.  Some operating systems provide
gang-scheduling for their applications for just this reason.  Other
systems (e.g. AIX) provide interfaces by which application threads can
indicate that they're holding locks and should not be preempted.  These
operating system mechanisms are ineffective in a virtualized environment
in which the system itself (and whatever application it happens to be
running) can be preempted on individual processors at arbitrary times.

To address these issues, we propose a three-level approach (the levels
are largely independent and can be implemented in any order):

1) Where possible, reconfigure domains dynamically so that their virtual
   processor counts match the processor resources they are actually
   receiving.

2) Where possible, gang-schedule multiprocessor domains.

3) Provide a preemption notification mechanism for multiprocessor
   domains that cannot be gang-scheduled.

We discuss each of these levels in turn and then lay out a concrete
implementation proposal.


Dynamic reconfiguration.
------------------------

Some domains are capable of accepting or releasing virtual processors as
their internal workloads change or as the overall workload on the
multiprocessor changes.  This addition or removal can take the form of
activating or deactivating virtual processors internally, or it can
involve a heavier-weight hot plug/unplug of virtual processors.  Many
domains will execute more efficiently on a small number of virtual
processors that get to run continuously than on a large number of
processors that run only occasionally.  The hypervisor should make it
easy for domains to adjust their virtual processor counts to match the
processor resources they are getting, and should encourage them to do
so.  Effective space-sharing of the machine will lessen the problems
that have to be addressed with gang-scheduling and preemption
notification.


Gang scheduling.
----------------

There will obviously be domains that need and get more than one
processor's worth of cpu resources, but that's not the only reason that
multiprocessor domains will exist.  Some operating systems or the
applications they support will not be amenable to dynamic
reconfiguration.  Applications may be written and compiled for
inflexible programming environments, they may be aware of the memory
topologies of the machines on which they are running and not be capable
of adapting to new topologies, or they may simply depend on the expanded
cache resources of a multiprocessor in addition to the processor cycles.
For these reasons, the hypervisor will have to support time-shared
multiprocessor domains, and such domains will execute most efficiently
if they are gang-scheduled.  If the operating system is gang-scheduled
on its virtual multiprocessor, it will never have problems with running
virtual processors waiting for kernel locks held by suspended virtual
processors, and it will be able to provide gang-scheduling and other
multiprocessor scheduling support for its multiprocessor applications.
We believe that a gang scheduler can be implemented above the existing
proportional-share scheduler.  It would adjust the weights of the
virtual processors it manages on a predetermined schedule independently
on all processors.  The schedule would be changed periodically in
response to workload changes.


Preemption notification.
------------------------

Gang scheduling is ideal to the extent that it is possible, but it
cannot solve all multiprocessor scheduling problems.  Consider, for
example, a two-processor machine running a two-processor guest domain
together with an I/O-hosting domain that uses part of just one
processor.  A pure gang scheduler would leave the second processor idle
whenever the I/O domain needs the first processor, even though the guest
domain may still be able to use the cycles on the second processor.  We
believe the hypervisor should schedule multiprocessor domains as gangs
whenever possible, but should also have a fall-back mechanism for use
when gang scheduling is hard or impossible.  For one thing, a fall-back
mechanism takes the pressure off the gang scheduler.  It can "cherry
pick" the multiprocessor domains that have special requirements or are
easy to schedule, and leave the rest to fend for themselves.

We propose to let the standard proportional-share scheduler handle all
the virtual processors of non-gang-scheduled domains independently, and
to provide them with preemption notifications to give them a chance to
avoid the inefficiencies associated with preempting holders of kernel or
user-level locks.  We discuss possible optimizations in a later section,
but at first cut we propose that the hypervisor post a preemption
interrupt to a running virtual processor whose physical processor is
needed for something else.  The virtual processor will be allowed to
continue but is expected to yield the processor voluntarily in a short
amount of time.  (Uhlig et al[1] found that in Linux 2.4.20, for 98% of
lock acquisitions, the locks are held less than 20 microseconds, so a
time extension on the order of 50 microseconds might be reasonable.)  If
the virtual processor doesn't yield quickly enough, the hypervisor will
preempt it in the usual manner.  The proportional-share scheduler will
account for any extra time given to a virtual processor and will factor
it into its scheduling algorithm.

Operating systems are free to use the preemption notification interrupt
as appropriate for their own needs.  AIX could use the interrupt to
drive its mechanism for avoiding user-level lock-holder preemption.  For
Linux, we propose that the interrupt handler simply create (or awaken) a
high-priority kernel thread that yields once to the hypervisor as soon
as it runs and then dies (or goes back to sleep).  Under the
"preemptable" variant of the Linux kernel, the thread will run as soon
as the kernel reaches a state in which no kernel locks are held (which
may be immediately).  From the Linux scheduler's viewpoint, the virtual
processor is busy running the kernel thread, so whatever application
process happened to be running on the virtual processor will be put to
bed cleanly and will be available to run on other virtual processors.
Behavior with respect to user-level locks will be no worse than it is
today for Linux running natively.  The high-priority thread idea is
similar in some respects to the notion of time-ballooning introduced in
Uhlig et al[1].  The key idea is that an artificial workload
representing the time that a virtual processor is NOT running is
introduced to the Linux scheduler, causing it to properly allocate the
resources it actually controls.


Possible optimizations.
-----------------------

The obvious downside to always posting preemption notification
interrupts is the extra round-trip from hypervisor to kernel and back on
each preemption.  Dynamic reconfiguration and gang scheduling should
reduce the number of domains that require preemption notification.  It's
not clear that the extra round-trip for the remaining multiprocessor
domains will have a measurable impact on performance, but if it does,
there are steps we might take to lessen the impact further.

If the preempting domain is expected to run for a very short time
(because it's an I/O hosting domain, for example), the hypervisor might
delay notifying the interrupted domain, in the hope that it can be
resumed without suffering the notification overhead.  The notification
can be delivered if the preempting domain runs longer than expected.

We might also consider a scheme like that described in Uhlig et al[1],
in which the kernel sets a flag (in space visible to the hypervisor) to
request that preemption be delayed.  In our case, the flag would simply
indicate whether or not preemption notification is needed.  The kernel
would set the flag whenever it holds a lock or is running an application
process that requires special processing.  For Linux, the flag would
probably be tied to the state variable that controls kernel thread
preemption (again assuming the "preemptable" variant of the Linux
kernel).  If the flag is set when the hypervisor wants to take the
processor, it posts a preemption notification interrupt as before and
gives the virtual processor a bit more time.  Otherwise the hypervisor
is free to preempt the virtual processor without notification.

The "delay preemption" flag idea could be extended to user-level
applications if we learn that the kernel-only flag is too coarse-grained
to support preemption-sensitive applications efficiently.

If it turns out that operating systems are not always able to respond to
preemption notifications in a timely manner and are therefore sometimes
being preempted involuntarily, we might consider a scheme (supported by
IBM's production hypervisor[2] and used by AIX and Linux) in which a
virtual processor that cannot acquire a lock yields its time-slice
explicitly to the lock holder.  Such a scheme requires tracking of lock
owners, an overhead that may not otherwise be necessary and may not be
justified.


Implementation plan and request for comments.
---------------------------------------------

We propose to begin by implementing a preemption-notification mechanism,
because such a mechanism will be needed in any case, and because it will
address some of the preemption inefficiencies that are occurring on
small-scale systems today.  Once a basic notification mechanism is
working, we can evaluate its performance impact and decide whether to
work on optimizations or on larger issues like gang scheduling and
dynamic reconfiguration.  In short, we propose to:

1)  Add a preemption notification mechanism to Xen, with an interrupt to
    be posted whenever Xen wants to preempt a virtual processor
    belonging to a multiprocessor domain.  Include a timeout to keep
    domains from running too long without yielding.  Make sure the Xen
    scheduler properly accounts for the extra time given to virtual
    processors and factors it into its scheduling decisions.

2)  Add a preemption notification interrupt handler to Linux, and have
    it schedule a high-priority thread that yields back to the
    hypervisor.

3)  Study the performance impact of preemption notification, and
    evaluate the need for any of the optimizations that might lessen the
    impact.

4a) Work on optimizations.

4b) Investigate possibilities for dynamic reconfiguration.

4c) Investigate adding a gang-scheduling capability to Xen's scheduling
    infrastructure.

Comments, anyone?


References.
-----------

[1] Volkmar Uhlig, Joshua LeVasseur, Espen Skoglund, Uwe Dannowski.
    Towards Scalable Multiprocessor Virtual Machines.  May 2004.

[2] IBM hypervisor spec to be published soon.  For now, see
    Linux/PowerPC code.