Scalable SMP Scheduling Proposal -------------------------------- In a multiprocessor server running virtualized environments, it is common to virtualize the physical processors, providing multiprocessor environments for guest operating systems. Modern operating systems use locks to ensure exclusive access to critical regions of code or data. Inefficiencies occur when one virtual processor of a multiprocessor domain is preempted while holding a lock, while other virtual processors of the same domain continue to run on other processors, waiting for the lock to be released. This problem can affect user applications as well as the kernel. Important subsystems (databases, for example) or large-scale scientific applications assume exclusive access to processors and tune their synchronization methods accordingly. Some operating systems provide gang-scheduling for their applications for just this reason. Other systems (e.g. AIX) provide interfaces by which application threads can indicate that they're holding locks and should not be preempted. These operating system mechanisms are ineffective in a virtualized environment in which the system itself (and whatever application it happens to be running) can be preempted on individual processors at arbitrary times. To address these issues, we propose a three-level approach (the levels are largely independent and can be implemented in any order): 1) Where possible, reconfigure domains dynamically so that their virtual processor counts match the processor resources they are actually receiving. 2) Where possible, gang-schedule multiprocessor domains. 3) Provide a preemption notification mechanism for multiprocessor domains that cannot be gang-scheduled. We discuss each of these levels in turn and then lay out a concrete implementation proposal. Dynamic reconfiguration. ------------------------ Some domains are capable of accepting or releasing virtual processors as their internal workloads change or as the overall workload on the multiprocessor changes. This addition or removal can take the form of activating or deactivating virtual processors internally, or it can involve a heavier-weight hot plug/unplug of virtual processors. Many domains will execute more efficiently on a small number of virtual processors that get to run continuously than on a large number of processors that run only occasionally. The hypervisor should make it easy for domains to adjust their virtual processor counts to match the processor resources they are getting, and should encourage them to do so. Effective space-sharing of the machine will lessen the problems that have to be addressed with gang-scheduling and preemption notification. Gang scheduling. ---------------- There will obviously be domains that need and get more than one processor's worth of cpu resources, but that's not the only reason that multiprocessor domains will exist. Some operating systems or the applications they support will not be amenable to dynamic reconfiguration. Applications may be written and compiled for inflexible programming environments, they may be aware of the memory topologies of the machines on which they are running and not be capable of adapting to new topologies, or they may simply depend on the expanded cache resources of a multiprocessor in addition to the processor cycles. For these reasons, the hypervisor will have to support time-shared multiprocessor domains, and such domains will execute most efficiently if they are gang-scheduled. If the operating system is gang-scheduled on its virtual multiprocessor, it will never have problems with running virtual processors waiting for kernel locks held by suspended virtual processors, and it will be able to provide gang-scheduling and other multiprocessor scheduling support for its multiprocessor applications. We believe that a gang scheduler can be implemented above the existing proportional-share scheduler. It would adjust the weights of the virtual processors it manages on a predetermined schedule independently on all processors. The schedule would be changed periodically in response to workload changes. Preemption notification. ------------------------ Gang scheduling is ideal to the extent that it is possible, but it cannot solve all multiprocessor scheduling problems. Consider, for example, a two-processor machine running a two-processor guest domain together with an I/O-hosting domain that uses part of just one processor. A pure gang scheduler would leave the second processor idle whenever the I/O domain needs the first processor, even though the guest domain may still be able to use the cycles on the second processor. We believe the hypervisor should schedule multiprocessor domains as gangs whenever possible, but should also have a fall-back mechanism for use when gang scheduling is hard or impossible. For one thing, a fall-back mechanism takes the pressure off the gang scheduler. It can "cherry pick" the multiprocessor domains that have special requirements or are easy to schedule, and leave the rest to fend for themselves. We propose to let the standard proportional-share scheduler handle all the virtual processors of non-gang-scheduled domains independently, and to provide them with preemption notifications to give them a chance to avoid the inefficiencies associated with preempting holders of kernel or user-level locks. We discuss possible optimizations in a later section, but at first cut we propose that the hypervisor post a preemption interrupt to a running virtual processor whose physical processor is needed for something else. The virtual processor will be allowed to continue but is expected to yield the processor voluntarily in a short amount of time. (Uhlig et al[1] found that in Linux 2.4.20, for 98% of lock acquisitions, the locks are held less than 20 microseconds, so a time extension on the order of 50 microseconds might be reasonable.) If the virtual processor doesn't yield quickly enough, the hypervisor will preempt it in the usual manner. The proportional-share scheduler will account for any extra time given to a virtual processor and will factor it into its scheduling algorithm. Operating systems are free to use the preemption notification interrupt as appropriate for their own needs. AIX could use the interrupt to drive its mechanism for avoiding user-level lock-holder preemption. For Linux, we propose that the interrupt handler simply create (or awaken) a high-priority kernel thread that yields once to the hypervisor as soon as it runs and then dies (or goes back to sleep). Under the "preemptable" variant of the Linux kernel, the thread will run as soon as the kernel reaches a state in which no kernel locks are held (which may be immediately). From the Linux scheduler's viewpoint, the virtual processor is busy running the kernel thread, so whatever application process happened to be running on the virtual processor will be put to bed cleanly and will be available to run on other virtual processors. Behavior with respect to user-level locks will be no worse than it is today for Linux running natively. The high-priority thread idea is similar in some respects to the notion of time-ballooning introduced in Uhlig et al[1]. The key idea is that an artificial workload representing the time that a virtual processor is NOT running is introduced to the Linux scheduler, causing it to properly allocate the resources it actually controls. Possible optimizations. ----------------------- The obvious downside to always posting preemption notification interrupts is the extra round-trip from hypervisor to kernel and back on each preemption. Dynamic reconfiguration and gang scheduling should reduce the number of domains that require preemption notification. It's not clear that the extra round-trip for the remaining multiprocessor domains will have a measurable impact on performance, but if it does, there are steps we might take to lessen the impact further. If the preempting domain is expected to run for a very short time (because it's an I/O hosting domain, for example), the hypervisor might delay notifying the interrupted domain, in the hope that it can be resumed without suffering the notification overhead. The notification can be delivered if the preempting domain runs longer than expected. We might also consider a scheme like that described in Uhlig et al[1], in which the kernel sets a flag (in space visible to the hypervisor) to request that preemption be delayed. In our case, the flag would simply indicate whether or not preemption notification is needed. The kernel would set the flag whenever it holds a lock or is running an application process that requires special processing. For Linux, the flag would probably be tied to the state variable that controls kernel thread preemption (again assuming the "preemptable" variant of the Linux kernel). If the flag is set when the hypervisor wants to take the processor, it posts a preemption notification interrupt as before and gives the virtual processor a bit more time. Otherwise the hypervisor is free to preempt the virtual processor without notification. The "delay preemption" flag idea could be extended to user-level applications if we learn that the kernel-only flag is too coarse-grained to support preemption-sensitive applications efficiently. If it turns out that operating systems are not always able to respond to preemption notifications in a timely manner and are therefore sometimes being preempted involuntarily, we might consider a scheme (supported by IBM's production hypervisor[2] and used by AIX and Linux) in which a virtual processor that cannot acquire a lock yields its time-slice explicitly to the lock holder. Such a scheme requires tracking of lock owners, an overhead that may not otherwise be necessary and may not be justified. Implementation plan and request for comments. --------------------------------------------- We propose to begin by implementing a preemption-notification mechanism, because such a mechanism will be needed in any case, and because it will address some of the preemption inefficiencies that are occurring on small-scale systems today. Once a basic notification mechanism is working, we can evaluate its performance impact and decide whether to work on optimizations or on larger issues like gang scheduling and dynamic reconfiguration. In short, we propose to: 1) Add a preemption notification mechanism to Xen, with an interrupt to be posted whenever Xen wants to preempt a virtual processor belonging to a multiprocessor domain. Include a timeout to keep domains from running too long without yielding. Make sure the Xen scheduler properly accounts for the extra time given to virtual processors and factors it into its scheduling decisions. 2) Add a preemption notification interrupt handler to Linux, and have it schedule a high-priority thread that yields back to the hypervisor. 3) Study the performance impact of preemption notification, and evaluate the need for any of the optimizations that might lessen the impact. 4a) Work on optimizations. 4b) Investigate possibilities for dynamic reconfiguration. 4c) Investigate adding a gang-scheduling capability to Xen's scheduling infrastructure. Comments, anyone? References. ----------- [1] Volkmar Uhlig, Joshua LeVasseur, Espen Skoglund, Uwe Dannowski. Towards Scalable Multiprocessor Virtual Machines. May 2004. [2] IBM hypervisor spec to be published soon. For now, see Linux/PowerPC code.