[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy

On August 18, 2015 8:55:32 AM PDT, Dario Faggioli <dario.faggioli@xxxxxxxxxx> 
>Hey everyone,
>So, as a followup of what we were discussing in this thread:
> [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>I started looking in more details at scheduling domains in the Linux
>kernel. Now, that thread was about CPUID and vNUMA, and their weird way
>of interacting, while this thing I'm proposing here is completely
>independent from them both.
>In fact, no matter whether vNUMA is supported and enabled, and no
>whether CPUID is reporting accurate, random, meaningful or completely
>misleading information, I think that we should do something about how
>scheduling domains are build.
>Fact is, unless we use 1:1, and immutable (across all the guest
>lifetime) pinning, scheduling domains should not be constructed, in
>Linux, by looking at *any* topology information, because that just does
>not make any sense, when vcpus move around.
>Let me state this again (hoping to make myself as clear as possible):
>matter in  how much good shape we put CPUID support, no matter how
>beautifully and consistently that will interact with both vNUMA,
>licensing requirements and whatever else. It will be always possible
>vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
>on two different NUMA nodes at time t2. Hence, the Linux scheduler
>should really not skew his load balancing logic toward any of those two
>situations, as neither of them could be considered correct (since
>nothing is!).

What about Windows guests?

>For now, this only covers the PV case. HVM case shouldn't be any
>different, but I haven't looked at how to make the same thing happen in
>there as well.
>What this RFC patch does is, in the Xen PV case, configure scheduling
>domains in such a way that there is only one of them, spanning all the
>pCPUs of the guest.

Wow. That is an pretty simple patch!!

>Note that the patch deals directly with scheduling domains, and there
>no need to alter the masks that will then be used for building and
>reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That
>the main difference between it and the patch proposed by Juergen here:
>This means that when, in future, we will fix CPUID handling and make it
>comply with whatever logic or requirements we want, that won't have 
>unexpected side effects on scheduling domains.
>Information about how the scheduling domains are being constructed
>during boot are available in `dmesg', if the kernel is booted with the
>'sched_debug' parameter. It is also possible to look
>at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>With the patch applied, only one scheduling domain is created, called
>the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
>tell that from the fact that every cpu* folder
>in /proc/sys/kernel/sched_domain/ only have one subdirectory
>('domain0'), with all the tweaks and the tunables for our scheduling
>Basically, the kind of feedback I'd be really glad to hear is:
> - what you guys thing of the approach,
> - whether you think, looking at this preliminary set of numbers, that
>   this is something worth continuing investigating,
> - if yes, what other workloads and benchmark it would make sense to
>   throw at it.

The thing that I was worried about is that we would be modifying the generic 
code, but your changes are all in Xen code!


In terms of workloads, I am CCing Herbert who I hope can provide advise on this.

Herbert, the full email is here: 

>Thanks and Regards,

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.