[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
On 08/18/2015 04:55 PM, Dario Faggioli wrote: > Hey everyone, > > So, as a followup of what we were discussing in this thread: > > [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest > http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html > > I started looking in more details at scheduling domains in the Linux > kernel. Now, that thread was about CPUID and vNUMA, and their weird way > of interacting, while this thing I'm proposing here is completely > independent from them both. > > In fact, no matter whether vNUMA is supported and enabled, and no matter > whether CPUID is reporting accurate, random, meaningful or completely > misleading information, I think that we should do something about how > scheduling domains are build. > > Fact is, unless we use 1:1, and immutable (across all the guest > lifetime) pinning, scheduling domains should not be constructed, in > Linux, by looking at *any* topology information, because that just does > not make any sense, when vcpus move around. > > Let me state this again (hoping to make myself as clear as possible): no > matter in how much good shape we put CPUID support, no matter how > beautifully and consistently that will interact with both vNUMA, > licensing requirements and whatever else. It will be always possible for > vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and > on two different NUMA nodes at time t2. Hence, the Linux scheduler > should really not skew his load balancing logic toward any of those two > situations, as neither of them could be considered correct (since > nothing is!). > > For now, this only covers the PV case. HVM case shouldn't be any > different, but I haven't looked at how to make the same thing happen in > there as well. > > OVERALL DESCRIPTION > =================== > What this RFC patch does is, in the Xen PV case, configure scheduling > domains in such a way that there is only one of them, spanning all the > pCPUs of the guest. > > Note that the patch deals directly with scheduling domains, and there is > no need to alter the masks that will then be used for building and > reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is > the main difference between it and the patch proposed by Juergen here: > http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html > > This means that when, in future, we will fix CPUID handling and make it > comply with whatever logic or requirements we want, that won't have any > unexpected side effects on scheduling domains. > > Information about how the scheduling domains are being constructed > during boot are available in `dmesg', if the kernel is booted with the > 'sched_debug' parameter. It is also possible to look > at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat. > > With the patch applied, only one scheduling domain is created, called > the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can > tell that from the fact that every cpu* folder > in /proc/sys/kernel/sched_domain/ only have one subdirectory > ('domain0'), with all the tweaks and the tunables for our scheduling > domain. > > EVALUATION > ========== > I've tested this with UnixBench, and by looking at Xen build time, on a > 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for > now, but I plan to re-run them in DomUs soon (Juergen may be doing > something similar to this in DomU already, AFAUI). > > I've run the benchmarks with and without the patch applied ('patched' > and 'vanilla', respectively, in the tables below), and with different > number of build jobs (in case of the Xen build) or of parallel copy of > the benchmarks (in the case of UnixBench). > > What I get from the numbers is that the patch almost always brings > benefits, in some cases even huge ones. There are a couple of cases > where we regress, but always only slightly so, especially if comparing > that to the magnitude of some of the improvement that we get. > > Bear also in mind that these results are gathered from Dom0, and without > any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If > we move things in DomU and do overcommit at the Xen scheduler level, I > am expecting even better results. > > RESULTS > ======= > To have a quick idea of how a benchmark went, look at the '% > improvement' row of each table. > > I'll put these results online, in a googledoc spreadsheet or something > like that, to make them easier to read, as soon as possible. > > *** Intel(R) Xeon(R) E5620 @ 2.40GHz > > *** pCPUs 16 DOM0 vCPUS 16 > *** RAM 12285 MB DOM0 Memory 9955 MB > *** NUMA nodes 2 > ======================================================================================================================================= > MAKE XEN (lower == better) > > ======================================================================================================================================= > # of build jobs -j1 -j6 > -j8 -j16** -j24 > vanilla/patched vanilla patched vanilla patched > vanilla patched vanilla patched vanilla patched > --------------------------------------------------------------------------------------------------------------------------------------- > 153.72 152.41 35.33 34.93 > 30.7 30.33 26.79 25.97 26.88 26.21 > 153.81 152.76 35.37 34.99 > 30.81 30.36 26.83 26.08 27 26.24 > 153.93 152.79 35.37 35.25 > 30.92 30.39 26.83 26.13 27.01 26.28 > 153.94 152.94 35.39 35.28 > 31.05 30.43 26.9 26.14 27.01 26.44 > 153.98 153.06 35.45 35.31 > 31.17 30.5 26.95 26.18 27.02 26.55 > 154.01 153.23 35.5 35.35 > 31.2 30.59 26.98 26.2 27.05 26.61 > 154.04 153.34 35.56 35.42 > 31.45 30.76 27.12 26.21 27.06 26.78 > 154.16 153.5 37.79 35.58 > 31.68 30.83 27.16 26.23 27.16 26.78 > 154.18 153.71 37.98 35.61 > 33.73 30.9 27.49 26.32 27.16 26.8 > 154.9 154.67 38.03 37.64 > 34.69 31.69 29.82 26.38 27.2 28.63 > --------------------------------------------------------------------------------------------------------------------------------------- > Avg. 154.067 153.241 36.177 35.536 > 31.74 30.678 27.287 26.184 27.055 26.732 > --------------------------------------------------------------------------------------------------------------------------------------- > Std. Dev. 0.325 0.631 1.215 0.771 > 1.352 0.410 0.914 0.116 0.095 0.704 > --------------------------------------------------------------------------------------------------------------------------------------- > % improvement 0.536 1.772 > 3.346 4.042 1.194 > ======================================================================================================================================== > ==================================================================================================================================================== > UNIXBENCH > ==================================================================================================================================================== > # parallel copies 1 parallel 6 > parrallel 8 parallel 16 parallel** 24 parallel > vanilla/patched vanilla patched vanilla > pached vanilla patched vanilla patched vanilla patched > ---------------------------------------------------------------------------------------------------------------------------------------------------- > Dhrystone 2 using register variables 2302.2 2302.1 13157.8 > 12262.4 15691.5 15860.1 18927.7 19078.5 18654.3 18855.6 > Double-Precision Whetstone 620.2 620.2 3481.2 > 3566.9 4669.2 4551.5 7610.1 7614.3 11558.9 11561.3 > Execl Throughput 184.3 186.7 884.6 > 905.3 1168.4 1213.6 2134.6 2210.2 2250.9 2265 > File Copy 1024 bufsize 2000 maxblocks 780.8 783.3 1243.7 > 1255.5 1250.6 1215.7 1080.9 1094.2 1069.8 1062.5 > File Copy 256 bufsize 500 maxblocks 479.8 482.8 781.8 > 803.6 806.4 781 682.9 707.7 698.2 694.6 > File Copy 4096 bufsize 8000 maxblocks 1617.6 1593.5 2739.7 > 2943.4 2818.3 2957.8 2389.6 2412.6 2371.6 2423.8 > Pipe Throughput 363.9 361.6 2068.6 > 2065.6 2622 2633.5 4053.3 4085.9 4064.7 4076.7 > Pipe-based Context Switching 70.6 207.2 369.1 > 1126.8 623.9 1431.3 1970.4 2082.9 1963.8 2077 > Process Creation 103.1 135 503 > 677.6 618.7 855.4 1138 1113.7 1195.6 1199 > Shell Scripts (1 concurrent) 723.2 765.3 4406.4 > 4334.4 5045.4 5002.5 5861.9 5844.2 5958.8 5916.1 > Shell Scripts (8 concurrent) 2243.7 2715.3 5694.7 > 5663.6 5694.7 5657.8 5637.1 5600.5 5582.9 5543.6 > System Call Overhead 330 330.1 1669.2 > 1672.4 2028.6 1996.6 2920.5 2947.1 2923.9 2952.5 > System Benchmarks Index Score 496.8 567.5 1861.9 > 2106 2220.3 2441.3 2972.5 3007.9 3103.4 3125.3 > ---------------------------------------------------------------------------------------------------------------------------------------------------- > % increase (of the Index Score) 14.231 > 13.110 9.954 1.191 0.706 > ==================================================================================================================================================== > > *** Intel(R) Xeon(R) X5650 @ 2.67GHz > *** pCPUs 24 DOM0 vCPUS 16 > *** RAM 36851 MB DOM0 Memory 9955 MB > *** NUMA nodes 2 > ======================================================================================================================================= > MAKE XEN (lower == better) > ======================================================================================================================================= > # of build jobs -j1 -j8 > -j12 -j24** -j32 > vanilla/patched vanilla patched vanilla patched > vanilla patched vanilla patched vanilla patched > --------------------------------------------------------------------------------------------------------------------------------------- > 119.49 119.47 23.37 23.29 > 20.12 19.85 17.99 17.9 17.82 17.8 > 119.59 119.64 23.52 23.31 > 20.16 19.99 18.19 18.05 18.23 17.89 > 119.59 119.65 23.53 23.35 > 20.19 20.08 18.26 18.09 18.35 17.91 > 119.72 119.75 23.63 23.41 > 20.2 20.14 18.54 18.1 18.4 17.95 > 119.95 119.86 23.68 23.42 > 20.24 20.19 18.57 18.15 18.44 18.03 > 119.97 119.9 23.72 23.51 > 20.38 20.31 18.61 18.21 18.49 18.03 > 119.97 119.91 25.03 23.53 > 20.38 20.42 18.75 18.28 18.51 18.08 > 120.01 119.98 25.05 23.93 > 20.39 21.69 19.99 18.49 18.52 18.6 > 120.24 119.99 25.12 24.19 > 21.67 21.76 20.08 19.74 19.73 19.62 > 120.66 121.22 25.16 25.36 > 21.94 21.85 20.26 20.3 19.92 19.81 > --------------------------------------------------------------------------------------------------------------------------------------- > Avg. 119.919 119.937 24.181 23.73 > 20.567 20.628 18.924 18.531 18.641 18.372 > --------------------------------------------------------------------------------------------------------------------------------------- > Std. Dev. 0.351 0.481 0.789 0.642 > 0.663 0.802 0.851 0.811 0.658 0.741 > --------------------------------------------------------------------------------------------------------------------------------------- > % improvement -0.015 1.865 > -0.297 2.077 1.443 > ======================================================================================================================================== > ==================================================================================================================================================== > UNIXBENCH > ==================================================================================================================================================== > # parallel copies 1 parallel 8 > parrallel 12 parallel 24 parallel** 32 parallel > vanilla/patched vanilla patched vanilla > pached vanilla patched vanilla patched vanilla patched > ---------------------------------------------------------------------------------------------------------------------------------------------------- > Dhrystone 2 using register variables 2650.1 2664.6 18967.8 > 19060.4 27534.1 27046.8 30077.9 30110.6 30542.1 30358.7 > Double-Precision Whetstone 713.7 713.5 5463.6 > 5455.1 7863.9 7923.8 12725.1 12727.8 17474.3 17463.3 > Execl Throughput 280.9 283.8 1724.4 > 1866.5 2029.5 2367.6 2370 2521.3 2453 2506.8 > File Copy 1024 bufsize 2000 maxblocks 891.1 894.2 1423 > 1457.7 1385.6 1482.2 1226.1 1224.2 1235.9 1265.5 > File Copy 256 bufsize 500 maxblocks 546.9 555.4 949 > 972.1 882.8 878.6 821.9 817.7 784.7 810.8 > File Copy 4096 bufsize 8000 maxblocks 1743.4 1722.8 3406.5 > 3438.9 3314.3 3265.9 2801.9 2788.3 2695.2 2781.5 > Pipe Throughput 426.8 423.4 3207.9 > 3234 4635.1 4708.9 7326 7335.3 7327.2 7319.7 > Pipe-based Context Switching 110.2 223.5 680.8 > 1602.2 998.6 2324.6 3122.1 3252.7 3128.6 3337.2 > Process Creation 130.7 224.4 1001.3 > 1043.6 1209 1248.2 1337.9 1380.4 1338.6 1280.1 > Shell Scripts (1 concurrent) 1140.5 1257.5 5462.8 > 6146.4 6435.3 7206.1 7425.2 7636.2 7566.1 7636.6 > Shell Scripts (8 concurrent) 3492 3586.7 7144.9 > 7307 7258 7320.2 7295.1 7296.7 7248.6 7252.2 > System Call Overhead 387.7 387.5 2398.4 > 2367 2793.8 2752.7 3735.7 3694.2 3752.1 3709.4 > System Benchmarks Index Score 634.8 712.6 2725.8 > 3005.7 3232.4 3569.7 3981.3 4028.8 4085.2 4126.3 > ---------------------------------------------------------------------------------------------------------------------------------------------------- > % increase (of the Index Score) 12.256 > 10.269 10.435 1.193 1.006 > ==================================================================================================================================================== > > *** Intel(R) Xeon(R) X5650 @ 2.67GHz > *** pCPUs 48 DOM0 vCPUS 16 > *** RAM 393138 MB DOM0 Memory 9955 MB > *** NUMA nodes 2 > ======================================================================================================================================= > MAKE XEN (lower == better) > ======================================================================================================================================= > # of build jobs -j1 -j20 > -j24 -j48** -j62 > vanilla/patched vanilla patched vanilla patched > vanilla patched vanilla patched vanilla patched > --------------------------------------------------------------------------------------------------------------------------------------- > 267.78 233.25 36.53 35.53 > 35.98 34.99 33.46 32.13 33.57 32.54 > 268.42 233.92 36.82 35.56 > 36.12 35.2 34.24 32.24 33.64 32.56 > 268.85 234.39 36.92 35.75 > 36.15 35.35 34.48 32.86 33.67 32.74 > 268.98 235.11 36.96 36.01 > 36.25 35.46 34.73 32.89 33.97 32.83 > 269.03 236.48 37.04 36.16 > 36.45 35.63 34.77 32.97 34.12 33.01 > 269.54 237.05 40.33 36.59 > 36.57 36.15 34.97 33.09 34.18 33.52 > 269.99 238.24 40.45 36.78 > 36.58 36.22 34.99 33.69 34.28 33.63 > 270.11 238.48 41.13 39.98 > 40.22 36.24 38 33.92 34.35 33.87 > 270.96 239.07 41.66 40.81 > 40.59 36.35 38.99 34.19 34.49 37.24 > 271.84 240.89 42.07 41.24 > 40.63 40.06 39.07 36.04 34.69 37.59 > --------------------------------------------------------------------------------------------------------------------------------------- > Avg. 269.55 236.688 38.991 37.441 > 37.554 36.165 35.77 33.402 34.096 33.953 > --------------------------------------------------------------------------------------------------------------------------------------- > Std. Dev. 1.213 2.503 2.312 2.288 > 2.031 1.452 2.079 1.142 0.379 1.882 > --------------------------------------------------------------------------------------------------------------------------------------- > % improvement 12.191 3.975 > 3.699 6.620 0.419 > ======================================================================================================================================== I'm a bit confused here as to why, if dom0 has 16 vcpus in all of your tests, you change the -j number (apparently) based on the number of pcpus available to Xen. Wouldn't it make more sense to stick with 1/6/8/16/24? That would allow us to have actually comparable numbers. But in any case, it seems to me that the numbers do show a uniform improvement and no regressions -- I think this approach looks really good, particularly as it is so small and well-contained. -George _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |