[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
On Thu, Aug 27, 2015 at 11:24 AM, George Dunlap <george.dunlap@xxxxxxxxxx> wrote: > On 08/18/2015 04:55 PM, Dario Faggioli wrote: >> Hey everyone, >> >> So, as a followup of what we were discussing in this thread: >> >> [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest >> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html >> >> I started looking in more details at scheduling domains in the Linux >> kernel. Now, that thread was about CPUID and vNUMA, and their weird way >> of interacting, while this thing I'm proposing here is completely >> independent from them both. >> >> In fact, no matter whether vNUMA is supported and enabled, and no matter >> whether CPUID is reporting accurate, random, meaningful or completely >> misleading information, I think that we should do something about how >> scheduling domains are build. >> >> Fact is, unless we use 1:1, and immutable (across all the guest >> lifetime) pinning, scheduling domains should not be constructed, in >> Linux, by looking at *any* topology information, because that just does >> not make any sense, when vcpus move around. >> >> Let me state this again (hoping to make myself as clear as possible): no >> matter in how much good shape we put CPUID support, no matter how >> beautifully and consistently that will interact with both vNUMA, >> licensing requirements and whatever else. It will be always possible for >> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and >> on two different NUMA nodes at time t2. Hence, the Linux scheduler >> should really not skew his load balancing logic toward any of those two >> situations, as neither of them could be considered correct (since >> nothing is!). >> >> For now, this only covers the PV case. HVM case shouldn't be any >> different, but I haven't looked at how to make the same thing happen in >> there as well. >> >> OVERALL DESCRIPTION >> =================== >> What this RFC patch does is, in the Xen PV case, configure scheduling >> domains in such a way that there is only one of them, spanning all the >> pCPUs of the guest. >> >> Note that the patch deals directly with scheduling domains, and there is >> no need to alter the masks that will then be used for building and >> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is >> the main difference between it and the patch proposed by Juergen here: >> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html >> >> This means that when, in future, we will fix CPUID handling and make it >> comply with whatever logic or requirements we want, that won't have any >> unexpected side effects on scheduling domains. >> >> Information about how the scheduling domains are being constructed >> during boot are available in `dmesg', if the kernel is booted with the >> 'sched_debug' parameter. It is also possible to look >> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat. >> >> With the patch applied, only one scheduling domain is created, called >> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can >> tell that from the fact that every cpu* folder >> in /proc/sys/kernel/sched_domain/ only have one subdirectory >> ('domain0'), with all the tweaks and the tunables for our scheduling >> domain. >> >> EVALUATION >> ========== >> I've tested this with UnixBench, and by looking at Xen build time, on a >> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for >> now, but I plan to re-run them in DomUs soon (Juergen may be doing >> something similar to this in DomU already, AFAUI). >> >> I've run the benchmarks with and without the patch applied ('patched' >> and 'vanilla', respectively, in the tables below), and with different >> number of build jobs (in case of the Xen build) or of parallel copy of >> the benchmarks (in the case of UnixBench). >> >> What I get from the numbers is that the patch almost always brings >> benefits, in some cases even huge ones. There are a couple of cases >> where we regress, but always only slightly so, especially if comparing >> that to the magnitude of some of the improvement that we get. >> >> Bear also in mind that these results are gathered from Dom0, and without >> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If >> we move things in DomU and do overcommit at the Xen scheduler level, I >> am expecting even better results. >> >> RESULTS >> ======= >> To have a quick idea of how a benchmark went, look at the '% >> improvement' row of each table. >> >> I'll put these results online, in a googledoc spreadsheet or something >> like that, to make them easier to read, as soon as possible. >> >> *** Intel(R) Xeon(R) E5620 @ 2.40GHz >> *** pCPUs 16 DOM0 vCPUS 16 >> *** RAM 12285 MB DOM0 Memory 9955 MB >> *** NUMA nodes 2 >> ======================================================================================================================================= >> MAKE XEN (lower == better) >> ======================================================================================================================================= >> # of build jobs -j1 -j6 >> -j8 -j16** -j24 >> vanilla/patched vanilla patched vanilla patched >> vanilla patched vanilla patched vanilla patched >> --------------------------------------------------------------------------------------------------------------------------------------- >> 153.72 152.41 35.33 34.93 >> 30.7 30.33 26.79 25.97 26.88 26.21 >> 153.81 152.76 35.37 34.99 >> 30.81 30.36 26.83 26.08 27 26.24 >> 153.93 152.79 35.37 35.25 >> 30.92 30.39 26.83 26.13 27.01 26.28 >> 153.94 152.94 35.39 35.28 >> 31.05 30.43 26.9 26.14 27.01 26.44 >> 153.98 153.06 35.45 35.31 >> 31.17 30.5 26.95 26.18 27.02 26.55 >> 154.01 153.23 35.5 35.35 >> 31.2 30.59 26.98 26.2 27.05 26.61 >> 154.04 153.34 35.56 35.42 >> 31.45 30.76 27.12 26.21 27.06 26.78 >> 154.16 153.5 37.79 35.58 >> 31.68 30.83 27.16 26.23 27.16 26.78 >> 154.18 153.71 37.98 35.61 >> 33.73 30.9 27.49 26.32 27.16 26.8 >> 154.9 154.67 38.03 37.64 >> 34.69 31.69 29.82 26.38 27.2 28.63 >> --------------------------------------------------------------------------------------------------------------------------------------- >> Avg. 154.067 153.241 36.177 35.536 >> 31.74 30.678 27.287 26.184 27.055 26.732 >> --------------------------------------------------------------------------------------------------------------------------------------- >> Std. Dev. 0.325 0.631 1.215 0.771 >> 1.352 0.410 0.914 0.116 0.095 0.704 >> --------------------------------------------------------------------------------------------------------------------------------------- >> % improvement 0.536 1.772 >> 3.346 4.042 1.194 >> ======================================================================================================================================== >> ==================================================================================================================================================== >> UNIXBENCH >> ==================================================================================================================================================== >> # parallel copies 1 parallel 6 >> parrallel 8 parallel 16 parallel** 24 parallel >> vanilla/patched vanilla patched vanilla >> pached vanilla patched vanilla patched vanilla patched >> ---------------------------------------------------------------------------------------------------------------------------------------------------- >> Dhrystone 2 using register variables 2302.2 2302.1 13157.8 >> 12262.4 15691.5 15860.1 18927.7 19078.5 18654.3 18855.6 >> Double-Precision Whetstone 620.2 620.2 3481.2 >> 3566.9 4669.2 4551.5 7610.1 7614.3 11558.9 11561.3 >> Execl Throughput 184.3 186.7 884.6 >> 905.3 1168.4 1213.6 2134.6 2210.2 2250.9 2265 >> File Copy 1024 bufsize 2000 maxblocks 780.8 783.3 1243.7 >> 1255.5 1250.6 1215.7 1080.9 1094.2 1069.8 1062.5 >> File Copy 256 bufsize 500 maxblocks 479.8 482.8 781.8 >> 803.6 806.4 781 682.9 707.7 698.2 694.6 >> File Copy 4096 bufsize 8000 maxblocks 1617.6 1593.5 2739.7 >> 2943.4 2818.3 2957.8 2389.6 2412.6 2371.6 2423.8 >> Pipe Throughput 363.9 361.6 2068.6 >> 2065.6 2622 2633.5 4053.3 4085.9 4064.7 4076.7 >> Pipe-based Context Switching 70.6 207.2 369.1 >> 1126.8 623.9 1431.3 1970.4 2082.9 1963.8 2077 >> Process Creation 103.1 135 503 >> 677.6 618.7 855.4 1138 1113.7 1195.6 1199 >> Shell Scripts (1 concurrent) 723.2 765.3 4406.4 >> 4334.4 5045.4 5002.5 5861.9 5844.2 5958.8 5916.1 >> Shell Scripts (8 concurrent) 2243.7 2715.3 5694.7 >> 5663.6 5694.7 5657.8 5637.1 5600.5 5582.9 5543.6 >> System Call Overhead 330 330.1 1669.2 >> 1672.4 2028.6 1996.6 2920.5 2947.1 2923.9 2952.5 >> System Benchmarks Index Score 496.8 567.5 1861.9 >> 2106 2220.3 2441.3 2972.5 3007.9 3103.4 3125.3 >> ---------------------------------------------------------------------------------------------------------------------------------------------------- >> % increase (of the Index Score) 14.231 >> 13.110 9.954 1.191 0.706 >> ==================================================================================================================================================== >> >> *** Intel(R) Xeon(R) X5650 @ 2.67GHz >> *** pCPUs 24 DOM0 vCPUS 16 >> *** RAM 36851 MB DOM0 Memory 9955 MB >> *** NUMA nodes 2 >> ======================================================================================================================================= >> MAKE XEN (lower == better) >> ======================================================================================================================================= >> # of build jobs -j1 -j8 >> -j12 -j24** -j32 >> vanilla/patched vanilla patched vanilla patched >> vanilla patched vanilla patched vanilla patched >> --------------------------------------------------------------------------------------------------------------------------------------- >> 119.49 119.47 23.37 23.29 >> 20.12 19.85 17.99 17.9 17.82 17.8 >> 119.59 119.64 23.52 23.31 >> 20.16 19.99 18.19 18.05 18.23 17.89 >> 119.59 119.65 23.53 23.35 >> 20.19 20.08 18.26 18.09 18.35 17.91 >> 119.72 119.75 23.63 23.41 >> 20.2 20.14 18.54 18.1 18.4 17.95 >> 119.95 119.86 23.68 23.42 >> 20.24 20.19 18.57 18.15 18.44 18.03 >> 119.97 119.9 23.72 23.51 >> 20.38 20.31 18.61 18.21 18.49 18.03 >> 119.97 119.91 25.03 23.53 >> 20.38 20.42 18.75 18.28 18.51 18.08 >> 120.01 119.98 25.05 23.93 >> 20.39 21.69 19.99 18.49 18.52 18.6 >> 120.24 119.99 25.12 24.19 >> 21.67 21.76 20.08 19.74 19.73 19.62 >> 120.66 121.22 25.16 25.36 >> 21.94 21.85 20.26 20.3 19.92 19.81 >> --------------------------------------------------------------------------------------------------------------------------------------- >> Avg. 119.919 119.937 24.181 23.73 >> 20.567 20.628 18.924 18.531 18.641 18.372 >> --------------------------------------------------------------------------------------------------------------------------------------- >> Std. Dev. 0.351 0.481 0.789 0.642 >> 0.663 0.802 0.851 0.811 0.658 0.741 >> --------------------------------------------------------------------------------------------------------------------------------------- >> % improvement -0.015 1.865 >> -0.297 2.077 1.443 >> ======================================================================================================================================== >> ==================================================================================================================================================== >> UNIXBENCH >> ==================================================================================================================================================== >> # parallel copies 1 parallel 8 >> parrallel 12 parallel 24 parallel** 32 parallel >> vanilla/patched vanilla patched vanilla >> pached vanilla patched vanilla patched vanilla patched >> ---------------------------------------------------------------------------------------------------------------------------------------------------- >> Dhrystone 2 using register variables 2650.1 2664.6 18967.8 >> 19060.4 27534.1 27046.8 30077.9 30110.6 30542.1 30358.7 >> Double-Precision Whetstone 713.7 713.5 5463.6 >> 5455.1 7863.9 7923.8 12725.1 12727.8 17474.3 17463.3 >> Execl Throughput 280.9 283.8 1724.4 >> 1866.5 2029.5 2367.6 2370 2521.3 2453 2506.8 >> File Copy 1024 bufsize 2000 maxblocks 891.1 894.2 1423 >> 1457.7 1385.6 1482.2 1226.1 1224.2 1235.9 1265.5 >> File Copy 256 bufsize 500 maxblocks 546.9 555.4 949 >> 972.1 882.8 878.6 821.9 817.7 784.7 810.8 >> File Copy 4096 bufsize 8000 maxblocks 1743.4 1722.8 3406.5 >> 3438.9 3314.3 3265.9 2801.9 2788.3 2695.2 2781.5 >> Pipe Throughput 426.8 423.4 3207.9 >> 3234 4635.1 4708.9 7326 7335.3 7327.2 7319.7 >> Pipe-based Context Switching 110.2 223.5 680.8 >> 1602.2 998.6 2324.6 3122.1 3252.7 3128.6 3337.2 >> Process Creation 130.7 224.4 1001.3 >> 1043.6 1209 1248.2 1337.9 1380.4 1338.6 1280.1 >> Shell Scripts (1 concurrent) 1140.5 1257.5 5462.8 >> 6146.4 6435.3 7206.1 7425.2 7636.2 7566.1 7636.6 >> Shell Scripts (8 concurrent) 3492 3586.7 7144.9 >> 7307 7258 7320.2 7295.1 7296.7 7248.6 7252.2 >> System Call Overhead 387.7 387.5 2398.4 >> 2367 2793.8 2752.7 3735.7 3694.2 3752.1 3709.4 >> System Benchmarks Index Score 634.8 712.6 2725.8 >> 3005.7 3232.4 3569.7 3981.3 4028.8 4085.2 4126.3 >> ---------------------------------------------------------------------------------------------------------------------------------------------------- >> % increase (of the Index Score) 12.256 >> 10.269 10.435 1.193 1.006 >> ==================================================================================================================================================== >> >> *** Intel(R) Xeon(R) X5650 @ 2.67GHz >> *** pCPUs 48 DOM0 vCPUS 16 >> *** RAM 393138 MB DOM0 Memory 9955 MB >> *** NUMA nodes 2 >> ======================================================================================================================================= >> MAKE XEN (lower == better) >> ======================================================================================================================================= >> # of build jobs -j1 -j20 >> -j24 -j48** -j62 >> vanilla/patched vanilla patched vanilla patched >> vanilla patched vanilla patched vanilla patched >> --------------------------------------------------------------------------------------------------------------------------------------- >> 267.78 233.25 36.53 35.53 >> 35.98 34.99 33.46 32.13 33.57 32.54 >> 268.42 233.92 36.82 35.56 >> 36.12 35.2 34.24 32.24 33.64 32.56 >> 268.85 234.39 36.92 35.75 >> 36.15 35.35 34.48 32.86 33.67 32.74 >> 268.98 235.11 36.96 36.01 >> 36.25 35.46 34.73 32.89 33.97 32.83 >> 269.03 236.48 37.04 36.16 >> 36.45 35.63 34.77 32.97 34.12 33.01 >> 269.54 237.05 40.33 36.59 >> 36.57 36.15 34.97 33.09 34.18 33.52 >> 269.99 238.24 40.45 36.78 >> 36.58 36.22 34.99 33.69 34.28 33.63 >> 270.11 238.48 41.13 39.98 >> 40.22 36.24 38 33.92 34.35 33.87 >> 270.96 239.07 41.66 40.81 >> 40.59 36.35 38.99 34.19 34.49 37.24 >> 271.84 240.89 42.07 41.24 >> 40.63 40.06 39.07 36.04 34.69 37.59 >> --------------------------------------------------------------------------------------------------------------------------------------- >> Avg. 269.55 236.688 38.991 37.441 >> 37.554 36.165 35.77 33.402 34.096 33.953 >> --------------------------------------------------------------------------------------------------------------------------------------- >> Std. Dev. 1.213 2.503 2.312 2.288 >> 2.031 1.452 2.079 1.142 0.379 1.882 >> --------------------------------------------------------------------------------------------------------------------------------------- >> % improvement 12.191 3.975 >> 3.699 6.620 0.419 >> ======================================================================================================================================== > > I'm a bit confused here as to why, if dom0 has 16 vcpus in all of your > tests, you change the -j number (apparently) based on the number of > pcpus available to Xen. Wouldn't it make more sense to stick with > 1/6/8/16/24? That would allow us to have actually comparable numbers. > > But in any case, it seems to me that the numbers do show a uniform > improvement and no regressions -- I think this approach looks really > good, particularly as it is so small and well-contained. That said, it's probably a good idea to make this optional somehow, so that if people do decide to do a pinning / partitioning approach, the guest scheduler actually can take advantage of topological information. -George _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |