[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RESEND v7 0/9] vnuma introduction



On Thu, Aug 21, 2014 at 1:08 AM, Elena Ufimtseva <ufimtseva@xxxxxxxxx> wrote:
> vNUMA introduction
>
> This series of patches introduces vNUMA topology awareness and
> provides interfaces and data structures to enable vNUMA for
> PV guests. There is a plan to extend this support for dom0 and
> HVM domains.
>
> vNUMA topology support should be supported by PV guest kernel.
> Corresponding patches should be applied.
>
> Introduction
> -------------
>
> vNUMA topology is exposed to the PV guest to improve performance when running
> workloads on NUMA machines. vNUMA enabled guests may be running on non-NUMA
> machines and thus having virtual NUMA topology visible to guests.
> XEN vNUMA implementation provides a way to run vNUMA-enabled guests on 
> NUMA/UMA
> and flexibly map vNUMA topology to physical NUMA topology.
>
> Mapping to physical NUMA topology may be done in manual and automatic way.
> By default, every PV domain has one vNUMA node. It is populated by default
> parameters and does not affect performance. To use automatic way of 
> initializing
> vNUMA topology, configuration file need only to have number of vNUMA nodes
> defined. Not-defined vNUMA topology parameters will be initialized to default
> ones.
>
> vNUMA topology is currently defined as a set of parameters such as:
>     number of vNUMA nodes;
>     distance table;
>     vnodes memory sizes;
>     vcpus to vnodes mapping;
>     vnode to pnode map (for NUMA machines).
>
> This set of patches introduces two hypercall subops: XEN_DOMCTL_setvnumainfo
> and XENMEM_get_vnuma_info.
>
>     XEN_DOMCTL_setvnumainfo is used by toolstack to populate domain
> vNUMA topology with user defined configuration or the parameters by default.
> vNUMA is defined for every PV domain and if no vNUMA configuration found,
> one vNUMA node is initialized and all cpus are assigned to it. All other
> parameters set to their default values.
>
>     XENMEM_gevnumainfo is used by the PV domain to get the information
> from hypervisor about vNUMA topology. Guest sends its memory sizes allocated
> for different vNUMA parameters and hypervisor fills it with topology.
> Future work to use this in HVM guests in the toolstack is required and
> in the hypervisor to allow HVM guests to use these hypercalls.
>
> libxl
>
> libxl allows us to define vNUMA topology in configuration file and verifies 
> that
> configuration is correct. libxl also verifies mapping of vnodes to pnodes and
> uses it in case of NUMA-machine and if automatic placement was disabled. In 
> case
> of incorrect/insufficient configuration, one vNUMA node will be initialized
> and populated with default values.
>
> libxc
>
> libxc builds the vnodes memory addresses for guest and makes necessary
> alignments to the addresses. It also takes into account guest e820 memory map
> configuration. The domain memory is allocated and vnode to pnode mapping
> is used to determine target node for particular vnode. If this mapping was not
> defined, it is not a NUMA machine or automatic NUMA placement is enabled, the
> default not node-specific allocation will be used.
>
> hypervisor vNUMA initialization
>
> PV guest
>
> As of now, only PV guest can take advantage of vNUMA functionality.
> Such guest allocates the memory for NUMA topology, sets number of nodes and
> cpus so hypervisor has information about how much memory guest has
> preallocated for vNUMA topology. Further guest makes subop hypercall
> XENMEM_getvnumainfo.
> If for some reason vNUMA topology cannot be initialized, Linux guest
> will have only one NUMA node initialized (standard Linux behavior).
> To enable this, vNUMA Linux patches should be applied and vNUMA supporting
> patches should be applied to PV kernel.
>
> Linux kernel patch is available here:
> https://git.gitorious.org/vnuma/linux_vnuma.git
> git://gitorious.org/vnuma/linux_vnuma.git
>
> Automatic vNUMA placement
>
> vNUMA automatic placement will be enabled if numa automatic placement is
> not in enabled or, if disabled, if vnode to pnode mapping is incorrect. If
> vnode to pnode mapping is correct and automatic NUMA placement disabled,
> vNUMA nodes will be allocated on nodes as it was specified in the guest
> config file.
>
> Xen patchset is available here:
> https://git.gitorious.org/vnuma/xen_vnuma.git
> git://gitorious.org/vnuma/xen_vnuma.git
>
>
> Examples of booting vNUMA enabled PV Linux guest on real NUMA machine:
>
> memory = 4000
> vcpus = 2
> # The name of the domain, change this if you want more than 1 VM.
> name = "null"
> vnodes = 2
> #vnumamem = [3000, 1000]
> #vnumamem = [4000,0]
> vdistance = [10, 20]
> vnuma_vcpumap = [1, 0]
> vnuma_vnodemap = [1]
> vnuma_autoplacement = 0
> #e820_host = 1
>
> [    0.000000] Linux version 3.15.0-rc8+ (assert@superpipe) (gcc version 
> 4.7.2 (Debian 4.7.2-5) ) #43 SMP Fri Jun 27 01:23:11 EDT 2014
> [    0.000000] Command line: root=/dev/xvda1 ro earlyprintk=xen debug 
> loglevel=8 debug print_fatal_signals=1 loglvl=all guest_loglvl=all LOGLEVEL=8 
> earlyprintk=xen sched_debug
> [    0.000000] ACPI in unprivileged domain disabled
> [    0.000000] e820: BIOS-provided physical RAM map:
> [    0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
> [    0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
> [    0.000000] Xen: [mem 0x0000000000100000-0x00000000f9ffffff] usable
> [    0.000000] bootconsole [xenboot0] enabled
> [    0.000000] NX (Execute Disable) protection: active
> [    0.000000] DMI not present or invalid.
> [    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> [    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
> [    0.000000] No AGP bridge found
> [    0.000000] e820: last_pfn = 0xfa000 max_arch_pfn = 0x400000000
> [    0.000000] Base memory trampoline at [ffff88000009a000] 9a000 size 24576
> [    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
> [    0.000000]  [mem 0x00000000-0x000fffff] page 4k
> [    0.000000] init_memory_mapping: [mem 0xf9e00000-0xf9ffffff]
> [    0.000000]  [mem 0xf9e00000-0xf9ffffff] page 4k
> [    0.000000] BRK [0x019c8000, 0x019c8fff] PGTABLE
> [    0.000000] BRK [0x019c9000, 0x019c9fff] PGTABLE
> [    0.000000] init_memory_mapping: [mem 0xf8000000-0xf9dfffff]
> [    0.000000]  [mem 0xf8000000-0xf9dfffff] page 4k
> [    0.000000] BRK [0x019ca000, 0x019cafff] PGTABLE
> [    0.000000] BRK [0x019cb000, 0x019cbfff] PGTABLE
> [    0.000000] BRK [0x019cc000, 0x019ccfff] PGTABLE
> [    0.000000] BRK [0x019cd000, 0x019cdfff] PGTABLE
> [    0.000000] init_memory_mapping: [mem 0x80000000-0xf7ffffff]
> [    0.000000]  [mem 0x80000000-0xf7ffffff] page 4k
> [    0.000000] init_memory_mapping: [mem 0x00100000-0x7fffffff]
> [    0.000000]  [mem 0x00100000-0x7fffffff] page 4k
> [    0.000000] RAMDISK: [mem 0x01dd8000-0x035c5fff]
> [    0.000000] Nodes received = 2
> [    0.000000] NUMA: Initialized distance table, cnt=2
> [    0.000000] Initmem setup node 0 [mem 0x00000000-0x7cffffff]
> [    0.000000]   NODE_DATA [mem 0x7cfd9000-0x7cffffff]
> [    0.000000] Initmem setup node 1 [mem 0x7d000000-0xf9ffffff]
> [    0.000000]   NODE_DATA [mem 0xf9828000-0xf984efff]
> [    0.000000] Zone ranges:
> [    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
> [    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
> [    0.000000]   Normal   empty
> [    0.000000] Movable zone start for each node
> [    0.000000] Early memory node ranges
> [    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
> [    0.000000]   node   0: [mem 0x00100000-0x7cffffff]
> [    0.000000]   node   1: [mem 0x7d000000-0xf9ffffff]
> [    0.000000] On node 0 totalpages: 511903
> [    0.000000]   DMA zone: 64 pages used for memmap
> [    0.000000]   DMA zone: 21 pages reserved
> [    0.000000]   DMA zone: 3999 pages, LIFO batch:0
> [    0.000000]   DMA32 zone: 7936 pages used for memmap
> [    0.000000]   DMA32 zone: 507904 pages, LIFO batch:31
> [    0.000000] On node 1 totalpages: 512000
> [    0.000000]   DMA32 zone: 8000 pages used for memmap
> [    0.000000]   DMA32 zone: 512000 pages, LIFO batch:31
> [    0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
> [    0.000000] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
> [    0.000000] nr_irqs_gsi: 16
> [    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
> [    0.000000] e820: [mem 0xfa000000-0xffffffff] available for PCI devices
> [    0.000000] Booting paravirtualized kernel on Xen
> [    0.000000] Xen version: 4.5-unstable (preserve-AD)
> [    0.000000] setup_percpu: NR_CPUS:20 nr_cpumask_bits:20 nr_cpu_ids:2 
> nr_node_ids:2
> [    0.000000] PERCPU: Embedded 28 pages/cpu @ffff88007ac00000 s85888 r8192 
> d20608 u2097152
> [    0.000000] pcpu-alloc: s85888 r8192 d20608 u2097152 alloc=1*2097152
> [    0.000000] pcpu-alloc: [0] 0 [1] 1
> [    0.000000] xen: PV spinlocks enabled
> [    0.000000] Built 2 zonelists in Node order, mobility grouping on.  Total 
> pages: 1007882
> [    0.000000] Policy zone: DMA32
> [    0.000000] Kernel command line: root=/dev/xvda1 ro earlyprintk=xen debug 
> loglevel=8 debug print_fatal_signals=1 loglvl=all guest_loglvl=all LOGLEVEL=8 
> earlyprintk=xen sched_debug
> [    0.000000] Memory: 3978224K/4095612K available (4022K kernel code, 769K 
> rwdata, 1744K rodata, 1532K init, 1472K bss, 117388K reserved)
> [    0.000000] Enabling automatic NUMA balancing. Configure with 
> numa_balancing= or the kernel.numa_balancing sysctl
> [    0.000000] installing Xen timer for CPU 0
> [    0.000000] tsc: Detected 2394.276 MHz processor
> [    0.004000] Calibrating delay loop (skipped), value calculated using timer 
> frequency.. 4788.55 BogoMIPS (lpj=9577104)
> [    0.004000] pid_max: default: 32768 minimum: 301
> [    0.004179] Dentry cache hash table entries: 524288 (order: 10, 4194304 
> bytes)
> [    0.006782] Inode-cache hash table entries: 262144 (order: 9, 2097152 
> bytes)
> [    0.007216] Mount-cache hash table entries: 8192 (order: 4, 65536 bytes)
> [    0.007288] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 
> bytes)
> [    0.007935] CPU: Physical Processor ID: 0
> [    0.007942] CPU: Processor Core ID: 0
> [    0.007951] Last level iTLB entries: 4KB 512, 2MB 8, 4MB 8
> [    0.007951] Last level dTLB entries: 4KB 512, 2MB 32, 4MB 32, 1GB 0
> [    0.007951] tlb_flushall_shift: 6
> [    0.021249] cpu 0 spinlock event irq 17
> [    0.021292] Performance Events: unsupported p6 CPU model 45 no PMU driver, 
> software events only.
> [    0.022162] NMI watchdog: disabled (cpu0): hardware events not enabled
> [    0.022625] installing Xen timer for CPU 1
>
> root@heatpipe:~# numactl --ha
> available: 2 nodes (0-1)
> node 0 cpus: 0
> node 0 size: 1933 MB
> node 0 free: 1894 MB
> node 1 cpus: 1
> node 1 size: 1951 MB
> node 1 free: 1926 MB
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
>
> root@heatpipe:~# numastat
>                            node0           node1
> numa_hit                   52257           92679
> numa_miss                      0               0
> numa_foreign                   0               0
> interleave_hit              4254            4238
> local_node                 52150           87364
> other_node                   107            5315
>
> root@superpipe:~# xl debug-keys u
>
> (XEN) Domain 7 (total: 1024000):
> (XEN)     Node 0: 1024000
> (XEN)     Node 1: 0
> (XEN)     Domain has 2 vnodes, 2 vcpus
> (XEN)         vnode 0 - pnode 0, 2000 MB, vcpu nums: 0
> (XEN)         vnode 1 - pnode 0, 2000 MB, vcpu nums: 1
>
>
> memory = 4000
> vcpus = 8
> # The name of the domain, change this if you want more than 1 VM.
> name = "null1"
> vnodes = 8
> #vnumamem = [3000, 1000]
> vdistance = [10, 40]
> #vnuma_vcpumap = [1, 0, 3, 2]
> vnuma_vnodemap = [1, 0, 1, 1, 0, 0, 1, 1]
> vnuma_autoplacement = 1
> e820_host = 1
>
> [    0.000000] Freeing ac228-fa000 pfn range: 318936 pages freed
> [    0.000000] 1-1 mapping on ac228->100000
> [    0.000000] Released 318936 pages of unused memory
> [    0.000000] Set 343512 page(s) to 1-1 mapping
> [    0.000000] Populating 100000-14ddd8 pfn range: 318936 pages added
> [    0.000000] e820: BIOS-provided physical RAM map:
> [    0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
> [    0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
> [    0.000000] Xen: [mem 0x0000000000100000-0x00000000ac227fff] usable
> [    0.000000] Xen: [mem 0x00000000ac228000-0x00000000ac26bfff] reserved
> [    0.000000] Xen: [mem 0x00000000ac26c000-0x00000000ac57ffff] unusable
> [    0.000000] Xen: [mem 0x00000000ac580000-0x00000000ac5a0fff] reserved
> [    0.000000] Xen: [mem 0x00000000ac5a1000-0x00000000ac5bbfff] unusable
> [    0.000000] Xen: [mem 0x00000000ac5bc000-0x00000000ac5bdfff] reserved
> [    0.000000] Xen: [mem 0x00000000ac5be000-0x00000000ac5befff] unusable
> [    0.000000] Xen: [mem 0x00000000ac5bf000-0x00000000ac5cafff] reserved
> [    0.000000] Xen: [mem 0x00000000ac5cb000-0x00000000ac5d9fff] unusable
> [    0.000000] Xen: [mem 0x00000000ac5da000-0x00000000ac5fafff] reserved
> [    0.000000] Xen: [mem 0x00000000ac5fb000-0x00000000ac6b5fff] unusable
> [    0.000000] Xen: [mem 0x00000000ac6b6000-0x00000000ac7fafff] ACPI NVS
> [    0.000000] Xen: [mem 0x00000000ac7fb000-0x00000000ac80efff] unusable
> [    0.000000] Xen: [mem 0x00000000ac80f000-0x00000000ac80ffff] ACPI data
> [    0.000000] Xen: [mem 0x00000000ac810000-0x00000000ac810fff] unusable
> [    0.000000] Xen: [mem 0x00000000ac811000-0x00000000ac812fff] ACPI data
> [    0.000000] Xen: [mem 0x00000000ac813000-0x00000000ad7fffff] unusable
> [    0.000000] Xen: [mem 0x00000000b0000000-0x00000000b3ffffff] reserved
> [    0.000000] Xen: [mem 0x00000000fed20000-0x00000000fed3ffff] reserved
> [    0.000000] Xen: [mem 0x00000000fed50000-0x00000000fed8ffff] reserved
> [    0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved
> [    0.000000] Xen: [mem 0x00000000ffa00000-0x00000000ffa3ffff] reserved
> [    0.000000] Xen: [mem 0x0000000100000000-0x000000014ddd7fff] usable
> [    0.000000] NX (Execute Disable) protection: active
> [    0.000000] DMI not present or invalid.
> [    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> [    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
> [    0.000000] No AGP bridge found
> [    0.000000] e820: last_pfn = 0x14ddd8 max_arch_pfn = 0x400000000
> [    0.000000] e820: last_pfn = 0xac228 max_arch_pfn = 0x400000000
> [    0.000000] Base memory trampoline at [ffff88000009a000] 9a000 size 24576
> [    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
> [    0.000000]  [mem 0x00000000-0x000fffff] page 4k
> [    0.000000] init_memory_mapping: [mem 0x14da00000-0x14dbfffff]
> [    0.000000]  [mem 0x14da00000-0x14dbfffff] page 4k
> [    0.000000] BRK [0x019cd000, 0x019cdfff] PGTABLE
> [    0.000000] BRK [0x019ce000, 0x019cefff] PGTABLE
> [    0.000000] init_memory_mapping: [mem 0x14c000000-0x14d9fffff]
> [    0.000000]  [mem 0x14c000000-0x14d9fffff] page 4k
> [    0.000000] BRK [0x019cf000, 0x019cffff] PGTABLE
> [    0.000000] BRK [0x019d0000, 0x019d0fff] PGTABLE
> [    0.000000] BRK [0x019d1000, 0x019d1fff] PGTABLE
> [    0.000000] BRK [0x019d2000, 0x019d2fff] PGTABLE
> [    0.000000] init_memory_mapping: [mem 0x100000000-0x14bffffff]
> [    0.000000]  [mem 0x100000000-0x14bffffff] page 4k
> [    0.000000] init_memory_mapping: [mem 0x00100000-0xac227fff]
> [    0.000000]  [mem 0x00100000-0xac227fff] page 4k
> [    0.000000] init_memory_mapping: [mem 0x14dc00000-0x14ddd7fff]
> [    0.000000]  [mem 0x14dc00000-0x14ddd7fff] page 4k
> [    0.000000] RAMDISK: [mem 0x01dd8000-0x0347ffff]
> [    0.000000] Nodes received = 8
> [    0.000000] NUMA: Initialized distance table, cnt=8
> [    0.000000] Initmem setup node 0 [mem 0x00000000-0x1f3fffff]
> [    0.000000]   NODE_DATA [mem 0x1f3d9000-0x1f3fffff]
> [    0.000000] Initmem setup node 1 [mem 0x1f800000-0x3e7fffff]
> [    0.000000]   NODE_DATA [mem 0x3e7d9000-0x3e7fffff]
> [    0.000000] Initmem setup node 2 [mem 0x3e800000-0x5dbfffff]
> [    0.000000]   NODE_DATA [mem 0x5dbd9000-0x5dbfffff]
> [    0.000000] Initmem setup node 3 [mem 0x5e000000-0x7cffffff]
> [    0.000000]   NODE_DATA [mem 0x7cfd9000-0x7cffffff]
> [    0.000000] Initmem setup node 4 [mem 0x7d000000-0x9c3fffff]
> [    0.000000]   NODE_DATA [mem 0x9c3d9000-0x9c3fffff]
> [    0.000000] Initmem setup node 5 [mem 0x9c800000-0x10f5d7fff]
> [    0.000000]   NODE_DATA [mem 0x10f5b1000-0x10f5d7fff]
> [    0.000000] Initmem setup node 6 [mem 0x10f800000-0x12e9d7fff]
> [    0.000000]   NODE_DATA [mem 0x12e9b1000-0x12e9d7fff]
> [    0.000000] Initmem setup node 7 [mem 0x12f000000-0x14ddd7fff]
> [    0.000000]   NODE_DATA [mem 0x14ddad000-0x14ddd3fff]
> [    0.000000] Zone ranges:
> [    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
> [    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
> [    0.000000]   Normal   [mem 0x100000000-0x14ddd7fff]
> [    0.000000] Movable zone start for each node
> [    0.000000] Early memory node ranges
> [    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
> [    0.000000]   node   0: [mem 0x00100000-0x1f3fffff]
> [    0.000000]   node   1: [mem 0x1f400000-0x3e7fffff]
> [    0.000000]   node   2: [mem 0x3e800000-0x5dbfffff]
> [    0.000000]   node   3: [mem 0x5dc00000-0x7cffffff]
> [    0.000000]   node   4: [mem 0x7d000000-0x9c3fffff]
> [    0.000000]   node   5: [mem 0x9c400000-0xac227fff]
> [    0.000000]   node   5: [mem 0x100000000-0x10f5d7fff]
> [    0.000000]   node   6: [mem 0x10f5d8000-0x12e9d7fff]
> [    0.000000]   node   7: [mem 0x12e9d8000-0x14ddd7fff]
> [    0.000000] On node 0 totalpages: 127903
> [    0.000000]   DMA zone: 64 pages used for memmap
> [    0.000000]   DMA zone: 21 pages reserved
> [    0.000000]   DMA zone: 3999 pages, LIFO batch:0
> [    0.000000]   DMA32 zone: 1936 pages used for memmap
> [    0.000000]   DMA32 zone: 123904 pages, LIFO batch:31
> [    0.000000] On node 1 totalpages: 128000
> [    0.000000]   DMA32 zone: 2000 pages used for memmap
> [    0.000000]   DMA32 zone: 128000 pages, LIFO batch:31
> [    0.000000] On node 2 totalpages: 128000
> [    0.000000]   DMA32 zone: 2000 pages used for memmap
> [    0.000000]   DMA32 zone: 128000 pages, LIFO batch:31
> [    0.000000] On node 3 totalpages: 128000
> [    0.000000]   DMA32 zone: 2000 pages used for memmap
> [    0.000000]   DMA32 zone: 128000 pages, LIFO batch:31
> [    0.000000] On node 4 totalpages: 128000
> [    0.000000]   DMA32 zone: 2000 pages used for memmap
> [    0.000000]   DMA32 zone: 128000 pages, LIFO batch:31
> [    0.000000] On node 5 totalpages: 128000
> [    0.000000]   DMA32 zone: 1017 pages used for memmap
> [    0.000000]   DMA32 zone: 65064 pages, LIFO batch:15
> [    0.000000]   Normal zone: 984 pages used for memmap
> [    0.000000]   Normal zone: 62936 pages, LIFO batch:15
> [    0.000000] On node 6 totalpages: 128000
> [    0.000000]   Normal zone: 2000 pages used for memmap
> [    0.000000]   Normal zone: 128000 pages, LIFO batch:31
> [    0.000000] On node 7 totalpages: 128000
> [    0.000000]   Normal zone: 2000 pages used for memmap
> [    0.000000]   Normal zone: 128000 pages, LIFO batch:31
> [    0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
> [    0.000000] smpboot: Allowing 8 CPUs, 0 hotplug CPUs
> [    0.000000] nr_irqs_gsi: 16
> [    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac228000-0xac26bfff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac26c000-0xac57ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac580000-0xac5a0fff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5a1000-0xac5bbfff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5bc000-0xac5bdfff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5be000-0xac5befff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5bf000-0xac5cafff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5cb000-0xac5d9fff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5da000-0xac5fafff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5fb000-0xac6b5fff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac6b6000-0xac7fafff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac7fb000-0xac80efff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac80f000-0xac80ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac810000-0xac810fff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac811000-0xac812fff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac813000-0xad7fffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xad800000-0xafffffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xb0000000-0xb3ffffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xb4000000-0xfed1ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xfed20000-0xfed3ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xfed40000-0xfed4ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xfed50000-0xfed8ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xfed90000-0xfedfffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xfee00000-0xfeefffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xfef00000-0xff9fffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xffa00000-0xffa3ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xffa40000-0xffffffff]
> [    0.000000] e820: [mem 0xb4000000-0xfed1ffff] available for PCI devices
> [    0.000000] Booting paravirtualized kernel on Xen
> [    0.000000] Xen version: 4.5-unstable (preserve-AD)
> [    0.000000] setup_percpu: NR_CPUS:20 nr_cpumask_bits:20 nr_cpu_ids:8 
> nr_node_ids:8
> [    0.000000] PERCPU: Embedded 28 pages/cpu @ffff88001e800000 s85888 r8192 
> d20608 u2097152
> [    0.000000] pcpu-alloc: s85888 r8192 d20608 u2097152 alloc=1*2097152
> [    0.000000] pcpu-alloc: [0] 0 [1] 1 [2] 2 [3] 3 [4] 4 [5] 5 [6] 6 [7] 7
> [    0.000000] xen: PV spinlocks enabled
> [    0.000000] Built 8 zonelists in Node order, mobility grouping on.  Total 
> pages: 1007881
> [    0.000000] Policy zone: Normal
> [    0.000000] Kernel command line: root=/dev/xvda1 ro console=hvc0 debug  
> kgdboc=hvc0 nokgdbroundup  initcall_debug debug
> [    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
> [    0.000000] xsave: enabled xstate_bv 0x7, cntxt size 0x340
> [    0.000000] Checking aperture...
> [    0.000000] No AGP bridge found
> [    0.000000] Memory: 3976748K/4095612K available (4022K kernel code, 769K 
> rwdata, 1744K rodata, 1532K init, 1472K bss, 118864K reserved)
>
> root@heatpipe:~# numactl --ha
> maxn: 7
> available: 8 nodes (0-7)
> node 0 cpus: 0
> node 0 size: 458 MB
> node 0 free: 424 MB
> node 1 cpus: 1
> node 1 size: 491 MB
> node 1 free: 481 MB
> node 2 cpus: 2
> node 2 size: 491 MB
> node 2 free: 482 MB
> node 3 cpus: 3
> node 3 size: 491 MB
> node 3 free: 485 MB
> node 4 cpus: 4
> node 4 size: 491 MB
> node 4 free: 485 MB
> node 5 cpus: 5
> node 5 size: 491 MB
> node 5 free: 484 MB
> node 6 cpus: 6
> node 6 size: 491 MB
> node 6 free: 486 MB
> node 7 cpus: 7
> node 7 size: 476 MB
> node 7 free: 471 MB
> node distances:
> node   0   1   2   3   4   5   6   7
>   0:  10  40  40  40  40  40  40  40
>   1:  40  10  40  40  40  40  40  40
>   2:  40  40  10  40  40  40  40  40
>   3:  40  40  40  10  40  40  40  40
>   4:  40  40  40  40  10  40  40  40
>   5:  40  40  40  40  40  10  40  40
>   6:  40  40  40  40  40  40  10  40
>   7:  40  40  40  40  40  40  40  10
>
> root@heatpipe:~# numastat
>                            node0           node1           node2           
> node3
> numa_hit                  182203           14574           23800           
> 17017
> numa_miss                      0               0               0              
>  0
> numa_foreign                   0               0               0              
>  0
> interleave_hit              1016            1010            1051            
> 1030
> local_node                180995           12906           23272           
> 15338
> other_node                  1208            1668             528            
> 1679
>
>                            node4           node5           node6           
> node7
> numa_hit                   10621           15346            3529            
> 3863
> numa_miss                      0               0               0              
>  0
> numa_foreign                   0               0               0              
>  0
> interleave_hit              1026            1017            1031            
> 1029
> local_node                  8941           13680            1855            
> 2184
> other_node                  1680            1666            1674            
> 1679
>
> root@superpipe:~# xl debug-keys u
>
> (XEN) Domain 6 (total: 1024000):
> (XEN)     Node 0: 321064
> (XEN)     Node 1: 702936
> (XEN)     Domain has 8 vnodes, 8 vcpus
> (XEN)         vnode 0 - pnode 1, 500 MB, vcpu nums: 0
> (XEN)         vnode 1 - pnode 0, 500 MB, vcpu nums: 1
> (XEN)         vnode 2 - pnode 1, 500 MB, vcpu nums: 2
> (XEN)         vnode 3 - pnode 1, 500 MB, vcpu nums: 3
> (XEN)         vnode 4 - pnode 0, 500 MB, vcpu nums: 4
> (XEN)         vnode 5 - pnode 0, 1841 MB, vcpu nums: 5
> (XEN)         vnode 6 - pnode 1, 500 MB, vcpu nums: 6
> (XEN)         vnode 7 - pnode 1, 500 MB, vcpu nums: 7
>
> Current problems:
>
> This was marked as separate porblem but leaving it here for reference.
> Warning on CPU bringup on other node
>
>     The cpus in guest wich belong to different NUMA nodes are configured
>     to chare same l2 cache and thus considered to be siblings and cannot
>     be on the same node. One can see following WARNING during the boot time:
>
> [    0.022750] SMP alternatives: switching to SMP code
> [    0.004000] ------------[ cut here ]------------
> [    0.004000] WARNING: CPU: 1 PID: 0 at arch/x86/kernel/smpboot.c:303 
> topology_sane.isra.8+0x67/0x79()
> [    0.004000] sched: CPU #1's smt-sibling CPU #0 is not on the same node! 
> [node: 1 != 0]. Ignoring dependency.
> [    0.004000] Modules linked in:
> [    0.004000] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.15.0-rc8+ #43
> [    0.004000]  0000000000000000 0000000000000009 ffffffff813df458 
> ffff88007abe7e60
> [    0.004000]  ffffffff81048963 ffff88007abe7e70 ffffffff8102fb08 
> ffffffff00000100
> [    0.004000]  0000000000000001 ffff8800f6e13900 0000000000000000 
> 000000000000b018
> [    0.004000] Call Trace:
> [    0.004000]  [<ffffffff813df458>] ? dump_stack+0x41/0x51
> [    0.004000]  [<ffffffff81048963>] ? warn_slowpath_common+0x78/0x90
> [    0.004000]  [<ffffffff8102fb08>] ? topology_sane.isra.8+0x67/0x79
> [    0.004000]  [<ffffffff81048a13>] ? warn_slowpath_fmt+0x45/0x4a
> [    0.004000]  [<ffffffff8102fb08>] ? topology_sane.isra.8+0x67/0x79
> [    0.004000]  [<ffffffff8102fd2e>] ? set_cpu_sibling_map+0x1c9/0x3f7
> [    0.004000]  [<ffffffff81042146>] ? numa_add_cpu+0xa/0x18
> [    0.004000]  [<ffffffff8100b4e2>] ? cpu_bringup+0x50/0x8f
> [    0.004000]  [<ffffffff8100b544>] ? cpu_bringup_and_idle+0x1d/0x28
> [    0.004000] ---[ end trace 0e2e2fd5c7b76da5 ]---
> [    0.035371] x86: Booted up 2 nodes, 2 CPUs
>
> The workaround is to specify cpuid in config file and not use SMT. But soon I 
> will come up
> with some other acceptable solution.
>
> Incorrect amount of memory for nodes in debug-keys output
>
>     Since the node ranges per domain are saved in guest addresses, the memory
>     calculated is incorrect due to the guest e820 memory holes for some nodes.
>
> TODO:
>     - some modifications to automatic vnuma placement may be needed;
>     - vdistance extended configuration parser will need to be in place;
>     - SMT siblings problem (see above) will need a solution (different 
> series);
>
> Changes since v6:
>     - added limit on number of vNUMA nodes per domain (32) on Xen side.
>     This will be increased in next version as this limit seem to be not
>     bug enough;
>     - added read_write lock to synchronize access to vnuma structure to 
> domain structure;
>     - added copy back of actual number of vcpus back to guest;
>     - added xsm example policies;
>     - reorganized series the way that xl implementation goes after libxl
>     definitions;
>     - changed the idl names for vnuma structure members in libxc;
>     - changed the failure path in Xen when setting vnuma topology to dont 
> create default
>     node, but fail instead and not introduce different views on vnuma between 
> toolstack
>     and Xen;
>     - changed failure path when parsing vnuma config to just fail instead of 
> creating single
>     default node;
>
> Changes since v5:
>     - reorganized patches;
>     - modified domctl hypercall and added locking;
>     - added XSM hypercalls with basic policies;
>     - verify 32bit compatibility;
>
> Elena Ufimtseva (9):
>   xen: vnuma topology and subop hypercalls
>   xsm bits for vNUMA hypercalls
>   vnuma hook to debug-keys u
>   libxc: Introduce xc_domain_setvnuma to set vNUMA
>   libxl: vnuma types declararion
>   libxl: build numa nodes memory blocks
>   libxc: allocate domain memory for vnuma enabled
>   libxl: vnuma nodes placement bits
>   libxl: vnuma topology configuration parser and doc
>
>  docs/man/xl.cfg.pod.5                        |   77 +++++
>  tools/flask/policy/policy/modules/xen/xen.if |    3 +-
>  tools/flask/policy/policy/modules/xen/xen.te |    2 +-
>  tools/libxc/xc_dom.h                         |   13 +
>  tools/libxc/xc_dom_x86.c                     |   76 ++++-
>  tools/libxc/xc_domain.c                      |   63 ++++
>  tools/libxc/xenctrl.h                        |    9 +
>  tools/libxl/libxl_create.c                   |    1 +
>  tools/libxl/libxl_dom.c                      |  148 +++++++++
>  tools/libxl/libxl_internal.h                 |    9 +
>  tools/libxl/libxl_numa.c                     |  193 ++++++++++++
>  tools/libxl/libxl_types.idl                  |    7 +-
>  tools/libxl/libxl_vnuma.h                    |   13 +
>  tools/libxl/libxl_x86.c                      |    3 +-
>  tools/libxl/xl_cmdimpl.c                     |  425 
> ++++++++++++++++++++++++++
>  xen/arch/x86/numa.c                          |   30 +-
>  xen/common/domain.c                          |   15 +
>  xen/common/domctl.c                          |  122 ++++++++
>  xen/common/memory.c                          |  106 +++++++
>  xen/include/public/arch-x86/xen.h            |    9 +
>  xen/include/public/domctl.h                  |   29 ++
>  xen/include/public/memory.h                  |   47 ++-
>  xen/include/xen/domain.h                     |   11 +
>  xen/include/xen/sched.h                      |    4 +
>  xen/include/xsm/dummy.h                      |    6 +
>  xen/include/xsm/xsm.h                        |    7 +
>  xen/xsm/dummy.c                              |    1 +
>  xen/xsm/flask/hooks.c                        |   10 +
>  xen/xsm/flask/policy/access_vectors          |    4 +
>  29 files changed, 1425 insertions(+), 18 deletions(-)
>  create mode 100644 tools/libxl/libxl_vnuma.h
>
> --
> 1.7.10.4
>

Hello

I am re-sending this series as previous send of v7 was not yet
reviewed except by Daniel on xsm part.
It has some changes mentioned in the change log of patch 0/9.

Please review this series and send your comments.

-- 
Elena

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.