[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [PATCH v4 0/7] vNUMA introduction



vNUMA introduction

This series of patches introduces vNUMA topology awareness and
provides interfaces and data structures to enable vNUMA for
PV guests. There is a plan to extend this support for dom0 and
HVM domains.

vNUMA topology support should be supported by PV guest kernel.
Corresponging patches should be applied.

Introduction
-------------

vNUMA topology is exposed to the PV guest to improve performance when running
workloads on NUMA machines.
XEN vNUMA implementation provides a way to create vNUMA-enabled guests on 
NUMA/UMA
and map vNUMA topology to physical NUMA in a optimal way.

XEN vNUMA support

Current set of patches introduces subop hypercall that is available for 
enlightened
PV guests with vNUMA patches applied.

Domain structure was modified to reflect per-domain vNUMA topology for use in 
other
vNUMA-aware subsystems (e.g. ballooning).

libxc

libxc provides interfaces to build PV guests with vNUMA support and in case of 
NUMA
machines provides initial memory allocation on physical NUMA nodes. This 
implemented by
utilizing nodemap formed by automatic NUMA placement. Details are in patch #3.

libxl

libxl provides a way to predefine in VM config vNUMA topology - number of 
vnodes,
memory arrangement, vcpus to vnodes assignment, distance map.

PV guest

As of now, only PV guest can take advantage of vNUMA functionality. vNUMA Linux 
patches
should be applied and NUMA support should be compiled in kernel.

This patchset can be pulled from 
https://git.gitorious.org/xenvnuma/xenvnuma.git:v6
Linux patchset https://git.gitorious.org/xenvnuma/linuxvnuma.git:v6

Examples of booting vNUMA enabled PV Linux guest on real NUMA machine:

1. Automatic vNUMA placement on h/w NUMA machine:

VM config:

memory = 16384
vcpus = 4
name = "rcbig"
vnodes = 4
vnumamem = [10,10]
vnuma_distance = [10, 30, 10, 30]
vcpu_to_vnode = [0, 0, 1, 1]

Xen:

(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 2569511):
(XEN)     Node 0: 1416166
(XEN)     Node 1: 1153345
(XEN) Domain 5 (total: 4194304):
(XEN)     Node 0: 2097152
(XEN)     Node 1: 2097152
(XEN)     Domain has 4 vnodes
(XEN)         vnode 0 - pnode 0  (4096) MB
(XEN)         vnode 1 - pnode 0  (4096) MB
(XEN)         vnode 2 - pnode 1  (4096) MB
(XEN)         vnode 3 - pnode 1  (4096) MB
(XEN)     Domain vcpu to vnode:
(XEN)     0 1 2 3

dmesg on pv guest:

[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
[    0.000000]   node   0: [mem 0x00100000-0xffffffff]
[    0.000000]   node   1: [mem 0x100000000-0x1ffffffff]
[    0.000000]   node   2: [mem 0x200000000-0x2ffffffff]
[    0.000000]   node   3: [mem 0x300000000-0x3ffffffff]
[    0.000000] On node 0 totalpages: 1048479
[    0.000000]   DMA zone: 56 pages used for memmap
[    0.000000]   DMA zone: 21 pages reserved
[    0.000000]   DMA zone: 3999 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 14280 pages used for memmap
[    0.000000]   DMA32 zone: 1044480 pages, LIFO batch:31
[    0.000000] On node 1 totalpages: 1048576
[    0.000000]   Normal zone: 14336 pages used for memmap
[    0.000000]   Normal zone: 1048576 pages, LIFO batch:31
[    0.000000] On node 2 totalpages: 1048576
[    0.000000]   Normal zone: 14336 pages used for memmap
[    0.000000]   Normal zone: 1048576 pages, LIFO batch:31
[    0.000000] On node 3 totalpages: 1048576
[    0.000000]   Normal zone: 14336 pages used for memmap
[    0.000000]   Normal zone: 1048576 pages, LIFO batch:31
[    0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[    0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs
[    0.000000] No local APIC present
[    0.000000] APIC: disable apic facility
[    0.000000] APIC: switched to apic NOOP
[    0.000000] nr_irqs_gsi: 16
[    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[    0.000000] e820: cannot find a gap in the 32bit address range
[    0.000000] e820: PCI devices with unassigned 32bit BARs may break!
[    0.000000] e820: [mem 0x400100000-0x4004fffff] available for PCI devices
[    0.000000] Booting paravirtualized kernel on Xen
[    0.000000] Xen version: 4.4-unstable (preserve-AD)
[    0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4 
nr_node_ids:4
[    0.000000] PERCPU: Embedded 28 pages/cpu @ffff8800ffc00000 s85376 r8192 
d21120 u2097152
[    0.000000] pcpu-alloc: s85376 r8192 d21120 u2097152 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 0 [1] 1 [2] 2 [3] 3


pv guest: numactl --hardware:

root@heatpipe:~# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0
node 0 size: 4031 MB
node 0 free: 3997 MB
node 1 cpus: 1
node 1 size: 4039 MB
node 1 free: 4022 MB
node 2 cpus: 2
node 2 size: 4039 MB
node 2 free: 4023 MB
node 3 cpus: 3
node 3 size: 3975 MB
node 3 free: 3963 MB
node distances:
node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10

Comments:
None of the configuration options are correct so default values were used.
Since machine is NUMA machine and there is no vcpu pinning defines, NUMA
automatic node selection mechanism is used and you can see how vnodes
were split across physical nodes.

2. Example with e820_host = 1 (32GB real NUMA machines, two nodes).

pv config:
memory = 4000
vcpus = 8
# The name of the domain, change this if you want more than 1 VM.
name = "null"
vnodes = 4
#vnumamem = [3000, 1000]
vdistance = [10, 40]
#vnuma_vcpumap = [1, 0, 3, 2]
vnuma_vnodemap = [1, 0, 1, 0]
#vnuma_autoplacement = 1
e820_host = 1 

guest boot:

[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 3.12.0+ (assert@superpipe) (gcc version 4.7.2 (Debi
an 4.7.2-5) ) #111 SMP Tue Dec 3 14:54:36 EST 2013
[    0.000000] Command line: root=/dev/xvda1 ro earlyprintk=xen debug loglevel=8
 debug print_fatal_signals=1 loglvl=all guest_loglvl=all LOGLEVEL=8 earlyprintk=
xen sched_debug
[    0.000000] ACPI in unprivileged domain disabled
[    0.000000] Freeing ac228-fa000 pfn range: 318936 pages freed
[    0.000000] 1-1 mapping on ac228->100000
[    0.000000] Released 318936 pages of unused memory
[    0.000000] Set 343512 page(s) to 1-1 mapping
[    0.000000] Populating 100000-14ddd8 pfn range: 318936 pages added
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
[    0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[    0.000000] Xen: [mem 0x0000000000100000-0x00000000ac227fff] usable
[    0.000000] Xen: [mem 0x00000000ac228000-0x00000000ac26bfff] reserved
[    0.000000] Xen: [mem 0x00000000ac26c000-0x00000000ac57ffff] unusable
[    0.000000] Xen: [mem 0x00000000ac580000-0x00000000ac5a0fff] reserved
[    0.000000] Xen: [mem 0x00000000ac5a1000-0x00000000ac5bbfff] unusable
[    0.000000] Xen: [mem 0x00000000ac5bc000-0x00000000ac5bdfff] reserved
[    0.000000] Xen: [mem 0x00000000ac5be000-0x00000000ac5befff] unusable
[    0.000000] Xen: [mem 0x00000000ac5bf000-0x00000000ac5cafff] reserved
[    0.000000] Xen: [mem 0x00000000ac5cb000-0x00000000ac5d9fff] unusable
[    0.000000] Xen: [mem 0x00000000ac5da000-0x00000000ac5fafff] reserved
[    0.000000] Xen: [mem 0x00000000ac5fb000-0x00000000ac6b6fff] unusable
[    0.000000] Xen: [mem 0x00000000ac6b7000-0x00000000ac7fafff] ACPI NVS
[    0.000000] Xen: [mem 0x00000000ac7fb000-0x00000000ac80efff] unusable
[    0.000000] Xen: [mem 0x00000000ac80f000-0x00000000ac80ffff] ACPI data
[    0.000000] Xen: [mem 0x00000000ac810000-0x00000000ac810fff] unusable
[    0.000000] Xen: [mem 0x00000000ac811000-0x00000000ac812fff] ACPI data
[    0.000000] Xen: [mem 0x00000000ac813000-0x00000000ad7fffff] unusable
[    0.000000] Xen: [mem 0x00000000b0000000-0x00000000b3ffffff] reserved
[    0.000000] Xen: [mem 0x00000000fed20000-0x00000000fed3ffff] reserved
[    0.000000] Xen: [mem 0x00000000fed50000-0x00000000fed8ffff] reserved
[    0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved
[    0.000000] Xen: [mem 0x00000000ffa00000-0x00000000ffa3ffff] reserved
[    0.000000] Xen: [mem 0x0000000100000000-0x000000014ddd7fff] usable
[    0.000000] bootconsole [xenboot0] enabled
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] DMI not present or invalid.
[    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.000000] No AGP bridge found
[    0.000000] e820: last_pfn = 0x14ddd8 max_arch_pfn = 0x400000000
[    0.000000] e820: last_pfn = 0xac228 max_arch_pfn = 0x400000000
[    0.000000] Base memory trampoline at [ffff88000009a000] 9a000 size 24576
[    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[    0.000000]  [mem 0x00000000-0x000fffff] page 4k
[    0.000000] init_memory_mapping: [mem 0x14da00000-0x14dbfffff]
[    0.000000]  [mem 0x14da00000-0x14dbfffff] page 4k
[    0.000000] BRK [0x019bd000, 0x019bdfff] PGTABLE
[    0.000000] BRK [0x019be000, 0x019befff] PGTABLE
[    0.000000] init_memory_mapping: [mem 0x14c000000-0x14d9fffff]
[    0.000000]  [mem 0x14c000000-0x14d9fffff] page 4k
[    0.000000] BRK [0x019bf000, 0x019bffff] PGTABLE
[    0.000000] BRK [0x019c0000, 0x019c0fff] PGTABLE
[    0.000000] BRK [0x019c1000, 0x019c1fff] PGTABLE
[    0.000000] BRK [0x019c2000, 0x019c2fff] PGTABLE
[    0.000000] init_memory_mapping: [mem 0x100000000-0x14bffffff]
[    0.000000]  [mem 0x100000000-0x14bffffff] page 4k
[    0.000000] init_memory_mapping: [mem 0x00100000-0xac227fff]
[    0.000000]  [mem 0x00100000-0xac227fff] page 4k
[    0.000000] init_memory_mapping: [mem 0x14dc00000-0x14ddd7fff]
[    0.000000]  [mem 0x14dc00000-0x14ddd7fff] page 4k
[    0.000000] RAMDISK: [mem 0x01dc8000-0x0346ffff]
[    0.000000] NUMA: Initialized distance table, cnt=4
[    0.000000] Initmem setup node 0 [mem 0x00000000-0x3e7fffff]
[    0.000000]   NODE_DATA [mem 0x3e7d9000-0x3e7fffff]
[    0.000000] Initmem setup node 1 [mem 0x3e800000-0x7cffffff]
[    0.000000]   NODE_DATA [mem 0x7cfd9000-0x7cffffff]
[    0.000000] Initmem setup node 2 [mem 0x7d000000-0x10f5dffff]
[    0.000000]   NODE_DATA [mem 0x10f5b9000-0x10f5dffff]
[    0.000000] Initmem setup node 3 [mem 0x10f800000-0x14ddd7fff]
[    0.000000]   NODE_DATA [mem 0x14ddad000-0x14ddd3fff]
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
[    0.000000]   Normal   [mem 0x100000000-0x14ddd7fff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
[    0.000000]   node   0: [mem 0x00100000-0x3e7fffff]
[    0.000000]   node   1: [mem 0x3e800000-0x7cffffff]
[    0.000000]   node   2: [mem 0x7d000000-0xac227fff]
[    0.000000]   node   2: [mem 0x100000000-0x10f5dffff]
[    0.000000]   node   3: [mem 0x10f5e0000-0x14ddd7fff]
[    0.000000] On node 0 totalpages: 255903
[    0.000000]   DMA zone: 56 pages used for memmap
[    0.000000]   DMA zone: 21 pages reserved
[    0.000000]   DMA zone: 3999 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 3444 pages used for memmap
[    0.000000]   DMA32 zone: 251904 pages, LIFO batch:31
[    0.000000] On node 1 totalpages: 256000
[    0.000000]   DMA32 zone: 3500 pages used for memmap
[    0.000000]   DMA32 zone: 256000 pages, LIFO batch:31
[    0.000000] On node 2 totalpages: 256008
[    0.000000]   DMA32 zone: 2640 pages used for memmap
[    0.000000]   DMA32 zone: 193064 pages, LIFO batch:31
[    0.000000]   Normal zone: 861 pages used for memmap
[    0.000000]   Normal zone: 62944 pages, LIFO batch:15
[    0.000000] On node 3 totalpages: 255992
[    0.000000]   Normal zone: 3500 pages used for memmap
[    0.000000]   Normal zone: 255992 pages, LIFO batch:31
[    0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[    0.000000] smpboot: Allowing 8 CPUs, 0 hotplug CPUs

root@heatpipe:~# numactl --ha
available: 4 nodes (0-3)
node 0 cpus: 0 4
node 0 size: 977 MB
node 0 free: 947 MB
node 1 cpus: 1 5
node 1 size: 985 MB
node 1 free: 974 MB
node 2 cpus: 2 6
node 2 size: 985 MB
node 2 free: 973 MB
node 3 cpus: 3 7
node 3 size: 969 MB
node 3 free: 958 MB
node distances:
node   0   1   2   3 
  0:  10  40  40  40 
  1:  40  10  40  40 
  2:  40  40  10  40 
  3:  40  40  40  10 

root@heatpipe:~# numastat -m

Per-node system memory usage (in MBs):
                          Node 0          Node 1          Node 2          Node 
3           Total
                 --------------- --------------- --------------- 
--------------- ---------------
MemTotal                  977.14          985.50          985.44          
969.91         3917.99

hypervisor: xl debug-keys u

(XEN) 'u' pressed -> dumping numa info (now-0x2A3:F7B8CB0F)
(XEN) Domain 2 (total: 1024000):
(XEN)     Node 0: 415468
(XEN)     Node 1: 608532
(XEN)     Domain has 4 vnodes
(XEN)         vnode 0 - pnode 1 1000 MB, vcpus: 0 4 
(XEN)         vnode 1 - pnode 0 1000 MB, vcpus: 1 5 
(XEN)         vnode 2 - pnode 1 2341 MB, vcpus: 2 6 
(XEN)         vnode 3 - pnode 0 999 MB, vcpus: 3 7 

This size descrepancy caused by the way how size if calculated
from guest pfns: end - start. Thus the hole size in this case of
~1,3Gb is included in the size.

3. zero vNUMA configuration for every pv domain.
Will be at least one vnuma node if vnuma topology was not
specified.

pv config:

memory = 4000
vcpus = 8
# The name of the domain, change this if you want more than 1 VM.
name = "null"
#vnodes = 4
vnumamem = [3000, 1000]
vdistance = [10, 40]
vnuma_vcpumap = [1, 0, 3, 2]
vnuma_vnodemap = [1, 0, 1, 0]
vnuma_autoplacement = 1
e820_host = 1

boot:
[    0.000000] init_memory_mapping: [mem 0x14dc00000-0x14ddd7fff]
[    0.000000]  [mem 0x14dc00000-0x14ddd7fff] page 4k
[    0.000000] RAMDISK: [mem 0x01dc8000-0x0346ffff]
[    0.000000] NUMA: Initialized distance table, cnt=1
[    0.000000] Initmem setup node 0 [mem 0x00000000-0x14ddd7fff]
[    0.000000]   NODE_DATA [mem 0x14ddad000-0x14ddd3fff]
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
[    0.000000]   Normal   [mem 0x100000000-0x14ddd7fff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
[    0.000000]   node   0: [mem 0x00100000-0xac227fff]
[    0.000000]   node   0: [mem 0x100000000-0x14ddd7fff]

root@heatpipe:~# numactl --ha
maxn: 0
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 3918 MB
node 0 free: 3853 MB
node distances:
node   0 
  0:  10 

root@heatpipe:~# numastat -m

Per-node system memory usage (in MBs):
                          Node 0           Total
                 --------------- ---------------
MemTotal                 3918.74         3918.74

hypervisor: xl debug-keys u

(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 6787432):
(XEN)     Node 0: 3485706
(XEN)     Node 1: 3301726
(XEN) Domain 3 (total: 1024000):
(XEN)     Node 0: 512000
(XEN)     Node 1: 512000
(XEN)     Domain has 1 vnodes
(XEN)         vnode 0 - pnode any 5341 MB, vcpus: 0 1 2 3 4 5 6 7


Notes:

to enable vNUMA in pv guest the corresponding patch set should be
applied - https://git.gitorious.org/xenvnuma/linuxvnuma.git:v5
or 
https://www.gitorious.org/xenvnuma/linuxvnuma/commit/deaa014257b99f57c76fbba12a28907786cbe17d.


Issues:

The most important right now is the automatic numa balancing for linux pv kernel
as its corrupting user space memory. Since the v3 of this patch series linux 
kernel
3.13 seem to perform correctly, but with the recent changes the issue is back.
See https://lkml.org/lkml/2013/10/31/133 for urgent patch what presumably had
numa balancing working. Sine 3.12 there were multiple changes to numa automatic
balancing. I am currently back to investigating if anything should be done from 
hypervisor
side and will work with kernel maintainers.

Elena Ufimtseva (7):
  xen: vNUMA support for PV guests
  libxc: Plumb Xen with vNUMA topology for domain
  xl: vnuma memory parsing and supplement functions
  xl: vnuma distance, vcpu and pnode masks parser
  libxc: vnuma memory domain allocation
  libxl: vNUMA supporting interface
  xen: adds vNUMA info debug-key u

 docs/man/xl.cfg.pod.5        |   60 +++++++
 tools/libxc/xc_dom.h         |   10 ++
 tools/libxc/xc_dom_x86.c     |   63 +++++--
 tools/libxc/xc_domain.c      |   64 +++++++
 tools/libxc/xenctrl.h        |    9 +
 tools/libxc/xg_private.h     |    1 +
 tools/libxl/libxl.c          |   18 ++
 tools/libxl/libxl.h          |   20 +++
 tools/libxl/libxl_arch.h     |    6 +
 tools/libxl/libxl_dom.c      |  158 ++++++++++++++++--
 tools/libxl/libxl_internal.h |    6 +
 tools/libxl/libxl_numa.c     |   49 ++++++
 tools/libxl/libxl_types.idl  |    6 +-
 tools/libxl/libxl_vnuma.h    |   11 ++
 tools/libxl/libxl_x86.c      |  123 ++++++++++++++
 tools/libxl/xl_cmdimpl.c     |  380 ++++++++++++++++++++++++++++++++++++++++++
 xen/arch/x86/numa.c          |   30 +++-
 xen/common/domain.c          |   10 ++
 xen/common/domctl.c          |   79 +++++++++
 xen/common/memory.c          |   96 +++++++++++
 xen/include/public/domctl.h  |   29 ++++
 xen/include/public/memory.h  |   17 ++
 xen/include/public/vnuma.h   |   59 +++++++
 xen/include/xen/domain.h     |    8 +
 xen/include/xen/sched.h      |    1 +
 25 files changed, 1282 insertions(+), 31 deletions(-)
 create mode 100644 tools/libxl/libxl_vnuma.h
 create mode 100644 xen/include/public/vnuma.h

-- 
1.7.10.4


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.