[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [RFC PATCH v1 00/10] Xen flamegraph (hypervisor stacktrace profile) support
I've long wanted to get stacktraces when profiling Xen, otherwise all you'd see is e.g. the address of memcpy, but without knowing which function called it you can't optimize it. Once you have stacktraces, even a simple low (prime) frequency timer based profile can show hotspots that would be optimization candidates, aka Flamegraphs. (even if the sample doesn't always hit within the same function and individually they'd be too small to be noticable, it should hit in one of the parents if it is a bottleneck). Example flamegraph produced using these patches: * workload: an otherwise idle VM migrated on localhost by XAPI in a loop: https://cdn.jsdelivr.net/gh/edwintorok/xen@pmustack-coverletter/docs/tmp/migrate-localhost.svg?x=473.2&y=2197&s=null * workload: VM migrated between 2 hosts by XAPI (NFS storage): https://cdn.jsdelivr.net/gh/edwintorok/xen@pmustack-coverletter/docs/tmp/migrate-send.svg?x=950.6&y=2197 https://cdn.jsdelivr.net/gh/edwintorok/xen@pmustack-coverletter/docs/tmp/migrate-receive.svg?x=906.6&y=869 There might be other approaches that could be tried in the future, e.g. Last Branch Record, but: * although both Intel and AMD support it, AFAIK Xen doesn't support it on AMD yet * there is a hardware limit to how deep it can be (~32?) * LBR may need some additional configuration to enable it to trace the hypervisor * Intel PMU is completely broken on the system I tried it on, so I would've had to first fix that This is some very early experimental work, thought I'd share it to get feedback on: * the desired ABI additions in pmu.h and arch-x86/pmu.h * any bugs you may spot * if anyone wants to port the python symbol lookup to perf itself (actually latest perf ships a flamegraph.py too) It also starts to become useful enough to spot performance hotspots in Xen, e.g. the rwlock.c scaling issue with large CONFIG_NR_CPUS, or unexpected page faults in 'unmap_page_range' (spotted by Andrew). It builds on top of: * the existing VPMU support, documented by Boris Ostrovsky in this thread: https://lists.xenproject.org/archives/html/xen-devel/2016-08/msg03244.html * a python script by Andriy to post-process the perf output Steps to enable: 1. ensure that you've got a build of Xen with CONFIG_FRAME_POINTER=y. Debug builds would have it, but for performance testing creating a release build with frame pointers enabled is recommended. 2. Apply both the Linux and Xen patches. I tested on top of ~6.6.22, and Xen 4.21+ (5c798ac8854af3528a78ca5a622c9ea68399809b) 3. ensure that VPMU is enabled in Xen, e.g. a GRUB line like: ``` multiboot2 /boot/xen.efi dom0_mem=4288M,max:4288M crashkernel=256M,below=4G console=vga vga=mode-0x0311 watchdog=0 vpmu=on dom0_vcpus_pin ``` On a XenServer system that can be achieved by: ``` /opt/xensource/libexec/xen-cmdline --set-xen watchdog=0 /opt/xensource/libexec/xen-cmdline --set-xen vpmu=on /opt/xensource/libexec/xen-cmdline --delete-xen dom0_max_vcpus=1-16 /opt/xensource/libexec/xen-cmdline --set-xen dom0_vcpus_pin reboot ``` 4. On everyboot: enable desired vPMU features: ``` echo 9 >/sys/hypervisor/pmu/pmu_features echo all >/sys/hypervisor/pmu/pmu_mode ``` 5. Record a trace, e.g. a simple timer based stacktrace, useful for initial investigation with a flamegraph: ``` perf kvm --host --guest record -a -F 97 -g ``` Or if you also want to trace userspace: ``` perf kvm --host --guest record -a -F 97 --call-graph=dwarf ``` 6. Look at the report: perf kvm --host --guest report. This will contain hex addresses for now, but a script can be used to resolve them. 7. Use the provided python script, and look at symbolized output Caveats: * x86-only for now * only tested on AMD EPYC 8124P * Xen PMU support was broken to begin with on Xeon Silver 4514Y, so I wasn't able to test there ('perf top' fails to parse samples). I'll try to figure out what is wrong there separately * for now I edit the release config in xen.spec to enable frame pointers. Eventually it might be useful to have a 3rd build variant: release-fp. Or teach Xen to produce/parse ORC or SFrame formats without requiring frame pointers. * perf produces raw hex addresses, and a python script is used to post-process it and obtain symbols. Eventually perf should be updated to do this processing itself (there was an old patch for Linux 3.12 by Borislav Petkov) * I've only tested capturing Dom0 stack traces. Linux doesn't support guest stacktraces yet (it can only lookup the guest RIP) * the Linux patch will need to be forwarded ported to master before submission * All the caveats for using regular VPMU apply, except for the lack of stacktraces, that is fixed here! * Dom0 must run hard pinned on all host CPUs * Watchdog must be disabled * not security supported * x86 only * secureboot needs to be disabled Edwin Török (10): pmu.h: add a BUILD_BUG_ON to ensure it fits within one page arch-x86/pmu.h: document current memory layout for VPMU arch-x86/pmu.h: convert ascii art drawing to Unicode vpmu.c: factor out register conversion pmu.h: introduce a stacktrace area arch-x86/pmu.h: convert ascii art diagram to Unicode arch-x86/vpmu.c: store guest registers when domain_id == DOMID_XEN pmu.h: expose a hypervisor stacktrace feature vpmu.c hypervisor stacktrace support in vPMU xen/tools/pyperf.py: example script to parse perf output xen/arch/x86/cpu/vpmu.c | 130 ++++++++++++++++++++------ xen/arch/x86/cpu/vpmu_amd.c | 2 +- xen/arch/x86/cpu/vpmu_intel.c | 2 +- xen/arch/x86/include/asm/vpmu.h | 1 + xen/include/public/arch-arm.h | 1 + xen/include/public/arch-ppc.h | 1 + xen/include/public/arch-riscv.h | 1 + xen/include/public/arch-x86/pmu.h | 101 ++++++++++++++++++++- xen/include/public/pmu.h | 41 ++++++++- xen/tools/pyperf.py | 146 ++++++++++++++++++++++++++++++ 10 files changed, 395 insertions(+), 31 deletions(-) create mode 100644 xen/tools/pyperf.py -- 2.47.1
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |