[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough



 

-----Original Message-----
From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx 
[mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Dante Cinco
Sent: Thursday, November 18, 2010 10:44 AM
To: Konrad Rzeszutek Wilk
Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@xxxxxxxxxx; 
andrew.thomas@xxxxxxxxxx; keir.fraser@xxxxxxxxxxxxx; chris.mason@xxxxxxxxxx
Subject: Re: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops 
domU kernel with PCI passthrough

On Thu, Nov 18, 2010 at 9:19 AM, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> 
wrote:
> Keir, Dan, Mathieu, Chris, Mukesh,
>
> This fellow is passing in a PCI device to his Xen PV guest and trying 
> to get high IOPS. The kernel he is using is a 2.6.36 with tglx's 
> sparse_irq rework.
>
>> I wanted to confirm that bounce buffering was indeed occurring so I 
>> modified swiotlb.c in the kernel and added printks in the following
>> functions:
>> swiotlb_bounce
>> swiotlb_tbl_map_single
>> swiotlb_tbl_unmap_single
>> Sure enough we were calling all 3 five times per I/O. We took your 
>> suggestion and replaced pci_map_single with pci_pool_alloc. The 
>> swiotlb calls were gone but the I/O performance only improved 6% (29k 
>> IOPS to 31k IOPS) which is still abysmal.
>
> Hey! 6% that is nothing to sneeze at.

When we were using an HVM kernel (2.6.32.15+drm33.5), our IOPS was at least 20x 
(~700k IOPS).

>
>>
>> Any suggestions on where to look next? I have one question about the
>
> So since you are talking IOPS I figured you must be using fio to run 
> those numbers. And since you mentioned HVM at some point, you are not 
> running this PV domain as a back-end for another PV guest. You are 
> probably going to run some form of iSCSI target and stuff those down the PCI 
> device.

Our setup is pure Fibre Channel. We're using a physically separate system 
(Linux-based also) to initiate the SCSI I/Os.

>
> Couple of things that pop in my head.. but lets first address your question.
>
>> P2M array: Does the P2M lookup occur every DMA or just during the 
>> allocation? What I'm getting at is this: Is the Xen-SWIOTLB a central
>
> It only occurs during allocation. Also since you are bypassing the 
> bounce buffer those calls are done without any spinlock. The lookup of 
> P2M is bitshifting, division - and are constant - so O(1).
>
>> resource that could be a bottleneck?
>
> Doubt it. Your best bet to figure this out is to play with ftrace, or 
> perf trace. But I don't know how well they work with Xen nowadays - 
> Jeremy and Mathieu Desnoyers poked it a bit and I think I overheard 
> that Mathieu got it working?
>
> So the next couple of possiblities are:
>  1). you are hitting the spinlock issues on 'struct request' or any of
>     the paths on the I/O. Oracle did a lot of work on those - and one
>     way to find this out is to look at tracing and see where the contention 
> is.
>     I don't know where or if those patches have been posted upstream.. 
> but as said,
>     if you are seeing the spinlock usage high  - that might be it.
>  1b). Spinlocks - make sure you have CONFIG_PVOPS_SPINLOCK enabled. 
> Otherwise

I checked the config file and it is enabled: CONFIG_PARAVIRT_SPINLOCKS=y

The platform we're running has Intel Xeon E5540 and X58 chipset. Here is the 
kernel configuration associated with processor. Is there anything we could tune 
to improve the performance ?

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP=y
CONFIG_SPARSE_IRQ=y
CONFIG_NUMA_IRQ_DESC=y
CONFIG_X86_MPPARSE=y
# CONFIG_X86_EXTENDED_PLATFORM is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_PARAVIRT_GUEST=y
CONFIG_XEN=y
CONFIG_XEN_PVHVM=y
CONFIG_XEN_MAX_DOMAIN_MEMORY=8
CONFIG_XEN_SAVE_RESTORE=y
CONFIG_XEN_DEBUG_FS=y
CONFIG_KVM_CLOCK=y
CONFIG_KVM_GUEST=y
CONFIG_PARAVIRT=y
CONFIG_PARAVIRT_SPINLOCKS=y
CONFIG_PARAVIRT_CLOCK=y
# CONFIG_PARAVIRT_DEBUG is not set
CONFIG_NO_BOOTMEM=y
# CONFIG_MEMTEST is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=7
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_XADD=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
CONFIG_CALGARY_IOMMU=y
CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT=y
CONFIG_AMD_IOMMU=y
CONFIG_AMD_IOMMU_STATS=y
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
CONFIG_IOMMU_API=y
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS=32
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
CONFIG_X86_MCE_INJECT=y
CONFIG_X86_THERMAL_VECTOR=y
# CONFIG_I8K is not set
CONFIG_MICROCODE=y
CONFIG_MICROCODE_INTEL=y
CONFIG_MICROCODE_AMD=y
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_DIRECT_GBPAGES=y
CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NODES_SPAN_OTHER_NODES=y
# CONFIG_NUMA_EMU is not set
CONFIG_NODES_SHIFT=6
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y
CONFIG_SPARSEMEM_VMEMMAP=y
# CONFIG_MEMORY_HOTPLUG is not set
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
# CONFIG_COMPACTION is not set
CONFIG_MIGRATION=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
# CONFIG_KSM is not set
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
# CONFIG_MEMORY_FAILURE is not set
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y
CONFIG_X86_RESERVE_LOW_64K=y
CONFIG_MTRR=y
# CONFIG_MTRR_SANITIZER is not set
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_EFI=y
CONFIG_SECCOMP=y
# CONFIG_CC_STACKPROTECTOR is not set
CONFIG_HZ_100=y
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=100
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
CONFIG_KEXEC_JUMP=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_ALIGN=0x1000000
CONFIG_HOTPLUG_CPU=y
# CONFIG_COMPAT_VDSO is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y


>     you are going to hit dreadfull conditions.
>  2). You are hitting the 64-bit syscall wall. Basically your user-mode
>     application (fio) is doing a write(), which used to be int 0x80 
> but now
>     is a syscall. The syscall gets trapped in the hypervisor which has 
> to
>     call in your PV kernel. You get hit with two context switches for 
> each
>     'write()' call. The solution is to use a 32-bit DomU as the guest 
> user
>     application and guest kernel run in different rings.

There is no user space application that is involved with the I/O. It's all 
kernel driver code that handles the I/O.

>  3). Xen CPU pools. You didn't say where the application that sends 
> the IOs
>     is located. But if it was in a seperate domain then you will want 
> to use
>     Xen CPU pools. Basically this way you can get gang-scheduling 
> where the
>     guest that submits the I/O and the guest that picks up the I/O are 
> running
>     right after each other. I don't know much more details, but this 
> is what
>     I understand it does.
>  4). CPU/MSI-X affinity. I think you already did this, but make sure 
> you pin
>     your guest to specific CPUs and also pin the MSI-X (vectors) to 
> the proper
>     destination. You can use the 'xm debug-keys i' to see the MSI-X 
> affinity - it
>     is a mask and basically see if it overlays the CPUs you are 
> running your guest
>     at. Not sure how to actually set the MSI-X affinity ... now that I think.
>     Keir or some of the Intel folks might know better.

There 16 devices (multi-function) that are PCI-passed through to domU.
There are 16 VCPUs in domU and all are pinned to individual PCPUs (24-CPU 
platform). Each IRQ in domU is affinitized to a CPU. This strategy has worked 
well for us with the HVM kernel. Here's the output of 'xm debug-keys i'
(XEN)    IRQ:  67 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:7a
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:127(----),
(XEN)    IRQ:  68 affinity:00000000,00000000,00000000,00000200 vec:43
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:126(----),
(XEN)    IRQ:  69 affinity:00000000,00000000,00000000,00000400 vec:83
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:125(----),
(XEN)    IRQ:  70 affinity:00000000,00000000,00000000,00000800 vec:4b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:124(----),
(XEN)    IRQ:  71 affinity:00000000,00000000,00000000,00001000 vec:8b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:123(----),
(XEN)    IRQ:  72 affinity:00000000,00000000,00000000,00002000 vec:53
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:122(----),
(XEN)    IRQ:  73 affinity:00000000,00000000,00000000,00004000 vec:93
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:121(----),
(XEN)    IRQ:  74 affinity:00000000,00000000,00000000,00008000 vec:5b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:120(----),
(XEN)    IRQ:  75 affinity:00000000,00000000,00000000,00010000 vec:9b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:119(----),
(XEN)    IRQ:  76 affinity:00000000,00000000,00000000,00020000 vec:63
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:118(----),
(XEN)    IRQ:  77 affinity:00000000,00000000,00000000,00040000 vec:a3
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:117(----),
(XEN)    IRQ:  78 affinity:00000000,00000000,00000000,00080000 vec:6b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:116(----),
(XEN)    IRQ:  79 affinity:00000000,00000000,00000000,00100000 vec:ab
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:115(----),
(XEN)    IRQ:  80 affinity:00000000,00000000,00000000,00200000 vec:73
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:114(----),
(XEN)    IRQ:  81 affinity:00000000,00000000,00000000,00400000 vec:b3
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:113(----),
(XEN)    IRQ:  82 affinity:00000000,00000000,00000000,00800000 vec:7b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:112(----),

>  5). Andrew, Mukesh, Keir, Dan, any other ideas?
>

We're also trying Chris' 4 things to try and will consider Mathieu's LTT 
suggestion.

- Dante

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.