[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Question] PARSEC benchmark has smaller execution time in VM than in native?



Hi Elena,


On Tue, Mar 1, 2016 at 3:39 PM, Elena Ufimtseva
<elena.ufimtseva@xxxxxxxxxx> wrote:
> On Tue, Mar 01, 2016 at 02:52:14PM -0500, Meng Xu wrote:
>> Hi Elena,
>>
>> Thank you very much for sharing this! :-)
>>
>> On Tue, Mar 1, 2016 at 1:20 PM, Elena Ufimtseva
>> <elena.ufimtseva@xxxxxxxxxx> wrote:
>> >
>> > On Tue, Mar 01, 2016 at 08:48:30AM -0500, Meng Xu wrote:
>> > > On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk
>> > > <konrad.wilk@xxxxxxxxxx> wrote:
>> > > >> > Hey!
>> > > >> >
>> > > >> > CC-ing Elena.
>> > > >>
>> > > >> I think you forgot you cc.ed her..
>> > > >> Anyway, let's cc. her now... :-)
>> > > >>
>> > > >> >
>> > > >> >> We are measuring the execution time between native machine 
>> > > >> >> environment
>> > > >> >> and xen virtualization environment using PARSEC Benchmark [1].
>> > > >> >>
>> > > >> >> In virtualiztion environment, we run a domU with three VCPUs, each 
>> > > >> >> of
>> > > >> >> them pinned to a core; we pin the dom0 to another core that is not
>> > > >> >> used by the domU.
>> > > >> >>
>> > > >> >> Inside the Linux in domU in virtualization environment and in 
>> > > >> >> native
>> > > >> >> environment,  We used the cpuset to isolate a core (or VCPU) for 
>> > > >> >> the
>> > > >> >> system processors and to isolate a core for the benchmark 
>> > > >> >> processes.
>> > > >> >> We also configured the Linux boot command line with isocpus= 
>> > > >> >> option to
>> > > >> >> isolate the core for benchmark from other unnecessary processes.
>> > > >> >
>> > > >> > You may want to just offline them and also boot the machine with 
>> > > >> > NUMA
>> > > >> > disabled.
>> > > >>
>> > > >> Right, the machine is booted up with NUMA disabled.
>> > > >> We will offline the unnecessary cores then.
>> > > >>
>> > > >> >
>> > > >> >>
>> > > >> >> We expect that execution time of benchmarks in xen virtualization
>> > > >> >> environment is larger than the execution time in native machine
>> > > >> >> environment. However, the evaluation gave us an opposite result.
>> > > >> >>
>> > > >> >> Below is the evaluation data for the canneal and streamcluster 
>> > > >> >> benchmarks:
>> > > >> >>
>> > > >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial
>> > > >> >> Native: 6.387s
>> > > >> >> Virtualization: 5.890s
>> > > >> >>
>> > > >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial
>> > > >> >> Native: 5.276s
>> > > >> >> Virtualization: 5.240s
>> > > >> >>
>> > > >> >> Is there anything wrong with our evaluation that lead to the 
>> > > >> >> abnormal
>> > > >> >> performance results?
>> > > >> >
>> > > >> > Nothing is wrong. Virtualization is naturally faster than baremetal!
>> > > >> >
>> > > >> > :-)
>> > > >> >
>> > > >> > No clue sadly.
>> > > >>
>> > > >> Ah-ha. This is really surprising to me.... Why will it speed up the
>> > > >> system by adding one more layer? Unless the virtualization disabled
>> > > >> some services that occur in native and interfere with the benchmark.
>> > > >>
>> > > >> If virtualization is faster than baremetal by nature, why we can see
>> > > >> that some experiment shows that virtualization introduces overhead?
>> > > >
>> > > > Elena told me that there were some weird regression in Linux 4.1 - 
>> > > > where
>> > > > CPU burning workloads were _slower_ on baremetal than as guests.
>> > >
>> > > Hi Elena,
>> > > Would you mind sharing with us some of your experience of how you
>> > > found the real reason? Did you use some tool or some methodology to
>> > > pin down the reason (i.e,  CPU burning workloads in native is _slower_
>> > > on baremetal than as guests)?
>> > >
>> >
>> > Hi Meng
>> >
>> > Yes, sure!
>> >
>> > While working on performance tests for smt-exposing patches from Joao
>> > I run CPU bound workload in HVM guest and using same kernel in baremetal
>> > run same test.
>> > While testing cpu-bound workload on baremetal linux (4.1.0-rc2)
>> > I found that the time to complete the same test is few times more that
>> > as it takes for the same under HVM guest.
>> > I have tried tests where kernel threads pinned to cores and without 
>> > pinning.
>> > The execution times are most of the times take as twice longer, sometimes 4
>> > times longer that HVM case.
>> >
>> > Interesting is not only that it takes sometimes 3-4 times more
>> > than HVM guest, but also that test with bound threads (to cores) takes 
>> > almost
>> > 3 times longer
>> > to execute than running same cpu-bound test under HVM (in all
>> > configurations).
>>
>>
>> wow~ I didn't expect the native performance can be so "bad".... ;-)
>
> Yes, quite a surprise :)
>>
>> >
>> >
>> > I run each test 5 times and here are the execution times (seconds):
>> >
>> > -------------------------------------------------
>> >         baremetal           |
>> > thread_bind | thread unbind | HVM pinned to cores
>> > ----------- |---------------|---------------------
>> >      74     |     83        |        28
>> >      74     |     88        |        28
>> >      74     |     38        |        28
>> >      74     |     73        |        28
>> >      74     |     87        |        28
>> >
>> > Sometimes better times were on unbinded tests, but not often enough
>> > to present it here. Some results are much worse and reach up to 120
>> > seconds.
>> >
>> > Each test has 8 kernel threads. In baremetal case I tried the following:
>> > - numa off,on;
>> > - all cpus are on;
>> > - isolate cpus from first node;
>> > - set intel_idle.max_cstate=1;
>> > - disable intel_pstate;
>> >
>> > I dont think I have exhausted all the options here, but it looked like
>> > two last changes did improve performance, but was still not comparable to
>> > HVM case.
>> > I am trying to find where regression had happened. Performance on newer
>> > kernel (I tried 4.5.0-rc4+) was close or better than HVM.
>> >
>> > I am trying to find f there were some relevant regressions to understand
>> > the reason of this.
>>
>>
>> I see. If this is only happening for the SMT, it may be caused by the
>> SMT-related load balancing in Linux scheduler.
>> However, I have disabled the HT on my machine. Probably, that's also
>> the reason why I didn't see so much different in performance.
>
> I did enable tracing to see if maybe there is extensive migration:
> Test machine has two nodes, 8 cores each, 2 threads per core, total 32 
> logical cpus.
>
> Kernel threads are not binded and here is the output for the life of one of 
> the threads:
>
> cat ./t-komp_trace |grep t-kompressor|grep 18883
>
>     t-kompressor-18883 [028] d... 69458.596403: sched_switch: 
> prev_comm=kthreadd prev_pid=18883 prev_prio=120 prev_state=D ==> 
> next_comm=swapper/28 next_pid=0 next_prio=120
>           insmod-18875 [027] dN.. 69458.669180: sched_migrate_task: 
> comm=t-kompressor pid=18883 prio=120 orig_cpu=28 dest_cpu=9
>           <idle>-0     [009] d... 69458.669205: sched_switch: 
> prev_comm=swapper/9 prev_pid=0 prev_prio=120 prev_state=R ==> 
> next_comm=t-kompressor next_pid=18883 next_prio=120
>     t-kompressor-18883 [009] d... 69486.997626: sched_switch: 
> prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=R ==> 
> next_comm=migration/9 next_pid=52 next_prio=0
>      migration/9-52    [009] d... 69486.997632: sched_migrate_task: 
> comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25
>           <idle>-0     [025] d... 69486.997641: sched_switch: 
> prev_comm=swapper/25 prev_pid=0 prev_prio=120 prev_state=R ==> 
> next_comm=t-kompressor next_pid=18883 next_prio=120
>     t-kompressor-18883 [025] d... 69486.997710: sched_switch: 
> prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=x ==> 
> next_comm=swapper/25 next_pid=0 next_prio=120
>           insmod-18875 [011] .N.. 69503.119960: sched_kthread_stop: 
> comm=t-kompressor pid=18883
>
>
> Threads are being spawned from two cores, then some of the threads migrate to 
> other cores.
> In the example above threads is being spawned on cpu 27 and when woken up, 
> runs on cpu 009.
> Later it migrated to 025 which is the second thread of the same core (009).
> While I am not sure why this migration happens, it does not seem to 
> contribute a lot.
> Anyway this picture repeats for some other threads (some stay where they were 
> woken up):
>
>     t-kompressor-18880 [014] dNh. 69485.802729: sched_migrate_task: comm=hald 
> pid=3820 prio=120 orig_cpu=14 dest_cpu=11
>     migration/13-72    [013] d... 69486.707459: sched_migrate_task: 
> comm=t-kompressor pid=18878 prio=120 orig_cpu=13 dest_cpu=29
>     migration/14-77    [014] d... 69486.783818: sched_migrate_task: 
> comm=t-kompressor pid=18880 prio=120 orig_cpu=14 dest_cpu=30
>      migration/8-47    [008] d... 69486.792667: sched_migrate_task: 
> comm=t-kompressor pid=18882 prio=120 orig_cpu=8 dest_cpu=24
>     migration/15-82    [015] d... 69486.796429: sched_migrate_task: 
> comm=t-kompressor pid=18881 prio=120 orig_cpu=15 dest_cpu=31
>     migration/10-57    [010] d... 69486.857848: sched_migrate_task: 
> comm=t-kompressor pid=18884 prio=120 orig_cpu=10 dest_cpu=26
>      migration/9-52    [009] d... 69486.997632: sched_migrate_task: 
> comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25
>     migration/28-147   [028] d... 69503.073577: sched_migrate_task: 
> comm=t-kompressor pid=18876 prio=120 orig_cpu=28 dest_cpu=10
>
> All threads are running on their own cores and some migrate to second 
> smt-thread over time.
> I probably should have traced some other scheduling events, but I did not yet 
> find any relevant ones yet.
>
>>
>> >
>> >
>> >
>> > What kernel you guys use?
>>
>>
>> I'm using a quite old kernel
>> 3.10.31
>> . The reason why I'm using this kernel is because I want to use the
>> LITMUS^RT [1], which is a linux testbed for real-time scheduling
>> research. (It has a new version though, and I can upgrade to the
>> latest version to see if the "problem" still occurs.)
>
> Yes, it will be interesting to see the outcome.
>
> What difference in numbers do you see?

Below is the evaluation data for the canneal and streamcluster
benchmarks, which are in the PARSEC benchmark:

Benchmark: canneal, input=simlarge, conf=gcc-serial
Native: 6.387s
Virtualization: 5.890s

Benchmark: streamcluster, input=simlarge, conf=gcc-serial
Native: 5.276s
Virtualization: 5.240s

> What the machines you are seeing it on?

Below is the CPU info. at /proc/cpuinfo

processor : 7

vendor_id : GenuineIntel

cpu family : 6

model : 58

model name : Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

stepping : 9

microcode : 0x12

cpu MHz : 1600.000

cache size : 8192 KB

physical id : 0

siblings : 8

core id : 3

cpu cores : 4

apicid : 7

initial apicid : 7

fpu : yes

fpu_exception : yes

cpuid level : 13

wp : yes

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl
vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb
xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase
smep erms

bogomips : 6784.70

clflush size : 64

cache_alignment : 64

address sizes : 36 bits physical, 48 bits virtual

power management:


> Is your workload is purely cpu-bound?

Nope. The canneal and streamcluster benchmark are cache-sensitive (or
memory sensitive) task. The execution time of these two benchmarks
depend on how much cache and memory it can get.

Under the current kernel, IIRC, I didn't see the "abnormal performance
behavior" for cpu-bound tasks.

Thanks and Best Regards,

Meng

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.