[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen ARM community call - meeting minutes and date for the next one



Hi Stefano,

On 30 March 2017 at 21:52, Stefano Stabellini <sstabellini@xxxxxxxxxx> wrote:
> On Thu, 30 Mar 2017, Volodymyr Babchuk wrote:
>> Hi Julien,
>>
>> 5pm UTC+1 will be fine for me.
>>
>> I just finished my EL0 PoC and want to share benchmark results.
>>
>> My benchmark setup is primitive, but results are reproducible. I have
>> wrote small driver that calls SMC 10.000.000  times in a loop. So,
>> benchmark looks like this:
>>
>> root@xenaarch64:~# time cat /proc/smc_test
>>  Will call SMC 10000000 time(s)
>> Done!
>>
>> real 1m51.428s
>> user 0m0.020s
>> sys 1m51.240s
>>
>> I compared three types of SMC "handlers":
>>
>> 1. Right in hypervisor. I lust put `advance_pc` into the
>> `do_trap_smc()`. This is the base.
>> 2. Handling in MiniOS, using `monitor.c` framework.
>> 3. Handling in EL0 XEN application.
>>
>> In all three cases there was no actual handling. Just variants of `return 
>> 0;`.
>> Results:
>>
>> 1. Hypervisor:
>>     real 0m10.757s
>>     user 0m0.000s
>>     sys 0m10.752s
>>
>>     10.757s = 1x (base value)
>>     1.07us per call
>
> 1u is incredibly good actually

Yep. But handler was right in the traps.c. There is very short path:
guest_sync (in entry.S) -> do_trap_hypervisor() -> do_trap_smc().
I assume that most of simple traps are handled within 1-2us.

>
>> 2. MiniOs:
>>     real 1m51.428s
>>     user 0m0.020s
>>     sys 1m51.240s
>>
>>     111.428s = 10.35x slower than hypervisor handler
>>     11.1us per call
>>
>> 3. EL0:
>>     real 1m3.525s
>>     user 0m0.000s
>>     sys 1m3.516s
>>
>>     63.525s = 5.9x slower than hypervisor handler
>>     6.35us par call
>>
>> As you can see, handling in hypervisor mode is obviously the fastest
>> way. In my first implementation EL0 approach was as slow as MiniOs.
>> Then I simplified context switching (for example we don't need to
>> save/restore GIC context). That gave me 1.7x boost. Now profiler shows
>> that hypervisor spends time in spinlocks and p2m code. But, anyways,
>> 10 000 000 traps in 60 seconds is a good result :)
>>
>> Testing was done on Renesas H3 board with 4xA57 cores.
>
> This is a very good result! I am also quite happy with your scientific
> approach.
Thanks!

> I wish more people started their projects doing experiments
> like this one, before putting their head down to write thousands of
> lines of code. I encourage you to submit a talk to Xen Summit about this
> and/or write a blog post for https://blog.xenproject.org/.
Yep. Actually Artem is pushing me to talk at Xen Summit :)

> I think that 6u per call is very good. Seeing that in the hypervisor
> case the time is only 1u per call, it makes me think we might be able to
> go down a bit more. I also assume that it is quite reliable (having
> vcpus pinned to pcpus), while in the MiniOs case, unless you have enough
> free pcpus to guarantee that MiniOs is already running, we should see
> spikes of latency, right? Because if MiniOs is not running, or it cannot
> run immediately, it should take much longer?  In fact, where was MiniOs
> running in your setup?

Yes, there was MiniOS for aarch64 running. I published links to
corresponding patches in the ML some time ago.

I didn't pinned vcpus to cpus because I had only 3 domais running on 4 CPUs:
dom0, domU which I used for benchmarking and MiniOS. I'm sure that at
leas two cores were idle. But you are right, under heavy load there
will be latency spikes, because we can't guarantee that MiniOs will be
scheduled immediately to serve handling request.
My design of EL0 app is better in this case, because for hypervisor it
looks like another function call. It still can be preempted, thought.
But at least entry to EL0 handler is done immediately.

And yes, my profiler shows that there are ways to further decrease
latency. Most obvious way is to get rid of 2nd stage translation and
thus eliminate p2m code from the call chain. Currently hypervisor
spends 20% of time in spinlocks code and about ~10-15% in p2m. So
there definitely are areas to improve :)
>
>> On 28 March 2017 at 18:23, Julien Grall <julien.grall@xxxxxxx> wrote:
>> > Hi all,
>> >
>> > Apologies for the late sending, you will find at the end of the e-mail a
>> > summary of the discussion from the previous call. Feel free to reply if I
>> > missed some parts.
>> >
>> > I suggest to have the next call on the 5th April at 5PM UTC. Any opinions?
>> >
>> > Also do you have any specific topic you would like to talk during this 
>> > call?
>> >
>> > Cheers,
>> >
>> > == Attendees ==
>> >
>> > Apologies if I misspelled any name.
>> >
>> > Stefano, Aporeto
>> > Julien, ARM
>> > Oleksandr, EPAM
>> > Artem, EPAM
>> > Thanasis, OnApp
>> > Volodymir, EPAM
>> >
>> > == Xen on ARM status ==
>> >
>> > Over 100 patches in-flight for Xen on ARM:
>> >     - PV protocols: Some are already accepted
>> >     - NUMA support
>> >     - GICv3 ITS support
>> >     - Exposing and emulating a PL011 for guest
>> >     - guest SMC forwarding for Xilinx platform
>> >     - Interrupt latency improvement
>> >
>> > == PV protocols ==
>> >
>> > * PV protocols written by Stefano was merged after 10 months
>> >
>> > Stefano: PV protocols review are moving faster
>> > Attendees agreed
>> >
>> > * Audio protocol: close to be accepted
>> > * Display protocol: minor issue, a bit more design is required
>> >
>> > Hopefully both will be ready for Xen 4.9
>> >
>> > Oleksandr: What to do when the backend die?
>> >
>> > (I cannot find any notes on it some I am not sure if we answered the 
>> > question
>> > during the call. I suspect it has been asked to bring up the subject on 
>> > the ML).
>> >
>> > == Interrupt latency ==
>> >
>> > Stefano: Some improvement has been done but it is not possible to know 
>> > whether
>> > it is good. Do you have any specific IRQ latency requirements?
>> >
>> > Artem: There is no hard latency requirements in automotive, although many
>> > requirements depends on latency. For example:
>> >     * Scheduling
>> >     * GPU (implementation is sentive to interrupt latency)
>> >
>> > Automotive is using a set of benchmark to find the virtualization 
>> > overhead. This
>> > should be low.
>> >
>> > ACTION: Artem to send a list of benchmark
>> >
>> > == SMC/HVC handling in Xen ==
>> >
>> > Artem: Please review the proposal on the mailing list. See:
>> >
>> > https://lists.xenproject.org/archives/html/xen-devel/2017-03/msg00430.html
>> >
>> > == Deprivilege mode ==
>> >
>> > EPAM are working on adding support for OP-TEE in Xen to allow multiple 
>> > guest
>> > access the trusted firmware.
>> >
>> > During the discussion on the design, it was suggested to move the SMC 
>> > handling
>> > in a separate domain. This was tested using the VM event API and Mini-OS
>> > (upstream with Chen Baozi's series to support ARM64). The first results
>> > shows it is 10 times slower than handling SMC calls directly in the 
>> > hypervisor.
>> >
>> > Volodymir is working on another approach to deprivilege the execution by
>> > implementing a Xen EL0.
>> >
>> > == big.LITTLE support ==
>> >
>> > Thanasis: Document discussed on the ML. Xen will split CPUs at boot time
>> > (big vs little). A series will be sent on the on the ML soon.
>> >
>> > --
>> > Julien Grall
>>
>>
>>
>> --
>> WBR Volodymyr Babchuk aka lorc [+380976646013]
>> mailto: vlad.babchuk@xxxxxxxxx
>>



-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@xxxxxxxxx

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.