Xen project Mailing List

Re: [Xen-devel] Xen on ARM IRQ latency and scheduler overhead

To: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

From: Stefano Stabellini <sstabellini@xxxxxxxxxx>

Date: Fri, 10 Feb 2017 10:32:27 -0800 (PST)

Cc: george.dunlap@xxxxxxxxxxxxx, edgar.iglesias@xxxxxxxxxx, julien.grall@xxxxxxx, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx

Delivery-date: Fri, 10 Feb 2017 18:32:58 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Fri, 10 Feb 2017, Dario Faggioli wrote: > On Thu, 2017-02-09 at 16:54 -0800, Stefano Stabellini wrote: > > Hi all, > > > Hi, > > > I have run some IRQ latency measurements on Xen on ARM on a Xilinx > > ZynqMP board (four Cortex A53 cores, GICv2). > > > > Dom0 has 1 vcpu pinned to cpu0, DomU has 1 vcpu pinned to cpu2. > > Dom0 is Ubuntu. DomU is an ad-hoc baremetal app to measure interrupt > > latency: https://github.com/edgarigl/tbm > > > Right, interesting use case. I'm glad to see there's some interest in > it, and am happy to help investigating, and trying to make things > better. Thank you! > > I modified the app to use the phys_timer instead of the virt_timer. > > You > > can build it with: > > > > make CFG=configs/xen-guest-irq-latency.cfg > > > Ok, do you (or anyone) mind explaining in a little bit more details > what the app tries to measure and how it does that. Give a look at app/xen/guest_irq_latency/apu.c: https://github.com/edgarigl/tbm/blob/master/app/xen/guest_irq_latency/apu.c This is my version which uses the phys_timer (instead of the virt_timer): https://github.com/sstabellini/tbm/blob/phys-timer/app/xen/guest_irq_latency/apu.c Edgar can jump in to add more info if needed (he is the author of the app), but as you can see from the code, the app is very simple. It sets a timer event in the future, then, after receiving the event, it checks the current time and compare it with the deadline. > As a matter of fact, I'm quite familiar with the scenario (I've spent a > lot of time playing with cyclictest https://rt.wiki.kernel.org/index.ph > p/Cyclictest ) but I don't immediately understand the meaning of way > the timer is programmed, what is supposed to be in the various > variables/register, what actually is 'freq', etc. The timer is programmed by writing the compare value to the cntp_cval system register, see a64_write_timer_cval. The counter is read by reading the cntpct system register, see arch-aarch64/aarch64-excp.c:aarch64_irq. freq is the frequency of the timer (which is lower than the cpu frequency). freq_k is the multiplication factor to convert timer counter numbers into nanosec, on my platform it's 10. If you want more info on the timer, give a look at "Generic Timer" in the ARM Architecture Reference Manual. > > These are the results, in nanosec: > > > > AVG MIN MAX WARM MAX > > > > NODEBUG no WFI 1890 1800 3170 2070 > > NODEBUG WFI 4850 4810 7030 4980 > > NODEBUG no WFI credit2 2217 2090 3420 2650 > > NODEBUG WFI credit2 8080 7890 10320 8300 > > > > DEBUG no WFI 2252 2080 3320 2650 > > DEBUG WFI 6500 6140 8520 8130 > > DEBUG WFI, credit2 8050 7870 10680 8450 > > > > DEBUG means Xen DEBUG build. > > > Mmm, and Credit2 (with WFI) behave almost the same (and even a bit > better in some cases) with debug enabled. While in Credit1, debug yes > or no makes quite a few difference, AFAICT, especially in the WFI case. > > That looks a bit strange, as I'd have expected the effect to be similar > (there's actually quite a bit of debug checks in Credit2, maybe even > more than in Credit1). > > > WARM MAX is the maximum latency, taking out the first few interrupts > > to > > warm the caches. > > WFI is the ARM and ARM64 sleeping instruction, trapped and emulated > > by > > Xen by calling vcpu_block. > > > > As you can see, depending on whether the guest issues a WFI or not > > while > > waiting for interrupts, the results change significantly. > > Interestingly, > > credit2 does worse than credit1 in this area. > > > This is with current staging right? That's right. > If yes, in Credit1, you on ARM > never stop the scheduler tick, like we do in x86. This means the system > is, in general, "more awake" than Credit2, which does not have a > periodic tick (and FWIW, also "more awake" of Credit1 in x86, as far as > the scheduler is concerned, at least). > > Whether or not this impact significantly your measurements, I don't > know, as it depends on a bunch of factors. What we know is that this > has enough impact to trigger the RCU bug Julien discovered (in a > different scenario, I know), so I would not rule it out. > > I can try sending a quick patch for disabling the tick when a CPU is > idle, but I'd need your help in testing it. That might be useful, however, if I understand this right, we don't actually want a periodic timer in Xen just to make the system more responsive, do we? > > Trying to figure out where those 3000-4000ns of difference between > > the > > WFI and non-WFI cases come from, I wrote a patch to zero the latency > > introduced by xen/arch/arm/domain.c:schedule_tail. That saves about > > 1000ns. There are no other arch specific context switch functions > > worth > > optimizing. > > > Yeah. It would be interesting to see a trace, but we still don't have > that for ARM. :-( indeed > > We are down to 2000-3000ns. Then, I started investigating the > > scheduler. > > I measured how long it takes to run "vcpu_unblock": 1050ns, which is > > significant. > > > How you measured, if I can ask. Simple. I added a timer counter read before and after the function call: uint64_t n1 = 0, n2 = 0; n1 = READ_SYSREG64(CNTPCT_EL0); function_call_to_measure(); n2 = READ_SYSREG64(CNTPCT_EL0); printk("DEBUG %s %d ns=%lu\n",__func__,__LINE__,(n2-n1)*10); Where 10 is the calculated freq_k for the platform I have. > > I don't know what is causing the remaining 1000-2000ns, but > > I bet on another scheduler function. Do you have any suggestions on > > which one? > > > Well, when a vcpu is woken up, it is put in a runqueue, and a pCPU is > poked to go get and run it. The other thing you may want to try to > measure is how much time passes between when the vCPU becomes runnable > and is added to the runqueue, and when it is actually put to run. > > Again, this would be visible in tracing. :-/ I could do that of you tell me where to add the two 'READ_SYSREG64(CNTPCT_EL0)'. > > Assuming that the problem is indeed the scheduler, one workaround > > that > > we could introduce today would be to avoid calling vcpu_unblock on > > guest > > WFI and call vcpu_yield instead. This change makes things > > significantly > > better: > > > > AVG MIN MAX WARM MAX > > DEBUG WFI (yield, no block) 2900 2190 5130 5130 > > DEBUG WFI (yield, no block) credit2 3514 2280 6180 5430 > > > > Is that a reasonable change to make? Would it cause significantly > > more > > power consumption in Xen (because xen/arch/arm/domain.c:idle_loop > > might > > not be called anymore)? > > > Exactly. So, I think that, as Linux has 'idle=poll', it is conceivable > to have something similar in Xen, and if we do, I guess it can be > implemented as you suggest. > > But, no, I don't think this is satisfying as default, not before trying > to figure out what is going on, and if we can improve things in other > ways. OK. Should I write a patch for that? I guess it would be arm specific initially. What do you think it would be a good name for the option? > > If we wanted to zero the difference between the WFI and non-WFI > > cases, > > would we need a new scheduler? A simple "noop scheduler" that > > statically > > assigns vcpus to pcpus, one by one, until they run out, then return > > error? > > > Well, writing such a scheduler would at least be useful as reference. > As in, the latency that you measure on it, is the minimum possible > latency the scheduler is responsible for, and we can compare that with > what we get from 'regular' schedulers. > > As a matter of fact, it may also turn out useful for a couple of other > issues/reason, so I may indeed give this a go. Thank you! If you write it, I'll help you test it on ARM64 :-) > But it would not be much more useful than that, IMO. Why? Actually I know of several potential users of Xen on ARM interested exactly in this use-case. They only have a statically defined number of guests with a total amount of vcpu lower or equal to the number of pcpu in the system. Wouldn't a scheduler like that help in this scenario?

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.