[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC v2 00/16] Old GIC (gic-vgic) optimizations for GICV2


  • To: Andrii Anisov <andrii.anisov@xxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • From: André Przywara <andre.przywara@xxxxxxx>
  • Date: Wed, 2 Jan 2019 18:33:45 +0000
  • Autocrypt: addr=andre.przywara@xxxxxxx; prefer-encrypt=mutual; keydata= mQINBFNPCKMBEAC+6GVcuP9ri8r+gg2fHZDedOmFRZPtcrMMF2Cx6KrTUT0YEISsqPoJTKld tPfEG0KnRL9CWvftyHseWTnU2Gi7hKNwhRkC0oBL5Er2hhNpoi8x4VcsxQ6bHG5/dA7ctvL6 kYvKAZw4X2Y3GTbAZIOLf+leNPiF9175S8pvqMPi0qu67RWZD5H/uT/TfLpvmmOlRzNiXMBm kGvewkBpL3R2clHquv7pB6KLoY3uvjFhZfEedqSqTwBVu/JVZZO7tvYCJPfyY5JG9+BjPmr+ REe2gS6w/4DJ4D8oMWKoY3r6ZpHx3YS2hWZFUYiCYovPxfj5+bOr78sg3JleEd0OB0yYtzTT esiNlQpCo0oOevwHR+jUiaZevM4xCyt23L2G+euzdRsUZcK/M6qYf41Dy6Afqa+PxgMEiDto ITEH3Dv+zfzwdeqCuNU0VOGrQZs/vrKOUmU/QDlYL7G8OIg5Ekheq4N+Ay+3EYCROXkstQnf YYxRn5F1oeVeqoh1LgGH7YN9H9LeIajwBD8OgiZDVsmb67DdF6EQtklH0ycBcVodG1zTCfqM AavYMfhldNMBg4vaLh0cJ/3ZXZNIyDlV372GmxSJJiidxDm7E1PkgdfCnHk+pD8YeITmSNyb 7qeU08Hqqh4ui8SSeUp7+yie9zBhJB5vVBJoO5D0MikZAODIDwARAQABtC1BbmRyZSBQcnp5 d2FyYSAoQVJNKSA8YW5kcmUucHJ6eXdhcmFAYXJtLmNvbT6JAjsEEwECACUCGwMGCwkIBwMC BhUIAgkKCwQWAgMBAh4BAheABQJTWSV8AhkBAAoJEAL1yD+ydue63REP/1tPqTo/f6StS00g NTUpjgVqxgsPWYWwSLkgkaUZn2z9Edv86BLpqTY8OBQZ19EUwfNehcnvR+Olw+7wxNnatyxo D2FG0paTia1SjxaJ8Nx3e85jy6l7N2AQrTCFCtFN9lp8Pc0LVBpSbjmP+Peh5Mi7gtCBNkpz KShEaJE25a/+rnIrIXzJHrsbC2GwcssAF3bd03iU41J1gMTalB6HCtQUwgqSsbG8MsR/IwHW XruOnVp0GQRJwlw07e9T3PKTLj3LWsAPe0LHm5W1Q+euoCLsZfYwr7phQ19HAxSCu8hzp43u zSw0+sEQsO+9wz2nGDgQCGepCcJR1lygVn2zwRTQKbq7Hjs+IWZ0gN2nDajScuR1RsxTE4WR lj0+Ne6VrAmPiW6QqRhliDO+e82riI75ywSWrJb9TQw0+UkIQ2DlNr0u0TwCUTcQNN6aKnru ouVt3qoRlcD5MuRhLH+ttAcmNITMg7GQ6RQajWrSKuKFrt6iuDbjgO2cnaTrLbNBBKPTG4oF D6kX8Zea0KvVBagBsaC1CDTDQQMxYBPDBSlqYCb/b2x7KHTvTAHUBSsBRL6MKz8wwruDodTM 4E4ToV9URl4aE/msBZ4GLTtEmUHBh4/AYwk6ACYByYKyx5r3PDG0iHnJ8bV0OeyQ9ujfgBBP B2t4oASNnIOeGEEcQ2rjuQINBFNPCKMBEACm7Xqafb1Dp1nDl06aw/3O9ixWsGMv1Uhfd2B6 it6wh1HDCn9HpekgouR2HLMvdd3Y//GG89irEasjzENZPsK82PS0bvkxxIHRFm0pikF4ljIb 6tca2sxFr/H7CCtWYZjZzPgnOPtnagN0qVVyEM7L5f7KjGb1/o5EDkVR2SVSSjrlmNdTL2Rd zaPqrBoxuR/y/n856deWqS1ZssOpqwKhxT1IVlF6S47CjFJ3+fiHNjkljLfxzDyQXwXCNoZn BKcW9PvAMf6W1DGASoXtsMg4HHzZ5fW+vnjzvWiC4pXrcP7Ivfxx5pB+nGiOfOY+/VSUlW/9 GdzPlOIc1bGyKc6tGREH5lErmeoJZ5k7E9cMJx+xzuDItvnZbf6RuH5fg3QsljQy8jLlr4S6 8YwxlObySJ5K+suPRzZOG2+kq77RJVqAgZXp3Zdvdaov4a5J3H8pxzjj0yZ2JZlndM4X7Msr P5tfxy1WvV4Km6QeFAsjcF5gM+wWl+mf2qrlp3dRwniG1vkLsnQugQ4oNUrx0ahwOSm9p6kM CIiTITo+W7O9KEE9XCb4vV0ejmLlgdDV8ASVUekeTJkmRIBnz0fa4pa1vbtZoi6/LlIdAEEt PY6p3hgkLLtr2GRodOW/Y3vPRd9+rJHq/tLIfwc58ZhQKmRcgrhtlnuTGTmyUqGSiMNfpwAR AQABiQIfBBgBAgAJBQJTTwijAhsMAAoJEAL1yD+ydue64BgP/33QKczgAvSdj9XTC14wZCGE U8ygZwkkyNf021iNMj+o0dpLU48PIhHIMTXlM2aiiZlPWgKVlDRjlYuc9EZqGgbOOuR/pNYA JX9vaqszyE34JzXBL9DBKUuAui8z8GcxRcz49/xtzzP0kH3OQbBIqZWuMRxKEpRptRT0wzBL O31ygf4FRxs68jvPCuZjTGKELIo656/Hmk17cmjoBAJK7JHfqdGkDXk5tneeHCkB411p9WJU vMO2EqsHjobjuFm89hI0pSxlUoiTL0Nuk9Edemjw70W4anGNyaQtBq+qu1RdjUPBvoJec7y/ EXJtoGxq9Y+tmm22xwApSiIOyMwUi9A1iLjQLmngLeUdsHyrEWTbEYHd2sAM2sqKoZRyBDSv ejRvZD6zwkY/9nRqXt02H1quVOP42xlkwOQU6gxm93o/bxd7S5tEA359Sli5gZRaucpNQkwd KLQdCvFdksD270r4jU/rwR2R/Ubi+txfy0dk2wGBjl1xpSf0Lbl/KMR5TQntELfLR4etizLq Xpd2byn96Ivi8C8u9zJruXTueHH8vt7gJ1oax3yKRGU5o2eipCRiKZ0s/T7fvkdq+8beg9ku fDO4SAgJMIl6H5awliCY2zQvLHysS/Wb8QuB09hmhLZ4AifdHyF1J5qeePEhgTA+BaUbiUZf i4aIXCH3Wv6K
  • Cc: Julien Grall <julien.grall@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Andrii Anisov <andrii_anisov@xxxxxxxx>
  • Delivery-date: Wed, 02 Jan 2019 18:35:08 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Openpgp: preference=signencrypt

On 26/12/2018 11:20, Andrii Anisov wrote:
> From: Andrii Anisov <andrii_anisov@xxxxxxxx>
> 
> This patch series is an attempt to reduce IRQ latency with the
> old GIC implementation (gic-vgic). These patches originally based
> on XEN 4.10 release. The motivation was to improve benchmark
> results of a system given to a customer for evaluation.
> This patch series is tailored for GICv2 on RCAR H3. Several
> of the most controversial patches (i.e. LRs shadowing) were
> not shared to the customer, and here are for comments and discussion.
> I hope several patches from here could be upstreamed. Some as is,
> others with modifications.
> 
> There are several simple ideas behind these changes:
>     - reduce an excessive code (condition checks)
>     - drop an excessive peripheral register accesses
>     - if not reduce, then move an excessive code out of spinlocks
>     - if not drop, then move an excessive register
>       accesses out of spinlocks
> 
> This is a v2 of the original RFC series [1]. From that series, patches
> [2] and [3] have already reached mainline. Here few patches are reworked
> with addressing some comments or separating them into more clear pieces,
> more patches are taken from the RFC v1 as is.
> 
> The main intention of this version of RFC series is to reveal
> patch-by-patch IRQ latency impact.
> The measurement is performed with TBM [4], so the use-case is trivial -
> passing a single IRQ twice in a second. Thus no lock contentions nor
> even passing more than one interrupt to a guest at the time use-cases
> are hit.
> 
> The series is based on the current xenbits/staging, commit 7f28661f6a7.
> XEN is build with no DEBUG and no GICv3 support for the staging HEAD and
> each commit. Four runtime configurations are evaluated for each commit:
>     - sched=credit2 vwfi=trap
>     - sched=credit2 vwfi=native
>     - sched=credit vwfi=trap
>     - sched=credit vwfi=native
> 
> Each commit is incrementally cherry-picked for the latency evaluation in
> an order they appear in the table. The table also can be found shared
> as a Google spreadsheet here [5].

Many thanks for generating these numbers, this is very useful.

But: could you make any sense out them? I plotted them, but they don't
seem to be very conclusive. I am scratching my head about the following
issues:
- The behaviour seems to be somewhat different between the four cases.
Which one do you care about, in particular? Is vwfi=native an option for
your use case? Which scheduler do you need or want to use?
- Which of the four columns (max, warm_max, min, avg) are you actually
interested in? For a hard realtime system I would expect max, or maybe
warm_max?
- Some of the patches seem to *increase* the latency. Patch 08/16 for
instance sticks out here. Most of the times the latency decreases again
afterwards, with a later patch, but I wonder if you could just pick the
patches that help or somehow explain those outliers.
- Can you give some more background on how you generated the numbers? I
haven't looked with too much detail into the benchmark, but I wonder about:
 * What is the runtime of your test? It seems to run forever and updates
   the stats twice a second, if I get this correctly. So for how long
   did you let it run?
 * Do we have any idea what the reliability of the values are? Can we
   somehow calculate the standard deviation, for instance? That would
   help to get an idea about the error range we are looking at.
 * Is this still the same system as in [4]? The resolution in there is
   only 120ns, some of the values for two patches differ exactly by that
   amount. So is this actually within the error margin?

Also it seems that this test is the only thing running on the system,
beside the idle VCPU. Is that a reasonable way to assess real world
interrupt latency? For instance that means that the IRQ handler mostly
hits even in the L1 cache, which I wouldn't expect in actual systems.
The method to separate warm_max from max seems to support this. Do you
have some numbers from that 3D benchmark you mentioned earlier, to put
this into perspective and show that improvements in one benefit the
other as well?

Also I looked at some code and started to cook up something myself[6].
The first two patches there should replace your patch 01/16 in a much
cleaner and easier way, along the lines I mentioned before in a reply to
a former post of yours.

Then I looked at the IRQ handler and stumbled upon the function pointers
we are using. I was eyeing them before, because my hunch is they are
costly, especially on big cores, as it might be hard for the CPU to
speculate correctly. Basically something like a call to
gic_hw_ops->gic_read_irq() translates into:
        ldr     x0, [x20]
        ldr     x0, [x0, #72]
        blr     x0
That contains two dependency on loads. If one of them misses in the
cache, the whole pipeline is stalled, if the CPU doesn't speculate both
loads correctly (which it might, but we don't know).
This is extra annoying since those function pointers never change, and
are always pointing to the GICv2 functions if CONFIG_GICV3 is not
defined. On top of this some functions are really trivial, so we pay a
big price for something that might be a single MMIO read or even system
register access. I think the proper solution (TM) would be to patch
these accesses once we know which GIC we are running on. Linux does
something to this effect, which benefits GICv3 in particular. From
quickly looking at it, Xen seems to lack the infrastructure (jump labels
and more sophisticated runtime patching) to easily copy this method.

So the top three patches in [6] address this in a somewhat hackish way,
just to show if this method improves something. I just compile tested
this, so would be curious if you could give it a try and test both
functionality and performance. A nice side effect is that those patches
benefit both the old and new VGIC. The effect on the TBM benchmark might
not be too visible, though, due to the hot caches.

Cheers,
Andre.

[6] https://github.com/Andre-ARM/xen/commits/vgic-opt
> 
>       sched=credit2 vwfi=trap                         sched=credit2 
> vwfi=native                       sched=credit vwfi=trap                      
>     sched=credit vwfi=native
> 
> 7f28661f6a7ce3d82f881b9afedfebca7f2cf116
>       max=9480 warm_max=7200 min=6600 avg=6743        max=4680 warm_max=3240 
> min=3000 avg=3007        max=9480 warm_max=7920 min=6720 avg=7009        
> max=4560 warm_max=3000 min=2880 avg=2979
> 
> gic:gic-vgic: separate GIV3 code more thoroughly
> 
>       max=9720 warm_max=6960 min=6600 avg=6617        max=5040 warm_max=3840 
> min=2880 avg=2905        max=9480 warm_max=7200 min=6600 avg=6871        
> max=4560 warm_max=3000 min=2880 avg=2887
> 
> gic-vgic:vgic: avoid excessive conversions
> 
>       max=9360 warm_max=6720 min=6480 avg=6578        max=4800 warm_max=3120 
> min=2880 avg=2895        max=9480 warm_max=7080 min=6600 avg=6804        
> max=4800 warm_max=3120 min=2880 avg=2887
> 
> gic:vgic:gic-vgic: introduce non-atomic bitops
> 
>       max=9120 warm_max=6600 min=6480 avg=6546        max=4920 warm_max=3000 
> min=2760 avg=2872        max=9120 warm_max=6720 min=6480 avg=6574        
> max=4200 warm_max=3120 min=2760 avg=2798
> 
> gic: drop interrupts enabling on interrupts processing
>       max=9240 warm_max=7080 min=6360 avg=6492        max=5040 warm_max=3240 
> min=2760 avg=2767        max=9240 warm_max=6720 min=6480 avg=6491        
> max=4440 warm_max=3000 min=2760 avg=2809
> 
> gic-vgic: skip irqs locking in gic_restore_pending_irqs()
>       max=9000 warm_max=6720 min=6360 avg=6430        max=4320 warm_max=3120 
> min=2640 avg=2671        max=9240 warm_max=6720 min=6360 avg=6459        
> max=4440 warm_max=2880 min=2640 avg=2668
> 
> vgic: move pause_flags check out of vgic spinlock
>       max=9240 warm_max=6720 min=6360 avg=6431        max=4800 warm_max=2880 
> min=2640 avg=2675        max=9360 warm_max=6600 min=6360 avg=6435        
> max=4440 warm_max=2760 min=2640 avg=2647
> 
> vgic: move irq_to_pending out of lock
>       max=8520 warm_max=7440 min=6360 avg=6444        max=4680 warm_max=3000 
> min=2640 avg=2753        max=9480 warm_max=6720 min=6360 avg=6445        
> max=4200 warm_max=3000 min=2640 avg=2667
> 
> gic-vgic:vgic: do not keep disabled IRQs in any of queues
>       max=9120 warm_max=7920 min=6360 avg=6447        max=4440 warm_max=2760 
> min=2760 avg=2767        max=10440 warm_max=7560 min=6360 avg=6459       
> max=4440 warm_max=3840 min=2640 avg=2669
> 
> xen/arm: Re-enable interrupt later in the trap path
>       max=9720 warm_max=9120 min=6360 avg=6441        max=4440 warm_max=2880 
> min=2760 avg=2767        max=9360 warm_max=6960 min=6360 avg=6451        
> max=4680 warm_max=2880 min=2640 avg=2675
> 
> gic-vgic: skip irqs locking in vgic_sync_from_lrs
>       max=9240 warm_max=7080 min=6360 avg=6431        max=4920 warm_max=3120 
> min=2640 avg=2678        max=9480 warm_max=6960 min=6360 avg=6443        
> max=4680 warm_max=2880 min=2640 avg=2667
> 
> gic-v2: Write HCR only on change
>       max=9840 warm_max=6600 min=6360 avg=6459        max=4440 warm_max=2760 
> min=2520 avg=2527        max=9480 warm_max=7920 min=6360 avg=6445        
> max=4320 warm_max=2760 min=2520 avg=2527
> 
> gic-v2: avoid HCR reading for GICv2
>       max=9480 warm_max=7680 min=6360 avg=6443        max=4320 warm_max=2880 
> min=2520 avg=2527        max=9360 warm_max=7080 min=6720 avg=6750        
> max=3960 warm_max=2640 min=2400 avg=2487
> 
> hack: arm/domain: simplify context restore from idle vcpu
>       max=9360 warm_max=6720 min=6000 avg=6214        max=5040 warm_max=2640 
> min=2520 avg=2527        max=9480 warm_max=7080 min=6240 avg=6367        
> max=4080 warm_max=2880 min=2400 avg=2527
> 
> hack: move gicv2 LRs reads and writes out of spinlocks
>       max=9480 warm_max=6840 min=6600 avg=6612        max=4800 warm_max=2760 
> min=2640 avg=2739        max=9000 warm_max=7200 min=6600 avg=6636        
> max=4560 warm_max=2760 min=2520 avg=2619
> 
> gic: vgic: align frequently accessed data by cache line size
>       max=9840 warm_max=6600 min=6240 avg=6288        max=4440 warm_max=2880 
> min=2640 avg=2682        max=8280 warm_max=6720 min=6360 avg=6488        
> max=4080 warm_max=2880 min=2640 avg=2678
> 
> gic: separate ppi processing
>       NOT SUITABLE FOR EVALUATION WITH TBM
> 
> [1] https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg03328.html
> [2] https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg03291.html
> [3] https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg03285.html
> [4] https://lists.xenproject.org/archives/html/xen-devel/2018-12/msg00881.html
> [5] 
> https://docs.google.com/spreadsheets/d/1J_u9UDowaonnaKtgiugXqtIFT-c2E4Ss2vxgL6NnbNo/edit?usp=sharing
> 
> Andrii Anisov (15):
>   gic:gic-vgic: separate GIV3 code more thoroughly
>   gic-vgic:vgic: avoid excessive conversions
>   gic:vgic:gic-vgic: introduce non-atomic bitops
>   gic: drop interrupts enabling on interrupts processing
>   gic-vgic: skip irqs locking in gic_restore_pending_irqs()
>   vgic: move pause_flags check out of vgic spinlock
>   vgic: move irq_to_pending out of lock
>   gic-vgic:vgic: do not keep disabled IRQs in any of queues
>   gic-vgic: skip irqs locking in vgic_sync_from_lrs
>   gic-v2: Write HCR only on change
>   gic-v2: avoid HCR reading for GICv2
>   hack: arm/domain: simplify context restore from idle vcpu
>   hack: move gicv2 LRs reads and writes out of spinlocks
>   gic: vgic: align frequently accessed data by cache line size
>   gic: separate ppi processing
> 
> Julien Grall (1):
>   xen/arm: Re-enable interrupt later in the trap path
> 
> [11] 
> https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg03282.html
> 
>  xen/arch/arm/arm64/entry.S   | 11 +++---
>  xen/arch/arm/domain.c        | 25 +++++++-----
>  xen/arch/arm/gic-v2.c        | 82 +++++++++++++++++++++++++-------------
>  xen/arch/arm/gic-v3-its.c    |  2 +
>  xen/arch/arm/gic-v3-lpi.c    |  2 +
>  xen/arch/arm/gic-v3.c        |  4 +-
>  xen/arch/arm/gic-vgic.c      | 87 +++++++++++++++++++++++++----------------
>  xen/arch/arm/gic.c           | 32 +++++++++++++--
>  xen/arch/arm/irq.c           | 32 +++++++++++++++
>  xen/arch/arm/traps.c         |  6 +++
>  xen/arch/arm/vgic-v3-its.c   |  2 +-
>  xen/arch/arm/vgic.c          | 93 
> +++++++++++++++++++++++++++++++++++++-------
>  xen/arch/arm/vgic/vgic.c     |  2 +
>  xen/include/asm-arm/config.h |  2 +-
>  xen/include/asm-arm/gic.h    | 10 ++---
>  xen/include/asm-arm/irq.h    |  3 ++
>  xen/include/asm-arm/vgic.h   | 24 ++++++++----
>  xen/include/xen/sched.h      |  1 +
>  18 files changed, 310 insertions(+), 110 deletions(-)
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.