Xen project Mailing List

Re: AMD EPYC virtual network performances

To: Andrei Semenov <andrei.semenov@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>

From: Jürgen Groß <jgross@xxxxxxxx>

Date: Tue, 9 Jul 2024 11:37:07 +0200

Delivery-date: Tue, 09 Jul 2024 09:37:23 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 09.07.24 10:36, Andrei Semenov wrote:

Hello,

As been reported by David Morel (mail 4 Jan 2024), our customers experience a
very poor virtual network performances in HVM guests on AMD EPYC platforms.

After some investigations we notices a huge performances drop (perfs divided by
factor of 5) starting from 5.10.88 Linux kernel version on the AMD EPYC
platforms. The patch introduced in this kernel version that allows to pinpoint
the buggy behavior is :

  “xen/netfront: harden netfront against event channel storms”
d31b3379179d64724d3bbfa87bd4ada94e3237de

The patch basically binds the network frontend to the `xen_lateeoi_chip`
irq_chip (insead of `xen_dynamic_chip`) which allows to its clients to inform
the chip if spurious interrupts are detected and so the delay in interrupt
treatment is introduced by the chip.

We tried to measure how much spurious interrupts (no work to do by the driver)
are raised. We used `iperf2` to bench the network bandwidth on the AMD EPYC 7262
8-core).

Dom0 > iperf -s

DomU> iperf -c $DOM0_IP_ADDRESS

It appears from our observations that we have approximatively 1 spurious
interrupt for 1 “useful” interrupt (frontend TX interrupts) for HVM guests.

We run the same bench on the same platform with PV and PVH and the interrupts
spurious/useful ratio was quite lower: 1 to 20 (so the network performances are
much better).

We also run this bench on the Intel platform (Intel Xeon Bronze 3106 CPU). The
interrupts spurious/useful ratio was about 1 to 30 for HVM guests.

So this make us think that this buggy behavior is related to abnormal amount of
spurious interrupts. This spurious/useful interrupts ratio is particularly
elevated in HVM guests on AMD platforms, so virtual network bandwidth is heavily
penalized – in our particular bench we have 1,5Gbps bandwidth instead of 7 Gbps
(when slowdown isn’t introduced by the irq_chip).

Does anybody notice this behavior on his side?  Can we do something about it?

In the guest you could raise the spurious event threshold via writing a higher number to /sys/devices/vif-0/xenbus/spurious_threshold (default is 1). There is a similar file on the backend side, which might be interesting to raise the value. In both directories you can see the number of spurious events by looking into the spurious_events file. In the end the question is why so many spurious events are happening. Finding the reason might be hard, though. Juergen

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.