[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Possible bug? DOM-U network stopped working after fatal error reported in DOM0



On Mon, Jan 10, 2022 at 10:54 PM Roger Pau Monné <roger.pau@xxxxxxxxxx> wrote:
>
> On Sat, Jan 08, 2022 at 01:14:26AM +0800, G.R. wrote:
> > On Wed, Jan 5, 2022 at 10:33 PM Roger Pau Monné <roger.pau@xxxxxxxxxx> 
> > wrote:
> > >
> > > On Wed, Jan 05, 2022 at 12:05:39AM +0800, G.R. wrote:
> > > > > > > > But seems like this patch is not stable enough yet and has its 
> > > > > > > > own
> > > > > > > > issue -- memory is not properly released?
> > > > > > >
> > > > > > > I know. I've been working on improving it this morning and I'm
> > > > > > > attaching an updated version below.
> > > > > > >
> > > > > > Good news.
> > > > > > With this  new patch, the NAS domU can serve iSCSI disk without OOM
> > > > > > panic, at least for a little while.
> > > > > > I'm going to keep it up and running for a while to see if it's 
> > > > > > stable over time.
> > > > >
> > > > > Thanks again for all the testing. Do you see any difference
> > > > > performance wise?
> > > > I'm still on a *debug* kernel build to capture any potential panic --
> > > > none so far -- no performance testing yet.
> > > > Since I'm a home user with a relatively lightweight workload, so far I
> > > > didn't observe any difference in daily usage.
> > > >
> > > > I did some quick iperf3 testing just now.
> > >
> > > Thanks for doing this.
> > >
> > > > 1. between nas domU <=> Linux dom0 running on an old i7-3770 based box.
> > > > The peak is roughly 12 Gbits/s when domU is the server.
> > > > But I do see regression down to ~8.5 Gbits/s when I repeat the test in
> > > > a short burst.
> > > > The regression can recover when I leave the system idle for a while.
> > > >
> > > > When dom0 is the iperf3 server, the transfer rate is much lower, down
> > > > all the way to 1.x Gbits/s.
> > > > Sometimes, I can see the following kernel log repeats during the
> > > > testing, likely contributing to the slowdown.
> > > >              interrupt storm detected on "irq2328:"; throttling 
> > > > interrupt source
> > >
> > > I assume the message is in the domU, not the dom0?
> > Yes, in the TrueNAS domU.
> > BTW, I rebooted back to the stock kernel and the message is no longer 
> > observed.
> >
> > With the stock kernel, the transfer rate from dom0 to nas domU can be
> > as high as 30Gbps.
> > The variation is still observed, sometimes down to ~19Gbps. There is
> > no retransmission in this direction.
> >
> > For the reverse direction, the observed low transfer rate still exists.
> > It's still within the range of 1.x Gbps, but should still be better
> > than the previous test.
> > The huge number of re-transmission is still observed.
> > The same behavior can be observed on a stock FreeBSD 12.2 image, so
> > this is not specific to TrueNAS.
>
> So that's domU sending the data, and dom0 receiving it.
Correct.
>
> >
> > According to the packet capture, the re-transmission appears to be
> > caused by packet reorder.
> > Here is one example incident:
> > 1. dom0 sees a sequence jump in the incoming stream and begins to send out 
> > SACKs
> > 2. When SACK shows up at domU, it begins to re-transmit lost frames
> >    (the re-transmit looks weird since it show up as a mixed stream of
> > 1448 bytes and 12 bytes packets, instead of always 1448 bytes)
> > 3. Suddenly the packets that are believed to have lost show up, dom0
> > accept them as if they are re-transmission
>
> Hm, so there seems to be some kind of issue with ordering I would say.
Agree.

>
> > 4. The actual re-transmission finally shows up in dom0...
> > Should we expect packet reorder on a direct virtual link? Sounds fishy to 
> > me.
> > Any chance we can get this re-transmission fixed?
>
> Does this still happen with all the extra features disabled? (-rxcsum
> -txcsum -lro -tso)
No obvious impact I would say.
After disabling all extra features:
xn0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    ether 00:18:3c:51:6e:4c
    inet 192.168.1.9 netmask 0xffffff00 broadcast 192.168.1.255
    media: Ethernet manual
    status: active
    nd6 options=9<PERFORMNUD,IFDISABLED>
The iperf3 result:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.04 GBytes  1.75 Gbits/sec  12674             sender
[  5]   0.00-10.14  sec  2.04 GBytes  1.73 Gbits/sec                  receiver
BTW, those extra features have huge impact on the dom0 => domU direction.
It goes all the way down from ~30 / 18 Gbps to 3.5 / 1.8 Gbps
(variation range) without those.
But there is no retransmission at all in both configs for this direction.
I wonder why such a huge difference since the nic is purely virtual
without any HW acceleration?

Any further suggestions on this retransmission issue?

>
> > So looks like at least the imbalance between two directions are not
> > related to your patch.
> > Likely the debug build is a bigger contributor to the perf difference
> > in both directions.
> >
> > I also tried your patch on a release build, and didn't observe any
> > major difference in iperf3 numbers.
> > Roughly match the 30Gbps and 1.xGbps number on the stock release kernel.
>
> Thanks a lot, will try to get this upstream then.
>
> Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.