WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] network hang trigger

To: "Keir Fraser" <Keir.Fraser@xxxxxxxxxxxx>
Subject: RE: [Xen-devel] network hang trigger
From: "James Harper" <JamesH@xxxxxxxxxxxxxxxx>
Date: Thu, 16 Sep 2004 20:02:37 +1000
Cc: "Bin Ren" <br260@xxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxxxx>
Delivery-date: Thu, 16 Sep 2004 11:07:31 +0100
Envelope-to: steven.hand@xxxxxxxxxxxx
List-archive: <http://sourceforge.net/mailarchive/forum.php?forum=xen-devel>
List-help: <mailto:xen-devel-request@lists.sourceforge.net?subject=help>
List-id: List for Xen developers <xen-devel.lists.sourceforge.net>
List-post: <mailto:xen-devel@lists.sourceforge.net>
List-subscribe: <https://lists.sourceforge.net/lists/listinfo/xen-devel>, <mailto:xen-devel-request@lists.sourceforge.net?subject=subscribe>
List-unsubscribe: <https://lists.sourceforge.net/lists/listinfo/xen-devel>, <mailto:xen-devel-request@lists.sourceforge.net?subject=unsubscribe>
Sender: xen-devel-admin@xxxxxxxxxxxxxxxxxxxxx
Thread-index: AcSbvkDdb00Xg9w4S8eCwMcqtHENqAAAi5gg
Thread-topic: [Xen-devel] network hang trigger

I've tried this, and I see the first fragment of the ping get sent and then a complete hang, which is what originally made me suspicious that there was some sort of race with sending packets with a very small time between one and the next.

 

It could be that Bin's patch changed the timing of things on his machine such that the bug goes away for him. I can make the bug come and go by placing printk's in network_start_xmit as per my previous email.

 

This is a dump from a normal size ping.

Xen0

listening on vif13.0, link-type EN10MB (Ethernet), capture size 96 bytes

19:02:48.276397 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 64: echo request seq 1

19:02:48.306646 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 64: echo reply seq 1

19:02:49.275931 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 64: echo request seq 2

19:02:49.276033 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 64: echo reply seq 2

XenU

listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes

19:02:48.270125 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 64: echo request seq 1

19:02:48.277577 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 64: echo reply seq 1

19:02:49.275460 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 64: echo request seq 2

19:02:49.276848 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 64: echo reply seq 2

 

This is from a large ping (with printk’s in network_start_xmit so it works)

Xen0

19:10:33.502706 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 1480: echo request seq 1

19:10:33.502711 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp

19:10:33.502966 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 1480: echo reply seq 1

19:10:33.502992 IP xen2.int.sbss.com.au > 192.168.200.200: icmp

19:10:34.496713 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 1480: echo request seq 2

19:10:34.496717 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp

19:10:34.496872 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 1480: echo reply seq 2

19:10:34.496895 IP xen2.int.sbss.com.au > 192.168.200.200: icmp

XenU

19:10:33.496431 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 1480: echo request seq 1

19:10:33.498042 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp

19:10:33.507890 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 1480: echo reply seq 1

19:10:33.507953 IP xen2.int.sbss.com.au > 192.168.200.200: icmp

19:10:34.492920 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 1480: echo request seq 2

19:10:34.494703 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp

19:10:34.501604 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 1480: echo reply seq 2

19:10:34.501639 IP xen2.int.sbss.com.au > 192.168.200.200: icmp

 

This is from the same large ping (with the printk’s removed so it hangs)

Xen0

listening on vif14.0, link-type EN10MB (Ethernet), capture size 96 bytes

19:23:25.125927 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 1480: echo request seq 1

19:23:55.122574 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 556: ip reassembly time exceeded

19:23:55.122726 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp

19:23:55.122732 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 1480: echo request seq 2

19:23:55.122734 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp

19:23:55.122735 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 1480: echo request seq 3

19:23:55.122737 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp

19:23:55.122739 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 1480: echo request seq 4

19:23:55.122741 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp

19:23:55.123850 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 1480: echo reply seq 2

19:23:55.123873 IP xen2.int.sbss.com.au > 192.168.200.200: icmp

19:23:55.123955 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 1480: echo reply seq 3

19:23:55.123977 IP xen2.int.sbss.com.au > 192.168.200.200: icmp

19:23:55.124050 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 1480: echo reply seq 4

19:23:55.124070 IP xen2.int.sbss.com.au > 192.168.200.200: icmp

XenU

listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes

19:23:25.126797 IP 192.168.200.200 > 192.168.200.204: icmp 1480: echo request seq 1

19:23:25.129472 IP 192.168.200.200 > 192.168.200.204: icmp

19:23:26.143609 IP 192.168.200.200 > 192.168.200.204: icmp 1480: echo request seq 2

19:23:26.143622 IP 192.168.200.200 > 192.168.200.204: icmp

19:23:27.143643 IP 192.168.200.200 > 192.168.200.204: icmp 1480: echo request seq 3

19:23:27.143660 IP 192.168.200.200 > 192.168.200.204: icmp

19:23:28.143643 IP 192.168.200.200 > 192.168.200.204: icmp 1480: echo request seq 4

19:23:28.143658 IP 192.168.200.200 > 192.168.200.204: icmp

19:23:55.124352 IP 192.168.200.204 > 192.168.200.200: icmp 556: ip reassembly time exceeded

19:23:55.126145 IP 192.168.200.204 > 192.168.200.200: icmp 1480: echo reply seq 2

19:23:55.126170 IP 192.168.200.204 > 192.168.200.200: icmp

19:23:55.126201 IP 192.168.200.204 > 192.168.200.200: icmp 1480: echo reply seq 3

19:23:55.126208 IP 192.168.200.204 > 192.168.200.200: icmp

19:23:55.126224 IP 192.168.200.204 > 192.168.200.200: icmp 1480: echo reply seq 4

19:23:55.126230 IP 192.168.200.204 > 192.168.200.200: icmp

 

The times are in sync between the two domains, so you can see that dom0 only sees the first fragment of the first ping and then a big delay, then the rest come through.

 

Is it possible that there is a synchronisation problem in interdomain communications?

 

James

 

> -----Original Message-----

> From: Keir Fraser [mailto:Keir.Fraser@xxxxxxxxxxxx]

> Sent: Thursday, 16 September 2004 17:24

> To: James Harper

> Cc: Bin Ren; xen-devel@xxxxxxxxxxxxxxxxxxxxx

> Subject: Re: [Xen-devel] network hang trigger

>

> > When I was thinking about this problem, I was imagining a deadlock

> > condition where rapid back to back packets (eg a fragmented icmp packet

> > from ping or a fragmented udp packet from nfs) was causing a hang until

> > part of the deadlock timed itself out and the packets started flowing

> > again. I couldn't see 1 packet causing a buffer exhaustion unless it got

> > itself into a loop where it kept retrying to send the second fragment

> > and didn't free the buffer each time, but even then the buffer bug would

> > be a side effect.

> >

> > The deadlock would have to be caused in the transmit from xenU to xen0,

> > and something about the difference between sending a ping and responding

> > to a ping is the difference between always causing a lockup and only

> > sometimes causing a lockup.

>

> Try tcpdumping each end of teh connecttion.

>

> I find that for ping 0->U, the 'seizure' is entirely within DOM0 --

> ping responses are still received, but for some reason they don't make

> it up to the ping application.

>

> For ping U->0, it does look as though the network seizes up -- I see

> no packets in either direction.

>

>  -- Keir