RE: [Xen-devel] MPI benchmark performance gap between native lin

To:	"Nivedita Singhvi" <niv@xxxxxxxxxx>, "Bin Ren" <bin.ren@xxxxxxxxx>, "Andrew Theurer" <habanero@xxxxxxxxxx>
Subject:	RE: [Xen-devel] MPI benchmark performance gap between native linux anddomU
From:	"Santos, Jose Renato G (Jose Renato Santos)" <joserenato.santos@xxxxxx>
Date:	Tue, 5 Apr 2005 17:17:51 -0700
Cc:	"Turner, Yoshio" <yoshio_turner@xxxxxx>, Aravind Menon <aravind.menon@xxxxxxx>, Xen-devel@xxxxxxxxxxxxxxxxxxx, G John Janakiraman <john@xxxxxxxxxxxxxxxxxxx>
Delivery-date:	Wed, 06 Apr 2005 00:17:55 +0000
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index:	AcU6LfZxUXcb0dHaQO2H6qEYW9ST+gADZVtA
Thread-topic:	[Xen-devel] MPI benchmark performance gap between native linux anddomU

  Nivedita, Bin, Andrew and all interested in Xenoprof

  We should be posting the xenoprof patches in a few days.
  We are doing some last cleaning up in the code. Just be a little more
patient

  Thanks

  Renato 

>> -----Original Message-----
>> From: Nivedita Singhvi [mailto:niv@xxxxxxxxxx] 
>> Sent: Tuesday, April 05, 2005 3:23 PM
>> To: Santos, Jose Renato G (Jose Renato Santos)
>> Cc: xuehai zhang; Xen-devel@xxxxxxxxxxxxxxxxxxx; Turner, 
>> Yoshio; Aravind Menon; G John Janakiraman
>> Subject: Re: [Xen-devel] MPI benchmark performance gap 
>> between native linux anddomU
>> 
>> 
>> Santos, Jose Renato G (Jose Renato Santos) wrote:
>> 
>> >   Hi,
>> > 
>> >   We had a similar network problem in the past. We were 
>> using a TCP 
>> > benchmark instead of MPI but I believe your problem is 
>> probably the 
>> > same as the one we encountered.
>> >   It took us a while to get to the bottom of this and we only 
>> > identified the reason for this behavior after we ported 
>> oprofile to 
>> > Xen and did some performance profiling experiments.
>> 
>> Hello! Was this on the 2.6 kernel? Would you be able to
>> share the oprofile port? It would be very handy indeed
>> right now. (I was told by a few people that someone
>> was porting oprofile and I believe there was some status
>> on the list that went by) but haven't seen it yet...
>> 
>> >   Here is a brief explanation of the problem we found and 
>> the solution 
>> > that worked for us.
>> >   Xenolinux allocates a full page (4KB) to store socket buffers 
>> > instead of using just MTU bytes as in traditional linux. This is 
>> > necessary to enable page exchanges between the guest and the I/O 
>> > domains. The side effect of this is that memory space used 
>> for  socket 
>> > buffers is not very efficient. Even if packets have the 
>> maximum MTU 
>> > size (typically 1500 bytes for Ethernet) the total buffer 
>> utilization 
>> > is very low ( at most just slightly  higher than 35%). If packets 
>> > arrive faster than they are processed at the receiver 
>> side, they will 
>> > exhaust the receiver buffer
>> 
>> Most small connections (say upto 3 - 4K) involve only 3 to 5 
>> segments, and so the tcp window never really opens fully.  
>> On longer lived connections, it does help very much to have 
>> a large buffer.
>> 
>> > before the TCP advertised window is reached (By default 
>> Linux uses a 
>> > TCP advertised window equal to 75% of the receive buffer size. In 
>> > standard Linux, this is typically sufficient to stop packet 
>> > transmission at the sender before running out of receive 
>> buffers. The 
>> > same is not true in Xen due to inefficient use of socket buffers). 
>> > When a packet arrives and there is no receive buffer 
>> available, TCP 
>> > tries to free socket buffer space by eliminating socket buffer 
>> > fragmentation (i.e. eliminating wasted buffer space). This 
>> is done at 
>> > the cost of an extra copy of all receive buffer to new compacted 
>> > socket buffers. This introduces overhead and reduces 
>> throughput when 
>> > the CPU is the bottleneck, which seems to be your case.
>> 
>> /proc/net/netstat will show a counter of just how many times 
>> this happens (RcvPruned). Would be interesting if that was 
>> significant.
>> 
>> > This problem is not very frequent because modern CPUs are 
>> fast enough 
>> > to receive packets at Gigabit speeds and the receive 
>> buffer does not 
>> > fill up. However the problem may arise when using slower machines 
>> > and/or when the workload consumes a lot of CPU cycles, such as for 
>> > example scientific MPI applications. In your case in you have both 
>> > factors against you.
>> 
>> 
>> > The solution to this problem is trivial. You just have to 
>> change the 
>> > TCP advertised window of your guest to a lower value. In 
>> our case, we 
>> > used 25% of the receive buffer size and that was sufficient  to 
>> > eliminate the problem. This can be done using the following command
>> 
>> >>echo -2 > /proc/sys/net/ipv4/tcp_adv_win_scale
>> 
>> How much did this improve your results by? And wouldn't
>>   making the default socket buffers, max socket buffers
>> larger by, say, 5 times be more effective (other than for
>> those applications using setsockopt() to set their buffers
>> to some size already, but not large enough)?
>> 
>> > (The default 2 corresponds to 75% of receive buffer, and -2 
>> > corresponds to 25%)
>> > 
>> > Please let me know if this improve your results. You 
>> should still see 
>> > a degradation in throughput when comparing xen to 
>> traditional linux, 
>> > but hopefully you should be able to see better 
>> throughputs. You should 
>> > also try running your experiments in domain 0. This will 
>> give better 
>> > throughput although still lower than traditional linux. I 
>> am curious 
>> > to know if this have any effect in your experiments. 
>> Please, post the 
>> > new results if this has any effect in your results
>> 
>> Yep, me too..
>> 
>> thanks,
>> Nivedita
>> 
>> 
>> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

RE: [Xen-devel] MPI benchmark performance gap between native linux anddo