[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen disk write slowness in kernel 3.8.x

To: Felipe Franciosi <felipe.franciosi@xxxxxxxxxx>
From: Steven Haigh <netwiz@xxxxxxxxx>
Date: Wed, 17 Apr 2013 17:52:45 +1000
Cc: "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>
Delivery-date: Wed, 17 Apr 2013 07:53:45 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 04/17/2013 12:39 AM, Felipe Franciosi wrote:

Hi Steven, sorry for the delay (doing some training followed by holidays).

Using "direct" from the guest does speed things up. I'm not entirely sure why.
Especially since this is also true (at least with the guests I tested) in XCP/XenServer
where the datapath includes a blktap2/tapdisk model where the virtual disks are always
opened with O_DIRECT and the only valid operations are READ or WRITE (i.e. there is no
special handling for guest operations such as barriers, etc).

Have a look at http://support.citrix.com/article/CTX136861 for an understanding
of how things work in XCP 1.6 / XenServer 6.1.0.

My guess is that "direct" allows you to control the exact size of the requests
(creating a consistent data flow). If you omit it and shove all your requests through the
guest's buffer cache, what you get in the backend will vary largely in terms of request
sizes. This will particularly affect the blkback/blkfront ring utilisation. This
hypothesis can hopefully be supported by running iostat in dom0 (e.g. iostat -xm 1) and
observing the request sizes. Let me know what you see in your environment. If anyone else
in the list have other ideas on this I'd like to hear them as this has been puzzling me
too.

Another thing you can try is to use "bs=11M" along i/oflag=direct. The "11M"
magic number comes from the ring size: with single-page rings, you can have up to 32 requests of
44KiB each in the ring.
32 * 44KiB = 1408 KiB
If you try that and observe iostat in dom0, you'll see that the average request size
often drops below 44KiB (88 sectors). To keep it consistent I decided to use multiples of
that number. The first "whole" number in the order of mebibytes you'll find is
8 rings: 32 * 44 * 1024 * 8 = 11 M.


# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=11M count=300
300+0 records in
300+0 records out
3460300800 bytes (3.5 GB) copied, 73.2253 s, 47.3 MB/s

Dom0 iostat -xm 1:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    9.30    0.00    1.16   89.53

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/savgrq-sz avgqu-sz await svctm %utilsde 1267.44 6526.74 100.00 545.35 5.31 27.24103.28 10.28 14.86 1.75 113.02sdc 1316.28 6348.84 105.81 546.51 5.69 26.94102.45 4.56 7.02 1.37 89.30sdd 1139.53 6269.77 110.47 544.19 5.11 26.8599.99 5.34 8.30 1.35 88.37sdf 1286.05 6436.05 96.51 548.84 5.50 27.29104.06 3.52 5.47 1.34 86.51md2 0.00 0.00 0.00 1572.09 0.00 51.1166.59 0.00 0.00 0.00 0.00

# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=11M count=300oflag=direct

300+0 records in
300+0 records out
3460300800 bytes (3.5 GB) copied, 67.5844 s, 51.2 MB/s

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.14    0.00   12.50    0.00    0.00   86.36

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/savgrq-sz avgqu-sz await svctm %utilsde 1189.77 7005.68 89.77 478.41 5.00 29.35123.81 5.62 9.90 1.72 97.50sdc 1040.91 6852.27 77.27 480.68 4.37 28.76121.60 2.84 5.09 1.11 61.93sdd 911.36 7057.95 67.05 477.27 3.66 29.46124.63 4.31 7.51 1.51 82.39sdf 1185.23 7227.27 72.73 495.45 4.90 30.28126.80 7.79 13.73 1.82 103.64md2 0.00 0.00 0.00 1727.27 0.00 56.5467.04 0.00 0.00 0.00 0.00

It seems that using a block size on dd actually reduces the total writespeed.


Compare this to using 1M block sizes:

# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=3500oflag=direct

3500+0 records in
3500+0 records out
3670016000 bytes (3.7 GB) copied, 27.6453 s, 133 MB/s

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   27.78    0.00    0.00   72.22

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/savgrq-sz avgqu-sz await svctm %utilsde 0.00 19040.00 0.00 303.33 0.00 75.83512.00 1.28 4.22 2.81 85.22sdc 0.00 19040.00 0.00 302.22 0.00 75.56512.00 0.50 1.67 1.10 33.33sdd 0.00 19040.00 0.00 302.22 0.00 75.56512.00 0.50 1.66 1.10 33.33sdf 0.00 19040.00 0.00 303.33 0.00 75.83512.00 1.28 4.22 2.81 85.22md2 0.00 0.00 0.00 4684.44 0.00 151.1166.06 0.00 0.00 0.00 0.00


# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=3500
3500+0 records in
3500+0 records out
3670016000 bytes (3.7 GB) copied, 81.3315 s, 45.1 MB/s

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   10.47    0.00    0.00   89.53

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/savgrq-sz avgqu-sz await svctm %utilsde 1116.28 5995.35 101.16 555.81 5.02 25.8496.20 5.99 9.31 1.52 100.00sdc 1209.30 6019.77 111.63 536.05 5.19 25.0795.70 5.66 8.58 1.45 93.95sdd 1236.05 5967.44 97.67 546.51 5.08 25.5497.34 8.16 12.67 1.60 103.02sdf 1203.49 5939.53 97.67 554.65 5.12 25.6396.54 4.76 7.48 1.44 93.95md2 0.00 0.00 0.00 1486.05 0.00 48.3166.59 0.00 0.00 0.00 0.00

Out of interest, trying 40k block sizes seems to show roughly the sameresults as the 11M tests... Maybe this is not related - but I'm notquite sure.

Verify if that gives you more consistency in your results.

Last but not least, you should definitely try two other things:
1) oprofile.
2) Roger's kernel with persistent grants and indirect I/O reqs (see below).

Using oprofile you should be able to get more directions/ideas in the sense of 
where your throughput is getting killed.

I'll have to look at how to use oprofile to give you useful informationfrom it... I assume you mean running this on the Dom0?

Finally, high throughput is usually achieved with large bandwidth and low 
latency. Persistent granting should help a lot with reducing latency. The 
indirect I/O operations that have recently been discussed in this list should 
help you to achieve larger bandwidth. This should be particularly beneficial in 
arrays such as yours, where the stripe size is larger than 88K.

Link to Roger's blog post and kernel code:
http://blog.xen.org/index.php/2012/11/23/improving-block-protocol-scalability-with-persistent-grants/#more-5711
http://xenbits.xen.org/gitweb/?p=people/royger/linux.git;a=summary 
(xen-block-indirect branch)

I'll do some research into this - if someone has a binary RPM packagealready built from this, I might be able to get results to you muchsooner...

I am very interested in your results, please share them.

Cheers,
Felipe


-----Original Message-----
From: Steven Haigh [mailto:netwiz@xxxxxxxxx]
Sent: 09 April 2013 07:54
To: Felipe Franciosi
Cc: xen-devel@xxxxxxxxxxxxx
Subject: Re: Xen disk write slowness in kernel 3.8.x

Resending this - hopefully for comment...

On 04/04/13 01:30, Felipe Franciosi wrote:

I think you should use i/oflag=direct on your "dd"s to bypass the buffer cache 
in the guest. Might provide more consistent results.


Ah - you are correct. I usually don't forget that. This has however discovered 
something interesting...

Entirely from the DomU:
# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096 oflag=direct
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 45.2753 s, 94.9 MB/s # rm 
/mnt/fileshare/output.zero # dd if=/dev/zero of=/mnt/fileshare/output.zero 
bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 80.5808 s, 53.3 MB/s

I did this 3-4 times with the same trend...

# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096 oflag=direct
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 40.693 s, 106 MB/s # rm 
/mnt/fileshare/output.zero # sync # dd if=/dev/zero 
of=/mnt/fileshare/output.zero bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 82.1511 s, 52.3 MB/s

So, when the DomU is using write buffers, write speed is halved. This I don't 
understand.

I haven't tweaked the array at all yet (as I've just been experimenting with 
this problem) - so with a bit of time, I can probably even increase its write 
speed some more - but without this speed decrease solved, it isn't really worth 
it yet...

The result of oprofile is intriguing. Did you have a chance to try Roger's 
persistent grant implementation?

He mentions it on his 08/03 e-mail. It's here:
git://xenbits.xen.org/people/royger/linux.git xen-block-indirect


I can't say I have - Although I package Xen for EL6, we're getting to the edge 
of my knowledge here.


Felipe

-----Original Message-----
From: Steven Haigh [mailto:netwiz@xxxxxxxxx]
Sent: 03 April 2013 14:27
To: xen-devel@xxxxxxxxxxxxx
Cc: Felipe Franciosi
Subject: Re: Xen disk write slowness in kernel 3.8.x

On 03/04/13 23:29, Felipe Franciosi wrote:

Do you know the size of your requests?
You used "iostat -m". Perhaps "iostat -xm" could be more meaningful as it will 
tell you the average request size in sectors.


Good call. I started just a sequential write with dd from /dev/zero on the DomU:
# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 91.269 s, 47.1 MB/s

iostat -xm shows:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              0.11    0.00    2.71    0.23    0.56   96.38

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
sdc            1208.40  5337.80  150.00  645.20     5.35    23.39
74.00     6.47    8.15   0.91  72.72
sdf            1199.80  5350.80  148.00  643.40     5.30    23.34
74.12     6.88    8.68   0.96  76.06
sdd            1203.40  5304.60  148.80  638.60     5.28    23.21
74.11     5.42    6.88   0.78  61.38
sde            1213.80  5382.40  148.40  652.40     5.37    23.59
74.08     6.40    8.00   0.94  75.20
md3               0.00     0.00    1.60 1312.20     0.05    42.68
66.60     0.00    0.00   0.00   0.00

Shutting down the DomU, mounting /dev/md3 in and doing this directly from the 
Dom0 shows:
# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 44.2801 s, 97.0 MB/s

details from iostat -xm:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              0.00    0.00    2.16    0.00    0.57   97.27

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
sdc            1175.40  5220.60  143.40  633.60     5.13    22.85
73.76     6.01    7.73   0.87  67.60
sdf            1175.20  5154.00  147.40  624.40     5.14    22.55
73.46     6.87    8.88   0.97  74.90
sdd            1183.40  5133.80  145.40  625.20     5.19    22.50
73.60     5.19    6.73   0.77  59.60
sde            1176.40  5229.60  146.00  637.00     5.16    22.99
73.62     7.39    9.51   0.99  77.90
md3               0.00     0.00    0.60 1277.40     0.02    41.56
66.63     0.00    0.00   0.00   0.00

This about ties in with what bonnie++ gets as the write speed:
Version  1.96       ------Sequential Output------ --Sequential Input-
--Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
xenhost.lan.crc. 2G   635  91 97526  15 67864  16   952  97 295833  35
409.6  10

bonnie++ maxes out at about the same as dd in the DomU.

I still didn't understand your environment. I think this is the first time you 
mention NFS/SMB.


/dev/md3 is passed through as a phy device to a DomU. This DomU runs both samba 
and NFS shares. It doesn't seem to matter what method is the source of the 
writes to the drive (dd, bonnie++, sending a file via NFS or Samba), the max 
write speed seems to be bottlenecking at ~50Mb/sec - which I believe is purely 
the write speed to the array via the DomU.

Weren't you testing copies from /dev/zero to your device?


I have tested many ways - this being just one of them.

I'd also recommend you run "oprofile" to see if there are any time sinks that 
we are overlooking. You can find out how to use it here:
http://wiki.xen.org/wiki/Xen_Profiling:_oprofile_and_perf


I ran 'perf top' on the Dom0, then started a write (dd if=/dev/zero 
of=/mnt/fileshare/output.zero bs=1M count=4096) on the DomU. This is what I'd 
see as the relevant output:

Samples: 574K of event 'cpu-clock', Event count (approx.): 60028
    88.37%  [kernel]             [k] xen_hypercall_sched_op
     4.73%  [kernel]             [k] xen_hypercall_xen_version
     1.00%  [kernel]             [k] xen_hypercall_grant_table_op
     0.99%  [raid456]            [k] handle_stripe
     0.77%  [raid6_pq]           [k] raid6_sse24_gen_syndrome
     0.66%  [raid456]            [k] ops_run_io
     0.57%  [kernel]             [k] memcpy
     0.21%  [kernel]             [k] xen_restore_fl_direct
     0.18%  [raid456]            [k] raid5_end_write_request
     0.18%  [raid456]            [k] __raid_run_ops
     0.14%  [kernel]             [k] xen_hypercall_event_channel_op
     0.11%  [kernel]             [k] get_phys_to_machine
     0.09%  [raid456]            [k] schedule_reconstruction

Now repeated, with the same on the DomU:
Samples: 300K of event 'cpu-clock', Event count (approx.): 84845

    97.63%  [kernel]            [k] hypercall_page
     0.44%  [kernel]            [k] copy_user_generic_string
     0.41%  [kernel]            [k] __clear_user
     0.08%  [kernel]            [k] __wake_up_bit
     0.07%  [kernel]            [k] xen_restore_fl_direct
     0.07%  [kernel]            [k] __mem_cgroup_commit_charge

Now, interestingly enough... The root drive (xvda) of the DomU in question is 
on a separate RAID1. This is passed as an LV.

# dd if=/dev/zero of=/output.bin bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 69.7036 s, 61.6 MB/s

Then from the Dom0 to the same RAID1:
# dd if=/dev/zero of=output.zero bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 48.4407 s, 88.7 MB/s

I don't really know what else I can do to try and see where the
slowdown is here - I am open to suggestions though :)

--
Steven Haigh

Email: netwiz@xxxxxxxxx
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299



--
Steven Haigh

Email: netwiz@xxxxxxxxx
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299



--
Steven Haigh

Email: netwiz@xxxxxxxxx
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] Xen disk write slowness in kernel 3.8.x
  - From: Felipe Franciosi

References:
- [Xen-devel] Xen disk write slowness in kernel 3.8.x
  - From: Steven Haigh
- Re: [Xen-devel] Xen disk write slowness in kernel 3.8.x
  - From: Felipe Franciosi
- Re: [Xen-devel] Xen disk write slowness in kernel 3.8.x
  - From: Steven Haigh
- Re: [Xen-devel] Xen disk write slowness in kernel 3.8.x
  - From: Felipe Franciosi
- Re: [Xen-devel] Xen disk write slowness in kernel 3.8.x
  - From: Steven Haigh
- Re: [Xen-devel] Xen disk write slowness in kernel 3.8.x
  - From: Felipe Franciosi
- Re: [Xen-devel] Xen disk write slowness in kernel 3.8.x
  - From: Steven Haigh
- Re: [Xen-devel] Xen disk write slowness in kernel 3.8.x
  - From: Felipe Franciosi

Prev by Date: [Xen-devel] [xen-unstable test] 17710: tolerable FAIL
Next by Date: [Xen-devel] Attempting to checksum a non-TCP/UDP packet, dropping a protocol 1 packet
Previous by thread: Re: [Xen-devel] Xen disk write slowness in kernel 3.8.x
Next by thread: Re: [Xen-devel] Xen disk write slowness in kernel 3.8.x
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.