Xen project Mailing List

Re: [Xen-devel] Xen disk write slowness in kernel 3.8.x

To: 'Steven Haigh' <netwiz@xxxxxxxxx>

From: Felipe Franciosi <felipe.franciosi@xxxxxxxxxx>

Date: Thu, 18 Apr 2013 13:10:32 +0000

Accept-language: en-GB, en-US

Cc: "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Thu, 18 Apr 2013 13:10:53 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Thread-index: AQHOMG7wf8NU0enKK06IUXPv8CFxR5jEjZeggAjerQCAC4ruoIABGESAgAH6m3A=

Thread-topic: Xen disk write slowness in kernel 3.8.x

Hi Steven, I'm a bit surprised with the results of your "dd" tests using bs=11M. I'd expect to see a constant avgrq-sz of 88 (44K). You seem to be experiencing a varying number near 66. I noticed your "dd" was done on top of a mount point. Could you repeat that test on top of the raw /dev/xvd* device? And yes, oprofile should be ran on dom0. If you provide the kernel symbols for your domU, it will be able to give you detailed information on that as well. I'm afraid I don't have RPMs for Roger's kernel tree. It shouldn't be too hard to clone and build it, though. Let me know if you have problems with that. Cheers, Felipe -----Original Message----- From: Steven Haigh [mailto:netwiz@xxxxxxxxx] Sent: 17 April 2013 08:53 To: Felipe Franciosi Cc: xen-devel@xxxxxxxxxxxxx Subject: Re: Xen disk write slowness in kernel 3.8.x On 04/17/2013 12:39 AM, Felipe Franciosi wrote: > Hi Steven, sorry for the delay (doing some training followed by holidays). > > Using "direct" from the guest does speed things up. I'm not entirely sure > why. Especially since this is also true (at least with the guests I tested) > in XCP/XenServer where the datapath includes a blktap2/tapdisk model where > the virtual disks are always opened with O_DIRECT and the only valid > operations are READ or WRITE (i.e. there is no special handling for guest > operations such as barriers, etc). > > Have a look at http://support.citrix.com/article/CTX136861 for an > understanding of how things work in XCP 1.6 / XenServer 6.1.0. > > My guess is that "direct" allows you to control the exact size of the > requests (creating a consistent data flow). If you omit it and shove all your > requests through the guest's buffer cache, what you get in the backend will > vary largely in terms of request sizes. This will particularly affect the > blkback/blkfront ring utilisation. This hypothesis can hopefully be supported > by running iostat in dom0 (e.g. iostat -xm 1) and observing the request > sizes. Let me know what you see in your environment. If anyone else in the > list have other ideas on this I'd like to hear them as this has been puzzling > me too. > > Another thing you can try is to use "bs=11M" along i/oflag=direct. The "11M" > magic number comes from the ring size: with single-page rings, you can have > up to 32 requests of 44KiB each in the ring. > 32 * 44KiB = 1408 KiB > If you try that and observe iostat in dom0, you'll see that the average > request size often drops below 44KiB (88 sectors). To keep it consistent I > decided to use multiples of that number. The first "whole" number in the > order of mebibytes you'll find is 8 rings: 32 * 44 * 1024 * 8 = 11 M. # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=11M count=300 300+0 records in 300+0 records out 3460300800 bytes (3.5 GB) copied, 73.2253 s, 47.3 MB/s Dom0 iostat -xm 1: avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 9.30 0.00 1.16 89.53 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sde 1267.44 6526.74 100.00 545.35 5.31 27.24 103.28 10.28 14.86 1.75 113.02 sdc 1316.28 6348.84 105.81 546.51 5.69 26.94 102.45 4.56 7.02 1.37 89.30 sdd 1139.53 6269.77 110.47 544.19 5.11 26.85 99.99 5.34 8.30 1.35 88.37 sdf 1286.05 6436.05 96.51 548.84 5.50 27.29 104.06 3.52 5.47 1.34 86.51 md2 0.00 0.00 0.00 1572.09 0.00 51.11 66.59 0.00 0.00 0.00 0.00 # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=11M count=300 oflag=direct 300+0 records in 300+0 records out 3460300800 bytes (3.5 GB) copied, 67.5844 s, 51.2 MB/s avg-cpu: %user %nice %system %iowait %steal %idle 1.14 0.00 12.50 0.00 0.00 86.36 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sde 1189.77 7005.68 89.77 478.41 5.00 29.35 123.81 5.62 9.90 1.72 97.50 sdc 1040.91 6852.27 77.27 480.68 4.37 28.76 121.60 2.84 5.09 1.11 61.93 sdd 911.36 7057.95 67.05 477.27 3.66 29.46 124.63 4.31 7.51 1.51 82.39 sdf 1185.23 7227.27 72.73 495.45 4.90 30.28 126.80 7.79 13.73 1.82 103.64 md2 0.00 0.00 0.00 1727.27 0.00 56.54 67.04 0.00 0.00 0.00 0.00 It seems that using a block size on dd actually reduces the total write speed. Compare this to using 1M block sizes: # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=3500 oflag=direct 3500+0 records in 3500+0 records out 3670016000 bytes (3.7 GB) copied, 27.6453 s, 133 MB/s avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 27.78 0.00 0.00 72.22 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sde 0.00 19040.00 0.00 303.33 0.00 75.83 512.00 1.28 4.22 2.81 85.22 sdc 0.00 19040.00 0.00 302.22 0.00 75.56 512.00 0.50 1.67 1.10 33.33 sdd 0.00 19040.00 0.00 302.22 0.00 75.56 512.00 0.50 1.66 1.10 33.33 sdf 0.00 19040.00 0.00 303.33 0.00 75.83 512.00 1.28 4.22 2.81 85.22 md2 0.00 0.00 0.00 4684.44 0.00 151.11 66.06 0.00 0.00 0.00 0.00 # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=3500 3500+0 records in 3500+0 records out 3670016000 bytes (3.7 GB) copied, 81.3315 s, 45.1 MB/s avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 10.47 0.00 0.00 89.53 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sde 1116.28 5995.35 101.16 555.81 5.02 25.84 96.20 5.99 9.31 1.52 100.00 sdc 1209.30 6019.77 111.63 536.05 5.19 25.07 95.70 5.66 8.58 1.45 93.95 sdd 1236.05 5967.44 97.67 546.51 5.08 25.54 97.34 8.16 12.67 1.60 103.02 sdf 1203.49 5939.53 97.67 554.65 5.12 25.63 96.54 4.76 7.48 1.44 93.95 md2 0.00 0.00 0.00 1486.05 0.00 48.31 66.59 0.00 0.00 0.00 0.00 Out of interest, trying 40k block sizes seems to show roughly the same results as the 11M tests... Maybe this is not related - but I'm not quite sure. > Verify if that gives you more consistency in your results. > > Last but not least, you should definitely try two other things: > 1) oprofile. > 2) Roger's kernel with persistent grants and indirect I/O reqs (see below). > > Using oprofile you should be able to get more directions/ideas in the sense > of where your throughput is getting killed. I'll have to look at how to use oprofile to give you useful information from it... I assume you mean running this on the Dom0? > Finally, high throughput is usually achieved with large bandwidth and low > latency. Persistent granting should help a lot with reducing latency. The > indirect I/O operations that have recently been discussed in this list should > help you to achieve larger bandwidth. This should be particularly beneficial > in arrays such as yours, where the stripe size is larger than 88K. > > Link to Roger's blog post and kernel code: > http://blog.xen.org/index.php/2012/11/23/improving-block-protocol-scal > ability-with-persistent-grants/#more-5711 > http://xenbits.xen.org/gitweb/?p=people/royger/linux.git;a=summary > (xen-block-indirect branch) I'll do some research into this - if someone has a binary RPM package already built from this, I might be able to get results to you much sooner... > I am very interested in your results, please share them. > > Cheers, > Felipe > > > -----Original Message----- > From: Steven Haigh [mailto:netwiz@xxxxxxxxx] > Sent: 09 April 2013 07:54 > To: Felipe Franciosi > Cc: xen-devel@xxxxxxxxxxxxx > Subject: Re: Xen disk write slowness in kernel 3.8.x > > Resending this - hopefully for comment... > > On 04/04/13 01:30, Felipe Franciosi wrote: >> I think you should use i/oflag=direct on your "dd"s to bypass the buffer >> cache in the guest. Might provide more consistent results. > > Ah - you are correct. I usually don't forget that. This has however > discovered something interesting... > > Entirely from the DomU: > # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096 > oflag=direct > 4096+0 records in > 4096+0 records out > 4294967296 bytes (4.3 GB) copied, 45.2753 s, 94.9 MB/s # rm > /mnt/fileshare/output.zero # dd if=/dev/zero > of=/mnt/fileshare/output.zero bs=1M count=4096 > 4096+0 records in > 4096+0 records out > 4294967296 bytes (4.3 GB) copied, 80.5808 s, 53.3 MB/s > > I did this 3-4 times with the same trend... > > # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096 > oflag=direct > 4096+0 records in > 4096+0 records out > 4294967296 bytes (4.3 GB) copied, 40.693 s, 106 MB/s # rm > /mnt/fileshare/output.zero # sync # dd if=/dev/zero > of=/mnt/fileshare/output.zero bs=1M count=4096 > 4096+0 records in > 4096+0 records out > 4294967296 bytes (4.3 GB) copied, 82.1511 s, 52.3 MB/s > > So, when the DomU is using write buffers, write speed is halved. This I don't > understand. > > I haven't tweaked the array at all yet (as I've just been experimenting with > this problem) - so with a bit of time, I can probably even increase its write > speed some more - but without this speed decrease solved, it isn't really > worth it yet... > >> The result of oprofile is intriguing. Did you have a chance to try Roger's >> persistent grant implementation? >> >> He mentions it on his 08/03 e-mail. It's here: >> git://xenbits.xen.org/people/royger/linux.git xen-block-indirect > > I can't say I have - Although I package Xen for EL6, we're getting to the > edge of my knowledge here. > >> >> Felipe >> >> -----Original Message----- >> From: Steven Haigh [mailto:netwiz@xxxxxxxxx] >> Sent: 03 April 2013 14:27 >> To: xen-devel@xxxxxxxxxxxxx >> Cc: Felipe Franciosi >> Subject: Re: Xen disk write slowness in kernel 3.8.x >> >> On 03/04/13 23:29, Felipe Franciosi wrote: >>> Do you know the size of your requests? >>> You used "iostat -m". Perhaps "iostat -xm" could be more meaningful as it >>> will tell you the average request size in sectors. >> >> Good call. I started just a sequential write with dd from /dev/zero on the >> DomU: >> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096 >> 4096+0 records in >> 4096+0 records out >> 4294967296 bytes (4.3 GB) copied, 91.269 s, 47.1 MB/s >> >> iostat -xm shows: >> avg-cpu: %user %nice %system %iowait %steal %idle >> 0.11 0.00 2.71 0.23 0.56 96.38 >> >> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >> avgrq-sz avgqu-sz await svctm %util >> sdc 1208.40 5337.80 150.00 645.20 5.35 23.39 >> 74.00 6.47 8.15 0.91 72.72 >> sdf 1199.80 5350.80 148.00 643.40 5.30 23.34 >> 74.12 6.88 8.68 0.96 76.06 >> sdd 1203.40 5304.60 148.80 638.60 5.28 23.21 >> 74.11 5.42 6.88 0.78 61.38 >> sde 1213.80 5382.40 148.40 652.40 5.37 23.59 >> 74.08 6.40 8.00 0.94 75.20 >> md3 0.00 0.00 1.60 1312.20 0.05 42.68 >> 66.60 0.00 0.00 0.00 0.00 >> >> Shutting down the DomU, mounting /dev/md3 in and doing this directly from >> the Dom0 shows: >> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096 >> 4096+0 records in >> 4096+0 records out >> 4294967296 bytes (4.3 GB) copied, 44.2801 s, 97.0 MB/s >> >> details from iostat -xm: >> avg-cpu: %user %nice %system %iowait %steal %idle >> 0.00 0.00 2.16 0.00 0.57 97.27 >> >> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >> avgrq-sz avgqu-sz await svctm %util >> sdc 1175.40 5220.60 143.40 633.60 5.13 22.85 >> 73.76 6.01 7.73 0.87 67.60 >> sdf 1175.20 5154.00 147.40 624.40 5.14 22.55 >> 73.46 6.87 8.88 0.97 74.90 >> sdd 1183.40 5133.80 145.40 625.20 5.19 22.50 >> 73.60 5.19 6.73 0.77 59.60 >> sde 1176.40 5229.60 146.00 637.00 5.16 22.99 >> 73.62 7.39 9.51 0.99 77.90 >> md3 0.00 0.00 0.60 1277.40 0.02 41.56 >> 66.63 0.00 0.00 0.00 0.00 >> >> This about ties in with what bonnie++ gets as the write speed: >> Version 1.96 ------Sequential Output------ --Sequential Input- >> --Random- >> Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- >> --Seeks-- >> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP >> /sec %CP >> xenhost.lan.crc. 2G 635 91 97526 15 67864 16 952 97 295833 35 >> 409.6 10 >> >> bonnie++ maxes out at about the same as dd in the DomU. >> >>> I still didn't understand your environment. I think this is the first time >>> you mention NFS/SMB. >> >> /dev/md3 is passed through as a phy device to a DomU. This DomU runs both >> samba and NFS shares. It doesn't seem to matter what method is the source of >> the writes to the drive (dd, bonnie++, sending a file via NFS or Samba), the >> max write speed seems to be bottlenecking at ~50Mb/sec - which I believe is >> purely the write speed to the array via the DomU. >> >>> Weren't you testing copies from /dev/zero to your device? >> >> I have tested many ways - this being just one of them. >> >>> I'd also recommend you run "oprofile" to see if there are any time sinks >>> that we are overlooking. You can find out how to use it here: >>> http://wiki.xen.org/wiki/Xen_Profiling:_oprofile_and_perf >> >> I ran 'perf top' on the Dom0, then started a write (dd if=/dev/zero >> of=/mnt/fileshare/output.zero bs=1M count=4096) on the DomU. This is what >> I'd see as the relevant output: >> >> Samples: 574K of event 'cpu-clock', Event count (approx.): 60028 >> 88.37% [kernel] [k] xen_hypercall_sched_op >> 4.73% [kernel] [k] xen_hypercall_xen_version >> 1.00% [kernel] [k] xen_hypercall_grant_table_op >> 0.99% [raid456] [k] handle_stripe >> 0.77% [raid6_pq] [k] raid6_sse24_gen_syndrome >> 0.66% [raid456] [k] ops_run_io >> 0.57% [kernel] [k] memcpy >> 0.21% [kernel] [k] xen_restore_fl_direct >> 0.18% [raid456] [k] raid5_end_write_request >> 0.18% [raid456] [k] __raid_run_ops >> 0.14% [kernel] [k] xen_hypercall_event_channel_op >> 0.11% [kernel] [k] get_phys_to_machine >> 0.09% [raid456] [k] schedule_reconstruction >> >> Now repeated, with the same on the DomU: >> Samples: 300K of event 'cpu-clock', Event count (approx.): 84845 >> >> 97.63% [kernel] [k] hypercall_page >> 0.44% [kernel] [k] copy_user_generic_string >> 0.41% [kernel] [k] __clear_user >> 0.08% [kernel] [k] __wake_up_bit >> 0.07% [kernel] [k] xen_restore_fl_direct >> 0.07% [kernel] [k] __mem_cgroup_commit_charge >> >> Now, interestingly enough... The root drive (xvda) of the DomU in question >> is on a separate RAID1. This is passed as an LV. >> >> # dd if=/dev/zero of=/output.bin bs=1M count=4096 >> 4096+0 records in >> 4096+0 records out >> 4294967296 bytes (4.3 GB) copied, 69.7036 s, 61.6 MB/s >> >> Then from the Dom0 to the same RAID1: >> # dd if=/dev/zero of=output.zero bs=1M count=4096 >> 4096+0 records in >> 4096+0 records out >> 4294967296 bytes (4.3 GB) copied, 48.4407 s, 88.7 MB/s >> >> I don't really know what else I can do to try and see where the >> slowdown is here - I am open to suggestions though :) >> >> -- >> Steven Haigh >> >> Email: netwiz@xxxxxxxxx >> Web: https://www.crc.id.au >> Phone: (03) 9001 6090 - 0412 935 897 >> Fax: (03) 8338 0299 >> > > > -- > Steven Haigh > > Email: netwiz@xxxxxxxxx > Web: https://www.crc.id.au > Phone: (03) 9001 6090 - 0412 935 897 > Fax: (03) 8338 0299 > -- Steven Haigh Email: netwiz@xxxxxxxxx Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 Fax: (03) 8338 0299 _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.