WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-users

[Xen-users] disk I/O problems under load? (Xen-3.4.1/x86_64)

To: xen-users@xxxxxxxxxxxxxxxxxxx
Subject: [Xen-users] disk I/O problems under load? (Xen-3.4.1/x86_64)
From: Luca Lesinigo <luca@xxxxxxxxxxxxx>
Date: Thu, 1 Oct 2009 01:33:57 +0200
Delivery-date: Wed, 30 Sep 2009 16:34:43 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-users-request@lists.xensource.com?subject=help>
List-id: Xen user discussion <xen-users.lists.xensource.com>
List-post: <mailto:xen-users@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-users>, <mailto:xen-users-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-users>, <mailto:xen-users-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-users-bounces@xxxxxxxxxxxxxxxxxxx
I'm getting problems whenever the load on a system increase, but IMHO it should be well withing hardware capabilities.

My configuration:
- HP Proliant DL160G5, with a single quadcore E5405, 14GiB RAM, 2x1TB sata disks Hitachi 7K1000.B on the onboard sata controller (intel chipset) - Xen-3.4.1 64bit hypervisor, compiled from gentoo portage, with default commandline settings (I just specify the serial console and nothing else) - Domain-0 with gentoo's xen-sources 2.6.21 (the xen 2.6.18 tarball didn't have networking, I think the HP Tigon3 gigabit driver is too old but hadn't time to look into that - Domain-0 is using the CFQ i/o scheduler, and works from a software raid-1, no tickless kernel, HZ=100. It has all the free ram (currently some 5.x GiB) - the rest of the disks is also mirrored in a raid-1 device, and I use LVM2 on top of that - 6x paravirt 64bit DomU with 2.6.29-gentoo-r5 kernel, with NOOP i/o scheduler, tickless kernel. 1 - 1.5GiB of ram each.
- 1x HVM 32bit Windows XP DomU, without any paravirt driver, 512MiB RAM
- I use logical volumes as storage space for DomU's, the linux ones also have 0.5GiB of swap space (unused, no DomU is swapping) - all the linux DomU are on ext3 (noatime), and all DomU are single- cpu (just one vcpu each) - network is bridged (one lan and one wan interface on the physical system and the same for each domU), no jumbo frames

Usually load on the system is very low. But when there is some I/O related load (I can easily trigger it by rsync'ing lots of stuff between domU's or from a different system to one of the domU or to the dom0) load gets very high and I often see domU's spending all their cpu time in "wait" [for I/O] state. When that happens, load on Domain-0 gets high (jumps from <1 to >5) and loads on DomU's get high too probably because of processes waiting for I/O to happen. Sometimes iostat will even show exactly 0.00 tps on all the dm-X devices (domU storage backends) and some activity on the physical devices, like all domU I/O activity froze up while dom0 is busy flushing caches or doing some other stuff.

vmstat in Dom0 shows one or two cores (25% or 50% cpu time) busy in 'iowait' state, and context switches go in the thousands, but not in the hundreths thousands that http://wiki.xensource.com/xenwiki/KnownIssues talks about.

I tried pinning cpus: Domain-0 had its four VCPUs pinned to CPUs 0 and 1, some domU's pinned to CPU 2, and some domU's pinned to CPU 3. As far as I can tell it did not do any difference. I also (briefly) tested with all linux DomU's running with the CFQ scheduler, while it didn't seem to make any difference it also was too short of a test to trust it much.

What's worse, sometimes I get qemu-dm processes (for the HVM domU) in zombie state. It also happened that the HVM domU crashed and I wasn't able to restart it: I got the hotplug scripts not working error from xm create, and looking in xenstore-ls I saw instances of the crashed domU with all its resources (which probably was the cause of the error?). Had the reboot the whole system to be able to start that domain again.

Normally iostat in Domain-0 shows more or less high tps (200~300 under normal load, even higher if I play around with rsync to artificially trigger the problems) on the md device where all the DomU reside, and much less (usually just 10-20% of the previous value) on the two physical disks sda and sdb that compose the mirror. I guess I see less tps because the scheduler/elevator in Dom-0 is doing its job.

I don't know if the load problems and the HVM problem are linked or not, but I also don't know where to look to solve any one of them.

Any help would be appreciated, thank you very much. Also, what are ideal/recommended settings in dom0 and domU regarding i/o schedulers and tickless or not? Is there any reason to leave the hypervisor some extra free ram or it is ok to just let xend shrink dom0 when needed and leave free just the minimum? If I sum up memory (currently) used by domains, I get 14146MiB. xm info says 14335MiB total_memory and 10MiB free_memory.

--
Luca Lesinigo

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users

<Prev in Thread] Current Thread [Next in Thread>
  • [Xen-users] disk I/O problems under load? (Xen-3.4.1/x86_64), Luca Lesinigo <=