On Wed, May 5, 2010 at 7:02 PM, Andrew Lyon <andrew.lyon@xxxxxxxxx> wrote:
> On Tue, May 4, 2010 at 2:09 PM, Heiko Wundram <modelnine@xxxxxxxxxxxxx> wrote:
>> Hey all!
>>
>> I'm currently in the process of migrating a (Gentoo-based) Xen-server to use
>> Xen 4.0.0 (where I'm using the Xen ebuilds from bugs.gentoo.org), and I'm
>> having severe problems with tapdisk2 (which I wish to use to do I/O
>> prioritizing using CFQ on the LVM-based backing storage of a virtual
>> server).
>>
>> It seems that after a while of heavy I/O in the virtual domain, the
>> communication between the (paravirtualized) DomU and Dom0 (the
>> tapdisk2-process) breaks, in that no more interrupts are delivered to Dom0
>> for I/O requests from the virtual domain, and as such the virtual host
>> "loses" its harddisk (but does not "break" besides not responding). The
>> network front-/backend is not affected by this communication loss, AFAICT.
>>
>> The virtual host can be destroyed by an xm destroy, but the created blktap2
>> interface does not disappear until the next reboot, and cannot be removed by
>> the respective sysfs accesses (rather, echoing a 1 into "remove" blocks,
>> too, and is "unkillable", i.e. stays in kernel space). After a blktap2
>> device has entered this broken state, no more hosts can be created by xm
>> create (that blocks, too), and the host system must be rebooted to enter a
>> usable state again.
>>
>> I've not been able to provoke this breakage by "normal" I/O (i.e., when the
>> hosts run normally), but I have been able to provoke it by using bonnie,
>> which after a short period of substained read/write I/O of +120MB/s will
>> freeze the blktap2 device.
>>
>> The Dom0 and the DomU kernels that are being used are xen-sources-2.6.32-r1
>> (which are the xen-stable 2.6.32.10 [11?] based OpenSuSE Xen-kernel sources,
>> AFAIK) from the official portage tree; the kernel configuration that's in
>> use is attached.
>>
>> I've tried iommu=off for xen (the mobo doesn't support VT-d anyway, so Xen
>> never turns it on), and I've also looked for any signs of errors appearing
>> when setting verbosity 9 for the blktap2 module and loglvl=all and
>> guest_loglvl=all for Xen, but there are no errors that I've seen so far.
>>
>> Strace-ing the tapdisk2 process reveals that it's blocked on select(), and
>> none of the descriptors it's polling on ever return as readable (which is
>> the condition that tapdisk2 queries), rather they always timeout after 600s.
>>
>> Thanks in advance for any hint as to what is causing this, or if there's
>> anything I might try to get things working...
>>
>> PS: I have to boot with acpi=off, as the mobo won't reboot when acpi is
>> turned on for Dom0 (not even when disabling ACPI reboots), but using acpi
>> directly doesn't change that blktap2 blocks.
>>
>> --- Heiko.
>>
>>
>>
>> _______________________________________________
>> Xen-users mailing list
>> Xen-users@xxxxxxxxxxxxxxxxxxx
>> http://lists.xensource.com/xen-users
>>
>
> I have had exactly the same problem and ended up going back to tapdisk1.
>
> I was able to replicate the problem using the entire SLE11-SP1 kernel
> source patch set which proves that the bug exists upstream,
> unfortunately I am very busy on other projects at the moment so did
> not have time to debug it at all.
>
> The SLE11-SP1 tree has been updated since xen-sources-2.6.32-r1, I
> will make a updated set of patches for you to try but it will take me
> a couple of days.
>
> Andy
>
Hi,
I have uploaded updated 2.6.32 patches and ebuild to
http://code.google.com/p/gentoo-xen-kernel/downloads/list, note that
patches should be applied to 2.6.32.13.
They should be added to portage in a few days time, provided no
problems are found.
Andy
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users
|