[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] NFS related netback hang



On Fri, Apr 12, 2013 at 1:38 AM, Wei Liu <wei.liu2@xxxxxxxxxx> wrote:
> On Thu, Apr 11, 2013 at 02:55:48PM +0100, G.R. wrote:
>> Hi,
>> I'm suffering from strange NFS related network issue for a while.
>>
>> The issue shows up when copying from dom0 to domU through a NFS mount.
>> After a short while, the transfer suddenly freezes and the domU
>> network simply stops any response. Force mounting the NFS mount
>> generally resolves the freeze. But some times you can really be in
>> bad luck that the trick does not work.
>>
>> Lucky enough, I captured the following log in a recent instance. It
>> appears to be a dead-lock when the netback tries to get some free
>> pages from NFS. I'm not sure if this is the whole story. Any
>> suggestion how to solve the issue?
>>
>
> BTW xen_netbk_alloc_page tries to allocate page from generic page pool.
> It is not specific to NFS.
>
>> Thanks,
>> Timothy
>>
>> Apr 11 21:22:27 gaia kernel: [429242.015643] INFO: task netback/0:2255
>> blocked for more than 120 seconds.
>> Apr 11 21:22:27 gaia kernel: [429242.015665] "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Apr 11 21:22:27 gaia kernel: [429242.015690] netback/0       D
>> ffff880210213900     0  2255      2 0x00000000
>> Apr 11 21:22:27 gaia kernel: [429242.015693]  ffff8801fee04ea0
>> 0000000000000246 0000000000000000 ffffffff818133f0
>> Apr 11 21:22:27 gaia kernel: [429242.015697]  0000000000013900
>> ffff8801fed87fd8 ffff8801fed87fd8 ffff8801fee04ea0
>> Apr 11 21:22:27 gaia kernel: [429242.015700]  ffff8801fed87488
>> ffff880210213900 ffff8801fee04ea0 ffff8801fed87488
>> Apr 11 21:22:27 gaia kernel: [429242.015703] Call Trace:
>> Apr 11 21:22:27 gaia kernel: [429242.015711]  [<ffffffff810c1bb5>] ?
>> __lock_page+0x66/0x66
>> Apr 11 21:22:27 gaia kernel: [429242.015715]  [<ffffffff814d06cb>] ?
>> io_schedule+0x55/0x6b
>> Apr 11 21:22:27 gaia kernel: [429242.015718]  [<ffffffff810c1bbc>] ?
>> sleep_on_page+0x7/0xc
>> Apr 11 21:22:27 gaia kernel: [429242.015720]  [<ffffffff814cf6c0>] ?
>> __wait_on_bit_lock+0x3c/0x85
>> Apr 11 21:22:27 gaia kernel: [429242.015723]  [<ffffffff810c3f7a>] ?
>> find_get_pages+0xea/0x100
>> Apr 11 21:22:27 gaia kernel: [429242.015726]  [<ffffffff810c1bb0>] ?
>> __lock_page+0x61/0x66
>> Apr 11 21:22:27 gaia kernel: [429242.015729]  [<ffffffff81058364>] ?
>> autoremove_wake_function+0x2a/0x2a
>> Apr 11 21:22:27 gaia kernel: [429242.015732]  [<ffffffff810cd110>] ?
>> truncate_inode_pages_range+0x28b/0x2f8
>> Apr 11 21:22:27 gaia kernel: [429242.015737]  [<ffffffff811c91d2>] ?
>> nfs_evict_inode+0x12/0x23
>> Apr 11 21:22:27 gaia kernel: [429242.015740]  [<ffffffff8111cdae>] ?
>> evict+0xa3/0x153
>> Apr 11 21:22:27 gaia kernel: [429242.015743]  [<ffffffff8111ce85>] ?
>> dispose_list+0x27/0x31
>> Apr 11 21:22:27 gaia kernel: [429242.015746]  [<ffffffff8111db6b>] ?
>> evict_inodes+0xe7/0xf4
>> Apr 11 21:22:27 gaia kernel: [429242.015749]  [<ffffffff8110b3af>] ?
>> generic_shutdown_super+0x3e/0xc5
>> Apr 11 21:22:27 gaia kernel: [429242.015752]  [<ffffffff8110b49e>] ?
>> kill_anon_super+0x9/0x11
>> Apr 11 21:22:27 gaia kernel: [429242.015755]  [<ffffffff811ca7b0>] ?
>> nfs_kill_super+0xd/0x16
>> Apr 11 21:22:27 gaia kernel: [429242.015758]  [<ffffffff8110b717>] ?
>> deactivate_locked_super+0x2c/0x5c
>> Apr 11 21:22:27 gaia kernel: [429242.015761]  [<ffffffff811c901d>] ?
>> __put_nfs_open_context+0xbf/0xe1
>> Apr 11 21:22:27 gaia kernel: [429242.015764]  [<ffffffff811d07db>] ?
>> nfs_commitdata_release+0x10/0x19
>> Apr 11 21:22:27 gaia kernel: [429242.015766]  [<ffffffff811d0f8c>] ?
>> nfs_initiate_commit+0xd9/0xe4
>> Apr 11 21:22:27 gaia kernel: [429242.015769]  [<ffffffff811d1bae>] ?
>> nfs_commit_inode+0x81/0x111
>> Apr 11 21:22:27 gaia kernel: [429242.015772]  [<ffffffff811c86f4>] ?
>> nfs_release_page+0x40/0x4f
>> Apr 11 21:22:27 gaia kernel: [429242.015775]  [<ffffffff810d0940>] ?
>> shrink_page_list+0x4f5/0x6d8
>> Apr 11 21:22:27 gaia kernel: [429242.015780]  [<ffffffff810d0f03>] ?
>> shrink_inactive_list+0x1dd/0x33f
>> Apr 11 21:22:27 gaia kernel: [429242.015783]  [<ffffffff810d15fa>] ?
>> shrink_lruvec+0x2e0/0x44d
>> Apr 11 21:22:27 gaia kernel: [429242.015787]  [<ffffffff810d17ba>] ?
>> shrink_zone+0x53/0x8a
>> Apr 11 21:22:27 gaia kernel: [429242.015790]  [<ffffffff810d1bcd>] ?
>> do_try_to_free_pages+0x1c6/0x3f4
>> Apr 11 21:22:27 gaia kernel: [429242.015794]  [<ffffffff810d20a3>] ?
>> try_to_free_pages+0xc4/0x11e
>> Apr 11 21:22:27 gaia kernel: [429242.015797]  [<ffffffff810c9018>] ?
>> __alloc_pages_nodemask+0x440/0x72f
>> Apr 11 21:22:27 gaia kernel: [429242.015801]  [<ffffffff810f592d>] ?
>> alloc_pages_current+0xb2/0xcd
>
> Judging from the stack trace above, it looks like the system is trying
> to squeeze some memory out from NFS. Probably it is just that your
> system is suffering from OOM? Then NFS failed to commit its changes to
> disk for some reason and hung.
>

Yes, it's not specific to NFS page, but I'm just bad luck enough.
I agree with your suspect, the chance depends on the memory pressure in dom0.
So here is a proper setup to reproduce the issue:
1. dom0 with SWAP disabled and with limited memory allocated.
2. domU serves storage and exports NFS
3. dom0 mounts the domU storage and writes to it.
4. You need to achieve high speed to expose this issue.

In my case, domU owns a dedicated SATA controller so there is no
blkback overhead. Not sure if this is important factor to achieve high
speed.
And the transfer is a normal file copy instead of O_SYNC / O_DIRECT
access so they can be cached in client side for some short period.
Finally the transfer speed && memory size is crucial.

With a 4GB memory allocate to dom0, I can copy a file (> 2GB) from a
USB2 port without problem at about 32MB/s.
But using a USB3 port, the same file generally sucks at 1.2GB. And the
'dd if=/dev/zero' sucks ever quicker.
With around 1ï2GB memory to dom0, the freeze happens much earlier, but
I did not check the exact time.

I'm on a custom build of xen 4.2.1 testing release (built around Jan
this year?), with some patches related to graphics pass-through. But I
guess the patch is not relevant.
The dom0 kernel is version 3.6.11, 64 bit version.

One thing I forgot to mention is the sign of memory leakage.
I'm not very sure about it, but my dom0 reported OOM several days before.
And typically I don't use dom0 for other purpose other than serve
backends. The allocated memory should be around 2GB and that should be
plenty for a dom0.
Are there any known leakage bug out there?

Thanks,
Timothy

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.