WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] Making snapshot of logical volumes handling HVM domU cau

To: Jeremy Fitzhardinge <jeremy@xxxxxxxx>, Scott Garron <xen-devel@xxxxxxxxxxxxxxxxxx>
Subject: RE: [Xen-devel] Making snapshot of logical volumes handling HVM domU causes OOPS and instability
From: "Xu, Dongxiao" <dongxiao.xu@xxxxxxxxx>
Date: Tue, 31 Aug 2010 14:59:40 +0800
Accept-language: en-US
Acceptlanguage: en-US
Cc: Daniel, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Stodden <daniel.stodden@xxxxxxxxxx>
Delivery-date: Tue, 31 Aug 2010 00:02:50 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <4C7BE1C6.5030602@xxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <4C7864BB.1010808@xxxxxxxxxxxxxxxxxx> <4C7BE1C6.5030602@xxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: ActIY72wxlB0sMDHRcOWz8l3SC+NLwAdg9Uw
Thread-topic: [Xen-devel] Making snapshot of logical volumes handling HVM domU causes OOPS and instability
Jeremy Fitzhardinge wrote:
>  On 08/27/2010 06:22 PM, Scott Garron wrote:
>> I use LVM volumes for domU disks.  To create backups, I create a
>> snapshot of the volume, mount the snapshot in the dom0, mount an
>> equally-sized backup volume from another physical storage source, run
>> an rsync from one to the other, unmount both, then remove the
>> snapshot. 
>> This includes creating a snapshot and mounting NTFS volumes from
>> Windows-based HVM guests.
>> 
>> This practice may not be perfect, but has worked fine for me for a
>> couple of years - while I was running Xen 3.2.1 and
>> linux-2.6.18.8-xen 
>> dom0 (and the same kernel for domU).  After upgrades of udev started
>> complaining about the kernel being too old, I thought it was well
>> past 
>> time to try to transition to a newer version of Xen and a newer dom0
>> kernel.  This transition has been a gigantic learning experience, let
>> me tell you.
>> 
>> After that transition, here's the problem I've been wrestling with
>> and 
>> can't seem to find a solution for:  It seems like any time I start
>> manipulating a volume group to add or remove a snapshot of a logical
>> volume that's used as a disk for a running HVM guest, new calls to
>> LVM2 and/or Xen's storage locks up and spins forever.  The first time
>> I ran across the problem, there was no indication of a problem other
>> than any command I ran that handled anything to do with LVM would
>> freeze and be completely unable to be signaled to do anything.  In
>> other words, no error messages, nothing in dmesg, nothing in
>> syslog... 
>> The commands would just freeze and not return.  That was with the
>> 2.6.31.14 kernel that is what's currently retrieved if you checkout
>> xen-4.0-testing.hg and just do a make dist.
>> 
>> I have since checked out and compiled 2.6.32.18 that comes from doing
>> git checkout -b xen/stable-2.6.32.x origin/xen/stable-2.6.32.x, as
>> described on the Wiki page here:
>> http://wiki.xensource.com/xenwiki/XenParavirtOps
>> 
>> If I run that kernel for dom0, but continue to use 2.6.31.14 for the
>> paravirtualized domUs, everything works fine until I try to
>> manipulate 
>> the snapshots of the HVM volumes.  Today, I got this kernel OOPS:
> 
> That's definitely bad.  Something is causing udevd to end up with bad
> pagetables which are causing a kernel crash on exit.  I'm not sure if
> its *the* udevd or some transient child, but either way its bad.  
> 
> Any thoughts on this Daniel?
> 
>> 
>> ---------------------------
>> 
>> [78084.004530] BUG: unable to handle kernel paging request at
>> ffff8800267c9010 [78084.004710] IP: [<ffffffff810382ff>]
>> xen_set_pmd+0x24/0x44 [78084.004886] PGD 1002067 PUD 1006067 PMD
>> 217067 PTE 80100000267c9065 [78084.005065] Oops: 0003 [#1] SMP
>> [78084.005234] last sysfs file:
>> /sys/devices/virtual/block/dm-32/removable
>> [78084.005256] CPU 1
>> [78084.005256] Modules linked in: tun xt_multiport fuse dm_snapshot
>> nf_nat_tftp nf_conntrack_tftp nf_nat_pptp nf_conntrack_pptp
>> nf_conntrack_proto_gre nf_nat_proto_gre ntfs parport_pc parport
>> k8temp 
>> floppy forcedeth [last unloaded: scsi_wait_scan]
>> [78084.005256] Pid: 22814, comm: udevd Tainted: G        W 
>> 2.6.32.18 #1 
>> H8SMI
>> [78084.005256] RIP: e030:[<ffffffff810382ff>]  [<ffffffff810382ff>]
>> xen_set_pmd+0x24/0x44 [78084.005256] RSP: e02b:ffff88002e2e1d18 
>> EFLAGS: 00010246 [78084.005256] RAX: 0000000000000000 RBX:
>> ffff8800267c9010 RCX: 
>> ffff880000000000
>> [78084.005256] RDX: dead000000100100 RSI: 0000000000000000 RDI:
>> 0000000000000004 [78084.005256] RBP: ffff88002e2e1d28 R08:
>> 0000000001993000 R09: 
>> dead000000100100
>> [78084.005256] R10: 800000016e90e165 R11: 0000000000000000 R12:
>> 0000000000000000 [78084.005256] R13: ffff880002d8f580 R14:
>> 0000000000400000 R15: 
>> ffff880029248000
>> [78084.005256] FS:  00007fa07d87f7a0(0000) GS:ffff880002d81000(0000)
>> knlGS:0000000000000000 [78084.005256] CS:  e033 DS: 0000 ES: 0000
>> CR0: 000000008005003b [78084.005256] CR2: ffff8800267c9010 CR3:
>> 0000000001001000 CR4: 0000000000000660
>> [78084.005256] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000 [78084.005256] DR3: 0000000000000000 DR6:
>> 00000000ffff0ff0 DR7: 0000000000000400 [78084.005256] Process udevd
>> (pid: 22814, threadinfo ffff88002e2e0000, 
>> task ffff880019491e80) [78084.005256] Stack:
>> [78084.005256]  0000000000600000 000000000061e000 ffff88002e2e1de8
>> ffffffff810fb8a5
>> [78084.005256] <0> 00007fff13ffffff 0000000100000206 ffff880003158003
>> 0000000000000000 [78084.005256] <0> 0000000000000000 000000000061dfff
>> 000000000061dfff 000000000061dfff [78084.005256] Call Trace:
>> [78084.005256]  [<ffffffff810fb8a5>] free_pgd_range+0x27c/0x45e
>> [78084.005256]  [<ffffffff810fbb2b>] free_pgtables+0xa4/0xc7
>> [78084.005256]  [<ffffffff810ff1fd>] exit_mmap+0x107/0x13f
>> [78084.005256]  [<ffffffff8107714b>] mmput+0x39/0xda [78084.005256]
>> [<ffffffff8107adff>] exit_mm+0xfb/0x106 [78084.005256]
>> [<ffffffff8107c86d>] do_exit+0x1e8/0x6ff [78084.005256]
>> [<ffffffff815c228b>] ? do_page_fault+0x2cd/0x2fd [78084.005256]
>> [<ffffffff8107ce0d>] do_group_exit+0x89/0xb3 [78084.005256]
>> [<ffffffff8107ce49>] sys_exit_group+0x12/0x16 [78084.005256]
>> [<ffffffff8103cc82>] system_call_fastpath+0x16/0x1b [78084.005256]
>> Code: 48 83 c4 28 5b c9 c3 55 48 89 e5 41 54 49 89 f4 53
>> 48 89 fb e8 fc ee ff ff 48 89 df ff 05 52 8f 9e 00 e8 78 e4 ff ff 84
>> c0
>> 75 05 <4c> 89 23 eb 16 e8 e0 ee ff ff 4c 89 e6 48 89 df ff 05 37 8f
>> 9e [78084.005256] RIP  [<ffffffff810382ff>] xen_set_pmd+0x24/0x44
>> [78084.005256]  RSP <ffff88002e2e1d18> [78084.005256] CR2:
>> ffff8800267c9010 [78084.005256] ---[ end trace 4eaa2a86a8e2da24 ]---
>> [78084.005256] Fixing recursive fault but reboot is needed!
>> 
>> ---------------------------
>> 
>> After that was printed on the console, use of anything that interacts
>> with Xen (xentop, xm) would freeze whatever command it was and not
>> return.  After trying to do a sane shutdown on the guests, the whole
>> dom0 locked completely.  Even the alt-sysrq things stopped working
>> after looking at a couple of them.
>> 
>> I feel it's probably necessary to mention that this is after several,
>> fairly rapid-fire creations and deletions of snapshot volumes.  I
>> have 
>> it scripted to make a snapshot, mount it, mount a backup volume,
>> rsync 
>> it, unmount both volumes, and delete the snapshot for 19 volumes in a
>> row.  In other words, there's a lot of disk I/O going on around the
>> time of the lockup.  It always seems to coincide with when it gets to
>> the volumes that are being used for active, running, Windows Server
>> 2008, HVM volumes.  That may be just coincidental, though, because
>> those are the last ones on the list.  15 volumes used in active,
>> running paravirtualized Linux guests are at the top of the list.
>> 
>> 
>> Another issue that comes up is that if I run the 2.6.32.18 pvops
>> kernel for my Linux domUs, after a time (usually only about an hour
>> or 
>> so), the network interfaces stop responding.  I don't know if the
>> problem is related, but it was something else that I noticed.  The
>> only way to get the network access to come back is to reboot the
>> domU. 
>> When I reverted the domU kernel to 2.6.31.14, this problem goes away.
> 
> That's a separate problem in netfront that appears to be a bug in the
> "smartpoll" code.  I think Dongxiao is looking into it. 

Yes, I tried to reproduce these days, however I could catch it locally. I tried 
both netperf and ping for a long time, but the bug is not triggered. What 
workload are you using when met the bug?

Thanks,
Dongxiao

> 
>> I'm not 100%
>> sure, but I think this issue also causes xm console to not allow you
>> to type on the console that you connect to.  If I connect to a
>> console, then issue an xm shutdown on the same domU from another
>> terminal, all of the console messages that show the play-by-play of
>> the shutdown process display, but my keyboard input doesn't seem to
>> make it through. 
> 
> Hm, not familiar with this problem.  Perhaps its just something wrong
> with your console settings for the domain?  Do you have "console=" on
> the kernel command line?  
> 
>> Since I'm not a developer, I don't know if these questions are better
>> suited for the xen-users list, but since it generated an OOPS with
>> the 
>> word "BUG" in capital letters, I thought I'd post it here.  If that
>> assumption was incorrect, just give me a gentle nudge and I'll
>> redirect the inquiry to somewhere more appropriate.  :)
> 
> Nope, they're both xen-devel fodder.  Thanks for posting.
> 
>     J


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel