[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] error in xen/arch/x86/mm.c:get_page during migration



At 09:52 -0500 on 25 Feb (1361785966), Andres Lagar-Cavilla wrote:
> >>>> On 22.02.13 at 21:07, Olaf Hering <olaf@xxxxxxxxx> wrote:
> >> On Fri, Feb 22, Jan Beulich wrote:
> >> 
> >>>>>> On 21.02.13 at 18:31, Olaf Hering <olaf@xxxxxxxxx> wrote:
> >>>> It did not happen with xl.
> >>> 
> >>> But the same guest and Dom0 kernel, and the same hypervisor?
> >> 
> >> Yes, same sles11sp2 dom0, and 3.7.9 pvops guest.
> >> 
> >>>> Here is the output while doing xm migrate:
> >>>> 
> >>>> (XEN) HVM2 restore: VMCE_VCPU 0
> >>>> (XEN) HVM2 restore: VMCE_VCPU 1
> >>>> (XEN) HVM2 restore: TSC_ADJUST 0
> >>>> (XEN) HVM2 restore: TSC_ADJUST 1
> >>>> (XEN) mm.c:1983:d0 Error pfn 4112c5: rd=ffff83036ffef000, 
> >> od=0000000000000000, caf=180000000000000, taf=7400000000000001
> >>> 
> >>> Didn't even notice yesterday that this is apparently after restore
> >>> has already started. Which makes me curious whether the domain
> >>> that is being referenced with rd= is the old or the new one (would
> >>> require printing the domain ID; honestly I never understood what
> >>> use printing of the domain pointer is).
> >>> 
> >>> I'm also confused by the domain pointer always being the same;
> >>> I would expect it to at least toggle between two values, but
> >>> probably even be different between every instance of the guest.
> >>> But you're not having a stubdom configured for the guest either,
> >>> according to the config you sent earlier...
> >> 
> >> The rd->domain_id is DOMID_COW in both cases.
> > 
> > Which suggests that memory sharing is in use. At least I'm unaware
> > of other uses of that pseudo domain.
> 
> There are none.
> 
> There seems to be something else amiss though. Unless I am parsing
> this incorrectly, taf == PGT_writable | PGT_pae_xen_l2? And caf == PAT
> | PCD? Looks like a very unlikely combination

By my reading, 

taf = 0x7400000000000001 = typecount 1, PGT_writable_page | PGT_validated
caf = 0x0180000000000000 = refcount 0, PGC_state_free

iow this is a free page but somehow has ended up with a typecount (which
explains why the get_page() failed).  And presumably this is one of the
various get_page[_and_type](page, dom_cow) calls in mem_sharing.c.

Since free_domheap_pages() has a BUG_ON(typecount != 0), it seems like
something's gone badly off the rails here. 

One place I can see that tinkers with typecount without holding a
ref is share_xen_page_with_guest(), which sets exactly this typecount,
but then calls page_set_owner(page, d).

There's some hairy code in __gnttab_map_grant_ref() too, but I _think_
it can't end up taking typecounts without refcounts.

__acquire_grant_for_copy() looks pretty hairy too, in particular this:
        (void)page_get_owner_and_reference(*page);
 but presumably the matching put_page() would have crashed if that was
the problem.  Does anyone understand the grant code well enough to get
into that?

If you can repro this, it might be worth tracing all the refcount ops
into a large buffer and dumping the history of this MFN on failure.

Cheers,

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.