[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Linux 4.1 reports wrong number of pages to toolstack





On 04/09/15 12:35, Wei Liu wrote:
On Fri, Sep 04, 2015 at 10:35:52AM +0100, Andrew Cooper wrote:
On 04/09/15 09:28, Jan Beulich wrote:
On 04.09.15 at 05:38, <JGross@xxxxxxxx> wrote:
On 09/04/2015 02:40 AM, Wei Liu wrote:
This issue is exposed by the introduction of migration v2. The symptom is that
a guest with 32 bit 4.1 kernel can't be restored because it's asking for too
many pages.

Note that all guests have 512MB memory, which means they have 131072 pages.

Both 3.14 tests [2] [3] get the correct number of pages.  Like:

     xc: detail: max_pfn 0x1ffff, p2m_frames 256
     ...
     xc: detail: Memory: 2048/131072    1%
     ...

However in both 4.1 [0] [1] the number of pages are quite wrong.

4.1 32 bit:

     xc: detail: max_pfn 0xfffff, p2m_frames 1024
     ...
     xc: detail: Memory: 11264/1048576    1%
     ...

It thinks it has 4096MB memory.

4.1 64 bit:

     xc: detail: max_pfn 0x3ffff, p2m_frames 512
     ...
     xc: detail: Memory: 3072/262144    1%
     ...

It thinks it has 1024MB memory.

The total number of pages is determined in libxc by calling
xc_domain_nr_gpfns, which yanks shared_info->arch.max_pfn from
hypervisor. And that value is clearly touched by Linux in some way.
Sure. shared_info->arch.max_pfn holds the number of pfns the p2m list
can handle. This is not the memory size of the domain.

I now think this is a bug in Linux kernel. The biggest suspect is the
introduction of linear P2M.  If you think this is a bug in toolstack,
please let me know.
I absolutely think it is a toolstack bug. Even without the linear p2m
things would go wrong in case a ballooned down guest would be migrated,
as shared_info->arch.max_pfn would hold the upper limit of the guest
in this case and not the current size.
I don't think this necessarily is a tool stack bug, at least not in
the sense implied above - since (afaik) migrating ballooned guests
(at least PV ones) has been working before, there ought to be
logic to skip ballooned pages (and I certainly recall having seen
migration slowly move up to e.g. 50% and the skip the other
half due to being ballooned, albeit that recollection certainly is
>from before v2). And pages above the highest populated one
ought to be considered ballooned just as much. With the
information provided by Wei I don't think we can judge about
this, since it only shows the values the migration process starts
from, not when, why, or how it fails.
Max pfn reported by migration v2 is max pfn, not the number of pages of RAM
in the guest.

I understand that by looking at the code. Just the log itself
is very confusing.

I propose we rename the log a bit. Maybe change "Memory" to "P2M" or
something else?

P2M would be wrong for HVM guests. Memory was the same term used by the legacy code iirc.

"Frames" is probably the best term.


It is used for the size of the bitmaps used by migration v2, including the
logdirty op calls.

All frames between 0 and max pfn will have their type queried, and acted
upon appropriately, including doing nothing if the frame was ballooned out.
In short, do you think this is a bug in migration v2?

There is insufficient information in this thread to say either way. Maybe. Maybe a Linux kernel bug.


When I looked at write_batch() I found some snippets that I thought to
be wrong. But I didn't what to make the judgement when I didn't have a
clear head.

write_batch() is a complicated function but it can't usefully be split any further. I would be happy to explain bits or expand the existing comments, but it is also possible that it is buggy.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.