[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen 4.7 crash



(CC Ian, Stefano and Wei)

Hello Aaron,

On 06/06/16 14:58, Aaron Cornelius wrote:
On 6/2/2016 5:07 AM, Julien Grall wrote:
Hello Aaron,

On 02/06/2016 02:32, Aaron Cornelius wrote:
This is with a custom application, we use the libxl APIs to interact
with Xen.  Domains are created using the libxl_domain_create_new()
function, and domains are destroyed using the libxl_domain_destroy()
function.

The test in this case creates a domain, waits a minute, then
deletes/creates the next domain, waits a minute, and so on.  So I
wouldn't be surprised to see the VMID occasionally indicate there are 2
active domains since there could be one being created and one being
destroyed in a very short time.  However, I wouldn't expect to ever have
256 domains.

Your log has:

(XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
(XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2)
dom:(0)

Which suggest that some grants are still mapped in DOM0.


The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which
means that only 48 of the the Mirage domains (with 32MB of RAM) would
work at the same time anyway.  Which doesn't account for the various
inter-domain resources or the RAM used by Xen itself.

All the pages who belongs to the domain could have been freed except the
one referenced by DOM0. So the footprint of this domain will be limited
at the time.

I would recommend you to check how many domain are running at this time
and if DOM0 effectively released all the resources.

If the p2m_teardown() function checked for NULL it would prevent the
crash, but I suspect Xen would be just as broken since all of my
resources have leaked away.  More broken in fact, since if the board
reboots at least the applications will restart and domains can be
recreated.

It certainly appears that some resources are leaking when domains are
deleted (possibly only on the ARM or ARM32 platforms).  We will try to
add some debug prints and see if we can discover exactly what is
going on.

The leakage could also happen from DOM0. FWIW, I have been able to cycle
2000 guests over the night on an ARM platforms.


We've done some more testing regarding this issue.  And further testing
shows that it doesn't matter if we delete the vchans before the domains
are deleted.  Those appear to be cleaned up correctly when the domain is
destroyed.

What does stop this issue from happening (using the same version of Xen
that the issue was detected on) is removing any non-standard xenstore
references before deleting the domain.  In this case our application
allocates permissions for created domains to non-standard xenstore
paths.  Making sure to remove those domain permissions before deleting
the domain prevents this issue from happening.

I am not sure to understand what you mean here. Could you give a quick example?


It does not appear to matter if we delete the standard domain xenstore
path (/local/domain/<id>) since libxl handles removing this path when
the domain is destroyed.

Based on this I would guess that the xenstore is hanging onto the VMID.

Regards,

--
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.