[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] several domU crashes after 4.1->4.8 live migration



On 02/02/17 13:30, Vincent Legout wrote:
> On Thu, Feb 02, 2017 at 12:05:09PM +0000, Andrew Cooper wrote :
>> On 02/02/17 07:58, Vincent Legout wrote:
>>> Hello,
>>>
>>> We had some issues after live migrating several domU from xen 4.1 to xen
>>> 4.8. We migrated around 200 domU and 5 crashed, from a few hours up to
>>> several days after the migration. All the domU had more than 1 year of
>>> uptime, and for example one crashed several days after the migration
>>> during a high load period.
>>>
>>> All 5 domU are running a 3.10 kernel (from 3.10.44 to 3.10.103). They
>>> have between 2GB and 16GB of RAM, and between 1 and 4 vCPUS.
>>>
>>> We use 3 types of machines (several PowerEdge C6100 (Intel L5640) and
>>> R710 (Intel L5520), and one C8220 (Intel E5-2650)). The C6100 and R710
>>> have 24 logical cores with HT enabled. The PowerEdge C8220 is only used
>>> for Xen 4.8, and has 32 logical cores with HT enabled. The C6100 and the
>>> R710 have 50GB of RAM, and the C8220 128GB. The xen 4.1 dom0 is running
>>> a 3.4.69 kernel, and the xen 4.8 one a 4.1.37 kernel.
>>>
>>> I've attached the most relevant parts of the domU kernel logs we could
>>> get. It seems the crashes came from different components of the kernel,
>>> though most of them seem to be related to memory.
>>>
>>> Would anyone have any idea if that's something that could be fixed? Or
>>> is it just that migrating from 4.1 to 4.8 is not supported?
>> Do the VMs migrate normally in production?
> Thanks for the comment. Yes, the migrations took place normally, at
> least we didn't see anything wrong then. These crashes happened randomly
> on a few VMs only, and at least a few hours after the migration.
>
>> This looks like a Linux kernel bug in the Xen suspend/resume paths, and
>> unlikely to be related to the version of the hypervisor in use.
> I agree about the Linux kernel bug. But I still think it should also be
> related to the migration because we never had anything like that without
> migration, on either xen 4.1 or 4.8.

Ok, so it does look like a change in behaviour between 4.1 and 4.8.

Have you observed any further crashes from migrations on 4.8 after the
upgrade?

I am not aware of any alterations to the hypervisor side of things which
would be relevant.

The toolstack however has changed quite a lot.

The first change is with the migration stream itself.  A migration like
that will be piped through tools/python/scripts/convert-legacy-stream to
convert from the old format to the new.  It is certainly possible that
there is a bug in that process, although we test it extensively in
XenServer and have never encountered a crash looking like this in any
vintage of PV guest (all the way back to the RHEL 4 days).  Also, the
fact that it didn't abort midwaythrough is a good sign that nothing
unexpected was encountered during the conversion process.

Another area which has changed is the semantics of how the toolstack
returns from the suspend call.  There have been various changes to
support fast resume, all of which revolve around modifying the return
value from the hypercall as observed by the guest.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.