[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen 4.6.1 crash with altp2m enabledbydefault


  • To: <andrew.cooper3@xxxxxxxxxx>, <JBeulich@xxxxxxxx>
  • From: <Kevin.Mayer@xxxxxxxx>
  • Date: Wed, 7 Sep 2016 08:35:42 +0000
  • Accept-language: de-DE, en-US
  • Cc: xen-devel@xxxxxxxxxxxxx
  • Delivery-date: Wed, 07 Sep 2016 08:35:58 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xen.org>
  • Thread-index: AQHR8V/Bf9t6T5jYSE6HlCkrclbAiaBQGgBggAS6loCAGQkB8A==
  • Thread-topic: AW: [Xen-devel] Xen 4.6.1 crash with altp2m enabledbydefault

Hi

I took the time to write a small script which restores and destroys domains 
from provided state files.
Just apply the patch to a xen 4.6.1, provide some images + state files and 
start the script.

python VmStarter.py -FILE /path/to/domU-0.state -FILE /path/to/domU-1.state 
--loggingLevel DEBUG

You can provide an arbitrary amount of state files and the script will start an 
additional thread for each one.
Each thread restores one guest domain from the provided state file, waits for a 
random time between 20 and 30 seconds (sleepTime = random.randint(20,30) ) , 
destroys the domain and then starts the process again.

The guest domains and the corresponding state files need to have the same name 
since the script extracts the domain name from the state file name.

When starting about one guest domain for every physical core of the CPU the 
crash should occur in 5 to 10 minutes. Since the crashes are pretty random the 
hypervisor sometimes panics almost instantly and sometimes it takes a while, 
but it seems to correlate with the amount of started guest domains.
More domains => faster crash

Kevin

> -----Ursprüngliche Nachricht-----
> Von: Andrew Cooper [mailto:andrew.cooper3@xxxxxxxxxx]
> Gesendet: Montag, 22. August 2016 13:58
> An: Mayer, Kevin <Kevin.Mayer@xxxxxxxx>; JBeulich@xxxxxxxx
> Cc: xen-devel@xxxxxxxxxxxxx
> Betreff: Re: AW: [Xen-devel] Xen 4.6.1 crash with altp2m enabledbydefault
> 
> On 19/08/16 11:01, Kevin.Mayer@xxxxxxxx wrote:
> > Hi
> >
> > I took another look at Xen and a new crashdump.
> > The last successful __vmwrite should be in static void
> > vmx_vcpu_update_vmfunc_ve(struct vcpu *v) [...]
> >     __vmwrite(SECONDARY_VM_EXEC_CONTROL,
> >               v->arch.hvm_vmx.secondary_exec_control);
> > [...]
> > After this the altp2m_vcpu_destroy wakes up the vcpu and is then
> finished.
> >
> > In nestedhvm_vcpu_destroy (nvmx_vcpu_destroy) the vmcs can
> overwritten (but is not reached in our case as far as I can see):
> >     if ( nvcpu->nv_n1vmcx )
> >         v->arch.hvm_vmx.vmcs = nvcpu->nv_n1vmcx;
> >
> > In conclusion:
> > When destroying a domain the altp2m_vcpu_destroy(v); path seems to
> mess up the vmcs which ( only ) sometimes leads to a failed __vmwrite in
> vmx_fpu_leave.
> > That is as far as I can get with my understanding of the Xen code.
> >
> > Do you guys have any additional ideas what I could test / analyse?
> 
> Do you have easy reproduction instructions you could share?  Sadly, this is
> looking like an issue which isn't viable to debug over email.
> 
> ~Andrew

____________
Virus checked by G Data MailSecurity
Version: AVA 25.8183 dated 07.09.2016
Virus news: www.antiviruslab.com

Attachment: xen-altp2menable.patch
Description: xen-altp2menable.patch

Attachment: VmStarter.py
Description: VmStarter.py

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.