[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] RE: mem_sharing: summarized problems when domain is dying



Hi:
 
       Another BUG found when testing memory sharing.
       In this test, I start 24 linux HVMS, each of them reboot through "xm reboot" every 30minutes.
       After several hours, some of the HVM will crash. All of the crash HVM are stopped during booting.
       The bug still exists even I forbid page sharing by cheating tapdisk that xc_memshr_nominate_gref()
       return failure.
 
       And no special log found.
 
       I was able to dump the crash stack. 
       what could happen?
       thanks.
 
PID: 2307   TASK: ffff810014166100  CPU: 0   COMMAND: "setfont"
 #0 [ffff8100123cd900] xen_panic_event at ffffffff88001d28
 #1 [ffff8100123cd920] notifier_call_chain at ffffffff80066eaa
 #2 [ffff8100123cd940] panic at ffffffff8009094a
 #3 [ffff8100123cda30] oops_end at ffffffff80064fca
 #4 [ffff8100123cda40] do_page_fault at ffffffff80066dc0
 #5 [ffff8100123cdb30] error_exit at ffffffff8005dde9
    [exception RIP: vgacon_do_font_op+363]
    RIP: ffffffff800515e5  RSP: ffff8100123cdbe 8  RFLAGS: 00010203
    RAX: 0000000000000000  RBX: ffffffff804b3740  RCX: ffff8100000a03fc
    RDX: 00000000000003fd  RSI: ffff810011cec000  RDI: ffffffff803244c4
    RBP: ffff810011cec000   R8: d0d6999996000000   R9: 0000009090b0b0ff
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000004
    R13: 0000000000000001  R14: 0000000000000001  R15: 000000000000000e
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #6 [ffff8100123cdc20] vgacon_font_set at ffffffff8016bec5
 #7 [ffff8100123cdc60] con_font_op at ffffffff801aa86b
 #8  ;[ffff8100123cdcd0] vt_ioctl at ffffffff801a5af4
 #9 [ffff8100123cdd70] tty_ioctl at ffffffff80038a2c
#10 [ffff8100123cdeb0] do_ioctl at ffffffff800420d9
#11 [ffff8100123cded0] vfs_ioctl at ffffffff800302ce
#12 [ffff8100123cdf40] sys_ioctl at ffffffff8004c766
#13 [ffff8100123cdf80] tracesys at ffffffff8005d28d (via system_call)
    RIP: 00000039294cc557  RSP: 00007fff54c4aec8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: ffffffff8005d28d  RCX: ffffffffffffffff
    RDX: 00007fff54c4aee0  RSI: 0000000000004b72  RDI: 0000000000000003
    RBP: 000000001d747ab0   R8: 0000000000000010   R9: 0000000 000800000
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000010
    R13: 0000000000000200  R14: 0000000000000008  R15: 0000000000000008
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b

 
> Date: Fri, 21 Jan 2011 14:45:14 -0500
> Subject: Re: mem_sharing: summarized problems when domain is dying
> From: juihaochiang@xxxxxxxxx
> To: Tim.Deegan@xxxxxxxxxx
> CC: tinnycloud@xxxxxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxx
>
> Hi
>
> On Fri, Jan 21, 2011 at 11:19 AM, Jui-Hao Chiang <juihaochiang@xxxxxxxxx> wrote:
> > Hi, Tim:
> >
> > From tinnycloud's result, here I summarize the current problem and
> > findings of mem_sharing due to domain dying.
> > (1) When domain is dying, alloc_domheap_page() and
> > set_shared_p2m_entry() would just fail. So the shr_lock is not enough
> > to ensure that the domain won't die in the middle of mem_sharing code.
> > As tinnycloud's code shows, is that better to use
> > rcu_lock_domain_by_id before calling the above two functions?
> >
>
> There seems no good locking to protect a domain from changing the
> is_dying state. So the unshare function could fail in the middle in
> several points, e.g., alloc_domheap_page and set_shared_p2m_entry.
> If that's the case, we need to add some checking, and probably revert
> the things we have done when is_dying is changed in the middle.
>
> Any comments?
>
> Jui-Hao
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.