[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] pvops: Does PVOPS guest os support online "suspend/resume"



On Thu, Aug 08, 2013 at 02:23:06PM +0000, Gonglei (Arei) wrote:
> Hi all,
> 
> While suspend and resume a PVOPS guest os while it's running, we found that 
> it would get its block/net io stucked. However, non-PVOPS guest os has no 
> such problem.
> 

With what version of Linux is this? Have you tried with v3.10?

Thanks.
> How reproducible:
> -------------------
> 1/1
> 
> Steps to reproduce:
> ------------------
>   1)suspend guest os
>     Note: do not migrate/shutdown the guest os.
>   2)resume guest os 
> 
> (Think about rolling-back(resume) during core-dumping(suspend) a guest, such 
> problem would cause the guest os unoprationable.)
> 
> ====================================================================
> we found warning messages in guest os:
> --------------------------------------------------------------------
> Aug  2 10:17:34 localhost kernel: [38592.985159] platform pcspkr: resume
> Aug  2 10:17:34 localhost kernel: [38592.989890] platform vesafb.0: resume
> Aug  2 10:17:34 localhost kernel: [38592.996075] input input0: type resume
> Aug  2 10:17:34 localhost kernel: [38593.001330] input input1: type resume
> Aug  2 10:17:34 localhost kernel: [38593.005496] vbd vbd-51712: legacy resume
> Aug  2 10:17:34 localhost kernel: [38593.011506] WARNING: g.e. still in use!
> Aug  2 10:17:34 localhost kernel: [38593.016909] WARNING: leaking g.e. and 
> page still in use!
> Aug  2 10:17:34 localhost kernel: [38593.026204] xen vbd-51760: legacy resume
> Aug  2 10:17:34 localhost kernel: [38593.033070] vif vif-0: legacy resume
> Aug  2 10:17:34 localhost kernel: [38593.039327] WARNING: g.e. still in use!
> Aug  2 10:17:34 localhost kernel: [38593.045304] WARNING: leaking g.e. and 
> page still in use!
> Aug  2 10:17:34 localhost kernel: [38593.052101] WARNING: g.e. still in use!
> Aug  2 10:17:34 localhost kernel: [38593.057965] WARNING: leaking g.e. and 
> page still in use!
> Aug  2 10:17:34 localhost kernel: [38593.066795] serial8250 serial8250: resume
> Aug  2 10:17:34 localhost kernel: [38593.073556] input input2: type resume
> Aug  2 10:17:34 localhost kernel: [38593.079385] platform Fixed MDIO bus.0: 
> resume
> Aug  2 10:17:34 localhost kernel: [38593.086285] usb usb1: type resume
> ------------------------------------------------------
> 
> which means that we refers to a grant-table while it's in use.
> 
> The reason results in that:
> suspend/resume codes:
> --------------------------------------------------------
> //drivers/xen/manage.c
> static void do_suspend(void)
> {
>       int err;
>       struct suspend_info si;
> 
>       shutting_down = SHUTDOWN_SUSPEND;
> 
> ââââââ
>       err = dpm_suspend_start(PMSG_FREEZE);
> ââââââ
>       dpm_resume_start(si.cancelled ? PMSG_THAW : PMSG_RESTORE);
> 
>       if (err) {
>               pr_err("failed to start xen_suspend: %d\n", err);
>               si.cancelled = 1;
>       }
> //NOTE: si.cancelled = 1
> 
> out_resume:
>       if (!si.cancelled) {
>               xen_arch_resume();   
>               xs_resume();
>       } else
>               xs_suspend_cancel();
> 
>       dpm_resume_end(si.cancelled ? PMSG_THAW : PMSG_RESTORE);  //blkfront 
> device got resumed here.
> 
> out_thaw:
> #ifdef CONFIG_PREEMPT
>       thaw_processes();
> out:
> #endif
>       shutting_down = SHUTDOWN_INVALID;
> }
> ------------------------------------
> 
> Func "dpm_suspend_start" suspends devices, and "dpm_resume_end" resumes 
> devices.
> However, we found that the device "blkfront" has no SUSPEND method but RESUME 
> method.
> 
> -------------------------------------
> //drivers/block/xen-blkfront.c
> static DEFINE_XENBUS_DRIVER(blkfront, ,
>       .probe = blkfront_probe,
>       .remove = blkfront_remove,
>       .resume = blkfront_resume,  // only RESUME method found here.
>       .otherend_changed = blkback_changed,
>       .is_ready = blkfront_is_ready,
> );
> --------------------------------------
> 
> It resumes blkfront device when it didn't get suspended, which caused the 
> prolem above.
> 
> 
> =========================================
> In order to check whether it's the problem of PVOPS or hypervisor(xen)/dom0, 
> we suspend/resume other non-PVOPS guest oses, no such problem occured.
> 
> Other non-PVOPS are using their own xen drivers, as shown in 
> https://github.com/jpaton/xen-4.1-LJX1/blob/master/unmodified_drivers/linux-2.6/platform-pci/machine_reboot.c
>  :
> 
> int __xen_suspend(int fast_suspend, void (*resume_notifier)(int))
> {
>     int err, suspend_cancelled, nr_cpus;
>     struct ap_suspend_info info;
> 
>     xenbus_suspend();
> 
> ââââââââ
>     preempt_enable();
> 
>     if (!suspend_cancelled)
>         xenbus_resume();     //when the guest os get resumed, 
> suspend_cancelled == 1, thus it wouldn't enter xenbus_resume_uvp here.
>     else
>         xenbus_suspend_cancel();  //It gets here. so the blkfront wouldn't 
> resume.
> 
>     return 0;
> }
> 
> 
> In non-PVOPS guest os, although they don't have blkfront SUSPEND method 
> either, their xen-driver doesn't resume blkfront device, thus, they would't 
> have any problem after suspend/resume.
> 
> 
> I'm wondering why the 2 types of driver(PVOPS and non-PVOPS) are different 
> here. 
> Is that because:
> 1) PVOPS kernel doesn't take this situation into accont, and has a bug here?
> or
> 2) PVOPS has other ways to avoid such problem?
> 
> thank you in advance.
> 
> -Gonglei
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.