[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] pvops: Does PVOPS guest os support online "suspend/resume"



Hi,

> -----Original Message-----
> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@xxxxxxxxxx]
> Sent: Monday, August 12, 2013 8:50 PM
> To: Gonglei (Arei)
> Cc: xen-devel@xxxxxxxxxxxxx; Zhangbo (Oscar); Luonengjun;
> ian.campbell@xxxxxxxxxx; stefano.stabellini@xxxxxxxxxxxxx; rjw@xxxxxxx;
> rshriram@xxxxxxxxx; Yanqiangjun; Jinjian (Ken)
> Subject: Re: [Xen-devel] pvops: Does PVOPS guest os support online
> "suspend/resume"
> 
> On Sat, Aug 10, 2013 at 08:29:43AM +0000, Gonglei (Arei) wrote:
> >
> >
> > > -----Original Message-----
> > > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@xxxxxxxxxx]
> > > Sent: Friday, August 09, 2013 3:17 AM
> > > To: Gonglei (Arei)
> > > Cc: xen-devel@xxxxxxxxxxxxx; Zhangbo (Oscar); Luonengjun; Hanweidong
> > > Subject: Re: [Xen-devel] pvops: Does PVOPS guest os support online
> > > "suspend/resume"
> > >
> > > On Thu, Aug 08, 2013 at 02:23:06PM +0000, Gonglei (Arei) wrote:
> > > > Hi all,
> > > >
> > > > While suspend and resume a PVOPS guest os while it's running, we found
> that
> > > it would get its block/net io stucked. However, non-PVOPS guest os has no
> such
> > > problem.
> > > >
> > >
> > > With what version of Linux is this? Have you tried with v3.10?
> >
> > Thanks for responding. We've tried kernel "3.5.0-17 generic" (ubuntu 12.10),
> the problem still exists.
> 
> So you have not tried v3.10. v3.5 is ancient from the upstream perspective.
> 
thank you, I didn't notice that, I would try 3.10 later.

> > Although we are not sure about the result about kernel 3.10, but 
> > suspiciously
> it would also have the same problem.
> 
> Potentially. There were fixes added in 3.5:
> 
> commit 569ca5b3f94cd0b3295ec5943aa457cf4a4f6a3a
> Author: Jan Beulich <JBeulich@xxxxxxxx>
> Date:   Thu Apr 5 16:10:07 2012 +0100
> 
>     xen/gnttab: add deferred freeing logic
> 
>     Rather than just leaking pages that can't be freed at the point where
>     access permission for the backend domain gets revoked, put them on a
>     list and run a timer to (infrequently) retry freeing them. (This can
>     particularly happen when unloading a frontend driver when devices are
>     still present, and the backend still has them in non-closed state or
>     hasn't finished closing them yet.)
> 
> and that seems to be triggered.

I've tryed to apply this patch, but it didn't fix this problem: 
it retries endlessly to free the leaking pages, however, there seems to be no 
end.
messages keep coming out per seconds "WARNING: leaking g.e. and page still in 
use!"
> >
> > Xen version:  4.3.0
> >
> > Another method to reproduce:
> > 1) xl create dom1.cfg
> > 2) xl save -c dom1 /path/to/save/file
> >    (-c  Leave domain running after creating the snapshot.)
> >
> > As I mentioned before, the problem occurs because PVOPS guest os
> RESUMEes blkfront when the guest resumes.
> > The "blkfront_resume" method seems unnecessary here.
> 
> It has to do that otherwise it can't replay the I/Os that might not have
> hit the platter when it migrated from the original host.
> 
> But you are exercising the case where it does a checkpoint,
> not a full save/restore cycle.
> 
> In which case you might be indeed hitting a bug.

If we add a suspend method for the blkfront, to make the front/end blk device 
turn their states from 
{XenbusStateConnected, XenbusStateConnected} into{XenbusStateInitialising, 
XenbusStateInitWait}, 
when we suspend the guest os,would that cause any problem? 
We found that windows xen-pv driver did such things. We're hoping that such 
attempt would solve this problem
> 
> > non-PVOPS guest os doesn't RESUME blkfront, thus they works fine.
> 
> Potentially. The non-PVOPS guests are based on an ancient kernels and
> the upstream logic in the generic suspend/resume machinery has also
> changed.
> 
> >
> > So, here comes the 2 questions, is the problem caused because:
> > 1) PVOPS kernel doesn't take this situation into accont, and has a bug here?
> > or
> > 2) PVOPS has other ways to avoid such problem?
> 
> Just to make sure I am not confused here. The problem does not
> appear if you do NOT use -c, correct?

yes, the purpose of using "-c" here is to do a "ONLINE" suspend/resume. such 
problem just occurs with ONLINE suspend/resume, 
rather than OFFLINE suspend/resume. To be precisely, 2 examples are listed here 
below:
  <1>
  1) xl create dom1.cfg
  2) xl save -c dom1 /opt/dom1.save  
     after this, the dom1 guest os has its io stucked. which means ONLINE 
suspend/resume has something wrong.
  3) xl destroy dom1
  4) xl restore /opt/dom1.save
     the restored dom1 works fine, which means OFFLINE suspend/resume is OK.
   

  <2>
  1) xl create dom1.cfg
  2) xl save dom1 /opt/dom1.save
     no "-c" here, it would destroy the guest dom1 automatically. 
  3) xl restore /opt/dom1.save
     the restored dom1 works fine, which means OFFLINE suspend/resume is OK.

-Gonglei
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.