[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 07/19] xen: credit2: prevent load balancing to go mad if time goes backwards
On Mon, Jun 20, 2016 at 9:02 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote: >>>> On 18.06.16 at 01:12, <dario.faggioli@xxxxxxxxxx> wrote: >> This really should not happen, but: >> 1. it does happen! Investigation is ongoing here: >> http://lists.xen.org/archives/html/xen-devel/2016-06/msg00922.html >> 2. even when 1 will be fixed it makes sense and is easy enough >> to have a 'safety catch' for it. >> >> The reason why this is particularly bad for Credit2 is that >> negative values of delta mean out of scale high load (because >> of the conversion to unsigned). This, for instance in the >> case of runqueue load, results in a runqueue having its load >> updated to values of the order of 10000% or so, which in turns >> means that the load balancer will migrate everything off from >> the pCPUs in the runqueue, and leave them idle until the load >> gets back to something sane... which may indeed take a while! >> >> This is not a fix for the problem of time going backwards. In >> fact, if that happens a lot, load tracking accuracy is still >> compromized, but at least the effect is a lot less bad than >> before. >> >> Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx> >> --- >> Cc: George Dunlap <george.dunlap@xxxxxxxxxx> >> Cc: Anshul Makkar <anshul.makkar@xxxxxxxxxx> >> Cc: David Vrabel <david.vrabel@xxxxxxxxxx> >> --- >> xen/common/sched_credit2.c | 12 ++++++++++++ >> 1 file changed, 12 insertions(+) >> >> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c >> index 50f8dfd..b73d034 100644 >> --- a/xen/common/sched_credit2.c >> +++ b/xen/common/sched_credit2.c >> @@ -404,6 +404,12 @@ __update_runq_load(const struct scheduler *ops, >> else >> { >> delta = now - rqd->load_last_update; >> + if ( unlikely(delta < 0) ) >> + { >> + d2printk("%s: Time went backwards? now %"PRI_stime" llu >> %"PRI_stime"\n", >> + __func__, now, rqd->load_last_update); >> + delta = 0; >> + } >> >> rqd->avgload = >> ( ( delta * ( (unsigned long long)rqd->load << >> prv->load_window_shift ) ) >> @@ -455,6 +461,12 @@ __update_svc_load(const struct scheduler *ops, >> else >> { >> delta = now - svc->load_last_update; >> + if ( unlikely(delta < 0) ) >> + { >> + d2printk("%s: Time went backwards? now %"PRI_stime" llu >> %"PRI_stime"\n", >> + __func__, now, svc->load_last_update); >> + delta = 0; >> + } >> >> svc->avgload = >> ( ( delta * ( (unsigned long long)vcpu_load << >> prv->load_window_shift ) ) > > Do the absolute times really matter here? I.e. wouldn't it be more > useful to simply log the value of delta? > > Also, may I ask you to use the L modifier in favor of the ll one, for > being one byte shorter (and hence, even if just very slightly, > reducing both image size and cache pressure)? > > And finally, instead of logging function names, could the two > messages be made distinguishable by other means resulting in less > data issued to the log (and potentially needing transmission over > a slow serial line)? The reason this is under a "d2printk" is because it's really only to help developers in debugging. In-tree this warning isn't even on with debug=y; you have to go to the top of the file and change the #define to make it even exist. Given that, I don't think the quibbles over the code size or the length of what's logged really matter. I think we should just take it as it is. Reviewed-by: George Dunlap <george.dunlap@xxxxxxxxxx> -George _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |