[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Regression, host crash with 4.5rc1



Jan-

No, I have no knowledge of an unpublished errata related to C State issues.

--
Don Dugger
"Censeo Toto nos in Kansa esse decisse." - D. Gale
Ph: 303/443-3786

-----Original Message-----
From: Jan Beulich [mailto:JBeulich@xxxxxxxx] 
Sent: Thursday, November 27, 2014 2:28 AM
To: Steve Freitas; Dugger, Donald D; Nakajima, Jun
Cc: xen-devel@xxxxxxxxxxxxx; Don Slutz
Subject: Re: [Xen-devel] Regression, host crash with 4.5rc1

>>> On 27.11.14 at 06:29, <sflist@xxxxxxxxx> wrote:
> On 11/25/2014 03:00 AM, Jan Beulich wrote:
>> Okay, so it's not really the mwait-idle driver causing the 
>> regression, but it is C-state related. Hence we're now down to seeing 
>> whether all or just the deeper C states are affected, i.e. I now need 
>> to ask you to play with "max_cstate=". For that you'll have to 
>> remember that the option's effect differs between the ACPI and the MWAIT 
>> idle drivers.
>> In the spirit of bisection I'd suggest using "max_cstate=2" first no 
>> matter which of the two scenarios you pick. If that still hangs, 
>> "max_cstate=1" obviously is the only other thing to try. Should that 
>> not hang (and you left out "mwait-idle=0"), trying "max_cstate=3"
>> in that same scenario would be the other case to check.
>>
>> No need for 'd' and 'a' output for the time being, but 'c' output 
>> would be much appreciated for all cases where you observe hangs.
>>
> 
> Okay, working through that now. I tried max_cstate=2 and got no hangs, 
> whether with or without mwait-idle=0. However, I was puzzled by this:
> 
> (XEN) 'c' pressed -> printing ACPI Cx structures
> (XEN) ==cpu0==
> (XEN) active state:             C0
> (XEN) max_cstate:               C2
> (XEN) states:
> (XEN)     C1:   type[C1] latency[003] usage[12219860] method[  FFH] 
> duration[1190961948551]
> (XEN)     C2:   type[C1] latency[010] usage[10205554] method[  FFH] 
> duration[2015393965907]
> (XEN)     C3:   type[C2] latency[020] usage[50926286] method[  FFH] 
> duration[30527997858148]
> (XEN)    *C0:   usage[73351700] duration[9974627547595]
> (XEN) max=0 pwr=0 urg=0 nxt=0
> (XEN) PC2[0] PC3[8589642315848] PC6[0] PC7[0]
> (XEN) CC3[28794734145697] CC6[0] CC7[0]
> (XEN) ==cpu1==
> (XEN) active state:             C3
> (XEN) max_cstate:               C2
> (XEN) states:
> (XEN)     C1:   type[C1] latency[003] usage[10699950] method[  FFH] 
> duration[1141422044112]
> (XEN)     C2:   type[C1] latency[010] usage[06382904] method[  FFH] 
> duration[1329739264322]
> (XEN)    *C3:   type[C2] latency[020] usage[44630764] method[  FFH] 
> duration[31676618425954]
> (XEN)     C0:   usage[61713618] duration[9561201640320]
> (XEN) max=0 pwr=0 urg=0 nxt=0
> (XEN) PC2[0] PC3[8589642315848] PC6[0] PC7[0]
> (XEN) CC3[30066495105056] CC6[0] CC7[0] [...]
> 
> Why would some of the cores be in C3 even though they list max_cstate as C2?

This was precisely the reason why I told you that the numbering differs (and is 
confusing and has nothing to do with actual C state
numbers): What max_cstate refers to in the mwait-idle driver is what above is 
listed as type[Cx], i.e. the state at index 1 is C1, at
2 we've got C1E, and at 3 we've got C2. And those still aren't in line with the 
numbering the CPU documentation uses, it's rather kind of meant to refer to the 
ACPI numbering (but probably also not fully matching up).

So max_cstate=2 working suggests a problem with what the CPU calls C6, which 
presumably isn't all that surprising considering the many errata (BD35, BD38, 
BD40, BD59, BD87, and BD104). Not sure how to proceed from here - I suppose you 
already made sure you run with the latest available BIOS. And with 6 errata 
documented it's not all that unlikely that there's a 7th one with MONITOR/MWAIT 
behavior. The commit you bisected to (and which you had verified to be the 
culprit by just forcing
arch_skip_send_event_check() to always return false) could be reasonably 
assumed to be broken only when MWAIT use for all C states didn't work.

Don, Jun - is there anything known but not yet publicly documented for Family 6 
Model 44 Xeons?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.