[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

>>> On 07.02.18 at 14:24, <andrew.cooper3@xxxxxxxxxx> wrote:
> On 07/02/18 13:08, Jan Beulich wrote:
>>>>> On 07.02.18 at 14:01, <igor.druzhinin@xxxxxxxxxx> wrote:
>>> So far the issue confirmed:
>>> Dell PowerEdge R740, Huawei systems based on Xeon Gold 6152 (the one
>>> that it was tested on), Intel S2600XX, etc.
>>> Also see:
>>> https://bugs.xenserver.org/browse/XSO-774 
>>> Well, no-watchdog is what we currently recommend in that case but we
>>> hoped there is a general solution here from Xen side. You have your
>>> point that they should fix this on their side because it's their fault
>>> indeed. But the user experience is also important for us I think.
>> Of course, hence the suggestion of possible alternative workarounds.
>> Impacting everyone is, as said, not a desirable approach in a case
>> like this one. I also continue to dislike the seemingly random division
>> by 10.
> Xen's usability is crap, which is in large part due to attitude like
> this.  It is not ok to expect the end user to know diagnose/debug issues
> like this, and it is entirely unreasonable to expect the end user to
> have to manually work around it.

Excuse me? The watchdog is off by default. Anyone turning it on
ought to know what they do. You (iirc) turning it on unilaterally in
XenServer puts the burden of avoidng users to have to diagnose
the issue on you.

> This particular issue does want feeding back to Intel so they can try
> and fix it, but whatever is wrong is present in a large amount of
> Skylake systems in the field.  Xen needs to be able to cope.

But in a reasonable way.

> Finally, as to boot times, your argument is backwards seeing as you care
> about elapsed boot time.  Slowing the frequency will speed everything
> up, as we aren't executing a large chunk of the BSP boot path with 100hz
> NMI constantly interrupting.

How long does handling a single NMI take? Microseconds, I assume.
Contrast this with waiting for two NMIs to arrive in wait_for_nmis(),
which goes up from 20ms to 200ms with this change.

Also you completely ignore my argument against the seemingly
random division by 10, including the resulting question of what you
mean to do once 10Hz also turns out too high a frequency.

I wouldn't, btw, mind an attempt to avoid the high rate NMIs
during early boot (if those occur in the first place, which from
two successive replies by Igor yesterday I wasn't sure anymore
is an actual fact), but that's independent of the issue at hand.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.