[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: x86/vmx: Don't spuriously crash the domain when INIT is received


  • To: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Fri, 25 Feb 2022 15:18:23 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=cfJFsvFnIwGC4dJEX2zbYAMwN2OPOR+mLARDA0M3EKQ=; b=fk+SOuNax25D7QBZIbAy9FhnW63RwNlQSRBNwi/T2skyxulGUzdrVh6H8gQBpktv20iYtLHz4G7ugY0pxxwYWb8PZn+OiQu8tmAlEXzUrPr7BvPOGlpV3mb1X6rSVHy2xb8aWawgVo5ViZsQro04b1UeiVBrVjcJB71ejsw7yEMeiKWDj3WECsXxnWltxMrFh6s5I2BlQt6SEeR25QTSD8zory+CgV/spbIcL8F+Qnm7yxHfUA2O90Ca/oI+FJJAgh1KhfA9pgtpmu+/1sazMt+q+4fyr1Cdh+EPvGfqxWXn7u/yPIB1m+ABttIACxehTgukD3556iKZ9McEPRwMaw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=PniVy1w4kuMyncQc1qBl4RS58v3tWrpZ6BfxnLQRWciVGy9ENmb9TigdSjRjAKJZH1YTmlXkj3A8gG/FxkxnyhaUfa98s26xeBh1+sxefTuMS2ySpUQSTpXQkxaLBkYAbPME8USAW9MLeUT3nntzbVFfWGnvo90PUn3nDyTEt9nkUwWhjvEbcX3x0KYLSXt1qgThMkLDGkrESagIeHDbtVwK3K42dI8c+5TQl/V4wUWOr2E4YdGJv5SAEbk4EpZqvB882zKkevnxUXFJi2WNPSg9m64/ap9CLfs6eb1NQAwQLufXjhTkgNkX+xvHKO+MrpIfmmh9c4BackdvSbZ1pA==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>, Roger Pau Monne <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Thiner Logoer <logoerthiner1@xxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Fri, 25 Feb 2022 14:18:34 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 25.02.2022 14:51, Marek Marczykowski-Górecki wrote:
> On Fri, Feb 25, 2022 at 02:19:39PM +0100, Jan Beulich wrote:
>> On 25.02.2022 13:28, Andrew Cooper wrote:
>>> On 25/02/2022 08:44, Jan Beulich wrote:
>>>> On 24.02.2022 20:48, Andrew Cooper wrote:
>>>>> In VMX operation, the handling of INIT IPIs is changed.  EXIT_REASON_INIT 
>>>>> has
>>>>> nothing to do with the guest in question, simply signals that an INIT was
>>>>> received.
>>>>>
>>>>> Ignoring the INIT is probably the wrong thing to do, but is helpful for
>>>>> debugging.  Crashing the domain which happens to be in context is 
>>>>> definitely
>>>>> wrong.  Print an error message and continue.
>>>>>
>>>>> Discovered as collateral damage from when an AP triple faults on S3 
>>>>> resume on
>>>>> Intel TigerLake platforms.
>>>> I'm afraid I don't follow the scenario, which was (only) outlined in
>>>> patch 1: Why would the BSP receive INIT in this case?
>>>
>>> SHUTDOWN is a signal emitted by a core when it can't continue.  Triple
>>> fault is one cause, but other sources include a double #MC, etc.
>>>
>>> Some external component, in the PCH I expect, needs to turn this into a
>>> platform reset, because one malfunctioning core can't.  It is why a
>>> triple fault on any logical processor brings the whole system down.
>>
>> I'm afraid this doesn't answer my question. Clearly the system didn't
>> shut down. Hence I still don't see why the BSP would see INIT in the
>> first place.
>>
>>>> And it also cannot be that the INIT was received by the vCPU while running 
>>>> on
>>>> another CPU:
>>>
>>> It's nothing (really) to do with the vCPU.  INIT is a external signal to
>>> the (real) APIC, just like NMI/etc.
>>>
>>> It is the next VMEntry on a CPU which received INIT that suffers a
>>> VMEntry failure, and the VMEntry failure has nothing to do with the
>>> contents of the VMCS.
>>>
>>> Importantly for Xen however, this isn't applicable for scheduling PV
>>> vCPUs, which is why dom0 wasn't the one that crashed.  This actually
>>> meant that dom0 was alive an usable, albeit it sharing all vCPUs on a
>>> single CPU.
>>>
>>>
>>> The change in INIT behaviour exists for TXT, where is it critical that
>>> software can clear secrets from RAM before resetting.  I'm not wanting
>>> to get into any of that because it's far more complicated than I have
>>> time to fix.
>>
>> I guess there's something hidden behind what you say here, like INIT
>> only being latched, but this latched state then causing the VM entry
>> failure. Which would mean that really the INIT was a signal for the
>> system to shut down / shutting down. In which case arranging to
>> continue by ignoring the event in VMX looks wrong. Simply crashing
>> the guest would then be wrong as well, of course. We should shut
>> down instead.
> 
> A shutdown could be an alternative here, with remark that it would make
> debugging such issues significantly harder. Note the INIT is delivered
> to BSP, but the actual reason (in this case) is on some AP. Shutdown
> (crash) in this case would prevent (still functioning) BSP to show you
> the message (unless you have serial console, which is rather rare in
> laptops - which are significant target for Qubes OS).

Well, I didn't necessarily mean shutting down silently. I fully
appreciate the usefulness of getting state dumped out for debugging
of an issue.

>> But I don't think I see the full picture here yet, unless your
>> mentioning of TXT was actually implying that TXT was active at the
>> point of the crash (which I don't think was said anywhere).
> 
> No, TXT wasn't (intentionally) active. I think Andrew mentioned it as
> explanation why VMX behaves this way: to let the OS do something _if_ TXT
> is active (the check for TXT is the OS responsibility here). But that's
> my guess only...

One part here that I don't understand: How would the OS become
aware of the INIT if it didn't try to enter a VMX guest (i.e. non-
root mode)?

Jan




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.