[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Error during update_runstate_area with KPTI activated

  • To: Julien Grall <julien@xxxxxxx>
  • From: Bertrand Marquis <Bertrand.Marquis@xxxxxxx>
  • Date: Fri, 15 May 2020 12:39:34 +0000
  • Accept-language: en-GB, en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=5Cw4bNmTVnyczlmiJYHCEsAFL4zrUO8yHCFANvKZuxI=; b=MY8mzdv+C9m86Yy9iXtF65Jnj1+ZrMwWcybSN2pXNOHpBWLgny9ZZUI/ch1n3DscOz2BMcbB7fa0tnMPO2pfD0El2Z2RS/ODRhTGZsP7uDs3z7uO3Lmi4D8X8IUL8TSbqHIInOXN1U/ZD1SZDqmRK5RJYDSa4vWWv4qxszOnIMmbWlS1D1sGSFIcX3JtrNAyJNGd49FdnanixxoIESPLjP8bjIycHZ0+mcENe7KIB2wLfRAI8lAjLqFZwLJ6pI0SL0XoyW6S3GiAbzswiZs739ISUlLNHVNfCxkdVy/B1nHhAdQ0FZ+WpJouJMRQ+gM9ZFguLlPl9Xso9gzJvoJTpg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=oIsn+in9M84C4pJUJ332oMkCjtdq7k4sTgUsapgnI/0zalxaig38oYjq8XNFN6GgWEU4O9jai7amqrn1lZJ6s9yZ/KnGMY+O1FTK/PBpBOO6pWm0Ae/ECATUB2IbhRG6DY4txfkkGpS3yzW0ICbyMhEiPcE7RCU6uJ0SIfUe/HguvQeO+qmvNsDcsc2F5wNI+V9uasCgagW/P0WZxRWW2BiAtCyl4whAjAF0qUfIxD8kcns1QjRpvVdvBvFHLVAfT0iNjZf8lOPxxrWC4jIDq2RF3anPPIFEDaPuN/OsNcsz35wGzgK4LtvucWOV44MuXJa7WGsV77AfNyZbBXaQqw==
  • Authentication-results-original: xen.org; dkim=none (message not signed) header.d=none;xen.org; dmarc=none action=none header.from=arm.com;
  • Cc: Hongyan Xia <hx242@xxxxxxx>, Stefano Stabellini <stefano.stabellini@xxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, nd <nd@xxxxxxx>, Julien Grall <julien.grall.oss@xxxxxxxxx>
  • Delivery-date: Fri, 15 May 2020 12:39:52 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Nodisclaimer: true
  • Original-authentication-results: xen.org; dkim=none (message not signed) header.d=none;xen.org; dmarc=none action=none header.from=arm.com;
  • Thread-topic: Error during update_runstate_area with KPTI activated

> On 15 May 2020, at 13:07, Julien Grall <julien@xxxxxxx> wrote:
> Hi,
> On 15/05/2020 11:10, Bertrand Marquis wrote:
>>> On 15 May 2020, at 11:00, Julien Grall <julien@xxxxxxx> wrote:
>>> Hi Bertrand,
>>> On 15/05/2020 10:21, Bertrand Marquis wrote:
>>>>> On 15 May 2020, at 10:10, Roger Pau Monné <roger.pau@xxxxxxxxxx 
>>>>> <mailto:roger.pau@xxxxxxxxxx>> wrote:
>>>>> On Fri, May 15, 2020 at 09:53:54AM +0100, Julien Grall wrote:
>>>>>> [CAUTION - EXTERNAL EMAIL] DO NOT reply, click links, or open 
>>>>>> attachments unless you have verified the sender and know the content is 
>>>>>> safe.
>>>>>> Hi,
>>>>>> On 15/05/2020 09:38, Roger Pau Monné wrote:
>>>>>>> On Fri, May 15, 2020 at 07:39:16AM +0000, Bertrand Marquis wrote:
>>>>>>>> On 14 May 2020, at 20:13, Julien Grall <julien.grall.oss@xxxxxxxxx 
>>>>>>>> <mailto:julien.grall.oss@xxxxxxxxx><mailto:julien.grall.oss@xxxxxxxxx>>
>>>>>>>>  wrote:
>>>>>>>> On Thu, 14 May 2020 at 19:12, Andrew Cooper <andrew.cooper3@xxxxxxxxxx 
>>>>>>>> <mailto:andrew.cooper3@xxxxxxxxxx><mailto:andrew.cooper3@xxxxxxxxxx>> 
>>>>>>>> wrote:
>>>>>>>> On 14/05/2020 18:38, Julien Grall wrote:
>>>>>>>> Hi,
>>>>>>>> On 14/05/2020 17:18, Bertrand Marquis wrote:
>>>>>>>> On 14 May 2020, at 16:57, Julien Grall <julien@xxxxxxx 
>>>>>>>> <mailto:julien@xxxxxxx><mailto:julien@xxxxxxx>> wrote:
>>>>>>>> On 14/05/2020 15:28, Bertrand Marquis wrote:
>>>>>>>> Hi,
>>>>>>>> Hi,
>>>>>>>> When executing linux on arm64 with KPTI activated (in Dom0 or in a
>>>>>>>> DomU), I have a lot of walk page table errors like this:
>>>>>>>> (XEN) p2m.c:1890: d1v0: Failed to walk page-table va
>>>>>>>> 0xffffff837ebe0cd0
>>>>>>>> After implementing a call trace, I found that the problem was
>>>>>>>> coming from the update_runstate_area when linux has KPTI activated.
>>>>>>>> I have the following call trace:
>>>>>>>> (XEN) p2m.c:1890: d1v0: Failed to walk page-table va
>>>>>>>> 0xffffff837ebe0cd0
>>>>>>>> (XEN) backtrace.c:29: Stacktrace start at 0x8007638efbb0 depth 10
>>>>>>>> (XEN)    [<000000000027780c>] get_page_from_gva+0x180/0x35c
>>>>>>>> (XEN)    [<00000000002700c8>] guestcopy.c#copy_guest+0x1b0/0x2e4
>>>>>>>> (XEN)    [<0000000000270228>] raw_copy_to_guest+0x2c/0x34
>>>>>>>> (XEN)    [<0000000000268dd0>] domain.c#update_runstate_area+0x90/0xc8
>>>>>>>> (XEN)    [<000000000026909c>] domain.c#schedule_tail+0x294/0x2d8
>>>>>>>> (XEN)    [<0000000000269524>] context_switch+0x58/0x70
>>>>>>>> (XEN)    [<00000000002479c4>] core.c#sched_context_switch+0x88/0x1e4
>>>>>>>> (XEN)    [<000000000024845c>] core.c#schedule+0x224/0x2ec
>>>>>>>> (XEN)    [<0000000000224018>] softirq.c#__do_softirq+0xe4/0x128
>>>>>>>> (XEN)    [<00000000002240d4>] do_softirq+0x14/0x1c
>>>>>>>> Discussing this subject with Stefano, he pointed me to a discussion
>>>>>>>> started a year ago on this subject here:
>>>>>>>> https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg03053.html
>>>>>>>> And a patch was submitted:
>>>>>>>> https://lists.xenproject.org/archives/html/xen-devel/2019-05/msg02320.html
>>>>>>>> I rebased this patch on current master and it is solving the
>>>>>>>> problem I have seen.
>>>>>>>> It sounds to me like a good solution to introduce a
>>>>>>>> VCPUOP_register_runstate_phys_memory_area to not depend on the area
>>>>>>>> actually being mapped in the guest when a context switch is being
>>>>>>>> done (which is actually the problem happening when a context switch
>>>>>>>> is trigger while a guest is running in EL0).
>>>>>>>> Is there any reason why this was not merged at the end ?
>>>>>>>> I just skimmed through the thread to remind myself the state.
>>>>>>>> AFAICT, this is blocked on the contributor to clarify the intended
>>>>>>>> interaction and provide a new version.
>>>>>>>> What do you mean here by intended interaction ? How the new hyper
>>>>>>>> call should be used by the guest OS ?
>>>>>>>> From what I remember, Jan was seeking clarification on whether the two
>>>>>>>> hypercalls (existing and new) can be called together by the same OS
>>>>>>>> (and make sense).
>>>>>>>> There was also the question of the handover between two pieces of
>>>>>>>> sotfware. For instance, what if the firmware is using the existing
>>>>>>>> interface but the OS the new one? Similar question about Kexecing a
>>>>>>>> different kernel.
>>>>>>>> This part is mostly documentation so we can discuss about the approach
>>>>>>>> and review the implementation.
>>>>>>>> I am still in favor of the new hypercall (and still in my todo list)
>>>>>>>> but I haven't yet found time to revive the series.
>>>>>>>> Would you be willing to take over the series? I would be happy to
>>>>>>>> bring you up to speed and provide review.
>>>>>>>> Sure I can take it over.
>>>>>>>> I ported it to master version of xen and I tested it on a board.
>>>>>>>> I still need to do a deep review of the code myself but I have an
>>>>>>>> understanding of the problem and what is the idea.
>>>>>>>> Any help to get on speed would be more then welcome :-)
>>>>>>>> I would recommend to go through the latest version (v3) and the
>>>>>>>> previous (v2). I am also suggesting v2 because I think the split was
>>>>>>>> easier to review/understand.
>>>>>>>> The x86 code is probably what is going to give you the most trouble as
>>>>>>>> there are two ABIs to support (compat and non-compat). If you don't
>>>>>>>> have an x86 setup, I should be able to test it/help write it.
>>>>>>>> Feel free to ask any questions and I will try my best to remember the
>>>>>>>> discussion from last year :).
>>>>>>>> At risk of being shouted down again, a new hypercall isn't necessarily
>>>>>>>> necessary, and there are probably better ways of fixing it.
>>>>>>>> The underlying ABI problem is that the area is registered by virtual
>>>>>>>> address.  The only correct way this should have been done is to 
>>>>>>>> register
>>>>>>>> by guest physical address, so Xen's updating of the data doesn't
>>>>>>>> interact with the guest pagetable settings/restrictions.  x86 suffers
>>>>>>>> the same kind of problems as ARM, except we silently squash the 
>>>>>>>> fallout.
>>>>>>>> The logic in Xen is horrible, and I would really rather it was deleted
>>>>>>>> completely, rather than to be kept for compatibility.
>>>>>>>> The runstate area is always fixed kernel memory and doesn't move.  I
>>>>>>>> believe it is already restricted from crossing a page boundary, and we
>>>>>>>> can calculate the va=>pa translation when the hypercall is made.
>>>>>>>> Yes - this is a technically ABI change, but nothing is going to break
>>>>>>>> (AFAICT) and the cleanup win is large enough to make this a *very*
>>>>>>>> attractive option.
>>>>>>>> I suggested this approach two years ago [1] but you were the one
>>>>>>>> saying that buffer could cross page-boundary on older Linux [2]:
>>>>>>>> "I'd love to do this, but we cant.  Older Linux used to have a virtual
>>>>>>>> buffer spanning a page boundary.  Changing the behaviour under that 
>>>>>>>> will
>>>>>>>> cause older setups to explode."
>>>>>>> Sorry this was long time ago, and details have faded. IIRC there was
>>>>>>> even a proposal (or patch set) that took that into account and allowed
>>>>>>> buffers to span across a page boundary by taking a reference to two
>>>>>>> different pages in that case.
>>>>>> I am not aware of a patch set. Juergen suggested a per-domain mapping but
>>>>>> there was no details how this could be done (my e-mail was left 
>>>>>> unanswered
>>>>>> [1]).
>>>>>> If we were using the vmap() then we would need up 1MB per domain 
>>>>>> (assuming
>>>>>> 128 vCPUs). This sounds quite a bit and I think we need to agree whether 
>>>>>> it
>>>>>> would be an acceptable solution (this was also left unanswered [1]).
>>>>> Could we map/unmap the runtime area on domain switch at a per-cpu
>>>>> based linear space area? There's no reason to have all the runtime
>>>>> areas mapped all the time, you just care about the one from the
>>>>> running vcpu.
>>>>> Maybe the overhead of that mapping and unmapping would be
>>>>> too high? But seeing that we are aiming at a secret-free Xen we would
>>>>> have to eventually go that route anyway.
>>>> Maybe the new hypercall should be a bit different:
>>>> - we have this area allocated already inside Xen and we do a copy of it on 
>>>> any context switch
>>>> - the guest is not supposed to modify any data in this area
>>>> We could introduce a new hypercall:
>>>> - Xen allocate the runstate area using a page aligned address and size
>>> At the moment the runstate is 40 bytes. If we were going to follow this 
>>> proposal, I would recommend to try to have as many runstate as possible in 
>>> your page.
>>> Otherewise, you would waste 4056 bytes per vCPU in both Xen and the guest 
>>> OS. This would even be worse for 64KB kernel.
>> Agree, so it should be one call to have an area with the runstate for all 
>> vCPUs, ensure a vCPU runstate has a size and an address which are cache line 
>> size aligned to prevent coherency stress.
> One 4KB region only going to cover 64 vCPUs. So you would need multiple 
> pages. This brings more question on how this would work for vCPU 
> online/offline or even hotplug.
> The code required in the guest to track them is going to be more complex 
> either in Xen or the guest.

Ok so the guest will have to allocate the runstate area and give this area to 
Xen (using virtual or guest physical address)

>>>> - the guest provide a free guest physical space to the hypercall
>>> This part is the most tricky part. How does the guest know what is free in 
>>> its physical address space?
>>> I am not aware of any way to do this in Linux. So the best you could do 
>>> would be to allocate a page from the RAM and tell Xen to replace it with 
>>> the runstate mapping.
>>> However, this also means you are going to possibly shatter a superpage in 
>>> the P2M. This may affect the performance in long-run.
>> Very true, Linux does not have a way to do that.
>> What about going the other way around: Xen can provide the physical address 
>> to the guest.
> Right now, the hypervisor doesn't have an easy to know the layout. Only the 
> toolstack has. So you would need a way to tell Xen where the region has been 
> reserved.
> This region would have to be allocated below 4GB to cater all the type of 
> guest. A guest may not use them at all (time accounting is not mandatory).
> Even if this is a few pages, this is not very ideal. It would be best if you 
> let the guest allocate some RAM and then pass the address to Xen.


>>>> - Xen maps read-only its own area to the guest at the provided address
>>>> - Xen shall not modify any data in the runstate area of other cores/guests 
>>>> (should already be the case)
>>>> - We keep the current hypercall for backward compatibility and map the 
>>>> areal during the hypercall and keep the area mapped at all time, we keep 
>>>> doing the copy during context switches
>>>> This would highly reduce the overhead by removing the mapping/unmapping.
>>> I don't think the overhead is going to be significant with 
>>> domain_map_page()/domain_unmap_page().
>>> On Arm64, the memory is always mapped so map/unmap is a NOP. On Arm32, we 
>>> have a fast map/unmap implementation.
>>> On x86, without SH, most of the memory is also always mapped. So this 
>>> operation is mostly a NOP. For the SH case, the map/unmap will be used in 
>>> any access to the guest memory (such as hypercalls access) but it is quite 
>>> optimized.
>>> Note that the current overhead is much more important today as you need to 
>>> walk the guest PT and P2M (we are talking at multiple map/unmap). So moving 
>>> to one map/unmap is already going to be a major improvement.
>> Agree
>>>> Regarding the secret free I do not really think this is something 
>>>> problematic here as we already have a copy of this internally anyway
>>> The secret free work is still under review, so what is done in Xen today 
>>> shouldn't dictate the future.
>>> The question to answer is whether we believe leaking the content may be a 
>>> problem. If the answer is yes, then most likely we will want the internal 
>>> representation to be mapped on demand or just mapped for Xen PT associated 
>>> for that domain.
>>> My gut feeling is the runstate content is not critical. But I haven't fully 
>>> thought through yet.
>> The runstate information is stored inside xen and then copied to the guest 
>> memory during context switch. So even if the guest area is not mapped, this 
>> information is still available inside the xen internal copy.
> Again, Xen is not secret free today. So the fact Xen has an internal copy 
> doesn't mean it is fine to leak the content. For instance, the guests' memory 
> regions are always mapped, does it mean the content is not sensitive? 
> Definitely not. Hence why the effort behind Secret Hiding.
> As I wrote in my previous e-mail, *if* we consider the leaking a problem, 
> then we would want the *internal* representation to be mapped on demande or 
> just mapped for Xen PT associated for that domain.
> But... the runstate area doesn't use up a full page. Today the page may also 
> contain secret from a domain. If you always map it, then there is a risk to 
> leak that content. This would have to be taken into consideration if we 
> follow Roger's approach.

So end point is the area is allocated by the guest and we need to do a copy 
from our internal version to the guest version each time we restore a vCPU 


> Cheers,
> -- 
> Julien Grall



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.