Xen project Mailing List

Re: Error during update_runstate_area with KPTI activated

From: Bertrand Marquis <Bertrand.Marquis@xxxxxxx>

Date: Fri, 15 May 2020 12:39:34 +0000

Accept-language: en-GB, en-US

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=5Cw4bNmTVnyczlmiJYHCEsAFL4zrUO8yHCFANvKZuxI=; b=MY8mzdv+C9m86Yy9iXtF65Jnj1+ZrMwWcybSN2pXNOHpBWLgny9ZZUI/ch1n3DscOz2BMcbB7fa0tnMPO2pfD0El2Z2RS/ODRhTGZsP7uDs3z7uO3Lmi4D8X8IUL8TSbqHIInOXN1U/ZD1SZDqmRK5RJYDSa4vWWv4qxszOnIMmbWlS1D1sGSFIcX3JtrNAyJNGd49FdnanixxoIESPLjP8bjIycHZ0+mcENe7KIB2wLfRAI8lAjLqFZwLJ6pI0SL0XoyW6S3GiAbzswiZs739ISUlLNHVNfCxkdVy/B1nHhAdQ0FZ+WpJouJMRQ+gM9ZFguLlPl9Xso9gzJvoJTpg==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=oIsn+in9M84C4pJUJ332oMkCjtdq7k4sTgUsapgnI/0zalxaig38oYjq8XNFN6GgWEU4O9jai7amqrn1lZJ6s9yZ/KnGMY+O1FTK/PBpBOO6pWm0Ae/ECATUB2IbhRG6DY4txfkkGpS3yzW0ICbyMhEiPcE7RCU6uJ0SIfUe/HguvQeO+qmvNsDcsc2F5wNI+V9uasCgagW/P0WZxRWW2BiAtCyl4whAjAF0qUfIxD8kcns1QjRpvVdvBvFHLVAfT0iNjZf8lOPxxrWC4jIDq2RF3anPPIFEDaPuN/OsNcsz35wGzgK4LtvucWOV44MuXJa7WGsV77AfNyZbBXaQqw==

Authentication-results-original: xen.org; dkim=none (message not signed) header.d=none;xen.org; dmarc=none action=none header.from=arm.com;

Cc: Hongyan Xia <hx242@xxxxxxx>, Stefano Stabellini <stefano.stabellini@xxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, nd <nd@xxxxxxx>, Julien Grall <julien.grall.oss@xxxxxxxxx>

Delivery-date: Fri, 15 May 2020 12:39:52 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Nodisclaimer: true

Original-authentication-results: xen.org; dkim=none (message not signed) header.d=none;xen.org; dmarc=none action=none header.from=arm.com;

Thread-index: AQHWKfvm5I6PbMAmoUakTwbdgqMBQqinvLCAgAAF7oCAABaDgIAACWGAgAARCgCAANBZgIAAEJgAgAAERACAAASVAIAAAySAgAAKyACAAALvAIAAIIgAgAAJFQA=

Thread-topic: Error during update_runstate_area with KPTI activated

> On 15 May 2020, at 13:07, Julien Grall <julien@xxxxxxx> wrote: > > Hi, > > On 15/05/2020 11:10, Bertrand Marquis wrote: >>> On 15 May 2020, at 11:00, Julien Grall <julien@xxxxxxx> wrote: >>> >>> Hi Bertrand, >>> >>> On 15/05/2020 10:21, Bertrand Marquis wrote: >>>>> On 15 May 2020, at 10:10, Roger Pau Monné <roger.pau@xxxxxxxxxx >>>>> <mailto:roger.pau@xxxxxxxxxx>> wrote: >>>>> >>>>> On Fri, May 15, 2020 at 09:53:54AM +0100, Julien Grall wrote: >>>>>> [CAUTION - EXTERNAL EMAIL] DO NOT reply, click links, or open >>>>>> attachments unless you have verified the sender and know the content is >>>>>> safe. >>>>>> >>>>>> Hi, >>>>>> >>>>>> On 15/05/2020 09:38, Roger Pau Monné wrote: >>>>>>> On Fri, May 15, 2020 at 07:39:16AM +0000, Bertrand Marquis wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 14 May 2020, at 20:13, Julien Grall <julien.grall.oss@xxxxxxxxx >>>>>>>> <mailto:julien.grall.oss@xxxxxxxxx><mailto:julien.grall.oss@xxxxxxxxx>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> On Thu, 14 May 2020 at 19:12, Andrew Cooper <andrew.cooper3@xxxxxxxxxx >>>>>>>> <mailto:andrew.cooper3@xxxxxxxxxx><mailto:andrew.cooper3@xxxxxxxxxx>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> On 14/05/2020 18:38, Julien Grall wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> On 14/05/2020 17:18, Bertrand Marquis wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 14 May 2020, at 16:57, Julien Grall <julien@xxxxxxx >>>>>>>> <mailto:julien@xxxxxxx><mailto:julien@xxxxxxx>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 14/05/2020 15:28, Bertrand Marquis wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> When executing linux on arm64 with KPTI activated (in Dom0 or in a >>>>>>>> DomU), I have a lot of walk page table errors like this: >>>>>>>> (XEN) p2m.c:1890: d1v0: Failed to walk page-table va >>>>>>>> 0xffffff837ebe0cd0 >>>>>>>> After implementing a call trace, I found that the problem was >>>>>>>> coming from the update_runstate_area when linux has KPTI activated. >>>>>>>> I have the following call trace: >>>>>>>> (XEN) p2m.c:1890: d1v0: Failed to walk page-table va >>>>>>>> 0xffffff837ebe0cd0 >>>>>>>> (XEN) backtrace.c:29: Stacktrace start at 0x8007638efbb0 depth 10 >>>>>>>> (XEN) [<000000000027780c>] get_page_from_gva+0x180/0x35c >>>>>>>> (XEN) [<00000000002700c8>] guestcopy.c#copy_guest+0x1b0/0x2e4 >>>>>>>> (XEN) [<0000000000270228>] raw_copy_to_guest+0x2c/0x34 >>>>>>>> (XEN) [<0000000000268dd0>] domain.c#update_runstate_area+0x90/0xc8 >>>>>>>> (XEN) [<000000000026909c>] domain.c#schedule_tail+0x294/0x2d8 >>>>>>>> (XEN) [<0000000000269524>] context_switch+0x58/0x70 >>>>>>>> (XEN) [<00000000002479c4>] core.c#sched_context_switch+0x88/0x1e4 >>>>>>>> (XEN) [<000000000024845c>] core.c#schedule+0x224/0x2ec >>>>>>>> (XEN) [<0000000000224018>] softirq.c#__do_softirq+0xe4/0x128 >>>>>>>> (XEN) [<00000000002240d4>] do_softirq+0x14/0x1c >>>>>>>> Discussing this subject with Stefano, he pointed me to a discussion >>>>>>>> started a year ago on this subject here: >>>>>>>> https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg03053.html >>>>>>>> >>>>>>>> And a patch was submitted: >>>>>>>> https://lists.xenproject.org/archives/html/xen-devel/2019-05/msg02320.html >>>>>>>> >>>>>>>> I rebased this patch on current master and it is solving the >>>>>>>> problem I have seen. >>>>>>>> It sounds to me like a good solution to introduce a >>>>>>>> VCPUOP_register_runstate_phys_memory_area to not depend on the area >>>>>>>> actually being mapped in the guest when a context switch is being >>>>>>>> done (which is actually the problem happening when a context switch >>>>>>>> is trigger while a guest is running in EL0). >>>>>>>> Is there any reason why this was not merged at the end ? >>>>>>>> >>>>>>>> I just skimmed through the thread to remind myself the state. >>>>>>>> AFAICT, this is blocked on the contributor to clarify the intended >>>>>>>> interaction and provide a new version. >>>>>>>> >>>>>>>> What do you mean here by intended interaction ? How the new hyper >>>>>>>> call should be used by the guest OS ? >>>>>>>> >>>>>>>> From what I remember, Jan was seeking clarification on whether the two >>>>>>>> hypercalls (existing and new) can be called together by the same OS >>>>>>>> (and make sense). >>>>>>>> >>>>>>>> There was also the question of the handover between two pieces of >>>>>>>> sotfware. For instance, what if the firmware is using the existing >>>>>>>> interface but the OS the new one? Similar question about Kexecing a >>>>>>>> different kernel. >>>>>>>> >>>>>>>> This part is mostly documentation so we can discuss about the approach >>>>>>>> and review the implementation. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I am still in favor of the new hypercall (and still in my todo list) >>>>>>>> but I haven't yet found time to revive the series. >>>>>>>> >>>>>>>> Would you be willing to take over the series? I would be happy to >>>>>>>> bring you up to speed and provide review. >>>>>>>> >>>>>>>> Sure I can take it over. >>>>>>>> >>>>>>>> I ported it to master version of xen and I tested it on a board. >>>>>>>> I still need to do a deep review of the code myself but I have an >>>>>>>> understanding of the problem and what is the idea. >>>>>>>> >>>>>>>> Any help to get on speed would be more then welcome :-) >>>>>>>> I would recommend to go through the latest version (v3) and the >>>>>>>> previous (v2). I am also suggesting v2 because I think the split was >>>>>>>> easier to review/understand. >>>>>>>> >>>>>>>> The x86 code is probably what is going to give you the most trouble as >>>>>>>> there are two ABIs to support (compat and non-compat). If you don't >>>>>>>> have an x86 setup, I should be able to test it/help write it. >>>>>>>> >>>>>>>> Feel free to ask any questions and I will try my best to remember the >>>>>>>> discussion from last year :). >>>>>>>> >>>>>>>> At risk of being shouted down again, a new hypercall isn't necessarily >>>>>>>> necessary, and there are probably better ways of fixing it. >>>>>>>> >>>>>>>> The underlying ABI problem is that the area is registered by virtual >>>>>>>> address. The only correct way this should have been done is to >>>>>>>> register >>>>>>>> by guest physical address, so Xen's updating of the data doesn't >>>>>>>> interact with the guest pagetable settings/restrictions. x86 suffers >>>>>>>> the same kind of problems as ARM, except we silently squash the >>>>>>>> fallout. >>>>>>>> >>>>>>>> The logic in Xen is horrible, and I would really rather it was deleted >>>>>>>> completely, rather than to be kept for compatibility. >>>>>>>> >>>>>>>> The runstate area is always fixed kernel memory and doesn't move. I >>>>>>>> believe it is already restricted from crossing a page boundary, and we >>>>>>>> can calculate the va=>pa translation when the hypercall is made. >>>>>>>> >>>>>>>> Yes - this is a technically ABI change, but nothing is going to break >>>>>>>> (AFAICT) and the cleanup win is large enough to make this a *very* >>>>>>>> attractive option. >>>>>>>> >>>>>>>> I suggested this approach two years ago [1] but you were the one >>>>>>>> saying that buffer could cross page-boundary on older Linux [2]: >>>>>>>> >>>>>>>> "I'd love to do this, but we cant. Older Linux used to have a virtual >>>>>>>> buffer spanning a page boundary. Changing the behaviour under that >>>>>>>> will >>>>>>>> cause older setups to explode." >>>>>>> >>>>>>> Sorry this was long time ago, and details have faded. IIRC there was >>>>>>> even a proposal (or patch set) that took that into account and allowed >>>>>>> buffers to span across a page boundary by taking a reference to two >>>>>>> different pages in that case. >>>>>> >>>>>> I am not aware of a patch set. Juergen suggested a per-domain mapping but >>>>>> there was no details how this could be done (my e-mail was left >>>>>> unanswered >>>>>> [1]). >>>>>> >>>>>> If we were using the vmap() then we would need up 1MB per domain >>>>>> (assuming >>>>>> 128 vCPUs). This sounds quite a bit and I think we need to agree whether >>>>>> it >>>>>> would be an acceptable solution (this was also left unanswered [1]). >>>>> >>>>> Could we map/unmap the runtime area on domain switch at a per-cpu >>>>> based linear space area? There's no reason to have all the runtime >>>>> areas mapped all the time, you just care about the one from the >>>>> running vcpu. >>>>> >>>>> Maybe the overhead of that mapping and unmapping would be >>>>> too high? But seeing that we are aiming at a secret-free Xen we would >>>>> have to eventually go that route anyway. >>>> Maybe the new hypercall should be a bit different: >>>> - we have this area allocated already inside Xen and we do a copy of it on >>>> any context switch >>>> - the guest is not supposed to modify any data in this area >>>> We could introduce a new hypercall: >>>> - Xen allocate the runstate area using a page aligned address and size >>> >>> At the moment the runstate is 40 bytes. If we were going to follow this >>> proposal, I would recommend to try to have as many runstate as possible in >>> your page. >>> >>> Otherewise, you would waste 4056 bytes per vCPU in both Xen and the guest >>> OS. This would even be worse for 64KB kernel. >> Agree, so it should be one call to have an area with the runstate for all >> vCPUs, ensure a vCPU runstate has a size and an address which are cache line >> size aligned to prevent coherency stress. > > One 4KB region only going to cover 64 vCPUs. So you would need multiple > pages. This brings more question on how this would work for vCPU > online/offline or even hotplug. > > The code required in the guest to track them is going to be more complex > either in Xen or the guest. Ok so the guest will have to allocate the runstate area and give this area to Xen (using virtual or guest physical address) > >>> >>> >>>> - the guest provide a free guest physical space to the hypercall >>> >>> This part is the most tricky part. How does the guest know what is free in >>> its physical address space? >>> >>> I am not aware of any way to do this in Linux. So the best you could do >>> would be to allocate a page from the RAM and tell Xen to replace it with >>> the runstate mapping. >>> >>> However, this also means you are going to possibly shatter a superpage in >>> the P2M. This may affect the performance in long-run. >> Very true, Linux does not have a way to do that. >> What about going the other way around: Xen can provide the physical address >> to the guest. > > Right now, the hypervisor doesn't have an easy to know the layout. Only the > toolstack has. So you would need a way to tell Xen where the region has been > reserved. > > This region would have to be allocated below 4GB to cater all the type of > guest. A guest may not use them at all (time accounting is not mandatory). > > Even if this is a few pages, this is not very ideal. It would be best if you > let the guest allocate some RAM and then pass the address to Xen. Ok > >>> >>>> - Xen maps read-only its own area to the guest at the provided address >>>> - Xen shall not modify any data in the runstate area of other cores/guests >>>> (should already be the case) >>>> - We keep the current hypercall for backward compatibility and map the >>>> areal during the hypercall and keep the area mapped at all time, we keep >>>> doing the copy during context switches >>>> This would highly reduce the overhead by removing the mapping/unmapping. >>> >>> I don't think the overhead is going to be significant with >>> domain_map_page()/domain_unmap_page(). >>> >>> On Arm64, the memory is always mapped so map/unmap is a NOP. On Arm32, we >>> have a fast map/unmap implementation. >>> >>> On x86, without SH, most of the memory is also always mapped. So this >>> operation is mostly a NOP. For the SH case, the map/unmap will be used in >>> any access to the guest memory (such as hypercalls access) but it is quite >>> optimized. >>> >>> Note that the current overhead is much more important today as you need to >>> walk the guest PT and P2M (we are talking at multiple map/unmap). So moving >>> to one map/unmap is already going to be a major improvement. >> Agree >>> >>>> Regarding the secret free I do not really think this is something >>>> problematic here as we already have a copy of this internally anyway >>> >>> The secret free work is still under review, so what is done in Xen today >>> shouldn't dictate the future. >>> >>> The question to answer is whether we believe leaking the content may be a >>> problem. If the answer is yes, then most likely we will want the internal >>> representation to be mapped on demand or just mapped for Xen PT associated >>> for that domain. >>> >>> My gut feeling is the runstate content is not critical. But I haven't fully >>> thought through yet. >> The runstate information is stored inside xen and then copied to the guest >> memory during context switch. So even if the guest area is not mapped, this >> information is still available inside the xen internal copy. > Again, Xen is not secret free today. So the fact Xen has an internal copy > doesn't mean it is fine to leak the content. For instance, the guests' memory > regions are always mapped, does it mean the content is not sensitive? > Definitely not. Hence why the effort behind Secret Hiding. > > As I wrote in my previous e-mail, *if* we consider the leaking a problem, > then we would want the *internal* representation to be mapped on demande or > just mapped for Xen PT associated for that domain. > > But... the runstate area doesn't use up a full page. Today the page may also > contain secret from a domain. If you always map it, then there is a risk to > leak that content. This would have to be taken into consideration if we > follow Roger's approach. So end point is the area is allocated by the guest and we need to do a copy from our internal version to the guest version each time we restore a vCPU context. Cheers Bertrand > Cheers, > > -- > Julien Grall

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.