Xen project Mailing List

Re: Error during update_runstate_area with KPTI activated

From: Bertrand Marquis <Bertrand.Marquis@xxxxxxx>

Date: Fri, 15 May 2020 10:10:39 +0000

Accept-language: en-GB, en-US

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=m0NHD7GnZd5Pl9LPlYYxn1pbIOW3+bNvRgoiI++4xXc=; b=labrYkBWQPzEZA2CzmIToOXrk2ih01asmckmPh6JMCcKF4qEmjh6n0hUH5Ndr3etIyNQhq10Aie7Y3tT30f8DiUzUCGjheGrFOMS4C4ckzbMkPEQcxLOBFboHYlG5WpwjtOB493GAHVO4l9w/1zPikosa9I1ok9HSv3OUtVnCpqpsV3GntgUzz8+bAq4stdBIxhUgGshWfFX/t8IlW1hiBBj2MkJ8u8fYf7+u+13vXPeiPuENSP0ZAw2XvqcbkSPNjRV0wrjCI2juz4Kag5B6eoOAOx78cxBQH3WDLXEvuoydyL40erLT41LMGABEHicsg2CiNQN3PjM81TAmbfuXQ==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=ZqNAusBf3xmC9Dy0dv6j1oa2k7EFH54YRBh9m3fHrBLpGg/eeEUmWqZFTFnjnhm2x+eWop/lTw1fpsRHgnqhoHOug24Dg9LWhYmswtHFgCFp1Q+xG4ZYWxFEip6ES67hO3sztS0o8x+C8CGwHAjDfbeqcydp5avqcGJu2SgvGd7K0Ns6EJXrhPYTI4MZdJiuzWgRlEPXU6nr8p3Rys+H/Rbyo+3zW4JO0mKompNBSastY1g4hGud2HJ3NAygTR0HN3/zbgOiVDFz+Q1sP0jNk/gfsNXOSkoUpN3NRmTbsxdwPN2IgN8SdcFaUFOoeG8cdK0M7BVIRgli8f3aJ4J3Dw==

Authentication-results: spf=pass (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; lists.xenproject.org; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com;lists.xenproject.org; dmarc=bestguesspass action=none header.from=arm.com;

Authentication-results-original: xen.org; dkim=none (message not signed) header.d=none;xen.org; dmarc=none action=none header.from=arm.com;

Cc: Hongyan Xia <hx242@xxxxxxx>, Stefano Stabellini <stefano.stabellini@xxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, nd <nd@xxxxxxx>, Julien Grall <julien.grall.oss@xxxxxxxxx>

Delivery-date: Fri, 15 May 2020 10:11:00 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Nodisclaimer: true

Original-authentication-results: xen.org; dkim=none (message not signed) header.d=none;xen.org; dmarc=none action=none header.from=arm.com;

Thread-index: AQHWKfvm5I6PbMAmoUakTwbdgqMBQqinvLCAgAAF7oCAABaDgIAACWGAgAARCgCAANBZgIAAEJgAgAAERACAAASVAIAAAySAgAAKyACAAALvAA==

Thread-topic: Error during update_runstate_area with KPTI activated

> On 15 May 2020, at 11:00, Julien Grall <julien@xxxxxxx> wrote: > > Hi Bertrand, > > On 15/05/2020 10:21, Bertrand Marquis wrote: >>> On 15 May 2020, at 10:10, Roger Pau Monné <roger.pau@xxxxxxxxxx >>> <mailto:roger.pau@xxxxxxxxxx>> wrote: >>> >>> On Fri, May 15, 2020 at 09:53:54AM +0100, Julien Grall wrote: >>>> [CAUTION - EXTERNAL EMAIL] DO NOT reply, click links, or open attachments >>>> unless you have verified the sender and know the content is safe. >>>> >>>> Hi, >>>> >>>> On 15/05/2020 09:38, Roger Pau Monné wrote: >>>>> On Fri, May 15, 2020 at 07:39:16AM +0000, Bertrand Marquis wrote: >>>>>> >>>>>> >>>>>> On 14 May 2020, at 20:13, Julien Grall <julien.grall.oss@xxxxxxxxx >>>>>> <mailto:julien.grall.oss@xxxxxxxxx><mailto:julien.grall.oss@xxxxxxxxx>> >>>>>> wrote: >>>>>> >>>>>> On Thu, 14 May 2020 at 19:12, Andrew Cooper <andrew.cooper3@xxxxxxxxxx >>>>>> <mailto:andrew.cooper3@xxxxxxxxxx><mailto:andrew.cooper3@xxxxxxxxxx>> >>>>>> wrote: >>>>>> >>>>>> On 14/05/2020 18:38, Julien Grall wrote: >>>>>> Hi, >>>>>> >>>>>> On 14/05/2020 17:18, Bertrand Marquis wrote: >>>>>> >>>>>> >>>>>> On 14 May 2020, at 16:57, Julien Grall <julien@xxxxxxx >>>>>> <mailto:julien@xxxxxxx><mailto:julien@xxxxxxx>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 14/05/2020 15:28, Bertrand Marquis wrote: >>>>>> Hi, >>>>>> >>>>>> Hi, >>>>>> >>>>>> When executing linux on arm64 with KPTI activated (in Dom0 or in a >>>>>> DomU), I have a lot of walk page table errors like this: >>>>>> (XEN) p2m.c:1890: d1v0: Failed to walk page-table va >>>>>> 0xffffff837ebe0cd0 >>>>>> After implementing a call trace, I found that the problem was >>>>>> coming from the update_runstate_area when linux has KPTI activated. >>>>>> I have the following call trace: >>>>>> (XEN) p2m.c:1890: d1v0: Failed to walk page-table va >>>>>> 0xffffff837ebe0cd0 >>>>>> (XEN) backtrace.c:29: Stacktrace start at 0x8007638efbb0 depth 10 >>>>>> (XEN) [<000000000027780c>] get_page_from_gva+0x180/0x35c >>>>>> (XEN) [<00000000002700c8>] guestcopy.c#copy_guest+0x1b0/0x2e4 >>>>>> (XEN) [<0000000000270228>] raw_copy_to_guest+0x2c/0x34 >>>>>> (XEN) [<0000000000268dd0>] domain.c#update_runstate_area+0x90/0xc8 >>>>>> (XEN) [<000000000026909c>] domain.c#schedule_tail+0x294/0x2d8 >>>>>> (XEN) [<0000000000269524>] context_switch+0x58/0x70 >>>>>> (XEN) [<00000000002479c4>] core.c#sched_context_switch+0x88/0x1e4 >>>>>> (XEN) [<000000000024845c>] core.c#schedule+0x224/0x2ec >>>>>> (XEN) [<0000000000224018>] softirq.c#__do_softirq+0xe4/0x128 >>>>>> (XEN) [<00000000002240d4>] do_softirq+0x14/0x1c >>>>>> Discussing this subject with Stefano, he pointed me to a discussion >>>>>> started a year ago on this subject here: >>>>>> https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg03053.html >>>>>> >>>>>> And a patch was submitted: >>>>>> https://lists.xenproject.org/archives/html/xen-devel/2019-05/msg02320.html >>>>>> >>>>>> I rebased this patch on current master and it is solving the >>>>>> problem I have seen. >>>>>> It sounds to me like a good solution to introduce a >>>>>> VCPUOP_register_runstate_phys_memory_area to not depend on the area >>>>>> actually being mapped in the guest when a context switch is being >>>>>> done (which is actually the problem happening when a context switch >>>>>> is trigger while a guest is running in EL0). >>>>>> Is there any reason why this was not merged at the end ? >>>>>> >>>>>> I just skimmed through the thread to remind myself the state. >>>>>> AFAICT, this is blocked on the contributor to clarify the intended >>>>>> interaction and provide a new version. >>>>>> >>>>>> What do you mean here by intended interaction ? How the new hyper >>>>>> call should be used by the guest OS ? >>>>>> >>>>>> From what I remember, Jan was seeking clarification on whether the two >>>>>> hypercalls (existing and new) can be called together by the same OS >>>>>> (and make sense). >>>>>> >>>>>> There was also the question of the handover between two pieces of >>>>>> sotfware. For instance, what if the firmware is using the existing >>>>>> interface but the OS the new one? Similar question about Kexecing a >>>>>> different kernel. >>>>>> >>>>>> This part is mostly documentation so we can discuss about the approach >>>>>> and review the implementation. >>>>>> >>>>>> >>>>>> >>>>>> I am still in favor of the new hypercall (and still in my todo list) >>>>>> but I haven't yet found time to revive the series. >>>>>> >>>>>> Would you be willing to take over the series? I would be happy to >>>>>> bring you up to speed and provide review. >>>>>> >>>>>> Sure I can take it over. >>>>>> >>>>>> I ported it to master version of xen and I tested it on a board. >>>>>> I still need to do a deep review of the code myself but I have an >>>>>> understanding of the problem and what is the idea. >>>>>> >>>>>> Any help to get on speed would be more then welcome :-) >>>>>> I would recommend to go through the latest version (v3) and the >>>>>> previous (v2). I am also suggesting v2 because I think the split was >>>>>> easier to review/understand. >>>>>> >>>>>> The x86 code is probably what is going to give you the most trouble as >>>>>> there are two ABIs to support (compat and non-compat). If you don't >>>>>> have an x86 setup, I should be able to test it/help write it. >>>>>> >>>>>> Feel free to ask any questions and I will try my best to remember the >>>>>> discussion from last year :). >>>>>> >>>>>> At risk of being shouted down again, a new hypercall isn't necessarily >>>>>> necessary, and there are probably better ways of fixing it. >>>>>> >>>>>> The underlying ABI problem is that the area is registered by virtual >>>>>> address. The only correct way this should have been done is to register >>>>>> by guest physical address, so Xen's updating of the data doesn't >>>>>> interact with the guest pagetable settings/restrictions. x86 suffers >>>>>> the same kind of problems as ARM, except we silently squash the fallout. >>>>>> >>>>>> The logic in Xen is horrible, and I would really rather it was deleted >>>>>> completely, rather than to be kept for compatibility. >>>>>> >>>>>> The runstate area is always fixed kernel memory and doesn't move. I >>>>>> believe it is already restricted from crossing a page boundary, and we >>>>>> can calculate the va=>pa translation when the hypercall is made. >>>>>> >>>>>> Yes - this is a technically ABI change, but nothing is going to break >>>>>> (AFAICT) and the cleanup win is large enough to make this a *very* >>>>>> attractive option. >>>>>> >>>>>> I suggested this approach two years ago [1] but you were the one >>>>>> saying that buffer could cross page-boundary on older Linux [2]: >>>>>> >>>>>> "I'd love to do this, but we cant. Older Linux used to have a virtual >>>>>> buffer spanning a page boundary. Changing the behaviour under that will >>>>>> cause older setups to explode." >>>>> >>>>> Sorry this was long time ago, and details have faded. IIRC there was >>>>> even a proposal (or patch set) that took that into account and allowed >>>>> buffers to span across a page boundary by taking a reference to two >>>>> different pages in that case. >>>> >>>> I am not aware of a patch set. Juergen suggested a per-domain mapping but >>>> there was no details how this could be done (my e-mail was left unanswered >>>> [1]). >>>> >>>> If we were using the vmap() then we would need up 1MB per domain (assuming >>>> 128 vCPUs). This sounds quite a bit and I think we need to agree whether it >>>> would be an acceptable solution (this was also left unanswered [1]). >>> >>> Could we map/unmap the runtime area on domain switch at a per-cpu >>> based linear space area? There's no reason to have all the runtime >>> areas mapped all the time, you just care about the one from the >>> running vcpu. >>> >>> Maybe the overhead of that mapping and unmapping would be >>> too high? But seeing that we are aiming at a secret-free Xen we would >>> have to eventually go that route anyway. >> Maybe the new hypercall should be a bit different: >> - we have this area allocated already inside Xen and we do a copy of it on >> any context switch >> - the guest is not supposed to modify any data in this area >> We could introduce a new hypercall: >> - Xen allocate the runstate area using a page aligned address and size > > At the moment the runstate is 40 bytes. If we were going to follow this > proposal, I would recommend to try to have as many runstate as possible in > your page. > > Otherewise, you would waste 4056 bytes per vCPU in both Xen and the guest OS. > This would even be worse for 64KB kernel. Agree, so it should be one call to have an area with the runstate for all vCPUs, ensure a vCPU runstate has a size and an address which are cache line size aligned to prevent coherency stress. > > >> - the guest provide a free guest physical space to the hypercall > > This part is the most tricky part. How does the guest know what is free in > its physical address space? > > I am not aware of any way to do this in Linux. So the best you could do would > be to allocate a page from the RAM and tell Xen to replace it with the > runstate mapping. > > However, this also means you are going to possibly shatter a superpage in the > P2M. This may affect the performance in long-run. Very true, Linux does not have a way to do that. What about going the other way around: Xen can provide the physical address to the guest. > >> - Xen maps read-only its own area to the guest at the provided address >> - Xen shall not modify any data in the runstate area of other cores/guests >> (should already be the case) >> - We keep the current hypercall for backward compatibility and map the areal >> during the hypercall and keep the area mapped at all time, we keep doing the >> copy during context switches >> This would highly reduce the overhead by removing the mapping/unmapping. > > I don't think the overhead is going to be significant with > domain_map_page()/domain_unmap_page(). > > On Arm64, the memory is always mapped so map/unmap is a NOP. On Arm32, we > have a fast map/unmap implementation. > > On x86, without SH, most of the memory is also always mapped. So this > operation is mostly a NOP. For the SH case, the map/unmap will be used in any > access to the guest memory (such as hypercalls access) but it is quite > optimized. > > Note that the current overhead is much more important today as you need to > walk the guest PT and P2M (we are talking at multiple map/unmap). So moving > to one map/unmap is already going to be a major improvement. Agree > >> Regarding the secret free I do not really think this is something >> problematic here as we already have a copy of this internally anyway > > The secret free work is still under review, so what is done in Xen today > shouldn't dictate the future. > > The question to answer is whether we believe leaking the content may be a > problem. If the answer is yes, then most likely we will want the internal > representation to be mapped on demand or just mapped for Xen PT associated > for that domain. > > My gut feeling is the runstate content is not critical. But I haven't fully > thought through yet. The runstate information is stored inside xen and then copied to the guest memory during context switch. So even if the guest area is not mapped, this information is still available inside the xen internal copy. Cheers Bertrand > > Cheers, > > -- > Julien Grall

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.