Xen project Mailing List

Re: [RFC PATCH 00/10] Preemption in hypervisor (ARM only)

To: Julien Grall <julien.grall.oss@xxxxxxxxx>

From: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>

Date: Wed, 24 Feb 2021 23:58:54 +0000

Accept-language: en-US

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=9KEyaSnnXSQi47ErcDhJ1QhWjuqGllib1++JC4xzFzU=; b=UqTz2EJw3EruXKwF33tlpXdys9peb3kT9+jJ4ILxsNxVsnIe9Gs0K0W2GaIZa/V7fnlvD4nVg5uLn7ppRLFz8n885cCyvm56le41B9irBdEQUqyOB2To9oSMkADIkwsG2uv09aROT4wq9x0PR94gt0ZyF/kGR76Z52FkS5aJhIJlAdOjoFb+D3WLLIaqYjuVLA18ekak2o6XBHhcnm6o+eTEJNX6M8IV1qonMbH//ElNf0LrVQBsuTsl9gMRtrY5mfkAM6E2VtNA5owB0XEf0THrFiTwMiOQU4WpCCbMJvEm2/nEnH6Ed9BBcMeLIY6pnBU49//9ZOLGrUMA/NvcPw==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=exGuiaUfxqIsnnrijaEofJi8EoDccZj5aBfPJ9kr2uIjB1aqaGS49OAcBZFfi6EBNA8yC6UHdpc1+xtc4+wlJOjsBd0CmeQoK2i3+CsN/mVNh0CQVdz74gomHQEI8vSedGUAC4VDnILH0MT5GlAfwbRYkh/IKEkVEp8BqFq9boL12TjhrQRajjxZYoj1rmekIJWbgNPE76rAMOE2DX8SfY2b4S+e45db5rqNEbH5CrsPomvdZbVO4f2hkfdo69HhT1bPjU0TNyg6aulWpBzPxtbwos/K/GJLcz7FyETHMnfAz+WKeJnjQlFbvPcJSxUtG3wiv+p/vD6NvpIw58QDZw==

Authentication-results: gmail.com; dkim=none (message not signed) header.d=none;gmail.com; dmarc=none action=none header.from=epam.com;

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>, Meng Xu <mengxu@xxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Ian Jackson <iwj@xxxxxxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>

Delivery-date: Wed, 24 Feb 2021 23:59:12 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Thread-index: AQHXCYx4A6OUUHr1gkqxWv1TEOLkuqplchSAgAAzdYCAAXE/AIAAtX0AgAAaNQCAABhvgA==

Thread-topic: [RFC PATCH 00/10] Preemption in hypervisor (ARM only)

Julien Grall writes: > On Wed, 24 Feb 2021 at 20:58, Volodymyr Babchuk > <Volodymyr_Babchuk@xxxxxxxx> wrote: >> >> >> Hi Julien, >> >> Julien Grall writes: >> >>> On 23/02/2021 12:06, Volodymyr Babchuk wrote: >>>> Hi Julien, >>> >>> Hi Volodymyr, >>> >>>> Julien Grall writes: >>>>> On 23/02/2021 02:34, Volodymyr Babchuk wrote: >>>>> ... just rescheduling the vCPU. It will also give the opportunity for >>>>> the guest to handle interrupts. >>>>> >>>>> If you don't return to the guest, then risk to get an RCU sched stall >>>>> on that the vCPU (some hypercalls can take really really long). >>>> Ah yes, you are right. I'd only wish that hypervisor saved context >>>> of >>>> hypercall on it's side... >>>> I have example of OP-TEE before my eyes. They have special return >>>> code >>>> "task was interrupted" and they use separate call "continue execution of >>>> interrupted task", which takes opaque context handle as a >>>> parameter. With this approach state of interrupted call never leaks to > >>>> rest of the system. >>> >>> Feel free to suggest a new approach for the hypercals. >>> >> >> I believe, I suggested it right above. There are some corner cases, that >> should be addressed, of course. > > If we wanted a clean break, then possibly yes. But I meant one that doesn't > break all the existing users and doesn't put Xen at risk. > > I don't believe your approach fulfill it. Of course, we can't touch any hypercalls that are part of stable ABI. But if I got this right, domctls and sysctls are not stable, so one can change theirs behavior quite drastically in major releases. >> >>>>> >>>>>> This approach itself have obvious >>>>>> problems: code that executes hypercall is responsible for preemption, >>>>>> preemption checks are infrequent (because they are costly by >>>>>> themselves), hypercall execution state is stored in guest-controlled >>>>>> area, we rely on guest's good will to continue the hypercall. >>>>> >>>>> Why is it a problem to rely on guest's good will? The hypercalls >>>>> should be preempted at a boundary that is safe to continue. >>>> Yes, and it imposes restrictions on how to write hypercall >>>> handler. >>>> In other words, there are much more places in hypercall handler code >>>> where it can be preempted than where hypercall continuation can be >>>> used. For example, you can preempt hypercall that holds a mutex, but of >>>> course you can't create an continuation point in such place. >>> >>> I disagree, you can create continuation point in such place. Although >>> it will be more complex because you have to make sure you break the >>> work in a restartable place. >> >> Maybe there is some misunderstanding. You can't create hypercall >> continuation point in a place where you are holding mutex. Because, >> there is absolutely not guarantee that guest will restart the >> hypercall. > > I don't think we are disagreeing here. My point is you should rarely > need to hold a mutex for a long period, so you could break your work > in smaller chunk. In which cases, you can use hypercall continuation. Let's say in this way: generally you can hold a mutex much longer than you can hold a spinlock. And nothing catastrophic will happen if you are preempted while holding a mutex. Better to avoid, this of course. > >> >> But you can preempt vCPU while holding mutex, because xen owns scheduler >> and it can guarantee that vCPU will be scheduled eventually to continue >> the work and release mutex. > > The problem is the "eventually". If you are accounting the time spent > in the hypervisor to the vCPU A, then there is a possibility that it > has exhausted its time slice. In which case, your vCPU A may be > sleeping for a while with a mutex held. > > If another vCPU B needs the mutex, it will either have to wait > potentially for a long time or we need to force vCPU A to run on > borrowed time. Yes, of course. >> >>> I would also like to point out that preemption also have some drawbacks. >>> With RT in mind, you have to deal with priority inversion (e.g. a >>> lower priority vCPU held a mutex that is required by an higher >>> priority). >> >> Of course. This is not as simple as "just call scheduler when we want >> to". > > Your e-mail made it sounds like it was easy to add preemption in > Xen. ;) I'm sorry for that :) Actually, there is lots of work to do. It appeared to me that "current" needs to be reworked, preempt_enable/disable needs to be reworked, per-cpu variables also should be reworked. And this is just to ensure consistency of already existing code. And I am not mentioning x86 support there... >>> Outside of RT, you have to be careful where mutex are held. In your >>> earlier answer, you suggested to held mutex for the memory >>> allocation. If you do that, then it means a domain A can block >>> allocation for domain B as it helds the mutex. >> >> As long as we do not exit to a EL1 with mutex being held, domain A can't >> block anything. Of course, we have to deal with priority inversion, but >> malicious domain can't cause DoS. > > It is not really a priority inversion problem outside of RT because > all the tasks will have the same priority. It is more a time > accounting problem because each vCPU may have a different number of > credits. Speaking of that, RTDS does not use concept of priority. And ARINC of course neither. >>>>> I am really surprised that this is the only changes necessary in >>>>> Xen. For a first approach, we may want to be conservative when the >>>>> preemption is happening as I am not convinced that all the places are >>>>> safe to preempt. >>>>> >>>> Well, I can't say that I ran extensive tests, but I played with this >>>> for >>>> some time and it seemed quite stable. Of course, I had some problems >>>> with RTDS... >>>> As I see it, Xen is already supports SMP, so all places where races >>>> are >>>> possible should already be covered by spinlocks or taken into account by >>>> some other means. >>> That's correct for shared resources. I am more worried for any >>> hypercalls that expected to run more or less continuously (e.g not >>> taking into account interrupt) on the same pCPU. >> >> Are there many such hypercalls? They can disable preemption if they >> really need to run on the same pCPU. As I understand, they should be >> relatively fast, because they can't create continuations anyway. > > Well, I never tried to make Xen preemptible... My comment is based on > the fact that the use preempt_{enable, disable}() was mostly done on a > best effort basis. > > The usual suspects are anything using this_cpu() or interacting with > the per-CPU HW registers. > > From a quick look here a few things (only looked at Arm): > * map_domain_page() in particular on arm32 because this is using > per-CPU page-tables > * guest_atomics_* as this uses this_cpu() > * virt_to_mfn() in particular the failure path > * Incorrect use (or missing) rcu locking. (Hopefully Juergen's > recent work in the RCU mitigate the risk) > > I can provide guidance, but you will have to go through the code and > check what's happening. Thank you for the list. Of course, I need to see thru all the code. I already had a bunch of problems with per_cpu variables... -- Volodymyr Babchuk at EPAM

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.