[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH 00/10] Preemption in hypervisor (ARM only)

  • To: Julien Grall <julien.grall.oss@xxxxxxxxx>
  • From: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>
  • Date: Wed, 24 Feb 2021 23:58:54 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=9KEyaSnnXSQi47ErcDhJ1QhWjuqGllib1++JC4xzFzU=; b=UqTz2EJw3EruXKwF33tlpXdys9peb3kT9+jJ4ILxsNxVsnIe9Gs0K0W2GaIZa/V7fnlvD4nVg5uLn7ppRLFz8n885cCyvm56le41B9irBdEQUqyOB2To9oSMkADIkwsG2uv09aROT4wq9x0PR94gt0ZyF/kGR76Z52FkS5aJhIJlAdOjoFb+D3WLLIaqYjuVLA18ekak2o6XBHhcnm6o+eTEJNX6M8IV1qonMbH//ElNf0LrVQBsuTsl9gMRtrY5mfkAM6E2VtNA5owB0XEf0THrFiTwMiOQU4WpCCbMJvEm2/nEnH6Ed9BBcMeLIY6pnBU49//9ZOLGrUMA/NvcPw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=exGuiaUfxqIsnnrijaEofJi8EoDccZj5aBfPJ9kr2uIjB1aqaGS49OAcBZFfi6EBNA8yC6UHdpc1+xtc4+wlJOjsBd0CmeQoK2i3+CsN/mVNh0CQVdz74gomHQEI8vSedGUAC4VDnILH0MT5GlAfwbRYkh/IKEkVEp8BqFq9boL12TjhrQRajjxZYoj1rmekIJWbgNPE76rAMOE2DX8SfY2b4S+e45db5rqNEbH5CrsPomvdZbVO4f2hkfdo69HhT1bPjU0TNyg6aulWpBzPxtbwos/K/GJLcz7FyETHMnfAz+WKeJnjQlFbvPcJSxUtG3wiv+p/vD6NvpIw58QDZw==
  • Authentication-results: gmail.com; dkim=none (message not signed) header.d=none;gmail.com; dmarc=none action=none header.from=epam.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>, Meng Xu <mengxu@xxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Ian Jackson <iwj@xxxxxxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>
  • Delivery-date: Wed, 24 Feb 2021 23:59:12 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHXCYx4A6OUUHr1gkqxWv1TEOLkuqplchSAgAAzdYCAAXE/AIAAtX0AgAAaNQCAABhvgA==
  • Thread-topic: [RFC PATCH 00/10] Preemption in hypervisor (ARM only)

Julien Grall writes:

> On Wed, 24 Feb 2021 at 20:58, Volodymyr Babchuk
> <Volodymyr_Babchuk@xxxxxxxx> wrote:
>> Hi Julien,
>> Julien Grall writes:
>>> On 23/02/2021 12:06, Volodymyr Babchuk wrote:
>>>> Hi Julien,
>>> Hi Volodymyr,
>>>> Julien Grall writes:
>>>>> On 23/02/2021 02:34, Volodymyr Babchuk wrote:
>>>>> ... just rescheduling the vCPU. It will also give the opportunity for
>>>>> the guest to handle interrupts.
>>>>> If you don't return to the guest, then risk to get an RCU sched stall
>>>>> on that the vCPU (some hypercalls can take really really long).
>>>> Ah yes, you are right. I'd only wish that hypervisor saved context
>>>> of
>>>> hypercall on it's side...
>>>> I have example of OP-TEE before my eyes. They have special return
>>>> code
>>>> "task was interrupted" and they use separate call "continue execution of
>>>> interrupted task", which takes opaque context handle as a
>>>> parameter. With this approach state of interrupted call never leaks to > 
>>>> rest of the system.
>>> Feel free to suggest a new approach for the hypercals.
>> I believe, I suggested it right above. There are some corner cases, that
>> should be addressed, of course.
> If we wanted a clean break, then possibly yes.  But I meant one that doesn't
> break all the existing users and doesn't put Xen at risk.
> I don't believe your approach fulfill it.

Of course, we can't touch any hypercalls that are part of stable
ABI. But if I got this right, domctls and sysctls are not stable, so one
can change theirs behavior quite drastically in major releases.

>>>>>> This approach itself have obvious
>>>>>> problems: code that executes hypercall is responsible for preemption,
>>>>>> preemption checks are infrequent (because they are costly by
>>>>>> themselves), hypercall execution state is stored in guest-controlled
>>>>>> area, we rely on guest's good will to continue the hypercall.
>>>>> Why is it a problem to rely on guest's good will? The hypercalls
>>>>> should be preempted at a boundary that is safe to continue.
>>>> Yes, and it imposes restrictions on how to write hypercall
>>>> handler.
>>>> In other words, there are much more places in hypercall handler code
>>>> where it can be preempted than where hypercall continuation can be
>>>> used. For example, you can preempt hypercall that holds a mutex, but of
>>>> course you can't create an continuation point in such place.
>>> I disagree, you can create continuation point in such place. Although
>>> it will be more complex because you have to make sure you break the
>>> work in a restartable place.
>> Maybe there is some misunderstanding. You can't create hypercall
>> continuation point in a place where you are holding mutex. Because,
>> there is absolutely not guarantee that guest will restart the
>> hypercall.
> I don't think we are disagreeing here. My point is you should rarely
> need to hold a mutex for a long period, so you could break your work
> in smaller chunk. In which cases, you can use hypercall continuation.

Let's say in this way: generally you can hold a mutex much longer than
you can hold a spinlock. And nothing catastrophic will happen if you are
preempted while holding a mutex. Better to avoid, this of course.

>> But you can preempt vCPU while holding mutex, because xen owns scheduler
>> and it can guarantee that vCPU will be scheduled eventually to continue
>> the work and release mutex.
> The problem is the "eventually". If you are accounting the time spent
> in the hypervisor to the vCPU A, then there is a possibility that it
> has exhausted its time slice. In which case, your vCPU A may be
> sleeping for a while with a mutex held.
> If another vCPU B needs the mutex, it will either have to wait
> potentially for a long time or we need to force vCPU A to run on
> borrowed time.

Yes, of course.

>>> I would also like to point out that preemption also have some drawbacks.
>>> With RT in mind, you have to deal with priority inversion (e.g. a
>>> lower priority vCPU held a mutex that is required by an higher
>>> priority).
>> Of course. This is not as simple as "just call scheduler when we want
>> to".
> Your e-mail made it sounds like it was easy to add preemption in
> Xen. ;)

I'm sorry for that :)
Actually, there is lots of work to do. It appeared to me that "current"
needs to be reworked, preempt_enable/disable needs to be reworked, per-cpu
variables also should be reworked. And this is just to ensure
consistency of already existing code.

And I am not mentioning x86 support there...

>>> Outside of RT, you have to be careful where mutex are held. In your
>>> earlier answer, you suggested to held mutex for the memory
>>> allocation. If you do that, then it means a domain A can block
>>> allocation for domain B as it helds the mutex.
>> As long as we do not exit to a EL1 with mutex being held, domain A can't
>> block anything. Of course, we have to deal with priority inversion, but
>> malicious domain can't cause DoS.
> It is not really a priority inversion problem outside of RT because
> all the tasks will have the same priority. It is more a time
> accounting problem because each vCPU may have a different number of
> credits.

Speaking of that, RTDS does not use concept of priority. And ARINC of
course neither.

>>>>> I am really surprised that this is the only changes necessary in
>>>>> Xen. For a first approach, we may want to be conservative when the
>>>>> preemption is happening as I am not convinced that all the places are
>>>>> safe to preempt.
>>>> Well, I can't say that I ran extensive tests, but I played with this
>>>> for
>>>> some time and it seemed quite stable. Of course, I had some problems
>>>> with RTDS...
>>>> As I see it, Xen is already supports SMP, so all places where races
>>>> are
>>>> possible should already be covered by spinlocks or taken into account by
>>>> some other means.
>>> That's correct for shared resources. I am more worried for any
>>> hypercalls that expected to run more or less continuously (e.g not
>>> taking into account interrupt) on the same pCPU.
>> Are there many such hypercalls? They can disable preemption if they
>> really need to run on the same pCPU. As I understand, they should be
>> relatively fast, because they can't create continuations anyway.
> Well, I never tried to make Xen preemptible... My comment is based on
> the fact that the use preempt_{enable, disable}() was mostly done on a
> best effort basis.
> The usual suspects are anything using this_cpu() or interacting with
> the per-CPU HW registers.
> From a quick look here a few things (only looked at Arm):
>   * map_domain_page() in particular on arm32 because this is using
> per-CPU page-tables
>   * guest_atomics_* as this uses this_cpu()
>   * virt_to_mfn() in particular the failure path
>   * Incorrect use (or missing) rcu locking. (Hopefully Juergen's
> recent work in the RCU mitigate the risk)
> I can provide guidance, but you will have to go through the code and
> check what's happening.

Thank you for the list. Of course, I need to see thru all the code. I
already had a bunch of problems with per_cpu variables...

Volodymyr Babchuk at EPAM



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.