[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH 00/10] Preemption in hypervisor (ARM only)

  • To: Julien Grall <julien@xxxxxxx>
  • From: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>
  • Date: Wed, 24 Feb 2021 20:57:39 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=lrqVEyAHN62wb+g3QOCdMW1S4p9d6yIwI0LuCXq8sQw=; b=IVlBbVZAVXBLXftjmQ47R0iFScA5uGylTiJPW3vMBHShDdAMcmk5KJamTK8OQheTNDwm6eTySCZvFO2dGUouyDneophadVfDIh9A+W/r2cjQ+sEZdJZ36mELX14sjx4rH8QNWU7YnrxMPyTEHanAs4I7sDfVde8B6fBRni7Rd/MCh88TviCLsRmMXLLYgfIGLw4lse1ZQxUED9BvwQx9F/OZ4gDKUwfIV1PZEN9iaSFWyBP/QPLikz9Hbhjln0Ppbmz074E5c/A8Sdj8QA4GZohz9qpnBZAiqXRzrzG/ksDY9uJ5sPT1ID89aezI78TO1+IV7RJyyDmWqWuE66/0HA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=IFRFkL72iBYUKqZO950N0cjWciICggOPsf3aMDMAHWa6E5PSJNP3oGCCuc/SIzhX0T4aC0jZUmFL0c/gXyZq0PxC4UOe50d8SvvDNz4QDoz6zf70SzD3KmG9pV1REA//lGEY4F51/V37Pib49YDm0pSyjJkszQaGwnt9f1CzlbLYGKnGar58G18g67qmHPgolMQz9u5vihHUbJO394sPsYxvWlBE//XOFAVzyp4F2mKuLBsxMfETMaWfXvEvuz3FdyNkRhxLjw37/ZpnIXgyB19QFrH4RQqcNy/Gvg8sy17rR8g+7INLXwwMtW5TJ8rf1ZoacBXcgt36c6o1KJdxlQ==
  • Authentication-results: xen.org; dkim=none (message not signed) header.d=none;xen.org; dmarc=none action=none header.from=epam.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>, Meng Xu <mengxu@xxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Ian Jackson <iwj@xxxxxxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>
  • Delivery-date: Wed, 24 Feb 2021 20:58:16 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHXCYx4A6OUUHr1gkqxWv1TEOLkuqplchSAgAAzdYCAAXE/AIAAtX0A
  • Thread-topic: [RFC PATCH 00/10] Preemption in hypervisor (ARM only)

Hi Julien,

Julien Grall writes:

> On 23/02/2021 12:06, Volodymyr Babchuk wrote:
>> Hi Julien,
> Hi Volodymyr,
>> Julien Grall writes:
>>> On 23/02/2021 02:34, Volodymyr Babchuk wrote:
>>> ... just rescheduling the vCPU. It will also give the opportunity for
>>> the guest to handle interrupts.
>>> If you don't return to the guest, then risk to get an RCU sched stall
>>> on that the vCPU (some hypercalls can take really really long).
>> Ah yes, you are right. I'd only wish that hypervisor saved context
>> of
>> hypercall on it's side...
>> I have example of OP-TEE before my eyes. They have special return
>> code
>> "task was interrupted" and they use separate call "continue execution of
>> interrupted task", which takes opaque context handle as a
>> parameter. With this approach state of interrupted call never leaks to > 
>> rest of the system.
> Feel free to suggest a new approach for the hypercals.

I believe, I suggested it right above. There are some corner cases, that
should be addressed, of course.

>>>> This approach itself have obvious
>>>> problems: code that executes hypercall is responsible for preemption,
>>>> preemption checks are infrequent (because they are costly by
>>>> themselves), hypercall execution state is stored in guest-controlled
>>>> area, we rely on guest's good will to continue the hypercall.
>>> Why is it a problem to rely on guest's good will? The hypercalls
>>> should be preempted at a boundary that is safe to continue.
>> Yes, and it imposes restrictions on how to write hypercall
>> handler.
>> In other words, there are much more places in hypercall handler code
>> where it can be preempted than where hypercall continuation can be
>> used. For example, you can preempt hypercall that holds a mutex, but of
>> course you can't create an continuation point in such place.
> I disagree, you can create continuation point in such place. Although
> it will be more complex because you have to make sure you break the
> work in a restartable place.

Maybe there is some misunderstanding. You can't create hypercall
continuation point in a place where you are holding mutex. Because,
there is absolutely not guarantee that guest will restart the

But you can preempt vCPU while holding mutex, because xen owns scheduler
and it can guarantee that vCPU will be scheduled eventually to continue
the work and release mutex.

> I would also like to point out that preemption also have some drawbacks.
> With RT in mind, you have to deal with priority inversion (e.g. a
> lower priority vCPU held a mutex that is required by an higher
> priority).

Of course. This is not as simple as "just call scheduler when we want

> Outside of RT, you have to be careful where mutex are held. In your
> earlier answer, you suggested to held mutex for the memory
> allocation. If you do that, then it means a domain A can block
> allocation for domain B as it helds the mutex.

As long as we do not exit to a EL1 with mutex being held, domain A can't
block anything. Of course, we have to deal with priority inversion, but
malicious domain can't cause DoS.

> This can lead to quite serious problem if domain A cannot run (because
> it exhausted its credit) for a long time.

I believe, this problem is related to a priority inversion problem and
they should be addressed together.

>>>> All this
>>>> imposes restrictions on which hypercalls can be preempted, when they
>>>> can be preempted and how to write hypercall handlers. Also, it
>>>> requires very accurate coding and already led to at least one
>>>> vulnerability - XSA-318. Some hypercalls can not be preempted at all,
>>>> like the one mentioned in [1].
>>>> Absence of hypervisor threads/vCPUs. Hypervisor owns only idle
>>>> vCPUs,
>>>> which are supposed to run when the system is idle. If hypervisor needs
>>>> to execute own tasks that are required to run right now, it have no
>>>> other way than to execute them on current vCPU. But scheduler does not
>>>> know that hypervisor executes hypervisor task and accounts spent time
>>>> to a domain. This can lead to domain starvation.
>>>> Also, absence of hypervisor threads leads to absence of high-level
>>>> synchronization primitives like mutexes, conditional variables,
>>>> completions, etc. This leads to two problems: we need to use spinlocks
>>>> everywhere and we have problems when porting device drivers from linux
>>>> kernel.
>>>> Proposed solution
>>>> =================
>>>> It is quite obvious that to fix problems above we need to allow
>>>> preemption in hypervisor mode. I am not familiar with x86 side, but
>>>> for the ARM it was surprisingly easy to implement. Basically, vCPU
>>>> context in hypervisor mode is determined by its stack at general
>>>> purpose registers. And __context_switch() function perfectly switches
>>>> them when running in hypervisor mode. So there are no hard
>>>> restrictions, why it should be called only in leave_hypervisor() path.
>>>> The obvious question is: when we should to try to preempt running
>>>> vCPU?  And answer is: when there was an external event. This means
>>>> that we should try to preempt only when there was an interrupt request
>>>> where we are running in hypervisor mode. On ARM, in this case function
>>>> do_trap_irq() is called. Problem is that IRQ handler can be called
>>>> when vCPU is already in atomic state (holding spinlock, for
>>>> example). In this case we should try to preempt right after leaving
>>>> atomic state. This is basically all the idea behind this PoC.
>>>> Now, about the series composition.
>>>> Patches
>>>>     sched: core: save IRQ state during locking
>>>>     sched: rt: save IRQ state during locking
>>>>     sched: credit2: save IRQ state during locking
>>>>     preempt: use atomic_t to for preempt_count
>>>>     arm: setup: disable preemption during startup
>>>>     arm: context_switch: allow to run with IRQs already disabled
>>>> prepare the groundwork for the rest of PoC. It appears that not all
>>>> code is ready to be executed in IRQ state, and schedule() now can be
>>>> called at end of do_trap_irq(), which technically is considered IRQ
>>>> handler state. Also, it is unwise to try preempt things when we are
>>>> still booting, so ween to enable atomic context during the boot
>>>> process.
>>> I am really surprised that this is the only changes necessary in
>>> Xen. For a first approach, we may want to be conservative when the
>>> preemption is happening as I am not convinced that all the places are
>>> safe to preempt.
>> Well, I can't say that I ran extensive tests, but I played with this
>> for
>> some time and it seemed quite stable. Of course, I had some problems
>> with RTDS...
>> As I see it, Xen is already supports SMP, so all places where races
>> are
>> possible should already be covered by spinlocks or taken into account by
>> some other means.
> That's correct for shared resources. I am more worried for any
> hypercalls that expected to run more or less continuously (e.g not 
> taking into account interrupt) on the same pCPU.

Are there many such hypercalls? They can disable preemption if they
really need to run on the same pCPU. As I understand, they should be
relatively fast, because they can't create continuations anyway.

>> Places which may not be safe to preempt are clustered around task
>> management code itself: schedulers, xen entry/exit points, vcpu
>> creation/destruction and such.
>> For example, for sure we do not want to destroy vCPU which was
>> preempted
>> in hypervisor mode. I didn't covered this case, by the way.
>>>> Patches
>>>>     preempt: add try_preempt() function
>>>>     sched: core: remove ASSERT_NOT_IN_ATOMIC and disable preemption[!]
>>>>     arm: traps: try to preempt before leaving IRQ handler
>>>> are basically the core of this PoC. try_preempt() function tries to
>>>> preempt vCPU when either called by IRQ handler and when leaving atomic
>>>> state. Scheduler now enters atomic state to ensure that it will not
>>>> preempt self. do_trap_irq() calls try_preempt() to initiate preemption.
>>> AFAICT, try_preempt() will deal with the rescheduling. But how about
>>> softirqs? Don't we want to handle them in try_preempt() as well?
>> Well, yes and no. We have the following softirqs:
>>   TIMER_SOFTIRQ - should be called, I believe
>>   RCU_SOFTIRQ - I'm not sure about this, but probably no
> When would you call RCU callback then?

I'm not sure there. But I think, they should be called in the same place
as always: while leaving hypervisor. But I'm not very familiar with
RCU, so I may talk nonsense. 

>>   NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ - should be moved to a separate
>>   thread, maybe?
>>   TASKLET_SOFTIRQ - should be moved to a separate thread >
>> So, looks like only timers should be handled for sure.
>>> [...]
>>>> Conclusion
>>>> ==========
>>>> My main intention is to begin discussion of hypervisor
>>>> preemption. As
>>>> I showed, it is doable right away and provides some immediate
>>>> benefits. I do understand that proper implementation requires much
>>>> more efforts. But we are ready to do this work if community is
>>>> interested in it.
>>>> Just to reiterate main benefits:
>>>> 1. More controllable latency. On embedded systems customers care
>>>> about
>>>> such things.
>>> Is the plan to only offer preemptible Xen?
>> Sorry, didn't get the question.
> What's your plan for the preemption support? Will an admin be able to
> configure Xen to be either preemptible or not?

Honestly, it would be much easier to enable it unconditionally. But I
understand, that this is not feasible. So, I'm looking at a build-time

Volodymyr Babchuk at EPAM



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.