Xen project Mailing List

Re: [RFC PATCH 00/10] Preemption in hypervisor (ARM only)

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

From: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>

Date: Wed, 24 Feb 2021 23:37:35 +0000

Accept-language: en-US

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=TaL74+wSv6ihcnPfmovGf+NX1Bf0YN6d4/H2CJgZAKk=; b=aaD1bS2/JBZW7wiUT+iRhs8U1736AehZqD4GMaFney622b9SOmgCrXYv4Mh8UZJcT7x9oDPeszJ6gY8zgzU+p0mU57RJTHVelpOimP26NyOktXi9JELNauqqgjkebpRy+qweNWf39klD7SwsPlb87mKZ1KIlr4xCY1pHzrvx9RuT/Ph6pZN4y3cfsuRJNttNtwsgmQ1g7gJ9+jVtd4Z6mmcZ/9y3f0DLr833fG2DmCEO02WeonkO0jeXsRDWwyr8JacSMazxyTxbVzetkdnsb35BhTk7rcLjZ8CRKXvBLlRQGOMDiT88ZoNtaP6q2VjVe+lW4GECGmKE8uleUIc33g==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=V4eswuERnQRkymqscH0vNRL+22D08R0TmFWLsk7Cqoae2+wBIB9ga1cXmMZyXK1iUB8T+s406rbJ/xi6tqwsARnzSZXmS0U34eqJ50/Uru19G8zX+JW5Ac6CwjnrlB4q4BCXBcut9JtTb3sycRsgOkhW+7DCEidyij83lh4t4NH1W4omwaYQqvRBK3vAJNo/Yqks8yqUUmKwpCmCMvNvZSZnr/6CqFf58Pr/9cXkBg3hvGGbLxiWxWI+EjN9loA69+/FrjinAQAxB2IrSAX5OwtGAZYzitSdLLk85sMv1UIWL3VjcComdtu1HEdG4rLWtUgG+MHD0Rc6a7rdnPK8dg==

Authentication-results: citrix.com; dkim=none (message not signed) header.d=none;citrix.com; dmarc=none action=none header.from=epam.com;

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>, Meng Xu <mengxu@xxxxxxxxxxxxx>, Ian Jackson <iwj@xxxxxxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>

Delivery-date: Wed, 24 Feb 2021 23:38:00 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Thread-index: AQHXCYx4A6OUUHr1gkqxWv1TEOLkuqpnnLaAgABcP4A=

Thread-topic: [RFC PATCH 00/10] Preemption in hypervisor (ARM only)

Hi Andrew, Andrew Cooper writes: > On 23/02/2021 02:34, Volodymyr Babchuk wrote: >> Hello community, >> >> Subject of this cover letter is quite self-explanatory. This patch >> series implements PoC for preemption in hypervisor mode. >> >> This is the sort of follow-up to recent discussion about latency >> ([1]). >> >> Motivation >> ========== >> >> It is well known that Xen is not preemptable. On other words, it is >> impossible to switch vCPU contexts while running in hypervisor >> mode. Only one place where scheduling decision can be made and one >> vCPU can be replaced with another is the exit path from the hypervisor >> mode. The one exception are Idle vCPUs, which never leaves the >> hypervisor mode for obvious reasons. >> >> This leads to a number of problems. This list is not comprehensive. It >> lists only things that I or my colleagues encountered personally. >> >> Long-running hypercalls. Due to nature of some hypercalls they can >> execute for arbitrary long time. Mostly those are calls that deal with >> long list of similar actions, like memory pages processing. To deal >> with this issue Xen employs most horrific technique called "hypercall >> continuation". When code that handles hypercall decides that it should >> be preempted, it basically updates the hypercall parameters, and moves >> guest PC one instruction back. This causes guest to re-execute the >> hypercall with altered parameters, which will allow hypervisor to >> continue hypercall execution later. This approach itself have obvious >> problems: code that executes hypercall is responsible for preemption, >> preemption checks are infrequent (because they are costly by >> themselves), hypercall execution state is stored in guest-controlled >> area, we rely on guest's good will to continue the hypercall. All this >> imposes restrictions on which hypercalls can be preempted, when they >> can be preempted and how to write hypercall handlers. Also, it >> requires very accurate coding and already led to at least one >> vulnerability - XSA-318. Some hypercalls can not be preempted at all, >> like the one mentioned in [1]. >> >> Absence of hypervisor threads/vCPUs. Hypervisor owns only idle vCPUs, >> which are supposed to run when the system is idle. If hypervisor needs >> to execute own tasks that are required to run right now, it have no >> other way than to execute them on current vCPU. But scheduler does not >> know that hypervisor executes hypervisor task and accounts spent time >> to a domain. This can lead to domain starvation. >> >> Also, absence of hypervisor threads leads to absence of high-level >> synchronization primitives like mutexes, conditional variables, >> completions, etc. This leads to two problems: we need to use spinlocks >> everywhere and we have problems when porting device drivers from linux >> kernel. > > You cannot reenter a guest, even to deliver interrupts, if pre-empted at > an arbitrary point in a hypercall. State needs unwinding suitably. > Yes, Julien pointed this to me already. So, looks like hypercall continuations are still needed. > Xen's non-preemptible-ness is designed to specifically force you to not > implement long-running hypercalls which would interfere with timely > interrupt handling in the general case. What if long-running hypercalls are still required? There are other options, like async calls, for example. > Hypervisor/virt properties are different to both a kernel-only-RTOS, and > regular usespace. This was why I gave you some specific extra scenarios > to do latency testing with, so you could make a fair comparison of > "extra overhead caused by Xen" separate from "overhead due to > fundamental design constraints of using virt". I can't see any fundamental constraints there. I see how virtualization architecture can influence context switch time: how many actions you need to switch one vCPU to another. I have in mind low level things there: reprogram MMU to use another set of tables, reprogram your interrupt controller, timer, etc. Of course, you can't get latency lower that context switch time. This is the only fundamental constraint I can see. But all other things are debatable. As for latency testing, I'm not interested in absolute times per se. I already determined that time needed to switch vCPU context on my machine is about 9us. It is fine for me. I am interested in a (semi-)guaranteed time of reaction. And Xen is doing quite well in most cases. But there are other cases, when long-lasting hypercalls cause spikes in time of reaction. > Preemption like this will make some benchmarks look better, but it also > introduces the ability to create fundamental problems, like preventing > any interrupt delivery into a VM for seconds of wallclock time while > each vcpu happens to be in a long-running hypercall. > > If you want timely interrupt handling, you either need to partition your > workloads by the long-running-ness of their hypercalls, or not have > long-running hypercalls. ... or do long-running tasks asynchronously. I believe, for most domctls and sysctls there is no need to hold calling vCPU in hypervisor mode at all. > I remain unconvinced that preemption is an sensible fix to the problem > you're trying to solve. Well, this is the purpose of this little experiment. I want to discuss different approaches and to estimate amount of required efforts. By the way, from x86 point of view, how hard to switch vCPU context while it is running in hypervisor mode? -- Volodymyr Babchuk at EPAM

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.