Xen project Mailing List

Re: [Xen-devel] RFC: HVM de-privileged mode scheduling considerations

To: Ben Catterall <Ben.Catterall@xxxxxxxxxx>, <george.dunlap@xxxxxxxxxxxxx>, <dario.faggioli@xxxxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Mon, 3 Aug 2015 14:54:51 +0100

Delivery-date: Mon, 03 Aug 2015 13:55:06 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 03/08/15 14:35, Ben Catterall wrote: > Hi all, > > I am working on an x86 proof-of-concept to evaluate if it is feasible > to move device models and x86 emulation code for HVM guests into a > de-privileged context. > > I was hoping to get feedback from relevant maintainers on scheduling > considerations for this system to mitigate potential DoS attacks. > > Many thanks in advance, > Ben > > This is intended as a proof-of-concept, with the aim of determining if > this idea is feasible within performance constraints. > > Motivation > ---------- > The motivation for moving the device models and x86 emulation code > into ring 3 is to mitigate a system compromise due a bug in any of > these systems. These systems are currently part of the hypervisor and, > consequently, a bug in any of these could allow an attacker to gain > control (or perform a DOS) of > Xen and/or guests. > > Migrating between PCPUs > ----------------------- > There is a need to support migration between pcpus so that the > scheduler can still perform this operation. However, there is an issue > to resolve. Currently, I have a per-vcpu copy of the Xen ring 0 stack > up to the point of entering the de-privileged mode. This allows us to > restore this stack and then continue from the entry point when we have > finished in de-privileged mode. There will be per-pcpu data on these > per-vcpu stacks such as saved stack frame pointers for the per-pcpu > stack, smp_processor_id() responses etc. > > Therefore, it will be necessary to lock the vcpu to the current pcpu > when it enters this user mode so that it does not wake up on a > different pcpu where such pointers and other data are invalid. We can > do this by setting a hard affinity to the pcpu that the vcpu is > executing on. See common/wait.c which does something similar to what I > am doing. > > However, needing to have hard affinity to a pcpu leads to the > following problem: > - An attacker could lock multiple vcpus to a single pcpu, leading to a > DoS. This could be achieved by spinning in a loop in Xen > de-privileged mode (assuming a bug in this mode) and performing this > operation on multiple vcpus at once. The attacker could wait until all > of their vcpus were on the same pcpu and then execute this attack. > This could cause the pcpu to, effectively, lock up, as it will be > under heavy load, and we would be unable to move work elsewhere. > > A solution to the DoS would be to force migration to another pcpu, if > after, say, 100 quanta have passed where the vcpu has remained in > de-privileged mode. This forcing of migration would require us to > forcibly complete the de-privileged operation, and then, just before > returning into the guest, force a cpu change. We could not just force > a migration at the schedule call point as the Xen stack needs to > unwind to free up resources. We would reset this count each time we > completed a de-privileged mode operation. > > A legitimate long-running de-privileged operation would trigger this > forced migration mechanism. However, it is unlikely that such > operations will be needed and the count can be adjusted appropriately > to mitigate this. > > Any suggestions or feedback would be appreciated! I don't see why any scheduling support is needed. Currently all operations like this are run synchronously in the vmexit context of the vcpu. Any current DoS is already a real issue. In any reasonable situation, emulation of a device is a small state mutation and occasionally kicking off a further action to perform. (The far bigger risk from this kind of emulation is following bad pointers/etc, rather than long loops.) I think it would be entirely reasonable to have a deadline for a single execution of depriv mode, after which the domain is declared malicious and killed. We already have this for host pcpus - the watchdog defaults to 5 seconds. Having a similar cutoff for depriv mode should be fine. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.