Xen project Mailing List

Re: [Xen-devel] [PATCH] x86/vm_event: fix race between vmx_vmexit_handler() and vm_event_resume()

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Thu, 27 Apr 2017 18:31:39 +0100

Cc: tamas@xxxxxxxxxxxxx, Razvan Cojocaru <rcojocaru@xxxxxxxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx

Delivery-date: Thu, 27 Apr 2017 17:34:53 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 27/04/17 09:52, Jan Beulich wrote: >>>> On 27.04.17 at 10:34, <andrew.cooper3@xxxxxxxxxx> wrote: >> On 27/04/2017 09:01, Jan Beulich wrote: >>>>>> On 27.04.17 at 09:22, <rcojocaru@xxxxxxxxxxxxxxx> wrote: >>>> The introspection agent can reply to a vm_event faster than >>>> vmx_vmexit_handler() can complete in some cases, where it is then >>>> not safe for vm_event_set_registers() to modify v->arch.user_regs. >>> This needs more explanation: I cannot see the connection between >>> vmx_vmexit_handler() completing and (un)safety of modification of >>> v->arch.user_regs. The latter is unsafe as long as the vCPU hasn't >>> gone through __context_switch(), and the former doesn't call that >>> function directly or indirectly (i.e. I think it is more than just >>> vmx_vmexit_handler() which needs to be completed by the time >>> register modification becomes safe to do). >> The test scenario here was to step over an int3 breakpoint by setting >> rip += 1. This is a very quick reply and tended to complete before the >> vcpu triggering the introspection event had properly paused and been >> descheduled. >> >> If the reply occurs before __context_switch() happens, >> __context_switch() clobbers the reply by overwriting v->arch.user_regs >> from the stack. If the reply occurs after __context_switch(), but the >> pcpu has gone idle and keeps v in context, v->arch.user_regs is not >> restored to the stack, and the update is similarly missed. > This second case not properly described, I think: v won't be kept in > context once we've gone through __context_switch(). If the pCPU > goes idle, __context_switch() simply will be deferred until another > vCPU gets scheduled onto this pCPU, and be avoided altogether if > it's the original vCPU that gets scheduled back onto this pCPU. But > I do get the point. I think much if not all of the above (suitably > adjusted) needs to go into the commit message. Yes. I did make a slight error. The latter case is that we don't pass through __context_switch() when transitioning to idle. > >> (There is a >> very narrow race condition where both the reply and __context_switch() >> are updating v->arch.user_regs, and who knows what will happen, given >> our memcpy implementation.) > Right, but with the proper serialization of events done now this > isn't a problem anymore either, aiui. Correct. > >>>> The patch additionally removes the sync_vcpu_execstate(v) call from >>>> vm_event_resume(), which is no longer necessary, which removes the >>>> associated broadcast TLB flush (read: performance improvement). >>> Depending on the better explanation above, it may or may not be >>> further necessary to also better explain the "no longer necessary" >>> part here. >> As I understand, it was a previous attempt to fix this bug. >> >> There is nothing (now) in vm_event_resume() which does anything other >> than new updates into v->arch.vm_event and drop the pause reference. >> All updated are applied in context, in the return-to-guest path, which >> is race free. >> >> There is no need for the IPI, or to force the target vcpu out of context >> if the pcpu is currently idle. > I agree, and (as indicated) the better explanation earlier on is > probably sufficient then. > >>>> --- a/xen/common/vm_event.c >>>> +++ b/xen/common/vm_event.c >>>> @@ -357,6 +357,13 @@ void vm_event_resume(struct domain *d, struct >>>> vm_event_domain *ved) >>>> { >>>> vm_event_response_t rsp; >>>> >>>> + /* >>>> + * vm_event_resume() runs either from XEN_DOMCTL_VM_EVENT_OP_*, or >>>> + * EVTCHN_send from the introspection consumer. Both contexts are >>>> + * guaranteed not to be the subject of vm_event responses. >>>> + */ >>>> + ASSERT(d != current->domain); >>> What is this meant to guard against? >> Documentation, as much as anything else. It took a long time to >> untangle the contexts involved here; I'm not convinced the logic is safe >> to run in context, and it certainly doesn't need to. > Well, as said - I think it is too strict now: You only need the vCPU > not be current afaict, and it really matters here which of the two > "current"-s you actually mean (and it looks to me as if you mean > the other one, guaranteeing register state to no longer be on the > stack). We don't know the vcpu at this point; that information comes out of the replies on the ring. We also may process multiple replies in this one loop. In the worse case, we process one reply for every vcpu in d. If the ASSERT() were to be deferred until the middle of the request loop, we could ASSERT(v != current) for every vcpu in d, but that is identical to this single assertion. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.