Xen project Mailing List

[Xen-devel] Re: CPU offlining patch xen-unstable:21049

To: "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>

From: Keir Fraser <keir.fraser@xxxxxxxxxxxxx>

Date: Thu, 15 Apr 2010 12:04:01 +0100

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>

Delivery-date: Thu, 15 Apr 2010 04:05:18 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Thread-index: AcrcdgaxyxAXcjWt2U2B4gzh49/iiwAAddGwAAM8tEEAAA/mEAABkodd

Thread-topic: CPU offlining patch xen-unstable:21049

I decided to keep the spin_trylock as they quell my paranoia about other possible deadlock scenarios inside those complicated hypercall functions. But I have modified the comments appropriately, in xen-unstable:21179. Note that this also depends on xen-unstable:21178 (we mustn't execute the hypercall continuation immediately, in the context of the caller of c_h_o_c()). Thanks. But, here's a more subtle and more tricky deadlock scenario for you. You'll like this one :-): stop_machine_run() schedules a softirq on every CPU. Let's say CPU A enters our softirq handler, interrupting some guest VCPU X which is still scheduled on CPU A. But some other CPU B could be waiting for X to be descheduled (one obvious example is hvmop_flush_tlb_all, which is a good one because some HVM guest can call that at any time). So we never get full softirq rendezvous because CPU B is spinning in hvmop_flush_tlb_all(), while CPU A spins in the stop_machine softirq handler. Deadlock! What do you think of that? :-D -- Keir On 15/04/2010 11:19, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx> wrote: > Aha, yes, you are right. So do I need create a patch, or you can simply revert > some chunks? > > --jyh > >> -----Original Message----- >> From: Keir Fraser [mailto:keir.fraser@xxxxxxxxxxxxx] >> Sent: Thursday, April 15, 2010 6:17 PM >> To: Jiang, Yunhong >> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx >> Subject: Re: CPU offlining patch xen-unstable:21049 >> >> On 15/04/2010 09:50, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx> wrote: >> >>> I think the try_lock is not for the cpu_down(). The point is, if another CPU >>> is trying the get the lock. >>> >>> Considering following scnerio: >>> 1) cpu_down() in CPU A, and get the xenpf_lock, then call to >>> stop_machine_run(), trying to bring all CPU to stop_machine_run context. >>> 2) At the same time, another vcpu in CPU B do a xenpf hypercall, and try to >>> get the xenpf_lock. If ther is no retyr for this lock, it can't get >>> xenpf_lock, it will never go to the softirq >>> So the system will hang. >>> >>> Hope this make thing clear. >> >> But CPU A doesn't hold the xenpf_lock when it calls stop_machine_run(). It >> dropped it before cpu_down() got invoked, because that gets executed via >> continue_hypercall_on_cpu(). >> >> -- Keir >> > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.