[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Re: CPU offlining patch xen-unstable:21049



I decided to keep the spin_trylock as they quell my paranoia about other
possible deadlock scenarios inside those complicated hypercall functions.
But I have modified the comments appropriately, in xen-unstable:21179. Note
that this also depends on xen-unstable:21178 (we mustn't execute the
hypercall continuation immediately, in the context of the caller of
c_h_o_c()). Thanks.

But, here's a more subtle and more tricky deadlock scenario for you. You'll
like this one :-): stop_machine_run() schedules a softirq on every CPU.
Let's say CPU A enters our softirq handler, interrupting some guest VCPU X
which is still scheduled on CPU A. But some other CPU B could be waiting for
X to be descheduled (one obvious example is hvmop_flush_tlb_all, which is a
good one because some HVM guest can call that at any time). So we never get
full softirq rendezvous because CPU B is spinning in hvmop_flush_tlb_all(),
while CPU A spins in the stop_machine softirq handler. Deadlock!

What do you think of that? :-D

 -- Keir

On 15/04/2010 11:19, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx> wrote:

> Aha, yes, you are right. So do I need create a patch, or you can simply revert
> some chunks?
> 
> --jyh
> 
>> -----Original Message-----
>> From: Keir Fraser [mailto:keir.fraser@xxxxxxxxxxxxx]
>> Sent: Thursday, April 15, 2010 6:17 PM
>> To: Jiang, Yunhong
>> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
>> Subject: Re: CPU offlining patch xen-unstable:21049
>> 
>> On 15/04/2010 09:50, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx> wrote:
>> 
>>> I think the try_lock is not for the cpu_down(). The point is, if another CPU
>>> is trying the get the lock.
>>> 
>>> Considering following scnerio:
>>> 1) cpu_down() in CPU A, and get the xenpf_lock, then call to
>>> stop_machine_run(), trying to bring all CPU to stop_machine_run context.
>>> 2) At the same time, another vcpu in CPU B do a xenpf hypercall, and try to
>>> get the xenpf_lock. If ther is no retyr for this lock, it can't get
>>> xenpf_lock, it will never go to the softirq
>>> So the system will hang.
>>> 
>>> Hope this make thing clear.
>> 
>> But CPU A doesn't hold the xenpf_lock when it calls stop_machine_run(). It
>> dropped it before cpu_down() got invoked, because that gets executed via
>> continue_hypercall_on_cpu().
>> 
>> -- Keir
>> 
> 



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.