[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x



On 27/03/2013 18:16, Marek Marczykowski wrote:
> On 27.03.2013 17:27, Andrew Cooper wrote:
>> On 27/03/2013 15:51, Marek Marczykowski wrote:
>>> On 27.03.2013 15:49, Marek Marczykowski wrote:
>>>> On 27.03.2013 15:46, Andrew Cooper wrote:
>>>>> As for locating the cause of the legacy vectors, it might be a good idea
>>>>> to stick a printk at the top of do_IRQ() which indicates an interrupt
>>>>> with vector between 0xe0 and 0xef.  This might at least indicate whether
>>>>> legacy vectors are genuinely being delivered, or whether we have some
>>>>> memory corruption causing these effects.
>>>> Ok, will try something like this.
>>> Nothing interesting here...
>>> Only vector 0xf1 for irq 4 and 0xf0 for irq 0 (which match irq dump 
>>> information).
>>>
>> Even in the case where we hit the original assertion?
> Yes, even then.
>
>> If so, then all I can thing is that the move_pending flag for that
>> specific GSI has been corrupted in memory somehow.
> I guest this isn't the case, see below.
>
>> I wonder if hexdumping irq_desc[9] after setup, before sleep, on resume
>> and in the case of the assertion failure might give some hints.
> I've tried something like this. Detailed log here:
> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-suspend-irq9-dump.log

This is concerning, unless I am getting utterly confused.  Jan: Do you
mind double checking my reasoning?

irq 0 through 15 should be the PIC irqs, set up in init_IRQ() in
arch/x86/i8259.c

irq9 should be the irq for the PIC vector which is set up as 0xe9, and
its vector should never change.

Could you put in extra checks for the sanity of per_cpu(vector_irq,
cpu)[0xe0 thru 0xef] ?

>
> Some interesing parts:
> after system startup:
> (XEN) irq_cfg of IRQ 9:
> (XEN)   vector: 138
> (XEN)   move_cleanup_count: 0x0
> (XEN)   move_in_progress: 0x0
> (XEN) irq_desc of IRQ 9:
> (XEN)   status: 80 (IRQ_GUEST | IRQ_PENDING)
>
> Isn't this wrong (status vs move_in_progress)?

This here looks fine.  What do you think is wrong about it?

>
> Then I've run pm-suspend, intentionally failed at the end to prevent actual
> suspend, but run all its hooks. After that:
> (XEN) irq_cfg of IRQ 9:
> (XEN)   vector: 181
> (XEN)   move_cleanup_count: 0x0
> (XEN)   move_in_progress: 0x1
> (XEN) irq_desc of IRQ 9:
> (XEN)   status: 80
>
> So now move_in_progress consistent with status.
> Wait few second, and still move_in_progress was 0x1. Isn't it supposed to be
> only temporary state?

move_in_progress gets set by __assign_irq_vector() when the scheduler
decides to move the IRQ.  It can stay set for a long time.

On the next interrupt from this source, the move_in_progress bit being
set causes the IRQ source to be reprogrammed to the new destination.

>
> Then suspended, at resume hit that bug. There was:
> (XEN) irq_cfg of IRQ 9:
> (XEN)   vector: 60
> (XEN)   move_cleanup_count: 0x0
> (XEN)   move_in_progress: 0x0
> (XEN) irq_desc of IRQ 9:
> (XEN)   status: 16
>
> move_in_progress==0, ok. But move_cleanup_count==0, while at least once was
> move_in_progress==1. Isn't that wrong?
>

move_cleanup_count is only set in send_cleanup_vector, for the specific
vector which is being cleaned up.

However, as the IPI handler cleans up all vectors which are outstanding,
the move_cleanup_count can be 0 for most vectors which are actually
cleaned up.

This is in an attempt to reduce the number of IPIs required to clean up
all moving irqs.  As the scheduler currently has a habit of moving vcpus
at every scheduling opportunity, this means that irqs are constantly moving.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.