[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Strange PVM spinlock case revisited (nailed down)



I think I finally can proof why allowing interrupts for the duration of poll_irq
in xen_spin_lock_slow is bad! \o/

So with another run/dump:

VCPUs 1,2,5 and 6 are spinning on the runq lock of VCPU 1. The number of
spinners is 4 (ok) and the lock released.
According to dom0 debug info VCPUs 2,5 and 6 are polling, VCPU 1 is not
but seems running (has=T). The event channel for spinlock1 is pending.

Checking the interrupt stack of VCPU 1 in the guest dump shows:

hypercall_page+938
xen_poll_irq+16
xen_spin_lock_slow+151
xen_spin_lock_flags+99
_raw_spin_lock_irqsave+46
update_shares+156
rebalance_domains+72
run_rebalance_domains+72
__do_softirq+168
call_softirq+99
do_softirq+101
irq_exit+142
xen_evtchn_do_upcall+53
xen_do_hypervisor_callback+101

So right before finishing a previous callback, irq_exit triggers softirq
processing (interrupts enabled) and while updating the shares this tries to grab
the runq lock which we see in lock_spinners.

Since irq_count is 2 there is one more interruption executing right now (though
irq_regs wont show this). So I just proceeded upwards in the interrupt stack and
found:
try_to_wake_up+57
default_wake_function+18
autoremove_wake_function+22
wake_bit_function+59
__wake_up_common+88
__wake_up+72
__wake_up_bit+49
end_page_writeback+64
put_io_page+36
ext4_end_bio+338
bio_endio+29
req_bio_endio.isra.45+163
blk_update_request+245
blk_update_bidi_request+49
__blk_end_bidi_request+32
__blk_end_request_all+31
blkif_interrupt+460
handle_irq_event_percpu+85
handle_irq_event+78
handle_edge_irq+132
__xen_evtchn_do_upcall+409
xen_evtchn_do_upcall+47
xen_do_hypervisor_callback+101

There was a bit more on the stack but try_to_wake_up seemed a believable current
path. Even more so after looking into the function:

#ifdef CONFIG_SMP
        /*
         * If the owning (remote) cpu is still in the middle of schedule() with
         * this task as prev, wait until its done referencing the task.
         */
        while (p->on_cpu) {
#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
                /*
                 * In case the architecture enables interrupts in
                 * context_switch(), we cannot busy wait, since that
                 * would lead to deadlocks when an interrupt hits and
                 * tries to wake up @prev. So bail and do a complete
                 * remote wakeup.
                 */
                if (ttwu_activate_remote(p, wake_flags))
                        goto stat;
#else
                cpu_relax();
#endif

And prying out the task in question from the stack, it was one currently
being accounted for VCPU 6 and on_cpu. VCPU 6 is one of the other waiters for
the runq lock of VCPU 1. Which would get picked up as soon as this callback is
done. Unfortunately we "stay awhile... stay forever!".

Ok, as an experiment I defined  __ARCH_WANT_INTERRUPTS_ON_CTXSW for x86 and
running that kernel in the PVM guest no more locks up.
Problem there is that this define is gone since v3.7 (f3e9478674). And the
same testcase is not able to cause the issue since v3.6 (611ae8e3f5). The
change of TLB flushes unlikely is directly related (rather changing the number
of IPIs and by that the likelihood to observe this).

So for v3.6+ kernels, the question is whether it can be ruled out that during
softirq handling rebalance_domains->update_blocked_averages (was update_shares)
is called which then can get into spin_lock_slow and enable interrupts.
I have not seen it or am aware of reports about it but its likely one of those
rare occurrences.

As for a solution, the simplest one would be never to re-enable interrupts
before poll_irq... Anything else feels atm right complicated (like trying to
make the scheduler use disable interrupts and spin_lock variants instead of
spin_lock_irq ones... weird hidden implications in common code "just" for PVM).

-Stefan

Note: comparing the number of db records created on a 15min run of the testcase
seems to yield similar results for interrupts enabled (with that arch define in
v3.2) or disabled. Though I do not have a huge number base and that PVM is the
only one running on the host.



Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.