[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Strange PVM spinlock case revisited



On 11.02.2013 18:29, Ian Campbell wrote:
> On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote:
>> A while ago I had been reporting an issue which causes Xen PVM guests
>> to hang (apparently related to spinlocks). Since then I was able to
>> gather a few more facts which I try to provide below. For the real
>> reasons, I am still puzzled and would be thankful for any hints or
>> direction.
>>
>> - Happens with Xen 4.1.2 and Xen 4.2 host-side.
>> - PVM guest with 8 VCPUs is running 800 threads doing DB updates.
> 
> This is on a host with >= 8 PCPUs, correct?

Maybe >=4 but I cannot remember for sure anymore.
> 
>> - When hanging at least 2 VCPUs in xen_spin_lock_slow with the lock free.
>>   IIRC the weird lock always belongs to one VCPU runqueue.
>> - Testcase shows the problem for guest kernel versions v3.2..v3.5 (not
>>   verified farther back). Since v3.6 it cannot be reproduced. Bisect
>>   ends up with: 611ae8e3f5204f7480b3b405993b3352cfa16662
>>     "x86/tlb: enable tlb flush range support for x86"
> 
> That seems like a very odd candidate for impacting on this issue.
> 
>> - The other potential fix for this was to prevent xen_spin_lock_slow()
>>   from unmasking events (enabling interrupts) if those were enabled
>>   before the spinlock call.
>>
>> Maybe allowing or not allowing the interrupts in xen_spin_lock_slow
>> just manages to tip the likelihood of getting nested deeper.
> 
> Either that or the bug is in the nesting aspect of xen_spin_lock_slow(),
> which seems to be a rather complex dance judging from the comments in
> that function (and one I confess I don't really follow).
> 

I am suspecting something like this. Or maybe generally a nesting issue revealed
by enabling interrupts here.

>>  The same
>> way that the changed TLB flush may reduce it by causing lesser IPIs.
> 
> Ah, that might explain why that patch is relevant, yes.
> 
>> From the static information I can extract from a dump it is hard to
>> tell what exactly is going on. VCPU2 seems to be doing something but
>> looks to be rather deep in event handling. The irq_count is at 2, but
>> irq_count seems to be used in a few more places than just xen_hypervisor
>> callback. Though that at least seems to be the entry on the stack I can
>> see for VCPU2. The contents of the per-cpu variables for irq_reqs and
>> irq_stack_ptr at least seem consistent as if a callback came in while
>> being on the sched_op hypercall. But then the bt command seems to be
>> operating on a completely different stack.
>> VCPU2 seems to be active, not polling (event processing could be initiated
>> from enable_irqs via force callback or before finally exiting the sched_op
>> hypercall), there seems to be a pending upcall but upcalls are masked.
>> The next thing I would try is to instrument the sites incrementing and
>> decrementing irq_count...
>> ---
>>
>> Below some of the info seen from dom0 debugging keys and looking
>> at a dump with crash:
>>
>> Dom0 Info:
>>
>> Polling vCPUs: {1,5-7}
>> VCPU0: CPU0 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
>> VCPU1: CPU1 [has=F] poll=10 upcall_pend = 00, upcall_mask = 00
>> VCPU2: CPU6 [has=T] poll=0  upcall_pend = 01, upcall_mask = 01
>> VCPU3: CPU2 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
>> VCPU4: CPU4 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
>> VCPU5: CPU3 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
>> VCPU6: CPU7 [has=F] poll=40 upcall_pend = 01, upcall_mask = 01
>> VCPU7: CPU4 [has=F] poll=46 upcall_pend = 01, upcall_mask = 01
>>
>> 10 [0/1]: s=6 n=1 x=0
>> 40 [0/1]: s=6 n=6 x=0
>> 46 [0/1]: s=6 n=7 x=0
>>
>> Guest Info:
>>
>> lock_spinners:
>> [0] (struct xen_spinlock *) 0x
>> [1] (struct xen_spinlock *) 0xffff8803bffc8ae8 = { lock=1, spinners=2 }
>> [2] (struct xen_spinlock *) 0xffff8803bfc93700 = { lock=0, spinners=2 }
> 
> If I understand correctly this means that 2 threads (VCPU2 & 6, I
> suppose) are spinning on this lock, but no one is holding it?

To my reading both went into xen_spinlock_slow(). VCPU2 may be either in the
raw_local_irq_enable() before doing poll_irq or maybe in handling some event
after receiving the wakeup irq but before returning from the hv call. So the
VCPU that had the lock has returned it and likely the wakeup has been sent.

> 
> An interesting hack^Wexperiment might be to make xen_poll_irq use a
> timeout and see if that unwedges things -- this would help confirm that
> the issue is on nested wakeup.

I could do that. This would at least re-activate the other waiters. In the case
shown VCPU6 and if something in the path of execution on VCPU2 deadlocks there
this would move things ahead.

> 
> I suppose xen_spin_unlock could also be instrumented to indicate who it
> last kicked and for which lock and that might show us something?

IIRC I had done this but it did not really show much. What I have done in the
meantime was to instrument the IRQ service functions in
arch/x86/kernel/entry_64.S to give me a history of callbacks. This shows (in
another experiment where it is 1,2,5,6 in the same lock and 2,5,6 still polling)
that the last events on the vcpu not polling are:

xen_do_hypervisor_callback+127
call_softirq+29
call_softirq+125
call_softirq+29
call_softirq+125
call_softirq+29
call_softirq+125
xen_do_hypervisor_callback+28
call_softirq+29
xen_do_hypervisor_callback+28

The lower offsets are when irq_count gets incremented and the higher offsets are
when irq_count gets decremented before ending the callback. This shows that at
least in this case there was an upcall, a softirq and another upcall being
triggered without finishing the previous one. There was another experiment where
I saw softirq, upcall, upcall. But at least it seems that there is always a
three level stack.
I believe that softirq, upcall would be ok as softirq processing enables
interrupts but is it expected to have an upcall interrupted by another one?

> 
> Can someone explain why the non-locked update of lock_spinners in
> spinning_lock() is safe against xen_spin_unlock reading the field from
> another VCPU? I suspect it isn't, so what happens if the other VCPU
> kicks the lock which is just about to become prev? maybe this is handled
> in xen_spin_lock_slow somehow?

Isn't that safe because lock_spinners is percpu?

> 
> In a similar vein do we not need a barrier in unspinning_lock to ensure
> that the remote cpu sees prev getting restored?
> 
>> [3] (struct xen_spinlock *) 0x
>> [4] (struct xen_spinlock *) 0x
>> [5] (struct xen_spinlock *) 0xffff8803bffc8ae8 -> [1]
>> [6] (struct xen_spinlock *) 0xffff8803bfc93700 -> [2]
>> [7] (struct xen_spinlock *) 0xffffffff81f15ef0 = { lock=1, spinners=1 }
>>
>> VCPU[2,6] -> lock of runqueue[2] = (struct rq *) 0xffff8803bfc93700
>>
>> irq_count[1] = 1
>> irq_count[2] = 2
>> It is -1 for the rest.
>>
>> *(struct pt_regs *) irq_regs[2] = {
>>      .ip = 0xffffffff810013aa == hypercall_page+938 -> sched_op
>>      .sp = 0xffff8803bfc83ce8
>> }
>>
>> char *irq_stack_ptr[2] = 0xffff8803bfc83fc0
>>
>> #> bt -T ffff88033d65c4a0
>> PID: 2050   TASK: ffff88033d65c4a0  CPU: 2   COMMAND: "postgres"
>>   [ffff88033d72d530] get_page_from_freelist at ffffffff8111e6df
>>   [ffff88033d72d600] __alloc_pages_nodemask at ffffffff8111ebec
>>   [ffff88033d72d620] check_preempt_curr at ffffffff81050254
>>   [ffff88033d72d640] pull_task at ffffffff8105e33e
>>   [ffff88033d72d670] balance_tasks at ffffffff8105e4b6
>>   [ffff88033d72d680] radix_tree_lookup_slot at ffffffff8130db5e
>>   [ffff88033d72d690] find_get_page at ffffffff811161ee
>>   [ffff88033d72d6b0] find_lock_page at ffffffff81116286
>>   [ffff88033d72d6e0] shmem_getpage_gfp at ffffffff8112dd10
>>   [ffff88033d72d748] pte_pfn_to_mfn at ffffffff81005a78
>>   [ffff88033d72d7a0] get_page_from_freelist at ffffffff8111e6df
>>   [ffff88033d72d7c0] _raw_spin_lock at ffffffff81655e4e
>>   [ffff88033d72d7d0] ext4_ext_check_cache at ffffffff81239248
>>   [ffff88033d72d820] ext4_ext_map_blocks at ffffffff8123e269
>>   [ffff88033d72d870] _raw_spin_lock at ffffffff81655e4e
>>   [ffff88033d72d880] enqueue_to_backlog at ffffffff8153e55f
>>   [ffff88033d72d890] __wake_up_common at ffffffff8104c1c8
>>   [ffff88033d72d8e0] netif_rx.part.82 at ffffffff8153f2de
>>   [ffff88033d72d920] netif_rx at ffffffff8153f400
>>   [ffff88033d72d960] loopback_xmit at ffffffff8146fb1c
>>   [ffff88033d72d990] dev_hard_start_xmit at ffffffff8153fea6
>>   [ffff88033d72d9d0] do_softirq at ffffffff81016284
>>   [ffff88033d72d9f0] local_bh_enable at ffffffff8106d614
>>   [ffff88033d72da00] dev_queue_xmit at ffffffff81540309
>>   [ffff88033d72da50] ip_finish_output at ffffffff815769ab
>>   [ffff88033d72da90] ip_output at ffffffff81577518
>>   [ffff88033d72dac0] ip_local_out at ffffffff81576c09
>>   [ffff88033d72dae0] ip_send_skb at ffffffff81577f4b
>>   [ffff88033d72db00] udp_send_skb at ffffffff8159aaa1
>>   [ffff88033d72db40] ip_generic_getfrag at ffffffff81575030
>>   [ffff88033d72db50] udp_sendmsg at ffffffff8159bd38
>>   [ffff88033d72db90] irq_to_desc at ffffffff810d70b7
>>   [ffff88033d72dbd0] notify_remote_via_irq at ffffffff813a74c1
>>   [ffff88033d72dc10] ttwu_queue at ffffffff81052672
>>   [ffff88033d72dc38] _raw_spin_unlock_irqrestore at ffffffff8165605e
>>   [ffff88033d72dc90] inet_sendmsg at ffffffff815a6474
>>   --- bt -t would end here
>>   [ffff88033d72dcb0] apparmor_socket_sendmsg at ffffffff812d43f7
>>   [ffff88033d72dcd0] sock_sendmsg at ffffffff815266fe
>>   [ffff88033d72dd70] md_make_request at ffffffff814e0606
>>   [ffff88033d72ddd0] kmem_cache_free at ffffffff811605cf
>>   [ffff88033d72de10] mempool_free_slab at ffffffff81118507
>>   [ffff88033d72de20] mempool_free at ffffffff811185b7
>>   [ffff88033d72de50] sys_sendto at ffffffff8152974d
>>   [ffff88033d72dee0] ext4_sync_file at ffffffff8120f07b
>>   [ffff88033d72df00] vfs_write at ffffffff811764b0
>>   [ffff88033d72df50] xen_hypervisor_callback at ffffffff816606db
>>     RIP: 00007f6852c7b39a  RSP: 00007fff431454e0  RFLAGS: 00000212
>>     RAX: 00007f6852fb38b8  RBX: 00007f6852fb3720  RCX: 000000000000014a
>>     RDX: 0000000000000150  RSI: 0000000000000140  RDI: 00007f6852fb3720
>>     RBP: 0000000000000150   R8: 0000000000000015   R9: 0000000000000000
>>     R10: 0000000000000000  R11: 00007f6852c9174a  R12: 00007f6852fb3778
>>     R13: 0000000000000140  R14: 00007f6852fb38b8  R15: 0000000000000001
>>     ORIG_RAX: ffffffffffffffff  CS: e033  SS: e02b
>>
>> #> bt -T ffff88033b812dc0
>> PID: 2069   TASK: ffff88033b812dc0  CPU: 6   COMMAND: "postgres"
>>   [ffff88033b897530] get_page_from_freelist at ffffffff8111e6df
>>   [ffff88033b897600] __alloc_pages_nodemask at ffffffff8111ebec
>>   [ffff88033b897640] radix_tree_lookup at ffffffff8130db6b
>>   [ffff88033b897650] irq_to_desc at ffffffff810d70b7
>>   [ffff88033b897660] irq_get_irq_data at ffffffff810d9f4e
>>   [ffff88033b897670] balance_tasks at ffffffff8105e493
>>   [ffff88033b897680] radix_tree_lookup_slot at ffffffff8130db5e
>>   [ffff88033b897690] find_get_page at ffffffff811161ee
>>   [ffff88033b8976b0] find_lock_page at ffffffff81116286
>>   [ffff88033b8976e0] shmem_getpage_gfp at ffffffff8112dd10
>>   [ffff88033b897748] pte_pfn_to_mfn at ffffffff81005a78
>>   [ffff88033b897770] xen_set_pte_at at ffffffff81008e29
>>   [ffff88033b897788] __raw_callee_save_xen_make_pte at ffffffff810052cd
>>   [ffff88033b8977b0] unlock_page at ffffffff8111589a
>>   [ffff88033b8977d0] __do_fault at ffffffff81138ac9
>>   [ffff88033b8978a0] pvclock_clocksource_read at ffffffff8103dd15
>>   [ffff88033b8978f0] xen_clocksource_read at ffffffff8100a850
>>   [ffff88033b897900] sched_clock at ffffffff8101bd79
>>   [ffff88033b897910] blk_rq_init at ffffffff812ec262
>>   [ffff88033b897930] get_request at ffffffff812f004e
>>   [ffff88033b8979c0] cpumask_next_and at ffffffff81308c66
>>   [ffff88033b8979e0] find_busiest_group at ffffffff8105a061
>>   --- bt -t ends here
>>   [ffff88033b897a30] radix_tree_lookup at ffffffff8130db6b
>>   [ffff88033b897a40] irq_to_desc at ffffffff810d70b7
>>   [ffff88033b897a50] irq_get_irq_data at ffffffff810d9f4e
>>   [ffff88033b897a60] info_for_irq at ffffffff813a636e
>>   [ffff88033b897a80] xen_poll_irq_timeout at ffffffff813a696e
>>   [ffff88033b897ac0] xen_poll_irq at ffffffff813a8450
>>   [ffff88033b897ad0] xen_spin_lock_slow at ffffffff8163ad77
>>   [ffff88033b897b20] xen_spin_lock at ffffffff810121da
>>   [ffff88033b897b40] _raw_spin_lock at ffffffff81655e4e
>>   [ffff88033b897b50] double_rq_lock at ffffffff8105119c
>>   [ffff88033b897b80] load_balance at ffffffff8105e788
>>   [ffff88033b897b88] _raw_spin_unlock_irqrestore at ffffffff8165605e
>>   [ffff88033b897c10] idle_balance at ffffffff8163d57c
>>   [ffff88033b897c60] __schedule at ffffffff81653ea9
>>   [ffff88033b897cc0] sleep_on_page at ffffffff81115810
>>   [ffff88033b897ce0] schedule at ffffffff81653fcf
>>   [ffff88033b897cf0] io_schedule at ffffffff8165407f
>>   [ffff88033b897d10] sleep_on_page at ffffffff8111581e
>>   [ffff88033b897d20] __wait_on_bit at ffffffff8165489f
>>   [ffff88033b897d70] wait_on_page_bit at ffffffff81115988
>>   [ffff88033b897da0] wake_bit_function at ffffffff8108a140
>>   [ffff88033b897dc0] filemap_fdatawait_range at ffffffff81115a9c
>>   [ffff88033b897e60] do_writepages at ffffffff81120ee1
>>   [ffff88033b897eb0] filemap_write_and_wait_range at ffffffff81117398
>>   [ffff88033b897ee0] ext4_sync_file at ffffffff8120ef9f
>>   [ffff88033b897f00] vfs_write at ffffffff811764b0
>>   [ffff88033b897f40] do_fsync at ffffffff811a4a36
>>   [ffff88033b897f70] sys_fdatasync at ffffffff811a4d83
>>   [ffff88033b897f80] system_call_fastpath at ffffffff8165e442
>>     RIP: 00007f6852ce8240  RSP: 00007fff43145618  RFLAGS: 00000246
>>     RAX: 000000000000004b  RBX: ffffffff8165e442  RCX: 00007f6852ce1900
>>     RDX: 00000000000000c5  RSI: 000000000000000c  RDI: 000000000000000c
>>     RBP: 00000000000000c5   R8: 000000000073a550   R9: 0000000000000000
>>     R10: 0000000000000004  R11: 0000000000000246  R12: ffffffff811a4d83
>>     R13: ffff88033b897f78  R14: 0000000000000001  R15: 0000000000af7488
>>     ORIG_RAX: 000000000000004b  CS: e033  SS: e02b
>>



Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.