[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: BUG in 1f3d87c75129 ("x86/vpt: do not take pt_migrate rwlock in some cases")
On Mon, Jun 14, 2021 at 01:53:09PM +0200, Jan Beulich wrote: > On 14.06.2021 13:15, Igor Druzhinin wrote: > > Hi, Boris, Stephen, Roger, > > > > We have stress tested recent changes on staging-4.13 which includes a > > backport of the subject. Since the backport is identical to the > > master branch and all of the pre-reqs are in place - we have no reason > > to believe the issue is not the same on master. > > > > Here is what we got by running heavy stress testing including multiple > > repeated VM lifecycle operations with storage and network load: > > > > > > Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287 > > ----[ Xen-4.13.3-10.7-d x86_64 debug=y Not tainted ]---- > > CPU: 17 > > RIP: e008:[<ffff82d080246b65>] common/timer.c#active_timer+0xc/0x1b > > RFLAGS: 0000000000010046 CONTEXT: hypervisor (d675v0) > > rax: 0000000000000000 rbx: ffff83137a8ed300 rcx: 0000000000000000 > > rdx: ffff83303fff7fff rsi: ffff83303fff2549 rdi: ffff83137a8ed300 > > rbp: ffff83303fff7cf8 rsp: ffff83303fff7cf8 r8: 0000000000000001 > > r9: 0000000000000000 r10: 0000000000000011 r11: 0000168b0cc08083 > > r12: 0000000000000000 r13: ffff82d0805cf300 r14: ffff82d0805cf300 > > r15: 0000000000000292 cr0: 0000000080050033 cr4: 00000000000426e0 > > cr3: 00000013c1a32000 cr2: 0000000000000000 > > fsb: 0000000000000000 gsb: 0000000000000000 gss: 0000000000000000 > > ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > > Xen code around <ffff82d080246b65> (common/timer.c#active_timer+0xc/0x1b): > > 0f b6 47 2a 84 c0 75 02 <0f> 0b 3c 04 76 02 0f 0b 3c 02 0f 97 c0 5d c3 55 > > Xen stack trace from rsp=ffff83303fff7cf8: > > ffff83303fff7d48 ffff82d0802479f1 0000168b0192b846 ffff83137a8ed328 > > 000000001d0776eb ffff83137a8ed2c0 ffff83133ee47568 ffff83133ee47000 > > ffff83133ee47560 ffff832b1a0cd000 ffff83303fff7d78 ffff82d08031e74e > > ffff83102d898000 ffff83133ee47000 ffff83102db8d000 0000000000000011 > > ffff83303fff7dc8 ffff82d08027df19 0000000000000000 ffff83133ee47060 > > ffff82d0805d0088 ffff83102d898000 ffff83133ee47000 0000000000000011 > > 0000000000000001 0000000000000011 ffff83303fff7e08 ffff82d0802414e0 > > ffff83303fff7df8 0000168b0192b846 ffff83102d8a4660 0000168b0192b846 > > ffff83102d8a4720 0000000000000011 ffff83303fff7ea8 ffff82d080241d6c > > ffff83133ee47000 ffff831244137a50 ffff83303fff7e48 ffff82d08031b5b8 > > ffff83133ee47000 ffff832b1a0cd000 ffff83303fff7e68 ffff82d080312b65 > > ffff83133ee47000 0000000000000000 ffff83303fff7ee8 ffff83102d8a4678 > > ffff83303fff7ee8 ffff82d0805d6380 ffff82d0805d5b00 ffffffffffffffff > > ffff83303fff7fff 0000000000000000 ffff83303fff7ed8 ffff82d0802431f5 > > ffff83133ee47000 0000000000000000 0000000000000000 0000000000000000 > > ffff83303fff7ee8 ffff82d08024324a 00007ccfc00080e7 ffff82d08033930b > > ffffffffb0ebd5a0 000000000000000d 0000000000000062 0000000000000097 > > 000000000000001e 000000000000001f ffffffffb0ebd5ad 0000000000000000 > > 0000000000000005 000000000003d91d 0000000000000000 0000000000000000 > > 00000000000003d5 000000000000001e 0000000000000000 0000beef0000beef > > Xen call trace: > > [<ffff82d080246b65>] R common/timer.c#active_timer+0xc/0x1b > > [<ffff82d0802479f1>] F stop_timer+0xf5/0x188 > > [<ffff82d08031e74e>] F pt_save_timer+0x45/0x8a > > [<ffff82d08027df19>] F context_switch+0xf9/0xee0 > > [<ffff82d0802414e0>] F > > common/schedule.c#sched_context_switch+0x146/0x151 > > [<ffff82d080241d6c>] F common/schedule.c#schedule+0x28a/0x299 > > [<ffff82d0802431f5>] F common/softirq.c#__do_softirq+0x85/0x90 > > [<ffff82d08024324a>] F do_softirq+0x13/0x15 > > [<ffff82d08033930b>] F vmx_asm_do_vmentry+0x2b/0x30 > > > > **************************************** > > Panic on CPU 17: > > Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287 > > **************************************** > > Since this suggests a timer was found on the list without ever having been > initialized, I've spotted a case where this indeed could now happen. Could > you give the patch below a try? > > Jan > > x86/vpt: fully init timers before putting onto list > > With pt_vcpu_lock() no longer acquiring the pt_migrate lock, parties > iterating the list and acting on the timers of the list entries will no > longer be kept from entering their loops by create_periodic_time()'s > holding of that lock. Therefore at least init_timer() needs calling > ahead of list insertion, but keep this and set_timer() together. > > Fixes: 8113b02f0bf8 ("x86/vpt: do not take pt_migrate rwlock in some cases") > Reported-by: Igor Druzhinin <igor.druzhinin@xxxxxxxxxx> > Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> Thanks for looking into this so quickly, and sorry for not realizing myself when relaxing the locking. Adding the timer to the list without it being fully initialized was a latent issue even if protected by the lock initially. Provided testing shows the issue is fixed: Reviewed-by: Roger Pau Monné <roger.pau@xxxxxxxxxx> Roger.
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |