[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Live-Patch application failure in core-scheduling mode
On 06.02.20 15:02, Sergey Dyasli wrote: On 06/02/2020 11:05, Sergey Dyasli wrote:On 06/02/2020 09:57, Jürgen Groß wrote:On 05.02.20 17:03, Sergey Dyasli wrote:Hello, I'm currently investigating a Live-Patch application failure in core- scheduling mode and this is an example of what I usually get: (it's easily reproducible) (XEN) [ 342.528305] livepatch: lp: CPU8 - IPIing the other 15 CPUs (XEN) [ 342.558340] livepatch: lp: Timed out on semaphore in CPU quiesce phase 13/15 (XEN) [ 342.558343] bad cpus: 6 9 (XEN) [ 342.559293] CPU: 6 (XEN) [ 342.559562] Xen call trace: (XEN) [ 342.559565] [<ffff82d08023f304>] R common/schedule.c#sched_wait_rendezvous_in+0xa4/0x270 (XEN) [ 342.559568] [<ffff82d08023f8aa>] F common/schedule.c#schedule+0x17a/0x260 (XEN) [ 342.559571] [<ffff82d080240d5a>] F common/softirq.c#__do_softirq+0x5a/0x90 (XEN) [ 342.559574] [<ffff82d080278ec5>] F arch/x86/domain.c#guest_idle_loop+0x35/0x60 (XEN) [ 342.559761] CPU: 9 (XEN) [ 342.560026] Xen call trace: (XEN) [ 342.560029] [<ffff82d080241661>] R _spin_lock_irq+0x11/0x40 (XEN) [ 342.560032] [<ffff82d08023f323>] F common/schedule.c#sched_wait_rendezvous_in+0xc3/0x270 (XEN) [ 342.560036] [<ffff82d08023f8aa>] F common/schedule.c#schedule+0x17a/0x260 (XEN) [ 342.560039] [<ffff82d080240d5a>] F common/softirq.c#__do_softirq+0x5a/0x90 (XEN) [ 342.560042] [<ffff82d080279db5>] F arch/x86/domain.c#idle_loop+0x55/0xb0 The first HT sibling is waiting for the second in the LP-application context while the second waits for the first in the scheduler context. Any suggestions on how to improve this situation are welcome.Can you test the attached patch, please? It is only tested to boot, so I did no livepatch tests with it.Thank you for the patch! It seems to fix the issue in my manual testing. I'm going to submit automatic LP testing for both thread/core modes.Andrew suggested to test late ucode loading as well and so I did. It uses stop_machine() to rendezvous cpus and it failed with a similar backtrace for a problematic CPU. But in this case the system crashed since there is no timeout involved: (XEN) [ 155.025168] Xen call trace: (XEN) [ 155.040095] [<ffff82d0802417f2>] R _spin_unlock_irq+0x22/0x30 (XEN) [ 155.069549] [<ffff82d08023f3c2>] S common/schedule.c#sched_wait_rendezvous_in+0xa2/0x270 (XEN) [ 155.109696] [<ffff82d08023f728>] F common/schedule.c#sched_slave+0x198/0x260 (XEN) [ 155.145521] [<ffff82d080240e1a>] F common/softirq.c#__do_softirq+0x5a/0x90 (XEN) [ 155.180223] [<ffff82d0803716f6>] F x86_64/entry.S#process_softirqs+0x6/0x20 It looks like your patch provides a workaround for LP case, but other cases like stop_machine() remain broken since the underlying issue with the scheduler is still there. Ah, that was actually a very good hint! When analyzing your initial problems with reboot and cpu offlining I looked into those cases in detail and concluded that stop_machine_run() was called inside a tasklet in those cases (which is true). Unfortunately there are some cases like ucode loading which don't do that, so those cases need to be considered as well. Writing another patch... Juergen _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |