|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Hit ASSERT in credit2 code with NR_CPUS=1 build
On Tue, 2021-03-09 at 17:24 +0100, Roger Pau Monné wrote:
> Hello,
>
Hey,
> While looking at the NR_CPUS == 1 build I realized I could reliable
> trigger the following ASSERT by creating a guest (note that dom0
> seems
> to be fine):
>
Yes, I'm (somewhat, not sure if exactly though) able to reproduce.
> (XEN) Assertion 'i != cpu' failed at credit2.c:1725
> (XEN) ----[ Xen-4.15.0-rc x86_64 debug=y Tainted: C ]----
> (XEN) CPU: 0
> (XEN) RIP: e008:[<ffff82d040249399>]
> common/sched/credit2.c#runq_tickle+0x469/0x571
> (XEN) RFLAGS: 0000000000010046 CONTEXT: hypervisor (d4v0)
> (XEN) rax: ffffffffffffffff rbx: 0000000000000000 rcx:
> 0000000000000000
> (XEN) rdx: ffff83086c62feb0 rsi: 0000012774fba66c rdi:
> ffff8307e11d5d40
> (XEN) rbp: ffff83008c8c7cf8 rsp: ffff83008c8c7c68 r8:
> ffff83086c66d6c0
> (XEN) r9: ffff82d0405d1218 r10: 0000000000000000 r11:
> ffff83086c631000
> (XEN) r12: ffff83086c6437c0 r13: 0000000000000000 r14:
> ffff83086c62fe20
> (XEN) r15: ffff82d0405d0320 cr0: 0000000080050033 cr4:
> 00000000003526e0
> (XEN) cr3: 00000007e130d000 cr2: ffff88826910cb38
> (XEN) fsb: 00007efee038b780 gsb: ffff888273400000 gss:
> 0000000000000000
> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008
> (XEN) Xen code around <ffff82d040249399>
> (common/sched/credit2.c#runq_tickle+0x469/0x571):
> (XEN) ac ff 75 3d 0f 0b 0f 0b <0f> 0b c7 45 ac 00 00 00 00 48 8d 05
> 6f 7e 38 00
> (XEN) Xen stack trace from rsp=ffff83008c8c7c68:
> [...]
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 0:
> (XEN) Assertion 'i != cpu' failed at credit2.c:1725
> (XEN) ****************************************
>
Interesting... So, how do cpumasks look like/work, with NR_CPUS=1
(sorry, I couldn't follow all the aspects of it too closely) ?
I'm asking because, what we're doing here is the following. First of
all we put together a cpumask (in `mask`) out of the intersection of
the CPUs that are in the vcpu's hard/soft affinity, are part of this
runqueue, are idle and have not been tickled (where tickled == they've
been poked and will go through schedule() soon):
cpumask_andnot(&mask, &rqd->active, &rqd->idle);
cpumask_andnot(&mask, &mask, &rqd->tickled);
cpumask_and(&mask, &mask, cpumask_scratch_cpu(cpu));
Now, I would very much expect for `mask` to have at most one bit set
(i.e., the one of our only CPU). Actually, considering how unlikely it
would be that our only CPU is both idle and not-tickled, I expect mask
to be empty most of the times.
Anyway, let's say the cpumask has 1 bit set (in which case, it must be
the one associated to CPU 0, I presume?). What we do now is this:
if ( __cpumask_test_and_clear_cpu(cpu, &mask) )
{
...
}
Which I think means that, no matter whether or not we enter the loop,
we clear the bit. Of course, which bit depends on the value of `cpu`...
But with NR_CPUS=1, I don't see how `cpu` can have a value different
than the ID of the one and only CPU we have.
So, in my mind, now `mask` is empty. Therefore, I'm currently clueless
about why we enter this loop...
> for_each_cpu(i, &mask)
> {
> s_time_t score;
>
> /* Already looked at this one above */
> ASSERT(i != cpu); <====
>
... and we reach this point.
I tried to build staging here (with NR_CPUS=1), and I think the code
for this ASSERT(), for me, is:
test %ebx,%ebx
je ffff82d040245ac5 <runq_tickle+0x48a>
(and ffff82d040245ac5 is of course ud2.)
Snf this kind of makes sense. Or, at least, I'm not surprised that, if
we are inside this loop, `i` is actually equal to `cpu`.
What I'm surprised about is that we are inside the loop in the first
place...
I guess I need to think more about it. Any bright ideas that explain
what is going on would be more than appreciated.
> In runq_tickle. I'm afraid I have no clue of what's going on. FTR
> using a non-debug build with NR_CPUS == 1 does seem to work fine and
> I
> don't see any ill effects.
>
Well, yes, special casing `cpu` and dealing with it outside of the loop
is just an optimization, for when soft-affinity is defined for the
vcpu. So it makes sense that things work without the ASSERT().
However, the ASSERT() was there as a consistency check, and it looks to
me to be a valid one, even with NR_CPUS=1, so I really don't know why
it triggers...
Thanks and Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)
Attachment:
signature.asc
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |