[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL



>>> On 14.03.11 at 11:52, Tim Deegan <Tim.Deegan@xxxxxxxxxx> wrote:
> At 10:39 +0000 on 14 Mar (1300099174), Jan Beulich wrote:
>> > I think this hang comes because although this code:
>> > 
>> >             cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers);
>> >             if ( commit )
>> >                CSCHED_PCPU(nxt)->idle_bias = cpu;
>> >             cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu));
>> > 
>> > removes the new cpu and its siblings from cpus, cpu isn't guaranteed to
>> > have been in cpus in the first place, and none of its siblings are
>> > either since nxt might not be its sibling.
>> 
>> I had originally spent quite a while to verify that the loop this is in
>> can't be infinite (i.e. there's going to be always at least one bit
>> removed from "cpus"), and did so again during the last half hour
>> or so.
> 
> I'm pretty sure there are possible passes through this loop that don't
> remove any cpus, though I haven't constructed the full history that gets
> you there.

Actually, while I don't think that this can happen, something else is
definitely broken here: The logic can select a CPU that's not in the
vCPU's affinity mask. How I managed to not note this when I
originally put this change together I can't tell. I'll send a patch in
a moment, and I think after that patch it's also easier to see that
each iteration will remove at least one bit.

>> > which guarantees that nxt will be removed from cpus, though I suspect
>> > this means that we might not pick the best HT pair in a particular core.
>> > Scheduler code is twisty and hurts my brain so I'd like George's
>> > opinion before checking anything in.
>> 
>> No - that was precisely done the opposite direction to get
>> better symmetry of load across all CPUs. With what you propose,
>> idle_bias would become meaningless.
> 
> I don't think see why it would.  As I said, having picked a core we
> might not iterate to pick the best cpu within that core, but the
> round-robining effect is still there.  And even if not I figured a
> hypervisor crash is worse than a suboptimal scheduling decision. :)

Sure. Just that this code has been there for quite a long time, and
it would be really strange to only now see it start producing hangs
(which apparently aren't that difficult to reproduce - iirc a similar
one was sent around by Ian a few days earlier).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.