[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest



On Fri, 2015-07-24 at 12:28 +0200, Juergen Gross wrote:
> On 07/23/2015 04:07 PM, Dario Faggioli wrote:

> > FWIW, I was thinking that the kernel were a better place, as Juergen is
> > saying, while now I'm more convinced that tools would be more
> > appropriate, as Boris is saying.
> 
> I've collected some information from the linux kernel sources as a base
> for the discussion:
> 
That's great, thanks for this!

> The complete numa information (cpu->node and memory->node relations) is
> taken from the acpi tables (srat, slit for "distances").
> 
Ok. And I already have a question (as I lost track of things a bit).
What you just said about ACPI tables is certainly true for baremetal and
HVM guests, but for PV? At the time I was looking into it, together with
Elena, there were Linux patches being produced for the PV case, which
makes sense.
However, ISTR that both Wei and Elena mentioned recently that those
patches have not been upstreamed in Linux yet... Is that the case? Maybe
not all, but at least some of them are there? Because if not, I'm not
sure I see how a PV guest would even see a vNUMA topology (which it
does).

Of course, I can go and check, but since you just looked, you may have
it fresh and clear already. :-)

> The topology information is obtained via:
> - intel:
>    + cpuid leaf b with subleafs, leaf 4
>    + cpuid leaf 2 and/or leaf 1 if leaf b and/or 4 isn't available
> - amd:
>    + cpuid leaf 8000001e, leaf 8000001d, leaf 4
>    + msr c001100c
>    + cpuid leaf 2 and/or leaf 1 if leaf b and/or 4 isn't available
> 
> The scheduler is aware of:
> - smt siblings (from topology)
> - last-level-cache siblings (from topology)
> - node siblings (from numa information)
>
Right. So, this confirms what we were guessing: we need to "reconcile"
these two sources of information (from the guest point of view).

Both the 'in kernel' and 'in toolstack' approach should have all the
necessary information to make things match, I think. In fact, in
toolstack, we know what the vNUMA topology is (we're parsing and
actually putting it in place!). In kernel, we know it as we read it from
tables or hypercalls (isn't that so, for PV guest?).

In fact, I think that it is the topology, i.e., what comes from MSRs,
that needs to adapt, and follow vNUMA, as much as possible. Do we agree
on this?

IMO, the thing boils down to these:

 1) from where (kernel vs. toolstack) is it the most easy and effective
    to enact the CPUID fiddling? As in, can we do that in toolstack?
    (Andrew was not so sure, and Boris found issues, although Jan seems
    to think they're no show stopper.)
    I'm quite certain that we can do that from inside the kernel,
    although, how early would we need to be doing it? Do we have the
    vNUMA info already?

 2) when tweaking the values of CPUID and other MSRs, are there other
    vNUMA (and topology in general) constraints and requirements we
    should take into account? For instance, do we want, for licensing
    reasons, all (or most) of the vcpus to be siblings, rather than full
    sockets? Etc.
     2a) if yes, how and where are these constraints specified?

If looking at 1) only, it still looks to me that doing things within the
kernel would be the way to go.

When looking at 2), OTOH, toolstacks variants start to be more
appealing. Especially depending on our answer to 2a). In fact,
in case we want to give the user a way to specify this
siblings-vs-cores-vs-sockets information, it IMO would be good to deal
with that in tools, rather than having to involve Xen or Linux!

> It will especially move tasks from one cpu to another first between smt
> siblings, second between llc siblings, third between node siblings and
> last all cpus.
> 
Yep, this part, I knew.

Maybe, there is room for "fixing" this at this level, hooking up inside
the scheduler code... but I'm shooting in the dark, without having check
whether and how this could be really feasible, should I?

One thing I don't like about this approach is that it would potentially
solve vNUMA and other scheduling anomalies, but...

> cpuid instruction is available for user mode as well.
> 
...it would not do any good for other subsystems, and user level code
and apps.

Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.