[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy



On 09/23/2015 10:30 AM, Dario Faggioli wrote:
On Wed, 2015-09-23 at 06:36 +0200, Juergen Gross wrote:

On 09/22/2015 06:22 PM, George Dunlap wrote:
Juergen / Dario, could one of you summarize your two approaches,
and the
(alleged) advantages and disadvantages of each one?

Okay, I'll have a try:

Thanks for this! ;-)

The problem we want to solve:
-----------------------------

The Linux kernel is gathering cpu topology data during boot via the
CPUID instruction on each processor coming online. This data is
primarily used in the scheduler to decide to which cpu a thread
should
be migrated when this seems to be necessary. There are other users of
the topology information in the kernel (e.g. some drivers try to do
optimizations like core-specific queues/lists).

When started in a virtualized environment the obtained data is next
to
useless or even wrong, as it is reflecting only the status of the
time
of booting the system. Scheduling of the (v)cpus done by the
hypervisor
is changing the topology beneath the feet of the Linux kernel without
reflecting this in the gathered topology information. So any
decisions
taken based on that data will be clueless and possibly just wrong.

Exactly.

The minimal solution is to change the topology data in the kernel in
a
way that all cpus are regarded as equal regarding their relation to
each
other (e.g. when migrating a thread to another cpu no cpu is
preferred
as a target).

The topology information of the CPUID instruction is, however, even
accessible form user mode and might be used for licensing purposes of
any user program (e.g. by limiting the software to run on a specific
number of cores or sockets). So just mangling the data returned by
CPUID in the hypervisor seems not to be a general solution, while we
might want to do it at least optionally in the future.

Yep. It turned out that, although being what started all this, CPUID
handling is a somewhat related but mostly independent problem. :-)

In the future we might want to support either dynamic topology
updates
or be able to tell the kernel to use some of the topology data, e.g.
when pinning vcpus.

Indeed. At least for the latter. Dynamic looks really difficult to me,
but indeed it would be ideal. Let's see.

Solution 1 (Dario):
-------------------

Don't use the CPUID derived topology information in the Linux
scheduler,
but let it use a simple "flat" topology by setting own scheduler
domain
data under Xen.

Advantages:
+ very clean solution regarding the scheduler interface

Yes, this is, I think, one of the main advantages of the patch. The
scheduler is offering an interface to architectures to define their
topology requirements and I'm using it, for specifying our topology
requirements: the tool for the job. :-D

+ scheduler decisions are based on a minimal data set
+ small patch

Disadvantages:
- covers the scheduler only, drivers still use the "wrong" data

This is a good point. It was the patch's purpose, TBH, but it's
certainly true that, if we need something similar elsewhere, we need to
do more.

- a little bit hacky regarding some NUMA architectures (needs either
a
    hook in the code dealing with that architecture or multiple
scheduler
    domain data overwrites)

As I said in my other email, I'll double check (yes, I also think this
is about AMD boxes with intra-socket NUMA nodes).

- future enhancements will make the solution less clean (either need
    duplicating scheduler domain data or some new hooks in scheduler
    domain interface)

This one, I'm not sure I understand.

What would you do for keeping the topology information of one level,
e.g. hyperthreads, in case we'd have a gang-scheduler in Xen? Either
you would copy the line:

{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },

from kernel/sched/core.c into your topology array, or you would add a
way in kernel/sched/core.c to remove all but this entry and add your
entry on top of it.


Solution 2 (Juergen):
---------------------

When booted as a Xen guest modify the topology data built during boot
resulting in the same simple "flat" topology as in Dario's solution.

Advantages:
+ the simple topology is seen by all consumers of topology data as
the
    data itself is modified accordingly

Yep, that's a good point.

+ small patch

+ future enhancements rather easy by selecting which data to modify

As for the '-' above about this, I'm not really sure what this means.

In the case mentioned above I just wouldn't zap the
topology_sibling_cpumask in my patch.


Disadvantages:
- interface to scheduler not as clean as in Dario's approach
- scheduler decisions are based on multiple layers of topology data
    where one layer would be enough to describe the topology

This is not too big of a deal, IMO. Not at runtime, at least, as far as
my investigation went for now. Initialization (of scheduling domains)
is a bit clumsy in this case, as scheduling domains are created and
then destroyed/collapsed, but after they are setup, the net effect is
that there's only one scheduling domain with Juergen's patch too,
exactly as with mine.

Dario, are you okay with this summary?

To most of it, yes, and thanks again for it.

Allow me to add a few points, out of the top of my head:

  * we need to check whether the two approaches have the same
    performance. In principle, they really should, and early results
    seems to confirm that, but I'd like to run the full set of benches
    (and I'll do that ASAP);

Thanks.

  * I think we want to run even more benchmarks, and run them in
    different (over)load conditions to better assess the effect of the
    change
  * both our patches provides a solution for Xen (for Xen PV guests, at
    least for now, to be more precise). It is very likely that, e.g.,
    KVM is in a similar situation, hence it may be worth to look for a
    more general solution, especially if that buys us something (e.g.,
    HVM support made easy?)

I wanted to look at this as soon as we've decided which way to go.

I had some discussion with a kvm guy last week and he seemed not to be
convinced they need something else as mangling CPUID (what they already
do).


Thanks and Regards,
Dario

PS. BTW, Juergen, you're not on IRC, on #xendevel, are you?

I'd like to, but I'd need an invitation. My user name is juergen_gross.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.