[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Notes on stubdoms and latency on ARM



Hi George,

On 17/07/17 12:28, George Dunlap wrote:
On 07/17/2017 11:04 AM, Julien Grall wrote:
Hi,

On 17/07/17 10:25, George Dunlap wrote:
On 07/12/2017 07:14 AM, Dario Faggioli wrote:
On Fri, 2017-07-07 at 14:12 -0700, Stefano Stabellini wrote:
On Fri, 7 Jul 2017, Volodymyr Babchuk wrote:

Since you are using Credit, can you try to disable context switch
rate
limiting?

Yep. You are right. In the environment described above (Case 2) I
now
get much better results:

 real 1.85
user 0.00
sys 1.85

From 113 to 1.85 -- WOW!

Obviously I am no scheduler expert, but shouldn't we advertise a bit
better a scheduler configuration option that makes things _one
hundred
times faster_ ?!

So, to be fair, so far, we've bitten this hard by this only on
artificially constructed test cases, where either some extreme
assumption were made (e.g., that all the vCPUs except one always run at
100% load) or pinning was used in a weird and suboptimal way. And there
are workload where it has been verified that it helps making
performance better (poor SpecVIRT  results without it was the main
motivation having it upstream, and on by default).

That being said, I personally have never liked rate-limiting, it always
looked to me like the wrong solution.

In fact, I *think* the only reason it may have been introduced is that
there was a bug in the credit2 code at the time such that it always had
a single runqueue no matter what your actual pcpu topology was.

FWIW, we don't yet parse the pCPU topology on ARM. AFAIU, we always tell
Xen each CPU is in its own core. Will it have some implications in the
scheduler?

Just checking -- you do mean its own core, as opposed to its own socket?
 (Or NUMA node?)

I don't know much about the scheduler, so I might say something stupid here :). Below the code we have for ARM

/* XXX these seem awfully x86ish... */
/* representing HT siblings of each logical CPU */
DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_mask);
/* representing HT and core siblings of each logical CPU */
DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_mask);

static void setup_cpu_sibling_map(int cpu)
{
    if ( !zalloc_cpumask_var(&per_cpu(cpu_sibling_mask, cpu)) ||
         !zalloc_cpumask_var(&per_cpu(cpu_core_mask, cpu)) )
        panic("No memory for CPU sibling/core maps");

    /* A CPU is a sibling with itself and is always on its own core. */
    cpumask_set_cpu(cpu, per_cpu(cpu_sibling_mask, cpu));
    cpumask_set_cpu(cpu, per_cpu(cpu_core_mask, cpu));
}

#define cpu_to_socket(_cpu) (0)

After calling setup_cpu_sibling_map, we never touch cpu_sibling_mask and cpu_core_mask for a given pCPU. So I would say that each logical CPU is in its own core, but they are all in the same socket at the moment.


On any system without hyperthreading (or with HT disabled), that's what
an x86 system will see as well.

Most schedulers have one runqueue per logical cpu.  Credit2 has the
option of having one runqueue per logical cpu, one per core (i.e.,
hyperthreads share a runqueue), one runqueue per socket (i.e., all cores
on the same socket share a runqueue), or one socket across the whole
system.  I *think* we made one socket per core the default a while back
to deal with multithreading, but I may not be remembering correctly.

In any case, if you don't have threads, then reporting each logical cpu
as its own core is the right thing to do.

The architecture doesn't disallow to do HT on ARM. Though, I am not aware of any cores using it today.


If you're mis-reporting sockets, then the scheduler will be unable to
take that into account.  But that's not usually going to be a major
issue, mainly because the scheduler is not actually in a position to
determine, most of the time, which is the optimal configuration.  If two
vcpus are communicating a lot, then the optimal configuration is to put
them on different cores of the same socket (so they can share an L3
cache); if two vcpus are computing independently, then the optimal
configuration is to put them on different sockets, so they can each have
their own L3 cache.  Xen isn't in a position to know which one is more
important, so it just assumes each vcpu is independent.

All that to say: It shouldn't be a major issue if you are mis-reporting
sockets. :-)

Good to know, thank you for the explanation! We might want to parse the bindings correctly to get a bit of improvement. I will add a task on jira.

Cheers,

--
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.