[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] x86: make "dom0_nodes=" work with credit2


  • To: Dario Faggioli <dfaggioli@xxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Fri, 8 Apr 2022 14:36:26 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=nJ9DjZ2YgBITR4o5KgspTRYuUPRJnVk9AglH4dqAqOY=; b=gpzUPy0UAFxbSMADXwD57r01lrAgKDR7xC0XPh374YaxtOpdO6mvjHiw4rk7QT7SgUQvmt5XDvtT3byhEfU72QE9+eSUP7k9BvdT8nZ2NQSkLPFEtSu5cT0MavD9owty6bmu123/zbvV/vzoV/5YY3BzdCDp8sXXb0UD9DQG3Xa0ohDFKtHaCInGESpt0roeXqfyrqsJtvSGAPITGhoMfX/CcLNYNoUZ33rao7yg6MTe8TnSmMaOcBblLTyT80CVJ4LRW1dM541PTTrdeH1Ed4lKHPU/whUgj253RXBIjdVYashvi6fdtbEB0E9HxsC1w/yELKV0IgCJT/sas0Cq+w==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=InrDKFC3x/srg0A/dBHW2ivAxdI7FNHBdY1jrHT9zene/XYAPLsnC/PQb0v6eF026Gnyre6KlXaQftJ7R0VEI/NGgGhalE2U9aKxuf69lRAMjZ1KCGfJxmli+Ax4J5A0pHOOF4AIcg3f+/l7ISGIxjncOzrFaAHXB65d6Fdze70iY0sLTotLjOsD+E0V0OIHeCOaA25RMoSPB//JQ9Zyz708Ehvlo0RTWC9bv/pkls0az9GA8eeY8K5uRNPLNerUmS0Q7u62DzLr6GKgr05PinG3PuwkpQkrpj7tqd47lnfkZ0aa57Ra/GFiouB56ZsHtEey+3XOc7i8D+wqoPIY+w==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: "ohering@xxxxxxx" <ohering@xxxxxxx>, "george.dunlap@xxxxxxxxxx" <george.dunlap@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Fri, 08 Apr 2022 12:36:45 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 08.04.2022 13:20, Dario Faggioli wrote:
> On Thu, 2022-04-07 at 15:27 +0200, Jan Beulich wrote:
>> ---
>> The Fixes: tag isn't very precise - it's rather the commit exposing
>> the
>> issue by default. I haven't been able to identify the actual commit
>> which did introduce the problem; it may well be that it has always
>> been
>> there since the introduction of credit2.
>>
>> Credit2 moving the vCPU-s off of their initially assigned ones right
>> away of course renders sched_select_initial_cpu()'s use of
>> cpu_cycle()
>> pretty useless.
>>
> Mmm... you mean that Credit2 is moving the vCPUs off they're assigned
> ones _right_now_, or that it will, with this patch?

Right now. On a 4-node (6 cores each) system with "dom0_nodes=0" I've
observed the 6 vCPU-s to reliably be put on CPUs 13, 14, etc. The
patch is actually undoing that questionable movement.

Since I have a large amount of other patches in place (none of which
I would assume to have such an effect) - Olaf has meanwhile confirmed
that the change helps for him as well.

> If you mean the former, I'm not sure it is. In fact, when
> sched_select_initial_cpu() is called for dom0, dom0's node affinity is
> just all nodes, isn't it? So, the we can pick any CPU in the cpupool,
> and we use cycle to try to spread the vCPUs kind of evenly.

The CPU mask used in the function is 0x3f for the example given above.
I did not check which of the constituent parts of the calculation have
this effect. But the result is that all CPUs will be put on CPU 0
first, as cpumask_cycle(..., 13) for a mask of 0x3f produces 0.

> If you mean after this patch, well, sure, but that's normal. Again,
> when we pick the initial CPU, we still don't know that the vCPUs have
> an affinity. Or, actually, we know that in sched_setup_dom0_vcpus(),
> but there's no direct way to tell it to sched_init_vcpu() (and hence
> sched_select_initial_cpu()).
> 
> That's because, by default, affinity is just "all cpus", when we create
> the vCPUs, and we change that later, if they have one already (due to
> it being present in the config file, or in the dom0_nodes parameter).

But that's what I'm talking about a little further down, where you
reply that you don't think using the more narrow set would hide the
issue.

> Something that *maybe* we can try, since we're handling dom0 vCPUs
> specially anyway, is to directly set dom0's node affinity to the nodes
> of the CPUs we have in dom0_cpus, before calling vcpu_create() (in
> sched_setup_dom0_vcpus(), of course).
> 
> This should make sched_select_initial_cpu() pick one of the "correct"
> CPUs in the first place. But I don't know if it's worth, neither if
> we'll still need this patch anyway (I have to check more thoroughly).

As per above - sched_select_initial_cpu() behaves as I would expect
it. It's credit2 which subsequently overrides that decision.

>>  But I guess that's still useful for other schedulers.
>> I wonder though whether sched_init_vcpu() shouldn't use the CPU map
>> calculated by sched_select_initial_cpu() for its call to
>> sched_set_affinity() in the non-pinned case, rather than setting
>> "all".
>>
> If we do that, and there's no affinity configured for the guest, or no
> "dom0_nodes=", when will we reset the affinity to all, which what it
> should be in such a case?

Why "reset"? When no restrictions are in place, afaict
sched_select_initial_cpu() will calculate a mask of "all".

> Also, if I'm right in my reasoning above, when we come from
> sched_setup_dom0_vcpus(), the mast calculated by
> sched_select_initial_cpu() is basically cpupool0's cpus_valid, so this
> wouldn't really change anything for the problem we're trying to solve
> here.
> 
>> (I guess doing so might mask the issue at hand, but I think the
>> change
>> here would still be applicable at least from an abstract pov.)
>>
> I don't think it would mask it, but I do think that, yes, the change
> you're making would still be applicable.
> 
> And, about it, One thing...
> 
>> --- a/xen/common/sched/core.c
>> +++ b/xen/common/sched/core.c
>> @@ -3403,9 +3403,15 @@ void __init sched_setup_dom0_vcpus(struc
>>      {
>>          for_each_sched_unit ( d, unit )
>>          {
>> +            spinlock_t *lock = unit_schedule_lock_irq(unit);
>> +
>>              if ( !opt_dom0_vcpus_pin && !dom0_affinity_relaxed )
>>                  sched_set_affinity(unit, &dom0_cpus, NULL);
>>              sched_set_affinity(unit, NULL, &dom0_cpus);
>> +
>> +            sched_unit_migrate_start(unit);
>> +            unit_schedule_unlock_irq(lock, unit);
>> +            sched_unit_migrate_finish(unit);
>>          }
>>      }
>>  
> ... The result of this (also considering the call to
> domain_update_node_affinity()) ends up looking very similar to what
> we'd get if, instead of calling sched_set_affinity(), we call
> vcpu_set_affinity().

Funny you should say this - this is what I had done first. It works,
but it is less efficient than the approach above. First and foremost
when there are multiple vCPU-s per unit.

> I'm therefore wondering if we should try to just do that... But I'm not
> sure, mostly because that would mean calling
> domain_update_node_affinity() for all dom0's vCPUs, which is clearly
> less efficient than just calling it once at the end.

Indeed, that's the other reason why I did change to the approach
above.

> So I'm thinking that we can indeed do it like this, and add a comment.

A comment to what effect?

Jan




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.