[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 0/4] mitigate the per-pCPU blocking list may be too long



(Chao Gao got lost from the recipients list again; re-adding)

>>> On 08.05.17 at 11:13, <george.dunlap@xxxxxxxxxx> wrote:
> On 08/05/17 17:15, Chao Gao wrote:
>> On Wed, May 03, 2017 at 04:21:27AM -0600, Jan Beulich wrote:
>>>>>> On 03.05.17 at 12:08, <george.dunlap@xxxxxxxxxx> wrote:
>>>> On 02/05/17 06:45, Chao Gao wrote:
>>>>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>>>>>> On 26/04/17 01:52, Chao Gao wrote:
>>>>>>> I compared the maximum of #entry in one list and #event (adding entry to
>>>>>>> PI blocking list) with and without the three latter patches. Here
>>>>>>> is the result:
>>>>>>> -------------------------------------------------------------
>>>>>>> |               |                      |                    |
>>>>>>> |    Items      |   Maximum of #entry  |      #event        |
>>>>>>> |               |                      |                    |
>>>>>>> -------------------------------------------------------------
>>>>>>> |               |                      |                    |
>>>>>>> |W/ the patches |         6            |       22740        |
>>>>>>> |               |                      |                    |
>>>>>>> -------------------------------------------------------------
>>>>>>> |               |                      |                    |
>>>>>>> |W/O the patches|        128           |       46481        |
>>>>>>> |               |                      |                    |
>>>>>>> -------------------------------------------------------------
>>>>>>
>>>>>> Any chance you could trace how long the list traversal took?  It would
>>>>>> be good for future reference to have an idea what kinds of timescales
>>>>>> we're talking about.
>>>>>
>>>>> Hi.
>>>>>
>>>>> I made a simple test to get the time consumed by the list traversal.
>>>>> Apply below patch and create one hvm guest with 128 vcpus and a 
>>>>> passthrough 
> 40 NIC.
>>>>> All guest vcpu are pinned to one pcpu. collect data by
>>>>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
>>>>> xentrace_format. When the list length is about 128, the traversal time
>>>>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's
>>>>> frequence is 1795.788MHz, therefore the time consumed is in the range of 
>>>>> 1us
>>>>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most
>>>>> 2900 vcpus can be added into the list.
>>>>
>>>> Great, thanks Chao Gao, that's useful.
>>>
>>> Looks like Chao Gao has been dropped ...
>>>
>>>>  I'm not sure a fixed latency --
>>>> say 500us -- is the right thing to look at; if all 2900 vcpus arranged
>>>> to have interrupts staggered at 500us intervals it could easily lock up
>>>> the cpu for nearly a full second.  But I'm having trouble formulating a
>>>> good limit scenario.
>>>>
>>>> In any case, 22us should be safe from a security standpoint*, and 128
>>>> should be pretty safe from a "make the common case fast" standpoint:
>>>> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up
>>>> traffic will be the least of your performance problems I should think.
>>>>
>>>>  -George
>>>>
>>>> * Waiting for Jan to contradict me on this one. :-)
>>>
>>> 22us would certainly be fine, if this was the worst case scenario.
>>> I'm not sure the value measured for 128 list entries can be easily
>>> scaled to several thousands of them, due cache and/or NUMA
>>> effects. I continue to think that we primarily need theoretical
>>> proof of an upper boundary on list length being enforced, rather
>>> than any measurements or randomized balancing. And just to be
>>> clear - if someone overloads their system, I do not see a need to
>>> have a guaranteed maximum list traversal latency here. All I ask
>>> for is that list traversal time scales with total vCPU count divided
>>> by pCPU count.
>> 
>> Thanks, Jan & George.
>> 
>> I think it is more clear to me about what should I do next step.
>> 
>> In my understanding, we should distribute the wakeup interrupts like
>> this:
>> 1. By default, distribute it to the local pCPU ('local' means the vCPU
>> is on the pCPU) to make the common case fast.
>> 2. With the list grows to a point where we think it may consumers too
>> much time to traverse the list, also distribute wakeup interrupt to local
>> pCPU, ignoring admin intentionally overloads their system.
>> 3. When the list length reachs the theoretic average maximum (means
>> maximal vCPU count divided by pCPU count), distribute wakeup interrupt
>> to another underutilized pCPU.
> 
> By "maximal vCPU count" do you mean, "total number of active vcpus on
> the system"?  Or some other theoretical maximum vcpu count (e.g., 32k
> domans * 512 vcpus each or something)?

The former.

> What about saying that the limit of vcpus for any given pcpu will be:
>  (v_tot / p_tot) + K
> where v_tot is the total number of vcpus on the system, p_tot is the
> total number of pcpus in the system, and K is a fixed number (such as
> 128) such that 1) the additional time walking the list is minimal, and
> 2) in the common case we should never come close to reaching that number?
> 
> Then the algorithm for choosing which pcpu to have the interrupt
> delivered to would be:
>  1. Set p = current_pcpu
>  2. if len(list(p)) < v_tot / p_tot + k, choose p
>  3. Otherwise, choose another p and goto 2
> 
> The "choose another p" could be random / pseudorandom selection, or it
> could be some other mechanism (rotate, look for pcpus nearby on the
> topology, choose the lowest one, &c).  But as long as we check the
> length before assigning it, it should satisfy Jan.

Right.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.