[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()


  • To: Anton Markov <akmarkov45@xxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Mon, 12 Jan 2026 17:08:38 +0100
  • Autocrypt: addr=jbeulich@xxxxxxxx; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL
  • Cc: andrew.cooper3@xxxxxxxxxx, roger.pau@xxxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Mon, 12 Jan 2026 16:08:49 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 12.01.2026 15:51, Anton Markov wrote:
> That's if IPIs are sent sequentially. In the most common case, they aren't,
>> though - we use the all-but-self shorthand.
> 
> Actually, even if IPIs are sent sequentially, I can't see where you spot
>> this effect: Both callers of time_calibration_rendezvous_tail() signal all
>> secondary CPUs to continue at the same time. Hence they'll all execute
>> time_calibration_rendezvous_tail() in parallel.
> 
> In parallel, but with a slight delay.
> 
> Are they? I fear I don't know which part of the code you're talking about.
> 
> In the function "time_calibration" (xen/arch/x86/time.c) Sorry, I don't
> take into account that you don't stay in context, being distracted by other
> threads.

That calls on_selected_cpus(), but send_IPI_mask() may then still resort to
all-but-self. In that case all IPIs are sent in one go.

Plus as said, how IPIs are sent doesn't matter for the invocation of
time_calibration_rendezvous_tail(). They'll all run at the same time, not
one after the other.

Since further down you build upon that "IPI lag", I fear we first need to
settle on this aspect of your theory.

Jan

> One of the reasons we (iirc) don't do that is that since the scaling factor
>> is also slightly imprecise, we'd prefer to avoid scaling very big values.
>> IOW by changing as you suggest we'd trade one accumulating error for
>> another.
> 
> As I wrote above, there will be an error when using scale_delta, but it
> won't accumulate between calls to time_calibration_rendezvous_tail. In the
> current version, the old error (ipi lag + rounding error) persists due to
> the use of the old local_stime in the get_s_time_fixed function, and it's
> added to the new error and accumulates with each call.
> If
> 
> c->local_stime = get_s_time_fixed(old_tsc ?: new_tsc);
> 
> replaced with:
> 
> c->local_stime = scale_delta(old_tsc ?: new_tsc);
> 
> Then we'll only be dealing with the error due to the current ipi lag and
> rough rounding, within a single call.
> 
> If I understand you correctly, you fixed the rough rounding of scale_delta
> by reducing the values using get_s_time_fixed . But the problem is, that
> didn't help. The error now accumulates gradually because we're relying on
> old calculations. Furthermore, while the old rounding error was limited,
> now the error accumulates infinitely, albeit very slowly. If this is so,
> then the solution to the problem becomes less obvious.
> 
> Roughly speaking, my servers start to go crazy after a week of continuous
> operation, as the time lag between cores reaches 1 millisecond or more.
> 
> On Mon, Jan 12, 2026 at 5:13 PM Jan Beulich <jbeulich@xxxxxxxx> wrote:
> 
>> On 12.01.2026 13:49, Anton Markov wrote:
>>>> Btw, your prior response was too hard to properly read, due to excess
>> blank
>>>> lines with at the same time squashed leading blanks. Together with your
>>>> apparent inability to avoid top-posting, I think you really want to
>> adjust
>>>> your mail program's configuration.
>>>
>>> I suggest we skip the discussion of formatting and the number of spaces
>> in
>>> messages and instead focus on the topic of the thread. I have a very
>>> difficult time troubleshooting difficult-to-reproduce bugs, and the fact
>>> that their descriptions are difficult to read due to the number of spaces
>>> is probably the least of the difficulties.
>>
>> Perhaps, yet it still makes dealing with things more difficult.
>>
>>> That invocation of get_s_time_fixed() reduces to scale_delta() (without
>>>> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW
>>>> it's not quite clear to me what change you are suggesting (that would
>>>> actually make a functional difference).
>>>
>>> Replacing get_s_time_fixed with scale_delta will remove the calculation
>>> dependency on the previous local_stime value, which accumulates lag
>> between
>>> cores. This is because: rdtsc_ordered is not called synchronously on the
>>> cores, but by the difference offset by the ipi speed. Therefore, we get:
>>>
>>> core0: current_rdtsc;
>>> core1: current_rdtsc + ipi speed;
>>> coreN: current_rdtsc + ipi speed * N;
>>
>> That's if IPIs are sent sequentially. In the most common case, they aren't,
>> though - we use the all-but-self shorthand.
>>
>> Actually, even if IPIs are sent sequentially, I can't see where you spot
>> this effect: Both callers of time_calibration_rendezvous_tail() signal all
>> secondary CPUs to continue at the same time. Hence they'll all execute
>> time_calibration_rendezvous_tail() in parallel.
>>
>>> Since ipi values are sent alternately in a loop to core0,
>>
>> Are they? I fear I don't know which part of the code you're talking about.
>>
>>> in the version
>>> with get_s_time_fixed, we get the following local_stime calculation
>> format.
>>>
>>> coreN: local_stime = local_stime + scale_delta((current_rdtsc +
>> (ipi_speed
>>> * N)) – local_rdtsc);
>>
>> One of the reasons we (iirc) don't do that is that since the scaling factor
>> is also slightly imprecise, we'd prefer to avoid scaling very big values.
>> IOW by changing as you suggest we'd trade one accumulating error for
>> another.
>>
>> Jan
>>
>>> This means the time on each core will differ by ipi_speed * N. And since
>>> we're using the values of the previous local_stime, the difference will
>>> accumulate because the previous local_stime was also offset. In the
>> version
>>> with scale_delta, we get:
>>>
>>> coreN: local_stime = scale_delta(current_rdtsc + (ipi_speed * N));
>>>
>>> This means there will still be a difference, but it won't accumulate, and
>>> the offsets will remain within normal limits.
>>>
>>> If it's still unclear: If your local_stime in get_s_time_fixed is offset
>>> relative to other cores, then the fact that rdtsc_ordered and local_tsc
>> are
>>> not offset doesn't change anything, since you're using the delta relative
>>> to local_stime.
>>>
>>> core0_local_stime + (rdtsc_ordered() - local_tsc) != core1_local_stime +
>>> (rdtsc_ordered() - local_tsc); // Even if rdtsc_ordered() and local_tsc
>> are
>>> equal across cores.
>>>
>>> On 96-core configurations, up to a millisecond of latency can accumulate
>> in
>>> local_stime over a week of operation, and this is a significant
>>> difference. This
>>> is due to the fact that I use cpufreq=xen:performance max_cstate=1 ,
>>> meaning that in my configuration, local_stime is never overwritten by
>>> master_stime.
>>>
>>> Thanks.
>>
> 




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.