Xen project Mailing List

Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()

Date: Mon, 12 Jan 2026 17:08:38 +0100

Autocrypt: addr=jbeulich@xxxxxxxx; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL

Cc: andrew.cooper3@xxxxxxxxxx, roger.pau@xxxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Mon, 12 Jan 2026 16:08:49 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 12.01.2026 15:51, Anton Markov wrote: > That's if IPIs are sent sequentially. In the most common case, they aren't, >> though - we use the all-but-self shorthand. > > Actually, even if IPIs are sent sequentially, I can't see where you spot >> this effect: Both callers of time_calibration_rendezvous_tail() signal all >> secondary CPUs to continue at the same time. Hence they'll all execute >> time_calibration_rendezvous_tail() in parallel. > > In parallel, but with a slight delay. > > Are they? I fear I don't know which part of the code you're talking about. > > In the function "time_calibration" (xen/arch/x86/time.c) Sorry, I don't > take into account that you don't stay in context, being distracted by other > threads. That calls on_selected_cpus(), but send_IPI_mask() may then still resort to all-but-self. In that case all IPIs are sent in one go. Plus as said, how IPIs are sent doesn't matter for the invocation of time_calibration_rendezvous_tail(). They'll all run at the same time, not one after the other. Since further down you build upon that "IPI lag", I fear we first need to settle on this aspect of your theory. Jan > One of the reasons we (iirc) don't do that is that since the scaling factor >> is also slightly imprecise, we'd prefer to avoid scaling very big values. >> IOW by changing as you suggest we'd trade one accumulating error for >> another. > > As I wrote above, there will be an error when using scale_delta, but it > won't accumulate between calls to time_calibration_rendezvous_tail. In the > current version, the old error (ipi lag + rounding error) persists due to > the use of the old local_stime in the get_s_time_fixed function, and it's > added to the new error and accumulates with each call. > If > > c->local_stime = get_s_time_fixed(old_tsc ?: new_tsc); > > replaced with: > > c->local_stime = scale_delta(old_tsc ?: new_tsc); > > Then we'll only be dealing with the error due to the current ipi lag and > rough rounding, within a single call. > > If I understand you correctly, you fixed the rough rounding of scale_delta > by reducing the values using get_s_time_fixed . But the problem is, that > didn't help. The error now accumulates gradually because we're relying on > old calculations. Furthermore, while the old rounding error was limited, > now the error accumulates infinitely, albeit very slowly. If this is so, > then the solution to the problem becomes less obvious. > > Roughly speaking, my servers start to go crazy after a week of continuous > operation, as the time lag between cores reaches 1 millisecond or more. > > On Mon, Jan 12, 2026 at 5:13 PM Jan Beulich <jbeulich@xxxxxxxx> wrote: > >> On 12.01.2026 13:49, Anton Markov wrote: >>>> Btw, your prior response was too hard to properly read, due to excess >> blank >>>> lines with at the same time squashed leading blanks. Together with your >>>> apparent inability to avoid top-posting, I think you really want to >> adjust >>>> your mail program's configuration. >>> >>> I suggest we skip the discussion of formatting and the number of spaces >> in >>> messages and instead focus on the topic of the thread. I have a very >>> difficult time troubleshooting difficult-to-reproduce bugs, and the fact >>> that their descriptions are difficult to read due to the number of spaces >>> is probably the least of the difficulties. >> >> Perhaps, yet it still makes dealing with things more difficult. >> >>> That invocation of get_s_time_fixed() reduces to scale_delta() (without >>>> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW >>>> it's not quite clear to me what change you are suggesting (that would >>>> actually make a functional difference). >>> >>> Replacing get_s_time_fixed with scale_delta will remove the calculation >>> dependency on the previous local_stime value, which accumulates lag >> between >>> cores. This is because: rdtsc_ordered is not called synchronously on the >>> cores, but by the difference offset by the ipi speed. Therefore, we get: >>> >>> core0: current_rdtsc; >>> core1: current_rdtsc + ipi speed; >>> coreN: current_rdtsc + ipi speed * N; >> >> That's if IPIs are sent sequentially. In the most common case, they aren't, >> though - we use the all-but-self shorthand. >> >> Actually, even if IPIs are sent sequentially, I can't see where you spot >> this effect: Both callers of time_calibration_rendezvous_tail() signal all >> secondary CPUs to continue at the same time. Hence they'll all execute >> time_calibration_rendezvous_tail() in parallel. >> >>> Since ipi values are sent alternately in a loop to core0, >> >> Are they? I fear I don't know which part of the code you're talking about. >> >>> in the version >>> with get_s_time_fixed, we get the following local_stime calculation >> format. >>> >>> coreN: local_stime = local_stime + scale_delta((current_rdtsc + >> (ipi_speed >>> * N)) – local_rdtsc); >> >> One of the reasons we (iirc) don't do that is that since the scaling factor >> is also slightly imprecise, we'd prefer to avoid scaling very big values. >> IOW by changing as you suggest we'd trade one accumulating error for >> another. >> >> Jan >> >>> This means the time on each core will differ by ipi_speed * N. And since >>> we're using the values of the previous local_stime, the difference will >>> accumulate because the previous local_stime was also offset. In the >> version >>> with scale_delta, we get: >>> >>> coreN: local_stime = scale_delta(current_rdtsc + (ipi_speed * N)); >>> >>> This means there will still be a difference, but it won't accumulate, and >>> the offsets will remain within normal limits. >>> >>> If it's still unclear: If your local_stime in get_s_time_fixed is offset >>> relative to other cores, then the fact that rdtsc_ordered and local_tsc >> are >>> not offset doesn't change anything, since you're using the delta relative >>> to local_stime. >>> >>> core0_local_stime + (rdtsc_ordered() - local_tsc) != core1_local_stime + >>> (rdtsc_ordered() - local_tsc); // Even if rdtsc_ordered() and local_tsc >> are >>> equal across cores. >>> >>> On 96-core configurations, up to a millisecond of latency can accumulate >> in >>> local_stime over a week of operation, and this is a significant >>> difference. This >>> is due to the fact that I use cpufreq=xen:performance max_cstate=1 , >>> meaning that in my configuration, local_stime is never overwritten by >>> master_stime. >>> >>> Thanks. >> >

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.