|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 1/5] x86/time: deal with negative deltas in get_s_time_fixed()
On 12.01.2026 15:51, Anton Markov wrote: > That's if IPIs are sent sequentially. In the most common case, they aren't, >> though - we use the all-but-self shorthand. > > Actually, even if IPIs are sent sequentially, I can't see where you spot >> this effect: Both callers of time_calibration_rendezvous_tail() signal all >> secondary CPUs to continue at the same time. Hence they'll all execute >> time_calibration_rendezvous_tail() in parallel. > > In parallel, but with a slight delay. > > Are they? I fear I don't know which part of the code you're talking about. > > In the function "time_calibration" (xen/arch/x86/time.c) Sorry, I don't > take into account that you don't stay in context, being distracted by other > threads. That calls on_selected_cpus(), but send_IPI_mask() may then still resort to all-but-self. In that case all IPIs are sent in one go. Plus as said, how IPIs are sent doesn't matter for the invocation of time_calibration_rendezvous_tail(). They'll all run at the same time, not one after the other. Since further down you build upon that "IPI lag", I fear we first need to settle on this aspect of your theory. Jan > One of the reasons we (iirc) don't do that is that since the scaling factor >> is also slightly imprecise, we'd prefer to avoid scaling very big values. >> IOW by changing as you suggest we'd trade one accumulating error for >> another. > > As I wrote above, there will be an error when using scale_delta, but it > won't accumulate between calls to time_calibration_rendezvous_tail. In the > current version, the old error (ipi lag + rounding error) persists due to > the use of the old local_stime in the get_s_time_fixed function, and it's > added to the new error and accumulates with each call. > If > > c->local_stime = get_s_time_fixed(old_tsc ?: new_tsc); > > replaced with: > > c->local_stime = scale_delta(old_tsc ?: new_tsc); > > Then we'll only be dealing with the error due to the current ipi lag and > rough rounding, within a single call. > > If I understand you correctly, you fixed the rough rounding of scale_delta > by reducing the values using get_s_time_fixed . But the problem is, that > didn't help. The error now accumulates gradually because we're relying on > old calculations. Furthermore, while the old rounding error was limited, > now the error accumulates infinitely, albeit very slowly. If this is so, > then the solution to the problem becomes less obvious. > > Roughly speaking, my servers start to go crazy after a week of continuous > operation, as the time lag between cores reaches 1 millisecond or more. > > On Mon, Jan 12, 2026 at 5:13 PM Jan Beulich <jbeulich@xxxxxxxx> wrote: > >> On 12.01.2026 13:49, Anton Markov wrote: >>>> Btw, your prior response was too hard to properly read, due to excess >> blank >>>> lines with at the same time squashed leading blanks. Together with your >>>> apparent inability to avoid top-posting, I think you really want to >> adjust >>>> your mail program's configuration. >>> >>> I suggest we skip the discussion of formatting and the number of spaces >> in >>> messages and instead focus on the topic of the thread. I have a very >>> difficult time troubleshooting difficult-to-reproduce bugs, and the fact >>> that their descriptions are difficult to read due to the number of spaces >>> is probably the least of the difficulties. >> >> Perhaps, yet it still makes dealing with things more difficult. >> >>> That invocation of get_s_time_fixed() reduces to scale_delta() (without >>>> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW >>>> it's not quite clear to me what change you are suggesting (that would >>>> actually make a functional difference). >>> >>> Replacing get_s_time_fixed with scale_delta will remove the calculation >>> dependency on the previous local_stime value, which accumulates lag >> between >>> cores. This is because: rdtsc_ordered is not called synchronously on the >>> cores, but by the difference offset by the ipi speed. Therefore, we get: >>> >>> core0: current_rdtsc; >>> core1: current_rdtsc + ipi speed; >>> coreN: current_rdtsc + ipi speed * N; >> >> That's if IPIs are sent sequentially. In the most common case, they aren't, >> though - we use the all-but-self shorthand. >> >> Actually, even if IPIs are sent sequentially, I can't see where you spot >> this effect: Both callers of time_calibration_rendezvous_tail() signal all >> secondary CPUs to continue at the same time. Hence they'll all execute >> time_calibration_rendezvous_tail() in parallel. >> >>> Since ipi values are sent alternately in a loop to core0, >> >> Are they? I fear I don't know which part of the code you're talking about. >> >>> in the version >>> with get_s_time_fixed, we get the following local_stime calculation >> format. >>> >>> coreN: local_stime = local_stime + scale_delta((current_rdtsc + >> (ipi_speed >>> * N)) – local_rdtsc); >> >> One of the reasons we (iirc) don't do that is that since the scaling factor >> is also slightly imprecise, we'd prefer to avoid scaling very big values. >> IOW by changing as you suggest we'd trade one accumulating error for >> another. >> >> Jan >> >>> This means the time on each core will differ by ipi_speed * N. And since >>> we're using the values of the previous local_stime, the difference will >>> accumulate because the previous local_stime was also offset. In the >> version >>> with scale_delta, we get: >>> >>> coreN: local_stime = scale_delta(current_rdtsc + (ipi_speed * N)); >>> >>> This means there will still be a difference, but it won't accumulate, and >>> the offsets will remain within normal limits. >>> >>> If it's still unclear: If your local_stime in get_s_time_fixed is offset >>> relative to other cores, then the fact that rdtsc_ordered and local_tsc >> are >>> not offset doesn't change anything, since you're using the delta relative >>> to local_stime. >>> >>> core0_local_stime + (rdtsc_ordered() - local_tsc) != core1_local_stime + >>> (rdtsc_ordered() - local_tsc); // Even if rdtsc_ordered() and local_tsc >> are >>> equal across cores. >>> >>> On 96-core configurations, up to a millisecond of latency can accumulate >> in >>> local_stime over a week of operation, and this is a significant >>> difference. This >>> is due to the fact that I use cpufreq=xen:performance max_cstate=1 , >>> meaning that in my configuration, local_stime is never overwritten by >>> master_stime. >>> >>> Thanks. >> >
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |