[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 3/3] x86/time: don't move TSC backwards in time_calibration_tsc_rendezvous()


  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Mon, 8 Feb 2021 10:38:58 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=jxQUzOQovPq+4OVo/Jsj2TfPDsijD3JtXeQggZxcGwc=; b=VK3W/YP0WiDdE+Ob3xQ0au2uxtZWW3bSK8g6fPl9c+1nXzqbXiYnWPM1MXxYEAP+fFMCQDdP6LLt8YjPrmp7LBAxOayOur4R79O/zo2ftzfT8u8GAENyLO+k/qibwcN/ZoVev2P78Gt/jCypAfgINzz0QsL01ud4AIR2n0vTsEzTdGKs8BPEbQxX2owpnP8quTY/kjPVN3zvY036UopszlWEkugxfinYxIHmlPdzkuD03rcGvHe8mfYc13C7dL6UrDpl9fEJEPvvqbP4oKzuE4sDavEuG8XVEEgz6GVdBgtFNcRxjWP6+NagTwy4bVQ+ORaYiQTNMvwDj4J+2/85Gg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=MQmf9JddjhhK0aswZCGs4kp0V3vVTw3mk5R0tNPFvkeWl81baYZ79CSJ06QdGpRreaORq0Ujf6N+0vQRUxtnwGlcdds5UuNjlGbLaBntRhuNUdHo5p0CLtMP/Eol7PtERBnq6YO3UDSIz4XksC2g4AqGel6PwZTPJgVgGSq3X9RCHiFEQGEmP4KOip08v5axBcD91A8yoYIul+sGH5NmoAuENSRmn7utwXA0HWoisBZFPK9J/Ubl/M8++qnvhV8ffcRcbgzs9PotncgFexBHAzmQsM3lz/Npc/mYg8V29Axsm4LQEPbOnsUrnVbzXjdaIxQ2Ab2s0vv35wGPP6r4uA==
  • Authentication-results: esa5.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.com
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Claudemir Todo Bom <claudemir@xxxxxxxxxxx>
  • Delivery-date: Mon, 08 Feb 2021 09:39:10 +0000
  • Ironport-sdr: XbaO5SN1pumLcMYO50PiMWGTIURoI0NGLwT/oun//Sg83ZtDBePyXefsxon9Mtt37cnSuuvEFD HwntF1g5d5NFp96XhqpbmZkm41O1ZiLY2jltV5dE70SBbb+KIZUMsHmaUK8k0A0zE9dKvrWB+O aZ6Au8TTNjAnVMygtU6xgzQpfYH5RN0GPyeeUCT3V8unio73mAz6+1Ne3KadhdUnpHnL/03DD2 nMH4V3TeTRipDNAP/p1IUSIryT6NT88GdZtLHzlkkeI1v3FDdMDefNA/Q+msex1jRRBkyyQGrO 9xA=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Mon, Feb 01, 2021 at 01:43:28PM +0100, Jan Beulich wrote:
> While doing this for small amounts may be okay, the unconditional use
> of CPU0's value here has been found to be a problem when the boot time
> TSC of the BSP was behind that of all APs by more than a second. In
> particular because of get_s_time_fixed() producing insane output when
> the calculated delta is negative, we can't allow this to happen.
> 
> On the first iteration have all other CPUs sort out the highest TSC
> value any one of them has read. On the second iteration, if that
> maximum is higher than CPU0's, update its recorded value from that
> taken in the first iteration. Use the resulting value on the last
> iteration to write everyone's TSCs.
> 
> To account for the possible discontinuity, have
> time_calibration_rendezvous_tail() record the newly written value, but
> extrapolate local stime using the value read.
> 
> Reported-by: Claudemir Todo Bom <claudemir@xxxxxxxxxxx>
> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
> ---
> v2: Don't update r->master_stime by calculation. Re-base over new
>     earlier patch. Make time_calibration_rendezvous_tail() take two TSC
>     values.
> ---
> Since CPU0 reads its TSC last on the first iteration, if TSCs were
> perfectly sync-ed there shouldn't ever be a need to update. However,
> even on the TSC-reliable system I first tested this on (using
> "tsc=skewed" to get this rendezvous function into use in the first
> place) updates by up to several thousand clocks did happen. I wonder
> whether this points at some problem with the approach that I'm not (yet)
> seeing.

I'm confused by this, so on a system that had reliable TSCs, which
you forced to remove the reliable flag, and then you saw big
differences when doing the rendezvous?

That would seem to indicate that such system doesn't really have
reliable TSCs?

> Considering the sufficiently modern CPU it's using, I suspect the
> reporter's system wouldn't even need to turn off TSC_RELIABLE, if only
> there wasn't the boot time skew. Hence another approach might be to fix
> this boot time skew. Of course to recognize whether the TSCs then still
> aren't in sync we'd need to run tsc_check_reliability() sufficiently
> long after that adjustment. Which is besides the need to have this
> "fixing" be precise enough for the TSCs to not look skewed anymore
> afterwards.

Maybe it would make sense to do a TSC counter sync after APs are up
and then disable the rendezvous if the next calibration rendezvous
shows no skew?

I also wonder, we test for skew just after the APs have been booted,
and decide at that point whether we need a calibration rendezvous.

Maybe we could do a TSC sync just after APs are up (to hopefully bring
them in sync), and then do the tsc_check_reliability just before Xen
ends booting (ie: before handing control to dom0?)

What we do right now (ie: do the tsc_check_reliability so early) is
also likely to miss small skews that will only show up after APs have
been running for a while?

> As per the comment ahead of it, the original purpose of the function was
> to deal with TSCs halted in deep C states. While this probably explains
> why only forward moves were ever expected, I don't see how this could
> have been reliable in case CPU0 was deep-sleeping for a sufficiently
> long time. My only guess here is a hidden assumption of CPU0 never being
> idle for long enough.
> 
> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -1658,17 +1658,17 @@ struct calibration_rendezvous {
>      cpumask_t cpu_calibration_map;
>      atomic_t semaphore;
>      s_time_t master_stime;
> -    u64 master_tsc_stamp;
> +    uint64_t master_tsc_stamp, max_tsc_stamp;
>  };
>  
>  static void
>  time_calibration_rendezvous_tail(const struct calibration_rendezvous *r,
> -                                 uint64_t tsc)
> +                                 uint64_t old_tsc, uint64_t new_tsc)
>  {
>      struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
>  
> -    c->local_tsc    = tsc;
> -    c->local_stime  = get_s_time_fixed(c->local_tsc);
> +    c->local_tsc    = new_tsc;
> +    c->local_stime  = get_s_time_fixed(old_tsc ?: new_tsc);
>      c->master_stime = r->master_stime;
>  
>      raise_softirq(TIME_CALIBRATE_SOFTIRQ);
> @@ -1683,6 +1683,7 @@ static void time_calibration_tsc_rendezv
>      int i;
>      struct calibration_rendezvous *r = _r;
>      unsigned int total_cpus = cpumask_weight(&r->cpu_calibration_map);
> +    uint64_t tsc = 0;
>  
>      /* Loop to get rid of cache effects on TSC skew. */
>      for ( i = 4; i >= 0; i-- )
> @@ -1692,8 +1693,15 @@ static void time_calibration_tsc_rendezv
>              while ( atomic_read(&r->semaphore) != (total_cpus - 1) )
>                  cpu_relax();
>  
> -            if ( r->master_tsc_stamp == 0 )
> -                r->master_tsc_stamp = rdtsc_ordered();
> +            if ( tsc == 0 )
> +                r->master_tsc_stamp = tsc = rdtsc_ordered();
> +            else if ( r->master_tsc_stamp < r->max_tsc_stamp )
> +                /*
> +                 * We want to avoid moving the TSC backwards for any CPU.
> +                 * Use the largest value observed anywhere on the first
> +                 * iteration.
> +                 */
> +                r->master_tsc_stamp = r->max_tsc_stamp;
>              else if ( i == 0 )
>                  r->master_stime = read_platform_stime(NULL);
>  
> @@ -1712,6 +1720,16 @@ static void time_calibration_tsc_rendezv
>              while ( atomic_read(&r->semaphore) < total_cpus )
>                  cpu_relax();
>  
> +            if ( tsc == 0 )
> +            {
> +                uint64_t cur;
> +
> +                tsc = rdtsc_ordered();
> +                while ( tsc > (cur = r->max_tsc_stamp) )
> +                    if ( cmpxchg(&r->max_tsc_stamp, cur, tsc) == cur )
> +                        break;

I think you could avoid reading cur explicitly for each loop and
instead do?

cur = ACCESS_ONCE(r->max_tsc_stamp)
while ( tsc > cur )
    cur = cmpxchg(&r->max_tsc_stamp, cur, tsc);

> +            }
> +
>              if ( i == 0 )
>                  write_tsc(r->master_tsc_stamp);
>  
> @@ -1719,9 +1737,12 @@ static void time_calibration_tsc_rendezv
>              while ( atomic_read(&r->semaphore) > total_cpus )
>                  cpu_relax();
>          }
> +
> +        /* Just in case a read above ended up reading zero. */
> +        tsc += !tsc;

Won't that be worthy of an ASSERT_UNREACHABLE? I'm not sure I see how
tsc could be 0 on a healthy system after the loop above.


Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.