[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT



>>> On 26.08.16 at 17:44, <joao.m.martins@xxxxxxxxxx> wrote:
> On 08/25/2016 11:37 AM, Jan Beulich wrote:
>>>>> On 24.08.16 at 14:43, <joao.m.martins@xxxxxxxxxx> wrote:
>>> This patch proposes relying on host TSC synchronization and
>>> passthrough to the guest, when running on a TSC-safe platform. On
>>> time_calibration we retrieve the platform time in ns and the counter
>>> read by the clocksource that was used to compute system time. We
>>> introduce a new rendezous function which doesn't require
>>> synchronization between master and slave CPUS and just reads
>>> calibration_rendezvous struct and writes it down the stime and stamp
>>> to the cpu_calibration struct to be used later on. We can guarantee that
>>> on a platform with a constant and reliable TSC, that the time read on
>>> vcpu B right after A is bigger independently of the VCPU calibration
>>> error. Since pvclock time infos are monotonic as seen by any vCPU set
>>> PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux.
>>> IIUC, this is similar to how it's implemented on KVM.
>> 
>> Without any tools side change, how is it guaranteed that a guest
>> which observed the stable bit won't get migrated to a host not
>> providing that guarantee?
> Do you want to prevent migration in such cases? The worst that can happen is 
> that the
> guest might need to fallback to a system call if this bit is 0 and would keep 
> doing
> so if the bit is 0.

Whether migration needs preventing I'm not sure; all I was trying
to indicate is that there seem to be pieces missing wrt migration.
As to the guest falling back to a system call - are guest kernels and
(as far as as affected) applications required to cope with the flag
changing from 1 to 0 behind their back?

>>>  {
>>>      struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
>>>  
>>> -    c->local_tsc    = rdtsc_ordered();
>>> -    c->local_stime  = get_s_time_fixed(c->local_tsc);
>>> +    if ( master_tsc )
>>> +    {
>>> +        c->local_tsc    = r->master_tsc_stamp;
>> 
>> Doesn't this require the TSCs to run in perfect sync (not even off
>> wrt each other by a single cycle)? Is such even possible on multi
>> socket systems? I.e. how would multiple chips get into such a
>> mode in the first place, considering their TSCs can't possibly start
>> ticking at exactly the same (perhaps even sub-)nanosecond?
> They do require to be in sync with multi-sockets, otherwise this wouldn't 
> work.

"In sync" may mean two things: Ticking at exactly the same rate, or
(more strict) holding the exact same values at all times.

> Invariant TSC only refers to cores in a package, but multi-socket is up to 
> board
> vendors (no manuals mention this guarantee across sockets). That one of the 
> reasons
> TSC is such a burden :(
> 
> Looking at datasheets (on the oldest processor I was testing this) it 
> mentions this note:
> 
> "In order In order to ensure Timestamp Counter (TSC) synchronization across 
> sockets
> in multi-socket systems, the RESET# deassertion edge should arrive at the 
> same BCLK
> rising edge at both sockets and should meet the Tsu and Th requirement of 
> 600ps
> relative to BCLK, as outlined in Table 2-26.".

Hmm, a dual socket system is certainly still one of the easier ones to
deal with. 600ps means 18cm difference in signaling paths, which on
larger systems (and namely ones composed of mostly independent
nodes) I could easily seem getting exceeded. That can certainly be
compensated (e.g. by deasserting RESET# at different times for
different sockets), but I'd then still question the accuracy.

> [0] Intel Xeon Processor 5600 Series Datasheet Vol 1, Page 63,
> http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-5 
> 600-vol-1-datasheet.pdf
> 
> The BCLK looks to be the global reference clock shared across sockets IIUC 
> used in
> the PLLs in the individual packages (to generate the signal where the TSC is
> extrapolated from). ( Please read it with a grain of salt, as I may be doing 
> wrong
> readings of these datasheets ). But If it was a box with TSC skewed among 
> sockets,
> wouldn't we see that at boot time in the tsc warp test? Or maybe TSC sync 
> check isn't
> potentially fast enough to catch any oddities?

That's my main fear: The check can't possibly determine whether TSCs
are in perfect sync, it can only check an approximation. Perhaps even
worse than the multi-node consideration here is hyper-threading, as
that makes it fundamentally impossible that all threads within one core
execute the same operation at exactly the same time. Not to speak of
the various odd cache effects which I did observe while doing the
measurements for my series (e.g. the second thread speculating the
TSC reads much farther than the primary ones, presumably because
the primary ones first needed to get the I-cache populated).

> Our docs
> (https://xenbits.xen.org/docs/unstable/misc/tscmode.txt) also seem to mention
> something along these lines on multi-socket systems. And Linux tsc code seems 
> to
> assume that Intel boxes have synchronized TSCs across sockets [1] and that the
> exceptions cases should mark tsc=skewed (we also have such parameter).
> 
> [1] arch/x86/kernel/tsc.c#L1094

Referring back to what I've said above: Does this mean exact same
tick rate, or exact same values?

> As reassurance I've been running tests for days long (currently in almost a 
> week on
> 2-socket system) and I'll keep running to see if it catches any issues or 
> time going
> backwards. Could also run in the biggest boxes we have with 8 sockets. But 
> still it
> would represent only a tiny fraction of what x86 has available these days.

A truly interesting case would be, as mentioned, a box composed of
individual nodes. Not sure whether that 8-socket one you mention
would meet that.

> Other than the things above I am not sure how to go about this :( Should we 
> start
> adjusting the TSCs if we find disparities or skew is observed on the long 
> run? Or
> allow only TSCs on vCPUS of the same package to expose this flag? Hmm, what's 
> your
> take on this? Appreciate your feedback.

At least as an initial approach requiring affinities to be limited to a
single socket would seem like a good compromise, provided HT
aspects don't have a bad effect (in which case also excluding HT
may be required). I'd also be fine with command line options
allowing to further relax that, but a simple "clocksource=tsc"
should imo result in a setup which from all we can tell will work as
intended.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.