[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC

Premise 1:  A large and growing percentage of servers
running Xen have a "reliable" TSC and Xen can determine
conclusively whether a server does or does not have a
reliable TSC.

The truth of this statement has been vociferously
challenged in other threads, so I'd LOVE TO GET

The rest of this is long though hopefully educational,
but if you have no interest in the rdtsc instruction
or timestamping, please move on to [2 of 4].

Since my overall premise is a bit vague, I need to
first very clearly define my terms.  And to define
those terms clearly, I need to provide some more
background.  As far as I can find, there is no
publication which clearly describes all of these

The rdtsc instruction was at one time the easiest
and cheapest and most precise method for "approximating
the passage of time"; as such rdtsc was widely
used by x86 performance practitioners and high-end
apps that needed to provide extensive metrics.  When
commodity SMP x86 systems emerged, rdtsc fell into
disfavor because: (a) it was difficult to for
different CPU packages to share a crystal or
ensure different crystals were synchronized or
increasing at precisely the same rate, and
(b) SMP apps were oblivious to which CPU their
thread(s) were running on so two rdtsc instructions
in the same thread might execute on different
CPU's and thus unwittingly use different crystals,
resulting in strange things like the appearance that
time went backwards (sometimes by a large amount)
or events appearing to take different amounts of
time depending on whether they were running on
processor A or processor B.  We will call this
the "inconsistent TSC" problem.

Processor and system vendors attempted to fix the
inconsistent TSC problem by providing a new class
of "platform timers" (e.g. HPET), but these proved
to be slow and difficult to use, especially for
apps that required frequent fine metrics.

Processor and system vendors eventually figured out
how to synchronize TSC with the same crystal, but
then a new set of problems emerged: Power features
sometimes caused the clock on one processor to
slow down or even stop, thus destroying the synchrony
with other processors.  This was fixed first
by ensuring that the tick rate did not change
("constant TSC") and later that it did not stop
("nonstop TSC"), unless ALL of the TSCs on all of
the processors stopped.  Nearly all of the most recent
generations of server processors support these
capabilities, so as a result on most recent servers,
the TSC on all processors/cores/sockets is driven by
the same crystal, always ticks at the same rate,
and doesn't stop independently of other processors'
TSCs.  This is what we call a "reliable TSC".

But we're not done yet.  What does a reliable TSC
provide?  We need to define a few more terms.

A "perfect TSC" would be one where a magic logic
analyzer with a cesium clock could confirm that
the TSC's on every processor increment at precisely
the same femtosecond.  Both the speed of light
and the pricing models of commodity processors
make a perfect TSC unlikely :-)

How close is good enough?  We define two TSCs
as being "unobservably different" if code running
on the two processors can never see time going
backwards, because the difference bettween their
TSCs is smaller than the memory access overhead
due to cache synchronization. (This is sometimes
called a "cache bounce".) To wit, suppose processor
A does a rdtsc and writes the result into memory;
meanwhile processor B is spinning until it sees that the
memory location has changed, reads A's value
from memory and then does its own rdtsc.  If
B's rdtsc is NEVER less OR equal to A's rdtsc,
we will call this an "optimal TSC".

A reliable TSC is not guaranteed to be optimal;
it may just be very close to optimal, meaning
the difference between two TSCs may sometimes
be observable but it will always be very small.
(As far as I know, processor and server vendors
will not guarantee exactly how small.)  To simulate
an optimal TSC with a reliable TSC, a software
wrapper can be placed around the reads from a
reliable TSC to catch and "fix" the rare
circumstances where time goes backwards.
If this wrapper, ensures that time never goes
backwards AND ensures that time always moves
forward, we call this a monotonically-increasing
wrapper.  If it instead ensures that time never
goes backwards AND may appear to stop, we call
this a monotonically-non-decreasing wrapper.

Note also that a reliable TSC is not guaranteed
to never stop; it is just guaranteed that if
the TSC on one processor is stopped, the TSC on
all other processors will also be stopped.  As
a result, a reliable TSC cannot be used as
a wallclock, at least without other software
support that can properly adjust the TSC on all
processors when all processors awaken.

Last, there is the issue of whether or not Xen can
conclusively determine if the TSC is reliable.
This is still an open challenge.  There exists
a CPUID bit which purports to do this, but it
is not known with certainty if there are exceptions.
Notably, there is concern if certain newer
larger NUMA servers will truly provide a reliable
TSC across all system processors even if the
CPUID bit on each CPU package says the package
does provide a reliable TSC.  One large server vendor
claims that this is not a problem anymore, but
ideally we would like to test this dynamically
and there is GPL code available to do exactly
that.  This code is used in Linux in some
circumstances once at boot-time to test for
an "optimal TSC".  But in some cases the CPUID
bit defuses this test.  And in any case a boottime
test may not catch all problems, such as a
power event that doesn't handle TSC quite properly.
So without some form of ongoing post-boottime
test, we just don't know.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.