RE: [Xen-devel] write_tsc in a PV domain?

I'm experimenting with clock_gettime(), gettimeofday(),
and rdtsc with a 2.6.30 64-bit pvguest.  I have tried both
with kernel.vsyscall64 equal to 0 and 1 (but haven't seen
any significant difference between the two).  I have
confirmed from sysfs that clocksource=xen

I have yet to get a measurement of either syscall that
is better than 2.5x WORSE than emulating rdtsc. On
my dual-core Conroe (Intel E6850) with 64-bit Xen and
32-bit dom0, I get approximately:

rdtsc native: 22ns
softtsc (rdtsc emulated): 360ns
gettime syscall w/softtsc: 1400ns
gettime syscall native tsc: 980ns
gettimeofday w/softtsc: 1750ns
gettimeofday native tsc: 900ns

I'm hoping this is either a bug in the 2.6.30 xen
pvclock implementation or in my measurement methodology,
so would welcome others measuring this.

A couple other minor observations:
1) The syscalls seem to be somewhat slower when usermode
   rdtscs are being emulated, by approximately the cost
   of emulating an rdtsc.  I suppose this makes
   sense since vsyscalls are executed in userland
   and since vgettimeofday does a rdtsc.  However it
   complicates strategy if emulating rdtsc is the default.
2) The syscall clock_getres() does not seem to reflect
   the fact that 

> -----Original Message-----
> From: Dan Magenheimer 
> Sent: Saturday, August 29, 2009 11:52 AM
> To: Jeremy Fitzhardinge
> Cc: Alan Cox; Xen-Devel (E-mail); Keir Fraser
> Subject: RE: [Xen-devel] write_tsc in a PV domain?
> 
> 
> (Reordered with most important points first...)
> 
> > You are talking about three different cases:
> 
> I agree with your analysis for case 1 and case 3.
> 
> > So, there's case 2: pv usermode.  There are four
> > classes of apps worth considering here:
> 
> I agree with your classification.  But a key point
> is that VMware provides correctness for all
> of these classes.  AND provides it at much better
> performance than trap-and-emulate.  AND provides
> correctness+performance regardless of the underlying
> OS (e.g. even "old" OS's such as RHEL4 and RHEL5).
> AND provides it regardless whether the guest OS is
> 32-bit or 64-bit.  AND, by the way, provides it for
> your case 1 (PV OS) and case 3 (HVM) as well.
> 
> > So if you want to address these problems, it seems to me 
> > you'll get most
> > bang for the buck by fixing (v)gettimeofday to use pvclock, and
> > convincing app writers to trust in gettimeofday.
> 
> (Partially irrelevant point, but gettimeofday returns
> microseconds which is not enough resolution for many
> cases where rdtsc has been used in apps.  Clock_gettime
> is the relevant API I think.)
> 
> If we can come up with a way for a kernel-loadable module
> to handle some equivalent of clock_gettime so that
> the most widely used shipping PV OS's can provide a
> pvclock interface to apps, this might be workable.
> If we tell app providers and customers: "You
> can choose either performance OR correctness but
> not both, unless you upgrade to a new OS (that is
> not even available yet)", I don't think that will
> be acceptable.
> 
> Any ideas on how pvclock might be provided through
> a module that could be added to, eg. RHEL4 or RHEL5?
> 
> > > There ARE guaranteed properties specified by
> > > the Intel SDM for any _single_ processor...
> > 
> > Yes, but those are fairly weak guarantees.  It does not 
> guarantee that
> > the tsc won't change rate arbitrarily, or stop outright 
> between reads.
> 
> They are weak guarantees only if one uses rdtsc
> to accurately track wallclock time.  They are
> perfectly useful guarantees if one simply wants to
> either timestamp data to record ordering (e.g.
> for journaling or transaction replay), or
> approximate the passing of time to provide
> approximate execution metrics (e.g. for
> performance tools).
> 
> > > What is NOT guaranteed, but is widely and
> > > incorrectly assumed to be implied and has
> > > gotten us into this mess, is that
> > > the same properties applies across multiple
> > > processors.
> > 
> > Yes, Linux offers even weaker guarantees than Intel.  Aside from the
> > processor migration issue, the tsc can jump arbitrarily as 
> a result of
> > suspend/resume (ie, it can be non-monotonic).
> 
> Please explain.  Suspend/resume is an S state isn't
> it?  Is it possible to suspend/resume one processor
> in an SMP system and not another processor?  I think
> not.  Your point is valid for C-states and P-states
> but those are what Intel/AMD has fixed in the most
> recent families of multi-core processors.
> 
> So I don't see how (in the most recent familes of
> processors) tsc can be non-monotonic.
> 
> > Even very recent processors with "constant" tscs (ie, they 
> > don't change
> > rate with the core frequency) stop in certain power states.
> 
> For the most recent families of processors, the TSC
> continues to run at a fixed rate even for all the
> P-states and C-states.  We should confirm this with
> Intel and AMD.
> 
> > Any motherboard design which runs packages in different
> > clock-domains will lose tsc-sync between those packages,
> > regardless of what's in the packages.
> 
> I'm told this is not true for recent multi-socket systems
> where the sockets are on the same motherboard.  And at
> least one large vendor that ships a new one-socket-per-
> motherboard NUMA-ish system claims that it is not even
> true when the sockets are on different motherboards.
> 
> Dan
> 
> (no further replies below, remaining original text retained
> for context)
> 
> > You are talking about three different cases:
> > 
> >    1. the reliability of the tsc in a PV guest in kernel mode
> >    2. the reliability of the tsc in a PV guest in user mode
> >    3. the reliability of the tsc in an HVM guest
> > 
> > I don't think 1. needs any attention.  The current scheme 
> works fine.
> > 
> > The only option for 3 is to try make a best-effort of tsc 
> > quality, which
> > ranges from trapping every rdtsc to make them all give globally
> > monotonic results, or use the other VT/SVM features to 
> apply an offset
> > from the raw tsc to a guest tsc, etc.  Either way the 
> situation isn't
> > much different from running native (ie, apps will see 
> > basically the same
> > tsc behaviour as in the native case, to some degree of 
> approximation).
> > 
> > So, there's case 2: pv usermode.  There are four classes of 
> apps worth
> > considering here:
> > 
> >    1. Old apps which make unwarranted assumptions about the 
> > behavour of
> >       the tsc.  They assume they're basically running on some 
> > equivalent
> >       of a P54, and so will get junk on any modernish 
> system with SMP
> >       and/or power management.  If people are still using 
> > such apps, it
> >       probably means their performance isn't critically 
> > dependent on the
> >       tsc.
> >    2. More sophisticated apps which know the tsc has some 
> limitations
> >       and try to mitigate them by filtering discontinuities, using
> >       rdtscp, etc.  They're best-effort, but they inherently 
> > lack enough
> >       information to do a complete job (they have to guess at where
> >       power transitions occured, etc).
> >    3. New apps which know about modern processor capabilities, and
> >       attempt to rely on constant_tsc forgoing all the best-effort
> >       filtering, etc
> >    4. Apps which use gettimeofday() and/or clock_gettime() 
> > for all time
> >       measurement.  They're guaranteed to get consistent 
> time results,
> >       perhaps at the cost of a syscall.  On systems which 
> support it,
> >       they'll get vsyscall implementations which avoid the 
> > syscall while
> >       still using the best-possible clocksource.  Even if 
> they don't a
> >       syscall will outperform an emulated rdtsc.
> > 
> > Class 1 apps are just broken.  We can try to emulate a UP, no-PM
> > processor for them, and that's probably best done in an HVM domain. 
> > There's no need to go to extraordinary efforts for them because the
> > native hardware certainly won't.
> > 
> > Class 2 apps will work as well as ever in a Xen PV domain as-is.  If
> > they use rdtscp then they will be able to correlate the tsc to the
> > underlying pcpu and manage consistency that way.  If they pin 
> > threads to
> > VCPUs, then they may also requre VCPUs to be pinned to PCPUs.  But
> > there's no need to make deep changes to Xen's tsc handling to
> > accommodate them.
> > 
> > Class 3 apps will get a bit of a rude surprise in a PV Xen 
> > domain.  But
> > they're also new enough to use another mechanism to get 
> time.  They're
> > new enough to "know" that gettimeofday can be very efficient, 
> > and should
> > not be going down the rathole of using rdtsc directly.  And unless
> > they're going to be restricted to a very narrow class of 
> machines (for
> > example, not my relatively new Core2 laptop which stops the 
> "constant"
> > tsc in deep sleep modes), they need to fall back to being a 
> > class 2 or 4
> > app anyway.
> > 
> > Class 4 apps are not well-served under Xen.  I think the vsyscall
> > mechanism will be disabled and they'll always end up doing a real
> > syscall.  However, I think it would be relatively easy to add a new
> > vgettimeofday implementation which directly uses the 
> pvclock mechanism
> > from usermode (the same code would work equally well for Xen 
> > and KVM). 
> > There's no need to add a new usermode ABI to get quick, high-quality
> > time in usermode.  Performance-wise it would be more or less
> > indistinguishable from using a raw rdtsc, but it has the benefit of
> > getting full cooperation from the kernel and Xen, and can take into
> > account all tsc variations (if any).
> > 
> > 
> > So if you want to address these problems, it seems to me 
> > you'll get most
> > bang for the buck by fixing (v)gettimeofday to use pvclock, and
> > convincing app writers to trust in gettimeofday.
> > 
> >     J
> >

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

RE: [Xen-devel] write_tsc in a PV domain?