I'm experimenting with clock_gettime(), gettimeofday(),
and rdtsc with a 2.6.30 64-bit pvguest. I have tried both
with kernel.vsyscall64 equal to 0 and 1 (but haven't seen
any significant difference between the two). I have
confirmed from sysfs that clocksource=xen
I have yet to get a measurement of either syscall that
is better than 2.5x WORSE than emulating rdtsc. On
my dual-core Conroe (Intel E6850) with 64-bit Xen and
32-bit dom0, I get approximately:
rdtsc native: 22ns
softtsc (rdtsc emulated): 360ns
gettime syscall w/softtsc: 1400ns
gettime syscall native tsc: 980ns
gettimeofday w/softtsc: 1750ns
gettimeofday native tsc: 900ns
I'm hoping this is either a bug in the 2.6.30 xen
pvclock implementation or in my measurement methodology,
so would welcome others measuring this.
A couple other minor observations:
1) The syscalls seem to be somewhat slower when usermode
rdtscs are being emulated, by approximately the cost
of emulating an rdtsc. I suppose this makes
sense since vsyscalls are executed in userland
and since vgettimeofday does a rdtsc. However it
complicates strategy if emulating rdtsc is the default.
2) The syscall clock_getres() does not seem to reflect
the fact that
> -----Original Message-----
> From: Dan Magenheimer
> Sent: Saturday, August 29, 2009 11:52 AM
> To: Jeremy Fitzhardinge
> Cc: Alan Cox; Xen-Devel (E-mail); Keir Fraser
> Subject: RE: [Xen-devel] write_tsc in a PV domain?
>
>
> (Reordered with most important points first...)
>
> > You are talking about three different cases:
>
> I agree with your analysis for case 1 and case 3.
>
> > So, there's case 2: pv usermode. There are four
> > classes of apps worth considering here:
>
> I agree with your classification. But a key point
> is that VMware provides correctness for all
> of these classes. AND provides it at much better
> performance than trap-and-emulate. AND provides
> correctness+performance regardless of the underlying
> OS (e.g. even "old" OS's such as RHEL4 and RHEL5).
> AND provides it regardless whether the guest OS is
> 32-bit or 64-bit. AND, by the way, provides it for
> your case 1 (PV OS) and case 3 (HVM) as well.
>
> > So if you want to address these problems, it seems to me
> > you'll get most
> > bang for the buck by fixing (v)gettimeofday to use pvclock, and
> > convincing app writers to trust in gettimeofday.
>
> (Partially irrelevant point, but gettimeofday returns
> microseconds which is not enough resolution for many
> cases where rdtsc has been used in apps. Clock_gettime
> is the relevant API I think.)
>
> If we can come up with a way for a kernel-loadable module
> to handle some equivalent of clock_gettime so that
> the most widely used shipping PV OS's can provide a
> pvclock interface to apps, this might be workable.
> If we tell app providers and customers: "You
> can choose either performance OR correctness but
> not both, unless you upgrade to a new OS (that is
> not even available yet)", I don't think that will
> be acceptable.
>
> Any ideas on how pvclock might be provided through
> a module that could be added to, eg. RHEL4 or RHEL5?
>
> > > There ARE guaranteed properties specified by
> > > the Intel SDM for any _single_ processor...
> >
> > Yes, but those are fairly weak guarantees. It does not
> guarantee that
> > the tsc won't change rate arbitrarily, or stop outright
> between reads.
>
> They are weak guarantees only if one uses rdtsc
> to accurately track wallclock time. They are
> perfectly useful guarantees if one simply wants to
> either timestamp data to record ordering (e.g.
> for journaling or transaction replay), or
> approximate the passing of time to provide
> approximate execution metrics (e.g. for
> performance tools).
>
> > > What is NOT guaranteed, but is widely and
> > > incorrectly assumed to be implied and has
> > > gotten us into this mess, is that
> > > the same properties applies across multiple
> > > processors.
> >
> > Yes, Linux offers even weaker guarantees than Intel. Aside from the
> > processor migration issue, the tsc can jump arbitrarily as
> a result of
> > suspend/resume (ie, it can be non-monotonic).
>
> Please explain. Suspend/resume is an S state isn't
> it? Is it possible to suspend/resume one processor
> in an SMP system and not another processor? I think
> not. Your point is valid for C-states and P-states
> but those are what Intel/AMD has fixed in the most
> recent families of multi-core processors.
>
> So I don't see how (in the most recent familes of
> processors) tsc can be non-monotonic.
>
> > Even very recent processors with "constant" tscs (ie, they
> > don't change
> > rate with the core frequency) stop in certain power states.
>
> For the most recent families of processors, the TSC
> continues to run at a fixed rate even for all the
> P-states and C-states. We should confirm this with
> Intel and AMD.
>
> > Any motherboard design which runs packages in different
> > clock-domains will lose tsc-sync between those packages,
> > regardless of what's in the packages.
>
> I'm told this is not true for recent multi-socket systems
> where the sockets are on the same motherboard. And at
> least one large vendor that ships a new one-socket-per-
> motherboard NUMA-ish system claims that it is not even
> true when the sockets are on different motherboards.
>
> Dan
>
> (no further replies below, remaining original text retained
> for context)
>
> > You are talking about three different cases:
> >
> > 1. the reliability of the tsc in a PV guest in kernel mode
> > 2. the reliability of the tsc in a PV guest in user mode
> > 3. the reliability of the tsc in an HVM guest
> >
> > I don't think 1. needs any attention. The current scheme
> works fine.
> >
> > The only option for 3 is to try make a best-effort of tsc
> > quality, which
> > ranges from trapping every rdtsc to make them all give globally
> > monotonic results, or use the other VT/SVM features to
> apply an offset
> > from the raw tsc to a guest tsc, etc. Either way the
> situation isn't
> > much different from running native (ie, apps will see
> > basically the same
> > tsc behaviour as in the native case, to some degree of
> approximation).
> >
> > So, there's case 2: pv usermode. There are four classes of
> apps worth
> > considering here:
> >
> > 1. Old apps which make unwarranted assumptions about the
> > behavour of
> > the tsc. They assume they're basically running on some
> > equivalent
> > of a P54, and so will get junk on any modernish
> system with SMP
> > and/or power management. If people are still using
> > such apps, it
> > probably means their performance isn't critically
> > dependent on the
> > tsc.
> > 2. More sophisticated apps which know the tsc has some
> limitations
> > and try to mitigate them by filtering discontinuities, using
> > rdtscp, etc. They're best-effort, but they inherently
> > lack enough
> > information to do a complete job (they have to guess at where
> > power transitions occured, etc).
> > 3. New apps which know about modern processor capabilities, and
> > attempt to rely on constant_tsc forgoing all the best-effort
> > filtering, etc
> > 4. Apps which use gettimeofday() and/or clock_gettime()
> > for all time
> > measurement. They're guaranteed to get consistent
> time results,
> > perhaps at the cost of a syscall. On systems which
> support it,
> > they'll get vsyscall implementations which avoid the
> > syscall while
> > still using the best-possible clocksource. Even if
> they don't a
> > syscall will outperform an emulated rdtsc.
> >
> > Class 1 apps are just broken. We can try to emulate a UP, no-PM
> > processor for them, and that's probably best done in an HVM domain.
> > There's no need to go to extraordinary efforts for them because the
> > native hardware certainly won't.
> >
> > Class 2 apps will work as well as ever in a Xen PV domain as-is. If
> > they use rdtscp then they will be able to correlate the tsc to the
> > underlying pcpu and manage consistency that way. If they pin
> > threads to
> > VCPUs, then they may also requre VCPUs to be pinned to PCPUs. But
> > there's no need to make deep changes to Xen's tsc handling to
> > accommodate them.
> >
> > Class 3 apps will get a bit of a rude surprise in a PV Xen
> > domain. But
> > they're also new enough to use another mechanism to get
> time. They're
> > new enough to "know" that gettimeofday can be very efficient,
> > and should
> > not be going down the rathole of using rdtsc directly. And unless
> > they're going to be restricted to a very narrow class of
> machines (for
> > example, not my relatively new Core2 laptop which stops the
> "constant"
> > tsc in deep sleep modes), they need to fall back to being a
> > class 2 or 4
> > app anyway.
> >
> > Class 4 apps are not well-served under Xen. I think the vsyscall
> > mechanism will be disabled and they'll always end up doing a real
> > syscall. However, I think it would be relatively easy to add a new
> > vgettimeofday implementation which directly uses the
> pvclock mechanism
> > from usermode (the same code would work equally well for Xen
> > and KVM).
> > There's no need to add a new usermode ABI to get quick, high-quality
> > time in usermode. Performance-wise it would be more or less
> > indistinguishable from using a raw rdtsc, but it has the benefit of
> > getting full cooperation from the kernel and Xen, and can take into
> > account all tsc variations (if any).
> >
> >
> > So if you want to address these problems, it seems to me
> > you'll get most
> > bang for the buck by fixing (v)gettimeofday to use pvclock, and
> > convincing app writers to trust in gettimeofday.
> >
> > J
> >
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|