(Reordered with most important points first...)
> You are talking about three different cases:
I agree with your analysis for case 1 and case 3.
> So, there's case 2: pv usermode. There are four
> classes of apps worth considering here:
I agree with your classification. But a key point
is that VMware provides correctness for all
of these classes. AND provides it at much better
performance than trap-and-emulate. AND provides
correctness+performance regardless of the underlying
OS (e.g. even "old" OS's such as RHEL4 and RHEL5).
AND provides it regardless whether the guest OS is
32-bit or 64-bit. AND, by the way, provides it for
your case 1 (PV OS) and case 3 (HVM) as well.
> So if you want to address these problems, it seems to me
> you'll get most
> bang for the buck by fixing (v)gettimeofday to use pvclock, and
> convincing app writers to trust in gettimeofday.
(Partially irrelevant point, but gettimeofday returns
microseconds which is not enough resolution for many
cases where rdtsc has been used in apps. Clock_gettime
is the relevant API I think.)
If we can come up with a way for a kernel-loadable module
to handle some equivalent of clock_gettime so that
the most widely used shipping PV OS's can provide a
pvclock interface to apps, this might be workable.
If we tell app providers and customers: "You
can choose either performance OR correctness but
not both, unless you upgrade to a new OS (that is
not even available yet)", I don't think that will
be acceptable.
Any ideas on how pvclock might be provided through
a module that could be added to, eg. RHEL4 or RHEL5?
> > There ARE guaranteed properties specified by
> > the Intel SDM for any _single_ processor...
>
> Yes, but those are fairly weak guarantees. It does not guarantee that
> the tsc won't change rate arbitrarily, or stop outright between reads.
They are weak guarantees only if one uses rdtsc
to accurately track wallclock time. They are
perfectly useful guarantees if one simply wants to
either timestamp data to record ordering (e.g.
for journaling or transaction replay), or
approximate the passing of time to provide
approximate execution metrics (e.g. for
performance tools).
> > What is NOT guaranteed, but is widely and
> > incorrectly assumed to be implied and has
> > gotten us into this mess, is that
> > the same properties applies across multiple
> > processors.
>
> Yes, Linux offers even weaker guarantees than Intel. Aside from the
> processor migration issue, the tsc can jump arbitrarily as a result of
> suspend/resume (ie, it can be non-monotonic).
Please explain. Suspend/resume is an S state isn't
it? Is it possible to suspend/resume one processor
in an SMP system and not another processor? I think
not. Your point is valid for C-states and P-states
but those are what Intel/AMD has fixed in the most
recent families of multi-core processors.
So I don't see how (in the most recent familes of
processors) tsc can be non-monotonic.
> Even very recent processors with "constant" tscs (ie, they
> don't change
> rate with the core frequency) stop in certain power states.
For the most recent families of processors, the TSC
continues to run at a fixed rate even for all the
P-states and C-states. We should confirm this with
Intel and AMD.
> Any motherboard design which runs packages in different
> clock-domains will lose tsc-sync between those packages,
> regardless of what's in the packages.
I'm told this is not true for recent multi-socket systems
where the sockets are on the same motherboard. And at
least one large vendor that ships a new one-socket-per-
motherboard NUMA-ish system claims that it is not even
true when the sockets are on different motherboards.
Dan
(no further replies below, remaining original text retained
for context)
> You are talking about three different cases:
>
> 1. the reliability of the tsc in a PV guest in kernel mode
> 2. the reliability of the tsc in a PV guest in user mode
> 3. the reliability of the tsc in an HVM guest
>
> I don't think 1. needs any attention. The current scheme works fine.
>
> The only option for 3 is to try make a best-effort of tsc
> quality, which
> ranges from trapping every rdtsc to make them all give globally
> monotonic results, or use the other VT/SVM features to apply an offset
> from the raw tsc to a guest tsc, etc. Either way the situation isn't
> much different from running native (ie, apps will see
> basically the same
> tsc behaviour as in the native case, to some degree of approximation).
>
> So, there's case 2: pv usermode. There are four classes of apps worth
> considering here:
>
> 1. Old apps which make unwarranted assumptions about the
> behavour of
> the tsc. They assume they're basically running on some
> equivalent
> of a P54, and so will get junk on any modernish system with SMP
> and/or power management. If people are still using
> such apps, it
> probably means their performance isn't critically
> dependent on the
> tsc.
> 2. More sophisticated apps which know the tsc has some limitations
> and try to mitigate them by filtering discontinuities, using
> rdtscp, etc. They're best-effort, but they inherently
> lack enough
> information to do a complete job (they have to guess at where
> power transitions occured, etc).
> 3. New apps which know about modern processor capabilities, and
> attempt to rely on constant_tsc forgoing all the best-effort
> filtering, etc
> 4. Apps which use gettimeofday() and/or clock_gettime()
> for all time
> measurement. They're guaranteed to get consistent time results,
> perhaps at the cost of a syscall. On systems which support it,
> they'll get vsyscall implementations which avoid the
> syscall while
> still using the best-possible clocksource. Even if they don't a
> syscall will outperform an emulated rdtsc.
>
> Class 1 apps are just broken. We can try to emulate a UP, no-PM
> processor for them, and that's probably best done in an HVM domain.
> There's no need to go to extraordinary efforts for them because the
> native hardware certainly won't.
>
> Class 2 apps will work as well as ever in a Xen PV domain as-is. If
> they use rdtscp then they will be able to correlate the tsc to the
> underlying pcpu and manage consistency that way. If they pin
> threads to
> VCPUs, then they may also requre VCPUs to be pinned to PCPUs. But
> there's no need to make deep changes to Xen's tsc handling to
> accommodate them.
>
> Class 3 apps will get a bit of a rude surprise in a PV Xen
> domain. But
> they're also new enough to use another mechanism to get time. They're
> new enough to "know" that gettimeofday can be very efficient,
> and should
> not be going down the rathole of using rdtsc directly. And unless
> they're going to be restricted to a very narrow class of machines (for
> example, not my relatively new Core2 laptop which stops the "constant"
> tsc in deep sleep modes), they need to fall back to being a
> class 2 or 4
> app anyway.
>
> Class 4 apps are not well-served under Xen. I think the vsyscall
> mechanism will be disabled and they'll always end up doing a real
> syscall. However, I think it would be relatively easy to add a new
> vgettimeofday implementation which directly uses the pvclock mechanism
> from usermode (the same code would work equally well for Xen
> and KVM).
> There's no need to add a new usermode ABI to get quick, high-quality
> time in usermode. Performance-wise it would be more or less
> indistinguishable from using a raw rdtsc, but it has the benefit of
> getting full cooperation from the kernel and Xen, and can take into
> account all tsc variations (if any).
>
>
> So if you want to address these problems, it seems to me
> you'll get most
> bang for the buck by fixing (v)gettimeofday to use pvclock, and
> convincing app writers to trust in gettimeofday.
>
> J
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|