WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] write_tsc in a PV domain?

To: dan.magenheimer@xxxxxxxxxx, Jeremy Fitzhardinge <jeremy@xxxxxxxx>
Subject: RE: [Xen-devel] write_tsc in a PV domain?
From: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>
Date: Mon, 31 Aug 2009 11:11:50 -0700 (PDT)
Cc: "Xen-Devel \(E-mail\)" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>, Alan Cox <alan@xxxxxxxxxxxxxxxxxxx>
Delivery-date: Mon, 31 Aug 2009 11:12:42 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <7ea61c55-13a9-4ee5-ac38-ac88e01186aa@default>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
I'm experimenting with clock_gettime(), gettimeofday(),
and rdtsc with a 2.6.30 64-bit pvguest.  I have tried both
with kernel.vsyscall64 equal to 0 and 1 (but haven't seen
any significant difference between the two).  I have
confirmed from sysfs that clocksource=xen

I have yet to get a measurement of either syscall that
is better than 2.5x WORSE than emulating rdtsc. On
my dual-core Conroe (Intel E6850) with 64-bit Xen and
32-bit dom0, I get approximately:

rdtsc native: 22ns
softtsc (rdtsc emulated): 360ns
gettime syscall w/softtsc: 1400ns
gettime syscall native tsc: 980ns
gettimeofday w/softtsc: 1750ns
gettimeofday native tsc: 900ns

I'm hoping this is either a bug in the 2.6.30 xen
pvclock implementation or in my measurement methodology,
so would welcome others measuring this.

A couple other minor observations:
1) The syscalls seem to be somewhat slower when usermode
   rdtscs are being emulated, by approximately the cost
   of emulating an rdtsc.  I suppose this makes
   sense since vsyscalls are executed in userland
   and since vgettimeofday does a rdtsc.  However it
   complicates strategy if emulating rdtsc is the default.
2) The syscall clock_getres() does not seem to reflect
   the fact that 

> -----Original Message-----
> From: Dan Magenheimer 
> Sent: Saturday, August 29, 2009 11:52 AM
> To: Jeremy Fitzhardinge
> Cc: Alan Cox; Xen-Devel (E-mail); Keir Fraser
> Subject: RE: [Xen-devel] write_tsc in a PV domain?
> 
> 
> (Reordered with most important points first...)
> 
> > You are talking about three different cases:
> 
> I agree with your analysis for case 1 and case 3.
> 
> > So, there's case 2: pv usermode.  There are four
> > classes of apps worth considering here:
> 
> I agree with your classification.  But a key point
> is that VMware provides correctness for all
> of these classes.  AND provides it at much better
> performance than trap-and-emulate.  AND provides
> correctness+performance regardless of the underlying
> OS (e.g. even "old" OS's such as RHEL4 and RHEL5).
> AND provides it regardless whether the guest OS is
> 32-bit or 64-bit.  AND, by the way, provides it for
> your case 1 (PV OS) and case 3 (HVM) as well.
> 
> > So if you want to address these problems, it seems to me 
> > you'll get most
> > bang for the buck by fixing (v)gettimeofday to use pvclock, and
> > convincing app writers to trust in gettimeofday.
> 
> (Partially irrelevant point, but gettimeofday returns
> microseconds which is not enough resolution for many
> cases where rdtsc has been used in apps.  Clock_gettime
> is the relevant API I think.)
> 
> If we can come up with a way for a kernel-loadable module
> to handle some equivalent of clock_gettime so that
> the most widely used shipping PV OS's can provide a
> pvclock interface to apps, this might be workable.
> If we tell app providers and customers: "You
> can choose either performance OR correctness but
> not both, unless you upgrade to a new OS (that is
> not even available yet)", I don't think that will
> be acceptable.
> 
> Any ideas on how pvclock might be provided through
> a module that could be added to, eg. RHEL4 or RHEL5?
> 
> > > There ARE guaranteed properties specified by
> > > the Intel SDM for any _single_ processor...
> > 
> > Yes, but those are fairly weak guarantees.  It does not 
> guarantee that
> > the tsc won't change rate arbitrarily, or stop outright 
> between reads.
> 
> They are weak guarantees only if one uses rdtsc
> to accurately track wallclock time.  They are
> perfectly useful guarantees if one simply wants to
> either timestamp data to record ordering (e.g.
> for journaling or transaction replay), or
> approximate the passing of time to provide
> approximate execution metrics (e.g. for
> performance tools).
> 
> > > What is NOT guaranteed, but is widely and
> > > incorrectly assumed to be implied and has
> > > gotten us into this mess, is that
> > > the same properties applies across multiple
> > > processors.
> > 
> > Yes, Linux offers even weaker guarantees than Intel.  Aside from the
> > processor migration issue, the tsc can jump arbitrarily as 
> a result of
> > suspend/resume (ie, it can be non-monotonic).
> 
> Please explain.  Suspend/resume is an S state isn't
> it?  Is it possible to suspend/resume one processor
> in an SMP system and not another processor?  I think
> not.  Your point is valid for C-states and P-states
> but those are what Intel/AMD has fixed in the most
> recent families of multi-core processors.
> 
> So I don't see how (in the most recent familes of
> processors) tsc can be non-monotonic.
> 
> > Even very recent processors with "constant" tscs (ie, they 
> > don't change
> > rate with the core frequency) stop in certain power states.
> 
> For the most recent families of processors, the TSC
> continues to run at a fixed rate even for all the
> P-states and C-states.  We should confirm this with
> Intel and AMD.
> 
> > Any motherboard design which runs packages in different
> > clock-domains will lose tsc-sync between those packages,
> > regardless of what's in the packages.
> 
> I'm told this is not true for recent multi-socket systems
> where the sockets are on the same motherboard.  And at
> least one large vendor that ships a new one-socket-per-
> motherboard NUMA-ish system claims that it is not even
> true when the sockets are on different motherboards.
> 
> Dan
> 
> (no further replies below, remaining original text retained
> for context)
> 
> > You are talking about three different cases:
> > 
> >    1. the reliability of the tsc in a PV guest in kernel mode
> >    2. the reliability of the tsc in a PV guest in user mode
> >    3. the reliability of the tsc in an HVM guest
> > 
> > I don't think 1. needs any attention.  The current scheme 
> works fine.
> > 
> > The only option for 3 is to try make a best-effort of tsc 
> > quality, which
> > ranges from trapping every rdtsc to make them all give globally
> > monotonic results, or use the other VT/SVM features to 
> apply an offset
> > from the raw tsc to a guest tsc, etc.  Either way the 
> situation isn't
> > much different from running native (ie, apps will see 
> > basically the same
> > tsc behaviour as in the native case, to some degree of 
> approximation).
> > 
> > So, there's case 2: pv usermode.  There are four classes of 
> apps worth
> > considering here:
> > 
> >    1. Old apps which make unwarranted assumptions about the 
> > behavour of
> >       the tsc.  They assume they're basically running on some 
> > equivalent
> >       of a P54, and so will get junk on any modernish 
> system with SMP
> >       and/or power management.  If people are still using 
> > such apps, it
> >       probably means their performance isn't critically 
> > dependent on the
> >       tsc.
> >    2. More sophisticated apps which know the tsc has some 
> limitations
> >       and try to mitigate them by filtering discontinuities, using
> >       rdtscp, etc.  They're best-effort, but they inherently 
> > lack enough
> >       information to do a complete job (they have to guess at where
> >       power transitions occured, etc).
> >    3. New apps which know about modern processor capabilities, and
> >       attempt to rely on constant_tsc forgoing all the best-effort
> >       filtering, etc
> >    4. Apps which use gettimeofday() and/or clock_gettime() 
> > for all time
> >       measurement.  They're guaranteed to get consistent 
> time results,
> >       perhaps at the cost of a syscall.  On systems which 
> support it,
> >       they'll get vsyscall implementations which avoid the 
> > syscall while
> >       still using the best-possible clocksource.  Even if 
> they don't a
> >       syscall will outperform an emulated rdtsc.
> > 
> > Class 1 apps are just broken.  We can try to emulate a UP, no-PM
> > processor for them, and that's probably best done in an HVM domain. 
> > There's no need to go to extraordinary efforts for them because the
> > native hardware certainly won't.
> > 
> > Class 2 apps will work as well as ever in a Xen PV domain as-is.  If
> > they use rdtscp then they will be able to correlate the tsc to the
> > underlying pcpu and manage consistency that way.  If they pin 
> > threads to
> > VCPUs, then they may also requre VCPUs to be pinned to PCPUs.  But
> > there's no need to make deep changes to Xen's tsc handling to
> > accommodate them.
> > 
> > Class 3 apps will get a bit of a rude surprise in a PV Xen 
> > domain.  But
> > they're also new enough to use another mechanism to get 
> time.  They're
> > new enough to "know" that gettimeofday can be very efficient, 
> > and should
> > not be going down the rathole of using rdtsc directly.  And unless
> > they're going to be restricted to a very narrow class of 
> machines (for
> > example, not my relatively new Core2 laptop which stops the 
> "constant"
> > tsc in deep sleep modes), they need to fall back to being a 
> > class 2 or 4
> > app anyway.
> > 
> > Class 4 apps are not well-served under Xen.  I think the vsyscall
> > mechanism will be disabled and they'll always end up doing a real
> > syscall.  However, I think it would be relatively easy to add a new
> > vgettimeofday implementation which directly uses the 
> pvclock mechanism
> > from usermode (the same code would work equally well for Xen 
> > and KVM). 
> > There's no need to add a new usermode ABI to get quick, high-quality
> > time in usermode.  Performance-wise it would be more or less
> > indistinguishable from using a raw rdtsc, but it has the benefit of
> > getting full cooperation from the kernel and Xen, and can take into
> > account all tsc variations (if any).
> > 
> > 
> > So if you want to address these problems, it seems to me 
> > you'll get most
> > bang for the buck by fixing (v)gettimeofday to use pvclock, and
> > convincing app writers to trust in gettimeofday.
> > 
> >     J
> >

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel