WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I've been l

To: Jeremy Fitzhardinge <jeremy@xxxxxxxx>
Subject: RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I've been looking for)
From: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>
Date: Mon, 21 Sep 2009 15:20:25 -0700 (PDT)
Cc: kurt.hackel@xxxxxxxxxx, "Xen-Devel \(E-mail\)" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxxxx>
Delivery-date: Mon, 21 Sep 2009 15:21:00 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <4AB7C79B.50709@xxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
> > However, I do need one special case to indicate
> > emulation vs non-emulation, so wraparound is
> > still a problem.
> 
> I was assuming you'd just repurpose the existing version number scheme
> which is always even, and therefore can never equal -1.

That wasn't my plan but if it can be made to work (see
below), it probably saves code in Xen.

> What's the full algorithm for detecting this feature?  Usermode has to
> establish:
> 
>    1. It is running under Xen (or not, if you expect this to be
>       implemented on multiple hypervisors)
>    2. rdtscp is available
>    3. the ABI is actually being implemented, ie:
>          1. the tsc_aux value actually has the correct meaning
>          2. it has a working mechanism for getting the tsc scaling
>             parameters
>          3. (accommodate ways to evolve the ABI in a 
> back-compatible way)
> before it can do anything else.

Yes, that's what I was thinking.  I was planning on prototyping
these checks with "userland-rdmsr" but userland-hypercall or
userland-shared-page could work also.

> If nothing else, its probably worth removing the rdtscp 
> feature from the
> logical guest cpuid, so that nothing else tries to use it for its own
> purposes; in other words, you're exclusively claiming rdtscp for this
> ABI.  Or you could disable this ABI if a guest kernel tries 
> to set TSC_AUX.

I was thinking that setting pvrdtscp=1 would override
any kernel use of rdtscp/TSC_AUX, but disabling the
cpuid has_rdtscp flag and using a different userland
detection mechanism (than checking cpuid for has_rdtscp)
would be a better way to avoid possible conflict.

> > I've restricted the scheme to constant_tsc as I think
> > it breaks down due to nasty races if running on a
> > machine where the pvclock parameters differ across
> > different pcpus.  I think the races can only be
> > avoided if Xen sets the TSC_AUX for all of the
> > pcpus running a pvrdtscp doman while all are idle.
> >
> > Is there a scheme that avoids the races? 
> 
> rdtscp makes it quite easy to avoid races because you get the tsc and
> metadata about the tsc atomically.  You just need to encode 
> enough info
> in the metadata to do the conversion.

Yes but I don't think there is enough bits for encoding
it all (32-bits in TSC_AUX, right?).

> The obvious thing to do is to pack a version number and pcpu 
> number into
> TSC_AUX.  Usermode would maintain an array of pv_clock parameters, one
> for each pcpu.  If the version number matches, then it uses the
> parameters it has; if not it fetches new parameters and repeats the
> rdtscp.  There's no need to worry about either thread or vcpu context
> switches because you get the (tsc,params) tuple atomically, 
> which is the
> tricky bit without rdtscp.
> 
> (The version number would be truncated wrt the normal pvclock version
> number, but it just needs to be large enough to avoid aliasing from
> wrapping; I'm assuming something like 24 bits version and 8 bits cpu
> number.)

I think a race occurs if the vcpu switches pcpu TWICE
from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp
each time on pcpu-A but reads one or more pvclock parameters
(that are too big to be encoded in TSC_AUX) on pcpu-B.
If Xen can atomically bump/change
TSC_AUX on *all* pcpus runniing a guest vcpu, the race
can be avoided.  But I suspect that is too expensive (some
kind of rendezvous required for each bump on any processor).

> > Fortunately, this also has the effect of greatly
> > reducing the version increase frequency.
> 
> I don't think that's going to be a huge issue; fetching time 
> parameters
> with a syscall/hypercall would be on the same order as doing 
> an emulated
> rdtsc, and would only need to happen, say, once per timeslice (100Hz?)
> at the outside.

Even if my assumption of the race (above) is incorrect,
32-bits is not very much time at 100Hz.  But the version
bump needs to occur synchronously with every P/C-state
transition for pvclock to work on non_constant_tsc machines
doesn't it?  How frequent can those transitions occur?
 
> > The rate is synced but the values may not be.  Since
> > software (BIOS or Xen) sets tsc on each processor
> > it is essentially impossible to ensure they are
> > identical.  The rendezvous algorithm should be able
> > to set them so that they are "unobservably" different,
> > but I keep hearing "within 2usec".  (It would be
> > interesting to measure this across a broad set
> > of machines.)  So it's probably prudent to recommend
> > that apps be prepared for the possibility even if
> > it never happens.
> 
> You don't need to guarantee anything stronger than they'd see on bare
> hardware.  You also need to be more precise about exactly what you're
> guaranteeing.
> 
> Are you saying that a single thread will never see regressing tscs? 
> That just requires making sure that Xen gets the tscs synced 
> closer than
> the context switch time of a thread between cpus, which 
> should be possible.
> 
> Or are you making the stronger guarantee that two threads running
> concurrently on different cpus doing rdtsc will see monotonically
> increasing tscs with respect to the ordering of all their operations? 
> That would require arbitrarily close syncing (well, within a 
> the time it
> takes a cacheline to bounce I guess).

I guess this all depends on what Xen is capable of
guaranteeing.  If Xen can provide a "cacheline
bounce guarantee", the app shouldn't have to care.

Linux now seems to provide a cacheline bounce guarantee for
itself, but afaik has no way to communicate that to an app
using raw rdtsc{,p} and all the relevant syscalls have a
monotonicity option and/or have insufficient resolution
to matter.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

<Prev in Thread] Current Thread [Next in Thread>