WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I've been l

To: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>
Subject: Re: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I've been looking for)
From: Jeremy Fitzhardinge <jeremy@xxxxxxxx>
Date: Mon, 21 Sep 2009 15:50:29 -0700
Cc: kurt.hackel@xxxxxxxxxx, "Xen-Devel \(E-mail\)" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxxxx>
Delivery-date: Mon, 21 Sep 2009 15:51:02 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <ca1f4760-ac59-4385-8daa-0e1dc2cb3c07@default>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <ca1f4760-ac59-4385-8daa-0e1dc2cb3c07@default>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.1) Gecko/20090814 Fedora/3.0-2.6.b3.fc11 Lightning/1.0pre Thunderbird/3.0b3
On 09/21/09 15:20, Dan Magenheimer wrote:
>>> However, I do need one special case to indicate
>>> emulation vs non-emulation, so wraparound is
>>> still a problem.
>>>       
>> I was assuming you'd just repurpose the existing version number scheme
>> which is always even, and therefore can never equal -1.
>>     
> That wasn't my plan but if it can be made to work (see
> below), it probably saves code in Xen.
>
>   
>> What's the full algorithm for detecting this feature?  Usermode has to
>> establish:
>>
>>    1. It is running under Xen (or not, if you expect this to be
>>       implemented on multiple hypervisors)
>>    2. rdtscp is available
>>    3. the ABI is actually being implemented, ie:
>>          1. the tsc_aux value actually has the correct meaning
>>          2. it has a working mechanism for getting the tsc scaling
>>             parameters
>>          3. (accommodate ways to evolve the ABI in a 
>> back-compatible way)
>> before it can do anything else.
>>     
> Yes, that's what I was thinking.  I was planning on prototyping
> these checks with "userland-rdmsr" but userland-hypercall or
> userland-shared-page could work also.
>
>   
>> If nothing else, its probably worth removing the rdtscp 
>> feature from the
>> logical guest cpuid, so that nothing else tries to use it for its own
>> purposes; in other words, you're exclusively claiming rdtscp for this
>> ABI.  Or you could disable this ABI if a guest kernel tries 
>> to set TSC_AUX.
>>     
> I was thinking that setting pvrdtscp=1 would override
> any kernel use of rdtscp/TSC_AUX, but disabling the
> cpuid has_rdtscp flag and using a different userland
> detection mechanism (than checking cpuid for has_rdtscp)
> would be a better way to avoid possible conflict.
>
>   
>>> I've restricted the scheme to constant_tsc as I think
>>> it breaks down due to nasty races if running on a
>>> machine where the pvclock parameters differ across
>>> different pcpus.  I think the races can only be
>>> avoided if Xen sets the TSC_AUX for all of the
>>> pcpus running a pvrdtscp doman while all are idle.
>>>
>>> Is there a scheme that avoids the races? 
>>>       
>> rdtscp makes it quite easy to avoid races because you get the tsc and
>> metadata about the tsc atomically.  You just need to encode 
>> enough info
>> in the metadata to do the conversion.
>>     
> Yes but I don't think there is enough bits for encoding
> it all (32-bits in TSC_AUX, right?).
>
>   
>> The obvious thing to do is to pack a version number and pcpu 
>> number into
>> TSC_AUX.  Usermode would maintain an array of pv_clock parameters, one
>> for each pcpu.  If the version number matches, then it uses the
>> parameters it has; if not it fetches new parameters and repeats the
>> rdtscp.  There's no need to worry about either thread or vcpu context
>> switches because you get the (tsc,params) tuple atomically, 
>> which is the
>> tricky bit without rdtscp.
>>
>> (The version number would be truncated wrt the normal pvclock version
>> number, but it just needs to be large enough to avoid aliasing from
>> wrapping; I'm assuming something like 24 bits version and 8 bits cpu
>> number.)
>>     
> I think a race occurs if the vcpu switches pcpu TWICE
> from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp
> each time on pcpu-A but reads one or more pvclock parameters
> (that are too big to be encoded in TSC_AUX) on pcpu-B.
>   

That shouldn't matter.  Once the process has (tsc,cpu,version) it can
use its own local copy of cpu's pvclock parameters to compute the
tsc->ns conversion.  Once it has that triple, it doesn't matter if it
gets context-switched; the time computation doesn't depend on what CPU
is currently running. 

It only needs to iterate if it gets a version mismatch.  You can
potentially get a livelock if the version is constantly changing between
the rdtscp and the get-pvclock-params, and exacerbated if the process
keeps bouncing between cpus between the two.  But given that the
rdtsc+get-params should take no more than a couple of microseconds, it
seems very unlikely the process is sustaining a megahertz CPU migration
rate.

And even if it fails, the process always has to be prepared to go to
some other time source.

> If Xen can atomically bump/change
> TSC_AUX on *all* pcpus runniing a guest vcpu, the race
> can be avoided.  But I suspect that is too expensive (some
> kind of rendezvous required for each bump on any processor).
>   

Right.  Any synchronized cross-cpu call is going to be very expensive,
and can't be done atomically without some kind of stop-the-world which
is even worse.

> Even if my assumption of the race (above) is incorrect,
> 32-bits is not very much time at 100Hz.  But the version
> bump needs to occur synchronously with every P/C-state
> transition for pvclock to work on non_constant_tsc machines
> doesn't it?  How frequent can those transitions occur?
>   

24 bits at 100Hz is 46ish hours.  So there's a potential alias problem
if the program reads the tsc at precisely 46.603 (ish) hours after its
previous read.  One workaround would be to force a re-read of the timing
parameters every X secs/mins/hours to guarantee that there's no wrap for
some expected rate of param updates.

That said, the standard pvclock algorithm is only 128 times better than
that, and I don't think it has ever considered to be a problem.  I've
never seen an update rate of more than once every few seconds.

Also Xen need only update the version number if something has actually
read that version; if nobody had read the current parameters, there's no
need to update the version when updating them to a new value.  That
would help mitigate the case of rapid param updates and a low rate of
reading.

> I guess this all depends on what Xen is capable of
> guaranteeing.  If Xen can provide a "cacheline
> bounce guarantee", the app shouldn't have to care.
>   

It can't, in princple, sync the tscs at a finer grain than the app can
measure.  It only has the same resources to play with, and there'll
always be some error margin.

> Linux now seems to provide a cacheline bounce guarantee for
> itself, but afaik has no way to communicate that to an app
> using raw rdtsc{,p} and all the relevant syscalls have a
> monotonicity option and/or have insufficient resolution
> to matter.
>   

It's a detail that a usermode app can't rely on anyway.

    J


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

<Prev in Thread] Current Thread [Next in Thread>