[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support




On 12/29/2015 05:37 PM, Joao Martins wrote:
> 
> 
> On 12/29/2015 02:58 PM, Andrew Cooper wrote:
>> On 28/12/2015 16:59, Joao Martins wrote:
>>> Hey!
>>>
>>> I've been working on pvclock vdso support on Linux guests, and came
>>> across Xen lacking support for PVCLOCK_TSC_STABLE_BIT flag which is
>>> required for vdso as of the latest pvclock ABI shared with KVM.
>>
>> , and originally borrowed from Xen.
>>
>> Please be aware that all of this was originally the Xen ABI (c/s 
>> 1271b793, 2005) and was added to Linux (c/s 7af192c, 2008) for join use 
>> with KVM.  In particular, Linux c/s 424c32f1a (which introduces 'flags') 
>> and every subsequent change in pvclock-abi.h violates the comment at the 
>> top, reminding people that the hypervisors must be kept in sync.
>>
> OK, I indeed was aware that this came originally from Xen. Should perhaps
> include a comment similar to what you state above?
> 
>> By the looks of things, the structures are still compatible, and having 
>> the two in sync is in everyones best interest. The first steps here need 
>> to be Linux upstreaming their local modifications, and further efforts 
>> made to ensuring that ABI changes don't go unnoticed as far as Xen is 
>> concerned (entry in the maintainers file with xen-devel listed?)
> Good point. Fortunately the changed part was the padding region (which was
> always zeroed out on update_vcpu_system_time) so compatibility was kept.
> 
>>
>>> In addition, I've found some problems which aren't necessarily visible
>>> to kernel as the pvclock algorithm in linux keeps the highest pvti
>>> time read among all cpus. But as is, a process using vdso gettimeofday
>>> observes a significant amount of warps (i.e. time going backwards) and
>>> it could be due to 1) time calibration skew in xen rendezvous
>>> algorithm, 2) xen clocksource not in sync with TSC and 3) in
>>> situations when guests unaware of VCPUS moving to a different PCPU.
>>> The warps are seen more frequently on PV guests (potentially because
>>> vcpu time infos are only updated when guest is in kernel mode, and
>>> perhaps lack of tsc_offset?), and in rare ocasions on HVM guests. And
>>> it is worth noting that with guests VCPUs pinned, only PV guests see
>>> these warps. But on HVM guests specifically: such warps only occur
>>> when one of guest VCPUs is pinned to CPU0.
>>
>> These are all concerning findings (especially the pinning on cpu0). 
>> Which version of Xen have you been developing on?
>>
> When I started working on this it was based on xen 4.5 (linux <= 4.4), but
> lately (~ 1 month) I have been working on xen-unstable.
> 
>> Haozhong Zhang (CC'd) found and fixed several timing related bugs as 
>> part of his VMX TSC Scaling support series (Message root at 
>> <1449435529-12989-1-git-send-email-haozhong.zhang@xxxxxxxxx>). I would 
>> be surprised if your identified bugs and his identified bugs didn't at 
>> least partially overlap.  (Note that most of the series has yet to be 
>> applied).
> I am afraid this might be a different bug: I have been basing my patches on 
> top
> of his work and also testing with a non-native vtsc_khz. (as part of the 
> testing
> requested from Boris). But I think these sort of issues might not affect in
> kernel since pvclock in linux takes care of ensuring monotonicity even when
> pvti's aren't guaranteed to be.
> 
>>
>> As to the specific options you identify, the time calibration rendezvous 
>> is (or should be) a nop on modern boxes with constant/invarient/etc 
>> TSCs.  I have a low priority TODO item to investigating conditionally 
>> disabling it, as it is a scalability concern on larger boxes.  Option 2 
>> seems more likely.
> Hm, from what I understood it's not a nop: it's just that there is a different
> rendezvous function depending on whether TSC is safe or not. And I believe 
> there
> to be where it lies the problem. The tsc_timestamp written in the pvti that is
> used to calculate the delta on the guest takes into account the time that the
> slave CPU rendezvous with master, thus could mark the TSC at a later instant
> depending on the CPU. And this would lead to a situation such as: CPU 3 
> writing
> an earlier tsc than say CPU 2 (thus leading to a smaller delta between both as
> seen by the guest, even though TSC is guaranteed to be monotonic) and guest
> would see a warp when doing a read on CPU 3 first right before CPU 2 and see
> time going backwards. Patch 5 also extends on that and I followed a similar
> approach to KVM.
> 
> But you suggestion of basically being a nop could makes things simpler (and
> perhaps cleaner?), and eliminates potential scalability issues when host has a
> large number of CPUs: on a system with a safe TSC every EPOCH it would read
> platform time and "request" the vCPU to update the pvti next time it's
> schedule() or continue_running() is called? This way we could remove the CPUS
> spinning for master to write stime and tsc_timestamp.
> 
>>
>> As for option 3, on a modern system it shouldn't make any difference.  
>> On an older system, it can presumably only be fixed by a guest 
>> performing its rdtsc between the two version checks, to ensure that it 
>> sees a consistent timestamp and scale, along with the hypervisor bumping 
>> version on deschedule, and against on schedule. However, this might be 
>> contrary to proposed plan to have one global pv wallclock.
>>
> Indeed, and vdso gettimeofday/clock_gettime already do that i.e. versions 
> checks
> around rdtsc/pvti reads.


>>>
>>> PVCLOCK_TSC_STABLE_BIT is the flag telling the guest that the
>>> vcpu_time_info (pvti) are monotonic as seen by any CPU,
>>> and supporting it could help fixing the issue mentioned above.
>>
>> Surely fixing some of these bugs are prerequisites to supporting 
>> TSC_STABLE_BIT ?  Either way, we should see about doing both.
>>
> I think so too.
> 
>>>   This
>>> series aims to propose a solution to that and it's divided as
>>> following:
>>>
>>>     * Patch 1: Adds the missing flag field to vcpu_time_info.
>>>     * Patch 2: Adds a new clocksource based on TSC
>>>     * Patch 3, 4: Cleanups for patch 5
>>>     * Patch 5: Implements the PVCLOCK_TSC_STABLE_BIT.
>>>     * Patch 6: Same as 5 before but for other platform timers
>>>
>>> I have some doubts on the correctness of Patch 6 but was the only
>>> solution I found for supporting PVCLOCK_TSC_STABLE_BIT when using
>>> other clocksources (i.e. != tsc). The test was running time-warp-test,
>>> that constantly calls clock_gettime/gettimeofday on every CPU. It
>>> measures a delta with the previous returned value and mark a warp if
>>> it's negative. I measured it during periods of 1h and 6h and check how
>>> many warps and their values (alongside the amount of time skew).
>>> Measurements are included in individual patches.
>>>
>>> Note that most of the testing has been done with Linux 4.4 in which
>>> these warps/skew could be easily observed as vdso would use each vCPU
>>> pvti. Though Linux 4.5 will change this behaviour and use only the
>>> vCPU0 pvti though still requiring PVCLOCK_TSC_STABLE_BIT usage.
>>>
>>> I've been using it for a while in the past weeks with no issues so far
>>> though still requires testing on bad TSC machines. But I would like to
>>> get folks comments/suggestions if this is the correct approach in
>>> implementing this flag.
>>
>> If you have some test instructions, I can probably find some bad TSC 
>> machines amongst the selection of older hardware in the XenServer test pool.
> Thanks! I have a bad TSC machine for testing already, and these tests would be
> to make sure there aren't regressions, since with a bad TSC
> PVCLOCK_TSC_STABLE_BIT wouldn't be set. Most of my tests were on a Broadwell 
> CPU
> single socket.
Ping: Or would you prefer resubmitting with the comments so far? I just didn't
do so because I didn't get comments on the main parts of this series (Patch
2,5,6). Thanks, and sorry for the noise!

Joao

>>
>> ~Andrew
>>

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.