Re: [Xen-devel] [PATCH] per-cpu timer changes

On Tue, May 24, 2005 at 02:20:36AM +0100, Ian Pratt wrote:
> 
> Don,
> 
> This is looking good. To help other people review the patch, it might be
> a good idea to post some of the design discussion we had off list as I
> think the approach will be new to most people. (Perhaps put some of the
> text in a comment in the hypervisor interface).
> 
> As regards the time going backwards messages, if you're seeing small -ve
> deltas, I'm not surprised -- you need to round to some precision as we
> won't be nanosecond accurate. Experience suggests we'll be good for a
> few 10's of ns with any kind of decent crystal. We could round to e.g.
> 512ns or 1024ns to make sure.
> 
> Best,
> Ian
> 

I am including the email that we exchanged off-list.  I started to edit
it, but decided that something I thought wasn't important, others would
find vital, so I include all the email.

The time going backwards was only occasionally, and it was a BIG jump
backwards.  I tracked it down yesterday to a problem with doing 32-bit
arithmetic in Linux on the tsc values.  For some reason, every 5-20
minutes xen seems to pause for about 5 seconds.  This causes the tsc to
wrap if only 32-bits are used, and the 'time went backwards' message is
printed.  I changed to use 64-bit tsc deltas and have been running since
yesterday afternoon without any 'time went backwards' messages.  I want
to do some more cleanup (remove my debugging code) and will post all my
changes to the list this afternoon.


----- Forwarded by Don Fry/Beaverton/IBM on 05/26/2005 09:29 AM -----

Bruce Jones/Beaverton/IBM wrote on 04/21/2005 09:07:26 AM:

> John, can you provide some additional technical guidance here?
>
> Ian, Keir: John is the implementor of our Linux changes for Summit
> and understands these issues better than anyone.
>
> I've added Don to the cc: list but he's on vacation this week and
> not reading email.
>
>  -- brucej
>
> Ian Pratt <Ian.Pratt@xxxxxxxxxxxx> wrote on 04/20/2005 05:42:47 PM:
>
> > > "Ian Pratt" <m+Ian.Pratt@xxxxxxxxxxxx> wrote on 04/20/2005 04:47:44 PM:
> > > > Please could Don write a paragraph explaining why cyclone timer support
> > > > is useful. Do summit systems have different frequency CPUs in the same
> > > > box?
> > > Bruce writes:
> > > I can write that paragraph myself.  IBM's high end xSeries systems are
> > > NUMA systems, each node is a separate machine with it's own front side
> > > bus, I/O buses, etc...  The chipset provides a cache-coherent interconnect
> > > to allow them to be cabled together into one big system.
> >
> > OK, so even the FSB clocks come from different crystals.
>
> Yes, and the hardware intentionally skews their frequencies, for reasons
> only the chipset designers understand. :)
>
> > > We had a boatload of problems with Linux when we first shipped it, with
> > > time moving around forward and backward for applications.  The processors
> > > in the various nodes  run at different frequencies and the on-processor
> > > timers do not run in sync.  We needed to modify Linux to use a system-wide
> > > timer.  Our chipset (code-named Cyclone) provides one, for newer systems
> > > Intel has defined the HPET that we can use.  We need to make similar
> > > changes to Xen.
> >
> > This needs some agreement on the design.
> >
> > My gut feeling is that it should still be possible for guests to use
> > the TSC to calculate the time offset relative to the published
> > Xen system time record (which is updated every couple of
> > seconds). The TSC calibration should be good enough to mean that
> > the relative drift over the period between records is tiny (and
> > errors can't accumulate beyond the period).
>
> My gut feeling is that your gut feeling is wrong.  We can't ever
> use the TSC on these systems - even a tiny amount of relative drift
> causes problems.
>
> But I'm no expert.  John, this is your cue.  Please join in.
>
> > The 'TSC when time record created' and 'TSC frequency' will have
> > to be per VCPU and updated to reflect the real CPU that the VCPU
> > is running on.
>
> As long as these are virtual and not read using the readTSC instruction,
> we may be OK.
>
> >
> > Ian
> >
> >
> >

----- Forwarded by Don Fry/Beaverton/IBM on 05/26/2005 09:29 AM -----

"Ian Pratt" <m+Ian.Pratt@xxxxxxxxxxxx> wrote on 04/21/2005 09:24:54 AM: 

> > Yes, and the hardware intentionally skews their frequencies,
> > for reasons only the chipset designers understand. :)
>
> It's to be sneaky as regards FCC EMC emissions regulations.
>
> Some systems even modulate the PCI bus frequency.
>
> > > My gut feeling is that it should still be possible for
> > guests to use
> > > the TSC to calculate the time offset relative to the published Xen
> > > system time record (which is updated every couple of
> > seconds). The TSC
> > > calibration should be good enough to mean that the relative
> > drift over
> > > the period between records is tiny (and errors can't
> > accumulate beyond
> > > the period).
> >
> > My gut feeling is that your gut feeling is wrong.  We can't
> > ever use the TSC on these systems - even a tiny amount of
> > relative drift causes problems.
>
> It depends on the crystal stability, the accuracy with which the
> calibration is done, and the frequency of publishing new absoloute time
> records.
>
> The latter can be made quite frequent if need be.
>
> I'd much prefer avoiding having to expose linux to the HPET/cyclone by
> hiding it Xen, and having the guest use TSC extrapolation from the the
> time record published by Xen.
> We'd just need to update the current interface to have per-CPU records
> (and TSC frequency calibration).
>
> > But I'm no expert.  John, this is your cue.  Please join in.
> >
> > > The 'TSC when time record created' and 'TSC frequency' will
> > have to be
> > > per VCPU and updated to reflect the real CPU that the VCPU
> > is running
> > > on.
> >
> > As long as these are virtual and not read using the readTSC
> > instruction, we may be OK.
>
> Using readTSC should be fine, since we're only using it to extrapolate
> from the last Xen supplied time record, and we've calibrated the
> frequency of the particular CPU we're running on. We only have to worry
> about rapid clock drift due to sudden temperature changes etc, but even
> then we can just update the calibration frequency periodically. Using
> this approach we get to keep gettimeofday very fast, and avoid
> complicating the hypervisor API -- it's exactly what we need for
> migrating a domain between physical servers with different frequency
> CPUs.
>
> Ian
>

----- Forwarded by Don Fry/Beaverton/IBM on 05/26/2005 09:29 AM -----

"Ian Pratt" <m+Ian.Pratt@xxxxxxxxxxxx> wrote on 04/21/2005 01:12:51 PM: 

> > First, forgive my lack of knowledge about Xen. Since I don't
> > know the details of what you're proposing, let me make a
> > straw-man and let you correct my assumptions.
> >
> > Lets say you're proposing that time be calculated with the
> > following formula:
> >
> > timefoday = xen_time_base +  rdtsc() - xen_last_tsc[CPUNUM]
> >
> > Given a guest domain with two cpus, the issue is managing
> > xen_last_tsc[] and xen_time_base. For the equation to work,
> > xen_last_tsc[0] must hold the TSC value from CPU0 at exactly
> > the time stored in xen_time_base. Additionally the same is
> > true with xen_las_tsc[1].
>
> I'm proposing:
>
> timeofday  = round_to_precision( last_xen_time_base[VCPU] +
>              ( rdtsc() - last_xen_tsc[VCPU] ) * xen_tsc_calibrate[VCPU]
> )
>
> We update last_xen_time_base and last_xen_tsc on each CPU every second
> or so.
> xen_tsc_calibrate is calculated for each CPU at start of day. For
> completeness, we could recalculate the calibration every 30s or so to 
> cope with crystal temperature drift if we wanted ultimate precision.
>
> > The difficult question is how do you ensure that the two
> > values in xen_last_tsc[] are linked with the time in
> > xen_time_base? This requires reading the TSC on two cpus at
> > the exact same time. Additionally, this sync point must
> > happen frequently enough so that the continuing drift between
> > cpus isn't noticed.
>
> Nope, we would set the time_base on each CPU independently, but relative
> to the same timer.
> This could be the cyclone, HPET, or even the PIT if its possible to read
> the same PIT from any node (though I'm guessing you probably have a PIT
> per node and can't read the remote one).
>
> > Then you'll have to weigh that solution against just using an
> > alternate global timesource like HPET/Cyclone.
>
> I'd prefer to avoid this as it would mean that there'd be a different 
> hypervisor API for guests on cylcone/hpet systems vs. normal synchronous
> CPU systems.
> Using the TSC will probably give a lower cost gettimeofday, we can also
> trap it and emulate if we want to lie to guests about the progress of
> time.
>
> Best,
> Ian
>
>
>
>
>
>

----- Forwarded by Don Fry/Beaverton/IBM on 05/26/2005 09:29 AM -----

John Stultz/Beaverton/IBM wrote on 04/21/2005 01:49:54 PM:

> I'm just resending this with proper addresses as something got futzed up
in the CC list on that last mail.
>
> "Ian Pratt" <m+Ian.Pratt@xxxxxxxxxxxx> wrote on 04/21/2005 01:12:51 PM:
>
> > > First, forgive my lack of knowledge about Xen. Since I don't
> > > know the details of what you're proposing, let me make a
> > > straw-man and let you correct my assumptions.
> > >
> > > Lets say you're proposing that time be calculated with the
> > > following formula:
> > >
> > > timefoday = xen_time_base +  rdtsc() - xen_last_tsc[CPUNUM]
> > >
> > > Given a guest domain with two cpus, the issue is managing
> > > xen_last_tsc[] and xen_time_base. For the equation to work,
> > > xen_last_tsc[0] must hold the TSC value from CPU0 at exactly
> > > the time stored in xen_time_base. Additionally the same is
> > > true with xen_las_tsc[1].
> >
> > I'm proposing:
> >
> > timeofday  = round_to_precision( last_xen_time_base[VCPU] +
> >              ( rdtsc() - last_xen_tsc[VCPU] ) * xen_tsc_calibrate[VCPU]
> > )
> >
> > We update last_xen_time_base and last_xen_tsc on each CPU every second
> > or so.
>
> Or possibly more frequently, as on a 4Ghz cpu the 32bit TSC will wrap 
each second. Alternatively you could use the full 64bits.
>
> > xen_tsc_calibrate is calculated for each CPU at start of day. For
> > completeness, we could recalculate the calibration every 30s or so to
> > cope with crystal temperature drift if we wanted ultimate precision.
> >
> > > The difficult question is how do you ensure that the two
> > > values in xen_last_tsc[] are linked with the time in
> > > xen_time_base? This requires reading the TSC on two cpus at
> > > the exact same time. Additionally, this sync point must
> > > happen frequently enough so that the continuing drift between
> > > cpus isn't noticed.
> >
> > Nope, we would set the time_base on each CPU independently, but
relative
> > to the same timer.

> Hmmm. That sounds like it could work. Just be sure that preempt won't 
bite you in the timeofday calculation. The bit about still using the
cyclone/HPET to sync the different xen_time_base[] values is the real key.
>
> > This could be the cyclone, HPET, or even the PIT if its possible to read
> > the same PIT from any node (though I'm guessing you probably have a PIT
> > per node and can't read the remote one).

> The ioport space is unified by the BIOS so there is one global PIT shared
by all nodes. Although as you'll need a continuous timesource that doesn't
loop between xen_time_base updates, the PIT would not work.
>
> thanks
> -john
----- Forwarded by Don Fry/Beaverton/IBM on 05/26/2005 09:29 AM -----

"Ian Pratt" <m+Ian.Pratt@xxxxxxxxxxxx> wrote on 04/28/2005 07:08:05 PM:

>
> > First I apologize for not being involved in this email
> > exchange last week.
> > I am also just learning about Xen so my questions may be
> > obvious to others.
> >
> > What is the last_xen_time_base referred to in Ian's email? Is
> > this the stime_irq or wc_sec,wc_usec or something else?
>
> I was referring to the wc_ wall clock and system time values.
> We'll need to make these per VPU, or perhaps slightly more cleanly,
> store an offset in ns for each VCPU.
>
> > When would the last_xen_tsc[VCPU] values be captured by Xen?
> > Currently, the tsc for cpu 0 is obtained during
> > timer_interrupt as full_tsc_irq.
>
> These just need to be captured periodically on each real CPU -- every
> couple of seconds would do it, though more frequently woulnd't hurt.
>
> > When updating the domain's shared_info structure mapping the
> > physical CPU to the domain's view of the CPU would need to be
> > done. For example if domain2 was running on CPU3 and CPU2 and
> > the domain's view was cpu0 and cpu1, the saved tsc value for
> > CPU3 would be copied to last_xen_tsc[0] and CPU2 to
> > last_xen_tsc[1] before sending the interrupt to the domain.
>
> Yep, this shouldn't be hard -- there's already some code to spot when
> they need to be updated.
>
> > From the last algorithm from Ian, I don't see anything that
> > refers to the Cyclone/HPET value directly. Is that because
> > Xen is the only thing that reads the Cyclone/HPET counter and
> > the domain just uses the TSC?
>
> Yep, we don't want to expose the cyclone/hpet to guests. There's no
> need, and it would have implications for migrating VMs between different
> systems.
>
> Strictly speaking, Xen wouldn't even need support for the hpet/cyclone
> as it could just use the shared PIT, though I have no objection to
> adding such support.
>
> Are you happy with this design? It's a little more work, but I believe
> better in the long run. We need to get the hypervisor interface change
> incorporated ASAP.
>
> Cheers,
> Ian

----- Forwarded by Don Fry/Beaverton/IBM on 05/26/2005 09:29 AM -----

"Ian Pratt" <m+Ian.Pratt@xxxxxxxxxxxx> wrote on 04/30/2005 12:04:57 AM:

> > It sounds like the per-cpu changes should be sufficient.
> >
> > Having a time base and ns deltas for each CPU sounds
> > interesting, but wouldn't you have do a subtraction to
> > generate the delta in Xen, and then add it back in, in the
> > domain? Just saving the per-cpu value would save the extra
> > add and subtract.
>
> Sure, but the add/subtract won't cost much, and it saves some space in
> the shared info page, which might be an issue if we have a lot of VCPUs.
>
> Not a big deal either way.
>
> > The bottom line is that it can all be done with the TSC,
> > without needing to use the Cyclone or HPET hardware, which
> > isn't available on all systems like the TSC.
>
> Great, we're in agreement. I think the first stage is just to do the per
> [V]CPU calibration and time vals. Could you work something up?
>
> Thanks,
> Ian


-- 
Don Fry
brazilnut@xxxxxxxxxx

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] [PATCH] per-cpu timer changes