[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Hi Dan,

I am running with hpet=1 and timer_mode=2. I don't see where timer_mode is checked for
hpet timekeeping but I set it nevertheless.


Dan Magenheimer wrote:

Hi Dave and Ben --
When running tests on xen-unstable (without your patch), please ensure that hpet=1 is set in the hvm config and also I think that when hpet is the clocksource on RHEL4-32, the clock IS resilient to missed ticks so timer_mode should be 2 (vs when pit is the clocksource on RHEL4-32, all clock ticks must be delivered and so timer_mode should be 0). Per http://lists.xensource.com/archives/html/xen-devel/2008-06/msg00098.html it's my intent to clean this up, but I won't get to it until next week. Thanks,

    -----Original Message-----
    *From:* xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
    [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx]*On Behalf Of *Dave
    *Sent:* Friday, June 06, 2008 4:46 AM
    *To:* Keir Fraser; Ben Guthro; xen-devel
    *Cc:* dan.magenheimer@xxxxxxxxxx; Dave Winchell
    *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


    I think the changes are required. We'll run some tests today today so
    that we have some data to talk about.


    -----Original Message-----
    From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx on behalf of Keir Fraser
    Sent: Fri 6/6/2008 4:58 AM
    To: Ben Guthro; xen-devel
    Cc: dan.magenheimer@xxxxxxxxxx
    Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

    Are these patches needed now the timers are built on Xen system
    time rather
    than host TSC? Dan has reported much better time-keeping with his
    checked in, and it¹s for sure a lot less invasive than this patchset.

     -- Keir

    On 5/6/08 15:59, "Ben Guthro" <bguthro@xxxxxxxxxxxxxxx> wrote:

    > 1. Introduction
    > This patch improves the hpet based guest clock in terms of drift and
    > monotonicity.
    > Prior to this work the drift with hpet was greater than 2%, far
    above the .05%
    > limit
    > for ntp to synchronize. With this code, the drift ranges from
    .001% to .0033%
    > depending
    > on guest and physical platform.
    > Using hpet allows guest operating systems to provide monotonic
    time to their
    > applications. Time sources other than hpet are not monotonic because
    > of their reliance on tsc, which is not synchronized across physical
    > processors.
    > Windows 2k864 and many Linux guests are supported with two
    policies, one for
    > guests
    > that handle missed clock interrupts and the other for guests
    that require the
    > correct number of interrupts.
    > Guests may use hpet for the timing source even if the physical
    platform has no
    > visible
    > hpet. Migration is supported between physical machines which
    differ in
    > physical
    > hpet visibility.
    > Most of the changes are in hpet.c. Two general facilities are
    added to track
    > interrupt
    > progress. The ideas here and the facilities would be useful in
    vpt.c, for
    > other time
    > sources, though no attempt is made here to improve vpt.c.
    > The following sections discuss hpet dependencies, interrupt
    delivery policies,
    > live migration,
    > test results, and relation to recent work with monotonic time.
    > 2. Virtual Hpet dependencies
    > The virtual hpet depends on the ability to read the physical or
    > (see discussion below) hpet.  For timekeeping, the virtual hpet
    also depends
    > on two new interrupt notification facilities to implement its
    policies for
    > interrupt delivery.
    > 2.1. Two modes of low-level hpet main counter reads.
    > In this implementation, the virtual hpet reads with
    > exported by
    > time.c, either the real physical hpet main counter register
    directly or a
    > "simulated"
    > hpet main counter.
    > The simulated mode uses a monotonic version of get_s_time()
    (NOW()), where the
    > last
    > time value is returned whenever the current time value is less
    than the last
    > time
    > value. In simulated mode, since it is layered on s_time, the
    > hardware
    > can be hpet or some other device. The frequency of the main
    counter in
    > simulated
    > mode is the same as the standard physical hpet frequency,
    allowing live
    > migration
    > between nodes that are configured differently.
    > If the physical platform does not have an hpet device, or if xen
    is configured
    > not
    > to use the device, then the simulated method is used. If there
    is a physical
    > hpet device,
    > and xen has initialized it, then either simulated or physical
    mode can be
    > used.
    > This is governed by a boot time option, hpet-avoid. Setting this
    option to 1
    > gives the
    > simulated mode and 0 the physical mode. The default is physical
    > A disadvantage of the physical mode is that may take longer to
    read the device
    > than in simulated mode. On some platforms the cost is about the
    same (less
    > than 250 nsec) for
    > physical and simulated modes, while on others physical cost is
    much higher
    > than simulated.
    > A disadvantage of the simulated mode is that it can return the
    same value
    > for the counter in consecutive calls.
    > 2.2. Interrupt notification facilities.
    > Two interrupt notification facilities are introduced, one is
    > hvm_isa_irq_assert_cb()
    > and the other hvm_register_intr_en_notif().
    > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
    the vioapic.
    > hvm_isa_irq_assert_cb allows a callback to be passed along to
    > vioapic_deliver()
    > and this callback is called with a mask of the vcpus which will
    get the
    > interrupt. This callback is made before any vcpus receive an
    > Vhpet uses hvm_register_intr_en_notif() to register a handler
    for a particular
    > vector that will be called when that vector is injected in
    > [vmx,svm]_intr_assist()
    > and also when the guest finishes handling the interrupt. Here
    finished is
    > defined
    > as the point when the guest re-enables interrupts or lowers the
    tpr value.
    > EOI is not used as the end of interrupt as this is sometimes
    returned before
    > the interrupt handler has done its work. A flag is passed to the
    > indicating
    > whether this is the injection point (post = 1) or the interrupt
    finished (post
    > = 0) point.
    > The need for the finished point callback is discussed in the
    missed ticks
    > policy section.
    > To prevent a possible early trigger of the finished callback,
    > logic
    > has a two stage arm, the first at injection
    (hvm_intr_en_notif_arm()) and the
    > second when
    > interrupts are seen to be disabled (hvm_intr_en_notif_disarm()).
    Once fully
    > armed, re-enabling
    > interrupts will cause hvm_intr_en_notif_disarm() to make the end
    of interrupt
    > callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm()
    are called by
    > [vmx,svm]_intr_assist().
    > 3. Interrupt delivery policies
    > The existing hpet interrupt delivery is preserved. This includes
    > vcpu round robin delivery used by Linux and broadcast delivery
    used by
    > Windows.
    > There are two policies for interrupt delivery, one for Windows
    2k8-64 and the
    > other
    > for Linux. The Linux policy takes advantage of the (guest) Linux
    missed tick
    > and offset
    > calculations and does not attempt to deliver the right number of
    > The Windows policy delivers the correct number of interrupts,
    even if
    > sometimes much
    > closer to each other than the period. The policies are similar
    to those in
    > vpt.c, though
    > there are some important differences.
    > Policies are selected with an HVMOP_set_param hypercall with index
    > Two new values are added, HVM_HPET_guest_computes_missed_ticks and
    > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that
    two new ones
    > are added is that
    > in some guests (32bit Linux) a no-missed policy is needed for
    clock sources
    > other than hpet
    > and a missed ticks policy for hpet. It was felt that there would
    be less
    > confusion by simply
    > introducing the two hpet policies.
    > 3.1. The missed ticks policy
    > The Linux clock interrupt handler for hpet calculates missed
    ticks and offset
    > using the hpet
    > main counter. The algorithm works well when the time since the
    last interrupt
    > is greater than
    > or equal to a period and poorly otherwise.
    > The missed ticks policy ensures that no two clock interrupts are
    delivered to
    > the guest at
    > a time interval less than a period. A time stamp (hpet main
    counter value) is
    > recorded (by a
    > callback registered with hvm_register_intr_en_notif) when Linux
    > handling the clock
    > interrupt. Then, ensuing interrupts are delivered to the vioapic
    only if the
    > current main
    > counter value is a period greater than when the last interrupt
    was handled.
    > Tests showed a significant improvement in clock drift with end
    of interrupt
    > time stamps
    > versus beginning of interrupt[1]. It is believed that the reason
    for the
    > improvement
    > is that the clock interrupt handler goes for a spinlock and can
    be therefore
    > delayed in its
    > processing. Furthermore, the main counter is read by the guest
    under the lock.
    > The net
    > effect is that if we time stamp injection, we can get the
    difference in time
    > between successive interrupt handler lock acquisitions to be
    less than the
    > period.
    > 3.2. The no-missed ticks policy
    > Windows 2k864 keeps very poor time with the missed ticks policy.
    So the
    > no-missed ticks policy
    > was developed. In the no-missed ticks policy we deliver the
    correct number of
    > interrupts,
    > even if they are spaced less than a period apart (when catching up).
    > Windows 2k864 uses a broadcast mode in the interrupt routing
    such that
    > all vcpus get the clock interrupt. The best Windows drift
    performance was
    > achieved when the
    > policy code ensured that all the previous interrupts (on the
    various vcpus)
    > had been injected
    > before injecting the next interrupt to the vioapic..
    > The policy code works as follows. It uses the
    hvm_isa_irq_assert_cb() to
    > record
    > the vcpus to be interrupted in h->hpet.pending_mask. Then, in
    the callback
    > registered
    > with hvm_register_intr_en_notif() at post=1 time it clears the
    current vcpu in
    > the pending_mask.
    > When the pending_mask is clear it decrements
    hpet.intr_pending_nr and if
    > intr_pending_nr is still
    > non-zero posts another interrupt to the ioapic with
    > Intr_pending_nr is incremented in
    > The missed ticks policy intr_en_notif callback also uses the
    > method. So even though
    > Linux does not broadcast its interrupts, the code could handle
    it if it did.
    > In this case the end of interrupt time stamp is made when the
    pending_mask is
    > clear.
    > 4. Live Migration
    > Live migration with hpet preserves the current offset of the
    guest clock with
    > respect
    > to ntp. This is accomplished by migrating all of the state in
    the h->hpet data
    > structure
    > in the usual way. The hp->mc_offset is recalculated on the
    receiving node so
    > that the
    > guest sees a continuous hpet main counter.
    > Code as been added to xc_domain_save.c to send a small message
    after the
    > domain context is sent. The contents of the message is the
    physical tsc
    > timestamp, last_tsc,
    > read just before the message is sent. When the last_tsc message
    is received in
    > xc_domain_restore.c,
    > another physical tsc timestamp, cur_tsc, is read. The two
    timestamps are
    > loaded into the domain
    > structure as last_tsc_sender and first_tsc_receiver with
    hypercalls. Then
    > xc_domain_hvm_setcontext
    > is called so that hpet_load has access to these time stamps.
    Hpet_load uses
    > the timestamps
    > to account for the time spent saving and loading the domain
    context. With this
    > technique,
    > the only neglected time is the time spent sending a small
    network message.
    > 5. Test Results
    > Some recent test results are:
    > 5.1 Linux 4u664 and Windows 2k864 load test.
    >       Duration: 70 hours.
    >       Test date: 6/2/08
    >       Loads: usex -b48 on Linux; burn-in on Windows
    >       Guest vcpus: 8 for Linux; 2 for Windows
    >       Hardware: 8 physical cpu AMD
    >       Clock drift : Linux: .0012% Windows: .009%
    > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
    >       Duration: 23 hours.
    >       Test date: 6/3/08
    >       Loads: none
    >       Guest vcpus: 8 for each Linux; 2 for Windows
    >       Hardware: 4 physical cpu AMD
    >       Clock drift : Linux: .033% Windows: .019%
    > 6. Relation to recent work in xen-unstable
    > There is a similarity between hvm_get_guest_time() in
    xen-unstable and
    > read_64_main_counter()
    > in this code. However, read_64_main_counter() is more tuned to
    the needs of
    > hpet.c. It has no
    > "set" operation, only the get. It isolates the mode, physical or
    simulated, in
    > read_64_main_counter()
    > itself. It uses no vcpu or domain state as it is a physical
    entity, in either
    > mode. And it provides a real
    > physical mode for every read for those applications that desire
    > 7. Conclusion
    > The virtual hpet is improved by this patch in terms of accuracy and
    > monotonicity.
    > Tests performed to date verify this and more testing is under way.
    > 8. Future Work
    > Testing with Windows Vista will be performed soon. The reason
    for accuracy
    > variations
    > on different platforms using the physical hpet device will be
    > Additional overhead measurements on simulated vs physical hpet
    mode will be
    > made.
    > Footnotes:
    > 1. I don't recall the accuracy improvement with end of interrupt
    stamping, but
    > it was
    > significant, perhaps better than two to one improvement. It
    would be a very
    > simple matter
    > to re-measure the improvement as the facility can call back at
    injection time
    > as well.
    > Signed-off-by: Dave Winchell <dwinchell@xxxxxxxxxxxxxxx>
    > <mailto:dwinchell@xxxxxxxxxxxxxxx>
    > Signed-off-by: Ben Guthro <bguthro@xxxxxxxxxxxxxxx>
    > <mailto:bguthro@xxxxxxxxxxxxxxx>
    > _______________________________________________
    > Xen-devel mailing list
    > Xen-devel@xxxxxxxxxxxxxxxxxxx
    > http://lists.xensource.com/xen-devel

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.