| [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
 Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
 Are these patches needed now the timers are built on Xen system time rather than host TSC? Dan has reported much better time-keeping with his patch checked in, and it’s for sure a lot less invasive than this patchset.
 
 -- Keir
 
 On 5/6/08 15:59, "Ben Guthro" <bguthro@xxxxxxxxxxxxxxx> wrote:
 
 
 1. Introduction
 
 This patch improves the hpet based guest clock in terms of drift and monotonicity.
 Prior to this work the drift with hpet was greater than 2%, far above the .05% limit
 for ntp to synchronize. With this code, the drift ranges from .001% to .0033% depending
 on guest and physical platform.
 
 Using hpet allows guest operating systems to provide monotonic time to their
 applications. Time sources other than hpet are not monotonic because
 of their reliance on tsc, which is not synchronized across physical processors.
 
 Windows 2k864 and many Linux guests are supported with two policies, one for guests
 that handle missed clock interrupts and the other for guests that require the
 correct number of interrupts.
 
 Guests may use hpet for the timing source even if the physical platform has no visible
 hpet. Migration is supported between physical machines which differ in physical
 hpet visibility.
 
 Most of the changes are in hpet.c. Two general facilities are added to track interrupt
 progress. The ideas here and the facilities would be useful in vpt.c, for other time
 sources, though no attempt is made here to improve vpt.c.
 
 The following sections discuss hpet dependencies, interrupt delivery policies, live migration,
 test results, and relation to recent work with monotonic time.
 
 
 2. Virtual Hpet dependencies
 
 The virtual hpet depends on the ability to read the physical or simulated
 (see discussion below) hpet.  For timekeeping, the virtual hpet also depends
 on two new interrupt notification facilities to implement its policies for
 interrupt delivery.
 
 2.1. Two modes of low-level hpet main counter reads.
 
 In this implementation, the virtual hpet reads with read_64_main_counter(), exported by
 time.c, either the real physical hpet main counter register directly or a "simulated"
 hpet main counter.
 
 The simulated mode uses a monotonic version of get_s_time() (NOW()), where the last
 time value is returned whenever the current time value is less than the last time
 value. In simulated mode, since it is layered on s_time, the underlying hardware
 can be hpet or some other device. The frequency of the main counter in simulated
 mode is the same as the standard physical hpet frequency, allowing live migration
 between nodes that are configured differently.
 
 If the physical platform does not have an hpet device, or if xen is configured not
 to use the device, then the simulated method is used. If there is a physical hpet device,
 and xen has initialized it, then either simulated or physical mode can be used.
 This is governed by a boot time option, hpet-avoid. Setting this option to 1 gives the
 simulated mode and 0 the physical mode. The default is physical mode.
 
 A disadvantage of the physical mode is that may take longer to read the device
 than in simulated mode. On some platforms the cost is about the same (less than 250 nsec) for
 physical and simulated modes, while on others physical cost is much higher than simulated.
 A disadvantage of the simulated mode is that it can return the same value
 for the counter in consecutive calls.
 
 2.2. Interrupt notification facilities.
 
 Two interrupt notification facilities are introduced, one is hvm_isa_irq_assert_cb()
 and the other hvm_register_intr_en_notif().
 
 The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to the vioapic.
 hvm_isa_irq_assert_cb allows a callback to be passed along to vioapic_deliver()
 and this callback is called with a mask of the vcpus which will get the
 interrupt. This callback is made before any vcpus receive an interrupt.
 
 Vhpet uses hvm_register_intr_en_notif() to register a handler for a particular
 vector that will be called when that vector is injected in [vmx,svm]_intr_assist()
 and also when the guest finishes handling the interrupt. Here finished is defined
 as the point when the guest re-enables interrupts or lowers the tpr value.
 EOI is not used as the end of interrupt as this is sometimes returned before
 the interrupt handler has done its work. A flag is passed to the handler indicating
 whether this is the injection point (post = 1) or the interrupt finished (post = 0) point.
 The need for the finished point callback is discussed in the missed ticks policy section.
 
 To prevent a possible early trigger of the finished callback, intr_en_notif logic
 has a two stage arm, the first at injection (hvm_intr_en_notif_arm()) and the second when
 interrupts are seen to be disabled (hvm_intr_en_notif_disarm()). Once fully armed, re-enabling
 interrupts will cause hvm_intr_en_notif_disarm() to make the end of interrupt
 callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm() are called by
 [vmx,svm]_intr_assist().
 
 3. Interrupt delivery policies
 
 The existing hpet interrupt delivery is preserved. This includes
 vcpu round robin delivery used by Linux and broadcast delivery used by Windows.
 
 There are two policies for interrupt delivery, one for Windows 2k8-64 and the other
 for Linux. The Linux policy takes advantage of the (guest) Linux missed tick and offset
 calculations and does not attempt to deliver the right number of interrupts.
 The Windows policy delivers the correct number of interrupts, even if sometimes much
 closer to each other than the period. The policies are similar to those in vpt.c, though
 there are some important differences.
 
 Policies are selected with an HVMOP_set_param hypercall with index HVM_PARAM_TIMER_MODE.
 Two new values are added, HVM_HPET_guest_computes_missed_ticks and
 HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that two new ones are added is that
 in some guests (32bit Linux) a no-missed policy is needed for clock sources other than hpet
 and a missed ticks policy for hpet. It was felt that there would be less confusion by simply
 introducing the two hpet policies.
 
 3.1. The missed ticks policy
 
 The Linux clock interrupt handler for hpet calculates missed ticks and offset using the hpet
 main counter. The algorithm works well when the time since the last interrupt is greater than
 or equal to a period and poorly otherwise.
 
 The missed ticks policy ensures that no two clock interrupts are delivered to the guest at
 a time interval less than a period. A time stamp (hpet main counter value) is recorded (by a
 callback registered with hvm_register_intr_en_notif) when Linux finishes handling the clock
 interrupt. Then, ensuing interrupts are delivered to the vioapic only if the current main
 counter value is a period greater than when the last interrupt was handled.
 
 Tests showed a significant improvement in clock drift with end of interrupt time stamps
 versus beginning of interrupt[1]. It is believed that the reason for the improvement
 is that the clock interrupt handler goes for a spinlock and can be therefore delayed in its
 processing. Furthermore, the main counter is read by the guest under the lock. The net
 effect is that if we time stamp injection, we can get the difference in time
 between successive interrupt handler lock acquisitions to be less than the period.
 
 3.2. The no-missed ticks policy
 
 Windows 2k864 keeps very poor time with the missed ticks policy. So the no-missed ticks policy
 was developed. In the no-missed ticks policy we deliver the correct number of interrupts,
 even if they are spaced less than a period apart (when catching up).
 
 Windows 2k864 uses a broadcast mode in the interrupt routing such that
 all vcpus get the clock interrupt. The best Windows drift performance was achieved when the
 policy code ensured that all the previous interrupts (on the various vcpus) had been injected
 before injecting the next interrupt to the vioapic..
 
 The policy code works as follows. It uses the hvm_isa_irq_assert_cb() to record
 the vcpus to be interrupted in h->hpet.pending_mask. Then, in the callback registered
 with hvm_register_intr_en_notif() at post=1 time it clears the current vcpu in the pending_mask.
 When the pending_mask is clear it decrements hpet.intr_pending_nr and if intr_pending_nr is still
 non-zero posts another interrupt to the ioapic with hvm_isa_irq_assert_cb().
 Intr_pending_nr is incremented in hpet_route_decision_not_missed_ticks().
 
 The missed ticks policy intr_en_notif callback also uses the pending_mask method. So even though
 Linux does not broadcast its interrupts, the code could handle it if it did.
 In this case the end of interrupt time stamp is made when the pending_mask is clear.
 
 4. Live Migration
 
 Live migration with hpet preserves the current offset of the guest clock with respect
 to ntp. This is accomplished by migrating all of the state in the h->hpet data structure
 in the usual way. The hp->mc_offset is recalculated on the receiving node so that the
 guest sees a continuous hpet main counter.
 
 Code as been added to xc_domain_save.c to send a small message after the
 domain context is sent. The contents of the message is the physical tsc timestamp, last_tsc,
 read just before the message is sent. When the last_tsc message is received in xc_domain_restore.c,
 another physical tsc timestamp, cur_tsc, is read. The two timestamps are loaded into the domain
 structure as last_tsc_sender and first_tsc_receiver with hypercalls. Then xc_domain_hvm_setcontext
 is called so that hpet_load has access to these time stamps. Hpet_load uses the timestamps
 to account for the time spent saving and loading the domain context. With this technique,
 the only neglected time is the time spent sending a small network message.
 
 5. Test Results
 
 Some recent test results are:
 
 5.1 Linux 4u664 and Windows 2k864 load test.
 Duration: 70 hours.
 Test date: 6/2/08
 Loads: usex -b48 on Linux; burn-in on Windows
 Guest vcpus: 8 for Linux; 2 for Windows
 Hardware: 8 physical cpu AMD
 Clock drift : Linux: .0012% Windows: .009%
 
 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
 Duration: 23 hours.
 Test date: 6/3/08
 Loads: none
 Guest vcpus: 8 for each Linux; 2 for Windows
 Hardware: 4 physical cpu AMD
 Clock drift : Linux: .033% Windows: .019%
 
 6. Relation to recent work in xen-unstable
 
 There is a similarity between hvm_get_guest_time() in xen-unstable and read_64_main_counter()
 in this code. However, read_64_main_counter() is more tuned to the needs of hpet.c. It has no
 "set" operation, only the get. It isolates the mode, physical or simulated, in read_64_main_counter()
 itself. It uses no vcpu or domain state as it is a physical entity, in either mode. And it provides a real
 physical mode for every read for those applications that desire this.
 
 7. Conclusion
 
 The virtual hpet is improved by this patch in terms of accuracy and monotonicity.
 Tests performed to date verify this and more testing is under way.
 
 8. Future Work
 
 Testing with Windows Vista will be performed soon. The reason for accuracy variations
 on different platforms using the physical hpet device will be investigated.
 Additional overhead measurements on simulated vs physical hpet mode will be made.
 
 Footnotes:
 
 1. I don't recall the accuracy improvement with end of interrupt stamping, but it was
 significant, perhaps better than two to one improvement. It would be a very simple matter
 to re-measure the improvement as the facility can call back at injection time as well.
 
 
 Signed-off-by: Dave Winchell <dwinchell@xxxxxxxxxxxxxxx> <mailto:dwinchell@xxxxxxxxxxxxxxx>
 Signed-off-by: Ben Guthro <bguthro@xxxxxxxxxxxxxxx> <mailto:bguthro@xxxxxxxxxxxxxxx>
 
 
 _______________________________________________
 Xen-devel mailing list
 Xen-devel@xxxxxxxxxxxxxxxxxxx
 http://lists.xensource.com/xen-devel
 
 
 _______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
 
 |