WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-ia64-devel

[Xen-ia64-devel] RE: Timer merge

To: "Magenheimer, Dan \(HP Labs Fort Collins\)" <dan.magenheimer@xxxxxx>, "Tian, Kevin" <kevin.tian@xxxxxxxxx>, <xen-ia64-devel@xxxxxxxxxxxxxxxxxxx>
Subject: [Xen-ia64-devel] RE: Timer merge
From: "Dong, Eddie" <eddie.dong@xxxxxxxxx>
Date: Fri, 26 Aug 2005 14:34:04 +0800
Cc: "Mallick, Asit K" <asit.k.mallick@xxxxxxxxx>
Delivery-date: Fri, 26 Aug 2005 06:33:06 +0000
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-ia64-devel-request@lists.xensource.com?subject=help>
List-id: DIscussion of the ia64 port of Xen <xen-ia64-devel.lists.xensource.com>
List-post: <mailto:xen-ia64-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-ia64-devel>, <mailto:xen-ia64-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-ia64-devel>, <mailto:xen-ia64-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-ia64-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: AcWnjSOhSjX2nMKXQI2EGFjNgzyS+AAXrvdAAAAqXDAAAf54IAAFcXoAAATDFSAAADCbAAAOO83gAB29uYAAKFIVoAACuHFQACBS1PA=
Thread-topic: Timer merge
Dan:
        Wonderful discussion between you and kevin. Just add more
comments here.

Magenheimer, Dan (HP Labs Fort Collins) wrote:
> Thanks Kevin for your thoughtful answer.
> 
>>> The current (non-VTI) code is not perfect but it *is* fast.  This
>>> is due a great deal to the fact that all the operations are
>>> handled either with "fast reflection" or "fast hyperprivop"
>>> code.  This code reflects/emulates operations without requiring
>>> all the state save/restore and stack switching that is necessary
>>> to call C.  Indeed it is done without even turning on psr.ic
>>> because all data that is accessed is pinned by TRs.  As a
>>> result each operation takes on the order of 100 cycles, as
>>> opposed to 1000-2000 cycles if C must be called.  And there are
>>> several (8-10 IIRC) operations per guest timer tick.
>> 
>> Yes, this is a fast path to reflect guest timer interrupt,
>> which we didn't note before.
> 
> Not just the fast path for reflection (once per tick).  Also
> the fast path for reading ivr (twice per tick), setting tpr
> (once per tick), reading tpr (twice per tick), setting eoi
> (once per tick), and (of course) setting itm (once per tick).
For TPR, IVR/EOI, yes it can be done in ASM with payment in readability
and we'd better do in that way.
But we are paying security here because we skipped the permission check
in ASM. 
For example, if guest application want to do hypercall through break
instruction(PSR.ic=1), the HV is not block this kind of operation. If
you consider the permission check code for each fast hypercall, the
overhead is much higher.

> 
> These nine operations total ~1000 cycles when using the fast
> path and ~15000 when using the slow path.  Multiplied by
> 1024 hz, the slow path uses an additional (above what
> Linux uses) ~1.5% of the total CPU just processing clock ticks.

The previous proposal only targets on machine ITM set stuff. So
performance degradation is much small than 1.5%.
> 
>> But considering that IPF Linux
>> can catch up losing ticks based on ITC as a monotonic
>> increasing timer source, the requirement for accuracy of
>> virtual timer injection may not look so strictly.
> 
> Isn't this a requirement of all operating systems on
> IPF since a long PAL call can happen asynchronously?

We booted both Linux & Windows Server 2003 on VMM, both of them are OK
with that. Also when the VM # increase, it is a must for guest to catch
up itself.
In IA32, the PIT IRQ is stacked when the domain is switched out, and
inject one by one when the domain is switched back. In IPF, we can take
the advantage of batching interrupt processing mechanism.

> 
>> To some
>> extent, to let guest catch up may have better performance
>> than triggering as many machine interrupts as what guest
>> wants. Because you can save much cycles to do context
>> switches in that way.
> 
> Delivering all ticks to all guests is certainly not
> scalable.  Say there are 1000 lightly-loaded guests sharing
> a single processor server.  The entire processor would be
> utilized just delivering all the ticks to each guest!
> 
> Is this what Xen/x86 does?
> 
>> Drawback of this way may let guest
>> application observe time stagnant within small time slot.
> 
> Hmmm.... can you explain?  Are you talking about a guest
> application that is making system calls to count jiffies
> (which I think is a Linux-wide problem) or a guest application
> that is reading the itc?  In the current model, the itc
> is always monotonically increasing unless the guest operating
> system sets itc.
> 
>> Of course actual performance difference needs future benchmark
>> data. But this is a factor we need to balance. ;-)
> 
> Agreed.  Perhaps we should set a system-wide quota, e.g. no
> more than 0.2% total system overhead for the hypervisor processing
> guest clock ticks.  (I'm not proposing that 0.2% is the right
> number, just using it as an example.)
Definitely we should keep the fast hypercall there, but just moving itm
set to ac_timer. With this the performance difference is only
1.5%*2/9=0.3% :-)


> 
>>> The core Xen code for handling timers is all in C so using
>>> any of it will slow every guest timer tick substantially,
>>> thus slowing the entire guest substantially.  It may be
>>> possible to write the semantic equivalent of the Xen ac_timer
>>> code in ia64 assembly, but this defeats the purpose of sharing
>>> the core Xen code.  Also, I'm doubtful that walking timer
>>> queues can be done with psr.ic off.
>> 
>> It's cleaner to consolidate all places modifying machine itm
>> into one uniform interface (Ac_timer here). This conforms to
>> common interface and also benefits merge process. If using
>> above policy to inject less interrupt, the benefit of
>> assembly is a bit reduced and instead show more error-prone.
> 
> As discussed in a different thread on xen-devel some time ago,
> I believe the ac_timer queue mechanism is an elegant interface
> that is overkill for how it is used.  It was pointed out
> (by Rolf I believe) that it is used more heavily in SMP.
> I was skeptical but couldn't argue because Xen/ia64 doesn't
> do SMP yet.
> 
> Without changing core code, the ac_timer queue mechanism MUST
> be used for scheduling domains.  Since this is less performance
> critical, I am OK with that.
> 
>>> Note that hypervisor ticks (primarily used for scheduling
>>> timeouts) are much less frequent (32 hz?) so not as
>>> performance-sensitive.  The current code does call C for
>>> these ticks.
>> 
>> Now HZ is defined as 100 in config.h, however current itm
>> modification policy actually makes this periodic value
>> useless. Even when itm_delta is added and set into itm, an
>> immediately following ac_timer softirq will reprogram the itm
>> to the closest time point in the ac timer list.
> 
> This sounds like a bug (but on the path for scheduling domains,
> not delivering guest ticks, correct?)
> 
>>> In short, I am open to rearchitecting the timer code to
>>> better merge with VTI.  However the changes should not have
>>> a significant effect on performance.  And anything that calls
>>> C code multiple times at 1024hz almost certainly will.
>> 
>> Agree. Actually this area is the one missing enough
>> discussion for a long time. We need to make progress without
>> breaking anything. Since we begin this discussion, broader
>> usage model should also be considered for future support:
>>      - When guest is presented with multiple vcpus, current
>> guest linux smp boot code will try to sync itc and thus write to itc.
> 
> The current model should handle this just fine using a
> delta.  This delta is not currently implemented, but that's
> only because setting itc hasn't been an issue yet.
So we reach common point here to keep an guest ITC by delta.
Another thing is that I think we need to keep a guest ITM. Suppose Dom1
program its ITM to machine itm in current implementation, some time
later before the timer is expired, an IO operation in Dom1 causes domain
switch (do_block). The control is switched back to Dom0, the machine ITM
must be reprogrammed to Dom0 next ITM. Without pre-saved guest ITM, I am
not sure how can this be achieved.

If we keep the guest ITM, the performance difference calculated above is
further decreased to 1.5%*1/9=0.16% :-)
We have achieved the goal now!!!

> 
>>      - When vcpu is allowed to be scheduled to different
>> physical cpu (host smp), itc on different physical cpu is
>> unlikely to be exactly same even after sync.
> 
> This is much less frequent so doing extra work here is OK.
> 
>>      - For large machine like NUMA, the itc is completely
>> un-synchronized driven by different ratio. People need to
>> access global platform timer for different cpus to have a
>> base monotonic time base.
> 
> Agreed.  But this is an operating system problem that
> is currently being discussed and solved in the Linux
> community.  I would prefer to see the problem solved
> by Linux and then (to the extent possible) leverage
> that solution.
> 
>> All these cases in my head just pose the importance of a
>> scalable and well-organized time mechanism for both system
>> time keep and virtual time emulation. To implement all in
>> assembly code seems frighten me. Without virtualized itc (by
>> offset) and itm, it's difficult to handle above cases. This
>> is why Eddie gave the proposal as the below of this thread.
> 
> I'm not proposing that *everything* be implemented in assembly,
> just that the architecture and design assume that the
> most frequent paths can be implemented in assembly
> (and with psr.ic off).  I think this will be hard to do
> using Xen core ac_timer queues.
> 
>> However, current assembly approach is also a good research
>> direction to consider. Whether we can short-circuit in some
>> special case is also the way to gain maximum performance. We
>> just need balance, but let's draw out a achievable goal first. ;-)
> 
> Well, it's hard to call it a research direction if its already
> implemented and working :-)
> 
> As I said, I'm not against a new time architecture/design.  I'm
> simply proposing that performance is more important than utilizing
> elegant-but-overcomplicated existing core Xen code.
> 
> Oh, and of course, that any new architecture/design works properly.
> As we've seen from the recent changes to Xen/x86, getting time
> working properly is not always easy.
Sure
> 
> Dan
> 

Any more comments!
Eddie

_______________________________________________
Xen-ia64-devel mailing list
Xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-ia64-devel

<Prev in Thread] Current Thread [Next in Thread>