Xen project Mailing List

[Xen-devel] RE: TSC scaling and softtsc reprise, and PROPOSAL

To: Keir Fraser <keir.fraser@xxxxxxxxxxxxx>, Ian Pratt <Ian.Pratt@xxxxxxxxxxxxx>, "Zhang, Xiantao" <xiantao.zhang@xxxxxxxxx>, "Xen-Devel (E-mail)" <xen-devel@xxxxxxxxxxxxxxxxxxx>

From: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>

Date: Thu, 23 Jul 2009 09:39:38 -0700 (PDT)

Cc: "Dong, Eddie" <eddie.dong@xxxxxxxxx>, John Levon <levon@xxxxxxxxxxxxxxxxx>

Delivery-date: Thu, 23 Jul 2009 09:41:09 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

> >> I've informally heard that certain version of the JVM and > >> Oracle Db have a habit of pounding rdtsc hard from user > >> space, but I don't know what rates. > > > > Indeed they do and they use it for timestamping > > events/transactions, so these are the very same > > apps that need to guarantee SMP timestamp ordering. > > Why would you expect host TSC consistency running on Xen to > be worse than > when running on a native OS? In short, it is because a new class of machine is emerging in the virtualization space that is really a NUMA machine, tries to look like a SMP (non-NUMA) machine by making memory access fast enough that NUMA-ness can be ignored, but for the purposes of time, is still a NUMA machine. Let's consider three physical platforms: SMALL = single socket (multi-core) MEDIUM = multiple sockets, same motherboard LARGE = multiple sockets, multiple motherboards The LARGE is becoming more widely available (e.g. HP DL785) because multiple motherboards are very convenient for field upgradeability (which has a major impact on support costs). They also make a very nice consolidation target for virtualizing a bunch of SMALL machines. However, SMALL and MEDIUM are much less expensive so much more prevalent (especially as development machines!). On SMALL, TSC is always consistent between cores (at least on all but the first dual-core processors). On MEDIUM, some claim that TSC is always consistent between cores on different sockets because the sockets share a motherboard crystal. I don't know if this is true; if it is true, MEDIUM can be considered the same as SMALL, if not MEDIUM can be considered the same as LARGE. So ignore MEDIUM as a subcase of one of the others. On LARGE, the motherboards are connected by HT or QPI, but neither has any form of clock synchronization. So, from a clock perspective, LARGE needs to be "partitioned"; OR there has to be sophisticated system software that does its best to synchronize TSC across all of the cores (which enterprise OS's like HP-UX and AIX have, Linux is working on, and Xen has... though it remains to be seen if any of these work "good enough"); OR TSC has to be abandoned altogether by all software that relies on it (OR TSC needs to be emulated). This problem on LARGE machines is obscure enough that software is developed (on SMALL machines) that has a hidden timebomb if TSC is not perfectly consistent. Admittedly, all such software should have a switch that abandons TSC altogether in favor of an OS "gettimeofday", but this either depends on TSC as well or on a verrryyy sllloooowwww platform timer that if used frequently probably has as bad or worse a performance impact as emulating TSC. So what is "good enough"? If Xen's existing algorithm works poorly on LARGE systems (or even on older SMALL systems), applications should abandon TSC. If Xen's existing algorithm works "well", then applications can and should use TSC. But unless "good enough" can be carefully defined and agreed upon between Xen and the applications AND Xen can communicate "YES this platform is good enough or NOT" to any software that cares, we are caught in a gray area. Unfortunately, neither is true: "good enough" is not defined, AND there is no clean way to communicate it even if it were. And living in the gray area means some very infrequent, very bizarre bugs can arise because sometimes, unbeknownst to that application, rarely and irreproducibly, time will appear to go backwards. And if timestamps are used, for example, to replay transactions, data corruption occurs. So the choices are: 1) Ignore the problem and hope it never happens (or if it does that Xen doesn't get blamed) 2) Tell all Xen users that TSC should not be used as a timestamp. (In other words, fix your apps or always turn on the app's TSC-is-bad option when running virtualized on a "bad" physical machine.) 3) Always emulate TSC and let the heavy TSC users pay the performance cost. Last, as Intel has pointed out, a related kind of issue occurs when live migration moves a running VM from a machine with one TSC rate to another machine with a different TSC rate (or if TSC rate varies on the same machine, i.e. for power-savings reasons). It would be nice if our choice (above) solves this problem too. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.