Xen project Mailing List

[Xen-devel] RE: TSC scaling and softtsc reprise, and PROPOSAL

To: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>, "Xen-Devel (E-mail)" <xen-devel@xxxxxxxxxxxxxxxxxxx>

From: "Zhang, Xiantao" <xiantao.zhang@xxxxxxxxx>

Date: Tue, 28 Jul 2009 08:55:46 +0800

Accept-language: en-US

Acceptlanguage: en-US

Cc: Ian Pratt <Ian.Pratt@xxxxxxxxxxxxx>, "Dong, Eddie" <eddie.dong@xxxxxxxxx>, John Levon <levon@xxxxxxxxxxxxxxxxx>

Delivery-date: Mon, 27 Jul 2009 17:56:57 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Thread-index: AcoLmPj7BoeqAjELQnyvnMIeWo1hoQDg/2Cw

Thread-topic: TSC scaling and softtsc reprise, and PROPOSAL

Hi, Dan Sorry for late reply! See my comments below. > > Thanks very much for the additional detail on the 10% > performance loss. What is this oltp benchmark? Is > it available for others to run? Also is the rdtsc > rate 120000/sec on EACH processor? OLTP benchmark is a test case of sysbench, and you can get it through the following link: http://sysbench.sourceforge.net/ And we only configured one virtual processor for one VM, and I don't know oltp whether can use two virtual processors. > > Assuming a 3GHz machine, your results seem to show that > emulating a rdtsc with softtsc takes about 2500 cycles. > This agrees with my approximation of about 1 usec. > > Have you analyzed where this 2500 cycles is being used? > My suggestion about performance optimization was not > to try a different algorithm but to see if it is possible > to code the existing algorithm much faster using a > special trap path and assembly code. (We called this > a "fast path" on Xen/ia64.) Even if the 2500 cycles > can be cut in half, that would be a big win. It should have no fast path for emulating rdtsc in x86 side, and the main cost should be from hardware context switch. Since I am using an old machine when run this benchmark, the cost should be reduced sharply in latest processors where I haven't done the test. > Am I correct in reading that your patch is ONLY for > HVM guests? If so, since some (maybe most) workloads > that rely on tsc for transaction timestamps will be > PV, your patch doesn't solve the whole problem. Yes, this patch is only for HVM guest, because only HVM guest can use TSC offset feature(one of VT features) ,and also I don't think PV guest need it. > Can someone at Intel confirm or deny that VMware ESX > always traps rdtsc? If so, it is probably not hard > to write an application that works on VMware ESX (on > certain hardware) but fails on Xen. \ > >> -----Original Message----- >> From: Zhang, Xiantao [mailto:xiantao.zhang@xxxxxxxxx] >> Sent: Tuesday, July 21, 2009 11:05 PM >> To: Keir Fraser; Dan Magenheimer; Xen-Devel (E-mail) >> Cc: John Levon; Ian Pratt; Dong, Eddie >> Subject: RE: TSC scaling and softtsc reprise, and PROPOSAL >> >> >> Keir Fraser wrote: >>> On 20/07/2009 21:02, "Dan Magenheimer" <dan.magenheimer@xxxxxxxxxx> >>> wrote: >>> >>>> I agree that if the performance is *really bad*, the default >>>> should not change. But I think we are still flying on rumors >>>> of data collected years ago in a very different world, and >>>> the performance data should be re-collected to prove that >>>> it is still *really bad*. If the degradation is a fraction >>>> of a percent even in worst case analysis, I think the default >>>> should be changed so that correctness prevails. >>>> >>>> Why now? Because more and more real-world applications are >>>> built on top of multi-core platforms where TSC is reliable >>>> and (by far) the best timesource. And I think(?) we all agree >>>> now that softtsc is the only way to guarantee correctness >>>> in a virtual environment. >>> >>> So how bad is the non-softtsc default mode anyway? Our default >>> timer_mode has guest TSCs track host TSC (plus a fixed per-vcpu >>> offset that defaults to having all vcpus of a domain aligned to >>> vcpu0 boot = zero tsc). >>> >>> Looking at the email thread you cited, all I see is someone from >>> Intel saying something about how their code to improve TSC >>> consistency across migration avoids RDTSC exiting where possible >>> (which I do not see -- if the TSC rates across the hosts do not >>> match closely then RDTSC exiting is enabled forever for that >>> domain), and, most bizarrely, that their 'solution' may have a tsc >>> drift >10^5 cycles. Where did this huge number come from? What >>> solution is being talked about, and under what conditions might the >>> claim hold? Who knows! >> >> We had done the experiment to measure the performance impact >> with softtsc using oltp workload, and we saw ~10% performance >> loss if rdtsc rate is more than 120,000/second. And we also >> did some other tests, and the results show that ~1% >> perfomance loss is caused by 10000 rdtsc instructions. So if >> the rdtsc rate is not that high(>10000/second), the >> performance impact can be ignored. >> >> We also introduced some performance optimization solutions, >> but as we claimed before, they may bring some TSC drift ( >> 10^5~10^6 cycles) between virtual processors in SMP cases. >> One solution is described below, for example, the guest is >> migrated from low TSC freq(low_freq) machine to a high TSC >> freq one(high_freq), you know, the low frequency is guest's >> expected frequency(exp_freq), and we should let guest be >> aware that it is running on the machine with exp_freq TSC to >> avoid possbile issues caused by faster TSC in any >> optimization solution. >> >> 1. In this solution, we only guarantee guest's TSC is >> increasing monotonically and the average frequency equals >> guest's expected frequency(exp_freq) in a fixed time slot (eg. ~1ms). >> 2. To be simple, let guest running in high_freq TSC (with >> hardware TSC offset feature, no perfomrance loss) for 1ms, >> and then enable rdtsc exiting and use trap and emulation >> method(suffers perfomance loss) to let guest running in a >> *VERY VERY* low frequency TSC(e.g 0.2 G Hz) for some time, >> and the specific value can be calculated with the formula to >> guarantee average TSC frquency == exp_freq: >> time = (high_freq - low_freq) / (low_freq - 0.2). >> >> 3. If the guest migrate from 2.4G machine to 3.0G machine, >> only in (3.0-2.4) /(2.4-0.2) == ~0.273ms guest has to suffer >> performance loss in the total time 1ms+0.273ms ,and that is >> also to say, in most of the time guest can leverage >> hardware's TSC offset feature to reduce perfomrance loss. >> >> 4. In the 1.273ms, we can say guest's TSC frequency is >> emulated to its expected one through the hardware and >> software's co-emulation. And the perfomance loss is very >> minor compared with purely softtsc solution. >> 5. But at the same time, since each vcpu's TSC is emulated >> indpendently for SMP guest, and they may generate a drift >> value between vcpus, and the drift vaule's range should be >> 10^5 ~10^6 cycles, and we don't know such drift between vcpus >> whether can bring other side-effects. At least, one >> side-effect case we can figure out is when one application >> running on one vcpu, and it may see backward TSC value after >> its migrating to another vcpu. Not sure this is a real >> problem, but it should exist in theory. >> >> Attached the draft patch to implement the solution based on >> an old #Cset19591. >> >> Xiantao _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.