Xen project Mailing List

Re: x86: memset() / clear_page() / page scrubbing

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Thu, 15 Apr 2021 17:21:09 +0100

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=yaTgXHbxWtss1Sz1hnv8d4uwdsysdt73FH6fveeYwpM=; b=ETWSmTctGHGfL2QZkQ9nsXpV8RW0cqmoHJLsYkj91CgTiyof6YsHSJ/jWJUGK/FeVxl1AKAwF9H/RzmBLP1KJ1/cVVLTI/7qDmnZ9l8p7bEaEv5jOeG+t3PITQyTya8197xtpNYec8oqQx8mMR2VfU4zy5AmyshdW+U+DzGV5osEBIGXVYFNhM/bIld1jiNHBRuLG4fXOXIhJ3LemreEugS3Apl1yigYiLxeOr6x3+ir+vrOEnSptOgeTUer4EXSHgVujGcQcDmt/HbAeeVDTPjQ2rhc4Dsd26XVWeiETGRRW3KlgYUKWjpu4GCybj1fGzvGICb1piJPC8gIlFOJ5Q==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Ui1FjuFCk+UsUfhZFI+u3+dyrqh2Qiuq/Y5NfT7YC8oXOHAkLEaZoLmF8+cZlWS/TNSzkxOufci8pxttiTyA7wCTIqjPlB1hbA3bdUihkgl2YQxEcgTXtaVrW3vgjUkD0WTjTdm2aGPjvL/5H2FQMnXAkLzuRNLM8rEeU4IaW4IdaSlSVZ+mgB5fWjZ+TEhQA6uzG9p4MEBWs6cQjOc7npbLgOALYC8Hjn6EnosMEA3/0J5Rm6m6iMJ2f5oyvDBdY2Ap7Xwe+nBxvpbaulCXRRaTPDVDTGH17AY62UUJskfwGOday4qYuMGPaO1mU/ep6hnHRvr4S6k0YhuKu9cvJg==

Authentication-results: esa2.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.com

Cc: Roger Pau Monné <roger.pau@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Thu, 15 Apr 2021 16:21:30 +0000

Ironport-hdrordr: A9a23:bKlV96N2Suc+rcBcT27w55DYdL4zR+YMi2QD/3taDTRIb82VkN 2vlvwH1RnyzA0cQm0khMroAse9aFvm39pQ7ZMKNbmvGDPntmyhMZ144eLZrAHIMxbVstRQ3a IIScRDIfXtEFl3itv76gGkE9AmhOKK6rysmP229RdQZCtBApsQiztRIACdD0FwWU1iDZ02CJ KT6qN81kSdUF4Qadm2AWRAYvPKoMfFmImjTRkNARMm7wfmt0LW1JfRFR+E0hACFw5e2LtKyx m4ryXVxIWG98u6xBjVynPJ4/1t+efJ59NfCKW3+7MoAxr2jALAXvUGZ5Sju3QPrPir+BIWlr D30modFuBSz1+UQW2vuxvq3GDboUQTwlvv00WRj3emgeGRfkNDN+N7iYhUcgTU5iMb1bkWus 87vBP6xu5qJCjNkyjn69/DWwsCrDvSnVMYnfMOlHsaaIMCadZq3P8i1XlIG5QNFj+S0vFfLM BSCqjnlZNrWG+BY2uclmdix8HEZAVIIj62BmIGusCTzgFMmmF4w0Yy1KUk7wc93aN4ZJ9e6+ veNKN00JlIU88NdKp4QNwMWM2tFwX2MFzxGVPXBW6iOLAMOnrLpZKyyLIp5NuycJhN6Jcpgp zOXH5RqGZaQTOuNeS+mLlwtjzdSmS0WjrgjutE4YJih7H6TL33dQWeVVEHiaKb0rciK/yef8 z2FINdAvflI2erM51OxRfCV55bLmRbeNEJu+w8R0mFrqvwW87Xn92eVMyWCKvmED4iVG+6KG AERiLPKMJJ6V3udWT/hDTXRnPxam3y9Z99C8Hhjqwu4blIErcJnhkeiFy/6M3OAyZFqLYKcE x3J66isq7TnxjwwU/4q0FSfjZNBEdc57vtF1lQoxURDk/yebEf//GWeWVY2mq7NgZyJvmmVj J3lhBSw+aaPpaQzSctB5aMKWSBlUYeo3qMUtM6lrCc49zmPrc1FIwvVqA0NQijLW00pS9a7E N4LCMUTE7WET3jzY+/ioYPOe3Zf95gxCGxIcBVrnrbnV6Gpd4mQ0YaWzLGa7/TvS8eAx5vwn Fh+a4Wh7SN3Ry1L3Ekveg+OFpQLFiMDKl+FwSDboVMkrXNcAV9JF363ACyulUWQC7H5k8Sjm vuIWmxdevQClRQgHxez53n6Uh5bGmbYkJ2ZE1rqIEVLxWyhl9DlcuwIoaj2WqYbVUPhtsQNz zIehM+CAJjzdLf7m/ZpB+yUVEdgrk+NO3UC7ouN4zJ0nS2MYuSiOUtBPlP5qtoM9jor84GWe +SYBWuMTv9Eu8lsjbl/koNCW1Rkj0JgPno0Brq4CyEx3Y5G+PVO0kjaLcBId2QhlKUD8qg4d Fct5YSsuSxOGmqNYLD5qHTcjJZKhTc5USxVPolrJhIvaQ08Jt/dqOrJwfg5TVi5lEZKsyxqW Y1BIJcy5rFMpV0f8MTdzlCl2BZ3uinHQ8OiEjOHuQ6fVsRlHfVMNOC3qrQpdMUczq8jTq1HW PazjZU8PjEVRaSzLI2C6o/JmJNdUg3gU4Std+qRsn1CA+wcftE80f/GnihcKVFQKztI8Rdkj 9Kp/WJlfSQbSz2xUT5uiZ6OLtH9yKCTdmpCAyBXc5O/NrSAyXCvoKapOqyhizwUz21dgAxgp BEb1UZaoB7sQYZ5bdHmRSae+jQuUIqk1xX/DFhmBrM4+GdkRbmNHADFxbYjJVQVSRUKV6Sg6 3+gLOl6Eg=

Ironport-sdr: qdZ+ZkHy4m62mHd92HbBnfT5Rlwp2g5ySDNN5Gt2W65k7MkFC2q4zm3HsE1hgFAx+RuxCnnIeA GiE9gao95W1ugix7GZG1t635t+Qc43Fbw/N/QimBU95An7ph0mK/QCJsQUWpsb1mhpERXaWTQd xb8QFvcs2BP9ynk+exxHTjRYH7yag6PB1tvCYnlIc97eiMuAMbXQe2aCT1ULUJEuYDreclDXq1 2+ckYSVwUVYd0RzaGd0h3TZOeAk/4zK3RTWFBPGRzxYxIV3BrUj5x4XrlW8RnRJZ+WYuHKRORK gEI=

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 14/04/2021 09:12, Jan Beulich wrote: > On 13.04.2021 15:17, Andrew Cooper wrote: >> Do you have actual numbers from these experiments? > Attached is the collected raw output from a number of systems. Wow Tulsa is vintage. Is that new enough to have nonstop_tsc ? It's also quite possibly old enough to fail Linux's REP_GOOD check which is something we're missing in Xen. >> I've seen your patch >> from the thread, but at a minimum its missing some hunks adding new >> CPUID bits. > It's not missing hunks - these additions are in a prereq patch that > I meant to post together with whatever this analysis would lead to. > If you think I should submit the prereqs ahead of time, I can of > course do so. Well - its necessary for anyone wanting to compile your patch. All the bits seem like they're ones we ought to mirror through to guests irrespective of optimisations taken in Xen, so I don't see a problem with such a patch going in stand-alone. >> I do worry however whether the testing is likely to be >> realistic for non-idle scenarios. > Of course it's not going to be - in non-idle scenarios we'll always > be somewhere in the middle. Therefore I wanted to have numbers at > the edges (hot and cold cache respectively), as any other numbers > are going to be much harder to obtain in a way that they would > actually be meaningful (and hence reasonably stable). In non-idle scenarios, all numbers can easily be worse across the board than as measured. Cachline snoops, and in particular, repeated snoops during the clear/copy operation will cause far worse overheads than simply being cache-cold to begin with. >> It is very little surprise that AVX-512 on Skylake is poor. The >> frequency hit from using %zmm is staggering. IceLake is expected to be >> better, but almost certainly won't exceed REP MOVSB, which is optimised >> in microcode for the data width of the CPU. > Right, much like AVX has improved but didn't get anywhere near > REP MOVS. The other thing I forgot to mention is the legacy/VEX pipeline stall penalty, which will definitely dwarf short operations, and on some CPUs doesn't even amortise itself over a 4k operation. Whether a vector algorithm suffers a stall or not typically depends on the instructions last executed in guest context. Furthermore, while on the current CPU you might manage to get a vector algorithm to be faster than an integer one, you will be forcing a frequency drop on every other CPU in the socket, and the net hit to the system can be worse than just using the integer algorithm to being with. IMO, vector optimised algorithms are a minefield we don't want to go wandering in, particularly as ERMS is common these days, and here to stay. >> For memset(), please don't move in the direction of memcpy(). memcpy() >> is problematic because the common case is likely to be a multiple of 8 >> bytes, meaning that we feed 0 into the REP MOVSB, and this a hit wanting >> avoiding. > And you say this despite me having pointed out that REP STOSL may > be faster in a number of cases? Or do you mean to suggest we should > branch around the trailing REP {MOV,STO}SB? > >> The "Fast Zero length $FOO" bits on future parts indicate >> when passing %ecx=0 is likely to be faster than branching around the >> invocation. > IOW down the road we could use alternatives patching to remove such > branches. But this of course is only if we don't end up using > exclusively REP MOVSB / REP STOSB there anyway, as you seem to be > suggesting ... > >> With ERMS/etc, our logic should be a REP MOVSB/STOSB only, without any >> cleverness about larger word sizes. The Linux forms do this fairly well >> already, and probably better than Xen, although there might be some room >> for improvement IMO. > ... here. > > As to the Linux implementations - for memcpy_erms() I don't think > I see any room for improvement in the function itself. We could do > alternatives patching somewhat differently (and I probably would). > For memset_erms() the tiny bit of improvement over Linux'es code > that I would see is to avoid the partial register access when > loading %al. But to be honest - in both cases I wouldn't have > bothered looking at their code anyway, if you hadn't pointed me > there. Answering multiple of the points together. Yes, the partial register access on %al was one thing I spotted, and movbzl would be an improvement. The alternatives are a bit weird, but they're best as they are IMO. It makes a useful enough difference to backtraces/etc, and unconditional jmp's are about as close to free as you can get these days. On an ERMS system, we want to use REP MOVSB unilaterally. It is my understanding that it is faster across the board than any algorithm variation trying to use wider accesses. For non ERMS systems, the split MOVSQ/MOVSB is still a win, but my expectation is that conditionally jumping over the latter MOVSB would be a win. The "Fast zero length" CPUID bits don't exist for no reason, and while passing 0 into memcpy/cmp() is exceedingly rare - possibly non-existent - and not worth optimising, passing a multiple of 8 in probably is worth optimising. (Obviously, this depends on the underlying mem*() functions seeing a multiple of 8 for a meaningful number of their inputs, but I'd expect this to be the case). >> It is worth nothing that we have extra variations of memset/memcpy where >> __builtin_memcpy() gets expanded inline, and the result is a >> compiler-chosen sequence, and doesn't hit any of our optimised >> sequences. I'm not sure what to do about this, because there is surely >> a larger win from the cases which can be turned into a single mov, or an >> elided store/copy, than using a potentially inefficient sequence in the >> rare cases. Maybe there is room for a fine-tuning option to say "just >> call memset() if you're going to expand it inline". > You mean "just call memset() instead of expanding it inline"? I think want I really mean is "if the result of optimising memset() is going to result in a REP instruction, call memset() instead". You want the compiler to do conversion to single mov's/etc, but you don't want is ... > If the inline expansion is merely REP STOS, I'm not sure we'd > actually gain anything from keeping the compiler from expanding it > inline. But if the inline construct was more complicated (as I > observe e.g. in map_vcpu_info() with gcc 10), then it would likely > be nice if there was such a control. I'll take note to see if I > can find anything. ... this. What GCC currently expands inline is a REP MOVS{L,Q}, with the first and final element done manually ahead of the REP, presumably for prefetching/pagewalk reasons. The exact sequence varies due to the surrounding code, and while its probably a decent stab for -O2/3 on "any arbitrary 64bit CPU", its not helpful when we've got a system-optimised mem*() to hand. > But this isn't relevant for {clear,copy}_page(). > >> For all set/copy operations, whether you want non-temporal or not >> depends on when/where the lines are next going to be consumed. Page >> scrubbing in idle context is the only example I can think of where we >> aren't plausibly going to consume the destination imminently. Even >> clear/copy page in a hypercall doesn't want to be non-temporal, because >> chances are good that the vcpu is going to touch the page on return. > I'm afraid the situation isn't as black-and-white. Take HAP or > IOMMU page table allocations, for example: They need to clear the > full page, yes. But often this is just to then insert one single > entry, i.e. re-use exactly one of the cache lines. I consider this an orthogonal problem. When we're not double-scrubbing most memory Xen uses, most of this goes away. Even if we do need to scrub a pagetable to use, we're never(?) complete at the end of the scrub, and need to make further writes imminently. These never want non-temporal accesses, because you never want to write into recently-evicted line, and there's no plausible way that trying to mix and match temporal and non-temporal stores is going to be a worthwhile optimisation to try. > Or take initial > population of guest RAM: The larger the guest, the less likely it > is for every individual page to get accessed again before its > contents get evicted from the caches. Judging from what Ankur said, > once we get to around L3 capacity, MOVNT / CLZERO may be preferable > there. Initial population of guests doesn't matter at all, because nothing (outside of the single threaded toolstack process issuing the construction hypercalls) is going to touch the pages until the VM is unpaused. The only async accesses I can think of are xenstored and xenconsoled starting up, and those are after the RAM is populated. In cases like this, current might be a good way of choosing between temporal and non-temporal accesses. As before, not double scrubbing will further improve things. > I think in cases where we don't know how the page is going to be > used subsequently, we ought to favor latency over cache pollution > avoidance. I broadly agree. I think the cases where its reasonably safe to use the pollution-avoidance are fairly obvious, and there is a steep cost to wrongly-using non-temporal accesses. > But in cases where we know the subsequent usage pattern, > we may want to direct scrubbing / zeroing accordingly. Yet of > course it's not very helpful that there's no way to avoid > polluting caches and still have reasonably low latency, so using > some heuristics may be unavoidable. I don't think any heuristics beyond current, or possibly d->creation_finished are going to be worthwhile, but I think these alone can net us a decent win. > And of course another goal of mine would be to avoid double zeroing > of pages: When scrubbing uses clear_page() anyway, there's no point > in the caller then calling clear_page() again. IMO, just like we > have xzalloc(), we should also have MEMF_zero. Internally the page > allocator can know whether a page was already scrubbed, and it > does know for sure whether scrubbing means zeroing. I think we've discussed this before. I'm in favour, but I'm absolutely certain that that wants be spelled MEMF_dirty (or equiv), so forgetting it fails safe, and code which is using dirty allocations is clearly identified and can be audited easily. ~Andrew

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.