[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 08/11] x86/shadow: reduce effort of hash calculation


  • To: Jan Beulich <jbeulich@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>
  • Date: Fri, 6 Jan 2023 02:03:07 +0000
  • Accept-language: en-GB, en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=ViXsOb5Flt6Rc1xOWmrEZV/r88DR+Cz+8dlxDG5k1LQ=; b=OAhUO2Nnxts/0Jpnsm3DxWLAdIvnHRHlmoun9CjuLTXoWXxPHE4axLyshJDTwS3L7CsZ64zrhKbQGSOc9gac4qm+ZCQec8hMbEGADscQMSNi3Phv5Zla5YuKUW3dW4zKKEjnrbfpcywNtY5vEpN/UwKbjPX/DQMIDowLoM24K9JPEQF1sdp5ddLmWOmYvpaez4WKeO6xJbFUrOn3AwJJDXQYi2As+X5cAgSLVxDIJQ+78+LutqFekRRsDinxWP2QRw3j1sUX5/GvyLYdksOA/dzteEG4+rds5X7p9qoALSRTuzXfTTGUi/Ap3rlMTeJ2b3bDqKJcd7pdoojA4PYuUw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=jA73pZut7Y+iTb1LnGgibzfH8YEh+g9uaFPV4cpbbfSkhbTKlEqifJfPt7OQm9WkPXkz1tByncsU3BvlDK7WtnWTepxOw4y2sfbyWwrqDQxBAaNOxmLg8nbejVxqTSCajjdQliRt+4hThLNkMqwnxpDddIHzLr4Sbn9glwhk+oOyC4kMrl/yyLABmSw+Bf48cRqFSLOdwZ+sPuJencjYoCzDRkOhtPN8XzDesKYMcwBv+saeHKE3wuSNqiYZPvphcJdHwu/9hKhXSYg0h5RfVCpAWf8sDTSLfXyHilTAXxzI2GuVRmfPvP8s2mjSVe7NjvLVW3YhAbz88JhuEcW0Qw==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: Wei Liu <wl@xxxxxxx>, Roger Pau Monne <roger.pau@xxxxxxxxxx>, "Tim (Xen.org)" <tim@xxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxx>
  • Delivery-date: Fri, 06 Jan 2023 02:03:17 +0000
  • Ironport-data: A9a23:/bRAp6qqAkcLlqOrsYvR30hrLBheBmIpZBIvgKrLsJaIsI4StFCzt garIBnTa/nba2f0e9h3b9vi8k4Dv5HQyYVjTgo++yw2FihH85uZCYyVIHmrMnLJJKUvbq7FA +Y2MYCccZ9uHhcwgj/3b9ANeFEljfngqoLUUbKCYWYpAFc+E0/NsDo788YhmIlknNOlNA2Ev NL2sqX3NUSsnjV5KQr40YrawP9UlKm06W1wUmAWP6gR5weHzyRNV/rzGInqR5fGatgMdgKFb 76rIIGRpgvx4xorA9W5pbf3GmVirmn6ZFXmZtJ+AsBOszAazsAA+v9T2Mk0MC+7vw6hjdFpo OihgLTrIesf0g8gr8xGO/VQO3kW0aSrY9YrK1Dn2SCY5xWun3cBX5yCpaz5VGEV0r8fPI1Ay RAXAC4UaRuljOOT+aKAafVGlOIDKtjlF5xK7xmMzRmBZRonabbqZvySoPN9gnI3jM0IGuvCb c0EbzYpdA7HfxBEJlYQDtQ5gfusgX78NTZfrTp5p4JuuzSVkFM3jeiraYKPEjCJbZw9ckKwj 2TK5WnmRDodM8SS02Gt+XOwnO7f2yj8Xer+EZXpr64w2wLKmgT/DjURaWCln9eh1neccMl+M FQ4+wc1qbIboRnDot7VGkfQTGS/lhwWVsdUEuY6wBqQ0aeS6AGcbkAUQzgEZNE4ucseQT0xy kTPj97vHSZosrCeVTSa7Lj8hSy2ETgYKykFfyBsZRcE5vHzrYd1iQjAJuuPC4awh9zxXD31n TaDqXFmg61J1JZRkaKm4VrAnjSg4IDTSRI47RnWWWTj6R5lYImiZMqj7l2zAet8Ebt1h2Kp5 BAs8/VyJshXZX1RvERhmNkwIYw=
  • Ironport-hdrordr: A9a23:8CjS0apxrdXtxfT3aHuefk8aV5oaeYIsimQD101hICG9E/bo9f xG+c5xvyMc5wx9ZJheo6HlBEDtex/hHOdOkO4s1NSZLWrbUQmTTb2KhLGKqwEIfReQygc378 ddmsZFZuEZ8jBB/KPHCQODYrAdKHDuytHQuQ7W9QYUcT1X
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHZIR+fQDluUga6GUWdpW8jT2W5ga6Qo8kA
  • Thread-topic: [PATCH 08/11] x86/shadow: reduce effort of hash calculation

On 05/01/2023 4:05 pm, Jan Beulich wrote:
> The "n" input is a GFN value and hence bounded by the physical address
> bits in use on a system.

The one case where this isn't obviously true is in sh_audit().  It comes
from a real MFN in the system, not a GFN, which will have the same
property WRT PADDR_BITS.

>  The hash quality won't improve by also
> including the upper always-zero bits in the calculation. To keep things
> as compile-time-constant as they were before, use PADDR_BITS (not
> paddr_bits) for loop bounding. This reduces loop iterations from 8 to 5.

While this is all true, you'll get a much better improvement by not
forcing 'n' onto the stack just to access it bytewise.  Right now, the
loop looks like:

<shadow_hash_insert>:
    48 83 ec 10                 sub    $0x10,%rsp
    49 89 c9                    mov    %rcx,%r9
    41 89 d0                    mov    %edx,%r8d
    48 8d 44 24 08              lea    0x8(%rsp),%rax
    48 8d 4c 24 10              lea    0x10(%rsp),%rcx
    48 89 74 24 08              mov    %rsi,0x8(%rsp)
    0f 1f 80 00 00 00 00        nopl   0x0(%rax)
/-> 0f b6 10                    movzbl (%rax),%edx
|   48 83 c0 01                 add    $0x1,%rax
|   45 69 c0 3f 00 01 00        imul   $0x1003f,%r8d,%r8d
|   41 01 d0                    add    %edx,%r8d
|   48 39 c1                    cmp    %rax,%rcx
\-- 75 ea                       jne    ffff82d0402efda0
<shadow_hash_insert+0x20>


which doesn't even have a compile-time constant loop bound.  It's
runtime calculated by the second lea constructing the upper pointer bound.

Given this further delta:

diff --git a/xen/arch/x86/mm/shadow/common.c
b/xen/arch/x86/mm/shadow/common.c
index 4a8bcec10fe8..902c749f2724 100644
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1397,13 +1397,12 @@ static unsigned int shadow_get_allocation(struct
domain *d)
 typedef u32 key_t;
 static inline key_t sh_hash(unsigned long n, unsigned int t)
 {
-    unsigned char *p = (unsigned char *)&n;
     key_t k = t;
     int i;
 
     BUILD_BUG_ON(PADDR_BITS > BITS_PER_LONG + PAGE_SHIFT);
-    for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++ )
-        k = p[i] + (k << 6) + (k << 16) - k;
+    for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++, n >>= 8 )
+        k = (uint8_t)n + (k << 6) + (k << 16) - k;
 
     return k % SHADOW_HASH_BUCKETS;
 }

the code gen becomes:

<shadow_hash_insert>:
    41 89 d0                    mov    %edx,%r8d
    49 89 c9                    mov    %rcx,%r9
    b8 05 00 00 00              mov    $0x5,%eax
/-> 45 69 c0 3f 00 01 00        imul   $0x1003f,%r8d,%r8d
|   40 0f b6 d6                 movzbl %sil,%edx
|   48 c1 ee 08                 shr    $0x8,%rsi
|   41 01 d0                    add    %edx,%r8d
|   83 e8 01                    sub    $0x1,%eax
\-- 75 e9                       jne    ffff82d0402efd8b
<shadow_hash_insert+0xb>

with an actual constant loop bound, and not a memory operand in sight. 
This form (even at 8 iterations) will easily execute faster than the
stack-spilled form.

~Andrew

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.