[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 2/7] x86: introduce ioremap_wc()


  • To: Jan Beulich <jbeulich@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Tue, 27 Apr 2021 18:13:46 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=GipC76gcOYhyuCYNSQNZrDDtl/VDAGbMS971GClrTis=; b=nTiFCnnSxZ1eF/9Fu2gCqT6sVh3P0X5SWjKWCQSpSiQtw+/cebbotORWae81gWMdHcEhWnVSxnIVFzZF1Yhh1lY+0RSVAWxSCgk9IQEMTGdXJ8lEE/e7evUKhxw/m8RHjiDOoYtLE5JS5yaYuaT8Hn6frfsfQjvJYY0EhbCDM5VO8lXYHHcSvlmxGgNh61jLfH0IaInCGJ1mMn4lCwlVc147YzRxlDXD985pD9HnOjZiSPsi75j61wn69qui3RFAb3oog4e7v26CgAvSsYI6cpmENZbjSKMXrne8me+xI9SjOcvyWMG9NVrYCGEH/cNr8aZjOMP8HUXsPjXuQBVlcQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=JHEiLs9Cjp1fS2kB7pEFGbtx9oE2g21cZkzTGaJYbDFuq0qCGi/qg1QREfnjsi+FUuXQejYwLWcVxp/NwiTWO2/TM7Fdmqb11Nkpj/x7vsCCId/4nURuj/DK8Y2yKM3L6h/xyyzG1XNqfLKU7SCpp7rqIAgxR2XPnqeFmeITWrH/trjrhtbKD6sUQSE0rAvgGVdST43l58GeHuXfYbD8elKC2eYVI2XwOi2d5Jabm5OrJwNlkLvpD0kY4/duZSs+d5LS6fNm9DsIxuxGJVOixiYbFW9siFrJI8PQt11hhDXA0CSgvjdtDKJIogRIQ8Cz7JqXe4XmIex3OvWEffgdBw==
  • Authentication-results: esa2.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.com
  • Cc: Wei Liu <wl@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Delivery-date: Tue, 27 Apr 2021 17:14:04 +0000
  • Ironport-hdrordr: A9a23:aRa/m68DB4SmEyFHJ1puk+B3I+orLtY04lQ7vn1ZYxY9SKKlvu qpm+kW0gKxtSYJVBgb6LS9EYSJXH+0z/NIyKYLO7PKZmXbkUuuaLpv9I7zhwDnchefysd49Y NNN5dzE8fxC18St7ee3CCdH8w7yNeKtICE7N2+815XQQtna75t4m5CY27xeHFefwVICYE0E5 CR/KN81lmdUE4KZce2DGRtZZmgm/T3lYnraRNDJxkr5Bjmt0LR1JfGEgOV1hpbbjVXwb1Kyx m9ryXF4MyY3M2T+1v16Cv47phdmtfto+Ezf/Cku4wwIjXohh3AXvUGZ5Sy+BYvoO+u7142kN 7D5zcYVv4DjE/sQg==
  • Ironport-sdr: qRHSuBtCZr5dASpfiQGim+CKw+WJLb/80628vXnIR4RLhVB4wFhxUE+QAR54WgN9NrF+Yh8nBM +09kaKpD0IdNZRXdpuH0aclbF8UiByd7yDe5O9e3AJEmw/DFe8CQi6vOybk8EFpW8weTeVu5Ws XadIdM+2aqkyBm1GrCacU7fFqaHQhK8KoM/q3smuDbgHulR4og2FYG4G2jERmNAowbW0KjQv6H ujz4IJ8VJEU07VZylv/TsQV8m/CVl5L7LCBxrfsMpnRy+G6w9rgBgyAobMySSA9P4s2CdrL0o0 uSc=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 27/04/2021 13:54, Jan Beulich wrote:
> In order for a to-be-introduced ERMS form of memcpy() to not regress
> boot performance on certain systems when video output is active, we
> first need to arrange for avoiding further dependency on firmware
> setting up MTRRs in a way we can actually further modify. On many
> systems, due to the continuously growing amounts of installed memory,
> MTRRs get configured with at least one huge WB range, and with MMIO
> ranges below 4Gb then forced to UC via overlapping MTRRs. mtrr_add(), as
> it is today, can't deal with such a setup. Hence on such systems we
> presently leave the frame buffer mapped UC, leading to significantly
> reduced performance when using REP STOSB / REP MOVSB.
>
> On post-PentiumII hardware (i.e. any that's capable of running 64-bit
> code), an effective memory type of WC can be achieved without MTRRs, by
> simply referencing the respective PAT entry from the PTEs. While this
> will leave the switch to ERMS forms of memset() and memcpy() with
> largely unchanged performance, the change here on its own improves
> performance on affected systems quite significantly: Measuring just the
> individual affected memcpy() invocations yielded a speedup by a factor
> of over 250 on my initial (Skylake) test system. memset() isn't getting
> improved by as much there, but still by a factor of about 20.
>
> While adding {__,}PAGE_HYPERVISOR_WC, also add {__,}PAGE_HYPERVISOR_WT
> to, at the very least, make clear what PTE flags this memory type uses.
>
> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
> ---

Seeing as MTRRs are full of firmware issues, shouldn't we also
cross-check that the vram is marked WC, or we'll still fall into bad
perf from combining down to UC.  (Obviously follow-up work if so.)

> TBD: Both callers are __init, so in principle ioremap_wc() could be so,
>      too, at least for the time being.

I don't see us making use this at runtime.  Uses of WC are few and far
between.

> TBD: If the VGA range is WC in the fixed range MTRRs, reusing the low
>      1st Mb mapping (like ioremap() does) would be an option.

It might be fine to do that unconditionally.  The low VRAM has had known
settings for 2 decades now.

That said, the low 1MB does use UC- mappings, which means we're entirely
dependent on MTRRs specifying WC for sensible performance.

> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -5883,6 +5883,20 @@ void __iomem *ioremap(paddr_t pa, size_t
>      return (void __force __iomem *)va;
>  }
>  
> +void __iomem *ioremap_wc(paddr_t pa, size_t len)
> +{
> +    mfn_t mfn = _mfn(PFN_DOWN(pa));
> +    unsigned int offs = pa & (PAGE_SIZE - 1);
> +    unsigned int nr = PFN_UP(offs + len);
> +    void *va;
> +
> +    WARN_ON(page_is_ram_type(mfn_x(mfn), RAM_TYPE_CONVENTIONAL));
> +
> +    va = __vmap(&mfn, nr, 1, 1, PAGE_HYPERVISOR_WC, VMAP_DEFAULT);

This doesn't look correct.  granularity and nr are passed the wrong way
around, but maybe that's related to the fact that only a single mfn is
passed.  I'm confused.

Also, several truncations will occur for a framebuffer > 4G, both with
calculations here, and the types of __vmap()'s parameters.

> +
> +    return (void __force __iomem *)(va + offs);
> +}
> +
>  int create_perdomain_mapping(struct domain *d, unsigned long va,
>                               unsigned int nr, l1_pgentry_t **pl1tab,
>                               struct page_info **ppg)
> --- a/xen/drivers/video/vesa.c
> +++ b/xen/drivers/video/vesa.c
> @@ -9,9 +9,9 @@
>  #include <xen/param.h>
>  #include <xen/xmalloc.h>
>  #include <xen/kernel.h>
> +#include <xen/mm.h>
>  #include <xen/vga.h>
>  #include <asm/io.h>
> -#include <asm/page.h>
>  #include "font.h"
>  #include "lfb.h"
>  
> @@ -103,7 +103,7 @@ void __init vesa_init(void)
>      lfbp.text_columns = vlfb_info.width / font->width;
>      lfbp.text_rows = vlfb_info.height / font->height;
>  
> -    lfbp.lfb = lfb = ioremap(lfb_base(), vram_remap);
> +    lfbp.lfb = lfb = ioremap_wc(lfb_base(), vram_remap);
>      if ( !lfb )
>          return;
>  
> @@ -179,8 +179,7 @@ void __init vesa_mtrr_init(void)
>  
>  static void lfb_flush(void)
>  {
> -    if ( vesa_mtrr == 3 )
> -        __asm__ __volatile__ ("sfence" : : : "memory");
> +    __asm__ __volatile__ ("sfence" : : : "memory");

wmb(), seeing as that is the operation we mean here?

>  }
>  
>  void __init vesa_endboot(bool_t keep)
> --- a/xen/drivers/video/vga.c
> +++ b/xen/drivers/video/vga.c
> @@ -79,7 +79,7 @@ void __init video_init(void)
>      {
>      case XEN_VGATYPE_TEXT_MODE_3:
>          if ( page_is_ram_type(paddr_to_pfn(0xB8000), RAM_TYPE_CONVENTIONAL) 
> ||
> -             ((video = ioremap(0xB8000, 0x8000)) == NULL) )
> +             ((video = ioremap_wc(0xB8000, 0x8000)) == NULL) )
>              return;
>          outw(0x200a, 0x3d4); /* disable cursor */
>          columns = vga_console_info.u.text_mode_3.columns;
> @@ -164,7 +164,11 @@ void __init video_endboot(void)
>      {
>      case XEN_VGATYPE_TEXT_MODE_3:
>          if ( !vgacon_keep )
> +        {
>              memset(video, 0, columns * lines * 2);
> +            iounmap(video);
> +            video = ZERO_BLOCK_PTR;
> +        }
>          break;

Shouldn't this hunk be in patch 5?

>      case XEN_VGATYPE_VESA_LFB:
>      case XEN_VGATYPE_EFI_LFB:
> --- a/xen/include/asm-x86/mm.h
> +++ b/xen/include/asm-x86/mm.h
> @@ -615,6 +615,8 @@ void destroy_perdomain_mapping(struct do
>                                 unsigned int nr);
>  void free_perdomain_mappings(struct domain *);
>  
> +void __iomem *ioremap_wc(paddr_t, size_t);

I'm not sure we want to add a second prototype.  ARM has ioremap_wc()
too, and we absolutely don't want them to get out of sync, and we have
two new architectures on the horizon.

Perhaps a new xen/ioremap.h which includes asm/ioremap.h  (although
thinking forward to encrypted RAM, we might want something which can
also encompass the memremap*() variants.)

ARM can #define ioremap_wc ioremap_wc and provide their inline wrapper. 
x86 can fall back to the common prototype.  Other architectures can do
whatever is best for them.

~Andrew




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.