[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v6)



On 13/06/14 19:05, Konrad Rzeszutek Wilk wrote:
> From: Malcolm Crossley <malcolm.crossley@xxxxxxxxxx>
>
> The page scrubbing is done in 128MB chunks in lockstep across all the
> non-SMT CPU's. This allows for the boot CPU to hold the heap_lock whilst each
> chunk is being scrubbed and then release the heap_lock when the CPU's are
> finished scrubing their individual chunk. This allows for the heap_lock to
> not be held continously and for pending softirqs are to be serviced
> periodically across the CPU's.
>
> The page scrub memory chunks are allocated to the CPU's in a NUMA aware
> fashion to reduce socket interconnect overhead and improve performance.
> Specifically in the first phase we scrub at the same time on all the
> NUMA nodes that have CPUs - we also weed out the SMT threads so that
> we only use cores (that gives a 50% boost). The second phase is for NUMA
> nodes that have no CPUs - for that we use the closest NUMA node's CPUs
> (non-SMT again) to do the job.
>
> This patch reduces the boot page scrub time on a 128GB 64 core AMD Opteron
> 6386 machine from 49 seconds to 3 seconds.
> On a IvyBridge-EX 8 socket box with 1.5TB it cuts it down from 15 minutes
> to 63 seconds.
>
> Signed-off-by: Malcolm Crossley <malcolm.crossley@xxxxxxxxxx>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> Reviewed-by: Tim Deegan <tim@xxxxxxx>

Functionally, Reviewed-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

A few minor nits...

>
> ---
>
> v2
>  - Reduced default chunk size to 128MB
>  - Added code to scrub NUMA nodes with no active CPU linked to them
>  - Be robust to boot CPU not being linked to a NUMA node
>
> v3:
>  - Don't use SMT threads
>  - Take care of remainder if the number of CPUs (or memory) is odd
>  - Restructure the worker thread
>  - s/u64/unsigned long/
>
> v4:
>  - Don't use all CPUs on non-CPU NUMA nodes, just closest one
>  - Syntax, and docs updates
>  - Compile on ARM
>  - Fix bug when NUMA node has 0 pages
>
> v5:
>  - Properly figure out best NUMA node.
>  - Fix comments to be proper style.
>
> v6:
>  - Missing page shift on default values, optimize cpumask usage.
>  - Fix case of NODE having one page and below first_valid_mfn
>  - Add Ack
> ---
>  docs/misc/xen-command-line.markdown |   10 ++
>  xen/common/page_alloc.c             |  211 
> ++++++++++++++++++++++++++++++++---
>  xen/include/asm-arm/numa.h          |    1 +
>  3 files changed, 204 insertions(+), 18 deletions(-)
>
> diff --git a/docs/misc/xen-command-line.markdown 
> b/docs/misc/xen-command-line.markdown
> index 25829fe..509f462 100644
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -198,6 +198,16 @@ Scrub free RAM during boot.  This is a safety feature to 
> prevent
>  accidentally leaking sensitive VM data into other VMs if Xen crashes
>  and reboots.
>  
> +### bootscrub_chunk

_ needs escaping with a backslash

> +> `= <size>`
> +
> +> Default: `134217728`

Due to the implicit 'k' suffix on sizes, this is not a useful hint as to
how to change the default.

Furthermore, `128M` is substantially clearer.

> +
> +Maximum RAM block size chunks to be scrubbed whilst holding the page heap 
> lock
> +and not running softirqs. Reduce this if softirqs are not being run 
> frequently
> +enough. Setting this to a high value may cause boot failure, particularly if
> +the NMI watchdog is also enabled.
> +
>  ### cachesize
>  > `= <size>`
>  
> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> ...
> - * Scrub all unallocated pages in all heap zones. This function is more
> - * convoluted than appears necessary because we do not want to continuously
> - * hold the lock while scrubbing very large memory areas.
> + * Scrub all unallocated pages in all heap zones. This function uses all
> + * online cpu's to scrub the memory in parallel.
>   */
>  void __init scrub_heap_pages(void)
>  {
> -    unsigned long mfn;
> -    struct page_info *pg;
> +    cpumask_t node_cpus, all_worker_cpus;
> +    unsigned int i, j;
> +    unsigned long offset, max_per_cpu_sz = 0;
> +    unsigned long start, end;
> +    unsigned long rem = 0;
> +    int last_distance, best_node;
> +    int cpus;
>  
>      if ( !opt_bootscrub )
>          return;
>  
> -    printk("Scrubbing Free RAM: ");
> +    cpumask_clear(&all_worker_cpus);
> +    /* Scrub block size. */
> +    chunk_size = opt_bootscrub_chunk >> PAGE_SHIFT;
> +    if ( chunk_size == 0 )
> +        chunk_size = MB(128) >> PAGE_SHIFT;
> +
> +    /* Round #0 - figure out amounts and which CPUs to use. */
> +    for_each_online_node ( i )
> +    {
> +        if ( !node_spanned_pages(i) )
> +            continue;
> +        /* Calculate Node memory start and end address. */
> +        start = max(node_start_pfn(i), first_valid_mfn);
> +        end = min(node_start_pfn(i) + node_spanned_pages(i), max_page);
> +        /* Just in case NODE has 1 page and starts below first_valid_mfn. */
> +        end = max(end, start);
> +        /* CPUs that are online and on this node (if none, that it is OK). */
> +        cpus = find_non_smt(i, &node_cpus);
> +        cpumask_or(&all_worker_cpus, &all_worker_cpus, &node_cpus);
> +        if ( cpus <= 0 )
> +        {
> +            /* No CPUs on this node. Round #2 will take of it. */
> +            rem = 0;
> +            region[i].per_cpu_sz = (end - start);
> +        }
> +        else
> +        {
> +            rem = (end - start) % cpumask_weight(&node_cpus);
> +            region[i].per_cpu_sz = (end - start) / 
> cpumask_weight(&node_cpus);

The 'cpus' variable still holds cpumask_weight(&node_cpus), and is
rather more efficient to use.

> +            if ( region[i].per_cpu_sz > max_per_cpu_sz )
> +                max_per_cpu_sz = region[i].per_cpu_sz;
> +        }
> +        region[i].start = start;
> +        region[i].rem = rem;
> +        cpumask_copy(&region[i].cpus, &node_cpus);
> +    }
> +
> +    printk("Scrubbing Free RAM on %u nodes using %u CPUs\n", 
> num_online_nodes(),
> +           cpumask_weight(&all_worker_cpus));

Both of these are signed quantities, rather than unsigned.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.