[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: IRQ latency measurements in hypervisor


  • To: Julien Grall <julien@xxxxxxx>
  • From: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>
  • Date: Wed, 20 Jan 2021 23:03:47 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=T06VIJV3AES2GkKauAFsmWjg1YIqEfZNKSNDpsX7eIA=; b=m3G080Ge9khVSv+i9Dqqfh/Twht8kwLOgjFunueR0QcIR+ewuTptROkLuwKr+St/0M6vlJ/pvyxIgtCVZvw+Eg0WKvELIUpj3P9MXouC8HHbmdTKxHi/AJ4rj8jm1dkkCvHs7j/wVHgFsCFEa5IqsedD8NrWne7JGt/VYQLe43ieSmwz8KBxD8IbLIM8x3VX4vCW2XYg9vm9U4uxml3guPcfPVLz/Askc0QqN/WxNhXudNPOkZsI61niDohy2wkb4bEPrLs4bdK/wdiMhlTvOep63Au8gy0soIQice/S/0wOzt3iJPQYHkMUhzYWPQGPtkk6ro1IkdzEuBx/8fzqbQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=juhPLn7sIOr7H611gPhpW+XgV4f7w5CvOzm9YyLbEcYITQm+ek3rfKBU3i7em+YuQ6q3Lufz8oKIRcyaU0mFGcB7ZygtzmY0txhw+bPLaSodllTdDgHcOUVaCO2pn9I95peO6qOphz2Hk9/jAH2xzgAAs4IZzyJsqy2aOLq8ZtQi9uU/VdNzfsIuxj4oXhuconeMrG9bhfh1182UA4VJVtevvxyQbaS1LWPmF4sCUI9N9nLc3j2FlGYznv4kHnOEvOBYMq/FL8gpV/2vVZfSgzNje7YemfN8fbTnrtMn0sh3/iJ2p4YUyKxUU+DdqFlar1cBAniy9IOkEn9+UFZp5w==
  • Authentication-results: xen.org; dkim=none (message not signed) header.d=none;xen.org; dmarc=none action=none header.from=epam.com;
  • Cc: Stefano Stabellini <stefano.stabellini@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Julien Grall <jgrall@xxxxxxxxxx>, Dario Faggioli <dario.faggioli@xxxxxxxx>, "Bertrand.Marquis@xxxxxxx" <Bertrand.Marquis@xxxxxxx>, "andrew.cooper3@xxxxxxxxxx" <andrew.cooper3@xxxxxxxxxx>
  • Delivery-date: Wed, 20 Jan 2021 23:04:06 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHW6T1firbO0EOYpEKUiW6vsavsPaonyNIAgADL04CAAEPVAIAAGKmAgAg9ZQA=
  • Thread-topic: IRQ latency measurements in hypervisor

Hi Julien,


Julien Grall writes:

> On 15/01/2021 15:45, Volodymyr Babchuk wrote:
>> Hi Julien,
>> Julien Grall writes:
>> 
>>> Hi Volodymyr, Stefano,
>>>
>>> On 14/01/2021 23:33, Stefano Stabellini wrote:
>>>> + Bertrand, Andrew (see comment on alloc_heap_pages())
>>>
>>> Long running hypercalls are usually considered security issues.
>>>
>>> In this case, only the control domain can issue large memory
>>> allocation (2GB at a time). Guest, would only be able to allocate 2MB
>>> at the time, so from the numbers below, it would only take 1ms max.
>>>
>>> So I think we are fine here. Next time, you find a large loop, please
>>> provide an explanation why they are not security issues (e.g. cannot
>>> be used by guests) or send an email to the Security Team in doubt.
>> Sure. In this case I took into account that only control domain can
>> issue this call, I just didn't stated this explicitly. Next time will
>> do.
>
> I am afraid that's not correct. The guest can request to populate a
> region. This is used for instance in the ballooning case.
>
> The main difference is a non-privileged guest will not be able to do
> allocation larger than 2MB.

I did some measurements. According to them allocation of order 9 takes
about 265us on my HW. I covered this in detail at the end of email.

>>>> This is very interestingi too. Did you get any spikes with the
>>>> period
>>>> set to 100us? It would be fantastic if there were none.
>>>>
>>>>> 3. Huge latency spike during domain creation. I conducted some
>>>>>      additional tests, including use of PV drivers, but this didn't
>>>>>      affected the latency in my "real time" domain. But attempt to
>>>>>      create another domain with relatively large memory size of 2GB led
>>>>>      to huge spike in latency. Debugging led to this call path:
>>>>>
>>>>>      XENMEM_populate_physmap -> populate_physmap() ->
>>>>>      alloc_domheap_pages() -> alloc_heap_pages()-> huge
>>>>>      "for ( i = 0; i < (1 << order); i++ )" loop.
>>>
>>> There are two for loops in alloc_heap_pages() using this syntax. Which
>>> one are your referring to?
>> I did some tracing with Lautrebach. It pointed to the first loop and
>> especially to flush_page_to_ram() call if I remember correctly.
>
> Thanks, I am not entirely surprised because we are clean and
> invalidating the region line by line and across all the CPUs.
>
> If we are assuming 128 bytes cacheline, we will need to issue 32 cache
> instructions per page. This going to involve quite a bit of traffic on 
> the system.
>
> One possibility would be to defer the cache flush when the domain is
> created and use the hypercall XEN_DOMCTL_cacheflush to issue the
> flush.

Can we flush caches on first access to a page? What I mean - do not
populate stage 2 tables with allocated memory. Flush caches in abort
handler and then populate them.

> Note that XEN_DOMCTL_cacheflush would need some modification to be
> preemptible. But at least, it will work on a GFN which is easier to
> track.
>


>>>>> I managed to overcome the issue #3 by commenting out all calls to
>>>>> populate_one_size() except the populate_one_size(PFN_4K_SHIFT) in
>>>>> xg_dom_arm.c. This lengthened domain construction, but my "RT" domain
>>>>> didn't experienced so big latency issues. Apparently all other
>>>>> hypercalls which are used during domain creation are either fast or
>>>>> preemptible. No doubts that my hack lead to page tables inflation and
>>>>> overall performance drop.
>>>> I think we need to follow this up and fix this. Maybe just by adding
>>>> a hypercall continuation to the loop.
>>>
>>> When I read "hypercall continuation", I read we will return to the
>>> guest context so it can process interrupts and potentially switch to
>>> another task.
>>>
>>> This means that the guest could issue a second populate_physmap() from
>>> the vCPU. Therefore any restart information should be part of the
>>> hypercall parameters. So far, I don't see how this would be possible.
>>>
>>> Even if we overcome that part, this can be easily abuse by a guest as
>>> the memory is not yet accounted to the domain. Imagine a guest that
>>> never request the continuation of the populate_physmap(). So we would
>>> need to block the vCPU until the allocation is finished.
>> Moreover, most of the alloc_heap_pages() sits under spinlock, so
>> first
>> step would be to split this function into smaller atomic parts.
>
> Do you have any suggestion how to split it?
>

Well, it  is quite complex function and I can't tell right away.
At this time I don't quite understand why spin_unlock() is called after
the first (1 << order) loop for instance.

Also, this function does many different things for its simple name.

>> 
>>> I think the first step is we need to figure out which part of the
>>> allocation is slow (see my question above). From there, we can figure
>>> out if there is a way to reduce the impact.
>> I'll do more tracing and will return with more accurate numbers. 
>> But as far as I can see, any loop on 262144 pages will take some time..
> .
>
> It really depends on the content of the loop. On any modern
> processors, you are very likely not going to notice a loop that update
> just a flag.
>
> However, you are likely going to be see an impact if your loop is
> going to clean & invalidate the cache for each page.
>

Totally agree. I used Xen tracing subsystem to do the measurements and I
can confirm that call to flush_page_to_ram() causes most of the impact.


There is the details:


I added number of tracing points to the function:

static struct page_info *alloc_heap_pages(
    unsigned int zone_lo, unsigned int zone_hi,
    unsigned int order, unsigned int memflags,
    struct domain *d)
{
    nodeid_t node;
    unsigned int i, buddy_order, zone, first_dirty;
    unsigned long request = 1UL << order;
    struct page_info *pg;
    bool need_tlbflush = false;
    uint32_t tlbflush_timestamp = 0;
    unsigned int dirty_cnt = 0;

    /* Make sure there are enough bits in memflags for nodeID. */
    BUILD_BUG_ON((_MEMF_bits - _MEMF_node) < (8 * sizeof(nodeid_t)));

    ASSERT(zone_lo <= zone_hi);
    ASSERT(zone_hi < NR_ZONES);

    if ( unlikely(order > MAX_ORDER) )
        return NULL;

    spin_lock(&heap_lock);

    TRACE_1D(TRC_PGALLOC_PT1, order); // <=================================

    /*
     * Claimed memory is considered unavailable unless the request
     * is made by a domain with sufficient unclaimed pages.
     */
    if ( (outstanding_claims + request > total_avail_pages) &&
          ((memflags & MEMF_no_refcount) ||
           !d || d->outstanding_pages < request) )
    {
        spin_unlock(&heap_lock);
        return NULL;
    }

    pg = get_free_buddy(zone_lo, zone_hi, order, memflags, d);
    /* Try getting a dirty buddy if we couldn't get a clean one. */
    if ( !pg && !(memflags & MEMF_no_scrub) )
        pg = get_free_buddy(zone_lo, zone_hi, order,
                            memflags | MEMF_no_scrub, d);
    if ( !pg )
    {
        /* No suitable memory blocks. Fail the request. */
        spin_unlock(&heap_lock);
        return NULL;
    }

    TRACE_0D(TRC_PGALLOC_PT2); // <=================================

    node = phys_to_nid(page_to_maddr(pg));
    zone = page_to_zone(pg);
    buddy_order = PFN_ORDER(pg);

    first_dirty = pg->u.free.first_dirty;

    /* We may have to halve the chunk a number of times. */
    while ( buddy_order != order )
    {
        buddy_order--;
        page_list_add_scrub(pg, node, zone, buddy_order,
                            (1U << buddy_order) > first_dirty ?
                            first_dirty : INVALID_DIRTY_IDX);
        pg += 1U << buddy_order;

        if ( first_dirty != INVALID_DIRTY_IDX )
        {
            /* Adjust first_dirty */
            if ( first_dirty >= 1U << buddy_order )
                first_dirty -= 1U << buddy_order;
            else
                first_dirty = 0; /* We've moved past original first_dirty */
        }
    }

    TRACE_0D(TRC_PGALLOC_PT3); // <=================================

    ASSERT(avail[node][zone] >= request);
    avail[node][zone] -= request;
    total_avail_pages -= request;
    ASSERT(total_avail_pages >= 0);

    check_low_mem_virq();

    if ( d != NULL )
        d->last_alloc_node = node;

    for ( i = 0; i < (1 << order); i++ )
    {
        /* Reference count must continuously be zero for free pages. */
        if ( (pg[i].count_info & ~PGC_need_scrub) != PGC_state_free )
        {
            printk(XENLOG_ERR
                   "pg[%u] MFN %"PRI_mfn" c=%#lx o=%u v=%#lx t=%#x\n",
                   i, mfn_x(page_to_mfn(pg + i)),
                   pg[i].count_info, pg[i].v.free.order,
                   pg[i].u.free.val, pg[i].tlbflush_timestamp);
            BUG();
        }

        /* PGC_need_scrub can only be set if first_dirty is valid */
        ASSERT(first_dirty != INVALID_DIRTY_IDX || !(pg[i].count_info & 
PGC_need_scrub));

        /* Preserve PGC_need_scrub so we can check it after lock is dropped. */
        pg[i].count_info = PGC_state_inuse | (pg[i].count_info & 
PGC_need_scrub);

        if ( !(memflags & MEMF_no_tlbflush) )
            accumulate_tlbflush(&need_tlbflush, &pg[i],
                                &tlbflush_timestamp);

        /* Initialise fields which have other uses for free pages. */
        pg[i].u.inuse.type_info = 0;
        page_set_owner(&pg[i], NULL);

        /* Ensure cache and RAM are consistent for platforms where the
         * guest can control its own visibility of/through the cache.
         */
        flush_page_to_ram(mfn_x(page_to_mfn(&pg[i])),
                          !(memflags & MEMF_no_icache_flush));
    }

    TRACE_0D(TRC_PGALLOC_PT4); // <=================================
    spin_unlock(&heap_lock);

    if ( first_dirty != INVALID_DIRTY_IDX ||
         (scrub_debug && !(memflags & MEMF_no_scrub)) )
    {
        for ( i = 0; i < (1U << order); i++ )
        {
            if ( test_bit(_PGC_need_scrub, &pg[i].count_info) )
            {
                if ( !(memflags & MEMF_no_scrub) )
                    scrub_one_page(&pg[i]);

                dirty_cnt++;

                spin_lock(&heap_lock);
                pg[i].count_info &= ~PGC_need_scrub;
                spin_unlock(&heap_lock);
            }
            else if ( !(memflags & MEMF_no_scrub) )
                check_one_page(&pg[i]);
        }

        if ( dirty_cnt )
        {
            spin_lock(&heap_lock);
            node_need_scrub[node] -= dirty_cnt;
            spin_unlock(&heap_lock);
        }
    }

    TRACE_0D(TRC_PGALLOC_PT5); // <=================================
    if ( need_tlbflush )
        filtered_flush_tlb_mask(tlbflush_timestamp);

    TRACE_0D(TRC_PGALLOC_PT6); // <=================================

    return pg;
}


And wrote a simple Python scripts that parses the output of
xentrace. There are results for different orders:

  46.842032: page_alloc trace point 1. Order: 18
  46.842035: page_alloc trace point 2 (+   0.000003)
  46.842035: page_alloc trace point 3 (+   0.000000)
  46.975105: page_alloc trace point 4 (+   0.133069)
  46.975106: page_alloc trace point 5 (+   0.000001)
  46.975106: page_alloc trace point 6 (+   0.000000): total:    0.133074

  46.998536: page_alloc trace point 1. Order: 9
  46.998538: page_alloc trace point 2 (+   0.000002)
  46.998540: page_alloc trace point 3 (+   0.000001)
  46.998799: page_alloc trace point 4 (+   0.000259)
  46.998800: page_alloc trace point 5 (+   0.000000)
  46.998800: page_alloc trace point 6 (+   0.000000): total:    0.000264

  46.835802: page_alloc trace point 1. Order: 3
  46.835803: page_alloc trace point 2 (+   0.000000)
  46.835803: page_alloc trace point 3 (+   0.000000)
  46.835812: page_alloc trace point 4 (+   0.000009)
  46.835813: page_alloc trace point 5 (+   0.000000)
  46.835813: page_alloc trace point 6 (+   0.000001): total:    0.000011

  46.998815: page_alloc trace point 1. Order: 0
  46.998816: page_alloc trace point 2 (+   0.000002)
  46.998817: page_alloc trace point 3 (+   0.000000)
  46.998818: page_alloc trace point 4 (+   0.000002)
  46.998819: page_alloc trace point 5 (+   0.000001)
  46.998819: page_alloc trace point 6 (+   0.000000): total:    0.000005

Then I commented out call to flush_page_to_ram() and got the following
results:

 149.561902: page_alloc trace point 1. Order: 18
 149.561905: page_alloc trace point 2 (+   0.000003)
 149.561905: page_alloc trace point 3 (+   0.000000)
 149.569450: page_alloc trace point 4 (+   0.007545)
 149.569451: page_alloc trace point 5 (+   0.000001)
 149.569452: page_alloc trace point 6 (+   0.000000): total:    0.007550

 149.592624: page_alloc trace point 1. Order: 9
 149.592626: page_alloc trace point 2 (+   0.000003)
 149.592627: page_alloc trace point 3 (+   0.000001)
 149.592639: page_alloc trace point 4 (+   0.000012)
 149.592639: page_alloc trace point 5 (+   0.000000)
 149.592640: page_alloc trace point 6 (+   0.000000): total:    0.000016

All time units are seconds, by the way.


-- 
Volodymyr Babchuk at EPAM


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.