[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access



>>>> +        default:
>>>> +            *flags &= ~(_PAGE_NX_BIT);
>>>> +            *flags |= (_PAGE_RW|_PAGE_PRESENT);
>>>> +            break;
>>>> +    }
>>>
>>>Tim will have the final say here, but I'd suggest dropping all these
>>>needless parentheses (I see only one case where they're really
>>>needed).
>>
>> OK, I will drop all the needless ones.  Could you please point out the
>> case they are needed? I couldn't find anything in the CODING_STYLE
>> that dictates what should be done.
>
>Convention is to use them when operator precedence isn't considered
>obvious. What "obvious" here is differs between people, but at least all
>precedence rules defined by school mathematics should generally not require
>parenthesizing. Which in the case above means no parentheses around the
>right side expression of assignment operators.

OK, I understand and will fix them appropriately. 

>>>> +
>>>> +    if ( page_get_owner(mfn_to_page(mfn)) != d )
>>>> +        return -ENOENT;
>>>> +
>>>> +    gfn = get_gpfn_from_mfn(mfn_x(mfn));
>>>
>>>You get "gfn" passed in and blindly overwrite it here? _If_ you need
>>>to do this lookup, shouldn't you instead check it matches the passed in
>one?
>>
>> The "gfn" and "mfn" passed in to the function should be the same as it
>> is a PV guest. That is why I am overwriting "gfn" to get the "gpfn"
>> from the machine_to_phys_mapping.
>
>Since this overwrites/discards a function parameter, this surely is worth a
>comment (or a pseudo-comment like ASSERT(gfn == mfn_x(mfn))).

I will add the ASSERT and a comment for it.

>>>> +
>>>> +    /*
>>>> +     * Values with the MSB set denote MFNs that aren't really part of the
>>>> +     * domain's pseudo-physical memory map (e.g., the shared info
>frame).
>>>> +     * Nothing to do here.
>>>> +     */
>>>> +    if ( unlikely(!VALID_M2P(gfn)) )
>>>> +        return 0;
>>>> +
>>>> +    if ( gfn > (d->tot_pages - 1) )
>>>> +        return -EINVAL;
>>>
>>>Hardly - a PV domain can easily make its address space sparse, and
>>>iirc pv-ops Linux even does so by default to avoid PFN/MFN collisions
>>>on MMIO space. (And as a side note, this would better be "gfn >=
>>>d->tot_pages".)
>>
>> I was using this as a guard against going out of bounds in the access
>> lookup table which is created based on the domains tot_pages. Please
>> let me know what I should be using instead of tot_pages when doing this.
>
>Afaict For PV guests there's absolutely nothing you can use to bound the
>range.
>>>> +    paging_lock(d);
>>>> +
>>>> +    table_idx = MEM_ACCESS_TABLE_IDX(gfn);
>>>> +    page_idx = MEM_ACCESS_PAGE_IDX(gfn);
>>>> +    access_table_page =
>>>map_domain_page(mfn_x(access_lookup_table[table_idx]));
>>>> +    access_table_page[page_idx] = p2ma;
>>>> +    unmap_domain_page(access_table_page);
>>>> +
>>>> +    if ( sh_remove_all_mappings(d->vcpu[0], mfn) )
>>>
>>>Is there anything guaranteeing d->vcpu and d->vcpu[0] being non- NULL?
>>
>> I am not sure. I see shadow_track_dirty_vram() calling this without a check.
>> Maybe I should check and call it only if d->vcpu is not null.
>
>You ought to check both, at least via ASSERT().

OK, I will add the ASSERTs.

>>>> +    table_idx = MEM_ACCESS_TABLE_IDX(gfn);
>>>> +    page_idx = MEM_ACCESS_PAGE_IDX(gfn);
>>>> +
>>>> +    access_table_page =
>>>map_domain_page(mfn_x(access_lookup_table[table_idx]));
>>>> +
>>>> +    /* This is a hint to take the default permissions */
>>>> +    if ( access_table_page[page_idx] == p2m_access_n )
>>>> +        access_table_page[page_idx] = p2m->default_access;
>>>
>>>We're in "get" here - why does that modify any global state?
>>
>> I do not initialize the access lookup table to the default value. But
>> the table is zeroed as part of initialization when each page is
>> allocated using shadow_alloc_p2m_page(). I then use the "p2m_access_n"
>> / 0 value as hint that the default value should be returned as
>> p2m_access_n is not valid for PV domains. If you prefer that I
>> initialize the table to the default value, I will gladly do so.
>
>I think implying _n to mean "default" is at least a latent bug anyway (since 
>that
>precludes setting that type). And _if_ "get" really does any modification, that
>minimally would require a big comment.

For PV domains access_n is precluded.

>Furthermore, with it being possible for the default type to change, the net
>effect might easily be different between setting the default type at
>initialization time or when the first lookup happens.

I am not sure that I follow your concern here. Setting default access does not 
mean that the access table gets cleared. So if a particular MFN entry's access 
value changed and the default value was changed, then the MFN should retain the 
access value it was changed to and not the default value. The new default value 
will only be returned for MFNs for which set_entry was not called on and I 
think that is correct.
 
>>>> +int p2m_mem_access_init(struct p2m_domain *p2m) {
>>>> +    struct domain *d = p2m->domain;
>>>> +    mfn_t *access_lookup_table;
>>>> +    uint32_t nr_access_table_pages;
>>>> +    uint32_t ctr;
>>>> +
>>>> +    nr_access_table_pages = get_domain_nr_access_table_pages(d);
>>>> +    access_lookup_table = xzalloc_array(mfn_t,
>>>> + nr_access_table_pages);
>>>
>>>This surely is going to be an order > 0 allocation, and you even risk
>>>it being an order > MAX_ORDER one. The former is disallowed at runtime
>>>by convention/agreement, the latter is an outright bug.
>>
>> Sorry, I was not aware of the convention. What should I be doing here
>> instead?
>
>Especially with the above note about GFN space being effectively unbounded
>for PV guests, you need to find a different data structure to represent the
>information you need. Perhaps something similar to what log-dirty mode
>uses, except that you would need more levels in the extreme case, and hence
>it might be worthwhile making the actually used number of levels dynamic?

I am a little confused here. In my initial discussions about this I was told 
that PV guests have bounded and contiguous memory. This is why I took the 
approach of an indexable array. 

" OTOH, all you need is a byte per pfn, and the great thing is that in PV 
domains, the physmap is bounded and continuous. Unlike HVM and its PCI holes, 
etc, which demand the sparse tree structure. So you can allocate an easily 
indexable array, notwithstanding super page concerns (I think/hope)."

http://lists.xenproject.org/archives/html/xen-devel/2013-11/msg03860.html

Did I misunderstand something here? 
PS: I have CCed Andres too so that he can clarify.

>>>> @@ -1414,6 +1419,8 @@ long p2m_set_mem_access(struct domain *d,
>>>unsigned long pfn, uint32_t nr,
>>>>      if ( pfn == ~0ul )
>>>>      {
>>>>          p2m->default_access = a;
>>>> +        if ( is_pv_domain(d) )
>>>> +            return p2m_mem_access_set_default(p2m);
>>>
>>>Is it really correct to set p2m->default_access _before_ calling that
>>>function? Perhaps it wouldn't be correct doing it the other way around
>>>either - I suppose you need to hold the necessary lock across both
>>>operations.
>>
>> I do not see the problem here. p2m_mem_access_set_default() just blows
>> the shadows away after taking the paging_lock. There is no actual
>> setting of default permission for all pages in the PV case.
>
>But the function having a return value means it might fail, in which case you
>shouldn't have set the new default.

Good point, I will fix that.

>>>> @@ -2443,7 +2443,8 @@ int sh_remove_all_mappings(struct vcpu *v,
>>>mfn_t gmfn)
>>>>          if ( !(shadow_mode_external(v->domain)
>>>>                 && (page->count_info & PGC_count_mask) <= 3
>>>>                 && ((page->u.inuse.type_info & PGT_count_mask)
>>>> -                   == !!is_xen_heap_page(page))) )
>>>> +                       == !!is_xen_heap_page(page)))
>>>> +                    && !(shadow_mode_mem_access(v->domain)) )
>>>
>>>You're breaking indentation, there are pointless parentheses again,
>>>but most importantly - why?
>>
>> Sorry, I meant to ask a question about this in my patch message and
>> mark this as a workaround. I am seeing the message on all
>> sh_remove_all_mappings() and I was not able to figure out why this was
>> happening. I just added this as a work around. I was hoping you or Tim
>would shed more light on this.
>
>I'm afraid that without detail on which pages the triggers on, and you at least
>having spent some time finding out where the stray/extra references may be
>coming from it's going to be hard to help.

Here are some of the prints I saw. I typically saw it for every set_entry() 
call.

(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 33960: 
c=8000000000000002 t=7400000000000000 <most common>
 (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
134b8: c=8000000000000038 t=7400000000000001
(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 13344: 
c=8000000000000003 t=7400000000000001
(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 13449: 
c=8000000000000028 t=7400000000000001
 (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
1394b: c=8000000000000013 t=7400000000000001
 (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
137b5: c=8000000000000003 t=7400000000000001
(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 33fa6: 
c=8000000000000002 t=7400000000000000
(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 1345d: 
c=8000000000000028 t=7400000000000001
(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 33410: 
c=8000000000000034 t=7400000000000001
(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 33412: 
c=8000000000000039 t=7400000000000001
(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 33953: 
c=8000000000000011 t=7400000000000001
 (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
13a63: c=8000000000000028 t=7400000000000001
 (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
393f2: c=800000000000003a t=7400000000000001
 (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
1335b: c=8000000000000028 t=7400000000000001
 (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
134bc: c=8000000000000036 t=7400000000000001
 (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
393f0: c=8000000000000014 t=7400000000000001
(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 134bd: 
c=8000000000000008 t=7400000000000001
(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 1354e: 
c=8000000000000010 t=7400000000000001
 (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
33443: c=8000000000000013 t=7400000000000001
(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 39f74: 
c=8000000000000008 t=7400000000000001

>>>> +            /*
>>>> +             * Do not police writes to guest memory emanating from the Xen
>>>> +             * kernel. Trying to do so will cause the same pagefault to 
>>>> occur
>>>> +             * over and over again with an event being sent to the access
>>>> +             * listener for each fault. If the access listener's vcpu is 
>>>> not
>>>> +             * scheduled during this time, the violation is never 
>>>> resolved and
>>>> +             * will eventually end with the host crashing.
>>>> +             */
>>>> +            if ( (violation && access_w) &&
>>>> +                 (regs->eip >= XEN_VIRT_START && regs->eip <=
>XEN_VIRT_END) )
>>>> +            {
>>>> +                violation = 0;
>>>> +                rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K,
>>>> +                                    p2m_ram_rw, p2m_access_rw);
>>>> +            }
>>>
>>>This looks more like a hack than a proper solution - shouldn't the
>>>listener be allowed to know of hypervisor side accesses?
>>
>> Ideally the listener should know of the hypervisor side accesses but I
>> ran in to cases where the listener would not get to run and the
>> pagefault would be tried over and over, eventually causing the host to
>> crash. I guess more detail about the problem is needed. I will send
>> out a separate email regarding this, CCing you and Tim.
>
>And if it's intended as a workaround until proper resolution only, you probably
>should say so explicitly in the comment, avoiding the need for reviewers to
>comment on this being a problem.

Sorry, I will be more explicit about such things in the future.

>>>> +/* Number of access table pages for a PV domain */ #define
>>>> +get_domain_nr_access_table_pages(d) \
>>>> +        DIV_ROUND_UP(P2M_ACCESS_SIZE * (d->tot_pages - 1),
>>>> +PAGE_SIZE)
>>>
>>>And once again. I wonder whether this code was tested on a suitably
>>>big pv-ops PV guest.
>>
>> The largest PV domain I tried this with was 28GB.
>
>Which ought to have had GFNs larger than ->tot_pages, so I wonder how this
>worked.

It would have had MFNs larger than tot_pages. I don't understand how it would 
have had GFNs larger than tot_pages.

Thanks,
Aravindh


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.