Xen project Mailing List

Re: [Xen-devel] [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access

From: "Aravindh Puthiyaparambil (aravindp)" <aravindp@xxxxxxxxx>

Date: Tue, 29 Apr 2014 23:10:41 +0000

Accept-language: en-US

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir@xxxxxxx>, Ian Jackson <ian.jackson@xxxxxxxxxxxxx>, Ian Campbell <ian.campbell@xxxxxxxxxx>, Tim Deegan <tim@xxxxxxx>

Delivery-date: Tue, 29 Apr 2014 23:11:07 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Thread-index: AQHPY4gXK7st0m8H50K6R/LPk4OUmZspLKmA

Thread-topic: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access

>> +void p2m_access_to_flags(u32 *flags, u32 gflags, p2m_access_t access) >> +{ >> + >> + /* Restrict with access permissions */ >> + switch (access) >> + { >> + case p2m_access_r: >> + *flags &= ~(_PAGE_RW); >> + *flags |= (_PAGE_NX_BIT|_PAGE_PRESENT); >> + break; >> + case p2m_access_rx: >> + case p2m_access_rx2rw: >> + *flags &= ~(_PAGE_NX_BIT|_PAGE_RW); >> + *flags |= _PAGE_PRESENT; >> + break; >> + case p2m_access_rw: >> + *flags |= (_PAGE_NX_BIT|_PAGE_RW|_PAGE_PRESENT); >> + break; >> + case p2m_access_rwx: >> + default: >> + *flags &= ~(_PAGE_NX_BIT); >> + *flags |= (_PAGE_RW|_PAGE_PRESENT); >> + break; >> + } > >Tim will have the final say here, but I'd suggest dropping all these >needless parentheses (I see only one case where they're really >needed). OK, I will drop all the needless ones. Could you please point out the case they are needed? I couldn't find anything in the CODING_STYLE that dictates what should be done. >> + >> + // Allow more restrictive guest flags to be propagated instead of access >> + // permissions > >Coding style (there are more of these, which I'm not going to >individually comment on). Sorry, I let the C++ style comments slip in. I will fix all of them to C style comments. >> +static int >> +p2m_mem_access_set_entry(struct p2m_domain *p2m, unsigned long >gfn, mfn_t mfn, >> + unsigned int page_order, p2m_type_t p2mt, >> + p2m_access_t p2ma) >> +{ >> + struct domain *d = p2m->domain; >> + mfn_t *access_lookup_table = p2m->access_lookup_table; >> + uint table_idx; >> + uint page_idx; >> + uint8_t *access_table_page; >> + >> + ASSERT(shadow_mode_mem_access(d) && access_lookup_table != >NULL); > >Better split into two ASSERT()s, so that if it triggers one would know >which of the two conditions wasn't met? You are right. I should have done that in the first place. >> + >> + /* For PV domains we only support rw, rx, rx2rw, rwx access permissions >*/ >> + if ( unlikely(p2ma != p2m_access_r && >> + p2ma != p2m_access_rw && >> + p2ma != p2m_access_rx && >> + p2ma != p2m_access_rwx && >> + p2ma != p2m_access_rx2rw) ) >> + return -EINVAL; > >Perhaps better done as a switch() statement. Sure. That will definitely be cleaner. >> + >> + if ( page_get_owner(mfn_to_page(mfn)) != d ) >> + return -ENOENT; >> + >> + gfn = get_gpfn_from_mfn(mfn_x(mfn)); > >You get "gfn" passed in and blindly overwrite it here? _If_ you need to >do this lookup, shouldn't you instead check it matches the passed in one? The "gfn" and "mfn" passed in to the function should be the same as it is a PV guest. That is why I am overwriting "gfn" to get the "gpfn" from the machine_to_phys_mapping. >> + >> + /* >> + * Values with the MSB set denote MFNs that aren't really part of the >> + * domain's pseudo-physical memory map (e.g., the shared info frame). >> + * Nothing to do here. >> + */ >> + if ( unlikely(!VALID_M2P(gfn)) ) >> + return 0; >> + >> + if ( gfn > (d->tot_pages - 1) ) >> + return -EINVAL; > >Hardly - a PV domain can easily make its address space sparse, and >iirc pv-ops Linux even does so by default to avoid PFN/MFN collisions >on MMIO space. (And as a side note, this would better be "gfn >= >d->tot_pages".) I was using this as a guard against going out of bounds in the access lookup table which is created based on the domains tot_pages. Please let me know what I should be using instead of tot_pages when doing this. >> + paging_lock(d); >> + >> + table_idx = MEM_ACCESS_TABLE_IDX(gfn); >> + page_idx = MEM_ACCESS_PAGE_IDX(gfn); >> + access_table_page = >map_domain_page(mfn_x(access_lookup_table[table_idx])); >> + access_table_page[page_idx] = p2ma; >> + unmap_domain_page(access_table_page); >> + >> + if ( sh_remove_all_mappings(d->vcpu[0], mfn) ) > >Is there anything guaranteeing d->vcpu and d->vcpu[0] being non- >NULL? I am not sure. I see shadow_track_dirty_vram() calling this without a check. Maybe I should check and call it only if d->vcpu is not null. >> +static mfn_t >> +p2m_mem_access_get_entry(struct p2m_domain *p2m, unsigned long >gfn, >> + p2m_type_t *t, p2m_access_t *a, p2m_query_t q, >> + unsigned int *page_order) >> +{ >> + struct domain *d = p2m->domain; >> + mfn_t *access_lookup_table = p2m->access_lookup_table; >> + uint table_idx; >> + uint page_idx; >> + uint8_t *access_table_page; >> + mfn_t mfn = _mfn(gfn); // For PV guests mfn == gfn >> + >> + ASSERT(shadow_mode_mem_access(d) && access_lookup_table != >NULL); >> + >> + /* Not necessarily true, but for non-translated guests, we claim >> + * it's the most generic kind of memory */ > >I think you copied this comment verbatim from elsewhere, but I don't >think it's correct as is. Yes, indeed. I copied this from __get_gfn_type_access(). I will get rid of the comment. >> + *t = p2m_ram_rw; >> + >> + if ( page_get_owner(mfn_to_page(mfn)) != d ) >> + return _mfn(INVALID_MFN); >> + >> + gfn = get_gpfn_from_mfn(mfn_x(mfn)); > >Same comment as earlier. > >> + if ( gfn > (d->tot_pages - 1) ) > >Dito. I have the same reasoning here as I did above. >> + table_idx = MEM_ACCESS_TABLE_IDX(gfn); >> + page_idx = MEM_ACCESS_PAGE_IDX(gfn); >> + >> + access_table_page = >map_domain_page(mfn_x(access_lookup_table[table_idx])); >> + >> + /* This is a hint to take the default permissions */ >> + if ( access_table_page[page_idx] == p2m_access_n ) >> + access_table_page[page_idx] = p2m->default_access; > >We're in "get" here - why does that modify any global state? I do not initialize the access lookup table to the default value. But the table is zeroed as part of initialization when each page is allocated using shadow_alloc_p2m_page(). I then use the "p2m_access_n" / 0 value as hint that the default value should be returned as p2m_access_n is not valid for PV domains. If you prefer that I initialize the table to the default value, I will gladly do so. >> +void p2m_mem_access_teardown(struct p2m_domain *p2m) >> +{ >> + struct domain *d = p2m->domain; >> + mfn_t *access_lookup_table = p2m->access_lookup_table; >> + uint32_t nr_access_table_pages; >> + uint32_t ctr; >> + >> + /* Reset the set_entry and get_entry function pointers */ >> + p2m_pt_init(p2m); >> + >> + if ( !access_lookup_table ) >> + return; >> + >> + nr_access_table_pages = get_domain_nr_access_table_pages(d); >> + >> + for ( ctr = 0; ctr < nr_access_table_pages; ctr++ ) > >No new effectively unbounded loops please. OK, I will add a check for preemption and add hypercall continuation logic here. >> +int p2m_mem_access_init(struct p2m_domain *p2m) >> +{ >> + struct domain *d = p2m->domain; >> + mfn_t *access_lookup_table; >> + uint32_t nr_access_table_pages; >> + uint32_t ctr; >> + >> + nr_access_table_pages = get_domain_nr_access_table_pages(d); >> + access_lookup_table = xzalloc_array(mfn_t, nr_access_table_pages); > >This surely is going to be an order > 0 allocation, and you even risk >it being an order > MAX_ORDER one. The former is disallowed at >runtime by convention/agreement, the latter is an outright bug. Sorry, I was not aware of the convention. What should I be doing here instead? >(And again just as a side note - you don't appear to need the zero >initialization here.) > >> + if ( !access_lookup_table ) >> + return -ENOMEM; >> + >> + p2m->access_lookup_table = access_lookup_table; >> + >> + for ( ctr = 0; ctr < nr_access_table_pages; ctr++ ) > >Same comment regarding the effective unboundedness of the loop. OK, I will add a check for preemption and add hypercall continuation logic here too. >> @@ -1414,6 +1419,8 @@ long p2m_set_mem_access(struct domain *d, >unsigned long pfn, uint32_t nr, >> if ( pfn == ~0ul ) >> { >> p2m->default_access = a; >> + if ( is_pv_domain(d) ) >> + return p2m_mem_access_set_default(p2m); > >Is it really correct to set p2m->default_access _before_ calling that >function? Perhaps it wouldn't be correct doing it the other way >around either - I suppose you need to hold the necessary lock across >both operations. I do not see the problem here. p2m_mem_access_set_default() just blows the shadows away after taking the paging_lock. There is no actual setting of default permission for all pages in the PV case. >> @@ -585,9 +585,17 @@ int paging_domctl(struct domain *d, >xen_domctl_shadow_op_t *sc, >> { >> >> case XEN_DOMCTL_SHADOW_OP_ENABLE: >> + /* >> + * Shadow mem_access mode should only be enabled when >mem_access is >> + * enabled in XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE. >> + */ >> + if ( sc->mode & XEN_DOMCTL_SHADOW_ENABLE_MEM_ACCESS ) >> + return -EINVAL; >> + >> if ( !(sc->mode & XEN_DOMCTL_SHADOW_ENABLE_LOG_DIRTY) ) >> break; >> /* Else fall through... */ >> + >> case XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY: > >Stray blank line - the fall through here makes it preferable to keep the >two case blocks un-separated. OK, I will keep them un-seperated. >> @@ -2443,7 +2443,8 @@ int sh_remove_all_mappings(struct vcpu *v, >mfn_t gmfn) >> if ( !(shadow_mode_external(v->domain) >> && (page->count_info & PGC_count_mask) <= 3 >> && ((page->u.inuse.type_info & PGT_count_mask) >> - == !!is_xen_heap_page(page))) ) >> + == !!is_xen_heap_page(page))) >> + && !(shadow_mode_mem_access(v->domain)) ) > >You're breaking indentation, there are pointless parentheses again, >but most importantly - why? Sorry, I meant to ask a question about this in my patch message and mark this as a workaround. I am seeing the message on all sh_remove_all_mappings() and I was not able to figure out why this was happening. I just added this as a work around. I was hoping you or Tim would shed more light on this. >> @@ -625,6 +627,20 @@ _sh_propagate(struct vcpu *v, >> } >> } >> >> + /* Propagate access permissions */ >> + if ( unlikely((level == 1) && >> + mem_event_check_ring(&d->mem_event->access) && >> + !sh_mfn_is_a_page_table(target_mfn)) ) > >Just a general remark here - unlikely() used like this is pointless, it >ought to be used on the individual constituents of && or ||. OK, I will leave this for mem_event_check_ring() case. >> + { >> + struct p2m_domain *p2m = p2m_get_hostp2m(d); >> + p2m_access_t a; >> + p2m_type_t t; >> + mfn_t mfn; >> + mfn = p2m->get_entry(p2m, mfn_x(target_mfn), &t, &a, 0, NULL); > >Missing blank line between declarations and statements. Also this >statement could instead be an initializer. I will make it an initializer. >> @@ -2822,6 +2838,8 @@ static int sh_page_fault(struct vcpu *v, >> int r; >> fetch_type_t ft = 0; >> p2m_type_t p2mt; >> + p2m_access_t p2ma; > >This variable ... > >> @@ -3009,7 +3027,80 @@ static int sh_page_fault(struct vcpu *v, >> >> /* What mfn is the guest trying to access? */ >> gfn = guest_l1e_get_gfn(gw.l1e); >> - gmfn = get_gfn(d, gfn, &p2mt); >> + if ( likely(!mem_event_check_ring(&d->mem_event->access)) ) >> + gmfn = get_gfn(d, gfn, &p2mt); >> + /* >> + * A mem_access listener is present, so we will first check if a >> violation >> + * has occurred. >> + */ >> + else >> + { >> + struct p2m_domain *p2m = p2m_get_hostp2m(v->domain); > >... seems to be used only in this scope, and with you already adding >scope restricted variables it should go here. OK, I will do that. >> + >> + gmfn = get_gfn_type_access(p2m, gfn_x(gfn), &p2mt, &p2ma, 0, >NULL); >> + if ( mfn_valid(gmfn) && !sh_mfn_is_a_page_table(gmfn) >> + && (regs->error_code & PFEC_page_present) >> + && !(regs->error_code & PFEC_reserved_bit) ) >> + { >> + int violation = 0; >> + bool_t access_w = !!(regs->error_code & PFEC_write_access); >> + bool_t access_x = !!(regs->error_code & PFEC_insn_fetch); >> + bool_t access_r = access_x ? 0 : !(access_w); > >Stray parentheses again. I'm not going to repeat this again - please >just check your code before submitting. I am really sorry. I will fix this is in the next patch series. >> + /* >> + * Do not police writes to guest memory emanating from the Xen >> + * kernel. Trying to do so will cause the same pagefault to >> occur >> + * over and over again with an event being sent to the access >> + * listener for each fault. If the access listener's vcpu is not >> + * scheduled during this time, the violation is never resolved >> and >> + * will eventually end with the host crashing. >> + */ >> + if ( (violation && access_w) && >> + (regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END) >> ) >> + { >> + violation = 0; >> + rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K, >> + p2m_ram_rw, p2m_access_rw); >> + } > >This looks more like a hack than a proper solution - shouldn't the >listener be allowed to know of hypervisor side accesses? Ideally the listener should know of the hypervisor side accesses but I ran in to cases where the listener would not get to run and the pagefault would be tried over and over, eventually causing the host to crash. I guess more detail about the problem is needed. I will send out a separate email regarding this, CCing you and Tim. >> +/* Number of access table pages for a PV domain */ >> +#define get_domain_nr_access_table_pages(d) \ >> + DIV_ROUND_UP(P2M_ACCESS_SIZE * (d->tot_pages - 1), PAGE_SIZE) > >And once again. I wonder whether this code was tested on a suitably >big pv-ops PV guest. The largest PV domain I tried this with was 28GB. Thanks, Aravindh _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.