WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

[Xen-devel] Re: [PATCH 13/13] Nested Virtualiztion: hap-on-hap

To: Christoph Egger <Christoph.Egger@xxxxxxx>
Subject: [Xen-devel] Re: [PATCH 13/13] Nested Virtualiztion: hap-on-hap
From: Tim Deegan <Tim.Deegan@xxxxxxxxxx>
Date: Fri, 10 Sep 2010 11:00:19 +0100
Cc: "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, "Dong, Eddie" <eddie.dong@xxxxxxxxx>
Delivery-date: Fri, 10 Sep 2010 03:01:15 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <201009091601.04316.Christoph.Egger@xxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <201009011718.44873.Christoph.Egger@xxxxxxx> <20100908140451.GA23487@xxxxxxxxxxxxxxxxxxxxxxx> <201009091601.04316.Christoph.Egger@xxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mutt/1.5.18 (2008-05-17)
At 15:01 +0100 on 09 Sep (1284044463), Christoph Egger wrote:
> > This comment is confusing to me.  The nested guest gpa isn't a virtual
> > address, and certainly not the same as an L1-guest VA.  Can you reword
> > it, maybe just to say what the call to nestedhvm_hap_nested_page_fault()
> > is going to do?
> 
> New wording is:
> 
>   /* The vcpu is in guest mdoe and the l1 guest
>    * uses hap. That means 'gpa' is in l2 guest
>    * physical adderss space.
>    * Fix the nested p2m or inject nested page fault
>    * into l1 guest if not fixable. The algorithm is
>    * the same as for shadow paging.
>    */
> 
> Is this fine with you ?

Sure, that's better. 

> > > -            gfn = paging_gva_to_gfn(curr, addr, &pfec);
> > > +            gfn = paging_p2m_ga_to_gfn(curr, p2m, mode, cr3, addr,
> > > &pfec);
> >
> > Do other callers of paging_gva_to_gfn need to handle nested-npt mode as
> > well?
> 
> If the other callers can be reached when the vcpu is in guest mode, then yes.

It looks like most of them can. 

> > If so, would it be better to update paging_gva_to_gfn() itself?
> 
> This is definitely easier but the call to p2m_get_p2m() might become be very
> expensive and should be cached by passing through per function argument.

I'd rather it was cached somewhere in the per-vcpu state if
p2m_get_p2m() is expensive, and not pass any more arguments.  Callers of
paging_gva_to_gfn() shouldn't need to know about the internals of the
p2m code - they should just be able to pass a VA and get a (L1) gfn.

> > > +     p2m = vcpu_nestedhvm(v).nh_p2m;
> > > +     if (p2m == NULL)
> > > +             /* p2m has either been invalidated or not yet assigned. */
> > > +             return;
> > > +
> > > +     cpu_set(v->processor, p2m->p2m_dirty_cpumask);
> >
> > Is this arch-specific code?  (and likewise the cpu_clear that follows)
> 
> No. I don't see how.

It's only used by NPT-on-NPT.  IIRC the EPT-on-EPT code uses a soft-TLB
model so it won't need this cpumask.  But it's not a big deal.

> > This function looks like it duplicates a lot of logic from an old
> > version of the normal p2m next-level function.
> 
> Yes, it is redundant. The point is that the *_write_p2m_entry() functions
> are not nested virtualization aware. I am having troubles with understanding
> the monitor p2m tables.
> 
> > Would it be easy to merge them?
> 
> Maybe, maybe not. I can answer the question when I understand the use of the
> monitor table and how they work.

OK.  I'll wait for the next rev then. 

> > > +int
> > > +nestedhvm_hap_nested_page_fault(struct vcpu *v, paddr_t L2_gpa)
> > > +{
> > > +    int rv;
> > > +    paddr_t L1_gpa, L0_gpa;
> > > +    struct domain *d = v->domain;
> > > +    struct p2m_domain *p2m, *nested_p2m;
> > > +
> > > +    p2m = p2m_get_hostp2m(d); /* L0 p2m */
> > > +    nested_p2m = p2m_get_nestedp2m(v, vcpu_nestedhvm(v).nh_vm_hostcr3);
> > > +
> > > +    /* walk the L1 P2M table, note we have to pass p2m
> > > +     * and not nested_p2m here or we fail the walk forever,
> > > +     * otherwise. */
> >
> > Can you explain that more fully?  It looks like you're walking the L0
> > table to translate an L2 gfn into an L1 gfn, which surely can't be right.
> 
> The l1 guest hostcr3 is in l1 guest physical address space. To walk the
> l1 guest's page table I need to translate it into host physical address space
> by walking the l0 p2m table.
> 
> > > +    rv = nestedhap_walk_L1_p2m(v, p2m, L2_gpa, &L1_gpa);

OK, I understand now, thanks.

> > > +static int
> > > +p2m_flush_locked(struct p2m_domain *p2m)
> > > +{
> > > +    struct page_info * (*alloc)(struct p2m_domain *);
> > > +    void (*free)(struct p2m_domain *, struct page_info *);
> > > +
> > > +    alloc = p2m->alloc_page;
> > > +    free = p2m->free_page;
> > > +
> > > +    if (p2m->cr3 == 0)
> > > +        /* Microoptimisation: p2m is already empty.
> > > +         * => about 0.3% speedup of overall system performance.
> > > +         */
> > > +        return 0;
> >
> > What happens if a malicious guest uses 0 as its actual CR3 value?
> 
> When l1 guest uses hap and l2 guest uses shadow paging, this code path
> never runs.
> When l1 guest uses hap and l2 guest uses hap, this is can only be the
> guest's hostcr3.
> 
> So your question is what happens if a malicious guest uses 0 to point to
> its nested page table ?

Yes.

> I think, this guest crashes sooner or later anyway.

It might, but in the meantime we've missed a flush.  If we fail to
properly flush a nested P2M when we should, the L2 guest gets to use
stale entries to write to memory it shouldn't see.

My point is that you can't use cr3==0 as an indicator that this slot
isn't in use because the guest can cause it to be true.  Maybe -1 would
be a better choice, since that's not a valid cr3 value.

> > > +    /* All p2m's are or were in use. We know the least recent used one.
> > > +     * Destroy and re-initialize it.
> > > +     */
> > > +    for (i = 0; i < MAX_NESTEDP2M; i++) {
> > > +        p2m = p2m_getlru_nestedp2m(d, NULL);
> > > +        rv = p2m_flush_locked(p2m);
> >
> > Is this enough?  If this p2m might be in host_vmcb->h_cr3 somewhere on
> > another vcpu, then you need to make sure that vcpu gets reset not to use
> > it any more.
> 
> There are three possibilities:
> An other vcpu is in VMRUN emulation before a nestedp2m is assigned.
>    In this case, it will get a new nestedp2m as it won't find its 'old'
>    nestedp2m anymore.
> 
> An other vcpu is in VMRUN emulation after a nestedp2m is assigned.
>    It will VMEXIT with a nested page fault.

Why?

> An other vcpu already running l2 guest.
>    It will VMEXIT with a nested page fault immediately.

Hmm.  It will exit for the TLB shootdown IPI, but I think you need to
clear vcpu_nestedhvm(v).nh_p2m on the other vcpu to make sure it doesn't
re-enter with the p2m you've just recycled.

Cheers,

Tim.

-- 
Tim Deegan <Tim.Deegan@xxxxxxxxxx>
Principal Software Engineer, XenServer Engineering
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel