[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel][PATCH][RFC] Supporting Enlightened Windows 2008Server



> > Would you mind generating diffs with -p, please?  It makes them a fair
> > bit easier to read.
> Will do.
Thanks.

> >  static void svm_vmexit_do_cpuid(struct cpu_user_regs *regs)
> >  {
> > -    unsigned int eax, ebx, ecx, edx, inst_len;
> > +    unsigned int eax, ebx, ecx, edx, inst_len, input;
> >  
> >      inst_len = __get_instruction_length(current, INSTR_CPUID, NULL);
> >      if ( inst_len == 0 ) 
> > ...
> > @@ -968,6 +970,7 @@
> >      regs->ecx = ecx;
> >      regs->edx = edx;
> >  
> > +    hyperx_intercept_do_cpuid(input, regs);
> >      __update_guest_eip(regs, inst_len);
> >  }
> >
>   
> > Would it be easier to put this bit in hvm_cpuid, or maybe
> > cpuid_hypervisor_leaves?  That would avoid needing to make the same
> > change to both the VMX and SVM CPUID infrastructure.
> Good point. I will look into it.
Thanks.


> > +static int
> > +hv_get_vp_registers(paddr_t input, paddr_t output)
> > +{
> > +hv_vcpu_t        *vcpup, *targetp;
> > +hv_partition_t   *curp = hv_get_current_partition();
> > +get_vp_registers_input_t*inbuf;
> > +get_vp_registers_output_t*outbuf;
> > +struct vcpu_guest_context*vcpu_ctx;
> > +u32*reg_indexp;
> > +get_vp_registers_output_t*out_regp;
> > +u32num_output_bytes = 0;
> > +
> > +        vcpup = &curp->vcpu_state[hv_get_current_vcpu_index()];
> > +inbuf = vcpup->input_buffer;
> > +outbuf = vcpup->output_buffer;
> > +out_regp = outbuf;
> > +/*
> > + * Copy the input data to the per-cpu input buffer.
> > + * This may be an overkill; obviously it is better to only
> > + * copy what we need. XXXKYS: Check with Mike.
> > + */
> > Who's Mike?  What did he say?
> Mike is  the MSFT contact. 
Ah, okay.  I think we'd rather not have comments which are meaningless
to other people reading the code, so it'd be better to get rid of that
sort of thing.

> > +if (hvm_copy_from_guest_phys(inbuf, input, PAGE_SIZE)) {
> > +return (HV_STATUS_INVALID_ALIGNMENT);
> > +}
> ...
> > +/*
> > + * Now that we have the register state; select what we want and
> > + * populate the output buffer.
> > + */
> > +reg_indexp = &inbuf->reg_index;
> > +while (*reg_indexp != 0) {
> > +switch(*reg_indexp) {
> > +/*
> > + * XXXKYS: need mapping code here; populate
> > + * outbuf.
> > + */
> > +panic("hv_get_vp_registers not supported\n");
> 
> > Yeah, that's not really good enough.  It would be better to just
> > not support this hypercall at all than have an implementation which
> > always crashes Xen.
> Early on, I decided what Hypercalls I would support; and from this
> list I wanted to only support those that were used by windows 2008
> server. I had put in a bunch of panics to trap all the hypercalls
> that the guest would invoke. This hypercall is not used. Thus the
> implementation is incomplete. I will clean this up.
If the hypercall is never made, it'd probably be best to just remove
it rather than trying to clean it up.

> > +static int
> > +hv_switch_va(paddr_t input)
> > +{
> > +hv_partition_t   *curp = hv_get_current_partition();
> > +        hv_vcpu_t *vcpup = &curp->vcpu_state[hv_get_current_vcpu_index()];
> > +
> > +/*
> > + * XXXKYS: the spec sys the asID is passed via memory at offset 0 of 
> > + * the page whose GPA is in the input register. However, it appears 
> > + * the current build of longhorn (longhorn-2007-02-06-x86_64-fv-02)
> > + * passes the asID in the input register instead. Need to check if 
> > + * future builds do this.
> > + */
> > +hvm_set_cr3(input); 
> > +HV_STATS_COLLECT(HV_CSWITCH, &vcpup->stats);
> > +return (HV_STATUS_SUCCESS);
> > +}
> > Do you have any evidence that this is actually faster than just
> > doing the CR3 write in the guest?  It's a win on real Viridian
> > because you avoid blowing the shadow page table cache, but I don't
> > think that applies on Xen's current shadow page table
> > implementation.
> As I noted earlier, initially I identified all the hypercalls that
> made for a win2k8 guest on our platform. From this list I
> implemented those that the guest actually invoked. I implemented all
> the hypercalls that the guest was invoking without regard to the
> performance value of the enlightenment. I agree with you that this
> may be more important on Veridian than on our platform.
Generally, small patches are easier to merge than big ones.  That
suggests that you should start with a minimal implementation and then
add things which have demonstrated value, rather than trying to get
everything working in the initial merge.

> > +static int
> > +hv_flush_va(paddr_t input)
> > +{
> ...
> > +/*
> > + * Now operate on what we are given
> > + * XXXKYS: For now we are ignoring as_id and fushGlobal flag.
> > + * May have to revisit this. But first stash away the processed 
> > + * parameters for subsequent use.
> > + */
> > +flush_argp->as_handle = as_id;
> > +flush_argp->flags = flush_global;
> > +flush_argp->v_mask = vcpu_mask;
> > +
> > +
> > +ret_val = hv_build_hcall_retval(HV_STATUS_SUCCESS, 0);
> > +hv_set_syscall_retval(guest_cpu_user_regs(),
> > +                                   curp->long_mode_guest,
> > +                                   ret_val);
> > +HV_STATS_COLLECT(HV_FLUSH_VA_STAT, &cur_vcpup->stats);
> > +panic("hv_flush_va not supported\n"); 
> 
> > So if this hypercall is ever correctly used by the guest, Xen will
> > crash?
> In the original patches that I posted, this hypercall was
> implemented. As part of my first round of cleanup, I removed the
> support for the TLB flush hypercalls. As part of this cleanup, I
> also disabled TLB related enlightenments. This will be cleaned up.
Thank you.

> > +static inline unsigned long
> > +get_mfn_from_gva(unsigned long va)
> > +{
> > +uint32_t pfec = PFEC_page_present;
> > +unsigned long gfn;
> > +gfn = paging_gva_to_gfn(current, va, &pfec);
> > +return (gmfn_to_mfn((current->domain), gfn));
> > +}
> 
> > That's only valid if you can guarantee that the resulting MFN will
> > never be written to.
> Yes. I use this to figure out if the hypercall is from the PV
> drivers in the guest or the hyparcall from the guest.
If it's only used in one place, it might be easier to remove the
foot-shooting potential completely and inline it at its call site.

> > +curp->hypercall_msr = msr_content;
> > +spin_unlock(&curp->lock);
> > +cur_vcpu->flags |= HV_VCPU_UP;
> > +}
> > +
> > +
> > +static inline void hv_write_sx_msr(uint32_t idx, hv_partition_t *curp,
> > +  hv_vcpu_t*cur_vcpu,
> > +  u64msr_content)
> > +{
> ...
> > +sx_page = get_virt_from_gmfn(d, gmfn);
> > +if (sx_page == NULL) {
> > +/*
> > + * The guest specified a bogus GPA; inject a GP fault
> > + * into the guest.
> > + */
> > +hv_inject_exception(TRAP_gp_fault);
> 
> > The spec says tha SIEFP can be outside the physical address space,
> > and you're just supposed to deal with it.  The guest will be unable
> > to access the frame, but the MSR write is supposed to succeed
> > (section 14.6.3 of version 0.83).  Likewise the SIMP (14.6.4).
>  You are right. As it turns out, the win2k8 guest on our platform
> never runs into this situation.  I will fix this.
Thank you.

> > +static inline u32
> > +hv_get_recommendations(void)
> > +{
> > +/*
> > + *For now we recommend all the features. Need to validate.
> > + */
> > +if ( paging_mode_hap(current->domain)) {
> > +/*
> > + * If HAP is enabled; the guest should not use TLB flush
> > + * related enlightenments.
> > + */
> > +return (0x19);
> > +} else {
> > +/*
> > + * For now disable TLB flush enlightenments.
> > + */
> > +return (0x19); 
> > +}
> > +}
> > So you recommend the use of hypercalls for address switches, MSRs
> > for APIC registers, and MSRs for rebooting?  The first two make
> > sense, but I'm not so sure that enlightened reboot is a useful
> > optimisation.
> Originally, I had implemented TLB flush enlightenments as
> well. Based on some of the performance analysis we are currently
> doing, I may re-introduce them.
Sure.

> As I noted earlier, I just wanted to implement all the features that
> made sense for the guest on our platform - hence I implemented
> reboot as well!
I think the interesting question, when deciding whether to implement a
feature, is less ``does this feature make sense?'', and more ``does
this feature need to be accelerated?''.  We're never going to be able
to remove the un-enlightened reboot support, and so there needs to be
some compelling advantage to justify duplicating the functionality
like this.  For hot-path operations, like page table updates, that
justification is performance, but it's not worth the extra code, with
all its attendant bug potential and maintenance cost, just to shave a
few milliseconds off reboot times.


> > +static inline void 
> > +hv_set_partition_features(hv_partition_t *hvpp)
> > +{
> > +hvpp->supported_features = 0x1f;
> > +}
> >i.e. VP_RUNTIME | PARTITION_REF_COUNT | SYNIC MSRS | SYNTHETIC_TIMERS |
> >     APIC_MSRS | HYPERCALL_MSRS.
> 
> > It's kind of surprising that you recommend guests use the reboot MSRs,
> > but then don't claim to support it.  Was that deliberate?
> 
> > Actually, looking some more, it doesn't look like you ever use the
> > supported_features field, and have instead hardcoded the value to
> > 0xff everywhere you need it.
> Will fix this.
Thanks.

> 
> > +
> > +static inline u16 
> > +hv_get_guest_major(void)
> > +{
> > +//KYS: Check!
> > +return (0);
> > +}
> > +static inline u16
> > +hv_get_guest_minor(void)
> > +{
> > +//KYS: Check!
> > +return (0);
> > +}
> > +static inline u32
> > +hv_get_guest_service_pack(void)
> > +{
> > +//KYS: Check!
> > +return (0);
> > +}
> > Err... are these supposed to return the guest major/minor/sp
> > reported in the identification MSR?
> Yep. Was not quite sure what numbers made sense!
The guest should tell you what numbers to use when it does interface
discovery.  See section 3.6 of the spec.

Of course, it really isn't obvious why the guest would ever need to
ask the hypervisor what operating system it's running, but that's a
problem with the specification rather than with this particular
implementation.

> > +static inline void
> > +hv_read_icr(u64 *icr_content)
> ...
> > +static inline void
> > +hv_read_tpr(u64 *tpr_content)
> ...
> > +static inline void
> > +hv_write_eoi(u64 msr_content)
> ...
> > +static inline void
> > +hv_write_icr(u64 msr_content)
> ...
> > +static inline void
> > +hv_write_tpr(u64 msr_content)
> ...
> 
> > These all have really weird implementations and lots of gratuitous
> > use of hardcoded magic numbers (have a look at apicdef.h).
> Will fix it.
Thanks.

> > +int
> > +hyperv_do_hypercall(struct cpu_user_regs *pregs)
> > +{
> > +hv_partition_t*curp = hv_get_current_partition();
> > +hv_vcpu_t        *vcpup;
> > +intlong_mode_guest = curp->long_mode_guest;
> > +unsigned long hypercall_mfn;
> > +unsigned long gmfn;
> > +gmfn = (curp->hypercall_msr >> 12);
> > +
> > +hypercall_mfn = get_mfn_from_gva(pregs->eip);
> 
> > That's exciting.  Do you expect that guests will make hypercalls
> > from anywhere other than the Viridian hypercall page?
> We support PV drivers for win2k8 guests. Given that both HyperV and
> Xen use the same hypercall numbers to implement different
> functionality, I needed a way to differentiate HyperV hypercalls. I
> use the mfn of the hypercall page to figure out who should handle
> the hypercall.
Is there any reason you can't make that decision based on eax or some
such?

The Viridian spec only specifies the interface between the guest and
the hypercall stub page, and not the interface between the stub page
and the hypervisor proper.  There's no reason why your stub page
couldn't do something like this:

orl $0x80000000, %eax
vmmcall

rather than just doing a vmmcall directly.  That would allow you to
distinguish Viridian hypercalls from Xen ones just by looking at
eax[1], which seems a bit cleaner to me.


[1] Assuming that neither Xen nor Viridian ever define hypercall
numbers above 0x80000000, which seems likely.

> > +int
> > +hyperv_vcpu_initialize(struct vcpu *v)
> > +{
> ...
> > +/*
> > + * Setup the input page for handling hypercalls.
> > + *
> > + */
> > +vcpup->input_buffer_page = 
> > +alloc_domheap_page(NULL);
> ...
> > +vcpup->input_buffer =
> > +get_virt_from_page_ptr(vcpup->input_buffer_page);
> ...
> > +vcpup->output_buffer_page = 
> > +alloc_domheap_page(NULL);
> ...
> > +vcpup->output_buffer =
> > +get_virt_from_page_ptr(vcpup->output_buffer_page);
> 
> > What exactly are these used for?  The only place I can see it
> > referenced are immediately before a panic(), which seems a bit
> > pointless.
> These were used in implementing TLB flush enlightenments. I got rid
> of those hypercalls in the current patch set. I will clean these up.
Thanks.

> > +static int
> > +hv_preprocess_cpuid_leaves(unsigned int input, struct cpu_user_regs *regs)
> > +{
> > +uint32_t idx;
> > +struct domain*d = current->domain;
> > +intextid = d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR];
> > +
> > +if (extid == 1) {
> > +/*
> > + * Enlightened Windows guest; need to remap and handle 
> > + * leaves used by PV front-end drivers.
> > + */
> > +if ((input >= 0x40000000) && (input <= 0x40000005)) {
> > +return (0);
> > +}
> > +/*
> > + * PV drivers use cpuid to query the hypervisor for details. On
> > + * Windows we will use the following leaves for this:
> > + *
> > + * 4096: VMM Sinature (corresponds to 0x40000000 on Linux)
> > + * 4097: VMM Version (corresponds to 0x40000001 on Linux)
> > + * 4098: Hypercall details (corresponds to 0x40000002 on Linux)
> > + */
> 
> > I think it would be better to just unconditionally duplicate the
> > Xen leaves at 0x40001000, regardless of whether the viridian
> > enlightenments are enabled.  That would make it much easier to
> > write PV drivers which work regardless of whether the
> > enlightenments are enabled, and if we can do that then it might (in
> > a few years' time) be possible to default to enlightments-on for
> > all domains.  That isn't really an option when there's no common
> > Xen API.
> 
> > This would also mean that you don't need to duplicate the whole of
> > hypervisor_cpuid_leaves() in the Viridian implementation.
> 
> I will look into cleaning this up.
Thanks.

> > +}
> > +switch (idx) {
> ...
> > +case HV_MSR_TIMER0_COUNT:
> > +timer = 0;
> > +goto process_timer_count_read;
> > +case HV_MSR_TIMER1_COUNT:
> > +timer = 1;
> > +goto process_timer_count_read;
> > +case HV_MSR_TIMER2_COUNT:
> > +timer = 2;
> > +goto process_timer_count_read;
> > +case HV_MSR_TIMER3_COUNT:
> > +timer = 3;
> > +process_timer_count_read:
> 
> > How much does timer support actually buy you?  It's pretty complicated
> > all by itself, and it also seems to be the only reason you need SYNIC
> > support.
> The feature is not used currently by the win2k8 guest. I may choose
> to get rid of this body of code.
Please do.  If it's not used then it can't possibly be buying you
anything, and it's a lot of code to carry just on the off chance that
it'll be useful later.

Steven.

Attachment: signature.asc
Description: Digital signature

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.