[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Is: paravirt docs and diet. Was:Re: [Lguest] lguest: unhandled trap 13 and CONFIG_MICROCODE_INTEL_EARLY

On 05/10/2013 07:51 AM, Konrad Rzeszutek Wilk wrote:
> First of thank you for enumerating the issues and clarifying them.

Thank you for the writeup, and for all the work you personally have done
in making things *much* smoother than they have been in the past.  I
don't want to make it sound like I'm just bellyaching here.

In the end, pvops is effectively driverizing the CPU (and a bunch of
other things, some of which should be separately driverized... that work
is happening too, of course.)  Driverizing the CPU is really, really
hard, although it can be done well; it just needs to be documented down
to the level of detail the CPU itself is.  The current CPU documentation
is 3044 pages long ;)

> My understanding was (until this email conversation) was that the
> lack of the hypercall docs was hindering you. However you are not
> interested in that - you are after the documentation for the paravirt
> interface. Specifically how it differs from what baremetal would do.
> That changes my priorities!
> In that case lguest provides an excellent source of documentation on
> how the paravirt interfaces should work. It is almost like reading a book.
> Perhaps moving them to paravirt.h will help. There are of course
> some differences between the platforms that use it - but that would help
> identify them. 

Sort of, but not really.  Unless each pvop is documented to the same
extent at the native operations they implement, there is no way to
deduce the necessary conditions from an example, no matter how
well-written.  That means that the pvop users themselves have to be
comprehensible, too.  This is particularly important when we want to
change the pvops themselves, as that is a kernel-internal interface as
opposed to the hypervisor interface which is an external ABI.

[... lots of good stuff snipped ...]

> Your specific example is interesting as a quick 10 second lookup confirmed
> that the implementation for different paravirt platforms all point to
> native_read_cr4().  There are no PV hypercalls involved. Axe!

Lguest uses these pvops to avoid touching cr4 at all:

/* cr4 is used to enable and disable PGE, but we don't care. */
static unsigned long lguest_read_cr4(void)
        return 0;

static void lguest_write_cr4(unsigned long val)

Lguest doesn't even seem to have an implementation for read_cr4_safe; I
don't know what that means will happen.

> I hope it is clear that there is no resistance from me in this diet-program.
> The issue I face to help you are the same you do: priority - those merge 
> windows,
> darn bugs, regressions, etc.

Of course.  Ultimately, we don't want to break things.  I would like to
see the Xen community more committed to a roadmap toward killing PV in
favor of HVM or what you call hybrid.

> And until this conversation I was under the impression the documentation
> for hypercalls was #1 - but that is more of a #3 (#2 being the
> diet program, #1 being the paravirt documentation).

That probably is a fair statement.

>>> 3. "Let's add another hook" becomes a far too easy solution to new problems.
> I would hope that with your help in brainstorming these problems one can 
> figure
> out that right way of doing it. Perhaps instead of posting the patches
> first, one can solicit you first for help to design it properly from the 
> start?

Absolutely, and that is happening more and more.  Perhaps more in
general, "show me the code" in all honour, but "show me the design" first.

>>> 4. Performance and maintainability impact of having to support multiple
>>> code flows with different semantics.  The semantics of the Xen MMU, in
>>> particular, is actually quite different from the x86 MMU.
> You along with Thomas had designed a nice way of not only making this code
> nicely but also cleaning up the code. And I hope we all learned more
> about the Xen RO/pinning/pagetables protection mechanism. The semantics
> are understood now.
> In terms of development however, I recall that you as a excellent
> maintainer was able to delegate the work to an Oracle employee and
> everybody pitched in to help review the code and test. We did not test
> a forced branch which had some extra patches and that bit us all
> in the merge window.
> The after effect is quite nice code and design.

Yes, in the end it ended up being a really nice piece of teamwork.
Kudos to everyone involved.

>>> 5. Performance and maintainability impact of a maze of twisty little
>>> functions, all different.  For example, in the case of some of the MSR
>>> functions, we actually end up telling the compiler to combine and break
>>> up the two 32-bit halves into a 64-bit register multiple times, because
>>> the functions don't actually match up.  I still don't understand why we
>>> don't just patch out the rdmsr/wrmsr instructions, using those registers.
> My understanding is that there are two variants of paravirt patching:
> inline patching (for a selective bunch of them), and patch in a call.
> The inline patching is the more efficient one and the functions that
> were choosen for this were the performance sensitive. I think at the
> time this was designed the rdmsr/wrmsr instructions were not terribly
> fast.

Yes.  The vast majority of all MSRs are not performance critical,
however, there is a small handful that are (and there are reasons to
believe the number will increase.)

> I think that some of these paravirt interfaces could be optimized now.
> One idea that I had been toying around and asked the ksplice team to
> prototype was the usage of the asm goto mechanism. It only did it for
> the pte_modify call but I never got to run the performance numbers
> on all the different architectures.
> In summary I think (and please correct me if I mis-understood you):
>  - The paravirt interfaces documentation should be #1 priority,
>  - Fixes, diet and optimization for paravirt infrastructure is needed.
> And since we don't want to hinder your development and I would need
> to reschedule work-items I have to ask - what are the development work
> that is hindering this? So I have an idea of the timelines (if this
> needs to be an NDA discussion, pls email me privately your manager's
> email so I can start the discussion on that).

Here is the funny bit... it actually hinders small pieces of work more
than big pieces of work.  For someone who is already doing a bunch of
heavy lifting, it is much easier to amortize the cost of dealing with
any PV funnies that come into play, but if you were hoping to do a
10-line patch to fix a problem you're not going to prefix it with a
patch series to fix PV issues raised by the patch if you can at all
avoid it.

> Lastly, a bit seperate topic, but close, I thought that the paravirt
> interface was under x86 maintainers ownership, but the MAINTAINERS 
> file says otherwise!
> Jeremy, Rusty, are you OK with this. Rusty, would you have the time to
> review the patches or would it be easier if I was the sole person ?

Yes, I suspect that is one of the problems we have had.  The paravirt
interface dates back to a time when there was no "de jure" x86
maintainer (Andi Kleen was maintaining x86-64, and Linus was
theoretically maintaining i386 as the "reference port" but didn't really
have the bandwidth) which didn't really help the situation any.

At the same time, I think we have realized over time just how hard this
is coupled.  This is a classic hook problem (and part of why hooks in
general are frowned upon in the kernel)... they are highly unobtrusive
when they are first created, but then they end up being fixed points for
the code around them, and people naturally like to avoid messing with
the fixed points... so the code around grows in unnatural ways.

That being said, the one single biggest issue is definitely the MMU.
That is the one area where PV and non-PV differs dramatically in
semantics, and it is also one of the most performance-critical parts of
the machine.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.