This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


Re: [Xen-devel] Questioning the Xen Design of the VMM

To: "Petersson, Mats" <Mats.Petersson@xxxxxxx>
Subject: Re: [Xen-devel] Questioning the Xen Design of the VMM
From: Al Boldi <a1426z@xxxxxxxxx>
Date: Wed, 9 Aug 2006 15:53:06 +0300
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
Delivery-date: Thu, 10 Aug 2006 01:52:05 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
In-reply-to: <907625E08839C4409CE5768403633E0BA7FE0E@xxxxxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <907625E08839C4409CE5768403633E0BA7FE0E@xxxxxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: KMail/1.5
Petersson, Mats wrote:
> > > Al Boldi wrote:
> > > > I maybe missing something, but why should the Xen-design
> > > > require the guest to be patched?
> The main reason to use a para-virtual kernel that it performs better
> than the fully virtualized version.
> > So HVM solves the problem, but why can't this layer be implemented in
> > software?
> It CAN, and has been done.

You mean full virtualization using binary translation in software?

My understanding was, that HVM implies full virtualization without the need 
for binary translation in software.

> It is however, a little bit difficult to
> cover some of the "strange" corner cases, as the x86 processor wasn't
> really designed to handle virtualization natively [until these
> extensions where added].

You mean AMDV/IntelVT extensions?

If so, then these extensions don't actively participate in the act of 
virtualization, but rather fix some x86-arch shortcomings, that make it 
easier for software (i.e. Xen) to virtualize, thus circumventing the need to 
do binary translation.  Is this a correct reading?

> This is why you end up with binary translation
> in VMWare for example. For example, let's say that we use the method of
> "ring compression" (which is when the guest-OS is moved from Ring 0
> [full privileges] to Ring 1 [less than full privileges]), and the
> hypervisor wants to have full control of interrupt flags:
> some_function:
>       ...
>       pushf                   // Save interrupt flag.
>       cli                     // Disable interrupts
>       ...
>       ...
>       ...
>       popf                    // Restore interrupt flag.
>       ...
> In Ring 0, all this works just fine - but of course, we don't know that
> the guest-OS tried to disable interrupts, so we have to change
> something. In Ring 1, the guest can't disable interrupts, so the CLI
> instruction can be intercepted. Great. But pushf/popf is a valid
> instruction in all four rings - it just doesn't change the interrupt
> enable flag in the flags register if you're not allowed to use the
> CLI/STI instructions! So, that means that interrupts are disabled
> forever after [until an STI instruction gets found by chance, at least].
> And if the next bit of code is:
>       mov     someaddress, eax                // someaddress is
> updated by an interrupt!
> $1:
>       cmp     someaddress, eax                // Check it...
>       jz      $1
> Then we'd very likely never get out of there, since the actual interrupt
> causing someaddress to change is believed by the VMM to be disabled.
> There is no real way to make popf trap [other than supplying it with
> invalid arguments in virtual 8086 mode, which isn't really a practical
> thing to do here!]
> Another problem is "hidden bits" in registers.
> Let's say this:
>       mov     cr0, eax
>       mov     eax, ecx
>       or      $1, eax
>       mov     eax, cr0
>       mov     $0x10, eax
>       mov     eax, fs
>       mov     ecx, cr0
>       mov     $0xF000000, eax
>       mov     $10000, ecx
> $1:
>       mov     $0, fs:eax
>       add     $4, eax
>       dec     ecx
>       jnz     $1
> Let's now say that we have an interrupt that the hypervisor would handle
> in the loop in the above code. The hypervisor itself uses FS for some
> special purpose, and thus needs to save/restore the FS register. When it
> returns, the system will crash (GP fault) because the FS register limit
> is 0xFFFF (64KB) and eax is greater than the limit - but the limit of FS
> was set to 0xFFFFFFFF before we took the interrupt... Incorrect
> behaviour like this is terribly difficult to deal with, and there really
> isn't any good way to solve these issues [other than not allowing the
> code to run when it does "funny" things like this - or to perform the
> necessary code in "translation mode" - i.e. emulate each instruction ->
> slow(ish)].

Or introduce AMDV/IntelVT extensions?

> > I'm sure there can't be a performance issue, as this
> > virtualization doesn't
> > occur on the physical resource level, but is (should be)
> > rather implemented
> > as some sort of a multiplexed routing algorithm, I think :)
> I'm not entirely sure what this statement is trying to say, but as I
> understand the situation, performance is entirely the reason why the Xen
> paravirtual model was implemented - all other VMM's are slower [although
> it's often hard to prove that, since for example Vmware have the rule
> that they have to give permission before publishing benchmarks of their
> product, and of course that permission would only be given in cases
> where there is some benefit to them].
> One of the obvious reasons for para-virtual being better than full
> virtualization is that it can be used in a "batched" mode. Let's say we
> have some code that does this:
> ...
>       p = malloc(2000 * 4096);
> ...
> Let's then say that the guts of malloc ends up in something like this:
> map_pages_to_user(...)
> {
>       for(v = random_virtual_address, p = start_page; p < end_page;
> p++, v+=4096)
>               map_one_page_to_user(p, v);
> }
> In full virtualization, we have no way to understand that someone is
> mapping 2000 pages to the same user-process in one guest, we'd just see
> writes to the page-table one page at a time.
> In the para-virtual case, we could do something like:
> map_pages_to_user(...)
> {
>       hypervisor_map_pages_to_user(current_process, start_page,
> end_page,
> random_virtual_address);
> }
> Now, the hypervisor knows "the full story" and can map all those pages
> in one go - much quicker, I would say. There's still more work than in
> the native case, but it's much closer to the native case.

Sure, but wouldn't this be for the price of losing guest-OS transparency?



Xen-devel mailing list