RE: [Xen-devel] Questioning the Xen Design of the VMM

 

> -----Original Message-----
> From: Al Boldi [mailto:a1426z@xxxxxxxxx] 
> Sent: 08 August 2006 15:10
> To: Petersson, Mats
> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
> Subject: Re: [Xen-devel] Questioning the Xen Design of the VMM
> 
> Petersson, Mats wrote:
> > Al Boldi wrote:
> > > I hoped Xen would be a bit more
> > > transparent, by simply exposing native hw tunneled thru some
> > > multiplexed Xen patched host-kernel driver.
> >
> > On the other hand, to reduce the size of the actual 
> hypervisor (VMM),
> > the approach of Xen is to use Linux as a driver-domain (commonly
> > combined as the management "domain" of Dom0). This means that Xen
> > hypervisor itself can be driver-less, but of course also relies on
> > having another OS on top of itself to make up for this. 
> Currently Linux
> > is the only available option for a driver-domain, but 
> there's nothing in
> > the interface between Xen and the driver domain that says 
> it HAS to be
> > so - it's just much easier to do with a well-known, open-source,
> > driver-rich kernel, than with a closed-source or 
> driver-poor kernel...
> 
> Ok, you are probably describing the state of the host-kernel, 
> which I agree 
> needs to be patched for performance reasons.

Yes, but you could have more than one driver domain, that is isolated in
all aspects from other driver domains (host-kernel implies, to me, that
it's also the management of the other domains).

Why would you want to have more than one driver domain? For separation
of course... 
1. Competing Company A and Company B are sharing the same hardware - you
don't want Company A to have even the remotest chance of seeing any data
that belongs to B or the other way around, so you definitely want them
to be separated in as many ways as possible. 
2. Let's assume that someone finds a way to "hack" into a system by
sending some particular pattern on the network (TCP/IP to a particular
port, causing buffer overflow, seems to have been popular on Widnows at
least). If you have multiple driver domains, you would only get ONE
domain broken (into) by this approach - of course, if it's widespread it
would still break all ports, but if it's targetted towards one
particular domain, the others will survive [let's say one of your client
companies are attacked with a targetted attack - other companies will
then be unaffected]. 
> 
> > > I maybe missing something, but why should the Xen-design
> > > require the guest to be patched?
> >
> > There are two flavours of Xen guests:
> > Para-virtual guests. Those are patched kernels, and have (in past
> > versions of Xen) been implemented for Linux 2.4, Linux 2.6, Windows,
> > <some version of>BSD and perhaps other versions that I 
> don't know of.
> > Current Xen is "Linux only" supplied with the Xen kernel. 
> Other kernels
> > are being worked on.
> 
> This is the part I am questioning.

The main reason to use a para-virtual kernel that it performs better
than the fully virtualized version.
> 
> > HVM guests. These are fully virtualized guests, where the 
> guest contains
> > the same binary as you would use on a non-virtual system. 
> You can run
> > Windows or Linux, or most other OS's on this. It does require "new"
> > hardware that has virtualization support in hardware (AMD's 
> AMDV (SVM)
> > or Intel VT) to use this flavour of guest though, so the 
> older model is
> > still maintained.
> 
> So HVM solves the problem, but why can't this layer be implemented in 
> software?

It CAN, and has been done. It is however, a little bit difficult to
cover some of the "strange" corner cases, as the x86 processor wasn't
really designed to handle virtualization natively [until these
extensions where added]. This is why you end up with binary translation
in VMWare for example. For example, let's say that we use the method of
"ring compression" (which is when the guest-OS is moved from Ring 0
[full privileges] to Ring 1 [less than full privileges]), and the
hypervisor wants to have full control of interrupt flags:

some_function:
        ...
        pushf                   // Save interrupt flag.
        cli                     // Disable interrupts
        ... 
        ...
        ...
        popf                    // Restore interrupt flag. 
        ...

In Ring 0, all this works just fine - but of course, we don't know that
the guest-OS tried to disable interrupts, so we have to change
something. In Ring 1, the guest can't disable interrupts, so the CLI
instruction can be intercepted. Great. But pushf/popf is a valid
instruction in all four rings - it just doesn't change the interrupt
enable flag in the flags register if you're not allowed to use the
CLI/STI instructions! So, that means that interrupts are disabled
forever after [until an STI instruction gets found by chance, at least].


And if the next bit of code is:

        mov     someaddress, eax                // someaddress is
updated by an interrupt!
$1:
        cmp     someaddress, eax                // Check it... 
        jz      $1

Then we'd very likely never get out of there, since the actual interrupt
causing someaddress to change is believed by the VMM to be disabled. 

There is no real way to make popf trap [other than supplying it with
invalid arguments in virtual 8086 mode, which isn't really a practical
thing to do here!]

Another problem is "hidden bits" in registers. 

Let's say this:

        mov     cr0, eax
        mov     eax, ecx
        or      $1, eax
        mov     eax, cr0
        mov     $0x10, eax
        mov     eax, fs
        mov     ecx, cr0
        
        mov     $0xF000000, eax
        mov     $10000, ecx
$1:
        mov     $0, fs:eax
        add     $4, eax
        dec     ecx
        jnz     $1

Let's now say that we have an interrupt that the hypervisor would handle
in the loop in the above code. The hypervisor itself uses FS for some
special purpose, and thus needs to save/restore the FS register. When it
returns, the system will crash (GP fault) because the FS register limit
is 0xFFFF (64KB) and eax is greater than the limit - but the limit of FS
was set to 0xFFFFFFFF before we took the interrupt... Incorrect
behaviour like this is terribly difficult to deal with, and there really
isn't any good way to solve these issues [other than not allowing the
code to run when it does "funny" things like this - or to perform the
necessary code in "translation mode" - i.e. emulate each instruction ->
slow(ish)]. 

> 
> I'm sure there can't be a performance issue, as this 
> virtualization doesn't 
> occur on the physical resource level, but is (should be) 
> rather implemented 
> as some sort of a multiplexed routing algorithm, I think :)

I'm not entirely sure what this statement is trying to say, but as I
understand the situation, performance is entirely the reason why the Xen
paravirtual model was implemented - all other VMM's are slower [although
it's often hard to prove that, since for example Vmware have the rule
that they have to give permission before publishing benchmarks of their
product, and of course that permission would only be given in cases
where there is some benefit to them]. 

One of the obvious reasons for para-virtual being better than full
virtualization is that it can be used in a "batched" mode. Let's say we
have some code that does this:

...
        p = malloc(2000 * 4096);
... 

Let's then say that the guts of malloc ends up in something like this:

map_pages_to_user(...)
{
        for(v = random_virtual_address, p = start_page; p < end_page;
p++, v+=4096)
                map_one_page_to_user(p, v);
}

In full virtualization, we have no way to understand that someone is
mapping 2000 pages to the same user-process in one guest, we'd just see
writes to the page-table one page at a time. 

In the para-virtual case, we could do something like:
map_pages_to_user(...)
{
        hypervisor_map_pages_to_user(current_process, start_page,
end_page,
random_virtual_address);
}

Now, the hypervisor knows "the full story" and can map all those pages
in one go - much quicker, I would say. There's still more work than in
the native case, but it's much closer to the native case. 


> 
> > I hope this is of use to you.
> > 
> > Please feel free to ask any further questions...
> 
> Thanks a lot for your detailed response!
> 
> --
> Al
> 
> 
> 
> 



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

RE: [Xen-devel] Questioning the Xen Design of the VMM