| On Wed, Dec 24, 2008 at 1:46 AM, Tian, Kevin <kevin.tian@xxxxxxxxx> wrote:
>>* When the balloon driver loads, it inflates the balloon size to
>>(maxmem - target), giving the memory back to Xen.  When this is
>>accomplished, the "populate-on-demand" portion of boot is effectively
>>finished.
>>
>
> Another tricky point could be with VT-d. If one guest page is used as
> DMA target before balloon driver is installed, and no early access on
> that page (like start-of-day scrubber), then PoD action will not be 
> triggered...
> Not sure the possibility of such condition, but you may need to have
> some thought or guard on that. em... after more thinking, actually PoD
> pages may be alive even after balloon driver is installed. I guess before
> coming up a solution you may add a check on whether target domain
> has passthrough device to decide whether this feature is on on-the-fly.
Hmm, I haven't looked at VT-d integration; it at least requires some
examination.  How are gfns translated to mfns for the VT-d hardware?
Does it use the hardware EPT tables?  Is the transaction re-startable
if we get an EPT fault and then fix the EPT table?
Any time gfn_to_mfn() is called, unless it's specifcally called with
the "query" type, the gfn is populated.  That's why qemu, the domain
builder, &c work currently without any modifications.  But if VT-d
uses the EPT tables to translate requests for a guest in hardware, and
the device requests can't be easily re-started after an EPT fault,
then this won't work.
A second issue is with the emergency sweep: if a page which happens to
be zero ends up being the target of a DMA, we may get:
* Device request to write to gfn X, which translates to mfn Y.
* Demand-fault on gfn Z, with no pages in the cache.
* Emergency sweep scans through gfn space, finds that mfn Y is empty.
It replaces gfn X with a PoD entry, and puts mfn Y behind gfn Z.
* The request finishes.  Either the request then fails (because EPT
translation for gfn X is not valid anymore), or it silently succeeds
in writing to mfn Y, which is now behind gfn Z instead of gfn X.
If we can't tell that there's an outstanding I/O on the page, then we
can't do an emergency sweep.  If we have some way of knowing that
there's *some* outstanding I/O to *some* page, we could pause the
guest until the I/O completes, then do the sweep.
At any rate, until we have that worked out, we should probably add
some "seatbelt" code to make sure that people don't use PoD for a VT-d
enabled domain.  I know absolutely nothing about the VT-d code; could
you either write a patch to do this check, or give me an idea of the
simplest thing to check?
>>NB that this code is designed to work only in conjunction with a
>>balloon driver.  If the balloon driver is not loaded, eventually all
>>pages will be dirtied (non-zero), the emergency sweep will fail, and
>>there will be no memory to back outstanding PoD pages.  When this
>>happens, the domain will crash.
>
> In that case, is it better to increase PoD target to configured max mem?
> It looks uncomfortable to crash a domain just because some optimization
> doesn't apply. :-)
If this happened, it wouldn't be because an optimization didn't apply,
but because we purposely tried to use a feature for which a key
component failed or wasn't properly in place.  If we set up a domain
with VT-d access on a box with no VT-d hardware, it would fail as well
-- just during boot, not 5 minutes after it. :-)
We could to allocate a new page at that point; but it's likely that
the allocation will fail unless there happens to be memory lying
around somewhere, not used by dom0 or any other doamin.  And if that
were the case, why not just start it with that much memory to begin
with?
The only way to make this more robust would be to pause the domain,
send a message back to xend, have it try to balloon down domain 0 (or
possibly other domains), increase the PoD cache size, and then unpause
the domain again.  This is not only a lot of work, but many of the
failure modes will be really hard to handle; e.g., if qemu makes a
hypercall that ends up doing a gfn_to_mfn() translation which fails,
we would need to make that whole operation re-startable.  I did look
at this, but it's a ton of work, and a lot of code changes (including
interface changes bewteen Xen and dom0 components), for a situation
which really should never happen in a properly configured system.
There's no reason that with a balloon driver which loads during boot,
and a properly configured target (i.e., not unreasonably small), the
driver shouldn't be able to quickly reach its target.
> Last, do you have any performance data on how this patch may impact
> the boot process, or even some workload after login?
I do not have any solid numbers.  Perceptually, I haven't noticed
anything too slow.  I'll do some simple benchmarks.
 -George
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
 |