>From: George Dunlap
>Sent: Wednesday, December 24, 2008 10:43 PM
>> Another tricky point could be with VT-d. If one guest page is used as
>> DMA target before balloon driver is installed, and no early access on
>> that page (like start-of-day scrubber), then PoD action will
>not be triggered...
>> Not sure the possibility of such condition, but you may need to have
>> some thought or guard on that. em... after more thinking,
>> pages may be alive even after balloon driver is installed. I
>> coming up a solution you may add a check on whether target domain
>> has passthrough device to decide whether this feature is on
>Hmm, I haven't looked at VT-d integration; it at least requires some
>examination. How are gfns translated to mfns for the VT-d hardware?
>Does it use the hardware EPT tables? Is the transaction re-startable
>if we get an EPT fault and then fix the EPT table?
there's a VT-d page table walked by VT-d engine, which is similar to
EPT content. When device dma request is intercepted by VT-d engine,
VT-d page table corresponding to that device is walked for valid mapping.
Not like EPT which is restartable, VT-d page fault is just for log purpose
since pci bus doesn't support I/O restart yet (although pcisig is looking
at this possibility). That says, if we can't find a chance to trigger a cpu
page fault before PoD page is used as dma target, either one should be
disabled if both are configured.
>A second issue is with the emergency sweep: if a page which happens to
>be zero ends up being the target of a DMA, we may get:
>* Device request to write to gfn X, which translates to mfn Y.
>* Demand-fault on gfn Z, with no pages in the cache.
>* Emergency sweep scans through gfn space, finds that mfn Y is empty.
>It replaces gfn X with a PoD entry, and puts mfn Y behind gfn Z.
>* The request finishes. Either the request then fails (because EPT
>translation for gfn X is not valid anymore), or it silently succeeds
>in writing to mfn Y, which is now behind gfn Z instead of gfn X.
yes, this is also one issue. the request will fail since the dma address
written to device is gfn, while X->Y mapping is cut off due to sweep.
>If we can't tell that there's an outstanding I/O on the page, then we
>can't do an emergency sweep. If we have some way of knowing that
>there's *some* outstanding I/O to *some* page, we could pause the
>guest until the I/O completes, then do the sweep.
one possibility is to have a pv dma engine or virtual VT-d engine
within guest, but that's another story.
>At any rate, until we have that worked out, we should probably add
>some "seatbelt" code to make sure that people don't use PoD for a VT-d
>enabled domain. I know absolutely nothing about the VT-d code; could
>you either write a patch to do this check, or give me an idea of the
>simplest thing to check?
Weidong works on VT-d and could give comments on exact point
>>>NB that this code is designed to work only in conjunction with a
>>>balloon driver. If the balloon driver is not loaded, eventually all
>>>pages will be dirtied (non-zero), the emergency sweep will fail, and
>>>there will be no memory to back outstanding PoD pages. When this
>>>happens, the domain will crash.
>> In that case, is it better to increase PoD target to
>configured max mem?
>> It looks uncomfortable to crash a domain just because some
>> doesn't apply. :-)
>If this happened, it wouldn't be because an optimization didn't apply,
>but because we purposely tried to use a feature for which a key
>component failed or wasn't properly in place. If we set up a domain
>with VT-d access on a box with no VT-d hardware, it would fail as well
>-- just during boot, not 5 minutes after it. :-)
It's different story regarding to VT-d, since as you said domain
creation will fail due to lacking of VT-d support, and user can
be aware of what's happening immediately and then make
approriate change to configuration file. Nothing is impacted.
However in PoD case, failure of emergency sweep may happen
after booting 5 minutes or even longer if guest doesn't use too
much memory, and then... crash. This is a bad user experience
and especially some unsynced stuff could be lost.
Anyway PoD looks like a nice-to-have feature, just like super
page. In both cases, as long as there're fallback chance, we'd
better fallback instead of crash. for example, as long as free
domheap pages are enough, use 4k page for failed super page
case and expand PoD to max mem for domain which doesn't
install a balloon driver successfully. In a environment with such
over-commitment support, not all VMs are expected to participate
into that party. :-)
A side question is how emergency sweep failure could be
checked and reported to user...
>We could to allocate a new page at that point; but it's likely that
>the allocation will fail unless there happens to be memory lying
>around somewhere, not used by dom0 or any other doamin. And if that
>were the case, why not just start it with that much memory to begin
This is the case that user's willing to use PoD doesn't mean it
always successful. You won't expect to have user to disable PoD
and use that much memory only after several rounds of crash
>The only way to make this more robust would be to pause the domain,
>send a message back to xend, have it try to balloon down domain 0 (or
>possibly other domains), increase the PoD cache size, and then unpause
>the domain again. This is not only a lot of work, but many of the
>failure modes will be really hard to handle; e.g., if qemu makes a
>hypercall that ends up doing a gfn_to_mfn() translation which fails,
>we would need to make that whole operation re-startable. I did look
>at this, but it's a ton of work, and a lot of code changes (including
>interface changes bewteen Xen and dom0 components), for a situation
>which really should never happen in a properly configured system.
>There's no reason that with a balloon driver which loads during boot,
>and a properly configured target (i.e., not unreasonably small), the
>driver shouldn't be able to quickly reach its target.
So I think a simple fallback to expand PoD to maxmem automatically
can avoid such complexity.
>> Last, do you have any performance data on how this patch may impact
>> the boot process, or even some workload after login?
>I do not have any solid numbers. Perceptually, I haven't noticed
>anything too slow. I'll do some simple benchmarks.
Thanks for your good work.
Xen-devel mailing list