[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] (v2) Design proposal for RMRR fix

> From: George Dunlap
> Sent: Monday, January 12, 2015 10:20 PM
> On Mon, Jan 12, 2015 at 12:28 PM, Tian, Kevin <kevin.tian@xxxxxxxxx> wrote:
> >> From: George Dunlap
> >> Sent: Monday, January 12, 2015 8:14 PM
> >>
> >> On Mon, Jan 12, 2015 at 11:22 AM, Tian, Kevin <kevin.tian@xxxxxxxxx>
> wrote:
> >> >> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> >> >> Sent: Monday, January 12, 2015 6:23 PM
> >> >>
> >> >> >>> On 12.01.15 at 11:12, <kevin.tian@xxxxxxxxx> wrote:
> >> >> >>  From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> >> >> >> Sent: Monday, January 12, 2015 6:09 PM
> >> >> >>
> >> >> >> >>> On 12.01.15 at 10:56, <kevin.tian@xxxxxxxxx> wrote:
> >> >> >> > the result is related to another open whether we want to block
> guest
> >> >> >> > boot for such problem. If 'warn' in domain builder is acceptable, 
> >> >> >> > we
> >> >> >> > don't need to change lowmem for such rare 1GB case, just throws
> >> >> >> > a warning for unnecessary conflictions (doesn't hurt if user 
> >> >> >> > doesn't
> >> >> >> > assign it).
> >> >> >>
> >> >> >> And how would you then deal with the one guest needing that
> >> >> >> range reserved?
> >> >> >
> >> >> > if guest needs the range, then report-all or report-sel doesn't 
> >> >> > matter.
> >> >> > domain builder throws the warning, and later device assignment will
> >> >> > fail (or warn w/ override). In reality I think 1GB is rare. Making 
> >> >> > such
> >> >> > assumption to simplify implementation is reasonable.
> >> >>
> >> >> One of my main problems with all you recent argumentation here
> >> >> is the arbitrary use of the 1Gb boundary - there's nothing special
> >> >> in this discussion with where the boundary is. Everything revolves
> >> >> around the (undue) effect of report-all on domains not needing all
> >> >> of the ranges found on the host.
> >> >>
> >> >
> >> > I'm not sure which part of my argument is not clear here. report-all
> >> > would be a problem here only if we want to fix all the conflictions
> >> > (then pulling unnecessary devices increases the confliction possibility)
> >> > in the domain builder. but if we only fix reasonable ones (e.g. >3GB)
> >> > while warn other conflictions (e.g. <3G) in domain builder (let later
> >> > assignment path to actually fail if confliction does matter), then we
> >> > don't need to solve all conflictions in domain builder (if say 1G example
> >> > fixing it may instead reduce lowmem greatly) and then report-all
> >> > may just add more warnings than report-sel for unused devices.
> >>
> >> You keep saying "report-all" or "report-sel", but I'm not 100% clear
> >> what you mean by those.  In any case, the naming has got to be a bit
> >> misleading: the important questions at the moment, AFAICT, are:
> >
> > I explained them in original proposal
> Yes, I read it and didn't understand it there either. :-)

sorry for that.

> >> 1. Whether we make holes at boot time for all RMRRs on the system, or
> >> whether only make RMRRs for some subset (or potentially some other
> >> arbitrary range, which may include RMRRs on other hosts to which we
> >> may want to migrate).
> >
> > I use 'report-all' to stand for making holes for all RMRRs on the system,
> > while 'report-sel' for specified subset.
> >
> > including other RMRRs (from admin for migration) is orthogonal to
> > above open.
> Right; so the "report" in this case is "report to the guest".
> As I said, I think that's confusing terminology; after all, we want to
> report to the guest all holes that we make, and only the holes that we
> make.  The question isn't then which ones we report, but which ones we
> make holes for. :-)

originally I use 'report' to describe the hypercall which hypervisor composes
the actual information about RMRR, so it can be 'report to libxl' or 'report
to the guest' regarding to who invokes that hypercall.

but yes here we more care about what's reported to the guest.

> So for this discussion, maybe "rmrr-host" (meaning, copy RMRRs from
> the host) or "rmrr-sel" (meaning, specify a selection of RMRRs, which
> may be from this host, or even another host)?

the counterpart of 'rmrr-host' gives me feeling of 'rmrr-guest'. :-)

> Given that the ranges may be of arbitrary size, and that we may want
> to specify additional ranges for migration to other hosts, then I
> think we need at some level we need the machinery to be in place to
> specify the RMRRs that will be reserved for a specific guest.
> At the xl level, there should of course be a way to specify "use all
> host RMRRs"; but what should happen then is that xl / libxl should
> query Xen for the host RMRRs and then pass those down to the next
> layer of the library.
> >> 2. Whether those holes are made by the domain builder in libxc, or by
> >> hvmloader
> >
> > based on current discussion, whether to make holes in hvmloader
> > doesn't bring fundamental difference. as long as domain builder
> > still need to populate memory (even minimal for hvmloader to boot),
> > it needs to check conflict and may ideally make hole too (though we
> > may make assumption not doing that)
> Well it will have an impact on the overall design of the code; but
> you're right, if RMRRs really can (and will) be anywhere in memory,
> then the domain builder will need to know what RMRRs are going to be
> reserved for this VM and avoid populating those.  If, on the other
> hand, we can make some fairly small assumptions about where there will
> not be any RMRRs, then we can get away with handling everything in
> hvmloader.

I'm not sure such fairly small assumptions can be made. For example,
host RMRR may include one or several regions in host PCI MMIO
space (say >3G), then hvmloader has to understand such knowledge
to avoid allocating them for guest PCI MMIO.

> >>
> >> 3. What happens if Xen is asked to assign a device and it finds that
> >> the required RMRR is not empty:
> >>  a. during guest creation
> >>  b. after the guest has booted
> >
> > for Xen we don't need differentiate a/b. by default it's clear failure
> > should be returned as it implies a security/correctness issue if
> > moving forward. but based on discussion an override to 'warn' only
> > is preferred, so admin can make decision (remains an open on
> > whether to do global override or per-device override)
> Well I think part of our confusion here is what "fail" vs "warn" means.
> Fail / warn might be "Do we refuse to assign the device, or do we go
> ahead and assign the device, knowing that it may act buggy?"
> Or it might be, "Do we fail domain creation if at some step we
> discover an RMRR conflict?  Or do we let the domain create succeed but
> warn that the device has not been attached?"
> I think in any case, failing to *assign* the device is the right thing
> to do (except perhaps with a per-device override option).


> libxl already has a policy of what happens when pci assignment fails
> during domain creation.  If I'm reading the code right, libxl will
> destroy the domain if libxl__device_pci_add() fails during domain
> creation; I think that's the right thing to do.  If you want to change
> that policy, that's a different discussion.

not my intention. as I said, the policy for Xen is clear. Just fail the
assignment hypercall (or warn w/ override). keep whatever policy
defined by libxl.

> But if the device assignment fails due to an unspecified RMRR, that's
> a bug in the toolstack -- it should have looked at the device list,
> found out what RMRRs were necessary, and reserved those ranges before
> we got to that point.
> The only time I would expect device assignment might fail during
> domain creation is if one of the devices had an RMRR shared with a
> device already assigned to another VM.
> >> Obviously at some point some part of the toolstack needs to identify
> >> which RMRRs go with what device, so that either libxc or hvmloader can
> >> make the appropriate holes in the address space; but at that point,
> >> "report" is not so much the right word as "query".  (Obviously we want
> >> to "report" in the e820 map all RMRRs that we've made holes for in the
> >> guest.)
> >
> > yes, using 'report' doesn't catch all the changes we need to make. Just
> > use them to simplify discussion in case all are on the same page. However
> > clearly my original explanation didn't make it. :/
> >
> > and state my major intention again. I don't think the preparation (i.e.
> > detect confliction and make holes) for device assignment should be a
> > a blocking failure.  Throw warning should be enough (i.e. in libxc). We
> > should let actual device assignment path to make final call based on
> > admin's configuration (default 'fail' w/ 'warn' override). Based on that
> > policy I think 'report-all' (making holes for all host RMRRs) is an
> > acceptable approach, w/ small impact on possibly more warning
> > messages (actually not bad to help admin understand the hotplug
> > possibility on this platform) and show more reserved regions to the
> > end user (but he shouldn't make any assumption here). :-)
> I don't really understand what you're talking about here.
> When the libxc domain builder runs, there is *no* guest memory mapped.
> So if it has the RMRRs, then it can *avoid* conflict; and if it
> doesn't have the RMRRs, it can't even *detect* conflict.  So there is
> no reason for libxc to either give a warning, or cause a failure.

not all the conflicts can or will be avoided. e.g. USB may report a 
region conflicting with guest BIOS which is a hard conflict. another 
example (from one design option) is that we may want to keep 
current cross-component structure (one lowmem + one highmem)
so conflict in the middle (e.g. 2G) is a problem (avoiding it will break
lowmem or make lowmem too small).

as long as we agree some conflicts may not be avoided, then it
comes to the open about whether to give a warning, or cause a failure.
I view making holes in domain builder as a preparation for later device
assignment, so gives warning should be sufficient here since Xen will
fail the assignment hypercall later when it actually happens and then
libxl will react according to defined policy like you described above.

> So I'm not sure why you think making holes for all RMRRs would have
> more warning messages.

based on the fact that not all RMRRs can or will be avoided, definitely
making holes for all RMRRs on the host can potentially lead to more 
conflicts than just making holes for RMRRs belonging to assigned
devices. When we agree warning is OK in domain builder, it means
more warning messages. 

> And when you say "show more reserved regions to the end user", I take
> it you mean the guest kernel (via the e820 map)?

yes since all reserved regions have to be marked in the e820, so guest
OS itself won't allocate the hole e.g. when doing PCI re-configuration.

> I'm also not clear what assumptions "he" may be making: you mean, the
> existence of an RMRR in the e820 map shouldn't be taken to mean that
> he has a specific device assigned?  No, indeed, he should not make
> such an assumption. :-)

I meant 'he' shouldn't make assumption on how many reserved regions
should exist in e820 based on exposed devices. Jan has a concern exposing
more reserved regions in e820 than necessary is not a good thing. I'm
trying to convince him it should be fine. :-)

> Again -- I think that the only place "rmrr-host" and "rmrr-sel" is
> important is at the very top level -- in xl, and possibly at a high
> level in libxl.  By the time things reach libxc and hvmloader, they
> should simply be told, "These are the RMRRs for this domain", and they
> should avoid conflicts and report those in the e820 map.

Having libxl to centrally manage RMRR at a high level is a good idea,
which however need help from you and Ian on what're detail tasks
in libxl to achieve such goal. We're not tool expert (especially to
Tiejun) so definitely more suggestion in this area is welcomed. :-)

Then I hope you understand now about our discussion in libxl/xen/
hvmloader, based on the fact that conflict may not be avoided. 
That's the major open in original discussion with Jan. I'd like to 
give an example of the flow here per Jan's suggestion, starting 
from domain builder after reserved regions have been specified 
by high level libxl.

Let's take an synthetic platform w/ two devices each reported 
with one RMRR reserved region:
        (D1): [0xe0000, 0xeffff] in <1MB area
        (D2): [0xa0000000, 0xa37fffff] in ~2.75G area

The guest is configured with 4G memory, and assigned with D2.
due to libxl policy (say for migration and hotplug) in total 3
ranges are reported:
        (hotplug): [0xe0000, 0xeffff] in <1MB area in this node
        (migration): [0x40000000, 0x40003fff] in ~1G area in another node
        (static-assign): [0xa0000000, 0xa37fffff] in ~2.75G area in this node

let's use xenstore to save such information (assume accessible to both
domain builder and hvmloader?)

STEP-1. domain builder

say the default layout w/o reserved regions would be:
        lowmem:         [0, 0xbfffffff]
        mmio hole:      [0xc0000000, 0xffffffff]
        highmem:        [0x100000000, 0x140000000]

domain builder then queries reserved regions from xenstore, 
and tries to avoid conflicts.

For [0xad000000, 0xaf7fffff], it can be avoided by reducing
lowmem to 0xad000000 and increase highmem:
        lowmem:         [0, 0x9fffffff]
        mmio hole:      [0xa0000000, 0xffffffff]
        highmem:        [0x100000000, 0x160000000]

For [0x40000000, 0x40003fff], leave it as a conflict since either
reducing lowmem to 1G is not nice to guest which doesn't use
highmem or we have to break lowmem into two trunks so more 
structure changes are required.

For [0xe0000, 0xeffff], leave it as a conflict (w/ guest BIOS)

w/ libxl centrally managed mode, domain builder doesn't know
whether a conflict will lead to an immediate error or not, so 
the best policy here is to throw warning and then move forward.
conflicts will be caught in later steps when a region is actually

STEP-2. static device assignment

after domain builder, libxl will request Xen hypervisor to complete
actual device assignment. Because D2 is statically assigned to
this guest, Xen will setup identity mapping for [0xa0000000, 
0xa37fffff] with conflict detection in gfn space. Since domain builder
has making hole for this region, there'll no conflict and device will
be assigned to the guest successfully.

STEP-3. hvmloader boot

hvmloader also needs to query reserved regions (still thru xenstore?)
due to two reasons:
        - mark all reported reserved regions in guest e820
        - make holes to avoid conflict in dynamic allocation (e.g. PCI
BAR, ACPI opregion, etc.)

hvmloader can avoid making holes for guest RAM again (even there
are potential conflicts w/ guest RAM they would be acceptable otherwise
libxl will fail the boot before reaching here). So hvmloader will just 
add a new reserved e820 entry and make hole for [0xa0000000, 
0xa37fffff] in this example, which doesn't have a guest RAM confliction.

STEP-4. D2 hotplug

After guest has been booted, user decides to hotplug D1 so libxl
will raise another device assignment request to Xen hypervisor. At
this point, a conflict is detected on [0xe0000, 0xeffff] and then a 
failure is returned by default. an override is provide in Xen to allow 
warn and return success, and user understands doing so implies 
an unsecure environment.

Out of this example hotplug will succeed if no conflict on RMRRs of 
the hotplug device.

STEP-5. migration

after guest has been booted, user decides to migrate the guest
to another node and the guest will be assigned with a new device
w/ a [0x40000000, 0x40003fff] reserved region on the new node. 
After the guest is migrated, Xen on the new node will detect conflict
when new device is hotplugged and failure (or warn override so 
success) will be returned accordingly.

Out of this example hotplug will succeed if no conflict on RMRRs of 
the hotplug device.


After completing this long thread, I actually like this high level libxl
management idea, which avoids complexity in domain builder/
hvmloader to understand device/RMRR association and then uses
different policy according to whether it's statically assigned or 
hotplug or for other purpose. :-)

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.