Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole

disclaimer:
This email got a bit lengthy - so make sure you got a cup of coffee when you 
read this.

> On an unrelated note I think if we do go down the route of having the
> guest kernel punch the holes itself and such we should do so iff
> XENMEM_memory_map returns either ENOSYS or nr_entries == 1 to leave open

When would that actually happen? Is that return value returned when the
hypervisor is not implementing it (what version was that implemented this)?

> the possibility of cunning tricks on the tools side in the future.

<shuders>

I think we have three options in regards to this RFC patch I posted:
 1). Continue with this and have the toolstack punch the PCI hole. It would
     fill the PCI hole area with INVALID_MFN. The toolstack determines where
     the PCI hole starts.
 2). Do this in the guest where the guest calls both XENMEM_machine_memory_map 
and
     XENMEM_memory_map to get an idea of the host side PCI hole and set it up.
     Requires changes in hypervisor to allow non-privileged PV guest to make
     XENMEM_machine_memory_map call. Linux kernel decides where PCI hole starts 
and
     the PCI hole is filled with INVALID_MFN.
 3). Make unconditionally a PCI hole, starting at 3GB. PCI hole filled with 
INVALID_MFN.
 4). Another one I didn't think of?

For all of those cases when devices show up we populate on demand the P2M array
with the MFNs). For the first two proposals the BARs we read of
the PCI devices are going to be written to the P2M array as identity (so
mfn_list[0xc0000] == 0xc0000). Code has not been written.

For the third proposal, we would have non-identity mappings in the P2M array, as
during the migration we could move from a device with BARs of 0xc0000 to 
0x20000.
So mfn_list[0xc0000] = 0x20000.

But for the third case I am unsure how we would get the "real" MFNs. We 
initially get
the BARs via 0xcf8 calls and if we don't filter them, it gets to ioremap 
function.
Say the host side BAR is at 0x20000, and our PCI hole starts at 0xc0000. The 
ioremap
gets called with 0x20000, and in its E820 that region is 'System RAM'.

        last_pfn = last_addr >> PAGE_SHIFT;
        for (pfn = phys_addr >> PAGE_SHIFT; pfn <= last_pfn; pfn++) {
                int is_ram = page_is_ram(pfn);

                if (is_ram && pfn_valid(pfn) && !PageReserved(pfn_to_page(pfn)))
                        return NULL;
                WARN_ON_ONCE(is_ram);
        }   

Ugh, and it will think (correctly) that it falls within RAM.

If we filter the 0xcf8 calls, which we can do the Xen PCI backend case, we can 
then
provide BARs that always start at 0xC0000. But that does not help the PV guest 
to
know the "real" MFNs which it needs so it can program the P2M array. So the Xen
PCI front would have to do this - which it could, thought it adds a complexity 
to it.

We also need to make all of this works with Domain zero, and here 1) or 2) can
easily be used as the Xen hypervisor has given us the E820 nicely peppered with 
holes.
(I wonder, what happens if dom0 makes a XENMEM_memory_map call - does it get 
anything?)

If we then go with 3), we would need to instrument the code that reads the BARs 
so that
it can filter it properly. That would be low-level Linux pci_conf_read and that 
is not
going happen - so we would have to make the Xen hypervisor be aware of this and 
when
it traps the in/out provide new BAR values starting at 0xC0000.

I am not comfortable maintaining this filter/keep state code in both the Xen 
hypervisor
and the Xen PCI front module so I think 3) would not work that well, unless 
there are
better ways that I have missed?

Back to 1) and 2). Migration would work if we unplug the PCI devices before 
suspend and
on resume plug them back in - otherwise the PCI BARs might have changed between
migrations. When the guest gets recreated - how does it iterate over the E820 
to create
the P2M list? Or is that something that is not done and we just save the P2M 
list and
restore as-is on the other side? Naturally, since we would unplug the PCI 
device the
entries in the E820 gaps would be INVALID_MFN...

If we consult the E820 during resume I think doing the PCI hole in the 
toolstack has
merits - simply b/c the user can set the PCI hole to an arbitrary address that 
is low
enough (0x2000 say) to cover all of the machines that he/she would migrate too. 
While
if we do it in the Linux kernel we do not have that information. Even if we 
don't
consult the E820, the toolstack still has merits - as the PCI hole start address
might be different between the migration machines and we might have started on
a box with the PCI hole being way up (3.9GB) while the other machines might have
at 1.2GB.

The other thing I don't know is how all of this works with 32-bit kernels?

P.S.
I've done the testing of 1) with 64-bit w/ and w/o ballooning and it worked 
fine.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in tools