This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in tools

To: Ian Campbell <Ian.Campbell@xxxxxxxxxx>
Subject: Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
Date: Tue, 16 Nov 2010 10:50:16 -0500
Cc: Jeremy Fitzhardinge <jeremy@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir@xxxxxxx>, Stefano Stabellini <Stefano.Stabellini@xxxxxxxxxxxxx>, "bruce.edge@xxxxxxxxx" <bruce.edge@xxxxxxxxx>, Gianni@xxxxxxxxxxxxxxxxxxxx, Tedesco <gianni.tedesco@xxxxxxxxxx>
Delivery-date: Tue, 16 Nov 2010 07:53:19 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <1289899586.31507.717.camel@xxxxxxxxxxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <4CE18AD6.5070102@xxxxxxxx> <C907413B.A0AD%keir@xxxxxxx> <20101115231133.GA12364@xxxxxxxxxxxx> <4CE1D921.2010703@xxxxxxxx> <1289899586.31507.717.camel@xxxxxxxxxxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mutt/1.5.20 (2009-06-14)
This email got a bit lengthy - so make sure you got a cup of coffee when you 
read this.

> On an unrelated note I think if we do go down the route of having the
> guest kernel punch the holes itself and such we should do so iff
> XENMEM_memory_map returns either ENOSYS or nr_entries == 1 to leave open

When would that actually happen? Is that return value returned when the
hypervisor is not implementing it (what version was that implemented this)?

> the possibility of cunning tricks on the tools side in the future.


I think we have three options in regards to this RFC patch I posted:
 1). Continue with this and have the toolstack punch the PCI hole. It would
     fill the PCI hole area with INVALID_MFN. The toolstack determines where
     the PCI hole starts.
 2). Do this in the guest where the guest calls both XENMEM_machine_memory_map 
     XENMEM_memory_map to get an idea of the host side PCI hole and set it up.
     Requires changes in hypervisor to allow non-privileged PV guest to make
     XENMEM_machine_memory_map call. Linux kernel decides where PCI hole starts 
     the PCI hole is filled with INVALID_MFN.
 3). Make unconditionally a PCI hole, starting at 3GB. PCI hole filled with 
 4). Another one I didn't think of?

For all of those cases when devices show up we populate on demand the P2M array
with the MFNs). For the first two proposals the BARs we read of
the PCI devices are going to be written to the P2M array as identity (so
mfn_list[0xc0000] == 0xc0000). Code has not been written.

For the third proposal, we would have non-identity mappings in the P2M array, as
during the migration we could move from a device with BARs of 0xc0000 to 
So mfn_list[0xc0000] = 0x20000.

But for the third case I am unsure how we would get the "real" MFNs. We 
initially get
the BARs via 0xcf8 calls and if we don't filter them, it gets to ioremap 
Say the host side BAR is at 0x20000, and our PCI hole starts at 0xc0000. The 
gets called with 0x20000, and in its E820 that region is 'System RAM'.

        last_pfn = last_addr >> PAGE_SHIFT;
        for (pfn = phys_addr >> PAGE_SHIFT; pfn <= last_pfn; pfn++) {
                int is_ram = page_is_ram(pfn);

                if (is_ram && pfn_valid(pfn) && !PageReserved(pfn_to_page(pfn)))
                        return NULL;

Ugh, and it will think (correctly) that it falls within RAM.

If we filter the 0xcf8 calls, which we can do the Xen PCI backend case, we can 
provide BARs that always start at 0xC0000. But that does not help the PV guest 
know the "real" MFNs which it needs so it can program the P2M array. So the Xen
PCI front would have to do this - which it could, thought it adds a complexity 
to it.

We also need to make all of this works with Domain zero, and here 1) or 2) can
easily be used as the Xen hypervisor has given us the E820 nicely peppered with 
(I wonder, what happens if dom0 makes a XENMEM_memory_map call - does it get 

If we then go with 3), we would need to instrument the code that reads the BARs 
so that
it can filter it properly. That would be low-level Linux pci_conf_read and that 
is not
going happen - so we would have to make the Xen hypervisor be aware of this and 
it traps the in/out provide new BAR values starting at 0xC0000.

I am not comfortable maintaining this filter/keep state code in both the Xen 
and the Xen PCI front module so I think 3) would not work that well, unless 
there are
better ways that I have missed?

Back to 1) and 2). Migration would work if we unplug the PCI devices before 
suspend and
on resume plug them back in - otherwise the PCI BARs might have changed between
migrations. When the guest gets recreated - how does it iterate over the E820 
to create
the P2M list? Or is that something that is not done and we just save the P2M 
list and
restore as-is on the other side? Naturally, since we would unplug the PCI 
device the
entries in the E820 gaps would be INVALID_MFN...

If we consult the E820 during resume I think doing the PCI hole in the 
toolstack has
merits - simply b/c the user can set the PCI hole to an arbitrary address that 
is low
enough (0x2000 say) to cover all of the machines that he/she would migrate too. 
if we do it in the Linux kernel we do not have that information. Even if we 
consult the E820, the toolstack still has merits - as the PCI hole start address
might be different between the migration machines and we might have started on
a box with the PCI hole being way up (3.9GB) while the other machines might have
at 1.2GB.

The other thing I don't know is how all of this works with 32-bit kernels?

I've done the testing of 1) with 64-bit w/ and w/o ballooning and it worked 

Xen-devel mailing list

<Prev in Thread] Current Thread [Next in Thread>