[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC PATCH 07/12] hvmloader: allocate MMCONFIG area in the MMIO hole + minor code refactoring



On Tue, 27 Mar 2018 09:45:30 +0100
Roger Pau Monné <roger.pau@xxxxxxxxxx> wrote:

>On Tue, Mar 27, 2018 at 05:42:11AM +1000, Alexey G wrote:
>> On Mon, 26 Mar 2018 10:24:38 +0100
>> Roger Pau Monné <roger.pau@xxxxxxxxxx> wrote:
>>   
>> >On Sat, Mar 24, 2018 at 08:32:44AM +1000, Alexey G wrote:  
>> [...]  
>> >> In fact, the emulated chipset (NB+SB combo without supplemental
>> >> devices) itself is a small part of required emulation. It's
>> >> relatively easy to provide own analogs of for eg. 'mch' and
>> >> 'ICH9-LPC' QEMU PCIDevice's, the problem is to glue all remaining
>> >> parts together.
>> >> 
>> >> I assume the final goal in this case is to have only a set of
>> >> necessary QEMU PCIDevice's for which we will be providing I/O,
>> >> MMIO and PCI conf trapping facilities. Only devices such as
>> >> rtl8139, ich9-ahci and few others.
>> >> 
>> >> Basically, this means a new, chipset-less QEMU machine type.
>> >> Well, in theory it is possible with a bit of effort I think. The
>> >> main question is where will be the NB/SB/PCIbus emulating part
>> >> reside in this case.    
>> >
>> >Mostly inside of Xen. Of course the IDE/SATA/USB/Ethernet... part of
>> >the southbrigde will be emulated by a device model (ie: QEMU).
>> >
>> >As you mention above, I also took a look and it seems like the
>> >amount of registers that we should emulate for Q35 DRAM controller
>> >(D0:F0) is fairly minimal based on current QEMU implementation. We
>> >could even possibly get away by just emulating PCIEXBAR.  
>> 
>> MCH emulation alone might be not an option. Besides, some
>> southbridge-specific features like emulating ACPI PM facilities for
>> domain power management (basically, anything at PMBASE) will be
>> preferable to implement on Xen side, especially considering the fact
>> that ACPI tables are already provided by Xen's libacpi/hvmloader, not
>> the device model.  
>
>Likely, but AFAICT this is kind of already broken, because PM1a and
>TMR is already emulated by Xen at hardcoded values. See
>xen/arch/x86/hvm/pmtimer.c.

Yes, that should be an argument to try to implement PMBASE emulation in
Xen too. Although this needs to be checked against dependencies in
QEMU first, especially with ACPI-related code.

This way we can have a better flexibility to use an arbitrary PMBASE
value, not just having to hardcode it to ACPI_PM1A_EVT_BLK_ADDRESS_V1
in all related components.

>> I think the feature may require to cover at least the NB+SB
>> combination, at least Q35 MCH + ICH9 for start, ideally 82441FX+PIIX4
>> as well. Also, Xen should control emulated/PT PCI device placement.  
>
>Q35 MCH (D0:F0) it's required in order to trap access to PCIEXBAR.

Absolutely.


BTW, another somewhat related problem at the moment is that Xen knows
nothing about a chipset-specific MMIO hole(s). Due to this, it is
possible for a guest to map PT BARs outside the MMIO hole, leading to
errors like this:

(XEN) memory_map:remove: dom4 gfn=c8000 mfn=c8000 nr=2000
(XEN) memory_map:add: dom4 gfn=ffffffffc8000 mfn=c8000 nr=2000
(XEN) p2m.c:1121:d0v5 p2m_set_entry: 0xffffffffc8000:9 -> -22 (0xc8000)
(XEN) memory_map:fail: dom4 gfn=ffffffffc8000 mfn=c8000 nr=2000 ret:-22
(XEN) memory_map:remove: dom4 gfn=ffffffffc8000 mfn=c8000 nr=2000
(XEN) p2m.c:1228:d0v5 gfn_to_mfn failed! gfn=ffffffffc8000 type:4
(XEN) memory_map: error -22 removing dom4 access to [c8000,c9fff]
(XEN) memory_map:remove: dom4 gfn=ffffffffc8000 mfn=c8000 nr=2000
(XEN) p2m.c:1228:d0v5 gfn_to_mfn failed! gfn=ffffffffc8000 type:4
(XEN) memory_map: error -22 removing dom4 access to [c8000,c9fff]
(XEN) memory_map:add: dom4 gfn=c8000 mfn=c8000 nr=2000

Note that it was merely a lame BAR sizing attempt from the guest-side SW
(a PCI config space viewing tool) -- writing F's to the high part of the
MMIO BAR first.

If we will know the guest's MMIO hole bounds, we can adapt to this
behavior, avoiding erroneous mapping attempts to a wrong address
outside the MMIO hole. Only the MMIO hole designated range can be used
to map PT device BARs.

So, if we will be actually emulating MCH's MMIO hole related registers
in Xen as well -- we can use them as scratchpad registers (write-once
of course) to pass this kind of information between Xen and other
involved parties as an alternative to eg. a dedicated hypercall.

>Could you be more concise about ICH9?
>
>The ICH9 spec contains multiple devices, for example it includes an
>ethernet controller and a SATA controller, which we should not emulate
>inside of Xen.

ICH built-in devices from out PoV can be considered as distinct PCI
devices (as long as they're actually distinct devices in PCI config
space).
It's a QEMU's approach for them -- these devices can be added to a q35
machine optionally. Only a minimal set of devices provided initially,
like MCH/LPC/AHCI. SMBus controller (0:1F.3) added by default too, but
it's not useful much at the moment.

So mostly we can consider the LPC bridge (0:1F.0) for emulation of
all devices provided by a real ICH SB.

>> II. (a new feature) Move chipset emulation to Xen directly.
>> 
>> In this case no separate notification necessary as Xen will be
>> emulating the chosen chipset itself. MMCONFIG location will be known
>> from own PCIEXBAR emulation.
>> 
>> QEMU will be used only to emulate a minimal set of unrelated devices
>> (eg. storage/network/vga). Less dependency on QEMU overall.
>> 
>> More freedom to implement some specific features in the future like
>> smram support for EFI firmware needs. Chipset remapping (aka reclaim)
>> functionality for memory relocation may be implemented under complete
>> Xen control, avoiding usage of unsafe add_to_physmap hypercalls.
>> 
>> In future this will allow to move passthrough-supporting code from
>> QEMU (hw/xen/xen-pt*.c) to Xen, merging it with Roger's vpci series.
>> This will improve eg. the PT + stubdomain situation a lot -- PCI
>> config space accesses for PT devices will be handled in a uniform
>> way without Dom0 interaction.
>> This particular feature can be implemented for the previous approach
>> as well, still it is easier to do when Xen controls the emulated
>> machine
>> 
>> In general, this is a good long-term direction.
>> 
>> What this approach will require:
>> --------------------------------
>> 
>> - Changes in QEMU code to support a new chipset-less machine(s). In
>>   theory might be possible to implement on top of the "null" machine
>>   concept  
>
>Not all parts of the chipset should go inside of Xen, ATM I only
>foresee Q35 MCH being implemented inside of Xen. So I'm not sure
>calling this a chipset-less machine is correct from QEMU PoV.

Emulating only MCH in Xen will still require lot of changes but 
overall benefit will become unclear -- basically, we just move
PCIEXBAR emulation to Xen from QEMU.

>> - Major changes in Xen code to implement the actual chipset emulation
>>   there
>> 
>> - Changes on the toolstack side as the emulated machine will be
>>   selected and used differently
>> 
>> - Moving passthrough support from QEMU to Xen will likely require to
>>   re-divide areas of responsibility for PCI device passthrough
>> between xen-pciback and the hypervisor. It might be more convenient
>> to perform some tasks of xen-pciback in Xen directly  
>
>Moving pci-passthough from QEMU to Xen is IMO a separate project, and
>by the text you provide I'm not sure how is that related to the Q35
>chipset implementation.

Yes, it's more a separate feature on top of that approach. 

>> - strong dependency between Xen/libxl/QEMU/etc versions -- any
>> outdated component will be a major problem. Can be resolved by
>> providing some compatibility code  
>
>Well, you would only be able to use the Q35 feature with the right
>version of the components.
>
>> - longer implementation time
>> 
>> Risks:
>> ------
>> 
>> - A major architecture change with possible issues encountered during
>>   the implementation
>> 
>> - Moving the emulation of the machine to Xen creates a non-zero risk
>> of introducing a security issue while extending the emulation support
>>   further. As all emulation will take place on a most trusted level,
>> any exploitable bug in the chipset emulation code may compromise the
>>   whole system
>> 
>> - there is a risk to encounter some dependency on missing chipset
>>   devices in QEMU. Some of QEMU devices (which depend on QEMU chipset
>>   devices/properties) might not work without extra patches. In theory
>>   this may be addressed by leaving the dummy MCH/LPC/pci-host devices
>>   in place while not forwarding any IO/MMIO/PCI conf accesses to them
>>   (using simply as compat placeholders)
>> 
>> - risk of incompatibility with future QEMU versions
>> 
>> In both cases, for security concerns PCIEXBAR and other MCH registers
>> can be made write-once (RO on all further accesses, similar to a
>> TXT-locked system).  
>
>I think option II is the right way to move forward.

Agree, it's a good long-term direction.
Well, the problem is, option 1 can be implemented in a matter of 1-3
days. It will allow MMCONFIG to work with multiple device emulators
while being very light on requirements -- no big code changes
necessary, easy to test/review, etc.

OTOH, option 2 will require some research first as the change is
non-trivial and may possibly produce any kind of incompatibility issues
with QEMU.

Emulating just MCH in Xen while still leaving anything else to
QEMU does not show an obvious advantage. Without extending the
chipset emulation in Xen further, it will be just an overcomplicated
emulation of PCIEXBAR register. If this will be the only first objective
for the feature, then we need some strong justification why moving the
emulation of guest's PCIEXBAR from QEMU to Xen is a mandatory thing.

We need to be extra sure that having MCH emulated in Xen while ICH9 and
all the rest will remain to be emulated by QEMU is a good solution for
PCIEXBAR emulation. Otherwise, having a split-type chipset emulation
between Xen/QEMU just to handle the Q35' PCIEXBAR register is an
overkill.

I would personally prefer to implement the option 1 first, while
researching and implementing the option 2 in the near perspective.

There is nothing special in PCIEXBAR, it's just one of the emulated
chipset registers, holding the address of the emulated MMIO area. This
register doesn't differ much with eg. AHCI ABAR. In fact, it's actually
more harmless --  for MMCONFIG MMIO we merely forward accesses for PCI
config read/write emulation (same thing as for emulated CF8/CFC I/O),
while handling AHCI ABAR MMIO means that we do serious things like
initiating real block I/O with the host. For PT devices MMCONFIG
accesses still go thru hw/xen-pt*.c for filtering or emulation.

>> It is somewhat related to the chipset because memory/MMIO layout
>> inconsistency can be solved more, well, naturally on Q35.
>> 
>> Basically, we have a non-standard MMIO hole layout where the
>> start of the high MMIO hole do not match the top of addressable RAM
>> (due to invisible ranges of the device model).  
>
>But that's a device model issue then? I'm not sure I'm getting what
>you mean here.

We depend on the device model in the question where we can place
the start of the high MMIO hole currently. This also badly affects
memory relocation support, which is required for MMIO hole auto-sizing.
There are multiple options how to resolve this problem, eg. placing
VRAM to some addresses far beyond >4Gb but this approach is not ideal
too as the device model cannot know where 64-bit BARs will be
allocated. Although this is a simplest approach to avoid overlaps and
to have the high MMIO hole base equal to the max guest RAM address.

>> Q35 initially have facilities to allow firmware to modify (via
>> emulation) or discover such MMIO hole setup which can be used for
>> safe MMIO BAR allocation to avoid overlaps with QEMU-owned invisible
>> ranges.  
>
>IMO a single entity should be in control of the memory layout, and
>that's the toolstack.
>
>Ideally we should not allow the firmware to change the layout at all.

This approach is terribly wrong, I don't know why opinions like this
so common at Citrix. The toolstack is a least informed side. If
MMIO/memory layout should be immutable, it must be calculated
considering all factors, like chipset-specific MMIO ranges or ranges
which cannot be used for the MMIO hole.

We need to know all resource requirements of device-model's and PT
PCI devices, all chipset-specific MMIO ranges (which belong to a device
model), all RMRRs (host's property) and all device-model invisible
ranges like VRAM backing store (another device model's property).
And we need to know in which manner hvmloader will be allocating BARs
to the MMIO hole -- eg. either in a forward direction starting from some
base or moving backwards from the end of 4Gb (minus hardcoded ranges).
Basically this means that we have to depend on hvmloader code/version
too in the toolstack, which is wrong on its own -- we should have a
freedom to modify the BAR allocation algo in hvmloader at any time.

At the moment all this information can be discovered only from
the firmware side. Lot of changes needed to gather all required
information from the toolstack.

>What are specifically the registers that you mention?

Write-once emulation of TOLUD/TOUUD/REMAPBASE/REMAPLIMIT registers for
hvmloader to use. That's the approach I'm actually using to make
'hvmloader/allow-memory-relocate=1' to work. Memory relocation without
relying on add_to_physmap hypercall for hvmloader (which it does
currently) while having MMIO/memory layout synchronized between all
parties. There are multiple benefits (mostly for PT needs), including
the MMIO hole auto-sizing support but this approach won't be accepted
well with "toolstack should do everything" attitude I'm afraid.

>> It doesn't really matter which registers to pick for this task, but
>> for Q35 this approach is at least consistent with what a real system
>> does (PV/PVH people will find this peculiarity pointless I
>> suppose :) ).  

>Right, but I don't think we aim to emulate a fully complete Q35 MCH or
>ICH9 for example, which has tons of registers, not even QEMU is trying
>to do that. The main goal is to emulate the registers we know are
>required for OSes to work.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.