Xen project Mailing List

[Xen-devel] CPUID improvements (phase 2) Design Doc

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Tue, 8 Nov 2016 15:35:23 +0000

Cc: Lan Tianyu <tianyu.lan@xxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Joao Martins <joao.m.martins@xxxxxxxxxx>, Roger Pau Monne <roger.pau@xxxxxxxxxx>

Delivery-date: Tue, 08 Nov 2016 15:36:14 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Hello. Please find inline the design doc for further CPUID improvements, planned for development during the 4.9 timeframe. A PDF version can be obtained from: http://xenbits.xen.org/people/andrewcoop/cpuid-improvements-phase-2-v1.pdf Comments welcome. ~Andrew ---8<--- % CPUID improvements, phase 2 \clearpage # Executive Summary Previous CPUID work in the Xen 4.7 time-frame provided remedial improvements for feature levelling, by having Xen decide what it would tolerate as the maximum featureset for each type of guest and offer this information to the toolstack. The purpose of that work was to allow a toolstack to be able to safely migrate a guest between two non-identical servers; first by the toolstack having an accurate idea of the hardware capabilities, and second by ensuring that the guest can only see the features permitted. There were known issues at the start which were deliberately not addressed, and further issues became evident during the work. In particular, Xen's internal handling of guests CPUID information is in need of improvements. The purpose of the phase 2 work is to improve the internals of CPUID handling in Xen and the lower levels of the toolstack. # Identified Issues ## Privileged domain CPUID policies The control domain and hardware domain have no policy data in Xen. This is a consequence of how the guest CPUID logic was developed. The lack of policy data is worked around in `pv_cpuid()` with logic looking like: if ( !is_control_domain(currd) && !is_hardware_domain(currd) ) domain_cpuid(currd, leaf, subleaf, &a, &b, &c, &d); else cpuid_count(leaf, subleaf, &a, &b, &c, &d); but this is flawed for several reasons. There is nothing special about these domains when it comes to CPUID; they should still see CPUID data appropriate for their VM container rather than leaking all hardware state. The default leaking of hardware state thus far has caused anti-features to enter into the ABI, most notably OXSAVE and MTRR handling for PV guests. There is also no equivalent logic for HVM control/hardware domains. ## Domain construction Currently, only PV domains are viable for use as the control domain in production scenarios. As the control domain has previously had full hardware CPUID information leaked into it, the domain construction logic has become reliant on this leakage. In particular the domain construction logic expects, by default, to seed xsave state information from what it can find in native CPUID. As PV guests are more feature restricted than HVM guests, this fails to construct an HVM guest with all available features if the control domain's CPUID view is properly restricted. A workaround for this was left over from the feature levelling work to avoid enabling CPUID Faulting on the control domain until this issue is resolved. ## Internal CPUID representation Xen's internal representation of CPUID policy information for a guest is an unordered list. This list gets searched for all leaf lookups, and contains unaudited toolstack values. Because the unordered list contains unaudited toolstack data, the main CPUID handling functions attempt to retrofit some sanity to the contents of the list. This is recalculated for every leaf lookup. Because the toolstack provided data isn't in a flat format, logic in Xen needs to make calls into the CPUID handling functions to find information. This causes searching and recalculation even for information such as maxphysaddr or feature flags which should be accessible in O(1) time. Some CPUID information depends on other CPUID information. As the data isn't in a flat format, handling of things like xsave state or max_leaf boundary checking requires recursion to obtain, and extra special care to not end up with a hypervisor stack overflow. ## Topology information Xen has very little concept of topology information in the CPUID policy. Some values (e.g. APIC ID) are hard-coded based on VCPU ID, while some values (e.g. the cache detail in leaf 4) are blindly propagated from the static toolstack-provided values, without understanding that they should vary by vcpu. The result is very confusing to a VM, as the values it reads across its vcpus do not match the expectations set out by the Intel/AMD software manuals. # Proposed Remediation ## Flat internal representation The following structure is proposed for containing a domains CPUID policy: struct cpuid_leaf { uint32_t a, b, c, d; }; struct cpuid_policy { union { struct cpuid_leaf raw[CPUID_GUEST_NR_BASIC]; struct { struct { uint32_t max_leaf, vendor_ebx, vendor_ecx, vendor_edx; }; struct { uint32_t stepping: 4, model: 4, family: 4, type: 2, ext_model: 4, ext_family: 8; uint8_t brand, clflush, cpu_count, apic_id; union { uint32_t _1c; struct { bool sse3: 1, pclmul: 1, dtes64: 1, monitor: 1, ... }; }; union { uint32_t _1d; struct { bool fpu: 1, vme: 1, de: 1, pse: 1, tsc: 1, ... }; }; }; ... }; } basic; union { struct cpuid_leaf raw[CPUID_GUEST_NR_EXTD]; struct { ... }; } extd; ... }; This is a complicated structure, but it has two key advantages. 1. Individual leaves can be accessed as opaque information via their index (e.g. `d->arch.cpuid->basic.raw[0]`). This is is expected to be useful for marshalling data to/from the toolstack, and responding to cpuid requests from the guest or instruction emulator. 2. Individual pieces of data can be accessed via mnemonic (e.g. `d->arch.cpuid->basic.max_leaf` or `d->arch.cpuid->extd.maxphysaddr`). This access is O(1) and can be used to replace existing logic which needs to query CPUID information in the same manner as a guest. In particular, it removes the need for Xen's CPUID handling to recurse back into itself to find `max_leaf`. To use the proposed structure in this way, Xen must rely on the contents of the CPUID policy always being accurate. ## Default policies The existing boot-time featureset calculations will be extended to calculate maximal full CPUID policies, not just maximum featuresets. During domain creation, the appropriate policy shall be duplicated and set as the domains default policy. * This change along will provide a usable policy for all domains, even privileged ones. The special casing for the control domain can therefore be dropped. Further modifications to the policy occur as further construction actions are taken. ## Guest CPUID handling The two primary CPUID functions are `pv_cpuid()` and `hvm_cpuid()`, with the former containing the extra logic to cope with privileged domains. These functions serve the same purpose and have a lot of duplicated code in common. However, they also contain differences; some deliberate but some accidental because of their different logic. There are some differences between PV and HVM guests in terms of CPUID policy, but they are all about which features are available to the guest. Given a suitable `struct cpuid_policy` to start with, there should be no difference in the handling of CPUID instructions on behalf of the guest. A further limitation of the existing functions is that they depend on being run in `current` context. This limit can also be lifted. Therefore, a single new function will be introduced: void guest_cpuid(const struct vcpu *v, unsigned int leaf, unsigned int subleaf, struct cpuid_leaf *res); which will replace existing functions, and handle all aspects of completing a guest CPUID request. * The use of a single function will remove all accidental inconsistencies between different guest types. It will also remove all differences between privileged and non-privileged domains; there is nothing special (CPUID-wise) between them. ## Changes in hypercall behaviour During domain construction, some pieces of information critical to the determination of the domains maximum acceptable CPUID policy are available right from the very start (Most notably, the HVM and HAP flags from the `XEN_DOMCTL_createdomain`). However, some other parameters are not available at convenient points. 1. The disable flag from `XEN_DOMCTL_disable_migrate` is used to set `d->disable_migrate`, whose only purpose is to avoid the unconditional clobbering of the Invariant TSC flag. This flag cannot even be queried by the toolstack once set. There are other facilities which should be restricted based on whether a VM might migrate or not. (e.g. The use of LBR, whose record format is hardware specific.) As using `XEN_DOMCTL_disable_migrate` implies that more features should be available by default, it would be more convenient to have a migration hint along with the other flags in `XEN_DOMCTL_createdomain`, than to reintroduce the features later, as it introduces an ordering problem with respect to `XEN_DOMCTL_set_cpuid`. 2. The use of `XEN_DOMCTL_set_address_size` switches a PV guest between native (64bit) and compat (32bit) mode. The current behaviour for 32bit PV guests is to hide all 64bit-only features. Using this hypercall once to switch from native to compat is fairly easy to cope with, feature wise, but using the hypercall a second time to switch back causes the same ordering problems with respect to `XEN_DOMCTL_set_address_size`. The preferred option here is to avoid hiding 32bit features. This is more architecturally correct, as a 32bit kernel running on 64bit-capable hardware will see 64bit-only features. Other options would be to modify the API to make `XEN_DOMCTL_set_address_size` a single-shot hypercall (which, given the existing restrictions, shouldn't impact any usecase), or to require a compat flag for PV guests in `XEN_DOMCTL_createdomain`, (which would definitely impact current usecases). Once the maximum applicable CPUID policy is determined, the toolstack may specify a preferred CPUID policy, via the `XEN_DOMCTL_set_cpuid` hypercall. Currently the toolstack-provided CPUID data is taken verbatim, and edited towards reality every time a value is requested. To start with, there is no method for the toolstack to query a domains CPUID policy. One shall be added. In addition, CPUID-policy auditing logic will be introduced and run as part of `XEN_DOMCTL_set_cpuid`, to ensure that the data in `struct cpuid_policy` is always accurate. This also introduces the expectation that `XEN_DOMCTL_set_cpuid` should fail if it attempts to load a policy which is outside of Xen can provide. # Miscellanea ## Feature support statement A problem which has surfaced several times is the difference between feature development and production use of the the feature. The current Xen/libxl interaction only has the notion of a feature being available or not, resulting in all experimental features becoming available by default to guests. Instead, it would be better for Xen to caclulate and advertise a maximum policy which includes experimental features, and a default policy, which is what the toolstack chooses by default. Updates to the CPUID policy will be checked against the maximum set, allowing users to explicilty opt-in to using the experimental features if they wish. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.