[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH] x86: retrieve and log CPU frequency information
On Wed, Apr 15, 2020 at 01:55:24PM +0200, Jan Beulich wrote: > While from just a single Skylake system it is already clear that we > can't base any of our logic on CPUID leaf 15 [1] (leaf 16 is > documented to be used for display purposes only anyway), logging this > information may still give us some reference in case of problems as well > as for future work. Additionally on the AMD side it is unclear whether > the deviation between reported and measured frequencies is because of us > not doing well, or because of nominal and actual frequencies being quite > far apart. > > The chosen variable naming in amd_log_freq() has pointed out a naming > problem in rdmsr_safe(), which is being taken care of at the same time. > Symmetrically wrmsr_safe(), being an inline function, also gets an > unnecessary underscore dropped from one of its local variables. > > [1] With a core crystal clock of 24MHz and a ratio of 216/2, the > reported frequency nevertheless is 2600MHz, rather than the to be > expected (and calibrated by both us and Linux) 2592MHz. > > Suggested-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> > Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> Reviewed-by: Roger Pau Monné <roger.pau@xxxxxxxxxx> I have one question below about P-state limits. > --- > TBD: The node ID retrieval using extended leaf 1E implies it won't work > on older hardware (pre-Fam15 I think). Besides the Node ID MSR, > which doesn't get advertised on my Fam10 box (and it's zero on all > processors despite there being two nodes as per the PCI device > map), and which isn't even documented for Fam11, Fam12, and Fam14, > I didn't find any other means to retrieve the node ID a CPU is > associated with - the NodeId register in PCI config space depends > on one already knowing the node ID for doing the access, as the > device to be used is a function of the node ID. I there a real chance of the boost states being different between nodes? Won't Xen explode elsewhere due to possibly diverging features between nodes? > > --- a/xen/arch/x86/cpu/amd.c > +++ b/xen/arch/x86/cpu/amd.c > @@ -532,6 +532,102 @@ static void amd_get_topology(struct cpui > : c->cpu_core_id); > } > > +void amd_log_freq(const struct cpuinfo_x86 *c) > +{ > + unsigned int idx = 0, h; > + uint64_t hi, lo, val; > + > + if (c->x86 < 0x10 || c->x86 > 0x19 || > + (c != &boot_cpu_data && > + (!opt_cpu_info || (c->apicid & (c->x86_num_siblings - 1))))) > + return; > + > + if (c->x86 < 0x17) { > + unsigned int node = 0; > + uint64_t nbcfg; > + > + /* > + * Make an attempt at determining the node ID, but assume > + * symmetric setup (using node 0) if this fails. > + */ > + if (c->extended_cpuid_level >= 0x8000001e && > + cpu_has(c, X86_FEATURE_TOPOEXT)) { > + node = cpuid_ecx(0x8000001e) & 0xff; > + if (node > 7) > + node = 0; > + } else if (cpu_has(c, X86_FEATURE_NODEID_MSR)) { > + rdmsrl(0xC001100C, val); > + node = val & 7; > + } > + > + /* > + * Enable (and use) Extended Config Space accesses, as we > + * can't be certain that MCFG is available here during boot. > + */ > + rdmsrl(MSR_AMD64_NB_CFG, nbcfg); > + wrmsrl(MSR_AMD64_NB_CFG, > + nbcfg | (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT)); > +#define PCI_ECS_ADDRESS(sbdf, reg) \ > + (0x80000000 | ((sbdf).bdf << 8) | ((reg) & 0xfc) | (((reg) & 0xf00) << > 16)) > + > + for ( ; ; ) { > + pci_sbdf_t sbdf = PCI_SBDF(0, 0, 0x18 | node, 4); > + > + switch (pci_conf_read32(sbdf, PCI_VENDOR_ID)) { > + case 0x00000000: > + case 0xffffffff: > + /* No device at this SBDF. */ > + if (!node) > + break; > + node = 0; > + continue; > + > + default: > + /* > + * Core Performance Boost Control, family > + * dependent up to 3 bits starting at bit 2. I would add: "Note that boot states operate at a frequency above the base one, and thus need to be accounted for in order to correctly fetch the nominal frequency of the processor." > + */ > + switch (c->x86) { > + case 0x10: idx = 1; break; > + case 0x12: idx = 7; break; > + case 0x14: idx = 7; break; > + case 0x15: idx = 7; break; > + case 0x16: idx = 7; break; > + } > + idx &= pci_conf_read(PCI_ECS_ADDRESS(sbdf, > + 0x15c), > + 0, 4) >> 2; > + break; > + } > + break; > + } > + > +#undef PCI_ECS_ADDRESS > + wrmsrl(MSR_AMD64_NB_CFG, nbcfg); > + } > + > + lo = 0; /* gcc may not recognize the loop having at least 5 iterations > */ > + for (h = c->x86 == 0x10 ? 5 : 8; h--; ) > + if (!rdmsr_safe(0xC0010064 + h, lo) && (lo >> 63)) > + break; > + if (!(lo >> 63)) > + return; Should you also take the P-state limit here into account (from MSR 0xC0010061)? I assume the firmware could set a minimum P-state higher than the possible ones present in the list of P-states, effectively preventing switching to lowest-performance P-states? The rest LGTM, Thanks.
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |