[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH] x86: retrieve and log CPU frequency information
On 15.05.2020 10:32, Roger Pau Monné wrote: > On Wed, Apr 15, 2020 at 01:55:24PM +0200, Jan Beulich wrote: >> While from just a single Skylake system it is already clear that we >> can't base any of our logic on CPUID leaf 15 [1] (leaf 16 is >> documented to be used for display purposes only anyway), logging this >> information may still give us some reference in case of problems as well >> as for future work. Additionally on the AMD side it is unclear whether >> the deviation between reported and measured frequencies is because of us >> not doing well, or because of nominal and actual frequencies being quite >> far apart. >> >> The chosen variable naming in amd_log_freq() has pointed out a naming >> problem in rdmsr_safe(), which is being taken care of at the same time. >> Symmetrically wrmsr_safe(), being an inline function, also gets an >> unnecessary underscore dropped from one of its local variables. >> >> [1] With a core crystal clock of 24MHz and a ratio of 216/2, the >> reported frequency nevertheless is 2600MHz, rather than the to be >> expected (and calibrated by both us and Linux) 2592MHz. >> >> Suggested-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> >> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> > > Reviewed-by: Roger Pau Monné <roger.pau@xxxxxxxxxx> Thanks, but please clarify whether this is with or without the two suggested changes (breaking out intel_log_freq() and introducing local variables for (uint8_t)(msrval >> NN)), or whether you mean to leave it to me whether to make them. And if I'm to make the change, whether you'd trust me to not screw up things, i.e. whether I can keep you R-b in that case. > I have one question below about P-state limits. > >> --- >> TBD: The node ID retrieval using extended leaf 1E implies it won't work >> on older hardware (pre-Fam15 I think). Besides the Node ID MSR, >> which doesn't get advertised on my Fam10 box (and it's zero on all >> processors despite there being two nodes as per the PCI device >> map), and which isn't even documented for Fam11, Fam12, and Fam14, >> I didn't find any other means to retrieve the node ID a CPU is >> associated with - the NodeId register in PCI config space depends >> on one already knowing the node ID for doing the access, as the >> device to be used is a function of the node ID. > > I there a real chance of the boost states being different between > nodes? Probably not, but doing things properly would still have been nice. > Won't Xen explode elsewhere due to possibly diverging features > between nodes? For many features - yes, but for boost states being different I don't think it would. >> --- a/xen/arch/x86/cpu/amd.c >> +++ b/xen/arch/x86/cpu/amd.c >> @@ -532,6 +532,102 @@ static void amd_get_topology(struct cpui >> : c->cpu_core_id); >> } >> >> +void amd_log_freq(const struct cpuinfo_x86 *c) >> +{ >> + unsigned int idx = 0, h; >> + uint64_t hi, lo, val; >> + >> + if (c->x86 < 0x10 || c->x86 > 0x19 || >> + (c != &boot_cpu_data && >> + (!opt_cpu_info || (c->apicid & (c->x86_num_siblings - 1))))) >> + return; >> + >> + if (c->x86 < 0x17) { >> + unsigned int node = 0; >> + uint64_t nbcfg; >> + >> + /* >> + * Make an attempt at determining the node ID, but assume >> + * symmetric setup (using node 0) if this fails. >> + */ >> + if (c->extended_cpuid_level >= 0x8000001e && >> + cpu_has(c, X86_FEATURE_TOPOEXT)) { >> + node = cpuid_ecx(0x8000001e) & 0xff; >> + if (node > 7) >> + node = 0; >> + } else if (cpu_has(c, X86_FEATURE_NODEID_MSR)) { >> + rdmsrl(0xC001100C, val); >> + node = val & 7; >> + } >> + >> + /* >> + * Enable (and use) Extended Config Space accesses, as we >> + * can't be certain that MCFG is available here during boot. >> + */ >> + rdmsrl(MSR_AMD64_NB_CFG, nbcfg); >> + wrmsrl(MSR_AMD64_NB_CFG, >> + nbcfg | (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT)); >> +#define PCI_ECS_ADDRESS(sbdf, reg) \ >> + (0x80000000 | ((sbdf).bdf << 8) | ((reg) & 0xfc) | (((reg) & 0xf00) << >> 16)) >> + >> + for ( ; ; ) { >> + pci_sbdf_t sbdf = PCI_SBDF(0, 0, 0x18 | node, 4); >> + >> + switch (pci_conf_read32(sbdf, PCI_VENDOR_ID)) { >> + case 0x00000000: >> + case 0xffffffff: >> + /* No device at this SBDF. */ >> + if (!node) >> + break; >> + node = 0; >> + continue; >> + >> + default: >> + /* >> + * Core Performance Boost Control, family >> + * dependent up to 3 bits starting at bit 2. > > > I would add: > > "Note that boot states operate at a frequency above the base one, and > thus need to be accounted for in order to correctly fetch the nominal > frequency of the processor." Done. >> + */ >> + switch (c->x86) { >> + case 0x10: idx = 1; break; >> + case 0x12: idx = 7; break; >> + case 0x14: idx = 7; break; >> + case 0x15: idx = 7; break; >> + case 0x16: idx = 7; break; >> + } >> + idx &= pci_conf_read(PCI_ECS_ADDRESS(sbdf, >> + 0x15c), >> + 0, 4) >> 2; >> + break; >> + } >> + break; >> + } >> + >> +#undef PCI_ECS_ADDRESS >> + wrmsrl(MSR_AMD64_NB_CFG, nbcfg); >> + } >> + >> + lo = 0; /* gcc may not recognize the loop having at least 5 iterations >> */ >> + for (h = c->x86 == 0x10 ? 5 : 8; h--; ) >> + if (!rdmsr_safe(0xC0010064 + h, lo) && (lo >> 63)) >> + break; >> + if (!(lo >> 63)) >> + return; > > Should you also take the P-state limit here into account (from MSR > 0xC0010061)? > > I assume the firmware could set a minimum P-state higher than the > possible ones present in the list of P-states, effectively preventing > switching to lowest-performance P-states? We're not after permitted P-states here - these would matter only if we were meaning to alter the current P-state by direct MSR accesses. Here we're only after logging capabilities (and the P-state limits can, aiui, in principle also change at runtime). Jan
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |