[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
On 8/4/25 5:19 PM, Jan Beulich wrote:
On 31.07.2025 17:58, Oleksii Kurochko wrote:Current implementation is based on x86's way to allocate VMIDs: VMIDs partition the physical TLB. In the current implementation VMIDs are introduced to reduce the number of TLB flushes. Each time the guest's virtual address space changes,virtual? I assumed that originally it meant that from Xen point of view it could be called guest's virtual as guest doesn't really work with real physical address, but it seems like it would be more clear to use guest-physical as you suggested below. instead of flushing the TLB, a new VMID is assigned. This reduces the number of TLB flushes to at most 1/#VMIDs. The biggest advantage is that hot parts of the hypervisor's code and data retain in the TLB. VMIDs are a hart-local resource. As preemption of VMIDs is not possible, VMIDs are assigned in a round-robin scheme. To minimize the overhead of VMID invalidation, at the time of a TLB flush, VMIDs are tagged with a 64-bit generation. Only on a generation overflow the code needs to invalidate all VMID information stored at the VCPUs with are run on the specific physical processor. This overflow appears after about 2^80 host processor cycles,Where's this number coming from? (If you provide numbers, I think they will want to be "reproducible" by the reader. Which I fear isn't the case here.) The 2^80 cycles (based on x86-related numbers) result from: 1. And VM-Entry/-Exit cycle takes at least 1800 cycles (approximated by 2^10) 2. We have 64 ASIDs (2^6) 3. 2^64 generations. I removed this part of the comment earlier because I assumed that the first item is quite CPU-dependent, even for x86, let alone for other architectures, which may have a different number (?). However, this part of the comment was reintroduced during one of the merges. @@ -148,6 +149,8 @@ void __init noreturn start_xen(unsigned long bootcpu_id, console_init_postirq(); + vmid_init();This lives here only temporarily, I assume? Every hart will need to execute it, and hence (like we have it on x86) this may want to be a central place elsewhere. I haven’t checked how it is done on x86; I probably should. I planned to call it for each hart separately during secondary hart bring-up, since accessing the --- /dev/null +++ b/xen/arch/riscv/vmid.c @@ -0,0 +1,165 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ + +#include <xen/domain.h> +#include <xen/init.h> +#include <xen/sections.h> +#include <xen/lib.h> +#include <xen/param.h> +#include <xen/percpu.h> + +#include <asm/atomic.h> +#include <asm/csr.h> +#include <asm/flushtlb.h> + +/* Xen command-line option to enable VMIDs */ +static bool __read_mostly opt_vmid_enabled = true;__ro_after_init ? Agree, __ro_afer_init would be better. +boolean_param("vmid", opt_vmid_enabled); + +/* + * VMIDs partition the physical TLB. In the current implementation VMIDs are + * introduced to reduce the number of TLB flushes. Each time the guest's + * virtual address space changes, instead of flushing the TLB, a new VMID isThe same odd "virtual" again? All the code here is about guest-physical, isn't it? Answered above. + * assigned. This reduces the number of TLB flushes to at most 1/#VMIDs. + * The biggest advantage is that hot parts of the hypervisor's code and data + * retain in the TLB. + * + * Sketch of the Implementation: + * + * VMIDs are a hart-local resource. As preemption of VMIDs is not possible, + * VMIDs are assigned in a round-robin scheme. To minimize the overhead of + * VMID invalidation, at the time of a TLB flush, VMIDs are tagged with a + * 64-bit generation. Only on a generation overflow the code needs to + * invalidate all VMID information stored at the VCPUs with are run on the + * specific physical processor. This overflow appears after about 2^80And the same interesting number again. Answered above. + * host processor cycles, so we do not optimize this case, but simply disable + * VMID useage to retain correctness. + */ + +/* Per-Hart VMID management. */ +struct vmid_data { + uint64_t hart_vmid_generation;Any reason not to simply use "generation"? No specific reason for that, it could be renamed to "generation". + uint16_t next_vmid; + uint16_t max_vmid; + bool disabled; +}; + +static DEFINE_PER_CPU(struct vmid_data, vmid_data); + +static unsigned long vmidlen_detect(void)__init ? Or wait, are you (deliberately) permitting different VMIDLEN across harts? All what I was able in RISC-V spec is that: The number of VMID bits is UNSPECIFIED and may be zero. The number of implemented VMID bits, termed VMIDLEN, may be determined by writing one to every bit position in the VMID field, then reading back the value in hgatp to see which bit positions in the VMID field hold a one. The least- significant bits of VMID are implemented first: that is, if VMIDLEN > 0, VMID[VMIDLEN-1:0] is writable. The maximal value of VMIDLEN, termed VMIDMAX, is 7 for Sv32x4 or 14 for Sv39x4, Sv48x4, and Sv57x4.. And I couldn't find explicitly that VMIDLEN will be the same across harts. Therefore, IMO, while the specification doesn't guarantee VMID will be different, the "unspecified" nature and the per-hart discovery mechanism of VMIDLEN in the hgatp CSR allows for VMIDLEN to be different on different harts in an implementation without violating the RISC-V privileged specification. +{ + unsigned long vmid_bits;Why "long" (also for the function return type)? Because csr_read() returns unsigned long as HGATP register has 'unsigned long' length. But it could be done in this way: csr_write(CSR_HGATP, old | HGATP_VMID_MASK); vmid_bits = MASK_EXTR(csr_read(CSR_HGATP), HGATP_VMID_MASK); vmid_bits = ffs_g(vmid_bits); csr_write(CSR_HGATP, old); And then use uint16_t for vmid_bits and use uin16_t as a return type. + unsigned long old; + + /* Figure-out number of VMID bits in HW */ + old = csr_read(CSR_HGATP); + + csr_write(CSR_HGATP, old | HGATP_VMID_MASK); + vmid_bits = csr_read(CSR_HGATP); + vmid_bits = MASK_EXTR(vmid_bits, HGATP_VMID_MASK);Nit: Stray blank.+ vmid_bits = flsl(vmid_bits); + csr_write(CSR_HGATP, old); + + /* + * We polluted local TLB so flush all guest TLB as + * a speculative access can happen at any time. + */ + local_hfence_gvma_all();There's no guest running. If you wrote hgat.MODE as zero, as per my understanding now new TLB entries could even purely theoretically appear. It could be an issue (or, at least, it is recommended) when hgatp.MODE is changed: If hgatp.MODE is changed for a given VMID, an HFENCE.GVMA with rs1=x0 (and rs2 set to either x0 or the VMID) must be executed to order subsequent guest translations with the MODE change—even if the old MODE or new MODE is Bare. On other hand it is guaranteed that, at least, on Reset (and so I assume for power on) that: If the hypervisor extension is implemented, the hgatp.MODE and vsatp.MODE fields are reset to 0. So it seems like if no guest is ran then there is no need even to write hgatp.MODE as zero, but it might be sense to do that explicitly just to be sure. I thought it was possible to have a running guest and perform a CPU hotplug. In that case, I expect that during the hotplug,
In fact, with no guest running (yet) I'm having a hard time seeing why you shouldn't be able to simply write the register with just HGATP_VMID_MASK, i.e. without OR-ing in "old". It's even questionable whether "old" needs restoring; writing plain zero afterwards ought to suffice. You're in charcge of the register, after all. It make sense (but I don't know if it is a possible case) to be sure that HGATP.MODE remains the same, so there is no need to have TLB flush. If HGATP.MODE is changed then it will be needed to do TLB flush as I mentioned above. If we agreed to keep local_hfence_gvma_all() then I think it isn't really any sense to restore 'old' or OR-ing it with HGATP_VMID_MASK. Generally if 'old' is guaranteed to be zero (and, probably, it makes sense to check that in vmidlen_detect() and panic if it isn't zero) and if vmidlen_detect() function will be called before any guest domain(s) will be ran then I could agree that we don't need local_hfence_gvma_all() here. As an option we can do local_hfence_gvma_all() only if 'old' value wasn't set to zero. Does it make sense? + return vmid_bits; +} + +void vmid_init(void) +{ + static bool g_disabled = false;+ unsigned long vmid_len = vmidlen_detect(); + struct vmid_data *data = "" + unsigned long max_availalbe_bits = sizeof(data->max_vmid) << 3;Nit: Typo in "available". Also now that we have it, better use BITS_PER_BYTE here? Sure, I will use BITS_PER_BYTE. + if ( vmid_len > max_availalbe_bits ) + panic("%s: VMIDLEN is bigger then a type which represent VMID: %lu(%lu)\n", + __func__, vmid_len, max_availalbe_bits);This shouldn't be a runtime check imo. What you want to check (at build time) is that the bits set in HGATP_VMID_MASK can be held in ->max_vmid. Oh, I just noticed that this check isn't even really correct because of data->max_vmid is inited after this check. Anyway, build time check would be better. + data->max_vmid = BIT(vmid_len, U) - 1; + data->disabled = !opt_vmid_enabled || (vmid_len <= 1);Actually, what exactly does it mean that "VMIDs are disabled"? There's no enable bit that I could find anywhere. Isn't it rather that in this case you need to arrange to flush always on VM entry (or always after a P2M change, depending how the TLB is split between guest and host use)? "VMIDs are disabled" here means that TLB flush will happen each time p2m is changed. If you look at vmx_vmenter_helper(), its flipping of SECONDARY_EXEC_ENABLE_VPID tweaks CPU behavior, such that the flush would be implicit (when the bit is off). I don't expect RISC-V has any such "implicit" flushing behavior? RISC-V relies on explicit software-managed fence instructions for TLB flushing. It seems like vmid_handle_vmenter() should be updated then to return true if VMIDs are disabled: bool vmid_handle_vmenter(struct vcpu_vmid *vmid) { struct vmid_data *data = "" ... /* * When we assign VMID 1, flush all TLB entries as we are starting a new * generation, and all old VMID allocations are now stale. */ return (vmid->vmid == 1); disabled: vmid->vmid = 0; - return 0; + return true; } + if ( g_disabled != data->disabled ) + { + printk("%s: VMIDs %sabled.\n", __func__, + data->disabled ? "dis" : "en"); + if ( !g_disabled ) + g_disabled = data->disabled;This doesn't match x86 code. g_disabled is a tristate there, which only the boot CPU would ever write to. Why g_disabled is written only by boot CPU? Does x86 have only two options or VMIDs are enabled for all CPUs or it is disabled for all of them? For RISC-V as I mentioned above it is needed to check all harts as the spec. doesn't explicitly mention that VMIDLEN is equal for all harts... A clear shortcoming of the x86 code (that you copied) is that the log message doesn't identify the CPU in question. A sequence of "disabled" and "enabled" could thus result, without the last one (or in fact any one) making clear what the overall state is. I think you want to avoid this from the beginning. ... Thereby it seems like declaration of g_disabled should be moved outside
vmid_init() function and add a new function which will return g_disabled
value (or just make g_disabled not static and rename to something like
g_vmids_disabled).
And the print message once after all harts will be initialized, somewhere
in setup.c in start_xen() after:
for_each_present_cpu ( i )
{
if ( (num_online_cpus() < nr_cpu_ids) && !cpu_online(i) )
{
int ret = cpu_up(i);
if ( ret != 0 )
printk("Failed to bring up CPU %u (error %d)\n", i, ret);
}
}
(RISC-V doesn't have such code at the moment)
~ Oleksii
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |