[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement

To: Jan Beulich <jbeulich@xxxxxxxx>
From: Oleksii Kurochko <oleksii.kurochko@xxxxxxxxx>
Date: Wed, 6 Aug 2025 13:33:31 +0200
Cc: Alistair Francis <alistair.francis@xxxxxxx>, Bob Eshleman <bobbyeshleman@xxxxxxxxx>, Connor Davis <connojdavis@xxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
Delivery-date: Wed, 06 Aug 2025 11:33:43 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 8/4/25 5:19 PM, Jan Beulich wrote:

On 31.07.2025 17:58, Oleksii Kurochko wrote:

Current implementation is based on x86's way to allocate VMIDs:
  VMIDs partition the physical TLB. In the current implementation VMIDs are
  introduced to reduce the number of TLB flushes.  Each time the guest's
  virtual address space changes,

virtual?

I assumed that originally it meant that from Xen point of view it could be
called guest's virtual as guest doesn't really work with real physical address,
but it seems like it would be more clear to use guest-physical as you suggested
below.

instead of flushing the TLB, a new VMID is
  assigned.  This reduces the number of TLB flushes to at most 1/#VMIDs.
  The biggest advantage is that hot parts of the hypervisor's code and data
  retain in the TLB.

  VMIDs are a hart-local resource.  As preemption of VMIDs is not possible,
  VMIDs are assigned in a round-robin scheme.  To minimize the overhead of
  VMID invalidation, at the time of a TLB flush, VMIDs are tagged with a
  64-bit generation. Only on a generation overflow the code needs to
  invalidate all VMID information stored at the VCPUs with are run on the
  specific physical processor.  This overflow appears after about 2^80
  host processor cycles,

Where's this number coming from? (If you provide numbers, I think they will
want to be "reproducible" by the reader. Which I fear isn't the case here.)

The 2^80 cycles (based on x86-related numbers) result from:
  1. And VM-Entry/-Exit cycle takes at least 1800 cycles (approximated by 2^10)
  2. We have 64 ASIDs (2^6)
  3. 2^64 generations.
I removed this part of the comment earlier because I assumed that the first
item is quite CPU-dependent, even for x86, let alone for other architectures,
which may have a different number (?).
However, this part of the comment was reintroduced during one of the merges.

@@ -148,6 +149,8 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
 
     console_init_postirq();
 
+    vmid_init();

This lives here only temporarily, I assume? Every hart will need to execute
it, and hence (like we have it on x86) this may want to be a central place
elsewhere.

I haven’t checked how it is done on x86; I probably should.

I planned to call it for each hart separately during secondary hart bring-up,
since accessing the hgatp register of a hart is required to detect VMIDLEN.
Therefore, vmid_init() should be called for secondary harts when their
initialization code starts executing.

--- /dev/null
+++ b/xen/arch/riscv/vmid.c
@@ -0,0 +1,165 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#include <xen/domain.h>
+#include <xen/init.h>
+#include <xen/sections.h>
+#include <xen/lib.h>
+#include <xen/param.h>
+#include <xen/percpu.h>
+
+#include <asm/atomic.h>
+#include <asm/csr.h>
+#include <asm/flushtlb.h>
+
+/* Xen command-line option to enable VMIDs */
+static bool __read_mostly opt_vmid_enabled = true;

__ro_after_init ?

Agree, __ro_afer_init would be better.

+boolean_param("vmid", opt_vmid_enabled);
+
+/*
+ * VMIDs partition the physical TLB. In the current implementation VMIDs are
+ * introduced to reduce the number of TLB flushes.  Each time the guest's
+ * virtual address space changes, instead of flushing the TLB, a new VMID is

The same odd "virtual" again? All the code here is about guest-physical, isn't
it?

Answered above.

+ * assigned. This reduces the number of TLB flushes to at most 1/#VMIDs.
+ * The biggest advantage is that hot parts of the hypervisor's code and data
+ * retain in the TLB.
+ *
+ * Sketch of the Implementation:
+ *
+ * VMIDs are a hart-local resource.  As preemption of VMIDs is not possible,
+ * VMIDs are assigned in a round-robin scheme.  To minimize the overhead of
+ * VMID invalidation, at the time of a TLB flush, VMIDs are tagged with a
+ * 64-bit generation. Only on a generation overflow the code needs to
+ * invalidate all VMID information stored at the VCPUs with are run on the
+ * specific physical processor.  This overflow appears after about 2^80

And the same interesting number again.

Answered above.

+ * host processor cycles, so we do not optimize this case, but simply disable
+ * VMID useage to retain correctness.
+ */
+
+/* Per-Hart VMID management. */
+struct vmid_data {
+   uint64_t hart_vmid_generation;

Any reason not to simply use "generation"?

No specific reason for that, it could be renamed to "generation".

+   uint16_t next_vmid;
+   uint16_t max_vmid;
+   bool disabled;
+};
+
+static DEFINE_PER_CPU(struct vmid_data, vmid_data);
+
+static unsigned long vmidlen_detect(void)

__init ? Or wait, are you (deliberately) permitting different VMIDLEN
across harts?

All what I was able in RISC-V spec is that:
  The number of VMID bits is UNSPECIFIED and may be zero. The number of
  implemented VMID bits, termed VMIDLEN, may be determined by writing one
  to every bit position in the VMID field, then reading back the value in
  hgatp to see which bit positions in the VMID field hold a one. The least-
  significant bits of VMID are implemented first: that is, if VMIDLEN > 0,
  VMID[VMIDLEN-1:0] is writable. The maximal value of VMIDLEN, termed
  VMIDMAX, is 7 for Sv32x4 or 14 for Sv39x4, Sv48x4, and Sv57x4..
And I couldn't find explicitly that VMIDLEN will be the same across harts.

Therefore, IMO, while the specification doesn't guarantee VMID will be
different, the "unspecified" nature and the per-hart discovery mechanism
of VMIDLEN in the hgatp CSR allows for VMIDLEN to be different on
different harts in an implementation without violating the
RISC-V privileged specification.

+{
+    unsigned long vmid_bits;

Why "long" (also for the function return type)?

Because csr_read() returns unsigned long as HGATP register has
'unsigned long' length.

But it could be done in this way:
    csr_write(CSR_HGATP, old | HGATP_VMID_MASK);
    vmid_bits =  MASK_EXTR(csr_read(CSR_HGATP), HGATP_VMID_MASK);
    vmid_bits = ffs_g(vmid_bits);
    csr_write(CSR_HGATP, old);
And then use uint16_t for vmid_bits and use uin16_t as a return type.

+    unsigned long old;
+
+    /* Figure-out number of VMID bits in HW */
+    old = csr_read(CSR_HGATP);
+
+    csr_write(CSR_HGATP, old | HGATP_VMID_MASK);
+    vmid_bits = csr_read(CSR_HGATP);
+    vmid_bits =  MASK_EXTR(vmid_bits, HGATP_VMID_MASK);

Nit: Stray blank.

+    vmid_bits = flsl(vmid_bits);
+    csr_write(CSR_HGATP, old);
+
+    /*
+     * We polluted local TLB so flush all guest TLB as
+     * a speculative access can happen at any time.
+     */
+    local_hfence_gvma_all();

There's no guest running. If you wrote hgat.MODE as zero, as per my
understanding now new TLB entries could even purely theoretically appear.

It could be an issue (or, at least, it is recommended) when hgatp.MODE is
changed:
 If hgatp.MODE is changed for a given VMID, an HFENCE.GVMA with rs1=x0
 (and rs2 set to either x0 or the VMID) must be executed to order subsequent
 guest translations with the MODE change—even if the old MODE or new MODE
 is Bare.
On other hand it is guaranteed that, at least, on Reset (and so I assume
for power on) that:
 If the hypervisor extension is implemented, the hgatp.MODE and vsatp.MODE
 fields are reset to 0.

So it seems like if no guest is ran then there is no need even to write
hgatp.MODE as zero, but it might be sense to do that explicitly just to
be sure.

I thought it was possible to have a running guest and perform a CPU hotplug.
In that case, I expect that during the hotplug, vmidlen_detect() will be
called and return the vmid_bits value, which is used as the active VMID.
At that moment, the local TLB could be speculatively polluted, I think.
Likely, it makes sense to call vmidlen_detect() only once for each hart
during initial bringup.

In fact, with no guest running (yet) I'm having a hard time seeing why
you shouldn't be able to simply write the register with just
HGATP_VMID_MASK, i.e. without OR-ing in "old". It's even questionable
whether "old" needs restoring; writing plain zero afterwards ought to
suffice. You're in charcge of the register, after all.

It make sense (but I don't know if it is a possible case) to be sure that
HGATP.MODE remains the same, so there is no need to have TLB flush. If
HGATP.MODE is changed then it will be needed to do TLB flush as I mentioned
above.

If we agreed to keep local_hfence_gvma_all() then I think it isn't really
any sense to restore 'old' or OR-ing it with HGATP_VMID_MASK.

Generally if 'old' is guaranteed to be zero (and, probably, it makes sense
to check that in vmidlen_detect() and panic if it isn't zero) and if
vmidlen_detect() function will be called before any guest domain(s) will
be ran then I could agree that we don't need local_hfence_gvma_all() here.

As an option we can do local_hfence_gvma_all() only if 'old' value wasn't
set to zero.

Does it make sense?

+    return vmid_bits;
+}
+
+void vmid_init(void)
+{
+    static bool g_disabled = false;

+    unsigned long vmid_len = vmidlen_detect();
+    struct vmid_data *data = ""
+    unsigned long max_availalbe_bits = sizeof(data->max_vmid) << 3;

Nit: Typo in "available". Also now that we have it, better use
BITS_PER_BYTE here?

Sure, I will use BITS_PER_BYTE.

+    if ( vmid_len > max_availalbe_bits )
+        panic("%s: VMIDLEN is bigger then a type which represent VMID: %lu(%lu)\n",
+              __func__, vmid_len, max_availalbe_bits);

This shouldn't be a runtime check imo. What you want to check (at build
time) is that the bits set in HGATP_VMID_MASK can be held in ->max_vmid.

Oh, I just noticed that this check isn't even really correct because of
data->max_vmid is inited after this check.

Anyway, build time check would be better.

+    data->max_vmid = BIT(vmid_len, U) - 1;
+    data->disabled = !opt_vmid_enabled || (vmid_len <= 1);

Actually, what exactly does it mean that "VMIDs are disabled"? There's
no enable bit that I could find anywhere. Isn't it rather that in this
case you need to arrange to flush always on VM entry (or always after a
P2M change, depending how the TLB is split between guest and host use)?

"VMIDs are disabled" here means that TLB flush will happen each time p2m
is changed.

If you look at vmx_vmenter_helper(), its flipping of
SECONDARY_EXEC_ENABLE_VPID tweaks CPU behavior, such that the flush
would be implicit (when the bit is off). I don't expect RISC-V has any
such "implicit" flushing behavior?

RISC-V relies on explicit software-managed fence instructions for TLB
flushing.

It seems like vmid_handle_vmenter() should be updated then to return
true if VMIDs are disabled:
  bool vmid_handle_vmenter(struct vcpu_vmid *vmid)
  {
      struct vmid_data *data = ""
  
  ...
  
      /*
       * When we assign VMID 1, flush all TLB entries as we are starting a new
       * generation, and all old VMID allocations are now stale.
       */
      return (vmid->vmid == 1);
  
   disabled:
      vmid->vmid = 0;
-      return 0;
+      return true;
  }

+    if ( g_disabled != data->disabled )
+    {
+        printk("%s: VMIDs %sabled.\n", __func__,
+               data->disabled ? "dis" : "en");
+        if ( !g_disabled )
+            g_disabled = data->disabled;

This doesn't match x86 code. g_disabled is a tristate there, which only
the boot CPU would ever write to.

Why g_disabled is written only by boot CPU? Does x86 have only two options
or VMIDs are enabled for all CPUs or it is disabled for all of them?

For RISC-V as I mentioned above it is needed to check all harts as the spec.
doesn't explicitly mention that VMIDLEN is equal for all harts...


A clear shortcoming of the x86 code (that you copied) is that the log
message doesn't identify the CPU in question. A sequence of "disabled"
and "enabled" could thus result, without the last one (or in fact any
one) making clear what the overall state is. I think you want to avoid
this from the beginning.

... Thereby it seems like declaration of g_disabled should be moved outside
vmid_init() function and add a new function which will return g_disabled
value (or just make g_disabled not static and rename to something like
g_vmids_disabled).

And the print message once after all harts will be initialized, somewhere
in setup.c in start_xen() after:
 
    for_each_present_cpu ( i )
    {
        if ( (num_online_cpus() < nr_cpu_ids) && !cpu_online(i) )
        {
            int ret = cpu_up(i);
            if ( ret != 0 )
                printk("Failed to bring up CPU %u (error %d)\n", i, ret);
        }
    }
(RISC-V doesn't have such code at the moment)

~ Oleksii

Follow-Ups:
- Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
  - From: Jan Beulich

References:
- Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
  - From: Jan Beulich

Prev by Date: Guide for Launching Full Linux VM on Xen (R-Car H3e Board)
Next by Date: Re: [PATCH v3 04/20] xen/riscv: introduce things necessary for p2m initialization
Previous by thread: Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
Next by thread: Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.