[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement

To: Demi Marie Obenour <demiobenour@xxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>
From: Oleksii Kurochko <oleksii.kurochko@xxxxxxxxx>
Date: Thu, 7 Aug 2025 10:43:13 +0200
Cc: Alistair Francis <alistair.francis@xxxxxxx>, Bob Eshleman <bobbyeshleman@xxxxxxxxx>, Connor Davis <connojdavis@xxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
Delivery-date: Thu, 07 Aug 2025 08:43:25 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 8/6/25 6:50 PM, Demi Marie Obenour wrote:

On 8/6/25 12:24, Oleksii Kurochko wrote:

On 8/6/25 2:05 PM, Jan Beulich wrote:

On 06.08.2025 13:33, Oleksii Kurochko wrote:

On 8/4/25 5:19 PM, Jan Beulich wrote:

On 31.07.2025 17:58, Oleksii Kurochko wrote:

@@ -148,6 +149,8 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
   
       console_init_postirq();
   
+    vmid_init();

This lives here only temporarily, I assume? Every hart will need to execute
it, and hence (like we have it on x86) this may want to be a central place
elsewhere.

I haven’t checked how it is done on x86; I probably should.

I planned to call it for each hart separately during secondary hart bring-up,
since accessing the|hgatp| register of a hart is required to detect|VMIDLEN|.
Therefore,|vmid_init()| should be called for secondary harts when their
initialization code starts executing.

But is this going to be the only per-hart thing that will need doing? Otherwise
the same larger "container" function may want calling instead.

Yes, it is going to be the only per-hart operation.

There is|__cpu_up()| (not yet upstreamed [1]), which calls
|sbi_hsm_hart_start(hartid, boot_addr, hsm_data)| to start a hart, and I planned
to place|vmid_init()| somewhere in the code executed at|boot_addr|.

[1]https://gitlab.com/xen-project/people/olkur/xen/-/blob/latest/xen/arch/riscv/smpboot.c#L40

+{
+    unsigned long vmid_bits;

Why "long" (also for the function return type)?

Because csr_read() returns unsigned long as HGATP register has
'unsigned long' length.

Oh, right, I should have commented on the function return type only.
Yet then I also can't resist stating that this kind of use of a variable,
which initially is assigned a value that doesn't really fit its name, is
easily misleading towards giving such comments.

But it could be done in this way:
      csr_write(CSR_HGATP, old | HGATP_VMID_MASK);
      vmid_bits =  MASK_EXTR(csr_read(CSR_HGATP), HGATP_VMID_MASK);
      vmid_bits = ffs_g(vmid_bits);
      csr_write(CSR_HGATP, old);
And then use uint16_t for vmid_bits and use uin16_t as a return type.

Please check ./CODING_STYLE again as to the use of fixed-width types.

I meant unsigned short, uint16_t was just short to write. I'll try to be
more specific.

+    unsigned long old;
+
+    /* Figure-out number of VMID bits in HW */
+    old = csr_read(CSR_HGATP);
+
+    csr_write(CSR_HGATP, old | HGATP_VMID_MASK);
+    vmid_bits = csr_read(CSR_HGATP);
+    vmid_bits =  MASK_EXTR(vmid_bits, HGATP_VMID_MASK);

Nit: Stray blank.

+    vmid_bits = flsl(vmid_bits);
+    csr_write(CSR_HGATP, old);
+
+    /*
+     * We polluted local TLB so flush all guest TLB as
+     * a speculative access can happen at any time.
+     */
+    local_hfence_gvma_all();

There's no guest running. If you wrote hgat.MODE as zero, as per my
understanding now new TLB entries could even purely theoretically appear.

It could be an issue (or, at least, it is recommended) when hgatp.MODE is
changed:
   If hgatp.MODE is changed for a given VMID, an HFENCE.GVMA with rs1=x0
   (and rs2 set to either x0 or the VMID) must be executed to order subsequent
   guest translations with the MODE change—even if the old MODE or new MODE
   is Bare.
On other hand it is guaranteed that, at least, on Reset (and so I assume
for power on) that:
   If the hypervisor extension is implemented, the hgatp.MODE and vsatp.MODE
   fields are reset to 0.

So it seems like if no guest is ran then there is no need even to write
hgatp.MODE as zero, but it might be sense to do that explicitly just to
be sure.

I thought it was possible to have a running guest and perform a CPU hotplug.

But that guest will run on another hart.

In that case, I expect that during the hotplug,|vmidlen_detect()| will be
called and return the|vmid_bits| value, which is used as the active VMID.
At that moment, the local TLB could be speculatively polluted, I think.
Likely, it makes sense to call vmidlen_detect() only once for each hart
during initial bringup.

That may bring you more problems than it solves. You'd need to stash away
the value originally read somewhere. And that somewhere isn't per-CPU data.

In fact, with no guest running (yet) I'm having a hard time seeing why
you shouldn't be able to simply write the register with just
HGATP_VMID_MASK, i.e. without OR-ing in "old". It's even questionable
whether "old" needs restoring; writing plain zero afterwards ought to
suffice. You're in charcge of the register, after all.

It make sense (but I don't know if it is a possible case) to be sure that
HGATP.MODE remains the same, so there is no need to have TLB flush. If
HGATP.MODE is changed then it will be needed to do TLB flush as I mentioned
above.

If we agreed to keep local_hfence_gvma_all() then I think it isn't really
any sense to restore 'old' or OR-ing it with HGATP_VMID_MASK.

Generally if 'old' is guaranteed to be zero (and, probably, it makes sense
to check that in vmidlen_detect() and panic if it isn't zero) and if
vmidlen_detect() function will be called before any guest domain(s) will
be ran then I could agree that we don't need local_hfence_gvma_all() here.

As an option we can do local_hfence_gvma_all() only if 'old' value wasn't
set to zero.

Does it make sense?

Well - I'd like the pre-conditions to be understood better. For example, can
a hart really speculate into guest mode, when the hart is only in the
process of being brought up?

I couldn't explicit words that a hart can't speculate into guest mode
either on bring up or during its work.

But there are some moments in the spec which tells:
   Implementations with virtual memory are permitted to perform address
   translations speculatively and earlier than required by an explicit
   memory access, and are permitted to cache them in address translation
   cache structures—including possibly caching the identity mappings from
   effective address to physical address used in Bare translation modes and
   M-mode.
And here:
   Implementations may also execute the address-translation algorithm
   speculatively at any time, for any virtual address, as long as satp is
   active (as defined in Section 10.1.11). Such speculative executions have
   the effect of pre-populating the address-translation cache.
Where it is explicitly mentioned that speculation can happen in *any time*.
And at the same time:
   Speculative executions of the address-translation algorithm behave as
   non-speculative executions of the algorithm do, except that they must
   not set the dirty bit for a PTE, they must not trigger an exception,
   and they must not create address-translation cache entries if those
   entries would have been invalidated by any SFENCE.VMA instruction
   executed by the hart since the speculative execution of the algorithm began.
What I read as if TLB was empty before it will stay empty.

I read that as "flushing the TLB invalidates entries created by speculative
execution before the TLB flush".

But this part:
  they must not create address-translation cache entries if those entries
  would have been invalidated by any SFENCE.VMA instruction

Doesn't it mean that entries which was invalidated by SFENCE.VMA can't be
inserted into the TLB during speculative execution?

So, if the speculative page walk started before SFENCE.VMA, SFENCE.VMA
indicates: “All previous TLB entries might be invalid". Therefore, any
speculative TLB entry that started before must not be inserted into the
TLB afterward.

So, hardware tracks if a SFENCE.VMA occurred after speculation started.
If so, any speculative address translations must be discarded or
not committed.

  That is the bare minimum needed for TLB
flushing to work.  You have to do the TLB flush *after* changing the PTEs,
not before.

This is true on at least x86 but I expect it to hold in general.

Agree with that.

~ Oleksii

References:
- Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
  - From: Jan Beulich
- Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
  - From: Oleksii Kurochko
- Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
  - From: Jan Beulich
- Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
  - From: Oleksii Kurochko
- Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
  - From: Demi Marie Obenour

Prev by Date: Re: [PATCH] xen/domctl: add domain_lock in XEN_DOMCTL_setvcpucontext
Next by Date: Re: [PATCH 2/2] x86/mm: correct PG_log_dirty definition
Previous by thread: Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
Next by thread: Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.