[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [XEN v2] xen/arm64: io: Decode 32-bit ldr/str post-indexing instructions



On Mon, 10 Jan 2022 14:33:11 +0000
Ayan Kumar Halder <ayan.kumar.halder@xxxxxxxxxx> wrote:

Hi Ayan,

> Many thanks for your inputs. It is making better sense now. Much 
> appreciated.
> 
> A few questions/clarifications :-
> 
> On 06/01/2022 15:33, Andre Przywara wrote:
> > On Wed, 5 Jan 2022 16:55:11 +0000
> > Ayan Kumar Halder<ayan.kumar.halder@xxxxxxxxxx>  wrote:
> >
> > Hi,
> >  
> >> Thank you so much for your feedback.
> >>
> >> I need a couple of clarifications before I can start with the v3 patch.
> >>
> >> On 08/12/2021 12:00, Andre Przywara wrote:  
> >>> On Mon, 6 Dec 2021 19:31:06 +0000
> >>> Julien Grall<julien@xxxxxxx>  wrote:
> >>>
> >>> Hi,
> >>>     
> >>>> On 01/12/2021 08:41, Bertrand Marquis wrote:  
> >>>>> Hi Ayan,
> >>>>>         
> >>>>>> On 30 Nov 2021, at 19:13, Ayan Kumar 
> >>>>>> Halder<ayan.kumar.halder@xxxxxxxxxx>  wrote:
> >>>>>>
> >>>>>> Hi Andre,
> >>>>>>
> >>>>>> Thanks for your comments. They are useful.
> >>>>>>
> >>>>>> On 30/11/2021 09:49, Andre Przywara wrote:  
> >>>>>>> On Mon, 29 Nov 2021 19:16:38 +0000
> >>>>>>> Ayan Kumar Halder<ayan.kumar.halder@xxxxxxxxxx>  wrote:
> >>>>>>> Hi,  
> >>>>>>>> At the moment, Xen is only handling data abort with valid syndrome 
> >>>>>>>> (i.e.
> >>>>>>>> ISV=0). Unfortunately, this doesn't cover all the instructions a 
> >>>>>>>> domain
> >>>>>>>> could use to access MMIO regions.
> >>>>>>>>
> >>>>>>>> For instance, Xilinx baremetal OS will use:
> >>>>>>>>
> >>>>>>>>            volatile u32 *LocalAddr = (volatile u32 *)Addr;
> >>>>>>>>            *LocalAddr = Value;
> >>>>>>>>
> >>>>>>>> This leave the compiler to decide which store instructions to use.  
> >>>>>>> As mentioned in the other email, this is wrong, if this points to 
> >>>>>>> MMIO:
> >>>>>>> don't let the compiler do MMIO accesses. If a stage 2 fault isn't in
> >>>>>>> an MMIO area, you should not see traps that you cannot handle already.
> >>>>>>> So I don't think it's a good idea to use that as an example. And since
> >>>>>>> this patch only seems to address this use case, I would doubt its
> >>>>>>> usefulness in general.  
> >>>>>> Yes, I should have fixed the comment.
> >>>>>>
> >>>>>> Currently, I am testing with baremetal app which uses inline assembly 
> >>>>>> code with post indexing instructions, to access the MMIO.
> >>>>>>
> >>>>>> ATM, I am testing with 32 bit MMIO only.
> >>>>>>
> >>>>>> On the usefulness, I am kind of torn as it is legitimate for post 
> >>>>>> indexing instructions to be used in an inline-assembly code for 
> >>>>>> accessing MMIO. However, that may not be something commonly seen.
> >>>>>>
> >>>>>> @Stefano/Bertrand/Julien/Volodymyr :- As you are the Arm mantainers, 
> >>>>>> can you comment if we should have decoding logic or not ?  
> >>>>> Andre gave you the official statement from Arm and there is nothing 
> >>>>> more I can say.  
> >>>> I think this would be handy for other hypervisor and OS developper to
> >>>> know what they can expect when running in a virtualized environment. So
> >>>> would it be possible to update the Arm Arm reflecting this statement?  
> >>> I don't think it's within the scope of the ARM ARM to say that. It just
> >>> says that "there is no syndrome information", and your mileage may vary in
> >>> working around that.
> >>>
> >>> Personally I would say that if you expect your software to work nicely
> >>> under a hypervisor, then just avoid those instructions. The Linux kernel
> >>> certainly did so.
> >>>
> >>> You can try to do instruction emulation, but doing this right can get
> >>> tricky quickly: think about I$/D$ coherency, MMU on or off, etc.  
> >> I am trying to get all the restrictions that need to be checked. I have
> >> referred
> >> https://developer.arm.com/documentation/dui0802/a/A64-General-Instructions/Register-restrictions-for-A64-instructions?lang=en
> >>   
> >> and "Arm A64 Instruction Set Architecture - DDI 0596" - LDR (immediate).
> >>
> >> So far I only see the following restrictions:-
> >>
> >> Rn -ne Rt
> >>
> >> Rt -ne SP
> >>
> >> You had mentioned the following cases :-
> >>
> >> 1. XZR vs SP - I see that both these refer to register no 31. Xen gets
> >> the register number (for Rn/Rt) only, so I am not sure what is to be
> >> done here.  
> > But the emulation code in Xen needs to know whether number 31 refers to the
> > stack pointer or to the zero register. This depends on the context of the
> > instruction. It's nothing magical about it, you just need to figure out
> > which it is in your case and make sure that the code uses the right
> > value. In the LDR case n==31 means SP, but t==31 means XZR. The emulation
> > would need to consider this.  
> 
> I see what you mean. I had a look at 
> xen/tools/qemu-xen-dir-remote/target/arm/cpu.h, CPUARMState {}. I don't 
> see anywhere we need to refer register by its name. IIUC, the 
> instruction decoding logic needs to know the register number only.

But there is of course still a difference between them: they refer to two
separate registers, they just share the encoding of x31 in the instruction.
Each instruction then *knows* what register it needs to go to when it sees
number 31.
By its very nature, you don't need to save XZR, so GPR[31] is typically
the SP, if you have an array holding register values, and you
probably handle the XZR case in code.

> In fact, Xen would only read CPUARMState->xregs[31] for SP/XZR in 
> Aarch64 context. Let me know what I might be missing here.
> 
> Beata/Richard - I see you have committed most recently on the file. 
> Please feel free to add your inputs.
> 
> >  
> >> 2. MMU on or off - As I see in try_handle_mmio(), one gets the physical
> >> address in gpa. So I am not sure what is to be done here.  
> > You *might* be fine, if Xen takes care of that, I don't know.
> > But it needs to explicitly consider those two cases - and also make
> > sure that caching effects are cared for.
> >  
> >> 3. I/D coherency - I don't understand how this affects instruction 
> >> decoding.  
> > In the ARM architecture the I cache and D cache are not necessarily
> > coherent. So whatever the CPU reads via the data bus does not need to be
> > the same data that the instruction fetcher read a while ago. And while
> > there is a well-known sequence to make the current data side visible to
> > the instruction side, there is nothing architectural for the other way
> > around.
> > For example the CPU fetches a bunch of instructions into the I cache, then
> > consumes it from there. Now some code changes those instruction words in
> > "memory". Without explicit D-cache-clean-to-the-PoU and
> > I-cache-invalidate+ISB the CPU might still execute the old instructions.  
> 
> This makes sense. Referring 
> https://developer.arm.com/documentation/den0024/a/Caches/Cache-maintenance, 
> I assume that the following instructions need to be executed before the 
> instruction is decoded.
> 
> Xn <--- contains the address of the instruction to be decoded.
> 
> STR Wt, [Xn]
> DC CVAU, Xn // Clean datacache by VA to point of unification (PoU)
> DSB ISH // Ensure visibility of the datacleaned from cache
> IC IVAU, Xn // Invalidate instruction cache by VA to PoU
> DSB ISH // Ensure completion of the invalidations
> ISB // Synchronize the fetched instruction stream

That is what I sketched above: the typical sequence to make sure the
instruction fetcher reads the right instructions, after you updated them.
BUT this is not what we need here, we need to other way around: read the
instruction that the CPU executed, but via the data bus, from memory. And
I wouldn't know of a neat architectural way to ensure this.

> > But if your emulation code tries to read the instruction from memory, it
> > fetches the data-visible side of it - which is not what the CPU executed,
> > so end up emulating the wrong thing.  
> 
> I am guessing that this should not be a very unusual case. Isn't this 
> similar to a host processor loading and modifying a firmware in the 
> memory. The firmware can then be executed by a different 
> micro-controller or some programmable engine. I mean the data pipeline 
> or instruction pipeline should read the same contents from memory.

... if they are coherent. Otherwise you need explicit maintenance, see
above. I mean it's the same situation with an external processor: this
core reads instructions from memory and is probably buffering/caching them
(crucial for performance). Now you need to tell if that buffer might
contain stale data, and if the data in memory has changed meanwhile.

> > Please also keep in mind that the architecture does not *guarantee*
> > coherency, but in reality it might actually be coherent - either because
> > of the particular silicon implementation, or because of side effects in
> > the system (data cache line being evicted, ISB executed). So seeing the
> > effects of coherency in your testing does not mean you are safe, it might
> > just happen by chance.  
> 
> Is there some good way to test this to ensure the best possible 
> reliability ?

That's the problem, I think: You can reason on the basis of the
architecture as described in the ARM ARM, but this brings you to the
problems above. You can now try to hand wave away the problems, by stating
that this will never happen in practice, but you then leave the steady
ground the architecture gives you. That's why I was asking before if this
is really necessary code to have, or if you would just be better off
fixing the guests.

Cheers,
Andre

> >> Please help me to understand further.
> >>
> >> - Ayan
> >>  
> >>>     
> >>>>> I will leave this decision to Stefano and Julien.  
> >>>> I have had a chat on IRC with Stefano about this. I think the main
> >>>> sticking point is the Arm Arm doesn't clearly state those instructions
> >>>> should not be used by a virtualized OS on MMIO regions.  
> >>> I don't understand why the ARM ARM would need to say that. Certainly you
> >>> realise that immediately when trying to use them, and apparently it was 
> >>> not
> >>> a problem in the last 8ish years of Xen/ARM's existence.
> >>>
> >>> So it's your decision on having the emulation, I personally would only do
> >>> it when there is a *good* use case.
> >>> And please apply the demanded scrutiny on the review - including all the
> >>> corner cases like Rn=Rt, XZR vs. SP (as Jan said) and possibly MMU status.
> >>>
> >>> Cheers,
> >>> Andre
> >>>     
> >>>> To me, this topic looks similar to the set/way instruction dilemma. They
> >>>> are a pain to virtualize (and the Arm Arm clearly hint it) but we had to
> >>>> do it because some OSes relied on them.
> >>>>
> >>>> I think the main difference is the Arm Arm doesn't hint they should not
> >>>> be used (it only says a valid syndrome is not provided) and the
> >>>> implementation should hopefully be smaller and self-contained.
> >>>>
> >>>> So I would be inclined to allow Xen to decode post-indexing instructions
> >>>> (pending the review).
> >>>>
> >>>> Cheers,
> >>>>     




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.