[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way



On 12/11/2017 11:10 AM, Andre Przywara wrote:
> Hi,
> 
> On 08/12/17 10:56, George Dunlap wrote:
>> On 12/07/2017 07:21 PM, Marc Zyngier wrote:
>>> On 07/12/17 18:06, George Dunlap wrote:
>>>> On 12/07/2017 04:58 PM, Marc Zyngier wrote:
>>>>> On 07/12/17 16:44, George Dunlap wrote:
>>>>>> On 12/07/2017 04:04 PM, Julien Grall wrote:
>>>>>>> Hi Jan,
>>>>>>>
>>>>>>> On 07/12/17 15:45, Jan Beulich wrote:
>>>>>>>>>>> On 07.12.17 at 15:53, <marc.zyngier@xxxxxxx> wrote:
>>>>>>>>> On 07/12/17 13:52, Julien Grall wrote:
>>>>>>>>> There is exactly one case where set/way makes sense, and that's when
>>>>>>>>> you're the only CPU left in the system, your MMU is off, and you're
>>>>>>>>> about to go down.
>>>>>>>>
>>>>>>>> With this and ...
>>>>>>>>
>>>>>>>>> On top of bypassing the coherency, S/W CMOs do not prevent lines from
>>>>>>>>> migrating from one CPU to another. So you could happily be flushing by
>>>>>>>>> S/W, and still end up with dirty lines in your cache. Success!
>>>>>>>>
>>>>>>>> ... this I wonder what value emulating those insns then has in the 
>>>>>>>> first
>>>>>>>> place. Can't you as well simply skip and ignore them, with the same
>>>>>>>> (bad) result?
>>>>>>>
>>>>>>> The result will be much much worst. Here a concrete example with a Linux
>>>>>>> Arm 32-bit:
>>>>>>>
>>>>>>>     1) Cache enabled
>>>>>>>     2) Decompress
>>>>>>>     3) Nuke cache (S/W)
>>>>>>>     4) Cache off
>>>>>>>     5) Access new kernel
>>>>>>>
>>>>>>> If you skip #3, the decompress data may not have reached the memory, so
>>>>>>> you would access stall data.
>>>>>>>
>>>>>>> This would effectively mean we don't support Linux Arm 32-bit.
>>>>>>
>>>>>> So Marc said that #3 "doesn't make sense", since although it might be
>>>>>> the only cpu on in the system, you're not "about to go down"; but Linux
>>>>>> 32-bit is doing that anyway.
>>>>>
>>>>> "Doesn't make sense" on an ARMv7+ with SMP. That code dates back to
>>>>> ARMv4, and has been left untouched ever since. "If it ain't broke..."
>>>>>
>>>>>> It sounds like from the slides the purpose of #3 might be to get stuff
>>>>>> out of the D-cache into the I-cache.  But why is the cache turned off?
>>>>>
>>>>> Linux mandates that the kernel in entered with the MMU off. Which has
>>>>> the effect of disabling the caches too (VIVT caches and all that jazz).
>>>>>
>>>>>> And why doesn't Linux use the VA-based flushes rather than the S/W 
>>>>>> flushes?
>>>>>
>>>>> Linux/arm64 does. Changing the 32bit port to use VA CMOs would probably
>>>>> break stuff from the late 90s, so that's not going to happen. These
>>>>> days, I tend to pick my battles... ;-)
>>>>
>>>> OK, so let me try to state this "forwards" for those of us not familiar
>>>> with the situation:
>>>>
>>>> 1. Linux expects to start in 'linear' mode, with the MMU disabled.
>>>>
>>>> 2. On ARM, disabling the MMU disables caching (!).  But disabling
>>>> caching doesn't flush the cache; it just means the cache is bypassed (!).
>>>>
>>>> 3. Which means for Linux on ARM, after unzipping the kernel image, you
>>>> need to flush the cache before disabling the MMU and starting Linux proper
>>>>
>>>> 4. For historical reasons, 32-bit ARM Linux uses the S/W instructions to
>>>> flush the cache.  This still works on 32-bit hardware, and so the Linux
>>>> maintainers are loathe to change it, even though more reliable VA-based
>>>> instructions are available (?).
>>>
>>> It also works on 64bit HW. It is just not easily virtualizable, which is
>>> why we've removed all S/W from the 64bit Linux port a while ago.
>>
>> From the diagram in your talk, it looked like the "flush the cache"
>> operation *doesn't* work anywhere that has a "system cache", even on
>> bare metal.
> 
> What Marc probably meant is that they still work *within the
> architectural limits* that s/w operations provide:
> - S/W CMOs are not broadcasted, so in a live SMP system they are
> probably not doing what you expect them to do. This isn't an issue for a
> 32-bit Linux kernel decompressor, because this is UP still at this point.
> - S/W CMOs are optional to implement for system caches. As Marc
> mentioned, there are not many 32-bit systems with a system cache out
> there.

Right, that's what I said -- on any 32-bit system with a system cache,
which doesn't implement the S/W functionality, then using S/W to flush
the cache won't work, even on bare metal.

> And on those systems you can still boot an uncompressed kernel or
> use gzip-ed kernel and let the bootloader (grub, U-Boot) decompress it.
> On the other hand there seem to be a substantial number of (older)
> 32-bit systems where VA CMOs have issues.

OK, good to know.

> The problem now is that for the "32-bit kernel on a 64-bit hypervisor"
> cache those two assumptions are not true: The system has multiple CPUs
> running already, also 64-bit hardware is much more likely to have system
> caches.
> So this is mostly a virtualization problem and thus should be solved here.

Right.

> To help assessing the benefits of adding PoD to Xen:

Can we come up with a different terminology for this functionality than
'PoD'?  On x86 populate-on-demand is quite different in functionality
and in target goal than what Julien is describing.

The goal of PoD on x86 is being able to boot a guest that actually uses
(say) 1GiB of RAM, but allow it to balloon up later to use 2GiB
megabytes of RAM, in circumstances where memory hotplug is not
available.  This means telling a guest it has 2GiB of RAM, but only
allocating 1GiB of host RAM for it, and shuffling memory around
behind-the-scenes until the balloon driver can come up and "free" 1GiB
of empty space back to Xen.

On x86 in PoD, the p2m table is initialized with entries which are
'empty' from the hardware point of view (no mfn).  Memory is allocated
to a per-domain "PoD pool" on domain creation, then assigned to the p2m
as it's used.  If the memory remains zero, then it may be reclaimed
under certain circumstances and moved somewhere else.  Once the memory
becomes non-zero, it must never be moved.  If a guest ever "dirties" all
of its initial allocation (i.e., makes it non-zero), then Xen will crash
it rather than allocate more memory.

What Julien is describing is different.  For one thing, for many dom0's,
it's not appropriate to put memory in abritrary places; you need a 1-1
mapping, so the "populate with random memory from a pool" isn't
appropriate.  For another, Julien will (I think?) want a way to detect
reads and writes to memory pages which have non-zero data.  This is not
something that the current PoD code has anything to do with.

It also seems like in the future, ARM may want something like the x86
PoD (i.e., the ability to boot a guest with 1GiB of RAM and then balloon
it up to 2GiB).  So keeping the 'PoD' name reserved for that
functionality makes more sense.

In fact, what it sounds like is an awful lot like 'logdirty', except
that it sounds like you want to log read accesses in addition to write
accesses (to determine what might be in the cache).  Maybe 'logaccess' mode?

> But on the other hand we had PoD naturally already in KVM, so this came
> at no cost.

As I've said in another thread, it's not accurate to say that KVM uses
PoD.  In PoD, the memory is pre-allocated to the domain before the guest
starts; I assume on KVM the memory isn't allocated until it's used (like
a normal process).  In PoD, if the total amount of non-zero memory in
the guest exceeds this amount, then Xen will crash the guest.  In KVM, I
assume that there is no implicit limit: if it doesn't have free host ram
when the allocation happens, then it evicts something from a buffer or
swaps some process / VM memory out to disk.

Hope I'm not being too pedantic here, but "the devil is in the details",
so I think it's important when comparing KVM and Xen's solutions to be
aware of the differences. :-)

In any case, if Julien wants to emulate the S/W instructions, it seems
like having 'logaccess' functionality in Xen is probably the only
reasonable way to accomplish that (as 'full VA flush' will quickly
become unworkable as the guest size grows).

> So I believe it would be worth to investigate what the actual impact is
> on booting a 32-bit kernel, with emulating s/w ops like KVM does (see
> below), but cleaning the *whole VA space*. If this is somewhat
> acceptable (I assume we have no more than 2GB for a typical ARM32
> guest), it might be worth to ignore PoD, at least for now and to solve
> this problem (and the IOMMU consequences).
> 
> This assumes that a single "full VA flush" cannot be abused as a DOS by
> a malicious guest, which should be investigated independently (as this
> applies to a PoD implementation as well).

Well the flush itself would need to be preemptible.  And it sounds like
you'd need to handle migrating specially somehow too.  For one you'd
need to make sure at least that all the cache on the current pcpu was
"cleaned" before running a vcpu anywhere else; and you'd also need to
make sure that any pcpu on which the vcpu had ever run had its entries
"invalidated" before the vcpu was run there again.

> Somewhat optional read for the background of how KVM optimized this ([1]):
> 
> KVM's solution to this problem works under the assumption that s/w
> operations with the caches (and MMU on) are not really meaningful, so we
> don't bother emulating them to the letter. 

Right -- so even on KVM, you're not actually following the ARM spec wrt
the S/W instructions: you're only handling the case that's fairly common
(i.e., flushing the cache with the MMU off).

Thanks,
 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.