[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains



On 01/22/2018 07:02 PM, Andrew Cooper wrote:
> On 22/01/18 18:48, George Dunlap wrote:
>> On 01/22/2018 06:39 PM, Andrew Cooper wrote:
>>> On 22/01/18 16:51, Jan Beulich wrote:
>>>>>>> On 22.01.18 at 16:00, <jgross@xxxxxxxx> wrote:
>>>>> On 22/01/18 15:48, Jan Beulich wrote:
>>>>>>>>> On 22.01.18 at 15:38, <jgross@xxxxxxxx> wrote:
>>>>>>> On 22/01/18 15:22, Jan Beulich wrote:
>>>>>>>>>>> On 22.01.18 at 15:18, <jgross@xxxxxxxx> wrote:
>>>>>>>>> On 22/01/18 13:50, Jan Beulich wrote:
>>>>>>>>>>>>> On 22.01.18 at 13:32, <jgross@xxxxxxxx> wrote:
>>>>>>>>>>> As a preparation for doing page table isolation in the Xen 
>>>>>>>>>>> hypervisor
>>>>>>>>>>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS 
>>>>>>>>>>> for
>>>>>>>>>>> 64 bit PV domains mapped to the per-domain virtual area.
>>>>>>>>>>>
>>>>>>>>>>> The per-vcpu stacks are used for early interrupt handling only. 
>>>>>>>>>>> After
>>>>>>>>>>> saving the domain's registers stacks are switched back to the normal
>>>>>>>>>>> per physical cpu ones in order to be able to address on-stack data
>>>>>>>>>>> from other cpus e.g. while handling IPIs.
>>>>>>>>>>>
>>>>>>>>>>> Adding %cr3 switching between saving of the registers and switching
>>>>>>>>>>> the stacks will enable the possibility to run guest code without any
>>>>>>>>>>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>>>>>>>>>>> able to access other domains data.
>>>>>>>>>>>
>>>>>>>>>>> Without any further measures it will still be possible for e.g. a
>>>>>>>>>>> guest's user program to read stack data of another vcpu of the same
>>>>>>>>>>> domain, but this can be easily avoided by a little PV-ABI 
>>>>>>>>>>> modification
>>>>>>>>>>> introducing per-cpu user address spaces.
>>>>>>>>>>>
>>>>>>>>>>> This series is meant as a replacement for Andrew's patch series:
>>>>>>>>>>> "x86: Prerequisite work for a Xen KAISER solution".
>>>>>>>>>> Considering in particular the two reverts, what I'm missing here
>>>>>>>>>> is a clear description of the meaningful additional protection this
>>>>>>>>>> approach provides over the band-aid. For context see also
>>>>>>>>>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html
>>>>>>>>>>  
>>>>>>>>> My approach supports mapping only the following data while the guest 
>>>>>>>>> is
>>>>>>>>> running (apart form the guest's own data, of course):
>>>>>>>>>
>>>>>>>>> - the per-vcpu entry stacks of the domain which will contain only the
>>>>>>>>>   guest's registers saved when an interrupt occurs
>>>>>>>>> - the per-vcpu GDTs and TSSs of the domain
>>>>>>>>> - the IDT
>>>>>>>>> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>>>>>>>>>
>>>>>>>>> All other hypervisor data and code can be completely hidden from the
>>>>>>>>> guests.
>>>>>>>> I understand that. What I'm not clear about is: Which parts of
>>>>>>>> the additionally hidden data are actually necessary (or at least
>>>>>>>> very desirable) to hide?
>>>>>>> Necessary:
>>>>>>> - other guests' memory (e.g. physical memory 1:1 mapping)
>>>>>>> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>>>>>>>   code emulator buffers
>>>>>>> - other guests' register values e.g. in vcpu structure
>>>>>> All of this is already being made invisible by the band-aid (with the
>>>>>> exception of leftovers on the hypervisor stacks across context
>>>>>> switches, which we've already said could be taken care of by
>>>>>> memset()ing that area). I'm asking about the _additional_ benefits
>>>>>> of your approach.
>>>>> I'm quite sure the performance will be much better as it doesn't require
>>>>> per physical cpu L4 page tables, but just a shadow L4 table for each
>>>>> guest L4 table, similar to the Linux kernel KPTI approach.
>>>> But isn't that model having the same synchronization issues upon
>>>> guest L4 updates which Andrew was fighting with?
>>> (Condensing a lot of threads down into one)
>>>
>>> All the methods have L4 synchronisation update issues, until we have a
>>> PV ABI which guarantees that L4's don't get reused.  Any improvements to
>>> the shadowing/synchronisation algorithm will benefit all approaches.
>>>
>>> Juergen: you're now adding a LTR into the context switch path which
>>> tends to be very slow.  I.e. As currently presented, this series
>>> necessarily has a higher runtime overhead than Jan's XPTI.
>>>
>>> One of my concerns is that this patch series moves further away from the
>>> secondary goal of my KAISER series, which was to have the IDT and GDT
>>> mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
>>> leak which CPU you're currently scheduled on into PV guests and b) the
>>> context switch code can drop a load of its slow instructions like LGDT
>>> and the VMWRITEs to update the VMCS.
>>>
>>> Jan: As to the things not covered by the current XPTI, hiding most of
>>> the .text section is important to prevent fingerprinting or ROP
>>> scanning.  This is a defence-in-depth argument, but a guest being easily
>>> able to identify whether certain XSAs are fixed or not is quite bad. 
>> I'm afraid we have a fairly different opinion of what is "quite bad".
> 
> I suggest you try talking to some real users then.
> 
>> Suppose we handed users a knob and said, "If you flip this switch,
>> attackers won't be able to tell if you've fixed XSAs or not without
>> trying them; but it will slow down your guests 20%."  How many do you
>> think would flip it, and how many would reckon that an attacker could
>> probably find out that information anyway?
> 
> Nonsense.  The performance hit is already taken. 

You just said:

"Juergen: you're now adding a LTR into the context switch path which
tends to be very slow.  I.e. As currently presented, this series
necessarily has a higher runtime overhead than Jan's XPTI."

And:

"As to the things not covered by the current XPTI, hiding most of
the .text section is important..."

You've previously said that the overhead for your KAISER series was much
higher than Jan's "bandaid" XPTI series, and implied that Juergen's
approach would suffer the same fate.

This led me to infer:

1. The .text segment is not hidden in XPTI, but would be under your and
Juergen's approaches

2. The cost of hiding the .text segment, over and above XPTI stage 1,
according to our current best efforts, is significant (making up 20% as
a reasonable strawman).

In which case performance hit is most certainly *not* already taken.

> The argument is "do
> you want an attacker able to trivially evaluate security weaknesses in
> your hypervisor", a process which usually has to be done by guesswork
> and knowing the exact binary under attack.  Having .text fully readable
> lowers the barrier to entry substantially.

And I can certainly see that some users would want to protect against
that.  But faced with an even higher performance hit, a significant
number of users would probably pass.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.