| 
    
 [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 22/22] x86/mm: zero stack on stack switch or reset
 On Mon, Jul 29, 2024 at 04:40:24PM +0100, Andrew Cooper wrote:
> On 26/07/2024 4:22 pm, Roger Pau Monne wrote:
> > With the stack mapped on a per-CPU basis there's no risk of other CPUs being
> > able to read the stack contents, but vCPUs running on the current pCPU could
> > read stack rubble from operations of previous vCPUs.
> >
> > The #DF stack is not zeroed because handling of #DF results in a panic.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> > ---
> >  xen/arch/x86/include/asm/current.h | 30 +++++++++++++++++++++++++++++-
> >  1 file changed, 29 insertions(+), 1 deletion(-)
> >
> > diff --git a/xen/arch/x86/include/asm/current.h 
> > b/xen/arch/x86/include/asm/current.h
> > index 75b9a341f814..02b4118b03ef 100644
> > --- a/xen/arch/x86/include/asm/current.h
> > +++ b/xen/arch/x86/include/asm/current.h
> > @@ -177,6 +177,14 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
> >  # define SHADOW_STACK_WORK ""
> >  #endif
> >  
> > +#define ZERO_STACK                                              \
> > +    "test %[stk_size], %[stk_size];"                            \
> > +    "jz .L_skip_zeroing.%=;"                                    \
> > +    "std;"                                                      \
> > +    "rep stosb;"                                                \
> > +    "cld;"                                                      \
> > +    ".L_skip_zeroing.%=:"
> > +
> >  #if __GNUC__ >= 9
> >  # define ssaj_has_attr_noreturn(fn) __builtin_has_attribute(fn, 
> > __noreturn__)
> >  #else
> > @@ -187,10 +195,24 @@ unsigned long get_stack_dump_bottom (unsigned long 
> > sp);
> >  #define switch_stack_and_jump(fn, instr, constr)                        \
> >      ({                                                                  \
> >          unsigned int tmp;                                               \
> > +        bool zero_stack = current->domain->arch.asi;                    \
> >          BUILD_BUG_ON(!ssaj_has_attr_noreturn(fn));                      \
> > +        ASSERT(IS_ALIGNED((unsigned long)guest_cpu_user_regs() -        \
> > +                          PRIMARY_STACK_SIZE +                          \
> > +                          sizeof(struct cpu_info), PAGE_SIZE));         \
> > +        if ( zero_stack )                                               \
> > +        {                                                               \
> > +            unsigned long stack_top = get_stack_bottom() &              \
> > +                                      ~(STACK_SIZE - 1);                \
> > +                                                                        \
> > +            clear_page((void *)stack_top + IST_MCE * PAGE_SIZE);        \
> > +            clear_page((void *)stack_top + IST_NMI * PAGE_SIZE);        \
> > +            clear_page((void *)stack_top + IST_DB  * PAGE_SIZE);        \
> > +        }                                                               \
> >          __asm__ __volatile__ (                                          \
> >              SHADOW_STACK_WORK                                           \
> >              "mov %[stk], %%rsp;"                                        \
> > +            ZERO_STACK                                                  \
> >              CHECK_FOR_LIVEPATCH_WORK                                    \
> >              instr "[fun]"                                               \
> >              : [val] "=&r" (tmp),                                        \
> > @@ -201,7 +223,13 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
> >                ((PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8),               \
> >                [stack_mask] "i" (STACK_SIZE - 1),                        \
> >                _ASM_BUGFRAME_INFO(BUGFRAME_bug, __LINE__,                \
> > -                                 __FILE__, NULL)                        \
> > +                                 __FILE__, NULL),                       \
> > +              /* For stack zeroing. */                                  \
> > +              "D" ((void *)guest_cpu_user_regs() - 1),                  \
> > +              [stk_size] "c"                                            \
> > +              (zero_stack ? PRIMARY_STACK_SIZE - sizeof(struct cpu_info)\
> > +                          : 0),                                         \
> > +              "a" (0)                                                   \
> >              : "memory" );                                               \
> >          unreachable();                                                  \
> >      })
> 
> This looks very expensive.
> 
> For starters, switch_stack_and_jump() is used twice in a typical context
> switch; once in the schedule tail, and again out of hvm_do_resume().
Right, it's the reset_stack_and_call_ind() at the end of context
switch and then the reset_stack_and_jump() in the HVM tail context
switch handlers.
One option would be to only do the stack zeroing from the
reset_stack_and_call_ind() call in context_switch().
I've got no idea how expensive this is, I might try to run some
benchmarks to get some figures.  I was planning on running two VMs
with 1 vCPU each, both pinned to the same pCPU.
> 
> Furthermore, #MC happen never (to many many significant figures), #DB
> happens never for HVM guests (but does happen for PV), and NMIs are
> either ~never, or 2Hz which is far less often than the 30ms default
> timeslice.
> 
> So, the overwhelming majority of the time, those 3 calls to clear_page()
> will be re-zeroing blocks of zeroes.
> 
> This can probably be avoided by making use of ist_exit (held in %r12) to
> only zero an IST stack when leaving it.  This leaves the IRET frame able
> to be recovered, but with e.g. RFDS, you can do that irrespective, and
> it's not terribly sensitive.
I could look into that, TBH I was bordeline with clearing the IST
stacks, as I wasn't convinced there could be anything sensitive there,
but again couldn't convince myself there's nothing sensitive now,
nor can be in the future.
> What about shadow stacks?  You're not zeroing those, and while they're
> less sensitive than the data stack, there ought to be some reasoning
> about them.
I've assumed that shadow stacks only contained the expected return
addresses, and hence won't be considered sensitive information, but
maybe I was too lax.
An attacker could get execution traces of the previous vCPU, and that
might be useful for some exploits?
Thanks, Roger.
 
  | 
  
![]()  | 
            
         Lists.xenproject.org is hosted with RackSpace, monitoring our  |