Re: [Xen-ia64-devel] EFI Mapping Windows Install Crash Bug

On Tue, Jul 01, 2008 at 09:19:24PM +0900, Isaku Yamahata wrote:
> On Tue, Jul 01, 2008 at 09:20:27PM +1000, Simon Horman wrote:
> > On Tue, Jul 01, 2008 at 08:04:16PM +0900, Isaku Yamahata wrote:
> > > On Tue, Jul 01, 2008 at 05:34:42PM +1000, Simon Horman wrote:
> > > > On Tue, Jul 01, 2008 at 04:07:53PM +0900, Isaku Yamahata wrote:
> > > > > On Tue, Jul 01, 2008 at 11:03:28AM +1000, Simon Horman wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > I'm a bit hesitant to jump the gun, but I think that I might have
> > > > > > isolated the cause of win2k3-sp2 crashing during install when my EFI
> > > > > > Mapping patches are applied. Well, perhaps not the cause, but I 
> > > > > > think I
> > > > > > know where it is dying.
> > > > > > 
> > > > > >     Quickly as background, the EFI Mapping parches move the mapping
> > > > > >     that EFI is taught on boot time to map memory where Linux places
> > > > > >     it ( basically pa + (0xe<60) ) instead of where Xen usually
> > > > > >     places it ( basically pa + (0xf<60) ). In order to protect this
> > > > > >     mapping from HVM domains a special region id is used. The
> > > > > >     hypervisor switches to that region id just before making any
> > > > > >     PAL, SAL or EFI calls, and switches back to the previous region
> > > > > >     id once the call completes.  As region 7 has to be changed,
> > > > > >     entries that are pinned into the TLB have to be repinned. And
> > > > > >     that is roughly where the fun begins.
> > > > > > 
> > > > > > As for the problem? It seems to be caused by 
> > > > > > ia64_mca_cpe_int_caller()
> > > > > > calling ia64_log_queue() which calls ia64_sal_get_state_info(). I
> > > > > > believe that the hypervisor dies in ia64_log_queue() somewhere after
> > > > > > ia64_sal_get_state_info() completes. That is, I am suspecting that 
> > > > > > the
> > > > > > call to ia64_sal_get_state_info() is returning bogus data.
> > > > > 
> > > > > Is ia64_mca_cpe_int_caller() called in interrupt context?
> > > > > If so, ia64_log_queue() calls xmalloc() which can't be called
> > > > > from interrupt context. Then xen VMM crashes at ASSERT(!in_irq())
> > > > > in _xmalloc().
> > > > 
> > > > That is a good point. Although xmalloc() is only called if
> > > > ia64_sal_get_state_info() returns a non-zero value. Which
> > > > according to tracing that I have done this afternoon, does
> > > > not seem to be the case (when ia64_log_queue() is called
> > > > from other places in mca.c.
> > > > 
> > > > How can I check if the call is being made in interrupt context?
> > > 
> > > in_irq()?
> > > Anyway I noticed ia64_mca_cpe_int_caller() is a irq handler so that it is 
> > > always called from intrrupt context. So ia64_log_queue() has to be
> > > fixed in case ia64_sal_get_state_info() returns > 0.
> > 
> > I'm actually not sure that code path ever gets exercised,
> > because as you say, if it did then the ASSERT(!in_irq()) in
> > _xmalloc() wound be triggered.
> > 
> > This seems to imply that ia64_sal_get_state_info() always returns 0
> > if called from an interrupt context - my debuging backs this up.
> 
> I supopse fault injection or something like that might be needed to
> test the execution path.

I have done some more investigations and it does really
seem that calling ia64_sal_get_state_info() via ia64_log_queue()
in ia64_mca_cpe_int_caller() causes the hypervisor to lock
up when my EFI RR patches are applied.

As you point out, if xmalloc() was ever called by ia64_log_queue()
in this context then a BUG would be triggered. As we are not
seeing that in the wild, then that case must not occur (or occur
so rarely that no one has seen and reported it yet). This means
that ia64_sal_get_state_info() must be returning zero.

If I understand correctly, ia64_log_queue() does more or less nothing
if ia64_sal_get_state_info() returns zero. Or in other words, if
ia64_sal_get_state_info() then for one reason or another there is no
information available at that time - we know that because if
there was information available then xmalloc() would be called and
a BUG would be triggered.


Given that without the EFF RR patches the call to ia64_log_queue()
in ia64_sal_get_state_info() seems to do nothing and with the call
a crash occurs, I wonder if the best way forward is to simply
remove the call.

The section on SAL_GET_STATE (==ia64_sal_get_state_info()) in the System
Abstraction Layer Specification (Dec 2003) does state "In response to
the MCA, Processor CMC, or Corrected Platform event, The operating
system must call the procedure to obtain all the pending processor and
plaftorm error information that triggerd the event."

Does that apply to situations when ia64_mca_cpe_int_caller() is called?
And if so, can calling ia64_log_queue() be deffered?

> > As for the EFI RID related problem that I am seeing. I am getting some
> > good results by translating the log_buffer argument to
> > ia64_sal_get_state_info() to an EFI virtual address (basically 0xe...
> > instead of 0xf...). I am sure that I tried this before and it failed.
> > But this time it seems to be working, so perhaps it is a combination of
> > this change and other changes.
> 
> As I'm reviewing the patches, I noticed only xen/arch/ia64/xen/ivt.S
> is patched, but xen/arch/ia64/xen/vmx_ivt.S isn't patched.
> Isn't it necessary to similar change to vmx_ivt.S?

[ As per our discussion off-line. ]

That is a good question, and one that I wasn't aware of until
you brought it up. I think that the answer is likely no, as
else the code would be very broken, and as it is it does work
most of the time. However, this could just be by chance - for
instance the TLB might be seeded with entries it needs. I
will look into this further.


_______________________________________________
Xen-ia64-devel mailing list
Xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-ia64-devel
WARNING - OLD ARCHIVES

xen-ia64-devel

Re: [Xen-ia64-devel] EFI Mapping Windows Install Crash Bug