[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Xen on ARM] Possible unhandled SGI bug.



On Sun, 2013-04-28 at 20:02 +0100, Sander Bogaert wrote:
> Hi,
> 
> all previous information can be found in this thread:
> http://lists.xen.org/archives/html/xen-devel/2013-04/msg02772.html
> 
> I've been trying to reproduce this behaviour for the last 2 days,
> crashme has been running on the Arndale board for a total of at least
> 20 hours. I restarted the process once in a while with the seed I saw
> crashing Xen ( 'crashme +2000.4 666 50 2:00:00 2' ).
> 
> The version of crashme is 2.4, the one from the Debian Wheezy
> repository. The last seed logged ( needs a SD card write so I don't
> know when the last sync was before the crash ) was 43166
> 
> I have not been able to reproduce the crash. However I'm quite sure I
> wasn't imagining things, I really did see Xen crash with the "SGI 2
> Unhandled" error when I was running crashme from dom0 userspace.

It could be that running crashme was just incidental, and the crash just
happened independently. There really ought to be no way for a guest to
directly generate a host level SGI and certainly no way for it to
generate one with a number of its choosing.

> This seems like a big deal and not being able to reproduce it is kind
> of frustrating. So I was wondering if there were any ideas on how this
> could have happened? When it did happend I just rebooted the board so
> it was in a 'clean' state.
> 
> Maybe some speculations on a cause could help me reproduce it? A small
> explanation on when exactly it should issue sgi's? I would really
> really like to get to the bottom of this :-)

The xen.git hypervisor uses two SGIs, GIC_SGI_EVENT_CHECK (==0) and
GIC_SGI_DUMP_STATE (==1). Both are issued only via calls to one of
send_SGI_{mask,self,allbutself} (or their various wrappers). In practice
this means smp_send_event_check_mask() or smp_send_state_dump(). You can
verify this by looking at callchains lead to one of the small number of
writes to GICD[GICD_SGIR].

Julien added a new SGI in his Arndale tree to call a function on another
CPU (not sure what he called it without looking it up, it's #2 though),
this would be exercised via smp_call_function() and friends.

About my only theory about how you can have seen a spurious host level
SGI==2 is a partial rebuild error -- i.e. make b0rked the build and you
got the new version of smp_call_function et al but not the new version
of do_sgi(). Unless of course Julien's tree temporarily had code with
that behaviour (i.e. added the smp_call stuff before the handler)?

TBH, there probably isn't going to be much we can do about this until we
get a repro, so I'd be tempted to ignore it and move on and hope we
never see it again.

About the only useful things we could do in case it does happen again
would be to print othercpu in the panic from do_sgi and to add asserts
to send_SGI_* to assert it is sending an SGI which we have defined (not
just one which the hardware defines as it asserts now. Could you whip up
a patch to do those?

Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.