RE: [Xen-devel] segfault in VM - FIXED!

My system doesn't have any ide devices, it's scsi only. The scsi driver is aic7xxx, and i'm still having crashes even with the latest checkout. I noticed in the logs for the first time some scsi errors in amongst all the others, but given the nature of the crash i don't know if that means anything.

Is this the same problem that we thought was in the network code? I could not readily induce the crash without creating lots of network traffic.

James

From: Keir Fraser
Sent: Sat 24/07/2004 2:01 AM
To: Keir Fraser
Cc: James Harper; Derek Glidden; xen-devel@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [Xen-devel] segfault in VM - FIXED!

Okay, so I found that the problem is due to overly-aggressive merging
of block requests in the IDE driver. The code assumes that if buffers
are adjacent in virtual or physical address space then they can be
merged --- this isn't always the case over Xen since those physical
addresses may map to different real machine pages.

I've checked in a fix that I think is safe for IDE --- in the
occasional instances that a merged scatter-gather list is invalid, we
should now cause IDE to fall back to a super-safe mode (basically
PIO). On my system this happens so occasionally that performance
shouldn't be affected.

If this also turns out to be a problem for SCSI then we may need to do
some more work --- our safety check will still trigger and we will
still fail the scatter-gather list, but it doesn't look as though many
SCSI drivers pick up the error return code and do anything sane. This
is a bug in those drivers, but this is small comfort to us in our aim
to work with the full range of Linux SCSI drivers.

What we need now is some more checking, particularly with SCSI block
devices, to see whether there are any more bugs to shake out.

 -- Keir


> 
> Yeah, it turns out I can reproduce this bug trivially by md5summing a
> file just slightly bigger than dom0's memory allocation, while
> floodpinging dom1.
> 
> I'm trying out a few things right now, so hopefully I'll be able to
> report progress on this evil bug r.s.n. :-)
> 
>  -- Keir
> 
> > I just made a change so that the skbuf is always copied in netif_be_start_xmit but it still crashes, which means most likely that bit is fine or at least isn't the only code containing bugs.
> > 
> > As another test I also put the 'goto done;' after the 'if ( skb_shared(skb) || skb_cloned(skb) || ...' block, (still block the receive but do it later) and there were no crashes, so i'm comfortable that we've exhausted netif_be_start_xmit as a source for bugs.
> > 
> > So I guess that leaves net_rx_action. I'm unsure on one thing though, the pages that get passed from dom0 to domU, how/where/do they get recycled back to dom0? Is it possible that domU could still write to a page that dom0 thought it had free to use for something else? If so, where would that be?
> > 
> > Keir: have you been able to reproduce these errors at all?
> > 
> > James
> > 
> > 
> > 
> > 
> > From: Keir Fraser
> > Sent: Fri 23/07/2004 3:48 AM
> > To: Derek Glidden
> > Cc: xen-devel@xxxxxxxxxxxxxxxxxxxxx
> > Subject: Re: [Xen-devel] segfault in VM
> > 
> > 
> > It's useful to have the extra data points -- it adds to our confidence
> > that it's the network driver that is somehow at fault here.
> > 
> > Quite how to proceed in narrowing down the problem is
> > unclear. One approach is to perturb the backend driver's data path
> > (e.g., always copying packets into a known-safe page-sized buffer, as
> > a check that our current copy-avoidancxe checks are not at fault; and
> > replacing the current high-performance but convoluted code for
> > batching hypercalls with something slower but easier to grok). The
> > latter is useful because if the bug goes away then we have a smaller
> > chunk of code to look at; if the bug remains then we end up with a
> > less complex data path that is easier to instrument and bughunt.
> > 
> > If anyone is interested in pursuing this bug independently, the
> > functions most under suspicion are netif_be_start_xmit and
> > net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
> > These two form the data path for packets getting sent to guest OSes.
> > 
> >  -- Keir
> > 
> > 
> > > 
> > > On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
> > > >
> > > > Anyway - currently sounds like teh bug resides in the most complex
> > > > half of the most complex driver. Who'd've thought it? ;-)
> > > 
> > > At this point this data is surely redundant but...
> > > 
> > > When I went to sleep last night I let my box run dom0 and four VMs 
> > > doing md5sum checks on a couple of large files, hammering the heck out 
> > > of the block i/o drivers and CPU but with all the ifaces/vifs on the 
> > > machine down.  When I woke up, all compares had been correct for the 
> > > six hours or so it ran.  I re-upped the ifaces and started to ping dom0 
> > > and the VMs and within a minute of the pings starting dom0 started to 
> > > report incorrect md5sums.
> > > 
> > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> > > "We all enter this world in the    | Support Electronic Freedom
> > > same way: naked; screaming; soaked |        http://www.eff.org/
> > > in blood. But if you live your     |  http://www.anti-dmca.org/
> > > life right, that kind of thing     |---------------------------
> > > doesn't have to stop there." -- Dana Gould
> > > 
> > > 
> > > 
> > > -------------------------------------------------------
> > > This SF.Net email is sponsored by BEA Weblogic Workshop
> > > FREE Java Enterprise J2EE developer tools!
> > > Get your free copy of BEA WebLogic Workshop 8.1 today.
> > > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> > > _______________________________________________
> > > Xen-devel mailing list
> > > Xen-devel@xxxxxxxxxxxxxxxxxxxxx
> > > https://lists.sourceforge.net/lists/listinfo/xen-devel
> > 
> > 
> > 
> > -------------------------------------------------------
> > This SF.Net email is sponsored by BEA Weblogic Workshop
> > FREE Java Enterprise J2EE developer tools!
> > Get your free copy of BEA WebLogic Workshop 8.1 today.
> > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@xxxxxxxxxxxxxxxxxxxxx
> > https://lists.sourceforge.net/lists/listinfo/xen-devel
>  -=- MIME -=- 
> --_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_
> Content-Type: text/plain;
> 	charset="iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
> 
> I just made a change so that the skbuf is always copied in netif_be_start_x=
> mit but it still crashes, which means most likely that bit is fine or at le=
> ast isn't the only code containing bugs.
> 
> As another test I also put the 'goto done;' after the 'if ( skb_shared(skb)=
>  || skb_cloned(skb) || ...' block, (still block the receive but do it later=
> ) and there were no crashes, so i'm comfortable that we've exhausted netif_=
> be_start_xmit as a source for bugs.
> 
> So I guess that leaves net_rx_action. I'm unsure on one thing though, the p=
> ages that get passed from dom0 to domU, how/where/do they get recycled back=
>  to dom0? Is it possible that domU could still write to a page that dom0 th=
> ought it had free to use for something else? If so, where would that be?
> 
> Keir: have you been able to reproduce these errors at all?
> 
> James
>

WARNING - OLD ARCHIVES

xen-devel

RE: [Xen-devel] segfault in VM - FIXED!