[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Radeon DRM dom0 issues



 Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote on 02/19/2014 
03:30:07 PM:

> From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> To: Michael Labriola <michael.d.labriola@xxxxxxxxx>, 
> Cc: Michael D Labriola <mlabriol@xxxxxxxx>, Konrad Rzeszutek Wilk 
> <konrad@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx, 
xen-devel-bounces@xxxxxxxxxxxxx
> Date: 02/19/2014 03:30 PM
> Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> 
> On Wed, Feb 19, 2014 at 03:08:08PM -0500, Michael Labriola wrote:
> > On Wed, Feb 19, 2014 at 2:57 PM, Konrad Rzeszutek Wilk
> > <konrad.wilk@xxxxxxxxxx> wrote:
> > > On Wed, Feb 19, 2014 at 02:33:26PM -0500, Michael Labriola wrote:
> > >> On Wed, Feb 19, 2014 at 12:04 PM, Konrad Rzeszutek Wilk
> > >> <konrad.wilk@xxxxxxxxxx> wrote:
> > >> > On Tue, Feb 11, 2014 at 10:35:18AM -0500, Michael D Labriola 
wrote:
> > >> >> Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote on 
01/24/2014
> > >> >> 09:49:38 AM:
> > >> >>
> > >> >> > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> > >> >> > To: Michael D Labriola <mlabriol@xxxxxxxx>,
> > >> >> > Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>,
> > >> >> > michael.d.labriola@xxxxxxxxx, xen-devel@xxxxxxxxxxxxx, 
xen-devel-
> > >> >> > bounces@xxxxxxxxxxxxx
> > >> >> > Date: 01/24/2014 09:50 AM
> > >> >> > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > >> >> >
> > >> >> > On Thu, Jan 23, 2014 at 11:54:37AM -0500, Michael D Labriola 
wrote:
> > >> >> > > xen-devel-bounces@xxxxxxxxxxxxx wrote on 01/21/2014 04:59:05 
PM:
> > >> >> > >
> > >> >> > > > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> > >> >> > > > To: Michael D Labriola <mlabriol@xxxxxxxx>,
> > >> >> > > > Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>,
> > >> >> > > > michael.d.labriola@xxxxxxxxx, xen-devel@xxxxxxxxxxxxx
> > >> >> > > > Date: 01/21/2014 04:59 PM
> > >> >> > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > >> >> > > > Sent by: xen-devel-bounces@xxxxxxxxxxxxx
> > >> >> > > >
> > >> >> > > > On Mon, Jan 20, 2014 at 03:15:24PM -0500, Michael D 
> Labriola wrote:
> > >> >> > > > > Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote 
> on 01/20/2014
> > >> >>
> > >> >> > > > > 10:38:27 AM:
> > >> >> > > > >
> > >> >> > > > > > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> > >> >> > > > > > To: Michael D Labriola <mlabriol@xxxxxxxx>,
> > >> >> > > > > > Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>,
> > >> >> > > > > > michael.d.labriola@xxxxxxxxx, xen-devel@xxxxxxxxxxxxx
> > >> >> > > > > > Date: 01/20/2014 10:38 AM
> > >> >> > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > >> >> > > > > >
> > >> >> > > > > > On Mon, Jan 20, 2014 at 10:26:22AM -0500, Michael D 
Labriola
> > >> >> wrote:
> > >> >> > > > > > > Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx> wrote 
on01/20/2014
> > >> >> > > 10:14:36
> > >> >> > > > > AM:
> > >> >> > > > > > >
> > >> >> > > > > > > > From: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>
> > >> >> > > > > > > > To: Michael D Labriola <mlabriol@xxxxxxxx>,
> > >> >> > > > > > > > Cc: xen-devel@xxxxxxxxxxxxx, 
michael.d.labriola@xxxxxxxxx
> > >> >> > > > > > > > Date: 01/20/2014 10:14 AM
> > >> >> > > > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > >> >> > > > > > > >
> > >> >> > > > > > > > On Mon, Jan 20, 2014 at 09:58:32AM -0500, 
> Michael D Labriola
> > >> >>
> > >> >> > > wrote:
> > >> >> > > > > > > > > Anyone here running a dom0 w/ Radeon DRM?  I'm 
having
> > >> >> > > consistent
> > >> >> > > > > > > crashes
> > >> >> > > > > > > > > with multiple older R600 series (HD 6470 and 
> HD 6570) and
> > >> >> > > unusably
> > >> >> > > > >
> > >> >> > > > > > > slow
> > >> >> > > > > > > > > graphics with a newer HD7000 (can see each line 
refresh
> > >> >> > > > > indiviually on
> > >> >> > > > > > >
> > >> >> > > > > > > > > radeonfb tty).  All 3 systems seem to work fine 
bare
> > >> >> metal.
> > >> >> > > > > > > >
> > >> >> > > > > > > > I hadn't been using DRM, just Xserver. Is that 
what you
> > >> >> mean?
> > >> >> > > > > > >
> > >> >> > > > > > > The R600 problems happen when in X, using OpenGL, 
> on my dom0.
> > >> >> The
> > >> >> > >
> > >> >> > > > > > > RadeonSI sluggishness is when using the KMS 
> framebuffer device
> > >> >> for
> > >> >> > > a
> > >> >> > > > > plain
> > >> >> > > > > > > text console login.
> > >> >> > > > > >
> > >> >> > > > > > So sluggish is probably due to the PAT not being 
enabled. This
> > >> >> patch
> > >> >> > > > > > should be applied:
> > >> >> > > > > >
> > >> >> > > > > > lkml.org/lkml/2011/11/8/406
> > >> >> > > > > >
> > >> >> > > > > > (or 
http://marc.info/?l=linux-kernel&m=132888833209874&w=2)
> > >> >> > > > > >
> > >> >> > > > > > and these two reverted:
> > >> >> > > > > >
> > >> >> > > > > >  "xen/pat: Disable PAT support for now."
> > >> >> > > > > >  "xen/pat: Disable PAT using pat_enabled value."
> > >> >> > > > > >
> > >> >> > > > > > Which is to say do:
> > >> >> > > > > >
> > >> >> > > > > > git revert c79c49826270b8b0061b2fca840fc3f013c8a78a
> > >> >> > > > > > git revert 8eaffa67b43e99ae581622c5133e20b0f48bcef1
> > >> >> > > > >
> > >> >> > > > > Thanks!  I cherry-picked that patch out of your testing 
tree,
> > >> >> reverted
> > >> >> > >
> > >> >> > > > > those 2 commits, recompiled and installed. 
Definitelyfixed the
> > >> >> HD
> > >> >> > > 7000
> > >> >> > > > > sluggishness and appears to have fixed the R600 
> crashes (although
> > >> >> it's
> > >> >> > >
> > >> >> > > > > only been running a few hours).
> > >> >> > > > >
> > >> >> > > > > How come that patch didn't get into mainline?  It looks 
pretty
> > >> >> > > innocuous
> > >> >> > > > > to me...
> > >> >> > > >
> > >> >> > > > <Sigh> the x86 maintainers wanted a different route. And I 
hadn't
> > >> >> had
> > >> >> > > > the chance nor time to implement it.
> > >> >> > >
> > >> >> > > I see.  Well, I've got a handful of boxes in my lab that 
need that
> > >> >> patch
> > >> >> > > to be usable.  If you do come up with a more 
mainline-ablesolution,
> > >> >> I'd
> > >> >> > > gladly test it for you.  ;-)
> > >> >> >
> > >> >> > Thank you!
> > >> >>
> > >> >> Uh, oh.  Looks like those reverts and patches didn't entirely 
fix my
> > >> >> problem.  My box with the HD5450 (r600 gallium3d) started going 
bonkers
> > >> >> again yeserday.  After being solid as a rock for 2 weeks as my 
primary
> > >> >> workstation, X has crashed a half dozen or so times so far 
> this week. I've
> > >> >> been in Xen with 2 paravirtual linux guests running almost 
> constantly for
> > >> >> this whole period.  I don't understand what's changed, but my 
system has
> > >> >> been entirely unstable now.  I did recompile my kernel... but I 
all did
> > >> >> was merge the v3.13.1 stable commit into my working tree and 
turn a few
> > >> >> things on (netfilter, wifi, a couple drivers turned on here 
> and there).  I
> > >> >> just went and verified that those patches are still applied in 
my tree
> > >> >> (i.e., I didn't accidentally undo them).  I'm scratching my head 
(and
> > >> >> staring at a TTY login).
> > >> >>
> > >> >> When X crashes, my kernel log prints a couple dozen iterations
> of this. 3d
> > >> >> acceleration no longer functions unless I reboot.  If memory 
serves, the
> > >> >> unpatched behavior upon X crash was that the kernel continued to 
spew
> > >> >> these errors until the whole box locked up.  At least that's 
> not happening
> > >> >> any more... ;-)
> > >> >>
> > >> >> [  702.070084] [TTM] radeon 0000:01:00.0: Unable to get page 2
> > >> >> [  702.075971] [TTM] radeon 0000:01:00.0: Failed to fill cached 
pool
> > >> >> (r:-12)!
> > >> >> [  704.720699] [TTM] radeon 0000:01:00.0: Unable to get page 0
> > >> >> [  704.726635] [TTM] radeon 0000:01:00.0: Failed to fill cached 
pool
> > >> >> (r:-12)!
> > >> >> [  704.733910] [drm:radeon_gem_object_create] *ERROR* Failed to 
allocate
> > >> >> GEM object (8192, 2, 4096, -12)
> > >> >>
> > >> >> and here's a slightly different variant that happened while I 
was typing
> > >> >> this email (on a different machine, luckily):
> > >> >>
> > >> >> [ 3107.713039] sdf: detected capacity change from 31625052160 to 
0
> > >> >> [ 3114.491717] usb 9-1: USB disconnect, device number 2
> > >> >> [64348.271534] [TTM] radeon 0000:01:00.0: Unable to get page 3
> > >> >> [64348.277312] [TTM] radeon 0000:01:00.0: Failed to fill cached 
pool
> > >> >> (r:-12)!
> > >> >> [64348.284470] [TTM] radeon 0000:01:00.0: Unable to get page 0
> > >> >> [64348.290257] [TTM] radeon 0000:01:00.0: Failed to fill cached 
pool
> > >> >> (r:-12)!
> > >> >> [64348.297561] [TTM] Buffer eviction failed
> > >> >> [64349.550518] [TTM] radeon 0000:01:00.0: Unable to get page 0
> > >> >> [64349.556417] [TTM] radeon 0000:01:00.0: Failed to fill cached 
pool
> > >> >> (r:-12)!
> > >> >> [64349.563714] [drm:radeon_gem_object_create] *ERROR* Failed to 
allocate
> > >> >> GEM object (16384, 2, 4096, -12)
> > >> >>
> > >> >> Any ideas?
> > >> >
> > >> > yes. I believe you have a memory leak. As in, some driver (or X) 
is
> > >> > eating up the memory and not giving up enough. That means the TTM
> > >> > layer is hitting its ceiling of how much memory it can allocate.
> > >> >
> > >> > Now finding the culprit is going to be a bit hard.
> > >> >
> > >> > You could use:
> > >> >
> > >> > [root@phenom 1]# cat /sys/kernel/debug/dri/1/ttm_dma_page_pool
> > >> >          pool      refills   pages freed    inuse available name
> > >> >            wc          259           224      808        4 
> nouveau 0000:05:00.0
> > >> >        cached      3403058      13561071    51158        3 
> radeon 0000:01:00.0
> > >> >        cached           25             0       96        4 
> nouveau 0000:05:00.0
> > >> >
> > >> > to figure out if my thinking is really true. You should have a 
huge
> > >> > 'inuse' count and almost no 'available'.
> > >>
> > >> My /sys/kernel/debug/dri directory has a 0 and a 64 entry, which 
appear to
> > >> always have the same contents.  Is that normal?
> > >
> > > Yes.
> > >>
> > >> My /sys/kernel/debug/dri/0/ttm_dma_page_pool file doesn't exist 
bare
> > >> metal... only in Xen.  Is that normal?
> > >
> > > It would show up on baremetal if you boot with 'iommu=soft'
> > >
> > >>
> > >>          pool      refills   pages freed    inuse available name
> > >>        cached        15190         59551     1205        4 radeon
> 0000:01:00.0
> > >>
> > >> If I watch that file while creating xterms, moving them around, 
etc, I can
> > >> see the number available fluctuate between 3 and 6.  This is true, 
even on
> > >> my box w/ the newer R7 card in it, which hasn't gotten that GEM 
error
> > >> message (yet?).
> > >
> > > OK, so lets see what happens when the error shows. Incidentally - 
> what amount of
> > > memory does your initial domain have? And is it different then when 
you
> > > boot it as a baremetal?
> > 
> > I've got the problem very reproducible on 3 boxes.  All three are
> > booting the dom0 with as much RAM as Xen will give them, then giving
> > up some of their RAM as needed when I create domUs.  The 3 boxes have
> > 4G, 8G, and 16G if memory serves.

Actually, they're 6G, 8G, and 16G... and I've got a box that I can't 
reproduce the problem on even though it's got the same video card... and 
it only has 2G of RAM.  Could this be a PAE/HIHGMEM issue?  I'm running 
32bit with CONFIG_HIGHMEM64G on all my boxes.


> > 
> > Does the amount of RAM on the actual video cards matter?  All the
> > older cards (that crash all the time) have 2G, whereas the R7 that
> > hasn't crashed yet only has 1G.
> 
> The TTM pool has a limit (a hard one). It is pretty simple:
> 
> 
>        pr_info("Zone %7s: Available graphics memory: %llu kiB\n", 
> 394                         zone->name, (unsigned long long)
> zone->max_mem >> 10); 
> 395         }  
> 396         ttm_page_alloc_init(glob, glob->zone_kernel->max_mem/
> (2*PAGE_SIZE)); 
> 397         ttm_dma_page_alloc_init(glob, glob->zone_kernel->max_mem/
> (2*PAGE_SIZE));
> 
> so 1/4 of your memory. Which means that when boot dom0 with as much
> memory as possible and then balloon down you might confuse it
> (as the initial memory assumption is done during bootup).
> 
> If you boot the troubled dom0s with 'dom0_mem_max' set to some good
> number - that might shed some light on this.

Ok, I've got one of the problematic boxes booted with dom0_mem=5G and it 
doesn't seem to be crashing.  Fingers crossed!


> 
> 
> > 
> > I've been reproducing the crash by just logging in and out of fluxbox
> > via XDM over and over again right after booting my dom0 in Xen w/ no
> > guests running.  That makes it happen within a few minutes.  Otherwise
> > it randomly crashes while I'm in the middle of trying to work... ;-)
> 
> HA!
> 
> Does fluxbox use a lot of graphic? I mean does it do a lot of fancy
> things when it starts and shuts itself?

Negative.  It does next to nothing.  Super light weight, pretty much just 
gets rid of the login box and puts a taskbar-type-thing on the bottom of 
the screen.  I'd say the majority of my crashes have happened in 
Enlightenment (with plenty of extra fancy things), but it HAS happened in 
fluxbox doing next to nothing.  Which was pretty surprising.


---
Michael D Labriola
Electric Boat
mlabriol@xxxxxxxx
401-848-8871 (desk)
401-848-8513 (lab)
401-316-9844 (cell)






_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.