[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Radeon DRM dom0 issues



On Wed, Feb 19, 2014 at 12:04 PM, Konrad Rzeszutek Wilk
<konrad.wilk@xxxxxxxxxx> wrote:
> On Tue, Feb 11, 2014 at 10:35:18AM -0500, Michael D Labriola wrote:
>> Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote on 01/24/2014
>> 09:49:38 AM:
>>
>> > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
>> > To: Michael D Labriola <mlabriol@xxxxxxxx>,
>> > Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>,
>> > michael.d.labriola@xxxxxxxxx, xen-devel@xxxxxxxxxxxxx, xen-devel-
>> > bounces@xxxxxxxxxxxxx
>> > Date: 01/24/2014 09:50 AM
>> > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
>> >
>> > On Thu, Jan 23, 2014 at 11:54:37AM -0500, Michael D Labriola wrote:
>> > > xen-devel-bounces@xxxxxxxxxxxxx wrote on 01/21/2014 04:59:05 PM:
>> > >
>> > > > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
>> > > > To: Michael D Labriola <mlabriol@xxxxxxxx>,
>> > > > Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>,
>> > > > michael.d.labriola@xxxxxxxxx, xen-devel@xxxxxxxxxxxxx
>> > > > Date: 01/21/2014 04:59 PM
>> > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
>> > > > Sent by: xen-devel-bounces@xxxxxxxxxxxxx
>> > > >
>> > > > On Mon, Jan 20, 2014 at 03:15:24PM -0500, Michael D Labriola wrote:
>> > > > > Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote on 01/20/2014
>>
>> > > > > 10:38:27 AM:
>> > > > >
>> > > > > > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
>> > > > > > To: Michael D Labriola <mlabriol@xxxxxxxx>,
>> > > > > > Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>,
>> > > > > > michael.d.labriola@xxxxxxxxx, xen-devel@xxxxxxxxxxxxx
>> > > > > > Date: 01/20/2014 10:38 AM
>> > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
>> > > > > >
>> > > > > > On Mon, Jan 20, 2014 at 10:26:22AM -0500, Michael D Labriola
>> wrote:
>> > > > > > > Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx> wrote on 01/20/2014
>> > > 10:14:36
>> > > > > AM:
>> > > > > > >
>> > > > > > > > From: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>
>> > > > > > > > To: Michael D Labriola <mlabriol@xxxxxxxx>,
>> > > > > > > > Cc: xen-devel@xxxxxxxxxxxxx, michael.d.labriola@xxxxxxxxx
>> > > > > > > > Date: 01/20/2014 10:14 AM
>> > > > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
>> > > > > > > >
>> > > > > > > > On Mon, Jan 20, 2014 at 09:58:32AM -0500, Michael D Labriola
>>
>> > > wrote:
>> > > > > > > > > Anyone here running a dom0 w/ Radeon DRM?  I'm having
>> > > consistent
>> > > > > > > crashes
>> > > > > > > > > with multiple older R600 series (HD 6470 and HD 6570) and
>> > > unusably
>> > > > >
>> > > > > > > slow
>> > > > > > > > > graphics with a newer HD7000 (can see each line refresh
>> > > > > indiviually on
>> > > > > > >
>> > > > > > > > > radeonfb tty).  All 3 systems seem to work fine bare
>> metal.
>> > > > > > > >
>> > > > > > > > I hadn't been using DRM, just Xserver. Is that what you
>> mean?
>> > > > > > >
>> > > > > > > The R600 problems happen when in X, using OpenGL, on my dom0.
>> The
>> > >
>> > > > > > > RadeonSI sluggishness is when using the KMS framebuffer device
>> for
>> > > a
>> > > > > plain
>> > > > > > > text console login.
>> > > > > >
>> > > > > > So sluggish is probably due to the PAT not being enabled. This
>> patch
>> > > > > > should be applied:
>> > > > > >
>> > > > > > lkml.org/lkml/2011/11/8/406
>> > > > > >
>> > > > > > (or http://marc.info/?l=linux-kernel&m=132888833209874&w=2)
>> > > > > >
>> > > > > > and these two reverted:
>> > > > > >
>> > > > > >  "xen/pat: Disable PAT support for now."
>> > > > > >  "xen/pat: Disable PAT using pat_enabled value."
>> > > > > >
>> > > > > > Which is to say do:
>> > > > > >
>> > > > > > git revert c79c49826270b8b0061b2fca840fc3f013c8a78a
>> > > > > > git revert 8eaffa67b43e99ae581622c5133e20b0f48bcef1
>> > > > >
>> > > > > Thanks!  I cherry-picked that patch out of your testing tree,
>> reverted
>> > >
>> > > > > those 2 commits, recompiled and installed.  Definitely fixed the
>> HD
>> > > 7000
>> > > > > sluggishness and appears to have fixed the R600 crashes (although
>> it's
>> > >
>> > > > > only been running a few hours).
>> > > > >
>> > > > > How come that patch didn't get into mainline?  It looks pretty
>> > > innocuous
>> > > > > to me...
>> > > >
>> > > > <Sigh> the x86 maintainers wanted a different route. And I hadn't
>> had
>> > > > the chance nor time to implement it.
>> > >
>> > > I see.  Well, I've got a handful of boxes in my lab that need that
>> patch
>> > > to be usable.  If you do come up with a more mainline-able solution,
>> I'd
>> > > gladly test it for you.  ;-)
>> >
>> > Thank you!
>>
>> Uh, oh.  Looks like those reverts and patches didn't entirely fix my
>> problem.  My box with the HD5450 (r600 gallium3d) started going bonkers
>> again yeserday.  After being solid as a rock for 2 weeks as my primary
>> workstation, X has crashed a half dozen or so times so far this week. I've
>> been in Xen with 2 paravirtual linux guests running almost constantly for
>> this whole period.  I don't understand what's changed, but my system has
>> been entirely unstable now.  I did recompile my kernel... but I all did
>> was merge the v3.13.1 stable commit into my working tree and turn a few
>> things on (netfilter, wifi, a couple drivers turned on here and there).  I
>> just went and verified that those patches are still applied in my tree
>> (i.e., I didn't accidentally undo them).  I'm scratching my head (and
>> staring at a TTY login).
>>
>> When X crashes, my kernel log prints a couple dozen iterations of this. 3d
>> acceleration no longer functions unless I reboot.  If memory serves, the
>> unpatched behavior upon X crash was that the kernel continued to spew
>> these errors until the whole box locked up.  At least that's not happening
>> any more... ;-)
>>
>> [  702.070084] [TTM] radeon 0000:01:00.0: Unable to get page 2
>> [  702.075971] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
>> (r:-12)!
>> [  704.720699] [TTM] radeon 0000:01:00.0: Unable to get page 0
>> [  704.726635] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
>> (r:-12)!
>> [  704.733910] [drm:radeon_gem_object_create] *ERROR* Failed to allocate
>> GEM object (8192, 2, 4096, -12)
>>
>> and here's a slightly different variant that happened while I was typing
>> this email (on a different machine, luckily):
>>
>> [ 3107.713039] sdf: detected capacity change from 31625052160 to 0
>> [ 3114.491717] usb 9-1: USB disconnect, device number 2
>> [64348.271534] [TTM] radeon 0000:01:00.0: Unable to get page 3
>> [64348.277312] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
>> (r:-12)!
>> [64348.284470] [TTM] radeon 0000:01:00.0: Unable to get page 0
>> [64348.290257] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
>> (r:-12)!
>> [64348.297561] [TTM] Buffer eviction failed
>> [64349.550518] [TTM] radeon 0000:01:00.0: Unable to get page 0
>> [64349.556417] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
>> (r:-12)!
>> [64349.563714] [drm:radeon_gem_object_create] *ERROR* Failed to allocate
>> GEM object (16384, 2, 4096, -12)
>>
>> Any ideas?
>
> yes. I believe you have a memory leak. As in, some driver (or X) is
> eating up the memory and not giving up enough. That means the TTM
> layer is hitting its ceiling of how much memory it can allocate.
>
> Now finding the culprit is going to be a bit hard.
>
> You could use:
>
> [root@phenom 1]# cat /sys/kernel/debug/dri/1/ttm_dma_page_pool
>          pool      refills   pages freed    inuse available     name
>            wc          259           224      808        4 nouveau 
> 0000:05:00.0
>        cached      3403058      13561071    51158        3 radeon 0000:01:00.0
>        cached           25             0       96        4 nouveau 
> 0000:05:00.0
>
> to figure out if my thinking is really true. You should have a huge
> 'inuse' count and almost no 'available'.

My /sys/kernel/debug/dri directory has a 0 and a 64 entry, which appear to
always have the same contents.  Is that normal?

My /sys/kernel/debug/dri/0/ttm_dma_page_pool file doesn't exist bare
metal... only in Xen.  Is that normal?

         pool      refills   pages freed    inuse available     name
       cached        15190         59551     1205        4 radeon 0000:01:00.0

If I watch that file while creating xterms, moving them around, etc, I can
see the number available fluctuate between 3 and 6.  This is true, even on
my box w/ the newer R7 card in it, which hasn't gotten that GEM error
message (yet?).


>
> But that will get us just to confirm that yes - you have a big usage
> of memory and it is hitting the ceiling.
>
> Now to actually figure out which application is hanging on these - that
> I am not sure about. I think there is some drm info tool to investigate
> how many pages each application is using. You can leave it running and
> see which app is gulping up the memory. But I am not sure which
> tool that is (if there was one).
>
> Well, lets do one step at a time - see if my theory is correct first.



-- 
Michael D Labriola
21 Rip Van Winkle Cir
Warwick, RI 02886
401-316-9844 (cell)
401-848-8871 (work)
401-234-1306 (home)

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.