[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles



Thursday, February 27, 2014, 4:15:39 PM, you wrote:

> On Thu, Feb 27, 2014 at 03:43:51PM +0100, Sander Eikelenboom wrote:
> [...]
>> 
>> > As far as I can tell netfront has a pool of grant references and it
>> > will BUG_ON() if there's no grefs in the pool when you request one.
>> > Since your DomU didn't crash so I suspect the book-keeping is still
>> > intact.
>> 
>> >> > Domain 1 seems to have increased it's nr_grant_entries from 2048 to 
>> >> > 3072 somewhere this night.
>> >> > Domain 7 is the domain that happens to give the netfront messages.
>> >> 
>> >> > I also don't get why it is reporting the "Bad grant reference" for 
>> >> > domain 0, which seems to have 0 active entries ..
>> >> > Also is this amount of grant entries "normal" ? or could it be a leak 
>> >> > somewhere ?
>> >> 
>> 
>> > I suppose Dom0 expanding its maptrack is normal. I see as well when I
>> > increase the number of domains. But if it keeps increasing while the
>> > number of DomUs stay the same then it is not normal.
>> 
>> It keeps increasing (without (re)starting domains) although eventually it 
>> looks like it is settling at a round a maptrack size of 31/256 frames.
>> 

> Then I guess that's reasonable. You have 15 DomUs after all...

>> 
>> > Presumably you only have netfront and blkfront to use grant table and
>> > your workload as described below invovled both so it would be hard to
>> > tell which one is faulty.
>> 
>> > There's no immediate functional changes regarding slot counting in this
>> > dev cycle for network driver. But there's some changes to blkfront/back
>> > which seem interesting (memory related).
>> 
>> Hmm all the times i get a "Bad grant reference" are related to that one 
>> specific guest.
>> And it's not doing much blkback/front I/O (it's providing webdav and rsync 
>> to network based storage (glusterfs))
>> 

> OK. I misunderstood that you were rsync'ing from / to your VM disk.

> What does webdav do anyway? Does it have a specific traffic pattern?

The VM is a webdav store .. and the storage for that is network based (at the 
moment glusterfs in dom0).
Remote backup solutions use this to store backups with duplicity.

Besides that in the guest runs a rsync script that syncs the storage with a 
remote location.

So the webdav network traffic from outside to the vm causes about the same 
amount of traffic from the vm to dom0,
so yes .. that gets stretched and tested ;-)

>> Added some more printk's:
>> 
>> @@ -2072,7 +2076,11 @@ __gnttab_copy(
>>                                        &s_frame, &s_pg,
>>                                        &source_off, &source_len, 1);
>>          if ( rc != GNTST_okay )
>> -            goto error_out;
>> +            PIN_FAIL(error_out, GNTST_general_error,
>> +                     "?!?!? src_is_gref: aquire grant for copy failed 
>> current_dom_id:%d src_dom_id:%d dest_dom_id:%d\n",
>> +                     current->domain->domain_id, op->source.domid, 
>> op->dest.domid);
>> +
>> +
>>          have_s_grant = 1;
>>          if ( op->source.offset < source_off ||
>>               op->len > source_len )
>> @@ -2096,7 +2104,11 @@ __gnttab_copy(
>>                                        current->domain->domain_id, 0,
>>                                        &d_frame, &d_pg, &dest_off, 
>> &dest_len, 1);
>>          if ( rc != GNTST_okay )
>> -            goto error_out;
>> +            PIN_FAIL(error_out, GNTST_general_error,
>> +                     "?!?!? dest_is_gref: aquire grant for copy failed 
>> current_dom_id:%d src_dom_id:%d dest_dom_id:%d\n",
>> +                     current->domain->domain_id, op->source.domid, 
>> op->dest.domid);
>> +
>> +
>>          have_d_grant = 1;
>> 
>> 
>> this comes out:
>> 
>> (XEN) [2014-02-27 02:34:37] grant_table.c:2109:d0 ?!?!? dest_is_gref: aquire 
>> grant for copy failed current_dom_id:0 src_dom_id:32752 dest_dom_id:7
>> 

> If it fails in gnttab_copy then I very much suspects this is a network
> driver problem as persistent grant in blk driver doesn't use grant
> copy.

Does the dest_gref or src_is_gref by any chance give some sort of direction ?

>> 
>> > My suggestion is, if you have a working base line, you can try to setup
>> > different frontend / backend combination to help narrow down the
>> > problem.
>> 
>> Will see what i can do after the weekend
>> 

> Thanks

>> > Wei.
>> 
>> <snip>
>> 



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.