[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Xen-users] kernel 3.9.2 - xen 4.2.2/4.3rc1 => BUG unable to handle kernel paging request netif_poll+0x49c/0xe8



On 07/04/2013 05:01 PM, Wei Liu wrote:
>
>> I am running into this issue as well with the openSUSE 12.3
>> distribution. This is with their 3.7.10-1.16-xen kernel and Xen version
>> 4.2.1_12-1.12.10. On the net I see some discussion of people hitting
>> this issue but not that much.  E.g., one of the symptoms is that a guest
>> crashes when running zypper install or zypper update when the Internet
>> connection is fast enough.
>>
> Do you have references to other reports?
I will gather them and post them later.

>
>> OpenSUSE 3.4.X kernels are running ok as guest on top of the openSUSE
>> 12.3 Xen distribution, but apparently since 3.7.10 and higher there is
>> this issue.
>>
>> I spent already quite some time in getting grip on the issue. I added a
>> bug to bugzilla.novell.com but no response. See
>> https://bugzilla.novell.com/show_bug.cgi?id=826374 for details.
>> Apparently for hitting this bug (i.e. make it all the way to the crash),
>> it is required to use some hardware which performs not too slow. With
>> this I mean it is easy to find hardware which is unable to reproduce the
>                                                     able?
>> issue.
>>
> I'm not quite sure about what you mean. Do you mean this bug can only
> be triggered when your receive path has real hardware NIC invloved?
>
> And reading your test case below it doesn't seem so. Dom0 to DomU
> transmission crashes the guest per your example.
Yes, a physical network card is not required. If you do send data to the
guest over a physical Ethernet card, it is required to operate it in 1
GbE mode. With a 100 MbE link I am unable to crash the guest.

If you do use vif interfaces only, the data rate will be high enough to
crash it.

However, I have also openSUSE 12.3 Xen configurations running which do
not have this issue. It is my feeling that smaller systems (in the sense
of less CPU cores and/or less memory bandwidth) do not reveal the issue.

>
>> In one of my recent experiments I changed the SLAB allocater to SLUB
>> which provides more detailed kernel logging. Here is the log output
>> after the first detected issue regarding xennet:
>>
> But the log below is not about SLUB. I cannot understand why SLAB v.s
> SLUB makes a difference.
I switched to SLUB from SLAB for its debugging functionality. The
openSUSE stock kernel used SLAB.



>
>> Too many frags
>> 2013-07-03T23:51:27.092147+02:00 domUA kernel: [  108.094615] netfront:
>> Too many frags
>> 2013-07-03T23:51:27.492112+02:00 domUA kernel: [  108.494255] netfront:
>> Too many frags
>> 2013-07-03T23:51:27.520194+02:00 domUA kernel: [  108.522445]
> "Too many frags" means your frontend is generating malformed packets.
> This is not normal. And apparently you didn't use the latest kernel in
> tree because the log message should be "Too many slots" in the latest
> OpenSuSE kernel.
Yes, I have seen that, but I used the latest openSUSE kernel which
belongs to openSUSE 12.3.


>> network_alloc_rx_buffers+0x76/0x5f0 [xennet]
>> 2013-07-03T23:51:27.679476+02:00 domUA kernel: [  108.671781]  
>> netif_poll+0xcf4/0xf30 [xennet]
>> 2013-07-03T23:51:27.679478+02:00 domUA kernel: [  108.671783]  
>> net_rx_action+0xf0/0x2e0
>>
> Seems like there's memory corruption in guest RX path.
As Jan already mentioned, it could be related to the kernel panics I 
obtain, however it may be a different issue as well.

>>
>>
>> I am happy to assist in more kernel probing. It is even possible for me
>> to setup access for someone to this machine.
>>
> Excellent. Last time Jan suspected that we potentially overrun the frag
> list of a skb (which would corrupt memory) but it has not been verified.
>
> I also skimmed your bug report on novell bugzilla which did suggest
> memory corruption.
>
> I wrote a patch to crash the kernel immediately when looping over the
> frag list, probably we could start from there? (You might need to adjust
> context, but it is only a one-liner which should be easy).
>
>
> Wei.
>
> ======
> diff --git a/drivers/xen/netfront/netfront.c b/drivers/xen/netfront/netfront.c
> index 6e5d233..9583011 100644
> --- a/drivers/xen/netfront/netfront.c
> +++ b/drivers/xen/netfront/netfront.c
> @@ -1306,6 +1306,7 @@ static RING_IDX xennet_fill_frags(struct netfront_info 
> *np,
>         struct sk_buff *nskb;
>
>         while ((nskb = __skb_dequeue(list))) {
> +               BUG_ON(nr_frags >= MAX_SKB_FRAGS);
>                 struct netif_rx_response *rx =
>                         RING_GET_RESPONSE(&np->rx, ++cons);
>

Integrated the patch. I obtained a crash dump and the log in it did not
show this BUG_ON. Here is the relevant section from the log

var/lib/xen/dump/domUA # crash /root/vmlinux-p1
2013-0705-1347.43-domUA.1.core

[    7.670132] Adding 4192252k swap on /dev/xvda1.  Priority:-1
extents:1 across:4192252k SS
[   10.204340] NET: Registered protocol family 17
[  481.534979] netfront: Too many frags
[  487.543946] netfront: Too many frags
[  491.049458] netfront: Too many frags
[  491.491153] ------------[ cut here ]------------
[  491.491628] kernel BUG at drivers/xen/netfront/netfront.c:1295!
[  491.492056] invalid opcode: 0000 [#1] SMP
[  491.492056] Modules linked in: af_packet autofs4 xennet xenblk cdrom
[  491.492056] CPU 0
[  491.492056] Pid: 1471, comm: sshd Not tainted 3.7.10-1.16-dbg-p1-xen #8 
[  491.492056] RIP: e030:[<ffffffffa0023aef>]  [<ffffffffa0023aef>]
netif_poll+0xe4f/0xf90 [xennet]
[  491.492056] RSP: e02b:ffff8801f5803c60  EFLAGS: 00010202
[  491.492056] RAX: ffff8801f5803da0 RBX: ffff8801f1a082c0 RCX:
0000000180200010
[  491.492056] RDX: ffff8801f5803da0 RSI: ffff8801fe83ec80 RDI:
ffff8801f03b2900
[  491.492056] RBP: ffff8801f5803e20 R08: 0000000000000001 R09:
0000000000000000
[  491.492056] R10: 0000000000000000 R11: 0000000000000000 R12:
ffff8801f03b3400
[  491.492056] R13: 0000000000000011 R14: 000000000004327e R15:
ffff8801f06009c0
[  491.492056] FS:  00007fc519f3d7c0(0000) GS:ffff8801f5800000(0000)
knlGS:0000000000000000
[  491.492056] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[  491.492056] CR2: 00007fc51410c400 CR3: 00000001f1430000 CR4:
0000000000002660
[  491.492056] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  491.492056] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[  491.492056] Process sshd (pid: 1471, threadinfo ffff8801f1264000,
task ffff8801f137bf00)
[  491.492056] Stack:
[  491.492056]  ffff8801f5803d60 ffffffff8008503e ffff8801f0600a40
ffff8801f0600000
[  491.492056]  0004328000000040 0000001200000000 ffff8801f5810570
ffff8801f0600a78
[  491.492056]  0000000000000000 ffff8801f0601fb0 0004326e00000012
ffff8801f5803d00
[  491.492056] Call Trace:
[  491.492056]  [<ffffffff8041ee35>] net_rx_action+0xd5/0x250
[  491.492056]  [<ffffffff800376d8>] __do_softirq+0xe8/0x230
[  491.492056]  [<ffffffff8051151c>] call_softirq+0x1c/0x30
[  491.492056]  [<ffffffff80008a75>] do_softirq+0x75/0xd0
[  491.492056]  [<ffffffff800379f5>] irq_exit+0xb5/0xc0
[  491.492056]  [<ffffffff8036c225>] evtchn_do_upcall+0x295/0x2d0
[  491.492056]  [<ffffffff8051114e>] do_hypervisor_callback+0x1e/0x30
[  491.492056]  [<00007fc519f97700>] 0x7fc519f976ff
[  491.492056] Code: ff 0f 1f 00 e8 a3 c1 40 e0 85 c0 90 75 69 44 89 ea
4c 89 f6 4c 89 ff e8 f0 cb ff ff c7 85 80 fe ff ff ea ff ff ff e9 7c f4
ff ff <0f> 0b ba 12 00 00 00 48 01 d0 48 39 c1 0f 82 bd fc ff ff e9 e9
[  491.492056] RIP  [<ffffffffa0023aef>] netif_poll+0xe4f/0xf90 [xennet]
[  491.492056]  RSP <ffff8801f5803c60>
[  491.511975] ---[ end trace c9e37475f12e1aaf ]---
[  491.512877] Kernel panic - not syncing: Fatal exception in interrupt

In the mean time Jan took the bug in bugzilla
(https://bugzilla.novell.com/show_bug.cgi?id=826374) and created a first
patch. I propose we continue the discussion there and post the
conclusion in this list to finish this thread here as well.


Dion

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.