Re: [Xen-devel] OOM problems

This kind of bug is in debian kernel seems most visible, but I was able
to reproduce it in all available kernels (SUSE 2.6.34 and rhel 2.6.18).

I found single solution to stop OOM killer coming for innocent processes
- disable memory overcommitment.

1) Set up swap as 50% of RAM or higher
2) set up vm.overcommit_memory = 2

In this condition only Debian Lenny kernel are still buggling (forget
and throw away), all other kernels works fine: they NEVER create an OOM
state (but, still can make MemoryError in case of 'no memory' state).

If you disable swap file all overcommited memory will be used from real
memory and cause MemoryError state before real memory running out.


В Птн, 12/11/2010 в 23:57 -0800, John Weekes пишет:
> On machines running many HVM (stubdom-based) domains, I often see errors 
> like this:
> 
> [77176.524094] qemu-dm invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
> [77176.524102] Pid: 7478, comm: qemu-dm Not tainted 2.6.32.25-g80f7e08 #2
> [77176.524109] Call Trace:
> [77176.524123]  [<ffffffff810897fd>] ? T.413+0xcd/0x290
> [77176.524129]  [<ffffffff81089ad3>] ? __out_of_memory+0x113/0x180
> [77176.524133]  [<ffffffff81089b9e>] ? out_of_memory+0x5e/0xc0
> [77176.524140]  [<ffffffff8108d1cb>] ? __alloc_pages_nodemask+0x69b/0x6b0
> [77176.524144]  [<ffffffff8108d1f2>] ? __get_free_pages+0x12/0x60
> [77176.524152]  [<ffffffff810c94e7>] ? __pollwait+0xb7/0x110
> [77176.524161]  [<ffffffff81262b93>] ? n_tty_poll+0x183/0x1d0
> [77176.524165]  [<ffffffff8125ea42>] ? tty_poll+0x92/0xa0
> [77176.524169]  [<ffffffff810c8a92>] ? do_select+0x362/0x670
> [77176.524173]  [<ffffffff810c9430>] ? __pollwait+0x0/0x110
> [77176.524178]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524183]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524188]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524193]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524197]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524202]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524207]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524212]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524217]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524222]  [<ffffffff810c8fb5>] ? core_sys_select+0x215/0x350
> [77176.524231]  [<ffffffff810100af>] ? xen_restore_fl_direct_end+0x0/0x1
> [77176.524236]  [<ffffffff8100c48d>] ? xen_mc_flush+0x8d/0x1b0
> [77176.524243]  [<ffffffff81014ffb>] ? xen_hypervisor_callback+0x1b/0x20
> [77176.524251]  [<ffffffff814b0f5a>] ? error_exit+0x2a/0x60
> [77176.524255]  [<ffffffff8101485d>] ? retint_restore_args+0x5/0x6
> [77176.524263]  [<ffffffff8102fd3d>] ? pvclock_clocksource_read+0x4d/0xb0
> [77176.524268]  [<ffffffff8102fd3d>] ? pvclock_clocksource_read+0x4d/0xb0
> [77176.524276]  [<ffffffff810663d1>] ? ktime_get_ts+0x61/0xd0
> [77176.524281]  [<ffffffff810c9354>] ? sys_select+0x44/0x120
> [77176.524286]  [<ffffffff81013f02>] ? system_call_fastpath+0x16/0x1b
> [77176.524290] Mem-Info:
> [77176.524293] DMA per-cpu:
> [77176.524296] CPU    0: hi:    0, btch:   1 usd:   0
> [77176.524300] CPU    1: hi:    0, btch:   1 usd:   0
> [77176.524303] CPU    2: hi:    0, btch:   1 usd:   0
> [77176.524306] CPU    3: hi:    0, btch:   1 usd:   0
> [77176.524310] CPU    4: hi:    0, btch:   1 usd:   0
> [77176.524313] CPU    5: hi:    0, btch:   1 usd:   0
> [77176.524316] CPU    6: hi:    0, btch:   1 usd:   0
> [77176.524318] CPU    7: hi:    0, btch:   1 usd:   0
> [77176.524322] CPU    8: hi:    0, btch:   1 usd:   0
> [77176.524324] CPU    9: hi:    0, btch:   1 usd:   0
> [77176.524327] CPU   10: hi:    0, btch:   1 usd:   0
> [77176.524330] CPU   11: hi:    0, btch:   1 usd:   0
> [77176.524333] CPU   12: hi:    0, btch:   1 usd:   0
> [77176.524336] CPU   13: hi:    0, btch:   1 usd:   0
> [77176.524339] CPU   14: hi:    0, btch:   1 usd:   0
> [77176.524342] CPU   15: hi:    0, btch:   1 usd:   0
> [77176.524345] CPU   16: hi:    0, btch:   1 usd:   0
> [77176.524348] CPU   17: hi:    0, btch:   1 usd:   0
> [77176.524351] CPU   18: hi:    0, btch:   1 usd:   0
> [77176.524354] CPU   19: hi:    0, btch:   1 usd:   0
> [77176.524358] CPU   20: hi:    0, btch:   1 usd:   0
> [77176.524364] CPU   21: hi:    0, btch:   1 usd:   0
> [77176.524367] CPU   22: hi:    0, btch:   1 usd:   0
> [77176.524370] CPU   23: hi:    0, btch:   1 usd:   0
> [77176.524372] DMA32 per-cpu:
> [77176.524374] CPU    0: hi:  186, btch:  31 usd:  81
> [77176.524377] CPU    1: hi:  186, btch:  31 usd:  66
> [77176.524380] CPU    2: hi:  186, btch:  31 usd:  49
> [77176.524385] CPU    3: hi:  186, btch:  31 usd:  67
> [77176.524387] CPU    4: hi:  186, btch:  31 usd:  93
> [77176.524390] CPU    5: hi:  186, btch:  31 usd:  73
> [77176.524393] CPU    6: hi:  186, btch:  31 usd:  50
> [77176.524396] CPU    7: hi:  186, btch:  31 usd:  79
> [77176.524399] CPU    8: hi:  186, btch:  31 usd:  21
> [77176.524402] CPU    9: hi:  186, btch:  31 usd:  38
> [77176.524406] CPU   10: hi:  186, btch:  31 usd:   0
> [77176.524409] CPU   11: hi:  186, btch:  31 usd:  75
> [77176.524412] CPU   12: hi:  186, btch:  31 usd:   1
> [77176.524414] CPU   13: hi:  186, btch:  31 usd:   4
> [77176.524417] CPU   14: hi:  186, btch:  31 usd:   9
> [77176.524420] CPU   15: hi:  186, btch:  31 usd:   0
> [77176.524423] CPU   16: hi:  186, btch:  31 usd:  56
> [77176.524426] CPU   17: hi:  186, btch:  31 usd:  35
> [77176.524429] CPU   18: hi:  186, btch:  31 usd:  32
> [77176.524432] CPU   19: hi:  186, btch:  31 usd:  39
> [77176.524435] CPU   20: hi:  186, btch:  31 usd:  24
> [77176.524438] CPU   21: hi:  186, btch:  31 usd:   0
> [77176.524441] CPU   22: hi:  186, btch:  31 usd:  35
> [77176.524444] CPU   23: hi:  186, btch:  31 usd:  51
> [77176.524447] Normal per-cpu:
> [77176.524449] CPU    0: hi:  186, btch:  31 usd:  29
> [77176.524453] CPU    1: hi:  186, btch:  31 usd:   1
> [77176.524456] CPU    2: hi:  186, btch:  31 usd:  30
> [77176.524459] CPU    3: hi:  186, btch:  31 usd:  30
> [77176.524463] CPU    4: hi:  186, btch:  31 usd:  30
> [77176.524466] CPU    5: hi:  186, btch:  31 usd:  31
> [77176.524469] CPU    6: hi:  186, btch:  31 usd:   0
> [77176.524471] CPU    7: hi:  186, btch:  31 usd:   0
> [77176.524474] CPU    8: hi:  186, btch:  31 usd:  30
> [77176.524477] CPU    9: hi:  186, btch:  31 usd:  28
> [77176.524480] CPU   10: hi:  186, btch:  31 usd:   0
> [77176.524483] CPU   11: hi:  186, btch:  31 usd:  30
> [77176.524486] CPU   12: hi:  186, btch:  31 usd:   0
> [77176.524489] CPU   13: hi:  186, btch:  31 usd:   0
> [77176.524492] CPU   14: hi:  186, btch:  31 usd:   0
> [77176.524495] CPU   15: hi:  186, btch:  31 usd:   0
> [77176.524498] CPU   16: hi:  186, btch:  31 usd:   0
> [77176.524501] CPU   17: hi:  186, btch:  31 usd:   0
> [77176.524504] CPU   18: hi:  186, btch:  31 usd:   0
> [77176.524507] CPU   19: hi:  186, btch:  31 usd:   0
> [77176.524510] CPU   20: hi:  186, btch:  31 usd:   0
> [77176.524513] CPU   21: hi:  186, btch:  31 usd:   0
> [77176.524516] CPU   22: hi:  186, btch:  31 usd:   0
> [77176.524518] CPU   23: hi:  186, btch:  31 usd:   0
> [77176.524524] active_anon:5675 inactive_anon:4676 isolated_anon:0
> [77176.524526]  active_file:146373 inactive_file:153543 isolated_file:480
> [77176.524527]  unevictable:0 dirty:167539 writeback:322 unstable:0
> [77176.524528]  free:5017 slab_reclaimable:15640 slab_unreclaimable:8972
> [77176.524529]  mapped:1114 shmem:7 pagetables:1908 bounce:0
> [77176.524536] DMA free:9820kB min:32kB low:40kB high:48kB 
> active_anon:4kB inactive_anon:0kB active_file:616kB inactive_file:2212kB 
> unevictable:0kB isolated(anon):0kB isolated(file):0kB present:12740kB 
> mlocked:0kB dirty:2292kB writeback:0kB mapped:0kB shmem:0kB 
> slab_reclaimable:72kB slab_unreclaimable:108kB kernel_stack:0kB 
> pagetables:12kB unstable:0kB bounce:0kB writeback_tmp:0kB 
> pages_scanned:3040 all_unreclaimable? no
> [77176.524541] lowmem_reserve[]: 0 1428 2452 2452
> [77176.524551] DMA32 free:7768kB min:3680kB low:4600kB high:5520kB 
> active_anon:22696kB inactive_anon:18704kB active_file:584580kB 
> inactive_file:608508kB unevictable:0kB isolated(anon):0kB 
> isolated(file):1920kB present:1462496kB mlocked:0kB dirty:664128kB 
> writeback:1276kB mapped:4456kB shmem:28kB slab_reclaimable:62076kB 
> slab_unreclaimable:32292kB kernel_stack:5120kB pagetables:7620kB 
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1971808 
> all_unreclaimable? yes
> [77176.524556] lowmem_reserve[]: 0 0 1024 1024
> [77176.524564] Normal free:2480kB min:2636kB low:3292kB high:3952kB 
> active_anon:0kB inactive_anon:0kB active_file:296kB inactive_file:3452kB 
> unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1048700kB 
> mlocked:0kB dirty:3736kB writeback:12kB mapped:0kB shmem:0kB 
> slab_reclaimable:412kB slab_unreclaimable:3488kB kernel_stack:80kB 
> pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB 
> pages_scanned:8192 all_unreclaimable? yes
> [77176.524569] lowmem_reserve[]: 0 0 0 0
> [77176.524574] DMA: 4*4kB 25*8kB 11*16kB 7*32kB 8*64kB 8*128kB 8*256kB 
> 3*512kB 0*1024kB 0*2048kB 1*4096kB = 9832kB
> [77176.524587] DMA32: 742*4kB 118*8kB 3*16kB 3*32kB 2*64kB 0*128kB 
> 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 7768kB
> [77176.524600] Normal: 1*4kB 1*8kB 2*16kB 13*32kB 14*64kB 2*128kB 
> 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1612kB
> [77176.524613] 302308 total pagecache pages
> [77176.524615] 1619 pages in swap cache
> [77176.524617] Swap cache stats: add 40686, delete 39067, find 24687/26036
> [77176.524619] Free swap  = 10141956kB
> [77176.524621] Total swap = 10239992kB
> [77176.577607] 793456 pages RAM
> [77176.577611] 436254 pages reserved
> [77176.577613] 308627 pages shared
> [77176.577615] 49249 pages non-shared
> [77176.577620] Out of memory: kill process 5755 (python2.6) score 110492 
> or a child
> [77176.577623] Killed process 5757 (python2.6)
> 
> Depending on what gets nuked by the OOM-killer, I am frequently left 
> with an unusable system that needs to be rebooted.
> 
> The machine always has plenty of memory available (1.5 GB devoted to 
> dom0, of which >1 GB is always just in "cached" state). For instance, 
> right now, on this same machine:
> 
> # free
>               total       used       free     shared    buffers     cached
> Mem:       1536512    1493112      43400          0      10284    1144904
> -/+ buffers/cache:     337924    1198588
> Swap:     10239992      74444   10165548
> 
> I have seen this OOM problem on a wide range of Xen versions, stretching 
> as far back as I can remember, including the most recent 4.1-unstable 
> and 2.6.32 pvops kernel (from yesterday, tested in the hope that they 
> would fix this).  I haven't found a way to reliably reproduce it yet, 
> but I suspect that the problem relates to reasonably heavy disk or 
> network activity -- during this last one, I see that a domain was 
> briefly doing ~200 Mbps of downloads.
> 
> Anyone have any ideas on what this could be? Is RAM getting 
> spontaneously filled because a buffer somewhere grows too quickly, or 
> something like that? What can I try here?
> 
> -John
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] OOM problems