This kind of bug is in debian kernel seems most visible, but I was able
to reproduce it in all available kernels (SUSE 2.6.34 and rhel 2.6.18).
I found single solution to stop OOM killer coming for innocent processes
- disable memory overcommitment.
1) Set up swap as 50% of RAM or higher
2) set up vm.overcommit_memory = 2
In this condition only Debian Lenny kernel are still buggling (forget
and throw away), all other kernels works fine: they NEVER create an OOM
state (but, still can make MemoryError in case of 'no memory' state).
If you disable swap file all overcommited memory will be used from real
memory and cause MemoryError state before real memory running out.
В Птн, 12/11/2010 в 23:57 -0800, John Weekes пишет:
> On machines running many HVM (stubdom-based) domains, I often see errors
> like this:
>
> [77176.524094] qemu-dm invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
> [77176.524102] Pid: 7478, comm: qemu-dm Not tainted 2.6.32.25-g80f7e08 #2
> [77176.524109] Call Trace:
> [77176.524123] [<ffffffff810897fd>] ? T.413+0xcd/0x290
> [77176.524129] [<ffffffff81089ad3>] ? __out_of_memory+0x113/0x180
> [77176.524133] [<ffffffff81089b9e>] ? out_of_memory+0x5e/0xc0
> [77176.524140] [<ffffffff8108d1cb>] ? __alloc_pages_nodemask+0x69b/0x6b0
> [77176.524144] [<ffffffff8108d1f2>] ? __get_free_pages+0x12/0x60
> [77176.524152] [<ffffffff810c94e7>] ? __pollwait+0xb7/0x110
> [77176.524161] [<ffffffff81262b93>] ? n_tty_poll+0x183/0x1d0
> [77176.524165] [<ffffffff8125ea42>] ? tty_poll+0x92/0xa0
> [77176.524169] [<ffffffff810c8a92>] ? do_select+0x362/0x670
> [77176.524173] [<ffffffff810c9430>] ? __pollwait+0x0/0x110
> [77176.524178] [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524183] [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524188] [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524193] [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524197] [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524202] [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524207] [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524212] [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524217] [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524222] [<ffffffff810c8fb5>] ? core_sys_select+0x215/0x350
> [77176.524231] [<ffffffff810100af>] ? xen_restore_fl_direct_end+0x0/0x1
> [77176.524236] [<ffffffff8100c48d>] ? xen_mc_flush+0x8d/0x1b0
> [77176.524243] [<ffffffff81014ffb>] ? xen_hypervisor_callback+0x1b/0x20
> [77176.524251] [<ffffffff814b0f5a>] ? error_exit+0x2a/0x60
> [77176.524255] [<ffffffff8101485d>] ? retint_restore_args+0x5/0x6
> [77176.524263] [<ffffffff8102fd3d>] ? pvclock_clocksource_read+0x4d/0xb0
> [77176.524268] [<ffffffff8102fd3d>] ? pvclock_clocksource_read+0x4d/0xb0
> [77176.524276] [<ffffffff810663d1>] ? ktime_get_ts+0x61/0xd0
> [77176.524281] [<ffffffff810c9354>] ? sys_select+0x44/0x120
> [77176.524286] [<ffffffff81013f02>] ? system_call_fastpath+0x16/0x1b
> [77176.524290] Mem-Info:
> [77176.524293] DMA per-cpu:
> [77176.524296] CPU 0: hi: 0, btch: 1 usd: 0
> [77176.524300] CPU 1: hi: 0, btch: 1 usd: 0
> [77176.524303] CPU 2: hi: 0, btch: 1 usd: 0
> [77176.524306] CPU 3: hi: 0, btch: 1 usd: 0
> [77176.524310] CPU 4: hi: 0, btch: 1 usd: 0
> [77176.524313] CPU 5: hi: 0, btch: 1 usd: 0
> [77176.524316] CPU 6: hi: 0, btch: 1 usd: 0
> [77176.524318] CPU 7: hi: 0, btch: 1 usd: 0
> [77176.524322] CPU 8: hi: 0, btch: 1 usd: 0
> [77176.524324] CPU 9: hi: 0, btch: 1 usd: 0
> [77176.524327] CPU 10: hi: 0, btch: 1 usd: 0
> [77176.524330] CPU 11: hi: 0, btch: 1 usd: 0
> [77176.524333] CPU 12: hi: 0, btch: 1 usd: 0
> [77176.524336] CPU 13: hi: 0, btch: 1 usd: 0
> [77176.524339] CPU 14: hi: 0, btch: 1 usd: 0
> [77176.524342] CPU 15: hi: 0, btch: 1 usd: 0
> [77176.524345] CPU 16: hi: 0, btch: 1 usd: 0
> [77176.524348] CPU 17: hi: 0, btch: 1 usd: 0
> [77176.524351] CPU 18: hi: 0, btch: 1 usd: 0
> [77176.524354] CPU 19: hi: 0, btch: 1 usd: 0
> [77176.524358] CPU 20: hi: 0, btch: 1 usd: 0
> [77176.524364] CPU 21: hi: 0, btch: 1 usd: 0
> [77176.524367] CPU 22: hi: 0, btch: 1 usd: 0
> [77176.524370] CPU 23: hi: 0, btch: 1 usd: 0
> [77176.524372] DMA32 per-cpu:
> [77176.524374] CPU 0: hi: 186, btch: 31 usd: 81
> [77176.524377] CPU 1: hi: 186, btch: 31 usd: 66
> [77176.524380] CPU 2: hi: 186, btch: 31 usd: 49
> [77176.524385] CPU 3: hi: 186, btch: 31 usd: 67
> [77176.524387] CPU 4: hi: 186, btch: 31 usd: 93
> [77176.524390] CPU 5: hi: 186, btch: 31 usd: 73
> [77176.524393] CPU 6: hi: 186, btch: 31 usd: 50
> [77176.524396] CPU 7: hi: 186, btch: 31 usd: 79
> [77176.524399] CPU 8: hi: 186, btch: 31 usd: 21
> [77176.524402] CPU 9: hi: 186, btch: 31 usd: 38
> [77176.524406] CPU 10: hi: 186, btch: 31 usd: 0
> [77176.524409] CPU 11: hi: 186, btch: 31 usd: 75
> [77176.524412] CPU 12: hi: 186, btch: 31 usd: 1
> [77176.524414] CPU 13: hi: 186, btch: 31 usd: 4
> [77176.524417] CPU 14: hi: 186, btch: 31 usd: 9
> [77176.524420] CPU 15: hi: 186, btch: 31 usd: 0
> [77176.524423] CPU 16: hi: 186, btch: 31 usd: 56
> [77176.524426] CPU 17: hi: 186, btch: 31 usd: 35
> [77176.524429] CPU 18: hi: 186, btch: 31 usd: 32
> [77176.524432] CPU 19: hi: 186, btch: 31 usd: 39
> [77176.524435] CPU 20: hi: 186, btch: 31 usd: 24
> [77176.524438] CPU 21: hi: 186, btch: 31 usd: 0
> [77176.524441] CPU 22: hi: 186, btch: 31 usd: 35
> [77176.524444] CPU 23: hi: 186, btch: 31 usd: 51
> [77176.524447] Normal per-cpu:
> [77176.524449] CPU 0: hi: 186, btch: 31 usd: 29
> [77176.524453] CPU 1: hi: 186, btch: 31 usd: 1
> [77176.524456] CPU 2: hi: 186, btch: 31 usd: 30
> [77176.524459] CPU 3: hi: 186, btch: 31 usd: 30
> [77176.524463] CPU 4: hi: 186, btch: 31 usd: 30
> [77176.524466] CPU 5: hi: 186, btch: 31 usd: 31
> [77176.524469] CPU 6: hi: 186, btch: 31 usd: 0
> [77176.524471] CPU 7: hi: 186, btch: 31 usd: 0
> [77176.524474] CPU 8: hi: 186, btch: 31 usd: 30
> [77176.524477] CPU 9: hi: 186, btch: 31 usd: 28
> [77176.524480] CPU 10: hi: 186, btch: 31 usd: 0
> [77176.524483] CPU 11: hi: 186, btch: 31 usd: 30
> [77176.524486] CPU 12: hi: 186, btch: 31 usd: 0
> [77176.524489] CPU 13: hi: 186, btch: 31 usd: 0
> [77176.524492] CPU 14: hi: 186, btch: 31 usd: 0
> [77176.524495] CPU 15: hi: 186, btch: 31 usd: 0
> [77176.524498] CPU 16: hi: 186, btch: 31 usd: 0
> [77176.524501] CPU 17: hi: 186, btch: 31 usd: 0
> [77176.524504] CPU 18: hi: 186, btch: 31 usd: 0
> [77176.524507] CPU 19: hi: 186, btch: 31 usd: 0
> [77176.524510] CPU 20: hi: 186, btch: 31 usd: 0
> [77176.524513] CPU 21: hi: 186, btch: 31 usd: 0
> [77176.524516] CPU 22: hi: 186, btch: 31 usd: 0
> [77176.524518] CPU 23: hi: 186, btch: 31 usd: 0
> [77176.524524] active_anon:5675 inactive_anon:4676 isolated_anon:0
> [77176.524526] active_file:146373 inactive_file:153543 isolated_file:480
> [77176.524527] unevictable:0 dirty:167539 writeback:322 unstable:0
> [77176.524528] free:5017 slab_reclaimable:15640 slab_unreclaimable:8972
> [77176.524529] mapped:1114 shmem:7 pagetables:1908 bounce:0
> [77176.524536] DMA free:9820kB min:32kB low:40kB high:48kB
> active_anon:4kB inactive_anon:0kB active_file:616kB inactive_file:2212kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB present:12740kB
> mlocked:0kB dirty:2292kB writeback:0kB mapped:0kB shmem:0kB
> slab_reclaimable:72kB slab_unreclaimable:108kB kernel_stack:0kB
> pagetables:12kB unstable:0kB bounce:0kB writeback_tmp:0kB
> pages_scanned:3040 all_unreclaimable? no
> [77176.524541] lowmem_reserve[]: 0 1428 2452 2452
> [77176.524551] DMA32 free:7768kB min:3680kB low:4600kB high:5520kB
> active_anon:22696kB inactive_anon:18704kB active_file:584580kB
> inactive_file:608508kB unevictable:0kB isolated(anon):0kB
> isolated(file):1920kB present:1462496kB mlocked:0kB dirty:664128kB
> writeback:1276kB mapped:4456kB shmem:28kB slab_reclaimable:62076kB
> slab_unreclaimable:32292kB kernel_stack:5120kB pagetables:7620kB
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1971808
> all_unreclaimable? yes
> [77176.524556] lowmem_reserve[]: 0 0 1024 1024
> [77176.524564] Normal free:2480kB min:2636kB low:3292kB high:3952kB
> active_anon:0kB inactive_anon:0kB active_file:296kB inactive_file:3452kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1048700kB
> mlocked:0kB dirty:3736kB writeback:12kB mapped:0kB shmem:0kB
> slab_reclaimable:412kB slab_unreclaimable:3488kB kernel_stack:80kB
> pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB
> pages_scanned:8192 all_unreclaimable? yes
> [77176.524569] lowmem_reserve[]: 0 0 0 0
> [77176.524574] DMA: 4*4kB 25*8kB 11*16kB 7*32kB 8*64kB 8*128kB 8*256kB
> 3*512kB 0*1024kB 0*2048kB 1*4096kB = 9832kB
> [77176.524587] DMA32: 742*4kB 118*8kB 3*16kB 3*32kB 2*64kB 0*128kB
> 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 7768kB
> [77176.524600] Normal: 1*4kB 1*8kB 2*16kB 13*32kB 14*64kB 2*128kB
> 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1612kB
> [77176.524613] 302308 total pagecache pages
> [77176.524615] 1619 pages in swap cache
> [77176.524617] Swap cache stats: add 40686, delete 39067, find 24687/26036
> [77176.524619] Free swap = 10141956kB
> [77176.524621] Total swap = 10239992kB
> [77176.577607] 793456 pages RAM
> [77176.577611] 436254 pages reserved
> [77176.577613] 308627 pages shared
> [77176.577615] 49249 pages non-shared
> [77176.577620] Out of memory: kill process 5755 (python2.6) score 110492
> or a child
> [77176.577623] Killed process 5757 (python2.6)
>
> Depending on what gets nuked by the OOM-killer, I am frequently left
> with an unusable system that needs to be rebooted.
>
> The machine always has plenty of memory available (1.5 GB devoted to
> dom0, of which >1 GB is always just in "cached" state). For instance,
> right now, on this same machine:
>
> # free
> total used free shared buffers cached
> Mem: 1536512 1493112 43400 0 10284 1144904
> -/+ buffers/cache: 337924 1198588
> Swap: 10239992 74444 10165548
>
> I have seen this OOM problem on a wide range of Xen versions, stretching
> as far back as I can remember, including the most recent 4.1-unstable
> and 2.6.32 pvops kernel (from yesterday, tested in the hope that they
> would fix this). I haven't found a way to reliably reproduce it yet,
> but I suspect that the problem relates to reasonably heavy disk or
> network activity -- during this last one, I see that a domain was
> briefly doing ~200 Mbps of downloads.
>
> Anyone have any ideas on what this could be? Is RAM getting
> spontaneously filled because a buffer somewhere grows too quickly, or
> something like that? What can I try here?
>
> -John
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|