[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [xen-unstable test] 164996: regressions - FAIL


  • To: Stefano Stabellini <sstabellini@xxxxxxxxxx>, Ian Jackson <iwj@xxxxxxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Wed, 22 Sep 2021 09:34:46 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=g+DxErIK80ExGUZHlpS0lXGQGa4hZ6FYuJR0+GOnBaA=; b=PUJjpCM7DBkji0EDnhTvdwaXcNqSuNbXLoGExdSof6DAQssIGr/dJlnneP3K4swRQnAT87Iferzh4+s2EOjKSYh+zNuFAQ8WYuBwGduQ+TSwr59z1vSjFZB3lNmC3WiO0VmVfOU5JD3u7oVbflu0PGUpJGxItg7MTETR63Og1psxXc3rFkg6olJYRBJDlkAccqqf8Vey4WMb/F4jjybEcvC8Oewc1LaywuGKDvCqaOnp1aIgiBf2b4F0ltAGxDelgv4VOAwf6YXOnc6rv2FmsT/rEo1xhUjQPbgwsnW/xVSkzHVjWr28CWiJjp1zpSy8plJSPvDrUtJiSLuGaMSIIQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=LuePZIo2UQDxDvCpD9HML9nCC71wh6PVicHAvNPM1oAR5prA/+5OUDAR9tAz1G8oj0YnlgNJefGApN8EVW5J1Kmzzql/kE6Db8wVm8DRzWA956tJxl7ayYm8p8lt3RwusfTQbezSiqJU+tPAxcX4l00FsiBFzJ4VV1Tcwql6poORmBNbau8NSDa8zVB1jpjL+impxIQXbxgaw18awzqvm3YmMWmvt4Hd9ky0ntJMbMDmMXJO7cmAv6cwCo3QNfTdpVwcne5z62LKxwfXfshoPbCRd7Y5oMLZGsJHpjeSiMhbfu1o3/LH050nWJI6KMxyamT2t8isBRcyOLhoxaAe1w==
  • Authentication-results: apertussolutions.com; dkim=none (message not signed) header.d=none;apertussolutions.com; dmarc=none action=none header.from=suse.com;
  • Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, dpsmith@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Wed, 22 Sep 2021 07:35:08 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 22.09.2021 01:38, Stefano Stabellini wrote:
> On Mon, 20 Sep 2021, Ian Jackson wrote:
>> Jan Beulich writes ("Re: [xen-unstable test] 164996: regressions - FAIL"):
>>> As per
>>>
>>> Sep 15 14:44:55.502598 [ 1613.322585] Mem-Info:
>>> Sep 15 14:44:55.502643 [ 1613.324918] active_anon:5639 inactive_anon:15857 
>>> isolated_anon:0
>>> Sep 15 14:44:55.514480 [ 1613.324918]  active_file:13286 
>>> inactive_file:11182 isolated_file:0
>>> Sep 15 14:44:55.514545 [ 1613.324918]  unevictable:0 dirty:30 writeback:0 
>>> unstable:0
>>> Sep 15 14:44:55.526477 [ 1613.324918]  slab_reclaimable:10922 
>>> slab_unreclaimable:30234
>>> Sep 15 14:44:55.526540 [ 1613.324918]  mapped:11277 shmem:10975 
>>> pagetables:401 bounce:0
>>> Sep 15 14:44:55.538474 [ 1613.324918]  free:8364 free_pcp:100 free_cma:1650
>>>
>>> the system doesn't look to really be out of memory; as per
>>>
>>> Sep 15 14:44:55.598538 [ 1613.419061] DMA32: 2788*4kB (UMEC) 890*8kB (UMEC) 
>>> 497*16kB (UMEC) 36*32kB (UMC) 1*64kB (C) 1*128kB (C) 9*256kB (C) 7*512kB 
>>> (C) 0*1024kB 0*2048kB 0*4096kB = 33456kB
>>>
>>> there even look to be a number of higher order pages available (albeit
>>> without digging I can't tell what "(C)" means). Nevertheless order-4
>>> allocations aren't really nice.
>>
>> The host history suggests this may possibly be related to a qemu update.
>>
>> http://logs.test-lab.xenproject.org/osstest/results/host/rochester0.html

Stefano - as per some of your investigation detailed further down I
wonder whether you had seen this part of Ian's reply. (Question of
course then is how that qemu update had managed to get pushed.)

>> The grub cfg has this:
>>
>>  multiboot /xen placeholder conswitch=x watchdog noreboot async-show-all 
>> console=dtuart dom0_mem=512M,max:512M ucode=scan  ${xen_rm_opts}
>>
>> It's not clear to me whether xen_rm_opts is "" or "no-real-mode edd=off".
> 
> I definitely recommend to increase dom0 memory, especially as I guess
> the box is going to have a significant amount, far more than 4GB. I
> would set it to 2GB. Also the syntax on ARM is simpler, so it should be
> just: dom0_mem=2G

Ian - I guess that's an adjustment relatively easy to make? I wonder
though whether we wouldn't want to address the underlying issue first.
Presumably not, because the fix would likely take quite some time to
propagate suitably. Yet if not, we will want to have some way of
verifying that an eventual fix there would have helped here.

> In addition, I also did some investigation just in case there is
> actually a bug in the code and it is not a simple OOM problem.

I think the actual issue is quite clear; what I'm struggling with is
why we weren't hit by it earlier.

As imo always, non-order-0 allocations (perhaps excluding the bringing
up of the kernel or whichever entity) are to be avoided it at possible.
The offender in this case looks to be privcmd's alloc_empty_pages().
For it to request through kcalloc() what ends up being an order-4
allocation, the original IOCTL_PRIVCMD_MMAPBATCH must specify a pretty
large chunk of guest memory to get mapped. Which may in turn be
questionable, but I'm afraid I don't have the time to try to drill
down where that request is coming from and whether that also wouldn't
better be split up.

The solution looks simple enough - convert from kcalloc() to kvcalloc().
I can certainly spin up a patch to Linux to this effect. Yet that still
won't answer the question of why this issue has popped up all of the
sudden (and hence whether there are things wanting changing elsewhere
as well).

Jan




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.