Xen project Mailing List

Re: Linux: balloon_process() causing workqueue lockups?

To: Jan Beulich <jbeulich@xxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>

From: Juergen Gross <jgross@xxxxxxxx>

Date: Fri, 27 Aug 2021 11:58:03 +0200

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Fri, 27 Aug 2021 09:58:18 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 27.08.21 11:44, Jan Beulich wrote:

On 27.08.2021 11:29, Juergen Gross wrote:

On 27.08.21 11:01, Jan Beulich wrote:

ballooning down Dom0 by about 16G in one go once in a while causes:

BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 stuck for 64s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
    pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=2/256 refcnt=3
      in-flight: 229:balloon_process
      pending: cache_reap
workqueue events_freezable_power_: flags=0x84
    pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
      pending: disk_events_workfn
workqueue mm_percpu_wq: flags=0x8
    pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
      pending: vmstat_update
pool 12: cpus=6 node=0 flags=0x0 nice=0 hung=64s workers=3 idle: 2222 43

I've tried to double check that this isn't related to my IOMMU work
in the hypervisor, and I'm pretty sure it isn't. Looking at the
function I see it has a cond_resched(), but aiui this won't help
with further items in the same workqueue.

Thoughts?


I'm seeing two possible solutions here:

1. After some time (1 second?) in balloon_process() setup a new
     workqueue activity and return (similar to EAGAIN, but without
     increasing the delay).

2. Don't use a workqueue for the ballooning activity, use a kernel
     thread instead.

I have a slight preference for 2, even if the resulting patch will
be larger. 1 is only working around the issue and it is hard to
find a really good timeout value.

I'd be fine to write a patch, but would prefer some feedback which
way to go.


Was there a particular reason that a workqueue was used in the first
place? Otherwise using a kernel thread would look like the way to
go, indeed. The presence of cond_resched() kind of indicates such an
intention already anyway.

The workqueue approach was there initially since the balloon driver has been added. I guess the cond_resched() was just needed to avoid scheduling starvation. It was part of the initial driver, too. So basically I don't know why the workqueue instead of a kernel thread was chosen. Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.