[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 2/4] block: Avoid processing BDS twice in bdrv_set_aio_context_ignore()


  • To: Kevin Wolf <kwolf@xxxxxxxxxx>
  • From: Vladimir Sementsov-Ogievskiy <vsementsov@xxxxxxxxxxxxx>
  • Date: Thu, 17 Dec 2020 17:01:03 +0300
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=virtuozzo.com; dmarc=pass action=none header.from=virtuozzo.com; dkim=pass header.d=virtuozzo.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=pO4nCuCyX73aaYqdOq4oxM1ZPe9FqdWAsp0oFeIBq2g=; b=Xu3/FIpd5Uuq4jcrALsi+KOTiqmF8JIJuh2uwOHOsqPYphsvwO9acnuh4x+eanu/Pe/V/mc2RBjAd7m1CrWynmotEy6bvzgTu1EpMwvBB1GCEVYmnDEmDwFncQIpavVljcdktVhSmPxRa20A2DuzJax6N94LAeXNfJblAVFS1ZLdJuqC5KMyLV+61pbaTUJ/vu9hc47LP5t9eBnaFUmV6H2KyxevGEdUv+cypeWm4TMtFAU7Iphp4JfitdsW1TISnHdH1DWgG/j2ljHyDAae6lspNNJOD6coG5mijEpHbEp9C9thQZ7fq38ygXrohCuPj8+23CvLJUtx7H2v0T5cFg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=fMAGZInbdk2Hru95/QMExIdpX8OzG69oehlLcl2cc1KkhGgnb6o6Ve02XgOwD/jdNwR2SIoC2YcyVbx4QH5YENW23OEor4BCqdi0KtYvx34hUxWwzV2nYLNUN0VjTDfpFQcd04stEuZTFmntsu0MRHKy3Y1PAMOW3JUTBrDBXxwV2JKwVkzubX4cM7lYYdTEU5XU4sRWLnSqnIInzWy1SWJ+crurAczRE5JsayJxltV5QXg4+dzc19BKz73VDSDIFGCRD7U9qrxlCJC808sJyTm+sIL0WpL/qcFprtL93vh7Qu6efQlGuuO1bWeqZNQJKcFqrMMjX7HZcJW/Chwy6A==
  • Authentication-results: lists.xenproject.org; dkim=none (message not signed) header.d=none;lists.xenproject.org; dmarc=none action=none header.from=virtuozzo.com;
  • Cc: Sergio Lopez <slp@xxxxxxxxxx>, Fam Zheng <fam@xxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, qemu-block@xxxxxxxxxx, Paul Durrant <paul@xxxxxxx>, "Michael S. Tsirkin" <mst@xxxxxxxxxx>, qemu-devel@xxxxxxxxxx, Max Reitz <mreitz@xxxxxxxxxx>, Stefan Hajnoczi <stefanha@xxxxxxxxxx>, Paolo Bonzini <pbonzini@xxxxxxxxxx>, Anthony Perard <anthony.perard@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Thu, 17 Dec 2020 14:01:28 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

17.12.2020 16:06, Kevin Wolf wrote:
Am 17.12.2020 um 13:50 hat Vladimir Sementsov-Ogievskiy geschrieben:
17.12.2020 13:58, Kevin Wolf wrote:
Am 17.12.2020 um 10:37 hat Sergio Lopez geschrieben:
On Wed, Dec 16, 2020 at 07:31:02PM +0100, Kevin Wolf wrote:
Am 16.12.2020 um 15:55 hat Sergio Lopez geschrieben:
On Wed, Dec 16, 2020 at 01:35:14PM +0100, Kevin Wolf wrote:
Anyway, trying to reconstruct the block graph with BdrvChild pointers
annotated at the edges:

BlockBackend
        |
        v
    backup-top ------------------------+
        |   |                          |
        |   +-----------------------+  |
        |            0x5655068b8510 |  | 0x565505e3c450
        |                           |  |
        | 0x565505e42090            |  |
        v                           |  |
      qcow2 ---------------------+  |  |
        |                        |  |  |
        | 0x565505e52060         |  |  | ??? [1]
        |                        |  |  |  |
        v         0x5655066a34d0 |  |  |  | 0x565505fc7aa0
      file                       v  v  v  v
                               qcow2 (backing)
                                      |
                                      | 0x565505e41d20
                                      v
                                    file

[1] This seems to be a BdrvChild with a non-BDS parent. Probably a
      BdrvChild directly owned by the backup job.

So it seems this is happening:

backup-top (5e48030) <---------| (5)
     |    |                      |
     |    | (6) ------------> qcow2 (5fbf660)
     |                           ^    |
     |                       (3) |    | (4)
     |-> (1) qcow2 (5e5d420) -----    |-> file (6bc0c00)
     |
     |-> (2) file (5e52060)

backup-top (5e48030), the BDS that was passed as argument in the first
bdrv_set_aio_context_ignore() call, is re-entered when qcow2 (5fbf660)
is processing its parents, and the latter is also re-entered when the
first one starts processing its children again.

Yes, but look at the BdrvChild pointers, it is through different edges
that we come back to the same node. No BdrvChild is used twice.

If backup-top had added all of its children to the ignore list before
calling into the overlay qcow2, the backing qcow2 wouldn't eventually
have called back into backup-top.

I've tested a patch that first adds every child to the ignore list,
and then processes those that weren't there before, as you suggested
on a previous email. With that, the offending qcow2 is not re-entered,
so we avoid the crash, but backup-top is still entered twice:

I think we also need to every parent to the ignore list before calling
callbacks, though it doesn't look like this is the problem you're
currently seeing.

I agree.

bs=0x560db0e3b030 (backup-top) enter
bs=0x560db0e3b030 (backup-top) processing children
bs=0x560db0e3b030 (backup-top) calling bsaci child=0x560db0e2f450 
(child->bs=0x560db0fb2660)
bs=0x560db0fb2660 (qcow2) enter
bs=0x560db0fb2660 (qcow2) processing children
bs=0x560db0fb2660 (qcow2) calling bsaci child=0x560db0e34d20 
(child->bs=0x560db1bb3c00)
bs=0x560db1bb3c00 (file) enter
bs=0x560db1bb3c00 (file) processing children
bs=0x560db1bb3c00 (file) processing parents
bs=0x560db1bb3c00 (file) processing itself
bs=0x560db0fb2660 (qcow2) calling bsaci child=0x560db16964d0 
(child->bs=0x560db0e50420)
bs=0x560db0e50420 (qcow2) enter
bs=0x560db0e50420 (qcow2) processing children
bs=0x560db0e50420 (qcow2) calling bsaci child=0x560db0e34ea0 
(child->bs=0x560db0e45060)
bs=0x560db0e45060 (file) enter
bs=0x560db0e45060 (file) processing children
bs=0x560db0e45060 (file) processing parents
bs=0x560db0e45060 (file) processing itself
bs=0x560db0e50420 (qcow2) processing parents
bs=0x560db0e50420 (qcow2) processing itself
bs=0x560db0fb2660 (qcow2) processing parents
bs=0x560db0fb2660 (qcow2) calling set_aio_ctx child=0x560db1672860
bs=0x560db0fb2660 (qcow2) calling set_aio_ctx child=0x560db1b14a20
bs=0x560db0e3b030 (backup-top) enter
bs=0x560db0e3b030 (backup-top) processing children
bs=0x560db0e3b030 (backup-top) processing parents
bs=0x560db0e3b030 (backup-top) calling set_aio_ctx child=0x560db0e332d0
bs=0x560db0e3b030 (backup-top) processing itself
bs=0x560db0fb2660 (qcow2) processing itself
bs=0x560db0e3b030 (backup-top) calling bsaci child=0x560db0e35090 
(child->bs=0x560db0e50420)
bs=0x560db0e50420 (qcow2) enter
bs=0x560db0e3b030 (backup-top) processing parents
bs=0x560db0e3b030 (backup-top) processing itself

I see that "blk_do_set_aio_context()" passes "blk->root" to
"bdrv_child_try_set_aio_context()" so it's already in the ignore list,
so I'm not sure what's happening here. Is backup-top is referenced
from two different BdrvChild or is "blk->root" not pointing to
backup-top's BDS?

The second time that backup-top is entered, it is not as the BDS of
blk->root, but as the parent node of the overlay qcow2. Which is
interesting, because last time it was still the backing qcow2, so the
change did have _some_ effect.

The part that I don't understand is why you still get the line with
child=0x560db1b14a20, because when you add all children to the ignore
list first, that should have been put into the ignore list as one of the
first things in the whole process (when backup-top was first entered).

Is 0x560db1b14a20 a BdrvChild that has backup-top as its opaque value,
but isn't actually present in backup-top's bs->children?

Exactly, that line corresponds to this chunk of code:

<---- begin ---->
      QLIST_FOREACH(child, &bs->parents, next_parent) {
          if (g_slist_find(*ignore, child)) {
              continue;
          }
          assert(child->klass->set_aio_ctx);
          *ignore = g_slist_prepend(*ignore, child);
          fprintf(stderr, "bs=%p (%s) calling set_aio_ctx child=%p\n", bs, 
bs->drv->format_name, child);
          child->klass->set_aio_ctx(child, new_context, ignore);
      }
<---- end ---->

Do you think it's safe to re-enter backup-top, or should we look for a
way to avoid this?

I think it should be avoided, but I don't understand why putting all
children of backup-top into the ignore list doesn't already avoid it. If
backup-top is in the parents list of qcow2, then qcow2 should be in the
children list of backup-top and therefore the BdrvChild should already
be in the ignore list.

The only way I can explain this is that backup-top and qcow2 have
different ideas about which BdrvChild objects exist that connect them.
Or that the graph changes between both places, but I don't see how that
could happen in bdrv_set_aio_context_ignore().


bdrv_set_aio_context_ignore() do bdrv_drained_begin().. As I reported
recently, nothing prevents some job finish and do graph modification
during some another drained section. It may be the case.

Good point, this might be the same bug then.

If everything worked correctly, a job completion could only happen on
the outer bdrv_set_aio_context_ignore(). But after that, we are already
in a drain section, so the job should be quiesced and a second drain
shouldn't cause any additional graph changes.

I would have to go back to the other discussion, but I think it was
related to block jobs that are already in the completion process and
keep moving forward even though they are supposed to be quiesced.

If I remember correctly, actually pausing them at this point looked
difficult. Maybe what we should then do is letting .drained_poll return
true until they have actually fully completed?

Ah, but was this something that would deadlock because the job
completion callbacks use drain sections themselves?

Hmm..  I've recently sent good example of dead-lock in email "aio-poll 
dead-lock"..

I don't have better idea than moving all graph modifications (together with
corresponding drained sections) into coroutines and protect by global coroutine
mutex.


If backup-top involved, I can suppose that graph modification is in
backup_clean, when we remove the filter.. Who is making
set_aio_context in the issue? I mean, what is the backtrace of
bdrv_set_aio_context_ignore()?

Sergio, can you provide the backtrace and also test if the theory with a
job completion in the middle of the process is what you actually hit?

Kevin



--
Best regards,
Vladimir



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.