[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 5/6] tools/xenstored: partially handle domains without a shared ring


  • To: Julien Grall <julien@xxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Thu, 23 Sep 2021 09:23:17 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=Z3sBADxbHmgf+gWA0mJP8l4iNdX4Q4e+8btyNSbBulU=; b=HY4BiAQ8OAcI+VwLb5wMgdmKdIJZvndB+YNqpgy8rc/Zm6tVLcWKOYqlXjSHgW+wTTQp4iBJXa4rLQjQH9oDcwjDqDHCGY8LoLKYz3u5hXLizPiR90K3fdfXilVEubDan2pAReF0RIKQFy8iIPNxKylsxmm1Gh9RgQRYtlMLE/TJXfDVSlftfBIYTAmgWFL35f1zpx5iYFaW/ig4etrCwyrgA2OzGHAHbWuW11YHcJ4u43PFVtuGGXj3m3u3ArrJPtpVJdQ0CDtVVe0oXPBx3jqmLoXek/OYnzE3BPXCK4T2RMRxHFqt73K+1EZc9Y9SZHu1zG1A5b6jm7q3dpFijQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Ls6zfH39QrhwpjVy7aoO1/2kPyPVbWkjfzjY9Eoeeay0EVTLtylONrkQ1nVkleDSCpoWyjJ2RTC51DDF+B4pzZuUFsggCL9k60OlUEErH2Ew56T7a8PQerW+nG3mP/Qov82wKK7mNpdeHsmAJTZOEPWMxjT+F4jVlVo8nR3R/xKPs9858zxBKpuPbkjjJc+WkFseGBZTDRGxO/ogkKFN7LupnOGCxl86rU/RlkUehSgju3bbsco+q1BmywaFzm/5CXd/Zs2zxOM7VJ++2vTqMVy15ExOgpWhlWjhDVl0MJAsgs54XDDKJNMNXsVJYA8IwBuvAX5vpfFlft2OXuFzWg==
  • Authentication-results: esa2.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.com
  • Cc: Juergen Gross <jgross@xxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Ian Jackson <iwj@xxxxxxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, <raphning@xxxxxxxxxxxx>, "Doebel, Bjoern" <doebel@xxxxxxxxx>
  • Delivery-date: Thu, 23 Sep 2021 07:23:45 +0000
  • Ironport-data: A9a23:uDKYSqyM5Sov9Cj2kbN6t+eUwSrEfRIJ4+MujC+fZmUNrF6WrkVUy 2IYUWuFOPiOZ2ryf9p+O9u39EsPuceGmtE2HlBtpSAxQypGp/SeCIXCJC8cHc8zwu4v7q5Dx 59DAjUVBJlsFhcwnvopW1TYhSEUOZugH9IQM8aZfHAsLeNYYH1500s7yrRi2tcAbeWRWGthh /uj+6UzB3f9s9JEGjp8B3Wr8U4HUFza4Vv0j3RmDRx5lAa2e0o9VfrzEZqZPXrgKrS4K8bhL wr1IBNVyUuCl/slIovNfr8W6STmSJaKVeSFoiI+t6RPHnGuD8H9u0o2HKN0VKtZt9mGt80ty dBdr66LcAtqE7/0ptkldRtDUAgraMWq+JefSZS+mcmazkmAeHrw2fR+SkoxOOX0+M4uXzsIr 6ZBbmlQMFbT3Ipaw5riIgVort4kI8TxepsWp1lrzC3DDOZgSpfGK0nPzYIDgG9r35gWdRrYT /cJSytWfT76WEJSK10ZB9UO29uJpXaqJlW0r3rK/PFqsgA/1jdZz7zFINfTPNuQSq19hE+Ap mTH+WvRCxQTJtuZjzGCtG+v7sfDmi7xVYY6Hbix5PlsxlGerkQSFx8+RVa9ueO+iEO1R5RYM UN80igzqak/8mS7Q9+7WAe3yFaGsQQbQMF4CPAh5UeGza+8yxaUAC0IQyBMbPQitdQqXno62 1mRhdTrCDdz9rqPRhqgGqy89G3of3JPdClbOHFCHVBtD8TfTJ8bk07OS8Z5IpWMgp7YCwnO/ gisrjN9ruBG5SIU7JmT8VfCijOqg5HGSA8p+wnaNl6YAhNFiJ2NPNPztAKFhRpUBMPAFADQ4 SRd8ySLxL1WVfmweDqxrPLh9V1Dz82MNiHVyXVrFoMon9hG0y/+Jd0IiN2SyUEADyrlRdMLS BOI0e+yzMUKVJdPUUORS9jqYyjN5fK8fekJrtiOMrJzjmFZLWdrBh1Ga0+KxHzKm0Mxi6w5M przWZ/yVixBUvw/l2TsFrZ1PVoXKsYWnzi7qXfTlUjP7FZjTCTNFedt3KWmNIjVE59oUC2Kq o0CZqNmOj1UUfHkYzm/zGLgBQpiEJTPPriv85Y/XrfaemJOQTh9Y9eMkeJJU9E0xMx9y7aXl kxRr2cFkTITc1Wccl7UAp2iAZuyNatCQYUTZ3dxYgr4hyZ5Me5CLs43LvMKQFXuz8Q6pdZcR PgZYcSQRPNJTzXM4TMGapfh6odlcXyWacimZkJJuRAzIMxtQRLn4Njhcle9/SUCFHPv58A/v 6ehxkXQRp9aH1ZuC8PfafSOyVKtvCdCxLIuDhWQetQDKl/x9IVKKjDqiqNlKc87NhielCCR0 BybAElEqLCV8ZM16tTAmYuNs5ytT7llBkNfEmSCteS2OCDW83CN24hFVOrULznRWHmtoPepZ PlPzuG6O/oCxQ4Yv415Grdt7KQ/+9qw+OMKklU6RC3GNg35BKlhL3+K2dh0mpdMnrIJ6xGrX k+v+8VBPenbMs3SD1NMdhEuaf6O1K9Il2CKv+g1Okjz+AR+4KGDDRdJJxCJhSFQcOl1PYciz btzscIa8VXi2B8jM9LAhSFI7WWcaHcHVvx/5J0dBYbqjCsty01DPsOAWnOnvsnXZoUeKFQuL x+VmLHG1uZVyUf1enYuEWTAgLhGjpMUtREWlFIPKjxlQDYeaiPbCPGJzQkKcw==
  • Ironport-hdrordr: A9a23:vBDNSKESjsZn6t88pLqFcpHXdLJyesId70hD6qkvc3Nom52j+/ xGws536faVslcssHFJo6HmBEClewKnyXcT2/htAV7CZnichILMFu9fBOTZsl/d8kHFh4tgPO JbAtRD4b7LfClHZKTBkXCF+r8bqbHtmsDY5pav854ud3ATV0gJ1XYGNu/xKDwReOApP+tcKH LKjfA32AZINE5nJPiTNz0gZazuttfLnJXpbVovAAMm0hCHiXeN5KThGxaV8x8CW3cXqI1SvV Ttokjc3OGOovu7whjT2yv66IlXosLozp9mCNaXgsYYBz3wgkKDZZhnWZeFoDcpydvfpWoCoZ 3pmVMNLs5z43TeciWcpgbs4RDp1HIU53rr2Taj8DLeiP28YAh/J9tKhIpffBecwVEnpstA3K VC2H/cn4ZLDDvb9R6NpuTgZlVPrA6ZsHAimekcgzh0So0FcoJcqoQZ4Qd8DIoAJiTn84oqed MeQv003MwmMm9yUkqp/FWGmLeXLzEO91a9Mwc/U/WuonhrdCsT9Tpd+CQd9k1wgq7VBaM0oN gsCZ4Y5o2mePVmGp6VNN1xMvdfNVa9NC4kEFjiaWgPR5t3cE4klfbMkcEIDaeRCdo18Kc=
  • Ironport-sdr: CW5gxOjWoC9Oi5dL8MxWCPd1ZWAvIZ+FmMuvfwqn4L++19JXCFa4JUFrkKyoUYAZ47NmVg73N5 KXFb9cMc6dvtu9HUPZAHHUMxxcl1R3r7cCFszIPEWtKw7eRapzMywJnrrR4yD2/UcN+8u0jAuw XGPf1tvwNGmGCUfBEMdgjHQ8SuzTLgQW05QbRc7pmyblsWbreUiuTtdi8fI8j5hrqwdF+9HaGw 1pQMzQMNywAVV3v8W3Z4eeBuy7KlN41/CG+mHuby8za33SpBSJPfLXHkWoSGQ66klz3nKSUkS3 FIFpSHB1dcqOUQOYh3O2DCF8
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Wed, Sep 22, 2021 at 06:46:25PM +0500, Julien Grall wrote:
> (+ Some AWS folks)
> 
> Hi Juergen,
> 
> On 22/09/2021 17:34, Juergen Gross wrote:
> > On 22.09.21 12:23, Julien Grall wrote:
> > > Hi Roger,
> > > 
> > > On 22/09/2021 14:58, Roger Pau Monné wrote:
> > > > On Wed, Sep 22, 2021 at 02:07:44PM +0500, Julien Grall wrote:
> > > > > Hi Roger,
> > > > > 
> > > > > On 22/09/2021 13:21, Roger Pau Monne wrote:
> > > > > > Failure to map the shared ring and thus establish a xenstore
> > > > > > connection with a domain shouldn't prevent the "@introduceDomain"
> > > > > > watch from firing, likewise with "@releaseDomain".
> > > > > > 
> > > > > > In order to handle such events properly xenstored should keep track 
> > > > > > of
> > > > > > the domains even if the shared communication ring cannot be mapped.
> > > > > > This was partially the case due to the restore mode, which needs to
> > > > > > handle domains that have been destroyed between the save and restore
> > > > > > period. This patch extends on the previous limited support of
> > > > > > temporary adding a domain without a valid interface ring, and 
> > > > > > modifies
> > > > > > check_domains to keep domains without an interface on the list.
> > > > > > 
> > > > > > Handling domains without a valid shared ring is required in order to
> > > > > > support domain without a grant table, since the lack of grant table
> > > > > > will prevent the mapping of the xenstore ring grant reference.
> > > > > > 
> > > > > > Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> > > > > > ---
> > > > > > oxenstored will need a similar treatment once grant mapping is used
> > > > > > there. For the time being it still works correctly because it uses
> > > > > > foreign memory to map the shared ring, and that will work in the
> > > > > > absence of grant tables on the domain.
> > > > > > ---
> > > > > > Changes since v1:
> > > > > >    - New in this version.
> > > > > > ---
> > > > > >    tools/xenstore/xenstored_domain.c | 30
> > > > > > ++++++++++++++++++------------
> > > > > >    1 file changed, 18 insertions(+), 12 deletions(-)
> > > > > > 
> > > > > > diff --git a/tools/xenstore/xenstored_domain.c
> > > > > > b/tools/xenstore/xenstored_domain.c
> > > > > > index 9fb78d5772..150c6f082e 100644
> > > > > > --- a/tools/xenstore/xenstored_domain.c
> > > > > > +++ b/tools/xenstore/xenstored_domain.c
> > > > > > @@ -119,6 +119,11 @@ static int writechn(struct connection *conn,
> > > > > >        struct xenstore_domain_interface *intf =
> > > > > > conn->domain->interface;
> > > > > >        XENSTORE_RING_IDX cons, prod;
> > > > > > +    if (!intf) {
> > > > > > +        errno = ENODEV;
> > > > > > +        return -1;
> > > > > > +    }
> > > > > > +
> > > > > >        /* Must read indexes once, and before anything
> > > > > > else, and verified. */
> > > > > >        cons = intf->rsp_cons;
> > > > > >        prod = intf->rsp_prod;
> > > > > > @@ -149,6 +154,11 @@ static int readchn(struct
> > > > > > connection *conn, void *data, unsigned int len)
> > > > > >        struct xenstore_domain_interface *intf =
> > > > > > conn->domain->interface;
> > > > > >        XENSTORE_RING_IDX cons, prod;
> > > > > > +    if (!intf) {
> > > > > > +        errno = ENODEV;
> > > > > > +        return -1;
> > > > > > +    }
> > > > > > +
> > > > > >        /* Must read indexes once, and before anything
> > > > > > else, and verified. */
> > > > > >        cons = intf->req_cons;
> > > > > >        prod = intf->req_prod;
> > > > > > @@ -176,6 +186,9 @@ static bool domain_can_write(struct
> > > > > > connection *conn)
> > > > > >    {
> > > > > >        struct xenstore_domain_interface *intf =
> > > > > > conn->domain->interface;
> > > > > > +    if (!intf)
> > > > > > +        return false;
> > > > > > +
> > > > > 
> > > > > Rather than adding an extra check, how about taking
> > > > > advantage of is_ignore?
> > > > 
> > > > Right, I just need to change the order in conn_can_read and
> > > > conn_can_write so that the is_ignored check is performed before the
> > > > can_{read,write} handler is called.
> > > > 
> > > > > >        return ((intf->rsp_prod - intf->rsp_cons) !=
> > > > > > XENSTORE_RING_SIZE);
> > > > > >    }
> > > > > > @@ -183,7 +196,8 @@ static bool domain_can_read(struct
> > > > > > connection *conn)
> > > > > >    {
> > > > > >        struct xenstore_domain_interface *intf =
> > > > > > conn->domain->interface;
> > > > > > -    if (domain_is_unprivileged(conn) && conn->domain->wrl_credit < 
> > > > > > 0)
> > > > > > +    if ((domain_is_unprivileged(conn) &&
> > > > > > conn->domain->wrl_credit < 0) ||
> > > > > > +        !intf)
> > > > > >            return false;
> > > > > >        return (intf->req_cons != intf->req_prod);
> > > > > > @@ -268,14 +282,7 @@ void check_domains(void)
> > > > > >                    domain->shutdown = true;
> > > > > >                    notify = 1;
> > > > > >                }
> > > > > > -            /*
> > > > > > -             * On Restore, we may have been unable to remap the
> > > > > > -             * interface and the port. As we don't know whether
> > > > > > -             * this was because of a dying domain, we need to
> > > > > > -             * check if the interface and port are still valid.
> > > > > > -             */
> > > > > > -            if (!dominfo.dying && domain->port &&
> > > > > > -                domain->interface)
> > > > > > +            if (!dominfo.dying)
> > > > > >                    continue;
> > > > > 
> > > > > This is mostly a revert on "tools/xenstore: handle dying
> > > > > domains in live
> > > > > update". However, IIRC, this check was necessary to release
> > > > > the connection
> > > > > if the domain has died in the middle of Live-Update.
> > > > 
> > > > But if the domain has died in the middle of live update
> > > > get_domain_info will return false, and thus the code won't get here.
> > > 
> > > Hmmm... I think I am mixing up a few things... I went through the
> > > original discussion (it was on the security ML) to find out why I
> > > wrote the patch like that. When going through the archives, I
> > > noticed that I provided a different version of this patch (see [1])
> > > because there was some issue with the check here (I wrote that it
> > > would lead to zombie domain, but don't have the rationale :().
> > > 
> > > Juergen, I don't seem to find the reason why the patch was not
> > > replaced as we discussed on the security ML. Do you remember why?
> > 
> > Sorry, no, I don't.
> > 
> > You did send the new version for V6 of the LU series, but it seems at
> > least in V9 you commented on the patch requesting that a comment just
> > in the section being different between the two variants to be removed.
> > 
> > So either we both overlooked the new variant not having gone in, or we
> > agreed to use the old version (e.g. in a security meeting). In my IRC
> > logs I couldn't find anything either (the only mentioning of that patch
> > was before V6 of the series was sent, and that was before you sending
> > the new one as a reply to V6).
> > 
> > > Assuming this was a mistake, could someone take care of sending an
> > > update? If not, I could do it when I am back in October.
> > > 
> > > For the archives, the issues would appear when shutting down a
> > > domain during Live-Update.
> > 
> > Hmm, IIRC you did quite some extensive testing of LU and didn't find
> > any problem in the final version.
> 
> I did extensive testing with early series but I can't remember whether I
> tested the shutdown reproducer with the latest series.
> 
> > 
> > Are you sure there really is a problem?
> 
> I thought a bit more and looked at the code (I don't have access to a test
> machine at the moment). I think there is indeed a problem.
> 
> Some watchers of @releaseDomain (such as xenconsoled) will only remove a
> domain from their internal state when the domain is actually dead.
> 
> This is based on dominfo.dying which is only set when all the resources are
> relinquished and waiting for the other domains to release any resources for
> that domain.
> 
> The problem is Xenstore may fail to map the interface or the event channel
> long before the domain is actually dead. With the current check, we would
> free the allocated structure and therefore send @releaseDomain too early. So
> daemon like xenconsoled, would never cleanup for the domain and leave a
> zombie domain. Well... until the next @releaseDomain (or @introduceDomain
> for Xenconsoled) AFAICT.
> 
> The revised patch is meant to solve it by just ignoring the connection. With
> that approach, we would correctly notify watches when the domain is dead.

I think the patch provided by Julien is indeed better than the current
code, or else you will miss @releaseDomain events in corner cases for
dominfo.dying.

I think however the patch is missing a change in the order of the
checks in conn_can_{read,write}, so that the is_ignored check is
performed before calling can_{read,write} which will try to poke at
the interface and trigger a fault if not mapped.

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.