[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 2/2] domain: use PGC_extra domheap page for shared_info



Hi Paul,

On 25/02/2020 09:53, Paul Durrant wrote:
There's no particular reason shared_info need use a xenheap page.

AFAICT, a side-effect of this change is the shared_info is now going to be part of the migration stream. One issue with this is on the restore side, they will be accounted in number of pages allocated to the domain.

I am fairly certain dirty logging on page with PGC_extra set would not work properly as well.

As the pages will be recreated in the restore side, I would suggest to skip them in XEN_DOMCTL_getpageframeinfo3.

It's
only purpose is to be mapped by the guest so use a PGC_extra domheap page
instead. This does entail freeing shared_info during
domain_relinquish_resources() rather than domain_destroy() so care is
needed to avoid de-referencing a NULL shared_info pointer hence some
extra checks of 'is_dying' are needed.
ASSERTions are added before apparently vulnerable dereferences in the
event channel code. These should not be hit because domain_kill() calls
evtchn_destroy() before calling domain_relinquish_resources() but are
added to catch any future re-ordering.

IHMO, the ASSERTions in the event channel pending/mask/... helpers will not protect against re-ordering. You may never hit them even if there is a re-ordering. It would be better to add an ASSERT()/BUG_ON() in evtchn_destroy() and possibly a comment in domain_kill() to mention ordering.

For Arm, the call to free_shared_info() in arch_domain_destroy() is left
in place since it called in the error path for arch_domain_create().

NOTE: A modification in p2m_alloc_table() is required to avoid a false
       positive when checking for domain memory.
       A fix is also needed in dom0_construct_pv() to avoid automatically
       adding PGC_extra pages to the physmap.

I am not entirely sure how this is related to this patch. Was it a latent bug? If so, I think it would make sense to split it from this patch.


Signed-off-by: Paul Durrant <pdurrant@xxxxxxxxxx>
---
Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>
Cc: Julien Grall <julien@xxxxxxx>
Cc: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>
Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
Cc: George Dunlap <george.dunlap@xxxxxxxxxx>
Cc: Ian Jackson <ian.jackson@xxxxxxxxxxxxx>
Cc: Jan Beulich <jbeulich@xxxxxxxx>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
Cc: Wei Liu <wl@xxxxxxx>
---
  xen/arch/arm/domain.c        |  2 ++
  xen/arch/x86/domain.c        |  3 ++-
  xen/arch/x86/mm/p2m.c        |  3 +--
  xen/arch/x86/pv/dom0_build.c |  4 ++++
  xen/common/domain.c          | 25 +++++++++++++++++++------
  xen/common/event_2l.c        |  4 ++++
  xen/common/event_fifo.c      |  1 +
  xen/common/time.c            |  3 +++
  8 files changed, 36 insertions(+), 9 deletions(-)

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index 2cbcdaac08..3904519256 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -1006,6 +1006,8 @@ int domain_relinquish_resources(struct domain *d)
          BUG();
      }
+ free_shared_info(d);
+
      return 0;
  }
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index eb7b0fc51c..3ad532eccf 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -691,7 +691,6 @@ void arch_domain_destroy(struct domain *d)
          pv_domain_destroy(d);
      free_perdomain_mappings(d);
- free_shared_info(d);
      cleanup_domain_irq_mapping(d);
psr_domain_free(d);
@@ -2246,6 +2245,8 @@ int domain_relinquish_resources(struct domain *d)
      if ( is_hvm_domain(d) )
          hvm_domain_relinquish_resources(d);
+ free_shared_info(d);
+
      return 0;
  }
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index c5f428d67c..787d97d85e 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -695,8 +695,7 @@ int p2m_alloc_table(struct p2m_domain *p2m)
p2m_lock(p2m); - if ( p2m_is_hostp2m(p2m)
-         && !page_list_empty(&d->page_list) )
+    if ( p2m_is_hostp2m(p2m) && domain_tot_pages(d) )
      {
          P2M_ERROR("dom %d already has memory allocated\n", d->domain_id);
          p2m_unlock(p2m);
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index dc16ef2e79..f8f1bbe2f4 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -792,6 +792,10 @@ int __init dom0_construct_pv(struct domain *d,
      {
          mfn = mfn_x(page_to_mfn(page));
          BUG_ON(SHARED_M2P(get_gpfn_from_mfn(mfn)));
+
+        if ( page->count_info & PGC_extra )
+            continue;

I would add a comment explaining why we skip page with PGC_extra set.

+
          if ( get_gpfn_from_mfn(mfn) >= count )
          {
              BUG_ON(is_pv_32bit_domain(d));
diff --git a/xen/common/domain.c b/xen/common/domain.c
index ba7a905258..1d42fbcc0f 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -128,9 +128,9 @@ static void vcpu_info_reset(struct vcpu *v)
  {
      struct domain *d = v->domain;
- v->vcpu_info = ((v->vcpu_id < XEN_LEGACY_MAX_VCPUS)
+    v->vcpu_info = (!d->is_dying && v->vcpu_id < XEN_LEGACY_MAX_VCPUS)
                      ? (vcpu_info_t *)&shared_info(d, vcpu_info[v->vcpu_id])
-                    : &dummy_vcpu_info);
+                    : &dummy_vcpu_info;

Without holding domain_lock(), I don't think there is a guarantee that is_dying (and therefore the shared_info) is not going to change behind your back. So v->vcpu_info may point to garbagge.

Looking at the callers, XEN_DOMCTL_soft_reset will not hold the lock.

As an aside, it would be good to explain in a comment why we are using dummy_vcpu_info when the domain is dying.

      v->vcpu_info_mfn = INVALID_MFN;
  }
@@ -1650,24 +1650,37 @@ int continue_hypercall_on_cpu( int alloc_shared_info(struct domain *d, unsigned int memflags)
  {
-    if ( (d->shared_info.virt = alloc_xenheap_pages(0, memflags)) == NULL )
+    struct page_info *pg;
+
+    pg = alloc_domheap_page(d, MEMF_no_refcount | memflags);
+    if ( !pg )
          return -ENOMEM;
- d->shared_info.mfn = virt_to_mfn(d->shared_info.virt);
+    if ( !get_page_and_type(pg, d, PGT_writable_page) )
+        return -ENODATA;

I think the page will never be freed if this fails.

+
+    d->shared_info.mfn = page_to_mfn(pg);
+    d->shared_info.virt = __map_domain_page_global(pg);
clear_page(d->shared_info.virt);
-    share_xen_page_with_guest(mfn_to_page(d->shared_info.mfn), d, SHARE_rw);
return 0;
  }
void free_shared_info(struct domain *d)
  {
+    struct page_info *pg;
+
      if ( !d->shared_info.virt )
          return;
- free_xenheap_page(d->shared_info.virt);
+    unmap_domain_page_global(d->shared_info.virt);
      d->shared_info.virt = NULL;
+
+    pg = mfn_to_page(d->shared_info.mfn);
+
+    put_page_alloc_ref(pg);
+    put_page_and_type(pg);
  }
/*
diff --git a/xen/common/event_2l.c b/xen/common/event_2l.c
index e1dbb860f4..a72fe0232b 100644
--- a/xen/common/event_2l.c
+++ b/xen/common/event_2l.c
@@ -27,6 +27,7 @@ static void evtchn_2l_set_pending(struct vcpu *v, struct 
evtchn *evtchn)
       * others may require explicit memory barriers.
       */
+ ASSERT(d->shared_info.virt);
      if ( guest_test_and_set_bit(d, port, &shared_info(d, evtchn_pending)) )
          return;
@@ -54,6 +55,7 @@ static void evtchn_2l_unmask(struct domain *d, struct evtchn *evtchn)
       * These operations must happen in strict order. Based on
       * evtchn_2l_set_pending() above.
       */
+    ASSERT(d->shared_info.virt);
      if ( guest_test_and_clear_bit(d, port, &shared_info(d, evtchn_mask)) &&
           guest_test_bit(d, port, &shared_info(d, evtchn_pending)) &&
           !guest_test_and_set_bit(d, port / BITS_PER_EVTCHN_WORD(d),
@@ -67,6 +69,7 @@ static bool evtchn_2l_is_pending(const struct domain *d, 
evtchn_port_t port)
  {
      unsigned int max_ports = BITS_PER_EVTCHN_WORD(d) * 
BITS_PER_EVTCHN_WORD(d);
+ ASSERT(d->shared_info.virt);
      ASSERT(port < max_ports);
      return (port < max_ports &&
              guest_test_bit(d, port, &shared_info(d, evtchn_pending)));
@@ -76,6 +79,7 @@ static bool evtchn_2l_is_masked(const struct domain *d, 
evtchn_port_t port)
  {
      unsigned int max_ports = BITS_PER_EVTCHN_WORD(d) * 
BITS_PER_EVTCHN_WORD(d);
+ ASSERT(d->shared_info.virt);
      ASSERT(port < max_ports);
      return (port >= max_ports ||
              guest_test_bit(d, port, &shared_info(d, evtchn_mask)));
diff --git a/xen/common/event_fifo.c b/xen/common/event_fifo.c
index 230f440f14..e8c6045d72 100644
--- a/xen/common/event_fifo.c
+++ b/xen/common/event_fifo.c
@@ -497,6 +497,7 @@ static void setup_ports(struct domain *d)
evtchn = evtchn_from_port(d, port); + ASSERT(d->shared_info.virt);
          if ( guest_test_bit(d, port, &shared_info(d, evtchn_pending)) )
              evtchn->pending = 1;
diff --git a/xen/common/time.c b/xen/common/time.c
index 58fa9abc40..938226c7b1 100644
--- a/xen/common/time.c
+++ b/xen/common/time.c
@@ -99,6 +99,9 @@ void update_domain_wallclock_time(struct domain *d)
      uint32_t *wc_version;
      uint64_t sec;
+ if ( d->is_dying )
+        return;

This is another instance where I think the use of d->is_dying is not safe. I looked at how other places in Xen dealt with d->is_dying.

We don't seem to have a common way to deal with it:
   1) It may be checked under domain_lock() -> No issue with that
2) It may be checked under d->page_alloc_lock (e.g assign_pages()). The assign_pages() case is fine because it will act as a full barrier. So if we call happen after relinquish_memory() then we will surely have observed d->is_dying. I haven't checked the others.

Some of the instances user neither the 2 locks above. We probably ought to investigate them (I will add a TODO in my list).

Regarding the two cases here, domain_lock() might be the solution. If we are concern about the contention (it is a spinlock), then we could switch the domain_lock() from spinlock to rwlock.

Cheers,

--
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.