[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM



On 2015/4/20 19:13, Jan Beulich wrote:
On 10.04.15 at 11:21, <tiejun.chen@xxxxxxxxx> wrote:
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -1665,6 +1665,46 @@ int xc_assign_device(
      return do_domctl(xch, &domctl);
  }

+struct xen_reserved_device_memory
+*xc_device_get_rdm(xc_interface *xch,
+                   uint32_t flag,
+                   uint16_t seg,
+                   uint8_t bus,
+                   uint8_t devfn,
+                   unsigned int *nr_entries)
+{

So what's the point of having both this new function and
xc_reserved_device_memory_map()? Is the latter useful for
anything besides the purpose here?

I just hope xc_reserved_device_memory_map() is a standard interface to call that XENMEM_reserved_device_memory_map, but xc_device_get_rdm() can handle some errors in current case.

I think you are hinting we just need one, right?


+    struct xen_reserved_device_memory *xrdm = NULL;
+    int rc = xc_reserved_device_memory_map(xch, flag, seg, bus, devfn, xrdm,
+                                           nr_entries);
+
+    if ( rc < 0 )
+    {
+        if ( errno == ENOBUFS )
+        {
+            if ( (xrdm = malloc(*nr_entries *
+                                sizeof(xen_reserved_device_memory_t))) == NULL 
)
+            {
+                PERROR("Could not allocate memory.");

Now that's exactly the kind of error message that makes no sense:
As errno will already cause PERROR() to print something along the
lines of the message you provide here, you're just creating
redundancy. Indicating the purpose of the allocation, otoh, would
add helpful context for the one inspecting the resulting log.

What about this?

PERROR("Could not allocate memory buffers to store reserved device memory entries.");


@@ -275,6 +272,7 @@ static int setup_guest(xc_interface *xch,
      elf_parse_binary(&elf);
      v_start = 0;
      v_end = args->mem_size;
+    v_lowend = lowmem_end = args->lowmem_size;

Considering that both variables get set to the same value and
don't appear to be changed subsequently - why two variables?

Yes, I really should drop one.


@@ -302,8 +300,11 @@ static int setup_guest(xc_interface *xch,

      for ( i = 0; i < nr_pages; i++ )
          page_array[i] = i;
-    for ( i = mmio_start >> PAGE_SHIFT; i < nr_pages; i++ )
-        page_array[i] += mmio_size >> PAGE_SHIFT;
+    /*
+     * This condition 'lowmem_end <= mmio_start' is always true.
+     */

For one I think you mean "The", not "This", as there's no such
condition around here. And then - why? DYM "is supposed to
always be true"? In which case you may want to check...

I always do this inside libxl__build_hvm() but before setup_guest(),

+    if (args.lowmem_size > mmio_start)
+        args.lowmem_size = mmio_start;

And plus, we also another policy to rdm,

    #1. Above a predefined boundary (default 2G)
        - move lowmem_end below reserved region to solve conflict;

This means there's such a likelihood of args.lowmem_size < mmio_start) as well.

So here I'm saying the condition is always true.


@@ -588,9 +589,6 @@ int xc_hvm_build(xc_interface *xch, uint32_t domid,
      if ( args.mem_target == 0 )
          args.mem_target = args.mem_size;

-    if ( args.mmio_size == 0 )
-        args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;
-
      /* An HVM guest must be initialised with at least 2MB memory. */
      if ( args.mem_size < (2ull << 20) || args.mem_target < (2ull << 20) )
          return -1;
@@ -634,6 +632,8 @@ int xc_hvm_build_target_mem(xc_interface *xch,
      args.mem_size = (uint64_t)memsize << 20;
      args.mem_target = (uint64_t)target << 20;
      args.image_file_name = image_name;
+    if ( args.mmio_size == 0 )
+        args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;

Is this code motion really necessary?

Yes, and actually I'm just move this sort of code.

Please take a look at that original code flow:

libxl__build_hvm()
    |
    + xc_hvm_build()
    |
    + args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;
        |
        + setup_guest()
        |
        + populate guest memory


After we introduce libxl__domain_device_check_rdm() to handle our policy, libxl__domain_device_check_rdm() needs that final mmio_size to calculate our lowmem_size, so I move 'args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH' out of xc_hvm_build(), and push forward libxl__domain_device_check_rdm():

libxl__build_hvm()
    |
    + args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;
    |
    + libxl__domain_device_check_rdm()
    |
    + xc_hvm_build()

This also brings a aftermath to xc_hvm_build_target_mem() which is also using xc_hvm_build().


--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -90,6 +90,201 @@ const char *libxl__domain_device_model(libxl__gc *gc,
      return dm;
  }

+/*
+ * Check whether there exists rdm hole in the specified memory range.
+ * Returns 1 if exists, else returns 0.
+ */
+static int check_rdm_hole(uint64_t start, uint64_t memsize,
+                          uint64_t rdm_start, uint64_t rdm_size)
+{
+    if ( start + memsize <= rdm_start || start >= rdm_start + rdm_size )
+        return 0;
+    else
+        return 1;
+}

The function appears to return a boolean type, so please make its
return value express this. Also the name doesn't really imply the
meaning of "true" and "false" being returned - how about
overlaps_rdm()? And finally, while I'm not the maintainer of this code

Looks good.

and hence don't have the final say on stylistic issues, I don't see
why the above couldn't be expressed with a single return statement.

Are you saying something like this? Note this was showed by yourself long time ago.

static bool check_mmio_hole_conflict(uint64_t start, uint64_t memsize,
uint64_t mmio_start, uint64_t mmio_size)
{
     return start + memsize > mmio_start && start < mmio_start + mmio_size;
}

But I don't think this really can't work out our case.


+int libxl__domain_device_check_rdm(libxl__gc *gc,
+                                   libxl_domain_config *d_config,
+                                   uint64_t rdm_mem_guard,
+                                   struct xc_hvm_build_args *args)
+{
+    int i, j, conflict;
+    libxl_ctx *ctx = libxl__gc_owner(gc);
+    struct xen_reserved_device_memory *xrdm = NULL;
+    unsigned int nr_all_rdms = 0;
+    uint64_t rdm_start, rdm_size, highmem_end = (1ULL << 32);
+    uint32_t type = d_config->b_info.rdm.type;
+    uint16_t seg;
+    uint8_t bus, devfn;
+
+    /* Might not to expose rdm. */
+    if ((type == LIBXL_RDM_RESERVE_TYPE_NONE) && !d_config->num_pcidevs)
+        return 0;
+
+    /* Collect all rdm info if exist. */
+    xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_HOST,
+                             0, 0, 0, &nr_all_rdms);

What meaning has passing a libxl private value to a libxc function?

We intend to collect all rdm entries info in advance and then we can construct d_config->rdms based on our policies as follows. Because we need to first allocate d_config->rdms properly to store rdms, but in some cases we don't know how many buffers are enough. For example, we don't have that global flag but with multiple pci devices. And even a shared entry worsen this situation.

So here, we set that flag as LIBXL_RDM_RESERVE_TYPE_HOST but without any SBDF to grab all rdms.


+    if (!nr_all_rdms)
+        return 0;
+    d_config->rdms = libxl__calloc(gc, nr_all_rdms,
+                                   sizeof(libxl_device_rdm));
+    memset(d_config->rdms, 0, sizeof(libxl_device_rdm));
+
+    /* Query all RDM entries in this platform */
+    if (type == LIBXL_RDM_RESERVE_TYPE_HOST) {
+        d_config->num_rdms = nr_all_rdms;
+        for (i = 0; i < d_config->num_rdms; i++) {
+            d_config->rdms[i].start =
+                                (uint64_t)xrdm[i].start_pfn << XC_PAGE_SHIFT;
+            d_config->rdms[i].size =
+                                (uint64_t)xrdm[i].nr_pages << XC_PAGE_SHIFT;
+            d_config->rdms[i].flag = d_config->b_info.rdm.reserve;
+        }
+    } else {
+        d_config->num_rdms = 0;
+    }
+    free(xrdm);
+
+    /* Query RDM entries per-device */
+    for (i = 0; i < d_config->num_pcidevs; i++) {
+        unsigned int nr_entries = 0;
+        bool new = true;
+        seg = d_config->pcidevs[i].domain;
+        bus = d_config->pcidevs[i].bus;
+        devfn = PCI_DEVFN(d_config->pcidevs[i].dev, d_config->pcidevs[i].func);
+        nr_entries = 0;
+        xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_NONE,
+                                 seg, bus, devfn, &nr_entries);
+        /* No RDM to associated with this device. */
+        if (!nr_entries)
+            continue;
+
+        /* Need to check whether this entry is already saved in the array.
+         * This could come from two cases:
+         *
+         *   - user may configure to get all RMRRs in this platform, which
+         * is already queried before this point
+         *   - or two assigned devices may share one RMRR entry
+         *
+         * different policies may be configured on the same RMRR due to above
+         * two cases. We choose a simple policy to always favor stricter policy
+         */
+        for (j = 0; j < d_config->num_rdms; j++) {
+            if (d_config->rdms[j].start ==
+                                (uint64_t)xrdm[0].start_pfn << XC_PAGE_SHIFT)
+             {
+                if (d_config->rdms[j].flag != LIBXL_RDM_RESERVE_FLAG_FORCE)
+                    d_config->rdms[j].flag = d_config->pcidevs[i].rdm_reserve;
+                new = false;
+                break;
+            }
+        }
+
+        if (new) {
+            if (d_config->num_rdms > nr_all_rdms - 1) {

">= nr_all_rdms" expresses the same thing.

Okay.


+                LIBXL__LOG(CTX, LIBXL__LOG_ERROR, "Exceed rdm array 
boundary!\n");
+                free(xrdm);
+                return -1;
+            }
+
+            /*
+             * This is a new entry.
+             */
+            d_config->rdms[d_config->num_rdms].start =
+                                (uint64_t)xrdm[0].start_pfn << XC_PAGE_SHIFT;
+            d_config->rdms[d_config->num_rdms].size =
+                                (uint64_t)xrdm[0].nr_pages << XC_PAGE_SHIFT;
+            d_config->rdms[d_config->num_rdms].flag =
d_config->pcidevs[i].rdm_reserve;
+            d_config->num_rdms++;
+        }
+        free(xrdm);
+    }
+
+    /* Fix highmem. */
+    if (args->mem_size > args->lowmem_size)
+        highmem_end += (args->mem_size - args->lowmem_size);

This looks misplaced. In particular it doesn't depend on anything done
so far in this function, so it being here is at least confusing. Is there

Sure.

any reason not to set up the variable correctly right at the beginning
of the function, or do _all_ of its initialization here?

So I think I should move that right at the beginning of the function.


+    /* Next step is to check and avoid potential conflict between RDM entries
+     * and guest RAM. To avoid intrusive impact to existing memory layout
+     * {lowmem, mmio, highmem} which is passed around various function blocks,
+     * below conflicts are not handled which are rare and handling them would
+     * lead to a more scattered layout:
+     *  - RMRR in highmem area (>4G)
+     *  - RMRR lower than a defined memory boundary (e.g. 2G)

So how is that going to do on my box having RMRRs below 1Mb?

This is just right in this case of (<2G) as well. So we still check what policy is passed, 'strict' versus 'relaxed'.


+     * Otherwise for conflicts between boundary and 4G, we'll simply move 
lowmem
+     * end below reserved region to solve conflict.
+     *
+     * If a conflict is detected on a given RMRR entry, an error will be
+     * returned.
+     * If 'force' policy is specified. Or conflict is treated as a warning if

The first sentence here appears to belong at the end of the one ending
on the previous line.

+     * 'try' policy is specified, and we also mark this as INVALID not to 
expose
+     * this entry to hvmloader.

What is "this" in "... also mark this as ..."? Certainly neither the conflict
nor the warning.

Sorry, this is my fault.

     * If a conflict is detected on a given RMRR entry, an error will be
* returned if 'strict' policy is specified. Or conflict is treated as a
     * warning if 'relaxed' policy is specified, and we also mark this as
     * INVALID not to expose this entry to hvmloader.


+     *
+     * Firstly we should check the case of rdm < 4G because we may need to
+     * expand highmem_end.
+     */
+    for (i = 0; i < d_config->num_rdms; i++) {
+        rdm_start = d_config->rdms[i].start;
+        rdm_size = d_config->rdms[i].size;
+        conflict = check_rdm_hole(0, args->lowmem_size, rdm_start, rdm_size);
+
+        if (!conflict)
+            continue;
+
+        /*
+         * Just check if RDM > our memory boundary
+         */
+        if (d_config->rdms[i].start > rdm_mem_guard) {
+            /*
+             * We will move downwards lowmem_end so we have to expand
+             * highmem_end.
+             */
+            highmem_end += (args->lowmem_size - rdm_start);
+            /* Now move downwards lowmem_end. */
+            args->lowmem_size = rdm_start;

Considering that the action here doesn't depend on the specific
->rdms[] slot being looked at, I don't see why the loop needs to

I'm not sure if I understand what you mean.

All rdm entries are organized disorderly in d_config->rdms, so we should traverse all entries to make sure args->lowmem_size is below all rdms' start address.

continue. Also - what if rdm_mem_guard > args->lowmem_size?

We have this check before we go to check rdm_mem_guard,

conflict = check_rdm_hole(0, args->lowmem_size, rdm_start, rdm_size);
if (!conflict) continue;

This means args->lowmem_size is already above rdm_start, so here actually this indicates we work with this condition,

(args->lowmem_size > rdm_start) && (rdm_start > rdm_mem_guard)


+        }
+    }
+
+    /*
+     * Finally we can take same policy to check lowmem(< 2G) and
+     * highmem adjusted above.
+     */

I don't follow - neither the code above nor the one below makes any
distinction between memory below 2Gb and memory below 4Gb (as
far as check_rdm_hole() calls are concerned).

As you guys discussed during that design phrase, we just need a memory_guard specific to rdm regardless of args->lowmem_size is below 2G or [2G, 4G], right?

In case of [2G, 4G] we have such this policy, args->lowmem_size = rdm_start. As you see this is implemented above. But in case of args->lowmem_size < 2G (memory_guard), we are trying to handle any conflict directly according to our policy. And currently we'd like to follow this same policy to handle any conflict from highmem as well.


+    for (i = 0; i < d_config->num_rdms; i++) {
+        rdm_start = d_config->rdms[i].start;
+        rdm_size = d_config->rdms[i].size;
+        /* Does this entry conflict with lowmem? */
+        conflict = check_rdm_hole(0, args->lowmem_size,
+                                  rdm_start, rdm_size);
+        /* Does this entry conflict with highmem? */
+        conflict |= check_rdm_hole((1ULL<<32), highmem_end,

The second parameter of the function is a size, not an address.

You're right. And this should be 'highmem_end - (1ULL<<32)'.


@@ -802,6 +804,8 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
       * Do all this in one step here...
       */
      args.mem_size = (uint64_t)(info->max_memkb - info->video_memkb) << 10;
+    args.lowmem_size = (args.mem_size > (1ULL << 32)) ?
+                            (1ULL << 32) : args.mem_size;

Use min().

Sure.


@@ -811,6 +815,27 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
          if (max_ram_below_4g < HVM_BELOW_4G_MMIO_START)
              args.mmio_size = info->u.hvm.mmio_hole_memkb << 10;
      }
+
+    if (args.mmio_size == 0)
+        args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;
+    mmio_start = (1ull << 32) - args.mmio_size;
+
+    if (args.lowmem_size > mmio_start)
+        args.lowmem_size = mmio_start;
+
+    /*
+     * We'd like to set a memory boundary to determine if we need to check
+     * any overlap with reserved device memory.
+     *
+     * TODO: we will add a config parameter for this boundary value next.

According to the titles of subsequent patches this doesn't seem to
happen within this series.

Sorry, actually I'm trying to say we do this after current revision is accepted. So just rephrase this comment,

* TODO: Its batter to provide a config parameter for this boundary value.


But overall the intentions of this patch seem right.


Really appreciate your comments.

Thanks
Tiejun

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.