[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM



On 2015/5/19 4:00, Wei Liu wrote:
On Thu, May 14, 2015 at 04:27:45PM +0800, Chen, Tiejun wrote:
On 2015/5/11 19:32, Wei Liu wrote:
On Mon, May 11, 2015 at 04:09:53PM +0800, Chen, Tiejun wrote:
On 2015/5/8 22:43, Wei Liu wrote:
Sorry for the late review. This series fell through the crack.


Thanks for your review.

On Fri, Apr 10, 2015 at 05:21:55PM +0800, Tiejun Chen wrote:
While building a VM, HVM domain builder provides struct hvm_info_table{}
to help hvmloader. Currently it includes two fields to construct guest
e820 table by hvmloader, low_mem_pgend and high_mem_pgend. So we should
check them to fix any conflict with RAM.

RMRR can reside in address space beyond 4G theoretically, but we never
see this in real world. So in order to avoid breaking highmem layout

How does this break highmem layout?

In most cases highmen is always continuous like [4G, ...] but RMRR is
theoretically beyond 4G but very rarely. Especially we don't see this
happened in real world. So we don't want to such a case breaking the
highmem.


The problem is  we take this approach just because this rarely happens
*now* is not future proof.  It needs to be clearly documented somewhere
in the manual (or any other Intel docs) and be referenced in the code.
Otherwise we might end up in a situation that no-one knows how it is
supposed to work and no-one can fix it if it breaks in the future, that
is, every single device on earth requires RMRR > 4G overnight (I'm
exaggerating).

Or you can just make it works with highmem. How much more work do you
envisage?

(If my above comment makes no sense do let me know. I only have very
shallow understanding of RMRR)

Maybe I'm misleading you :)

I don't mean RMRR > 4G is not allowed to work in our implementation. What
I'm saying is that our *policy* is just simple for this kind of rare highmem
case...


Yes, but this policy is hardcoded in code (as in, you bail when
detecting conflict in highmem region). I don't think we have the
expertise to un-hardcode it in the future (it might require x86 specific
insider information and specific hardware to test). So I would like to
make it as future proof as possible.

I know you're limited by hvm_info. I would accept this hardcoded policy
as short term solution, but I would like commitment from Intel to
maintain this piece of code and properly work out a flexible solution to
deal with >4G case.

Looks Kevin replied this.




we don't solve highmem conflict. Note this means highmem rmrr could still
be supported if no conflict.


Like these two sentences above.


Aren't you actively trying to avoid conflict in libxl?

RMRR is fixed by BIOS so we can't aovid conflict. Here we want to adopt some
good policies to address RMRR. In the case of highmemt, that simple policy
should be enough?


Whatever policy you and HV maintainers agree on. Just clearly document
it.

Do you mean I should brief this patch description into one separate
document?


Either in code comment or in a separate document.

Okay.





[...]

The caller to xc_device_get_rdm always frees this.


I think I misunderstood how this function works. I thought xrdm was
passed in by caller, which is clearly not the case. Sorry!


In that case, the `if ( rc < 0 )' is not needed because the call should
always return rc < 0. An assert is good enough.

assert(rc < 0)? But we can't presume the user always pass '0' firstly, and
additionally, we may have no any RMRR indeed.

So I guess what you want is,

assert( rc <=0 );
if ( !rc )
     goto xrdm;


Yes I was thinking about something like this.

How in this case can rc equal to 0? Is it because you don't actually have any

Yes.

regions? If so, please write a comment.


Sure.

if ( errno == ENOBUFS )
...

Right?



[...]


+    memset(d_config->rdms, 0, sizeof(libxl_device_rdm));
+
+    /* Query all RDM entries in this platform */
+    if (type == LIBXL_RDM_RESERVE_TYPE_HOST) {
+        d_config->num_rdms = nr_all_rdms;
+        for (i = 0; i < d_config->num_rdms; i++) {
+            d_config->rdms[i].start =
+                                (uint64_t)xrdm[i].start_pfn << XC_PAGE_SHIFT;
+            d_config->rdms[i].size =
+                                (uint64_t)xrdm[i].nr_pages << XC_PAGE_SHIFT;
+            d_config->rdms[i].flag = d_config->b_info.rdm.reserve;
+        }
+    } else {
+        d_config->num_rdms = 0;
+    }

And you should move d_config->rdms = libxl__calloc inside that `if'.
That is, don't allocate memory if you don't need it.

We can't since in all cases we need to preallocate this, and then we will
handle this according to our policy.


How would it ever be used again if you set d_config->num_rdms to 0? How
do you know the exact size of your array again?

If we don't consider multiple devices shares one rdm entry, our workflow can
be showed as follows:

#1. We always preallocate all rdms[] but with memset().

Because the number of RDM entries used in a domain never exceeds the
number of global entries? I.e. we never risk overrunning the array?

Yes.

"global" indicates the number would count all rdm entries. If w/o "global" flag, we just count these rdm entries associated to those pci devices assigned to this domain. So this means actually we have this equation,

The number without "global" <= The number with "global"


#2. Then we have two cases for that global rule,

#2.1 If we really need all according to our global rule, we would set all
rdms[] with all real rdm info and set d_config->num_rdms.
#2.2 If we have no such a global rule to obtain all, we just clear
d_config->num_rdms.


No, don't do this. In any failure path, if the free / dispose function
depends on num_rdms to iterate through the whole list to dispose memory
(if your rdm structure later contains pointers to allocated memory),
this method leaks memory.

The num_ field should always reflect the actual size of the array.

#3. Then for per device rule

#3.1 From #2.1, we just need to check if we should change one given rdm
entry's policy if this given entry is just associated to this device.
#3.2 From 2.2, obviously we just need to fill rdms one by one. Of course,
its very possible that we don't fill all rdms since all passthroued devices
might not have no rdm at all or they just occupy some. But anyway, finally
we sync d_config->num_rdms.


Sorry I don't follow. But if your problem is you don't know how many
entries you actually need, just use libxl__realloc?



+    free(xrdm);
+
+    /* Query RDM entries per-device */
+    for (i = 0; i < d_config->num_pcidevs; i++) {
+        unsigned int nr_entries = 0;

Maybe I should restate this,
        unsigned int nr_entries;


[...]


Like I replied above we always preallocate this at the beginning.


Ah, OK.

But please don't do this. It's hard to see you don't overrun the
buffer. Please allocate memory only when you need it.

Sorry I don't understand this. As I mention above, we don't know how many
rdm entries we really need to allocate. So this is why we'd like to
preallocate all rdms at the beginning. Then this can cover all cases, global
policy, (global policy & per device) and only per device. Even if multiple
devices shares one rdm we also need to avoid duplicating a new...


Can you use libxl__realloc?


Looks this can expand size each time automatically, right? If so I think we can try this.

Thanks
Tiejun


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.