Keir Fraser wrote:
>I mean forcibly decrement them to zero and free them right there and
>then. Of course, as you point out, the problem is that some of the
>pages are mapped in domain0. I'm not sure how we can distinguish
>tainted refcnts from genuine external references. Perhaps there's a
>proper way we should be destructing the full shadow pagetables such
>that the refcnts end up at zero.
Thanks for your comment. I have done extensive tracing through
the domain destruction code in the hypervisor in the last few
days.
The bottom line: after domain destruction code in the hypervisor
is done, all shadow pages were indeed freed up - even though
the shadow_tainted_refcnts flag was set. I now believe the
remaining pages are genuinely externally referenced (possibly
by the qemu device model still running in domain0).
Here are more details on what I have found:
Ideally, when we destroy or shut down a VMX domain, the general
page reference counts ended up at 0 in shadow mode, so that the
pages can be released properly from the domain.
I have traced quite a bit of code for different scenarios
involving Windows XP running in a VMX domain. I only
did simple operations in Windows XP, but I tried to destroy
the VMX domain at different times (e.g. during Windows XP boot,
during simple operations, after Windows XP has been shutdown, etc.)
For non-VMX (Linux) domains, after we relinquish memory in
domain_relinquish_resources(), all pages in the domain's page
list indeed had reference count of 0 and were properly freed from
the xen heap - just like we expected.
For VMX (e.g., Windows XP) domains, after we relinquish memory in
domain_relinquish_resources(), depending on how many activities
were done in Windows XP, there were anywhere from 2 to 100 pages
remaining just before the domain's structures were freed up
by the hypervisor. Most of these pages still have page
reference counts of 1, and therefore, could not be freed
from the heap by the hypervisor. This prevents the rest
of the domain's resources from being released, and therefore,
'xm list' still shows the VMX domains after they were destroyed.
In shadow mode, the following things could be reflected
in the page (general) reference counts:
(a) General stuff:
- page is allocated (PGC_allocated)
- page is pinned
- page is pointed by CR3's
(b) Shadow page tables (l1, l2, hl2, etc.)
(c) Out-of-sync entries
(d) Grant table mappings
(e) External references (not through grant table)
(f) Monitor page table references (external shadow mode)
(g) Writable PTE predictions
(h) GDTs/LDTs
So I put in a lot of instrumentation and tracing code,
and made sure that the above things were taken into
account and removed from the page reference counts
during the domain destruction code sequence in the
hypervisor. During this code sequence, we disable
shadow mode (shadow_mode_disable()) and the
shadow_tainted_refcnts flag was set. However,
much to my surprise, the page reference counts
were properly taken care of in shadow mode, and
all shadow pages (including those in l1, l2, hl2
tables and snapshots) were all freed up.
In particular, here's where each of the things
in the above list was taken into account during
the domain destruction code sequence in the
hypervisor:
(a) General stuff:
- None of remaining pages have PGC_allocated
flag set
- None of remaining pages are still pinned
- The monitor shadow ref was 0, and all
pages pointed to by CR3's were taken care
of in free_shadow_pages()
(b) All shadow pages (including those pages in
l1, l2, hl2, snapshots) were freed properly.
I implemented counters to track all shadow
page promotions/allocations and demotions/
deallocations throughout the hypervisor code,
and at the end after we relinquished all domain
memory pages, these counters did indeed
return to 0 - as we expected.
(c) out-of-sync entries -> in free_out_of_sync_state()
called by free_shadow_pages().
(d) grant table mappings -> the count of active
grant table mappings is 0 after the domain
destruction sequence in the hypervisor is
executed.
(e) external references not mapped via grant table
-> I believe that these include the qemu-dm
pages which still remain after we relinquish
all domain memory pages - as the qemu-dm may
still be active after a VMX domain has been
destroyed.
(f) external monitor page references -> all references
from monitor page table are dropped in
vmx_relinquish_resources(), and monitor table
itself is freed in domain_destruct(). In fact,
in my code traces, the monitor shadow reference
count was 0 after the domain destruction code in
the hypervisor.
(g) writable PTE predictions -> I didn't see any pages in
this category in my code traces, but if there
are, they would be freed up in free_shadow_pages().
(h) GDTs/LDTs -> these were destroyed in destroy_gdt() and
invalidate_shadow_ldt() called from domain_relinquish_
resources().
Based on the code instrumentation and tracing above, I am
pretty confident that the shadow page reference counts
were handled properly during the domain destruction code
sequence in the hypervisor. There is a problem in keeping
track of shadow page counts (domain->arch.shadow_page_count),
and I will submit a patch to fix this shortly. However, this
does not really impact how shadow pages are handled.
Consequently, the pages that still remain after the domain
destruction code sequence in the hypervisor are externally
referenced and may belong to the qemu device model running
in domain0. The fact that qemu-dm is still active for some
time after a VMX domain has been torn down in the hypervisor
is evident by examining the tools code (python). In fact,
if I forcibly free these remaining pages from the xen heap,
the system/dom0 crashed.
Am I missing anything ? Your comments, suggestions, etc.,
are welcome! Thanks for reading this rather long email :-)
Khoa H.
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|