[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [PATCH v2 3/3] docs: remove tmem related text



Signed-off-by: Wei Liu <wei.liu2@xxxxxxxxxx>
---
 docs/man/xl.conf.pod.5              |   9 +-
 docs/man/xl.pod.1.in                |  68 ----
 docs/misc/tmem-internals.html       | 789 ------------------------------------
 docs/misc/xen-command-line.markdown |   6 -
 docs/misc/xsm-flask.txt             |  36 --
 5 files changed, 2 insertions(+), 906 deletions(-)
 delete mode 100644 docs/misc/tmem-internals.html

diff --git a/docs/man/xl.conf.pod.5 b/docs/man/xl.conf.pod.5
index 37262a7ef8..b1bde7d657 100644
--- a/docs/man/xl.conf.pod.5
+++ b/docs/man/xl.conf.pod.5
@@ -148,10 +148,8 @@ The default choice is "xvda".
 =item B<claim_mode=BOOLEAN>
 
 If this option is enabled then when a guest is created there will be an
-guarantee that there is memory available for the guest. This is an
-particularly acute problem on hosts with memory over-provisioned guests
-that use tmem and have self-balloon enabled (which is the default
-option). The self-balloon mechanism can deflate/inflate the balloon
+guarantee that there is memory available for the guest.
+The self-balloon mechanism can deflate/inflate the balloon
 quickly and the amount of free memory (which C<xl info> can show) is
 stale the moment it is printed. When claim is enabled a reservation for
 the amount of memory (see 'memory' in xl.conf(5)) is set, which is then
@@ -163,9 +161,6 @@ If the reservation cannot be meet the guest creation fails 
immediately
 instead of taking seconds/minutes (depending on the size of the guest)
 while the guest is populated.
 
-Note that to enable tmem type guests, one needs to provide C<tmem> on the
-Xen hypervisor argument and as well on the Linux kernel command line.
-
 Default: C<1>
 
 =over 4
diff --git a/docs/man/xl.pod.1.in b/docs/man/xl.pod.1.in
index 18006880d6..7c765dbc3c 100644
--- a/docs/man/xl.pod.1.in
+++ b/docs/man/xl.pod.1.in
@@ -1677,74 +1677,6 @@ Obtain information of USB devices connected as such via 
the device model
 
 =back
 
-=head1 TRANSCENDENT MEMORY (TMEM)
-
-=over 4
-
-=item B<tmem-list> I<[OPTIONS]> I<domain-id>
-
-List tmem pools.
-
-B<OPTIONS>
-
-=over 4
-
-=item B<-l>
-
-If this parameter is specified, also list tmem stats.
-
-=back
-
-=item B<tmem-freeze> I<domain-id>
-
-Freeze tmem pools.
-
-=item B<tmem-thaw> I<domain-id>
-
-Thaw tmem pools.
-
-=item B<tmem-set> I<domain-id> [I<OPTIONS>]
-
-Change tmem settings.
-
-B<OPTIONS>
-
-=over 4
-
-=item B<-w> I<WEIGHT>
-
-Weight (int)
-
-=item B<-p> I<COMPRESS>
-
-Compress (int)
-
-=back
-
-=item B<tmem-shared-auth> I<domain-id> [I<OPTIONS>]
-
-De/authenticate shared tmem pool.
-
-B<OPTIONS>
-
-=over 4
-
-=item B<-u> I<UUID>
-
-Specify uuid (abcdef01-2345-6789-1234-567890abcdef)
-
-=item B<-a> I<AUTH>
-
-0=auth,1=deauth
-
-=back
-
-=item B<tmem-freeable>
-
-Get information about how much freeable memory (MB) is in-use by tmem.
-
-=back
-
 =head1 FLASK
 
 B<FLASK> is a security framework that defines a mandatory access control policy
diff --git a/docs/misc/tmem-internals.html b/docs/misc/tmem-internals.html
deleted file mode 100644
index 9b7e70e650..0000000000
--- a/docs/misc/tmem-internals.html
+++ /dev/null
@@ -1,789 +0,0 @@
-<h1>Transcendent Memory Internals in Xen</h1>
-<P>
-by Dan Magenheimer, Oracle Corp.</p>
-<P>
-Draft 0.1 -- Updated: 20100324
-<h2>Overview</h2>
-<P>
-This document focuses on the internal implementation of
-Transcendent Memory (tmem) on Xen.  It assumes
-that the reader has a basic knowledge of the terminology, objectives, and
-functionality of tmem and also has access to the Xen source code.
-It corresponds to the Xen 4.0 release, with
-patch added to support page deduplication (V2).
-<P>
-The primary responsibilities of the tmem implementation are to:
-<ul>
-<li>manage a potentially huge and extremely dynamic
-number of memory pages from a potentially large number of clients (domains)
-with low memory overhead and proper isolation
-<li>provide quick and efficient access to these
-pages with as much concurrency as possible
-<li>enable efficient reclamation and <i>eviction</i> of pages (e.g. when
-memory is fully utilized)
-<li>optionally, increase page density through compression and/or
-deduplication
-<li>where necessary, properly assign and account for
-memory belonging to guests to avoid malicious and/or accidental unfairness
-and/or denial-of-service
-<li>record utilization statistics and make them available to management tools
-</ul>
-<h2>Source Code Organization</h2>
-
-<P>
-The source code in Xen that provides the tmem functionality
-is divided up into four files: tmem.c, tmem.h, tmem_xen.c, and tmem_xen.h.
-The files tmem.c and tmem.h are intended to
-be implementation- (and hypervisor-) independent and the other two files
-provide the Xen-specific code.  This
-division is intended to make it easier to port tmem functionality to other
-hypervisors, though at this time porting to other hypervisors has not been
-attempted.  Together, these four files
-total less than 4000 lines of C code.
-<P>
-Even ignoring the implementation-specific functionality, the
-implementation-independent part of tmem has several dependencies on
-library functionality (Xen source filenames in parentheses):
-<ul>
-<li>
-a good fast general-purpose dynamic memory
-allocator with bounded response time and efficient use of memory for a very
-large number of sub-page allocations.  To
-achieve this in Xen, the bad old memory allocator was replaced with a
-slightly-modified version of TLSF (xmalloc_tlsf.c), first ported to Linux by
-Nitin Gupta for compcache.
-<li>
-good tree data structure libraries, specifically
-<i>red-black</i> trees (rbtree.c) and <i>radix</i> trees (radix-tree.c).
-Code for these was borrowed for Linux and adapted for tmem and Xen.
-<li>
-good locking and list code.  Both of these existed in Xen and required
-little or no change.
-<li>
-optionally, a good fast lossless compression
-library.  The Xen implementation added to
-support tmem uses LZO1X (lzo.c), also ported for Linux by Nitin Gupta.
-</ul>
-<P>
-More information about the specific functionality of these
-libraries can easily be found through a search engine, via wikipedia, or in the
-Xen or Linux source logs so we will not elaborate further here.
-
-<h2>Prefixes/Abbreviations/Glossary</h2>
-
-<P>
-The tmem code uses several prefixes and abbreviations.
-Knowledge of these will improve code readability:
-<ul>
-<li>
-<i>tmh</i> ==
-transcendent memory host.  Functions or
-data structures that are defined by the implementation-specific code, i.e. the
-Xen host code
-<li>
-<i>tmemc</i>
-== transcendent memory control.
-Functions or data structures that provide management tool functionality,
-rather than core tmem operations.
-<li>
-<i>cli </i>or
-<i>client</i> == client.
-The tmem generic term for a domain or a guest OS.
-</ul>
-<P>
-When used in prose, common tmem operations are indicated
-with a different font, such as <big><kbd>put</kbd></big>
-and <big><kbd>get</kbd></big>.
-
-<h2>Key Data Structures</h2>
-
-<P>
-To manage a huge number of pages, efficient data structures
-must be carefully selected.
-<P>
-Recall that a tmem-enabled guest OS may create one or more
-pools with different attributes.  It then
-<kbd>put</kbd></big>s and <kbd>get</kbd></big>s
-pages to/from this pool, identifying the page
-with a <i>handle</i> that consists of a <i>pool_id</i>, an <i>
-object_id</i>, and a <i>page_id </i>(sometimes
-called an <i>index</i>).
-This suggests a few obvious core data
-structures:
-<ul>
-<li>
-When a guest OS first calls tmem, a <i>client_t</i> is created to contain
-and track all uses of tmem by that guest OS.  Among
-other things, a <i>client_t</i> keeps pointers
-to a fixed number of pools (16 in the current Xen implementation).
-<li>
-When a guest OS requests a new pool, a <i>pool_t</i> is created.
-Some pools are shared and are kept in a
-sharelist (<i>sharelist_t</i>) which points
-to all the clients that are sharing the pool.
-Since an <i>object_id</i> is 64-bits,
-a <i>pool_t</i> must be able to keep track
-of a potentially very large number of objects.
-To do so, it maintains a number of parallel trees (256 in the current
-Xen implementation) and a hash algorithm is applied to the <i>object_id</i>
-to select the correct tree.
-Each tree element points to an object.
-Because an <i>object_id</i> usually represents an <i>inode</i>
-(a unique file number identifier), and <i>inode</i> numbers
-are fairly random, though often &quot;clumpy&quot;, a <i>red-black tree</i>
-is used.
-<li>
-When a guest first
-<kbd>put</kbd></big>s a page to a pool with an as-yet-unused <i>object_id,</i> 
an
-<i>obj_t</i> is created.  Since a <i
->page_id</i> is usually an index into a file,
-it is often a small number, but may sometimes be very large (up to
-32-bits).  A <i>radix tree</i> is a good data structure to contain items
-with this kind of index distribution.
-<li>
-When a page is
-<kbd>put</kbd></big>, a page descriptor, or <i>pgp_t</i>, is created, which
-among other things will point to the storage location where the data is kept.
-In the normal case the pointer is to a <i>pfp_t</i>, which is an
-implementation-specific datatype representing a physical pageframe in memory
-(which in Xen is a &quot;struct page_info&quot;).
-When deduplication is enabled, it points to
-yet another data structure, a <i>pcd_</i>t
-(see below).  When compression is enabled
-(and deduplication is not), the pointer points directly to the compressed data.
-For reasons we will see shortly, each <i>pgp_t</i> that represents
-an <i>ephemeral</i> page (that is, a page placed
-in an <i>ephemeral</i> pool) is also placed
-into two doubly-linked linked lists, one containing all ephemeral pages
-<kbd>put</kbd></big> by the same client and one
-containing all ephemeral pages across all clients (&quot;global&quot;).
-<li>
-When deduplication is enabled, multiple <i>pgp_</i>t's may need to point to
-the same data, so another data structure (and level of indirection) is used
-called a page content descriptor, or <i>pcd_t</i>.
-Multiple page descriptors (<i>pgp_t</i>'s) may point to the same <i>pcd_t</i>.
-The <i>pcd_t</i>, in turn, points to either a <i>pfp_t</i>
-(if a full page of data), directly to a
-location in memory (if the page has been compressed or trailing zeroes have
-been eliminated), or even a NULL pointer (if the page contained all zeroes and
-trailing zero elimination is enabled).
-</ul>
-<P>
-The most apparent usage of this multi-layer web of data structures
-is &quot;top-down&quot; because, in normal operation, the vast majority of tmem
-operations invoked by a client are
-<kbd>put</kbd></big>s and <kbd>get</kbd></big>s, which require the various
-data structures to be walked starting with the <i>client_t</i>, then
-a <i>pool_t</i>, then an <i>obj_t</i>, then a <i>pgd_t</i>.
-However, there is another highly frequent tmem operation that is not
-visible from a client: memory reclamation.
-Since tmem attempts to use all spare memory in the system, it must
-frequently free up, or <i>evict</i>,
-pages.  The eviction algorithm will be
-explained in more detail later but, in brief, to free memory, ephemeral pages
-are removed from the tail of one of the doubly-linked lists, which means that
-all of the data structures associated with that page-to-be-removed must be
-updated or eliminated and freed.  As a
-result, each data structure also contains a <i>back-pointer</i>
-to its parent, for example every <i>obj_t</i>
-contains a pointer to its containing <i>pool_t</i>.
-<P>
-This complex web of interconnected data structures is updated constantly and
-thus extremely sensitive to careless code changes which, for example, may
-result in unexpected hypervisor crashes or non-obvious memory leaks.
-On the other hand, the code is fairly well
-modularized so, once understood, it is possible to relatively easily switch out
-one kind of data structure for another.
-To catch problems as quickly as possible when debug is enabled, most of
-the data structures are equipped with <i>sentinels</i>and many inter-function
-assumptions are documented and tested dynamically
-with <i>assertions</i>.
-While these clutter and lengthen the tmem
-code substantially, their presence has proven invaluable on many occasions.
-<P>
-For completeness, we should also describe a key data structure in the Xen
-implementation-dependent code: the <i>tmh_page_list</i>. For security and
-performance reasons, pages that are freed due to tmem operations (such
-as <kbd>get</kbd></big>) are not immediately put back into Xen's pool
-of free memory (aka the Xen <i>heap</i>).
-Tmem pages may contain guest-private data that must be <i>scrubbed</i> before
-those memory pages are released for the use of other guests.
-But if a page is immediately re-used inside of tmem itself, the entire
-page is overwritten with new data, so need not be scrubbed.
-Since tmem is usually the most frequent
-customer of the Xen heap allocation code, it would be a waste of time to scrub
-a page, release it to the Xen heap, and then immediately re-allocate it
-again.  So, instead, tmem maintains
-currently-unused pages of memory on its own free list, <i>tmh_page_list</i>,
-and returns the pages to Xen only when non-tmem Xen
-heap allocation requests would otherwise fail.
-
-<h2>Scalablility/Concurrency</h2>
-
-<P>Tmem has been designed to be highly scalable.
-Since tmem access is invoked similarly in
-many ways to asynchronous disk access, a &quot;big SMP&quot; tmem-aware guest
-OS can, and often will, invoke tmem hypercalls simultaneously on many different
-physical CPUs.  And, of course, multiple
-tmem-aware guests may independently and simultaneously invoke tmem
-hypercalls.  While the normal frequency
-of tmem invocations is rarely extremely high, some tmem operations such as data
-compression or lookups in a very large tree may take tens of thousands of
-cycles or more to complete.  Measurements
-have shown that normal workloads spend no more than about 0.2% (2% with
-compression enabled) of CPU time executing tmem operations.
-But those familiar with OS scalability issues
-recognize that even this limited execution time can create concurrency problems
-in large systems and result in poorly-scalable performance.
-<P>
-A good locking strategy is critical to concurrency, but also
-must be designed carefully to avoid deadlock and <i>livelock</i> problems.  For
-debugging purposes, tmem supports a &quot;big kernel lock&quot; which disables
-concurrency altogether (enabled in Xen with &quot;tmem_lock&quot;, but note
-that this functionality is rarely tested and likely has bit-rotted). Infrequent
-but invasive tmem hypercalls, such as pool creation or the control operations,
-are serialized on a single <i>read-write lock</i>, called tmem_rwlock,
-which must be held for writing.  All other tmem operations must hold this lock
-for reading, so frequent operations such as
-<kbd>put</kbd></big> and <kbd>get</kbd></big> <kbd>flush</kbd></big> can 
execute simultaneously
-as long as no invasive operations are occurring.
-<P>
-Once a pool has been selected, there is a per-pool
-read-write lock (<i>pool_rwlock</i>) which
-must be held for writing if any transformative operations might occur within
-that pool, such as when an<i> obj_t</i> is
-created or destroyed.  For the highly
-frequent operation of finding an<i> obj_t</i>
-within a pool, pool_rwlock must be held for reading.
-<P>
-Once an object has been selected, there is a per-object
-spinlock (<i>obj_spinlock)</i>.
-This is a spinlock rather than a read-write
-lock because nearly all of the most frequent tmem operations (e.g.
-<kbd>put</kbd></big> and <kbd>get</kbd></big> <kbd>flush</kbd></big>)
-are transformative, in
-that they add or remove a page within the object.
-This lock is generally taken whenever an
-object lookup occurs and released when the tmem operation is complete.
-<P>
-Next, the per-client and global ephemeral lists are
-protected by a single global spinlock (<i>eph_lists_</i>spinlock)
-and the per-client persistent lists are also protected by a single global
-spinlock (<i>pers_list_spinlock</i>).
-And to complete the description of
-implementation-independent locks, if page deduplication is enabled, all pages
-for which the first byte match are contained in one of 256 trees that are
-protected by one of 256 corresponding read-write locks
-(<i>pcd_tree_rwlocks</i>).
-<P>
-In the Xen-specific code (tmem_xen.c), page frames (e.g.  struct page_info)
-that have been released are kept in a list (<i>tmh_page_list</i>) that
-is protected by a spinlock (<i>tmh_page_list_lock</i>).
-There is also an &quot;implied&quot; lock
-associated with compression, which is likely the most time-consuming operation
-in all of tmem (of course, only when compression is enabled): A compression
-buffer is allocated one-per-physical-cpu early in Xen boot and a pointer to
-this buffer is returned to implementation-independent code and used without a
-lock.
-<P>
-The proper method to avoid deadlocks is to take and release
-locks in a very specific predetermined order.
-Unfortunately, since tmem data structures must simultaneously be
-accessed &quot;top-down&quot; (
-<kbd>put</kbd></big> and <kbd>get</kbd></big>)
-and &quot;bottoms-up&quot;
-(memory reclamation), more complex methods must be employed:
-A <i>trylock</i>mechanism is used (c.f. <i>tmem_try_to_evict_pgp()</i>),
-which takes the lock if it is available but returns immediately (rather than
-spinning and waiting) if the lock is not available.
-When walking the ephemeral list to identify
-pages to free, any page that belongs to an object that is locked is simply
-skipped.  Further, if the page is the
-last page belonging to an object, and the pool read-write lock for the pool the
-object belongs to is not available (for writing), that object is skipped.
-These constraints modify the LRU algorithm
-somewhat, but avoid the potential for deadlock.
-<P>
-Unfortunately, a livelock was still discovered in this approach:
-When memory is scarce and each client is
-<kbd>put</kbd></big>ting a large number of pages
-for exactly one object (and thus holding the object spinlock for that object),
-memory reclamation takes a very long time to determine that it is unable to
-free any pages, and so the time to do a
-<kbd>put</kbd></big> (which eventually fails) becomes linear to the
-number of pages in the object!  To avoid
-this situation, a workaround was added to always ensure a minimum amount of
-memory (1MB) is available before any object lock is taken for the client
-invoking tmem (see <i>tmem_ensure_avail_pages()</i>).
-Other such livelocks (and perhaps deadlocks)
-may be lurking.
-<P>
-A last issue related to concurrency is atomicity of counters.
-Tmem gathers a large number of
-statistics.  Some of these counters are
-informational only, while some are critical to tmem operation and must be
-incremented and decremented atomically to ensure, for example, that the number
-of pages in a tree never goes negative if two concurrent tmem operations access
-the counter exactly simultaneously.  Some
-of the atomic counters are used for debugging (in assertions) and perhaps need
-not be atomic; fixing these may increase performance slightly by reducing
-cache-coherency traffic.  Similarly, some
-of the non-atomic counters may yield strange results to management tools, such
-as showing the total number of successful
-<kbd>put</kbd></big>s as being higher than the number of
-<kbd>put</kbd></big>s attempted.
-These are left as exercises for future tmem implementors.
-
-<h2>Control and Manageability</h2>
-
-<P>
-Tmem has a control interface to, for example, set various
-parameters and obtain statistics.  All
-tmem control operations funnel through <i>do_tmem_control()</i>
-and other functions supporting tmem control operations are prefixed
-with <i>tmemc_</i>.
-
-<P>
-During normal operation, even if only one tmem-aware guest
-is running, tmem may absorb nearly all free memory in the system for its own
-use.  Then if a management tool wishes to
-create a new guest (or migrate a guest from another system to this one), it may
-notice that there is insufficient &quot;free&quot; memory and fail the creation
-(or migration).  For this reason, tmem
-introduces a new tool-visible class of memory -- <i>freeable</i> memory --
-and provides a control interface to access
-it.  All ephemeral memory and all pages on the <i>tmh_page_list</i>
-are freeable. To properly access freeable
-memory, a management tool must follow a sequence of steps:
-<ul>
-<li>
-<i>freeze</i>
-tmem:When tmem is frozen, all 
-<kbd>put</kbd></big>s fail, which ensures that no
-additional memory may be absorbed by tmem.
-(See <i>tmemc_freeze_pools()</i>, and
-note that individual clients may be frozen, though this functionality may be
-used only rarely.)
-<li>
-<i>query freeable MB: </i>If all freeable memory were released to the Xen
-heap, this is the amount of memory (in MB) that would be freed.
-See <i>tmh_freeable_pages()</i>.
-<li>
-<i>flush</i>:
-Tmem may be requested to flush, or relinquish, a certain amount of memory, e.g.
-back to the Xen heap.  This amount is
-specified in KB.  See <i
->tmemc_flush_mem()</i> and <i
->tmem_relinquish_npages()</i>.
-<li>
-At this point the management tool may allocate
-the memory, e.g. using Xen's published interfaces.
-<li>
-<i>thaw</i>
-tmem: This terminates the freeze, allowing tmem to accept 
-<kbd>put</kbd></big>s again.
-</ul>
-<P>
-Extensive tmem statistics are available through tmem's
-control interface (see <i>tmemc_list </i>and
-the separate source for the &quot;xm tmem-list&quot; command and the
-xen-tmem-list-parse tool).  To maximize
-forward/backward compatibility with future tmem and tools versions, statistical
-information is passed via an ASCII interface where each individual counter is
-identified by an easily parseable two-letter ASCII sequence.
-
-<h2>Save/Restore/Migrate</h2>
-
-<P>
-Another piece of functionality that has a major impact on
-the tmem code is support for save/restore of a tmem client and, highly related,
-live migration of a tmem client.
-Ephemeral pages, by definition, do not need to be saved or
-live-migrated, but persistent pages are part of the state of a running VM and
-so must be properly preserved.
-<P>
-When a save (or live-migrate) of a tmem-enabled VM is initiated, the first step
-is for the tmem client to be frozen (see the manageability section).
-Next, tmem API version information is
-recorded (to avoid possible incompatibility issues as the tmem spec evolves in
-the future).  Then, certain high-level
-tmem structural information specific to the client is recorded, including
-information about the existing pools.
-Finally, the contents of all persistent pages are recorded.
-<P>
-For live-migration, the process is somewhat more complicated.
-Ignoring tmem for a moment, recall that in
-live migration, the vast majority of the VM's memory is transferred while the
-VM is still fully operational.  During
-each phase, memory pages belonging to the VM that are changed are marked and
-then retransmitted during a later phase.
-Eventually only a small amount of memory remains, the VM is paused, the
-remaining memory is transmitted, and the VM is unpaused on the target machine.
-<P>
-The number of persistent tmem pages may be quite large,
-possibly even larger than all the other memory used by the VM; so it is
-unacceptable to transmit persistent tmem pages during the &quot;paused&quot;
-phase of live migration.  But if the VM
-is still operational, it may be making calls to tmem:
-A frozen tmem client will reject any 
-<big><kbd>put</kbd></big> operations, but tmem must
-still correctly process <big><kbd>flush</kbd></big>es
-(page and object), including implicit flushes due to duplicate 
-<big><kbd>put</kbd></big>s.
-Fortunately, these operations can only
-invalidate tmem pages, not overwrite tmem pages or create new pages.
-So, when a live-migrate has been initiated,
-the client is frozen.  Then during the
-&quot;live&quot; phase, tmem transmits all persistent pages, but also records
-the handle of all persistent pages that are invalidated.
-Then, during the &quot;paused&quot; phase,
-only the handles of invalidated persistent pages are transmitted, resulting in
-the invalidation on the target machine of any matching pages that were
-previously transmitted during the &quot;live&quot; phase.
-<P>
-For restore (and on the target machine of a live migration),
-tmem must be capable of reconstructing the internal state of the client from
-the saved/migrated data.  However, it is
-not the client itself that is <big><kbd>put</kbd></big>'ing
-the pages but the management tools conducting the restore/migration.
-This slightly complicates tmem by requiring
-new API calls and new functions in the implementation, but the code is
-structured so that duplication is minimized.
-Once all tmem data structures for the client are reconstructed, all
-persistent pages are recreated and, in the case of live-migration, all
-invalidations have been processed and the client has been thawed, the restored
-client can be resumed.
-<P>
-Finally, tmem's data structures must be cluttered a bit to
-support save/restore/migration.  Notably,
-a per-pool list of persistent pages must be maintained and, during live
-migration, a per-client list of invalidated pages must be logged.
-A reader of the code will note that these
-lists are overlaid into space-sensitive data structures as a union, which may
-be more error-prone but eliminates significant space waste.
-
-<h2>Miscellaneous Tmem Topics</h2>
-
-<P>
-<i><b>Duplicate <big><kbd>puts</kbd></big></b></i>.
-One interesting corner case that
-significantly complicates the tmem source code is the possibility
-of a <i>duplicate</i>
-<big><kbd>put</kbd></big>,
-which occurs when two
-<big><kbd>put</kbd></big>s
-are requested with the same handle but with possibly different data.
-The tmem API addresses
-<i>
-<big><kbd>put</kbd></big>-<big><kbd>put</kbd></big>-<big><kbd>get</kbd></big>
-coherence</i> explicitly: When a duplicate
-<big><kbd>put</kbd></big> occurs, tmem may react one of two ways: (1) The 
-<big><kbd>put</kbd></big> may succeed with the old
-data overwritten by the new data, or (2) the
-<big><kbd>put</kbd></big> may be failed with the original data flushed and
-neither the old nor the new data accessible.
-Tmem may <i>not</i> fail the 
-<big><kbd>put</kbd></big> and leave the old data accessible.
-<P>
-When tmem has been actively working for an extended period,
-system memory may be in short supply and it is possible for a memory allocation
-for a page (or even a data structure such as a <i>pgd_t</i>) to fail. Thus,
-for a duplicate 
-<big><kbd>put</kbd></big>, it may be impossible for tmem to temporarily
-simultaneously maintain data structures and data for both the original 
-<big><kbd>put</kbd></big> and the duplicate 
-<big><kbd>put</kbd></big>.
-When the space required for the data is
-identical, tmem may be able to overwrite <i>in place </i>the old data with
-the new data (option 1).  But in some circumstances, such as when data
-is being compressed, overwriting is not always possible and option 2 must be
-performed.
-<P>
-<i><b>Page deduplication and trailing-zero elimination.</b></i>
-When page deduplication is enabled
-(&quot;tmem_dedup&quot; option to Xen), ephemeral pages for which the contents
-are identical -- whether the pages belong
-to the same client or different clients -- utilize the same pageframe of
-memory.  In Xen environments where
-multiple domains have a highly similar workload, this can save a substantial
-amount of memory, allowing a much larger number of ephemeral pages to be
-used.  Tmem page deduplication uses
-methods similar to the KSM implementation in Linux [ref], but differences 
between
-the two are sufficiently great that tmem does not directly leverage the
-code.  In particular, ephemeral pages in
-tmem are never dirtied, so need never be <i>copied-on-write</i>.
-Like KSM, however, tmem avoids hashing,
-instead employing <i>red-black trees</i>
-that use the entire page contents as the <i>lookup
-key</i>.  There may be better ways to implement this.
-<P>
-Dedup'ed pages may optionally be compressed
-(&quot;tmem_compress&quot; and &quot;tmem_dedup&quot; Xen options specified),
-to save even more space, at the cost of more time.
-Additionally, <i>trailing zero elimination (tze)</i> may be applied to dedup'ed
-pages.  With tze, pages that contain a
-significant number of zeroes at the end of the page are saved without the 
trailing
-zeroes; an all-zero page requires no data to be saved at all.
-In certain workloads that utilize a large number
-of small files (and for which the last partial page of a file is padded with
-zeroes), a significant space savings can be realized without the high cost of
-compression/decompression.
-<P>
-Both compression and tze significantly complicate memory
-allocation.  This will be discussed more below.
-<P>
-<b><i>Memory accounting</i>.</b>
-Accounting is boring, but poor accounting may
-result in some interesting problems.  In
-the implementation-independent code of tmem, most data structures, page frames,
-and partial pages (e.g. for compresssion) are <i>billed</i> to a pool,
-and thus to a client.  Some <i>infrastructure</i> data structures, such as
-pools and clients, are allocated with <i>tmh_alloc_infra()</i>, which does not
-require a pool to be specified.  Two other
-exceptions are page content descriptors (<i>pcd_t</i>)
-and sharelists (<i>sharelist_t</i>) which
-are explicitly not associated with a pool/client by specifying NULL instead of
-a <i>pool_t</i>.
-(Note to self:
-These should probably just use the <i>tmh_alloc_infra()</i> interface too.)
-As we shall see, persistent pool pages and
-data structures may need to be handled a bit differently, so the
-implementation-independent layer calls a different allocation/free routine for
-persistent pages (e.g. <i>tmh_alloc_page_thispool()</i>)
-than for ephemeral pages (e.g. <i>tmh_alloc_page()</i>).
-<P>
-In the Xen-specific layer, we
-disregard the <i>pool_t</i> for ephemeral
-pages, as we use the generic Xen heap for all ephemeral pages and data
-structures.(Denial-of-service attacks
-can be handled in the implementation-independent layer because ephemeral pages
-are kept in per-client queues each with a counted length.
-See the discussion on weights and caps below.)
-However we explicitly bill persistent pages
-and data structures against the client/domain that is using them.
-(See the calls to the Xen routine <i>alloc_domheap_pages() </i>in tmem_xen.h; 
of
-the first argument is a domain, the pages allocated are billed by Xen to that
-domain.)This means that a Xen domain
-cannot allocate even a single tmem persistent page when it is currently 
utilizing
-its maximum assigned memory allocation!
-This is reasonable for persistent pages because, even though the data is
-not directly accessible by the domain, the data is permanently saved until
-either the domain flushes it or the domain dies.
-<P>
-Note that proper accounting requires (even for ephemeral pools) that the same
-pool is referenced when memory is freed as when it was allocated, even if the
-ownership of a pool has been moved from one client to another (c.f. <i
->shared_pool_reassign()</i>).
-The underlying Xen-specific information may
-not always enforce this for ephemeral pools, but incorrect alloc/free matching
-can cause some difficult-to-find memory leaks and bent pointers.
-<P>
-Page deduplication is not possible for persistent pools for
-accounting reasons: Imagine a page that is created by persistent pool A, which
-belongs to a domain that is currently well under its maximum allocation.
-Then the <i>pcd_t</i>is matched by persistent pool B, which is
-currently at its maximum.
-Then the domain owning pool A is destroyed.
-Is B beyond its maximum?
-(There may be a clever way around this
-problem.  Exercise for the reader!)
-<P>
-<b><i>Memory allocation.</i></b> The implementation-independent layer assumes
-there is a good fast general-purpose dynamic memory allocator with bounded
-response time and efficient use of memory for a very large number of sub-page
-allocations.  The old xmalloc memory
-allocator in Xen was not a good match for this purpose, so was replaced by the
-TLSF allocator.  Note that the TLSF
-allocator is used only for allocations smaller than a page (and, more
-precisely, no larger than <i>tmem_subpage_maxsize()</i>);
-full pages are allocated by Xen's normal heap allocator.
-<P>
-After the TLSF allocator was integrated into Xen, more work
-was required so that each client could allocate memory from a separate
-independent pool. (See the call to <i>xmem_pool_create()</i>in
-<i>tmh_client_init()</i>.) 
-This allows the data structures allocated for the
-purpose of supporting persistent pages to be billed to the same client as the
-pages themselves.  It also allows partial
-(e.g. compressed) pages to be properly billed.
-Further, when partial page allocations cause internal fragmentation,
-this fragmentation can be isolated per-client.
-And, when a domain dies, full pages can be freed, rather than only
-partial pages. One other change was
-required in the TLSF allocator: In the original version, when a TLSF memory
-pool was allocated, the first page of memory was also allocated.
-Since, for a persistent pool, this page would
-be billed to the client, the allocation of the first page failed if the domain
-was started at its maximum memory, and this resulted in a failure to create the
-memory pool.  To avoid this, the code was
-changed to delay the allocation of the first page until first use of the memory
-pool.
-<P>
-<b><i>Memory allocation interdependency.</i></b>
-As previously described,
-pages of memory must be moveable back and forth between the Xen heap and the
-tmem ephemeral lists (and page lists).
-When tmem needs a page but doesn't have one, it requests one from the
-Xen heap (either indirectly via xmalloc, or directly via Xen's <i
->alloc_domheap_pages()</i>).
-And when Xen needs a page but doesn't have
-one, it requests one from tmem (via a call to <i
->tmem_relinquish_pages()</i> in Xen's <i
->alloc_heap_pages() </i>in page_alloc.c).
-This leads to a potential infinite loop!
-To break this loop, a new memory flag (<i>MEMF_tmem</i>) was added to Xen
-to flag and disallow the loop.
-See <i>tmh_called_from_tmem()</i>
-in <i>tmem_relinquish_pages()</i>.
-Note that the <i
->tmem_relinquish_pages()</i> interface allows for memory requests of
-order &gt; 0 (multiple contiguous pages), but the tmem implementation disallows
-any requests larger than a single page.
-<P>
-<b><i>LRU page reclamation</i></b>.
-Ephemeral pages generally <i>age </i>in
-a queue, and the space associated with the oldest -- or <i
->least-recently-used -- </i>page is reclaimed when tmem needs more
-memory.  But there are a few exceptions
-to strict LRU queuing.  First is when
-removal from a queue is constrained by locks, as previously described above.
-Second, when an ephemeral pool is <i>shared,</i> unlike a private ephemeral
-pool, a
-<big><kbd>get</kbd></big>
-does not imply a
-<big><kbd>flush</kbd></big>
-Instead, in a shared pool, a 
-results in the page being promoted to the front of the queue.
-Third, when a page that is deduplicated (i.e.
-is referenced by more than one <i>pgp_</i>t)
-reaches the end of the LRU queue, it is marked as <i
->eviction attempted</i> and promoted to the front of the queue; if it
-reaches the end of the queue a second time, eviction occurs.
-Note that only the <i
->pgp_</i>t is evicted; the actual data is only reclaimed if there is no
-other <i>pgp_t </i>pointing to the data.
-<P>
-All of these modified- LRU algorithms deserve to be studied
-carefully against a broad range of workloads.
-<P>
-<b><i>Internal fragmentation</i>.</b>
-When
-compression or tze is enabled, allocations between a half-page and a full-page
-in size are very common and this places a great deal of pressure on even the
-best memory allocator.  Additionally,
-problems may be caused for memory reclamation: When one tmem ephemeral page is
-evicted, only a fragment of a physical page of memory might be reclaimed.
-As a result, when compression or tze is
-enabled, it may take a very large number of eviction attempts to free up a full
-contiguous page of memory and so, to avoid near-infinite loops and livelocks, 
eviction
-must be assumed to be able to fail.
-While all memory allocation paths in tmem are resilient to failure, very
-complex corner cases may eventually occur.
-As a result, compression and tze are disabled by default and should be
-used with caution until they have been tested with a much broader set of
-workloads.(Note to self: The 
-code needs work.)
-<P>
-<b><i>Weights and caps</i>.</b>
-Because
-of the just-discussed LRU-based eviction algorithms, a client that uses tmem at
-a very high frequency can quickly swamp tmem so that it provides little benefit
-to a client that uses it less frequently.
-To reduce the possibility of this denial-of-service, limits can be
-specified via management tools that are enforced internally by tmem.
-On Xen, the &quot;xm tmem-set&quot; command
-can specify &quot;weight=&lt;weight&gt;&quot; or &quot;cap=&lt;cap&gt;&quot;
-for any client.  If weight is non-zero
-for a client and the current percentage of ephemeral pages in use by the client
-exceeds its share (as measured by the sum of weights of all clients), the next
-page chosen for eviction is selected from the requesting client's ephemeral
-queue, instead of the global ephemeral queue that contains pages from all
-clients.(See <i>client_over_quota().</i>)
-Setting a cap for a client is currently a no-op.
-<P>
-<b><i>Shared pools and authentication.</i></b>
-When tmem was first proposed to the linux kernel mailing list
-(LKML), there was concern expressed about security of shared ephemeral
-pools.  The initial tmem implementation only
-required a client to provide a 128-bit UUID to identify a shared pool, and the
-linux-side tmem implementation obtained this UUID from the superblock of the
-shared filesystem (in ocfs2).  It was
-pointed out on LKML that the UUID was essentially a security key and any
-malicious domain that guessed it would have access to any data from the shared
-filesystem that found its way into tmem.
-Ocfs2 has only very limited security; it is assumed that anyone who can
-access the filesystem bits on the shared disk can mount the filesystem and use
-it.  But in a virtualized data center,
-higher isolation requirements may apply.
-As a result, management tools must explicitly authenticate (or may
-explicitly deny) shared pool access to any client.
-On Xen, this is done with the &quot;xl
-tmem-shared-auth&quot; command.
-<P>
-<b><i>32-bit implementation</i>.</b>
-There was some effort put into getting tmem working on a 32-bit Xen.
-However, the Xen heap is limited in size on
-32-bit Xen so tmem did not work very well.
-There are still 32-bit ifdefs in some places in the code, but things may
-have bit-rotted so using tmem on a 32-bit Xen is not recommended.
-
-<h2>Known Issues</h2>
-
-<p><b><i>Fragmentation.</i></b>When tmem
-is active, all physically memory becomes <i>fragmented</i>
-into individual pages.  However, the Xen
-memory allocator allows memory to be requested in multi-page contiguous
-quantities, called order&gt;0 allocations.
-(e.g. 2<sup>order</sup> so
-order==4 is sixteen contiguous pages.)
-In some cases, a request for a larger order will fail gracefully if no
-matching contiguous allocation is available from Xen.
-As of Xen 4.0, however, there are several
-critical order&gt;0 allocation requests that do not fail gracefully.
-Notably, when a domain is created, and
-order==4 structure is required or the domain creation will fail.
-And shadow paging requires many order==2
-allocations; if these fail, a PV live-migration may fail.
-There are likely other such issues.
-<P>
-But, fragmentation can occur even without tmem if any domU does
-any extensive ballooning; tmem just accelerates the fragmentation.
-So the fragmentation problem must be solved
-anyway.  The best solution is to disallow
-order&gt;0 allocations altogether in Xen -- or at least ensure that any attempt
-to allocate order&gt;0 can fail gracefully, e.g. by falling back to a sequence
-of single page allocations. However this restriction may require a major 
rewrite
-in some of Xen's most sensitive code.
-(Note that order&gt;0 allocations during Xen boot and early in domain0
-launch are safe and, if dom0 does not enable tmem, any order&gt;0 allocation by
-dom0 is safe, until the first domU is created.)
-<P>
-Until Xen can be rewritten to be <i>fragmentation-safe</i>, a small hack
-was added in the Xen page
-allocator.(See the comment &quot;
-memory is scarce&quot; in <i>alloc_heap_pages()</i>.)
-Briefly, a portion of memory is pre-reserved
-for allocations where order&gt;0 and order&lt;9.
-(Domain creation uses 2MB pages, but fails
-gracefully, and there are no other known order==9 allocations or order&gt;9
-allocations currently in Xen.)
-<P>
-<b><i>NUMA</i></b>.  Tmem assumes that
-all memory pages are equal and any RAM page can store a page of data for any
-client.  This has potential performance
-consequences in any NUMA machine where access to <i
->far memory</i> is significantly slower than access to <i
->near memory</i>.
-On nearly all of today's servers, however,
-access times to <i>far memory</i> is still
-much faster than access to disk or network-based storage, and tmem's primary 
performance
-advantage comes from the fact that paging and swapping are reduced.
-So, the current tmem implementation ignores
-NUMA-ness; future tmem design for NUMA machines is an exercise left for the
-reader.
-
-<h2>Bibliography</h2>
-
-<P>
-(needs work)<b style='mso-bidi-font-weight:>
-<P><a 
href="http://oss.oracle.com/projects/tmem";>http://oss.oracle.com/projects/tmem</a>
diff --git a/docs/misc/xen-command-line.markdown 
b/docs/misc/xen-command-line.markdown
index 9028bcde2e..fe891ef074 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1993,12 +1993,6 @@ pages) must also be specified via the tbuf\_size 
parameter.
 ### timer\_slop
 > `= <integer>`
 
-### tmem
-> `= <boolean>`
-
-### tmem\_compress
-> `= <boolean>`
-
 ### tsc (x86)
 > `= unstable | skewed | stable:socket`
 
diff --git a/docs/misc/xsm-flask.txt b/docs/misc/xsm-flask.txt
index 62f15dde84..40e5fc845e 100644
--- a/docs/misc/xsm-flask.txt
+++ b/docs/misc/xsm-flask.txt
@@ -81,42 +81,6 @@ __HYPERVISOR_memory_op (xen/include/public/memory.h)
  * XENMEM_get_pod_target
  * XENMEM_claim_pages
 
-__HYPERVISOR_tmem_op (xen/include/public/tmem.h)
-
- The following tmem control ops, that is the sub-subops of
- TMEM_CONTROL, are covered by this statement. 
-
- Note that TMEM is also subject to a similar policy arising from
- XSA-15 http://lists.xen.org/archives/html/xen-announce/2012-09/msg00006.html.
- Due to this existing policy all TMEM Ops are already subject to
- reduced security support.
-
- * TMEMC_THAW
- * TMEMC_FREEZE
- * TMEMC_FLUSH
- * TMEMC_DESTROY
- * TMEMC_LIST
- * TMEMC_SET_WEIGHT
- * TMEMC_SET_CAP
- * TMEMC_SET_COMPRESS
- * TMEMC_QUERY_FREEABLE_MB
- * TMEMC_SAVE_BEGIN
- * TMEMC_SAVE_GET_VERSION
- * TMEMC_SAVE_GET_MAXPOOLS
- * TMEMC_SAVE_GET_CLIENT_WEIGHT
- * TMEMC_SAVE_GET_CLIENT_CAP
- * TMEMC_SAVE_GET_CLIENT_FLAGS
- * TMEMC_SAVE_GET_POOL_FLAGS
- * TMEMC_SAVE_GET_POOL_NPAGES
- * TMEMC_SAVE_GET_POOL_UUID
- * TMEMC_SAVE_GET_NEXT_PAGE
- * TMEMC_SAVE_GET_NEXT_INV
- * TMEMC_SAVE_END
- * TMEMC_RESTORE_BEGIN
- * TMEMC_RESTORE_PUT_PAGE
- * TMEMC_RESTORE_FLUSH_PAGE
-
-
 
 Setting up FLASK
 ----------------
-- 
2.11.0


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.