[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC Design Doc v2] Add vNVDIMM support for Xen



On Mon, Jul 18, 2016 at 08:29:12AM +0800, Haozhong Zhang wrote:
> Hi,
> 

Hey!

Thanks for posting! Sorry for the late review. Below are some of my
comment.

> Following is version 2 of the design doc for supporting vNVDIMM in
> Xen. It's basically the summary of discussion on previous v1 design
> (https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00006.html).
> Any comments are welcome. The corresponding patches are WIP.
> 
> Thanks,
> Haozhong
> 
> 
> 
> vNVDIMM Design v2
> 
> Changes in v2:
>  - Rewrite the the design details based on previous discussion [7].
>  - Add Section 3 Usage Example of vNVDIMM in Xen.
>  - Remove content about pcommit instruction which has been deprecated [8].
> 
> Content
> =======
> 1. Background
>  1.1 Access Mechanisms: Persistent Memory and Block Window
>  1.2 ACPI Support
>   1.2.1 NFIT
>   1.2.2 _DSM and _FIT
>  1.3 Namespace
>  1.4 clwb/clflushopt
> 2. NVDIMM/vNVDIMM Support in Linux Kernel/KVM/QEMU
>  2.1 NVDIMM Driver in Linux Kernel
>  2.2 vNVDIMM Implementation in KVM/QEMU
> 3. Usage Example of vNVDIMM in Xen
> 4. Design of vNVDIMM in Xen
>  4.1 Guest clwb/clflushopt Enabling
>  4.2 pmem Address Management
>   4.2.1 Reserve Storage for Management Structures
>   4.2.2 Detection of Host pmem Devices
>   4.2.3 Get Host Machine Address (SPA) of Host pmem Files
>   4.2.4 Map Host pmem to Guests
>   4.2.5 Misc 1: RAS
>   4.2.6 Misc 2: hotplug
>  4.3 Guest ACPI Emulation
>   4.3.1 Building Guest ACPI Tables
>   4.3.2 Emulating Guest _DSM
> References
> 
> 
> Non-Volatile DIMM or NVDIMM is a type of RAM device that provides
> persistent storage and retains data across reboot and even power
> failures. This document describes the design to provide virtual NVDIMM
> devices or vNVDIMM in Xen.
> 
> The rest of this document is organized as below.
>  - Section 1 introduces the background knowledge of NVDIMM hardware,
>    which is used by other parts of this document.
> 
>  - Section 2 briefly introduces the current/future NVDIMM/vNVDIMM
>    support in Linux kernel/KVM/QEMU. They will affect the vNVDIMM
>    design in Xen.
> 
>  - Section 3 shows the basic usage example of vNVDIMM in Xen.
> 
>  - Section 4 proposes design details of vNVDIMM in Xen.
> 
> 
> 
> 1. Background
> 
> 1.1 Access Mechanisms: Persistent Memory and Block Window
> 
>  NVDIMM provides two access mechanisms: byte-addressable persistent
>  memory (pmem) and block window (pblk). An NVDIMM can contain multiple
>  ranges and each range can be accessed through either pmem or pblk
>  (but not both).
> 
>  Byte-addressable persistent memory mechanism (pmem) maps NVDIMM or
>  ranges of NVDIMM into the system physical address (SPA) space, so
>  that software can access NVDIMM via normal memory loads and
>  stores. If the virtual address is used, then MMU will translate it to
>  the physical address.
> 
>  In the virtualization circumstance, we can pass through a pmem range
>  or partial of it to a guest by mapping it in EPT (i.e. mapping guest
>  vNVDIMM physical address to host NVDIMM physical address), so that
>  guest accesses are applied directly to the host NVDIMM device without
>  hypervisor's interceptions.
> 
>  Block window mechanism (pblk) provides one or multiple block windows
>  (BW).  Each BW is composed of a command register, a status register
>  and a 8 Kbytes aperture register. Software fills the direction of the
>  transfer (read/write), the start address (LBA) and size on NVDIMM it
>  is going to transfer. If nothing goes wrong, the transferred data can
>  be read/write via the aperture register. The status and errors of the
>  transfer can be got from the status register. Other vendor-specific
>  commands and status can be implemented for BW as well. Details of the
>  block window access mechanism can be found in [3].
> 
>  In the virtualization circumstance, different pblk regions on a
>  single NVDIMM device may be accessed by different guests, so the
>  hypervisor needs to emulate BW, which would introduce a high overhead
>  for I/O intensive workload.
> 
>  Therefore, we are going to only implement pmem for vNVDIMM. The rest
>  of this document will mostly concentrate on pmem.
> 
> 
> 1.2 ACPI Support
> 
>  ACPI provides two factors of support for NVDIMM. First, NVDIMM
>  devices are described by firmware (BIOS/EFI) to OS via ACPI-defined
>  NVDIMM Firmware Interface Table (NFIT). Second, several functions of
>  NVDIMM, including operations on namespace labels, S.M.A.R.T and
>  hotplug, are provided by ACPI methods (_DSM and _FIT).
> 
> 1.2.1 NFIT
> 
>  NFIT is a new system description table added in ACPI v6 with
>  signature "NFIT". It contains a set of structures.
> 
>  - System Physical Address Range Structure
>    (SPA Range Structure)
> 
>    SPA range structure describes system physical address ranges
>    occupied by NVDIMMs and types of regions.
> 
>    If Address Range Type GUID field of a SPA range structure is "Byte
>    Addressable Persistent Memory (PM) Region", then the structure
>    describes a NVDIMM region that is accessed via pmem. The System
>    Physical Address Range Base and Length fields describe the start
>    system physical address and the length that is occupied by that
>    NVDIMM region.
> 
>    A SPA range structure is identified by a non-zero SPA range
>    structure index.
> 
>    Note: [1] reserves E820 type 7: OSPM must comprehend this memory as
>          having non-volatile attributes and handle distinct from
>          conventional volatile memory (in Table 15-312 of [1]). The
>          memory region supports byte-addressable non-volatility. E820
>          type 12 (OEM defined) may be also used for legacy NVDIMM
>          prior to ACPI v6.
> 
>    Note: Besides OS, EFI firmware may also parse NFIT for booting
>          drives (Section 9.3.6.9 of [5]).
> 
>  - Memory Device to System Physical Address Range Mapping Structure
>    (Range Mapping Structure)
> 
>    An NVDIMM region described by a SPA range structure can be
>    interleaved across multiple NVDIMM devices. A range mapping
>    structure is used to describe the single mapping on each NVDIMM
>    device. It describes the size and the offset in a SPA range that an
>    NVDIMM device occupies. It may refer to an Interleave Structure
>    that contains details of the entire interleave set. Those
>    information is used in pblk by the NVDIMM driver for address
>    translation.
> 
>    The NVDIMM device described by the range mapping structure is
>    identified by an unique NFIT Device Handle.
> 
>  Details of NFIT and other structures can be found in Section 5.25 in [1].
> 
> 1.2.2 _DSM and _FIT
> 
>  The ACPI namespace device uses _HID of ACPI0012 to identify the root
>  NVDIMM interface device. An ACPI namespace device is also present
>  under the root device For each NVDIMM device. Above ACPI namespace

s/For/for/

>  devices are defined in SSDT.
> 
>  _DSM methods are present under the root device and each NVDIMM
>  device. _DSM methods are used by drivers to access the label storage
>  area, get health information, perform vendor-specific commands,
>  etc. Details of all _DSM methods can be found in [4].
> 
>  _FIT method is under the root device and evaluated by OSPM to get
>  NFIT of hotplugged NVDIMM. The hotplugged NVDIMM is indicated to OS
>  using ACPI Namespace device with PNPID of PNP0C80 and the device
>  object notification value is 0x80. Details of NVDIMM hotplug can be
>  found in Section 9.20 of [1].
> 
> 
> 1.3 Namespace
> 
>  [2] describes a mechanism to sub-divide NVDIMMs into namespaces,
>  which are logic units of storage similar to SCSI LUNs and NVM Express
>  namespaces.
> 
>  The namespace information is describes by namespace labels stored in
>  the persistent label storage area on each NVDIMM device. The label
>  storage area is excluded from the the range mapped by the SPA range

s/the the/the

>  structure and can only be accessed via _DSM methods.
> 
>  There are two types of namespaces defined in [2]: the persistent
>  memory namespace and the block namespaces. Persistent memory
>  namespaces is built for only pmem NVDIMM regions, while block
>  namespaces only for pblk. Only one persistent memory namespace is
>  allowed for a pmem NVDIMM region.
> 
>  Besides being accessed via _DSM, namespaces are managed and
>  interpreted by software. OS vendors may decide to not follow [2] and
>  store other types of information in the label storage area.
> 
> 
> 1.4 clwb/clflushopt
> 
>  Writes to NVDIMM may be cached by caches, so certain flushing
>  operations should be performed to make them persistent on
>  NVDIMM. clwb is used in favor of clflushopt and clflush to flush
>  writes from caches to memory.
> 
>  Details of clwb/clflushopt can be found in Chapter 10 of [6].

Didn't that opcode get dropped in favour of poking in some register?
> 
> 
> 
> 2. NVDIMM/vNVDIMM Support in Linux Kernel/KVM/QEMU
> 
> 2.1 NVDIMM Driver in Linux Kernel
> 
>  Linux kernel since 4.2 has added support for ACPI-defined NVDIMM
>  devices.
> 
>  NVDIMM driver in Linux probes NVDIMM devices through ACPI (i.e. NFIT
>  and _FIT). It is also responsible to parse the namsepace labels on

s/namspace/namespace/

>  each NVDIMM devices, recover namespace after power failure (as
>  described in [2]) and handle NVDIMM hotplug. There are also some
>  vendor drivers to perform vendor-specific operations on NVDIMMs
>  (e.g. via _DSM).
> 
>  Compared to the ordinary ram, NVDIMM is used more like a persistent

s/ram/RAM/
>  storage drive for its persistent aspect. For each persistent memory
>  namespace, or a label-less pmem NVDIMM range, NVDIMM driver
>  implements a block device interface (/dev/pmemX) for it.
> 
>  Userspace applications can mmap(2) the whole pmem into its own
>  virtual address space. Linux kernel maps the system physical address
>  space range occupied by pmem into the virtual address space, so that every
>  normal memory loads/writes with proper flushing instructions are
>  applied to the underlying pmem NVDIMM regions.
> 
>  Alternatively, a DAX file system can be made on /dev/pmemX. Files on
>  that file system can be used in the same way as above. As Linux
>  kernel maps the system address space range occupied by those files on
>  NVDIMM to the virtual address space, reads/writes on those files are
>  applied to the underlying NVDIMM regions as well.
> 
> 2.2 vNVDIMM Implementation in KVM/QEMU
> 
>  An overview of vNVDIMM implementation in KVM (Linux kernel v4.2) / QEMU 
> (commit
>  70d1fb9 and patches in-review/future) is showed by the following figure.
> 
> 
>                                        +---------------------------------+
>  Guest                             GPA |                    | /dev/pmem0 |
>                                        +---------------------------------+
>            parse        evaluate                            ^            ^
>             ACPI          _DSM                              |            |
>               |            |                                |            |
>  -------------|------------|--------------------------------|------------|----
>               V            V                                |            |
>           +-------+    +-------+                            |            |
>  QEMU     | vACPI |    | v_DSM |                            |            |
>           +-------+    +-------+                            |            |
>                            ^                                |            |
>                            | Read/Write                     |            |
>                            V                                |            |
>           +...+--------------------+...+-----------+        |            |
>     VA    |   | Label Storage Area |   |    buf    |  
> KVM_SET_USER_MEMORY_REGION(buf)
>           +...+--------------------+...+-----------+        |            |
>                                        ^  mmap(2)  ^        |            |
>  --------------------------------------|-----------|--------|------------|----
>                                        |           +--------~------------+
>                                        |                    |            |
>  Linux/KVM                             +--------------------+            |
>                                                             |            |
>                                                        +....+------------+
>                                                 SPA    |    | /dev/pmem0 |
>                                                        +....+------------+
>                                                                    ^
>                                                                    |
>                                                             Host NVDIMM Driver
> -------------------------------------------------------------------|---------
>                                                                    |
>  HW                                                          +------------+
>                                                              |   NVDIMM   |
>                                                              +------------+
> 
> 

Nice picture!

>  A part not put in above figure is enabling guest clwb/clflushopt
>  which exposes those instructions to guest via guest cpuid.

And aren't those deprecated?
> 
>  Besides instruction enabling, there are two primary parts of vNVDIMM
>  implementation in KVM/QEMU.
> 
>  (1) Address Mapping
> 
>   As described before, the host Linux NVDIMM driver provides a block
>   device interface (/dev/pmem0 at the bottom) for a pmem NVDIMM
>   region. QEMU can than mmap(2) that device into its virtual address
>   space (buf). QEMU is responsible to find a proper guest physical
>   address space range that is large enough to hold /dev/pmem0. Then
>   QEMU passes the virtual address of mmapped buf to a KVM API
>   KVM_SET_USER_MEMORY_REGION that maps in EPT the host physical
>   address range of buf to the guest physical address space range where
>   the virtual pmem device will be.
> 
>   In this way, all guest writes/reads on the virtual pmem device is
>   applied directly to the host one.
> 
>   Besides, above implementation also allows to back a virtual pmem
>   device by a mmapped regular file or a piece of ordinary ram.
> 
>  (2) Guest ACPI Emulation
> 
>   As guest system physical address and the size of the virtual pmem
>   device are determined by QEMU, QEMU is responsible to emulate the
>   guest NFIT and SSDT. Basically, it builds the guest NFIT and its
>   sub-structures that describes the virtual NVDIMM topology, and a
>   guest SSDT that defines ACPI namespace devices of virtual NVDIMM in
>   guest SSDT.
> 
>   As a portion of host pmem device or a regular file/ordinary file can
>   be used to back the guest pmem device, the label storage area on
>   host pmem cannot always be passed through to guest. Therefore, the
>   guest reads/writes on the label storage area is emulated by QEMU. As
>   described before, _DSM method is utilized by OSPM to access the
>   label storage area, and therefore it is emulated by QEMU. The _DSM
>   buffer is registered as MMIO, and its guest physical address and
>   size are described in the guest ACPI. Every command/status
>   read/write from guest is trapped and emulated by QEMU.
>


And is there any need for the E820 type 7 to be exposed? I presume
not as the ACPI NFIT is sufficient?


>   Guest _FIT method will be implemented similarly in the future.
> 
> 
> 
> 3. Usage Example of vNVDIMM in Xen
> 
>  Our design is to provide virtual pmem devices to HVM domains. The
>  virtual pmem devices are backed by host pmem devices.
> 
>  Dom0 Linux kernel can detect the host pmem devices and create
>  /dev/pmemXX for each detected devices. Users in Dom0 can then create
>  DAX file system on /dev/pmemXX and create several pre-allocate files
>  in the DAX file system.
> 
>  After setup the file system on the host pmem, users can add the
>  following lines in the xl configuration files to assign the host pmem
>  regions to domains:
>      vnvdimm = [ 'file=/dev/pmem0' ]
>  or
>      vnvdimm = [ 'file=/mnt/dax/pre_allocated_file' ]
> 
>   The first type of configuration assigns the entire pmem device
>   (/dev/pmem0) to the domain, while the second assigns the space
>   allocated to /mnt/dax/pre_allocated_file on the host pmem device to
>   the domain.
> 
>   When the domain starts, guest can detect the (virtual) pmem devices
>   via ACPI and guest read/write on the virtual pmem devices are
>   directly applied on their host backends.

Would guest namespace (128kb) be written at offset 0 of said file (or block)?
And of course the guest can only manipulate this using ACPI _DSM methods?

> 
> 
> 
> 4. Design of vNVDIMM in Xen
> 
>  As KVM/QEMU, our design currently only provides pmem vNVDIMM.
> 
>  Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
>  three parts:
>  (1) Guest clwb/clflushopt enabling,
>  (2) pmem address management, and
>  (3) Guest ACPI emulation.
> 
>  The rest of this section present the design of each part
>  respectively. The basic design principle to reuse existing code in
>  Linux NVDIMM driver, QEMU and Xen as much as possible.
> 
> 
> 4.1 Guest clwb/clflushopt Enabling
> 
>  The instruction enabling is simple and we do the same work as in KVM/QEMU:
>  - clwb/clflushopt are exposed to guest via guest cpuid.
> 

Again, isn't that deprecated and the new mechanism (pokng at some register)
has to be used?
> 
> 4.2 pmem Address Management
> 
>  pmem address management is primarily composed of three parts:
>  (1) detection of pmem devices and their address ranges, which is
>      accomplished by Dom0 Linux pmem driver and Xen hypervisor;
>  (2) get SPA ranges of an pmem area that will be mapped to domain
>      which is accomplished by xl;
>  (3) map the pmem area to a domain, which is accomplished by qemu and
s/qemu/QEMU/
>      Xen hypervisor.
> 
>  Our design intends to reuse the current memory management for normal
>  RAM in Xen to manage the mapping of pmem. Then we will come across a
>  problem: where we store the memory management data structs for pmem.

s/we store/where to/
> 
>  The rest of this section addresses above aspects respectively.

Wait. What about alternatives? Why treat it as a RAM region instead of
as an MMIO region?

> 
> 4.2.1 Reserve Storage for Management Structures
> 
>  A core data struct in Xen memory management is 'struct page_info'.
>  For normal ram, Xen creates a page_info struct for each page. For
>  pmem, we are going to do the same. However, for large capacity pmem
>  devices (e.g. several terabytes or even larger), a large amount of
>  page_info structs will occupy too much storage space that cannot
>  fit in the normal ram.
> 
>  Our solution, as used by Linux kernel, is to reserve an area on pmem
>  and place pmem's page_info structs in that reserved area. Therefore,
>  we can always ensure there is enough space for pmem page_info
>  structs, though the access to them is slower than directly from the
>  normal ram.
> 
>  Such a pmem namespace can be created via a userspace tool ndctl and
>  then recognized by Linux NVDIMM driver. However, they currently only
>  reserve space for Linux kernel's page structs. Therefore, our design
>  need to extend both Linux NVDIMM driver and ndctl to reserve
>  arbitrary size.

That seems .. fragile? What if Windows or FreeBSD want to use it
too? Would this 'struct page' on on NVDIMM be generalized enough
to work with Linux,Xen, FreeBSD and what not?

And this ndctl is https://github.com/pmem/ndctl I presume?

And how is this structure reserved? Is it a seperate namespace entry?
And QEMU knows not to access it? Or Xen needs to make sure _nobody_
except it can access it? Which means Xen may need to know the format
of the ndctl structures that are laid out in the NVDIMM region?

> 
> 4.2.2 Detection of Host pmem Devices
> 
>  The detection and initialize host pmem devices require a non-trivial
>  driver to interact with the corresponding ACPI namespace devices,
>  parse namespace labels and make necessary recovery actions. Instead
>  of duplicating the comprehensive Linux pmem driver in Xen hypervisor,
>  our designs leaves it to Dom0 Linux and let Dom0 Linux report
>  detected host pmem devices to Xen hypervisor.

So Xen would ignore at bootup ACPI NFIT structures?
> 
>  Our design takes following steps to detect host pmem devices when Xen
>  boots.
>  (1) As booting on bare metal, host pmem devices are detected by Dom0
>      Linux NVDIMM driver.
> 
>  (2) Our design extends Linux NVDIMM driver to reports SPA's and sizes
>      of the pmem devices and reserved areas to Xen hypervisor via a
>      new hypercall.

reserved areas? That is the namespace region and the SPA <start,end>
for the ndctl areas? Are the ndctl areas guarnateed to be contingous?

Is there some spec on the ndctl and how/where they are stuck in the NVDIMM?

> 
>  (3) Xen hypervisor then checks
>      - whether SPA and size of the newly reported pmem device is overlap
>        with any previously reported pmem devices;

Or normal RAM?

>      - whether the reserved area can fit in the pmem device and is
>        large enough to hold page_info structs for itself.

I think I know what you mean but it sounds odd.

Perhaps:

 large enough to hold page_info struct's for it's entire range?

Native speaker, like Ian, would know how to say this right I think.

Anyhow, wouldn't this 'sizeof(struct page_info)' depend on the ndctl
tool and what version was used to create this? What if one version
used 32-bytes for a PAGE, while another used 64-bytes for a PAGE too?
It would be a bit of catching up .. wait, this same problem MUST
be with Linux? How does it deal with this?

> 
>      If any checks fail, the reported pmem device will be ignored by
>      Xen hypervisor and hence will not be used by any

I hope this hypercall returns an error code too?

>      guests. Otherwise, Xen hypervisor will recorded the reported
s/recorded/record/
>      parameters and create page_info structs in the reserved area.

Ohh. You just blast it away? I guess it makes sense. Then what is the
purpose of the ndctl? Just to carve out an namespace region for this?

And what if there is something there from previous OS (say Linux)?
Just blast it away? But could Linux depend on this containing some
persistent information? Or does it also blast it away?

But those regions may be non-contingous (or maybe not? I need to check
the spec to double-check) so how do you figure out this 'reserved area'
as it may be an N SPA's of the <start>,<end> type?

> 
>  (4) Because the reserved area is now used by Xen hypervisor, it
>      should not be accessible by Dom0 any more. Therefore, if a host
>      pmem device is recorded by Xen hypervisor, Xen will unmap its

s/recorded/usurped/? Maybe monopolized? Owned? Ah, possesed!

s/its/this/
>      reserved area from Dom0. Our design also needs to extend Linux
>      NVDIMM driver to "balloon out" the reserved area after it
>      successfully reports a pmem device to Xen hypervisor.

This "balloon out" is interesting. You are effectively telling Linux
to ignore a certain range of 'struct page_info', so that if somebody
uses /sys/debug/kernel/page_walk it won't blow up? (As the kerne
can't read the struct page_info anymore).

How would you do this? Simulate an NVDIMM unplug?

But if you do that how will SMART tools work anymore? And
who would do the _DSM checks on the health of the NVDIMM?

/me scratches his head. Perhaps the answers are later in this
design..

> 
> 4.2.3 Get Host Machine Address (SPA) of Host pmem Files
> 
>  Before a pmem file is assigned to a domain, we need to know the host
>  SPA ranges that are allocated to this file. We do this work in xl.
> 
>  If a pmem device /dev/pmem0 is given, xl will read
>  /sys/block/pmem0/device/{resource,size} respectively for the start
>  SPA and size of the pmem device.

Oh! How convient!
> 
>  If a pre-allocated file /mnt/dax/file is given,
>  (1) xl first finds the host pmem device where /mnt/dax/file is. Then
>      it uses the method above to get the start SPA of the host pmem
>      device.
>  (2) xl then uses fiemap ioctl to get the extend mappings of
>      /mnt/dax/file, and adds the corresponding physical offsets and
>      lengths in each mapping entries to above start SPA to get the SPA
>      ranges pre-allocated for this file.

Nice !
> 
>  The resulting host SPA ranges will be passed to QEMU which allocates
>  guest address space for vNVDIMM devices and calls Xen hypervisor to
>  map the guest address to the host SPA ranges.
> 
> 4.2.4 Map Host pmem to Guests
> 
>  Our design reuses the existing address mapping in Xen for the normal
>  ram to map pmem. We will still initiate the mapping for pmem from
>  QEMU, and there is one difference from the mapping of normal ram:
>  - For the normal ram, QEMU only needs to provide gpfn, and the actual
>    host memory where gpfn is mapped is allocated by Xen hypervisor.
>  - For pmem, QEMU provides both gpfn and mfn where gpfn is expected to
>    be mapped to. mfn is provided by xl as described in Section 4.2.3.
> 
>  Our design introduce a new XENMEM op for the pmem mapping, which
>  finally calls guest_physmap_add_page() to add the host pmem page to a
>  domain's address space.
> 
> 4.2.5 Misc 1: RAS
> 
>  Machine check can occur from NVDIMM as normal ram, so that we follow
>  the current machine check handling in Xen for MC# from NVDIMM.

OK, so that is mc_memerr_dhandler. OK,

Is there enought telemtry information for the guest to know
it is NVDIMM and handle it via the NVDIMM #MCE error handling which
is different than normal #MCE?

I presume this means a certain Linux guest dependency as well
for this to work?

> 
> 4.2.6 Misc 2: hotplug
> 
>  The hotplugged host NVDIMM devices is detected via _FIT method under
>  the root ACPI namespace device for NVDIMM. We rely on Dom0 Linux
>  kernel to discover the hotplugged NVDIMM devices and follow steps in
>  Section 4.2.2 to report the hotplugged devices to Xen hypervisor.
> 
> 
> 4.3 Guest ACPI Emulation
> 
>  Guest ACPI emulation is composed of two parts: building guest NFIT
>  and SSDT that defines ACPI namespace devices for NVDIMM, and
>  emulating guest _DSM. As QEMU has already implemented ACPI support
>  for vNVDIMM on KVM, our design intends to reuse its implementation.
> 
> 4.3.1 Building Guest ACPI Tables
> 
>  Two tables for vNVDIMM need to be built:
>  - NFIT, which defines the basic parameters of vNVDIMM devices and
>    does not contain any AML code.
>  - SSDT, which defines ACPI namespace devices for vNVDIMM in AML code.
> 
>  The contents of both tables are affected by some parameters
>  (e.g. address and size of vNVDIMM devices) that cannot be determined
>  until a guest configuration is given. However, all AML code in guest
>  ACPI are currently generated at compile time fro pre-crafted .asl

s/fro/for/

>  files, which is not feasible for ACPI namespace devices for vNVDIMM.
> 
>  We could either introduce an AML builder to generate AML code at
>  runtime like what QEMU is currently doing, or pass vNVDIMM ACPI
>  tables from QEMU to Xen. In order to reduce the duplicated code (to

s/to Xen/to hvmloader/ I think?

>  AML builder in QEMU), our design takes the latter approach. Basically,
>  our design takes the following steps:
>  1) The current QEMU does not build any ACPI stuffs when it runs as
>     the Xen device model, so we need to patch it to generate NFIT and
>     AML code of ACPI namespace devices for vNVDIMM.
> 
>  2) QEMU then copies above NFIT and ACPI namespace devices to an area
>     at the end of guest memory below 4G. The guest physical address
>     and size of this area are written to xenstore keys
>     (/local/domain/domid/hvmloader/dm-acpi/{address,length}) The
>     detailed format of data in this area is explained later.
> 
>  3) hvmloader reads above xenstore keys to probe the passed-in ACPI
>     tables and ACPI namespace devices, and detects the potential
>     collisions as explained later.
> 
>  4) If no collisions are found, hvmloader will
>     (1) append the passed-in ACPI tables to the end of existing guest
>         ACPI tables, like what current construct_passthrough_tables()
>         does.
>     (2) construct a SSDT for each passed-in ACPI namespace devices and
>         append to the end of existing guest ACPI tables.
> 
>  Passing arbitrary ACPI tables and AML code from QEMU could
>  introduce at least two types of collisions:
>  1) a passed-in table and a Xen-built table have the same signature
>  2) a passed-in ACPI namespace device and a Xen-built ACPI namespace
>     device have the same device name.
> 
>  Our design takes the following method to avoid and detect collisions.
>  1) The data layout of area where QEMU copies its NFIT and ACPI
>     namespace devices is organized as below:

Why can't this be expressed in XenStore?

You could have /local/domain/domid/hvmloader/dm-acpi/<name>/{address,length, 
type}
?

> 
>      1 byte 4 bytes  length bytes
>     +------+--------+-----------+------+--------+-----------+-----
>     | type | length | data blob | type | length | data blob | ...
>     +------+--------+-----------+------+--------+-----------+-----
> 
>     type: 0 - data blob contains a complete ACPI table
>           1 - data blob contains AML code for an ACPI namespace device
> 
>     length: the number of bytes of data blob
> 
>     data blob: type 0 - a complete ACPI table
>                type 1 - composed as below:
> 
>                          4 bytes   (length - 4) bytes
>                       +---------+------------------+
>                       | name[4] | AML code snippet |
>                       +---------+------------------+
> 
>                         name[4]         : name of ACPI namespace device
>                       AML code snippet: AML code inside "Device(name[4])"
> 
>                e.g. for an ACPI namespace device defined by
>                    Device(NVDR)
>                    {
>                      Name (_HID, "ACPI0012")
>                      ...
>                    }
>                   QEMU builds a data blob like
>                       +--------------------+-----------------------------+
>                       | 'N', 'V', 'D', 'R' | Name (_HID, "ACPI0012") ... |
>                       +--------------------+-----------------------------+
> 
>  2) hvmloader stores signatures of its own guest ACPI tables in an
>     array builtin_table_sigs[], and names of its own guest ACPI
>     namespace devices in an array builtin_nd_names[]. Because there
>     are only a few guest ACPI tables and namespace devices built by
>     Xen, we can hardcode their signatures or names in hvmloader.
> 
>  3) When hvmloader loads a type 0 entry, it extracts the signature

s/type 0/data blob->type 0/ ?

>     from the data blob and search for it in builtin_table_sigs[].  If
>     found anyone, hvmloader will report an error and stop. Otherwise,
>     it will append it to the end of loaded guest ACPI.
> 
>  4) When hvmloader loads a type 1 entry, it extracts the device name
>     from the data blob and search for it in builtin_nd_names[]. If
>     found anyone, hvmloader will report and error and stop. Otherwise,
>     it will wrap the AML code snippet by "Device (name[4]) {...}" and
>     include it in a new SSDT which is then appended to the end of
>     loaded guest ACPI.
> 
> 4.3.2 Emulating Guest _DSM
> 
>  Our design leaves the emulation of guest _DSM to QEMU. Just as what
>  it does with KVM, QEMU registers the _DSM buffer as MMIO region with
>  Xen and then all guest evaluations of _DSM are trapped and emulated
>  by QEMU.

Sweet!

So one question that I am not if it has been answered, with the
'struct page_info' being removed from the dom0 how will OEM _DSM method
operation? For example some of the AML code may asking to poke
at specific SPAs, but how will Linux do this properly without
'struct page_info' be available?

Thanks!
> 
> 
> References:
> [1] ACPI Specification v6,
>     http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
> [2] NVDIMM Namespace Specification,
>     http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
> [3] NVDIMM Block Window Driver Writer's Guide,
>     http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
> [4] NVDIMM DSM Interface Example,
>     http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> [5] UEFI Specification v2.6,
>     http://www.uefi.org/sites/default/files/resources/UEFI%20Spec%202_6.pdf
> [6] Intel Architecture Instruction Set Extensions Programming Reference,
>     
> https://software.intel.com/sites/default/files/managed/07/b7/319433-023.pdf
> [7] https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00006.html
> [8] https://lists.xen.org/archives/html/xen-devel/2016-06/msg00606.html

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.