[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC Design Doc v2] Add vNVDIMM support for Xen



Hi Konrad,

On 08/03/16 17:25, Konrad Rzeszutek Wilk wrote:
> On Mon, Jul 18, 2016 at 08:29:12AM +0800, Haozhong Zhang wrote:
> > Hi,
> > 
> 
> Hey!
> 
> Thanks for posting! Sorry for the late review. Below are some of my
> comment.
>

Thank you for the review!

[..]
> And is there any need for the E820 type 7 to be exposed? I presume
> not as the ACPI NFIT is sufficient?
>

No, NFIT is sufficient and provides more information than E820.

> 
> >   Guest _FIT method will be implemented similarly in the future.
> > 
> > 
> > 
> > 3. Usage Example of vNVDIMM in Xen
> > 
> >  Our design is to provide virtual pmem devices to HVM domains. The
> >  virtual pmem devices are backed by host pmem devices.
> > 
> >  Dom0 Linux kernel can detect the host pmem devices and create
> >  /dev/pmemXX for each detected devices. Users in Dom0 can then create
> >  DAX file system on /dev/pmemXX and create several pre-allocate files
> >  in the DAX file system.
> > 
> >  After setup the file system on the host pmem, users can add the
> >  following lines in the xl configuration files to assign the host pmem
> >  regions to domains:
> >      vnvdimm = [ 'file=/dev/pmem0' ]
> >  or
> >      vnvdimm = [ 'file=/mnt/dax/pre_allocated_file' ]
> > 
> >   The first type of configuration assigns the entire pmem device
> >   (/dev/pmem0) to the domain, while the second assigns the space
> >   allocated to /mnt/dax/pre_allocated_file on the host pmem device to
> >   the domain.
> > 
> >   When the domain starts, guest can detect the (virtual) pmem devices
> >   via ACPI and guest read/write on the virtual pmem devices are
> >   directly applied on their host backends.
> 
> Would guest namespace (128kb) be written at offset 0 of said file (or block)?
> And of course the guest can only manipulate this using ACPI _DSM methods?
>

I guess you mean the label storage area which stores labels of
namespaces. In the current QEMU implementation, the guest label
storage area is at the end of the file or the block device. It's not
mapped to the guest address space (which I missed to state here) and
can be accessed only via guest _DSM.

[..] 
> > 4.2 pmem Address Management
> > 
> >  pmem address management is primarily composed of three parts:
> >  (1) detection of pmem devices and their address ranges, which is
> >      accomplished by Dom0 Linux pmem driver and Xen hypervisor;
> >  (2) get SPA ranges of an pmem area that will be mapped to domain
> >      which is accomplished by xl;
> >  (3) map the pmem area to a domain, which is accomplished by qemu and
> s/qemu/QEMU/
> >      Xen hypervisor.
> > 
> >  Our design intends to reuse the current memory management for normal
> >  RAM in Xen to manage the mapping of pmem. Then we will come across a
> >  problem: where we store the memory management data structs for pmem.
> 
> s/we store/where to/
> > 
> >  The rest of this section addresses above aspects respectively.
> 
> Wait. What about alternatives? Why treat it as a RAM region instead of
> as an MMIO region?
>

The part used as the label storage area of vNVDIMM is treated as MMIO
as described by a later section of this design. Other parts of vNVDIMM
are directly accessed by guest, so I think we can treat them as normal
RAM regions and map to guest, though we definitely need to mark them
as pmem regions via virtual NFIT.

> > 
> > 4.2.1 Reserve Storage for Management Structures
> > 
> >  A core data struct in Xen memory management is 'struct page_info'.
> >  For normal ram, Xen creates a page_info struct for each page. For
> >  pmem, we are going to do the same. However, for large capacity pmem
> >  devices (e.g. several terabytes or even larger), a large amount of
> >  page_info structs will occupy too much storage space that cannot
> >  fit in the normal ram.
> > 
> >  Our solution, as used by Linux kernel, is to reserve an area on pmem
> >  and place pmem's page_info structs in that reserved area. Therefore,
> >  we can always ensure there is enough space for pmem page_info
> >  structs, though the access to them is slower than directly from the
> >  normal ram.
> > 
> >  Such a pmem namespace can be created via a userspace tool ndctl and
> >  then recognized by Linux NVDIMM driver. However, they currently only
> >  reserve space for Linux kernel's page structs. Therefore, our design
> >  need to extend both Linux NVDIMM driver and ndctl to reserve
> >  arbitrary size.
> 
> That seems .. fragile? What if Windows or FreeBSD want to use it
> too?

AFAIK, the way used by current Linux NVDIMM driver for reservation has
not been documented in any public specifications yet. I'll consult
driver developers for more information.

> Would this 'struct page' on on NVDIMM be generalized enough
> to work with Linux,Xen, FreeBSD and what not?
>

No. Different operating systems may choose different data structures
to manage NVDIMM according to their own requirements and
consideration, so it would be hard to reach an agreement on what to
put in a generic data structure (and make it as part of ABI?).

> And this ndctl is https://github.com/pmem/ndctl I presume?

Yes. Sorry that I forgot to attach the URL.

>
> And how is this structure reserved? Is it a seperate namespace entry?

No, it does not introduce any extra namespace entry. The current
NVDIMM driver in Linux does the reservation in the way shown by the
following diagram (I omit details about alignment and padding for
simplicity):

 SPA  SPA+4K
  |      |
  V      V
  +------+-----------+-- ... ---+-----...-----+
  |      | nd_pfn_sb | reserved | free to use |
  +------+-----------+-- ... ---+-----...-----+
  |<--   nd_pfn_sb.dataoff   -->|             |
  |    (+ necessary padding)                  |
  |                                           |
  |<------------- pmem namespace ------------>|

Given a pmem namespace which starts from SPA,
 1) the driver stores a struct nd_pfn_sb at SPA+4K
 2) the reserved area is after nd_pfn_sb
 3) the free-to-use area is after the reserved area, and its location
    relative to SPA can be derived from nd_pfn_sb.dataoff
 4) only the free-to-use area is exposed to a block device /dev/pmemX.
    Access to sector N of /dev/pmemX actually goes to (SPA +
    nd_pfn_sb.dataoff + N * SECT_SIZE)
 5) nd_pfn_sb also contains a signature "NVDIMM_PFN_INFO" and a
    checksum. If the driver finds such signature and the checksum
    matches, then it knows this device contains reserved area.

> And QEMU knows not to access it?

QEMU as a userspace program can only access /dev/pmemX and hence has
no way to touch the reserved area.

> Or Xen needs to make sure _nobody_
> except it can access it? Which means Xen may need to know the format
> of the ndctl structures that are laid out in the NVDIMM region?
>

Xen hypervisor relies on dom0 driver to parse the layout.  At Dom0
boot, Dom0 NVDIMM driver reports address/size of area reserved for Xen
to Xen hypervisor, which then unmaps the reserved area from Dom0.

> > 
> > 4.2.2 Detection of Host pmem Devices
> > 
> >  The detection and initialize host pmem devices require a non-trivial
> >  driver to interact with the corresponding ACPI namespace devices,
> >  parse namespace labels and make necessary recovery actions. Instead
> >  of duplicating the comprehensive Linux pmem driver in Xen hypervisor,
> >  our designs leaves it to Dom0 Linux and let Dom0 Linux report
> >  detected host pmem devices to Xen hypervisor.
> 
> So Xen would ignore at bootup ACPI NFIT structures?

Yes, parsing NFIT is left to Dom0 which has the correct driver.

> > 
> >  Our design takes following steps to detect host pmem devices when Xen
> >  boots.
> >  (1) As booting on bare metal, host pmem devices are detected by Dom0
> >      Linux NVDIMM driver.
> > 
> >  (2) Our design extends Linux NVDIMM driver to reports SPA's and sizes
> >      of the pmem devices and reserved areas to Xen hypervisor via a
> >      new hypercall.
> 
> reserved areas? That is the namespace region and the SPA <start,end>
> for the ndctl areas? Are the ndctl areas guarnateed to be contingous?
>

explained above. The reserved area on an individual pmem namespace is
contiguous.

> Is there some spec on the ndctl and how/where they are stuck in the NVDIMM?
>

No public spec so far, as mentioned above.

> > 
> >  (3) Xen hypervisor then checks
> >      - whether SPA and size of the newly reported pmem device is overlap
> >        with any previously reported pmem devices;
> 
> Or normal RAM?
>

Yes, I missed normal RAM here.

> >      - whether the reserved area can fit in the pmem device and is
> >        large enough to hold page_info structs for itself.
> 
> I think I know what you mean but it sounds odd.
> 
> Perhaps:
> 
>  large enough to hold page_info struct's for it's entire range?
>

Yes, that is what I mean

> Native speaker, like Ian, would know how to say this right I think.
> 
> Anyhow, wouldn't this 'sizeof(struct page_info)' depend on the ndctl
> tool and what version was used to create this? What if one version
> used 32-bytes for a PAGE, while another used 64-bytes for a PAGE too?
> It would be a bit of catching up .. wait, this same problem MUST
> be with Linux? How does it deal with this?
>

Good question. Linux chooses a size (64 bytes) larger than its current
sizeof(struct page) (40 bytes). We may also do in the same way,
e.g. 32 bytes vs. 64 bytes?

> > 
> >      If any checks fail, the reported pmem device will be ignored by
> >      Xen hypervisor and hence will not be used by any
> 
> I hope this hypercall returns an error code too?
>

Definitely yes

> >      guests. Otherwise, Xen hypervisor will recorded the reported
> s/recorded/record/
> >      parameters and create page_info structs in the reserved area.
> 
> Ohh. You just blast it away? I guess it makes sense. Then what is the
> purpose of the ndctl? Just to carve out an namespace region for this?
>

ndctl is used by, for example, a system admin to reserve space on a
host pmem namespace. If there is already data in the namespace, ndctl
will give a warning message and exit as long as --force option is not
given. However, if --force is present, ndctl will break the existing
data.

> And what if there is something there from previous OS (say Linux)?
> Just blast it away? But could Linux depend on this containing some
> persistent information? Or does it also blast it away?
>

As above, if linux driver detects the signature "NVDIMM_PFN_INFO" and
a matched checksum, it will know it's safe to write to the reserved
area. Otherwise, it will treat the pmem namespace as a raw device and
store page struct's in the normal RAM.

> But those regions may be non-contingous (or maybe not? I need to check
> the spec to double-check) so how do you figure out this 'reserved area'
> as it may be an N SPA's of the <start>,<end> type?
>

the reserved area is per pmem namespace.

> > 
> >  (4) Because the reserved area is now used by Xen hypervisor, it
> >      should not be accessible by Dom0 any more. Therefore, if a host
> >      pmem device is recorded by Xen hypervisor, Xen will unmap its
> 
> s/recorded/usurped/? Maybe monopolized? Owned? Ah, possesed!
> 
> s/its/this/
> >      reserved area from Dom0. Our design also needs to extend Linux
> >      NVDIMM driver to "balloon out" the reserved area after it
> >      successfully reports a pmem device to Xen hypervisor.
> 
> This "balloon out" is interesting. You are effectively telling Linux
> to ignore a certain range of 'struct page_info', so that if somebody
> uses /sys/debug/kernel/page_walk it won't blow up? (As the kerne
> can't read the struct page_info anymore).
>
> How would you do this? Simulate an NVDIMM unplug?

s/page_info/page/ (struct page for linux, struct page_info for xen)

As in Jan's comment, "balloon out" is a confusing name here.
Basically, it's to remove the reserved area from some resource struct
in nvdimm driver to avoid it's accessed out of the driver via the
resource struct. And the nvdimm driver does not map the reserved area,
so I think it cannot be touched via page_walk.

> 
> But if you do that how will SMART tools work anymore? And
> who would do the _DSM checks on the health of the NVDIMM?
>

A userspace SMART tool cannot access the reserved area, so I think it
can still work. I haven't look at the implementation of any SMART
tools for NVDIMM, but I guess they would finally call the driver to
evaluate the ARS _DSM which reports the bad blocks. As long as the
driver does not return the bad blocks in the reserved area to SMART
tools (which I suppose to be handled by driver itself), SMART tools
should work fine.

> /me scratches his head. Perhaps the answers are later in this
> design..
>
> > 
> > 4.2.3 Get Host Machine Address (SPA) of Host pmem Files
> > 
> >  Before a pmem file is assigned to a domain, we need to know the host
> >  SPA ranges that are allocated to this file. We do this work in xl.
> > 
> >  If a pmem device /dev/pmem0 is given, xl will read
> >  /sys/block/pmem0/device/{resource,size} respectively for the start
> >  SPA and size of the pmem device.
> 
> Oh! How convient!
> > 
> >  If a pre-allocated file /mnt/dax/file is given,
> >  (1) xl first finds the host pmem device where /mnt/dax/file is. Then
> >      it uses the method above to get the start SPA of the host pmem
> >      device.
> >  (2) xl then uses fiemap ioctl to get the extend mappings of
> >      /mnt/dax/file, and adds the corresponding physical offsets and
> >      lengths in each mapping entries to above start SPA to get the SPA
> >      ranges pre-allocated for this file.
> 
> Nice !
> > 
> >  The resulting host SPA ranges will be passed to QEMU which allocates
> >  guest address space for vNVDIMM devices and calls Xen hypervisor to
> >  map the guest address to the host SPA ranges.
> > 
> > 4.2.4 Map Host pmem to Guests
> > 
> >  Our design reuses the existing address mapping in Xen for the normal
> >  ram to map pmem. We will still initiate the mapping for pmem from
> >  QEMU, and there is one difference from the mapping of normal ram:
> >  - For the normal ram, QEMU only needs to provide gpfn, and the actual
> >    host memory where gpfn is mapped is allocated by Xen hypervisor.
> >  - For pmem, QEMU provides both gpfn and mfn where gpfn is expected to
> >    be mapped to. mfn is provided by xl as described in Section 4.2.3.
> > 
> >  Our design introduce a new XENMEM op for the pmem mapping, which
> >  finally calls guest_physmap_add_page() to add the host pmem page to a
> >  domain's address space.
> > 
> > 4.2.5 Misc 1: RAS
> > 
> >  Machine check can occur from NVDIMM as normal ram, so that we follow
> >  the current machine check handling in Xen for MC# from NVDIMM.
> 
> OK, so that is mc_memerr_dhandler. OK,
> 
> Is there enought telemtry information for the guest to know
> it is NVDIMM and handle it via the NVDIMM #MCE error handling which
> is different than normal #MCE?
> 
> I presume this means a certain Linux guest dependency as well
> for this to work?
>

Yes, the guest should at least know which address belongs to
vNVDIMM. Then it can tell from the address of virtual #MC where the
error comes from. Otherwise, the guest will see an #MC for an address
it doesn't know.

> > 
> > 4.2.6 Misc 2: hotplug
> > 
> >  The hotplugged host NVDIMM devices is detected via _FIT method under
> >  the root ACPI namespace device for NVDIMM. We rely on Dom0 Linux
> >  kernel to discover the hotplugged NVDIMM devices and follow steps in
> >  Section 4.2.2 to report the hotplugged devices to Xen hypervisor.
> > 
> > 
> > 4.3 Guest ACPI Emulation
> > 
> >  Guest ACPI emulation is composed of two parts: building guest NFIT
> >  and SSDT that defines ACPI namespace devices for NVDIMM, and
> >  emulating guest _DSM. As QEMU has already implemented ACPI support
> >  for vNVDIMM on KVM, our design intends to reuse its implementation.
> > 
> > 4.3.1 Building Guest ACPI Tables
> > 
> >  Two tables for vNVDIMM need to be built:
> >  - NFIT, which defines the basic parameters of vNVDIMM devices and
> >    does not contain any AML code.
> >  - SSDT, which defines ACPI namespace devices for vNVDIMM in AML code.
> > 
> >  The contents of both tables are affected by some parameters
> >  (e.g. address and size of vNVDIMM devices) that cannot be determined
> >  until a guest configuration is given. However, all AML code in guest
> >  ACPI are currently generated at compile time fro pre-crafted .asl
> 
> s/fro/for/
> 
> >  files, which is not feasible for ACPI namespace devices for vNVDIMM.
> > 
> >  We could either introduce an AML builder to generate AML code at
> >  runtime like what QEMU is currently doing, or pass vNVDIMM ACPI
> >  tables from QEMU to Xen. In order to reduce the duplicated code (to
> 
> s/to Xen/to hvmloader/ I think?
>

yes

> >  AML builder in QEMU), our design takes the latter approach. Basically,
> >  our design takes the following steps:
> >  1) The current QEMU does not build any ACPI stuffs when it runs as
> >     the Xen device model, so we need to patch it to generate NFIT and
> >     AML code of ACPI namespace devices for vNVDIMM.
> > 
> >  2) QEMU then copies above NFIT and ACPI namespace devices to an area
> >     at the end of guest memory below 4G. The guest physical address
> >     and size of this area are written to xenstore keys
> >     (/local/domain/domid/hvmloader/dm-acpi/{address,length}) The
> >     detailed format of data in this area is explained later.
> > 
> >  3) hvmloader reads above xenstore keys to probe the passed-in ACPI
> >     tables and ACPI namespace devices, and detects the potential
> >     collisions as explained later.
> > 
> >  4) If no collisions are found, hvmloader will
> >     (1) append the passed-in ACPI tables to the end of existing guest
> >         ACPI tables, like what current construct_passthrough_tables()
> >         does.
> >     (2) construct a SSDT for each passed-in ACPI namespace devices and
> >         append to the end of existing guest ACPI tables.
> > 
> >  Passing arbitrary ACPI tables and AML code from QEMU could
> >  introduce at least two types of collisions:
> >  1) a passed-in table and a Xen-built table have the same signature
> >  2) a passed-in ACPI namespace device and a Xen-built ACPI namespace
> >     device have the same device name.
> > 
> >  Our design takes the following method to avoid and detect collisions.
> >  1) The data layout of area where QEMU copies its NFIT and ACPI
> >     namespace devices is organized as below:
> 
> Why can't this be expressed in XenStore?
> 
> You could have /local/domain/domid/hvmloader/dm-acpi/<name>/{address,length, 
> type}
> ?
>

If XenStore can be used, then it could save some guest memory.

This is a general mechanism to pass ACPI which and is not limited to
NVDIMM, so it means QEMU may pass a lot of entries. I'm not sure if
XenStore is still a proper place when the number is large. Maybe we
should put an upper limit for the number of entries.

> > 
> >      1 byte 4 bytes  length bytes
> >     +------+--------+-----------+------+--------+-----------+-----
> >     | type | length | data blob | type | length | data blob | ...
> >     +------+--------+-----------+------+--------+-----------+-----
> > 
> >     type: 0 - data blob contains a complete ACPI table
> >           1 - data blob contains AML code for an ACPI namespace device
> > 
> >     length: the number of bytes of data blob
> > 
> >     data blob: type 0 - a complete ACPI table
> >                type 1 - composed as below:
> > 
> >                          4 bytes   (length - 4) bytes
> >                     +---------+------------------+
> >                     | name[4] | AML code snippet |
> >                     +---------+------------------+
> > 
> >                         name[4]         : name of ACPI namespace device
> >                     AML code snippet: AML code inside "Device(name[4])"
> > 
> >                e.g. for an ACPI namespace device defined by
> >                  Device(NVDR)
> >                  {
> >                    Name (_HID, "ACPI0012")
> >                    ...
> >                  }
> >                 QEMU builds a data blob like
> >                     +--------------------+-----------------------------+
> >                     | 'N', 'V', 'D', 'R' | Name (_HID, "ACPI0012") ... |
> >                     +--------------------+-----------------------------+
> > 
> >  2) hvmloader stores signatures of its own guest ACPI tables in an
> >     array builtin_table_sigs[], and names of its own guest ACPI
> >     namespace devices in an array builtin_nd_names[]. Because there
> >     are only a few guest ACPI tables and namespace devices built by
> >     Xen, we can hardcode their signatures or names in hvmloader.
> > 
> >  3) When hvmloader loads a type 0 entry, it extracts the signature
> 
> s/type 0/data blob->type 0/ ?
>

no, type information is out of data blob ({type, length, data blob})

> >     from the data blob and search for it in builtin_table_sigs[].  If
> >     found anyone, hvmloader will report an error and stop. Otherwise,
> >     it will append it to the end of loaded guest ACPI.
> > 
> >  4) When hvmloader loads a type 1 entry, it extracts the device name
> >     from the data blob and search for it in builtin_nd_names[]. If
> >     found anyone, hvmloader will report and error and stop. Otherwise,
> >     it will wrap the AML code snippet by "Device (name[4]) {...}" and
> >     include it in a new SSDT which is then appended to the end of
> >     loaded guest ACPI.
> > 
> > 4.3.2 Emulating Guest _DSM
> > 
> >  Our design leaves the emulation of guest _DSM to QEMU. Just as what
> >  it does with KVM, QEMU registers the _DSM buffer as MMIO region with
> >  Xen and then all guest evaluations of _DSM are trapped and emulated
> >  by QEMU.
> 
> Sweet!
> 
> So one question that I am not if it has been answered, with the
> 'struct page_info' being removed from the dom0 how will OEM _DSM method
> operation? For example some of the AML code may asking to poke
> at specific SPAs, but how will Linux do this properly without
> 'struct page_info' be available?
>

(s/page_info/page/)

The current Intel NVDIMM driver in Linux does not evaluate any OEM
_DSM method, so I'm not sure whether the kernel has to access a NVDIMM
page during evaluating _DSM.

The most close one in my mind, though not an OEM _DSM, is function 1
of ARS _DSM, which requires inputs of a start SPA and a length in
bytes. After kernel gives the inputs, the scrubbing of the specified
area is done by the hardware and does not requires any mappings in OS.

Any example of such OEM _DSM methods?

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.