[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC Design Doc v2] Add vNVDIMM support for Xen



On 08/04/16 10:51, Konrad Rzeszutek Wilk wrote:
> > > >  Such a pmem namespace can be created via a userspace tool ndctl and
> > > >  then recognized by Linux NVDIMM driver. However, they currently only
> > > >  reserve space for Linux kernel's page structs. Therefore, our design
> > > >  need to extend both Linux NVDIMM driver and ndctl to reserve
> > > >  arbitrary size.
> > > 
> > > That seems .. fragile? What if Windows or FreeBSD want to use it
> > > too?
> > 
> > AFAIK, the way used by current Linux NVDIMM driver for reservation has
> > not been documented in any public specifications yet. I'll consult
> > driver developers for more information.
> > 
> > > Would this 'struct page' on on NVDIMM be generalized enough
> > > to work with Linux,Xen, FreeBSD and what not?
> > >
> > 
> > No. Different operating systems may choose different data structures
> > to manage NVDIMM according to their own requirements and
> > consideration, so it would be hard to reach an agreement on what to
> > put in a generic data structure (and make it as part of ABI?).
> 
> Yes. As I can see different OSes having different sizes. And then
> this size of 'reserved region' ends up being too small and only
> some part of the NVDIMM can be used.
>

If the reserved area is too small for some OS, those OS may choose to
put management data structures in the normal RAM in order to map all
NVDIMM.

Possibly, a tool can be developed to adjust the reserved size w/o
breaking existing data (e.g. by moving data towards the end to leave
room for reserved area).

> > 
> > > And this ndctl is https://github.com/pmem/ndctl I presume?
> > 
> > Yes. Sorry that I forgot to attach the URL.
> > 
> > >
> > > And how is this structure reserved? Is it a seperate namespace entry?
> > 
> > No, it does not introduce any extra namespace entry. The current
> > NVDIMM driver in Linux does the reservation in the way shown by the
> > following diagram (I omit details about alignment and padding for
> > simplicity):
> > 
> >  SPA  SPA+4K
> >   |      |
> >   V      V
> >   +------+-----------+-- ... ---+-----...-----+
> >   |      | nd_pfn_sb | reserved | free to use |
> >   +------+-----------+-- ... ---+-----...-----+
> >   |<--   nd_pfn_sb.dataoff   -->|             |
> >   |    (+ necessary padding)                  |
> >   |                                           |
> >   |<------------- pmem namespace ------------>|
> > 
> > Given a pmem namespace which starts from SPA,
> 
> AAAAh, so it is at start of the namespace! Thanks
> 
> >  1) the driver stores a struct nd_pfn_sb at SPA+4K
> >  2) the reserved area is after nd_pfn_sb
> >  3) the free-to-use area is after the reserved area, and its location
> >     relative to SPA can be derived from nd_pfn_sb.dataoff
> >  4) only the free-to-use area is exposed to a block device /dev/pmemX.
> >     Access to sector N of /dev/pmemX actually goes to (SPA +
> >     nd_pfn_sb.dataoff + N * SECT_SIZE)
> >  5) nd_pfn_sb also contains a signature "NVDIMM_PFN_INFO" and a
> >     checksum. If the driver finds such signature and the checksum
> >     matches, then it knows this device contains reserved area.
> 
> /me nods.
> 
> And of course this nice diagram and such is going to be in
> a public ABI document :-)
> > 
> > > And QEMU knows not to access it?
> > 
> > QEMU as a userspace program can only access /dev/pmemX and hence has
> > no way to touch the reserved area.
> 
> Rightto.
> > 
> > > Or Xen needs to make sure _nobody_
> > > except it can access it? Which means Xen may need to know the format
> > > of the ndctl structures that are laid out in the NVDIMM region?
> > >
> > 
> > Xen hypervisor relies on dom0 driver to parse the layout.  At Dom0
> > boot, Dom0 NVDIMM driver reports address/size of area reserved for Xen
> > to Xen hypervisor, which then unmaps the reserved area from Dom0.
> 
> OK, so the /dev/pmem driver would consult this when somebody is mmaping
> the area. But since this would be removed from the driver (unregistered)
> it would report an zero size?
>

The current pmem driver in Linux need be modified (which I'm doing) to
understand the reserved area is unmapped and should never be accessed.

> Or would it "Otherwise, it will treat the pmem namespace as a raw device and
> store page struct's in the normal RAM." - which means dom0 can still
> access the SPA (except obviously the area that is for this reserved region)?
>

Yes, a raw device means there is no reserved area and the entire pmem
namespace can be used for free (i.e. the free-to-use are in above
diagram covers the entire pmem namespace).

> ..snip..
> > > >      guests. Otherwise, Xen hypervisor will recorded the reported
> > > s/recorded/record/
> > > >      parameters and create page_info structs in the reserved area.
> > > 
> > > Ohh. You just blast it away? I guess it makes sense. Then what is the
> > > purpose of the ndctl? Just to carve out an namespace region for this?
> > >
> > 
> > ndctl is used by, for example, a system admin to reserve space on a
> > host pmem namespace. If there is already data in the namespace, ndctl
> > will give a warning message and exit as long as --force option is not
> > given. However, if --force is present, ndctl will break the existing
> > data.
> > 
> > > And what if there is something there from previous OS (say Linux)?
> > > Just blast it away? But could Linux depend on this containing some
> > > persistent information? Or does it also blast it away?
> > >
> > 
> > As above, if linux driver detects the signature "NVDIMM_PFN_INFO" and
> > a matched checksum, it will know it's safe to write to the reserved
> > area. Otherwise, it will treat the pmem namespace as a raw device and
> > store page struct's in the normal RAM.
> 
> OK, so my worry is that we will have a divergence. Which is that
> the system admin creates this under ndctl v0, boots Xen uses it.
> Then moves the NVDIMM to another machine which has ndctl v1 and
> he/she boots in Linux.
> 
> Linux gets all confused b/c the region has something it can't understand
> and the user is very angry.
> 
> So it sounds like the size the ndctl reserves MUST be baked in an ABI
> and made sure to expand if needed.
>

ndctl is a management tool which passes all its requests to the driver
via sysfs, so the compatibility across different versions of Linux
would actual be introduced by the different versions of drivers.

All newer versions of drivers should provide backwards compatibility
to previous versions (which is the current drivers'
behavior). However, the forwards compatibility is hard to preserved,
e.g.
 - an old version w/o reserved area support (e.g. the one in linux
   kernel 4.2) recognizes a pmem namespace w/ reserved area as a raw
   device and may write to the reserved area. If it's a xen reserved
   area and the driver is in dom0, the dom0 kernel will crash.
   
 - the same crash would happen if an old version driver w/ reserved
   area support but xen reserved area support (e.g. the one in linux
   kernel 4.7) is used for a pmem namespace w/ xen reserved area.

For the cross-OS compatibility, there is an effort to standardize the
reservation. In the meantime, only linux is capable to handle such
pmem namespaces with reserved area.

> ..snip..
> > > This "balloon out" is interesting. You are effectively telling Linux
> > > to ignore a certain range of 'struct page_info', so that if somebody
> > > uses /sys/debug/kernel/page_walk it won't blow up? (As the kerne
> > > can't read the struct page_info anymore).
> > >
> > > How would you do this? Simulate an NVDIMM unplug?
> > 
> > s/page_info/page/ (struct page for linux, struct page_info for xen)
> > 
> > As in Jan's comment, "balloon out" is a confusing name here.
> > Basically, it's to remove the reserved area from some resource struct
> > in nvdimm driver to avoid it's accessed out of the driver via the
> > resource struct. And the nvdimm driver does not map the reserved area,
> > so I think it cannot be touched via page_walk.
> 
> OK, I need to read the Linux code more to make sure I am
> not missing something.
> 
> Basically the question that keeps revolving in my head is:
> 
> Why is this even neccessary?
> 
> Let me expand - it feels like (and I think I am missing something
> here) that we are crippling the Linux driver so that it won't
> break - b/c if it tried to access the 'strut page_info' in this
> reserved region it would crash. So we eliminate that, and make
> the driver believe the region exists (is reserved), but it can't
> use it. And instead use the normal RAM pages to keep track
> of the NVDIMM SPAs.
> 
> Or perhaps not keep track at all and just treat the whole
> NVDIMM as opaque MMIO that is inaccessible?
>

If we trust the driver in dom0 kernel always does correct things (and
we can trust it, right?), no crash will happen. However, as Jan
comment 
(https://lists.xenproject.org/archives/html/xen-devel/2016-08/msg00433.html):

| Right now Dom0 isn't allowed to access any memory in use by Xen
| (and not explicitly shared), and I don't think we should deviate
| from that model for pmem.

xen hypervisor must explicitly disallow dom0 from accessing the
reserved area.

> But how will that work if there is a DAX filesystem on it?
> The ext4 needs some mechanism to access the files that are there.
> (Otherwise you couldn't use the fiemap ioctl to find the SPAs).
>

No, the file system does not touch the reserved area. If a reserved
area exists, the start SPA of /dev/pmem0 reported via sysfs is the
start SPA of the reserved area, so fiemap can still work.

> [see below]
> > 
> > > 
> > > But if you do that how will SMART tools work anymore? And
> > > who would do the _DSM checks on the health of the NVDIMM?
> > >
> > 
> > A userspace SMART tool cannot access the reserved area, so I think it
> > can still work. I haven't look at the implementation of any SMART
> > tools for NVDIMM, but I guess they would finally call the driver to
> > evaluate the ARS _DSM which reports the bad blocks. As long as the
> > driver does not return the bad blocks in the reserved area to SMART
> > tools (which I suppose to be handled by driver itself), SMART tools
> > should work fine.
> > 
> > > /me scratches his head. Perhaps the answers are later in this
> > > design..
> 
> So I think I figured out the issue here!!
> 
> You just want to have the Linux kernel driver to use normal RAM
> pages to keep track of the NVDIMM SPA ranges.

Yes, this is what the current driver does for a raw device.

> As in treat the NVDIMM as if it is normal RAM?

If you are talking about the location of page struct, then yes.  The
page struct's for NVDIMM is put in the normal RAM just like the page
struct's for the normal RAM. But NVDIMM can never, for example, be
allocated via the kernel memory allocator (buddy/slab/etc.).

> 
> [Or is Linux treating this area as MMIO region (in wihch case it does not
> need struct page_info)??]
>
> And then Xen can use this reserved region for its own
> purpose!
> 
> Perhaps then the section that explains this 'reserved region' could
> say something along:
> 
> "We need to keep track of the SPAs. The guest NVDIMM 'file'
> on the NVDIMM may be in the worst case be randomly and in descending
> discontingous order (say from the end of the NVDIMM), we need
> to keep track of each of the SPAs. The reason is that we need
> the SPAs when we populate the guest EPT.
> 
> As such we can store the guest SPA in memory (linear array?)
> or red-black tree, or any other - but all of them will consume
> "normal RAM". And with sufficient large enough NVDIMM we may
> not have enough 'normal RAM' to store this.
> 
> Also we only need to know these SPAs during guest creation,
> destruction, ballooning, etc - hence we may store them on the
> NVDIMM itself. Fortunatly for us the ndctl and Linux are
> available which carve out right after the namespace region (128kb)
> and 'reserved region' which the OS can use to store its
> struct page_info to cover the full range of the NVDIMM.
> 
> The complexity in this is that:
>  - We MUST make sure Linux does not try to use it while
>    we use it.
>  - That the size of this 'reserved region' is sufficiently
>    large for our 'struct page_info' structure.
>  - The layout has an ABI baked.
>  - Linux fs'es with DAX support MUST be able mlock these SPA
>    regions (so that nobody tries to remove the 'file' while
>    a guest is using it).

I need to check whether linux currently does this.

>  - Linus fs'es with DAX support MUST be able to resize the
>    'file', hereby using more of the SPAs and rewritting the
>    properties of the file on DAX (which should then cause an
>    memory hotplug ACPI in the guest treating the new size of
>    the file as new NFIT region?)
>

Currently my plan is to disallow such resizing and possibly other
changes out of guest if it's being used by guest (akin to disk) in the
first implementation. It's mostly for simplicity and we can add it in
future. For hotplug, we can pass another file as a new pmem namespace
to guest.

> "
> 
> I think that covers it?
> ..snip..
> > > >  Our design takes the following method to avoid and detect collisions.
> > > >  1) The data layout of area where QEMU copies its NFIT and ACPI
> > > >     namespace devices is organized as below:
> > > 
> > > Why can't this be expressed in XenStore?
> > > 
> > > You could have 
> > > /local/domain/domid/hvmloader/dm-acpi/<name>/{address,length, type}
> > > ?
> > >
> > 
> > If XenStore can be used, then it could save some guest memory.
> 
> It is also easier than relaying on the format of a blob in memory.
> > 
> > This is a general mechanism to pass ACPI which and is not limited to
> > NVDIMM, so it means QEMU may pass a lot of entries. I'm not sure if
> > XenStore is still a proper place when the number is large. Maybe we
> > should put an upper limit for the number of entries.
> 
> Why put a limit on it? It should easily handle thousands of <name>.
> And the only attributes you have under <name> are just address,
> length and type.
>

OK, if it's not a problem, I will use xenstore to pass those
information.

> .. snip..
> > > > 4.3.2 Emulating Guest _DSM
> > > > 
> > > >  Our design leaves the emulation of guest _DSM to QEMU. Just as what
> > > >  it does with KVM, QEMU registers the _DSM buffer as MMIO region with
> > > >  Xen and then all guest evaluations of _DSM are trapped and emulated
> > > >  by QEMU.
> > > 
> > > Sweet!
> > > 
> > > So one question that I am not if it has been answered, with the
> > > 'struct page_info' being removed from the dom0 how will OEM _DSM method
> > > operation? For example some of the AML code may asking to poke
> > > at specific SPAs, but how will Linux do this properly without
> > > 'struct page_info' be available?
> > >
> > 
> > (s/page_info/page/)
> > 
> > The current Intel NVDIMM driver in Linux does not evaluate any OEM
> > _DSM method, so I'm not sure whether the kernel has to access a NVDIMM
> > page during evaluating _DSM.
> > 
> > The most close one in my mind, though not an OEM _DSM, is function 1
> > of ARS _DSM, which requires inputs of a start SPA and a length in
> > bytes. After kernel gives the inputs, the scrubbing of the specified
> > area is done by the hardware and does not requires any mappings in OS.
> 
> <nods>
> > 
> > Any example of such OEM _DSM methods?
> 
> I can't think of any right now - but that is the danger of OEMs - they
> may decide to do something .. ill advisable. Hence having it work
> the same way as Linux is what we should strive for.
> 

I see: though the evaluation itself does not use any software
maintained mappings, the driver may use when handling the result of
evaluation, e.g. ARS _DSM reports bad blocks in the reserved area and
the driver may then have to access the reserved area (though this
could never happen in the current kernel because the driver does ARS
before reservation).

Currently there is no OEM _DSM support in linux kernel, so I cannot
think of any solution. However, if such an OEM _DSM comes, we may add
xen specific handling to the driver or introduce a way in nvdimm
driver framework to avoid accessing the reserved area in certain
circumstances (e.g. when used in xen dom0).

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.