[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC Design Doc v2] Add vNVDIMM support for Xen



> > >  Such a pmem namespace can be created via a userspace tool ndctl and
> > >  then recognized by Linux NVDIMM driver. However, they currently only
> > >  reserve space for Linux kernel's page structs. Therefore, our design
> > >  need to extend both Linux NVDIMM driver and ndctl to reserve
> > >  arbitrary size.
> > 
> > That seems .. fragile? What if Windows or FreeBSD want to use it
> > too?
> 
> AFAIK, the way used by current Linux NVDIMM driver for reservation has
> not been documented in any public specifications yet. I'll consult
> driver developers for more information.
> 
> > Would this 'struct page' on on NVDIMM be generalized enough
> > to work with Linux,Xen, FreeBSD and what not?
> >
> 
> No. Different operating systems may choose different data structures
> to manage NVDIMM according to their own requirements and
> consideration, so it would be hard to reach an agreement on what to
> put in a generic data structure (and make it as part of ABI?).

Yes. As I can see different OSes having different sizes. And then
this size of 'reserved region' ends up being too small and only
some part of the NVDIMM can be used.

> 
> > And this ndctl is https://github.com/pmem/ndctl I presume?
> 
> Yes. Sorry that I forgot to attach the URL.
> 
> >
> > And how is this structure reserved? Is it a seperate namespace entry?
> 
> No, it does not introduce any extra namespace entry. The current
> NVDIMM driver in Linux does the reservation in the way shown by the
> following diagram (I omit details about alignment and padding for
> simplicity):
> 
>  SPA  SPA+4K
>   |      |
>   V      V
>   +------+-----------+-- ... ---+-----...-----+
>   |      | nd_pfn_sb | reserved | free to use |
>   +------+-----------+-- ... ---+-----...-----+
>   |<--   nd_pfn_sb.dataoff   -->|             |
>   |    (+ necessary padding)                  |
>   |                                           |
>   |<------------- pmem namespace ------------>|
> 
> Given a pmem namespace which starts from SPA,

AAAAh, so it is at start of the namespace! Thanks

>  1) the driver stores a struct nd_pfn_sb at SPA+4K
>  2) the reserved area is after nd_pfn_sb
>  3) the free-to-use area is after the reserved area, and its location
>     relative to SPA can be derived from nd_pfn_sb.dataoff
>  4) only the free-to-use area is exposed to a block device /dev/pmemX.
>     Access to sector N of /dev/pmemX actually goes to (SPA +
>     nd_pfn_sb.dataoff + N * SECT_SIZE)
>  5) nd_pfn_sb also contains a signature "NVDIMM_PFN_INFO" and a
>     checksum. If the driver finds such signature and the checksum
>     matches, then it knows this device contains reserved area.

/me nods.

And of course this nice diagram and such is going to be in
a public ABI document :-)
> 
> > And QEMU knows not to access it?
> 
> QEMU as a userspace program can only access /dev/pmemX and hence has
> no way to touch the reserved area.

Rightto.
> 
> > Or Xen needs to make sure _nobody_
> > except it can access it? Which means Xen may need to know the format
> > of the ndctl structures that are laid out in the NVDIMM region?
> >
> 
> Xen hypervisor relies on dom0 driver to parse the layout.  At Dom0
> boot, Dom0 NVDIMM driver reports address/size of area reserved for Xen
> to Xen hypervisor, which then unmaps the reserved area from Dom0.

OK, so the /dev/pmem driver would consult this when somebody is mmaping
the area. But since this would be removed from the driver (unregistered)
it would report an zero size?

Or would it "Otherwise, it will treat the pmem namespace as a raw device and
store page struct's in the normal RAM." - which means dom0 can still
access the SPA (except obviously the area that is for this reserved region)?

..snip..
> > >      guests. Otherwise, Xen hypervisor will recorded the reported
> > s/recorded/record/
> > >      parameters and create page_info structs in the reserved area.
> > 
> > Ohh. You just blast it away? I guess it makes sense. Then what is the
> > purpose of the ndctl? Just to carve out an namespace region for this?
> >
> 
> ndctl is used by, for example, a system admin to reserve space on a
> host pmem namespace. If there is already data in the namespace, ndctl
> will give a warning message and exit as long as --force option is not
> given. However, if --force is present, ndctl will break the existing
> data.
> 
> > And what if there is something there from previous OS (say Linux)?
> > Just blast it away? But could Linux depend on this containing some
> > persistent information? Or does it also blast it away?
> >
> 
> As above, if linux driver detects the signature "NVDIMM_PFN_INFO" and
> a matched checksum, it will know it's safe to write to the reserved
> area. Otherwise, it will treat the pmem namespace as a raw device and
> store page struct's in the normal RAM.

OK, so my worry is that we will have a divergence. Which is that
the system admin creates this under ndctl v0, boots Xen uses it.
Then moves the NVDIMM to another machine which has ndctl v1 and
he/she boots in Linux.

Linux gets all confused b/c the region has something it can't understand
and the user is very angry.

So it sounds like the size the ndctl reserves MUST be baked in an ABI
and made sure to expand if needed.

..snip..
> > This "balloon out" is interesting. You are effectively telling Linux
> > to ignore a certain range of 'struct page_info', so that if somebody
> > uses /sys/debug/kernel/page_walk it won't blow up? (As the kerne
> > can't read the struct page_info anymore).
> >
> > How would you do this? Simulate an NVDIMM unplug?
> 
> s/page_info/page/ (struct page for linux, struct page_info for xen)
> 
> As in Jan's comment, "balloon out" is a confusing name here.
> Basically, it's to remove the reserved area from some resource struct
> in nvdimm driver to avoid it's accessed out of the driver via the
> resource struct. And the nvdimm driver does not map the reserved area,
> so I think it cannot be touched via page_walk.

OK, I need to read the Linux code more to make sure I am
not missing something.

Basically the question that keeps revolving in my head is:

Why is this even neccessary?

Let me expand - it feels like (and I think I am missing something
here) that we are crippling the Linux driver so that it won't
break - b/c if it tried to access the 'strut page_info' in this
reserved region it would crash. So we eliminate that, and make
the driver believe the region exists (is reserved), but it can't
use it. And instead use the normal RAM pages to keep track
of the NVDIMM SPAs.

Or perhaps not keep track at all and just treat the whole
NVDIMM as opaque MMIO that is inaccessible?

But how will that work if there is a DAX filesystem on it?
The ext4 needs some mechanism to access the files that are there.
(Otherwise you couldn't use the fiemap ioctl to find the SPAs).

[see below]
> 
> > 
> > But if you do that how will SMART tools work anymore? And
> > who would do the _DSM checks on the health of the NVDIMM?
> >
> 
> A userspace SMART tool cannot access the reserved area, so I think it
> can still work. I haven't look at the implementation of any SMART
> tools for NVDIMM, but I guess they would finally call the driver to
> evaluate the ARS _DSM which reports the bad blocks. As long as the
> driver does not return the bad blocks in the reserved area to SMART
> tools (which I suppose to be handled by driver itself), SMART tools
> should work fine.
> 
> > /me scratches his head. Perhaps the answers are later in this
> > design..

So I think I figured out the issue here!!

You just want to have the Linux kernel driver to use normal RAM
pages to keep track of the NVDIMM SPA ranges. As in treat
the NVDIMM as if it is normal RAM?

[Or is Linux treating this area as MMIO region (in wihch case it does not
need struct page_info)??]

And then Xen can use this reserved region for its own
purpose!

Perhaps then the section that explains this 'reserved region' could
say something along:

"We need to keep track of the SPAs. The guest NVDIMM 'file'
on the NVDIMM may be in the worst case be randomly and in descending
discontingous order (say from the end of the NVDIMM), we need
to keep track of each of the SPAs. The reason is that we need
the SPAs when we populate the guest EPT.

As such we can store the guest SPA in memory (linear array?)
or red-black tree, or any other - but all of them will consume
"normal RAM". And with sufficient large enough NVDIMM we may
not have enough 'normal RAM' to store this.

Also we only need to know these SPAs during guest creation,
destruction, ballooning, etc - hence we may store them on the
NVDIMM itself. Fortunatly for us the ndctl and Linux are
available which carve out right after the namespace region (128kb)
and 'reserved region' which the OS can use to store its
struct page_info to cover the full range of the NVDIMM.

The complexity in this is that:
 - We MUST make sure Linux does not try to use it while
   we use it.
 - That the size of this 'reserved region' is sufficiently
   large for our 'struct page_info' structure.
 - The layout has an ABI baked.
 - Linux fs'es with DAX support MUST be able mlock these SPA
   regions (so that nobody tries to remove the 'file' while
   a guest is using it).
 - Linus fs'es with DAX support MUST be able to resize the
   'file', hereby using more of the SPAs and rewritting the
   properties of the file on DAX (which should then cause an
   memory hotplug ACPI in the guest treating the new size of
   the file as new NFIT region?)

"

I think that covers it?
..snip..
> > >  Our design takes the following method to avoid and detect collisions.
> > >  1) The data layout of area where QEMU copies its NFIT and ACPI
> > >     namespace devices is organized as below:
> > 
> > Why can't this be expressed in XenStore?
> > 
> > You could have 
> > /local/domain/domid/hvmloader/dm-acpi/<name>/{address,length, type}
> > ?
> >
> 
> If XenStore can be used, then it could save some guest memory.

It is also easier than relaying on the format of a blob in memory.
> 
> This is a general mechanism to pass ACPI which and is not limited to
> NVDIMM, so it means QEMU may pass a lot of entries. I'm not sure if
> XenStore is still a proper place when the number is large. Maybe we
> should put an upper limit for the number of entries.

Why put a limit on it? It should easily handle thousands of <name>.
And the only attributes you have under <name> are just address,
length and type.

.. snip..
> > > 4.3.2 Emulating Guest _DSM
> > > 
> > >  Our design leaves the emulation of guest _DSM to QEMU. Just as what
> > >  it does with KVM, QEMU registers the _DSM buffer as MMIO region with
> > >  Xen and then all guest evaluations of _DSM are trapped and emulated
> > >  by QEMU.
> > 
> > Sweet!
> > 
> > So one question that I am not if it has been answered, with the
> > 'struct page_info' being removed from the dom0 how will OEM _DSM method
> > operation? For example some of the AML code may asking to poke
> > at specific SPAs, but how will Linux do this properly without
> > 'struct page_info' be available?
> >
> 
> (s/page_info/page/)
> 
> The current Intel NVDIMM driver in Linux does not evaluate any OEM
> _DSM method, so I'm not sure whether the kernel has to access a NVDIMM
> page during evaluating _DSM.
> 
> The most close one in my mind, though not an OEM _DSM, is function 1
> of ARS _DSM, which requires inputs of a start SPA and a length in
> bytes. After kernel gives the inputs, the scrubbing of the specified
> area is done by the hardware and does not requires any mappings in OS.

<nods>
> 
> Any example of such OEM _DSM methods?

I can't think of any right now - but that is the danger of OEMs - they
may decide to do something .. ill advisable. Hence having it work
the same way as Linux is what we should strive for.


> 
> Thanks,
> Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.