[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu



On Wed, Jan 27, 2016 at 03:16:59AM -0700, Jan Beulich wrote:
> >>> On 26.01.16 at 20:32, <konrad.wilk@xxxxxxxxxx> wrote:
> > On Tue, Jan 26, 2016 at 09:34:13AM -0700, Jan Beulich wrote:
> >> >>> On 26.01.16 at 16:57, <haozhong.zhang@xxxxxxxxx> wrote:
> >> > On 01/26/16 08:37, Jan Beulich wrote:
> >> >> >>> On 26.01.16 at 15:44, <konrad.wilk@xxxxxxxxxx> wrote:
> >> >> >>  Last year at Linux Plumbers Conference I attended a session 
> >> >> >> dedicated
> >> >> >> to NVDIMM support. I asked the very same question and the INTEL guy
> >> >> >> there told me there is indeed something like a partition table meant
> >> >> >> to describe the layout of the memory areas and their contents.
> >> >> > 
> >> >> > It is described in details at pmem.io, look at  Documents, see
> >> >> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces 
> >> >> > section.
> >> >> 
> >> >> Well, that's about how PMEM and PBLK ranges get marked, but not
> >> >> about how use of the space inside a PMEM range is coordinated.
> >> >>
> >> > 
> >> > How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT 
> >> > table.
> >> > Namespace to pmem is something like partition table to disk.
> >> 
> >> But I'm talking about sub-dividing the space inside an individual
> >> PMEM range.
> > 
> > The namespaces are it.
> > 
> > Once you have done them you can mount the PMEM range under say /dev/pmem0
> > and then put a filesystem on it (ext4, xfs) - and enable DAX support.
> > The DAX just means that the FS will bypass the page cache and write directly
> > to the virtual address.
> > 
> > then one can create giant 'dd' images on this filesystem and pass it
> > to QEMU to .. expose as NVDIMM to the guest. Because it is a file - the 
> > blocks
> > (or MFNs) for the contents of the file are most certainly discontingous.
> 
> And what's the advantage of this over PBLK? I.e. why would one
> want to separate PMEM and PBLK ranges if everything gets used
> the same way anyway?

Speed. PBLK emulates hardware - by having a sliding window of the DIMM. The
OS can only write to a ring-buffer with the system address and the payload
(64bytes I think?) - and the hardware (or firmware) picks it up and does the
writes to NVDIMM.

The only motivation behind this is to deal with errors. Normal PMEM writes
do not report errors. As in if the media is busted - the hardware will engage 
its
remap logic and write somewhere else - until all of its remap blocks have
been exhausted. At that point writes (I presume, not sure) and reads will report
an error - but via an #MCE. 

Part of this Xen design will be how to handle that :-)

With an PBLK - I presume the hardware/firmware will read the block after it has
written it - and if there are errors it will report it right away. Which means
you can easily hook PBLK nicely in RAID setups right away. It will be slower
than PMEM, but it does give you the normal error reporting. That is until
the MCE#->OS->fs errors logic gets figured out.

The MCE# logic code is being developed right now by Tony Luck on LKML - and
the last I saw the MCE# has the system address - and the MCE code would tag
the pages with some bit so that the applications would get a signal.

> 
> Jan
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.