[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC Design Doc v2] Add vNVDIMM support for Xen

To: Jan Beulich <JBeulich@xxxxxxxx>
From: Haozhong Zhang <haozhong.zhang@xxxxxxxxx>
Date: Wed, 3 Aug 2016 14:54:20 +0800
Cc: Juergen Gross <JGross@xxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Ian Jackson <ian.jackson@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, Xiao Guangrong <guangrong.xiao@xxxxxxxxxxxxxxx>
Delivery-date: Wed, 03 Aug 2016 06:54:33 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>
Mail-followup-to: Jan Beulich <JBeulich@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Ian Jackson <ian.jackson@xxxxxxxxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Xiao Guangrong <guangrong.xiao@xxxxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>, Juergen Gross <JGross@xxxxxxxx>

On 08/02/16 08:46, Jan Beulich wrote:
> >>> On 18.07.16 at 02:29, <haozhong.zhang@xxxxxxxxx> wrote:
> > 4.2.2 Detection of Host pmem Devices
> > 
> >  The detection and initialize host pmem devices require a non-trivial
> >  driver to interact with the corresponding ACPI namespace devices,
> >  parse namespace labels and make necessary recovery actions. Instead
> >  of duplicating the comprehensive Linux pmem driver in Xen hypervisor,
> >  our designs leaves it to Dom0 Linux and let Dom0 Linux report
> >  detected host pmem devices to Xen hypervisor.
> > 
> >  Our design takes following steps to detect host pmem devices when Xen
> >  boots.
> >  (1) As booting on bare metal, host pmem devices are detected by Dom0
> >      Linux NVDIMM driver.
> > 
> >  (2) Our design extends Linux NVDIMM driver to reports SPA's and sizes
> >      of the pmem devices and reserved areas to Xen hypervisor via a
> >      new hypercall.
> > 
> >  (3) Xen hypervisor then checks
> >      - whether SPA and size of the newly reported pmem device is overlap
> >        with any previously reported pmem devices;
> 
> ... or with system RAM.
> 
> >      - whether the reserved area can fit in the pmem device and is
> >        large enough to hold page_info structs for itself.
> 
> So "reserved" here means available for Xen's use, but not for more
> general purposes? How would the area Linux uses for its own
> purposes get represented?
>

Reserved for xen only. I was going to reuse the existing reservation
mechanism in linux pmem driver to allow reserving two areas - one for
xen and another for linux itself. However, I later realized the
existing mechanism depends on huge page support, so it does not work
in dom0. For the first implementation, I'm implementing in a different
way to reserve only for xen, and letting dom0 linux put page struct
for pmem in the normal ram. Afterwards, I'll look for a way to allow
both.

> >  (4) Because the reserved area is now used by Xen hypervisor, it
> >      should not be accessible by Dom0 any more. Therefore, if a host
> >      pmem device is recorded by Xen hypervisor, Xen will unmap its
> >      reserved area from Dom0. Our design also needs to extend Linux
> >      NVDIMM driver to "balloon out" the reserved area after it
> >      successfully reports a pmem device to Xen hypervisor.
> 
> ... "balloon out" ... _after_? That'd be unsafe.
>

Before ballooning is accomplished, the pmem driver does not create any
device node under /dev/ and hence no one except the pmem drive can
access the reserved area on pmem, so I think it's okey to balloon
after reporting.

> > 4.2.3 Get Host Machine Address (SPA) of Host pmem Files
> > 
> >  Before a pmem file is assigned to a domain, we need to know the host
> >  SPA ranges that are allocated to this file. We do this work in xl.
> > 
> >  If a pmem device /dev/pmem0 is given, xl will read
> >  /sys/block/pmem0/device/{resource,size} respectively for the start
> >  SPA and size of the pmem device.
> > 
> >  If a pre-allocated file /mnt/dax/file is given,
> >  (1) xl first finds the host pmem device where /mnt/dax/file is. Then
> >      it uses the method above to get the start SPA of the host pmem
> >      device.
> >  (2) xl then uses fiemap ioctl to get the extend mappings of
> >      /mnt/dax/file, and adds the corresponding physical offsets and
> >      lengths in each mapping entries to above start SPA to get the SPA
> >      ranges pre-allocated for this file.
> 
> Remind me again: These extents never change, not even across
> reboot? I think this would be good to be written down here explicitly.

Yes

> Hadn't there been talk of using labels to be able to allow a guest to
> own the exact same physical range again after reboot or guest or
> host?
>

You mean labels in NVDIMM label storage area? As defined in Intel
NVDIMM Namespace Specification, labels are used to specify
namespaces. For a pmem interleave set (possible cross several dimms),
at most one pmem namespace (and hence at most one label) is
allowed. Therefore, labels can not be used to partition pmem.

> >  3) When hvmloader loads a type 0 entry, it extracts the signature
> >     from the data blob and search for it in builtin_table_sigs[].  If
> >     found anyone, hvmloader will report an error and stop. Otherwise,
> >     it will append it to the end of loaded guest ACPI.
> 
> Duplicate table names aren't generally collisions: There can, for
> example, be many tables named "SSDT".
>

I'll exclude SSDT from the duplication check.

> >  4) When hvmloader loads a type 1 entry, it extracts the device name
> >     from the data blob and search for it in builtin_nd_names[]. If
> >     found anyone, hvmloader will report and error and stop. Otherwise,
> >     it will wrap the AML code snippet by "Device (name[4]) {...}" and
> >     include it in a new SSDT which is then appended to the end of
> >     loaded guest ACPI.
> 
> But all of these could go into a single SSDT, instead of (as it sounds)
> each into its own one?
>

Yes, I meant to put them in one SSDT.

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] [RFC Design Doc v2] Add vNVDIMM support for Xen
  - From: Jan Beulich

References:
- Re: [Xen-devel] [RFC Design Doc v2] Add vNVDIMM support for Xen
  - From: Jan Beulich

Prev by Date: [Xen-devel] [ovmf baseline-only test] 66893: tolerable trouble: blocked/broken/pass
Next by Date: [Xen-devel] [ovmf baseline-only test] 66895: tolerable trouble: blocked/broken/pass
Previous by thread: Re: [Xen-devel] [RFC Design Doc v2] Add vNVDIMM support for Xen
Next by thread: Re: [Xen-devel] [RFC Design Doc v2] Add vNVDIMM support for Xen
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.