[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Understanding osdep_xenforeignmemory_map mmap behaviour



(add Arnd to CC)

Juergen Gross <jgross@xxxxxxxx> writes:

> [[PGP Signed Part:Undecided]]
> On 24.03.22 02:42, Stefano Stabellini wrote:
>> I am pretty sure the reasons have to do with old x86 PV guests, so I am
>> CCing Juergen and Boris.
>> 
>>> Hi,
>>>
>>> While we've been working on the rust-vmm virtio backends on Xen we
>>> obviously have to map guest memory info the userspace of the daemon.
>>> However following the logic of what is going on is a little confusing.
>>> For example in the Linux backend we have this:
>>>
>>>    void *osdep_xenforeignmemory_map(xenforeignmemory_handle *fmem,
>>>                                     uint32_t dom, void *addr,
>>>                                     int prot, int flags, size_t num,
>>>                                     const xen_pfn_t arr[/*num*/], int 
>>> err[/*num*/])
>>>    {
>>>        int fd = fmem->fd;
>>>        privcmd_mmapbatch_v2_t ioctlx;
>>>        size_t i;
>>>        int rc;
>>>
>>>        addr = mmap(addr, num << XC_PAGE_SHIFT, prot, flags | MAP_SHARED,
>>>                    fd, 0);
>>>        if ( addr == MAP_FAILED )
>>>            return NULL;
>>>
>>>        ioctlx.num = num;
>>>        ioctlx.dom = dom;
>>>        ioctlx.addr = (unsigned long)addr;
>>>        ioctlx.arr = arr;
>>>        ioctlx.err = err;
>>>
>>>        rc = ioctl(fd, IOCTL_PRIVCMD_MMAPBATCH_V2, &ioctlx);
>>>
>>> Where the fd passed down is associated with the /dev/xen/privcmd device
>>> for issuing hypercalls on userspaces behalf. What is confusing is why
>>> the function does it's own mmap - one would assume the passed addr would
>>> be associated with a anonymous or file backed mmap region already that
>>> the calling code has setup. Applying a mmap to a special device seems a
>>> little odd.
>>>
>>> Looking at the implementation on the kernel side it seems the mmap
>>> handler only sets a few flags:
>>>
>>>    static int privcmd_mmap(struct file *file, struct vm_area_struct *vma)
>>>    {
>>>            /* DONTCOPY is essential for Xen because copy_page_range doesn't 
>>> know
>>>             * how to recreate these mappings */
>>>            vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTCOPY |
>>>                             VM_DONTEXPAND | VM_DONTDUMP;
>>>            vma->vm_ops = &privcmd_vm_ops;
>>>            vma->vm_private_data = NULL;
>>>
>>>            return 0;
>>>    }
>>>
>>> So can I confirm that the mmap of /dev/xen/privcmd is being called for
>>> side effects? Is it so when the actual ioctl is called the correct flags
>>> are set of the pages associated with the user space virtual address
>>> range?
>>>
>>> Can I confirm there shouldn't be any limitation on where and how the
>>> userspace virtual address space is setup for the mapping in the guest
>>> memory?
>>>
>>> Is there a reason why this isn't done in the ioctl path itself?
>
> For a rather long time we were using "normal" user pages for this purpose,
> which were just locked into memory for doing the hypercall.

Was this using the normal mlock() semantics to stop pages being swapped
out of RAM?

> Unfortunately there have been very rare problems with that approach, as
> the Linux kernel can set a user page related PTE to invalid for short
> periods of time, which led to EFAULT in the hypervisor when trying to
> access the hypercall data.

I must admit I'm not super familiar with the internals of page table
handling with Linux+Xen. Doesn't the kernel need to delegate the
tweaking of page tables to the hypervisor or is it allowed to manipulate
the page tables itself?

> In Linux this can avoided only by using kernel memory, which is the
> reason why the hypercall buffers are allocated and mmap()-ed through the
> privcmd driver.
>
>>>
>>> I'm trying to understand the differences between Xen and KVM in the API
>>> choices here. I think the equivalent is the KVM_SET_USER_MEMORY_REGION
>>> ioctl for KVM which brings a section of the guest physical address space
>>> into the userspaces vaddr range.
>
> The main difference is just that the consumer of the hypercall buffer is
> NOT the kernel, but the hypervisor. In the KVM case both are the same, so
> a brief period of an invalid PTE can be handled just fine in KVM, while
> the Xen hypervisor has no idea that this situation will be over very
> soon.

I still don't follow the details of why we have the separate mmap. Is it
purely because the VM flags of the special file can be changed in a way
that can't be done with a traditional file-backed mmap?

I can see various other devices have their own setting of vm flags but
VM_DONTCOPY for example can be set with the appropriate madvise call:

       MADV_DONTFORK (since Linux 2.6.16)
              Do not make the pages in this range available to the child after
              a fork(2).  This is useful to  prevent  copy-on-write  semantics
              from  changing  the  physical  location  of a page if the parent
              writes to it after a  fork(2).   (Such  page  relocations  cause
              problems for hardware that DMAs into the page.)

For the vhost-user work we need to be able to share the guest memory
between the xen-vhost-master (which is doing the ioctls to talk to Xen)
and the vhost-user daemon (which doesn't know about hypervisors but just
deals in memory and events).

Would it be enough to loosen the API and just have xen_remap_pfn()
verify the kernels VM flags are appropriately set before requesting Xen
updates the page tables?

-- 
Alex Bennée



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.