Xen project Mailing List

Re: Understanding osdep_xenforeignmemory_map mmap behaviour

(add Arnd to CC) Juergen Gross <jgross@xxxxxxxx> writes: > [[PGP Signed Part:Undecided]] > On 24.03.22 02:42, Stefano Stabellini wrote: >> I am pretty sure the reasons have to do with old x86 PV guests, so I am >> CCing Juergen and Boris. >> >>> Hi, >>> >>> While we've been working on the rust-vmm virtio backends on Xen we >>> obviously have to map guest memory info the userspace of the daemon. >>> However following the logic of what is going on is a little confusing. >>> For example in the Linux backend we have this: >>> >>> void *osdep_xenforeignmemory_map(xenforeignmemory_handle *fmem, >>> uint32_t dom, void *addr, >>> int prot, int flags, size_t num, >>> const xen_pfn_t arr[/*num*/], int >>> err[/*num*/]) >>> { >>> int fd = fmem->fd; >>> privcmd_mmapbatch_v2_t ioctlx; >>> size_t i; >>> int rc; >>> >>> addr = mmap(addr, num << XC_PAGE_SHIFT, prot, flags | MAP_SHARED, >>> fd, 0); >>> if ( addr == MAP_FAILED ) >>> return NULL; >>> >>> ioctlx.num = num; >>> ioctlx.dom = dom; >>> ioctlx.addr = (unsigned long)addr; >>> ioctlx.arr = arr; >>> ioctlx.err = err; >>> >>> rc = ioctl(fd, IOCTL_PRIVCMD_MMAPBATCH_V2, &ioctlx); >>> >>> Where the fd passed down is associated with the /dev/xen/privcmd device >>> for issuing hypercalls on userspaces behalf. What is confusing is why >>> the function does it's own mmap - one would assume the passed addr would >>> be associated with a anonymous or file backed mmap region already that >>> the calling code has setup. Applying a mmap to a special device seems a >>> little odd. >>> >>> Looking at the implementation on the kernel side it seems the mmap >>> handler only sets a few flags: >>> >>> static int privcmd_mmap(struct file *file, struct vm_area_struct *vma) >>> { >>> /* DONTCOPY is essential for Xen because copy_page_range doesn't >>> know >>> * how to recreate these mappings */ >>> vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTCOPY | >>> VM_DONTEXPAND | VM_DONTDUMP; >>> vma->vm_ops = &privcmd_vm_ops; >>> vma->vm_private_data = NULL; >>> >>> return 0; >>> } >>> >>> So can I confirm that the mmap of /dev/xen/privcmd is being called for >>> side effects? Is it so when the actual ioctl is called the correct flags >>> are set of the pages associated with the user space virtual address >>> range? >>> >>> Can I confirm there shouldn't be any limitation on where and how the >>> userspace virtual address space is setup for the mapping in the guest >>> memory? >>> >>> Is there a reason why this isn't done in the ioctl path itself? > > For a rather long time we were using "normal" user pages for this purpose, > which were just locked into memory for doing the hypercall. Was this using the normal mlock() semantics to stop pages being swapped out of RAM? > Unfortunately there have been very rare problems with that approach, as > the Linux kernel can set a user page related PTE to invalid for short > periods of time, which led to EFAULT in the hypervisor when trying to > access the hypercall data. I must admit I'm not super familiar with the internals of page table handling with Linux+Xen. Doesn't the kernel need to delegate the tweaking of page tables to the hypervisor or is it allowed to manipulate the page tables itself? > In Linux this can avoided only by using kernel memory, which is the > reason why the hypercall buffers are allocated and mmap()-ed through the > privcmd driver. > >>> >>> I'm trying to understand the differences between Xen and KVM in the API >>> choices here. I think the equivalent is the KVM_SET_USER_MEMORY_REGION >>> ioctl for KVM which brings a section of the guest physical address space >>> into the userspaces vaddr range. > > The main difference is just that the consumer of the hypercall buffer is > NOT the kernel, but the hypervisor. In the KVM case both are the same, so > a brief period of an invalid PTE can be handled just fine in KVM, while > the Xen hypervisor has no idea that this situation will be over very > soon. I still don't follow the details of why we have the separate mmap. Is it purely because the VM flags of the special file can be changed in a way that can't be done with a traditional file-backed mmap? I can see various other devices have their own setting of vm flags but VM_DONTCOPY for example can be set with the appropriate madvise call: MADV_DONTFORK (since Linux 2.6.16) Do not make the pages in this range available to the child after a fork(2). This is useful to prevent copy-on-write semantics from changing the physical location of a page if the parent writes to it after a fork(2). (Such page relocations cause problems for hardware that DMAs into the page.) For the vhost-user work we need to be able to share the guest memory between the xen-vhost-master (which is doing the ioctls to talk to Xen) and the vhost-user daemon (which doesn't know about hypervisors but just deals in memory and events). Would it be enough to loosen the API and just have xen_remap_pfn() verify the kernels VM flags are appropriately set before requesting Xen updates the page tables? -- Alex Bennée

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.