[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] How to deal with hypercalls returning -EFAULT
On 13/06/18 16:27, Juergen Gross wrote: > Currently the release of Xen 4.11 is blocked due to a sporadic failure > of the OSSTEST guest-saverestore[.2]. During that test a hypercall > issued by libxc via the Linux privcmd driver returns -EFAULT in spite > of all hypercall buffers locked in memory via mlock() (or similar flags > specified in a mmap() call). > > My analysis has revealed that modern Linux kernels might make such > locked user pages unaccessible for very short periods of time. This can > happen e.g. when pages are subject to compaction or migration. > > There are multiple ways to mitigate this problem: > > 1. Trying to switch page migration or compaction off in dom0. > Pros: - no change in Xen necessary Pro: can likely retrofitted to existing environments without further code changes. (Not that I disagree with your Con's in this case) > Cons: - new cases might come up in the future > - easy to miss, failures are really very sporadic and might > happen only after updating the kernel > > 2. Add a bandaid to Xen tools by retrying hypercalls which have failed > with -EFAULT (either for all or only for some hypercalls) > Pros: - no interface change necessary > Cons: - not all hypercalls might be just repeatable > - problem isn't solved but just worked around We'd have to whitelist hypercalls which are safe to repeat like this. Most wont be. Any mutable operation which -EFAULTs can't safely be restarted, because we can't distinguish an early fault (Xen reading the parameters) from a late fault (Xen trying to update a userspace pointer with the result). > > 3. Modify the interface to the privcmd driver to pass information about > used buffers to the kernel in order to lock them there. Either add a > new interface for hypercall buffer management or add the list of > buffers to the privcmd ioctl data structure. > Pros: - problem is really solved > Cons: - split solution between kernel and Xen, both must be changed To be clear, you mean suggesting changing libxc here, rather than the hypervisor? Getting this problem fixed properly would be a distinct improvement over the whack-a-mole which has been played in the past. > > 4. Modify the interface between hypervisor and kernel: instead of just > returning -EFAULT let the hypervisor behave more like copy_to_user by > raising a page fault which can then be fixed up in the kernel. This > change must be activated by the kernel, of course. > Pros: - rather simple change in the kernel "doing the right thing" > - hypercall bounce buffer handling in libxc/libxencall can be > switched off for a kernel supporting this chnage > Cons: - split solution between kernel and Xen, both must be changed > - not sure how complex the required hypervisor change will be Sadly, as I've just realised... Con: Cannot be used to replace all -EFAULTs. Faults when copying data in can be resolved by passing #PF to the kernel, but faults when trying to update guest state (continuation, or completion information) cannot be safely resumed at a later point. > > It should be noted that we can either select only one of above solutions > or one of 3/4 and additionally one of 1/2 as a fallback for old kernels. > > How to proceed? Much as I hate to say it (as I do like this idea), I don't idea 4 is a viable alternative to 3. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |