[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [ARM] Bash often segfaults in Dom0 with the latest Xen



On Wed, 5 Jun 2013, Julien Grall wrote:
> On 06/05/2013 03:30 PM, Christoffer Dall wrote:
> 
> > On 5 June 2013 04:48, Julien Grall <julien.grall@xxxxxxxxxx> wrote:
> >> On 06/05/2013 02:38 AM, Christoffer Dall wrote:
> >>
> >>> On 4 June 2013 15:45, Julien Grall <julien.grall@xxxxxxxxxx> wrote:
> >>>> Hi all,
> >>>>
> >>>> Since a couple of week,  I'm tracking an issue with Xen on ARM with no 
> >>>> luck.
> >>>>
> >>>> I'm run out of idea, so I send this email to have advice from the 
> >>>> community.
> >>>>
> >>>> Most of the time bash will abort with random error in dom0:
> >>>>   - page fault (data and prefetch abort)
> >>>>   - memory corruption (malloc corruption and invalid pointer)
> >>>>
> >>>> It's easily to reproduce by doing ./configure on the xen tree.
> >>>>
> >>>> My environment is an arndale board:
> >>>>   - linux linaro 13.05 (using arndale_xen_dom0_defconfig and 
> >>>> exynos5250_arndale.dts)
> >>>>   - opensuse 12.03 (http://en.opensuse.org/HCL:Arndale)
> >>>>   - xen upstream
> >>>>
> >>>> The linux tree can be retrieved from 
> >>>> git://xenbits.xen.org/people/julieng/linux-arm.git
> >>>> using the branch linaro-3.10.
> >>>> The previous branch is based on the linaro tree with some patches for 
> >>>> the dts and xen.
> >>>>
> >>>> The issue also occurs on the versatile express. But it's harder to 
> >>>> reproduce.
> >>>> Here the environment is:
> >>>>   - linux linaro 13.05 (using vexpress_xen_dom0_defconfig and 
> >>>> vexpress_v2p_ca15_a7.dtb)
> >>>>   - ubuntu linaro 13.05
> >>>>   - xen upstream
> >>>>
> >>>> I have tried different distributions and linux version, the issue was 
> >>>> the same.
> >>>> I made some testing to narrow down the bug and I came to the following 
> >>>> test case:
> >>>>
> >>>> Only dom0 is running and each VCPUs are pinned to a specific cpu
> >>>> (vcpu0 -> cpu0 and vcpu1 -> cpu1).
> >>>>
> >>>> The patch below removes WFI trap and by consequence avoid a VCPU to move 
> >>>> to
> >>>> another physical CPU.
> >>>> =========================================
> >>>> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
> >>>> index 6cfba1a..e89ca15 100644
> >>>> --- a/xen/arch/arm/traps.c
> >>>> +++ b/xen/arch/arm/traps.c
> >>>> @@ -62,7 +62,7 @@ void __cpuinit init_traps(void)
> >>>>      WRITE_SYSREG((vaddr_t)hyp_traps_vector, VBAR_EL2);
> >>>>
> >>>>      /* Setup hypervisor traps */
> >>>> -    
> >>>> WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TWI|HCR_TSC,
> >>>>  HCR_EL2);
> >>>> +    WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TSC, 
> >>>> HCR_EL2);
> >>>>      isb();
> >>>>  }
> >>>>
> >>>> =========================================
> >>>>
> >>>> If a bash process is assigned to a specific cpu with taskset, the 
> >>>> process seems
> >>>> to always run without any issue.
> >>>>
> >>>> taskset -c 0 ./configure
> >>>>
> >>>> I guess it's a caching issue, but each time I've tried to play with the 
> >>>> caching
> >>>> policy Linux was not booting.
> >>>>
> >>>> Thanks in advance for any advice.
> >>>
> >>> Some thoughts:
> >>>
> >>>  - Does dom0 run with Stage-2 translation? If so, you should be able
> >>> to disable caches in both Hyp mode and for dom0 by manipulating the
> >>> hyp registers to try and exclude caches. If Linux doesn't boot under
> >>> such configuration, something else is completely broken, as it must be
> >>> transparent to your dom0.
> >>>
> >>>  - Are you doing any swapping and/or page reclaiming? I wouldn't
> >>> assume so for dom0, but if you are, you need to maintain the icache
> >>> properly, since it can be aliasing, see
> >>> http://lxr.linux.no/linux+v3.9.4/arch/arm/kvm/mmu.c#L495 (I doubt this
> >>> is the case though)
> >>>
> >>> - All other cache accesses should be coherent across cores and are
> >>> physically indexed/physically tagged so I don't see how this could be
> >>> your issue.
> >>
> >> It was only an idea because I have noticed the memory was often corrupted.
> >>
> >>> - Do you always see the crash in user space or kernel space in dom0 or
> >>> is it all over the map?
> >>
> >>
> >> Only in user space in dom0.
> >>
> > Hmm, which kernel version is dom0 based on? Can you bisect the dom0
> > source to make sure it's not something introduced during development.
> 
> I'm using the linaro's branch ll_20130528.0, I have only few patches for
> the dts and not yet in linaro tree patches.
> 
> I have the same issue with linux 3.9-rc4 with multiple CPUs and I can't
> really go before without carrying many xen patches to try it.
> 
> I have tried different configuration with the number of CPUs in Xen
> (pCPU) and linux (vCPU):
>   - 2 pCPU 2 vCPU : segfaulting
>   - 2 pCPU 1 vCPU : working
>   - 1 pCPU 1 vCPU : working
>   - 1 pCPU 2 vCPU : very slow but working

If you put it like that, it would seem to me that the most likely
candidate would be a bug in SMP support in Xen.
What happen if you have 2 pCPU, 1vCPU but you keep moving the vCPU
between the two pCPU?

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.