I saw some intermittent/weird behavior on latest xen-ia64-unstable.hg
(Rev 6461), where sometimes I can login into xenU shell, sometimes
pending after "Mounting root fs...", and even sometimes the whole system
is broken as following:
(XEN) ia64_fault: General Exception: IA-64 Reserved Register/Field fault
(data access): reflecting
(XEN) $$$$$ PANIC in domain 1 (k6=f000000007fd8000): psr.ic off,
delivering
fault=5300,ipsr=0000121208026010,iip=a00000010000cd00,ifa=f000000007fdfd
60,isr=00000a0c00000004,PSCB.iip*** ADD REGISTER DUMP HERE FOR DEBUGGING
(XEN) BUG at domain.c:311
(XEN) priv_emulate: priv_handle_op fails, isr=0000000000000000
(XEN)
Finally I found the root cause is that match_dtlb should return guest
pte instead of machine pte, because translate_machine_pte will be
invoked always after vcpu_translate. Translate_machine_pte assumes to
accept a guest pte and will walk 3 level tables to get machine frame
number. Why does it happen so scare?
- For xen0, guest pfn == machine pfn, so nothing happen
- For xenU, currently there's only one vtlb entry to cache
latest inserted TC entry. Say current vtlb entry for VA1 has been
inserted into machine TLB. Normally there'll be many itc issued before
machine TC for VA1 is purged. Those insertion will change single vtlb
entry. So in 99.99% case, once guest va is purged out of machine
TLB/vhpt and trigger TLB miss again, match_tlb will fail.
But there's also corner case where vtlb entry has not been updated but
the machine TC entry for VA1 has been purged. In this case, if trying to
access that VA1 immediately, match_dtlb will return true and then
problematic code becomes the murderer.
For example, sometimes I saw:
(XEN) translate_domain_pte: bad mpa=000000007f170080 (>
0000000010004000),vadr=5fffff0000000080,pteval=000000007f170561,itir=000
0000000000038
(XEN) lookup_domain_mpa: bad mpa 000000007f170080 (> 0000000010004000
Above access happens when vcpu_translate tries to access guest SVHPT.
You can saw 0x7f170080 is actually machine pfn. When 0x7f170080 is
passed into translate_machine_pte, warning shows and it's finally mapped
into machine pfn 0. (Maybe we can change such error condition to panic,
instead of return incorrect pfn)
Then things all went weird:
(XEN) translate_domain_pte: bad mpa=0000eef3f000e738 (>
0000000010004000),vadr=4000000000042738,pteval=f000eef3f000eef3,itir=000
0000000026238
(XEN) lookup_domain_mpa: bad mpa 0000eef3f000e738 (> 0000000010004000
And finally GP fault happens. This error has actually hidden behind for
a long time, but seldom triggered.
John, please make a test on your side with all the patches I sent out
today (Including the max_page one). I believe we can call it an end now.
;-)
BTW, Dan, there's two heads on current xen-ia64-unstable.hg. Please do a
merge.
Signed-off-by Kevin Tian <Kevin.tian@xxxxxxxxx>
diff -r 68d8a0a1aeb7 xen/arch/ia64/xen/vcpu.c
--- a/xen/arch/ia64/xen/vcpu.c Thu Sep 1 21:51:57 2005
+++ b/xen/arch/ia64/xen/vcpu.c Fri Sep 2 21:30:01 2005
@@ -1315,7 +1315,8 @@
/* check 1-entry TLB */
if ((trp = match_dtlb(vcpu,address))) {
dtlb_translate_count++;
- *pteval = trp->page_flags;
+ //*pteval = trp->page_flags;
+ *pteval = vcpu->arch.dtlb_pte;
*itir = trp->itir;
return IA64_NO_FAULT;
}
Thanks,
Kevin
hg_0902_match_dtlb
Description: hg_0902_match_dtlb
_______________________________________________
Xen-ia64-devel mailing list
Xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-ia64-devel
|