[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL
Juergen Gross writes ("Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL"): > Before that there was: > > 2018-05-30 22:12:49.320+0000: xc: Failed to get types for pfn batch (14 > = Bad address): Internal error This seems to be the only message about the root cause. > But looking at the messages issued some seconds before that I see some > xenstore watch related messages in: > > http://logs.test-lab.xenproject.org/osstest/logs/123379/test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm/huxelrebe1---var-log-libvirt-libxl-libxl-driver.log I think this is all a red herring. What I see happening is: 2018-05-30 22:12:44.695+0000: libxl: libxl_event.c:636:libxl__ev_xswatch_register: watch w=0xb40005e8 wpath=/local/domain/3/control/shutdown token=2/b: register slotnum=2 libxl starts watching the domain's shutdown control node. I think this is done from near libxl_dom_suspend.c:202. 2018-05-30 22:12:44.696+0000: libxl: libxl_event.c:573:watchfd_callback: watch w=0xb40005e8 wpath=/local/domain/3/control/shutdown token=2/b: event epath=/local/domain/3/control/shutdown The watch we just set triggers. This happens with every xenstore watch, after it is set up - ie, it does not mean that anything had written the node. 2018-05-30 22:12:44.696+0000: libxl: libxl_event.c:673:libxl__ev_xswatch_deregister: watch w=0xb40005e8 wpath=/local/domain/3/control/shutdown token=2/b: deregister slotnum=2 libxl stops watching the domain's shutdown control node. This is done, I think, by domain_suspend_common_pvcontrol_suspending (libxl_dom_suspend.c:222). We can conclude that if (!rc && !domain_suspend_pvcontrol_acked(state)) was not taken. It seems unlikely that rc!=0, because the node is read in xswait_xswatch_callback using libxl__xs_read_checked which I think would log a message. So probably /local/domain/3/control/shutdown was `suspend', meaning the domain had indeed acked the suspend request. 2018-05-30 22:12:44.696+0000: libxl: libxl_event.c:636:libxl__ev_xswatch_register: watch w=0xb40005f8 wpath=@releaseDomain token=2/c: register slotnum=2 This is the watch registration in domain_suspend_common_wait_guest. 2018-05-30 22:12:44.696+0000: libxl: libxl_event.c:548:watchfd_callback: watch w=0xb40005f8 epath=/local/domain/3/control/shutdown token=2/b: counter != c This is a watch event for the watch we set up at 2018-05-30 22:12:44.696+0000. You can tell because the token is the same. But that watch was cancelled within libxl at 2018-05-30 22:12:44.696+0000. libxl's watch handling machinery knows this, and discards the event. "counter != c", libxl_event.c:547. It does indeed use the same slot in the libxl xswatch data structure, but libxl can distinguish it by the counter and the event path. (In any case xs watch handlers should tolerate spurious events and be idempotent, although that does not matter here.) I think this must be the watch event from the guest actually writing its acknowledgement to the control node - we would indeed expect two such events, one generated by the watch setup, and one from the guest's write. The timing meant that here we processed the guest's written value as a result of the first watch event. This is fine. 2018-05-30 22:12:44.696+0000: libxl: libxl_event.c:573:watchfd_callback: watch w=0xb40005f8 wpath=@releaseDomain token=2/c: event epath=@releaseDomain This is the immediate-auto-firing of the @releaseDomain event set up at 2018-05-30 22:12:44.696+0000. libxl's xswatch machinery looks this up in slot 2 and finds that the counter and paths are right, so it will dispatch it to suspend_common_wait_guest_watch which is a frontend for suspend_common_wait_guest_check. In the absence of log messages from that function we can conclude that !(info.flags & XEN_DOMINF_shutdown) ie the guest has not shut down yet. 2018-05-30 22:12:44.720+0000: libxl: libxl_event.c:573:watchfd_callback: watch w=0xb2a26708 wpath=@releaseDomain token=3/0: event epath=@releaseDomain This is a watch event which was set up much earlier at 2018-05-30 21:58:02.182+0000. The surrounding context there (references to domain_death_xswatch_callback) makes it clear that this is pursuant to libxl_evenable_domain_death - ie, libvirt asked libxl to monitor for the death of the domain. 2018-05-30 22:12:44.724+0000: libxl: libxl_domain.c:816:domain_death_xswatch_callback: shutdown reporting The output here is a bit perplexing. I don't understand how we can have the message "shutdown reporting" without any previous message "Exists shutdown_reported=%d" or "[evg=%p] nentries=%d rc=%d %ld..%ld" both of which seem to precede the "shutdown reporting" message in domain_death_xswatch_callback. However, we can conclude that, at this point, libxl finds that got->flags & XEN_DOMINF_shutdown and it decides to inform libvirt that the domain has shut down, by providing a DOMAIN_SHUTDOWN libxl event. (This event is not passed to libvirt immediately yet because it lives on either (a) a queue on this thread's stack, which will be processed on return to libvirt, or (b) a queue associated with the CTX, whose lock we hold. The callback to libvirt will be reported later.) 2018-05-30 22:12:44.724+0000: libxl: libxl_domain.c:771:domain_death_xswatch_callback: [evg=0] all reported 2018-05-30 22:12:44.724+0000: libxl: libxl_domain.c:833:domain_death_xswatch_callback: domain death search done This is the end of the search ing domain_death_xswatch_callback for domains which need to be reported. libvirt was listening only for one domain. 2018-05-30 22:12:44.724+0000: libxl: libxl_event.c:573:watchfd_callback: watch w=0xb40005f8 wpath=@releaseDomain token=2/c: event epath=@releaseDomain Another xs watch event for the same domain shutdown, because libxl had set up two watches for it. (These will probably have been written very quickly together into the xs ring and/or socket, and handled within the loop in libxl_event.c:watchfd_callback. So we have not yet released the CTX lock or returned to libvirt: therefore, probably, the domain shutdown event notification to libvirt is still queued with libxl. Indeed as we will see, that occurs a bit later.) This is another watch event from the registration in domain_suspend_common_wait_guest and again it will call suspend_common_wait_guest_check. 2018-05-30 22:12:44.724+0000: libxl: libxl_event.c:673:libxl__ev_xswatch_deregister: watch w=0xb40005f8 wpath=@releaseDomain token=2/c: deregister slotnum=2 This must logically be one of the libxl__ev_xswatch_deregister calls in domain_suspend_common_guest_suspended / domain_suspend_common_done. However, looking at suspend_common_wait_guest_check, reaching either of those should have produced some kind of log message - either "guest has suspended" (DEBUG) or an ERROR of some sort. ... oh I have just spotted this logfile ... http://logs.test-lab.xenproject.org/osstest/logs/123379/test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm/huxelrebe1---var-log-libvirt-libxl-debianhvm.guest.osstest.log ... which contains the other half of the messages. So we see there: 2018-05-30 22:12:44.724+0000: libxl: libxl_dom_suspend.c:350:suspend_common_wait_guest_check: Domain 3:guest has suspended OK. 2018-05-30 22:12:44.731+0000: libxl: libxl_event.c:686:libxl__ev_xswatch_deregister: watch w=0xb40005f8: deregister unregistered This is the second call to libxl__ev_xswatch_deregister for the same watch event. That tells us that the first call must have been in domain_suspend_common_guest_suspended. So all is going well, and we called domain_suspend_common_done, whose idempotent cleanup deregisters the already-deregisterd watch. 2018-05-30 22:12:44.731+0000: libxl: libxl_event.c:1398:egc_run_callbacks: event 0xb40068f0 callback type=domain_shutdown This is libxl calling libvirt to tell libvirt that the domain has shut down. libvirt does not seem to respond. 2018-05-30 22:12:49.320+0000: xc: Failed to get types for pfn batch (14 = Bad address): Internal error 2018-05-30 22:12:49.483+0000: xc: Save failed (14 = Bad address): Internal error 2018-05-30 22:12:49.648+0000: libxl-save-helper: complete r=-1: Bad address And this is the first thing that goes wrong. You can see similar messages in the other logfile: 2018-05-30 22:12:49.650+0000: libxl: libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 3:saving domain: domain responded to suspend request: Bad address All of these are reports of the same thing: xc_get_pfn_type_batch at xc_sr_save.c:133 failed with EFAULT. 2018-05-30 22:12:49.650+0000: libxl: libxl.c:746:libxl__fd_flags_restore: fnctl F_SETFL of fd 33 to 0x8001 2018-05-30 22:12:49.650+0000: libxl: libxl_event.c:1869:libxl__ao_complete: ao 0xb4000478: complete, rc=-3 2018-05-30 22:12:49.650+0000: libxl: libxl_event.c:1838:libxl__ao__destroy: ao 0xb4000478: destroy > which make me wonder whether the libxl watch handling is really > correct: e.g. libxl__ev_xswatch_register() first registers the watch > with xenstore and only then writes the data needed for processing the > watch in the related structure. Could it be that the real suspend watch > event was interpreted as a @releaseDomain event? No. The code in libxl__ev_xswatch_register all runs with the libxl CTX lock held so it cannot be interrupted in this way. As you see above I have analysed the log and it is all operating correctly, albeit rather noisily in the debug log. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |