Xen project Mailing List

Re: [Xen-devel] kexec -e in PVHVM guests (and in PV).

To: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>

From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Date: Tue, 1 Jul 2014 11:34:01 -0400

Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, daniel.kiper@xxxxxxxxxx

Delivery-date: Tue, 01 Jul 2014 15:34:31 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Tue, Jul 01, 2014 at 10:12:58AM +0200, Vitaly Kuznetsov wrote: > Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> writes: > > > Hey, > > > > I had on my todo list an patch from Olaf patch that shuffles > > the shared_page to be in the 0xFE700000 addr (in the "gap" > > with newer QEMU's) which unfortunately did not work when > > migrating on 32-bit PVHVM guests on Xen 4.1. > > > > The commit is 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f > > "xen PVonHVM: use E820_Reserved area for shared_info" and it > > ended up being reverted. I dusted it off and I think I found > > the original bug (and fixed it), but while digging in this > > the more I discovered a ton more of issues. > > > > A bit about the use case - the 'kexec -e' allows one to > > restart the Linux kernel without a reboot. It is not a crash kernel > > so it is just meant to restart and work, and then restart, etc. > > > > The 'kdump -c' (crash) is a different use case and I had not > > thought much about it. But I think that all of the solutions > > I am thinking of will make it also work. (so you could > > do kexec-crash -> kexec-e->kexec-e>kexec-crash->kexec-e, and > > so, if you would want to). > > > > The problem I uncovered was that the memory region where > > the new kernel would be executed had bits of memory changed - which > > meant that the purgatory code in kexec would detect the SHA1SUM > > being incorrect and not load. That lead me to find out that > > VCPUOP_register_vcpu_info was the culprit (well, the xen_vcpu_info > > was being modified, and its PFN was in the 'new' kernel image area). > > > > Anyhow, the end result of that is that I think to get this > > working we would need to have: > > > > 1). A symmetrical VCPUOP_register_vcpu_info call, say > > VCPUOP_unregister_vcpu_info, which would for a provided vpuid > > set 'vcpu_info' to the shared_info, and 'vcpu_info_mfn' to > > INVALID_MFN. Naturally the vcpu_id has to be down (_VPF_down). > > A prototype patch along with an naive implementation in > > the Linux kernel made this work surprisingly well! > > > > The Linux kernel had to call on the shutdown the: > > disable_nonboot_cpus() which would bring all the AP CPUs down. > > Each AP CPU would call said hypercall. Also on each CPU > > bringup we would call this (that is the BSP would make this > > call before bringing the AP CPUs up - on bootup paths it > > would result in nothing, while for an kexec -c type kernel > > it would allow us to use the CPUs). > > > > 2). Ditto for VCPUOP_register_runtime and > > VCPUOP_register_runstate_memory_area. They would need a > > similar 'unregister' call with similar semantics as the > > one above. > > > > 3). The shared_info. Olaf's patch stuck the shared_info in the > > "gaps" of the E820 or the E820_RSRV region. But the recent patches > > for PCI passthrough are making me twitchy and I think we would > > need to parse the E820 and /proc/ioports (so 'resource API in > > Linux kernel' to figure out a good place to stash this. Or on > > shutdown (kexec -e) balloon out the shared region (need to > > double check that this possible in the first place). > > > > 4). Balloon memory. I am not really sure how to deal with that. The > > guest might have ballooned out tons of memory but the new kernel > > won't know about it until the xen/balloon driver kicks in and > > figures this out based on XenStore. Then it will try to balloon > > out.. and depending on its luck - balloon out memory that was > > already ballooned out, or not. Also during the bootup of > > the 'kexec -e' kernel it might touch pages that had been > > ballooned out - and try to use them! > > > > 5). Events. Olaf had written code long time ago that would poke the > > events to see if they were already in use (-EEXIST) and if so > > re-use them - it works great albeit there are tons of messages > > in the Xen ring buffer. The Linux patch I wrote did an > > 'disable_nonboot_cpus' and also tore down the BSP interrupts - that > > meant that all of the events were nicely torn down. This all works > > on non-FIFO event. David Vrabel says that to make this work > > (re-use or teardown and bring up) would be hard. > > > > 6). QEMU PnP typ devices. Such as 'serial,'i8042', and 'rtc' end up > > going through the EVTCHNOP_bind_pirg. Somehow on the 'kexec -e' > > kernel we end up doing OK, but the devices don't work anymore. > > That is - the serial input does not accept any more input (but > > it can output alright). > > > > 7). Grants. Andrew Cooper hinted at this and a bit of experimentation > > shows that Xen hypervisor will indeed smack down any guest that > > tries to re-use its "old" grants. I am not even sure if the > > GNTTAB_setup call is returning the "old" grant frames. > > His suggestion was 'GNTTAB_reset' to well, reset everything. > > > > My thinking is that a lot of this code is shared with PV (and PVH) > > once this is fixed we could do full scale 'kexec -e' in an PV > > (or PVH) type guest. Doing dom0 kexec -e would be an interesting > > experiment :-( > > > > I am unable to fix this for Xen 4.5 and I am not sure what other > > issues there are present. If folks have some ideas or would like to > > chime in (or even pick some of these up!)- please do respond. > > > > I have one more issue related to kexec/kdump topic I'm investigating > right now. Woot! > > When kdump happens and new kernel boots we have /proc/vmcore > device. There is no problem in reading from this device, however > makedumpfile reads it with mmap() by default and that doesn't work for > me. > > I figured out the following: there are several pages (2 in my case) in > vmcore which are not ram. read_from_oldmem() calls special pfn_is_ram() > check (which does HVMOP_get_mem_type and these pages are reported as > HVMMEM_mmio_dm) and skips them. mmap_vmcore() doesn't have this check > and we got these pages mapped. When we do memcpy() from them we get > stuck in case we try reading more than 16 bytes (that's weird). Ooh, would it make sense to expand 'mmap_vmcore' to have this check? > > I have 'quick and dirty' patch which brings pfn_is_ram() check to > mmap_vmcore() and replaces all HVMMEM_mmio_dm pages with an empty > page. I'm going to investigate a bit more here. Ok. > > I can try looking at something from the above as well. E.g. I was able > to solve no.6 with the following (yes, dirty again) patch: Yeey! That would be fantastic. Heh. I was thinking some thing similar, albeit to do this also from the 'xen_kexec_shutdown' path - in case we are booting in an kernel that does not have these patches. See the four attached patches - two for Xen, and two for Linux. They are very much RFC and I believe they are still buggy. If you want to try them out and improve, please be my guest. Thank you for your interest! > > commit 23a224c4ad664dfc6fe672f74f83549387efebda > Author: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> > Date: Wed Jun 18 14:12:15 2014 +0200 > > wip: unmap all pirqs > > Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> > > diff --git a/drivers/xen/events/events_base.c > b/drivers/xen/events/events_base.c > index dfa12a4..16af7e4 100644 > --- a/drivers/xen/events/events_base.c > +++ b/drivers/xen/events/events_base.c > @@ -1658,6 +1719,35 @@ void xen_callback_vector(void) {} > static bool fifo_events = true; > module_param(fifo_events, bool, 0); > > +static void unmap_all_pirqs(void) > +{ > + struct evtchn_status status; > + int port, rc = -ENOENT; > + struct physdev_unmap_pirq unmap_irq; > + struct evtchn_close close; > + > + memset(&status, 0, sizeof(status)); > + for (port = 0; port < xen_evtchn_max_channels(); port++) { > + status.dom = DOMID_SELF; > + status.port = port; > + rc = HYPERVISOR_event_channel_op(EVTCHNOP_status, &status); > + if (rc < 0) > + continue; > + pr_warn("unmap_all_pirqs: port: %d, status: %d\n", status.port, > status.status); > + if (status.status == EVTCHNSTAT_pirq) { > + close.port = port; > + if (HYPERVISOR_event_channel_op(EVTCHNOP_close, &close) > != 0) > + pr_warn("EVTCHNSTAT_pirq: failed to close event > channel %d\n", port); > + unmap_irq.pirq = status.u.pirq; > + unmap_irq.domid = DOMID_SELF; > + pr_warn("unmapping previously mapped pirq %d\n", > unmap_irq.pirq); > + if (HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, > &unmap_irq) != 0) > + pr_warn("failed to unmap pirq %d\n", > unmap_irq.pirq); > + } > + } > +} > + > + > void __init xen_init_IRQ(void) > { > int ret = -EINVAL; > @@ -1686,6 +1776,8 @@ void __init xen_init_IRQ(void) > xen_callback_vector(); > > if (xen_hvm_domain()) { > + unmap_all_pirqs(); > + > native_init_IRQ(); > /* pci_xen_hvm_init must be called after native_init_IRQ so that > * __acpi_register_gsi can point at the right function */ > > -- > Vitaly

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.