[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] kexec -e in PVHVM guests (and in PV).



On Tue, Jul 01, 2014 at 10:12:58AM +0200, Vitaly Kuznetsov wrote:
> Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> writes:
> 
> > Hey, 
> >
> > I had on my todo list an patch from Olaf patch that shuffles
> > the shared_page to be in the 0xFE700000 addr (in the "gap"
> > with newer QEMU's) which unfortunately did not work when
> > migrating on 32-bit PVHVM guests on Xen 4.1.
> >
> > The commit is 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f
> > "xen PVonHVM: use E820_Reserved area for shared_info" and it
> > ended up being reverted. I dusted it off and I think I found
> > the original bug (and fixed it), but while digging in this
> > the more I discovered a ton more of issues.
> >
> > A bit about the use case - the 'kexec -e' allows one to
> > restart the Linux kernel without a reboot. It is not a crash kernel
> > so it is just meant to restart and work, and then restart, etc.
> >
> > The 'kdump -c' (crash) is a different use case and I had not
> > thought much about it. But I think that all of the solutions
> > I am thinking of will make it also work. (so you could
> > do kexec-crash -> kexec-e->kexec-e>kexec-crash->kexec-e, and
> > so, if you would want to).
> >
> > The problem I uncovered was that the memory region where
> > the new kernel would be executed had bits of memory changed - which
> > meant that the purgatory code in kexec would detect the SHA1SUM
> > being incorrect and not load. That lead me to find out that
> > VCPUOP_register_vcpu_info was the culprit (well, the xen_vcpu_info
> > was being modified, and its PFN was in the 'new' kernel image area).
> >
> > Anyhow, the end result of that is that I think to get this
> > working we would need to have:
> >
> >  1). A symmetrical VCPUOP_register_vcpu_info call, say
> >      VCPUOP_unregister_vcpu_info, which would for a provided vpuid
> >      set 'vcpu_info' to the shared_info, and 'vcpu_info_mfn' to
> >      INVALID_MFN. Naturally the vcpu_id has to be down (_VPF_down).
> >      A prototype patch along with an naive implementation in
> >      the Linux kernel made this work surprisingly well!
> >
> >      The Linux kernel had to call on the shutdown the:
> >      disable_nonboot_cpus() which would bring all the AP CPUs down.
> >      Each AP CPU would call said hypercall. Also on each CPU
> >      bringup we would call this (that is the BSP would make this
> >      call before bringing the AP CPUs up - on bootup paths it
> >      would result in nothing, while for an kexec -c type kernel
> >      it would allow us to use the CPUs).
> >
> >  2). Ditto for VCPUOP_register_runtime and
> >      VCPUOP_register_runstate_memory_area.  They would need a
> >      similar 'unregister' call with similar semantics as the
> >      one above.
> >
> >  3). The shared_info. Olaf's patch stuck the shared_info in the
> >      "gaps" of the E820 or the E820_RSRV region. But the recent patches
> >       for PCI passthrough are making me twitchy and I think we would
> >       need to parse the E820 and /proc/ioports (so 'resource API in
> >       Linux kernel' to figure out a good place to stash this. Or on
> >       shutdown (kexec -e)  balloon out the shared region (need to
> >       double check that this possible in the first place).
> >
> >  4). Balloon memory. I am not really sure how to deal with that. The
> >      guest might have ballooned out tons of memory but the new kernel
> >      won't know about it until the xen/balloon driver kicks in and
> >      figures this out based on XenStore. Then it will try to balloon
> >      out.. and depending on its luck - balloon out memory that was
> >      already ballooned out, or not.  Also during the bootup of
> >      the 'kexec -e' kernel it might touch pages that had been
> >      ballooned out - and try to use them!
> >
> >  5). Events. Olaf had written code long time ago that would poke the
> >      events to see if they were already in use (-EEXIST) and if so
> >      re-use them - it works great albeit there are tons of messages
> >      in the Xen ring buffer. The Linux patch I wrote did an
> >      'disable_nonboot_cpus' and also tore down the BSP interrupts - that
> >      meant that all of the events were nicely torn down. This all works
> >      on non-FIFO event.  David Vrabel says that to make this work
> >      (re-use or teardown and bring up) would be hard.
> >
> >  6). QEMU PnP typ devices. Such as 'serial,'i8042', and 'rtc' end up
> >      going through the EVTCHNOP_bind_pirg. Somehow on the 'kexec -e'
> >      kernel we end up doing OK, but the devices don't work anymore.
> >      That is - the serial input does not accept any more input (but
> >      it can output alright).
> >
> >  7). Grants. Andrew Cooper hinted at this and a bit of experimentation
> >      shows that Xen hypervisor will indeed smack down any guest that
> >      tries to re-use its "old" grants. I am not even sure if the
> >      GNTTAB_setup call is returning the "old" grant frames.
> >      His suggestion was 'GNTTAB_reset' to well, reset everything.
> >
> > My thinking is that a lot of this code is shared with PV (and PVH)
> > once this is fixed we could do full scale 'kexec -e' in an PV
> > (or PVH) type guest. Doing dom0 kexec -e would be an interesting
> > experiment :-(
> >
> > I am unable to fix this for Xen 4.5 and I am not sure what other
> > issues there are present. If folks have some ideas or would like to
> > chime in (or even pick some of these up!)- please do respond.
> >
> 
> I have one more issue related to kexec/kdump topic I'm investigating
> right now. 

Woot!
> 
> When kdump happens and new kernel boots we have /proc/vmcore
> device. There is no problem in reading from this device, however
> makedumpfile reads it with mmap() by default and that doesn't work for
> me.
> 
> I figured out the following: there are several pages (2 in my case) in
> vmcore which are not ram. read_from_oldmem() calls special pfn_is_ram()
> check (which does HVMOP_get_mem_type and these pages are reported as
> HVMMEM_mmio_dm) and skips them. mmap_vmcore() doesn't have this check
> and we got these pages mapped. When we do memcpy() from them we get
> stuck in case we try reading more than 16 bytes (that's weird).

Ooh, would it make sense to expand 'mmap_vmcore' to have this check?
> 
> I have 'quick and dirty' patch which brings pfn_is_ram() check to
> mmap_vmcore() and replaces all HVMMEM_mmio_dm pages with an empty
> page. I'm going to investigate a bit more here.

Ok.
> 
> I can try looking at something from the above as well. E.g. I was able
> to solve no.6 with the following (yes, dirty again) patch:

Yeey! That would be fantastic.

Heh. I was thinking some thing similar, albeit to do this also
from the 'xen_kexec_shutdown' path - in case we are booting in
an kernel that does not have these patches.

See the four attached patches - two for Xen, and two for Linux.
They are very much RFC and I believe they are still buggy. If you
want to try them out and improve, please be my guest.

Thank you for your interest!
> 
> commit 23a224c4ad664dfc6fe672f74f83549387efebda
> Author: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
> Date:   Wed Jun 18 14:12:15 2014 +0200
> 
>     wip: unmap all pirqs
>     
>     Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
> 
> diff --git a/drivers/xen/events/events_base.c 
> b/drivers/xen/events/events_base.c
> index dfa12a4..16af7e4 100644
> --- a/drivers/xen/events/events_base.c
> +++ b/drivers/xen/events/events_base.c
> @@ -1658,6 +1719,35 @@ void xen_callback_vector(void) {}
>  static bool fifo_events = true;
>  module_param(fifo_events, bool, 0);
>  
> +static void unmap_all_pirqs(void)
> +{
> +     struct evtchn_status status;
> +     int port, rc = -ENOENT;
> +     struct physdev_unmap_pirq unmap_irq;
> +     struct evtchn_close close;
> +
> +     memset(&status, 0, sizeof(status));
> +     for (port = 0; port < xen_evtchn_max_channels(); port++) {
> +             status.dom = DOMID_SELF;
> +             status.port = port;
> +             rc = HYPERVISOR_event_channel_op(EVTCHNOP_status, &status);
> +             if (rc < 0)
> +                     continue;
> +             pr_warn("unmap_all_pirqs: port: %d, status: %d\n", status.port, 
> status.status);
> +             if (status.status == EVTCHNSTAT_pirq) {
> +                     close.port = port;
> +                     if (HYPERVISOR_event_channel_op(EVTCHNOP_close, &close) 
> != 0)
> +                             pr_warn("EVTCHNSTAT_pirq: failed to close event 
> channel %d\n", port);
> +                     unmap_irq.pirq = status.u.pirq;
> +                     unmap_irq.domid = DOMID_SELF;
> +                     pr_warn("unmapping previously mapped pirq %d\n", 
> unmap_irq.pirq);
> +                     if (HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, 
> &unmap_irq) != 0)
> +                             pr_warn("failed to unmap pirq %d\n", 
> unmap_irq.pirq);
> +             }
> +     }
> +}
> +
> +
>  void __init xen_init_IRQ(void)
>  {
>       int ret = -EINVAL;
> @@ -1686,6 +1776,8 @@ void __init xen_init_IRQ(void)
>               xen_callback_vector();
>  
>       if (xen_hvm_domain()) {
> +             unmap_all_pirqs();
> +
>               native_init_IRQ();
>               /* pci_xen_hvm_init must be called after native_init_IRQ so that
>                * __acpi_register_gsi can point at the right function */
> 
> -- 
>   Vitaly

Attachment: 0001-VCPUOP_reset_vcpu_info.patch
Description: Text document

Attachment: 0002-VCPU_reset-VCPU_up-VCPU_is_up-etc-for-HVM.patch
Description: Text document

Attachment: 0001-xen-PVonHVM-use-E820_Reserved-area-for-shared_info.patch
Description: Text document

Attachment: 0002-RFC-VCPU_reset_cpu_info.patch
Description: Text document

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.