|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] kexec -e in PVHVM guests (and in PV).
Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> writes:
> Hey,
>
> I had on my todo list an patch from Olaf patch that shuffles
> the shared_page to be in the 0xFE700000 addr (in the "gap"
> with newer QEMU's) which unfortunately did not work when
> migrating on 32-bit PVHVM guests on Xen 4.1.
>
> The commit is 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f
> "xen PVonHVM: use E820_Reserved area for shared_info" and it
> ended up being reverted. I dusted it off and I think I found
> the original bug (and fixed it), but while digging in this
> the more I discovered a ton more of issues.
>
> A bit about the use case - the 'kexec -e' allows one to
> restart the Linux kernel without a reboot. It is not a crash kernel
> so it is just meant to restart and work, and then restart, etc.
>
> The 'kdump -c' (crash) is a different use case and I had not
> thought much about it. But I think that all of the solutions
> I am thinking of will make it also work. (so you could
> do kexec-crash -> kexec-e->kexec-e>kexec-crash->kexec-e, and
> so, if you would want to).
>
> The problem I uncovered was that the memory region where
> the new kernel would be executed had bits of memory changed - which
> meant that the purgatory code in kexec would detect the SHA1SUM
> being incorrect and not load. That lead me to find out that
> VCPUOP_register_vcpu_info was the culprit (well, the xen_vcpu_info
> was being modified, and its PFN was in the 'new' kernel image area).
>
> Anyhow, the end result of that is that I think to get this
> working we would need to have:
>
> 1). A symmetrical VCPUOP_register_vcpu_info call, say
> VCPUOP_unregister_vcpu_info, which would for a provided vpuid
> set 'vcpu_info' to the shared_info, and 'vcpu_info_mfn' to
> INVALID_MFN. Naturally the vcpu_id has to be down (_VPF_down).
> A prototype patch along with an naive implementation in
> the Linux kernel made this work surprisingly well!
>
> The Linux kernel had to call on the shutdown the:
> disable_nonboot_cpus() which would bring all the AP CPUs down.
> Each AP CPU would call said hypercall. Also on each CPU
> bringup we would call this (that is the BSP would make this
> call before bringing the AP CPUs up - on bootup paths it
> would result in nothing, while for an kexec -c type kernel
> it would allow us to use the CPUs).
>
> 2). Ditto for VCPUOP_register_runtime and
> VCPUOP_register_runstate_memory_area. They would need a
> similar 'unregister' call with similar semantics as the
> one above.
>
> 3). The shared_info. Olaf's patch stuck the shared_info in the
> "gaps" of the E820 or the E820_RSRV region. But the recent patches
> for PCI passthrough are making me twitchy and I think we would
> need to parse the E820 and /proc/ioports (so 'resource API in
> Linux kernel' to figure out a good place to stash this. Or on
> shutdown (kexec -e) balloon out the shared region (need to
> double check that this possible in the first place).
>
> 4). Balloon memory. I am not really sure how to deal with that. The
> guest might have ballooned out tons of memory but the new kernel
> won't know about it until the xen/balloon driver kicks in and
> figures this out based on XenStore. Then it will try to balloon
> out.. and depending on its luck - balloon out memory that was
> already ballooned out, or not. Also during the bootup of
> the 'kexec -e' kernel it might touch pages that had been
> ballooned out - and try to use them!
>
> 5). Events. Olaf had written code long time ago that would poke the
> events to see if they were already in use (-EEXIST) and if so
> re-use them - it works great albeit there are tons of messages
> in the Xen ring buffer. The Linux patch I wrote did an
> 'disable_nonboot_cpus' and also tore down the BSP interrupts - that
> meant that all of the events were nicely torn down. This all works
> on non-FIFO event. David Vrabel says that to make this work
> (re-use or teardown and bring up) would be hard.
>
> 6). QEMU PnP typ devices. Such as 'serial,'i8042', and 'rtc' end up
> going through the EVTCHNOP_bind_pirg. Somehow on the 'kexec -e'
> kernel we end up doing OK, but the devices don't work anymore.
> That is - the serial input does not accept any more input (but
> it can output alright).
>
> 7). Grants. Andrew Cooper hinted at this and a bit of experimentation
> shows that Xen hypervisor will indeed smack down any guest that
> tries to re-use its "old" grants. I am not even sure if the
> GNTTAB_setup call is returning the "old" grant frames.
> His suggestion was 'GNTTAB_reset' to well, reset everything.
>
> My thinking is that a lot of this code is shared with PV (and PVH)
> once this is fixed we could do full scale 'kexec -e' in an PV
> (or PVH) type guest. Doing dom0 kexec -e would be an interesting
> experiment :-(
>
> I am unable to fix this for Xen 4.5 and I am not sure what other
> issues there are present. If folks have some ideas or would like to
> chime in (or even pick some of these up!)- please do respond.
>
I have one more issue related to kexec/kdump topic I'm investigating
right now.
When kdump happens and new kernel boots we have /proc/vmcore
device. There is no problem in reading from this device, however
makedumpfile reads it with mmap() by default and that doesn't work for
me.
I figured out the following: there are several pages (2 in my case) in
vmcore which are not ram. read_from_oldmem() calls special pfn_is_ram()
check (which does HVMOP_get_mem_type and these pages are reported as
HVMMEM_mmio_dm) and skips them. mmap_vmcore() doesn't have this check
and we got these pages mapped. When we do memcpy() from them we get
stuck in case we try reading more than 16 bytes (that's weird).
I have 'quick and dirty' patch which brings pfn_is_ram() check to
mmap_vmcore() and replaces all HVMMEM_mmio_dm pages with an empty
page. I'm going to investigate a bit more here.
I can try looking at something from the above as well. E.g. I was able
to solve no.6 with the following (yes, dirty again) patch:
commit 23a224c4ad664dfc6fe672f74f83549387efebda
Author: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
Date: Wed Jun 18 14:12:15 2014 +0200
wip: unmap all pirqs
Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index dfa12a4..16af7e4 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1658,6 +1719,35 @@ void xen_callback_vector(void) {}
static bool fifo_events = true;
module_param(fifo_events, bool, 0);
+static void unmap_all_pirqs(void)
+{
+ struct evtchn_status status;
+ int port, rc = -ENOENT;
+ struct physdev_unmap_pirq unmap_irq;
+ struct evtchn_close close;
+
+ memset(&status, 0, sizeof(status));
+ for (port = 0; port < xen_evtchn_max_channels(); port++) {
+ status.dom = DOMID_SELF;
+ status.port = port;
+ rc = HYPERVISOR_event_channel_op(EVTCHNOP_status, &status);
+ if (rc < 0)
+ continue;
+ pr_warn("unmap_all_pirqs: port: %d, status: %d\n", status.port,
status.status);
+ if (status.status == EVTCHNSTAT_pirq) {
+ close.port = port;
+ if (HYPERVISOR_event_channel_op(EVTCHNOP_close, &close)
!= 0)
+ pr_warn("EVTCHNSTAT_pirq: failed to close event
channel %d\n", port);
+ unmap_irq.pirq = status.u.pirq;
+ unmap_irq.domid = DOMID_SELF;
+ pr_warn("unmapping previously mapped pirq %d\n",
unmap_irq.pirq);
+ if (HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq,
&unmap_irq) != 0)
+ pr_warn("failed to unmap pirq %d\n",
unmap_irq.pirq);
+ }
+ }
+}
+
+
void __init xen_init_IRQ(void)
{
int ret = -EINVAL;
@@ -1686,6 +1776,8 @@ void __init xen_init_IRQ(void)
xen_callback_vector();
if (xen_hvm_domain()) {
+ unmap_all_pirqs();
+
native_init_IRQ();
/* pci_xen_hvm_init must be called after native_init_IRQ so that
* __acpi_register_gsi can point at the right function */
--
Vitaly
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |