[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] null domains after xl destroy



On 11/04/17 17:59, Juergen Gross wrote:
On 11/04/17 07:25, Glenn Enright wrote:
Hi all

We are seeing an odd issue with domu domains from xl destroy, under
recent 4.9 kernels a (null) domain is left behind.

I guess this is the dom0 kernel version?

This has occurred on a variety of hardware, with no obvious commonality.

4.4.55 does not show this behavior.

On my test machine I have the following packages installed under
centos6, from https://xen.crc.id.au/

~]# rpm -qa | grep xen
xen47-licenses-4.7.2-4.el6.x86_64
xen47-4.7.2-4.el6.x86_64
kernel-xen-4.9.21-1.el6xen.x86_64
xen47-ocaml-4.7.2-4.el6.x86_64
xen47-libs-4.7.2-4.el6.x86_64
xen47-libcacard-4.7.2-4.el6.x86_64
xen47-hypervisor-4.7.2-4.el6.x86_64
xen47-runtime-4.7.2-4.el6.x86_64
kernel-xen-firmware-4.9.21-1.el6xen.x86_64

I've also replicated the issue with 4.9.17 and 4.9.20

To replicate, on a cleanly booted dom0 with one pv VM, I run the
following on the VM

{
while true; do
 dd bs=1M count=512 if=/dev/zero of=test conv=fdatasync
done
}

Then on the dom0 I do this sequence to reliably get a null domain. This
occurs with oxenstored and xenstored both.

{
xl sync 1
xl destroy 1
}

xl list then renders something like ...

(null)                                       1     4     4     --p--d
9.8     0

Something is referencing the domain, e.g. some of its memory pages are
still mapped by dom0.

From what I can see it appears to be disk related. Affected VMs all use
lvm storage for their boot disk. lvdisplay of the affected lv shows that
the lv has is being help open by something.

How are the disks configured? Especially the backend type is important.


~]# lvdisplay test/test.img | grep open
  # open                 1

I've not been able to determine what that thing is as yet. I tried lsof,
dmsetup, various lv tools. Waiting for the disk to be released does not
work.

~]# xl list
Name                                        ID   Mem VCPUs      State
Time(s)
Domain-0                                     0  1512     2     r-----
29.0
(null)                                       1     4     4     --p--d
9.8

xenstore-ls reports nothing for the null domain id that I can see.

Any qemu process related to the domain still running?

Any dom0 kernel messages related to Xen?


Juergen


Yep, 4.9 dom0 kernel

Typically we see an xl process running, but that has already gone away in this case. The domU is a PV guest using phy definition, the basic startup is like this...

xl -v create -f paramfile extra="console=hvc0 elevator=noop xen-blkfront.max=64"

There are no qemu processes or threads anywhere I can see.

I dont see any meaningful messages in the linux kernel log, and nothing at all in the hypervisor log. Here is output from the dom0 starting and then stopping a domU using the above mechanism

br0: port 2(vif3.0) entered disabled state
br0: port 2(vif4.0) entered blocking state
br0: port 2(vif4.0) entered disabled state
device vif4.0 entered promiscuous mode
IPv6: ADDRCONF(NETDEV_UP): vif4.0: link is not ready
xen-blkback: backend/vbd/4/51713: using 2 queues, protocol 1 (x86_64-abi) persistent grants xen-blkback: backend/vbd/4/51721: using 2 queues, protocol 1 (x86_64-abi) persistent grants
vif vif-4-0 vif4.0: Guest Rx ready
IPv6: ADDRCONF(NETDEV_CHANGE): vif4.0: link becomes ready
br0: port 2(vif4.0) entered blocking state
br0: port 2(vif4.0) entered forwarding state
br0: port 2(vif4.0) entered disabled state
br0: port 2(vif4.0) entered disabled state
device vif4.0 left promiscuous mode
br0: port 2(vif4.0) entered disabled state

... here is xl info ...

host                   : xxxxxxxxxxxx
release                : 4.9.21-1.el6xen.x86_64
version                : #1 SMP Sat Apr 8 18:03:45 AEST 2017
machine                : x86_64
nr_cpus                : 4
max_cpu_id             : 3
nr_nodes               : 1
cores_per_socket       : 4
threads_per_core       : 1
cpu_mhz                : 2394
hw_caps : b7ebfbff:0000e3bd:20100800:00000001:00000000:00000000:00000000:00000000
virt_caps              :
total_memory           : 8190
free_memory            : 6577
sharing_freed_memory   : 0
sharing_used_memory    : 0
outstanding_claims     : 0
free_cpus              : 0
xen_major              : 4
xen_minor              : 7
xen_extra              : .2
xen_version            : 4.7.2
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p
xen_scheduler          : credit
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          :
xen_commandline : dom0_mem=1512M cpufreq=xen dom0_max_vcpus=2 dom0_vcpus_pin log_lvl=all guest_loglvl=all vcpu_migration_delay=1000
cc_compiler            : gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)
cc_compile_by          : mockbuild
cc_compile_domain      : (none)
cc_compile_date        : Mon Apr  3 12:17:20 AEST 2017
build_id               : 0ec32d14d7c34e5d9deaaf6e3b7ea0c8006d68fa
xend_config_format     : 4


# cat /proc/cmdline
ro root=UUID=xxxxxxxxxx rd_MD_UUID=xxxxxxxxxxxx rd_NO_LUKS KEYBOARDTYPE=pc KEYTABLE=us LANG=en_US.UTF-8 rd_MD_UUID=xxxxxxxxxxxxx SYSFONT=latarcyrheb-sun16 crashkernel=auto rd_NO_LVM rd_NO_DM rhgb quiet pcie_aspm=off panic=30 max_loop=64 dm_mod.use_blk_mq=y xen-blkfront.max=64

The domu is using an lvm on top of a md raid1 array, on direct connected HDDs. Nothing special hardware wise. The disk line for that domU looks functionally like...

disk = [ 'phy:/dev/testlv/test.img,xvda1,w' ]

I would appreciate any suggestions on how to increase the debug level in a relevant way or where to look to get more useful information on what is happening.

To clarify the actual shutdown sequence that causes problems...

# xl sysrq $id s
# xl destroy $id


Regards, Glenn

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.