[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Windows domu DRBD backend problem


  • To: win-pv-devel@xxxxxxxxxxxxxxxxxxxx
  • From: Kotán Attila <info@xxxxxxxx>
  • Date: Wed, 23 Apr 2025 19:28:29 +0200
  • Delivery-date: Wed, 23 Apr 2025 17:28:38 +0000
  • List-id: Developer list for the Windows PV Drivers subproject <win-pv-devel.lists.xenproject.org>

Hello Tu Dinh,  
Update info:
I baught today an NVME for testing and unfortunately the problem is present when the DRBD backend is NVME too.
I tested before this situation when the primary node is not an DELL server (i tested with desktop category computer).
Seems definitely related only to DELL servers or maybe the multiprocessor environment. I use only DELL servers, no have info about another vendor.


Thank you for your advise.
I try to catch all info / output:

DRBD configs
- global_common.conf
-----
global {
        usage-count yes;
        udev-always-use-vnr; # treat implicit the same as explicit volumes
}

common {
        handlers {
        }

        startup {
        }

        options {
        }

        disk {
                on-io-error     detach;
                resync-rate         160M;
        }

        net {
        }
}
-----

- w2022_system.res
-----
resource w2022_system {
  protocol C;

  net {
  }

  syncer {
   }

  on xen18 {
    device    /dev/drbd0;
    disk      /dev/NVME01/w2022_system;
    address   172.16.16.8:7800;
    meta-disk internal;
  }

  on xen16 {
    device     /dev/drbd0;
    disk       /dev/VG02/w2022_system;
    address    172.16.16.6:7800;
    meta-disk  internal;
  }
}
-----

Domu config
- w2022.cfg
-----
name = 'w2022'
builder = 'hvm'
memory = 16384
#shadow_memory = 8
vcpus=16
uuid = 'cac0559e-06fd-42fc-a92f-fa2d8cadaff1'
vif = [ 'bridge=xenbr0, mac=00:11:6c:1c:49:17' ]
disk = [ 'drbd:w2022_system,xvda,w', ]
boot='dc'
vnc=1
vncunused=0
vnclisten = '0.0.0.0'
vncdisplay=2
stdvga=1
_on_poweroff_ = 'destroy'
_on_reboot_ = 'restart'
_on_crash_ = 'restart'
usb=1
usbdevice=['tablet']
-----

Nothing special in config.

I tested with Domain-0 is Debian 10, 11, 12 and testing (maybe trixie).
I try Domu is Windows 7 or Windows 2022.
The test envionment is:
- Node1 (xen18) DELL T630 with PERC H730 (1G, BBU) or 1TB NVME as primary
- Node2  (xen16) DELL R730XD with PERC H730mini (1G, BBU) as secondary

I have problem too with another environment with two DELL R640 server.

The node1 kern.log with PERC (DELL Raid controller) virtual disk DRBD backend:
-----
Apr 23 12:46:22 xen18 kernel: [  574.527385] drbd w2022_system: PingAck did not arrive in time.
Apr 23 12:46:22 xen18 kernel: [  574.527464] drbd w2022_system: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Apr 23 12:46:22 xen18 kernel: [  574.527600] block drbd0: new current UUID 19B04B0803CC3C87:CB60C112A5C9EA5D:EBEF8CB4948C160D:EBEE8CB4948C160D
Apr 23 12:46:22 xen18 kernel: [  574.527661] drbd w2022_system: ack_receiver terminated
Apr 23 12:46:22 xen18 kernel: [  574.527665] drbd w2022_system: Terminating drbd_a_w2022_sy
Apr 23 12:46:22 xen18 kernel: [  574.583747] drbd w2022_system: Connection closed
Apr 23 12:46:22 xen18 kernel: [  574.584035] drbd w2022_system: conn( NetworkFailure -> Unconnected )
Apr 23 12:46:22 xen18 kernel: [  574.584038] drbd w2022_system: receiver terminated
Apr 23 12:46:22 xen18 kernel: [  574.584041] drbd w2022_system: Restarting receiver thread
Apr 23 12:46:22 xen18 kernel: [  574.584043] drbd w2022_system: receiver (re)started
Apr 23 12:46:22 xen18 kernel: [  574.584052] drbd w2022_system: conn( Unconnected -> WFConnection )
-----

The node1 kern.log with NVME DRBD backend:
-----
Apr 23 10:51:18 xen18 kernel: [  912.800847] drbd w2022_system: PingAck did not arrive in time.
Apr 23 10:51:18 xen18 kernel: [  912.800930] drbd w2022_system: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Apr 23 10:51:18 xen18 kernel: [  912.807793] block drbd0: new current UUID 2361269D52925DF1:AB440CB0842F155D:25CE81C2C028E09E:74D2255AB30D4115
Apr 23 10:51:18 xen18 kernel: [  912.811762] drbd w2022_system: ack_receiver terminated
Apr 23 10:51:18 xen18 kernel: [  912.811768] drbd w2022_system: Terminating drbd_a_w2022_sy
Apr 23 10:51:18 xen18 kernel: [  912.853400] drbd w2022_system: Connection closed
Apr 23 10:51:18 xen18 kernel: [  912.853723] drbd w2022_system: conn( NetworkFailure -> Unconnected )
Apr 23 10:51:18 xen18 kernel: [  912.853727] drbd w2022_system: receiver terminated
Apr 23 10:51:18 xen18 kernel: [  912.853729] drbd w2022_system: Restarting receiver thread
Apr 23 10:51:18 xen18 kernel: [  912.853732] drbd w2022_system: receiver (re)started
Apr 23 10:51:18 xen18 kernel: [  912.853740] drbd w2022_system: conn( Unconnected -> WFConnection )
-----

Seems the Domu can't write back to the DRBD because after i destroy the Domu (no other sollution to exit), i got the following error:
libxl: error: libxl_exec.c:117:libxl_report_child_exitstatus: /etc/xen/scripts/block-drbd remove [1380] exited with error status 1
libxl: error: libxl_device.c:1259:device_hotplug_child_death_cb: script: /etc/xen/scripts/block-drbd failed; error detected.

The domu can't release the DRBD, and looks cannot release in the xenstore:
root@xen18:~# xl list
Name                                        ID   Mem VCPUs    State    Time(s)
Domain-0                                     0  4096     4     r-----     144.7
(null)                                       1     0    16     --p--d     387.5
 
I try to man many test with different DRBD config, but no luck.
Sometimes the windows survive the disconnection, but if reconnecting the secondary thats freeze like disconnect.

I didn't have problem if the:
- Domu OS is Linux with same config.
- XEN PV (VBD) driver no installed to Domu.

The latest (unsigned) or any other windpws driver have any debug options?

Thank you for your help.
Best Regards:

Attila





2025. 04. 23. 13:04 keltezéssel, Tu Dinh írta:
Hi Attila,

On 22/04/2025 21:00, Kotán Attila wrote:
Hello,

I use many windows as xen domu and installed your winpv driver. I 
experience the problem in very special case, but not an unique.
The windows freeze when the disk backend of domu is DRBD and the the 
DRBD secondary node (DRBD status=secondary) goes to disconnect, for 
example i reboot the computer.

The problem is reproducible:
- Two DELL server with PERC RAID controller
- RAID0,1,10 virtual disks boot side
- DRBD backend is an RAID virtual disk
- Primary side installed an Windows with XEN VBD driver

If the secondary side goes to offline, the DRBD status change to 
disconencted, but working continuously, but the windows freeze.

- No problem with linux domu
- No problem until the winpv VBD install
- No problem if the DRBD backend is not a RAID virtual disk (for example 
with M.2 NVME backend)

I tested the
- Winpv driver 9.0 (signed)
- Winpv driver latest (testsigned)
- citrix xenserver driver (managementagent-9.4.0)
and the problem is occurs in everyone.

Everyting workin fine until the secondary DRBD node goes to offline / 
disconnected. I think i tested many situation and finally left the WinPV 
driver what is caused the problem i mean.

Do you have any tipp, what can i set?
How can i debugging this problem?
I try to see what happened on linux domu when the DRBD status change, 
but i can't find anything.

Have any ideas?

Thank you and best regards.
Attila

If the problem doesn't happen with the DRBD NVMe backend then it 
suggests a problem with DRBD itself, which is not handling failovers 
correctly.

Do you have any relevant kernel debug outputs when the freeze happens, 
as opposed to during a normal failover (e.g. with DRBD NVMe)?

Best regards,
Tu Dinh


Ngoc Tu Dinh | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech





 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.