[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Windows domu DRBD backend problem
Hello Tu Dinh, Update info: I baught today an NVME for testing and unfortunately the problem is present when the DRBD backend is NVME too. I tested before this situation when the primary node is not an DELL server (i tested with desktop category computer). Seems definitely related only to DELL servers or maybe the multiprocessor environment. I use only DELL servers, no have info about another vendor. Thank you for your advise. I try to catch all info / output: DRBD configs - global_common.conf ----- global { usage-count yes; udev-always-use-vnr; # treat implicit the same as explicit volumes } common { handlers { } startup { } options { } disk { on-io-error detach; resync-rate 160M; } net { } } ----- - w2022_system.res ----- resource w2022_system { protocol C; net { } syncer { } on xen18 { device /dev/drbd0; disk /dev/NVME01/w2022_system; address 172.16.16.8:7800; meta-disk internal; } on xen16 { device /dev/drbd0; disk /dev/VG02/w2022_system; address 172.16.16.6:7800; meta-disk internal; } } ----- Domu config - w2022.cfg ----- name = 'w2022' builder = 'hvm' memory = 16384 #shadow_memory = 8 vcpus=16 uuid = 'cac0559e-06fd-42fc-a92f-fa2d8cadaff1' vif = [ 'bridge=xenbr0, mac=00:11:6c:1c:49:17' ] disk = [ 'drbd:w2022_system,xvda,w', ] boot='dc' vnc=1 vncunused=0 vnclisten = '0.0.0.0' vncdisplay=2 stdvga=1 _on_poweroff_ = 'destroy' _on_reboot_ = 'restart' _on_crash_ = 'restart' usb=1 usbdevice=['tablet'] ----- Nothing special in config. I tested with Domain-0 is Debian 10, 11, 12 and testing (maybe trixie). I try Domu is Windows 7 or Windows 2022. The test envionment is: - Node1 (xen18) DELL T630 with PERC H730 (1G, BBU) or 1TB NVME as primary - Node2 (xen16) DELL R730XD with PERC H730mini (1G, BBU) as secondary I have problem too with another environment with two DELL R640 server. The node1 kern.log with PERC (DELL Raid controller) virtual disk DRBD backend: ----- Apr 23 12:46:22 xen18 kernel: [ 574.527385] drbd w2022_system: PingAck did not arrive in time. Apr 23 12:46:22 xen18 kernel: [ 574.527464] drbd w2022_system: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Apr 23 12:46:22 xen18 kernel: [ 574.527600] block drbd0: new current UUID 19B04B0803CC3C87:CB60C112A5C9EA5D:EBEF8CB4948C160D:EBEE8CB4948C160D Apr 23 12:46:22 xen18 kernel: [ 574.527661] drbd w2022_system: ack_receiver terminated Apr 23 12:46:22 xen18 kernel: [ 574.527665] drbd w2022_system: Terminating drbd_a_w2022_sy Apr 23 12:46:22 xen18 kernel: [ 574.583747] drbd w2022_system: Connection closed Apr 23 12:46:22 xen18 kernel: [ 574.584035] drbd w2022_system: conn( NetworkFailure -> Unconnected ) Apr 23 12:46:22 xen18 kernel: [ 574.584038] drbd w2022_system: receiver terminated Apr 23 12:46:22 xen18 kernel: [ 574.584041] drbd w2022_system: Restarting receiver thread Apr 23 12:46:22 xen18 kernel: [ 574.584043] drbd w2022_system: receiver (re)started Apr 23 12:46:22 xen18 kernel: [ 574.584052] drbd w2022_system: conn( Unconnected -> WFConnection ) ----- The node1 kern.log with NVME DRBD backend: ----- Apr 23 10:51:18 xen18 kernel: [ 912.800847] drbd w2022_system: PingAck did not arrive in time. Apr 23 10:51:18 xen18 kernel: [ 912.800930] drbd w2022_system: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Apr 23 10:51:18 xen18 kernel: [ 912.807793] block drbd0: new current UUID 2361269D52925DF1:AB440CB0842F155D:25CE81C2C028E09E:74D2255AB30D4115 Apr 23 10:51:18 xen18 kernel: [ 912.811762] drbd w2022_system: ack_receiver terminated Apr 23 10:51:18 xen18 kernel: [ 912.811768] drbd w2022_system: Terminating drbd_a_w2022_sy Apr 23 10:51:18 xen18 kernel: [ 912.853400] drbd w2022_system: Connection closed Apr 23 10:51:18 xen18 kernel: [ 912.853723] drbd w2022_system: conn( NetworkFailure -> Unconnected ) Apr 23 10:51:18 xen18 kernel: [ 912.853727] drbd w2022_system: receiver terminated Apr 23 10:51:18 xen18 kernel: [ 912.853729] drbd w2022_system: Restarting receiver thread Apr 23 10:51:18 xen18 kernel: [ 912.853732] drbd w2022_system: receiver (re)started Apr 23 10:51:18 xen18 kernel: [ 912.853740] drbd w2022_system: conn( Unconnected -> WFConnection ) ----- Seems the Domu can't write back to the DRBD because after i destroy the Domu (no other sollution to exit), i got the following error: libxl: error: libxl_exec.c:117:libxl_report_child_exitstatus: /etc/xen/scripts/block-drbd remove [1380] exited with error status 1 libxl: error: libxl_device.c:1259:device_hotplug_child_death_cb: script: /etc/xen/scripts/block-drbd failed; error detected. The domu can't release the DRBD, and looks cannot release in the xenstore: root@xen18:~# xl list Name ID Mem VCPUs State Time(s) Domain-0 0 4096 4 r----- 144.7 (null) 1 0 16 --p--d 387.5 I try to man many test with different DRBD config, but no luck. Sometimes the windows survive the disconnection, but if reconnecting the secondary thats freeze like disconnect. I didn't have problem if the: - Domu OS is Linux with same config. - XEN PV (VBD) driver no installed to Domu. The latest (unsigned) or any other windpws driver have any debug options? Thank you for your help. Best Regards: Attila 2025. 04. 23. 13:04 keltezéssel, Tu
Dinh írta:
Hi Attila, On 22/04/2025 21:00, Kotán Attila wrote:Hello, I use many windows as xen domu and installed your winpv driver. I experience the problem in very special case, but not an unique. The windows freeze when the disk backend of domu is DRBD and the the DRBD secondary node (DRBD status=secondary) goes to disconnect, for example i reboot the computer. The problem is reproducible: - Two DELL server with PERC RAID controller - RAID0,1,10 virtual disks boot side - DRBD backend is an RAID virtual disk - Primary side installed an Windows with XEN VBD driver If the secondary side goes to offline, the DRBD status change to disconencted, but working continuously, but the windows freeze. - No problem with linux domu - No problem until the winpv VBD install - No problem if the DRBD backend is not a RAID virtual disk (for example with M.2 NVME backend) I tested the - Winpv driver 9.0 (signed) - Winpv driver latest (testsigned) - citrix xenserver driver (managementagent-9.4.0) and the problem is occurs in everyone. Everyting workin fine until the secondary DRBD node goes to offline / disconnected. I think i tested many situation and finally left the WinPV driver what is caused the problem i mean. Do you have any tipp, what can i set? How can i debugging this problem? I try to see what happened on linux domu when the DRBD status change, but i can't find anything. Have any ideas? Thank you and best regards. AttilaIf the problem doesn't happen with the DRBD NVMe backend then it suggests a problem with DRBD itself, which is not handling failovers correctly. Do you have any relevant kernel debug outputs when the freeze happens, as opposed to during a normal failover (e.g. with DRBD NVMe)? Best regards, Tu Dinh Ngoc Tu Dinh | Vates XCP-ng Developer XCP-ng & Xen Orchestra - Vates solutions web: https://vates.tech
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |