|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 2/2] Limit retry in ControllerGetResponse
On 21/10/2025 15:07, Owen Smith wrote: > Is there a way of detecting the backend has disappeared during > ControllerGetResponse? > As currently, there is a 100ms delay between attempts to trigger the backend > - its possible the backend could take longer to respond in normal operation > > Owen I couldn't think of a clean way to do that. One way is to try and probe FrontendIsBackendOnline every timeout expiration to see if it's still working. It wouldn't work if the backend became unresponsive, but I can add that as an additional check for v2. Another problem is with XENVIF_CONTROLLER_POLL_PERIOD as passed to EvtchnWait. This is similar to other places where we call KeStallExecutionProcessor in a loop at DISPATCH_LEVEL. This really limits how long we can wait in ControllerGetResponse, and I feel like 10x10ms is already way too long. IMO a better way would be to rework the controller (and eventually other components) to not run requests full-time at DISPATCH_LEVEL, so that we're free to wait for as long as we need. This is what I've been trying to do here: https://github.com/xcp-ng/win-xenvif/tree/dnt/controller-rework, but I need to find time to continue it... > > ________________________________________ > From: win-pv-devel <win-pv-devel-bounces@xxxxxxxxxxxxxxxxxxxx> on behalf of > Tu Dinh <ngoc-tu.dinh@xxxxxxxxxx> > Sent: 16 October 2025 11:16 AM > To: win-pv-devel@xxxxxxxxxxxxxxxxxxxx > Cc: Tu Dinh > Subject: [PATCH 2/2] Limit retry in ControllerGetResponse > > During unplug, NDIS may issue a hash reconfiguration. However, the > backend may disappear before the request has had a chance to complete. > This is observable by e.g. adding a delay at the beginning of > VifReceiverSetHashAlgorithm and unplugging. If this happens, Xenvif will > get stuck waiting forever in ControllerGetResponse, at DISPATCH_LEVEL to > boot. > > Limit the EvtchnWait retry to 10 attempts (10ms each to keep the > previous 100ms limit) and return an error code to the caller if the wait > has failed. > > Signed-off-by: Tu Dinh <ngoc-tu.dinh@xxxxxxxxxx> > --- > src/xenvif/controller.c | 13 +++++++++++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/src/xenvif/controller.c b/src/xenvif/controller.c > index e032972..109c030 100644 > --- a/src/xenvif/controller.c > +++ b/src/xenvif/controller.c > @@ -248,7 +248,7 @@ fail1: > #define TIME_MS(_ms) (TIME_US((_ms) * 1000)) > #define TIME_RELATIVE(_t) (-(_t)) > > -#define XENVIF_CONTROLLER_POLL_PERIOD 100 // ms > +#define XENVIF_CONTROLLER_POLL_PERIOD 10 // ms > > _IRQL_requires_(DISPATCH_LEVEL) > static NTSTATUS > @@ -258,11 +258,12 @@ ControllerGetResponse( > ) > { > LARGE_INTEGER Timeout; > + ULONG Attempt; > NTSTATUS status; > > Timeout.QuadPart = > TIME_RELATIVE(TIME_MS(XENVIF_CONTROLLER_POLL_PERIOD)); > > - for (;;) { > + for (Attempt = 0; Attempt < 10; Attempt++) { > ULONG Count; > > Count = XENBUS_EVTCHN(GetCount, > @@ -284,6 +285,13 @@ ControllerGetResponse( > __ControllerSend(Controller); > } > > + // Use STATUS_TRANSACTION_TIMED_OUT as an error code since > STATUS_TIMEOUT is > + // a success code. > + if (Controller->Response.id != Controller->Request.id) { > + status = STATUS_TRANSACTION_TIMED_OUT; > + goto done; > + } > + > ASSERT3U(Controller->Response.type, ==, Controller->Request.type); > > switch (Controller->Response.status) { > @@ -311,6 +319,7 @@ ControllerGetResponse( > if (NT_SUCCESS(status) && Data != NULL) > *Data = Controller->Response.data; > > +done: > RtlZeroMemory(&Controller->Request, > sizeof (struct xen_netif_ctrl_request)); > RtlZeroMemory(&Controller->Response, > -- > 2.51.0.windows.2 > > > > -- > Ngoc Tu Dinh | Vates XCP-ng Developer > > XCP-ng & Xen Orchestra - Vates solutions > > web: https://vates.tech > > > -- Ngoc Tu Dinh | Vates XCP-ng Developer XCP-ng & Xen Orchestra - Vates solutions web: https://vates.tech
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |