[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH] xen/watchdog: Identify which domain watchdog fired
On 13/02/2025 5:00 pm, Jan Beulich wrote: > On 13.02.2025 17:46, Andrew Cooper wrote: >> When a watchdog fires, the domain is crashed and can't dump any state. >> >> Xen allows a domain to have two separate watchdogs. Therefore, for a >> domain running multiple watchdogs (e.g. one based around network, one >> for disk), it is important for diagnostics to know which watchdog >> fired. >> >> As the printk() is in a timer callback, this is a bit awkward to >> arrange, but there are 12 spare bits in the bottom of the domain >> pointer owing to its alignment. >> >> Reuse these bits to encode the watchdog id too, so the one which fired >> is identified when the domain is crashed. >> >> Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> >> --- >> CC: Anthony PERARD <anthony.perard@xxxxxxxxxx> >> CC: Michal Orzel <michal.orzel@xxxxxxx> >> CC: Jan Beulich <jbeulich@xxxxxxxx> >> CC: Julien Grall <julien@xxxxxxx> >> CC: Roger Pau Monné <roger.pau@xxxxxxxxxx> >> CC: Stefano Stabellini <sstabellini@xxxxxxxxxx> > You'll eventually need a scheduler maintainer's ack, yet you didn't Cc any > of them. Oops yes. Although really SCHEDOP and the scheduler shouldn't be mixed like this. > >> --- a/xen/common/sched/core.c >> +++ b/xen/common/sched/core.c >> @@ -1534,12 +1534,17 @@ long vcpu_yield(void) >> >> static void cf_check domain_watchdog_timeout(void *data) >> { >> - struct domain *d = data; >> + /* >> + * The data parameter encodes the watchdog id in the low bits of >> + * the domain pointer. >> + */ >> + struct domain *d = _p((unsigned long)data & PAGE_MASK); >> + unsigned int id = (unsigned long)data & ~PAGE_MASK; >> >> if ( d->is_shutting_down || d->is_dying ) >> return; >> >> - printk("Watchdog timer fired for domain %u\n", d->domain_id); >> + printk("Watchdog timer %u fired for %pd\n", id, d); > And apriori knowledge will be required to associate the number with whichever > watchdog it was (network or disk in your example)? (No question that logging > the number is better than not doing so.) Indeed, but that's up to the entities in the domain requesting the watchdog. (Yes, we do have this logged.) > >> @@ -1593,7 +1598,17 @@ void watchdog_domain_init(struct domain *d) >> d->watchdog_inuse_map = 0; >> >> for ( i = 0; i < NR_DOMAIN_WATCHDOG_TIMERS; i++ ) >> - init_timer(&d->watchdog_timer[i], domain_watchdog_timeout, d, 0); >> + { >> + void *data = d; >> + >> + BUILD_BUG_ON(NR_DOMAIN_WATCHDOG_TIMERS >= PAGE_SIZE); >> + >> + /* >> + * For the timer callback parameter, encode the watchdog id in >> + * the low bits of the domain pointer. >> + */ >> + init_timer(&d->watchdog_timer[i], domain_watchdog_timeout, data + >> i, 0); >> + } > This way we'll be promising to ourselves that we're never going to alter > the allocation mechanism of struct domain instances, always requiring > them to have at least page alignment. If someone wanted to change that, > they'll have a hard time spotting the logic here. Sadly I have no good > suggestion towards improving the situation. I wasn't terribly happy either, but something has occurred to me. For both struct domain and vcpu, we could have an __aligned(PAGE_SIZE) attribute. It's accurate and unlikely to change (and helps x86 code generation at least). Then, we can use: BUILD_BUG_ON((NR_DOMAIN_WATCHDOG_TIMERS > alignof(d)); which should trigger cleanly if the precondition is violated. For watchdog specifically, we only actually need uint16_t alignment to be safe, and there's no way that's going to break in practice. ~Andrew
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |