[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: dom0 suddenly blocking on all access to md device

  • To: xen-devel@xxxxxxxxxxxxxxxxxxxx
  • From: Andy Smith <andy@xxxxxxxxxxxxxx>
  • Date: Sat, 12 Jun 2021 23:13:57 +0000
  • Delivery-date: Sat, 12 Jun 2021 23:14:08 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Openpgp: id=BF15490B; url=http://strugglers.net/~andy/pubkey.asc

Hi Rob,

On Sat, Jun 12, 2021 at 05:47:49PM -0500, Rob Townley wrote:
> mdadm.conf has email reporting capabilities to alert to failing drives.
> Test that you receive emails.

I do receive those emails, when such things occur, but the drives
are not failing.

Devices are not kicked out of MD arrays, all IO just stalls
completely. Also these incidents coincide with an upgrade of OS and
hypervisor and are happening on 5 different servers so far, so it
would be highly unlikely that so many devices suddenly went bad.

> Use mdadm to run tests on the raid.

Weekly scrubs take place using /usr/share/mdadm/checkarray

> smartctl -a /dev/

Yep, SMART health checks and self-testing are enabled.

I've now put two test servers on linux-image-amd64/buster-backports
and any time any of the production servers experiences the issue I
will boot it into that kernel next time.




Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.