[Xen-devel] Help commissioning x86 boxes intended for builds [himrod[012]]

Sorry for the rather random CC list.

Last year we bought a variety of test boxes.  Amongst them were three
biggish Intel machines which I had primarily intended for use as
dedicated build servers.  These are himrod[012].

Unfortunately I have not been able to commission them because they
have been failing their commissioning tests.  Investigations have not
found the problem.

The symptom is that, occasionally, the network stops working for a
while.  It then comes back, spontaneously.  There are no log messages
recorded on the box itself in /var/log for this; no messages on the
serial console.

The failure probability is about 10% for any one individual test job.

It seems to do it only under Xen with our own kernels (4.14.x).
For initial installation and for for builds we use stock Debian
kernels (currently, jessie, so 3.16.56-1 for the installer and
3.16.57-2 for the installed system); and I haven't seen failures
there.  I have not tried other combinations (yet).

They have Intel I350 NICs.  We have the same NICs in another pair of
boxes, debina[01], which work fine.  (The himrods are set to
use UEFI; the debinas BIOS.)

I have already had the machines' firmware updated.

I know that it isn't the whole machine freezing because here is an
example where the test box itself experiences a timeout trying to talk
to the network:


Here when the test box's network connection starts working again, the
TCP carrying the ssh session eventually retransmits and then the TCP
connection is unblocked, so the test box ends up sending the whole lot
of buffered up error messages to the controller VM which duly logs

The most recent failed commissioning attempt was for himrods 0 and 2
only.  himrod1 has an unrelated problem with its serial cable.
However, my notes indicate that I had previous problems with himrod1
too.  So I think we need to take this as applying to these three

Just in case, I have asked Credativ to give the machines fresh cables
to different swtich ports.

I don't have a good model of what to do (or try) next.  Suggestions
welcome.  The full report from my most recent commissioning test
attempt is here:


