[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Serious AMD-Vi(?) issue



On Mon, Jul 01, 2024 at 11:07:57AM -0700, Elliott Mitchell wrote:
> On Thu, Jun 27, 2024 at 05:18:15PM -0700, Elliott Mitchell wrote:
> > I'm rather surprised it was so long before the next system restart.  
> > Seems a quiet period as far as security updates go.  Good news is I made
> > several new observations, but I don't know how valuable these are.
> > 
> > On Mon, May 13, 2024 at 10:44:59AM +0200, Roger Pau Monné wrote:
> > > 
> > > Does booting with `iommu=no-intremap` lead to any issues being
> > > reported?
> > 
> > On boot there was in fact less.  Notably the "AMD-Vi" messages haven't
> > shown up at all.  I haven't stressed it very much yet, but previous
> > boots a message showed up the moment the MD-RAID1 driver was loaded.
> > 
> > 
> > I am though seeing two different messages now:
> > 
> > (XEN) CPU#: No irq handler for vector # (IRQ -#, LAPIC)
> > (XEN) IRQ# a=#[#,#] v=#[#] t=PCI-MSI s=#
> > 
> > These are to be appearing in pairs.  Multiple values show for each field,
> > though each field appears to vary between 2-3 different values.  There
> > are thousands of these messages showing up.
> 
> Some lucky timing so I've done some more experimentation and sampling.
> 
> The "(XEN) IRQ" line almost always shows up with the "(XEN) CPU" line.
> I notice it is possible to generate the first without the second, so this
> seems notable.  Every single "(XEN) CPU" line mentioned "LAPIC".
> 
> The small number (20) of lines where "(XEN) IRQ" did not show up, the
> "(XEN) CPU" line always ended with "(IRQ -2147483648, LAPIC)"
> 
> For the "t=" value out of 316 samples, 94 listed "PCI-MSI" while 222
> listed "PCI-MSI/-X".
> 
> For the IRQ, 72 occurred 126 times.  71, 73 and 108 occurred roughly 50
> times each. 109 and 111 occurred under 10 times.  Almost no other IRQ
> values appeared.
> 
> The "s=" value was "00000030" slightly more often than "00000010".  No
> other values have been observed so far.
> 
> The other values were didn't show too many patterns.
> 
> Most processors were mentioned roughly equally.  Several had fewer
> mentions, but not enough to seem significant.  I discovered processor 1
> did NOT show up.  Whereas processor 0 had an above average number of
> occurrences.  This seems notable as these 2 processors are both reserved
> exclusively for domain 0.

All of the patterns continue.  There are more reports on processor 0 than
any other processor, but not enough to look particularly suspicious.
What *does* look suspicious is the complete absence of reports from
processor 1.

> There have also been a few "spurious 8259A interrupt" lines.  So far
> there haven't been very many of these.  The processor and IRQ listed
> don't yet appear to show any patterns.  So far no IRQ has been listed
> twice.

IRQs 3-7 and 9-15 have each shown up once.  1-2 and 8 haven't shown up
so far.


Things look different enough to try reenabling Linux software RAID1.  I'm
going to continue monitoring closely, but so far it seems
"iommu=no-intremap" may in fact mitigate the issue with software RAID1.

This seems odd, but I'm simply reporting what I observe.  I would have
expected to see problem indications by now, yet there aren't any.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@xxxxxxx  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445





 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.