[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC PATCH 21/24] ARM: vITS: handle INVALL command



Hi Stefano,

thanks for the prompt and helpful answer!

On 10/12/16 00:30, Stefano Stabellini wrote:
> On Fri, 9 Dec 2016, Andre Przywara wrote:
>>>> I've been spending some time thinking about this, and I think we can in
>>>> fact get away without ever propagating command from domains to the host.
>>>>
>>>> I made a list of all commands that possible require host ITS command
>>>> propagation. There are two groups:
>>>> 1: enabling/disabling LPIs: INV, INVALL
>>>> 2: mapping/unmapping devices/events: DISCARD, MAPD, MAPTI.
>>>>
>>>> The second group can be handled by mapping all required devices up
>>>> front, I will elaborate on that in a different email.
>>>>
>>>> For the first group, read below ...
>>>>
>>>> On 01/12/16 01:19, Stefano Stabellini wrote:
>>>>> On Fri, 25 Nov 2016, Julien Grall wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 18/11/16 18:39, Stefano Stabellini wrote:
>>>>>>> On Fri, 11 Nov 2016, Stefano Stabellini wrote:
>>>>>>>> On Fri, 11 Nov 2016, Julien Grall wrote:
>>>>>>>>> On 10/11/16 20:42, Stefano Stabellini wrote:
>>>>>>>>> That's why in the approach we had on the previous series was "host ITS
>>>>>>>>> command
>>>>>>>>> should be limited when emulating guest ITS command". From my recall, 
>>>>>>>>> in
>>>>>>>>> that
>>>>>>>>> series the host and guest LPIs was fully separated (enabling a guest
>>>>>>>>> LPIs was
>>>>>>>>> not enabling host LPIs).
>>>>>>>>
>>>>>>>> I am interested in reading what Ian suggested to do when the physical
>>>>>>>> ITS queue is full, but I cannot find anything specific about it in the
>>>>>>>> doc.
>>>>>>>>
>>>>>>>> Do you have a suggestion for this?
>>>>>>>>
>>>>>>>> The only things that come to mind right now are:
>>>>>>>>
>>>>>>>> 1) check if the ITS queue is full and busy loop until it is not 
>>>>>>>> (spin_lock
>>>>>>>> style)
>>>>>>>> 2) check if the ITS queue is full and sleep until it is not (mutex 
>>>>>>>> style)
>>>>>>>
>>>>>>> Another, probably better idea, is to map all pLPIs of a device when the
>>>>>>> device is assigned to a guest (including Dom0). This is what was written
>>>>>>> in Ian's design doc. The advantage of this approach is that Xen doesn't
>>>>>>> need to take any actions on the physical ITS command queue when the
>>>>>>> guest issues virtual ITS commands, therefore completely solving this
>>>>>>> problem at the root. (Although I am not sure about enable/disable
>>>>>>> commands: could we avoid issuing enable/disable on pLPIs?)
>>>>>>
>>>>>> In the previous design document (see [1]), the pLPIs are enabled when the
>>>>>> device is assigned to the guest. This means that it is not necessary to 
>>>>>> send
>>>>>> command there. This is also means we may receive a pLPI before the 
>>>>>> associated
>>>>>> vLPI has been configured.
>>>>>>
>>>>>> That said, given that LPIs are edge-triggered, there is no deactivate 
>>>>>> state
>>>>>> (see 4.1 in ARM IHI 0069C). So as soon as the priority drop is done, the 
>>>>>> same
>>>>>> LPIs could potentially be raised again. This could generate a storm.
>>>>>
>>>>> Thank you for raising this important point. You are correct.
>>>>>
>>>>>> The priority drop is necessary if we don't want to block the reception of
>>>>>> interrupt for the current physical CPU.
>>>>>>
>>>>>> What I am more concerned about is this problem can also happen in normal
>>>>>> running (i.e the pLPI is bound to an vLPI and the vLPI is enabled). For
>>>>>> edge-triggered interrupt, there is no way to prevent them to fire again. 
>>>>>> Maybe
>>>>>> it is time to introduce rate-limit interrupt for ARM. Any opinions?
>>>>>
>>>>> Yes. It could be as simple as disabling the pLPI when Xen receives a
>>>>> second pLPI before the guest EOIs the first corresponding vLPI, which
>>>>> shouldn't happen in normal circumstances.
>>>>>
>>>>> We need a simple per-LPI inflight counter, incremented when a pLPI is
>>>>> received, decremented when the corresponding vLPI is EOIed (the LR is
>>>>> cleared).
>>>>>
>>>>> When the counter > 1, we disable the pLPI and request a maintenance
>>>>> interrupt for the corresponding vLPI.
>>>>
>>>> So why do we need a _counter_? This is about edge triggered interrupts,
>>>> I think we can just accumulate all of them into one.
>>>
>>> The counter is not to re-inject the same amount of interrupts into the
>>> guest, but to detect interrupt storms.
>>
>> I was wondering if an interrupt "storm" could already be defined by
>> "receiving an LPI while there is already one pending (in the guest's
>> virtual pending table) and it being disabled by the guest". I admit that
>> declaring two interrupts as a storm is a bit of a stretch, but in fact
>> the guest had probably a reason for disabling it even though it
>> fires, so Xen should just follow suit.
>> The only difference is that we don't do it _immediately_ when the guest
>> tells us (via INV), but only if needed (LPI actually fires).
> 
> Either way should work OK, I think.
> 
> 
>>>>  - If the VLPI is enabled, we EOI it on the host and inject it.
>>>>  - If the VLPI is disabled, we set the pending bit in the VCPU's
>>>>    pending table and EOI on the host - to allow other IRQs.
>>>> - On a guest INV command, we check whether that vLPI is now enabled:
>>>>  - If it is disabled now, we don't need to do anything.
>>>>  - If it is enabled now, we check the pending bit for that VLPI:
>>>>   - If it is 0, we don't do anything.
>>>>   - If it is 1, we inject the VLPI and clear the pending bit.
>>>> - On a guest INVALL command, we just need to iterate over the virtual
>>>> LPIs.
>>>
>>> Right, much better.
>>>
>>>
>>>> If you look at the conditions above, the only immediate action is
>>>> when a VLPI gets enabled _and_ its pending bit is set. So we can do
>>>> 64-bit read accesses over the whole pending table to find non-zero words
>>>> and thus set bits, which should be rare in practice. We can store the
>>>> highest mapped VLPI to avoid iterating over the whole of the table.
>>>> Ideally the guest has no direct control over the pending bits, since
>>>> this is what the device generates. Also we limit the number of VLPIs in
>>>> total per guest anyway.
>>>
>>> I wonder if we could even use a fully packed bitmask with only the
>>> pending bits, so 1 bit per vLPI, rather than 1 byte per vLPI. That would
>>> be a nice improvement.
>>
>> The _pending_ table is exactly that: one bit per VLPI.
> 
> Actually the spec says about the pending table, ch 6.1.2:
> 
> "Each Redistributor maintains entries in a separate LPI Pending table
> that indicates the pending state of each LPI when GICR_CTLR.EnableLPIs
> == 1 in the Redistributor:
>   0 The LPI is not pending.
>   1 The LPI is pending.
> 
> For a given LPI:
> • The corresponding byte in the LPI Pending table is (base address + (N / 8)).
> • The bit position in the byte is (N MOD 8)."

        ^^^

> It seems to me that each LPI is supposed to have a byte, not a bit. Am I
> looking at the wrong table?

Well, the explanation could indeed be a bit more explicit, but it's
really meant to be a bit:
1) The two lines above describe how to address a single bit in a
byte-addressed array.
2) The following paragraphs talk about "the first 1KB" when it comes to
non-LPI interrupts. This matches 8192 bits.
3) In section 6.1 the spec states: "Memory-backed storage for LPI
pending _bits_ in an LPI Pending table."
4) The actual instance, Marc's Linux driver, also speaks of a bit:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/irqchip/irq-gic-v3-its.c#n785

> In any case you suggested to trap the pending table, so we can actually
> write anything we want to those guest provided pages.

Indeed:
"During normal operation, the LPI Pending table is maintained solely by
the Redistributor."

>> So by doing a
>> 64-bit read we cover 64 VLPIs. And normally if an LPI fires it will
>> probably be enabled (otherwise the guest would have disabled it in the
>> device), so we inject it and don't need this table. It's
>> really just for storing the pending status should an LPI arrive while
>> the guest had _disabled_ it. I assume this is rather rare, so the table
>> will mostly be empty: that's why I expect most reads to be 0 and the
>> iteration of the table to be very quick. As an additional optimization
>> we could store the highest and lowest virtually pending LPI, to avoid
>> scanning the whole table.
>>
>> We can't do so much about the property table, though, because its layout
>> is described in the spec - in contrast to the ITS tables, which are IMPDEF.
>> But as we only need to do something if the LPI is _both_ enabled _and_
>> pending, scanning the pending table gives us a quite good filter
>> already. For the few LPIs that hit here, we can just access the right
>> byte in the property table.
> 
> OK
> 
> 
>>>> If that still sounds like a DOS vector, we could additionally rate-limit
>>>> INVALLs, and/or track additions to the pending table after the last
>>>> INVALL: if there haven't been any new pending bits since the last scan,
>>>> INVALL is a NOP.
>>>>
>>>> Does that makes sense so far?
>>>
>>> It makes sense. It should be OK.
>>>
>>>
>>>> So that just leaves us with this IRQ storm issue, which I am thinking
>>>> about now. But I guess this is not a show stopper given we can disable
>>>> the physical LPI if we sense this situation.
>>>
>>> That is true and it's exactly what we should do.
>>>
>>>
>>>>> When we receive the maintenance interrupt and we clear the LR of the
>>>>> vLPI, Xen should re-enable the pLPI.
>>
>> So I was thinking why you would need to wait for the guest to actually
>> EOI it?
>> Can't we end the interrupt storm condition at the moment the guest
>> enables the interrupt? LPIs are edge triggered and so a storm in the
>> past is easily merged into a single guest LPI once the guest enables it
>> again. From there on we inject every triggered LPI into the guest.
> 
> What you describe works if the guest disables the interrupt. But what if
> it doesn't? Xen should also be able to cope with non-cooperative guests
> which might even have triggered the interrupt storm on purpose.
> 
> I think that you are asking this question because we are actually
> talking about two different issues. See below.

Indeed I was wondering about that ...

> 
>> This special handling for the interrupt storm just stems from the fact
>> that have to keep LPIs enabled on the h/w interrupt controller level,
>> despite the guest having disabled it on it's own _virtual_ GIC. So once
>> the guest enables it again, we are in line with the current GICv2/GICv3,
>> aren't we? Do we have interrupt storm detection/prevention in the moment?
> 
> No, that's not the cause of the storm. Julien described well before:
> 
>   given that LPIs are edge-triggered, there is no deactivate state (see 4.1
>   in ARM IHI 0069C). So as soon as the priority drop is done, the same LPIs 
> could
>   potentially be raised again. This could generate a storm.
> 
> The problem is that Xen has to do priority drop upon receiving an pLPI,
> but, given that LPIs don't have a deactivate state, the priority drop is
> enough to let the hardware inject a second LPI, even if the guest didn't
> EOI the first one yet.
> 
> In the case of SPIs, the hardware cannot inject a second interrupt after
> Xen does priority drop, it has to wait for the guest to EOI it.

I understand that ...

>> And please keep in mind that LPIs are _always_ edge triggered: So once
>> we EOI an LPI on the host, this "very same" LPI is gone from a h/w GIC
>> point of view, the next incoming interrupt must have been triggered by a
>> new interrupt condition in the device (new network packet, for
>> instance).
> 
> This is actually the problem: what if the guest configured the device on
> purpose to keep generating LPIs without pause? Nice and simple way to
> take down the host.

I see, I just wasn't sure we were talking about the same thing: actual
interrupt storm triggered by the device vs. "virtual" interrupt storm
due to an interrupt line not being lowered by the IRQ handler.
And I was hoping for the latter, but well ...
So thanks for the clarification.

>> In contrast to GICv2 this applies to _every_ LPI.
>> So I am not sure we should really care _too_ much about this (apart from
>> the "guest has disabled it" part): Once we assign a device to a guest,
>> we lose some control over the machine anyway and at least trust the
>> device to not completely block the system.
> 
> No we don't! Hardware engineers make mistakes too!

You tell me ... ;-)

> We have to protect
> Xen from devices which purposely or mistakenly generate interrupt
> storms. This is actually a pretty common problem.

I see, though this doesn't make this whole problem easier ;-)

>> I don't see how the ITS differs in that respect from the GICv3/GICv2.
> 
> It differs because in the case of GICv2, every single interrupt has to
> be EOI'd by the guest.

Sure, though I think technically it's "deactivated" here that matters
(we EOI LPIs as well). And since LPIs have no active state, this makes
the difference.

> Therefore the Xen scheduler can still decide to
> schedule it out. In the case of the ITS, Xen could be stuck in an
> interrupt handling loop.

So I was wondering as we might need to relax our new strict "No
(unprivileged) guest ever causes a host ITS command to be queued" rule a
bit, because:
- Disabling an LPI is a separate issue, as we can trigger this in Xen
interrupt context once we decide that this is an interrupt storm.
- But enabling it again has to both happen in timely manner (as the
guest expects interrupts to come in) and to be triggered by a guest
action, which causes the INV command to be send when handling a guest fault.

Now this INV command (and possibly a follow-up SYNC) for enabling an LPI
would be the only critical ones, so I was wondering if we could ensure
that these commands can always be queued immediately, by making sure we
have at least two ITS command queue slots available all of the time.
Other ITS commands (triggered by device pass-throughs, for instance),
would then have to potentially wait if we foresee that they could fill
up the host command queue.
Something like QoS for ITS commands.
And I think we should map the maximum command queue size on the host
(1MB => 32768 commands) to make this scenario less likely.

I will need to think about this a bit further, maybe implement something
as a proof of concept.

Cheers,
Andre.

> 
>> A quick update on my side: I implemented the scheme I described in my
>> earlier mail now and it boots to the Dom0 prompt on a fastmodel with an ITS:
> 
> Nice!
> 
> 
>> - On receiving the PHYSDEVOP_manage_pci_add hypercall in
>> xen/arch/arm/physdev.c, we MAPD the device on the host, MAPTI a bunch of
>> interrupts and enable them. We keep them unassigned in our host
>> pLPI->VLPI table, so we discard them should they fire.
>> This hypercall is issued by Dom0 Linux before bringing up any PCI
>> devices, so it works even for Dom0 without any Linux changes. For DomUs
>> with PCI passthrough Dom0 is expected to issue this hypercall on behalf
>> of the to-be-created domain.
>> - When a guest (be it Dom0 or Domu) actually maps an LPI (MAPTI), we
>> just enter the virtual LPI number and the target VCPU in our pLPI-vLPI
>> table and be done. Should it fire now, we know where to inject it, but
>> refer to the enabled bit in the guest's property table before doing so.
>> - When a guest (be it Dom0 or DomU) enables or disabled an interrupt, we
>> don't do much, as we refer to the enable bit every time we want to
>> inject already. The only thing I actually do is to inject an LPI if
>> there is a virtual LPI pending and the LPI is now enabled.
> 
> Sounds good, well done!
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.