[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [for-xen-4.5 PATCH v2 2/2] dpci: Add ZOMBIE state to allow the softirq to finish with the dpci_pirq.



On Fri, Nov 21, 2014 at 03:55:37PM +0000, Jan Beulich wrote:
> >>> On 21.11.14 at 16:14, <konrad.wilk@xxxxxxxxxx> wrote:
> > On Fri, Nov 21, 2014 at 01:03:43PM +0100, Jan Beulich wrote:
> >> >>> On 21.11.14 at 12:50, <konrad.wilk@xxxxxxxxxx> wrote:
> >> > On November 21, 2014 2:51:33 AM EST, Jan Beulich <JBeulich@xxxxxxxx> 
> >> > wrote:
> >> >>>>> On 20.11.14 at 20:51, <konrad.wilk@xxxxxxxxxx> wrote:
> >> >>> @@ -669,7 +670,7 @@ static void hvm_dirq_assist(struct domain *d,
> >> >>struct hvm_pirq_dpci *pirq_dpci)
> >> >>>      ASSERT(d->arch.hvm_domain.irq.dpci);
> >> >>>  
> >> >>>      spin_lock(&d->event_lock);
> >> >>> -    if ( pirq_dpci->state )
> >> >>> +    if ( test_and_clear_bool(pirq_dpci->masked) )
> >> >>>      {
> >> >>>          struct pirq *pirq = dpci_pirq(pirq_dpci);
> >> >>>          const struct dev_intx_gsi_link *digl;
> >> >>
> >> >>So this now guards solely against the timeout enforced EOI? Why do
> >> >>you no longer need to guard against cancellation (i.e. why isn't this
> >> >>looking at both ->state and ->masked)?
> >> >>
> >> > 
> >> > The previous state check was superfluous as the dpci_softirq would check 
> >> > for 
> >> > the valid STATE_ before calling hvm_dirq_assist (and deal with 
> >> > cancellation).
> >> 
> >> I thought so first too, but then how is the guarding against
> >> cancellation working now?
> > 
> > Since there are two forms of cancellation, lets consider each one seperatly:
> > 
> > 1). pt_irq_create_bind and pt_irq_destroy_bind. Each of them calling 
> >     pt_pirq_softirq_reset in the 'error' case or in the normal path of 
> > destruction.
> >     When that happens the action handler for the IRQ is removed the moment 
> > we call
> >     pirq_guest_unbind. As such we only have to deal with the potential 
> > outstanding
> >     pirq_dpci waiting to be serviced. Since we did the 
> > 'pt_pirq_softirq_reset'
> >     we have changed the state from STATE_SCHED to zero. That means 
> > dpci_softirq will
> >     NOT go in:
> >     if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )               
> 
> Unless the flag got set again in the meantime. Since the event lock

I am having a hard time seeing when 'state' be set in that scenario.

The 'raise_softirq_for' is called when 'pirq->dpci->flags' has some value. That
gets cleared in 'pt_irq_create_bind' and 'pt_irq_destroy_bind'. Also inside
of those functions we try our best to change the 'state' from STATE_SCHED to 
zero.

But lets imagine we are in 'hvm_dirq_assist' right about to take the lock,
and another CPU is calling 'pt_irq_create_bind'.

The 'pt_irq_create_bind' will spin on:

                                                                                
 
234     /*                                                                      
    
235      * A crude 'while' loop with us dropping the spinlock and giving        
    
236      * the softirq_dpci a chance to run.                                    
    
237      * We MUST check for this condition as the softirq could be scheduled   
    
238      * and hasn't run yet. Note that this code replaced tasklet_kill which  
    
239      * would have spun forever and would do the same thing (wait to flush 
out   
240      * outstanding hvm_dirq_assist calls.                                   
    
241      */                                                                     
    
242     if ( pt_pirq_softirq_active(pirq_dpci) )                                
    
243     {                                                                       
    
244         spin_unlock(&d->event_lock);                                        
    
245         cpu_relax();                                                        
    
246         goto restart;                                                       
    

OK, so that is not it.

Lets try another case - 'pt_irq_creates_bind' is in the error path:

   rc = msixtbl_pt_register(d, info, pt_irq_bind->u.msi.gtable);   
279                 if ( unlikely(rc) )                                         
    
280                 {                                                           
    
281                     pirq_guest_unbind(d, info);                             
    
282                     /*                                                      
    
283                      * Between 'pirq_guest_bind' and before 
'pirq_guest_unbind' 
284                      * an interrupt can be scheduled. No more of them are 
going 
285                      * to be scheduled but we must deal with the one that 
may be
286                      * in the queue.                                        
    
287                      */                                                     
    
288                     pt_pirq_softirq_reset(pirq_dpci);                  

And right at that moment the softirq is running. It has already set
STATE_RUN and just cleared STATE_SCHED and is executing 'hvm_dirq_assist'.
It is stuck on the spin_lock waiting for that.

The 'pt_irq_create_bind' fails in 'pt_pirq_softirq_reset' to change the 'state'
(as the state is in 'STATE_RUN', not 'STATE_SCHED').  'pt_irq_create_bind' 
continues
on with setting '->flag=0' and unlocks the lock.

'hvm_dirq_assist' grabs the lock and continues one (and state is STATE_RUN).
Since 'flag = 0' and 'digl_list' is empty, it thunders through the 
'hvm_dirq_assist'
not doing anything. Well, it ends up calling the dreaded 'set_timer' - which 
will hit
another assert as it has never been initialized. Adding in 'mapping = 0' on the
error paths for MSI takes care of that.  That will take care of that - and I 
just
rolled 'masked =0' in the pt_pirq_softirq_reset function.

The legacy interrupt does not have this window, so it cannot do it.

> is being acquired right before the line quoted above, _that_ is

No. The event lock is acquired in 'hvm_dirq_assist' which would be
checking the 'mapping', not 'state'.

   spin_lock(&d->event_lock);                                                  
   if ( test_and_clear_bool(pirq_dpci->masked) )  

> what needs closely looking at (and why I was asking about the
> deletion of the [at the first glance unnecessary] checking of ->state
> in hvm_dirq_assist()).


From 90d00db0949a8e796d7f812134753a54b2fe3d51 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
Date: Thu, 20 Nov 2014 14:28:11 -0500
Subject: [PATCH] dpci: Add 'masked' as an gate for hvm_dirq_assist to process.

commit f6dd295381f4b6a66acddacf46bca8940586c8d8
"dpci: replace tasklet with softirq" used the 'masked' as an
two-bit state mechanism (STATE_SCHED, STATE_RUN) to communicate
between 'raise_softirq_for' and 'dpci_softirq' to determine whether
the 'struct hvm_pirq_dpci' can be re-scheduled.

However it ignored the 'pt_irq_guest_eoi' was not adhering to the proper
dialogue and was not using locked cmpxchg or test_bit operations and
ended setting 'state' set to zero. That meant 'raise_softirq_for' was
free to schedule it while the 'struct hvm_pirq_dpci'' was still on an
per-cpu list causing an list corruption.

The code would trigger the following path causing list corruption:

    \-timer_softirq_action
        pt_irq_time_out calls pt_pirq_softirq_cancel sets state to 0.
            pirq_dpci is still on dpci_list.
    \- dpci_sofitrq
        while (!list_emptry(&our_list))
                list_del, but has not yet done 'entry->next = LIST_POISON1;'
    [interrupt happens]
        raise_softirq checks state which is zero. Adds pirq_dpci to the 
dpci_list.
    [interrupt is done, back to dpci_softirq]
        finishes the entry->next = LIST_POISON1;
                .. test STATE_SCHED returns true, so executes the 
hvm_dirq_assist.
        ends the loop, exits.

    \- dpci_softirq
        while (!list_emtpry)
                list_del, but ->next already has LIST_POISON1 and we blow up.

An alternative solution was proposed (adding STATE_ZOMBIE and making
pt_irq_time_out use the cmpxchg protocol on 'state'), which fixed the above
issue but had an fatal bug. It would miss interrupts that are to be scheduled!

This patch brings back the 'masked' boolean which is used as an
communication channel between 'hvm_do_IRQ_dpci', 'hvm_dirq_assist' and
'pt_irq_guest_eoi'. When we have an interrupt we set 'masked'. Anytime
'hvm_dirq_assist' or 'pt_irq_guest_eoi' executes - it clears it.

The 'state' is left as a seperate mechanism to provide an mechanism between
'raise_sofitrq' and 'softirq_dpci' to communicate the state of the
'struct hvm_dirq_pirq'.

However since we have now two seperate machines we have to deal with an
cancellations and outstanding interrupt being serviced: 'pt_irq_destroy_bind'
is called while an 'hvm_dirq_assist' is just about to service.
The 'pt_irq_destroy_bind' takes the lock first and kills the timer - and
the moment it releases the spinlock, 'hvm_dirq_assist' thunders in and calls
'set_timer' hitting an ASSERT.

By clearing the 'masked' in the 'pt_irq_destroy_bind' we take care of that
scenario by inhibiting 'hvm_dirq_assist' to call the 'set_timer'.

In the 'pt_irq_create_bind' - in the error cases we could be seeing
an softirq scheduled right away and being serviced (though stuck at
the spinlock).  The 'pt_irq_create_bind' fails in 'pt_pirq_softirq_reset'
to change the 'state' (as the state is in 'STATE_RUN', not 'STATE_SCHED').
'pt_irq_create_bind' continues on with setting '->flag=0' and unlocks the lock.

'hvm_dirq_assist' grabs the lock and continues one. Since 'flag = 0' and
'digl_list' is empty, it thunders through the 'hvm_dirq_assist' not doing
anything until it hits 'set_timer' which is undefined for MSI. Adding
in 'masked=0' for the MSI case fixes that.

The legacy interrupt one does not need it as there is no chance of
do_IRQ being called at that point.

Reported-by: Sander Eikelenboom <linux@xxxxxxxxxxxxxx>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
v2: Deal with 'masked' in pt_irq_destroy_bind.
v3: Deal with 'masked' in pt_irq_create_bind.
---
 xen/drivers/passthrough/io.c | 12 ++++++++++--
 xen/include/xen/hvm/irq.h    |  1 +
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index efc66dc..ae050df 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -129,6 +129,13 @@ static void pt_pirq_softirq_reset(struct hvm_pirq_dpci 
*pirq_dpci)
         pirq_dpci->dom = NULL;
         break;
     }
+    /*
+     * Inhibit 'hvm_dirq_assist' from doing anything useful and at worst
+     * calling 'set_timer' which will blow up (as we have called kill_timer
+     * or never initialized it). Note that we hold the lock that
+     * 'hvm_dirq_assist' could be spinning on.
+     */
+    pirq_dpci->masked = 0;
 }
 
 bool_t pt_irq_need_timer(uint32_t flags)
@@ -142,7 +149,7 @@ static int pt_irq_guest_eoi(struct domain *d, struct 
hvm_pirq_dpci *pirq_dpci,
     if ( __test_and_clear_bit(_HVM_IRQ_DPCI_EOI_LATCH_SHIFT,
                               &pirq_dpci->flags) )
     {
-        pirq_dpci->state = 0;
+        pirq_dpci->masked = 0;
         pirq_dpci->pending = 0;
         pirq_guest_eoi(dpci_pirq(pirq_dpci));
     }
@@ -610,6 +617,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
          !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
         return 0;
 
+    pirq_dpci->masked = 1;
     raise_softirq_for(pirq_dpci);
     return 1;
 }
@@ -669,7 +677,7 @@ static void hvm_dirq_assist(struct domain *d, struct 
hvm_pirq_dpci *pirq_dpci)
     ASSERT(d->arch.hvm_domain.irq.dpci);
 
     spin_lock(&d->event_lock);
-    if ( pirq_dpci->state )
+    if ( test_and_clear_bool(pirq_dpci->masked) )
     {
         struct pirq *pirq = dpci_pirq(pirq_dpci);
         const struct dev_intx_gsi_link *digl;
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index 9709397..3996f1f 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -94,6 +94,7 @@ struct hvm_irq_dpci {
 struct hvm_pirq_dpci {
     uint32_t flags;
     unsigned int state;
+    bool_t masked;
     uint16_t pending;
     struct list_head digl_list;
     struct domain *dom;
-- 
1.9.3


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.