WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

[Xen-devel] RE: VM hung after running sometime

To: <keir.fraser@xxxxxxxxxxxxx>
Subject: [Xen-devel] RE: VM hung after running sometime
From: MaoXiaoyun <tinnycloud@xxxxxxxxxxx>
Date: Tue, 21 Sep 2010 13:02:53 +0800
Cc: xen devel <xen-devel@xxxxxxxxxxxxxxxxxxx>
Delivery-date: Mon, 20 Sep 2010 22:06:25 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
Importance: Normal
In-reply-to: <C8BCE982.23833%keir.fraser@xxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <BAY121-W55A5D5C31D990DE652359DA7E0@xxxxxxx>, <C8BCE982.23833%keir.fraser@xxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Hi Keir:
 
        I spent more time on how event channel works. And now I  know that event is bind to
irq with call of request_irq. When event is sent, the other side of the channel will run into
asm_do_IRQ->generic_handle_irq->generic_handle_irq_desc->handle_level_irq(
here it actually invokes desc->handle_irq, and for evtchn this is handle_level_irq).
I noticed that in handle_level_irq the event mask and pending is cleared.
 
Well I have one more analysis to be discussed.
 
Attached is the evtchn when a VM is hang in physical server. Domain 10 is hang.
We can see domain 10 CPU info on the bottem the log, its has flags = 4 which means
_VPF_blocked_in_xen.
 
(XEN) VCPU information and callbacks for domain 10:
(XEN)     VCPU0: CPU11 [has=F] flags=4 poll=0 upcall_pend = 00, upcall_mask = 00 dirty_cpus={} cpu_affinity={4-15}
(XEN)     paging assistance: shadowed 2-on-3
(XEN)     No periodic timer
(XEN)     Notifying guest (virq 1, port 0, stat 0/-1/0)
(XEN)     VCPU1: CPU9 [has=T] flags=0 poll=0 upcall_pend = 00, upcall_mask = 00 dirty_cpus={9} cpu_affinity={4-15}
(XEN)     paging assistance: shadowed 2-on-3
(XEN)     No periodic timer
(XEN)     Notifying guest (virq 1, port 0, stat 0/-1/0)

And its domain event info is :
(XEN) Domain 10 polling vCPUs: {No periodic timer}
(XEN) Event channel information for domain 10:
(XEN)     port [p/m]
(XEN)        1 [0/1]: s=3 n=0 d=0 p=105 x=1
(XEN)        2 [0/1]: s=3 n=1 d=0 p=106 x=1
(XEN)        3 [0/0]: s=3 n=0 d=0 p=104 x=0
(XEN)        4 [0/1]: s=2 n=0 d=0 x=0
(XEN)        5 [0/0]: s=6 n=0 x=0
(XEN)        6 [0/0]: s=2 n=0 d=0 x=0
(XEN)        7 [0/0]: s=3 n=0 d=0 p=107 x=0
(XEN)        8 [0/0]: s=3 n=0 d=0 p=108 x=0
(XEN)        9 [0/0]: s=3 n=0 d=0 p=109 x=0
(XEN)       10 [0/0]: s=3 n=0 d=0 p=110 x=0
 
Base on our situation, we only interest in the event channel which consumer_is_xen is 1,
and here "x=1", that is port 1 and 2. According to the log, the other side of the channel
is domain 0, port 105, and 106.
 
Take a look at domain 0 event channel with port 105,106, I find on port 105, it pending is
1.(in [1,0], first bit refer to pending, and is 1, second bit refer to mask, is 0).
 
(XEN)      105 [1/0]: s=3 n=2 d=10 p=1 x=0
(XEN)      106 [0/0]: s=3 n=2 d=10 p=2 x=0
 
In all, we have domain U cpu blocking on _VPF_blocked_in_xen, and it must set the pending bit.
Consider pending is 1, it looks like the irq is not triggered, am I  right ?
Since if it is triggerred, it should clear the pending bit. (line 361).
 
------------------------------/linux-2.6-pvops.git/kernel/irq/chip.c---
354 void
355 handle_level_irq(unsigned int irq, struct irq_desc *desc)
356 {
357         struct irqaction *action;
358         irqreturn_t action_ret;
359
360         spin_lock(&desc->lock);
361         mask_ack_irq(desc, irq);
362
363         if (unlikely(desc->status & IRQ_INPROGRESS))
364                 goto out_unlock;
365         desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
366         kstat_incr_irqs_this_cpu(irq, desc);
367
 
BTW, the qemu still works fine when VM is hang. Below is it strace output.
No much difference between other well worked qemu instance, other than select all Timeout.
-------------------
select(14, [3 7 11 12 13], [], [], {0, 10000}) = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {673470, 59535265}) = 0
clock_gettime(CLOCK_MONOTONIC, {673470, 59629728}) = 0
clock_gettime(CLOCK_MONOTONIC, {673470, 59717700}) = 0
clock_gettime(CLOCK_MONOTONIC, {673470, 59806552}) = 0
select(14, [3 7 11 12 13], [], [], {0, 10000}) = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {673470, 70234406}) = 0
clock_gettime(CLOCK_MONOTONIC, {673470, 70332116}) = 0
clock_gettime(CLOCK_MONOTONIC, {673470, 70419835}) = 0
 
 
 
 
> Date: Mon, 20 Sep 2010 10:35:46 +0100
> Subject: Re: VM hung after running sometime
> From: keir.fraser@xxxxxxxxxxxxx
> To: tinnycloud@xxxxxxxxxxx
> CC: xen-devel@xxxxxxxxxxxxxxxxxxx; jbeulich@xxxxxxxxxx
>
> On 20/09/2010 10:15, "MaoXiaoyun" <tinnycloud@xxxxxxxxxxx> wrote:
>
> > Thanks Keir.
> >
> > You're right, after I deeply looked into the wait_on_xen_event_channel, it is
> > smart enough
> > to avoid the race I assumed.
> >
> > How about prepare_wait_on_xen_event_channel ?
> > Currently Istill don't know when it will be invoked.
> > Could enlighten me?
>
> As you can see it is called from hvm_send_assist_req(), hence it is called
> whenever an ioreq is sent to qemu-dm. Note that it is called *before*
> qemu-dm is notified -- hence it cannot race the wakeup from qemu, as we will
> not get woken u ntil qemu-dm has done the work, and it cannot start the work
> until it is notified, and it is not notified until after
> prepare_wait_on_xen_event_channel has been executed.
>
> -- Keir
>
> >
> >> Date: Mon, 20 Sep 2010 08:45:21 +0100
> >> Subject: Re: VM hung after running sometime
> >> From: keir.fraser@xxxxxxxxxxxxx
> >> To: tinnycloud@xxxxxxxxxxx
> >> CC: xen-devel@xxxxxxxxxxxxxxxxxxx; jbeulich@xxxxxxxxxx
> >>
> >> On 20/09/2010 07:00, "MaoXiaoyun" <tinnycloud@xxxxxxxxxxx> wrote:
> >>
> >>> When IO is not ready, domain U in VMEXIT->hvm_do_resume might invoke
> >>> wait_on_xen_event_channel
> >>> (where it is blocked in _VPF_blocked_in_xen).
> >>>
> >>> Here is my assumption of event missed.
> >>>
> >>> step 1: hvm_do_re sume execute 260, and suppose p->state is STATE_IOREQ_READY
> >>> or STATE_IOREQ_INPROCESS
> >>> step 2: then in cpu_handle_ioreq is in line 547, it execute line 548 so
> >>> quickly before hvm_do_resume execute line 270.
> >>> Well, the event is missed.
> >>> In other words, the _VPF_blocked_in_xen is cleared before it is actually
> >>> setted, and Domian U who is blocked
> >>> might never get unblocked, it this possible?
> >>
> >> Firstly, that code is very paranoid and it should never actually be the case
> >> that we see STATE_IOREQ_READY or STATE_IOREQ_INPROCESS in hvm_do_resume().
> >> Secondly, even if you do, take a look at the implementation of
> >> wait_on_xen_event_channel() -- it is smart enough to avoid the race you
> >> mention.
> >>
> >> -- Keir
> >& gt;
> >>
> >
>
>

Attachment: hang.txt
Description: Text document

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel