[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Enabling vm_event for a guest with more VCPUs than available ring buffer slots freezes the virtual machine



On 02/08/2017 01:23 AM, Tamas K Lengyel wrote:
> 
> 
> On Tue, Feb 7, 2017 at 1:41 PM, Razvan Cojocaru
> <rcojocaru@xxxxxxxxxxxxxxx <mailto:rcojocaru@xxxxxxxxxxxxxxx>> wrote:
> 
>     On 02/07/2017 10:20 PM, Tamas K Lengyel wrote:
>     >
>     >
>     > On Tue, Feb 7, 2017 at 11:57 AM, Razvan Cojocaru
>     > <rcojocaru@xxxxxxxxxxxxxxx <mailto:rcojocaru@xxxxxxxxxxxxxxx>
>     <mailto:rcojocaru@xxxxxxxxxxxxxxx
>     <mailto:rcojocaru@xxxxxxxxxxxxxxx>>> wrote:
>     >
>     >     On 02/07/2017 08:39 PM, Andrew Cooper wrote:
>     >     > On 07/02/17 18:31, Razvan Cojocaru wrote:
>     >     >> On 02/07/2017 08:15 PM, Tamas K Lengyel wrote:
>     >     >>>
>     >     >>> On Tue, Feb 7, 2017 at 9:53 AM, Razvan Cojocaru
>     >     >>> <rcojocaru@xxxxxxxxxxxxxxx <mailto:rcojocaru@xxxxxxxxxxxxxxx>
>     <mailto:rcojocaru@xxxxxxxxxxxxxxx <mailto:rcojocaru@xxxxxxxxxxxxxxx>>
>     >     <mailto:rcojocaru@xxxxxxxxxxxxxxx
>     <mailto:rcojocaru@xxxxxxxxxxxxxxx>
>     >     <mailto:rcojocaru@xxxxxxxxxxxxxxx
>     <mailto:rcojocaru@xxxxxxxxxxxxxxx>>>> wrote:
>     >     >>>
>     >     >>>     Hello,
>     >     >>>
>     >     >>>     Setting, e.g. 16 VCPUs for a HVM guest, ends up
>     blocking the
>     >     guest
>     >     >>>     completely when subscribing to vm_events, apparently
>     because
>     >     of this
>     >     >>>     code in xen/common/vm_event.c:
>     >     >>>
>     >     >>>     315     /* Give this vCPU a black eye if necessary, on the
>     >     way out.
>     >     >>>     316      * See the comments above wake_blocked() for more
>     >     information
>     >     >>>     317      * on how this mechanism works to avoid
>     waiting. */
>     >     >>>     318     avail_req = vm_event_ring_available(ved);
>     >     >>>     319     if( current->domain == d && avail_req <
>     d->max_vcpus )
>     >     >>>     320         vm_event_mark_and_pause(current, ved);
>     >     >>>
>     >     >>>     It would appear that even if the guest only has 2 online
>     >     VCPUs, the
>     >     >>>     "avail_req < d->max_vcpus" condition will pause
>     current, and we
>     >     >>>     eventually end up with all the VCPUs paused.
>     >     >>>
>     >     >>>     An ugly hack ("avail_req < 2") has allowed booting a guest
>     >     with many
>     >     >>>     VCPUs (max_vcpus, the guest only brings 2 VCPUs online),
>     >     however that's
>     >     >>>     just to prove that that was the culprit - a real
>     solution to
>     >     this needs
>     >     >>>     more in-depth understading of the issue and potential
>     >     solution. That's
>     >     >>>     basically very old code (pre-2012 at least) that got moved
>     >     around into
>     >     >>>     the current shape of Xen today - please CC anyone relevant
>     >     to the
>     >     >>>     discussion that you're aware of.
>     >     >>>
>     >     >>>     Thoughts?
>     >     >>>
>     >     >>>
>     >     >>> I think is a side-effect of the growth of the vm_event
>     structure
>     >     and the
>     >     >>> fact that we have a single page ring. The check
>     effectively sets a
>     >     >>> threshold of having enough space for each vCPU to place at
>     least one
>     >     >>> more event on the ring, and if that's not the case it gets
>     >     paused. OTOH
>     >     >>> I think this would only have an effect on asynchronous events,
>     >     for all
>     >     >>> other events the vCPU is already paused. Is that the case
>     you have?
>     >     >> No, on the contrary, all my events are synchronous (the VCPU is
>     >     paused
>     >     >> waiting for the vm_event reply).
>     >     >>
>     >     >> I've debugged this a bit, and the problem seems to be that
>     >     >> vm_event_wake_blocked() breaks here:
>     >     >>
>     >     >> 150     /* We remember which vcpu last woke up to avoid
>     scanning
>     >     always
>     >     >> linearly
>     >     >> 151      * from zero and starving higher-numbered vcpus under
>     >     high load */
>     >     >> 152     if ( d->vcpu )
>     >     >> 153     {
>     >     >> 154         int i, j, k;
>     >     >> 155
>     >     >> 156         for (i = ved->last_vcpu_wake_up + 1, j = 0; j <
>     >     >> d->max_vcpus; i++, j++)
>     >     >> 157         {
>     >     >> 158             k = i % d->max_vcpus;
>     >     >> 159             v = d->vcpu[k];
>     >     >> 160             if ( !v )
>     >     >> 161                 continue;
>     >     >> 162
>     >     >> 163             if ( !(ved->blocked) || online >= avail_req )
>     >     >> 164                break;
>     >     >> 165
>     >     >> 166             if ( test_and_clear_bit(ved->pause_flag,
>     >     &v->pause_flags) )
>     >     >> 167             {
>     >     >> 168                 vcpu_unpause(v);
>     >     >> 169                 online++;
>     >     >> 170                 ved->blocked--;
>     >     >> 171                 ved->last_vcpu_wake_up = k;
>     >     >> 172             }
>     >     >> 173         }
>     >     >> 174     }
>     >     >>
>     >     >> at "if ( !(ved->blocked) || online >= avail_req )". At this
>     point,
>     >     >> nothing ever gets unblocked. It's hard to believe that this is
>     >     desired
>     >     >> behaviour, as I don't know what could possibly happen for that
>     >     condition
>     >     >> to become false once all the online VCPUs are stuck (especially
>     >     when the
>     >     >> guest has just started booting).
>     >
>     >
>     > Ah I see what happens. During boot vCPU 0 generates an event and gets
>     > marked blocked because the number of vCPUs is so high. The other vCPUs
>     > are still unblocked since they are idle, but this test here will still
>     > be true (online >= avail_req) and thus we can never unblock vCPU0. And
>     > then the boot process is hanging because vCPU0 never resumes. I would
>     > argue that this test should be changed to check that there is at
>     least 1
>     > spot on the ring and only break if that is not the case anymore (ie.
>     > instead of incrementing online we should be decrementing avail_req).
> 
>     That is exactly what happens. And it can't really be fixed just by
>     increasing the ring buffer (although that definitely helps a lot and
>     would be a smart move): no matter how large it is, we can always ask the
>     domain to use more VCPUs than there are slots in the buffer.
> 
> 
> No, not increasing the ring buffer but fixing the check to unblock when
> there is at least 1 spot on the ring. So look at this comment...
> 
> /*
>  * vm_event_wake_blocked() will wakeup vcpus waiting for room in the
>  * ring. These vCPUs were paused on their way out after placing an event,
>  * but need to be resumed where the ring is capable of processing at least
>  * one event from them.
>  */
> 
> .. it seems to me that the check "online >= avail_req" was just wrong
> from the start.

I've read that to read that there need to be more than one available
slot in the ring: "wake vcpus [plural] [...] at leas one event from them".

>     >     >
>     >     > I wouldn't bet that this logic has ever been tested.  If you
>     >     recall, the
>     >     > addition of register state into the vm_event ring made each
>     entry far
>     >     > larger, which in turns makes it more likely to hit this
>     condition.
>     >     >
>     >     > However, simply fixing the logic to re-online the cpus isn't
>     a good
>     >     > solution either, as having $N vcpus paused at any one time
>     because of
>     >     > ring contention is not conducive good system performance.
>     >     >
>     >     > Realistically, the ring size needs to be max_cpus *
>     sizeof(largest
>     >     > vm_event) at an absolute minimum, and I guess this is now
>     beyond 1
>     >     page?
>     >
>     >     Yes, of course the reason this triggers earlier now is the
>     growth of the
>     >     request's size. Yes, using e.g. 20 VCPUs in the guest's setup will
>     >     exceed a page's number of slots.
>     >
>     >     And yes, ideally we should have multi-page ring buffers -
>     however that
>     >     is a long-term project that requires design changes in other
>     parts of
>     >     Xen as well (Andrew, CCd here, was recently talking about one).
>     >
>     >     However, even with a one-page ring buffer, surely it's not
>     good to end
>     >     up in this situation, especially for guests such as mine,
>     which never
>     >     actually bring more than 2 VCPUs online. But even if they were
>     to use
>     >     more, blocking the guest on vm_event init is completely
>     pointless - we
>     >     might as well return some kind of error if max_vcpus >
>     available slots.
>     >
>     >     I don't follow the system performance argument. Surely completely
>     >     blocking the guest is worse.
>     >
>     >
>     > I also don't see the point in marking a vCPU blocked if it is already
>     > paused. I think this behavior of blocking vCPUs makes only sense for
>     > asynchronous events. Razvan, could you test what happens if
>     > vm_event_mark_and_pause is only called if the vCPU is unpaused?
> 
>     It works for me with this change (using Xen 4.7 sources here):
> 
>     @@ -318,7 +329,11 @@ void vm_event_put_request(struct domain *d,
>           * on how this mechanism works to avoid waiting. */
>          avail_req = vm_event_ring_available(ved);
>          if( current->domain == d && avail_req < d->max_vcpus )
>     -        vm_event_mark_and_pause(current, ved);
>     +    {
>     +        if ( !atomic_read( &current->vm_event_pause_count ) )
>     +            vm_event_mark_and_pause(current, ved);
>     +    }
> 
> 
> Good, that's what I expected. There really is no point in blocking a
> vCPU that is already paused. But I would just add that check to the if
> above as another && clause.

Quite right, this was late-night debugging and I also had a printk() in
that scope, which I quickly removed on pasting the change. The patch
will simply have the additional && condition.


Thanks,
Razvan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.